Allan's Library: 2007

Tuesday, December 25, 2007

Happy Holidays and Seasons Greetings

Seasons Greetings to all. It is indeed a wonderful holidays as the Google Scholar has published an important piece to the Semantic Web literature. He's done it again, writing an concise and cogent piece on the key elements which differentiates Web 3.0 from Web 2.0. In other news, a reader recently made a comment from a previous entry which I found to be very interesting. Here's what he said:

I (as a librarian) found the article and the whole topic very important. I especially enjoyed the conclusion. You wrote that "Web 3.0 is about bringing the miscellaneous back together meaningfully after it's been fragmented into a billion pieces."I was wondering if in your opinion this means that the semantic web may turn a folksonomy into some kind of structured taxonomy. We all know the advantages and disadvantages of a folksonomy. Is it possible for web 3.0 to minimize those disadvantages and maybe even make good use out of them?

My response? It'll sound cliched and tired: it's really too early to tell. But although it's murky as to what the Semantic Web will look like, all directions point to the possibility that folksonomies will play a key role. Here's why:

(1) Underneath the messiness of the Web, is a fairly organized latent structure, whose backbones are web threads. A scale-free network is significantly dominated by few highly connected hubs.

(2) What this means is that folksonomies and tagging are in fact controlled vocabularies in their own right. Lots have been written about this. Recent studies have shown that the frequency distribution of tags in folksonomies tends to stabilize into power-law distributions. When a substantial number of users tag content for a long period of time, stable tags start appearing in the resulting folksonomy.

(3) Such a use of folksonomies could help overcome some of the inherent difficulties in ontology construction, thus potentially bridging Web 2.0 and the Semantic Web. By using folksonomies' collective categorization scheme as an initial knowledge base for constructing ontologies, the ontology author could then use the tagging distribution's most common tags as concepts, relations, or instances. Folksonomies do not a Semantic Web make -- but it's a good start.

Thursday, December 20, 2007

Information Science As Web 3.0?

In the early and mid-1950’s, scientists, engineers, librarians, and entrepreneurs started working enthusiastically on the problem and solution defined by Vannevar Bush. There were heated debates about the “best” solution, technique, or system. What ultimately ensued became information retrieval (IR), a major subfield of Information Science.

In his article Information Science, Tefko Saracevic makes a bold prediction:

fame awaits the researcher(s) who devises a formal theoretical work, bolstered by experimental evidence, that connects the two largely separated clusters i.e. connecting basic phenomena (information seeking behaviour) in the retrieval world (information retrieval). A best seller awaits the author that produces an integrative text in information science. Information Science will not become a full-fledged discipline until the two ends are connected successfully.

As Saracevic puts it, IR is one of the most widely spread applications of any information system worldwide. So how come Information Science has yet to produce a Nobel Prize winner?

But the World Wide Web changed everything, particularly IR. Because the Web is a mess, everybody is interested in some form of IR as a solution to fit it. A number of academic-based efforts were initiated to develop mechanisms, such as search engines, “intelligent” agents and crawlers. Some of those were IR scaled, and adapted to the problem; others were a variety of extensions of IR.

Out of all this emerged commercial ventures, such as Yahoo!, whose basic objective was to provide search mechanisms for finding something of relevance for users on demand. Not to mention making lots of money. Disconcertingly, the connection of the information science community is tenuous, and almost non-existent – the flow of knowledge is one sided, from IR research results into proprietary search engines . The reverse contribution to public knowledge is zero. A number of evaluations of these search engines have been undertaken simply by comparing some results between them or comparing their retrieval against some benchmarks.

As I've opined before, LIS will play a prominent role in the next stage of the Web. So who's it gonna be?

Tuesday, December 18, 2007

The Semantic Solution - A Browser?

In a recent discussion with colleagues about Web 2.0, we ran into the conundrum of what lies beyond Web 2.0 that would solve some of the limitations that it has. I offered the idea of an automated Web browser - a portal - one that would not be unlike an Internet Explorer browser with which a user could just sign in, and enter his or her password, and then freely surf the the Semantic Web (or whatever parts of it exist). It would be an exciting journey. Dennis Quan and David Karger's How to Make a Semantic Web Browser proposes the following:

Semantic Web browser—an end user application that automatically locates metadata and assembles point-and-click interfaces from a combination of relevant information, ontological specifications, and presentation knowledge, all described in RDF and retrieved dynamically from the Semantic Web. With such a tool, naïve users can begin to discover, explore, and utilize Semantic Web data and services. Because data and services are accessed directly through a standalone client and not through a central point of access . . . . new content and services can be consumed as soon as they become available. In this way we take advantage of an important sociological force that encourages the production of new Semantic Web content by remaining faithful to the decentralized nature of the Web

I like this idea of a portal. To have everyone agree about how to implement W3C standards - RDF, SPARQL, OWL - is unrealistic. Not everyone will accept the extra work for no real sustainable incentive. That is perhaps why there is no current real invested interest by companies and private investors to channel funding to Semantic Web research. However, the Semantic Web portal is one method to combat the malaise. In many ways, it resembles the birth of Web 1.0, before Yahoo!'s remarkable directory and search engines. All we need is one Jim Clark and one Marc Andreeson, I guess.

(Maybe a librarian and an information scientist, or two?)

Friday, December 14, 2007

"Web 3.0" AND OR the "Semantic Web"

Although I have worked in health research centres and medical libraries, I have never worked professionally as a librarian in a health setting. That is why I have great admiration for health librarians such as The Google Scholar, who can multitask, working as a top-notch librarian while at the same time keeping up with cutting edge technology. The Google Scholar recently made a wonderful entry about Web 3.0 and the Semantic Web:

In medicine, there is virtually no discussion about web 3.0 (see this PubMed search for web 3.0 (zero results) and most of the discussion on the semantic web (see this PubMed search - ~100 results) is from the perspective of biology/ bioinformatics.
The dichotomy in the literature is both perplexing and unsurprising. On the one hand, semanticists are looking at a new intelligent web has 'added meaning' to documents, and machine interoperability. On the other, web 3.0 advocates use '3.0' to be trendy, hip or to market themselves or their websites. That said, I prefer the web 3.0 label to the semantic web because it follows web 2.0 and suggests continuity.

I find it perplexing, too, that academics tend to subscribe to the term "Semantic Web" whereas practitioners and technology experts tend to refer to "Web 3.0." For example, the Journal of Cataloging and Classification recently had an entire issue devoted to the Semantic Web - without one mention of the term "Web 3.0."

Although the dichotomy in the literature is apparent, it's interesting that for most of us, we associate Web 3.0 and the Semantic Web together. It's not unlike a decade ago when we used the terms "Internet" and "Web" interchangeably -- even though they are not.

Tim Berners-Lee and the W3C envisioned for the Web to eventually progress to becoming the Semantic Web. Standards such as RDF and DAML+OIL emerged as early as 1998, long before Web 2.0. Web 2.0 is not even mentioned in the W3C because it has no standards. In my opinion, Web 3.0 and the Semantic Web are separate entities. Web 3.0 goes one step further in that it will extend beyond the web browser and will not be limited to just the personal computer.

It is important that medical librarians -- all librarians for that matter -- join in (and even lead) the discourse, particularly since the Semantic Web & Web 3.0 will be based heavily on the principles of knowledge and information organization. Whereas Web 1.0 and 2.0 could not distinguish among Acetaminophen, Paracetamol, and Tylenol -- Web 3.0 will.

Tuesday, December 11, 2007

Google and End of Web 2.0

Google Scholar recently celebrated its third birthday. There were some old friends who showed up at the party (the older brother Google arrived a bit late though) -- but overall, it was a fairly quiet evening atop of Mountain View. So where are we now with Google Scholar? Has the tool lived up to its early hype? What improvements have been made to Scholar in the past year? In a series of fascinating postings, my colleague, The Google Scholar, made some insightful comments, particularly when he argues:

What Google scholar has done is bring scholars and academics onto the web for their work in a way that Google alone did not. This has led to a greater use of social software and the rise of Web 2.0. For all its benefits, Web 2.0 has given us extreme info-glut which, in turn, will make Web 3.0 (and the semantic web) necessary.

I agree. Google Scholar (and Google) are very much Web 2.0 products. As I had elaborated in my previous entry, AJAX (which is Web 2.0-based), produced many remarkable programs such as Gmail and Google Earth.

Was this destiny? Not really. As Yihong Ding proposes, Web 2.0 did not choose Google; rather, it was Google that had decided to follow Web 2.0. If Yahoo had only known about the politics of the Web a little earlier, it might have precluded Google. (But that's for historians to analyze). Yahoo! realized the potential of Web 2.0 too late; it purchased Flickr without really understanding how to fit it into Yahoo!'s Web 1.0 universe.

Back to Dean's point. Google's strength might ultimately lead to its own demise. The PageRank algorithm might have a drawback similar to Yahoo!'s once dominant directory. Just as Yahoo! failed to catch up with the explosion of the Web, Google's PageRank will slowly lose its dominance due to the explosion caused by Web 2.0. With richer semantics, Google might not be willing to drastically alter its algorithm since it is Google's bread-and-butter. So that is why Google and Web 2.0 might be feeling the weight of the future fall too heavily on their shoulders.

Sunday, December 09, 2007

AJAX'ing our way to Web 2.0

Part of my day job entails analyzing technologies and how they better serve users. But one of the things we seem to forget when promoting Web 2.0 is the flaws it brings with it. Because one of the core technologies of Web 2.0 is AJAX, I've been looking around for a good analysis of it. David Best's Web 2.0: Next Big Thing or Next Big Internet Bubble seems to do the job. AJAX is a core component of Web 2.0, as it introduces an engine that runs on the client side - the Web browser. Certain actions can be carried out in the engine and need no data transfer to the server; thus, they are carred out only on the client's computer and is thus quite fast, comparable to desktop applications. In the HTML-world of Web 1.0, a Web page has to completely reload after a user action, such as clicking on links, or entering data in a form.

Gmail, Google Maps, and Flickr are all AJAX (and therefore Web 2.0) applications. Yet, just because it's got the Web 2.0 label does not necessarily mean it is "better." Why? Let's take a look at Gmail and Flickr, and see the advantages and disadvantages of their reliance on AJAX-technology:

(1) Rich User Experience - Fasst! Response to user actions are quick and the Web applications behave like desktop applications (e.g. drag and drop).

(2) Javascript - AJAX is made up of JavaScript. Unlike Web 1.0 applications, JavaScript excludes ten percent of all Web users, an issue the W3C is concerned about. Without going into the technology, JavaScript bars many users from AJAX use (such as Active X - a known security problem in Internet Explorer)

(3) The Back Button - Because Web browsers usually keep a history of whole Web pages in Web 1.0, many are often surprised that Gmail does not allow this as it is an AJAX application, for single actions are not cacheable for the browser.

(4) Bookmarking - Web 2.0 is based on rich user experience; unfortunately, this means that as with many dynamically generated pages, bookmarking or linking to a certain state of such a page is nearly impossible, as those states are not uniquely identifiable by URL. (Try bookmarking on Flickr!)

Thursday, December 06, 2007

Are You Ready For Library 3.0?

Are you ready for Library 2.0? We might just be too late because Library 3.0 is just around the corner according to some observers. How can libraries learn from the other service industries, how will librarians keep up with subject specific skills (evidence-based medicine, law, problem-based learning? Are librarian skills out of alignment with these trends? As Saw and Todd point out in Library 3.0: Where Are Our Skills, the future of academic libraries will be a digital one, where the successful librarian will be flexible, adaptable, and multi-skilled in order to survive in an environment of constant and rapid change. Drivers for change will require this new generation of librarians to navigate not only new technologies as well as understanding their users’ behaviour, but ultimately themselves (Generation X and Y’s). So what are some attributes of Librarian 3.0?

(1) Institutionalization – Creating the right culture. Flexible hours and attractive salaries, without micromanagement while encouraging working in teams and individual praise and recognition for their accomplishment. The key to retaining these employees is the quality of relationships they have with their managers - Gen X and Y's see their work demand a better balance in their work and personal lives.

(2) Innovation – Doing things differently – Innovative services will mean taking-the-service to the clients. An example would be “Librarian With a Latte” program from the University of Michigan at Ann Arbor.

(3) Imagination – Changing the rules. Collaboration with a wide range of information providers, where rethinking of the catalogue means it is no longer relevant in its current form – the catalogue should be a “one-stop shop” for searching resources, providing access beyond local collections, and to different types of resources in a seamless way

(4) Ideation – A Culture that encourages ideas – In creating the appropriate working environment, it is necessary to be also supported by professional associations.

(5) Inspiration – Doing things differently – As competition increases for the future workforce, ongoing professional development as opposed to formal training in a library school is necessary. Already free web-based instruction similar to the popular Five Weeks to a Social Library are already popping up.

So what does this all mean? It might sound like an eye-rolling cliche: information professionals of the future will have to be prepared for lifelong learning. This is a challenge for many professionals, who argue that their plates are already full to the brim. What to do? The authors leave us with a daunting reference from Charles Darwin:

It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change

Tuesday, December 04, 2007

I See No Forests But the Trees . . .

"So where is it?" is the question that most information professionals and scholars say when they approach the topic of the Semantic Web. Everyone's favourite Computer Scientist, Yihong Ding's Web Evolution Theory and The Next Stage: Part 2 makes an interesting observation, one which I agree wholeheartedly:

The transition from Web 1.0 to Web 2.0 is not supervised. W3C had not launched a special group for a plot of Web 2.0; and neither did Tim O'Reilly though he was one of the most insightful observers who caught and named this transition and one of the most anxious advocates of Web 2.0. In comparison, W3C did have launched a special group about Semantic Web that was engaged by hundreds of brilliant web researchers all over the world. The progress of WWW in the past several years, however, shows that the one lack of supervision (Web 2.0) advanced faster than the one with lots of supervision (Semantic Web). This phenomenon suggests the existence of web evolution laws that is objective to individual willingness.

Even Tim O'Reilly pointed out that Web 2.0 largely came out of a conference when exhausted software engineers and computer programmers from the dot.com disaster saw common trends happening on the Web. Nothing is scripted in Web 2.0. Perhaps that's why there can never be a definitive agreement on what it constitutes. As I give instructional sessions and presentations of Web 2.0 tools, sometimes I wonder, how wikis, blogs, social bookmarking, and RSS feeds will look like two years from now. Will they be relevant? Or will they transmute into something entirely different? Or will we continue on as status quo?

Is Web 2.0 merely an interim to the next planned stage of the Web? Are we seeing trees, but missing the forest?

Friday, November 30, 2007

Digital Libraries in the Semantic Age

Brian Matthews of CCLRC Appleton Laboratory offers some interesting insights in Semantic Web Technologies. In particular, he argues that libraries are increasingly converting themselves to digital libraries. A key aspect for the Digital library is the provision of shared catalogues which can be published and browsed. This requires the use of common metadata to describe the fields of the catalogue (such as author, title, date, and publisher), and common controlled vocabularies to allow subject identifiers to be assigned to publications.

As Matthew proposes, by publishing controlled vocabularies in one place, which can then be accessed by all users across the Web, library catalogues can use the same Web-accessible vocabularies for cataloguing, marking up items with the most relevant terms for the domain of interest. Therefore, search engines can use the same vocabularies in their search to ensure that the most relevant items of information are returned.

The Semantic Web opens up the possibility to take such an approach. It offers open standards that can enable vendor-neutral solutions, with a useful flexibility (allowing structured and semi-structured data, formal and informal descriptions, and an open and extensible architecture) and it helps to support decentralized solutions where that is appropriate. In essence, RDF can be used as this common interchange for catalogue metadata and shared vocabulary, which can then be used by all libraries and search engines across the Web.

But in order to use the Semantic Web to its best effect, metadata needs to be published in RDF formats. There are several initiatives involved with defining metadata standards, and some of them are well known to librarians:

(1) Dublin Core Metadata Initiative

(2) MARC

(3) ONIX

(3) PRISM

Wednesday, November 21, 2007

Postmodern Librarian - Part Two

To continue where we had left off. True, Digital Libraries and the Future of the Library Profession intimates that libraries and perhaps librarianship has entered the postmodern age. But Joint hasn't been the first to author such an argument; many others have also argued likewise. In fact, I had written about it before, too. But I believe to stop at the modernist-postmodernist dichotomy misses the point.

In my opinion, perhaps this is where Web 2.0 comes in. Although the postmodern information order is not clear to us, it seems to be the dynamic behind Web 2.0, in which interactive tools such as blogs, wikis, RSS facilitate social networking and the anarchic storage of unrestrained distribution of content. According to Joint, much of our professional efforts to impose a realist-modernist model on our library will fail. The old LIS model needs to be re-theorized, just as Newtonian Physics had to evolve into Quantum Theory, in recognition of the fact that super-small particles simply were not physically located where Newtonian Physics said they should be. In this light, perhaps this is where we can start to understand what exactly is Web 2.0. And beyond.

Friday, November 16, 2007

Semantic Web: A McCool Way of Explaining It

Yahoo's Rob McCool argues in Rethinking the Semantic Web, Part 1 that the Semantic Web will never happen. Why? Because the Semantic Web has three fundamental parts, and they just don't fit together based on current technologies. Here is what we have. The foundation is the set of data models and formats that provide semantics to applications that use them (RDF, RDF Schema, OWL). The second layer is composed of services - purely machine-accessible programs that answer Web requests and perform actions in response. At the top are the intelligent agents, or applications.

Reason? Knowledge representation is a technique with mathematical roots in the work of Edgar Codd, widely known as the one whose original paper using set theory and predicate calculus led to the relational database revolution in the 1980's. Knowledge representation uses the fundamental mathematics of Codd's theory to translate information, which humans represent with natural language, into sets of tables that use well-defined schema to defined schema to define what can be entered in the rows and columns.

The problem is that this creates a fundamental barrier, in terms of richness of representation as well as creation and maintenance, compared to the written language that people use. Logic, which forms the basis of OWL, suffers from an inability to represent exceptions to rules and the contexts in which they're valid.

Databases are deployed only by corporations whose information-management needs require them or by hobbyists who believe they can make some money from creating and sharing their databases. Because information theory removes nearly all context from information, both knowledge representation and relational databases represent only facts. Complex relationships, exceptions to rules, and ideas that resist simplistic classifications pose significant design challenges to information bases. Adding semantics only increases the burden exponentially.

Because it's a complex format and requires users to sacrifice expressively and pay enormous costs in translation and maintenance, McCool believes Semantic Web will not achieve widespread support. Never? Not until another Edgar Codd comes along our way. So we wait.

Wednesday, November 14, 2007

The Postmodern Librarian?

Are we in the postmodern era? Nicholas Joint's Digital Libraries and the Future of the Library Profession seems to think so. In it, he argues that unique contemporary cultural shifts are leading to a new form of librarianship that can be characterized as "postmodern" in nature, and that this form of professional specialism will be increasingly influential in the decades to come.

According to Joint, the idea of the postmodern digital library is clearly very different from the interim digital library. In the summer of 2006, a workshop at the eLit conference in Loughborough on the cultural impact of mobile communication technologies, there emerged the Five Theses of Loughborough. Here they are:

(1) There are no traditional information objects on the internet with determinate formats or determinate formats or determinate qualities: the only information object and information forat on the internet is "ephemera"

(2) The only map of the internet is the internet itself, it cannot be described

(3) A hypertext collection cannot be selectively collected because each information object is infinite and infinity cannot be contained

(4) The problem of digital preservation is like climate change; it is man-made and irreversible, and means that much digital data is ephemeral; but unlike climate change, it is not necessarily catastrophic

(5) Thus, there is no such thing as a traditional library in a postmodern world. Postmodern information sets are just as accessible as traditional libraries;: there are no formats, no descriptions, no hope of collection management, no realistic possibility of preservation. And they work fine.

Monday, November 12, 2007

New York City In a Semantic Web

Tim Krichel in The Semantic Web and an Introduction to Resource Description Framework makes a very astute analogy for understanding the technology behind the Semantic Web, particularly the nuances of XML and RDF, where the goal is to move away from the present Web - where pages are essentially constructed for use by human consumption - to a Web where more information can be understood and treated by machines. The analogy goes like this:

We fit each car in New York City with a device that lets a reverse geographical position system reads its movements. Suppose, in addition, that another machine can predict the weather or some other phenomenon that impacts traffic. Assume that a third kind of device has the public transport timetables. Then, data from a collaborative knowledge picture of these machines can be used to advise on the best means of transportation for reaching a certain destination within the next few hours.

The computer systems doing the calculations required for the traffic advisory are likely to be controlled by different bodies, such as the city authority or the national weather service. Therefore, there must be a way for software agents to process the information from the machine where it resides, to proceed with further processing of that information to a form in which a software agent of the final user can be used to query the dataset.

Wednesday, November 07, 2007

Genre Searching

At today's SLAIS colloquium, Dr. Luanne Freund gave a presentation on Genre Searching: A Pragmatic Approach to Information Retrieval. Freund argues for taking a pragmatics approach in genre searching and genre classification. But there are two perspectives of pragmatics: socio-pragmatic and cognitive-pragmatic. Using a case study, a high-tech firm, Freund and her colleagues built a unique search engine called X-Cite, which culls together documents from the corporate intranet (which include anything from FAQ's to specialize manuals) with tags. In ranking documents based on title, abstract, and keywords as part of the search engine, the algorithm uniquely cuts down on the ambiguity and guesswork of searching. Using a software engineering workplace domain as its starting point, Freund believes that genre searching has the potential to make a significant contribution to the effectiveness of workplace search systems, by incorporating genre weights into the ranking algorithm.

In genre analysis, three steps must be taken:

(1) Identify - The core genre repertoire of the work domain

(2) Develop - A standard taxonomy to represent it

(3) Develop - Operational definitions of the genre classes in the taxonomy, including identifying features in terms of form, function and content to facilitate manual and automatic genre classification.

Throughout the entire presentation, my mind kept returning to the question: is this not another specialized form of social searching? A tailorized search engine which narrows its search to a specific genre? Although the two are entirely different things, I keep thinking that creating your own search engine is certainly much easier.

Simple Knowledge Organization System (SKOS) & Librarians

Miles and Perez-Aguera's SKOS: Simple Knowledge Organization for the Web introduces SKOS, a Semantic Web language for representing structured vocabularies, including thesauri, classification schemes, subject heading systems, and taxonomies -- tools that cataloguers and librarians use everyday in their line of work.

It's interesting that the very essence of librarianship and cataloging will play a vital role in the upcoming version of the Web. It's hard to fathom how this works: how can MARC records and the DDC have anything to do with the intelligent agents which form the layers of architecture of the Semantic Web and Web 3.0? The answer: metadata.

And even more importantly: the messiness and disorganization of the Web will require information professionals with the techniques and methods to reorganize everything coherently. Web 1.0 and 2.0 were about creating -- but the Semantic Web will be about orderliness and regulating. By controlled structured vocabulary, SKOS is built on the following features. Take a closer look at Miles & Perez-Aguera's article -- it's well worth a read.

(1) Thesauri - Broadly conforming to the ISO 2788:1986 guidelines such as the UK Archival Thesaurus (UKAT, 2004), the General Multilingual Environmental Thesaurus (GEMET), and the Art and Architecture Thesaurus

(2) Classification Schemes - Such the Dewey Decimal Classification (DDC), the Universal Decimal Classification (UDC), and the Bliss Classification (BC2)

(3) Subject Heading Systems - The Library of Congress Subject Headings (LCSH) and the Medical Subject Headings (MeSH)

Friday, November 02, 2007

New Librarians, New Possibilities?

Are newer, incoming librarians changing the profession? Maybe. But not yet. University Affairs has published an article called The New Librarians, which highlights some of the new ideas that newer librarians are bringing into academic libraries. Everyone's favourite University Librarian (at least for me), Jeff Trzeciak, who has his own blog, is featured in the piece, and in it, he describes how he has swiftly hired new Library 2.0-ready librarians as well as overturning the traditional style decor and culture of McMaster Library, with items such as a "café, diner-style booths, stand-up workstations, oversized ottomans, and even coffee tables with pillows on the floor will take their place, all equipped for online access. Interactive touch-screen monitors will line the wall."

University of Guelph Chief Librarian Michael Ridley, similarly sees a future where the university library serves as an “academic town square,” a place that "brings people and ideas together in an ever-bigger and more diffuse campus. Services in the future will include concerts, lectures, art shows – anything that trumpets the joy of learning."

Is this the future of libraries? Yes, it's a matter of time. That's where we're heading -- that's where we'll end up. It is a matter of time. Change is difficult, particularly in larger academic institutions where bureaucracy and politics play an essential role in all aspects of operations. There is great skepticism towards Jeff Trzeciak's drastic changes to McMaster Library -- he's either a pioneer if he succeeds, or an opportunist if he fails. A lot is riding on Jeff's shoulders.

Tuesday, October 30, 2007

Introducing Semantic Searching

Just as we had Google and Web 2.0 nearly figured out, the Semantic Web is just around the corner. Introducing hakia, one of the first truly Semantic Web search engines. As we had argued, the Semantic Web is a digital catalogue, and many of the key components is the understanding of ontologies and taxonomies. Built on Semantic Web technologies, hakia is a new "meaning-based" (semantic) search engine with the purpose of improving search relevancy and interactivity -- the potential benefits for end users are search efficiency, richness of information, and saving time. Here are the elements which makes hakia. Will this hakia team be the next Brin and Page? Why don't you try it?

(1) Ontological Semantics (OntoSem) - A formal and comprehensive linguistic theory of meaning in natural language. As such, it bears significantly on philosophy of language, mathematical logic, and cognitive science

(2) Query Detection and Extraction (QDEX) - A system invented to bypass the limitations of the inverted index approach when dealing with semantically rich data

(3) SemanticRank algorithm - Deploys a collection of methods to score and rank paragraphs that are retrieved from the QDEX system for a given query. The process includes query analysis, best sentence analysis, and other pertinent operations

(4) Dialogue - In order establish a human-like dialogue with the user, the dialogue algorithm's goal is to convert the search engine's role into a computerized assistant with advanced communication skills while utilizing the largest amount of information resources in the world.

(5) Search mission - Google mission was to organize the world's information and make it universally accessible and useful. hakia's mission is to search for better search.

Monday, October 22, 2007

A Defintion of the Semantic Web

Parker, Nitse, and Flowers' Libraries as Knowledge Management Centers makes a good point about special libraries. Libraries need to be at the forefront of technology, or else they'll be an endangered species. As libraries struggle with the fallout of the digital age, they must find a creative way to remain relevant to the twenty first century user who has the ability and means of finding vast amounts of information without even setting foot in a library. The authors go on to suggest that an understanding of the Semantic Web is necessary for those working in libraries. An excellent definition of the Semantic Web is made -- one of the best I've seen so far:

Today's web pages are designed for human use, and human interpretation is required to understand the content. Because the content is not machine-interpretable, any type of automation is difficult. The Semantic Web augments today's web to eliminate the need for human reasoning in determining the meaning of web-based data. The Semantic Web is based on the concept that documents can be annotated in such a way that their semantic content will be optimally accessible and comprehensible to automated software agents and other computerized tools that function without human guidance. Thus, the Semantic Web might have a more significant impact in integrating resources that are not in a traditional catalog system than in changing bibliographic databases.

Thursday, October 11, 2007

Three Perspectives of the Semantic Web

Catherine Marshall and Frank Shipman has interesting insight in Which Semantic Web? In it, they argue that the plethora of interpretations of the Semantic Web can be traced back to three different perspectives. Here they are:

(1) A Universal Library - Readily accessed and used by humans in a variety of information use and contexts. This perspective arose as a reaction to the disorder of the Web, which was not ordered in categorization until search engines came along. Metadata, cataloguing, and schemas were seen as the answer.

(2) Computational Agents - Completing sophisticated activities on behalf of their human counterparts. Tim Berners-Lee envisioned an infrastructure for knowledge acquisition, representation, and utilization across diverse use contexts. This global knowledge base wil be used by personal agents to collect and reason about information, assisting people with tasks common to everyday life.

(3) Federated Data and Knowledge Base - In this vision, federated components are developed with some knowledge of another or at least with a shared anticipation of the type of applications that will use the data. In essence, this Web encompasses languages used for syntactically sharing data rather than having to write specialized converters for each pair of languages.

Wednesday, October 10, 2007

Knowledge Management 3.0

Michael Koenig and T. Kanti Srikantaiah proffer the idea that Knowledge Management is in its third phase. Here are the different stages:

Stage 1 - Internet of Intellectual Capital - this initial stage of KM was driven primarily by IT. In this stage, organizations realized that their stock in trade was information and knowledge -- yet the left hand rarely knew what the right hand did. When the Internet emerged, KM was about how to deploy the new technology to accomplish those goals.

Stage 2 - Human & Cultural dimensions - the hallmark phrase is communities of practice. KM during this stage was about knowledge creation as well as knowledge sharing and communication.

Stage 3 - Content & Retrievability - consists of structuring content and assigning descriptors (index terms). In content management and taxonomies, KM is about arrangement description, and structure of that content. Interestingly, taxonomies are perceived by the KM community as emanating from natural scientists, when in fact they are the domain of librarians and information scientists. To take this one step further, The Semantic Web is also built on taxonomies and ontologies. Anyone see a trend? Perhaps a convergence?

Monday, October 08, 2007

When is an Apple, an Apple?

In Linked: How Everything Is Connected to Everything Else and What It Means, Albert-Laszlo Barabasi proposes that the ultimate search engine is one that can tap into the input of every person here on Earth. Although none such search engines existed, he argues that Google is the closest we have to a “worldly” search engine because of its PageRank algorithm.

I argue that we can go one step further because with the advent of Web 2.0, social search is actually the closest that we have to gathering input from all of the world’s users. How? Why? Let me explain with an analogy.

It’s not a matter of how, but a matter of when. Web 2.0 is very much like an apple. An apple can be food, a paperweight, a target, or a weapon if needed. It can be whatever you want it to be when you want it to be. The same goes for social searching. It is not search engines.

Del.icio.us is a social bookmarking web service. But it can be a powerful search tool if used properly; essentially, it taps into the social preferences of other users. Same goes for Youtube: it’s a video sharing website, but what’s to say that it can’t be used for searching videos for relevant topics, what’s to say that you can’t search related videos based on videos bookmarked by others? Social search is not based on program; it is mindset, a metaphorical sweet fruit, if you will.

In many ways, social searching is not unlike what librarians did (and still do) in the print-based world where an elegant craft of creativity and perserverence was required to find the right materials and putting them into the hands of the patron; the only difference is that the search has become digital.

Friday, October 05, 2007

Youtube University

UC Berkeley has become the first university to formally offer videos of full course lectures via YouTube. Two hundred clips, representing eight full classes, have been uploaded so far. Here is "SIMS 141 - Search, Google, and Life: Sergey Brin - Google." Enjoy.

Wednesday, October 03, 2007

Of Ontologies + Taxonomies

In 2002 -- two years before Tim O'Reilly's famous coining of the term, "Web 2.0," Katherine Adams of the Los Angeles Public Library had already argued that librarians will be an essential piece to the Semantic Web equation. In The Semantic Web: Differentiating Between Taxonomies and Ontologies, Adams makes a few strong arguments that is strikingly ahead of their time. Long before wikis, blogs, and RSS feeds had come to prominence, (5 years ago!) Adams had the foresight to point out the importance of librarians in reply to Berners-Lee et al's vision. Here are Adams main points, all of which I find fascinating based on pre-Web 2.0 knowledge:

(1) Taxonomies: An Important Part of the Semantic Web - The new Web entails adding an extra layer of infrastructure to the current HTML Web - metadata in the form of vocabularies and the relationships that exist between selected terms will make this possible for machines to understand conceptual relationships as humans do.

(2) Defining Ontologies and Taxonomies - Ontologies and taxonomies are used synonymously -- Computer Scientists refer to hierarchies of structured vocabularies as "ontology" while librarians call them "taxonomy."

(3) Standardized Language and Conceptual Relationships - Both taxonomies and ontologies consist of a structured vocabulary that identifies a single key term to represent a concept that could be described using several words.

(4) Different Points of Emphasis - Computer Science is concerned with how software and associated machines interact with ontologies; librarians are concerned with how patrons retrieve information with the aid of taxonomies. However, they're essential different sides of the same coin.

(5) Topic Maps As New Web Infrastructure - Topic maps will ultimately point the way to the next stage of the Web's development. They represent a new international standard (ISO 13250). In fact, even the OCLC is looking to topic maps in its Dublin Core Initiative to organize the Web by subject.

Monday, October 01, 2007

Web 3.0 Librarian

My colleague Dean Giustini and I have collaborated on an article, The Semantic Web as a large, searchable catalogue: a librarian’s perspective. In it, we argue that librarians will play a prominent role in Web 3.0. The current Web is disjointed and disorganized, and searching is much like looking for a needle in the haystack.

It's not unlike the library before Melvil Dewey introduced the idea of organizing and cataloguing books in a classification system. In many ways, we see the parallels here 130 years later. It's not surprising at all to see the OCLC at the forefront in developing Semantic Web technologies. Many of the same techniques of bibliographic control apply to the possibilities of the Semantic Web. It was the computer scientists and computer engineers who had created Web 1.0 and 2.0, but it will ultimately be individuals from library science and information science who will play a prominent role in the evolution of organizing the messiness into a coherent whole for users. Are we saying that Web 2.0 is irrelevant? Of course not. Web 2.0 is an intermediary stage. Folksonomies, social tagging, wikis, blogs, podcasts, mashups, etc -- all of these things are essential basic building blocks to the Semantic Web.

Thursday, September 27, 2007

Libraries and the Semantic Web

Interestingly, not much has been talked about in terms of librarianship and Semantic Web technologies. It's as if there's a gap that can never be bridged: the rustic gatekeeper of books and high-end cutting edge programmer-speak. Quite recently, Jane Greenberg, professor of Library and Information Science at the University of North Carolina at Chapel Hill, has pointed out in Advancing the Semantic Web via Library Functions that there are many similarities between the library and Semantic Web. Here are some:

(1) Each has developed as a response to an abundance of information

(2) Both have mission statements grounded in service, information access, and knowledge discovery

(3) Both have advanced as a result of international and national standards

(4) Both have grown due to a collaborative spirit

(5) Both have become a part of society's fabric (although not so much yet for the Semantic Web)

Monday, September 24, 2007

Four Ways to Look at the Web

The Semantic Web is far from the monolithic artificial intelligent machine which could seemingly process the whim of a user's thoughts. Cade Metz's Web 3.0: Tomorrow's Web, Today offers an excellent and concise glimpse into the different multitude of possibilities of this new Web. Although still in its hyper-conceptual stages, Metz envisions four directions which Web 3.0 could take:

(1) The Semantic Web - A Web where machines can read sites as easily as humans read them. You ask your machine to check your schedule against the schedules of all the dentists and doctors within a 10-mile radius—and it obeys.

(2) The 3D Web - A Web you can walk through. Without leaving your desk, you can go house hunting across town or take a tour of Europe. Or you can walk through a Second Life–style virtual world, surfing for data and interacting with others in 3D.

(3) The Media-Centric Web - A Web where you can find media using other media—not just keywords. You supply, say, a photo of your favorite painting and your search engines turn up hundreds of similar paintings.

(4) The Pervasive Web - A Web that's everywhere. On your PC. On your cell phone. On your clothes and jewelry. Spread throughout your home and office. Even your bedroom windows are online, checking the weather, so they know when to open and close

Tuesday, September 18, 2007

The Seminal on The Semantic

Before Tim O'Reilly, there was Sir Tim Berners-Lee, who often credited as the creator of the Internet. However, what many do not know is that Berners-Lee also preceded many so-called Web 2.0 experts when he had envisioned the Semantic Web (or as many refer to it synonymously as "Web 3.0"). While O'Reilly came along in 2004 to coin Web 2.0, Berners-Lee had long ago created the conceptual foundations in an article co-produced with James Hendler and Ora Lassila, titled The Semantic Web in Scientific American in 2001. Although librarians and information professionals don't need to know the specifics behind the coding technology behind the Semantic Web (that would be asking too much, for much of it is still in development), it is important to have a good grasp of the concepts and a strong understanding of the history and evolution of the Web. Thus, it is important to know that the Semantic Web will be defined by five concepts:

(1) Expressing Meaning - Bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.

(2) Knowledge Representation - For Web 3.0 to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning: this is where XML and RDF comes in, but are they only preliminary languages?

(3) Ontologies - But for a program that wants to compare or combine information across two databases, it has to know what two terms are being used to mean the same thing. This means that the program must have a way to discover common meanings for whatever database it encounters. Hence, an ontology has a taxonomy and a set of inference rules.

(4) Agents - The real power of the Semantic Web will be the programs that actually collect Web content from diverse sources, process the information and exchange the results with other programs. Thus, whereas Web 2.0 is about applications, the Semantic Web will be about services.

(5) Evolution of Knowledge - The Semantic Web is not merely a tool for conducting individual tasks; rather, its ultimate goal is to advance the evolution of human knowledge as a whole. Whereas human endeavour is caught between the eternal struggle of small groups acting independently and the need to mesh with the greater community, the Semantic Web is a process of joining together subcultures when a wider common language is needed.

Saturday, September 15, 2007

Web 3.0 & the Sem-antic Web

Ready or not, like it or not, Web 3.0 is around the corner. It's coming - so it's best to understand the technologies. Particularly for librarians, we need to understand the intricate technologies behind what the semantic web will look like, how it runs, and what to expect from its much anticipated (although still hyper-theoretical) features.

Ora Lassila and James Hendler, who co-authored along with Tim Berners-Lee, on the article which predicted what the semantic web would look like in 2001, argues in their most recent article, Embracing "Web 3.0" that the technologies that make it possible for the semantic web is slowly but surely maturing. In particular,

As RDF acceptance has grown, the need has become clear for a standard query language to be for RDF what SQL is for relational data. The SPARQL Protocol and RDF Query Language (SPARQL), now under standardization at the W3C, is designed to be that language.

But that doesn't mean that Web 2.0 technologies are obsolete. Rather, they are only a terminal stage of the evolution to Web 3.0. In particular, it is interesting that the authors note

(1) Folksonomies - tagging provides and organic, community-driven means of creating structure and classification vocabularies.

(2) Microformats - the use of HTML markup to decode structured data are a step toward "semantic data." Of course, although not in Semantic Web formats, microformatted data is easy to transform into something like RDF or OWL.

As you can see, we're moving along. Take a look at this: on the surface, Yahoo Food looks just like any Web service; underneath, it is made from SPARQL which really does "sparkle."

Monday, September 10, 2007

Six Kinds of (Social) Searching

Librarians need to be aware of social searching. It's important and it's here to stay. What makes social searching so integral for librarians' information retrieval skills is that it requires knowledge of Web 2.0 (mashups, wisdom of crowds, long tail, etc.) It doesn't mean that "traditional" search skills are obsolete. Far from it. Rather, social searching just adds another layer in the librarian's toolkit. Here are some of my favourites.

1. Social Q&A sites - Cha Cha, Live QnA, Yahoo! Answers, Answer Bag, Wondir

2. Shared bookmarks and web pages - Del.icio.us, Shadows, Yahoo's MyWeb, Furl, Diigo, Connotea

3. Collaborative directories - Open Directory Project, Prefound, Zimbio, Wikipedia

4. Taggregators - Technorati, Bloglines, Wikipedia

5. Personalized verticals - PogoFrog, Eurekster, Rollyo

6. Collaborative harvesters - iRazoo, Digg, Flickr, Youtube, Netscape, Reddit, Tailrank, popurls.com

Saturday, September 01, 2007

Top 25 Definitions for Web 2.0

Summer has gone by so quickly. What happened to June? I've been culling readings from all over everywhere, aggregating the best definitions of Web 2.0. Notice there is a lot: twenty-five in all. I tried making sense of everything, even trying to arrange and shuffle for a catchy acronym (think ROY G. BIV). I challenge all librarians and other information professionals interested in Web 2.0 to do the same: find a catchy acronym and share it with us all. I will share my own in one month's time.

(1) Social Networks - The content of a site should comprise user-provided information that attracts members of an ever-expanding network. (example: Facebook)

(2) Wisdom of Crowds - Group judgments are surprisingly accurate, and the aggregation of input is facilitated by the ready availability of social networking sites. (example: Wikipedia)

(3) Loosely Coupled API's - Short for "Application Programming Interface," API provides a set of instructions (messages) that a programmer can use to communicate between applications, thus allowing programmers to incorporate one piece of software to directly manipulate (code) into another. (example: Google Maps)

(4) Mashups - They are combinations of APIs and data that result in new information resources and services. (example: Calgary Mapped)

(5) Permanent Betas - The idea is that no software is ever truly complete so long as the user community is still commenting upon it, and thus, improving it. (example: Google Labs)

(6) Software Gets Better the More People Use It - Because all social networking sites seek to capitalize on user input, the true value of each site is definted by the number of people it can bring together. (example: Windows Live Messenger)

(7) Folksonomies - It's a classification system created in a bottom-up fashion and with no central coordination. Entirely differing from the traditional classification schemes such as the Dewey Decimal or Library of Congress Classifications, folksonomies allow any user to "social tag" whatever phrase they deem necessary for an object. (example: Flickr and Youtube)

(8) Individual Production and User Generated Content - Free social software tools such as blogs and wikis have lowered the barrier to entry, following the same footsteps as the 1980s self-publishing revolution sparked by the advent of the office laser printer and desktop publishing software. In the world of Web 2.0, with a few clicks of the mouse, a user can upload videos or photos from their digital cameras and into their own media space, tag it with keywords and make the content available for everyone in the world.

(9) Harness the Power of the Crowd - Harnessing not the "intellectual" power, but the power of the "wisdom of the crowds," "crowd-sourcing" and "folksonomies."

(10) Data on an Epic Scale - Google has a total database measured in hundreds of petabytes (a million, billion bytes) which is swelled each day by terabytes of new information. Much of this is collected indirectly from users and aggregated as a side effect of the ordinary use of major Internet services and applications such as Google, Amazon, and EBay. In a sense these services are 'learning' every time they are used by mining and sifting data for better services.

(11) Architecture of Participation - Through the use of the application or service, the service itself gets better. Simply argued, the more you use it - and the more other people use - the better it gets. Web 2.0 technologies are designed to take the user interactions and utilize them to improve itself. (e.g. Google search).

(12) Network Effects - It is general economic term often used to describe the increase in vaue to the existing users of a service in which there is some form of interaction with others, as more and more people to start to use it. As the Internet is, at heart, a telecommunications network, it is therefore subject to the network effect. In Web 2.0, new software services are being made available which, due to their social nature, rely a great deal on the network effect for their adoption. eBay is one example of how the application of this concept works so successfully.

(13) Openness - Web 2.0 places an emphasis on making use of the information in vast databases that the services help to populate. This means Web 2. 0 is about working with open standards, using open source software, making use of free data, re-using data and working in a spirit of open innovation.

(14) The Read/Write Web - A term given to describe the main differences between Old Media (newspaper, radio, and TV) and New Media (e.g. blogs, wikis, RSS feeds), the new Web is dynamic in that it allows consumers of the web to alter and add to the pages they visit - information flows in all directions.

(15) The Web as a Platform - Better known as "perpetual beta," the idea behind Web 2.0 services is that they need to be constantly updated. Thus, this includes experimenting with new features in a live environment to see how customers react.

(16) The Long Tail - The new Web lowers the barriers for publishing anything (including media) related to a specific interest because it empowers writers to connect directly with international audiences interested in extremely narrow topics, whereas originally it was difficult to publish a book related to a very specific interest because its audience would be too limited to justify the publisher's investment.

(17) Harnessing Collective Intelligence - Google, Amazon, and Wikipedia are good examples of how successful Web 2.0-centric companies use the collective intelligence of users in order to continually improve services based on user contributions. Google's PageRank examines how many links points to a page, and from what sites those links come in order to determine its relevancy instead of the evaluating the relevance of websites based solely on their content.

(18) Science of Networks - To truly understand Web 2.0, one must not only understand web networks, but also human and scientific networks. Ever heard of six degrees of separation and the small world phenomenon? Knowing how to open up a Facebook account isn't good enough; we must know what goes on behind the scene in the interconnectedness of networks - socially and scientifically.

(19) Core Datasets from User Contributions - Web 2.0 companies use to collect unique datasets is through user contributions. However, collecting is only half the picture; using the datasets is the key. These contributions are then organized into databases and analyzed to extract the collective intelligence hidden in the data. This extracted information is then used to extract collective knowledge that can be applied to the direct improvement of the website or web service.

(20) Lightweight Programming Models - The move toward database driven web services has been accompanied by new software development models that often lead to greater flexibility. In sharing and processing datasets between partners, this enables mashups and remixes of data. Google Maps is a common example as it allows people to combine its data and application with other geographic datasets and applications.

(21) The Wisdom of the Crowds - Not only has it blurred the boundary between amateur and professional status, in a connected world, ordinary people often have access to better information than officials do. As an example, the collective intelligence of the evacuees of the towers saved numerous lives in the face of disobeying authority which told them to stay put.

(22) Digital Natives - Because a generation (mostly the under 25's) have grown up surrounded by developing technologies, those fully at home in a digital environment aren't worried about information overload; rather, they crave it.

(23) Internet Economics - Small is the new big. Unlike the past when publishing was controlled by publishers, Web 2.0's read/write web has opened up markets to a far bigger range of supply and demand. The amateur who writes one book has access to the same shelf space as the professional author.

(24) "Wirelessness" - Digital natives are less attached to computers and are more interested in accessing information through mobile devices, when and where they need it. Hence, traditional client applications designed to run on a specific platform, will struggle if not disappear in the long run.

(25) Who Will Rule? - This will be the ultimate question (and prize). As Sharon Richardson argues, whoever rules "may not even exist yet."

Pages