Friday, November 30, 2007

Digital Libraries in the Semantic Age

Brian Matthews of CCLRC Appleton Laboratory offers some interesting insights in Semantic Web Technologies. In particular, he argues that libraries are increasingly converting themselves to digital libraries. A key aspect for the Digital library is the provision of shared catalogues which can be published and browsed. This requires the use of common metadata to describe the fields of the catalogue (such as author, title, date, and publisher), and common controlled vocabularies to allow subject identifiers to be assigned to publications.

As Matthew proposes, by publishing controlled vocabularies in one place, which can then be accessed by all users across the Web, library catalogues can use the same Web-accessible vocabularies for cataloguing, marking up items with the most relevant terms for the domain of interest. Therefore, search engines can use the same vocabularies in their search to ensure that the most relevant items of information are returned.

The Semantic Web opens up the possibility to take such an approach. It offers open standards that can enable vendor-neutral solutions, with a useful flexibility (allowing structured and semi-structured data, formal and informal descriptions, and an open and extensible architecture) and it helps to support decentralized solutions where that is appropriate. In essence, RDF can be used as this common interchange for catalogue metadata and shared vocabulary, which can then be used by all libraries and search engines across the Web.

But in order to use the Semantic Web to its best effect, metadata needs to be published in RDF formats. There are several initiatives involved with defining metadata standards, and some of them are well known to librarians:

(1) Dublin Core Metadata Initiative

(2) MARC

(3) ONIX

(3) PRISM

Wednesday, November 21, 2007

Postmodern Librarian - Part Two

To continue where we had left off. True, Digital Libraries and the Future of the Library Profession intimates that libraries and perhaps librarianship has entered the postmodern age. But Joint hasn't been the first to author such an argument; many others have also argued likewise. In fact, I had written about it before, too. But I believe to stop at the modernist-postmodernist dichotomy misses the point.

In my opinion, perhaps this is where Web 2.0 comes in. Although the postmodern information order is not clear to us, it seems to be the dynamic behind Web 2.0, in which interactive tools such as blogs, wikis, RSS facilitate social networking and the anarchic storage of unrestrained distribution of content. According to Joint, much of our professional efforts to impose a realist-modernist model on our library will fail. The old LIS model needs to be re-theorized, just as Newtonian Physics had to evolve into Quantum Theory, in recognition of the fact that super-small particles simply were not physically located where Newtonian Physics said they should be. In this light, perhaps this is where we can start to understand what exactly is Web 2.0. And beyond.

Friday, November 16, 2007

Semantic Web: A McCool Way of Explaining It

Yahoo's Rob McCool argues in Rethinking the Semantic Web, Part 1 that the Semantic Web will never happen. Why? Because the Semantic Web has three fundamental parts, and they just don't fit together based on current technologies. Here is what we have. The foundation is the set of data models and formats that provide semantics to applications that use them (RDF, RDF Schema, OWL). The second layer is composed of services - purely machine-accessible programs that answer Web requests and perform actions in response. At the top are the intelligent agents, or applications.

Reason? Knowledge representation is a technique with mathematical roots in the work of Edgar Codd, widely known as the one whose original paper using set theory and predicate calculus led to the relational database revolution in the 1980's. Knowledge representation uses the fundamental mathematics of Codd's theory to translate information, which humans represent with natural language, into sets of tables that use well-defined schema to defined schema to define what can be entered in the rows and columns.

The problem is that this creates a fundamental barrier, in terms of richness of representation as well as creation and maintenance, compared to the written language that people use. Logic, which forms the basis of OWL, suffers from an inability to represent exceptions to rules and the contexts in which they're valid.

Databases are deployed only by corporations whose information-management needs require them or by hobbyists who believe they can make some money from creating and sharing their databases. Because information theory removes nearly all context from information, both knowledge representation and relational databases represent only facts. Complex relationships, exceptions to rules, and ideas that resist simplistic classifications pose significant design challenges to information bases. Adding semantics only increases the burden exponentially.

Because it's a complex format and requires users to sacrifice expressively and pay enormous costs in translation and maintenance, McCool believes Semantic Web will not achieve widespread support. Never? Not until another Edgar Codd comes along our way. So we wait.

Wednesday, November 14, 2007

The Postmodern Librarian?

Are we in the postmodern era? Nicholas Joint's Digital Libraries and the Future of the Library Profession seems to think so. In it, he argues that unique contemporary cultural shifts are leading to a new form of librarianship that can be characterized as "postmodern" in nature, and that this form of professional specialism will be increasingly influential in the decades to come.

According to Joint, the idea of the postmodern digital library is clearly very different from the interim digital library. In the summer of 2006, a workshop at the eLit conference in Loughborough on the cultural impact of mobile communication technologies, there emerged the Five Theses of Loughborough. Here they are:

(1) There are no traditional information objects on the internet with determinate formats or determinate formats or determinate qualities: the only information object and information forat on the internet is "ephemera"

(2) The only map of the internet is the internet itself, it cannot be described

(3) A hypertext collection cannot be selectively collected because each information object is infinite and infinity cannot be contained

(4) The problem of digital preservation is like climate change; it is man-made and irreversible, and means that much digital data is ephemeral; but unlike climate change, it is not necessarily catastrophic

(5) Thus, there is no such thing as a traditional library in a postmodern world. Postmodern information sets are just as accessible as traditional libraries;: there are no formats, no descriptions, no hope of collection management, no realistic possibility of preservation. And they work fine.

Monday, November 12, 2007

New York City In a Semantic Web

Tim Krichel in The Semantic Web and an Introduction to Resource Description Framework makes a very astute analogy for understanding the technology behind the Semantic Web, particularly the nuances of XML and RDF, where the goal is to move away from the present Web - where pages are essentially constructed for use by human consumption - to a Web where more information can be understood and treated by machines. The analogy goes like this:
We fit each car in New York City with a device that lets a reverse geographical position system reads its movements. Suppose, in addition, that another machine can predict the weather or some other phenomenon that impacts traffic. Assume that a third kind of device has the public transport timetables. Then, data from a collaborative knowledge picture of these machines can be used to advise on the best means of transportation for reaching a certain destination within the next few hours.
The computer systems doing the calculations required for the traffic advisory are likely to be controlled by different bodies, such as the city authority or the national weather service. Therefore, there must be a way for software agents to process the information from the machine where it resides, to proceed with further processing of that information to a form in which a software agent of the final user can be used to query the dataset.

Wednesday, November 07, 2007

Genre Searching

At today's SLAIS colloquium, Dr. Luanne Freund gave a presentation on Genre Searching: A Pragmatic Approach to Information Retrieval. Freund argues for taking a pragmatics approach in genre searching and genre classification. But there are two perspectives of pragmatics: socio-pragmatic and cognitive-pragmatic. Using a case study, a high-tech firm, Freund and her colleagues built a unique search engine called X-Cite, which culls together documents from the corporate intranet (which include anything from FAQ's to specialize manuals) with tags. In ranking documents based on title, abstract, and keywords as part of the search engine, the algorithm uniquely cuts down on the ambiguity and guesswork of searching. Using a software engineering workplace domain as its starting point, Freund believes that genre searching has the potential to make a significant contribution to the effectiveness of workplace search systems, by incorporating genre weights into the ranking algorithm.

In genre analysis, three steps must be taken:

(1) Identify - The core genre repertoire of the work domain

(2) Develop - A standard taxonomy to represent it

(3) Develop - Operational definitions of the genre classes in the taxonomy, including identifying features in terms of form, function and content to facilitate manual and automatic genre classification.

Throughout the entire presentation, my mind kept returning to the question: is this not another specialized form of social searching? A tailorized search engine which narrows its search to a specific genre? Although the two are entirely different things, I keep thinking that creating your own search engine is certainly much easier.

Simple Knowledge Organization System (SKOS) & Librarians

Miles and Perez-Aguera's SKOS: Simple Knowledge Organization for the Web introduces SKOS, a Semantic Web language for representing structured vocabularies, including thesauri, classification schemes, subject heading systems, and taxonomies -- tools that cataloguers and librarians use everyday in their line of work.

It's interesting that the very essence of librarianship and cataloging will play a vital role in the upcoming version of the Web. It's hard to fathom how this works: how can MARC records and the DDC have anything to do with the intelligent agents which form the layers of architecture of the Semantic Web and Web 3.0? The answer: metadata.

And even more importantly: the messiness and disorganization of the Web will require information professionals with the techniques and methods to reorganize everything coherently. Web 1.0 and 2.0 were about creating -- but the Semantic Web will be about orderliness and regulating. By controlled structured vocabulary, SKOS is built on the following features. Take a closer look at Miles & Perez-Aguera's article -- it's well worth a read.

(1) Thesauri - Broadly conforming to the ISO 2788:1986 guidelines such as the UK Archival Thesaurus (UKAT, 2004), the General Multilingual Environmental Thesaurus (GEMET), and the Art and Architecture Thesaurus

(2) Classification Schemes - Such the Dewey Decimal Classification (DDC), the Universal Decimal Classification (UDC), and the Bliss Classification (BC2)

(3) Subject Heading Systems - The Library of Congress Subject Headings (LCSH) and the Medical Subject Headings (MeSH)

Friday, November 02, 2007

New Librarians, New Possibilities?

Are newer, incoming librarians changing the profession? Maybe. But not yet. University Affairs has published an article called The New Librarians, which highlights some of the new ideas that newer librarians are bringing into academic libraries. Everyone's favourite University Librarian (at least for me), Jeff Trzeciak, who has his own blog, is featured in the piece, and in it, he describes how he has swiftly hired new Library 2.0-ready librarians as well as overturning the traditional style decor and culture of McMaster Library, with items such as a "café, diner-style booths, stand-up workstations, oversized ottomans, and even coffee tables with pillows on the floor will take their place, all equipped for online access. Interactive touch-screen monitors will line the wall."

University of Guelph Chief Librarian Michael Ridley, similarly sees a future where the university library serves as an “academic town square,” a place that "brings people and ideas together in an ever-bigger and more diffuse campus. Services in the future will include concerts, lectures, art shows – anything that trumpets the joy of learning."

Is this the future of libraries? Yes, it's a matter of time. That's where we're heading -- that's where we'll end up. It is a matter of time. Change is difficult, particularly in larger academic institutions where bureaucracy and politics play an essential role in all aspects of operations. There is great skepticism towards Jeff Trzeciak's drastic changes to McMaster Library -- he's either a pioneer if he succeeds, or an opportunist if he fails. A lot is riding on Jeff's shoulders.

Tuesday, October 30, 2007

Introducing Semantic Searching

Just as we had Google and Web 2.0 nearly figured out, the Semantic Web is just around the corner. Introducing hakia, one of the first truly Semantic Web search engines. As we had argued, the Semantic Web is a digital catalogue, and many of the key components is the understanding of ontologies and taxonomies. Built on Semantic Web technologies, hakia is a new "meaning-based" (semantic) search engine with the purpose of improving search relevancy and interactivity -- the potential benefits for end users are search efficiency, richness of information, and saving time. Here are the elements which makes hakia. Will this hakia team be the next Brin and Page? Why don't you try it?

(1) Ontological Semantics (OntoSem) - A formal and comprehensive linguistic theory of meaning in natural language. As such, it bears significantly on philosophy of language, mathematical logic, and cognitive science

(2) Query Detection and Extraction (QDEX) - A system invented to bypass the limitations of the inverted index approach when dealing with semantically rich data

(3)
SemanticRank algorithm - Deploys a collection of methods to score and rank paragraphs that are retrieved from the QDEX system for a given query. The process includes query analysis, best sentence analysis, and other pertinent operations

(4) Dialogue -
In order establish a human-like dialogue with the user, the dialogue algorithm's goal is to convert the search engine's role into a computerized assistant with advanced communication skills while utilizing the largest amount of information resources in the world.

(5)
Search mission - Google mission was to organize the world's information and make it universally accessible and useful. hakia's mission is to search for better search.

Monday, October 22, 2007

A Defintion of the Semantic Web

Parker, Nitse, and Flowers' Libraries as Knowledge Management Centers makes a good point about special libraries. Libraries need to be at the forefront of technology, or else they'll be an endangered species. As libraries struggle with the fallout of the digital age, they must find a creative way to remain relevant to the twenty first century user who has the ability and means of finding vast amounts of information without even setting foot in a library. The authors go on to suggest that an understanding of the Semantic Web is necessary for those working in libraries. An excellent definition of the Semantic Web is made -- one of the best I've seen so far:

Today's web pages are designed for human use, and human interpretation is required to understand the content. Because the content is not machine-interpretable, any type of automation is difficult. The Semantic Web augments today's web to eliminate the need for human reasoning in determining the meaning of web-based data. The Semantic Web is based on the concept that documents can be annotated in such a way that their semantic content will be optimally accessible and comprehensible to automated software agents and other computerized tools that function without human guidance. Thus, the Semantic Web might have a more significant impact in integrating resources that are not in a traditional catalog system than in changing bibliographic databases.

Thursday, October 11, 2007

Three Perspectives of the Semantic Web

Catherine Marshall and Frank Shipman has interesting insight in Which Semantic Web? In it, they argue that the plethora of interpretations of the Semantic Web can be traced back to three different perspectives. Here they are:

(1) A Universal Library - Readily accessed and used by humans in a variety of information use and contexts. This perspective arose as a reaction to the disorder of the Web, which was not ordered in categorization until search engines came along. Metadata, cataloguing, and schemas were seen as the answer.

(2) Computational Agents - Completing sophisticated activities on behalf of their human counterparts. Tim Berners-Lee envisioned an infrastructure for knowledge acquisition, representation, and utilization across diverse use contexts. This global knowledge base wil be used by personal agents to collect and reason about information, assisting people with tasks common to everyday life.

(3) Federated Data and Knowledge Base - In this vision, federated components are developed with some knowledge of another or at least with a shared anticipation of the type of applications that will use the data. In essence, this Web encompasses languages used for syntactically sharing data rather than having to write specialized converters for each pair of languages.

Wednesday, October 10, 2007

Knowledge Management 3.0

Michael Koenig and T. Kanti Srikantaiah proffer the idea that Knowledge Management is in its third phase. Here are the different stages:

Stage 1 - Internet of Intellectual Capital - this initial stage of KM was driven primarily by IT. In this stage, organizations realized that their stock in trade was information and knowledge -- yet the left hand rarely knew what the right hand did. When the Internet emerged, KM was about how to deploy the new technology to accomplish those goals.

Stage 2 - Human & Cultural dimensions - the hallmark phrase is communities of practice. KM during this stage was about knowledge creation as well as knowledge sharing and communication.

Stage 3 - Content & Retrievability - consists of structuring content and assigning descriptors (index terms). In content management and taxonomies, KM is about arrangement description, and structure of that content. Interestingly, taxonomies are perceived by the KM community as emanating from natural scientists, when in fact they are the domain of librarians and information scientists. To take this one step further, The Semantic Web is also built on taxonomies and ontologies. Anyone see a trend? Perhaps a convergence?

Monday, October 08, 2007

When is an Apple, an Apple?

In Linked: How Everything Is Connected to Everything Else and What It Means, Albert-Laszlo Barabasi proposes that the ultimate search engine is one that can tap into the input of every person here on Earth. Although none such search engines existed, he argues that Google is the closest we have to a “worldly” search engine because of its PageRank algorithm.

I argue that we can go one step further because with the advent of Web 2.0, social search is actually the closest that we have to gathering input from all of the world’s users. How? Why? Let me explain with an analogy.

It’s not a matter of how, but a matter of when. Web 2.0 is very much like an apple. An apple can be food, a paperweight, a target, or a weapon if needed. It can be whatever you want it to be when you want it to be. The same goes for social searching. It is not search engines.

Del.icio.us is a social bookmarking web service. But it can be a powerful search tool if used properly; essentially, it taps into the social preferences of other users. Same goes for Youtube: it’s a video sharing website, but what’s to say that it can’t be used for searching videos for relevant topics, what’s to say that you can’t search related videos based on videos bookmarked by others? Social search is not based on program; it is mindset, a metaphorical sweet fruit, if you will.

In many ways, social searching is not unlike what librarians did (and still do) in the print-based world where an elegant craft of creativity and perserverence was required to find the right materials and putting them into the hands of the patron; the only difference is that the search has become digital.

Friday, October 05, 2007

Youtube University

UC Berkeley has become the first university to formally offer videos of full course lectures via YouTube. Two hundred clips, representing eight full classes, have been uploaded so far. Here is "SIMS 141 - Search, Google, and Life: Sergey Brin - Google." Enjoy.


Wednesday, October 03, 2007

Of Ontologies + Taxonomies

In 2002 -- two years before Tim O'Reilly's famous coining of the term, "Web 2.0," Katherine Adams of the Los Angeles Public Library had already argued that librarians will be an essential piece to the Semantic Web equation. In The Semantic Web: Differentiating Between Taxonomies and Ontologies, Adams makes a few strong arguments that is strikingly ahead of their time. Long before wikis, blogs, and RSS feeds had come to prominence, (5 years ago!) Adams had the foresight to point out the importance of librarians in reply to Berners-Lee et al's vision. Here are Adams main points, all of which I find fascinating based on pre-Web 2.0 knowledge:

(1) Taxonomies: An Important Part of the Semantic Web - The new Web entails adding an extra layer of infrastructure to the current HTML Web - metadata in the form of vocabularies and the relationships that exist between selected terms will make this possible for machines to understand conceptual relationships as humans do.

(2) Defining Ontologies and Taxonomies - Ontologies and taxonomies are used synonymously -- Computer Scientists refer to hierarchies of structured vocabularies as "ontology" while librarians call them "taxonomy."

(3) Standardized Language and Conceptual Relationships - Both taxonomies and ontologies consist of a structured vocabulary that identifies a single key term to represent a concept that could be described using several words.

(4) Different Points of Emphasis - Computer Science is concerned with how software and associated machines interact with ontologies; librarians are concerned with how patrons retrieve information with the aid of taxonomies. However, they're essential different sides of the same coin.

(5) Topic Maps As New Web Infrastructure - Topic maps will ultimately point the way to the next stage of the Web's development. They represent a new international standard (ISO 13250). In fact, even the OCLC is looking to topic maps in its Dublin Core Initiative to organize the Web by subject.