OR WAIT 15 SECS
Creating order from information chaos with the help of the Semantic Web.
The organization and accessibility of the massive amount of knowledge on the Internet can only be described as chaotic and haphazard. Most people who spend a little time on the Web will hone in over time to the good sites—e.g., amazon.com for books, ebay.com for auctions, and google.com for searches. However, massive amounts of interesting, valuable, and accurate information are often missed or only found "accidentally." Perhaps this is the natural outgrowth of a system that involves the independent publication of documents (a.k.a. Web pages) by anyone, for any purpose. These Web pages only become visible by word of mouth, advertisement, or placement in Web search engines.
There is another way of organizing and retrieving information on the Web, known as the Semantic Web. It is the brainchild of Tim Berners-Lee, the person who conceived of the Web and built the first Web browser. Before diving in with definitions and details, it is worth discussing the organization and retrieval of information on the Web to date.
Any person or entity can publish a Web site, which is typically a network of interacting hyperlinked documents with links to programs and databases. Internally, each Web site has an organization that may be viewed in a site map of page titles and/or descriptions. Although there is no standard for the organization of a site, good sites post a site map and provide search capabilities for easy navigation.
Each Web site is defined by a URL (e.g., 18.104.22.168), which is mapped to a unique IP address. There is no structure or organization of IP addresses or URLs that relates to the content of information on the site, or type of site. All that a URL defines is a method for a Web browser to find and display a particular site. Therefore, as the number of Web sites grew beyond a handful it became necessary to be able to categorize and search for them. The idea of categorization and search had already been in place on the Internet in the pre-WWW days, when the targets were ftp and telenet sites and the search engines were known as gopher, archie, and veronica.
The first attempts at categorization were hierarchical directories, by companies like Yahoo. These categorizations still exist, and are useful for particular kinds of searches. They are based on an "ontology" of the Web—categories and subcategories of Web sites organized into a hierarchical directory. Directories are still useful for some purposes, but they are limited in their usefulness because they can't possibly keep up with the proliferation of Web sites. In addition, the user must "guess" the ontology of the organizers, which is very imprecise when categorizing Web sites. For example, in the current Google directory, the company SAS Institute is listed as: /Computers/Programming/Languages/SAS, which is certainly correct, but /Business > Biotechnology and Pharmaceuticals > Pharmaceuticals > Outsourcing > Data Management or /Business/Biotechnology_and_Pharmaceuticals/Pharmaceuticals/Software/ might make more sense to many in the pharmaceutical industry.
With continued growth, the Web has moved toward a search model based on visible and hidden content on the Web page (e.g., Altavista). Most have now settled on a search engine, Google, which ranks pages via an algorithm relating to the number and quality of referral links for a particular page. While Google gives astoundingly good results, it often requires a fair amount of sifting through unrelated and unhelpful Web pages by the searcher. In addition, it can often be difficult to find rare and interesting pages that are tucked into a "corner" of the Web.
An interesting new model is developing for sharing Web pages and searches. This is the concept of "tagging" Web pages for sharing. A number of Web sites such as del.icio.us and stumbleupon.com have appeared that provide this service. Any member of one of these services can "tag" a Web site that they find useful. The Web site now becomes available for browsing a directory of "tagged" sites searches, or for "random access" to such sites.
The tagging serves two purposes: first, to provide an objective, third-party assignment of ontologic tagging, which allows for categorization of the Web site, and second, to provide a "pre-qualified" set of Web sites for targeted searching. Finally, the StumbleUpon model allows for immediate access to random, highly rated Web sites within a category, or any number of categories.
Collaborative tagging with review works for searching and browsing because of the cooperative nature of the Web community, especially those who would get involved in such a "techie" enterprise. The collaborative tag is an incremental improvement on the Web. Some predict that the next real paradigm shift in the Web is going to arrive with the implementation of the Semantic Web.
What is the Semantic Web, and why is it so important? Let's answer these questions and see how Semantic Web concepts can be useful for organizing medical, scientific, research—or for that matter any—information.
In the words of Tim Berners-Lee, "The Semantic Web is a web of data, in some ways like a global database,"1 and the Semantic Web effort is developing "languages for expressing information in a machine processable form." The Semantic Web makes use of structured text, rather than natural language, to identify knowledge and its relationship with other knowledge or data. Berners-Lee's original vision involves the action of intelligent software "agents" on computers and handheld devices that would act autonomously to both retrieve data and interact with Web sites through specialized tagging of data and content on these sites.
The Semantic Web functions because of a highly specialized type of data and information tagging that can be implanted within Web pages. Fundamentally, each Semantic Web tag, known as a triplet, links together a subject, verb, and object, creating a relationship between them. Three simple examples of a semantic tag might be: <Boston> <is in the state of> <Massachusetts>; <Beacon Hill> <is a neighborhood in> <Boston>; <I> <like> <Boston>. As you can see, these two statements each consist of two nouns separated by a verb and are interlinked with one another through Boston.
The Semantic Web is conceived of as a huge mesh of nodes (nouns) linked by arcs (verbs). Each node can be extended infinitely by adding more verbs or by making it the object of another triplet. The URIs (uniform resource identifiers) specified in a triplet don't have to reside on the external Web; any network that is addressable through a URI, perhaps internal to an organization or group of organizations, could also work.
The subjects and verbs (and sometimes the objects) in actual Semantic Web implementations aren't simple words but rather URIs. A URI is the general term for any Web locating designation, such as a URL (beginning with http://), ftp (beginning with ftp://), and many others. If a document or data has a URI, it can be found on the Web, and has a unique label that is uniformly shared by anyone on the Web. The tagging will operate on the base of existing Web standards such as HTML and XML, and tags are put in specialized files that follow semantics to control the proliferation of verbs that might create chaos from order.
Once data on the Web is tagged, a Semantic Web application can follow the tags and relationships to answer a question or perform a search. For example, the query "find neighborhoods in Massachusetts" could locate and act on our aforementioned Semantic Web tags and give Beacon Hill as an answer. This not only follows the defined relationships in the nodes and arcs, it also involves the application of logic (If A is in B and B is in C, then A is in C), a powerful tool.
If a "neighborhood rating" organization rated Beacon Hill as safe, someone could ask the question, "show me neighborhoods rated as safe by a trusted organization in cities that Paul likes." The answer, again Beacon Hill, would involve a number of features of the envisioned Semantic Web, including trust, logic, and the ability to codify personal relationships with any object. One of the most powerful features of Semantic Webs is that they can be combined as long as there are common nodes, creating an infinitely expandable web of information.
The following example, again from Berners-Lee, is instructive2 :
"The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users. Such an agent coming to a physical therapy clinic's Web page will know not just that the page has keywords such as 'treatment,' 'medicine,' 'physical,' and 'therapy' (as might be encoded today) but also that Dr. Hartman works at this clinic on Mondays, Wednesdays, and Fridays and that an appointment script takes a date range in yyyy-mm-dd format and returns appointment times."
The Semantic Web can also be very meaningful for data, information, and knowledge that we now consider the realm of the traditional database. This usage requires an agreed upon ontology for the particular data. The ontology would be a shared ontology for all expected users of the data.
Health care, pharmaceutical, and biological data are particularly rich areas for the Semantic Web, and an interest group of the W3C (World Wide Web Consortium) has been set up to develop standards and work toward integrating "people, data, software, publications, and clinical trials."
The Semantic Web is early on in its development, but seems to be picking up steam. If you are interested in learning more or getting involved, you know how to find it—go to Google, enter "Semantic Web," and click "I'm feeling lucky." The rest is up to you.
Paul Bleicher MD, PhD, is the founder and chairman of Phase Forward, 880 Winter Street, Waltham, MA 02451, (888) 703-1122, firstname.lastname@example.org
He is a member of the Applied Clinical Trials Editorial Advisory Board.