METHOD AND APPARATUS FOR UNIFIED QUERY INTERFACE FOR NETWORK INFORMATION
TECHNICAL FIELD
The present invention relates to methods and systems for finding or searching information available on a network, in different information formats, and in some embodiments including access over a public network.
BACKGROUND ART
As the number of information sources on the web continues to grow, the need for good information integration technology increases. For this reason, the problem of information integration has received a great deal of attention from the research community and from the commercial community. The differences between the approaches taken by the research community and the commercial community are striking. At the risk of oversimplifying the issue, the research community has focused on the semantic power of the integration technology at the expense of the scalability of the approach, while the commercial community has focused on scalability at the expense of semantic power.
Research systems such as TSIMMIS, the Information Manifold, and Infomaster provide a general query facility over the integrated view of a number of data sources. They are able to support powerful queries and infer implicit joins through the use of a view containment test. While this provides the very real advantage that a user can get an answer to a query without knowing that a join is required to construct this answer (since the system deduces the implicit join "under the covers"), unfortunately, it also renders query planning and execution expensive. Query planning and execution in such systems is expensive because the size of the plan space that must be searched and generated query plan grows quickly with the number of sources wrapped. For this reason, these systems are most effective at evaluating complex queries over a relatively small number of sites.
In the commercial world, the information integration space is dominated by comparison shopping services. In contrast to the research systems, these systems do
not provide general purpose querying - their goal is to be able to evaluate a small, j number of canned queries (expressed by forms presented to the user) over a large number of sites.
Another very real barrier to scalability that is largely orthogonal to the semantic power of an integration system is the difficulty of "wrapping" new sites, that is, how hard it is to add new information sources to the system. The research community has not really dealt with this scalability issue (there is little call to wrap hundreds of sites in a research prototype), while the commercial community "solves" this problem by employing a small army of programmers to write wrappers, usually aided by some sort of wrapper generation toolkit.
DISCLOSURE OF THE INVENTION
The present invention, in various aspects, involves a method and/or system and/or apparatus for providing a scalable, unified view or search over large numbers of queryable information sources. In specific embodiments, the invention accomplishes this, in part by sacrificing some expressive power in the set of queries supported.
A system according to one embodiment of the invention provides scalability through three main techniques. First, it uses a collection of ontologies organized into hierarchical namespaces as a medium for expressing data semantics. Second, it employs a declarative query language to describe information sources so that source descriptions can be "executed" at run time instead of being pre-compiled into the system. Third, it utilizes inverted-index style operations to identify a subset of information sources that are relevant to a particular user query.
A further understanding of the invention can be had from the detailed discussion of specific embodiments below. For purposes of clarity, this discussion refers to devices, methods, and concepts in terms of specific examples. However, the method of the present invention may operate in a wide variety of applications. It is therefore intended that the invention not be limited except as provided in the attached claims.
Furthermore, it is well known in the art that computer systems can include a wide variety of different components and different functions in a modular fashion. Different embodiments of the present invention can include different mixtures of elements and functions and may group various functions as parts of various elements. For purposes of clarity, the invention is described in terms of systems that include different innovative components and innovative combinations of components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment in this specification. Furthermore, it is well known in the art of internet applications and software systems that particular file formats, languages, and underlying methods of operation may vary. The disclosure of a particular implementation language or format of an element should not be taken to limit the invention to that particular implementation unless so provided in the attached claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes. The invention will be better understood with reference to the following drawings and detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a functional block diagram of a system overview according to specific embodiments of the invention.
FIG. 2 illustrates two example of namespaces, according to specific embodiments of the invention.
FIG. 3 illustrates an example syntax for a source description language according to specific embodiments of the invention. FIG. 4 is a block diagram showing a snapshot of inverted index after registering sample sources.
FIG. 5 is a block diagram showing a representative example logic device in which various aspects of the present invention may be embodied.
BEST MODE FOR CARRYING OUT THE INVENTION
1. System Overview
In particular embodiments, the present invention involves a search system and/or method that employs namespaces in its query facility and "soft-wrapping" information sources. Figure 1 illustrates main components of an example system according to the present invention. (One implementation of a system according to the invention is referred in some associated documents IDB.) A system according to the present invention, aims to provide a general query facility by using a collection of ontologies organized into a hierarchical namespace. Each ontology in a namespace defines a set of terms that describe common concepts. A namespace is used as the medium for expressing data semantics. Both user queries and source descriptions are written using terms in the namespace.
According to further aspects of the invention, a query language is provided (sometimes referred to as IDBQL) that, is based on SQL. Queries are expressed using terms from a namespace. When writing a query, users do not need to know about the exported views of each individual information source. Instead, the query engine will identify a set of relevant information sources by using terms that appear in the query to probe an inverted index.
Unlike prior art systems (such as TSIMMIS and the Information Manifold), in specific embodiments, the present invention does not generally infer implicit joins. This implies that the invention can answer only a subset of the queries handled by systems that use joins. However, this also implies that query planning according to the present invention is much more simple than under these prior systems (it requires simple inverted list lookup operations) and scales to large numbers of sites.
2. Soft-Wrapping
The present invention in further aspects utilizes a novel approach referred to as soft-wrapping to wrap information sources. According to the present invention, a
"wrapper" is a declarative query evaluated at runtime. Source descriptions can be executed or evaluated at run time instead of being pre-compiled into the system (or
"hard-wrapped"). The advantages of soft-wrapping over hard-wrapping are many. First, it is more flexible and portable, because the writing of source descriptions is independent of any run-time environment. Second, soft-wrappers can be tested and registered dynamically at runtime through a Web interface, without having to restart the system. Third, it is easy to adapt to dynamically changing Web data sources, as recompilation is not needed. Finally, soft wrapping is more secure in that what is registered is a declarative query, and not a pre-compiled wrapper program that must be trusted by whoever executes the wrapper.
3. Namespaces The present invention, in particular embodiments, uses a collection of ontologies organized into a hierarchical namespace as a medium for expressing data semantics. An ontology according to the invention is a grouping of terms describing a concept. The terms in the ontologies are reusable. When defining an ontology, one can borrow existing terms from other ontologies in the namespace as well as create new terms. An ontology can selectively inherit (or reuse) any subset of the parent ontology. Inheritance from multiple ontologies is also allowed.
The IDB namespace functions as a global schema that provides a uniform view over information on the Web. It is an a priori schema as opposed to the a posteriori schema of some prior art systems. In TSIMMIS, for example, user queries are formulated over the view exported by a mediator. The mediated view is, in turn, generated by integrating views of lower level mediators or data sources. As a result, any source level changes such as adding a new source or dropping an existing source may affect the upper level mediated view user queries being formulated on. A namespace according to the current invention is defined independently from the views of data sources. In fact, the source view is defined using the terms in the namespace. Because of this, information source level changes do not affect the global view.
In particular embodiments, the invention uses a simple collection of terms as the global schema. In future, if XML namespaces become prevalent these could be
used in place of the IDB ontology. By adopting XML namespaces, the invention would then be able to reuse a large number of widely used namespaces as schema without having to reinvent them.
FIG. 2 illustrates two example of namespaces, according to specific embodiments of the invention. The movie ontology consists of terms that may be useful to describe movies. The term product#name from the product namespace is reused in the movie namespace as movie#title. It is advantageous to reuse existing terms, as this increases the number of information sources that can contribute to a given query. For instance, if user queries on the name of a product using the product ontology, then information sources belonging to book and movie ontologies are also queried in addition to sources directly belonging to the product ontology. This is because the book#title and movie#title terms are inherited from the product#name term in the product namespace.
4. Source Description Language According to Specific Embodiments of the Invention
A query system according to the invention interacts with information sources using source descriptions. The role of source descriptions is twofold: (1) They export the views and capabilities of information sources; (2) They extract and map local data in the described source to the exported view of the source. Unlike traditional "hard wrapping," the present invention, according to specific embodiments, uses a "soft wrapping" scheme that allows source descriptions to be executed at query evaluation time. The source description is, in fact, a query language that "queries" a remote document or database. As a result, IDB does not require hard-coded or compiled wrappers to communicate with sources. Prior art "hard wrappers" generally require recompilation each time an information source changes its data presentation.
An example syntax of the source description language is as follows:
SELECT list-of-terms FROM url [post | get] [html | xml] WHERE mapping-rule [[and] or] mapping-rule] ...
The SELECT clause defines the exported view; the FROM clause specifies the location of the remote database and its query capability; and the WHERE clause defines the mapping rules.
As a further example, Figure 3 shows a source description for amazon.com. After evaluating the source description of amazon.com, an eight-column table of vendor, title, etc. will be generated.
The execution starts by evaluating the FROM clause. The FROM clause specifies the location of amazon.com' s book database and the query binding that it accepts. Amazon.com' s book database is published on the Web through a front-end form interface. This form interface accepts user inputs on the title and author fields, and this information is encoded in the url string in the FROM clause. In the case where the target information source is a document, the url of the document can simply be placed in the FROM clause without any query binding encoding.
Once IDB has rewritten the user query into local queries, the placeholders $book#title$ and $book#author$ will be replaced with the corresponding values from the user query. After it opens a url connection and sends the query string, the query result will be returned from the source in an HTML page. The HTML page is parsed into a DOM tree [DOM98]. If a source returns an XML page, then IDB will invoke an XML parser instead to generate the DOM tree. After this parsing step, the remaining query processing steps are transparent to both XML and HTML since the DOM interface is generic to both markup languages.
The WHERE clause consists of a set of path expressions and perl-style text operations. The path expressions are evaluated over the DOM tree generated from the result page. The syntax of our path expression is like that of HEL [SA99a, SA99b] and WIDL [A1197]. HEL also supports perl-style pattern matching. The IDB source description language, however, allows direct mapping from path expressions to the exported view and provides a larger set of text operations. Further, it allows the conjunction and disjunction of path expressions. For instance, depending on the user query binding, the amazon.com database returns two different types of HTML pages. In case the user query binding results in exactly one
book entry, it directly returns the HTML page that contains the full book description. Otherwise, it returns an HTML page that contains a list of matching book entries, where each book entry has a short description and url to the book page. We need different path expressions for each of these cases, as shown in Figure 3. A source description language according to specific embodiments supports popular perl regular expression operations such as match, substitute, join, split, and a custom-designed switch operator. The switch operator is used to normalize the irregularity of output data across multiple sources. For example, some sources represent product availability in graphical symbols and they must be transformed into the text equivalents. The dot(.) in the path expression implies the direct path from the parent element to the child and the arrow(->) implies 0 or more steps exist in between.
The SELECT clause provides a global semantics for the local data. It defines the schema of the table that is generated by the execution of the source description. Note that the constant value ςamazon' is materialized into the book#vendor term as all book entries are coming from the same source, amazon.com. The plus(+) sign at the end of an attribute is shorthand for TS NOT NULL'.
Although the exported view of Figure 3 consists of terms only from the book ontology, this is not a requirement of the IDB approach. The IDB source description can choose terms selectively from one or more ontologies. The source description need not conform to any namespace nor have any restriction on choosing sets of terms from various ontologies. This allows the source description language to describe sources using terms that are close as possible to the original semantics of the data. As is the case in amazon.com, data extracted from the result page can potentially have some nested structure. To map this nested data to a flat output table, IDB employs a set of special iterators that are associated with each output attribute.
As we pointed out earlier, since IDB uses a declarative query language for its source descriptions, the traditional pre-compiled "wrapper" is no longer needed. This "soft wrapper" approach is more scalable since the process of writing, testing,
and registering wrappers is not dependent on the hardware and software development environment, thus it can be completely decentralized. In fact, the source descriptions can be tested and registered at runtime through the Internet without bringing down the system. Anyone can write and register the source descriptions from anywhere in the Internet. Also, with the soft wrapper approach it is easier to adjust to dynamically changing Web sources, as they need not be recompiled each time the source changes. Finally, it is more secure in that what is registered is a declarative query and not a pre-compiled wrapper program. That is, using the soft-wrapper approach, a "buggy" wrapper may cause the data from the wrapped source to be mapped incorrectly, but since it is just a declarative query, it does not pose a security risk to the site executing the soft wrapper query.
5. Query Language According to Specific Embodiments of the Invention
This section discusses a query language according to a specific embodiment of the invention (at times, referred to herein, as IDBQL) primarily through examples. In a particualr embodiment, a query language may be understood as a subset of SQL, with additional keyword predicates. The keyword predicates are added to support a keyword match operation that is perhaps the most popular operation in real-world web queries.
A query is formulated using the terms defined in the ontology. The query writer does not need to know about the exported views of each of the individual underlying information sources. A query processor according to specific embodiments of the invention will identify a set of relevant information sources by probing an inverted index using the terms used in the query as described in the next section. A first example query illustrates a basic structure of a query language and the use of the keyword predicate.
SELECT B. endor, B. title, B. author, B. price, B.year
FROM book B
WHERE B. title ~ = 'Database Systems' Example Query 1
This query uses the "book" ontology and retrieves vendor, title, author, price, and year information of books whose title contains the keyword 'Database' and 'Systems'. The result table will include book entries with titles, e.g. 'Database Management Systems', 'Readings in Database Systems', etc. The partial output of the query is shown below. vendor : Bookpool title : Database Management Systems author : Raghu Ramakrishnan, et al price : 55.95 year : 1999
vendor : Amazon title : Fundamentals of Database Systems author : Ramez A. Elmasri/Shamkant B. Navathe price : 79.75 year : 1999
vendor : Barnes and Noble title : A First Course in Database Systems author : Jeffrey D. Ullman, With Jennifer Widom price : 59.75 year : 1997
Example Results 1
The keyword operators are especially useful because data may come from autonomous information sources. The presentation format of the data may differ across information sources, and perhaps even the data within one source may have different presentation formats over time. One common example would be the format of person's name. Some sources may put the last name before the first name, and some others first name first. A query language supports three keyword operators and their semantics as defined in Table 1.
Table 1. Keyword Operator Semantics
Table 2. Data Type Coercion Rules
The first example query was not very selective as returned more than 200 entries from various online book vendors. A second example query adds two more selection conditions to the first query and retrieves book availability information along with the original attributes.
SELECT B . endor, B . title, B . author, B . price, B . year, B . stock
FROM book B
WHERE B . title -=' Database Systems ' AND B . author ~= ' Ramakrishnan ' AND B . year > 1998
Example Query 2
This query illustrates the use of numeric order predicates and data type coercion. A data model according to the invention is essentially type free. Attribute values are treated as string literals. To evaluate an order predicate (<, >, >=, <=, =), a system uses the Lore[MAG+97] coercion rules as shown in Table 2. For instance, in the above query, if the year attribute is not null and can be parsed into a number, the predicate will be evaluated over two numeric values. In a join predicate, such as book.year = movie.year, both operands are attributes. In thi? "nc,e, one of the
attributes is first coerced into an appropriate type before the predicate is evaluated using the rules in Table 2. Part of the result table for the second query is shown below. vendor: Fatbrain title: Database Management Systems , Second Edition author: Ramakrishnan , Raghu / Gehrke, Johannes price: 53.25 year: 1999 stock: Ships same day
vendor: Bookpool title: Database Management Systems author: Raghu Ramakrishnan, et al price: 55.95 year: 1999 stock: In Stock!
vendor: Amazon title: Database Management Systems author: Raghu Ramakrishnan, Johannes Gehrke price: 75.60 year: 1999 stock: Usually ships in 24 hours
Example Results 2
Example Query 3 is a simple explicit join query. It is provided to illustrate the case where more than one ontology is involved in a query. It retrieves title, actor of movies and vendor, url, format, price of books where the movies are directed by 'Steven Spielberg', books are written by 'Michael Crichton', and both movie and book have the same title. Part of the result table for this query follows the example.
SELECT M . title , M . actor, B . endor, B . url ,
B . format , B . price
FROM book B, movie M
WHERE B . author ~= ' Michael Crichton ' AND
M . director ~= ' Steven Spielberg ' AND
B . title = M . title
Example Query 3
title: Jurassic Park actor: Morgan Freeman / Nigel Hawthorne / Anthony Hopkins /
Sir Anthony Hopkines vendor: Amazon url: http://www.amazon.com/exec/obidos/ASLN/0394588169/ ... format: Hardcover price: 18.87
title: Jurassic Park actor: Morgan Freeman / Nigel Hawthorne / Anthony Hopkins / Sir
Anthony Hopkines vendor: Barnes and Noble url: http://shop.bamesandnoble.com/booksearch/isbnTnquiry.asp ?isbn=
0345370775.. format: Mass Market Paperback price: 6.39
title: Jurassic Park actor: Morgan Freeman / Nigel Hawthorne / Anthony Hopkins /
Sir Anthony Hopkines vendor: Borders url: http://search.borders.com/fcgi-bin/db2www/search/search.d2w/
Details?prodID... format: Paperback price: 6.39
Example Results 3
6. Example Query Processing
The following are the steps of query processing according to specific embodiments of the invention.
• A user query is formulated using terms in one or more ontologies.
• The query engine identifies base tables for each ontology used in the query. A base table is determined by identifying the minimum subset of terms in a given ontology that is required to evaluate the user query.
• The query engine retrieves a set of source descriptions for each base table from the source description index. This index is, in fact, an inverted index that associates terms in the ontology to the relevant source descriptions.
• The query engine translates the original user query into local queries using the views exported from the set of source descriptions that were identified in the previous step.
• The query engine materializes local views at each source, unions results by base tables, and processes remaining predicates (e.g. joins between base tables).
These steps are illustrated below using a query example. The following are the two ontologies (book and review) that the example query references. The list of terms in each is shown only for illustration. An ontology is a collection of terms that describe a concept, e.g. book and review in the example; a namespace is the collection of all those ontologies organized into a hierarchical semantic graph. book (vendor, title, url, author, year, price, format, stock, publisher, isbn) review (isbn, title, author, year, review)
Example Ontology For Example Query 4 An example query is shown below. It retrieves the vendor, title, price, and review attributes of books that have the keywords 'Database' and 'Systems' in their title.
SELECT B .vendor, B title, B. author, B. price, R . review
FROM book B, review R
WHERE B . title~= = Database Syst 2ms ' and B. title =
R. title
Example Query 4
The first step of the query processing is to identify the base tables and the predicate binding implied for the base tables. To illustrate this process, we represent the above query in rules with adornments. query4fbfff (vendor, title, author, price, review) :- bookfbff (vendor, title, author, price), reviewbf (title, review), title ~= ''Database Systems'
Query 4 in Rules with Adornments
Predicate adornment is used to illustrate how the binding pattern serves as a filter for pruning out irrelevant sources. As shown above, the title is the only variable that is bound in query4. The base tables in this query are
bookfbff ( vendor, title , author, price ) and reviewbf ( title ; review) . The way base tables are identified is straightforward; all terms used in the SELECT and WHERE clause are gathered and grouped into the ontologies that appear in the FROM clause. Note that the base table is different from the global predicates as discussed, for example, with regard to the Information Manifold [LR096a, LR096b]. Unlike the predefined set of predicates in the Information Manifold, a base table according to specific embodiments of the invention is dynamically generated by projecting out terms from the particular user query. Also, compared to other systems, this aspect of the invention uses a much simpler way to identify information sources that are relevant to a particular user query. Previous systems, including the Information Manifold and Infomaster [DG97, GKD97] identify information sources through a query rewriting scheme based on the view containment test in their query processing (see [U1197] for an overview). In contrast, in this aspect, embodiments of the invention utilizes inverted index style operations in query processing. To illustrate this, assume the following information sources. amazon (vendor, title, author, year, price) bordersf ff (vendor, title, author, price) Book3fbff (vendor, title, price, isbn) Book4fb (author, isbn)
nytimesbff (title, author, review) wpost .bbf (title, isbn, review)
Due to the space limitation, only the list of terms that each source exports, instead of source descriptions, is shown. For an example of a full source description see FIG. 3.
The adornment here has a slightly different meaning than the adornment in the query. It is used here to specify the query capability of a particular source. For instance, the adornment of amazon, fbbff means that amazon can accept the queries on either title or author and returns a table of columns including vendor,
title, author, year, and price. Similarly, nytimes (with adornment bff) can answer the queries on title and returns a table of title, author, and review.
When a source description is registered, a system according to a specific embodiment of the invention, indexes the source into two inverted indexes. The first inverted index, A, maintains the relation between all terms used in the source description and the identifier of the source description itself. The second index, B, indexes only bound variables (terms). A snap shot of two example indexes are shown in FIG. 4.
For each base table in the query, a method according to this aspect of the invention, identifies a subset of source descriptions using the indexes. In the example query, first probe the index A using all terms in the book base table that include vendor, title, author, and price. The result amazon and borders is obtained by intersecting the four resulting inverted lists. Second, probe the index B using bound variables in the book base table. Here, title is the only bound variable in the user query. It returns amazon, borders, and book3. Finally, intersect the two results to get the subset of sources that are relevant to the user query, specifically for the book base table. Repeat the same process to get nytimes and wpost for the review base table.
All sources in the result have the query capability on the title attribute and can produce a table of the projected columns in each base table. The remaining steps of the query processing are straightforward. The system groups the source descriptions on the base tables and executes them. When executing source descriptions, the placeholders in the FROM clause are replaced by the query binding and encoded into a legal url query string. In the example, the source descriptions of amazon and borders are executed and generate a book table by unioning the results from both sources. Similarly, the source descriptions of nytimes and wpost are executed to generate the review table.
Earlier integration systems based on the view containment test may find more results than this procedure. (For instance, the Information Manifold would generate tuples from the sources book3 and book4 by joini"" nn the attribute
isbn.) The present invention, because it does not do such inference, would not find this implicit join. However by giving up some semantic power, the present invention gains flexibility and scalability.
The last step of query processing is to evaluate join predicates across the ontologies. In the example, the book and review tables are joined on title attribute.
By way of further example, consider in a further embodiment, the first inverted index indexes Source Descriptions (SDs) based on their exported terms. The second inverted index indexes SDs based on their input terms. As an example, assume a user query has an input on terml, and requires term2 and term3 exported, and all terms are coming from a single ontology for simplicity. To identify relevant sources, the first inverted index is proved with terml, term2, and term3 and the resulting list of sources are intersected. This step identifies all sources that export all three terms that are needed. Then, the second inverted index is proved with term 1 only to identify all sources that are capable of answering queries on terml. Finally, both results from the first index and second index are intersected to get sources that can answer queries on terml and export term2 and term3.
7. Implementation Issues
From the teachings provided herein, it will be seen that wrapping a new source in some implementations can be done very efficiently, in one implementation taking just 10 minutes on average. The query planning stage is effectively instantaneous, with the delay in query evaluation due to waiting for the information sources to respond. With multithreading in a system according to the invention of the query engine, the delay in waiting for the sites to respond is overlapped.
8. Embodiment in a Programmed Digital Apparatus
The invention or aspects thereof may be embodied in a fixed media or transmissible program component containing logic instructions and/or data that, when loaded into an appropriately configured computing device, cause that device to perform interpolation according to the invention. FIG. 5 is a block diagram
showing a representative example logic device in which various aspects of the present invention may be embodied.
FIG. 5 shows digital device 700 that may be understood as a logical apparatus that can read instructions from media 717 and/or network port 719. Apparatus 700 can thereafter use those instructions to direct a method according to the invention. One type of logical apparatus that may embody the invention is a computer system as illustrated in 700, containing CPU 707, optional input devices 709 and 711, disk drives 715 and optional monitor 705. Fixed media 717 may be used to program such a system and could represent a disk-type optical or magnetic media or a memory. Communication port 719 may also be used to program such a system and could represent any type of communication connection.
The invention also may be embodied within the circuitry of an application specific integrated circuit (ASIC) or a programmable logic device (PLD). In such a case, the invention may be embodied in a computer understandable descriptor language which may be used to create an ASIC or PLD that operates as herein described.
The invention also may be embodied within the circuitry or logic processes of other digital apparatus, such as cameras, displays, image editing equipment, etc.
9. Conclusion The invention has now been explained with regard to specific embodiments.
Variations on these embodiments and other embodiments will be apparent to those of skill in the art. The invention therefore should not be limited except as provided in the attached claims.
It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.