WO2003042873A1 - Procede et systeme d'indexation et de recherche de donnees semi-structurees - Google Patents

Procede et systeme d'indexation et de recherche de donnees semi-structurees Download PDF

Info

Publication number
WO2003042873A1
WO2003042873A1 PCT/US2002/036240 US0236240W WO03042873A1 WO 2003042873 A1 WO2003042873 A1 WO 2003042873A1 US 0236240 W US0236240 W US 0236240W WO 03042873 A1 WO03042873 A1 WO 03042873A1
Authority
WO
WIPO (PCT)
Prior art keywords
index
person
data
semi
path
Prior art date
Application number
PCT/US2002/036240
Other languages
English (en)
Inventor
Joseph Ellsworth
Chetan Patel
Original Assignee
Coherity, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Coherity, Inc. filed Critical Coherity, Inc.
Publication of WO2003042873A1 publication Critical patent/WO2003042873A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Definitions

  • the present invention relates generally to data processing, and, more particularly, to a method and system for indexing and searching of semi- structured data.
  • Extensible Markup Language is a meta language for exchanging content among different platforms such as the World Wide Web.
  • XML that allows data to have nested substructures with free text (“semi- structured data”).
  • XML allows for user-defined tags ("meta data") to structure data in which the data can have any number of nested or hierarchical levels (“hierarchical data structure”).
  • metadata data to structure data in which the data can have any number of nested or hierarchical levels
  • the flexibility of XML to define tags allows XML data to have a wide variety of formats that can evolve with changing conditions (“evolving XML data”). As such, XML is popular with business partners or customers allowing them to exchange evolving XML data over the Internet.
  • Entity A may want to partner with entity B to exchange data regarding their employees.
  • Entity A may store, e.g., a record of ⁇ person> with a description of programming skill sets using an XML format as illustrated below. ⁇ person> ⁇ name> ⁇ John ⁇ /name> ⁇ skill>
  • Entity B may store, e.g., a record of ⁇ person> with a description of programming skill sets using a different XML format as illustrated below.
  • entities A and B both use a ⁇ person> record
  • entity A has a more sophisticated nested structure for the ⁇ person> record than entity B.
  • the above records can evolve if the person acquires new and different types of skills.
  • the underlying data structure for the ⁇ person> records may need modification.
  • a standard XML format is needed. Updating XML data into a standard format, however, is time extensive. Furthermore, each change to XML data may require updating the underlying schema or format of the XML data.
  • RDBMS Relational Database Management System
  • XML XML
  • RDBMS Relational Database Management System
  • XML For a hierarchical data structure such as XML, RDBMS splits levels or substructures of XML data into separate sections of tables. This requires a mapping to define relationships between data in each substructure to a particular section of tables.
  • RDBMS one limitation with RDBMS is that a new mapping process is required when adding new substructures to XML data. This requires a time extensive mapping process each time a new substructure is added to XML data.
  • RDBMS RDBMS
  • RDBMS is not well suited for storing and querying data in hierarchical, sub-structured data such as XML.
  • Another system for handling XML data uses a "free-text" unstructured text search engine ("free-text system").
  • a free-text system typically stores data as documents.
  • the free-text system indices documents based on content using an inverted tree indexing scheme. This type of indexing does not allow for searching in a specific substructure within hierarchical, semi-structured data such as XML.
  • a free-text system does not differentiate, e.g., between text strings in underlying substructures of XML data.
  • a user may want to search : or the text string "Java programming" in the skill sets category of an employee -ecord and not for the text string in other substructures.
  • the free-text system nowever, would search for the text string in all substructures.
  • the data retrieval and querying abilities of a free-text system do not effectively leverage hierarchical data structures to search for particular data within a specific substructure.
  • hybrid system Another system for handling XML data is a hybrid of RDBMS and the free-test system (“hybrid system").
  • the hybrid system typically splits unstructured XML data (e.g., free text) into a free-text system and structured XML data (e.g., substructure data such as a path) into a RDBMS.
  • the hybrid system splits data retrieval between the two different systems. For example, a query for a keyword is performed in the free-text system and a query for structured data is performed in the RDBMS.
  • a method for indexing semi-structured data includes determining an element within the semi-structured data, the element being associated with a path; and creating a plurality of indices for the element, the indices including an index by keyword or phrase for the element, an index by the path for the element, an index to support word stems for the element, a keys index for path manipulation of the element, a soundex index for the element, and a synonym index for the element.
  • a method for processing a query request regarding semi-structured data includes parsing the query request into query components, each query component representing a search request for an element within the semi-structured data; and processing the query components to search for the associated elements using a plurality of indices including an index by keyword or phrase for the element, an index by the path for the element, an index to support word stems for the element, a keys index for path manipulation of the element, a soundex index for the element, and a synonym index for the element.
  • a method for request and index management in a system for processing index and modification requests based on defined rules and conditions including determining to process at least one of the subsequent processing steps based on the defined rules and conditions of the system, the step of determining including: processing at least one modification request into new index files if at least one pending modification request exits and a number of index files in the system is less than a first threshold limit; merging a first size of index files into a first merged file if no modification request is pending or the number of index files in the system is larger than a second threshold limit; merging a second size of index files into a second merged file if the step of merging the first size of index files is not being processed and the number of index files is larger than second threshold limit; reclaiming freed memory space by deleting the merged first size and second size of index files; and optimizing the system by building additional index files if space has been reclaimed and optimization is required.
  • a method for processing an event trigger on semi-structured data includes parsing the event trigger into components; processing the components to determine at least one path used to invoke the event trigger; creating a segmented index based on each determined path; and if a modification is performed on the semi-structured data, determining at least one path affected by the request; determining if the path affected by the modification is in the segmented index; and notifying at least one party of the modification if the affected path is in the segmented index.
  • a method for merging semi-structured data includes parsing a declarative request into components; identifying primary and secondary semi-structured record types being requested and a least one relationship between the requested semi-structured record types using the parsed components; building a temporary index based on the identified secondary record types; and merging the primary record types with the secondary record types based the temporary index.
  • FIG. 1 A is block diagram of an exemplary system architecture in which methods and system consistent with the invention may be implemented
  • FIG. 1 B is an internal block diagram of an exemplary computer system in which methods and system consistent with the invention may be implemented;
  • FIG. 2 is a flow diagram of a method for indexing semi-structured data
  • FIG. 3 is a flow diagram of a method for querying semi-structured data
  • FIG. 4A is a flow diagram of a method for evaluating a WHERE clause in FIG. 3;
  • FIG. 4B is a flow diagram of a method for evaluating group of expressions in FIG. 4A;
  • FIG. 4C is a flow diagram of a method for loading and reconstructing results in FIG. 3;
  • FIG. 5A is a flow diagram of a method for processing a query request by a query engine
  • FIGS. 5B is a flow diagram of a method for an index engine to process queued index or modification requests;
  • FIG. 6 is an exemplary XML event trigger in which methods consistent with the invention may be implemented
  • FIG. 7A is a flow diagram of a method for processing an XML event trigger
  • FIG. 7B is a flow diagram of a method for executing an XML event trigger.
  • FIG. 8 is a flow diagram of a method for merging semi-structured data.
  • FIG. 1 A is block diagram of an exemplary system architecture 100 in which methods and system consistent with the invention may be implemented.
  • System architecture 100 includes clients 104 and 106 connected to a query server 110 via a network 102.
  • Query server 110 is connected to an index , server 112 and an XML repository 120.
  • XML repository 120 stores XML data and index files consistent with the invention.
  • XML repository 120 is a database system including on or more storage devices.
  • XML repository 120 may store other types of information such as, for example, configuration or storage use information.
  • Network 102 may be the Internet, a local area network (LAN), or a wide area network (WAN).
  • System architecture 100 is suitable for use with the JavaTM, Python, C++, SQLTM programming languages, and other like programming languages.
  • Clients 104 and 106 include user interfaces such as, for example, a web browser 103 and client application 105, respectively, to send query or index requests to query engine 111 operating in query server 110.
  • a query request is a search request for desired data in XML repository 120.
  • An index request (“modification request”) is a modification request to XML repository 120.
  • a modification request is used for an update, delete, or insert to XML repository 120.
  • Clients 104 and 106 can send a query or modification request to query engine 111 of query server 110 using a standard protocols such as Hypertext Markup Transfer Protocol (HTTP) or Structured Query Language (SQL) protocol.
  • HTTP Hypertext Markup Transfer Protocol
  • SQL Structured Query Language
  • Query engine 111 determines whether a request from clients 104 or 106 is a query or modification request. Query engine 111 processes a query request from clients 104 and 106 by parsing the query request for execution of a search consistent with the invention. Query engine 111 may use index files in XML repository 120. Query engine 111 loads search results of records that match the query request and returns the results to requesting clients 104 or 106. In one embodiment, query engine 111 stores modification requests in an index request queue 115 ("modification request queue 115") for an index engine 113. Query engine 111 notifies index engine 113 of an inbound modification request placed in the modification request queue 115 via a "notification" communication path.
  • Index engine 113 creates indices related to particular elements of XML data or documents consistent with the invention. In one embodiment, index engine 113 stores indices in index files consistent with the invention. Index engine 113 also loads and processes index requests from the index request queue 115. After processing an index request, index engine 113 updates ' or creates indices in XML repository 120 related to an update, delete, or insert request. In one embodiment, index engine 113 periodically polls modification request queue 115 to determine if any new pending modification requests exist. Index engine 113 notifies query engine 111 of completed processed modification requests and of updates to indices in XML repository 120 such that query engine 111 can use updated indices for query searching in XML repository 120.
  • FIG. 1B is an internal block diagram of an exemplary computer system 150 in which methods and system consistent with the invention may be implemented.
  • Computer system 150 may represent the internal components of the clients or servers of exemplary system architecture 100 in FIG. 1.
  • a query engine or an index engine consistent with the invention, may be implemented in computer system 150.
  • both a query engine and an index engine may be implemented in computer system 150.
  • Computer system 150 includes several components all interconnected via a system bus 160.
  • Bus 160 may be, for example, a bidirectional system bus that connects the components of computer system 150.
  • bus 160 may contain thirty-two address lines for addressing a memory 165 and thirty-two bit data lines for transferring data among the components.
  • Computer system 150 may communicate with other computing systems on network 102 via network interface 185, examples of which include Ethernet or dial-up telephone connections.
  • Computer system 150 contains a central processing unit (CPU) 155 connected to a memory 165.
  • CPU 155 may be a microprocessor such as the Pentium ® family microprocessors manufactured by Intel Corporation. However, any other suitable microprocessor, micro-, mini-, or mainframe computer, may be used.
  • Memory 165 may include a random access memory (RAM), a readonly memory (ROM), a video memory, or mass storage.
  • the mass storage may include both fixed and removable media (e.g., magnetic, optical, or magnetic optical storage systems or other available mass storage technology).
  • Memory 165 may contain a program, an application programming interface (API), and other instructions for performing the methods consistent with the invention.
  • the query engine 111 and index engine 113 may be implemented as software programs in memory 165 executed by CPU 155. In one embodiment, query engine 111 and index engine 113 are computer programs suitable for the Python programming language.
  • Computer system 150 may also receive input via input/output (I/O) devices 170, which may include a keyboard, pointing device, or other like input devices. Computer system 150 may also present information and interfaces via display 180.
  • I/O input/output
  • FIG. 2 is a flow diagram of a method for indexing semi-structured data.
  • the method provides indices for flexible path searching of semi- structured data.
  • the method refers to the following exemplary XML document.
  • the above exemplary XML document represents exemplary semi- structured data.
  • the exemplary XML document includes a person/user's profile with elements or sections describing the person's name, email, phone number, interest, skill, and etc.
  • the exemplary XML document includes three nested levels (person > skill > programming) and includes fields with data types.
  • the method does not require XML documents to conform to any particular schema or format. In one embodiment, the method builds index files in an ASCII
  • CISAM Contiguous Index Sequential Access Method
  • a keyword or phrase index file is built by performing an inverted tree algorithm on one or more elements in the XML document.
  • Each element is associated with a path.
  • the element "sky diving & bungee jumping" is associated with a path
  • the keyword or phrase index file stores indices of records to keywords or phrases for each element within the XML document. Each index relates to a word or phrase that occurs in an element of a structure within the
  • ⁇ /interest> would translate into the following indices for the keyword or phrase index file.
  • bungee ⁇ person.interest A 0000000126 diving A person.interest A 0000000126 jumping ⁇ person.interest A 0000000126 sky A person.interest 0000000126 where the first entry "bungee” occurs in the path "person. interest” of record "000000126” that represents the ID of the record (“record ID”) containing the keyword "bungee.”
  • an entire phrase that occurs in a given element can be indexed, e.g., "sky diving & bungee jumping" and stored in the keyword and phrase index file.
  • a pre-defined threshold e.g. 1 K
  • the phrase can be hashed to store a hash code, which may be used in place of the phrase. In this case, a hash code to phrase mapping is maintained in a separate index file.
  • the keyword and phrase index file may store individual words and its associated paths, e.g., .bungee A person. interest. This allows for advanced parametric and step searching capabilities. In particular, such indices can be used to find all paths where the word "bungee" occurs. Additionally, such indices are useful for determining the actual structure where certain words or phrases occur.
  • an index by path file is built by performing btree algorithms on one or more elements in the XML document. Appropriate btree algorithms may be found in the Knuth reference.
  • the index by path file stores indices of records based on the path for each element within the XML document. Each index refers to the path where the element is found in the XML document. For the exemplary XML document, indices by path for elements in the XML document would translate into the following index by path entries. person. email A gill@killroy. com 0000000126 person. interest A programming A 0000000126 person.inierest A sky diving & bungee jumping A OOOO0126 person, name.
  • an index to "gill@killroy.com” is designated by its path "person. email” for the record "0000000126.”
  • An index by path provides useful indices to search for elements within hierarchical data structure such an XML document that is changing and evolving.
  • a query request may use special semantics for a query search, e.g., SELECT WHERE person*skill CONTAINS "Java”. This query looks for the text string "Java” at the path "person*skill” where " * " indicates that there can be zero or more undetermined nested levels between the person and the skill attribute.
  • an index by path is useful for semi-structured data.
  • the index by path also indicates the occurrence of a specific path for a given word or phrase.
  • repeating elements with different values e.g., ⁇ languages>
  • entries that are similar e.g., "bungee”
  • an index file to support word stems is built by performing word stemming algorithms for one or more elements in the XML document. Appropriate word stemming algorithms may be found in Martin
  • This index file stores indices that refer to records containing variations of keywords or phrases related to an element of the XML document (e.g., rates - rate).
  • One or more elements may or may not have a corresponding index to a word stem record.
  • a stemming algorithm can be applied to determine if a keyword or phrase occurs in the beginning, end, or in the middle of the actual word thus providing very advanced query capabilities, e.g., select * where person* skill CONTAINS "Java” may match "javascript", "hotjava", “Java” and other variations of the word Java that occur in the ⁇ person> .... ⁇ skill> section.
  • a keys index file is built for path manipulation by maintaining a stack of previously processed paths on one or more elements in the XML document.
  • the keys index file stores keys referring to resolved paths for a particular element within an XML document.
  • the keys for resolved paths of the element "person” are provided below.
  • the keys index file is useful to resolve query searches such as, for example, retrieving all elements that contain a certain word, retrieving all elements that begin with a certain word, retrieving all elements that are children of a certain element within a given path by using the key for the given path.
  • a soundex index file is built by performing soundex algorithms on one or more elements in the XML document. Appropriate soundex algorithms may found in the Knuth reference.
  • a soundex index file stores indices to records having similar sounding words of one or more elements in an XML document. For example, the query command SELECT *
  • WHERE person*name like Javier" using a soundex index file may return records for a person having a name Xavier, Javier, Havier, and similar sounding variations.
  • a synonym index file is built by performing a synonym algorithm on one or more elements in the XML document. Appropriate synonym algorithms may be found in the Knuth reference.
  • a synonym index file stores indices of records having synonyms to the element in the XML document.
  • an identification (ID) file is built to maintain the records of the index files.
  • the ID file is a component of an XML repository.
  • the ID file stores record IDs and all paths for that record ID.
  • the ID file represents a master record file for the complete set of elements in an XML document. If an index has been used to resolve a matching record, the complete record is loaded from the ID file and reconstructed before delivery to a client.
  • the ID file may include meta data information about an XML repository or database (e.g., number of fields/elements within a given document/record, etc.) This information is useful for optimized query searches and for building alternative query execution plans.
  • the ID file may also include an array index that lists a plurality of values for an element, which are represented with their literal array path. For example, person. skill. programming. languages has more than one entry and can be distinguished using an array index ".1" as shown below in example entries of an ID file for the exemplary XML document.
  • 0000000126 A person. skill, programming. languages A person.skill. programming, la nguages . A java
  • index file representations can be used for storing and searching of other types of structured or semi-structured data such as HTML data. Furthermore, index files can be compressed for memory optimization.
  • FIG. 3 is a flow diagram of a method for querying semi-structured data consistent with the invention.
  • a query request is received.
  • the query request is provided in the form of SQL type statements.
  • a query request may be provided in other types of query command formats.
  • the request may include a "dynamic query request” or a "server-side query request.” Exemplary dynamic and server-side query requests are illustrated below for querying semi-structured data such as XML data.
  • the requests include keywords such as "SELECT" and "WHERE.”
  • the "SELECT" keyword sets forth the substructures or path for the query search.
  • the "WHERE” keyword sets forth the specific text string(s) or data for the desired search.
  • WHERE person*skill contains " ⁇ Java ⁇ ” ⁇ /xsql:query> [0060]
  • unique extensions are used such as, for example, " * " and " . " in the query command format. Other symbols may also be used for extensions.
  • the extensions indicate substructures/semi-structures for the specific type of information requested, e.g., SELECT person*skill
  • WHERE person*skill contains "Java," in XML data.
  • the extensions allow for specific queries to any number of paths within XML data using any number of types of query command statements.
  • the query request is parsed into query elements based on keywords or clauses in the query request such as, for example, based on the "SELECT" clause and the "WHERE” clause.
  • the elements related to the query request are parsed into query elements based on keywords or clauses in the query request such as, for example, based on the "SELECT” clause and the "WHERE” clause.
  • SELECT clause are parsed into its components, e.g., (person*skill, person. name).
  • the WHERE clause is parsed into components based on expressions and operators within the expressions.
  • An expression includes operators such as, for example,
  • “Java”) includes the operator CONTAINS having a LHS component
  • a WHERE clause may be associated with any number of expressions.
  • An example model of a WHERE clause having a number of expressions is illustrated below.
  • Expression 1 LHS operator RHS (e.g., person*skill contains "Java") i. AND ii.
  • Expression2 LHS operator RHS v. OR v.
  • Expression 3 LHS operator RHS) [0063] As shown in the above example, the WHERE clause is associated with three expressions (Expressions 1-3) connected by AND and OR logical connectors for specific search requirements.
  • stage 306 a check is made to determine if there is a "WHERE" clause. If there is no "WHERE” clause, at stage 307, all the record IDs associated with "SELECT" clause are retrieved from a master ID file. Although not shown, from stage 307, the method can continue to stage 310. At stage 308, if there is a WHERE clause, the where expressions are evaluated in a manner described below to obtain the desired search results. The search results include all or parts of the record match the request. At stage 310, the results are loaded and reconstructed for the query request as described below with regard to FIG. 4C. At stage 312, the results are returned or delivered to the requesting client(s).
  • FIG. 4A is a flow diagram of a method consistent with the invention for evaluating a WHERE clause at stage 308 in FIG. 3.
  • the various WHERE clause expressions connected by "OR” are grouped. For example, in the above example regarding FIG. 3, expressions (i) and (ii) would be grouped together while expression (iii) would be in a separate group.
  • each expression in the group that are connected by "AND” is evaluated as described below.
  • a union of record IDs is performed if multiple groups were separated by "OR” to obtain a complete set of record IDs.
  • each expression is tokenized or parsed into components such as ⁇ LHS (left handside), operator (e.g. LIKE, CONTAINS, EQ, BETWEEN, etc.), RHS (right hand side) ⁇ .
  • LHS left handside
  • operator e.g. LIKE, CONTAINS, EQ, BETWEEN, etc.
  • RHS right hand side
  • the various paths are resolved for the LHS(e.g., person*skill can be based on the input data).
  • an index/search method is determined based on the type of operator. For example, if the operator is "LIKE,” a soundex index can be used to perform a soundex method consistent with the invention. If the operator is "EQ,” an index by keyword or phrase can be used to perform a keyword search. Any combination of indices may be used for the query request as described above consistent with the invention.
  • the record IDs that match one of the resolved paths and satisfy the current expression in the WHERE clause are loaded.
  • the search fails for the expression, the method exits the expression.
  • all record IDs that do not exist in the current expression are removed.
  • FIG. 4C is a flow diagram of a method consistent with the invention for loading and reconstructing results of FIG. 3.
  • the following method is iterated for each record ID in the matching result set.
  • all the fields/ elements associated with the record ID e.g., from an
  • the SELECT clause is resolved to determine the substructures that need to be projected in the returned result (e.g. select person*id, person. name) would translate into:
  • a projection is performed from the set of elements and fragments obtained at stage 422 and the results are filtered that match the above expressions.
  • the record IDs that match the expressions can be stored in another index file for later use.
  • FIG. 5A is a flow diagram of a method for a query engine to process a query request.
  • a request is received from a client application.
  • the type of the request is determined, i.e., is it a query request or an index request (insert, update, or delete). Exemplary index requests for an insert, update, and delete are illustrated below.
  • a search is made through all index files and associated index extent files to process the query request.
  • the results are returned and delivered to the client.
  • a query request is given priority over a modification request and processed before a modification request.
  • the query requests can be processed using the methods described above with respect to FIGS. 3, 4A, 4B, and 4C.
  • the query engine queues the modification requests in modification request queue for an index engine.
  • FIGS. 5B is a flow diagram of a method for an index engine to process queued modification requests.
  • the method is used to maintain and optimize the indices while there are query requests to process based on defined rules and conditions in a system.
  • new rules can bee added and order the changed.
  • the following method is an exemplary embodiment to perform rules based processing of queued modification requests.
  • Exemplary rules are defined at stages 526, 530, 534, 538, and 542 (i.e., the conditions to perform those stages).
  • a first rule is evaluated to determine if there are modification requests present in the modification request queue and the size of the index files are less than a small merge threshold (i.e., the number index files that is considered a "small number" of index files.
  • a small merge threshold i.e., the number index files that is considered a "small number" of index files.
  • the modification request is processed and a set of new index files are generated to represent the modification request. Once the request has been processed, the method returns to stage 524.
  • a second rule is evaluated to determine if there are no pending modification requests or the size of the index files is greater than the small merge threshold.
  • stage 523 if the second rule is met, a small merge of index files is performed resulting from stage 528. The small index files are merged into larger index files to improve query performance. The method continues back to stage 524.
  • stage 534 if the second rule is not met, a third rule is evaluated to determine if the number of index files is greater than a large merge threshold limit.
  • stage 530 if the third rule is met, the small index files generated by the small merge at stage 532 are merged into larger index files. In one embodiment, the large merge is performed on hard disk. The method continues back to stage 524.
  • a fourth rule is evaluated to determine if the small merge, large merge, and pending requests are completed.
  • unused memory space is reclaimed by deleting the index files that were merged into larger index files.
  • a fifth rule is evaluated to determine if optimization of the indices is required.
  • an optimization process is performed by building secondary indices (e.g., bitmap index, segmented index).
  • the fifth rule is not met, the system is quiescent and is in silent mode performing no activity. The method then continues to stage 524. It should be noted that additional rules may be defined and existing rules modified to implement the method.
  • FIG. 6 is an exemplary XML event trigger.
  • An XML event trigger is a mechanism to notify interested parties if a specific change occurs to an XML document.
  • the XML event trigger allows for notifications regarding specific changes to a substructure in an XML document. The following define nomenclature for the XML event trigger.
  • the XML event trigger definition includes a "TRIGGER” section 602, a "PERFORM” section 604, and "NOTIFY” sections 606 and 608.
  • the XML event trigger consistent with the present invention, requires certain syntax rules. For example, commas are used to separate key value attribute specifications in the various sections. In addition, the ordering of the keywords and sections, as described below, is required.
  • FIG. 6 is reproduced below.
  • TRIGGER section 602 the "TRIGGER” keyword is reserved to specify an XML event trigger.
  • the keywords "AFTER INSERT” indicate the type of modification, i.e., INSERT, to evaluate the XML event trigger. Specifically, in the above section, the XML trigger is evaluated after an INSERT modification.
  • Other keywords may be used such as, for example, "AFTER UPDATE,” “AFTER DELETE,” BEFORE INSERT,” BEFORE UPDATE,” BEFORE DELETE,” to specify other types of conditions for evaluating the XML event trigger.
  • the trigger name is “at_risk_sales_over_50K.”
  • the trigger name is globally unique for all triggers used in a system.
  • the "WHEN” keyword specifies a clause that sets forth the conditions to invoke the event trigger.
  • the event trigger is invoked WHEN the substructure "service_request._sales_manger” EQ "john” AND the substructure "service_request.company_id” EQ “ZIPPY” are true.
  • the operations specified in the PERFORM section 604 and NOTIFY sections 606 and 608 are performed.
  • PERFORM section 604 of FIG. 6 is reproduced below.
  • PERFORM section 604 is optional for an XML event trigger. If a PERFORM section 604 is present, the PERFORM section must occur after the WHEN clause and before and NOTIFY clauses.
  • the "PERFORM" executes an xsql file "get_open_bids_over_one_million.xsql.” Executing an xsql file performs a query that returns a set of XML records, which are sent to parties specified in the NOTIFY sections.
  • the PERFORM transforms results from above into a default data structure as "d.result.1" where .1 indicates the results of the first perform that occurs in the XML trigger event and .2 would be the results of the second perform and so on.
  • the default data structure for a PERFORM can be an XML structure or like semi-structured data such as HTML used in the NOTIFY sections. If there are no results, subsequent PERFORM or NOTIFY sections that attempt to reference this result will be skipped. For example, if AS is used to set the results to d.pr2 and a subsequent NOTIFY specifies d.pr.2 as a data parameter that NOTIFY would be skipped. If the AS is specified, the results will not be the default input to the notify clauses as specified below. Attributes that are not reserve words in the PERFORM section 604 are passed as query parameters to a query enginer.
  • NOTIFY sections 606 and 608 of FIG. 6 are reproduced below.
  • NOTIFY sections must occur after all PERFORM sections. NOTIFY sections dictate the manner in which to transmit a notice based on an XML event trigger.
  • a HTTP post or a SMTP email is used to provide the notice.
  • the "type" attribute sets the notice type. The number and use of additional attributes varies depending on the desired type of delivery desired. For example, the HTTP post supports a content-type attribute which will be sent as part of the header for the post.
  • the input data for the NOTIFY sections is specified using the keyword "DATA.” If there is no DATA attribute set, it will default to the output of the last perform that was not redirected using the AS clause or possibly input data if there have been no un-redirected PERFORM clauses executed. If the default or specified data parameter was the result of a previous perform and that perform did not find any records in its nested queries, this perform will be skipped.
  • attributes that are not considered reserved words for the selected type of handler agent will be passed as query parameters wherever possible. Common input attributes are provided below.
  • TYPE This filed sets forth the type of delivery agent chosen for the notification.
  • This field sets forth the main data structure to be sent in the notification.
  • XSL This field sets for the XSL template which the data structure is transformed into prior to sending it in the HTTP post.
  • the XSL sheet may produce HTML, XML or other data types as allowed by the processor.
  • This field sets forth the port numbers for sending the post proxy: This field sets for the proxy server to proxy the request content-type: This field sets forth the typical content type for delivery, e.g., text/xml.
  • sendto This field sets forth the email address of the recipient of the notification. from: This field sets for the email address of the sender. account: This field sets forth the user id to use to access the smtp server. • server: This field sets forth the name or IP address of the smtp server to send the email to.
  • HTTP posts can deliver results back on their own and it is possible to add those results to the internal event data model using the AS keyword as detailed below.
  • AS keyword for a HTTP post type of notification.
  • the AS clause places the result in d.NRIX as an element that can be drawn off of by name by subsequent notifications.
  • NOTIFY operations can be considered un-ordered in that they are dumped into a queue and, as the remote resources necessary to deliver the notification become available, the NOTIFY operations are delivered.
  • the AS clause is introduced in this context, it becomes necessary for the event notification engine to stop what it is doing, perform the post, and receive the results before continuing to process the subsequent NOTIFY operations.
  • the AS feature can be very powerful since it makes the XML event trigger effectively enabled for distributed gathering of data and interaction before sending the final notification to the recipient. That means that if some of the needed data is present, for example, a server which has XSQL installed in it becomes relatively easy to fetch that data as part of processing this XML trigger event. If the content-type of the data from the notify action is text xml or xml/xml then it will be parsed stored in the as parameter in a form that allows access to sub components.
  • FIG. 7A is a flow diagram of a method for processing an XML event trigger.
  • a trigger definition is parsed into components as explained in the TRIGGER section above.
  • the various paths of the "WHEN" clause are determined. In the exemplary XML event trigger described above, the paths are "service_request.sales_manager” and “service_request.company_id.”
  • a segmented index entry for determined paths is created.
  • the segmented index refers to the path, the word or phrase being referenced at that path, and ID of the trigger that contains the path, e.g., the path definition in the trigger is service_request.sales_manage EQ "John" would translate into the index entry of
  • FIG. 7B is a flow diagram of a method for executing an XML event trigger.
  • a search is performed and determine whether there are any triggers associated with the above determined paths. If there are no triggers, the method ends. If a trigger is found, load the trigger definition at stage 712 if any triggers were found matching the determined path. At stage 714, a determination is made if the input data structure of the loaded trigger definition matches the "WHEN" clause.
  • FIG. 8 is flow diagram of a method for merging semi-structured data.
  • the method allows a user to merge XML records into a separate, merged XML record for a high-speed query search.
  • the method provides a declarative approach to specify a relational merge or join of XML records.
  • the following is an exemplary query command to specify a relational merge or join to assist in explaining the method.
  • the above query command is to retrieve/create a birthday wishlist for a person called "joe" by merging information from three different XML document types (person, book, boot). Specifically, the method retrieves all person records where the person's name section contains "joe". A birthday wish list section is added to the retrieved person records.
  • the birthday wishlist can be created by matching the person's interests against any section called “category” defined in any other documents in an XML repository. Alternatively, the birthday wishlist can contain all data found by matching the person's interests to any section in any document called "class,” where the manufacturer is "redwing", having attributes type of "hiking", size of "10" and class of "$A”.
  • the command may use the following syntax rules.
  • the identifier "//” is used as a global or wildcard identifier.
  • the identifier " is used to indicate the ending of one element and the beginning of the another element.
  • the parenthesis after the keywords "OPTIONAL”, MANDATORY”, “MERGE” are mandatory.
  • the use of "$A” or an variable having "$” used in the context of a WHERE clause is used to pull that value from a previously resolved master record and substitutes its value into the WHERE clause. In one embodiment, only scalar items or simple arrays can be specified.
  • each value from the array will be interpolated.
  • $A is bound to the previously defined Alias A and indicates that the result of the nested SELECT will be merged into and will become part of the record originally named by the ALIAS A.
  • the following are sample data records to explain a relational join method. If there are more than one records matching the join criteria, they will be returned as array elements in the output/container object (temporary storage data structure). This ability to create an array of return values by merging data from multiple records is useful for joining or merging hierarchical data structures.
  • the query request is parsed to generate a standard parse tree.
  • the different type of discrete records are analyzed by analyzing the parse tree.
  • the type of records is determined by Aliases.
  • a record Type 1 is identified by the alias A as a master record (select person* ALIAS A where person/Zname// CONTAINS "JOE').
  • a record Type 2 is identified by the alias B as a merge type record (select *As $A//birthday/wishlist ALIAS B WHERE //category contains $A//category).
  • a record Type 3 is identified by the alias C as a merge type record (select *AS $A//birthday/wishlist ALIAS C ).
  • the relations and merge criteria are identified between various requested records.
  • the mapping filed for Record Type 1 is person/Zinterest.
  • the criteria for mapping Record Type 1 to Record Type 2 is //category contains person/Zinterest.
  • the criteria for mapping Record Type 1 to Record Type 3 is //class EQ //interest.
  • the primary record type is identified.
  • the primary record type is the base record type into which the secondary records are merged to generate the final output record.
  • the primary record type is identified based on the cardinality of data for each record type. For example, the record type with the lowest cardinality is chosen as the primary record type.
  • the primary record type will be Record Type 1 because it has the lowest cardinality, i.e., 10.
  • the primary records are loaded. In one embodiment, all records are searched for primary record types that matches the specified WHERE clause (e.g., SELECT person * WHERE person//name// contains "joe"). A primary result set out of these records is built. These records can be stored as the result set in a temporary file in memory.
  • merge specifications for a merge are built by evaluating each primary record at stage 808.
  • a merge search is built based on the merge criteria defined at stages 802 and 804.
  • the values for each record are extracted into a match set based on the mapping built at stage 804.
  • #R1 all possible values are extracted from the field "person/interest” ( “web script”, “Java”, “shoe”). Each value is added to the merge search.
  • the result is a complete set of discrete search values which represents a logical join of those data values across the entire set, e.g., ⁇ 'person/interest' : 'recs' : ⁇ 'web_script' : [#R1], 'Java' : [#R1], 'shoe' : ['#R1 '] ⁇ , match : [ ⁇ 'op' : 'contains', 'path' : '//class' ⁇ , ⁇ 'op' : 'contains', 'path' : '//category' ⁇ ] ⁇ .
  • a super set is built of all secondary records that match any of the merge criteria for primary records by evaluating each of the merge search specification built at stage 806. For example, record #R1 - it translates into the following specific searches:
  • Each secondary record is stored in a set of secondary result set files as temp files on disk or in memory.
  • the secondary result sets are stored by search phrase.
  • a query optimization can be performed to build the secondary result sets by evaluating the other WHERE clause criteria (e.g. type EQ "hiking” and size EQ "10" and manufact EQ "redwing") and determining whether that returns fewer records.
  • the objective is to build the smallest possible secondary result set, since the cost in execution time and memory of merging the primary and secondary result set can be very high.
  • the primary and secondary records are merged. Each primary record at stage 806 is evaluated.
  • the extract / Identify merge criteria e.g. //category and //class
  • candidate values e.g. web script, Java, shoe
  • the intermediate structure/place in the output record is then created to store the merged records (e.g. person/birthday/wishlist).
  • Auxiliary records which match the merge criteria and its candidate values identified above are located and retrieved from the secondary result set. These records are evaluates with WHERE clause (if any) associated with the auxiliary record (e.g. where type EQ "hiking” AND size EQ "10" AND * manufact* EQ "redwing".
  • the foregoing description is based on XML data, however, other types of semi-structured data such as HTML may be used to implement the invention.
  • the foregoing description is based on a client-server architecture, but other types of architectures may be employed such as a peer-to-peer architecture consistent with the invention.
  • the described implementations include software, the invention may be implemented as a combination of hardware and software or in hardware alone.
  • aspects of the invention are described as being stored in memory, other types of computer-readable media, such as secondary storage devices, like hard disks, floppy disks, or compact disc ROM (CD-ROM); a carrier wave from the Internet; or other forms of RAM or ROM. The scope of the invention is therefore defined by the claims and their equivalents.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne des implémentations permettant d'indexer et de chercher des données semi-structurées. L'indexation et la recherche peuvent être mises en oeuvre pour un élément associé à un trajet au sein de données semi-structurées (120) telles que des données XML (langage de balisage extensible). Des clients (104 et 106) accèdent à un serveur de recherche (110) par l'intermédiaire de navigateurs Web et d'Internet (102). Ce serveur de recherche est relié à un serveur d'index ainsi qu'à un dépôt d'archives (102).
PCT/US2002/036240 2001-11-13 2002-11-13 Procede et systeme d'indexation et de recherche de donnees semi-structurees WO2003042873A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US99385401A 2001-11-13 2001-11-13
US09/993,854 2001-11-13

Publications (1)

Publication Number Publication Date
WO2003042873A1 true WO2003042873A1 (fr) 2003-05-22

Family

ID=25540003

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/036240 WO2003042873A1 (fr) 2001-11-13 2002-11-13 Procede et systeme d'indexation et de recherche de donnees semi-structurees

Country Status (1)

Country Link
WO (1) WO2003042873A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100561474C (zh) * 2006-01-17 2009-11-18 鸿富锦精密工业(深圳)有限公司 远程多点文件索引同步系统及方法
US7792821B2 (en) 2006-06-29 2010-09-07 Microsoft Corporation Presentation of structured search results
WO2014123529A1 (fr) * 2013-02-07 2014-08-14 Hewlett-Packard Development Company, L.P. Formatage de données semi-structurées dans une base de données
WO2015052584A1 (fr) * 2013-10-10 2015-04-16 Calgary Scientific Inc. Procédés et systèmes de recherche intelligente dans des archives dans de multiples systèmes de dépôt

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6076087A (en) * 1997-11-26 2000-06-13 At&T Corp Query evaluation on distributed semi-structured data
US6240407B1 (en) * 1998-04-29 2001-05-29 International Business Machines Corp. Method and apparatus for creating an index in a database system
US6282537B1 (en) * 1996-05-30 2001-08-28 Massachusetts Institute Of Technology Query and retrieving semi-structured data from heterogeneous sources by translating structured queries

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6282537B1 (en) * 1996-05-30 2001-08-28 Massachusetts Institute Of Technology Query and retrieving semi-structured data from heterogeneous sources by translating structured queries
US6076087A (en) * 1997-11-26 2000-06-13 At&T Corp Query evaluation on distributed semi-structured data
US6240407B1 (en) * 1998-04-29 2001-05-29 International Business Machines Corp. Method and apparatus for creating an index in a database system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BADR Y. ET AL.: "Transformation rules from semi-structured XML documents to database model", ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, June 2001 (2001-06-01), pages 181 - 184, XP010551208 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100561474C (zh) * 2006-01-17 2009-11-18 鸿富锦精密工业(深圳)有限公司 远程多点文件索引同步系统及方法
US7792821B2 (en) 2006-06-29 2010-09-07 Microsoft Corporation Presentation of structured search results
WO2014123529A1 (fr) * 2013-02-07 2014-08-14 Hewlett-Packard Development Company, L.P. Formatage de données semi-structurées dans une base de données
CN104969221A (zh) * 2013-02-07 2015-10-07 惠普发展公司,有限责任合伙企业 格式化数据库中的半结构化数据
EP2954433A4 (fr) * 2013-02-07 2016-08-31 Hewlett Packard Entpr Dev Lp Formatage de données semi-structurées dans une base de données
CN104969221B (zh) * 2013-02-07 2018-05-11 慧与发展有限责任合伙企业 格式化数据库中的半结构化数据
US11126656B2 (en) 2013-02-07 2021-09-21 Micro Focus Llc Formatting semi-structured data in a database
WO2015052584A1 (fr) * 2013-10-10 2015-04-16 Calgary Scientific Inc. Procédés et systèmes de recherche intelligente dans des archives dans de multiples systèmes de dépôt

Similar Documents

Publication Publication Date Title
US7043472B2 (en) File system with access and retrieval of XML documents
US8204856B2 (en) Database replication
US7707168B2 (en) Method and system for data retrieval from heterogeneous data sources
US6826557B1 (en) Method and apparatus for characterizing and retrieving query results
JP5065584B2 (ja) テキストマイニングおよび検索のためのアプリケーションプログラミングインターフェース
US7987189B2 (en) Content data indexing and result ranking
US8510339B1 (en) Searching content using a dimensional database
US8335779B2 (en) Method and apparatus for gathering, categorizing and parameterizing data
US7657515B1 (en) High efficiency document search
US8219972B1 (en) Platform for processing semi-structured self-describing data with aggregating clauses
US7487174B2 (en) Method for storing text annotations with associated type information in a structured data store
US20100174692A1 (en) Graph store
US20110093500A1 (en) Query Optimization
US20100121839A1 (en) Query optimization
US20030033275A1 (en) Combined database index of unstructured and structured columns
US20040148278A1 (en) System and method for providing content warehouse
JP2010225181A (ja) レジストリ駆動型相互運用性及び文書の交換
WO2010085523A1 (fr) Stockage d'un graphique
Abramowicz et al. Filtering the Web to feed data warehouses
US7333994B2 (en) System and method for database having relational node structure
US7512608B2 (en) Method for processing structured documents stored in a database
Lee et al. A design and implementation of XML-Based mediation framework (XMF) for integration of internet information resources
Arnold-Moore et al. Architecture of a content management server for XML document applications
WO2003042873A1 (fr) Procede et systeme d'indexation et de recherche de donnees semi-structurees
US8745035B1 (en) Multistage pipeline for feeding joined tables to a search system

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP