WO2006059425A1 - データベース構築装置、データベース検索装置、データベース装置、データベース構築方法、及びデータベース検索方法 - Google Patents
データベース構築装置、データベース検索装置、データベース装置、データベース構築方法、及びデータベース検索方法 Download PDFInfo
- Publication number
- WO2006059425A1 WO2006059425A1 PCT/JP2005/017696 JP2005017696W WO2006059425A1 WO 2006059425 A1 WO2006059425 A1 WO 2006059425A1 JP 2005017696 W JP2005017696 W JP 2005017696W WO 2006059425 A1 WO2006059425 A1 WO 2006059425A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- name
- appearance information
- ancestor path
- ancestor
- attribute
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/123—Storage facilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
Definitions
- Database construction apparatus database retrieval apparatus, database apparatus, database construction method, and database retrieval method
- the present invention relates to a database apparatus that manages structured documents having a logical structure such as XML, and in particular, a database construction apparatus that accumulates and manages a large amount of structured documents, and structured documents stored in the database efficiently.
- the present invention relates to a database search device for searching.
- Japanese Patent Application Laid-Open No. 2002-202973 discloses a structure document management apparatus that registers structured documents based on a logical structure and performs full text search by specifying the logical structure.
- FIG. 33 is a configuration diagram of a conventional structured document management apparatus.
- a structured document input unit 2402 inputs a structured document to be registered.
- the structure analysis unit 2407 analyzes the input structured document into a tree structure.
- the structure information creation unit 2408 allocates a name ID to the tag name (element name) of each element and stores it in the name ID table storage unit 2418 in the data storage unit 2406.
- a path name ID is assigned to a path name of each element, that is, a character string in which tag names are described in order from the highest layer, and is stored in the path name index storage unit 2416.
- a path hierarchy index storage unit 2417 assigns a node hierarchy ID to a character string that describes the path hierarchy of each element, that is, the order of appearance of each hierarchy of path names.
- the order of appearance of each hierarchy of path names indicates the number of elements that appear among the elements of the same tag name that have the same parent element.
- element entity a code (hereinafter referred to as “search unit identifier”) that uniquely represents the search unit is assigned to each element entity.
- search unit identifier Stored in the element management table storage unit 2415.
- FIG. 34 is a diagram showing an example of an element management table in a conventional structured document management apparatus.
- the element management table 2501 is composed of a set of a document number 2503, a nose name ID 2504, a nose hierarchy ID 2505, and a name ID 2506 using the search unit identifier U child 2502 as a key.
- the character string index creation unit 2409 performs pre-processing on the character string that is the content of each element entity. A character chain of a predetermined number of characters is taken out. Then, the character string index creation unit 2409, for this character chain, the corresponding search unit identifier and a number indicating the number of the character in the element content (hereinafter referred to as “character position number”). Are stored in the character string index storage unit 2419.
- Figure 35A shows an example of a structured document.
- FIG. 35B is a diagram showing an example of a character string index in the conventional structured document management apparatus. In FIG.
- the record 2606 of the character ⁇ IJ index 2602 indicates that the character string 2603 “structure” is included in the character string of the element “search unit identifier additional child 2604” 1 and the character position number 2605 is “1”. "In other words,” is present at the first character from the beginning of the element ".
- FIG. 36A is a diagram showing an example of setting search conditions.
- the search condition 2701 specifying the structure indicates “a document whose path name is“ Z paper Z bibliography Z title ”and the character string“ structured ”is included in the element”.
- the search condition analysis unit 2410 refers to the path name index storage unit 2416 and converts the path name of the search condition into a path name ID “N2” (2702).
- the character string index search unit 2411 takes out the two-character chain “structure” and “structure” from “structure”, and refers to the character string index and continues to “structure” and “structure”.
- the search unit identifier of the entry that appears at the same time and has the same search unit identifier is obtained (2703).
- the search unit identifier “1” or “8” is obtained as the character string index search result group as shown in FIG. 36C.
- the structure matching unit 2412 obtains a search result that satisfies the structure specification of the search conditions 2702 and 2703.
- the structure matching unit 2412 searches the element management table 2501 shown in FIG. 36B using the search unit identifier obtained as the character string index search result group as a key. Then, an entry that matches the path name ID power N2 ”is determined as a search result.
- the search result is shown in FIG. 36C.
- the structure matching unit 24 12 If the search condition is a condition specifying a tag name, the structure matching unit 24 12 The search result is an entry whose element management table name ID matches the name ID of the specified tag name. If the search condition is a condition that specifies both the path name and the path hierarchy, the structure matching unit 2412 The path name ID of the element management table matches the path name ID of the specified path name, and the path hierarchy ID matches the path hierarchy ID of the specified path hierarchy. Let an entry be a search result.
- Japanese Patent Laid-Open No. 2004-310607 discloses a document management apparatus that generates an index that links an element included in a structure document with a position on a hierarchical structure. Even if this document management device is an element that has the same search path to a position in the hierarchical structure, that is, an element that has multiple child nodes for one parent node, multiple Each element can be identified and managed.
- the above-described conventional structured document management apparatus first obtains a search unit identifier in which a specified character string appears by referring to a character string index, and then determines whether the search unit identifier satisfies the specified structural condition. Is determined with reference to the element management table. For this reason, it is necessary to specify the character string search condition when performing a document search, and it is not possible to perform a search specifying only the structure condition. In other words, in order to perform a search by specifying only the structural conditions, the entire element management table is searched for whether or not the force satisfies the structural conditions for all search unit identifiers. Therefore, there exists a subject that efficiency is very bad.
- the data structure is such that the logical structure data is added to the search index data for full-text search. For this reason, it is not possible to construct search data having a structure that enables an efficient search for a search that specifies only the structural conditions.
- the database construction device of the present invention appears in a structured document based on an input document analysis unit that assigns a unique document number to a structured document and analyzes the structure, and an analysis result of the input document analysis unit.
- An element name registration unit that assigns a unique element name ID to each element name and registers it in the element name dictionary, and each ancestor node name that appears in the structured document based on the analysis result of the input document analysis unit
- An ancestor path name registration unit that assigns a unique ancestor path name ID and registers it in the ancestor path name dictionary, and the document number, character position, and ancestor path name ID in which the element of interest appears based on the analysis result of the input document analysis unit
- element appearance information including at least branching order information in the element appearance information storage section using the element name ID as a key.
- an appearance information registration unit that registers ancestor path appearance information including at least the document number, character position, element name ID, and branch order information in the ancestor path appearance information storage unit using the ance
- This database construction device generates an appropriate appearance information index based on the appearance information of elements when registering and storing a structure document. Therefore, the database construction device of the present invention is not limited to the case where both the character string search condition and the structure condition are specified, but also to various search conditions in which only the structure condition is specified without the character string search condition. It is possible to construct search data having a structure capable of efficiently searching for a desired document.
- FIG. 1 is a block diagram showing a configuration of a database apparatus according to Embodiment 1 of the present invention.
- FIG. 2 is a flowchart showing a procedure of document registration processing in Embodiment 1 of the present invention.
- FIG. 3 is a diagram showing an example of a structure document to be registered and searched in Embodiment 1 of the present invention.
- FIG. 4 is a diagram showing an example of the result of analyzing the logical structure of the structure document in the first embodiment of the present invention.
- FIG. 5 is a diagram for explaining ancestor path names in Embodiment 1 of the present invention.
- FIG. 6 is a diagram showing an example of the contents of an element name dictionary in the first embodiment of the present invention.
- FIG. 7 is a diagram showing an example of the contents of an ancestor path name dictionary in the first embodiment of the present invention.
- FIG. 8 is a diagram showing an example of the contents of an attribute name dictionary in the first embodiment of the present invention.
- FIG. 9 is a diagram for explaining character positions in the first embodiment of the present invention.
- FIG. 10A is a diagram for explaining element appearance information according to Embodiment 1 of the present invention.
- FIG. 10B is a diagram for explaining element appearance information according to Embodiment 1 of the present invention.
- FIG. 11 is a diagram for explaining ancestor path appearance information in Embodiment 1 of the present invention.
- FIG. 12A is a diagram for explaining attribute appearance information in Embodiment 1 of the present invention.
- FIG. 12B is a diagram for explaining attribute appearance information in Embodiment 1 of the present invention.
- FIG. 13 is a diagram for explaining text appearance information in Embodiment 1 of the present invention.
- FIG. 14 is a diagram showing an example of a search expression in the first embodiment of the present invention.
- FIG. 15 is a flowchart showing a procedure of a search process of the database device in the first embodiment of the present invention.
- FIG. 16A is a diagram for explaining an example of a search condition in the first embodiment of the present invention.
- FIG. 16B is a diagram for explaining the search operation of the database apparatus in the first embodiment of the present invention.
- FIG. 16C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 17A is a diagram for explaining an example of search conditions in the first embodiment of the present invention.
- FIG. 17B is a diagram for explaining the search operation of the database device in the first embodiment of the present invention. .
- FIG. 17C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 18A is a diagram for explaining an example of a search condition in the first embodiment of the present invention.
- FIG. 18B is a diagram for explaining a search operation of the database apparatus in the first embodiment of the present invention.
- FIG. 18C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 19A is a diagram for explaining an example of search conditions in the first embodiment of the present invention.
- FIG. 19B explains the search operation of the database device according to the first embodiment of the present invention.
- FIG. 19C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 20A is a diagram for explaining an example of search conditions in the first embodiment of the present invention.
- FIG. 20B is a diagram for explaining a search operation of the database device in the first embodiment of the present invention.
- FIG. 20C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 21A is a diagram for explaining an example of a search condition in the first embodiment of the present invention.
- FIG. 21B is a diagram for explaining a search operation of the database device in the first embodiment of the present invention. .
- FIG. 21C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 22A is a diagram for explaining an example of the search condition in the first embodiment of the present invention.
- FIG. 22B is a diagram for explaining the search operation of the database device in the first embodiment of the present invention. .
- FIG. 22C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 23A is a diagram for explaining an example of a search condition in Embodiment 1 of the present invention.
- FIG. 23B is a diagram for explaining a search operation of the database device in Embodiment 1 of the present invention. .
- FIG. 23C is a diagram for explaining a search result in the first embodiment of the present invention.
- FIG. 24 is a diagram used for explaining the order of empty elements in the second embodiment of the present invention.
- FIG. 25A is a diagram illustrating a partial ancestor path name according to the second embodiment of the present invention.
- FIG. 25B is a diagram showing the contents of the ancestor path name dictionary according to Embodiment 2 of the present invention.
- FIG. 25C is a diagram for explaining an ancestor path name ID string in the second embodiment of the present invention.
- FIG. 26 is a diagram for explaining element appearance information in the second embodiment of the present invention.
- FIG. 27 is a diagram for explaining ancestor path appearance information according to the second embodiment of the present invention.
- FIG. 28 is a diagram showing an example of a search expression in the second embodiment of the present invention.
- FIG. 29A is a diagram for explaining a search operation according to the second embodiment of the present invention.
- FIG. 29B is a diagram for explaining a search result in the second embodiment of the present invention.
- FIG. 30 is a block diagram showing a configuration of the database apparatus in the third embodiment of the present invention.
- FIG. 31 is a flowchart showing a procedure for document registration processing of the database apparatus in the third embodiment of the present invention.
- FIG. 32 is a diagram for explaining grouped element appearance information according to the third embodiment of the present invention.
- FIG. 33 is a block diagram of a conventional structured document management apparatus.
- FIG. 34 is a diagram showing an example of an element management table in the conventional structure document management apparatus.
- FIG. 35A is a diagram showing an example of a structured document processed by a conventional structured document management apparatus.
- FIG. 35B is a diagram showing an example of a character string index in the conventional structure document management apparatus.
- FIG. 36A is a diagram for explaining an example of search conditions in the conventional structured document management apparatus.
- FIG. 36B is a diagram for explaining a search operation in the conventional structured document management apparatus.
- FIG. 36C is a diagram for explaining a search result in the conventional structured document management apparatus.
- FIG. 1 is a block diagram showing the configuration of the database device according to Embodiment 1 of the present invention.
- the database apparatus inputs a structured document group 101 to be registered in the database, assigns a unique document number to each document in the inputted structure document group 101, and creates a logical structure.
- Input document analysis unit 10 inputs a structured document group 101 to be registered in the database, assigns a unique document number to each document in the inputted structure document group 101, and creates a logical structure.
- the element name that appears in the document is unique.
- a unique identifier hereinafter referred to as “element name ID”
- the ancestor path names appearing in the document (attention An element name of an element's ancestor elements, separated from each other by a slash in order from the top level, and not including the element name of the element of interest itself (! /,).
- the ancestor path name registration unit 104 and the input document analysis unit 102 which are registered in the ancestor path name dictionary 108, are assigned unique identifiers (hereinafter referred to as attribute names) appearing in the document.
- the attribute name registration unit 105 and the input document analysis unit 102 which are assigned to the attribute name dictionary 109 and assigned to the attribute name dictionary 109, are assigned the element appearance information storage unit 111 of the appearance position index 110, the ancestor path Appearance information storage unit 112, Attribute appearance information storage unit 11 3.
- An appearance information registration unit 106 for registering four types of appearance information in the text appearance information storage unit 114 is provided.
- the database device includes an element name dictionary 107 in which the element name ID and the corresponding element name are recorded, an ancestor path name dictionary 108 in which the ancestor path name ID and the corresponding ancestor path name are recorded, and an attribute name ID. And an attribute name dictionary 109 in which attribute names corresponding to the attribute names are recorded, and an appearance position index 110 in which four types of appearance information are respectively stored.
- the appearance position index 110 includes an element appearance information storage unit 111, an ancestor path appearance information storage unit 112, an attribute appearance information storage unit 113, and a text appearance information storage unit 114.
- the element appearance information storage unit 111 stores the document number, character position, number of characters, and ancestor path name ID branching information in which each element appears, using the element name ID as a key, and the ancestor path appearance information storage unit 112. Stores the document number, character position, number of characters, element name ID, and branch order information of each element, using the ancestor path name ID of that element as a key. Stores the document number, character position, number of characters, element name ID, ancestor path name ID, and branching order information using the attribute name ID as a key, and the text appearance information storage unit 114 stores the information from the text in the element.
- the document number, character position, ancestor path name ID, element name ID, attribute name ID, and branching order information that appears are displayed as partial characters.
- the database device analyzes the search condition given to the search condition input unit 116 that receives the search expression 115 and the search condition input unit 116, converts it into an internal condition, and outputs it to the appearance information acquisition unit 118.
- search condition analysis unit 117 and search condition analysis unit 117 4 types of appearance information stored in the position index 110 Appropriate information is selected and acquired, and the appearance information acquisition unit 118 that obtains the result data set that matches the search condition, and the result data set in the appropriate format as a search result 120 As a search result output unit 119.
- FIG. 2 is a flowchart showing the procedure of the document registration process in the first embodiment of the present invention.
- step 2201 the input document analysis unit 102 reads one structured document from the structured document group 101 and assigns a unique document number to each document.
- FIG. 3 is a diagram showing an example of a structure document to be registered and searched in Embodiment 1 of the present invention.
- the structure document 101a shown in FIG. 3 has a book element at the highest level, and the book element includes a title element and two chapter elements.
- the title element contains the string “document search” of the element entity, and the first chapter element is another title element, two section elements, and a keyword whose attribute value is “history” (keyword ) Attribute.
- FIG. 4 shows the result of the input document analysis unit 102 analyzing the structure document 101a into a tree structure.
- FIG. 4 is a diagram showing a result of analyzing the logical structure of the structure document in the first embodiment of the present invention.
- a square frame of the tree structure 300 represents elements 301 to 303, and a character string written in the frame represents an element name 304.
- An elliptical dotted line frame indicates the attribute 305, and a character string written in the frame indicates the attribute name 306 (update).
- FIG. 5 is a diagram for explaining ancestor path names in Embodiment 1 of the present invention.
- the node name 701 of the element 302 shaded in FIG. 4 is composed of an ancestor path name 702 and an element name 703.
- branch order 307 of the element 302 is “1Z2Z3”.
- the branching order for each element in the nose name This is a sequence of numbers indicating the number of occurrences among the elements with the same parent element and the same element name.
- the element 302 shaded in FIG. 4 and the element 303 to the left of the element 302 have the same path name, but branch orders 307 and 308 are different.
- the notation method of the branch order is not limited to this. For example, the depth of a hierarchy having a value other than 1 and its value may be arranged. In this way, branch order 307 is expressed as “2: 2, 3: 3”.
- the element name registration unit 103 checks whether the element name of the element of interest has been registered in the element name dictionary 107. If it is registered, the corresponding element name ID is obtained. If it is not registered, a new element name ID (> 0) is assigned, and the element name and element name ID are registered in the element name dictionary 107. To do.
- FIG. 6 shows an example (407) of the contents of the element name dictionary 107 after the structured document 101a shown in FIG. 3 is registered.
- the ancestor path name registration unit 104 checks whether the ancestor path name of the element of interest has already been registered in the ancestor path name dictionary 108. If it is registered, the corresponding ancestor path name ID is acquired. If not registered, a new ancestor path name ID (> 0) is assigned, and the ancestor path name is assigned to the ancestor path name dictionary 108. sign up.
- FIG. 7 shows an example (408) of the contents of the ancestor path name dictionary 108 after the registration process of the structure document 10 la shown in FIG.
- step 2205 if the element of interest has an attribute, the process proceeds to step 2206, and if not, the process proceeds to step 2207.
- the attribute name registration unit 105 checks whether or not the attribute name of each attribute of the element of interest has been registered in the attribute name dictionary 109. If it is registered, the corresponding attribute name ID is acquired. If it is not registered, a new attribute name ID (> 0) is assigned and the attribute name is registered in the attribute name dictionary 109.
- FIG. 8 shows an example (409) of the contents of the attribute name dictionary 109 after the structured document 101a shown in FIG. 3 is registered.
- the appearance information registration unit 106 performs element appearance information on the element of interest.
- the information is registered in the element appearance information storage unit 111 using the element name ID as a key.
- the element appearance information is a set of the following five types of values: the document number, the first character position and number of characters (other than the tag) contained in the element of interest (including descendant elements), ancestor path name ID, branch order It consists of a set of values.
- FIG. 9 is a diagram for explaining how to count character positions in the database apparatus according to the present embodiment.
- a table 410 indicates the character position 412 of each character 411 in a character string in which all texts in the document excluding tags are connected.
- the first character position is “0”.
- FIGS. 10A-10B are diagrams for explaining element appearance information in Embodiment 1 of the present invention.
- the character position of the first character 321 is “115”, and the number of characters of the entire element entity 322 is “40”.
- Element appearance information 501 regarding the section element 302 is shown in FIG. 10A.
- the element name ID (502) of the section element 302 is “4”, and the document number (503) is “1”.
- the section element 302 includes an element entity having a length of “40” (number of characters 505) starting from the “115” character (character position 504) force.
- the ancestor path name ID (506) of the section element 302 is “3”, and the branch order (507) is “1/2/3”.
- the ancestor path name with the ancestor path name ID 506 of “3” is “ZbookZchapter”.
- the appearance information registration unit 106 registers the ancestor path appearance information related to the element of interest in the ancestor path appearance information storage unit 112 using the ancestor path name ID as a key.
- This ancestor path appearance information includes the following five types of values: the document number, the first character position and number of characters (other than tags) included in the element of interest (including descendant elements), and the element name. It consists of a set of ID and branch order values.
- FIG. 11 is a diagram for explaining ancestor path appearance information according to Embodiment 1 of the present invention.
- FIG. 11 shows the contents 511 of the ancestor path appearance information related to the element 302 shown with shading in FIG.
- the element appearance information and ancestor path appearance information related to the same element differ in that the key item is the element name I D502 or the force that is the ancestor path name ID506. Is only
- step 2209 if the target element has an attribute, the process proceeds to step 2210. If the target element does not have an attribute, the process proceeds to step 2211.
- the appearance information registration unit 106 assigns attributes related to the attributes of the element of interest. Appearance information is registered in the attribute appearance information storage unit 113 using the attribute name ID as a key.
- the attribute expression information is composed of the following six types of value pairs: document number, first character position and number of attribute values, ancestor nose name ID, element name ID, and branch order value.
- the FIG. 12A-12B is a diagram for explaining attribute appearance information in Embodiment 1 of the present invention.
- the section element 302 shaded in FIG. 4 includes the update attribute 305.
- the attribute value 350 of the update attribute 305 is the character position 351 of the first character 351 is “115”, Value 305
- the total number of characters 352 is “6”.
- the character position of the first character of the attribute value is virtually the first character of the text (other than the tag) included in the element of interest 322 (including the descendant elements) as shown in FIG. 12B.
- the attribute appearance information 521 regarding the update attribute 305 of the section element 302 is shown in FIG. 12A.
- the attribute name ID (522) is “2”
- the document number (503) is “1”.
- the update attribute 305 has an attribute value of “6” characters (number of characters 505) starting with the “115” character (character position 504) force.
- the ancestor path name ID (506) of the element to which the update attribute 305 belongs is “3”
- the element name ID (502) is “4”
- the branch order (507) is “1/2/3”.
- the attribute name with ID name “2” is “update”
- the ancestor path name with ancestor path name ID 506 “3” is “ZbookZsection”.
- the element name with element name ID 502 of “4” is “te”.
- the appearance information registration unit 106 cuts out a partial character string from the text power of the entity content of the element of interest. Then, the text appearance information is registered in the text appearance information storage unit 114 with the extracted partial character string as a key. At this time, 0 is always stored in the attribute name ID to distinguish it from the attribute value.
- the text appearance information consists of the following six types of values: the document number, the first character position of the extracted substring, the ancestor path name ID, the element name ID, the attribute name ID, and the branch order value. Composed of a set.
- step 2212 if the element of interest has an attribute, the process proceeds to step 2213, and if it does not have an attribute, the process proceeds to step 2214.
- step 2213 the appearance information registration unit 106 cuts out a partial character string from the attribute value character string of each attribute of the element of interest. Then, the partial character string is registered in the text appearance information storage unit 114 as a key. The attribute value appears virtually at the position shown in Fig. 11. As with the attribute appearance information, the character position is calculated.
- step 2213 unlike the processing in step 2211, the attribute name ID (> 0) of the attribute of interest is stored in the attribute name ID.
- FIG. 13 is a diagram for explaining text appearance information in Embodiment 1 of the present invention.
- the text appearance information 531 (— portion) is the text appearance information about the element entity (text) of the section element 302 shaded in FIG.
- the appearance information record 1201 shows an example of the element entity of the section element 302.
- the element entity part of the section element 302 Character string (532) “Maximum” appears at the “118” character (character position 5004) of the document whose document number (503) is “1”.
- the element containing the substring, that is, the ancestor path name ID (506) of the section element 302 is “3”, the element name ID (502) is “4”, and the branch order (507) is “1/2/3”. It is.
- the ancestor path name with the ancestor path name ID 506 of 3 is “ZbookZsection”, and the element name with the element name ID 502 of 4 is “chapter”.
- whether or not the partial character string 532 is an attribute value can be determined according to the attribute name ID 522.
- the attribute name ID is “0”
- An appearance information record 1202 shows an example of an attribute value of the update attribute 305 in the section element 302.
- the substring (532) “00” of the attribute value of the update attribute 305 appears at the “116” character (character position 504) of the document whose document number (503) is “1”.
- the attribute element including the partial character string, that is, the ancestor path name ID of the section element 302 is “3”
- the element name ID (502) is “4”
- the branch order (507) is “1Z2Z3”.
- the attribute name ID (522) belonging to the element is “2”.
- the ancestor path name ID power “3” has an ancestor path name “ZbookZsection”
- the element name ID “4” has an element name “chapter”
- the attribute name ID “2” has an attribute name “update”. .
- step 2214 it is checked whether or not the processing has been completed for all elements appearing in this document. If there are any unprocessed elements, the process returns to step 2203 and the process is repeated.
- step 2215 it is checked whether or not processing has been completed for all input documents. If unprocessed documents remain! / Return, the processing returns to step 2201 to repeat the processing.
- the database apparatus registers a document and completes the database construction process. Next, a description will be given of processing for searching for a registered document group by the database apparatus according to the present embodiment.
- FIG. 14 is a diagram showing an example of a search expression in the first embodiment of the present invention.
- These search expressions 2101-2107 are described in the XPath language published as a recommendation of the World Wide Web Consortium (W3C). The detailed specifications of the XPath language are described in the URL "http: ZZwww. W3. OrgZTRZxpath”.
- a search expression 2101 represents “a title element that is a child of a chapter element that is a child of a book element in the highest hierarchy”.
- the search expression 2102 represents “any child element of a chapter element that is a child of a book element in the highest hierarchy”.
- the search expression 2103 represents “Title elements in any hierarchy”.
- Retrieval expression 2104 represents “second section element child of chapter element child of top-level book element”.
- the search expression 2105 represents “update attribute of a section element child of a chapter element child of a book element of the highest hierarchy”.
- the search expression 2106 represents “an element that is a section element that is a child of a chapter element that is a child of a book element in the highest hierarchy and includes the character string“ maximum word ”in the element entity content”.
- the search expression 2107 represents “an attribute that is an update attribute of a section element that is a child of a chapter element that is a child of a book element of the highest hierarchy, and that the attribute value includes
- FIG. 15 is a flowchart showing the procedure of the search process of the database device in the first embodiment of the present invention.
- step 2301 the search condition input unit 116 inputs the search expression 2101.
- step 2305 the appearance information acquisition unit 118 compares the acquired number of entries N with the number of entries M. If N ⁇ M, go to step 2306, otherwise go to step 2310.
- the element appearance information storage unit 111 in FIG. 16B is selected.
- step 2307 the appearance information acquisition unit 118 checks whether the ancestor path name ID of this entry is 3. If the ancestor path name ID is 3, go to step 2308; otherwise, go to step 2309.
- step 2308 the appearance information acquisition unit 118 adds the data of this entry to the result data set 1302.
- Figure 16C shows the result data set.
- Each data of the result data set 1302 is stored in a format such as (document number, ancestor path name ID, element name ID, attribute name ID, branch order).
- step 2309 the appearance information acquisition unit 118 checks whether all N entries have been processed. If there is an entry that has not yet been processed, return to step 2306 and repeat the process.
- step 2314 the appearance information acquisition unit 118 outputs the obtained result data set to the search result output unit 119.
- the search result output unit 119 is a sentence of the obtained result data set. Output the search results in an appropriate format, such as by acquiring the book entity.
- the database apparatus selects the one having the designated ancestor path name ID from the entry of the designated element name ID in the element appearance information storage unit 111 for the search expression 2101. Either the process or the process of selecting the entry having the specified element name ID from the specified ancestor path name ID entry in the ancestor path appearance information storage unit 112 is selected. Therefore, the amount of processing can be suppressed according to the characteristics of the logical structure of the search target structured document group, and a desired document can be searched efficiently.
- the result data set 1502 is output to the search result output unit 119 as shown in FIG. 18C.
- the search result output unit 119 outputs the search results in an appropriate format, for example, by acquiring the document entity of the obtained result data set 1502.
- the database apparatus only needs to acquire the entry of the designated ancestor path name ID in the ancestor path appearance information storage unit 112 for the search expression 2102. Can be searched efficiently.
- the search result output unit 119 outputs the search results in an appropriate format, for example, by acquiring the document entity of the obtained result data set 1602.
- the database apparatus only needs to acquire the entry of the specified element name ID in the element appearance information storage unit 111 for the search expression 2103. You can search efficiently.
- the search result output unit 119 outputs a search result in an appropriate format, for example, by acquiring a document entity of the obtained result data set.
- the database apparatus has an ancestor path name ID designated from the entry of the designated element name ID in the element appearance information storage unit 111 and the branch order for the search expression 2104.
- the process of selecting the object and the element having the specified element name ID and branching order from the entry of the specified ancestor node name ID in the ancestor path appearance information storage unit 112 is selected. Select which of the processing power and the number of entries is smaller. As a result, it is possible to reduce the amount of search processing and to efficiently search for a desired document.
- the appearance information acquisition unit 118 performs a search result as a result data set 1802 in a format such as (document number, ancestor path name ID, element name ID, attribute name ID, branch order), for example.
- the search result output unit 119 outputs the search result in an appropriate format by acquiring the document entity of the obtained result data set.
- the database apparatus uses the ancestor path name ID and the element name ID specified from the entry of the specified attribute name ID in the attribute appearance information storage unit 113 for the search expression 2105. It is possible to search for a desired document by selecting what it has.
- the appearance information acquisition unit 118 refers to the appearance position index 110 and performs a concatenation operation on the “maximum” entry 1901 and the “word” entry 1902 in the text appearance information storage unit 114, as shown in FIG. 22B.
- the ancestor path name ID is 3, the element name ID is 4, and the attribute name ID is 0, because the document number is the same, but the word “word” is positioned just two characters behind the “maximum”.
- the branch information is checked to determine whether the branch order is the same, and an entry satisfying the condition is obtained.
- Result number, ancestor path name ID, element name ID, attribute name ID, branch order is output to the search result output unit 119 as a result data set 1903.
- the search result output unit 119 outputs a search result in an appropriate format, for example, by obtaining a document entity of the obtained result data set.
- the database apparatus uses the ancestor path name ID and element when performing a concatenation operation between the partial character string entries in the text appearance information storage unit 114 for the search expression 2106. It is possible to search for a desired document by selecting a value (1904, 1905) in which the name ID is a specified value, the branching order is the same, and the attribute name ID is 0.
- the appearance information acquisition unit 118 refers to the appearance position index 110, and performs a concatenation operation between the entry 2001 of “20” and the entry 2002 of “04” in the text appearance information storage unit 114, as shown in FIG. 23B. To do.
- the appearance information acquisition unit 118 has an ancestor path name ID 3 and an element name ID 4 that are just the same document number and that “20” is positioned two characters behind “04”. Check if the attribute name ID is 2 and the branch order is the same, and find the entry that satisfies the condition. Then, as shown in FIG. 23C, the appearance information acquisition unit 118 searches the result data set 2003 in a format such as (document number, ancestor path name ID, element name ID, attribute name ID, branch order). Result Output to output unit 119.
- the search result output unit 119 outputs a search result in an appropriate format, for example, by acquiring a document entity of the obtained result data set.
- the database apparatus for the search expression 2107, performs an ancestor path name ID and an element when performing a concatenation operation between entries of partial character strings in the text appearance information storage unit 114. Select a value (2004, 2005) with the specified name ID, the branch order is the same, and the attribute name ID is the specified value (> 0) (2004, 2005). It becomes possible to search.
- the database apparatus stores an element appearance information storage unit that stores element appearance information using an element name ID as a key, and element appearance information.
- An ancestor path appearance information storage unit stored using an ancestor path name ID as a key and an attribute appearance information storage unit stored using attribute name ID as a key are provided. Therefore, this database device can efficiently search for a desired document even with a search expression that specifies only a structural condition.
- the database device further includes a text appearance information storage unit that stores text character strings of element entities and attribute information of attributes of elements and appearance information of the extracted partial character strings. Provide. For this reason, this database device can perform a character string search not only on the text of the element entity but also on the attribute value.
- the database device applies the present invention even if the query condition expression is given in another query language having the same meaning as the force described as giving the search condition expression as an X-nos expression in the database search process. It is possible to do.
- the database apparatus when registering a structured document, a list of element names, ancestor path names, and attribute names indicating the document structure included in the structured document is stored. And an index of appearance position information in the structure document. For this reason, this database device efficiently searches for documents having a desired logical structure not only for search conditions that specify both character string search conditions and structure conditions, but also for various search conditions that specify only structures. You can build a database.
- the document structure is analyzed to construct dictionary data and appearance position index data, and the structure text
- a configuration for registering a document and a configuration for efficiently searching a registered document based on dictionary data and appearance position index data for a document indicated by a retrieval formula indicating the accepted document structure are realized simultaneously.
- a configuration with only registered functions may be implemented as a database construction device, and a configuration with only a search may be implemented as a database retrieval device.
- dictionary data for elements and ancestor paths and appearance position index data are generated and registered, and a dictionary for attributes in this configuration is registered.
- the configuration for generating and registering the data and appearance position index data and the configuration for generating and registering the appearance position index data for the text of the element or attribute value in this configuration are simultaneously realized. However, it can be realized as a configuration in which only elements and ancestor paths are registered, a configuration in which attributes are registered for this configuration, or a configuration in which text is added to this configuration and registered. Also good.
- the database apparatus in the present embodiment has almost the same configuration as that of the first embodiment shown in FIG. However, this database device is different from the first embodiment in the following points.
- the ancestor path name registration unit 104 has a unique ancestor path name ID for each partial ancestor path name obtained by dividing the ancestor path name that appears in the document into parts. Is registered in the ancestor path name dictionary 108.
- the appearance information registration unit 106 uses the element name ID as a key for the document number, character position, number of characters, ancestor path name ID string, branch order, and empty element order in which each element appears. And stored in the element appearance information storage unit 111.
- this database device uses the ancestor path name ID column as a key for the document number, character position, number of characters, element name ID, branch order, and empty element order in which each element appears. To store. In addition, this database device uses the document name, character position, number of characters, element name ID, ancestor path name ID column, branch order, and empty element order information in which each attribute appears, and the attribute name ID as a key for attribute appearance information. The information is stored in the information storage unit 113. This database device also cuts from the text in the element.
- the value of the attribute of the extracted partial character string and element For the extracted partial character string, the document number, character position, ancestor path name ID string, element name ID, attribute name ID, branch order, empty element that appear
- the order information is stored in the text appearance information storage unit 114 using the partial character string as a key.
- step 2201 the input document analysis unit 102 reads one structured document and assigns a unique document number.
- step 2202 the logical structure of the structure document is analyzed.
- empty element is an element that does not have any element entity text, including descendant elements
- empty element order is the first sibling element that has the same parent element. 1 if the element is a force or the immediately preceding sibling element is not an empty element, otherwise 1 (the preceding sibling element is an empty element). The value added with is obtained and arranged in each layer from the highest layer to the element.
- FIG. 24 is a diagram for explaining the order of empty elements in the second embodiment of the present invention.
- Figure 24 shows an example of the document tree structure 310 and the empty element order.
- a square frame with diagonal lines shows elements 2801, 2804, and 2805 including the text of the actual element, and a plain square frame indicates empty elements 2802 and 2803 that do not include the element entity.
- the character string written in the form of “” represents the information of the empty element order 2806 of each element.
- the first two numbers "1Z2" indicated by the empty element order of sibling elements 2801 to 2804 correspond to the empty element order of the ancestor elements.
- the notation method in the order of empty elements is not limited to this.
- the depth of a hierarchy having a value other than 1 and its value may be displayed side by side. If the empty element order 2806 (“lZ2Z 3”) is expressed in this way, “2: 2, 3: 3” is obtained.
- the value of depth 1 is “1”, it is omitted, the value of depth 2 is “2”, and the value of depth 3 is “3”. For this reason, when dealing with documents in which empty elements rarely appear, that is, documents whose empty element order value is almost “1”, the latter notation can reduce the size of the appearance position index file.
- step 2203 as in the first embodiment, the element name registration unit 103 performs a registration process on the element name dictionary 107 for the element name of the element of interest.
- the ancestor path name registration unit 104 divides the ancestor path name of the element of interest into every three levels, and checks whether each divided ancestor path name after registration is registered in the ancestor path name dictionary 108. Investigate. If it is registered, the corresponding ancestor path name ID is acquired. If it is not registered, a new ancestor path name ID (> 0) is assigned and registered in the ancestor path name dictionary 108. If the depth of the ancestor path name is 3 levels or less, the ancestor path name ID column becomes a single ancestor path name ID as in the first embodiment.
- FIG. 25A is a diagram for explaining a partial ancestor path name in Embodiment 2 of the present invention
- FIG. 25B is a diagram showing the contents of an ancestor path name dictionary
- FIG. 25C is a diagram for explaining an ancestor path name ID column. It is a figure.
- ancestor path name 2901 excluding element name 2911 from path name 2900 "Z AZBZCZAZBZCZAZBZCJ is further broken down into partial path names ⁇ / ⁇ / ⁇ / CJ (2913, 2 914) and" ZAZBZ "(2915) it can.
- FIG. 25A ancestor path name 2901 excluding element name 2911 from path name 2900 "Z AZBZCZAZBZCZAZBZCJ is further broken down into partial path names ⁇ / ⁇ / ⁇ / CJ (2913, 2 914) and" ZAZBZ "(2915) it can.
- FIG. 25A a diagram for
- the ancestor path name 2905 “ZAZBZC” and “ZAZB” ancestor path IDs 2904 are registered in the contents 2903 of the ancestor path name dictionary 108 as “83” and “25”, respectively.
- an ancestor path name 2901 is an ancestor path name ID column 2902 “83: 83: 25” using an ancestor path ID 2904 indicating each decomposed ancestor path name 2905 and the symbol “:”. "Can be expressed.
- an ancestor path name 2901 is divided and an ancestor path name ID 2904 is assigned to each partial ancestor path name 2905, and an ancestor path name that has already been registered with the ancestor element and other elements of the element. ID2904 can be used in common.
- the number of overlapping ancestor path name IDs can be reduced, and the size of the ancestor path name dictionary 108 can be reduced. Note that although an example in which the ancestor path name is divided every three layers has been described in the present embodiment, the dividing method is not limited to this. For example, it is possible to divide every four layers and change the division width according to the depth of the layer. Also, 1S using the symbol ":" as a delimiter for the ancestor path name ID string may be another delimiter.
- step 2205 to step 2206 the attribute name registration unit 105 proceeds to the attribute name dictionary 109 of each attribute of the element of interest as in the first embodiment. Registration process.
- the appearance information registration unit 106 registers the element appearance information related to the element of interest in the element appearance information storage unit 111 using the element name ID as a key.
- the element appearance information includes the following six types of values: document number, first character position and number of characters (other than tags) included in the element of interest (including descendant elements), ancestor path name ID string, branch It consists of a set of values in order and empty element order. “Character position” represents the number of the first character in the character string in which all texts in the document excluding the tag are connected. If the element of interest is an empty element, the first character position of the text (other than the tag) that appears for the first time after the element of interest is regarded as the first character position of the element of interest.
- FIG. 26 shows an example of element appearance information.
- FIG. 26 is a diagram for explaining element appearance information according to the second embodiment of the present invention.
- the ancestor path name 506 of the element appearance information 541 has an ancestor path name ID column in which one or more ancestor path name IDs are connected by a delimiter instead of a single ancestor path name ID. To be recorded and to include information on the order of 548 empty elements.
- the appearance information registration unit 106 registers the ancestor path appearance information related to the element of interest in the ancestor path appearance information storage unit 112 using the ancestor path name ID column as a key.
- the ancestor path appearance information includes the following six types of values: document number, the first character position and number of characters (other than tags) included in the element of interest (including descendant elements), element name ID, branch order , Consisting of a set of values in the order of empty elements.
- Fig. 27 An example of ancestor path appearance information is shown in Fig. 27.
- FIG. 27 is a diagram for explaining ancestor path appearance information according to Embodiment 2 of the present invention.
- the ancestor path appearance information 551 includes the information of the empty element order 548, and the ancestor node name ID 506 includes one or more ancestor path names instead of a single ancestor path name ID.
- the appearance information registration unit 106 uses the attribute appearance information for each attribute of the element of interest as the attribute appearance information using the attribute name ID as a key. Register in the storage unit 113.
- the attribute appearance information is composed of the following seven types of value pairs: document number, first character position and number of characters of attribute value, ancestor path name ID string, element name ID, branch order, empty element order.
- Embodiment 1 The difference from Embodiment 1 is that an ancestor path name ID column in which one or more ancestor path name IDs are combined with a delimiter instead of a single ancestor path name ID in the attribute appearance information ancestor path name ID is recorded. And information on the order of empty elements.
- the appearance information registration unit 106 cuts out the text power partial character string of the entity content of the element of interest, and registers the text appearance information in the text appearance information storage unit 114 using the cut out partial character string as a key.
- the text appearance information is not an attribute value, the value “0” is always stored in the attribute name ID.
- Text appearance information consists of the following seven types of values: document number, leading character position of the extracted partial character string, ancestor path name ID column, element name ID, attribute name ID, branch order, empty element order Consists of a set of values.
- the ancestor path name ID of the text appearance information is not a single ancestor path name ID, but an ancestor nose name ID column in which one or more ancestor path name IDs are connected by a delimiter. And including information about the order of empty elements.
- the appearance information registration unit 106 cuts out a partial character string from the attribute value character string of each attribute of the element of interest, The partial character string is registered in the appearance information storage unit 114 as a key.
- the difference from the first embodiment is that an ancestor path name ID column in which one or more ancestor path name IDs are separated by a delimiter instead of a single ancestor path name ID is recorded in the text appearance information. And including information on the order of empty elements.
- steps 2214 to 2215 are executed in the same manner as in the first embodiment, and a document is registered to construct a database.
- the search condition analysis unit 117 obtains an ancestor path name ID from the ancestor path name and converts it into an internal condition. Ancestors This can be realized by changing the process to obtain the ancestor path name ID column from the source name. In other words, the search condition analysis unit 117 divides the ancestor path name into three layers, refers to the ancestor path name dictionary 108, obtains ancestor path name IDs corresponding to the divided partial ancestor path names, and determines the ancestor paths. An ancestor path name ID column is obtained by sequentially separating the name IDs with a delimiter.
- the format of the ancestor path name ID column is the same as the example shown in Figs. 25A-25C in the description of the document registration process.
- the appearance information acquisition unit 118 can collate with the ancestor path name ID and change each process to collate with the ancestor path name ID column to obtain the search result. .
- FIG. 28 is a diagram showing an example of a search expression in Embodiment 2 of the present invention.
- the search expression 3201 shown in FIG. 28 represents “a Y element that is a sibling element of the X element of the B element child of the A element child of the highest hierarchy and appears after the X element”.
- the search expression 3 201 is input from the search condition input unit 116.
- the search condition analysis unit 117 analyzes the search expression 3201, converts it into an internal condition with reference to the element name dictionary 107 and the ancestor path name dictionary 108, and outputs it to the appearance information acquisition unit 118.
- the ancestor path name ID corresponding to the ancestor path name ⁇ / ⁇ / Bj is 25
- the element name ID corresponding to the element name “X” is “10”
- the element name ID is “14”.
- condition C3 is necessary for the internal condition is that the character position is the same for the empty element and the element located immediately after it, so the values in the order of the empty elements must be compared to determine the context. Because it is.
- the appearance information acquisition unit 118 refers to the occurrence position index 110, and as shown in FIG. 29A, the element name ID is 10 among the entries whose ancestor path name ID is 25 in the ancestor path appearance information storage unit 112. (Cx) and the element name ID 14 (Cy). Next, Cx and Cy entry pairs 3301 and 3302 that satisfy C1 and (C2 or C3) are obtained. Then, the appearance information acquisition unit 118 As shown in 29B, for example, the result data set 3303 is output to the search result output unit 119 in the format (document number, ancestor path name ID, element name ID, attribute name ID, branch order, empty element order). To do.
- the search result output unit 119 outputs a search result in an appropriate format, for example, by acquiring a document entity of the obtained result data set.
- the number of entries of the designated ancestor path name ID in the ancestor nose appearance information storage unit 112 and the number of entries of the specified element name ID in the element appearance information storage unit 111 are calculated. Compare and choose the lesser one, and ask for it.
- the database device uses the search expression 3201 to refer to the appearance positions of the two elements obtained by referring to the ancestor path appearance information storage unit 112 or the element appearance information storage unit 111. Even if the two elements are the same, that is, when the two elements are in the relationship between the empty element and the next element, the information on the order of the empty elements is compared to eliminate the ambiguity of the context, Can be requested.
- the ancestor path name registration unit 104 divides the ancestor path name and is unique to each partial ancestor path name after the division. An ID is assigned and registered in the ancestor path name dictionary 108. Therefore, the size of the ancestor path name dictionary can be reduced.
- the appearance information registration unit 106 also stores information in the order of empty elements in the element appearance information storage unit 111, the ancestor path appearance information storage unit 112, the attribute appearance information storage unit 113, and the text appearance information storage unit 114. To do. Therefore, the database device according to the present embodiment can eliminate the ambiguity of the context that the start character position of the empty element and the element immediately after it are the same, and can obtain the correct LV and the search result.
- the database device allows the first character of the text to appear first after the element of interest when the element of the structural document is an empty element that does not contain any text.
- the position is regarded as the first character position of the element of interest. Therefore, the appearance order of empty elements is generated as an appearance position index, and not only when the structured document contains empty elements but also when consecutive empty elements are included, the entire structure of the document structure is displayed.
- sentence search it is possible to efficiently search for a document indicated by a search expression indicating a document structure including empty elements.
- the database device registers an ancestor path string based on partial path names obtained by dividing ancestor path names under certain conditions. Therefore, the database device in the present embodiment can reduce the size of the ancestor nose dictionary without accumulating partial paths, and even a structured document containing many structured objects can be obtained. It is possible to efficiently search for the document indicated by the search formula indicating the document structure.
- the database apparatus when registering a structured document, analyzes the document structure, constructs dictionary data and appearance position index data, and registers the structure document, A configuration in which the document indicated by the retrieval formula indicating the received document structure is efficiently retrieved based on the dictionary data and the appearance position index data is simultaneously realized.
- it may be realized as a configuration with only a function for registering a structured document or a configuration with only a search.
- the database apparatus when registering a structured document, the database apparatus according to the present embodiment generates and registers appearance position index data corresponding to an empty element having no text element, and an ancestor path name.
- FIG. 30 is a block diagram showing a configuration of the database device according to Embodiment 3 of the present invention.
- the database device in the third embodiment has almost the same configuration as that in the second embodiment.
- this database device differs from Embodiment 2 in the following points.
- An appearance information grouping unit 3401 for duplicating information stored in the element appearance information storage unit 111, the ancestor path appearance information storage unit 112, the attribute appearance information storage unit 113, and the text appearance information storage unit 114 is provided. Added.
- FIG. 31 is a flowchart showing a procedure for document registration processing of the database apparatus according to the third embodiment of the present invention.
- the processing from steps 2201 to 2215 is the same as in the second embodiment. Since it is the same, description is abbreviate
- the appearance information group part 3401 is registered in the element appearance information storage part 111 using the same element name ID as a key, and the document number and the character position in the entry group are registered. Collect the entries that have the same value for all four types of information items (number of characters, ancestor path name ID, branch order, empty element order), and set the number of those entries to a threshold (for example, 10 entries). If so, group those entries.
- the appearance information grouping unit 3401 selects any of the four types of information items (number of characters, ancestor path name ID, branch order, empty element order) from the remaining entry groups excluding the document number and character position.
- the appearance information group part 3401 similarly creates a group of entries in which the values of the two information items of any force are common.
- the appearance information grouping part 3401 creates a group of entries in which the value of one type of information item is common, and the last remaining entry is registered as a group having no common information item.
- FIG. 32 is a diagram for explaining grouped element appearance information according to Embodiment 3 of the present invention.
- the element appearance information whose element name ID is 14 is grouped, and is composed of group information and individual entries.
- the group information 3601 to 3604 the values of information items common to the entries 3605 to 3608 belonging to each group and the links to the individual entries'
- Each entry 3605 belonging to the group stores only the document number and character position.
- Individual entry 3606 The number of characters is stored together with the document number and character position.
- Each entry 3607 stores the branch order together with the document number and character position.
- the fourth group information 3604 is a group having no common information item, and all information items are stored in each entry 3608.
- each piece of information stored in the ancestor path appearance information storage unit 112, the attribute appearance information storage unit 113, and the text appearance information storage unit 114 also has an information item having a common value other than the document number and the character position. Group them together and complete the database construction process for registering documents.
- the appearance information acquisition unit 118 of the database apparatus searches all registered information items based on the contents and group information of grouped entries as a process of searching for registered document groups.
- the search result is obtained in the same manner as in the second embodiment.
- the appearance information grouping unit 3401 of the database apparatus groups entries stored in the appearance position index 110, and information items that are common within the group. Enclose the values of individual ,. Therefore, the database apparatus in the present embodiment can reduce the index size.
- the database apparatus groups the parts where information item values are common under certain conditions for the appearance position information of each element, ancestor path, etc. Store in a different structure from the parts that cannot be shared. This reduces the size of the index without duplicating and storing common parts.
- the database construction device can construct retrieval data having a structure that allows efficient retrieval of structured documents, and is useful for a database device that can efficiently retrieve data.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/587,770 US20070168363A1 (en) | 2004-11-30 | 2005-09-27 | Database constructing apparatus, database search apparatus, database apparatus, method of constructing database, and method of searching database |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2004345392 | 2004-11-30 | ||
JP2004-345392 | 2004-11-30 | ||
JP2005131992A JP2006185408A (ja) | 2004-11-30 | 2005-04-28 | データベース構築装置及びデータベース検索装置及びデータベース装置 |
JP2005-131992 | 2005-04-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006059425A1 true WO2006059425A1 (ja) | 2006-06-08 |
Family
ID=36564865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2005/017696 WO2006059425A1 (ja) | 2004-11-30 | 2005-09-27 | データベース構築装置、データベース検索装置、データベース装置、データベース構築方法、及びデータベース検索方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20070168363A1 (ja) |
JP (1) | JP2006185408A (ja) |
WO (1) | WO2006059425A1 (ja) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4860416B2 (ja) * | 2006-09-29 | 2012-01-25 | 株式会社ジャストシステム | 文書検索装置、文書検索方法および文書検索プログラム |
JP4770694B2 (ja) * | 2006-10-18 | 2011-09-14 | セイコーエプソン株式会社 | デバイスと接続される装置、データ内を検索する方法、コンピュータプログラム、および、インデックスデータ |
JP4445509B2 (ja) | 2007-03-20 | 2010-04-07 | 株式会社東芝 | 構造化文書検索システム及びプログラム |
US20120284661A1 (en) * | 2010-04-05 | 2012-11-08 | Makoto Mikuriya | Map information processing device |
US11487707B2 (en) * | 2012-04-30 | 2022-11-01 | International Business Machines Corporation | Efficient file path indexing for a content repository |
WO2013175524A1 (ja) * | 2012-05-22 | 2013-11-28 | 株式会社 東芝 | 構造文書管理システム、構造文書管理方法及びプログラム |
US9104730B2 (en) | 2012-06-11 | 2015-08-11 | International Business Machines Corporation | Indexing and retrieval of structured documents |
US8914356B2 (en) | 2012-11-01 | 2014-12-16 | International Business Machines Corporation | Optimized queries for file path indexing in a content repository |
US9323761B2 (en) | 2012-12-07 | 2016-04-26 | International Business Machines Corporation | Optimized query ordering for file path indexing in a content repository |
JP6212639B2 (ja) * | 2014-06-30 | 2017-10-11 | 株式会社日立製作所 | 検索方法 |
JP6841322B2 (ja) | 2017-04-06 | 2021-03-10 | 富士通株式会社 | インデックス生成プログラム、インデックス生成装置、インデックス生成方法、検索プログラム、検索装置および検索方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001331490A (ja) * | 2000-03-17 | 2001-11-30 | Fujitsu Ltd | 構造化文書格納装置、構造化文書検索装置、構造化文書格納検索装置及びプログラム並びにプログラム記録媒体 |
JP2002202973A (ja) * | 2000-10-25 | 2002-07-19 | Matsushita Electric Ind Co Ltd | 構造化文書管理装置 |
JP2003067403A (ja) * | 2001-08-24 | 2003-03-07 | Fuji Xerox Co Ltd | 構造化文書管理装置及び構造化文書管理方法、検索装置、検索方法 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3887867B2 (ja) * | 1997-02-26 | 2007-02-28 | 株式会社日立製作所 | 構造化文書の登録方法 |
CA2242158C (en) * | 1997-07-01 | 2004-06-01 | Hitachi, Ltd. | Method and apparatus for searching and displaying structured document |
JP4141556B2 (ja) * | 1998-12-18 | 2008-08-27 | 株式会社日立製作所 | 構造化文書管理方法及びその実施装置並びにその処理プログラムを記録した媒体 |
JP3754253B2 (ja) * | 1999-11-19 | 2006-03-08 | 株式会社東芝 | 構造化文書検索方法、構造化文書検索装置及び構造化文書検索システム |
US6721727B2 (en) * | 1999-12-02 | 2004-04-13 | International Business Machines Corporation | XML documents stored as column data |
JP2001167087A (ja) * | 1999-12-14 | 2001-06-22 | Fujitsu Ltd | 構造化文書検索装置,構造化文書検索方法,構造化文書検索用プログラム記録媒体および構造化文書検索用インデックス作成方法 |
US6804677B2 (en) * | 2001-02-26 | 2004-10-12 | Ori Software Development Ltd. | Encoding semi-structured data for efficient search and browsing |
JP3692054B2 (ja) * | 2001-05-21 | 2005-09-07 | 株式会社東芝 | 文書構造変換方法および文書構造変換装置およびプログラム |
US7249133B2 (en) * | 2002-02-19 | 2007-07-24 | Sun Microsystems, Inc. | Method and apparatus for a real time XML reporter |
JP4267336B2 (ja) * | 2003-01-30 | 2009-05-27 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 構造パターン候補を生成する方法、システムおよびプログラム |
US20060053122A1 (en) * | 2004-09-09 | 2006-03-09 | Korn Philip R | Method for matching XML twigs using index structures and relational query processors |
JP2006127235A (ja) * | 2004-10-29 | 2006-05-18 | Toshiba Corp | 構造化文書管理システム、構造化文書管理方法及びプログラム |
-
2005
- 2005-04-28 JP JP2005131992A patent/JP2006185408A/ja active Pending
- 2005-09-27 WO PCT/JP2005/017696 patent/WO2006059425A1/ja not_active Application Discontinuation
- 2005-09-27 US US10/587,770 patent/US20070168363A1/en not_active Abandoned
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001331490A (ja) * | 2000-03-17 | 2001-11-30 | Fujitsu Ltd | 構造化文書格納装置、構造化文書検索装置、構造化文書格納検索装置及びプログラム並びにプログラム記録媒体 |
JP2002202973A (ja) * | 2000-10-25 | 2002-07-19 | Matsushita Electric Ind Co Ltd | 構造化文書管理装置 |
JP2003067403A (ja) * | 2001-08-24 | 2003-03-07 | Fuji Xerox Co Ltd | 構造化文書管理装置及び構造化文書管理方法、検索装置、検索方法 |
Also Published As
Publication number | Publication date |
---|---|
US20070168363A1 (en) | 2007-07-19 |
JP2006185408A (ja) | 2006-07-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2006059425A1 (ja) | データベース構築装置、データベース検索装置、データベース装置、データベース構築方法、及びデータベース検索方法 | |
US10169354B2 (en) | Indexing and search query processing | |
CN109492077B (zh) | 基于知识图谱的石化领域问答方法及系统 | |
US6853992B2 (en) | Structured-document search apparatus and method, recording medium storing structured-document searching program, and method of creating indexes for searching structured documents | |
US8504553B2 (en) | Unstructured and semistructured document processing and searching | |
US8005819B2 (en) | Indexing and searching product identifiers | |
US20170235841A1 (en) | Enterprise search method and system | |
JP2005092889A (ja) | ウェブページのための情報ブロック抽出装置及び情報ブロック抽出方法 | |
US7124147B2 (en) | Data structures related to documents, and querying such data structures | |
CN102254014A (zh) | 一种网页特征自适应的信息抽取方法 | |
Chen et al. | BibPro: A citation parser based on sequence alignment | |
US8214403B2 (en) | Structured document management device and method | |
CN115358200A (zh) | 一种基于SysML元模型的模板化文档自动生成方法 | |
JP5225021B2 (ja) | 全文検索方法及び装置及びプログラム | |
KR101174184B1 (ko) | 통계에 의한 시소러스 데이터베이스 구축 방법 및 시소러스 데이터 구축 시스템 | |
JP3719089B2 (ja) | 文書処理装置 | |
JP7371989B1 (ja) | 検索サーバー、検索システム、及び検索プログラム | |
EP1072986A2 (en) | System and method for extracting data from semi-structured text | |
Xu et al. | N-gram index structure study for semantic based mathematical formula | |
JP5225022B2 (ja) | Xmlデータ検索方法及び装置及びプログラム | |
CN118051619A (zh) | 一种基于网络嵌入和语义表征的作者姓名消歧方法 | |
Min et al. | Method of Understanding Structure and Building Database with Material Experiment Data | |
Han et al. | Wrapping Data into XML | |
JPH09282326A (ja) | 文書高速構造検索方式 | |
Baggam et al. | DATA MINING TECHNOQUES AND ITS APPLICATIONS A SURVEY |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KM KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2007168363 Country of ref document: US Ref document number: 10587770 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 200580003630.3 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWP | Wipo information: published in national office |
Ref document number: 10587770 Country of ref document: US |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05787950 Country of ref document: EP Kind code of ref document: A1 |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 5787950 Country of ref document: EP |