WO2007119567A1 - 文書処理装置および文書処理方法 - Google Patents
文書処理装置および文書処理方法 Download PDFInfo
- Publication number
- WO2007119567A1 WO2007119567A1 PCT/JP2007/056690 JP2007056690W WO2007119567A1 WO 2007119567 A1 WO2007119567 A1 WO 2007119567A1 JP 2007056690 W JP2007056690 W JP 2007056690W WO 2007119567 A1 WO2007119567 A1 WO 2007119567A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pair
- node
- structured document
- document file
- value
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/137—Hierarchical processing, e.g. outlines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
Definitions
- the present invention relates to a document file search technique.
- Patent Document 1 Japanese Patent Laid-Open No. 2006-048536
- HTML Hyper Text Markup Language
- XML extensible Markup Language
- Authors can freely design the tag structure of an XML document, but the tag structure is often patterned to some extent according to the document content. For example, between sales documents, there are many parts common to the tag set used (Bob Library) and its tag structure, but the similarity between the tag set used in sales documents and legal documents and the tag structure is small. .
- the present invention is an invention made based on the above-mentioned attention of the inventor, and its main purpose is to select a highly related structured document file based on the tag structure of the structured document file. Technology.
- One embodiment of the present invention is a document processing apparatus.
- This device uses a specified position from a structured document file described in a specified tag set.
- a pair of related tags is detected as a node pair, the appearance mode of the node pair in the structure document file is indexed as an attribute value according to a predetermined rule, and index information in which the node pair is associated with the attribute value is generated.
- a node pair common to the node pair group detected from the first structured document file and the node pair group detected from the second structured document file is detected as a common pair, and the index information of the first structured document file is detected.
- the index information of the second structured document file to determine the similarity between the attribute value of the common pair in the first structured document file and the attribute value of the common pair in the second structured document file. Index as similar values.
- FIG. 1 is a schematic diagram for explaining the principle of similar document search based on a tag structure.
- FIG. 2 is a schematic diagram for explaining a parent-child relationship.
- FIG. 3 is a schematic diagram for explaining a repetitive relationship.
- FIG. 4 A schematic diagram for explaining a sibling relationship.
- FIG. 5 is a functional block diagram of the document processing apparatus.
- FIG. 6 is a screen diagram showing node similarity values.
- FIG. 7 is a diagram showing the result of investigating a node pair for a certain drug information database.
- FIG. 8 A table for calculating approximate distribution values.
- 100 document processing device 110 user interface processing unit, 120 data processing unit, 130 data holding unit, 132 input unit, 134 document acquisition unit, 136 display unit, 14 0 Index processing unit, 142 Node pair detection unit, 144 Attribute value acquisition unit, 146 Indentus information generation unit, 150 Similarity determination unit, 152 Common pair detection unit, 154 Node similarity value calculation unit, 156 Correction unit, 158 Rare value calculation , 160 Distribution approximate value acquisition unit, 162 Document similarity value calculation unit, 170 Document holding unit, 172 Index information holding unit.
- FIG. 1 is a schematic diagram for explaining the principle of similar document search based on a tag structure.
- This figure shows a case in which it is determined whether the structured document 52 or the structured document 54 is a document file having higher similarity than the structured document 50.
- the structure document file to be investigated such as structured document 50
- query document is referred to as “query document”
- the target structural document file is called the “document to be examined”.
- the report> tag problem> tag is in the upper-lower relationship. Also, since the problem> tag-specific countermeasures> tag is also in a higher-order subordinate relationship, the report> tag-specific countermeasures-> tag is also indirectly in the upper-lower relation.
- the “Report> Tag and Mathematics> Tag and Report> Tag and Science” tag has an upper-lower relationship. Also, since math> tag problems> tags are in a higher / lower relationship, ⁇ report> tag issues> tags are also indirectly in a higher / lower relationship.
- the ⁇ report> tag and the ⁇ problem> tag are directly related to each other in the upper / lower order.
- the tag is a force that has a higher / lower relationship.
- ⁇ Mathematics> it is not a direct upper / lower relationship.
- ⁇ Report> tag measures> Tags have a higher / lower relationship.
- the problem> tag is sandwiched between the report> tag and the measure> tag, but the tag is in a higher / lower relationship.
- the structured document 54 there is a tag> tag itself.
- structured document 54 is structured in structured document 50 rather than structured document 52. Above, it can be said that they are similar.
- this embodiment proposes a method for quantifying the similarity between the query document and the document to be inspected based on the common tag structure of the structure document file as shown in FIG. .
- a similar document search based on the tag structure is referred to as a “structure similarity search” and is distinguished from a “content similarity search” which is a similar document search based on a word group included in the document.
- an inspected document similar to a query document may be selected by narrowing down candidates from a large number of inspected documents by a structure similarity search and then executing a content similarity search.
- the document processing apparatus 100 detects a pair of tags included in a structured document file, and executes a structural similarity search using the pair (hereinafter referred to as “node pair”) as a basic unit.
- a tag pair detected as a node pair is required to have a predetermined positional relationship in the structure document file.
- the following describes the three relationships of “parent-child”, “repetition”, and “brother” as the positional relationship to be detected as a node pair.
- FIG. 2 is a schematic diagram for explaining the parent-child relationship.
- the parent-child relationship means that the two tags are in the upper-lower relationship in the structure file.
- B tag 12 is below A tag 10.
- a tag 10 and B tag 12 are in a parent-child relationship.
- the parent-child relationship may be a direct upper-lower relationship, or may be a relationship that reaches B tag 12 with several tag layers sandwiched between A tag 10.
- the appearance mode of the node pair in the structure document file is indexed as an attribute value.
- Attribute values are index values for three items: “depth”, “distance”, and “frequency”. Less than The attribute value refers to a set of these three index values.
- “Depth” for a node pair in a parent-child relationship indicates how many levels the tag corresponding to the parent is from the root tag. In the case of the figure, A tag 10 is two levels below the root tag, so the depth is “2”.
- “Distance” for a node pair in a parent-child relationship is the number of layers from the parent tag to the child tag. In the case of the figure, since the A tag 10 and the B tag 12 are separated by three layers, the distance is “3”.
- a node pair in a parent-child relationship such a combination of A tag and B tag at a depth of “2” and a distance of “3” appears in the structure file. is there.
- a node pair in a parent-child relationship is called a “parent-child pair”.
- FIG. 3 is a schematic diagram for explaining the repetitive relationship.
- a repeated relationship is a relationship in which a parent tag is shared and child tags with the same content appear multiple times. This is a special form of parent-child relationship.
- a tag 10 and B tag 14 and A tag 10 and B tag 16 are not only A tag 10 and B tag 12 but have a parent-child relationship of depth “2” and distance “3”. In such a case, the first A tag 10 and B tag 12 are considered to have a parent-child relationship, the second and subsequent A tag 10 and B tag 14, and A tag 10 and B tag 16 are said to be in a repetitive relationship. .
- a tag 10, B tag 14, and B tag 16 have a repetitive relationship of frequency “2”, and the frequency in the repetitive relationship is always 2 or more. The depth and distance in the repetitive relationship are obtained in the same manner as the parent-child relationship.
- a node pair having a repetitive relationship is referred to as a “repetitive pair”.
- FIG. 4 is a schematic diagram for explaining the brother relationship.
- a sibling relationship is a relationship in which a parent tag is shared and child tags with different contents appear multiple times.
- a tag 10 and B tag 12 A tag 10 and C tag 18, and A tag 10 and D tag 20.
- the A tag 10, the B tag 14, and the B tag 16 have a repetition relationship of frequency “2”.
- the relationship between B tag 16 and C tag 18, B tag 16 and D tag 20, and C tag 18 and D tag 20 is a sibling relationship.
- the distance between sibling node pairs (hereinafter referred to as “sibling pairs”) is calculated as the distance between the same level of one tag and the other tag.
- the distance between B tag 16 and C tag 18 is “1”, the distance between B tag 16 and D tag 20 is “2”, and the distance between C tag 18 and D tag 20 is “1”.
- the average distance between B tag 12, B tag 14, and B tag 16 may be calculated as the distance of the sibling pair when the B tag is the partner.
- “Depth” in the sibling pair indicates the number of layers from the root tag. In the case of the figure, the sibling pair depth is all “5”.
- a tag pair corresponding to any of a parent-child pair, a repetitive pair, and a sibling pair is detected as a node pair.
- the relationships shown in Fig. 2 to Fig. 4 are examples of defining node pairs that characterize the tag structure of the structure document file. What kind of positional relationship tag pairs are defined as node pairs The user of the document processing apparatus 100 may determine arbitrarily. In this embodiment, the description will focus on the most simple parent-child relationship.
- FIG. 5 is a functional block diagram of the document processing apparatus 100.
- the document processing apparatus 100 includes a user interface processing unit 110, a data processing unit 120, and a data holding unit 130.
- the user interface processing unit 110 is in charge of processing related to the user interface in general, such as input processing from the user and information display to the user.
- the user interface processing unit 110 will be described as providing the user interface service of the document processing apparatus 100.
- the user may operate the document processing apparatus 100 via the Internet.
- a communication unit (not shown) receives operation instruction information from the user terminal, and transmits processing result information executed based on the operation instruction to the user terminal.
- the data processing unit 120 executes various data processing based on the data acquired from the user interface processing unit 110.
- the data processor 120 is a user interface processor. It also serves as an interface between the physical unit 110 and the data holding unit 130.
- the data holding unit 130 stores various data such as setting data prepared in advance and data received from the data processing unit 120.
- the user interface processing unit 110 includes an input unit 132 and a display unit 136.
- the input unit 132 receives an input operation from the user.
- the display unit 136 displays various information to the user.
- the input unit 132 includes a document acquisition unit 134 for acquiring a structure document file from outside.
- the data holding unit 130 includes a document holding unit 170 and an index information holding unit 172.
- the document holding unit 170 holds the structured document file acquired by the document acquisition unit 134.
- the index information holding unit 172 holds index information generated by an index information generating unit 146 described later.
- the data processing unit 120 includes an index processing unit 140 and a similarity determination unit 150.
- the index processing unit 140 generates index information in which a node pair is associated with its attribute value for each structure document file.
- the index processing unit 140 includes a node pair detection unit 142, an attribute value acquisition unit 144, and an index information generation unit 146.
- the node pair detection unit 142 detects a node pair from the structured document file.
- the attribute value acquisition unit 144 calculates an attribute value for each of the depth, distance, and frequency for each detected node pair.
- the index information generation unit 146 generates index information that associates the document ID, node pair, and attribute value for specifying the structured document file, and records the index information in the index information holding unit 172.
- the similarity determination unit 150 performs a structural similarity search by comparing the index information of the query document and the index information of the subject document.
- the similarity determination unit 150 includes a common pair detection unit 152, a node similarity value calculation unit 154, a correction unit 156, a rare value calculation unit 158, a distribution approximate value acquisition unit 160, and a document similarity value calculation unit 162.
- the common pair detection unit 152 detects a node pair included in both the node pair group included in the query document and the node pair group included in the check target document.
- a node pair is referred to as a “common pair”.
- parent and child with tag ⁇ A> and tag ⁇ B> If there is a pair and a parent-child pair with tag ⁇ A> and tag ⁇ B> also exists in the subject document, tag ⁇ A> and tag ⁇ B> Detected as a common pair of document and target document.
- tag names themselves do not necessarily match completely.
- report> tag and date> tag are parent-child pairs
- rep> tag and date> tag are in a parent-child relationship.
- the tag named “report>” and “rep” and “re”, and the “name” tag are common to the “r” and “re” and the three characters, so there is some similarity in the names.
- node pairs that include report> tags and date> tags are treated as common pairs. In this way, it may be determined that the two tag names to be compared overlap with each other by a predetermined number of characters or when one tag name includes the other tag name.
- synonym dictionary data in which similar relationships between words are defined in advance may be prepared, and the common pair detection unit 152 may determine whether two tag names to be compared are in a similar relationship.
- a document creator can arbitrarily set a tag name. For this reason, the tag name of the query document and the tag name of the document to be inspected are often not the same, but are often similar. If a common pair is detected in consideration of the similarity of tag names, a more realistic structural similarity search is possible for structured document files such as XML documents.
- the node similarity value calculation unit 154 calculates the similarity between the attribute value of the common pair in the query document and the attribute value of the common pair in the document to be examined as a node similarity value. The calculation formula for calculation will be described later. Node similarity values are calculated for all common pairs in the query document node pair group.
- the rare value calculation unit 158 calculates a rare value for each common pair.
- the rare value is a numerical value indicating the appearance frequency of the common pair that is the target of the adjustment in the structured document file group (hereinafter simply referred to as “corpus”) included in the document holding unit 170. Node pairs with fewer occurrences in the corpus have a higher rarity value.
- Distribution approximate value acquisition section 160 calculates a distribution approximate value for each common pair.
- the attribute value of a node pair that becomes a common pair varies in the corpus. For example, a parent / child bearer appears as a distance “3” in one structured document and as a distance “8” in another structured document. The power that may appear. On the other hand, the distance between different parent-child pairs may vary in the range of “3-5” in the corpus.
- the distribution approximate value is an index value for correcting the node similarity value in consideration of such variation in the attribute value of the common pair. The distribution approximation will be described in detail with reference to Figs.
- the correction unit 156 corrects the node similarity value based on the rare value and the distribution approximate value. A specific correction method will also be described later.
- the document similarity value calculation unit 162 calculates the similarity of the tag structure between the query document and the test object document from the node similarity value of each common pair detected based on the relationship between the query document and the document to be inspected. Calculate as For example, when a plurality of common pairs are included in the query document and the document to be inspected, the total value or average value of the node similarity values for these common pairs may be calculated as the document similarity value. In this embodiment, the total value of the node similarity values is calculated as the document similarity value. As the number of common pairs increases and the node similarity value increases, the document similarity value increases.
- the document similarity value is a numerical value indicating the similarity of the tag structure between the query document and the document to be inspected.
- Expressions (1) to (3) are the node pair C that is a parent-child pair and a common pair in a query document A and inspected document B. This is an expression for calculating a node similarity value for.
- Expression (1) is an expression for calculating the rare value of the node pair C.
- “documentCount” is the number of structured document files held in the document holding unit 170. That is, the number of documents included in the corpus.
- the document holding unit 170 may calculate a rare value for a document group included in a predetermined external database.
- distribution indicates the total number of occurrences of node pair C in the corpus. .
- the rare value increases as the number of appearances decreases with respect to the number of documents in the corpus.
- the rare value calculation unit 158 calculates the rare value using the calculation formula shown in the formula (1).
- Expression (2) is a calculation expression for indexing the difference between the attribute value of the node pair C in the query document and the attribute value of the node pair C in the document to be examined as a Differece value. For example, if the distance of node pair C in the query document is 3 and the distance of node pair C in the test document is 10, the node pair C is a common pair, but its appearance differs greatly between the two documents. . In such a case, the difference value becomes large.
- QDistance in equation (2) is an attribute value related to the distance of node pair C in the query document.
- dDistance is an attribute value related to the distance of node pair C in the document to be inspected. If there are multiple node pairs C in the document to be inspected, indicate the average distance between them.
- maxDistance indicates the maximum distance of node pair C in the corpus. When the maximum distance exceeds a predetermined value, for example, “10”, it is uniformly “10”.
- qFrequency indicates the “frequency” of node pair C in the query document
- dFrequency indicates the “frequency” of node pair in the inspected document
- maxFrequency indicates the maximum frequency of the node pair in the corpus.
- the upper limit of the maximum frequency is also set to “10” as a predetermined value.
- qDepth is the “depth” of node pair C in the query document
- dDepth is the “depth” of node pair C in the document being examined
- maxDepth is the maximum depth of node pair C in the corpus.
- the upper limit of the maximum depth is also set to “10” as a predetermined value.
- the first term in the square root of Equation (2) is a term that indexes the difference in the distance between the node pair C in the query document and the test subject document.
- the second term is used to index frequency differences
- the third term is used to index depth differences. The smaller the difference between the three elements of distance, frequency, and depth calculated in terms 1 to 3, the smaller the Diffrence value.
- [0037] /, / 3, and ⁇ are weighting coefficients for the elements of distance, frequency, and depth, respectively.
- the difference in distance between parent and child pairs is considered to be greater in tag structure than in frequency and depth.
- the difference in depth is considered to be smaller than the difference in distance and frequency as the tag structure. Therefore, in this embodiment, the radius is set to 0.7, ⁇ is set to 0.2, and ⁇ is set to 0.1 so that ⁇ > ⁇ . If the a, ⁇ , and ⁇ inlets are 1, let's find a suitable value for ⁇ ⁇ ⁇ through experiments according to the corpus.
- Equation (3) is a calculation equation for correcting the node similarity value obtained from Equation (2) with the rare value obtained from Equation (1).
- the correction unit 156 corrects the node similarity value by multiplying the rare value by the node similarity value.
- the corrected node similarity value indicates the similarity between the appearance mode of the node pair C in the query document and the appearance mode of the node pair C in the test document. In the two documents to be compared, when a rare node pair appears as a common pair, the node similarity value is large. Such a node pair can be said to be an important node pair showing the similarity of the tag structure between the query document and the document to be inspected.
- FIG. 6 is a screen diagram that displays node similarity values.
- the display unit 136 arranges a plurality of display areas (hereinafter referred to as “pair boxes”) in a matrix corresponding to the parent-child pair of the query document.
- the node similarity value is displayed in the box.
- the node pair detection unit 142 scans the tag structure of the query document and detects a total of 22 parent-child pairs.
- the attribute value acquisition unit 144 detects attribute values for distance, frequency, and depth for each parent-child pair.
- the index information generation unit 146 generates index information and records it in the index information holding unit 172.
- the query document is held in the document holding unit 170.
- the common pair detection unit 152 sequentially selects documents to be inspected from the document holding unit 170. In some cases, the user may explicitly specify an inspected document to be compared via the input unit 132.
- the common pair detection unit 152 detects a common pair by referring to the index information of the query document and the index information of the document to be inspected.
- the parent-child pair of ⁇ body>, output>, and this-week> and ⁇ output> has not been detected in the inspected document, but other parent-child pairs have been detected. That is, of the 22 parent-child pairs in the query document, 20 parent-child pairs other than these two are common pairs.
- the node similarity value calculation unit 154 calculates a node similarity value for these 20 common pairs, and the correction unit 156 corrects each node similarity value with a rare value.
- the display unit 136 displays the node similarity value in the pair box for each parent-child pair of the query document.
- the node similarity value of the common pair by the ⁇ schedule> tag and the ⁇ term> tag is the highest 5.33.
- the display unit 136 displays a pair box of a common pair in which the node similarity value is a predetermined value, for example, 5.00 or more, in a color different from that of the pair buttons of other common pairs. For example, the pair box is displayed in dark red.
- the node similarity value of the common pair by the progress> tag and the term> tag is 4.32
- the node similarity value of the common pair of the ⁇ body> tag and the term> tag is 4.38.
- These common functions A node pair is similar in appearance but not as common as the schedule> tag and term> tag.
- the display unit 136 displays a pale box having a node similarity value of 4.00 or more in light red. Pair boxes with node similarity values less than 4.00 are displayed in white. According to such a display method, when a query document and a test subject document are compared, it becomes easy to visually identify a node pair whose appearance is particularly similar.
- the document similarity value calculation unit 162 calculates the total value of the node similarity values as the document similarity value.
- the similarity determination unit 150 performs a structure similarity search by calculating the document similarity value of the document to be inspected with respect to the query document. For example, a predetermined number of documents to be examined are selected as a structured document similar to a query document in descending order of document similarity values.
- the display unit 136 may further include a ranking display unit (not shown).
- the ranking display unit selects a predetermined number, for example, 20 documents to be inspected in descending order of the document similarity value calculated for a query document, and displays the titles in a list format. Alternatively, the document similarity value is displayed in a rank order in descending order of the document similarity value. According to such a display method, it becomes easy to comprehensively recognize a test document having a tag structure similar to the query document.
- an ambiguous search using an Xpath expression is possible. For example, if you search for the corresponding position from the document to be inspected using the Xpath expression as “/ body / note / chapter / para” as the search expression, if it is a normal Xpath search, “/ body / a / note hapter / The tag in the position where paraj and les are not hit, because “a” and les, the tag that contains the tag that is unclear to the conditions, is included. By searching for node similarity values for the node pairs “body / note” and “note / chapter ⁇ ci”, an Xpath search close to that can be performed even if it does not completely match the search expression.
- FIG. 7 is a diagram showing a result of investigating a node pair for a certain medicine information database.
- the structured document that was studied was an XML document, with 11682 documents and a total size of about 400 megabytes. From this database, 2020 parent-child pairs, 1548 repeated pairs, and 1044 sibling pairs were detected. Of the 2020 parent-child pairs, the parent-child pair that appeared with the highest frequency appears 13749 times. Also, one parent-child pair is a statement The average number of occurrences in the book group was 2335. Of the 2020 parent-child pairs, the maximum distance is 10 and the average distance is 2.72. However, the upper limit of the parent-child pair distance is set to 10. Similarly, the maximum frequency of the parent-child pair was 83.75, the average frequency was 1.31, the maximum depth was 9.00, and the average depth was 2.43.
- the maximum standard deviation indicating variation in distance was 1.55, and the average standard deviation was 0.20. That is, the distance between a parent-child pair varies by about 1.55 standard deviation. The average variation in the distance between parent-child pairs is about 0.20 standard deviation, and the distance between parent-child pairs may not vary so much. Recognize.
- the variation in frequency is the largest standard deviation 46.40 and the average standard deviation 0.40.
- the variation in depth is 1.65 for the maximum standard deviation and 0.10 for the average standard deviation. The results shown in the figure were obtained for repeated pairs and sibling pairs.
- the distribution approximate value acquisition unit 160 calculates the distribution approximate value as a variable for correcting the node similarity value in consideration of the variation in the attribute value of the node pair. If the dispersion of attribute values for a node pair A is a normal distribution, about 68% of the node pairs ⁇ detected from the corpus will fall within the range of the average value of the attribute values / soil standard deviation ⁇ . In addition, about 95% is within the range of ⁇ ⁇ 2 ⁇ .
- the distance of the common pair C in the query document A corresponds to the size of ⁇ -2.5 ⁇ .
- the distance of the common pair C in the document to be examined is assumed to be ⁇ + 1.8 ⁇ .
- the common pair C is a force that appears in both the query document and the document to be examined. In such a case, the distribution approximate value is reduced and the node similarity value is corrected to be small.
- FIG. 8 is a table for obtaining distribution approximate values.
- the distribution approximation for the distance of the node pair ⁇ is 1. 0.
- the attribute value of the common pair in the query document and the test The distribution approximate value is 1.0 when the attribute value of the common pair in the document is statistically close.
- the distribution approximation is 0.5.
- it is 0.3 if it is 2 ⁇ or more and less than 3 ⁇ , 0.2 if it is 3 ⁇ or more and less than 4 ⁇ , and 0.1 if it is 4 ⁇ or more.
- the correction unit 156 corrects the node similarity value by multiplying Equation (3) by the distribution approximation value. For example, by multiplying the distribution approximate values for distance, frequency, and depth by the corrected node similarity value in Equation (3), the final node similarity value can be obtained with the forces that consider the standard deviation. May be. According to such a processing method, when the attribute values of the common pair of the query document and the document to be inspected are statistically far from each other, the node similarity value is largely suppressed.
- the (qDistance-dDistance) portion of Equation (3) may be changed to qDistance-dDistance / (distance distribution approximation value) by dividing the distance distribution approximation value.
- the setting of the distribution approximation value shown in FIG. 8 is merely an example, and a suitable setting value of the distribution approximation value may be obtained according to the corpus.
- the document processing apparatus 100 compares the tag structure of the query document and the tag structure of the document to be examined, and can quantify the structural similarity as a node similarity value or a document similarity value in units of node pairs. Since structural similarity search can be realized with a simple algorithm, high-speed search is possible.
- the process for acquiring the attribute value is simplified.
- a node pair that is characteristic in the corpus is corrected with a rare value so that the node similarity value becomes high. Therefore, it is possible to perform a search considering a node pair that is useful for determining the similarity between the query document and the subject document and a node pair that is not.
- the node similarity value is corrected in consideration of the process. Therefore, even if detected as a common pair, if an attribute value that is statistically distant is included, the node similarity value becomes small, so that the accuracy of the structure similarity search can be improved at any time.
- a more practical structural similarity search becomes possible.
- the function of the rare correction unit described in the claims is realized by the node similarity value calculation unit 154 and the correction unit 156 in the present embodiment.
- the function of the distribution correction unit described in the claims is realized by the node similarity value calculation unit 154 and the correction unit 156 in the present embodiment.
- the function of the node similarity value display unit described in the claim is realized by the display unit 136 in this embodiment.
- the present invention can be used in a search device for a structured document file.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/294,135 US20090132566A1 (en) | 2006-03-31 | 2007-03-28 | Document processing device and document processing method |
JP2008510879A JP4878624B2 (ja) | 2006-03-31 | 2007-03-28 | 文書処理装置および文書処理方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006099800 | 2006-03-31 | ||
JP2006-099800 | 2006-03-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007119567A1 true WO2007119567A1 (ja) | 2007-10-25 |
Family
ID=38609344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2007/056690 WO2007119567A1 (ja) | 2006-03-31 | 2007-03-28 | 文書処理装置および文書処理方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20090132566A1 (ja) |
JP (1) | JP4878624B2 (ja) |
WO (1) | WO2007119567A1 (ja) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013038519A1 (ja) * | 2011-09-14 | 2013-03-21 | 株式会社マイニングブラウニー | ウェブページ解析装置およびウェブページ解析用プログラム |
CN103500219A (zh) * | 2013-10-12 | 2014-01-08 | 翔傲信息科技(上海)有限公司 | 一种标签自适应精准匹配的控制方法 |
JP2014081958A (ja) * | 2014-01-20 | 2014-05-08 | Fujitsu Ltd | アノテーション付与方法、アノテーション復元方法、アノテーション付与装置及びアノテーション復元装置 |
JP2014102624A (ja) * | 2012-11-19 | 2014-06-05 | Nippon Telegr & Teleph Corp <Ntt> | キーワード関連度スコア算出装置、キーワード関連度スコア算出方法、及びプログラム |
CN115495554A (zh) * | 2022-09-23 | 2022-12-20 | 深圳今日人才信息科技有限公司 | 一种简历信息模块化的评估方法 |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8527522B2 (en) * | 2008-09-05 | 2013-09-03 | Ramp Holdings, Inc. | Confidence links between name entities in disparate documents |
US20100228738A1 (en) * | 2009-03-04 | 2010-09-09 | Mehta Rupesh R | Adaptive document sampling for information extraction |
US8983980B2 (en) * | 2010-11-12 | 2015-03-17 | Microsoft Technology Licensing, Llc | Domain constraint based data record extraction |
US9558185B2 (en) * | 2012-01-10 | 2017-01-31 | Ut-Battelle Llc | Method and system to discover and recommend interesting documents |
JP5784196B2 (ja) * | 2014-08-06 | 2015-09-24 | 株式会社東芝 | 文書マークアップ支援装置、方法、及びプログラム |
US10643031B2 (en) | 2016-03-11 | 2020-05-05 | Ut-Battelle, Llc | System and method of content based recommendation using hypernym expansion |
CN107491547B (zh) * | 2017-08-28 | 2020-11-10 | 北京百度网讯科技有限公司 | 基于人工智能的搜索方法和装置 |
US20210303773A1 (en) * | 2020-03-30 | 2021-09-30 | Oracle International Corporation | Automatic layout of elements in a process flow on a 2-d canvas based on representations of flow logic |
KR102248294B1 (ko) * | 2020-11-05 | 2021-05-04 | 주식회사 해시스크래퍼 | 동일 구조의 데이터를 추출하는 방법 및 그를 이용한 장치 |
US11934362B2 (en) * | 2021-07-22 | 2024-03-19 | EMC IP Holding Company LLC | Granular data migration |
US11809449B2 (en) | 2021-09-20 | 2023-11-07 | EMC IP Holding Company LLC | Granular data replication |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001014326A (ja) * | 1999-06-29 | 2001-01-19 | Hitachi Ltd | 構造指定による類似文書の検索装置及び検索方法 |
JP2003162518A (ja) * | 2001-11-26 | 2003-06-06 | Canon Inc | 文書種別判定方法 |
JP2003242167A (ja) * | 2002-02-19 | 2003-08-29 | Nippon Telegr & Teleph Corp <Ntt> | 構造化文書の変換ルール作成方法および装置と変換ルール作成プログラムおよび該プログラムを記録したコンピュータ読取り可能な記録媒体 |
JP2005149236A (ja) * | 2003-11-17 | 2005-06-09 | Nippon Telegr & Teleph Corp <Ntt> | ブロック自動抽出装置、ブロック自動抽出方法およびプログラム |
JP2005326970A (ja) * | 2004-05-12 | 2005-11-24 | Mitsubishi Electric Corp | 構造化文書曖昧検索装置及びそのプログラム |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7813915B2 (en) * | 2000-09-25 | 2010-10-12 | Fujitsu Limited | Apparatus for reading a plurality of documents and a method thereof |
JP2004348341A (ja) * | 2003-05-21 | 2004-12-09 | Toshiba Corp | 構造化文書処理システム、構造化文書処理方法及びプログラム |
US20050060643A1 (en) * | 2003-08-25 | 2005-03-17 | Miavia, Inc. | Document similarity detection and classification system |
-
2007
- 2007-03-28 WO PCT/JP2007/056690 patent/WO2007119567A1/ja active Application Filing
- 2007-03-28 JP JP2008510879A patent/JP4878624B2/ja not_active Expired - Fee Related
- 2007-03-28 US US12/294,135 patent/US20090132566A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001014326A (ja) * | 1999-06-29 | 2001-01-19 | Hitachi Ltd | 構造指定による類似文書の検索装置及び検索方法 |
JP2003162518A (ja) * | 2001-11-26 | 2003-06-06 | Canon Inc | 文書種別判定方法 |
JP2003242167A (ja) * | 2002-02-19 | 2003-08-29 | Nippon Telegr & Teleph Corp <Ntt> | 構造化文書の変換ルール作成方法および装置と変換ルール作成プログラムおよび該プログラムを記録したコンピュータ読取り可能な記録媒体 |
JP2005149236A (ja) * | 2003-11-17 | 2005-06-09 | Nippon Telegr & Teleph Corp <Ntt> | ブロック自動抽出装置、ブロック自動抽出方法およびプログラム |
JP2005326970A (ja) * | 2004-05-12 | 2005-11-24 | Mitsubishi Electric Corp | 構造化文書曖昧検索装置及びそのプログラム |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013038519A1 (ja) * | 2011-09-14 | 2013-03-21 | 株式会社マイニングブラウニー | ウェブページ解析装置およびウェブページ解析用プログラム |
JP2014102624A (ja) * | 2012-11-19 | 2014-06-05 | Nippon Telegr & Teleph Corp <Ntt> | キーワード関連度スコア算出装置、キーワード関連度スコア算出方法、及びプログラム |
CN103500219A (zh) * | 2013-10-12 | 2014-01-08 | 翔傲信息科技(上海)有限公司 | 一种标签自适应精准匹配的控制方法 |
CN103500219B (zh) * | 2013-10-12 | 2017-08-15 | 翔傲信息科技(上海)有限公司 | 一种标签自适应精准匹配的控制方法 |
JP2014081958A (ja) * | 2014-01-20 | 2014-05-08 | Fujitsu Ltd | アノテーション付与方法、アノテーション復元方法、アノテーション付与装置及びアノテーション復元装置 |
CN115495554A (zh) * | 2022-09-23 | 2022-12-20 | 深圳今日人才信息科技有限公司 | 一种简历信息模块化的评估方法 |
Also Published As
Publication number | Publication date |
---|---|
US20090132566A1 (en) | 2009-05-21 |
JPWO2007119567A1 (ja) | 2009-08-27 |
JP4878624B2 (ja) | 2012-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2007119567A1 (ja) | 文書処理装置および文書処理方法 | |
US9990422B2 (en) | Contextual analysis engine | |
US10430806B2 (en) | Input/output interface for contextual analysis engine | |
US10235681B2 (en) | Text extraction module for contextual analysis engine | |
US10002292B2 (en) | Organizational logo enrichment | |
CN105934755B (zh) | 用社交标记来增强搜索结果 | |
US20130232157A1 (en) | Systems and methods for processing unstructured numerical data | |
CN112136123B (zh) | 表征文件以进行相似性搜索 | |
US20130268532A1 (en) | Clustered Information Processing and Searching with Structured-Unstructured Database Bridge | |
US10733370B2 (en) | Method, apparatus, and computer program product for generating a preview of an electronic document | |
US9268768B2 (en) | Non-standard and standard clause detection | |
US9075879B2 (en) | System, method and computer program for searching within a sub-domain by linking to other sub-domains | |
US11269896B2 (en) | System and method for automatic difficulty level estimation | |
US20080168036A1 (en) | System and Method for Locating and Extracting Tabular Data | |
Haak et al. | Perception-aware bias detection for query suggestions | |
US20160196266A1 (en) | Inferring seniority based on canonical titles | |
JP2011227633A (ja) | コンテンツ管理装置,情報関連度算出方法および情報関連度算出プログラム | |
US10360243B2 (en) | Storage medium, information presentation method, and information presentation apparatus | |
US20160196619A1 (en) | Homogenizing time-based seniority signal with transition-based signal | |
JP2011076264A (ja) | 検索制御装置、検索制御方法、及びプログラム | |
Zammit et al. | Exposing knowledge: providing a real-time view of the domain under study for students | |
JP7273888B2 (ja) | 決定装置、決定方法、および決定プログラム | |
Dadure et al. | Efficient Assessment of Formula Representation in Embedded Vector | |
Gupta et al. | Identifying High-Level Concept Clones in Software Programs Using Method’s Descriptive Documentation | |
Вишнякова | FULL TEXT SEARCHING |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07740128 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2008510879 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12294135 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07740128 Country of ref document: EP Kind code of ref document: A1 |