JP2001084252A - System and method for retrieving similar document and computer-readable recording medium with similar document retrieval program recorded thereon - Google Patents

System and method for retrieving similar document and computer-readable recording medium with similar document retrieval program recorded thereon

Info

Publication number
JP2001084252A
JP2001084252A JP25716799A JP25716799A JP2001084252A JP 2001084252 A JP2001084252 A JP 2001084252A JP 25716799 A JP25716799 A JP 25716799A JP 25716799 A JP25716799 A JP 25716799A JP 2001084252 A JP2001084252 A JP 2001084252A
Authority
JP
Japan
Prior art keywords
cluster
document
sentence
structure
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP25716799A
Other languages
Japanese (ja)
Inventor
Takeyuki Aikawa
Katsushi Suzuki
Yasuhiro Takayama
勇之 相川
克志 鈴木
泰博 高山
Original Assignee
Mitsubishi Electric Corp
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Electric Corp, 三菱電機株式会社 filed Critical Mitsubishi Electric Corp
Priority to JP25716799A priority Critical patent/JP2001084252A/en
Publication of JP2001084252A publication Critical patent/JP2001084252A/en
Pending legal-status Critical Current

Links

Abstract

PROBLEM TO BE SOLVED: To perform similar document retrieval while taking a statement structure into consideration, even when a large scale document set is defined as a retrieval object. SOLUTION: This system is provided with an inputting means 101 for inputting a retrieval statement, a statement structure analyzing means 102 which refers to a word dictionary 103 and analyzes the structure of an input retrieval statement, a document database 105 in which a clustered document is stored, a similar statement collating means 108 which refers to an ontology 109 and calculates similarity between analysis results of the input retrieval statement and cluster structure information used as an index, when retrieving a document included in a cluster of the document database, and a cluster retrieving means 104, which retrieves cluster structure information being most similar to the input retrieval statement, on the basis of the similarity calculated by the means 108 and retrieves a similar document from the clustered document in the document database which is made to correspond to the cluster structure information.

Description

DETAILED DESCRIPTION OF THE INVENTION

[0001]

[0001] 1. Field of the Invention [0002] The present invention relates to a similar document search system, a similar document search method, and a computer-readable recording medium storing a similar document search program used in a help desk support system or the like.
In particular, the present invention relates to a similar document search system and the like that searches for a large set of documents having a relatively short description, such as a dialog record in a help desk support system, having a high degree of specialty, and a similar tendency of words used. .

[0002]

2. Description of the Related Art As a technique for searching for a document similar to an input document, a keyword is extracted by dividing the document into words, and the similarity between documents is calculated by a statistical method using the frequency of occurrence of the keyword. Techniques are well known. As a search method using this method, a large number of documents are clustered in advance, a cluster similar to the input search sentence is searched first, and a detailed search is performed on the documents in the cluster of the search result. (“Large-scale document clustering for document retrieval”, Iwayama et al., The 3rd Annual Meeting of the Language Processing Society, pp. 245-248 (1
997): Hereinafter, it is abbreviated as "Reference 1." ).

In such searches, independent words are often used as keywords. Then, if a relatively short description, such as a dialogue record in a help desk support system, has a high degree of specialty and a set of documents with similar tendency in terms of words to be used as a search target, statistics using the frequency of occurrence of keywords, etc. There is a problem that sufficient search accuracy cannot be obtained with a conventional method.

In particular, between documents (and between documents and clusters)
Is calculated based on the frequency of appearance of the independent word, etc., so that "it does not start up even when the power is turned on and the LED lamp remains blinking." And "the LED lamp blinks and the tape does not enter. .} Used keyword set L
Two sentences with similar ED, lamp, blinking, and}, but different meanings, could not be distinguished.

To solve such a problem, Japanese Patent Laid-Open No. 10-1
Japanese Patent Application Laid-Open No. 71803 (hereinafter abbreviated as “Document 2”) discloses a similarity calculation method in consideration of a sentence structure.

The similarity calculation method described in Reference 2 will be described with reference to the drawings. FIG. 20 is a diagram illustrating a conventional similarity calculation method disclosed in Document 2, for example.

First, an original sentence and a reference sentence are converted into a syntactic structure analysis unit 2.
001 is analyzed. A single sentence or a clause is sequentially read out from the syntax analysis result of each sentence by the syntax element extraction unit 2002, and the similarity is calculated by the single sentence similarity calculation unit 2003. Next, the similarity accumulating unit 2004 sequentially accumulates the similarities obtained by the above processing to obtain the sentence similarity of the original sentence and the reference sentence.

According to the similarity calculation method described above, similarity calculation can be performed in consideration of the sentence structure, but there is a problem that the calculation cost is large.

[0009]

In a conventional full-text search based on an independent word keyword, the similarity between documents (and between a document and a cluster) is calculated based on the appearance frequency of independent words and the like. When searching for a set of documents with relatively short descriptions and high expertise, such as dialogue records in a desk support system, and similar words, the search accuracy may not be sufficient. there were.

[0010] In particular, it is noted that "it does not start even when the power is turned on,
The ED lamp keeps blinking. "And" The LED lamp is blinking and the tape cannot be inserted. "It was not possible to distinguish two sentences having similar but different meanings.

Further, according to Document 2, since the similarity is calculated in consideration of the sentence structure, the above two sentences can be distinguished. However, the similarity calculation in consideration of the sentence structure requires a large calculation cost. Therefore, there is a problem that it takes a lot of time to search for a similar document from a large number of documents to be searched.

Unlike the method based on the keyword frequency and the like, there is no means for calculating the similarity between the document cluster and the input search sentence. It was not possible to apply such a speeding-up technique.

SUMMARY OF THE INVENTION The present invention has been made to solve the above-described problem, and a large document set is searched for by using a cluster representative structure as an index for a document cluster that has been clustered in advance. It is an object of the present invention to provide a similar document search system and method capable of performing similar document search in consideration of a sentence structure even in such a case, and a computer-readable recording medium storing a similar document search program.

[0014]

According to a first aspect of the present invention, there is provided a similar document search system comprising: an input unit for inputting a search sentence; a word dictionary for analyzing a sentence structure; Sentence structure analysis means for analyzing the structure of the search sentence,
A document database storing clustered documents, an ontology storing knowledge about concepts, and referring to the ontology, an analysis result of the input search sentence, and an index when searching from documents included in the cluster of the document database. A similar sentence matching unit that calculates a similarity with the cluster structure information to be used; and a cluster structure information that is most similar to the input search sentence is searched based on the similarity calculated by the similar sentence matching unit. And a cluster search means for searching for a similar document from a document cluster in the document database associated with.

A similar document search system according to a second aspect of the present invention uses an index from a document included in the cluster of the document database as an index at the time of search based on an analysis result of a sentence structure by the sentence structure analysis means. It further comprises index data generation means for generating cluster structure information.

In a similar document search system according to a third aspect of the present invention, the index data generating means has a structure in which the analysis result of the sentence structure analyzing means for a document included in a cluster is superimposed as the cluster structure information. This is to generate a certain cluster representative structure.

In a similar document search method according to a fourth aspect of the present invention, an input step of inputting a search sentence and a sentence structure analyzing step of analyzing the structure of the input search sentence with reference to a sentence structure analysis word dictionary. With reference to an ontology storing knowledge about the concept, the analysis result of the input search sentence,
Calculating the similarity with the cluster structure information generated from the documents included in the cluster of the document database storing the clustered documents, and based on the calculated similarity, the cluster structure information most similar to the input search sentence And a cluster search step of searching for a similar document from a document cluster in the document database associated with the cluster structure information.

In a similar document search method according to a fifth aspect of the present invention, the clustering of the document database is performed hierarchically, and the index based on the cluster structure information forms a tree structure corresponding to the hierarchical structure of the document cluster. The cluster search step is configured to search for a similar document cluster while sequentially searching the tree structure of the index.

[0019] In a similar document search method according to a sixth aspect of the present invention, the similarity calculation process in the cluster search step includes:
The method includes an inter-phrase similarity calculating step of calculating the similarity of the clause nodes in the dependency structure of the analysis result, and a dependency information similarity calculating step of calculating the similarity of the dependency information between the clauses.

In a similar document search method according to a seventh aspect of the present invention, the inter-phrase similarity calculation step calculates inter-phrase similarity in consideration of modal expression.

[0021] In a similar document search method according to claim 8 of the present invention, the ontology is an IS-A dictionary that describes a higher-order relationship between concepts.

In a similar document search method according to a ninth aspect of the present invention, the ontology is a HAS-A dictionary that describes a partial whole relationship between concepts.

In a similar document search method according to a tenth aspect of the present invention, the ontology is a case dictionary describing case relationships between concepts.

[0024] In a similar document search method according to claim 11 of the present invention, the ontology is a paraphrase dictionary that describes paraphrasable equivalent expressions.

According to a twelfth aspect of the present invention, there is provided a computer-readable recording medium storing a similar document search program, wherein an input procedure for inputting a search sentence and the input search sentence are referred to by referring to a sentence structure analysis word dictionary. A sentence structure analysis procedure for analyzing the structure of the document, and referring to an ontology storing knowledge about the concept, the analysis result of the input search sentence and a document generated from a document included in a cluster of a document database storing a clustered document. The similarity with the calculated cluster structure information is calculated, and the cluster structure information most similar to the input search sentence is searched based on the calculated similarity, and the document database in the document database associated with the cluster structure information is searched. And a cluster search procedure for searching for a similar document from a document cluster.

[0026]

DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 A similar document search system according to Embodiment 1 of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a configuration of a similar document search system according to Embodiment 1 of the present invention. In the drawings, the same reference numerals indicate the same or corresponding parts.

In FIG. 1, reference numeral 101 denotes an input means.
Input a search sentence by keyboard input, handwritten character recognition, voice recognition, etc. Reference numeral 102 denotes a sentence structure analysis unit that analyzes the structure of the input search sentence. 103 is a word dictionary referred to in the analysis processing. Reference numeral 104 denotes a cluster search unit that searches a document similar to the input search sentence for the search target document set 10
Search from 5.

In FIG. 2, reference numeral 106 denotes index data generating means, which generates index data 107 from the analysis result of each document in the document set 105. It is assumed that the search target document set 105 is hierarchically clustered in advance like the document cluster 1, the document cluster 2, and the document cluster 3. The index data 107 has a tree structure having a shape corresponding to the cluster hierarchy.

Further, in the figure, reference numeral 108 denotes a similar sentence matching means, which is an ontology 10 describing the relation between concepts.
9 and the analysis result of the input sentence and the index data 107.
Cluster representative structure (107x to 107
The similarity with z) is calculated. Cluster search means 10
4 searches for a cluster representative structure that is most similar to the input search sentence by using the similar sentence matching means 108 while tracing the tree structure of the index data 107, and further searches for a similar document from a document cluster associated with the cluster representative structure. Search for. 11
0 is an output means for outputting a search result.

Next, the operation of the similar document search system according to the first embodiment will be described with reference to the drawings.

FIG. 2 is a flowchart showing a similar document search process according to the first embodiment of the present invention. Hereinafter, each step of FIG. 2 will be described with reference to FIG. 1 and other detailed drawings as appropriate.

First, in step S201, a search sentence is input from the input means 101. Here, it is assumed that the sentence “LED sometimes lights up” is input. Next, in step S202, the sentence structure analysis unit 102
Analyzes the structure of the search sentence entered in. Hereinafter, first, the sentence structure analysis processing will be described with reference to FIGS.

FIG. 3 is a flowchart showing details of the Japanese sentence structure analysis processing. First, in step S301, a morphological analysis process is performed. This morphological analysis processing is performed by a minimum cost method (Reference 3: “Morphological analysis of Japanese sentence including unregistered words”), Yoshimura et al., Transactions of Information Processing Society of Japan Vol.
30, no. 3, pp. 294-301 (1989))
Perform according to. The morphological analysis may be performed by other known morphological analysis methods.

FIG. 4 shows an example of the configuration of the word dictionary 103 used for analysis. The dictionary is configured to include at least the headline 103a, the part of speech information 103b, and the concept information 103c.

An example of the result of the morphological analysis is shown at 301 in FIG. In 301, “/” (slash) indicates a morpheme delimiter. Although the actual morphological analysis result includes detailed information such as part of speech information, it is omitted here for the sake of simplicity.

Next, in step S302 of FIG. 3, unknown word processing is performed on the result of the morphological analysis. Since the analysis result 301 does not include an unknown word, the process proceeds to the next step without performing any processing. If the morphological analysis result 301 includes an unknown word or a series of single kanji, a process of estimating an unknown word range and grouping a plurality of morphemes is performed.

Next, in step S303 of FIG. 3, a phrase structure 501 is generated from the morphological analysis result 301.

FIG. 5 shows an example of a phrase structure. The clause structure 501 includes at least an attribute list 50 indicating the property of the clause.
1a, independent word information 501b and attached word information 501c. The attribute list 501a includes a relation attribute 501d indicating a grammatical property, a receiving attribute 501e, and the like. The attached word information 501c is composed of a list having a plurality of elements corresponding to the attached word string included in the phrase. If a prefix or suffix is added before and after the independent word, the word is processed as an element of the attribute list 501a having a phrase structure.

FIG. 3 shows an example in which independent word information and attached word information are used as pointer information to the dictionary 103. However, necessary information is extracted from the word dictionary information 103a to 103c and stored as a part of the phrase structure. May be configured.

Next, in step S304 of FIG. 3, dependency analysis is performed using the phrase structure 501 as an input. FIG. 6 is a detailed flowchart showing the contents of the dependency analysis process. For simplicity, a simple algorithm is shown, but the CYK method is described after describing a more complicated grammar (Reference 4: "Natural Language Understanding", edited by Tanaka and Tsujii, Ohmsha, 1989, No. 3). The analysis using the chapter syntactic analysis method [1] CYK method) does not impair the features of the present invention at all. In addition, there is generally an ambiguity in the dependency, and many methods have been proposed to resolve this ambiguity using a case dictionary or the like. Since the ontology in the present system includes a case dictionary and an IS-A hierarchy, this may be used for disambiguation of dependency.

Hereinafter, the processing of each step in FIG. 6 will be described using a specific example shown in FIG. Step S601
, The analysis stack S and the analysis buffer A are initialized. Since the specific example 501 of FIG. 3 is input, S
= {}, A = {LED is sometimes lit}.

Next, the flow advances to step S602 to determine a termination condition of the loop processing. Here, the number of elements of A is 3
Therefore, the process proceeds to step S603.

In step S603, the state of the stack S is determined. Here, since S = {}, step S608
Proceed to.

In step S608, the analysis buffer A
Of the two elements on the left side of is determined. This dependency determination process is based on the dependency attribute of the left clause (501d in FIG. 5).
And according to the combination of the receiving attribute of the right clause (501e in FIG. 5). Here, while the dependency attribute of “LED is” is “ga”, “sometimes” is an adverb, so the dependency determination result is false, and the process proceeds to step S610.

In step S610, the leftmost phrase “LED is” is inserted into the stack, and S = {LED is}, A
= {Lights up sometimes}. Hereinafter, the process proceeds to step S604 via steps S602 and S603.

In step S604, a dependency determination is made between the elements of the stack S and the leftmost clause of the analysis buffer A. Here, the dependency determination of “LED” and “occasionally” is performed, and the result is false as in the previous step S608, and the process proceeds to step S606. A has 2 elements
Therefore, the process proceeds from step S606 to step S608.

In step S608, a dependency determination of "sometimes" and "lights on" is performed, and the determination result is true, and the flow advances to step S609.

In step S609, a dependency structure is created such that the child node of "light on" is "sometimes".
Steps S602 and S603 are performed again to step S
Proceed to 604. This time, the dependency determination of “LED is” and “lights on (← sometimes)” of the stack is performed, and the process proceeds to step S605 because the determination result is true. Thereafter, the process proceeds to step S602, where the termination condition is true, and the dependency analysis process ends. As described above, the dependency structure 302 in FIG. 3 is generated.

The description of the sentence structure analysis processing in step S202 in FIG. 2 has been completed. In the above description, the input is described as one sentence for the sake of simplicity. However, even when the input is a plurality of sentences, by performing the dependency determination between the last sentence word and the word in the next sentence, It can be analyzed in the same way.

Next, the process proceeds to the cluster search step of step S203 in FIG. 2. Before describing the cluster search step, the index generation processing will be described. This index generation process is a process in which the index data generation unit 106 of FIG. 1 generates index data 107 from the search target document set 105 that has been clustered in advance. Hereinafter, the index generation processing will be described with reference to FIGS.

FIG. 7 is a flowchart showing details of the index generation processing. FIG. 8 is an example of a document cluster in the search target document set 105 shown in FIG. It is assumed that the document cluster 2 (105b) and the document cluster 3 (105c) are hierarchically configured by subdividing the document cluster 1 (105a). Although not shown in the figure, there are many other hierarchical clusters similar to the document cluster 1. Here, a virtual document cluster including the entire document set 105 is defined as a root cluster.

In FIG. 8, documents 1 to 6 (801 and 802) are search target documents included in each document cluster. For simplicity of description, some of the sentences described in Documents 1 to 6 are omitted and indicated by "...". In addition,
In FIG. 8, the sentence associated with each document cluster is described for explanation, and is a sentence indicating the semantic content of the document cluster, and is not related to the following processing.

In the present invention, the clustering method does not matter. Strict bottom-up clustering may be performed using a well-known clustering method using similar sentence matching described below, or top-down clustering may be performed by setting a heuristic according to a search target. . Clustering may be performed not only by mechanical clustering but also by manual work. Hereinafter, the index generation processing will be described with reference to the document cluster of FIG. 8 as an example.

First, steps S701 to S7
At 04, an index structure of the most subdivided document cluster (hereinafter referred to as a leaf cluster) of the hierarchical document cluster structure is generated. Step S701
Is the end condition determination of the repetitive processing. Step S7
At 02, each document in the leaf cluster is analyzed. For the analysis, the sentence structure analyzing means 102 of FIG. 1 described above is used. In step S703, the analysis result of each document is superimposed to create a cluster representative structure. FIG. 9 shows an example in which documents 4 to 6 (802 in FIG. 8) included in the document cluster 3 (105c) in FIG. 8 are superimposed.

In FIG. 9, the analysis result 901 of the document 4
Analysis result 902 of document 5 and analysis result 90 of document 6
3 is a cluster representative structure 1 in which common clause nodes are overlapped and attribute information such as the type of dependency is merged.
07z. In FIG. 9, “φ” is indicated as the dependency information when a compound word is formed or when a case particle is abbreviated and case particles are missing. Indicates that it has not been done.

FIG. 10 shows a detailed configuration example of the cluster representative structure. The cluster representative structure 107z includes at least the independent word information 107a, the attribute list 107b, and the destination information 10
7c and dependency source information 107d. Each piece of information has frequency information in which the number of times of analysis result superposition is recorded. For example, the frequency information 107e of FIG.
Appear twice. Dependency information 107c
The modification source information 107d includes a modification destination clause, a modification type, and frequency information (107f in FIG. 9). The attribute information also has frequency information 107g for each attribute.
These pieces of frequency information are used as weighting coefficients for similarity calculation in a similar sentence matching process described later. This concludes the description of the superposition processing.

Returning to FIG. 7, in step S704, the cluster representative structure created as described above is associated with the original cluster and used as an index. FIG. 11 is an example of index information generated from each of the leaf clusters 2 and 3. FIG.
In FIG. 1, a part of the document is abbreviated as "...", and the analysis result is also abbreviated by an empty node and an arrow with no relation to the omitted part. After all the leaf clusters have been indexed, the process goes through step S701 to step S7.
Go to 05.

In steps S705 to S707, a cluster representative structure is created for document clusters other than leaf clusters, and the cluster representative structures are associated with each other. The associating process is performed in a bottom-up manner from the subdivided cluster (leaf cluster) side of the search target document set 105 to the composite cluster (root direction) in which a plurality of clusters are integrated.

In step S706, a process of superimposing the cluster representative structure of the child cluster corresponding to the cluster hierarchy is performed. In the case of the cluster hierarchy shown in FIG. 11, the cluster representative structure 107 which is the index information of the document cluster 2
y and the cluster representative structure 107z, which is the index information of the document cluster 3, is superimposed on the document cluster 1
The cluster representative structure 107x (FIG. 1) serving as the index information of is created.

Further, in step S707, the cluster representative structure 107x is associated as an index of the document cluster 1, and 107y and 1
The tree structure of the index is also created by associating the index with 07z.
The index data 107 shown in FIG. 1 is created as described above.

Conventionally, such index data generating means 10
6 and the index data 107 were not provided, the similarity calculation with all documents had to be performed in order to perform a similar document search in consideration of the sentence structure, which was extremely inefficient. Hereinafter, the operation of the cluster search means 104 of FIG. 1 using the above-described index structure will be described.

Returning to FIG. 2, the cluster search processing in step S203 will be described. FIG. 12 is a flowchart illustrating details of the cluster search process. First, in step S1201, Q is the input sentence analysis result (3 in FIG. 3).
02), N is initialized with the root node of the index tree structure.

Next, in step S1202, the similarity matching means 108 in FIG. 1 calculates the similarity between each cluster representative structure, which is the child node index of N, and Q.
The similarity calculation will be described later in detail.

Next, in step S1204, a cluster representative structure N ′ having the highest similarity to Q among the above-described cluster representative structures is set as a new N, and the processing from step S1202 is repeated. 1 and 1
In the document cluster and index shown in FIG. 1, first, the index information 107x of the document cluster 1 is searched as a child node of the root cluster. Next, the index information 107y of the document cluster 2 is searched as a child node of the index information 107x. Thus, the tree structure of the index information is sequentially searched from the root direction. Finally, a leaf cluster is searched, and the process ends according to the end condition in step S1202.

The operation of the similar sentence matching means 108 for calculating the similarity in FIG. 1 will be described below with reference to FIGS.
9 will be described. From here, as an example, the analysis result of the input search sentence "LED blinks occasionally"
A description will be given using the cluster representative structure 107y (FIG. 11).

FIG. 13 is a detailed flowchart showing the contents of the similar sentence matching process. First, in step S1301, the inter-phrase similarity is calculated, and the inter-phrase similarity correspondence table shown in FIG. 14 is created. Table 1400 shown in FIG.
Each phrase structure (1401) in the analysis result of the input search sentence “LED blinks occasionally”, and cluster representative structure 1
It is created by calculating the similarity 1403 with the round robin of each phrase structure (1402) in 07y.

The calculation of the inter-phrase similarity is performed as follows. First, the similarity SimJ of the independent word information is calculated.
In the case of the same independent word, the value of SimJ is set to 1. Hereinafter, unless otherwise specified, the similarity is defined as a real number having a value from 0 to 1, and the similarity 1 indicates the same information.

For independent words that are not the same, FIG.
The independent word concept (103c in FIG. 5) is referred to by referring to the IS-A dictionary (1501) or the HAS-A dictionary (1502) shown in FIG.
Calculate the distance d between, for example, 0.9 d to the independent word similarity S
Set as the value of imJ. For example, if the degree of similarity is between “LED” and “power lamp”, then in FIG.
Since the distance in the dictionary is 2, a value of 0.81 is set as SimJ.

Further, the attribute similarity SimA is obtained by referring to the attribute information of the phrase structure. It is assumed that SimA is defined in advance according to the type of attribute. For example, if the “negation” attribute, which is modality information, is common, SimA
Is set to 1, and when not common, SimA is set to 0.1.
When there are a plurality of attributes, these attribute similarities are averaged. In the calculation of the averaging, weighting is appropriately performed using the frequency information 107g shown in FIG.
That is, when a phrase included in the input search sentence has attribute information that appears frequently in a certain document cluster, the weight is set such that the weight of the attribute information in the attribute similarity SimA becomes higher.

The prefix and suffix that the phrase structure has as attribute information are also reflected in the calculation of the attribute similarity. For example, a phrase with a suffix of
Attribute similarity with a phrase having a “continuation” attribute represented by an adjunct word string is high.

The inter-phrase similarity 1403 shown in FIG.
Above independent word similarity SimJ and attribute similarity SimA
Multiplied by. Further, a table 1400 shown in FIG.
In creating the data, the analysis result (1401) of the input search sentence and the cluster representative structure (14
When there is an expression that can be paraphrased between the expression and the expression (02), a virtual phrase composed of a plurality of phrases corresponding to the expression is created, and the similarity between phrases is set to 1 and added to the table 1400.

FIG. 16 is a diagram showing a configuration example of the paraphrase dictionary. This paraphrase dictionary 1600 includes at least a paraphrase source expression 1601 and a paraphrase destination expression 1602.
Consists of The paraphrase source expression 1601 includes at least a headline 1601a, a part of speech 1601b, and child node information 1601c. Similarly, the paraphrase expression 1602 is
Heading 1602a, part of speech 1602b, child node information 16
02c.

What is shown in FIG. 16 is that “turn on the power” can be paraphrased into the expression “turn on P-ON” (the upper entry). Such paraphrases are often found in texts that need to be entered in a short time, such as conversation logs in help desk operations. By using the paraphrase dictionary 1600 to cope with such paraphrase expressions, it is possible to improve the accuracy of similar sentence matching.

Returning to FIG. 13, the creation of a similar phrase correspondence tree structure in step S1302 will be described. Step S
In the inter-phrase similarity correspondence table created in 1301, a plurality of interpretations are generally possible. For example, in the example shown in FIG. 14, two types of phrases, “light on” and “blink”, can be used as the phrase in the cluster representative structure for the phrase “blinks” in the input sentence. A predetermined threshold, for example, 0.5
All phrase pairs having the above similarity between phrases are listed,
It is necessary to select a combination that has the highest similarity. Therefore, in step S1302, a similar phrase correspondence tree structure as shown in FIG. 17 is created.

FIG. 17 is a diagram showing an example of a similar phrase correspondence tree structure. FIG. 7A is a similar phrase correspondence tree structure, and FIG. 7B is a conceptual explanatory diagram of the association. Each node of the similar clause correspondence tree structure corresponds to the correspondence between the analysis result of the input search sentence and the clause of the cluster representative structure. That is, the correspondence 17 between the phrase “LED is” and the phrase “LED [Ga]”
02a corresponds to the node 1701a of the similar phrase correspondence tree structure. There are two types of clauses of the cluster representative structure corresponding to the phrase “blinking”, “lit” and “blinking”.
01b and 1701c are created.

The similar phrase correspondence tree structure of FIG. 17 is created by the following procedure from the table 1400 of FIG. First, items having a similarity greater than or equal to a predetermined threshold are sequentially searched from the first line.
In the case of FIG. 14, “LED” and “LED
Since the correspondence of [G] is found, the node 1701a is created. Next, while repeating the same processing in the second and subsequent rows, the possibility of association is expanded into a tree structure. When the expansion is completed, the similarity information 1701 is added to each leaf node.
d and 1701e are set respectively.

The similarity information includes a phrase similarity average value sim
N and the dependency similarity average value simL are stored.
The phrase similarity average value is an average of the inter-phrase similarity from the root node to the leaf node on the similar phrase correspondence tree structure. In the averaging, weighting is appropriately performed using the frequency information 107e shown in FIG. That is, when a phrase included in the input search sentence frequently appears in a certain document cluster, the weight is set so that the weight of the phrase in simN is increased.

Returning to FIG. 13, in step S1303, the average dependency similarity value simL is calculated in the following procedure. In the analysis result of the input search sentence, the destination of a certain clause is uniquely determined. Therefore, at each node of the similar phrase correspondence tree structure of FIG. 17, it is checked whether or not the relationship information similar to the relationship information of the phrase of the input search sentence exists in the cluster representative structure. Is weighted by the frequency information 107f shown in FIG. On the other hand, when there is no similar dependency information, 1 is set as the dependency non-similarity simL1 of the node. The dependency information similarity average value simL is a dependency similarity simL0 and a dependency dissimilarity s from the root node to the leaf node on the similar phrase correspondence tree structure.
It is calculated from imL1 by the following equation (A).

SimL = (average of dependency similarity simL0) / {(number of nodes where simL0 is not 0) + (number of nodes where simL1 is not 0)} Expression (A)

For example, in the example of FIG.
01a is set as simL0 = 1 because the relationship information similar to the relationship information "GA" from the phrase "LED is" to "blinking" is also included in the cluster representative structure. on the other hand,
At nodes 1701b and 1701c, since there is no end-of-sentence word, neither simL0 nor simL1 is set.

FIG. 18 is a diagram showing an example in which a value is set to the dependency dissimilarity simL1. FIG. 18 shows a case where the input search sentence is “LED blinks orange”. In FIG. 18, at the node 1801b,
The phrase "orange" in the input search phrase "blinks"
On the other hand, in the cluster representative structure, the phrase “orange” relates to “LED”, and since the destination phrase is different, 1 is set to simL1.

When calculating the average of the dependency similarity simL0, weighting is performed using the case dictionary 1900 shown in FIG. This case dictionary 1900 describes the relationship between concepts. Here, it is determined that the dependency relation described in the case dictionary 1900 is important, and the weight of the dependency similarity in the simL0 is increased. The description of the case dictionary 1900 may be provided with a degree of importance, and the degree of dependency similarity simL0 may be weighted using the degree of importance.

Finally, in step S1304 of FIG. 13, the maximum similarity SimMAX is obtained. From the phrase similarity average value simN and the dependency similarity average value simL calculated above, the similarity of each similar information is calculated by the following equation (B).

Similarity = α × Phrase similarity + β × Dependency similarity Expression (B)

The similarity is calculated for all the similar information in the similar phrase correspondence tree structure, and the largest value is adopted as the output SimMAX of similar sentence matching. This is the end of the description of the operation of the similar sentence matching unit 108 in FIG.

Note that the above-described cluster search method is effective even when the search target document set 105 in FIG. 1 is not hierarchically clustered. That is, when there are document clusters 1 to 3 having no hierarchical structure, the entire document set can be considered as virtual document cluster 0. Then, since the child clusters of the document cluster 0 can be regarded as the document clusters 1 to 3, the above-described search method for the hierarchically clustered document set can be applied as it is.

Returning to FIG. 2, in step S204, the search result is output by the output means 110 of FIG.
Since the search target documents are sufficiently narrowed down by the cluster search means 104 and the number of cases is reduced, the similarity is calculated for all the documents by the similar sentence matching means 108 and displayed in a predetermined format in descending order of the similarity. be able to.

As described above, in the first embodiment, the cluster representative structure is generated as the index data 107 by the index data generating means 106, and the similarity of the cluster representative structure is similar to the input search text by the similar sentence matching means 108. Since the degree can be calculated, the cluster search means 104
Thus, it is possible to narrow down the search target document cluster while sequentially referring to the index tree structure. In the similar document search in consideration of the conventional sentence structure, since the above-mentioned index and search means were not provided, similar sentence matching had to be performed for the entire document set, and the processing was time-consuming. According to the first embodiment, the search target can be narrowed down using the tree structure index, so that high-speed processing is possible.

[0089]

As described above, the similar document search system according to the first aspect of the present invention provides an input unit for inputting a search sentence, a word dictionary for analyzing a sentence structure, and the word dictionary with reference to the word dictionary. Sentence structure analysis means for analyzing the structure of the input search sentence, a document database storing clustered documents, an ontology storing knowledge about concepts, and an analysis result of the input search sentence with reference to the ontology, A similar sentence matching unit that calculates a similarity between the document included in the cluster of the document database and cluster structure information used as an index at the time of search; and the input search sentence based on the similarity calculated by the similar sentence matching unit. The most similar cluster structure information is searched, and a class is obtained from the document cluster in the document database corresponding to the cluster structure information. A cluster search means for searching for documents is provided, so that the cluster structure information is used as an index for document clusters that have been clustered in advance, so that the sentence structure is considered even when a large document set is to be searched. There is an effect that a similar document search can be performed.

As described above, the similar document search system according to the second aspect of the present invention searches a document included in the cluster of the document database based on the sentence structure analysis result by the sentence structure analysis means. Index data generating means for generating cluster structure information that is sometimes used as an index is further provided. By using the cluster structure information as an index, similarity considering the sentence structure even when a large document set is searched for There is an effect that a document search can be performed.

As described above, in the similar document search system according to the third aspect of the present invention, the index data generating unit may include, as the cluster structure information, an analysis result of the sentence structure analyzing unit for a document included in a cluster. Generates a cluster representative structure that is a superimposed structure. By using the cluster structure information as an index, it is possible to search for similar documents in consideration of the sentence structure even when a large document set is searched. This has the effect that it can be performed.

As described above, in the similar document search method according to the fourth aspect of the present invention, the input step of inputting a search sentence and the structure of the input search sentence are analyzed with reference to a sentence structure analysis word dictionary. A sentence structure analyzing step, and referring to an ontology storing knowledge about the concept, the analysis result of the input search sentence and cluster structure information generated from documents included in a cluster of a document database storing clustered documents. And, based on the calculated similarity, search for cluster structure information most similar to the input search sentence, and search for a similarity from the document cluster in the document database associated with this cluster structure information. And a cluster search step for searching for a document. By using the static structure information as an index, an effect that it is possible to enable similar document search considering the structure of the sentence, even if to be searched large document set.

In the similar document search method according to the fifth aspect of the present invention, as described above, the clustering of the document database is performed hierarchically, and the index based on the cluster structure information corresponds to the hierarchical structure of the document cluster. It is configured to form a tree structure, and the cluster search step searches for similar document clusters while sequentially searching the tree structure of the index. Therefore, even when a large document set is to be searched, the sentence structure is considered. There is an effect that a similar document search can be performed.

As described above, in the similar document search method according to the sixth aspect of the present invention, the similarity calculation processing in the cluster search step calculates the similarity between the clause nodes in the dependency structure of the analysis result. Since it includes a similarity calculation step and a dependency information similarity calculation step for calculating similarity of dependency information between phrases, similar documents that take into account the sentence structure even when a large document set is to be searched. There is an effect that a search can be enabled.

In the similar document search method according to the seventh aspect of the present invention, as described above, since the inter-phrase similarity calculation step calculates inter-phrase similarity in consideration of modal expression, a large-scale document Even when a set is to be searched, it is possible to perform a similar document search in consideration of the sentence structure.

In the similar document search method according to claim 8 of the present invention, as described above, the ontology is an IS-A dictionary that describes the upper-lower relationships between concepts.
Even when a large document set is to be searched, it is possible to perform a similar document search in consideration of the sentence structure.

In the similar document search method according to the ninth aspect of the present invention, as described above, since the ontology is a HAS-A dictionary that describes the entire partial relationship between concepts, a large-scale document set is searched. Even in the case of a target, it is possible to perform a similar document search in consideration of a sentence structure.

In the similar document search method according to the tenth aspect of the present invention, as described above, since the ontology is a case dictionary describing case relations between concepts, a large document set is searched. Even in this case, there is an effect that a similar document search can be performed in consideration of the sentence structure.

In the similar document search method according to the eleventh aspect of the present invention, as described above, the ontology is a paraphrase dictionary in which paraphrasable equivalent expressions are described. In this case, it is possible to perform a similar document search in consideration of the sentence structure.

According to a twelfth aspect of the present invention, there is provided a computer-readable recording medium storing a similar document search program, comprising: an input procedure for inputting a search sentence; A sentence structure analysis procedure for analyzing the structure of a document, and an ontology storing knowledge on concepts are referred to, the analysis result of the input search sentence and a document generated from documents included in a cluster of a document database storing clustered documents. The similarity with the calculated cluster structure information is calculated, and the cluster structure information most similar to the input search sentence is searched based on the calculated similarity. And a cluster search procedure for searching for similar documents from document clusters. The use of cluster structure information as an index for the cluster, an effect that it is possible to enable similar document search considering the structure of the sentence, even if to be searched large document set.

[Brief description of the drawings]

FIG. 1 is a diagram showing an overall configuration of a similar document search system according to Embodiment 1 of the present invention.

FIG. 2 is a flowchart showing an operation of the similar document search system according to the first embodiment of the present invention.

FIG. 3 is a flowchart showing details of a Japanese sentence analysis process of the similar document search system according to the first embodiment of the present invention.

FIG. 4 is a diagram showing a configuration of a word dictionary of the similar document search system according to the first embodiment of the present invention.

FIG. 5 is a diagram showing an example of a phrase structure in the similar document search system according to the first embodiment of the present invention.

FIG. 6 is a flowchart showing details of a dependency analysis process of the similar document search system according to the first embodiment of the present invention.

FIG. 7 is a flowchart showing details of an index generation process of the similar document search system according to the first embodiment of the present invention.

FIG. 8 is a diagram showing an example of a document cluster in the similar document search system according to the first embodiment of the present invention.

FIG. 9 is a diagram showing an example of a superposition structure in the similar document search system according to the first embodiment of the present invention.

FIG. 10 is a diagram showing a detailed configuration example of a cluster representative structure in the similar document search system according to the first embodiment of the present invention.

FIG. 11 is a diagram showing an example of index data in the similar document search system according to the first embodiment of the present invention.

FIG. 12 is a flowchart showing details of a cluster search process of the similar document search system according to the first embodiment of the present invention.

FIG. 13 is a flowchart showing details of a similar sentence matching process of the similar document search system according to the first embodiment of the present invention.

FIG. 14 is a diagram showing an example of a phrase similarity correspondence table in the similar document search system according to the first embodiment of the present invention.

FIG. 15 is an ontology IS-A dictionary and HAS- of the ontology of the similar document search system according to the first embodiment of the present invention;
FIG. 3 is a diagram illustrating a configuration example of an A dictionary.

FIG. 16 is a diagram showing a configuration example of an ontology paraphrase dictionary of the similar document search system according to the first embodiment of the present invention.

FIG. 17 is a diagram showing an example of a similar phrase correspondence tree structure of the similar document search system according to the first embodiment of the present invention.

FIG. 18 is an example of a similar phrase correspondence tree structure (simL1 =
It is a figure which shows 1).

FIG. 19 is a diagram showing an example of an ontology case dictionary of the similar document search system according to the first embodiment of the present invention.

FIG. 20 is a diagram showing a conventional similarity calculation method.

[Explanation of symbols]

101 input means, 102 sentence structure analyzing means, 103
Word dictionary, 104 cluster search means, 105 search document set, 106 index data generation means, 107 index data, 108 similar sentence matching means, 109 ontology, 110 output means.

 ────────────────────────────────────────────────── ─── Continuing from the front page (72) Inventor Katsushi Suzuki 2-3-2 Marunouchi 2-chome, Chiyoda-ku, Tokyo Mitsubishi Electric Corporation F-term (reference) 5B075 ND03 NR02 NR12 NS01 PR06 QM08 QP03 QS01

Claims (12)

    [Claims]
  1. An input unit for inputting a search sentence; a word dictionary for analyzing a sentence structure; a sentence structure analyzing unit for analyzing a structure of the input search sentence with reference to the word dictionary; A stored document database; an ontology storing knowledge about concepts; an analysis result of the input search sentence by referring to the ontology; and cluster structure information used as an index when searching from documents included in the cluster of the document database. A similar sentence matching unit that calculates a similarity to the input search sentence based on the similarity calculated by the similar sentence matching unit, and is searched for cluster structure information that is most similar to the input search sentence. Cluster search means for searching for a similar document from a document cluster in the document database. Search system.
  2. 2. An index data generating means for generating, from a document included in a cluster of the document database, cluster structure information used as an index at the time of retrieval based on a sentence structure analysis result by the sentence structure analyzing means. The similar document retrieval system according to claim 1, further comprising:
  3. 3. The method according to claim 2, wherein the index data generation unit generates, as the cluster structure information, a cluster representative structure which is a structure obtained by superimposing an analysis result of the sentence structure analysis unit on a document included in a cluster. Claim 2
    Description similar document search system.
  4. 4. An input step for inputting a search sentence, a sentence structure analyzing step for analyzing the structure of the input search sentence with reference to a sentence structure analysis word dictionary, and referring to an ontology storing knowledge about concepts. hand,
    A similarity between an analysis result of the input search sentence and cluster structure information generated from a document included in a cluster of a document database storing clustered documents is calculated, and the input based on the calculated similarity is calculated. A cluster search step of searching for cluster structure information most similar to the search sentence and searching for a similar document from a document cluster in the document database associated with the cluster structure information. .
  5. 5. The clustering of the document database is performed hierarchically, an index based on the cluster structure information is configured to form a tree structure corresponding to a hierarchical structure of the document cluster, and the cluster search step includes: 5. The similar document search method according to claim 4, wherein similar document clusters are searched while sequentially searching a tree structure.
  6. 6. The similarity calculation process of the cluster search step includes the following steps: a similarity calculation process for calculating the similarity of a clause node in the dependency structure of the analysis result; and a similarity calculation of dependency information between the clauses. 5. A dependency information similarity calculating step of:
    Similar document search method described.
  7. 7. The similar document search method according to claim 6, wherein the inter-phrase similarity calculation step calculates the inter-phrase similarity in consideration of the modal expression.
  8. 8. The similar document search method according to claim 4, wherein said ontology is an IS-A dictionary describing a higher-order relationship between concepts.
  9. 9. The similar document search method according to claim 4, wherein said ontology is a HAS-A dictionary describing a partial whole relationship between concepts.
  10. 10. The similar document search method according to claim 4, wherein said ontology is a case dictionary describing case relations between concepts.
  11. 11. The similar document search method according to claim 4, wherein the ontology is a paraphrase dictionary that describes paraphrasable equivalent expressions.
  12. 12. An input procedure for inputting a search sentence, a sentence structure analysis procedure for analyzing the structure of the input search sentence with reference to a word dictionary for sentence structure analysis, and an ontology storing knowledge on concepts. hand,
    Calculating the similarity between the analysis result of the input search sentence and the cluster structure information generated from the documents included in the cluster of the document database storing the clustered documents, and based on the calculated similarity, A cluster search procedure for searching for cluster structure information most similar to a search sentence, and searching for a similar document from a document cluster in the document database associated with the cluster structure information. A computer-readable recording medium on which is recorded.
JP25716799A 1999-09-10 1999-09-10 System and method for retrieving similar document and computer-readable recording medium with similar document retrieval program recorded thereon Pending JP2001084252A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP25716799A JP2001084252A (en) 1999-09-10 1999-09-10 System and method for retrieving similar document and computer-readable recording medium with similar document retrieval program recorded thereon

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP25716799A JP2001084252A (en) 1999-09-10 1999-09-10 System and method for retrieving similar document and computer-readable recording medium with similar document retrieval program recorded thereon

Publications (1)

Publication Number Publication Date
JP2001084252A true JP2001084252A (en) 2001-03-30

Family

ID=17302628

Family Applications (1)

Application Number Title Priority Date Filing Date
JP25716799A Pending JP2001084252A (en) 1999-09-10 1999-09-10 System and method for retrieving similar document and computer-readable recording medium with similar document retrieval program recorded thereon

Country Status (1)

Country Link
JP (1) JP2001084252A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7272595B2 (en) 2002-09-03 2007-09-18 International Business Machines Corporation Information search support system, application server, information search method, and program product
JP2007334402A (en) * 2006-06-12 2007-12-27 Hitachi Ltd Server, system and method for retrieving clustered vector data
JP2008134954A (en) * 2006-11-29 2008-06-12 Canon Inc Information processing device, its control method, and program
US7779024B2 (en) * 2005-05-26 2010-08-17 International Business Machines Corporation Using ontological relationships in a computer database
US8001122B2 (en) * 2007-12-12 2011-08-16 Sun Microsystems, Inc. Relating similar terms for information retrieval
WO2016006276A1 (en) * 2014-07-10 2016-01-14 日本電気株式会社 Index generation device and index generation method
JP2017201478A (en) * 2016-05-06 2017-11-09 日本電信電話株式会社 Keyword evaluation device, similarity evaluation device, search device, evaluate method, search method, and program

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7272595B2 (en) 2002-09-03 2007-09-18 International Business Machines Corporation Information search support system, application server, information search method, and program product
US7779024B2 (en) * 2005-05-26 2010-08-17 International Business Machines Corporation Using ontological relationships in a computer database
JP2007334402A (en) * 2006-06-12 2007-12-27 Hitachi Ltd Server, system and method for retrieving clustered vector data
JP2008134954A (en) * 2006-11-29 2008-06-12 Canon Inc Information processing device, its control method, and program
US8001122B2 (en) * 2007-12-12 2011-08-16 Sun Microsystems, Inc. Relating similar terms for information retrieval
WO2016006276A1 (en) * 2014-07-10 2016-01-14 日本電気株式会社 Index generation device and index generation method
US10437803B2 (en) 2014-07-10 2019-10-08 Nec Corporation Index generation apparatus and index generation method
JP2017201478A (en) * 2016-05-06 2017-11-09 日本電信電話株式会社 Keyword evaluation device, similarity evaluation device, search device, evaluate method, search method, and program

Similar Documents

Publication Publication Date Title
Weiss et al. Fundamentals of predictive text mining
Nothman et al. Learning multilingual named entity recognition from Wikipedia
US9588960B2 (en) Automatic extraction of named entities from texts
Li et al. Recursive deep models for discourse parsing
Hernault et al. HILDA: A discourse parser using support vector machine classification
US9495358B2 (en) Cross-language text clustering
TWI512507B (en) A method and apparatus for providing multi-granularity word segmentation results
Nivre et al. MaltParser: A language-independent system for data-driven dependency parsing
Jacquemin et al. NLP for term variant extraction: synergy between morphology, lexicon, and syntax
Strzalkowski Natural language information retrieval
Castellví et al. A review of current systems’
US8447588B2 (en) Region-matching transducers for natural language processing
US5995922A (en) Identifying information related to an input word in an electronic dictionary
US5225981A (en) Language analyzer for morphemically and syntactically analyzing natural languages by using block analysis and composite morphemes
US8463594B2 (en) System and method for analyzing text using emotional intelligence factors
US6098034A (en) Method for standardizing phrasing in a document
US8266169B2 (en) Complex queries for corpus indexing and search
JP3879321B2 (en) Document summarization apparatus, document summarization method, and recording medium recording document summarization program
Bouma et al. Alpino: Wide-coverage computational analysis of Dutch
JP2745370B2 (en) Machine translation method and machine translation device
US6470306B1 (en) Automated translation of annotated text based on the determination of locations for inserting annotation tokens and linked ending, end-of-sentence or language tokens
Candito et al. Benchmarking of statistical dependency parsers for french
Veloso et al. Cognitive Technologies
US6269189B1 (en) Finding selected character strings in text and providing information relating to the selected character strings
US6816830B1 (en) Finite state data structures with paths representing paired strings of tags and tag combinations