WO2019230465A1

WO2019230465A1 - Similarity assessment device, method therefor, and program

Info

Publication number: WO2019230465A1
Application number: PCT/JP2019/019829
Authority: WO
Inventors: 克人別所; 久子浅野; 準二富田
Original assignee: 日本電信電話株式会社
Priority date: 2018-05-31
Filing date: 2019-05-20
Publication date: 2019-12-05
Also published as: JP2019211808A

Abstract

Provided is a similarity assessment device that assesses the similarity between A and B as being high when the relationship between words in A is similar to the relationship between words in B even if meanings of the words in A are far different from meanings of the words in B. This similarity assessment device is provided with a concept base that stores a set of pairs between words and vectors representing concepts of the words, and the device performs: determination of an injection φ from X to Y for mapping an element in X to an element in Y, where X is one set of words, among two sets of words in text strings, including a smaller number of elements, and Y is the other; calculation of a distance between an arbitrarily defined element pair in X and an element pair in Y corresponding to the arbitrarily defined element pair under the injection φ; calculation of a sum total of distances for all the element pairs, as a distance score of the injection φ; and setting of the minimum value of all the distance scores corresponding to all injections, as a distance score of the text strings.

Description

Similarity evaluation apparatus, method thereof, and program

The present invention relates to a similarity evaluation apparatus that evaluates the similarity between two texts A and B, a method thereof, and a program.

There are methods described in Non-Patent Document 1 and Non-Patent Document 2 as a concept base that is a set of pairs of a word and a vector representing the concept of the word.

All of these methods generate a word vector using a corpus as an input, and are arranged so that semantically close word vectors are close. The generation algorithm is based on the distribution hypothesis that the concept of each word can be estimated by the appearance pattern (peripheral distribution) of the peripheral words of the word in the corpus.

The distance representing the similarity between the texts can be calculated using the concept base generated by these methods. For a given text, a vector of the text is generated by synthesizing a vector of words in the text (for example, taking the centroid of the word vector). The distance between texts is calculated as the distance between corresponding text vectors.

Cases where the meanings of the words in A and B in the two texts A and B are far from each other, but the relationship between the words in A is close to the relationship between the words in B. There is. That is, although the content itself is far, there is a text pair in which the similarity is high because the relationship between events in each text is similar.

For example, for the text A “Lost mobile at the company” in FIG. 1 and the text B “I dropped the regular at the station” in the first row of the “Problem” column in FIG. 2, the set of words in A is 3 becomes {company, mobile phone, lose}, and the set of words in B becomes {station, regular, drop} in FIG. The combination of the word in A and the word in B (eg: (company, station), (mobile, regular), (lose, drop)) is distant. However, the relationship between the words in A and the words in B (example: ((company （mobile), (station ⇔ periodic)), ((company losing), (station crash) )), ((Lost mobile phone), (Periodic crash))) are close. As shown in FIG. 2, when “Lost mobile phone at company”, which is a problem faced by the user, is entered into a database in which a list of “problems” and “solutions” is listed. If the text “Problem” with a similar relationship between events, although the content itself is distant, hit “I dropped a period at the station”, the corresponding “Solution” text “Contact the office of the station.” Can be obtained. Using the reference information of the problem “I dropped the period at the station.” And the solution “Contact the office of the station.” As a reference information, the user said “Lost the phone at the company. Can be inferred that a solution of “inquiry to the company's management office” can be considered. In this way, even if the meaning of the words in A and B is far, if the relationship between the words in A and the relationship between the words in B are close, the similarity between A and B is judged high. It will be useful to do.

However, currently, when evaluating the similarity between texts A and B, the evaluation is based on the closeness between the word vector in A and the word vector in B, so the word in A and the word in B However, if the relationship between the words in A is close to the relationship between the words in B, it cannot be evaluated that the similarity between A and B is high.

The present invention is for solving the above-mentioned problem, and even if the meaning of the words in A and B is far, the relationship between the words in A and the relationship between the words in B are close. For example, an object of the present invention is to provide a similarity evaluation apparatus, a method and a program for evaluating that the similarity between A and B is high.

In order to solve the above-described problem, according to one aspect of the present invention, the similarity evaluation apparatus includes a concept base in which a set of pairs of a word and a vector representing the concept of the word is stored, and the text is divided into words. Word segmentation means and the word set for each of the two texts, the one with the least number of elements as {X ₁ ,…, X _m } and the other as {Y ₁ ,…, Y _n } Map the element X _i in {X ₁ ,…, X _m } to Y _{φ_i} and determine the injection φ from {X ₁ ,…, X _m } to {Y ₁ ,…, Y _n } And the corresponding vector in the concept base of the word Z as V (Z), the arbitrary element pair X _i , X _j (i <j in {X ₁ ,…, X _m } ), The distance between V (X _j ) -V (X _i ) and V (Y _{φ_j} ) -V (Y _{φ_i} ) is calculated, and the distance for all element pairs X _i , X _j (i <j) is calculated. Is calculated by a single shot distance score calculating means for calculating the sum of the distances as a distance score of the single shot φ and a single shot distance score calculating means The minimum value of all distances scores corresponding to all the injection, and a text distance score calculation means for the distance score of the two text.

In order to solve the above problems, according to another aspect of the present invention, a similarity evaluation apparatus includes a concept base in which a set of pairs of a word and a vector representing the concept of the word is stored, and the text as a word. {X ₁ ,…, X _m } means that the number of elements in the word dividing means to be divided, the word set of the query text, and each word set of one or more search target texts is not large, and the other the _{_{{Y 1, ..., Y n}} } and word set specific means _{to, {X 1, ..., X} m} elements X _i in mapping the _{_{Y φ_i, {X 1, ...}} , X m} from { {X ₁ , ..., X _m }, where V (Z) is the injection determining means for determining the injection φ to Y ₁ , ..., Y _n } and the corresponding vector in the concept base of the word Z Calculate the distance between V (X _j ) -V (X _i ) and V (Y _{φ_j} ) -V (Y _{φ_i} ) for any element pair X _i , X _j (i <j) element pair X _i, the sum of the distance to the X _j (i _<j), all elements pairs X _i, the value obtained by dividing the number of X _j (i _<j) single Injection distance score calculation means to calculate as φ distance score, and the minimum value of all distance scores corresponding to all injections for the search object text calculated by the injection distance score calculation means, the query text and the search object text The inter-text distance score calculating means for generating a distance score between the query text and the inter-text distance score calculating means generates an evaluation result using the distance score between the query text and each of the one or more search target texts.

According to the present invention, even if the meaning of the word in the text A and the word in the text B is far, if the relationship between the words in the text A and the relationship between the words in the text B are close, the text A, There is an effect that the similarity of B can be evaluated as high.

The figure which shows the example of the object text which evaluates similarity. The figure which shows the example of the object text which evaluates similarity. The figure which shows the example of a word set. The figure which shows the example of a word set. The figure which shows the example when the relationship between words is near, even if the meaning of the word in two texts is far. The functional block diagram of the similarity evaluation apparatus which concerns on 1st embodiment. The figure which shows the example of a concept base. The figure which shows the example of the processing flow of the similarity evaluation apparatus which concerns on 1st embodiment. The figure which shows the example of a single shot. The figure which shows the example of the processing flow of the pre-processing of the similarity evaluation apparatus which concerns on 2nd embodiment. The figure which shows the example of the processing flow of the search process of the similarity evaluation apparatus which concerns on 2nd embodiment.

Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent means having the same function and steps for performing the same processing are denoted by the same reference numerals, and redundant description is omitted. In the following description, it is assumed that processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

<Points of first embodiment>
The specific concept base has a property that the difference vector between the vectors of the words in the word pair having the same relationship is almost the same. That is, when the vector of the word Z is V (Z), for the word pair (a, b) and the word pair (c, d) having the same relationship,

Holds. This means that the relationship between the word pair (a, b) can be regarded as V (b) -V (a).

In the processing of the present invention, the correspondence φ between the elements of the word set {X ₁ , ..., X _m } of one text and the elements of the word set {Y ₁ , ..., Y _n } of the other text is generally There are several.

Under a certain φ, the element of the word set {Y ₁ ,…, Y _n } corresponding to the element X _i of the word set {X ₁ , ..., X _m } is _represented by Y _{φ_i} (where the subscript A_B is , A _B )), even if V (X _i ) and V (Y _{φ_i} ) are far away, the relationship between any element pair X _i , X _j (i <j) and the corresponding element pair Y _If the relationship between _{φ_i} and Y _{φ_j} is close,

_Therefore , the distance between V (X _j ) −V (X _i ) and V (Y _{φ —} _j ) −V (Y _{φ —} _i ) becomes small, and the distance score of φ as the sum of the distances becomes small. Therefore, the inter-text distance score as the minimum value of the φ distance score is also reduced.

If a relationship between a certain element pair X _i , X _j (i <j) and a corresponding element pair Y _{φ_i} , Y _{φ_j} are far under a certain φ, in general,

_Therefore , the distance between V (X _j ) -V (X _i ) and V (Y _{φ_j} ) -V (Y _{φ_i} ) increases, and the distance score of φ as the sum of the distances increases.

Therefore, if the inter-text distance score as the minimum value of the distance score of φ is small, the relationship between an arbitrary element pair X _i , X _j (i <j) under φ taking the minimum distance score, and Although the relationship between the corresponding element pairs _{Yφ_i} and _{Yφ_j} is close, it can be evaluated that the similarity between the texts is high.

For any element pair X _i , X _j (i <j) under a certain φ,

If the above holds, the word vector list V (X ₁ ), ..., V (X _m ) is translated to almost overlap the word vector list V (Y _{φ_1} ), ..., V (Y _{φ_m} ). be able to.

With respect to the examples of the texts A and B mentioned in the problem to be solved by the invention, the vector of words in A and the vector of words in B are far from each other as shown in FIG. Injection φ
φ: company → station, mobile → regular, lose → drop, drop V (mobile)-V (company) and V (regular)-V (station) distance, V (lost)-V (company) And V (drop) -V (station) distance, V (lost) -V (mobile) and V (drop) -V (regular) distance are small, and φ distance score is small. Thereby, the distance score between texts becomes small, and it can be evaluated that the similarity between the texts A and B is high.

<First embodiment>
FIG. 6 is a configuration example of the similarity evaluation apparatus according to the present embodiment.

The similarity evaluation apparatus includes a concept base 106, a word dividing unit 101, a word set specifying unit 102, an injection determination unit 103, an injection distance score calculation unit 104, and an inter-text distance score calculation unit 105.

The similarity evaluation device takes two texts as input, evaluates the similarity between the two texts, and outputs an evaluation result.

The similarity evaluation device is, for example, a special configuration in which a special program is read by a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Device. For example, the similarity evaluation device executes each process under the control of the central processing unit. Data input to the similarity evaluation device and data obtained in each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary. Used for other processing. Each processing means of the similarity evaluation apparatus may be at least partially configured by hardware such as an integrated circuit. Each storage unit included in the similarity evaluation device can be configured by a main storage device such as RAM (Random Access Memory) or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily have to be included in the similarity evaluation device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory). It is good also as a structure provided in the exterior of an evaluation apparatus.

Hereafter, each part will be described.

<Concept Base 106>
The concept base 106 stores a set of pairs of a word and a vector representing the concept of the word. FIG. 7 is an example of the concept base 106. The concept base 106 is generated by, for example, the method of Non-Patent Document 1 or Non-Patent Document 2.

There are no duplicate words in the concept base 106.

The vector of each word is a p-dimensional vector, and vectors of words that are semantically close are arranged nearby. Here, “near” and “far” mean the distance between vectors (for example, Euclidean distance or its square).

In the concept base 106, only content words such as nouns, verbs, and adjectives may be registered, or words of other parts of speech may be registered. In the present embodiment, only content words are registered. When a word is registered in the concept base 106 in the final form and the concept base 106 is searched, the search may be performed in the word final form, or when all the usage forms are registered and the concept base 106 is searched. You may make it search with the utilization form that appeared in the text. In the present embodiment, the search is performed in a terminal form.

FIG. 8 is a diagram illustrating an example of a processing routine of the similarity evaluation apparatus. Hereinafter, each means of the similarity evaluation apparatus will be described by describing the processing contents of each step in FIG.

The processing routine of FIG. 8 is a routine for evaluating the similarity between A and B with two texts A and B as inputs. As an example, take texts A and B mentioned in the problem to be solved by the invention.

<Word division means 101>
In the processing target text G determination step S11, the word dividing unit 101 receives the input texts A and B, and if there is an unprocessed text among the input texts A and B, the text to be processed from the unprocessed text. And the determined text is G, and the process proceeds to S12. If there is no unprocessed text, the process proceeds to S13.

In the word dividing step S12, the word dividing means 101 divides the text G into words and outputs them. Specifically, the morphological analysis of the text G is performed, and different sets of words (a set of different words that make up the text G. This is the same regardless of how many times the same word is used in the text G. A set of two elements). Here, as words, only content words such as nouns, verbs, and adjectives may be used, and words of other parts of speech may be added. In the present embodiment, only content words are used. Further, in the present embodiment, the utilization form is converted into a word end form and then used as an element of the word set. After the process is completed, the process proceeds to S11.

When the text G is “Lost mobile phone at the company”, the processing result of the word division step S12 is {Company, mobile phone, lost}. When the text G is “The station has dropped the period”, the processing result of the word division step S12 is {station, period, drop}.

<Word set specifying means 102>
In the word set specifying step S13, the word set specifying means 102 receives the word set of each of the two texts acquired in S12, and determines that the number of elements in the word set of each of the two texts is not large {X ₁ , …, X _m } and the other as {Y ₁ ,…, Y _n }. After the process is completed, the process proceeds to S14.

Since the word sets {company, mobile phone, lose} and {station, regular, drop} acquired in S12 are the same in 3 elements, either word set may be {X ₁ , ..., X _m } . Here, “X ₁ = Company, X ₂ = Mobile, X ₃ = Lose” and “Y ₁ = Station, Y ₂ = Regular, Y ₃ = Drop” are assumed.

<Injection determining means 103>
In injective φ decision step S14, injection determination means 103, a word set _{_{{X 1, ..., X m}} }, {Y 1, ..., Y n} as _{input, {X 1, ..., X} m} in mapping elements X _i to _{_{Y φ_i, {X 1, ...}} , X m} from _{_{{Y 1, ..., Y n}} } of injection to, when there is unprocessed injection, the untreated single of The injection to be processed is determined from the shooting, the determined shooting is output as φ, and the process proceeds to S15. If there is no unprocessed injection, the process proceeds to S16.

There are six injections from {X ₁ , X ₂ , X ₃ } to {Y ₁ , Y ₂ , Y ₃ } as shown in FIG. Here, the injection “X ₁ → Y ₁ , X ₂ → Y ₂ , X ₃ → Y ₃ ” in FIG. 9, that is, “Company → Station, Mobile → Regular, Lost → Drop” is processed. Let it be a single shot φ.

<Injection distance score calculation means 104>
In the injection distance score calculation step S15, the injection distance score calculation means 104 receives the injection φ determined as a processing target by the injection determination means 103 and inputs the corresponding vector in the concept base 106 of the word Z to V (Z), the vector V (X ₁ ), ..., V (X _m ) corresponding to {X ₁ , ..., X _m } and the vector V (X corresponding to {Y _{φ_1} , ..., Y _{φ_m} } Y _{φ_1} ),..., V (Y _{φ_m} ) are extracted from the concept base 106. For any pair of elements X _i , X _j (i <j) in {X ₁ ,…, X _m }, V (X _j ) -V (X _i ) and V (Y _{φ_j} ) -V (Y _{φ_i} ) Is calculated, and the sum of the distances for all element pairs X _i and X _j (i <j) is calculated as a distance score of the injection φ and output. As the distance between V (X _j ) −V (X _i ) and V (Y _{φ — j} ) −V (Y _{φ_i} ), the Euclidean distance may be adopted, or the square of the Euclidean distance may be adopted. After the process is completed, the process proceeds to S14.

For injection φ “X ₁ → Y ₁ , X ₂ → Y ₂ , X ₃ → Y ₃ ”, V (X ₂ ) -V (X ₁ ) and V (Y ₂ ) -V (Y ₁ ) Distance, V (X ₃ ) -V (X ₁ ) and V (Y ₃ ) -V (Y ₁ ), V (X ₃ ) -V (X ₂ ) and V (Y ₃ ) -V (Y ₂ ) The distance is calculated, and the sum of the distances is calculated as a distance score of the injection φ. In other words, for shots φ “company → station, mobile → regular, lose → drop”, the distance between V (mobile)-V (company) and V (regular)-V (station), V (loss) -Calculate the distance between V (company) and V (drop)-V (station), V (loss)-V (mobile) and V (drop)-V (regular), and sum the distance Calculated as a distance score of the injection φ. When the word vectors are arranged as shown in FIG. 5, the distance score of the injection φ is a value close to 0.

<Text distance score calculation means 105>
In the inter-text distance score calculating step S16, the inter-text distance score calculating unit 105 selects all the distance scores corresponding to all the injections calculated by the injection distance score calculating unit 104 (in the case of FIG. 9, six injections). 6 distance scores corresponding to), and the minimum value of all the distance scores is evaluated as the distance score of the two texts A and B, and the evaluation result is output. For example, (i) the distance score itself may be output as the evaluation result, or (ii) if the distance scores of the texts A and B are less than or less than a certain threshold, the evaluation that the texts A and B are similar A result may be output, and otherwise, an evaluation result indicating no similarity may be output. After completion of the processing, the processing routine of FIG.

The distance score of the example texts A and B is close to 0, and the texts A and B are evaluated to be similar.

<Effect>
With the above configuration, even if the meaning of the word in text A and the word in text B are far, if the relationship between the words in text A and the relationship between the words in text B are close, text A and B It can be evaluated that the similarity of is high.

<Second embodiment>
A description will be given centering on differences from the first embodiment.

FIG. 10 is a diagram illustrating an example of a pre-processing routine of the similarity evaluation device, and FIG. 11 is a diagram illustrating an example of a search processing routine of the similarity evaluation device. 10 and 11, as shown in FIG. 2, when a database in which a list of pairs of “problem” and “solution” is provided, the text in each row of the “problem” column is used as a search target. When a text equivalent to “problem” such as is input as a query, this is a processing routine for obtaining a search target text having high similarity to the query text. When a search target text having high similarity is obtained, the search target text and the corresponding “solution” text are returned. FIG. 10 is a search pre-processing routine performed using a list of texts to be searched as input, and FIG. 11 is a search process routine performed using query text as an input.

<Pre-processing>
The processing routine of FIG. 10 will be described.

<Word division means 101>
In the processing target text H determination step S21, the word dividing unit 101 receives a list of search target texts (for example, the “problem” list in FIG. 2), and there is an unprocessed search target text among the search target texts. In this case, the search target text to be processed is determined from the unprocessed search target text, the determined search target text is set to H, and the process proceeds to S22. If there is no unprocessed text to be searched, the processing routine of FIG. 10 ends.

In the word division step S22, the word division unit 101 divides the search target text H determined in S21 into words, adds the search target text H in association with the search target text H, and adds it to the list. The list is stored in a storage unit (not shown). The processing contents are the same as the processing contents of the word dividing means 101 in the word dividing step S12 of FIG. After the process is completed, the process proceeds to S21.

<Search process>
The processing routine of FIG. 11 will be described.

<Word division means 101>
In the word division step S31, the word division means 101 receives the query text as input, divides the query text into words, and outputs it. The processing contents are the same as the processing contents of the word dividing means 101 in the word dividing step S12 of FIG. After the processing is completed, the process proceeds to S32.

<Word set specifying means 102>
In the processing target text H determination step S32, the word set specifying unit 102 receives the word set of the query text as an input, refers to a list of search target texts stored in a storage unit (not shown), and includes unprocessed text among the search target texts. If there is a search target text, the search target text to be processed is determined from the unprocessed search target text, the determined search target text is set to H, and the process proceeds to S33. If there is no unprocessed text to be searched, the process proceeds to S37.

In the word set specifying step S33, the word set specifying means 102 extracts the word set of the search target text H acquired in S22 from a storage unit (not shown), and the word set of the search target text H and the query text word acquired in S31. among the _{set, {X 1, ..., X} m} towards the number of elements is not large and the other _{_{{Y 1, ..., Y n}} } is output as. After the process ends, the process proceeds to S34.

<Injection determining means 103>
In injective φ decision step S34, injection determination means 103, a word set _{_{{X 1, ..., X m}} }, {Y 1, ..., Y n} as _{input, {X 1, ..., X} m} in mapping elements X _i to _{_{Y φ_i, {X 1, ...}} , X m} from _{_{{Y 1, ..., Y n}} } of injection to, when there is unprocessed injection, the untreated single of The injection to be processed is determined from the shooting, the determined shooting is output as φ, and the process proceeds to S35. If there is no unprocessed injection, the process proceeds to S36.

<Injection distance score calculation means 104>
In the injection distance score calculation step S35, the injection distance score calculation means 104 receives the injection φ determined as the processing target by the injection determination means 103, and the injection distance score calculation step S15 in FIG. The same processing as that of the distance score calculation means 104 is performed. Alternatively, V (X _j ) -V (X _i ) and V (Y _{φ_j} ) -V (Y _{φ_i} for all element pairs X _i , X _j (i <j) in {X ₁ , ..., X _m } ) Is calculated by dividing the sum of the distances by the number of all element pairs X _i , X _j (i <j) in {X ₁ , ..., X _m } as a distance score of the injection φ, You may make it output. This is a measure to rectify that the distance between the texts of the corresponding text to be searched tends to decrease as the number of elements in the word set {X ₁ , ..., X _m } decreases. is there. Therefore, when the number of elements of the word set {X ₁ ,..., X _m } with the smaller number of elements does not change for each search target text, the calculation method of the first embodiment is adopted, and Should adopt the corrective actions described above. After the process ends, the process proceeds to S34.

<Text distance score calculation means 105>
In the inter-text distance score calculation step S36, the inter-text distance score calculation means 105 receives all distance scores corresponding to all the injections for the search target text H calculated by the injection distance score calculation means 104, Is evaluated as the distance score between the query text and the search target text H. After the processing is completed, the process proceeds to S32.

As described above, when there is no unprocessed search target text in S32, the process proceeds to S37. In the evaluation result generation step S37, the inter-text distance score calculation means 105 generates and outputs an evaluation result based on the distance score between the query text and each search target text. The following can be considered as evaluation results.
(1) The search target text having the minimum distance score among all the distance scores between the query text and all the search target texts, and the distance score
(2) A list of search target texts that have a distance score below or below a certain threshold and a set of the distance scores
(3) The search target text is ranked in ascending order of the distance score with the query text, and a list of pairs of distance scores corresponding to the search target text arranged in the ranking order. Here, the evaluation result may be limited to a list of pairs up to the top number in the list, or a list of pairs whose distance score is below or below a certain threshold.

Suppose that the query text of FIG. 1 is input for the database of FIG. 2, “the station has dropped the period” is output as the search target text that takes the minimum inter-text distance score.

In this embodiment, as described above, the corresponding “solution” text is output together with the evaluation result.

<Effect>
By setting it as such a structure, the effect similar to 1st embodiment can be acquired.

<Modification>
In evaluating the similarity between arbitrary texts A and B, in addition to the inter-text distance score described in the first embodiment and the second embodiment, the inter-text distance as described in the background art is used. Calculate the distance between the texts based on the distance between the vector of words in A and the vector of words in B, and use the weighted linear combination of the two calculated distances as the final distance between texts. Similarity may be evaluated based on the distance.

<Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

<Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

Also, this program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

For example, a computer that executes such a program first stores a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by an electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

Further, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

In the present invention, for two texts A and B, even if the meaning of the word in A and the word in B are distant, if the relationship between the words in A and the relationship between the words in B are close, A , B can be applied to a similarity evaluation technique that evaluates that the similarity is high.

Claims

A concept base in which a set of pairs of words and vectors representing the concepts of the words is stored;
Word dividing means for dividing the text into words,
A word set specifying means that sets {X 1 , ..., X m } to be the one in which the number of elements in each of the two text sets is not large, and {Y 1 , ..., Y n } to the other,
Map the element X i in {X 1 ,…, X m } to Y φ_i , and the shot that determines the injection φ from {X 1 ,…, X m } to {Y 1 ,…, Y n } A determination means;
Let V (Z) be the corresponding vector in the concept base of the word Z, for any element pair X i , X j (i <j) in {X 1 , ..., X m }, V ( X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) are calculated, and the sum of the distances for all element pairs X i , X j (i <j) injection distance score calculating means for calculating as a distance score of φ,
An inter-text distance score calculating means that uses a minimum value of all distance scores corresponding to all the injections calculated by the injective distance score calculating means as a distance score of the two texts;
Similarity evaluation device.
A concept base in which a set of pairs of words and vectors representing the concepts of the words is stored;
Word dividing means for dividing the text into words,
In the query text word set and each word set of one or more search target texts, {X 1 , ..., X m } is the one with the least number of elements, and the other is {Y 1 , ..., Y n }, a word set specifying means,
Map the element X i in {X 1 ,…, X m } to Y φ_i , and the shot that determines the injection φ from {X 1 ,…, X m } to {Y 1 ,…, Y n } A determination means;
Let V (Z) be the corresponding vector in the concept base of the word Z, for any element pair X i , X j (i <j) in {X 1 , ..., X m }, V ( X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) are calculated, and the sum of the distances for all element pairs X i , X j (i <j) A single shot distance score calculating means for calculating a value obtained by dividing the number of element pairs X i and X j (i <j) as a single shot φ distance score;
The inter-text distance score calculation means that uses the minimum value of all distance scores corresponding to all the injections for the search target text calculated by the injection distance score calculation means as the distance score between the query text and the search target text. And
The inter-text distance score calculating means generates an evaluation result using a distance score between the query text and each of the one or more search target texts.
Similarity evaluation device.
The concept base stores a set of pairs of a word and a vector representing the concept of the word,
A word dividing means for dividing the text into words;
The word set identification means that the word set of each of the two texts is {X 1 ,…, X m } where the number of elements is not large, and the other is {Y 1 ,…, Y n } Steps,
Injection determining means, {X 1, ..., X m} elements X i in mapping the Y φ_i, {X 1, ... , X m} from {Y 1, ..., Y n } injection to a bijection determination step for determining φ;
Single Shakyori score calculating means, when the corresponding vector in the concept-based word Z and V (Z), {X 1 , ..., X m} Any element pair X i in, X j (i <j), the distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) is calculated, and all element pairs X i , X j (i <j) A single shot distance score calculating step for calculating the sum of the distances as a single shot φ distance score;
The inter-text distance score calculating means uses the minimum value of all the distance scores corresponding to all the injections calculated in the injective distance score calculating step as the inter-text distance score calculating step. Prepare
Similarity evaluation method.
The concept base stores a set of pairs of a word and a vector representing the concept of the word,
A word dividing means for dividing the text into words;
The word set identification means uses {X 1 ,…, X m } for the word set of the query text and each of the word sets of one or more search target texts, and the other is set to {X 1 ,…, X m } A word set identification step of {Y 1 , ..., Y n };
Injection determining means, {X 1, ..., X m} elements X i in mapping the Y φ_i, {X 1, ... , X m} from {Y 1, ..., Y n } injection to a bijection determination step for determining φ;
Single Shakyori score calculating means, when the corresponding vector in the concept-based word Z and V (Z), {X 1 , ..., X m} Any element pair X i in, X j (i <j), the distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) is calculated, and all element pairs X i , X j (i <j) A shot distance score calculating step of calculating a value obtained by dividing the sum of the distances by the number of all element pairs X i , X j (i <j) as a distance score of the shot φ;
The distance score between the query text and the search target text is the minimum value of all the distance scores corresponding to all the shots for the search target text calculated by the inter-text distance score calculation means in the injection distance score calculation step. The inter-text distance score calculating step,
The inter-text distance score calculating means includes an evaluation result generating step of generating an evaluation result using a distance score between the query text and each of the one or more search target texts.
Similarity evaluation method.
A program for causing a computer to function as the similarity evaluation apparatus according to claim 1 or 2.