WO2019230465A1 - Similarity assessment device, method therefor, and program - Google Patents

Similarity assessment device, method therefor, and program Download PDF

Info

Publication number
WO2019230465A1
WO2019230465A1 PCT/JP2019/019829 JP2019019829W WO2019230465A1 WO 2019230465 A1 WO2019230465 A1 WO 2019230465A1 JP 2019019829 W JP2019019829 W JP 2019019829W WO 2019230465 A1 WO2019230465 A1 WO 2019230465A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
word
distance score
distance
words
Prior art date
Application number
PCT/JP2019/019829
Other languages
French (fr)
Japanese (ja)
Inventor
克人 別所
久子 浅野
準二 富田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Publication of WO2019230465A1 publication Critical patent/WO2019230465A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types

Definitions

  • the present invention relates to a similarity evaluation apparatus that evaluates the similarity between two texts A and B, a method thereof, and a program.
  • Non-Patent Document 1 and Non-Patent Document 2 There are methods described in Non-Patent Document 1 and Non-Patent Document 2 as a concept base that is a set of pairs of a word and a vector representing the concept of the word.
  • All of these methods generate a word vector using a corpus as an input, and are arranged so that semantically close word vectors are close.
  • the generation algorithm is based on the distribution hypothesis that the concept of each word can be estimated by the appearance pattern (peripheral distribution) of the peripheral words of the word in the corpus.
  • the distance representing the similarity between the texts can be calculated using the concept base generated by these methods.
  • a vector of the text is generated by synthesizing a vector of words in the text (for example, taking the centroid of the word vector).
  • the distance between texts is calculated as the distance between corresponding text vectors.
  • the set of words in A is 3 becomes ⁇ company, mobile phone, lose ⁇
  • the set of words in B becomes ⁇ station, regular, drop ⁇ in FIG.
  • the combination of the word in A and the word in B eg: (company, station), (mobile, regular), (lose, drop)
  • the relationship between the words in A and the words in B is close. As shown in FIG.
  • the present invention is for solving the above-mentioned problem, and even if the meaning of the words in A and B is far, the relationship between the words in A and the relationship between the words in B are close.
  • an object of the present invention is to provide a similarity evaluation apparatus, a method and a program for evaluating that the similarity between A and B is high.
  • the similarity evaluation apparatus includes a concept base in which a set of pairs of a word and a vector representing the concept of the word is stored, and the text is divided into words.
  • Word segmentation means and the word set for each of the two texts, the one with the least number of elements as ⁇ X 1 ,..., X m ⁇ and the other as ⁇ Y 1 ,..., Y n ⁇ Map the element X i in ⁇ X 1 ,..., X m ⁇ to Y ⁇ _i and determine the injection ⁇ from ⁇ X 1 ,..., X m ⁇ to ⁇ Y 1 ,..., Y n ⁇ And the corresponding vector in the concept base of the word Z as V (Z), the arbitrary element pair X i , X j (i ⁇ j in ⁇ X 1 ,..., X m ⁇ ), The distance between V (X j ) -V (X i
  • a similarity evaluation apparatus includes a concept base in which a set of pairs of a word and a vector representing the concept of the word is stored, and the text as a word.
  • ⁇ X 1 ,..., X m ⁇ means that the number of elements in the word dividing means to be divided, the word set of the query text, and each word set of one or more search target texts is not large, and the other the ⁇ Y 1, ..., Y n ⁇ and word set specific means to, ⁇ X 1, ..., X m ⁇ elements X i in mapping the Y ⁇ _i, ⁇ X 1, ...
  • V (Z) is the injection determining means for determining the injection ⁇ to Y 1 , ..., Y n ⁇ and the corresponding vector in the concept base of the word Z
  • V (Z) is the injection determining means for determining the injection ⁇ to Y 1 , ..., Y n ⁇ and the corresponding vector in the concept base of the word Z
  • Z Calculate the distance between V (X j ) -V (X i ) and V (Y ⁇ _j ) -V (Y ⁇ _i ) for any element pair X i , X j (i ⁇ j) element pair X i, the sum of the distance to the X j (i ⁇ j), all elements pairs X i, the value obtained by dividing the number of X j (i ⁇ j) single Injection distance score calculation means to calculate as ⁇ distance score, and the minimum value of all distance scores corresponding to all injections for the search object text calculated by the injection distance score calculation means, the query text
  • the figure which shows the example of the object text which evaluates similarity The figure which shows the example of the object text which evaluates similarity.
  • the functional block diagram of the similarity evaluation apparatus which concerns on 1st embodiment The figure which shows the example of a concept base.
  • the figure which shows the example of a single shot The figure which shows the example of the processing flow of the pre-processing of the similarity evaluation apparatus which concerns on 2nd embodiment.
  • the specific concept base has a property that the difference vector between the vectors of the words in the word pair having the same relationship is almost the same. That is, when the vector of the word Z is V (Z), for the word pair (a, b) and the word pair (c, d) having the same relationship,
  • the correspondence ⁇ between the elements of the word set ⁇ X 1 , ..., X m ⁇ of one text and the elements of the word set ⁇ Y 1 , ..., Y n ⁇ of the other text is generally There are several.
  • the element of the word set ⁇ Y 1 ,..., Y n ⁇ corresponding to the element X i of the word set ⁇ X 1 , ..., X m ⁇ is represented by Y ⁇ _i (where the subscript A_B is , A B )), even if V (X i ) and V (Y ⁇ _i ) are far away, the relationship between any element pair X i , X j (i ⁇ j) and the corresponding element pair Y If the relationship between ⁇ _i and Y ⁇ _j is close,
  • the inter-text distance score as the minimum value of the distance score of ⁇ is small, the relationship between an arbitrary element pair X i , X j (i ⁇ j) under ⁇ taking the minimum distance score, and Although the relationship between the corresponding element pairs Y ⁇ _i and Y ⁇ _j is close, it can be evaluated that the similarity between the texts is high.
  • the word vector list V (X 1 ), ..., V (X m ) is translated to almost overlap the word vector list V (Y ⁇ _1 ), ..., V (Y ⁇ _m ). be able to.
  • the vector of words in A and the vector of words in B are far from each other as shown in FIG. Injection ⁇ ⁇ : company ⁇ station, mobile ⁇ regular, lose ⁇ drop, drop V (mobile)-V (company) and V (regular)-V (station) distance, V (lost)-V (company) And V (drop) -V (station) distance, V (lost) -V (mobile) and V (drop) -V (regular) distance are small, and ⁇ distance score is small. Thereby, the distance score between texts becomes small, and it can be evaluated that the similarity between the texts A and B is high.
  • FIG. 6 is a configuration example of the similarity evaluation apparatus according to the present embodiment.
  • the similarity evaluation apparatus includes a concept base 106, a word dividing unit 101, a word set specifying unit 102, an injection determination unit 103, an injection distance score calculation unit 104, and an inter-text distance score calculation unit 105.
  • the similarity evaluation device takes two texts as input, evaluates the similarity between the two texts, and outputs an evaluation result.
  • the similarity evaluation device is, for example, a special configuration in which a special program is read by a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Device.
  • CPU Central Processing Unit
  • RAM Random Access Memory
  • the similarity evaluation device executes each process under the control of the central processing unit. Data input to the similarity evaluation device and data obtained in each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary. Used for other processing.
  • Each processing means of the similarity evaluation apparatus may be at least partially configured by hardware such as an integrated circuit.
  • Each storage unit included in the similarity evaluation device can be configured by a main storage device such as RAM (Random Access Memory) or middleware such as a relational database or a key-value store.
  • a main storage device such as RAM (Random Access Memory) or middleware such as a relational database or a key-value store.
  • middleware such as a relational database or a key-value store.
  • each storage unit does not necessarily have to be included in the similarity evaluation device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory). It is good also as a structure provided in the exterior of an evaluation apparatus.
  • the concept base 106 stores a set of pairs of a word and a vector representing the concept of the word.
  • FIG. 7 is an example of the concept base 106.
  • the concept base 106 is generated by, for example, the method of Non-Patent Document 1 or Non-Patent Document 2.
  • the vector of each word is a p-dimensional vector, and vectors of words that are semantically close are arranged nearby.
  • “near” and “far” mean the distance between vectors (for example, Euclidean distance or its square).
  • the concept base 106 only content words such as nouns, verbs, and adjectives may be registered, or words of other parts of speech may be registered. In the present embodiment, only content words are registered.
  • the search may be performed in the word final form, or when all the usage forms are registered and the concept base 106 is searched. You may make it search with the utilization form that appeared in the text. In the present embodiment, the search is performed in a terminal form.
  • FIG. 8 is a diagram illustrating an example of a processing routine of the similarity evaluation apparatus.
  • each means of the similarity evaluation apparatus will be described by describing the processing contents of each step in FIG.
  • the processing routine of FIG. 8 is a routine for evaluating the similarity between A and B with two texts A and B as inputs. As an example, take texts A and B mentioned in the problem to be solved by the invention.
  • the word dividing unit 101 receives the input texts A and B, and if there is an unprocessed text among the input texts A and B, the text to be processed from the unprocessed text. And the determined text is G, and the process proceeds to S12. If there is no unprocessed text, the process proceeds to S13.
  • the word dividing means 101 divides the text G into words and outputs them. Specifically, the morphological analysis of the text G is performed, and different sets of words (a set of different words that make up the text G. This is the same regardless of how many times the same word is used in the text G. A set of two elements). Here, as words, only content words such as nouns, verbs, and adjectives may be used, and words of other parts of speech may be added. In the present embodiment, only content words are used. Further, in the present embodiment, the utilization form is converted into a word end form and then used as an element of the word set. After the process is completed, the process proceeds to S11.
  • the processing result of the word division step S12 is ⁇ Company, mobile phone, lost ⁇ .
  • the processing result of the word division step S12 is ⁇ station, period, drop ⁇ .
  • the word set specifying means 102 receives the word set of each of the two texts acquired in S12, and determines that the number of elements in the word set of each of the two texts is not large ⁇ X 1 , ..., X m ⁇ and the other as ⁇ Y 1 ,..., Y n ⁇ . After the process is completed, the process proceeds to S14.
  • word set may be ⁇ X 1 , ..., X m ⁇ .
  • X 1 Company
  • X 2 Mobile
  • X 3 Lose
  • Y 1 Station
  • Y 2 Regular
  • Y 3 Drop
  • injection determination means 103 In injective ⁇ decision step S14, injection determination means 103, a word set ⁇ X 1, ..., X m ⁇ , ⁇ Y 1, ..., Y n ⁇ as input, ⁇ X 1, ..., X m ⁇ in mapping elements X i to Y ⁇ _i, ⁇ X 1, ... , X m ⁇ from ⁇ Y 1, ..., Y n ⁇ of injection to, when there is unprocessed injection, the untreated single of The injection to be processed is determined from the shooting, the determined shooting is output as ⁇ , and the process proceeds to S15. If there is no unprocessed injection, the process proceeds to S16.
  • the injection distance score calculation means 104 receives the injection ⁇ determined as a processing target by the injection determination means 103 and inputs the corresponding vector in the concept base 106 of the word Z to V (Z), the vector V (X 1 ), ..., V (X m ) corresponding to ⁇ X 1 , ..., X m ⁇ and the vector V (X corresponding to ⁇ Y ⁇ _1 , ..., Y ⁇ _m ⁇ Y ⁇ _1 ),..., V (Y ⁇ _m ) are extracted from the concept base 106.
  • the inter-text distance score calculating unit 105 selects all the distance scores corresponding to all the injections calculated by the injection distance score calculating unit 104 (in the case of FIG. 9, six injections). 6 distance scores corresponding to), and the minimum value of all the distance scores is evaluated as the distance score of the two texts A and B, and the evaluation result is output. For example, (i) the distance score itself may be output as the evaluation result, or (ii) if the distance scores of the texts A and B are less than or less than a certain threshold, the evaluation that the texts A and B are similar A result may be output, and otherwise, an evaluation result indicating no similarity may be output. After completion of the processing, the processing routine of FIG.
  • the distance score of the example texts A and B is close to 0, and the texts A and B are evaluated to be similar.
  • FIG. 10 is a diagram illustrating an example of a pre-processing routine of the similarity evaluation device
  • FIG. 11 is a diagram illustrating an example of a search processing routine of the similarity evaluation device. 10 and 11, as shown in FIG. 2, when a database in which a list of pairs of “problem” and “solution” is provided, the text in each row of the “problem” column is used as a search target.
  • a text equivalent to “problem” such as is input as a query
  • this is a processing routine for obtaining a search target text having high similarity to the query text.
  • the search target text and the corresponding “solution” text are returned.
  • FIG. 10 is a search pre-processing routine performed using a list of texts to be searched as input
  • FIG. 11 is a search process routine performed using query text as an input.
  • the word dividing unit 101 receives a list of search target texts (for example, the “problem” list in FIG. 2), and there is an unprocessed search target text among the search target texts.
  • the search target text to be processed is determined from the unprocessed search target text, the determined search target text is set to H, and the process proceeds to S22. If there is no unprocessed text to be searched, the processing routine of FIG. 10 ends.
  • the word division unit 101 divides the search target text H determined in S21 into words, adds the search target text H in association with the search target text H, and adds it to the list.
  • the list is stored in a storage unit (not shown).
  • the processing contents are the same as the processing contents of the word dividing means 101 in the word dividing step S12 of FIG. After the process is completed, the process proceeds to S21.
  • the word division means 101 receives the query text as input, divides the query text into words, and outputs it.
  • the processing contents are the same as the processing contents of the word dividing means 101 in the word dividing step S12 of FIG. After the processing is completed, the process proceeds to S32.
  • the word set specifying unit 102 receives the word set of the query text as an input, refers to a list of search target texts stored in a storage unit (not shown), and includes unprocessed text among the search target texts. If there is a search target text, the search target text to be processed is determined from the unprocessed search target text, the determined search target text is set to H, and the process proceeds to S33. If there is no unprocessed text to be searched, the process proceeds to S37.
  • the word set specifying means 102 extracts the word set of the search target text H acquired in S22 from a storage unit (not shown), and the word set of the search target text H and the query text word acquired in S31. among the set, ⁇ X 1, ..., X m ⁇ towards the number of elements is not large and the other ⁇ Y 1, ..., Y n ⁇ is output as. After the process ends, the process proceeds to S34.
  • injection determination means 103 In injective ⁇ decision step S34, injection determination means 103, a word set ⁇ X 1, ..., X m ⁇ , ⁇ Y 1, ..., Y n ⁇ as input, ⁇ X 1, ..., X m ⁇ in mapping elements X i to Y ⁇ _i, ⁇ X 1, ... , X m ⁇ from ⁇ Y 1, ..., Y n ⁇ of injection to, when there is unprocessed injection, the untreated single of The injection to be processed is determined from the shooting, the determined shooting is output as ⁇ , and the process proceeds to S35. If there is no unprocessed injection, the process proceeds to S36.
  • the injection distance score calculation means 104 receives the injection ⁇ determined as the processing target by the injection determination means 103, and the injection distance score calculation step S15 in FIG. The same processing as that of the distance score calculation means 104 is performed.
  • V (X j ) -V (X i ) and V (Y ⁇ _j ) -V (Y ⁇ _i for all element pairs X i , X j (i ⁇ j) in ⁇ X 1 , ..., X m ⁇ ) Is calculated by dividing the sum of the distances by the number of all element pairs X i , X j (i ⁇ j) in ⁇ X 1 , ..., X m ⁇ as a distance score of the injection ⁇ , You may make it output.
  • the inter-text distance score calculation means 105 receives all distance scores corresponding to all the injections for the search target text H calculated by the injection distance score calculation means 104, Is evaluated as the distance score between the query text and the search target text H. After the processing is completed, the process proceeds to S32.
  • the inter-text distance score calculation means 105 generates and outputs an evaluation result based on the distance score between the query text and each search target text.
  • the following can be considered as evaluation results.
  • (1) The search target text having the minimum distance score among all the distance scores between the query text and all the search target texts, and the distance score (2) A list of search target texts that have a distance score below or below a certain threshold and a set of the distance scores (3) The search target text is ranked in ascending order of the distance score with the query text, and a list of pairs of distance scores corresponding to the search target text arranged in the ranking order.
  • the evaluation result may be limited to a list of pairs up to the top number in the list, or a list of pairs whose distance score is below or below a certain threshold.
  • the corresponding “solution” text is output together with the evaluation result.
  • ⁇ Modification> In evaluating the similarity between arbitrary texts A and B, in addition to the inter-text distance score described in the first embodiment and the second embodiment, the inter-text distance as described in the background art is used. Calculate the distance between the texts based on the distance between the vector of words in A and the vector of words in B, and use the weighted linear combination of the two calculated distances as the final distance between texts. Similarity may be evaluated based on the distance.
  • the program describing the processing contents can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.
  • this program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
  • a computer that executes such a program first stores a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program.
  • a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good.
  • ASP Application Service Provider
  • the program includes information provided for processing by an electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).
  • each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.
  • a , B can be applied to a similarity evaluation technique that evaluates that the similarity is high.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Provided is a similarity assessment device that assesses the similarity between A and B as being high when the relationship between words in A is similar to the relationship between words in B even if meanings of the words in A are far different from meanings of the words in B. This similarity assessment device is provided with a concept base that stores a set of pairs between words and vectors representing concepts of the words, and the device performs: determination of an injection φ from X to Y for mapping an element in X to an element in Y, where X is one set of words, among two sets of words in text strings, including a smaller number of elements, and Y is the other; calculation of a distance between an arbitrarily defined element pair in X and an element pair in Y corresponding to the arbitrarily defined element pair under the injection φ; calculation of a sum total of distances for all the element pairs, as a distance score of the injection φ; and setting of the minimum value of all the distance scores corresponding to all injections, as a distance score of the text strings.

Description

類似性評価装置、その方法、及びプログラムSimilarity evaluation apparatus, method thereof, and program
 本発明は、2つのテキストA、Bの類似性を評価する類似性評価装置、その方法、及びプログラムに関する。 The present invention relates to a similarity evaluation apparatus that evaluates the similarity between two texts A and B, a method thereof, and a program.
 単語と該単語の概念を表すベクトルとの対の集合である概念ベースとして、非特許文献1や非特許文献2で述べられている手法がある。 There are methods described in Non-Patent Document 1 and Non-Patent Document 2 as a concept base that is a set of pairs of a word and a vector representing the concept of the word.
 これらの手法はいずれもコーパスを入力として単語のベクトルを生成するものであり、意味的に近い単語のベクトルは近くなるような配置となる。生成アルゴリズムは、各単語の概念は、コーパスにおける該単語の周辺単語の出現パターン(周辺分布)によって推定できるという分布仮説をベースにしている。 All of these methods generate a word vector using a corpus as an input, and are arranged so that semantically close word vectors are close. The generation algorithm is based on the distribution hypothesis that the concept of each word can be estimated by the appearance pattern (peripheral distribution) of the peripheral words of the word in the corpus.
 これらの手法により生成した概念ベースを用いて、テキスト間の類似性を表す距離を算出することができる。任意のテキストに対し、該テキスト中の単語のベクトルを合成する(例えば単語ベクトルの重心をとる)ことにより、該テキストのベクトルを生成する。テキスト間の距離を、対応するテキストベクトル間の距離として算出する。 The distance representing the similarity between the texts can be calculated using the concept base generated by these methods. For a given text, a vector of the text is generated by synthesizing a vector of words in the text (for example, taking the centroid of the word vector). The distance between texts is calculated as the distance between corresponding text vectors.
 2つのテキストA、Bについて、A中の単語とB中の単語の意味が遠いが、A中の単語間の関係性とB中の単語間の関係性が近いため、類似性が高くなるケースがある。すなわち、内容そのものは遠いものの、それぞれのテキスト中の事象間の関係性が類似するため、類似性が高くなるテキストの対が存在する。 Cases where the meanings of the words in A and B in the two texts A and B are far from each other, but the relationship between the words in A is close to the relationship between the words in B. There is. That is, although the content itself is far, there is a text pair in which the similarity is high because the relationship between events in each text is similar.
 例えば図1のテキストA「会社で携帯を失くした。」と、図2の「問題」列の1行目のテキストB「駅で定期を落とした。」について、A中の単語の集合は図3の{会社,携帯,失くす}となり、B中の単語の集合は図4の{駅,定期,落とす}となる。A中の単語とB中の単語の組(例:(会社,駅),(携帯,定期),(失くす,落とす))は意味が遠い。しかし、A中の単語間の関係性とB中の単語間の関係性の組(例:((会社⇔携帯),(駅⇔定期)),((会社⇔失くす),(駅⇔落とす)),((携帯⇔失くす),(定期⇔落とす)))は近い。図2のように、「問題」と「解決策」の組のリストが載っているデータベースに対し、ユーザが直面している問題である「会社で携帯を失くした。」を入力したとき、内容そのものは遠いものの、事象間の関係性が類似する「問題」のテキスト「駅で定期を落とした。」がヒットすれば、対応する「解決策」のテキスト「駅の事務室に問い合わせる。」を取得できる。ユーザは問題である「駅で定期を落とした。」と、それに対する解決策「駅の事務室に問い合わせる。」を参考情報として、自身が直面している問題である「会社で携帯を失くした。」に対しては、「会社の管理室に問い合わせる。」という解決策が考えられると類推することができる。このように、A中の単語とB中の単語の意味が遠くても、A中の単語間の関係性とB中の単語間の関係性が近ければ、AとBの類似性を高く判断することが有用となる。 For example, for the text A “Lost mobile at the company” in FIG. 1 and the text B “I dropped the regular at the station” in the first row of the “Problem” column in FIG. 2, the set of words in A is 3 becomes {company, mobile phone, lose}, and the set of words in B becomes {station, regular, drop} in FIG. The combination of the word in A and the word in B (eg: (company, station), (mobile, regular), (lose, drop)) is distant. However, the relationship between the words in A and the words in B (example: ((company (mobile), (station ⇔ periodic)), ((company losing), (station crash) )), ((Lost mobile phone), (Periodic crash))) are close. As shown in FIG. 2, when “Lost mobile phone at company”, which is a problem faced by the user, is entered into a database in which a list of “problems” and “solutions” is listed. If the text “Problem” with a similar relationship between events, although the content itself is distant, hit “I dropped a period at the station”, the corresponding “Solution” text “Contact the office of the station.” Can be obtained. Using the reference information of the problem “I dropped the period at the station.” And the solution “Contact the office of the station.” As a reference information, the user said “Lost the phone at the company. Can be inferred that a solution of “inquiry to the company's management office” can be considered. In this way, even if the meaning of the words in A and B is far, if the relationship between the words in A and the relationship between the words in B are close, the similarity between A and B is judged high. It will be useful to do.
 しかしながら現状はテキストA、Bの類似性を評価するにあたり、A中の単語のベクトルとB中の単語のベクトルとの近さを基準に評価しているため、A中の単語とB中の単語の意味が遠いが、A中の単語間の関係性とB中の単語間の関係性が近い場合、AとBの類似性が高いと評価することができない。 However, currently, when evaluating the similarity between texts A and B, the evaluation is based on the closeness between the word vector in A and the word vector in B, so the word in A and the word in B However, if the relationship between the words in A is close to the relationship between the words in B, it cannot be evaluated that the similarity between A and B is high.
 本発明は、上記課題を解決するためのものであり、A中の単語とB中の単語の意味が遠くても、A中の単語間の関係性とB中の単語間の関係性が近ければ、A、Bの類似性は高いと評価する類似性評価装置、その方法、及びプログラムを提供することを目的とする。 The present invention is for solving the above-mentioned problem, and even if the meaning of the words in A and B is far, the relationship between the words in A and the relationship between the words in B are close. For example, an object of the present invention is to provide a similarity evaluation apparatus, a method and a program for evaluating that the similarity between A and B is high.
 上記の課題を解決するために、本発明の一態様によれば、類似性評価装置は、単語と該単語の概念を表すベクトルとの対の集合が格納される概念ベースと、テキストを単語分割する単語分割手段と、2つのテキストそれぞれの単語集合で要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}とする単語集合特定手段と、{X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射φを決定する単射決定手段と、単語Zの概念ベース中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、単射φの距離スコアとして算出する単射距離スコア算出手段と、単射距離スコア算出手段で算出した全ての単射に対応する全ての距離スコアの最小値を、2つのテキストの距離スコアとするテキスト間距離スコア算出手段とを備える。 In order to solve the above-described problem, according to one aspect of the present invention, the similarity evaluation apparatus includes a concept base in which a set of pairs of a word and a vector representing the concept of the word is stored, and the text is divided into words. Word segmentation means and the word set for each of the two texts, the one with the least number of elements as {X 1 ,…, X m } and the other as {Y 1 ,…, Y n } Map the element X i in {X 1 ,…, X m } to Y φ_i and determine the injection φ from {X 1 ,…, X m } to {Y 1 ,…, Y n } And the corresponding vector in the concept base of the word Z as V (Z), the arbitrary element pair X i , X j (i <j in {X 1 ,…, X m } ), The distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) is calculated, and the distance for all element pairs X i , X j (i <j) is calculated. Is calculated by a single shot distance score calculating means for calculating the sum of the distances as a distance score of the single shot φ and a single shot distance score calculating means The minimum value of all distances scores corresponding to all the injection, and a text distance score calculation means for the distance score of the two text.
 上記の課題を解決するために、本発明の他の態様によれば、類似性評価装置は、単語と該単語の概念を表すベクトルとの対の集合が格納される概念ベースと、テキストを単語分割する単語分割手段と、クエリテキストの単語集合と、1つ以上の検索対象テキストのそれぞれの単語集合とで、要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}とする単語集合特定手段と、{X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射φを決定する単射決定手段と、単語Zの概念ベース中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、全ての要素対Xi,Xj(i<j)の数で割った値を単射φの距離スコアとして算出する単射距離スコア算出手段と、単射距離スコア算出手段で算出した検索対象テキストに対する全ての単射に対応する全ての距離スコアの最小値を、クエリテキストと検索対象テキストとの距離スコアとするテキスト間距離スコア算出手段とを備え、テキスト間距離スコア算出手段は、クエリテキストと1つ以上の検索対象テキストのそれぞれとの距離スコアを用いて、評価結果を生成する。 In order to solve the above problems, according to another aspect of the present invention, a similarity evaluation apparatus includes a concept base in which a set of pairs of a word and a vector representing the concept of the word is stored, and the text as a word. {X 1 ,…, X m } means that the number of elements in the word dividing means to be divided, the word set of the query text, and each word set of one or more search target texts is not large, and the other the {Y 1, ..., Y n } and word set specific means to, {X 1, ..., X m} elements X i in mapping the Y φ_i, {X 1, ... , X m} from { {X 1 , ..., X m }, where V (Z) is the injection determining means for determining the injection φ to Y 1 , ..., Y n } and the corresponding vector in the concept base of the word Z Calculate the distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) for any element pair X i , X j (i <j) element pair X i, the sum of the distance to the X j (i <j), all elements pairs X i, the value obtained by dividing the number of X j (i <j) single Injection distance score calculation means to calculate as φ distance score, and the minimum value of all distance scores corresponding to all injections for the search object text calculated by the injection distance score calculation means, the query text and the search object text The inter-text distance score calculating means for generating a distance score between the query text and the inter-text distance score calculating means generates an evaluation result using the distance score between the query text and each of the one or more search target texts.
 本発明によれば、テキストA中の単語とテキストB中の単語の意味が遠くても、テキストA中の単語間の関係性とテキストB中の単語間の関係性が近ければ、テキストA、Bの類似性は高いと評価することができるという効果を奏する。 According to the present invention, even if the meaning of the word in the text A and the word in the text B is far, if the relationship between the words in the text A and the relationship between the words in the text B are close, the text A, There is an effect that the similarity of B can be evaluated as high.
類似性を評価する対象テキストの例を示す図。The figure which shows the example of the object text which evaluates similarity. 類似性を評価する対象テキストの例を示す図。The figure which shows the example of the object text which evaluates similarity. 単語集合の例を示す図。The figure which shows the example of a word set. 単語集合の例を示す図。The figure which shows the example of a word set. 2つのテキスト中の単語の意味が遠くても、単語間の関係性が近い場合の例を示す図。The figure which shows the example when the relationship between words is near, even if the meaning of the word in two texts is far. 第一実施形態に係る類似性評価装置の機能ブロック図。The functional block diagram of the similarity evaluation apparatus which concerns on 1st embodiment. 概念ベースの例を示す図。The figure which shows the example of a concept base. 第一実施形態に係る類似性評価装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the similarity evaluation apparatus which concerns on 1st embodiment. 単射の例を示す図。The figure which shows the example of a single shot. 第二実施形態に係る類似性評価装置の事前処理の処理フローの例を示す図。The figure which shows the example of the processing flow of the pre-processing of the similarity evaluation apparatus which concerns on 2nd embodiment. 第二実施形態に係る類似性評価装置の検索処理の処理フローの例を示す図。The figure which shows the example of the processing flow of the search process of the similarity evaluation apparatus which concerns on 2nd embodiment.
 以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成手段や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent means having the same function and steps for performing the same processing are denoted by the same reference numerals, and redundant description is omitted. In the following description, it is assumed that processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.
<第一実施形態のポイント>
 特定の概念ベースには、同一の関係性にある単語対の各単語のベクトルの差ベクトルは、ほぼ同一であるという性質がある。すなわち、単語ZのベクトルをV(Z)としたとき、同一の関係性にある単語対(a、b)と単語対(c、d)に対し、
<Points of first embodiment>
The specific concept base has a property that the difference vector between the vectors of the words in the word pair having the same relationship is almost the same. That is, when the vector of the word Z is V (Z), for the word pair (a, b) and the word pair (c, d) having the same relationship,
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
が成り立つ。これは単語対(a、b)の関係性をV(b)-V(a)と捉えることができることを意味している。 Holds. This means that the relationship between the word pair (a, b) can be regarded as V (b) -V (a).
 本発明の処理において、一方のテキストの単語集合{X1,…,Xm}の要素と、もう一方のテキストの単語集合{Y1,…,Yn}の要素との対応付けφは一般に複数ある。 In the processing of the present invention, the correspondence φ between the elements of the word set {X 1 , ..., X m } of one text and the elements of the word set {Y 1 , ..., Y n } of the other text is generally There are several.
 あるφのもとで、単語集合{X1,…,Xm}の要素Xiに対応する単語集合{Y1,…,Yn}の要素をYφ_i(ただし、下付き添え字A_Bは、ABを意味する)とすると、V(Xi)とV(Yφ_i)が遠くても、任意の要素対Xi,Xj(i<j)の関係性と、対応する要素対Yφ_i,Yφ_jの関係性とが近い場合、上記の概念ベースの性質により Under a certain φ, the element of the word set {Y 1 ,…, Y n } corresponding to the element X i of the word set {X 1 , ..., X m } is represented by Y φ_i (where the subscript A_B is , A B )), even if V (X i ) and V (Y φ_i ) are far away, the relationship between any element pair X i , X j (i <j) and the corresponding element pair Y If the relationship between φ_i and Y φ_j is close,
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
が成り立ち、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離は小さくなり、該距離の総和としてのφの距離スコアは小さくなる。したがってφの距離スコアの最小値としてのテキスト間距離スコアも小さくなる。 Therefore , the distance between V (X j ) −V (X i ) and V (Y φ — j ) −V (Y φ — i ) becomes small, and the distance score of φ as the sum of the distances becomes small. Therefore, the inter-text distance score as the minimum value of the φ distance score is also reduced.
 あるφのもとで、ある要素対Xi,Xj(i<j)の関係性と、対応する要素対Yφ_i,Yφ_jの関係性とが遠い場合、一般に If a relationship between a certain element pair X i , X j (i <j) and a corresponding element pair Y φ_i , Y φ_j are far under a certain φ, in general,
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
が成り立たず、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離は大きくなり、該距離の総和としてのφの距離スコアは大きくなる。 Therefore , the distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) increases, and the distance score of φ as the sum of the distances increases.
 したがってφの距離スコアの最小値としてのテキスト間距離スコアが小さければ、最小値の距離スコアをとるφのもとで、任意の要素対Xi,Xj(i<j)の関係性と、対応する要素対Yφ_i,Yφ_jの関係性とが近いことがいえ、テキスト間の類似性が高いと評価することができる。 Therefore, if the inter-text distance score as the minimum value of the distance score of φ is small, the relationship between an arbitrary element pair X i , X j (i <j) under φ taking the minimum distance score, and Although the relationship between the corresponding element pairs Yφ_i and Yφ_j is close, it can be evaluated that the similarity between the texts is high.
 あるφのもとで、任意の要素対Xi,Xj(i<j)に対し、 For any element pair X i , X j (i <j) under a certain φ,
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
が成り立つ場合は、単語ベクトルのリストV(X1),…,V(Xm)を平行移動することにより、単語ベクトルのリストV(Yφ_1),…,V(Yφ_m)にほぼ重ね合わせることができる。 If the above holds, the word vector list V (X 1 ), ..., V (X m ) is translated to almost overlap the word vector list V (Y φ_1 ), ..., V (Y φ_m ). be able to.
 発明が解決しようとする課題で挙げたテキストA、Bの例に対しては、図5のように、A中の単語のベクトルとB中の単語のベクトルは遠い。単射φを、
 φ:会社→駅,携帯→定期,失くす→落とす
としたとき、V(携帯)-V(会社)とV(定期)-V(駅)の距離、V(失くす)-V(会社)とV(落とす)-V(駅)の距離、V(失くす)-V(携帯)とV(落とす)-V(定期)の距離が小さく、φの距離スコアが小さくなる。これにより、テキスト間距離スコアが小さくなり、テキストA、Bの類似性が高いと評価することができる。
With respect to the examples of the texts A and B mentioned in the problem to be solved by the invention, the vector of words in A and the vector of words in B are far from each other as shown in FIG. Injection φ
φ: company → station, mobile → regular, lose → drop, drop V (mobile)-V (company) and V (regular)-V (station) distance, V (lost)-V (company) And V (drop) -V (station) distance, V (lost) -V (mobile) and V (drop) -V (regular) distance are small, and φ distance score is small. Thereby, the distance score between texts becomes small, and it can be evaluated that the similarity between the texts A and B is high.
<第一実施形態>
 図6は本実施形態に係る類似性評価装置の構成例である。
<First embodiment>
FIG. 6 is a configuration example of the similarity evaluation apparatus according to the present embodiment.
 類似性評価装置は、概念ベース106と単語分割手段101と単語集合特定手段102と単射決定手段103と単射距離スコア算出手段104とテキスト間距離スコア算出手段105とを備える。 The similarity evaluation apparatus includes a concept base 106, a word dividing unit 101, a word set specifying unit 102, an injection determination unit 103, an injection distance score calculation unit 104, and an inter-text distance score calculation unit 105.
 類似性評価装置は、2つのテキストを入力とし、2つのテキストの類似性を評価し、評価結果を出力する。 The similarity evaluation device takes two texts as input, evaluates the similarity between the two texts, and outputs an evaluation result.
 類似性評価装置は、例えば、中央演算処理装置(CPU: Central Processing Unit)、主記憶装置(RAM: Random Access Memory)などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。類似性評価装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。類似性評価装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。類似性評価装置の各処理手段は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。類似性評価装置が備える各記憶部は、例えば、RAM(Random Access Memory)などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも類似性評価装置がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ(Flash Memory)のような半導体メモリ素子により構成される補助記憶装置により構成し、類似性評価装置の外部に備える構成としてもよい。 The similarity evaluation device is, for example, a special configuration in which a special program is read by a known or dedicated computer having a central processing unit (CPU: Central Processing Unit), a main storage device (RAM: Random Access Memory), and the like. Device. For example, the similarity evaluation device executes each process under the control of the central processing unit. Data input to the similarity evaluation device and data obtained in each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out to the central processing unit as necessary. Used for other processing. Each processing means of the similarity evaluation apparatus may be at least partially configured by hardware such as an integrated circuit. Each storage unit included in the similarity evaluation device can be configured by a main storage device such as RAM (Random Access Memory) or middleware such as a relational database or a key-value store. However, each storage unit does not necessarily have to be included in the similarity evaluation device, and is configured by an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory). It is good also as a structure provided in the exterior of an evaluation apparatus.
 以下、各部について説明する。 Hereafter, each part will be described.
<概念ベース106>
 概念ベース106には、単語と該単語の概念を表すベクトルとの対の集合が格納される。図7は、概念ベース106の例である。概念ベース106は、例えば、非特許文献1や非特許文献2の手法によって生成する。
<Concept Base 106>
The concept base 106 stores a set of pairs of a word and a vector representing the concept of the word. FIG. 7 is an example of the concept base 106. The concept base 106 is generated by, for example, the method of Non-Patent Document 1 or Non-Patent Document 2.
 概念ベース106中の単語に重複するものはない。 There are no duplicate words in the concept base 106.
 各単語のベクトルはp次元ベクトルであり、意味的に近い単語のベクトルは、近くに配置されている。なお、ここでいう「近い」、「遠い」は、ベクトル間の距離(例えばユークリッド距離やその2乗)を意味する。 The vector of each word is a p-dimensional vector, and vectors of words that are semantically close are arranged nearby. Here, “near” and “far” mean the distance between vectors (for example, Euclidean distance or its square).
 概念ベース106には名詞、動詞、形容詞等の内容語のみを登録するというようにしてもよいし、さらにそれ以外の品詞の単語も登録するというようにしてもよい。本実施形態では、内容語のみを登録する。概念ベース106において単語を終止形で登録し、概念ベース106を検索する際は、単語の終止形で検索するというようにしてもよいし、全ての活用形を登録し概念ベース106を検索する際はテキスト中に表れた活用形で検索するというようにしてもよい。本実施形態では、終止形で検索する。 In the concept base 106, only content words such as nouns, verbs, and adjectives may be registered, or words of other parts of speech may be registered. In the present embodiment, only content words are registered. When a word is registered in the concept base 106 in the final form and the concept base 106 is searched, the search may be performed in the word final form, or when all the usage forms are registered and the concept base 106 is searched. You may make it search with the utilization form that appeared in the text. In the present embodiment, the search is performed in a terminal form.
 図8は、類似性評価装置の処理ルーチンの一例を示す図である。以下、図8の各ステップの処理内容を述べることにより、類似性評価装置の各手段の説明をする。 FIG. 8 is a diagram illustrating an example of a processing routine of the similarity evaluation apparatus. Hereinafter, each means of the similarity evaluation apparatus will be described by describing the processing contents of each step in FIG.
 図8の処理ルーチンは、2つのテキストA、Bを入力として、A、Bの類似性を評価するルーチンである。例として、発明が解決しようとする課題で挙げたテキストA、Bをとる。 The processing routine of FIG. 8 is a routine for evaluating the similarity between A and B with two texts A and B as inputs. As an example, take texts A and B mentioned in the problem to be solved by the invention.
<単語分割手段101>
 処理対象テキストG決定ステップS11では、単語分割手段101が、入力テキストA、Bを入力とし、入力テキストA、Bの内、未処理のテキストがある場合、未処理のテキストから処理対象とするテキストを決定し、決定したテキストをGとし、S12に移る。未処理のテキストがない場合、S13に移る。
<Word division means 101>
In the processing target text G determination step S11, the word dividing unit 101 receives the input texts A and B, and if there is an unprocessed text among the input texts A and B, the text to be processed from the unprocessed text. And the determined text is G, and the process proceeds to S12. If there is no unprocessed text, the process proceeds to S13.
 単語分割ステップS12では、単語分割手段101が、テキストGを単語分割し、出力する。具体的には、テキストGを形態素解析し、単語の異なりの集合(テキストGを構成する異なる単語からなる集合であり、テキストGの中で同一の単語が何度用いられていてもこれを1つの要素とする集合)を取得する。ここで単語として、名詞、動詞、形容詞等の内容語のみとしてもよいし、さらにそれ以外の品詞の単語を加えてもよい。本実施形態では、内容語のみとする。また、本実施形態では、活用形を単語の終止形に変換した上で、単語集合の要素とする。処理の終了後、S11に移る。 In the word dividing step S12, the word dividing means 101 divides the text G into words and outputs them. Specifically, the morphological analysis of the text G is performed, and different sets of words (a set of different words that make up the text G. This is the same regardless of how many times the same word is used in the text G. A set of two elements). Here, as words, only content words such as nouns, verbs, and adjectives may be used, and words of other parts of speech may be added. In the present embodiment, only content words are used. Further, in the present embodiment, the utilization form is converted into a word end form and then used as an element of the word set. After the process is completed, the process proceeds to S11.
 テキストGが「会社で携帯を失くした。」の場合、単語分割ステップS12の処理結果は、{会社,携帯,失くす}となる。テキストGが「駅で定期を落とした。」の場合、単語分割ステップS12の処理結果は、{駅,定期,落とす}となる。 When the text G is “Lost mobile phone at the company”, the processing result of the word division step S12 is {Company, mobile phone, lost}. When the text G is “The station has dropped the period”, the processing result of the word division step S12 is {station, period, drop}.
<単語集合特定手段102>
 単語集合特定ステップS13では、単語集合特定手段102が、S12で取得した、2つのテキストそれぞれの単語集合を入力とし、2つのテキストそれぞれの単語集合で要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}として出力する。処理の終了後、S14に移る。
<Word set specifying means 102>
In the word set specifying step S13, the word set specifying means 102 receives the word set of each of the two texts acquired in S12, and determines that the number of elements in the word set of each of the two texts is not large {X 1 , …, X m } and the other as {Y 1 ,…, Y n }. After the process is completed, the process proceeds to S14.
 S12で取得した単語集合{会社,携帯,失くす}と{駅,定期,落とす}はどちらも要素数が3で同じなので、どちらの単語集合を{X1,…,Xm}としてもよい。ここでは、「X1=会社,X2=携帯,X3=失くす」とし、「Y1=駅,Y2=定期,Y3=落とす」とする。 Since the word sets {company, mobile phone, lose} and {station, regular, drop} acquired in S12 are the same in 3 elements, either word set may be {X 1 , ..., X m } . Here, “X 1 = Company, X 2 = Mobile, X 3 = Lose” and “Y 1 = Station, Y 2 = Regular, Y 3 = Drop” are assumed.
<単射決定手段103>
 単射φ決定ステップS14では、単射決定手段103が、単語集合{X1,…,Xm}、{Y1,…,Yn}を入力とし、{X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射の内、未処理の単射がある場合、未処理の単射から処理対象とする単射を決定し、決定した単射をφとして出力し、S15に移る。未処理の単射がない場合、S16に移る。
<Injection determining means 103>
In injective φ decision step S14, injection determination means 103, a word set {X 1, ..., X m }, {Y 1, ..., Y n} as input, {X 1, ..., X m} in mapping elements X i to Y φ_i, {X 1, ... , X m} from {Y 1, ..., Y n } of injection to, when there is unprocessed injection, the untreated single of The injection to be processed is determined from the shooting, the determined shooting is output as φ, and the process proceeds to S15. If there is no unprocessed injection, the process proceeds to S16.
 {X1,X2,X3}から{Y1,Y2,Y3}への単射は、図9のように6個ある。ここでは、図9の1行目の単射「X1→Y1,X2→Y2,X3→Y3」すなわち「会社→駅,携帯→定期,失くす→落とす」を処理対象の単射φとする。 There are six injections from {X 1 , X 2 , X 3 } to {Y 1 , Y 2 , Y 3 } as shown in FIG. Here, the injection “X 1 → Y 1 , X 2 → Y 2 , X 3 → Y 3 ” in FIG. 9, that is, “Company → Station, Mobile → Regular, Lost → Drop” is processed. Let it be a single shot φ.
<単射距離スコア算出手段104>
 単射距離スコア算出ステップS15では、単射距離スコア算出手段104が、単射決定手段103で処理対象として決定された単射φを入力とし、単語Zの概念ベース106中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}に対応するベクトルV(X1),…,V(Xm)と、{Yφ_1,…,Yφ_m}に対応するベクトルV(Yφ_1),…,V(Yφ_m)を概念ベース106から取り出す。{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、単射φの距離スコアとして算出し、出力する。V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離としては、ユークリッド距離を採用してもよいし、ユークリッド距離の二乗を採用してもよい。処理の終了後、S14に移る。
<Injection distance score calculation means 104>
In the injection distance score calculation step S15, the injection distance score calculation means 104 receives the injection φ determined as a processing target by the injection determination means 103 and inputs the corresponding vector in the concept base 106 of the word Z to V (Z), the vector V (X 1 ), ..., V (X m ) corresponding to {X 1 , ..., X m } and the vector V (X corresponding to {Y φ_1 , ..., Y φ_m } Y φ_1 ),..., V (Y φ_m ) are extracted from the concept base 106. For any pair of elements X i , X j (i <j) in {X 1 ,…, X m }, V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) Is calculated, and the sum of the distances for all element pairs X i and X j (i <j) is calculated as a distance score of the injection φ and output. As the distance between V (X j ) −V (X i ) and V (Y φ — j ) −V (Y φ_i ), the Euclidean distance may be adopted, or the square of the Euclidean distance may be adopted. After the process is completed, the process proceeds to S14.
 単射φ「X1→Y1,X2→Y2,X3→Y3」に対しては、V(X2)-V(X1)とV(Y2)-V(Y1)の距離、V(X3)-V(X1)とV(Y3)-V(Y1)の距離、V(X3)-V(X2)とV(Y3)-V(Y2)の距離を算出し、該距離の総和を、単射φの距離スコアとして算出する。すなわち単射φ「会社→駅,携帯→定期,失くす→落とす」に対しては、V(携帯)-V(会社)とV(定期)-V(駅)の距離、V(失くす)-V(会社)とV(落とす)-V(駅)の距離、V(失くす)-V(携帯)とV(落とす)-V(定期)の距離を算出し、該距離の総和を、単射φの距離スコアとして算出する。各単語ベクトルが図5の配置のとき、単射φの距離スコアは0に近い値となる。 For injection φ “X 1 → Y 1 , X 2 → Y 2 , X 3 → Y 3 ”, V (X 2 ) -V (X 1 ) and V (Y 2 ) -V (Y 1 ) Distance, V (X 3 ) -V (X 1 ) and V (Y 3 ) -V (Y 1 ), V (X 3 ) -V (X 2 ) and V (Y 3 ) -V (Y 2 ) The distance is calculated, and the sum of the distances is calculated as a distance score of the injection φ. In other words, for shots φ “company → station, mobile → regular, lose → drop”, the distance between V (mobile)-V (company) and V (regular)-V (station), V (loss) -Calculate the distance between V (company) and V (drop)-V (station), V (loss)-V (mobile) and V (drop)-V (regular), and sum the distance Calculated as a distance score of the injection φ. When the word vectors are arranged as shown in FIG. 5, the distance score of the injection φ is a value close to 0.
<テキスト間距離スコア算出手段105>
 テキスト間距離スコア算出ステップS16では、テキスト間距離スコア算出手段105が、単射距離スコア算出手段104で算出した全ての単射に対応する全ての距離スコア(図9の場合、6個の単射に対応する6個の距離スコア)を入力とし、全ての距離スコアの最小値を、2つのテキストA、Bの距離スコアとして評価し、評価結果を出力する。例えば、(i)距離スコア自体を評価結果として出力してもよいし、(ii)テキストA、Bの距離スコアが、ある閾値以下あるいは未満の場合、テキストA、Bは類似性があるという評価結果を出力し、それ以外の場合には、類似性がないという評価結果を出力してもよい。処理の終了後、図8の処理ルーチンを終了する。
<Text distance score calculation means 105>
In the inter-text distance score calculating step S16, the inter-text distance score calculating unit 105 selects all the distance scores corresponding to all the injections calculated by the injection distance score calculating unit 104 (in the case of FIG. 9, six injections). 6 distance scores corresponding to), and the minimum value of all the distance scores is evaluated as the distance score of the two texts A and B, and the evaluation result is output. For example, (i) the distance score itself may be output as the evaluation result, or (ii) if the distance scores of the texts A and B are less than or less than a certain threshold, the evaluation that the texts A and B are similar A result may be output, and otherwise, an evaluation result indicating no similarity may be output. After completion of the processing, the processing routine of FIG.
 例のテキストA、Bの距離スコアは0に近い値となり、テキストA、Bは類似性があると評価する。 The distance score of the example texts A and B is close to 0, and the texts A and B are evaluated to be similar.
<効果>
 以上の構成により、テキストA中の単語とテキストB中の単語の意味が遠くても、テキストA中の単語間の関係性とテキストB中の単語間の関係性が近ければ、テキストA、Bの類似性は高いと評価することができる。
<Effect>
With the above configuration, even if the meaning of the word in text A and the word in text B are far, if the relationship between the words in text A and the relationship between the words in text B are close, text A and B It can be evaluated that the similarity of is high.
<第二実施形態>
 第一実施形態と異なる部分を中心に説明する。
<Second embodiment>
A description will be given centering on differences from the first embodiment.
 図10は類似性評価装置の事前処理ルーチンの一例を示す図であり、図11は類似性評価装置の検索処理ルーチンの一例を示す図である。図10、11は、図2のように、「問題」と「解決策」の組のリストが載っているデータベースが与えられたとき、「問題」列の各行のテキストを検索対象として、図1のような「問題」相当のテキストがクエリとして入力されたとき、該クエリテキストと類似性の高い検索対象テキストを求める処理のルーチンである。類似性の高い検索対象テキストが求まると、該検索対象テキスト及び対応する「解決策」のテキストが返される。図10は、検索対象テキストのリストを入力として行う、検索の事前処理のルーチンであり、図11は、クエリテキストを入力として行う検索処理のルーチンである。 FIG. 10 is a diagram illustrating an example of a pre-processing routine of the similarity evaluation device, and FIG. 11 is a diagram illustrating an example of a search processing routine of the similarity evaluation device. 10 and 11, as shown in FIG. 2, when a database in which a list of pairs of “problem” and “solution” is provided, the text in each row of the “problem” column is used as a search target. When a text equivalent to “problem” such as is input as a query, this is a processing routine for obtaining a search target text having high similarity to the query text. When a search target text having high similarity is obtained, the search target text and the corresponding “solution” text are returned. FIG. 10 is a search pre-processing routine performed using a list of texts to be searched as input, and FIG. 11 is a search process routine performed using query text as an input.
<事前処理>
 図10の処理ルーチンを説明する。
<Pre-processing>
The processing routine of FIG. 10 will be described.
<単語分割手段101>
 処理対象テキストH決定ステップS21では、単語分割手段101が、検索対象テキストのリスト(例えば、図2の「問題」のリスト)を入力とし、検索対象テキストの内、未処理の検索対象テキストがある場合、未処理の検索対象テキストから処理対象とする検索対象テキストを決定し、決定した検索対象テキストをHとし、S22に移る。未処理の検索対象テキストがない場合、図10の処理ルーチンを終了する。
<Word division means 101>
In the processing target text H determination step S21, the word dividing unit 101 receives a list of search target texts (for example, the “problem” list in FIG. 2), and there is an unprocessed search target text among the search target texts. In this case, the search target text to be processed is determined from the unprocessed search target text, the determined search target text is set to H, and the process proceeds to S22. If there is no unprocessed text to be searched, the processing routine of FIG. 10 ends.
 単語分割ステップS22では、単語分割手段101が、S21で決定した検索対象テキストHを単語分割し、検索対象テキストHと対応付けて前述のリストに加える。なお、リストは図示しない記憶部に格納される。処理内容は、図8の単語分割ステップS12における単語分割手段101の処理内容と同じである。処理の終了後、S21に移る。 In the word division step S22, the word division unit 101 divides the search target text H determined in S21 into words, adds the search target text H in association with the search target text H, and adds it to the list. The list is stored in a storage unit (not shown). The processing contents are the same as the processing contents of the word dividing means 101 in the word dividing step S12 of FIG. After the process is completed, the process proceeds to S21.
<検索処理>
 図11の処理ルーチンを説明する。
<Search process>
The processing routine of FIG. 11 will be described.
<単語分割手段101>
 単語分割ステップS31では、単語分割手段101が、クエリテキストを入力とし、クエリテキストを単語分割し、出力する。処理内容は、図8の単語分割ステップS12における単語分割手段101の処理内容と同じである。処理の終了後、S32に移る。
<Word division means 101>
In the word division step S31, the word division means 101 receives the query text as input, divides the query text into words, and outputs it. The processing contents are the same as the processing contents of the word dividing means 101 in the word dividing step S12 of FIG. After the processing is completed, the process proceeds to S32.
<単語集合特定手段102>
 処理対象テキストH決定ステップS32では、単語集合特定手段102は、クエリテキストの単語集合を入力とし、図示しない記憶部に格納された検索対象テキストのリストを参照し、検索対象テキストの内、未処理の検索対象テキストがある場合、未処理の検索対象テキストから処理対象とする検索対象テキストを決定し、決定した検索対象テキストをHとし、S33に移る。未処理の検索対象テキストがない場合、S37に移る。
<Word set specifying means 102>
In the processing target text H determination step S32, the word set specifying unit 102 receives the word set of the query text as an input, refers to a list of search target texts stored in a storage unit (not shown), and includes unprocessed text among the search target texts. If there is a search target text, the search target text to be processed is determined from the unprocessed search target text, the determined search target text is set to H, and the process proceeds to S33. If there is no unprocessed text to be searched, the process proceeds to S37.
 単語集合特定ステップS33では、単語集合特定手段102が、S22で取得した検索対象テキストHの単語集合を図示しない記憶部から取り出し、検索対象テキストHの単語集合と、S31で取得したクエリテキストの単語集合の内、要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}として出力する。処理の終了後、S34に移る。 In the word set specifying step S33, the word set specifying means 102 extracts the word set of the search target text H acquired in S22 from a storage unit (not shown), and the word set of the search target text H and the query text word acquired in S31. among the set, {X 1, ..., X m} towards the number of elements is not large and the other {Y 1, ..., Y n } is output as. After the process ends, the process proceeds to S34.
<単射決定手段103>
 単射φ決定ステップS34では、単射決定手段103が、単語集合{X1,…,Xm}、{Y1,…,Yn}を入力とし、{X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射の内、未処理の単射がある場合、未処理の単射から処理対象とする単射を決定し、決定した単射をφとして出力し、S35に移る。未処理の単射がない場合、S36に移る。
<Injection determining means 103>
In injective φ decision step S34, injection determination means 103, a word set {X 1, ..., X m }, {Y 1, ..., Y n} as input, {X 1, ..., X m} in mapping elements X i to Y φ_i, {X 1, ... , X m} from {Y 1, ..., Y n } of injection to, when there is unprocessed injection, the untreated single of The injection to be processed is determined from the shooting, the determined shooting is output as φ, and the process proceeds to S35. If there is no unprocessed injection, the process proceeds to S36.
<単射距離スコア算出手段104>
 単射距離スコア算出ステップS35では、単射距離スコア算出手段104が、単射決定手段103で処理対象として決定された単射φを入力とし、図8の単射距離スコア算出ステップS15における単射距離スコア算出手段104の処理と同じ処理を行う。あるいは、{X1,…,Xm}中の全ての要素対Xi,Xj(i<j)に対するV(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離の総和を、{X1,…,Xm}中の全ての要素対Xi,Xj(i<j)の数で割った値を、単射φの距離スコアとして算出し、出力するというようにしてもよい。これは距離の総和だと、単語集合{X1,…,Xm}の要素数が少ないほど、対応する検索対象テキストのテキスト間距離スコアが小さくなる傾向があるのを是正するための措置である。よって、要素数が大きくはない方の単語集合{X1,…,Xm}の要素数が検索対象テキスト毎に変わらない場合には、第一実施形態の算出方法を採用し、変わる場合には上述の是正措置を採用するとよい。処理の終了後、S34に移る。
<Injection distance score calculation means 104>
In the injection distance score calculation step S35, the injection distance score calculation means 104 receives the injection φ determined as the processing target by the injection determination means 103, and the injection distance score calculation step S15 in FIG. The same processing as that of the distance score calculation means 104 is performed. Alternatively, V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i for all element pairs X i , X j (i <j) in {X 1 , ..., X m } ) Is calculated by dividing the sum of the distances by the number of all element pairs X i , X j (i <j) in {X 1 , ..., X m } as a distance score of the injection φ, You may make it output. This is a measure to rectify that the distance between the texts of the corresponding text to be searched tends to decrease as the number of elements in the word set {X 1 , ..., X m } decreases. is there. Therefore, when the number of elements of the word set {X 1 ,..., X m } with the smaller number of elements does not change for each search target text, the calculation method of the first embodiment is adopted, and Should adopt the corrective actions described above. After the process ends, the process proceeds to S34.
<テキスト間距離スコア算出手段105>
 テキスト間距離スコア算出ステップS36では、テキスト間距離スコア算出手段105が、単射距離スコア算出手段104で算出した、検索対象テキストHに対する全ての単射に対応する全ての距離スコアを入力とし、全ての距離スコアの最小値を、クエリテキストと検索対象テキストHとの間の距離スコアとして評価する。処理の終了後、S32に移る。
<Text distance score calculation means 105>
In the inter-text distance score calculation step S36, the inter-text distance score calculation means 105 receives all distance scores corresponding to all the injections for the search target text H calculated by the injection distance score calculation means 104, Is evaluated as the distance score between the query text and the search target text H. After the processing is completed, the process proceeds to S32.
 前述の通り、S32において未処理の検索対象テキストがない場合、S37に移る。評価結果生成ステップS37では、テキスト間距離スコア算出手段105は、クエリテキストと各検索対象テキストとの間の距離スコアをもとに、評価結果を生成し、出力する。評価結果としては、以下のようなものが考えられる。
(1)クエリテキストと全ての検索対象テキストとの間の全ての距離スコアの中で最小の距離スコアをとる検索対象テキストと該距離スコア
(2)ある閾値以下あるいは未満の距離スコアをとる検索対象テキストと該距離スコアの組のリスト
(3)検索対象テキストを、クエリテキストとの距離スコアの昇順にランキングし、該ランキングの順に、並べた検索対象テキストと対応する距離スコアの組のリスト。ここで、評価結果をリスト中の上位何番目かまでの組のリスト、あるいは、距離スコアがある閾値以下あるいは未満の組のリストに限定してもよい。
As described above, when there is no unprocessed search target text in S32, the process proceeds to S37. In the evaluation result generation step S37, the inter-text distance score calculation means 105 generates and outputs an evaluation result based on the distance score between the query text and each search target text. The following can be considered as evaluation results.
(1) The search target text having the minimum distance score among all the distance scores between the query text and all the search target texts, and the distance score
(2) A list of search target texts that have a distance score below or below a certain threshold and a set of the distance scores
(3) The search target text is ranked in ascending order of the distance score with the query text, and a list of pairs of distance scores corresponding to the search target text arranged in the ranking order. Here, the evaluation result may be limited to a list of pairs up to the top number in the list, or a list of pairs whose distance score is below or below a certain threshold.
 図2のデータベースを対象として、図1のクエリテキストを入力とした場合、最小のテキスト間距離スコアをとる検索対象テキストとして「駅で定期を落とした。」を出力する。 Suppose that the query text of FIG. 1 is input for the database of FIG. 2, “the station has dropped the period” is output as the search target text that takes the minimum inter-text distance score.
 本実施形態では、上述の通り、評価結果とともに対応する「解決策」のテキストを出力する。 In this embodiment, as described above, the corresponding “solution” text is output together with the evaluation result.
<効果>
 このような構成とすることで、第一実施形態と同様の効果を得ることができる。
<Effect>
By setting it as such a structure, the effect similar to 1st embodiment can be acquired.
<変形例>
 なお任意のテキストA、Bの類似性を評価するにあたり、第一実施形態、第二実施形態で説明したテキスト間距離スコアの他に、背景技術で述べたようなテキスト間距離を始めとする、A中の単語のベクトルとB中の単語のベクトルとの距離をベースとするテキスト間距離を算出し、算出した2つの距離を重み付き線形結合した値を最終的なテキスト間距離とし、当該テキスト間距離をもとに類似性を評価するというようにしてもよい。
<Modification>
In evaluating the similarity between arbitrary texts A and B, in addition to the inter-text distance score described in the first embodiment and the second embodiment, the inter-text distance as described in the background art is used. Calculate the distance between the texts based on the distance between the vector of words in A and the vector of words in B, and use the weighted linear combination of the two calculated distances as the final distance between texts. Similarity may be evaluated based on the distance.
<その他の変形例>
 本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。
<Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.
<プログラム及び記録媒体>
 また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。
<Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.
 この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.
 また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 Also, this program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
 このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 For example, a computer that executes such a program first stores a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by an electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).
 また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.
 本発明は、2つのテキストA、Bについて、A中の単語とB中の単語の意味が遠くても、A中の単語間の関係性とB中の単語間の関係性が近ければ、A、Bの類似性は高いと評価する類似性評価技術に適用可能である。 In the present invention, for two texts A and B, even if the meaning of the word in A and the word in B are distant, if the relationship between the words in A and the relationship between the words in B are close, A , B can be applied to a similarity evaluation technique that evaluates that the similarity is high.

Claims (5)

  1.  単語と該単語の概念を表すベクトルとの対の集合が格納される概念ベースと、
     テキストを単語分割する単語分割手段と、
     2つのテキストそれぞれの単語集合で要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}とする単語集合特定手段と、
     {X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射φを決定する単射決定手段と、
     単語Zの前記概念ベース中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、単射φの距離スコアとして算出する単射距離スコア算出手段と、
     単射距離スコア算出手段で算出した全ての単射に対応する全ての距離スコアの最小値を、前記2つのテキストの距離スコアとするテキスト間距離スコア算出手段とを備える、
     類似性評価装置。
    A concept base in which a set of pairs of words and vectors representing the concepts of the words is stored;
    Word dividing means for dividing the text into words,
    A word set specifying means that sets {X 1 , ..., X m } to be the one in which the number of elements in each of the two text sets is not large, and {Y 1 , ..., Y n } to the other,
    Map the element X i in {X 1 ,…, X m } to Y φ_i , and the shot that determines the injection φ from {X 1 ,…, X m } to {Y 1 ,…, Y n } A determination means;
    Let V (Z) be the corresponding vector in the concept base of the word Z, for any element pair X i , X j (i <j) in {X 1 , ..., X m }, V ( X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) are calculated, and the sum of the distances for all element pairs X i , X j (i <j) injection distance score calculating means for calculating as a distance score of φ,
    An inter-text distance score calculating means that uses a minimum value of all distance scores corresponding to all the injections calculated by the injective distance score calculating means as a distance score of the two texts;
    Similarity evaluation device.
  2.  単語と該単語の概念を表すベクトルとの対の集合が格納される概念ベースと、
     テキストを単語分割する単語分割手段と、
     クエリテキストの単語集合と、1つ以上の検索対象テキストのそれぞれの単語集合とで、要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}とする単語集合特定手段と、
     {X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射φを決定する単射決定手段と、
     単語Zの前記概念ベース中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、全ての要素対Xi,Xj(i<j)の数で割った値を単射φの距離スコアとして算出する単射距離スコア算出手段と、
     単射距離スコア算出手段で算出した前記検索対象テキストに対する全ての単射に対応する全ての距離スコアの最小値を、前記クエリテキストと前記検索対象テキストとの距離スコアとするテキスト間距離スコア算出手段とを備え、
     前記テキスト間距離スコア算出手段は、前記クエリテキストと1つ以上の前記検索対象テキストのそれぞれとの距離スコアを用いて、評価結果を生成する、
     類似性評価装置。
    A concept base in which a set of pairs of words and vectors representing the concepts of the words is stored;
    Word dividing means for dividing the text into words,
    In the query text word set and each word set of one or more search target texts, {X 1 , ..., X m } is the one with the least number of elements, and the other is {Y 1 , ..., Y n }, a word set specifying means,
    Map the element X i in {X 1 ,…, X m } to Y φ_i , and the shot that determines the injection φ from {X 1 ,…, X m } to {Y 1 ,…, Y n } A determination means;
    Let V (Z) be the corresponding vector in the concept base of the word Z, for any element pair X i , X j (i <j) in {X 1 , ..., X m }, V ( X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) are calculated, and the sum of the distances for all element pairs X i , X j (i <j) A single shot distance score calculating means for calculating a value obtained by dividing the number of element pairs X i and X j (i <j) as a single shot φ distance score;
    The inter-text distance score calculation means that uses the minimum value of all distance scores corresponding to all the injections for the search target text calculated by the injection distance score calculation means as the distance score between the query text and the search target text. And
    The inter-text distance score calculating means generates an evaluation result using a distance score between the query text and each of the one or more search target texts.
    Similarity evaluation device.
  3.  概念ベースには単語と該単語の概念を表すベクトルとの対の集合が格納されるものとし、
     単語分割手段が、テキストを単語分割する単語分割ステップと、
     単語集合特定手段が、2つのテキストそれぞれの単語集合で要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}とする単語集合特定ステップと、
     単射決定手段が、{X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射φを決定する単射決定ステップと、
     単射距離スコア算出手段が、単語Zの前記概念ベース中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、単射φの距離スコアとして算出する単射距離スコア算出ステップと、
     テキスト間距離スコア算出手段が、単射距離スコア算出ステップで算出した全ての単射に対応する全ての距離スコアの最小値を、前記2つのテキストの距離スコアとするテキスト間距離スコア算出ステップとを備える、
     類似性評価方法。
    The concept base stores a set of pairs of a word and a vector representing the concept of the word,
    A word dividing means for dividing the text into words;
    The word set identification means that the word set of each of the two texts is {X 1 ,…, X m } where the number of elements is not large, and the other is {Y 1 ,…, Y n } Steps,
    Injection determining means, {X 1, ..., X m} elements X i in mapping the Y φ_i, {X 1, ... , X m} from {Y 1, ..., Y n } injection to a bijection determination step for determining φ;
    Single Shakyori score calculating means, when the corresponding vector in the concept-based word Z and V (Z), {X 1 , ..., X m} Any element pair X i in, X j (i <j), the distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) is calculated, and all element pairs X i , X j (i <j) A single shot distance score calculating step for calculating the sum of the distances as a single shot φ distance score;
    The inter-text distance score calculating means uses the minimum value of all the distance scores corresponding to all the injections calculated in the injective distance score calculating step as the inter-text distance score calculating step. Prepare
    Similarity evaluation method.
  4.  概念ベースには単語と該単語の概念を表すベクトルとの対の集合が格納されるものとし、
     単語分割手段が、テキストを単語分割する単語分割ステップと、
     単語集合特定手段が、クエリテキストの単語集合と、1つ以上の検索対象テキストのそれぞれの単語集合とで、要素数が大きくはない方を{X1,…,Xm}とし、もう一方を{Y1,…,Yn}とする単語集合特定ステップと、
     単射決定手段が、{X1,…,Xm}中の要素XiをYφ_iに写像する、{X1,…,Xm}から{Y1,…,Yn}への単射φを決定する単射決定ステップと、
     単射距離スコア算出手段が、単語Zの前記概念ベース中の対応するベクトルをV(Z)とするとき、{X1,…,Xm}中の任意の要素対Xi,Xj(i<j)に対し、V(Xj)-V(Xi)とV(Yφ_j)-V(Yφ_i)の距離を算出し、全ての要素対Xi,Xj(i<j)に対する該距離の総和を、全ての要素対Xi,Xj(i<j)の数で割った値を単射φの距離スコアとして算出する単射距離スコア算出ステップと、
     テキスト間距離スコア算出手段が、単射距離スコア算出ステップで算出した前記検索対象テキストに対する全ての単射に対応する全ての距離スコアの最小値を、前記クエリテキストと前記検索対象テキストとの距離スコアとするテキスト間距離スコア算出ステップと、
     前記テキスト間距離スコア算出手段が、前記クエリテキストと1つ以上の前記検索対象テキストのそれぞれとの距離スコアを用いて、評価結果を生成する評価結果生成ステップとを備える、
     類似性評価方法。
    The concept base stores a set of pairs of a word and a vector representing the concept of the word,
    A word dividing means for dividing the text into words;
    The word set identification means uses {X 1 ,…, X m } for the word set of the query text and each of the word sets of one or more search target texts, and the other is set to {X 1 ,…, X m } A word set identification step of {Y 1 , ..., Y n };
    Injection determining means, {X 1, ..., X m} elements X i in mapping the Y φ_i, {X 1, ... , X m} from {Y 1, ..., Y n } injection to a bijection determination step for determining φ;
    Single Shakyori score calculating means, when the corresponding vector in the concept-based word Z and V (Z), {X 1 , ..., X m} Any element pair X i in, X j (i <j), the distance between V (X j ) -V (X i ) and V (Y φ_j ) -V (Y φ_i ) is calculated, and all element pairs X i , X j (i <j) A shot distance score calculating step of calculating a value obtained by dividing the sum of the distances by the number of all element pairs X i , X j (i <j) as a distance score of the shot φ;
    The distance score between the query text and the search target text is the minimum value of all the distance scores corresponding to all the shots for the search target text calculated by the inter-text distance score calculation means in the injection distance score calculation step. The inter-text distance score calculating step,
    The inter-text distance score calculating means includes an evaluation result generating step of generating an evaluation result using a distance score between the query text and each of the one or more search target texts.
    Similarity evaluation method.
  5.  請求項1または請求項2の類似性評価装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the similarity evaluation apparatus according to claim 1 or 2.
PCT/JP2019/019829 2018-05-31 2019-05-20 Similarity assessment device, method therefor, and program WO2019230465A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2018-104292 2018-05-31
JP2018104292A JP2019211808A (en) 2018-05-31 2018-05-31 Similarity evaluation apparatus, method thereof and program

Publications (1)

Publication Number Publication Date
WO2019230465A1 true WO2019230465A1 (en) 2019-12-05

Family

ID=68696696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/019829 WO2019230465A1 (en) 2018-05-31 2019-05-20 Similarity assessment device, method therefor, and program

Country Status (2)

Country Link
JP (1) JP2019211808A (en)
WO (1) WO2019230465A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102352481B1 (en) * 2019-12-27 2022-01-18 동국대학교 산학협력단 Sentence analysis device using morpheme analyzer built on machine learning and operating method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012003333A (en) * 2010-06-14 2012-01-05 Nippon Telegr & Teleph Corp <Ntt> Similar document retrieval device, similar document retrieval method, its program and recording medium
JP2015005174A (en) * 2013-06-21 2015-01-08 日本放送協会 Content retrieval system, method, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012003333A (en) * 2010-06-14 2012-01-05 Nippon Telegr & Teleph Corp <Ntt> Similar document retrieval device, similar document retrieval method, its program and recording medium
JP2015005174A (en) * 2013-06-21 2015-01-08 日本放送協会 Content retrieval system, method, and program

Also Published As

Publication number Publication date
JP2019211808A (en) 2019-12-12

Similar Documents

Publication Publication Date Title
Qi et al. Finding all you need: web APIs recommendation in web of things through keywords search
CN110162695B (en) Information pushing method and equipment
US6915295B2 (en) Information searching method of profile information, program, recording medium, and apparatus
US20200356729A1 (en) Generation of text from structured data
CN111488426A (en) Query intention determining method and device and processing equipment
US9607272B1 (en) System and method for training data generation in predictive coding
US20220083874A1 (en) Method and device for training search model, method for searching for target object, and storage medium
CN109241243B (en) Candidate document sorting method and device
CN112115232A (en) Data error correction method and device and server
CN109977292B (en) Search method, search device, computing equipment and computer-readable storage medium
CN112925912B (en) Text processing method, synonymous text recall method and apparatus
CN116431837B (en) Document retrieval method and device based on large language model and graph network model
CN112765362B (en) Knowledge-graph entity alignment method based on improved self-encoder and related equipment
CN112905768A (en) Data interaction method, device and storage medium
JP2018124617A (en) Teacher data collection apparatus, teacher data collection method and program
CN114090735A (en) Text matching method, device, equipment and storage medium
US11256707B1 (en) Per-query database partition relevance for search
US9104946B2 (en) Systems and methods for comparing images
US11397776B2 (en) Systems and methods for automated information retrieval
CN113505190B (en) Address information correction method, device, computer equipment and storage medium
CN108681490B (en) Vector processing method, device and equipment for RPC information
WO2019230465A1 (en) Similarity assessment device, method therefor, and program
CN109885812B (en) Method and device for dynamically adding hotwords and readable storage medium
CN116383340A (en) Information searching method, device, electronic equipment and storage medium
CN113312523B (en) Dictionary generation and search keyword recommendation method and device and server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19812172

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19812172

Country of ref document: EP

Kind code of ref document: A1