JP6740877B2

JP6740877B2 - Similarity calculation program, similarity calculation method, and similarity calculation device

Info

Publication number: JP6740877B2
Application number: JP2016229208A
Authority: JP
Inventors: 謙介馬場
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-11-25
Filing date: 2016-11-25
Publication date: 2020-08-19
Anticipated expiration: 2036-11-25
Also published as: JP2018085051A

Description

本発明は、類似度算出プログラム、類似度算出方法、および類似度算出装置に関する。 The present invention relates to a similarity calculation program, a similarity calculation method, and a similarity calculation device.

近年、文書間の比較を行い、類似箇所を抽出することが行われている。例えば、２つの文章の文字列を行と列に配置し、文字間の類似度に基づいて、文字列の類似箇所を抽出する際に、終点からの不一致又は読み飛ばしが連続する文字数を制限し、抽出をリセットして再開することで、代表的な局所対応を網羅的に抽出する技術が知られている。 In recent years, it has been performed to compare documents and extract similar portions. For example, character strings of two sentences are arranged in rows and columns, and when extracting similar portions of character strings based on the similarity between characters, the number of characters that do not match or are skipped consecutively from the end point is limited. , A technique for exhaustively extracting typical local correspondences by resetting and restarting the extraction is known.

また、文書内の文字列に対して検索文字列との一致若しくは類似する文字列の分布状態及びその一致度を表示する技術等が提案されている。 Further, there has been proposed a technique of displaying the distribution state of the character strings in the document that match or is similar to the search character string and the degree of matching thereof.

特開２０１２−５９１００号公報JP 2012-59100 A 特開平７−１４６８７２号公報JP-A-7-146872 特開平８−２４９４４５号公報JP-A-8-249445

M．J．Fischer and M．S．Paterson:String-matching and other products， Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium， New York， 1973)，pp．113-125，1974M. J. Fischer and M. S. Paterson:String-matching and other products, Complexity of Computation (Proceedings of the SIAM-AMS Applied Mathematics Symposium, New York, 1973), pp. 113-125, 1974 D．Gusfield: Algorithms on Strings，Trees and Sequences: Computer Science and Computational Biology，Cambridge University Press，1997D. Gusfield: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997 M．J．Atallah et al．:A randomized algorithm for approximate string matching．Algorithmica，29:468-486，2001M. J. Atallah et al. :A randomized algorithm for approximate string matching. Algorithmica, 29:468-486, 2001 K．Baba et al．:A Note on Randomized Algorithm for String Matching with Mismatches，Nordic Journal of Computing，10(1):2-12，2003K. Baba et al. :A Note on Randomized Algorithm for String Matching with Mismatches, Nordic Journal of Computing, 10(1):2-12, 2003 T．Schoenmeyr and D．Yu-Zhang: FFT-based Algorithms for the String Matching with Mismatches Problem， Journal of Algorithms，57:130-139，2005T. Schoenmeyr and D. Yu-Zhang: FFT-based Algorithms for the String Matching with Mismatches Problem, Journal of Algorithms, 57:130-139, 2005

上述した技術では、２文書間で互いに対応する位置での類似度を算出、検索対象の文書に対して与えられた検索文字列に一致若しくは類似する文字列を抽出等を行う技術である。したがって、２文書に対してすべての位置ずれで文書間の類似を算出する場合、文書の文字の種類の多さや文書の区切り等で抽出した文字列の長さに応じて処理負担が大きくなるといった問題がある。 The above-mentioned technique is a technique of calculating the similarity between two documents at positions corresponding to each other, extracting a character string that matches or is similar to the search character string given to the document to be searched. Therefore, when the similarity between documents is calculated for all two documents with respect to all positional deviations, the processing load increases according to the number of character types in the document and the length of the extracted character string due to document breaks. There's a problem.

したがって、１つの側面では、本発明は、文字単位の比較により文書間の比較を行う際の計算量を抑えることを目的とする。 Therefore, in one aspect, the present invention aims to reduce the amount of calculation when comparing documents by character-by-character comparison.

一態様によれば、複数の文書データに含まれる文字または文字列ごとに該複数の文書データをベクトル化した複数のベクトルを生成し、前記複数のベクトルのそれぞれについてフーリエ変換した結果の要素ごとの積を算出し、前記文字または文字列それぞれに対し、前記積を前記要素ごとに加算した合算ベクトルを生成し、前記合算ベクトルより、フーリエ逆変換を用いて、前記複数の文書データ間の相関値を生成する処理をコンピュータが行う類似度算出プログラムが提供される。 According to one aspect, a plurality of vectors obtained by vectorizing the plurality of document data are generated for each character or character string included in the plurality of document data, and each vector of the plurality of vectors is subjected to Fourier transform. A product is calculated, and for each of the character or character string, a summation vector is generated by adding the product for each element, and a Fourier transform is used from the summation vector to calculate a correlation value between the plurality of document data. There is provided a similarity calculation program in which a computer performs a process of generating.

また、上記課題を解決するための手段として、類似度算出方法および類似度算出装置とすることもできる。 Further, as a means for solving the above problems, a similarity calculation method and a similarity calculation device can be used.

文字単位の比較により文書間の比較を行う際の計算量を抑えることができる。 By comparing characters, it is possible to reduce the amount of calculation when comparing documents.

位置ずれを説明するための図である。It is a figure for explaining position gap. 文書間の相関の例を示す図である。It is a figure which shows the example of the correlation between documents. 計算と計算時間との関係を説明するための図である。It is a figure for demonstrating the relationship between calculation and calculation time. 計算方法の概念図を示す図である。It is a figure which shows the conceptual diagram of a calculation method. 計算量の削減を説明するための図である。It is a figure for demonstrating reduction of calculation amount. 類似度算出装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a similarity calculation apparatus. 類似度算出装置の機能構成例を示す図である。It is a figure which shows the functional structural example of a similarity calculation apparatus. ２文書間の相関を求める第１の類似算出処理を説明するためのフローチャートである。9 is a flowchart illustrating a first similarity calculation process for obtaining a correlation between two documents. 図８のフローチャートでの処理例を示す図である。It is a figure which shows the process example in the flowchart of FIG. １対Ｎ相関を求める第２の類似算出処理を説明するためのフローチャートである。It is a flowchart for demonstrating the 2nd similarity calculation process which calculates|requires 1 to N correlation. 語彙数に対する実行時間の比較結果を示す図である。It is a figure which shows the comparison result of the execution time with respect to the number of vocabularies. 文書長に対する実行時間の比較結果を示す図である。It is a figure which shows the comparison result of the execution time with respect to a document length. 文書中の語彙数と処理時間の比較結果を示す図である。It is a figure which shows the comparison result of the number of vocabularies in a document, and processing time.

以下、本発明の実施の形態を図面に基づいて説明する。先ず、２つの文書のそれぞれの文字列を入力とし、すべての位置ずれについての文字の一致の数（相関）を計算する問題を考察する。「相関」とは、すべての位置ずれで対応させた語との一致数をいう。 Embodiments of the present invention will be described below with reference to the drawings. First, consider the problem of inputting the character strings of two documents and calculating the number of character matches (correlation) for all misregistrations. The "correlation" means the number of matches with the corresponding words in all the positional deviations.

文章間の相関を求める関連技術として、相関を求める高速アルトリズムFFT(Fast Fourier Transform)-based Algorithmが存在する。FFTは高速フーリエ変換の略である。 As a related technique for finding a correlation between sentences, there is a fast Fourier transform (FFT)-based algorithm for finding a correlation. FFT stands for Fast Fourier Transform.

FFT-based Algorithmの処理は、O(nlogn)時間の畳み込み演算をσ回繰り返す。ここで、σは文字又は文字列の種類の数を示し、nは各文章の文字列の長さを示す。 In the process of the FFT-based Algorithm, the convolution operation of O(nlogn) time is repeated σ times. Here, σ indicates the number of types of characters or character strings, and n indicates the length of the character string of each sentence.

次に、相関の計算時間について考察する。相関計算の例として、語の比較を順に行う単純計算と、FFT-Based Algorithmを用いた計算量について比較する。 Next, consider the correlation calculation time. As an example of correlation calculation, a simple calculation that sequentially compares words and a calculation amount using FFT-Based Algorithm are compared.

基本的に、σ≦nと見なせる場合、FFT-Based Algorithmが単純計算より高速である。即ち、σが一定で小さい場合、O(n²)とO(nlogn)の比較と見なせる。しかしながら、σが不定で小さい場合、O(n²logσ)に対するO(σnlogσ)の利点が小さい。即ち、σが大きい場合は想定されておらず、一定で小さいσを想定し、計算時間をO(nlogn)として扱っている（非特許文献１、２）。そのため、FFTの実行回数が多くなるという問題がある。 Basically, the FFT-Based Algorithm is faster than simple calculation when σ≦n can be considered. That is, when σ is constant and small, it can be regarded as a comparison between O(n ² ) and O(nlogn). However, when σ is indefinite and small, the advantage of O(σnlogσ) over O(n ² log σ) is small. That is, when σ is large, it is not assumed, and a constant and small σ is assumed, and the calculation time is treated as O(nlogn) (Non-Patent Documents 1 and 2). Therefore, there is a problem that the number of FFT executions increases.

この問題に対して、畳み込みの繰り返しからk回分の結果だけを使って近似することが提案されている（非特許文献３、４、５）。しかしがら、畳み込み演算を一つの処理単位として扱っているため、畳み込み演算の処理回数の削減に留まっている。 To solve this problem, it has been proposed to perform approximation using only the result of k iterations (Non-Patent Documents 3, 4, and 5). However, since the convolution operation is treated as one processing unit, the number of times the convolution operation is processed is reduced.

さらに、このような近似的手法（非特許文献３、４、５）では、厳密な値でなく近似値を出力している。長さnの文書について、k回の繰り返しによる相関c_iの推定値の分散は(n-c_i)/kで表される。近似値が正しい値c_iに対して散らばる程度を表わす。大きなnについて誤差が大きく、長い文書に適していない。さらに、小さなc_iについて誤差が大きい。相関のうち小さい値の推定値の精度が低いことを示す。つまり、移動平均の計算や相関全体をベクトルとして機械学習等に用いる場合に適用できない。 Furthermore, in such an approximate method (Non-patent documents 3, 4, and 5), an approximate value is output instead of an exact value. For a document of length n, the variance of the estimate of the correlation c _i over k iterations is given by (nc _i )/k. It represents the degree to which the approximate values are scattered with respect to the correct value c _i . Large error for large n, not suitable for long documents. Furthermore, the error is large for small c _i . It shows that the accuracy of the estimated value of the small value of the correlation is low. That is, it cannot be applied to the case where the moving average is calculated or the entire correlation is used as a vector for machine learning.

従来より、畳み込み演算は信号処理等の分野で一般的な概念である。信号処理では、データは、送受信間の通信で定められるデータ長を処理単位として処理されるためσが小さい。 Conventionally, the convolution operation is a general concept in the field of signal processing and the like. In signal processing, data is processed with a data length determined by communication between transmission and reception as a processing unit, so that σ is small.

また、畳み込み演算は、プログラミング言語においても既存の関数を利用できる。このような開発環境において、畳み込み演算内での処理負荷について詳細な解析がなさることがなかった。学会誌、学術論文等の文書の長さは、通信におけるデータ長よりはるかに大きい。 Also, the convolution operation can use an existing function in a programming language. In such a development environment, detailed analysis of the processing load in the convolution operation has not been done. The length of documents such as academic journals and academic papers is much larger than the data length in communication.

本実施例では、σが大きな場合に、更なる処理の高速化を実現し、近年の文書間の比較等での大きなσの大量データに対して相関を求める需要に対応する。 In the present embodiment, when σ is large, further speeding up of processing is realized, and in recent years, there is a demand for obtaining a correlation with respect to a large amount of large σ data in comparison between documents.

ここで、相関の計算時間における課題は、現実的な実行時間の削減である。具体的には、σnlogσに対する係数部分、即ち、FFTの実行回数を削減することである。発明者は、畳み込み演算が、離散フーリエ変換と逆離散フーリエ変換の処理により行われることに着目し、大きなσの大量データを扱う場合の相関の計算処理の高速化を実現した。大きなσの大量データを、以下の説明では、文書を例として説明する。 Here, the problem in the calculation time of the correlation is to reduce the actual execution time. Specifically, it is to reduce the coefficient part for σnlogσ, that is, the number of times FFT is executed. The inventor noticed that the convolution operation is performed by the processing of the discrete Fourier transform and the inverse discrete Fourier transform, and realized the speedup of the calculation processing of the correlation when handling a large amount of data of large σ. In the following description, a large amount of data with a large σ will be described using a document as an example.

２文書s，t間の相関を求める場合、２文書s，tの各文字列内で連続した２文字以上で類似する類似部分の位置は必ずしも同一位置ではない。類似部分が出現する位置がずれている場合についても考慮する必要がある。相関で考慮される位置ずれについて説明する。 When obtaining the correlation between two documents s and t, the positions of similar portions that are similar in two or more consecutive characters in each character string of the two documents s and t are not necessarily the same position. It is also necessary to consider the case where the positions where similar portions appear are displaced. The position shift considered in the correlation will be described.

まず、２つの文字列間の相関計算に係る表記を以下に定義する。 First, the notation relating to the correlation calculation between two character strings is defined below.

文書sと文書tの文字は、長さnの文字列全体からなる集合Σⁿの要素として表される。

The characters of document s and document t are represented as elements of a set Σ ⁿ consisting of the entire character string of length n.

文書sと文書tとの相関c(s，t)は、2n-1次元ベクトルで表され、i番目の要素は、

The correlation c(s,t) between the document s and the document t is represented by a 2n-1 dimensional vector, and the i-th element is

で表される。ただし、

It is represented by. However,

について、

about,

のように文字列を数式のように表わすことで、範囲外の形式的な比較のためにダミーの語を付加することを表現する。このようにすることで、位置ずれを考慮して相関を算出できる。

By expressing a character string like a mathematical expression, a dummy word is added for formal comparison outside the range. By doing so, the correlation can be calculated in consideration of the positional deviation.

図１は、位置ずれを説明するための図である。図１では、文書sの文書tに対する語の位置ずれを考慮するために、文書tの前後に、文書sにも文書tにも属さないダミーの語を付加する概念図を示している。 FIG. 1 is a diagram for explaining the positional deviation. FIG. 1 shows a conceptual diagram in which dummy words that do not belong to the document s or the document t are added before and after the document t in order to consider the positional deviation of the word of the document s with respect to the document t.

文章sを１語ずつずらして、ずらした位置における文書tの１語との相関を算出する。文書tに対して、文書sでは語順を変えてある場合がずれ位置に相当する。１語ずらすごとに相関を算出することで、ずれごとの相関c₁， c₂，・・・， c_2n-1を得る。 The sentence s is shifted word by word, and the correlation with the word of the document t at the shifted position is calculated. A case where the word order is changed in the document s with respect to the document t corresponds to the shift position. By calculating the correlation for each word shift, the correlation c ₁ , c ₂ ,..., C _2n-1 for each shift is obtained.

全ての位置ずれを考慮した相関c(s，t)は、全ての位置ずれでの文字の一致の数に相当するとした場合、単純な計算ではO(n²)回の文字比較が行われる。ここで、１回の文字比較の時間はlogσに依存する。 If it is assumed that the correlation c(s, t) considering all misregistrations corresponds to the number of matching characters in all misregistrations, a simple calculation makes O(n ² ) times of character comparisons. Here, the time of one character comparison depends on log σ.

図２は、文書間の相関の例を示す図である。図２では、文章sは文字列“abbacab”であり、文書tは文字列“ababbac”である場合の相関の例を示している。図２において、ダミーの語xは空欄で表している。 FIG. 2 is a diagram showing an example of correlation between documents. FIG. 2 shows an example of correlation when the sentence s is the character string “abbacab” and the document t is the character string “ababbac”. In FIG. 2, the dummy word x is blank.

文書tの前後にn-1後のダミーを付加した文書t’の先頭に文書sの先頭を合せて、一語ずつ文書sをずらして一致する文字数をカウントする。この例では、文書t’の先頭に文書sの先頭を合せた場合には、文書sと文書tとに一致する文字が存在しない。従って、相関c₁は0となる。一語ずらすと、文書sの末尾２文字と文書tの先頭２文字とが一致する。従って、相関c₂は2となる。 The head of the document s is aligned with the head of the document t', which is a dummy of n-1's before and after the document t, and the number of matching characters is counted by shifting the document s word by word. In this example, when the beginning of the document s is aligned with the beginning of the document t′, there is no character that matches the documents s and t. Therefore, the correlation c ₁ becomes 0. When shifted by one word, the last two characters of the document s and the first two characters of the document t match. Therefore, the correlation c ₂ is 2.

更に１語ずらした相関c₃は０となる。順に、相関c₄は３、相関c₅は１、相関c₆は２、相関c₇は３、相関c₈は１、相関c₉は５、相関c₁₀は１、相関c₁₁は０、相関c₁₂は１、そして、相関c₁₃は０となる。 Further, the correlation c ₃ shifted by one word becomes 0. Correlation c ₄ is 3, correlation c ₅ is 1, correlation c ₆ is 2, correlation c ₇ is 3, correlation c ₈ is 1, correlation c ₉ is 5, correlation c ₁₀ is 1, correlation c ₁₁ is 0, The correlation c ₁₂ is 1, and the correlation c ₁₃ is 0.

ここで、文字又は文字列の種類毎に行われるFFTによる畳み込みの計算について説明する。２つのn次元ベクトルuとvとの巡回畳み込みrは、 Here, the calculation of the FFT convolution performed for each type of character or character string will be described. The cyclic convolution r of two n-dimensional vectors u and v is

で表される。ただし、-n+1≦i≦0についてはv_i=v_n+iである。

It is represented by. However, for −n+1≦i≦0, v _i =v _n+i .

R，U，Vをそれぞれr，u，vの離散フーリエ変換としoを要素ごとの積とすると、 Let R, U, V be the discrete Fourier transforms of r, u, v respectively, and o be the element-wise product,

で表される。このことから、rは、uとvとからFFTによりO(nlogn)時間で計算可能である。計算ルートと計算時間との関係を図３で説明する。

It is represented by. From this fact, r can be calculated in O(nlogn) time by FFT from u and v. The relationship between the calculation route and the calculation time will be described with reference to FIG.

図３は、計算と計算時間との関係を説明するための図である。図３において、nは文章の長さを表わす。uとvとからrを得る計算、即ち、語uと語vの比較rを行う計算の場合、計算時間O(n²)で表される。 FIG. 3 is a diagram for explaining the relationship between calculation and calculation time. In FIG. 3, n represents the length of the sentence. In the case of a calculation for obtaining r from u and v, that is, a calculation for comparing r between the word u and the word v, the calculation time is represented by O(n ² ).

一方、u，vの離散フーリエ変換U，Vの計算時間はO(nlogn)であり、U，Vの要素毎の積Rの計算時間は、O(n)である。また、Rからrへの逆離散フーリエ変換の計算時間は、O(nlogn)である。 On the other hand, the calculation time of the discrete Fourier transforms U and V of u and v is O(nlogn), and the calculation time of the product R of each element of U and V is O(n). The calculation time of the inverse discrete Fourier transform from R to r is O(nlogn).

次に、FFT-based Algorithmの概要について説明する。FFT-based Algorithmでは、文字を数値化し、文字列間の相関をベクトルの畳み込みにより計算する。そうすることで、文字列全体の集合Σ内のある文字aを1、それ以外を0に置き換えると、文字aの一致のみを考慮した相関を畳み込み演算で算出でき、O(nlogn)時間で計算可能である。文書から数値列への置き換えはO(n)時間である。また、乗算で一致又は不一致を表現できるため、畳み込み演算が適用可能である。 Next, an outline of the FFT-based Algorithm will be described. The FFT-based Algorithm digitizes characters and calculates the correlation between strings by vector convolution. By doing so, if one character a in the set Σ of the entire character string is replaced with 1 and the others are replaced with 0, the correlation considering only the matching of the character a can be calculated by the convolution operation, and the correlation can be calculated in O(nlogn) time. It is possible. It takes O(n) time to replace a document with a sequence of numbers. Further, since congruity or non-coincidence can be expressed by multiplication, convolution operation can be applied.

文字列全体の集合Σ内の要素である文字又は文字列の種類ごとに、O(nlogn)時間の畳み込み演算を行うため、文字又は文字列の種類の数のσ回繰り返される。そして、ベクトルの要素ごとの和をとることで相関を算出する。即ち、ある位置ずれでの相関c_iが求まる。 For each type of character or character string that is an element in the set Σ of the entire character string, a convolution operation of O(nlogn) time is performed, and therefore, σ times of the number of character or character string types are repeated. Then, the correlation is calculated by taking the sum for each element of the vector. That is, the correlation c _i at a certain displacement can be obtained.

より具体的に、FFT-based Algorithmの計算式を以下に示す。 More specifically, the calculation formula of the FFT-based Algorithm is shown below.

について、φ_aはaを1、それ以外を0に写す関数とし、定義（数３）より、n≦i≦2n-1について、

, Φ _a is a function that maps a to 1 and the others to 0. From the definition (Equation 3), for n≦i≦2n-1,

と表される。ここで、加算の順序を入れ替えて、

Is expressed as Here, change the order of addition,

とする。

And

（u_a，1， u_a，2，．．．， u_a，2n-1）と（v_a，1， v_a，2，．．．， v_a，2n-1）とを、 (U _a,1 , u _a,2 ,..., u _a,2n-1 ) and (va _,1 , v _a,2 ,..., v _a,2n-1 )

である。数１１では、1≦i≦nのときには、片方を反転させ、n≦i≦2n-1のときには、0を埋める。2n-1次元ベクトルとすると、数１０は、

Is. In Expression 11, one is inverted when 1≦i≦n, and zero is filled when n≦i≦2n−1. Assuming a 2n-1 dimensional vector, Equation 10 is

で表される。数１２内の

It is represented by. Within number 12

は巡回畳み込み演算である。

Is a cyclic convolution operation.

類似部分の位置ずれを考慮した場合、文字又は文字列の種類（文字列全体の集合Σ内の要素）の数分の畳み込み演算を繰り返すことになり、計算時間が長くなる。
・１回の畳み込み演算に対して、２回のFFT、１回のベクトルの要素ごとの積、１回の逆FFTが行われる。ここで、逆FFTの計算時間はFFTの計算時間に相当すると考えられる。
・それぞれの１回の計算時間は、FFTがO(nlogn)時間、要素ごとの積はO(n)時間であり、FFTが支配的である。
・従って、１回の相関の計算で、FFTが3σ回必要であることが分かる。 When the positional deviation of similar portions is taken into consideration, the convolution operation is repeated for the number of types of characters or character strings (elements in the set Σ of the entire character string), which increases the calculation time.
-For one convolution operation, two FFTs, one vector element-wise product, and one inverse FFT are performed. Here, the calculation time of the inverse FFT is considered to correspond to the calculation time of the FFT.
-For each calculation time, FFT is O(nlogn) time, and product of each element is O(n) time, and FFT is dominant.
・Therefore, it can be seen that FFT is required 3σ times in one calculation of correlation.

また、畳み込み演算の繰り返しの回数σは、アルファベットサイズσに対する論理的な最小値である。 The number of repetitions σ of the convolution operation is a logical minimum value with respect to the alphabet size σ.

文字又は文字列の種類ごとに畳み込み演算を行い、その結果を集約する関連技術に対して、発明者は、畳み込み演算の最後の処理であるフーリエ変換（逆FFT）の前に集約することで、フーリエ変換の回数を削減することを見出した。フーリエ変換の回数を削減するために、発明者が着目した計算式の変形について説明する。 For the related technology that performs a convolution operation for each type of character or character string and aggregates the results, the inventor aggregates the result before the Fourier transform (inverse FFT), which is the final process of the convolution operation. We have found to reduce the number of Fourier transforms. A modification of the calculation formula focused on by the inventor in order to reduce the number of Fourier transforms will be described.

r_aをu_aとv_aの巡回畳み込みとすると、相関c(s，t)は、 If r _a is _a circular convolution of u _a and v _a , then the correlation c(s,t) is

のベクトルから得られる。このベクトルを

Is obtained from the vector of. This vector

と表示することにする。そして、fを離散フーリエ変換、R_a=f(r_a)とすると、数１５は、

Will be displayed. Then, when f is a discrete Fourier transform, and R _a =f(r _a ), Equation 15 becomes

と変形できる。数１６の右辺より、ベクトルの要素ごとの加算後に逆FFTを行なえばよいことが分かる。要素ごとの加算後の逆FFTは、１回のFFTと見なせる。

Can be transformed. It can be seen from the right side of Expression 16 that the inverse FFT may be performed after addition for each element of the vector. The inverse FFT after addition for each element can be regarded as one FFT.

図４は、計算方法の概念図を示す図である。図４において、畳み込み演算２ｐと、畳み込み演算２ｐを行わない単純な方法の計算部分は、集合Σの要素（文字又は文字列の種類）ごとに処理が繰り返される部分である。即ちσ回数繰り返される。その後、得られたσ個のベクトルの要素ごとに値を合算して、文書sと文書tとの相関c(s，t)が得られる。 FIG. 4 is a diagram showing a conceptual diagram of the calculation method. In FIG. 4, the convolution operation 2p and the calculation part of the simple method that does not perform the convolution operation 2p are the parts in which the process is repeated for each element (character or character string type) of the set Σ. That is, it is repeated σ times. Then, the values are added up for each element of the obtained σ vectors, and the correlation c(s, t) between the document s and the document t is obtained.

関連技術において、畳み込み演算２ｐでは、FFTの計算が、離散フーリエ変換時に2σ回行われ、逆離散フーリエ変換時にσ回行われるため、合計3σ回行われる。 In the related art, in the convolution operation 2p, the FFT calculation is performed 2σ times during the discrete Fourier transform and σ times during the inverse discrete Fourier transform, so that a total of 3σ times are performed.

一方、本実施例では、畳み込み演算２ｐのうち、逆FFTの実行前に、ベクトルの要素ごとの加算を行い、その結果に対して逆FFTを行うことで、逆FFTの計算回数をσ回から１回に削減する。本実施例では、σが大きいほど効果を奏し、関連技術に比べて、約３分の２の時間で２文書間の相関を取得できる。 On the other hand, in the present embodiment, in the convolution operation 2p, the vector-by-element addition is performed before the execution of the inverse FFT, and the inverse FFT is performed on the result, so that the number of times the inverse FFT is calculated changes from σ Reduce to once. In this embodiment, the larger σ is, the more effective the effect is, and the correlation between two documents can be acquired in about two-thirds the time as compared with the related art.

また、σが小さい場合、応用の観点から各文字での相関が得ることが必要となる場合があるが、σが大きい場合、特定文字での相関が必要であっても、その他の文字として大きなσを考える必要があり、その観点において、本実施例がより適していると言える。 Also, when σ is small, it may be necessary to obtain the correlation for each character from the viewpoint of application, but when σ is large, even if the correlation for a specific character is required, it is large for other characters. It is necessary to consider σ, and it can be said that the present embodiment is more suitable in this respect.

図４における関連技術と本実施例の計算例を図５に示し、計算量の削減について更に説明する。図５は、計算量の削減を説明するための図である。 The related technique in FIG. 4 and the calculation example of this embodiment are shown in FIG. 5, and the reduction of the calculation amount will be further described. FIG. 5 is a diagram for explaining the reduction of the calculation amount.

図５では、文章sは文字列“abca”であり、文書tは文字列“abcb”である場合の２文書s，t間の相関c(s，t)を求めるまでの計算例を示す。左にベクトルの例を示し、右にベクトルの離散フーリエ変換の例を示す。 FIG. 5 shows an example of calculation until the correlation c(s,t) between the two documents s and t in the case where the sentence s is the character string “abca” and the document t is the character string “abcb”. An example of the vector is shown on the left, and an example of the discrete Fourier transform of the vector is shown on the right.

文書s，tに存在する語の種別毎にベクトルへと変換する。文書s，tには３つの文字a、bおよびcが存在する。まず、文書s，tのそれぞれにおいて、文字a、b、cごとに、一致する箇所に1を示し、一致しない箇所を0で示したベクトルに変換する。文書sはベクトル
u_a=(1，0，0，1)、
u_b=(0，1，0，0)、
u_c=(0，0，1，0)
に変換され、文書tはベクトル
v_a=(0，0，0，1)、
v_b=(1，0，1，0)、
v_c=(0，1，0，0)
に変換される。 Convert to a vector for each word type existing in documents s and t. There are three letters a, b and c in documents s and t. First, in each of the documents s and t, for each of the characters a, b, and c, 1 is indicated in the matching portion, and the non-matching portion is converted into a vector indicated by 0. Document s is a vector
u _a =(1,0,0,1),
u _b =(0,1,0,0),
u _c =(0,0,1,0)
And the document t is a vector
v _a =(0,0,0,1),
v _b =(1,0,1,0),
v _c =(0,1,0,0)
Is converted to.

文書sに関して、ベクトルu_a、u_b、u_cのそれぞれを離散フーリエ変換し、 Discrete Fourier transform each of the vectors u _a, u _{b, and} u _c with respect to the document s,

同様に、文書tに関して、ベクトルv_a、v_b、v_cのそれぞれを離散フーリエ変換し、

Similarly, for document t, perform _a discrete Fourier transform of each of the vectors v _a, v _b, v _c ,

を得る。即ち、離散フーリエ変換が６（2×σ）回行われる。

To get That is, the discrete Fourier transform is performed 6 (2×σ) times.

次に、U_xとV_xの要素ごとの積R_x（x=a，b，c）を求める。従って、 Next, the product R _x (x=a, b, c) of each element of U _x and V _x is obtained. Therefore,

を得る。

To get

関連技術では、各R_a、R_b、R_cについて、逆離散フーリエ変換を行うことで、
r_a=(0，0，0，1)、
r_b=(1，0，1，0)、
r_c=(0，1，0，0)
を得る。関連技術では、逆離散フーリエ変換が３（σ）回行われる。その後、要素ごとの和を求めて、２つの文書sと文書tの相関値
(0，1，0，3，0，0，1)
を得る。 In the related art, by performing an inverse discrete Fourier transform on each R _a , R _b , and R _c ,
r _a =(0,0,0,1),
r _b =(1,0,1,0),
r _c =(0,1,0,0)
To get In the related art, the inverse discrete Fourier transform is performed 3(σ) times. After that, the sum for each element is calculated and the correlation value between the two documents s and t
(0, 1, 0, 3, 0, 0, 1)
To get

一方、本実施例では、U_xとV_xの要素ごとの積R_x（x=a，b，c）を求めた後、要素ごとの逆離散フーリエ変換を行わずに、要素ごとの和を求める。求めた On the other hand, in the present embodiment, after obtaining the product R _x (x=a, b, c) of U _x and V _x for each element, the sum of each element is calculated without performing the inverse discrete Fourier transform for each element. Ask. Sought

に対して、逆離散フーリエ変換を行なえば、関連技術と同様に、相関値
(0，1，0，3，0，0，1)
を得る。即ち、１回の逆離散フーリエ変換で、２つの文書sと文書tの相関を得ることができる。この例では、σは3であるため、逆離散フーリエ変換に係る処理時間は１／３になる。全体の処理として、関連技術ではFFTの回数が9σ（=6σ+3σ）、即ち、２７回行われたのに対して、本実施例ではFFTの回数が7σ（=6σ+σ）、即ち、２１回に削減される。σ=3の場合で説明したが、σが大きい程、本実施例の効果は大きい。

On the other hand, if the inverse discrete Fourier transform is performed, the correlation value is
(0, 1, 0, 3, 0, 0, 1)
To get That is, the correlation between two documents s and t can be obtained by one inverse discrete Fourier transform. In this example, since σ is 3, the processing time associated with the inverse discrete Fourier transform is 1/3. As a whole process, in the related art, the number of FFTs is 9σ (=6σ+3σ), that is, 27 times, whereas in the present embodiment, the number of FFTs is 7σ (=6σ+σ), that is, It will be reduced to 21 times. Although the case of σ=3 has been described, the larger σ, the greater the effect of this embodiment.

次に、上述した処理を行う類似度算出装置１００（図６）のハードウェア構成について説明する。図６は、類似度算出装置のハードウェア構成を示す図である。図６において、類似度算出装置１００は、コンピュータによって制御される情報処理装置であって、ＣＰＵ（Central Processing Unit）１１と、主記憶装置１２と、補助記憶装置１３と、入力装置１４と、表示装置１５と、通信Ｉ／Ｆ（インターフェース）１７と、ドライブ装置１８とを有し、バスＢに接続される。 Next, the hardware configuration of the similarity calculation device 100 (FIG. 6) that performs the above-described processing will be described. FIG. 6 is a diagram showing a hardware configuration of the similarity calculation device. 6, the similarity calculation device 100 is an information processing device controlled by a computer, and includes a CPU (Central Processing Unit) 11, a main storage device 12, an auxiliary storage device 13, an input device 14, and a display. It has a device 15, a communication I/F (interface) 17, and a drive device 18, and is connected to the bus B.

ＣＰＵ１１は、主記憶装置１２に格納されたプログラムに従って類似度算出装置１００を制御するプロセッサに相当する。主記憶装置１２には、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等が用いられ、ＣＰＵ１１にて実行されるプログラム、ＣＰＵ１１での処理に必要なデータ、ＣＰＵ１１での処理にて得られたデータ等を記憶又は一時保存する。 The CPU 11 corresponds to a processor that controls the similarity calculation device 100 according to a program stored in the main storage device 12. A RAM (Random Access Memory), a ROM (Read Only Memory), or the like is used for the main storage device 12, and a program executed by the CPU 11, data required for processing by the CPU 11, and data obtained by the processing by the CPU 11 are obtained. Stored or temporarily saved the data etc.

補助記憶装置１３には、ＨＤＤ（Hard Disk Drive）等が用いられ、各種処理を実行するためのプログラム等のデータを格納する。補助記憶装置１３に格納されているプログラムの一部が主記憶装置１２にロードされ、ＣＰＵ１１に実行されることによって、各種処理が実現される。 An HDD (Hard Disk Drive) or the like is used as the auxiliary storage device 13, and stores data such as programs for executing various processes. Various processes are realized by loading a part of the program stored in the auxiliary storage device 13 into the main storage device 12 and executing it in the CPU 11.

入力装置１４は、マウス、キーボード等を有し、ユーザが類似度算出装置１００による処理に必要な各種情報を入力するために用いられる。表示装置１５は、ＣＰＵ１１の制御のもとに必要な各種情報を表示する。入力装置１４と表示装置１５とは、一体化したタッチパネル等によるユーザインタフェースであってもよい。通信Ｉ／Ｆ１７は、有線又は無線などのネットワークを通じて通信を行う。通信Ｉ／Ｆ１７による通信は無線又は有線に限定されるものではない。
類似度算出装置１００によって行われる処理を実現するプログラムは、例えば、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory）等の記憶媒体１９によって類似度算出装置１００に提供される。 The input device 14 has a mouse, a keyboard, and the like, and is used by the user to input various information necessary for the processing by the similarity calculation device 100. The display device 15 displays various information required under the control of the CPU 11. The input device 14 and the display device 15 may be a user interface such as an integrated touch panel. The communication I/F 17 communicates via a wired or wireless network. The communication by the communication I/F 17 is not limited to wireless or wired.
A program that implements the processing performed by the similarity calculation device 100 is provided to the similarity calculation device 100 by a storage medium 19 such as a CD-ROM (Compact Disc Read-Only Memory).

ドライブ装置１８は、ドライブ装置１８にセットされた記憶媒体１９（例えば、ＣＤ−ＲＯＭ等）と類似度算出装置１００とのインターフェースを行う。 The drive device 18 interfaces the storage medium 19 (for example, a CD-ROM or the like) set in the drive device 18 and the similarity calculation device 100.

また、記憶媒体１９に、後述される本実施の形態に係る種々の処理を実現するプログラムを格納し、この記憶媒体１９に格納されたプログラムは、ドライブ装置１８を介して類似度算出装置１００にインストールされる。インストールされたプログラムは、類似度算出装置１００により実行可能となる。 Further, the storage medium 19 stores a program that implements various processes according to the present embodiment described later, and the program stored in the storage medium 19 is stored in the similarity calculation device 100 via the drive device 18. Installed. The installed program can be executed by the similarity calculation device 100.

尚、プログラムを格納する記憶媒体１９はＣＤ−ＲＯＭに限定されず、コンピュータが読み取り可能な、データとしての構造（structure）を有する１つ以上の非一時的（non-transitory）な、有形（tangible）な媒体であればよい。コンピュータ読取可能な記憶媒体として、ＣＤ−ＲＯＭの他に、ＤＶＤ（Digital Versatile Disk）ディスク、ＵＳＢメモリ等の可搬型記録媒体、フラッシュメモリ等の半導体メモリであっても良い。 The storage medium 19 for storing the program is not limited to the CD-ROM, and is one or more non-transitory, tangible (tangible) tangible computer-readable structures having a data structure. ) Any medium. The computer-readable storage medium may be a DVD (Digital Versatile Disk) disk, a portable recording medium such as a USB memory, or a semiconductor memory such as a flash memory, in addition to the CD-ROM.

類似度算出装置１００は、ラップトップ、タブレット端末等であってもよい。その場合、記憶媒体１９は、ＳＤ（Secure Digital）メモリカード等であり、ドライブ装置１８は、ドライブ装置１８にセットされた記憶媒体１９と類似度算出装置１００とのインターフェースを行う。 The similarity calculation device 100 may be a laptop, a tablet terminal, or the like. In this case, the storage medium 19 is an SD (Secure Digital) memory card or the like, and the drive device 18 interfaces the storage medium 19 set in the drive device 18 and the similarity calculation device 100.

図７は、類似度算出装置の機能構成例を示す図である。図７において、類似度算出装置１００は、ベクトル変換部６１と、離散フーリエ変換部６２と、要素乗算部６３と、要素加算部６４と、逆離散フーリエ変換部６５とを有する。記憶部１３０は、文書データ７０、ベクトルデータ７１、変換後ベクトルデータ７２、積ベクトルデータ７３、合算ベクトル７４、相関結果７９等を記憶する。 FIG. 7 is a diagram illustrating a functional configuration example of the similarity calculation device. In FIG. 7, the similarity calculation device 100 includes a vector conversion unit 61, a discrete Fourier transform unit 62, an element multiplication unit 63, an element addition unit 64, and an inverse discrete Fourier transform unit 65. The storage unit 130 stores the document data 70, the vector data 71, the converted vector data 72, the product vector data 73, the sum vector 74, the correlation result 79, and the like.

ベクトル変換部６１は、記憶部１３０に記憶されている文書データ７０の文書s、t等をそれぞれベクトルに変換する。ベクトルへの変換方法は、上述したように、文字の種別ごとに一致を１、不一致を０で表わせばよい。ベクトル変換部６１によって得られたベクトルu、v等を含むベクトルデータ７１が記憶部１３０に記憶される。 The vector conversion unit 61 converts each of the documents s, t, etc. of the document data 70 stored in the storage unit 130 into a vector. As described above, the conversion method into a vector may be represented by 1 for each character type and 0 for disagreement for each character type. Vector data 71 including the vectors u, v, etc. obtained by the vector conversion unit 61 is stored in the storage unit 130.

離散フーリエ変換部６２は、記憶部１３０に記憶されたベクトルデータ７１のベクトルu、v等をそれぞれに対して離散フーリエ変換を実行する。離散フーリエ変換は、（文書数）×（文字又は文字列の種類の数σ）の回数分実行される。離散フーリエ変換後のベクトルU、V等を含む変換後ベクトルデータ７２が記憶部１３０に記憶される。 The discrete Fourier transform unit 62 performs a discrete Fourier transform on each of the vectors u, v, etc. of the vector data 71 stored in the storage unit 130. The discrete Fourier transform is executed for the number of times (the number of documents)×(the number of types of characters or character strings σ). The transformed vector data 72 including the vectors U and V after the discrete Fourier transform is stored in the storage unit 130.

要素乗算部６３は、変換後ベクトルデータ７２内の全てのベクトルを対象に、要素ごとの積（アダマール積）を求める。全ての文字の種類のそれぞれに対して得られたベクトルRを含む積ベクトルデータ７３が記憶部１３０に記憶される。 The element multiplication unit 63 obtains a product (Hadamard product) for each element for all the vectors in the converted vector data 72. The product vector data 73 including the vector R obtained for each of all character types is stored in the storage unit 130.

要素加算部６４は、積ベクトルデータ７３の文字又は文字列の種類毎のベクトルRに対して、要素ごとの和を求めることで、合算ベクトル７４を得る。得られた合算ベクトル７４は、記憶部１３０に記憶される。 The element addition unit 64 obtains the sum vector 74 by obtaining the sum for each element of the vector R for each type of character or character string of the product vector data 73. The obtained sum vector 74 is stored in the storage unit 130.

逆離散フーリエ変換部６５は、合算ベクトル７４に対して、逆離散フーリエ変換を行って、文書s、t等に関する相関値を示す相関結果７９を取得する。相関結果７９は記憶部１３０に記憶される。また、相関結果７９を表示装置１５に表示するようにしてもよい。 The inverse discrete Fourier transform unit 65 performs an inverse discrete Fourier transform on the sum vector 74 to obtain a correlation result 79 indicating a correlation value regarding the documents s, t and the like. The correlation result 79 is stored in the storage unit 130. Further, the correlation result 79 may be displayed on the display device 15.

文書データ７０は、文書s、t等の２以上の文書を含み、ユーザによって与えられる。ユーザは、予め相関を検証する対象文書として複数の文書を用意し、特定文書（例えば、オリジナル文書）を入力とし類似度算出装置１００に与えてもよい。 The document data 70 includes two or more documents such as documents s and t, and is given by the user. The user may prepare a plurality of documents as target documents for verifying the correlation in advance, and input the specific document (for example, the original document) to the similarity calculation apparatus 100.

ベクトルデータ７１は、文書データ７０の各文書s、t等ごとにベクトル変換した結果を示す。例えば、文書sがベクトルuに相当し、文書tがベクトルvに相当する。他の文書に対しても同様に変換されたベクトルが示される。本実施例において、「文書」は種々のデータファイルに相当する。 The vector data 71 shows the result of vector conversion for each document s, t, etc. of the document data 70. For example, the document s corresponds to the vector u, and the document t corresponds to the vector v. Similar converted vectors are shown for other documents. In the present embodiment, the "document" corresponds to various data files.

変換後ベクトルデータ７２は、文書s、t等のそれぞれに対してベクトルu、v等を含み、各ベクトルu、v等は、文字の種類ごとに存在する。変換後ベクトルデータ７２は、（文書数）×（文字の種類の数σ）の数のベクトルを含む。 The converted vector data 72 includes vectors u, v, etc. for the documents s, t, etc., and each vector u, v, etc. exists for each type of character. The converted vector data 72 includes a number of vectors of (the number of documents)×(the number of character types σ).

積ベクトルデータ７３は、文字又は文字列の種類ごとに、要素ごとの積を示すベクトルRを示す。積ベクトルデータ７３は、文字又は文字列の種類の数のベクトルRを含む。特定文書に対して複数の対象文書で相関を算出する１対Ｎ相関の場合には、各対象文書ごとに、文字の種類の数のベクトルRが生成される。 The product vector data 73 indicates a vector R indicating a product for each element for each type of character or character string. The product vector data 73 includes a vector R of the number of types of characters or character strings. In the case of 1:N correlation in which a plurality of target documents are used to calculate the correlation with respect to a specific document, a vector R of the number of character types is generated for each target document.

合算ベクトル７４は、文字の種類ごとのベクトルRに対して要素ごとの和を示す。２文書s、t間の相関に対しては、１の合算ベクトル７４が生成される。本実施例において、逆離散フーリエ変換が行われるのはこの合算ベクトル７４のみである。１対Ｎ相関の場合には、Ｎ個の合算ベクトル７４が生成されるため、逆離散フーリエ変換はＮ回行われるが、畳み込み演算を行う関連技術と比べた場合、１／（文字又は文字列の種類の数）回に削減可能である。 The summation vector 74 indicates a sum for each element with respect to the vector R for each character type. For the correlation between the two documents s and t, a sum vector 74 of 1 is generated. In this embodiment, it is only this sum vector 74 that the inverse discrete Fourier transform is performed. In the case of 1-to-N correlation, since N summed vectors 74 are generated, the inverse discrete Fourier transform is performed N times, but when compared with the related technique that performs the convolution operation, 1/(character or character string It can be reduced to the number of types) times.

相関結果７９は、合算ベクトル７４に対して逆フーリエ変換を行うことにより得られた文書間の相関を示す。２文書s、t間の相関に対しては、１つの相関値が示される。１対Ｎ相関では、各対象文書との相関値が出力されるため、相関結果７９によって、Ｎ個の相関値が示される。 The correlation result 79 shows the correlation between documents obtained by performing the inverse Fourier transform on the sum vector 74. One correlation value is shown for the correlation between two documents s and t. In the 1-to-N correlation, since the correlation value with each target document is output, the correlation result 79 shows N correlation values.

本実施例では、文書の文字列からベクトルへの変換を任意の関数としても、処理時間の削減が可能である。一例として、出力される相関の各要素は文字の一致の数ではなく、ベクトルの内積の和で示しても良い。文書を空白等による分ち書き、句読点等による分割等により抽出した処理単位を要素とする配列を作成し、要素の特徴を表わすベクトルの内積の和をとる。ベクトルの内積の和は、文字の類似度を表わすベクトルへの変換（例えば、文書データから学習した語の分散表現）により、重み付きの相関とみなすことができる。 In this embodiment, the processing time can be reduced even if the conversion from the character string of the document to the vector is performed by using an arbitrary function. As an example, each element of the output correlation may be represented by the sum of inner products of vectors instead of the number of matching characters. The document is divided into spaces and divided into punctuation marks to create an array whose elements are processing units, and the sum of the inner products of the vectors representing the characteristics of the elements is calculated. The sum of vector inner products can be regarded as a weighted correlation by conversion into a vector representing the similarity of characters (for example, a distributed expression of words learned from document data).

次に、２文書s，t間の相関を求める第１の類似算出処理について説明する。図８は、２文書間の相関を求める第１の類似算出処理を説明するためのフローチャートである。 Next, the first similarity calculation process for obtaining the correlation between two documents s and t will be described. FIG. 8 is a flowchart for explaining the first similarity calculation process for obtaining the correlation between two documents.

図８において、ベクトル変換部６１は、文書データ７０から文書sと文書tとを読み込み（ステップＳ３０１）、各文書s，tを文字の種類ごとにベクトルに変換する（ステップＳ３０２）。文書sはベクトルuに変換され、文書tはベクトルvに変換される。ベクトルuとベクトルvとを含むベクトルデータ７１が、記憶部１３０に記憶される。 In FIG. 8, the vector conversion unit 61 reads the documents s and t from the document data 70 (step S301), and converts each document s, t into a vector for each character type (step S302). Document s is converted to vector u and document t is converted to vector v. Vector data 71 including the vector u and the vector v is stored in the storage unit 130.

離散フーリエ変換部６２は、ベクトルuに対して離散フーリエ変換を行ってベクトルUを生成し、ベクトルvに対して離散フーリエ変換を行ってベクトルVを生成する（ステップＳ３０３）。ベクトルUとベクトルVとを含む変換後ベクトルデータ７２が、記憶部１３０に記憶される。 The discrete Fourier transform unit 62 performs a discrete Fourier transform on the vector u to generate a vector U, and performs a discrete Fourier transform on the vector v to generate a vector V (step S303). The converted vector data 72 including the vector U and the vector V is stored in the storage unit 130.

要素乗算部６３は、文字又は文字列の種類ごとに、ベクトルの各要素の積を算出し（アダマール積を求め）、文字の種類ごとのベクトルRを取得する（ステップＳ３０４）。ベクトルRは、文字の種類の数だけ生成される。文字の種類の数のベクトルRを含む積ベクトルデータ７３が、記憶部１３０に記憶される。 The element multiplication unit 63 calculates the product of each element of the vector for each type of character or character string (calculates Hadamard product), and acquires the vector R for each type of character (step S304). The vector R is generated by the number of character types. The product vector data 73 including the vector R of the number of character types is stored in the storage unit 130.

要素加算部６４は、文字又は文字列の種類ごとのベクトルRの要素を加算して、合算ベクトル７４を取得する（ステップＳ３０５）。合算ベクトル７４が、記憶部１３０に記憶される。 The element addition unit 64 adds the elements of the vector R for each type of character or character string to obtain the summed vector 74 (step S305). The added vector 74 is stored in the storage unit 130.

逆離散フーリエ変換部６５は、合算ベクトル７４に対して逆離散フーリエ変換を行って、相関結果７９を得る（ステップＳ３０６）。相関結果７９が、記憶部１３０に記憶される。また、相関結果７９が表示装置１５に表示されるようにしてもよい。 The inverse discrete Fourier transform unit 65 performs an inverse discrete Fourier transform on the sum vector 74 to obtain a correlation result 79 (step S306). The correlation result 79 is stored in the storage unit 130. Further, the correlation result 79 may be displayed on the display device 15.

図９は、図８のフローチャートでの処理例を示す図である。図９では、図８のステップ番号と対応付けている。ベクトル変換部６１が文書s，tを読み込んだ後のステップＳ３０２から説明する。 FIG. 9 is a diagram showing a processing example in the flowchart of FIG. In FIG. 9, it is associated with the step numbers of FIG. The step S302 after the vector conversion unit 61 reads the documents s and t will be described.

図９において、文書s，tのいずれかに含まれる文字の種類は複数存在し、文字の種類をa、b、・・・zで表す。英数字、日本語（ひらがな、カタカナ、及び漢字）、その他の言語の各文字が、文字の種類に相当する。 In FIG. 9, there are a plurality of character types included in any of the documents s and t, and the character types are represented by a, b,... Z. Alphanumeric characters, Japanese (Hiragana, Katakana, and Kanji) and other characters in other languages correspond to the character types.

また、語、フレーズ等を文字とみなすことで本実施例を適用可能である。この場合、各文書s，tにおいて文字とみなす処理単位からベクトルへの変換の前に、空白等による分ち書き、句読点等による分割等により処理単位を抽出することで、処理単位を要素とする配列を作成すればよい。また、処理単位は、品詞ごとであっても良いし、文字又はフレーズごとであってもよい。処理単位は、文書内の文字又は文字列の種類である。処理単位で表される配列それぞれを文書s，tとすることで、以下のように処理される。 Further, this embodiment can be applied by regarding words, phrases, etc. as characters. In this case, before converting the processing unit regarded as a character in each document s, t into a vector, the processing unit is extracted by dividing the processing unit by a space, dividing by a punctuation mark, etc. Just create an array. The processing unit may be each part of speech, or each character or phrase. The processing unit is the type of character or character string in the document. By processing the arrays represented by the processing units as documents s and t, the processing is performed as follows.

ベクトル変換部６１は、文書sに対して、文字の種類a、b、・・・zの各々に関してベクトル変換し、ベクトルu_a、u_b、・・・u_zを得る（ステップＳ３０２ｓ）。また、ベクトル変換部６１は、文書tに対して、文字の種類a、b、・・・zの各々に関してベクトル変換し、ベクトルv_a、v_b、・・・v_zを得る（ステップＳ３０２ｔ）。 The vector conversion unit 61 performs vector conversion on the document s for each of the character types a, b,... _Z to obtain vectors u _a , u _b ,... U _z (step S302s). Further, the vector conversion unit 61 performs vector conversion on the document t for each of the character types a, b,... _Z to obtain vectors v _a , v _b ,... V _z (step S302t). ..

離散フーリエ変換部６２は、文書sから生成したベクトルu_a、u_b、・・・u_zの各々について、離散フーリエ変換を行ってベクトルU_a、U_b、・・・U_zを得る（ステップＳ３０３ｕａ、Ｓ３０３ｕｂ、・・・、及びＳ３０３ｕｚ）。また、離散フーリエ変換部６２は、文書tから生成したベクトルv_a、v_b、・・・v_zの各々について、離散フーリエ変換を行ってベクトルV_a、V_b、・・・V_zを得る（ステップＳ３０３ｖａ、Ｓ３０３ｖｂ、・・・、及びＳ３０３ｖｚ）。 Discrete Fourier transform unit 62, the vector u _a, u _b generated from the document s, for each of the · · · u _z, vector U _a performing discrete Fourier transform, U _b, obtaining · · · U _z (step S303ua, S303ub,..., And S303uz). Further, the discrete Fourier transform unit 62, the vector v _a generated from the document t, v _b, for each of the · · · v _z, to obtain a vector _{_{V a, V b, ··· V}} z by performing a discrete Fourier transform (Steps S303va, S303vb,..., And S303vz).

要素乗算部６３は、文書s，t間で、文字の種類ごとに、ベクトルの各要素の積を算出し、文字の種別ごとのベクトルR_a、R_b、・・・R_zを取得する（ステップＳ３０４ａ、Ｓ３０５ｂ、・・・Ｓ３０４ｚ）。 The element multiplication unit 63 calculates the product of each element of the vector between the documents s and t for each character type, and obtains the vectors R _a , R _b ,... R _z for each character type ( Steps S304a, S305b,..., S304z).

要素加算部６４は、更に、ベクトルR_a、R_b、・・・R_zにおいて同一ずれ位置の要素の値を加算して、１つの合算ベクトル７４を得る（ステップＳ３０５）。その後、逆離散フーリエ変換部６５は、合算ベクトル７４に対して逆離散フーリエ変換を行い、相関r(s，t)を示す相関結果を得る（ステップＳ３０６）。 The element adding unit 64 further adds the values of the elements at the same displacement positions in the vectors _Ra , _Rb ,..., _Rz to obtain one sum vector 74 (step S305). Then, the inverse discrete Fourier transform unit 65 performs an inverse discrete Fourier transform on the sum vector 74 to obtain the correlation result indicating the correlation r(s,t) (step S306).

次に、１対Ｎ相関を求める第２の類似算出処理について説明する。図１０は、１対Ｎ相関を求める第２の類似算出処理を説明するためのフローチャートである。図１０では、特定文書sと各対象文書t₁〜t_nとの相関を求める。 Next, the second similarity calculation process for obtaining the 1-to-N correlation will be described. FIG. 10 is a flowchart for explaining the second similarity calculation process for obtaining the 1:N correlation. In FIG. 10, the correlation between the specific document s and each target document t _{1 to} t _n is calculated.

図１０において、ベクトル変換部６１は、文書データ７０から特定文書sとｎ個の対象文書t₁〜t_nとを読み込んで（ステップＳ４０１）、各文書s及びt₁〜t_nを文字の種類ごとにベクトルに変換する（ステップＳ４０２）。特定文書sはベクトルuに変換され、対象文書t₁〜t_nはそれぞれベクトルv₁〜v_nに変換される。ベクトルuとベクトルv₁〜v_nとを含むベクトルデータ７１が、記憶部１３０に記憶される。 10, the vector conversion unit 61 reads the specific document s and n pieces of target document t ₁ ~t _n from the document data 70 (step S401), the type of each document s and t ₁ ~t _n characters It is converted into a vector for each (step S402). The specific document s is converted into a vector u, and the target documents t _{1 to} t _n are converted into vectors v _{1 to} v _n , respectively. Vector data 71 including the vector u and the vectors v _{1 to} v _n is stored in the storage unit 130.

離散フーリエ変換部６２は、ベクトルuに対して離散フーリエ変換を行ってベクトルUを生成し、ベクトルv₁〜v_nのそれぞれに対して離散フーリエ変換を行ってベクトルV₁〜V_nを生成する（ステップＳ３０３）。ベクトルUとベクトルV₁〜V_nとを含む変換後ベクトルデータ７２が、記憶部１３０に記憶される。 Discrete Fourier transform unit 62 generates a vector U by performing a discrete Fourier transform to the vector u, and generates a vector V ₁ ~V _n by performing a discrete Fourier transform on each vector v ₁ to v _n (Step S303). The converted vector data 72 including the vector U and the vectors V _{1 to} V _n is stored in the storage unit 130.

要素乗算部６３は、対象文書t₁〜t_nに係るベクトルV₁〜V_nの中から１つ選択し（ステップＳ４０４−１）、特定文書sに係るベクトルUと選択したベクトルV_iとにおいて、文字の種類ごとに、ベクトルの各要素の積を算出し（アダマール積を求め）、文字の種類ごとのベクトルR_iを取得する（ステップＳ４０４−２）。ベクトルR_iは、文字の種類の数だけ生成される。文字の種類の数のベクトルR_iを含む積ベクトルデータ７３が、記憶部１３０に記憶される。 The element multiplication unit 63 selects one from the vectors V _{1 to} V _n related to the target documents t _{1 to} t _n (step S404-1), and selects the vector U related to the specific document s and the selected vector V _i . , The product of each element of the vector is calculated for each character type (hadamard product is calculated), and the vector R _i for each character type is acquired (step S404-2). The vectors R _i are generated by the number of character types. The product vector data 73 including the vector R _i of the number of character types is stored in the storage unit 130.

要素加算部６４は、文字の種類ごとのベクトルR_iの要素を加算して、合算ベクトル７４_iを取得する（ステップＳ４０５）。合算ベクトル７４_iが、記憶部１３０に記憶される。逆離散フーリエ変換部６５は、合算ベクトル７４_iに対して逆離散フーリエ変換を行って、記憶部１３０の相関結果７９に追加する（ステップＳ４０６）。 The element adding unit 64 adds the elements of the vector R _i for each type of character to obtain a summed vector 74 _i (step S405). The added vector 74 _i is stored in the storage unit 130. The inverse discrete Fourier transform unit 65 performs an inverse discrete Fourier transform on the sum vector 74 _i and adds it to the correlation result 79 of the storage unit 130 (step S406).

その後、第２の類似算出処理は、変換後ベクトルデータ７２に未処理のベクトルV_iが存在するか否かを判断する（ステップＳ４０７）。存在する場合には、第２の類似算出処理は、ステップＳ４０１−１へと戻り、上述同様の処理を繰り返す。 Then, the second similarity calculation process determines whether or not the unprocessed vector V _i exists in the converted vector data 72 (step S407). If it exists, the second similarity calculation process returns to step S401-1, and the same process as described above is repeated.

一方、存在しない場合、第２の類似算出処理は、終了する。記憶部１３０内の相関結果７９には、特定文書sと各対象文書t₁〜t_nの相関値r₁〜r_nが記憶されている。未処理のベクトルV_iが存在しない場合、相関結果７９を表示装置１５に表示して、第２の類似算出処理を終了してもよい。 On the other hand, if there is not, the second similarity calculation process ends. The correlation result 79 in the storage unit 130, the correlation value r ₁ ~r _n particular document s and the target document t ₁ ~t _n are stored. When there is no unprocessed vector V _i , the correlation result 79 may be displayed on the display device 15 and the second similarity calculation process may be ended.

ここで、文書８０の実データのサイズについて、概ねの範囲を示す。 Here, the range of the actual data size of the document 80 is shown.

σについて、
・核酸塩基：４〜
・アルファベット：２６
・２バイト文字：６５５３６
・単一言語の語録数（語を文字として扱う）：数千〜数万
nについて、
・学術論文（語の数）：数千〜数万
・ゲノム配列：数十万〜数千万
・センサーデータ：数億以上
関連技術を用いた実行例では、数百編の学術論文間の相関の算出で数日かかる。また、１研究機関がリポジトリで公開する論文数は数万以上である。以下に、本実施例と関連技術との実験結果の比較を示す。 For σ,
・Nucleic acid base: 4 to
・Alphabet: 26
・Double-byte character: 65536
・Number of monolingual lexicons (words are treated as characters): Thousands to tens of thousands
For n,
・Academic papers (number of words): tens of thousands to tens of thousands ・Genome sequence: hundreds of thousands to tens of millions ・Sensor data: hundreds of millions or more Correlation between hundreds of academic papers in an example using related technology It takes several days to calculate. The number of papers published by one research institution in the repository is more than tens of thousands. The following is a comparison of experimental results between this example and related art.

図１１は、語彙数に対する実行時間の比較結果を示す図である。図１１中、横軸に語彙数を示し、縦軸に実行時間（msec）を示す。語彙数は、文字の種類に相当し、上述では、σで示されている。このグラフでは、文書長が１０２４語、２０４８語、及び４０９６語の場合の実行時間を示している。 FIG. 11 is a diagram showing a comparison result of the execution time with respect to the number of words. In FIG. 11, the horizontal axis represents the number of vocabularies, and the vertical axis represents the execution time (msec). The vocabulary number corresponds to the type of character, and is indicated by σ in the above. This graph shows the execution time when the document length is 1024 words, 2048 words, and 4096 words.

図１１より、本実施例及び関連技術とも、語彙数が増える程、実行時間が長くなっている。しかしながら、いずれの場合においても、本実施例が関連技術より実行時間が短い。特に、文書長が４０９６語の場合、語彙数が８００語を超えた時点において、本実施例の実行時間の方が、関連技術の実行時間より約２０％短い。 From FIG. 11, in both the present embodiment and the related art, the longer the vocabulary number, the longer the execution time. However, in any case, the execution time of this embodiment is shorter than that of the related art. In particular, when the document length is 4096 words, the execution time of this embodiment is about 20% shorter than the execution time of the related art when the number of vocabularies exceeds 800 words.

図１２は、文書長に対する実行時間の比較結果を示す図である。図１２中、横軸に文書長を示し、縦軸に実行時間（msec）を示す。このグラフでは、対象となる語彙数が２５６語、５１２語、及び１０２４語の場合の実行時間を示している。 FIG. 12 is a diagram showing the comparison result of the execution time with respect to the document length. In FIG. 12, the horizontal axis represents the document length and the vertical axis represents the execution time (msec). This graph shows the execution time when the number of target vocabularies is 256 words, 512 words, and 1024 words.

図１２より、本実施例及び関連技術とも、文書長が増える程、実行時間が長くなっている。しかしながら、いずれの場合においても、本実施例が関連技術より実行時間が短い。特に、語彙数が１０２４語の場合、文書長が３５００語付近において、本実施例の実行時間の方が、関連技術の実行時間より約２０％短い。 From FIG. 12, in both the present embodiment and the related art, the longer the document length, the longer the execution time. However, in any case, the execution time of this embodiment is shorter than that of the related art. Particularly, when the vocabulary number is 1024 words, the execution time of this embodiment is about 20% shorter than the execution time of the related art when the document length is around 3500 words.

図１３は、文書中の語彙数と処理時間の比較結果を示す図である。図１３中、横軸に文書長を示し、左縦軸に実行時間（msec）を示し、右縦軸に語彙数を示す。右縦軸の語彙数は、文書の長さに応じた語彙数の統計値を示す。 FIG. 13 is a diagram showing a comparison result between the number of words in a document and the processing time. In FIG. 13, the horizontal axis indicates the document length, the left vertical axis indicates the execution time (msec), and the right vertical axis indicates the vocabulary number. The vocabulary number on the right vertical axis indicates the statistical value of the vocabulary number according to the length of the document.

図１３より、語彙数は、文書長に比例して増加することが分かる。また、本実施例は、文書長が２０００語程度までは、関連技術との実行時間の差は余りない。文書長が４０００語に近付くころには、即ち、σが大きくなるほど、実行時間の差は約２０％に及び、本実施例により高速化が実現されていることが分かる。この約２０％の高速化は、畳み込みに掛る計算量を抑えたことに起因する。 From FIG. 13, it can be seen that the vocabulary number increases in proportion to the document length. Further, in the present embodiment, when the document length is up to about 2000 words, there is little difference in execution time from the related art. As the document length approaches 4000 words, that is, as σ increases, the difference in execution time reaches about 20%, and it can be seen that the speedup is realized by this embodiment. The speedup of about 20% is due to the fact that the calculation amount required for convolution is suppressed.

上述の発明者による実験結果から、本実施例では、文字の種類の数および文字列の長さが増加した場合においても、計算量を抑えることが示されたといえる。 From the above experimental results by the inventor, it can be said that the present embodiment shows that the calculation amount is suppressed even when the number of character types and the length of the character string increase.

上述した実施例において、ベクトル変換部６１、離散フーリエ変換部６２および要素乗算部６３が第１変換部に相当し、要素加算部６４が加算部に相当し、および逆離散フーリエ変換部６５が第２変換部に相当する。 In the above-described embodiment, the vector transforming unit 61, the discrete Fourier transforming unit 62 and the element multiplying unit 63 correspond to the first transforming unit, the element adding unit 64 corresponds to the adding unit, and the inverse discrete Fourier transforming unit 65 corresponds to the first transforming unit. It corresponds to two conversion units.

本発明は、具体的に開示された実施例に限定されるものではなく、特許請求の範囲から逸脱することなく、主々の変形や変更が可能である。 The present invention is not limited to the specifically disclosed embodiments, and various modifications and changes can be made without departing from the scope of the claims.

以上の実施例を含む実施形態に関し、更に以下の付記を開示する。
（付記１）
複数の文書データに含まれる文字または文字列ごとに該複数の文書データをベクトル化した複数のベクトルを生成し、
前記複数のベクトルのそれぞれについてフーリエ変換した結果の要素ごとの積を算出し、
前記文字または文字列それぞれに対し、前記積を前記要素ごとに加算した合算ベクトルを生成し、
前記合算ベクトルより、フーリエ逆変換を用いて、前記複数の文書データ間の相関値を生成する、
処理をコンピュータに行わせる類似度算出プログラム。
（付記２）
前記コンピュータに、
前記複数の文書を、品詞、文字、フレーズ、又は空白を区切りとした単語のいずれかで分割し、分割した単語ごとに、前記複数の文書データの各ずれ位置における該単語との一致または不一致を前記要素の値として前記複数のベクトルを生成させることを特徴とする付記１記載の類似度算出プログラム。
（付記３）
前記要素は、前記複数の文書データ間のずれ位置に相当することを特徴とする付記１又は２記載の類似度算出プログラム。
（付記４）
前記積は、アダマール積であることを特徴とする付記１乃至３のいずれか１項記載の類似度算出プログラム。
（付記５）
複数の文書データに含まれる文字または文字列ごとに該複数の文書データをベクトル化した複数のベクトルを生成し、
前記複数のベクトルのそれぞれについてフーリエ変換した結果の要素ごとの積を算出し、
前記文字または文字列それぞれに対し、前記積を前記要素ごとに加算した合算ベクトルを生成し、
前記合算ベクトルより、フーリエ逆変換を用いて、前記複数の文書データ間の相関値を生成する、
処理をコンピュータが行う類似度算出方法。
（付記６）
複数の文書データに含まれる文字または文字列ごとに該複数の文書データをベクトル化した複数のベクトルを生成するベクトル変換部と、
前記複数のベクトルのそれぞれについてフーリエ変換した結果の要素ごとの積を算出する第１変換部と、
前記文字または文字列それぞれに対し、前記積を前記要素ごとに加算した合算ベクトルを生成する加算部と、
前記合算ベクトルより、フーリエ逆変換を用いて、前記複数の文書データ間の相関値を生成する第２変換部と、
を有する類似度算出装置。 The following supplementary notes will be further disclosed regarding the embodiments including the above-described examples.
(Appendix 1)
Generating a plurality of vectors obtained by vectorizing the plurality of document data for each character or character string included in the plurality of document data,
Calculating a product for each element of the result of Fourier transform for each of the plurality of vectors,
For each of the character or character string, generate a sum vector in which the product is added for each element,
From the sum vector, using the inverse Fourier transform, to generate a correlation value between the plurality of document data,
A similarity calculation program that causes a computer to perform processing.
(Appendix 2)
On the computer,
The plurality of documents are divided into any of words that are separated by a part of speech, a character, a phrase, or a space, and for each divided word, a match or a mismatch with the word at each shift position of the plurality of document data is determined. The similarity calculation program according to note 1, wherein the plurality of vectors are generated as the values of the elements.
(Appendix 3)
The similarity calculation program according to appendix 1 or 2, wherein the element corresponds to a shift position between the plurality of document data.
(Appendix 4)
The similarity calculation program according to any one of appendices 1 to 3, wherein the product is a Hadamard product.
(Appendix 5)
Generating a plurality of vectors obtained by vectorizing the plurality of document data for each character or character string included in the plurality of document data,
Calculating a product for each element of the result of Fourier transform for each of the plurality of vectors,
For each of the character or character string, generate a sum vector in which the product is added for each element,
From the sum vector, using the inverse Fourier transform, to generate a correlation value between the plurality of document data,
A similarity calculation method in which processing is performed by a computer.
(Appendix 6)
A vector conversion unit that generates a plurality of vectors by vectorizing the plurality of document data for each character or character string included in the plurality of document data;
A first conversion unit that calculates a product for each element of the result of Fourier transform for each of the plurality of vectors;
An addition unit that generates a summation vector in which the product is added for each of the elements for each of the character or the character string;
A second conversion unit that generates a correlation value between the plurality of pieces of document data from the combined vector by using a Fourier inverse transform;
A similarity calculation device having.

６１ベクトル変換部
６２離散フーリエ変換部
６３要素乗算部
６４要素加算部
６５逆離散フーリエ変換部
７０文書データ
７１ベクトルデータ
７２変換後ベクトルデータ
７３積ベクトルデータ
７４合算ベクトル
７９相関結果 61 vector conversion unit 62 discrete Fourier transform unit 63 element multiplication unit 64 element addition unit 65 inverse discrete Fourier transform unit 70 document data 71 vector data 72 post-transformation vector data 73 product vector data 74 summation vector 79 correlation result

Claims

Generating a plurality of vectors obtained by vectorizing the plurality of document data for each character or character string included in the plurality of document data,
Calculating a product for each element of the result of Fourier transform for each of the plurality of vectors,
For each of the character or character string, generate a sum vector in which the product is added for each element,
From the sum vector, using the inverse Fourier transform, to generate a correlation value between the plurality of document data,
A similarity calculation program that causes a computer to perform processing.

On the computer,
The plurality of documents are divided into any of words that are separated by a part of speech, a character, a phrase, or a space, and for each divided word, a match or a mismatch with the word at each shift position of the plurality of document data is determined. The similarity calculation program according to claim 1, wherein the plurality of vectors are generated as the values of the elements.

The similarity calculation program according to claim 1, wherein the element corresponds to a shift position between the plurality of document data.

Generating a plurality of vectors obtained by vectorizing the plurality of document data for each character or character string included in the plurality of document data,
Calculating a product for each element of the result of Fourier transform for each of the plurality of vectors,
For each of the character or character string, generate a sum vector in which the product is added for each element,
From the sum vector, using the inverse Fourier transform, to generate a correlation value between the plurality of document data,
A similarity calculation method in which processing is performed by a computer.

A vector conversion unit that generates a plurality of vectors by vectorizing the plurality of document data for each character or character string included in the plurality of document data;
A first conversion unit that calculates a product for each element of the result of Fourier transform for each of the plurality of vectors;
An addition unit that generates a summation vector in which the product is added for each of the elements for each of the character or the character string;
A second conversion unit that generates a correlation value between the plurality of pieces of document data from the combined vector by using a Fourier inverse transform;
A similarity calculation device having.