JP2014085854A

JP2014085854A - Similarity evaluation system, similarity evaluation device, user terminal, similarity evaluation method, and program

Info

Publication number: JP2014085854A
Application number: JP2012234519A
Authority: JP
Inventors: Shinya Takada; 慎也高田; Toshihiro Motoda; 敏浩元田; Shinichi Nakahara; 慎一中原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-10-24
Filing date: 2012-10-24
Publication date: 2014-05-12

Abstract

PROBLEM TO BE SOLVED: To mechanically perform finer similarity evaluation when entropy values of entire files are close to each other.SOLUTION: Each of user terminals 2to 2transmits a comparison source file and a comparison target file to a similarity evaluation device 3. The similarity evaluation device 3 divides each of the comparison source file and the comparison target file into a plurality of sections by a prescribed division method and calculates, per section, prescribed entropy values from the divided comparison source file and the divided comparison target file and obtains a determination result showing whether the comparison source file and the comparison target file are similar or not on the basis of comparison source section feature quantities and comparison target section feature quantities. Each of user terminals 2to 2displays the determination result received from the similarity evaluation device 3.

Description

この発明は、エントロピー値を用いてファイルの類似度を評価する類似度評価技術に関する。 The present invention relates to a similarity evaluation technique for evaluating the similarity of files using an entropy value.

マルウェアの検出やＤＬＰ(Data Loss Prevention)の分野で、ファイルの類似度を測定する方法として、エントロピー値を求めることが行われている。エントロピー値の計算は順序性を考慮しない場合には式（１）により行われる。順序性を考慮する場合には式（２）により行われる。以下、順序性を考慮したエントロピー値は、M1エントロピーと呼ぶ。式（１）および（２）において、P_iは対象ファイルの中で値がiとなる確率である。 In the fields of malware detection and DLP (Data Loss Prevention), an entropy value is obtained as a method of measuring file similarity. The calculation of the entropy value is performed according to Equation (1) when ordering is not taken into consideration. When ordering is taken into consideration, it is performed according to equation (2). Hereinafter, the entropy value considering the order is referred to as M1 entropy. In equations (1) and (2), P _i is the probability that the value is i in the target file.

データの状態をエントロピー値で表現することにより、記憶装置に存在する大量の情報を分類したり、特定データ間での近似を検証したりするのに役立つ。特に、原本データに対して意図的に変更を加えて検知を逃れるタイプのマルウェアやアンチフォレンジック手法が存在するため、原本ファイルとの近似を客観的かつ定量的に表現することは、フォレンジック調査にとって意義が大きい。 By expressing the state of data with an entropy value, it is useful for classifying a large amount of information existing in the storage device and for verifying the approximation between specific data. In particular, there are types of malware and anti-forensic techniques that intentionally change the original data and escape detection, so objective and quantitative representation of the approximation to the original file is meaningful for forensic investigations. Is big.

エントロピー値はデータの状態を表現しており、対象データの一部を変更した場合、原本データからのエントロピー値の変化量には相関関係が存在する。これに対して、フォレンジック分野でデータの識別に用いられるいわゆるハッシュ関数を使用した方法では、データとハッシュ値の変化量に相関関係が存在しないため、原本ファイルにどの程度の変更が加わったかに関して客観的な判断を加えることはできない。この性質は、計算されたハッシュ値への攻撃に対して頑強であるという点においてハッシュ関数の長所でもあるが、ファイルの近似の検証という点においては、エントロピー値の方が有利である。また、エントロピー値の計算は、ハッシュ計算と比べて計算コストが低くてすむため、大量のデータをすばやく検証するのにもエントロピー値を用いた類似度評価の方が向いている。 The entropy value expresses the state of the data. When a part of the target data is changed, there is a correlation between the amount of change in the entropy value from the original data. On the other hand, in the method using the so-called hash function used for data identification in the forensic field, there is no correlation between the change amount of the data and the hash value, so the objective is as to how much the original file has been changed. It is not possible to make reasonable judgments. This property is also an advantage of the hash function in that it is robust against attacks on the calculated hash value, but the entropy value is more advantageous in terms of verifying the approximation of the file. In addition, since the calculation cost of the entropy value is lower than that of the hash calculation, the similarity evaluation using the entropy value is more suitable for quickly verifying a large amount of data.

エントロピー値を用いた類似度評価をより正確に行うため、エントロピー値にファイルサイズで重みを付ける方法が提案されている（例えば、非特許文献１参照）。重み付きエントロピー値は式（３）により計算される。 In order to more accurately perform the similarity evaluation using the entropy value, a method of weighting the entropy value with the file size has been proposed (see, for example, Non-Patent Document 1). The weighted entropy value is calculated by equation (3).

重み付きエントロピー値を用いた類似度は、E₁を比較元ファイルの重み付きエントロピー値とし、E₂を比較先ファイルの重み付きエントロピー値とし、S₁を比較元ファイルのファイルサイズとし、S₂を比較先ファイルのファイルサイズとして、式（４）により計算される。 Similarity using the weighted entropy value, E ₁ is the weighted entropy value of the comparison source file, E ₂ is the weighted entropy value of the comparison destination file, S ₁ is the file size of the comparison source file, S ₂ Is calculated by the equation (4) with the file size of the comparison destination file.

式（４）は計算例であり通常のファイルに適用した場合、有意な値を取らない。このため例えば、式（５）のように類似度を計算することもある。 Equation (4) is a calculation example and does not take a significant value when applied to a normal file. For this reason, for example, the similarity may be calculated as shown in Equation (5).

Guidance Software, “Utilizing Entropy to Identify Undetected Malware”, [online], [平成24年10月15日検索], インターネット<URL:http://www.guidancesoftware.com/DocumentRegistration.aspx?did=1000017288>Guidance Software, “Utilizing Entropy to Identify Undetected Malware”, [online], [October 15, 2012 search], Internet <URL: http: //www.guidancesoftware.com/DocumentRegistration.aspx? Did = 1000017288>

エントロピー値はデータのランダム状態を表現するに過ぎないため、エントロピー値もしくは重み付きエントロピー値を計算しても、偶然的に内容が全く異なるデータが近いエントロピー値をもつことがあり得る。従来はデータの目視確認により補足的な類似度評価を行なっていたが、比較対象のファイルが多い場合には、このようなファイルの目視による類似度評価は困難になる。したがって、近いエントロピー値をとるファイル間の類似度をさらに評価する他の評価手段が必要とされていた。 Since the entropy value merely represents a random state of the data, even if the entropy value or the weighted entropy value is calculated, it is possible that data having completely different contents may have a close entropy value by chance. Conventionally, supplementary similarity evaluation is performed by visual confirmation of data. However, when there are many files to be compared, it is difficult to visually evaluate the similarity of such files. Therefore, another evaluation means for further evaluating the similarity between files having close entropy values is required.

この発明はこのような点に鑑みてなされたものであり、ファイル全体のエントロピー値が近い値となった場合に、より詳細な類似度評価を機械的に行うことができる類似度評価技術を提供することを目的とする。 The present invention has been made in view of the above points, and provides a similarity evaluation technique capable of performing a more detailed similarity evaluation mechanically when the entropy value of the entire file becomes a close value. The purpose is to do.

上記の課題を解決するために、この発明の一態様による類似度評価システムは、利用者端末と類似度評価装置とを含み、比較元ファイルと比較先ファイルとが類似するか否かを判定する。利用者端末は、比較元ファイルと比較先ファイルとを類似度評価装置へ送信する投入部と、類似度評価装置から受信した、比較元ファイルと比較先ファイルとが類似するか否かを示す判定結果を表示する表示部と、を含む。類似度評価装置は、比較元ファイルと比較先ファイルそれぞれを所定の分割方法で複数の区間に分割し、比較元分割ファイルと比較先分割ファイルを生成する区間分割部と、比較元分割ファイルと比較先分割ファイルそれぞれから区間ごとに所定のエントロピー値を算出し、比較元区間特徴量と比較先区間特徴量を生成する区間特徴量算出部と、比較元区間特徴量と比較先区間特徴量とに基づいて判定結果を求める区間類似度評価部と、を含む。 In order to solve the above problems, a similarity evaluation system according to an aspect of the present invention includes a user terminal and a similarity evaluation device, and determines whether or not a comparison source file and a comparison destination file are similar. . The user terminal determines whether or not the input unit that transmits the comparison source file and the comparison destination file to the similarity evaluation device and whether the comparison source file and the comparison destination file received from the similarity evaluation device are similar A display unit for displaying the result. The similarity evaluation device divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and compares the section division unit that generates the comparison source division file and the comparison destination division file with the comparison source division file A predetermined entropy value is calculated for each section from each of the destination division files, and a section feature quantity calculation unit that generates a comparison source section feature quantity and a comparison destination section feature quantity, and a comparison source section feature quantity and a comparison destination section feature quantity And a section similarity evaluation unit for obtaining a determination result based on the determination result.

この発明の他の態様による類似度評価システムは、利用者端末と類似度評価装置とを含み、比較元ファイルと比較先ファイルとが類似するか否かを判定する。利用者端末は、比較元ファイルと比較先ファイルそれぞれを所定の分割方法で複数の区間に分割し、比較元分割ファイルと比較先分割ファイルを生成する区間分割部と、比較元分割ファイルと比較先分割ファイルそれぞれから区間ごとに所定のエントロピー値を算出し、比較元区間特徴量と比較先区間特徴量を生成する区間特徴量算出部と、比較元区間特徴量と比較先区間特徴量とを類似度評価装置へ送信する投入部と、類似度評価装置から受信した、比較元ファイルと比較先ファイルとが類似するか否かを示す判定結果を表示する表示部と、を含む。類似度評価装置は、比較元区間特徴量と比較先区間特徴量とに基づいて判定結果を求める区間類似度評価部を含む。 A similarity evaluation system according to another aspect of the present invention includes a user terminal and a similarity evaluation device, and determines whether or not the comparison source file and the comparison destination file are similar. The user terminal divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, generates a comparison source division file and a comparison destination division file, a comparison source division file, and a comparison destination Calculating a predetermined entropy value for each section from each divided file and generating a comparison section characteristic value and a comparison target section feature quantity, and a comparison source section feature quantity and a comparison target section feature quantity are similar And a display unit that displays a determination result that is received from the similarity evaluation device and that indicates whether the comparison source file and the comparison destination file are similar to each other. The similarity evaluation device includes a section similarity evaluation unit that obtains a determination result based on the comparison source section feature quantity and the comparison destination section feature quantity.

この発明の類似度評価技術によれば、比較対象ファイルを複数の区間に分割し、区間ごとのエントロピー値に基づいてファイルの類似度を評価するため、ファイル全体のエントロピー値が近い値となる場合であっても、より詳細な類似度評価を機械的に行うことができる。これにより、２つのファイルの類似度評価の精度が向上する。 According to the similarity evaluation technique of the present invention, when the comparison target file is divided into a plurality of sections and the file similarity is evaluated based on the entropy value for each section, the entropy value of the entire file becomes a close value. Even so, a more detailed similarity evaluation can be performed mechanically. This improves the accuracy of the similarity evaluation between the two files.

第一実施形態に係る類似度評価システムの機能構成を例示する図である。It is a figure which illustrates the function structure of the similarity evaluation system which concerns on 1st embodiment. （Ａ）利用者端末の機能構成を例示する図である。（Ｂ）類似度評価装置の機能構成を例示する図である。(A) It is a figure which illustrates the function structure of a user terminal. (B) It is a figure which illustrates the function structure of a similarity evaluation apparatus. 類似度評価方法の処理フローを例示する図である。It is a figure which illustrates the processing flow of a similarity evaluation method. 全体特徴量の算出結果を例示する図である。It is a figure which illustrates the calculation result of a whole feature-value. 全体類似度の評価結果を例示する図である。It is a figure which illustrates the evaluation result of whole similarity. ローリングハッシュの処理フローを例示する図である。It is a figure which illustrates the processing flow of rolling hash. 区間特徴量の算出結果を例示する図である。It is a figure which illustrates the calculation result of a section feature-value. 区間類似度の評価結果を例示する図である。It is a figure which illustrates the evaluation result of section similarity. 第二実施形態に係る類似度評価システムの機能構成を例示する図である。It is a figure which illustrates the function structure of the similarity evaluation system which concerns on 2nd embodiment. （Ａ）利用者端末の機能構成を例示する図である。（Ｂ）類似度評価装置の機能構成を例示する図である。(A) It is a figure which illustrates the function structure of a user terminal. (B) It is a figure which illustrates the function structure of a similarity evaluation apparatus. 類似度評価方法の処理フローを例示する図である。It is a figure which illustrates the processing flow of a similarity evaluation method. 類似度評価方法の処理フローを例示する図である。It is a figure which illustrates the processing flow of a similarity evaluation method.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
＜構成＞
図１を参照して、この実施形態の類似度評価システム１の構成例を説明する。類似度評価システム１は、N(≧1)台の利用者端末２₁，…，２_Nと類似度評価装置３を含む。N台の利用者端末２₁，…，２_Nと類似度評価装置３はネットワーク４に接続される。ネットワーク４は、接続される各装置が相互に通信可能なように構成されていればよく、例えばインターネットやＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）などで構成することができる。なお、各装置は必ずしもネットワークを介してオンラインで通信可能である必要はない。例えば、利用者端末２_n(1≦n≦N)の出力する情報を磁気テープやＵＳＢメモリなどの可搬型記録媒体に記憶し、その可搬型記録媒体からオフラインで類似度評価装置３へ入力するように構成してもよい。 [First embodiment]
<Configuration>
With reference to FIG. 1, the structural example of the similarity evaluation system 1 of this embodiment is demonstrated. The similarity evaluation system 1 includes N (≧ 1) user terminals 2 ₁ ,..., 2 _N and a similarity evaluation device 3. N user terminals 2 ₁ ,..., 2 _N and similarity evaluation device 3 are connected to network 4. The network 4 only needs to be configured so that the connected devices can communicate with each other. For example, the network 4 can be configured by the Internet, a LAN (Local Area Network), a WAN (Wide Area Network), or the like. Each device does not necessarily need to be able to communicate online via a network. For example, information output from the user terminal 2 _n (1 ≦ n ≦ N) is stored in a portable recording medium such as a magnetic tape or a USB memory, and is input to the similarity evaluation device 3 offline from the portable recording medium. You may comprise as follows.

図２（Ａ）を参照して、類似度評価システム１に含まれる利用者端末２_nの構成例を説明する。利用者端末２_nは、制御部２０１、メモリ２０２、ファイル記憶部２１、投入部２２、表示部２３を備える。利用者端末２_nは、例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。利用者端末２_nは制御部２０１の制御のもとで各処理を実行する。利用者端末２_nに入力されたデータや各処理で得られたデータはメモリ２０２に格納され、メモリ２０２に格納されたデータは必要に応じて読み出されて他の処理に利用される。ファイル記憶部２１は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリなどの半導体メモリ素子により構成される補助記憶装置、などにより構成することができる。 With reference to FIG. 2A, a configuration example of the user terminal 2 _n included in the similarity evaluation system 1 will be described. The user terminal 2 _n includes a control unit 201, a memory 202, a file storage unit 21, an input unit 22, and a display unit 23. The user terminal 2 _n is a special device configured by reading a special program into a known or dedicated computer having, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like. The user terminal 2 _n executes each process under the control of the control unit 201. The data input to the user terminal 2 _n and the data obtained in each process are stored in the memory 202, and the data stored in the memory 202 is read as necessary and used for other processes. The file storage unit 21 can be configured by, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device including a semiconductor memory element such as a hard disk, an optical disk, or a flash memory.

図２（Ｂ）を参照して、類似度評価システム１に含まれる類似度評価装置３の構成例を説明する。類似度評価装置３は、制御部３０１、メモリ３０２、入力部３１、全体特徴量算出部３２、全体類似度評価部３３、区間分割部３４、区間特徴量算出部３５、区間類似度評価部３６、出力部３７、特徴量記憶部３８を備える。類似度評価装置３は、例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。類似度評価装置３は制御部３０１の制御のもとで各処理を実行する。類似度評価装置３に入力されたデータや各処理で得られたデータはメモリ３０２に格納され、メモリ３０２に格納されたデータは必要に応じて読み出されて他の処理に利用される。特徴量記憶部３８は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリなどの半導体メモリ素子により構成される補助記憶装置、リレーショナルデータベースやキーバリューストアなどのミドルウェア、などにより構成することができる。 With reference to FIG. 2B, a configuration example of the similarity evaluation device 3 included in the similarity evaluation system 1 will be described. The similarity evaluation device 3 includes a control unit 301, a memory 302, an input unit 31, an overall feature amount calculation unit 32, an overall similarity evaluation unit 33, a section division unit 34, a section feature amount calculation unit 35, and a section similarity evaluation unit 36. , An output unit 37 and a feature amount storage unit 38. The similarity evaluation device 3 is a special device configured by reading a special program into a known or dedicated computer having, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like. The similarity evaluation device 3 executes each process under the control of the control unit 301. Data input to the similarity evaluation device 3 and data obtained in each process are stored in the memory 302, and the data stored in the memory 302 is read out as necessary and used for other processes. The feature amount storage unit 38 includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device including a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, middleware such as a relational database and a key / value store, Etc. can be configured.

＜処理＞
図３を参照して、この実施形態の類似度評価システム１の動作例を説明する。 <Processing>
With reference to FIG. 3, the operation example of the similarity evaluation system 1 of this embodiment is demonstrated.

利用者端末２_n(1≦n≦N)の備えるファイル記憶部２１には類似度を評価する少なくとも２つのファイルが記憶されている。 The file storage unit 21 included in the user terminal 2 _n (1 ≦ n ≦ N) stores at least two files for evaluating the similarity.

利用者端末２_nの備える投入部２２は、利用者の操作に従って、ファイル記憶部２１に記憶されているファイルから選択した、類似度を評価する比較対象ファイル群を類似度評価装置３へ送信する。比較対象ファイル群に含まれるファイルの数は２つであってもよいし、３つ以上であってもよい。以降の説明では、２つのファイルの類似度を評価する場合について説明し、３つ以上の場合に相違する点があれば適宜説明する。また、比較対象の２つのファイルは、それぞれ比較元ファイル、比較先ファイルと呼ぶ。 The input unit 22 included in the user terminal 2 _n transmits, to the similarity evaluation device 3, a comparison target file group for evaluating the similarity selected from the files stored in the file storage unit 21 according to the user's operation. . The number of files included in the comparison target file group may be two, or may be three or more. In the following description, a case where the similarity between two files is evaluated will be described, and if there are differences in three or more cases, they will be described as appropriate. The two files to be compared are called a comparison source file and a comparison destination file, respectively.

利用者端末２nが送信した比較対象ファイル群は、入力部３１を介して類似度評価装置３へ入力される（ステップＳ３１）。類似度評価装置３の備える全体特徴量算出部３２は、比較元ファイルと比較先ファイルそれぞれから所定のエントロピー値を算出する（ステップＳ３２）。比較元ファイルから算出したエントロピー値は比較元全体特徴量、比較先ファイルから算出したエントロピー値は比較先全体特徴量と呼ぶ。算出するエントロピー値は、様々な種類のエントロピー値を適用することができる。例えば、上記の式（１）で算出されるエントロピー値でもよい。上記の式（２）で算出されるM1エントロピー値でもよい。上記の式（３）で算出されるファイルサイズ重み付きエントロピー値でもよい。また例えば、N-gramエントロピーを算出してもよい。N-gramエントロピーとは、決められたバイト数(Nバイト)の単位でファイル中における各ビットパターンの出現数から算出するエントロピー値である。比較対象ファイル群が３つ以上のファイルを含む場合には、すべての２つのファイルの組み合わせについて、所定のエントロピー値を算出すればよい。 The comparison target file group transmitted by the user terminal 2n is input to the similarity evaluation device 3 via the input unit 31 (step S31). The overall feature quantity calculation unit 32 included in the similarity evaluation device 3 calculates a predetermined entropy value from each of the comparison source file and the comparison destination file (step S32). The entropy value calculated from the comparison source file is referred to as a comparison source overall feature amount, and the entropy value calculated from the comparison destination file is referred to as a comparison destination overall feature amount. Various types of entropy values can be applied as the entropy value to be calculated. For example, the entropy value calculated by the above equation (1) may be used. The M1 entropy value calculated by the above equation (2) may be used. It may be a file size weighted entropy value calculated by the above equation (3). For example, N-gram entropy may be calculated. N-gram entropy is an entropy value calculated from the number of occurrences of each bit pattern in a file in units of a determined number of bytes (N bytes). When the comparison target file group includes three or more files, a predetermined entropy value may be calculated for all two file combinations.

図４にファイル全体の特徴量の算出結果の例を示す。「File Name」列は比較対象ファイルの物理的なファイル名である。「No.」列は比較対象ファイルを一意に識別する数値である。「Entropy」列は比較対象ファイルから式（１）で算出したエントロピー値である。「M1Entropy」列は比較対象ファイルから式（２）で算出したM1エントロピー値である。「WEntropy」列は比較対象ファイルから式（３）で算出したファイルサイズ重み付きエントロピー値である。「File size」列は比較対象ファイルのバイト単位のファイルサイズである。この例では、15個の比較対象ファイルが投入されている。例えば、No.1のae.bmpは、エントロピー値が7.36303であり、M1エントロピー値が4.758377であり、ファイルサイズ重み付きエントロピー値が87.8226であり、ファイルサイズが151,374バイトであることを表している。 FIG. 4 shows an example of the calculation result of the feature amount of the entire file. The “File Name” column is a physical file name of the comparison target file. The “No.” column is a numerical value that uniquely identifies the comparison target file. The “Entropy” column is an entropy value calculated from the comparison target file by the equation (1). The “M1 Entropy” column is the M1 entropy value calculated from the comparison target file by the equation (2). The “WEntropy” column is a file size weighted entropy value calculated from the comparison target file by the equation (3). The “File size” column is the file size in bytes of the comparison target file. In this example, 15 comparison target files are input. For example, No. 1 ae.bmp indicates that the entropy value is 7.36303, the M1 entropy value is 4.758377, the file size weighted entropy value is 87.8226, and the file size is 151,374 bytes.

算出された比較元全体特徴量および比較先全体特徴量は特徴量記憶部３８に記憶される。全体類似度評価部３３は特徴量記憶部３８から読み出した比較元全体特徴量と比較先全体特徴量から全体類似度を算出する（ステップＳ３３１）。全体類似度の算出は、例えば上記の式（５）により計算することができる。比較対象ファイル群が３つ以上のファイルを含む場合には、すべての２つのファイルの組み合わせについて、全体類似度を算出すればよい。 The calculated comparison source overall feature quantity and comparison destination overall feature quantity are stored in the feature quantity storage unit 38. The overall similarity evaluation unit 33 calculates the overall similarity from the comparison source overall feature amount and the comparison destination overall feature amount read from the feature amount storage unit 38 (step S331). The calculation of the overall similarity can be performed by, for example, the above equation (5). When the comparison target file group includes three or more files, the overall similarity may be calculated for the combination of all two files.

全体類似度評価部３３は算出した全体類似度をあらかじめ定めた閾値と比較し、比較元ファイルと比較先ファイルが類似するか否かを示す暫定判定結果を求める。暫定判定結果には、全体類似度が閾値以上である場合にはファイルが類似することを示す値を設定し、全体類似度が閾値未満の場合にはファイルが類似しないことを示す値を設定する（ステップＳ３３２）。比較対象ファイル群が３つ以上のファイルを含む場合には、全体類似度が閾値以上であるファイルの組み合わせが存在する場合に類似するファイルが存在することを示す値を設定し、全体類似度が閾値以上であるファイルの組み合わせが存在しない場合に類似するファイルが存在しないことを示す値を設定すればよい。 The overall similarity evaluation unit 33 compares the calculated overall similarity with a predetermined threshold value, and obtains a provisional determination result indicating whether the comparison source file and the comparison destination file are similar. In the temporary determination result, a value indicating that the files are similar is set when the overall similarity is equal to or greater than the threshold, and a value indicating that the files are not similar is set when the overall similarity is less than the threshold. (Step S332). When the comparison target file group includes three or more files, a value indicating that a similar file exists is set when there is a combination of files whose overall similarity is equal to or greater than the threshold, and the overall similarity is What is necessary is just to set the value which shows that a similar file does not exist when the combination of the file more than a threshold value does not exist.

図５にファイル全体の類似度の評価結果の例を示す。「File Name」列は比較対象ファイルの物理的なファイル名である。「No.」列は比較元ファイルを一意に識別する数値である。「1」〜「15」列は比較先ファイルを示している。つまり、図５は比較元ファイルと比較先ファイルの全体類似度のマトリックスとなっている。例えば、No.1のae.bmpとNo.2のaf.bmpとの全体類似度は89である。類似度を判定する閾値を99に設定すると、図５のマトリックス中で値を太字で示したファイルの組み合わせにおいて、ファイルが類似すると判定される。例えば、No.1のae.bmpであれば、No.5のimagesCA6MA9NM.bmpとNo.6のimagesCA83PY61.bmpとNo.8のimagesCAQI7U2A.bmpと類似すると判定される。 FIG. 5 shows an example of the evaluation result of the similarity of the whole file. The “File Name” column is a physical file name of the comparison target file. The “No.” column is a numerical value that uniquely identifies the comparison source file. Columns “1” to “15” indicate comparison destination files. That is, FIG. 5 is a matrix of the overall similarity of the comparison source file and the comparison destination file. For example, the overall similarity between No. 1 ae.bmp and No. 2 af.bmp is 89. When the threshold value for determining the similarity is set to 99, it is determined that the files are similar in the combination of files whose values are shown in bold in the matrix of FIG. For example, No. 1 ae.bmp is determined to be similar to No. 5 imagesCA6MA9NM.bmp, No. 6 imagesCA83PY61.bmp, and No. 8 imagesCAQI7U2A.bmp.

暫定判定結果が比較元ファイルと比較先ファイルとが類似しないことを示す場合には、比較元ファイルと比較先ファイルとが類似しないことを示す判定結果を出力し、処理を終了する（ステップＳ３７２）。例えば、図６において、比較対象ファイルとして、No.1のae.bmpとNo.2のaf.bmpが投入された場合であれば、全体類似度が89であり閾値99未満であるため、２つのファイルは類似しないと判定される。この場合には、２つのファイルは類似しないことを出力して処理を終了する。 If the provisional determination result indicates that the comparison source file and the comparison destination file are not similar, a determination result indicating that the comparison source file and the comparison destination file are not similar is output, and the process ends (step S372). . For example, in FIG. 6, if No.1 ae.bmp and No.2 af.bmp are input as comparison target files, the overall similarity is 89, which is less than the threshold value 99. Two files are determined not to be similar. In this case, the fact that the two files are not similar is output and the process is terminated.

暫定判定結果が比較元ファイルと比較先ファイルとが類似することを示す場合には、以降の処理を継続する。比較対象ファイル群が３つ以上のファイルを含む場合であって、複数のファイルの組み合わせでファイルが類似すると判定された場合には、類似するファイルの組み合わせすべてについて以降の処理を繰り返し実行する。 If the provisional determination result indicates that the comparison source file and the comparison destination file are similar, the subsequent processing is continued. When the comparison target file group includes three or more files and it is determined that the files are similar by a combination of a plurality of files, the subsequent processing is repeatedly executed for all the combinations of similar files.

比較元ファイルと比較先ファイルが区間分割部３４へ入力される。区間分割部３４は、比較元ファイルと比較先ファイルそれぞれを所定の分割方法で複数の区間に分割する。（ステップＳ３４）。比較元ファイルを分割した後のファイルは比較元分割ファイル、比較先ファイルを分割した後のファイルは比較先分割ファイルと呼ぶ。区間分割の方法は、例えば、あらかじめ定めた固定サイズで分割することができる。例えば、10Kバイトごとに区間を分割すればよい。また例えば、比較対象ファイルのファイルサイズに比例した均等割りにより分割サイズを決定してもよい。ファイルサイズに比例した均等割りであれば、末端の少量サイズの区間が生じることを回避することができる。また例えば、ローリングハッシュにより分割点を決定してもよい。 The comparison source file and the comparison destination file are input to the section dividing unit 34. The section dividing unit 34 divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method. (Step S34). The file after dividing the comparison source file is called a comparison source division file, and the file after dividing the comparison destination file is called a comparison destination division file. As a method of dividing the section, for example, the section can be divided by a predetermined fixed size. For example, the section may be divided every 10K bytes. Further, for example, the division size may be determined by an equal division proportional to the file size of the comparison target file. If it is equally divided in proportion to the file size, it is possible to avoid the occurrence of a small end size section. For example, the division point may be determined by a rolling hash.

図６にローリングハッシュによる区間分割の動作例を示す。分割対象のファイルが入力されると、そのファイルのバイナリ列からmバイトを読み取る（ステップＳ３４１）。このmバイト分のバイナリ列をウィンドウと呼ぶ。ウィンドウのサイズを決定するmは任意の値を取ることができる。例えば、m=7などと設定すればよい。次に、読み取ったウィンドウのハッシュ値を計算する（ステップＳ３４２）。ハッシュ関数としては、例えば、文字列を構成する全文字の値を加算する関数、各文字のASCIIコードを全て掛け合わせる関数などが考えられる。続いて、計算したハッシュ値の下位tビットを評価する（ステップＳ３４３）。下位tビットがあらかじめ定めた任意のビットパターンと一致する場合には、そのウィンドウの位置を分割点として記憶する（ステップＳ３４４）。ビットパターンはどのようなものでもよく、例えば、下位tビットがすべて1の場合などでもよいし、すべて0の場合などでもよい。tは、例えば、Lをファイルサイズとし、Nを区間の総数として、下記の式により決定すればよい。 FIG. 6 shows an operation example of section division by rolling hash. When a file to be divided is input, m bytes are read from the binary string of the file (step S341). This binary sequence of m bytes is called a window. M, which determines the size of the window, can take any value. For example, m = 7 may be set. Next, the hash value of the read window is calculated (step S342). As the hash function, for example, a function for adding the values of all the characters constituting the character string, a function for multiplying all the ASCII codes of the respective characters, and the like can be considered. Subsequently, the lower t bits of the calculated hash value are evaluated (step S343). If the lower t bits match a predetermined arbitrary bit pattern, the position of the window is stored as a division point (step S344). Any bit pattern may be used. For example, the lower t bits may be all 1's or all 0's. For example, t may be determined by the following equation, where L is the file size and N is the total number of sections.

ここで、 here,

は床関数であり、xを超えない最大の整数である。 Is the floor function, the largest integer not exceeding x.

次に、ウィンドウの位置がファイルの終点であるかを確認する（ステップＳ３４５）。ファイルの終点でなければ、ウィンドウの位置を先頭から1バイト後方へ移動させ、ステップＳ３４１からステップＳ３４５の処理を繰り返す。ウィンドウの位置がファイルの終点に至るまで上記の処理を繰り返す。 Next, it is confirmed whether the window position is the end point of the file (step S345). If it is not the end point of the file, the window position is moved backward by 1 byte from the top, and the processing from step S341 to step S345 is repeated. The above processing is repeated until the window position reaches the end point of the file.

ローリングハッシュを用いてファイルの区間分割を行うと、比較対象の２つのファイルの相違点を除いて共通の部分を区間として抽出することができる可能性が高い。 If a file is divided into sections using a rolling hash, there is a high possibility that a common part can be extracted as a section except for differences between the two files to be compared.

比較元分割ファイルと比較先分割ファイルは区間特徴量算出部３５へ入力される。区間特徴量算出部３５は、比較元分割ファイルと比較先分割ファイルそれぞれから区間ごとに所定のエントロピー値を算出する（ステップＳ３５）。比較元分割ファイルの区間ごとのエントロピー値のスペクトルを比較元区間特徴量と呼び、比較先分割ファイルの区間ごとのエントロピー値のスペクトルを比較先区間特徴量と呼ぶ。算出するエントロピー値は、全体特徴量と同様に、様々な種類のエントロピー値を適用することができる。例えば、上記の式（１）で算出されるエントロピー値でもよい。上記の式（２）で算出されるM1エントロピー値でもよい。上記の式（３）で算出されるファイルサイズ重み付きエントロピー値でもよい。また例えば、N-gramエントロピーを算出してもよい。 The comparison source division file and the comparison destination division file are input to the section feature amount calculation unit 35. The section feature amount calculation unit 35 calculates a predetermined entropy value for each section from each of the comparison source split file and the comparison target split file (step S35). The spectrum of the entropy value for each section of the comparison source divided file is called a comparison source section feature, and the spectrum of the entropy value for each section of the comparison target divided file is called a comparison destination section feature. As the entropy value to be calculated, various types of entropy values can be applied in the same manner as the entire feature amount. For example, the entropy value calculated by the above equation (1) may be used. The M1 entropy value calculated by the above equation (2) may be used. It may be a file size weighted entropy value calculated by the above equation (3). For example, N-gram entropy may be calculated.

図７（Ａ）に比較元ファイルの区間特徴量の算出結果の例を示し、図７（Ｂ）に比較先ファイルの区間特徴量の算出結果の例を示す。「File Name」列は区間分割後の比較対象ファイルのファイル名である。「Entropy」列はその分割ファイルから算出したエントロピー値である。「M1Entropy」列はその分割ファイルから算出したM1エントロピー値である。「WEntropy」列はその分割ファイルから算出したファイルサイズ重み付きエントロピー値である。「File Size」列はその分割ファイルのバイト単位のファイルサイズである。この例では、No.1のae.bmpを比較元ファイルとし、No.5のimagesCA6MA9NM.bmpを比較先ファイルとして、それぞれ100,000バイトの固定サイズで分割している。図７（Ａ）では、No.1のae.bmpがae.001〜ae.016の16個に分割され、分割後のae.001〜ae.016それぞれについて、エントロピー値とM1エントロピー値とファイルサイズ重み付きエントロピー値を算出した結果を示している。図７（Ｂ）では、No.5のimagesCA6MA9NM.bmpがIMAGES~1.001〜IMAGES~1.016の16個に分割され、分割後のIMAGES~1.001〜IMAGES~1.016それぞれについて、エントロピー値とM1エントロピー値とファイルサイズ重み付きエントロピー値を算出した結果を示している。 FIG. 7A shows an example of the calculation result of the section feature value of the comparison source file, and FIG. 7B shows an example of the calculation result of the section feature value of the comparison destination file. The “File Name” column is the file name of the comparison target file after the section division. The “Entropy” column is an entropy value calculated from the divided file. The “M1 Entropy” column is an M1 entropy value calculated from the divided file. The “WEntropy” column is a file size weighted entropy value calculated from the divided file. The “File Size” column is the file size in bytes of the divided file. In this example, No. 1 ae.bmp is used as a comparison source file, and No. 5 imagesCA6MA9NM.bmp is used as a comparison destination file, and each file is divided by a fixed size of 100,000 bytes. In FIG. 7A, No.1 ae.bmp is divided into 16 ae.001 to ae.016, and the entropy value, M1 entropy value, and file for each of ae.001 to ae.016 after division. The result of calculating the entropy value with size weight is shown. In FIG. 7B, No. 5 imagesCA6MA9NM.bmp is divided into 16 images of IMAGES ~ 1.001 ~ IMAGES ~ 1.016, and the entropy value, M1 entropy value and file for each of IMAGES ~ 1.001 ~ IMAGES ~ 1.016 after division. The result of calculating the entropy value with size weight is shown.

比較元区間特徴量と比較先区間特徴量は区間類似度評価部３６へ入力される。区間類似度評価部３６は、比較元区間特徴量と比較先区間特徴量とに基づいて比較元ファイルと比較先ファイルが類似するか否かを示す判定結果を求める（ステップＳ３６）。判定の方法は、下記の式（６）に示すように、比較元区間特徴量に含まれる区間ごとのエントロピー値と、比較先区間特徴量に含まれる区間ごとのエントロピー値との差をそれぞれ算出し、その平均を計算すればよい。 The comparison source section feature value and the comparison destination section feature value are input to the section similarity evaluation unit 36. The section similarity evaluation unit 36 obtains a determination result indicating whether the comparison source file and the comparison destination file are similar based on the comparison source section feature quantity and the comparison destination section feature quantity (step S36). As shown in the following equation (6), the determination method calculates the difference between the entropy value for each section included in the comparison source section feature and the entropy value for each section included in the comparison target section feature. Then, the average may be calculated.

ここで、nは区間の総数であり、H1_iは比較元ファイルの区間iのエントロピー値であり、H2_iは比較先ファイルの区間iのエントロピー値である。 Here, n is the total number of sections, H1 _i is the entropy value of section i of the comparison source file, and H2 _i is the entropy value of section i of the comparison destination file.

式（６）で求めた差の平均Dがあらかじめ定めた閾値以上であれば、比較元ファイルと比較先ファイルは類似の関係にないと判断する。差の平均により類似するか否かを判定する場合の閾値は、例えば0.25などに設定すればよい。 If the average D of the differences obtained by Equation (6) is equal to or greater than a predetermined threshold value, it is determined that the comparison source file and the comparison destination file are not in a similar relationship. The threshold for determining whether or not they are similar by the average of the differences may be set to 0.25, for example.

図８に区間類似度の評価結果の例を示す。図８（Ａ）は比較元区間特徴量と比較先区間特徴量との差の平均を算出した結果である。「区間」列は分割区間の番号を示す数値である。「区間エントロピー値」欄は比較対象ファイルの区間ごとのエントロピー値であり、「No.1」列には比較元ファイルであるNo.1のae.bmpについての区間ごとのエントロピー値が、「No.5」列には比較先ファイルであるNo.5のimagesCA6MA9NM.bmpについての区間ごとのエントロピー値が、それぞれ設定されている。「差」列はNo.1の区間ごとのエントロピー値からNo.5の区間ごとのエントロピー値を減算した値である。例えば、No.1のae.bmpの分割ファイルae.001の区間エントロピー値は7.190017であり、No.5のimagesCA6MA9NM.bmpの分割ファイルIMAGES~1.001の区間エントロピー値は6.508847であり、その差は0.68117であることを表している。図８（Ａ）の最下段の「平均」行は、「差」列の値の平均値、すなわち式（６）における差の平均Dの値である。この例では、差の平均Dが0.472823375であり、上記の例の通り閾値を0.25とすると、No.1のae.bmpとNo.5のimagesCA6MA9NM.bmpは類似の関係にないと判断できる。図８（Ｂ）は比較元区間特徴量と比較先区間特徴量をグラフにプロットした結果である。横軸は区間の番号であり、縦軸は区間エントロピー値の値である。複数の区間で区間エントロピー値が大きく相違していることがわかる。 FIG. 8 shows an example of the evaluation result of the section similarity. FIG. 8A shows the result of calculating the average of the difference between the comparison source section feature value and the comparison destination section feature value. The “section” column is a numerical value indicating the number of the divided section. The “Section entropy value” field is the entropy value for each section of the file to be compared. In the “No.1” column, the entropy value for each section for the comparison source file No.1 ae.bmp is “No.1”. In the column “.5”, the entropy value for each section for No. 5 imagesCA6MA9NM.bmp, which is the comparison destination file, is set. The “difference” column is a value obtained by subtracting the entropy value for each No. 5 section from the entropy value for each No. 1 section. For example, the section entropy value of the divided file ae.001 of No. 1 ae.bmp is 7.190017, the section entropy value of the divided file IMAGES to 1.001 of the imagesCA6MA9NM.bmp of No. 5 is 6.508847, and the difference is 0.68117. It represents that. The “average” row at the bottom of FIG. 8A is the average value of the “difference” column, that is, the average D value of the difference in equation (6). In this example, if the average difference D is 0.472823375 and the threshold value is 0.25 as in the above example, it can be determined that No.1 ae.bmp and No.5 imagesCA6MA9NM.bmp are not in a similar relationship. FIG. 8B shows the result of plotting the comparison source section feature quantity and the comparison destination section feature quantity on a graph. The horizontal axis is the section number, and the vertical axis is the section entropy value. It can be seen that the section entropy values are greatly different in a plurality of sections.

類似するか否かの判定は、比較元区間特徴量と比較先区間特徴量について各種の統計判定を利用して行なってもよい。統計判定の方法は、例えば、標準偏差により行なってもよい。すなわち、比較元区間特徴量と比較先区間特徴量それぞれの標準偏差を算出し、標準偏差の差が閾値未満であればファイルが類似すると判定する。また例えば、相関係数により判定してもよい。すなわち、比較元区間特徴量と比較先区間特徴量との相関係数を求め、その相関係数が閾値以上であればファイルが類似すると判定する。また例えば、最長一致系列比較により判定を行なってもよい。すなわち、比較元区間特徴量と比較先区間特徴量とで対応する区間エントロピー値が連続して一致する区間の長さを求め、その一致する区間の長さが閾値以上であればファイルが類似すると判定する。また例えば、フーリエ解析により判定を行なってもよい。すなわち、比較元区間特徴量と比較先区間特徴量それぞれをフーリエ変換し、得られたパワースペクトルの系列を比較して一致する要素の数が閾値以上であればファイルが類似すると判定する。また例えば、Χスクエア検定により判定を行なってもよい。すなわち、比較元区間特徴量と比較先区間特徴量をそれぞれ所定の関数に近似させて検定統計量を算出することにより、あらかじめ定めた閾値を有意水準としてファイルが類似するか否かの仮説検定を行う。 The determination of whether or not they are similar may be performed using various statistical determinations regarding the comparison source section feature value and the comparison destination section feature value. The statistical determination method may be performed by, for example, standard deviation. That is, the standard deviation of each of the comparison source section feature quantity and the comparison destination section feature quantity is calculated, and if the difference between the standard deviations is less than the threshold value, it is determined that the files are similar. Further, for example, the determination may be made using a correlation coefficient. That is, a correlation coefficient between the comparison source section feature value and the comparison destination section feature value is obtained, and if the correlation coefficient is equal to or greater than a threshold value, it is determined that the files are similar. Further, for example, the determination may be made by comparing the longest match series. That is, the length of the section in which the corresponding section entropy values are continuously matched in the comparison source section feature quantity and the comparison destination section feature quantity is obtained, and the files are similar if the length of the matching section is equal to or greater than a threshold value. judge. For example, the determination may be performed by Fourier analysis. That is, each of the comparison source section feature quantity and the comparison destination section feature quantity is subjected to Fourier transform, and the obtained power spectrum series is compared. If the number of matching elements is equal to or greater than the threshold, it is determined that the files are similar. Further, for example, the determination may be made by the Χ square test. In other words, by calculating the test statistic by approximating the comparison source section feature quantity and the comparison destination section feature quantity to respective predetermined functions, a hypothesis test whether the files are similar with a predetermined threshold as a significance level is performed. Do.

判定結果は、類似度評価装置３の備える出力部３７を介して利用者端末２_nへ送信される（ステップＳ３７１，Ｓ３７１）。送信する情報は利用者端末２_nが表示するために必要な情報である。例えば、比較対象ファイルの全体特徴量、比較対象ファイル間の全体類似度、比較対象ファイルの区間特徴量、比較対象ファイル間の類似度判定結果、および区間エントロピー値の差の平均などの類似度を判定するために算出したすべての値などから任意に選択すればよい。 The determination result is transmitted to the user terminal 2 _n via the output unit 37 included in the similarity evaluation device 3 (steps S371 and S371). The information to be transmitted is information necessary for the user terminal 2 _n to display. For example, the total feature amount of the comparison target file, the overall similarity between the comparison target files, the section feature amount of the comparison target file, the similarity determination result between the comparison target files, and the average of the difference of the section entropy values What is necessary is just to select arbitrarily from all the values calculated in order to determine.

利用者端末２_nの備える表示部２３は、類似度評価装置３から受信した判定結果を表示する。表示の方法はどのような方法であってもよい。例えば、利用者端末２_nのディスプレイに整形して表示してもよいし、利用者端末２_nに設定された所定のプリンタへあらかじめ定めた書式で出力してもよい。 The display unit 23 included in the user terminal 2 _n displays the determination result received from the similarity evaluation device 3. The display method may be any method. For example, it may be to format and display the user terminal 2 _n display may be output in a predetermined format to a predetermined printer that is set to the user terminal 2 _n.

＜効果＞
従来のエントロピー値を用いた類似度評価技術では、比較対象の２つのファイルの内容が全く異なっているにも関わらず、偶然にファイル全体のエントロピー値が近い値となることがあった。このような場合、従来はデータの目視確認により補足的な類似度評価を行なっていた。 <Effect>
In the conventional similarity evaluation technique using the entropy value, the entropy value of the entire file may be close by chance even though the contents of the two files to be compared are completely different. In such cases, conventionally, complementary similarity evaluation has been performed by visual confirmation of data.

この実施形態の類似度評価システム１は、ファイル全体のエントロピー値が近い値となった場合に、機械的により詳細な類似度評価を行うことができる。したがって、２つのファイルの類似度評価の精度が向上する。なお、この発明の類似度評価技術によっても計算量の増大は大きくなく、実用的なレベルの利便性を維持している。 The similarity evaluation system 1 of this embodiment can perform a more detailed similarity evaluation mechanically when the entropy value of the whole file becomes a close value. Therefore, the accuracy of similarity evaluation between two files is improved. Note that the amount of calculation is not greatly increased even by the similarity evaluation technique of the present invention, and a practical level of convenience is maintained.

［第二実施形態］
＜構成＞
図９を参照して、この実施形態の類似度評価システム５の構成例を説明する。類似度評価システム５は、N(≧1)台の利用者端末６₁，…，６_Nと類似度評価装置７を含む。N台の利用者端末６₁，…，６_Nと類似度評価装置７はネットワーク４に接続される。ネットワーク４は、接続される各装置が相互に通信可能なように構成されていればよく、例えばインターネットやＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）などで構成することができる。なお、各装置は必ずしもネットワークを介してオンラインで通信可能である必要はない。例えば、利用者端末６_n(1≦n≦N)の出力する情報を磁気テープやＵＳＢメモリなどの可搬型記録媒体に記憶し、その可搬型記録媒体からオフラインで類似度評価装置７へ入力するように構成してもよい。 [Second Embodiment]
<Configuration>
With reference to FIG. 9, the structural example of the similarity evaluation system 5 of this embodiment is demonstrated. The similarity evaluation system 5 includes N (≧ 1) user terminals 6 ₁ ,..., 6 _N and a similarity evaluation device 7. N user terminals 6 ₁ ,..., 6 _N and similarity evaluation device 7 are connected to network 4. The network 4 only needs to be configured so that the connected devices can communicate with each other. For example, the network 4 can be configured by the Internet, a LAN (Local Area Network), a WAN (Wide Area Network), or the like. Each device does not necessarily need to be able to communicate online via a network. For example, information output from the user terminal 6 _n (1 ≦ n ≦ N) is stored in a portable recording medium such as a magnetic tape or a USB memory, and is input to the similarity evaluation device 7 off-line from the portable recording medium. You may comprise as follows.

図１０（Ａ）を参照して、類似度評価システム５に含まれる利用者端末６_nの構成例を説明する。利用者端末６_nは、制御部６０１、メモリ６０２、ファイル記憶部２１、投入部２２、表示部２３、全体特徴量算出部３２、区間分割部３４、区間特徴量算出部３５を備える。利用者端末６_nは、例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。利用者端末６_nは制御部６０１の制御のもとで各処理を実行する。利用者端末６_nに入力されたデータや各処理で得られたデータはメモリ６０２に格納され、メモリ６０２に格納されたデータは必要に応じて読み出されて他の処理に利用される。ファイル記憶部２１は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリなどの半導体メモリ素子により構成される補助記憶装置、などにより構成することができる。 With reference to FIG. 10A, a configuration example of the user terminal 6 _n included in the similarity evaluation system 5 will be described. The user terminal 6 _n includes a control unit 601, a memory 602, a file storage unit 21, an input unit 22, a display unit 23, an overall feature amount calculation unit 32, a section division unit 34, and a section feature amount calculation unit 35. The user terminal 6 _n is a special device configured by reading a special program into a known or dedicated computer having a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like. The user terminal 6 _n executes each process under the control of the control unit 601. Data input to the user terminal 6 _n and data obtained in each process are stored in the memory 602, and the data stored in the memory 602 is read out as necessary and used for other processes. The file storage unit 21 can be configured by, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device including a semiconductor memory element such as a hard disk, an optical disk, or a flash memory.

図１０（Ｂ）を参照して、類似度評価システム５に含まれる類似度評価装置７の構成例を説明する。類似度評価装置７は、制御部７０１、メモリ７０２、入力部３１、全体類似度評価部３３、区間類似度評価部３６、出力部３７、特徴量記憶部３８を備える。類似度評価装置７は、例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。類似度評価装置７は制御部７０１の制御のもとで各処理を実行する。類似度評価装置７に入力されたデータや各処理で得られたデータはメモリ７０２に格納され、メモリ７０２に格納されたデータは必要に応じて読み出されて他の処理に利用される。特徴量記憶部３８は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリなどの半導体メモリ素子により構成される補助記憶装置、リレーショナルデータベースやキーバリューストアなどのミドルウェア、などにより構成することができる。 With reference to FIG. 10B, a configuration example of the similarity evaluation device 7 included in the similarity evaluation system 5 will be described. The similarity evaluation device 7 includes a control unit 701, a memory 702, an input unit 31, an overall similarity evaluation unit 33, a section similarity evaluation unit 36, an output unit 37, and a feature amount storage unit 38. The similarity evaluation device 7 is a special device configured by loading a special program into a known or dedicated computer having, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like. The similarity evaluation device 7 executes each process under the control of the control unit 701. Data input to the similarity evaluation device 7 and data obtained in each process are stored in the memory 702, and the data stored in the memory 702 is read out as necessary and used for other processes. The feature amount storage unit 38 includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device including a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, middleware such as a relational database and a key / value store, Etc. can be configured.

したがって、第一実施形態の類似度評価システム１と第二実施形態の類似度評価システム５との相違点は、類似度評価システム１では類似度評価装置３が備えていた全体特徴量算出部３２、区間分割部３４、区間特徴量算出部３５を利用者端末６_nが備えるように構成した点である。 Therefore, the difference between the similarity evaluation system 1 of the first embodiment and the similarity evaluation system 5 of the second embodiment is that the similarity evaluation system 1 includes an overall feature amount calculation unit 32 provided in the similarity evaluation device 3. The section dividing unit 34 and the section feature amount calculating unit 35 are configured so that the user terminal 6 _n is provided.

＜処理＞
図１１を参照して、この実施形態の類似度評価システム５に含まれる利用者端末６の動作例を説明する。 <Processing>
With reference to FIG. 11, the operation example of the user terminal 6 included in the similarity evaluation system 5 of this embodiment will be described.

利用者端末６_n(1≦n≦N)の備えるファイル記憶部２１には類似度を評価する少なくとも２つのファイルを含む比較対象ファイル群が記憶されている。比較対象ファイル群に含まれるファイルの数は２つであってもよいし、３つ以上であってもよい。以降の説明では、２つのファイルの類似度を評価する場合について説明し、３つ以上の場合に相違する点があれば適宜説明する。また、比較対象の２つのファイルは、それぞれ比較元ファイル、比較先ファイルと呼ぶ。 The file storage unit 21 included in the user terminal 6 _n (1 ≦ n ≦ N) stores a comparison target file group including at least two files whose similarity is evaluated. The number of files included in the comparison target file group may be two, or may be three or more. In the following description, a case where the similarity between two files is evaluated will be described, and if there are differences in three or more cases, they will be described as appropriate. The two files to be compared are called a comparison source file and a comparison destination file, respectively.

利用者端末６_nの備える全体特徴量算出部３２は、比較元ファイルと比較先ファイルそれぞれから所定のエントロピー値を算出する（ステップＳ３２）。比較元ファイルから算出したエントロピー値は比較元全体特徴量、比較先ファイルから算出したエントロピー値は比較先全体特徴量と呼ぶ。算出するエントロピー値は、第一実施形態と同様であるので説明を省略する。比較対象ファイル群が３つ以上のファイルを含む場合には、すべての２つのファイルの組み合わせについて、所定のエントロピー値を算出すればよい。 The overall feature amount calculation unit 32 included in the user terminal 6 _n calculates a predetermined entropy value from each of the comparison source file and the comparison destination file (step S32). The entropy value calculated from the comparison source file is referred to as a comparison source overall feature amount, and the entropy value calculated from the comparison destination file is referred to as a comparison destination overall feature amount. Since the entropy value to be calculated is the same as in the first embodiment, description thereof is omitted. When the comparison target file group includes three or more files, a predetermined entropy value may be calculated for all two file combinations.

比較元ファイルと比較先ファイルが区間分割部３４へ入力される。区間分割部３４は、比較元ファイルと比較先ファイルそれぞれを所定の方法で複数の区間に分割する。（ステップＳ３４）。比較元ファイルを分割した後のファイルは比較元分割ファイル、比較先ファイルを分割した後のファイルは比較先分割ファイルと呼ぶ。区間分割の方法は、第一実施形態と同様であるので説明を省略する。 The comparison source file and the comparison destination file are input to the section dividing unit 34. The section dividing unit 34 divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined method. (Step S34). The file after dividing the comparison source file is called a comparison source division file, and the file after dividing the comparison destination file is called a comparison destination division file. Since the method of segmentation is the same as that of the first embodiment, description thereof is omitted.

比較元分割ファイルと比較先分割ファイルは区間特徴量算出部３５へ入力される。区間特徴量算出部３５は、比較元分割ファイルと比較先分割ファイルそれぞれから区間ごとに所定のエントロピー値を算出する（ステップＳ３５）。比較元分割ファイルの区間ごとのエントロピー値のスペクトルを比較元区間特徴量と呼び、比較先分割ファイルの区間ごとのエントロピー値のスペクトルを比較先区間特徴量と呼ぶ。算出するエントロピー値は、全体特徴量と同様に、第一実施形態と同様であるので説明を省略する。 The comparison source division file and the comparison destination division file are input to the section feature amount calculation unit 35. The section feature amount calculation unit 35 calculates a predetermined entropy value for each section from each of the comparison source split file and the comparison target split file (step S35). The spectrum of the entropy value for each section of the comparison source divided file is called a comparison source section feature, and the spectrum of the entropy value for each section of the comparison target divided file is called a comparison destination section feature. The entropy value to be calculated is the same as that of the first embodiment, as is the case with the entire feature amount, and thus description thereof is omitted.

利用者端末６_nの備える投入部２２は、比較元全体特徴量と比較先全体特徴量と比較元区間特徴量と比較先区間特徴量とを類似度評価装置７へ送信する。 The input unit 22 included in the user terminal 6 _n transmits the comparison source overall feature quantity, the comparison destination overall feature quantity, the comparison source section feature quantity, and the comparison destination section feature quantity to the similarity evaluation device 7.

図１２を参照して、この実施形態の類似度評価システム５に含まれる類似度評価装置７の動作例を説明する。 With reference to FIG. 12, the operation example of the similarity evaluation apparatus 7 contained in the similarity evaluation system 5 of this embodiment is demonstrated.

利用者端末６_nが送信した比較元全体特徴量と比較先全体特徴量と比較元区間特徴量と比較先区間特徴量とは、入力部３１を介して類似度評価装置７へ入力される（ステップＳ３１）。入力された比較元全体特徴量と比較先全体特徴量と比較元区間特徴量と比較先区間特徴量とは特徴量記憶部３８に記憶される。全体類似度評価部３３は特徴量記憶部３８から読み出した比較元全体特徴量と比較先全体特徴量から全体類似度を算出する（ステップＳ３３１）。全体類似度の算出は、第一実施形態と同様であるので説明を省略する。比較対象ファイル群が３つ以上のファイルを含む場合には、すべての２つのファイルの組み合わせについて、全体類似度を算出すればよい。 The comparison source overall feature quantity, comparison destination overall feature quantity, comparison source section feature quantity, and comparison destination section feature quantity transmitted by the user terminal 6 _n are input to the similarity evaluation device 7 via the input unit 31 ( Step S31). The input comparison source overall feature quantity, comparison destination overall feature quantity, comparison source section feature quantity, and comparison destination section feature quantity are stored in the feature quantity storage unit 38. The overall similarity evaluation unit 33 calculates the overall similarity from the comparison source overall feature amount and the comparison destination overall feature amount read from the feature amount storage unit 38 (step S331). Since the calculation of the overall similarity is the same as that in the first embodiment, the description thereof is omitted. When the comparison target file group includes three or more files, the overall similarity may be calculated for the combination of all two files.

暫定判定結果が比較元ファイルと比較先ファイルとが類似しないことを示す場合には、比較元ファイルと比較先ファイルとが類似しないことを示す判定結果を出力し、処理を終了する（ステップＳ３７２）。 If the provisional determination result indicates that the comparison source file and the comparison destination file are not similar, a determination result indicating that the comparison source file and the comparison destination file are not similar is output, and the process ends (step S372). .

暫定判定結果が比較元ファイルと比較先ファイルとが類似することを示す場合には、以降の処理を継続する。複数のファイルの組み合わせでファイルが類似すると判定された場合には、類似するファイルの組み合わせすべてについて以降の処理を繰り返し実行する。 If the provisional determination result indicates that the comparison source file and the comparison destination file are similar, the subsequent processing is continued. When it is determined that the files are similar in a combination of a plurality of files, the subsequent processing is repeatedly executed for all the similar file combinations.

比較元区間特徴量と比較先区間特徴量は区間類似度評価部３６へ入力される。区間類似度評価部３６は、比較元区間特徴量と比較先区間特徴量とに基づいて比較元ファイルと比較先ファイルが類似するか否かを示す判定結果を求める（ステップＳ３６）。判定の方法は、第一実施形態と同様であるので説明を省略する。 The comparison source section feature value and the comparison destination section feature value are input to the section similarity evaluation unit 36. The section similarity evaluation unit 36 obtains a determination result indicating whether the comparison source file and the comparison destination file are similar based on the comparison source section feature quantity and the comparison destination section feature quantity (step S36). Since the determination method is the same as that of the first embodiment, description thereof is omitted.

判定結果は、類似度評価装置７の備える出力部３７を介して利用者端末６_nへ送信される（ステップＳ３７１，Ｓ３７１）。送信する情報は、第一実施形態と同様であるので説明を省略する。 The determination result is transmitted to the user terminal 6 _n via the output unit 37 provided in the similarity evaluation device 7 (steps S371 and S371). Since the information to be transmitted is the same as in the first embodiment, description thereof is omitted.

利用者端末６_nの備える表示部２３は、類似度評価装置７から受信した判定結果を表示する。表示の方法は、第一実施形態と同様であるので説明を省略する。 The display unit 23 included in the user terminal 6 _n displays the determination result received from the similarity evaluation device 7. Since the display method is the same as in the first embodiment, the description is omitted.

＜効果＞
上記のように構成することで、この実施形態の類似度評価システム５は、利用者端末６₁，…，６_Nから類似度評価装置７へネットワーク４を経由して送信されるデータが、データサイズの大きい評価対象ファイルではなくデータサイズの小さい特徴量のみとすることができる。これにより、ネットワーク４上に流れるトラフィックを抑え、レスポンスを向上することができる。 <Effect>
By configuring as described above, the similarity evaluation system 5 of this embodiment is configured such that data transmitted from the user terminals 6 ₁ ,..., 6 _N to the similarity evaluation apparatus 7 via the network 4 is data. Only the feature quantity with a small data size can be used instead of the evaluation target file with a large size. Thereby, the traffic flowing on the network 4 can be suppressed and the response can be improved.

［プログラム、記録媒体］
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施例において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program, recording medium]
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above-described embodiments are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

また、上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１，５類似度評価システム
２，６利用者端末
３，７類似度評価装置
４ネットワーク
２１ファイル記憶部
２２投入部
２３表示部
３１入力部
３２全体特徴量算出部
３３全体類似度評価部
３４区間分割部
３５区間特徴量算出部
３６区間類似度評価部
３７出力部
３８特徴量記憶部
２０１，３０１，６０１，７０１制御部
２０２，３０２，６０２，７０２メモリ DESCRIPTION OF SYMBOLS 1,5 Similarity evaluation system 2,6 User terminal 3,7 Similarity evaluation apparatus 4 Network 21 File storage part 22 Input part 23 Display part 31 Input part 32 Whole feature-value calculation part 33 Whole similarity evaluation part 34 Section division Unit 35 section feature amount calculation unit 36 section similarity evaluation unit 37 output unit 38 feature amount storage unit 201, 301, 601, 701 control unit 202, 302, 602, 702 memory

Claims

A similarity evaluation system that includes a user terminal and a similarity evaluation device, and determines whether the comparison source file and the comparison destination file are similar,
The user terminal is
An input unit that transmits the comparison source file and the comparison destination file to the similarity evaluation device;
A display unit that displays a determination result received from the similarity evaluation device and indicating whether or not the comparison source file and the comparison destination file are similar;
Including
The similarity evaluation device includes:
A section dividing unit that divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and generates a comparison source split file and a comparison destination split file;
A section feature quantity calculation unit that calculates a predetermined entropy value for each section from each of the comparison source division file and the comparison destination division file, and generates a comparison source section feature quantity and a comparison destination section feature quantity;
A section similarity evaluation unit for obtaining the determination result based on the comparison source section feature quantity and the comparison destination section feature quantity;
A similarity evaluation system characterized by including

A similarity evaluation system that includes a user terminal and a similarity evaluation device, and determines whether the comparison source file and the comparison destination file are similar,
The user terminal is
A section dividing unit that divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and generates a comparison source split file and a comparison destination split file;
A section feature quantity calculation unit that calculates a predetermined entropy value for each section from each of the comparison source division file and the comparison destination division file, and generates a comparison source section feature quantity and a comparison destination section feature quantity;
An input unit for transmitting the comparison source section feature quantity and the comparison destination section feature quantity to the similarity evaluation device;
A display unit that displays a determination result received from the similarity evaluation device and indicating whether or not the comparison source file and the comparison destination file are similar;
Including
The similarity evaluation device includes:
A section similarity evaluation unit for obtaining the determination result based on the comparison source section feature quantity and the comparison destination section feature quantity;
A similarity evaluation system characterized by including

The similarity evaluation system according to claim 1,
The similarity evaluation device includes:
Calculating the entropy value from the entire files of the comparison source file and the comparison destination file, and generating an overall feature amount calculation unit that generates the comparison source overall feature amount and the comparison destination overall feature amount;
An overall similarity evaluation unit for obtaining a provisional determination result indicating whether or not the comparison source file and the comparison destination file are similar based on the comparison source overall feature quantity and the comparison destination overall feature quantity;
Including
When the provisional determination result indicates that the comparison source file and the comparison destination file are similar, the section dividing unit, the section feature amount calculating unit, and the section similarity evaluation unit are executed. Similarity evaluation system.

The similarity evaluation system according to claim 2,
The user terminal is
Calculating the entropy value from the entire files of the comparison source file and the comparison destination file, and generating an overall feature amount calculation unit that generates the comparison source overall feature amount and the comparison destination overall feature amount;
Including
The similarity evaluation device includes:
An overall similarity evaluation unit for obtaining a provisional determination result indicating whether or not the comparison source file and the comparison destination file are similar based on the comparison source overall feature quantity and the comparison destination overall feature quantity;
Including
The similarity evaluation system, wherein the section similarity evaluation unit is executed when the provisional determination result indicates that the comparison source file and the comparison destination file are similar.

A section dividing unit that divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and generates a comparison source division file and a comparison destination division file,
A section feature quantity calculation unit that calculates a predetermined entropy value for each section from each of the comparison source division file and the comparison destination division file, and generates a comparison source section feature quantity and a comparison destination section feature quantity;
An interval similarity evaluation unit for obtaining a determination result indicating whether or not the comparison source file and the comparison destination file are similar based on the comparison source interval feature and the comparison destination interval feature;
Similarity evaluation apparatus including

An input unit for transmitting the comparison source file and the comparison destination file to the similarity evaluation device;
A display unit that displays a determination result received from the similarity evaluation device and indicating whether or not the comparison source file and the comparison destination file are similar;
User terminal including

The comparison source section feature amount obtained by dividing the comparison source file into a plurality of sections by a predetermined division method and calculating a predetermined entropy value for each section, and the comparison destination file is divided into a plurality of sections by the division method. A similarity evaluation section including a section similarity evaluation unit for obtaining a determination result indicating whether or not the comparison source file and the comparison destination file are similar to each other based on the comparison target section feature value for which the entropy value is calculated apparatus.

A section dividing unit that divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and generates a comparison source division file and a comparison destination division file,
A section feature quantity calculation unit that calculates a predetermined entropy value for each section from each of the comparison source division file and the comparison destination division file, and generates a comparison source section feature quantity and a comparison destination section feature quantity;
An input unit for transmitting the comparison source section feature quantity and the comparison destination section feature quantity to the similarity evaluation device;
A display unit that displays a determination result received from the similarity evaluation device and indicating whether or not the comparison source file and the comparison destination file are similar;
User terminal including

A similarity evaluation method for determining whether a comparison source file and a comparison destination file are similar,
A user terminal sends the comparison source file and the comparison destination file to the similarity evaluation device; and
The similarity evaluation device divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and generates a comparison source division file and a comparison destination division file.
A section feature quantity calculation step in which the similarity evaluation device calculates a predetermined entropy value for each section from each of the comparison source split file and the comparison destination split file, and generates a comparison source section feature quantity and a comparison destination section feature quantity. When,
Section similarity evaluation for obtaining a determination result indicating whether the comparison source file and the comparison destination file are similar based on the comparison source section feature quantity and the comparison destination section feature quantity. Part step,
A display step in which the user terminal displays the determination result received from the similarity evaluation device;
The similarity evaluation method characterized by including this.

A similarity evaluation method for determining whether a comparison source file and a comparison destination file are similar,
A user terminal divides each of the comparison source file and the comparison destination file into a plurality of sections by a predetermined division method, and generates a comparison source division file and a comparison destination division file.
A section feature quantity calculating step in which the user terminal calculates a predetermined entropy value for each section from each of the comparison source split file and the comparison destination split file, and generates a comparison source section feature quantity and a comparison destination section feature quantity; ,
The user terminal sends the comparison source section feature quantity and the comparison destination section feature quantity to the similarity evaluation device; and
Section similarity for obtaining a determination result indicating whether or not the comparison source file and the comparison destination file are similar based on the comparison source section feature and the comparison destination section feature An evaluation step;
A display step in which the user terminal displays the determination result received from the similarity evaluation device;
The similarity evaluation method characterized by including this.

A program for causing a computer to function as the similarity evaluation device according to claim 5 or 7 or the user terminal according to claim 6 or 8.