JP2003216425A

JP2003216425A - Resemblance measuring system

Info

Publication number: JP2003216425A
Application number: JP2002015135A
Authority: JP
Inventors: Katsuro Inoue; 克郎井上; Makoto Matsushita; 誠松下; Tetsuo Yamamoto; 哲男山本
Original assignee: Japan Science and Technology Corp
Current assignee: Japan Science and Technology Agency
Priority date: 2002-01-24
Filing date: 2002-01-24
Publication date: 2003-07-31

Abstract

<P>PROBLEM TO BE SOLVED: To quantitatively measure the resemblance of texts (for example, source programs) in one-dimensional disposition. <P>SOLUTION: A case of the source program is explained. At first, a part giving no effect on the function of the program to be produced is excluded (Step1). Given two software systems are inputted and CCFinder is executed (Step2). If a clone pair of found as a result of the execution of the CCFinder, diff is executed to all the paired files (Step3). A correspondence relation is defined to the lines which are determined to be accorded with each other in either of the CCFinder or the diff (Step4). The resemblance is calculated by the definition of CSR (Step5). This system can also output a different part (differential). <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、一次元配列のテキ
ストの類似度の計測に関するものであり、特に、プログ
ラムがどの程度異なっているかを定量的に計測できるも
のに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to measuring the degree of similarity between texts in a one-dimensional array, and more particularly, to measuring quantitatively how different programs are.

【０００２】[0002]

【技術的背景】二つのソフトウェア・システムが与えら
れたとき、そのシステムの間の違いはどれぐらいあるの
か、客観的に知ることは重要である。いくつかあるシス
テムのバージョン間の相違の度合を調べることによっ
て、システムの保守の様子や進化の度合を知ることがで
きる。システムの改変時にドキュメントが作成されてい
れば、それを手がかりにして、相違点を知ることは可能
であろうが、定量的な値を得ることは容易ではなかっ
た。システムが小規模で、全体の構造を人間が容易に把
握できる場合は、そのシステムの個々の構成要素につい
て定量的な値を調べ、システム全体の値とすることがで
きよう。しかし、構造が複雑になり、数百，数千にも及
ぶファイルから構成されるシステムでは、何らかの機械
的な処理により、自動的に求めることが必須となる。[Technical background] Given two software systems, it is important to know objectively how different the systems are. By examining the degree of difference between the versions of some systems, it is possible to know how the system is being maintained and the degree of evolution. If a document was created when the system was modified, it would be possible to use it as a clue to see the differences, but it was not easy to obtain a quantitative value. If the system is small and the whole structure can be easily grasped by humans, it would be possible to examine the quantitative values for the individual components of the system and use them as the values for the entire system. However, in a system having a complicated structure and made up of hundreds or thousands of files, it is essential to automatically obtain it by some mechanical processing.

【０００３】個々のファイル間での類似度を計算し、差
分を抽出する方法は、いろいろ提案されている。Baxter
らは、下記の論文で、プログラムテキストを構文解析し
てASTを作り、その共通頂点の割合を類似度とすること
を提案した。 Baxter, I. D., Yahin, A., Moura, L., Sant'Anna,
M., Bier, L.: Clone Detection Using Abstract Synta
x Trees, Proceedings; International Conferenceon S
oftware Maintenance, IEEE Computer Society Press,p
p. 368-378 (1998). また、長橋は、下記の論文で、COBOLプログラムの組に
対して、正規化後にLCSアルゴリズムを適用して差分を
抽出し類似度を計算する手法を提案している。長橋賢
児,"類似度に基づくソフトウェア品質の評価", 情報処
理学会研究報告2000-SE-126, Vol. 2000, No. 25, pp.
65-72 (2000).しかし、これらの方法は全て、ファイル
の組に対して計算するもので、ファイルの集合の組に対
しては適用できない。Various methods have been proposed for calculating the degree of similarity between individual files and extracting the difference. Baxter
In the following paper, they proposed to parse the program text to make AST, and make the ratio of the common vertices the similarity. Baxter, ID, Yahin, A., Moura, L., Sant'Anna,
M., Bier, L .: Clone Detection Using Abstract Synta
x Trees, Proceedings; International Conferenceon S
oftware Maintenance, IEEE Computer Society Press, p
p. 368-378 (1998). In the following paper, Nagahashi proposed a method for applying a LCS algorithm to a set of COBOL programs after normalization to extract differences and calculate similarity. There is. Kenji Nagahashi, "Evaluation of Software Quality Based on Similarity", IPSJ Research Report 2000-SE-126, Vol. 2000, No. 25, pp.
65-72 (2000). However, all of these methods operate on a set of files and cannot be applied to a set of files.

【０００４】[0004]

【発明が解決しようとする課題】本発明の目的は、大規
模ソフトウェアや大規模WEBシステム、大規模なドキュ
メントやマニュアル類などの開発、管理において、一つ
一つのファイルが他のファイルとどの程度異なるか、と
いうのではなく、複数のファイルで構成されているシス
テム全体として、他のシステムとどの程度異なっている
かを、定量的に知ることである。The object of the present invention is to determine how much each file is different from other files in the development and management of large-scale software, large-scale WEB system, large-scale documents and manuals. It does not mean that they are different, but it is to quantitatively know how different they are from other systems as a whole system composed of multiple files.

【０００５】[0005]

【課題を解決するための手段】上記の目的を達成するた
めに、本発明は、複数のファイルで構成された２つのテ
キスト間の類似を計測する類似度計測システムであっ
て、前記２つのテキストを構成するファイル間で同形の
パターンを調べて、同形のパターンがあるファイルの組
を抽出する対応ファイル抽出手段と、前記対応ファイル
抽出手段で抽出されたファイルの組間で、対応のとれる
行の組を求める対応行検出手段と、２つのテキストの行
数の和と、それぞれのテキストで対応のとれるとされた
行数の和との割合を求める類似度計算手段とを備えるこ
とを特徴とする。前記対応行検出手段では、対応のとれ
なかった行を出力して、類似度を計測するとともに、差
分も出力できるようにすることができる。さらに、前処
理として、双方のテキストから、類似度計測に無関係な
行を削除する行削除手段を備えることもできる。また、
前記２つのテキストがソース・プログラムであるとき、
ファイル間の類似を計測する前記対応ファイル抽出手段
として、プログラムの文法に沿ってトークン列に変換し
てから、同形のパターンを求めることもできる。そし
て、前記対応行検出手段は、プログラムの文法に沿って
トークン列に変換してから、同形のパターンを求めるも
のと、プログラムの文法によらないで相違部分を求める
ものとの双方を用いて、対応のとれる行の組を検出する
こともできる。これらのシステムで行っている方法や、
これらのシステムを構築するプログラム、そのプログラ
ムを記録した記録媒体も本発明である。In order to achieve the above object, the present invention is a similarity measuring system for measuring the similarity between two texts composed of a plurality of files. The patterns of the same shape are examined between the files forming the file, and the corresponding file extracting means for extracting the set of files having the same pattern, and the set of lines that can be taken between the set of files extracted by the corresponding file extracting means. It is characterized by comprising corresponding line detecting means for obtaining a set, and similarity degree calculating means for obtaining a ratio of the sum of the number of lines of two texts and the sum of the number of lines determined to correspond in each text. . The corresponding row detecting means can output the rows that are not matched, measure the degree of similarity, and output the difference. Furthermore, as preprocessing, a line deleting unit that deletes lines irrelevant to the similarity measurement from both texts can be provided. Also,
When the two texts are source programs,
As the corresponding file extracting means for measuring the similarity between files, it is also possible to obtain a homomorphic pattern after converting the token string according to the grammar of the program. Then, the corresponding line detecting means uses both the one that obtains a pattern of the same shape after converting into a token string according to the grammar of the program and the one that obtains a different portion without depending on the grammar of the program, Corresponding sets of rows can also be detected. How these systems work,
A program for constructing these systems and a recording medium recording the program are also the present invention.

【０００６】[0006]

【発明の実施の形態】以下、図面を参照して、本発明の
実施形態を説明する。実施形態として、ソース・プログ
ラムの複数ファイルで構成されたソフトウエア・システ
ムを例に、以下、説明するが、本システムは、線形一次
元配列のテキストのファイルで構成された大規模WEBシ
ステム、大規模なドキュメントやマニュアル類などにも
適用することができる。まず、ソース・プログラムとし
て与えられた二つのソフトウェア・システムの類似度メ
トリクスＣＳＲ(Corresponding Source-line Ratio)を
提案する。二つのソフトウェア・システムの類似度メト
リクスには、多くの要素が考えられる。たとえば、ファ
イルの数や行数、ファイル名の違いなどである。以降、
提案する類似度メトリクスＣＳＲの形式的な定義を行
い、それを求めるための具体的な方法を説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. As an embodiment, a software system composed of a plurality of source program files will be described below as an example. This system is a large-scale WEB system composed of a text file of a linear one-dimensional array. It can also be applied to large-scale documents and manuals. First, we propose a similarity metric CSR (Corresponding Source-line Ratio) between two software systems given as source programs. Many factors can be considered in the similarity metric of two software systems. For example, the number of files, the number of lines, and the difference in file names. Or later,
A formal definition of the proposed similarity metric CSR will be made, and a specific method for obtaining it will be described.

【０００７】＜類似度の定義＞一つのファイルや複数の
ファイルから構成されるソフトウェア・システムの間の
関係を定義するために、ファイルやソフトウェア・シス
テムを抽象的にあらわす“プロダクト”を用いる。ここ
では簡単化のためにプロダクトは、その要素の集合と考
え、P =｛ｐ_１，・・・，ｐ_ｍ｝と書く。ここで、Ｐを
ファイルとすると、各ｐ_ｉはファイルの各行であり、ま
た、Ｐをソフトウェア・システムとすると、各ｐ_ｉはソ
フトウェア・システムを構成する各ファイルとなる。二つのプロダクトＰ＝｛ｐ_１；・・・；ｐ_ｍ｝，Ｑ＝
｛ｑ_１；・・・；ｑ_ｎ｝に対し、対応Ｒ⊆Ｐ×Ｑが得られるとする。今、ＰとＱ
のＲに対する類似度Ｓ（０≦Ｓ≦１）を次のように定義
する。<Definition of Similarity> In order to define the relationship between one file or a software system composed of a plurality of files, a “product” that abstractly represents the file or software system is used. Here, for simplification, the product is considered as a set of its elements, and is written as P = {p ₁ , ..., P _m }. Here, when P is a file, each p _i is each line of the file, and when P is a software system, each p _i is each file that constitutes the software system. Two products P = {p ₁ ; ... _Pm }, Q =
It is assumed that the corresponding R⊆P × Q is obtained for {q ₁ ; ...; q _n }. Now P and Q
The similarity S (0 ≦ S ≦ 1) with respect to R is defined as follows.

【数１】これは、図１のように対応Ｒに含まれるＰ，Ｑの要素数
をＰとＱの総要素数で割ったものである。Ｒに関係しな
いＰ，Ｑの要素が増えることによって、Ｓは下がる。Ｒ
＝空集合では、Ｓ＝０となる。また、ＰとＱが同じもの
のとき、∀ｉ（ｐ_ｉ，ｑ_ｉ）∈ＲとなりＳ＝１となる。[Equation 1] This is the number of elements of P and Q included in the correspondence R divided by the total number of elements of P and Q as shown in FIG. S decreases due to the increase in P and Q elements that are not related to R. R
= In the empty set, S = 0. When P and Q are the same, ∀i (p _i , q _i ) εR and S = 1.

【０００８】＜類似度の適用方法とＣＳＲ＞定義した類
似度を実際のソフトウェア・システムやファイルに適用
する方法を考える。１．Ｐ，Ｑをファイルとし、ｐ_ｉ，ｑ_ｉをそのファイル
の中の行とする。Ｒは、ファイルの差分抽出ツールｄｉ
ｆｆにより与えられる関係とすると、Ｓをファイル間の
類似度と定義し、計算で求めることができる。２．Ｐ，Ｑをソフトウェア・システムとし、ｐ_ｉ，ｑ_ｉ
をそのソフトウェア・システムを構成するファイルとす
る。Ｒを同じファイル名を持つファイルの対応とする。
この場合、容易にシステム間の類似度を計算できるが、
名前を変更した場合や名前は同じだがファイルの中身を
変更した場合などは、類似度は直感的な値とは異なって
しまう。また、ファイルの大きさにかかわらず均等な重
みで類似度を計算するため、直感に合わない場合もあ
る。たとえば、小さな多数のファイルのみが対応にあ
り、少数の大きなファイルが対応していない場合、高い
類似度になってしまう。３．Ｐ，Ｑをソフトウェア・システムとし、ｐ_ｉ，ｑ_ｉ
をそのソフトウェア・システムを構成するファイルとす
る。Ｒをファイル同士の類似度が最も高いファイルへの
対応とする。こうした場合、ファイル名の考えや中身の
変更には対応できる。しかし、ファイルの大きさが反映
されない。また、すべての組み合わせでファイル間の類
似度を求めなければならず、大きな手間がかかる。４．Ｐ，Ｑをソフトウェア・システムとし、ｐ_ｉ，ｑ_ｉ
をＰ，Ｑそれぞれの各ファイルの各行とする。直感的に
は各ファイルを連結したファイルの各行を考える。何ら
かの方法で、各行の対応Ｒが与えられたとする。類似度
は、ファイル名やファイルの大きさに影響されず、直感
的に近い値が得られることが期待される。本発明では、
この方法を用いたソフトウェア・システム間の類似度メ
トリクスをＣＳＲ(Corresponding Source-line Ratio）
と呼ぶ。以降では、どのようにして、Ｒを決め、ＣＳＲ
を求めるかを詳しく述べる。<Similarity Applying Method and CSR> Consider a method of applying the defined similarity to an actual software system or file. 1. Let P and Q be files and p _i and q _{i be} lines in the file. R is a file difference extraction tool di
If the relation is given by ff, S can be defined as the similarity between files and calculated. 2. Let P and Q be software systems, and p _i and q _i
Be the files that make up the software system. Let R be the correspondence of files with the same file name.
In this case, the similarity between the systems can be easily calculated,
When the name is changed or the name is the same but the file contents are changed, the similarity is different from the intuitive value. Further, since the similarity is calculated with an equal weight regardless of the size of the file, it may not be intuitive. For example, if only a small number of large files correspond and a small number of large files do not, a high degree of similarity will result. 3. Let P and Q be software systems, and p _i and q _i
Be the files that make up the software system. Let R be the correspondence to the file with the highest similarity between the files. In such cases, it is possible to deal with changes in the file name and contents. However, the size of the file is not reflected. In addition, it is necessary to find the similarity between files for all combinations, which takes a lot of time and effort. 4. Let P and Q be software systems, and p _i and q _i
Be each line of each file of P and Q. Intuitively, consider each line in a file that is a concatenation of the files. It is assumed that the correspondence R of each row is given in some way. It is expected that the degree of similarity will be intuitively similar and will not be affected by the file name or file size. In the present invention,
The similarity metric between software and systems using this method can be calculated using CSR (Corresponding Source-line Ratio).
Call. After that, how to decide R and CSR
To ask in detail.

【０００９】＜ＣＳＲの求め方＞ここでは、前に定義し
た類似度メトリクスＣＳＲを求めるための対応Ｒの具体
的な求め方について説明する。さらに、与えられた二つ
のソフトウェア・システムから類似度を計算するアルゴ
リズムについて述べる。（アプローチ）すべてのファイルのすべての行に対して
の対応を求めるためには、各行に対してその行と同じ行
が存在するかどうか調べればよい。このようなコードの
重複を求めるための手法はすでにいくつか存在する。ク
ローン検出ツールＣＣＦｉｎｄｅｒは、複数のソース・
コードを入力としてコード・クローンを出力するツール
である。ＣＣＦｉｎｄｅｒはソース・コードをプログラ
ミング言語の文法に沿ってトークン列に変換する。その
際、複数のソースも一つのトークン列に連結される。ま
た，ソースコード中の空白とコメントは生成されるプロ
グラムの機能に影響しないので無視される。さらに、実
用的に意味のあるクローンのみを検出するために、その
トークン列の変換を行う。これは、パラメータ置き換え
（名前が異なっても等価にする）などである。そして、
そのトークン列を比較し、ソース・コードが一致してい
るかどうか調べる。一致した部分トークン列をコードク
ローンと呼ぶ。ＵＮＩＸ（Ｒ）のｄｉｆｆコマンドは、
二つのファイルの各行に対して発見アルゴリズムＬＣＳ
を利用して、そのファイル間の行単位の差分を求める。
この差分は、一方のファイルからもう一方のファイルを
生成可能なファイルとなっている。ｄｉｆｆコマンドの
特徴は構文解析などを必要とせず差分の計算が行えると
ころにある。ここで、ＣＣＦｉｎｄｅｒとｄｉｆｆを用
いて、対応を求める。まず、ＣＣｆｉｎｄｅｒを用いて
コードクローンを検出する。検出されたクローンを持つ
ファイル間に対してｄｉｆｆを実行する。ＣＣｆｉｎｄ
ｅｒとｄｉｆｆによって同一行と判断された行の間に対
応があるものとする。２種類のツールを組み合わせるこ
とで、類似度の正確性を向上させる。<How to Obtain CSR> Here, a specific method of obtaining the correspondence R for obtaining the previously defined similarity metric CSR will be described. Furthermore, we describe an algorithm to calculate the similarity from two given software systems. (Approach) In order to find the correspondence for all lines in all files, it is necessary to check for each line whether or not the same line exists. There are already some methods for obtaining such code duplication. CCFinder, a clone detection tool, can
It is a tool that inputs a code and outputs a code clone. CCFinder converts a source code into a token string according to the grammar of a programming language. At this time, a plurality of sources are also linked to one token string. Also, whitespace and comments in the source code do not affect the functionality of the generated program and are ignored. Furthermore, in order to detect only clones that have practical meaning, the token string is converted. This is parameter substitution (even if the names are different, they are equivalent). And
Compare the token strings to see if the source code matches. The matched partial token string is called a code clone. The UNIX diff command is
Finding algorithm LCS for each line of two files
To obtain the line-wise difference between the files.
This difference is a file that can generate the other file from the one file. The feature of the diff command is that the difference can be calculated without requiring syntax analysis. Here, the correspondence is obtained using CCFinder and diff. First, a code clone is detected using CCfinder. Execute diff between files with detected clones. CCfind
It is assumed that there is a correspondence between the lines determined to be the same line by er and diff. The accuracy of the similarity is improved by combining two types of tools.

【００１０】（アルゴリズム）図２のフローチャートを
参照して、ＣＳＲを求めるためのアルゴリズムを以下に
示す。入力：二つのソフトウェア・システムＰとＱ出力：ＰとＱの類似度ＣＳＲ（０≦Ｓ≦１）・Ｓｔｅｐ１：前処理生成されるプログラムの機能に影響を与えない部分を取
り除く。この処理は、用いられているプログラミング言
語によって異なる。たとえば、Ｃ言語で記述されたファ
イルの場合、コメント部分、空行をすべて取り除く。こ
れにより、ｄｉｆｆを実行した時の類似度の精度を向上
させる。・Ｓｔｅｐ２：ＣＣＦｉｎｄｅｒの実行与えられた二つのソフトウェア・システムを入力とし
て、ＣＣＦｉｎｄｅｒを実行させる。実行させる際のオ
プションとして、最低一致トークン数を２０とする。最
低一致トークン数とは、出力すべき一致するトークン列
の長さの最低値を表す。・Ｓｔｅｐ３：ｄｉｆｆの実行ＣＣＦｉｎｄｅｒの実行の結果、一つでもクローン・ペ
アがみつかったペアのファイルのすべてに対してｄｉｆ
ｆを実行する。・Ｓｔｅｐ４：対応の抽出ＣＣＦｉｎｄｅｒで検出されたクローンのトークン列か
ら、実際のファイルの一致している行同士を求める。さ
らに、ｄｉｆｆで求まった差分情報から一致している行
同士を計算する。ＣＣＦｉｎｄｅｒかｄｉｆｆのどちら
かで一致していると判断された行同士に関係を定義す
る。・Ｓｔｅｐ５：ＣＳＲの計算ＣＳＲの定義より計算する。ただし、二つのソフトウェ
ア・システムの全行数は前処理後の行数を用いる。対応
を求めるにあたって、ＣＣＦｉｎｄｅｒだけでなく、ｄ
ｉｆｆも用いる理由は対応の正確性の向上である。Ｃ言
語のプリプロセッサ命令（＃ｉｎｃｌｕｄｅ行など）は
ＣＣＦｉｎｄｅｒでは除外される。そのため、ｄｉｆｆ
を用いた差分情報を加えることにより、同一行と判断で
きる行が増加する。そのため、類似度はより正確に類似
をあらわす値となると考える。後節で述べる適用実験の
結果、対応をもつ行は一割程度増加する。また、ｄｉｆ
ｆだけを用いた対応の抽出の場合、ディレクトリ構造を
保ったまま二つのソフトウェア・システムを入力とする
場合と、一つのソフトウェア・システムのすべてのファ
イルの全行を巨大な一つのファイルにまとめたのを入力
にする場合の二種類が考えられる。前者の場合、二つの
ソフトウェア・システムの間で同一の構造を持つ必要が
あり、ファイル名の変更やディレクトリ構造の変化に追
従できない問題が生じる。後者の場合、結合するファイ
ルの順番に問題が生じる。ｄｉｆｆで利用しているアル
ゴリズムでは、文字列の入れ替えに対応できないためで
ある。つまり、ファイルＡ，Ｂが存在するとき、Ａ，Ｂ
と結合したファイルとＢ，Ａと結合したファイルに対し
て、ｄｉｆｆを実行したとき、これらの二つのファイル
は同一と判断できない。そのため、ＣＣｆｉｎｄｅｒと
ｄｉｆｆを組み合わせて、対応を求める方法を採用して
いる。(Algorithm) An algorithm for obtaining CSR will be described below with reference to the flowchart of FIG. Input: Two software systems P and Q Output: P and Q similarity CSR (0 ≤ S ≤ 1) ・ Step 1: Preprocessing Remove the part that does not affect the function of the generated program. This process depends on the programming language used. For example, in the case of a file written in C language, all comment parts and blank lines are removed. This improves the accuracy of the similarity when executing diff. Step2: Execution of CCFinder The CCFinder is executed by using two given software systems as inputs. As an option when executing, the minimum number of matching tokens is set to 20. The minimum matching token number represents the minimum length of the matching token string to be output. -Step3: Execution of diff As a result of execution of CCFinder, dif is performed for all files of the pair in which at least one clone pair is found.
execute f. Step 4: Corresponding lines of the actual file are found from the token string of the clone detected by the corresponding extraction CCFinder. Further, the lines that match each other are calculated from the difference information obtained by diff. A relationship is defined between rows that are determined to match in either CCFinder or diff. -Step 5: Calculation of CSR Calculated from the definition of CSR. However, the total number of lines of the two software systems is the number of lines after preprocessing. Not only CCFinder but also d
The reason for using if is to improve the accuracy of correspondence. Preprocessor instructions in C language (#include line, etc.) are excluded in CCFinder. Therefore, diff
By adding the difference information using, the number of lines that can be determined to be the same line increases. Therefore, the similarity is considered to be a value that more accurately represents the similarity. As a result of the application experiment described in the next section, the number of lines with correspondence increases by about 10%. Also, dif
In the case of corresponding extraction using only f, inputting two software systems while keeping the directory structure, and combining all lines of all files of one software system into one huge file There are two possible ways of inputting. In the former case, it is necessary to have the same structure between the two software systems, which causes a problem that it cannot follow changes in file names and changes in directory structure. In the latter case, there is a problem in the order of files to be combined. This is because the algorithm used in diff cannot handle the replacement of character strings. That is, when files A and B exist, A and B
When diff is executed for the file combined with and the file combined with B and A, these two files cannot be judged to be the same. Therefore, the method of combining CC finder and diff to find the correspondence is adopted.

【００１１】＜実行例＞ここでは、類似度メトリクスＣ
ＳＲを、実際のソフトウェア・システムに適用する。適
用するソフトウェア・システムとして、最近のＵＮＩＸ
（Ｒ）系ＯＳを用いた。（適用するシステム）適用するソフトウェア・システム
は、ＢＳＤ系ＵＮＩＸ（Ｒ）である４．４−ＢＳＤＬ
ｉｔｅ，４．４−ＢＳＤＬｉｔｅ２と、これらから派
生したＯＳであるＦｒｅｅＢＳＤ，ＮｅｔＢＳＤ，Ｏｐ
ｅｎＢＳＤである。ＦｒｅｅＢＳＤ，ＮｅｔＢＳＤ，Ｏ
ｐｅｎＢＳＤは、オープン・ソースとして開発が現在も
進められている。この三つのＯＳからは、４．４−ＢＳ
ＤＬｉｔｅ以降にリリースされたバージョンから現在
までのバージョンまでの中から主要なバージョンを選び
だした。その結果、ＦｒｅｅＢＳＤは６バージョン、Ｎ
ｅｔＢＳＤは６バージョン、ＯｐｅｎＢＳＤは９バージ
ョンを選び出した。適用したＯＳは総数２３個となる。
これらの各ＯＳに対し、すべての組み合わせを考えて類
似度ＣＳＲを計測した。前述の類似度を計測するアルゴ
リズムにおけるＳｔｅｐ１の言語依存部分の処理は、以
下の通りである。・計測対象はＯＳのカーネル部分のみとし、カーネルを
生成するのに必要なファイルだけを取り出す。・対象言語はＣ言語のみとし、拡張子が、ｃ，ｈのファ
イルのみを計測対象とする。・コメントと空行は全て消去する。<Execution example> Here, the similarity metric C
Apply SR to the actual software system. Recent UNIX as an applied software system
(R) type OS was used. (Applicable system) The applicable software system is BSD UNIX (R) 4.4-BSD L
ITE, 4.4-BSD Lite2 and OS derived from these FreeBSD, NetBSD, Op
It is enBSD. FreeBSD, NetBSD, O
PenBSD is still being developed as open source. From these three OS, 4.4-BS
We have selected the major version from the version released after D Lite to the version up to the present. As a result, FreeBSD has 6 versions, N
We chose 6 versions for etBSD and 9 versions for OpenBSD. The total number of applied OS is 23.
The similarity CSR was measured for each of these OSs, considering all combinations. The processing of the language-dependent part of Step1 in the above-described algorithm for measuring the degree of similarity is as follows. -The measurement target is only the kernel part of the OS, and only the files necessary to generate the kernel are taken out. -The target language is C language only, and only files with extensions c and h are measured. -Delete all comments and blank lines.

【００１２】（適用結果）各ＯＳの総ファイル数と行数
の表を図３に示す。この表の値は、Ｓｔｅｐ１の前処理
を行った後の結果に対して測定した値である。類似度を
計測した値の一部の表を図４に示す。図４の表のＦｒｅ
ｅＢＳＤだけに注目すると、各バージョンの中で、最も
類似度が高い値を持つのはそのバージョンの前後のどち
らかである。リリースした時期により前のバージョンか
後のバージョンのどちらかになる。たとえば、Ｆｒｅｅ
ＢＳＤ２．２と他のバージョンとのＣＳＲを比較したグ
ラフを図５に示す。このグラフから、自分自身を除く
と、最も類似度が高いバージョンは０．７０６のＦｒｅ
ｅＢＳＤ２．１であることが分かる。次期バージョンで
あるＦｒｅｅＢＳＤ３．０との類似度は０．６０３であ
り、これは三番目に近い値となっている。二番目の類似
度はＦｒｅｅＢＳＤ２．０．５の０．６６５である。Ｆ
ｒｅｅＢＳＤ３．０で、大幅な変更が加えられているこ
とが読みとれる。(Application Result) FIG. 3 shows a table of the total number of files and the number of lines of each OS. The values in this table are the values measured with respect to the results after the pretreatment of Step1. FIG. 4 shows a table of a part of the measured values of the degree of similarity. Fre in the table of FIG.
Focusing only on eBSD, it is either before or after that version that the value with the highest degree of similarity is in each version. Depending on when it was released, it will be either the previous version or the later version. For example, Free
A graph comparing the CSR of BSD 2.2 with other versions is shown in FIG. If you exclude yourself from this graph, the version with the highest similarity is Fre of 0.706.
It can be seen that it is eBSD 2.1. The similarity with the next version, FreeBSD 3.0, is 0.603, which is the third closest value. The second similarity is 0.665 of FreeBSD 2.0.5. F
It can be seen that reeBSD 3.0 has undergone significant changes.

【００１３】図３の表でファイル数と行数の変化を見る
と、ファイル数は約１．８倍増加し、行数は約１．７倍
増加しており、それ以前の変更量とは異なることが分か
る。ＦｒｅｅＢＳＤとＮｅｔＢＳＤの間のＣＳＲを図６
に示す。ＦｒｅｅＢＳＤ２．０からＦｒｅｅＢＳＤ２．
２までは、ＮｅｔＢＳＤ１．０と最も類似度が高く、Ｎ
ｅｔＢＳＤのバージョンが上がるにつれ類似度は減少し
ていく。しかしながら、ＦｒｅｅＢＳＤ３．０、Ｆｒｅ
ｅＢＳＤ４．０に関しては、ＮｅｔＢＳＤ１．３との類
似度が他のＮｅｔＢＳＤのバージョンと比べ高くなって
いる。あるバージョンを基準に考え、比較するバージョ
ンを上げていくと類似度の変化は減少する。ＦｒｅｅＢ
ＳＤとＮｅｔＢＳＤという基本的に異なる開発（開発者
や開発ポリシー）で行われている場合、類似度が上がる
原因としては、他方のＯＳにある機能を追加するため
に、ソース・コードをコピーした場合や、両方のＯＳに
同一ファイルを取り込んだ場合が考えられる。Looking at the changes in the number of files and the number of lines in the table of FIG. 3, the number of files increased by about 1.8 times, and the number of lines increased by about 1.7 times. What is the amount of change before that? You can see that it is different. Figure 6 shows the CSR between FreeBSD and NetBSD.
Shown in. FreeBSD 2.0 to FreeBSD 2.
Up to 2, the similarity is highest with NetBSD 1.0, and N
The similarity decreases as the version of etBSD increases. However, FreeBSD3.0, Fre
Regarding eBSD 4.0, the similarity with NetBSD 1.3 is higher than that of other NetBSD versions. Considering a certain version as a reference, and increasing the version to be compared, the change in similarity decreases. FreeB
When SD and NetBSD are fundamentally different development (developer or development policy), the reason why the similarity increases is that the source code is copied to add the function in the other OS. Alternatively, the same file may be imported into both OSs.

【００１４】図７にＢＳＤ系ＵＮＩＸ（Ｒ）の派生ツリ
ーを示す。この図７は、ＦｒｅｅＢＳＤ，ＮｅｔＢＳ
Ｄ，ＯｐｅｎＢＳＤといったＯＳがどのように派生し、
いつリリースされたかをあらわしている。ＦｒｅｅＢＳ
Ｄ３．０，ＮｅｔＢＳＤ１．３は、共に４．４ＢＳＤＬ
ｉｔｅ２を取り込んだ最初のバージョンであることを示
している。つまり、ＦｒｅｅＢＳＤ、ＮｅｔＢＳＤに取
り込まれた４．４ＢＳＤＬｉｔｅ２のソースコードの行
が対応し、類似度が増加した。これらのことから、ＣＳ
Ｒがソフトウェア・システムの類似度を正しく表してい
るといえる。FIG. 7 shows a derivative tree of the BSD UNIX (R). This Figure 7 is FreeBSD, NetBS
How were OSs such as D and OpenBSD derived?
It shows when it was released. FreeBS
D3.0 and NetBSD1.3 are both 4.4 BSDL
This indicates that this is the first version that incorporates ite2. That is, the lines of the source code of 4.4BSDLite2 incorporated in FreeBSD and NetBSD correspond to each other, and the degree of similarity increases. From these things, CS
It can be said that R correctly represents the similarity of software systems.

【００１５】＜分析＞ここでは、前述のように適用して
得られた類似度ＣＳＲを元にして、クラスタ分析を用い
てバージョンを分類し、バージョンの樹状図（デンドロ
グラム）を作成する。作成した樹状図とＯＳの系譜図を
比較し、似ている点、異なる点についての議論を行う。（樹状図）類似度ＣＳＲを各個体間の距離として、クラ
スタ分析を行う。クラスタ間の距離には平均距離を用い
計算する。クラスタ分析を行い、その結果から得られた
樹状図を図８に示す。横軸は結合距離を表している。図
８から、ＦｒｅｅＢＳＤ，ＮｅｔＢＳＤ，ＯｐｅｎＢＳ
Ｄがどのように分類できるか考える。最も大きな分類と
しては、最終的なクラスタ統合が行われている部分で、
ＦｒｅｅＢＳＤとＮｅｔＢＳＤ＋ＯｐｅｎＢＳＤという
分類である。類似度ＣＳＲを用いたクラスタ分析による
分類が正しくＯＳの種類の分類を反映している。次に、
ＮｅｔＢＳＤとＯｐｅｎＢＳＤの分類を見てみる。Ｏｐ
ｅｎＢＳＤ２．０を除いたすべてのＯｐｅｎＢＳＤのバ
ージョンが統合され、ＮｅｔＢＳＤ１．１と統合されて
いる。ＯｐｅｎＢＳＤ２．０に関しても、ＮｅｔＢＳＤ
１．２と統合された後、ＮｅｔＢＳＤ１．１に統合され
ている。図７の派生図から分かるように、ＯｐｅｎＢＳ
ＤはＮｅｔＢＳＤ１．１から派生してできたＯＳであ
る。図８の樹状図でも、ＯｐｅｎＢＳＤがＮｅｔＢＳＤ
１．１から派生していることを示している。また、図７
からＮｅｔＢＳＤ１．２とＯｐｅｎＢＳＤ２．０は同時
期にリリースされたバージョンであり、また、Ｏｐｅｎ
ＢＳＤ２．０は最初のバージョンであることも分かる。
これらの原因からＮｅｔＢＳＤ１．２とＯｐｅｎＢＳＤ
が高い類似度を示した。<Analysis> Here, based on the similarity CSR obtained by applying as described above, the versions are classified by using the cluster analysis, and a dendrogram of the versions is created. We will compare the created dendrogram and the genealogy diagram of OS to discuss similarities and differences. (Denogram) Cluster analysis is performed using the similarity CSR as the distance between individuals. The average distance is used for the distance between clusters. A dendrogram obtained from the result of the cluster analysis is shown in FIG. The horizontal axis represents the bond distance. From Figure 8, FreeBSD, NetBSD, OpenBS
Consider how D can be classified. The largest classification is the part where the final cluster integration is performed,
It is classified into FreeBSD and NetBSD + OpenBSD. The classification by the cluster analysis using the similarity CSR correctly reflects the classification of the OS type. next,
Let's look at the classification of NetBSD and OpenBSD. Op
All OpenBSD versions except enBSD 2.0 have been integrated and integrated with NetBSD 1.1. For OpenBSD 2.0, NetBSD
After being integrated with 1.2, it has been integrated with NetBSD 1.1. As can be seen from the derivative diagram of FIG. 7, OpenBS
D is an OS derived from NetBSD 1.1. Even in the dendrogram of FIG. 8, OpenBSD is NetBSD.
It is shown that it is derived from 1.1. Also, FIG.
NetBSD1.2 and OpenBSD2.0 are the versions released at the same time, and OpenBSD
It can also be seen that BSD 2.0 is the first version.
From these causes NetBSD1.2 and OpenBSD
Showed a high degree of similarity.

【００１６】＜他の類似度＞ＣＳＲによる類似度の有効
性を調べるため、各個体間の距離をＣＳＲ以外の類似度
を用いてクラスタ分析を行った。用いた類似度は、全フ
ァイル数のうちのＣＣＦｉｎｄｅｒでクローンが検出さ
れたファイル数の割合と、全ファイル数のうちのディレ
クトリ名を含んだファイル名が一致するファイル数の割
合である。これら二つの類似度を用いて作成した樹状図
を図９、図１０にそれぞれ示す。共に図８と同様の結果
が得られることが分かる。この理由は、これらのＯＳす
べてがＢＳＤ系由来のファイル構造、ファイル名を持つ
ためだと考えられる。ファイル名命名規則やファイルの
階層構造が決まったポリシーで開発が行われている場合
は、単純にファイル名を用いた分析も可能であることが
いえる。さらに、由来が異なる二つのＯＳであり、同時
期にリリースされたＦｒｅｅＢＳＤ４．０とＬｉｎｕｘ
２．２．１５の類似度を計算した。計算する類似度は、
ＣＳＲと全ファイル数のうちのディレクトリ名を含んだ
ファイル名が一致するファイル数の割合である。結果を
表１に示す。<Other Similarity> In order to investigate the effectiveness of the similarity by CSR, cluster analysis was performed on the distance between individuals using similarity other than CSR. The degree of similarity used is the ratio of the number of files in which a clone is detected by the CCFinder to the total number of files, and the ratio of the number of files in which the file names including the directory name match to the total number of files. A dendrogram created using these two similarities is shown in FIGS. 9 and 10, respectively. It can be seen that the same results as in FIG. 8 are obtained. It is considered that this is because all of these OSs have a file structure and a file name derived from the BSD system. It can be said that if the development is carried out according to a policy in which the file name naming rule and the hierarchical structure of files are determined, it is possible to simply analyze using the file name. In addition, two OSes with different origins, FreeBSD 4.0 and Linux released at the same time
A similarity of 2.2.15 was calculated. The calculated similarity is
It is the ratio of the number of files in which the file name including the directory name of the CSR and the total number of files matches. The results are shown in Table 1.

【表１】FreeBSD4.0とLinux2.2.15の類似度ファイル数の割合は０であり、一つも同一のファイル名
がみつからかった。しかし、ＣＳＲの値は、０．０３１
となった。異なる開発形態では、ファイル数の割合だけ
では、必ずしも類似度を表せるとは限らないと考えられ
る。[Table 1] Similarity between FreeBSD 4.0 and Linux 2.2.15 The ratio of the number of files was 0, and it was difficult to find the same file name. However, the value of CSR is 0.031
Became. In different development modes, the ratio of the number of files alone does not always represent the degree of similarity.

【００１７】ＣＳＲは類似度を表す値であるが、ＣＳＲ
を計算する過程で各行の対応を求めている。この各行の
対応を用いると、実際にファイルのどの行が他のファイ
ルの行と同じであるか知ることが可能である。ＣＳＲの
値だけでなく、これらの対応する行の情報も得ることが
出来、類似度を用いソフトウェア・システムを参照、改
変する場合に役に立つ。図７の派生図に示したリリース
時期と類似度との間に相関があるかどうかを計算した。
ＦｒｅｅＢＳＤのリリース間隔を計算した結果を表２に
示す。CSR is a value indicating the degree of similarity.
The correspondence of each line is sought in the process of calculating. Using this line-to-line correspondence, it is possible to know which line in the file is actually the same as the line in another file. Not only the value of the CSR but also the information of these corresponding rows can be obtained, which is useful when referring to or modifying the software system using the similarity. Whether or not there is a correlation between the release time and the similarity shown in the derivative diagram of FIG. 7 was calculated.
Table 2 shows the result of calculating the release interval of FreeBSD.

【表２】バージョン間のリリース間隔（月）ＣＳＲとリリース間隔の相関は、−０．９７３であっ
た。図３の表における行数の単純な増減数とリリース間
隔との相関は、０．５２８であった。ＣＳＲとリリース
間隔との間には強い負の相関があることがわかる。ま
た、行数の増減数よりＣＳＲのほうがリリース間隔との
相関が高い。このことから、ある二つのバージョンの類
似度を計測することで、そのリリース間隔を知ることが
可能である。ＣＳＲは、言い換えるとソース・コードの
変更していない部分の割合を示しているため、相関が高
いということは、開発が滞りなく一定に行われていると
いえる。今まで、大規模なシステム間で、どの程度の違
いがあるのか客観的に知る方法がなかったが、本発明を
用いることにより、２つのシステム間の相違が客観的に
示せる。例えば、あるシステムＡが改良されてシステム
Ｂとなった場合、ＡとＢで対応しているファイル、一部
変更を受けたファイル、Ｂで新たに付け加えられたファ
イル、削除されたファイルなどの差分情報を、変更を受
けた行数の割合（類似度）と共に、効率よく知ることが
できる。[Table 2] Release interval between versions (month) The correlation between CSR and release interval was -0.973. The correlation between the simple increase / decrease in the number of rows and the release interval in the table of FIG. 3 was 0.528. It can be seen that there is a strong negative correlation between CSR and the release interval. Further, the correlation between the CSR and the release interval is higher than the increase / decrease in the number of rows. From this, it is possible to know the release interval by measuring the similarity between two versions. In other words, the CSR indicates the ratio of the unmodified portion of the source code. Therefore, the high correlation means that the development is being carried out constantly. Until now, there has been no way to objectively know how much difference there is between large-scale systems, but by using the present invention, the difference between two systems can be objectively shown. For example, when a certain system A is improved to a system B, the difference between the files corresponding to A and B, the files that have been partially changed, the files newly added by B, the deleted files, etc. The information can be efficiently known together with the ratio (similarity) of the number of changed lines.

【００１８】[0018]

【発明の効果】今まで、客観的に２つの大規模なテキス
ト・ファイルで構成されたシステム間の類似度を求める
方法がなかったが、本発明により、いろいろなシステム
の比較ができるようになった。例えば、ＵＮＩＸ（Ｒ）
システムは、多数の版が存在していたが、その違いを定
量的に知ることが困難であった。本システムをＦｒｅｅ
ＢＳＤ、ＮｅｔＢＳＤなどＢＳＤ系ＵＮＩＸ（Ｒ）の各
版に適用し、その違いを計算すると、ほぼ、その開発の
流れに沿った系譜を得ることができるようになった。ま
た、この発明を適用して、一方のシステムを基準にして
他方のシステムの差分のみを保存することにすれば、効
率のよい圧縮保存方法になる。例えば、一部のファイル
のみが異なる多数のシステムが存在するとき、ひとつの
基準システムを基に他のシステムをこれで圧縮保存すれ
ば、保存容量の減少とともに、システムの管理がしやす
くなる。Up to now, there has been no method for objectively obtaining the similarity between two large-scale text file systems, but the present invention enables comparison of various systems. It was For example, UNIX (R)
There were many versions of the system, but it was difficult to know the difference quantitatively. Free this system
When applied to each version of BSD UNIX (R) such as BSD and NetBSD, and calculating the difference, it became possible to obtain a genealogy that almost followed the development flow. Further, by applying the present invention and storing only the difference of the other system with reference to one system, an efficient compression saving method can be obtained. For example, when there are many systems in which only some of the files are different, if one system is compressed and stored in another system based on one reference system, the storage capacity is reduced and the system is easily managed.

[Brief description of drawings]

【図１】本発明における類似度（ＣＳＲ）を説明するた
めの図である。FIG. 1 is a diagram for explaining a similarity (CSR) according to the present invention.

【図２】類似度（ＣＳＲ）を求める処理を示すフローチ
ャートである。FIG. 2 is a flowchart showing a process of obtaining a similarity (CSR).

【図３】各ＯＳのファイル数と行数を示す表である。FIG. 3 is a table showing the number of files and the number of lines of each OS.

【図４】各バージョン間の類似度（ＣＳＲ）の一部を示
す表である。FIG. 4 is a table showing a part of the similarity (CSR) between versions.

【図５】ＦｒｅｅＢＳＤ２．２と他のバージョンとの類
似度（ＣＳＲ）を示すグラフである。FIG. 5 is a graph showing the similarity (CSR) between FreeBSD 2.2 and other versions.

【図６】ＦｒｅｅＢＳＤとＮｅｔＢＳＤとの間の類似度
（ＣＳＲ）を示すグラフである。FIG. 6 is a graph showing the similarity (CSR) between FreeBSD and NetBSD.

【図７】ＢＳＤ系ＵＮＩＸ（Ｒ）派生図である。FIG. 7 is a BSD UNIX (R) derivative diagram.

【図８】類似度（ＣＳＲ）を用いた樹状図である。FIG. 8 is a dendrogram using similarity (CSR).

【図９】一致したファイル数を用いた樹状図である。FIG. 9 is a dendrogram using the number of matched files.

【図１０】同一ファイル名を用いた樹状図である。FIG. 10 is a dendrogram using the same file name.

Claims

[Claims]

1. A similarity measurement system for measuring the similarity between two texts composed of a plurality of files, wherein patterns of the same shape are examined by examining patterns of the same shape between files composing the two texts. Corresponding file extracting means for extracting a set of files, corresponding line detecting means for finding a set of lines that can be corresponded between the pairs of files extracted by the corresponding file extracting means, and the sum of the numbers of lines of two texts. And a similarity calculation means for calculating a ratio between the sum of the number of lines determined to correspond to each text and a sum of the numbers of lines.

2. The similarity degree measuring system according to claim 1, wherein the corresponding row detecting means outputs the rows that have not been matched, and can measure the similarity and also output the difference.

3. The similarity measurement system according to claim 1, further comprising, as preprocessing, a line deletion unit that deletes lines unrelated to similarity measurement from both texts.

4. The two texts are a source program, and the corresponding file extracting means converts the token string according to the grammar of the program and then obtains the isomorphic pattern. The similarity measurement system according to any one of 1 to 3.

5. The corresponding line detecting means converts both a token string according to the grammar of a program and then obtains a pattern of the same shape, and a method of finding a different portion without depending on the grammar of the program. The similarity measurement system according to claim 4, wherein a set of rows that can be associated with each other is detected using the similarity measurement system.

6. A similarity measuring method for measuring the similarity between two texts composed of a plurality of files, wherein patterns of the same shape are examined by checking patterns of the same shape between files composing the two texts. Extract a set of files, find the set of lines that can be matched between the extracted sets of files, and then add the sum of the number of lines of two texts and the number of lines that can be matched in each text. A method of measuring the degree of similarity, which is characterized by obtaining a ratio of

7. A recording medium for recording a program for causing a computer system to construct the similarity measurement system according to claim 1, which measures the similarity between two texts composed of a plurality of files. .

8. A program for causing a computer system to build the similarity measurement system according to claim 1, which measures the similarity between two texts composed of a plurality of files.