JP5342407B2

JP5342407B2 - Program analysis method, program analysis program, and program analysis apparatus

Info

Publication number: JP5342407B2
Application number: JP2009250678A
Authority: JP
Inventors: 真一横溝; 隆之武澤; 昇佐藤; 航巻口
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-10-30
Filing date: 2009-10-30
Publication date: 2013-11-13
Anticipated expiration: 2029-10-30
Also published as: JP2011096082A

Description

本発明は、プログラム解析方法、プログラム解析プログラムおよびプログラム解析装置の技術に関する。 The present invention relates to a program analysis method, a program analysis program, and a program analysis apparatus.

大量のプログラムコードから構成される大規模なプログラムシステムでは、制御対象設備の増減、さらには、ユーザからの要求事項実現のために、過去の信頼性の高いプログラムシステムを流用し、このプログラムシステムのプログラムコードに若干の改造、加筆修正などを施すことで、新しいプログラムシステムを開発するケースが多い。この開発手法を採用することは、実績あるプログラムシステムＡを流用し、プログラムシステムＢを新たに開発することになる。このような場合、これら２つのプログラムシステムは巨大なクローンとなる。 In a large-scale program system composed of a large number of program codes, a highly reliable program system in the past is used to increase or decrease the equipment to be controlled and to realize the requirements from the user. In many cases, a new program system is developed by applying some modifications and corrections to the program code. Adopting this development method will divert program system A with a proven track record and newly develop program system B. In such a case, these two program systems become huge clones.

また、大規模なプログラムシステムＡ中でも、プログラムコード中に何度もあらわれる類似したプログラムコードが存在する場合が多い。このような、１つのプログラムシステムに現れるプログラムコードもクローンと呼ばれる。これは、プログラムコードを複製（コピー）、貼り付け（ペースト）することによって生じるものである。このような開発手法にて実現された大規模なプログラムシステムは、類似したプログラムコード（クローン）が多数存在し理解・解読性を著しく悪化させる原因となる。
例えば、専用のプログラムを搭載しているハードウェアが旧式となったため、新しいハードウェアと交換する場合、搭載されているプログラムも旧式であるため、新しいアーキテクチャのプログラムを開発する必要がある。このような場合、旧式のプログラムを流用して、新しいプログラムを開発することが、しばしば行われるが、グローバル変数の扱いの違いなどが原因でバグが生じたりすることがある。 Further, even in a large-scale program system A, there are many cases in which similar program code that appears many times in the program code exists. Such a program code appearing in one program system is also called a clone. This is caused by copying (copying) and pasting (pasting) the program code. A large-scale program system realized by such a development method has many similar program codes (clones), which causes a significant deterioration in understanding and decipherability.
For example, since hardware equipped with a dedicated program has become obsolete, when replacing with new hardware, the installed program is also obsolete, so a program with a new architecture needs to be developed. In such a case, it is often performed to develop a new program by diverting an old-style program, but bugs may occur due to differences in handling of global variables.

プログラムの維持・保守においては、プログラムコード中のある箇所に不具合が発見され修正された場合、その箇所と類似している箇所についても同様の修正が必要であるかどうかの検討が必要となる。しかし、多くの場合はそれらプログラムコードに対して、大なり小なり修正が施されており、単純な文字列検索などでは、大量のプログラムコード中のどこにどれくらいそうした箇所が存在するのかを網羅的に把握することは困難である。 In the maintenance / maintenance of a program, when a defect is found and corrected in a certain part of the program code, it is necessary to examine whether a similar correction is necessary for a part similar to that part. However, in many cases, the program code has been modified to a greater or lesser extent. By simple character string search, etc., it is comprehensive to see where such a part exists in a large amount of program code. It is difficult to grasp.

さらに、多くの社会インフラを支える制御システムは絶対品質を要求されるため、ハードウェアは耐寿命・耐環境・信頼性の高い装置から構成される場合が多い。このため、このようなハードウェアに搭載されているプログラムシステムについても、１０年以上保守・改造が行われているプログラムシステムも数多く存在する。しかし近年、特殊なハードウェアにおいて１０年以上稼動しているプログラムシステムも、半導体・ハードディスクなどの交換部品の調達が困難となりつつあるため、オープン系ハードウェアへ移行する開発を要求される場合が多くなっている。この場合においても、プログラムコード中に何度もあらわれる類似したコードは、前記した理由で開発効率・信頼性を阻害する要因となり問題となる。
また、プログラムシステムを他のシステムに移行する場合においては、レガシシステムの開発順番のとおりの更新順番となる保証がないため、どちらがクローンであるかが不明となってしまうケースもある。
このような場合は、更新システムを構成するプログラムコードだけを対象にプログラムコードのクローン解析を行うのではなく、過去のプログラムシステムにおける開発経緯や、履歴や、変遷を明らかにした上でのクローン関係を明らかにしなければ、やはり、開発効率や、信頼性を阻害する要因を引き起こし問題となる。 Furthermore, since control systems that support many social infrastructures are required to have absolute quality, hardware is often composed of devices with a long life, environmental resistance, and high reliability. For this reason, there are also many program systems that have been maintained and modified for more than 10 years for the program systems installed in such hardware. However, in recent years, it has become difficult to procure replacement parts such as semiconductors and hard disks even for program systems that have been operating on special hardware for more than 10 years. It has become. Even in this case, similar code that appears many times in the program code becomes a problem that hinders development efficiency and reliability for the reasons described above.
In addition, when a program system is migrated to another system, there is no guarantee that the update order is the same as the development order of the legacy system, so it may be unclear which is the clone.
In such a case, the clone relationship of the program code is not analyzed only for the program code that constitutes the update system, but the development history, history, and transition in the past program system are clarified. If this is not clarified, it will cause problems that hinder development efficiency and reliability.

このような、既存のプログラムコードクローンを解析するソフトウェアとして、ＣＣＦｉｎｄｅｒが開示されている（例えば、非特許文献１参照）。ＣＣＦｉｎｄｅｒでは、プログラムコードにおいて字句解析を実施して、トークン列化し特定の文字列に変更する処理を実施した上で照合が実施されるもので、プログラムコードで変数名・呼び出しする関数（サブルーチン）名を変更してもプログラムコードの類似度を検出することが可能である。
また、特許文献１には、プログラムコードを標準化して、それぞれトークン列を作成し、各トークンから相関マトリックスを作成し相関マトリックスを２値化された画像と見なしテクスチャ量を算出し、さらにその相関マトリックスに対応する各参照ベクトル間の距離を算出することでプログラムコード間の類似度を算出する類似度評価プログラム、類似度評価装置および類似度評価方法が開示されている。 As such software for analyzing an existing program code clone, CCFinder is disclosed (for example, see Non-Patent Document 1). In CCFinder, the lexical analysis is performed on the program code, the token is converted into a specific character string, and collation is performed. The variable name and the function (subroutine) name to be called in the program code It is possible to detect the similarity of the program code even if is changed.
In Patent Document 1, the program code is standardized, each token string is created, a correlation matrix is created from each token, the correlation matrix is regarded as a binarized image, and a texture amount is calculated. A similarity evaluation program, a similarity evaluation apparatus, and a similarity evaluation method for calculating a similarity between program codes by calculating a distance between reference vectors corresponding to a matrix are disclosed.

特開２００８−４６６９５号公報JP 2008-46695 A

神谷年洋、ＣＣＦｉｎｄｅｒホームページ、[online]、［平成２１年１０月２日検索］、インターネット<URL:http://www.ccfinder.net/ccfinderx-j.html>Toshihiro Kamiya, CCFinder homepage, [online], [October 2, 2009 search], Internet <URL: http://www.ccfinder.net/ccfinderx-j.html>

非特許文献１に記載のプログラムコードクローン検出ツール（ＣＣＦｉｎｄｅｒ）、および特許文献１に記載の技術は、字句解析を実施し、予め定義されたルールに従い計算処理によりプログラムコードクローンを解析する方法である。この方法では、クローンを抽出することが可能であるが、大規模なプログラムシステムでは無数のクローンが解析されてしまうため、例えば、使用している変数が異なるため、違う関数として分類したい場合など、本当に必要であるクローンの情報を選び出すことが困難になる場合が多く、要求を十分に満たすものとはいえなかった。
さらに、非特許文献１に記載のプログラムコードクローン検出ツールは、あるプログラムを字句解析し、プログラム構文の類似度を直接比較するため、プログラムの中から参照するメモリ領域が異なることにより処理上では意味が違うと判断すべきである場合でもクローンと判別してしまう問題がある。つまり、同じ文字列が使用されている変数でも、例えば一方はグローバル変数であり、他方がローカル変数である場合、これらの変数を使用している関数は、本質的に異なる機能をもつものであるにも拘わらず、同じ関数とみなされてしまう。
このような問題は、特許文献１においても解決されていない。 The program code clone detection tool (CCFinder) described in Non-Patent Document 1 and the technique described in Patent Document 1 are methods for performing lexical analysis and analyzing program code clones by calculation processing according to a predefined rule. . In this method, it is possible to extract clones, but innumerable clones are analyzed in a large-scale program system, so, for example, if you want to classify as a different function because the variables used are different, etc. In many cases, it was difficult to select the clone information that was really necessary, and it was not enough to meet the requirements.
Furthermore, the program code clone detection tool described in Non-Patent Document 1 lexically analyzes a program and directly compares the similarity of program syntax. There is a problem that even if it should be determined that is different, it is determined as a clone. In other words, even if a variable uses the same string, for example, if one is a global variable and the other is a local variable, the functions that use these variables have essentially different functions. Nevertheless, it is regarded as the same function.
Such a problem is not solved even in Patent Document 1.

このような背景に鑑みて本発明がなされたのであり、本発明は、クローンの系統を正確に表示することを目的とする。 The present invention has been made in view of such a background, and an object of the present invention is to accurately display clone lines.

前記課題を解決するため、本発明は、複数のプログラムコード間において、改変元のプログラムコードを親とし、トークンの追加または修正が行われることで改変された改変先のプログラムコードを子とする親子関係を導出および表示するプログラム解析装置によるプログラム解析方法であって、前記プログラム解析装置は、記述形式が異なるが、同一の関数とみなせる関数を登録した関数辞書と、前記関数で使用されている変数の種類を登録したメモリアクセス辞書と、を記憶部に記憶しており、前記プログラム解析装置が、（ａ１）解析対象である複数のプログラムコードを、前記トークンに分解する字句解析を行い、（ａ２）所定の前記プログラムコードにおける前記字句解析の結果を基に生成されるとともに、前記記憶部に格納され、前記所定のプログラムコードにおけるトークンが登録されているテンプレートと、各プログラムコードにおける各字句解析の結果と、前記関数辞書と、を参照し、前記テンプレートに登録されている前記トークンと比較した結果、記述形式が異なるが前記関数辞書において同一の関数とみなせる関数を同一の関数として判別する処理を各プログラムコードについて行い、（ａ３）前記テンプレートと、前記各プログラムコードにおける前記字句解析の結果と、前記メモリアクセス辞書と、を参照し、前記テンプレートに登録されている関数と比較した結果、形式が同一の関数であるが、異なる種類の変数を用いている関数を異なる関数と判定して、前記同一の関数を絞り込む処理を各プログラムコードについて行い、（ａ４）前記異なる関数であると判定された関数に関するトークンを異なるトークンとし、前記複数のプログラムコードのうち、２つのプログラムコードについて、一方のプログラムコードのトークンの出現位置を縦軸にとり、他方のプログラムコードのトークンの出現位置を横軸にとり、同一のトークンが前記２つのプログラムコードに出現するとき、前記同一のトークンの出現位置に対応する横軸の座標をｘ、前記同一のトークンの出現位置に対応する縦軸の座標をｙとしたとき、（ｘ，ｙ）の点にプロットしたデータを生成するクローン解析を行い、（ａ５）前記クローン解析の結果を基に、前記テンプレートおよびプログラムコード間の親子関係を導出し、（ａ６）前記（ａ４）および（ａ５）の処理を、解析対象となっているテンプレートおよびプログラムコードの各対に対して行い、（ａ７）前記親子関係を導出されたテンプレートおよびプログラムコード間を線で接続した系統図を表示部に表示させることを特徴とする。
その他の解決手段については、実施形態中に記載する。 In order to solve the above-described problem, the present invention provides a parent-child having a modification source program code as a parent among a plurality of program codes and a modification-destination program code modified by adding or correcting a token as a child. A program analysis method by a program analysis apparatus for deriving and displaying a relation, wherein the program analysis apparatus has a function dictionary in which functions that can be regarded as the same function are registered, and variables used in the function , although the description format is different memory access dictionary having registered the kind, have been stored in the storage unit, the program analyzing device, (a1) a plurality of program code to be analyzed, performs decompose lexical analysis on the tokens, (a2 ) together is generated based on the results of the lexical analysis in a given said program code stored in the storage unit Wherein the template token in a given program code is registered, the result of each lexical analysis in each program codes, wherein a function dictionary, with reference to a result of comparison with the token that is registered in the template description formats differ performs processing to determine as the same function function which can be regarded as the same function in the function dictionary for each program code, (a3) and the template, the result of the lexical analysis in each of the program code, wherein Referring to a memory access dictionary a result of comparison with the functions registered in the template, but the format is the same function, it is determined that different functions functions use different types of variables, the same perform the narrowing down untreated functions for each program code, (a4) said different Seki And the determined function different token related tokens as being, among the plurality of program codes for the two program code, taken ordinate the occurrence position of the token in one of the program code, the appearance of the other program code tokens position represented by the horizontal axis, when the same token appears in the two program code, and the vertical axis corresponding to the coordinate of the horizontal axis corresponding to the occurrence position of the same token x, the occurrence position of the same token When the coordinate of y is y, clone analysis is performed to generate data plotted at the point (x, y). (A5) Based on the result of the clone analysis , a parent-child relationship between the template and the program code is derived. (A6) The processing of (a4) and (a5) is performed using the template and program to be analyzed. (A7) A system diagram in which the template and the program code from which the parent-child relationship is derived is connected by a line is displayed on the display unit.
Other solutions are described in the embodiments.

本発明によれば、クローンの系統を正確に表示することができる。 According to the present invention, clone lines can be accurately displayed.

本実施形態に係るプログラム解析装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the program analysis apparatus which concerns on this embodiment. 本実施形態に係るプログラム解析処理の概要を示す図である。It is a figure which shows the outline | summary of the program analysis process which concerns on this embodiment. 本実施形態に係るプログラム解析処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the program analysis process which concerns on this embodiment. ステップＳ１０２における字句解析処理を説明するための図である。It is a figure for demonstrating the lexical analysis process in step S102. ステップＳ１０４におけるテンプレート作成処理を説明するための図である。It is a figure for demonstrating the template creation process in step S104. ステップＳ１０５における関数辞書作成処理を説明するための図である。It is a figure for demonstrating the function dictionary creation process in step S105. ステップＳ１０６におけるメモリアクセス辞書作成処理を説明するための図である。It is a figure for demonstrating the memory access dictionary creation process in step S106. ステップＳ１０７における類似度算出処理を説明するための図である。It is a figure for demonstrating the similarity calculation process in step S107. ステップＳ１１０における分類結果作成処理を説明するための図である（その１）。It is a figure for demonstrating the classification result creation process in step S110 (the 1). ステップＳ１１０における分類結果作成処理を説明するための図である（その１）。It is a figure for demonstrating the classification result creation process in step S110 (the 1). ステップＳ１１１におけるクローン解析処理を説明するためのクローン解析図である。It is a clone analysis figure for demonstrating the clone analysis process in step S111. ステップＳ１１２の親子関係導出・派生率算出処理のうち、親子関係導出処理の部分を説明するための図である。It is a figure for demonstrating the part of parent-child relationship derivation processing among the parent-child relationship derivation / derivation rate calculation processing of step S112. クローン解析処理が終了した後、結果表示部に表示されるクローン解析図の例である。It is an example of the clone analysis figure displayed on a result display part after a clone analysis process is complete | finished. ステップＳ１１４における系統図作成処理を説明するための系統図の例を示す図である。It is a figure which shows the example of the systematic diagram for demonstrating the systematic diagram creation process in step S114. テンプレート、関数辞書、メモリアクセス辞書の具体例を示す図である。It is a figure which shows the specific example of a template, a function dictionary, and a memory access dictionary. テンプレートと、プログラムコードとの具体例を示す図である。It is a figure which shows the specific example of a template and a program code. プログラムコード群αを構成する各プログラムコードのテンプレートａに対する類似度結果を示す図である。It is a figure which shows the similarity result with respect to the template a of each program code which comprises the program code group (alpha). プログラムコード群α，βの系統図を示す図である（その１）。It is a figure which shows the systematic diagram of program code group (alpha) and (beta) (the 1). プログラムコード群α，βの系統図を示す図である（その２）。It is a figure which shows the systematic diagram of program code group (alpha) and (beta) (the 2).

次に、本発明を実施するための形態（「実施形態」という）について、適宜図面を参照しながら詳細に説明する。以下の説明において、プログラムコード群とは、複数のプログラムコードが集合することにより機能するプログラムのことである。例えば、１つのプログラムをプログラムコード群とすれば、プログラムコードはメイン関数や、サブ関数となる。あるいは、プロセスがプログラムコードとなる場合もある。
なお、本実施形態の各図において、同様の構成要素には同一の符号を付し、説明を省略する。 Next, modes for carrying out the present invention (referred to as “embodiments”) will be described in detail with reference to the drawings as appropriate. In the following description, a program code group is a program that functions by a plurality of program codes being aggregated. For example, if one program is a program code group, the program code is a main function or a sub function. Alternatively, the process may be program code.
Note that, in each drawing of the present embodiment, the same reference numerals are given to the same components, and description thereof is omitted.

《装置構成》
図１は、本実施形態に係るプログラム解析装置の構成例を示す機能ブロック図である。
プログラム解析装置１は、装置の動作を制御する制御部１００と、情報を記憶する記憶部２００と、ディスプレイなどの結果表示部（表示部）３００と、キーボードやマウスなどのユーザインタフェース部４００とを有してなる。
制御部１００は、解析対象プログラムコード群登録部１０１と、字句解析部１０２と、テンプレート作成部１０３と、関数辞書作成部１０４と、メモリアクセス辞書作成部１０５と、類似度算出部１０６と、類似度解析結果作成部１０７と、分類結果作成部１０８と、クローン解析部１０９と、系統図作成部１１０と、を有してなる。 "Device configuration"
FIG. 1 is a functional block diagram illustrating a configuration example of the program analysis apparatus according to the present embodiment.
The program analysis apparatus 1 includes a control unit 100 that controls the operation of the apparatus, a storage unit 200 that stores information, a result display unit (display unit) 300 such as a display, and a user interface unit 400 such as a keyboard and a mouse. Have.
The control unit 100 includes an analysis target program code group registration unit 101, a lexical analysis unit 102, a template creation unit 103, a function dictionary creation unit 104, a memory access dictionary creation unit 105, a similarity calculation unit 106, and a similarity A degree analysis result creation unit 107, a classification result creation unit 108, a clone analysis unit 109, and a system diagram creation unit 110.

解析対象プログラムコード群登録部１０１は、解析対象となる複数のプログラムコードからなるプログラムコード群２０１をユーザインタフェース部４００を介して登録する処理を行う機能を有する。
字句解析部１０２は、登録されたプログラムコードをトークンに分解する字句解析処理を行う機能を有する。なお、トークンとは、関数などプログラムコードを構成する単語や記号の最小単位のことである。
テンプレート作成部１０３は、字句解析の結果（字句解析結果２０５）や、ユーザインタフェース部４００から入力された情報などを基にテンプレート２０２を作成する処理を行う機能を有する。テンプレート２０２の詳細は後記して説明する。
関数辞書作成部１０４は、テンプレート２０２や、字句解析結果２０５や、ユーザインタフェース部４００から入力された情報などを基に、関数の変形例などを登録した関数辞書（トークン辞書）２０３を作成する機能を有する。関数辞書２０３の詳細は後記して説明する。
メモリアクセス辞書作成部１０５は、テンプレート２０２や、ユーザインタフェース部４００から入力された情報などを基に、関数で参照される変数の種類を登録したメモリアクセス辞書２０４を作成する機能を有する。メモリアクセス辞書２０４の詳細は後記して説明する。 The analysis target program code group registration unit 101 has a function of performing processing for registering the program code group 201 including a plurality of program codes to be analyzed via the user interface unit 400.
The lexical analysis unit 102 has a function of performing a lexical analysis process that decomposes the registered program code into tokens. Note that a token is a minimum unit of words and symbols constituting a program code such as a function.
The template creation unit 103 has a function of performing processing for creating the template 202 based on the result of lexical analysis (lexical analysis result 205), information input from the user interface unit 400, and the like. Details of the template 202 will be described later.
The function dictionary creation unit 104 has a function of creating a function dictionary (token dictionary) 203 in which modified examples of functions are registered based on the template 202, the lexical analysis result 205, information input from the user interface unit 400, and the like. Have Details of the function dictionary 203 will be described later.
The memory access dictionary creation unit 105 has a function of creating the memory access dictionary 204 in which the types of variables referred to by functions are registered based on the template 202, information input from the user interface unit 400, and the like. Details of the memory access dictionary 204 will be described later.

類似度算出部１０６は、テンプレート２０２におけるプログラムコードの構造と、登録されているプログラムコードの構造とを比較して類似度を算出する機能を有する。
類似度結果作成部１０７は、類似度算出部１０６が出力した結果を結果表示部３００に表示させる機能を有する。
分類結果作成部１０８は、テンプレート２０２毎に作成した類似度結果を分類結果として結果表示部３００に表示させる機能を有する。
クローン解析部１０９は、字句解析結果２０５と、関数辞書２０３と、メモリアクセス辞書２０４とを基に、クローン解析を行い、このクローン解析の結果を結果表示部３００に表示させる機能を有する。
系統図作成部１１０は、クローン解析の結果を基にプログラムコード間の親子関係を導出し、プログラムコード間の親子関係を系統図として結果表示部３００に表示させる機能を有する。
なお、制御部１００および各部１０１〜１１０は、図示しないＲＯＭ（Read Only Memory）や、ＨＤ（Hard Disk）に格納されたプログラム解析プログラムが、ＲＡＭ（Random Access Memory）に展開され、ＣＰＵ（Central Processing Unit）によって実行されることによって具現化する。 The similarity calculation unit 106 has a function of calculating the similarity by comparing the structure of the program code in the template 202 with the structure of the registered program code.
The similarity result creation unit 107 has a function of causing the result display unit 300 to display the result output by the similarity calculation unit 106.
The classification result creation unit 108 has a function of causing the result display unit 300 to display the similarity result created for each template 202 as a classification result.
The clone analysis unit 109 has a function of performing clone analysis based on the lexical analysis result 205, the function dictionary 203, and the memory access dictionary 204, and displaying the result of the clone analysis on the result display unit 300.
System diagram creation unit 110, based on the result of clonal analysis out guide the parent-child relationship between program code, having a parent-child relationship between program code function to be displayed on the result display section 300 as a system diagram.
Note that the control unit 100 and each of the units 101 to 110 have a program analysis program stored in a ROM (Read Only Memory) or HD (Hard Disk) (not shown) developed in a RAM (Random Access Memory) and a CPU (Central Processing Memory). It is embodied by being executed by (Unit).

記憶部２００には、登録されたプログラムコード群２０１、字句解析結果２０５、テンプレート２０２、関数辞書２０３、メモリアクセス辞書２０４、トークン登録判別表２０６の各情報が格納される。前記したように字句解析結果２０５、テンプレート２０２、関数辞書２０３、メモリアクセス辞書２０４、トークン登録判別表２０６については後記して説明する。 The storage unit 200 stores each information of a registered program code group 201, a lexical analysis result 205, a template 202, a function dictionary 203, a memory access dictionary 204, and a token registration determination table 206. As described above, the lexical analysis result 205, the template 202, the function dictionary 203, the memory access dictionary 204, and the token registration determination table 206 will be described later.

《処理内容の概略》
図２は、本実施形態に係るプログラム解析処理の概要を示す図である。
まず、プログラムコード群αとして、プログラムコードａ１，ａ２，ａ３，ａ４が登録され、プログラムコード群βとして、プログラムコードｂ１，ｂ２，ｂ３が登録されたとする。
まず、プログラム解析装置１は、テンプレート２０２、関数辞書２０３、メモリアクセス辞書２０４などを使用して、各プログラムコード群２０１内におけるプログラムコード間の類似度を算出する。すなわち、プログラムコードａ１，ａ２，ａ３，ａ４間の類似度を算出し、プログラムコードｂ１，ｂ２，ｂ３の類似度を算出する（Ｓ１）。
次に、プログラム解析装置１はクローン解析処理を行い（Ｓ２）、作成されたクローン解析図を基に派生率の算出と、プログラムコード間の親子関係の導出を行う（Ｓ３）。
そして、プログラム解析装置１はプログラムコード間の親子関係を系統図として結果表示部３００に表示する（Ｓ４）。ここで、親子関係とは、元のプログラムコードを親とし、親プログラムコードにトークンを追加、修正などの編集を行ったものを子とする。
なお、図２の系統図における「ａ」はテンプレート２０２である。《Outline of processing contents》
FIG. 2 is a diagram showing an outline of the program analysis processing according to the present embodiment.
First, it is assumed that program codes a1, a2, a3, and a4 are registered as the program code group α, and program codes b1, b2, and b3 are registered as the program code group β.
First, the program analysis apparatus 1 calculates the similarity between program codes in each program code group 201 using the template 202, the function dictionary 203, the memory access dictionary 204, and the like. That is, the similarity between the program codes a1, a2, a3, and a4 is calculated, and the similarity between the program codes b1, b2, and b3 is calculated (S1).
Next, the program analysis apparatus 1 performs a clone analysis process (S2), calculates a derivation rate based on the created clone analysis diagram, and derives a parent-child relationship between program codes (S3).
Then, the program analysis apparatus 1 displays the parent-child relationship between the program codes on the result display unit 300 as a system diagram (S4). Here, the parent-child relationship means that the original program code is a parent, and a token is added to the parent program code, and editing such as correction is made a child.
Note that “a” in the system diagram of FIG.

《処理手順》
次に、図１を参照しつつ、図３〜図１９を参照して本実施形態に係るプログラム解析処理の手順を説明する。
まず、図３のフローチャートにおいて、処理手順の流れを説明し、図４〜図１９を参照して各処理の詳細を説明することとする。
図３は、本実施形態に係るプログラム解析処理の手順を示すフローチャートである。
まず、ユーザインタフェース部４００を介して、複数の解析対象プログラムコードが記憶部２００に入力される。その後、ユーザインタフェース部４００を介して、記憶部２００に入力されたプログラムコードの中から、解析対象となるプログラムコードが選択・登録される解析対象プログラムコード登録処理が行われる（Ｓ１０１）。
次に、字句解析処理は、ステップＳ１０１で選択されたプログラムコードのうち、１つを選択し、このプログラムコードからコメント部のような直接的には関係のない情報を除去し、トークン単位で分割する字句解析処理を行う（Ｓ１０２）。
そして、制御部１００は、ステップＳ１０１で選択されたすべてのプログラムコードについて、字句解析処理が完了したか否かを判定する（Ｓ１０３）。
ステップＳ１０３の結果、すべてのプログラムコードについて字句解析処理が完了していない場合（Ｓ１０３→Ｎｏ）、制御部１００はステップＳ１０２の処理へ戻り、字句解析処理が行われていないプログラムコードについて字句解析処理を行う。 <Processing procedure>
Next, the procedure of the program analysis process according to this embodiment will be described with reference to FIGS.
First, in the flowchart of FIG. 3, the flow of the processing procedure will be described, and details of each processing will be described with reference to FIGS.
FIG. 3 is a flowchart showing the procedure of the program analysis process according to the present embodiment.
First, a plurality of analysis target program codes are input to the storage unit 200 via the user interface unit 400. Thereafter, an analysis target program code registration process is performed in which a program code to be analyzed is selected and registered from among the program codes input to the storage unit 200 via the user interface unit 400 (S101).
Next, in the lexical analysis process, one of the program codes selected in step S101 is selected, information that is not directly related such as a comment part is removed from the program code, and divided in token units. A lexical analysis process is performed (S102).
Then, the control unit 100 determines whether or not the lexical analysis processing has been completed for all the program codes selected in step S101 (S103).
When the lexical analysis processing has not been completed for all program codes as a result of step S103 (S103 → No), the control unit 100 returns to the processing of step S102, and the lexical analysis processing is performed for the program code for which lexical analysis processing has not been performed. I do.

ステップＳ１０３の結果、すべてのプログラムコードについて字句解析処理が完了している場合（Ｓ１０３→Ｙｅｓ）、テンプレート作成部１０３が字句解析結果２０５と、ユーザインタフェース部４００を介して入力された情報とを基にトークンに重み付けを行ったテンプレート２０２を作成するテンプレート作成処理を行う（Ｓ１０４）。
そして、関数辞書作成部１０４が、字句解析結果２０５と、テンプレート２０２と、ユーザインタフェース部４００を介して入力された情報などを基に関数辞書２０３を作成する関数辞書作成処理を行う（Ｓ１０５）。
次に、メモリアクセス辞書作成部１０５が、テンプレート２０２と、ユーザインタフェース部４００を介して入力された情報などを基にメモリアクセス辞書２０４を作成するメモリアクセス辞書作成処理を行う（Ｓ１０６）。
なお、異なるタイプのテンプレート２０２を複数作製してもよい。テンプレート２０２を複数作成するときはステップＳ１０４〜Ｓ１０６の処理を繰り返す。 As a result of step S103, when the lexical analysis processing has been completed for all program codes (S103 → Yes), the template creation unit 103 uses the lexical analysis result 205 and the information input via the user interface unit 400 as a basis. A template creation process for creating a template 202 in which the tokens are weighted is performed (S104).
Then, the function dictionary creation unit 104 performs a function dictionary creation process for creating the function dictionary 203 based on the lexical analysis result 205, the template 202, and information input via the user interface unit 400 (S105).
Next, the memory access dictionary creation unit 105 performs a memory access dictionary creation process for creating the memory access dictionary 204 based on the template 202 and information input via the user interface unit 400 (S106).
A plurality of different types of templates 202 may be produced. When creating a plurality of templates 202, the processes in steps S104 to S106 are repeated.

そして、類似度算出部１０６が、テンプレート２０２や、テンプレート２０２から作成されるトークン登録判別表２０６を基に類似度を算出する類似度算出処理を行う（Ｓ１０７）。
次に、制御部１００がテンプレート２０２に対するすべてのプログラムコードにおいて、類似度算出処理が完了したか否かを判定する（Ｓ１０８）。
ステップＳ１０８の結果、すべてのプログラムコードについて類似度算出処理が完了していない場合（Ｓ１０８→Ｎｏ）、制御部１００はステップＳ１０７へ処理を戻し、類似度算出部１０６が類似度を算出していないプログラムコードについて類似度を算出する。 Then, the similarity calculation unit 106 performs a similarity calculation process for calculating the similarity based on the template 202 and the token registration determination table 206 created from the template 202 (S107).
Next, the control unit 100 determines whether or not the similarity calculation processing has been completed for all program codes for the template 202 (S108).
As a result of step S108, when the similarity calculation processing has not been completed for all program codes (S108 → No), the control unit 100 returns the processing to step S107, and the similarity calculation unit 106 has not calculated the similarity. The similarity is calculated for the program code.

ステップＳ１０８の結果、すべてのプログラムコードについて類似度算出処理が完了している場合（Ｓ１０８→Ｙｅｓ）、類似度結果作成部１０７が類似度算出処理の結果を結果表示部３００に表示させる類似度結果表示処理を行う（Ｓ１０９）。
次に、分類結果作成部１０８が、閾値以下の類似度を有するプログラムコードを処理の対象から除外し、テンプレート２０２毎にプログラムコードを分類する分類結果作成処理を行う（Ｓ１１０）。
そして、クローン解析部１０９が、テンプレート２０２、字句解析結果２０５、関数辞書２０３およびメモリアクセス辞書２０４を基に、テンプレート２０２およびプログラムコードの各組み合わせについてクローン解析を行い、その結果を結果表示部３００に表示するクローン解析処理を行う（Ｓ１１１）。 If the similarity calculation processing has been completed for all program codes as a result of step S108 (S108 → Yes), the similarity result creation unit 107 causes the result display unit 300 to display the result of the similarity calculation processing. Display processing is performed (S109).
Next, the classification result creation unit 108 performs a classification result creation process for classifying the program code for each template 202 by excluding the program codes having a similarity equal to or less than the threshold from the processing target (S110).
Then, the clone analysis unit 109 performs clone analysis on each combination of the template 202 and the program code based on the template 202, the lexical analysis result 205, the function dictionary 203, and the memory access dictionary 204, and the result is displayed in the result display unit 300. The clone analysis process to be displayed is performed (S111).

次に、系統図作成部１１０が、クローン解析結果を基に、テンプレートやプログラムコード間の親子関係を導出したり、派生率を算出する親子関係導出・派生率算出処理を行う（Ｓ１１２）。
そして、制御部１００がすべてのプログラムコードの組み合わせについて親子関係導出・派生率算出処理が完了したか否かを判定する（Ｓ１１３）。
ステップＳ１１３の結果、すべての組み合わせについて親子関係導出・派生率算出処理が完了していない場合（Ｓ１１３→Ｎｏ）、制御部１００はステップＳ１１２の処理へ戻り、系統図作成部１１０は派生率を算出していないプログラムコードの組み合わせについて親子関係導出・派生率算出処理を行う。
ステップＳ１１３の結果、すべての組み合わせについて親子関係導出・派生率算出処理が完了している場合（Ｓ１１３→Ｙｅｓ）、プログラムコード間の親子関係を示した系統図を作成し、結果表示部３００に表示させる系統図作成処理を行う（Ｓ１１４）。 Next, the system diagram creation unit 110 performs parent-child relationship derivation / derivation rate calculation processing for deriving a parent-child relationship between templates and program codes and calculating a derivation rate based on the clone analysis result (S112).
Then, the control unit 100 determines whether or not the parent-child relationship derivation / derivation rate calculation processing is completed for all combinations of program codes (S113).
As a result of step S113, when the parent-child relationship derivation / derivation rate calculation processing is not completed for all combinations (S113 → No), the control unit 100 returns to the processing of step S112, and the system diagram creation unit 110 calculates the derivation rate. Parent-child relationship derivation / derivation rate calculation processing is performed for combinations of program codes that have not been performed.
If the parent-child relationship derivation / derivation rate calculation processing is completed for all combinations as a result of step S113 (S113 → Yes), a system diagram showing the parent-child relationship between program codes is created and displayed on the result display unit 300 A system diagram creation process is performed (S114).

（Ｓ１０２：字句解析処理）
以下、図１および図３を参照しつつ、図４〜図１４に沿って主な処理の詳細な説明を行う。
図４は、ステップＳ１０２における字句解析処理を説明するための図である。
ここで、字句解析とは、プログラムコードから、コメント部のような直接的には類似度の比較に必要のない情報を除去し、さらにプログラムコードをトークン単位で分割する処理をいう。まず、字句解析部１０２が、ステップＳ１０１で選択・登録されたプログラムコードを読み込むと、以下のような字句解析処理を行う。
例えば、プログラムコードがＣ言語で記述されている場合、読み込んだプログラムコード内の「／／」に続く文字列や「／＊」と「＊／」で囲まれたコメント文を削除する。さらにコメントを削除したプログラムコードをトークン毎に分割する処理を行う。
上記手順により字句解析されるプログラムコード及び字句解析されたトークン列の具体例が図４に示されている。 (S102: Lexical analysis processing)
Hereinafter, the main processing will be described in detail with reference to FIGS. 4 to 14 with reference to FIGS. 1 and 3.
FIG. 4 is a diagram for explaining the lexical analysis processing in step S102.
Here, the lexical analysis refers to a process of removing information that is not necessary for directly comparing the similarity, such as a comment part, from the program code, and further dividing the program code in units of tokens. First, when the lexical analyzer 102 reads the program code selected / registered in step S101, the lexical analyzer performs the following lexical analysis process.
For example, when the program code is written in C language, the character string following “//” in the read program code and the comment sentence enclosed by “/ *” and “* /” are deleted. Furthermore, the program code from which the comment is deleted is divided into tokens.
FIG. 4 shows a specific example of the program code and the token string subjected to lexical analysis according to the above procedure.

プログラムコード４０１を字句解析した結果が字句解析されたプログラムコード４０２であり、プログラムコード４１１を字句解析した結果が字句解析されたプログラムコード４１２である。一見して相互の類似度が低く見えるようなプログラムコード間であっても、このように、プログラムコードを字句解析することにより、それらの類似度を的確に算出することが可能となる。
なお、字句解析処理は公知の技術である。 A lexical analysis result of the program code 401 is a lexical analysis program code 402, and a lexical analysis result of the program code 411 is a lexical analysis program code 412. Even between program codes that seem to have low mutual similarity at first glance, it is possible to accurately calculate the similarity by lexical analysis of the program code in this way.
The lexical analysis process is a known technique.

（Ｓ１０４：テンプレート作成処理）
図５は、ステップＳ１０４におけるテンプレート作成処理を説明するための図である。
テンプレート２０２は、ステップＳ１０２で字句解析された結果が結果表示部３００に表示され、ユーザが表示された字句解析結果２０５（図１）を参照して、ユーザインタフェース４００を介して情報を入力することにより作成される。
例えば、図５に示すプログラムコード５０１からテンプレート２０２が作成される。
図５に示すように、テンプレート２０２には、プログラムコードで処理順に記載される関数、システムコール等を用いてプログラムコードの構造がトークン毎に記述されている。例えば、プログラムコード５０１がＣ言語であるなら、「ｉｆ」文、「ｅｌｓｅ」文、「ｆｏｒ」文、共通関数、システムコールなどのトークンがテンプレート２０２に登録される。
ここで、テンプレート２０２に登録されるトークンは、プログラムコードにおけるすべてのトークンではなく、プログラムコードに特徴的なトークンが登録されていればよい。 (S104: Template creation process)
FIG. 5 is a diagram for explaining the template creation processing in step S104.
In the template 202, the result of the lexical analysis in step S102 is displayed on the result display unit 300, and the user inputs information via the user interface 400 with reference to the displayed lexical analysis result 205 (FIG. 1). Created by.
For example, the template 202 is created from the program code 501 shown in FIG.
As shown in FIG. 5, in the template 202, the structure of the program code is described for each token by using a function, a system call, and the like described in the processing order in the program code. For example, if the program code 501 is C language, tokens such as “if” sentence, “else” sentence, “for” sentence, common function, and system call are registered in the template 202.
Here, the token registered in the template 202 is not limited to all tokens in the program code, but may be any token unique to the program code.

さらに、図５に示すように、ユーザは、テンプレート２０２に登録した各々のトークンに対し重み付けを行うことが可能である。つまり、テンプレート２０２内で特に重要なトークンに対して強い重み付けをすることで、テンプレート２０２の特徴的な構造を定義することが可能である。
なお、例えば、プログラムコードＡを基に作成したテンプレート２０２、プログラムコードＢを基に作成したテンプレート２０２などというように、テンプレート２０２を複数作成することができる。例えば、プログラムコードＡを基に作成したテンプレート２０２を用いることで、プログラムコードＡに近いか否かを判定することができ、プログラムコードＢを基に作成したテンプレート２０２を用いることで、プログラムコードＢに近いか否かを判定することができる。 Furthermore, as shown in FIG. 5, the user can weight each token registered in the template 202. That is, it is possible to define a characteristic structure of the template 202 by applying strong weighting to a particularly important token in the template 202.
A plurality of templates 202 can be created, for example, a template 202 created based on the program code A, a template 202 created based on the program code B, and the like. For example, by using the template 202 created based on the program code A, it can be determined whether or not it is close to the program code A. By using the template 202 created based on the program code B, the program code B It can be determined whether it is close to.

図６で後記する関数辞書２０３を作成することで、複数の異なる関数を共通の定義として扱うことが可能である。つまり、関数辞書２０３で同種の関数であると定義されている関数（またはトークン）は同種の関数であるとして、類似度を上げることができる。また、代入文やメモリのコピーを行っているトークンに関しては、図７で後記するメモリアクセス辞書２０４でコピー先の変数の属性を登録することができ、コピー先がグローバル変数、または、ローカル変数であることを定義することが可能である。 By creating the function dictionary 203 described later in FIG. 6, it is possible to handle a plurality of different functions as a common definition. In other words, the function (or token) defined as the same kind of function in the function dictionary 203 is assumed to be the same kind of function, and the similarity can be increased. In addition, for tokens for which assignment statements and memory are copied, the attribute of the copy destination variable can be registered in the memory access dictionary 204 described later in FIG. 7, and the copy destination can be a global variable or a local variable. It is possible to define something.

（Ｓ１０５：関数辞書作成処理）
図６は、ステップＳ１０５における関数辞書作成処理を説明するための図である。
ユーザは、テンプレート２０２と、結果表示部３００に表示されている字句解析結果２０５（図１）を参照して、ユーザインタフェース４００を介して情報を入力することにより関数辞書２０３を作成する。
図６に示すように、例えば、ほぼ同義の共通関数（つまり、コピーなどを利用して作成された関数）「ｓｕｂ＿Ａ１（）」および「ｓｕｂ＿Ａ２（）」をテンプレート２０２における「ｆｕｎｃＡ（）」と同グループの関数として関数辞書２０３を定義する。
このようにすることで、後記する類似度算出時に、コールしている共通関数のみ異なるプログラムコードを類似していることを判別することができる。つまり、異なる関数であっても、関数辞書２０３で同グループの関数として登録されている関数は同じ関数とすることができる。
また、関数辞書２０３に登録されている関数は、テンプレート２０２に登録されている関数でもよい。 (S105: Function dictionary creation process)
FIG. 6 is a diagram for explaining the function dictionary creation processing in step S105.
The user creates a function dictionary 203 by inputting information through the user interface 400 with reference to the template 202 and the lexical analysis result 205 (FIG. 1) displayed on the result display unit 300.
As shown in FIG. 6, for example, common functions having substantially the same meaning (that is, functions created using a copy or the like) “sub_A1 ()” and “sub_A2 ()” are the same as “funcA ()” in the template 202. A function dictionary 203 is defined as a group function.
By doing so, it is possible to determine that similar program codes differ only in the calling common function when calculating the similarity described later. That is, even if they are different functions, functions registered as functions of the same group in the function dictionary 203 can be the same function.
The function registered in the function dictionary 203 may be a function registered in the template 202.

（Ｓ１０６：メモリアクセス辞書作成処理）
図７は、ステップＳ１０６におけるメモリアクセス辞書作成処理を説明するための図である。
図７に示すように、ユーザは、テンプレート２０２に定義したトークンで使用している変数毎に、その変数がグローバル変数、ローカル変数、または複数のグローバル変数である複数グローバル変数群であるか否かといった変数の種類を、ユーザインタフェース４００を介して入力することによりメモリアクセス辞書２０４を作成する。
このようにすることで、後記する類似度算出時に、同じ形式であるが参照している変数がグローバル変数と、ローカル変数であるなど、異なる種類の変数を参照している関数を異なる関数として判別することができる。
なお、メモリアクセス辞書２０４は、結果表示部３００に表示されているテンプレート２０２や、字句解析結果２０５を参照しながら、ユーザによって作成されるものである。 (S106: Memory access dictionary creation process)
FIG. 7 is a diagram for explaining the memory access dictionary creation processing in step S106.
As shown in FIG. 7, for each variable used in the token defined in the template 202, the user determines whether the variable is a global variable, a local variable, or a plurality of global variables that are a plurality of global variables. The memory access dictionary 204 is created by inputting such variable types via the user interface 400.
By doing this, when referring to similarity, which will be described later, functions that refer to different types of variables, such as global variables and local variables that are referenced in the same format but are referenced, are distinguished as different functions. can do.
The memory access dictionary 204 is created by the user with reference to the template 202 displayed on the result display unit 300 and the lexical analysis result 205.

テンプレート２０２、関数辞書２０３、メモリアクセス辞書２０４の記述内容は任意であるが、抽出したいクローンコードの特徴的な構造を定義することが望ましい。また、これらは一度作成した後、記憶部２００に保存することが可能であり、何度でも読み込み・編集することが可能である。
また、テンプレート２０２、関数辞書２０３、メモリアクセス辞書２０４は、図２の処理が始まる前に予め作成されていてもよい。 Although the description contents of the template 202, the function dictionary 203, and the memory access dictionary 204 are arbitrary, it is desirable to define the characteristic structure of the clone code to be extracted. These can be created once and then stored in the storage unit 200, and can be read and edited any number of times.
Further, the template 202, the function dictionary 203, and the memory access dictionary 204 may be created in advance before the processing of FIG.

（Ｓ１０７：類似度算出処理）
図８は、ステップＳ１０７における類似度算出処理を説明するための図である。
まず、類似度算出部１０６は、図８に示すようなトークン登録判別表２０６をプログラムコード毎に作成する。
まず、類似度算出部１０６は、ステップＳ１０４で作成したテンプレート２０２をコピーし、ステップＳ１０２で作成した字句解析結果２０５と比較することにより、処理対象となっているプログラムコードにおいて、テンプレート２０２に登録されているトークンがあるか否かを判定し、あれば該当する登録判別の欄に「○」を登録し、なければ「×」を登録する。
このとき、類似度算出部１０６は、関数辞書２０３や、メモリアクセス辞書２０４を参照して、異なる形式のトークンがあっても関数辞書２０３に登録されていれば、トークン登録判別表２０６の登録判別欄に「○」を登録し、同じ形式のトークンがあってもメモリアクセス辞書２０４において異なる変数を参照している場合はトークン登録判別表２０６の登録判別欄に「×」を登録する。 (S107: similarity calculation processing)
FIG. 8 is a diagram for explaining the similarity calculation processing in step S107.
First, the similarity calculation unit 106 creates a token registration determination table 206 as shown in FIG. 8 for each program code.
First, the similarity calculation unit 106 copies the template 202 created in step S104 and compares it with the lexical analysis result 205 created in step S102, thereby registering it in the template 202 in the program code to be processed. It is determined whether or not there is a token that is present. If there is a token, “◯” is registered in the corresponding registration determination column, and if not, “X” is registered.
At this time, the similarity calculation unit 106 refers to the function dictionary 203 or the memory access dictionary 204, and if there is a token of a different format, it is registered in the function dictionary 203. If “◯” is registered in the column and a different variable is referenced in the memory access dictionary 204 even if the token has the same format, “X” is registered in the registration determination column of the token registration determination table 206.

このトークン登録判別表２０６は、プログラムコード毎に作成される。例えば、テンプレートＡとテンプレートＢがあり、プログラムコードＡ〜プログラムコードＤがあれば、テンプレートＡを基にしたプログラムコードＡ〜プログラムコードＤのトークン登録判別表２０６を作成し、これとは別にテンプレートＢを基にしたプログラムコードＡ〜プログラムコードＤのトークン登録判別表２０６を作成する。 This token registration determination table 206 is created for each program code. For example, if there are a template A and a template B, and there is a program code A to a program code D, a token registration determination table 206 of the program code A to the program code D based on the template A is created. Token registration determination table 206 of program code A to program code D based on the above is created.

次に、類似度算出部１０６は、各プログラムコードのトークン登録判別表２０６において「○」が登録されているトークンの重みを加算し、加算した結果を、すべての重みを加算した値（最大スコア）で除算し、さらに１００を乗算することにより類似度を算出する。
すなわち、類似度算出部１０６は以下の式（１）より類似度を算出する。 Next, the similarity calculation unit 106 adds the weights of the tokens for which “◯” is registered in the token registration determination table 206 of each program code, and the addition result is a value obtained by adding all the weights (maximum score). ) And then multiply by 100 to calculate the similarity.
That is, the similarity calculation unit 106 calculates the similarity from the following equation (1).

類似度（％）＝プログラムコードスコア÷最大スコア×１００・・・（１） Similarity (%) = program code score / maximum score × 100 (1)

例えば、図８に示すプログラムコードＤのトークン登録判別表２０６から求められる類似度は（１＋２＋１）÷（１＋１＋２＋１＋１）×１００＝６６（％）となる。
類似度算出部１０６は、類似度を作成されているテンプレート２０２に対するすべてのプログラムコードについて算出する。 For example, the similarity obtained from the token registration determination table 206 of the program code D shown in FIG. 8 is (1 + 2 + 1) ÷ (1 + 1 + 2 + 1 + 1) × 100 = 66 (%).
The similarity calculation unit 106 calculates the similarity for all program codes for the template 202 that has been created.

なお、類似度算出の処理対象となるテンプレート２０２および閾値は、ユーザによって指定されてもよい。指定のタイミングは、類似度算出処理の前であればいつでもよい。 Note that the template 202 and the threshold value that are the processing target of similarity calculation may be designated by the user. The designated timing may be any time before the similarity calculation process.

（Ｓ１１０：分類結果作成処理）
図９および図１０は、ステップＳ１１０における分類結果作成処理を説明するための図である。
分類結果作成部１０８は、予め設定してある閾値以上の類似度を有するプログラムコードをグループとして登録する。
図９の例では、閾値を「３０」と設定しているため、類似度が「２０（％）」のプログラムコードＣはグループから除外されている。 (S110: Classification result creation process)
9 and 10 are diagrams for explaining the classification result creation processing in step S110.
The classification result creation unit 108 registers program codes having a similarity equal to or higher than a preset threshold as a group.
In the example of FIG. 9, since the threshold is set to “30”, the program code C having a similarity of “20 (%)” is excluded from the group.

また、前記したようにテンプレート２０２は複数作成することも可能であるため、図１０に示すように異なるテンプレート２０２によるグループ化も可能である。
図１０の例では、テンプレートＡを基にグループ化されたプログラムコードＡ、Ｂ，Ｄと、テンプレートＢを基にグループ化されたプログラムコードＡ，Ｃ、Ｄとの２つのグループが作成されている。 In addition, since a plurality of templates 202 can be created as described above, grouping by different templates 202 as shown in FIG. 10 is also possible.
In the example of FIG. 10, two groups of program codes A, B, and D grouped based on the template A and program codes A, C, and D grouped based on the template B are created. .

分類結果作成部１０８は、図９や図１０に示すようなプログラムコードのグループを結果表示部３００に表示させる。 The classification result creation unit 108 causes the result display unit 300 to display a group of program codes as shown in FIGS.

（Ｓ１１１：クローン解析処理）
図１１は、ステップＳ１１１におけるクローン解析処理を説明するためのクローン解析図である。
クローン解析部１０９が以下の手順でプログラムコードに対するクローン解析を行う。
まず、クローン解析部１０９は、図９に示すようなグループ化されたプログラムコードから、テンプレート２０２とプログラムコードの対、プログラムコードとプログラムコードの対を選択し、縦軸が一方のテンプレート２０２またはプログラムコードのトークン出現位置、横軸が他方のテンプレート２０２またはプログラムコードのトークン出現位置としたクローン解析図１１０１を作成する。
クローン解析部１０９は、比較対象となっているテンプレート２０２およびプログラムコードの字句解析結果２０５を基に、クローン解析図１１０１上において同じトークンが記述されている箇所にプロットする。
すなわち、クローン解析部１０９は、２つのプログラムコードについて、一方のプログラムコードのトークンの出現位置を縦軸にとり、他方のプログラムコードのトークンの出現位置を横軸にとり、絞り込んだ同一のトークンが前記２つのプログラムコードに出現するとき、前記同一のトークンの出現位置に対応する横軸の座標をｘ、前記同一のトークンの出現位置に対応する縦軸の座標をｙとし、（ｘ，ｙ）の点にプロットを行うことによりクローン解析図を作成する。
このとき、クローン解析部１０９は、形式が異なっても関数辞書２０３に登録されている関数や、メモリアクセス辞書２０４で同じ種類の変数を参照している関数を、同じトークンとしてクローン解析図１１０１にプロットする。逆に、クローン解析部１０９は、形式が同じでも、メモリアクセス辞書２０４において異なる変数を参照している関数は異なる関数とし、プロットを行わない。 (S111: Clone analysis process)
FIG. 11 is a clone analysis diagram for explaining the clone analysis processing in step S111.
The clone analysis unit 109 performs clone analysis on the program code in the following procedure.
First, the clone analysis unit 109 selects a template 202 and a program code pair or a program code and a program code pair from the grouped program codes as shown in FIG. A clone analysis diagram 1101 is created in which the code token appearance position and the horizontal axis are the other template 202 or the program code token appearance position.
The clone analysis unit 109 plots the same token on the clone analysis diagram 1101 on the basis of the template 202 to be compared and the lexical analysis result 205 of the program code.
That is, for the two program codes, the clone analysis unit 109 takes the appearance position of the token of one program code on the vertical axis and the appearance position of the token of the other program code on the horizontal axis. When appearing in one program code, the coordinate of the horizontal axis corresponding to the appearance position of the same token is x, the coordinate of the vertical axis corresponding to the appearance position of the same token is y, and the point of (x, y) A clonal analysis diagram is created by plotting.
At this time, the clone analysis unit 109 stores the function registered in the function dictionary 203 even if the format is different, or the function referring to the same type of variable in the memory access dictionary 204 as the same token in the clone analysis diagram 1101. Plot. On the contrary, even if the format is the same, the clone analysis unit 109 sets different functions referring to different variables in the memory access dictionary 204, and does not perform plotting.

図１１に示すように、クローン解析図１１０１では、同じであると判定されたトークン列間のプロット部分のうち、長い直線となっている部分が各プログラムコード間の共通トークン部分である。この直線となっている部分をクローン片と称することとする。仮に、２つのプログラムコードが同一である場合は、原点から右下にかけて連続した直線がプロットされることとなる。
なお、クローン解析図１１０１において、水平方向および垂直方向に複数プロットされている箇所、つまり一方のトークン出現位置に対し、複数の他方のトークン出現位置が対応している箇所は、関数辞書２０３や、メモリアクセス辞書２０４によって一方のプログラムコードにおける１つのトークンに対し、他方のプログラムコードにおける複数のトークンが同一であると判定されたものである。 As shown in FIG. 11, in the clone analysis diagram 1101, among the plot portions between token strings determined to be the same, a portion that is a long straight line is a common token portion between program codes. This straight portion is referred to as a clone piece. If the two program codes are the same, a continuous straight line is plotted from the origin to the lower right.
In the clone analysis diagram 1101, a plurality of points plotted in the horizontal direction and the vertical direction, that is, a place where a plurality of other token appearance positions correspond to one token appearance position are the function dictionary 203, The memory access dictionary 204 determines that one token in one program code is identical to a plurality of tokens in the other program code.

（Ｓ１１２：親子関係導出処理）
図１２は、ステップＳ１１２の親子関係導出・派生率算出処理のうち、親子関係導出処理の部分を説明するための図である。
また、図１２では、プログラムコードＡの途中に条件文が挿入された場合、クローン解析図１２０１に不連続部分（破線矢印）が生じることを示している。このように、クローン解析図１２０１において、クローン片が不連続となっている場合、その行にはプログラムコード特有の処理が記述されているものと考えられる。
すなわち、元のプログラムコードをプログラムコードＡ、プログラムコードＡに条件文を追加挿入したものをプログラムコードＡ１とすると、図１２のクローン解析図１２０１において、条件文が追加挿入されたところで、クローン片が水平右方向にずれた状態で不連続となっている。このように、クローン片が水平右方向にずれていれば、横軸のトークン出現位置が示すプログラムコード（ここでは、プログラムコードＡ１）は、縦軸のトークン出現位置が示すプログラムコード（ここでは、プログラムコードＡ）にトークンが追加挿入されたものであることがわかる。
なお、逆にプログラムコードＡ１にトークンが追加挿入されたものがプログラムコードＡだとすると、クローン解析図では垂直下方向にクローン片がずれた状態の不連続部分が生じることとなる。 (S112: Parent-child relationship derivation process)
FIG. 12 is a diagram for explaining a part of the parent-child relationship derivation process in the parent-child relationship derivation / derivation rate calculation process in step S112.
FIG. 12 shows that when a conditional statement is inserted in the middle of the program code A, a discontinuous portion (broken arrow) is generated in the clone analysis diagram 1201. Thus, in the clone analysis diagram 1201, when the clone pieces are discontinuous, it is considered that the processing unique to the program code is described in that line.
That is, if the original program code is program code A and the program code A is additionally inserted with a conditional statement as program code A1, then in the clone analysis diagram 1201 of FIG. It is discontinuous in a state shifted in the horizontal right direction. Thus, if the clone piece is shifted in the horizontal right direction, the program code indicated by the token appearance position on the horizontal axis (here, program code A1) is the program code indicated by the token appearance position on the vertical axis (here, It can be seen that an additional token is inserted into the program code A).
On the other hand, if the program code A is obtained by additionally inserting a token into the program code A1, a discontinuous portion in which the clone pieces are shifted in the vertically downward direction is generated in the clone analysis diagram.

（Ｓ１１２：親子関係導出・派生率算出処理）
図１３は、クローン解析処理が終了した後、結果表示部に表示されるクローン解析図の例である。
図１３では、テンプレートＡ‐プログラムコードＡ（クローン解析図１３０１）、テンプレートＡ‐プログラムコードＢ（クローン解析図１３０２）、テンプレートＡ‐プログラムコードＤ（クローン解析図１３０３）、プログラムコードＡ‐プログラムコードＢ（クローン解析図１３０４）、プログラムコードＡ‐プログラムコードＤ（クローン解析図１３０５）、プログラムコードＢ‐プログラムコードＤ（クローン解析図１３０６）、・・・の組み合わせに対するクローン解析図が表示されている。ここで、プログラムコードＣを含んだ組み合わせがないのは、図９に示すようにプログラムコードＣの類似度が閾値以下であるため、グループから除外された例を示しているためである。 (S112: Parent-child relationship derivation / derivation rate calculation processing)
FIG. 13 is an example of a clone analysis diagram displayed on the result display unit after the clone analysis process is completed.
In FIG. 13, template A-program code A (clone analysis diagram 1301), template A-program code B (clone analysis diagram 1302), template A-program code D (clone analysis diagram 1303), program code A-program code B (Clone analysis diagram 1304), a clone analysis diagram for a combination of program code A-program code D (clone analysis diagram 1305), program code B-program code D (clone analysis diagram 1306),. Here, there is no combination including the program code C because, as shown in FIG. 9, since the similarity of the program code C is equal to or less than the threshold value, an example excluded from the group is shown.

ここで、テンプレートＡは、プログラムコードＡを基に作成したテンプレート２０２（図１）である。プログラムコードＡを基にしているが、完全に同じわけではないため、テンプレートＡ−プログラムコードＡのクローン解析図には不連続部分が生じている。なお、図１３において、不連続部分は破線矢印で示している。
また、プログラムコードＡはオリジナルのプログラムコードであり、プログラムコードＢはプログラムコードＡに対して処理を追加したプログラムコードとする。そして、プログラムコードＤはプログラムコードＡに対して処理を修正したプログラムコードとする。 Here, the template A is the template 202 (FIG. 1) created based on the program code A. Although it is based on the program code A, it is not completely the same. Therefore, a discontinuous portion is generated in the clone analysis diagram of template A-program code A. In FIG. 13, discontinuous portions are indicated by broken-line arrows.
The program code A is an original program code, and the program code B is a program code obtained by adding processing to the program code A. The program code D is a program code obtained by correcting the processing for the program code A.

図１２で、説明したようにあるプログラムコードにトークンが追加挿入されている場合、クローン片は水平右方向あるいは垂直下方向にずれた状態で不連続となっている。
従って、あるプログラムコードに対し、単純にトークンが追加されているだけであれば、水平右方向あるいは垂直下方向に対し、すべての不連続部分が一定の方向にずれた状態となる。そのため、すべての不連続部分が水平右方向もしくは垂直下方向にずれている場合、そのプログラムコードは元のプログラムコードに対し、単純にトークンが追加されているだけなので、元のプログラムコードを親、トークンが追加されたプログラムコードを子とみなすことができる。 In FIG. 12, when a token is additionally inserted into a certain program code as described above, the clone pieces are discontinuous in a state shifted in the horizontal right direction or the vertical downward direction.
Accordingly, if a token is simply added to a certain program code, all discontinuous portions are shifted in a certain direction with respect to the horizontal right direction or the vertical downward direction. Therefore, if all the discontinuities are shifted horizontally right or vertically downward, the program code is simply a token added to the original program code. Program code with added tokens can be considered a child.

なお、水平右方向にずれていれば、クローン解析図の縦軸が示すプログラムコード（テンプレート）が親となり、横軸が示すプログラムコード（テンプレート）がトークンを追加された子となる。また、垂直下方向にずれていれば、クローン解析図の横軸が示すプログラムコード（テンプレート）が親となり、縦軸が示すプログラムコード（テンプレート）がトークンを追加された子となる。図１３の例では、クローン解析図１３０１，１３０２，１３０４の不連続部分１３１１がこれらの例に相当する。 If it is shifted horizontally to the right, the program code (template) indicated by the vertical axis of the clone analysis diagram becomes a parent, and the program code (template) indicated by the horizontal axis becomes a child to which a token is added. If it is shifted vertically downward, the program code (template) indicated by the horizontal axis of the clone analysis diagram becomes a parent, and the program code (template) indicated by the vertical axis becomes a child to which a token is added. In the example of FIG. 13, the discontinuous portion 1311 of the clone analysis diagrams 1301, 1302, and 1304 corresponds to these examples.

例えば、図１３のクローン解析図１３０１では、プログラムコードＡは、テンプレートＡに対しトークンが追加されているだけなので、テンプレートＡ＞プログラムコードＡであることがわかる。なお、「＞」は「親＞子」であることを示す。
同様に、クローン解析図１３０２では、テンプレートＡ＞プログラムコードＢ、クローン解析図１３０４からプログラムコードＡ＞プログラムコードＢであることがわかる。従って、系統図作成部１１０は、クローン解析図１３０１，１３０２，１３０４からテンプレートＡ＞プログラムコードＡ＞プログラムコードＢという親子関係を導出する。 For example, in the clone analysis diagram 1301 in FIG. 13, it can be seen that since the program code A has only a token added to the template A, template A> program code A. “>” Indicates “parent> child”.
Similarly, in the clone analysis diagram 1302, it can be seen that template A> program code B and clone analysis diagram 1304 that program code A> program code B. Therefore, the system diagram creation unit 110 derives a parent-child relationship of template A> program code A> program code B from the clone analysis diagrams 1301, 1302, 1304.

さらに、トークンが修正されたり、異なるトークンに置換えられたりした部分は、クローン解析図１３０５の符号１３２１に示すように、クローン片が完全に断裂した状態となっている。これを断裂部分と称することとする。
このような断裂部分や、クローン解析図１３０５の符号１３２２のように、斜め方向にクローン片がずれている不連続部分が存在する場合、クローン解析図１３０１，１３０２，１３０４のようにクローン解析図の形状から親子関係を導出することができない。
図１３では、クローン解析図１３０３，１３０５，１３０６が、このようなクローン解析図に該当する。 Furthermore, as shown by reference numeral 1321 in the clone analysis diagram 1305, the portion where the token has been modified or replaced with a different token is in a state where the clone piece has been completely broken. This is referred to as a tearing portion.
When there is such a broken portion or a discontinuous portion in which the clone piece is shifted in an oblique direction as indicated by reference numeral 1322 in the clone analysis diagram 1305, the clone analysis diagram as shown in the clone analysis diagram 1301, 1302, 1304 The parent-child relationship cannot be derived from the shape.
In FIG. 13, clone analysis diagrams 1303, 1305, and 1306 correspond to such clone analysis diagrams.

図１３におけるクローン解析図１３０３，１３０５，１３０６のように、クローン解析図の形状から親子関係を導出することができない場合、系統図作成部１１０は、式（２）で算出される派生率を用いて各々のプログラムコード間のクローンコード片の平均長が一番長いものを構造が近いものと判断し、この派生率を基に親子関係を導出する。 When the parent-child relationship cannot be derived from the shape of the clone analysis diagram as shown in the clone analysis diagrams 1303, 1305, and 1306 in FIG. Thus, it is determined that the structure having the longest average length of the clone code pieces between the program codes is close in structure, and the parent-child relationship is derived based on this derivation rate.

派生率＝（Ｔ_１＋Ｔ_２＋・・・＋Ｔ_Ｎ）／Ｎ・・・（２） Derivation rate = (T ₁ + T ₂ +... + T _N ) / N (2)

式（２）における、Ｎはクローンコード片の数であり、Ｔ_ｎはｎ個目のクローンコード片の長さである。
系統図作成部１１０は、図１３のクローン解析図１３０５（プログラムコードＡ‐プログラムコードＤ）、およびクローン解析図１３０６（プログラムコードＢ‐プログラムコードＤ）のそれぞれに対し、式（２）を計算し、クローンコード片の平均長（派生率）を求める。なお、テンプレートＡは、すべてのプログラムコードと親子関係をもつものであるため、系統図作成部１１０は派生率の算出対象からクローン解析図１３０３を外す。 In the formula (2), N is the number of clone code fragments, and T _n is the length of the nth clone code fragment.
The system diagram creation unit 110 calculates Equation (2) for each of the clone analysis diagram 1305 (program code A-program code D) and the clone analysis diagram 1306 (program code B-program code D) in FIG. Then, the average length (derivation rate) of clone code fragments is obtained. Since template A has a parent-child relationship with all program codes, system diagram creation unit 110 removes clone analysis diagram 1303 from the derivation rate calculation target.

派生率の算出結果、プログラムコードＡ−プログラムコードＤの派生率＞プログラムコードＢ‐プログラムコードＤの派生率であったとすると、系統図作成部１１０が、プログラムコードＤは、プログラムコードＢよりプログラムコードＡに近い構造を有すると判別する。そして、系統図作成部１１０は、プログラムコードＤの系統図における位置を、プログラムコードＡと並列に位置付け、プログラムコードＡの親であるテンプレートＡの子とする。なお、プログラムコードＢ−プログラムコードＤの派生率＞プログラムコードＡ‐プログラムコードＤの派生率であった場合、系統図作成部１１０は、系統図におけるプログラムコードＤの位置をプログラムコードＢと並列に位置付け、プログラムコードＤをプログラムコードＢの親であるプログラムコードＡの子とする。 Assuming that the calculation result of the derivation rate is derivation rate of program code A−program code D> program code B−derivation rate of program code D, the system diagram creation unit 110 determines that program code D is program code from program code B. It is determined that the structure is close to A. Then, the system diagram creation unit 110 positions the position of the program code D in the system diagram in parallel with the program code A and sets it as a child of the template A that is the parent of the program code A. When the derivation rate of program code B-program code D> program code A-derivation rate of program code D, the system diagram creation unit 110 sets the position of the program code D in the system diagram in parallel with the program code B. The program code D is positioned as a child of the program code A which is the parent of the program code B.

以上、系統図作成部１１０による親子関係の導出手順を整理すると、
１．系統図作成部１１０は、クローン解析図において、すべての不連続部分が水平右方向あるいは垂直下方向に対し、一定の方向でずれているものがあるか否かを判定し、あれば、クローン片がずれている方向から親子関係を導出する（図１３のクローン解析図１３０１，１３０２，１３０４）。
２．系統図作成部１１０は、クローン片が一定方向にずれていないクローン解析図（クローン解析図１３０３，１３０５，１３０６）からテンプレート２０２を有するクローン解析図を派生率の算出対象から除外する（クローン解析図１３０３）。
３．派生率の算出対象から除外されなかったクローン解析図から派生率を算出し、親子関係を導出する。 As described above, when the procedure for deriving the parent-child relationship by the system diagram creation unit 110 is organized,
1. The system diagram creation unit 110 determines whether or not all the discontinuous parts in the clone analysis diagram are displaced in a certain direction with respect to the horizontal right direction or the vertical downward direction. The parent-child relationship is derived from the direction in which they are shifted (clone analysis diagrams 1301, 1302, 1304 in FIG. 13).
2. The system diagram creation unit 110 excludes the clone analysis diagram having the template 202 from the clone analysis diagram (clone analysis diagram 1303, 1305, 1306) from which the clone pieces are not shifted in a certain direction from the calculation target of the derivation rate (clone analysis diagram). 1303).
3. The derivation rate is calculated from the clone analysis chart that was not excluded from the derivation rate calculation target, and the parent-child relationship is derived.

なお、系統図作成部１１０は、ユーザによって登録されたシステムの開発順番を参照し、以前に作成されたプログラムコードが必ず親になるように派生関係を決定してもよい。
このようにして、系統図作成部１１０は、プログラムコード間の派生関係を求め、系統図を作成し、結果表示部３００に作成した系統図を表示する。 Note that the system diagram creation unit 110 may determine the derivation relationship so that the program code created before becomes a parent by referring to the development order of the systems registered by the user.
In this way, the system diagram creation unit 110 obtains a derivation relationship between program codes, creates a system diagram, and displays the created system diagram on the result display unit 300.

（Ｓ１１４：系統図作成処理）
図１４は、ステップＳ１１４における系統図作成処理を説明するための系統図の例を示す図である。
図１３で説明した処理により、系統図作成部１１０は、テンプレートＡの子としてプログラムコードＡおよびプログラムコードＤ、プログラムコードＡの子（テンプレートＡの孫）としてプログラムコードＢという関係を導出し、結果表示部３００に系統図１４０１を表示する。 (S114: System diagram creation process)
FIG. 14 is a diagram showing an example of a system diagram for explaining the system diagram creation processing in step S114.
Through the processing described with reference to FIG. 13, the system diagram creation unit 110 derives the relationship of the program code A and the program code D as children of the template A and the program code B as a child of the program code A (grandchild of the template A). A system diagram 1401 is displayed on the display unit 300.

《具体例》
次に、図１５〜図１９を参照して、本実施形態に係る系統図の作成手順を具体例を用いて説明する。
本実施形態において、系統図の算出対象となるソースプログラムの使用言語はＣ言語とした。ここでは、ある１つのテンプレート２０２に対して、２つの解析対象プログラムコード群α，βから類似したプログラムコードを算出し、系統図を算出するものとする。
まず、ユーザは類似度算出に先立ち、解析対象としてプログラムコード群α、およびプログラムコード群αをベースとして作成されたプログラムコード群βを解析対象プログラムコードとして登録する。 "Concrete example"
Next, with reference to FIGS. 15 to 19, a procedure for creating a system diagram according to the present embodiment will be described using a specific example.
In this embodiment, the language used for the source program to be calculated for the system diagram is C language. Here, it is assumed that similar program codes are calculated from two analysis target program code groups α and β for a certain template 202, and a system diagram is calculated.
First, prior to the similarity calculation, the user registers the program code group α as an analysis target and the program code group β created based on the program code group α as an analysis target program code.

なお、ここで、プログラムコード群２０１とは、共通のプログラムコードを使用して構成されている別のプログラムコード群２０１である。例えば、バージョンが異なるプログラムなどがプログラムコード群α、βに該当する。登録されたプログラムコードは、図３のステップＳ１０２による字句解析が行われ、さらにステップＳ１０４〜Ｓ１０６の処理が行われ、テンプレート２０２、関数辞書２０３、メモリアクセス辞書２０４が作成される。 Here, the program code group 201 is another program code group 201 configured using a common program code. For example, programs with different versions correspond to the program code groups α and β. The registered program code is subjected to lexical analysis in step S102 of FIG. 3, and further processed in steps S104 to S106 to create a template 202, a function dictionary 203, and a memory access dictionary 204.

図１５は、テンプレート、関数辞書、メモリアクセス辞書の具体例を示す図である。
図１５の例では、テンプレート２０２の「共通関数（・・・）」には、関数辞書２０３により「ｓｕｂ＿Ａ（）、ｓｕｂ＿Ｂ（）」が該当しており、メモリアクセス辞書２０４により関数ｍｅｍｃｐｙが参照する「ｐＤａｔａ」はグローバル変数であることがわかる。
このように解析対象プログラムコード群２０１内で使用されている意味のあるトークンを組み合わせて定義する。ここで、意味のあるトークンと互いにはコピー関係にあるトークンのことである。例えば、図１５に示すように、ｍｅｍｓｅｔ関数、ｉｆ−ｅｌｓｅ文、共通関数、ｍｅｍｃｐｙ関数を使用した構造をテンプレート２０２として定義する。なお、図１５のテンプレート２０２では、「共通関数」と「ｍｅｍｃｐｙ」の関数が重要な意味を持つとし、「共通関数」、「ｍｅｍｃｐｙ」の重みを「２」とし、その他のトークンの重みを「１」としている。 FIG. 15 is a diagram illustrating specific examples of templates, function dictionaries, and memory access dictionaries.
In the example of FIG. 15, the “common function (...)” Of the template 202 corresponds to “sub_A (), sub_B ()” by the function dictionary 203, and the function memcpy refers to the memory access dictionary 204. It can be seen that “pData” is a global variable.
In this manner, meaningful tokens used in the analysis target program code group 201 are defined in combination. Here, a meaningful token is a token having a copy relationship with each other. For example, as shown in FIG. 15, a structure using a memset function, an if-else statement, a common function, and a memcpy function is defined as a template 202. In the template 202 of FIG. 15, it is assumed that the functions “common function” and “memcpy” have important meanings, the weights of “common function” and “memcpy” are “2”, and the weights of other tokens are “ 1 ”.

図１６は、テンプレートと、プログラムコードとの具体例を示す図である。
図１６におけるテンプレートａは、図１６のプログラムコードａ１を基に作成されたものである。さらに、プログラムコードｃは、プログラムコードａ１を基に、修正を加えて作成されたプログラムコードである。
テンプレートａにおいて登録されているトークン数は５つであり、図１５の重みに従えば最大スコアは７である。
プログラムコードａ１では、テンプレートａで定義されているトークンのすべてがテンプレートａに登録した順に出現している（ドット部分）。そして、「ｍｅｍｃｐｙ」の「＆ｐＲｔｎ」がグローバル変数であるとするならば、テンプレートａに対するプログラムコードａ１の類似度は１００％となる。関数内で使用している変数がグローバル変数であるか、ローカル変数であるかなどは、宣言文の形式から判定できる。
また、プログラムコードｃでも、テンプレートａに登録したすべてのトークンが出現しているが、「Ｓｕｂ＿Ｂ」の後に「ｉｆ」が出現するなど、テンプレートａとは出現の順番が異なるため、プログラムコードｃにおける「ｉｆ−ｅｌｓｅ」文は、テンプレートａのクローンコードと判定せず、「ｍｅｍｓｅｔ」、「Ｓｕｂ＿Ｂ」、「ｍｅｍｃｐｙ」のみがテンプレートａと合致したと判定する（ドット部分）。その結果、テンプレートａに対するプログラムコードｃの類似度はおよそ７１％となる。 FIG. 16 is a diagram illustrating specific examples of templates and program codes.
The template a in FIG. 16 is created based on the program code a1 in FIG. Furthermore, the program code c is a program code created by making corrections based on the program code a1.
The number of tokens registered in the template a is 5, and the maximum score is 7 according to the weight in FIG.
In the program code a1, all tokens defined in the template a appear in the order registered in the template a (dot portion). If “& pRtn” of “memcpy” is a global variable, the similarity of the program code a1 to the template a is 100%. Whether a variable used in a function is a global variable or a local variable can be determined from the form of a declaration statement.
In the program code c, all tokens registered in the template a appear, but the order of appearance differs from the template a, such as “if” appearing after “Sub_B”. The “if-else” statement is not determined as a clone code of the template a, and only “memset”, “Sub_B”, and “memcpy” are determined to match the template a (dot portion). As a result, the similarity of the program code c to the template a is approximately 71%.

つまり、
プログラムコードａ１の類似度＝（１＋１＋２＋２＋１）／（１＋１＋２＋２＋１）×１００＝１００（％）
プログラムコードｃの類似度＝（１＋２＋２）／（１＋１＋２＋２＋１）×１００≒７１（％）
である。 That means
Similarity of program code a1 = (1 + 1 + 2 + 2 + 1) / (1 + 1 + 2 + 2 + 1) × 100 = 100 (%)
Similarity of program code c = (1 + 2 + 2) / (1 + 1 + 2 + 2 + 1) × 100≈71 (%)
It is.

以上のように、プログラム解析装置１はプログラムコード群α、βにおける各々のプログラムコードについて類似度を算出する。 As described above, the program analysis apparatus 1 calculates the similarity for each program code in the program code groups α and β.

図１７は、プログラムコード群αを構成する各プログラムコードのテンプレートａに対する類似度結果を示す図である。
ここで、「ａ１」、「ｃ」、「ｃ１」、「ｄ」、「ｅ」、「ｅ１」、「ｈ」、「ｊ」はそれぞれプログラムコード群αを構成するプログラムコードの名称である。
ここでは、閾値を４０％とし、類似度が４０％以上であるプログラムコードのみを表示している。 FIG. 17 is a diagram illustrating a similarity result of each program code constituting the program code group α with respect to the template a.
Here, “a1”, “c”, “c1”, “d”, “e”, “e1”, “h”, and “j” are names of program codes constituting the program code group α.
Here, only a program code having a threshold of 40% and a similarity of 40% or more is displayed.

図１８および図１９は、プログラムコード群α，βの系統図を示す図である。なお、図１８および図１９において、「ａ」のみがテンプレート２０２であり、その他はプログラムコードである。
これらはプログラムコード群αおよびプログラムコード群βに対し、個別に図３に示す処理を行った結果である。 18 and 19 are diagrams showing system diagrams of the program code groups α and β. In FIG. 18 and FIG. 19, only “a” is the template 202, and the others are program codes.
These are the results of individually performing the processing shown in FIG. 3 on the program code group α and the program code group β.

さらに、図１９に示すように、系統図作成処理は図１８の結果を統合させて結果表示部３００に表示させてもよい。
これは、例えばユーザがユーザインタフェース部４００を介して、プログラムコード群αをベースにプログラムコード群βが作成された旨の入力などを行うことによって作成されてもよい。 Further, as shown in FIG. 19, the system diagram creation process may integrate the results of FIG. 18 and display them on the result display unit 300.
This may be created, for example, when the user inputs via the user interface unit 400 that the program code group β is created based on the program code group α.

本実施形態は、複数のプログラムコードを有するプログラムシステムであれば、記述されている言語がＣ言語に限ることなく、他のプログラム言語でも利用することができる。 The present embodiment is not limited to the C language as long as it is a program system having a plurality of program codes, and can be used in other program languages.

《まとめ》
本実施形態によれば、形式が異なっていても同じ関数とみなせる関数を登録した関数辞書２０３や、関数内で使用している変数の種類を登録したメモリアクセス辞書２０４を用いることによって、形式上は異なっていても実質的に同じトークンであるとしたり、形式上は同一でも、参照している変数が異なれば異なるトークン（関数）であるとしたりすることにより、プログラムコードの単純比較では抽出することのできないトークンの種類を抽出することができ、精度の高いプログラム解析を行うことができる。さらに、トークンの種類を考慮した系統図を作成することにより、精度の高いプログラムコード間の親子関係を導出することができる。 <Summary>
According to the present embodiment, by using the function dictionary 203 that registers functions that can be regarded as the same function even if the formats are different, and the memory access dictionary 204 that registers the types of variables used in the functions, Are extracted by simple comparison of program code by assuming that the tokens are substantially the same even if they are different, or that they are the same in form, but different tokens (functions) if the referenced variable is different. Token types that cannot be extracted can be extracted, and highly accurate program analysis can be performed. Furthermore, by creating a system diagram in consideration of the token type, it is possible to derive a parent-child relationship between program codes with high accuracy.

１プログラム解析装置
１００制御部
１０１解析対象プログラムコード群登録部
１０２字句解析部
１０３テンプレート作成部
１０４関数辞書作成部
１０５メモリアクセス辞書作成部
１０６類似度算出部
１０７類似度解析結果作成部
１０８分類結果作成部
１０９クローン解析部
１１０系統図作成部
２００記憶部
２０１プログラムコード群
２０２テンプレート
２０３関数辞書（トークン辞書）
２０４メモリアクセス辞書
２０５字句解析結果
２０６トークン登録判別表
３００結果表示部（表示部）
１１０１，１２０１，１３０１〜１３０６クローン解析図
１４０１系統図 DESCRIPTION OF SYMBOLS 1 Program analysis apparatus 100 Control part 101 Analysis object program code group registration part 102 Lexical analysis part 103 Template creation part 104 Function dictionary creation part 105 Memory access dictionary creation part 106 Similarity calculation part 107 Similarity analysis result creation part 108 Classification result creation Unit 109 Clone analysis unit 110 System diagram creation unit 200 Storage unit 201 Program code group 202 Template 203 Function dictionary (token dictionary)
204 Memory access dictionary 205 Lexical analysis result 206 Token registration determination table 300 Result display section (display section)
1101, 1201, 1301-1306 Clone analysis diagram 1401 System diagram

Claims

A program by a program analysis apparatus for deriving and displaying a parent-child relationship having a modification source program code as a parent and a modification destination program code modified by adding or correcting a token among a plurality of program codes An analysis method,
The program analysis device includes:
A function dictionary that registers functions that can be regarded as the same function, although the description format is different.
A memory access dictionary that registers the types of variables used in the function;
Is stored in the storage unit,
The program analyzer is
(A1) Perform a lexical analysis to decompose a plurality of program codes to be analyzed into the tokens,
(A2) A template that is generated based on the result of the lexical analysis in the predetermined program code, is stored in the storage unit, and has a token registered in the predetermined program code, and each lexical in each program code As a result of comparison with the token registered in the template with reference to the analysis result and the function dictionary, functions that have different description formats but can be regarded as the same function in the function dictionary are determined as the same function. Perform processing for each program code,
(A3) Although the template, the result of the lexical analysis in each program code, and the memory access dictionary are referred to and compared with the function registered in the template, the functions have the same format. , A function using different types of variables is determined as a different function, and the process of narrowing down the same function is performed for each program code,
(A4) Tokens related to the functions determined to be different functions are set as different tokens, and for two program codes of the plurality of program codes, the vertical axis indicates the appearance position of one program code token, and the other The appearance position of the token of the program code is taken on the horizontal axis, and when the same token appears in the two program codes, the coordinate of the horizontal axis corresponding to the appearance position of the same token is x, and the appearance position of the same token A clone analysis for generating data plotted at a point (x, y), where y is the coordinate of the vertical axis corresponding to
(A5) Based on the result of the clone analysis, a parent-child relationship between the template and the program code is derived,
(A6) The processing of (a4) and (a5) is performed for each template / program code pair to be analyzed,
(A7) A program analysis method, wherein a system diagram in which the template and the program code from which the parent-child relationship is derived is connected by a line is displayed on a display unit.

A line in which the plot in the data of the result of the clonal analysis is continuous is a clonal piece,
The program analysis device is
When all the clone pieces are discontinuous in a state shifted in a certain direction with respect to the horizontal right direction or the vertical downward direction,
When all the discontinuous parts are shifted horizontally to the right, the template and program code indicated by the vertical axis of the clone analysis result are used as a parent, and the program code indicated by the horizontal axis of the result of the clone analysis is used as a child. ,
When all the discontinuous parts are shifted vertically downward, the template and program code indicated by the horizontal axis of the clone analysis result are set as a parent, and the program code indicated by the vertical axis of the clone analysis result is set as a child. The program analysis method according to claim 1, wherein:

A line in which the plot in the data of the result of the clonal analysis is continuous is a clonal piece,
The program analysis device is
Calculate the derivation rate based on the following formula (1),
Two program codes having similar derivation rates are positioned in parallel in the system diagram,
The program analysis method according to claim 1, wherein a parent of the program code positioned in parallel is a common parent.
Derivation rate = (T ₁ + T ₂ +... + T _N ) / N (1)
Here, N is the number of clone code fragments, and T _n is the length of the nth clone code fragment.

The program analysis device includes:
A template that associates weights with tokens used in a plurality of program codes is further stored in the storage unit,
The program analyzer is
2. The program analysis method according to claim 1, wherein in the program code to be processed, the similarity is calculated by adding a weight of a token that matches the token registered in the template. .

The program analysis device is
The program analysis method according to claim 4, wherein a set of program codes whose similarity is less than a predetermined value is excluded from the processing targets of (a2) to (a5).

The program analysis method according to claim 1, wherein the types of variables are a global variable, a local variable, and a plurality of global variables that are a plurality of global variables.

A program analysis program for causing a computer to execute the program analysis method according to any one of claims 1 to 6.

A program analysis device that derives and displays a parent-child relationship between a plurality of program codes, the parent of which is the program code of the modification source and the child of the program code of the modification destination that has been modified by adding or modifying a token. And
A control unit for processing information; and a storage unit for storing information;
In the storage unit,
A function dictionary that registers functions that can be regarded as the same function, although the description format is different.
A memory access dictionary that registers the types of variables used in the function;
Is remembered,
The control unit is
(A1) Perform a lexical analysis to decompose a plurality of program codes to be analyzed into the tokens,
(A2) A template that is generated based on the result of the lexical analysis in the predetermined program code, is stored in the storage unit, and has a token registered in the predetermined program code, and each lexical in each program code As a result of comparison with the token registered in the template with reference to the analysis result and the function dictionary, functions that have different description formats but can be regarded as the same function in the function dictionary are determined as the same function. Perform processing for each program code,
(A3) Although the template, the result of the lexical analysis in each program code, and the memory access dictionary are referred to and compared with the function registered in the template, the functions have the same format. , A function using different types of variables is determined as a different function, and the process of narrowing down the same function is performed for each program code,
(A4) Tokens related to the functions determined to be different functions are set as different tokens, and for two program codes of the plurality of program codes, the vertical axis indicates the appearance position of one program code token, and the other The appearance position of the token of the program code is taken on the horizontal axis, and when the same token appears in the two program codes, the coordinate of the horizontal axis corresponding to the appearance position of the same token is x, and the appearance position of the same token The coordinate of the vertical axis corresponding to is y, and the data plotted at the point (x, y) is generated,
(A5) Based on the generated data, a parent-child relationship between the template and the program code is derived,
(A6) The processing of (a4) and (a5) is performed for each template / program code pair to be analyzed,
(A7) A program analysis apparatus characterized in that a system diagram in which the template and the program code from which the parent-child relationship is derived is connected by a line is displayed on a display unit.