JP2009099030A

JP2009099030A - Processing content determination device and processing content determination method

Info

Publication number: JP2009099030A
Application number: JP2007271325A
Authority: JP
Inventors: Akihiro Sakano; 晃弘坂野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-10-18
Filing date: 2007-10-18
Publication date: 2009-05-07

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a processing content determination device and processing content determination method capable of more accurately determining the sameness or similarity about processing contents described by using a prescribed relevant rule. <P>SOLUTION: A description rule analysis part 112 analyzes first and second comparison target description data 103 and 104 that are subjected to similarity or not determination to, for example, a tree structure by using a relation rule common to the both data, a similarity operating part 114 operates similarity between both comparison targets with a block pruned as one example by a comparison degree setting part 107 as a target. An operation result is output as similar or not, or the degree of similarity and displayed by a similarity determination display part 108. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、文法という関連法則を用いて文章同士の比較を行う場合のように、所定の関連法則を用いて組み立てられた処理内容の同一性の有無の判定あるいは類似の度合いの検出を行う処理内容判定装置および処理内容判定方法に関する。 The present invention is a process for determining the presence or absence of identity of processing contents assembled using a predetermined relational law or detecting the degree of similarity, as in the case of comparing sentences using a related law called grammar. The present invention relates to a content determination device and a processing content determination method.

ある言語で記述された著作物Ａと、他の言語で記述された著作物Ｂとが存在するものとする。本明細書で著作物とは、創作されたソースやドキュメントをいう。たとえば盗作であるかどうかを判断する際には、判断の対象となる２つの著作物Ａ、Ｂの類似点の有無あるいは類似度が問題となる。そこで、本発明に関連する技術として、著作物Ａ、Ｂの類似点を検出する関連技術が各種提案されている（たとえば特許文献１、特許文献２参照）。 It is assumed that there is a work A written in a certain language and a work B written in another language. In this specification, a copyrighted work refers to a created source or document. For example, when determining whether or not it is plagiarism, the presence or absence of similarity between two works A and B to be determined becomes a problem. Accordingly, various techniques for detecting similarities between the copyrighted works A and B have been proposed as techniques related to the present invention (see, for example, Patent Document 1 and Patent Document 2).

このうち、特許文献１では、所定のプログラム言語で記述されたソースコードから類似のソースコード片を抽出する技術を開示している。本願発明に関連するこの第１の関連技術では、ソースコード群に含まれるソースコードを総当りで比較して類似のソースコード片を抽出する従来の手法が、ソースコード群に含まれるソースコードの数が大量に存在する場合に膨大な処理時間を要するという問題を解消している。すなわち、この第１の関連技術では、指定されたソースコード片を基準として類似するソースコード片を抽出するようにして、すべてのソースコードを総当りで類似比較して類似するソースコード片を抽出する従来の手法と比較して処理結果を短時間で得られるようにしている。 Among these, Patent Document 1 discloses a technique for extracting a similar source code fragment from a source code described in a predetermined program language. In this first related technology related to the present invention, the conventional method of extracting similar source code pieces by comparing the source code included in the source code group with the brute force is the same as that of the source code included in the source code group. The problem that enormous processing time is required when a large number exists is solved. That is, in this first related technique, similar source code pieces are extracted with reference to a designated source code piece, and all source codes are compared in a brute force manner to extract similar source code pieces. Compared with the conventional method, the processing result can be obtained in a short time.

次に、特許文献２に記載された第２の関連技術について説明する。第２の関連技術は、ソースプログラムを変更した場合に、どの程度の変更が加えられたかを確認するための技術である。 Next, the second related technique described in Patent Document 2 will be described. The second related technique is a technique for confirming how much change has been made when the source program is changed.

図４は、この第２の関連技術によるソースプログラム比較情報生成システムの構成の概要を表わしたものである。このソースプログラム比較情報生成システム４００では、ソースプログラム読み込み部４０１が変更前ソースプログラム４０２と変更後ソースプログラム４０３の読み込みを行う。基本比較情報解析部４０４は、これらのソースプログラム４０２、４０３の行単位での一致・不一致を比較して変更行及び比較行を抽出する。そして、変更行または比較行から構成される変更部分を示す変更ブロックとその他の未変更ブロックとに分類する。この基本比較情報解析部４０４では、ソースプログラムを各行ごとに対比して、１または連続した複数行からなる変更が行われていない未変更ブロックと、変更が行われた変更ブロックに分ける。 FIG. 4 shows an outline of the configuration of the source program comparison information generation system according to the second related technique. In the source program comparison information generation system 400, the source program reading unit 401 reads the source program 402 before change and the source program 403 after change. The basic comparison information analysis unit 404 compares the source programs 402 and 403 for matching / mismatching in line units, and extracts changed lines and comparative lines. And it classify | categorizes into the change block which shows the change part comprised from a change line or a comparison line, and another unchanged block. In this basic comparison information analysis unit 404, the source program is compared for each line, and is divided into an unmodified block composed of one or a plurality of continuous lines and a modified block where the modification has been performed.

詳細比較情報解析部４０５は、未変更ブロックの間にそれぞれ挟まれた領域として存在する変更ブロック内の比較行と変更行との組合せに対して詳細な比較情報を作成する。具体的には行類似度演算部４１１が、変更前ソースプログラム４０２と変更後ソースプログラム４０３の双方について変更ブロック内の各行の類似度を、行のすべての組み合わせに対して算出する。 The detailed comparison information analysis unit 405 creates detailed comparison information for a combination of a comparison line and a change line in a change block that exists as an area sandwiched between unchanged blocks. Specifically, the row similarity calculation unit 411 calculates the similarity of each row in the changed block for all combinations of rows for both the source program 402 before change and the source program 403 after change.

次に、ブロック類似度演算部４１３は行類似度演算部４１１で算出された各組合せごとの行の類似度に基づき、変更ブロックにおける組合せパターンのすべてについて、変更ブロック全体としての類似度の算出を行う。 Next, based on the row similarity for each combination calculated by the row similarity calculation unit 411, the block similarity calculation unit 413 calculates the similarity for the entire changed block for all the combination patterns in the changed block. Do.

最後に設けられた変更種別判定部４１４は、以上の処理の結果、ブロック類似度が最大となった組合せパターンに基づき、変更ブロックの各行について、変更種別の判定を行う。具体的には組合せパターンで、比較行と変更行とが対応付けされている場合はその行について「修正」があったものとする。また、対応付けのされていない比較行についてはその行が「削除」されたものとし、対応付けのされていない変更行があった場合には、その行が「追加」されたものと判定するようにしている。 The change type determination unit 414 provided last determines the change type for each row of the change block based on the combination pattern having the maximum block similarity as a result of the above processing. Specifically, in the combination pattern, when the comparison line and the changed line are associated with each other, it is assumed that “correction” has been made for the line. In addition, regarding a comparison line that is not associated, it is assumed that the line is “deleted”, and when there is a modified line that is not associated, it is determined that the line is “added”. I am doing so.

次に、特許文献以外の本発明に関連する技術について概要を説明する。ＣＦＧ（Context Free Grammar：文脈自由文法）やＢＮＦ（Backus Naur Form）に関する解説が本発明の第３の関連技術として行われており（たとえば非特許文献１参照）、ＡＮＳＩＣ（American National Standard for Information Systems - Programming language C）（８９）やその他の言語のＢＮＦ実装例が本発明の第３の関連技術として行われている（たとえば非特許文献２および非特許文献３参照）。また、Ｐｒｏｌｏｇ（PROgramming in LOGic：論理プログラミング）によるＤＣＧ（Definite Clause Grammar：確定節文法）の解説が、第３の関連技術における非特許文献３に記載されている。ただし、「ＪＩＳＸ３０１２：２００１Ｐｒｏｌｏｇ」は、基本部分のみでＤＣＧは含まれていない。このような従来の本発明に関連する技術では、ソースの文字列差分を抽出するようになっている。 Next, an outline of techniques related to the present invention other than patent documents will be described. Explanations regarding CFG (Context Free Grammar) and BNF (Backus Naur Form) are provided as the third related technology of the present invention (see, for example, Non-Patent Document 1), and ANSI C (American National Standard for Information). Systems-Programming language C) (89) and BNF implementation examples in other languages are performed as the third related technique of the present invention (see, for example, Non-Patent Document 2 and Non-Patent Document 3). Further, a description of DCG (Definite Clause Grammar) by Prolog (PROgramming in LOGic) is described in Non-Patent Document 3 in the third related technology. However, “JISX3012: 2001 Prolog” includes only the basic part and does not include DCG. In the conventional technique related to the present invention, the source character string difference is extracted.

以上、本発明の関連技術を各種示したが、著作物Ａ、Ｂの類似点を検出するこれらの技術には、一般に次のような傾向がある。
（ａ）ソースコードやドキュメントをそれらの構造ではなく文字列として直接比較する。
（ｂ）文字列の比較を行うときに単一の言語で比較を行い、異なった言語間での比較は行わない。
（ｃ）比較単位としてのブロックの設定がない。 Although various techniques related to the present invention have been described above, these techniques for detecting similarities between works A and B generally have the following tendencies.
(A) Compare source codes and documents directly as character strings instead of their structures.
(B) When comparing character strings, the comparison is performed in a single language, and the comparison between different languages is not performed.
(C) No block is set as a comparison unit.

たとえば、図４に示した本発明の第２の関連技術では、変更後ソースプログラム側の変更行または変更部分に対応する変更前ソースプログラム側の比較行から構成される変更部分を変更ブロックとし、未変更部分を未変更ブロックとする形でブロック（領域）の区分けを行っている。すなわち、第２の関連技術で「ブロック」とは変更箇所と未変更箇所を区分けするためのものでしかなく、しかもこれらの類似度は、主として行単位にこれらのソースプログラムの差分をとって統計的な手法で測定している。

特開２００６−０１８６９３号公報（第００２２段落、図３）特開２００３−２８０９０３号公報（第００１０、第００１１段落、図１、図３）五月女健治著「bison／flexプログラムジェネレータon MS-DOS」啓学出版、１９９４年１月２５日、ｐｐ．２８１−３１８ http://www.quut.com/c/ANSI-C-grammar-y.html http://www.csci.csusb.edu/dick/samples/index.html http://bach.istc.kobe-u.ac.jp/prolog/intro/lang.html For example, in the second related technique of the present invention shown in FIG. 4, a changed part composed of a comparison line on the pre-change source program side corresponding to a changed line or a changed part on the changed source program side is used as a changed block. Blocks (areas) are classified in such a way that unchanged parts are regarded as unchanged blocks. In other words, in the second related technique, “block” is only for distinguishing between a changed portion and an unchanged portion, and the similarity is statistically obtained by taking a difference of these source programs mainly in line units. Measured by a typical method.

JP 2006-018693 A (paragraph 0022, FIG. 3) Japanese Patent Laid-Open No. 2003-280903 (paragraphs 0010 and 0011, FIGS. 1 and 3) Kenji Satsuki, “bison / flex program generator on MS-DOS” Keikaku Publishing, January 25, 1994, pp. 281-318 http://www.quut.com/c/ANSI-C-grammar-y.html http://www.csci.csusb.edu/dick/samples/index.html http://bach.istc.kobe-u.ac.jp/prolog/intro/lang.html

以上説明したようにソースやドキュメント等の著作物の類似性を判定しようとするとき、本発明の関連技術では、比較の対象となる著作物Ａ、Ｂが共通の言語からなっており、これらの比較は単純に差分をとることによって行っていた。このため、次のような問題が指摘されていた。 As described above, when trying to determine the similarity of a work such as a source or a document, in the related technology of the present invention, the works A and B to be compared are made of a common language. The comparison was made by simply taking the difference. For this reason, the following problems have been pointed out.

（ａ）開発ツールとしてのソースの自由度に影響されない差異確認を行うことができなかった。すなわち、オブジェクトに差異がない程度のソースの違いを無視して、大きな相違点を判断することが困難とされた。
（ｂ）したがって、たとえば特許権や著作権の侵害の有無の判断を行おうとするときに、それぞれの類似ソースを検出してこれを元のソースと単純に置き換えただけのようなものであると判断するには構造の同一性を判断することが重要であるが、このような構造の同一性の判断が困難とされた。
（ｃ）この結果として、特許権や著作権の侵害に当たる類似のドキュメントの検出が不得意とされた。 (A) It was not possible to confirm the difference without being influenced by the degree of freedom of the source as a development tool. In other words, it is difficult to judge a large difference by ignoring a source difference that does not cause a difference between objects.
(B) Therefore, for example, when trying to determine whether there is infringement of patent rights or copyrights, it is as if each similar source is detected and this is simply replaced with the original source. It is important to determine the identity of the structure to make a determination, but it has been difficult to determine the identity of such a structure.
(C) As a result, it has been poor at detecting similar documents that infringe patent rights and copyrights.

たとえば第２の関連技術では、主として行単位にこソースプログラムの差分をとるようにしている。したがって、これを一般的な著作物の類似を判別する技術に適用したとすると、特定された著作物に対して不正な改ざんを行った場合のその箇所を探したり、著作物Ａ、Ｂの２点間における著作物の変更点を探したりするといった用途に対しては有用な技術となる。しかしながら、著作物Ａ、Ｂの構造の違いを認識したりその違いを視覚化したりするといった用途や、ソフトウェアを異言語や別プラットフォームに移植した結果の同一性を言語非依存で確認したい、あるいは、文字列や言語を置き換えただけのような不正な流用を確認したいといったニーズには、応えることができないという問題があった。 For example, in the second related technique, the difference of the source program is mainly taken in line units. Therefore, if this is applied to a technique for discriminating the similarity of a general work, it is possible to search for a place where the specified work is illegally tampered with, or two of the works A and B. This is a useful technique for applications such as searching for changes in works between points. However, we want to confirm the identity of the results of recognizing the difference in the structure of the copyrighted works A and B and visualizing the difference, and the result of porting the software to another language or another platform, independent of language, or There was a problem that it was not possible to respond to the need to confirm unauthorized diversion just by replacing text strings and languages.

以上、著作物の同一性や類似度の判定あるいは判断について説明したが、所定の関連法則を用いて記述された処理内容一般の類否判定や類否判断を行うとき、同様の問題があった。 In the above, the determination or determination of the identity or similarity of the copyrighted work has been described. However, there are similar problems when performing general similarity determination or similarity determination based on a predetermined related law. .

そこで本発明の目的は、所定の関連法則を用いて記述された処理内容について同一性あるいは類似のより的確な判定が可能な処理内容判定装置および処理内容判定方法を提供することにある。 Accordingly, an object of the present invention is to provide a processing content determination apparatus and a processing content determination method capable of more accurately determining the identity or similarity of the processing content described using a predetermined relational law.

本発明では、（イ）所定の関連法則を用いて組み立てられ、類似判断の比較対象となるそれぞれの記述を入力する比較対象記述入力手段と、（ロ）これら比較対象となる記述を、前記した関連法則を用いてそれぞれ解析し、それぞれの記述内容を構成する記述構成内容同士の関連を表わした記述構成内容関連マップを生成する記述構成内容関連マップ生成手段と、（ハ）この記述構成内容関連マップ生成手段で生成された記述構成内容関連マップを前記した比較対象となる記述同士で比較して、これらマップの一致の程度を類似度として演算する類似度演算手段とを処理内容判定装置に具備させる。 In the present invention, (a) a comparison object description input means for inputting each description to be compared using a predetermined relational law and to be compared for similarity determination, and (b) the description to be compared is described above. A description composition contents relation map generating means for generating a description composition contents relation map representing each relation between the description composition contents constituting the respective description contents, and (c) this description composition contents relation The processing content determination apparatus includes similarity calculation means for comparing the description structure content related maps generated by the map generation means with the descriptions to be compared with each other and calculating the degree of matching of these maps as similarity. Let

また、本発明では、（イ）所定の関連法則を用いて組み立てられ、類似判断の比較対象となるそれぞれの記述を入力する比較対象記述入力ステップと、（ロ）これら比較対象となる記述を、前記した関連法則を用いてそれぞれ解析し、それぞれの記述内容を構成する記述構成内容同士の関連を表わした記述構成内容関連マップを生成する記述構成内容関連マップ生成ステップと、（ハ）この記述構成内容関連マップ生成ステップで生成された記述構成内容関連マップを前記した比較対象となる記述同士で比較して、これらマップの一致の程度を類似度として演算する類似度演算ステップとを処理内容判定方法に具備させる。 In the present invention, (a) a comparison object description input step for inputting each description to be compared with a predetermined judgment rule and (b) a description to be compared, A description composition content relation map generation step for generating a description composition content relation map that analyzes each using the above-mentioned relational rules and generates a description composition content relation map that represents the relationship between the description composition contents constituting each description content; A processing content determination method including a similarity calculation step that compares the description structure content related maps generated in the content related map generation step with the descriptions to be compared with each other and calculates the degree of matching of these maps as a similarity To provide.

このように本発明では、所定の関連法則を用いて組み立てられ、類似判断の比較対象となるそれぞれの記述を入力し、これらを比較する前に前記した所定の関連法則を用いてそれぞれの記述内容を構成する記述構成内容同士の関連を表わした記述構成内容関連マップを生成する。記述構成内容関連マップは、たとえば階層関係を示すマップとしての構文解析木あるいはネットワークで関係を相互に表わしたマップとしてのネットワーク型データ構造となる。類似度の演算は、記述構成内容関連マップを対比することで行う。これにより、たとえば親子関係の親同士や子供同士の同一性や差分を求めながら類似度（類否あるいは似通った度合い）を演算することができ、先行技術よりも的確な判定が可能になる。 In this way, in the present invention, each description that is assembled using a predetermined relational law and is compared for similarity determination is input, and the contents of each description using the above-mentioned predetermined relational law before comparing them. A description configuration content relation map representing the relationship between the description configuration contents constituting the. The description configuration content relation map is, for example, a parse tree as a map showing a hierarchical relationship or a network type data structure as a map representing the relationship with each other by a network. The calculation of similarity is performed by comparing the description configuration content related maps. Thereby, for example, the similarity (similarity or degree of similarity) can be calculated while obtaining the identity or difference between parents or children in a parent-child relationship, and more accurate determination can be made than in the prior art.

図１は本発明の実施の形態における処理内容判定装置を使用した処理内容判定システムの構成を表わしたものである。この処理内容判定システム１００は、第１の比較対象記述データ格納部１０１および第２の比較対象記述データ格納部１０２を備えている。第１の比較対象記述データ格納部１０１には、第１の比較対象記述データ１０３が格納されており、第２の比較対象記述データ格納部１０２にはこの第１の比較対象記述データ１０３と比較する第２の比較対象記述データ１０４が格納されている。第１および第２の比較対象記述データ１０３、１０４は、たとえば小説やプログラム等の著作物が代表的である。 FIG. 1 shows a configuration of a processing content determination system using a processing content determination device according to an embodiment of the present invention. The processing content determination system 100 includes a first comparison target description data storage unit 101 and a second comparison target description data storage unit 102. The first comparison target description data storage unit 101 stores the first comparison target description data 103, and the second comparison target description data storage unit 102 compares the first comparison target description data 103 with the first comparison target description data 103. Second comparison target description data 104 to be stored is stored. The first and second comparison target description data 103 and 104 are typically literary works such as novels and programs.

処理内容判定装置１０５は、これら第１および第２の比較対象記述データ１０３、１０４を入力する比較対象記述データ入力部１１１と、比較対象記述データ入力部１１１から入力されたこれら比較対象記述データ１０３、１０４に記述されている関連法則（以下、記述ルールという。）を解析する記述ルール解析部１１２と、解析した記述ルール１１３を入力してこれら第１および第２の比較対象記述データ１０３、１０４の類似度を演算する類似度演算部１１４を備えている。比較対象記述データ１０３、１０４は、共に記述ルールで構成されたブロックの集合と考えることができる。ここで記述ルール解析部１１２は、外部の記述ルールデータベース１０６を参照して記述ルールの解析を行う。 The processing content determination apparatus 105 includes a comparison target description data input unit 111 that inputs the first and second comparison target description data 103 and 104, and the comparison target description data 103 input from the comparison target description data input unit 111. , 104, a description rule analysis unit 112 that analyzes a related rule (hereinafter referred to as a description rule), and an analysis description rule 113 that is analyzed, and the first and second comparison target description data 103, 104. The similarity calculating unit 114 is provided for calculating the similarity. Both of the comparison target description data 103 and 104 can be considered as a set of blocks configured by description rules. Here, the description rule analysis unit 112 analyzes the description rule with reference to the external description rule database 106.

すなわち記述ルール解析部１１２は、第１および第２の比較対象記述データ１０３、１０４に共通した記述ルールを解析する部分である。たとえばこれが小説、特許明細書あるいはソースプログラムのような関連法則（ここでは言語）によって記述されているものであれば、この言語が異なる場合にこれを統一する前処理が必要である。一例をあげると、第１の比較対象記述データ１０３が英語によって記述されており、第２の比較対象記述データ１０４が日本語によって記述されていれば、これらを共通言語（英語、日本語あるいはエスペラント語のような両言語の共通言語）に翻訳する前処理を行う必要がある。記述ルールデータベース１０６には、このような前処理部分を実行する際のデータベース部分が含まれてよい。 That is, the description rule analysis unit 112 is a part that analyzes a description rule common to the first and second comparison target description data 103 and 104. For example, if this is described by a related law (in this case, a language) such as a novel, a patent specification, or a source program, preprocessing for unifying it is necessary if the language is different. As an example, if the first comparison target description data 103 is described in English and the second comparison target description data 104 is described in Japanese, these are expressed in a common language (English, Japanese or Esperanto). It is necessary to perform preprocessing for translation into a common language of both languages, such as words. The description rule database 106 may include a database portion for executing such a preprocessing portion.

記述内容が異なれば前処理の手法も異なる。たとえば記述内容が言語ではなくゲームの内容である場合で、サラリーマンの出世ゲームと政治家の出世ゲームの著作権の類否を論じる場合、登場人物の役割等の各種要素を考慮しながら比較対象となる一方のゲームの世界あるいは共通したゲームの世界に両者を配置し直す前処理を的確に行うことで、類否判定の確度が増す。 If the description contents are different, the preprocessing method is also different. For example, when the content of the description is not the language but the content of the game, when discussing the similarities of the copyrights of the salaryman advancement game and the politician advancement game, consider the various factors such as the role of the characters to be compared. The accuracy of similarity determination is increased by accurately performing preprocessing for rearranging both in the same game world or common game world.

記述ルール解析部１１２は、このように共通言語に翻訳されたもの、あるいは元々共通言語で記述されたものの内容を解析する。したがって第１の比較対象記述データ格納部１０１および第２の比較対象記述データ格納部１０２には、それぞれが内容を解析できる関連法則を用いで作成された比較対象記述データ１０３、１０４が格納されている必要がある。たとえば、日本語等の言語は、文法という関連法則があり、これによって内容を解析することができる。言語に限らず、前記したゲームのように各部の連結関係を解析できる関連法則が規定されている記述は第１および第２の比較対象記述データ１０３、１０４として本発明を適用することができることは当然である。 The description rule analysis unit 112 analyzes the contents of the one translated into the common language as described above or originally written in the common language. Therefore, the first comparison target description data storage unit 101 and the second comparison target description data storage unit 102 store the comparison target description data 103 and 104 created by using the related laws that can analyze the contents, respectively. Need to be. For example, a language such as Japanese has a related law called grammar, which can analyze the content. It is possible to apply the present invention as the first and second comparison target description data 103 and 104 for a description in which a relational law capable of analyzing the connection relation of each part is specified as in the game described above, not limited to language. Of course.

ここで記述ルール解析部１１２の解析とは、たとえば第１の比較対象記述データ１０３内で、どのように各内容が関連しているかを、記述構成内容関連マップを使用して調べることをいう。たとえば、親、子、孫といったような階層関係や、同一の部署の人物、同一の趣味のクラブに属している人物、同一の年齢というようなネットワークで接続されるような関係を調べることである。 Here, the analysis by the description rule analysis unit 112 refers to, for example, examining how each content is related in the first comparison target description data 103 using a description configuration content related map. For example, to investigate hierarchical relationships such as parents, children, grandchildren, etc., and relationships that are connected via networks such as people in the same department, people belonging to the same hobby club, and the same age. .

類似度演算部１１４は、第１および第２の比較対象記述データ１０３、１０４についてそれぞれ解析したこのような記述ルール１１３の対応する箇所あるいは対応するグループについて比較を行い、対応するものの同一性の有無や差分の量を測定して、両記述内容が似ているかの度合いをアナログ的な結果として演算したり、類否という２値の結果を演算したりする。このとき、比較程度設定部１０７で比較の程度（深さあるいは範囲）を設定するようにしてもよい。たとえば、階層型の解析結果を有する記述データの類否を比較するものであれば、基本となる階層から数えて所定の階層までを類否の演算の対象としてもよい。次に説明する実施例では、これを「枝刈り」という概念で実現している。 The similarity calculation unit 114 compares the corresponding part or the corresponding group of the description rule 113 analyzed for the first and second comparison target description data 103 and 104, respectively, and whether or not the corresponding ones are identical. The amount of difference is measured, and the degree of similarity between the two descriptions is calculated as an analog result, or the binary result of similarity is calculated. At this time, the degree of comparison (depth or range) may be set by the comparison degree setting unit 107. For example, as long as the similarity of the description data having the hierarchical analysis result is compared, the similarity calculation may be performed from the basic hierarchy to a predetermined hierarchy. In the embodiment described below, this is realized by the concept of “pruning”.

類似度演算部１１４の演算結果１１５は類似判定表示部１０８に出力される。類似判定表示部１０８はプリンタのような記録媒体に出力するものであってもよいし、液晶ディスプレイのような表示内容を出力するものであってもよい。また、処理内容判定装置１０５と類似判定表示部１０８は通信ケーブルで接続されていてもよいし、通信ネットワークを介して接続されていてもよい。また、これらの間に演算結果１１５を一時的にあるいは半永久的に保存するハードディスク等の記憶デバイスが介在してもよい。 The calculation result 115 of the similarity calculation unit 114 is output to the similarity determination display unit 108. The similarity determination display unit 108 may output to a recording medium such as a printer, or may output display contents such as a liquid crystal display. Further, the processing content determination device 105 and the similarity determination display unit 108 may be connected by a communication cable or may be connected via a communication network. Further, a storage device such as a hard disk for temporarily or semi-permanently storing the calculation result 115 may be interposed between them.

類似判定表示部１０８に出力される演算結果は、前記したように類似の程度を表わすものであってもよいし、類否を判定したものであってもよい。処理内容判定装置１０５がどのような用途に使用されるかによって出力形態が異なることになる。比較程度設定部１０７がこのような出力形態の設定を行うことも自由である。 The calculation result output to the similarity determination display unit 108 may represent a degree of similarity as described above, or may be a determination of similarity. The output form varies depending on the purpose of use of the processing content determination apparatus 105. The comparison degree setting unit 107 is free to set such an output form.

以上説明したような本実施の形態の処理内容判定装置１０５によれば、記述形式が異なった比較対象の記述データ同士であっても、翻訳のような前処理を行うことで、類似判断を行うことができ、比較対象の制限が緩和される。また、記述ルールを基に類似判断を行うので、文法に限らず、たとえば所定の関連法則を用いて組み立てられた気象に関する記述を用いて２つの台風の類似判断を行う場合のように、自然現象の比較のような広範囲の類似判断が可能になる。 According to the processing content determination apparatus 105 of the present embodiment as described above, similarity determination is performed by performing preprocessing such as translation even between comparison target description data having different description formats. And the restriction of the comparison target is relaxed. In addition, since the similarity determination is performed based on the description rule, the natural phenomenon is not limited to the grammar. For example, the similarity determination of two typhoons is performed using a description about the weather assembled using a predetermined related law. It is possible to make a wide range of similar judgments such as

更に、関連法則に基づいて類似の判断を行うので、客観的な判断が可能である。また、比較程度の設定を行うようにすれば、処理内容判定装置１０５の処理目的に応じて効率的な処理を行うことができる。 Furthermore, since a similar determination is made based on the related laws, an objective determination is possible. Further, if the comparison level is set, efficient processing can be performed according to the processing purpose of the processing content determination apparatus 105.

なお、図１に示した処理内容判定装置１０５は、図示しないＣＰＵ（Central Processing Unit）あるいはプロセッサと、同じく図示しない類似判定用の制御プログラムを格納したハードディスク等の記憶媒体を用いたコンピュータとして実現することができる。この場合、第１の比較対象記述データ格納部１０１、第２の比較対象記述データ格納部１０２および記述ルールデータベース１０６は、ハードディスクや通信ネットワーク上に配置したサーバで構成することができ、比較程度設定部１０７は図示しないキーボードやポインティングデバイスで構成することができる。 1 is realized as a computer using a CPU (Central Processing Unit) or processor (not shown) and a storage medium such as a hard disk storing a control program for similarity determination (not shown). be able to. In this case, the first comparison target description data storage unit 101, the second comparison target description data storage unit 102, and the description rule database 106 can be configured by a server arranged on a hard disk or a communication network, and a comparison degree setting is performed. The unit 107 can be configured by a keyboard or a pointing device (not shown).

図２は、本発明の一実施例による処理内容判定装置の構成の概要を表わしたものである。この処理内容判定装置２００は、メタ言語ソース２０１を読み込んで対応する構文解析器２０２を出力するメタ言語２０３を備えている。ここでメタ言語とは、既存の言語の文法解析機能を生成するツールである。代表的には、構文解析を行うＣプログラムを自動生成するツールとしての「Ｙａｃｃ（Yet Another Compiler Compiler）」や、「Ｐｒｏｌｏｇ」を挙げることができる。 FIG. 2 shows an outline of the configuration of a processing content determination apparatus according to an embodiment of the present invention. The processing content determination apparatus 200 includes a meta language 203 that reads a meta language source 201 and outputs a corresponding syntax analyzer 202. Here, a meta language is a tool that generates a grammar analysis function of an existing language. Typically, “Yacc (Yet Another Compiler Compiler)” and “Prolog” can be cited as tools for automatically generating a C program for performing syntax analysis.

この処理内容判定装置２００に入力される著作物としての第１のソース・言語２０４および第２のソース・言語２０５は、互いに異なる言語で記述されている。第１の構文解析器２０６は第１のソース・言語２０４を読み込んで、第１の構文解析木２０７を出力する。第２の構文解析器２０８は第２のソース・言語２０５を読み込んで、第２の構文解析木２０９を出力する。 The first source / language 204 and the second source / language 205 as the copyrighted work input to the processing content determination apparatus 200 are described in different languages. The first syntax analyzer 206 reads the first source / language 204 and outputs a first syntax analysis tree 207. The second parser 208 reads the second source / language 205 and outputs a second parse tree 209.

構文解析木比較器２１１は、比較単位基準２１２と同様に類似度判定基準２１３を比較前に読み込むようになっている。次に構文解析木比較器２１１は、比較対象となる第１の構文解析木２０７および第２の構文解析木２０９を読み込む。そして、比較単位基準２１３で指定された条件に従い、第１および第２の構文解析木２０７、２０９を各々部分木の集合に分割する。次に構文解析木比較器２１１は、第１の構文解析木２０７の各部分木と第２の構文解析木２０９の各部分木の総当りの組み合わせで比較を行う。そして、すべての比較結果について、類似度判定基準２１３での指定に基づいて類似度を判定し、比較結果として比較結果リスト２１４を出力するようになっている。 The parse tree comparator 211 reads the similarity determination criterion 213 before the comparison in the same manner as the comparison unit criterion 212. Next, the parse tree comparator 211 reads the first parse tree 207 and the second parse tree 209 to be compared. Then, according to the condition specified by the comparison unit criterion 213, the first and second parse trees 207 and 209 are each divided into sets of subtrees. Next, the parse tree comparator 211 compares each subtree of the first parse tree 207 with the brute force combination of each subtree of the second parse tree 209. For all the comparison results, the similarity is determined based on the designation in the similarity determination criterion 213, and the comparison result list 214 is output as the comparison result.

このように本実施例の処理内容判定装置２００は、形式言語の構文解析器２０２を作る処理部分と、言語ごとの構文解析木２０７、２０９を生成する処理部分と、これらの構文解析木２０７、２０９を比較する処理部分を備えている。このうちの言語ごとの構文解析木２０７、２０９を生成する処理は、必要とされる場合のみ実施される。なお、処理内容判定装置２００は多くの構成を採り得るが、本実施例ではこのうちの特定のパターンのみを示すことにする。 As described above, the processing content determination apparatus 200 according to the present embodiment includes a processing part that creates the formal language parser 202, a processing part that generates the parsing trees 207 and 209 for each language, and these parsing trees 207, 209 is compared. Of these, the process of generating the parse trees 207 and 209 for each language is performed only when necessary. The processing content determination apparatus 200 can take many configurations, but in the present embodiment, only a specific pattern is shown.

ところで図２に示した処理内容判定装置２００で構文解析器２０２は、メタ言語２０３から生成される。生成された構文解析器２０２は、破線でそれぞれ示したように第１の構文解析器２０６および第２の構文解析器２０８として利用されるようになっている。このように異なる２種類の構文解析器２０６、２０８を作成し利用する代わりに、１種類の構文解析器が使用されるものであってもよい。 Incidentally, the syntax analyzer 202 in the processing content determination apparatus 200 shown in FIG. The generated parser 202 is used as a first parser 206 and a second parser 208 as indicated by broken lines. In this way, instead of creating and using two different types of parsers 206 and 208, one type of parser may be used.

図３は、構文解析木比較器の構成を具体的に表わしたものである。これら図２と図３を使用して、本実施例の処理内容判定装置２００の説明を行う。なお、本実施例ではメタ言語２０３を扱っている。そこで、図面では、あえてデータとこれらの処理ステップの表記の区別を行わないようにしている。データとこれらの処理ステップの表記は部分的に可能であるが、処理対象がデータにもなるので同じものを部分ごとに表記を変えていると別なものに見えてしまうおそれがあるためである。 FIG. 3 specifically shows the configuration of the parse tree comparator. The processing content determination apparatus 200 of this embodiment will be described with reference to FIGS. 2 and 3. In this embodiment, the meta language 203 is handled. Therefore, in the drawing, the data is not distinguished from the notation of these processing steps. This is because data and these processing steps can be partly described, but the processing target is also data, so if the same thing is changed for each part, it may look different. .

まず、最初の処理手順としてのステップＳ３０１では、前記したようにメタ言語２０３がメタ言語ソース２０１を読み込んで、対応する構文解析器２０２を出力する。メタ言語ソース２０１は、たとえば、比較したい著作物を記した言語の文法と、文法構造を木構造に変換する副作用からなっている。 First, in step S301 as the first processing procedure, the meta language 203 reads the meta language source 201 and outputs the corresponding parser 202 as described above. The meta language source 201 includes, for example, a language grammar describing a work to be compared and a side effect of converting the grammar structure into a tree structure.

著作物の言語としては、文法が、ＣＦＧ（Context Free Grammar:文脈自由文法）、ＤＣＧ（Definite Clause Grammar）等の情報処理が可能な理論に基づく形式言語と解釈できるものであれば自然言語であっても人工言語であっても構わない。ただし、実際には、その言語を処理できる「Ｙａｃｃ」（Yet Another Compiler-Compiler）あるいは「Ｐｒｏｌｏｇ」（PROgramming in LOGic:論理プログラミング）といったメタ言語２０３に位置付くものが必要になる。「Ｐｒｏｌｏｇ」は元来、自然言語解析のために開発されたものである。 The language of a copyrighted work is a natural language as long as it can be interpreted as a formal language based on a theory capable of information processing, such as CFG (Context Free Grammar) and DCG (Definite Clause Grammar). Or an artificial language. However, in practice, a language that can be processed in the language such as “Yacc” (Yet Another Compiler-Compiler) or “Prolog” (PROgramming in LOGic) is required. “Prolog” was originally developed for natural language analysis.

また、実装上の工夫として、メタ言語ソース２０１は、プリプロセッサ（preprocessor）が必要な言語体系では予め処理を済ませておく。メタ言語２０３から出力された構文解析器２０２は、ステップＳ３０２で第１および第２の構文解析器２０６、２０８として使用されることになる。 Further, as an ingenuity in implementation, the meta language source 201 is processed in advance in a language system that requires a preprocessor. The parser 202 output from the meta language 203 is used as the first and second parsers 206 and 208 in step S302.

次のステップＳ３０２で、メタ言語２０３とメタ言語ソース２０１から得られた第１および第２の構文解析器２０６、２０８は、各々の対象となる言語で記述された著作物を読み込む。ここでは、著作物としての第１のソース・言語２０４および第２のソース・言語２０５の読み込みが行われる。第１の構文解析器２０６は第１のソース・言語２０４を読み込んで、第１の構文解析木２０７を出力する。第２の構文解析器２０８は第２のソース・言語２０８を読み込んで、第２の構文解析木２０９を出力する。たとえば、第１の構文解析器２０６はＣ言語ソースから、その構文解析木を生成する機能を備えており、第２の構文解析器２０８はＢＡＳＩＣ言語ソースから、その構文解析木を生成する機能を備えていると考えるとよい。 In the next step S302, the first and second syntax analyzers 206 and 208 obtained from the meta language 203 and the meta language source 201 read a work described in each target language. Here, the first source / language 204 and the second source / language 205 are read as works. The first syntax analyzer 206 reads the first source / language 204 and outputs a first syntax analysis tree 207. The second parser 208 reads the second source / language 208 and outputs the second parser tree 209. For example, the first parser 206 has a function of generating a parse tree from a C language source, and the second parser 208 has a function of generating a parse tree from a BASIC language source. Think of it as prepared.

このようにステップＳ３０２で第１および第２の構文解析木２０７、２０９として表現されたことで、言語の異なる２つの著作物は対等な記述形式となり、形式的な比較が可能になる。もちろん、これらの出力形式はツリー構造、ネットワーク構造、ＸＭＬ（Extensible Markup Language）等の各種のものが考えられるが、両者を比較するには表現形式をいずれか１つに統一しておく必要がある。たとえば、記述された命令を逐次的に実行し、処理の結果に応じて変数の内容を変化させていくプログラミング言語としての手続き型言語は、順接、条件分岐、繰り返し、跳躍で構成されている。これをＸＭＬで次の第１〜第５の関連法則Ｒ₁〜Ｒ₅によって置き換える。なお、たとえばネットワーク構造については変換形式自体を検討する必要があるが、これは本発明の趣旨から外れるため説明を割愛する。 As described above, the first and second parsing trees 207 and 209 are expressed in step S302, so that two works with different languages have an equivalent description format, and a formal comparison is possible. Of course, there are various output formats such as tree structure, network structure, XML (Extensible Markup Language), etc., but in order to compare the two, it is necessary to unify the expression format to one. . For example, a procedural language as a programming language that executes written instructions sequentially and changes the contents of variables according to the results of processing consists of ordering, conditional branching, repetition, and jumping. . This is replaced by the following first to fifth relational rules R _{1 to} R ₅ in XML. For example, for the network structure, it is necessary to examine the conversion format itself, but this is out of the scope of the present invention, and thus the description is omitted.

関連法則Ｒ₁……「順接」は、各文に対応する同レベルのタグの並びとする。
関連法則Ｒ₂……「条件分岐」は、分岐条件式と分岐する同レベルのタグの並びとする。
関連法則Ｒ₃……「繰り返し」は、繰り返し条件式と繰り返す同レベルのタグの並びとする。
関連法則Ｒ₄……「跳躍」は、ラベルを示すトークンを変換したノードをポイントするタグとする。
関連法則Ｒ₅……「直接再起呼び出し」はタグとして特別扱いしない（タグオプション間で比較）。 Related Law R ₁ ...... “Sequential” is a sequence of tags of the same level corresponding to each sentence.
Related rule R ₂ ...... “Conditional branch” is a sequence of tags at the same level as the branch conditional expression.
Related Law R ₃ ...... “Repetition” is a sequence of tags having the same level as the repetition conditional expression.
Related law R ₄ ... “Jump” is a tag that points to a node obtained by converting a token indicating a label.
Related Law R ₅ …… “Direct Recall” is not treated specially as a tag (comparison between tag options).

ここで、関連法則Ｒ₅の「直接再帰呼び出し」とは、手続きの中から同じ手続きを呼び出すことであり、手続き（たとえばａ）から呼び出した手続き（たとえばｂ）から更に元の手続き（ａ）を呼び出すような間接再帰呼び出しとは区別している。また、トークンとは、空白やコメントを省いた認識の最小単位をいう。 Here, the “direct recursive call” of the related rule R ₅ is to call the same procedure from among the procedures, and the original procedure (a) is further transferred from the procedure (eg, b) called from the procedure (eg, a). It is distinguished from indirect recursive calls such as calling. A token is the smallest unit of recognition without white space or comments.

Ｃ言語を例にとり、メタ言語の１つとしてのＢＮＦ（Backus Naur Form）に従って、第１および第２の構文解析木２０７、２０９をＸＭＬに変換する例を挙げる。ただし関連法則Ｒ₁の例については、ＢＮＦが複雑すぎることと、本質的に同等な文を同一の深さに並列列挙することが本実施の形態での関連法則であり、本質的ではないのでここでは例示しない。 Taking the C language as an example, an example in which the first and second parsing trees 207 and 209 are converted to XML according to BNF (Backus Naur Form) as one of meta languages will be given. However, in the example of the related rule R ₁ , the BNF is too complicated and the parallel enumeration of essentially equivalent sentences at the same depth is the related rule in the present embodiment, and is not essential. It is not illustrated here.

関連法則Ｒ₂の例
[条件分岐のＢＮＦ]
selection_statement
: IF '(' expression ')' statement-block
| IF '(' expression ')' statement-block ELSE statement-block
| SWITCH '(' expression ')' statement-block
;
[条件分岐のＸＭＬ(IF '(' expression ')' statement-block ELSE statement-blockのみ例示)]
<select key=’if’ expression=’’>
<statement-block eval=’true’>…</statement-block>
<statement-block eval=’false’>…</statement-block>
</select> Example of related law R ₂
[Conditional branch BNF]
selection_statement
: IF '(' expression ')' statement-block
| IF '(' expression ')' statement-block ELSE statement-block
| SWITCH '(' expression ')' statement-block
;
[Conditional branch XML (IF '(' expression ')' statement-block ELSE statement-block only)]
<select key = 'if' expression = ''>
<statement-block eval = 'true'>… </ statement-block>
<statement-block eval = 'false'>… </ statement-block>
</ select>

関連法則Ｒ₃の例
[繰り返しのＢＮＦ]
iteration_statement
: WHILE '(' expression ')' statement-block
| DO statement WHILE '(' expression ')' ';'
| FOR '(' expression_statement expression_statement ')' statement-block
| FOR '(' expression_statement expression_statement expression ')' statement-block
| FOR '(' declaration expression_statement ')' statement-block
| FOR '(' declaration expression_statement expression ')' statement-block
;
[繰り返しのＸＭＬ(FOR '(' expression_statement expression_statement expression ')' statement-blockのみ例示)]
<iteration key=’for’ eval=' expression_statement’>
<statement-block>…</statement-block>
</iteration> Example of related law R ₃
[Repeated BNF]
iteration_statement
: WHILE '(' expression ')' statement-block
| DO statement WHILE '(' expression ')'';'
| FOR '(' expression_statement expression_statement ')' statement-block
| FOR '(' expression_statement expression_statement expression ')' statement-block
| FOR '(' declaration expression_statement ')' statement-block
| FOR '(' declaration expression_statement expression ')' statement-block
;
[Repeated XML (For '(' expression_statement expression_statement expression ')' statement-block only example)]
<iteration key = 'for' eval = 'expression_statement'>
<statement-block>… </ statement-block>
</ iteration>

関連法則Ｒ₄の例
[跳躍のＢＮＦ]
jump_statement
: GOTO IDENTIFIER ';'
| CONTINUE ';'
| BREAK ';'
| RETURN ';'
| RETURN expression ';'
;
labeled_statement
: IDENTIFIER ':' statement-block
| CASE constant_expression ':' statement-block
| DEFAULT ':' statement-block
;
[跳躍のＸＭＬ(GOTO IDENTIFIER ';' およびIDENTIFIER ':' statement-blockのみ例示)]
<jump key=’goto’ eval=’ <IDENTIFIER>’/>
:
<label key=’ <IDENTIFIER>’/> Example of related law R ₄
[Jumping BNF]
jump_statement
: GOTO IDENTIFIER ';'
| CONTINUE ';'
| BREAK ';'
| RETURN ';'
| RETURN expression ';'
;
labeled_statement
: IDENTIFIER ':' statement-block
| CASE constant_expression ':' statement-block
| DEFAULT ':' statement-block
;
[Jumping XML (GOTO IDENTIFIER ';' and IDENTIFIER ':' statement-block only)]
<jump key = 'goto' eval = '<IDENTIFIER>'/>
:
<label key = '<IDENTIFIER>'/>

関連法則Ｒ₅の例
直接再起呼び出し自体は、Ｒ１〜Ｒ４のＢＮＦに自然に内包されており、独自のＢＮＦは提示しない。
[直接再起呼び出しのＸＭＬ]
<block rootid=’<FUNCTIONAL>’>
:
<statement procid=’<FUNCTIONAL>’>
:
<／block> Example direct recursive calls itself of the relevant law R ₅ is naturally contained in the BNF of R1~R4, own BNF is not present.
[Direct Recall XML]
<block rootid = '<FUNCTIONAL>'>
:
<statement procid = '<FUNCTIONAL>'>
:
</ Block>

ここで「ＦＵＮＣＴＩＯＮＡＬ」はＣ言語の関数に対応し、これが一致するものを再帰とする。他の手続き型言語では、手続き名（たとえば「ＰＲＯＣＥＤＵＲＥ」）に置き換えて考えるとよい。 Here, “FUNCTIONAL” corresponds to a function in C language, and the one that matches is a recursion. In other procedural languages, a procedure name (for example, “PROCEDURE”) may be considered.

なお、関連法則Ｒ₁の例についての記載を省略したが、任意の文の変換イメージをここでは、「statement」タグという名前で表わしている。 Although the description of the example of the related rule R ₁ is omitted, a conversion image of an arbitrary sentence is represented by a name “statement” tag here.

以上説明した例で、タグは図３に示した木構造のノードを表わしている。オプションは、ソースとなる著作物の記述や特性を示すために用いる。ＸＭＬへの変換イメージは仮のものである。したがって、ＸＭＬタグにおける「statement-block」、「select」、「iteration」、「jump」、「label」のオプションは、予約語と対応した「key」や評価と対応させた「eval」等の言葉の概念を有するものである必要はない。ＸＭＬタグにおいてタグ名、オプション、タグ階層で、関連法則Ｒ₁〜関連法則Ｒ₅のような構造が決定できるようにするようにすればよい。 In the example described above, the tag represents the tree-structured node shown in FIG. Options are used to indicate the description and characteristics of the source work. The conversion image to XML is provisional. Therefore, the “statement-block”, “select”, “iteration”, “jump”, and “label” options in the XML tag are words such as “key” corresponding to the reserved word and “eval” corresponding to the evaluation. It is not necessary to have the concept of In the XML tag, a structure such as the relation law R ₁ to the relation law R ₅ may be determined by the tag name, option, and tag hierarchy.

この例で「key」としているのは、ＢＮＦの式にマッチングした人工言語の予約語を意味する例である。大文字の文字列は、Ｃ言語の任意、または特定のトークン（予約語やユーザ定義語）を示している。また、例で、「eval」としているのは、ユーザ定義語や、式を表わしている。 In this example, “key” is an example that means a reserved word in an artificial language that matches the BNF expression. The uppercase character string indicates an arbitrary or specific token (reserved word or user-defined word) in the C language. In the example, “eval” represents a user-defined word or an expression.

形式言語の言語体系は、Ｃ言語のような構造化された手続き型言語以外のものもある。たとえば関数型言語や論理型言語の場合には、文自体が、木構造をしており、ソースの処理方式の類似性の判定にそのまま利用することが期待できる。この一方で、構造化されてない手続き型言語は、木構造で表現するのは難しい可能性がある。 The language system of the formal language may be other than a structured procedural language such as C language. For example, in the case of a functional language or a logic language, the sentence itself has a tree structure, and can be expected to be used as it is for determining the similarity of the source processing methods. On the other hand, an unstructured procedural language may be difficult to represent in a tree structure.

自然言語の中にも、英語のように比較的、形式言語として取り扱いやすいものがある。このような自然言語は、ドキュメントの類似性判定への応用も考えられる。ただしそのような自然言語であっても散文詩のようなものは規制の文法体系を取っていないか、または変形生成文法で説明はできるものもあるが、句読点の喪失などで、構文解析木を生成する処理が難しいと考えられる。 Some natural languages, such as English, are relatively easy to handle as formal languages. Such a natural language can be applied to similarity determination of documents. However, even in such natural language, some prose poems do not have a regulated grammar system, or some can be explained by modified generation grammar, but a parse tree is generated due to loss of punctuation, etc. It is thought that processing to do is difficult.

なお、本実施例の処理内容判定装置２００は、構文解析についての厳密さを追及する必然性はない。すなわち、構文を概要レベルでも解析できるものであれば、言語非依存な構文解析木を生成し比較対象としての著作物の類否を判断することができる。このため、本発明の適用範囲は広い。本発明では、構文解析木の属性として、著作物のトークンやその種類を含むことが可能である。 Note that the processing content determination apparatus 200 according to the present embodiment does not necessarily pursue strictness regarding syntax analysis. In other words, if the syntax can be analyzed even at the summary level, a language-independent parse tree can be generated to determine the similarity of the works to be compared. For this reason, the application range of the present invention is wide. In the present invention, it is possible to include a token of a copyrighted work and its type as attributes of the parse tree.

次にステップＳ３０３について説明する。
構文解析木比較器２１１は、次の要領で、第１および第２の構文解析木２０７、２０９を比較する。 Next, step S303 will be described.
The parse tree comparator 211 compares the first and second parse trees 207 and 209 in the following manner.

第１の構文解析木比較器２１１は、まず比較単位基準２１２と類似度判定基準２１３を比較前に読み込む。これら比較単位基準２１２と類似度判定基準２１３および第１および第２の構文解析木２０７、２０９の読み込み順序は任意でよい。 The first parse tree comparator 211 first reads the comparison unit criterion 212 and the similarity determination criterion 213 before comparison. The reading order of the comparison unit criterion 212, the similarity determination criterion 213, and the first and second syntax analysis trees 207 and 209 may be arbitrary.

次に、構文解析木比較器２１１は、これら読み込んだ第１および第２の構文解析木２０７、２０９をより細かい単位ブロック（枝）に分割する。この分割は、ブロック分割ルール（枝刈りルール）に従う。このブロック分割ルールについて説明する。 Next, the parse tree comparator 211 divides the read first and second parse trees 207 and 209 into smaller unit blocks (branches). This division follows a block division rule (pruning rule). This block division rule will be described.

ブロックに分割する比較単位基準２１２は、環境設定（設定ファイル、レジストリ、環境変数等）として実装される。比較単位基準２１２でブロック化する際のブロックの単位等の条件を設定し、構文解析木比較器２１１は、比較前にこの設定した条件を読み込む。 The comparison unit reference 212 divided into blocks is implemented as an environment setting (setting file, registry, environment variable, etc.). A condition such as a block unit when the block is formed based on the comparison unit criterion 212 is set, and the parse tree comparator 211 reads the set condition before the comparison.

ブロック化する条件のうちの必須条件は、関数定義やメソッド定義のような根（ｒｏｏｔ）から分割する単位を指定するために利用する。先に示したＸＭＬへの変換例では、タグの階層の深さを指定し、構文解析木の読み込み時にその深さ以上をブロックとして枝刈りを行う。ブロック化する具体的な単位としては、手続きの部分ではないまとまり（たとえばＣ言語の関数）やメソッドレベルのスコープが想定される。 The essential condition of the conditions for blocking is used to specify a unit to be divided from a root such as a function definition or a method definition. In the example of conversion to XML shown above, the depth of the tag hierarchy is specified, and pruning is performed with the depth greater than that as a block when the parse tree is read. As a specific unit to be blocked, a group (for example, a C language function) that is not a procedure part or a method level scope is assumed.

Ｃ＃のような言語では、関数スコープの上位に、クラスや名前空間のスコープがある。このため、根（ｒｏｏｔ）の直下に関数があるとは限らない。したがって、ブロック分割ルールは調整が必要になる。またブロック化する条件に対する任意の付随条件として、ブロック化した結果、あまりにも階層の低い枝（０階層や１階層）は、比較対象から除外することも可能とする。比較目的によって、要、不要の差異が出るが、たとえばアルゴリズムの一致を問うのであれば意味の無い比較であり、これを省略することで組み合わせの数が減少し、大幅に処理時間を短縮することができる。 In languages like C #, there is a class and namespace scope above the function scope. For this reason, there is not always a function directly under the root. Therefore, the block division rule needs to be adjusted. Further, as an optional incidental condition to the condition for blocking, branches that are too low as a result of blocking (0 hierarchy and 1 hierarchy) can be excluded from comparison targets. Depending on the purpose of the comparison, there will be a difference between important and unnecessary. For example, if you ask for the matching of the algorithm, it is meaningless. By omitting this, the number of combinations will be reduced and the processing time will be greatly shortened. Can do.

この一方で、文書の著作権侵害を問うのであれば、階層のより低い枝のレベルによる比較も必要になる。前記した任意の付随条件は、トリビアルな部分木は比較しないという趣旨で設けられるものであり、比較対象となる部分木の末端まで見ないという意味ではない。 On the other hand, if you ask for copyright infringement of a document, you will also need to compare at a lower branch level. The arbitrary accompanying conditions described above are provided for the purpose of not comparing trivial subtrees, and do not mean that the end of the subtree to be compared is not seen.

本実施例で使用される枝刈ルールはトップダウンに行われる。このため、全体の構文のような検知から比較単位の部分木を生成する。これによって、トリビアルな部分木も出現するものの、所望の処理単位を、確実に部分木として取り出すことができる。トリビアルな部分木は簡単に判定できるものも多い。そこで、トリビアルな部分木であると判定できた段階でこのトリビアルな部分木を切り捨ててしまうと処理効率がよい。 The pruning rules used in this embodiment are top-down. For this reason, a subtree of a comparison unit is generated from detection such as the entire syntax. As a result, although a trivial subtree also appears, a desired processing unit can be reliably extracted as a subtree. Many trivial subtrees can be easily determined. Therefore, if the trivial subtree is discarded at the stage where it is determined that it is a trivial subtree, the processing efficiency is good.

さて、第１および第２の構文解析木２０７、２０９をより細かい単位ブロック（枝）に分割すれば、これら第１および第２の構文解析木２０７、２０９はこれらの単位ブロック（枝）の集合であるとみなすことができる。特別な場合としてブロックを全体とみなしてもよい。比較は、第１および第２の構文解析木２０７、２０９のそれぞれのブロック（枝）の集合について、第１の構文解析木２０７の枝Ｘと第２の構文解析木２０９の枝Ｙの組み合わせで行われる。比較したい片側の１つの単位ブロック（枝）に対し、他方にクリエをかけるため、他方側の単位ブロック（枝）の数をｎとすると、比較のための検索にかかる時間計算量はＯ（ｎ）となる。ハッシュ化、あるいは、ネットワーク型データ構造を適切に利用すれば、時間計算量をＯ（１）に短縮することができる。 Now, if the first and second parsing trees 207 and 209 are divided into finer unit blocks (branches), the first and second parsing trees 207 and 209 are set of these unit blocks (branches). Can be considered. As a special case, the block may be considered as a whole. The comparison is based on the combination of the branch X of the first parse tree 207 and the branch Y of the second parse tree 209 for each set of blocks (branches) of the first and second parse trees 207 and 209. Done. Since one unit block (branch) on one side to be compared is applied to the other, assuming that the number of unit blocks (branches) on the other side is n, the time complexity required for the search for comparison is O (n ) If hashing or a network type data structure is appropriately used, the amount of time calculation can be reduced to O (1).

構文解析木比較器２１１は、すべての組み合わせに対して類似度判定基準２１３に従って、判定を下す。判定結果は、双方のブロックの組み合わせとそれ対する判定結果を比較結果リスト２１４として出力する。通常はすべての判定について比較結果リスト２１４を作成するが、場合により一部の判定に対して比較結果リスト２１４を作成してもよい。 The parse tree comparator 211 makes a determination according to the similarity determination criterion 213 for all combinations. As the determination result, the combination of both blocks and the corresponding determination result are output as the comparison result list 214. Normally, the comparison result list 214 is created for all judgments, but the comparison result list 214 may be created for some judgments in some cases.

構文解析木比較器２１１の以上の動作は図３に詳細に示している。図３に示した例では、２つの著作物Ａ、Ｂをそれぞれブロック（Ａ₁、Ａ₂、……）、（Ｂ₁、Ｂ₂、……）に分けて、類似度判定基準を使用して判定結果を比較結果リスト２１４として出力している。 The above operation of the parse tree comparator 211 is shown in detail in FIG. In the example shown in FIG. 3, two works A and B are divided into blocks (A ₁ , A ₂ ,...) And (B ₁ , B ₂ ,...), Respectively, and the similarity criterion is used. The determination result is output as a comparison result list 214.

類似度判定基準には、一致の程度と、それを決定する判定基準が含まれており、たとえば、次のように分類される。
（ａ）完全一致言語体系……構造、属性が一致する場合（たとえば著作権侵害検知向け）
（ｂ）部分一致言語体系……構造が一致する場合（たとえば特許侵害検知向け）
（ｃ）不一致 The similarity determination criterion includes the degree of coincidence and the determination criterion for determining it, and is classified as follows, for example.
(A) Completely consistent language system: When the structure and attributes match (for example, for copyright infringement detection)
(B) Partially matched language system: When the structures match (for example, for patent infringement detection)
(C) Disagreement

以上説明した本実施例では、次のような効果を得ることができる。 In the present embodiment described above, the following effects can be obtained.

本実施例では、構文解析木の部分木を比較するように構成されている。このため、形式言語で記された著作物の文法構造を比較することができる。形式言語には、人工言語のほか、文法整理された自然言語も含まれるので、本実施例による応用の範囲は広い。手続き型言語、関数型言語、論理型言語パラダイム間での比較にどの程度の意味があるかは不明であるが、各言語パラダイム内での比較には、判定基準が意味をなす。自然言語であっても、形式言語であれば、著作権侵害とまでいかなくても、たとえば翻訳の文法構造が正しいかといった確認を行うことが可能である。 In this embodiment, it is configured to compare the subtrees of the parse tree. For this reason, it is possible to compare the grammatical structures of works written in formal languages. The formal language includes not only an artificial language but also a natural language with a grammatical arrangement, so the range of applications according to this embodiment is wide. It is unclear how much the comparison between the procedural language, the functional language, and the logical language paradigm is meaningful, but the judgment criterion is meaningful for the comparison within each language paradigm. Even if it is a natural language, if it is a formal language, it is possible to confirm whether the grammatical structure of the translation is correct, for example, without infringing copyright.

また、本実施例では、構文解析木によって類否を比較するので、形式言語の種類に非依存に処理が可能である。また、比較単位として部分木を用いるので、必要に応じて、トリビアルな部分木を比較対象とするか否かを調整することができ、適切な単位で比較することができる。 Further, in this embodiment, similarity is compared using a parse tree, so that processing can be performed independently of the type of formal language. Further, since the subtree is used as the comparison unit, whether or not the trivial subtree is to be compared can be adjusted as necessary, and the comparison can be performed in an appropriate unit.

＜実施例の変形例＞ <Modification of Example>

本発明は、以上説明した実施例そのものに限定されるものでないことは当然である。たとえば、先の実施例では構文解析木の木構造を用いた階層型ネットワークを使用したが、この構文解析木をネットワーク型データ構造で表現するようにしてもよい。これを変形例として説明する。 Of course, the present invention is not limited to the embodiments described above. For example, in the previous embodiment, a hierarchical network using a tree structure of a parse tree is used. However, this parse tree may be expressed by a network type data structure. This will be described as a modification.

ネットワークデータ構造のノードは、一番上にある節としてのルート、分岐点、下に節を持たない節としてのリーフのいずれかになる。この変形例では、ルートを初期ノードとする。次に、各トークンを、解析上の出現順序に従って、付番する。これら空白やコメントを省いた認識の最小単位としてのトークンは、ノードとして扱われる。 A node in the network data structure is either a root as a node at the top, a branch point, or a leaf as a node without a node below. In this modification, the route is the initial node. Next, each token is numbered according to the order of appearance in the analysis. These tokens, which are the smallest unit of recognition without white space and comments, are treated as nodes.

ＸＭＬの例と対応させると、条件分岐や繰り返し、場合によっては逐次処理中に現れるブロックスコープ（statement-block相当）があれば、そこで枝分かれを行う。 Corresponding to the XML example, if there is a block scope (corresponding to a statement-block) that appears during conditional branching and repetition, or in some cases, sequential processing, branching is performed there.

条件分岐や繰り返しは、本発明の技術思想としては異なるコントロールフローという以上の違いはない。このため、いずれかの構造であることと、その分岐や繰り返しの条件のみを各ノードの属性として保持する。ただし、この変形例でネットワークデータ構造に対して行う処理の場合には、枝刈りを行ったことにより得られる各部分木は、それぞれ別のネットワークとして再認識させなければならない点に注意を要する。この結果として、各枝のすべてのノードに対応して付される一意の番号は、ルートを同じ値にして順に振り直す。 Conditional branching and repetition are not different from each other as a different control flow as the technical idea of the present invention. For this reason, only one of the structures and the branching and repeating conditions are held as attributes of each node. However, in the case of processing performed on the network data structure in this modification, it should be noted that each subtree obtained by pruning must be re-recognized as a separate network. As a result, the unique numbers assigned to all nodes in each branch are reassigned in order with the same value for the route.

この処理に従って、ノードに接続する枝の一意の番号も再計算し、各枝に振り直すことになる。ノードと枝（ブランチ）の一意の番号を振り直すのは、取り上げたネットワーク型データ構造がその付された値に依存しているためであり、数値が一致していないと通常は異なるネットワーク構造である、と盲目的に判定してしまうためである。 According to this processing, the unique number of the branch connected to the node is also recalculated and reassigned to each branch. The reason why the unique numbers of nodes and branches (branches) are reassigned is because the network type data structure taken up depends on the assigned value. This is because it is determined blindly.

したがって、この変形例の場合には、ノードと枝の一意の番号と各ノードの属性を比較することで、図２で示した構文解析木比較器２１１は類似に関して所定の判定を行うことができるようになる。 Therefore, in the case of this modification, the parse tree comparator 211 shown in FIG. 2 can make a predetermined determination regarding similarity by comparing the unique numbers of nodes and branches with the attributes of each node. It becomes like this.

このようなネットワーク型データ構造の採用は、処理効率の悪いＸＭＬに代わるものとして期待することができる。なお、ネットワーク型データ構造の処理に際しては、これをンピュータで扱いやすい一次元的なデータ構造に置き換えることが有効である。これに関しては、たとえば本発明者の発案した関連技術（特開平１０−０７８９１１号公報）に開示された技術を利用することができる。 The adoption of such a network type data structure can be expected as an alternative to XML with poor processing efficiency. When processing a network type data structure, it is effective to replace it with a one-dimensional data structure that can be easily handled by a computer. In this regard, for example, a technique disclosed in a related technique (Japanese Patent Laid-Open No. 10-078911) proposed by the present inventor can be used.

以上説明した本発明の実施例は、著作物の改定において構造単位での差異のリストアップといった用途にも適用できる。従来では、差分を含む行単位での差異確認のため、著作物の性格によっては、確認がわずらわしいという問題があったが、これを解消することができる。また、翻訳における文法構造チェックや、非合法に流用された著作物の文字列には依存しない構造上の同一性のリストアップといった用途にも適用可能である。 The embodiments of the present invention described above can also be applied to uses such as listing differences in structural units in the revision of a literary work. Conventionally, there has been a problem that confirmation is troublesome depending on the nature of the copyrighted work because of the difference confirmation in line units including the difference, but this can be solved. It can also be applied to applications such as checking grammatical structure in translation and listing structural identity that does not depend on character strings of works illegally diverted.

本発明の実施の形態における処理内容判定装置を使用した処理内容判定システムのシステム構成図である。1 is a system configuration diagram of a processing content determination system using a processing content determination device according to an embodiment of the present invention. 本発明の一実施例による処理内容判定装置の構成説明図である。1 is a configuration explanatory diagram of a processing content determination apparatus according to an embodiment of the present invention. FIG. 本実施例における構文解析木比較器の構成を具体的に表わした説明図である。It is explanatory drawing which represented the structure of the parse tree comparator in a present Example concretely. 第２の関連技術によるソースプログラム比較情報生成システムの構成の概要を表わしたシステム構成図である。It is a system block diagram showing the outline | summary of the structure of the source program comparison information generation system by a 2nd related technique.

Explanation of symbols

１００処理内容判定システム
１０１第１の比較対象記述データ格納部
１０２第２の比較対象記述データ格納部
１０３第１の比較対象記述データ
１０４第２の比較対象記述データ
１０５、２００処理内容判定装置
１０６記述ルールデータベース
１０７比較程度設定部
１０８類似判定表示部
１１１比較対象記述データ入力部
１１２記述ルール解析部
１１４類似度演算部
１１５演算結果
２０１メタ言語ソース
２０２構文解析器
２０３メタ言語
２０４第１のソース・言語
２０５第２のソース・言語
２０６第１の構文解析器
２０７第１の構文解析木
２０８第２の構文解析器
２０９第２の構文解析木
２１１構文解析木比較器
２１２比較単位基準
２１３類似度判定基準
２１４比較結果リスト DESCRIPTION OF SYMBOLS 100 Processing content determination system 101 1st comparison object description data storage part 102 2nd comparison object description data storage part 103 1st comparison object description data 104 2nd comparison object description data 105,200 Process content determination apparatus 106 Description Rule database 107 Comparison degree setting unit 108 Similarity determination display unit 111 Comparison target description data input unit 112 Description rule analysis unit 114 Similarity calculation unit 115 Operation result 201 Meta language source 202 Syntax analyzer 203 Meta language 204 First source language 205 Second Source / Language 206 First Parser 207 First Parser Tree 208 Second Parser 209 Second Parser Tree 211 Parse Tree Comparator 212 Comparison Unit Criteria 213 Similarity Criteria 214 Comparison result list

Claims

Comparison target description input means for inputting each description to be compared for similarity judgment, which is assembled using a predetermined relational law,
Description structure content relation map generating means for analyzing the descriptions to be compared with each other using the relational law and generating a description composition content relation map representing the relation between the description composition contents constituting each description content; ,
The description composition content related map generated by the description composition content related map generating means is compared with the descriptions to be compared, and the similarity calculation means for calculating the degree of matching of these maps as the similarity. The processing content determination apparatus characterized by the above.

Each description to be compared using the predetermined relational law and to be compared in the similarity determination is a description having a grammatical structure, and the description configuration content related map generation means sets each description to be compared as a single description. Parse tree generating means for generating a language-independent parse tree by analyzing the grammatical structure of the grammar structure, and the similarity calculating means is a parse analysis of descriptions to be compared generated by the parse tree generating means. 2. The processing content determination apparatus according to claim 1, wherein the processing content determination device is a means for comparing trees and calculating a degree of coincidence of the contents of these structures as a similarity.

Each description to be compared using the predetermined relational law and to be compared in the similarity determination is a description having a grammatical structure, and the description configuration content related map generation means sets each description to be compared as a single description. Is a network type data structure generating means for generating a language-independent network, and the similarity calculating means is a network type of descriptions to be compared generated by the network type data structure generating means. 2. The processing content determination apparatus according to claim 1, wherein the processing content determination device is means for comparing data structures and calculating a degree of coincidence of the contents of these structures as a similarity.

The description structure content relation map generation means comprises pre-processing means for translating these into a common language when each description to be compared in the similarity determination is a different language. The processing content determination apparatus according to claim 3.

3. The process according to claim 2, wherein the similarity calculation means comprises block designation means for designating a description range to be compared as a block as a comparison range determined by a hierarchy depth in a tree structure. Content determination device.

In the syntax analysis of the parse tree generation means, a CFG (Context Free Grammar) analysis tool is used for a procedural artificial language, and a DCG (Definite Clause Grammar: definite) is used for a natural language. 3. The processing content determination apparatus according to claim 2, wherein a meta language having a clause grammar analysis function is used.

2. The processing content determination apparatus according to claim 1, wherein the similarity calculation result of the similarity calculation means includes a match or mismatch of the description to be compared and an intermediate aspect thereof.

Dividing means for dividing the parse tree generated by the parse tree generating means into a set of subtrees is provided, and the similarity calculating means compares the brilliant combinations of the subtrees and compares the similarity to the block. The processing content determination apparatus according to claim 2, wherein:

A comparison object description input step for inputting each description to be compared with the similarity judgment, which is assembled using a predetermined relational law;
A description composition content relation map generation step for analyzing the descriptions to be compared with each other by using the relational law and generating a description composition content relation map representing a relation between the description composition contents constituting each description content; ,
The description composition content relation map generated in the description composition content relation map generation step is compared with the descriptions to be compared, and a similarity calculation step is performed to calculate the degree of matching of these maps as a similarity. The processing content determination method characterized by the above.

Each description that is constructed using the predetermined relational law and is a comparison target of similarity determination is a description having a grammatical structure, and the description configuration content relation map generation step includes a single description to be compared. Is a parse tree generation step for generating a language-independent parse tree by analyzing the grammatical structure, and the similarity calculation step is a parse analysis between descriptions to be compared generated by the parse tree generation step. The processing content determination method according to claim 9, which is a step of comparing trees and calculating a degree of coincidence of the contents of these structures as a similarity.

Each description that is constructed using the predetermined relational law and is a comparison target of similarity determination is a description having a grammatical structure, and the description configuration content relation map generation step includes a single description to be compared. A network-type data structure generation step for generating a language-independent network by analyzing the grammatical structure, and the similarity calculation step is a network type of descriptions to be compared generated by the network-type data structure generation step. 10. The processing content determination method according to claim 9, which is a step of comparing data structures and calculating a degree of coincidence of the contents of these structures as a similarity.