JP2021179932A

JP2021179932A - Document processing apparatus, document processing method, and program

Info

Publication number: JP2021179932A
Application number: JP2020086310A
Authority: JP
Inventors: 秀夫伊東; Hideo Ito
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2021-11-18

Abstract

To determine the degree of correspondence between a plurality of documents.SOLUTION: A document processing apparatus comprises: an input unit that receives input of a plurality of documents; a document dividing unit that divides the documents into words being characters or character strings and divides the documents into portions being groups larger than the words; a word correspondence degree calculation unit that calculates a word correspondence degree that is the degree of correspondence between the words in the different documents; a portion correspondence degree calculation unit that calculates a portion correspondence degree that is the degree of correspondence between the portions in the different documents; and a document correspondence degree calculation unit that calculates a document correspondence degree that is the degree of correspondence between the documents based on the degree of correspondence between the words determined by using the word correspondence degree and the portion correspondence degree.SELECTED DRAWING: Figure 1

Description

本発明は、文書処理装置、文書処理方法、及び、プログラムに関する。 The present invention relates to a document processing apparatus, a document processing method, and a program.

類似する内容の文書を検索する技術が知られている。 A technique for searching documents with similar contents is known.

例えば、情報検索において、文書に含まれる単語の出現頻度をもとに、文書の距離、又は、類似度が計算される。そして、計算された距離が近い、又は、類似度が高い順に、文書が提示される。このようにして、ある文書に類似する文書を検索する技術が知られている（例えば、非特許文献１等）。 For example, in information retrieval, the distance or similarity of documents is calculated based on the frequency of appearance of words contained in the document. Then, the documents are presented in the order of the calculated distances being close to each other or the degree of similarity being high. In this way, a technique for searching for a document similar to a certain document is known (for example, Non-Patent Document 1 and the like).

文書は、文字又は文字列（以下「単語」という。）以外に異なる要素を含む。すなわち、文書は、単語よりも大きな集まり（以下「部分」という。）で構成される。そして、従来の技術では、単語だけに基づいて類似が判定される。このように、従来の技術は、部分間の対応とは無関係に定めた単語間の対応に基づいて文書の間でどれだけ対応が取れているか（以下、異なる文書の間で対応が取れている度合いを「対応度」という。）が判定できていない。すなわち、従来の技術では、単語間の対応度を用いる方法において、部分間の対応度が考慮されていない。 Documents contain different elements other than letters or strings (hereinafter referred to as "words"). That is, a document is composed of a collection larger than a word (hereinafter referred to as a "part"). Then, in the conventional technique, the similarity is determined based only on the word. In this way, in the conventional technique, how much correspondence is taken between documents based on the correspondence between words defined independently of the correspondence between parts (hereinafter, correspondence is taken between different documents). The degree is called "correspondence"). That is, in the conventional technique, the degree of correspondence between parts is not considered in the method using the degree of correspondence between words.

開示の技術は、複数の文書における対応度を判定することを目的とする。 The disclosed technique aims to determine the degree of correspondence in multiple documents.

開示の技術は、複数の文書を入力する入力部と、
文字、若しくは、文字列である単語に、前記文書を分割する、及び、前記単語より大きな集まりである部分に、前記文書を分割する文書分割部と、
異なる前記文書の前記単語の間における対応度である単語対応度を計算する単語対応度計算部と、
異なる前記文書の前記部分の間における対応度である部分対応度を計算する部分対応度計算部と、
前記単語対応度、及び前記部分対応度を用いて定義した単語間の対応度に基づいて、前記文書の間における対応度である文書対応度を計算する文書対応度計算部と
を備える文書処理装置である。 The disclosed technology is an input unit for inputting multiple documents, and
A document division unit that divides the document into characters or words that are character strings, and divides the document into parts that are larger than the words.
A word correspondence calculation unit that calculates a word correspondence, which is a correspondence between the words in different documents, and a word correspondence calculation unit.
A partial correspondence calculation unit that calculates a partial correspondence, which is a correspondence between the parts of different documents, and a partial correspondence calculation unit.
A document processing device including a document correspondence degree calculation unit that calculates a document correspondence degree, which is a correspondence degree between documents, based on the word correspondence degree and the correspondence degree between words defined by using the partial correspondence degree. Is.

複数の文書における対応度を判定することができる。 It is possible to determine the degree of correspondence in a plurality of documents.

機能構成例を示す図である。It is a figure which shows the functional structure example. ハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example. 全体処理例を示す図である。It is a figure which shows the whole processing example. 文書の例を示す図である。It is a figure which shows the example of a document. 分割の例を示す図である。It is a figure which shows the example of division. 単語対応度の計算結果例を示す図である。It is a figure which shows the calculation result example of the word correspondence degree. 部分対応度の計算結果例を示す図である。It is a figure which shows the calculation result example of a partial correspondence degree. 同時確率の計算結果例を示す図である。It is a figure which shows the example of the calculation result of the joint probability.

以下に、図面を参照して具体例を説明し、本発明に係る文書処理装置の実施形態について説明する。 Hereinafter, a specific example will be described with reference to the drawings, and an embodiment of the document processing apparatus according to the present invention will be described.

＜機能構成例＞
図１は、機能構成例を示す図である。例えば、文書処理装置１００は、入力部１０１、文書分割部１０２、単語対応度計算部１０３、部分対応度計算部１０４、及び、文書対応度計算部１０５を備える機能構成である。 <Function configuration example>
FIG. 1 is a diagram showing a functional configuration example. For example, the document processing device 100 has a functional configuration including an input unit 101, a document division unit 102, a word correspondence degree calculation unit 103, a partial correspondence degree calculation unit 104, and a document correspondence degree calculation unit 105.

これらの機能は、例えば、以下のようなハードウェアによって実現する。 These functions are realized by, for example, the following hardware.

＜ハードウェア構成例＞
図２は、ハードウェア構成例を示す図である。例えば、文書処理装置１００は、バスＢで相互に接続する入力装置１１、出力装置１２、ドライブ装置１３、補助記憶装置１４、メモリ装置１５、演算処理装置１６、及び、インタフェース装置１７を含む情報処理装置である。 <Hardware configuration example>
FIG. 2 is a diagram showing a hardware configuration example. For example, the document processing device 100 includes information processing including an input device 11, an output device 12, a drive device 13, an auxiliary storage device 14, a memory device 15, an arithmetic processing device 16, and an interface device 17 connected to each other by a bus B. It is a device.

入力装置１１は、様々な情報を入力する装置である。例えば、入力装置１１は、キーボード、マウス、又は、ポインティングデバイス等である。 The input device 11 is a device for inputting various information. For example, the input device 11 is a keyboard, a mouse, a pointing device, or the like.

出力装置１２は、様々な情報を出力する装置である。例えば、出力装置１２は、ディスプレイ等である。 The output device 12 is a device that outputs various information. For example, the output device 12 is a display or the like.

インタフェース装置１７は、ネットワーク、又は、ケーブルを介して外部装置とデータを送受信するための装置である。 The interface device 17 is a device for transmitting / receiving data to / from an external device via a network or a cable.

ドライブ装置１３は、記憶媒体１８とデータを送受信する装置である。 The drive device 13 is a device that transmits / receives data to / from the storage medium 18.

補助記憶装置１４は、インストールされたプログラム、ファイル、及び、データ等を格納する。例えば、補助記憶装置１４は、ハードディスク等である。 The auxiliary storage device 14 stores installed programs, files, data, and the like. For example, the auxiliary storage device 14 is a hard disk or the like.

メモリ装置１５は、主記憶装置の例である。そのため、プログラムが実行されると、演算装置及び制御装置等と協働して動作し、処理を実行する。 The memory device 15 is an example of a main storage device. Therefore, when the program is executed, it operates in cooperation with the arithmetic unit, the control unit, and the like, and executes the process.

なお、ハードウェアの構成は、図示する構成に限られない。すなわち、文書処理装置１００は、図示する以外の演算装置、制御装置、記憶装置、入力装置、出力装置、又は、周辺機器を外部又は内部に更に有する構成でもよい。 The hardware configuration is not limited to the configuration shown in the figure. That is, the document processing device 100 may be configured to further have an arithmetic unit, a control device, a storage device, an input device, an output device, or a peripheral device other than those shown in the figure.

例えば、入力部１０１は、入力装置１１等で実現される。具体的には、入力装置１１を操作して入力される文字等を記憶して文書を入力する。 For example, the input unit 101 is realized by an input device 11 or the like. Specifically, the input device 11 is operated to store the characters and the like to be input, and the document is input.

なお、入力部１０１は、入力装置１１で実現するに限られない。例えば、入力部１０１は、インタフェース装置１７又はドライブ装置１３等で入力するテキストデータ又は文書データ等の形式で文書を入力してもよい。 The input unit 101 is not limited to the input device 11. For example, the input unit 101 may input a document in the form of text data, document data, or the like input by the interface device 17, the drive device 13, or the like.

また、文書は、入力するハードウェア及びデータの形式を問わない。したがって、例えば、音声認識処理又は変換処理等によって生成された文書が用いられてもよい。 In addition, the document may be input in any hardware and data format. Therefore, for example, a document generated by voice recognition processing, conversion processing, or the like may be used.

文書分割部１０２、単語対応度計算部１０３、部分対応度計算部１０４、及び、文書対応度計算部１０５は、例えば、メモリ装置１５及び演算処理装置１６が協働して処理を実行することで実現する。具体的には、以下のような処理を実行することで実現する。 In the document division unit 102, the word correspondence degree calculation unit 103, the partial correspondence degree calculation unit 104, and the document correspondence degree calculation unit 105, for example, the memory device 15 and the arithmetic processing unit 16 cooperate to execute processing. Realize. Specifically, it is realized by executing the following processing.

＜全体処理例＞
図３は、全体処理例を示す図である。例えば、文書処理装置１００は以下のような処理を実行する。 <Overall processing example>
FIG. 3 is a diagram showing an example of overall processing. For example, the document processing apparatus 100 executes the following processing.

文書の入力処理例（ステップＳ１）
入力部１０１は、文書を入力する。なお、対応しているかの対象となる文書は、あらかじめ記憶装置に記憶してある文書でもよい。すなわち、文書処理装置１００は、過去に作成した文書等を蓄積しておく構成でもよい。 Document input processing example (step S1)
The input unit 101 inputs a document. The document to be supported may be a document stored in a storage device in advance. That is, the document processing device 100 may be configured to store documents and the like created in the past.

図４は、文書の例を示す図である。以下、図示するような２つの文書を例にして説明する。以下に示す例では、第１文書Ｄ１、及び、第２文書Ｄ２の２つの文書が入力される。 FIG. 4 is a diagram showing an example of a document. Hereinafter, two documents as shown in the figure will be described as an example. In the example shown below, two documents, the first document D1 and the second document D2, are input.

第１文書Ｄ１は、「公」、「知」、「情」及び「報」の４文字を１行目、「所」、「有」、「情」及び「報」の４文字を２行目、並びに、「独」、「自」、「情」及び「報」の４文字を３行目に有する３行で構成される文書である。 In the first document D1, the four characters "public", "knowledge", "information" and "report" are on the first line, and the four characters "place", "yes", "information" and "report" are on two lines. It is a document composed of eyes and three lines having the four characters "Germany", "self", "information" and "report" in the third line.

第２文書Ｄ２は、「保」、「有」、「情」及び「報」の４文字を１行目、並びに、「公」、「知」、「情」及び「報」の４文字を２行目に有する２行で構成される文書である。 The second document D2 contains the four characters "ho", "yes", "information" and "report" on the first line, and the four characters "public", "knowledge", "information" and "report". It is a document composed of two lines having the second line.

また、以下に示す例は、１文字を「単語」として扱う例である。さらに、以下の例は、１行単位で「部分」として扱う例である。なお、１行は、改行コードによって認識されるとする。また、改行コードは、図には明示しないが、文書がデータとして有する（図示する例では、改行のある箇所に改行コードがある）。したがって、文書処理装置１００は、改行コードによって１行を認識できる。 Further, the example shown below is an example in which one character is treated as a "word". Further, the following example is an example of treating as a "part" in units of one line. It is assumed that one line is recognized by the line feed code. Although the line feed code is not specified in the figure, the document has the line feed code as data (in the illustrated example, the line feed code is located at the place where the line feed is present). Therefore, the document processing apparatus 100 can recognize one line by the line feed code.

なお、この例では、改行コードは、単語としては扱われないとする。 In this example, the line feed code is not treated as a word.

文書を単語及び部分に分割する文書分割処理例（ステップＳ２）
文書分割部１０２は、入力された文書を分割して単語を生成する。さらに、文書分割部１０２は、入力された文書を分割して部分を生成する。 Example of document division processing for dividing a document into words and parts (step S2)
The document division unit 102 divides the input document to generate a word. Further, the document division unit 102 divides the input document to generate a portion.

具体的には、第１文書Ｄ１、及び、第２文書Ｄ２は、以下のように単語及び部分に分割される。 Specifically, the first document D1 and the second document D2 are divided into words and parts as follows.

図５は、分割の例を示す図である。以下、第１文書Ｄ１を単語及び部分に分割した結果を図における縦軸で示し、かつ、第２文書Ｄ２を単語及び部分に分割した結果を図における横軸で示す。 FIG. 5 is a diagram showing an example of division. Hereinafter, the result of dividing the first document D1 into words and parts is shown by the vertical axis in the figure, and the result of dividing the second document D2 into words and parts is shown by the horizontal axis in the figure.

第１文書Ｄ１は、３行で構成されるため、行ごとに、第１１部分Ｐ１１、第１２部分Ｐ１２、及び、第１３部分Ｐ１３の３つの部分に分割される。 Since the first document D1 is composed of three lines, each line is divided into three parts, an eleventh part P11, a twelfth part P12, and a thirteenth part P13.

第１文書Ｄ１は、１２文字で構成されるため、文字ごとに、第１０１文字Ｗ１０１乃至第１１２文字Ｗ１１２の１２単語に分割される。 Since the first document D1 is composed of 12 characters, each character is divided into 12 words of the 101st character W101 to the 112th character W112.

第２文書Ｄ２は、２行で構成されるため、行ごとに、第２１部分Ｐ２１、及び、第２２部分Ｐ２２の２つの部分に分割される。 Since the second document D2 is composed of two lines, each line is divided into two parts, a 21st part P21 and a 22nd part P22.

第１文書Ｄ１は、８文字で構成されるため、文字ごとに、第２０１文字Ｗ２０１乃至第２０８文字Ｗ２０８の８単語に分割される。 Since the first document D1 is composed of eight characters, it is divided into eight words of the 201st character W201 to the 208th character W208 for each character.

以下、図示するように分割した場合を例に説明する。 Hereinafter, the case of division as shown in the figure will be described as an example.

単語対応度を計算する単語対応度計算処理例（ステップＳ３）
単語対応度計算部１０３は、単語対応度を計算する。 Example of word correspondence calculation processing for calculating word correspondence (step S3)
The word correspondence degree calculation unit 103 calculates the word correspondence degree.

単語対応度は、単語の間における対応度を示す値である。具体的には、単語対応度は、下記（１）式で計算される値である。

Ｐ（ｗａ→ｗｂ｜ｕａ→ｕｂ）＝部分「ｕａ」が部分「ｕｂ」対応する場合に、単語「ｗａ」が単語「ｗｂ」に対応する確率（１）

なお、上記（１）式は、下記（１−Ａ）式のように表現できる。 The word correspondence is a value indicating the correspondence between words. Specifically, the word correspondence is a value calculated by the following equation (1).

P (wa → wb | ua → ub) = Probability that the word “wa” corresponds to the word “wb” when the part “ua” corresponds to the part “ub” (1)

The above equation (1) can be expressed as the following equation (1-A).

なお、「ｗａ」は、一方の文書（以下、第１文書Ｄ１とする。）が有する単語（この例では、第１０１文字Ｗ１０１乃至第１１２文字Ｗ１１２のいずれかの単語である。）を示す。また、「ｗｂ」は、他方の文書（以下、第２文書Ｄ２とする。）が有する単語（この例では、第２０１文字Ｗ２０１乃至第２０８文字Ｗ２０８のいずれかの単語である。）を示す。上記（１）式の分子は、第１単語「ｗａ」、及び、第２単語「ｗｂ」の類似度を示す。一方で、上記（１）式の分母は、第１単語「ｗａ」と、第２文書Ｄ２が有する各単語「ｗｋ」とのそれぞれの類似度を総計した総計類似度を示す。したがって、単語対応度は、上記（１）式に基づいて、第１単語「ｗａ」、及び、第２単語「ｗｂ」の類似度が、総計類似度に対して占める割合で計算される値である。

In addition, "wa" indicates a word (in this example, any word of the 101st character W101 to the 112th character W112) possessed by one of the documents (hereinafter referred to as the first document D1). Further, "wb" indicates a word (in this example, any word of the 201st character W201 to the 208th character W208) possessed by the other document (hereinafter referred to as the second document D2). The molecule of the above formula (1) shows the similarity between the first word "wa" and the second word "wb". On the other hand, the denominator of the above equation (1) indicates the total similarity between the first word "wa" and the respective words "wk" in the second document D2. Therefore, the word correspondence is a value calculated based on the above equation (1) by the ratio of the similarity between the first word "wa" and the second word "wb" to the total similarity. be.

部分「ｕａ」、及び、部分「ｕｂ」は、単語「ｗａ」、及び、単語「ｗｂ」が属する部分を示す。 The part "ua" and the part "ub" indicate the part to which the word "wa" and the word "wb" belong.

また、「ｓｉｍ」は、類似度を示す。すなわち、上記（１）式では、「ｓｉｍ（ｗａ，ｗｂ）」は、単語「ｗａ」と単語「ｗｂ」の類似度を計算した結果を示す。以下、類似度を計算する対象となる単語同士が一致する場合を「１．００」とし、単語同士が異なる場合を「０．００」（以下、「０．００」の場合の記載を省略する。したがって、空欄の場合は、「０．００」を示す。）とする。 Further, "sim" indicates the degree of similarity. That is, in the above equation (1), "sim (wa, wb)" indicates the result of calculating the similarity between the word "wa" and the word "wb". Hereinafter, the case where the words for which the similarity is to be calculated match each other is defined as "1.00", and the case where the words differ from each other is "0.00" (hereinafter, the description in the case of "0.00" is omitted. Therefore, if it is blank, it indicates "0.00".).

したがって、上記（１）式、すなわち、単語対応度の計算結果は、例えば、以下のような結果となる。 Therefore, the above equation (1), that is, the calculation result of the word correspondence degree is, for example, the following result.

図６は、単語対応度の計算結果例を示す図である。図示するように、第１０３文字Ｗ１０３及び第２０３文字Ｗ２０３、第１０３文字Ｗ１０３及び第２０７文字Ｗ２０７、第１０７文字Ｗ１０７及び第２０３文字Ｗ２０３、第１０７文字Ｗ１０７及び第２０７文字Ｗ２０７、第１１１文字Ｗ１１１及び第２０３文字Ｗ２０３、並びに、第１１１文字Ｗ１１１及び第２０７文字Ｗ２０７は、いずれも「情」の文字の単語同士である。すなわち、これらの単語同士は一致するため、類似度は、「１．００」となる。なお、類似度と対応関係は別であり、類似度が「１．００」であっても、単語同士が対応しない場合もある。 FIG. 6 is a diagram showing an example of the calculation result of the word correspondence degree. As shown in the figure, the 103rd character W103 and the 203rd character W203, the 103rd character W103 and the 207th character W207, the 107th character W107 and the 203rd character W203, the 107th character W107 and the 207th character W207, the 111th character W111 and The 203rd character W203, and the 111th character W111 and the 207th character W207 are both words of the character "feeling". That is, since these words match each other, the degree of similarity is "1.00". It should be noted that the similarity and the correspondence are different, and even if the similarity is "1.00", the words may not correspond to each other.

類似度が高くても対応しない場合は、例えば、部分「ＡＡ」と「Ａ」を対応付ける場合において、部分「ＡＡ」では、単語「Ａ」は２つあるのに対し、「Ａ」には、「Ａ」が１つしかないので、「Ａ」と「Ａ」が類似度「１．００」であっても、どちらか一方のＡが対応しない場合となる。 If there is no correspondence even if the degree of similarity is high, for example, in the case of associating the parts "AA" and "A", in the part "AA", there are two words "A", whereas in "A", there are two words. Since there is only one "A", even if "A" and "A" have a similarity degree of "1.00", one of the A's does not correspond.

第１０４文字Ｗ１０４及び第２０４文字Ｗ２０４、第１０４文字Ｗ１０４及び第２０８文字Ｗ２０８、第１０８文字Ｗ１０８及び第２０４文字Ｗ２０４、第１０８文字Ｗ１０８及び第２０８文字Ｗ２０８、第１１２文字Ｗ１１２及び第２０４文字Ｗ２０４、並びに、第１１２文字Ｗ１１２及び第２０８文字Ｗ２０８は、いずれも「報」の文字の単語同士である。すなわち、これらの単語同士は一致するため、類似度は、「１．００」となる。 104th character W104 and 204th character W204, 104th character W104 and 208th character W208, 108th character W108 and 204th character W204, 108th character W108 and 208th character W208, 112th character W112 and 204th character W204, Further, the 112th character W112 and the 208th character W208 are both words of the character "report". That is, since these words match each other, the degree of similarity is "1.00".

第１０６文字Ｗ１０６及び第２０１文字Ｗ２０１は、いずれも「有」の文字の単語同士である。すなわち、これらの単語同士は一致するため、類似度は、「１．００」となる。 The 106th character W106 and the 201st character W201 are both words of the character "Yes". That is, since these words match each other, the degree of similarity is "1.00".

第１０１文字Ｗ１０１及び第２０５文字Ｗ２０５は、いずれも「公」の文字の単語同士である。すなわち、これらの単語同士は一致するため、類似度は、「１．００」となる。 The 101st character W101 and the 205th character W205 are both words of the character "public". That is, since these words match each other, the degree of similarity is "1.00".

第１０２文字Ｗ１０２及び第２０６文字Ｗ２０６は、いずれも「知」の文字の単語同士である。すなわち、これらの単語同士は一致するため、類似度は、「１．００」となる。 The 102nd character W102 and the 206th character W206 are both words of the character "knowledge". That is, since these words match each other, the degree of similarity is "1.00".

上記以外の単語同士（例えば、第１０１文字Ｗ１０１及び第２０１文字Ｗ２０１の組み合わせの場合等である。）は、単語が異なる。すなわち、これらの単語同士は一致しないため、類似度は、「０．００」（図では類似度「０．００」を空欄で表す。）となる。 Words other than the above are different from each other (for example, in the case of a combination of the 101st character W101 and the 201st character W201). That is, since these words do not match each other, the similarity is "0.00" (in the figure, the similarity "0.00" is represented by a blank).

なお、単語の類似度は、単語が一致するか否かで計算される場合に限られない。例えば、単語の類似度は、単語をベクトルで表現して、ベクトルの内積に基づいて計算される値で定まってもよい。 The degree of similarity of words is not limited to the case where it is calculated based on whether or not the words match. For example, the similarity of words may be determined by a value calculated based on the inner product of the vectors, expressing the words as vectors.

また、ベクトルには、単語ベクトル等が用いられてもよい。なお、単語ベクトルは、単語の内容を示す特徴を複数個の数値（以下、特徴を示す値を「特徴量」という。）、すなわち、ベクトルで示した値である。具体的には、単語ベクトルは、文献（ＤｉｓｔｒｉｂｕｔｅｄＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｏｆＳｅｎｔｅｎｃｅｓａｎｄＤｏｃｕｍｅｎｔｓＰｒｏｃｅｅｄｉｎｇｓｏｆＴｈｅ３１ｓｔＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ（ＩＣＭＬ２０１４）, ｐｐ. １１８８ - １１９６, ２０１４）に記載される深層学習を用いた方法等で計算される。 Further, a word vector or the like may be used as the vector. In the word vector, a feature indicating the content of a word is a plurality of numerical values (hereinafter, a value indicating the feature is referred to as a "feature amount"), that is, a value indicated by a vector. Specifically, the word vector is described in the literature (Distributed Repressions of Sentences and Substances Proceedings of The 31st International Conference on Machine Learning (ICML 2014), etc. It is calculated by.

この方法では、あらかじめ用意されたコーパスと呼ばれる大容量のテキストを参照して、単語ベクトルを算出する。 In this method, a word vector is calculated by referring to a large amount of text called a corpus prepared in advance.

ベクトルの内積計算は、具体的には以下のように行う。＜ａ，ｂ＞は、一方のベクトル「ａ」、及び、他方のベクトル「ｂ」の内積を示す。そして、意味、文脈、又は、関連分野等が類似している２つの単語は、その特徴量が近しい値となる。したがって、類似の単語であれば、互いのベクトルは類似する。例えば、類似する２つのベクトル（５０，１０，５）及び（４８，９，４）の内積は、「５０×４８＋１０×９＋５×４＝２５１０」と計算する。一方で、類似しない２つのベクトル（５０，１０，５）及び（９，４，４８）の内積は、「５０×４８＋１０×９＋５×４＝７３０」と計算する。 Specifically, the vector inner product calculation is performed as follows. <A, b> indicates the inner product of one vector "a" and the other vector "b". Two words having similar meanings, contexts, related fields, etc. have similar features. Therefore, if the words are similar, the vectors of each other are similar. For example, the inner product of two similar vectors (50,10,5) and (48,9,4) is calculated as "50 × 48 + 10 × 9 + 5 × 4 = 2510". On the other hand, the inner product of two dissimilar vectors (50,10,5) and (9,4,48) is calculated as "50 × 48 + 10 × 9 + 5 × 4 = 730".

このように、類似した単語同士、すなわち、類似した意味、文脈、又は、関連分野等の単語同士では、ベクトルの内積は、類似しない場合と比較して高い値となる。このようにして計算されるベクトルの内積が、あらかじめ設定される閾値を超える以上の値である場合には、互いの単語が類似していると判断されてもよい。 As described above, in the case of similar words, that is, words having similar meanings, contexts, or related fields, the inner product of the vectors has a higher value than in the case of dissimilarity. When the inner product of the vectors calculated in this way exceeds a preset threshold value, it may be determined that the words are similar to each other.

部分対応度を計算する部分対応度計算処理例（ステップＳ４）
部分対応度計算部１０４は、部分対応度を計算する。 Example of partial correspondence calculation processing for calculating partial correspondence (step S4)
The partial correspondence degree calculation unit 104 calculates the partial correspondence degree.

部分対応度は、部分の間における対応度を示す値である。具体的には、部分対応度は、下記（２）式で計算される値である。

Ｐ（ｕａ→ｕｂ）＝Σｓｉｍ（ｗａ，ｗｂ）／Σｓｉｍ（ｗａ，ｗｂ）
ｗａｉｎｕａａｌｌｏｆｗａ
ｗｂｉｎｕｂａｌｌｏｆｗｂ（２）

上記（２）式における分母は、第１文書Ｄ１における単語「ｗａ」及び第２文書Ｄ２における単語「ｗｂ」のペアにおける類似度の総計である。そのため、分母の値は、分子の値と異なり、部分「ｕａ」及び部分「ｕｂ」に依拠しない。上記（２）式の分子は、第１部分の例である部分「ｕａ」に含まれる第１単語と、第２部分の例である部分「ｕｂ」に含まれる第２単語のそれぞれの類似度を総計した「第１総計類似度」を示す。一方で、上記（２）式の分母は、第１文書Ｄ１及び第２文書Ｄ２に含まれる全単語のそれぞれの類似度を総計した「第２総計類似度」を示す。したがって、部分対応度は、上記（２）式に基づいて、第１総計類似度が、第２総計類似度に対して占める割合を示す値である。 The degree of partial correspondence is a value indicating the degree of correspondence between parts. Specifically, the degree of partial correspondence is a value calculated by the following equation (2).

P (ua → ub) = Σsim (wa, wb) / Σsim (wa, wb)
wa in ua all of wa
wb in ub all of wb (2)

The denominator in the above equation (2) is the total degree of similarity in the pair of the word "wa" in the first document D1 and the word "wb" in the second document D2. Therefore, the value of the denominator does not depend on the part "ua" and the part "ub" unlike the value of the numerator. In the molecule of the above formula (2), the similarity between the first word contained in the part "ua" which is an example of the first part and the second word contained in the part "ub" which is an example of the second part. The "first total similarity" is shown. On the other hand, the denominator of the above equation (2) indicates the "second total similarity", which is the sum of the similarity of all the words included in the first document D1 and the second document D2. Therefore, the partial correspondence degree is a value indicating the ratio of the first total similarity to the second total similarity based on the above equation (2).

具体的には、上記（２）式、すなわち、部分対応度の計算結果は、例えば、以下のような結果となる。 Specifically, the above equation (2), that is, the calculation result of the degree of partial correspondence is as follows, for example.

図７は、部分対応度の計算結果例を示す図である。以下、部分対応度ごとに計算の例を説明する。すなわち、上記（２）式に基づいて具体的にどのような計算を行うかを上記（２）式における分子と分母の計算に分けて説明する。 FIG. 7 is a diagram showing an example of a calculation result of the degree of partial correspondence. Hereinafter, an example of calculation for each degree of partial correspondence will be described. That is, what kind of calculation is specifically performed based on the above equation (2) will be described separately for the calculation of the numerator and the denominator in the above equation (2).

第１１部分Ｐ１１及び第２１部分Ｐ２１の間における部分対応度は、上記（２）式における分子を単語対応度（すなわち、図６に示す計算結果である。）の総和により算出する。具体的には、図６に示す単語対応度の計算結果の場合には、第１１部分Ｐ１１及び第２１部分Ｐ２１の間における部分対応度は、１６の計算結果のうち、２つが「１．００」と計算されるため、上記（２）式の分子は、「１．００×２＝２．００」と計算する。 The partial correspondence between the 11th part P11 and the 21st part P21 is calculated by summing up the numerator in the above equation (2) with the word correspondence (that is, the calculation result shown in FIG. 6). Specifically, in the case of the calculation result of the word correspondence degree shown in FIG. 6, the partial correspondence degree between the 11th part P11 and the 21st part P21 is "1.00" out of the 16 calculation results. Therefore, the numerator of the above formula (2) is calculated as "1.00 × 2 = 2.00".

一方で、図６に示す単語対応度の計算結果の場合には、第１文書Ｄ１及び第２文書Ｄ２におけるすべての単語の間で合計して、「１２（すなわち、第１文書Ｄ１が有する単語の数である。）×８（すなわち、第２文書Ｄ２が有する単語の数である。）＝９６」である。この単語対応度のうち、図６に示す例では、それぞれの単語対応度が「１．００」と計算された計算結果は「１５」である。したがって、上記（２）式の分母は、「１．００×１５＝１５．００」と計算する。 On the other hand, in the case of the calculation result of the word correspondence degree shown in FIG. 6, the total of all the words in the first document D1 and the second document D2 is "12 (that is, the word possessed by the first document D1). (That is, the number of words possessed by the second document D2) = 96 ". Among the word correspondences, in the example shown in FIG. 6, the calculation result in which each word correspondence is calculated as "1.00" is "15". Therefore, the denominator of the above equation (2) is calculated as "1.00 × 15 = 15.00".

ゆえに、第１１部分Ｐ１１及び第２１部分Ｐ２１の間における部分対応度、すなわち、上記（２）式は、「２．００／１５．００＝０．１３３３・・・≒０．１３」と計算する。 Therefore, the degree of partial correspondence between the 11th part P11 and the 21st part P21, that is, the above equation (2) is calculated as "2.00 / 15.00 = 0.1333 ... ≈0.13". ..

第１１部分Ｐ１１及び第２２部分Ｐ２２の間における部分対応度は、１６の計算結果のうち、４つが「１．００」と計算されるため、上記（２）式の分子は、「１．００×４＝４．００」と計算する。また、上記（２）式の分母は、第１１部分Ｐ１１及び第２２部分Ｐ２２の間における部分対応度と同様で「１５．００」である。 Since the partial correspondence between the 11th part P11 and the 22nd part P22 is calculated as "1.00" in 4 of the 16 calculation results, the numerator of the above formula (2) is "1.00". × 4 = 4.00 ”. Further, the denominator of the above equation (2) is "15.00", which is the same as the degree of partial correspondence between the 11th portion P11 and the 22nd portion P22.

ゆえに、第１１部分Ｐ１１及び第２２部分Ｐ２２の間における部分対応度、すなわち、上記（２）式は、「４．００／１５．００＝０．２６６６・・・≒０．２７」と計算する。 Therefore, the degree of partial correspondence between the 11th part P11 and the 22nd part P22, that is, the above equation (2) is calculated as "4.00 / 15.00 = 0.2666 ... ≈0.27". ..

第１２部分Ｐ１２及び第２１部分Ｐ２１の間における部分対応度は、１６の計算結果のうち、３つが「１．００」と計算されるため、上記（２）式の分子は、「１．００×３＝３．００」と計算する。また、上記（２）式の分母は、第１２部分Ｐ１２及び第２１部分Ｐ２１の間における部分対応度と同様で「１５．００」である。 Since the partial correspondence between the 12th part P12 and the 21st part P21 is calculated as "1.00" in 3 of the 16 calculation results, the numerator of the above formula (2) is "1.00". × 3 = 3.00 ”. Further, the denominator of the above equation (2) is "15.00", which is the same as the degree of partial correspondence between the twelfth part P12 and the 21st part P21.

ゆえに、第１２部分Ｐ１２及び第２１部分Ｐ２１の間における部分対応度、すなわち、上記（２）式は、「３．００／１５．００＝０．２０」と計算する。 Therefore, the degree of partial correspondence between the 12th part P12 and the 21st part P21, that is, the above equation (2) is calculated as "3.00 / 15.00 = 0.20".

第１２部分Ｐ１２及び第２２部分Ｐ２２の間における部分対応度は、１６の計算結果のうち、２つが「１．００」と計算されるため、上記（２）式の分子は、「１．００×２＝２．００」と計算する。また、上記（２）式の分母は、第１２部分Ｐ１２及び第２２部分Ｐ２２の間における部分対応度と同様で「１５．００」である。 Since the partial correspondence between the twelfth part P12 and the 22nd part P22 is calculated as "1.00" in two of the 16 calculation results, the numerator of the above formula (2) is "1.00". × 2 = 2.00 ”. Further, the denominator of the above equation (2) is "15.00", which is the same as the degree of partial correspondence between the twelfth portion P12 and the 22nd portion P22.

ゆえに、第１２部分Ｐ１２及び第２２部分Ｐ２２の間における部分対応度、すなわち、上記（２）式は、「２．００／１５．００＝０．１３３３・・・≒０．１３」と計算する。 Therefore, the degree of partial correspondence between the twelfth part P12 and the 22nd part P22, that is, the above equation (2) is calculated as "2.00 / 15.00 = 0.1333 ... ≈0.13". ..

第１３部分Ｐ１３及び第２１部分Ｐ２１の間における部分対応度は、１６の計算結果のうち、２つが「１．００」と計算されるため、上記（２）式の分子は、「１．００×２＝２．００」と計算する。また、上記（２）式の分母は、第１３部分Ｐ１３及び第２１部分Ｐ２１の間における部分対応度と同様で「１５．００」である。 Since the partial correspondence between the 13th part P13 and the 21st part P21 is calculated as "1.00" in two of the 16 calculation results, the numerator of the above formula (2) is "1.00". × 2 = 2.00 ”. Further, the denominator of the above equation (2) is "15.00", which is the same as the degree of partial correspondence between the 13th portion P13 and the 21st portion P21.

ゆえに、第１３部分Ｐ１３及び第２１部分Ｐ２１の間における部分対応度、すなわち、上記（２）式は、「２．００／１５．００＝０．１３３３・・・≒０．１３」と計算する。 Therefore, the degree of partial correspondence between the 13th part P13 and the 21st part P21, that is, the above equation (2) is calculated as "2.00 / 15.00 = 0.1333 ... ≈0.13". ..

第１３部分Ｐ１３及び第２２部分Ｐ２２の間における部分対応度は、１６の計算結果のうち、２つが「１．００」と計算されるため、上記（２）式の分子は、「１．００×２＝２．００」と計算する。また、上記（２）式の分母は、第１３部分Ｐ１３及び第２２部分Ｐ２２の間における部分対応度と同様で「１５．００」である。 Since the partial correspondence between the 13th part P13 and the 22nd part P22 is calculated as "1.00" in two of the 16 calculation results, the numerator of the above formula (2) is "1.00". × 2 = 2.00 ”. Further, the denominator of the above equation (2) is "15.00", which is the same as the degree of partial correspondence between the 13th portion P13 and the 22nd portion P22.

ゆえに、第１３部分Ｐ１３及び第２２部分Ｐ２２の間における部分対応度、すなわち、上記（２）式は、「２．００／１５．００＝０．１３３３・・・≒０．１３」と計算する。 Therefore, the degree of partial correspondence between the 13th part P13 and the 22nd part P22, that is, the above equation (2) is calculated as "2.00 / 15.00 = 0.1333 ... ≈0.13". ..

以上のように、それぞれの部分対応度を計算すると、図７に示すような結果が得られる。 When the degree of partial correspondence is calculated as described above, the result as shown in FIG. 7 is obtained.

文書対応度を計算する文書対応度計算処理例（ステップＳ５）
文書対応度計算部１０５は、文書対応度を計算する。 Example of document correspondence calculation processing for calculating document correspondence (step S5)
The document correspondence calculation unit 105 calculates the document correspondence.

文書対応度は、文書の間における対応度を示す値である。文書対応度を計算するには、まず下記（３）式で計算される値（以下、「同時確率」という。同時確率（Ｊｏｉｎｔｐｒｏｂａｂｉｌｉｔｙ）は、「Ｐ（Ａ,Ｂ）」のように記載し、事象「Ａ」と「Ｂ」が同時に起こる確率を示す。）を計算する。

Ｐ（ｗａ→ｗｂ，ｕａ→ｕｂ）＝Ｐ（ｗａ→ｗｂ｜ｕａ→ｕｂ）×Ｐ（ｕａ→ｕｂ）
（３）

上記（３）式の右辺第１項、すなわち、「Ｐ（ｗａ→ｗｂ｜ｕａ→ｕｂ）」は、単語対応度であり、上記（１）式の計算結果である。また、この例では、図６に示す計算結果である。 The document correspondence is a value indicating the correspondence between documents. To calculate the degree of document compatibility, first, the value calculated by the following formula (3) (hereinafter referred to as "joint probability". The joint probability is described as "P (A, B)". , Indicates the probability that events "A" and "B" will occur at the same time.)

P (wa → wb, ua → ub) = P (wa → wb | ua → ub) × P (ua → ub)
(3)

The first term on the right side of the above equation (3), that is, "P (wa → wb | ua → ub)" is the degree of word correspondence, which is the calculation result of the above equation (1). Further, in this example, it is the calculation result shown in FIG.

さらに、上記（３）式の右辺第２項、すなわち、「Ｐ（ｕａ→ｕｂ）」は、部分対応度であり、上記（２）式の計算結果である。また、この例では、図７に示す計算結果である。 Further, the second term on the right side of the above equation (3), that is, "P (ua → ub)" is the degree of partial correspondence, which is the calculation result of the above equation (2). Further, in this example, it is the calculation result shown in FIG.

したがって、上記（３）式が示すように、同時確率は、単語対応度と部分対応度を乗算した計算結果となる。具体的には、同時確率は、以下のように計算される。 Therefore, as shown in the above equation (3), the joint probability is a calculation result obtained by multiplying the word correspondence degree and the partial correspondence degree. Specifically, the joint probability is calculated as follows.

図８は、同時確率の計算結果例を示す図である。上記（３）式に示すように、図示する同時確率は、図６に示す単語対応度と図７に示す部分対応度を乗じて計算した結果となる。 FIG. 8 is a diagram showing an example of calculation results of joint probabilities. As shown in the above equation (3), the joint probability shown in the figure is the result calculated by multiplying the word correspondence degree shown in FIG. 6 and the partial correspondence degree shown in FIG. 7.

具体的には、第１０３文字Ｗ１０３及び第２０３文字Ｗ２０３、並びに、第１０４文字Ｗ１０４及び第２０４文字Ｗ２０４の同時確率は、「１．００×０．１３＝０．１３」と計算する。 Specifically, the joint probability of the 103rd character W103 and the 203rd character W203, and the 104th character W104 and the 204th character W204 is calculated as "1.00 × 0.13 = 0.13".

第１０１文字Ｗ１０１及び第２０５文字Ｗ２０５、第１０２文字Ｗ１０２及び第２０６文字Ｗ２０６、第１０３文字Ｗ１０３及び第２０７文字Ｗ２０７、並びに、第１０４文字Ｗ１０４及び第２０８文字Ｗ２０８の同時確率は、「１．００×０．２７＝０．２７」と計算する。 The simultaneous probability of the 101st character W101 and the 205th character W205, the 102nd character W102 and the 206th character W206, the 103rd character W103 and the 207th character W207, and the 104th character W104 and the 208th character W208 is "1.00". × 0.27 = 0.27 ”.

第１０６文字Ｗ１０６及び第２０２文字Ｗ２０２、第１０７文字Ｗ１０７及び第２０３文字Ｗ２０３、並びに、第１０８文字Ｗ１０８及び第２０４文字Ｗ２０４の同時確率は、「１．００×０．２０＝０．２０」と計算する。 The simultaneous probability of the 106th character W106 and the 202nd character W202, the 107th character W107 and the 203rd character W203, and the 108th character W108 and the 204th character W204 is "1.00 × 0.20 = 0.20". calculate.

第１０７文字Ｗ１０７及び第２０７文字Ｗ２０７、並びに、第１０８文字Ｗ１０８及び第２０８文字Ｗ２０８の同時確率は、「１．００×０．１３＝０．１３」と計算する。 The joint probability of the 107th character W107 and the 207th character W207, and the 108th character W108 and the 208th character W208 is calculated as "1.00 × 0.13 = 0.13".

第１１１文字Ｗ１１１及び第２０３文字Ｗ２０３、並びに、第１１２文字Ｗ１１２及び第２０４文字Ｗ２０４の同時確率は、「１．００×０．１３＝０．１３」と計算する。 The simultaneous probability of the 111th character W111 and the 203rd character W203, and the 112th character W112 and the 204th character W204 is calculated as "1.00 × 0.13 = 0.13".

第１１１文字Ｗ１１１及び第２０７文字Ｗ２０７、並びに、第１１２文字Ｗ１１２及び第２０８文字Ｗ２０８の同時確率は、「１．００×０．１３＝０．１３」と計算する。 The simultaneous probability of the 111th character W111 and the 207th character W207, and the 112th character W112 and the 208th character W208 is calculated as "1.00 × 0.13 = 0.13".

次に、文書対応度計算部１０５は、下記（４）式のように文書対応度を計算する。

第１文書Ｄ１及び第２文書Ｄ２の文書対応度＝ΣｍａｘＰ（ｗａ→ｗｂ，ｕａ→ｕｂ）／ｍａｘ（第１文書Ｄ１の単語数，第２文書Ｄ２の単語数）ａｌｌｏｆｗａ
（４）

上記（４）式において、単語「ｗａ」に対してＰ（ｗａ→ｗｂ，ｕａ→ｕｂ）を与える単語である「ｗｂ」を「単語「ｗａ」の対応先」という。そして、上記（４）式では、「ｍａｘ」は最大値を抽出する関数である。また、上記（４）式において、「ｍａｘＰ（ｗａ→ｗｂ，ｕａ→ｕｂ）」で計算される値を「最大対応度」という。さらに、単語「ｗａ」、単語「ｗｂ」及び最大対応度の３つの組を「単語間の対応関係」という。 Next, the document correspondence calculation unit 105 calculates the document correspondence as shown in the following equation (4).

Document correspondence of the first document D1 and the second document D2 = Σmax P (wa → wb, ua → ub) / max (the number of words in the first document D1 and the number of words in the second document D2) all of wa
(4)

In the above equation (4), the word "wb" that gives P (wa → wb, ua → ub) to the word "wa" is referred to as "correspondence destination of the word" wa "". Then, in the above equation (4), "max" is a function for extracting the maximum value. Further, in the above equation (4), the value calculated by "max P (wa → wb, ua → ub)" is referred to as "maximum correspondence". Further, the three sets of the word "wa", the word "wb" and the maximum degree of correspondence are referred to as "correspondence between words".

そして、上記（４）式における計算では、第２文書Ｄ２が有する単語「ｗｂ」が、第１文書Ｄ１が有する複数の単語と対応先となる場合には、それぞれの最大対応度を比較して、大きい値のみを加算（上記（４）式における「Σ」の計算である。）の対象とする。すなわち、小さい値は、加算の対象から除外する。 Then, in the calculation in the above equation (4), when the word "wb" possessed by the second document D2 corresponds to a plurality of words possessed by the first document D1, the maximum correspondence degree of each is compared. , Only large values are added (calculation of "Σ" in the above equation (4)). That is, small values are excluded from the addition.

図８に示す例では、加算する対象となる対応先（以下、「加算対象Ｃ１」という。）は図示する通り、７つの値である。一方で、これ以外の値は、加算の対象から除外する。 In the example shown in FIG. 8, the corresponding destinations to be added (hereinafter referred to as “addition target C1”) are seven values as shown in the figure. On the other hand, other values are excluded from the addition.

例えば、第１３部分Ｐ１３における第１１１文字Ｗ１１１は、加算の対象から除外される値（以下「除外対象Ｃ２」という。）である。第１１１文字Ｗ１１１は、対応先を第２０３文字Ｗ２０３又は第２０７文字Ｗ２０７とする場合で、最大対応度「０．１３」となる。そして、除外対象Ｃ２が加算の対象とならないのは以下のような理由である。

理由１）第２０３文字Ｗ２０３は、第１０７文字Ｗ１０７と、より値が大きい最大対応度「０．２０」となる。 For example, the 111th character W111 in the 13th part P13 is a value excluded from the addition target (hereinafter referred to as “exclusion target C2”). The 111th character W111 has a maximum correspondence degree of "0.13" when the correspondence destination is the 203rd character W203 or the 207th character W207. The reason why the exclusion target C2 is not the target of addition is as follows.

Reason 1) The 203rd character W203 has a larger value than the 107th character W107 and has a maximum correspondence degree of "0.20".

理由２）第２０７文字Ｗ２０７は、第１０３文字Ｗ１０３と、より値が大きい最大対応度「０．２７」となる。

以上のように、加算対象Ｃ１を総計すると、上記（４）式の分子が計算される。この例では、上記（４）式の分子（以下「総計最大対応度」という。）は、「０．２７×４＋０．２０×３＝１．６８」と計算する。また、第１文書Ｄ１の単語数は「１２」、かつ、第２文書Ｄ２の単語数は「８」である為、上記（４）式の分母（以下「最大単語数」という。）は「１２」と計算する。 Reason 2) The 207th character W207 has a larger value than the 103rd character W103 and has a maximum correspondence degree of "0.27".

As described above, when the addition target C1 is totaled, the numerator of the above equation (4) is calculated. In this example, the numerator of the above formula (4) (hereinafter referred to as “total maximum correspondence”) is calculated as “0.27 × 4 + 0.20 × 3 = 1.68”. Further, since the number of words in the first document D1 is "12" and the number of words in the second document D2 is "8", the denominator of the above equation (4) (hereinafter referred to as "maximum number of words") is ". 12 "is calculated.

したがって、上記（４）式に基づいて、第１文書Ｄ１及び第２文書Ｄ２の文書対応度は、総計最大対応度を最大単語数で除算して「１．６８／１２＝０．１４」と計算できる。 Therefore, based on the above equation (4), the document correspondence of the first document D1 and the second document D2 is "1.68 / 12 = 0.14" by dividing the total maximum correspondence by the maximum number of words. Can be calculated.

このように、複数の部分を含む文書の間における対応度は、単語の間での対応度、及び、部分の間での対応度の両方を用いて定めた単語間の対応度を計算する。すなわち、文書対応度は、単語対応度及び部分対応度の両方を用いて定めた単語間の対応度に基づいて計算する。このように計算する文書対応度を用いると、単語の間における類似度のみを用いる場合と比較して高い精度で文書間の対応度を計算することができる。すなわち、より内容が対応した文書が見つけやすい。 In this way, the degree of correspondence between documents containing a plurality of parts calculates the degree of correspondence between words determined by using both the degree of correspondence between words and the degree of correspondence between parts. That is, the document correspondence is calculated based on the correspondence between words determined by using both the word correspondence and the partial correspondence. When the degree of correspondence between documents calculated in this way is used, the degree of correspondence between documents can be calculated with higher accuracy than when only the degree of similarity between words is used. That is, it is easier to find a document with more corresponding contents.

例えば、第２文書Ｄ２に対して、第１文書Ｄ１に含まれる「独自情報」（すなわち、第１３部分Ｐ１３である。）は、第２文書Ｄ２に対応する部分がない。しかし、単語間の類似度を計算する方法では、第１３部分Ｐ１３の対応度も文書の対応度を計算する上で加味される。そのため、例えば、第１１部分Ｐ１１及び第１２部分Ｐ１２だけを含む文書（以下「第３文書」という。）より、いわば余計な部分「独自情報」を含む第１文書Ｄ１の方が対応した文書であると計算されてしまう。 For example, with respect to the second document D2, the "unique information" (that is, the thirteenth part P13) included in the first document D1 does not have a part corresponding to the second document D2. However, in the method of calculating the degree of similarity between words, the degree of correspondence of the thirteenth part P13 is also taken into consideration in calculating the degree of correspondence of the document. Therefore, for example, the first document D1 containing the extra part "unique information" corresponds to the document containing only the eleventh part P11 and the twelfth part P12 (hereinafter referred to as "third document"). It will be calculated if there is.

これに対して、本実施形態であれば、第３文書の方が、第１文書Ｄ１より第２文書Ｄ２に対応しているというように判定できる。 On the other hand, in the present embodiment, it can be determined that the third document corresponds to the second document D2 rather than the first document D1.

文書は、例えば、契約書等である。以下、文書が契約書である場合を例に説明する。 The document is, for example, a contract or the like. Hereinafter, the case where the document is a contract will be described as an example.

契約書を用いた審査では、審査を行う者は、審査の対象となる契約書に、契約に必要な内容が記述されているか否か、及び、契約に不要な内容が記述されているか否か等の判断を行う。このようにして、審査を行う者は、契約書の妥当性を判断する。その際に、審査を行う者は、参考にする情報として、過去に審査した別の契約書を用いる場合がある。 In the examination using the contract, the person who conducts the examination determines whether or not the contract to be examined contains the contents necessary for the contract and whether or not the contents unnecessary for the contract are described. And so on. In this way, the examiner judges the validity of the contract. At that time, the examiner may use another contract that has been examined in the past as reference information.

このような審査に参考とする文書を探索する場合であって、かつ、過去に審査した文書、すなわち、参考にする為に蓄積した契約書が大量にある場合等には、参考にできる文書が自動的に抽出されるのが望ましい。 When searching for a document to be used as a reference for such an examination, and when there are a large number of documents examined in the past, that is, a large amount of contracts accumulated for reference, a document that can be referred to is available. It is desirable that it be extracted automatically.

単語の出現頻度だけを基準にして、類似する文書を検索する場合には、契約書が、条文及び条項等といった、複数の部分で構成される為、部分対応度を考慮しないと、適切な文書が検索されない場合がある。 When searching for similar documents based only on the frequency of appearance of words, the contract is composed of multiple parts such as articles and clauses, so if you do not consider the degree of partial correspondence, it is an appropriate document. May not be searched.

この例では、現在の審査に参考としたい過去の契約書は、現在の審査で特徴となる部分に対応する部分をなるべく多く含み、かつ、特徴以外の部分をなるべく含まない契約書が望ましい。このような契約書を検索するには、単語に基づく類似度等では、検索するのが難しい場合が多い。 In this example, it is desirable that the past contracts to be referred to in the current examination include as many parts as possible corresponding to the characteristic parts in the current examination and do not include the parts other than the characteristics as much as possible. In order to search for such a contract, it is often difficult to search by word-based similarity or the like.

そこで、部分対応度を計算して、対応する文書を検索するのに部分対応度を考慮した単語の対応度を用いる。このようにすると、対応した文書を検索することができる。すなわち、この例では、審査において審査の対象となる契約書と、内容が類似した過去の契約書を精度良く検索することができる。 Therefore, the degree of partial correspondence is calculated, and the degree of correspondence of words considering the degree of partial correspondence is used to search for the corresponding document. By doing so, the corresponding document can be searched. That is, in this example, it is possible to accurately search for past contracts whose contents are similar to those of the contract to be examined in the examination.

＜単語及び部分について＞
単語は、上記の例で示すような１つの文字に限られない。すなわち、単語は、複数の文字を組み合わせた「文字列」であってもよい。例えば、文字列に分割する場合には、文書を所定数ごとに分けて単語を生成する。なお、所定数は、あらかじめ設定されてもよいし、学習で最適化されてもよい。そして、文書を単語に分割する場合には、例えば、形態要素解析等の解析方法が用いられる。なお、単語に分割するには、Ｎ−ｇｒａｍ等の方法によって、文書からｎ個ずつ文字を取り出すことで、単語に分割する方法等でもよい。 <About words and parts>
The word is not limited to one letter as shown in the above example. That is, the word may be a "character string" that is a combination of a plurality of characters. For example, when dividing into character strings, the document is divided into predetermined numbers to generate words. The predetermined number may be set in advance or may be optimized by learning. When the document is divided into words, for example, an analysis method such as morphological element analysis is used. In addition, in order to divide into words, a method of dividing into words by extracting n characters from the document by a method such as N-gram may be used.

部分は、上記の例で示すような１行単位の文字の集まりでなくともよい。例えば、部分は、文節、文、段落、又は、文章を単位とした文字の集まりでもよい。そして、文書を部分に分割する場合には、例えば、句点等の区切り文字ごとに、文書は、部分に分割されてもよい。また、単語及び部分は、プログラムのソースコード又はＤＮＡ配列のように、自然言語以外の文字列（シーケンス）一般に適用されてもよい。 The portion does not have to be a group of characters in units of one line as shown in the above example. For example, the part may be a phrase, a sentence, a paragraph, or a collection of characters in units of sentences. Then, when the document is divided into parts, the document may be divided into parts for each delimiter such as a kuten. In addition, words and parts may be generally applied to character strings (sequences) other than natural language, such as program source code or DNA sequences.

＜その他の実施形態＞
なお、文字には、数字、又は、記号等が含まれてもよい。 <Other embodiments>
The characters may include numbers, symbols, and the like.

機械学習とは、コンピュータに人のような学習能力を獲得させるための技術であり、コンピュータが、データ識別等の判断に必要なアルゴリズムを、事前に取り込まれる学習データから自律的に生成し、新たなデータについてこれを適用して予測を行う技術のことをいう。本発明の実施において機械学習を用いる場合は、機械学習のための学習方法は、教師あり学習、教師なし学習、半教師学習、強化学習、深層学習のいずれかの方法でもよく、さらに、これらの学習方法を組み合わせた学習方法でもよく、機械学習のための学習方法は問わない。 Machine learning is a technology for making a computer acquire learning ability like a human being, and the computer autonomously generates an algorithm necessary for judgment such as data identification from learning data taken in advance, and a new one. It is a technology that applies this to various data to make predictions. When machine learning is used in the implementation of the present invention, the learning method for machine learning may be any of supervised learning, unsupervised learning, semi-teacher learning, enhanced learning, and deep learning, and further, these methods. A learning method that combines learning methods may be used, and a learning method for machine learning does not matter.

上記で説明した実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」とは、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたＡＳＩＣ（Application Specific Integrated Circuit）、ＤＳＰ（Digital Signal Processor）、ＦＰＧＡ（Field Programmable Gate Array）回路モジュール等のデバイスを含むものとする。 Each function of the embodiment described above can be realized by one or more processing circuits. Here, the "processing circuit" as used herein is a processor programmed to perform each function by software, such as a processor implemented by an electronic circuit, or a processor designed to execute each function described above. It shall include devices such as ASIC (Application Specific Integrated Circuit), DSP (Digital Signal Processor), and FPGA (Field Programmable Gate Array) circuit module.

また、上述した各実施形態の文書処理装置１００は、例えば、ＰＪ（Projector：プロジェクタ）、デジタルサイネージ等の出力装置、ＨＵＤ（Head Up Display）装置、産業機械、医療機器、ネットワーク家電、自動車（Connected Car）、ノートＰＣ（Personal Computer）、携帯電話、タブレット端末、ゲーム機、ＰＤＡ（Personal Digital Assistant）、デジタルカメラ、ウェアラブルＰＣ又はデスクトップＰＣ等であってもよい。 Further, the document processing device 100 of each of the above-described embodiments is, for example, a PJ (Projector: projector), an output device such as a digital signage, a HUD (Head Up Display) device, an industrial machine, a medical device, a network home appliance, and an automobile (Connected). It may be a Car), a notebook PC (Personal Computer), a mobile phone, a tablet terminal, a game machine, a PDA (Personal Digital Assistant), a digital camera, a wearable PC, a desktop PC, or the like.

なお、全体処理の全部又は一部は、コンピュータ言語で記述され、コンピュータに文書処理方法を実行させるためのプログラムによって実現されてもよい。すなわち、プログラムは、コンピュータ、又は、２つ以上のコンピュータを用いる文書処理システムに各処理を実行させるためのコンピュータプログラムである。 In addition, all or a part of the whole processing may be described in a computer language and may be realized by a program for causing a computer to execute a document processing method. That is, the program is a computer program for causing a computer or a document processing system using two or more computers to execute each process.

したがって、プログラムに基づいて文書処理方法が実行されると、コンピュータが有する演算装置及び制御装置は、各処理を実行するため、プログラムに基づいて演算及び制御を行う。また、コンピュータが有する記憶装置は、各処理を実行するため、プログラムに基づいて、処理に用いられるデータを記憶する。 Therefore, when the document processing method is executed based on the program, the arithmetic unit and the control unit of the computer perform arithmetic and control based on the program in order to execute each processing. In addition, the storage device of the computer stores the data used for the processing based on the program in order to execute each processing.

また、プログラムは、コンピュータが読み取り可能な記憶媒体に記録されて頒布することができる。なお、記憶媒体は、磁気テープ、フラッシュメモリ、光ディスク、光磁気ディスク又は磁気ディスク等のメディアである。さらに、プログラムは、電気通信回線を通じて頒布することができる。 The program can also be recorded and distributed on a computer-readable storage medium. The storage medium is a medium such as a magnetic tape, a flash memory, an optical disk, a magneto-optical disk, or a magnetic disk. In addition, the program can be distributed over telecommunication lines.

なお、本発明に係る実施形態は、文書処理システムによって実現されてもよい。また、文書処理システムは、各処理及びデータの記憶を冗長、分散、並列、仮想化又はこれらを組み合わせて実行してもよい。 The embodiment of the present invention may be realized by a document processing system. Further, the document processing system may execute each processing and storage of data in a redundant, distributed, parallel, virtualized manner, or a combination thereof.

以上、各実施形態に基づき本発明の説明を行ってきたが、上記実施形態に示した要件に本発明が限定されるものではない。これらの点に関しては、本発明の主旨をそこなわない範囲で変更することができ、その応用形態に応じて適切に定めることができる。 Although the present invention has been described above based on each embodiment, the present invention is not limited to the requirements shown in the above embodiments. With respect to these points, the gist of the present invention can be changed to the extent that the gist of the present invention is not impaired, and can be appropriately determined according to the application form thereof.

１００文書処理装置
１０１入力部
１０２文書分割部
１０３単語対応度計算部
１０４部分対応度計算部
１０５文書対応度計算部
Ｃ１加算対象
Ｃ２除外対象
Ｄ１第１文書
Ｄ２第２文書
Ｐ１１第１１部分
Ｐ１２第１２部分
Ｐ１３第１３部分
Ｐ２１第２１部分
Ｐ２２第２２部分
Ｗ１０１第１０１文字
Ｗ１０２第１０２文字
Ｗ１０３第１０３文字
Ｗ１０４第１０４文字
Ｗ１０６第１０６文字
Ｗ１０７第１０７文字
Ｗ１０８第１０８文字
Ｗ１０９第１０９文字
Ｗ１１０第１１０文字
Ｗ１１１第１１１文字
Ｗ１１２第１１２文字
Ｗ２０１第２０１文字
Ｗ２０２第２０２文字
Ｗ２０３第２０３文字
Ｗ２０４第２０４文字
Ｗ２０５第２０５文字
Ｗ２０６第２０６文字
Ｗ２０７第２０７文字
Ｗ２０８第２０８文字 100 Document processing device 101 Input unit 102 Document division unit 103 Word correspondence degree calculation unit 104 Partial correspondence degree calculation unit 105 Document correspondence degree calculation unit C1 Addition target C2 Exclusion target D1 First document D2 Second document P11 11th part P12 12th Part P13 13th part P21 21st part P22 22nd part W101 101st character W102 102nd character W103 103rd character W104 104th character W106 106th character W107 107th character W108 108th character W109 109th character W110 110th character W111 111th character W112 112th character W201 201st character W202 202nd character W203 203rd character W204 204th character W205 205th character W206 206th character W207 207th character W208 208th character

ＧｅｒａｒｄＳａｌｔｏｎ,ＭｉｃｈａｅｌＪ. ＭｃＧｉｌｌ，Ｉｎｔｒｏｄｕｃｔｉｏｎｔｏｍｏｄｅｒｎｉｎｆｏｒｍａｔｉｏｎｒｅｔｒｉｅｖａｌ，ＮｅｗＹｏｒｋ，ＭｃＧｒａｗ−Ｈｉｌｌ，１９８３Gerard Salton, Michael J. McGill, Introduction to modern information remote, New York, McGraw-Hill, 1983.

Claims

Input section for inputting multiple documents and
A document division unit that divides the document into characters or words that are character strings, and divides the document into parts that are larger than the words.
A word correspondence calculation unit that calculates a word correspondence, which is a correspondence between the words in different documents, and a word correspondence calculation unit.
A partial correspondence calculation unit that calculates a partial correspondence, which is a correspondence between the parts of different documents, and a partial correspondence calculation unit.
A document processing device including a document correspondence degree calculation unit that calculates a document correspondence degree, which is a correspondence degree between documents, based on the word correspondence degree and the correspondence degree between words determined by using the partial correspondence degree. ..

The part is a sentence containing the word or a collection of the words line by line.
The document processing apparatus according to claim 1, wherein the document division unit divides the document for each line feed code or delimiter to generate the portion.

The character string is a collection of a predetermined number of characters.
The document processing apparatus according to claim 1 or 2, wherein the document dividing unit divides the document into predetermined numbers to generate the word of the character string.

In the case of calculating the word correspondence between the first word included in the first document and the second word included in the second document.
The word correspondence is
Claim that the similarity between the first word and the second word is the ratio of the similarity between the first word and all the words contained in the second document to the total similarity. The document processing apparatus according to any one of 1 to 3.

In the case of calculating the partial correspondence between the first part included in the first document and the second part included in the second document.
The degree of partial correspondence is
The first document and the second document include the first total similarity, which is the sum of the similarities between the first word included in the first part and the second word included in the second part. The document processing apparatus according to any one of claims 1 to 4, which is a ratio of the similarity of all words to the total similarity of the second total.

The similarity is
The document processing apparatus according to claim 4 or 5, wherein the word matches or not, or the word is expressed as a vector and calculated based on the inner product of the vectors.

In the case of calculating the document correspondence between the first document and the second document,
The degree of document compatibility is
Among the calculation results obtained by multiplying the word correspondence degree and the partial correspondence degree, the total maximum correspondence degree, which is the total of the maximum correspondence degree which is the maximum value in each word, is included in the first document and the second document. The document processing apparatus according to any one of claims 1 to 6, which is a value calculated by dividing by the maximum number of words, which is the maximum value among the number of words.

It is a document processing method performed by a document processing device.
The input procedure for the document processor to input multiple documents,
A document division procedure in which a document processing device divides the document into words that are characters or character strings, and divides the document into parts that are larger than the words.
A word correspondence calculation procedure in which a document processing device calculates a word correspondence, which is a correspondence between the words in different documents, and a word correspondence calculation procedure.
A partial correspondence calculation procedure in which a document processing device calculates a partial correspondence, which is a correspondence between the parts of different documents.
A document correspondence calculation procedure in which the document processing device calculates the document correspondence, which is the correspondence between the documents, based on the word correspondence and the correspondence between words determined by using the partial correspondence. Document processing method including.

A program that lets a computer execute a document processing method.
The input procedure for the computer to input multiple documents,
A document division procedure in which a computer divides the document into words that are characters or character strings, and divides the document into parts that are larger than the words.
A word correspondence calculation procedure in which a computer calculates a word correspondence, which is a correspondence between the words in different documents.
A partial correspondence calculation procedure in which a computer calculates a partial correspondence, which is a correspondence between the parts of a different document.
The computer executes a document correspondence calculation procedure for calculating the document correspondence, which is the correspondence between the documents, based on the word correspondence and the correspondence between words determined by using the partial correspondence. Program to make you.