JP2009122942A

JP2009122942A - Inter-document distance calculation device and program

Info

Publication number: JP2009122942A
Application number: JP2007295912A
Authority: JP
Inventors: Toshiro Uchiyama; 俊郎内山; Ryoji Kataoka; 良治片岡; Katsuto Bessho; 克人別所; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-11-14
Filing date: 2007-11-14
Publication date: 2009-06-04
Anticipated expiration: 2027-11-14
Also published as: JP5057516B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an inter-document distance calculation device for easily registering the new names of a person and an organization or the like and new words appearing every day by calculating concept vectors at the same time as appearance, and utilizing words or the like which are important as the characteristics of a document in inter-document distance calculation. <P>SOLUTION: This inter-document distance calculation device includes:a component frequency calculation means for calculating component frequency in an input document; a vector group calculation means for calculating a vector group by assigning vectors to the components whose vectors are registered, and assigning unregistered vectors to the components whose vectors are not registered with original expressions; and a vector inner product calculation means for calculating a vector inner product by defining the inner product of the unknown vectors as a predetermined value, and defining the inner product of the unknown vector and the vector of the components whose vectors have been registered as 0. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、文書データについて、検索や分析を行うためのデータ間の距離を算出する方式に関する。
The present invention relates to a method for calculating the distance between data for searching and analyzing document data.

文書間の距離を算出する方法として、文書を名詞や動詞等の単語に分解し、予め登録した概念ベクトルを単語に対応させ、文書を概念ベクトルの集合として表し、その概念ベクトルの平均ベクトルを文書の概念ベクトルとし、他の文書の概念ベクトルとのユークリッド距離（またはコサイン）によって、文書間の距離を算出することが知られている（たとえば、非特許文献１参照）。 As a method of calculating the distance between documents, the documents are decomposed into words such as nouns and verbs, pre-registered concept vectors are associated with the words, the documents are represented as a set of concept vectors, and the average vector of the concept vectors is represented in the document It is known that the distance between documents is calculated based on the Euclidean distance (or cosine) with the concept vector of another document (for example, see Non-Patent Document 1).

また、画像等のベクトル集合間の距離を算出する方法として、ベクトル集合Ａとベクトル集合Ｂとの距離は、以下のように算出することが知られている（たとえば、特許文献１参照）。まず、各ベクトル集合について、その集合を代表するベクトルである代表ベクトル集合を決める。次に、ベクトル集合Ａの代表ベクトル集合で、ベクトル集合Ｂをベクトル量子化し、その量子化誤差を誤差Ａとする。ベクトル集合Ｂの代表ベクトル集合でベクトルＡをベクトル量子化し、その量子化誤差を誤差Ｂとする。ここで、誤差Ａと誤差Ｂの大きい方の値を、ベクトル集合Ａとベクトル集合Ｂとの間の距離とする。 As a method for calculating the distance between vector sets such as images, it is known that the distance between the vector set A and the vector set B is calculated as follows (see, for example, Patent Document 1). First, for each vector set, a representative vector set that is a vector representing the set is determined. Next, the vector set B is vector quantized with the representative vector set of the vector set A, and the quantization error is set as an error A. The vector A is vector-quantized with the representative vector set of the vector set B, and the quantization error is defined as an error B. Here, the value with the larger error A and error B is set as the distance between the vector set A and the vector set B.

上記文献の記載内容を組み合わせると、非特許文献１によって、文書を概念ベクトルの集合で表し、引き続いて特許文献１のベクトル集合間の距離算出方法によって、文書間の距離を算出することができる。
別所克人、内山俊郎、片岡良治著「単語・意味属性間共起に基づく概念ベースの拡張方式」情報処理学会研究会報告、Vol.SIG-ICS 144、pp.29-34、2006 特開２００２−３６６９５６号公報 Combining the contents described in the above documents, the non-patent document 1 can express a document as a set of concept vectors, and the distance between the documents can be calculated by the distance calculation method between the vector sets in Patent document 1 subsequently.
Katsuhito Bessho, Toshiro Uchiyama, Ryoji Kataoka "Concept-based extension based on co-occurrence between words and semantic attributes" Information Processing Society of Japan, Vol.SIG-ICS 144, pp.29-34, 2006 JP 2002-366955 A

上記従来技術では、予め登録しておいた単語について、それをベクトル表現し、文書間の距離算出に反映させることができる。しかし、人名、組織名等が新たに登場し、また、新語が日々登場し、これらの単語等を、登場と同時に概念ベクトルを算出して登録することは困難である。そのために、文書の特徴として重要であるはずの単語等を、文書間の距離算出に反映させることができないという問題がある。 In the above prior art, it is possible to express a word registered in advance as a vector and reflect it in calculating the distance between documents. However, new names of people, organizations, etc. appear, and new words appear daily, and it is difficult to calculate and register these words and the like at the same time as they appear. For this reason, there is a problem that words or the like that should be important as document characteristics cannot be reflected in the calculation of the distance between documents.

本発明は、人名、組織名等が新たに登場し、また、新語が日々登場し、これらの単語等を、登場と同時に概念ベクトルを算出して登録することが容易であり、文書の特徴として重要である単語等を、文書間の距離算出に反映させることができる文書間距離計算装置を提供することを目的とする。
In the present invention, a person name, an organization name, etc. newly appear, and new words appear every day, and it is easy to calculate and register a concept vector at the same time as these words appear as a feature of the document. An object of the present invention is to provide an inter-document distance calculation apparatus that can reflect important words and the like in the calculation of the distance between documents.

本発明は、文書の構成要素である単語、合成語、またはフレーズについて、ベクトルが予め登録されているときに、２つの文書間の距離を計算する装置において、入力された文書における構成要素頻度を算出する構成要素頻度算出手段と、ベクトルが登録されている上記構成要素に、そのベクトルを割り当て、ベクトルが登録されていない構成要素について、原表現を残し、未知のベクトルを割り当てることによって、ベクトル集合を算出するベクトル集合算出手段と、未知のベクトル同士の内積を、予め決めた値とし、上記未知のベクトルと上記ベクトルが登録されている構成要素のベクトルとの内積を０としてベクトル内積を計算するベクトル内積計算手段とを有する文書間距離計算装置である。
According to the present invention, in a device for calculating a distance between two documents when a vector is registered in advance for a word, a compound word, or a phrase that is a component of a document, the component frequency in the input document is calculated. A vector set is obtained by assigning a vector to the constituent element frequency calculating means for calculating, assigning the vector to the constituent element in which the vector is registered, and leaving an original expression and assigning an unknown vector to the constituent element in which the vector is not registered. The vector inner product is calculated by setting the inner product between the unknown vector and the vector of the component in which the vector is registered to 0 as the inner product between the unknown vector and the vector set calculating means for calculating the vector. An inter-document distance calculation device having a vector inner product calculation means.

本発明によれば、文書間の距離を計算する際に、予め登録した単語について、ベクトル表現し、文書間の距離を算出する装置において、日々登場する新しい単語、合成語、フレーズ等であって、登録されていない単語、合成語、フレーズ等を、未知のベクトルで表現し、文書間の距離算出に反映させることができ、結果として、より適切な文書間距離を算出することができるという効果を奏する。
According to the present invention, when calculating the distance between documents, a vector that represents a word registered in advance and calculating the distance between documents are new words, synthesized words, phrases, etc. that appear daily. The effect is that unregistered words, synthesized words, phrases, etc. can be expressed by unknown vectors and reflected in the distance calculation between documents, and as a result, a more appropriate distance between documents can be calculated. Play.

発明を実施するための最良の形態は、次の実施例である。 The best mode for carrying out the invention is the following embodiment.

図１は、本発明の実施例１である文書間距離計算装置１００を示すブロック図である。 FIG. 1 is a block diagram showing an inter-document distance calculation apparatus 100 that is Embodiment 1 of the present invention.

文書間距離計算装置１００は、文書入力手段１０と、単語頻度算出手段２０と、ベクトル集合算出手段３０と、量子化誤差算出手段４０と、文書間距離決定手段５０と、単語概念ベース格納手段７０とを有する。 The inter-document distance calculation apparatus 100 includes a document input unit 10, a word frequency calculation unit 20, a vector set calculation unit 30, a quantization error calculation unit 40, an inter-document distance determination unit 50, and a word concept base storage unit 70. And have.

文書入力手段１０は、第１文書の例である文書Ａと第２文書の例である文書Ｂとを入力する。 The document input unit 10 inputs a document A that is an example of a first document and a document B that is an example of a second document.

単語頻度算出手段２０は、入力された文書における単語頻度を算出する手段である。 The word frequency calculation means 20 is a means for calculating the word frequency in the input document.

ベクトル集合算出手段３０は、ベクトルが登録されている上記構成要素に、そのベクトルを割り当て、ベクトルが登録されていない構成要素について、原表現を残し、未知のベクトルを割り当てることによって、ベクトル集合を算出する手段である。 The vector set calculation means 30 calculates the vector set by assigning the vector to the above-described component in which the vector is registered, leaving the original representation and assigning an unknown vector to the component in which the vector is not registered. It is means to do.

ベクトル内積計算手段は、原表現が同じ、未知のベクトル同士の内積を予め決められた値とし、原表現が互いに異なる未知のベクトル同士の内積を０とし、上記未知のベクトルと上記ベクトルが登録されている構成要素のベクトルとの内積を０としてベクトル内積を計算する。 The vector inner product calculation means sets the inner product of unknown vectors having the same original representation to a predetermined value, sets the inner product of unknown vectors different from each other to 0, and registers the unknown vector and the vector. The vector inner product is calculated by setting the inner product with the component vector to be zero.

量子化誤差算出手段４０は、単語Ｂの各単語に、単語Ａの単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、上位単語Ａによる上記単語Ｂの量子化誤差とする手段である。 The quantization error calculating means 40 assigns each word of the word B the word that is the closest to the word A, calculates the distance between the word vectors, and calculates the calculated distance between the word vectors for all the words. Is a means for dividing the added distance by the number of words and averaging them to obtain the quantization error of the word B by the upper word A.

文書間距離決定手段５０は、第１量子化誤差と第２量子化誤差とに基づいて、単語Ａと単語Ｂとの距離を求める手段である。つまり、単語Ａによる第２文書の量子化誤差と、第２文書による単語Ａの量子化誤差とのうちで、大きい方の量化化誤差を、上記単語Ａと上記第２文書との距離として算出する手段である。 The inter-document distance determining unit 50 is a unit that obtains the distance between the word A and the word B based on the first quantization error and the second quantization error. That is, of the quantization error of the second document due to the word A and the quantization error of the word A due to the second document, the larger quantification error is calculated as the distance between the word A and the second document. It is means to do.

次に、文書間距離計算装置１００において、文書Ａと文書Ｂとの文書間距離を算出する動作について説明する。 Next, an operation for calculating the inter-document distance between the document A and the document B in the inter-document distance calculation apparatus 100 will be described.

図２は、実施例１の動作を示すフローチャートである。 FIG. 2 is a flowchart illustrating the operation of the first embodiment.

まず、Ｓ１で、文書Ａと文書Ｂとを入力する。文書コーパス等を利用し、非特許文献１に記載されている単語−意味属性間の共起情報を用い、特異値分解等の次元圧縮処理を経て、単語に対応する概念ベクトルを算出する。 First, in S1, document A and document B are input. Using a document corpus or the like, the concept vector corresponding to the word is calculated through dimensional compression processing such as singular value decomposition using co-occurrence information between words and semantic attributes described in Non-Patent Document 1.

上記実施例では、単語を使用しているが、単語の代わりに、合成語、またはフレーズ等の文書の構成要素を使用するようにしてもよい。 In the above embodiment, a word is used. However, instead of a word, a constituent element of a document such as a synthesized word or a phrase may be used.

図３は、実施例１において単語概念ベース格納手段７０に格納されている単語概念ベースを示す図である。 FIG. 3 is a diagram illustrating the word concept base stored in the word concept base storage unit 70 in the first embodiment.

ここで、各単語の概念ベクトルである単語ベクトルを、長さ（Ｌ２ノルム）１に正規化する。これと同様にして、登録されていない単語の単語ベクトル（未知のベクトル）の長さは、正規化したので、「１」（上記説明では、α）であるとする。また、単語概念ベース格納手段７０に登録されている単語を、以下、「登録されている単語」という。なお、単語−単語間の共起情報を使用して、概念ベースを算出するようにしてもよい。 Here, the word vector which is the concept vector of each word is normalized to length (L2 norm) 1. Similarly, since the length of the word vector (unknown vector) of the unregistered word is normalized, it is assumed that it is “1” (α in the above description). Further, the word registered in the word concept base storage means 70 is hereinafter referred to as “registered word”. The concept base may be calculated using word-word co-occurrence information.

次に、文書Ａと文書Ｂとについて、形態素解析し、単語の頻度を求める。 Next, the morphological analysis is performed on the document A and the document B, and the word frequency is obtained.

図４、図５は、文書Ａと文書Ｂとについて、形態素解析し、求めた単語の頻度の例を示す図である。 4 and 5 are diagrams illustrating examples of word frequencies obtained by performing morphological analysis on document A and document B. FIG.

図４に示す文書Ａについては、単語「朝ラー」が新規に登場したので、概念ベース格納手段７０には、登録されていない。つまり、図４中では、単語「朝ラー」の登録番号欄には、「未登録」が表示されている。図５に示す文書Ｂに含まれている単語の頻度表では、単語「朝ラー」と「アムラー」とが、未登録語である。文書Ａに属する単語集合Ｓ_Ａ＝｛ラーメン，朝ラー，朝，喜多方，目当て｝に対応した単語ベクトルの集合Ｘ_Ａを、 In the document A shown in FIG. 4, the word “morning ra” has newly appeared, and is not registered in the concept base storage unit 70. That is, in FIG. 4, “unregistered” is displayed in the registration number column of the word “morning ra”. In the word frequency table included in the document B shown in FIG. 5, the words “morning ra” and “amra” are unregistered words. A set of word vectors X _A corresponding to the word set S _A = {ramen, morning ra, morning, kitakata, aim} belonging to the document A,

と記載する。

It describes.

「朝ラー」に対応する単語ベクトルの値は未知のベクトルを割り当てる。ここで、Ｎ_Ａは、文書Ａに含まれている単語数を示す。 An unknown vector is assigned to the value of the word vector corresponding to “morning ra”. Here, N _A indicates the number of words included in the document A.

図４に示す例では、単語数は９である。これと同様に、文書Ｂに属している単語集合Ｓ_Ｂ＝｛ラーメン，朝食，福島，喜多方，アムラー｝に対応する単語ベクトルの集合Ｘ_Ｂを、 In the example shown in FIG. 4, the number of words is nine. Similarly, a set of word vectors X _B corresponding to the word set S _B = {ramen, breakfast, Fukushima, Kitakata, Amler} belonging to the document B is

と記載する。

It describes.

そして、実際の距離計算を行う。まず、文書Ａに属している単語ベクトル集合によって、文書Ｂの単語ベクトルを量子化し、Ｓ２で、量子化誤差Ｅ_Ａ→Ｂと量子化誤差Ｅ_Ｂ→Ａとを求める。 Then, the actual distance calculation is performed. First, the word vector of the document B is quantized with the word vector set belonging to the document A, and a quantization error E _{A → B} and a quantization error E _{B → A} are obtained in S2.

具体的には、量子化誤差Ｅ_Ａ→Ｂを求める場合、文書Ｂの各単語に、文書Ａの単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、文書Ａによる文書Ｂの量子化誤差Ｅ_Ａ→Ｂとする。 Specifically, when obtaining the quantization error E _{A → B} , the word closest to the word in the document A is assigned to each word in the document B, and the distance between the word vectors is calculated. The distance between the word vectors is added to all the words, and the sum of the added distances divided by the number of words is defined as a quantization error E _{A → B} of the document B by the document A.

すなわち、量子化誤差は、第２文書の各単語に、第１文書の単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものである。 That is, the quantization error is assigned to each word of the second document by assigning the word closest to the word of the first document, calculating the distance between the word vectors, and calculating the distance between the word vectors to all the words. This is the sum of words and the average of the distances divided by the number of words.

量子化誤差Ｅ_Ｂ→Ａを求める場合、文書Ａの各単語に、文書Ｂの単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、文書Ｂによる文書Ａの量子化誤差Ｅ_Ｂ→Ａとする。 When the quantization error E _{B → A} is obtained, the word closest to the word of the document B is assigned to each word of the document A, the distance between the word vectors is calculated, and the calculated distance between the word vectors is calculated. The sum of all the words and the average obtained by dividing the added distance by the number of words is defined as a quantization error E _{B → A} of the document A by the document B.

図６は、文書Ａの単語ベクトル集合による文書Ｂの単語ベクトル集合の量子化の結果例を示す図である。 FIG. 6 is a diagram illustrating an example of a result of quantization of the word vector set of the document B by the word vector set of the document A.

単語「朝ラー」は、文書Ａにも、文書Ｂにも存在しているので、文書Ａに存在している単語「朝ラー」の単語ベクトルと、文書Ｂに存在している単語「朝ラー」の単語ベクトルとの距離は、「０」である。 Since the word “morning ra” exists in the document A and the document B, the word vector of the word “morning ra” that exists in the document A and the word “morning ra” that exists in the document B The distance from the word vector “” is “0”.

一方、単語「アムラー」は、文書Ｂには存在しているが、文書Ａには存在していない。
この場合、定義によって、自分自身を除く任意の単語の単語ベクトルとの内積が０であるので、単語間距離は、それぞれの単語のＬ２ノルムを足したものである。つまり、２である。上記と同じようにして、文書Ｂに属する単語ベクトル集合によって、文書Ａの単語ベクトルを量子化する。 On the other hand, the word “Amler” is present in document B but not in document A.
In this case, since the inner product with the word vector of any word except itself is 0 by definition, the distance between words is obtained by adding the L2 norm of each word. That is, 2. In the same manner as described above, the word vector of the document A is quantized by the word vector set belonging to the document B.

図７は、文書Ｂに属する単語ベクトル集合によって、文書Ａの単語ベクトルを量子化した結果の例を示す図である。 FIG. 7 is a diagram illustrating an example of a result of quantizing the word vector of the document A by the word vector set belonging to the document B.

図７では、単語「目当て」に最も近い単語である「喜多方」までの距離が２．２である。このように、内積の値が負になる単語ベクトル間の距離は、未登録の単語との距離「２」よりも大きくなる。 In FIG. 7, the distance to “Kitakata”, which is the word closest to the word “target”, is 2.2. As described above, the distance between the word vectors having a negative inner product value is larger than the distance “2” with the unregistered word.

そして、Ｓ２で、距離の合計を単語数で割ることによって、量子化誤差を求める。 Then, in S2, a quantization error is obtained by dividing the total distance by the number of words.

上記のように、文書Ａによる文書Ｂの量子化誤差Ｅ_Ａ→Ｂ＝０．５８と、文書Ｂによる文書Ａの量子化誤差Ｅ_Ｂ→Ａ＝０．３９とを求める。 As described above, the quantization error E _{A → B} = 0.58 of the document B due to the document A and the quantization error E _{B → A} = 0.39 of the document A due to the document B are obtained.

そして、Ｓ３で、距離を選択する。つまり、Ｓ４で、量子化誤差の大きい方である０．５８を、文書Ａと文書Ｂとの距離として出力する。 In S3, a distance is selected. That is, in S4, 0.58, which has the larger quantization error, is output as the distance between the document A and the document B.

量子化誤差Ｅ_Ａ→Ｂ、Ｅ_Ｂ→Ａを式で表すと、 When the quantization error E _{A → B} and E _{B → A} are expressed by equations,

である。

It is.

図８は、本発明の実施例２である文書間距離計算装置２００を示すブロック図である。 FIG. 8 is a block diagram showing an inter-document distance calculation apparatus 200 that is Embodiment 2 of the present invention.

実施例１は、全ての単語を用いて量子化を行う実施例である。実施例２は、量子化するための単語ベクトル集合について、各文書の単語ベクトル集合を、より小数の単語ベクトルで代表した代表ベクトルを用いる実施例である。これによって、計算コストを低減することができる。 In the first embodiment, quantization is performed using all words. The second embodiment is an embodiment in which a representative vector that represents the word vector set of each document with a smaller number of word vectors is used for the word vector set to be quantized. Thereby, the calculation cost can be reduced.

文書間距離計算装置２００は、文書間距離計算装置１００において、代表ベクトル算出手段６０とを付加した装置である。 The inter-document distance calculation device 200 is a device in which the representative vector calculation means 60 is added to the inter-document distance calculation device 100.

代表ベクトル算出手段６０は、ベクトル集合算出手段３０で得たベクトル集合の分布を代表する所定の個数のベクトル（代表ベクトル）を選定（または計算）する。 The representative vector calculation means 60 selects (or calculates) a predetermined number of vectors (representative vectors) representing the distribution of the vector set obtained by the vector set calculation means 30.

図９は、文書間距離計装置２００において、文書Ｂに含まれている単語を代表する５つの代表ベクトルを示す図である。 FIG. 9 is a diagram showing five representative vectors representing the words included in the document B in the inter-document distance meter device 200.

単語「福島」と「喜多方」との単語ベクトル同士が似ているので、これを１つのベクトルで代表している。代表ベクトルを算出する場合、代表される単語の出現頻度によって、それぞれの単語ベクトルの重み付け平均をとる。 Since the word vectors of the words “Fukushima” and “Kitakata” are similar, this is represented by one vector. When calculating a representative vector, the weighted average of each word vector is taken according to the appearance frequency of the representative word.

図９に示す例は、模式的な例であるので、代表される単語の数と代表ベクトルの数とは、殆ど変わらないが、普通の文書には、より多くの異なり単語（原表現が異なる単語）があり、単語ベクトルの集合をクラスタリングすること等によって、代表ベクトルを算出すれば、単語ベクトル数を大幅に低減することができる。 Since the example shown in FIG. 9 is a schematic example, the number of representative words and the number of representative vectors are almost the same, but more ordinary words (original expressions are different) in ordinary documents. If the representative vector is calculated, for example, by clustering a set of word vectors, the number of word vectors can be greatly reduced.

ただし、未登録の単語は、定義から、どの単語ベクトルとも等距離にあるので、独立して１つの代表ベクトルを割り当てる。「朝ラー」と「アムラー」の場合に相当する。文書Ａの代表ベクトルを、｛Ｗ_１ ^Ａ，…，Ｗ_ａ ^Ａ｝と示し、文書Ｂの代表ベクトルを、｛Ｗ_１ ^Ｂ，…，Ｗ_ｂ ^Ｂ｝と示す。 However, since unregistered words are equidistant from any word vector by definition, one representative vector is assigned independently. Corresponds to “Morning Ra” and “Amler”. The representative vector of document A is denoted as {W ₁ ^A ,..., W _a ^A }, and the representative vector of document B is denoted as {W ₁ ^B ,..., W _b ^B }.

文書の全ての単語を、代表ベクトルにした例が、実施例１である。このときに、文書Ａの代表ベクトルで、文書Ｂを量子化したときの量子化誤差Ｅ_Ａ→Ｂと、文書Ｂの代表ベクトルで、文書Ａを量子化したときの量子化誤差Ｅ_Ｂ→Ａとは、 An example in which all the words of a document are representative vectors is the first embodiment. At this time, the quantization error E _{A → B} when the document B is quantized with the representative vector of the document A, and the quantization error E _{B → A} when the document A is quantized with the representative vector of the document _B Is

となる。

It becomes.

量子化誤差Ｅ_Ａ→ＢとＥ_Ｂ→Ａとのうちで、大きい方を、文書Ａと文書Ｂとの距離として出力する。 The larger one of quantization error E _{A → B} and E _{B → A} is output as the distance between document A and document B.

ここでも、未登録の単語に対応する単語ベクトルは、自分と同じ単語を除く任意の単語ベクトルと代表ベクトルとの内積が０であるとする。この仮定を用いて、単語ベクトル間の距離を算出する。すなわち、同じ単語間の距離は、０であり、異なる単語の単語ベクトル（代表ベクトル）との距離は、両ベクトルのＬ２ノルムの和である。 Here again, it is assumed that the word vector corresponding to the unregistered word has an inner product of an arbitrary word vector excluding the same word as itself and the representative vector is zero. Using this assumption, the distance between word vectors is calculated. That is, the distance between the same words is 0, and the distance to the word vector (representative vector) of different words is the sum of the L2 norms of both vectors.

図１０は、実施例２における一連の処理の流れを示すフローチャートである。 FIG. 10 is a flowchart illustrating a flow of a series of processes in the second embodiment.

図１０に示すフローチャートは、図２に示すフローチャートにおいて、文書Ａと文書Ｂとを入力する処理（Ｓ１）と、量子化誤差Ｅ_Ａ→ＢとＥ_Ｂ→Ａとを算出する処理（Ｓ２）との間に、代表ベクトルを算出する処理（Ｓ１１）が挿入されている。 The flowchart shown in FIG. 10 is the same as the flowchart shown in FIG. 2, the process of inputting document A and document B (S 1), and the process of calculating quantization errors E _{A → B} and E _{B → A} (S 2). Between these, a process (S11) for calculating a representative vector is inserted.

また、量子化されるための単語ベクトル集合について、各文書の単語ベクトル集合を、より小数の単語ベクトルで代表した量子化誤差近似用代表ベクトルを用いるようにしてもよい。なお、量子化誤差近似用代表ベクトルを、上記量子化するための代表ベクトルと共通にすることができる。 Also, with respect to the word vector set to be quantized, a representative vector for quantization error approximation that represents the word vector set of each document by a smaller number of word vectors may be used. Note that the representative vector for quantization error approximation can be made common with the representative vector for quantization.

文書Ａの量子化誤差近似用代表ベクトルを A representative vector for approximation of quantization error of document A

と表し、文書Ｂの量子誤差近似用代表ベクトルを

And the representative vector for approximation of quantum error in document B

と表す。

It expresses.

そして、文書Ａ、Ｂのそれぞれについて、各量子化誤差近似用代表ベクトルに代表される単語数を、 For each of the documents A and B, the number of words represented by each representative vector for quantization error approximation is expressed as follows:

と表す。

It expresses.

図１１は、文書Ｂの量子化誤差近似用代表ベクトルと代表される単語数とを示す図である。 FIG. 11 is a diagram showing a representative vector for quantization error approximation of document B and the number of representative words.

ここで、文書Ａの単語ベクトルを、文書Ａの量子化誤差近似用代表ベクトルで量子化したときの量子化誤差Ｅ_０ ^Ａを、次の式（４）で計算する。 Here, a quantization error E ₀ ^A when the word vector of the document A is quantized with the representative vector for quantization error approximation of the document A is calculated by the following equation (4).

上記と同様にして文書Ｂについても量子化誤差Ｅ_０ ^Ｂを計算する。

The quantization error E ₀ ^B is calculated for the document B in the same manner as described above.

上記情報を用い、文書Ａの代表ベクトルで、文書Ｂを量子化したときの量子化誤差Ｅ_Ａ→Ｂと、文書Ｂの代表ベクトルで、文書Ａを量子化したときの量子化誤差Ｅ_Ｂ→Ａとを、次の式（５）で近似計算する。 Using the above information, the quantization error E _{A → B} when the document B is quantized with the representative vector of the document A, and the quantization error E _{B →} when the document A is quantized with the representative vector of the document _{B → A} is approximated by the following equation (5).

量子化誤差Ｅ_Ａ→ＢとＥ_Ｂ→Ａとのうちで、大きい方を、文書Ａと文書Ｂとの距離として出力する。

The larger one of quantization error E _{A → B} and E _{B → A} is output as the distance between document A and document B.

つまり、上記実施例において、既知単語ｘと未知単語ｙとの内積を、ゼロとし、未知単語（ｙ）同士の距離を、一定値αとする。既知単語ｘと未知単語ｙとの距離は、（ｘ−ｙ）^２であり、この（ｘ−ｙ）^２を、ｘ^２、２ｘｙ、ｙ^２に展開し、２ｘｙがゼロで、ｙ^２を定数αとすると、既知単語ｘと未知単語ｙとの距離は、ｘ^２＋αである。 That is, in the above embodiment, the inner product of the known word x and the unknown word y is set to zero, and the distance between the unknown words (y) is set to a constant value α. The distance between the known word x and the unknown word y is (xy) ^2. This (xy) ² is expanded into x ² , 2xy, and y ² , 2xy is zero, and y ² is a constant. Assuming α, the distance between the known word x and the unknown word y is x ² + α.

未登録の構成要素である単語等も、未知のベクトルで表現されているとし、その未知のベクトルと、他の構成要素（登録、未登録のいずれの構成要素）のベクトルとの内積を０とし、互いに同じ構成要素に対応する未知のベクトル同士の内積を、予め決めた値αにする。これによって、未登録の単語を、文書間の距離算出に反映させることができる。 Words that are unregistered components are also represented by unknown vectors, and the inner product of the unknown vector and the vector of another component (either registered or unregistered component) is 0. The inner product of unknown vectors corresponding to the same constituent elements is set to a predetermined value α. Thereby, unregistered words can be reflected in the calculation of the distance between documents.

他のベクトルと自分自身のベクトルとの内積が、上記のように定義されれば、特許文献１に記載されているベクトル集合Ａとベクトル集合Ｂとの距離計算の考え方を、（ベクトル集合＋未知のベクトル集合）Ａと（ベクトル集合＋未知のベクトル集合）Ｂとの距離計算に適用することができる。すなわち、登録されたベクトルｘと未知のベクトルｙとのユークリッド距離Ｄを、次の式（１）によって求める。 If the inner product of another vector and its own vector is defined as described above, the concept of distance calculation between the vector set A and the vector set B described in Patent Document 1 is expressed as (vector set + unknown Of vector) A and (vector set + unknown vector set) B can be applied to the distance calculation. That is, the Euclidean distance D between the registered vector x and the unknown vector y is obtained by the following equation (1).

Ｄ＝||ｘ−ｙ||^２
＝||ｘ||^２−２ｘｙ＋||ｙ||^２
＝||ｘ||^２＋α … 式（１）
すなわち、特許文献１に記載されている考え方を、文書間の距離計算に適用する。 D = || xy− | ²
= || x || ² -2xy + || y || ²
= || x || ² + α Equation (1)
In other words, the concept described in Patent Document 1 is applied to distance calculation between documents.

つまり、上記実施例は、文書の構成要素である単語、合成語、またはフレーズについて、ベクトルが予め登録されているときに、２つの文書間の距離を計算する装置において、入力された文書における構成要素頻度を算出する構成要素頻度算出手段と、ベクトルが登録されている上記構成要素に、そのベクトルを割り当て、ベクトルが登録されていない構成要素について、原表現を残し、未知のベクトルを割り当てることによって、ベクトル集合を算出するベクトル集合算出手段と、未知のベクトル同士の内積を、予め決めた値とし、上記未知のベクトルと上記ベクトルが登録されている構成要素のベクトルとの内積を０としてベクトル内積を計算するベクトル内積計算手段とを有する文書間距離計算装置の例である。 In other words, in the above-described embodiment, in a device that calculates a distance between two documents when a vector is registered in advance for a word, a composite word, or a phrase that is a component of the document, the configuration in the input document A component frequency calculating means for calculating an element frequency, assigning the vector to the above-described component in which a vector is registered, leaving an original expression and assigning an unknown vector to a component in which the vector is not registered A vector set calculating means for calculating a vector set and an inner product between unknown vectors as a predetermined value, and an inner product between the unknown vector and a vector of a component in which the vector is registered as 0 This is an example of an inter-document distance calculation device having a vector dot product calculation means for calculating.

この場合、上記文書が、第１文書と第２文書とであり、上記第２文書の各単語に、上記第１文書の単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、第１量子化誤差として算出し、上記第１文書の各単語に、上記第２文書の単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、第２量子化誤差として算出する量子化誤差算出手段が設けられている。 In this case, the document is the first document and the second document, and the word closest to the word of the first document is assigned to each word of the second document, and the distance between the word vectors is calculated. Then, the calculated distance between the word vectors is added up for all the words, and the average of the added distance divided by the number of words is calculated as the first quantization error. The word closest to the word in the second document is assigned to the word, the distance between the word vectors is calculated, the calculated distance between the word vectors is added for all the words, and the added distance is calculated. There is provided a quantization error calculating means for calculating the average obtained by dividing by the number of words as the second quantization error.

また、上記第１量子化誤差と上記第２量子化誤差とのうちで、大きい量子化誤差を、上記第１文書と上記第２文書との距離であると決定する文書間距離決定手段が設けられている。 An inter-document distance determining unit is provided that determines a large quantization error among the first quantization error and the second quantization error as a distance between the first document and the second document. It has been.

また、上記実施例は、文書の構成要素である単語、合成語、またはフレーズについて、ベクトルが予め登録されているときに、２つの文書間の距離を計算する装置において、入力された文書における構成要素頻度を算出する構成要素頻度算出手段と、ベクトルが登録されている上記構成要素に、そのベクトルを割り当て、ベクトルが登録されていない構成要素について、原表現を残し、未知のベクトルを割り当てることによって、ベクトル集合を算出するベクトル集合算出手段と、原表現が同じ、上記未知のベクトル同士の内積を予め決めた値とし、原表現が互いに異なる上記未知ベクトル同士の内積を０とし、上記未知のベクトルと上記ベクトルが登録されている構成要素のベクトルとの内積を０としてベクトル内積を計算するベクトル内積計算手段とを有する文書間距離計算装置の例である。 Further, in the above embodiment, in a device that calculates a distance between two documents when a vector is registered in advance for a word, a composite word, or a phrase that is a component of the document, the configuration in the input document A component frequency calculating means for calculating an element frequency, assigning the vector to the above-described component in which a vector is registered, leaving an original expression and assigning an unknown vector to a component in which the vector is not registered The vector set calculation means for calculating the vector set is the same as the original expression, the inner product of the unknown vectors having the same original expression is set to a predetermined value, the inner product of the unknown vectors having different original expressions is set to 0, and the unknown vector In the vector to calculate the vector inner product with the inner product of the vector and the component vector in which the vector is registered as 0 It is an example of a document between the distance calculation unit and a calculation unit.

この場合、上記文書が、第１文書と第２文書とであり、上記第２文書の各単語に、上記第１文書の単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、第１量子化誤差として算出し、上記第１文書の各単語に、上記第２文書の単語で最も近い距離にある単語を割り当て、単語ベクトル間の距離を計算し、この計算された単語ベクトル間距離を全ての単語について足し合わせ、この足し合わせた距離を単語数で割って平均化したものを、第２量子化誤差として算出する量子化誤差算出手段を有する。 In this case, the document is the first document and the second document, and the word closest to the word of the first document is assigned to each word of the second document, and the distance between the word vectors is calculated. Then, the calculated distance between the word vectors is added up for all the words, and the average of the added distance divided by the number of words is calculated as the first quantization error. The word closest to the word in the second document is assigned to the word, the distance between the word vectors is calculated, the calculated distance between the word vectors is added for all the words, and the added distance is calculated. There is provided a quantization error calculating means for calculating the average by dividing by the number of words as the second quantization error.

また、上記第１量子化誤差と上記第２量子化誤差とのうちで、大きい量子化誤差を、上記第１文書と上記第２文書との距離であると決定する文書間距離決定手段を有する。 In addition, there is an inter-document distance determining unit that determines that a large quantization error among the first quantization error and the second quantization error is a distance between the first document and the second document. .

さらに、上記実施例は、文書の構成要素である単語、合成語、またはフレーズについて、ベクトルが予め登録されているときに、２つの文書間の距離を計算する装置において、第１文書のベクトル集合から、量子化するための第１の代表ベクトル集合を求め、第２文書のベクトル集合から、量子化するための第２の代表ベクトル集合を算出する代表ベクトル集合算出手段と、入力された文書における構成要素頻度を算出する構成要素頻度算出手段と、ベクトルが登録されている上記構成要素に、そのベクトルを割り当て、ベクトルが登録されていない構成要素については、原表現を残し、未知のベクトルを割り当てることによって、ベクトル集合を算出するベクトル集合算出手段と、原表現が同じ、上記未知のベクトル同士の内積を予め決めた値とし、原表現が互いに異なる上記未知ベクトル同士の内積を０とし、上記未知のベクトルと上記ベクトルが登録されている構成要素のベクトルとの内積を０としてベクトル内積を計算するベクトル内積計算手段とを有する文書間距離計算装置の例である。 Further, in the above embodiment, in the apparatus for calculating the distance between two documents when a vector is registered in advance for a word, compound word, or phrase that is a component of the document, the vector set of the first document A representative vector set calculating means for obtaining a first representative vector set for quantization and calculating a second representative vector set for quantization from the vector set of the second document; A component frequency calculating means for calculating a component frequency, and assigning the vector to the above-described component in which the vector is registered, and assigning an unknown vector while leaving the original expression for the component in which the vector is not registered Thus, the inner product of the unknown vectors having the same original representation as the vector set calculating means for calculating the vector set is determined in advance. A vector dot product calculating means for calculating a vector dot product by setting the dot product of the unknown vectors whose values are different from each other to 0 as the value, and setting the dot product of the unknown vector and the vector of the component in which the vector is registered as 0 It is an example of the distance calculation apparatus between documents which has.

この場合、上記量子化するための第２の代表ベクトル集合によって、上記第１文書のベクトル集合を量子化したときの第２量子化誤差を求め、上記第１の量子化するためのベクトル集合によって、上記第２文書のベクトル集合を量子化したときの第１量子化誤差を求め、上記量子化するための第１の代表ベクトル集合によって、上記第２文書のベクトル集合を量子化したときの第１量子化誤差を求め、上記第２の量子化するためのベクトル集合によって、上記第１文書のベクトル集合を量子化したときの第２量子化誤差を求める量子化誤差算出手段を有する。 In this case, a second quantization error when the vector set of the first document is quantized by the second representative vector set for quantization is obtained, and the first quantization vector set is used. A first quantization error when the vector set of the second document is quantized is obtained, and a first representative vector set for the quantization is used to quantize the vector set of the second document. Quantization error calculation means is provided for obtaining a second quantization error when a first quantization error is obtained and the vector set for the second document is quantized by the second quantization vector set.

また、上記文書間距離算出手段は、第１文書による第２文書の量子化誤差と、第２文書による第１文書の量子化誤差とのうちで、大きい方の量化化誤差を、上記第１文書と上記第２文書との距離として算出する手段である。 Further, the inter-document distance calculation means calculates a larger quantification error among the quantization error of the second document by the first document and the quantization error of the first document by the second document. A means for calculating a distance between the document and the second document.

また、上記実施例は、文書の量子化するための代表ベクトル集合で、別の文書のベクトル集合の量子化誤差を高速に計算する装置において、第１文書のベクトル集合から、量子化されるための第１の代表ベクトル集合を、各代表ベクトルに代表される単語の頻度数を求め、第２文書のベクトル集合から、量子化されるための第２の代表ベクトル集合と各代表ベクトルに代表される単語の頻度数とを求める手段と、量子化されるための第１の代表ベクトル集合で、第１のベクトル集合を量子化したときの第１の近似に用いる量子化誤差と、量子化されるための第２の代表ベクトル集合で、第２のベクトル集合を量子化したときの第２の近似に用いる量子化誤差とを求める手段と、上記第１の近似に用いる量子化誤差と上記量子化されるための第１の代表ベクトル集合を、量子化するための第２の代表ベクトル集合によって量子化したときの近似的な量子化誤差を求める手段とを有する高速な量子化誤差計算装置の例である。 In the above embodiment, the representative vector set for quantizing the document is quantized from the vector set of the first document in the apparatus for calculating the quantization error of the vector set of another document at high speed. The first representative vector set of the second representative vector set is obtained by obtaining the frequency number of words represented by each representative vector and represented by the second representative vector set to be quantized and each representative vector from the vector set of the second document. A means for obtaining a frequency number of words to be quantized, a first representative vector set to be quantized, a quantization error used for the first approximation when the first vector set is quantized, A second representative vector set for obtaining a quantization error used for the second approximation when the second vector set is quantized, a quantization error used for the first approximation, and the quantum The first to be The representative vector set of examples of fast quantization error computing device having means for obtaining an approximate quantization error when quantizing the second representative vector set for quantizing.

なお、上記第１の近似に用いる量子化誤差は、式（４）に示す量子化誤差Ｅ_０ ^Ａである。上記第２の近似に用いる量子化誤差は、上記量子化誤差Ｅ_０ ^Ａと同様に算出した量子化誤差Ｅ_０ ^Ｂである。上記近似的な量子化誤差は、式（５）に示す量子化誤差Ｅ_Ａ→Ｂ、Ｅ_Ｂ→Ａである。 Note that the quantization error used in the first approximation is a quantization error E ₀ ^A shown in Expression (4). The quantization error used for the second approximation is a quantization error E ₀ ^B calculated in the same manner as the quantization error E ₀ ^A. The approximate quantization error is the quantization error E _{A →} B and E _{B → A} shown in Expression (5).

また、上記実施例をプログラムの発明として把握することができる。つまり、上記実施例は、上記文書間距離計算装置を構成する各手段としての機能をコンピュータに実行させるためのコンピュータプログラムの例である。さらに、上記実施例は、上記量子化誤差計算装置を構成する各手段としての機能をコンピュータに実行させるためのコンピュータプログラムの例である。
Moreover, the said Example can be grasped | ascertained as invention of a program. In other words, the above embodiment is an example of a computer program for causing a computer to execute functions as each means constituting the inter-document distance calculation apparatus. Furthermore, the said Example is an example of the computer program for making a computer perform the function as each means which comprises the said quantization error calculation apparatus.

本発明の実施例１である文書間距離計算装置１００を示すブロック図である。1 is a block diagram illustrating an inter-document distance calculation apparatus 100 that is Embodiment 1 of the present invention. FIG. 実施例１の動作を示すフローチャートである。3 is a flowchart showing the operation of the first embodiment. 実施例１において単語概念ベース格納手段７０に格納されている単語概念ベースを示す図である。It is a figure which shows the word concept base stored in the word concept base storage means 70 in Example 1. FIG. 文書Ａと文書Ｂとについて、形態素解析し、求めた単語の頻度の例を示す図である。FIG. 6 is a diagram illustrating an example of word frequencies obtained by performing morphological analysis on a document A and a document B; 文書Ａと文書Ｂとについて、形態素解析し、求めた単語の頻度の例を示す図である。FIG. 6 is a diagram illustrating an example of word frequencies obtained by performing morphological analysis on a document A and a document B; 文書Ａの単語ベクトル集合による文書Ｂの単語ベクトル集合の量子化の結果例を示す図である。FIG. 10 is a diagram illustrating an example of a result of quantization of a word vector set of document B by a word vector set of document A. 文書Ｂに属する単語ベクトル集合によって、文書Ａの単語ベクトルを量子化した結果の例を示す図である。10 is a diagram illustrating an example of a result of quantizing a word vector of a document A by a word vector set belonging to the document B. FIG. 本発明の実施例２である文書間距離計算装置２００を示すブロック図である。It is a block diagram which shows the distance calculation apparatus 200 between documents which is Example 2 of this invention. 文書間距離計装置２００において、文書Ｂに含まれている単語を代表する５つの代表ベクトルを示す図である。5 is a diagram showing five representative vectors representing words included in a document B in the inter-document distance meter device 200. FIG. 実施例２における一連の処理の流れを示すフローチャートである。10 is a flowchart illustrating a flow of a series of processes in the second embodiment. 文書Ｂの量子化誤差近似用代表ベクトルと代表される単語数とを示す図である。FIG. 6 is a diagram illustrating a representative vector for quantization error approximation of a document B and a representative number of words.

Explanation of symbols

１００…文書間距離計算装置、
１０…文書入力手段、
２０…単語頻度算出手段、
３０…量子化誤差手段、
４０…量子化誤差算出手段、
５０…文書間距離決定手段、
６０…代表ベクトル算出手段、
７０…単語概念ベース格納手段、
２００…文書間距離計算装置。 100: Inter-document distance calculation device,
10 ... Document input means,
20 ... Word frequency calculation means,
30: Quantization error means,
40: Quantization error calculation means,
50 ... Document distance determination means,
60 ... representative vector calculating means,
70: Word concept base storage means,
200: Inter-document distance calculation device.

Claims

In a device for calculating a distance between two documents when a vector is registered in advance for a word, a compound word, or a phrase that is a component of the document,
Component frequency calculation means for calculating the component frequency in the input document;
Vector set calculating means for calculating a vector set by assigning the vector to the component in which the vector is registered, and leaving an original expression and assigning an unknown vector to the component in which the vector is not registered;
The inner product of the unknown vectors having the same original representation is set to a predetermined value, the inner product of the unknown vectors having different original representations is set to 0, and the vector of the component in which the unknown vector and the vector are registered A vector dot product calculating means for calculating a vector dot product with an inner product of 0 as a vector;
An inter-document distance calculation apparatus comprising:

In claim 1,
The document is a first document and a second document,
Assigning to each word in the second document the word closest to the word in the first document, calculating the distance between the word vectors, and adding the calculated distance between the word vectors for all words; The average distance obtained by dividing the sum of the distances by the number of words is calculated as the first quantization error,
Assigning to each word in the first document the word closest to the word in the second document, calculating the distance between the word vectors, and adding the calculated distance between the word vectors for all words; An inter-document distance calculation device comprising quantization error calculation means for calculating an average obtained by dividing the added distance by the number of words as a second quantization error.

In claim 2,
An inter-document distance determining means for determining a large quantization error among the first quantization error and the second quantization error as a distance between the first document and the second document; Characteristic distance calculation device.

In a device for calculating a distance between two documents when a vector is registered in advance for a word, a compound word, or a phrase that is a component of the document,
Representative vector set calculating means for obtaining a first representative vector set for quantization from the vector set of the first document, and calculating a second representative vector set for quantization from the vector set of the second document; ;
Component frequency calculation means for calculating the component frequency in the input document;
A vector set calculating means for calculating a vector set by assigning the vector to the above-mentioned component in which the vector is registered, and assigning an unknown vector to the component in which the vector is not registered, leaving an original expression; ;
The inner product of the unknown vectors having the same original representation is set to a predetermined value, the inner product of the unknown vectors having different original representations is set to 0, and the vector of the component in which the unknown vector and the vector are registered A vector dot product calculating means for calculating a vector dot product with an inner product of 0 as a vector;
An inter-document distance calculation apparatus comprising:

In claim 4,
A second quantization error when the vector set of the first document is quantized is obtained by the second representative vector set for quantization, and the first quantization vector set is used to obtain the second quantization error. Find the first quantization error when quantizing a vector set of two documents,
A first quantization error is obtained when the vector set of the second document is quantized by the first representative vector set for quantization, and the second quantization vector set is used for obtaining the first quantization error. An inter-document distance calculation apparatus comprising quantization error calculation means for obtaining a second quantization error when a vector set of one document is quantized.

In claim 4,
The inter-document distance calculation means calculates a larger quantification error between the quantization error of the second document by the first document and the quantization error of the first document by the second document, and the first document. An inter-document distance calculation apparatus, which is a means for calculating a distance from the second document.

In a representative vector set for quantizing a document and calculating a quantization error of a vector set of another document at high speed,
The first representative vector set to be quantized is obtained from the vector set of the first document, the frequency number of words represented by each representative vector is obtained, and the quantized from the vector set of the second document Means for determining a second representative vector set and the frequency of words represented by each representative vector;
In the first representative vector set to be quantized, the quantization error used for the first approximation when the first vector set is quantized, and the second representative vector set to be quantized, Means for determining a quantization error used for the second approximation when the second vector set is quantized;
An approximate quantization error when the quantization error used for the first approximation and the first representative vector set to be quantized are quantized by the second representative vector set for quantization is expressed as follows. Means to seek;
A high-speed quantization error calculation device characterized by comprising:

The computer program for making a computer perform the function as each means which comprises the distance calculation apparatus between documents of any one of Claims 1-6.

A computer program for causing a computer to execute a function as each means constituting the quantization error calculation device according to claim 7.