JP5244452B2

JP5244452B2 - Document feature expression calculation apparatus and program

Info

Publication number: JP5244452B2
Application number: JP2008128857A
Authority: JP
Inventors: 俊郎内山; 克人別所; 直人阿部; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-05-15
Filing date: 2008-05-15
Publication date: 2013-07-24
Anticipated expiration: 2028-05-15
Also published as: JP2009277100A

Description

本発明は、文書の特徴を人間にとって分かりやすく表現するための文書特徴表現計算技術に関するものである。 The present invention relates to a document feature expression calculation technique for expressing features of a document in an easy-to-understand manner for humans.

文書の特徴を表すための従来技術としては、例えば、ベクトル空間モデルを用いる方法（非特許文献１）や、概念ベクトルを用いる方法（特許文献２）等がある。 Examples of conventional techniques for representing document characteristics include a method using a vector space model (Non-Patent Document 1) and a method using a concept vector (Patent Document 2).

ベクトル空間モデルを用いる方法では、文書を単語等に分解し、その出現頻度等を要素とするベクトルで文書の特徴を表現している。また、概念ベクトルを用いる方法では、予め登録した概念ベクトルを文書中の単語に対応させ、文書を概念ベクトルの集合として表し、その概念ベクトルの集合の平均ベクトルを文書の概念ベクトル（文書ベクトルと呼ぶ）とすることで文書特徴を表現している。 In the method using the vector space model, the document is decomposed into words or the like, and the feature of the document is expressed by a vector having its appearance frequency as an element. In the method using concept vectors, a pre-registered concept vector is associated with a word in the document, the document is represented as a set of concept vectors, and an average vector of the set of concept vectors is referred to as a document concept vector (referred to as a document vector). ) To express document features.

なお、本願明細書及び特許請求の範囲において、「文書」とは、あるまとまった概念を持つ文の集合としての文書、及びこのような文書を複数含む文書集合（文書群）の両方の意味を持つ用語として使用している。また、「文書」は、あるカテゴリに属する文書、文書群等の意味も含む。 In the present specification and claims, “document” means both a document as a set of sentences having a certain concept and a document set (document group) including a plurality of such documents. It is used as a term. The “document” also includes the meanings of documents, document groups, etc. belonging to a certain category.

一方、ベクトル集合を限られた数の代表ベクトルで表すベクトル量子化技術が知られている（特許文献１）。
北研二、津田和彦、獅々堀正幹、"情報検索アルゴリズム"、pp.60-63、共立出版、２００２特開平０８−３１６８４２号公報特開２００７−７２６１０号公報 On the other hand, a vector quantization technique that represents a vector set with a limited number of representative vectors is known (Patent Document 1).
Kita Kenji, Tsuda Kazuhiko, Sasabori Masatomi, "Information Retrieval Algorithm", pp.60-63, Kyoritsu Shuppan, 2002 JP 08-316842 A JP 2007-72610 A

文書の特徴を表現するための上記従来技術では、文書自体が持つ特徴を人間にとって分かりやすく表現することが困難であった。すなわち、ベクトル空間モデルでは、単語数が膨大となるため、人間が文書の特徴を容易に把握することができない。また、概念ベクトルで文書特徴を表現する方法では、人間が概念ベクトルを理解することが難しい。 In the above-described conventional technique for expressing the characteristics of a document, it is difficult to express the characteristics of the document itself in an easy-to-understand manner for humans. That is, in the vector space model, since the number of words becomes enormous, a human cannot easily grasp the characteristics of the document. In addition, it is difficult for a human to understand the concept vector by the method of expressing the document feature by the concept vector.

ここで、概念ベクトル集合に対してベクトル量子化技術を適用すると、限られた数の代表ベクトルを得ることができる。しかし、この場合でも、人間が代表ベクトルを理解することは困難であるため、文書の特徴を人間にとって分かりやすく表現することはできない。もし、文書の特徴を人間にとって分かりやすく表現できれば、文書特徴の妥当性を判断したり文書の傾向を把握したりすることが容易にできるようになる。 Here, when the vector quantization technique is applied to the concept vector set, a limited number of representative vectors can be obtained. However, even in this case, since it is difficult for a human to understand the representative vector, the characteristics of the document cannot be expressed in an easy-to-understand manner for the human. If the characteristics of a document can be expressed in a manner that is easy for humans to understand, it is possible to easily determine the validity of the document characteristics and grasp the tendency of the document.

本発明は上記の点に鑑みてなされたものであり、文書の特徴を人間にとって分かりやすく表現することを可能とする文書特徴表現計算技術を提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide a document feature expression calculation technique that makes it possible to express document features in an easy-to-understand manner for humans.

上記の課題を解決するために、本発明は、入力された文書から、当該文書の特徴を表すための索引語と、その重みとを出力する文書特徴表現計算装置であって、索引語と、それに対応するベクトルとを含む概念ベースを格納する概念ベース格納手段と、前記文書から各索引語を抽出する索引語抽出手段と、前記概念ベース格納手段から、前記索引語抽出手段で抽出された各索引語に対応するベクトルを入力ベクトルとして取得し、当該入力ベクトルの集合を入力ベクトル集合格納手段に格納する入力ベクトル集合算出手段と、前記入力ベクトル集合格納手段に格納された入力ベクトルの集合に対し、量子化誤差が最小となるように量子化する処理を実行することにより代表ベクトルの集合を算出する代表ベクトル集合算出手段と、前記代表ベクトル算出手段により算出された各代表ベクトルに対応する索引語と、その重みとを算出し、それらを出力する索引語重み算出手段と、を備え、前記代表ベクトル集合算出手段は、前記量子化する処理として、競合学習方式を用いた処理を行い、当該競合学習方式における学習過程で、ランダムに選択した入力ベクトルを用いて代表ベクトルを更新する際に、更新対象の代表ベクトルを入力ベクトルに学習率gの割合で近づけた位置を中心とし、当該中心から更新対象の代表ベクトルまでの距離以内にあり、当該中心に最も近いベクトルを、前記概念ベース格納手段に格納されたベクトルの集合から選択し、当該選択されたベクトルを更新後の代表ベクトルとすることを特徴とする文書特徴表現計算装置として構成される。 In order to solve the above-described problem, the present invention provides a document feature expression calculation device that outputs an index word for representing a feature of the document and an weight thereof from an input document, the index word, A concept base storing means for storing a concept base including a vector corresponding thereto, an index word extracting means for extracting each index word from the document, and each index word extracted from the concept base storage means by the index word extracting means A vector corresponding to the index word is acquired as an input vector, and an input vector set calculation unit that stores the set of input vectors in the input vector set storage unit; and an input vector set stored in the input vector set storage unit and a representative vector set calculating means for calculating a set of representative vectors by executing a process of quantizing as quantization error is minimized, the representative base And index words corresponding to each representative vector calculated by the torque calculating means calculates the its weight, and index word weight calculation means for outputting them, wherein the representative vector set calculating means said quantization As a process, when a process using a competitive learning method is performed and a representative vector is updated using a randomly selected input vector in a learning process in the competitive learning method, the learning rate is set to the update target representative vector as an input vector. Centered on the position approached by the ratio of g, the vector that is within the distance from the center to the representative vector to be updated and is closest to the center is selected from the set of vectors stored in the concept base storage means, The document feature expression calculation device is characterized in that the selected vector is used as the updated representative vector .

前記代表ベクトル集合算出手段は、前記競合学習方式における学習過程において、更新後の代表ベクトルの値が更新前の代表ベクトルの値と同じであった場合は、学習率gの初期値g₀の分だけ当該代表ベクトルに対応する学習率を増加させ、同じでない場合には、学習率gを前記初期値g₀とすることとしてもよい。
また、前記索引語重み算出手段は、前記代表ベクトルに対応する索引語の重みとして、当該代表ベクトルが代表する入力ベクトル群における各入力ベクトルに対応する重みを足し合わせた重みを算出することとしてもよい。 In the learning process in the competitive learning method, the representative vector set calculation means determines the amount of the initial value g ₀ of the learning rate g when the updated representative vector value is the same as the pre-update representative vector value. Only when the learning rate corresponding to the representative vector is increased and is not the same, the learning rate g may be set to the initial value g ₀ .
The index word weight calculating means may calculate a weight obtained by adding the weights corresponding to the input vectors in the input vector group represented by the representative vector as the weight of the index word corresponding to the representative vector. Good.

また、本発明は、入力された文書から、当該文書の特徴を表す索引語に対応する重みを算出し、出力する文書特徴表現計算装置であって、索引語と、それに対応するベクトルとを含む概念ベースを格納する概念ベース格納手段と、前記代表ベクトル集合算出手段を備える前記文書特徴表現計算装置により算出された前記文書の特徴を表す索引語である特徴表現索引語を格納する特徴表現索引語格納手段と、前記文書から各索引語とその出現頻度を抽出する索引語抽出手段と、前記索引語抽出手段により抽出された各索引語に対応するベクトルを前記概念ベース格納手段から取得し、当該ベクトルの集合を前記出現頻度で重み付けして平均をとることにより文書ベクトルを算出する文書ベクトル算出手段と、前記特徴表現索引語格納手段に格納された各特徴表現索引語に対応する前記概念ベースにおけるベクトルと、前記文書ベクトル算出手段により算出された文書ベクトルとを用いて前記特徴表現索引語に対応する重みを算出し、出力する重み算出手段とを備えたことを特徴とする文書特徴表現計算装置として構成することもできる。 Further, the present invention is a document feature expression calculation apparatus that calculates and outputs a weight corresponding to an index word representing a feature of the document from an input document, and includes an index word and a vector corresponding to the index word A feature expression index word that stores a feature expression index word that is an index word that represents the feature of the document calculated by the document feature expression calculation device including a concept base storage unit that stores a concept base and the representative vector set calculation unit. Storage means, index word extraction means for extracting each index word and its appearance frequency from the document, and a vector corresponding to each index word extracted by the index word extraction means is acquired from the concept base storage means, Document vector calculation means for calculating a document vector by weighting a set of vectors with the appearance frequency and taking the average, and storing in the feature expression index word storage means Weight calculating means for calculating and outputting a weight corresponding to the feature expression index word using the concept-based vector corresponding to each feature expression index word and the document vector calculated by the document vector calculating means It can also be configured as a document feature expression calculation device characterized by comprising

前記重み算出手段は、前記各特徴表現索引語に対応する前記概念ベースにおけるベクトルを列ベクトルとして横に並べた行列の擬似逆行列を前記文書ベクトルに掛けることにより前記重みを算出することとしてもよい。 The weight calculating means may calculate the weight by multiplying the document vector by a pseudo inverse matrix of a matrix in which vectors in the concept base corresponding to the feature expression index terms are arranged side by side as column vectors. .

また、本発明は、コンピュータを、上記文書特徴表現計算装置における各手段として機能させるためのプログラムとして構成することもできる。 The present invention can also be configured as a program for causing a computer to function as each means in the document feature expression calculation apparatus.

本発明によれば、文書の特徴を人間に分かりやすい形で表現することができるため、文書特徴の妥当性を判断したり文書の傾向を把握したりすることが可能となる。これにより、文書分類問題における正解文書の収集などにおいて、文書量や質を判断するための重要な手がかりを得ることができる。 According to the present invention, since the characteristics of a document can be expressed in a form that is easy for humans to understand, it is possible to judge the validity of the document characteristics and to grasp the tendency of the document. This makes it possible to obtain important clues for judging the amount and quality of documents in collecting correct documents in a document classification problem.

以下、本発明の実施の形態として、第１の実施の形態、及び第２の実施の形態を説明する。なお、以下で説明する実施の形態では、文書の特徴を表すための単位として"単語"を用いる例を説明するが、文書の特徴を表すための単位は"単語"に限定されるものではなく、文書の構成要素であればどのような単位を用いてもよい。例えば、文書の特徴を表すための単位として、合成語、フレーズ等を用いてもよい。その場合、以下の説明の中での"単語"を当該語（合成語、フレーズ等）に置き換えればよい。また、本明細書及び特許請求の範囲において、当該単位となる単語、合成語、フレーズ等の語を索引語と呼ぶことにする。 Hereinafter, the first embodiment and the second embodiment will be described as embodiments of the present invention. In the embodiment described below, an example is described in which “words” are used as units for representing document features. However, units for representing document features are not limited to “words”. Any unit may be used as long as it is a component of the document. For example, a compound word, a phrase, or the like may be used as a unit for expressing the document characteristics. In that case, the “word” in the following description may be replaced with the word (synthetic word, phrase, etc.). Further, in the present specification and claims, words such as words, composite words, phrases, and the like serving as the unit are referred to as index words.

（第１の実施の形態）
まず、本発明の第１の実施の形態について説明する。 (First embodiment)
First, a first embodiment of the present invention will be described.

＜概要＞
本実施の形態では、文書の特徴を限られた数の単語と、その単語に付与した重みによって表すこととしている。単語は人間にとって分かりやすく、また、その数を限られたものとしたために、本実施の形態によれば人間にとって文書の全体の特徴を把握することが容易になる。 <Overview>
In the present embodiment, the features of the document are represented by a limited number of words and the weights assigned to the words. Since words are easy to understand for human beings and the number of words is limited, according to the present embodiment, it becomes easy for human beings to understand the overall characteristics of a document.

本実施の形態では、文書の特徴を表す単語を文書から抽出する際に、その単語に対応する概念ベクトルが文書中の単語の概念ベクトルを良く代表するように抽出する。ここで、良く代表するとは、量子化誤差が小さいという意味である。そして、本実施の形態におけるベクトル量子化に係る処理方法は、特許文献１に記載された競合学習方式に基づくものであり、特許文献１に記載された技術との主な違いは、代表ベクトルのとり得る値を、単語概念ベース中のベクトルに限定した点であり、そのためのアルゴリズムの付加及び変更が施されている。 In the present embodiment, when a word representing the characteristics of a document is extracted from the document, the concept vector corresponding to the word is extracted so as to well represent the concept vector of the word in the document. Here, well-represented means that the quantization error is small. The processing method related to vector quantization in the present embodiment is based on the competitive learning method described in Patent Document 1, and the main difference from the technique described in Patent Document 1 is that of the representative vector. The possible values are limited to vectors in the word concept base, and algorithms are added and changed for this purpose.

＜装置構成＞
図１に、本発明の第１の実施の形態における文書特徴表現計算装置１０の機能構成図を示す。図１に示すように、本実施の形態における文書特徴表現計算装置１０は、入力部１１、単語頻度算出部１２、入力ベクトル集合算出部１３、代表ベクトル集合算出部１４、単語・重み算出部１５、出力部１６、単語概念ベース格納部１７、入力ベクトル集合格納部１８、代表ベクトル集合格納部１９を有する。 <Device configuration>
FIG. 1 shows a functional configuration diagram of a document feature representation calculation apparatus 10 according to the first embodiment of the present invention. As shown in FIG. 1, the document feature expression calculation apparatus 10 according to the present exemplary embodiment includes an input unit 11, a word frequency calculation unit 12, an input vector set calculation unit 13, a representative vector set calculation unit 14, and a word / weight calculation unit 15. , An output unit 16, a word concept base storage unit 17, an input vector set storage unit 18, and a representative vector set storage unit 19.

入力部１１は、特徴を表現する対象となる文書、及び後述する処理の中で使用される各種パラメータの値を入力するための機能部である。単語頻度算出部１２は、入力された文書から、文書を構成する各単語とその出現頻度を算出するための機能部である。入力ベクトル集合算出部１３は、単語概念ベース格納部１７に格納された単語概念ベースを参照することにより、単語をベクトルに変換して、当該ベクトルと出現頻度とを対応付けて入力ベクトル集合格納部１８に格納する機能部である。 The input unit 11 is a functional unit for inputting a document that represents a feature and values of various parameters used in processing to be described later. The word frequency calculation unit 12 is a functional unit for calculating each word constituting the document and its appearance frequency from the input document. The input vector set calculation unit 13 refers to the word concept base stored in the word concept base storage unit 17, converts the word into a vector, associates the vector with the appearance frequency, and inputs the input vector set storage unit. 18 is a functional unit to be stored.

代表ベクトル集合算出部１４は、入力されたパラメータ情報、入力ベクトル集合、及び単語概念ベースを用いて代表ベクトルを算出し、代表ベクトル集合として代表ベクトル集合格納部１９に格納する機能部である。単語・重み算出部１５は、入力ベクトル集合、代表ベクトル集合、及び単語概念ベースを用いて、文書の概念を表す情報として、単語とその重みを算出する機能部である。出力部１６、単語・重み算出部１５により算出された単語とその重みの情報を出力するための機能部である。 The representative vector set calculation unit 14 is a functional unit that calculates a representative vector using the input parameter information, the input vector set, and the word concept base and stores the representative vector set in the representative vector set storage unit 19. The word / weight calculator 15 is a functional unit that calculates a word and its weight as information representing the concept of the document using the input vector set, the representative vector set, and the word concept base. This is a functional unit for outputting information about the words calculated by the output unit 16 and the word / weight calculation unit 15 and their weights.

単語概念ベース格納部１７は単語概念ベースを格納する格納部であり、入力ベクトル集合格納部１８は入力ベクトル集合を格納する格納部であり、代表ベクトル集合格納部１９は代表ベクトル集合を格納する格納部である。 The word concept base storage unit 17 is a storage unit for storing a word concept base, the input vector set storage unit 18 is a storage unit for storing an input vector set, and the representative vector set storage unit 19 is a storage for storing a representative vector set. Part.

なお、文書特徴表現計算装置１０は、CPUと、メモリやハードディスク等の記憶装置とを含む一般的なコンピュータに、本実施の形態で説明する処理に対応するプログラムを実行させることにより実現されるものであり、上述した各機能部は、コンピュータに当該プログラムが実行されて実現される機能部である。従って、例えば、各機能部間での情報のやりとりは、実際にはメモリ等の記憶装置を介して行われるものである。なお、上記プログラムは、メモリ等の記録媒体からコンピュータにインストールすることもできるし、ネットワーク上のサーバからダウンロードするようにしてもよい。 The document feature representation calculation apparatus 10 is realized by causing a general computer including a CPU and a storage device such as a memory or a hard disk to execute a program corresponding to the processing described in the present embodiment. Each functional unit described above is a functional unit realized by executing the program on a computer. Therefore, for example, the exchange of information between the functional units is actually performed via a storage device such as a memory. The program can be installed in a computer from a recording medium such as a memory, or may be downloaded from a server on a network.

＜文書特徴表現計算装置１０の動作＞
次に、文書特徴表現計算装置１０の処理動作について、図２、図３に示すフローチャートを参照して説明する。 <Operation of Document Feature Representation Calculation Device 10>
Next, the processing operation of the document feature expression calculation apparatus 10 will be described with reference to the flowcharts shown in FIGS.

文書特徴表現計算装置１０の処理を行う前に、事前処理として、単語概念ベースを作成し、それを単語概念ベース格納部１７に格納しておく。単語概念ベースの作成方法としては、例えば、文書コーパスを利用し、特許文献２で説明されている単語−意味属性間の共起情報を用い、特異値分解等の次元圧縮処理を経て、単語に対応する概念ベクトルを算出し、これを単語概念ベースとして単語概念ベース格納部１７に格納する。 Before performing the process of the document feature expression calculation apparatus 10, a word concept base is created as a pre-process and stored in the word concept base storage unit 17. As a method for creating a word concept base, for example, using a document corpus, using word-semantic attribute co-occurrence information described in Patent Document 2, a dimensional compression process such as singular value decomposition, A corresponding concept vector is calculated and stored in the word concept base storage unit 17 as a word concept base.

なお、特許文献２では、上記概念ベクトルは"単語ベクトル"として記載されている。また、概念ベースを算出する際には、単語−意味属性間の共起情報を用いることのほか、単語−単語間の共起情報を用いることとしてもよい。 In Patent Document 2, the concept vector is described as a “word vector”. In calculating the concept base, the word-word co-occurrence information may be used in addition to the word-semantic attribute co-occurrence information.

単語と概念ベクトルは１対１で紐付けられており、単語概念ベースは、例えば図４に示すように、単語と概念ベクトルとの対応を表す対応表の形で作成されている。 The word and the concept vector are linked one-to-one, and the word concept base is created in the form of a correspondence table representing the correspondence between the word and the concept vector, for example, as shown in FIG.

以下、図２のフローチャートに沿って、処理の流れを説明する。 Hereinafter, the flow of processing will be described with reference to the flowchart of FIG.

まず、文書、ベクトル量子化数(n)、終了回数(L)、検査回数(C)、ベース学習率(g₀)を、入力部１１により文書特徴表現計算装置１０に入力する（ステップ１）。ここで、"文書"は特徴を表す対象とする文書であり、ベクトル量子化数nは、最終的に文書の特徴を表すための単語の数である。他の値は、後述する代表ベクトル算出において用いる値である。入力された情報は、単語頻度算出部１２に送られる。 First, the document, the number of vector quantizations (n), the number of completions (L), the number of inspections (C), and the base learning rate (g ₀ ) are input to the document feature representation calculation apparatus 10 through the input unit 11 (step 1). . Here, the “document” is a document that represents a feature, and the vector quantization number n is the number of words that ultimately represent the feature of the document. The other values are values used in representative vector calculation described later. The input information is sent to the word frequency calculation unit 12.

次に、単語頻度算出部１２は、入力された文書に対して形態素解析を行うことにより、当該文書を単語に分割し、各単語とその出現頻度情報を取得し（ステップ２）、取得した情報と、上記の入力情報とを入力ベクトル集合算出部１３に送る。 Next, the word frequency calculation unit 12 divides the document into words by performing morphological analysis on the input document, acquires each word and its appearance frequency information (step 2), and acquires the acquired information. And the above input information are sent to the input vector set calculation unit 13.

入力ベクトル集合算出部１３は、単語概念ベース格納部１７に格納されている単語概念ベースを参照することにより、ステップ２で求めた各単語に対応する概念ベクトルを単語概念ベースから取得し、当該ベクトルと単語の出現頻度とを対応付けた情報の集合である入力ベクトル集合を算出し、これを入力ベクトル集合として入力ベクトル集合格納部１８に格納する（ステップ３）。図５に、入力ベクトル集合の一例を示す。 The input vector set calculation unit 13 refers to the word concept base stored in the word concept base storage unit 17 to acquire the concept vector corresponding to each word obtained in step 2 from the word concept base, and And an input vector set which is a set of information in which the word appearance frequency is associated with each other, and stores this in the input vector set storage unit 18 as an input vector set (step 3). FIG. 5 shows an example of the input vector set.

続いて、代表ベクトル集合算出部１４が、ステップ３において算出された入力ベクトル集合、単語概念ベース、及び、入力ベクトル集合算出部１３から入力されるベクトル量子化数(n)、終了回数(L)、検査回数(C)、ベース学習率(g₀)を用いて代表ベクトルの集合を算出し、これを代表ベクトル集合として代表ベクトル集合格納部１９に保存する処理を行う（ステップ４）。 Subsequently, the representative vector set calculation unit 14 receives the input vector set calculated in step 3, the word concept base, the vector quantization number (n) input from the input vector set calculation unit 13, and the number of end times (L). Then, a set of representative vectors is calculated using the number of inspections (C) and the base learning rate (g ₀ ), and a process of storing them as a representative vector set in the representative vector set storage unit 19 is performed (step 4).

以下、代表ベクトル集合算出部１４が実行する代表ベクトル集合の算出処理を、図３に示すフローチャートに沿って詳細に説明する。 Hereinafter, the representative vector set calculation process executed by the representative vector set calculation unit 14 will be described in detail with reference to the flowchart shown in FIG.

まず、上記のように、入力ベクトル集合算出部１３から、ベクトル量子化数(n)、終了回数(L)、検査回数(C)、及びベース学習率(g₀)が代表ベクトル算出部１４に入力される（ステップ１０１）。 First, as described above, the number of vector quantizations (n), the number of completions (L), the number of inspections (C), and the base learning rate (g ₀ ) are input from the input vector set calculation unit 13 to the representative vector calculation unit 14. Input (step 101).

ここで、上記の終了回数Lは代表ベクトル集合を算出する際の総学習回数を表し、検査回数Cは代表ベクトルを順次生成する際に、次回の代表ベクトルを生成するまでの学習回数を表す。ベース学習率g₀は代表ベクトルを更新する際の変化率の最小値である。後述するように、学習率は状況に応じて変化する。 Here, the above-mentioned end count L represents the total number of times of learning when the representative vector set is calculated, and the number of inspections C represents the number of times of learning until the next representative vector is generated when the representative vectors are sequentially generated. The base learning rate g ₀ is the minimum value of the change rate when the representative vector is updated. As will be described later, the learning rate changes depending on the situation.

続いて、代表ベクトル集合算出部１４は、処理過程を管理するパラメータの初期設定を行う（ステップ１０２）。ここでは、繰り返し処理の中で順次生成される代表ベクトルの数をｑと定義し、それに１を代入する。また、既に繰り返された学習の回数を示す学習回数をｔと定義し、それに初期値０を代入する。また、検査回数Cに達するまでの学習回数をカウントする部分回数をrと定義して、それに初期値０を代入する。 Subsequently, the representative vector set calculation unit 14 initializes parameters for managing the process (Step 102). Here, the number of representative vectors sequentially generated in the iterative process is defined as q, and 1 is substituted into it. In addition, the learning number indicating the number of learnings that has already been repeated is defined as t, and the initial value 0 is substituted for it. Further, the partial number for counting the number of learning times until the number of inspections C is reached is defined as r, and the initial value 0 is substituted for it.

次に、代表ベクトルの初期化を行う。ここでは、代表ベクトル集合算出部１４は、入力ベクトル集合格納部１８に格納されている入力ベクトル集合の中からランダムに１つの入力ベクトルを選択し、そのベクトル値を１個目の代表ベクトルの値と定める。代表ベクトルには、学習率gと部分ひずみの値が付帯しており、学習率gをベース学習率g₀とし、部分ひずみの値を０とする。なお、部分ひずみとは、入力ベクトル集合を各代表ベクトルで量子化した際に生じる量子化誤差を代表ベクトル毎に集計した値であるが、その処理内容についてはステップ１５１からの代表ベクトル生成処理のところで説明する。 Next, the representative vector is initialized. Here, the representative vector set calculation unit 14 randomly selects one input vector from the input vector set stored in the input vector set storage unit 18, and uses the vector value as the value of the first representative vector. It is determined. The representative vector is accompanied by a learning rate g and a partial strain value. The learning rate g is the base learning rate g ₀ and the partial strain value is 0. The partial distortion is a value obtained by totalizing the quantization errors generated when the input vector set is quantized with each representative vector for each representative vector. The processing content of the representative vector generation processing from step 151 is as follows. By the way, I will explain.

ここでの代表ベクトルは、メモリにおける作業領域に格納される（つまり、メモリに格納される）。図６に、代表ベクトルのデータの例を示す。以下で説明する処理に従って、この作業領域に格納された値が順次更新されていく。なお、図６に示す「番号」は、代表ベクトルとして選択される概念ベクトルの番号であり、図４に示す単語概念ベースにおける「番号」と対応付けられるものである。この番号により、図６に示す代表ベクトルと、単語概念ベースにおける単語とを紐付けることができる。 The representative vector here is stored in a work area in the memory (that is, stored in the memory). FIG. 6 shows an example of representative vector data. The values stored in this work area are sequentially updated according to the processing described below. The “number” shown in FIG. 6 is the number of the concept vector selected as the representative vector, and is associated with the “number” in the word concept base shown in FIG. With this number, the representative vector shown in FIG. 6 and the word in the word concept base can be linked.

次に、代表ベクトル集合算出部１４は、部分回数rが検査回数Cに達しており、かつ代表ベクトル数qがベクトル量子化数nよりも小さいか否かを判断する（ステップ１０３）。判断結果がYesであればステップ１５１の処理を行い、判断結果がNoであればステップ１０４からの処理を行う。まず、判断結果がNoの場合の処理について説明する。 Next, the representative vector set calculation unit 14 determines whether or not the partial number r has reached the number of inspections C and the representative vector number q is smaller than the vector quantization number n (step 103). If the determination result is Yes, the process of Step 151 is performed, and if the determination result is No, the process from Step 104 is performed. First, processing when the determination result is No will be described.

ステップ１０４において、代表ベクトル集合算出部１４は、入力ベクトル集合格納部１８に格納された入力ベクトル集合の中からランダムに１つの入力ベクトルを選択する。このとき、ベクトルの数がそれに対応する出現頻度分存在すると考え、各入力ベクトルが選択される確率を出現頻度に比例させる。つまり、出現頻度が高いほど、その出現頻度に対応する入力ベクトルが高い確率で選択される。なお、各単語に重要度等の重み付けがなされている場合は、この重みも出現頻度（つまり選択確率）に反映させることもできる。 In step 104, the representative vector set calculation unit 14 randomly selects one input vector from the input vector sets stored in the input vector set storage unit 18. At this time, it is assumed that there are as many vectors as there are appearance frequencies corresponding thereto, and the probability that each input vector is selected is proportional to the appearance frequency. That is, the higher the appearance frequency, the higher the probability that the input vector corresponding to the appearance frequency will be selected. When each word is weighted such as importance, this weight can also be reflected in the appearance frequency (that is, selection probability).

そして、代表ベクトル集合算出部１４は、選択された入力ベクトルXと、既に生成され作業領域に格納されている各代表ベクトルWとの間のユークリッド２乗距離を計算し、代表ベクトルの中で最も入力ベクトルXとの距離が小さい代表ベクトルWを勝者として選出する（ステップ１０５）。なお、同じ距離の複数の代表ベクトルがあった場合は、登録が若い（現在に近い）代表ベクトルを勝者とする。また、入力ベクトルと勝者の代表ベクトルとの距離を、当該代表ベクトルに対応する部分ひずみに加算する（ステップ１０６）。 Then, the representative vector set calculation unit 14 calculates the Euclidean square distance between the selected input vector X and each representative vector W that has already been generated and stored in the work area. A representative vector W having a small distance from the input vector X is selected as a winner (step 105). If there are a plurality of representative vectors having the same distance, the representative vector that is young (close to the present) is selected as the winner. Further, the distance between the input vector and the representative vector of the winner is added to the partial distortion corresponding to the representative vector (step 106).

次に、代表ベクトル集合算出部１４は、勝者の代表ベクトルを更新する処理を行う（ステップ１０７）。ここではまず、勝者の代表ベクトルW、入力ベクトルX、及びこの代表ベクトルに付帯している学習率gを用いてW＋g(X-W)を計算し、これを仮の代表ベクトルW₀とする。つまり、W₀←W＋g(X-W)によりW₀を求める。 Next, the representative vector set calculation unit 14 performs a process of updating the representative vector of the winner (step 107). Here, first, the winner of the representative vector W, the input vector X, and calculates the W + g (XW) using learning rate g are incidental to the representative vector, which is the representative vector W ₀ tentative. In other words, determine the W ₀ by _{W 0 ← W + g (XW} ).

図７に示すように、仮の代表ベクトルW₀は、勝者の代表ベクトルWを学習率gに応じた率だけ入力ベクトルXに近づけたベクトルである。特許文献１では、このW₀を更新された代表ベクトルとしている。一方、本実施の形態においては、代表ベクトルのとり得る値を、単語概念ベースの中のベクトルに限定することとしているので、以下に示す近似処理を行う。 As shown in FIG. 7, the temporary representative vector W ₀ is a vector obtained by bringing the winner's representative vector W closer to the input vector X by a rate corresponding to the learning rate g. In Patent Document 1, this W ₀ is used as an updated representative vector. On the other hand, in the present embodiment, the possible values of the representative vector are limited to vectors in the word concept base, so the following approximation processing is performed.

すなわち、代表ベクトル集合算出部１４は、W₀から距離‖W−W₀‖以内に存在する単語概念ベースの中のベクトルのうち、最もW₀に近いベクトルを、勝者の代表ベクトルの新たな値Wとする。つまり、作業領域に格納されている更新前の勝者の代表ベクトルを、当該新たな代表ベクトルに書き換え、これに対応する概念ベクトルの番号を記入する。もし、単語概念ベースの中に最もW₀に近いベクトルが複数ある場合には、その中のうち、単語概念ベースへの登録が最も若いベクトルを選択し、それを勝者の代表ベクトルの新たな値Wとする。 That is, the representative vector set calculation unit 14, among the vectors in a word concept base present from W ₀ to the distance ‖W-W ₀ ‖ within, the vector closest to W _0, a new value of the representative vectors of the winner W. That is, the representative vector of the pre-update winner stored in the work area is rewritten with the new representative vector, and the number of the concept vector corresponding thereto is entered. If there are multiple vectors in the word concept base that are closest to W ₀ , select the youngest of the vectors registered in the word concept base and use it as the new value for the representative vector of the winner. W.

更に、代表ベクトル集合算出部１４は、上記代表ベクトルに付帯する学習率gの値を更新する（ステップ１０８）。もし、当該代表ベクトルが学習により移動した場合は、g←g₀のようにベース学習率の値を学習率として、当該代表ベクトルの学習率へ反映させる。また、もし代表ベクトルが移動しなかった場合は、学習率をg←g+g₀により更新し、同様に当該代表ベクトルの学習率へ反映させる。 Further, the representative vector set calculation unit 14 updates the value of the learning rate g attached to the representative vector (step 108). If the representative vector moves due to learning, the value of the base learning rate is reflected in the learning rate of the representative vector as a learning rate as g ← g ₀ . If the representative vector does not move, the learning rate is updated by g ← g + g ₀ and is similarly reflected in the learning rate of the representative vector.

なお、代表ベクトルが移動しない事象は、W₀から距離‖W-W₀‖以内に存在する単語概念ベースの中のベクトルが、唯一更新前の代表ベクトルWであった場合に起こる。このように学習率を変更することで、代表ベクトル更新時一回あたりの学習率をベース学習率のg₀にすることができる。つまり、学習率の期待値をベース学習率のg₀にすることができる。 The phenomenon that the representative vector does not move occurs when the vector in the word concept base existing within the distance ‖WW ₀から from W ₀ is the only representative vector W before update. By changing the learning rate in this way, the learning rate per time when the representative vector is updated can be set to g ₀ of the base learning rate. That is, the expected value of the learning rate can be set to g ₀ of the base learning rate.

次に、代表ベクトル集合算出部１４は、部分回数rと学習回数tとにそれぞれ１を加えた上で（ステップ１０９）、学習回数tがLに達しているか否かを判断する（ステップ１１０）。この判断結果がYesであれば、作業領域における全ての代表ベクトルを「代表ベクトル集合」として代表ベクトル集合格納部１９に保存して（ステップ１１１）、処理を終了する。判断結果がNoであれば、ステップ１０３の判断処理に移る。
以上のステップ１０４〜ステップ１０８の学習処理を繰り返し、部分回数rが検査回数Cに達し、代表ベクトル数qがベクトル量子化数nよりも小さい場合は、ステップ１５１からの代表ベクトル生成処理を行う。 Next, the representative vector set calculation unit 14 adds 1 to the partial count r and the learning count t (step 109), and determines whether the learning count t has reached L (step 110). . If this determination result is Yes, all the representative vectors in the work area are stored as “representative vector sets” in the representative vector set storage unit 19 (step 111), and the process is terminated. If the determination result is No, the process proceeds to the determination process of step 103.
The learning process in steps 104 to 108 is repeated, and when the partial number r reaches the number of inspections C and the representative vector number q is smaller than the vector quantization number n, the representative vector generation process from step 151 is performed.

すなわち、代表ベクトル算出部１４は、部分回数rを０にクリアした（ステップ１５１）後、作業領域から各代表ベクトルの部分ひずみを取得する（ステップ１５２）。前述したように、部分ひずみとは、入力ベクトル集合を各代表ベクトルで量子化した際に生じる量子化誤差を代表ベクトル毎に集計した値である。 That is, the representative vector calculation unit 14 clears the partial count r to 0 (step 151), and then acquires the partial distortion of each representative vector from the work area (step 152). As described above, the partial distortion is a value obtained by tabulating the quantization errors generated when the input vector set is quantized with each representative vector for each representative vector.

代表ベクトルの支配領域内に存在する全ての入力ベクトルとその代表ベクトルとの間の距離を計算し、その総和をとることにより部分ひずみを実際に計算することも可能であるが、本実施の形態では、学習の度に入力ベクトルとの距離を加算してきた部分ひずみを各代表ベクトルに関する部分ひずみ（真値）の近似と考え、最も部分ひずみの大きな代表ベクトルを選択する（ステップ１５３）。また、ここで、全ての代表ベクトルにおける部分ひずみの値を０にする。 Although it is possible to actually calculate the partial distortion by calculating the distance between all the input vectors existing in the dominating region of the representative vector and the representative vector, and taking the sum thereof, this embodiment Then, the partial distortion obtained by adding the distance to the input vector for each learning is regarded as an approximation of the partial distortion (true value) for each representative vector, and the representative vector having the largest partial distortion is selected (step 153). Here, the value of partial distortion in all representative vectors is set to zero.

次に、代表ベクトル集合算出部１４は、ステップ１５３において選択した代表ベクトルに基づいて新たな代表ベクトルを生成する（ステップ１５４）。本実施の形態では、選択された代表ベクトルと同じベクトル値を持つ新たな代表ベクトルを１個生成し、作業領域に格納する。そして、代表ベクトル集合算出部１４は、代表ベクトル数qに１を加算して（ステップ１５５）、ステップ１０３に進む。 Next, the representative vector set calculation unit 14 generates a new representative vector based on the representative vector selected in step 153 (step 154). In the present embodiment, one new representative vector having the same vector value as the selected representative vector is generated and stored in the work area. Then, the representative vector set calculation unit 14 adds 1 to the number of representative vectors q (step 155), and proceeds to step 103.

以上のようにして、最初は1個であった代表ベクトルが学習を検査回数C繰り返す毎に１個ずつ追加されていき、最終的にベクトル量子化数nに等しい個数まで代表ベクトルが生成される。また、代表ベクトルの学習が繰り返されていき、最終的に終了回数Lに達したときに、代表ベクトル集合算出部１４は、全ての代表ベクトルを代表ベクトル集合として代表ベクトル集合格納部１９に保存し（ステップ１１１）、図３に示す代表ベクトル集合算出処理を終了する。なお、代表ベクトル集合格納部１９に保存されるデータの構造は、図６に示したものと同様である。 As described above, one representative vector that was initially one is added every time learning is repeated C times, and finally representative vectors are generated up to a number equal to the vector quantization number n. . When the representative vector learning is repeated and finally the number of times of ending L is reached, the representative vector set calculation unit 14 stores all the representative vectors as representative vector sets in the representative vector set storage unit 19. (Step 111), the representative vector set calculation process shown in FIG. The structure of data stored in the representative vector set storage unit 19 is the same as that shown in FIG.

ここまでの処理により、代表ベクトル集合の算出が実施された後、単語・重み算出部１５が、最終的に文書特徴を表現するための単語と重みを算出し、出力する処理を行う（図２のステップ５）。 After the representative vector set is calculated by the processing so far, the word / weight calculation unit 15 finally calculates and outputs a word and weight for expressing the document feature (FIG. 2). Step 5).

この処理ではまず、単語・重み算出部１５は、入力ベクトル集合格納部１８に格納されている各入力ベクトルをそれに最も近い代表ベクトルで代表させる。つまり、ある入力ベクトルに着目すると、その入力ベクトルに最も近い代表ベクトルをその入力ベクトルを代表する代表ベクトルとする。この処理を各入力ベクトルについて行うことにより、各入力ベクトルに対応する代表ベクトルが算出される。そして、単語・重み算出部１５は、各代表ベクトルについて、その代表ベクトルが代表する各入力ベクトルの重み（重みとして、例えば、本実施の形態では、各入力ベクトルに対応付けられている出現頻度を用いることができる）の和を計算し、それを当該代表ベクトルに関する重みとする。 In this process, first, the word / weight calculator 15 represents each input vector stored in the input vector set storage 18 with a representative vector closest thereto. That is, when attention is paid to a certain input vector, a representative vector closest to the input vector is set as a representative vector representing the input vector. By performing this process for each input vector, a representative vector corresponding to each input vector is calculated. Then, for each representative vector, the word / weight calculation unit 15 calculates the weight of each input vector represented by the representative vector (as the weight, for example, the appearance frequency associated with each input vector in the present embodiment). (Which can be used) is calculated and used as the weight for the representative vector.

次に、単語・重み算出部１５は、単語概念ベース格納部１７を参照することにより、各代表ベクトル（それぞれ単語概念ベース中の概念ベクトルである）に対応する各単語を取得する。そして、各単語と、その重み（当該単語に対応する代表ベクトルに対応する重み）とを出力部１６を介して出力する。以上の処理により、単語とそれに関する重みとして文書特徴を表せたことになり、処理を終了する。 Next, the word / weight calculation unit 15 refers to the word concept base storage unit 17 to acquire each word corresponding to each representative vector (each is a concept vector in the word concept base). Then, each word and its weight (weight corresponding to the representative vector corresponding to the word) are output via the output unit 16. With the above processing, the document feature can be expressed as a word and its weight, and the processing ends.

（第２の実施の形態）
次に、本発明の第２の実施の形態について説明する。本実施の形態は、文書中の単語の概念ベクトルを全ての単語について平均した平均ベクトルである文書ベクトル（特許文献２参照）が文書の特徴を表していることに着目し、この文書ベクトルを単語と重みによって近似するという考えに基づいている。 (Second Embodiment)
Next, a second embodiment of the present invention will be described. The present embodiment pays attention to the fact that a document vector (see Patent Document 2), which is an average vector obtained by averaging the concept vectors of words in a document for all words, represents the characteristics of the document. And based on the idea of approximating by weight.

そして、本実施の形態では、文書特徴を表すための単語の集合は、事前に何らかの方法で求めてあるものとする。例えば、文書を第１の実施の形態における文書特徴表現計算装置１０を用いて処理し、その文書を良く表すことのできる単語の集合を求めておく。共通の単語集合により種々の文書の特徴を表すことができれば、それら文書同士を人間に分かりやすい形で比較することが可能である。第２の実施の形態は、このような利用場面を想定している。 In the present embodiment, it is assumed that a set of words for representing document features is obtained in advance by some method. For example, a document is processed using the document feature expression calculation apparatus 10 in the first embodiment, and a set of words that can well represent the document is obtained. If the characteristics of various documents can be expressed by a common word set, it is possible to compare these documents in a form that is easy for humans to understand. The second embodiment assumes such a usage scene.

＜装置構成＞
図８に、第２の実施の形態における文書特徴表現計算装置２０の機能構成図を示す。図８に示すように、本実施の形態における文書特徴表現計算装置２０は、入力部２１、単語頻度算出部２２、文書ベクトル算出部２３、重み算出部２４、出力部２５、単語概念ベース格納部２６、単語集合格納部２７を有する。 <Device configuration>
FIG. 8 shows a functional configuration diagram of the document feature representation calculation apparatus 20 according to the second embodiment. As shown in FIG. 8, the document feature expression calculation apparatus 20 in the present embodiment includes an input unit 21, a word frequency calculation unit 22, a document vector calculation unit 23, a weight calculation unit 24, an output unit 25, and a word concept base storage unit. 26, a word set storage unit 27.

入力部２１は、特徴を表現する対象となる文書を入力するための機能部である。単語頻度算出部２２は、入力された文書から、文書を構成する各単語とその出現頻度を算出するための機能部である。文書ベクトル算出部２３は、単語概念ベースを参照することにより、各単語とその出現頻度を用いて文書ベクトルを算出するための機能部である。重み算出部２４は、上述した考えに基づき、単語の重み（係数）を算出するための機能部である。出力部２５は、文書の特徴を表す単語と、重み算出部２４により算出された重みとを出力するための機能部である。 The input unit 21 is a functional unit for inputting a document to be expressed as a feature. The word frequency calculation unit 22 is a functional unit for calculating each word constituting the document and its appearance frequency from the input document. The document vector calculation unit 23 is a functional unit for calculating a document vector using each word and its appearance frequency by referring to the word concept base. The weight calculation unit 24 is a functional unit for calculating a word weight (coefficient) based on the above-described idea. The output unit 25 is a functional unit for outputting a word representing the feature of the document and the weight calculated by the weight calculation unit 24.

単語概念ベース格納部２６は単語概念ベースを格納する格納部であり、単語格納部２７は、文書の特徴を表す単語を格納する格納部である。 The word concept base storage unit 26 is a storage unit that stores a word concept base, and the word storage unit 27 is a storage unit that stores words representing the characteristics of a document.

なお、文書特徴表現計算装置２０は、CPUと、メモリやハードディスク等の記憶装置とを含む一般的なコンピュータに、本実施の形態で説明する処理に対応するプログラムを実行させることにより実現されるものであり、上述した各機能部は、コンピュータに当該プログラムが実行されて実現される機能部である。従って、例えば、各機能部間での情報のやりとりは、実際にはメモリ等の記憶装置を介して行われるものである。なお、上記プログラムは、メモリ等の記録媒体からコンピュータにインストールすることもできるし、ネットワーク上のサーバからダウンロードするようにしてもよい。 The document feature expression calculation device 20 is realized by causing a general computer including a CPU and a storage device such as a memory or a hard disk to execute a program corresponding to the processing described in the present embodiment. Each functional unit described above is a functional unit realized by executing the program on a computer. Therefore, for example, the exchange of information between the functional units is actually performed via a storage device such as a memory. The program can be installed in a computer from a recording medium such as a memory, or may be downloaded from a server on a network.

＜文書特徴表現計算装置２０の動作＞
次に、文書特徴表現計算装置２０の処理動作を、図９に示すフローチャートに沿って説明する。 <Operation of Document Feature Expression Calculation Device 20>
Next, the processing operation of the document feature expression calculation device 20 will be described with reference to the flowchart shown in FIG.

まず、事前処理として、第１の実施の形態と同様にして単語概念ベースを用意し、単語概念ベース格納部２６に格納するとともに、処理の対象とする文書の特徴を表す単語の集合を求めておき、それを単語格納部２７に格納しておく。例えば、単語概念ベースとしては第１の実施の形態で使用したものを使用し、文書の特徴を表す単語の集合としては、本実施の形態で対象とする文書に対して、第１の実施の形態での処理を施して求めた単語の集合を用いることができる。なお、以下、文書の特徴を表すための単語の数をnとし、単語概念ベースにおける概念ベクトルの次元数をMとする。 First, as pre-processing, a word concept base is prepared in the same manner as in the first embodiment, stored in the word concept base storage unit 26, and a set of words representing the characteristics of a document to be processed is obtained. It is stored in the word storage unit 27. For example, as the word concept base, the one used in the first embodiment is used, and as a set of words representing the characteristics of the document, the first embodiment is applied to the target document in the present embodiment. A set of words obtained by performing processing in a form can be used. In the following, it is assumed that the number of words for representing the feature of the document is n, and the number of dimensions of the concept vector in the word concept base is M.

事前処理の後に、まず、処理の対象とする文書を、入力部２１を介して文書特徴表現計算装置２０に入力する（ステップ１１）。当該文書は単語頻度算出部２２に渡される。単語頻度算出部２２は、形態素解析により当該文書を単語に分割し、単語とその出現頻度情報を算出する（ステップ１２）。これらの情報は文書ベクトル算出部２３に送られる。 After the pre-processing, first, a document to be processed is input to the document feature expression calculation device 20 via the input unit 21 (step 11). The document is passed to the word frequency calculation unit 22. The word frequency calculation unit 22 divides the document into words by morphological analysis, and calculates a word and its appearance frequency information (step 12). These pieces of information are sent to the document vector calculation unit 23.

続いて、文書ベクトル算出部２３は文書ベクトル算出処理を行う（ステップ１３）。つまり、文書ベクトル算出部２３は、単語概念ベース格納部２６に格納された単語概念ベースを参照することにより、単語頻度算出部２２により算出された各単語から、それに対応する概念ベクトルを取得し、取得された各概念ベクトルに、当該概念ベクトルに対応する単語の出現頻度（この出現頻度に単語の重要度の重みを反映することも可能である）で重みを付け（出現頻度を掛ける）、重み付けされた概念ベクトル集合の平均（和を取って、全体の数で割る）をとり、これを文書ベクトルy（M次元列ベクトル）とする。ここで算出された情報が重み算出部２４に渡される。 Subsequently, the document vector calculation unit 23 performs a document vector calculation process (step 13). That is, the document vector calculation unit 23 acquires a concept vector corresponding to each word calculated by the word frequency calculation unit 22 by referring to the word concept base stored in the word concept base storage unit 26, Each acquired concept vector is weighted (multiplied by the appearance frequency) with the appearance frequency of the word corresponding to the concept vector (the importance frequency of the word can be reflected on this appearance frequency), and weighting An average of the set of concept vectors (taken and divided by the total number) is taken as a document vector y (M-dimensional column vector). The information calculated here is passed to the weight calculation unit 24.

次に、文書特徴を表すものとして単語格納部２７に格納されている各単語の重み（係数）を算出する処理を行う（ステップ１４）。その処理の詳細は以下のとおりである。 Next, a process of calculating the weight (coefficient) of each word stored in the word storage unit 27 as a document feature is performed (step 14). The details of the processing are as follows.

重み算出部２４は、単語概念ベース格納部２６に格納されている単語概念ベースから、単語格納部２７に格納されている各単語に対応する概念ベクトルを取得し、取得した各概念ベクトルを列ベクトルとし、単語数分の列ベクトルを横に並べた行列A（M×Ｎ行列）を作成する。これを単語基底と呼ぶことにする。 The weight calculation unit 24 acquires a concept vector corresponding to each word stored in the word storage unit 27 from the word concept base stored in the word concept base storage unit 26, and the acquired concept vector is a column vector. And a matrix A (M × N matrix) in which column vectors for the number of words are arranged horizontally is created. This is called a word base.

各単語に対する重み（本実施形態で求めようとしている重み）のベクトルをx（N次元列ベクトル）とすると、単語基底と係数xにより文書ベクトルyは
y≒Ax 式（１）
のように近似できる。ここで単語基底を特異値分解すると、
A=UΣV^t 式（２）
と表せる。そして、Aの擬似逆行列A⁺を
A⁺＝VΣ⁺U^t 式（３）
により求めることができる。ここで、Σ⁺は、Σの零でない成分の逆数を成分とする行列の転置である。この擬似逆行列を用いることにより、式（１）においてyを２乗誤差を最小化する意味で最適近似する重み（係数）ｘは、
ｘ=A⁺y 式（４）
により表すことができる。つまり、重み算出部２４は、単語基底と文書ベクトルyを用いて、式（２）〜（４）で示す計算を行うことにより重みベクトルxを算出する。 If the vector of the weight for each word (the weight to be obtained in this embodiment) is x (N-dimensional column vector), the document vector y is expressed by the word base and the coefficient x.
y ≒ Ax formula (1)
It can be approximated as follows. Here, when singular value decomposition is performed on the word basis,
A = UΣV ^t expression (2)
It can be expressed. And the pseudo inverse matrix A ⁺ of A
A ⁺ = VΣ ⁺ U ^t equation (3)
It can ask for. Here, Σ ⁺ is a transposition of a matrix whose component is the reciprocal of a non-zero component of Σ. By using this pseudo inverse matrix, the weight (coefficient) x that optimally approximates y in Equation (1) to minimize the square error is:
x = A ⁺ y Formula (4)
Can be represented by That is, the weight calculation unit 24 calculates the weight vector x by performing the calculations shown in the equations (2) to (4) using the word base and the document vector y.

その後、出力部２５が、重みベクトルxの値と、単語格納部２７に格納されている単語を出力する。なお、単語は予め求めてあるものなので、重みxだけを出力することとしてもよい。 Thereafter, the output unit 25 outputs the value of the weight vector x and the word stored in the word storage unit 27. Since the word is obtained in advance, only the weight x may be output.

以上の処理により、与えられた文書の特徴を表現する単語に対応した重みが算出でき、処理を終了する。なお、第１に実施の形態における文書特徴表現計算装置１０に、本実施の形態における文書ベクトル算出部２３、重み算出部２４、単語格納部２７を加え、文書特徴表現計算装置１０において本実施の形態で説明した重みxを計算し、単語とともに出力することとしてもよい。 Through the above processing, the weight corresponding to the word expressing the characteristics of the given document can be calculated, and the processing ends. First, the document feature expression calculation device 10 according to the present embodiment is added with the document vector calculation unit 23, the weight calculation unit 24, and the word storage unit 27 according to the present embodiment. The weight x described in the embodiment may be calculated and output together with the word.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

第１の実施の形態における文書特徴表現計算装置１０の機能構成図である。It is a functional lineblock diagram of document feature expression calculation device 10 in a 1st embodiment. 文書特徴表現計算装置１０の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the document feature representation calculation apparatus 10. 文書特徴表現計算装置１０の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the document feature representation calculation apparatus 10. 単語概念ベースの例を示す図である。It is a figure which shows the example of a word concept base. 入力ベクトル集合の例を示す図である。It is a figure which shows the example of an input vector set. 作業領域に格納される代表ベクトルの例を示す図である。It is a figure which shows the example of the representative vector stored in a work area. 仮の代表ベクトルW₀を説明するための図である。It is a diagram for explaining a representative vector W ₀ tentative. 第２の実施の形態における文書特徴表現計算装置２０の機能構成図である。It is a functional block diagram of the document feature expression calculation apparatus 20 in 2nd Embodiment. 文書特徴表現計算装置２０の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the document feature representation calculation apparatus 20.

Explanation of symbols

１０文書特徴表現計算装置
１１入力部
１２単語頻度算出部
１３入力ベクトル集合算出部
１４代表ベクトル集合算出部
１５単語・重み算出部
１６出力部
１７単語概念ベース格納部
１８入力ベクトル集合格納部
１９代表ベクトル集合格納部
２０文書特徴表現計算装置
２１入力部
２２単語頻度算出部
２３文書ベクトル算出部
２４重み算出部
２５出力部
２６単語概念ベース格納部
２７単語集合格納部 10 Document feature expression calculation device
DESCRIPTION OF SYMBOLS 11 Input part 12 Word frequency calculation part 13 Input vector set calculation part 14 Representative vector set calculation part 15 Word / weight calculation part 16 Output part 17 Word concept base storage part 18 Input vector set storage part 19 Representative vector set storage part 20 Document feature Expression calculation device 21 Input unit 22 Word frequency calculation unit 23 Document vector calculation unit 24 Weight calculation unit 25 Output unit 26 Word concept base storage unit 27 Word set storage unit

Claims

A document feature expression calculation device that outputs an index word for expressing a feature of a document and its weight from an input document,
A concept base storage means for storing a concept base including an index word and a vector corresponding thereto;
Index word extraction means for extracting each index word from the document;
An input vector set calculating means for obtaining, as an input vector, a vector corresponding to each index word extracted by the index word extracting means from the concept base storing means, and storing the set of the input vectors in the input vector set storing means; ,
Against a set of stored input vector to the input vector set storage means, and a representative vector set calculating means for calculating a set of representative vectors by executing a process of quantizing as quantization error is minimized,
Index word weight calculating means for calculating an index word corresponding to each representative vector calculated by the representative vector calculating means and its weight and outputting them .
The representative vector set calculation means performs a process using a competitive learning method as the quantization process, and updates a representative vector using a randomly selected input vector in a learning process in the competitive learning method. The concept base storage stores the vector closest to the center that is within the distance from the center to the representative vector to be updated, centered on the position where the representative vector to be updated is approximated to the input vector at the rate of the learning rate g. A document feature expression calculation apparatus , wherein a selection is made from a set of vectors stored in the means, and the selected vector is used as an updated representative vector .

In the learning process in the competitive learning method, the representative vector set calculation means determines the amount of the initial value g ₀ of the learning rate g when the updated representative vector value is the same as the pre-update representative vector value. 2. The document feature expression calculation apparatus according to claim 1 , wherein the learning rate corresponding to the representative vector is increased only when the learning rate g is not the same, and the learning rate g is set to the initial value g ₀ .

The index word weight calculating means calculates a weight obtained by adding a weight corresponding to each input vector in an input vector group represented by the representative vector as a weight of an index word corresponding to the representative vector. document feature representation computing device according to claim 1 or 2.

A document feature expression calculation device for calculating and outputting a weight corresponding to an index word representing a feature of the document from an input document,
A concept base storage means for storing a concept base including an index word and a vector corresponding thereto;
A feature expression index word storage means for storing a feature expression index word which is an index word representing the feature of the document calculated by the document feature expression calculation device according to any one of claims 1 to 3 .
Index word extraction means for extracting each index word and its appearance frequency from the document;
A document vector that obtains a vector corresponding to each index word extracted by the index word extraction means from the concept base storage means, calculates a document vector by weighting the set of vectors with the appearance frequency and taking the average. A calculation means;
The weight corresponding to the feature expression index word using the vector in the concept base corresponding to each feature expression index word stored in the feature expression index word storage means and the document vector calculated by the document vector calculation means And a weight calculation means for calculating and outputting a document feature expression calculation device.

The weight calculating means calculates the weight by multiplying the document vector by a pseudo inverse matrix of a matrix in which vectors in the concept base corresponding to the feature expression index terms are arranged side by side as column vectors. The document feature expression calculation apparatus according to claim 4 .

The program for functioning a computer as each means in the document characteristic expression calculation apparatus of any one of Claims 1 thru | or 5 .