JP5891837B2

JP5891837B2 - Co-occurrence dictionary creation device

Info

Publication number: JP5891837B2
Application number: JP2012033981A
Authority: JP
Inventors: 貢三浦
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-02-20
Filing date: 2012-02-20
Publication date: 2016-03-23
Anticipated expiration: 2032-02-20
Also published as: JP2013171382A

Description

本発明は自然言語解析分野に関し、特に共起辞書を作成する共起辞書作成装置に関する。 The present invention relates to the field of natural language analysis, and more particularly to a co-occurrence dictionary creation device that creates a co-occurrence dictionary.

共起辞書は、例えば機械翻訳における訳語選択や、かな漢字変換における漢字選択などの分野で用いられている（例えば特許文献１参照）。このような共起辞書は、人手で作成する方法と機械的に作成する方法とがある。共起辞書を機械的に作成する場合、例えば特許文献２に記載されているように、多数の文章について、その文章中に同時に出現している単語を形態素解析によって抽出し、抽出結果に基づいて共起マトリックス（共起行列）を更新する。 Co-occurrence dictionaries are used in fields such as translation word selection in machine translation and kanji selection in kana-kanji conversion (see, for example, Patent Document 1). Such a co-occurrence dictionary can be created manually or mechanically. When creating a co-occurrence dictionary mechanically, for example, as described in Patent Document 2, for a large number of sentences, words simultaneously appearing in the sentence are extracted by morphological analysis, and based on the extraction result Update the co-occurrence matrix.

特開２０００−２５０９１４号公報JP 2000-250914 A 特開平７−３６８８３号公報JP 7-36883 A 特開２００２−０８２９４６号公報Japanese Patent Laid-Open No. 2002-082946 特開２００９−０７５７９１号公報JP 2009-075791 A

ところで、通常の文章で用いられる単語の数は非常に多いため、一般に共起マトリックスのサイズは巨大なものとなる。特許文献２では、共起マトリックスを作成する際にどの単語を用いるかを共起登録用辞書に予め記憶しておき、共起登録用辞書に記載されている単語の共起関係のみを作成することによって、共起マトリックスの巨大化を抑えている。しかし、共起関係を作成する単語を制限することは、共起マトリックスから共起関係を調査できる単語の範囲を制限することになるため、共起辞書の精度が大幅に低下する。 By the way, since the number of words used in normal sentences is very large, the size of the co-occurrence matrix is generally huge. In Patent Literature 2, which word is used when creating the co-occurrence matrix is stored in advance in the dictionary for co-occurrence registration, and only the co-occurrence relationship of the words described in the dictionary for co-occurrence registration is created. As a result, the co-occurrence matrix is prevented from becoming huge. However, limiting the words that create the co-occurrence relationship limits the range of words that can be investigated for the co-occurrence relationship from the co-occurrence matrix, which greatly reduces the accuracy of the co-occurrence dictionary.

本発明の目的は、上述したような課題、すなわち、共起マトリックスの巨大化を抑えると共起辞書の精度が大幅に低下する、という課題を解決する共起辞書作成装置を提供することにある。 An object of the present invention is to provide a co-occurrence dictionary creation device that solves the problem as described above, that is, the problem that the accuracy of the co-occurrence dictionary is greatly reduced if the enlargement of the co-occurrence matrix is suppressed. .

本発明の一形態にかかる共起辞書作成装置は、
行と列との交点に前記行に割り当てられた単語と前記列に割り当てられた単語とが同一文章中に同時に出現する頻度を表す数値を記録した共起マトリックスを入力し、意味的に類似する単語どうしが前記行方向および前記列方向に隣接するように前記共起マトリックスの前記行および前記列を並べ替える並べ替え部と、
前記並べ替え後の前記共起マトリックスの前記行と前記列との交点に記録された数値を、前記行と前記列との交点に対応する画素の輝度値として有する画像を生成する画像生成部と、
前記画像に対して離散コサイン変換を行って生成したＤＣＴ係数から高周波成分を取り除き、残ったＤＣＴ係数に逆離散コサイン変換を行って前記画像の縮小画像を生成する画像縮小部と、
前記縮小画像の行および列に対応する行および列を有し、行と列との交点に前記縮小画像における対応する画素の輝度値が頻度として記録された縮小共起マトリックスと、前記縮小共起マトリックスの行および列に割り当てた識別番号と前記単語との対応を表す関係情報とから構成される共起辞書を生成する辞書作成部と
を有する、といった構成を採る。 A co-occurrence dictionary creation device according to one aspect of the present invention is provided.
Enter a co-occurrence matrix recording numerical values representing the frequency at which the word assigned to the row and the word assigned to the column appear simultaneously in the same sentence at the intersection of the row and the column, and are semantically similar A reordering unit that reorders the rows and columns of the co-occurrence matrix such that words are adjacent in the row direction and the column direction;
An image generation unit for generating an image having the numerical value recorded at the intersection of the row and the column of the co-occurrence matrix after the rearrangement as a luminance value of a pixel corresponding to the intersection of the row and the column; ,
An image reduction unit that removes high-frequency components from DCT coefficients generated by performing discrete cosine transform on the image, and performs inverse discrete cosine transform on the remaining DCT coefficients to generate a reduced image of the image;
A reduced co-occurrence matrix having rows and columns corresponding to the rows and columns of the reduced image, and a luminance value of a corresponding pixel in the reduced image recorded as a frequency at an intersection of the row and the column; and the reduced co-occurrence A configuration is adopted in which a dictionary creation unit is provided that generates a co-occurrence dictionary composed of identification numbers assigned to rows and columns of a matrix and relational information indicating correspondence between the words.

本発明は上述したような構成を有するため、共起辞書の精度をさほど低下させることなく、共起マトリックスの巨大化を抑えることができる。 Since the present invention has the above-described configuration, the co-occurrence matrix can be prevented from becoming enormous without significantly reducing the accuracy of the co-occurrence dictionary.

本発明の第１の実施形態のブロック図である。It is a block diagram of a 1st embodiment of the present invention. 本発明の第１の実施形態の処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the 1st Embodiment of this invention. 本発明の第１の実施形態における並べ替え部の動作説明図である。It is operation | movement explanatory drawing of the rearrangement part in the 1st Embodiment of this invention. 本発明の第１の実施形態における画像縮小部の動作説明図である。It is operation | movement explanatory drawing of the image reduction part in the 1st Embodiment of this invention. 本発明の第１の実施形態における辞書作成部の動作説明図である。It is operation | movement explanatory drawing of the dictionary creation part in the 1st Embodiment of this invention. 本発明の第１の実施形態における共起辞書を構成する関係情報の構成例を示す図である。It is a figure which shows the structural example of the relationship information which comprises the co-occurrence dictionary in the 1st Embodiment of this invention. 本発明の第２の実施形態のブロック図である。It is a block diagram of the 2nd Embodiment of this invention. 本発明の第２の実施形態における共起データの抽出例を示す図である。It is a figure which shows the example of extraction of co-occurrence data in the 2nd Embodiment of this invention. 本発明の第２の実施形態における共起マトリックスの例を示す図である。It is a figure which shows the example of the co-occurrence matrix in the 2nd Embodiment of this invention. 本発明の第２の実施形態におけるシソーラスによる並べ替えの例を示す図である。It is a figure which shows the example of the rearrangement by the thesaurus in the 2nd Embodiment of this invention. 本発明の第２の実施形態における強度（輝度）情報への変換例を示す図である。It is a figure which shows the example of conversion into the intensity | strength (luminance) information in the 2nd Embodiment of this invention. 本発明の第２の実施形態における離散コサイン化による量子化の例を示す図である。It is a figure which shows the example of the quantization by discrete cosineization in the 2nd Embodiment of this invention. 本発明の第２の実施形態における訳語選択の例を示す図である。It is a figure which shows the example of the translation selection in the 2nd Embodiment of this invention. 本発明の第２の実施形態の処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process of the 2nd Embodiment of this invention.

次に本発明の実施の形態について図面を参照して詳細に説明する。
[第１の実施形態]
図１を参照すると、本発明の第１の実施形態にかかる共起辞書作成装置１００は、共起マトリックス１０１を入力し、共起辞書１０２を生成して出力する機能を有する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.
[First embodiment]
Referring to FIG. 1, a co-occurrence dictionary creation device 100 according to the first embodiment of the present invention has a function of inputting a co-occurrence matrix 101, generating a co-occurrence dictionary 102, and outputting it.

共起マトリックス１０１は、行と列との交点に、行に割り当てられた単語と列に割り当てられた単語とが同一文章中に同時に出現する頻度を表す数値を記録したデータ構造である。共起マトリックス１０１は、例えば人手によって作成されるか、或いは後述する第２の実施形態と同様の方法で機械的に作成されて、共起辞書作成装置１００に入力される。ここで、共起マトリックス１０１はＮ×Ｎの対称行列とする。すなわち、扱う単語の数をＮとする。 The co-occurrence matrix 101 is a data structure in which a numerical value representing a frequency at which a word assigned to a row and a word assigned to a column simultaneously appear in the same sentence at the intersection of the row and the column is recorded. The co-occurrence matrix 101 is created manually, for example, or mechanically created by the same method as in the second embodiment described later, and is input to the co-occurrence dictionary creating apparatus 100. Here, the co-occurrence matrix 101 is an N × N symmetric matrix. That is, let N be the number of words handled.

共起辞書作成装置１００は、例えば、専用のデータ通信回路から構成される通信インターフェース部、キーボードやマウスなどから構成される操作入力部、ＬＣＤなどから構成される画面表示部、メモリやハードディスクなどから構成される記憶部、およびマイクロプロセッサとその周辺回路などから構成されるプロセッサとを有するパーソナルコンピュータ等の情報処理装置で実現される。記憶部には、共起辞書作成プログラムが記憶されており、装置の起動時にこの共起辞書作成プログラムがプロセッサに読み取られ、そのプロセッサの動作を制御することにより、そのプロセッサ上に図１に示す並べ替え部１１１、画像生成部１１２、画像縮小部１１３、および辞書作成部１１４を実現する。 The co-occurrence dictionary creation device 100 includes, for example, a communication interface unit configured by a dedicated data communication circuit, an operation input unit configured by a keyboard and a mouse, a screen display unit configured by an LCD, a memory, a hard disk, and the like. The present invention is realized by an information processing apparatus such as a personal computer having a storage unit configured and a processor including a microprocessor and its peripheral circuits. The co-occurrence dictionary creation program is stored in the storage unit. When the apparatus is started up, the co-occurrence dictionary creation program is read by the processor and the operation of the processor is controlled, so that the processor shown in FIG. A rearrangement unit 111, an image generation unit 112, an image reduction unit 113, and a dictionary creation unit 114 are realized.

並べ替え部１１１は、共起マトリックス１０１を入力し、意味的に類似する単語どうしが行方向および列方向に隣接するように、共起マトリックス１０１の行および列を並べ替える機能を有する。 The rearrangement unit 111 has a function of inputting the co-occurrence matrix 101 and rearranging the rows and columns of the co-occurrence matrix 101 so that semantically similar words are adjacent to each other in the row direction and the column direction.

画像生成部１１２は、並べ替え後の共起マトリックス１０１を画像化する機能を有する。具体的には、共起マトリックス１０１の行と列との交点に記録された数値を、行と列との交点に対応する画素の輝度値として有する画像を生成する機能を有する。生成された画像の行および列の数は、共起マトリックス１０１と同じくＮ×Ｎである。 The image generation unit 112 has a function of imaging the rearranged co-occurrence matrix 101. Specifically, it has a function of generating an image having the numerical value recorded at the intersection of the row and column of the co-occurrence matrix 101 as the luminance value of the pixel corresponding to the intersection of the row and column. The number of rows and columns of the generated image is N × N as in the co-occurrence matrix 101.

画像縮小部１１３は、上記生成された画像に対して、離散コサイン変換を行って生成したＤＣＴ係数から高周波成分を取り除き、残ったＤＣＴ係数に逆離散コサイン変換を行って、上記生成された画像の縮小画像を生成する機能を有する。ここで、縮小画像の行および列の数をＭ（＜Ｎ）とする。一般に、ＮとＭとの間には、ｄを正の整数として、Ｎ＝２ｄ×Ｍの関係がある。 The image reduction unit 113 removes high frequency components from the DCT coefficients generated by performing discrete cosine transform on the generated image, performs inverse discrete cosine transform on the remaining DCT coefficients, A function of generating a reduced image; Here, the number of rows and columns of the reduced image is M (<N). In general, there is a relationship of N = 2d × M between N and M, where d is a positive integer.

辞書作成部１１４は、上記生成された縮小画像の行および列に対応する行および列を有し、行と列との交点に上記縮小画像における対応する画素の輝度値が頻度として記録された縮小共起マトリックス１０３と、この縮小共起マトリックス１０３の行および列に割り当てた識別番号と共起マトリックス１０１の行および列に割り当てられた単語との対応関係を示す関係情報１０４とから構成される共起辞書１０２を生成する機能を有する。生成された縮小共起マトリックス１０３の行および列の数は、上記縮小画像と同じくＭ×Ｍである。 The dictionary creation unit 114 has rows and columns corresponding to the rows and columns of the generated reduced image, and a reduction in which the luminance value of the corresponding pixel in the reduced image is recorded as a frequency at the intersection of the row and the column. A co-occurrence matrix 103 and co-occurrence information 104 indicating the correspondence between the identification numbers assigned to the rows and columns of the reduced co-occurrence matrix 103 and the words assigned to the rows and columns of the co-occurrence matrix 101 are shown. It has a function of generating the dictionary 102. The number of rows and columns of the generated reduced co-occurrence matrix 103 is M × M as in the reduced image.

次に、図１および図２を参照して、本実施形態にかかる共起辞書作成装置１００の動作を説明する。 Next, the operation of the co-occurrence dictionary creation device 100 according to the present embodiment will be described with reference to FIG. 1 and FIG.

まず、並べ替え部１１１は、Ｎ×Ｎの共起マトリックス１０１を入力し、意味的に類似する単語どうしが行方向および列方向に隣接するように、共起マトリックス１０１の行および列を並べ替える（ステップＳ１０１）。例えば、共起マトリックス１０１の列に、単語「犬」「猫」「飼育」「世話」がある場合、「犬」と「猫」は互いに類似しているため、「犬」に対応する列と「猫」に対応する列とが隣接するように列の並べ替えを行う。同様に「飼育」と「世話」は互いに類似しているため、それらに対応する列どうしが隣接するように列の並べ替えを行う。そして、全ての列の並べ替えを終えると、並べ替え後の列の単語の並びと同じ順序に行の単語が並ぶように行の並べ替えを行う。なお、単語どうしの類似性は、単語を同義語や意味上の類似関係、包含関係などによって分類した辞書（シソーラス）を用いて決定することができる。 First, the rearrangement unit 111 inputs the N × N co-occurrence matrix 101 and rearranges the rows and columns of the co-occurrence matrix 101 so that semantically similar words are adjacent to each other in the row direction and the column direction. (Step S101). For example, when the words “dog”, “cat”, “bred”, “care” are in the column of the co-occurrence matrix 101, since “dog” and “cat” are similar to each other, the column corresponding to “dog” The columns are rearranged so that the column corresponding to “cat” is adjacent. Similarly, since “bred” and “care” are similar to each other, the columns are rearranged so that the columns corresponding to them are adjacent to each other. When all the columns are rearranged, the rows are rearranged so that the words in the rows are arranged in the same order as the words in the rearranged columns. Note that the similarity between words can be determined using a dictionary (thesaurus) in which words are classified according to synonyms, semantic similarity relationships, inclusion relationships, and the like.

以上のような並べ替えを行うことによって、例えば図３に示すように、共起マトリックス上において、「犬」と「飼育」との頻度（＝５）、「犬」と「世話」との頻度（＝７）、「猫」と「飼育」との頻度（＝４）、「猫」と「世話」との頻度（＝８）などのように、関連性のある頻度が集まって塊を形成するようになる。 By performing the above sort, for example, as shown in FIG. 3, on the co-occurrence matrix, the frequency of “dog” and “bred” (= 5), the frequency of “dog” and “care” (= 7), frequency of “cat” and “bred” (= 4), frequency of “cat” and “care” (= 8), etc. To come.

次に画像生成部１１２は、並べ替え後の共起マトリックス１０１の行と列との交点に記録された頻度を示す数値を、行と列との交点に対応する画素の輝度値として有する画像、すなわちＮ×Ｎの画素を有する二次元画像を生成する（ステップＳ１０２）。 Next, the image generation unit 112 has an image having a numerical value indicating the frequency recorded at the intersection of the row and column of the rearranged co-occurrence matrix 101 as the luminance value of the pixel corresponding to the intersection of the row and column, That is, a two-dimensional image having N × N pixels is generated (step S102).

次に画像縮小部１１３は、画像生成部１１２が生成した画像に対して、離散コサイン変換を行い、生成されたＤＣＴ係数から高周波成分を取り除き、残ったＤＣＴ係数に逆離散コサイン変換を行って、縮小画像を生成する（ステップＳ１０３）。より具体的には、画像縮小部１１３は、以下のような処理を行う。 Next, the image reduction unit 113 performs discrete cosine transform on the image generated by the image generation unit 112, removes high-frequency components from the generated DCT coefficients, and performs inverse discrete cosine transform on the remaining DCT coefficients, A reduced image is generated (step S103). More specifically, the image reduction unit 113 performs the following processing.

画像縮小部１１３は、先ず画像をブロックに分割する。１ブロックのサイズは任意であるが、ここでは８×８画素のブロックに分割する。次に画像縮小部１１３は、各ブロック毎に、離散コサイン変換を行ってＤＣＴ係数を生成する。８×８画素の１ブロックに対して離散コサイン変換を行うと、図４に示すような８×８個のＤＣＴ係数が生成される。ＤＣＴ係数Ａ₀₀は直流成分、Ａ₀₀以外のＤＣＴ係数は交流成分であり、右下に行くほど高次の交流成分となる。次に画像縮小部１１３は、生成されたＤＣＴ係数から高周波成分を取り除く。幾つのＤＣＴ係数を取り除くかは、縮小率に従う。例えば画像サイズを縦横とも半分にする場合、図４に破線で囲んだ低周波数側の４×４個のＤＣＴ係数以外のＤＣＴ係数を全て取り除く。次に画像縮小部１１３は、残ったＤＣＴ係数に逆離散コサイン変換を行って縮小画像を生成する。例えば、図４に破線で囲んだ低周波数側の４×４個のＤＣＴ係数に対して逆離散コサイン変換を行うと、４×４画素の画像が生成される。 The image reduction unit 113 first divides the image into blocks. The size of one block is arbitrary, but here, it is divided into blocks of 8 × 8 pixels. Next, the image reducing unit 113 performs discrete cosine transform for each block to generate DCT coefficients. When discrete cosine transform is performed on one block of 8 × 8 pixels, 8 × 8 DCT coefficients as shown in FIG. 4 are generated. DCT coefficient A ₀₀ is a DC component, the DCT coefficients other than the A ₀₀ is an alternating current component, a high order of the AC component toward the lower right. Next, the image reducing unit 113 removes high frequency components from the generated DCT coefficients. How many DCT coefficients are removed depends on the reduction ratio. For example, when the image size is halved both vertically and horizontally, all DCT coefficients other than the 4 × 4 DCT coefficients on the low frequency side surrounded by a broken line in FIG. 4 are removed. Next, the image reduction unit 113 performs inverse discrete cosine transform on the remaining DCT coefficients to generate a reduced image. For example, when inverse discrete cosine transform is performed on 4 × 4 DCT coefficients on the low frequency side surrounded by a broken line in FIG. 4, an image of 4 × 4 pixels is generated.

次に辞書作成部１１４は、画像縮小部１１３によって生成された縮小画像から、縮小共起マトリックス１０３と関係情報１０４とを生成する（ステップＳ１０４）。縮小共起マトリックス１０３の生成では、縮小画像の行および列に対応する行および列を有し、行と列との交点に上記縮小画像における対応する画素の輝度値が頻度として記録されたマトリックスを生成する。例えば、縦横のサイズが半分に縮小された縮小画像からは、（Ｎ／２）×（Ｎ／２）の共起マトリックスを生成する。そして、生成した縮小共起マトリックス１０３の行および列に対して識別番号を割り当てる。行と列の数はＮ／２なので、割り当てる識別番号の総数はＮ／２である。 Next, the dictionary creation unit 114 creates the reduced co-occurrence matrix 103 and the relationship information 104 from the reduced image generated by the image reducing unit 113 (step S104). In the generation of the reduced co-occurrence matrix 103, a matrix having rows and columns corresponding to the rows and columns of the reduced image, and the luminance value of the corresponding pixel in the reduced image recorded as a frequency at the intersection of the row and the column is used. Generate. For example, a co-occurrence matrix of (N / 2) × (N / 2) is generated from a reduced image whose vertical and horizontal sizes are reduced by half. Then, an identification number is assigned to the row and column of the generated reduced co-occurrence matrix 103. Since the number of rows and columns is N / 2, the total number of identification numbers to be assigned is N / 2.

関係情報１０４の生成では、縮小共起マトリックス１０３の行および列に割り当てた識別番号と元の共起マトリックス１０１の行および列に割り当てられた単語との対応関係を洗い出し、単語と識別番号とを対応付ける。例えば、図５に示すように、並べ替え後の共起マトリックス１０１における「犬」に対応する列と「猫」に対応する列との２つの列が、画像縮小プロセスを経て生成された縮小共起マトリックス１０３ではＩＤ１０１を付与した一つの列に縮小されていれば、例えば図６に示すように、「犬」と「ＩＤ１０１」との組み合わせ、および「猫」と「ＩＤ１０１」との組み合わせを関係情報１０４に記録する。また、並べ替え後の共起マトリックス１０１における「飼育」に対応する行と「世話」に対応する行との２つの行が、画像縮小プロセスを経て生成された縮小共起マトリックス１０３ではＩＤ１０２を付与した一つの行に縮小されていれば、例えば図６に示すように、「飼育」と「ＩＤ１０２」との組み合わせ、および「世話」と「ＩＤ１０２」との組み合わせを関係情報１０４に記録する。 In the generation of the relationship information 104, the correspondence between the identification numbers assigned to the rows and columns of the reduced co-occurrence matrix 103 and the words assigned to the rows and columns of the original co-occurrence matrix 101 is identified, and the words and the identification numbers are obtained. Associate. For example, as shown in FIG. 5, two columns, a column corresponding to “dog” and a column corresponding to “cat”, in the co-occurrence matrix 101 after the rearrangement are generated through the image reduction process. If the occurrence matrix 103 is reduced to a single column with ID 101, for example, as shown in FIG. 6, the combination of “dog” and “ID 101” and the combination of “cat” and “ID 101” are related. Record in information 104. In the reduced co-occurrence matrix 103 generated through the image reduction process, two rows, a row corresponding to “bred” and a row corresponding to “care” in the co-occurrence matrix 101 after rearrangement, are assigned ID 102. If it is reduced to one line, the combination of “bred” and “ID102” and the combination of “care” and “ID102” are recorded in the relationship information 104 as shown in FIG.

図６に示す関係情報１０４は、単語と識別番号との組み合わせのみで構成されているが、その他の情報が含まれていてもよい。例えば、関係情報１０４には、各単語の品詞、意味、訳語などが含まれていてもよい。 Although the relationship information 104 shown in FIG. 6 is composed of only a combination of a word and an identification number, other information may be included. For example, the relationship information 104 may include the part of speech, meaning, translation, etc. of each word.

上述のようにして生成された共起辞書１０２を用いて、或る単語Ａと別の単語Ｂとの共起頻度を調べる場合、単語Ａ、Ｂに対応する識別番号を関係情報１０４から取得し、単語Ａに対応する識別番号を持つ行（あるいは列）と単語Ｂに対応する識別番号を持つ列（あるいは行）との交点の頻度を縮小共起マトリックス１０３から取得する。 When the co-occurrence frequency of a certain word A and another word B is examined using the co-occurrence dictionary 102 generated as described above, identification numbers corresponding to the words A and B are acquired from the relationship information 104. The frequency of the intersection of the row (or column) having the identification number corresponding to the word A and the column (or row) having the identification number corresponding to the word B is acquired from the reduced co-occurrence matrix 103.

このように本実施形態によれば、共起辞書の精度をさほど低下させることなく、共起マトリックスの巨大化を抑えることができる。その理由は、共起関係を作成する単語を制限していないためである。また、別の理由は、原画像の画質をさほど劣化させずに画像を縮小することができるＤＣＴ処理による画像縮小技術を用いて、複数の頻度を一つの頻度に圧縮しているため、或る単語と或る単語の共起の頻度の情報を維持したまま（関連性を維持したまま）、複数の頻度を一つの頻度に圧縮することができるためである。 As described above, according to the present embodiment, the co-occurrence matrix can be prevented from becoming enormous without significantly reducing the accuracy of the co-occurrence dictionary. The reason is that the words for creating the co-occurrence relationship are not limited. Another reason is that a plurality of frequencies are compressed to one frequency using an image reduction technique based on DCT processing that can reduce an image without significantly degrading the image quality of the original image. This is because it is possible to compress a plurality of frequencies into one frequency while maintaining information on the frequency of co-occurrence of a word and a certain word (while maintaining relevance).

[第２の実施形態]
次に本発明の第２の実施形態について詳細に説明する。本実施形態は、機械翻訳における訳語選択装置に本発明を適用した実施の形態である。 [Second Embodiment]
Next, a second embodiment of the present invention will be described in detail. This embodiment is an embodiment in which the present invention is applied to a translation word selection device in machine translation.

従来より、機械翻訳に関する分野において、訳語を選択する装置が利用されてきた。訳語を選択する装置は各種あるが、一般的に利用されるものは、以下の通りである。
(1)単語ベクトルを利用するもの（例えば特許文献３）。
(2)語と語の共起条件を利用するもの（例えば特許文献１）。
(3)類似度などを利用するもの（例えば特許文献４）。 Conventionally, devices for selecting translated words have been used in the field of machine translation. There are various devices for selecting a translated word, but the devices generally used are as follows.
(1) Those using word vectors (for example, Patent Document 3).
(2) One that uses a word-word co-occurrence condition (for example, Patent Document 1).
(3) Those using similarity or the like (for example, Patent Document 4).

また、情報圧縮の手法として、離散コサイン変換やフーリエ変換などが利用されている。一般に画像の不可逆圧縮方法として、離散コサイン変換が従来より利用されている。離散コサイン変換には標準的な方法が幾種か知られているが、最も一般的な方法は、type-II DCTと呼ばれるものであり、離散コサイン変換と呼んだ場合これを指すことが多い。同様に、DCT-IIの逆変換であるtype-III DCTは逆離散コサイン変換と呼ばれる。いづれにせよ、本実施形態においては、従来技法である離散コサイン変換を利用するが、その方式は特に特定しない。また、類似の離散フーリエ変換を利用しても本実施形態は成立する。ただし、低周波数成分が多いデータにおいては、離散コサインを利用した方が圧縮効果が高いため、本実施形態においては離散コサインを利用している。 Further, discrete cosine transform, Fourier transform, or the like is used as an information compression method. In general, discrete cosine transform has been used as an irreversible compression method for images. Several standard methods are known for the discrete cosine transform, but the most common method is called type-II DCT, which is often referred to as discrete cosine transform. Similarly, type-III DCT, which is the inverse transform of DCT-II, is called inverse discrete cosine transform. In any case, the present embodiment uses the discrete cosine transform which is a conventional technique, but the method is not particularly specified. Also, the present embodiment can be realized by using a similar discrete Fourier transform. However, for data with many low frequency components, since the compression effect is higher when the discrete cosine is used, the discrete cosine is used in this embodiment.

本実施形態が解決しようとする課題は、共起データが膨大になる点と、共起データを間引くなどの圧縮方式では共起データの関係性が失われて精度が落ちる点である。 The problem to be solved by the present embodiment is that the co-occurrence data becomes enormous and that the compression method such as thinning out the co-occurrence data loses the relationship of the co-occurrence data and decreases accuracy.

本実施形態の目的は、利用者に対し、共起データの管理を平易に行う機能と訳語の選択を的確に実施する機能と、これらの機能を有する訳語選択装置を提供することである。 An object of the present embodiment is to provide a user with a function for easily managing co-occurrence data, a function for accurately selecting a translated word, and a translated word selecting apparatus having these functions.

以下、本実施形態の構成および動作を図面を参照して詳細に説明する。 Hereinafter, the configuration and operation of the present embodiment will be described in detail with reference to the drawings.

本実施形態の訳語選択装置は、図７に示されるような構成をとり、データ読み取り装置１と、共起情報作成装置２と、シソーラス装置３と、共起マトリックス管理装置４と、離散コサイン変換装置５と、フィルタ６と、逆変換装置７と、共起情報管理装置８と、文書入力装置９と、辞書１０と、訳語選択装置１１とを備えている装置である。ここで、共起マトリックス管理装置４が図１の並べ替え部１１１と画像生成部１１２に相当し、離散コサイン変換装置５とフィルタ６と逆変換装置７が図１の画像縮小部１１３に相当し、共起情報管理装置８が図１の辞書作成部１１４に相当する。 The translation word selection device of the present embodiment has a configuration as shown in FIG. 7, and includes a data reading device 1, a co-occurrence information creation device 2, a thesaurus device 3, a co-occurrence matrix management device 4, and a discrete cosine transform. The apparatus 5 includes a device 5, a filter 6, an inverse conversion device 7, a co-occurrence information management device 8, a document input device 9, a dictionary 10, and a translation selection device 11. Here, the co-occurrence matrix management device 4 corresponds to the rearrangement unit 111 and the image generation unit 112 in FIG. 1, and the discrete cosine transformation device 5, the filter 6, and the inverse transformation device 7 correspond to the image reduction unit 113 in FIG. The co-occurrence information management device 8 corresponds to the dictionary creation unit 114 in FIG.

データ読み取り装置１は、文書を読み取る装置である。従来より一般的に利用されているものである。読み取り対象は、メモリや外部記憶装置のファイルデータやWeb上にあるHTMLデータでもよく、一般的な文書データであれば、装置の形態や配置を特定するものではない。 The data reading device 1 is a device that reads a document. It has been generally used conventionally. The reading target may be file data in a memory or an external storage device or HTML data on the Web, and if it is general document data, it does not specify the form or arrangement of the device.

共起情報作成装置２は、データ読み取り装置１が読み取った文書ドキュメントから、共起情報を作成する装置である。共起情報とは、特定の単語が他の特定の語と同時に一文中に出現する頻度や確率を統計的に処理したものである。例えば、手術と病院は、共起しやすいが、手術とマントルなどは共起し難いなどのデータを意味する。本実施形態では、データ読取装置１で、読み取った文を形態素単位に分割し、共起マトリックスに蓄積する機能を有する装置である。本動作を行う装置は、従来も用いられてきたものである。図８に共起データの抽出例を示す。また、図９に共起マトリックスの作成例を示す。新たに（盲腸,手術）の共起データが得られた場合、マトリックス中の頻度データが１つ加算される例である。 The co-occurrence information creation device 2 is a device that creates co-occurrence information from a document document read by the data reading device 1. The co-occurrence information is obtained by statistically processing the frequency and probability that a specific word appears in a sentence simultaneously with another specific word. For example, it means data that surgery and hospital are likely to co-occur but surgery and mantle are difficult to co-occur. In the present embodiment, the data reading device 1 is a device having a function of dividing a read sentence into morpheme units and storing them in a co-occurrence matrix. An apparatus for performing this operation has been used conventionally. FIG. 8 shows an example of co-occurrence data extraction. FIG. 9 shows an example of creating a co-occurrence matrix. This is an example in which one piece of frequency data in the matrix is added when new (caecum, surgery) co-occurrence data is obtained.

シソーラス装置３は、単語と単語の関係を階層的に表現するものである。本装置も従来より利用されてきたものである。本装置を利用して、共起マトリックスのデータを関連度の高いものを近くに並べるように構成する。この並べ替えの目的は、データの圧縮率を向上させることを目的としている。隣接する画素間に強い相関を持たせることにより、比較的低い周波数成分に電力が集中するようにし、低い周波数のコサイン成分の絶対値が大きく、高い周波数成分の絶対値が小さくなり、エントロピー符号化により大幅に情報量を圧縮させるために利用しているものである。本装置がなくても本実施形態の装置は動作するが、圧縮率を高めるために、本実施形態の要素としてる。実際には、シソーラスではなく、他の手法により、隣接する画素間に相関が持たせられる機能を実装できれば、本実施形態と同様の効果を得ることが可能である。図１０にシソーラスによる並び替えの例を示す。 The thesaurus device 3 expresses the relationship between words in a hierarchical manner. This apparatus has also been used conventionally. By using this apparatus, the data of the co-occurrence matrix is arranged so that highly relevant ones are arranged nearby. The purpose of this rearrangement is to improve the data compression rate. By having strong correlation between adjacent pixels, power is concentrated on relatively low frequency components, the absolute value of the low frequency cosine component is large, the absolute value of the high frequency component is small, and entropy coding is performed. Is used to greatly reduce the amount of information. Although the apparatus of this embodiment operates even without this apparatus, it is used as an element of this embodiment in order to increase the compression rate. Actually, the same effect as that of the present embodiment can be obtained if a function that allows correlation between adjacent pixels can be implemented by other methods than the thesaurus. FIG. 10 shows an example of rearrangement by the thesaurus.

共起マトリックス管理装置４は、共起マトリックスのデータを管理する装置であり、本装置は、従来の技術にはないものであり、本実施形態の中核をなす装置である。本装置は、共起情報作成装置２で得られた共起マトリックスのデータをシソーラス３を利用して、類似する単語同士の並べ替えを行うとともに、頻度情報を輝度情報として処理する。共起マトリックスのデータは、２次元の画像と類似のデータとして、離散コサイン変換装置５を利用し、離散コサイン変換を行う。また、フィルタ装置６を利用して、画像データの量子化を実施し、データを圧縮する。さらに、逆変換装置７を利用して、量子化された共起マトリックスのデータを共起情報管理装置８に送出する。図１１に強度（輝度）情報への変換の例を示す。また図１２に離散コサイン変換を使った量子化の例を示す。 The co-occurrence matrix management device 4 is a device for managing the data of the co-occurrence matrix, and this device is not present in the prior art, and is a device that forms the core of this embodiment. This apparatus uses the thesaurus 3 to rearrange similar words using the co-occurrence matrix data obtained by the co-occurrence information creation apparatus 2 and processes frequency information as luminance information. The co-occurrence matrix data is subjected to discrete cosine transform as data similar to a two-dimensional image using the discrete cosine transform device 5. Also, the filter device 6 is used to quantize the image data and compress the data. Further, the inverse transformation device 7 is used to send the quantized co-occurrence matrix data to the co-occurrence information management device 8. FIG. 11 shows an example of conversion to intensity (luminance) information. FIG. 12 shows an example of quantization using discrete cosine transform.

離散コサイン変換装置５は、離散コサイン変換を実施する装置であり、音声や画像の圧縮で従来より利用されてきた装置であって、本実施形態上の必須構成要素であるが、それ自体は新規性は有しない装置である。一般に画像圧縮などで従来から利用されている装置として類似の装置に離散フーリエ変換装置などもある。本実施形態においては、離散フーリエ変換装置を利用しても同様の効果を得ることが可能である。ただし、低周波数成分が多いデータにおいては、離散コサインを利用した方が圧縮効果が高いため、本実施形態においては離散コサインを利用している。 The discrete cosine transform device 5 is a device that performs discrete cosine transform, and is a device that has been conventionally used in audio and image compression, and is an indispensable component in this embodiment, but is itself new. It is a device that does not have sex. In general, there are a discrete Fourier transform apparatus and the like as a similar apparatus conventionally used for image compression. In the present embodiment, the same effect can be obtained even if a discrete Fourier transform apparatus is used. However, for data with many low frequency components, since the compression effect is higher when the discrete cosine is used, the discrete cosine is used in this embodiment.

フィルタ６は、離散コサイン変換を実施したデータに対し、周波数成分や変化点などをフィルタし、元データを量子化するため利用される装置であり、音声や画像の圧縮で従来より利用されてきた装置であって、本実施形態上の必須構成要素であるが、それ自体は新規性は有しない装置である。 The filter 6 is a device used to quantize the original data by filtering the frequency components and change points of the data subjected to the discrete cosine transform, and has been used conventionally for compressing audio and images. Although it is an apparatus and is an essential component in this embodiment, it is an apparatus that does not have novelty in itself.

逆変換装置７は、離散コサイン変換を逆向きに実施する装置であり、音声や画像の圧縮で従来より利用されてきた装置であって、本実施形態上の必須構成要素であるが、新規性は有しない装置である。 The inverse transform device 7 is a device that performs discrete cosine transform in the reverse direction, and is a device that has been conventionally used in the compression of audio and images, and is an essential component in the present embodiment. Does not have a device.

共起情報管理装置８は、共起マトリックス管理装置４から送出された共起マトリックスのデータを管理する装置である。本装置は、訳語選択装置１１と連携し、訳語を選択する機能を実現する。また、量子化された共起マトリックスのデータを辞書１１に関係つける機能を有する装置である。本装置は、従来はなかった装置であり、本実施形態により、新規に創出された装置である。 The co-occurrence information management device 8 is a device that manages co-occurrence matrix data transmitted from the co-occurrence matrix management device 4. This device realizes a function of selecting a translated word in cooperation with the translated word selecting device 11. In addition, the apparatus has a function of relating the quantized co-occurrence matrix data to the dictionary 11. This device is a device that has not been heretofore, and is a device newly created by this embodiment.

文書入力装置９は、翻訳の対象となる文が入力される装置であり、いわゆるテキストデータの入力装置であって、本実施形態上の必須構成要素であるが、従来から利用されている装置である。 The document input device 9 is a device for inputting a sentence to be translated, and is a so-called text data input device, which is an essential component in the present embodiment, but is a conventionally used device. is there.

辞書１０は、訳語選択装置１１が利用する装置であり、言語と訳語のデータを蓄積している装置であって、本実施形態上の必須構成要素であるが、従来から利用されている装置である。 The dictionary 10 is a device used by the translation selection device 11 and stores data of language and translation, and is an essential component in the present embodiment. is there.

訳語選択装置１１は、文書入力装置９に入力された文章を辞書１０を利用して、単語単位に訳語を付与し、かつ共起情報管理装置８を利用して、圧縮された共起マトリックスのデータを参照することにより適宜訳語を選択する機能を有する装置である。図１３に訳語選択の例を示す。 The translation selection device 11 uses the dictionary 10 for the sentence input to the document input device 9 to assign a translation word by word, and uses the co-occurrence information management device 8 for the compressed co-occurrence matrix. It is an apparatus having a function of selecting a translation as appropriate by referring to data. FIG. 13 shows an example of translation selection.

図１４のフローチャートを参照すると、本実施形態の動作は以下のようになる。
[ステップＳ１] データを読み取る。
[ステップＳ２] 単語単位に形態素解析を行い共起情報を抽出する。
[ステップＳ３] 共起マトリックス上に単語がすでに登録されているかを判定する。
[ステップＳ４] ないならば、新規に登録する。
[ステップＳ５] あるならば、頻度情報を１つ上げる。
[ステップＳ６] すべての単語の処理が終わったかを判定する。ＹＥＳならステップＳ７へ、ＮＯならステップＳ２へ進む。
[ステップＳ７] シソーラスを参照し、ソーティングを実施する。
[ステップＳ８] 全単語のソーティングが終了したか否かを判定する。ＹＥＳならステップＳ１０へ、ＮＯならステップＳ７へ進む。
[ステップＳ９] 離散コサイン変換を実施する。
[ステップＳ１０] フィルタリングを実施する。
[ステップＳ１１] 逆変換を実施する。
[ステップＳ１２] 量子化されたIDを元の単語と結びつける。すなわち、もとの単語のあった領域が圧縮された（量子化された）部分にIDをつける。
[ステップＳ１３] 辞書の単語にIDを付与する。例えば、猫と犬が隣接するところが量子化されてID10がつくとすると、猫と犬の辞書にID=10を付与する。 Referring to the flowchart of FIG. 14, the operation of this embodiment is as follows.
[Step S1] Data is read.
[Step S2] Morphological analysis is performed for each word to extract co-occurrence information.
[Step S3] It is determined whether a word has already been registered on the co-occurrence matrix.
[Step S4] If it does not exist, it is newly registered.
[Step S5] If there is, the frequency information is increased by one.
[Step S6] It is determined whether all the words have been processed. If YES, the process proceeds to step S7, and if NO, the process proceeds to step S2.
[Step S7] Referring to the thesaurus, sorting is performed.
[Step S8] It is determined whether or not all words have been sorted. If YES, the process proceeds to step S10, and if NO, the process proceeds to step S7.
[Step S9] Discrete cosine transform is performed.
[Step S10] Filtering is performed.
[Step S11] Inverse transformation is performed.
[Step S12] The quantized ID is linked to the original word. That is, an ID is attached to a compressed (quantized) portion of the area where the original word was.
[Step S13] An ID is assigned to a word in the dictionary. For example, if an area where a cat and a dog are adjacent is quantized and ID10 is added, ID = 10 is assigned to the cat-dog dictionary.

本実施形態によれば、以下のような効果が得られる。 According to this embodiment, the following effects can be obtained.

第一の効果は、共起データの管理を平易に行う機能を実現できることである。その理由は、離散コサイン変換により共起データをそれらの相互の関係を極力保持したまま圧縮して管理することが可能になったためである。 The first effect is that a function for easily managing co-occurrence data can be realized. The reason is that co-occurrence data can be compressed and managed while maintaining the mutual relationship as much as possible by discrete cosine transform.

第二の効果は、訳語の選択を的確に実施する機能を実現できることである。その理由は、関係性を極力保持した共起データを利用できるためである。 The second effect is that it is possible to realize a function for accurately performing translation selection. The reason is that co-occurrence data that maintains the relationship as much as possible can be used.

なお、本システムの実装上の形態として、各構成要素を実現する手段として、電子回路やコンピュータなどが想定される。上記の１〜１１のモジュールを別々の部品としても、全部を１つの装置として作成することも可能である。１〜１１のモジュールは、PCなどに代表されるひとつの装置内部に格納することも可能である。 Note that, as a form of implementation of this system, an electronic circuit, a computer, or the like is assumed as a means for realizing each component. The above modules 1 to 11 can be formed as separate parts or all as a single device. The modules 1 to 11 can also be stored in one device typified by a PC.

また、それぞれをソフトウエアと、コンピュータという構成で実施することも可能である。また、それぞれの機能をネットワークを介した形での構成も可能である。また、それぞれの機能あるいは装置全体を複数台用意し、性能を向上する構成をとることも可能である。さらに、本装置を他の装置に組み合わせたり、ソフトウエアの形で実施する場合は、他のソフトウエアから呼ばれる形での構成も可能である。 Moreover, it is also possible to implement each with the structure of a software and a computer. Further, it is possible to configure each function via a network. In addition, it is possible to prepare a plurality of each function or the whole apparatus to improve the performance. Further, when the present apparatus is combined with other apparatuses or implemented in the form of software, a configuration called by other software is also possible.

本実施形態は、共起データの管理を平易に行う機能、訳語の選択を的確に実施する機能を有することを特徴に持つ訳語選択装置である。訳語の選択に関しては、機械翻訳装置への利用が第一義として想定されるが、漢字かな変換システム、検索装置でのランキング機構、テキストマイニングにおける計算時間の圧縮などへの利用も想定される。 The present embodiment is a translated word selecting device characterized by having a function of easily managing co-occurrence data and a function of accurately selecting a translated word. Regarding the selection of the translation word, the use to the machine translation device is assumed as the primary meaning, but the use to the kanji conversion system, the ranking mechanism in the search device, and the calculation time reduction in the text mining is also assumed.

１００…共起辞書作成装置
１０１…共起マトリックス
１０２…共起辞書
１０３…縮小共起マトリックス
１０４…関係情報
１１１…並べ替え部
１１２…画像生成部
１１３…画像縮小部
１１４…辞書作成部 DESCRIPTION OF SYMBOLS 100 ... Co-occurrence dictionary creation apparatus 101 ... Co-occurrence matrix 102 ... Co-occurrence dictionary 103 ... Reduction co-occurrence matrix 104 ... Relation information 111 ... Rearrangement part 112 ... Image generation part 113 ... Image reduction part 114 ... Dictionary preparation part

Claims

Enter a co-occurrence matrix recording numerical values representing the frequency at which the word assigned to the row and the word assigned to the column appear simultaneously in the same sentence at the intersection of the row and the column, and are semantically similar A reordering unit that reorders the rows and columns of the co-occurrence matrix such that words are adjacent in the row direction and the column direction;
An image generation unit for generating an image having the numerical value recorded at the intersection of the row and the column of the co-occurrence matrix after the rearrangement as a luminance value of a pixel corresponding to the intersection of the row and the column; ,
High-frequency components are removed from DCT coefficients generated by performing discrete cosine transform on the image, and inverse discrete cosine transform is performed on the remaining DCT coefficients to reduce the vertical and horizontal size of the image to 1 / 2d (d is a positive integer). an image reducing unit that generates a reduced image obtained by reducing the,
A reduced co-occurrence matrix having rows and columns corresponding to the rows and columns of the reduced image, and a luminance value of a corresponding pixel in the reduced image recorded as a frequency at an intersection of the row and the column; and the reduced co-occurrence A co-occurrence dictionary creation device comprising: a dictionary creation unit that creates a co-occurrence dictionary composed of identification numbers assigned to rows and columns of a matrix and relation information representing correspondence between the words.

The co-occurrence dictionary creation device according to claim 1, wherein the rearrangement unit performs rearrangement with reference to a thesaurus dictionary in which words are classified according to synonyms, semantic similarity relationships, and inclusion relationships.

A translation selection apparatus for selecting a translation word in machine translation using the co-occurrence dictionary generated by the co-occurrence dictionary creation apparatus according to claim 1 .

A co-occurrence dictionary creation method executed by a co-occurrence dictionary creation device having a rearrangement unit, an image generation unit, an image reduction unit, and a dictionary creation unit,
The reordering unit inputs a co-occurrence matrix in which numerical values representing the frequency of the words assigned to the row and the words assigned to the column appear simultaneously in the same sentence at the intersection of the row and the column. Reordering the rows and columns of the co-occurrence matrix so that semantically similar words are adjacent in the row and column directions;
The image generating unit has an image having a numerical value recorded at the intersection of the row and the column of the co-occurrence matrix after the rearrangement as a luminance value of a pixel corresponding to the intersection of the row and the column. Generate
The image reduction unit removes high frequency components from DCT coefficients generated by performing discrete cosine transform on the image, and performs inverse discrete cosine transform on the remaining DCT coefficients to reduce the vertical and horizontal sizes of the image to 1 / 2d ( d is a positive integer), and a reduced image is generated.
The dictionary creation unit has rows and columns corresponding to the rows and columns of the reduced image, and a reduced co-occurrence matrix in which luminance values of corresponding pixels in the reduced image are recorded as frequencies at intersections of the rows and columns. And a co-occurrence dictionary creation method for generating a co-occurrence dictionary composed of the identification numbers assigned to the rows and columns of the reduced co-occurrence matrix and the relationship information indicating the correspondence between the words.

Computer
Enter a co-occurrence matrix recording numerical values representing the frequency at which the word assigned to the row and the word assigned to the column appear simultaneously in the same sentence at the intersection of the row and the column, and are semantically similar A reordering unit that reorders the rows and columns of the co-occurrence matrix such that words are adjacent in the row direction and the column direction;
An image generation unit for generating an image having the numerical value recorded at the intersection of the row and the column of the co-occurrence matrix after the rearrangement as a luminance value of a pixel corresponding to the intersection of the row and the column; ,
High-frequency components are removed from DCT coefficients generated by performing discrete cosine transform on the image, and inverse discrete cosine transform is performed on the remaining DCT coefficients to reduce the vertical and horizontal size of the image to 1 / 2d (d is a positive integer). an image reducing unit that generates a reduced image obtained by reducing the,
A reduced co-occurrence matrix having rows and columns corresponding to the rows and columns of the reduced image, and a luminance value of a corresponding pixel in the reduced image recorded as a frequency at an intersection of the row and the column; and the reduced co-occurrence The program for functioning as a dictionary creation part which produces | generates the co-occurrence dictionary comprised from the relationship information showing the identification number allocated to the row | line | column and column of a matrix, and the said word.