JP2019012468A

JP2019012468A - Meaning specification program, information generation program, meaning specification method, information generation method, meaning specification device, and information generation device

Info

Publication number: JP2019012468A
Application number: JP2017129498A
Authority: JP
Inventors: 片岡　正弘; Masahiro Kataoka; 正弘片岡; 聡尾上; Satoshi Onoe; 清司大倉; Seiji Okura
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2019-01-24
Also published as: WO2019003516A1

Abstract

To provide an encoder that specifies the meanings of a polysemic word (word) included in text data.SOLUTION: An encoder receives text data. The encoder extracts a word included in the received text data. When the extracted word has a plurality of meanings, the encoder refers to, for every meaning of the word having a plurality of meanings, a storage unit that stores co-occurrence information on other words for a word, and specifies the meanings of the extracted word on the basis of the co-occurrence information.SELECTED DRAWING: Figure 9

Description

本発明は、語義特定プログラム等に関する。 The present invention relates to a meaning definition program and the like.

近年、テキストデータを構成する単語（形態素）を解析する技術がある。また、それぞれの単語からベクトルを生成するWord2Vec技術が存在する。例えば、Word2Vec技術では、ある単語と、隣接する他の単語との関係に基づいて、各単語のベクトルを算出する処理を行う。また、テキストデータの単語には、複数の語義を有する多義語が含まれており、かかる多義語に対しても同様にして、Word2Vec技術により１つのベクトルが生成される。 In recent years, there is a technique for analyzing words (morphemes) constituting text data. There is also a Word2Vec technology that generates vectors from each word. For example, in the Word2Vec technology, a process for calculating a vector of each word is performed based on the relationship between a certain word and another adjacent word. In addition, the words of the text data include a plurality of meanings having a plurality of meanings, and one vector is similarly generated for such meanings using the Word2Vec technique.

なお、従来技術１には、情報処理システムに関する提案書や仕様書等の文書内で複数の意味が割り当てられた多義語を抽出することが記載されている。従来技術２には、単語辞書中の単語全てに対して、概念ベクトルを付与することが記載されている。 The prior art 1 describes extracting a multiple meaning word assigned with a plurality of meanings in a document such as a proposal or specification regarding an information processing system. Prior art 2 describes that concept vectors are assigned to all words in the word dictionary.

特開２０１３−０２０４３１号公報JP2013-020431A 特開２００６−３３８３４２号公報JP 2006-338342 A

しかしながら、上述した従来技術では、テキストデータに含まれる多義語（単語）の語義を特定することができないという問題がある。 However, the above-described conventional technique has a problem that it is not possible to specify the meaning of a multiple meaning word (word) included in text data.

たとえば、従来技術では、テキストデータに含まれる各単語それぞれに１つのベクトルを付与するものであり、複数の語義を持つ多義語がどのような語義で用いられているかを特定するものではない。 For example, in the related art, one vector is assigned to each word included in text data, and it does not specify in what sense a polysemy having a plurality of meanings is used.

１つの側面では、本発明は、テキストデータに含まれる多義語（単語）の語義を特定することができる語義特定プログラム、情報生成プログラム、語義特定方法、情報生成方法、語義特定装置および情報生成装置を提供することを目的とする。 In one aspect, the present invention provides a meaning definition program, an information generation program, a meaning definition method, an information generation method, a meaning definition device, and an information generation device that can specify the meaning of a polysemy (word) included in text data. The purpose is to provide.

第１の案では、コンピュータに下記の処理を実行させる。コンピュータは、テキストデータを受け付ける。コンピュータは、受け付けたテキストデータに含まれる単語を抽出する。コンピュータは、抽出した単語が複数の語義を含む場合、複数の語義を含む単語の語義ごとに、単語に対する他の単語の共起情報を記憶する記憶部を参照して、共起情報に基づき、抽出した単語の語義を特定する。 In the first plan, the computer executes the following processing. The computer accepts text data. The computer extracts words included in the received text data. When the extracted word includes a plurality of meanings, the computer refers to the storage unit that stores the co-occurrence information of other words for each word meaning including the plurality of meanings, based on the co-occurrence information, Identify the meaning of the extracted word.

テキストデータに含まれる多義語（単語）の語義を特定することができる。 It is possible to specify the meaning of the multiple meaning word (word) included in the text data.

図１は、本実施例に係る符号化装置の処理の一例を説明するための図である。FIG. 1 is a diagram for explaining an example of processing of the encoding device according to the present embodiment. 図２は、本実施例に係る符号化装置の構成を示す機能ブロック図である。FIG. 2 is a functional block diagram illustrating the configuration of the encoding apparatus according to the present embodiment. 図３は、オフセットテーブルのデータ構造の一例を示す図である。FIG. 3 is a diagram illustrating an example of the data structure of the offset table. 図４は、語義ＨＭＭのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of the meaning HMM. 図５は、オフセットテーブルのポインタを説明するための図である。FIG. 5 is a diagram for explaining the pointer of the offset table. 図６は、符号化ケヤキ木のデータ構造の一例を示す図である。FIG. 6 is a diagram illustrating an example of the data structure of the encoded zelkova tree. 図７は、動的辞書のデータ構造の一例を示す図である。FIG. 7 is a diagram illustrating an example of the data structure of the dynamic dictionary. 図８は、符号化部の処理の一例を説明するための図である。FIG. 8 is a diagram for explaining an example of processing of the encoding unit. 図９は、本実施例に係る符号化装置の処理手順を示すフローチャートである。FIG. 9 is a flowchart illustrating the processing procedure of the encoding apparatus according to the present embodiment. 図１０は、符号化装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。FIG. 10 is a diagram illustrating an example of a hardware configuration of a computer that realizes the same function as that of the encoding device.

以下に、本願の開示する語義特定プログラム、情報生成プログラム、語義特定方法、情報生成方法、語義特定装置および情報生成装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Hereinafter, embodiments of a meaning specification program, an information generation program, a meaning definition method, an information generation method, a meaning definition device, and an information generation device disclosed in the present application will be described in detail based on the drawings. Note that the present invention is not limited to the embodiments.

図１は、本実施例に係る符号化装置の処理の一例を説明するための図である。符号化装置は、語義特定装置、情報生成装置の一例である。符号化装置は、テキストファイルＦ１から各単語を読み出し、各単語に対して符号化を行うことで、符号化ファイルＦ２を生成する。符号化装置は、単語を符号化する場合に、単語の語義を特定し、語義に応じた符号を割り当てる。 FIG. 1 is a diagram for explaining an example of processing of the encoding device according to the present embodiment. The encoding device is an example of a meaning-specific device and an information generation device. The encoding device reads each word from the text file F1 and encodes each word to generate an encoded file F2. When encoding a word, the encoding device specifies the meaning of the word and assigns a code corresponding to the meaning.

図１では、テキストファイルＦ１に含まれる「・・・He△has△deep△idea△to△clear△that△matter.・・・」のうち、単語「deep△」を符号化する場合について説明する。ここで、「△」は、スペースを示すものである。単語「deep△」は、複数の語義を有する多義語である。たとえば、単語「deep△」には、「top to bottom、front to back、measurement、deep、breath、sounds、colours、sleep、serious、emotions、knowledge、difficult to understand、involved、person、in sport」の語義がある。 FIG. 1 illustrates a case where the word “deepΔ” is encoded among “..., HeΔhasΔdeepΔideaΔtoΔclearΔthatΔmatter...” Included in the text file F1. . Here, “Δ” indicates a space. The word “deepΔ” is a polysemy having a plurality of meanings. For example, the word `` deep △ '' has the meaning of `` top to bottom, front to back, measurement, deep, breath, sounds, colors, sleep, serious, emotions, knowledge, difficult to understand, involved, person, in sport '' There is.

符号化装置は、語義ＨＭＭ（Hidden Markov Model）１４２を基にして、テキストファイルＦ１の「deep△」の語義を、複数の語義の中から特定する。語義ＨＭＭ１４２は、多義語と、語義ＩＤと、語義と、複数の共起単語（共起率）とを対応づける。 Based on the semantic HMM (Hidden Markov Model) 142, the encoding device specifies the meaning of “deepΔ” in the text file F1 from a plurality of meanings. The word meaning HMM 142 associates a multiple meaning word, a word meaning ID, a word meaning, and a plurality of co-occurrence words (co-occurrence rate).

多義語は、複数の語義を有する単語を示す。語義ＩＤは、多義語に含まれる語義を一意に識別する番号である。語義は、多義語に含まれる各語義を示す。共起単語は、ある語義の多義語と共起する単語を示す。共起単語は、共起率と対応づけられる。たとえば、多義語「deep」が語義「knowledge」の意味で文章に登場する場合には、係る「deep」の前後で、単語「understanding」が共起する可能性が「４１％」である旨が示されている。また、多義語「deep」が語義「knowledge」の意味で文章に登場する場合には、係る「deep」の前後で、単語「idea」が共起する可能性が「１９％」である旨が示されている。 An ambiguous word indicates a word having a plurality of meanings. The meaning ID is a number that uniquely identifies the meaning included in the polysemy. The meaning means each meaning included in the polysemy. A co-occurrence word indicates a word that co-occurs with a polysemy of a certain meaning. Co-occurrence words are associated with co-occurrence rates. For example, when the ambiguous word “deep” appears in a sentence with the meaning of the word “knowledge”, the possibility that the word “understanding” co-occurs before and after the “deep” is “41%”. It is shown. In addition, when the ambiguous word “deep” appears in the sentence with the meaning of the word “knowledge”, the possibility that the word “idea” co-occurs before and after the “deep” is “19%”. It is shown.

符号化装置は、テキストファイルＦ１の「deep△」の前後に出現する単語と、語義ＨＭＭ１４２の各共起単語とを比較して、ヒットする語義のレコードを判定する。図１に示す例では、テキストファイルＦ１の「deep△」の後ろに「idea△」が出現しているため、語義ＨＭＭ１４２の語義ＩＤ（１２）のレコードにヒットする。このため、符号化装置は、テキストファイルＦ１の「deep△」の語義を、語義ＩＤ（１２）の語義「knowledge」であると特定する。 The encoding device compares the words appearing before and after “deepΔ” in the text file F1 with the co-occurrence words of the meaning HMM 142 to determine a record of meaning that hits. In the example shown in FIG. 1, since “ideaΔ” appears after “deepΔ” in the text file F1, the record with the meaning ID (12) of the meaning HMM 142 is hit. For this reason, the encoding device specifies the meaning of “deepΔ” in the text file F1 as the meaning “knowledge” of the meaning ID (12).

符号化装置は、多義語（単語）「deep△」の語義「knowledge」に対応する語義コード「A14F06h」を、動的辞書１４４に登録する。単語および語義の組に対応する語義コードは、予め設定されているものとする。 The encoding device registers the meaning code “A14F06h” corresponding to the meaning “knowledge” of the ambiguous word (word) “deepΔ” in the dynamic dictionary 144. It is assumed that a meaning code corresponding to a set of words and meanings is set in advance.

符号化装置が、語義コード「A14F06h」を動的辞書１４４に登録する。たとえば、語義コード「A14F06h」の登録位置を示す動的コードを「AE00h」とする。符号化装置は、「AE00h」を、動的辞書１４４から取得する。 The encoding device registers the meaning code “A14F06h” in the dynamic dictionary 144. For example, the dynamic code indicating the registration position of the meaning code “A14F06h” is “AE00h”. The encoding apparatus acquires “AE00h” from the dynamic dictionary 144.

符号化装置は、多義語（単語）「deep△」に対応する動的コード「AE00h」を、符号化ファイルＦ２に出力することで、符号化を行う。符号化装置は、「deep△」のような複数の語義を有する他の多義語については、上記処理を繰り返し実行することで、符号化を行う。一方、符号化装置は、単一の語義を有する単語については、予め割り当てられる符号（静的コード、または動的コード）を、符号化ファイルＦ２に出力することで、符号化を行う。 The encoding device performs encoding by outputting the dynamic code “AE00h” corresponding to the ambiguous word (word) “deepΔ” to the encoded file F2. The encoding device performs encoding by repeatedly executing the above-described processing for other multiple meaning words having a plurality of meanings such as “deepΔ”. On the other hand, the encoding device performs encoding by outputting a code (static code or dynamic code) assigned in advance to the encoded file F2 for a word having a single meaning.

符号化ファイルＦ２には、ヘッダ部２ａと、符号化データ部２ｂと、トレーラ部２ｃとが含まれる。ヘッダ部２ａには、単語の出現頻度の情報等が含まれる。符号化データ部２ｂには、符号化装置が生成した符号化データが格納される。トレーラ部２ｃには、動的辞書１４４の情報等が含まれる。 The encoded file F2 includes a header portion 2a, an encoded data portion 2b, and a trailer portion 2c. The header part 2a includes information on the appearance frequency of words. The encoded data portion 2b stores encoded data generated by the encoding device. The trailer unit 2c includes information on the dynamic dictionary 144 and the like.

上記のように、符号化装置は、テキストファイルＦ１の対象単語の前後に出現する単語と、語義ＨＭＭ１４２の各共起単語とを比較することで、単語の語義を特定することができる。また、符号化装置は、対象単語と語義との組み合わせに対応する語義コードを動的辞書１４４に登録して、動的コードを取得し、符号化を行う。このため、同一の単語であっても、語義に応じたコードによる符号化を行うことができる。 As described above, the encoding device can identify the meaning of a word by comparing the words that appear before and after the target word of the text file F1 with the co-occurrence words of the meaning HMM 142. Also, the encoding device registers the meaning code corresponding to the combination of the target word and the meaning in the dynamic dictionary 144, acquires the dynamic code, and performs encoding. For this reason, even if it is the same word, the encoding by the code | cord | chord according to the meaning can be performed.

図２は、本実施例に係る符号化装置の構成を示す機能ブロック図である。図２に示すように、この符号化装置１００は、インタフェース部１１０と、記憶部１４０、制御部１５０を有する。 FIG. 2 is a functional block diagram illustrating the configuration of the encoding apparatus according to the present embodiment. As illustrated in FIG. 2, the encoding device 100 includes an interface unit 110, a storage unit 140, and a control unit 150.

インタフェース部１１０は、ネットワーク等を介して、他の外部装置と情報をやり取りする処理部である。 The interface unit 110 is a processing unit that exchanges information with other external devices via a network or the like.

記憶部１４０は、テキストファイルＦ１、オフセットテーブル１４１、語義ＨＭＭ１４２、符号化ケヤキ木１４３、動的辞書１４４、静的辞書１４５、符号化ファイルＦ２を有する。記憶部１４０は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリ（Flash Memory）などの半導体メモリ素子や、ＨＤＤ（Hard Disk Drive）などの記憶装置に対応する。 The storage unit 140 includes a text file F1, an offset table 141, a meaning HMM 142, an encoded zelkova tree 143, a dynamic dictionary 144, a static dictionary 145, and an encoded file F2. The storage unit 140 corresponds to a semiconductor memory element such as a random access memory (RAM), a read only memory (ROM), and a flash memory, and a storage device such as a hard disk drive (HDD).

テキストファイルＦ１は、符号化対象となる文章のデータである。テキストファイルＦ１は、図１で説明したようなスペース「△」で区切られる複数の単語を有する。 The text file F1 is text data to be encoded. The text file F1 has a plurality of words separated by a space “Δ” as described in FIG.

オフセットテーブル１４１は、テキストファイルＦ１に含まれる単語の位置（オフセット）や、多義語であるか否かを示すポインタ等を保持するテーブルである。図３は、オフセットテーブルのデータ構造の一例を示す図である。図３に示すように、このオフセットテーブル１４１は、番号と、基本コードと、オフセットと、多義語ポインタとを対応づける。 The offset table 141 is a table that holds a position (offset) of a word included in the text file F1, a pointer indicating whether or not the word is a polysemy, and the like. FIG. 3 is a diagram illustrating an example of the data structure of the offset table. As shown in FIG. 3, the offset table 141 associates numbers, basic codes, offsets, and multi-word pointers.

番号は、オフセットテーブル１４１の各レコードの登録順序を示すものである。基本コードは、各単語に予め割り当てられるコードであり、単語を一意に示す。この基本コードは、後述する静的コード、動的コードとは異なるコードである。オフセットは、基本コードに対応する単語が、テキストファイルＦ１上に存在する位置を示すものである。 The number indicates the registration order of each record in the offset table 141. The basic code is a code assigned in advance to each word and uniquely indicates the word. This basic code is different from the static code and dynamic code described later. The offset indicates the position where the word corresponding to the basic code exists on the text file F1.

多義語ポインタは、基本コードに対応する単語が、複数の語義を有する多義語であるか否かを示すものである。多義語ポインタが「０」である場合には、該当するレコードの基本コードに対応する単語が、多義語ではないことを示す。たとえば、図３の番号「０」に対応する単語コード「A14F04h」の多義語ポインタは「０」であるため、単語コード「A14F04h」の単語は、多義語ではない。 The multiple word pointer indicates whether or not the word corresponding to the basic code is a multiple word having a plurality of word meanings. When the ambiguous word pointer is “0”, it indicates that the word corresponding to the basic code of the corresponding record is not the ambiguous word. For example, since the ambiguity pointer of the word code “A14F04h” corresponding to the number “0” in FIG. 3 is “0”, the word of the word code “A14F04h” is not a ambiguity.

多義語ポインタに、語義ＨＭＭ１４２への「ポインタ」が格納されている場合には、該当する基本コードに対応する単語は、多義語であることを示す。ポインタは、語義ＨＭＭ１４２のレコード位置を示すものであり、かかるレコード位置には、多義語となる単語の各語義に関する情報を有する。たとえば、図３の番号「２」に対応する単語コード「A14F06h」には、語義ＨＭＭ１４２へのポインタが格納されているため、単語コード「A14F06h」に対応する単語は、多義語である。 When the “pointer” to the meaning HMM 142 is stored in the ambiguous word pointer, it indicates that the word corresponding to the corresponding basic code is an ambiguous word. The pointer indicates the record position of the meaning HMM 142, and the record position has information on each meaning of the word that becomes a polysemy. For example, since the pointer to the meaning HMM 142 is stored in the word code “A14F06h” corresponding to the number “2” in FIG. 3, the word corresponding to the word code “A14F06h” is an ambiguous word.

語義ＨＭＭ１４２は、多義語に含まれる各語義について、ある語義で用いられた単語の前後で共起する他の単語に関する情報を保持するテーブルである。図４は、語義ＨＭＭのデータ構造の一例を示す図である。図４に示すように、この語義ＨＭＭ１４２は、多義語と、語義ＩＤと、語義と、複数の共起単語（共起率）とを対応づける。多義語、語義ＩＤ、語義、複数の共起単語（共起率）に関する説明は、図１で説明したものと同様である。 The meaning HMM 142 is a table that holds information about other words that co-occur before and after a word used in a certain meaning for each meaning included in the multiple meaning. FIG. 4 is a diagram illustrating an example of a data structure of the meaning HMM. As shown in FIG. 4, this meaning HMM 142 associates multiple meanings, meaning IDs, meanings, and a plurality of co-occurrence words (co-occurrence rates). The explanation regarding the multiple meaning word, the meaning ID, the meaning, and the plurality of co-occurrence words (co-occurrence rate) is the same as that explained in FIG.

図５は、オフセットテーブルのポインタを説明するための図である。たとえば、オフセットテーブル１４１の番号「２」に対応するポインタは、語義ＨＭＭ１４２の多義語「deep」のレコード位置を示している。このため、単語コード「A14F06h」に対応する多義語「deep」には、語義ＩＤ（１）〜（１６）の語義が含まれることが分かる。 FIG. 5 is a diagram for explaining the pointer of the offset table. For example, the pointer corresponding to the number “2” in the offset table 141 indicates the record position of the multiple meaning “deep” of the meaning HMM 142. For this reason, it is understood that the meaning of the meaning IDs (1) to (16) is included in the ambiguous word “deep” corresponding to the word code “A14F06h”.

符号化ケヤキ木１４３は、単語を符号化する際に利用する静的コードおよび動的コードの情報を有する。図６は、符号化ケヤキ木のデータ構造の一例を示す図である。図６に示すように、この符号化ケヤキ木１４３は、２グラム、ビットマップ、ポインタ、英単語、静的コード、動的コード、基本コード、語義コードを有する。 The encoding zelkova tree 143 has information on static codes and dynamic codes used when encoding words. FIG. 6 is a diagram illustrating an example of the data structure of the encoded zelkova tree. As shown in FIG. 6, this encoding zelkova tree 143 has 2 grams, a bitmap, a pointer, an English word, a static code, a dynamic code, a basic code, and a semantic code.

２グラムは、２文字の文字列（単語）を示す情報である。ビットマップは、２グラムの文字列に対応するビットマップを示す。例えば、「ａａ」に対応するビットマップは「０＿０＿０＿０＿０」となる。ポインタは、ビットマップに対応する英単語の位置を示すポインタである。 2 grams is information indicating a two-character string (word). The bitmap indicates a bitmap corresponding to a character string of 2 grams. For example, the bitmap corresponding to “aa” is “0 — 0 — 0 — 0 — 0”. The pointer is a pointer indicating the position of the English word corresponding to the bitmap.

例えば、単語「able△」に対応する静的コードを符号化ケヤキ木１４３から検索する場合には、２グラム「ab」、「bl」、「le」、「e△」に対応するビットマップをそれぞれ組み合わせる。ビットマップの各桁において、すべてのビットマップの値が０となっている場合には、組み合わせたビットマップの該当する桁を「０」とする。これに対して、「１」が一つでも含まれる場合には該当する桁を「１」に設定することで、ビットマップを組み合わせる。 For example, when a static code corresponding to the word “ableΔ” is searched from the encoded zelkova tree 143, bitmaps corresponding to the two grams “ab”, “bl”, “le”, “eΔ” are selected. Combine each. When the values of all the bitmaps are 0 in each digit of the bitmap, the corresponding digit of the combined bitmap is set to “0”. On the other hand, if at least one “1” is included, the corresponding digit is set to “1” to combine the bitmaps.

例えば、「ab」のビットマップが「１＿０＿０＿０＿０」、「bl」のビットマップが「０＿１＿０＿０＿０」、「le」のビットマップが「０＿０＿１＿０＿０」、「e△」のビットマップが「０＿０＿０＿１＿０」とする。この場合には、各ビットマップを組み合わせたビットマップは「１＿１＿１＿１＿０」となる。このビットマップ「１＿１＿１＿１＿０」にポインタに示される位置に、単語「able△」に対応する静的コードが格納されている。 For example, the bitmap of “ab” is “1 — 0 — 0 — 0 — 0”, the bitmap of “bl” is “0 — 1 — 0 — 0 — 0”, the bitmap of “le” is “0 — 0 — 1 — 0 — 0”, and the bitmap of “eΔ” is “0 — 0 — 0 — 1 — 0”. In this case, the bitmap obtained by combining the bitmaps is “1_1_1_1_0”. A static code corresponding to the word “ableΔ” is stored at the position indicated by the pointer in the bitmap “1_1_1_1_0”.

英単語は、たとえば、英語辞書等に登場する単語である。静的コードは、英単語に割り当てられたコードである。一つの英単語に対して、一つの静的コードが割り当てられる。このため、多義語であっても、この多義語に割り当てられる静的コードは一つである。 English words are words appearing in, for example, an English dictionary. A static code is a code assigned to an English word. One static code is assigned to one English word. For this reason, even if it is a polysemy, the static code assigned to this polysemy is one.

なお、英単語が多義語である場合には、静的コードのレコードには、多義語の語義を一意に示す語義ＩＤが格納される。 When the English word is a multiple meaning word, the static code record stores a meaning ID that uniquely indicates the meaning of the multiple meaning word.

動的コードは、多義語の各語義に動的に割り当てられるコードである。多義語の語義に動的コードを割り当てる処理は後述する。 A dynamic code is a code that is dynamically assigned to each meaning of a polysemy. The process of assigning a dynamic code to the meaning of a multiple meaning will be described later.

基本コードは、各英単語（単語）に予め割り当てられるコードであり、単語を一意に示す。語義コードは、単語の語義に予め割り当てられるコードである。 The basic code is a code assigned in advance to each English word (word) and uniquely indicates the word. The meaning code is a code assigned in advance to the meaning of the word.

動的辞書１４４は、語義に動的コードを割り当てる場合に利用する情報である。図７は、動的辞書のデータ構造の一例を示す図である。動的辞書１４４は、動的辞書部１４４ａ、バッファ部１４４ｂを有する。動的辞書部１４４ａは、動的コードと、語義コードと、多義語情報と、ポインタとを有する。 The dynamic dictionary 144 is information used when a dynamic code is assigned to a word meaning. FIG. 7 is a diagram illustrating an example of the data structure of the dynamic dictionary. The dynamic dictionary 144 includes a dynamic dictionary unit 144a and a buffer unit 144b. The dynamic dictionary unit 144a includes a dynamic code, a meaning code, polysemy information, and a pointer.

動的コードは、動的辞書１４４の登録位置を示すものである。また、動的コードは、バッファ１４４ｂに格納された単語に割り当てられる符号であり、同一の単語であっても、語義が異なると、割り当てられる動的コードは異なる。 The dynamic code indicates the registration position of the dynamic dictionary 144. The dynamic code is a code assigned to the word stored in the buffer 144b, and even if the word is the same, the assigned dynamic code is different if the meaning is different.

語義コードは、動的コードを割り当てた単語の語義コードを示すものである。多義語情報は、動的コードを割り当てた単語の語義の内容および語義ＩＤを示すものである。ポインタは、動的コードに対応する多義語のバッファ１４４ｂの格納位置を示すポインタである。 The meaning code indicates the meaning code of the word to which the dynamic code is assigned. Polysemy information indicates the meaning and meaning ID of a word to which a dynamic code is assigned. The pointer is a pointer that indicates the storage position of the multi-word buffer 144b corresponding to the dynamic code.

たとえば、図７に示す動的辞書１４４では、「knowledge」の語義を有する多義語「deep」に対して動的コード「AE00h」が割り当てられていることを示している。また、多義語「deep△」は、バッファ１４４ｂの先頭位置に格納されていることが、動的コード「AE00h」のポインタが示している。また、動的コード「AE01h」のポインタは、バッファ１４４ｂの「deep△」の次の位置を示している。 For example, the dynamic dictionary 144 shown in FIG. 7 indicates that the dynamic code “AE00h” is assigned to the polysemy “deep” having the meaning of “knowledge”. The pointer of the dynamic code “AE00h” indicates that the ambiguous word “deepΔ” is stored at the head position of the buffer 144b. The pointer of the dynamic code “AE01h” indicates the position next to “deepΔ” in the buffer 144b.

静的辞書１４５は、単語と、静的コードとを対応づける情報である。 The static dictionary 145 is information that associates words with static codes.

符号化ファイルＦ２は、テキストファイルＦ１が符号化されたファイルである。符号化ファイルＦ２に関する説明は、図１で説明した符号化ファイルＦ２と同様である。 The encoded file F2 is a file obtained by encoding the text file F1. The description of the encoded file F2 is the same as that of the encoded file F2 described in FIG.

図２の説明に戻る。制御部１５０は、抽出部１５１、特定部１５２、符号化部１５３、共起情報生成部１５４を有する。制御部１５０は、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などによって実現できる。また、制御部１５０は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などのハードワイヤードロジックによっても実現できる。 Returning to the description of FIG. The control unit 150 includes an extraction unit 151, a specification unit 152, an encoding unit 153, and a co-occurrence information generation unit 154. The control unit 150 can be realized by a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or the like. The control unit 150 can also be realized by a hard wired logic such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).

抽出部１５１は、テキストファイルＦ１から符号化対象となるコンテンツ部分のデータを単語単位に抽出する処理部である。たとえば、コンテンツ部分のデータが、「He△has△deep△idea△to△clear△that△matter.」である場合には、「He△」、「has△」、「deep△」、「△idea」、「to△」、「clear△」、「that△」、「matter.」の単位で、各単語を抽出する。抽出部１５１は、抽出した各単語を、特定部１５２に出力する。 The extraction unit 151 is a processing unit that extracts data of a content part to be encoded from the text file F1 in units of words. For example, when the data of the content part is “He △ has △ deep △ idea △ to △ clear △ that △ matter.”, “He △”, “has △”, “deep △”, “△ idea ”,“ ToΔ ”,“ clearΔ ”,“ thatΔ ”,“ matter. ”, Each word is extracted. The extraction unit 151 outputs each extracted word to the specifying unit 152.

特定部１５２は、単語を取得し、オフセットテーブル１４１、語義ＨＭＭ１４２、符号化ケヤキ木１４３を基にして、単語の語義を特定する処理部である。以下において、特定部１５２の処理の一例について説明する。 The specifying unit 152 is a processing unit that acquires a word and specifies the meaning of the word based on the offset table 141, the meaning HMM 142, and the encoded zelkova tree 143. Hereinafter, an example of processing of the specifying unit 152 will be described.

特定部１５２は、単語の基本コードを特定する。たとえば、特定部１５２は、単語と、符号化ケヤキ木１４３とを比較して、単語の基本コードを特定する。 The specifying unit 152 specifies the basic code of the word. For example, the specifying unit 152 compares the word and the encoded zelkova tree 143 to specify the basic code of the word.

特定部１５２は、単語の基本コードと、オフセットテーブル１４１とを比較して、単語が多義語であるか否かを判定する。たとえば、特定部１５２は、基本コードに対応するポインタが「０」である場合には、単語は多義語でないと判定する。一方、特定部１５２は、基本コードに対応するポインタが、語義ＨＭＭ１４２のレコード位置を示している場合には、単語が多義語であると判定する。 The specifying unit 152 compares the basic code of the word with the offset table 141 and determines whether or not the word is an ambiguous word. For example, when the pointer corresponding to the basic code is “0”, the specifying unit 152 determines that the word is not a polysemy. On the other hand, when the pointer corresponding to the basic code indicates the record position of the semantic HMM 142, the specifying unit 152 determines that the word is an ambiguous word.

特定部１５２は、単語が多義語でないと判定した場合には、単語と、かかる単語が多義語でない旨の情報とを、符号化部１５３に出力する。 If the specifying unit 152 determines that the word is not a multiple meaning word, the specifying unit 152 outputs the word and information indicating that the word is not a multiple meaning word to the encoding unit 153.

特定部１５２は、単語が多義語であると判定した場合には、次の処理を実行する。特定部１５２は、オフセットテーブル１４１において、単語の基本コードに対応するポインタを参照し、ポインタの示す語義ＨＭＭ１４２のレコードを特定する。 If the identifying unit 152 determines that the word is an ambiguous word, the identifying unit 152 performs the following process. The specifying unit 152 refers to the pointer corresponding to the basic code of the word in the offset table 141 and specifies the record of the meaning HMM 142 indicated by the pointer.

特定部１５２は、語義ＨＭＭ１４２の特定したレコードを参照し、単語の語義を特定する。たとえば、特定部１５２は、語義の特定対象となる単語の前後に出現する単語と、語義ＨＭＭ１４２の共起単語（共起率）とを比較して、単語の語義を特定する。比較する共起単語は、該当する多義語に対応づけられた語義の共起単語とする。たとえば、多義語が「deep」であれば、多義語「deep」の語義（１）〜（１６）に対応する共起単語と比較する。 The specifying unit 152 refers to the record specified by the meaning HMM 142 and specifies the meaning of the word. For example, the specifying unit 152 compares the words appearing before and after the word that is the target of meaning definition with the co-occurrence words (co-occurrence rate) of the meaning HMM 142 to specify the meaning of the word. The co-occurrence word to be compared is a co-occurrence word having a meaning associated with the corresponding multiple meaning word. For example, if the ambiguous word is “deep”, it is compared with the co-occurrence words corresponding to the meanings (1) to (16) of the ambiguous word “deep”.

ここでは一例として、特定部１５２が、テキストファイルＦ１「He△has△deep△idea△to△clear△that△matter.」に含まれる多義語「deep△」の語義を特定する場合について説明する。特定部１５２は、テキストファイルＦ１の「deep△」の後ろに「idea△」が出現しているため、図４に示した語義ＨＭＭ１４２の語義ＩＤ（１２）のレコードにヒットする。このため、符号化装置は、テキストファイルＦ１の「deep△」の語義を、語義ＩＤ（１２）の語義「knowledge」であると特定する。 Here, as an example, a case will be described in which the specifying unit 152 specifies the meaning of the ambiguous word “deep Δ” included in the text file F1 “He Δhas Δdeep Δidea Δto Δclear Δthat Δmatter. Since “ideaΔ” appears after “deepΔ” in the text file F1, the specifying unit 152 hits the record of the meaning ID (12) of the meaning HMM 142 shown in FIG. For this reason, the encoding device specifies the meaning of “deepΔ” in the text file F1 as the meaning “knowledge” of the meaning ID (12).

特定部１５２は、各語義の共起単語にヒットした場合には、ヒットした共起単語の数、共起率等に応じて、語義を判定しても良い。たとえば、特定部１５２は、語義ＩＤ（１）に対応する共起単語のうち、１つの共起単語にヒットし、語義ＩＤ（１２）に対応する共起単語のうち、２つの共起単語にヒットした場合には、語義ＩＤ（１２）の語義を、単語の語義として特定する。あるいは、特定部１５２は、語義ＩＤ（１）に対応する共起単語のうち、共起率１２％となる共起単語にヒットし、語義ＩＤ（１２）に対応する共起単語のうち、共起率１０％の共起単語にヒットした場合には、語義ＩＤ（１）の語義を、単語の語義として特定する。共起単語にヒットする数を優先するのか、共起率を優先するのかは、適宜、管理者が調整しても良い。 When the co-occurrence word of each meaning is hit, the specifying unit 152 may determine the meaning according to the number of hit co-occurrence words, the co-occurrence rate, and the like. For example, the specifying unit 152 hits one co-occurrence word among the co-occurrence words corresponding to the meaning ID (1), and determines two co-occurrence words among the co-occurrence words corresponding to the meaning ID (12). In the case of a hit, the meaning of the meaning ID (12) is specified as the meaning of the word. Alternatively, the specifying unit 152 hits a co-occurrence word having a co-occurrence rate of 12% among the co-occurrence words corresponding to the meaning ID (1), and out of the co-occurrence words corresponding to the meaning ID (12). When a co-occurrence word with an occurrence rate of 10% is hit, the meaning of the meaning ID (1) is specified as the meaning of the word. The administrator may appropriately adjust whether to give priority to the number of hits to the co-occurrence word or to give priority to the co-occurrence rate.

特定部１５２は、単語が多義語であると判定した場合には、単語と、かかる単語が多義語である旨の情報と、単語の語義ＩＤとを、符号化部１５３に出力する。 If the identifying unit 152 determines that the word is an ambiguous word, the identifying unit 152 outputs the word, information indicating that the word is an ambiguous word, and the meaning ID of the word to the encoding unit 153.

符号化部１５３は、単語に符号を割り当て、単語に割り当てた符号を符号化ファイルＦ２に出力することで、符号化を行う処理部である。 The encoding unit 153 is a processing unit that performs encoding by assigning a code to a word and outputting the code assigned to the word to the encoded file F2.

符号化部１５３は、符号化対象の単語を取得し、取得した単語が「多義語でない」場合の処理について説明する。符号化部１５３は、単語と、符号化ケヤキ木１４３とを比較し、単語に対応する符号（静的コード、または動的コード）を特定する。符号化部１５３は、特定した符号を、符号化ファイルＦ２に出力する。たとえば、符号化部１５３は、単語が静的辞書１４５にヒットする場合には、単語の静的コードを特定する。符号化部１５３は、単語が静的辞書１４５にヒットしない場合には、動的辞書１４４を用いて、動的コードを特定する。たとえば、符号化部１５３は、単語が動的辞書１４４に未登録である場合には、単語を動的辞書１４４に登録して、登録位置に対応する動的コードを得る。符号化部１５３は、単語が動的辞書１４４に登録済みである場合には、既に登録済みの登録位置に対応する動的コードを得る。 The encoding unit 153 acquires a word to be encoded, and describes a process when the acquired word is “not a polysemy”. The encoding unit 153 compares the word and the encoded zelkova tree 143, and specifies a code (static code or dynamic code) corresponding to the word. The encoding unit 153 outputs the specified code to the encoded file F2. For example, when the word hits the static dictionary 145, the encoding unit 153 identifies the static code of the word. When the word does not hit the static dictionary 145, the encoding unit 153 uses the dynamic dictionary 144 to identify the dynamic code. For example, if the word is not registered in the dynamic dictionary 144, the encoding unit 153 registers the word in the dynamic dictionary 144 and obtains a dynamic code corresponding to the registered position. If the word has been registered in the dynamic dictionary 144, the encoding unit 153 obtains a dynamic code corresponding to the registered position that has already been registered.

符号化部１５３は、符号化対象の単語を取得し、取得した単語が「多義語である」場合の処理について説明する。単語が「多義語である」場合には、上記説明のように、語義ＩＤが特定部１５２から通知される。符号化部１５３は、単語と、符号化ケヤキ木１４３とを比較して、単語に対応する動的コードを特定する。 The encoding unit 153 acquires a word to be encoded, and describes a process in a case where the acquired word is “polysemy”. When the word is “polysemy”, the meaning ID is notified from the specifying unit 152 as described above. The encoding unit 153 compares the word and the encoded zelkova tree 143 to identify a dynamic code corresponding to the word.

図６で説明したように、単語が多義語である場合には、単語に対応する静的コードに、語義ＩＤが設定されている。このため、符号化部１５３は、特定部１５２から通知された語義ＩＤと、静的コードの語義ＩＤとを比較して、多義語の語義ＩＤに対応する動的コードが既に登録されているか否かを判定する。 As described with reference to FIG. 6, when a word is an ambiguous word, a meaning ID is set in a static code corresponding to the word. Therefore, the encoding unit 153 compares the meaning ID notified from the specifying unit 152 with the meaning ID of the static code, and whether or not the dynamic code corresponding to the meaning ID of the polysemy has already been registered. Determine whether.

符号化部１５３は、多義語の語義ＩＤに対応する動的コードが既に登録されている場合には、語義ＩＤに対応する動的コードを、単語の符号として、符号化ファイルＦ２に出力する。たとえば、図６に示す符号化ケヤキ木１４３には、多義語「deep」、語義ＩＤ「（１２）」の動的コードには、動的コード「AE00h」が登録されている。このため、符号化部１５３は、動的コード「AE00h」を、多義語「deep」の語義「（１２）」に対応する符号として、符号化ファイルＦ２に出力する。 When the dynamic code corresponding to the meaning ID of the polysemy has already been registered, the encoding unit 153 outputs the dynamic code corresponding to the meaning ID to the encoded file F2 as the word code. For example, in the encoded zelkova tree 143 shown in FIG. 6, the dynamic code “AE00h” is registered as the dynamic code with the multiple meaning “deep” and the meaning ID “(12)”. Therefore, the encoding unit 153 outputs the dynamic code “AE00h” to the encoded file F2 as a code corresponding to the meaning “(12)” of the ambiguous word “deep”.

符号化部１５３は、多義語の語義ＩＤに対応する動的コードが登録されていない場合には、次の図８で説明する処理を実行する。図８は、符号化部の処理の一例を説明するための図である。図８に示す例では、符号化ケヤキ木１４３の多義語「deep」、語義ＩＤ「（１２）」に対応する動的コードが登録されていないものとする。 When the dynamic code corresponding to the meaning ID of the polysemy is not registered, the encoding unit 153 executes the processing described in FIG. FIG. 8 is a diagram for explaining an example of processing of the encoding unit. In the example illustrated in FIG. 8, it is assumed that the dynamic code corresponding to the multiple meaning “deep” and the meaning ID “(12)” of the encoded zelkova tree 143 is not registered.

符号化部１５３は、多義語「deep」、語義ＩＤ「（１２）」に対応する語義コード「A14F06h」を、動的辞書部１４４ａの未格納のレコードの内、最小の動的コードに対応する語義コードに登録する。図８に示す例では、符号化部１５３は、動的辞書部１４４ａの動的コード「AE00h」に対応する語義コードに、語義コード「A14F06h」を登録する。符号化部１５３は、語義コード「A14F06h」に対応する多義語情報に、語義「knowledge（12）」を格納する。符号化部１５３は、多義語情報を、特定部１５２から取得しておくものとする。 The encoding unit 153 corresponds to the smallest dynamic code among the unstored records of the dynamic dictionary unit 144a with the meaning code “A14F06h” corresponding to the polysemy “deep” and the meaning ID “(12)”. Register in the meaning code. In the example illustrated in FIG. 8, the encoding unit 153 registers the meaning code “A14F06h” in the meaning code corresponding to the dynamic code “AE00h” of the dynamic dictionary unit 144a. The encoding unit 153 stores the meaning “knowledge (12)” in the polysemy information corresponding to the meaning code “A14F06h”. It is assumed that the encoding unit 153 acquires the polynomial information from the specifying unit 152.

符号化部１５３は、バッファ部１４４ｂ上において、動的コード「AE00h」に対応するポインタが示す位置に、多義語「deep△」を登録する。符号化部１５３は、動的コード「AE01h」に対応するポインタの位置を、バッファ部１４４ｂ上の「deep△」の直後に設定する。 The encoding unit 153 registers the ambiguous word “deepΔ” at the position indicated by the pointer corresponding to the dynamic code “AE00h” on the buffer unit 144b. The encoding unit 153 sets the position of the pointer corresponding to the dynamic code “AE01h” immediately after “deepΔ” on the buffer unit 144b.

符号化部１５３は、上記の処理を行うことで、多義語「deep」、語義ＩＤ「（１２）」に対応する語義コード「A14F06h」と、動的コード「AE00h」とを対応づけたことになる。符号化部１５３は、動的コード「AE00h」を、多義語「deep」の語義ＩＤ「（１２）」に対応する動的コードに登録する。これにより、語義ＩＤ「１２」の語義となる多義語「deep」に動的コード「AE00h」が割り当てられる。符号化部１５３は、語義ＩＤに対応する動的コードを、単語の符号として、符号化ファイルＦ２に出力する。 By performing the above processing, the encoding unit 153 associates the meaning code “A14F06h” corresponding to the ambiguous word “deep” and the meaning ID “(12)” with the dynamic code “AE00h”. Become. The encoding unit 153 registers the dynamic code “AE00h” in the dynamic code corresponding to the meaning ID “(12)” of the polysemy “deep”. As a result, the dynamic code “AE00h” is assigned to the polysemy “deep” as the meaning of the meaning ID “12”. The encoding unit 153 outputs the dynamic code corresponding to the meaning ID to the encoded file F2 as a word code.

符号化部１５３は、特定部１５２から取得する各単語について、上記処理を繰り返し実行することで、符号化ファイルＦ２を生成する。 The encoding unit 153 generates the encoded file F2 by repeatedly executing the above process for each word acquired from the specifying unit 152.

図２の説明に戻る。共起情報生成部１５４は、英英辞典等（図示略）の情報を、インタフェース１１０等を介して外部装置から取得し、取得した英英辞典の情報を基にして、語義ＨＭＭ１４２を生成する処理部である。共起情報生成部１５４は、抽出部１５１、特定部１５２、符号化部１５３の処理よりも前に処理を行い、語義ＨＭＭ１４２を生成しておくものとする。なお、共起情報生成部１５４は、英英辞書等の情報を受け付ける受付部の機能も有する。 Returning to the description of FIG. The co-occurrence information generation unit 154 obtains information on an English-English dictionary (not shown) from an external device via the interface 110, and generates a semantic HMM 142 based on the obtained English-English dictionary information. Part. The co-occurrence information generation unit 154 performs processing prior to the processing of the extraction unit 151, the identification unit 152, and the encoding unit 153, and generates the semantic HMM 142. The co-occurrence information generation unit 154 also has a function of a reception unit that receives information such as an English-English dictionary.

たとえば、英英辞典の情報は、単語と、単語に含まれる各語義と、語義の説明とを対応づけた情報を含む。共起情報生成部１５４は、英英辞典の情報を走査して、多義語、語義を対応付け、各語義には、ユニークな語義ＩＤを割り当てる。また、共起情報生成部１５４は、英英辞書の情報を参照して、語義毎に、単語と共起する共起単語の共起率を算出することで、各語義に対応する共起単語および共起率を算出する。なお、共起情報生成部１５４は、語義毎に単語の使用が区別された文章の情報を取得し、この文章の情報を用いて、各語義に対応する共起単語および共起率を算出しても良い。共起情報生成部１５４は、上記処理を実行することで、語義ＨＭＭ１４２を生成する。 For example, the English-English dictionary information includes information that associates words, meanings included in the words, and explanations of meanings. The co-occurrence information generation unit 154 scans information in the English-English dictionary, associates multiple meanings and meanings, and assigns a unique meaning ID to each meaning. In addition, the co-occurrence information generation unit 154 refers to the information in the English-English dictionary and calculates the co-occurrence rate of the co-occurrence words co-occurring with the word for each word meaning, so that the co-occurrence words corresponding to each word meaning And calculate the co-occurrence rate. Note that the co-occurrence information generation unit 154 acquires information on a sentence in which use of a word is distinguished for each meaning, and calculates the co-occurrence word and the co-occurrence rate corresponding to each meaning using the information on the sentence. May be. The co-occurrence information generation unit 154 generates the meaning HMM 142 by executing the above process.

次に、本実施例に係る符号化装置１００の処理手順の一例について説明する。図９は、本実施例に係る符号化装置の処理手順を示すフローチャートである。図９に示すように、符号化装置１００の抽出部１５１は、符号化対象のテキストデータＦ１を単語単位に抽出する（ステップＳ１０１）。 Next, an example of a processing procedure of the encoding device 100 according to the present embodiment will be described. FIG. 9 is a flowchart illustrating the processing procedure of the encoding apparatus according to the present embodiment. As illustrated in FIG. 9, the extraction unit 151 of the encoding device 100 extracts the text data F1 to be encoded in units of words (step S101).

符号化装置１００の特定部１５２は、オフセットテーブル１４１を基にして、単語が多義語であるか否かを判定する（ステップＳ１０２）。単語が多義語でない場合には（ステップＳ１０２，Ｎｏ）、ステップＳ１０３に移行する。単語が多義語である場合には（ステップＳ１０２，Ｙｅｓ）、ステップＳ１０５に移行する。 The specifying unit 152 of the encoding apparatus 100 determines whether or not the word is a polysemy based on the offset table 141 (step S102). If the word is not an ambiguous word (step S102, No), the process proceeds to step S103. If the word is an ambiguous word (step S102, Yes), the process proceeds to step S105.

符号化装置１００の符号化部１５３は、単語と符号化ケヤキ木１４３とを比較して、単語に対応する符号（静的コード、または動的コード）を特定する（ステップＳ１０３）。符号化部１５３は、符号（静的コード、または動的コード）を、符号化ファイルＦ２に書き込み（ステップＳ１０４）、ステップＳ１０９に移行する。 The encoding unit 153 of the encoding device 100 compares the word and the encoded zelkova tree 143 to identify a code (static code or dynamic code) corresponding to the word (step S103). The encoding unit 153 writes the code (static code or dynamic code) in the encoded file F2 (step S104), and proceeds to step S109.

符号化部１５３は、対象単語の前後に含まれる単語と、語義ＨＭＭ１４２の共起単語（共起率）を基にして、語義ＩＤを特定する（ステップＳ１０５）。符号化部１５３は、多義語および語義ＩＤに対応する動的コードが符号化ケヤキ木１４３に登録されているか否かを判定する（ステップＳ１０６）。 The encoding unit 153 identifies the meaning ID based on the words included before and after the target word and the co-occurrence words (co-occurrence rate) of the meaning HMM 142 (step S105). The encoding unit 153 determines whether or not the dynamic code corresponding to the ambiguous word and the meaning ID is registered in the encoded zelkova tree 143 (step S106).

符号化部１５３は、多義語および語義ＩＤに対応する動的コードが符号化ケヤキ木１４３に登録されている場合には（ステップＳ１０６，Ｙｅｓ）、ステップＳ１０８に移行する。一方、符号化部１５３は、多義語および語義ＩＤに対応する動的コードが符号化ケヤキ木１４３に登録されていない場合には（ステップＳ１０６，Ｎｏ）、ステップＳ１０７に移行する。 If the dynamic code corresponding to the ambiguous word and the meaning ID is registered in the encoded zelkova tree 143 (step S106, Yes), the encoding unit 153 proceeds to step S108. On the other hand, when the dynamic code corresponding to the ambiguous word and the meaning ID is not registered in the encoded zelkova tree 143 (No at Step S106), the encoding unit 153 proceeds to Step S107.

符号化部１５３は、多義語および語義ＩＤに対応する動的コードを割り当て、動的辞書１４４を更新する（ステップＳ１０７）。符号化部１５３は、動的コードを、符号化ファイルＦ２に書き込む（ステップＳ１０８）。 The encoding unit 153 assigns a dynamic code corresponding to the ambiguous word and the meaning ID, and updates the dynamic dictionary 144 (step S107). The encoding unit 153 writes the dynamic code in the encoded file F2 (step S108).

符号化装置１００は、テキストデータの終点でない場合には（ステップＳ１０９，Ｎｏ）、ステップＳ１０１に移行する。符号化装置１００は、テキストデータの終点の場合には（ステップＳ１０９，Ｙｅｓ）、ステップＳ１１０に移行する。 If it is not the end point of the text data (No at Step S109), the encoding apparatus 100 proceeds to Step S101. When the end point of the text data is reached (Yes at Step S109), the encoding apparatus 100 proceeds to Step S110.

符号化部１５３は、符号化ファイルＦ２のトレーラ部２ｃに動的辞書１４４の情報を格納する（ステップＳ１１０）。 The encoding unit 153 stores the information of the dynamic dictionary 144 in the trailer unit 2c of the encoded file F2 (Step S110).

次に、本実施例に係る符号化装置１００の効果について説明する。符号化部１００は、テキストファイルＦ１の対象単語の前後に出現する単語と、語義ＨＭＭ１４２の各共起単語とを比較することで、語義を特定することができる。また、符号化装置は、対象単語と語義との組み合わせに対応する語義コードを動的辞書１４４に登録して、動的コードを取得し、符号化を行う。このため、同一の単語であっても、語義に応じたコードによる符号化を行うことができる。 Next, effects of the encoding device 100 according to the present embodiment will be described. The encoding unit 100 can identify the meaning by comparing the words appearing before and after the target word of the text file F1 with the co-occurrence words of the meaning HMM 142. Also, the encoding device registers the meaning code corresponding to the combination of the target word and the meaning in the dynamic dictionary 144, acquires the dynamic code, and performs encoding. For this reason, even if it is the same word, the encoding by the code | cord | chord according to the meaning can be performed.

符号化装置１００は、英英辞典等の情報を基づき、単語の語義毎に単語の共起情報を生成し、単語の語義と共起情報とを対応づけることで、語義ＨＭＭ１４２を生成する。かかる語義ＨＭＭ１４２を用いることで、テキストファイルＦ１に含まれる単語の語義を特定することが可能となる。 The encoding apparatus 100 generates word co-occurrence information for each word meaning based on information such as an English-English dictionary, and generates a word meaning HMM 142 by associating the word meaning with the co-occurrence information. By using the meaning HMM 142, it is possible to specify the meaning of the words included in the text file F1.

ここで、本実施例で説明した符号化装置１００と同様の機能を実現するコンピュータのハードウェア構成の一例について説明する。図１０は、符号化装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。 Here, an example of a hardware configuration of a computer that realizes the same function as that of the encoding device 100 described in the present embodiment will be described. FIG. 10 is a diagram illustrating an example of a hardware configuration of a computer that realizes the same function as that of the encoding device.

図１０に示すように、コンピュータ２００は、各種演算処理を実行するＣＰＵ２０１と、ユーザからのデータの入力を受け付ける入力装置２０２と、ディスプレイ２０３とを有する。また、コンピュータ２００は、記憶媒体からプログラム等を読み取る読み取り装置２０４と、無線ネットワークを介して他のコンピュータとの間でデータの授受を行うインタフェース装置２０５とを有する。また、コンピュータ２００は、各種情報を一時記憶するＲＡＭ２０６と、ハードディスク装置２０７とを有する。そして、各装置２０１〜２０７は、バス２０８に接続される。 As illustrated in FIG. 10, the computer 200 includes a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input from a user, and a display 203. The computer 200 also includes a reading device 204 that reads a program and the like from a storage medium, and an interface device 205 that exchanges data with another computer via a wireless network. The computer 200 also includes a RAM 206 that temporarily stores various information and a hard disk device 207. The devices 201 to 207 are connected to the bus 208.

ハードディスク装置２０７は、抽出プログラム２０７ａ、特定プログラム２０７ｂ、符号化プログラム２０７ｃ、共起情報生成プログラム２０７ｄを有する。ＣＰＵ２０１は、抽出プログラム２０７ａ、特定プログラム２０７ｂ、符号化プログラム２０７ｃ、共起情報生成プログラム２０７ｄを読み出してＲＡＭ２０６に展開する。 The hard disk device 207 includes an extraction program 207a, a specific program 207b, an encoding program 207c, and a co-occurrence information generation program 207d. The CPU 201 reads out the extraction program 207 a, the identification program 207 b, the encoding program 207 c, and the co-occurrence information generation program 207 d and develops them in the RAM 206.

抽出プログラム２０７ａは、抽出プロセス２０６ａとして機能する。特定プログラム２０７ｂは、特定プロセス２０６ｂとして機能する。符号化プログラム２０７ｃは、符号化プロセス２０６ｃとして機能する。共起情報生成プログラム２０７ｄは、共起情報生成プロセス２０６ｄとして機能する。 The extraction program 207a functions as the extraction process 206a. The specific program 207b functions as the specific process 206b. The encoding program 207c functions as the encoding process 206c. The co-occurrence information generation program 207d functions as a co-occurrence information generation process 206d.

抽出プロセス２０６ａの処理は、抽出部１５１の処理に対応する。特定プロセス２０６ｂの処理は、特定部１５２の処理に対応する。符号化プロセス２０６ｃの処理は、符号化部１５３の処理に対応する。共起情報生成プロセス２０６ｄの処理は、共起情報生成部１５４の処理に対応する。 The processing of the extraction process 206a corresponds to the processing of the extraction unit 151. The process of the specifying process 206b corresponds to the process of the specifying unit 152. The process of the encoding process 206c corresponds to the process of the encoding unit 153. The process of the co-occurrence information generation process 206d corresponds to the process of the co-occurrence information generation unit 154.

なお、各プログラム２０７ａ〜２０７ｄについては、必ずしも最初からハードディスク装置２０７に記憶させておかなくても良い。例えば、コンピュータ２００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ２００が各プログラム２０７ａ〜２０７ｄを読み出して実行するようにしても良い。 Note that the programs 207a to 207d are not necessarily stored in the hard disk device 207 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, and an IC card inserted into the computer 200. Then, the computer 200 may read and execute each of the programs 207a to 207d.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）テキストデータを受け付け、
受け付けた前記テキストデータに含まれる単語を抽出し、
抽出した前記単語が複数の語義を含む場合、複数の語義を含む単語の語義ごとに、前記単語に対する他の単語の共起情報を記憶する記憶部を参照して、前記共起情報に基づき、抽出した前記単語の語義を特定する
処理をコンピュータに実行させることを特徴とする語義特定プログラム。 (Appendix 1) Accept text data,
Extract words included in the received text data,
When the extracted word includes a plurality of meanings, for each meaning of the word including a plurality of meanings, referring to a storage unit that stores co-occurrence information of other words for the word, based on the co-occurrence information, A meaning specifying program that causes a computer to execute a process for specifying the meaning of the extracted word.

（付記２）前記共起情報は、単語が有する複数の語義にそれぞれ対応する語義コードの情報を含み、前記共起情報と、前記語義を特定する処理により特定された語義とを基にして、単語に割り当てる語義コードを判定する処理を更にコンピュータに実行させることを特徴とする付記１に記載の語義特定プログラム。 (Supplementary Note 2) The co-occurrence information includes information on a meaning code corresponding to each of a plurality of meanings of a word, and based on the co-occurrence information and the meanings specified by the process of specifying the meanings, The semantic identification program according to appendix 1, further causing a computer to execute processing for determining a semantic code to be assigned to a word.

（付記３）前記語義コードを判定する処理により判定された語義コードを、登録番号が割り当てられた動的辞書の登録位置に登録する処理を更にコンピュータに実行させることを特徴とする付記２に記載の語義特定プログラム。 (Supplementary note 3) The supplementary note 2 further causes the computer to execute a process of registering the semantic code determined by the process of determining the semantic code at a registration position of a dynamic dictionary to which a registration number is assigned. Meaning definition program.

（付記４）前記単語を抽出する処理により抽出された単語に対応するコードとして、前記動的辞書の登録位置の登録番号を出力する処理を更にコンピュータに実行させることを特徴とする付記３に記載の語義特定プログラム。 (Supplementary Note 4) The supplementary note 3 further causes the computer to execute a process of outputting a registration number of a registration position of the dynamic dictionary as a code corresponding to the word extracted by the process of extracting the word. Meaning definition program.

（付記５）複数の語義を含む単語と、前記単語の各語義の説明とを含む情報を受け付け、
受け付けた前記情報に基づき、前記情報に含まれる単語の語義毎に、前記語義に対する他の単語の共起情報を生成し、
前記情報に含まれる単語の語義に、生成した前記共起情報を対応づける情報を生成する、
処理をコンピュータに実行させることを特徴とする情報生成プログラム。 (Supplementary Note 5) Accept information including a word including a plurality of meanings and an explanation of each meaning of the word,
Based on the received information, for each word meaning included in the information, generate co-occurrence information of other words for the meaning,
Generating information that associates the generated co-occurrence information with the meaning of the word included in the information;
An information generation program that causes a computer to execute processing.

（付記６）コンピュータが実行する語義特定方法であって、
テキストデータを受け付け、
受け付けた前記テキストデータに含まれる単語を抽出し、
抽出した前記単語が複数の語義を含む場合、複数の語義を含む単語の語義ごとに、前記単語に対する他の単語の共起情報を記憶する記憶部を参照して、前記共起情報に基づき、抽出した前記単語の語義を特定する
処理を実行することを特徴とする語義特定方法。 (Supplementary Note 6) A semantic identification method executed by a computer,
Accept text data,
Extract words included in the received text data,
When the extracted word includes a plurality of meanings, for each meaning of the word including a plurality of meanings, referring to a storage unit that stores co-occurrence information of other words for the word, based on the co-occurrence information, A meaning specifying method, comprising: performing a process for specifying a meaning of the extracted word.

（付記７）前記共起情報は、単語が有する複数の語義にそれぞれ対応する語義コードの情報を含み、前記共起情報と、前記語義を特定する処理により特定された語義とを基にして、単語に割り当てる語義コードを判定する処理を更に実行することを特徴とする付記６に記載の語義特定方法。 (Supplementary Note 7) The co-occurrence information includes information on meaning codes corresponding to a plurality of meanings of a word, and based on the co-occurrence information and the meanings specified by the process of specifying the meanings, The semantic identification method according to appendix 6, further comprising the step of determining a semantic code to be assigned to the word.

（付記８）前記語義コードを判定する処理により判定された語義コードを、登録番号が割り当てられた動的辞書の登録位置に登録する処理を更に実行することを特徴とする付記７に記載の語義特定方法。 (Additional remark 8) The meaning as described in additional remark 7 which further performs the process which registers the meaning code determined by the process which determines the said meaning code in the registration position of the dynamic dictionary to which the registration number was allocated. Identification method.

（付記９）前記単語を抽出する処理により抽出された単語に対応するコードとして、前記動的辞書の登録位置の登録番号を出力する処理を更に実行することを特徴とする付記８に記載の語義特定方法。 (Additional remark 9) The meaning of additional remark 8 characterized by further performing the process which outputs the registration number of the registration position of the said dynamic dictionary as a code corresponding to the word extracted by the process which extracts the said word Identification method.

（付記１０）コンピュータが実行する情報生成方法であって、
複数の語義を含む単語と、前記単語の各語義の説明とを含む情報を受け付け、
受け付けた前記情報に基づき、前記情報に含まれる単語の語義毎に、前記語義に対する他の単語の共起情報を生成し、
前記情報に含まれる単語の語義に、生成した前記共起情報を対応づける情報を生成する、
処理を実行することを特徴とする情報生成方法。 (Supplementary Note 10) An information generation method executed by a computer,
Receiving information including a word including a plurality of meanings and an explanation of each meaning of the word;
Based on the received information, for each word meaning included in the information, generate co-occurrence information of other words for the meaning,
Generating information that associates the generated co-occurrence information with the meaning of the word included in the information;
An information generation method characterized by executing processing.

（付記１１）テキストデータを受け付け、受け付けた前記テキストデータに含まれる単語を抽出する抽出部と、
前記単語が複数の語義を含む場合、複数の語義を含む単語の語義ごとに、前記単語に対する他の単語の共起情報を記憶する記憶部を参照して、前記共起情報に基づき、抽出した前記単語の語義を特定する特定部と
を有することを特徴とする語義特定装置。 (Additional remark 11) The extraction part which receives text data and extracts the word contained in the received said text data,
When the word includes a plurality of meanings, each word meaning including a plurality of meanings is extracted based on the co-occurrence information with reference to a storage unit that stores co-occurrence information of other words for the word And a specifying unit that specifies the meaning of the word.

（付記１２）前記共起情報は、単語が有する複数の語義にそれぞれ対応する語義コードの情報を含み、前記特定部は、前記共起情報と、前記語義を特定する処理により特定された語義とを基にして、単語に割り当てる語義コードを判定する処理を更に実行することを特徴とする付記１１に記載の語義特定装置。 (Supplementary Note 12) The co-occurrence information includes information on meaning codes corresponding to a plurality of meanings of a word, and the specifying unit includes the meaning specified by the co-occurrence information and a process for specifying the meaning. The semantic identification device according to appendix 11, further comprising a process of determining a semantic code to be assigned to a word based on the above.

（付記１３）前記語義コードを判定する処理により判定された語義コードを、登録番号が割り当てられた動的辞書の登録位置に登録する符号化部を更に有することを特徴とする付記１２に記載の語義特定装置。 (Supplementary note 13) The supplementary note 12, further comprising an encoding unit for registering the semantic code determined by the processing for determining the semantic code at a registration position of a dynamic dictionary to which a registration number is assigned. Semantic identification device.

（付記１４）前記符号化部は、前記抽出部により抽出された単語に対応するコードとして、前記動的辞書の登録位置の登録番号を出力する処理を更に実行することを特徴とする付記１３に記載の語義特定装置。 (Supplementary note 14) The supplementary note 13, wherein the encoding unit further executes a process of outputting a registration number of a registration position of the dynamic dictionary as a code corresponding to the word extracted by the extraction unit. Description meaning device.

（付記１５）複数の語義を含む単語と、前記単語の各語義の説明とを含む情報を受け付ける受付部と、
前記受付部が受け付けた前記情報に基づき、前記情報に含まれる単語の語義毎に、前記語義対する他の単語の共起情報を生成し、前記情報に含まれる単語の語義に、生成した前記共起情報を対応づける情報を生成する共起情報生成部と
を有することを特徴とする情報生成装置。 (Supplementary Note 15) A reception unit that receives information including a word including a plurality of meanings and an explanation of each meaning of the word;
Based on the information received by the receiving unit, for each word meaning of the word included in the information, co-occurrence information of another word corresponding to the word meaning is generated, and the generated co-occurrence information for the word meaning of the word included in the information is generated. A co-occurrence information generation unit that generates information for associating the occurrence information.

１００符号化装置
１１０インタフェース部
１４０記憶部
１４１オフセットテーブル
１４２語義ＨＭＭ
１４３符号化ケヤキ木
１４４動的辞書
１５０制御部
１５１抽出部
１５２特定部
１５３符号化部
１５４共起情報生成部 DESCRIPTION OF SYMBOLS 100 Coding apparatus 110 Interface part 140 Storage part 141 Offset table 142 Word meaning HMM
143 Coding zelkova 144 Dynamic dictionary 150 Control unit 151 Extraction unit 152 Identification unit 153 Coding unit 154 Co-occurrence information generation unit

Claims

Accept text data,
Extract words included in the received text data,
When the extracted word includes a plurality of meanings, for each meaning of the word including a plurality of meanings, referring to a storage unit that stores co-occurrence information of other words for the word, based on the co-occurrence information, A meaning specifying program that causes a computer to execute a process for specifying the meaning of the extracted word.

The co-occurrence information includes information on the meaning code corresponding to each of a plurality of meanings of the word, and the meaning assigned to the word based on the co-occurrence information and the meaning specified by the processing for specifying the meaning The semantic identification program according to claim 1, further causing a computer to execute a process of determining a code.

The meaning specification according to claim 2, further causing the computer to execute a process of registering the meaning code determined by the process of determining the meaning code at a registration position of a dynamic dictionary to which a registration number is assigned. program.

4. The word meaning specification according to claim 3, further causing the computer to execute a process of outputting a registration number of a registration position of the dynamic dictionary as a code corresponding to the word extracted by the process of extracting the word. program.

Receiving information including a word including a plurality of meanings and an explanation of each meaning of the word;
Based on the received information, for each word meaning included in the information, generate co-occurrence information of other words for the meaning,
Generating information that associates the generated co-occurrence information with the meaning of the word included in the information;
An information generation program that causes a computer to execute processing.

A method of identifying meanings executed by a computer,
Accept text data,
Extract words included in the received text data,
When the extracted word includes a plurality of meanings, for each meaning of the word including a plurality of meanings, referring to a storage unit that stores co-occurrence information of other words for the word, based on the co-occurrence information, A meaning specifying method, comprising: performing a process for specifying a meaning of the extracted word.

An information generation method executed by a computer,
Receiving information including a word including a plurality of meanings and an explanation of each meaning of the word;
Based on the received information, for each word meaning included in the information, generate co-occurrence information of other words for the meaning,
Generating information that associates the generated co-occurrence information with the meaning of the word included in the information;
An information generation method characterized by executing processing.

An extractor that accepts text data and extracts words included in the accepted text data;
When the word includes a plurality of meanings, each word meaning including a plurality of meanings is extracted based on the co-occurrence information with reference to a storage unit that stores co-occurrence information of other words for the word And a specifying unit that specifies the meaning of the word.

A reception unit that receives information including a word including a plurality of meanings and an explanation of each meaning of the word;
Based on the information received by the receiving unit, for each word meaning of the word included in the information, co-occurrence information of another word corresponding to the word meaning is generated, and the generated co-occurrence information for the word meaning of the word included in the information is generated. A co-occurrence information generation unit that generates information for associating the occurrence information.