JP5373998B1

JP5373998B1 - Dictionary generating apparatus, method, and program

Info

Publication number: JP5373998B1
Application number: JP2013515598A
Authority: JP
Inventors: 正人萩原
Original assignee: Rakuten Inc
Current assignee: Rakuten Group Inc
Priority date: 2012-02-28
Filing date: 2012-09-03
Publication date: 2013-12-18
Anticipated expiration: 2032-09-03
Also published as: CN103608805A; KR20130137048A; JPWO2013128684A1; TWI452475B; WO2013128684A1; TW201335776A; KR101379128B1; CN103608805B

Abstract

辞書生成装置は、予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部と、収集されたテキストの集合に対して、単語分割モデルが組み込まれた単語分割を実行して、各テキストに境界情報を付与する解析部と、解析部により境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、選択部により選択された単語を辞書に登録する登録部とを備える。コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与されている。
A dictionary generation device executes a word generation in which a word division model is incorporated in a set of collected text and a model generation unit that generates a word division model using a corpus and a word group prepared in advance. An analysis unit for adding boundary information to each text; a selection unit for selecting a word to be registered in the dictionary from the text to which boundary information has been added by the analysis unit; and a registration unit for registering the word selected by the selection unit in the dictionary With. Each text included in the corpus is given boundary information indicating a word boundary.

Description

本発明の一形態は、単語辞書を生成するための装置、方法、プログラム、及びコンピュータ読取可能な記録媒体に関する。 One embodiment of the present invention relates to an apparatus, a method, a program, and a computer-readable recording medium for generating a word dictionary.

従来から、単語辞書を用いて文章を分割することで複数の単語を得る技術（単語分割）が知られている。これに関連して下記特許文献１には、入力テキストの部分文字列と照合する単語を単語辞書から検索して単語候補として生成し、その単語辞書と照合しない入力テキストの部分文字列から未知語である可能性があるものを未知語候補として選択し、未知語モデルを用いて未知語候補の品詞別単語出現確率を推定し、動的計画法を用いて同時確率が最大となる単語列を求める技術が記載されている。 Conventionally, a technique (word division) for obtaining a plurality of words by dividing a sentence using a word dictionary is known. In relation to this, Japanese Patent Application Laid-Open No. 2004-228561 searches for a word to be matched with a partial character string of an input text from a word dictionary and generates it as a word candidate. Is selected as an unknown word candidate, and the unknown word model is used to estimate the word appearance probability of each unknown part of speech using the unknown word model, and the word sequence that maximizes the joint probability is determined using dynamic programming. The required technology is described.

特開２００１−０５１９９６号公報JP 2001-051996 A

テキストを正しく分割するためには、語彙的な知識を充実させるために辞書内に大量の単語を用意しておくことが望ましい。しかし、大規模な辞書を人手により構築するのは容易ではない。そこで、大規模な単語辞書を容易に構築することが要請されている。 In order to divide the text correctly, it is desirable to prepare a large number of words in the dictionary in order to enhance lexical knowledge. However, it is not easy to build a large dictionary manually. Therefore, it is required to easily construct a large-scale word dictionary.

本発明の一形態に係る辞書生成装置は、予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部であって、コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与され、境界情報が、文字間位置に境界が存在しないことを示す第１の情報と、文字間位置に境界が存在することを示す第２の情報と、文字間位置に境界が確率的に存在することを示す第３の情報とを含む、該モデル生成部と、収集されたテキストの集合に対して、単語分割モデルが組み込まれた単語分割を実行して、各テキストに境界情報を付与する解析部と、解析部により境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、選択部により選択された単語を辞書に登録する登録部とを備える。 A dictionary generation apparatus according to an aspect of the present invention is a model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus indicates a word boundary. Boundary information is provided, and the boundary information includes first information indicating that no boundary exists at the inter-character position, second information indicating that a boundary exists at the inter-character position, and a boundary at the inter-character position. The model generation unit including the third information indicating the existence of the probability, and the word division in which the word division model is embedded is performed on the collected text set, and each text is bounded. An analysis unit for providing information, a selection unit for selecting a word to be registered in the dictionary from the text to which boundary information is provided by the analysis unit, and a registration unit for registering the word selected by the selection unit in the dictionary.

本発明の一形態に係る辞書生成方法は、辞書生成装置により実行される辞書生成方法であって、予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成ステップであって、コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与され、境界情報が、文字間位置に境界が存在しないことを示す第１の情報と、文字間位置に境界が存在することを示す第２の情報と、文字間位置に境界が確率的に存在することを示す第３の情報とを含む、該モデル生成ステップと、収集されたテキストの集合に対して、単語分割モデルが組み込まれた単語分割を実行して、各テキストに境界情報を付与する解析ステップと、解析ステップにおいて境界情報が付与されたテキストから辞書に登録する単語を選択する選択ステップと、選択ステップにおいて選択された単語を辞書に登録する登録ステップとを含む。 A dictionary generation method according to an aspect of the present invention is a dictionary generation method executed by a dictionary generation device, and includes a model generation step of generating a word division model using a corpus and a word group prepared in advance. Each text included in the corpus is given boundary information indicating a word boundary, and the boundary information includes first information indicating that there is no boundary at the inter-character position and a boundary at the inter-character position. The word generation model includes the model generation step including the second information indicating the boundary and the third information indicating that the boundary is probabilistically present at the position between the characters. An analysis step for executing built-in word segmentation to give boundary information to each text, and a selection step for selecting a word to be registered in the dictionary from the text to which the boundary information was given in the analysis step. Including a flop, and a registration step of registering a word selected in the selection step in the dictionary.

本発明の一形態に係る辞書生成プログラムは、予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部であって、コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与され、境界情報が、文字間位置に境界が存在しないことを示す第１の情報と、文字間位置に境界が存在することを示す第２の情報と、文字間位置に境界が確率的に存在することを示す第３の情報とを含む、該モデル生成部と、収集されたテキストの集合に対して、単語分割モデルが組み込まれた単語分割を実行して、各テキストに境界情報を付与する解析部と、解析部により境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、選択部により選択された単語を辞書に登録する登録部とをコンピュータに実行させる。 A dictionary generation program according to an aspect of the present invention is a model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus indicates a word boundary. Boundary information is provided, and the boundary information includes first information indicating that no boundary exists at the inter-character position, second information indicating that a boundary exists at the inter-character position, and a boundary at the inter-character position. The model generation unit including the third information indicating the existence of the probability, and the word division in which the word division model is embedded is performed on the collected text set, and each text is bounded. An analysis unit for providing information, a selection unit for selecting a word to be registered in the dictionary from the text to which boundary information has been added by the analysis unit, and a registration unit for registering the word selected by the selection unit in the dictionary are implemented on a computer. Make.

本発明の一形態に係るコンピュータ読取可能な記録媒体は、予め用意されたコーパス及び単語群を用いて単語分割モデルを生成するモデル生成部であって、コーパスに含まれる各テキストには、単語の境界を示す境界情報が付与され、境界情報が、文字間位置に境界が存在しないことを示す第１の情報と、文字間位置に境界が存在することを示す第２の情報と、文字間位置に境界が確率的に存在することを示す第３の情報とを含む、該モデル生成部と、収集されたテキストの集合に対して、単語分割モデルが組み込まれた単語分割を実行して、各テキストに境界情報を付与する解析部と、解析部により境界情報が付与されたテキストから辞書に登録する単語を選択する選択部と、選択部により選択された単語を辞書に登録する登録部とをコンピュータに実行させる辞書生成プログラムを記憶する。 A computer-readable recording medium according to an aspect of the present invention is a model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus includes a word Boundary information indicating a boundary is given , and the boundary information includes first information indicating that no boundary exists at the inter-character position, second information indicating that a boundary exists at the inter-character position, and the inter-character position. Each of the model generation unit including the third information indicating that the boundary is probabilistically present, and executing word division incorporating a word division model on the collected text set, An analysis unit for adding boundary information to the text, a selection unit for selecting a word to be registered in the dictionary from the text to which boundary information has been added by the analysis unit, and a registration unit for registering the word selected by the selection unit in the dictionary Con Storing a dictionary generation program to be executed by Yuta.

このような形態によれば、境界情報が付与されているコーパスと、単語群とを用いて単語分割モデルが生成され、そのモデルが組み込まれた単語分割がテキスト集合に適用される。そして、この適用により境界情報が付与されたテキスト集合から単語が選択されて辞書に登録される。このように、境界情報付きのコーパスを用いた解析によりテキスト集合にも境界情報を付与した上で、そのテキスト集合から抽出された単語を登録することで、大規模な単語辞書を容易に構築することができる。また、単に境界が存在するかしないかという二択ではなく、その中間の概念を示す第３の情報を導入することで、より適切にテキストを複数の単語に分割することができる。 According to such a form, a word division model is generated using a corpus to which boundary information is given and a word group, and the word division incorporating the model is applied to the text set. Then, a word is selected from the text set to which boundary information is given by this application and registered in the dictionary. In this way, by adding boundary information to a text set by analysis using a corpus with boundary information, a word dictionary extracted from the text set is registered to easily build a large-scale word dictionary. be able to. In addition, the text can be more appropriately divided into a plurality of words by introducing the third information indicating an intermediate concept rather than simply selecting whether or not a boundary exists.

別の形態に係る辞書生成装置では、選択部が、解析部により付与された境界情報から算出される各単語の出現頻度に基づいて、辞書に登録する単語を選択してもよい。このように算出される出現頻度を考慮することで辞書の精度を上げることができる。 In the dictionary generation device according to another aspect, the selection unit may select a word to be registered in the dictionary based on the appearance frequency of each word calculated from the boundary information given by the analysis unit. The accuracy of the dictionary can be increased by considering the appearance frequency calculated in this way.

さらに別の形態に係る辞書生成装置では、選択部が、出現頻度が所定の閾値以上である単語を選択してもよい。一定の回数以上出現した単語のみを辞書に登録することで、辞書の精度を上げることができる。 In the dictionary generation device according to another aspect, the selection unit may select a word whose appearance frequency is equal to or higher than a predetermined threshold. By registering only words that appear more than a certain number of times in the dictionary, the accuracy of the dictionary can be improved.

さらに別の形態に係る辞書生成装置では、選択部が、出現頻度が閾値以上である単語を登録候補として抽出し、出現頻度が高い単語から順に該登録候補から所定数の単語を選択し、登録部が、選択部により選択された単語を単語群が記録されている辞書に追加してもよい。出現頻度が相対的に高い単語のみを辞書に登録することで、辞書の精度を上げることができる。また、予め用意されている単語群の辞書に単語を追加することで、辞書の構成を簡単にすることができる。 In the dictionary generation device according to another aspect, the selection unit extracts words having an appearance frequency equal to or higher than a threshold as registration candidates, selects a predetermined number of words from the registration candidates in order from the word having the highest appearance frequency, and registers them. The unit may add the word selected by the selection unit to the dictionary in which the word group is recorded. By registering only words with a relatively high appearance frequency in the dictionary, the accuracy of the dictionary can be improved. Further, by adding words to a word group dictionary prepared in advance, the configuration of the dictionary can be simplified.

さらに別の形態に係る辞書生成装置では、選択部が、出現頻度が閾値以上である単語を登録候補として抽出し、出現頻度が高い単語から順に該登録候補から所定数の単語を選択し、登録部が、選択部により選択された単語を、単語群が記録されている辞書とは別の辞書に登録してもよい。出現頻度が相対的に高い単語のみを辞書に登録することで、辞書の精度を上げることができる。また、予め用意されている単語群の辞書（既存辞書）とは別の辞書に単語を追加することで、既存辞書とは異なる特性の辞書を生成することができる。 In the dictionary generation device according to another aspect, the selection unit extracts words having an appearance frequency equal to or higher than a threshold as registration candidates, selects a predetermined number of words from the registration candidates in order from the word having the highest appearance frequency, and registers them. The unit may register the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded. By registering only words with a relatively high appearance frequency in the dictionary, the accuracy of the dictionary can be improved. Further, by adding words to a dictionary different from the dictionary of existing word groups (existing dictionary), it is possible to generate a dictionary having characteristics different from those of the existing dictionary.

さらに別の形態に係る辞書生成装置では、登録部が、選択部により選択された単語を、単語群が記録されている辞書とは別の辞書に登録してもよい。予め用意されている単語群の辞書（既存辞書）とは別の辞書に単語を追加することで、既存辞書とは異なる特性の辞書を生成することができる。 In the dictionary generation device according to another embodiment, the registration unit may register the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded. By adding words to a dictionary different from a dictionary of existing word groups (existing dictionary), a dictionary having characteristics different from those of the existing dictionary can be generated.

さらに別の形態に係る辞書生成装置では、選択部が、出現頻度が閾値以上である単語を登録候補として抽出し、出現頻度の高さに応じて該登録候補の単語をグループ化し、登録部が、選択部により生成された複数のグループを、単語群が記録されている辞書とは別の複数の辞書に個別に登録してもよい。出現頻度の高さに応じて単語をグループ化し、生成された各グループを別々の辞書に登録することで、出現頻度に起因して特性が互いに異なる複数の辞書を生成することができる。 In the dictionary generation device according to another aspect, the selection unit extracts words whose appearance frequency is equal to or higher than a threshold as registration candidates, groups the registration candidate words according to the appearance frequency, and the registration unit The plurality of groups generated by the selection unit may be individually registered in a plurality of dictionaries different from the dictionary in which the word group is recorded. By grouping words according to the appearance frequency and registering the generated groups in different dictionaries, a plurality of dictionaries having different characteristics due to the appearance frequency can be generated.

さらに別の形態に係る辞書生成装置では、収集されたテキストのそれぞれには、該テキストの分野を示す情報が関連付けられており、登録部が、選択部により選択された単語を、該単語が含まれていたテキストの分野に基づいて、分野毎に用意された辞書に個別に登録してもよい。分野毎に辞書を生成することで、特性が互いに異なる複数の辞書を生成することができる。 In the dictionary generation device according to another aspect, each of the collected texts is associated with information indicating the field of the text, and the registration unit includes the word selected by the selection unit. It may be individually registered in a dictionary prepared for each field based on the field of the text. By generating a dictionary for each field, a plurality of dictionaries having different characteristics can be generated.

さらに別の形態に係る辞書生成装置では、各単語の出現頻度が第１、第２、及び第３の情報に基づいて算出されてもよい。単に境界が存在するかしないかという二択ではなく、その中間の概念を示す第３の情報を導入することで、より適切にテキストを複数の単語に分割することができる。 In the dictionary generation device according to another aspect, the appearance frequency of each word may be calculated based on the first, second, and third information. The text can be more appropriately divided into a plurality of words by introducing the third information indicating an intermediate concept instead of simply selecting whether or not a boundary exists.

さらに別の形態に係る辞書生成装置では、解析部が、第１の二値分類器及び第２の二値分類器を備え、第１の二値分類器が、各文字間位置について、第１の情報を割り当てるか第１の情報以外の情報を割り当てるかを判定し、第２の二値分類器が、第１の二値分類器により第１の情報以外の情報を割り当てると判定された文字間位置について、第２の情報を割り当てるか第３の情報を割り当てるかを判定してもよい。二値分類器を複数用いて段階的に境界情報を確定することで、高速且つ効率的にテキストに境界情報を付与することができる。 In the dictionary generation device according to another aspect, the analysis unit includes a first binary classifier and a second binary classifier, and the first binary classifier has a first position for each character position. Whether to allocate information other than the first information, or whether the second binary classifier assigns information other than the first information by the first binary classifier For the inter-position, it may be determined whether the second information or the third information is assigned. By using a plurality of binary classifiers to determine boundary information in stages, it is possible to add boundary information to texts at high speed and efficiently.

さらに別の形態に係る辞書生成装置では、収集されたテキストの集合が複数のグループに分割され、解析部、選択部、及び登録部が複数のグループのうちの一つに基づく処理を実行した後に、モデル生成部がコーパス、単語群、及び登録部により登録された単語を用いて単語分割モデルを生成し、続いて、解析部、選択部、及び登録部が複数のグループのうちの別の一つに基づく処理を実行してもよい。 In the dictionary generation device according to another embodiment, the collected text set is divided into a plurality of groups, and the analysis unit, the selection unit, and the registration unit perform processing based on one of the plurality of groups. The model generation unit generates a word division model using the corpus, the word group, and the word registered by the registration unit, and then the analysis unit, the selection unit, and the registration unit are another one of the plurality of groups. One process may be executed.

本発明の一側面によれば、大規模な単語辞書を容易に構築することができる。 According to one aspect of the present invention, a large-scale word dictionary can be easily constructed.

実施形態に係る辞書生成装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the dictionary production | generation apparatus which concerns on embodiment. 図１に示す辞書生成装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the dictionary production | generation apparatus shown in FIG. 境界情報（単語境界タグ）の設定を説明するための図である。It is a figure for demonstrating the setting of boundary information (word boundary tag). 図１に示す辞書生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the dictionary production | generation apparatus shown in FIG. 実施形態に係る辞書生成プログラムの構成を示す図である。It is a figure which shows the structure of the dictionary production | generation program which concerns on embodiment.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一又は同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.

まず、図１〜３を用いて、実施形態に係る辞書生成装置１０の機能構成を説明する。辞書生成装置１０は、収集された大量のテキストから成る集合（以下では「大規模テキスト」ともいう）を解析することでそのテキスト集合から単語を抽出し、抽出された単語を辞書に追加するコンピュータである。 First, the functional configuration of the dictionary generation device 10 according to the embodiment will be described with reference to FIGS. The dictionary generation apparatus 10 extracts a word from the text set by analyzing a set of collected large amounts of text (hereinafter also referred to as “large-scale text”), and adds the extracted word to the dictionary It is.

図１に示すように、辞書生成装置１０は、オペレーティングシステムやアプリケーション・プログラムなどを実行するＣＰＵ１０１と、ＲＯＭ及びＲＡＭで構成される主記憶部１０２と、ハードディスクなどで構成される補助記憶部１０３と、ネットワークカードなどで構成される通信制御部１０４と、キーボードやマウスなどの入力装置１０５と、ディスプレイなどの出力装置１０６とを備えている。 As shown in FIG. 1, the dictionary generation apparatus 10 includes a CPU 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk, and the like. The communication control unit 104 includes a network card, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a display.

後述する辞書生成装置１０の各機能的構成要素は、ＣＰＵ１０１や主記憶部１０２の上に所定のソフトウェアを読み込ませ、ＣＰＵ１０１の制御の下で通信制御部１０４や入力装置１０５、出力装置１０６などを動作させ、主記憶部１０２や補助記憶部１０３におけるデータの読み出し及び書き込みを行うことで実現される。処理に必要なデータやデータベースは主記憶部１０２や補助記憶部１０３内に格納される。なお、図１では辞書生成装置１０が１台のコンピュータで構成されているように示しているが、辞書生成装置１０の機能を複数台のコンピュータに分散させてもよい。 Each functional component of the dictionary generation device 10 to be described later reads predetermined software on the CPU 101 and the main storage unit 102, and controls the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101. The operation is realized by reading and writing data in the main storage unit 102 and the auxiliary storage unit 103. Data and databases necessary for processing are stored in the main storage unit 102 and the auxiliary storage unit 103. In FIG. 1, the dictionary generation device 10 is illustrated as being configured by a single computer, but the functions of the dictionary generation device 10 may be distributed to a plurality of computers.

図２に示すように、辞書生成装置１０は機能的構成要素としてモデル生成部１１、解析部１２、選択部１３、及び登録部１４を備えている。辞書生成装置１０は、単語抽出処理を実行する際に、予め用意されている学習コーパス２０、既存辞書３１、及び大規模テキスト４０を参照し、抽出された単語を単語辞書３０に格納する。なお、単語辞書３０は少なくとも既存辞書３１を含んでおり、１以上の追加辞書３２を更に含んでいてもよい。辞書生成装置１０について詳細に説明する前に、これらのデータについて説明する。 As illustrated in FIG. 2, the dictionary generation device 10 includes a model generation unit 11, an analysis unit 12, a selection unit 13, and a registration unit 14 as functional components. When executing the word extraction process, the dictionary generation device 10 refers to the learning corpus 20, the existing dictionary 31, and the large-scale text 40 prepared in advance, and stores the extracted words in the word dictionary 30. Note that the word dictionary 30 includes at least the existing dictionary 31 and may further include one or more additional dictionaries 32. Before describing the dictionary generation device 10 in detail, these data will be described.

学習コーパス２０は、単語の境界（文を単語に分割した際の分割位置）を示す境界情報（アノテーション）が付与された（関連付けられた）テキストの集合であり、データベースとして予め用意されている。テキストは複数の単語から成る文や文字列である。本実施形態では、仮想商店街のウェブサイト内に蓄積されている商品のタイトル及び説明文からランダムに抽出した所定数のテキストを学習コーパス２０の材料とする。 The learning corpus 20 is a set of texts to which boundary information (annotations) indicating word boundaries (division positions when a sentence is divided into words) is attached (associated), and is prepared in advance as a database. Text is a sentence or character string consisting of a plurality of words. In the present embodiment, a predetermined number of texts randomly extracted from the titles and descriptions of the products stored in the website of the virtual shopping street are used as the material of the learning corpus 20.

抽出した各テキストには、評価者の人手により境界情報が付与される。境界情報の設定は、点推定による単語分割と３段階単語分割コーパスという二つの技術に基づいて実施される。 Boundary information is given to each extracted text by the evaluator manually. The setting of boundary information is performed based on two techniques of word division by point estimation and a three-stage word division corpus.

［点推定による単語分割］
テキスト（文字列）ｘ＝ｘ_１ｘ_２…ｘ_ｎ（ｘ_１，ｘ_２，…，ｘ_ｎは文字）には、単語境界タグｂ＝ｂ_１ｂ_２…ｂ_ｎが割り当てられる。ここで、ｂ_ｉは文字ｘ_ｉとｘ_ｉ＋１との間（文字間位置）に単語境界が存在するか否かを表すタグであり、ｂ_ｉ＝１は分割を、ｂ_ｉ＝０は非分割を、それぞれ意味する。ここで、このタグｂ_ｉで示される値は分割の強度であるとも言える。[Word segmentation by point estimation]
A word boundary tag b = b ₁ b ₂ ... B _n is assigned to text (character string) x = x ₁ x ₂ ... X _n (x ₁ , x ₂ ,..., X _n are characters). Here, b _i is a tag indicating whether or not a word boundary exists between characters x _i and x _{i + 1} (inter-character position), b _i = 1 is division, and b _i = 0 is non-division Respectively. Here, the value indicated by this tag b _i can be said to be the intensity of the split.

図３に、「ボールペンを買った。」（bo-rupen wo katta）という日本語の文（英語では「(I) bought a ballpoint pen.」）において「ん（ｎ）」と「を（ｗｏ）」との間のタグを決定する例を示す。単語境界タグの値は，その周辺に存在する文字から得られる素性（feature）を参照して決定される。例えば、文字素性、文字種素性、及び辞書素性という３種類の素性を用いて単語境界タグの値が設定される。 Figure 3 shows the Japanese sentence “bo-rupen wo katta” (“(I) bought a ballpoint pen.” In English) with “n (n)” and “wo (wo)”. An example of determining a tag between “and“. The value of the word boundary tag is determined by referring to the feature obtained from the characters existing around it. For example, the value of the word boundary tag is set using three types of features, that is, a character feature, a character type feature, and a dictionary feature.

文字素性は、境界ｂ_ｉに接する、もしくは境界ｂ_ｉを内包する長さｎ以下のすべての文字（ｎ−ｇｒａｍ）と、その位置ｂ_ｉに対する相対位置との組合せで示される素性である。例えば図３においてｎ＝３とした場合には、「ん（ｎ）」と「を（ｗｏ）」との間の境界ｂ_ｉに対して、「−１／ン（ｎ）」「１／を（ｗｏ）」「−２／ペン（ｐｅｎ）」「−１／ンを（ｎｗｏ）」「１／を買（ｗｏｋａ）」「−３／ルペン（ｒｕｐｅｎ）」「−２／ペンを（ｐｅｎｗｏ）」「−１／ンを買（ｎｗｏｋａ）」「１／を買っ（ｗｏｋａｔ）」という９個の素性が得られる。Character feature is in contact with the boundary b _i, or the boundary b _i all characters length n of the enclosing (n-gram), is a feature represented by a combination of relative positions with respect to the position b _i. For example, when n = 3 in FIG. 3, “−1 / n (n)”, “1 /” is set for the boundary b _i between “n (n)” and “wo (wo)”. (Wo) ""-2 / pen (pen) ""-1 / n (n wo) "" 1 / buy (wo ka) ""-3 / rupen ""-2 / pen (( 9 features of “pen wo”, “-1 / buy (n wo ka)”, “1 / buy (wo kat)”.

文字種素性は、文字の代わりに文字種を扱うという点以外は、上記の文字素性と同様である。文字種として、ひらがな、カタカナ、漢字、アルファベット大文字、アルファベット小文字、アラビア数字、漢数字、及び中黒（・）の８種類を考慮した。なお、用いる文字種及びその種類数は何ら限定されない。 Character type features are the same as the character features described above, except that character types are used instead of characters. As the character types, eight types of hiragana, katakana, kanji, upper case alphabet, lower case alphabet, arabic numerals, kanji numerals, and middle black (•) were considered. In addition, the character type to be used and the number of the types are not limited at all.

辞書素性は、境界の周辺に位置する長さｊ（１≦ｊ≦ｋ）の単語が辞書に存在するか否かを表す素性である。辞書素性は、境界ｂ_ｉが単語の終点に位置しているのか（Ｌ）、その始点に位置しているのか（Ｒ）、それともその単語に内包されているのか（Ｍ）を示すフラグと、その単語の長さｊとの組合せで示される。もし、辞書に「ペン（ｐｅｎ）」「を（ｗｏ）」という単語が登録されていれば、図３における境界ｂｉに対してＬ２及びＲ１という辞書素性が作成される。なお、後述するように複数の辞書を用いる場合には、辞書素性には辞書の識別子が付与される。例えば、識別子がＤＩＣ１である辞書Ａに「ペン（ｐｅｎ）」が登録されており、識別子がＤＩＣ２である辞書Ｂに「を（ｗｏ）」が登録されていれば、辞書素性はＤＩＣ１−Ｌ２、ＤＩＣ２−Ｒ１等のように表される。The dictionary feature is a feature representing whether or not a word having a length j (1 ≦ j ≦ k) located around the boundary exists in the dictionary. The dictionary feature is a flag indicating whether the boundary b _i is located at the end point of the word (L), whether it is located at the start point (R) or contained in the word (M), It is shown in combination with the length j of the word. If the words “pen” and “wo” are registered in the dictionary, dictionary features L2 and R1 are created for the boundary bi in FIG. As will be described later, when a plurality of dictionaries are used, a dictionary identifier is assigned to the dictionary feature. For example, if “pen” is registered in the dictionary A having the identifier DIC1, and “wo” is registered in the dictionary B having the identifier DIC2, the dictionary feature is DIC1-L2. It is expressed as DIC2-R1.

なお、本実施形態では、文字素性及び文字種素性におけるｎ−ｇｒａｍの最大長ｎを３とし、辞書素性における単語の最大長ｋを８としたが、これらの値は任意に定めてよい。 In the present embodiment, the maximum length n of n-gram in the character feature and the character type feature is 3 and the maximum word length k in the dictionary feature is 8, but these values may be arbitrarily determined.

［３段階単語分割コーパス］
日本語には、単語境界を一意に決めるのが難しい単語が存在し、適切な単語分割の態様が場面によって異なるという問題がある。一例として、「ボールペン（ｂｏ−ｒｕｐｅｎ）」（英語では「ballpoint pen」）という単語を含んだテキスト集合に対してキーワード検索を行う場合を想定する。もし「ボールペン（ｂｏ−ｒｕｐｅｎ）」を分割しない場合には、「ペン（ｐｅｎ）」（英語では「pen」）というキーワードで検索してもテキストが抽出されないことになる（再現率の低下）。一方、「ボールペン（ｂｏ−ｒｕｐｅｎ）」を「ボール（ｂｏ−ｒｕ）」（英語では「ball」）と「ペン（ｐｅｎ）」とに分割した場合には、スポーツ用品である「ボール（ｂｏ−ｒｕ）」をキーワードとした検索により、「ボールペン（ｂｏ−ｒｕｐｅｎ）」を含むテキストが抽出されてしまう（精度の低下）。[Three-stage word division corpus]
In Japanese, there is a problem that it is difficult to uniquely determine the word boundary, and the mode of appropriate word division differs depending on the scene. As an example, assume a case where a keyword search is performed on a text set including the word “ballpoint pen” (“ballpoint pen” in English). If the “ballpoint pen (bo-rupen)” is not divided, text will not be extracted even if the search is performed using the keyword “pen” (“pen” in English) (decrease in recall). On the other hand, when the “ball-point pen (bo-rupen)” is divided into “ball (bo-ru)” (“ball” in English) and “pen (pen)”, the sports equipment “ball (bo-rupen)” (ru) "as a keyword, a text including" ballpoint pen (bo-rupen) "is extracted (decrease in accuracy).

そこで、上記のように「分割」「非分割」の２値だけでなく「半分割」という概念を導入した３段階単語分割コーパスを用いる。３段階単語分割コーパスは、確率的な値で分割の態様を示す確率的単語分割を発展させた手法である。人間が実際に認識できる単語分割の強さは多くても数段階に過ぎず、連続的な確率値で分割の態様を示す必要性は低いという理由から、この３段階単語分割コーパスが用いられる。半分割を含む単語については、その単語全体と、その単語の構成要素との両方が抽出されるので、人間にとって分割か非分割かの判断が難しい単語をとりあえず半分割として記録することが可能になると共に、境界情報の付与が容易になる。「半分割」は、文字間位置に境界が確率的に（０より大きく１より小さい確率の範囲で）存在することを示す一態様である。 Therefore, a three-stage word division corpus that introduces the concept of “half division” as well as the binary of “division” and “non-division” as described above. The three-stage word division corpus is a technique that develops probabilistic word division that indicates a division mode with a probabilistic value. The three-stage word division corpus is used because the number of word division strengths that humans can actually recognize is only a few levels at most, and it is not necessary to indicate the mode of division with continuous probability values. For words that contain half-splitting, both the whole word and the components of the word are extracted, so it is possible to record words that are difficult for humans to determine whether they are split or non-split for the time being as half-splitting At the same time, it becomes easy to add boundary information. “Half-splitting” is an aspect indicating that a boundary is probabilistically present (within a probability range greater than 0 and less than 1) at the position between characters.

３段階単語分割コーパスは、「分割」（ｂ_ｉ＝１）及び「非分割」（ｂ_ｉ＝０）に「半分割」（ｂ_ｉ＝０．５）を加えた３段階の離散確率的単語分割により生成されるコーパスである。例えば、「ボール／ペン（ｂｏ−ｒｕ／ｐｅｎ）」のような複合名詞や、「折り／たたむ（ｏｒｉ／ｔａｔａｍｕ）」（英語では「fold」）のような複合動詞、「お／すすめ（ｏ／ｓｕｓｕｍｅ）」（英語では「recommendation」）のような、接辞も含めて語彙化しているような単語の中の分割（これらの例では”／”で示している）は、半分割として定義するのが自然である。また、「充電池（ｊｕｕｄｅｎｃｈｉ）」（英語では「rechargeable battery」）は、「充電（ｊｕｕｄｅｎ）」（英語では「recharge」）と「電池（ｄｅｎｃｈｉ）」（英語では「battery」）のような「ＡＢ＋ＢＣ→ＡＢＣ」型の複合語といえるが、このような単語は「充／電／池（ｊｕｕ／ｄｅｎ／ｃｈｉ）」というように半分割される。The three-stage word division corpus is a three-stage discrete stochastic word obtained by adding “half-division” (b _i = 0.5) to “division” (b _i = 1) and “non-division” (b _i = 0). This is a corpus generated by division. For example, a compound noun such as “ball / pen (bo-ru / pen)”, a compound verb such as “ori / tatam” (“fold” in English), “o / sume (o / ") (" Recommendation "in English), a division in a word that is lexicalized including an affix (indicated by" / "in these examples) is defined as a half division Is natural. In addition, “rechargeable battery” (in English, “rechargeable battery”) means “rechargeable” (in English, “recharge”) and “denchi” (in English, “battery”). Although it can be said that it is a compound word of the type “AB + BC → ABC”, such a word is divided in half as “charge / electricity / pond (juu / den / chi)”.

「ボールペンを買った。」（bo-rupen
wo katta）というテキストは、上記の点推定による単語分割と３段階単語分割コーパスとを用いて例えば図３に示すように分割される。図３の例では、「分割」（ｂ_ｉ＝１）の単語境界タグは、テキストの先頭や、「ン（ｎ）」と「を（ｗｏ）」の間などに付与されている。「半分割」（ｂ_ｉ＝０．５）の単語境界タグは「ル（ｒｕ）」と「ペ（ｐｅ）」の間に付与されている。図３では「非分割」（ｂ_ｉ＝０）の単語境界タグを省略しているが、文字間に境界が表されていない箇所（例えば「ペ（ｐｅ）」と「ン（ｎ）」の間）には当該タグが付与される。“I bought a ballpoint pen.” (Bo-rupen
The text “wo katta) is divided as shown in FIG. 3, for example, using the word division based on the above point estimation and the three-stage word division corpus. In the example of FIG. 3, the word boundary tag of “divided” (b _i = 1) is given at the beginning of the text or between “n (n)” and “m (wo)”. The word boundary tag of “half-divided” (b _i = 0.5) is assigned between “le” and “pe”. In FIG. 3, the word boundary tag of “non-divided” (b _i = 0) is omitted, but places where no boundary is expressed between characters (for example, “pe (pe)” and “n (n)”) The tag is assigned to (between).

各テキストには単語境界タグが境界情報として付与されて、学習コーパス２０としてデータベースに格納される。境界情報をテキストに付与する方法は任意である。一例として、「分割」をスペースで示し、「半分割」をハイフンで示し、「非分割」の表示を省略するように各テキストに境界情報を埋め込んでもよい。この場合には、境界情報が付与されたテキストを文字列のまま記録することができる。 Each text is given a word boundary tag as boundary information and stored in the database as a learning corpus 20. The method for adding the boundary information to the text is arbitrary. As an example, boundary information may be embedded in each text so that “divided” is indicated by a space, “half-divided” is indicated by a hyphen, and the display of “non-divided” is omitted. In this case, the text with the boundary information can be recorded as a character string.

既存辞書３１は、所定数の単語の集合であり、データベースとして予め用意されている。既存辞書３１は一般に用いられている電子化辞書でもよく、例えばＵｎｉＤｉｃという形態素解析辞書であってもよい。 The existing dictionary 31 is a set of a predetermined number of words, and is prepared in advance as a database. The existing dictionary 31 may be a generally used electronic dictionary, for example, a UniDic morphological analysis dictionary.

大規模テキスト４０は、収集されたテキストの集合であり、データベースとして予め用意されている。大規模テキスト４０には、抽出しようとする単語やその単語の分野などに応じて、任意の文や文字列を含めてよい。例えば、仮想商店街のウェブサイトから商品のタイトル及び説明文を大量に収集し、これらの生データから大規模テキスト４０を構築してもよい。大規模テキスト４０として用意されるテキストの数は、学習コーパス２０に含まれるテキストの数よりも圧倒的に多い。 The large-scale text 40 is a set of collected text and is prepared in advance as a database. The large-scale text 40 may include an arbitrary sentence or character string according to the word to be extracted and the field of the word. For example, a large amount of product titles and explanations may be collected from a virtual shopping street website, and the large-scale text 40 may be constructed from these raw data. The number of texts prepared as the large-scale text 40 is overwhelmingly larger than the number of texts included in the learning corpus 20.

以上を前提として辞書生成装置１０の機能的構成要素を説明する。 Based on the above, functional components of the dictionary generation device 10 will be described.

モデル生成部１１は、学習コーパス２０及び単語辞書３０を用いて単語分割モデルを生成する手段である。モデル生成部１１は、サポート・ベクトル・マシン（ＳＶＭ：Support vector machine）を備えており、学習コーパス２０及び単語辞書３０をこのマシンに入力して学習処理を実行させることで、単語分割モデルを生成する。この単語分割モデルは、テキストをどのように区切るべきかというルールを示しており、単語分割に用いられるパラメータ群として出力される。なお、機械学習に用いるアルゴリズムはＳＶＭに限定されず、決定木やロジスティック回帰などであってもよい。 The model generation unit 11 is means for generating a word division model using the learning corpus 20 and the word dictionary 30. The model generation unit 11 includes a support vector machine (SVM), and generates a word division model by inputting a learning corpus 20 and a word dictionary 30 to the machine and executing learning processing. To do. This word segmentation model shows the rules on how to segment text, and is output as a parameter group used for word segmentation. The algorithm used for machine learning is not limited to SVM, and may be a decision tree or logistic regression.

大規模テキスト４０を解析するために、モデル生成部１１は学習コーパス２０及び既存辞書３１に基づく学習をＳＶＭに実行させることで、最初の単語分割モデル（ベースライン・モデル）を生成する。そして、モデル生成部１１はこの単語分割モデルを解析部１２に出力する。 In order to analyze the large-scale text 40, the model generation unit 11 causes the SVM to perform learning based on the learning corpus 20 and the existing dictionary 31, thereby generating an initial word division model (baseline model). Then, the model generation unit 11 outputs this word division model to the analysis unit 12.

その後、後述する解析部１２、選択部１３、及び登録部１４の処理により単語辞書３０に単語が追加されると、モデル生成部１１は学習コーパス２０と単語辞書３０の全体とに基づく学習（再学習）処理をＳＶＭに実行させることで、修正された単語分割モデルを生成する。ここで、単語辞書３０の全体とは、既存辞書３１に最初から記憶されていた単語、及び大規模テキスト４０から得られた単語のすべてを意味する。 Thereafter, when a word is added to the word dictionary 30 by the processing of the analysis unit 12, the selection unit 13, and the registration unit 14 described later, the model generation unit 11 performs learning (re-execution) based on the learning corpus 20 and the entire word dictionary 30. A corrected word division model is generated by causing the SVM to execute (learning) processing. Here, the whole word dictionary 30 means all the words stored in the existing dictionary 31 from the beginning and the words obtained from the large-scale text 40.

解析部１２は、単語分割モデルが組み込まれた解析（単語分割）を大規模テキスト４０に対して実行して、各テキストに境界情報を付与する（関連付ける）手段である。この結果、図３に示すようなテキストが大量に得られる。解析部１２は大規模テキスト４０を成している各テキストについてそのような単語分割を実行することで、上記「分割」（第２の情報）、「半分割」（第３の情報）、及び「非分割」（第１の情報）を示す境界情報を各テキストに付与し、処理されたすべてのテキストを選択部１３に出力する。 The analysis unit 12 is a unit that performs analysis (word division) in which the word division model is incorporated on the large-scale text 40 and gives (associates) boundary information to each text. As a result, a large amount of text as shown in FIG. 3 is obtained. The analysis unit 12 performs such word division on each text constituting the large-scale text 40, so that the “division” (second information), “half-division” (third information), and Boundary information indicating “non-divided” (first information) is assigned to each text, and all processed texts are output to the selection unit 13.

解析部１２は二つの二値分類器を備えており、これらの分類器を順に用いて３種類の境界情報を各テキストに付与する。第１の分類器は、文字間位置が「非分割」かそれ以外かを判定する手段であり、第２の分類器は、「非分割」ではないと判定された境界が「分割」か「半分割」かを判定する手段である。現実には文字間位置の過半数が「非分割」であることから、まず文字間位置が「非分割」であるか否かを判定し、続いて「非分割」ではない以外と判定された箇所について分割の態様を判定することで、効率的に境界情報を大量のテキストに付与することができる。また、二値分類器を組み合わせることで、解析部１２の構造を単純化することができる。 The analysis unit 12 includes two binary classifiers, and uses these classifiers in order to give three types of boundary information to each text. The first classifier is means for determining whether the inter-character position is “non-divided” or otherwise, and the second classifier is whether the boundary determined not to be “non-divided” is “divided” or “ It is a means for determining whether it is “half-split”. In reality, since the majority of the inter-character positions are “non-divided”, it is first determined whether or not the inter-character positions are “non-divided”, and then determined to be other than “non-divided”. By determining the division mode for, boundary information can be efficiently given to a large amount of text. Moreover, the structure of the analysis part 12 can be simplified by combining a binary classifier.

選択部１３は、解析部１２により境界情報が付与されたテキストから、単語辞書３０に登録する単語を選択する手段である。 The selection unit 13 is means for selecting a word to be registered in the word dictionary 30 from the text to which boundary information is given by the analysis unit 12.

まず、選択部１３は入力されたテキスト群に含まれている各単語ｗの合計出現頻度ｆ_ｒ（ｗ）を下記式（１）により求める。この計算は、各文字間位置に付与された境界情報ｂ_ｉから出現頻度が得られることを意味する。

ここで、Ｏ_１は単語ｗの表記の出現を示しており、下記の通りに定義される。

First, the selection part 13 calculates | requires total frequency _fr (w) of each word w contained in the input text group by following formula (1). This calculation means that the appearance frequency can be obtained from the boundary information b _i given to each inter-character position.

Here, O ₁ indicates the appearance of the notation of the word w and is defined as follows.

図３に示す「ボールペンを買った。」（bo-rupen wo katta）という一つの文における単語「ボールペン（ｂｏ−ｒｕｐｅｎ）」の出現頻度は、１．０＊１．０＊１．０＊０．５＊１．０＊１．０＝０．５となり、その文における単語「ペン（ｐｅｎ）」の出現頻度は、０．５＊１．０＊１．０＝０．５となる。これらは、その文の中に「ボールペン（ｂｏ−ｒｕｐｅｎ）」及び「ペン（ｐｅｎ）」という単語がそれぞれ０．５回ずつ出現したものとみなされることを意味する。選択部１３は、各テキストに含まれている各単語の出現頻度を求めて、単語毎にその出現頻度を集計することで、各単語の合計出現頻度を得る。 The frequency of appearance of the word “ball-point” (bo-rupen) in one sentence “bo-rupen wo katta” shown in FIG. 3 is 1.0 * 1.0 * 1.0 * 0. .5 * 1.0 * 1.0 = 0.5, and the appearance frequency of the word “pen” in the sentence is 0.5 * 1.0 * 1.0 = 0.5. These mean that the words “bo-rupen” and “pen” appear to appear 0.5 times each in the sentence. The selection part 13 calculates | requires the appearance frequency of each word contained in each text, and obtains the total appearance frequency of each word by totaling the appearance frequency for every word.

続いて、選択部１３は大規模テキスト４０内の単語群から、合計出現頻度が第１の閾値ＴＨａ以上である単語のみを登録候補Ｖとして選択する（頻度による単語の足切り）。そして、選択部１３は最終的に単語辞書３０に登録する単語をその登録候補Ｖの中から選択し、必要に応じてその単語を格納する辞書（データベース）を決定する。最終的に登録する単語及び格納先の辞書の決定方法は一つに限定されるものではなく、下記の通り様々な手法を用いうる。 Subsequently, the selection unit 13 selects only words whose total appearance frequency is equal to or higher than the first threshold THa from the word group in the large-scale text 40 as a registration candidate V (word truncation based on frequency). Then, the selection unit 13 selects a word to be finally registered in the word dictionary 30 from the registration candidates V, and determines a dictionary (database) for storing the word as necessary. The method of determining the word to be finally registered and the dictionary of the storage destination is not limited to one, and various methods can be used as described below.

選択部１３は、登録候補Ｖのうち合計出現頻度が所定の閾値以上である単語のみを既存辞書３１に追加すると決定してもよい。この場合に、選択部１３は合計出現頻度が第２の閾値ＴＨｂ（ただしＴＨｂ＞ＴＨａ）である単語のみを選んでもよいし、合計出現頻度が上位ｎ位までの単語のみを選んでもよい。以下では、このような処理を「ＡＰＰＥＮＤ」ともいう。 The selection unit 13 may determine to add only words whose total appearance frequency is equal to or higher than a predetermined threshold among the registration candidates V to the existing dictionary 31. In this case, the selection unit 13 may select only words whose total appearance frequency is the second threshold THb (where THb> THa), or may select only words whose total appearance frequency is the top n. Hereinafter, such processing is also referred to as “APPEND”.

あるいは、選択部１３は、登録候補Ｖのうち合計出現頻度が所定の閾値以上である単語のみを追加辞書３２に登録すると決定してもよい。この場合にも、選択部１３は合計出現頻度が第２の閾値ＴＨｂ（ただしＴＨｂ＞ＴＨａ）である単語のみを選んでもよいし、合計出現頻度が上位ｎ位までの単語のみを選んでもよい。以下では、このような処理を「ＴＯＰ」ともいう。 Alternatively, the selection unit 13 may determine to register only words whose total appearance frequency is greater than or equal to a predetermined threshold among the registration candidates V in the additional dictionary 32. Also in this case, the selection unit 13 may select only words whose total appearance frequency is the second threshold THb (where THb> THa), or may select only words whose total appearance frequency is the top n. Hereinafter, such processing is also referred to as “TOP”.

あるいは、選択部１３は、登録候補Ｖのすべてを追加辞書３２に登録すると決定してもよい。以下では、このような処理を「ＡＬＬ」ともいう。 Alternatively, the selection unit 13 may determine that all the registration candidates V are registered in the additional dictionary 32. Hereinafter, such processing is also referred to as “ALL”.

あるいは、選択部１３は登録候補Ｖを合計出現頻度に応じて複数の部分集合に分け、各部分集合を個別の追加辞書３２に登録すると決定してもよい。登録候補Ｖのうち、合計出現頻度が上位ｎ位までの部分集合をＶ_ｎと表すとする。この場合に選択部１３は、例えば、上位１０００位までの単語から成る部分集合Ｖ₁₀₀₀と、上位２０００位までの単語から成る部分集合Ｖ₂₀₀₀と、上位３０００位までの単語から成る部分集合Ｖ₃₀₀₀とを生成する。そして、選択部１３は部分集合Ｖ₁₀₀₀、Ｖ₂₀₀₀、及びＶ₃₀₀₀を第１の追加辞書３２、第２の追加辞書３２、及び第３の追加辞書３２に登録すると決定する。なお、生成する部分集合の個数や、各部分集合の大きさは任意に定めてよい。以下では、このような処理を「ＭＵＬＴＩ」という。Alternatively, the selection unit 13 may decide to divide the registration candidates V into a plurality of subsets according to the total appearance frequency and register each subset in the individual additional dictionary 32. Of the registration candidates V, a subset having a total appearance frequency up to the top _n is represented as V _n . In this case, for example, the selection unit 13 sets the subset V ₁₀₀₀ composed of the words up to the top 1000, the subset V ₂₀₀₀ composed of the words up to the top _2000, and the subset V ₃₀₀₀ composed of the words up to the top _3000. And generate Then, the selection unit 13 determines to register the subsets V ₁₀₀₀ , V ₂₀₀₀ , and V ₃₀₀₀ in the first additional dictionary 32, the second additional dictionary 32, and the third additional dictionary 32. Note that the number of subsets to be generated and the size of each subset may be arbitrarily determined. Hereinafter, such processing is referred to as “MULTI”.

最終的に登録する単語を選択するとともに格納先の辞書を決定すると、選択部１３はその選択結果を登録部１４に出力する。 When the word to be registered is finally selected and the storage destination dictionary is determined, the selection unit 13 outputs the selection result to the registration unit 14.

登録部１４は、選択部１３により選択された単語を単語辞書３０に登録する手段である。単語辞書３０のうちどの辞書に単語を登録するかは選択部１３での処理に依存するので、登録部１４は既存辞書３１にのみ単語を登録するかもしれないし、一つの追加辞書３２にのみ単語を登録するかもしれない。上記の「ＭＵＬＴＩ」処理の場合には、登録部１４は選択された単語を複数の追加辞書３２に分けて登録する。 The registration unit 14 is means for registering the word selected by the selection unit 13 in the word dictionary 30. Which dictionary to register the word in the word dictionary 30 depends on the processing in the selection unit 13, so the registration unit 14 may register the word only in the existing dictionary 31, or the word only in one additional dictionary 32. May register. In the case of the above “MULTI” process, the registration unit 14 divides the selected word into a plurality of additional dictionaries 32 and registers them.

上述したように、単語辞書３０に追加された単語は単語分割モデルの修正に用いられるが、単語辞書３０を単語分割以外の目的で用いてもよい。例えば、形態素解析や、自動入力機能を備える入力ボックスにおける入力候補語句の表示や、固有名詞を抽出するための知識データベースなどのために単語辞書３０を用いてもよい。 As described above, the word added to the word dictionary 30 is used for correcting the word division model, but the word dictionary 30 may be used for purposes other than word division. For example, the word dictionary 30 may be used for morphological analysis, display of input candidate words in an input box having an automatic input function, a knowledge database for extracting proper nouns, and the like.

次に、図４を用いて、辞書生成装置１０の動作を説明するとともに本実施形態に係る辞書生成方法について説明する。 Next, the operation of the dictionary generation device 10 will be described with reference to FIG. 4 and the dictionary generation method according to the present embodiment will be described.

まず、モデル生成部１１が、学習コーパス２０及び既存辞書３１に基づく学習をＳＶＭに実行させることで最初の単語分割モデル（ベースライン・モデル）を生成する（ステップＳ１１、モデル生成ステップ）。続いて、解析部１２がそのベースライン・モデルが組み込まれた解析（単語分割）を大規模テキスト４０に対して実行して、「分割」、「半分割」、又は「非分割」を示す境界情報を各テキストに付与する（関連付ける）（ステップＳ１２、解析ステップ）。 First, the model generation unit 11 generates an initial word division model (baseline model) by causing the SVM to perform learning based on the learning corpus 20 and the existing dictionary 31 (step S11, model generation step). Subsequently, the analysis unit 12 performs an analysis (word division) in which the baseline model is incorporated on the large-scale text 40, and indicates a boundary indicating “division”, “half-division”, or “non-division”. Information is assigned (associated) to each text (step S12, analysis step).

続いて、選択部１３が、辞書に登録する単語を選択する（選択ステップ）。具体的には、選択部１３は境界情報付きのテキストに基づいて各単語の合計出現頻度を算出し（ステップＳ１３）、その頻度が所定の閾値以上である単語を登録候補として選択する（ステップＳ１４）。そして、選択部１３は最終的に辞書に登録する単語を登録候補から選択すると共に、単語を登録する辞書を決定する（ステップＳ１５）。選択部１３は上記のＡＰＰＥＮＤ，ＴＯＰ，ＡＬＬ，ＭＵＬＴＩなどの手法を用いて、単語を選択し辞書を指定することができる。 Subsequently, the selection unit 13 selects a word to be registered in the dictionary (selection step). Specifically, the selection unit 13 calculates the total appearance frequency of each word based on the text with boundary information (step S13), and selects a word whose frequency is a predetermined threshold or more as a registration candidate (step S14). ). Then, the selection unit 13 selects a word to be finally registered in the dictionary from registration candidates and determines a dictionary in which the word is registered (step S15). The selection unit 13 can select a word and designate a dictionary by using the above-described techniques such as APPEND, TOP, ALL, and MULTI.

続いて、登録部１４が選択部１３での処理に基づいて、選択した単語を指定の辞書に登録する（ステップＳ１６、登録ステップ）。 Subsequently, the registration unit 14 registers the selected word in the designated dictionary based on the processing in the selection unit 13 (step S16, registration step).

以上の処理により、単語辞書３０への単語の追加が完了する。本実施形態では、拡張された単語辞書３０を用いて単語分割モデルが修正される。すなわち、モデル生成部１１が、学習コーパス２０と単語辞書３０の全体とに基づく再学習により、修正された単語分割モデルを生成する（ステップＳ１７）。 With the above processing, the addition of words to the word dictionary 30 is completed. In the present embodiment, the word division model is corrected using the expanded word dictionary 30. That is, the model generation unit 11 generates a corrected word division model by relearning based on the learning corpus 20 and the entire word dictionary 30 (step S17).

次に、図５を用いて、コンピュータを辞書生成装置１０として機能させるための辞書生成プログラムＰ１を説明する。 Next, a dictionary generation program P1 for causing a computer to function as the dictionary generation device 10 will be described with reference to FIG.

辞書生成プログラムＰ１は、メインモジュールＰ１０、モデル生成モジュールＰ１１、解析モジュールＰ１２、選択モジュールＰ１３、及び登録モジュールＰ１４を備えている。 The dictionary generation program P1 includes a main module P10, a model generation module P11, an analysis module P12, a selection module P13, and a registration module P14.

メインモジュールＰ１０は、辞書生成機能を統括的に制御する部分である。モデル生成モジュールＰ１１、解析モジュールＰ１２、選択モジュールＰ１３、及び登録モジュールＰ１４を実行することにより実現される機能はそれぞれ、上記のモデル生成部１１、解析部１２、選択部１３、及び登録部１４の機能と同様である。 The main module P10 is a part that comprehensively controls the dictionary generation function. The functions realized by executing the model generation module P11, the analysis module P12, the selection module P13, and the registration module P14 are the functions of the model generation unit 11, the analysis unit 12, the selection unit 13, and the registration unit 14, respectively. It is the same.

辞書生成プログラムＰ１は、例えば、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ、半導体メモリ等の有形の記録媒体に固定的に記録された上で提供される。また、辞書生成プログラムＰ１は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The dictionary generation program P1 is provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Further, the dictionary generation program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.

以上説明したように、本実施形態によれば、境界情報が付与されている学習コーパス２０と、既存辞書３１とを用いて単語分割モデルが生成され、そのモデルが組み込まれた単語分割が大規模テキスト４０に適用される。そして、この適用により境界情報が付与されたテキスト集合から単語が選択されて単語辞書３０に登録される。このように、学習コーパス２０を用いた解析によりテキスト集合にも境界情報を付与した上で、そのテキスト集合から抽出された単語を登録することで、大規模な単語辞書３０を容易に構築することができる。 As described above, according to the present embodiment, a word division model is generated using the learning corpus 20 to which boundary information is given and the existing dictionary 31, and the word division incorporating the model is large-scale. Applies to text 40. Then, by this application, a word is selected from the text set to which boundary information is given and registered in the word dictionary 30. As described above, by adding boundary information to a text set by analysis using the learning corpus 20 and registering words extracted from the text set, a large-scale word dictionary 30 can be easily constructed. Can do.

例えば「スマホケース」（sumahoke-su）（英語では「smartphone case」）が「スマホ」（sumaho）と「ケース」（ke-su）とに分けられて、それまで未知語であった「スマホ」（sumaho）が辞書に登録され得る。なお、「スマホ」（sumaho）は、日本語の「スマートフォン」（suma-tofon）の略語である。また、「うっとろりん」（uttororin）という語句（日本語の「うっとり」（uttori）（英語では「fascinated」）に相当する未知語）も辞書に登録され得る。そして、構築された辞書を用いてテキスト解析を行うことで、登録された単語を含む文（例えば「スマホ」（sumaho）又は「うっとろりん」（uttororin）を含む文）の単語分割がより精度良く実行される。 For example, “sumahoke-su” (in English, “smartphone case”) is divided into “sumaho” and “case” (ke-su). sumaho) can be registered in the dictionary. Note that “sumaho” is an abbreviation for “suma-tofon” in Japanese. Also, the phrase “uttororin” (an unknown word corresponding to “uttori” in Japanese (“fascinated” in English)) can be registered in the dictionary. Then, by performing text analysis using the constructed dictionary, word segmentation of sentences containing registered words (for example, sentences containing “sumaho” or “uttororin”) can be performed more accurately. Executed.

次に、本実施形態における辞書生成装置１０による単語分割性能の評価の一例を示す。単語分割性能の評価の指標には、精度(Ｐｒｅｃ)、再現率(Ｒｅｃ)、及びＦ値を用いた．正解コーパスに含まれる延べ単語数をＮ_ＲＥＦ、解析結果に含まれる延べ単語数をＮ_ＳＹ _Ｓ、解析結果及び正解コーパスの両者に含まれる延べ単語数をＮ_ＣＯＲとすると、上記の３指標は下記のように定義される。
Ｐｒｅｃ＝Ｎ_ＣＯＲ／Ｎ_ＳＹＳ
Ｒｅｃ＝Ｎ_ＣＯＲ=Ｎ_ＲＥＦ
Ｆ＝２Ｐｒｅｃ・Ｒｅｃ／（Ｐｒｅｃ＋Ｒｅｃ）Next, an example of evaluation of word division performance by the dictionary generation device 10 in the present embodiment will be shown. Accuracy (Prec), recall (Rec), and F value were used as indexes for evaluating word division performance. If the total number of words included in the correct corpus is N _REF , the total number of words included in the analysis result is N _SY _S , and the total number of words included in both the analysis result and the correct corpus is N _COR , the above three indicators are as follows: Is defined as follows.
Prec = N _COR / N _SYS
Rec = N _COR = N _REF
F = 2Prec · Rec / (Prec + Rec)

既存辞書としてＵｎｉＤｉｃの見出し語リスト（異なり３０４，２６７語）を用い、サポート・ベクトル・マシンとしてＬＩＢＬＩＮＥＡＲをデフォルトパラメータで使用した。学習コーパスおよび大規模テキスト内の半角文字はすべて全角に統一したが、それ以上の正規化は行わなかった。 UniDic's headword list (304,267 different words) was used as an existing dictionary, and LIBLINEAR was used as a support vector machine with default parameters. All the half-width characters in the learning corpus and large text were unified, but no further normalization was performed.

まず、学習コーパス及び大規模テキストが同じ分野である場合（同一分野の学習）の有効性について説明する。ここで、分野とは、文体、内容（ジャンル）などに基づいて文及び単語をグループ化するための概念である。同一分野の学習では、仮想商店街Ａのウェブサイトからジャンルの偏り無くランダムに抽出した５９０商品のタイトルおよび説明文と、仮想商店街Ｂのウェブサイトからランダムに抽出した５０商品の説明文とから３段階単語分割の学習コーパスを作成した。この学習コーパスの単語数は約１１万であり、文字数は約３４万であった。この学習コーパスを用いて性能を評価した。 First, the effectiveness when the learning corpus and the large-scale text are in the same field (learning in the same field) will be described. Here, the field is a concept for grouping sentences and words based on style, contents (genre), and the like. In learning in the same field, from the title and description of 590 products randomly extracted from the website of the virtual shopping mall A without genre bias, and the description of 50 products randomly extracted from the website of the virtual shopping mall B A learning corpus with three-level word division was created. The number of words in this learning corpus was about 110,000 and the number of characters was about 340,000. The performance was evaluated using this learning corpus.

大規模テキストとして、上記仮想商店街Ａ内の全商品データのタイトルおよび説明文を用いた。商品数は約２７００万であり、文字数は約１６０億であった。 As large-scale texts, the titles and explanations of all product data in the virtual shopping street A were used. The number of products was about 27 million and the number of characters was about 16 billion.

この大規模テキストをベースライン・モデルにより解析して２段階単語分割を実行した場合には、異なり５７６，９５４語が抽出され、当該解析後に３段階単語分割を実行した場合には、異なり６０３，１８７語が抽出された。ここで、単語の足切りのために用いた頻度の閾値は２０とした。上記「ＭＵＬＴＩ」を採用した際には、合計出現頻度の上位１０万語、上位２０万語、上位３０万語、上位４０万語、及び全体を別々の辞書として追加した。上記「ＴＯＰ」を採用した際には上位１０万語のみを用いた。 When this large-scale text is analyzed by the baseline model and two-stage word division is executed, different 576,954 words are extracted, and when three-stage word division is executed after the analysis, different 603, 187 words were extracted. Here, the frequency threshold used for word cut-off is 20. When the above “MULTI” was adopted, the top 100,000 words, the top 200,000 words, the top 300,000 words, the top 400,000 words and the whole of the total appearance frequency were added as separate dictionaries. When the above “TOP” was adopted, only the top 100,000 words were used.

ベースライン・モデルによる学習結果、２段階単語分割により得られた単語辞書を用いた再学習の結果、及び３段階単語分割により得られた単語辞書を用いた再学習の結果を表１に示す。表１中の数値はすべて百分率（％）である。

Table 1 shows the learning result by the baseline model, the result of relearning using the word dictionary obtained by the two-stage word division, and the result of relearning using the word dictionary obtained by the three-stage word division. All values in Table 1 are percentages (%).

２段階単語分割を使って再学習した場合には、どの手法（ＡＰＰＥＮＤ／ＴＯＰ／ＡＬＬ／ＭＵＬＴＩ）を用いて単語を追加してもＦ値が向上し、このことは、提案する大規模テキストを用いた学習が有効であることを示している。Ｆ値の増加幅は、ＡＰＰＥＮＤ＜ＴＯＰ＜ＡＬＬ＜ＭＵＬＴＩの順で大きかった。この結果から、単語を追加する際には、既存辞書に追加するよりも別の辞書に追加した方がより効果的であり、更には、追加する単語を一つの追加辞書に登録するよりも出現頻度に応じて別々の辞書に追加した方がより効果的であることが分かった。 When re-learning using two-stage word segmentation, the F-value is improved no matter which method (APPEND / TOP / ALL / MULTI) is used, which means that the proposed large text It shows that the learning used is effective. The increment of the F value was larger in the order of APPEND <TOP <ALL <MULTI. From this result, when adding a word, it is more effective to add it to another dictionary than to add it to an existing dictionary, and furthermore, it appears more than registering the word to be added to one additional dictionary. It was found that it was more effective to add to different dictionaries according to frequency.

表１より、分類器が単語の出現頻度に応じて異なる貢献度及び重みを自動的に学習していると考えられる。さらに、３段階単語分割を使って再学習した場合には、すべての場合においてベースライン・モデルおよび２段階単語分割よりも性能が向上した。具体的には、半分割を考慮することにより、接辞を伴う単語を正確に獲得するなどの改善が得られた。 From Table 1, it can be considered that the classifier automatically learns different contributions and weights depending on the appearance frequency of words. Furthermore, when re-learning using three-stage word division, performance improved in all cases over the baseline model and two-stage word division. Specifically, by taking into account half-division, improvements such as accurately acquiring words with affixes were obtained.

次に、学習コーパスと大規模テキストとが異なる分野である場合の有効性について説明する。用いた学習コーパスは、上記同一分野の学習におけるものと同じとした。一方、大規模テキストは、旅行予約サイトＣ内のユーザレビュー、宿泊施設名、宿泊プラン名、及び宿泊施設からの返答を用いた。テキスト数は３４８，５６４であり、その文字数は約１億２６００万であった。この大規模テキストのうち、１５０件及び５０件のレビューをランダムに抽出して人手による単語分割を行い、それぞれテストコーパス及び能動学習用コーパス（学習コーパスに対する追加分）として用いた。 Next, the effectiveness when the learning corpus and the large-scale text are different fields will be described. The learning corpus used was the same as that used for learning in the same field. On the other hand, the large-scale text used a user review in the travel reservation site C, an accommodation facility name, an accommodation plan name, and a response from the accommodation facility. The number of texts was 348,564, and the number of characters was about 126 million. Of this large-scale text, 150 and 50 reviews were randomly extracted and manually divided into words, which were used as a test corpus and an active learning corpus (additions to the learning corpus), respectively.

まず、上記の商品分野の学習コーパスから学習したベースライン・モデルを用いて旅行分野の大規模テキストを解析した。この解析性能が下記表２の「ベースライン」である。 First, a large-scale text in the travel field was analyzed using the baseline model learned from the above-mentioned learning corpus in the product field. This analysis performance is the “baseline” in Table 2 below.

次に、商品分野の学習コーパスに分野適応用のコーパスを加えて単語分割モデルを学習した後、それを用いて大規模テキストを解析した。この解析性能が下記表２の「分野適応」である．大規模テキストを解析した後に２段階単語分割を用いると異なり４１，６７１語が抽出され、３段階単語分割を用いると異なり４４，２４７語が抽出された。いずれの場合も、合計出現頻度が５以上の単語のみを用いた。 Next, we added a field-adaptive corpus to the product-area learning corpus to learn the word segmentation model, and then used it to analyze large text. This analysis performance is “field adaptation” in Table 2 below. After analyzing large-scale text, 41,671 words were extracted when using two-stage word division, and 44,247 words were extracted when using three-stage word division. In all cases, only words with a total appearance frequency of 5 or more were used.

これらの得られた単語を辞書に追加し、学習コーパスおよび分野適応用コーパスを用いてモデルを再学習した結果を表２に示す。表２中の数値はすべて百分率（％）である。

Table 2 shows the results of adding these obtained words to the dictionary and re-learning the model using the learning corpus and the field adaptation corpus. All values in Table 2 are percentages (%).

この表から分かるように、学習コーパスと大規模テキストで分野が異なる場合には、３段階単語分割の場合において性能の向上が見られた。 As can be seen from the table, when the learning corpus and the large-scale text are different, the performance was improved in the case of three-stage word division.

以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 The present invention has been described in detail based on the embodiments. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist thereof.

上記実施形態では選択部１３が出現頻度に基づいて単語を選択したが、選択部１３は、この出現頻度を参照することなく、すべての単語を既存辞書３１又は追加辞書３２に登録してもよい。また、単語の足切りは必須の処理ではない。 In the above embodiment, the selection unit 13 selects a word based on the appearance frequency. However, the selection unit 13 may register all the words in the existing dictionary 31 or the additional dictionary 32 without referring to the appearance frequency. . Also, word truncation is not an essential process.

上記実施形態では解析部１２が大規模テキスト４０の全体を解析した後に選択部１３及び登録部１４による処理が行われたが、解析部１２は収集された大量のテキストを複数回に分けて解析してもよい。この場合には、モデル生成ステップ、解析ステップ、選択ステップ、及び登録ステップから成る一連の処理が複数回繰り返される。例えば、大規模テキスト４０をグループ１〜３に分けた場合には、１ループ目の処理でグループ１が解析されて単語が登録され、２ループ目の処理でグループ２が解析されて単語が更に登録され、３ループ目の処理でグループ３が解析されて単語が更に登録される。２ループ目以降の処理では、モデル生成部１１は単語辞書３０の全体を参照して、修正された単語分割モデルを生成する。 In the above embodiment, the processing by the selection unit 13 and the registration unit 14 is performed after the analysis unit 12 analyzes the entire large-scale text 40. However, the analysis unit 12 analyzes a large amount of collected text in multiple times. May be. In this case, a series of processes including a model generation step, an analysis step, a selection step, and a registration step are repeated a plurality of times. For example, when the large-scale text 40 is divided into groups 1 to 3, the group 1 is analyzed and the word is registered in the first loop process, and the group 2 is analyzed in the second loop process to further add the word. Registered, group 3 is analyzed in the process of the third loop, and further words are registered. In the processing after the second loop, the model generation unit 11 refers to the entire word dictionary 30 and generates a corrected word division model.

上記実施形態では３段階分割の手法を用いたので境界情報は３種類であったが、境界情報の態様はこの例に限定されない。例えば、「分割」「非分割」という２種類の境界情報のみを用いて２段階の単語分割を行ってもよい。また、「分割」「非分割」と、複数種類の確率的分割とを用いて、４段階以上の単語分割を行ってもよい。例えば、ｂ_ｉ＝０．３３とｂ_ｉ＝０．６７という確率的分割（第３の情報）を用いた４段階の単語分割を行ってもよい。いずれにしても、第３の情報に相当する分割の強度は、境界情報が「非分割」の場合の強度（例えばｂ_ｉ＝０）より大きく、境界情報が「分割」の場合の強度（例えばｂ_ｉ＝１）より小さい。In the above embodiment, since the three-stage division method is used, there are three types of boundary information. However, the mode of the boundary information is not limited to this example. For example, two-stage word division may be performed using only two types of boundary information “division” and “non-division”. Furthermore, word division may be performed in four or more stages using “division”, “non-division”, and a plurality of types of probabilistic division. For example, four-stage word division may be performed using probabilistic division (third information) of b _i = 0.33 and b _i = 0.67. In any case, the strength of the division corresponding to the third information is larger than the strength when the boundary information is “non-divided” (for example, b _i = 0), and the strength when the boundary information is “divided” (for example, smaller than b _i = 1).

本実施形態によれば、大規模な単語辞書を容易に構築することができる。 According to this embodiment, a large-scale word dictionary can be easily constructed.

１０…辞書生成装置、１１…モデル生成部、１２…解析部、１３…選択部、１４…登録部、２０…学習コーパス、３０…単語辞書、３１…既存辞書（単語群）、３２…追加辞書、４０…大規模テキスト（収集されたテキストの集合）、Ｐ１…辞書生成プログラム、Ｐ１０…メインモジュール、Ｐ１１…モデル生成モジュール、Ｐ１２…解析モジュール、Ｐ１３…選択モジュール、Ｐ１４…登録モジュール。 DESCRIPTION OF SYMBOLS 10 ... Dictionary production | generation apparatus, 11 ... Model production | generation part, 12 ... Analysis part, 13 ... Selection part, 14 ... Registration part, 20 ... Learning corpus, 30 ... Word dictionary, 31 ... Existing dictionary (word group), 32 ... Additional dictionary , 40 ... large-scale text (collected text collection), P1 ... dictionary generation program, P10 ... main module, P11 ... model generation module, P12 ... analysis module, P13 ... selection module, P14 ... registration module.

Claims

A model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus is provided with boundary information indicating a word boundary, and the boundary information is The first information indicating that the boundary does not exist at the inter-character position, the second information indicating that the boundary exists at the inter-character position, and the probability that the boundary exists at the inter-character position. The model generation unit including third information to be shown ;
An analysis unit that executes word division in which the word division model is incorporated with respect to a set of collected texts and gives the boundary information to each text;
A selection unit for selecting a word to be registered in the dictionary from the text to which the boundary information is given by the analysis unit;
A dictionary generation apparatus comprising: a registration unit that registers a word selected by the selection unit in the dictionary.

The selection unit selects a word to be registered in the dictionary based on the appearance frequency of each word calculated from the boundary information given by the analysis unit.
The dictionary generation device according to claim 1.

The selection unit selects a word having the appearance frequency equal to or higher than a predetermined threshold;
The dictionary generation device according to claim 2.

The selection unit extracts words having the appearance frequency equal to or higher than the threshold as registration candidates, and selects a predetermined number of words from the registration candidates in order from the word having the highest appearance frequency,
The registration unit adds the word selected by the selection unit to a dictionary in which the word group is recorded;
The dictionary generation device according to claim 3.

The selection unit extracts words having the appearance frequency equal to or higher than the threshold as registration candidates, and selects a predetermined number of words from the registration candidates in order from the word having the highest appearance frequency,
The registration unit registers the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded;
The dictionary generation device according to claim 3.

The registration unit registers the word selected by the selection unit in a dictionary different from the dictionary in which the word group is recorded;
The dictionary generation device according to claim 3.

The selection unit extracts words having the appearance frequency equal to or higher than the threshold as registration candidates, and groups the registration candidate words according to the height of the appearance frequency.
The registration unit individually registers a plurality of groups generated by the selection unit in a plurality of dictionaries different from a dictionary in which the word group is recorded.
The dictionary generation device according to claim 3.

Each of the collected texts is associated with information indicating the field of the text,
The registration unit individually registers the words selected by the selection unit in a dictionary prepared for each field based on the field of the text in which the word was included.
The dictionary generation device according to claim 3.

The appearance frequency of each word is calculated based on the first, second, and third information.
The dictionary production | generation apparatus as described in any one of Claims 2-8.

The analysis unit includes a first binary classifier and a second binary classifier,
The first binary classifier determines whether to assign the first information or information other than the first information for each inter-character position;
Whether the second binary classifier assigns the second information for the inter-character position determined by the first binary classifier to assign information other than the first information, or the third binary classifier Determine whether to assign information,
The dictionary generation device according to claim 9.

The collected set of text is divided into a plurality of groups;
After the analysis unit, the selection unit, and the registration unit execute processing based on one of the plurality of groups, the model generation unit is registered by the corpus, the word group, and the registration unit. Generating the word division model using a word, and subsequently, the analysis unit, the selection unit, and the registration unit perform processing based on another one of the plurality of groups,
The dictionary production | generation apparatus as described in any one of Claims 1-10.

A dictionary generation method executed by a dictionary generation device,
In the model generation step of generating a word division model using a corpus and a word group prepared in advance, each text included in the corpus is given boundary information indicating a word boundary, and the boundary information is The first information indicating that the boundary does not exist at the inter-character position, the second information indicating that the boundary exists at the inter-character position, and the probability that the boundary exists at the inter-character position. The model generation step including third information to be shown ;
Analyzing the set of collected texts by performing word division incorporating the word division model and adding the boundary information to each text;
A selection step of selecting a word to be registered in the dictionary from the text given the boundary information in the analysis step;
A registration step of registering the word selected in the selection step in the dictionary.

A model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus is provided with boundary information indicating a word boundary, and the boundary information is The first information indicating that the boundary does not exist at the inter-character position, the second information indicating that the boundary exists at the inter-character position, and the probability that the boundary exists at the inter-character position. The model generation unit including third information to be shown ;
An analysis unit that executes word division in which the word division model is incorporated with respect to a set of collected texts and gives the boundary information to each text;
A selection unit for selecting a word to be registered in the dictionary from the text to which the boundary information is given by the analysis unit;
A dictionary generation program that causes a computer to execute a registration unit that registers a word selected by the selection unit in the dictionary.

A model generation unit that generates a word division model using a corpus and a word group prepared in advance, and each text included in the corpus is provided with boundary information indicating a word boundary, and the boundary information is The first information indicating that the boundary does not exist at the inter-character position, the second information indicating that the boundary exists at the inter-character position, and the probability that the boundary exists at the inter-character position. The model generation unit including third information to be shown ;
An analysis unit that executes word division in which the word division model is incorporated with respect to a set of collected texts and gives the boundary information to each text;
A selection unit for selecting a word to be registered in the dictionary from the text to which the boundary information is given by the analysis unit;
A computer-readable recording medium that stores a dictionary generation program that causes a computer to execute a registration unit that registers a word selected by the selection unit in the dictionary.