JP2014120007A

JP2014120007A - Dictionary registration device, word division device, dictionary registration method, word division method and program

Info

Publication number: JP2014120007A
Application number: JP2012275198A
Authority: JP
Inventors: Manabu Satsusano; 学颯々野
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2012-12-18
Filing date: 2012-12-18
Publication date: 2014-06-30
Anticipated expiration: 2032-12-18
Also published as: JP5693552B2

Abstract

PROBLEM TO BE SOLVED: To provide a dictionary for word division having a higher degree of precision than the conventional dictionary.SOLUTION: A dictionary for word division with a higher degree of precision can be obtained from a dictionary registration device which includes: a dictionary for word division which can store one or more division information composed of a set of one or more words and two or more division words; a first division result acquisition unit for acquiring a first division result from one word division device; an other division result acquisition unit for acquiring two or more other division information from two or more of other word division devices; a different portion acquisition unit for acquiring one or more different portions between the first division result and the two or more other division results; a division information acquisition unit which, if one or more different portions satisfy a predetermined condition, configures division information including a word which is a character string corresponding the different portions and two or more words which are the different portions; and a dictionary registration unit for accumulating the division information in the dictionary for word division.

Description

本発明は、文を２以上の単語に分割する単語分割装置が利用する単語分割用辞書に情報を登録する辞書登録装置等に関するものである。 The present invention relates to a dictionary registration device that registers information in a word division dictionary used by a word division device that divides a sentence into two or more words.

従来、各種の入力データの記述形式の差を吸収して辞書の各項目に共通に登録する辞書作成システムが存在した（特許文献１参照）。 Conventionally, there has been a dictionary creation system that absorbs differences in the description format of various input data and registers them in common in each dictionary item (see Patent Document 1).

また、学習データから不要なデータを取り除いて精度を向上させることができる機械学習システムがあった（特許文献２参照）。 In addition, there is a machine learning system that can remove unnecessary data from learning data to improve accuracy (see Patent Document 2).

特開２００６−２１５８２３号公報（第１頁、第１図等）JP 2006-215823 A (first page, FIG. 1 etc.) 特開２００５−１８１９２８号公報（第１頁、第１図等）Japanese Patent Laying-Open No. 2005-181928 (first page, FIG. 1 etc.)

しかしながら、従来の技術においては、精度の高い単語分割用辞書を得ることができなかった。 However, the conventional technique cannot obtain a word segmentation dictionary with high accuracy.

本第一の発明の辞書登録装置は、１以上の単語と、単語と単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを格納し得る単語分割用辞書と、一の単語分割装置が一の文を分割した結果である第一分割結果を取得する第一分割結果取得部と、一の単語分割装置ではない単語分割装置である２以上の他単語分割装置が、一の文を分割した結果である２以上の他分割結果を取得する他分割結果取得部と、２以上の各他分割結果に含まれる部分であり、第一分割結果と２以上の各他分割結果との相違する部分である１以上の相違部分を取得する相違部分取得部と、相違部分取得部が取得した１以上の相違部分が予め決められた条件を満たす場合、１以上のいずれかの相違部分を用いて、相違部分に対応する文字列である単語と、相違部分である２以上の単語とを有する分割情報を構成する分割情報取得部と、分割情報を単語分割用辞書に蓄積する辞書登録部とを具備する辞書登録装置である。 The dictionary registration device according to the first aspect of the present invention is a word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the words. , A first division result acquisition unit that acquires a first division result that is a result of dividing one sentence by one word dividing device, and two or more other word dividing devices that are word dividing devices that are not one word dividing device Are the other division result acquisition unit for acquiring two or more other division results that are the result of dividing one sentence, and the part included in each of the two or more other division results, the first division result and each of the two or more When a different part acquisition unit that acquires one or more different parts that are different from other division results and one or more different parts acquired by the different part acquisition unit satisfy a predetermined condition, either one or more A word that is a character string corresponding to the difference part. A dictionary registration apparatus comprising the division information acquisition unit that constitutes the division information having 2 or more and the word is a different part, and a dictionary registration unit for storing divided information into words divided dictionary.

かかる構成により、精度の高い単語分割用辞書を得ることができる。 With this configuration, a highly accurate word segmentation dictionary can be obtained.

また、本第二の発明の辞書登録装置は、第一の発明に対して、他分割結果取得部は、２つの他分割結果である、第二分割結果および第三分割結果を取得し、相違部分取得部は、第一分割結果と第二分割結果との相違部分である第一相違部分と、第一分割結果と第三分割結果との相違部分である第二相違部分とを取得し、分割情報取得部は、第一相違部分と第二相違部分とが共通する場合、第一相違部分を用いて、第一相違部分に対応する文字列である単語と、第一相違部分である２以上の単語とを有する分割情報を構成する辞書登録装置である。 In addition, the dictionary registration device of the second invention is different from the first invention in that the other division result acquisition unit acquires the second division result and the third division result, which are two other division results, The partial acquisition unit acquires a first difference portion that is a difference portion between the first division result and the second division result, and a second difference portion that is a difference portion between the first division result and the third division result, When the first difference portion and the second difference portion are common, the division information acquisition unit uses the first difference portion, and is a word that is a character string corresponding to the first difference portion, and the first difference portion is 2 It is a dictionary registration apparatus which comprises the division | segmentation information which has the above word.

また、本第三の発明の単語分割装置は、第一または第二の発明に対して、辞書登録装置と、１以上の文字を有する文を受け付ける受付部と、辞書登録装置により構成された単語分割用辞書を用いて、受付部が受け付けた文を構成する文字列と一致する最大長の単語を、単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を取得して、文を分割して得られる２以上の単語の集合である第一分割結果を取得する第一分割部と、第一分割結果を出力する出力部とを具備する単語分割装置である。 In addition, the word dividing device according to the third aspect of the present invention is a word constituted by a dictionary registration device, a reception unit that accepts a sentence having one or more characters, and a dictionary registration device with respect to the first or second invention. Using the division dictionary, the word having the maximum length matching the character string constituting the sentence received by the reception unit is acquired from the word division dictionary, and two or more divided words corresponding to the acquired word are acquired. Thus, the word division device includes a first division unit that acquires a first division result that is a set of two or more words obtained by dividing a sentence, and an output unit that outputs the first division result.

かかる構成により、精度の高い単語分割用辞書を用いて、文を２以上の単語に高速に分割できる。 With this configuration, it is possible to divide a sentence into two or more words at high speed using a highly accurate word division dictionary.

また、本第四の発明の単語分割装置は、第三の発明に対して、２以上の他単語分割装置をさらに具備する単語分割装置である。 The word segmentation device according to the fourth aspect of the present invention is a word segmentation device further comprising two or more other word segmentation units with respect to the third aspect.

本発明による辞書登録装置によれば、精度の高い単語分割用辞書を得ることができる。 According to the dictionary registration device of the present invention, it is possible to obtain a word segmentation dictionary with high accuracy.

実施の形態１における単語分割装置１のブロック図Block diagram of word segmentation apparatus 1 according to Embodiment 1 同単語分割装置１の動作について説明するフローチャートA flowchart for explaining the operation of the word segmentation apparatus 1 同単語分割用辞書１１を示す図The figure which shows the dictionary 11 for the word division | segmentation 同単語分割装置１の実験結果を示す図The figure which shows the experimental result of the same word division | segmentation apparatus 1 同単語分割装置１の実験結果を示す図The figure which shows the experimental result of the same word division | segmentation apparatus 1 実施の形態２における辞書登録装置２のブロック図Block diagram of dictionary registration apparatus 2 in the second embodiment 同辞書登録装置２の動作について説明するフローチャートThe flowchart explaining operation | movement of the dictionary registration apparatus 2 同相違部分取得処理について説明するフローチャートFlowchart explaining the same difference acquisition processing 実施の形態３における単語分割装置３のブロック図Block diagram of word segmentation apparatus 3 according to Embodiment 3 同単語分割装置３の動作について説明するフローチャートA flowchart for explaining the operation of the word dividing device 3 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system

以下、単語分割装置、辞書登録装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a word division device, a dictionary registration device, and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態において、文を２以上の単語に分割する単語分割装置１について説明する。 (Embodiment 1)
In the present embodiment, a word dividing device 1 that divides a sentence into two or more words will be described.

図１は、本実施の形態における単語分割装置１のブロック図である。単語分割装置１は、単語分割用辞書１１、受付部１２、第一分割部１３、および出力部１４を備える。 FIG. 1 is a block diagram of a word segmentation apparatus 1 in the present embodiment. The word division device 1 includes a word division dictionary 11, a reception unit 12, a first division unit 13, and an output unit 14.

単語分割用辞書１１は、１以上の単語と１以上の分割情報とを格納し得る。分割情報は、単語と２以上の分割単語の組である。分割単語は、単語を分割した結果である。分割情報は、例えば、「自由形式：自由／形式」「はないか：は／ない／か」である。分割情報「自由形式：自由／形式」の「自由形式」は単語であり、「自由／形式」の「自由」「形式」は、それぞれ分割単語である。また、分割情報「はないか：は／ない／か」の「はないか」は単語であり、「は／ない／か」の「は」「ない」「か」はそれぞれ分割単語である。なお、単語は、形態素や連語など、意味を持つあらゆる用語を含む、と考えても良い。また、分割単語も単語と言える。 The word division dictionary 11 can store one or more words and one or more pieces of division information. The division information is a set of a word and two or more divided words. The divided word is a result of dividing the word. The division information is, for example, “free format: free / format” or “is not: is / is / is”. “Free format” of the division information “free format: free / format” is a word, and “free” and “format” of “free / format” are divided words. In addition, “has not”: “has not”: “has not” is “word”, and “ha”, “no”, “no”, and “ha” are respectively segmented words. Note that the word may be considered to include any term having meaning such as a morpheme or a collocation. A divided word can also be said to be a word.

また、単語分割用辞書１１において、１以上の単語と１以上の分割情報とを同一ファイルや同一データベースに保持されていていることが好適である。但し、１以上の単語と１以上の分割情報とは、別ファイルや別のデータベースに保持されていても良い。つまり、単語分割用辞書１１の具体的なデータ構造は問わない。単語分割用辞書１１は、１以上の単語と１以上の分割情報とを保持していれば良い。 In the word division dictionary 11, it is preferable that one or more words and one or more pieces of division information are held in the same file or the same database. However, the one or more words and the one or more pieces of division information may be held in separate files or separate databases. That is, the specific data structure of the word division dictionary 11 is not limited. The word division dictionary 11 only needs to hold one or more words and one or more pieces of division information.

単語分割用辞書１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。単語分割用辞書１１に単語や分割情報が記憶される過程は問わない。例えば、記録媒体を介して単語や分割情報が単語分割用辞書１１で記憶されるようになってもよく、通信回線等を介して送信された単語や分割情報が単語分割用辞書１１で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された単語や分割情報が単語分割用辞書１１で記憶されるようになってもよい。 The word segmentation dictionary 11 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. The process of storing words and division information in the word division dictionary 11 does not matter. For example, words and division information may be stored in the word division dictionary 11 via a recording medium, and words and division information transmitted via a communication line or the like are stored in the word division dictionary 11. Alternatively, a word or division information input via an input device may be stored in the word division dictionary 11.

受付部１２は、１以上の文字を有する文を受け付ける。文は不完全な文でも良い。つまり、文は連語などでもよい。また、文の言語は、問わない。文は、通常、日本語、中国語、韓国語、モンゴル語等、分かち書きしない言語の文である。ただし、文は、英語等の分かち書きしない言語でも良い。文は、例えば、ＵＲＬを示す文字列、ファイル名を示す文字列などでも良い。また、ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。 The accepting unit 12 accepts a sentence having one or more characters. The sentence may be an incomplete sentence. That is, the sentence may be a multiple word or the like. The language of the sentence is not limited. The sentence is usually a sentence in a language that is not divided, such as Japanese, Chinese, Korean, and Mongolian. However, the sentence may be a language such as English that is not shared. The sentence may be, for example, a character string indicating a URL or a character string indicating a file name. In addition, reception means reception of information input from an input device such as a keyboard, mouse, touch panel, reception of information transmitted via a wired or wireless communication line, an optical disk, a magnetic disk, a semiconductor memory, etc. This is a concept including reception of information read from the recording medium.

文の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。受付部１２は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The sentence input means may be anything such as a keyboard, mouse or menu screen. The receiving unit 12 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.

第一分割部１３は、受付部１２が受け付けた文を分割し、２以上の単語の集合である第一分割結果を取得する。
さらに具体的には、第一分割部１３は、単語分割用辞書を用いて、受付部１２が受け付けた文を構成する文字列と一致する最大長の単語を、単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を取得して、文を分割して得られる２以上の単語の集合である第一分割結果を取得する。かかる処理をさらに詳細に説明すると、以下のような処理になる。第一分割部１３は、単語分割用辞書を用いて、受付部１２が受け付けた文を構成する１以上の文字列を取得する。そして、第一分割部１３は、当該１以上の各文字列と一致する最大長の単語を単語分割用辞書から取得する。そして、第一分割部１３は、単語分割用辞書から取得した１以上の各単語ごとに、単語に対応する２以上の分割単語を取得して、文を分割して得られる２以上の単語の集合である第一分割結果を取得する。 The first dividing unit 13 divides the sentence received by the receiving unit 12 and acquires a first division result that is a set of two or more words.
More specifically, the first dividing unit 13 acquires, from the word dividing dictionary, the maximum length word that matches the character string constituting the sentence received by the receiving unit 12 using the word dividing dictionary. Two or more divided words corresponding to the acquired word are acquired, and a first division result which is a set of two or more words obtained by dividing the sentence is acquired. This process will be described in further detail as follows. The 1st division part 13 acquires one or more character strings which constitute the sentence which acceptance part 12 received using the dictionary for word division. Then, the first division unit 13 obtains the maximum length word that matches the one or more character strings from the word division dictionary. And the 1st division part 13 acquires two or more division words corresponding to a word for every one or more words acquired from the word division dictionary, and obtains two or more words obtained by dividing a sentence. The first division result that is a set is acquired.

第一分割部１３は、さらに具体的には、例えば、以下のように処理を行う。まず、第一分割部１３は、受付部１２が受け付けた文の先頭である文のポインタから最大長の文字列に一致する単語を、単語分割用辞書１１から取得する第一の処理を行う。そして、第一分割部１３は、取得した単語に対応する２以上の分割単語を有する場合は、一致する単語に変えて２以上の分割単語を取得する第二の処理を行う。この第一の処理と第二の処理とを含めて、分割単語取得処理という。そして、第一分割部１３は、文のポインタを、前記一致する単語の次の文字に移動する。そして、第一分割部１３は、上記の分割単語取得処理を文の最後の文字を含む単語まで行う。その結果、第一分割部１３は、文を分割して得られる２以上の単語の集合である第一分割結果が取得できる。なお、第一の処理において取得した単語が、分割情報に含まれる単語ではない場合、第一分割部１３は、当該第一の処理において取得した単語をそのまま保持する。また、第一分割結果は、２以上の単語の集合であるが、当該２以上の区切りが判断できる態様のデータ構造を有する。 More specifically, the first dividing unit 13 performs processing as follows, for example. First, the first dividing unit 13 performs a first process of acquiring, from the word dividing dictionary 11, a word that matches the maximum length character string from the sentence pointer at the head of the sentence received by the receiving unit 12. When the first dividing unit 13 has two or more divided words corresponding to the acquired word, the first dividing unit 13 performs a second process of acquiring two or more divided words instead of the matching words. The first process and the second process are referred to as a divided word acquisition process. Then, the first dividing unit 13 moves the sentence pointer to the character next to the matching word. And the 1st division part 13 performs said division | segmentation word acquisition process to the word containing the last character of a sentence. As a result, the first division unit 13 can acquire a first division result that is a set of two or more words obtained by dividing a sentence. When the word acquired in the first process is not a word included in the division information, the first dividing unit 13 holds the word acquired in the first process as it is. The first division result is a set of two or more words, but has a data structure in which the two or more breaks can be determined.

また、第一分割部１３が行う第一の処理の方法は問わない。第一分割部１３が行う第一の処理は、いわゆる最長一致法（longest matchあるいはmaximum matchとも言う。）等の公知技術が利用可能である。最長一致法は、「岩波書店，岩波講座，ソフトウェア科学15 自然言語処理 126-127ページ長尾真編」等に記載されている。 Moreover, the method of the 1st process which the 1st division part 13 performs is not ask | required. For the first processing performed by the first dividing unit 13, a known technique such as a so-called longest match method (also referred to as a longest match or maximum match) can be used. The longest match method is described in “Iwanami Shoten, Iwanami Lecture, Software Science 15 Natural Language Processing, pages 126-127, Makoto Nagao”.

また、第一分割部１３は、文のポインタから最大長の文字列を、当該文のポインタが示す文字から文の最後の文字（文のポインタからＮ番目の文字だとする）までの文字列（これを文字列Ａとする。）と一致する単語が単語分割用辞書１１に存在するか否かを判断し、存在すれば当該文字列Ａを取得し、存在しなければ、当該文のポインタが示す文字から文のポインタから（Ｎ−１）番目の文字までの文字列（これを文字列Ｂとする。）と一致する単語が単語分割用辞書１１に存在するか否かを判断し、存在すれば当該文字列Ｂを取得する。存在しなければ、上記の処理と同様に、１文字ずつ文字列を少なくしていって、文のポインタが示す文字を先頭とする文字列の中で、最大長の文字列の単語を、単語分割用辞書１１から検索する。つまり、第一分割部１３は、文の中の未処理の文字列の最長文字列から、１文字ずつ減らしながら、順に単語分割用辞書１１を検索して、ポインタｐから最長の文字列を取得しても良い。
なお、第一分割部１３は、文の中から最大長の文字列を検出するためのデータ構造として、公知技術である「トライ(trie)」が存在する。トライについて、以下の（１）〜（３）に記載されているので詳細な説明を省略する。
（１）徳永拓之著「日本語入力を支える技術」，89-99ページ
（２）インターネットウェブページ，ＵＲＬ「http://www.slideshare.net/higashiyama/ss-8738479」
（３）インターネットウェブページ，ＵＲＬ「http://nanika.osonae.com/DArray/dary.html」 Further, the first dividing unit 13 determines the character string of the maximum length from the sentence pointer, and the character string from the character indicated by the sentence pointer to the last character of the sentence (assuming that it is the Nth character from the sentence pointer). It is determined whether or not a word that matches (this is referred to as a character string A) exists in the word segmentation dictionary 11, and if it exists, the character string A is acquired, and if it does not exist, a pointer to the sentence It is determined whether or not a word that matches the character string (referred to as character string B) from the character indicated by the sentence pointer to the (N-1) th character exists in the word segmentation dictionary 11; If it exists, the character string B is acquired. If it does not exist, the character string is reduced by one character at a time, and the word with the maximum character string in the character string starting with the character indicated by the sentence pointer is replaced with the word. Search from the partitioning dictionary 11. That is, the first dividing unit 13 searches the word dividing dictionary 11 in order while reducing the character by character from the longest character string of the unprocessed character string in the sentence, and obtains the longest character string from the pointer p. You may do it.
The first dividing unit 13 has a known technique “trie” as a data structure for detecting the maximum length character string from the sentence. Since the trial is described in the following (1) to (3), a detailed description is omitted.
(1) Takuyuki Tokunaga, “Technology supporting Japanese input”, pages 89-99 (2) Internet web page, URL “http://www.slideshare.net/higashiyama/ss-8738479”
(3) Internet Web page, URL “http://nanika.osonae.com/DArray/dary.html”

第一分割部１３は、通常、ＭＰＵやメモリ等から実現され得る。第一分割部１３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The first dividing unit 13 can be usually realized by an MPU, a memory, or the like. The processing procedure of the first dividing unit 13 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１４は、第一分割部１３が取得した第一分割結果を出力する。ここで、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタでの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。処理結果を他のプログラムに引渡す場合、単語分割装置１と他のプログラムとは、例えば、音声認識装置、機械翻訳装置などを実現する。つまり、文を分割して得られた第一分割結果は、例えば、音声認識処理、機械翻訳処理等に利用され得る。 The output unit 14 outputs the first division result acquired by the first division unit 13. Here, output refers to display on a display, projection using a projector, printing with a printer, sound output, transmission to an external device, storage in a recording medium, and output to other processing devices or other programs. It is a concept that includes delivery of processing results. When handing over the processing result to another program, the word segmentation device 1 and the other program realize, for example, a speech recognition device, a machine translation device, or the like. That is, the first division result obtained by dividing the sentence can be used for, for example, speech recognition processing, machine translation processing, and the like.

出力部１４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部１４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 14 may be considered as including or not including an output device such as a display or a speaker. The output unit 14 can be realized by output device driver software, or output device driver software and an output device.

次に、単語分割装置１の動作について、図２のフローチャートを用いて説明する。 Next, operation | movement of the word division | segmentation apparatus 1 is demonstrated using the flowchart of FIG.

（ステップＳ２０１）受付部１２は、文を受け付けたか否かを判断する。文を受け付ければステップＳ２０２に行き、文を受け付けなければステップＳ２０１に戻る。 (Step S201) The reception unit 12 determines whether a sentence has been received. If a sentence is accepted, the process goes to step S202, and if no sentence is accepted, the process returns to step S201.

（ステップＳ２０２）第一分割部１３は、文のポインタｐを１に設定する。文のポインタｐは、文の中における、単語取得の先頭の位置を示す。 (Step S202) The first dividing unit 13 sets the sentence pointer p to 1. The sentence pointer p indicates the position of the beginning of word acquisition in the sentence.

（ステップＳ２０３）第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字から、最大長の文字列と一致する単語を検索する。そして、第一分割部１３は、最大長の文字列である単語を単語分割用辞書１１から取得する。 (Step S203) The first division unit 13 searches for a word that is present in the word division dictionary 11 and matches the maximum length character string from characters corresponding to p in the sentence. Then, the first division unit 13 acquires a word that is a maximum length character string from the word division dictionary 11.

（ステップＳ２０４）第一分割部１３は、ステップＳ２０３で取得した単語が、分割情報に含まれる単語であるか否かを判断する。分割情報に含まれる単語であればステップＳ２０５に行き、分割情報に含まれる単語でなければステップＳ２０６に行く。 (Step S204) The first division unit 13 determines whether or not the word acquired in step S203 is a word included in the division information. If it is a word included in the division information, go to step S205, and if it is not a word included in the division information, go to step S206.

（ステップＳ２０５）第一分割部１３は、ステップＳ２０３で取得した単語に対応する２以上の分割単語を、単語分割用辞書１１から取得する。そして、第一分割部１３は、２以上の分割単語をバッファに追記する。なお、バッファの初期値はＮＵＬＬである。また、第一分割部１３は、２以上の各分割単語に区切り文字を入れて、２以上の分割単語をバッファに追記する。区切り文字は、例えば、「／」「（スペース）」「，」等、何でも良い。ステップＳ２０７に行く。 (Step S205) The first dividing unit 13 acquires two or more divided words corresponding to the word acquired in step S203 from the word dividing dictionary 11. Then, the first dividing unit 13 adds two or more divided words to the buffer. Note that the initial value of the buffer is NULL. The first dividing unit 13 puts a delimiter character in each of the two or more divided words and adds the two or more divided words to the buffer. The delimiter may be anything such as “/”, “(space)”, “,”, etc. Go to step S207.

（ステップＳ２０６）第一分割部１３は、ステップＳ２０３で取得した単語をバッファに追記する。なお、第一分割部１３は、ステップＳ２０３で取得した単語と、前または／および後の単語との間には、区切り文字を配置する。 (Step S206) The first dividing unit 13 adds the word acquired in step S203 to the buffer. The first dividing unit 13 places a delimiter between the word acquired in step S203 and the previous or / and subsequent word.

（ステップＳ２０７）第一分割部１３は、ポインタｐを、最大長の文字列長の分だけ進める。 (Step S207) The first dividing unit 13 advances the pointer p by the maximum character string length.

（ステップＳ２０８）第一分割部１３は、すべての分割処理が終了したか否かを判断する。すべての分割処理が終了していればステップＳ２０９に行き、終了していなければステップＳ２０３に戻る。なお、ポインタｐが文の最後の文字の次の位置である場合、すべての分割処理が終了した、と言える。 (Step S208) The first dividing unit 13 determines whether or not all the dividing processes have been completed. If all the division processes are completed, the process goes to step S209, and if not completed, the process returns to step S203. When the pointer p is the position next to the last character of the sentence, it can be said that all the division processes have been completed.

（ステップＳ２０９）出力部１４は、バッファ内の２以上の単語を出力する。ステップＳ２０１に戻る。 (Step S209) The output unit 14 outputs two or more words in the buffer. The process returns to step S201.

なお、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。
また、図２のフローチャートにおいて、受け付けられた文の先頭から処理を開始し、文の終わりまで順に処理を行った。しかし、例えば、受け付けられた文の最後から処理を開始し、文の後から前の方向に処理を進めて行っても良い。つまり、ステップＳ２０２で、第一分割部１３は、文のポインタｐを文の最後に設定し、ステップＳ２０７で、ポインタｐを、最大長の文字列長の分だけ、文の前に戻っても良い。かかる場合、ステップＳ２０３で、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字から前にポインタを進めて、最大長の文字列と一致する単語を検索する。そして、第一分割部１３は、最大長の文字列である単語を単語分割用辞書１１から取得する。 In the flowchart of FIG. 2, the process is terminated by powering off or a process termination interrupt.
Further, in the flowchart of FIG. 2, the processing is started from the beginning of the accepted sentence, and the processing is sequentially performed until the end of the sentence. However, for example, the process may be started from the end of the accepted sentence, and the process may be performed in the previous direction after the sentence. In other words, in step S202, the first dividing unit 13 sets the sentence pointer p to the end of the sentence, and in step S207, the pointer p may return to the front of the sentence by the maximum character string length. good. In such a case, in step S203, the first division unit 13 is a word existing in the word division dictionary 11, and advances the pointer forward from the character corresponding to p in the sentence to match the maximum length character string. Search for the word you want. Then, the first division unit 13 acquires a word that is a maximum length character string from the word division dictionary 11.

以下、本実施の形態における単語分割装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the word segmentation apparatus 1 in the present embodiment will be described.

今、図３が単語分割用辞書１１である。単語分割用辞書１１を構成するレコードは、「ＩＤ」「単語」「分割単語」を有する。単語分割用辞書１１のレコードは、品詞や出現確率等の他の情報を有しても良い。また、単語分割用辞書１１のレコードは、単語、または単語と分割情報の対を有する。 FIG. 3 shows the word division dictionary 11. The records constituting the word division dictionary 11 have “ID”, “word”, and “division word”. The record of the word division dictionary 11 may have other information such as part of speech and appearance probability. Further, the record of the word division dictionary 11 has a word or a pair of a word and division information.

単語に該当するレコードは、属性「分割単語」の値がＮＵＬＬ（図３の「−」）である。また、単語に該当するレコードは、例えば、図３の「ＩＤ＝５，６，８，９，１０，１１，１２，１３」のレコードである。また、単語と分割情報の対に該当するレコードは、属性「分割単語」の値が２以上の分割単語を有する。属性「分割単語」における分割単語は、ここでは、区切り文字「／」で区切られている。さらに、単語と分割情報の対に該当するレコードは、例えば、図３の「ＩＤ＝１，２，３，４，７」のレコードである。なお、単語分割用辞書１１のレコードは、「単語か、単語と分割情報の対かを示すフラグ」を属性値として有しても良い。 A record corresponding to a word has a value of the attribute “divided word” of NULL (“−” in FIG. 3). The record corresponding to the word is, for example, a record of “ID = 5, 6, 8, 9, 10, 11, 12, 13” in FIG. Further, a record corresponding to a pair of a word and division information has a division word having an attribute “division word” value of 2 or more. Here, the divided words in the attribute “divided word” are separated by a delimiter “/”. Furthermore, the record corresponding to the pair of the word and the division information is, for example, a record of “ID = 1, 2, 3, 4, 7” in FIG. Note that the record of the word division dictionary 11 may have “a flag indicating whether it is a word or a pair of a word and division information” as an attribute value.

（具体例１）
かかる状況において、受付部１２は、文「正夫はしっかり者だ」を受け付けた、とする。次に、第一分割部１３は、文のポインタｐを１に設定する。つまり、ポインタｐは文の「正」の位置に設定された。 (Specific example 1)
In this situation, it is assumed that the reception unit 12 has received the sentence “Masao is a solid person”. Next, the first dividing unit 13 sets the sentence pointer p to 1. That is, the pointer p is set at the “positive” position of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「正」から、最大長の文字列と一致する単語「正夫」を検索し、取得する。 Next, the first division unit 13 searches for the word “Matsuo” that is present in the word division dictionary 11 and matches the maximum character string from the character “Correct” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「正夫」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「正夫」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “Matsuo” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “Matsuo” is “− (NULL)”.

そして、第一分割部１３は、取得した単語「正夫」をバッファに追記する。 Then, the first dividing unit 13 adds the acquired word “Matsuo” to the buffer.

次に、第一分割部１３は、単語「正夫」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「は」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “Matsuo”. Next, the first dividing unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position “ha” of the sentence.

次に、第一分割部１３は、まだ、分割処理が終了していない、と判断する。 Next, the first division unit 13 determines that the division process has not been completed yet.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「は」から、最大長の文字列と一致する単語「は」を検索し、取得する。 Next, the first division unit 13 searches for a word “ha” that is present in the word division dictionary 11 and matches the maximum length character string from the character “ha” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「は」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「は」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “ha” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “ha” is “− (NULL)”.

そして、第一分割部１３は、取得した単語「は」をバッファに追記する。なお、第一分割部１３は、単語「は」の前に区切り文字「／」を入れて、バッファに追記する。そして、現在のバッファには「正夫／は」が格納された。 Then, the first dividing unit 13 adds the acquired word “ha” to the buffer. The first dividing unit 13 adds a delimiter “/” in front of the word “ha” and appends it to the buffer. Then, “Masao / Ha” is stored in the current buffer.

次に、第一分割部１３は、単語「は」の文字列長「１」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「１」だけ進め、ポインタｐを文の「し」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “1” of the word “ha”. Next, the first dividing unit 13 advances the pointer p by “1” corresponding to the maximum length of the character string, and sets the pointer p to the position of “shi” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「し」から、最大長の文字列と一致する単語「しっかり者」を検索し、取得する。 Next, the first dividing unit 13 is a word that exists in the word dividing dictionary 11, and from the character “shi” corresponding to p in the sentence, the word “solid” that matches the maximum length character string is obtained. Search and get.

次に、第一分割部１３は、取得した単語「しっかり者」が、分割情報に含まれる単語であるか否かを判断する。つまり、単語「しっかり者」に対応する分割単語がＮＵＬＬでないので、第一分割部１３は、単語「しっかり者」が、分割情報に含まれる単語であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “firm person” is a word included in the division information. That is, since the divided word corresponding to the word “firm person” is not NULL, the first dividing unit 13 determines that the word “firm person” is a word included in the division information.

そして、第一分割部１３は、単語「しっかり者」に対応する分割情報「しっかり／者」を、単語分割用辞書１１から取得する。 Then, the first division unit 13 acquires the division information “firm / person” corresponding to the word “firm person” from the word division dictionary 11.

そして、第一分割部１３は、区切り文字「／」と取得した単語「しっかり／者」とをバッファに追記する。そして、現在のバッファには「正夫／は／しっかり／者」が格納された。 Then, the first dividing unit 13 adds the delimiter “/” and the acquired word “solid / person” to the buffer. In the current buffer, “Masao / Ha / Ken / People” is stored.

次に、第一分割部１３は、単語「しっかり者」の文字列長「５」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「５」だけ進め、ポインタｐを文の「だ」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “5” of the word “firm person”. Next, the first dividing unit 13 advances the pointer p by “5” corresponding to the maximum length of the character string, and sets the pointer p to the position of “DA” in the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「だ」から、最大長の文字列と一致する単語「だ」を検索し、取得する。 Next, the first division unit 13 searches for a word “DA” that is a word existing in the word division dictionary 11 and matches the maximum length character string from the character “DA” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「だ」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「だ」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “DA” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “DA” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「だ」とをバッファに追記する。そして、現在のバッファには「正夫／は／しっかり／者／だ」が格納された。 Then, the first division unit 13 adds the delimiter “/” and the acquired word “da” to the buffer. In the current buffer, “Masao / Ha / So / Person” is stored.

次に、第一分割部１３は、単語「だ」の文字列長「１」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「１」だけ進め、ポインタｐを文の「だ」の次の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “1” of the word “DA”. Next, the first dividing unit 13 advances the pointer p by “1” corresponding to the maximum length of the character string, and sets the pointer p to a position next to “DA” in the sentence.

次に、第一分割部１３は、分割処理が終了した、と判断する。 Next, the first division unit 13 determines that the division process has been completed.

そして、出力部１４は、バッファ内の２以上の分割された単語列「正夫／は／しっかり／者／だ」を出力する。 Then, the output unit 14 outputs two or more divided word strings “Masao / Ha / So / I / D” in the buffer.

（具体例２）
受付部１２は、文「そうはいってもまだ子供」を受け付けた、とする。次に、第一分割部１３は、文のポインタｐを１に設定する。つまり、ポインタｐは文の「そ」の位置に設定された。 (Specific example 2)
It is assumed that the reception unit 12 has received the sentence “Although it is said that it is still a child”. Next, the first dividing unit 13 sets the sentence pointer p to 1. That is, the pointer p is set at the position “s” in the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「そ」から、最大長の文字列と一致する単語「そうはいっても」を検索し、取得する。 Next, the first division unit 13 is a word that exists in the word division dictionary 11, and the word “So” corresponding to p in the sentence is matched with the maximum length character string “Yes. Search for and get "".

次に、第一分割部１３は、取得した単語「そうはいっても」が、分割情報に含まれる単語であるか否かを判断する。つまり、単語「そうはいっても」に対応する分割単語がＮＵＬＬでないので、第一分割部１３は、単語「そうはいっても」が、分割情報に含まれる単語であると判断する。 Next, the first dividing unit 13 determines whether or not the acquired word “even if so” is a word included in the division information. That is, since the divided word corresponding to the word “Yes” is not NULL, the first dividing unit 13 determines that the word “No, yes” is a word included in the division information.

そして、第一分割部１３は、単語「そうはいっても」に対応する分割情報「そう／は／いって／も」を、単語分割用辞書１１から取得する。 Then, the first division unit 13 acquires the division information “so / ha / inte / mo” corresponding to the word “so yes” from the word division dictionary 11.

そして、第一分割部１３は、取得した単語「そう／は／いって／も」をバッファに追記する。そして、現在のバッファには「そう／は／いって／も」が格納された。 Then, the first dividing unit 13 adds the acquired word “so / ha / inte / mo” to the buffer. In the current buffer, “yes / ha / te / mo” is stored.

次に、第一分割部１３は、単語「そうはいっても」の文字列長「７」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「７」だけ進め、ポインタｐを文の「ま」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “7” of the word “even if so”. Next, the first dividing unit 13 advances the pointer p by “7” corresponding to the maximum length of the character string, and sets the pointer p to the position “ma” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「ま」から、最大長の文字列と一致する単語「まだ」を検索し、取得する。 Next, the first division unit 13 searches for the word “still” that matches the maximum length character string from the character “ma” corresponding to p in the sentence that is present in the word division dictionary 11. And get.

次に、第一分割部１３は、取得した単語「まだ」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「は」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “still” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “ha” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「まだ」とをバッファに追記する。そして、現在のバッファには「そう／は／いって／も／まだ」が格納された。 Then, the first dividing unit 13 appends the delimiter “/” and the acquired word “still” to the buffer. In the current buffer, “yes / ha / te / mo / no” is stored.

次に、第一分割部１３は、単語「まだ」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「子」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “still”. Next, the first division unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position of the “child” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「子」から、最大長の文字列と一致する単語「子供」を検索し、取得する。 Next, the first division unit 13 searches for a word “child” that is in the word division dictionary 11 and matches the maximum character string from the character “child” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「子供」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「は」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “child” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “ha” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「子供」とをバッファに追記する。そして、現在のバッファには「そう／は／いって／も／まだ／子供」が格納された。 Then, the first dividing unit 13 adds the delimiter “/” and the acquired word “child” to the buffer. In the current buffer, “Yes / I / I / Mo / Still / Child” is stored.

次に、第一分割部１３は、単語「まだ」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「供」の次の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “still”. Next, the first dividing unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position next to “don” in the sentence.

そして、出力部１４は、バッファ内の２以上の分割された単語列「そう／は／いって／も／まだ／子供」を出力する。 Then, the output unit 14 outputs the two or more divided word strings “so / ha / de / mo / yana / child” in the buffer.

以上、本実施の形態によれば、非常に簡易な処理により、文を２以上の単語に分割できる。そのため、文の単語への分割が非常に高速に行える。 As described above, according to the present embodiment, a sentence can be divided into two or more words by a very simple process. Therefore, the sentence can be divided into words very quickly.

なお、本実施の形態において、第一分割部１３が最大長の文字列である単語を単語分割用辞書１１から取得するアルゴリズムは問わない。 In the present embodiment, the algorithm by which the first dividing unit 13 acquires the word that is the maximum length character string from the word dividing dictionary 11 does not matter.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における単語分割装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記録媒体に、１以上の単語と、単語と当該単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを有する単語分割用辞書を格納しており、コンピュータを、１以上の文字を有する文を受け付ける受付部と、前記受付部が受け付けた文の先頭である文のポインタから最大長の文字列に一致する単語を、前記単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を有する場合は、前記一致する単語に変えて前記２以上の分割単語を取得する分割単語取得処理を行い、前記文のポインタを前記一致する単語の次の文字に移動した後、前記分割単語取得処理を文の最後の文字を含む単語まで行い、文を分割して得られる２以上の単語の集合である第一分割結果を取得する第一分割部と、前記第一分割結果を出力する出力部として機能させるためのプログラム、である。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the word segmentation apparatus in the present embodiment is the following program. That is, this program stores a word division dictionary having one or more words and one or more pieces of division information that is a set of two or more divided words that are the result of dividing the words. A word receiving dictionary that accepts a sentence having one or more characters, and a word that matches a maximum-length character string from a sentence pointer at the head of the sentence accepted by the accepting part. If there are two or more divided words corresponding to the acquired word, a divided word acquisition process is performed in which the two or more divided words are obtained instead of the matching words, and the sentence pointer is After moving to the next character of the matching word, the divided word acquisition process is performed up to the word including the last character of the sentence, and the first division result which is a set of two or more words obtained by dividing the sentence is acquired. The first minute to And parts, is a program, to function as an output unit for outputting the first result of division.

また、以下、単語分割装置１を用いた実験結果について説明する。実験において、単語分割装置１を実現するソフトウェアは、「MA-2」という名称である。
（実験１） Hereinafter, experimental results using the word segmentation apparatus 1 will be described. In the experiment, the software that realizes the word segmentation apparatus 1 is named “MA-2”.
(Experiment 1)

実験１における他の単語分割装置として、公知技術である「MeCab 0.98」を用いた。「MeCab 0.98」は、「http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html」に記載されている。また、他の単語分割装置として、出願人が開発した単語分割装置であり、Viterbiアルゴリズムを用いた単語分割装置「MA-1」も用いた。図４に、上記の３つの装置に、ＵＴＦ−８日本語テキスト３８８．５ＭＢを入力し、各装置の処理速度（ＫＢ／ｓｅｃ）を測定した結果を示す。単語分割装置１である「MA-2」は、「MeCab 0.98」の４．３倍の処理速度であった。なお、単語分割装置１である「MA-2」によれば、新聞１年分を約３０秒で解析可能であることが分かる（図４参照）。
（実験２） As another word segmentation apparatus in Experiment 1, “MeCab 0.98”, which is a known technique, was used. “MeCab 0.98” is described in “http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html”. In addition, as another word dividing device, the word dividing device developed by the applicant, the word dividing device “MA-1” using the Viterbi algorithm was also used. FIG. 4 shows the results of measuring the processing speed (KB / sec) of each device by inputting UTF-8 Japanese text 388.5 MB into the above three devices. “MA-2”, which is the word segmentation device 1, was 4.3 times faster than “MeCab 0.98”. In addition, according to “MA-2” which is the word dividing device 1, it can be understood that one year of newspaper can be analyzed in about 30 seconds (see FIG. 4).
(Experiment 2)

次に、単語分割装置１「MA-2」を用いた実験２の結果について説明する。実験２の結果を、図５に記載する。実験２における他の単語分割装置として、公知技術である「JUMAN 6.0」「MeCab 0.98」「KyTea 0.3.0」「ChaSen 2.3.3」を用いた。「JUMAN 6.0」は「http://nlp.ist.i.kyoto-u.ac.jp/index.php?cmd=read&page=JUMAN&alias%5B%5D=%E6%97%A5%E6%9C%AC%E8%AA%9E%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3%E6%9E%90%E3%82%B7%E3%82%B9%E3%83%86%E3%83%A0JUMAN」、「KyTea 0.3.0」は「http://www.phontron.com/kytea/index-ja.html」、「ChaSen 2.3.3」は「http://chasen.naist.jp/hiki/ChaSen/」に記載されている。また、本実験において、上記の５つの装置に、ウェブ・テキスト８万文を入力し、各装置に解析させた場合の処理時間を測定した（図５参照）。図５により、単語分割装置１「MA-2」の処理速度は他より極めて速いことが分かる。なお、単語分割装置１のアルゴリズムおよびモデルは、図５に示す「深さ優先探索＋連語」である。 Next, the result of Experiment 2 using the word segmentation apparatus 1 “MA-2” will be described. The result of Experiment 2 is described in FIG. As other word segmentation devices in Experiment 2, known technologies “JUMAN 6.0”, “MeCab 0.98”, “KyTea 0.3.0”, and “ChaSen 2.3.3” were used. "JUMAN 6.0" is "http://nlp.ist.i.kyoto-u.ac.jp/index.php?cmd=read&page=JUMAN&alias%5B%5D=%E6%97%A5%E6%9C%AC % E8% AA% 9E% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90% E3% 82% B7% E3% 82% B9% E3 % 83% 86% E3% 83% A0JUMAN ”,“ KyTea 0.3.0 ”is“ http://www.phontron.com/kytea/index-en.html ”,“ ChaSen 2.3.3 ”is“ http: / /chasen.naist.jp/hiki/ChaSen/ ”. Further, in this experiment, 80,000 web texts were input to the above five devices, and the processing time when each device was analyzed was measured (see FIG. 5). FIG. 5 shows that the processing speed of the word segmentation apparatus 1 “MA-2” is extremely faster than others. Note that the algorithm and model of the word segmentation apparatus 1 are “depth-first search + collocation” shown in FIG.

（実施の形態２）
本実施の形態において、一の単語分割装置の分割結果と、他の複数の単語分割装置との相違部分が所定の条件を満たす場合、単語分割用辞書１１に分割情報を登録する辞書登録装置２について説明する。なお、他の単語分割装置は、従来の単語分割装置であり、高い精度で単語に分割できることが確認できている単語分割装置であることは好適である。 (Embodiment 2)
In the present embodiment, the dictionary registration device 2 for registering division information in the word division dictionary 11 when a division result of one word division device and a difference between the plurality of other word division devices satisfy a predetermined condition. Will be described. The other word segmentation device is a conventional word segmentation device, and is preferably a word segmentation device that has been confirmed to be segmented into words with high accuracy.

図６は、本実施の形態における辞書登録装置２のブロック図である。辞書登録装置２は、単語分割用辞書１１、第一分割結果取得部２２、他分割結果取得部２３、相違部分取得部２４、分割情報取得部２５、および辞書登録部２６を備える。 FIG. 6 is a block diagram of dictionary registration apparatus 2 in the present embodiment. The dictionary registration device 2 includes a word division dictionary 11, a first division result acquisition unit 22, another division result acquisition unit 23, a different part acquisition unit 24, a division information acquisition unit 25, and a dictionary registration unit 26.

第一分割結果取得部２２は、第一分割結果を取得する。第一分割結果とは、一の単語分割装置が一の文を分割した結果である。第一分割結果は、１または２以上の単語の集合であり、通常、単語間が区切られていることを認識できる構造を有する。また、一の単語分割装置は、例えば、単語分割装置１である。第一分割結果取得部２２は、一の単語分割装置から第一分割結果を受信しても良いし、図示しない記録媒体に格納されている第一分割結果を読み出しても良いし、受け付けられた一の文を分割し、第一分割結果を取得しても良い。第一分割結果取得部２２が受け付けられた一の文を分割し、第一分割結果を取得する場合、第一分割結果取得部２２は第一分割部１３と同等の機能を有する。 The first division result acquisition unit 22 acquires the first division result. The first division result is a result of dividing one sentence by one word dividing device. The first division result is a set of one or two or more words, and usually has a structure capable of recognizing that words are separated. Moreover, the one word division device is the word division device 1, for example. The first division result acquisition unit 22 may receive the first division result from one word division device, or may read or accept the first division result stored in a recording medium (not shown). One sentence may be divided and the first division result may be acquired. When the first division result acquisition unit 22 divides one accepted sentence and acquires the first division result, the first division result acquisition unit 22 has the same function as the first division unit 13.

他分割結果取得部２３は、２以上の他分割結果を取得する。他分割結果は、他単語分割装置が一の文を分割した結果である。他分割結果は、２以上の単語の集合であり、単語間が区切られていることを認識できる構造を有する。なお、他単語分割装置は、上記の一の単語分割装置ではない単語分割装置である。 The other division result acquisition unit 23 acquires two or more other division results. The other division result is a result of dividing one sentence by another word dividing device. The other division result is a set of two or more words, and has a structure capable of recognizing that the words are separated. The other word dividing device is a word dividing device that is not the one word dividing device described above.

他分割結果取得部２３は、２つの他分割結果である、第二分割結果および第三分割結果を取得しても良い。 The other division result acquisition unit 23 may acquire the second division result and the third division result, which are two other division results.

相違部分取得部２４は、１以上の相違部分を取得する。相違部分とは、第一分割結果と他分割結果との相違する部分である。また、相違部分は、他分割結果に含まれる部分である。なお、相違部分取得部２４は、２以上の相違部分を取得しても良い。また、相違部分取得部２４は、一の文に対して、２箇所以上の相違部分を取得しても良い。相違部分は、通常、２以上の分割単語を有する。 The different part acquisition unit 24 acquires one or more different parts. The different part is a part where the first division result and the other division result are different. The different part is a part included in the other division result. Note that the different part acquisition unit 24 may acquire two or more different parts. Moreover, the different part acquisition part 24 may acquire two or more different parts with respect to one sentence. The difference part usually has two or more divided words.

他単語分割装置が、例えば、第二単語分割装置および第三単語分割装置であり、第二単語分割装置が取得した分割結果が第二分割結果、第三単語分割装置が取得した分割結果が第三分割結果である場合、相違部分取得部２４は、例えば、第一分割結果と第二分割結果との相違部分である第一相違部分と、第一分割結果と第三分割結果との相違部分である第二相違部分とを取得しても良い。 The other word segmentation devices are, for example, a second word segmentation device and a third word segmentation device, the segmentation result obtained by the second word segmentation device is the second segmentation result, and the segmentation result obtained by the third word segmentation device is the first When it is a three-part division result, the different part acquisition unit 24, for example, a first different part that is a difference part between the first division result and the second division result, and a difference part between the first division result and the third division result You may acquire the 2nd different part which is.

分割情報取得部２５は、相違部分取得部２４が取得した１以上の相違部分が予め決められた条件を満たす場合、１以上のいずれかの相違部分を用いて、分割情報を構成する。分割情報は、相違部分に対応する文字列である単語と、相違部分である２以上の単語とを有する情報である。予め決められた条件を満たすか否かが判断される相違部分は、文の中の同一の相違する箇所における相違部分である。また、相違部分とは、単語への分割の仕方が相違する部分である。 When one or more different parts acquired by the different part acquisition unit 24 satisfy a predetermined condition, the division information acquisition unit 25 configures the division information using any one or more different parts. The division information is information having a word that is a character string corresponding to a different part and two or more words that are different parts. The difference part in which it is determined whether or not a predetermined condition is satisfied is a difference part in the same different part in the sentence. Moreover, the different part is a part in which the way of dividing into words is different.

ここで、予め決められた条件は、例えば、２以上のすべての相違部分が同じことである。また、予め決められた条件は、例えば、２以上の相違部分のうち、Ｎ（Ｎは、例えば、５０％）以上の相違部分が同じことであっても良い。 Here, the predetermined condition is, for example, that two or more different parts are the same. In addition, for example, the predetermined condition may be that N (N is, for example, 50%) or more different portions among two or more different portions are the same.

つまり、例えば、分割情報取得部２５は、第一相違部分と第二相違部分とが共通する場合、第一相違部分を用いて、第一相違部分（または第二相違部分）に対応する文字列である単語と、第一相違部分（または第二相違部分）である２以上の単語とを有する分割情報を構成する。ここで、第一相違部分と第二相違部分とは同じであるので、どちらを用いて分割情報を構成しても、分割情報は同じになる。 That is, for example, when the first different part and the second different part are common, the division information acquisition unit 25 uses the first different part and uses the first different part (or the second different part) as a character string. And division information having two or more words that are first different parts (or second different parts). Here, since the first different part and the second different part are the same, the divided information is the same regardless of which is used to form the divided information.

辞書登録部２６は、分割情報取得部２５が取得した分割情報を単語分割用辞書１１に蓄積する。 The dictionary registration unit 26 accumulates the division information acquired by the division information acquisition unit 25 in the word division dictionary 11.

第一分割結果取得部２２、および他分割結果取得部２３は、例えば、無線または有線の通信手段により実現され得る。また、第一分割結果取得部２２等は、ＭＰＵやメモリ等から実現されても良い。 The first division result acquisition unit 22 and the other division result acquisition unit 23 can be realized by, for example, wireless or wired communication means. The first division result acquisition unit 22 and the like may be realized by an MPU, a memory, or the like.

相違部分取得部２４、分割情報取得部２５、辞書登録部２６は、通常、ＭＰＵやメモリ等から実現され得る。相違部分取得部２４の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The different part acquisition unit 24, the division information acquisition unit 25, and the dictionary registration unit 26 can be usually realized by an MPU, a memory, or the like. The processing procedure of the different part acquisition unit 24 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、辞書登録装置２の動作について、図７のフローチャートを用いて説明する。 Next, operation | movement of the dictionary registration apparatus 2 is demonstrated using the flowchart of FIG.

（ステップＳ７０１）第一分割結果取得部２２は、第一分割結果を取得する。 (Step S701) The first division result acquisition unit 22 acquires the first division result.

（ステップＳ７０２）他分割結果取得部２３は、２以上の他分割結果を取得する。 (Step S702) The other division result acquisition unit 23 acquires two or more other division results.

（ステップＳ７０３）相違部分取得部２４は、カウンタｉに１を代入する。 (Step S703) The different portion acquisition unit 24 substitutes 1 for a counter i.

（ステップＳ７０４）相違部分取得部２４は、ステップＳ７０２で取得した他分割結果の中に、ｉ番目の他分割結果が存在するか否かを判断する。ｉ番目の他分割結果が存在すればステップＳ７０５に行き、ｉ番目の他分割結果が存在しなければステップＳ７０９に行く。 (Step S704) The different portion acquisition unit 24 determines whether or not the i-th other division result exists in the other division results acquired in step S702. If the i-th other division result exists, the process goes to step S705. If the i-th other division result does not exist, the process goes to step S709.

（ステップＳ７０５）相違部分取得部２４は、第一分割結果とｉ番目の他分割結果との相違部分を取得する。かかる処理を相違部分取得処理という。相違部分取得処理について、図８のフローチャートを用いて説明する。 (Step S705) The different part acquisition unit 24 acquires a different part between the first division result and the i-th other division result. Such a process is called a different part acquisition process. A different part acquisition process is demonstrated using the flowchart of FIG.

（ステップＳ７０６）相違部分取得部２４は、ステップＳ７０５で相違部分が取得できたか否かを判断する。相違部分を取得できればステップＳ７０７に行き、取得できなければステップＳ７０８に行く。 (Step S706) The different portion acquisition unit 24 determines whether or not a different portion has been acquired in step S705. If the different part can be acquired, the process goes to step S707, and if not, the process goes to step S708.

（ステップＳ７０７）相違部分取得部２４は、ステップＳ７０６で取得した相違部分を、図示しないバッファに一時蓄積する。なお、相違部分を蓄積することは、相違部分を示す情報を蓄積することと同意義である。また、相違部分を示す情報とは、例えば、相違部分の開始文字、終了文字を示すポインタ等である。さらに、一の文において２箇所以上の相違する箇所が存在する場合、相違部分取得部２４は、箇所ごとに、相違部分を一時蓄積する。 (Step S707) The different part acquisition unit 24 temporarily stores the different part acquired in step S706 in a buffer (not shown). It should be noted that accumulating different parts is equivalent to accumulating information indicating the different parts. Further, the information indicating the different part is, for example, a pointer indicating a start character and an end character of the different part. Furthermore, when there are two or more different parts in one sentence, the different part acquisition unit 24 temporarily stores the different parts for each part.

（ステップＳ７０８）相違部分取得部２４は、カウンタｉを１、インクリメントする。ステップＳ７０４に戻る。 (Step S708) The different part acquisition unit 24 increments the counter i by one. The process returns to step S704.

（ステップＳ７０９）分割情報取得部２５は、カウンタｊに１を代入する。 (Step S709) The division information acquisition unit 25 substitutes 1 for a counter j.

（ステップＳ７１０）分割情報取得部２５は、ｊ番目の相違する箇所が存在するか否かを判断する。ｊ番目の相違する箇所が存在すればステップＳ７１１に行き、存在しなければ処理を終了する。 (Step S710) The division information acquisition unit 25 determines whether there is a j-th different place. If there is a j-th different part, the process goes to step S711, and if not, the process ends.

（ステップＳ７１１）分割情報取得部２５は、ｊ番目の相違箇所に関して、バッファ内の１以上の相違部分が予め決められた条件を満たすか否かを判断する。予め決められた条件を満たす場合はステップＳ７１２に行き、予め決められた条件を満たさない場合はステップＳ７１６に行く。 (Step S711) The division information acquisition unit 25 determines whether one or more different portions in the buffer satisfy a predetermined condition with respect to the j-th different portion. If the predetermined condition is satisfied, the process goes to step S712. If the predetermined condition is not satisfied, the process goes to step S716.

（ステップＳ７１２）分割情報取得部２５は、ｊ番目の相違箇所に関して、相違部分に対応する文字列である単語を取得する。なお、この単語は、通常、連語であり、分割情報を構成する単語である。 (Step S712) The division information acquisition unit 25 acquires a word that is a character string corresponding to the different portion with respect to the j-th different portion. In addition, this word is a collocation word and is a word which comprises division | segmentation information normally.

（ステップＳ７１３）分割情報取得部２５は、相違部分である２以上の分割単語を取得する。 (Step S713) The division information acquisition unit 25 acquires two or more division words that are different parts.

（ステップＳ７１４）分割情報取得部２５は、ステップＳ７１２で取得した単語と、ステップＳ７１３で取得した２以上の分割単語を用いて、分割情報を構成する。 (Step S714) The division information acquisition unit 25 configures division information using the word acquired in step S712 and two or more division words acquired in step S713.

（ステップＳ７１５）辞書登録部２６は、ステップＳ７１４で構成された分割情報を単語分割用辞書１１に蓄積する。 (Step S715) The dictionary registration unit 26 accumulates the division information configured in step S714 in the word division dictionary 11.

（ステップＳ７１６）分割情報取得部２５は、カウンタｊを１、インクリメントする。ステップＳ７１０に戻る。 (Step S716) The division information acquisition unit 25 increments the counter j by 1. The process returns to step S710.

なお、図７のフローチャートにおいて、第一分割結果と他分割結果との相違部分は、通常、一箇所であるが、２箇所以上存在しても良い。 In the flowchart of FIG. 7, the difference between the first division result and the other division result is usually one place, but there may be two or more places.

次に、ステップＳ７０５の相違部分取得処理について、図８のフローチャートを用いて説明する。 Next, the different part acquisition process in step S705 will be described with reference to the flowchart of FIG.

（ステップＳ８０１）相違部分取得部２４は、カウンタｉ、およびｊに１を代入する。ここで、カウンタｉは第一分割結果の中の分割単語のカウンタであり、カウンタｊは第二分割結果の中の分割単語のカウンタである。 (Step S801) The different part acquisition unit 24 substitutes 1 for counters i and j. Here, the counter i is a counter for divided words in the first division result, and the counter j is a counter for divided words in the second division result.

（ステップＳ８０２）相違部分取得部２４は、第一分割結果の中にｉ番目の分割単語が存在するか否かを判断する。ｉ番目の分割単語が存在すればステップＳ８０３に行き、ｉ番目の分割単語が存在しなければ処理を終了する。 (Step S802) The different part acquisition unit 24 determines whether or not the i-th divided word exists in the first division result. If the i-th divided word exists, the process goes to step S803, and if the i-th divided word does not exist, the process ends.

（ステップＳ８０３）相違部分取得部２４は、第一分割結果の中のｉ番目の分割単語を取得する。 (Step S803) The different portion acquisition unit 24 acquires the i-th divided word in the first division result.

（ステップＳ８０４）相違部分取得部２４は、他分割結果の中のｊ番目の分割単語を取得する。 (Step S804) The different part acquisition unit 24 acquires the j-th divided word in the other division result.

（ステップＳ８０５）相違部分取得部２４は、ステップＳ８０３で取得したｉ番目の分割単語と、ステップＳ８０４で取得したｊ番目の分割単語とが同一か否かを判断する。２つの分割単語が同一であればステップＳ８０６に行き、同一でなければステップＳ８０７に行く。 (Step S805) The different portion acquisition unit 24 determines whether or not the i-th divided word acquired in step S803 is the same as the j-th divided word acquired in step S804. If the two divided words are the same, go to step S806, otherwise go to step S807.

（ステップＳ８０６）相違部分取得部２４は、カウンタｉ、およびｊを、１インクリメントする。ステップＳ８０２に戻る。 (Step S806) The different part acquisition unit 24 increments the counters i and j by 1. The process returns to step S802.

（ステップＳ８０７）相違部分取得部２４は、第一分割結果の分割単語の最後の文字と、他分割結果の分割単語の最後の文字とが一致するまで、または文の終了まで、iとｊとを進める。なお、分割単語の最後の文字が一致するとは、文字コードが一致する意味ではなく、分割対象の文の中の文字が同じ位置の文字であることである。なお、位置が同じ位置の文字は、文字コードも同じである。また、iとｊとを進めるとは、現在のｉ，ｊの位置から、それぞれ進めていくことである。 (Step S807) The different portion acquisition unit 24 determines that i and j until the last character of the divided word of the first division result matches the last character of the divided word of the other division result, or until the end of the sentence. To proceed. The fact that the last character of the divided words matches does not mean that the character codes match, but that the characters in the sentence to be divided are characters at the same position. The characters at the same position have the same character code. Further, to advance i and j is to advance from the current positions of i and j, respectively.

（ステップＳ８０８）相違部分取得部２４は、ステップＳ８０７でｊを進める前の分割単語から、ｊを進めた後の分割単語までを、相違部分として取得する。ステップＳ８０６に戻る。 (Step S808) The different part acquisition unit 24 acquires, from the divided words before j is advanced in step S807 to the divided words after j is advanced, as different parts. The process returns to step S806.

以下、本実施の形態における辞書登録装置２の具体的な動作について説明する。具体例において、他の単語分割装置は、単語分割装置Ａと単語分割装置Ｂの２つであり、辞書登録するための予め決められた条件は相違部分が一致すること、として説明する。つまり、他分割結果取得部２３は、第二分割結果および第三分割結果を取得する。また、相違部分取得部２４は、第一分割結果と第二分割結果との相違部分である第一相違部分と、第一分割結果と第三分割結果との相違部分である第二相違部分とを取得するものとする。さらに、第一分割結果を取得する装置は、上述した単語分割装置１である、とする。なお、
なお、単語分割装置Ａと単語分割装置Ｂは、例えば、上述した「MeCab 0.98」「JUMAN 6.0」「KyTea 0.3.0」「ChaSen 2.3.3」等である。また、単語分割装置Ａと単語分割装置Ｂとは、異なる単語分割装置である。 Hereinafter, a specific operation of the dictionary registration device 2 in the present embodiment will be described. In a specific example, there are two other word segmentation devices, word segmentation device A and word segmentation device B, and the description will be made assuming that the predetermined conditions for registering in the dictionary match different portions. That is, the other division result acquisition unit 23 acquires the second division result and the third division result. Moreover, the different part acquisition unit 24 includes a first different part that is a difference between the first division result and the second division result, and a second different part that is a difference between the first division result and the third division result. Shall be obtained. Furthermore, it is assumed that the device that obtains the first division result is the word division device 1 described above. In addition,
Note that the word dividing device A and the word dividing device B are, for example, “MeCab 0.98”, “JUMAN 6.0”, “KyTea 0.3.0”, “ChaSen 2.3.3”, and the like described above. The word dividing device A and the word dividing device B are different word dividing devices.

ここで、単語分割装置１、および他の２つの単語分割装置は、文「自由形式で間違いはないか」を受け付けたとする。 Here, it is assumed that the word segmentation device 1 and the other two word segmentation devices have received the sentence “Is there a mistake in free form”?

次に、単語分割装置１は、第一分割結果「自由形／式／で／間違い／はな／いか」を取得した、とする。また、他の単語分割装置Ａは、「自由形／式／で／間違い／は／ない／か」を取得した、とする。さらに、他の単語分割装置Ｂは、「自由／形式／で／間違い／は／ない／か」を取得した、とする。 Next, it is assumed that the word segmentation apparatus 1 has acquired the first segmentation result “free form / formula / de / incorrect / Hana / Ika”. Further, it is assumed that the other word segmentation apparatus A has acquired “free form / formula / de / incorrect / has / not / is”. Furthermore, it is assumed that the other word segmentation apparatus B has acquired “free / format / de / incorrect / has / not / is”.

かかる状況において、辞書登録装置２の第一分割結果取得部２２は、第一分割結果「自由形／式／で／間違い／はな／いか」を、単語分割装置１から取得する。 In such a situation, the first division result acquisition unit 22 of the dictionary registration device 2 acquires the first division result “free form / formula / de / incorrect / Hana / Ika” from the word division device 1.

次に、他分割結果取得部２３は、単語分割装置Ａから他分割結果Ａ「自由形／式／で／間違い／は／ない／か」を取得する。また、他分割結果取得部２３は、単語分割装置Ｂから他分割結果Ａ「自由／形式／で／間違い／は／ない／か」を取得する。 Next, the other division result acquisition unit 23 acquires the other division result A “free form / formula / de / incorrect / has / no /” from the word segmentation apparatus A. Further, the other division result acquisition unit 23 acquires the other division result A “free / format / de / incorrect / has / no /” from the word segmentation apparatus B.

次に、相違部分取得部２４は、第一分割結果と他分割結果Ａとの相違部分「は／ない／か」を取得する。そして、相違部分取得部２４は、相違部分「は／ない／か」をバッファに一時蓄積する。なお、相違部分「は／ない／か」の文中における箇所を他の箇所と区別するため、相違部分取得部２４は、相違部分「は／ない／か」と、文中の位置を示す「９」とを対応付けて蓄積することは好適である。ここで、「９」は、「は／ない／か」の最初の文字「は」の文中でのオフセット（最初からの文字数）である。また、第一分割結果と他分割結果Ａとは、「自由形／式／で／間違い」までの分割単語は同一であり、相違部分取得部２４は、「自由形／式／で／間違い」までの分割単語に関して、相違部分を取得しない。 Next, the different portion acquisition unit 24 acquires a difference portion “ha / no /” between the first division result and the other division result A. Then, the different part acquisition unit 24 temporarily stores the different part “ha / no / ka” in the buffer. In order to distinguish the part in the sentence of the different part “ha / no / ka” from the other part, the different part acquisition unit 24 indicates the different part “ha / no / ka” and “9” indicating the position in the sentence. Is preferably stored in association with each other. Here, “9” is an offset (number of characters from the beginning) in the sentence of the first character “ha” of “ha / no / ka”. In the first division result and the other division result A, the divided words up to “free form / formula / de / incorrect” are the same, and the different part acquisition unit 24 determines “free form / formula / de / incorrect”. No difference is acquired for the divided words up to.

次に、相違部分取得部２４は、第一分割結果と他分割結果Ｂとの相違部分「自由／形式」、および「は／ない／か」を取得する。そして、相違部分取得部２４は、相違部分「自由／形式」、および「は／ない／か」をバッファに一時蓄積する。なお、相違部分取得部２４は、相違部分「自由／形式」と、文中の位置を示す「１」とを対応付けて蓄積することは好適である。また、相違部分取得部２４は、相違部分「は／ない／か」と、文中の位置を示す「９」とを対応付けて蓄積することは好適である。また、第一分割結果と他分割結果Ｂとは、「で／間違い」の中の分割単語は同一であり、相違部分取得部２４は、「で／間違い」の分割単語に関して、相違部分を取得しない。 Next, the different portion acquisition unit 24 acquires different portions “free / format” and “ha / no /” between the first division result and the other division result B. Then, the different part acquisition unit 24 temporarily stores the different parts “free / format” and “ha / no / ka” in the buffer. It is preferable that the different part acquisition unit 24 stores the different part “free / format” and “1” indicating the position in the sentence in association with each other. Further, it is preferable that the different part acquisition unit 24 stores the different part “ha / no / ka” and “9” indicating the position in the sentence in association with each other. In addition, the first divided result and the other divided result B have the same divided word in “de / wrong”, and the different portion acquisition unit 24 obtains a different portion regarding the divided word “de / wrong”. do not do.

次に、分割情報取得部２５は、１箇所目の相違部分「自由／形式」をバッファから取得する。ここで、分割情報取得部２５は、相違部分「自由／形式」を一つだけ取得する。しかし、１箇所目の相違部分「自由／形式」に関して、他分割結果Ａから取得された相違部分と他分割結果Ｂから取得された相違部分とが共通しないので、予め決められた条件を満たさない、と分割情報取得部２５は判断する。そして、分割情報取得部２５は、１箇所目の相違部分「自由／形式」に関して、分割情報を構成しない。 Next, the division information acquisition unit 25 acquires the first difference portion “free / format” from the buffer. Here, the division information acquisition unit 25 acquires only one different part “free / format”. However, regarding the first difference portion “free / format”, the difference portion acquired from the other division result A and the difference portion acquired from the other division result B are not common, and thus the predetermined condition is not satisfied. , And the division information acquisition unit 25 determine. Then, the division information acquisition unit 25 does not configure the division information for the first difference portion “free / format”.

次に、分割情報取得部２５は、２箇所目の相違部分「は／ない／か」をバッファから取得する。ここで、分割情報取得部２５は、相違部分「は／ない／か」を二つ取得する。そして、２箇所目の相違部分「は／ない／か」に関して、他分割結果Ａから取得された相違部分と他分割結果Ｂから取得された相違部分とが共通するので、予め決められた条件を満たす、と分割情報取得部２５は判断する。 Next, the division information acquisition unit 25 acquires the second difference part “ha / no /” from the buffer. Here, the division information acquisition unit 25 acquires two different parts “ha / no / ka”. Since the difference part acquired from the other division result A and the difference part acquired from the other division result B are common with respect to the second difference part “ha / no / or”, a predetermined condition is set. The division information acquisition unit 25 determines that it is satisfied.

次に、分割情報取得部２５は、２箇所目の相違部分「は／ない／か」に対応する文字列「はないか」を取得する。また、分割情報取得部２５は、相違部分である２以上の分割単語「は／ない／か」を取得する。次に、分割情報取得部２５は、単語「はないか」と２以上の分割単語「は／ない／か」を用いて、分割情報「はないか：は／ない／か」を構成する。 Next, the division information acquisition unit 25 acquires the character string “Is there?” Corresponding to the second difference part “Is not / is”. Further, the division information acquisition unit 25 acquires two or more divided words “ha / no / ka” which are different parts. Next, the division information acquisition unit 25 configures the division information “has no: no / has / no” using the word “has no” and two or more divided words “ha / no / no”.

次に、辞書登録部２６は、構成された分割情報「はないか：は／ない／か」を単語分割用辞書１１に蓄積する。 Next, the dictionary registration unit 26 stores the configured division information “has no: no / no / no” in the word division dictionary 11.

以上、本実施の形態によれば、精度の高い単語分割用辞書を得ることができる。 As described above, according to the present embodiment, it is possible to obtain a word segmentation dictionary with high accuracy.

なお、本実施の形態によれば、分割情報を登録する条件は、上述したように種々考えられる。 Note that, according to the present embodiment, various conditions for registering division information can be considered as described above.

また、本実施の形態によれば、他の単語分割装置が３以上の多数存在し、共通する相違部分の割合（数でも良い）の閾値を大きくすれば、大きくするほど、精度の高い単語分割用辞書を構築できる。 In addition, according to the present embodiment, there are many other word segmentation devices of three or more, and the larger the threshold value of the ratio (or the number) of common different parts, the more accurate the word segmentation. You can build a dictionary.

さらに、本実施の形態における辞書登録装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記録媒体に、１以上の単語と、単語と当該単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを格納し得る単語分割用辞書を格納しており、コンピュータを、一の単語分割装置が一の文を分割した結果である第一分割結果を取得する第一分割結果取得部と、前記一の単語分割装置ではない単語分割装置である２以上の他単語分割装置が、前記一の文を分割した結果である２以上の他分割結果を取得する他分割結果取得部と、前記２以上の各他分割結果に含まれる部分であり、前記第一分割結果と前記２以上の各他分割結果との相違する部分である１以上の相違部分を取得する相違部分取得部と、前記相違部分取得部が取得した１以上の相違部分が予め決められた条件を満たす場合、１以上のいずれかの相違部分を用いて、当該相違部分に対応する文字列である単語と、当該相違部分である２以上の単語とを有する分割情報を構成する分割情報取得部と、前記分割情報を前記単語分割用辞書に蓄積する辞書登録部として機能させるためのプログラム、である。 Furthermore, the software that implements the dictionary registration apparatus according to the present embodiment is the following program. That is, this program stores a word division dictionary capable of storing one or more words and one or more division information that is a set of two or more division words as a result of dividing the words on the recording medium. A first division result acquisition unit that acquires a first division result obtained by dividing one sentence by one word dividing device; and a word dividing device that is not the one word dividing device. A part included in each of the two or more other division results, and another division result acquisition unit that acquires two or more other division results that are the result of dividing the one sentence by two or more other word division devices A different part acquisition unit that acquires one or more different parts that are different parts between the first division result and the two or more other division results, and one or more different parts acquired by the different part acquisition unit. 1 or more if predetermined conditions are met Using any one of the different parts, a division information acquisition unit that forms division information having a word that is a character string corresponding to the different part and two or more words that are the different parts; and It is a program for functioning as a dictionary registration part which accumulate | stores in the dictionary for word division.

また、上記プログラムにおいて、前記他分割結果取得部は、２つの他分割結果である、第二分割結果および第三分割結果を取得し、前記相違部分取得部は、前記第一分割結果と前記第二分割結果との相違部分である第一相違部分と、前記第一分割結果と前記第三分割結果との相違部分である第二相違部分とを取得し、前記分割情報取得部は、前記第一相違部分と前記第二相違部分とが共通する場合、当該第一相違部分を用いて、当該第一相違部分に対応する文字列である単語と、当該第一相違部分である２以上の単語とを有する分割情報を構成するものとして、コンピュータを機能させることは好適である。 In the above program, the other division result acquisition unit acquires a second division result and a third division result, which are two other division results, and the different part acquisition unit includes the first division result and the first division result. A first difference part that is a difference part from the two-part dividing result and a second different part that is a difference part between the first division result and the third division result are obtained, and the division information obtaining unit is configured to obtain the first difference part. When one different part and the second different part are common, using the first different part, a word that is a character string corresponding to the first different part and two or more words that are the first different part It is preferable to make a computer function as what constitutes the division information having

（実施の形態３）
本実施の形態において、実施の形態２で説明した辞書登録装置２が含まれる単語分割装置３について説明する。 (Embodiment 3)
In the present embodiment, a word dividing device 3 including the dictionary registration device 2 described in the second embodiment will be described.

また、本実施の形態において、他単語分割装置も含まれていても良い。なお、他単語分割装置は、例えば、上述した「MeCab 0.98」「JUMAN 6.0」「KyTea 0.3.0」「ChaSen 2.3.3」等である。 Moreover, in this Embodiment, the other word division | segmentation apparatus may also be included. Other word segmentation devices are, for example, “MeCab 0.98”, “JUMAN 6.0”, “KyTea 0.3.0”, “ChaSen 2.3.3”, etc. described above.

図９は、本実施の形態における単語分割装置３のブロック図である。単語分割装置３は、辞書登録装置２、２以上の他単語分割装置３１、受付部１２、第一分割部１３、出力部１４を備える。 FIG. 9 is a block diagram of word segmentation apparatus 3 in the present embodiment. The word division device 3 includes a dictionary registration device 2, two or more other word division devices 31, a reception unit 12, a first division unit 13, and an output unit 14.

次に、単語分割装置３の動作について、図１０のフローチャートを用いて説明する。図１０のフローチャートにおいて、図２のフローチャートと同一のステップについて説明を省略する。 Next, operation | movement of the word division | segmentation apparatus 3 is demonstrated using the flowchart of FIG. In the flowchart of FIG. 10, the description of the same steps as those in the flowchart of FIG. 2 is omitted.

（ステップＳ１００１）受付部１２は、カウンタｉに、１を代入する。 (Step S1001) The reception unit 12 assigns 1 to the counter i.

（ステップＳ１００２）受付部１２は、ｉ番目の他単語分割装置３１が存在するか否かを判断する。ｉ番目の他単語分割装置３１が存在すればステップＳ１００３に行き、ｉ番目の他単語分割装置３１が存在しなければステップＳ１００５に行く。 (Step S1002) The reception unit 12 determines whether or not the i-th other word dividing device 31 exists. If the i-th other word dividing device 31 exists, the process goes to step S1003, and if the i-th other word dividing device 31 does not exist, the process goes to step S1005.

（ステップＳ１００３）ｉ番目の他単語分割装置３１は、受付部１２から文を受け付け、当該文に対して、単語に分割する処理を行う。そして、ｉ番目の他単語分割装置３１は、他分割結果を取得する。 (Step S1003) The i-th other word dividing device 31 receives a sentence from the receiving unit 12, and performs a process of dividing the sentence into words. Then, the i-th other word dividing device 31 acquires the other division result.

（ステップＳ１００４）受付部１２は、カウンタｉを１、インクリメントする。ステップＳ１００２に戻る。 (Step S1004) The reception unit 12 increments the counter i by one. The process returns to step S1002.

（ステップＳ１００５）辞書登録装置２は、辞書登録処理を行う。辞書登録処理は、図７を用いて説明した処理である。ステップＳ２０１に戻る。 (Step S1005) The dictionary registration device 2 performs a dictionary registration process. The dictionary registration process is the process described with reference to FIG. The process returns to step S201.

なお、図１０のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 In the flowchart of FIG. 10, the process ends when the power is turned off or the process ends.

以上、本実施の形態によれば、精度の高い単語分割用辞書を用いて、文を２以上の単語に高速に分割できる。 As described above, according to the present embodiment, it is possible to divide a sentence into two or more words at high speed using a high-precision word division dictionary.

また、本実施の形態によれば、文を２以上の単語に分割する処理を行いながら、単語分割用辞書を充実させていくことができる。 Further, according to the present embodiment, it is possible to enrich the word division dictionary while performing a process of dividing a sentence into two or more words.

なお、本実施の形態によれば、単語分割装置３は、２以上の他単語分割装置３１を具備した。しかし、２以上の他単語分割装置３１は、単語分割装置３の外部に存在しても良い。かかる場合、単語分割装置３は、辞書登録装置２、受付部１２、第一分割部１３、出力部１４を備える。そして、かかる場合、単語分割装置３は、２以上の各他単語分割装置３１から、他分割結果を取得する。 According to the present embodiment, the word dividing device 3 includes two or more other word dividing devices 31. However, two or more other word dividing devices 31 may exist outside the word dividing device 3. In such a case, the word dividing device 3 includes a dictionary registration device 2, a receiving unit 12, a first dividing unit 13, and an output unit 14. In such a case, the word dividing device 3 acquires the other division result from each of the two or more other word dividing devices 31.

さらに、本実施の形態における単語分割装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記録媒体に、１以上の単語と、単語と当該単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを格納し得る単語分割用辞書を格納しており、コンピュータを、一の単語分割装置が一の文を分割した結果である第一分割結果を取得する第一分割結果取得部と、前記一の単語分割装置ではない単語分割装置である２以上の他単語分割装置が、前記一の文を分割した結果である２以上の他分割結果を取得する他分割結果取得部と、前記２以上の各他分割結果に含まれる部分であり、前記第一分割結果と前記２以上の各他分割結果との相違する部分である１以上の相違部分を取得する相違部分取得部と、前記相違部分取得部が取得した１以上の相違部分が予め決められた条件を満たす場合、１以上のいずれかの相違部分を用いて、当該相違部分に対応する文字列である単語と、当該相違部分である２以上の単語とを有する分割情報を構成する分割情報取得部と、前記分割情報を前記単語分割用辞書に蓄積する辞書登録部として機能させるためのプログラム、である。 Furthermore, the software that implements the word segmentation apparatus in the present embodiment is the following program. That is, this program stores a word division dictionary capable of storing one or more words and one or more division information that is a set of two or more division words as a result of dividing the words on the recording medium. A first division result acquisition unit that acquires a first division result obtained by dividing one sentence by one word dividing device; and a word dividing device that is not the one word dividing device. A part included in each of the two or more other division results, and another division result acquisition unit that acquires two or more other division results that are the result of dividing the one sentence by two or more other word division devices A different part acquisition unit that acquires one or more different parts that are different parts between the first division result and the two or more other division results, and one or more different parts acquired by the different part acquisition unit. 1 or more if predetermined conditions are met Using any one of the different parts, a division information acquisition unit that forms division information having a word that is a character string corresponding to the different part and two or more words that are the different parts; and It is a program for functioning as a dictionary registration part which accumulate | stores in the dictionary for word division.

また、上記プログラムにおいて、２以上の他単語分割装置として、コンピュータをさらに機能させることは好適である。 In the above program, it is preferable to further cause the computer to function as two or more other word dividing devices.

また、図１１は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の辞書登録装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１１は、このコンピュータシステム３００の概観図であり、図１２は、システム３００のブロック図である。 FIG. 11 shows the external appearance of a computer that executes the program described in this specification to realize the dictionary registration apparatus and the like of the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 11 is an overview diagram of the computer system 300, and FIG. 12 is a block diagram of the system 300.

図１１において、コンピュータシステム３００は、ＣＤ−ＲＯＭドライブを含むコンピュータ３０１と、キーボード３０２と、マウス３０３と、モニタ３０４とを含む。 In FIG. 11, a computer system 300 includes a computer 301 including a CD-ROM drive, a keyboard 302, a mouse 303, and a monitor 304.

図１２において、コンピュータ３０１は、ＣＤ−ＲＯＭドライブ３０１２に加えて、ＭＰＵ３０１３と、ＭＰＵ３０１３、ＣＤ−ＲＯＭドライブ３０１２に接続されたバス３０１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３０１５と、ＭＰＵ３０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３０１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３０１７とを含む。ここでは、図示しないが、コンピュータ３０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 12, in addition to the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013 and the CD-ROM drive 3012, a ROM 3015 for storing a program such as a bootup program, and an MPU 3013. And a RAM 3016 for temporarily storing instructions of the application program and providing a temporary storage space, and a hard disk 3017 for storing the application program, the system program, and data. Although not shown here, the computer 301 may further include a network card that provides connection to a LAN.

コンピュータシステム３００に、上述した実施の形態の辞書登録装置等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３１０１に記憶されて、ＣＤ−ＲＯＭドライブ３０１２に挿入され、さらにハードディスク３０１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３０１に送信され、ハードディスク３０１７に記憶されても良い。プログラムは実行の際にＲＡＭ３０１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３１０１またはネットワークから直接、ロードされても良い。 A program that causes the computer system 300 to execute the functions of the dictionary registration device and the like of the above-described embodiment may be stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. Alternatively, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into the RAM 3016 at the time of execution. The program may be loaded directly from the CD-ROM 3101 or the network.

プログラムは、コンピュータ３０１に、上述した実施の形態の辞書登録装置等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 301 to execute functions such as the dictionary registration device according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.

なお、上記プログラムにおいて、情報を送信する送信ステップや、情報を受信する受信ステップなどでは、ハードウェアによって行われる処理、例えば、送信ステップにおけるモデムやインターフェースカードなどで行われる処理（ハードウェアでしか行われない処理）は含まれない。 In the above program, in a transmission step for transmitting information, a reception step for receiving information, etc., processing performed by hardware, for example, processing performed by a modem or an interface card in the transmission step (only performed by hardware). Not included) is not included.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、一の装置に存在する２以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 Further, in each of the above embodiments, it goes without saying that two or more communication units existing in one apparatus may be physically realized by one medium.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる辞書登録装置は、精度の高い単語分割用辞書を得ることができる、という効果を有し、辞書登録装置等として有用である。 As described above, the dictionary registration device according to the present invention has an effect of being able to obtain a highly accurate word segmentation dictionary, and is useful as a dictionary registration device or the like.

１、３単語分割装置
２辞書登録装置
１１単語分割用辞書
１２受付部
１３第一分割部
１４出力部
２２第一分割結果取得部
２３他分割結果取得部
２４相違部分取得部
２５分割情報取得部
２６辞書登録部
３１他単語分割装置 DESCRIPTION OF SYMBOLS 1, 3 Word division | segmentation apparatus 2 Dictionary registration apparatus 11 Dictionary for word division 12 Reception part 13 First division part 14 Output part 22 First division result acquisition part 23 Other division result acquisition part 24 Different part acquisition part 25 Division | segmentation information acquisition part 26 Dictionary registration unit 31 Other word segmentation device

Claims

A word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the word and the word;
A first division result acquisition unit that acquires a first division result obtained by dividing one sentence by one word dividing device;
An other division result acquisition unit that acquires two or more other division results that are a result of dividing two or more other word division devices that are word division devices that are not the one word division device;
A different part acquisition unit that acquires one or more different parts that are parts included in each of the two or more other division results and that are different from the first division result and the two or more other division results;
When one or more different parts acquired by the different part acquisition unit satisfy a predetermined condition, using one or more different parts, the word that is a character string corresponding to the different part and the difference A division information acquisition unit that forms division information having two or more words that are parts;
A dictionary registration device comprising: a dictionary registration unit that accumulates the division information in the word division dictionary.

The other division result acquisition unit
Obtain two other split results, the second split result and the third split result,
The difference part acquisition unit is
Obtaining a first different part that is a difference between the first split result and the second split result, and a second different part that is a different part between the first split result and the third split result;
The division information acquisition unit
When the first different part and the second different part are common, using the first different part, a word that is a character string corresponding to the first different part and two or more that are the first different part The dictionary registration device according to claim 1, comprising division information having a plurality of words.

The dictionary registration device according to claim 1 or 2,
A reception unit for receiving a sentence having one or more characters;
Using the word segmentation dictionary constructed by the dictionary registration device, the word having the maximum length that matches the character string constituting the sentence accepted by the accepting unit is obtained from the word segmentation dictionary, and the obtained word A first dividing unit that obtains two or more divided words corresponding to, and obtains a first division result that is a set of two or more words obtained by dividing the sentence;
A word segmentation device comprising: an output unit that outputs the first segmentation result.

The word dividing device according to claim 3, further comprising two or more other word dividing devices.

On the recording medium,
Storing a word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the word and the word;
A dictionary registration method realized by a first division result acquisition unit, another division result acquisition unit, a different part acquisition unit, a division information acquisition unit, and a dictionary registration unit,
A first division result acquisition step in which the first division result acquisition unit acquires a first division result which is a result of dividing one sentence by one word dividing device;
The other division result acquisition unit obtains two or more other division results that are the result of dividing the one sentence by two or more other word division devices that are word division devices that are not the one word division device. A division result acquisition step;
The different part acquisition unit is a part included in each of the two or more other division results, and acquires one or more different parts that are different parts between the first division result and the two or more other division results. A difference acquisition step to perform,
When the division information acquisition unit satisfies one or more different parts acquired in the different part acquisition step, a character corresponding to the different part using one or more different parts A division information acquisition step that constitutes division information having words that are columns and two or more words that are the different parts;
A dictionary registration method, wherein the dictionary registration unit includes a dictionary registration step of storing the division information in the word division dictionary.

On the recording medium,
Storing a word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the word and the word;
A word division method realized by a reception unit, a first division unit, an output unit, and a dictionary registration device,
An accepting step in which the accepting unit accepts a sentence having one or more characters;
The first division unit uses the word division dictionary to acquire the maximum length word that matches the character string constituting the sentence received by the reception unit from the word division dictionary, and the acquired word A first division step of obtaining a first division result which is a set of two or more words obtained by dividing the sentence by obtaining two or more divided words corresponding to
An output step in which the output unit outputs the first division result; and
A word division method comprising: the dictionary registration device comprising steps constituting the dictionary registration method according to claim 6.

On the recording medium,
A dictionary for word division constructed by the dictionary registration device according to claim 1 or 2 is stored;
Computer
A first division result acquisition unit that acquires a first division result obtained by dividing one sentence by one word dividing device;
An other division result acquisition unit that acquires two or more other division results that are a result of dividing two or more other word division devices that are word division devices that are not the one word division device;
A different part acquisition unit that acquires one or more different parts that are parts included in each of the two or more other division results and that are different from the first division result and the two or more other division results;
When one or more different parts acquired by the different part acquisition unit satisfy a predetermined condition, using one or more different parts, the word that is a character string corresponding to the different part and the difference A division information acquisition unit that forms division information having two or more words that are parts;
The program for functioning as a dictionary registration part which accumulate | stores the said division | segmentation information in the said dictionary for word division | segmentation.

Computer
A reception unit for receiving a sentence having one or more characters;
3. The word segmentation dictionary using the word segmentation dictionary constructed by the dictionary registration device according to claim 1 or 2 to determine a maximum length word that matches a character string constituting a sentence accepted by the accepting unit. A first dividing unit that acquires two or more divided words corresponding to the acquired word, and obtains a first division result that is a set of two or more words obtained by dividing the sentence;
A program for functioning as an output unit for outputting the first division result.