JP2014106707A

JP2014106707A - Word division device, data structure of dictionary for word division, word division method and program

Info

Publication number: JP2014106707A
Application number: JP2012258722A
Authority: JP
Inventors: Manabu Satsusano; 学颯々野
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2012-11-27
Filing date: 2012-11-27
Publication date: 2014-06-09
Anticipated expiration: 2032-11-27
Also published as: JP5697648B2

Abstract

PROBLEM TO BE SOLVED: To solve a problem of inability to divide a sentence into two or more words at a high speed.SOLUTION: A sentence can be divided into two or more words at a high speed with the use of a word division device including: a dictionary for word division which can store one or more words and one or more division data which is a set of a word and two or more divided words which is a result of division of the word; a first division unit which acquires a word coinciding with a character string which is longest from a pointer of a sentence, which is the beginning of an accepted sentence, from the dictionary for word division, executes divided word acquisition processing for acquiring two or more divided words instead of a coinciding word if there are two or more divided words corresponding to the acquired word; executes the divided word acquisition processing up to a word containing the last letter of the sentence after moving the pointer of the sentence to a next letter of the coinciding word, and acquires a first division result which is a group of two or more words obtained from division of the sentence; and an output unit for outputting the first division result.

Description

本発明は、文を２以上の単語に分割する単語分割装置等に関するものである。 The present invention relates to a word dividing device that divides a sentence into two or more words.

従来、単語分割済みの第１のコーパスと単語非分割の第２のコーパスを有効に利用して、単語のn-gram確率を高い精度で計算し、自然言語処理の認識精度を上げる技術が存在した（特許文献１参照）。 Conventionally, there is a technology that increases the recognition accuracy of natural language processing by calculating the n-gram probabilities of words with high accuracy by effectively using the first corpus that has already been divided into words and the second corpus that has not been divided into words. (See Patent Document 1).

また、従来、ユーザが容易にカスタマイズ可能な形態素解析システムが存在した（特許文献２参照）。本システムでは、テキスト入力部へ入力された文字列を、汎用形態素解析部が形態素解析用辞書を参照して複数の形態素に分割する。次に、汎用形態素解析部により分割された複数の形態素に対して、パターンマッチングエンジンが、パターンファイル内にユーザにより記述された形態素の分割、又は結合を指示するパターンを参照し、変換処理を施す。そして、パターンマッチングエンジンにより変換された複数の形態素は、解析結果として出力生成部から出力される。 Conventionally, there has been a morphological analysis system that can be easily customized by a user (see Patent Document 2). In this system, the character string input to the text input unit is divided into a plurality of morphemes by the general-purpose morpheme analysis unit with reference to the morpheme analysis dictionary. Next, the pattern matching engine performs conversion processing on a plurality of morphemes divided by the general-purpose morpheme analysis unit with reference to a pattern instructing division or combination of morphemes described by the user in the pattern file. . The plurality of morphemes converted by the pattern matching engine are output from the output generation unit as analysis results.

さらに、従来、文や複合語などの単語列を、正しい単語の並びに分割する技術が存在した（特許文献３参照）。 Furthermore, there has conventionally been a technique for dividing a word string such as a sentence or a compound word into a sequence of correct words (see Patent Document 3).

特開２００６−３１２９５号公報（第１頁、第１図等）JP 2006-3295A (first page, FIG. 1 etc.) 特開平１０−４０２５２号公報（第１頁、第１図等）Japanese Patent Laid-Open No. 10-40252 (first page, FIG. 1 etc.) 特開平７−２６２１９１号公報（第１頁、第１図等）JP-A-7-262191 (first page, FIG. 1 etc.)

しかしながら、従来技術においては、文を２以上の単語に高速に分割できなかった。 However, in the prior art, a sentence cannot be divided into two or more words at high speed.

本第一の発明の単語分割装置は、１以上の単語と、単語と単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを格納し得る単語分割用辞書と、１以上の文字を有する文を受け付ける受付部と、単語分割用辞書を用いて、受付部が受け付けた文を構成する文字列と一致する最大長の単語を、単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を取得して、文を分割して得られる２以上の単語の集合である第一分割結果を取得する第一分割部と、第一分割結果を出力する出力部とを具備する単語分割装置である。 The word segmentation device according to the first aspect of the present invention is a word segmentation dictionary that can store one or more words and one or more segmentation information that is a set of two or more segmented words that is a result of segmenting the words. Using a word receiving dictionary that accepts a sentence having one or more characters and a word segmentation dictionary, obtains the maximum length word that matches the character string constituting the sentence accepted by the accepting unit from the word segmentation dictionary; A first division unit that obtains two or more divided words corresponding to the obtained word and obtains a first division result that is a set of two or more words obtained by dividing the sentence; A word segmenting device comprising an output unit for outputting.

かかる構成により、非常に簡易な処理により文を２以上の単語に分割できるため、文を２以上の単語に高速に分割できる。 With this configuration, since the sentence can be divided into two or more words by a very simple process, the sentence can be divided into two or more words at high speed.

また、本第二の発明の単語分割装置は、第一の発明に対して、受付部が受け付けた文を、第一分割部とは異なるアルゴリズムにより文を分割して２以上の単語を取得する第二分割部を用いて、分割した２以上の単語の集合である第二分割結果を取得する第二分割結果取得部と、第一分割結果と第二分割結果とが異なるか否かを判断する判断部と、第一分割結果と第二分割結果とが異なると判断部が判断した場合、第一分割結果と第二分割結果とが異なる箇所に対応する文の中の文字列を取得し、異なる箇所に対応する第二分割結果に含まれる２以上の単語を取得し、取得した文字列である単語と、取得した２以上の単語とを有する分割情報を構成する分割情報取得部と、分割情報を単語分割用辞書に蓄積する辞書登録部とをさらに具備する単語分割装置である。 In addition, the word dividing device according to the second aspect of the invention obtains two or more words by dividing the sentence received by the receiving unit with an algorithm different from that of the first dividing unit with respect to the first invention. Using the second division unit, a second division result acquisition unit that acquires a second division result that is a set of two or more divided words and whether the first division result and the second division result are different are determined. If the determination unit determines that the first division result and the second division result are different from each other, the character string in the sentence corresponding to the portion where the first division result and the second division result are different is acquired. A division information acquisition unit that acquires two or more words included in the second division result corresponding to different locations, and that constitutes division information having the acquired character string and the acquired two or more words; A dictionary registration unit for storing the division information in the word division dictionary. A dividing device.

かかる構成により、単語分割用辞書を充実させることができる。 With this configuration, the word division dictionary can be enriched.

また、本第三の発明の単語分割装置は、第二の発明に対して、第一分割部とは異なるアルゴリズムにより、受付部が受け付けた文を分割して２以上の単語を取得する第二分割部をさらに具備する単語分割装置である。 The word segmentation device according to the third aspect of the present invention is a second segmentation method for segmenting a sentence received by the reception unit by using an algorithm different from that of the first division unit and acquiring two or more words. A word segmentation device further comprising a segmentation unit.

また、本第四の発明の単語分割装置は、第二または第三の発明に対して、第二分割部は、ビタビアルゴリズムを用いた形態素解析のアルゴリズムにより、文を分割して２以上の単語を取得する単語分割装置である。 Further, in the word dividing device according to the fourth aspect of the present invention, in contrast to the second or third aspect, the second dividing unit divides the sentence by an algorithm of morphological analysis using the Viterbi algorithm to obtain two or more words. Is a word segmentation device for acquiring.

かかる構成により、単語分割用辞書に精度の高い分割情報を登録できる。 With this configuration, it is possible to register division information with high accuracy in the word division dictionary.

本発明による単語分割装置によれば、文を２以上の単語に高速に分割できる。 According to the word dividing device of the present invention, a sentence can be divided into two or more words at high speed.

実施の形態１における単語分割装置１のブロック図Block diagram of word segmentation apparatus 1 according to Embodiment 1 同単語分割装置１の動作について説明するフローチャートA flowchart for explaining the operation of the word segmentation apparatus 1 同単語分割用辞書１１を示す図The figure which shows the dictionary 11 for the word division | segmentation 実施の形態２における単語分割装置２のブロック図Block diagram of word segmentation apparatus 2 according to Embodiment 2 同単語分割装置２の動作について説明するフローチャートA flowchart for explaining the operation of the word segmentation device 2 同単語分割装置３のブロック図Block diagram of the word dividing device 3 単語分割装置１の実験結果を示す図The figure which shows the experimental result of the word division | segmentation apparatus 1 単語分割装置１の実験結果を示す図The figure which shows the experimental result of the word division | segmentation apparatus 1 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同実施の形態におけるコンピュータシステムのブロック図Block diagram of a computer system according to the embodiment 実施の形態２における単語分割装置２の他のブロック図Another block diagram of word segmentation apparatus 2 in the second embodiment

以下、単語分割装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a word division device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態において、文を２以上の単語に分割する単語分割装置１について説明する。 (Embodiment 1)
In the present embodiment, a word dividing device 1 that divides a sentence into two or more words will be described.

図１は、本実施の形態における単語分割装置１のブロック図である。単語分割装置１は、単語分割用辞書１１、受付部１２、第一分割部１３、および出力部１４を備える。 FIG. 1 is a block diagram of a word segmentation apparatus 1 in the present embodiment. The word division device 1 includes a word division dictionary 11, a reception unit 12, a first division unit 13, and an output unit 14.

単語分割用辞書１１は、１以上の単語と１以上の分割情報とを格納し得る。分割情報は、単語と２以上の分割単語の組である。分割単語は、単語を分割した結果である。分割情報は、例えば、「自由形式：自由／形式」「はないか：は／ない／か」である。分割情報「自由形式：自由／形式」の「自由形式」は単語であり、「自由／形式」の「自由」「形式」は、それぞれ分割単語である。また、分割情報「はないか：は／ない／か」の「はないか」は単語であり、「は／ない／か」の「は」「ない」「か」はそれぞれ分割単語である。なお、単語は、形態素や連語など、意味を持つあらゆる用語を含む、と考えても良い。また、分割単語も単語と言える。 The word division dictionary 11 can store one or more words and one or more pieces of division information. The division information is a set of a word and two or more divided words. The divided word is a result of dividing the word. The division information is, for example, “free format: free / format” or “is not: is / is / is”. “Free format” of the division information “free format: free / format” is a word, and “free” and “format” of “free / format” are divided words. In addition, “has not”: “has not”: “has not” is “word”, and “ha”, “no”, “no”, and “ha” are respectively segmented words. Note that the word may be considered to include any term having meaning such as a morpheme or a collocation. A divided word can also be said to be a word.

また、単語分割用辞書１１において、１以上の単語と１以上の分割情報とを同一ファイルや同一データベースに保持されていていることが好適である。但し、１以上の単語と１以上の分割情報とは、別ファイルや別のデータベースに保持されていても良い。つまり、単語分割用辞書１１の具体的なデータ構造は問わない。単語分割用辞書１１は、１以上の単語と１以上の分割情報とを保持していれば良い。 In the word division dictionary 11, it is preferable that one or more words and one or more pieces of division information are held in the same file or the same database. However, the one or more words and the one or more pieces of division information may be held in separate files or separate databases. That is, the specific data structure of the word division dictionary 11 is not limited. The word division dictionary 11 only needs to hold one or more words and one or more pieces of division information.

単語分割用辞書１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。単語分割用辞書１１に単語や分割情報が記憶される過程は問わない。例えば、記録媒体を介して単語や分割情報が単語分割用辞書１１で記憶されるようになってもよく、通信回線等を介して送信された単語や分割情報が単語分割用辞書１１で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された単語や分割情報が単語分割用辞書１１で記憶されるようになってもよい。 The word segmentation dictionary 11 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. The process of storing words and division information in the word division dictionary 11 does not matter. For example, words and division information may be stored in the word division dictionary 11 via a recording medium, and words and division information transmitted via a communication line or the like are stored in the word division dictionary 11. Alternatively, a word or division information input via an input device may be stored in the word division dictionary 11.

受付部１２は、１以上の文字を有する文を受け付ける。文は不完全な文でも良い。つまり、文は連語などでもよい。また、文の言語は、問わない。文は、通常、日本語、中国語、韓国語、モンゴル語等、分かち書きしない言語の文である。ただし、文は、英語等の分かち書きしない言語でも良い。文は、例えば、ＵＲＬを示す文字列、ファイル名を示す文字列などでも良い。また、ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。 The accepting unit 12 accepts a sentence having one or more characters. The sentence may be an incomplete sentence. That is, the sentence may be a multiple word or the like. Moreover, the language of a sentence is not ask | required. The sentence is usually a sentence in a language that is not divided, such as Japanese, Chinese, Korean, and Mongolian. However, the sentence may be a language such as English that is not shared. The sentence may be, for example, a character string indicating a URL or a character string indicating a file name. In addition, reception means reception of information input from an input device such as a keyboard, mouse, touch panel, reception of information transmitted via a wired or wireless communication line, an optical disk, a magnetic disk, a semiconductor memory, etc. This is a concept including reception of information read from the recording medium.

文の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。受付部１２は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The sentence input means may be anything such as a keyboard, mouse or menu screen. The receiving unit 12 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.

第一分割部１３は、受付部１２が受け付けた文を分割し、２以上の単語の集合である第一分割結果を取得する。
さらに具体的には、第一分割部１３は、単語分割用辞書を用いて、受付部１２が受け付けた文を構成する文字列と一致する最大長の単語を、単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を取得して、文を分割して得られる２以上の単語の集合である第一分割結果を取得する。かかる処理をさらに詳細に説明すると、以下のような処理になる。第一分割部１３は、単語分割用辞書を用いて、受付部１２が受け付けた文を構成する１以上の文字列を取得する。そして、第一分割部１３は、当該１以上の各文字列と一致する最大長の単語を単語分割用辞書から取得する。そして、第一分割部１３は、単語分割用辞書から取得した１以上の各単語ごとに、単語に対応する２以上の分割単語を取得して、文を分割して得られる２以上の単語の集合である第一分割結果を取得する。 The first dividing unit 13 divides the sentence received by the receiving unit 12 and acquires a first division result that is a set of two or more words.
More specifically, the first dividing unit 13 acquires, from the word dividing dictionary, the maximum length word that matches the character string constituting the sentence received by the receiving unit 12 using the word dividing dictionary. Two or more divided words corresponding to the acquired word are acquired, and a first division result which is a set of two or more words obtained by dividing the sentence is acquired. This process will be described in further detail as follows. The 1st division part 13 acquires one or more character strings which constitute the sentence which acceptance part 12 received using the dictionary for word division. Then, the first division unit 13 obtains the maximum length word that matches the one or more character strings from the word division dictionary. And the 1st division part 13 acquires two or more division words corresponding to a word for every one or more words acquired from the word division dictionary, and obtains two or more words obtained by dividing a sentence. The first division result that is a set is acquired.

第一分割部１３は、さらに具体的には、例えば、以下のように処理を行う。まず、第一分割部１３は、受付部１２が受け付けた文の先頭である文のポインタから最大長の文字列に一致する単語を、単語分割用辞書１１から取得する第一の処理を行う。そして、第一分割部１３は、取得した単語に対応する２以上の分割単語を有する場合は、一致する単語に変えて２以上の分割単語を取得する第二の処理を行う。この第一の処理と第二の処理とを含めて、分割単語取得処理という。そして、第一分割部１３は、文のポインタを、前記一致する単語の次の文字に移動する。そして、第一分割部１３は、上記の分割単語取得処理を文の最後の文字を含む単語まで行う。その結果、第一分割部１３は、文を分割して得られる２以上の単語の集合である第一分割結果が取得できる。なお、第一の処理において取得した単語が、分割情報に含まれる単語ではない場合、第一分割部１３は、当該第一の処理において取得した単語をそのまま保持する。また、第一分割結果は、２以上の単語の集合であるが、当該２以上の区切りが判断できる態様のデータ構造を有する。 More specifically, the first dividing unit 13 performs processing as follows, for example. First, the first dividing unit 13 performs a first process of acquiring, from the word dividing dictionary 11, a word that matches the maximum length character string from the sentence pointer at the head of the sentence received by the receiving unit 12. When the first dividing unit 13 has two or more divided words corresponding to the acquired word, the first dividing unit 13 performs a second process of acquiring two or more divided words instead of the matching words. The first process and the second process are referred to as a divided word acquisition process. Then, the first dividing unit 13 moves the sentence pointer to the character next to the matching word. And the 1st division part 13 performs said division | segmentation word acquisition process to the word containing the last character of a sentence. As a result, the first division unit 13 can acquire a first division result that is a set of two or more words obtained by dividing a sentence. When the word acquired in the first process is not a word included in the division information, the first dividing unit 13 holds the word acquired in the first process as it is. The first division result is a set of two or more words, but has a data structure in which the two or more breaks can be determined.

また、第一分割部１３が行う第一の処理の方法は問わない。第一分割部１３が行う第一の処理は、いわゆる最長一致法（longest match あるいは maximum matchとも言う。）等の公知技術が利用可能である。最長一致法は、「岩波書店，岩波講座，ソフトウェア科学15 自然言語処理 126-127ページ長尾真編」等に記載されている。 Moreover, the method of the 1st process which the 1st division part 13 performs is not ask | required. For the first processing performed by the first dividing unit 13, a known technique such as a so-called longest match method (also referred to as a longest match or maximum match) can be used. The longest match method is described in “Iwanami Shoten, Iwanami Lecture, Software Science 15 Natural Language Processing, pages 126-127, Makoto Nagao”.

また、第一分割部１３は、文のポインタから最大長の文字列を、当該文のポインタが示す文字から文の最後の文字（文のポインタからＮ番目の文字だとする）までの文字列（これを文字列Ａとする。）と一致する単語が単語分割用辞書１１に存在するか否かを判断し、存在すれば当該文字列Ａを取得し、存在しなければ、当該文のポインタが示す文字から文のポインタから（Ｎ−１）番目の文字までの文字列（これを文字列Ｂとする。）と一致する単語が単語分割用辞書１１に存在するか否かを判断し、存在すれば当該文字列Ｂを取得する。存在しなければ、上記の処理と同様に、１文字ずつ文字列を少なくしていって、文のポインタが示す文字を先頭とする文字列の中で、最大長の文字列の単語を、単語分割用辞書１１から検索する。つまり、第一分割部１３は、文の中の未処理の文字列の最長文字列から、１文字ずつ減らしながら、順に単語分割用辞書１１を検索して、ポインタｐから最長の文字列を取得しても良い。
なお、第一分割部１３は、文の中から最大長の文字列を検出するためのデータ構造として、公知技術である「トライ(trie)」が存在する。トライについて、以下の（１）〜（３）に記載されているので詳細な説明を省略する。
（１）徳永拓之著「日本語入力を支える技術」，89-99ページ
（２）インターネットウェブページ，ＵＲＬ
「http://www.slideshare.net/higashiyama/ss-8738479」
（３）インターネットウェブページ，ＵＲＬ
「http://nanika.osonae.com/DArray/dary.html」
」 Further, the first dividing unit 13 determines the character string of the maximum length from the sentence pointer, and the character string from the character indicated by the sentence pointer to the last character of the sentence (assuming that it is the Nth character from the sentence pointer). It is determined whether or not a word that matches (this is referred to as a character string A) exists in the word segmentation dictionary 11, and if it exists, the character string A is acquired, and if it does not exist, a pointer to the sentence It is determined whether or not a word that matches the character string (referred to as character string B) from the character indicated by the sentence pointer to the (N-1) th character exists in the word segmentation dictionary 11; If it exists, the character string B is acquired. If it does not exist, the character string is reduced by one character at a time, and the word with the maximum character string in the character string starting with the character indicated by the sentence pointer is replaced with the word. Search from the partitioning dictionary 11. That is, the first dividing unit 13 searches the word dividing dictionary 11 in order while reducing the character by character from the longest character string of the unprocessed character string in the sentence, and obtains the longest character string from the pointer p. You may do it.
The first dividing unit 13 has a known technique “trie” as a data structure for detecting the maximum length character string from the sentence. Since the trial is described in the following (1) to (3), a detailed description is omitted.
(1) Takuyuki Tokunaga, “Technology that supports Japanese input”, pages 89-99 (2) Internet web page, URL
"Http://www.slideshare.net/higashiyama/ss-8738479"
(3) Internet Web page, URL
"Http://nanika.osonae.com/DArray/dary.html"
"

第一分割部１３は、通常、ＭＰＵやメモリ等から実現され得る。第一分割部１３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The first dividing unit 13 can be usually realized by an MPU, a memory, or the like. The processing procedure of the first dividing unit 13 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１４は、第一分割部１３が取得した第一分割結果を出力する。ここで、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタでの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。処理結果を他のプログラムに引渡す場合、単語分割装置１と他のプログラムとは、例えば、音声認識装置、機械翻訳装置などを実現する。つまり、文を分割して得られた第一分割結果は、例えば、音声認識処理、機械翻訳処理等に利用され得る。 The output unit 14 outputs the first division result acquired by the first division unit 13. Here, output refers to display on a display, projection using a projector, printing with a printer, sound output, transmission to an external device, storage in a recording medium, and output to other processing devices or other programs. It is a concept that includes delivery of processing results. When handing over the processing result to another program, the word segmentation device 1 and the other program realize, for example, a speech recognition device, a machine translation device, or the like. That is, the first division result obtained by dividing the sentence can be used for, for example, speech recognition processing, machine translation processing, and the like.

出力部１４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部１４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 14 may be considered as including or not including an output device such as a display or a speaker. The output unit 14 can be realized by output device driver software, or output device driver software and an output device.

次に、単語分割装置１の動作について、図２のフローチャートを用いて説明する。 Next, operation | movement of the word division | segmentation apparatus 1 is demonstrated using the flowchart of FIG.

（ステップＳ２０１）受付部１２は、文を受け付けたか否かを判断する。文を受け付ければステップＳ２０２に行き、文を受け付けなければステップＳ２０１に戻る。 (Step S201) The reception unit 12 determines whether a sentence has been received. If a sentence is accepted, the process goes to step S202, and if no sentence is accepted, the process returns to step S201.

（ステップＳ２０２）第一分割部１３は、文のポインタｐを１に設定する。文のポインタｐは、文の中における、単語取得の先頭の位置を示す。 (Step S202) The first dividing unit 13 sets the sentence pointer p to 1. The sentence pointer p indicates the position of the beginning of word acquisition in the sentence.

（ステップＳ２０３）第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字から、最大長の文字列と一致する単語を検索する。そして、第一分割部１３は、最大長の文字列である単語を単語分割用辞書１１から取得する。 (Step S203) The first division unit 13 searches for a word that is present in the word division dictionary 11 and matches the maximum length character string from characters corresponding to p in the sentence. Then, the first division unit 13 acquires a word that is a maximum length character string from the word division dictionary 11.

（ステップＳ２０４）第一分割部１３は、ステップＳ２０３で取得した単語が、分割情報に含まれる単語であるか否かを判断する。分割情報に含まれる単語であればステップＳ２０５に行き、分割情報に含まれる単語でなければステップＳ２０６に行く。 (Step S204) The first division unit 13 determines whether or not the word acquired in step S203 is a word included in the division information. If it is a word included in the division information, go to step S205, and if it is not a word included in the division information, go to step S206.

（ステップＳ２０５）第一分割部１３は、ステップＳ２０３で取得した単語に対応する２以上の分割単語を、単語分割用辞書１１から取得する。そして、第一分割部１３は、２以上の分割単語をバッファに追記する。なお、バッファの初期値はＮＵＬＬである。また、第一分割部１３は、２以上の各分割単語に区切り文字を入れて、２以上の分割単語をバッファに追記する。区切り文字は、例えば、「／」「（スペース）」「，」等、何でも良い。ステップＳ２０７に行く。 (Step S205) The first dividing unit 13 acquires two or more divided words corresponding to the word acquired in step S203 from the word dividing dictionary 11. Then, the first dividing unit 13 adds two or more divided words to the buffer. Note that the initial value of the buffer is NULL. The first dividing unit 13 puts a delimiter character in each of the two or more divided words and adds the two or more divided words to the buffer. The delimiter may be anything such as “/”, “(space)”, “,”, etc. Go to step S207.

（ステップＳ２０６）第一分割部１３は、ステップＳ２０３で取得した単語をバッファに追記する。なお、第一分割部１３は、ステップＳ２０３で取得した単語と、前または／および後の単語との間には、区切り文字を配置する。 (Step S206) The first dividing unit 13 adds the word acquired in step S203 to the buffer. The first dividing unit 13 places a delimiter between the word acquired in step S203 and the previous or / and subsequent word.

（ステップＳ２０７）第一分割部１３は、ポインタｐを、最大長の文字列長の分だけ進める。 (Step S207) The first dividing unit 13 advances the pointer p by the maximum character string length.

（ステップＳ２０８）第一分割部１３は、すべての分割処理が終了したか否かを判断する。すべての分割処理が終了していればステップＳ２０９に行き、終了していなければステップＳ２０３に戻る。なお、ポインタｐが文の最後の文字の次の位置である場合、すべての分割処理が終了した、と言える。 (Step S208) The first dividing unit 13 determines whether or not all the dividing processes have been completed. If all the division processes are completed, the process goes to step S209, and if not completed, the process returns to step S203. When the pointer p is the position next to the last character of the sentence, it can be said that all the division processes have been completed.

（ステップＳ２０９）出力部１４は、バッファ内の２以上の単語を出力する。ステップＳ２０１に戻る。 (Step S209) The output unit 14 outputs two or more words in the buffer. The process returns to step S201.

なお、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。
また、図２のフローチャートにおいて、受け付けられた文の先頭から処理を開始し、文の終わりまで順に処理を行った。しかし、例えば、受け付けられた文の最後から処理を開始し、文の後から前の方向に処理を進めて行っても良い。つまり、ステップＳ２０２で、第一分割部１３は、文のポインタｐを文の最後に設定し、ステップＳ２０７で、ポインタｐを、最大長の文字列長の分だけ、文の前に戻っても良い。かかる場合、ステップＳ２０３で、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字から前にポインタを進めなて、最大長の文字列と一致する単語を検索する。そして、第一分割部１３は、最大長の文字列である単語を単語分割用辞書１１から取得する。 In the flowchart of FIG. 2, the process is terminated by powering off or a process termination interrupt.
Further, in the flowchart of FIG. 2, the processing is started from the beginning of the accepted sentence, and the processing is sequentially performed until the end of the sentence. However, for example, the process may be started from the end of the accepted sentence, and the process may be performed in the previous direction after the sentence. In other words, in step S202, the first dividing unit 13 sets the sentence pointer p to the end of the sentence, and in step S207, the pointer p may return to the front of the sentence by the maximum character string length. good. In such a case, in step S203, the first dividing unit 13 is a word existing in the word dividing dictionary 11, and advances the pointer forward from the character corresponding to p in the sentence to obtain the maximum length character string. Search for matching words. Then, the first division unit 13 acquires a word that is a maximum length character string from the word division dictionary 11.

以下、本実施の形態における単語分割装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the word segmentation apparatus 1 in the present embodiment will be described.

今、図３が単語分割用辞書１１である。単語分割用辞書１１を構成するレコードは、「ＩＤ」「単語」「分割単語」を有する。単語分割用辞書１１のレコードは、品詞や出現確率等の他の情報を有しても良い。また、単語分割用辞書１１のレコードは、単語または分割情報に分類される。 FIG. 3 shows the word division dictionary 11. The records constituting the word division dictionary 11 have “ID”, “word”, and “division word”. The record of the word division dictionary 11 may have other information such as part of speech and appearance probability. The records in the word division dictionary 11 are classified into words or division information.

単語に分類されるレコードは、属性「分割単語」の値がＮＵＬＬ（図３の「−」）である。また、単語に分類されるレコードは、例えば、図３の「ＩＤ＝５，６，８，９，１０，１１，１２，１３」のレコードである。また、分割情報に分類されるレコードは、属性「分割単語」の値が２以上の分割単語を有する。属性「分割単語」における分割単語は、ここでは、区切り文字「／」で区切られている。さらに、分割情報に分類されるレコードは、例えば、図３の「ＩＤ＝１，２，３，４，７」のレコードである。なお、単語分割用辞書１１のレコードは、「単語か分割情報かを示すフラグ」を属性値として有しても良い。 A record classified as a word has an attribute “divided word” value of NULL (“−” in FIG. 3). Moreover, the record classified into a word is a record of "ID = 5,6,8,9,10,11,12,13" of FIG. 3, for example. Further, the record classified as the division information has a division word having the attribute “division word” value of 2 or more. Here, the divided words in the attribute “divided word” are separated by a delimiter “/”. Further, the record classified as the division information is, for example, a record of “ID = 1, 2, 3, 4, 7” in FIG. Note that the record of the word division dictionary 11 may have “a flag indicating whether it is a word or division information” as an attribute value.

（具体例１）
かかる状況において、受付部１２は、文「正夫はしっかり者だ」を受け付けた、とする。次に、第一分割部１３は、文のポインタｐを１に設定する。つまり、ポインタｐは文の「正」の位置に設定された。 (Specific example 1)
In this situation, it is assumed that the reception unit 12 has received the sentence “Masao is a solid person”. Next, the first dividing unit 13 sets the sentence pointer p to 1. That is, the pointer p is set at the “positive” position of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「正」から、最大長の文字列と一致する単語「正夫」を検索し、取得する。 Next, the first division unit 13 searches for the word “Matsuo” that is present in the word division dictionary 11 and matches the maximum character string from the character “Correct” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「正夫」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「正夫」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “Matsuo” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “Matsuo” is “− (NULL)”.

そして、第一分割部１３は、取得した単語「正夫」をバッファに追記する。 Then, the first dividing unit 13 adds the acquired word “Matsuo” to the buffer.

次に、第一分割部１３は、単語「正夫」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「は」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “Matsuo”. Next, the first dividing unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position “ha” of the sentence.

次に、第一分割部１３は、まだ、分割処理が終了していない、と判断する。 Next, the first division unit 13 determines that the division process has not been completed yet.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「は」から、最大長の文字列と一致する単語「は」を検索し、取得する。 Next, the first division unit 13 searches for a word “ha” that is present in the word division dictionary 11 and matches the maximum length character string from the character “ha” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「は」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「は」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “ha” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “ha” is “− (NULL)”.

そして、第一分割部１３は、取得した単語「は」をバッファに追記する。なお、第一分割部１３は、単語「は」の前に区切り文字「／」を入れて、バッファに追記する。そして、現在のバッファには「正夫／は」が格納された。 Then, the first dividing unit 13 adds the acquired word “ha” to the buffer. The first dividing unit 13 adds a delimiter “/” in front of the word “ha” and appends it to the buffer. Then, “Masao / Ha” is stored in the current buffer.

次に、第一分割部１３は、単語「は」の文字列長「１」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「１」だけ進め、ポインタｐを文の「し」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “1” of the word “ha”. Next, the first dividing unit 13 advances the pointer p by “1” corresponding to the maximum length of the character string, and sets the pointer p to the position of “shi” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「し」から、最大長の文字列と一致する単語「しっかり者」を検索し、取得する。 Next, the first dividing unit 13 is a word that exists in the word dividing dictionary 11, and from the character “shi” corresponding to p in the sentence, the word “solid” that matches the maximum length character string is obtained. Search and get.

次に、第一分割部１３は、取得した単語「しっかり者」が、分割情報に含まれる単語であるか否かを判断する。つまり、単語「しっかり者」に対応する分割単語がＮＵＬＬでないので、第一分割部１３は、単語「しっかり者」が、分割情報に含まれる単語であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “firm person” is a word included in the division information. That is, since the divided word corresponding to the word “firm person” is not NULL, the first dividing unit 13 determines that the word “firm person” is a word included in the division information.

そして、第一分割部１３は、単語「しっかり者」に対応する分割情報「しっかり／者」を、単語分割用辞書１１から取得する。 Then, the first division unit 13 acquires the division information “firm / person” corresponding to the word “firm person” from the word division dictionary 11.

そして、第一分割部１３は、区切り文字「／」と取得した単語「しっかり／者」とをバッファに追記する。そして、現在のバッファには「正夫／は／しっかり／者」が格納された。 Then, the first dividing unit 13 adds the delimiter “/” and the acquired word “solid / person” to the buffer. In the current buffer, “Masao / Ha / Ken / People” is stored.

次に、第一分割部１３は、単語「しっかり者」の文字列長「５」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「５」だけ進め、ポインタｐを文の「だ」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “5” of the word “firm person”. Next, the first dividing unit 13 advances the pointer p by “5” corresponding to the maximum length of the character string, and sets the pointer p to the position of “DA” in the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「だ」から、最大長の文字列と一致する単語「だ」を検索し、取得する。 Next, the first division unit 13 searches for a word “DA” that is a word existing in the word division dictionary 11 and matches the maximum length character string from the character “DA” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「だ」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「だ」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “DA” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “DA” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「だ」とをバッファに追記する。そして、現在のバッファには「正夫／は／しっかり／者／だ」が格納された。 Then, the first division unit 13 adds the delimiter “/” and the acquired word “da” to the buffer. In the current buffer, “Masao / Ha / So / Person” is stored.

次に、第一分割部１３は、単語「だ」の文字列長「１」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「１」だけ進め、ポインタｐを文の「だ」の次の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “1” of the word “DA”. Next, the first dividing unit 13 advances the pointer p by “1” corresponding to the maximum length of the character string, and sets the pointer p to a position next to “DA” in the sentence.

次に、第一分割部１３は、分割処理が終了した、と判断する。 Next, the first division unit 13 determines that the division process has been completed.

そして、出力部１４は、バッファ内の２以上の分割された単語列「正夫／は／しっかり／者／だ」を出力する。 Then, the output unit 14 outputs two or more divided word strings “Masao / Ha / So / I / D” in the buffer.

（具体例２）
受付部１２は、文「そうはいってもまだ子供」を受け付けた、とする。次に、第一分割部１３は、文のポインタｐを１に設定する。つまり、ポインタｐは文の「そ」の位置に設定された。 (Specific example 2)
It is assumed that the reception unit 12 has received the sentence “Although it is said that it is still a child”. Next, the first dividing unit 13 sets the sentence pointer p to 1. That is, the pointer p is set at the position “s” in the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「そ」から、最大長の文字列と一致する単語「そうはいっても」を検索し、取得する。 Next, the first division unit 13 is a word that exists in the word division dictionary 11, and the word “So” corresponding to p in the sentence is matched with the maximum length character string “Yes. Search for and get "".

次に、第一分割部１３は、取得した単語「そうはいっても」が、分割情報に含まれる単語であるか否かを判断する。つまり、単語「そうはいっても」に対応する分割単語がＮＵＬＬでないので、第一分割部１３は、単語「そうはいっても」が、分割情報に含まれる単語であると判断する。 Next, the first dividing unit 13 determines whether or not the acquired word “even if so” is a word included in the division information. That is, since the divided word corresponding to the word “Yes” is not NULL, the first dividing unit 13 determines that the word “No, yes” is a word included in the division information.

そして、第一分割部１３は、単語「そうはいっても」に対応する分割情報「そう／は／いって／も」を、単語分割用辞書１１から取得する。 Then, the first division unit 13 acquires the division information “so / ha / inte / mo” corresponding to the word “so yes” from the word division dictionary 11.

そして、第一分割部１３は、取得した単語「そう／は／いって／も」をバッファに追記する。そして、現在のバッファには「そう／は／いって／も」が格納された。 Then, the first dividing unit 13 adds the acquired word “so / ha / inte / mo” to the buffer. In the current buffer, “yes / ha / te / mo” is stored.

次に、第一分割部１３は、単語「そうはいっても」の文字列長「７」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「７」だけ進め、ポインタｐを文の「ま」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “7” of the word “even if so”. Next, the first dividing unit 13 advances the pointer p by “7” corresponding to the maximum length of the character string, and sets the pointer p to the position “ma” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「ま」から、最大長の文字列と一致する単語「まだ」を検索し、取得する。 Next, the first division unit 13 searches for the word “still” that matches the maximum length character string from the character “ma” corresponding to p in the sentence that is present in the word division dictionary 11. And get.

次に、第一分割部１３は、取得した単語「まだ」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「は」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “still” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “ha” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「まだ」とをバッファに追記する。そして、現在のバッファには「そう／は／いって／も／まだ」が格納された。 Then, the first dividing unit 13 appends the delimiter “/” and the acquired word “still” to the buffer. In the current buffer, “yes / ha / te / mo / no” is stored.

次に、第一分割部１３は、単語「まだ」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「子」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “still”. Next, the first division unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position of the “child” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「子」から、最大長の文字列と一致する単語「子供」を検索し、取得する。 Next, the first division unit 13 searches for a word “child” that is in the word division dictionary 11 and matches the maximum character string from the character “child” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「子供」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「は」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “child” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “ha” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「子供」とをバッファに追記する。そして、現在のバッファには「そう／は／いって／も／まだ／子供」が格納された。 Then, the first dividing unit 13 adds the delimiter “/” and the acquired word “child” to the buffer. In the current buffer, “Yes / I / I / Mo / Still / Child” is stored.

次に、第一分割部１３は、単語「まだ」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「供」の次の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “still”. Next, the first dividing unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position next to “don” in the sentence.

そして、出力部１４は、バッファ内の２以上の分割された単語列「そう／は／いって／も／まだ／子供」を出力する。 Then, the output unit 14 outputs the two or more divided word strings “so / ha / de / mo / yana / child” in the buffer.

以上、本実施の形態によれば、非常に簡易な処理により、文を２以上の単語に分割できる。そのため、文の単語への分割が非常に高速に行える。 As described above, according to the present embodiment, a sentence can be divided into two or more words by a very simple process. Therefore, the sentence can be divided into words very quickly.

なお、本実施の形態において、第一分割部１３が最大長の文字列である単語を単語分割用辞書１１から取得するアルゴリズムは問わない。
また、本実施の形態において、第二分割部２１の代わりに、１以上の第二分割結果の集合である第二分割結果格納部２６を用いても良い。かかる場合、判断部２３は、第一分割結果と、第二分割結果格納部２６に格納されている第二分割結果とが異なるか否かを判断する。そして、かかる場合、単語分割装置２は、単語分割用辞書１１、受付部１２、第一分割部１３、出力部１４、判断部２３、分割情報取得部２４、辞書登録部２５、および第二分割結果格納部２６を備える。かかる場合の単語分割装置２のブロック図を図１１に示す。
そして、図１１において、分割情報取得部２４は、判断部２３経由で、第二分割結果格納部２６から第二分割結果を取得する。
なお、第二分割結果格納部２６の第二分割結果の集合は、一定以上の多量のデータであり、人手で作成した単語分割済みのデータであることが好適である。また、第一分割結果と第二分割結果格納部２６に格納されている第二分割結果とに関して、分割対象の文は同じである。 In the present embodiment, the algorithm by which the first dividing unit 13 acquires the word that is the maximum length character string from the word dividing dictionary 11 does not matter.
Further, in the present embodiment, instead of the second division unit 21, a second division result storage unit 26 that is a set of one or more second division results may be used. In such a case, the determination unit 23 determines whether the first division result and the second division result stored in the second division result storage unit 26 are different. In such a case, the word division device 2 includes the word division dictionary 11, the reception unit 12, the first division unit 13, the output unit 14, the determination unit 23, the division information acquisition unit 24, the dictionary registration unit 25, and the second division. A result storage unit 26 is provided. FIG. 11 shows a block diagram of the word dividing device 2 in such a case.
In FIG. 11, the division information acquisition unit 24 acquires the second division result from the second division result storage unit 26 via the determination unit 23.
It should be noted that the set of second division results in the second division result storage unit 26 is a large amount of data greater than or equal to a certain amount, and is preferably word-divisionally created data created manually. The sentence to be divided is the same for the first division result and the second division result stored in the second division result storage unit 26.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における単語分割装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記録媒体に、１以上の単語と、単語と当該単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを有する単語分割用辞書を格納しており、コンピュータを、１以上の文字を有する文を受け付ける受付部と、前記受付部が受け付けた文の先頭である文のポインタから最大長の文字列に一致する単語を、前記単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を有する場合は、前記一致する単語に変えて前記２以上の分割単語を取得する分割単語取得処理を行い、前記文のポインタを前記一致する単語の次の文字に移動した後、前記分割単語取得処理を文の最後の文字を含む単語まで行い、文を分割して得られる２以上の単語の集合である第一分割結果を取得する第一分割部と、前記第一分割結果を出力する出力部として機能させるためのプログラム、である。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the word segmentation apparatus in the present embodiment is the following program. That is, this program stores a word division dictionary having one or more words and one or more pieces of division information that is a set of two or more divided words that are the result of dividing the words. A word receiving dictionary that accepts a sentence having one or more characters, and a word that matches a maximum-length character string from a sentence pointer at the head of the sentence accepted by the accepting part. If there are two or more divided words corresponding to the acquired word, a divided word acquisition process is performed in which the two or more divided words are obtained instead of the matching words, and the sentence pointer is After moving to the next character of the matching word, the divided word acquisition process is performed up to the word including the last character of the sentence, and the first division result which is a set of two or more words obtained by dividing the sentence is acquired. The first minute to And parts, is a program, to function as an output unit for outputting the first result of division.

（実施の形態２）
本実施の形態において、文の分割処理を行いながら、単語分割用辞書を充実させることができる単語分割装置１について説明する。 (Embodiment 2)
In the present embodiment, a word division apparatus 1 that can enhance a word division dictionary while performing sentence division processing will be described.

図４は、本実施の形態における単語分割装置２のブロック図である。単語分割装置２は、単語分割用辞書１１、受付部１２、第一分割部１３、出力部１４、第二分割部２１、第二分割結果取得部２２、判断部２３、分割情報取得部２４、辞書登録部２５を備える。 FIG. 4 is a block diagram of word segmentation apparatus 2 in the present embodiment. The word division device 2 includes a word division dictionary 11, a reception unit 12, a first division unit 13, an output unit 14, a second division unit 21, a second division result acquisition unit 22, a determination unit 23, a division information acquisition unit 24, A dictionary registration unit 25 is provided.

第二分割部２１は、第一分割部１３とは異なるアルゴリズムにより、受付部１２が受け付けた文を分割して２以上の単語を取得する。この２以上の単語を第二分割結果とも言う。 The second dividing unit 21 divides the sentence received by the receiving unit 12 using an algorithm different from that of the first dividing unit 13 and acquires two or more words. These two or more words are also referred to as a second division result.

第二分割部２１は、文を分割し２以上の単語を取得する処理において、一定以上の精度を有することが確認できているものであることが好適である。例えば、第二分割部２１は、ビタビアルゴリズムを用いた形態素解析のアルゴリズムにより、文を分割して２以上の単語を取得する。 It is preferable that the second dividing unit 21 has been confirmed to have a certain level of accuracy in the process of dividing a sentence and obtaining two or more words. For example, the second dividing unit 21 acquires two or more words by dividing a sentence by an algorithm of morphological analysis using a Viterbi algorithm.

第二分割結果取得部２２は、第二分割部２１が取得した２以上の単語の集合である第二分割結果を取得する。なお、第二分割結果取得部２２は、第二分割部２１から第二分割結果を取得するだけの処理である。 The second division result acquisition unit 22 acquires a second division result that is a set of two or more words acquired by the second division unit 21. The second division result acquisition unit 22 is a process that only acquires the second division result from the second division unit 21.

また、後述する判断部２３に、第二分割部２１が第二分割結果を渡しても良い。かかる場合、第二分割結果取得部２２は、何も処理を行わないが、第二分割結果を第二分割結果取得部２２が判断部２３に渡した、と考えても良い。 Further, the second division unit 21 may pass the second division result to the determination unit 23 described later. In such a case, the second division result acquisition unit 22 does not perform any processing, but it may be considered that the second division result acquisition unit 22 has passed the determination result to the determination unit 23.

判断部２３は、第一分割結果と第二分割結果とが異なるか否かを判断する。なお、第一分割結果は、第一分割部１３が取得した２以上の単語の集合である。 The determination unit 23 determines whether or not the first division result and the second division result are different. The first division result is a set of two or more words acquired by the first division unit 13.

分割情報取得部２４は、第一分割結果と第二分割結果とが異なると判断部２３が判断した場合、分割情報を構成する。 The division information acquisition unit 24 configures division information when the determination unit 23 determines that the first division result and the second division result are different.

分割情報は、第一分割結果と第二分割結果とが異なる箇所に対応する文の中の文字列と、当該文字列に対応する２以上の区切られた単語であり、第二分割結果に含まれる２以上の単語とを有する。つまり、分割情報取得部２４は、まず、第一分割結果と第二分割結果とが異なる箇所を特定する。次に、分割情報取得部２４は、受付部１２が受け付けた文の中から、当該箇所に対応する文の中の文字列を取得する。次に、分割情報取得部２４は、第二分割結果の中から、当該文字列に対応する２以上の分割単語を取得する。そして、分割情報取得部２４は、文の中の文字列と、２以上の分割単語とを有する分割情報を構成する。なお、文の中の文字列は、分割情報を構成する単語である。 The division information is a character string in a sentence corresponding to a portion where the first division result and the second division result are different, and two or more delimited words corresponding to the character string, and is included in the second division result Two or more words. That is, the division information acquisition unit 24 first identifies a location where the first division result and the second division result are different. Next, the division information acquisition unit 24 acquires a character string in the sentence corresponding to the part from the sentences received by the reception unit 12. Next, the division information acquisition unit 24 acquires two or more division words corresponding to the character string from the second division result. And the division | segmentation information acquisition part 24 comprises the division | segmentation information which has the character string in a sentence, and two or more division | segmentation words. In addition, the character string in a sentence is the word which comprises division | segmentation information.

辞書登録部２５は、分割情報取得部２４が取得した分割情報を単語分割用辞書１１に蓄積する。 The dictionary registration unit 25 accumulates the division information acquired by the division information acquisition unit 24 in the word division dictionary 11.

第二分割部２１、第二分割結果取得部２２、判断部２３、分割情報取得部２４、および辞書登録部２５は、通常、ＭＰＵやメモリ等から実現され得る。第二分割部２１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The second division unit 21, the second division result acquisition unit 22, the determination unit 23, the division information acquisition unit 24, and the dictionary registration unit 25 can be usually realized by an MPU, a memory, or the like. The processing procedure of the second division unit 21 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、単語分割装置２の動作について、図５のフローチャートを用いて説明する。図５のフローチャートにおいて、図２のフローチャートと同一ステップについて、説明を省略する。 Next, operation | movement of the word division | segmentation apparatus 2 is demonstrated using the flowchart of FIG. In the flowchart of FIG. 5, the description of the same steps as those in the flowchart of FIG. 2 is omitted.

（ステップＳ５０１）第二分割部２１が、受付部１２が受け付けた文に対して、分割処理を行い、２以上の単語を取得する。この２以上の単語は第二分割結果である。 (Step S501) The second dividing unit 21 performs division processing on the sentence received by the receiving unit 12, and acquires two or more words. These two or more words are the result of the second division.

（ステップＳ５０２）第二分割結果取得部２２は、第二分割結果を取得する。 (Step S502) The second division result acquisition unit 22 acquires the second division result.

（ステップＳ５０３）判断部２３は、第一分割結果を取得する。 (Step S503) The determination unit 23 acquires the first division result.

（ステップＳ５０４）判断部２３は、カウンタｉ、およびｊに１を代入する。カウンタｉは第一分割結果に含まれる分割単語のカウンタであり、カウンタｊは第二分割結果に含まれる分割単語のカウンタである。 (Step S504) The determination unit 23 substitutes 1 for counters i and j. The counter i is a counter for divided words included in the first division result, and the counter j is a counter for divided words included in the second division result.

（ステップＳ５０５）判断部２３は、第二分割結果の中にｊ番目の分割単語が存在するか否かを判断する。ｊ番目の分割単語が存在すればステップＳ５０６に行き、ｊ番目の分割単語が存在しなければステップＳ２０１に戻る。 (Step S505) The determination unit 23 determines whether or not the j-th divided word is present in the second division result. If the jth divided word exists, the process goes to step S506, and if the jth divided word does not exist, the process returns to step S201.

（ステップＳ５０６）判断部２３は、第一分割結果の中のｉ番目の分割単語と、第二分割結果の中のｊ番目の分割単語とが一致するか否かを判断する。一致する場合はステップＳ５１１に行き、一致しない場合はステップＳ５０７に行く。 (Step S506) The determination unit 23 determines whether or not the i-th divided word in the first division result matches the j-th divided word in the second division result. If they match, go to step S511; otherwise, go to step S507.

（ステップＳ５０７）分割情報取得部２４は、第一分割結果の中の分割単語と、第二分割結果の中の分割単語との、最後の文字が一致するまで、第二分割結果の中から、２以上の分割単語を取得する。なお、この２以上の分割単語は、ｊ番目の分割単語から連続する分割単語である。 (Step S507) The division information acquisition unit 24 selects from the second division result until the last character of the division word in the first division result and the division word in the second division result match. Get two or more split words. The two or more divided words are divided words that are continuous from the jth divided word.

（ステップＳ５０８）分割情報取得部２４は、分割情報を構成する。つまり、分割情報取得部２４は、ステップＳ５０７で取得した１以上の分割単語から区切り文字を削除し、単語を取得する。そして、分割情報取得部２４は、当該単語と、ステップＳ５０７で取得した２以上の分割単語とを用いて、分割情報を構成する。なお、分割情報取得部２４は、ステップＳ５０７で取得した１以上の分割単語から区切り文字を削除し単語を取得するのではなく、受付部１２が受け付けた文から単語を取得しても良い。 (Step S508) The division information acquisition unit 24 configures division information. That is, the division information acquisition unit 24 deletes a delimiter from one or more division words acquired in step S507, and acquires a word. And the division | segmentation information acquisition part 24 comprises division information using the said word and two or more division | segmentation words acquired by step S507. Note that the division information acquisition unit 24 may acquire a word from a sentence received by the reception unit 12 instead of deleting a delimiter from one or more division words acquired in step S507.

（ステップＳ５０９）辞書登録部２５は、ステップＳ５０８で構成された分割情報を、単語分割用辞書１１に登録する。 (Step S509) The dictionary registration unit 25 registers the division information configured in step S508 in the word division dictionary 11.

（ステップＳ５１０）判断部２３は、カウンタｉおよびｊを、ステップＳ５０７で、最後の文字が一致した分割単語まで進める。 (Step S510) The determination unit 23 advances the counters i and j to the divided word that matches the last character in step S507.

（ステップＳ５１１）判断部２３は、カウンタｉおよびｊを、それぞれ１ずつ進める。ステップＳ５０５に戻る。 (Step S511) The determination unit 23 advances the counters i and j by one each. The process returns to step S505.

なお、図５のフローチャートにおいて、第一分割結果と第二分割結果とが異なる場合でも、出力部１４は第一分割結果を出力した。しかし、第一分割結果と第二分割結果とが異なる場合に、出力部１４は第二分割結果を出力しても良い。また、単語分割用辞書１１が予め決められた条件を満たすほど充実する前は、出力部１４は第二分割結果を出力し、充実した後は、出力部１４は第一分割結果を出力しても良い。 In the flowchart of FIG. 5, the output unit 14 outputs the first division result even when the first division result and the second division result are different. However, when the first division result and the second division result are different, the output unit 14 may output the second division result. The output unit 14 outputs the second division result before the word division dictionary 11 is enhanced to satisfy a predetermined condition, and after the output, the output unit 14 outputs the first division result. Also good.

また、図５のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 Further, in the flowchart of FIG. 5, the processing is ended by powering off or interruption of processing end.

以下、本実施の形態における単語分割装置２の具体的な動作について説明する。 Hereinafter, a specific operation of the word segmentation apparatus 2 in the present embodiment will be described.

今、図３が単語分割用辞書１１である。 FIG. 3 shows the word division dictionary 11.

かかる状況において、受付部１２は、文「間違いはないか」を受け付けた、とする。次に、第一分割部１３は、文のポインタｐを１に設定する。つまり、ポインタｐは文の「間」の位置に設定された。 In this situation, it is assumed that the reception unit 12 has received the sentence “Is there a mistake?” Next, the first dividing unit 13 sets the sentence pointer p to 1. That is, the pointer p is set at a position “between” of the sentences.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「間」から、最大長の文字列と一致する単語「間違い」を検索し、取得する。 Next, the first division unit 13 searches for a word “wrong” that is a word existing in the word division dictionary 11 and matches the maximum character string from the character “between” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「間違い」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「間違い」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “wrong” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “wrong” is “− (NULL)”.

そして、第一分割部１３は、取得した単語「間違い」をバッファに追記する。 Then, the first division unit 13 adds the acquired word “wrong” to the buffer.

次に、第一分割部１３は、単語「間違い」の文字列長「３」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「３」だけ進め、ポインタｐを文の「は」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “3” of the word “wrong”. Next, the first dividing unit 13 advances the pointer p by “3” corresponding to the maximum length of the character string, and sets the pointer p to the position “ha” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「は」から、最大長の文字列と一致する単語「はな」を検索し、取得する。 Next, the first division unit 13 is a word existing in the word division dictionary 11, and from the character “ha” corresponding to p in the sentence, the word “hana” that matches the maximum length character string. Search and get.

次に、第一分割部１３は、取得した単語「はな」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「はな」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “Hana” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “Hana” is “− (NULL)”.

そして、第一分割部１３は、取得した単語「はな」をバッファに追記する。なお、第一分割部１３は、単語「はな」の前に区切り文字「／」を入れて、バッファに追記する。そして、現在のバッファには「間違い／はな」が格納された。 Then, the first dividing unit 13 adds the acquired word “Hana” to the buffer. The first dividing unit 13 adds a delimiter character “/” before the word “Hana” and appends it to the buffer. The current buffer stores “wrong / hana”.

次に、第一分割部１３は、単語「はな」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「い」の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “Hana”. Next, the first dividing unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to the position “i” of the sentence.

次に、第一分割部１３は、単語分割用辞書１１に存在する単語であり、文の中のｐに対応する文字「い」から、最大長の文字列と一致する単語「いか」を検索し、取得する。 Next, the first division unit 13 searches the word “Ika” that is a word existing in the word division dictionary 11 and matches the maximum length character string from the character “I” corresponding to p in the sentence. And get.

次に、第一分割部１３は、取得した単語「いか」が、分割情報に含まれる単語であるか否かを判断する。つまり、第一分割部１３は、単語「いか」に対応する分割情報が「−（ＮＵＬＬ）」であると判断する。 Next, the first division unit 13 determines whether or not the acquired word “squid” is a word included in the division information. That is, the first division unit 13 determines that the division information corresponding to the word “squid” is “− (NULL)”.

そして、第一分割部１３は、区切り文字「／」と取得した単語「いか」とをバッファに追記する。そして、現在のバッファには「間違い／はな／いか」が格納された。 Then, the first division unit 13 adds the delimiter “/” and the acquired word “Ika” to the buffer. The current buffer stores “wrong / hana / ika”.

次に、第一分割部１３は、単語「いか」の文字列長「２」を算出する。次に、第一分割部１３は、ポインタｐを、最大長の文字列長の分「２」だけ進め、ポインタｐを文の「が」の次の位置に設定する。 Next, the first dividing unit 13 calculates the character string length “2” of the word “squid”. Next, the first dividing unit 13 advances the pointer p by “2” corresponding to the maximum length of the character string, and sets the pointer p to a position next to “GA” in the sentence.

そして、出力部１４は、バッファ内の２以上の分割された単語列「間違い／はな／いか」を出力する。 Then, the output unit 14 outputs two or more divided word strings “wrong / hana / ika” in the buffer.

次に、第二分割部２１が、受付部１２が受け付けた文「間違いはないか」に対して、分割処理を行い、２以上の単語「間違い／は／ない／か」を取得した、とする。 Next, the second dividing unit 21 performs a dividing process on the sentence “Is there an error” received by the receiving unit 12 and acquires two or more words “incorrect / has / not /”? To do.

次に、第二分割結果取得部２２は、第二分割結果「間違い／は／ない／か」を取得する。また、判断部２３は、第一分割結果「間違い／はな／いか」を取得する。 Next, the second division result acquisition unit 22 acquires the second division result “wrong / has / no /”. In addition, the determination unit 23 acquires the first division result “wrong / hana / ika”.

次に、判断部２３は、カウンタｉ、およびｊに１を代入する。 Next, the determination unit 23 substitutes 1 for counters i and j.

次に、判断部２３は、第二分割結果の中に１番目の分割単語が存在すると判断する。また、次に、判断部２３は、第一分割結果の中の１番目の分割単語と、第二分割結果の中に１番目の分割単語とが一致すると判断する。そして、判断部２３は、カウンタｉおよびｊを、それぞれ１ずつ進める。 Next, the determination unit 23 determines that the first divided word exists in the second division result. Next, the determination unit 23 determines that the first divided word in the first division result matches the first divided word in the second division result. Then, determination unit 23 increments counters i and j by one each.

次に、判断部２３は、第二分割結果の中に２番目の分割単語が存在すると判断する。また、次に、判断部２３は、第一分割結果の中の２番目の分割単語「はな」と、第二分割結果の中の２番目の分割単語「は」とが一致しない、と判断する。 Next, the determination unit 23 determines that the second divided word exists in the second division result. Next, the determination unit 23 determines that the second divided word “Hana” in the first division result does not match the second divided word “Hana” in the second division result. To do.

次に、分割情報取得部２４は、第一分割結果の中の分割単語と、第二分割結果の中の分割単語との、最後の文字が一致するまで、第二分割結果の中から、２以上の分割単語を取得する。つまり、分割情報取得部２４は、第二分割結果の中の「は／ない／か」を取得する。 Next, the division information acquisition unit 24 selects 2 from the second division result until the last character of the division word in the first division result matches the division word in the second division result. The above divided words are acquired. That is, the division information acquisition unit 24 acquires “ha / no /” in the second division result.

次に、分割情報取得部２４は、第二分割結果の中の「は／ない／か」から区切り文字を除き、単語「はないか」を取得する。そして、分割情報取得部２４は、単語「はないか」と２以上の分割単語「は／ない／か」を用いて、分割情報を構成する。 Next, the division information acquisition unit 24 removes the delimiter from “ha / no / ka” in the second division result, and acquires the word “has no”. Then, the division information acquisition unit 24 configures the division information by using the word “has no” and two or more division words “ha / no /?”.

次に、辞書登録部２５は、構成された分割情報を、単語分割用辞書１１に登録する。この分割情報は、単語「はないか」と２以上の分割単語「は／ない／か」とを有する情報である。 Next, the dictionary registration unit 25 registers the configured division information in the word division dictionary 11. This division information is information having the word “has no” and two or more division words “ha no / no”.

次に、判断部２３は、カウンタｉおよびｊを、最後の文字が一致した分割単語まで進める。つまり、判断部２３は、カウンタｉを２進め、カウンタｊを３進める。 Next, the determination unit 23 advances the counters i and j to the divided word that matches the last character. That is, the determination unit 23 advances the counter i by 2 and advances the counter j by 3.

次に、判断部２３は、第二分割結果の中に５番目の分割単語が存在しない、と判断する。そして、処理を終了する。 Next, the determination unit 23 determines that the fifth divided word does not exist in the second division result. Then, the process ends.

以上、本実施の形態によれば、文の分割処理を行いながら、単語分割用辞書を充実させることができる。 As described above, according to the present embodiment, it is possible to enrich the word division dictionary while performing sentence division processing.

なお、本実施の形態によれば、単語分割装置は第二分割部２１を有した。しかし、第二分割部２１は、単語分割装置の外部の装置に存在しても良い。かかる場合の単語分割装置３のブロック図を図６に示す。なお、ここでは、単語分割装置３は、第二分割部２１を具備する単語分割装置４から、第二分割結果を受け取るものとする。つまり、かかる場合、例えば、第二分割結果取得部２２は、第二分割部２１が取得した２以上の単語の集合である第二分割結果を、単語分割装置４から受信する。 According to the present embodiment, the word dividing device has the second dividing unit 21. However, the second dividing unit 21 may exist in a device outside the word dividing device. FIG. 6 shows a block diagram of the word dividing device 3 in such a case. Here, it is assumed that the word division device 3 receives the second division result from the word division device 4 including the second division unit 21. That is, in such a case, for example, the second division result acquisition unit 22 receives the second division result that is a set of two or more words acquired by the second division unit 21 from the word division device 4.

さらに、本実施の形態における単語分割装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記録媒体に、１以上の単語と、単語と当該単語を分割した結果である２以上の分割単語の組である１以上の分割情報とを有する単語分割用辞書を格納しており、コンピュータを、１以上の文字を有する文を受け付ける受付部と、前記受付部が受け付けた文の先頭である文のポインタから最大長の文字列に一致する単語を、前記単語分割用辞書から取得し、当該取得した単語に対応する２以上の分割単語を有する場合は、前記一致する単語に変えて前記２以上の分割単語を取得する分割単語取得処理を行い、前記文のポインタを前記一致する単語の次の文字に移動した後、前記分割単語取得処理を文の最後の文字を含む単語まで行い、文を分割して得られる２以上の単語の集合である第一分割結果を取得する第一分割部と、前記第一分割結果を出力する出力部として機能させるためのプログラム、である。 Furthermore, the software that implements the word segmentation apparatus in the present embodiment is the following program. That is, this program stores a word division dictionary having one or more words and one or more pieces of division information that is a set of two or more divided words that are the result of dividing the words. A word receiving dictionary that accepts a sentence having one or more characters, and a word that matches a maximum-length character string from a sentence pointer at the head of the sentence accepted by the accepting part. If there are two or more divided words corresponding to the acquired word, a divided word acquisition process is performed in which the two or more divided words are obtained instead of the matching words, and the sentence pointer is After moving to the next character of the matching word, the divided word acquisition process is performed up to the word including the last character of the sentence, and the first division result which is a set of two or more words obtained by dividing the sentence is acquired. The first minute to And parts, is a program, to function as an output unit for outputting the first result of division.

上記プログラムにおいて、コンピュータを、前記受付部が受け付けた文を、前記第一分割部とは異なるアルゴリズムにより文を分割して２以上の単語を取得する第二分割部を用いて、分割した２以上の単語の集合である第二分割結果を取得する第二分割結果取得部と、前記第一分割結果と前記第二分割結果とが異なるか否かを判断する判断部と、前記第一分割結果と前記第二分割結果とが異なると前記判断部が判断した場合、前記第一分割結果と前記第二分割結果とが異なる箇所に対応する文の中の文字列を取得し、当該異なる箇所に対応する前記第二分割結果に含まれる２以上の単語を取得し、前記取得した文字列である単語と、前記取得した２以上の単語とを有する分割情報を構成する分割情報取得部と、前記分割情報を前記単語分割用辞書に蓄積する辞書登録部として、さらに機能させることは好適である。 In the above program, the computer divides the sentence received by the accepting unit using a second dividing unit that obtains two or more words by dividing the sentence by an algorithm different from the first dividing unit. A second division result acquisition unit that acquires a second division result that is a set of words, a determination unit that determines whether the first division result and the second division result are different, and the first division result When the determination unit determines that the second division result is different from the second division result, a character string in a sentence corresponding to a location where the first division result and the second division result are different is obtained, and the different location is obtained. Two or more words included in the corresponding second division result, a division information acquisition unit that forms division information including the acquired character string and the acquired two or more words; Divide information for word division As dictionary registration section for storing in the book, it is preferable to further function.

上記プログラムにおいて、コンピュータを、前記第一分割部とは異なるアルゴリズムにより、前記受付部が受け付けた文を分割して２以上の単語を取得する第二分割部をさらに具備するものとして、さらに機能させることは好適である。 In the above program, the computer further functions as a computer further comprising a second dividing unit that obtains two or more words by dividing the sentence received by the receiving unit using an algorithm different from the first dividing unit. That is preferred.

上記プログラムにおいて、コンピュータを、前記第二分割部は、ビタビアルゴリズムを用いた形態素解析のアルゴリズムにより、文を分割して２以上の単語を取得するものとして機能させることは好適である。
（実験１） In the above program, it is preferable that the second division unit functions as a unit that divides a sentence and acquires two or more words by an algorithm of morphological analysis using a Viterbi algorithm.
(Experiment 1)

以下、単語分割装置１を用いた実験１の結果について説明する。なお、単語分割装置１を実現するソフトウェアは、「MA-2」という名称である。また、他の単語分割装置として、公知技術である「MeCab 0.98」を用いた。「MeCab 0.98」は、「http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html」に記載されている。また、他の単語分割装置として、出願人が開発した単語分割装置であり、Viterbiアルゴリズムを用いた単語分割装置「MA-1」も用いた。図７に、上記の３つの装置に、ＵＴＦ−８日本語テキスト３８８．５ＭＢを入力し、各装置の処理速度（ＫＢ／ｓｅｃ）を測定した。単語分割装置１である「MA-2」は、「MeCab 0.98」の４．３倍、「WebMA2（Version 3.7.0）」の７．５倍の処理速度であった。なお、単語分割装置１である「MA-2」によれば、新聞１年分を約３０秒で解析可能であることが分かる。
（実験２） Hereinafter, the result of Experiment 1 using the word segmentation apparatus 1 will be described. The software that implements the word segmentation device 1 is named “MA-2”. As another word segmentation device, “MeCab 0.98”, which is a known technique, was used. “MeCab 0.98” is described in “http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html”. In addition, as another word dividing device, the word dividing device developed by the applicant, the word dividing device “MA-1” using the Viterbi algorithm was also used. In FIG. 7, UTF-8 Japanese text 388.5MB was input to the above three devices, and the processing speed (KB / sec) of each device was measured. “MA-2”, which is the word segmentation device 1, was 4.3 times faster than “MeCab 0.98” and 7.5 times faster than “WebMA2 (Version 3.7.0)”. In addition, according to “MA-2” which is the word dividing device 1, it can be understood that one year of newspaper can be analyzed in about 30 seconds.
(Experiment 2)

次に、単語分割装置１「MA-2」を用いた実験２の結果について説明する。実験２の結果を、図８に記載する。実験２において、他の単語分割装置として、公知技術である「JUMAN 6.0」「MeCab 0.98」「KyTea 0.3.0」「ChaSen 2.3.3」を用いた。「JUMAN 6.0」は「http://nlp.ist.i.kyoto-u.ac.jp/index.php?cmd=read&page=JUMAN&alias%5B%5D=%E6%97%A5%E6%9C%AC%E8%AA%9E%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3%E6%9E%90%E3%82%B7%E3%82%B9%E3%83%86%E3%83%A0JUMAN」、「KyTea 0.3.0」は「http://www.phontron.com/kytea/index-ja.html」、「ChaSen 2.3.3」は「http://chasen.naist.jp/hiki/ChaSen/」に記載されている。また、本実験において、上記の５つの装置に、ウェブ・テキスト８万文を入力し、各装置に解析させた場合の処理時間を測定した（図８参照）。単語分割装置１「MA-2」の処理速度は他より極めて速いことが分かる。なお、単語分割装置１のアルゴリズムおよびモデルは、図８に示す「深さ優先探索＋連語」である。 Next, the result of Experiment 2 using the word segmentation apparatus 1 “MA-2” will be described. The result of Experiment 2 is shown in FIG. In Experiment 2, “JUMAN 6.0”, “MeCab 0.98”, “KyTea 0.3.0”, and “ChaSen 2.3.3”, which are known techniques, were used as other word segmentation devices. "JUMAN 6.0" is "http://nlp.ist.i.kyoto-u.ac.jp/index.php?cmd=read&page=JUMAN&alias%5B%5D=%E6%97%A5%E6%9C%AC % E8% AA% 9E% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90% E3% 82% B7% E3% 82% B9% E3 % 83% 86% E3% 83% A0JUMAN ”,“ KyTea 0.3.0 ”is“ http://www.phontron.com/kytea/index-en.html ”,“ ChaSen 2.3.3 ”is“ http: / /chasen.naist.jp/hiki/ChaSen/ ”. In this experiment, the processing time was measured when 80,000 web texts were input to the above five devices and analyzed by each device (see FIG. 8). It can be seen that the processing speed of the word segmentation device 1 “MA-2” is extremely faster than others. Note that the algorithm and model of the word segmentation apparatus 1 are “depth-first search + collocation” shown in FIG.

また、図９は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の単語分割装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図９は、このコンピュータシステム３００の概観図であり、図１０は、システム３００のブロック図である。 FIG. 9 shows the external appearance of a computer that executes the program described in this specification to realize the word segmentation device according to various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 9 is an overview diagram of the computer system 300, and FIG. 10 is a block diagram of the system 300.

図９において、コンピュータシステム３００は、ＣＤ−ＲＯＭドライブを含むコンピュータ３０１と、キーボード３０２と、マウス３０３と、モニタ３０４とを含む。 In FIG. 9, a computer system 300 includes a computer 301 including a CD-ROM drive, a keyboard 302, a mouse 303, and a monitor 304.

図１０において、コンピュータ３０１は、ＣＤ−ＲＯＭドライブ３０１２に加えて、ＭＰＵ３０１３と、ＭＰＵ３０１３、ＣＤ−ＲＯＭドライブ３０１２に接続されたバス３０１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３０１５と、ＭＰＵ３０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３０１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３０１７とを含む。ここでは、図示しないが、コンピュータ３０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 10, in addition to the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013 and the CD-ROM drive 3012, a ROM 3015 for storing a program such as a bootup program, and an MPU 3013. And a RAM 3016 for temporarily storing instructions of the application program and providing a temporary storage space, and a hard disk 3017 for storing the application program, the system program, and data. Although not shown here, the computer 301 may further include a network card that provides connection to a LAN.

コンピュータシステム３００に、上述した実施の形態の単語分割装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３１０１に記憶されて、ＣＤ−ＲＯＭドライブ３０１２に挿入され、さらにハードディスク３０１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３０１に送信され、ハードディスク３０１７に記憶されても良い。プログラムは実行の際にＲＡＭ３０１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３１０１またはネットワークから直接、ロードされても良い。 A program that causes the computer system 300 to execute the function of the word segmentation device of the above-described embodiment may be stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. Alternatively, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into the RAM 3016 at the time of execution. The program may be loaded directly from the CD-ROM 3101 or the network.

プログラムは、コンピュータ３０１に、上述した実施の形態の単語分割装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 301 to execute the functions of the word segmentation device according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.

なお、上記プログラムにおいて、情報を送信する送信ステップや、情報を受信する受信ステップなどでは、ハードウェアによって行われる処理、例えば、送信ステップにおけるモデムやインターフェースカードなどで行われる処理（ハードウェアでしか行われない処理）は含まれない。 In the above program, in a transmission step for transmitting information, a reception step for receiving information, etc., processing performed by hardware, for example, processing performed by a modem or an interface card in the transmission step (only performed by hardware). Not included) is not included.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、一の装置に存在する２以上の通信手段（端末情報送信部、端末情報受信部など）は、物理的に一の媒体で実現されても良いことは言うまでもない。 In each of the above embodiments, it is needless to say that two or more communication means (terminal information transmission unit, terminal information reception unit, etc.) existing in one device may be physically realized by one medium. .

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる単語分割装置は、文を２以上の単語に高速に分割できる、という効果を有し、単語分割装置等として有用である。 As described above, the word segmentation device according to the present invention has an effect that a sentence can be segmented into two or more words at high speed, and is useful as a word segmentation device or the like.

１、２、３、４単語分割装置
１１単語分割用辞書
１２受付部
１３第一分割部
１４出力部
２１第二分割部
２２第二分割結果取得部
２３判断部
２４分割情報取得部
２５辞書登録部 1, 2, 3, 4 Word segmentation device 11 Word segmentation dictionary 12 Reception unit 13 First segmentation unit 14 Output unit 21 Second segmentation unit 22 Second segmentation result acquisition unit 23 Judgment unit 24 Division information acquisition unit 25 Dictionary registration unit

Claims

A word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the word and the word;
A reception unit for receiving a sentence having one or more characters;
Using the word segmentation dictionary, the maximum length word that matches the character string constituting the sentence accepted by the accepting unit is acquired from the word segmentation dictionary, and two or more segments corresponding to the acquired word A first dividing unit that acquires a word and acquires a first division result that is a set of two or more words obtained by dividing the sentence;
A word segmentation device comprising: an output unit that outputs the first segmentation result.

The sentence received by the accepting unit is a set of two or more words divided by using a second dividing unit that obtains two or more words by dividing the sentence by an algorithm different from the first dividing unit. A second split result acquisition unit for acquiring a split result,
A determination unit that determines whether the first division result and the second division result are different;
When the determination unit determines that the first division result and the second division result are different, a character string in a sentence corresponding to a portion where the first division result and the second division result are different is acquired. The division information which acquires the 2 or more words contained in the said 2nd division result corresponding to the said different location, and comprises the division information which has the said word which is the acquired character string, and the said 2 or more acquired words An acquisition unit;
The word division device according to claim 1, further comprising a dictionary registration unit that accumulates the division information in the word division dictionary.

The word division device according to claim 2, further comprising a second division unit that obtains two or more words by dividing the sentence received by the reception unit using an algorithm different from the first division unit.

The second dividing unit is
The word segmentation device according to claim 2 or 3, wherein a sentence is divided and two or more words are obtained by a morphological analysis algorithm using a Viterbi algorithm.

A data structure of a dictionary for word division having one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the words and the words,
A reception unit for receiving a sentence having one or more characters;
When a word matching the maximum length character string is acquired from the word segmentation dictionary from the sentence pointer at the head of the sentence received by the reception unit, and there are two or more divided words corresponding to the acquired word Performs divided word acquisition processing for acquiring the two or more divided words instead of the matching words, moves the pointer of the sentence to the next character of the matching words, and then executes the divided word acquisition processing. A first division unit that performs a process up to a word including the last character of and obtains a first division result that is a set of two or more words obtained by dividing the sentence;
The data structure of the word division | segmentation dictionary utilized for the word division | segmentation apparatus provided with the output part which outputs said 1st division | segmentation result.

On the recording medium,
Storing a word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the word and the word;
A word division method realized by a reception unit, a first division unit, and an output unit,
An accepting step in which the accepting unit accepts a sentence having one or more characters;
The first division unit uses the word division dictionary to obtain the maximum length word that matches the character string constituting the sentence accepted in the acceptance step from the word division dictionary, and obtains the word A first division step of obtaining two or more divided words corresponding to the word and obtaining a first division result that is a set of two or more words obtained by dividing the sentence;
A word division method, wherein the output unit includes an output step of outputting the first division result.

On the recording medium,
Storing a word division dictionary that can store one or more words and one or more pieces of division information that is a set of two or more divided words that is a result of dividing the word and the word;
Computer
A reception unit for receiving a sentence having one or more characters;
Using the word segmentation dictionary, the maximum length word that matches the character string constituting the sentence accepted by the accepting unit is acquired from the word segmentation dictionary, and two or more segments corresponding to the acquired word A first dividing unit that acquires a word and acquires a first division result that is a set of two or more words obtained by dividing the sentence;
A program for functioning as an output unit for outputting the first division result.