JP2000259673A

JP2000259673A - Method and device for dividing sentence to words

Info

Publication number: JP2000259673A
Application number: JP11204868A
Authority: JP
Inventors: Yasuki Iizuka; 泰樹飯塚; Tomoko Fujita; 智子藤田; Chuichi Kikuchi; 忠一菊池; Takashi Shimojima; 崇下島
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1999-01-06
Filing date: 1999-07-19
Publication date: 2000-09-22

Abstract

PROBLEM TO BE SOLVED: To provide a device capable of dividing a sentence into words without depending upon a dictionary. SOLUTION: The sentence division device for dividing text data into words is provided with a text storage means 102 for storing text data, a word processing knowledge storage means 107 for storing pattern information for word extraction and sentence division and a word extraction means 103 for extracting a word from the text data stored in the means 102 by applying the pattern information. The device is also provided with a word storage means 104 for storing the extracted word, a word detection means 105 for detecting respective locations of the text data using the extracted word, and a sentence division means 106 for dividing the text data into words on the basis of the detected locations of the word on the text data and a new sentence dividing position found out by applying the pattern information of sentence division based on the locations of the word. Consequently a sentence can be divided into words without using a dictionary.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、電子計算機を利用
した機械翻訳や大量文書検索、テキスト自動要約等を実
施する自然言語処理システムの前処理・解析部における
方式と装置に関し、特に、文章から効率的に単語を分割
できるようにしたものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and an apparatus in a preprocessing / analysis unit of a natural language processing system for performing machine translation, mass document search, automatic text summarization, etc. using an electronic computer. The word can be divided efficiently.

【０００２】[0002]

【従来の技術】機械翻訳や自動要約といった自然言語処
理システムにおいては、その最初の段階として文章の解
析が必要になる。特に日本語のような膠着語では、文章
が単語ごとに分かれていないため、単語への分割が最初
の解析として必要である。2. Description of the Related Art In natural language processing systems such as machine translation and automatic summarization, it is necessary to analyze sentences as the first step. In particular, in agglutinated words such as Japanese, the sentence is not divided for each word, so the division into words is necessary as the first analysis.

【０００３】また、文書検索システムでは、例えば「今
月の東京都議会」という文字列の「東京都議会」という
単語は、「東京」で全文検索した場合でも「京都」で全
文検索した場合でも抽出されることになるが、こうした
検索でのノイズを減らすためには、やはり文章を単語に
分割しておく必要がある。In the document search system, for example, the word "Tokyo Metropolitan Assembly" in the character string "Tokyo Metropolitan Assembly of the Month" is extracted whether the full-text search is performed in "Tokyo" or the full-text search in "Kyoto". That said, in order to reduce noise in such searches, it is necessary to break sentences into words.

【０００４】このような処理には、通常は形態素解析処
理（特開平９−２８８６７３公報など）が行われる。形
態素解析では、解析用の単語辞書を用意して、文章の単
語への分割処理が行われるが、形態素解析の精度はこの
辞書がどれだけ整っているかに依存する。辞書に載って
いなものを未知語（未登録語）として推定し収集する方
法が特開平９−２８８６７３公報などで提案され、ま
た、テキストの文字列の出現頻度を網羅的に調べて、そ
の出現頻度から単語や慣用句を収集する方法が特開平９
−１３８８０１公報などに提案されている。In such a process, a morphological analysis process (Japanese Patent Application Laid-Open No. 9-288673) is usually performed. In the morphological analysis, a word dictionary for analysis is prepared and the sentence is divided into words. The accuracy of the morphological analysis depends on how well the dictionary is prepared. A method of estimating and collecting words not included in the dictionary as unknown words (unregistered words) is proposed in Japanese Patent Application Laid-Open No. 9-288673, and the like. A method for collecting words and idioms from frequency is disclosed in
138801 publication.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、言語に
は常に新しい単語が生まれるものであるため、形態素解
析用辞書は常にメンテナンスが必要である。また、対象
とする文書によって単語の使われ方が違うこともあり、
対象とする文書を変更する度に辞書を調整しなければい
けない。そして、どれだけ注意していても形態素解析に
おいて未知語、すなわち辞書に載っていない単語に遭遇
する可能性は否定できず、未知語の出現により形態素解
析の精度が大幅に低下することがある。However, since a new word is always born in a language, a morphological analysis dictionary always requires maintenance. Also, the word usage may differ depending on the target document,
You have to adjust the dictionary each time you change the target document. No matter how much attention is paid, the possibility of encountering an unknown word in a morphological analysis, that is, a word not listed in a dictionary cannot be denied, and the appearance of the unknown word may significantly reduce the accuracy of the morphological analysis.

【０００６】本発明は、こうした従来技術の課題を解決
するものであり、基本的に辞書に依存しないで文章を単
語へ分割することができる単語分割方法を提供し、ま
た、その方法を実施する装置を提供することを目的とし
ている。The present invention has been made to solve the problems of the prior art, and provides a word division method capable of dividing a sentence into words basically without depending on a dictionary, and implements the method. It is intended to provide a device.

【０００７】[0007]

【課題を解決するための手段】そこで、本発明の単語分
割装置では、自然言語テキストデータを格納するテキス
ト記憶手段と、単語抽出および単語分割を行うためのパ
ターン情報を記憶する単語処理用知識記憶手段と、パタ
ーン情報を適用してテキスト記憶手段に格納されたテキ
ストデータから単語を抽出する単語抽出手段と、抽出さ
れた単語を記憶する単語記憶手段と、抽出された単語が
使われているテキストデータの各位置を発見する単語発
見手段と、テキストデータを単語に分割する単語分割手
段とを設けている。Therefore, in the word segmenting apparatus of the present invention, a text memory for storing natural language text data, and a word processing knowledge memory for storing pattern information for word extraction and word segmentation. Means, word extraction means for extracting words from text data stored in the text storage means by applying pattern information, word storage means for storing the extracted words, and text in which the extracted words are used There are provided word finding means for finding each position of data, and word dividing means for dividing text data into words.

【０００８】また、本発明の単語分割方法では、単語抽
出のためのパターン情報を適用してテキストデータから
単語を抽出し、抽出した単語が使われているテキストデ
ータの各位置を発見し、この位置の情報などを利用して
テキストデータを単語に分割している。In the word dividing method of the present invention, a word is extracted from text data by applying pattern information for word extraction, and each position of the text data where the extracted word is used is found. Text data is divided into words using position information and the like.

【０００９】そのため、辞書を使わずに文章を単語に分
割することができる。Therefore, a sentence can be divided into words without using a dictionary.

【００１０】[0010]

【発明の実施の形態】本発明の請求項１に記載の発明
は、テキストデータを単語に分割する単語分割装置にお
いて、自然言語テキストデータを格納するテキスト記憶
手段と、単語抽出および単語分割を行うためのパターン
情報を記憶する単語処理用知識記憶手段と、パターン情
報を適用してテキスト記憶手段に格納されたテキストデ
ータから単語を抽出する単語抽出手段と、抽出された単
語を記憶する単語記憶手段と、抽出された単語が使われ
ているテキストデータの各位置を発見する単語発見手段
と、これまでに発見されているテキストデータ上の単語
の位置と、この単語の位置を基に単語分割のパターン情
報を適用して求めた新たな単語分割位置とでテキストデ
ータを単語に分割する単語分割手段とを設けたものであ
り、辞書を使わずに文章を単語に分割することができ
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS According to a first aspect of the present invention, in a word dividing apparatus for dividing text data into words, text storage means for storing natural language text data, and word extraction and word division are performed. Word processing knowledge storage means for storing pattern information, word extraction means for applying pattern information to extract words from text data stored in text storage means, and word storage means for storing extracted words A word finding means for finding each position in the text data where the extracted word is used, a position of the word on the text data which has been found so far, and a word division based on the position of this word. A word division unit that divides text data into words at a new word division position obtained by applying pattern information is provided. Chapter can be divided into words.

【００１１】請求項２に記載の発明は、自然言語テキス
トデータを格納するテキスト記憶手段と、単語抽出およ
び単語分割を行うためのパターン情報を記憶する単語処
理用知識記憶手段と、テキスト記憶手段に格納されたテ
キストデータにパターン情報を繰り返し適用してテキス
トデータから単語を繰り返し抽出する単語抽出手段と、
抽出された単語を記憶する単語記憶手段と、単語抽出手
段が単語を抽出するごとに、抽出された単語が使われて
いるテキストデータの各位置を発見する単語発見手段
と、単語発見手段によって発見されたテキストデータ上
の単語の位置を基にテキストデータを単語に分割する単
語分割手段とを設けたものであり、テキストデータから
の単語抽出と、抽出した単語が使われているテキストデ
ータの各位置の発見とを交互に繰り返すことにより、単
語が特定され、それによってその前後の単語が特定され
るという具合に単語の抽出数が増加する。The invention according to claim 2 is characterized in that the text storage means stores natural language text data, the word processing knowledge storage means stores pattern information for performing word extraction and word division, and the text storage means. Word extraction means for repeatedly applying pattern information to the stored text data to repeatedly extract words from the text data;
A word storage means for storing the extracted words, a word finding means for finding each position of the text data in which the extracted words are used each time the word extracting means extracts the words, Word division means for dividing the text data into words based on the positions of the words in the extracted text data, and extracting words from the text data and text data in which the extracted words are used. By alternately repeating the process of finding a position, the number of words to be extracted is increased, such that a word is specified, and words before and after the word are specified.

【００１２】請求項３に記載の発明は、単語抽出手段が
抽出する単語の数を管理する単語数判定手段を設け、こ
の単語数判定手段の管理の下に、単語抽出手段が単語数
の増加が無くなるまで単語抽出を繰り返すようにしたも
のであり、単語抽出を何回か繰り返すと、単語数の増加
が見られなくなる。According to a third aspect of the present invention, there is provided a word number determining means for managing the number of words to be extracted by the word extracting means, and the word extracting means increases the number of words under the management of the word number determining means. The word extraction is repeated until there is no longer. If the word extraction is repeated several times, the number of words does not increase.

【００１３】請求項４に記載の発明は、自然言語テキス
トデータを格納するテキスト記憶手段と、単語抽出およ
び単語分割を行うためのパターン情報を記憶する単語処
理用知識記憶手段と、テキスト記憶手段に格納されたテ
キストデータにパターン情報を適用してテキストデータ
から単語を抽出する単語抽出手段と、抽出された単語を
記憶する単語記憶手段と、抽出された単語が使われてい
るテキストデータの各位置を発見する単語発見手段と、
単語発見手段によって発見された位置での分割可能性を
定量的に算出する分割可能性算出手段と、分割可能性算
出手段の算出結果を基にテキストデータを単語に分割す
る単語分割手段とを設けたものであり、単語に分割すべ
きかどうかが定量的に判定できる。According to a fourth aspect of the present invention, there is provided a text storage unit for storing natural language text data, a word processing knowledge storage unit for storing pattern information for performing word extraction and word division, and a text storage unit. Word extraction means for extracting words from text data by applying pattern information to the stored text data, word storage means for storing the extracted words, and each position of the text data in which the extracted words are used Word discovery means for discovering
Dividability calculating means for quantitatively calculating the possibility of division at the position found by the word finding means, and word dividing means for dividing text data into words based on the calculation result of the dividing possibility calculating means are provided. Therefore, it can be quantitatively determined whether or not to be divided into words.

【００１４】請求項５に記載の発明は、テキスト記憶手
段に格納されたテキストデータから数値表現を特定する
数値表現発見手段を設けたものであり、数値表現は、そ
の文字列の前後には単語としての切れ目があり、途中に
は切れ目がないため、句読点などと同様に、単語分割の
基準と成り得る。According to a fifth aspect of the present invention, there is provided a numerical expression finding means for specifying a numerical expression from text data stored in the text storage means. Since there is a break in the middle and there is no break in the middle, it can be a reference for word segmentation like punctuation marks.

【００１５】請求項６に記載の発明は、単語抽出手段
が、テキストデータから特定の文字種の単語を抽出する
ようにしたものであり、文字種の変わり目には単語の切
れ目がある可能性が高い。According to a sixth aspect of the present invention, the word extracting means extracts a word of a specific character type from the text data, and there is a high possibility that there is a word break at a character type change.

【００１６】請求項７に記載の発明は、単語抽出手段
を、漢字単語を抽出する漢字単語抽出手段、片仮名単語
を抽出する抽出手段及び平仮名単語を抽出する抽出手段
で構成したものであり、それぞれの抽出手段において、
文字種に適したルールで単語を抽出することができる。According to a seventh aspect of the present invention, the word extracting means comprises a kanji word extracting means for extracting a kanji word, an extracting means for extracting a katakana word, and an extracting means for extracting a hiragana word. In the extraction means of
Words can be extracted using rules suitable for the character type.

【００１７】請求項８に記載の発明は、単語抽出手段に
より抽出された異なる文字種の単語の結合可能性を定量
的に算出する結合可能性算出手段を設け、結合可能性が
高い異なる文字種の組み合わせを１つの単語として扱う
ようにしたものであり、例えば、「オブジェクト指向」
のような文字列を１つの単語として処理することができ
る。According to the present invention, a combination possibility calculating means for quantitatively calculating the combination possibility of words of different character types extracted by the word extracting means is provided, and a combination of different character types having a high possibility of combination is provided. Is treated as one word, for example, "object-oriented"
Can be processed as one word.

【００１８】請求項９に記載の発明は、自然言語テキス
トデータを格納するテキスト記憶手段と、単語抽出およ
び単語分割を行うためのパターン情報を記憶する単語処
理用知識記憶手段と、テキスト記憶手段に格納されたテ
キストデータから特定の文字種を含む文字列を抽出する
部分文字列抽出手段と、抽出された文字列を文字種の文
字数に着目して文字数別に分類・整列する分類整列手段
と、文字数別に整列した文字列に対してパターン情報を
適用して文字種の単語を抽出する単語抽出手段と、抽出
された単語を記憶する単語記憶手段と、抽出された単語
が使われているテキストデータの各位置を発見する単語
発見手段と、単語発見手段によって発見された位置での
分割可能性を定量的に算出する分割可能性算出手段と、
分割可能性算出手段の算出結果を基にテキストデータを
単語に分割する単語分割手段とを設けたものであり、テ
キストデータにパターン情報を逐次的に当てはめて単語
を抽出する場合に比べて、効率的に単語を抽出すること
ができる。According to a ninth aspect of the present invention, there is provided a text storage unit for storing natural language text data, a word processing knowledge storage unit for storing pattern information for performing word extraction and word division, and a text storage unit. A partial character string extracting means for extracting a character string containing a specific character type from the stored text data, a sorting and aligning means for classifying and sorting the extracted character strings by the number of characters by focusing on the number of characters of the character type, and a sorting and sorting means by character number Word extracting means for extracting a character type word by applying pattern information to the extracted character string, word storing means for storing the extracted word, and each position of the text data in which the extracted word is used. A word finding means to be found, a division possibility calculating means for quantitatively calculating the division possibility at a position found by the word finding means,
Word division means for dividing text data into words based on the calculation result of the division possibility calculation means, and the efficiency is improved as compared with a case where pattern information is sequentially applied to text data to extract words. Words can be extracted in a targeted manner.

【００１９】請求項１０に記載の発明は、部分文字列抽
出手段が、漢字を含む文字列を抽出し、単語抽出手段
が、漢字単語を抽出するようにしたものであり、漢字単
語を効率的に抽出することができる。According to a tenth aspect of the present invention, the partial character string extracting means extracts a character string including a kanji, and the word extracting means extracts a kanji word. Can be extracted.

【００２０】請求項１１に記載の発明は、平仮名以外の
文字種を記号に置き換えたテキストデータの文字列を作
成する縮退文字列作成手段を設け、部分文字列抽出手段
が、縮退文字列作成手段によって作成された文字列から
平仮名を含む文字列を抽出し、単語抽出手段が、平仮名
の単語を抽出するようにしたものであり、平仮名単語を
効率的に抽出することができる。[0020] According to the eleventh aspect of the present invention, there is provided a reduced character string creating means for creating a character string of text data in which a character type other than hiragana is replaced with a symbol, and the partial character string extracting means is provided by the reduced character string creating means. A character string including hiragana is extracted from the created character string, and the word extracting means extracts a word of hiragana, so that hiragana words can be efficiently extracted.

【００２１】請求項１２に記載の発明は、テキストデー
タを単語に分割する単語分割方法において、単語抽出の
ためのパターン情報を適用してテキストデータから単語
を抽出し、抽出した単語が使われているテキストデータ
の各位置を発見し、発見したテキストデータ上の単語の
位置と、この単語の位置を基に単語分割のパターン情報
を適用して求めた新たな単語分割位置とでテキストデー
タを単語に分割するようにしたものであり、辞書を使わ
ずに文章を単語に分割することができる。According to a twelfth aspect of the present invention, in the word dividing method for dividing text data into words, words are extracted from the text data by applying pattern information for word extraction, and the extracted words are used. The position of each word in the text data is found, and the text data is converted into words by using the word positions on the found text data and the new word division positions obtained by applying word division pattern information based on the word positions. The sentence can be divided into words without using a dictionary.

【００２２】請求項１３に記載の発明は、単語抽出のた
めのパターン情報を適用してテキストデータから単語を
抽出し、抽出された単語が使われているテキストデータ
の各位置を発見し、発見された単語位置を基にパターン
情報を適用して新たな単語を抽出し、この手順を繰り返
した後、発見されたテキストデータ上の単語の位置でテ
キストデータを分割するようにしたものであり、単語抽
出と単語位置発見とを繰り返すことによって、単語の抽
出数を増やすことができる。According to a thirteenth aspect of the present invention, a word is extracted from text data by applying pattern information for word extraction, and each position of the text data in which the extracted word is used is found. A new word is extracted by applying pattern information based on the found word position, and after repeating this procedure, the text data is divided at the position of the word on the found text data, By repeating the word extraction and the word position finding, the number of extracted words can be increased.

【００２３】請求項１４に記載の発明は、この繰り返し
を、新たな単語が抽出されなくなるまで繰り返すように
したものであり、単語抽出を何回か繰り返すと、単語数
の増加が見られなくなる。According to a fourteenth aspect of the present invention, this repetition is repeated until a new word is not extracted. If the word extraction is repeated several times, the number of words does not increase.

【００２４】請求項１５に記載の発明は、単語抽出のた
めのパターン情報を適用してテキストデータから単語を
抽出し、抽出された単語が使われているテキストデータ
の各位置を発見し、発見された位置での分割可能性を定
量的に算出し、算出した値に基づいて単語に分割するテ
キストデータの位置を決定するようにしたものであり、
単語への分割位置とすることの適否を定量的に決定する
ことができる。According to a fifteenth aspect of the present invention, a word is extracted from text data by applying pattern information for word extraction, and each position of the text data in which the extracted word is used is found. Calculates the possibility of division at the determined position, and determines the position of the text data to be divided into words based on the calculated value,
It is possible to quantitatively determine whether it is appropriate to use the word division position.

【００２５】請求項１６に記載の発明は、テキストデー
タから数字を発見して、その前後の文字列から数値表現
を識別し、その数値表現が現れるテキストデータ上の位
置を単語分割の基準位置とするようにしたものであり、
数値表現は、句読点などと同様に、単語分割の基準とし
て有効である。According to the present invention, a numeral is found from text data, a numerical expression is identified from a character string before and after the numeral, and a position on the text data where the numerical expression appears is defined as a reference position for word division. Is to do
Numerical expressions, like punctuation marks, are effective as criteria for word segmentation.

【００２６】請求項１７に記載の発明は、テキストデー
タから特定の文字種を含む文字列を抽出し、抽出した文
字列を文字種の文字数に着目して文字数別に分類・整列
し、文字数別に整列した各文字列に対して単語抽出のた
めのパターン情報を適用してこの文字種の単語を抽出
し、抽出した単語が使われているテキストデータの各位
置を発見するようにしたものであり、単語抽出を効率化
することができる。According to a seventeenth aspect of the present invention, a character string including a specific character type is extracted from text data, and the extracted character string is classified and sorted according to the number of characters by focusing on the number of characters of the character type. In this method, words of this character type are extracted by applying pattern information for word extraction to character strings, and each position of text data in which the extracted words are used is found. Efficiency can be improved.

【００２７】請求項１８に記載の発明は、テキストデー
タから漢字を含む文字列を抽出して、漢字単語を抽出す
るようにしたものであり、漢字単語を効率的に抽出する
ことができる。The invention according to claim 18 is to extract a character string including a kanji from text data to extract a kanji word, so that a kanji word can be efficiently extracted.

【００２８】請求項１９に記載の発明は、テキストデー
タの平仮名以外の文字種を記号に置き換えた文字列を作
成し、この文字列から平仮名を含む文字列を抽出し、抽
出した文字列を平仮名の文字数に着目して文字数別に分
類・整列し、文字数別に整列した各文字列に対して単語
抽出のためのパターン情報を適用して平仮名の単語を抽
出するようにしたものであり、平仮名の単語の効率的な
抽出が可能になる。[0028] According to a nineteenth aspect of the present invention, a character string in which character types other than hiragana in the text data are replaced by symbols is created, a character string containing hiragana is extracted from this character string, and the extracted character string is converted to the hiragana character. It focuses on the number of characters, sorts and sorts by the number of characters, and applies pattern information for word extraction to each character string that is sorted by the number of characters to extract hiragana words. Efficient extraction becomes possible.

【００２９】請求項２０に記載の発明は、単語に分割し
たテキストデータを、更新された単語群を用いて単語に
再分割する単語分割方法において、テキストデータの単
語分割に使用した単語群と、更新された単語群との差分
単語を抽出し、この差分単語の文字列を含むテキストデ
ータを検索し、検索されたテキストデータの単語分割を
解除して単語分割以前のテキストデータを再生し、再生
したテキストデータを更新された単語群を用いて単語分
割するようにしたものであり、既に単語分割されたテキ
ストデータを、最新の単語を用いて再分割することがで
きる。According to a twentieth aspect of the present invention, in the word division method for subdividing text data divided into words into words by using an updated word group, a word group used for word division of the text data; Extract the difference word from the updated word group, search the text data including the character string of this difference word, release the word division of the searched text data, play the text data before word division, and play back The word data is divided into words using the updated word group. Text data that has already been word-divided can be re-divided using the latest word.

【００３０】請求項２１に記載の発明は、単語に分割さ
れたテキストを検索対象とする情報検索装置において、
単語に分割された検索対象テキストのテキストデータを
格納する検索対象テキスト記憶手段と、検索対象テキス
トのインデックスを作成するインデックス作成手段と、
作成されたインデックスを蓄えるインデックス記憶手段
と、インデックスを用いて検索対象テキストを検索する
検索手段と、検索対象テキストの単語分割に使用された
単語を保存する抽出単語保存手段と、テキストデータを
単語に分割する単語分割装置と、単語分割装置で新たな
テキストデータの単語分割に使用された単語群と抽出単
語保存手段に保存された単語群とを比較して、その差分
単語を検出する単語比較手段と、検索対象テキストの単
語再分割を制御する単語分割修正手段とを設け、単語分
割修正手段が、差分単語の文字列を含む検索対象テキス
トに対して単語分割装置を通じて単語再分割が行われる
ように制御するようにしたものであり、情報検索装置の
単語分割された検索対象テキストのデータと、その単語
検索用インデックスとを更新することができる。According to a twenty-first aspect of the present invention, there is provided an information retrieval apparatus for retrieving text divided into words,
A search target text storage unit that stores text data of the search target text divided into words, an index creation unit that creates an index of the search target text,
Index storage means for storing the created index, search means for searching the search target text using the index, extracted word storage means for storing words used for word division of the search target text, and text data converted to words A word division device for dividing, a word comparison unit for comparing a word group used for word division of new text data by the word division device with a word group stored in the extracted word storage unit, and detecting the difference word And word division correcting means for controlling word division of the search target text, wherein the word division correction means performs word division on the search target text including the character string of the difference word through the word division device. The data of the search target text divided by the word of the information search device and the index for the word search are used. It is possible to update the door.

【００３１】請求項２２に記載の発明は、この単語分割
装置として、請求項１乃至１１のいずれかに記載の単語
分割装置を用いたものであり、これらの単語分割装置を
用いてテキストデータの単語再分割が行われる。According to a twenty-second aspect of the present invention, the word segmenting apparatus according to any one of the first to eleventh aspects is used as the word segmenting apparatus. Word subdivision is performed.

【００３２】請求項２３に記載の発明は、検索手段が、
差分単語の文字列を含む検索対象テキストを、インデッ
クスを用いて検索するようにしたものであり、全文検索
用のインデックスを用いて、差分単語の文字列を含むテ
キストが検索される。According to a twenty-third aspect of the present invention, the search means comprises:
The search target text including the character string of the difference word is searched using the index, and the text including the character string of the difference word is searched using the index for full-text search.

【００３３】請求項２４に記載の発明は、単語分割修正
手段が、単語分割装置により単語再分割されたテキスト
データと単語再分割前の検索対象テキストのデータとを
比較し、両者間に相違が見られる場合に、検索対象テキ
スト記憶手段に格納された検索対象テキストのデータを
単語再分割されたテキストデータで更新し、インデック
ス作成手段が、検索対象テキスト記憶手段の検索対象テ
キストが更新された場合に、その検索対象テキストの単
語検索用のインデックスを再作成するようにしたもので
あり、検索対象テキストデータ及び単語検索用インデッ
クスの更新を効率的に行うことができる。According to a twenty-fourth aspect of the present invention, the word division correcting means compares the text data obtained by word division by the word division device with the data of the text to be searched before the word division, and finds a difference between the two. If the search target text data stored in the search target text storage unit is updated with the word-subdivided text data, and the index creation unit updates the search target text in the search target text storage unit In addition, the index for word search of the search target text is re-created, so that the search target text data and the word search index can be updated efficiently.

【００３４】以下、本発明の実施の形態について、図面
を用いて説明する。Hereinafter, embodiments of the present invention will be described with reference to the drawings.

【００３５】（第１の実施の形態）第１の実施形態の文
字列分割装置は、図１に示すように、処理対象テキスト
データが電子化された形で入力するテキスト入力手段10
1と、テキスト入力手段101から入力した処理対象テキス
トデータと処理過程におけるテキストやテキスト中の文
字の属性やマークなどを一時的に記憶するテキスト記憶
手段102と、自然言語における特有のパターン等の知識
を利用して、テキストの中から単語を抽出する単語抽出
手段103と、単語抽出手段103により抽出された単語を記
憶する単語記憶手段104と、単語記憶手段104に記憶され
ている単語が現れるテキスト中の場所を発見する単語発
見手段105と、テキスト記憶手段102に記憶されているテ
キストを単語に分割する単語分割手段106と、単語処理
用の知識を記憶する単語処理用知識記憶手段107と、処
理結果のテキストを出力するテキスト出力手段108とを
備えている。(First Embodiment) As shown in FIG. 1, a character string dividing device according to a first embodiment is a text input means 10 for inputting text data to be processed in an electronic form.
1, text storage means 102 for temporarily storing the text data to be processed inputted from the text input means 101, the text and the attributes and marks of characters in the text in the course of processing, and knowledge of unique patterns in natural language, etc. , Word extracting means 103 for extracting words from the text, word storing means 104 for storing the words extracted by the word extracting means 103, and text in which the words stored in the word storing means 104 appear. Word finding means 105 for finding a place inside, word dividing means 106 for dividing the text stored in the text storage means 102 into words, word processing knowledge storage means 107 for storing word processing knowledge, A text output unit 108 for outputting a text as a processing result.

【００３６】単語抽出手段103及び単語分割手段105は、
単語処理用知識記憶手段107に蓄積されている知識を用
いて、テキストの中から単語を抽出し、あるいは、テキ
ストを単語に分割する。The word extracting means 103 and the word dividing means 105
Using the knowledge stored in the word processing knowledge storage means 107, a word is extracted from the text or the text is divided into words.

【００３７】また、単語抽出手段103が抽出した単語を
記憶する単語記憶手段104は、初期状態では何も記憶し
ていない。記憶する単語はトライ構造などを用いて、高
速に検索できるようにしておく。The word storage means 104 for storing the words extracted by the word extraction means 103 does not store anything in the initial state. The words to be stored can be searched at high speed using a trie structure or the like.

【００３８】単語発見手段105は、単語記憶手段104に記
憶されている単語のテキスト中に現れる場所を発見し、
発見された単語の前後にマークをテキスト記憶手段102
の中に書き込む。The word finding means 105 finds a place where the word stored in the word storage means 104 appears in the text,
Text storage means 102 with marks before and after the found word
Write in.

【００３９】単語処理用知識記憶手段107は、単語処理
用の知識として、単語抽出用、単語分割用、及び単語抽
出分割兼用の知識を記憶している。これらの知識は、初
期状態から不変である。The word processing knowledge storage means 107 stores, as word processing knowledge, knowledge for word extraction, word division, and word extraction and division. These knowledges are unchanged from the initial state.

【００４０】本発明の装置はコンピュータにより構成さ
れる。テキスト入力手段101は、例えばキーボードなど
の入力装置、ＯＣＲ入力装置などから成り、テキスト出
力手段108は、例えばディスプレイやプリンタなどの出
力装置から成る。テキスト記憶手段102と単語処理用知
識記憶手段107は、コンピュータのメモリ、またはハー
ドディスク装置の記憶領域に設定される。他の手段はコ
ンピュータの計算機構により構成される。The device of the present invention is constituted by a computer. The text input unit 101 includes an input device such as a keyboard, an OCR input device, and the like, and the text output unit 108 includes an output device such as a display and a printer. The text storage unit 102 and the word processing knowledge storage unit 107 are set in a memory of a computer or a storage area of a hard disk device. The other means is constituted by a computer computing mechanism.

【００４１】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図２で示す。The operation of the word segmenting apparatus configured as described above will be described. FIG. 2 shows the overall flow.

【００４２】ステップ２０１：テキスト入力手段101か
ら入力されたデータは、まずテキスト記憶手段102に蓄
えられる。Step 201: Data input from the text input means 101 is first stored in the text storage means 102.

【００４３】ステップ２０２：このテキストの中から、
単語抽出手段103により単語が抽出される。抽出には単
語処理用知識記憶手段107の情報が用いられる。抽出さ
れた単語は逐次、単語記憶手段104に蓄えられるととも
に、抽出した箇所については、テキスト記憶手段102の
該当箇所にマークをつける。Step 202: From this text,
The word is extracted by the word extracting means 103. Information of the word processing knowledge storage unit 107 is used for extraction. The extracted words are sequentially stored in the word storage means 104, and the extracted parts are marked in the corresponding parts of the text storage means 102.

【００４４】ステップ２０３：単語発見手段105は、抽
出された単語が、テキスト中の抽出された以外の場所で
出現しているかを探し、もし出現していればテキスト記
憶手段102の該当箇所にマークをつける。Step 203: The word finding means 105 searches for whether the extracted word appears in a place other than the extracted word in the text, and if so, marks it at the corresponding place in the text storage means 102. Attach

【００４５】ステップ２０４：単語分割手段106は、前
ステップまでに書き込まれたマークと、単語処理用知識
記憶手段107の情報をもとに、テキスト記憶手段102のテ
キストデータを単語に分割する。Step 204: The word division means 106 divides the text data of the text storage means 102 into words based on the marks written up to the previous step and the information of the word processing knowledge storage means 107.

【００４６】ステップ２０５：分割されたデータを出力
する。Step 205: Output the divided data.

【００４７】以下、図２のステップ202、単語抽出処理
における単語抽出手段103の動作の詳細について、日本
語を例に説明する。Hereinafter, the details of the operation of the word extracting means 103 in the step 202 of FIG. 2 and the word extracting process will be described using Japanese as an example.

【００４８】単語抽出手段103では、辞書を用いずに、
字面のパターン解析のみでテキストから単語を抽出す
る。日本語の場合、構文解析をしなくても格助詞と判断
される平仮名文字列のパターンを発見することが可能で
あり、このパターンを用いて単語を発見する。この解析
に用いるパターン情報は、単語処理用知識記憶手段107
に記憶されているもののうち、抽出用、または抽出分割
兼用のものを用いる。In the word extracting means 103, without using a dictionary,
Words are extracted from text only by analyzing the character pattern. In the case of Japanese, it is possible to find a pattern of a hiragana character string that is determined to be a case particle without parsing, and a word is found using this pattern. The pattern information used for this analysis is stored in the word processing knowledge storage unit 107.
Among those stored in the storage area, one used for extraction or for both extraction and division is used.

【００４９】単語処理用知識記憶手段107に記憶されて
いるパターン情報の例を図３に示している。パターン３
０１は、抽出及び分割に兼用するパターンであり、例え
ば「その子供は、」という文字列の場合、「その」と
「子供」との間、及び「子供」と「は、」との間が分割
可能であり、「子供」が単語として抽出できることを示
している。また、同様に抽出分割兼用パターンのパター
ン３０２は、例えば「デラックス家具は、」という文字
列の場合に同じように分割・抽出が可能であることを示
し、パターン３０３は、「その子供が、」という文字列
の場合にも同じように分割・抽出が可能であることを示
している。FIG. 3 shows an example of the pattern information stored in the word processing knowledge storage means 107. Pattern 3
01 is a pattern that is used for both extraction and division. For example, in the case of a character string “the child is,” there is a pattern between “the” and “child” and between “the child” and “wa,” It can be divided, indicating that "child" can be extracted as a word. Similarly, the pattern 302 of the extraction / divisional pattern indicates that division / extraction is possible in the same manner in the case of a character string such as “Deluxe furniture,” and the pattern 303 indicates that the child is “ In the case of the character string, it indicates that division and extraction are possible in the same manner.

【００５０】また、パターン３１０は、後述する分割用
パターンであり、例えば「記録方法は、」という文字列
の場合、「記録」と言う漢字が単語であると認識される
場合に、その後に続く２文字の漢字「方法」は分割可能
であることを示している。The pattern 310 is a dividing pattern to be described later. For example, in the case of a character string "recording method", when a kanji character "recording" is recognized as a word, the pattern follows. The two-character kanji “method” indicates that it can be divided.

【００５１】ここで、「任意の平仮名」「漢字２文字」
などはテキスト文字列にマッチするパターン指定であ
り、「文字列“は、”」は文字列そのものにマッチする
パターン指定であり、「抽出可能部分」はそのパターン
にマッチした文字列を単語とみなせる箇所、「分割可
能」は分割処理時に分割できる点を示すものである。Here, “arbitrary hiragana”, “two kanji”
Is a pattern specification that matches the text string, "character string", "is a pattern specification that matches the character string itself, and" extractable part "can regard the character string that matches the pattern as a word The location “dividable” indicates a point that can be divided during the division processing.

【００５２】単語抽出手段103は、このパターンをテキ
ストの文字列と照合し、もしパターンが合致した場合に
は、合致した部分の単語を抽出し、それを単語記憶手段
104に記憶する。The word extracting means 103 compares this pattern with the character string of the text, and if the pattern matches, extracts the word of the matched part and stores it in the word storage means.
Store in 104.

【００５３】パターンの適用は、基本的にはテキストの
全域に渡って全てのパターンを試行し、パターンが適合
する場所を発見する。基本的アルゴリズムの一例を図４
に示すが、適合する場所の発見には別の方法を採用して
もかまわない。The application of the pattern basically tries all the patterns over the whole area of the text and finds a place where the pattern matches. Figure 4 shows an example of the basic algorithm
However, other methods may be used to find a suitable location.

【００５４】ステップ４０１：単語処理用知識記憶手段
107の中から、最初のパターンを選択する。なお、選択
可能なものは単語抽出用、または単語抽出分割兼用のパ
ターンのみである。Step 401: word processing knowledge storage means
Select the first pattern from 107. Note that only patterns that can be selected for word extraction or word extraction and division can be selected.

【００５５】ステップ４０２：処理箇所を示すポインタ
を、テキスト記憶手段102の最初に移動する。Step 402: The pointer indicating the processing location is moved to the beginning of the text storage means 102.

【００５６】ステップ４０３：ポインタの位置からパタ
ーンの適用を試みる。Step 403: Attempt to apply the pattern from the position of the pointer.

【００５７】ステップ４０４：もしパターンの適用が成
功したなら、ステップ４０５：成功した部分の文字列を処理する。こ
こでは、文字列が単語と判断されたため、テキスト記憶
手段102の発見された単語部分にマークをつけ、発見さ
れた単語を単語記憶手段104に記憶する。Step 404: If the application of the pattern is successful, Step 405: Process the character string of the successful part. Here, since the character string is determined to be a word, a mark is added to the found word portion of the text storage means 102, and the found word is stored in the word storage means 104.

【００５８】ステップ４０６：ポインタを、次にパター
ンが適用できる箇所まで移動させる。詳細は後で述べ
る。Step 406: The pointer is moved to a position where the pattern can be applied next. Details will be described later.

【００５９】ステップ４０７：ポインタがテキストの最
後を示しているかを判定し、最後でなければステップ４
０３へ戻って繰り返し処理を行う。テキストの最後を示
している場合、ステップ４０８へ進む。Step 407: Determine whether the pointer indicates the end of the text, and if not, step 4
Returning to step 03, the processing is repeated. If it indicates the end of the text, go to step 408.

【００６０】ステップ４０８：選択されているパターン
は、単語処理用知識記憶手段107の中で最後のものかを
判定し、最後ならば終了する。最後でなければ、ステッ
プ４０９へ進む。Step 408: It is determined whether or not the selected pattern is the last one in the word processing knowledge storage means 107. If it is the last one, the process ends. If not the last, proceed to step 409.

【００６１】ステップ４０９：単語処理用知識記憶手段
107の中から、次のパターンを選択してステップ４０２
へ戻って繰り返し処理を行う。Step 409: Knowledge processing means for word processing
The next pattern is selected from among 107 and step 402
Return to and repeat the process.

【００６２】図５の例の文字列に、図３の例に示された
パターンを用いて解析する場合を説明する。単語抽出分
割兼用パターンである図３のパターン３０１が選択され
ている時、テキストデータ中の文字列「この装置は、」
の「こ」の部分に処理用ポインタが置かれていた場合、
文字列「この」がパターン「任意の平仮名」に適合し、
文字列「装置」がパターン「漢字２文字」に適合し、文
字列「は、」がパターン文字列“は、”に適合すること
から、パターン全体が文字列に適合し、パターン適用が
成功する。そこで文字列「装置」の部分が単語であると
判断され、その部分にマークが付けられるとともに単語
記憶手段104に記憶される。A case where the character string of the example of FIG. 5 is analyzed using the pattern shown in the example of FIG. 3 will be described. When the pattern 301 shown in FIG. 3 which is a word extraction divisional combined pattern is selected, the character string “this device is
If the processing pointer is located at the "ko" part of
The string "this" matches the pattern "any hiragana"
Since the character string "device" matches the pattern "two Chinese characters" and the character string "wa," matches the pattern character string "wa," the entire pattern matches the character string, and the pattern application succeeds. . Therefore, it is determined that the part of the character string “apparatus” is a word, and the part is marked and stored in the word storage unit 104.

【００６３】ステップ４０６におけるポインタの移動に
ついて図５の例を用いて説明する。ステップ４０６にお
けるポインタの移動は、適用が成功した場合も失敗した
場合も、次の文字への移動とは限らない。図５の文字列
「この装置は、」にパターン３０１を適用した場合、初
回の適用が文字位置１の「こ」から始まったとする。こ
の場合、次回の適用を次の文字である文字位置２の
「の」から開始したなら、パターン「任意の平仮名」に
文字列「の」が合致し、以下も全て合致することからパ
ターン３０１の適用は成功する。The movement of the pointer in step 406 will be described with reference to the example of FIG. Moving the pointer in step 406, whether successful or unsuccessful, does not necessarily mean moving to the next character. When the pattern 301 is applied to the character string “this device is” in FIG. 5, it is assumed that the first application starts from “ko” at character position 1. In this case, if the next application is started from the next character, "no" at character position 2, the character string "no" matches the pattern "arbitrary hiragana", and all the following matches. The application succeeds.

【００６４】しかし、この場合、文字位置１から始めて
も文字位置２から始めても同様に成功してしまうため結
果が重複してしまう。これを避けるため、ポインタの移
動は、パターン３０１における最初のパターン「任意の
平仮名」が終了した次の文字以降へ進めなければいけな
い。図５の例の場合、初回の適用が文字位置１からの場
合、次回の適用は文字位置３以降から試みることにな
る。パターンの解析により、成功した場合、失敗した場
合、それぞれどれだけ先から適用を試みるかをあらかじ
め計算しておいてもかまわない。However, in this case, the same result is obtained regardless of whether the processing is started from the character position 1 or from the character position 2, so that the results are duplicated. In order to avoid this, the movement of the pointer must proceed to the character following the end of the first pattern “arbitrary hiragana” in the pattern 301. In the example of FIG. 5, if the first application is from character position 1, the next application will be attempted from character position 3 and on. By analyzing the pattern, if successful or unsuccessful, it may be possible to calculate in advance how much to try to apply each time.

【００６５】日本語の単語を発見するためのパターンと
しては、漢字や平仮名といった字種についての情報や、
「は、」「が、」といった格助詞と考えられる文字列を
用いる。字種の変わり目は単語の切れ目である可能性が
高く、「は、」「が、」といった文字列の直前の文字列
が名詞単語である可能性が高い。上記の例の他にも、
「である。」「なのだ。」といった断定表現の直前の漢
字文字列なども名詞単語の可能性が高い。このような知
識はあらかじめ単語処理用知識記憶手段107に記憶させ
ておく。Patterns for finding Japanese words include information on character types such as kanji and hiragana,
A character string that is considered to be a case particle such as “wa” or “ga” is used. The character type change is likely to be a word break, and the character string immediately before the character string such as “wa,” “ga,” is likely to be a noun word. In addition to the above example,
A kanji character string just before an assertive expression such as “is.” Or “is.” Is also likely to be a noun word. Such knowledge is stored in the word processing knowledge storage unit 107 in advance.

【００６６】以下、図２のステップ２０３、単語発見処
理における単語発見手段105の動作について、詳細を説
明する。この処理は、ステップ２０２で抽出された単語
が、テキスト中の単語抽出箇所以外の場所で出現してい
る箇所を発見することを目的とする。基本的アルゴリズ
ムの一例を図６に示すステップ６０１：処理用のポインタをテキストの最初へ
移動する。Hereinafter, the operation of the word finding means 105 in the word finding process in step 203 of FIG. 2 will be described in detail. This processing aims at finding a place where the word extracted in step 202 appears in a place other than the place where the word is extracted in the text. An example of the basic algorithm is shown in FIG. 6. Step 601: Move the processing pointer to the beginning of the text.

【００６７】ステップ６０２：ポインタの位置から始ま
る単語が、単語記憶手段104にあるか調べる。あればス
テップ６０３へ、なければステップ６０４へ進む。Step 602: It is checked whether a word starting from the position of the pointer exists in the word storage means 104. If there is, go to step 603; otherwise go to step 604.

【００６８】ステップ６０３：もしそのような単語があ
れば、単語と認定された文字列について、テキスト記憶
手段102の中にマークを書き込む。図７の例のように、
テキストの中で、ステップ２０３の単語抽出処理で抽出
された単語について、抽出された箇所以外の場所に同じ
単語が出現していればそれにマークをつける。Step 603: If there is such a word, a mark is written in the text storage means 102 for the character string recognized as a word. As in the example of FIG.
In the text, the word extracted by the word extraction process in step 203 is marked if the same word appears in a place other than the extracted place.

【００６９】ステップ６０４：次の文字へポインタを移
動する。Step 604: Move the pointer to the next character.

【００７０】ステップ６０５：ポインタはテキストの最
後を示しているか判定し、最後でなければ６０２へ戻っ
て繰り返し処理を行う。ポインタがテキストの最後だっ
た場合には処理を終了する。Step 605: Determine whether the pointer indicates the end of the text, and if not, return to 602 to repeat the processing. If the pointer is at the end of the text, the process ends.

【００７１】ステップ２０３の単語発見処理は、テキス
ト中の単語の検索に相当するため、図６のアルゴリズム
の代わりに、単語記憶手段104の中の一つ一つの単語に
ついて、テキストの中を探す検索手法を採用してもかま
わない。Since the word finding process in step 203 corresponds to the search for a word in the text, a search for searching the text for each word in the word storage means 104 instead of the algorithm in FIG. A method may be adopted.

【００７２】次に、図２のステップ２０４での単語分割
処理における単語分割手段106の動作について詳細を説
明する。ステップ２０４の単語分割処理の動作は、ステ
ップ２０２の単語抽出処理の動作、すなわち図４のアル
ゴリズムと同様であるが、以下の三点で異なっている。（１）ステップ２０２の単語抽出処理は、パターンをも
とに単語の抽出を行い、図４のステップ４０５において
抽出できた単語を単語記憶手段104に記憶する。ステッ
プ２０４の単語分割処理は、パターンをもとに単語への
分割を行い、分割結果をテキスト記憶手段102に書き込
む。（２）ステップ２０２の単語抽出処理は、単語処理用知
識記憶手段107に記憶されている情報のうち、抽出用、
または抽出分割兼用のものを用いる。ステップ２０４の
単語分割処理は、単語処理用知識記憶手段107に記憶さ
れている情報のうち、分割用、または抽出分割兼用のも
のを用いる。（３）ステップ２０２の単語抽出処理は、漢字や平仮名
といった文字種類や「は、」「が、」といった特定文字
列のパターンのみを使うが、ステップ２０４の単語分割
処理は、既に発見されている単語の位置といった情報を
パターンとして使うことができる。Next, the operation of the word dividing means 106 in the word dividing process in step 204 of FIG. 2 will be described in detail. The operation of the word segmentation process of step 204 is the same as the operation of the word extraction process of step 202, that is, the algorithm of FIG. 4, but differs in the following three points. (1) In the word extraction process in step 202, words are extracted based on the pattern, and the words extracted in step 405 in FIG. In the word division processing in step 204, division into words is performed based on the pattern, and the division result is written in the text storage means 102. (2) The word extraction process in step 202 is performed for extracting, out of the information stored in the word processing knowledge storage 107,
Alternatively, one that is also used for extraction and division is used. In the word division processing in step 204, of the information stored in the word processing knowledge storage unit 107, information that is also used for division or for both extraction and division is used. (3) The word extraction process in step 202 uses only character types such as kanji and hiragana and a specific character string pattern such as “wa,” “ga,”, but the word segmentation process in step 204 has already been found. Information such as word positions can be used as patterns.

【００７３】図８の例の文字列に、図３の例に示された
パターン３１０の適用を試みたとしよう。すでに単語
「装置」は単語記憶手段104に記憶されており、単語発
見手段105によってマークされているものとする。この
時、テキストデータ中の文字列「その装置単体は、」の
文字列「装置」がパターン「漢字単語」に適合し、文字
列「単体」がパターン「漢字２文字」に適合し、文字列
「は、」がパターン文字列“は、”に適合することか
ら、パターン全体が文字列に適合し、パターン適用が成
功する。そこで文字列「単体」の部分が単語であると判
断され、その前後で分割可能であると判断される。この
ようにして、単語処理用知識記憶手段107のパターン情
報と、単語記憶手段104の単語情報をもとに、テキスト
データを単語に分割する。Assume that an attempt is made to apply the pattern 310 shown in the example of FIG. 3 to the character string of the example of FIG. It is assumed that the word “device” has already been stored in the word storage means 104 and has been marked by the word finding means 105. At this time, the character string “device” of the character string “the device itself” in the text data matches the pattern “kanji word”, the character string “single” matches the pattern “two kanji characters”, and the character string Since "wa" matches the pattern character string "wa", the entire pattern matches the character string, and the pattern application succeeds. Therefore, it is determined that the part of the character string “single” is a word, and that it can be divided before and after the word. In this way, the text data is divided into words based on the pattern information in the word processing knowledge storage unit 107 and the word information in the word storage unit 104.

【００７４】以上のように、ステップ２０２からステッ
プ２０４によって、テキストデータを単語に分割する。As described above, the text data is divided into words in steps 202 to 204.

【００７５】なお、この実施形態では日本語における単
語分割を例にしたが、本発明は、日本語と同じ膠着語で
ある中国語などにも同様に適用できる。In this embodiment, word division in Japanese is taken as an example. However, the present invention can be similarly applied to Chinese, which is the same sticky word as Japanese.

【００７６】以上のように、本実施形態では、単語処理
用知識記憶手段107の情報をもとに単語抽出手段103が抽
出した単語を単語記憶手段104に記憶し、この単語情報
と、単語処理用知識記憶手段107の情報を用いて、単語
分割手段106がテキストを単語に分割することで、辞書
を用いない単語分割を行うことが可能になり、その実用
的効果は大きい。As described above, in the present embodiment, the word extracted by the word extraction means 103 based on the information of the word processing knowledge storage means 107 is stored in the word storage means 104, and the word information and the word processing The word division means 106 divides the text into words using the information in the knowledge storage means 107, so that word division without using a dictionary can be performed, and the practical effect is large.

【００７７】（第２の実施の形態）第２の実施形態の文
字列分割装置では、同じテキストに対して単語抽出を繰
り返して実施することにより、抽出する単語数を増やし
ている。(Second Embodiment) In the character string dividing apparatus according to the second embodiment, the number of words to be extracted is increased by repeatedly performing word extraction on the same text.

【００７８】例えば、図１１に示すように、第１回目の
単語抽出で、「この装置は、」という文字列から単語
「装置」が抽出できると、第２回目の単語抽出では「こ
の装置単体は、」と言う文字列から「単体」を単語抽出
することができ、次の回には、「よって単体動作は、」
という文字列から「動作」という単語を抽出することが
できる。このように、同一テキストに対して単語抽出を
繰り返すことにより、単語数が増加する。For example, as shown in FIG. 11, if the word “device” can be extracted from the character string “this device is” in the first word extraction, the word “device” can be extracted in the second word extraction. Can be extracted as a word from the character string, and the next time,
The word "action" can be extracted from the character string. As described above, the number of words increases by repeating word extraction for the same text.

【００７９】この文字列分割装置は、図９に示すよう
に、単語記憶手段104に記憶される単語数を管理し、単
語数の増加が見られなくなるまで単語抽出を繰り返すよ
うに装置全体の動作を制御する単語数判定手段109を備
えている。その他のブロック構成は第１の実施形態（図
１）と変わりがない。ただ、単語抽出手段103は、繰り
返し処理を行うために、繰り返し処理用カウンタを使う
ように拡張されており、また、単語記憶手段104は、繰
り返し処理を行うために、繰り返しの何回目に発見され
た単語であるかを記憶できるように拡張されており、こ
れらの点は第１の実施形態と違っている。As shown in FIG. 9, the character string dividing apparatus manages the number of words stored in the word storage means 104 and repeats the word extraction until the increase in the number of words is not observed. Is provided. Other block configurations are the same as those of the first embodiment (FIG. 1). However, the word extraction means 103 has been extended to use a counter for repetition processing in order to perform the repetition processing, and the word storage means 104 has been discovered in any number of repetitions in order to perform the repetition processing. This is different from the first embodiment in that the word is expanded so that the word can be stored.

【００８０】この単語数判定手段109はコンピュータの
計算機構により構成される。The word number judging means 109 is constituted by a calculation mechanism of a computer.

【００８１】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図１０で示
す。The operation of the word segmenting apparatus configured as described above will be described. FIG. 10 shows the overall flow.

【００８２】ステップ１００１：テキスト入力手段101
から入力されたデータは、まずテキスト記憶手段102に
蓄えられる。Step 1001: text input means 101
Is first stored in the text storage means 102.

【００８３】ステップ１００２：繰り返しをカウントす
るための変数Ｎを１に初期化する。ステップ１００３：単語発見手段105は、繰り返しのＮ
−１回目に抽出された単語が、テキスト中の抽出された
以外の場所で出現しているかを探す。もし出現していれ
ばテキスト記憶手段102の該当箇所にマークをつける。
ステップ１００３は、第１の実施形態（図２）における
ステップ２０３での単語発見処理と同様であるが、ステ
ップ１００３の単語発見処理は、繰り返しの一回前、す
なわちＮ−１回目に発見された単語についてのみ処理を
行えばいいという点が違っている。一回目の処理、すな
わちＮ＝１でＮ−１＝０の時は、単語記憶手段104に単
語が一つも記憶されていないため、この処理は一つも単
語が発見されないまま終了する。Step 1002: A variable N for counting repetitions is initialized to 1. Step 1003: The word finding means 105 returns N
-Search whether the word extracted first time appears in a place other than the extracted word in the text. If it appears, a mark is made at the corresponding location in the text storage means 102.
Step 1003 is the same as the word finding process in step 203 in the first embodiment (FIG. 2), but the word finding process in step 1003 is found one time before the repetition, that is, the (N−1) th time. The difference is that only words need to be processed. In the first process, that is, when N = 1 and N−1 = 0, no word is stored in the word storage unit 104, and thus this process ends without any word being found.

【００８４】ステップ１００４：単語抽出手段103は、
テキスト記憶手段102のテキストデータから単語を抽出
する。抽出には、前回の繰り返しまでに発見された単語
のマーク、ステップ１００３で書き込まれた単語のマー
クといった単語情報と、単語処理用知識記憶手段107の
情報が用いられる。抽出された単語は逐次、単語記憶手
段104に繰り返しをカウントするための変数Ｎの値とと
もに蓄える。さらに抽出した箇所については、テキスト
記憶手段102の該当箇所にマークをつける。抽出方法
は、第１の実施形態における図２のステップ２０２と同
様であるが、抽出に今迄に発見されている単語情報を利
用する点と、抽出された単語とともに繰り返し回数Ｎを
単語記憶手段104に記憶する点の二点が異なっている。Step 1004: The word extracting means 103
A word is extracted from the text data in the text storage means 102. For the extraction, word information such as a word mark found up to the previous repetition, a word mark written in step 1003, and information in the word processing knowledge storage unit 107 are used. The extracted words are sequentially stored in the word storage means 104 together with the value of the variable N for counting repetitions. Further, for the extracted portion, a mark is given to the corresponding portion of the text storage means 102. The extraction method is the same as that of step 202 in FIG. 2 in the first embodiment, except that word information that has been found so far is used for extraction, and the number of repetitions N is stored together with the extracted words in the word storage means. The two points that are stored in 104 are different.

【００８５】ステップ１００５：単語抽出が充分に行わ
れたかを判断する。充分でない場合は、ステップ１００
６へ、充分と判断された場合はステップ１００７へ進
む。判断には、単語数判定手段109が、今回の繰り返し
で新たに単語が抽出されたかどうかを、単語記憶手段10
4の情報をもとに判断する。新規に一つでも単語が抽出
されていれば、充分ではないと判断してステップ１００
６へ進む。Step 1005: It is determined whether the word has been sufficiently extracted. If not, step 100
If it is determined to be sufficient, the process proceeds to step 1007. For the determination, the word number determination means 109 determines whether or not a new word has been extracted in the current repetition.
Judge based on the information in 4. If at least one new word has been extracted, it is determined that it is not sufficient, and step 100 is executed.
Proceed to 6.

【００８６】ステップ１００６：繰り返しをカウントす
るための変数Ｎを１増やし、ステップ１００３に戻って
繰り返し処理を行う。Step 1006: The variable N for counting repetitions is incremented by 1, and the flow returns to step 1003 to perform repetitive processing.

【００８７】ステップ１００７：単語分割手段106は、
前ステップまでに書き込まれたマークと、単語処理用知
識記憶手段107の情報をもとに、テキスト記憶手段102の
テキストデータを単語に分割する。分割戦略は単語処理
用知識記憶手段107の内容の変更でいつでも変更可能だ
が、基本的には単語のマークの前後で単語を分割する。Step 1007: The word dividing means 106
Based on the marks written up to the previous step and the information in the word processing knowledge storage unit 107, the text data in the text storage unit 102 is divided into words. The division strategy can be changed at any time by changing the contents of the word processing knowledge storage means 107, but basically the word is divided before and after the word mark.

【００８８】ステップ１００８：結果を出力して終了す
る。Step 1008: Output the result and end.

【００８９】上記ステップ１００４における処理は、第
１の実施形態における図２のステップ２０２単語抽出処
理と類似したものであるが、ステップ１００４はＮ−１
回目の繰り返しまでに発見された単語情報を利用して、
単語の抽出を行う点が異なっている。The processing in step 1004 is similar to the word extraction processing in step 202 in FIG. 2 in the first embodiment.
Using the word information found up to the repetition,
The difference is that words are extracted.

【００９０】図１１の例を使って説明する。テキストデ
ータ「この装置は、」から、Ｎ＝１回目に単語「装置」
を抽出する方法は、第１の実施形態における図５と同様
である。Ｎ＝２回目において、テキストデータ「この装
置単体は、」から文字列「単体」を単語と判断するの
は、第１の実施形態における図８と同様である。Ｎ＝３
回目において、同様に、テキストデータ「よって単体動
作は、」から文字列「動作」を単語と判断して抽出す
る。このようにＮ−１回目までに抽出された単語の情報
を使って複合語の分解を行いながら、Ｎ回目の単語抽出
を行う。The operation will be described with reference to the example shown in FIG. From the text data "this device is", the word "device" is N = 1st time
Is extracted in the same manner as in FIG. 5 in the first embodiment. When N = 2 times, the character string “single” is determined to be a word from the text data “this device alone,” as in FIG. 8 in the first embodiment. N = 3
In the same time, similarly, the character string “operation” is determined as a word and extracted from the text data “according to the single operation”. The N-th word extraction is performed while decomposing the compound word using the information of the words extracted up to the (N-1) -th time.

【００９１】このように、ステップ１００１からステッ
プ１００８によって、テキストデータを、より細かく単
語抽出して単語分割を行うことが可能になる。As described above, the steps 1001 to 1008 make it possible to more precisely extract words from text data and perform word division.

【００９２】以上のように、この実施形態では、単語処
理用知識記憶手段107の情報をもとに単語抽出手段103が
単語を抽出して、抽出した単語を単語記憶手段104に記
憶し、単語抽出手段103はこの単語の情報を用いて繰り
返し単語抽出を行い、単語数判定手段109が充分な数の
単語を抽出したと判断した上で、この単語情報と単語処
理用知識記憶手段107の情報を用いて、単語分割手段106
がテキストを単語に分割することで、辞書を用いない単
語分割を行うことが可能になり、その実用的効果は大き
い。As described above, in this embodiment, the word extraction unit 103 extracts a word based on the information of the word processing knowledge storage unit 107, and stores the extracted word in the word storage unit 104. The extraction means 103 repeatedly performs word extraction using the word information, and determines that the word number determination means 109 has extracted a sufficient number of words, and then determines the word information and the information in the word processing knowledge storage means 107. Using the word division means 106
By dividing text into words, it becomes possible to perform word division without using a dictionary, and the practical effect is great.

【００９３】（第３の実施の形態）第３の実施形態の文
字列分割装置は、単語分割の可能性を定量的に求め、そ
の値から単語分割位置を決定する。(Third Embodiment) The character string dividing apparatus according to the third embodiment quantitatively determines the possibility of word division and determines the word division position from the value.

【００９４】また、テキストデータの中で使われている
数字を基に単語分割が可能な点を求めることも行う。例
えば、図１５に示すように、「この量は合計１万５千ト
ンに」という文字列がある場合に、１や５と言う数字の
前後の文字列を調べることにより、数値表現の単語「合
計１万５千トン」を識別する。In addition, a point at which word division is possible is obtained based on the numbers used in the text data. For example, as shown in FIG. 15, when there is a character string “this amount is 15,000 tons in total”, by examining character strings before and after the numbers 1 and 5, the numerical expression word “ 15,000 tons in total ".

【００９５】この装置は、図１２に示すように、テキス
ト中の数値表現を発見する数値表現発見手段110と、テ
キストのある位置が単語に分割できるかどうかの可能性
を定量的に計算する単語分割可能性計算手段111とを備
えている。As shown in FIG. 12, this apparatus includes a numerical expression finding means 110 for finding a numerical expression in a text, and a word for quantitatively calculating the possibility of dividing a certain position of the text into words. And division possibility calculating means 111.

【００９６】単語分割手段106は、単語分割可能性計算
手段111が求めた値を基に、テキストを単語に分割す
る。また、単語処理用知識記憶手段107は、単語抽出手
段103及び単語分割手段105だけでなく、単語分割可能性
計算手段111でも用いる知識を記憶している。The word dividing means 106 divides the text into words based on the value obtained by the word dividing possibility calculating means 111. The word processing knowledge storage unit 107 stores knowledge used not only by the word extraction unit 103 and the word division unit 105 but also by the word division possibility calculation unit 111.

【００９７】図１６は単語処理用知識記憶手段107に記
憶されている情報の一例である。例えば、パターン番号
１５０１のパターン「任意の平仮名＋（分離３０）＋漢
字２文字（抽出可能）＋（分離３０）＋“は、”」に
は、図３のパターン３０１に、「任意の平仮名」と「漢
字２文字」とが分離出来る可能性を示す数値（分離３
０）が加えられ、「漢字２文字」と「“は、”」とが分
離出来る可能性を示す数値（分離３０）が加えられてい
る。この数値は正値も負値も取ることが可能であり、正
値は分割できることを、負値は分割できないことを示
す。また、パターンに「（抽出可能）」と書かれている
ものは、そのパターンにマッチした文字列を単語と認定
して抽出することができることを示す。FIG. 16 shows an example of information stored in the word processing knowledge storage means 107. For example, the pattern “arbitrary hiragana” in the pattern 301 of FIG. 3 corresponds to the pattern “arbitrary hiragana” + (separation 30) + two characters of Kanji (extractable) + (separation 30) + “ha”. And the numerical value that indicates the possibility of separating “Kanji 2 characters” (separation 3
0) is added, and a numerical value (separation 30) indicating the possibility that “two kanji characters” and ““ wa, ”” can be separated is added. This numerical value can be either a positive value or a negative value. A positive value indicates that the image can be divided, and a negative value indicates that the image cannot be divided. Further, a pattern in which "(extractable)" is written indicates that a character string that matches the pattern can be recognized as a word and extracted.

【００９８】その他の構成は第２の実施形態（図９）と
変わりがない。The other structure is the same as that of the second embodiment (FIG. 9).

【００９９】数値表現発見手段110及び単語分割可能性
計算手段111手段はコンピュータの計算機構により構成
される。The numerical expression finding means 110 and the word division possibility calculating means 111 are constituted by a computer mechanism.

【０１００】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図１３で示
す。The operation of the word segmenting apparatus configured as described above will be described. FIG. 13 shows the entire flow.

【０１０１】ステップ１３０１：テキスト入力手段101
から入力されたデータは、まずテキスト記憶手段102に
蓄えられる。Step 1301: text input means 101
Is first stored in the text storage means 102.

【０１０２】ステップ１３０２：数値表現発見手段110
は、テキスト全域にわたって数値表現を探し、数値表現
が発見されたらその前後をマークする。数値表現とは
「１００円」「約二ヶ月前」「１万５千メートル」とい
った数値を表現する文字列である。詳細は後で述べる。Step 1302: Numerical expression discovery means 110
Looks for a numeric expression throughout the text, and marks before and after the numeric expression is found. The numerical expression is a character string expressing a numerical value such as “100 yen”, “about two months ago”, “15,000 meters”. Details will be described later.

【０１０３】ステップ１３０３：繰り返しをカウントす
るための変数Ｎを１に初期化する。ステップ１３０４：単語発見手段105は、繰り返しのＮ
−１回目に抽出された単語が、テキスト中の抽出された
以外の場所で出現しているかを探す。この処理は、第２
の実施形態（図１０）のステップ１００３と同様であ
る。Step 1303: A variable N for counting repetitions is initialized to 1. Step 1304: The word finding means 105 repeats N
-Search whether the word extracted first time appears in a place other than the extracted word in the text. This process is the second
This is the same as step 1003 in the embodiment (FIG. 10).

【０１０４】ステップ１３０５：単語抽出手段103は、
前ステップまでに発見された単語位置のマークと、数値
表現のマークと、単語処理用知識記憶手段107の情報を
もとに、テキストデータの中で単語に分割できる点を計
算し、その計算結果をもとに単語抽出を行う。この処理
は、第２の実施形態（図１０）のステップ１００４と同
様であるが、ステップ１３０２で発見された数値表現の
前後で単語分割が可能であるという情報を利用できる点
が異なっている。数値表現は一つの単語として扱う。Step 1305: The word extracting means 103
Based on the word position mark and the numerical expression mark found up to the previous step, and the information in the word processing knowledge storage unit 107, calculate the points that can be divided into words in the text data. Perform word extraction based on This process is the same as step 1004 in the second embodiment (FIG. 10), except that the information that word division is possible before and after the numerical expression found in step 1302 can be used. Numeric expressions are treated as one word.

【０１０５】ステップ１３０６：単語抽出が充分に行わ
れたかを判断する。充分でない場合は、ステップ１３０
７へ、充分と判断された場合はステップ１３０８へ進
む。判断には、単語数判定手段109が、今回の繰り返し
で新たに単語が抽出されたかどうかを、単語記憶手段10
4の情報をもとに判断する。新規に一つでも単語が抽出
されていれば、充分ではないと判断してステップ１３０
７へ進む。Step 1306: It is determined whether or not the word has been sufficiently extracted. If not, step 130
The process proceeds to step 1308 if determined to be sufficient. For the determination, the word number determination means 109 determines whether or not a new word has been extracted in the current repetition.
Judge based on the information in 4. If at least one new word has been extracted, it is determined that the word is not sufficient, and step 130
Proceed to 7.

【０１０６】ステップ１３０７：繰り返しをカウントす
るための変数Ｎを１増やし、ステップ１３０４に戻って
繰り返し処理を行う。Step 1307: The variable N for counting repetitions is incremented by 1, and the process returns to step 1304 to perform repetitive processing.

【０１０７】ステップ１３０８：分割可能性計算手段11
1は、前ステップまでに書き込まれた単語のマークと、
数値表現のマークと、単語処理用知識記憶手段107の情
報をもとに、テキストデータの中で単語に分割できる点
を計算する。そして、分割可能と判断される場所で、単
語分割手段106がテキストを分割する。詳細は後で述べ
る。Step 1308: Dividability calculating means 11
1 is the mark of the word written up to the previous step,
Based on the numerical expression mark and the information in the word processing knowledge storage unit 107, a point that can be divided into words in the text data is calculated. Then, the word dividing means 106 divides the text at a place where it is determined that the text can be divided. Details will be described later.

【０１０８】ステップ１３０９：結果を出力して終了す
る。Step 1309: Output the result and end.

【０１０９】以下、図１３のステップ１３０２での数値
表現発見処理における、数値表現発見手段110の動作の
詳細について、日本語を例に説明する。図１４にアルゴ
リズムを示す。Hereinafter, the details of the operation of the numerical expression finding means 110 in the numerical expression finding process in step 1302 in FIG. 13 will be described with reference to Japanese as an example. FIG. 14 shows the algorithm.

【０１１０】ステップ１４０１：ポインタＰ，Ｑ，Ｒを
テキストの先頭に位置させる。ポインタＰは処理位置
を、ポインタＱは数値表現の最初を、ポインタＲは数値
表現の最後を記録するのに用いる。Step 1401: The pointers P, Q, and R are positioned at the head of the text. The pointer P is used to record the processing position, the pointer Q is used to record the beginning of the numerical expression, and the pointer R is used to record the end of the numerical expression.

【０１１１】ステップ１４０２：ポインタＰの指す位置
がテキストの最後かを判断する。もし最後なら終了。最
後でなければステップ１４０３へ。Step 1402: It is determined whether the position pointed by the pointer P is at the end of the text. If it is the last, it ends. If not, go to step 1403.

【０１１２】ステップ１４０３：ポインタＰが指してい
る文字は数を表現する文字かどうかを判断する。数を表
現する文字には、１，２，３といった算用数字、一、
二、三、十、百、千といった漢数字がある。もし数を表
現する文字だったならステップ１４０５へ、そうでなか
ったならステップ１４０４へ。Step 1403: It is determined whether or not the character pointed by the pointer P is a character expressing a number. Characters that represent numbers include arithmetic numbers such as 1, 2, 3, 1,
There are two, three, ten, one hundred and one thousand Chinese numbers. If it is a character representing a number, go to step 1405; otherwise, go to step 1404.

【０１１３】ステップ１４０４：ポインタＰを一文字進
めてステップ１４０２へ。Step 1404: The pointer P is advanced by one character, and the flow advances to step 1402.

【０１１４】ステップ１４０５：ポインタＱをポインタ
Ｐに一致させる。ポインタＲをＰの次の文字にセットす
る。Step 1405: The pointer Q is matched with the pointer P. Set the pointer R to the character following P.

【０１１５】ステップ１４０６：この時点でポインタＱ
からポインタＰの文字までが数の表現文字列とみなされ
る。ポインタＰの次の文字、すなわちＰ＋１の位置にあ
る文字は、ＱからＰまでの数の表現と接続し得るか判断
する。これは漢数字の表現の「五百二十」などという次
に算用数字の「２３」などは続かないが、漢数字の
「億」「万」「一」などは接続できる。これらの接続可
能性を調べる。接続できる場合はステップ１４０７へ、
接続できない場合は１４０８へ。Step 1406: At this point, the pointer Q
To the character of the pointer P are regarded as a number expression string. It is determined whether the character next to the pointer P, that is, the character at the position of P + 1 can be connected to the expression of the numbers from Q to P. In this case, the kanji numerals such as "520" and the algebraic numeral "23" do not follow, but the kanji numerals "billion", "million" and "one" can be connected. Examine these connectivity possibilities. If the connection can be made, go to step 1407.
If connection is not possible, go to 1408.

【０１１６】ステップ１４０７：ポインタＰとＲを一文
字ずつ進めてステップ１４０６へ戻り、同じ処理を繰り
返す。Step 1407: The pointers P and R are advanced one character at a time, and the process returns to step 1406 to repeat the same processing.

【０１１７】ステップ１４０８：ポインタＰの次の文
字、すなわちＰ＋１の位置にある文字は、数値の単位と
なり得るかを判断する。もし単位となり得るならステッ
プ１４０９へ、そうでなければステップ１４１４へ。Step 1408: It is determined whether the character next to the pointer P, that is, the character at the position of P + 1 can be a unit of numerical value. If it can be a unit, go to step 1409; otherwise go to step 1414.

【０１１８】ステップ１４０９：ポインタＲを単位を表
現すると判断されて文字列の次の文字まで移動する。Step 1409: It is determined that the pointer R represents a unit, and the cursor moves to the next character of the character string.

【０１１９】ステップ１４１０：ポインタＱの位置より
前にある文字列は、数前置詞かどうかを判断する。数前
置詞とは「約」「およそ」「合計」などのことである。
もし数前置詞と判断されたらステップ１４１１へ、そう
でなければステップ１４１２へ。Step 1410: It is determined whether the character string located before the position of the pointer Q is a number preposition. Numerical prepositions are "about", "approximately", "sum" and the like.
If it is determined to be a number preposition, go to step 1411; otherwise go to step 1412.

【０１２０】ステップ１４１１：ポインタＱを、発見さ
れた数前置詞の先頭へ移動する。Step 1411: Move the pointer Q to the head of the found number preposition.

【０１２１】ステップ１４１２：ポインタＱからポイン
タＲまでの文字列を、数値表現と判断する。そして、そ
の前後で分割可能性が高く、その間では分割可能性が低
いことを、テキスト記憶手段102に書き込む。図１５に
この分割可能性の例を模式図で示す。数値表現前後の分
割可能性は高いので正値を、数値表現の間は分割可能性
が低い、すなわち分割できないので負値を書き込む。こ
の負値の絶対値を充分大きくしておくことで、ステップ
１３０８における分割可能点の計算で誤って正値が書き
込まれても、数値表現の間で分割されることを防ぐこと
ができる。Step 1412: The character string from the pointer Q to the pointer R is determined to be a numerical expression. Then, the fact that the possibility of division is high before and after that and the possibility of division is low in the meantime are written in the text storage means 102. FIG. 15 is a schematic diagram showing an example of this division possibility. A positive value is written because the possibility of division before and after the numerical expression is high, and a negative value is written during the numerical expression because the possibility of division is low, that is, division is impossible. By making the absolute value of the negative value sufficiently large, even if a positive value is erroneously written in the calculation of the dividable point in step 1308, it is possible to prevent division between numerical expressions.

【０１２２】ステップ１４１３：ポインタＰをポインタ
Ｒの位置に移動し（ＰにＲを代入する）、ステップ１４
０２へ戻る。Step 1413: Move the pointer P to the position of the pointer R (substitute R for P), and
Return to 02.

【０１２３】ステップ１４１４：ポインタＱとＲを比較
する。もしＲ−Ｑが１だったら、発見された文字列は一
文字であり、数値表現ではない可能性が非常に高くな
る。そこでこれは数値表現とは判断しない。Ｒ−Ｑが１
ならステップ１４１５へ。そうでなければステップ１４
１２へ。Step 1414: Compare pointers Q and R. If RQ is 1, the found character string is a single character, and it is highly likely that it is not a numerical representation. So we do not judge this as a numerical representation. RQ is 1
If so, go to step 1415. Otherwise step 14
Go to 12.

【０１２４】ステップ１４１５：ポインタＰをＱの次の
文字に移動し、ステップ１４０２へ戻る。Step 1415: Move the pointer P to the character next to Q, and return to step 1402.

【０１２５】このようにして、テキスト文字列「鈴木一
郎」の「一」は数値表現とは判定せず（ステップ１４１
４）、テキスト文字列「約１万５千メートル」は数値表
現と判定することが可能になる。As described above, "1" of the text character string "Ichiro Suzuki" is not determined to be a numerical expression (step 141).
4), the text character string "about 15,000 meters" can be determined to be a numerical expression.

【０１２６】以下、図１３のステップ１３０８の分割可
能性計算処理と単語抽出処理における、単語分割可能性
計算手段111及び単語分割手段106の動作の詳細につい
て、日本語を例に説明する。図１７にアルゴリズムを示
す。Hereinafter, the details of the operations of the word division possibility calculation means 111 and the word division means 106 in the division possibility calculation processing and the word extraction processing in step 1308 of FIG. 13 will be described using Japanese as an example. FIG. 17 shows the algorithm.

【０１２７】ステップ１７０１：分割可能性計算手段11
1は、テキスト全域に、単語処理用知識記憶手段107の、
単語分割に使うことができる全てのパターンを適用す
る。この処理は、第２の実施形態におけるステップ２０
４の処理と同様にパターンを適用するが、適用が成功し
た場合、パターンに書かれている分離可能性の数値を、
テキスト記憶手段102の該当箇所に書き込む。既に数値
が書き込まれている場合は、書き込まれている数値に加
算した数値、または加算して平均を取った数値を書き込
む。Step 1701: division possibility calculating means 11
1 is for the word processing knowledge storage means 107,
Apply all patterns that can be used for word segmentation. This processing corresponds to step 20 in the second embodiment.
Apply the pattern in the same way as in step 4, but if the application is successful, the value of the separability written in the pattern is
Write it to the corresponding location in the text storage means 102. If a numerical value has already been written, a numerical value added to the written numerical value or a numerical value obtained by adding and averaging is written.

【０１２８】ステップ１７０２：単語分割手段106は、
テキスト全域にわたって、あらかじめ設定された閾値以
上の分離可能性数値がある所で、テキストを分割する。Step 1702: The word dividing means 106
The text is split where there is a separability value equal to or greater than a preset threshold value over the entire text.

【０１２９】文字列に対応する分割可能性の数値と閾値
の関係の模式図を図１８に示す。例えば、ステップ１３
０５においてＮ回目に抽出された単語については、単語
としての確からしさは、Ｎ−１回目までに抽出された単
語の精度に依存する。すなわち、１回目に抽出された単
語は最も単語としての確からしさが高いが、以降、抽出
の精度は少しずつ下がる。そこで、ステップ１７０１に
おいて適用する知識に、単語の発見されたＮが大きいほ
ど、その前後で分割できる確度が小さいことを反映させ
た知識を用意しておけば、確度が小さい箇所は分割され
ない。反対に確度が小さいパターンであっても、一つの
箇所に複数の小さな確度が累積されれば、合計した確度
が高くなって分割されることになる。このように単語分
割を、分割可能性によって計算することで、精度の高さ
と分割の細かさを制御することができる。FIG. 18 is a schematic diagram showing the relationship between the numerical value of the division possibility corresponding to the character string and the threshold value. For example, step 13
Regarding the word extracted at the Nth time in 05, the certainty as a word depends on the accuracy of the word extracted up to the N-1th time. That is, the first extracted word has the highest likelihood as a word, but thereafter, the accuracy of extraction gradually decreases. Therefore, if the knowledge applied in step 1701 reflects the fact that the greater the N at which the word is found, the smaller the probability that the word can be divided before and after the word is prepared, a portion with a small degree of accuracy is not divided. Conversely, even if the pattern has a small degree of accuracy, if a plurality of small degrees of accuracy are accumulated at one location, the total accuracy is increased and the pattern is divided. In this way, by calculating the word division based on the division possibility, it is possible to control high precision and fineness of division.

【０１３０】以上のように、この実施形態では、数値表
現発見手段110が数値表現を特定し、単語抽出手段103が
単語処理用知識記憶手段107の情報と数値表現の位置を
もとに単語を抽出して、抽出した単語を単語記憶手段10
4に記憶し、単語抽出手段103はこの単語の情報を用いて
繰り返し単語抽出を行い、単語数判定手段109が充分な
数の単語を抽出したと判断した上で、この単語情報と単
語処理用知識記憶手段107の情報を用いて、単語分割可
能性計算手段111が分割可能な点を計算して、単語分割
手段106がテキストを単語に分割することで、辞書を用
いない単語分割を行うことが可能になり、その実用的効
果は大きい。As described above, in this embodiment, the numerical expression finding means 110 specifies a numerical expression, and the word extracting means 103 extracts a word based on the information in the word processing knowledge storage means 107 and the position of the numerical expression. Extract and store the extracted words in word storage means 10
4 and the word extraction means 103 repeatedly performs word extraction using the word information, and after the word number determination means 109 determines that a sufficient number of words have been extracted, the word information and the word processing Using the information in the knowledge storage unit 107, the word division possibility calculation unit 111 calculates points that can be divided, and the word division unit 106 divides the text into words, thereby performing word division without using a dictionary. And its practical effect is great.

【０１３１】（第４の実施の形態）第４の実施形態の文
字分割装置は、テキストから文字種を区分し、それを基
に平仮名や片仮名などの単語を分離する。(Fourth Embodiment) The character dividing apparatus according to the fourth embodiment separates character types from text and separates words such as hiragana and katakana based on the character types.

【０１３２】この装置は、図１９に示すように、テキス
トデータの中から漢字単語を抽出する漢字単語抽出手段
112と、テキストデータの中から片仮名単語を抽出する
片仮名単語抽出手段113と、テキストデータの中から平
仮名単語を抽出する平仮名単語抽出手段114とを備えて
いる。This device is, as shown in FIG. 19, a kanji word extracting means for extracting kanji words from text data.
112, katakana word extracting means 113 for extracting katakana words from text data, and hiragana word extracting means 114 for extracting hiragana words from text data.

【０１３３】また、単語処理用知識記憶手段107は、単
語処理用知識として漢字単語用、片仮名単語用、平仮名
単語用の知識を分類して記憶している。The word processing knowledge storage unit 107 classifies and stores knowledge for kanji words, katakana words, and hiragana words as word processing knowledge.

【０１３４】その他の構成は、第３の実施形態（図１
２）と変わりがない。また、漢字単語抽出手段112、片
仮名単語抽出手段113及び平仮名単語抽出手段114はコン
ピュータの計算機構により構成される。The other structure is similar to that of the third embodiment (FIG. 1).
No difference from 2). Further, the kanji word extracting means 112, the katakana word extracting means 113 and the hiragana word extracting means 114 are constituted by a computing mechanism of a computer.

【０１３５】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図２０で示
す。The operation of the word segmenting apparatus thus configured will be described. FIG. 20 shows the entire flow.

【０１３６】ステップ２００１：テキスト入力手段101
から入力されたデータは、まずテキスト記憶手段102に
蓄えられる。Step 2001: Text input means 101
Is first stored in the text storage means 102.

【０１３７】ステップ２００２：数値表現発見手段110
は、テキスト全域にわたって数値表現を探し、数値表現
が発見されたらその前後を数値表現としてマークする。
この処理は、第３の実施形態におけるステップ１３０２
と同様である。Step 2002: Numerical expression discovery means 110
Searches for a numeric expression throughout the text, and if found, marks before and after it as a numeric expression.
This processing corresponds to step 1302 in the third embodiment.
Is the same as

【０１３８】ステップ２００３：漢字単語抽出手段112
は、漢字部分の文字列のみを重点的に処理することで、
漢字の単語を抽出する。詳細は後で述べる。Step 2003: Kanji word extracting means 112
Focuses only on the character string of the kanji part,
Extract Kanji words. Details will be described later.

【０１３９】ステップ２００４：片仮名単語抽出手段11
3は、片仮名部分の文字列のみを重点的に処理すること
で、片仮名単語を抽出する。詳細は後で述べる。Step 2004: Katakana word extraction means 11
In step 3, the katakana word is extracted by mainly processing only the character string of the katakana part. Details will be described later.

【０１４０】ステップ２００５：平仮名単語抽出手段11
4は、平仮名部分の文字列のみを重点的に処理すること
で、平仮名単語を抽出する。詳細は後で述べる。Step 2005: Hiragana word extraction means 11
4 extracts a hiragana word by mainly processing only the character string in the hiragana portion. Details will be described later.

【０１４１】ステップ２００６：以上で、前ステップま
でに抽出された単語は単語記憶手段104に記憶されてい
る。単語発見手段105は、前ステップまでに抽出された
全ての単語について、テキスト中に出現する所を探し、
全てに単語のマークを付ける。ステップ２００７：分割可能性計算手段111は、前ステ
ップまでに書き込まれた単語のマークと、数値表現のマ
ークと、単語処理用知識記憶手段107の情報をもとに、
テキストデータの中で単語に分割できる点を計算する。
そして、分割可能と判断される場所で、単語分割手段10
6がテキストを分割する。この処理は、第３の実施形態
（図１３）におけるステップ１３０８と同様である。Step 2006: The words extracted up to the previous step are stored in the word storage means 104. The word finding means 105 searches for a word that appears in the text for all the words extracted up to the previous step,
Mark all with words. Step 2007: The division possibility calculating means 111 calculates the word mark, the numerical expression mark, and the information in the word processing knowledge storage means 107 written up to the previous step.
Calculate points that can be divided into words in text data.
Then, at a place determined to be possible to divide, the word dividing means 10
6 split the text. This processing is the same as step 1308 in the third embodiment (FIG. 13).

【０１４２】ステップ２００８：結果を出力して終了す
る。Step 2008: Output the result and end.

【０１４３】なお、ステップ２００３、ステップ２００
４、ステップ２００５の３つの順序は問わない。例えば
ステップ２００４片仮名単語抽出処理がステップ２００
３漢字単語抽出処理より先でもかまわない。しかし片仮
名単語、漢字単語の発見の方が平仮名単語の発見より一
般的に容易であり、平仮名単語抽出処理２００５が後の
方がよい。Steps 2003 and 200
4. The order of the three steps of step 2005 does not matter. For example, in step 2004, the katakana word extraction process is executed in step 200.
It may be earlier than the three kanji word extraction process. However, it is generally easier to find katakana words and kanji words than to find hiragana words, and it is better to follow the hiragana word extraction process 2005.

【０１４４】以下、図２０のステップ２００３漢字単語
抽出処理について述べる。漢字単語抽出処理は、基本的
には第２の実施形態における単語抽出の過程を、漢字文
字列に限定して行うものであり、処理は同様である。図
２１にアルゴリズムを示す。ステップ２１０１：繰り返しを表現するＮを１に初期化
する。Hereinafter, the kanji word extraction processing in step 2003 in FIG. 20 will be described. The kanji word extraction processing is basically performed by limiting the word extraction process in the second embodiment to kanji character strings, and the processing is the same. FIG. 21 shows the algorithm. Step 2101: N representing repetition is initialized to 1.

【０１４５】ステップ２１０２：単語発見手段105は、
繰り返しのＮ−１回目に抽出された単語が、テキスト中
の抽出された以外の場所で出現しているかを探す。もし
出現していればテキスト記憶手段102の該当箇所にマー
クをつける。一回目の処理、すなわちＮ＝１でＮ−１＝
０の時は、単語記憶手段104に単語が一つも記憶されて
いないため、この処理は一つも単語が発見されないまま
終了する。Step 2102: The word finding means 105
A search is made to see if the word extracted at the (N-1) th repetition appears in a place other than the extracted word in the text. If it appears, a mark is made at the corresponding location in the text storage means 102. The first process, ie, N = 1 and N−1 =
When the value is 0, no word is stored in the word storage means 104, so this process ends without finding any word.

【０１４６】ステップ２１０３：漢字単語抽出手段112
は、テキスト記憶手段102のテキストデータから漢字単
語を抽出する。抽出には、既に発見された単語に付けら
れているマークといった単語情報と、単語処理用知識記
憶手段107の情報を用いる。テキストの先頭から漢字文
字列を探して、その部分に単語処理用知識記憶手段107
からの漢字用パターンを適用して単語抽出を行う。抽出
された単語は逐次、単語記憶手段104に繰り返しをカウ
ントするための変数Ｎとともに蓄える。さらに抽出した
箇所については、テキスト記憶手段102の該当箇所にマ
ークをつける。抽出方法は、第３の実施形態における図
１３のステップ１３０５と同様であるが、漢字以外の部
分についてはパターン適用を省くことが可能であり、漢
字部分のみを処理する点が異なっている。Step 2103: Kanji word extracting means 112
Extracts a kanji word from the text data in the text storage means 102. For extraction, word information such as a mark attached to a word that has already been found, and information in the word processing knowledge storage unit 107 are used. A kanji character string is searched from the beginning of the text, and the word processing knowledge storage unit 107 is stored in that part.
Words are extracted by applying the kanji pattern from. The extracted words are sequentially stored in the word storage means 104 together with a variable N for counting repetitions. Further, for the extracted portion, a mark is given to the corresponding portion of the text storage means 102. The extraction method is the same as that of step 1305 in FIG. 13 in the third embodiment, except that pattern application can be omitted for parts other than kanji, and that only the kanji part is processed.

【０１４７】ステップ２１０４：漢字単語抽出が充分に
行われたかを判断する。充分でない場合は、ステップ２
１０５へ、充分と判断された場合は終了する。判断に
は、単語数判定手段109が、今回の繰り返しで新たに単
語が抽出されたかどうかを、単語記憶手段104の情報を
もとに判断する。新規に一つでも単語が抽出されていれ
ば、充分ではないと判断してステップ２１０５へ進む。Step 2104: It is determined whether or not the kanji word has been sufficiently extracted. If not, step 2
If it is determined to be sufficient, the process is terminated. For the determination, the word number determination means 109 determines whether or not a new word has been extracted in this repetition, based on the information in the word storage means 104. If at least one new word has been extracted, it is determined that the word is not sufficient, and the process proceeds to step 2105.

【０１４８】ステップ２１０５：繰り返しをカウントす
るための変数Ｎを１増やし、ステップ２１０２に戻って
繰り返しを行う。Step 2105: The variable N for counting repetitions is incremented by 1, and the process returns to step 2102 to repeat.

【０１４９】以上が漢字の単語抽出処理である。カタカ
ナ、平仮名についても同様に処理を行うが、それぞれの
処理で用いる単語処理用知識記憶手段107の知識は、漢
字用、片仮名用、平仮名用を限定して用いる。パターン
の適用箇所を字種により特定して、その字種に適したパ
ターンのみを選択的に適用することで、パターン適用の
効率を向上させることができるとともに、字種によって
はパターン適用以外の方法を採用することも可能にな
る。The above is the kanji word extraction processing. The same processing is performed for katakana and hiragana, but the knowledge of the word processing knowledge storage means 107 used in each processing is limited to those for kanji, katakana, and hiragana. By specifying the application location of the pattern by the character type and selectively applying only the pattern suitable for that character type, the efficiency of pattern application can be improved, and depending on the character type, methods other than pattern application can be used. Can also be adopted.

【０１５０】以上のように、この実施形態では、数値表
現発見手段110が数値表現を、漢字単語抽出手段112が漢
字単語を、片仮名単語抽出手段113が片仮名単語を、平
仮名単語抽出手段114が平仮名単語をそれぞれ効率的に
抽出して単語記憶手段104に記憶し、この単語情報と数
値表現の情報、単語処理用知識記憶手段107の情報を用
いて、単語分割可能性計算手段111が分割可能な点を計
算して、単語分割手段106がテキストを単語に分割する
ことで、辞書を用いない単語分割を行うことが可能にな
り、その実用的効果は大きい。As described above, in this embodiment, the numerical expression discovery means 110 detects a numerical expression, the kanji word extraction means 112 reads a kanji word, the katakana word extraction means 113 reads a katakana word, and the hiragana word extraction means 114 reads a hiragana word. Each word is efficiently extracted and stored in the word storage unit 104, and the word division possibility calculation unit 111 can divide the word using the word information, the information of the numerical expression, and the information of the word processing knowledge storage unit 107. By calculating points and dividing the text into words by the word dividing means 106, it becomes possible to perform word division without using a dictionary, and the practical effect is large.

【０１５１】（第５の実施の形態）第５の実施形態の文
字分割装置は、例えば「オブジェクト指向」という文字
列のように、異なる文字種の結合であっても一つの単語
として扱う方が相応しいものは一単語として処理する。(Fifth Embodiment) In the character dividing device according to the fifth embodiment, it is more appropriate to treat a combination of different character types as one word, such as a character string "object-oriented". Things are treated as one word.

【０１５２】この装置は、図２２に示すように、抽出さ
れた単語のうち、隣接する２つの単語についての結合可
能性を調べる単語結合可能性計算手段115を備えてい
る。その他の構成は第４の実施形態（図１９）と変わり
がない。この単語結合可能性計算手段115はコンピュー
タの計算機構により構成される。As shown in FIG. 22, this apparatus includes word combination possibility calculation means 115 for examining the combination possibility of two adjacent words among the extracted words. Other configurations are the same as those of the fourth embodiment (FIG. 19). The word combination possibility calculation means 115 is constituted by a calculation mechanism of a computer.

【０１５３】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図２３で示
す。処理の流れのステップ２３０１からステップ２３０
６までは、第４の実施形態（図２０）のステップ２００
１からステップ２００６までと同様である。The operation of the word segmenting apparatus thus configured will be described. FIG. 23 shows the overall flow. Steps 2301 to 230 of the processing flow
Steps 6 to 6 are performed in step 200 of the fourth embodiment (FIG. 20).
1 to step 2006.

【０１５４】テキスト入力手段101から入力されたデー
タは、まずテキスト記憶手段102に蓄えられ（ステップ
２３０１）、数値表現発見手段110が、テキスト全域に
わたって数値表現を探し、数値表現が発見されたらその
前後を数値表現としてマークする（ステップ２３０
２）。漢字単語抽出手段112は、漢字部分の文字列のみ
を重点的に処理することで、漢字の単語を抽出する（ス
テップ２３０３）。片仮名単語抽出手段113は、片仮名
部分の文字列のみを重点的に処理することで、片仮名単
語を抽出する（ステップ２３０４）。平仮名単語抽出手
段114は、平仮名部分の文字列のみを重点的に処理する
ことで、平仮名単語を抽出する（ステップ２３０５）。
前ステップまでに抽出された単語は単語記憶手段104に
記憶され、単語発見手段105は、前ステップまでに抽出
された全ての単語について、テキスト中に出現する所を
探し、全てに単語のマークを付ける（ステップ２３０
６）。The data input from the text input means 101 is first stored in the text storage means 102 (step 2301), and the numerical expression finding means 110 searches for the numerical expression over the entire text. Is marked as a numerical expression (step 230
2). The kanji word extracting means 112 extracts kanji words by mainly processing only the character string of the kanji part (step 2303). The katakana word extracting means 113 extracts katakana words by focusing only on the character string of the katakana part (step 2304). The hiragana word extracting means 114 extracts a hiragana word by mainly processing only the character string of the hiragana portion (step 2305).
The words extracted up to the previous step are stored in the word storage unit 104, and the word finding unit 105 searches for the word appearing in the text for all the words extracted up to the previous step, and marks the words in all of them. Attach (Step 230
6).

【０１５５】ステップ２３０７：単語結合可能性計算手
段115は、抽出された単語のうち、隣接して出現する可
能性が高い単語を探す。そしてそれらの単語の隣接する
部分の単語分割可能性をマイナスに設定する。ステップ
２３０５までは基本的に文字種により単語が切れること
を仮定して処理を行なっていたが、ここでそれらの結合
の可能性を調べ、文字種を跨がって一つの単語を形成す
る場合の分割点を計算し直す。詳細は後で述べる。Step 2307: The word combination possibility calculation means 115 searches for a word having a high possibility of appearing adjacently among the extracted words. Then, the word division possibility of the adjacent part of those words is set to minus. Until step 2305, the processing was basically performed on the assumption that the word is cut off by the character type. Here, the possibility of combining them is examined, and the division when forming one word across the character type is performed. Recalculate the points. Details will be described later.

【０１５６】ステップ２３０８：単語分割可能性計算手
段111は、前ステップまでに書き込まれた単語のマーク
と、数値表現のマークと、単語結合可能性（分割可能性
の負値）と、単語処理用知識記憶手段107の情報をもと
に、テキストデータの中で単語に分割できる点、または
分割できない点を計算する。そして分割可能と判断され
る場所で、単語分割手段106がテキストを分割する。こ
の処理は、第３の実施形態（図１３）のステップ１３０
８と同様である。Step 2308: The word division possibility calculation means 111 executes the word mark written in the previous step, the numerical expression mark, the word combination possibility (negative value of division possibility), and the word processing Based on the information in the knowledge storage unit 107, a point in the text data that can be divided into words or a point that cannot be divided is calculated. Then, the word dividing means 106 divides the text at the place where it is determined that the text can be divided. This processing is performed in step 130 of the third embodiment (FIG. 13).
Same as 8.

【０１５７】ステップ２３０９：結果を出力して終了す
る。Step 2309: Output the result and end.

【０１５８】なお、第４の実施形態で述べたように、ス
テップ２３０３、ステップ２３０４、ステップ２３０５
の３つの順序は問わない。Note that, as described in the fourth embodiment, steps 2303, 2304, and 2305
Does not matter.

【０１５９】以下、図２３のステップ２３０７単語結合
可能性判定処理について述べる。図２４にアルゴリズム
を示す。Hereinafter, the word combination possibility determination processing in step 2307 of FIG. 23 will be described. FIG. 24 shows the algorithm.

【０１６０】ステップ２４０１：単語記憶手段104の中
から最初の単語を一つ取り出す。以下、この単語を単語
ｔとする。Step 2401: One first word is extracted from the word storage means 104. Hereinafter, this word is referred to as a word t.

【０１６１】ステップ２４０２：単語ｔの後に現われる
単語とその頻度を、テキストデータを走査することで計
算し、単語ｔの後に現われる確率を計算する。この時、
後に現れる語として、句読点、および格助詞と考えられ
るもの、すなわち「が」「は」「を」などは計算に入れ
ない。計算例を図２５に示す。図２５は、単語「オブジ
ェクト」を単語ｔとした時、それに続く単語として「モ
デル」「指向」「関係」「間」などがあり、それぞれの
確率が０．１，０．５，０．３５，０．０１の場合を示
す。Step 2402: The word appearing after the word t and its frequency are calculated by scanning the text data, and the probability of appearing after the word t is calculated. At this time,
Words that appear later, such as punctuation and case particles, such as "ga", "wa", and "wo", are not counted. FIG. 25 shows a calculation example. FIG. 25 shows that when the word “object” is the word t, the following words include “model”, “directivity”, “relation”, “between”, etc., and the respective probabilities are 0.1, 0.5, 0.35. , 0.01.

【０１６２】ステップ２４０３：このようにして計算し
たそれぞれの単語の出現確率のうち、あらかじめ決めら
れた閾値以上の確率を持つ単語があるか判断する。この
ような単語があれば、ステップ２４０４へ、そうでなけ
ればステップ２４０９へ進む。図２５の例においては、
閾値を０．３とすると、「指向」と「間」の２つの単語
が閾値以上の確率で出現すると判断される。Step 2403: It is determined whether or not there is a word having a probability equal to or greater than a predetermined threshold among the appearance probabilities of the respective words calculated in this way. If there is such a word, the process proceeds to step 2404; otherwise, the process proceeds to step 2409. In the example of FIG.
Assuming that the threshold is 0.3, it is determined that two words “directed” and “between” appear with a probability greater than or equal to the threshold.

【０１６３】ステップ２４０４：単語ｔの後に閾値以上
の確率で出現する単語の集合を、集合Ａとする。図２５
の例では、「指向」「間」の２つの単語が集合Ａに入
る。Step 2404: A set of words appearing after the word t with a probability greater than or equal to the threshold is referred to as a set A. FIG.
In the example, two words “directed” and “between” are included in the set A.

【０１６４】ステップ２４０５：集合Ａの中から最初の
単語を一つ取り出す。これを単語ａと呼ぶ。Step 2405: One first word is extracted from the set A. This is called word a.

【０１６５】ステップ２４０６：単語ａの前に現われる
単語とその頻度を調べ、単語ａの前に現れる確率を計算
する。この時、前に現れる語として、句読点、および格
助詞と考えられるもの、すなわち「が」「は」「を」な
どは計算に入れない。図２６、図２７はそれぞれ、「指
向」「間」を単語ａに選んだ時の、それぞれの前に現わ
れる単語の出現確率を計算した例である。Step 2406: The word appearing before word a and its frequency are examined, and the probability of appearing before word a is calculated. At this time, punctuation marks and words considered to be case particles as words appearing before, that is, "ga", "wa", "wo", etc. are not taken into account. FIG. 26 and FIG. 27 are examples in which the appearance probabilities of words appearing before the respective words when “directivity” and “between” are selected as the word a are calculated.

【０１６６】ステップ２４０７：このようにして計算し
たそれぞれの単語の出現確率のうち、あらかじめ決めら
れた閾値以上の確率を持つ単語があり、その単語の中に
単語ｔが含まれているかを判断する。含まれていればス
テップ２４０８へ、そうでなければステップ２４０９へ
進む。図２６の単語ａが「指向」だった時の例では、
「指向」の前に「オブジェクト」が閾値以上の確率で出
現していると判断される。図２７の単語ａが「間」だっ
た時の例では、「間」の前に閾値以上の確率で出現する
単語はあるものの、その中に「オブジェクト」は無い。Step 2407: Among the appearance probabilities of the respective words calculated in this way, there is a word having a probability equal to or more than a predetermined threshold, and it is determined whether or not the word t is included in the word. . If it is included, the process proceeds to step 2408; otherwise, the process proceeds to step 2409. In the example where the word a in FIG. 26 is “directional”,
It is determined that “object” appears before “directivity” with a probability equal to or greater than the threshold. In the example where the word a is “between” in FIG. 27, there is a word that appears before “between” with a probability greater than or equal to the threshold, but there is no “object” among them.

【０１６７】ステップ２４０８：単語ａの前に単語ｔが
閾値以上の確率で出現する時、この２つは結合可能性が
高いと判断し、一つの単語とみなす。そして、テキスト
記憶手段102の中で、２つの単語が隣接して出現する箇
所の単語分離可能性を負値に設定する。図２５と図２６
の例では、「オブジェクト」と「指向」が隣接して出現
する確率が高かったため、この２つはまとめて「オブジ
ェクト指向」という一つの単語であると判断できる。そ
こで、図２８に例示すように、テキスト記憶手段102の
中で、２つの単語が隣接して出現する箇所の単語分離可
能性を負値に設定する。Step 2408: When the word t appears before the word a with a probability greater than or equal to the threshold value, the two are judged to have a high possibility of combination and are regarded as one word. Then, in the text storage unit 102, the word separability of a place where two words appear adjacent to each other is set to a negative value. FIG. 25 and FIG. 26
In the example, since the probability that “object” and “directivity” appear adjacent to each other is high, it can be determined that these two words are collectively a single word “object-oriented”. Therefore, as shown in FIG. 28, the word separability of a place where two words appear adjacent to each other is set to a negative value in the text storage means 102.

【０１６８】ステップ２４０９：単語ａは、集合Ａの中
の最後の単語か判断する。最後の単語ならステップ２４
１１へ、そうでなければステップ２４１０へ進む。Step 2409: Determine whether word a is the last word in set A. Step 24 for the last word
Otherwise, go to step 2410.

【０１６９】ステップ２４１０：集合Ａの中から次の単
語を一つ取り出し、これを単語ａとする。ステップ２４
０６へ戻って繰り返し処理を行う。Step 2410: One next word is taken out of the set A, and this is set as a word a. Step 24
Returning to step 06, the processing is repeated.

【０１７０】ステップ２４１１：単語ｔは、単語記憶手
段104の中で最後の単語か判断する。最後の単語なら終
了する。そうでなければステップ２４１２へ。Step 2411: It is determined whether the word t is the last word in the word storage means 104. If the last word, end. Otherwise, go to step 2412.

【０１７１】ステップ２４１２：単語記憶手段104の中
から次の単語を取り出し、これを単語ｔとしてステップ
２４０２へ戻って繰り返し処理を行う。Step 2412: The next word is fetched from the word storage means 104, this is used as word t, and the process returns to step 2402 to repeat the processing.

【０１７２】以上のように、この実施形態では、数値表
現発見手段110が数値表現を、漢字単語抽出手段112が漢
字単語を、片仮名単語抽出手段112が片仮名単語を、平
仮名単語抽出手段114が平仮名単語をそれぞれ抽出して
単語記憶手段104に記憶し、単語結合可能性計算手段115
が２つの単語の結合する可能性を計算する。そして、単
語分割可能性計算手段111が、ここまでで得られた単語
情報と単語処理用知識記憶手段107の情報を用いて、分
割可能な点を計算し、単語分割手段106がテキストを単
語に分割する。こうすることで、辞書を用いない単語分
割を行うことが可能になり、その実用的効果は大きい。As described above, in this embodiment, the numerical expression finding means 110 is for numerical expression, the kanji word extracting means 112 is for kanji words, the katakana word extracting means 112 is for katakana words, and the hiragana word extracting means 114 is for hiragana. Each word is extracted and stored in the word storage means 104, and the word combination possibility calculation means 115
Calculates the likelihood of two words joining. Then, the word division possibility calculating means 111 calculates a dividable point using the word information obtained so far and the information of the word processing knowledge storage means 107, and the word dividing means 106 converts the text into words. To divide. By doing so, it becomes possible to perform word division without using a dictionary, and the practical effect is great.

【０１７３】（第６の実施の形態）第６の実施形態の文
字分割装置は、テキストデータに含まれる漢字文字列を
文字数によって整理し、同じ文字数の漢字文字列に共通
のルールを適用して、漢字単語を効率的に抽出する。(Sixth Embodiment) A character dividing device according to a sixth embodiment arranges kanji character strings included in text data according to the number of characters, and applies a common rule to kanji character strings having the same number of characters. , To efficiently extract kanji words.

【０１７４】この装置は、図２９に示すように、テキス
トデータの中から、漢字文字列とその前後の２文字とを
含む部分文字列を抽出する部分文字列抽出手段116と、
部分文字列抽出手段116で抽出された文字列をそれらが
処理される間記憶するテキスト処理用バッファ117と、
テキスト処理用バッファ117の文字列を分類、整列する
文字列分類整列手段118とを備えている。漢字単語抽出
手段112は、部分文字列抽出手段116、テキスト処理用バ
ッファ117、文字列分類整列手段118を用いて、テキスト
データの中から漢字単語を効率的に抽出する。その他の
構成は第５の実施形態（図２２）と変わりがない。As shown in FIG. 29, this apparatus includes a partial character string extracting means 116 for extracting a partial character string including a kanji character string and two characters before and after it from text data,
A text processing buffer 117 for storing the character strings extracted by the partial character string extraction means 116 while they are processed,
A character string classification and sorting means 118 for classifying and sorting the character strings in the text processing buffer 117. The kanji word extraction means 112 efficiently extracts kanji words from text data using the partial character string extraction means 116, the text processing buffer 117, and the character string classification and alignment means 118. Other configurations are the same as those of the fifth embodiment (FIG. 22).

【０１７５】このテキスト処理用バッファ117はコンピ
ュータのメモリ、またはハードディスク装置の記憶領域
に設定され、部分文字列抽出手段116及び文字列分類整
列手段118はコンピュータの計算機構により構成され
る。The text processing buffer 117 is set in a memory of a computer or a storage area of a hard disk device, and the partial character string extracting means 116 and the character string sorting / aligning means 118 are constituted by a computer mechanism.

【０１７６】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図３０で示
す。処理の流れのステップ３００１とステップ３００２
は第５の実施形態（図２３）のステップ２３０１、２３
０２と同様である。ステップ３００４からステップ３０
０９についても、第５の実施形態（図２３）のステップ
２３０４からステップ２３０９と同様である。以下、第
６の実施形態を特徴づけるステップ３００３の、漢字文
字列抽出による漢字単語抽出処理について詳しく説明す
る。アルゴリズムの詳細を図３１に示す。The operation of the word segmenting apparatus thus configured will be described. FIG. 30 shows the overall flow. Steps 3001 and 3002 of the processing flow
Are steps 2301, 23 of the fifth embodiment (FIG. 23).
Same as 02. Step 3004 to Step 30
09 is the same as Steps 2304 to 2309 of the fifth embodiment (FIG. 23). Hereinafter, the kanji word extraction processing by kanji character string extraction in step 3003 characterizing the sixth embodiment will be described in detail. The details of the algorithm are shown in FIG.

【０１７７】ステップ３１０１：まず、部分文字列抽出
手段116がテキスト記憶手段102より漢字の文字列部分を
抽出し、テキスト処理用バッファ117にその文字列の表
を作成して記憶する。抽出する時には、その前後Ｎ文字
も一緒に抽出する。この例を図を用いて説明する。図３
２の例文から漢字文字列を前後２文字を一緒に抽出した
ものが、図３３の表である。漢字文字列の前後の文字数
Ｎを３以上の大きな値に設定すると、単語抽出において
詳しいルールを用意することが可能となり抽出精度が向
上するが計算量が増える。一方、１以下に設定すると詳
しいルールを用意できず抽出精度が落ちてしまう。この
ため、適度な計算量で適度な精度を得るためには、Ｎを
２に設定するのが妥当である。Step 3101: First, the partial character string extracting means 116 extracts the character string part of the kanji from the text storage means 102, and creates and stores a table of the character string in the text processing buffer 117. When extracting, N characters before and after the character are also extracted. This example will be described with reference to the drawings. FIG.
FIG. 33 shows a kanji character string extracted from the two example sentences together with two characters before and after the kanji character string. If the number N of characters before and after the kanji character string is set to a large value of 3 or more, detailed rules can be prepared in word extraction, and extraction accuracy improves, but the amount of calculation increases. On the other hand, if the number is set to 1 or less, detailed rules cannot be prepared, and the extraction accuracy decreases. Therefore, it is appropriate to set N to 2 in order to obtain appropriate accuracy with an appropriate amount of calculation.

【０１７８】ステップ３１０２：このように抽出した文
字列を、文字列分類整列手段118が漢字文字列の長さで
分類する。Step 3102: The character strings extracted in this way are classified by the character string classification and alignment means 118 according to the length of the kanji character string.

【０１７９】ステップ３１０３：分類された文字列を、
さらに文字列分類整列手段118が漢字文字列の部分の辞
書順、または文字コード順で整列し、表を作成する。前
後の文字列を含めて同じものがある場合、それらの出現
回数を記録してまとめる。図３４の表は、図３３の表か
ら２文字の漢字文字列を取り出して整列したものであ
る。図３５の表は、図３３の表から３文字の漢字文字列
を取り出して整列したものである。同様に４文字、５文
字などについても表を作成するが、Ｍ文字（Ｍ＝５，
６，７など）以上の長さの文字列は一つの表にまとめて
もかまわない。Step 3103: The classified character string is
Further, the character string classification and sorting means 118 sorts the kanji character strings in dictionary order or character code order, and creates a table. If there is the same thing including the character string before and after, record the number of appearances of them and summarize them. The table of FIG. 34 is obtained by extracting and sorting two kanji character strings from the table of FIG. The table of FIG. 35 is obtained by extracting and arranging three kanji character strings from the table of FIG. Similarly, a table is created for four characters, five characters, etc., but M characters (M = 5,
Character strings longer than (6, 7, etc.) may be combined into one table.

【０１８０】ステップ３１０４：このように作成した表
のうち、２文字の表、および３文字の表から、漢字単語
抽出手段112が、漢字単語を抽出する。単語の抽出に
は、前後に句読点があるもの、句読点と格助詞に囲まれ
たものなどを単語とみなす。これらの詳細なルールは単
語処理用知識記憶手段107に記憶しておく。表に同じ漢
字文字列が載っている場合には、重複した処理は行わな
い。例えば図３５の3502の行において、文字列「衆議
院」は格助詞「の」に囲まれていると判断され、単語と
認識された場合、図３５の3503の行は処理を行わない。
抽出された単語は、逐次、単語記憶手段104に記憶され
る。Step 3104: The kanji word extracting means 112 extracts a kanji word from the two-character table and the three-character table among the tables thus created. In the extraction of words, words having punctuation before and after, and those surrounded by punctuation and case particles are regarded as words. These detailed rules are stored in the word processing knowledge storage unit 107. If the same kanji character string appears in the table, no duplicate processing is performed. For example, in the line 3502 in FIG. 35, it is determined that the character string “lower house” is surrounded by the case particle “no”, and if it is recognized as a word, the process in the line 3503 in FIG. 35 is not performed.
The extracted words are sequentially stored in the word storage unit 104.

【０１８１】ステップ３１０５：次に、４文字以上の文
字列の表から、２文字３文字の表から得られた単語知識
をもとに、漢字単語抽出手段112が複合語の単語分割を
行い、単語を抽出する。これは、例えば３文字の表から
「衆議院」という単語が抽出されていて、５文字の表の
中にある「衆議院議員」という文字列が格助詞に囲まれ
ていて名詞であると推測されていた場合、「衆議院議
員」は「衆議院」という既知の単語と「議員」という未
知の単語から構成される複合名詞と判断され、「議員」
という単語を抽出できるという原理による。このような
ルールは、単語処理用知識記憶手段107に記憶させてお
く。Step 3105: Next, from the table of character strings of four or more characters, based on the word knowledge obtained from the table of two characters and three characters, the kanji word extraction means 112 performs word division of compound words, Extract words. This is because, for example, the word “lower house” is extracted from a three-character table, and the character string “lower house member” in the five-character table is surrounded by case particles and is assumed to be a noun. , "Lower of the lower house" is determined to be a compound noun composed of the known word "lower of the house" and an unknown word of "lower
By the principle that the word can be extracted. Such rules are stored in the word processing knowledge storage unit 107.

【０１８２】以上で効果的な漢字単語抽出が実現でき
る。なお、ステップ３１０１において抽出した文字列の
表は、自然言語におけるＮグラム統計の手法と同様に、
実際には文字列を表にするのではなく、ポインタだけを
表にしてもかまわない。図３６に示すように、文字列は
一つだけが記憶されていて、表には抽出した文字列の位
置と長さだけを記録しておく。このようにすることで記
憶領域を少なくすることができる。As described above, effective kanji word extraction can be realized. Note that the table of character strings extracted in step 3101 is similar to the N-gram statistical method in natural language,
Actually, instead of making a character string into a table, only a pointer may be made into a table. As shown in FIG. 36, only one character string is stored, and only the position and length of the extracted character string are recorded in the table. By doing so, the storage area can be reduced.

【０１８３】以上のように、この実施形態では、テキス
トデータに含まれる漢字文字列を文字数によって整理
し、同じ文字数の漢字文字列に対して、同じようにルー
ルを適用して、漢字単語を抽出している。そのため、テ
キストデータの冒頭から逐語的にルールを適用して単語
を抽出する場合に比べて、極めて効率的に漢字単語を抽
出することができる。As described above, in this embodiment, the kanji character strings included in the text data are arranged according to the number of characters, and the same rules are applied to the kanji character strings having the same number of characters to extract the kanji words. are doing. Therefore, a kanji word can be extracted extremely efficiently as compared with a case where a word is extracted by applying a rule verbatim from the beginning of text data.

【０１８４】（第７の実施の形態）第７の実施形態の文
字分割装置は、テキストデータに含まれる平仮名の文字
列を整理して、平仮名の単語を抽出する。(Seventh Embodiment) The character dividing device according to the seventh embodiment sorts out hiragana character strings included in text data and extracts hiragana words.

【０１８５】この装置では、図３７に示すように、テキ
ストデータの文字列を置き換えて縮退文字列を作成する
縮退文字列作成手段120と、この縮退文字列を作成する
ための変換表119とを備えている。縮退文字列作成手段1
20が作成した縮退文字列はテキスト処理用バッファ117
に蓄積され、部分文字列抽出手段116は、この縮退文字
列の中から平仮名を含む部分文字列を抽出し、文字列分
類整列手段118は、この部分文字列を分類、整列し、平
仮名単語抽出手段114は、部分文字列抽出手段116、テキ
スト処理用バッファ117、文字列分類整列手段118、文字
列変換表119及び縮退文字列作成手段120を用いて、テキ
ストデータの中から平仮名単語を効率的に抽出する。In this apparatus, as shown in FIG. 37, a reduced character string creating means 120 for replacing a character string of text data to create a reduced character string, and a conversion table 119 for creating the reduced character string are used. Have. Degenerate string creation means 1
The degenerated character string created by 20 is a buffer 117 for text processing
The character string extraction means 116 extracts a partial character string including hiragana from the degenerated character string, and the character string classification and alignment means 118 classifies and sorts the partial character string, and extracts a hiragana word. The means 114 uses the partial character string extraction means 116, the text processing buffer 117, the character string classification and alignment means 118, the character string conversion table 119, and the degenerated character string creation means 120 to efficiently convert hiragana words from text data. To extract.

【０１８６】この文字列変換表119はコンピュータのメ
モリ、またはハードディスク装置の記憶領域に設定さ
れ、縮退文字列作成手段120はコンピュータの計算機構
により構成される。The character string conversion table 119 is set in a memory of a computer or a storage area of a hard disk device, and the degenerated character string creating means 120 is configured by a computer.

【０１８７】以上のように構成された単語分割装置につ
いて、その動作を説明する。全体の流れを図３８で示
す。処理の流れのステップ３８０１からステップ３８０
４は第６の実施形態（図３０）のステップ３００１から
３００４と同様である。ステップ３８０６からステップ
３８０９についても、第６の実施形態（図３０）のステ
ップ３００６からステップ３００９と同様である。以
下、第７の実施形態を特徴づけるステップ３８０５の縮
退文字列作成による平仮名単語抽出処理について詳しく
説明する。アルゴリズムの詳細を図３９に示す。[0187] The operation of the word segmenting apparatus configured as described above will be described. FIG. 38 shows the overall flow. Steps 3801 to 380 of the processing flow
Step 4 is the same as steps 3001 to 3004 in the sixth embodiment (FIG. 30). Steps 3806 to 3809 are the same as steps 3006 to 3009 of the sixth embodiment (FIG. 30). Hereinafter, the hiragana word extraction processing of step 3805, which characterizes the seventh embodiment, by generating a reduced character string will be described in detail. FIG. 39 shows details of the algorithm.

【０１８８】ステップ３９０１：まず縮退文字列作成手
段120は、テキスト全体について平仮名部分だけに注目
し、平仮名以外の部分を縮退させた文字列を作成する。
これは、平仮名以外の部分を、文字列変換表119に基い
て適当な記号に置換することで実現する。文字列変換表
119の例を図４０に示す。このように平仮名以外の文字
列を一つの記号に変換する。図４０の変換表を使った変
換例を図４１に示す。この例では句読点は変換しない。
このように平仮名以外の文字を記号にすることで文字の
種類を減らすことができるため、通常２ｂｙｔｅ以上で
表現しなければならない日本語の一文字を１ｂｙｔｅで
表現できるようになる。そのため、平仮名処理において
使用する全体の記憶領域を小さく抑えることができる。Step 3901: First, the contracted character string creating means 120 creates a character string in which only the hiragana portion of the entire text is focused and portions other than the hiragana are contracted.
This is realized by replacing a part other than Hiragana with an appropriate symbol based on the character string conversion table 119. String conversion table
An example of 119 is shown in FIG. In this way, character strings other than hiragana are converted into one symbol. FIG. 41 shows a conversion example using the conversion table of FIG. In this example, punctuation is not converted.
Since the types of characters can be reduced by using characters other than hiragana as symbols in this manner, one Japanese character that normally needs to be expressed in 2 bytes or more can be expressed in 1 byte. Therefore, the entire storage area used in the hiragana processing can be reduced.

【０１８９】ステップ３９０２：部分文字列抽出手段11
6は、縮退文字列から平仮名の文字列部分を抽出し、テ
キスト処理用バッファ117にその文字列の表を作成して
記憶する。抽出する時には、その前後の変換後の記号一
文字も一緒に抽出する。図４１の例文から平仮名文字列
を前後一文字の記号と一緒に抽出したものが、図４２の
表である。この表は、第６の実施形態（図３１）のステ
ップ３１０１で作成した表と同様であり、また、図３６
に示したものと同様にポインタだけの表を作っても同じ
ように処理できる。Step 3902: partial character string extracting means 11
6 extracts the character string portion of Hiragana from the degenerated character string, creates and stores a table of the character string in the text processing buffer 117. When extracting, one converted character before and after that character is also extracted. FIG. 42 is a table in which the hiragana character string is extracted from the example sentence of FIG. This table is similar to the table created in step 3101 of the sixth embodiment (FIG. 31).
The same processing can be performed by creating a table containing only pointers as shown in FIG.

【０１９０】ステップ３９０３：文字列分類整列手段11
8は、このように抽出した文字列を平仮名文字列の長さ
で分類する。この処理は、第６の実施形態（図３１）の
ステップ３１０２と同様である。Step 3903: String sorting / sorting means 11
8 classifies the character string extracted in this way by the length of the hiragana character string. This process is the same as step 3102 in the sixth embodiment (FIG. 31).

【０１９１】ステップ３９０４：文字列分類整列手段11
8は、分類された文字列を、さらに平仮名文字列の部分
の辞書順、または文字コード順で整列し、表を作成す
る。前後の記号を含めて同じものがある場合、それらの
出現回数を記録してまとめる。この処理は第６の実施形
態（図３１）のステップ３１０３と同様である。Step 3904: Character string classification and alignment means 11
8 further sorts the classified character strings in the dictionary order or the character code order of the hiragana character string portion and creates a table. If the same thing is included, including the preceding and following symbols, record the number of appearances and summarize them. This process is the same as step 3103 of the sixth embodiment (FIG. 31).

【０１９２】ステップ３９０５：このように作成した表
のうち、２文字の表、３文字の表、４文字の表から平仮
名単語抽出手段113が、平仮名単語を抽出する。単語の
抽出には、前後に句読点があり何度も出現する文字列、
句読点と格助詞に囲まれて何度も出現する文字列などを
単語とみなす。また、一文字の漢字の後に何度も出現す
る文字列は用言の活用語尾と判断される。これらの詳細
なルールは単語処理用知識記憶手段107に記憶させてお
く。表に載っている文字列の中で、同じ文字列について
は重複した処理を減らす。これは、第６の実施形態（図
３１）のステップ３１０４と同様である。Step 3905: The hiragana word extracting means 113 extracts a hiragana word from the two-character table, the three-character table, and the four-character table among the tables thus created. Word extraction includes punctuation marks before and after,
A character string or the like appearing many times surrounded by punctuation marks and case particles is regarded as a word. A character string that appears many times after a single kanji character is determined to be a ending of a declinable word. These detailed rules are stored in the word processing knowledge storage unit 107. Of the character strings listed in the table, the same character string is reduced in duplicate processing. This is the same as step 3104 in the sixth embodiment (FIG. 31).

【０１９３】以上で高速な平仮名単語抽出が実現でき
る。Thus, high-speed hiragana word extraction can be realized.

【０１９４】このように、この実施形態では、平仮名に
ついて、縮退文字列作成手段120が文字列変換表119に基
づいて、テキストデータから縮退文字列を作成し、部分
文字列抽出手段116、テキスト処理用バッファ117、及び
文字列分類整列手段118が、その縮退文字列から平仮名
部分文字列を抽出し、且つ、分類・整列する。平仮名単
語抽出手段113は、その中から平仮名単語を抽出して単
語記憶手段104に記憶させる。そのため、極めて効率的
に平仮名単語を抽出することができる。As described above, in this embodiment, for hiragana, the degenerated character string creating means 120 creates a degenerated character string from text data based on the character string conversion table 119, and the partial character string extracting means 116 The buffer 117 for character strings and the character string classification / alignment unit 118 extract the hiragana partial character strings from the degenerated character strings, and classify and sort them. The hiragana word extracting means 113 extracts hiragana words from the extracted hiragana words and stores them in the word storage means 104. Therefore, hiragana words can be extracted very efficiently.

【０１９５】（第８の実施の形態）第８の実施形態で
は、テキストの単語分割の見直し処理について説明す
る。(Eighth Embodiment) In an eighth embodiment, a process of reviewing word division of text will be described.

【０１９６】本発明の単語分割方法では、辞書を使わず
にテキストデータから単語を抽出しているため、テキス
トの単語分割に使用する単語は固定されず、例えば、新
聞記事のように話題が目まぐるしく変わる文書では、そ
れまで単語として抽出されなかった文字列が、翌日の記
事を解析した結果、新たに単語として取り込まれると言
うように、極めて適応的な単語抽出や単語分割が可能で
ある。In the word segmentation method of the present invention, words are extracted from text data without using a dictionary. Therefore, words used for word segmentation of text are not fixed. In a changing document, extremely adaptive word extraction and word segmentation are possible, such that a character string that has not been extracted as a word is taken in as a new word as a result of analyzing the article on the next day.

【０１９７】その反面、過去に単語分割したテキスト
は、現在認識されている単語を用いて分割をやり直す
と、以前と違う箇所で分割される可能性があり、そのた
め、過去に単語分割したテキストを随時メンテナンスす
る必要がある。On the other hand, if the text that has been divided into words in the past is redone using the currently recognized word, the text may be divided at a different place from the previous one. Maintenance is required from time to time.

【０１９８】この実施形態では、このように、既に単語
分割されているテキストの単語分割を見直す処理につい
て説明する。In this embodiment, a process of reviewing word division of a text that has already been word-divided will be described.

【０１９９】図４３には、単語分割されたテキストを検
索対象として検索を行う情報検索装置において、検索対
象テキストの単語分割の見直しを可能にした装置を示し
ている。FIG. 43 shows an information retrieval apparatus for performing a search using a word-segmented text as a search object, wherein the word division of the text to be searched can be reviewed.

【０２００】この装置は、第１の実施形態（図１）と同
様に、テキスト入力手段101、テキスト記憶手段102、単
語抽出手段103、単語記憶手段104、単語発見手段105、
単語分割手段106、単語処理用知識記憶手段107、及びテ
キスト出力手段108を備えるとともに、単語分割された
テキスト（検索対象テキスト）を記憶する検索対象テキ
スト記憶手段204と、検索対象テキストの単語分割時に
使用された単語を保存する抽出単語保存手段207と、検
索対象テキストの検索用インデックスを作成するインデ
ックス作成手段201と、作成されたインデックスを記憶
するインデックス記憶手段202と、インデックスを用い
て検索語が含まれるテキストを検索する検索手段203
と、抽出単語保存手段207に保存されている単語と単語
記憶手段104に記憶されている最新の単語とを比較する
単語比較手段206と、検索対象テキストの単語分割のや
り直しを制御する単語分割修正手段205とを備えてい
る。As in the first embodiment (FIG. 1), this apparatus comprises a text input means 101, a text storage means 102, a word extraction means 103, a word storage means 104, a word discovery means 105,
A search target text storage unit 204 that includes a word division unit 106, a word processing knowledge storage unit 107, and a text output unit 108, and stores word-divided text (search target text). Extracted word storage means 207 for storing the used words, index creation means 201 for creating a search index for the search target text, index storage means 202 for storing the created index, Search means 203 for searching for contained text
A word comparison unit 206 that compares the word stored in the extracted word storage unit 207 with the latest word stored in the word storage unit 104; and a word division correction that controls the word division of the search target text again. Means 205.

【０２０１】この装置では、第１の実施形態（図２のフ
ロー図）で説明したように、テキスト入力手段101から
入力されたテキストデータが、テキスト記憶手段102に
蓄えられ、単語抽出手段103により、単語処理用知識記
憶手段107の情報を用いて、このテキストの中から単語
が抽出され、抽出された単語が単語記憶手段104に蓄え
られるとともに、テキストの単語抽出箇所にマークが付
される。単語発見手段105は、抽出された単語がテキス
ト中の他の場所でも出現しているかどうかを探し、もし
出現していればテキストの該当箇所にマークをつける。
単語分割手段106は、テキストに書き込まれたマーク
と、単語処理用知識記憶手段107の情報をもとに、テキ
ストデータを単語に分割する。In this apparatus, as described in the first embodiment (flow chart in FIG. 2), text data input from the text input means 101 is stored in the text storage means 102, and Using the information in the word processing knowledge storage unit 107, a word is extracted from this text, the extracted word is stored in the word storage unit 104, and a mark is added to the word extraction location of the text. The word finding means 105 searches whether or not the extracted word also appears in other places in the text, and if so, marks the corresponding part of the text.
The word dividing means 106 divides the text data into words based on the marks written in the text and the information in the word processing knowledge storage means 107.

【０２０２】単語分割が済んだテキストデータは、検索
対象テキスト記憶手段204に蓄積され、また、単語記憶
手段104に記憶された単語が抽出単語保存手段207に格納
される。The text data after the word division is stored in the search target text storage unit 204, and the words stored in the word storage unit 104 are stored in the extracted word storage unit 207.

【０２０３】インデックス作成手段201は、検索対象テ
キストを検索するためのインデックスを、単語分割され
たテキストデータを用いて作成する。The index creating means 201 creates an index for searching for a text to be searched using the word-divided text data.

【０２０４】ここでは、インデックス作成手段201が、
全文検索と単語検索とが可能なインデックスを作成する
場合について説明する。全文検索は、検索語の文字列が
テキストデータに含まれているかどうかを検索するもの
であり、一方、単語検索は、検索語がテキストデータに
単語として含まれているかどうかを検索するものであ
る。例えば、「東京都では、‥」と言うテキストの場
合、全文検索では、「東京」「京都」などの検索語でヒ
ットするが、単語検索では「東京」「東京都」のよう
に、単語として一致しないとヒットしない。Here, the index creation means 201
A case will be described in which an index capable of performing full-text search and word search is created. A full-text search is to search whether a character string of a search word is included in text data, while a word search is to search whether a search word is included as a word in text data. . For example, if the text is "Tokyo, ‥", a full-text search will find hits such as "Tokyo" and "Kyoto". If they don't match, they won't hit.

【０２０５】テキストデータは、単語分割処理により、
図４５（ｂ）に示すように、単語の始まりが単語始端記
号“［”で、単語の終わりが単語終端記号“］”で区分
されたテキストデータに変換される。なお、実際にはテ
キスト中で“［”や“］”が出現する可能性があるの
で、テキストに出現しない特殊な文字コード等が単語始
端記号や単語終端記号に使用される。The text data is obtained by word division processing.
As shown in FIG. 45 (b), the data is converted into text data in which the beginning of a word is separated by a word start symbol "[" and the end of the word is separated by a word end symbol "]". Since “[” and “]” may actually appear in the text, a special character code or the like that does not appear in the text is used as a word start symbol or a word end symbol.

【０２０６】インデックス作成手段201は、まず、この
単語分割されたテキストデータから、文字列長さＮ（例
えばＮ＝２）のインデックス用の文字列を取り出す。
「本発明の実施は、」というテキストの場合、「本発」
「発明」「明の」「の実」「実施」「施は」「は、」の
７つを取り出すことができる。この夫々の文字列でイン
デックスを作成する。インデックスには、それぞれの文
字列、文書番号、文字位置を記録する他に、その文字列
が単語始端記号及び単語終端記号にどのように接してい
るかを表す単語情報が付加される。First, the index creating means 201 extracts an index character string having a character string length N (for example, N = 2) from the word-divided text data.
In the case of the text "Implementation of the present invention", "Original"
"Invention", "Ming", "Fruit", "Implementation", "Departure" and "Ha" can be extracted. An index is created with each of these character strings. In addition to recording each character string, document number, and character position, the index is added with word information indicating how the character string touches a word start symbol and a word end symbol.

【０２０７】単語情報として、ここでは、図４５（ｄ）
に示すように、一文字目が単語の始まりになっている
か、一文字目が単語の終わりになっているか、二文字目
が単語の終わりになっているか、の３つの情報を１ｂｉ
ｔのフラグの形で持たせている。As the word information, here, FIG.
As shown in FIG. 3, three information of whether the first character is the beginning of a word, the first character is at the end of the word, or the second character is at the end of the word is 1bi.
It is provided in the form of a flag of t.

【０２０８】このようにして作成されたインデックス
は、インデックス記憶手段202に蓄積される。The index created in this way is stored in the index storage means 202.

【０２０９】このインデックスを用いて全文検索を行う
場合は、検索語の文字列をインデックスの文字列の長さ
Ｎ（ここではＮ＝２）ごとに先頭から分割し、それぞれ
の文字列が最初の文字から何文字目から始まるかを記憶
する。検索文字列を分解したものを部分検索文字列と呼
ぶことにすると、例えば、検索文字列「全文検索装置」
は、「全文」「検索」「装置」の３つの部分検索文字列
に分割され、夫々最初の文字から０番目、２番目、４番
目となる。When performing a full-text search using this index, the character string of the search word is divided from the head for each character string length N (here, N = 2) of the index, and each character string is the first character string. Remembers the starting character from the character. When a search string is decomposed and referred to as a partial search string, for example, the search string "full-text search device"
Is divided into three partial search character strings of “full text”, “search”, and “device”, which are 0th, 2nd, and 4th from the first character, respectively.

【０２１０】この時、検索文字列がＮで割り切れない場
合は、一部が重なるように分割し、部分検索文字列の集
合が必ず元の検索文字列全てをカバーするようにＮ文字
の組を取り出す。例えば「検索文字列」という単語は、
「検索」「文字」「字列」として、夫々０番目、２番
目、３番目とすればよい。At this time, if the search character string is not divisible by N, the search character string is divided so as to partially overlap, and the set of N characters is set so that the set of partial search character strings always covers the entire original search character string. Take out. For example, the word "search string"
The “search”, “character”, and “character string” may be the 0th, 2nd, and 3rd, respectively.

【０２１１】次に、この部分検索文字列について、イン
デックスを検索し、該当する文字列を取り出す。そし
て、取り出したインデックスについて、インデックスの
検索対象文書番号と文書中の文字列の位置とを調べて、
連続性を評価する。検索文字列が「全文検索装置」の場
合は、「全文」「検索」「装置」が同じ文書番号で、文
字列「全文」の出現位置がｘ文字目だった時に、文字列
「検索」の出現位置がｘ＋２文字目、文字列「装置」の
出現位置がｘ＋４文字目だった場合、この文書に「全文
検索装置」という文字列が含まれていると判断する。Next, the index is searched for the partial search character string, and the corresponding character string is extracted. Then, for the extracted index, the index search target document number and the position of the character string in the document are checked.
Evaluate continuity. If the search string is "full-text search device", when the "full-text", "search", and "device" have the same document number and the occurrence position of the character string "full-text" is the xth character, the occurrence of the character string "search" If the position is the (x + 2) th character and the appearance position of the character string “device” is the (x + 4) th character, it is determined that this document includes the character string “full text search device”.

【０２１２】一方、このインデックスを用いて単語検索
を行う場合は、全文検索のときと同様に、検索語の文字
列を部分検索文字列に分割した後、この部分検索文字列
について、インデックスを検索し、該当する文字列を取
り出す。On the other hand, when performing a word search using this index, the character string of the search word is divided into partial search character strings, and the index is searched for this partial search character string, as in the case of full-text search. And extract the corresponding character string.

【０２１３】このとき、検索文字列が「全文検索装置」
の場合、その最初の部分検索文字列「全文」について
は、インデックスの一文字目が単語の開始になっている
フラグを参照し、フラグが立っていないものは、文字列
が一致しても、該当しないものと識別する。同様に、そ
の最後の部分検索文字列「装置」については、インデッ
クスの二文字目が単語の終わりになっているかのフラグ
を参照し、フラグが立っていないものは、文字列が一致
しても、該当しないものと識別する。その他の部分検索
文字列（この例では文字列「検索」）については、文字
の一致のみを見て、フラグについては調べない。At this time, the search character string is “full text search device”.
In the case of, for the first partial search string "full text", refer to the flag where the first character of the index is the beginning of the word, and if the flag is not set, even if the character string matches, Identify those that do not. Similarly, for the last partial search character string "device", refer to the flag indicating whether the second character of the index is at the end of the word. , And those that do not apply. For other partial search character strings (the character string "search" in this example), only the character match is checked, and the flag is not checked.

【０２１４】こうして取り出したインデックスについ
て、インデックスの検索対象文書番号と文書中の文字列
の位置とを調べて、連続性を評価し、文書番号が同じ
で、それらの出現位置が連続している場合には、この文
書に「全文検索装置」という単語が含まれていると判断
する。For the index thus extracted, the document number to be searched for the index and the position of the character string in the document are examined to evaluate the continuity. If the document numbers are the same and their appearance positions are continuous , It is determined that this document contains the word “full-text search device”.

【０２１５】なお、単語検索では、このように、検索文
字列の一文字目が単語の始まりになっており、検索文字
列の最後の文字が単語の終わりになっていることをイン
デックスのフラグで確認することにより、検索語と完全
一致の単語を検索することができ、また、検索文字列の
一文字目が単語の始まりになっていることだけを確認す
ることによって、完全一致及び前方一致の単語を検索す
ることができ、検索文字列の最後の文字が単語の終わり
になっていることだけを確認することによって、完全一
致及び後方一致の単語を検索することができ、検索文字
列の一文字目が単語の始まりになっていて、検索文字列
の最後の文字が単語の終わりになっていないことを確認
することによって前方一致の単語だけを検索することが
でき、検索文字列の一文字目が単語の始まりになってお
らず、検索文字列の最後の文字が単語の終わりになって
いることを確認することによって、後方一致の単語だけ
を検索することができる。In the word search, the index flag confirms that the first character of the search character string is the beginning of the word and that the last character of the search character string is the end of the word. By doing so, it is possible to search for a word that exactly matches the search word, and by checking only that the first character of the search string is the beginning of the word, it is possible to search for words that match exactly and You can search for exact and suffix words by only checking that the last character of the search string is at the end of the word, so that the first character of the search string is You can search only for words that start with a word by checking that it is the beginning of a word and the last character of the search string is not at the end of the word. The first character is not made at the beginning of the word, the last character of the search string by making sure that it is at the end of a word, it is possible to search only the words of the rear match.

【０２１６】さて、既に蓄積されている検索対象テキス
トの単語分割の是非を見直し、そのテキストのインデッ
クスを修正する作業は、図４４に示すように、次のよう
な手順で行われる。The operation of reviewing the word division of the search target text already stored and correcting the index of the text is performed in the following procedure as shown in FIG.

【０２１７】まず、単語比較手段206は、検索対象テキ
ストの単語分割に使用された、抽出単語保存手段207に
蓄積されている単語と、単語記憶手段104に記憶されて
いる最新の単語とを比較し、単語記憶手段104には蓄積
されているが抽出単語保存手段207に蓄積されていない
単語（差分単語）を抽出する（ａ）。この差分単語は検
索手段203に渡され、検索手段203は、検索対象テキスト
のインデックスを使って、差分単語を検索文字列とする
全文検索を行い、差分単語の文字列を含むテキストを検
索し、その文書番号を単語分割修正手段205に伝える
（ｂ）。First, the word comparison means 206 compares the word stored in the extracted word storage means 207 and the latest word stored in the word storage means 104, used for word division of the search target text. Then, a word (difference word) stored in the word storage unit 104 but not stored in the extracted word storage unit 207 is extracted (a). This difference word is passed to the search means 203, and the search means 203 performs a full-text search using the difference word as a search character string using the index of the search target text, and searches for a text including the difference word character string, The document number is transmitted to the word division correction means 205 (b).

【０２１８】単語分割修正手段205は、その文書番号の
検索対象テキストを検索対象テキスト記憶手段204から
読みだし、そのテキストデータに含まれる単語始端記号
及び単語終端記号を削除して、単語分割以前のテキスト
データに戻し（ｃ）、これをテキスト記憶手段102に出
力してテキストデータの単語分割を再度行わせる。The word division correcting means 205 reads out the search target text of the document number from the search target text storage means 204, deletes the word start symbol and the word end symbol contained in the text data, and deletes the words before the word division. The data is returned to the text data (c), and the text data is output to the text storage means 102 so that the text data is divided again.

【０２１９】テキスト記憶手段102に蓄えられたテキス
トデータに対して、前述した手順で単語分割が行われ
る。このとき、単語記憶手段104に既に蓄えられている
単語は、テキストデータの単語分割に活用される
（ｄ）。[0219] Word division is performed on the text data stored in the text storage means 102 in the above-described procedure. At this time, the words already stored in the word storage means 104 are used for word division of the text data (d).

【０２２０】単語分割修正手段205は、テキスト記憶手
段102に記憶されている再度の単語分割が行われたテキ
ストデータと、検索対象テキスト記憶手段204に蓄積さ
れている元の検索対象テキストデータとを比較し、相違
箇所がある場合には、検索対象テキスト記憶手段204の
検索対象テキストデータを、再度の単語分割が行われた
テキストデータによって更新する（ｅ）。[0220] The word division correction means 205 compares the text data, again stored in the text storage means 102, into which the word has been divided and the original search target text data stored in the search target text storage means 204. If there is a difference, the search target text data in the search target text storage unit 204 is updated with the text data that has undergone word division again (e).

【０２２１】インデックス作成手段201は、検索対象テ
キストデータが更新された場合に、その検索対象テキス
トのインデックスを再作成し（ｆ）、作成されたインデ
ックスがインデックス記憶手段202に蓄積される
（ｇ）。When the search target text data is updated, the index creation means 201 re-creates the index of the search target text (f), and the created index is stored in the index storage means 202 (g). .

【０２２２】単語分割修正手段205は、（ｂ）によって
検出された全ての文書番号の検索対象テキストに対し
て、（ｃ）（ｄ）（ｅ）（ｆ）（ｇ）の処理が行われる
ように制御する。また、単語分割修正手段205は、抽出
単語保存手段207の内容を、単語記憶手段104に記憶され
た単語に置き換える。The word division correction means 205 performs the processing of (c), (d), (e), (f), and (g) on the search target texts of all document numbers detected by (b). To control. Further, the word division correction unit 205 replaces the content of the extracted word storage unit 207 with the word stored in the word storage unit 104.

【０２２３】こうした一連の処理により、検索対象テキ
ストのメンテナンスが行われる。With such a series of processing, maintenance of the search target text is performed.

【０２２４】なお、ここでは、再度の単語分割が行われ
たテキストデータと、検索対象テキスト記憶手段204に
蓄積されている元の検索対象テキストデータとを比較し
て、相違箇所がある場合にだけ、検索対象テキスト記憶
手段204の検索対象テキストデータを更新している
（ｅ）が、この比較を省略して、テキストデータの単語
分割を再度行った場合には、再分割後のテキストデータ
により検索対象テキスト記憶手段204の検索対象テキス
トデータを常に更新するようにしてもよい。[0224] Here, the text data that has undergone word segmentation again is compared with the original text data to be searched stored in the search text storage means 204, and only when there is a difference, (E) the search target text data in the search target text storage unit 204 is updated, but if this comparison is omitted and the word division of the text data is performed again, the search is performed using the text data after the re-division. The search target text data in the target text storage unit 204 may be constantly updated.

【０２２５】また、ここでは、第１の実施形態の文字列
分割装置によって単語分割されたテキストデータを例に
説明しているが、第２〜第７の実施形態の文字列分割装
置に対しても同じように適用することができる。Also, here, text data that has been word-divided by the character string dividing device of the first embodiment is described as an example, but the character string dividing devices of the second to seventh embodiments will be described. Can be applied in a similar manner.

【０２２６】また、ここでは、一つのインデックスを用
いて全文検索と単語検索とが実施できる場合について説
明したが、全文検索用のインデックスと単語検索用のイ
ンデックスとを別に持つ場合にも適用することが可能で
ある。この場合、差分単語を有する検索対象テキストを
全文検索用のインデックスを用いて検索し、再分割後の
テキストデータを用いて単語検索用のインデックスを更
新する。[0226] Here, a case where full-text search and word search can be performed using one index has been described. However, the present invention can be applied to a case where a full-text search index and a word search index are separately provided. Is possible. In this case, the search target text having the difference word is searched using the full-text search index, and the word search index is updated using the subdivided text data.

【０２２７】[0227]

【発明の効果】以上の説明から明らかなように、本発明
の単語分割装置及び単語分割方法では、辞書を使わずに
テキストを単語に分割することができる。As is apparent from the above description, the word dividing apparatus and the word dividing method of the present invention can divide a text into words without using a dictionary.

【０２２８】また、例えば新聞について文書検索を行う
場合には、新聞を使って単語を抽出し、また、技術文献
の文書検索を行う場合には、技術文献を使って単語を抽
出することにより、対象文献に適合する単語の抽出が可
能であり、単語分割の処理を的確に行うことができる。For example, when performing a document search for a newspaper, a word is extracted using a newspaper. When performing a document search for a technical document, a word is extracted using a technical document. Words that match the target document can be extracted, and word division processing can be performed accurately.

[Brief description of the drawings]

【図１】本発明の第１の実施形態における単語分割装置
の構成を示すブロック図、FIG. 1 is a block diagram showing a configuration of a word segmentation device according to a first embodiment of the present invention;

【図２】本発明の第１の実施形態における単語分割装置
の動作を示すフローチャート、FIG. 2 is a flowchart showing the operation of the word segmentation apparatus according to the first embodiment of the present invention;

【図３】本発明の第１の実施形態における単語分割装置
の文字処理用パターン情報の構造例を示す概念図、FIG. 3 is a conceptual diagram showing an example of the structure of character processing pattern information of the word segmentation device according to the first embodiment of the present invention;

【図４】本発明の第１の実施形態における文字処理用パ
ターン適用の動作を示すフローチャート、FIG. 4 is a flowchart showing an operation of applying a character processing pattern according to the first embodiment of the present invention;

【図５】本発明の第１の実施形態における文字列パター
ンの適用結果を示す概念図、FIG. 5 is a conceptual diagram showing a result of applying a character string pattern according to the first embodiment of the present invention;

【図６】本発明の第１の実施形態における単語発見処理
の動作を示すフローチャート、FIG. 6 is a flowchart showing an operation of a word finding process according to the first embodiment of the present invention;

【図７】本発明の第１の実施形態における単語発見結果
を示す概念図、FIG. 7 is a conceptual diagram showing a word finding result according to the first embodiment of the present invention;

【図８】本発明の第１の実施形態における文字列パター
ンの適用結果による単語分割を示す概念図、FIG. 8 is a conceptual diagram showing word division according to the result of applying a character string pattern in the first embodiment of the present invention;

【図９】本発明の第２の実施形態における単語分割装置
の構成を示すブロック図、FIG. 9 is a block diagram illustrating a configuration of a word segmentation device according to a second embodiment of the present invention;

【図１０】本発明の第２の実施形態における単語分割装
置の動作を示すフローチャート、FIG. 10 is a flowchart showing the operation of the word segmentation device according to the second embodiment of the present invention;

【図１１】本発明の第２の実施形態における単語抽出の
繰り返し結果を示す概念図、FIG. 11 is a conceptual diagram showing a repeated result of word extraction according to the second embodiment of the present invention;

【図１２】本発明の第３の実施形態における単語分割装
置の構成を示すブロック図、FIG. 12 is a block diagram showing a configuration of a word segmentation device according to a third embodiment of the present invention;

【図１３】本発明の第３の実施形態における単語分割装
置の動作を示すフローチャート、FIG. 13 is a flowchart showing the operation of the word segmentation device according to the third embodiment of the present invention;

【図１４】本発明の第３の実施形態における数値表現発
見処理の動作を示すフローチャート、FIG. 14 is a flowchart showing an operation of a numerical expression discovery process according to the third embodiment of the present invention;

【図１５】本発明の第３の実施形態における数値表現近
辺での文字列の単語分割可能性を示す概念図、FIG. 15 is a conceptual diagram showing the possibility of word division of a character string in the vicinity of a numerical expression according to the third embodiment of the present invention;

【図１６】本発明の第３の実施形態における文字処理用
知識情報の概念図、FIG. 16 is a conceptual diagram of character processing knowledge information according to the third embodiment of the present invention;

【図１７】本発明の第３の実施形態における単語分割処
理動作を示すフローチャート、FIG. 17 is a flowchart showing a word division processing operation according to the third embodiment of the present invention;

【図１８】本発明の第３の実施形態における文字列の単
語分割可能性の例を示す概念図、FIG. 18 is a conceptual diagram showing an example of the possibility of word division of a character string in the third embodiment of the present invention;

【図１９】本発明の第４の実施形態における単語分割装
置の構成を示すブロック図、FIG. 19 is a block diagram showing a configuration of a word segmentation device according to a fourth embodiment of the present invention;

【図２０】本発明の第４の実施形態における単語分割装
置の動作を示すフローチャート、FIG. 20 is a flowchart showing the operation of the word segmentation device according to the fourth embodiment of the present invention;

【図２１】本発明の第４の実施形態における単語抽出処
理を示すフローチャート、FIG. 21 is a flowchart showing word extraction processing according to the fourth embodiment of the present invention;

【図２２】本発明の第５の実施形態における単語分割装
置の構成を示すブロック図、FIG. 22 is a block diagram showing a configuration of a word segmentation device according to a fifth embodiment of the present invention;

【図２３】本発明の第５の実施形態における単語分割装
置の動作を示すフローチャート、FIG. 23 is a flowchart showing the operation of the word segmentation apparatus according to the fifth embodiment of the present invention;

【図２４】本発明の第５の実施形態における単語結合可
能性判定処理の動作を示すフローチャート、FIG. 24 is a flowchart showing the operation of word combination possibility determination processing in the fifth embodiment of the present invention;

【図２５】本発明の第５の実施形態の単語結合可能性判
定処理における単語の後に接続する単語の確率を計算し
た例を示す図、FIG. 25 is a diagram illustrating an example of calculating a probability of a word connected after a word in the word combination possibility determination process according to the fifth embodiment of the present invention.

【図２６】本発明の第５の実施形態の単語結合可能性判
定処理における単語の前に接続する単語の確率を計算し
た例の一つを示す図、FIG. 26 is a diagram illustrating one example of calculating a probability of a word connected before a word in the word combination possibility determination process according to the fifth embodiment of the present invention;

【図２７】本発明の第５の実施形態の単語結合可能性判
定処理における単語の前に接続する単語の確率を計算し
た別の例を示す図、FIG. 27 is a diagram illustrating another example of calculating the probability of a word connected before a word in the word combination possibility determination process according to the fifth embodiment of the present invention.

【図２８】本発明の第５の実施形態における文字列の単
語分割可能性の例を示す概念図、FIG. 28 is a conceptual diagram showing an example of the possibility of word division of a character string in the fifth embodiment of the present invention;

【図２９】本発明の第６の実施形態における単語分割装
置の構成を示すブロック図、FIG. 29 is a block diagram showing a configuration of a word segmentation device according to a sixth embodiment of the present invention;

【図３０】本発明の第６の実施形態における単語分割装
置の動作を示すフローチャート、FIG. 30 is a flowchart showing the operation of the word segmentation apparatus according to the sixth embodiment of the present invention;

【図３１】本発明の第６の実施形態における漢字文字列
抽出による漢字単語抽出処理の動作を示すフローチャー
ト、FIG. 31 is a flowchart showing the operation of kanji word extraction processing by kanji character string extraction according to the sixth embodiment of the present invention;

【図３２】本発明の第６の実施形態における漢字文字列
抽出による漢字単語抽出処理の動作例を説明するための
例文、FIG. 32 is an example sentence for explaining an operation example of kanji word extraction processing by kanji character string extraction according to the sixth embodiment of the present invention;

【図３３】本発明の第６の実施形態におけるステップ３
１０１によって抽出された漢字文字列の表の例を示す
図、FIG. 33 is step 3 in the sixth embodiment of the present invention.
FIG. 7 is a diagram showing an example of a table of kanji character strings extracted by 101;

【図３４】本発明の第６の実施形態におけるステップ３
１０３によって分類整列された２文字漢字文字列の表の
例を示す図、FIG. 34 is step 3 in the sixth embodiment of the present invention.
FIG. 9 is a diagram showing an example of a table of two-character kanji character strings sorted and sorted by 103;

【図３５】本発明の第６の実施形態におけるステップ３
１０３によって分類整列された３文字漢字文字列の表の
例を示す図、FIG. 35: Step 3 in the sixth embodiment of the present invention.
FIG. 10 is a diagram showing an example of a table of three-character kanji character strings sorted and sorted by 103;

【図３６】本発明の第６の実施形態における漢字文字列
表の別の実現法を説明する概念図、FIG. 36 is a conceptual diagram illustrating another method of realizing a kanji character string table according to the sixth embodiment of the present invention;

【図３７】本発明の第７の実施形態における単語分割装
置の構成を示すブロック図、FIG. 37 is a block diagram showing a configuration of a word segmentation device according to a seventh embodiment of the present invention;

【図３８】本発明の第７の実施形態における単語分割装
置の動作を示すフローチャート、FIG. 38 is a flowchart showing the operation of the word segmentation apparatus according to the seventh embodiment of the present invention;

【図３９】本発明の第７の実施形態における縮退文字列
作成による平仮名単語抽出処理の動作を示すフローチャ
ート、FIG. 39 is a flowchart showing the operation of hiragana word extraction processing by generating a degenerated character string in the seventh embodiment of the present invention;

【図４０】本発明の第７の実施形態における文字列変換
表の例を示す図、FIG. 40 is a diagram showing an example of a character string conversion table according to the seventh embodiment of the present invention;

【図４１】本発明の第７の実施形態における縮退文字列
作成の例を示す図、FIG. 41 is a diagram showing an example of generating a degenerated character string according to the seventh embodiment of the present invention;

【図４２】本発明の第７の実施形態における平仮名文字
列抽出の例を示す図、FIG. 42 is a diagram showing an example of hiragana character string extraction according to the seventh embodiment of the present invention;

【図４３】本発明の第８の実施形態における情報検索装
置の構成を示すブロック図、FIG. 43 is a block diagram showing a configuration of an information search device according to an eighth embodiment of the present invention;

【図４４】本発明の第８の実施形態における情報検索装
置の動作を示す図、FIG. 44 is a view showing the operation of the information search device according to the eighth embodiment of the present invention;

【図４５】本発明の第８の実施形態における情報検索装
置のインデックスを説明する図である。FIG. 45 is a diagram illustrating an index of the information search device according to the eighth embodiment of the present invention.

[Explanation of symbols]

101 テキスト入力手段 102 テキスト記憶手段 103 単語抽出手段 104 単語記憶手段 105 単語発見手段 106 単語分割手段 107 単語処理用知識記憶手段 108 テキスト出力手段 109 単語数判定手段 110 数値表現発見手段 111 単語分割可能性計算手段 112 漢字単語抽出手段 113 片仮名単語抽出手段 114 平仮名単語抽出手段 115 単語結合可能性計算手段 116 部分文字列抽出手段 117 テキスト処理用バッファ 118 文字列分類整列手段 119 文字列変換表 120 縮退文字列作成手段 201 インデックス作成手段 202 インデックス記憶手段 203 検索手段 204 検索対象テキスト記憶手段 205 単語分割修正手段 206 単語比較手段 207 抽出単語保存手段 101 text input means 102 text storage means 103 word extraction means 104 word storage means 105 word discovery means 106 word division means 107 word processing knowledge storage means 108 text output means 109 word count determination means 110 numerical expression discovery means 111 word division possibility Calculation means 112 Kanji word extraction means 113 Katakana word extraction means 114 Hiragana word extraction means 115 Word merging possibility calculation means 116 Partial character string extraction means 117 Text processing buffer 118 Character string classification and alignment means 119 Character string conversion table 120 Degenerated character strings Creation means 201 Index creation means 202 Index storage means 203 Search means 204 Search target text storage means 205 Word division correction means 206 Word comparison means 207 Extracted word storage means

───────────────────────────────────────────────────── フロントページの続き (72)発明者菊池忠一大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者下島崇大阪府門真市大字門真1006番地松下電器産業株式会社内Ｆターム(参考） 5B075 ND03 NS10 QS01 5B091 AA15 CA02 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Chuichi Kikuchi 1006 Kadoma Kadoma, Osaka Prefecture Matsushita Electric Industrial Co., Ltd. Term (reference) 5B075 ND03 NS10 QS01 5B091 AA15 CA02

Claims

[Claims]

1. A word dividing apparatus for dividing text data into words, a text storage means for storing natural language text data, and a word processing knowledge storage means for storing pattern information for word extraction and word division. A word extracting unit that extracts a word from text data stored in a text storage unit by applying the pattern information; a word storage unit that stores an extracted word; and text data using the extracted word. And a new word division position obtained by applying word division pattern information based on the position of the word on the text data that has been found so far and the position of the word. And a word dividing means for dividing text data into words.

2. A word dividing apparatus for dividing text data into words, comprising: a text storage means for storing natural language text data; and a word processing knowledge storage means for storing pattern information for performing word extraction and word division. Word extraction means for repeatedly applying the pattern information to text data stored in text storage means to repeatedly extract words from the text data; word storage means for storing the extracted words; Each time a word is extracted, a word finding means for finding each position of the text data in which the extracted word is used, and text data is found based on the position of the word on the text data found by the word finding means. A word dividing device, comprising: word dividing means for dividing into words.

3. The apparatus according to claim 1, further comprising a word number determining unit that manages the number of words to be extracted by the word extracting unit. 3. The method according to claim 2, wherein the extraction is repeated.
A word segmentation device according to item 1.

4. A word dividing apparatus for dividing text data into words, a text storage means for storing natural language text data, and a word processing knowledge storage means for storing pattern information for performing word extraction and word division. Word extracting means for extracting words from the text data by applying the pattern information to the text data stored in the text storing means; word storing means for storing the extracted words; Word finding means for finding each position of the text data, division possibility calculating means for quantitatively calculating the division possibility at the position found by the word finding means, and a calculation result of the division possibility calculating means And a word dividing means for dividing the text data into words based on the term.

5. The word segmenting apparatus according to claim 4, further comprising numerical expression finding means for specifying a numerical expression from text data stored in the text storage means.

6. The word segmenting apparatus according to claim 4, wherein the word extracting unit extracts a word of a specific character type from the text data.

7. The word according to claim 6, wherein the word extracting means comprises a kanji word extracting means for extracting a kanji word, an extracting means for extracting a katakana word, and an extracting means for extracting a hiragana word. Splitting device.

8. A combination possibility calculation means for quantitatively calculating the combination possibility of words of different character types extracted by the word extraction means, wherein a combination of different character types having a high possibility of combination is regarded as one word. The word segmenting device according to claim 7, wherein the word segmenting device is used.

9. A word dividing apparatus for dividing text data into words, a text storage means for storing natural language text data, and a word processing knowledge storage means for storing pattern information for performing word extraction and word division. A partial character string extracting means for extracting a character string containing a specific character type from text data stored in a text storage means; and a classification for classifying and sorting the extracted character string by the number of characters by focusing on the number of characters of the character type. Sorting means; word extracting means for applying the pattern information to the character strings arranged according to the number of characters to extract words of the character type; word storage means for storing the extracted words; Word-finding means for finding each position in text data in which is used, and division at the position found by the word-finding means A dividing potential calculation means for quantitatively calculating the word dividing device, characterized in that it comprises a word dividing means for dividing the word text data calculated based on the result of the division potential calculation means.

10. The word segmenting apparatus according to claim 9, wherein said partial character string extracting means extracts a character string including a kanji, and said word extracting means extracts a kanji word.

11. A reduced character string creating means for creating a character string of the text data in which a character type other than hiragana is replaced with a symbol, and the partial character string extracting means is created by the reduced character string creating means. 10. The word segmenting apparatus according to claim 9, wherein a character string including hiragana is extracted from the character string, and the word extracting means extracts a word of hiragana.

12. A word division method for dividing text data into words, wherein a word is extracted from the text data by applying pattern information for word extraction, and each position of the text data in which the extracted word is used is determined. Characterizing the text data into words at the position of the word on the found text data and the new word division position obtained by applying the word division pattern information based on the position of the word. How to split words.

13. A word division method for dividing text data into words, wherein a word is extracted from the text data by applying pattern information for word extraction, and each position of the text data in which the extracted word is used. After extracting the new word by applying pattern information based on the found word position, repeating this procedure, dividing the text data at the position of the word on the found text data Characteristic word segmentation method.

14. The word division method according to claim 13, wherein the repetition is repeated until a new word is not extracted.

15. A word division method for dividing text data into words, wherein a word is extracted from the text data by applying pattern information for word extraction, and each position of the text data in which the extracted word is used. A word division method comprising: calculating a division possibility at a found position; and quantitatively calculating a division possibility at the found position; and determining a position of text data to be divided into words based on the calculated value.

16. A word dividing method for dividing text data into words, wherein a numeral is found from the text data, a numerical expression is identified from character strings before and after the numeral, and a position in the text data where the numerical expression appears is represented by a word. A word division method characterized by a division reference position.

17. A word dividing method for dividing text data into words, wherein a character string including a specific character type is extracted from the text data, and the extracted character string is classified and sorted according to the number of characters by focusing on the number of characters of the character type. Applying pattern information for word extraction to each character string arranged according to the number of characters to extract a word of the character type, and finding each position of text data in which the extracted word is used. Characteristic word segmentation method.

18. The word dividing method according to claim 17, wherein a character string including a kanji is extracted from the text data to extract a kanji word.

19. A character string in which a character type other than hiragana in the text data is replaced with a symbol is created, a character string including hiragana is extracted from the character string, and the extracted character string is focused on the number of characters of the hiragana 18. The word division according to claim 17, wherein the words of the hiragana are extracted by classifying and arranging the characters by the number of characters, and applying pattern information for word extraction to each of the character strings arranged by the number of characters. Method.

20. A word division method for subdividing text data divided into words into words by using an updated word group, comprising: a word group used for word division of the text data; and the updated word group. , The text data including the character string of the difference word is searched, the word division of the searched text data is released, and the text data before word division is reproduced, and the reproduced text data is retrieved. Is divided using the updated word group.

21. An information retrieval apparatus for searching a text divided into words as a search target, wherein a search target text storage means for storing text data of the search target text divided into words, and an index of the search target text are created. Index creation means; index storage means for storing the created index; search means for searching for a search target text using the index; and extracted word storage means for storing words used for word division of the search target text. A word segmentation device that divides text data into words, and a word group used for word segmentation of new text data in the word segmentation device is compared with a word group saved in the extracted word storage unit. Word comparison means for detecting a difference word, and controlling word subdivision of the search target text Word division correction means, wherein the word division correction means controls word subdivision to be performed on the search target text including the character string of the difference word through the word division device. Search device.

22. The word division device according to claim 1, wherein
22. The information search device according to claim 21, comprising the word segmentation device according to claim 1.

23. The information search apparatus according to claim 21, wherein the search means searches for a search target text including the character string of the difference word using the index.

24. The word division correction unit compares text data word-subdivided by the word division device with data of a search target text before word subdivision, and when there is a difference between the two, When the data of the search target text stored in the search target text storage unit is updated with the word subdivided text data, and the index creation unit updates the search target text in the search target text storage unit 22. The information search device according to claim 21, wherein an index for word search of the search target text is re-created.