JPH09138801A

JPH09138801A - Character string extracting method and its system

Info

Publication number: JPH09138801A
Application number: JP7321179A
Authority: JP
Inventors: Sayori Shimohata; さより下畑; Toshiyuki Sugio; 俊之杉尾; Junji Nagata; 淳次永田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-11-15
Filing date: 1995-11-15
Publication date: 1997-05-27

Abstract

PROBLEM TO BE SOLVED: To utilize extracted words for the syntax analyzation, etc., of the text by enabling the generation of a dictionary optimum for the text if registering the words as a dictionary. SOLUTION: The optimum consecutive character string is extracted 3 from the text 1 described in natural language and concerning a character adjacent to the consecutive character string, an appearing frequency appearing at the same time of the consecutive character string is investigated 4. Whether the character is provided or not with the high probability of being used integrally with the consecutive character string is objectively evaluated by means of a numerical value corresponding to this appearing frequency. When the frequency is high, the character string is recognized to be one group of words and phrases including the adjacent character. Words extracted in this way are registered as the dictionary to utilize for the syntax analyzation, etc., of the text.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、機械翻訳や情報検
索等を実施する自然言語処理システムで使用する単語や
イディオムといった連続文字列を自動的に抽出する文字
列抽出方法とシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a character string extracting method and system for automatically extracting a continuous character string such as a word or an idiom used in a natural language processing system for carrying out machine translation, information retrieval or the like.

【０００２】[0002]

【従来の技術】情報処理装置で使用できるようにデータ
化された文書を、自動的に翻訳する機械翻訳システム
や、文書全文をサーチして該当するキーワードを使用し
た文献を検索するといったシステムでは、その文書中に
存在する一定の意味を持つ文字列を抽出する処理が要求
される。この文字列が技術用語やその他の学術用語等に
ついては、広く一般的に使用されるものを辞書として登
録しておけばよい。しかしながら、一般文献では、文書
によって使用される用語は必ずしも同一でなく、予め用
意した辞書だけでは不十分な場合も多い。そこで、処理
対象となる文書中から直接所定の意味を持つ文字列を抽
出する技術が開発されている（［文献名］情報処理学会
研究報告Vol.93,No.61(93-NL-96-1)）。ここでは、自然
言語で記述されたテキストから連続文字列を文字列の長
さと出現頻度を条件として抽出する技術が紹介されてい
る。2. Description of the Related Art In a machine translation system that automatically translates a data-formatted document that can be used by an information processing device, or a system that searches the entire document and searches for documents using the corresponding keyword, A process of extracting a character string having a certain meaning existing in the document is required. Regarding the technical terms and other academic terms, etc., this character string may be widely and commonly used as a dictionary. However, in general literature, terms used in documents are not always the same, and a dictionary prepared in advance is often insufficient. Therefore, a technology has been developed to directly extract a character string having a predetermined meaning from a document to be processed ([Reference] IPSJ Research Report Vol.93, No.61 (93-NL-96- 1)). Here, a technique for extracting a continuous character string from a text described in natural language based on the length and appearance frequency of the character string is introduced.

【０００３】[0003]

【発明が解決しようとする課題】ところで、上記のよう
な従来のシステムでは、更に次のような解決すべき課題
があった。例えば、対象となるテキスト中に「アドレ
ス」という文字列が１００回出現し、「ドレス」という
文字列が１回出現する場合を考える。このとき、「アド
レス」という文字列を単純に検索してその出現回数を数
えると１００回になる。一方、「ドレス」という文字列
をカウントすると出現回数は１０１回となる。このよう
な場合に、「ドレス」をそのテキスト中で意味のある文
字列として認識すべきかどうか、その出現回数だけで判
断するのは誤りとなる場合もある。即ち、このような連
続する文字列を抽出する技術を用いて、実際に意味のあ
る一まとまりの文字列を選択し、そのまとまりの強さも
考慮して、これらによって辞書を生成すれば、そのテキ
ストの解析や翻訳処理、その他各種の処理が容易にな
る。By the way, the above conventional system has the following problems to be solved. For example, consider a case where the character string “address” appears 100 times and the character string “dress” appears once in the target text. At this time, if the character string "address" is simply searched and the number of appearances is counted, it becomes 100 times. On the other hand, when the character string “dress” is counted, the number of appearances is 101 times. In such a case, it may be erroneous to judge whether "dress" should be recognized as a meaningful character string in the text or only by the appearance frequency. That is, by using such a technique for extracting continuous character strings, a meaningful group of character strings is actually selected, and by considering the strength of the group, a dictionary is created with these, and the text Analysis, translation processing, and various other processing become easier.

【０００４】[0004]

【課題を解決するための手段】本発明は以上の点を解決
するため次の構成を採用する。〈構成１〉自然言語で記述されるテキストから、任意の
連続文字列を抽出し、その連続文字列に隣接する文字に
ついて、連続文字列と同時に出現する出現頻度を、予め
設定した閾値と比較して、この閾値より出現頻度の高い
文字は連続文字列と一体に使用されるものとして、纏め
て認識すべき文字列に選定する。The present invention employs the following structure to solve the above problems. <Structure 1> An arbitrary continuous character string is extracted from a text written in natural language, and the appearance frequency of a character adjacent to the continuous character string that appears at the same time as the continuous character string is compared with a preset threshold value. Characters having a higher appearance frequency than this threshold are used as a unit with the continuous character string, and are selected as a character string to be collectively recognized.

【０００５】〈説明〉自然言語で記述されるテキスト
は、日本語のように単語間の区切りが無いものでよい。
任意の文字列とは、テキスト中のどの部分のものでもよ
く、何文字で構成されるものでもよい。連続文字列に隣
接する文字は、直前の文字でも直後の文字でもよく、そ
の一方でも両方でもよい。連続文字列と同時に出現する
頻度を調べると、連続文字列と一体に使用される確率の
高い文字かどうかを、数値により客観的に評価できる。
頻度が高ければ、その文字列が隣接する文字を含めてひ
とまとまりの単語や慣用句と認識できる。こうして抽出
した単語を辞書として登録し、テキストの構文解析等に
利用する。閾値は経験的に妥当な値に選定する。例え
ば、結び付きの非常に強い連続文字列のみを抽出する場
合には、閾値を高くしてもよいし、結びつきの強いもの
から弱いものまで、段階を付けて各種の単語を抽出して
もよい。<Explanation> The text described in natural language may have no word separation as in Japanese.
The arbitrary character string may be any part of the text or may be composed of any number of characters. The character adjacent to the continuous character string may be the preceding character, the following character, or one or both of them. By investigating the frequency of appearing at the same time as the continuous character string, it is possible to objectively evaluate whether or not the character has a high probability of being used together with the continuous character string by a numerical value.
If the frequency is high, the character string including adjacent characters can be recognized as a group of words or idioms. The words thus extracted are registered as a dictionary and used for text parsing and the like. The threshold is empirically selected. For example, in the case of extracting only continuous character strings with a very strong connection, the threshold value may be set high, or various words may be extracted with a grade from a strong connection to a weak connection.

【０００６】〈構成２〉自然言語で記述されるテキスト
から、任意の連続文字列を抽出し、その連続文字列の直
前の文字と直後の文字について、連続文字列と同時に出
現する出現頻度を、予め設定した閾値と比較して、この
閾値より出現頻度の高い文字が存在しない場合に、連続
文字列を、テキスト中で纏めて認識すべき文字列に選定
する。<Structure 2> An arbitrary continuous character string is extracted from a text written in natural language, and the appearance frequency of the character immediately before and after the continuous character string, which appears simultaneously with the continuous character string, is calculated. When a character having a higher appearance frequency than this threshold does not exist as compared with a preset threshold, the continuous character string is collectively selected as a character string to be recognized in the text.

【０００７】〈説明〉連続文字列の直前の文字の出現頻
度を、前方分散値、直後の文字の出現頻度を後方分散値
として表すことができる。出現頻度はこの他に、標準偏
差等を用いて統計的に算出することができる。テキスト
中で纏めて認識すべき文字列は、テキスト中で出現頻度
の高い連続文字列である。これは、そのテキストの解析
に最も適した辞書を構成できる。従って、任意のテキス
トについて、予め辞書を用意しておくことなく辞書の最
適化ができる。<Explanation> The appearance frequency of the character immediately before the continuous character string can be expressed as a forward dispersion value, and the appearance frequency of the character immediately after it can be expressed as a backward dispersion value. In addition to this, the appearance frequency can be statistically calculated using a standard deviation or the like. The character string to be recognized collectively in the text is a continuous character string having a high appearance frequency in the text. This can constitute the dictionary that is best suited for parsing that text. Therefore, the dictionary can be optimized for any text without preparing the dictionary in advance.

【０００８】〈構成３〉自然言語で記述されるテキス
トから、任意の連続文字列を抽出する連続文字列抽出部
と、連続文字列に隣接する文字について、連続文字列と
同時に出現する出現頻度を演算する出現頻度演算部と、
この出現頻度と予め設定した閾値とを比較して、この閾
値より出現頻度の高い文字は連続文字列と一体に使用さ
れるものとして、纏めて認識すべき文字列に選定する出
現頻度比較部と、選定された連続文字列を登録して記憶
する記憶装置とを備える。<Structure 3> A continuous character string extraction unit for extracting an arbitrary continuous character string from a text written in natural language, and an appearance frequency of a character adjacent to the continuous character string, which appears at the same time as the continuous character string, An appearance frequency calculation unit that calculates,
This appearance frequency is compared with a preset threshold value, and a character whose appearance frequency is higher than this threshold value is used as a unit with a continuous character string, and an appearance frequency comparison unit that collectively selects the character string to be recognized. And a storage device that registers and stores the selected continuous character string.

【０００９】〈説明〉テキストの入力は、キーボードを
用いても、フロッピーディスクを用いても、また、他の
情報処理装置から転送を受けるようにしてもよい。閾値
は、１個でも、複数設定してもよい。連続文字列の登録
は、抽出文字列を直接入力するようにしてもよいし、ま
た、フロッピーディスク等に一端蓄積してから、別の情
報処理装置の記憶装置に登録するようにしてもよい。<Explanation> Text input may be performed by using a keyboard, a floppy disk, or a transfer from another information processing apparatus. The number of thresholds may be one or plural. The continuous character string may be registered by directly inputting the extracted character string, or may be temporarily stored in a floppy disk or the like and then registered in a storage device of another information processing device.

【００１０】[0010]

【発明の実施の形態】以下、本発明の実施の形態を具体
例を用いて説明する。〈具体例１〉図１は、本発明によるシステムの機能ブロ
ック図である。本発明のシステムは、概略この図に示す
ような機能ブロックにより実現する。即ち、このシステ
ムは、自然言語によるテキスト１を入力するためのテキ
スト入力部２と、連続文字列抽出部３、出現頻度演算部
４、出現頻度比較部５及び記憶装置６等から構成され
る。テキスト入力部２からは自然言語によるテキスト１
が電子化された状態で入力される。連続文字列抽出部３
は、この中から連続する文字列を抽出する。その抽出す
る方法等は後で説明するが、この文字列の前後に隣接す
る文字について、出現頻度演算部４がその出現頻度を演
算する。そして、出現頻度比較部５において所定の閾値
７と比較する。例えば、ある連続文字列の直前に出現す
る文字がその連続文字列と同時に出現する回数を数え
て、その連続文字列の出現回数との比を求めてみると、
直前の文字と連続文字列との結び付きが明確になる。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to specific examples. <Specific Example 1> FIG. 1 is a functional block diagram of a system according to the present invention. The system of the present invention is realized by the functional blocks as schematically shown in this figure. That is, this system includes a text input unit 2 for inputting a natural language text 1, a continuous character string extraction unit 3, an appearance frequency calculation unit 4, an appearance frequency comparison unit 5, a storage device 6, and the like. Text 1 in natural language from the text input section 2
Is input in a digitized state. Continuous character string extraction unit 3
Extracts consecutive character strings from this. Although the extraction method and the like will be described later, the appearance frequency calculation unit 4 calculates the appearance frequency of characters adjacent to the front and rear of this character string. Then, the appearance frequency comparison unit 5 compares it with a predetermined threshold value 7. For example, if you count the number of times a character that appears immediately before a certain continuous character string appears at the same time as that continuous character string, and then calculate the ratio with the number of times that that continuous character string appears,
The connection between the previous character and the continuous character string becomes clear.

【００１１】直前の文字との結び付きの強さをこうして
数値化し、その結び付きが強い場合には、連続文字列と
直前の文字列とは一体に使用されると判断する。こうす
れば、連続文字列がどこまで一体に使われるものかを判
定できる。直前の文字も直後の文字もこうして判定を行
い、全体として一まとまりの文字列をテキスト中から抽
出する。こうして、重要な語やイディオムのようにまと
めて認識すべき連続文字列をテキスト中から抽出し、こ
れを用いた辞書を生成して利用することができる。The strength of the connection with the immediately preceding character is digitized in this way, and when the connection is strong, it is determined that the continuous character string and the immediately preceding character string are used together. By doing this, it is possible to determine how far the continuous character strings are used together. The character just before and the character immediately after are judged in this way, and a character string as a whole is extracted from the text. In this way, a continuous character string such as an important word or idiom to be recognized collectively can be extracted from the text, and a dictionary using this can be generated and used.

【００１２】以下、本発明の方法等を更に具体的に説明
する。図２は、本発明のシステムを具体化したブロック
図である。図のシステムは、記憶装置６と、入出力装置
１１と、処理装置１４を有する。ここで、記憶装置６
は、入力されたテキストや各段階の処理結果を保存する
機能を有するハードウェア等から成る。入出力装置１１
は、テキストの入力、抽出結果の表示等を行う機能を有
するキーボードやディスプレイから成る。処理装置１４
は、連続文字列を抽出するための各種処理を実行する機
能を有する。これはワークステーション等から構成され
る。記憶装置６は、入力されたテキストを保存する入力
テキスト記憶部２１と、各段階の処理結果を保存するバ
ッファ２２と、抽出された文字列の情報を記憶する抽出
文字列格納テーブル２３を有する。バッファ２２は、バ
ッファＸ、バッファＹ、バッファＡ、バッファＢ、バッ
ファＣを有する。これらはいずれもハードディスクや主
記憶装置上の適当な記憶領域上に設定される。The method of the present invention will be described in more detail below. FIG. 2 is a block diagram embodying the system of the present invention. The system shown in the figure includes a storage device 6, an input / output device 11, and a processing device 14. Here, the storage device 6
Is composed of hardware having a function of storing the input text and the processing result of each stage. I / O device 11
Is composed of a keyboard and a display having functions of inputting text and displaying extraction results. Processor 14
Has a function of executing various processes for extracting a continuous character string. This is composed of a workstation and the like. The storage device 6 includes an input text storage unit 21 that stores the input text, a buffer 22 that stores the processing result of each stage, and an extracted character string storage table 23 that stores information on the extracted character string. The buffer 22 has a buffer X, a buffer Y, a buffer A, a buffer B, and a buffer C. All of these are set in an appropriate storage area on the hard disk or main storage device.

【００１３】図３は、この抽出文字列格納テーブルの例
説明図である。抽出文字列格納テーブル２３は、抽出文
字列格納部２５と、出現回数格納部２６と、前方分散値
格納部２７と、後方分散値格納部２８を有する。また、
図２に示した入出力装置１１は、入力部１２と出力部１
３を有する。ここで、入力部１２は、テキストを入力す
る機能を有する。この入力部２は、例えば、キーボード
により構成されてもよいし、計算機の記憶装置に格納さ
れているテキストファイルをアクセスする装置であって
も良い。出力部１３は、抽出結果の表示等を行う機能を
有する。この出力部１３は、例えば、ディスプレイやプ
リンタ等により構成されている。FIG. 3 is an explanatory diagram of an example of the extracted character string storage table. The extracted character string storage table 23 has an extracted character string storage unit 25, an appearance count storage unit 26, a front variance value storage unit 27, and a rear variance value storage unit 28. Also,
The input / output device 11 shown in FIG. 2 includes an input unit 12 and an output unit 1.
3 Here, the input unit 12 has a function of inputting text. The input unit 2 may be composed of, for example, a keyboard, or may be a device that accesses a text file stored in a storage device of a computer. The output unit 13 has a function of displaying the extraction result. The output unit 13 is composed of, for example, a display and a printer.

【００１４】処理装置１４は、文字列切り出し部１５、
ソート処理部１６、文字列抽出部１７、分散値計算部１
８を有する。文字列切り出し部１５は、入力されたテキ
ストから、任意の文字列を切り出して生成する機能を有
する。ソート処理部１６は、文字列を任意のキーによっ
てソートして、バッファに格納する機能を有する。文字
列抽出部１７は、２つの文字列の比較照合を行い、先頭
から一致する文字数をカウントして、その部分文字列
と、後続文字と、文字列の開始アドレスをバッファに格
納する機能を有する。分散値計算部１８は、抽出された
文字列に後続する文字の分散する度合を計算し、抽出文
字列格納テーブル２３に格納する機能を有する。The processing unit 14 includes a character string cutout unit 15,
Sort processing unit 16, character string extraction unit 17, variance value calculation unit 1
8 The character string cutout unit 15 has a function of cutting out and generating an arbitrary character string from the input text. The sort processing unit 16 has a function of sorting a character string by an arbitrary key and storing it in a buffer. The character string extraction unit 17 has a function of comparing and collating two character strings, counting the number of matching characters from the beginning, and storing the partial character string, the subsequent character, and the start address of the character string in a buffer. . The distributed value calculation unit 18 has a function of calculating the degree of dispersion of characters that follow the extracted character string and storing the calculated degree in the extracted character string storage table 23.

【００１５】上記の構成のシステムは、次のように動作
する。図４は、本発明の処理過程を示すための、文字列
抽出処理のフローチャートである。まず、テキストの入
力処理が実行される（ステップＳ１）。入力されたテキ
ストは、図２の入力テキスト記憶部２１に格納される。
この処理は、入力部１２を使って実行される。次に、入
力されたテキストを入力テキスト記憶部２１から読み込
み（ステップＳ２）、各文字を先頭とし、任意の終端ま
での文字列を切り出す処理が実行される（ステップＳ
３）。この処理は、文字列切り出し部１５により実行さ
れる。例えば、ｎ語から構成されるテキストの１文字目
からｎ文字目を各々先頭とし、ｎ文字目を終端とする文
字列を切り出すとすれば、この処理により、ｎ個の文字
列が切り出されることになる。The system configured as described above operates as follows. FIG. 4 is a flowchart of a character string extraction process for showing the process of the present invention. First, a text input process is executed (step S1). The input text is stored in the input text storage unit 21 of FIG.
This process is executed using the input unit 12. Next, the input text is read from the input text storage unit 21 (step S2), and a process of cutting out a character string starting from each character and ending at an arbitrary end is performed (step S2).
3). This processing is executed by the character string cutout unit 15. For example, if a character string starting from the first character to the n-th character and ending from the n-th character is cut out of a text composed of n words, this process cuts out n character strings. become.

【００１６】次に、切り出された文字列を任意のキーに
よってソートし（ステップＳ４）、バッファＸに格納す
る処理を行う（ステップＳ５）。辞書順というのは、五
十音順とかアルファベット順という意味である。本具体
例では、ソートのキーを辞書順とする。この処理は、ソ
ート処理部１６により実行される。次に、バッファＸの
各文字列と、テキスト中で各文字列の直前及び直後に位
置する文字列とを比較し、一致文字列の種類と数、及
び、各抽出文字列に後続する文字の種類と数を、バッフ
ァＹに格納する処理が実行される（ステップＳ６）。こ
の処理は、文字列抽出部１７により実行される。次に、
抽出文字列に後続する文字の、分散する度合を計算する
処理を行う（ステップＳ７）。これは、抽出文字列の出
現回数と、後続する文字の種類と出現回数により、計算
される。分散値は、出現回数がｍ回の文字列に対し、後
続文字が常に１種類でｍ回出現したとき最小になり、ｍ
種類で各１回ずつのとき最大になるよう設定する。この
処理は、ステップＳ６の結果をもとに、分散値計算部１
８により実行される。Then, the cut out character strings are sorted by an arbitrary key (step S4) and stored in the buffer X (step S5). Dictionary order means alphabetical order or alphabetical order. In this specific example, the sorting key is a dictionary. This processing is executed by the sort processing unit 16. Next, each character string in the buffer X is compared with the character strings located immediately before and after each character string in the text, and the types and number of matching character strings and the characters following each extracted character string are compared. A process of storing the type and the number in the buffer Y is executed (step S6). This processing is executed by the character string extraction unit 17. next,
A process of calculating the degree of dispersion of the characters following the extracted character string is performed (step S7). This is calculated based on the number of appearances of the extracted character string, the type of the following character, and the number of appearances. The variance value is the minimum when the number of appearances is m and the succeeding character is always one type and appears m times.
Set the type so that it becomes the maximum when it is once each. This processing is based on the result of step S6, and the variance value calculation unit 1
8 is executed.

【００１７】この結果から、図３のように、抽出文字列
を抽出文字列格納部２５に、文字列の出現回数を出現回
数格納部２６に、分散値を後方分散値格納部２８に、そ
れぞれ格納する（ステップＳ８）。この処理が終了する
と、入力テキスト記憶部２１のテキストの各文字を逆順
に並べ変え（ステップＳ１０）、再び、ステップＳ２か
らステップＳ９の処理が実行される。この処理の結果得
られた後続する文字の分散値は、図３の前方分散値格納
部２７に格納される（ステップＳ８）。逆順からの処理
により抽出される文字列自体及びその出現回数は、前方
からの処理結果と同じになるため、改めて格納する必要
はない。以上が、文字列を抽出するための処理である。From this result, as shown in FIG. 3, the extracted character string is stored in the extracted character string storage unit 25, the number of appearances of the character string is stored in the appearance count storage unit 26, and the variance value is stored in the backward distributed value storage unit 28. Store (step S8). When this process ends, the characters of the text in the input text storage unit 21 are rearranged in reverse order (step S10), and the processes of steps S2 to S9 are executed again. The variance value of the subsequent character obtained as a result of this processing is stored in the forward variance value storage unit 27 of FIG. 3 (step S8). Since the character string itself extracted by the processing from the reverse order and the number of appearances thereof are the same as the processing result from the front, it is not necessary to store them again. The above is the processing for extracting the character string.

【００１８】次に、図４に示したステップ６の一致文字
列抽出処理の内容をよりを詳しく説明する。図５は、こ
の処理を示すフローチャートである。文字列抽出処理部
は、最大一致文字数を求める処理と、一致文字列を抽出
する処理、処理結果を集計してバッファに格納する処理
から成っている。最大一致文字数を求める処理では、ま
ず、バッファＸの各文字列を読み込み（ステップＳ
１）、同じバッファＸ中に格納された他の文字列であっ
て直前に格納された文字列と比較し、一致文字数をバッ
ファＡに格納する（ステップＳ２）。次に、その直後に
格納された文字列と比較し、一致文字数をバッファＢに
格納する（ステップＳ３）。ここで、バッファＡとバッ
ファＢを比較し（ステップＳ４）、大きい方の文字数を
最大一致文字数とする（ステップＳ５，Ｓ６）。Next, the contents of the matching character string extraction processing in step 6 shown in FIG. 4 will be described in more detail. FIG. 5 is a flowchart showing this processing. The character string extraction processing unit includes a process of obtaining the maximum number of matching characters, a process of extracting a matching character string, and a process of totaling and storing the processing results in a buffer. In the process of obtaining the maximum number of matching characters, first, each character string in the buffer X is read (step S
1) The other character string stored in the same buffer X is compared with the character string stored immediately before, and the number of matching characters is stored in the buffer A (step S2). Next, the number of matching characters is stored in the buffer B by comparing with the character string stored immediately after that (step S3). Here, the buffer A and the buffer B are compared (step S4), and the larger character number is set as the maximum matching character number (steps S5 and S6).

【００１９】次に、一致文字列を抽出する処理では、文
字数を表すパラメータｎに１をセットし（ステップＳ
７）、ｎが最大一致文字数から比較して（ステップＳ
８）、それ以下なら、文字列の１文字目からｎ文字目ま
でとｎ＋１文字目をバッファＣに格納する（ステップＳ
９，Ｓ１０）。次に、ｎを１つカウントアップして（ス
テップＳ１１）、ステップＳ８に戻る。この処理を、ｎ
が最大一致文字数になるまで続ける。ｎが最大一致文字
数より大きくなると、次の文字列に対して、同様の処理
を繰り返す。バッファＡ、バッファＢの内容が共に０の
場合、テキスト中に一致する文字列はないことになるた
め、文字列は抽出されない。最後に、バッファＣの内容
から、抽出文字列、その出現回数、後続文字列毎の出現
回数を集計し（ステップＳ１２，１３）、バッファＹに
格納する（ステップＳ１４）。Next, in the process of extracting the matching character string, 1 is set to the parameter n representing the number of characters (step S
7), n is compared from the maximum number of matching characters (step S
8) If it is less than that, the first character to the nth character and the (n + 1) th character of the character string are stored in the buffer C (step S).
9, S10). Next, n is incremented by 1 (step S11), and the process returns to step S8. This process is
Continue until is the maximum number of matching characters. When n becomes larger than the maximum number of matching characters, the same process is repeated for the next character string. When the contents of both the buffer A and the buffer B are 0, there is no matching character string in the text, so the character string is not extracted. Finally, the extracted character string, the number of appearances thereof, and the number of appearances of each succeeding character string are totalized from the contents of the buffer C (steps S12 and 13) and stored in the buffer Y (step S14).

【００２０】次に、実際の事例を使って、本発明の処理
過程を具体的に説明する。まず、図６に示すようなテキ
ストが入力されるものとする。改行は、便宜上“ＣＲ”
で表している。図７は、入力テキストの各文字と文字番
号との対応を示したものである。以下、どの文字列を指
しているかを明確にするため、抽出した文字列には開始
位置の文字番号を付けて、「文字列（文字番号）」の形
式で記述する。テキストの入力は、入力部１２を使って
行われる（図４ステップＳ１）。入力されたテキスト
は、先頭の語から順に、任意の終端までの文字列に切り
出される（図４ステップＳ３）。ここでは、任意の終端
を改行“ＣＲ”とする。図６の例では、まず、先頭文字
「株（１）」から「ＣＲ（５）」までの文字列が切り出
される。次に、「式（２）」から「ＣＲ（５）」までの
文字列が切り出される。この処理を、末尾の「合（３
２）」から「ＣＲ（３３）」までの文字列まで、繰り返
し行う。Next, the processing steps of the present invention will be specifically described by using an actual case. First, it is assumed that the text as shown in FIG. 6 is input. Line feed is "CR" for convenience
It is represented by FIG. 7 shows the correspondence between each character of the input text and the character number. In the following, in order to clarify which character string is pointed out, the extracted character string is described in the form of "character string (character number)" with the character number of the start position. The text is input using the input unit 12 (step S1 in FIG. 4). The input text is sequentially cut out into a character string from the first word to an arbitrary end (step S3 in FIG. 4). Here, a line feed "CR" is set at an arbitrary end. In the example of FIG. 6, first, the character string from the first character “stock (1)” to “CR (5)” is cut out. Next, the character strings from "expression (2)" to "CR (5)" are cut out. This process is performed by adding "((3
2) ”to“ CR (33) ”are repeated.

【００２１】図８に、この処理の結果切り出された文字
列を示す。文字番号は先頭文字の番号である。次に、切
り出された文字列を辞書順にソートし、バッファＸに格
納する処理が実行される（図４ステップＳ５）。図９に
は、処理後のバッファＸの内容を示す。次に、一致文字
列を抽出する処理を行う（図４ステップＳ６）ここでは、まず、バッファＸの各文字列と、バッファＸ
中に一緒に格納された直前及び直後の文字列とを比較
し、文字列の先頭から一致する文字数を数えて、最大一
致文字数を求める処理が実行される（図５ステップＳ１
〜Ｓ４）。例えば、図９の「市場統合（２９）」の場
合、直前の文字列「市場（２３）」と一致する文字は
「市場」で、一致文字数は２となり、直後の文字列「市
民（８）」と一致する文字は「市」で、一致文字数は１
となるため、大きい方の文字数である２が最大一致文字
数となる。FIG. 8 shows a character string cut out as a result of this processing. The character number is the number of the first character. Next, the cut-out character strings are sorted in the order of the dictionary and stored in the buffer X (step S5 in FIG. 4). FIG. 9 shows the contents of the buffer X after processing. Next, a process of extracting a matching character string is performed (step S6 in FIG. 4). Here, first, each character string of the buffer X and the buffer X
A process of comparing the character strings immediately before and immediately after stored together in the character string, counting the number of matching characters from the beginning of the character string, and obtaining the maximum number of matching characters is executed (step S1 in FIG. 5).
~ S4). For example, in the case of “market integration (29)” in FIG. 9, the character that matches the immediately preceding character string “market (23)” is “market”, the number of matching characters is 2, and the character string immediately after that is “citizen (8)”. Is a city and the number of matching characters is 1
Therefore, the larger number of characters, 2 is the maximum number of matching characters.

【００２２】次に、一致文字列の部分文字列を抽出する
処理を行う（ステップＳ５〜Ｓ１１）。まず、ｎに１を
セットする（図５ステップＳ７）。ここで、ｎ（＝１）
は最大一致文字数（＝２）以下になるので、文字列の１
文字目からｎ（＝１）文字目まで、即ち、１文字目の
「市」とｎ＋１（＝２）文字目の「場」をバッファＣに
格納する（図５ステップＳ９，Ｓ１０）。次に、ｎを１
つカウントアップして、２とする（図５ステップＳ１
１）。ここでも、ｎ（＝２）は最大一致文字数（＝２）
以下になるので、文字列の１文字目からｎ（＝２）文字
目までの「市場」と、ｎ＋１（＝３）文字目の「統」を
バッファＣに格納する（図５ステップＳ９，Ｓ１０）。
次に、ｎを１つカウントアップして、３とする（図５ス
テップＳ１１）。このとき、ｎ（＝３）が最大一致文字
数（＝２）より大きくなるので、文字列「市場統合（２
９）」の処理を終了し、次の文字列「市民（８）」に対
して、同様の処理を行う（ステップＳ２）。この処理
を、最後の文字列まで繰り返し行う。Next, a process of extracting a partial character string of the matching character string is performed (steps S5 to S11). First, n is set to 1 (step S7 in FIG. 5). Where n (= 1)
Is less than or equal to the maximum number of matching characters (= 2), so 1
The characters from the first character to the n (= 1) th character, that is, the “city” of the first character and the “field” of the n + 1 (= 2) th character are stored in the buffer C (steps S9 and S10 in FIG. 5). Then n is 1
One counts up to 2 (step S1 in FIG. 5).
1). Here again, n (= 2) is the maximum number of matching characters (= 2)
Since it becomes the following, the "market" from the first character to the n (= 2) th character of the character string and the "combination" of the n + 1 (= 3) th character are stored in the buffer C (steps S9 and S10 in FIG. 5). ).
Next, n is incremented by 1 to be 3 (step S11 in FIG. 5). At this time, since n (= 3) becomes larger than the maximum number of matching characters (= 2), the character string “market integration (2
9) ”, and the same process is performed on the next character string“ citizen (8) ”(step S2). This process is repeated until the last character string.

【００２３】図１０に、本処理後のバッファＣの内容を
示す。次に、一致文字列とその出現数、及び、各文字列
に後続する文字の種類と各出現数を集計して、バッファ
２２に格納する処理が実行される（図５ステップＳ１
３，Ｓ１４）。図１１に、本処理後のバッファＹの内容
を示す。次に、一致文字列に後続する文字の、分散する
度合を計算する処理を行う（図４ステップＳ７）。この
処理は、図５ステップＳ１４の結果を使って、分散値計
算部１８により実行される。分散の度合は、ここでは、
エントロピーを用いて求めることにする。エントロピー
は、Σ_n ⁱ⁼¹−Ｐi ＊ｌn Ｐi （Ｐi ＝ｉの確率）で求め
ることができる。例えば、図１１の「市」の場合、
「市」は５回出現し、「市」の次に「場」が出現する回
数が３回、「民」が出現する回数が２回ある。このと
き、「市」の後方分散値は、−３／５＊ｌｎ（３／５）
−２／５＊ｌｎ（２／５）≠０．６７３０１２となる。FIG. 10 shows the contents of the buffer C after this processing. Next, a process is performed in which the matching character string and the number of appearances thereof, the type of character following each character string, and each number of appearances are totaled and stored in the buffer 22 (step S1 in FIG. 5).
3, S14). FIG. 11 shows the contents of the buffer Y after this processing. Next, a process of calculating the degree of dispersion of the characters following the matching character string is performed (step S7 in FIG. 4). This processing is executed by the variance value calculation unit 18 using the result of step S14 in FIG. The degree of dispersion here is
We will use entropy to find the value. Entropy can be obtained by _{^{Σ n i = 1 -Pi * ln}} Pi (Pi = i probability). For example, in the case of “city” in FIG.
The "city" appears 5 times, the "place" appears 3 times after the "city", and the "people" appears 2 times. At this time, the backward dispersion value of “city” is −3 / 5 * ln (3/5)
-2 / 5 * ln (2/5) ≠ 0.673012.

【００２４】次に、処理の結果を抽出文字列格納テーブ
ル２３に格納する（図４ステップＳ８）。この例の場
合、「市」を図３の抽出文字列格納部２５に、“５”を
出現回数格納部２６に、“０．６７３０１２”を後方分
散値格納部２８に、各々格納する。図１２に、本処理の
結果を示す。なお、「市場」の後続文字に「ＣＲ」が２
回あるが、この場合「ＣＲ」は終端を表しているので、
各々別の文字として扱う。Next, the processing result is stored in the extracted character string storage table 23 (step S8 in FIG. 4). In the case of this example, “city” is stored in the extracted character string storage unit 25 of FIG. 3, “5” is stored in the appearance count storage unit 26, and “0.673012” is stored in the backward variance value storage unit 28. FIG. 12 shows the result of this processing. Note that "CR" is 2 after the character "Market".
There are times, but in this case "CR" represents the end, so
Treat as different characters.

【００２５】この処理が終了すると、テキストの末尾か
ら、逆順に、再び、上述した処理が実行される。逆順か
ら処理した結果得られた分散値を、図３に示す前方分散
値格納部２７に格納する（図４ステップＳ８）。図１３
に、逆順に並べ変えたテキストの例を示す。また、文字
列切り出し処理の例を図１４に、ソート処理後のバッフ
ァＸの例を図１５に、文字列抽出処理後のバッファＣの
例を図１６に、文字列抽出処理後のバッファＹの例を図
１７に、分散値計算処理後の抽出文字列格納テーブルの
例を図１８に、各々示す。When this processing is completed, the above-mentioned processing is executed again in reverse order from the end of the text. The variance value obtained as a result of processing from the reverse order is stored in the forward variance value storage unit 27 shown in FIG. 3 (step S8 in FIG. 4). FIG.
Shows an example of text in reverse order. Further, an example of the character string cutout process is shown in FIG. 14, an example of the buffer X after the sort process is shown in FIG. 15, an example of the buffer C after the character string extraction process is shown in FIG. 16, and a buffer Y after the character string extraction process is shown. FIG. 17 shows an example, and FIG. 18 shows an example of the extracted character string storage table after the distributed value calculation processing.

【００２６】前方分散値は、その文字列がそれより前の
文字と切れる強さを表す。また、後方分散値は、その文
字列がそれより後ろの文字と切れる強さを表す。「市
民」や「株式市場」のように前方分散値と後方分散値の
値が近いとき、その文字列は一まとまりとなる傾向があ
ることを示す。また、「市」のように、前方分散値が大
きく、後方分散値が小さい場合、その文字列は接頭辞や
修飾語のように、後ろに特定の文字（列）を伴う傾向が
あることを示す。逆に、「市場」のように、前方分散値
が小さく、後方分散値が大きい場合は、接頭語や名詞の
ように、前に特定の文字（列）を伴う傾向があることを
示す。これらの傾向は、分散値が大きくなるほど強くな
る。また、入力テキストが大きくなるほど顕著に現れ
る。この性質を利用すると、適当な閾値を設けることに
よって、抽出文字列格納テーブルの文字列を用途に合わ
せて順序付けたり、抽出したりすることが可能になる。The forward variance value represents the strength at which the character string is cut off from the preceding character. The backward dispersion value represents the strength at which the character string is cut off from the character behind it. When the forward and backward variances are close to each other, such as "citizen" and "stock market", it indicates that the character strings tend to be a unit. Also, if the forward variance value is large and the backward variance value is small like "city", the character string tends to be followed by a specific character (string) like a prefix or modifier. Show. On the contrary, if the forward variance value is small and the backward variance value is large like “market”, it indicates that there is a tendency to accompany a specific character (string) before, like a prefix or noun. These tendencies become stronger as the variance value increases. The larger the input text, the more prominent it appears. By using this property, it is possible to order or extract the character strings in the extracted character string storage table according to the purpose by providing an appropriate threshold value.

【００２７】以上のように、この具体例によれば、テキ
ストに出現する文字列の中から、まとめて認識すべき連
続文字列を文字列の長さや頻度に依存することなく、自
動的にかつ正確に抽出することができる。例えば、ある
テキスト中で、「ドレス」の前には、「ア」が１００
回、それ以外の文字が１回出現するとしたとき、「ドレ
ス」の前方分散値は非常に小さい値となる。このとき、
「アドレス」の前後に様々な文字が出現していれば、前
方分散値及び後方分散値が大きい値となり、「アドレ
ス」の出現頻度が「ドレス」より少なくても、「アドレ
ス」を抽出することができる。この発明で抽出される文
字列は、出現頻度が高く、単語の構成に依存しない表現
となるため、機械翻訳での専門用語辞書の作成や、情報
検索でのキーワード抽出等、自然言語処理の広い分野で
の利用が可能である。As described above, according to this specific example, continuous character strings to be recognized collectively from the character strings appearing in the text are automatically and automatically independent of the length and frequency of the character string. Can be accurately extracted. For example, in a text, “A” is 100 before “Dress”.
When the character appears once, and the other characters appear once, the forward dispersion value of “dress” becomes a very small value. At this time,
If various characters appear before and after "address", the forward and backward dispersion values will be large, and even if the appearance frequency of "address" is less than "dress", extract "address". You can Since the character string extracted by this invention has a high appearance frequency and is an expression that does not depend on the structure of words, it can be used for a wide range of natural language processing, such as creating a technical term dictionary in machine translation and keyword extraction in information retrieval. It can be used in the field.

[Brief description of the drawings]

【図１】本発明によるシステムの機能ブロック図であ
る。FIG. 1 is a functional block diagram of a system according to the present invention.

【図２】本発明のシステムを具体化したブロック図であ
る。FIG. 2 is a block diagram embodying the system of the present invention.

【図３】抽出文字列格納テーブルの例説明図である。FIG. 3 is an explanatory diagram of an example of an extracted character string storage table.

【図４】文字列抽出処理のフローチャートである。FIG. 4 is a flowchart of a character string extraction process.

【図５】一致文字列抽出処理のフローチャートである。FIG. 5 is a flowchart of matching character string extraction processing.

【図６】入力テキストの例説明図である。FIG. 6 is an explanatory diagram of an example of input text.

【図７】入力テキストと文字番号の対応説明図である。FIG. 7 is an explanatory diagram of correspondence between input text and character numbers.

【図８】文字列切り出し処理の例説明図である。FIG. 8 is an explanatory diagram of an example of character string cutout processing.

【図９】ソート処理後のバッファＸの内容説明図であ
る。FIG. 9 is an explanatory diagram of contents of a buffer X after sorting processing.

【図１０】文字列抽出処理後のバッファＣの内容説明図
である。FIG. 10 is an explanatory diagram of contents of the buffer C after the character string extraction processing.

【図１１】集計後のバッファＹの内容説明図である。FIG. 11 is an explanatory diagram of contents of a buffer Y after totalization.

【図１２】抽出文字列格納テーブルの内容説明図であ
る。FIG. 12 is an explanatory diagram of contents of an extracted character string storage table.

【図１３】逆順に並べ変えたテキストと文字番号の対応
説明図である。FIG. 13 is an explanatory diagram of correspondence between texts and character numbers rearranged in reverse order.

【図１４】文字列切り出し処理の例説明図である。FIG. 14 is an explanatory diagram illustrating an example of character string cutout processing.

【図１５】ソート処理後のバッファＸの内容説明図であ
る。FIG. 15 is an explanatory diagram of contents of a buffer X after sorting processing.

【図１６】文字列抽出処理後のバッファＣの内容説明図
である。FIG. 16 is an explanatory diagram of the contents of the buffer C after the character string extraction processing.

【図１７】集計後のバッファＹの内容説明図である。FIG. 17 is an explanatory diagram of contents of a buffer Y after totaling.

【図１８】抽出文字列格納テーブルの内容説明図であ
る。FIG. 18 is an explanatory diagram of contents of an extracted character string storage table.

[Explanation of symbols]

１テキスト２テキスト入力部３連続文字列抽出部４出現頻度演算部５出現頻度比較部６記憶装置７閾値 1 Text 2 Text Input Section 3 Continuous Character String Extraction Section 4 Appearance Frequency Calculation Section 5 Appearance Frequency Comparison Section 6 Storage Device 7 Threshold

Claims

[Claims]

1. An arbitrary continuous character string is extracted from text described in natural language, and the appearance frequency of characters adjacent to the continuous character string that appear at the same time as the continuous character string is set as a preset threshold value. In comparison, a character string extraction method characterized in that characters having a higher appearance frequency than this threshold value are selected as a character string to be collectively recognized as being used integrally with the continuous character string.

2. An arbitrary continuous character string is extracted from a text written in natural language, and the appearance frequency of the character immediately before and after the continuous character string that appears at the same time as the continuous character string is previously determined. Character string extraction characterized by selecting the continuous character string as a character string to be collectively recognized in the text when a character having a higher appearance frequency than the threshold value does not exist in comparison with the set threshold value Method.

3. A continuous character string extraction unit for extracting an arbitrary continuous character string from a text written in natural language, and for a character adjacent to the continuous character string, an appearance frequency that appears at the same time as the continuous character string. Characters to be collectively recognized as a character having a higher appearance frequency than this threshold is compared with the appearance frequency calculation unit that calculates the appearance frequency and a preset threshold value, as they are used integrally with the continuous character string. A character string extraction system comprising: an appearance frequency comparison unit for selecting a string; and a storage device for registering and storing the selected continuous character string.