JPH11175525A

JPH11175525A - Database device for processing natural language

Info

Publication number: JPH11175525A
Application number: JP9339815A
Authority: JP
Inventors: Tokuji Ikeno; 篤司池野
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-12-10
Filing date: 1997-12-10
Publication date: 1999-07-02

Abstract

PROBLEM TO BE SOLVED: To allow a user to add and correct a character string desired by the user and the frequency information to a database for processing natural language. SOLUTION: This device is provided with a data base 6 for processing natural language for storing plural sets of a partial character string constituted of prescribed number of characters appearing in a natural language sentence and the absolute or relative frequency information. Also, this device is provided with a user inputting means 1 for receiving a character string to be reflected on the storage content of the data base for processing natural language and the significance, and a data base updating means 7 for adding the character string and the frequency information corresponding to the significance as a set to the data base for processing natural language when the items of one or plural partial character strings constituted of the prescribed number of characters constituting the input character string are not present in the data base, and for updating the frequency information of the partial character string according to the significance when the items of the partial character strings are present in the data base for processing natural language.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は自然言語処理用デー
タベース装置に関し、例えば、文の形態素を自動的に解
析する形態素解析装置に適用し得るものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a database device for natural language processing, and can be applied to, for example, a morphological analyzer for automatically analyzing morphemes of sentences.

【０００２】[0002]

【従来の技術】［文献名］特開平７−２７１７９２日本語文等の自然言語文を処理する装置（例えば機械翻
訳装置や質疑応答装置やコンピュータ援用の教育装置
等）においては、自然言語文に対して最初に形態素解析
を行う。2. Description of the Related Art Japanese Patent Laid-Open No. Hei 7-271792 In a device for processing a natural language sentence such as a Japanese sentence (for example, a machine translation device, a question-and-answer device, a computer-assisted educational device, etc.), First, a morphological analysis is performed.

【０００３】形態素解析装置は、一般に、形態素解析部
（形態素解析プログラム部）と、辞書（単語辞書）と、
活用語尾テーブルと、品詞別接続テーブルとから構成さ
れており、形態素解析部が、上述の各種記憶部が、入力
テキストに対して、辞書や、活用語尾テーブルや、品詞
別接続テーブル等を参照することで形態素解析を行うも
のである。A morphological analyzer generally includes a morphological analyzer (morphological analysis program), a dictionary (word dictionary),
The morphological analysis unit includes an inflectional ending table and a part-of-speech connection table. The morphological analysis unit refers to the dictionary, the inflectional ending table, and the part-of-speech connection table for the input text. In this way, morphological analysis is performed.

【０００４】これに対して、近年、単語辞書を使用せ
ず、代わりにタグつき（＝形態素境界や各形態素の品詞
情報等を保持した）コーパス（＝大量のテキストデー
タ）から学習した、タグつき部分文字列の出現頻度情報
を格納している統計データベース（自然言語処理用デー
タベース）を用いた形態素解析装置が研究され始めてい
る（特開平７−２７１７９２号公報や、特願平９−６８
３００号明細書及び図面参照）。On the other hand, recently, instead of using a word dictionary, instead of using a word dictionary, a tag-based (= a large amount of text data) learned from a corpus (= a large amount of text data) which holds morpheme boundaries and part-of-speech information of each morpheme is used. A morphological analyzer using a statistical database (a database for natural language processing) that stores the appearance frequency information of partial character strings has begun to be studied (Japanese Patent Application Laid-Open No. 7-271792 and Japanese Patent Application No. 9-68).
No. 300 and drawings).

【０００５】この形態素解析方式では、開発者による発
見的な手法で構築されてきた接続テーブルの代わりに、
コーパスから求めた統計データに基づく連鎖確率（出現
頻度情報）を使用するという点で、従来方式よりも、形
態素境界の確定の根拠が明確である。また、未知語が存
在しても一定の基準で精度の高い解析を進めることがで
きるとされている。In this morphological analysis method, instead of a connection table constructed by a heuristic technique by a developer,
The reason for determining the morpheme boundary is clearer than the conventional method in that a chain probability (appearance frequency information) based on statistical data obtained from the corpus is used. In addition, even if an unknown word is present, it is said that highly accurate analysis can be performed based on a certain standard.

【０００６】[0006]

【発明が解決しようとする課題】ところが、上記の統計
データに基づく形態素解析装置では、辞書を用いないた
め、ユーザがある形態素を定義して形態素解析結果に反
映したいときに、そのことを実現する手段が（用意され
てい）ないという課題がある。すなわち、辞書に登録す
るという方法が使えないので、ユーザがある形態素を定
義して形態素解析結果に反映したいときに、それを実現
する手段がなかった。However, in the morphological analyzer based on the above-mentioned statistical data, since a dictionary is not used, when a user wants to define a certain morpheme and reflect it in the morphological analysis result, this is realized. There is a problem that means are not provided. That is, since the method of registering in a dictionary cannot be used, there is no means to realize a morpheme when the user wants to define the morpheme and reflect it in a morphological analysis result.

【０００７】そのため、ユーザ定義の形態素データの入
力を受けつけられるて自然言語処理用データベース装置
（統計データベース装置）が求められている。Therefore, there is a need for a natural language processing database device (statistical database device) that can accept input of user-defined morphological data.

【０００８】[0008]

【課題を解決するための手段】かかる課題を解決するた
め、本発明は、自然言語文に現れる所定文字数でなる部
分文字列とその絶対的又は相対的な頻度情報との組を複
数組格納しており、自然言語処理装置本体に利用される
自然言語処理用データベースを有する自然言語処理用デ
ータベース装置において、（１）ユーザから入力され
た、上記自然言語処理用データベースの格納内容に反映
させたい文字列及び重要度を受け取るユーザ入力手段
と、（２）ユーザから入力された文字列を構成する、上
記所定文字数でなる１又は複数の部分文字列の項目が上
記自然言語処理用データベースになければ、上記重要度
に応じた頻度情報とを組として、上記自然言語処理用デ
ータベースに追加させると共に、ユーザから入力された
文字列を構成する、上記所定文字数でなる１又は複数の
部分文字列の項目が上記自然言語処理用データベースに
あれば、上記自然言語処理用データベースのその部分文
字列の頻度情報を上記重要度に応じて更新させるデータ
ベース更新手段とを備えたことを特徴とする。In order to solve this problem, the present invention stores a plurality of sets of a partial character string consisting of a predetermined number of characters appearing in a natural language sentence and its absolute or relative frequency information. In the natural language processing database device having a natural language processing database used by the natural language processing device main body, (1) a character input by a user and desired to be reflected in the stored content of the natural language processing database (2) if there is no item in the natural language processing database for one or more partial character strings having the predetermined number of characters, which constitutes the character string input by the user, A pair with the frequency information according to the importance is added to the natural language processing database, and a character string input by a user is configured. Database updating means for updating the frequency information of the partial character string in the natural language processing database according to the importance, if one or a plurality of partial character string items having a predetermined number of characters are present in the natural language processing database And characterized in that:

【０００９】[0009]

【発明の実施の形態】（Ａ）第１の実施形態（Ａ−１）第１の実施形態の構成以下、本発明による自然言語処理用データベース装置
（統計データベース装置）を形態素解析装置に適用した
第１の実施形態を図面を参照しながら詳述する。DESCRIPTION OF THE PREFERRED EMBODIMENTS (A) First Embodiment (A-1) Configuration of First Embodiment Hereinafter, a database apparatus (statistics database apparatus) for natural language processing according to the present invention is applied to a morphological analyzer. The first embodiment will be described in detail with reference to the drawings.

【００１０】この第１の実施形態の形態素解析装置は、
入出力装置や通信装置や外部記憶装置等を適宜有するワ
ークステーションやパソコン等の情報処理装置で実現さ
れるものであるが、機能的には、図１の機能ブロック図
で示すことができる。The morphological analyzer according to the first embodiment has
It is realized by an information processing device such as a workstation or a personal computer having an input / output device, a communication device, an external storage device, or the like as appropriate, and can be functionally shown by the functional block diagram of FIG.

【００１１】図１において、第１の実施形態の形態素解
析装置は、入力部１と、形態素解析部２と、出力部３
と、タグつきコーパス（タグつきコーパス記憶部）４
と、連鎖確率計算部５と、統計データベース６と、統計
データベース重み変更部７と、ユーザ形態素入力部８と
からなる。In FIG. 1, a morphological analyzer according to a first embodiment includes an input unit 1, a morphological analyzer 2, and an output unit 3.
And a tagged corpus (tagged corpus storage unit) 4
, A chain probability calculating unit 5, a statistical database 6, a statistical database weight changing unit 7, and a user morphological input unit 8.

【００１２】ここで、入力部１、形態素解析部２及び出
力部３は、形態素解析装置本体を構成しており、タグつ
きコーパス４、連鎖確率計算部５、統計データベース
６、統計データベース重み変更部７及びユーザ形態素入
力部８は、統計データベース装置を構成している。ま
た、タグつきコーパス４及び連鎖確率計算部５は、統計
データベース６を形成させるものであり、形態素解析
は、形成された統計データベース６を利用するものであ
るので、この第１の実施形態の場合には、これらタグつ
きコーパス４及び連鎖確率計算部５を省略することもで
きる。Here, the input unit 1, the morphological analysis unit 2 and the output unit 3 constitute a morphological analysis device main body, and include a corpus 4 with a tag, a linkage probability calculation unit 5, a statistical database 6, and a statistical database weight changing unit. 7 and the user morpheme input unit 8 constitute a statistical database device. The tagged corpus 4 and the chain probability calculator 5 form a statistical database 6, and the morphological analysis utilizes the formed statistical database 6. Therefore, in the case of the first embodiment, In this case, the tagged corpus 4 and the chain probability calculator 5 can be omitted.

【００１３】入力部１は、文字列（自然言語テキスト）
を入力として受けとり、形態素解析部２にその入力文字
列を送るものである。入力部１は、例えば、キーボー
ド、マウス、ＯＣＲ（光学式文字認識装置）、音声認識
装置等の任意の手段で構成されていても良く、また、ネ
ットワーク等の通信媒体を経て外部からの通信信号を受
信する手段として構成されていても良い。The input unit 1 is a character string (natural language text).
Is received as an input, and the input character string is sent to the morphological analyzer 2. The input unit 1 may be composed of any means such as a keyboard, a mouse, an OCR (optical character recognition device), a voice recognition device, and the like, and a communication signal from the outside via a communication medium such as a network. May be configured as a means for receiving the message.

【００１４】形態素解析部２は、入力文字列に対して、
統計データベース６の情報を利用して形態素解析を行う
ものである。形態素解析部２は、詳細構成の図示は省略
していないが、以下のような機能を担う拡張文字列生成
部２ａ、スコアテーブル２ｂ、スコア計算部２ｃ及び最
適経路探索部２ｄを有する。The morphological analysis unit 2 converts the input character string into
The morphological analysis is performed using the information of the statistical database 6. The morphological analysis unit 2 includes an extended character string generation unit 2a, a score table 2b, a score calculation unit 2c, and an optimum route search unit 2d having the following functions, although the detailed configuration is not shown.

【００１５】拡張文字列生成部２ａは、統計データベー
ス６を参照して、入力文字列の拡張文字を生成し、入力
文字列の文頭から文末までについて、Ｎ文字でなる拡張
文字列（Ｎ−ｇｒａｍ）の経路（組み合わせ）をスコア
テーブル２ｂに格納するものである。スコアテーブル２
ｂは、入力文字列の文頭から文末までの全ての拡張文字
列（Ｎ−ｇｒａｍ）の経路と、統計データベース６に格
納されている部分連鎖確率情報とに基づき求めた拡張文
字列の経路に対応する拡張文字列の連鎖確率情報を格納
するものである。スコア計算部２ｃは、統計データベー
ス６に格納されている部分連鎖確率情報に基づき、スコ
アテーブル２ｂに格納されている拡張文字列の経路に対
する連鎖確率情報を計算するものである。最適経路探索
部２ｄは、スコア計算部２ｃにより計算された連鎖確率
情報の中から、最適な条件（例えば最大値の連鎖確率情
報を与えるなど）を満たす拡張文字列を、最適拡張文字
列（形態素解析結果）として選択するものである。The extended character string generation unit 2a refers to the statistical database 6 to generate extended characters of the input character string, and from the beginning of the input character string to the end of the sentence, an extended character string (N-gram) composed of N characters. ) Are stored in the score table 2b. Score table 2
b corresponds to the path of all extended character strings (N-gram) from the beginning of the input character string to the end of the sentence and the path of the extended character string obtained based on the partial chain probability information stored in the statistical database 6. It stores the chain probability information of the extended character string. The score calculation unit 2c calculates chain probability information for the path of the extended character string stored in the score table 2b based on the partial chain probability information stored in the statistical database 6. The optimum route search unit 2d converts an extended character string that satisfies an optimal condition (for example, gives maximum value chain probability information) from the chain probability information calculated by the score calculation unit 2c into an optimal expanded character string (morpheme). (Analysis result).

【００１６】出力部３は、形態素解析部２から解析結果
の形態素列を受けとり、出力するものである。例えば、
種々の表示手段や印刷手段や通信手段等が該当する。The output unit 3 receives the morphological sequence of the analysis result from the morphological analysis unit 2 and outputs it. For example,
Various display means, printing means, communication means, and the like correspond.

【００１７】タグつきコーパス４は、形態素境界（や各
形態素の品詞情報）のタグを保持した大量のテキストデ
ータである。図２に、タグつきコーパス４のデータ例を
示す。この図２に示す例では形態素境界をスラッシュ
（／）で表示し、品詞・活用型・活用形の情報を四角括
弧内にカンマで列記するという形式で各情報を保持して
いる。なお、タグとして、形態素境界だけを含むタグつ
きコーパス４であっても良い。The tagged corpus 4 is a large amount of text data holding tags of morpheme boundaries (or part-of-speech information of each morpheme). FIG. 2 shows an example of data of the corpus 4 with a tag. In the example shown in FIG. 2, each morpheme boundary is displayed by a slash (/), and information of the part of speech, inflection type, and inflection type are listed in square brackets with commas. Note that the tag may be a tagged corpus 4 including only morpheme boundaries.

【００１８】連鎖確率計算部５は、タグつきコーパス４
が保持するテキストデータを処理し、統計データベース
６を作成するものである（特開平７−２７１７９２号公
報では、単語モデル推定手段あるいは品詞付けモデル推
定手段と呼ばれている）。具体的には、例えば、Ｎ−ｇ
ｒａｍ拡張文字列（形態素境界だけを含む拡張文字列、
又は、形態素境界や品詞を含む拡張文字列）の連鎖確率
情報を計算するものである。The chain probability calculating section 5 includes a tagged corpus 4
Is to process the text data held by the database and create the statistical database 6 (referred to as word model estimating means or part-of-speech model estimating means in Japanese Patent Laid-Open No. Hei 7-271792). Specifically, for example, N-g
ram extended string (extended string containing only morpheme boundaries,
Alternatively, it calculates chain probability information of an extended character string including a morpheme boundary and a part of speech.

【００１９】統計データベース６は、基本的には、連鎖
確率計算部５で計算された結果のデータベースである。
なお、この第１の実施形態の場合、統計データベース６
の格納内容は、統計データベース重み変更部７によっ
て、変更できるようになされている。The statistical database 6 is basically a database of the results calculated by the chain probability calculator 5.
In the case of the first embodiment, the statistical database 6
Can be changed by the statistical database weight changing unit 7.

【００２０】統計データベース重み変更部７は、ユーザ
形態素入力部８から形態素解析結果に影響を与える形態
素情報（特定のＮ−ｇｒａｍ拡張文字列のデータ）を受
けて、統計データベース６中のその形態素情報（Ｎ−ｇ
ｒａｍ拡張文字列）に関するデータに対して変更を加え
るものである。具体的には、例えば、指示されたＮ−ｇ
ｒａｍ拡張文字列の連鎖確率情報を変更するものであ
る。The statistical database weight changing unit 7 receives morphological information (data of a specific N-gram extended character string) which affects the morphological analysis result from the user morphological input unit 8, and receives the morphological information in the statistical database 6. (N-g
ram extension character string). Specifically, for example, the designated N-g
This is to change the chain probability information of the ram extended character string.

【００２１】ユーザ形態素入力部８は、ユーザから形態
素解析結果に影響を与える形態素情報の入力を受け付
け、その情報を整理して、統計データベース重み変更部
７にそれを送るものである。なお、この第１の実施形態
のユーザ形態素入力部８は、変更の重要度のデータを受
け付けて、そのデータも同時に送るものである。例え
ば、「確実にその形態素を解析結果に反映させたい」と
きはユーザは重要度０．９９を入力し、「他の部分への
影響を極力少なくして反映させたい」ときは重要度０．
３を入力する。The user morphological input unit 8 receives input of morphological information which affects the morphological analysis result from the user, arranges the information, and sends it to the statistical database weight changing unit 7. The user morpheme input unit 8 according to the first embodiment receives data on the importance of the change and sends the data at the same time. For example, the user inputs the importance level of 0.99 when "I want to surely reflect the morpheme in the analysis result", and when the user wants to reflect the morpheme with minimal influence on other parts, the importance level is 0.
Enter 3.

【００２２】（Ａ−２）第１の実施形態の動作入力文字列を入力部１が受け付け、形態素解析部２が統
計データベース６の格納内容を使用して形態素解析を行
い、出力部３を経て形態素列を出力する動作は、統計デ
ータベース６を利用する従来の形態素解析装置と同様で
あるので、その説明は省略する。(A-2) Operation of the First Embodiment The input unit 1 receives an input character string, the morphological analysis unit 2 performs a morphological analysis using the contents stored in the statistical database 6, and The operation of outputting a morpheme sequence is the same as that of a conventional morphological analyzer using the statistical database 6, and a description thereof will be omitted.

【００２３】また、タグつきコーパス４に対して連鎖確
率計算部５が処理を行い、統計データベース６を作成す
る動作も、従来の装置と同様であるのでその説明は省略
する。The operation of the linkage probability calculation unit 5 for processing the tagged corpus 4 to create the statistical database 6 is the same as that of the conventional apparatus, and therefore the description thereof is omitted.

【００２４】例えば、文献『長尾眞、森信介著、
「大規模日本語テキストのｎグラム統計の作り方と語句
の自動抽出」、情報処理学会研究報告自然言語処理９６
−１、１９９３年７月』に記載のものを適用できる。For example, in the literature “Shin Nagao, Shinsuke Mori,
"How to make n-gram statistics and automatic extraction of words and phrases in large-scale Japanese texts", Information Processing Society of Japan, Natural Language Processing 96
1, July 1993] can be applied.

【００２５】そこで、以下では、ユーザからの形態素情
報の入力をユーザ形態素入力部８で受け付けて、統計デ
ータベース重み変更部７を通じて統計データベース６の
データを修正する動作を図３を参照しながら説明する。In the following, the operation of receiving the input of morpheme information from the user at the user morpheme input unit 8 and modifying the data of the statistical database 6 through the statistical database weight changing unit 7 will be described with reference to FIG. .

【００２６】なお、以下の説明においては、統計データ
ベース６は、Ｎ−ｇｒａｍ文字の文字区切りに関するも
のとする。また、タグつきコーパス４のデータは情報と
して（品詞等の情報は持たず）形態素区切りだけを持っ
ているものとする。In the following description, it is assumed that the statistical database 6 relates to character delimiters of N-gram characters. It is also assumed that the data of the tagged corpus 4 has only morpheme delimiters as information (without information such as part of speech).

【００２７】まず、ユーザに形態素情報入力画面（図４
参照）を提示し、形態素情報の入力を受け付ける（ステ
ップ３０１）。First, a morpheme information input screen (FIG. 4)
(See step 301).

【００２８】図４における画面中の四角括弧対［］で挟
まれた部分がユーザが入力する部分である。形態素情報
の入力は、一連の文字列に形態素区切り記号を挿入した
ものである。なお、マニュアル等によって入力方法をユ
ーザに予め知得させておく。形態素区切り記号で区切ら
れた各部分文字列が形態素である、という意味の入力で
ある。入力された文字列中に区切り記号がない場合は、
その文字列全体が一つの形態素であると判断する。重要
度は０．０から１．０の間の数値を入力するものとす
る。０．９９（限りなく１に近い値）であれば、入力さ
れた各形態素は確実に反映されることを意味し、値が小
さくなるほど形態素の反映される可能性が減じる。これ
はある形態素を切り出すために生じる他の形態素への影
響をどの程度ユーザが容認するかを示す値である。The portion between square brackets [] in the screen in FIG. 4 is the portion to be input by the user. The morpheme information is input by inserting a morpheme delimiter into a series of character strings. The user is informed of the input method in advance by a manual or the like. This is an input meaning that each partial character string separated by a morpheme delimiter is a morpheme. If there is no delimiter in the entered string,
It is determined that the entire character string is one morpheme. It is assumed that a value between 0.0 and 1.0 is input as the importance. If the value is 0.99 (a value as close to 1 as possible), it means that each input morpheme is surely reflected, and the smaller the value, the less the possibility that the morpheme is reflected. This is a value indicating how much the user tolerates the influence on another morpheme caused by cutting out a certain morpheme.

【００２９】次に、ユーザ形態素入力部８から上記のユ
ーザが入力した形態素列と重要度を受けとった統計デー
タベース重み変更部７は、入力された形態素列を拡張文
字列に変換する（ステップ３０２）。図５に拡張文字列
の例を示す。図５において、拡張文字＜こ，０＞や＜
こ，１＞における文字「こ」は、形態素列を構成する文
字そのものであり、数字「０」、「１」はそれぞれ、そ
の文字の後側が形態素区切りになっていないかいるかを
表している。＜＃，１＞は特殊拡張文字であり、統計デ
ータベース６がＮ−ｇｒａｍ拡張文字列毎に連鎖確率情
報を構成するものであれば、入力された形態素列の先頭
側及び分割側にそれぞれ、Ｎ−１個ずつ付加されるもの
である。Next, the statistical database weight changing unit 7, which receives the morpheme sequence input by the user and the importance from the user morpheme input unit 8, converts the input morpheme sequence into an extended character string (step 302). . FIG. 5 shows an example of an extended character string. In FIG. 5, extended characters <ko, 0> and <
The character “ko” in 1> is the character itself that constitutes the morpheme string, and the numbers “0” and “1” respectively indicate whether the back of the character is a morpheme delimiter. <#, 1> is a special extended character, and if the statistical database 6 constitutes the chain probability information for each N-gram extended character string, N -1 is added at a time.

【００３０】次に、位置ポインタを拡張文字列の先頭に
設定し（ステップ３０３）、そこからＮ拡張文字（Ｎ−
ｇｒａｍの場合）の抽出可能かどうかを調べる（ステッ
プ３０４）。例えば、Ｎ＝３の場合であって、図５の拡
張文字列の場合、ポインタが、先頭の＜＃，１＞から＜
ぐ，１＞までの間にあるときは、３拡張文字の抽出が可
能であり、＜ぐ，１＞の直後の＜＃，１＞にポインタが
あるときに初めて３文字の抽出が不可能となる。Next, the position pointer is set at the head of the extended character string (step 303), and N extended characters (N-
It is checked whether or not extraction is possible (step 304). For example, when N = 3 and in the case of the extended character string in FIG. 5, the pointer moves from <#, 1> at the beginning to <
It is possible to extract three extended characters when it is between the characters and the character, and it is impossible to extract three characters for the first time when the pointer is located at <#, 1> immediately after the character. Become.

【００３１】ステップ３０４の判定において、抽出可能
であればステップ３０５に進み、抽出不可能であれば一
連の処理を終了する。If it is determined in step 304 that extraction is possible, the process proceeds to step 305. If extraction is not possible, a series of processes is terminated.

【００３２】ステップ３０５においては、実際に、Ｎ拡
張文字を抽出し、抽出した拡張文字列が統計データベー
ス６の見出しに存在しているか否かを探索する（ステッ
プ３０６）。At step 305, N extended characters are actually extracted, and it is searched whether or not the extracted extended character string exists in the heading of the statistical database 6 (step 306).

【００３３】統計データベース６をテーブル構成で構成
した一例を図６に示している。統計データベース６は、
図６に示すように、Ｎ−ｇｒａｍ（３−ｇｒａｍ）拡張
文字列でなる見出しと、それに対する連鎖確率情報の値
（確率値そのものでも良く、また、それを一律に所定倍
したものであっても良い；以下では重みの値と呼ぶ）と
でなる。見出しのＮ−ｇｒａｍ拡張文字列には、重複す
るものは存在しない。FIG. 6 shows an example in which the statistical database 6 has a table configuration. Statistics database 6
As shown in FIG. 6, a heading composed of an N-gram (3-gram) extended character string and a value of the chain probability information corresponding thereto (the probability value itself may be used, or a uniform multiple of the same). (Hereinafter referred to as a weight value). There is no duplicate in the N-gram extension character string of the heading.

【００３４】ここで、当該見出しが統計データベース６
に存在していない場合には（ステップ３０６で否定結
果）、形態素入力時に受け付けた重要度の数値を参考に
して、重みの値を作成する（ステップ３１０）。値の作
成方法にはいろいろ考えられるが、最も単純な方法とし
ては、重要度の数値をそのまま重みの値とする方法を挙
げることができる。その後、統計データベース６に当該
見出しを持ったデータを一行追加し、重み値を登録する
（ステップ３１１）。Here, the heading is a statistical database 6
If it does not exist (No in step 306), a weight value is created with reference to the value of importance received at the time of morpheme input (step 310). There are various methods for creating values, but the simplest method is to use the value of the importance as it is as the weight. Thereafter, one line of data having the heading is added to the statistical database 6, and a weight value is registered (step 311).

【００３５】一方、ステップ３０６の判定において、抽
出データが既に統計データベース６に存在したという結
果を得た場合には、統計データベース６から現在の重み
値を取得する（ステップ３０７）。そして、現在の重み
値と、形態素入力時に受け付けた重要度の数値を参考に
して新しい重み値を計算する（ステップ３０８）。そし
て、このようにして求めた新しい重み値に、当該見出し
（Ｎ−ｇｒａｍ拡張文字列）の重み値を変更する（ステ
ップ３０９）。On the other hand, if it is determined in step 306 that the extracted data already exists in the statistical database 6, the current weight value is obtained from the statistical database 6 (step 307). Then, a new weight value is calculated with reference to the current weight value and the numerical value of the importance received at the time of morpheme input (step 308). Then, the weight value of the headline (N-gram extended character string) is changed to the new weight value thus obtained (step 309).

【００３６】新しい重み値の計算方法もいろいろ考えら
れるが、例えば、以下の（１）式に示す計算方法を挙げ
ることができる。There are various methods for calculating a new weight value. For example, a calculation method represented by the following equation (1) can be given.

【００３７】（新しい重み値）＝（現在の重み値）＋｛１．０−（現在の重み値）｝＊（重要度） …（１）この（１）式の計算方法を適用した場合、重要度が最大
であればほとんど１．０に近い値になり、重要度が小さ
くても現在の重み値よりもわずかだが確実に値が増加す
るような新しい重み値が計算できる。(New weight value) = (current weight value) + {1.0− (current weight value)} * (importance) (1) When the calculation method of the equation (1) is applied, If the importance is maximum, the value is almost close to 1.0. Even if the importance is small, a new weight value that is slightly smaller than the current weight value but surely increases can be calculated.

【００３８】現在の位置ポインタで定まるＮ−ｇｒａｍ
拡張文字列について、統計データベース６の格納内容の
追加、変更処理を終了すると、ポインタを一文字ずらし
て（ステップ３１２）、上述したステップ３０４に戻
る。N-gram determined by current position pointer
When the process of adding and changing the storage contents of the statistical database 6 for the extended character string is completed, the pointer is shifted by one character (step 312), and the process returns to step 304 described above.

【００３９】（Ａ−３）第１の実施形態の効果上記第１の実施形態によれば、ユーザが統計データベー
ス６に自分の希望する形態素（拡張文字列）及びその重
み値を追加・修正することができる。その結果、形態素
解析結果に希望の形態素列情報を反映することができ
る。すなわち、ユーザは簡単な入力操作によって、希望
する解析結果が得られるように、統計データベース６を
変更することができ、装置の使い勝手を従来より向上さ
せることができる。(A-3) Effects of the First Embodiment According to the first embodiment, the user adds and corrects his desired morpheme (extended character string) and its weight value to the statistical database 6. be able to. As a result, the desired morpheme sequence information can be reflected in the morphological analysis result. That is, the user can change the statistical database 6 so that a desired analysis result can be obtained by a simple input operation, and the usability of the apparatus can be improved as compared with the related art.

【００４０】（Ａ−４）第１の実施形態の変形実施形態統計データベース６は、項目の追加・修正に応じられる
構成であれば良く、構成がテーブル構成に限定されるも
のではない。また、その内容も、Ｎ−ｇｒａｍ文字列に
限定されないので、特開平７−２７１７９２号公報に記
載のようにＮ個の品詞並びのデータであっても良い。(A-4) Modified Embodiment of First Embodiment The statistical database 6 may have any configuration as long as it can respond to addition and correction of items, and the configuration is not limited to a table configuration. Further, since the content is not limited to the N-gram character string, it may be data of N parts of speech arranged as described in JP-A-7-271792.

【００４１】また、形態素情報の入力を、図２に示すよ
うな品詞情報をも有する形式で実行させるようにすれ
ば、形態素区切り情報だけでなく、品詞情報をも有する
Ｎ−ｇｒａｍ拡張文字列を拡張している統計データベー
ス６に対しても、項目の追加・修正を行うことができ
る。If the input of the morpheme information is executed in the form having the part of speech information as shown in FIG. 2, the N-gram extended character string having not only the morpheme delimiter information but also the part of speech information is obtained. Items can be added to and modified from the extended statistical database 6.

【００４２】さらに、重要度にはマイナスの値を与えら
れるようにしても良い。その場合、統計データベース６
のテーブルの値は現状よりも減少することになるので、
当該形態素の分割を抑制させるように働くことになる。Further, a negative value may be given to the importance. In that case, statistics database 6
Table value will be smaller than it is now,
It will work to suppress the division of the morpheme.

【００４３】さらにまた、第１の実施形態においては、
ユーザからは形態素列全体に対して重要度を付与する方
式で説明したが、重要度は各Ｎ−ｇｒａｍ拡張文字の組
毎に指定するようにしても良い。さらに、重み値の計算
時に、ユーザにＮ−ｇｒａｍ拡張文字の組毎に重要度を
問い合わせて対話的に処理する方式でも良い。Further, in the first embodiment,
Although the description has been given of a method in which the user assigns importance to the entire morpheme string, the importance may be specified for each set of N-gram extended characters. Further, when calculating the weight value, a method may be used in which the user is inquired about the importance for each set of N-gram extended characters and interactively processed.

【００４４】このようにＮ−ｇｒａｍ拡張文字の組毎に
指定できるようにした場合においては、重要度ではな
く、重み値（連鎖確率情報）自体を指定できるようにし
ても良い。この場合、既存の重み値（連鎖確率情報）を
表示して修正指定させることが好ましい。重み値（連鎖
確率情報）自体を指定させる場合において、重み値とし
て０の指定も許容させることが好ましい。この場合、重
み値０のＮ−ｇｒａｍ拡張文字列は、形態素解析結果に
反映されることが絶対ないものとなる。言い換えると、
積極的な禁止条件を付与したことになる。In the case where designation is possible for each set of N-gram extended characters as described above, a weight value (chain probability information) itself may be designated instead of importance. In this case, it is preferable that an existing weight value (chain probability information) is displayed and the correction is designated. When the weight value (chain probability information) itself is designated, it is preferable to allow the designation of 0 as the weight value. In this case, the N-gram extended character string having a weight value of 0 is never reflected in the morphological analysis result. In other words,
This means that an aggressive prohibition condition has been given.

【００４５】また、第１の実施形態においては、重み値
（連鎖確率情報）の修正は、ユーザ入力に係る形態素情
報から得られたＮ−ｇｒａｍ拡張文字列に対して行われ
るものであったが、この修正に合わせて、他のＮ−ｇｒ
ａｍ拡張文字列の重み値（連鎖確率情報）の修正も行う
ようにしても良い。例えば、一般的に、Ｎ−ｇｒａｍ拡
張文字列についての情報を格納している統計データベー
ス６においては、先頭のＮ−１個の拡張文字が同じ複数
のＮ−ｇｒａｍ拡張文字列の重み値（連鎖確率情報）の
総和が所定値（例えば１）になるようになされており、
この条件を守るように、他のＮ−ｇｒａｍ拡張文字列の
重み値（連鎖確率情報）の修正も行うようにしても良
い。このような修正は、重み値（連鎖確率情報）を頻度
値に変えて修正し、再度重み値に変えることで実行する
ことができる。In the first embodiment, the correction of the weight value (chain probability information) is performed on the N-gram extended character string obtained from the morpheme information related to the user input. According to this modification, other N-gr
The weight value (chain probability information) of the am extended character string may be corrected. For example, in general, in the statistical database 6 that stores information on N-gram extended character strings, the weight value (chain) of a plurality of N-gram extended character strings in which the first N-1 extended characters are the same Probability information) is set to a predetermined value (for example, 1),
The weight value (chain probability information) of another N-gram extended character string may be modified so as to keep this condition. Such correction can be executed by changing the weight value (chain probability information) to a frequency value, correcting the frequency value, and then changing the weight value again.

【００４６】なお、統計データベース６における重み値
を頻度で管理するものに対しても、上記第１の実施形態
の技術思想を適用することができる。The technical concept of the first embodiment can be applied to a case where the weight value in the statistical database 6 is managed by frequency.

【００４７】（Ｂ）第２の実施形態（Ｂ−１）第２の実施形態の構成以下、本発明による自然言語処理用データベース装置
（統計データベース装置）を形態素解析装置に適用した
第２の実施形態を図面を参照しながら詳述する。(B) Second Embodiment (B-1) Configuration of Second Embodiment Hereinafter, a second embodiment in which the database device (statistical database device) for natural language processing according to the present invention is applied to a morphological analyzer. The form will be described in detail with reference to the drawings.

【００４８】この第２の実施形態の形態素解析装置も、
入出力装置や通信装置や外部記憶装置等を適宜有するワ
ークステーションやパソコン等の情報処理装置で実現さ
れるものであるが、機能的には、図７の機能ブロック図
で示すことができる。なお、図７において、上述した図
１との同一、対応部分には、同一符号を付して示してい
る。The morphological analyzer of the second embodiment also has
It is realized by an information processing device such as a workstation or a personal computer having an input / output device, a communication device, an external storage device, or the like as appropriate, and can be functionally shown by a functional block diagram in FIG. In FIG. 7, the same or corresponding parts as those in FIG. 1 described above are denoted by the same reference numerals.

【００４９】図７において、第２の実施形態の形態素解
析装置は、入力部１と、形態素解析部２と、出力部３
と、タグつきコーパス４と、連鎖確率計算部５と、統計
データベース６と、ユーザ形態素入力部８と、コーパス
追加部９とからなる。この第２の実施形態の場合、連鎖
確率計算部５、統計データベース６、ユーザ形態素入力
部８及びコーパス追加部９が、統計データベース装置
（自然言語処理用データベース装置）を構成している。In FIG. 7, a morphological analyzer according to the second embodiment includes an input unit 1, a morphological analyzer 2, and an output unit 3.
, A tagged corpus 4, a chain probability calculator 5, a statistical database 6, a user morpheme input unit 8, and a corpus adding unit 9. In the case of the second embodiment, the chain probability calculating unit 5, the statistical database 6, the user morphological input unit 8, and the corpus adding unit 9 constitute a statistical database device (a database device for natural language processing).

【００５０】図７において、入力部１、形態素解析部
２、出力部３、タグつきコーパス４、連鎖確率計算部
５、統計データベース６及びユーザ形態素入力部８は、
第１の実施形態と同じ機能を担うものであり、その機能
説明は省略する。In FIG. 7, an input unit 1, a morphological analysis unit 2, an output unit 3, a tagged corpus 4, a linkage probability calculation unit 5, a statistical database 6, and a user morphological input unit 8
The third embodiment has the same function as the first embodiment, and a description of the function will be omitted.

【００５１】コーパス追加部９は、ユーザ形態素入力部
８から送られてくる形態素列情報を、重要度に応じた数
だけ、その複製を作成してタグつきコーパス４に追加す
るものである。コーパス追加部９は、タグつきコーパス
４のデータサイズ（例えば、文数、形態素数、文字数
等）の初期値を与えられており、タグつきコーパス４に
データを追加する際に保持しているサイズの値を変更す
る。また、コーパス追加部９は、タグつきコーパス４に
データを追加した後に、連鎖確率計算部５に対して再計
算を指令するものである。The corpus adding unit 9 makes a copy of the morpheme string information sent from the user morphological input unit 8 in a number corresponding to the degree of importance, and adds it to the tagged corpus 4. The corpus adding unit 9 is given an initial value of the data size (for example, the number of sentences, the number of morphemes, the number of characters, and the like) of the tagged corpus 4, and holds the size held when adding data to the tagged corpus 4. Change the value of. The corpus adding unit 9 instructs the chain probability calculating unit 5 to perform recalculation after adding data to the tagged corpus 4.

【００５２】（Ｂ−２）第２の実施形態の動作従来装置の動作と異なるのは、ユーザ形態素入力部８と
コーパス追加部９の関連する部分だけである。そこで、
以下では、ユーザからの形態素情報の入力をユーザ形態
素入力部８で受け付けて、コーパス追加部９を通じてタ
グつきコーパス４にデータを追加し、連鎖確率計算部５
に再計算指令を与えるまでの動作を図８のフローチャー
トを参照しながら説明する。なお、この第２の実施形態
のタグつきコーパス４は、図２に示した形式のものでは
なく、それから品詞情報（活用型、活用形を含む）を除
いたものとする。(B-2) Operation of the Second Embodiment The operation of the conventional device is different from that of the conventional device only in the portions related to the user morphological input unit 8 and the corpus adding unit 9. Therefore,
In the following, input of morphological information from the user is received by the user morphological input unit 8, data is added to the tagged corpus 4 through the corpus adding unit 9, and the chain probability calculating unit 5 is added.
Will be described with reference to the flowchart of FIG. Note that the tagged corpus 4 of the second embodiment is not of the format shown in FIG. 2, but is obtained by removing part of speech information (including inflected and inflected forms) therefrom.

【００５３】まず、ユーザに形態素入力画面（上記図４
参照）を提示し、形態素情報の入力を受け付ける（ステ
ップ８０１）。このステップに関しては、第１の実施形
態に同じである（図３のステップ３０１）。First, a morpheme input screen (FIG.
) And input of morphological information is accepted (step 801). This step is the same as in the first embodiment (step 301 in FIG. 3).

【００５４】次に、既存のコーパスサイズと重要度によ
り追加する分量（ここでは文数）を決める（ステップ８
０２）。コーパスのサイズのデータはコーパス追加部９
が保持しているものとする。例えば、次の（２）式によ
り、追加分量を定める。なお、今までのコーパスサイズ
に応じて複数の式を選択適用するようにしても良い。Next, the amount to be added (here, the number of sentences) is determined based on the existing corpus size and importance (step 8).
02). Corpus size data can be obtained from the corpus adding unit 9
Shall be held. For example, the additional amount is determined by the following equation (2). A plurality of formulas may be selectively applied according to the corpus size up to now.

【００５５】追加分量（文数）＝コーパスサイズ（文数）＊０．０１＊重要度（但し、少数点以下切り上げ） …（２）次に、ステップ８０２で決められた追加文数だけ、入力
形態素列情報を複製し、タグつきコーパス４に追加する
（ステップ８０３）。このとき、追加した文数（場合に
よっては形態素数等その他のデータも）によりコーパス
サイズを更新しておく。Additional quantity (number of sentences) = corpus size (number of sentences) * 0.01 * importance (however, rounded up to the nearest decimal point) (2) Next, the number of additional sentences determined in step 802 is The input morpheme string information is duplicated and added to the tagged corpus 4 (step 803). At this time, the corpus size is updated based on the number of sentences added (and other data such as the number of morphemes in some cases).

【００５６】そして、タグつきコーパス４が更新された
ことを通知し、連鎖確率計算部５に対して再計算の指令
を送る（ステップ８０４）。これによって、連鎖確率計
算部５が更新されたタグつきコーパス４を用いて再計算
を行い、統計データベース６が更新される。Then, it notifies that the tagged corpus 4 has been updated, and sends a recalculation command to the chain probability calculation unit 5 (step 804). As a result, the chain probability calculation unit 5 performs recalculation using the updated corpus 4 with the tag, and the statistical database 6 is updated.

【００５７】更新された統計データベース６を用いて形
態素解析を行うと、ユーザの入力が反映された解析結果
が得られる。When a morphological analysis is performed using the updated statistical database 6, an analysis result reflecting the user's input is obtained.

【００５８】（Ｂ−３）第２の実施形態の効果上記第２の実施形態によっても、ユーザが統計データベ
ース６に自分の希望する形態素（拡張文字列）及びその
重み値を追加・修正することができ、形態素解析結果に
希望の形態素列情報を反映することができる。すなわ
ち、ユーザは簡単な入力操作によって、希望する解析結
果が得られるように、統計データベース６を変更するこ
とができ、装置の使い勝手を従来より向上させることが
できる。(B-3) Effects of the Second Embodiment According to the second embodiment, the user can add / modify his / her desired morpheme (extended character string) and its weight value to the statistical database 6. And the desired morpheme sequence information can be reflected in the morphological analysis result. That is, the user can change the statistical database 6 so that a desired analysis result can be obtained by a simple input operation, and the usability of the apparatus can be improved as compared with the related art.

【００５９】なお、第２の実施形態では、統計データベ
ース６の作成方法を利用して、ユーザの希望を統計デー
タベース６に反映する方法であり、他のＮ−ｇａｒｍ拡
張文字列への悪影響が少なくて済むと考えられる。すな
わち、ユーザの入力した文（形態素列）がコーパス中に
何度も出現したという位置付けで考えるだけで良いの
で、恣意的な操作が少ない（＝人為的なミスが入りにく
い）ため、ユーザ情報の反映によるリスクが小さく、統
計データベース６をゆるやかに操作することができる。In the second embodiment, a method of creating a statistical database 6 is used to reflect a user's desire in the statistical database 6, and the N-garm extended character string is less adversely affected. It is thought that it is enough. In other words, it is only necessary to consider the sentence (morpheme sequence) input by the user as appearing many times in the corpus, and there are few arbitrary operations (= hard to make artificial mistakes). The risk of reflection is small, and the statistical database 6 can be operated slowly.

【００６０】一方、ユーザ情報の反映度合が小さい、指
定した形態素に係るＮ−ｇｒａｍ拡張文字列の連鎖確率
情報を小さくする方向には操作できないという面では、
第１の実施形態の方が良好である。On the other hand, in the aspect that the degree of reflection of the user information is small and the operation cannot be performed in a direction to reduce the chain probability information of the N-gram extended character string related to the specified morpheme,
The first embodiment is better.

【００６１】（Ｂ−４）第２の実施形態の変形実施形態統計データベース６は、項目の追加・修正に応じられる
構成であれば良く、構成がテーブル構成に限定されるも
のではない。また、その内容も、Ｎ−ｇｒａｍ文字列に
限定されないので、特開平７−２７１７９２号公報に記
載のようにＮ個の品詞並びのデータであっても良い。(B-4) Modified Embodiment of Second Embodiment The statistical database 6 may have any configuration as long as it can respond to addition and correction of items, and the configuration is not limited to a table configuration. Further, since the content is not limited to the N-gram character string, it may be data of N parts of speech arranged as described in JP-A-7-271792.

【００６２】また、形態素情報の入力を、図２に示すよ
うな品詞情報をも有する形式で実行させるようにすれ
ば、形態素区切り情報だけでなく、品詞情報をも有する
Ｎ−ｇｒａｍ拡張文字列を拡張している統計データベー
ス６に対しても、項目の追加・修正を行うことができ
る。If the input of the morpheme information is executed in a form having the part of speech information as shown in FIG. 2, the N-gram extended character string having not only the morpheme delimiter information but also the part of speech information is obtained. Items can be added to and modified from the extended statistical database 6.

【００６３】さらに、統計データベース６における重み
値を頻度で管理するものに対しても、上記第２の実施形
態の技術思想を適用することができる。Further, the technical idea of the second embodiment can be applied to the case where the weight value in the statistical database 6 is managed by frequency.

【００６４】（Ｃ）第３の実施形態（Ｃ−１）第３の実施形態の構成以下、本発明による自然言語処理用データベース装置
（統計データベース装置）を形態素解析装置に適用した
第３の実施形態を図面を参照しながら詳述する。(C) Third Embodiment (C-1) Configuration of Third Embodiment Hereinafter, a third embodiment in which the natural language processing database device (statistical database device) according to the present invention is applied to a morphological analyzer. The form will be described in detail with reference to the drawings.

【００６５】この第３の実施形態の形態素解析装置も、
入出力装置や通信装置や外部記憶装置等を適宜有するワ
ークステーションやパソコン等の情報処理装置で実現さ
れるものであるが、機能的には、図９の機能ブロック図
で示すことができる。なお、図９において、上述した図
１との同一、対応部分には、同一符号を付して示してい
る。The morphological analyzer of the third embodiment also has
It is realized by an information processing device such as a workstation or a personal computer having an input / output device, a communication device, an external storage device, or the like as appropriate, and can be functionally shown by a functional block diagram in FIG. In FIG. 9, the same or corresponding parts as those in FIG. 1 described above are denoted by the same reference numerals.

【００６６】図９において、第３の実施形態の形態素解
析装置は、入力部１と、形態素解析部２と、出力部３
と、タグつきコーパス４と、連鎖確率計算部５と、統計
データベース６と、統計データベース重み変更部７と、
ユーザ形態素入力部８と、文例検索部１０と、文例出力
部１１と、タグつきコーパスのインデックス（インデッ
クス記憶部）１２とからなる。この第３の実施形態の場
合、連鎖確率計算部５、統計データベース６、統計デー
タベース重み変更部７、ユーザ形態素入力部８、文例検
索部１０、文例出力部１１及びインデックス１２が、統
計データベース装置（自然言語処理用データベース装
置）を構成している。In FIG. 9, the morphological analyzer according to the third embodiment includes an input unit 1, a morphological analyzer 2, and an output unit 3.
A tagged corpus 4, a chain probability calculator 5, a statistical database 6, a statistical database weight changer 7,
It comprises a user morpheme input unit 8, a sentence example search unit 10, a sentence example output unit 11, and an index (index storage unit) 12 of a tagged corpus. In the case of the third embodiment, the linkage probability calculating unit 5, the statistical database 6, the statistical database weight changing unit 7, the user morphological input unit 8, the sentence example searching unit 10, the sentence example output unit 11, and the index 12 are composed of a statistical database device ( (A database device for natural language processing).

【００６７】図９において、入力部１、形態素解析部
２、出力部３、タグつきコーパス４、連鎖確率計算部
５、統計データベース６及び統計データベース重み変更
部７は、第１の実施形態と同じ機能を担うものであり、
その機能説明は省略する。In FIG. 9, an input unit 1, a morphological analysis unit 2, an output unit 3, a tagged corpus 4, a linkage probability calculation unit 5, a statistical database 6, and a statistical database weight changing unit 7 are the same as those in the first embodiment. Function.
The description of the function is omitted.

【００６８】タグつきコーパス４が、保持しているデー
タは第１の実施形態に同じであるが、その格納データの
インデックス１２が別に用意されている。タグつきコー
パス４のインデックス１２は、タグつきコーパス４が保
持しているデータのタグ（区切り記号、品詞・活用形等
の情報）を除去したデータ（プレーンデータ）を見出し
として保持する。各見出しには、その元となったタグつ
きコーパス４内のデータヘのポインタが付与されてい
る。The data held by the tagged corpus 4 is the same as that of the first embodiment, but an index 12 of the stored data is separately prepared. The index 12 of the tagged corpus 4 holds data (plain data) obtained by removing tags (information such as delimiters, parts of speech and inflected forms) of data held by the tagged corpus 4 as headings. Each heading is provided with a pointer to data in the corpus 4 with tags from which the heading is based.

【００６９】ユーザ形態素入力部８は、第１の実施形態
の動作の他に、文例検索部１０に対しても、受け付けた
ユーザ形態素列を送るものである。The user morpheme input unit 8 sends the received user morpheme sequence to the sentence example search unit 10 in addition to the operation of the first embodiment.

【００７０】文例検索部１０は、ユーザ形態素入力部８
からユーザ形態素列を受けとって区切り記号を取り除い
た文字列から、一定文字数（Ｍ文字）の部分文字列を
（複数個：可能な限り）作成するものである。また、そ
の各部分文字列をインデックス１２に対して送り、イン
デックス１２の見出し中（プレーンデータ）に当該部分
文字列を含むものがあれば、元データヘのポインタを取
得する。さらに、そのポインタのデータをタグつきコー
パス４に送り、当該データ（文例：タグつき形態素列）
を取得し、文例出力部１１に送付するものである。The sentence example search unit 10 is a user morpheme input unit 8
A partial character string having a fixed number of characters (M characters) (a plurality of characters: as many as possible) is created from a character string obtained by receiving a user morpheme string from and removing a delimiter. Further, each of the partial character strings is sent to the index 12, and if any of the index 12 (plane data) includes the partial character string, a pointer to the original data is obtained. Further, the data of the pointer is sent to the corpus 4 with the tag, and the data (sentence example: a morphological sequence with a tag)
Is acquired and sent to the sentence example output unit 11.

【００７１】文例出力部１１は、文例検索部１０から送
られてきた形態素列（文例）を出力するものである。The sentence example output unit 11 outputs the morpheme string (sentence example) sent from the sentence example search unit 10.

【００７２】（Ｃ−２）第３の実施形態の動作以下では、この第３の実施形態における特徴動作を行う
ユーザ形態素入力部８、文例検索部１０及び文例出力部
１１の動作について、図１０のフローチャートを参照し
ながら説明する。(C-2) Operation of the Third Embodiment The operation of the user morpheme input unit 8, the sentence example search unit 10, and the sentence example output unit 11 that perform the characteristic operation in the third embodiment will be described below with reference to FIG. This will be described with reference to the flowchart of FIG.

【００７３】まず、ユーザに形態素入力画面（図１１参
照）を提示し、形態素入力を受け付け（ステップ１００
１）、この受付時に押下されたボタンが、文例検索ボタ
ンか学習ボタンかを判定する（ステップ１００２）。First, a morpheme input screen (see FIG. 11) is presented to the user, and morpheme input is accepted (step 100).
1) It is determined whether the button pressed at the time of this reception is a sentence example search button or a learning button (step 1002).

【００７４】図１１に示すように、この第３の実施形態
の場合、形態素入力画面は、形態素入力及び重要度入力
を受け付ける入力フィールドだけでなく、「学習ボタ
ン」及び「文例検索ボタン」が表示され、動作モード
を、これら「学習ボタン」及び「文例検索ボタン」によ
って指示することを求めている。また、「文例検索ボタ
ン」の押下によって開始された文例検索動作によって得
られた文例を表示するための表示フィールドも予め用意
されている。As shown in FIG. 11, in the case of the third embodiment, the morpheme input screen displays not only an input field for receiving a morpheme input and an importance input, but also a “learning button” and a “sentence example search button”. It is requested that the operation mode be designated by the "learning button" and the "sentence example search button". Also, a display field for displaying a sentence example obtained by a sentence example search operation started by pressing the “sentence example search button” is prepared in advance.

【００７５】学習ボタンが押下された場合には、統計デ
ータベース重み変更部７がユーザ形態素列を受け取り、
第１の実施形態と同様の動作を行うため、図１１に示す
処理を終了する。なお、図１１の処理が終了したときに
は、上述した図３のステップ３０２以降の処理に進むよ
うになる。When the learning button is pressed, the statistical database weight changing unit 7 receives the user morpheme sequence,
In order to perform the same operation as in the first embodiment, the processing illustrated in FIG. 11 ends. When the processing in FIG. 11 is completed, the processing proceeds to the processing in step 302 and subsequent steps in FIG. 3 described above.

【００７６】一方、押下されたボタンが文例検索ボタン
であった場合には、文例検索部１０がユーザ形態素列の
みを受け取り（重要度のデータは必要ない）、タグ（こ
こでは区切り記号のみ）を除去してプレーンデータ（文
字列）を作成する（ステップ１００３）。例えば、入力
形態素列が、「ここ／では／きもの／を／ぬぐ」であっ
た場合には、区切り記号を全て除去した「ここではきも
のをぬぐ」が、プレーンデータとなる。On the other hand, if the pressed button is a sentence example search button, the sentence example search unit 10 receives only the user morpheme string (no data of importance is required) and sets the tag (here, only the delimiter). It is removed to create plane data (character string) (step 1003). For example, if the input morpheme string is “here // kimono / w / wipe”, “wipe kimono here” with all the delimiters removed is the plain data.

【００７７】次に、当該プレーンデータから部分文字列
を作成する（ステップ１００４）。この過程では、予め
この処理用の文字数が定められているものとして（ここ
では３文字とする；この文字数はＮ−ｇｒａｍ拡張文字
列のＮ文字とは無関係であっても良い）、プレーンデー
タからその長さの部分文字列を全て切り出してくる。な
お、文字数はシステム起動時に指定する、あるいはユー
ザからの入力を受ける等の方法で変更可能であっても良
い。例えば、プレーンデータが上記の例であった場合に
は、「ここで」、「こでは」、「ではき」、「はき
も」、「きもの」、「ものを」、「のをぬ」、「をぬ
ぐ」という計８個の部分文字列が作成される。ここで、
プレーンデータの文字数が上記の定められた文字数に達
していない場合は、プレーンデータそのものを部分文字
列とする。Next, a partial character string is created from the plane data (step 1004). In this process, assuming that the number of characters for this processing is determined in advance (here, three characters; this number of characters may be irrelevant to the N characters of the N-gram extended character string), Cut out all substrings of that length. Note that the number of characters may be changeable by a method designated at the time of starting the system or by receiving an input from the user. For example, when the plain data is the above example, “here”, “kodani”, “deki”, “kikimo”, “kimono”, “mono”, “nonu”, A total of eight partial character strings "to wipe" are created. here,
If the number of characters of the plane data does not reach the above-defined number of characters, the plane data itself is used as a partial character string.

【００７８】次に、未処理の中で最も先頭側の部分文字
列を処理対象にセットし（ステップ１００５）、インデ
ックス内の見出しに当該部分文字列が含まれているもの
があるかどうかのマッチングをとり、マッチした見出し
の元データヘのポインタを全て文例検索部１０が受け取
る（ステップ１００６）。Next, the leading partial character string among the unprocessed ones is set as a processing target (step 1005), and it is determined whether or not there is a heading in the index including the partial character string. And the sentence example search unit 10 receives all the pointers to the original data of the matched headline (step 1006).

【００７９】そして、処理対象の部分文字列に処理済み
フラグを付与し（ステップ１００７）、他に未処理の部
分文字列がないか調べる（ステップ１００８）。Then, a processed flag is given to the partial character string to be processed (step 1007), and it is checked whether there is another unprocessed partial character string (step 1008).

【００８０】未処理の部分文字列があった場合には、上
述したステップ１００５に戻る。このようなステップ１
００５〜１００８でなる処理ループを繰り返すことによ
り、やがて未処理の部分文字列がなくなる。このように
して全ての部分文字列の処理が完了した場合には、受け
取った元データヘのポインタの重複を削除して整理する
（ステップ１００９）。これは、複数の部分文字列を含
むプレーンデータがあった場合、各々の部分文字列のマ
ッチング処理に対して元データが獲得されるためであ
る。If there is an unprocessed partial character string, the process returns to step 1005 described above. Step 1 like this
By repeating the processing loop consisting of 005 to 1008, unprocessed partial character strings eventually disappear. When the processing of all the partial character strings is completed in this way, the duplication of the pointer to the received original data is deleted and arranged (step 1009). This is because, when there is plane data including a plurality of partial character strings, original data is obtained for matching processing of each partial character string.

【００８１】次に、ポインタでタグつきコーパス４のデ
ータ（文例）を参照して獲得する（ステップ１０１
０）。そして、文例出力部１１に、それらの文例を送
り、画面に出力して（ステップ１０１１）、一連の処理
を終了する。Next, the data is acquired by referring to the data (sentence example) of the corpus 4 with the tag by the pointer (step 101).
0). Then, these sentence examples are sent to the sentence example output unit 11 and output to the screen (Step 1011), and a series of processes is ended.

【００８２】図１２は、出力画面の例を示すものであ
る。この出力画面は、上述した図１１に示すユーザ形態
素列入力画面に対応しており、「文例検索ボタン」の下
にこの検索結果が出力される。FIG. 12 shows an example of the output screen. This output screen corresponds to the user morpheme string input screen shown in FIG. 11 described above, and this search result is output below the “sentence example search button”.

【００８３】このような出力画面を見て、ユーザは重要
度を変更して「学習ボタン」を押下して、第１の実施形
態のような統計データベース６の追加、変更動作に移行
させることができる。Looking at such an output screen, the user can change the degree of importance and press the “learning button” to shift to the operation of adding and changing the statistical database 6 as in the first embodiment. it can.

【００８４】すなわち、ユーザが統計データベース６に
自分の希望する形態素及びその重み値を追加・修正する
にあたって、同じ文字列を含む文例を検索・表示するこ
とにより、ユーザの入力が他のどのような文の解析に影
響を与えるか推定することができる。統計データベース
６であるので、絶対的に他への影響を避けることはでき
ないが、文例を見ることで入力形態素列を変更したり、
重要度を変更したりすることができる。In other words, when the user adds or corrects his or her desired morpheme and its weight value to the statistical database 6, by searching and displaying a sentence example including the same character string, the user's input can be any other It can be estimated whether it affects the parsing of sentences. Since it is the statistical database 6, it is impossible to absolutely avoid the influence on others. However, by looking at sentence examples, it is possible to change the input morpheme sequence,
You can change the importance.

【００８５】文例検索対象となるタグつきコーパス４
は、解析に使用する統計データベース６の元であるが故
に、得られる形態素解析結果を間接的に表現している。
例えば、仮に、それらの文のタグを除去して実際に解析
させれば、形態素区切り情報をタグとして保有するタグ
つきコーパス４の文に戻る。その意味で上記の推定が効
率的に行える。従って、全く別種のコーパス（タグつ
き、タグなし）を検索対象としても良いが、その場合、
参照することで推定はできるが、現状の解析に与える直
接的な影響を知ることはできないので、効果はやや薄れ
る。Corpus 4 with tag to be searched for sentence example
Is an element of the statistical database 6 used for analysis, and thus indirectly expresses the obtained morphological analysis result.
For example, if the tags of those sentences are removed and actually analyzed, the sentence returns to the sentence of the corpus 4 with the tag that holds the morpheme delimiter information as a tag. In that sense, the above estimation can be performed efficiently. Therefore, a completely different corpus (tagged, untagged) may be searched, but in that case,
Although it can be estimated by reference, the effect is somewhat diminished because the direct effect on the current analysis cannot be known.

【００８６】（Ｃ−３）第３の実施形態の効果上記第３の実施形態によっても、ユーザが統計データベ
ース６に自分の希望する形態素（拡張文字列）及びその
重み値を追加・修正することができ、形態素解析結果に
希望の形態素列情報を反映することができる。すなわ
ち、ユーザは簡単な入力操作によって、希望する解析結
果が得られるように、統計データベース６を変更するこ
とができ、装置の使い勝手を従来より向上させることが
できる。(C-3) Effect of Third Embodiment According to the third embodiment, the user can add / modify his / her desired morpheme (extended character string) and its weight value to the statistical database 6. And the desired morpheme sequence information can be reflected in the morphological analysis result. That is, the user can change the statistical database 6 so that a desired analysis result can be obtained by a simple input operation, and the usability of the apparatus can be improved as compared with the related art.

【００８７】また、第３の実施形態によれば、ユーザ入
力の形態素列と同じ部分文字列を含む文例を検索する文
例検索部を設けたので、ユーザは、統計データベースを
変更する前にその影響を推定して、入力内容を調整する
ことができる。Further, according to the third embodiment, the sentence example search unit for searching the sentence example including the same partial character string as the morpheme string input by the user is provided. Can be estimated to adjust the input content.

【００８８】（Ｃ−４）第３の実施形態の変形実施形態第３の実施形態は、第１の実施形態の構成を基本的に備
えているので、上述した第１の実施形態についての変形
実施形態を、この第３の実施形態の変形実施形態として
挙げることができる。(C-4) Modified Embodiment of Third Embodiment The third embodiment basically has the configuration of the first embodiment, and is therefore a modification of the above-described first embodiment. The embodiment can be cited as a modified embodiment of the third embodiment.

【００８９】また、第３の実施形態は、第１の実施形態
の構成に文例検索部１０及び文例出力部１１を設けたも
のであるが、第２の実施形態の構成に文例検索部１０及
び文例出力部１１を設けて装置を構成しても良い。In the third embodiment, the sentence example search unit 10 and the sentence example output unit 11 are provided in the configuration of the first embodiment. However, the sentence example search unit 10 and the sentence example output unit 11 are provided in the configuration of the second embodiment. The apparatus may be configured by providing the sentence example output unit 11.

【００９０】さらに、第３の実施形態では、文例検索部
１０がインデックス１２をも利用して文例を検索するも
のを示したが、インデックス１２を省略し、文例検索部
１０がタグつきコーパス４に対して直接検索処理するも
のであっても良い。この場合、文例検索部１０は、ユー
ザ入力の形態素列からタグを除去する必要はない。Further, in the third embodiment, the example in which the sentence example search unit 10 searches for a sentence example by using the index 12 has been described. However, the index 12 is omitted, and the sentence example search unit 10 searches the corpus 4 with tags. Alternatively, a direct search process may be performed. In this case, the sentence example search unit 10 does not need to remove the tag from the morpheme string input by the user.

【００９１】（Ｄ）他の実施形態上記各実施形態の説明においても、種々変形実施形態に
ついて言及したが、さらに、以下のような変形実施形態
を挙げることができる。(D) Other Embodiments In the description of each of the above embodiments, various modified embodiments have been described, but the following modified embodiments can be further mentioned.

【００９２】上記各実施形態は、ユーザが反映させよう
とする形態素列情報をその都度入力するものであった
が、出力部３から出力された形態素列情報をそのまま又
は一部修正して、ユーザ入力の形態素列情報として入力
できるようにしても良い。In each of the above embodiments, the morpheme sequence information that the user intends to reflect is input each time. However, the morpheme sequence information output from the output unit 3 is directly or partially corrected, and You may enable it to be input as the input morpheme string information.

【００９３】また、上記各実施形態においては、対象と
する自然言語が日本語であるものを示したが、他の言語
に係る所定文字数の文字列を格納した統計データベース
装置（自然言語処理用データベース装置）に対しても、
本発明を適用することができる。In each of the above embodiments, the target natural language is Japanese, but a statistical database device (natural language processing database) storing character strings of a predetermined number of characters relating to other languages. Device)
The present invention can be applied.

【００９４】さらに、上記各実施形態においては、本発
明の統計データベース装置を形態素解析装置に適用した
ものを示したが、本発明の統計データベース装置を利用
する処理は、形態素解析に限定されるものではない。例
えば、通信手段で受信した自然言語テキストの一部がバ
ーストエラーによって未知語の場合に、その未知語部分
の正しい文字列を統計データベースの格納内容を利用し
て推定するような自然言語処理装置の統計データベース
装置に本発明の統計データベース装置を適用することが
できる。本発明の統計データベース装置の用途によって
は、統計データベースに、形態素区切り情報や品詞情報
等を保有しない文字列だけを格納していても良い。In each of the above embodiments, the statistical database device of the present invention is applied to a morphological analyzer. However, the processing using the statistical database device of the present invention is limited to morphological analysis. is not. For example, when a part of a natural language text received by a communication unit is an unknown word due to a burst error, a natural language processing apparatus that estimates a correct character string of the unknown word portion using the contents stored in a statistical database. The statistical database device of the present invention can be applied to a statistical database device. Depending on the use of the statistical database device of the present invention, the statistical database may store only character strings that do not have morpheme delimiter information, part-of-speech information, and the like.

【００９５】なお、特許請求の範囲における文字列の
語、形態素区切り情報や品詞情報等を含むものも含まな
いものの双方を意味するものとする。It is to be noted that both the words in the claims and those that do not include those that include morpheme delimiter information or part of speech information are meant.

【００９６】[0096]

【発明の効果】以上のように、本発明によれば、自然言
語文に現れる所定文字数でなる部分文字列とその絶対的
又は相対的な頻度情報との組を複数組格納しており、自
然言語処理装置本体に利用される自然言語処理用データ
ベースを有する自然言語処理用データベース装置が、ユ
ーザから入力された、自然言語処理用データベースの格
納内容に反映させたい文字列及び重要度を受け取るユー
ザ入力手段と、ユーザから入力された文字列を構成す
る、所定文字数でなる１又は複数の部分文字列の項目が
自然言語処理用データベースになければ、重要度に応じ
た頻度情報とを組として、自然言語処理用データベース
に追加させると共に、その部分文字列の項目が自然言語
処理用データベースにあれば、自然言語処理用データベ
ースのその部分文字列の頻度情報を重要度に応じて更新
させるデータベース更新手段とを有するので、ユーザが
自然言語処理用データベースに自分の希望する文字列及
びその頻度情報を追加・修正することができ、自然言語
処理結果に希望の文字列情報を反映することができ、装
置の使い勝手を従来より向上させることができる。As described above, according to the present invention, a plurality of sets of a partial character string consisting of a predetermined number of characters appearing in a natural language sentence and its absolute or relative frequency information are stored. A user input which receives a character string and importance to be reflected in the content stored in the natural language processing database by a natural language processing database apparatus having a natural language processing database used in the language processing apparatus body If there is no item of one or more partial character strings of a predetermined number of characters constituting the character string input by the user in the database for natural language processing, frequency information according to the importance is set as a set, and If it is added to the database for language processing, and if the item of the substring is in the database for natural language processing, the partial character of the database for natural language processing And a database updating means for updating the frequency information according to the degree of importance, so that the user can add / modify his / her desired character string and its frequency information to the natural language processing database, and The desired character string information can be reflected on the device, and the usability of the device can be improved as compared with the related art.

【図面の簡単な説明】[Brief description of the drawings]

【図１】第１の実施形態の構成を示す機能ブロック図で
ある。FIG. 1 is a functional block diagram showing a configuration of a first embodiment.

【図２】タグつきコーパスの例を示す説明図である。FIG. 2 is an explanatory diagram showing an example of a corpus with a tag.

【図３】第１の実施形態の特徴動作を示すフローチャー
トである。FIG. 3 is a flowchart illustrating a characteristic operation of the first embodiment.

【図４】第１の実施形態の形態素入力画面の一例を示す
説明図である。FIG. 4 is an explanatory diagram illustrating an example of a morpheme input screen according to the first embodiment.

【図５】拡張文字列の一例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of an extended character string.

【図６】統計データベースの一例を示す説明図である。FIG. 6 is an explanatory diagram illustrating an example of a statistical database.

【図７】第２の実施形態の構成を示す機能ブロック図で
ある。FIG. 7 is a functional block diagram illustrating a configuration of a second embodiment.

【図８】第２の実施形態の特徴動作を示すフローチャー
トである。FIG. 8 is a flowchart illustrating a characteristic operation of the second embodiment.

【図９】第３の実施形態の構成を示す機能ブロック図で
ある。FIG. 9 is a functional block diagram illustrating a configuration of a third embodiment.

【図１０】第３の実施形態の特徴動作を示すフローチャ
ートである。FIG. 10 is a flowchart showing a characteristic operation of the third embodiment.

【図１１】第３の実施形態の形態素入力画面の一例を示
す説明図である。FIG. 11 is an explanatory diagram illustrating an example of a morpheme input screen according to the third embodiment.

【図１２】第３の実施形態の検索文例出力画面の一例を
示す説明図である。FIG. 12 is an explanatory diagram illustrating an example of a search sentence example output screen according to the third embodiment.

[Explanation of symbols]

１…入力部、２…形態素解析部、３…出力部、４…タグ
つきコーパス、５…連鎖確率計算部、６…統計データベ
ース、７…統計データベース重み変更部、８…ユーザ形
態素入力部、９…コーパス追加部、１０…文例検索部、
１１…文例出力部、１２…インデックス。DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Morphological analysis part, 3 ... Output part, 4 ... Tagged corpus, 5 ... Chain probability calculation part, 6 ... Statistical database, 7 ... Statistical database weight change part, 8 ... User morphological input part, 9 … Corpus addition part, 10… sentence example search part,
11: sentence example output unit, 12: index.

Claims

[Claims]

A plurality of sets of a partial character string consisting of a predetermined number of characters appearing in a natural language sentence and its absolute or relative frequency information are stored. Input means for receiving a character string and importance to be reflected in the stored content of the natural language processing database, the character string being input by the user. If the item of the one or more partial character strings having the predetermined number of characters does not exist in the database for natural language processing, it is added to the database for natural language processing as a set with frequency information according to the importance. In addition, the item of one or a plurality of partial character strings having the predetermined number of characters, which constitutes a character string If the use database, natural language processing database device characterized by comprising a database update means for the frequency information of the partial character string of the database for the natural language processing is updated in accordance with the importance.

2. The database updating means separates a character string input from a user into one or more partial character strings having the predetermined number of characters, and searches the natural language processing database using the separated partial character strings. If there is no item of the separated partial character string as a result of the search, the frequency information according to the importance is added as a set to the natural language processing database, and the separated partial character string is searched. 2. The natural language processing database device according to claim 1, wherein if there is an item, the frequency information of the partial character string in the natural language processing database is updated according to the importance.

3. A sentence example storage unit which stores a natural language sentence or a natural language character string used for forming the natural language processing database, wherein the database updating unit stores a character string received by the user input unit. A sentence example adding unit to be added to the sentence example storage unit by at least the number of appearances determined according to the importance, and a database for reconstructing the natural language processing database based on the storage contents of the sentence example storage unit after the addition processing 2. The natural language processing database device according to claim 1, wherein the database device comprises a reconstructing unit.

4. The natural language processing database according to claim 1, further comprising a sentence example storage unit that stores a natural language sentence and a natural language character string used for forming the natural language processing database. In the apparatus, a sentence example extracting character string input means for receiving a sentence example extracting character string input from a user, and having the entirety of the example example extracting character string or a partial character string thereof,
A natural language processing database device further comprising: a sentence example search output unit that extracts and outputs a natural language sentence or a natural language character string stored in the sentence example storage unit.