JPH01144162A

JPH01144162A - System and device for morpheme analysis using key word

Info

Publication number: JPH01144162A
Application number: JP62303544A
Authority: JP
Inventors: Takeshi Nishimura; 健士西村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1987-11-30
Filing date: 1987-11-30
Publication date: 1989-06-06

Abstract

PURPOSE:To increase the processing speed with the title system and device by realizing the morpheme analysis with high accuracy for the sentences including character strings, etc., which contain the words proper to the field of a natural language sentence whose topic is limited into a specific field and the specific symbols or the numerical value having the meaning proper to such a field that is not easily recognized as a nit to be divided in the normal morpheme analysis. CONSTITUTION:With input of a natural language sentence whose topic is limited into a specific field, the coincidence is checked between one of those keywords stored preliminarily in a key word storing part 1 and a part of a character string of the input natural language sentence via a divided part 2 obtained by a keyboard. If a coincident part is detected, the punctuation marks recognizable in the morpheme analysis are put at the front and the back of said coincident part. While no mark is given when no coincident part is obtained. Then the natural language sentence divided by said punctuation marks and the undivided natural language sentence containing no punctuation mark are sent to a morpheme analyzing part 3 to undergo the morpheme analysis. Thus the morpheme analysis processing speed is increased.

Description

【発明の詳細な説明】を形態素解析する方式及び装置に関するものである。[Detailed description of the invention] The present invention relates to a method and device for morphologically analyzing

[Prior art]

従来、自然ｄ語文を形態素解析する技術としては１例え
ば、高橋延匡編「日本語情報処理Ｊ　１９８６年の田中
穂積著「構文解析と意味解析」に記載されている１字種
切りによる強制分割、最長一致の原則による辞書引きと
接続情報の獲得、接続表チエツクによる分かち書き、の
３つの処理のフェイズからなる形態素解析技術が知られ
ている。Conventionally, techniques for morphologically analyzing natural D-language sentences include forced segmentation using one-character type cutting, which is described in Nobumasa Takahashi (ed.), "Japanese Information Processing J," written by Hozumi Tanaka (1986), "Syntactic Analysis and Semantic Analysis"; A morphological analysis technique is known that consists of three processing phases: dictionary lookup and connection information acquisition based on the longest match principle, and separation writing based on a connection table check.

[Problem that the invention seeks to solve]

従来の形態素解析技術では、形態素解析の対象とする自
然ａｈ文の述べる話題を特定分野に限っため１人力され
た自然言語文中に特定分野に固有の語が出現した場合に
は分割の鞘度が低下するという問題があった。With conventional morphological analysis technology, the topics covered by natural ah sentences targeted for morphological analysis are limited to specific fields, so if a word unique to a specific field appears in a natural language sentence written by one person, the degree of segmentation becomes difficult. There was a problem with the decline.

[Means for solving problems]

本発明のキーワードを用いた形態素解析方式は。 The morphological analysis method using the keywords of the present invention is as follows.

特定分野に話題を限定した自然言語文を人力し。Human-powered natural language sentences with topics limited to specific fields.

前もって前記特定分野のキーワードを登録しておいた辞
書を参照して、前記辞書に登録されたキーワードが出現
するとその前後に形態素解析にて認識可能な区切り記号
を挿入し、その後に形態素解析を行なうことを特徴とす
る。A dictionary in which keywords of the specific field are registered in advance is referred to, and when a keyword registered in the dictionary appears, a delimiter that can be recognized by morphological analysis is inserted before and after the keyword, and then morphological analysis is performed. It is characterized by

[Effect]

ある物足の分野における自然ｄ語文には分野固有の語が
現われる。例えは、物品売買情報を解析の対象分野とす
る際に。Field-specific words appear in natural D-word sentences in a certain field. For example, when article sales information is targeted for analysis.

「４０万で買ったパソコン新同品を２３万円で売ります
。」という文においてはｒ、ｌｒ同品」とは「新品同様の品
」の意の物品売買分野固有の飴である。従来の形態素解
析技術では「・・・パンコン」まではう筐く分割できて
も、「新同品」という語が′＃書に含まれているとはと
ても期待できないのでその後の分割が保証できない。対
象分野を紋った自然ぎ語処理の応用システムにおいては
１通常１分野固有の語は文章の意味を取るのに］Ｌ要な
役割を持つので１分野固有の語を正確に拾い出すことが
必要である。In the sentence ``I'm selling the same new computer I bought for 400,000 yen for 230,000 yen,'' r, lr same item'' means ``like new item,'' which is a candy unique to the field of goods sales. With conventional morphological analysis technology, even if it is possible to divide up to ``...Pancon'', it is highly unlikely that the word ``new same product'' is included in the document, so subsequent divisions cannot be guaranteed. . In an application system of natural language processing based on the target field, it is difficult to accurately pick out words unique to a field because words that are unique to a field usually play an important role in determining the meaning of a sentence. is necessary.

分野固有の飴は文章中で一意的な意味を持つことが多く
１前後の飴への依存を考慮せずに切シ出すことができる
。「新同品」もその一つであるが。Field-specific candies often have unique meanings within a text, and can be extracted without considering dependence on the first or last candies. "New same product" is one of them.

解析方式にお−て優先的に扱うことができる。すなわち
、まず、自然言語文中からキーワードだけを検出し、検
出できたキーワードの前後に印として区切り記号を挿入
して、そこでは分割が決定されたものとする。区切シ記
号を便宜的に＠で表わすことにすると、上の例は。It can be treated preferentially in the analysis method. That is, first, only keywords are detected from the natural language sentence, and delimiters are inserted as marks before and after the detected keywords, and division is determined at that point. For convenience, the delimiter symbol is represented by @ in the above example.

［４０万で買ったパソコン＠新同品＠を２３万円で売シ
ます。[I'm selling the new PC I bought for 400,000 yen for 230,000 yen.

となる。次に、＠が区切り記号であることを認識しなが
ら、具体的には、＠を前後の接続が自由な語として扱い
ながら従来の形態素解析技術を適用し、上の文を解析す
る。becomes. Next, while recognizing that @ is a delimiter, and specifically treating @ as a word that can be freely connected before and after it, conventional morphological analysis techniques are applied to analyze the above sentence.

この方式では１分野固有の飴が誤って分割されることが
なくなる。特に、特殊記号や数値が混在した文字列など
１通常分割される単位としては認め難いものが分野固有
の意味を持つ場合に有効である。また、形態素解析の対
象とする文字列が実質的に短くなるので、処理が高速化
されるという効果もある。上の例だと。This method prevents candy unique to one field from being erroneously divided. This is particularly effective when characters that are difficult to recognize as units that are normally divided, such as character strings containing special symbols and numerical values, have field-specific meanings. Furthermore, since the character string targeted for morphological analysis becomes substantially shorter, there is also the effect of speeding up the processing. In the example above.

「４０万で買ったパンコン」及び「を２３万円で売シま
す。」の２つの短い断片の形態素解析に帰着される。This results in morphological analysis of two short fragments: ``I bought a bread cone for 400,000 yen'' and ``I will sell it for 230,000 yen.''

〔Example〕

第１図は本発明の方式の実施例を示すプロ、り図である
。キーワードによる分割部２によって。FIG. 1 is a schematic diagram showing an embodiment of the system of the present invention. By keyword division part 2.

キーワード格納部１に予じめ格納されたキーワードのう
ち１人力された自然言語文の文字列の一部分と一致する
ものがないかが調べられ、一致する部分があれはその前
後に形態素解析において認識可能な区切り記号が挿入さ
れる。一致する部分が無ければなにも挿入されない。区
切シ記号によって分割された自然言語文もしくは区切シ
記号が挿入されず人力されたままの状態の自然言語文は
形態素解析部３に送られ、従来技術を用いて形態素解析
される。It is checked whether any of the keywords stored in advance in the keyword storage unit 1 matches a part of the character string of the natural language sentence written manually, and if there is a matching part, it can be recognized by morphological analysis before and after it. A delimiter is inserted. If there is no match, nothing will be inserted. The natural language sentence divided by the delimiter symbols or the natural language sentence as it has been manually written without the delimiter symbols inserted is sent to the morphological analysis section 3, where it is morphologically analyzed using conventional techniques.

第２図は本発明の装置の実施例を示すプロ、り図である
。人力自然首胎文は人力自然言語文用メモＩＪ　１１に
格納される。制御器１５によって、キーワード検出器１
２に動作の指示が出され、キーワード検出器１２ｆｉ特
定分野のキーワードが予じめ格納されているキーワード
辞書２０と入力自然言語文用メモリ１１を参照しながら
キーワードを検出し、検出された各キーワードの入力さ
れた自然言語文中における位ＩＩＩｔをキーワード位置
用メモリ１４に書き込む。キーワード位置用メモリ１４
には、キーワードが人力自然言語文中に検出される度に
、そのキーワードの人力自然ｔｉ文中における開始位置
と、終了位置の次の文字の位置の２つの値が書き込まれ
る。キーワード検出器１２及びキーワード辞書２０には
いろいろな実現方法があるが１例えば、［口径バイ）Ｊ
１９８７年８月号の伊、高木、牛島共著「５種類のパタ
ーン−マツチング手法をｃｄ語の関数で実現する」で紹
介されている６８Ｍアルゴリズムに基づいた装置を用い
ればよい。前に示した「４０万で買りたパンコン新同品
を２３万円で売シます。」という例だと、キーワード「
新同品」に対して「ｉＦＦ」の位置と１を」の位置の２
つがキーワード位置用メモリ１４に書き込まれる。キー
ワード検出器１２が人力自然言語文用メモリ１１の内容
をスキャンし終ると、制御器１５はアドレスカウンタ１
３と比較器１６とマルチプレク″９″１７に動作を指示
する。FIG. 2 is a schematic diagram showing an embodiment of the apparatus of the present invention. The human natural language sentence is stored in the human natural language sentence memo IJ11. The controller 15 controls the keyword detector 1
2, the keyword detector 12fi detects keywords while referring to the keyword dictionary 20 in which keywords of a specific field are stored in advance and the input natural language sentence memory 11, and each detected keyword The position IIIt in the input natural language sentence is written into the keyword position memory 14. Keyword position memory 14
Each time a keyword is detected in a human-generated natural language sentence, two values are written: the start position of the keyword in the human-generated natural language sentence, and the position of the character next to the end position. There are various implementation methods for the keyword detector 12 and the keyword dictionary 20.
It is sufficient to use a device based on the 68M algorithm introduced in the August 1987 issue of I, Takagi, and Ushijima, "Realizing 5 types of pattern-matching methods using CD word functions." In the example shown above, ``I'm selling the new Pancon product that I bought for 400,000 yen for 230,000 yen,'' the keyword ``
Position 2 of “iFF” and position 1 of “new same product”
is written into the keyword position memory 14. When the keyword detector 12 finishes scanning the contents of the human natural language sentence memory 11, the controller 15 starts the address counter 1.
3, comparator 16, and multiplexer "9" 17 to operate.

比較器１６は、アドレスカウンタ１３の値とキーワード
位置用メモリ１４の値とを比較し、キーワード位置用メ
モリ１４の負の中にアドレスカウンタ１３の値と等しい
ものがあれば、マルチプレクサ１７に働きかけて区切シ
記号用メモリ１８に格納されている文字を形態素解析器
１９にたいして出力させ、アドレスカウンタ１３に対し
て１回動作を待機してアドレス値を更新しなめように指
示し、さらにキーワード位置用メモリ１４中から該当す
る値を削除する。キーワード位置用メモリ１４の値の中
にアドレスカウンタ１３の値と等しいものが無かったら
、比較器１６は、マルチプレクサ１７に働きかけてアド
レスカウンタ１３の持つアドレス値に該当する人力自然
言語文用メモリｌｌ中の文字を形態素解析器１９にたい
して出力させ。The comparator 16 compares the value of the address counter 13 and the value of the keyword position memory 14, and if there is a negative value in the keyword position memory 14 that is equal to the value of the address counter 13, it acts on the multiplexer 17. It outputs the characters stored in the delimiter symbol memory 18 to the morphological analyzer 19, instructs the address counter 13 to wait for one operation and update the address value, and then outputs the characters stored in the keyword position memory 19. Delete the corresponding value from 14. If there is no value in the keyword position memory 14 that is equal to the value in the address counter 13, the comparator 16 acts on the multiplexer 17 to find the value in the human natural language sentence memory 11 that corresponds to the address value held in the address counter 13. output the characters to the morphological analyzer 19.

アドレスカウンタ１３に対してアドレス値に１を加える
ように指示する。アドレスカウンタ１３の値が入力自然
ぎ語文用メモリ１１に格納されている自然言語文の最後
の文字のアドレスを越えるまでこの比較器１６の動作が
繰り返される。前の例だと、［４０万で買ったパソコン
］までは人力自然言語文中メモＩＪ　１１から１文字ず
つ形態素解析器１９へ文字が送られるが、アドレスカウ
ンタ１３が「新」を指すと区切り記号が送られキーワー
ド位置用メモリ１４から「新」の位置の値が削除される
。続いて「新同品」が１文字ずつ送られ。Instructs the address counter 13 to add 1 to the address value. The operation of the comparator 16 is repeated until the value of the address counter 13 exceeds the address of the last character of the natural language sentence stored in the input natural language sentence memory 11. In the previous example, up to [the computer I bought for 400,000 yen], characters are sent character by character from human natural language text memo IJ 11 to the morphological analyzer 19, but when the address counter 13 points to "new", the delimiter is sent to the morphological analyzer 19. is sent, and the value of the "new" position is deleted from the keyword position memory 14. Next, "new same product" was sent one character at a time.

「を」に至って区切り記号が送られてキーワード位置用
メモリ１４から「を」の位置の値が削除される。「２３
万円で売ります。」はそのまま送られる。When "wo" is reached, a delimiter is sent and the value at the position of "wo" is deleted from the keyword position memory 14. “23
I'll sell it for 10,000 yen. ' will be sent as is.

形態素解析器１９は従来技術のものが利用できる。A conventional morphological analyzer 19 can be used.

〔effect〕

本発明のキーワードを用いた形態素解析方式もしくは装
置を使うと、ある特定の分野に話題を限定した自然言語
の文章を形態素解析する際に、その分野固有の語や２通
常の形祠メ解析においては分割される単位としては認め
難い分野固有の意味を持つ特殊記号や数値が混在した文
字列などを含む文章を相変良く形態素解析することがで
きるという効果がある。また、形態素解析の対象とする
文字列が短くなることがあるので処理が高速化されると
いう効果がある。When using the morphological analysis method or device using the keywords of the present invention, when morphologically analyzing a natural language sentence with a topic limited to a certain field, it is possible to analyze words specific to that field and 2. has the effect of being able to morphologically analyze sentences that contain character strings mixed with special symbols and numbers that have field-specific meanings that are difficult to recognize as units of division. Furthermore, since the character string to be subjected to morphological analysis may be shortened, there is an effect of speeding up the processing.

[Brief explanation of the drawing]

図工ある。ｌ・・・・・・キーワード格納部、２・・・・・・キー
ワードによる分割部、３・・・・・・形態素解析部、１
１・・・・・・入力自然言語文用メモリ、１２・・・・
・・キーワード検出器。１３・・・・・・アドレスカウンタ、１４・・・・・・
キーワード位置用メモリ、１５・・・・・・制御器、１
６・・・・・・比較器。１７・・・・・・マルチプレタブ、１８・・・・・・区
切り記号用メモ！７．１９・・・・・・形態素解析器、
２０・・・・・・キーワード辞書、２１・・・・・・キ
ーワードによる分割手段。２２・・・・・・形態素解析手段、２３・・・・・・キ
ーワード格納手段である。代理人　弁理士　　内・）・′原゛晋臼然、−語文Ｘ力笈１　図箭２回There is art. l...Keyword storage unit, 2...Keyword division unit, 3...Morphological analysis unit, 1
1... Memory for input natural language sentences, 12...
・Keyword detector. 13...Address counter, 14...
Keyword position memory, 15...Controller, 1
6... Comparator. 17...Multiple tabs, 18...Memo for delimiters! 7.19 Morphological analyzer,
20...Keyword dictionary, 21...Dividing means by keyword. 22... Morphological analysis means, 23... Keyword storage means. Agent: Patent Attorney Nai・)・'Hara Shin Usuran, - Words and sentences

Claims

[Claims]

(1) Input a natural language sentence with a topic limited to a specific field,
A dictionary in which keywords of the specific field are registered in advance is referred to, and when a keyword registered in the dictionary appears, a preset delimiter that can be recognized by morphological analysis is created before and after the keyword in the natural language sentence. Insert the symbol and
A morphological analysis method using keywords, characterized in that a morphological analysis is performed on a natural language sentence into which the delimiter is inserted.

(2) A keyword storage means that stores keywords of a specific field in advance, and whether or not there are keywords of the specific field stored in the keyword storage means in a natural language sentence inputted with a topic limited to the specific field. keyword dividing means for inserting preset delimiter signals that can be recognized by morphological analysis before and after the keyword detected in the natural language sentence when the corresponding keyword is detected by referring to the natural language sentence; When the delimiter is found during morphological analysis of the natural language sentence into which the delimiter has been inserted by the dividing means, it is recognized that the character string to be processed has ended, and morphological analysis is started anew from the next character. A morphological analysis device using keywords, characterized in that it includes a repeating morphological analysis means.