JPH0634207B2

JPH0634207B2 - Topic prediction device

Info

Publication number: JPH0634207B2
Application number: JP62186009A
Authority: JP
Inventors: 秀雄島津
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1987-07-24
Filing date: 1987-07-24
Publication date: 1994-05-02
Anticipated expiration: 2009-05-02
Also published as: JPS6429972A

Description

【発明の詳細な説明】（産業上の利用分野）本願発明は、自然言語処理において、与えられた文の話
題を予測する方式及び装置に関するものである。The present invention relates to a system and apparatus for predicting the topic of a given sentence in natural language processing.

（従来技術とその問題点）従来自然言語解析の手法は、機械翻訳を含めて、性質の
よい文の構文情報に基づいた処理が中心であった（参考
文献：意味構造を介した日本語機械翻訳システム日経
エレクトロニクス 1984年12月17日号）。しかし、ふつ
う人間同士がやりとりする言語情報は、誤字・脱字・文
法的な逸脱・省略等の雑音を含んでおり、そのような文
の処理に対しては、従来手法のように与えられた入力文
のすべての部分に対して画一的に処理をする方法だと、
入力文中の雑音の部分によって解析が不可能になる欠点
があった。(Prior art and its problems) Conventional methods of natural language analysis have mainly focused on processing based on syntactic information of sentences with good characteristics, including machine translation (Reference: Japanese Machines through Semantic Structure) Translation system Nikkei Electronics December 17, 1984 issue). However, linguistic information exchanged between humans usually contains noises such as typographical errors, omissions, grammatical deviations, omissions, etc. If you want to process all parts of a sentence uniformly,
There is a drawback that analysis is impossible due to the noise part in the input sentence.

このような雑音に対処する為には、人間がいわゆる斜め
読みをするように、文の内容の大きな掴み方、及び文の
中の重要部分とそうでない部分の認識の方法を計算機上
で実現し、話者が何を意図しているのか・どれが雑音な
のか、ということを予測することが重要である。本願発
明の目的は、与えられた文の話題を予測する方式と装置
を提供することにある。In order to deal with such noise, we realized on a computer how to grasp the contents of a sentence, and how to recognize important and unimportant parts of a sentence, as if humans were reading diagonally. , It is important to predict what the speaker intends and which is noise. An object of the present invention is to provide a method and apparatus for predicting the topic of a given sentence.

（問題点を解決するための手段）本願の第１の発明は、入力された文を対象言語の文法規
則を使って解析し、前記入力された文の話題を規定する
可能性のある語句を抜き出して、前記抜き出された語句
から前記文の話題を予測し、前記予測された話題の中
で、前記入力文中の最も多くの語句から予測された話題
を前記入力文の話題に予測することを特徴とする話題予
測方式である。(Means for Solving Problems) A first invention of the present application analyzes an input sentence by using a grammatical rule of a target language, and determines a word or phrase that may define a topic of the input sentence. Extracting and predicting the topic of the sentence from the extracted words and phrases, and predicting the topic predicted from the most words and phrases in the input sentence among the predicted topics to the topic of the input sentence Is a topic prediction method.

本願の第２の発明は、単語単位に各単語が予測する話題
の候補が予め登録されている話題辞書と、入力された文
を単語列に分割する形態素解析部と、前記形態素解析部
の出力を受け取って、前記入力文の文法構造を解析する
構文解析部と、前記構文解析部の出力を受け取って、構
文解析結果から前記入力文の主要構成要素となる単語の
みを抽出する主要構成要素抽出部と、前記話題辞書を参
照して前記主要構成要素抽出部によって抽出された単語
ごとに予測される話題候補を取り出して、前記取り出さ
れた話題候補ごとに取り出された回数を保持しておき、
最も出現回数の多かった話題候補を、前記入力された文
の話題として出力する話題選択部、とから成ることを特
徴とする話題予測装置である。A second invention of the present application is a topic dictionary in which candidates of topics predicted by each word are registered in advance, a morphological analysis unit that divides an input sentence into word strings, and an output of the morphological analysis unit. And a syntactic analysis unit that analyzes the grammatical structure of the input sentence, and an output of the syntactic analysis unit that extracts only the main constituent elements of the input sentence from the syntactic analysis result. Section, the topic candidate predicted for each word extracted by the main component extraction section with reference to the topic dictionary is extracted, and the number of times extracted for each extracted topic candidate is held,
The topic prediction device is characterized by comprising a topic candidate having the highest number of appearances, and a topic selection unit that outputs the topic candidate as a topic of the input sentence.

（作用と原理）本願発明は上記の手段により従来技術の問題点を解決し
た。次に例を使って本願発明の作用を説明する。例とし
て、マスメディアを使っての中古品の売買の掲示メッセ
ージの例を考える。(Operation and Principle) The present invention has solved the problems of the prior art by the above means. Next, the operation of the present invention will be described using an example. As an example, consider an example of a posted message for buying and selling second-hand goods using mass media.

「おまけに、充電器付けます。」「買って下さった方には、充電器付けます。」「充電器は、おまけです。」上の例では、，，とも、文の話題は、売り主が買
い主に対して売買の対象に加えて、おまけも付けるとい
うことである。ここで述べられる内容は中古品の売買に
関する話題に限定されているという前提があるから、
「おまけ」や「付けます」というキーワードとなる単語
の出現から、上記の文で述べられている話題の内容が予
測できる。"In addition, I will add a charger.""For those who bought it, I will add a charger.""The charger is an extra." In the above example, the topic of the sentence is In addition to buying and selling items for buyers, it also means adding bonuses. Since it is premised that the content described here is limited to the topic of buying and selling second-hand goods,
From the appearance of words that are keywords such as “bonus” and “attach”, the content of the topic described in the above sentence can be predicted.

しかしながら、上に述べたような単純に入力文から特定
のパタンが見つかればある話題を示すという方法では、
たとえ入力文で述べられる対象領域が狭いことが予めわ
かっていたとしても、間違って予測してしまうことが多
い。次にそのような例をあげる。However, in the above-mentioned method of simply showing a certain topic from the input sentence when a specific pattern is found,
Even if it is known in advance that the target area described in the input sentence is small, it is often wrongly predicted. Here is an example.

「商品とおまけの受渡しは、手渡しです。」を、単にパタンの有無で評価すると、「商品」−＞
「売り物の説明」、「おまけ」−＞「おまけの提供」、
「受渡し」−＞「交換の方法」、「手渡し」−＞「交換
の方法」、のように、の文に対して予測する話題が発
散してしまう。これを避ける為には、入力文の文法構造
まで解析して、ある単語が単に修飾語にすぎないのか、
それとも、入力文を構成する格の中心構成要素なのかを
判別し、その文を構成する格の中心構成要素となる単語
に対してのみ、話題予測の検査をすればよい。格として
は、主格（動作をする人）、道具格（動作の道具）、対
象格（動作の対象）、源泉格（動作の出発点、原因）、
目標格（動作の帰着点）、場所格（動作の場所）、時間
格（動作の時間）等がある。If you simply evaluate "The delivery of goods and bonuses is by hand."
"Explanation of items for sale", "Bonus"->"Offer of bonus",
The topic to be predicted for the sentence of "passing"->"exchangemethod","handing"->"exchangemethod" diverges. In order to avoid this, the grammatical structure of the input sentence is analyzed and whether a certain word is simply a modifier,
Alternatively, it is only necessary to determine whether or not it is the central constituent element of the case that constitutes the input sentence, and to examine the topic prediction only for the word that is the central constituent element of the case that constitutes the sentence. Cases include nominative cases (persons who perform actions), instrumental cases (tools of actions), target cases (objects of actions), source cases (starting point, cause of actions),
There are goal cases (return points of actions), place cases (places of actions), time cases (time of actions), and the like.

では、「商品とおまけの受渡しは」という語句は、主
格であり、その中心構成要素は「受渡し」であり、この
中の「商品とおまけの」の部分は、「受渡し」を修飾し
ているだけある。従って、「商品」と「おまけ」につい
ては無視され、「受渡し」に対してのみ話題予測検査が
なされれば、の文の話題予測は、「受渡し」−＞「交
換の方法」、「手渡し」−＞「交換の方法」のように一
つに予測される。Then, the phrase “delivery of goods and bonuses” is a nominative case, and its central component is “delivery”, and the part of “commodities and bonuses” modifies “delivery”. There is only. Therefore, if “commodity” and “bonus” are ignored, and topic prediction inspection is performed only on “delivery”, the topic prediction of the sentence is “delivery”-> “exchange method”, “hand delivery”. -> It is predicted to be one like "Exchange method".

本願発明は、このように入力文で述べられる対象領域が
狭いことが予めわかっている応用において、その対象領
域においてキーワードとなる単語とその単語が予測する
対象領域中の話題の候補の関係を予め記述してある辞書
を用意しておき、文を入力すると、与えられた入力文を
文法規則に従って解析し、その文の主要構成要素となる
単語をキーワードとして抽出し、それらのキーワードか
ら予測される話題の候補を数えておき、最も多くのキー
ワードから予測された話題の候補をその文の話題と予測
する方式である。例えば、では、「おまけ」と「付け
ます」の両方が「おまけの提供」の話題を予測している
ので、の話題は、「おまけの提供」となる。同様に、
では、「受渡し」と「手渡し」の両方から「交換の方
法」という話題が予測されることになる。In the present invention, in an application in which it is known in advance that the target area described in the input sentence is narrow in this way, the relationship between the word that is a keyword in the target area and the topic candidate in the target area predicted by the word is previously determined. If you prepare a dictionary and describe a sentence, analyze the given input sentence according to grammatical rules, extract the words that are the main constituent elements of the sentence as keywords, and predict from those keywords. In this method, topic candidates are counted and the topic candidate predicted from the most keywords is predicted as the topic of the sentence. For example, in, since both the “bonus” and “attach” predict the topic of “offer offer”, the topic of is “offer offer”. Similarly,
Then, the topic of "exchange method" will be predicted from both "delivery" and "handing".

（実施例）以下、本願発明の詳細を実施例に従って説明する。(Example) Hereinafter, the detail of this invention is demonstrated according to an Example.

第１図は、本願の第１の発明の一実施例を示すブロック
図である。第１図において、１は、単語単位に各単語か
ら予測される話題の候補が予め登録されている話題辞書
である。２は、入力された文を文法的に解析し、入力文
の主要構成要素となる単語を抽出する話題指示語切り出
し部である。３は、話題辞書を参照して、話題指示語切
り出し部２に抽出された単語毎に、予測する話題候補が
なければ何もせず、あればそれを取り出して、取り出さ
れた話題候補毎に、取り出された回数を保持しておき、
話題指示語切り出し部２に抽出されたすべての単語につ
いて話題候補の取り出し・出現回数の数え上げを終える
と、取り上げられた話題候補のうちで最も出現回数の多
かった話題候補を前記入力された文の話題として出力す
る話題選択部である。FIG. 1 is a block diagram showing an embodiment of the first invention of the present application. In FIG. 1, reference numeral 1 denotes a topic dictionary in which topic candidates predicted from each word are registered in advance in word units. Reference numeral 2 is a topic-indicator segmentation unit that grammatically analyzes the input sentence and extracts words that are the main constituent elements of the input sentence. 3 refers to the topic dictionary, does nothing for each of the words extracted by the topic indicator cutout unit 2 if there is no topic candidate to predict, and if there is, extracts it and for each extracted topic candidate, Keep the number of times it was taken out,
When the extraction of the topic candidates and the counting of the number of appearances of all the words extracted by the topic-indicating word cutout unit 2 are completed, the topic candidate having the highest number of appearances among the taken-up topic candidates is extracted from the input sentence. It is a topic selection unit that outputs as a topic.

入力文は、まず話題指示語切り出し部２に入力され、品
詞別に単語ごとに切り分けられる。例えば、入力文「お
まけに、充電器付けます。」は、「おまけ」，「に」，
「充電器」，「付けます」に分けられる。次に、文法規
則に従って、入力文の構造が解析される。「おまけに」
は源泉格、「充電器」は対象格、「付けます」が主動
詞、であることがわかる。また、それぞれの格の中心構
成要素は、「おまけ」，「充電器」であることもわか
る。First, the input sentence is input to the topic instruction word slicing unit 2 and is segmented into words according to the part of speech. For example, the input sentence "Additional charge to the charger." Is "extra", "ni",
It is divided into "charger" and "attach." Next, the structure of the input sentence is analyzed according to the grammar rules. "As a bonus"
It can be seen that is the source case, "charger" is the target case, and "attach" is the main verb. It can also be seen that the central components of each case are "bonus" and "charger".

次に、それぞれの格の中心構成要素と主動詞の単語を一
つずつ切り出し、話題選択部３に入力していく。話題選
択部３は単語を受け取ると、話題辞書１を参照して、話
題辞書１にその単語がキーワードとして登録されていな
ければその単語の処理を終え、登録されていたときには
その単語が予測する話題候補を話題辞書１から取り出し
て、その話題項目と出現の回数を記憶しておく。この処
理を、与えられた入力文の中心構成要素と中心動詞の単
語すべてについて行い、最も出現の頻度の大きかった話
題候補を与えられた入力文の話題として出力する。Next, the central constituents of each case and the words of the main verb are cut out one by one and input to the topic selection unit 3. When the topic selection unit 3 receives a word, it refers to the topic dictionary 1 to end the processing of the word if the word is not registered as a keyword in the topic dictionary 1, and if it is registered, the topic predicted by the word. The candidate is taken out from the topic dictionary 1 and the topic item and the number of appearances are stored. This process is performed for all the central constituent elements and words of the central verb of the given input sentence, and the topic candidate with the highest frequency of occurrence is output as the topic of the given input sentence.

次に、本方式を具体的に実現する装置を本願の第２の発
明の一実施例として説明する。第２図は本願の第２の発
明の一実施例を示す図である。第２図において、10は形
態素解析部、11は構文解析部、12は主要構成要素抽出
部、13は話題辞書、14はキーワード比較部、15は話題候
補格納部、16は話題予測部である。Next, an apparatus for specifically realizing this method will be described as an embodiment of the second invention of the present application. FIG. 2 is a diagram showing an embodiment of the second invention of the present application. In FIG. 2, 10 is a morphological analysis unit, 11 is a syntax analysis unit, 12 is a main component extraction unit, 13 is a topic dictionary, 14 is a keyword comparison unit, 15 is a topic candidate storage unit, and 16 is a topic prediction unit. .

ここでは、（作用と原理）の項で使った例文である「商
品とおまけの受渡しは、手渡しです。」を参照して、第
２図の説明を行う。Here, the explanation of FIG. 2 will be made with reference to the example sentence “delivery of goods and bonuses is hand delivery” which is an example sentence used in the section (action and principle).

入力文は、形態素解析部10で、品詞単位に分割される。
例文は、「商品」「と」「おまけ」「の」「受渡し」
「は」「手渡し」「です」に分割される。形態素解析処
理については、既に種々の手法が提案されているが、本
発明に於いては、それらのうちのどれかを特定すること
はしない（参考文献：日本語情報処理高橋編、田中他
著近代科学社）。The morpheme analysis unit 10 divides the input sentence into parts of speech.
Example sentences are "product""to""bonus""no""delivery"
It is divided into "ha,""handing," and "is." Various methods have already been proposed for morphological analysis processing, but in the present invention, one of them is not specified (reference: Japanese information processing Takahashi, Tanaka et al. Modern science company).

形態素解析部10の出力は、構文解析部11の入力となる。
構文解析部11は、文法規則に従って入力文を構文解析す
る。その結果、「受渡し」は主格であり、「「手渡し」
「です」」は主動詞であり、「商品とおまけの」は、主
格に対する修飾句であることがわかる。構文解析処理に
ついても、既に種々の手法が提案されているが、本発明
においては、それらのうちのどれかを特定することはし
ない（参考文献：日本語情報処理高橋編、田中他著
近代科学社）。The output of the morphological analysis unit 10 becomes the input of the syntax analysis unit 11.
The parsing unit 11 parses an input sentence according to grammatical rules. As a result, "passing" is a nominative, and "passing"
It can be seen that "da" is the main verb and "commodity and bonus" is a modifier to the nominative. Various methods have already been proposed for parsing processing, but in the present invention, any one of them is not specified (reference: Japanese information processing Takahashi, Tanaka et al.
Modern science company).

構文解析部11の出力は、主要構成要素抽出部12の入力と
なる。主要構成要素抽出部12は、構文解析部11の出力を
受け取ると、入力文の中の格の主要構成要素となった単
語と、主動詞だけを出力し、その他の修飾句は捨ててし
まうフィルターの役割をする。従って、例文から残る
部分は「受渡し」と「手渡し」のみとなる。The output of the syntax analysis unit 11 becomes the input of the main component extraction unit 12. When the main component extraction unit 12 receives the output of the syntax analysis unit 11, it outputs only the words that are the main components of the case in the input sentence and the main verb, and discards other modifiers. Play the role of. Therefore, the only remaining part of the example sentence is "delivery" and "handing".

話題辞書13は、第３図で示す構成であり、例えば、「受
渡し」という単語から予測される話題候補は「交換の方
法」である。The topic dictionary 13 has the structure shown in FIG. 3, and the topic candidate predicted from the word “delivery” is, for example, the “exchange method”.

キーワード比較部14は主要構成要素抽出部12から単語を
１つ受け取ると、その単語をキーにして話題辞書13を検
索する。もし、その単語が話題辞書13に登録されていな
ければ、その単語の処理を終了し、主要構成要素抽出部
12から次の単語を受け取る。もしも、その単語が話題辞
書13に登録されていると、話題辞書13から読み出した話
題候補を出力する。例文では、例えば、「受渡し」を受
け取ると「交換の方法」を出力する。When the keyword comparison unit 14 receives one word from the main component extraction unit 12, the keyword comparison unit 14 searches the topic dictionary 13 using that word as a key. If the word is not registered in the topic dictionary 13, the processing of the word is terminated and the main component extraction unit
Receive the next word from 12. If the word is registered in the topic dictionary 13, the topic candidates read from the topic dictionary 13 are output. In the example sentence, for example, when "delivery" is received, "exchange method" is output.

話題候補格納部15は、第４図に示すように、出現した話
題候補とその話題候補の出現回数の対、という内部状態
を持つ。話題候補格納部15はキーワード比較部14から話
題候補を受け取ると、それが初出であるかどうか調べ、
そうであれば、話題候補名と、その出現回数（即ち１
回）を話題候補格納部15中に格納する。もしも、既に出
現しているときには、出現回数を１増やして、その処理
を終える。例文では、「受渡し」と「手渡しです」の二
つのキーワードから、「交換の方法」の話題候補が取り
出されるので、第４図に示すように、「回数２」とな
っている。As shown in FIG. 4, the topic candidate storage unit 15 has an internal state of a topic candidate that has appeared and the number of appearances of the topic candidate. When the topic candidate storage unit 15 receives a topic candidate from the keyword comparison unit 14, it checks whether it is the first appearance,
If so, the topic candidate name and the number of appearances (that is, 1
Stored in the topic candidate storage unit 15. If it has already appeared, the number of appearances is incremented by 1 and the process ends. In the example sentence, since the topic candidates of “exchange method” are extracted from the two keywords “delivery” and “hand delivery”, the number of times is “2” as shown in FIG.

入力文によっては、二つ以上の話題候補の出現回数が同
じであることがある。本発明は、入力文の話題を予測す
る方式について述べたものであり、その予測を使ってど
のように入力文の解釈処理を進めるかということについ
ては特定していないが、同程度の可能性を持つ話題候補
が複数個存在したときには、とりあえず、話題候補のう
ちの一つを選んで解析を進め、途中で解析不能になれ
ば、別の候補で再試行するという方法が適していると思
われる。このような処理には、ＰＲＯＬＯＧが向いてい
る（参考文献：知識表現とＰｒｏｌｏｇ／ＫＲ中島秀
之産業図書、昭和60年）。Depending on the input sentence, the number of appearances of two or more topic candidates may be the same. The present invention describes a method for predicting the topic of an input sentence, and does not specify how to use the prediction to proceed with the process of interpreting the input sentence, but the same possibility If there are multiple topic candidates that have the above, for the time being, select one of the topic candidates to proceed with the analysis, and if the analysis becomes impossible on the way, try another candidate Be done. PROLOG is suitable for such processing (reference: knowledge representation and Prolog / KR Hideyuki Nakajima Sangyo Tosho, 1985).

以上の処理が入力された文中のすべての単語に対して終
了すると、話題予測部16が話題候補格納部15の内部状態
を読み出して、出現回数の最も多い話題候補名を選び出
して出力する。この話題候補が、入力された文の話題と
いうことになり、外部に出力される。When the above process is completed for all the words in the input sentence, the topic prediction unit 16 reads the internal state of the topic candidate storage unit 15, selects the topic candidate name with the highest number of appearances, and outputs it. This topic candidate is the topic of the input sentence and is output to the outside.

（発明の効果）本願発明によれば、自然言語処理においてある文の話題
を判定することが出来るので、その文の内容を予測に基
づいて解析することが出来る。また、従来方式だと不得
手であった雑音の多い入力文に対しても、本願発明を使
えばある文の述べる内容の中の何が重要で何が重要でな
いかが判断できるので、雑音の多く含まれた自然言語文
を解析することができる。本願発明で提案する方式は、
大量の自然言語文から利用者の必要情報だけを取り出す
システムやデータベースやエキスパートシステムの為の
インタフェースとして自然言語対話システムを構築する
場合に適している。(Effect of the Invention) According to the present invention, it is possible to determine the topic of a sentence in natural language processing, so that the content of the sentence can be analyzed based on prediction. In addition, even for a noisy input sentence, which was not good in the conventional method, it is possible to determine what is important and what is not important in the content of a certain sentence by using the present invention. It can analyze many natural language sentences. The method proposed in the present invention is
It is suitable for constructing a natural language dialogue system as an interface for a system, a database, or an expert system that extracts only the user's necessary information from a large amount of natural language sentences.

【図面の簡単な説明】第１図は本願の第１の発明の一実施例の構成を示す図、
第２図は本願の第２の発明の一実施例の構成を示す図、
第３図は第２図実施例における話題辞書の構成を示す概
念図、第４図は第２図実施例における話題候補格納部の
構成を示す概念図である。１…話題辞書、２…話題指示語切り出し部、３…話題選
択部、10…形態素解析部、11…構文解析部、12…主要構
成要素抽出部、13…話題辞書、14…キーワード比較部、
15…話題候補格納部、16…話題予測部。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram showing a configuration of an embodiment of a first invention of the present application,
FIG. 2 is a diagram showing the configuration of an embodiment of the second invention of the present application,
FIG. 3 is a conceptual diagram showing the configuration of the topic dictionary in the embodiment of FIG. 2, and FIG. 4 is a conceptual diagram showing the configuration of the topic candidate storage unit in the embodiment of FIG. 1 ... Topic dictionary, 2 ... Topic index extraction unit, 3 ... Topic selection unit, 10 ... Morphological analysis unit, 11 ... Syntax analysis unit, 12 ... Main component extraction unit, 13 ... Topic dictionary, 14 ... Keyword comparison unit,
15 ... Topic candidate storage section, 16 ... Topic prediction section.

Claims

[Claims]

1. A topic dictionary in which topic candidates predicted by each word are registered in advance on a word-by-word basis, a morphological analysis unit that divides an input sentence into word strings, and an output of the morphological analysis unit is received. A syntactic analysis unit that analyzes the grammatical structure of the input sentence; a main constituent element extraction unit that receives the output of the syntactic analysis unit and extracts only words that are the main constituent elements of the input sentence from the syntactic analysis result; The topic candidate predicted for each word extracted by the main component extraction unit with reference to the topic dictionary is extracted, and the number of times extracted for each of the extracted topic candidates is held, and the number of appearances is highest. And a topic selection unit that outputs the topic candidates that were the most common as the topic of the input sentence.