JP4114580B2

JP4114580B2 - Natural language processing system, natural language processing method, and computer program

Info

Publication number: JP4114580B2
Application number: JP2003326398A
Authority: JP
Inventors: 智子大熊; 博増市; 宏樹吉村; 大悟杉原
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-09-18
Filing date: 2003-09-18
Publication date: 2008-07-09
Anticipated expiration: 2023-09-18
Also published as: JP2005092617A

Description

本発明は、人間が日常的なコミュニケーションに使用する自然言語を数学的に取り扱うための自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムに係り、特に、自然言語文の構文・意味解析を行なう自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムに関する。 The present invention relates to a natural language processing system, a natural language processing method, and a computer program for mathematically handling a natural language used by humans for daily communication, and in particular, to analyze syntax and semantics of a natural language sentence. The present invention relates to a natural language processing system, a natural language processing method, and a computer program.

さらに詳しくは、本発明は、複数の語が連なって構成される複合語を含む文に対してより速い解析速度で構文・意味解析結果を出力する自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムに係り、特に、解析速度の向上のために複合語を１つに纏め上げたときにより高い再現率の構文・意味解析結果を出力する自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムに関する。 More specifically, the present invention relates to a natural language processing system, a natural language processing method, and a computer that output a result of syntactic / semantic analysis at a faster analysis speed with respect to a sentence including a compound word composed of a plurality of words. A natural language processing system and a natural language processing method for outputting a syntax / semantic analysis result with a higher recall when combining compound words into one to improve the analysis speed, and a computer, Regarding the program.

日本語や英語など、人間が日常的なコミュニケーションに使用する言葉のことを「自然言語」と呼ぶ。多くの自然言語は、自然発生的な起源を持ち、人類、民族、社会の歴史とともに進化してきた。勿論、人は身振りや手振りなどによっても意思疎通を行なうことが可能であるが、自然言語により最も自然で且つ高度なコミュニケーションを実現することができる。 Words that humans use for everyday communication, such as Japanese and English, are called “natural languages”. Many natural languages have a naturally occurring origin and have evolved with the history of mankind, people and society. Of course, people can communicate with each other by gestures and hand gestures, but natural language can realize the most natural and advanced communication.

他方、情報技術の発展に伴い、コンピュータが人間社会に定着し、各種産業や日常生活の中に深く浸透している。いまやコンピュータ・データだけでなく、画像や音響などほとんどすべての情報コンテンツがコンピュータ上で取り扱われ、情報の編集・加工、蓄積、管理、伝達、共有など高度な処理を行なうことが可能となっている。 On the other hand, with the development of information technology, computers have become established in human society and have deeply penetrated into various industries and daily life. Now, not only computer data, but almost all information content such as images and sounds are handled on the computer, making it possible to perform advanced processing such as editing / processing, storage, management, transmission and sharing of information. .

例えば、日本語や英語を始めとする各種の言語で記述される自然言語は、本来抽象的であいまい性が高い性質を持つが、文章を数学的に取り扱うことにより、コンピュータ処理を行なうことができる。この結果、機械翻訳や対話システム、検索システム、質問応答システムなど、自動化処理により自然言語に関するさまざまなアプリケーション／サービスが実現される。 For example, a natural language written in various languages such as Japanese and English is inherently abstract and ambiguous, but can be processed computerically by handling sentences mathematically. . As a result, various applications / services related to natural language are realized by automated processing such as machine translation, dialogue system, search system, and question answering system.

かかる自然言語処理は一般に、形態素解析、構文解析、意味解析、文脈解析という各処理フェーズに区分される。 Such natural language processing is generally divided into processing phases of morphological analysis, syntax analysis, semantic analysis, and context analysis.

形態素解析では、文を意味的最小単位である形態素（ｍｏｒｐｈｅｍｅ）に分節して品詞の認定処理を行なう。構文解析では、文法規則などを基に句構造などの文の構造を解析する。文法規則が木構造であることから、構文解析結果は一般に個々の形態素が係り受け関係などを基にして接合された木構造となる。意味解析では、文中の語の語義（概念）や、語と語の間の意味関係などに基づいて、文が伝える意味を表現する意味構造を求めて、意味構造を合成する。また、文脈解析では、文の系列である文章（談話）を解析の基本単位とみなして、文間の意味的なまとまりを得て談話構造を構成する。 In morpheme analysis, a sentence is segmented into morphemes which are the smallest semantic units, and part-of-speech recognition processing is performed. In syntax analysis, sentence structure such as phrase structure is analyzed based on grammatical rules. Since the grammatical rule is a tree structure, the parsing result generally has a tree structure in which individual morphemes are joined based on a dependency relationship. In semantic analysis, a semantic structure that expresses the meaning conveyed by a sentence is obtained based on the meaning (concept) of the words in the sentence and the semantic relationship between words, and the semantic structure is synthesized. In context analysis, a sentence series (discourse) is regarded as a basic unit of analysis, and a discourse structure is constructed by obtaining a semantic group between sentences.

とりわけ、構文解析及び意味解析は、自然言語処理の分野において、対話システム、機械翻訳、文書校正支援、文書要約などのアプリケーションを実現する上で必要不可欠の技術であるとされている。 In particular, syntactic analysis and semantic analysis are indispensable techniques for realizing applications such as dialog systems, machine translation, document proofreading, and document summarization in the field of natural language processing.

構文解析では、自然言語文を受け取り、文法規則に基づいて単語（文節）間の係り受け関係を決定する処理を行なう。構文解析結果は、依存構造と呼ばれる木構造（依存木）の形態で表現することができる。また、意味解析では、単語（文節）間の係り受け関係に基づいて文中の格関係を決定する処理を行なうことができる。ここで言う格関係とは、文を構成する各要素が持つ、主語（ＳＵＢＪ）、目的語（ＯＢＪ）といった文法上の役割のことを指す。また、文の時制や様相、話法などを判定する処理を意味解析が含む場合もある。 In the syntax analysis, a natural language sentence is received, and a dependency relationship between words (sentences) is determined based on grammatical rules. The parsing result can be expressed in the form of a tree structure (dependency tree) called a dependency structure. In the semantic analysis, it is possible to perform a process of determining a case relationship in a sentence based on a dependency relationship between words (sentences). The case relationship here refers to a grammatical role such as a subject (SUBJ) and an object (OBJ) possessed by each element constituting a sentence. In addition, semantic analysis may include processing for determining sentence tense, appearance, speech, and the like.

ところで、自然言語処理における構文・意味解析に要する時間は、文に含まれる形態素数に対し指数関数的に増加するとされている。このため、複数の形態素を１つにまとめて形態素数を減らすことにより、解析速度の向上を期待することができる。 By the way, the time required for syntax / semantic analysis in natural language processing is assumed to increase exponentially with respect to the number of morphemes contained in a sentence. For this reason, improvement in analysis speed can be expected by combining a plurality of morphemes into one and reducing the number of morphemes.

例えば、構文解析や意味解析などにおいて、複数の語が連なって構成される複合語が出現した場合、これらの語を１つの単語として扱う処理、すなわち連続する複数の形態素を纏め上げる処理を行なうことで、解析速度の改善を図っている。 For example, when a compound word composed of a plurality of words appears in syntax analysis or semantic analysis, a process of treating these words as one word, that is, a process of collecting a plurality of continuous morphemes Therefore, the analysis speed is improved.

例えば、連語に代表されるような原文テクスト中の一まとまりの表現形態を一の形態素として処理し、統語解析の精度を向上させるとともに統語解析に要する時間の短縮を図った機械翻訳装置について提案されている（例えば、特許文献１を参照のこと）。この場合、ＨＤ装置に英和連語辞書を用意し、この英和連語辞書には、いわゆる連語に代表されるような一まとまりの表現形態を格納しておく。そして、統語解析処理において、等位接続詞によって結合された単語から構成される表現形態を英文テクスト中で検索し、英文連語辞書に登録されている場合、あるいは検索された表現形態を構成する単語の接頭辞又は接尾辞が同一である場合には、その検索された表現形態を一の形態素として認識し、分離することなく構文の解析を行なう。 For example, a machine translation device has been proposed that processes a set of expression forms in a source text such as collocations as a single morpheme to improve the accuracy of syntactic analysis and reduce the time required for syntactic analysis. (For example, refer to Patent Document 1). In this case, an English-Japanese collocation dictionary is prepared in the HD device, and a group of expression forms represented by so-called collocations are stored in the English-Japanese collocation dictionary. Then, in the syntactic analysis process, an expression form composed of words combined by equivalence conjunctions is searched in an English text and registered in the English collocation dictionary, or the words constituting the searched expression form If the prefixes or suffixes are the same, the retrieved expression form is recognized as one morpheme and the syntax is analyzed without separation.

また、コンピュータを用いた自然語解析装置において、解析が難しかった複合語、重文、複文を効率的に解析できる構文解析方法についても提案されている（例えば、特許文献２を参照のこと）。この場合、品詞情報が動詞＋接続助詞＋動詞である単語列の組み合わせを１つの動詞とする。また、あらかじめ用意した辞書を参照して、ａｎｄやｏｒの論理演算子という属性を持つ単語が検出された場合、単語の前後に位置する単語を含めて一個の単語として処理を進める。また、形容詞、副詞、感嘆詞から選ばれる少なくとも１つの単語が検出された場合、該単語は該単語の後に最初に出現する指示表明語（動詞、名詞などのように事象を示す単語）を修飾する単語として処理を進める。 In addition, a syntax analysis method that can efficiently analyze compound words, heavy sentences, and compound sentences that have been difficult to analyze in a natural language analysis apparatus using a computer has been proposed (see, for example, Patent Document 2). In this case, a combination of word strings whose part-of-speech information is verb + connecting particle + verb is defined as one verb. If a word having an attribute of “and” or “or” is detected with reference to a dictionary prepared in advance, the process proceeds as one word including words positioned before and after the word. Also, if at least one word selected from adjectives, adverbs, and exclamations is detected, the word modifies the first statement statement (a word indicating an event such as a verb or noun) that appears after the word. Proceed as a word to be processed.

本明細書で言う複合語は、複数の名詞が連なって構成される「複合名詞」や、複数の動詞が連なって構成される「複合動詞」などが挙げられる。 The compound words referred to in this specification include “compound nouns” composed of a plurality of nouns, “compound verbs” composed of a plurality of verbs, and the like.

例えば、以下の例文（１）に示すように、「青少年」、「総合」、「体育」、「大会」という４つの連続した名詞を１つの複合名詞として扱う。 For example, as shown in the following example sentence (1), four consecutive nouns “youth”, “general”, “physical education”, and “meeting” are treated as one compound noun.

（１）横浜で青少年総合体育大会が行われた。
横浜で青少年総合体育大会が行うれるた。
→ 横浜で、青少年総合体育大会が行うれるた (1) A youth sports competition was held in Yokohama.
A youth sports competition was held in Yokohama.
→ A youth sports competition was held in Yokohama

ところが、元々別の形態素をまとめることによって、不具合が生じることがある。図８には、上記の例文に対する構文意味解析結果の一例を示している。例えば、この解析結果を対象にして、「大会」というキーワードで検索しようとしても、連続した名詞を１つの名詞として取り扱った上記の語（「青少年総合体育大会」）とは一致しない。 However, problems may occur by collecting different morphemes. FIG. 8 shows an example of the syntactic and semantic analysis result for the above example sentence. For example, even if an attempt is made to search for the result of this analysis with the keyword “meeting”, it does not match the above word (“Youth General Athletic Meet”) that treats consecutive nouns as one noun.

（２）横浜で行われた大会は何ですか？ (2) What is the tournament held in Yokohama?

例えば、例文（２）のような自然言語による問い合わせに対して、（１）を回答として採用することができない。図９には、例文（２）についての構文意味解析結果を示しているが、一方の図８に示した解析結果では複合名詞として纏め上げを行なっているため、対応付けることができなくなってしまっている。要言すれば、連続する形態素を纏め上げる弊害として、検索システムの再現率を低下させてしまうことになる。 For example, (1) cannot be adopted as an answer to an inquiry in natural language such as example sentence (2). FIG. 9 shows the syntactic and semantic analysis result for the example sentence (2). However, the analysis result shown in FIG. 8 is summarized as a compound noun and cannot be associated. Yes. In short, as a harmful effect of collecting continuous morphemes, the recall rate of the search system is reduced.

ここで、再現率を維持するために、文字列の完全一致ではなく部分一致を取るという方針で検索を実施しても、別の問題を招来する。 Here, in order to maintain the reproduction rate, even if the search is performed based on a policy of taking a partial match instead of a complete match of the character string, another problem is caused.

（３）合体はどこで行われましたか？ (3) Where was the coalescence done?

例えば上記の例文（３）のような問い合わせ文に対して、「合体」という単語が「青少年総合体育大会」と部分的に一致してしまう。図１０には、例文（２）についての構文意味解析結果を示しているが、図８に示した解析結果と対応付けられてしまうため、例文（１）を回答として採用してしまう。つまり、検索システムの適合率を低下させてしまうことになる。 For example, in the inquiry sentence like the above example sentence (3), the word “union” partially matches with “youth general athletic meet”. FIG. 10 shows the syntactic and semantic analysis result for the example sentence (2), but the example sentence (1) is adopted as an answer because it is associated with the analysis result shown in FIG. In other words, the relevance rate of the search system is reduced.

特開平１１−３２９１７８号公報JP 11-329178 A 特開２００１−１２５８９８号公報JP 2001-125898 A

本発明の目的は、複数の語が連なって構成される複合語を含む文に対してより速い解析速度で構文・意味解析結果を出力することができる、優れた自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムを提供することにある。 An object of the present invention is to provide an excellent natural language processing system and natural language processing capable of outputting a syntactic / semantic analysis result at a faster analysis speed with respect to a sentence including a compound word composed of a plurality of words. It is to provide a method and a computer program.

本発明のさらなる目的は、解析速度の向上のために複合語を１つに纏め上げたときに高い精度の構文・意味解析結果を出力することができる、優れた自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムを提供することにある。 A further object of the present invention is to provide an excellent natural language processing system and natural language processing capable of outputting a highly accurate syntax / semantic analysis result when combining compound words into one for improving the analysis speed. It is to provide a method and a computer program.

本発明のさらなる目的は、解析速度の向上のために複合語を１つに纏め上げたときに、検索システムの再現率や適合率を低下させることなく、高い精度の構文・意味解析結果を出力することができる、優れた自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムを提供することにある。 A further object of the present invention is to output a high-accuracy syntax / semantic analysis result without reducing the recall rate and relevance rate of the search system when compound words are combined into one to improve the analysis speed. An object of the present invention is to provide an excellent natural language processing system, natural language processing method, and computer program.

本発明は、上記課題を参酌してなされたものであり、その第１の側面は、特定の品詞からなる複数の語が連なって構成される複合語が出現する自然言語文を解析する自然言語処理システムであって、
入力された自然言語文について形態素毎の品詞の認定結果を含んだ形態素解析結果を取得する手段と、
前記形態素解析結果に基づいて、該入力された自然言語文中で前記特定の品詞の形態素が連なっている箇所を抽出する手段と、
該抽出された連続する前記特定の品詞の形態素と形態素の間に、連結したことを示す区切り文字を与える手段と、
を具備することを特徴とする自然言語処理システムである。 The present invention has been made in consideration of the above problems, and a first aspect thereof is a natural language for analyzing a natural language sentence in which a compound word composed of a plurality of words composed of specific parts of speech appears. A processing system,
Means for acquiring a morpheme analysis result including a recognition result of a part of speech for each morpheme for the input natural language sentence;
Based on the morpheme analysis result, means for extracting a part where the morpheme of the specific part of speech is continuous in the input natural language sentence;
Means for providing a delimiter indicating that the extracted continuous morphemes of the specific part of speech are connected,
It is a natural language processing system characterized by comprising.

ここで、本発明に係る自然言語処理システムは、該形態素と形態素の間に連結したことを示す区切り文字を与えられてなる連続する前記特定の品詞の形態素を単一の語すなわち複合語として出力する。 Here, the natural language processing system according to the present invention outputs a continuous morpheme of the specific part of speech as a single word, that is, a compound word, given a delimiter indicating that it is connected between the morpheme. To do.

そして、ここで言う特定の品詞は、例えば名詞や動詞のことであり、複合語は、複合名詞や複合動詞のことを指す。 And the specific part of speech said here is a noun and a verb, for example, and a compound word points out a compound noun and a compound verb.

本発明によれば、構文・意味解析の前処理で連続する形態素をまとめて１つの複合語を形成する際に、元の形態素と形態素の間に、連結したことを示す区切り文字を与える。このように区切り文字によって、１つに纏め上げられた複合語から元の形態素を容易に取り出すことができる構造となっていることから、検索システムにおける再現率を維持することができる。 According to the present invention, when consecutive morphemes are formed together in the preprocessing of syntax / semantic analysis to form one compound word, a delimiter character indicating connection is provided between the original morphemes and the morphemes. Since the original morpheme can be easily extracted from the compound words grouped into one by the delimiter as described above, the recall in the search system can be maintained.

また、本発明の第２の側面は、特定の品詞からなる複数の語が連なって構成される複合語が出現する自然言語文を解析するための処理をコンピュータ・システム上で実行するようにコンピュータ可読形式で記述されたコンピュータ・プログラムであって、
入力された自然言語文について形態素毎の品詞の認定結果を含んだ形態素解析結果を取得するステップと、
前記形態素解析結果に基づいて、該入力された自然言語文中で前記特定の品詞の形態素が連なっている箇所を抽出するステップと、
該抽出された連続する前記特定の品詞の形態素と形態素の間に、連結したことを示す区切り文字を与えるステップと、
該形態素と形態素の間に連結したことを示す区切り文字を与えられてなる連続する前記特定の品詞の形態素を単一の語として出力するステップと、
を具備することを特徴とするコンピュータ・プログラムである。 A second aspect of the present invention is a computer configured to execute, on a computer system, a process for analyzing a natural language sentence in which a compound word composed of a plurality of words composed of specific parts of speech appears. A computer program written in a readable format,
Obtaining a morphological analysis result including a recognition result of part of speech for each morpheme for the input natural language sentence;
Based on the morphological analysis result, extracting a portion where the morpheme of the specific part of speech is continuous in the input natural language sentence;
Providing a delimiter indicating concatenation between the extracted consecutive morphemes of the specific part of speech;
Outputting as a single word consecutive morphemes of the specific part-of-speech given a delimiter indicating that they are connected between the morphemes;
A computer program characterized by comprising:

本発明の第２の側面に係るコンピュータ・プログラムは、コンピュータ・システム上で所定の処理を実現するようにコンピュータ可読形式で記述されたコンピュータ・プログラムを定義したものである。換言すれば、本発明の第２の側面に係るコンピュータ・プログラムをコンピュータ・システムにインストールすることによって、コンピュータ・システム上では協働的作用が発揮され、本発明の第１の側面に係る自然言語処理システムと同様の作用効果を得ることができる。 The computer program according to the second aspect of the present invention defines a computer program described in a computer-readable format so as to realize predetermined processing on a computer system. In other words, by installing the computer program according to the second aspect of the present invention in the computer system, a cooperative action is exhibited on the computer system, and the natural language according to the first aspect of the present invention. The same effects as the processing system can be obtained.

本発明によれば、複数の語が連なって構成される複合語を含む文に対してより速い解析速度で構文・意味解析結果を出力することができる、優れた自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムを提供することができる。 Advantageous Effects of Invention According to the present invention, an excellent natural language processing system and natural language processing capable of outputting a syntax / semantic analysis result at a faster analysis speed for a sentence including a compound word composed of a plurality of words connected in series. Methods and computer programs can be provided.

また、本発明によれば、解析速度の向上のために複合語を１つに纏め上げたときに、検索システムの再現率や適合率を低下させることなく、高い精度の構文・意味解析結果を出力することができる、優れた自然言語処理システム及び自然言語処理方法、並びにコンピュータ・プログラムを提供することができる。 In addition, according to the present invention, when compound words are combined into one for improving the analysis speed, a highly accurate syntax / semantic analysis result can be obtained without reducing the recall rate and the matching rate of the search system. An excellent natural language processing system, natural language processing method, and computer program that can be output can be provided.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施形態や添付する図面に基づくより詳細な説明によって明らかになるであろう。 Other objects, features, and advantages of the present invention will become apparent from more detailed description based on embodiments of the present invention described later and the accompanying drawings.

以下、図面を参照しながら本発明の実施形態について詳解する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本発明に係る自然言語処理システムは、解析速度の向上のために、複合名詞や複合動詞などの連続する形態素からなる複合語を１つに纏め上げる処理を行なうが、このとき、検索システムの再現率や適合率を低下させることなく、高い精度の構文・意味解析結果を出力することができる。 The natural language processing system according to the present invention performs a process of collecting compound words composed of continuous morphemes such as compound nouns and compound verbs into one in order to improve the analysis speed. It is possible to output high-accuracy syntax / semantic analysis results without reducing the rate and precision.

ここで、構文・意味解析を行うための文法理論の代表的な例として、ＬｅｘｉｃａｌＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ（ＬＦＧ）を挙げることができる。本発明は、例えばＬＦＧ文法理論に基づく統語・意味解析処理に組み込んで実装することができる。ＬＦＧでは、ネイティブ・スピーカの言語知識すなわち文法を、コンピュータ処理や、コンピュータの処理動作に影響を及ぼすその他の非文法的な処理パラメータとは切り離したコンポーネントとして構成している。 Here, Lexical Functional Grammar (LFG) can be cited as a representative example of grammar theory for performing syntax / semantic analysis. The present invention can be implemented by being incorporated into syntactic / semantic analysis processing based on, for example, LFG grammar theory. In LFG, linguistic knowledge, that is, grammar of native speakers is configured as a component separated from computer processing and other non-grammatical processing parameters that affect the processing operation of the computer.

まず、自然言語処理システムの全体像について簡単に説明する。図１には、ＬＦＧに基づく自然言語処理システム１の構成を模式的に示している。この自然言語処理システムは、例えばパーソナル・コンピュータ（ＰＣ）などの一般的な計算機システム上で、所定の自然言語処理アプリケーション・プログラムを実行するという形態で実現することができる。 First, an overview of the natural language processing system will be briefly described. FIG. 1 schematically shows a configuration of a natural language processing system 1 based on LFG. This natural language processing system can be realized in the form of executing a predetermined natural language processing application program on a general computer system such as a personal computer (PC).

形態素解析部２は、日本語など特定の言語に関する形態素ルール２Ａと形態素辞書２Ｂを持ち、入力文を意味的最小単位である形態素に分節して品詞の認定処理を行なう。例えば、「私の娘は英語を話します。」という文が入力された場合、形態素解析結果として、「私｛Ｎｏｕｎ｝の｛ｕｐ｝娘｛Ｎｏｕｎ｝は｛ｕｐ｝英語｛Ｎｏｕｎ｝を｛ｕｐ｝話す｛Ｖｅｒｂ１｝｛ｔｒ｝ます｛ｊｐ｝。｛ｐｔ｝」が出力される。 The morpheme analysis unit 2 has a morpheme rule 2A and a morpheme dictionary 2B related to a specific language such as Japanese, and performs a part-of-speech recognition process by segmenting an input sentence into morphemes that are semantic minimum units. For example, if a sentence “My daughter speaks English” is input, “{up} daughter {Noun} of I {Noun} {up} English {Noun} {up} } Speak {Verb1} {tr} mass {jp}. {Pt} "is output.

このような形態素解析結果は、次いで、統語・意味解析部３に入力される。統語・意味解析部３は、文法ルール３Ａや結合価辞書３Ｂなどの辞書を持ち、文法ルールなどに基づく句構造の解析や、文中の語の語義や語と語の間の意味関係などに基づいて文が伝える意味を表現する意味構造の解析を行なう（結合価辞書は動詞と主語などの文中の他の構成要素との関係を記述したものであり、述部とそれに係る語の意味関係を抽出することができる）。そして、構文解析した結果として、単語や形態素などからなる文章の句構造を木構造として表した“ｃ−ｓｔｒｕｃｔｕｒｅ（ｃｏｎｓｔｉｔｕｅｎｔｓｔｒｕｃｔｕｒｅ）”と、主語、目的語などの格構造に基づいて入力文を疑問文、過去形、丁寧文など意味的・機能的に解析した結果として“ｆ−ｓｔｒｕｃｔｕｒｅ（ｆｕｎｃｔｉｏｎａｌｓｔｒｕｃｔｕｒｅ）”を出力する。 Such a morphological analysis result is then input to the syntactic / semantic analysis unit 3. The syntactic / semantic analysis unit 3 has dictionaries such as a grammar rule 3A and a valence dictionary 3B, and is based on the analysis of phrase structure based on the grammar rule, the meaning of words in a sentence, and the semantic relationship between words. Analyzing the semantic structure expressing the meaning conveyed by the sentence (The valence dictionary describes the relationship between verbs and other components in the sentence such as the subject, and the semantic relation between the predicate and the related word. Can be extracted). As a result of parsing, “c-structure (constituent structure)” representing a phrase structure of a sentence including words and morphemes as a tree structure, and an input sentence based on a case structure such as a subject and an object are questioned. “F-structure (functional structure)” is output as a result of semantic and functional analysis such as sentences, past tense, and polite sentences.

図２及び図３には、入力文「私の娘は英語を話します。」を統語・意味解析部１により処理した結果として得られるｃ−ｓｔｒｕｃｔｕｒｅ及びｆ−ｓｔｒｕｃｔｕｒｅをそれぞれ示している。 FIGS. 2 and 3 respectively show c-structure and f-structure obtained as a result of processing the input sentence “My daughter speaks English” by the syntactic / semantic analysis unit 1.

ｃ−ｓｔｒｕｃｔｕｒｅは、文中の単語や句の構造を木構造形式で表したものであり、構文カテゴリによって定義される。例えば音素列を生成するための音韻学的な解釈を、ｃ−ｓｔｒｕｃｔｕｒｅを基に行なうことができる。一方、ｆ−ｓｔｒｕｃｔｕｒｅは、文法的な機能を明確に表現したものであり、文法的な機能名、意味的形式、並びに特徴シンボルにより構成される。ｆ−ｓｔｒｕｃｔｕｒｅを参照することにより、主語（ｓｕｂｊｅｃｔ）、目的語（ｏｂｊｅｃｔ）、補語（ｃｏｍｐｌｅｍｅｎｔ）、修飾語（ａｄｊｕｎｃｔ）といった意味理解を得ることができる。ｆ−ｓｔｒｕｃｔｕｒｅは、ｃ−ｓｔｒｕｃｔｕｒｅの各節点に付随する素性の集合であり、図３に示すように属性−属性値のマトリックスの形で表現される。すなわち、［］で囲まれた中の左側は素性（属性）の名前であり、右側は素性の値（属性値）である。 c-structure represents the structure of words and phrases in a sentence in a tree structure format, and is defined by a syntax category. For example, phonological interpretation for generating a phoneme string can be performed based on c-structure. On the other hand, f-structure clearly expresses a grammatical function, and includes a grammatical function name, a semantic form, and a feature symbol. By referring to f-structure, it is possible to obtain an understanding of the meaning of a subject, an object, an complement, a modifier, and so on. The f-structure is a set of features attached to each node of the c-structure, and is expressed in the form of an attribute-attribute value matrix as shown in FIG. That is, the left side in [] is a feature (attribute) name, and the right side is a feature value (attribute value).

なお、ＬＦＧの詳細に関しては、例えばＲ．Ｍ．Ｋａｐｌａｎ及びＪ．Ｂｒｅｓｎａｎ共著の論文“Ｌｅｘｉｃａｌ−ＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ：ＡＦｏｒｍａｌＳｙｓｔｅｍｆｏｒＧｒａｍｍａｔｉｃａｌＲｅｐｒｅｓｅｎｔａｔｉｏｎ”（ＴｈｅＭＩＴＰｒｅｓｓ，Ｃａｍｂｒｉｄｇｅ（１９８２）．ＲｅｐｒｉｎｔｅｄｉｎＦｏｒｍａｌＩｓｓｕｅｓｉｎＬｅｘｉｃａｌ−ＦｕｎｃｔｉｏｎａｌＧｒａｍｍａｒ，ｐｐ．２９−１３０．ＣＳＬＩｐｕｂｌｉｃａｔｉｏｎｓ，ＳｔａｎｆｏｒｄＵｎｉｖｅｒｓｉｔｙ（１９９５）．）などに記述されている。 For details of LFG, see, for example, R.A. M.M. Kaplan and J.H. Bresnan co-author of the paper. "Lexical-Functional Grammar: A Formal System for Grammatical Representation" (The MIT Press, Cambridge (1982) Reprinted in Formal Issues in Lexical-Functional Grammar, pp.29-130.CSLI publications, Stanford University (1995 ).) Etc.

次いで、本発明に係る自然言語処理による複合名詞などの連続する形態素からなる複合語についての纏め上げ処理について詳解する。 Next, a detailed description will be given of the grouping process for compound words composed of continuous morphemes such as compound nouns by natural language processing according to the present invention.

［背景技術］の欄でも既に述べたように、複合語を１つの単語として纏め上げることにより、構文・意味解析時の解析速度が向上するが、検索システムにおける再現率又は適合率が低下するという弊害を伴う。そこで、本発明では、構文・意味解析の前処理で連続する形態素をまとめて１つの複合語を形成する際に、元の形態素と形態素の間に、連結したことを示す区切り文字を与えるようにした。このように区切り文字によって、形態素解析結果から元の形態素を容易に取り出すことができる構造となっていることから、検索システムにおける再現率を維持することができる。 As already mentioned in the section of “Background Art”, by combining the compound words as one word, the analysis speed at the time of syntax / semantic analysis is improved, but the recall or relevance rate in the search system is reduced. Accompanied by evil. Therefore, in the present invention, when forming a single compound word by concatenating consecutive morphemes in the preprocessing of syntax / semantic analysis, a delimiter indicating that the original morphemes are connected is given. did. As described above, since the original morpheme can be easily extracted from the morpheme analysis result by the delimiter character, the reproducibility in the search system can be maintained.

図４には、構文・意味解析の前処理として、形態素解析の結果を基に、連続する複数の形態素を１つの複合語に纏め上げる処理の手順をフローチャートの形式で示している。但し、ここでは複合語の例として複合名詞を取り扱うものとする。また、複合語を形成する際に、形態素と形態素の間に何らかの区切り文字を挿入する。 FIG. 4 shows, in the form of a flow chart, a processing procedure in which a plurality of consecutive morphemes are combined into one compound word based on the result of morphological analysis as preprocessing for syntax / semantic analysis. However, compound nouns are handled here as examples of compound words. Also, when forming a compound word, some delimiter is inserted between morphemes.

まず、元の日本語原文を入力するとともに、別途行なわれる形態素解析処理から得られる形態素解析結果を取得する（ステップＳ１）。形態素解析では、入力文を意味的最小単位である形態素に分節して品詞の認定が行われる。 First, the original Japanese original is input, and a morphological analysis result obtained from a morphological analysis process performed separately is acquired (step S1). In the morphological analysis, the part of speech is segmented into morphemes, which are the smallest semantic units, and the part of speech is recognized.

次いで、変数ｉに１を代入し、ｉが入力文に含まれる形態素数に到達するまでの間、ループ１では、複数の形態素を１つの複合語に纏め上げる際に、形態素と形態素の間に何らかの区切り文字を挿入する処理が行われる。ループ１では、複合語を構成する連続的な形態素を逐次書き込んでいくためのバッファが用意される。 Next, while substituting 1 for the variable i and until i reaches the number of morphemes contained in the input sentence, in loop 1, when combining multiple morphemes into one compound word, Processing to insert some delimiter is performed. In loop 1, a buffer is prepared for sequentially writing continuous morphemes constituting a compound word.

ｉ番目の形態素が名詞の場合には（ステップＳ２）、ループ２において、当該形態素を含む複合語を形成する際に、形態素と形態素の間に区切り文字を挿入する処理が行われる。すなわち、バッファが空でなく既に１以上の形態素が蓄積されている場合には（ステップＳ３）、直前の形態素との間に区切り文字（本実施形態では“／”（スラッシュ）を区切り文字として使用する）を挿入してから（ステップＳ９）、バッファに書き込んで形態素を連結する（ステップＳ４）。 When the i-th morpheme is a noun (step S2), in loop 2, when a compound word including the morpheme is formed, a process of inserting a delimiter between the morpheme and the morpheme is performed. That is, when the buffer is not empty and one or more morphemes have already been accumulated (step S3), a delimiter (in this embodiment, “/” (slash) is used as a delimiter between the previous morphemes. Is inserted (step S9), and then written into the buffer to connect the morphemes (step S4).

次いで、変数ｉを１だけ増分して（ステップＳ５）、入力文から次の形態素を取り出す。 Next, the variable i is incremented by 1 (step S5), and the next morpheme is extracted from the input sentence.

ｉ番目の形態素が名詞でない場合には、ループ２から出て、バッファが空かどうか、すなわち複合名詞が形成されているかどうかを判別する（ステップＳ５）。バッファに複合名詞が格納されている場合には、これを１つの形態素として取り出した後、バッファを空の状態に戻す（ステップＳ１０）。取り出された複合名詞は、元の形態素と形態素の間に区切り文字が挿入された状態である。 If the i-th morpheme is not a noun, the loop 2 is exited to determine whether the buffer is empty, that is, whether a compound noun is formed (step S5). If a compound noun is stored in the buffer, it is taken out as one morpheme and then returned to an empty state (step S10). The extracted compound noun is in a state in which a delimiter is inserted between the original morpheme.

そして、上記の処理が施されたｉ番目の形態素を出力し（ステップＳ７）、ｉを１だけ増分して、すべての形態素について処理が終わるまで、ループ１を繰り返し実行する。 Then, the i-th morpheme subjected to the above processing is output (step S7), i is incremented by 1, and loop 1 is repeatedly executed until the processing is completed for all the morphemes.

続いて、本実施形態に係る構文・意味解析の前処理ついて、上記の例文（１）に基づいて具体的に説明する。 Next, the pre-processing of syntax / semantic analysis according to the present embodiment will be specifically described based on the example sentence (1).

入力文を形態素解析に投入すると、その出力結果として、品詞情報とともに文字列が取得され、形態素解析結果として保持される。ここでは、入力文を構成する各形態素「横浜」、「で」、「青少年」、「総合」、「体育」、「大会」「が」、「行う」、「れる」、「た。」が、図５に示すようにそれぞれ品詞情報と文頭からの順番とともに格納される。同図において、「表層」は原文から形態素毎に区切った文字列であり、「見出し語」は形態素が活用語の場合の原形である。 When an input sentence is input to morphological analysis, a character string is acquired together with the part-of-speech information as an output result, and is held as a morphological analysis result. Here, the morphemes “Yokohama”, “De”, “Youth”, “Comprehensive”, “Physical education”, “Meeting” “Ga”, “Done”, “Red”, “Ta” that compose the input sentence. 5, part-of-speech information and the order from the beginning of the sentence are respectively stored. In the figure, “surface layer” is a character string delimited by morpheme from the original text, and “entry word” is the original form when the morpheme is a usage word.

次に、ｉ番目の形態素の品詞の情報を参照し、名詞でなければ空白文字とともに出力する。ｉ＝１のとき、ｉ番目の形態素「横浜」が名詞であり（ステップＳ２）、バッファが空であるため（ステップＳ３）、「横浜」をそのままバッファに格納する（ステップＳ４）。そして、変数ｉをｉ＋１にする（ステップＳ５）。 Next, the part-of-speech information of the i-th morpheme is referenced, and if it is not a noun, it is output together with a blank character. When i = 1, since the i-th morpheme “Yokohama” is a noun (step S2) and the buffer is empty (step S3), “Yokohama” is stored in the buffer as it is (step S4). Then, the variable i is set to i + 1 (step S5).

次の形態素ｉは「で」であり、その品詞情報は「格助詞」なので（ステップＳ２）、バッファの文字列を空白文字とともに出力し、バッファを空にする（ステップＳ１０）。続いて、ｉ番目の形態素（「で」）を出力する（ステップＳ７）。そして、変数ｉをｉ＋１にする（ステップＳ８）。 Since the next morpheme i is “de” and the part of speech information is “case particle” (step S2), the buffer character string is output together with a blank character, and the buffer is emptied (step S10). Subsequently, the i-th morpheme (“de”) is output (step S7). Then, the variable i is set to i + 1 (step S8).

次の形態素ｉは「青少年」であり、品詞が「名詞」であり（ステップＳ２）、バッファが空なので（ステップＳ３）、「青少年」をそのままバッファに格納する（ステップＳ４）。そして、ｉをｉ＋１にする（ステップＳ５）。 Since the next morpheme i is “youth”, the part of speech is “noun” (step S2), and the buffer is empty (step S3), “youth” is stored in the buffer as it is (step S4). Then, i is set to i + 1 (step S5).

次の形態素ｉは「総合」であり、品詞が「名詞」である（ステップＳ２）。このとき、バッファが空でないため（ステップＳ３）、区切り文字（ここでは「／」）とｉ番目の形態素「総合」をバッファに追加する（ステップＳ９）。したがって、ここでバッファに格納されているのは「青少年／総合」というになる。そして、変数ｉをｉ＋１にする（ステップＳ５）。次の形態素ｉは「体育」であり品詞は「名詞」で（ステップＳ２）、且つバッファは空ではないので（ステップＳ３）、ループ２における手続きを繰り返す。つまり、バッファの文字列は「青少年／総合／体育」になる（ステップＳ９）。 The next morpheme i is “general” and the part of speech is “noun” (step S2). At this time, since the buffer is not empty (step S3), a delimiter (here, “/”) and the i-th morpheme “total” are added to the buffer (step S9). Therefore, what is stored in the buffer here is “Youth / Comprehensive”. Then, the variable i is set to i + 1 (step S5). Since the next morpheme i is “physical education”, the part of speech is “noun” (step S2), and the buffer is not empty (step S3), the procedure in loop 2 is repeated. That is, the character string in the buffer is “Youth / Comprehensive / Physical Education” (step S9).

後続のｉ番目の形態素の品詞情報が名詞である限り、ループ２内の処理を繰り返す。ここでは、次の形態素ｉが「大会」であり、品詞は「名詞」なので（ステップＳ３）、また同じ処理を行う。したがって、バッファの内容すなわち複合名詞の構成は、図６に示すように「青少年／総合／体育／大会」になる。 As long as the part-of-speech information of the subsequent i-th morpheme is a noun, the processing in the loop 2 is repeated. Here, since the next morpheme i is “meeting” and the part of speech is “noun” (step S3), the same processing is performed. Therefore, the contents of the buffer, that is, the composition of the compound noun is “Youth / Comprehensive / Physical Education / Meeting” as shown in FIG.

さらに次の形態素ｉは「が」であり、品詞は「格助詞」なので（ステップＳ２）、ループ２における繰り返しを中止し、バッファの文字列を空白文字とともに出力し、次に形態素ｉを出力する（ステップＳ７）。 Further, since the next morpheme i is “ga” and the part of speech is “case particle” (step S2), the repetition in the loop 2 is stopped, the buffer character string is output together with a blank character, and then the morpheme i is output. (Step S7).

さらに、変数ｉをｉ＋１にして（ステップＳ８）、ループ１を繰り返す。その後、名詞以外の形態素「行う」、「れる」、「た」がそれぞれ出力される（ステップＳ７）。 Further, the variable i is set to i + 1 (step S8), and the loop 1 is repeated. Thereafter, morphemes other than nouns “do”, “re”, and “ta” are output (step S7).

以上のようにして、本実施形態に係る構文・意味解析の前処理手続きによれば、例文（１）の入力に対して、「横浜で青少年／総合／体育／大会が行うれるた」が出力される。 As described above, according to the pre-processing procedure for syntax / semantic analysis according to the present embodiment, “youth / general / physical education / competition is performed in Yokohama” is output in response to the input of example sentence (1). Is done.

図７には、例文（１）からなる文字列を入力とし、図４に示した処理手続きによって出力された構文意味解析結果の例を示している。 FIG. 7 shows an example of the syntactic and semantic analysis result output by the processing procedure shown in FIG. 4 with the character string consisting of the example sentence (1) as an input.

同図に示すように解析結果を用いて検索システムを適用すれば、「大会」などの検索語と適切な一致を取ることが可能であると同時に、「合体」などの検索語との不適切な一致を防ぐことが可能となる。すなわち、検索システムの再現率や適合率を低下させることなく、高い精度の構文・意味解析結果を出力することが可能となる。 As shown in the figure, if the search system is applied using the analysis results, it is possible to achieve an appropriate match with a search term such as “meeting” and at the same time inappropriate with a search term such as “union” It becomes possible to prevent a coincidence. In other words, it is possible to output a highly accurate syntax / semantic analysis result without lowering the recall rate or relevance rate of the search system.

［追補］
以上、特定の実施形態を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が該実施形態の修正や代用を成し得ることは自明である。 [Supplement]
The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiment without departing from the gist of the present invention.

本実施形態ではＬＦＧ文法理論に基づいて説明したが、勿論、他の文法ルールを備えた解析システムにおいても本発明を同様に適用することができる。 Although the present embodiment has been described based on the LFG grammar theory, of course, the present invention can be similarly applied to an analysis system having other grammar rules.

要するに、例示という形態で本発明を開示してきたのであり、本明細書の記載内容を限定的に解釈するべきではない。本発明の要旨を判断するためには、冒頭に記載した特許請求の範囲の欄を参酌すべきである。 In short, the present invention has been disclosed in the form of exemplification, and the description of the present specification should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims section described at the beginning should be considered.

図１は、ＬＦＧに基づく自然言語処理システム１の構成を模式的に示した図である。FIG. 1 is a diagram schematically showing a configuration of a natural language processing system 1 based on LFG. 図２は、入力文「私の娘は英語を話します。」を統語・意味解析部１により処理した結果として得られるｃ−ｓｔｒｕｃｔｕｒｅを示した図である。FIG. 2 is a diagram showing c-structure obtained as a result of processing the input sentence “My daughter speaks English” by the syntactic / semantic analysis unit 1. 図３は、入力文「私の娘は英語を話します。」を統語・意味解析部１により処理した結果として得られるｆ−ｓｔｒｕｃｔｕｒｅを示した図である。FIG. 3 is a diagram showing f-structure obtained as a result of processing the input sentence “My daughter speaks English” by the syntactic / semantic analysis unit 1. 図４は、構文・意味解析の前処理として、形態素解析の結果を基に、連続する複数の形態素を１つの複合語に纏め上げる処理の手順を示したフローチャートである。FIG. 4 is a flowchart showing a procedure of processing for collecting a plurality of continuous morphemes into one compound word based on the result of morphological analysis as preprocessing for syntax / semantic analysis. 図５は、例文（１）についての形態素解析結果を示した図である。FIG. 5 is a diagram showing a morphological analysis result for the example sentence (1). 図６は、例文（１）から構文・意味解析の前処理により得られる複合名詞の構成を示した図である。FIG. 6 is a diagram showing the composition of compound nouns obtained from the example sentence (1) by preprocessing of syntax / semantic analysis. 図７は、例文（１）からなる文字列を入力とし、図４に示した処理手続きによって出力された構文意味解析結果の例を示した図である。FIG. 7 is a diagram showing an example of the result of syntactic and semantic analysis that is output by the processing procedure shown in FIG. 4 with the character string consisting of the example sentence (1) as an input. 図８は、例文（１）に対する構文意味解析結果の一例を示した図である。FIG. 8 is a diagram illustrating an example of a syntax and semantic analysis result for the example sentence (1). 図９は、例文（２）についての構文意味解析結果を示した図である。FIG. 9 is a diagram showing the result of syntactic and semantic analysis for the example sentence (2). 図１０は、例文（３）についての構文意味解析結果を示した図である。FIG. 10 is a diagram illustrating a syntax-meaning analysis result for the example sentence (3).

Explanation of symbols

１…自然言語処理システム
２…形態素解析部
２Ａ…形態素ルール，２Ｂ…形態素辞書
３…統語・意味解析部
３Ａ…文法ルール，３Ｂ…結合価辞書 DESCRIPTION OF SYMBOLS 1 ... Natural language processing system 2 ... Morphological analysis part 2A ... Morphological rule, 2B ... Morphological dictionary 3 ... Syntactic / semantic analysis part 3A ... Grammar rule, 3B ... Joint value dictionary

Claims

A natural language processing system for analyzing a natural language sentence in which a compound word composed of a plurality of words composed of specific parts of speech appears,
Means for acquiring a morpheme analysis result including a recognition result of a part of speech for each morpheme for the input natural language sentence;
Based on the morpheme analysis result, means for extracting a part where the morpheme of the specific part of speech is continuous in the input natural language sentence;
Means for providing a delimiter indicating that the extracted continuous morphemes of the specific part of speech are connected,
Means for outputting as a single word consecutive morphemes of the specific part of speech, given a delimiter indicating that the morpheme is connected between the morphemes;
Means for searching for a match with a search word for each original morpheme delimited by the delimiter for a compound word given a delimiter indicating that the morpheme is connected between the morphemes;
A natural language processing system comprising:

The specific part of speech is a noun or a verb,
The natural language processing system according to claim 1.

A natural language processing method for analyzing a natural language sentence in which a compound word composed of a plurality of words composed of specific parts of speech appears on a natural language processing system constructed using a computer,
The morpheme analysis result acquisition means provided in the computer acquires a morpheme analysis result including a recognition result of a part of speech for each morpheme for the input natural language sentence;
An extraction means provided in the computer, based on the morpheme analysis result, extracting a portion where the morpheme of the specific part of speech is continuous in the input natural language sentence;
A step of providing a delimiter indicating that the assigning means included in the computer is connected between the extracted morphemes of the specific parts of speech that are consecutive;
The output means provided in the computer outputs the continuous morphemes of the specific part of speech as a single word given a delimiter indicating that the morpheme is connected between the morphemes;
The search means provided in the computer matches a search word for each original morpheme delimited by the delimiter for a compound word given a delimiter indicating that it is connected between the morphemes The step of searching to take
A natural language processing method comprising:

The specific part of speech is a noun or a verb,
The natural language processing method according to claim 3.