JP2011232891A

JP2011232891A - Language analysis program

Info

Publication number: JP2011232891A
Application number: JP2010101421A
Authority: JP
Inventors: Guowei Zu; 国威祖; Toshiyuki Kano; 敏行加納
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2010-04-26
Filing date: 2010-04-26
Publication date: 2011-11-17
Anticipated expiration: 2030-04-26
Also published as: JP5017416B2

Abstract

PROBLEM TO BE SOLVED: To extract an unknown word accurately without using external information and frequency.SOLUTION: A suitability judgment section 14 transmits a word with an unknown word tag to a clause extending section 15 when a morpheme analysis result in a storage section 11 includes the unknown word tag. The clause extending section 15 writes clauses extended to the right side and to the left side or an extended clause into the storage section 11 according to whether the transmitted word includes a delimiter or not. A morpheme analysis section 13 performs a morpheme analysis on the extended clause in the storage section 11 and writes the morpheme analysis result into the storage section 11. The suitability judgment section 14 determines whether the morpheme analysis result of the extended clause and the word with the unknown word tag of the previous morpheme analysis result are matched or not when the morpheme analysis result of the extended clause includes the unknown word tag. When matched, an analysis result output section 16 outputs the previous morpheme analysis result.

Description

本発明の実施形態は、言語解析プログラムに関する。 Embodiments described herein relate generally to a language analysis program.

一般に、日本語の形態素解析に基づいて、辞書に未登録の単語（以下、未知語又は未登録語という）を抽出する技術としては、統計的な手法と言語知識ベースの手法がある。 In general, there are a statistical technique and a linguistic knowledge-based technique for extracting words that are not registered in a dictionary (hereinafter referred to as unknown words or unregistered words) based on Japanese morphological analysis.

統計的な手法は、大量のコーパスから機械学習した未知語の単語らしさや接続コストにより、未知語を抽出する技術である。言語知識ベースの手法は、辞書、制約、事例などの大規模の言語知識に基づいて、未知語の誤分割を発見し、適切な未知語を抽出する技術である。 The statistical method is a technique for extracting unknown words based on the word likeness and connection cost of unknown words machine-learned from a large number of corpora. The language knowledge-based technique is a technique for finding an erroneous division of an unknown word and extracting an appropriate unknown word based on large-scale language knowledge such as a dictionary, constraints, and examples.

特開２００３−９９４２６号公報JP 2003-99426 A 特開２００６−３１２９５号公報JP 200631295 A 特開平５−６７０７３号公報JP-A-5-67073 特開平９−１９０４４０号公報JP-A-9-190440

池谷昌紀著、「文字列が単語になる確率を用いた未知語抽出」、情報処理学会研究報告．自然言語処理研究会報告、２０００（１１）、ｐｐ．４９−５４、２０００Masanori Ikeya, "Unknown word extraction using probability that character string becomes word", Information Processing Society of Japan report. Natural Language Processing Study Group Report, 2000 (11), pp. 49-54, 2000 村脇有吾著、「形態論的制約を用いたオンライン未知語獲得」、自然言語処理，Ｖｏｌ．１７，Ｎｏ．１，２０１０Murakiwaki, “Online Unknown Word Acquisition Using Morphological Constraints”, Natural Language Processing, Vol. 17, no. 1,2010

しかしながら、以上のような手法は、いずれも大規模なコーパス、辞書、言語規則などの詳細な外部情報を用いる必要があるため、正確な未知語の抽出を簡単に実現させることができない。また、頻度に基づく統計的な手法の場合、よく使われる用語（高頻度語）なら認識できるが、低頻度語を認識することが困難である。 However, any of the above methods requires the use of detailed external information such as a large corpus, dictionary, language rule, and so on, and accurate unknown word extraction cannot be easily realized. In the case of a statistical method based on frequency, it is possible to recognize frequently used terms (high frequency words), but it is difficult to recognize low frequency words.

本発明の実施形態は、外部情報や頻度を用いずに、正確な未知語の抽出を簡単に実現し得る言語解析プログラムを提供することを目的とする。 An object of the present invention is to provide a language analysis program capable of easily realizing accurate unknown word extraction without using external information and frequency.

実施形態の言語解析装置は、ハードウェア資源とソフトウェアとの組合せ構成で実施される場合、言語解析装置の機能を実現させるための言語解析プログラムが予めコンピュータにインストールされて実施される。 When the language analysis apparatus according to the embodiment is implemented by a combination of hardware resources and software, a language analysis program for realizing the functions of the language analysis apparatus is installed in a computer in advance.

実施形態の言語解析装置は、対象文と対象文の形態素解析結果とを記憶するための記憶部を備え、未知語を含む対象文を形態素解析した場合に、未知語に付与した未知語タグを含む形態素解析結果を出力するためのものである。 The language analysis apparatus according to the embodiment includes a storage unit for storing a target sentence and a morphological analysis result of the target sentence. When a morphological analysis is performed on a target sentence including an unknown word, an unknown word tag attached to the unknown word It is for outputting the morphological analysis result including.

実施形態の言語解析装置は、記憶部、形態素解析部、適切性判定部、文節拡張部及び解析結果出力部を備えている。 The language analysis apparatus according to the embodiment includes a storage unit, a morphological analysis unit, an appropriateness determination unit, a phrase expansion unit, and an analysis result output unit.

実施形態の記憶部は、対象文と対象文の形態素解析結果とを記憶するためのものであり、形態素解析部、適切性判定部、文節拡張部、解析結果出力部による各種処理のワークメモリとしても機能する。 The storage unit of the embodiment is for storing the target sentence and the morphological analysis result of the target sentence, and is used as a work memory for various processes by the morphological analysis unit, the appropriateness determination unit, the phrase expansion unit, and the analysis result output unit. Also works.

実施形態の形態素解析部は、記憶部内の対象文を形態素解析することにより、対象文を構成する各単語に品詞タグを付与した形態素解析結果を記憶部に書き込む。 The morphological analysis unit of the embodiment writes a morphological analysis result in which a part-of-speech tag is added to each word constituting the target sentence in the storage unit by performing morphological analysis on the target sentence in the storage unit.

実施形態の適切性判定部は、記憶部に書き込まれた形態素解析結果に対し、品詞タグのうちの未知語タグが含まれるか否かを判定し、未知語タグが含まれる場合、当該未知語タグが付与された単語を文節拡張部に送出する。 The suitability determination unit of the embodiment determines whether or not an unknown word tag of the part-of-speech tag is included in the morphological analysis result written in the storage unit. If the unknown word tag is included, the unknown word The word with the tag is sent to the phrase extension section.

実施形態の文節拡張部は、送出された単語に対し、助詞、句読点、記号、文頭又は文末に相当する区切り文字が含まれるか否かを判定し、判定の結果、単語に区切り文字が含まれる場合、左側の拡張文節を作成すると共に、右側の拡張文節を作成し、右側の拡張文節及び左側の拡張文節を記憶部に書き込む。 The phrase expansion unit of the embodiment determines whether or not the transmitted word includes a delimiter corresponding to a particle, a punctuation mark, a symbol, a sentence head, or a sentence end. As a result of the determination, the word includes a delimiter. In this case, the left extension clause is created, the right extension clause is created, and the right extension clause and the left extension clause are written in the storage unit.

実施形態によれば、左側の拡張文節は、単語に区切り文字が含まれる場合、当該含まれる区切り文字の左に最も近い区切り文字を左区切りとし、含まれる区切り文字を右区切りとして当該左区切りから当該右区切りまでの文字列からなる。 According to the embodiment, when a word includes a delimiter, the left extension clause is defined as a left delimiter closest to the left of the included delimiter and a left delimiter as a right delimiter. It consists of a character string up to the right delimiter.

実施形態によれば、右側の拡張文節は、単語に区切り文字が含まれる場合、当該含まれる区切り文字を左区切りとし、当該含まれる区切り文字の右に最も近い区切り文字を右区切りとして当該左区切りから当該右区切りまでの文字列からなる。 According to the embodiment, when the word includes a delimiter, the right-side extended clause uses the left delimiter as the delimiter closest to the right of the included delimiter and the left delimiter. To the right delimiter.

一方、実施形態の文節拡張部は、単語に区切り文字が含まれない場合、当該判定に用いた単語の左に最も近い区切り文字を左区切りとし、当該単語の右に最も近い区切り文字を右区切りとして当該左区切りから当該右区切りまでの文字列からなる拡張文節を作成し、この拡張文節を記憶部に書き込む。 On the other hand, when the word expansion part of the embodiment does not include a delimiter, the delimiter closest to the left of the word used for the determination is defined as the left delimiter, and the delimiter closest to the right of the word is delimited as the right Then, an extended clause composed of a character string from the left break to the right break is created, and this extended clause is written in the storage unit.

実施形態の形態素解析部は、記憶部内の拡張文節を形態素解析することにより、拡張文節を構成する単語に品詞タグを付与した形態素解析結果を記憶部に書き込む。 The morpheme analysis unit according to the embodiment writes a morpheme analysis result in which a part-of-speech tag is added to a word constituting an extended clause in the storage unit by performing a morphological analysis on the extended clause in the storage unit.

実施形態の適切性判定部は、拡張文節の形態素解析結果に対し、未知語タグが含まれるか否かを判定し、未知語タグが含まれる場合、拡張文節の形態素解析結果と、前回の形態素解析結果のうちの未知語タグが付与された単語とが一致するか否かを判定する。 The appropriateness determination unit of the embodiment determines whether or not an unknown word tag is included in the morphological analysis result of the extended clause, and if the unknown word tag is included, the morphological analysis result of the extended clause and the previous morpheme It is determined whether or not the word to which the unknown word tag is assigned in the analysis result matches.

実施形態の適切性判定部は、この判定の結果、否の場合、当該拡張文節の形態素解析結果における未知語タグが付与された単語を文節拡張部に送出する。 If the result of this determination is negative, the appropriateness determination unit of the embodiment sends the word to which the unknown word tag is assigned in the morpheme analysis result of the extension clause to the clause extension unit.

一方、実施形態の解析結果出力部は、適切性判定部による判定の結果、拡張文節の形態素解析結果と、前回の形態素解析結果のうちの未知語タグが付与された単語とが一致する場合、前回の形態素解析結果を出力する。 On the other hand, the analysis result output unit of the embodiment, as a result of the determination by the appropriateness determination unit, the morpheme analysis result of the extended clause and the word to which the unknown word tag of the previous morpheme analysis result is matched, Output previous morphological analysis result.

第１の実施形態に係る言語解析装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the language analyzer which concerns on 1st Embodiment. 同実施形態における拡張文節生成規則を示す模式図である。It is a schematic diagram which shows the extended clause production | generation rule in the embodiment. 同実施形態における動作の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the operation | movement in the embodiment. 同実施形態におけるデータの遷移例を説明するための模式図である。It is a schematic diagram for demonstrating the example of a data transition in the embodiment. 同実施形態における文節拡張部の動作の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of operation | movement of the phrase expansion part in the embodiment. 同実施形態におけるデータの遷移例を説明するための模式図である。It is a schematic diagram for demonstrating the example of a data transition in the embodiment. 第２の実施形態に係る言語解析装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the language analyzer which concerns on 2nd Embodiment. 同実施形態における動作の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the operation | movement in the embodiment. 同実施形態におけるデータの遷移例を説明するための模式図である。It is a schematic diagram for demonstrating the example of a data transition in the embodiment. 同実施形態におけるデータの遷移例を説明するための模式図である。It is a schematic diagram for demonstrating the example of a data transition in the embodiment. 同実施形態におけるデータの遷移例を説明するための模式図である。It is a schematic diagram for demonstrating the example of a data transition in the embodiment. 一般的な解析ミスのバリエーションを説明するための模式図である。It is a schematic diagram for demonstrating the variation of a general analysis mistake.

以下、各実施形態について図面を用いて説明する。なお、各実施形態の言語解析装置は、ハードウェア構成、又はハードウェア資源とソフトウェアとの組合せ構成のいずれでも実施可能となっている。組合せ構成のソフトウェアとしては、予めネットワーク又は記憶媒体から言語解析装置となるコンピュータにインストールされ、言語解析装置の機能を実現させるための言語解析プログラムが用いられる。これは各実施形態に共通している。 Each embodiment will be described below with reference to the drawings. Note that the language analysis device of each embodiment can be implemented with either a hardware configuration or a combined configuration of hardware resources and software. As the software of the combined configuration, a language analysis program that is installed in advance from a network or a storage medium into a computer serving as a language analysis device and realizes the functions of the language analysis device is used. This is common to each embodiment.

（第１の実施形態）
図１は第１の実施形態に係る言語解析装置の構成例を示すブロック図である。この言語解析装置は、未知語を含む対象文を形態素解析した場合に、未知語に付与した未知語タグを含む形態素解析結果を出力するためのものである。 (First embodiment)
FIG. 1 is a block diagram illustrating a configuration example of the language analysis apparatus according to the first embodiment. This language analysis apparatus is for outputting a morphological analysis result including an unknown word tag attached to an unknown word when a morphological analysis is performed on a target sentence including the unknown word.

言語解析装置は、具体的には、記憶部１１、入力部１２、形態素解析部１３、適切性判定部１４、文節拡張部１５及び解析結果出力部１６を備え、各部１１〜１６がバス１７を介して相互に接続されている。 Specifically, the language analysis apparatus includes a storage unit 11, an input unit 12, a morpheme analysis unit 13, an appropriateness determination unit 14, a phrase expansion unit 15, and an analysis result output unit 16, and each unit 11 to 16 uses a bus 17. Are connected to each other.

記憶部１１は、例えばハードディスクドライブや不揮発性メモリ装置などのハードウェアで構成された記憶媒体であり、解析対象とする文データ（以下、対象文という）と対象文の形態素解析結果とを記憶するためのものである。また、記憶部１１は、形態素解析部１３、適切性判定部１４、文節拡張部１５、解析結果出力部１６による実行対象の制御用プログラムを記憶する。さらに、記憶部１１は、形態素解析用の解析辞書、適切性判定の判定規則、拡張文節の作成規則を格納する。また、記憶部１１は、形態素解析部１３、適切性判定部１４、文節拡張部１５、解析結果出力部１６による各種処理のワークメモリとしても機能する。
入力部１２は、外部から対象文を入力して記憶部１１に書き込むための入力インタフェースであり、例えば、図示しないネットワークから対象文を受信して記憶部１１に書込可能な受信装置、図示しないコンピュータ読み取り可能な記憶媒体を着脱自在に保持して当該記憶媒体から対象文を読み出して記憶部１１に書き込むための記憶媒体保持装置、ユーザ操作に応じて対象文を入力して記憶部１１に書き込むためのキーボード等の入力装置などが適宜使用可能となっている。なお、言語解析装置としては、記憶部１１内に対象文が記憶されていれば形態素解析を実行できるので、記憶部１１に書き込む方法については任意の手法が使用可能となっている。 The storage unit 11 is a storage medium configured by hardware such as a hard disk drive or a nonvolatile memory device, and stores sentence data to be analyzed (hereinafter referred to as a target sentence) and a morphological analysis result of the target sentence. Is for. The storage unit 11 stores a control program to be executed by the morphological analysis unit 13, the appropriateness determination unit 14, the phrase expansion unit 15, and the analysis result output unit 16. Furthermore, the storage unit 11 stores an analysis dictionary for morphological analysis, a determination rule for determining appropriateness, and a rule for creating an extended phrase. The storage unit 11 also functions as a work memory for various processes performed by the morphological analysis unit 13, the appropriateness determination unit 14, the phrase expansion unit 15, and the analysis result output unit 16.
The input unit 12 is an input interface for inputting a target sentence from the outside and writing it in the storage unit 11. For example, a receiving device that receives a target sentence from a network (not shown) and can write it in the storage unit 11, not shown A storage medium holding device for detachably holding a computer-readable storage medium, reading out the target sentence from the storage medium and writing it in the storage unit 11, and inputting the target sentence in response to a user operation and writing it in the storage unit 11 Therefore, an input device such as a keyboard can be used as appropriate. As the language analysis apparatus, morphological analysis can be performed if the target sentence is stored in the storage unit 11, and any method can be used for writing to the storage unit 11.

形態素解析部１３は、記憶部１１内の対象文を形態素解析することにより、対象文を構成する各単語に品詞タグを付与した形態素解析結果を記憶部１１に書き込む対象文解析機能と、記憶部１１内の拡張文節を形態素解析することにより、拡張文節を構成する単語に品詞タグを付与した形態素解析結果を記憶部１１に書き込む拡張文節解析機能とをもっている。具体的には、形態素解析部１３は、記憶部１１内の対象文に対して、記憶部１１に記憶される形態素解析用辞書を参照しながら、単語分割、品詞タグ付与を行い、解析結果を記憶部１１に記憶する。なお、単語分割は形態素分割と呼んでもよい。また、「単語に品詞タグを付与した」は、「単語に品詞タグを関連付けた」と読み替えてもよい。品詞タグは、品詞タグ情報と呼んでもよい。また、各実施形態では、形態素の解析方式、解析辞書の形式、及び解析結果の形式は限定しない。 The morphological analysis unit 13 performs a morphological analysis on the target sentence in the storage unit 11, thereby writing a morpheme analysis result obtained by adding a part of speech tag to each word constituting the target sentence to the storage unit 11, and a storage unit 11 has an expanded phrase analysis function for writing a morpheme analysis result in which a part-of-speech tag is added to a word constituting an expanded phrase to the storage unit 11 by performing a morphological analysis on the expanded phrase in 11. Specifically, the morpheme analysis unit 13 performs word division and part-of-speech tagging on the target sentence in the storage unit 11 while referring to the morpheme analysis dictionary stored in the storage unit 11, and displays the analysis result. Store in the storage unit 11. Note that word division may be referred to as morpheme division. Further, “a part of speech tag is assigned to a word” may be read as “a part of speech tag is associated with a word”. The part of speech tag may be called part of speech tag information. In each embodiment, the morpheme analysis method, the analysis dictionary format, and the analysis result format are not limited.

適切性判定部１４は、形態素解析部１３で解析された結果に対して、記憶部１１に記憶された適切性判定の判定規則に基づいて、解析失敗の有無（例えば、未知語の有無）を確認し、解析結果の適切性を判定する。判定結果は、記憶部１１に記憶する。なお、「適切性判定」は「未知語の有無の判定」と読み替えてもよく、「未知語の有無」は「未知語タグが含まれるか否か」と読み替えてもよい。また、本実施形態では、適切性判定部１４は、以下の機能(f14-1)〜(f14-3)をもっている。 The appropriateness determination unit 14 determines whether or not there is an analysis failure (for example, the presence or absence of an unknown word) based on the determination rule of the appropriateness determination stored in the storage unit 11 for the result analyzed by the morphological analysis unit 13. Check and determine the appropriateness of the analysis results. The determination result is stored in the storage unit 11. Note that “appropriateness determination” may be read as “determination of presence / absence of unknown word”, and “presence / absence of unknown word” may be read as “whether an unknown word tag is included”. In the present embodiment, the appropriateness determination unit 14 has the following functions (f14-1) to (f14-3).

(f14-1) 対象文解析機能により記憶部１１に書き込まれた形態素解析結果に対し、品詞タグのうちの未知語タグが含まれるか否かを判定し、未知語タグが含まれる場合、当該形態素解析結果を文節拡張部１５に送出する未知語判定機能。なお、未知語タグは、未知語タグ情報、未知語ラベル、又は未知語ラベル情報、などと読み替えてもよい。 (f14-1) For the morphological analysis result written in the storage unit 11 by the target sentence analysis function, it is determined whether or not an unknown word tag is included in the part of speech tag. An unknown word determination function for sending a morphological analysis result to the phrase expansion unit 15. The unknown word tag may be read as unknown word tag information, unknown word label, unknown word label information, or the like.

(f14-2) 拡張文節解析機能により記憶部１１に書き込まれた形態素解析結果に対し、未知語タグが含まれるか否かを判定し、未知語タグが含まれる場合、拡張文節の形態素解析結果と、前回の形態素解析結果のうちの未知語タグが付与された単語とが一致するか否かを判定する解析結果判定機能。なお、「未知語タグが付与された単語」は、「未知語タグに関連付けられた単語」と読み替えてもよい。 (f14-2) It is determined whether or not an unknown word tag is included in the morphological analysis result written in the storage unit 11 by the extended phrase analysis function. If an unknown word tag is included, the morphological analysis result of the extended phrase And an analysis result determination function for determining whether or not a word to which an unknown word tag is assigned in the previous morpheme analysis result matches. Note that “a word with an unknown word tag” may be read as “a word associated with an unknown word tag”.

(f14-3) 解析結果判定機能による判定の結果、否の場合、当該拡張文節の形態素解析結果を文節拡張部１５の区切り文字判定機能に送出する区切り文字判定送出機能。 (f14-3) A delimiter character determination transmission function for transmitting the morpheme analysis result of the extended clause to the delimiter character determination function of the phrase extension unit 15 when the result of the determination by the analysis result determination function is negative.

文節拡張部１５は、適切性判定部１４で、不適切と判定された解析結果に対して、解析失敗箇所の抽出、解析失敗箇所を含む拡張文節の切り出しを行い、抽出結果は、記憶部１１に記憶する。解析失敗箇所は、単語分割で行き詰りの文字列、或いは、解析辞書で見つからなく、品詞タグが付与されない文字列を意味するが、通常の場合、未知語と指すことができる。拡張文節とは、解析失敗箇所を含み、図２に示す如き、記憶部１１に記憶する拡張規則で定義された区切り文字によって区切られた最小単位の文字列である。拡張規則は、記憶部１１に記憶する場合に限らず、予め文節拡張部１５の機能を実現させるためのプログラムの一部として記述されていてもよい。なお、本実施形態では、文節拡張部１５は、以下の各機能(f15-1)〜(f15-3)をもっている。 The phrase expansion unit 15 extracts an analysis failure location and extracts an extended phrase including the analysis failure location for the analysis result determined to be inappropriate by the appropriateness determination unit 14, and the extraction result is stored in the storage unit 11. To remember. An analysis failure point means a character string that is dead due to word division or a character string that is not found in the analysis dictionary and that is not given a part-of-speech tag, but can usually be referred to as an unknown word. The extended clause is a character string of a minimum unit including a parsing failure part and delimited by a delimiter defined by an extended rule stored in the storage unit 11 as shown in FIG. The extension rule is not limited to being stored in the storage unit 11 but may be described in advance as a part of a program for realizing the function of the phrase extension unit 15. In this embodiment, the phrase expansion unit 15 has the following functions (f15-1) to (f15-3).

(f15-1) 適切性判定部１４から送出された形態素解析結果から未知語タグが付与された単語を抽出し、この単語に対し、助詞、句読点、記号、文頭又は文末に相当する区切り文字が含まれるか否かを判定する区切り文字判定機能。なお、区切り文字の範囲は、助詞、句読点、記号、文頭及び文末に限らず、例えば、接続詞又は数字などのように、文章の区切りに用いられる文字であれば、任意の文字が使用可能となっている。 (f15-1) A word with an unknown word tag is extracted from the morphological analysis result sent from the appropriateness determination unit 14, and a delimiter character corresponding to a particle, a punctuation mark, a symbol, a sentence head, or a sentence end is extracted from this word. Delimiter determination function that determines whether or not it is included. Note that the range of delimiters is not limited to particles, punctuation marks, symbols, sentence heads, and sentence endings. For example, any character can be used as long as it is a character used to delimit sentences, such as conjunctions or numbers. ing.

(f15-2) この区切り文字判定機能による判定の結果、単語に区切り文字が含まれる場合、左側の拡張文節を作成すると共に、右側の拡張文節を作成し、右側の拡張文節及び左側の拡張文節を記憶部１１に書き込む第１拡張文節作成機能。 (f15-2) As a result of determination by this delimiter determination function, if a delimiter is included in a word, a left extension clause is created, a right extension clause is created, a right extension clause and a left extension clause are created. A first extended phrase creating function for writing the information into the storage unit 11.

ここで、左側の拡張文節は、単語に区切り文字が含まれる場合、当該含まれる区切り文字の左に最も近い区切り文字を左区切りとし、含まれる区切り文字を右区切りとして当該左区切りから当該右区切りまでの文字列からなる。 Here, if the word contains a delimiter, the left-side extended clause uses the left delimiter as the delimiter closest to the left of the included delimiter and the left delimiter as the right delimiter. It consists of a character string up to.

右側の拡張文節は、単語に区切り文字が含まれる場合、当該含まれる区切り文字を左区切りとし、当該含まれる区切り文字の右に最も近い区切り文字を右区切りとして当該左区切りから当該右区切りまでの文字列からなる。 The expanded clause on the right side, when a word includes a delimiter, includes the left delimiter as the included delimiter, and the right delimiter as the delimiter closest to the right of the included delimiter, from the left delimiter to the right delimiter. Consists of a string.

(f15-3) 区切り文字判定機能による判定の結果、否の場合、当該判定に用いた単語の左に最も近い区切り文字を左区切りとし、当該単語の右に最も近い区切り文字を右区切りとして当該左区切りから当該右区切りまでの文字列からなる拡張文節を作成し、この拡張文節を記憶部１１に書き込む第２拡張文節作成機能。 (f15-3) If the result of the determination by the delimiter determination function is NO, the delimiter closest to the left of the word used for the determination is the left delimiter, and the delimiter closest to the right of the word is the right delimiter A second extended clause creating function for creating an extended clause composed of a character string from the left break to the right break and writing the extended clause in the storage unit 11.

解析結果出力部１６は、適切性判定部１４で適切と判定された文の形態素解析結果を出力する。例えば、解析結果出力部１６は、適切性判定部１４の解析結果判定機能による判定の結果、（拡張文節の形態素解析結果と、前回の形態素解析結果のうちの未知語タグが付与された単語との）両者が一致する場合、前回の形態素解析結果を出力する解析結果出力機能をもっている。 The analysis result output unit 16 outputs the morphological analysis result of the sentence determined to be appropriate by the appropriateness determination unit 14. For example, as a result of determination by the analysis result determination function of the appropriateness determination unit 14, the analysis result output unit 16 (the morpheme analysis result of the extended clause and the word to which the unknown word tag is assigned in the previous morpheme analysis result) When both match, the analysis result output function for outputting the previous morpheme analysis result is provided.

次に、以上のように構成された言語解析装置の動作について図３のフローチャートを用いて説明する。 Next, the operation of the language analyzer configured as described above will be described with reference to the flowchart of FIG.

言語解析装置においては、ユーザによる入力部１２の操作により、対象文を入力して記憶部１１に書き込む（Ｓ１１）。対象文は、ユーザがキーボード等から直接入力した文でも良いし、既存のファイルから読み込んだ文集合（文書）でも良い。文集合の場合、文単位で対象文を切り出す処理が必要である。 In the language analysis device, the target sentence is input and written in the storage unit 11 by the operation of the input unit 12 by the user (S11). The target sentence may be a sentence directly input by a user from a keyboard or the like, or may be a sentence set (document) read from an existing file. In the case of a sentence set, it is necessary to process the target sentence in sentence units.

次に、形態素解析部１３は、記憶部１１内の対象文に対して、単語分割、品詞タグ付与からなる形態素解析を行い、形態素解析結果を記憶部１１に書き込む（Ｓ１２）。本実施形態では、図４等に示すように、単語分割の結果を「／」で示し、品詞タグを「＜」「＞」で示している。なお、品詞タグのうち、未知語を表す未知語タグは＜未＞として示される。また、品詞タグは品詞ラベルと呼んでもよく、未知語タグは未知語ラベルと呼んでもよい。 Next, the morpheme analysis unit 13 performs morpheme analysis including word division and part-of-speech tagging on the target sentence in the storage unit 11, and writes the morpheme analysis result in the storage unit 11 (S12). In the present embodiment, as shown in FIG. 4 and the like, the result of word division is indicated by “/”, and the part-of-speech tags are indicated by “<” and “>”. Of the part-of-speech tags, an unknown word tag representing an unknown word is indicated as <not>. The part-of-speech tag may be referred to as a part-of-speech label, and the unknown word tag may be referred to as an unknown word label.

次に、適切性判定部１４は、記憶部１１内の形態素解析結果に対し、記憶部１１に記憶された判定規則に基づいて、解析結果が適切か否かを判定する。例えば、判定規則が「形態素解析結果に未知語タグが含まれない場合には適切と判定し、未知語タグが含まれる場合には不適切と判定する。」との旨で設定された場合、適切性判定部１４は、図４に示すように、未知語タグ“＜未＞”が含まれない例１１の形態素解析結果を適切と判定し、未知語タグ“＜未＞”が含まれる例１２の形態素解析結果を不適切と判定する。 Next, the appropriateness determination unit 14 determines whether or not the analysis result is appropriate based on the determination rule stored in the storage unit 11 for the morphological analysis result in the storage unit 11. For example, when the determination rule is set to the effect that “determined as appropriate when an unknown word tag is not included in the morphological analysis result, and determined as inappropriate when an unknown word tag is included”, As illustrated in FIG. 4, the appropriateness determination unit 14 determines that the morphological analysis result of Example 11 that does not include the unknown word tag “<un>” is appropriate, and includes the unknown word tag “<un>”. Twelve morphological analysis results are determined to be inappropriate.

適切と判定した場合、適切性判定部１４は、形態素解析結果を解析結果出力部１６に送出する。不適切と判定した場合、適切性判定部１４は、形態素解析結果を文節拡張部１５に送出する（Ｓ１３）。 When it is determined to be appropriate, the appropriateness determination unit 14 sends the morphological analysis result to the analysis result output unit 16. If it is determined as inappropriate, the appropriateness determination unit 14 sends the morphological analysis result to the phrase expansion unit 15 (S13).

なお、前述した判定規則は只１つの例である。判定規則は、未知語又は未登録語の有無を判定する内容であれば、「未知語」以外の表記及び数値で設定してもよい。 Note that the above-described determination rule is just one example. The determination rule may be set by notation and numerical values other than “unknown word” as long as it is a content for determining whether or not there is an unknown word or an unregistered word.

文節拡張部１５は、送出された形態素解析結果に対し、図５に示すように、解析失敗箇所を抽出し、図２に示した如き、記憶部１１内の作成規則を参照しながら以下のステップＳ１４−２〜Ｓ１４−５を実行する。 As shown in FIG. 5, the phrase expansion unit 15 extracts the analysis failure part from the sent morpheme analysis result, and performs the following steps while referring to the creation rule in the storage unit 11 as shown in FIG. S14-2 to S14-5 are executed.

すなわち、文節拡張部１５は、拡張文節を作成するとき、解析失敗箇所に区切り文字が含まれるか否かを判定する（Ｓ１４−２）。 In other words, the phrase expansion unit 15 determines whether or not a delimiter is included in the analysis failure location when creating the expanded phrase (S14-2).

判定の結果、区切り文字を含む場合、その文字を区切りとして、両側の拡張文節をそれぞれ切り出して、２つの拡張文節の区切りを確定する（Ｓ１４−３）。 As a result of the determination, if a delimiter is included, the extended clauses on both sides are cut out using the character as a delimiter, and the delimiter between the two extended clauses is determined (S14-3).

例えば、図６に示す例１３の場合、解析失敗箇所「一面しか」が区切り文字の助詞「しか」を含むので、当該含まれる区切り文字「しか」の左に最も近い区切り文字「文頭」を左区切りとし、当該含まれる区切り文字「しか」を右区切りとする（左側の拡張文節）。また、当該含まれる区切り文字「しか」を左区切りとし、当該含まれる区切り文字「しか」の右に最も近い区切り文字の助詞「が」を右区切りとする（右側の拡張文節）。 For example, in the case of Example 13 shown in FIG. 6, since the analysis failure point “only one side” includes the particle “shika” as the delimiter, the delimiter “sentence” closest to the left of the included delimiter “shika” A separator is used, and the included delimiter character “shika” is a right separator (extended clause on the left). Further, the included delimiter “shika” is the left delimiter, and the particle “ga” that is the delimiter closest to the right of the included delimiter “shika” is the right delimiter (the extended clause on the right).

一方、判定の結果、区切り文字を含まない場合、解析失敗箇所の両側に最も近い区切り文字より、解析失敗箇所を含む文字列を丸ごとで切り出して、拡張文節の区切りを確定する（Ｓ１４−４）。 On the other hand, as a result of the determination, if the delimiter is not included, the character string including the analysis failure location is cut out from the delimiter closest to both sides of the analysis failure location, and the extension clause delimitation is determined (S14-4). .

例えば、図６に示す例１２の場合、解析失敗箇所「え」が助詞、句読点、記号、文頭、文末のいずれの区切り文字も含まないので、当該判定に用いた単語「え」の左に最も近い区切り文字の助詞「の」を左区切りとし、当該単語「え」の右に最も近い区切り文字「を」を右区切りとする。 For example, in the case of Example 12 shown in FIG. 6, the analysis failure point “e” does not include any delimiters such as particles, punctuation marks, symbols, sentence heads, and sentence endings. The near delimiter particle “NO” is the left delimiter, and the delimiter “O” closest to the right of the word “E” is the right delimiter.

文節拡張部１５は、ステップＳ１４−３又はＳ１４−４により確定した区切り間の文字列からなる拡張文節を作成し（Ｓ１５）、この拡張文節を記憶部１１に書き込む。例えば、図６に示す例１３の場合、左区切り「文頭」から右区切り「しか」までの文字列「１面」からなる左側の拡張文節を作成すると共に、左区切り「しか」から右区切り「が」までの文字列「編種別データ」からなる右側の拡張文節を作成する。また、図６に示す例１２の場合、左区切り「の」から右区切り「を」までの文字列「色変え」からなる拡張文節を作成する。 The phrase expansion unit 15 creates an extended phrase composed of the character string between the breaks determined in step S14-3 or S14-4 (S15), and writes this extended phrase in the storage unit 11. For example, in the case of Example 13 shown in FIG. 6, an extended phrase on the left side consisting of the character string “one side” from the left delimiter “sentence” to the right delimiter “shika” is created, and the left delimiter “sika” to the right delimiter “ A right expanded phrase consisting of the character string “edit type data” up to “is created” is created. Further, in the case of Example 12 shown in FIG. 6, an extended clause composed of the character string “color change” from the left delimiter “NO” to the right delimiter “NO” is created.

次に、形態素解析部１３は、記憶部１１内の拡張文節に対し、再度、単語分割、品詞タグ付与からなる形態素解析を行い、形態素解析結果を記憶部１１に書き込む。 Next, the morpheme analysis unit 13 again performs morpheme analysis including word division and part-of-speech tagging on the extended clause in the storage unit 11 and writes the morpheme analysis result in the storage unit 11.

適切性判定部１４は、記憶部１１に書き込まれた拡張文節の形態素解析結果に対し、判定規則に基づき、形態素解析結果が適切か否かを判定する。 The appropriateness determination unit 14 determines whether or not the morphological analysis result is appropriate based on the determination rule for the morphological analysis result of the extended clause written in the storage unit 11.

例えば、判定規則が「拡張文節の形態素解析結果に未知語がない場合、適切と判定する。また、未知語があっても、拡張文節の形態素解析結果が前回の形態素解析結果と一致する場合、適切と判定する。未知語があり、かつ、前回の形態素解析結果と不一致の場合、不適切と判定する。」と設定されているとする。 For example, if the decision rule is “If there is no unknown word in the morphological analysis result of the extended clause, it is determined as appropriate. Also, even if there is an unknown word, the morphological analysis result of the extended clause matches the previous morphological analysis result. It is determined that it is determined to be appropriate if there is an unknown word and it does not match the previous morphological analysis result.

この設定の場合、適切性判定部１４は、図４に示す例１２の拡張文節に対して適切と判定し、拡張文節の形態素解析結果と前回の形態素解析結果を解析結果出力部１６に送出する。一方、不適切と判定した場合、適切性判定部１４は、形態素解析結果を文節拡張部１５に送出する。この場合、言語解析装置は、前述したステップＳ１４からステップＳ１２を介してステップＳ１３に至る処理を繰り返し実行する。 In the case of this setting, the appropriateness determination unit 14 determines that the extended clause of Example 12 shown in FIG. 4 is appropriate, and sends the morpheme analysis result of the extended clause and the previous morpheme analysis result to the analysis result output unit 16. . On the other hand, if it is determined as inappropriate, the appropriateness determination unit 14 sends the morphological analysis result to the phrase expansion unit 15. In this case, the language analysis apparatus repeatedly executes the processing from step S14 to step S13 through step S12.

最後に、解析結果出力部１６は、適切性判定部１４により適切と判定された形態素解析結果に対し、拡張文節の形態素解析結果と前回の形態素解析結果を比較する。拡張文節が作成されない場合、又は、拡張文節の形態素解析結果と前回の形態解析結果が一致する場合、そのまま形態素解析結果を出力する。 Finally, the analysis result output unit 16 compares the morpheme analysis result of the extended clause with the previous morpheme analysis result against the morpheme analysis result determined to be appropriate by the appropriateness determination unit 14. When the extended clause is not created or when the morphological analysis result of the extended clause matches the previous morphological analysis result, the morphological analysis result is output as it is.

また、拡張文節の形態素解析結果と前回の形態素解析結果が一致しない場合、解析結果出力部１６は、前回の形態素解析結果に、拡張文節の形態素解析結果を置き換えてから出力する（Ｓ１５）。 If the morpheme analysis result of the extended clause does not match the previous morpheme analysis result, the analysis result output unit 16 replaces the morpheme analysis result of the extended clause with the previous morpheme analysis result and outputs the result (S15).

上述したように本実施形態によれば、未知語を含む形態素解析結果に対して、未知語の文節拡張を実行して再帰的に形態素解析結果を適切化することにより、未知語の抽出精度の高い形態素解析結果を得ることができる。 As described above, according to the present embodiment, the unknown word extraction accuracy is improved by performing phrase expansion of the unknown word and recursively optimizing the morphological analysis result for the morphological analysis result including the unknown word. High morphological analysis results can be obtained.

（第２の実施形態）
図７は第２の実施形態に係る言語解析装置の構成例を示すブロック図であり、図１とほぼ同一の機能部には同一符号を付してその詳しい説明を省略し、ここでは異なる部分について主に述べる。 (Second Embodiment)
FIG. 7 is a block diagram illustrating a configuration example of the language analysis apparatus according to the second embodiment. The same reference numerals are given to the functional units that are substantially the same as those in FIG. 1, and detailed descriptions thereof are omitted. Is mainly described.

すなわち、本実施形態の言語解析装置２０は、図１に示した構成に比べ、破線により囲んで示すように、適切性判定部１４に代えて未知語抽出部２１及び完全未知語判定部２２を備え、解析結果出力部１６に代えて未登録語出力部２３を備えている。また、これら各部２１〜２３の変更に伴い、言語解析装置は、未知語を含む対象文を形態素解析した場合に、前記未知語を出力するための装置となっている。同様に、各部２１〜２３の変更に伴い、記憶部１１では記憶されるプログラム及びデータが変更されている。例えば、記憶部１１は、形態素解析部１３、文節拡張部１５、未知語抽出部２１、完全未知語判定部２２及び未登録語出力部２３による実行対象の制御用プログラムを記憶する。さらに、記憶部１１は、形態素解析用の解析辞書、完全未知語判定条件、拡張文節の作成規則を格納する。また、記憶部１１は、形態素解析部１３、文節拡張部１５、未知語抽出部２１、完全未知語判定部２２及び未登録語出力部２３による各種処理のワークメモリとしても機能する。また、各部２１〜２３の変更に伴い、文節拡張部１５ではデータの入出力先などが変更されている。 That is, the language analysis apparatus 20 according to the present embodiment includes an unknown word extraction unit 21 and a completely unknown word determination unit 22 instead of the appropriateness determination unit 14 as shown by being surrounded by a broken line as compared with the configuration illustrated in FIG. In addition, an unregistered word output unit 23 is provided instead of the analysis result output unit 16. Further, along with the change of each of these units 21 to 23, the language analysis device is a device for outputting the unknown word when the morphological analysis is performed on the target sentence including the unknown word. Similarly, the program and data stored in the storage unit 11 are changed with the change of the units 21 to 23. For example, the storage unit 11 stores a control program to be executed by the morphological analysis unit 13, the phrase expansion unit 15, the unknown word extraction unit 21, the completely unknown word determination unit 22, and the unregistered word output unit 23. Further, the storage unit 11 stores an analysis dictionary for morphological analysis, a completely unknown word determination condition, and an extended phrase creation rule. The storage unit 11 also functions as a work memory for various processes by the morphological analysis unit 13, the phrase expansion unit 15, the unknown word extraction unit 21, the complete unknown word determination unit 22, and the unregistered word output unit 23. In addition, with the change of each unit 21 to 23, the phrase expansion unit 15 changes the data input / output destination.

ここで、未知語抽出部２１は、形態素解析部１３により記憶部１１に書き込まれた形態素解析結果に対し、未知語タグが含まれるか否かを判定し、未知語タグがある場合、当該未知語タグが付された単語を（形態素解析結果から抽出して）記憶部１１に書き込む未知語タグ判定機能をもっている。 Here, the unknown word extraction unit 21 determines whether or not an unknown word tag is included in the morpheme analysis result written in the storage unit 11 by the morpheme analysis unit 13. It has an unknown word tag determination function for writing a word with a word tag (extracted from the morphological analysis result) to the storage unit 11.

完全未知語判定部２２は、記憶部１１内の未知語タグが付された単語（＝判定対象となる形態素解析結果）に対し、完全未知語判定条件を満たすか否かを判定する機能をもっている。 The complete unknown word determination unit 22 has a function of determining whether or not a complete unknown word determination condition is satisfied for a word (= result of morphological analysis to be determined) with an unknown word tag in the storage unit 11. .

ここで、完全未知語判定条件は、以下の第１条件［１］及び第２条件［２］を同時に満たすという条件である。すなわち、完全未知語判定部２２による判定は、以下の第１条件［１］及び第２条件［２］を同時に満たすか否かを判定することを意味している。 Here, the completely unknown word determination condition is a condition that the following first condition [1] and second condition [2] are simultaneously satisfied. That is, the determination by the completely unknown word determination unit 22 means that it is determined whether or not the following first condition [1] and second condition [2] are simultaneously satisfied.

完全未知語判定条件［１］：１文字の平仮名や片仮名ではない。 Completely unknown word determination condition [1]: Not one-character hiragana or katakana.

完全未知語判定条件［２］：形態素解析結果が単一の形態素から構成され、その品詞が未知語である。 Completely unknown word determination condition [2]: The morpheme analysis result is composed of a single morpheme, and the part of speech is an unknown word.

この完全未知語判定条件に沿って、例えば、評価対象が「全検＜未＞」の場合、完全未知語と判定される。完全未知語とは、完全未知語判定条件を全て満たす語である。「＜未＞」は、未知語タグである。また、評価対象が「え＜未＞」、「ア＜未＞」「合計＜名＞」、「→＜記号＞特速＜未＞」の場合、完全未知語ではないと判定される。 In accordance with the complete unknown word determination condition, for example, when the evaluation target is “all inspection <un>”, it is determined as a complete unknown word. A completely unknown word is a word that satisfies all the completely unknown word determination conditions. “<Not>” is an unknown word tag. Further, when the evaluation target is “e <un>”, “a <un>”, “total <name>”, “→ <symbol> special speed <un>”, it is determined that it is not a completely unknown word.

なお、完全未知語判定部２２を備えたことに伴い、文節拡張部１５の区切り文字判定機能(f15-1)は、完全未知語判定手段による判定の結果、否の場合、当該未知語タグが付与された単語に対し、助詞、句読点、記号、文頭又は文末に相当する区切り文字が含まれるか否かを判定する機能となっている。文節拡張部１５の第１拡張文節作成機能(f15-2)及び第２拡張文節作成機能(f15-3)は、前述した通りである。 In addition, as a result of the determination by the complete unknown word determination means, the unknown character tag is determined as a result of determination by the complete unknown word determination unit (f15-1) due to the provision of the complete unknown word determination unit 22. It is a function for determining whether or not a given word includes a delimiter corresponding to a particle, a punctuation mark, a symbol, a sentence head, or a sentence end. The first extended phrase creating function (f15-2) and the second extended phrase creating function (f15-3) of the phrase expanding unit 15 are as described above.

未登録語出力部２３は、完全未知語判定部２２から送られた形態素を未登録語として出力する。また、判定のタイミングと種類によって、未登録語の種類を分け、それぞれ異なるラベルを付与する。具体的には、未登録語出力部２３は、以下の機能(f23-1)〜(f23-2)をもっている。 The unregistered word output unit 23 outputs the morpheme sent from the complete unknown word determination unit 22 as an unregistered word. Further, the types of unregistered words are divided according to the timing and type of determination, and different labels are assigned to each. Specifically, the unregistered word output unit 23 has the following functions (f23-1) to (f23-2).

(f23-1) 形態素解析部１３の拡張文節解析機能(f13-2)により記憶部１１に書き込まれた形態素解析結果に対し、未知語抽出部２１及び完全未知語判定部２２による判定の結果、未知語タグが付与された単語が第１条件［１］及び第２条件［２］を同時に満たす場合、未知語である当該単語に未登録語ラベルを付与して出力する機能。なお、「…ラベルを付与して」は「…ラベル情報に関連付けて」と読み替えてもよい。これは未登録語ラベルに限らず、他のラベルについても同様である。 (f23-1) As a result of determination by the unknown word extraction unit 21 and the completely unknown word determination unit 22 with respect to the morpheme analysis result written in the storage unit 11 by the extended clause analysis function (f13-2) of the morpheme analysis unit 13, A function of assigning an unregistered word label to the word that is an unknown word and outputting it when the word with the unknown word tag satisfies the first condition [1] and the second condition [2] simultaneously. It should be noted that “... with label” may be read as “... in association with label information”. This is not limited to unregistered word labels, but the same applies to other labels.

(f23-2) 形態素解析部１３の拡張文節解析機能(f13-2)により記憶部１１に書き込まれた形態素解析結果に対し、未知語抽出部２１及び完全未知語判定部２２による判定の結果、未知語タグが付与された単語が第１条件［１］及び第２条件［２］を同時に満たさない場合、未知語である当該単語に入力ミスを示すラベルを付与して出力する機能。 (f23-2) As a result of determination by the unknown word extraction unit 21 and the completely unknown word determination unit 22 with respect to the morpheme analysis result written in the storage unit 11 by the extended clause analysis function (f13-2) of the morpheme analysis unit 13, A function of adding and outputting a label indicating an input error to the word that is an unknown word when the word to which the unknown word tag is assigned does not satisfy the first condition [1] and the second condition [2] at the same time.

次に、以上のように構成された言語解析装置の動作について図８のフローチャートを用いて説明する。 Next, the operation of the language analyzer configured as described above will be described with reference to the flowchart of FIG.

始めに、ステップＳ２１及びＳ２２は、前述したステップＳ１１及びＳ１２と同様に実行される。これにより、対象文が入力されて記憶部１１に書き込まれた後、対象文の形態素解析結果が記憶部１１に書き込まれる。 First, steps S21 and S22 are executed in the same manner as steps S11 and S12 described above. Thus, after the target sentence is input and written in the storage unit 11, the morphological analysis result of the target sentence is written in the storage unit 11.

ここで、未知語抽出部２１は、記憶部１１内の形態素解析結果に対し、未知語タグが含まれるか否かを判定することにより、品詞が未知語である形態素があるか否かを判断する。未知語タグが含まれず、未知語が存在しない場合、対象文の形態素解析結果に対する処理が終わり、言語解析装置は、何も出力しないまま、次の対象文の入力に移行する（図９に示す例２１）。未知語タグがあり、未知語が存在する場合、未知語抽出部２１は、当該未知語タグが付された単語を記憶部１１に書き込むと共に、形態素解析部１３に送出する（Ｓ２３）。例えば、図９に示す例２２の場合、未知語タグ“＜未＞”が付された単語「変臨」を記憶部１１に書き込む。 Here, the unknown word extraction unit 21 determines whether or not there is a morpheme whose part of speech is an unknown word by determining whether or not the unknown word tag is included in the morphological analysis result in the storage unit 11. To do. If the unknown word tag is not included and the unknown word does not exist, the processing for the morphological analysis result of the target sentence ends, and the language analysis apparatus shifts to input of the next target sentence without outputting anything (shown in FIG. 9). Example 21). If there is an unknown word tag and there is an unknown word, the unknown word extraction unit 21 writes the word with the unknown word tag in the storage unit 11 and sends it to the morpheme analysis unit 13 (S23). For example, in the case of the example 22 shown in FIG.

次に、形態素解析部１３は、送出された未知語を形態素解析し、形態素解析結果を記憶部１１に書き込むと共に、完全未知語判定部２２に送出する（Ｓ２４）。 Next, the morpheme analysis unit 13 performs morpheme analysis on the sent unknown word, writes the morpheme analysis result in the storage unit 11, and sends it to the completely unknown word determination unit 22 (S24).

完全未知語判定部２２は、送出された未知語のみの形態素解析結果に対して、完全未知語か否かを判定するため、完全未知語判定条件を満たすか否かを判定し、完全未知語の場合（図９に示す例２２）には、該当未知語を未登録語出力部２３に送出する。 The completely unknown word determination unit 22 determines whether or not the completely unknown word determination condition is satisfied in order to determine whether or not the morphological analysis result of only the transmitted unknown word is a completely unknown word. In the case of (Example 22 shown in FIG. 9), the corresponding unknown word is sent to the unregistered word output unit 23.

一方、判定の結果、完全未知語ではない場合には、完全未知語判定部２２は、該当未知語を文節拡張部１５に送出する（Ｓ２５）。例えば、図１０に示す例２３及び例２４、並びに図１１に示す例２５の場合、完全未知語ではない未知語は、それぞれ「分」及び「１面しか」並びに「されたた」となっている。 On the other hand, as a result of the determination, if it is not a completely unknown word, the completely unknown word determining unit 22 sends the corresponding unknown word to the phrase expanding unit 15 (S25). For example, in the case of Example 23 and Example 24 shown in FIG. 10 and Example 25 shown in FIG. 11, unknown words that are not completely unknown words are “minute”, “only one face”, and “done”, respectively. Yes.

文節拡張部１５は、送出された未知語、及び記憶部１１内の対象文の形態素解析結果に基づいて、拡張文節を作成する。拡張文節の作成方法は、前述したステップＳ１４と同じである。 The phrase expansion unit 15 creates an expanded phrase based on the sent unknown word and the morphological analysis result of the target sentence in the storage unit 11. The method for creating the extended phrase is the same as that in step S14 described above.

例えば、図１０に示す例２３の場合、解析失敗箇所「分」が助詞、句読点、記号、文頭、文末のいずれの区切り文字も含まないので、当該判定に用いた単語「分」の左に最も近い区切り文字の助詞「文頭」を左区切りとし、当該単語「分」の右に最も近い区切り文字「を」を右区切りとし、左区切り「文頭」から右区切り「を」までの文字列「遅い時分」からなる拡張文節を作成する。 For example, in the case of Example 23 shown in FIG. 10, the analysis failure point “minute” does not include any delimiters such as particles, punctuation marks, symbols, sentence heads, and sentence endings. The near delimiter particle "beginning" is the left delimiter, the word delimiter "o" closest to the right of the word "minutes" is the right delimiter, and the string from the left delimiter "beginning" to the right delimiter "o" is "slow" Create an extended clause consisting of “hours and minutes”.

図１０に示す例２４の場合、前述した例１３と同様に、左側の拡張文節「１面」を作成すると共に、右側の拡張文節「編種別データ」を作成する。 In the case of Example 24 shown in FIG. 10, the left extension phrase “one page” is created and the right extension phrase “Editing type data” is created as in Example 13 described above.

図１１に示す例２５の場合、解析失敗箇所「されたた」が助詞、句読点、記号、文頭、文末のいずれの区切り文字も含まないので、当該判定に用いた単語「されたた」の左に最も近い区切り文字の助詞「が」を左区切りとし、当該単語「されたた」の右に最も近い区切り文字「が」を右区切りとし、左区切り「が」から右区切り「が」までの文字列「保存されたた各ダイヤ」からなる拡張文節を作成する。 In the case of Example 25 shown in FIG. 11, since the analysis failure point “deta” does not include any delimiters such as particles, punctuation marks, symbols, sentence heads, and sentence endings, the left side of the word “dere” used in the determination The delimiter particle “ga” that is closest to the left is the left delimiter, the delimiter “ga” that is closest to the right of the word “deta” is the right delimiter, and the left delimiter “ga” to the right delimiter “ga” Create an expanded clause consisting of the string "each saved diamond".

いずれにしても、文節拡張部１５は、作成した拡張文節を記憶部１１に書き込むと共に、形態素解析部１３に送る。（Ｓ２６）
形態素解析部１３は、送出された拡張文節を形態素解析し、得られた形態素解析結果を記憶部１１に書き込むと共に、未知語抽出部２１に送出する（Ｓ２７）。 In any case, the phrase expansion unit 15 writes the created expansion phrase in the storage unit 11 and sends it to the morpheme analysis unit 13. (S26)
The morpheme analysis unit 13 performs morpheme analysis on the transmitted extended clause, writes the obtained morpheme analysis result in the storage unit 11, and sends it to the unknown word extraction unit 21 (S27).

未知語抽出部２１は、送出された拡張文節の形態素解析結果に対して、未知語タグが含まれるか否かを判定することにより、品詞が未知語である形態素があるか否かを判断する。未知語タグが含まれず、未知語が存在しない場合、言語解析装置は、入力された対象文文に対する処理が終わり、何も出力せず、次の対象文の入力に移行する（図１０に示す例２３）。未知語タグがあり、未知語が存在する場合、未知語抽出部１３は、該当未知語を記憶部１１に書き込むと共に、形態素解析部１３に送出する（図９に示す例２３、図１０に示す例２４）（Ｓ２８）。例えば、図１０に示す例２４及び図１１に示す例２５の場合、未知語は、それぞれ「編種」及び「されたた」となっている。 The unknown word extraction unit 21 determines whether or not there is a morpheme whose part of speech is an unknown word by determining whether or not the unknown word tag is included in the morphological analysis result of the transmitted extended clause. . If the unknown word tag is not included and the unknown word does not exist, the language analysis apparatus finishes the processing for the input target sentence, outputs nothing, and proceeds to input of the next target sentence (shown in FIG. 10). Example 23). When there is an unknown word tag and an unknown word exists, the unknown word extraction unit 13 writes the unknown word in the storage unit 11 and sends it to the morpheme analysis unit 13 (example 23 shown in FIG. 9 and shown in FIG. 10). Example 24) (S28). For example, in the case of Example 24 shown in FIG. 10 and Example 25 shown in FIG. 11, the unknown words are “knitting type” and “done”, respectively.

形態素解析部１３は、送出された未知語を形態素解析し、得られた形態素解析結果を記憶部１１に書き込むと共に、完全未知語判定部２２に送出する（Ｓ２９）。 The morpheme analysis unit 13 performs morphological analysis on the sent unknown word, writes the obtained morpheme analysis result in the storage unit 11, and sends it to the completely unknown word determination unit 22 (S29).

完全未知語判定部２２は、送出された未知語のみの形態素解析結果に対して、完全未知語か否かを判定するため、完全未知語判定条件を満たすか否かを判定し、完全未知語の場合（図１０に示す例２４の「編種」）には、該当未知語に「未登録語」を示すラベルを付与して未登録語出力部２３に送出する。なお、例２４に示すように、未登録語を示すラベルとして、拡張未知語を示すラベルを用いてもよい。 The completely unknown word determination unit 22 determines whether or not the completely unknown word determination condition is satisfied in order to determine whether or not the morphological analysis result of only the transmitted unknown word is a completely unknown word. In this case (“knitting type” in Example 24 shown in FIG. 10), a label indicating “unregistered word” is assigned to the corresponding unknown word and sent to the unregistered word output unit 23. As shown in Example 24, a label indicating an extended unknown word may be used as a label indicating an unregistered word.

一方、判定の結果、完全未知語判定条件を満たさず、完全未知語ではない場合（図１１に示す例２５の「されたた」）には、完全未知語判定部２２は、該当未知語に「入力ミス」を示すラベルを付与して未登録語出力部２３に送出する（Ｓ３０）。 On the other hand, as a result of the determination, when the completely unknown word determination condition is not satisfied and the word is not a completely unknown word (“Done” in Example 25 shown in FIG. 11), the completely unknown word determining unit 22 sets the corresponding unknown word. A label indicating “input error” is given and sent to the unregistered word output unit 23 (S30).

未登録語出力部２３は、完全未知語判定部２２から送出された未知語を未登録語として出力する（Ｓ３１）。 The unregistered word output unit 23 outputs the unknown word sent from the complete unknown word determination unit 22 as an unregistered word (S31).

例えば、完全未知語判定部２２は、「未登録語」と判定した未知語と、「入力ミス」と判定した未知語とに対し、互いに異なるラベルを付与すれば、ステップＳ３１では、未登録語と入力ミスを別々に出力することができる。 For example, if the completely unknown word determination unit 22 gives different labels to the unknown word determined to be “unregistered word” and the unknown word determined to be “input mistake”, in step S31, the unregistered word And input mistakes can be output separately.

上述したように本実施形態によれば、第１の実施形態と同様に、未知語を含む形態素解析結果に対して、未知語の文節拡張を実行して再帰的に形態素解析を実行することにより、未知語の解析ミスを低減しつつ、未登録語を抽出することができる。 As described above, according to the present embodiment, similar to the first embodiment, the morphological analysis is performed recursively by executing the phrase expansion of the unknown word on the morphological analysis result including the unknown word. Unregistered words can be extracted while reducing unknown word analysis errors.

また、本実施形態によれば、完全未知語判定条件を用いた判定により、完全未知語を判定するので、第１の実施形態に比べ、未知語の抽出精度を向上させることができる。 Further, according to the present embodiment, since the completely unknown word is determined by the determination using the completely unknown word determination condition, the unknown word extraction accuracy can be improved as compared with the first embodiment.

さらに、本実施形態によれば、完全未知語の判定に伴い、入力ミスによって発生した未知語と、未登録語によって発生した未知語とを区別して出力する場合には、入力ミスを発見することができる。 Further, according to the present embodiment, when a completely unknown word is determined and an unknown word generated due to an input error is distinguished from an unknown word generated due to an unregistered word, the input error is found. Can do.

また、以上のような各実施形態によれば、辞書やコーパスなどの大規模な外部情報や頻度を用いずに、正確な未知語の抽出を簡単に実現することができる。さらに、適切化した形態素解析結果を元の形態素解析結果に取り込むことにより、形態素解析の精度を向上させることができる。 Moreover, according to each embodiment as described above, accurate unknown word extraction can be easily realized without using large-scale external information and frequency such as a dictionary and a corpus. Furthermore, the accuracy of morpheme analysis can be improved by incorporating the appropriate morpheme analysis result into the original morpheme analysis result.

ここで、各実施形態が低減させる解析ミスについて補足的に説明する。
一般に、言語解析の分野では、解析精度の向上を図る観点から、形態素解析に基づく未登録語を辞書に追加登録している。多くの場合、未登録語には、専門性が高い用語、新語、又は利用者が独自に定義した用語が含まれる。このため、未登録語を追加登録することは専門知識を抽出する場合にも有用である。 Here, the analysis mistake which each embodiment reduces is demonstrated supplementarily.
In general, in the field of language analysis, unregistered words based on morphological analysis are additionally registered in a dictionary from the viewpoint of improving analysis accuracy. In many cases, unregistered words include highly specialized terms, new words, or terms that are uniquely defined by the user. For this reason, additionally registering unregistered words is also useful when extracting specialized knowledge.

未登録語は通常、形態素解析で「未知語」の形で出力される。このため、未知語を未登録語として抽出する方法が簡便である。但し、形態素解析結果には、文のどこかに解析失敗や行き詰りを生じた解析ミスにより、未登録語以外の未知語が含まれる場合がある。解析ミスによる未知語は未登録語ではないため、未登録語の抽出精度を低下させる。 Unregistered words are usually output in the form of “unknown words” by morphological analysis. For this reason, the method of extracting an unknown word as an unregistered word is simple. However, the morphological analysis result may include an unknown word other than an unregistered word due to an analysis error that causes an analysis failure or a dead end somewhere in the sentence. Since the unknown word due to the analysis error is not an unregistered word, the extraction accuracy of the unregistered word is lowered.

従って、未登録語の抽出精度を向上させるため、形態素解析結果から解析ミスによる未知語を低減させる必要がある。解析ミスのバリエーションは、図１２に示すように、過剰判定、位置ズレ、過剰結合及び過剰分割の４タイプに定義されている。 Therefore, in order to improve the extraction accuracy of unregistered words, it is necessary to reduce unknown words due to analysis errors from the morphological analysis results. As shown in FIG. 12, variations in analysis errors are defined as four types: over-judgment, misalignment, over-coupling, and over-division.

なお、４番目の過剰分割による解析ミスは従来の手法で低減できるが、過剰判定、位置ズレ及び過剰結合による解析ミスは、従来の手法で低減できない。また、検証実験の結果、解析ミスのうち、約２割が過剰分割であり、残り８割が過剰判定、位置ズレ及び過剰結合であることが知られている。各実施形態の手法によれば、残り８割の過剰判定、位置ズレ及び過剰結合による解析ミスを低減させることができる。 The analysis error due to the fourth excessive division can be reduced by the conventional method, but the analysis error due to excessive determination, positional deviation, and excessive coupling cannot be reduced by the conventional method. As a result of the verification experiment, it is known that about 20% of the analysis errors are overdivision, and the remaining 80% are overdetermination, misalignment, and overcoupling. According to the method of each embodiment, the remaining 80% of excess determination, misalignment, and analysis errors due to excessive coupling can be reduced.

なお、上記実施形態に記載した手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、光磁気ディスク（ＭＯ）、半導体メモリなどの記憶媒体に格納して頒布することもできる。 Note that the method described in the above embodiment includes a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a magneto-optical disk (MO) as programs that can be executed by a computer. ), And can be distributed in a storage medium such as a semiconductor memory.

また、この記憶媒体としては、プログラムを記憶でき、かつコンピュータが読み取り可能な記憶媒体であれば、その記憶形式は何れの形態であっても良い。 In addition, as long as the storage medium can store a program and can be read by a computer, the storage format may be any form.

また、記憶媒体からコンピュータにインストールされたプログラムの指示に基づきコンピュータ上で稼働しているＯＳ（オペレーティングシステム）や、データベース管理ソフト、ネットワークソフト等のＭＷ（ミドルウェア）等が上記実施形態を実現するための各処理の一部を実行しても良い。 In addition, an OS (operating system) running on a computer based on an instruction of a program installed in the computer from a storage medium, MW (middleware) such as database management software, network software, and the like realize the above-described embodiment. A part of each process may be executed.

さらに、上記実施形態における記憶媒体は、コンピュータと独立した媒体に限らず、ＬＡＮやインターネット等により伝送されたプログラムをダウンロードして記憶又は一時記憶した記憶媒体も含まれる。 Furthermore, the storage medium in the above embodiment is not limited to a medium independent of a computer, but also includes a storage medium in which a program transmitted via a LAN, the Internet, or the like is downloaded and stored or temporarily stored.

また、記憶媒体は１つに限らず、複数の媒体から上記実施形態における処理が実行される場合も実施形態における記憶媒体に含まれ、媒体構成は何れの構成であっても良い。 Further, the number of storage media is not limited to one, and the case where the processing in the above embodiment is executed from a plurality of media is also included in the storage medium in the embodiment, and the medium configuration may be any configuration.

なお、上記実施形態におけるコンピュータは、記憶媒体に記憶されたプログラムに基づき、上記実施形態における各処理を実行するものであって、パソコン等の１つからなる装置、複数の装置がネットワーク接続されたシステム等の何れの構成であっても良い。 The computer in the above embodiment executes each process in the above embodiment based on a program stored in a storage medium, and a single device such as a personal computer or a plurality of devices are connected to a network. Any configuration such as a system may be used.

また、上記実施形態におけるコンピュータとは、パソコンに限らず、情報処理機器に含まれる演算処理装置、マイコン等も含み、プログラムによって上記実施形態の機能を実現することが可能な機器、装置を総称している。 In addition, the computer in the above embodiment is not limited to a personal computer, but includes an arithmetic processing device, a microcomputer, and the like included in an information processing device. ing.

なお、上記実施形態はそのままの形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組合せにより種々の変形例を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組合せてもよい。 In addition, the said embodiment is not limited to a form as it is, In an implementation stage, a component can be deform | transformed and embodied in the range which does not deviate from the summary. Moreover, various modifications can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１１…記憶部、１２…入力部、１３…形態素解析部、１４…適切性判定部、１５…文節拡張部、１６…解析結果出力部、１７，２８…バス、２１…未知語抽出部、２２…完全未知語判定部、２３…未登録語出力部。 DESCRIPTION OF SYMBOLS 11 ... Memory | storage part, 12 ... Input part, 13 ... Morphological analysis part, 14 ... Appropriateness determination part, 15 ... Phrase expansion part, 16 ... Analysis result output part, 17, 28 ... Bus, 21 ... Unknown word extraction part, 22 ... completely unknown word determination unit, 23 ... unregistered word output unit.

Claims

A storage means for storing the target sentence and the morphological analysis result of the target sentence is provided, and when a target sentence including an unknown word is analyzed, a morphological analysis result including an unknown word tag attached to the unknown word is output. A language analysis program used in a language analysis device for
The language analysis device;
A target sentence analysis unit that writes a morpheme analysis result in which a part of speech tag is added to each word constituting the target sentence by performing a morphological analysis on the target sentence in the storage unit;
For the written morpheme analysis result, determine whether or not an unknown word tag of the part of speech tag is included, and if an unknown word tag is included, an unknown word determination means for sending the morpheme analysis result,
A delimiter character that extracts a word with an unknown word tag from the sent morpheme analysis result and determines whether or not a delimiter character corresponding to a particle, punctuation mark, symbol, sentence head, or sentence end is included in this word Determination means,
As a result of this determination, if the word contains a delimiter, the character from the left delimiter to the right delimiter is the delimiter closest to the left of the included delimiter as the left delimiter and the included delimiter as the right delimiter Create a left-side extended clause consisting of a column, and use the character string from the left delimiter to the right delimiter with the included delimiter as the left delimiter and the delimiter closest to the right of the included delimiter as the right delimiter A first extended phrase creating means for creating the right extended phrase and writing the right extended phrase and the left extended phrase to the storage means,
If the result of the determination by the delimiter determining means is negative, the delimiter closest to the left of the word used for the determination is the left delimiter, and the delimiter closest to the right of the word is the right delimiter from the left delimiter A second extended phrase creating means for creating an extended phrase consisting of a character string up to the right delimiter and writing the extended phrase in the storage means;
An expanded phrase analysis unit that writes a morpheme analysis result in which a part-of-speech tag is added to a word constituting the expanded phrase by morphological analysis of the expanded phrase in the storage unit;
It is determined whether or not an unknown word tag is included in the morphological analysis result written in the storage means by the extended phrase analysis means, and if an unknown word tag is included, the morphological analysis result of the extended phrase and the previous time Analysis result determination means for determining whether or not a word to which an unknown word tag is assigned among the morpheme analysis results of
If the result of this determination is NO, delimiter character determination sending means for sending the morpheme analysis result of the extended clause to the delimiter character determination means,
As a result of the determination by the analysis result determination means, if both match, an analysis result output means for outputting the previous morpheme analysis result,
Language analysis program to function as

A language analysis used for a language analysis device for outputting the unknown word when a morphological analysis is performed on a target sentence including an unknown word, including storage means for storing the target sentence and a morphological analysis result of the target sentence A program,
The language analysis device;
A morpheme analysis unit that writes a morpheme analysis result in which a part-of-speech tag is assigned to each word constituting the target sentence to the storage unit by performing morphological analysis on the target sentence in the storage unit;
It is determined whether or not an unknown word tag of the part-of-speech tags is included in the written morpheme analysis result, and when an unknown word tag is included, the word to which the unknown word tag is assigned is stored in the storage unit Unknown word tag determination means for writing to
A first condition that a word with an unknown word tag in the storage means is not a single character hiragana or katakana, and a morpheme analysis result is composed of a single morpheme, and the part of speech tag is an unknown word tag A completely unknown word judging means for judging whether or not the second condition is simultaneously satisfied,
As a result of this determination, in the case of No, a delimiter determining means for determining whether or not a delimiter corresponding to a particle, a punctuation mark, a symbol, a sentence head, or a sentence end is included for the word given the unknown word tag,
As a result of this determination, if a delimiter is included in the word, the delimiter closest to the left of the included delimiter is the left delimiter, the delimiter closest to the left of the included delimiter is the left delimiter, and the included The left delimiter is the left delimiter, and the delimiter closest to the right of the included delimiter is created. A first extended phrase creating means for creating a right extended phrase consisting of a character string from the left delimiter to the right delimiter as a right delimiter, and writing the right extended phrase and the left extended phrase into the storage means;
If the result of the determination by the delimiter determining means is negative, the delimiter closest to the left of the word used for the determination is the left delimiter, and the delimiter closest to the right of the word is the right delimiter from the left delimiter A second extended phrase creating means for creating an extended phrase consisting of a character string up to the right delimiter and writing the extended phrase in the storage means;
An expanded phrase analysis unit that writes a morpheme analysis result in which a part-of-speech tag is added to a word constituting the expanded phrase by morphological analysis of the expanded phrase in the storage unit;
As a result of the determination by the unknown word tag determination unit and the complete unknown word determination unit with respect to the morphological analysis result written in the storage unit by the extended phrase analysis unit, the word to which the unknown word tag is assigned is the first condition. And when the second condition is satisfied at the same time, means for outputting an unregistered word label to the word that is the unknown word
Language analysis program to function as

The language analysis program according to claim 2,
As a result of the determination by the unknown word tag determination unit and the complete unknown word determination unit with respect to the morphological analysis result written in the storage unit by the extended phrase analysis unit, the word to which the unknown word tag is assigned is the first condition. And when the second condition is not satisfied at the same time, a means for giving a label indicating an input error to the word that is the unknown word and outputting the same,
Language analysis program to function as