JP2019016162A

JP2019016162A - Morphological analysis program, morphological analysis device, and morphological analysis method

Info

Publication number: JP2019016162A
Application number: JP2017133065A
Authority: JP
Inventors: 一森田; Hajime Morita; 友哉岩倉; Tomoya Iwakura
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2019-01-31
Also published as: CN109213992A

Abstract

To improve the analysis accuracy of morphological analysis.SOLUTION: A morphological analysis device comprises: a storage unit for storing a matching dictionary that includes a morphological analysis dictionary, character strings included in each of a plurality of sentences, and a first morphological analysis result of the character strings that are commonly obtained for each of the plurality of sentences; a first analysis unit for, from an analysis target text, outputting the first morphological analysis result for the character strings matching with the character strings included in the matching dictionary; and a second analysis unit for, from the analysis target text, generating a lattice including a plurality of morphological analysis result candidates using the morphological analysis dictionary for the remaining character strings unmatched with the character strings included in the matching dictionary, morphologically analyzing the remaining character strings using the lattice, and outputting a second morphological analysis result for the remaining character strings.SELECTED DRAWING: Figure 2

Description

本発明は、形態素解析プログラム、形態素解析装置、および形態素解析方法に関する。 The present invention relates to a morpheme analysis program, a morpheme analyzer, and a morpheme analysis method.

近年、インターネット上の情報が飛躍的に増大しており、ビッグデータを用いたビジネスが増加しているため、ビッグデータを効率的に処理することが望まれている。日本語、中国語、または韓国語の文書のように、単語と単語がスペース等の区切り文字で区切られていない表記の文書の場合、単語の出現頻度を計算するために形態素解析が行われる。 In recent years, information on the Internet has increased dramatically, and business using big data has increased. Therefore, it is desired to process big data efficiently. In the case of a document in which a word and a word are not separated by a delimiter such as a space, such as a Japanese, Chinese, or Korean document, morphological analysis is performed to calculate the appearance frequency of the word.

形態素解析は、テキストを形態素に分割し、各形態素に対して品詞情報を付与する処理である。形態素解析により得られる形態素は、単語として扱われることもある。このような形態素解析を行うことで、文書中の単語間の関係及び単語の品詞が決定され、文書中のテキストを単語に分割することができる。しかし、形態素解析は処理負荷が大きいため、大量のテキストを処理するには長い時間がかかる。 Morphological analysis is a process of dividing text into morphemes and adding part-of-speech information to each morpheme. A morpheme obtained by morpheme analysis may be treated as a word. By performing such morphological analysis, the relationship between words in the document and the part of speech of the word are determined, and the text in the document can be divided into words. However, since morphological analysis has a heavy processing load, it takes a long time to process a large amount of text.

形態素解析においては、解析対象の文字列の表記に部分一致する全ての単語を辞書から抜き出し、単語の候補（解析候補）を列挙したグラフ構造であるラティスを利用した解析が行われる。 In morphological analysis, all words that partially match the notation of the character string to be analyzed are extracted from the dictionary, and analysis using a lattice that is a graph structure in which word candidates (analysis candidates) are listed is performed.

図１は、ラティスの例を示す図である。
図１では、解析対象である入力文＝「送られてきた」に対するラティスを構築した場合を示す。形態素解析では、構築したラティスに対して文脈（前後の形態素）を考慮して、正しい形態素列を決定する。それにより、「送られてきた」の形態素解析結果は、「送ら（動詞・未然形）｜れて（接尾辞）｜きた（接尾辞）」となる。ラティスの構築は、計算コストが大きく、時間がかかる。 FIG. 1 is a diagram illustrating an example of a lattice.
FIG. 1 shows a case where a lattice for an input sentence = “sent” that is an analysis target is constructed. In the morphological analysis, a correct morpheme sequence is determined in consideration of the context (front and back morphemes) for the constructed lattice. As a result, the result of the morphological analysis of “sent” is “sent (verb / form)” | (suffix) | kita (suffix). Lattice construction is computationally expensive and time consuming.

形態素解析において、計算コストの大きいラティスの構築を行なわず、パターンマッチングを用いることで高速化する方法が知られている（例えば、非特許文献１参照）。文を２以上の単語に高速に分割する単語分割装置が知られている（例えば、特許文献１参照）。精度の高い単語分割用辞書を得る辞書登録装置が知られている（例えば、特許文献２参照）。 In morphological analysis, there is known a method of increasing the speed by using pattern matching without constructing a lattice having a high calculation cost (for example, see Non-Patent Document 1). 2. Description of the Related Art A word dividing device that divides a sentence into two or more words at high speed is known (for example, see Patent Document 1). A dictionary registration device that obtains a word segmentation dictionary with high accuracy is known (see, for example, Patent Document 2).

特開２０１４−１０６７０７号公報JP 2014-106707 A 特開２０１４−１２０００７号公報JP 2014-120007 A

Manabu Sassano, “Deterministic Word Segmentation Using Maximum Matching with Fully Lexicalized Rules”, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 79-83, Gothenburg, Sweden, April 26-30 2014Manabu Sassano, “Deterministic Word Segmentation Using Maximum Matching with Fully Lexicalized Rules”, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 79-83, Gothenburg, Sweden, April 26-30 2014

しかしながら、非特許文献１記載の形態素解析方法では、パターンマッチングにより高速に形態素解析を行うことができるが、誤った解析結果を出力する場合があり、解析精度が低いという問題がある。 However, although the morpheme analysis method described in Non-Patent Document 1 can perform morpheme analysis at high speed by pattern matching, there is a problem that an incorrect analysis result may be output and analysis accuracy is low.

１つの側面において、本発明は、形態素解析の解析精度を向上させることを目的とする。 In one aspect, an object of the present invention is to improve analysis accuracy of morphological analysis.

実施の形態に係る形態素解析プログラムは、形態素解析辞書と、複数の文それぞれに含まれる文字列と、前記複数の文それぞれに対して共通に得られた前記文字列の第１の形態素解析結果とを含むマッチング辞書を記憶する記憶部を備えるコンピュータに以下の処理を実行させる。 The morpheme analysis program according to the embodiment includes a morpheme analysis dictionary, a character string included in each of a plurality of sentences, and a first morpheme analysis result of the character string obtained in common for each of the plurality of sentences. A computer having a storage unit that stores a matching dictionary including the following processes is executed.

前記コンピュータは、解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致する文字列に対して、前記第１の形態素解析結果を出力する。 The computer outputs the first morpheme analysis result for a character string that matches the character string included in the matching dictionary in the analysis target text.

前記コンピュータは、前記解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて、複数の形態素解析結果の候補を含むラティスを生成する
前記コンピュータは、前記ラティスを用いて前記残りの文字列に対する形態素解析を行い、前記残りの文字列に対する第２の形態素解析結果を出力する The computer generates a lattice including a plurality of morphological analysis result candidates using the morphological analysis dictionary for the remaining character strings that do not match the character strings included in the matching dictionary in the analysis target text. The computer performs a morphological analysis on the remaining character string using the lattice, and outputs a second morpheme analysis result on the remaining character string

実施の形態によれば、形態素解析の解析精度を向上させることができる。 According to the embodiment, the analysis accuracy of morphological analysis can be improved.

ラティスの例を示す図である。It is a figure which shows the example of a lattice. 実施の形態に係る形態素解析装置の構成図である。It is a block diagram of the morpheme analyzer which concerns on embodiment. 文脈独立辞書の例である。It is an example of a context independent dictionary. 文脈依存文字列と解析結果の例である。It is an example of a context-dependent character string and an analysis result. 実施の形態に係る形態素解析処理を示す図である。It is a figure which shows the morphological analysis process which concerns on embodiment. 実施の形態に係る形態素解析処理を示す図である。It is a figure which shows the morphological analysis process which concerns on embodiment. 実施の形態に係る文脈独立辞書生成処理のフローチャートである。It is a flowchart of the context independent dictionary production | generation process which concerns on embodiment. 実施の形態に係る形態素解析処理のフローチャートである。It is a flowchart of the morphological analysis process which concerns on embodiment. 未解析の文字列とその前後の形態素についてのラティスを示す図である。It is a figure which shows the lattice about an unanalyzed character string and the morpheme before and behind it. 未解析の文字列とその前後の形態素についてのラティスを示す図である。It is a figure which shows the lattice about an unanalyzed character string and the morpheme before and behind it. 未解析の文字列に対する解析結果である形態素列を示す図である。It is a figure which shows the morpheme string which is the analysis result with respect to an unanalyzed character string. 未解析の文字列に対する解析結果である形態素列を示す図である。It is a figure which shows the morpheme string which is the analysis result with respect to an unanalyzed character string. 実施の形態に係る形態素解析処理の変形例のフローチャートである。It is a flowchart of the modification of the morphological analysis process which concerns on embodiment. 未解析の文字列を含む入力文のラティスを示す図である。It is a figure which shows the lattice of the input sentence containing an unparsed character string. 未解析の文字列に対する解析結果を含む入力文の形態素列を示す図である。It is a figure which shows the morpheme string of the input sentence containing the analysis result with respect to an unanalyzed character string. 情報処理装置の構成図である。It is a block diagram of information processing apparatus.

以下、図面を参照しながら実施の形態について説明する。
最初に非特許文献１記載の従来技術を用いて、形態素解析を行った場合について述べる。従来技術では、最初に辞書を用いた最長一致法により、解析対象の文に対する形態素列を出力し、誤って出力された形態素列のうち、置き換えパターンに一致する形態素列を当該置き換えパターンに従って正しい形態素列に置き換えている。
（従来技術による形態素解析の第１の例）
解析対象である入力文＝「非常に評判がいいわけだ」に対して、最長一致法による形態素解析の結果は、「非常に｜評判｜が｜いいわけ｜だ」となる。従来技術では、誤った解析結果を修正するため、置き換えパターンを参照し、一致する形態素列を修正する。 Hereinafter, embodiments will be described with reference to the drawings.
First, a case where morphological analysis is performed using the conventional technique described in Non-Patent Document 1 will be described. In the prior art, first, a morpheme string corresponding to a sentence to be analyzed is output by a longest match method using a dictionary, and a morpheme string that matches a replacement pattern among morpheme strings output in error is correct according to the replacement pattern. Replaced with a column.
(First example of morphological analysis according to the prior art)
For the input sentence to be analyzed = “very good reputation”, the result of the morphological analysis by the longest match method is “very | reputation | is | good reason | is”. In the prior art, in order to correct an erroneous analysis result, a replacement pattern is referred to and a matching morpheme string is corrected.

ここで、「が｜いいわけ｜だ」を「が｜いい｜わけだ」に修正する置き換えパターンがあるとする。その場合、「非常に｜評判｜が｜いいわけ｜だ」は、「非常に｜評判｜が｜いい｜わけだ」に修正される。このように、適切な置き換えパターンがある場合は、「非常に評判がいいわけだ」の解析結果として、正しい解析結果である「非常に｜評判｜が｜いい｜わけだ」が得られる。 Here, it is assumed that there is a replacement pattern that corrects “ga | good reason | da” to “ga | good matter”. In that case, "very | reputation | is | good reason | is" is corrected to "very | reputation | is | good | As described above, when there is an appropriate replacement pattern, the analysis result “very good reputation” is obtained as the analysis result “very good reputation” as the analysis result “very good reputation”.

しかし、上記のような適切な置き換えパターンが存在しない場合、形態素列は修正されないため、「非常に評判がいいわけだ」に対する形態素解析の結果として、誤った解析結果である「非常に｜評判｜が｜いいわけ｜だ」が得られる。
（従来技術による形態素解析の第２の例）
入力文＝「人手不足と言うがいいわけだ」に対して、最長一致法による形態素解析の結果は、「人手｜不足｜と｜言う｜が｜いいわけ｜だ」となる。従来技術では、誤った解析結果を修正するため、置き換えパターンを参照し、一致する形態素列を修正する。 However, if there is no appropriate replacement pattern as described above, the morpheme sequence is not corrected. Therefore, as a result of the morphological analysis for “Very Reputable”, the erroneous analysis result “Very | Reputation | | Good reason |
(Second example of conventional morphological analysis)
For the input sentence = “It is good to say that there is a shortage of manpower”, the result of the morphological analysis by the longest match method is “manual | shortage | and | say | In the prior art, in order to correct an erroneous analysis result, a replacement pattern is referred to and a matching morpheme string is corrected.

ここで、「人手｜不足｜と｜言う｜が｜いいわけ｜だ」に含まれる形態素列を修正する置き換えパターンが存在しない場合、解析結果は修正されずに、正しい解析結果である「人手｜不足｜と｜言う｜が｜いいわけ｜だ」が得られる。 Here, if there is no replacement pattern for correcting the morpheme sequence included in “manual | shortage | and | say | is | good reason | da”, the analysis result is not corrected and the correct analysis result is “manual | shortage” | And | say | is | good reason |

ここで、「が｜いいわけ｜だ」を「が｜いい｜わけだ」に修正する置き換えパターンがあるとする。その場合、「人手｜不足｜と｜言う｜が｜いいわけ｜だ」は、「人手｜不足｜と｜言う｜が｜いい｜わけだ」に修正される。置き換えパターンを適用することにより、「人手不足と言うがいいわけだ」に対する形態素解析の結果として、誤った解析結果である「人手｜不足｜と｜言う｜が｜いい｜わけだ」が得られる。 Here, it is assumed that there is a replacement pattern that corrects “ga | good reason | da” to “ga | good matter”. In this case, “manual | insufficient | and | say | is | good reason | da” is corrected to “manual | insufficient | and | say | is | good”. By applying the replacement pattern, the result of the morphological analysis for “manual shortage is good” is obtained as the result of erroneous analysis “manual | shortage | and | say | is | good”.

このように、置き換えパターンは、文脈を考慮したルールになっておらず、置き換えパターンを適用することで、誤った解析結果が得られることがある。 As described above, the replacement pattern is not a rule considering the context, and an erroneous analysis result may be obtained by applying the replacement pattern.

図２は、実施の形態に係る形態素解析装置の構成図である。
形態素解析装置１０１は、辞書生成部２０１、形態素解析部３０１、および記憶部４０１を有する。 FIG. 2 is a configuration diagram of the morpheme analyzer according to the embodiment.
The morpheme analyzer 101 includes a dictionary generator 201, a morpheme analyzer 301, and a storage unit 401.

辞書生成部２０１は、文脈独立辞書構築部２１１、形態素解析部２２１、依存性判定部２３１を有する。 The dictionary generation unit 201 includes a context independent dictionary construction unit 211, a morpheme analysis unit 221, and a dependency determination unit 231.

文脈独立辞書構築部２１１は、形態素解析部２２１と依存性判定部２３１の判定結果を用いて文脈独立辞書４２１を生成する。 The context independent dictionary construction unit 211 generates the context independent dictionary 421 using the determination results of the morphological analysis unit 221 and the dependency determination unit 231.

形態素解析部２２１は、コーパス４１１の形態素解析を行う。形態素解析部２２１は、例えば、既存の形態素解析方法を用いて、形態素解析を行う。 The morpheme analysis unit 221 performs morpheme analysis of the corpus 411. The morpheme analysis unit 221 performs morpheme analysis using, for example, an existing morpheme analysis method.

依存性判定部２３１は、依存性判定部２３１は、文字列が文脈により形態素解析の結果が異なる（文脈に依存する）文字列であるか判定する。 The dependency determination unit 231 determines whether the character string is a character string whose morpheme analysis results differ (depends on the context) depending on the context.

実施の形態において、文脈により形態素解析の結果が異なる文字列を文脈依存文字列と称する。また、実施の形態において、文脈により形態素解析の結果が変化しない文字列を文脈独立文字列と称する。 In the embodiment, a character string having a different morphological analysis result depending on a context is referred to as a context-dependent character string. In the embodiment, a character string whose morphological analysis result does not change depending on the context is referred to as a context-independent character string.

形態素解析部３０１は、文脈独立文字列解析部３１１および文脈依存文字列解析部３２１を有する。 The morpheme analysis unit 301 includes a context independent character string analysis unit 311 and a context dependent character string analysis unit 321.

文脈独立文字列解析部３１１は、文脈独立辞書４２１を用いてパターンマッチングにより、入力文４３１の形態素解析を行う。文脈独立文字列解析部３１１により、入力文４３１のうち文脈独立文字列の形態素解析が行われる。 The context independent character string analysis unit 311 performs morphological analysis of the input sentence 431 by pattern matching using the context independent dictionary 421. The context-independent character string analysis unit 311 performs morphological analysis of the context-independent character string in the input sentence 431.

文脈依存文字列解析部３２１は、ラティス構築部３２２および形態素列選択部３２３を有する。文脈依存文字列解析部３２１は、入力文４３１のうち文脈独立文字列解析部３１１により解析されていない文字列（すなわち、文脈依存文字列）の形態素解析を行う。 The context-dependent character string analysis unit 321 includes a lattice construction unit 322 and a morpheme sequence selection unit 323. The context-dependent character string analysis unit 321 performs morphological analysis of a character string (that is, a context-dependent character string) that has not been analyzed by the context-independent character string analysis unit 311 in the input sentence 431.

ラティス構築部３２２は、未解析の文字列のラティスを構築する。ラティス（単語ラティスとも呼ぶ）は、解析対象の文字列の表記に部分一致する全ての単語を形態素解析辞書から抜き出し、単語の候補（解析結果の候補）を列挙したグラフ構造である。 The lattice constructing unit 322 constructs a lattice of an unanalyzed character string. The lattice (also referred to as a word lattice) is a graph structure in which all words partially matching the notation of the character string to be analyzed are extracted from the morphological analysis dictionary and word candidates (analysis result candidates) are listed.

形態素列選択部３２３は、構築されたラティスにおいて、文章として最も確からしいと思われる単語の並び（パス）を選択する。形態素列選択部３２３は、例えば、Viterbiアルゴリズムを用いて、評価値を最小とするようなパスを選択する。尚、形態素列選択部３２３は、Viterbiアルゴリズムに限らずビームサーチ等の方法を用いても良い。 The morpheme string selection unit 323 selects a word sequence (path) that is most likely to be a sentence in the constructed lattice. The morpheme sequence selection unit 323 selects a path that minimizes the evaluation value using, for example, the Viterbi algorithm. Note that the morpheme sequence selection unit 323 is not limited to the Viterbi algorithm, and may use a method such as a beam search.

記憶部４０１は、形態素解析装置１０１で使用されるデータやプログラム等を記憶する。記憶部４０１は、コーパス４１１、文脈独立辞書４２１、入力文４３１、および解析結果４４１を記憶する。また、記憶部４０１は、ラティス構築部３２２および形態素解析部２２１がラティスを構築するときに使用する複数の単語（形態素）を含む形態素解析辞書（不図示）を記憶する。 The storage unit 401 stores data, programs, and the like used by the morphological analyzer 101. The storage unit 401 stores a corpus 411, a context independent dictionary 421, an input sentence 431, and an analysis result 441. In addition, the storage unit 401 stores a morpheme analysis dictionary (not shown) including a plurality of words (morphemes) used when the lattice construction unit 322 and the morpheme analysis unit 221 construct a lattice.

コーパス４１１は、複数の文の集合である。コーパス４１１は、辞書生成部２０１により文脈独立辞書４２１の生成に用いられる。 The corpus 411 is a set of a plurality of sentences. The corpus 411 is used by the dictionary generation unit 201 to generate the context independent dictionary 421.

文脈独立辞書４２１は、文脈独立文字列と文脈独立文字列に対する形態素解析の結果を示す情報である。文脈独立辞書４２１は、マッチング辞書の一例である。 The context independent dictionary 421 is information indicating the result of morphological analysis for the context independent character string and the context independent character string. The context independent dictionary 421 is an example of a matching dictionary.

入力文４３１は、形態素解析部３０１による形態素解析の対象となる文である。入力文４３１は、解析対象テキストの一例である。 The input sentence 431 is a sentence that is a target of morphological analysis by the morphological analysis unit 301. The input sentence 431 is an example of analysis target text.

解析結果４４１は、入力文４３１の形態素解析の結果である。
図３は、文脈独立辞書の例である。 The analysis result 441 is a result of morphological analysis of the input sentence 431.
FIG. 3 is an example of a context independent dictionary.

文脈独立辞書４２１は、文脈により形態素解析の結果が変化しない文字列である文脈独立文字列を示す情報である。文脈独立辞書４２１は、文字列と形態素列とを含む。文脈独立辞書４２１には、文字列と形態素列とが対応付けられて記録されている。 The context independent dictionary 421 is information indicating a context independent character string that is a character string whose morphological analysis result does not change depending on the context. The context independent dictionary 421 includes a character string and a morpheme string. In the context independent dictionary 421, character strings and morpheme strings are recorded in association with each other.

文字列は、文脈独立文字列である。
形態素列は、文字列に対する形態素解析の結果である。形態素列は、形態素解析により文字列が分割された複数の形態素の集合である。明細書および図面において、形態素列の「｜」は形態素間の切れ目を示す。尚、形態素列には、各形態素の品詞や活用形を示す情報が付加されていてもよい。 The character string is a context-independent character string.
The morpheme string is a result of morpheme analysis on the character string. A morpheme string is a set of a plurality of morphemes obtained by dividing a character string by morpheme analysis. In the specification and drawings, “|” in the morpheme string indicates a break between morphemes. Note that the morpheme string may be added with information indicating the part of speech or the utilization form of each morpheme.

例えば、図３の文脈独立辞書４２１は、文字列として「夜間や休日」、対応する形態素列として「夜間｜や｜休日」を含む。また、図３の文脈独立辞書４２１は、文字列として「がれきの山」、対応する形態素列として「がれき｜の｜山」を含む。 For example, the context-independent dictionary 421 in FIG. 3 includes “nighttime or holiday” as a character string, and “nighttime | or | holiday” as a corresponding morpheme string. Also, the context-independent dictionary 421 in FIG. 3 includes “debris mountain” as a character string and “debris | no | mountain” as a corresponding morpheme string.

「夜間や休日」は、「夜間や休日」の前後の文脈によって、形態素解析の結果が変化しない文字列である。すなわち、「夜間や休日」に対する形態素解析の結果は、常に同じである。「夜間や休日」に対して形態素解析を行うと、「夜間｜や｜休日」のように分割される。 “Night or holiday” is a character string whose morphological analysis result does not change depending on the context before and after “night or holiday”. That is, the result of the morphological analysis for “nighttime or holiday” is always the same. When the morphological analysis is performed on “nighttime or holiday”, it is divided into “nighttime | or | holiday”.

「がれきの山」についても、形態素解析の結果は、常に「がれき｜の｜山」のように分割される。 Also for “debris mountain”, the result of morphological analysis is always divided as “debris | no | mountain”.

上記のような文脈独立文字列は、文字列の前後の文脈によらず形態素解析の結果が常に同じとなるため、文脈独立文字列だけで正しい解析結果を得ることが可能である。 The context-independent character string as described above always has the same morphological analysis result regardless of the context before and after the character string. Therefore, it is possible to obtain a correct analysis result using only the context-independent character string.

また、文脈独立辞書４２１は、文字列として、型番、人名、顔文字、定型句、英単語、またはタブや改行を示す制御用トークンなどが登録されてもよい。型番、人名、顔文字、定型句、英単語、および制御用トークンは、文字列の前後の文脈によらず形態素解析の結果が常に同じとなる文字列である。また、文脈独立辞書４２１は、文字列が括弧であるときの括弧に対する形態素解析の結果や文字列が連続した数値のような数値表現であるときの数値表現に対する形態素解析の結果を示す情報を含んでもよい。括弧や数値表現は文字列の前後の文脈によらず形態素解析の結果が常に同じとなる文字列である。 In the context-independent dictionary 421, a model number, a person name, an emoticon, a fixed phrase, an English word, or a control token indicating a tab or a line feed may be registered as a character string. The model number, personal name, emoticon, fixed phrase, English word, and control token are character strings that always have the same morphological analysis results regardless of the context before and after the character string. The context independent dictionary 421 includes information indicating the result of morphological analysis for parentheses when the character string is parentheses and the result of morphological analysis for numerical expressions when the character string is a numerical expression such as a continuous numerical value. But you can. Parentheses and numerical expressions are character strings whose morphological analysis results are always the same regardless of the context before and after the character string.

次に、文脈により形態素解析の結果が異なる文字列である文脈依存文字列について述べる。 Next, a context-dependent character string, which is a character string whose morphological analysis results differ depending on the context, will be described.

図４は、文脈依存文字列と解析結果の例である。
ここでは、文脈依存文字列として「よく知っているからだ」、「休日や夜間」、および「雪の山」の３つの例について述べる。
（１）「よく知っているからだ」
「よく知っているからだ」に対して形態素解析を行うと、「よく知っているからだ」の前後の文脈によって、「よく｜知っている｜から｜だ」または「よく｜知っている｜からだ（体）」のように分割される。
（２）「休日や夜間」
「休日や夜間」の前に「今週の」が付いている場合、「今週の休日や夜間」に対して形態素解析を行うと、「今週｜の｜休日｜や｜夜間」のように分割される。 FIG. 4 is an example of a context-dependent character string and an analysis result.
Here, three examples of context-sensitive character strings “because you know well”, “holiday and night”, and “snow mountain” will be described.
(1) “Because I know well”
If you perform morphological analysis on "Because you know well", depending on the context before and after "Because you know well", "Well | Know | From | Da" or "Well | Know | Body) ”.
(2) "Holidays and nights"
If "this week" is preceded by "holiday or night", morphological analysis is performed on "this week's holiday or night". The

「休日や夜間」の前に「病院の定」が付いている場合、「病院の定休日や夜間」に対して形態素解析を行うと、「病院｜の｜定休日｜や｜夜間」のように分割される。
（３）「雪の山」
「雪の山」の後に「を見る」が付いている場合、「雪の山を見る」に対して形態素解析を行うと、「雪｜の｜山｜を｜見る」のように分割される。 If “hospital fixed” precedes “holiday or night”, the morphological analysis of “hospital closed holiday or night” would result in “hospital | no | fixed holiday | or | night” It is divided into.
(3) “Snowy Mountain”
When “see” is attached after “snow mountain”, if morphological analysis is performed on “see snow mountain”, it is divided into “snow | no | mountain | see” .

「雪の山」の前に「大」且つ「雪の山」の後に「形県」が付いている場合、「大雪の山形県」に対して形態素解析を行うと、「大雪｜の｜山形｜県」のように分割される。 When “large” is added before “snow mountain” and “gata prefecture” is added after “snow mountain”, when “morphological analysis” is performed on “snow mountain”, It is divided like |

上記のような文脈依存文字列は、文字列の前後の文脈により形態素解析の結果が異なるため、文脈依存文字列だけでは、正しい解析結果を得ることが難しい。 The context-dependent character string as described above has different morphological analysis results depending on the context before and after the character string, and it is difficult to obtain a correct analysis result only with the context-dependent character string.

次に、実施の形態に係る形態素解析処理の例を示す。
図５は、実施の形態に係る形態素解析処理を示す図である。 Next, an example of morphological analysis processing according to the embodiment will be described.
FIG. 5 is a diagram illustrating morphological analysis processing according to the embodiment.

図５では、入力文４３１として「非常に評判がいいわけだ」の形態素解析を行う場合について説明する。また、文脈独立辞書４２１は、文字列＝「非常に評判がいい」と形態素列＝「非常に｜評判｜が｜いい」が含まれているとする。 FIG. 5 illustrates a case where a morphological analysis of “very good reputation” is performed as the input sentence 431. Further, it is assumed that the context independent dictionary 421 includes a character string = “very good reputation” and a morpheme string = “very | reputation | is | good”.

文脈独立文字列解析部３１１は、入力文＝「非常に評判がいいわけだ」に対して、文脈独立辞書４２１を用いた最長一致法による解析を行う。図５では、入力文＝「非常に評判がいいわけだ」のうち「非常に評判がいい」が文脈独立辞書４２１の文字列と一致する。 The context-independent character string analysis unit 311 performs analysis by the longest match method using the context-independent dictionary 421 for the input sentence = “very good reputation”. In FIG. 5, “very good reputation” in the input sentence = “very good reputation” matches the character string of the context independent dictionary 421.

よって、入力文＝「非常に評判がいいわけだ」のうち「非常に評判がいい」の形態素解析の結果は、「非常に｜評判｜が｜いい」となる。 Therefore, the result of the morphological analysis of “very good reputation” in the input sentence = “very good reputation” is “very | reputation | is good”.

次に、入力文のうち、文脈独立文字列解析部３１１により解析されなかった残りの文字列の形態素解析を文脈依存文字列解析部３２１が行う。すなわち、文脈依存文字列解析部３２１は、入力文＝「非常に評判がいいわけだ」のうち、文脈独立文字列解析部３１１により解析されなかった残りの文字列＝「わけだ」の形態素解析を行う。 Next, the context-dependent character string analysis unit 321 performs morphological analysis of the remaining character strings that have not been analyzed by the context-independent character string analysis unit 311 in the input sentence. That is, the context-dependent character string analysis unit 321 performs a morphological analysis of the remaining character string = “translation” which is not analyzed by the context-independent character string analysis unit 311 among the input sentence = “very popular”. Do.

ラティス構築部３２２は、残り（未解析）の文字列＝「わけだ」とその前後の解析済みの文字列＝「非常に評判がいいわけだ」のラティスを構築する。 The lattice constructing unit 322 constructs a lattice of the remaining (unanalyzed) character string = “translation” and the analyzed character string before and after that = “very good reputation”.

形態素列選択部３２３は、構築されたラティスにおいて、文章として最も確からしいと思われる単語の並び（パス）を選択する。その結果、未解析の文字列＝「わけだ」の解析結果は、「わけだ」となる。 The morpheme string selection unit 323 selects a word sequence (path) that is most likely to be a sentence in the constructed lattice. As a result, the analysis result of the unparsed character string = “Daida” becomes “Daida”.

以上により、入力文＝「非常に評判がいいわけだ」の形態素解析の結果は、「非常に｜評判｜が｜いい｜わけだ」となる。 As described above, the result of the morphological analysis of the input sentence = “very good reputation” becomes “very | reputation | is |

次に、入力文に含まれる文字列が文脈独立辞書４２１に含まれていない場合について述べる。文脈独立辞書４２１に用いられるコーパスのサイズが小さく、文脈独立辞書４２１に文字列＝「非常に評判がいい」が含まれない場合を説明する。 Next, a case where the character string included in the input sentence is not included in the context independent dictionary 421 will be described. The case where the size of the corpus used for the context independent dictionary 421 is small and the character string = “very popular” is not included in the context independent dictionary 421 will be described.

図６は、実施の形態に係る形態素解析処理を示す図である。
図６では、図５と同様に入力文４３１として「非常に評判がいいわけだ」の形態素解析を行う場合について説明する。また、文脈独立辞書４２１は、文字列＝「非常に評判がいい」は含まれていないとする。 FIG. 6 is a diagram illustrating morphological analysis processing according to the embodiment.
FIG. 6 illustrates a case where the morphological analysis of “very good reputation” is performed as the input sentence 431 as in FIG. 5. In addition, the context independent dictionary 421 does not include the character string = “very popular”.

文脈独立文字列解析部３１１は、入力文＝「非常に評判がいいわけだ」に対して、文脈独立辞書４２１を用いた最長一致法による解析を行う。図６では、入力文＝「非常に評判がいいわけだ」のうち、一致する文字列が文脈独立辞書４２１に含まれていない。 The context-independent character string analysis unit 311 performs analysis by the longest match method using the context-independent dictionary 421 for the input sentence = “very good reputation”. In FIG. 6, in the input sentence = “very popular”, the matching character string is not included in the context independent dictionary 421.

よって、入力文＝「非常に評判がいいわけだ」のいずれの文字列も文脈独立文字列解析部３１１により解析されない。 Therefore, any character string of the input sentence = “it is very popular” is not analyzed by the context independent character string analysis unit 311.

次に、入力文のうち、文脈独立文字列解析部３１１により解析されなかった残りの文字列の形態素解析を文脈依存文字列解析部３２１が行う。すなわち、文脈依存文字列解析部３２１は、入力文＝「非常に評判がいいわけだ」の形態素解析を行う。 Next, the context-dependent character string analysis unit 321 performs morphological analysis of the remaining character strings that have not been analyzed by the context-independent character string analysis unit 311 in the input sentence. That is, the context-dependent character string analysis unit 321 performs a morphological analysis of input sentence = “very good reputation”.

ラティス構築部３２２は、残り（未解析）の文字列＝「非常に評判がいいわけだ」のラティスを構築する。 The lattice constructing unit 322 constructs a lattice of the remaining (unanalyzed) character string = “very popular”.

形態素列選択部３２３は、構築されたラティスにおいて、文章として最も確からしいと思われる単語の並び（パス）を選択する。その結果、未解析の文字列＝「非常に評判がいいわけだ」の解析結果は、「非常に｜評判｜が｜いい｜わけだ」となる。 The morpheme string selection unit 323 selects a word sequence (path) that is most likely to be a sentence in the constructed lattice. As a result, the analysis result of the unanalyzed character string = “very good reputation” becomes “very | reputation | is good”.

このように、入力文に含まれる文字列が文脈独立辞書４２１に含まれていない場合でも、正しく形態素解析を行うことができる。 Thus, even when the character string included in the input sentence is not included in the context independent dictionary 421, the morphological analysis can be performed correctly.

図７は、実施の形態に係る文脈独立辞書生成処理のフローチャートである。
ここで、コーパス４０１は、文ｓ_ｉ（ｉ＝０〜Ｎ）を含むとする。実施の形態において、文ｓ_１、ｓ_２、ｓ_１２、ｓ_１５、ｓ_２０、ｓ_３０、ｓ_３５を下記に示す。
文ｓ_１＝「朝日新聞東京本社が「宅配便で不審な段ボール箱が二箱送られてきた」と築地署に届け出た。」
文ｓ_２＝「そうする必要があるからだ。」
文ｓ_１２＝「担当者は朝日新聞の取材に回答した。」
文ｓ_１５＝「からだと健康に気を付けましょう。」
文ｓ_２０＝「朝日新聞東京本社は大江戸線築地市場駅の前にある。」
文ｓ_３０＝「本社が意思決定権を持つ。」
文ｓ_３５=「発行元の日本社が責任を負う。」
また、文ｓ_ｉのｉは、文ｓ_ｉの文ＩＤとする。 FIG. 7 is a flowchart of context-independent dictionary generation processing according to the embodiment.
Here, it is assumed that the corpus 401 includes sentences s _i (i = 0 to N). In the embodiment, sentences s ₁ , s ₂ , s ₁₂ , s ₁₅ , s ₂₀ , s ₃₀ , and s ₃₅ are shown below.
Sentence s ₁ = “The Asahi Shimbun Tokyo headquarters reported to the Tsukiji station that“ two suspicious cardboard boxes were sent by courier ”. "
Sentence s ₂ = “Because it is necessary to do so.”
Sentence s ₁₂ = “The person in charge answered the Asahi Shimbun.”
Sentence s ₁₅ = “Let's take care of our body.”
Sentence ₂₀ = “The Asahi Shimbun Tokyo headquarters is in front of Tsukiji market station on the Oedo Line.”
Sentence s ₃₀ = “Head office has decision-making power.”
Sentence ₃₅ = “The publisher, Japan, is responsible.”
In addition, i of the sentence _{s i} is the sentence ID of the sentence _{s i.}

ステップＳ５０１は、ステップＳ５０６の終端に対応するループの始端である。変数ｉの初期値は０であり、ループを実行する条件はｉがＮ以下であり、ループの終了毎にｉは１ずつインクリメントされる。 Step S501 is the beginning of a loop corresponding to the end of step S506. The initial value of the variable i is 0, the condition for executing the loop is that i is N or less, and i is incremented by 1 at the end of the loop.

ステップＳ５０２において、形態素解析部２２１は、コーパス４０１を読み出し、コーパス４０１に含まれる文ｓ_ｉの形態素解析を行う。例えば、形態素解析部２２１は、文ｓ_ｉに対するラティスを構築して、形態素解析を行う。文ｓ_ｉに対する形態素解析の結果である形態素列を形態素列ｓ’_ｉとする。文ｓ_１の形態素解析の結果ｓ’_１は、ｓ’_１＝「朝日|新聞|東京|本社|が|「|〜」となる。また、形態素列ｓ’_ｉのｉは、形態素列ｓ’_ｉの文ＩＤとする。 In step S502, the morphological analysis unit 221 reads the corpus 401, performs morphological analysis of a sentence _{s i} included in the corpus 401. For example, the morphological analysis unit 221 constructs a lattice for the sentence s _i and performs morphological analysis. A morpheme sequence s ′ _i is a morpheme sequence that is a result of the morpheme analysis on the sentence s _i . As a result of morphological analysis of sentence s ₁ , s ′ ₁ is s ′ ₁ = “Asahi | newspaper | Tokyo | head office ||||| In addition, morpheme string s 'i of _i is, morpheme string s' the statement ID of _i.

ステップＳ５０３において、ステップＳ５０５の終端に対応するループの始端である。依存性判定部２３１は、形態素列ｓ’_ｉに含まれる連続する部分形態素列のうち未選択の連続する部分形態素列を１つ選択する。選択された部分形態素列ｎは、ｎ＝（文字列ｐ、形態素列ｍ、文ＩＤ）と表記する。文字列ｐは形態素列ｍを繋げた文字列であり、形態素列ｍは選択された部分形態素列を構成する形態素列であり、文ＩＤは選択された部分形態素列ｎが含まれる形態素列ｓ’_ｉまたは文ｓ_ｉの文ＩＤである。例えば、ｎ＝（朝日新聞、朝日｜新聞、１）となる。また、ｎ＝（新聞東京本社、新聞｜東京｜本社、１）となる
ステップＳ５０４において、依存性判定部２３１は、文字列ｐごとに、形態素列ｍと文ＩＤの配列をＴ［ｐ］．Ｍ、Ｔ［ｐ］．Ｈにそれぞれ保存する。例えば、文字列ｐ＝「朝日新聞」の場合、Ｔ［朝日新聞］．Ｍ＝[朝日｜新聞]、Ｔ［朝日新聞］．Ｈ＝[１，１２、〜]となる。また、文字列ｐ＝「からだ」の場合、Ｔ［からだ］．Ｍ＝[から｜だ，からだ（体）]，Ｔ［からだ］．Ｈ＝[２，１５、〜]となる。すなわち、文字列＝「からだ」の形態素解析の結果は「から｜だ」または「からだ（体）」となることを示す。また、Ｔ［からだ］．Ｈ＝[２，１５、〜]は、文字列＝「からだ」が文ｓ_２、ｓ_１５に含まれていることを示す。 In step S503, it is the beginning of a loop corresponding to the end of step S505. The dependency determination unit 231 selects one unselected continuous partial morpheme sequence from among the continuous partial morpheme sequences included in the morpheme sequence s ′ _i . The selected partial morpheme string n is expressed as n = (character string p, morpheme string m, sentence ID). The character string p is a character string obtained by connecting the morpheme strings m, the morpheme string m is a morpheme string constituting the selected partial morpheme string, and the sentence ID is a morpheme string s ′ including the selected partial morpheme string n. _This is the sentence ID of _i or sentence s _i . For example, n = (Asahi Shimbun, Asahi | Newspaper, 1). Further, in step S504 where n = (newspaper Tokyo headquarters, newspaper | Tokyo | headquarters, 1), the dependency determining unit 231 sets the array of morpheme strings m and sentence IDs to T [p]. M, T [p]. Save each to H. For example, if the character string p = “Asahi Shimbun”, T [Asahi Shimbun]. M = [Asahi | newspaper], T [Asahi newspaper]. H = [1, 12, ...]. When the character string p = “body”, T [body]. M = [from |, body (body)], T [body]. H = [2,15,-]. That is, the result of the morphological analysis of the character string = “body” is “from |” or “body (body)”. Also, T [Body]. H = [2, 15, ...] indicates that the character string = “body” is included in the sentences s ₂ and s ₁₅ .

ステップＳ５０５において、ステップＳ５０３の始端に対応するループの終端である。形態素列ｓ’_ｉに含まれる連続する部分形態素列を全て選択済みの場合、制御はステップＳ５０６に進み、形態素列ｓ’_ｉにおいて未選択の連続する部分形態素列がある場合、制御はステップＳ５０３に戻る。 In step S505, it is the end of the loop corresponding to the start of step S503. If all the continuous partial morpheme sequences included in the morpheme sequence s ′ _i have been selected, the control proceeds to step S506. If there is an unselected continuous partial morpheme sequence in the morpheme sequence s ′ _i , the control proceeds to step S503. Return.

ステップＳ５０６は、ステップＳ５０１の始端に対応するループの終端である。iがＮより大きい場合、処理はステップＳ５０７に進み、iがＮ以下の場合、ｉは１インクリメントされ、制御はステップＳ５０１に戻る。 Step S506 is the end of the loop corresponding to the start of step S501. If i is larger than N, the process proceeds to step S507. If i is N or less, i is incremented by 1, and the control returns to step S501.

ステップＳ５０７において、ステップＳ５１６の終端に対応するループの始端である。依存性判定部２３１は、配列Ｔ［ｐ］の文字列ｐのうち未選択の文字列ｐを１つ選択する。以下、ステップＳ５０８〜Ｓ５１５における文字列ｐは、選択された文字列ｐであるとする。 In step S507, it is the beginning of the loop corresponding to the end of step S516. The dependency determination unit 231 selects one unselected character string p from the character strings p in the array T [p]. Hereinafter, it is assumed that the character string p in steps S508 to S515 is the selected character string p.

ステップＳ５０８において、依存性判定部２３１は、配列Ｔ［ｐ］．Ｍの要素の数｜Ｔ［ｐ］．Ｍ｜が１であるか判定する。配列Ｔ［ｐ］．Ｍの要素の数が１である場合、制御はステップＳ５１０に進み、Ｔ［ｐ］．Ｍの要素の数が１以外の場合、制御はステップＳ５０９に進む。例えば、文字列ｐ＝「からだ」である場合、Ｔ［からだ］．Ｍ＝[から｜だ，からだ（体）]であるので、｜Ｔ［ｐ］．Ｍ｜＝２となり、制御はステップＳ５０９に進む。例えば、文字列ｐ＝「東京本社が「宅配便」である場合、Ｔ［東京本社が「宅配便］．Ｍ＝[東京本社が「宅配便]であるので、｜Ｔ［ｐ］．Ｍ｜＝１となり、制御はステップＳ５１０に進む。ステップＳ５０８では、文字列ｐの形態素解析の結果が複数あるか、言い換えれば文字列ｐの形態素解析が常に同一であるかチェックしている。 In step S508, the dependency determining unit 231 determines that the array T [p]. Number of elements of M | T [p]. It is determined whether M | is 1. Array T [p]. If the number of elements of M is 1, control proceeds to step S510, where T [p]. If the number of elements of M is other than 1, control proceeds to step S509. For example, when the character string p = “body”, T [body]. Since M = [from |, body (body)], | T [p]. M | = 2, and control proceeds to step S509. For example, if the character string p = “Tokyo head office is“ courier ””, T [Tokyo head office is “courier”. M = [Tokyo head office is “courier”], so | T [p] .M | Then, control proceeds to step S510, where it is checked whether there are a plurality of morphological analysis results of the character string p, in other words, whether the morphological analysis of the character string p is always the same.

ステップＳ５０９において、文字列ｐを破棄する。
ステップＳ５１０において、依存性判定部２３１は、配列Ｔ［ｐ］．Ｈの要素の数｜Ｔ［ｐ］．Ｈ｜が１より大きいか判定する。配列Ｔ［ｐ］．Ｈの要素の数が１より大きい場合、制御はステップＳ５１２に進み、Ｔ［ｐ］．Ｈの要素の数が１以下の場合、制御はステップＳ５１１に進む。例えば、文字列ｐ＝「東京本社が「宅配便」である場合、Ｔ［東京本社が「宅配便］．Ｈ＝[１]であるので、｜Ｔ［ｐ］．Ｈ｜＝１となり、制御はステップＳ５１１に進む。例えば、文字列ｐ＝「朝日新聞東京本社」である場合、Ｔ［朝日新聞東京本社］．Ｈ＝[１，１２，３０]であるので、｜Ｔ［ｐ］．Ｈ｜＝３となり、制御はステップＳ５１２に進む。 In step S509, the character string p is discarded.
In step S510, the dependency determining unit 231 determines whether the array T [p]. Number of elements of H | T [p]. It is determined whether H | Array T [p]. If the number of elements of H is greater than 1, control proceeds to step S512 where T [p]. If the number of elements of H is 1 or less, control proceeds to step S511. For example, when the character string p = “Tokyo head office is“ courier service ”, T [Tokyo head office is“ courier service ”.H = [1], and therefore | T [p] .H | = 1. The process proceeds to step S511. For example, if the character string p = “Asahi Shimbun Tokyo head office”, T [Asahi Shimbun Tokyo head office]. Since H = [1, 12, 30], | T [p]. H | = 3, and control proceeds to step S512.

ステップＳ５１１において、文字列ｐを破棄する。
ステップＳ５１２において、依存性判定部２３１は、文字列ｐを含む文集合の文ＩＤであるＨ_ｐ’を得る。例えば、文字列ｐ＝「本社が」である場合、「本社が」を含む文は、文ｓ_１，ｓ_２０，ｓ_３０，ｓ_３５であるので、Ｈ_ｐ’＝１，２０，３０，３５となる。 In step S511, the character string p is discarded.
In step S512, the dependency determination unit 231 obtains H _p ′ that is the sentence ID of the sentence set including the character string p. For example, when the character string p = “head office is”, the sentences including “head office is” are the sentences s ₁ , s ₂₀ , s ₃₀ , s ₃₅ , and thus H _p ′ = ₁ , ₂₀ , ₃₀ , ₃₅ It becomes.

ステップＳ５１３において、依存性判定部２３１は、配列Ｔ［ｐ］．Ｈと文集合Ｈ_ｐ’が等しいか判定する。配列Ｔ［ｐ］．Ｈと文集合Ｈ_ｐ’が等しい場合、制御はステップＳ５１５に進み、配列Ｔ［ｐ］．Ｈと文集合Ｈ’が等しくない場合、制御はステップＳ５１４に進む。例えば、文字列ｐ＝「本社が」である場合、配列Ｔ［ｐ］．Ｈ＝[１，１２，３０]であり、Ｈ_ｐ’＝１，２０，３０，３５であり、配列Ｔ［ｐ］．ＨとＨ_ｐ’は等しくないため、制御はステップＳ５１４に進む。ステップＳ５１３では、形態素列の境界が異なる場合があるかを検出している。 In step S513, the dependency determining unit 231 determines that the array T [p]. It is determined whether H and the sentence set H _p ′ are equal. Array T [p]. If H and sentence set H _p ′ are equal, control proceeds to step S515 and array T [p]. If H and sentence set H ′ are not equal, control proceeds to step S514. For example, when the character string p = “head office is”, the array T [p]. H = [1, 12, 30], H _p ′ = 1, 20, 30, 35, and the array T [p]. Since H and H _p ′ are not equal, control proceeds to step S514. In step S513, it is detected whether the boundary of the morpheme string may be different.

ステップＳ５１４において、文字列ｐを破棄する。
ステップＳ５１５において、文字列ｐと当該文字列ｐの解析結果である形態素列を文脈独立辞書４２１に登録する。文字列ｐ＝「朝日新聞東京本社」である場合、Ｔ［朝日新聞東京本社］．Ｍ＝[朝日｜新聞｜東京｜本社]、Ｔ［朝日新聞東京本社］．Ｈ＝[１，１２，３０]となり、文字列ｐ＝「朝日新聞東京本社」と形態素列＝「朝日｜新聞｜東京｜本社」が文脈独立辞書４２１に登録される。 In step S514, the character string p is discarded.
In step S515, the character string p and the morpheme string that is the analysis result of the character string p are registered in the context independent dictionary 421. When the character string p = “Asahi Shimbun Tokyo head office”, T [Asahi Shimbun Tokyo head office]. M = [Asahi | Newspaper | Tokyo | Headquarters], T [Asahi Shimbun Tokyo Headquarters]. H = [1, 12, 30], and the character string p = “Asahi Shimbun Tokyo head office” and the morpheme string = “Asahi | newspaper | Tokyo | head office” are registered in the context independent dictionary 421.

ステップＳ５１６において、ステップＳ５０７の始端に対応するループの終端である。
図８は、実施の形態に係る形態素解析処理のフローチャートである。 In step S516, the end of the loop corresponding to the start of step S507.
FIG. 8 is a flowchart of morpheme analysis processing according to the embodiment.

ステップＳ６０１において、文脈独立文字列解析部３１１は、入力文４３１を読み出す。入力文４３１に含まれる文字を先頭から順にｃ０、ｃ１、〜、ｃＮと表記する。また、変数ｉ＝０とする。実施の形態において、入力文＝「朝日新聞東京本社が「宅配便で不審な段ボール箱が」とする。 In step S601, the context independent character string analysis unit 311 reads the input sentence 431. Characters included in the input sentence 431 are expressed as c0, c1,..., CN in order from the top. Further, it is assumed that the variable i = 0. In the embodiment, it is assumed that the input sentence = “Asahi Shimbun Tokyo headquarters is“ suspicious cardboard box by courier ”.

ステップＳ６０２において、文脈独立文字列解析部３１１は、入力文４３１と文脈独立辞書４２１とのパターンマッチングを行い、文脈独立辞書４２１に含まれる文字列と一致する文字列を入力文４３１から検出する。詳細には、文脈独立文字列解析部３１１は、文脈独立辞書４２１を検索し、文脈独立辞書４２１に含まれる文字列とマッチするｃｉを先頭とする最長の文字列ｃｉ〜ｃｊを探索する。 In step S602, the context-independent character string analysis unit 311 performs pattern matching between the input sentence 431 and the context-independent dictionary 421, and detects a character string that matches the character string included in the context-independent dictionary 421 from the input sentence 431. Specifically, the context-independent character string analysis unit 311 searches the context-independent dictionary 421 and searches for the longest character strings ci to cj that start with ci matching the character string included in the context-independent dictionary 421.

例えば、ｉ＝０の時、ｃ０〜ｃ７＝「朝日新聞東京本社」となる。ｉ＝８の時、マッチする文字列はない。ｉ＝９の時、ｃ９〜ｃ１１＝「「宅配」となる。ｉ＝１２の時、マッチする文字列はない。ｉ＝１３の時、ｃ１３〜ｃ２２＝「で不審な段ボール箱が」となる。 For example, when i = 0, c0 to c7 = “Asahi Shimbun Tokyo head office”. When i = 8, there is no matching character string. When i = 9, c9 to c11 = “home delivery”. When i = 12, there is no matching character string. When i = 13, c13 to c22 = “Suspicious cardboard box”.

ステップＳ６０３において、文脈独立文字列解析部３１１は、ｃｉを先頭とする文字列に一致する文字列が文脈独立辞書４２１にあるかチェックする。一致する文字列が文脈独立辞書４２１にある場合制御はステップＳ６０５に進み、一致する文字列が文脈独立辞書４２１に無い場合、制御はステップＳ６０４に進む。 In step S603, the context-independent character string analysis unit 311 checks whether there is a character string that matches the character string starting with ci in the context-independent dictionary 421. If there is a matching character string in the context independent dictionary 421, the control proceeds to step S605, and if there is no matching character string in the context independent dictionary 421, the control proceeds to step S604.

ステップＳ６０４において、文脈独立文字列解析部３１１は、変数ｉを１インクリメントする。 In step S604, the context-independent character string analysis unit 311 increments the variable i by 1.

ステップＳ６０５において、文脈独立文字列解析部３１１は、変数ｉをｊ＋１に設定する。例えば、ｉ＝０の時、ステップＳ６０２で述べたようにｃ０〜ｃ７＝「朝日新聞東京本社」となり、ｊ＝７なので、ｉは、８（＝７＋１）に設定される。 In step S605, the context-independent character string analysis unit 311 sets the variable i to j + 1. For example, when i = 0, as described in step S602, c0 to c7 = “Asahi Shimbun Tokyo head office” and j = 7, so i is set to 8 (= 7 + 1).

ステップＳ６０６において、文脈独立文字列解析部３１１は、文字列ｃｉ〜ｃｊに対する解析結果を解析結果４４１として記憶部４０１に保存する。例えば、ｉ＝０の時、ｃ０〜ｃ７＝「朝日新聞東京本社」に対する解析結果＝「朝日｜新聞｜東京｜本社」を解析結果４４１として記憶部４０１に保存する。ｉ＝９の時、ｃ９〜ｃ１１＝「「宅配」に対する解析結果＝「「｜宅配」を解析結果４４１として記憶部４０１に保存する。ｉ＝１３の時、ｃ１３〜ｃ２２＝「で不審な段ボール箱が」に対する解析結果＝「で｜不審な｜段ボール｜箱｜が」を解析結果４４１として記憶部４０１に保存する。 In step S606, the context-independent character string analysis unit 311 stores the analysis results for the character strings ci to cj in the storage unit 401 as the analysis results 441. For example, when i = 0, c0 to c7 = analysis result for “Asahi Shimbun Tokyo head office” = “Asahi | newspaper | Tokyo | head office” is stored in the storage unit 401 as an analysis result 441. When i = 9, c9 to c11 = “analysis result for“ home delivery ”=“ “| home delivery” is stored in the storage unit 401 as the analysis result 441. When i = 13, the analysis result for c13 to c22 = “and suspicious corrugated cardboard box” = “de | suspicious | corrugated cardboard | box |” is stored in the storage unit 401 as the analysis result 441.

ステップＳ６０７において、文脈独立文字列解析部３１１は、変数ｉがＮより大きいか判定する。変数ｉがＮより大きい場合、制御はステップＳ６０８に進み、変数ｉがＮ以下の場合、制御はステップＳ６０２に戻る。 In step S607, the context-independent character string analysis unit 311 determines whether the variable i is greater than N. If the variable i is greater than N, the control proceeds to step S608, and if the variable i is N or less, the control returns to step S602.

ステップＳ６０８において、入力文４３１のうち文脈独立文字列解析部３１１において未解析の文字列をｓ_０、ｓ_１、〜、ｓ_Ｍとする。また、変数ｋ＝０とする。実施の形態において、入力文＝「朝日新聞東京本社が「宅配便で不審な段ボール箱が」のうち、「朝日新聞東京本社」、「「宅配」、および「で不審な段ボール箱が」が解析済みのため、未解析の文字列は、ｓ_０＝「が」、ｓ_１＝「便」となる。 In step S608, _{it s} _0, s 1 string unparsed context independent string analysis unit 311 of the input sentence 431, ~, and _{s M.} Further, it is assumed that the variable k = 0. In the embodiment, the input sentence = “Asahi Shimbun Tokyo head office is“ suspicious cardboard box by courier ”,“ Asahi Shimbun Tokyo head office ”,“ Home delivery ”, and“ Suspicious cardboard box is analyzed ” Therefore, the unanalyzed character string is s ₀ = “ga” and s ₁ = “stool”.

ステップＳ６０９において、ラティス構築部３２２は、文字列ｓ_ｋと文字列ｓ_ｋの前後の解析済みの形態素について、複数の単語を含む辞書を用いてラティスを構築する。文字列ｓ_０＝「が」とその前後の解析済みの形態素のラティスを図９に示す。文字列ｓ_１＝「便」とその前後の解析済みの形態素のラティスを図１０に示す。 In step S609, the lattice constructing unit 322 constructs a lattice for the character string s _k and the analyzed morphemes before and after the character string s _k using a dictionary including a plurality of words. FIG. 9 shows a lattice of the character string s ₀ = “ga” and the analyzed morphemes before and after that. FIG. 10 shows the lattice of the character string s ₁ = “stool” and the analyzed morphemes before and after the character string s ₁ = “stool”.

ステップＳ６１０において、形態素列選択部３２３は、構築されたラティスにおいて、文章として最も確からしいと思われる単語の並び（パス）を選択する。形態素列選択部３２３は、例えば、Viterbiアルゴリズムを用いて、評価値を最小とするようなパスを選択する。例えば、文字列ｓ_０＝「が」に対して、解析結果として「が（助詞）」が選択される。文字列ｓ_０＝「が」の解析結果と文字列ｓ_０の前後の解析済みの形態素を含む形態素列を図１１に示す。例えば、文字列ｓ_１＝「便」に対して、解析結果として「便（びん）」が選択される。文字列ｓ_１＝「便」の解析結果と文字列ｓ_１の前後の解析済みの形態素を含む形態素列を図１２に示す。 In step S610, the morpheme sequence selection unit 323 selects a word sequence (path) that is most likely to be a sentence in the constructed lattice. The morpheme sequence selection unit 323 selects a path that minimizes the evaluation value using, for example, the Viterbi algorithm. For example, “ga (particle)” is selected as the analysis result for the character string s ₀ = “ga”. FIG. 11 shows a morpheme string including an analysis result of the character string s ₀ = “ga” and an analyzed morpheme before and after the character string s ₀ . For example, for the character string s ₁ = “stool”, “stool (bottle)” is selected as the analysis result. FIG. 12 shows an analysis result of the character string s ₁ = “feces” and a morpheme string including the analyzed morpheme before and after the character string s ₁ .

ステップＳ６１１において、形態素列選択部３２３は、変数ｋを１インクリメントする。 In step S611, the morpheme string selection unit 323 increments the variable k by 1.

ステップＳ６１２において、形態素列選択部３２３は、変数ｋがＭより大きいか判定する。変数ｋがＭより大きい場合、制御はステップＳ６１３に進み、変数ｋがＭ以下の場合、制御はステップＳ６０９に戻る。 In step S612, the morpheme string selection unit 323 determines whether the variable k is greater than M. If the variable k is larger than M, the control proceeds to step S613, and if the variable k is equal to or smaller than M, the control returns to step S609.

ステップＳ６１３において、文字列ｓ１〜ｓ_Ｍに対する形態素解析の結果を解析結果４４１として記憶部４０１に保存する。 In step S613, stored in the storage unit 401 the result of the morphological analysis for a character string Sl to S _M as an analysis result 441.

図８に示す形態素解析処理のように、未解析の文字列について、未解析の文字列ごとに形態素解析を行うのでなく、全ての未解析の文字列を含む入力文全体のラティスを用いて形態素解析を行ってもよい。 As in the morpheme analysis process shown in FIG. 8, morpheme analysis is not performed for each unanalyzed character string for an unanalyzed character string, but using a lattice of the entire input sentence including all unanalyzed character strings. Analysis may be performed.

図１３は、実施の形態に係る形態素解析処理の変形例のフローチャートである。
図８の形態素解析処理と同様に、入力文＝「朝日新聞東京本社が「宅配便で不審な段ボール箱が」とする。 FIG. 13 is a flowchart of a modification of the morphological analysis process according to the embodiment.
As in the morphological analysis process of FIG. 8, it is assumed that the input sentence = “Asahi Shimbun Tokyo head office is“ suspicious cardboard box by courier ”.

ステップＳ１６０１〜Ｓ１６０８の処理は、それぞれ図８のステップＳ６０１〜Ｓ６０７の処理と同様であるため、説明は省略する。 The processes in steps S1601 to S1608 are the same as the processes in steps S601 to S607 in FIG.

ステップＳ１６０９において、ラティス構築部３２２は、文字列ｓ_ｋと文字列ｓ_ｋの前後の解析済みの形態素について、複数の単語を含む辞書を用いてラティスを構築する。 In step S1609, the lattice constructing unit 322 constructs a lattice for the character string s _k and the analyzed morphemes before and after the character string s _k using a dictionary including a plurality of words.

ステップＳ１６１０において、形態素列選択部３２３は、変数ｋを１インクリメントする。 In step S1610, the morpheme string selection unit 323 increments the variable k by 1.

ステップＳ１６１１において、形態素列選択部３２３は、変数ｋがＭより大きいか判定する。変数ｋがＭより大きい場合、制御はステップＳ１６１２に進み、変数ｋがＭ以下の場合、制御はステップＳ１６０９に戻る。実施の形態において、変数ｋがＭより大きい場合、図１４に示すような未解析の文字列ｓ_０＝「が」、ｓ_１＝「便」を含む入力文全体のラティスが構築される。 In step S <b> 1611, the morpheme string selection unit 323 determines whether the variable k is larger than M. If the variable k is larger than M, the control proceeds to step S1612. If the variable k is equal to or smaller than M, the control returns to step S1609. In the embodiment, when the variable k is larger than M, a lattice of the entire input sentence including the unparsed character string s ₀ = “ga” and s ₁ = “feces” as shown in FIG. 14 is constructed.

ステップＳ１６１２において、形態素列選択部３２３は、構築されたラティスにおいて、文章として最も確からしいと思われる単語の並び（パス）を選択する。形態素列選択部３２３は、例えば、Viterbiアルゴリズムを用いて、評価値を最小とするようなパスを選択する。例えば、文字列ｓ_０＝「が」に対して、解析結果として「が（助詞）」が選択される。例えば、文字列ｓ_１＝「便」に対して、解析結果として「便（びん）」が選択される。文字列ｓ_０＝「が」、ｓ_１＝「便」の解析結果を含む入力文全体の形態素列を図１５に示す。入力文全体のラティスを構築して形態素解析を行うことで、図９，１０のように未解析の文字列とその前後の形態素列のラティスのみから形態素解析を行うより、精度を向上できる。 In step S <b> 1612, the morpheme string selection unit 323 selects a word sequence (path) that seems to be most likely as a sentence in the constructed lattice. The morpheme sequence selection unit 323 selects a path that minimizes the evaluation value using, for example, the Viterbi algorithm. For example, “ga (particle)” is selected as the analysis result for the character string s ₀ = “ga”. For example, for the character string s ₁ = “stool”, “stool (bottle)” is selected as the analysis result. FIG. 15 shows a morpheme string of the entire input sentence including the analysis result of the character string s ₀ = “ga” and s ₁ = “feces”. By constructing a lattice of the entire input sentence and performing morpheme analysis, the accuracy can be improved compared to performing morpheme analysis from only the unparsed character string and the lattice of the morpheme strings before and after that as shown in FIGS.

実施の形態の形態素解析装置によれば、パターンマッチングにより形態素解析を行い、パターンマッチングに合致しなかったテキストに対してラティスを構築して解析を行うことで、形態素解析の精度を保ちながら高速化できる。 According to the morphological analysis device of the embodiment, morphological analysis is performed by pattern matching, and a lattice is constructed and analyzed for text that does not match pattern matching, thereby speeding up while maintaining the accuracy of morphological analysis it can.

実施の形態の形態素解析装置によれば、パターンマッチングに合致しなかったテキストに対してのみラティスを構築して形態素解析を行うので、解析対象のテキスト全体のラティスを構築して形態素解析を行う場合に比べて、計算コストを低減できる。 According to the morphological analysis device of the embodiment, the lattice is constructed only for the text that does not match the pattern matching and the morphological analysis is performed. Therefore, the lattice of the entire text to be analyzed is constructed and the morphological analysis is performed. Compared with, the calculation cost can be reduced.

図１６は、情報処理装置の構成図である。
図２の形態素解析装置１０１は、例えば、図１６に示すような情報処理装置（コンピュータ）１０を用いて実現可能である。 FIG. 16 is a configuration diagram of the information processing apparatus.
The morphological analysis apparatus 101 in FIG. 2 can be realized using, for example, an information processing apparatus (computer) 10 as shown in FIG.

図１６の情報処理装置は、Central Processing Unit（ＣＰＵ）１、メモリ２、入力装置３、出力装置４、補助記憶装置５、媒体駆動装置６、及びネットワーク接続装置７を含む。これらの構成要素はバス８により互いに接続されている。 The information processing apparatus in FIG. 16 includes a central processing unit (CPU) 1, a memory 2, an input device 3, an output device 4, an auxiliary storage device 5, a medium driving device 6, and a network connection device 7. These components are connected to each other by a bus 8.

メモリ２は、例えば、Read Only Memory（ＲＯＭ）、Random Access Memory（ＲＡＭ）、フラッシュメモリ等の半導体メモリである。メモリ２は、形態素解析処理のためのプログラム及びデータを格納する。メモリ２は、記憶部４０１として用いることができる。 The memory 2 is a semiconductor memory such as a read only memory (ROM), a random access memory (RAM), or a flash memory, for example. The memory 2 stores a program and data for morphological analysis processing. The memory 2 can be used as the storage unit 401.

ＣＰＵ１（プロセッサ）は、例えば、メモリ２を利用してプログラムを実行することにより、文脈独立辞書構築部２１１、形態素解析部２２１、依存性判定部２３１、文脈独立文字列解析部３１１、ラティス構築部３２２、および形態素列選択部３２３として動作する。 The CPU 1 (processor) executes, for example, a program using the memory 2 to thereby execute a context independent dictionary construction unit 211, a morpheme analysis unit 221, a dependency determination unit 231, a context independent character string analysis unit 311, and a lattice construction unit. 322 and the morpheme string selection unit 323 operate.

入力装置３は、例えば、キーボード、ポインティングデバイス等であり、ユーザ又はオペレータからの指示や情報の入力に用いられる。出力装置４は、例えば、表示装置、プリンタ、スピーカ等であり、ユーザ又はオペレータへの問い合わせや処理結果の出力に用いられる。処理結果は、形態素解析の結果であってもよい。 The input device 3 is, for example, a keyboard, a pointing device, and the like, and is used for inputting instructions and information from a user or an operator. The output device 4 is, for example, a display device, a printer, a speaker, and the like, and is used for outputting an inquiry to a user or an operator and a processing result. The processing result may be a result of morphological analysis.

補助記憶装置５は、例えば、磁気ディスク装置、光ディスク装置、光磁気ディスク装置、テープ装置等である。補助記憶装置５は、ハードディスクドライブ又はフラッシュメモリであってもよい。情報処理装置は、補助記憶装置５にプログラム及びデータを格納しておき、それらをメモリ２にロードして使用することができる。補助記憶装置５は、記憶部４０１として用いることができる。 The auxiliary storage device 5 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. The auxiliary storage device 5 may be a hard disk drive or a flash memory. The information processing apparatus can store programs and data in the auxiliary storage device 5 and load them into the memory 2 for use. The auxiliary storage device 5 can be used as the storage unit 401.

媒体駆動装置６は、可搬型記録媒体９を駆動し、その記録内容にアクセスする。可搬型記録媒体９は、メモリデバイス、フレキシブルディスク、光ディスク、光磁気ディスク等である。可搬型記録媒体９は、Compact Disk Read Only Memory（ＣＤ−ＲＯＭ）、Digital Versatile Disk（ＤＶＤ）、Universal Serial Bus（ＵＳＢ）メモリ等であってもよい。ユーザ又はオペレータは、この可搬型記録媒体９にプログラム及びデータを格納しておき、それらをメモリ２にロードして使用することができる。 The medium driving device 6 drives a portable recording medium 9 and accesses the recorded contents. The portable recording medium 9 is a memory device, a flexible disk, an optical disk, a magneto-optical disk, or the like. The portable recording medium 9 may be a compact disk read only memory (CD-ROM), a digital versatile disk (DVD), a universal serial bus (USB) memory, or the like. A user or an operator can store programs and data in the portable recording medium 9 and load them into the memory 2 for use.

このように、プログラム及びデータを格納するコンピュータ読み取り可能な記録媒体は、メモリ２、補助記憶装置５、及び可搬型記録媒体９のような、物理的な（非一時的な）記録媒体である。 As described above, the computer-readable recording medium for storing the program and data is a physical (non-transitory) recording medium such as the memory 2, the auxiliary storage device 5, and the portable recording medium 9.

ネットワーク接続装置７は、Local Area Network（ＬＡＮ）、インターネット等の通信ネットワークに接続され、通信に伴うデータ変換を行う通信インタフェースである。情報処理装置は、ネットワーク接続装置７を介して外部の装置からプログラム及びデータを受信し、それらをメモリ２にロードして使用することができる。 The network connection device 7 is a communication interface that is connected to a communication network such as a local area network (LAN) or the Internet and performs data conversion accompanying communication. The information processing apparatus can receive a program and data from an external apparatus via the network connection apparatus 7 and load them into the memory 2 for use.

情報処理装置は、ネットワーク接続装置７を介して、ユーザ端末から指示や情報を受信し、形態素解析処理を行って、処理結果をユーザ端末へ送信することもできる。 The information processing apparatus can also receive instructions and information from the user terminal via the network connection apparatus 7, perform morphological analysis processing, and transmit the processing result to the user terminal.

なお、情報処理装置が図１６のすべての構成要素を含む必要はなく、用途や条件に応じて一部の構成要素を省略することも可能である。例えば、ユーザ又はオペレータからの指示や情報の入力を行わない場合は、入力装置３を省略してもよく、ユーザ又はオペレータへの問い合わせや処理結果の出力を行わない場合は、出力装置４を省略してもよい。情報処理装置が可搬型記録媒体９又は通信ネットワークにアクセスしない場合は、媒体駆動装置６又はネットワーク接続装置７を省略してもよい。 Note that the information processing apparatus does not have to include all the components illustrated in FIG. 16, and some of the components may be omitted depending on the application and conditions. For example, the input device 3 may be omitted when an instruction or information is not input from the user or operator, and the output device 4 is omitted when an inquiry or processing result is not output to the user or operator. May be. When the information processing apparatus does not access the portable recording medium 9 or the communication network, the medium driving device 6 or the network connection device 7 may be omitted.

以上の実施の形態に関し、さらに以下の付記を開示する。
（付記１）
形態素解析辞書と、複数の文それぞれに含まれる文字列と、前記複数の文それぞれに対して共通に得られた前記文字列の第１の形態素解析結果とを含むマッチング辞書を記憶する記憶部を備えるコンピュータに
解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致する文字列に対して、前記第１の形態素解析結果を出力し、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて、複数の形態素解析結果の候補を含むラティスを生成し、
前記ラティスを用いて前記残りの文字列に対する形態素解析を行い、前記残りの文字列に対する第２の形態素解析結果を出力する、
処理を実行させる形態素解析プログラム。
（付記２）
前記複数の文の形態素解析を行い、前記複数の文それぞれに含まれる文字列の形態素解析結果がすべて同じである場合に、前記文字列を前記マッチング辞書に登録する処理を前記コンピュータにさらに実行させる付記１記載の形態素解析プログラム。
（付記３）
前記マッチング辞書は、複数の文字列と前記複数の文字列の複数の形態素解析結果とを含み、前記複数の文字列は、前記複数の文それぞれに含まれる文字列と他の文字列とを含み、前記複数の形態素解析結果は、前記第１の形態素解析結果と前記他の文字列の形態素解析結果とを含み、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記複数の文字列それぞれと一致する複数の文字列に対して、前記複数の形態素解析結果を出力し、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記複数の文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて前記ラティスを生成し、前記ラティスを用いて前記複数の文字列と一致しない残りの文字列に対する形態素解析を行う
処理を前記コンピュータにさらに実行させる付記１記載の形態素解析プログラム。
（付記４）
形態素解析辞書と、複数の文それぞれに含まれる文字列と、前記複数の文それぞれに対して共通に得られた前記文字列の第１の形態素解析結果とを含むマッチング辞書を記憶する記憶部と、
解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致する文字列に対して、前記第１の形態素解析結果を出力する第１の解析部と、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて、複数の形態素解析結果の候補を含むラティスを生成し、前記ラティスを用いて前記残りの文字列に対する形態素解析を行い、前記残りの文字列に対する第２の形態素解析結果を出力する第２の解析部と、
を備える形態素解析装置。
（付記５）
前記複数の文の形態素解析を行い、前記複数の文それぞれに含まれる文字列の形態素解析結果がすべて同じである場合に、前記文字列を前記マッチング辞書に登録する辞書生成部と、
をさらに備えることを特徴とする付記４記載の形態素解析装置。
（付記６）
前記マッチング辞書は、複数の文字列と前記複数の文字列の複数の形態素解析結果とを含み、前記複数の文字列は、前記複数の文それぞれに含まれる文字列と他の文字列とを含み、前記複数の形態素解析結果は、前記第１の形態素解析結果と前記他の文字列の形態素解析結果とを含み、
前記第１の解析部は、前記解析対象テキストのうち、前記マッチング辞書に含まれる前記複数の文字列それぞれと一致する複数の文字列に対して、前記複数の形態素解析結果を出力し、
前記第２の解析部は、前記解析対象テキストのうち、前記マッチング辞書に含まれる前記複数の文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて前記ラティスを生成し、前記ラティスを用いて前記複数の文字列と一致しない残りの文字列に対する形態素解析を行うことを特徴とする付記４記載の形態素解析装置。
（付記７）
形態素解析辞書と、複数の文それぞれに含まれる文字列と、前記複数の文それぞれに対して共通に得られた前記文字列の第１の形態素解析結果とを含むマッチング辞書を記憶する記憶部を備える形態素解析装置が
解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致する文字列に対して、前記第１の形態素解析結果を出力し、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて、複数の形態素解析結果の候補を含むラティスを生成し、
前記ラティスを用いて前記残りの文字列に対する形態素解析を行い、前記残りの文字列に対する第２の形態素解析結果を出力する、
処理を有する形態素解析方法。
（付記８）
前記複数の文の形態素解析を行い、前記複数の文それぞれに含まれる文字列の形態素解析結果がすべて同じである場合に、前記文字列を前記マッチング辞書に登録する処理をさらに有する付記７記載の形態素解析方法。
（付記９）
前記マッチング辞書は、複数の文字列と前記複数の文字列の複数の形態素解析結果とを含み、前記複数の文字列は、前記複数の文それぞれに含まれる文字列と他の文字列とを含み、前記複数の形態素解析結果は、前記第１の形態素解析結果と前記他の文字列の形態素解析結果とを含み、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記複数の文字列それぞれと一致する複数の文字列に対して、前記複数の形態素解析結果を出力し、
前記解析対象テキストのうち、前記マッチング辞書に含まれる前記複数の文字列と一致しない残りの文字列に対し、前記形態素解析辞書を用いて前記ラティスを生成し、前記ラティスを用いて前記複数の文字列と一致しない残りの文字列に対する形態素解析を行う
処理をさらに有する付記７記載の形態素解析方法。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
A storage unit for storing a matching dictionary including a morpheme analysis dictionary, a character string included in each of a plurality of sentences, and a first morpheme analysis result of the character string obtained in common for each of the plurality of sentences; The computer includes the output of the first morpheme analysis result for the character string that matches the character string included in the matching dictionary in the analysis target text,
Among the analysis target text, for the remaining character strings that do not match the character strings included in the matching dictionary, using the morphological analysis dictionary, generate a lattice including a plurality of morphological analysis result candidates,
Performing morphological analysis on the remaining character string using the lattice, and outputting a second morphological analysis result on the remaining character string;
A morphological analysis program that executes processing.
(Appendix 2)
Performing a morphological analysis of the plurality of sentences, and causing the computer to further execute a process of registering the character strings in the matching dictionary when all the morphological analysis results of the character strings included in the plurality of sentences are the same. The morphological analysis program according to attachment 1.
(Appendix 3)
The matching dictionary includes a plurality of character strings and a plurality of morphological analysis results of the plurality of character strings, and the plurality of character strings include a character string included in each of the plurality of sentences and another character string. The plurality of morpheme analysis results include the first morpheme analysis result and the morpheme analysis result of the other character string,
Among the analysis target text, for the plurality of character strings that match each of the plurality of character strings included in the matching dictionary, the plurality of morpheme analysis results are output,
Of the analysis target text, for the remaining character strings that do not match the plurality of character strings included in the matching dictionary, the lattice is generated using the morphological analysis dictionary, and the plurality of characters using the lattice The morpheme analysis program according to appendix 1, further causing the computer to execute a process for performing a morphological analysis on the remaining character string that does not match the string.
(Appendix 4)
A storage unit for storing a morphological analysis dictionary, a matching dictionary including a character string included in each of a plurality of sentences, and a first morphological analysis result of the character string obtained in common for each of the plurality of sentences; ,
A first analysis unit that outputs the first morpheme analysis result for a character string that matches the character string included in the matching dictionary in the analysis target text;
Using the morpheme analysis dictionary, a lattice including a plurality of morpheme analysis result candidates is generated for the remaining character strings that do not match the character strings included in the matching dictionary in the analysis target text, and the lattice A second analysis unit that performs morpheme analysis on the remaining character string using and outputs a second morpheme analysis result on the remaining character string;
A morphological analyzer comprising:
(Appendix 5)
Performing morphological analysis of the plurality of sentences, and when the morphological analysis results of the character strings included in each of the plurality of sentences are all the same, a dictionary generation unit that registers the character strings in the matching dictionary;
The morphological analysis device according to appendix 4, further comprising:
(Appendix 6)
The matching dictionary includes a plurality of character strings and a plurality of morphological analysis results of the plurality of character strings, and the plurality of character strings include a character string included in each of the plurality of sentences and another character string. The plurality of morpheme analysis results include the first morpheme analysis result and the morpheme analysis result of the other character string,
The first analysis unit outputs the plurality of morpheme analysis results for a plurality of character strings that match each of the plurality of character strings included in the matching dictionary in the analysis target text,
The second analysis unit generates the lattice using the morphological analysis dictionary for the remaining character strings that do not match the plurality of character strings included in the matching dictionary in the analysis target text, and The morpheme analyzer according to appendix 4, wherein morpheme analysis is performed on the remaining character strings that do not match the plurality of character strings using a lattice.
(Appendix 7)
A storage unit for storing a matching dictionary including a morpheme analysis dictionary, a character string included in each of a plurality of sentences, and a first morpheme analysis result of the character string obtained in common for each of the plurality of sentences; The morpheme analyzer comprising the output of the first morpheme analysis result for the character string that matches the character string included in the matching dictionary in the analysis target text,
Among the analysis target text, for the remaining character strings that do not match the character strings included in the matching dictionary, using the morphological analysis dictionary, generate a lattice including a plurality of morphological analysis result candidates,
Performing morphological analysis on the remaining character string using the lattice, and outputting a second morphological analysis result on the remaining character string;
A morphological analysis method having processing.
(Appendix 8)
The appendix 7 further includes a process of performing morphological analysis of the plurality of sentences and registering the character strings in the matching dictionary when the morphological analysis results of the character strings included in the plurality of sentences are all the same. Morphological analysis method.
(Appendix 9)
The matching dictionary includes a plurality of character strings and a plurality of morphological analysis results of the plurality of character strings, and the plurality of character strings include a character string included in each of the plurality of sentences and another character string. The plurality of morpheme analysis results include the first morpheme analysis result and the morpheme analysis result of the other character string,
Among the analysis target text, for the plurality of character strings that match each of the plurality of character strings included in the matching dictionary, the plurality of morpheme analysis results are output,
Of the analysis target text, for the remaining character strings that do not match the plurality of character strings included in the matching dictionary, the lattice is generated using the morphological analysis dictionary, and the plurality of characters using the lattice The morpheme analysis method according to appendix 7, further comprising a process of performing a morpheme analysis on the remaining character string that does not match the string.

１０１形態素解析装置
２０１辞書生成部
２１１文脈独立辞書構築部
２２１形態素解析部
２３１依存性判定部
３０１形態素解析部
３１１文脈独立文字列解析部
３２１文脈依存文字列解析部
３２２ラティス構築部
３２３形態素列選択部
４０１記憶部
４１１コーパス
４２１文脈独立辞書
４３１入力文
４４１解析結果 DESCRIPTION OF SYMBOLS 101 Morphological analyzer 201 Dictionary generation part 211 Context independent dictionary construction part 221 Morphological analysis part 231 Dependency determination part 301 Morphological analysis part 311 Context independent character string analysis part 321 Context dependent character string analysis part 322 Lattice construction part 323 Morphological string selection part 401 storage unit 411 corpus 421 context independent dictionary 431 input sentence 441 analysis result

Claims

A storage unit for storing a matching dictionary including a morpheme analysis dictionary, a character string included in each of a plurality of sentences, and a first morpheme analysis result of the character string obtained in common for each of the plurality of sentences; The computer includes the output of the first morpheme analysis result for the character string that matches the character string included in the matching dictionary in the analysis target text,
Among the analysis target text, for the remaining character strings that do not match the character strings included in the matching dictionary, using the morphological analysis dictionary, generate a lattice including a plurality of morphological analysis result candidates,
Performing morphological analysis on the remaining character string using the lattice, and outputting a second morphological analysis result on the remaining character string;
A morphological analysis program that executes processing.

Performing a morphological analysis of the plurality of sentences, and causing the computer to further execute a process of registering the character strings in the matching dictionary when all the morphological analysis results of the character strings included in the plurality of sentences are the same. The morphological analysis program according to claim 1.

The matching dictionary includes a plurality of character strings and a plurality of morphological analysis results of the plurality of character strings, and the plurality of character strings include a character string included in each of the plurality of sentences and another character string. The plurality of morpheme analysis results include the first morpheme analysis result and the morpheme analysis result of the other character string,
Among the analysis target text, for the plurality of character strings that match each of the plurality of character strings included in the matching dictionary, the plurality of morpheme analysis results are output,
Of the analysis target text, for the remaining character strings that do not match the plurality of character strings included in the matching dictionary, the lattice is generated using the morphological analysis dictionary, and the plurality of characters using the lattice The morpheme analysis program according to claim 1, further causing the computer to execute a process of performing a morphological analysis on a remaining character string that does not match a string.

A storage unit for storing a morphological analysis dictionary, a matching dictionary including a character string included in each of a plurality of sentences, and a first morphological analysis result of the character string obtained in common for each of the plurality of sentences; ,
A first analysis unit that outputs the first morpheme analysis result for a character string that matches the character string included in the matching dictionary in the analysis target text;
Using the morpheme analysis dictionary, a lattice including a plurality of morpheme analysis result candidates is generated for the remaining character strings that do not match the character strings included in the matching dictionary in the analysis target text, and the lattice A second analysis unit that performs morpheme analysis on the remaining character string using and outputs a second morpheme analysis result on the remaining character string;
A morphological analyzer comprising:

A storage unit for storing a matching dictionary including a morpheme analysis dictionary, a character string included in each of a plurality of sentences, and a first morpheme analysis result of the character string obtained in common for each of the plurality of sentences; The morpheme analyzer comprising the output of the first morpheme analysis result for the character string that matches the character string included in the matching dictionary in the analysis target text,
Among the analysis target text, for the remaining character strings that do not match the character strings included in the matching dictionary, using the morphological analysis dictionary, generate a lattice including a plurality of morphological analysis result candidates,
Performing morphological analysis on the remaining character string using the lattice, and outputting a second morphological analysis result on the remaining character string;
A morphological analysis method having processing.