JP6619932B2

JP6619932B2 - Morphological analyzer and program

Info

Publication number: JP6619932B2
Application number: JP2014266384A
Authority: JP
Inventors: 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-12-26
Filing date: 2014-12-26
Publication date: 2019-12-11
Anticipated expiration: 2034-12-26
Also published as: JP2016126498A

Description

本発明は、入力された文に対応する形態素列を出力する形態素解析装置およびプログラムに関する。 The present invention relates to a morpheme analyzer and a program for outputting a morpheme string corresponding to an inputted sentence.

音声合成技術の代表的な利用形態にテキスト音声変換（Ｔｅｘｔ−Ｔｏ−Ｓｐｅｅｃｈ、ＴＴＳ）がある。テキスト音声変換は、入力されたテキストに対応する音声波形を合成する処理である。以下では、この一連の処理は、入力されたテキストを解析してテキストの読み方に関する情報を生成する処理と、読み方に関する情報から音声波形を合成する処理の、大きく２つに分ける。また、入力は日本語の漢字仮名交じり文であるとする。 Text-to-speech conversion (Text-To-Speech, TTS) is a typical usage form of speech synthesis technology. Text-to-speech conversion is a process of synthesizing a speech waveform corresponding to input text. In the following, this series of processes is roughly divided into two processes: a process of analyzing input text to generate information on how to read the text, and a process of synthesizing a speech waveform from the information on how to read. Also, it is assumed that the input is a Japanese kanji kana mixed sentence.

以下では、読み方に関する情報を表現するために用いる記号を、音声合成用記号と呼ぶ。音声合成用記号には様々な形式があり得るが、ここでは、一連の音声を構成する音韻的情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものを想定する。そのような音声合成用記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００６「日本語テキスト音声合成用記号」がある（非特許文献１参照）。この記号だけで音声の感情表現等まで表現することは困難だが、少なくとも通常の読み上げ音声の言語的特徴を記述するために必要な情報は含まれている。 Hereinafter, a symbol used for expressing information on how to read is referred to as a speech synthesis symbol. There are various forms of symbols for speech synthesis. Here, we assume that the phonetic information that composes a series of speech and the prosodic information that is mainly expressed as a pose or voice pitch. . An example of such a symbol for speech synthesis is JEITA (Electronic Information Technology Industry Association) standard IT-4006 “Symbol for Japanese Text Speech Synthesis” (see Non-Patent Document 1). Although it is difficult to express even emotional expressions of speech with these symbols alone, it contains at least the information necessary to describe the linguistic features of normal speech.

一方、音声波形を合成する処理は、音声合成用記号の通りの波形が合成されるように行われる。したがって、日本語テキストの正確な読み上げを実現するためには、日本語の漢字仮名交じり文に対応した、正確な音声合成用記号を作成すればよい。 On the other hand, the process of synthesizing the speech waveform is performed so that the waveform according to the speech synthesis symbol is synthesized. Therefore, in order to realize accurate reading of the Japanese text, it is only necessary to create an accurate speech synthesis symbol corresponding to the Japanese kanji kana mixed text.

任意の日本語テキストから音声合成用記号を生成する処理は、日本語漢字仮名交じり文を形態素と呼ばれる言語表現上意味を持つ最小の単位に区切り、形態素ごとの読みを付与し、形態素列等を参照し形態素の情報を適切に変形し、必要に応じ、ポーズ等の韻律的境界を挿入し、それらを接続することで実現できる。この際、各形態素の読みは、形態素辞書情報として予め作成し格納しておく（特許文献１参照）。 The process of generating symbols for speech synthesis from arbitrary Japanese text is to divide Japanese Kanji kana mixed sentences into the smallest units that have meaning in linguistic expressions called morphemes, add readings for each morpheme, It can be realized by referring to the morpheme information appropriately, inserting prosodic boundaries such as poses as necessary, and connecting them. At this time, the reading of each morpheme is created and stored in advance as morpheme dictionary information (see Patent Document 1).

ただし、形態素は、言語学的な定義の通りである必要はなく、一連の処理を行う上で適当に区切られた単位でもよい。例えば、形態素の並びをより適切に処理するために、複数の形態素で構成される句（複合名詞句等）を便宜的に１つの形態素と見なして処理することがある。よって以下においては、形態素とは、その用途の観点から処理上の最小単位となるべく適当に設定された文字の並び（文字列）をいい、また、全ての文は、この文字列を連結することで構成できるものとする。 However, the morpheme does not have to be as defined in linguistic terms, and may be a unit appropriately divided for performing a series of processing. For example, in order to more appropriately process the arrangement of morphemes, a phrase composed of a plurality of morphemes (such as a compound noun phrase) may be regarded as one morpheme for convenience. Therefore, in the following, a morpheme refers to a sequence of characters (character string) that is appropriately set as a minimum unit for processing from the viewpoint of its use, and all sentences are concatenated with this character string. It can be configured with.

このように、ある文に対して、それを形態素に分割する処理は、一般に形態素解析処理と呼ばれ、音声合成処理に限らず、文の構成要素の抽出等で用いられている。ＴＴＳシステムの読みの正しさは形態素解析の精度に強く依存することになるので、ＴＴＳシステムでは高精度な形態素解析が求められる。一方で、ＷｏｒｌｄＷｉｄｅＷｅｂ（ＷＷＷ）の大量のテキストデータからデータ抽出を行うような場合とは異なり、ＴＴＳシステムでは、通常、短時間に大量のテキストデータ処理する必要はない。 As described above, a process for dividing a sentence into morphemes is generally called a morpheme analysis process, and is used not only for speech synthesis processing but also for extraction of sentence components. Since the correctness of reading of the TTS system strongly depends on the accuracy of the morphological analysis, the TTS system requires highly accurate morphological analysis. On the other hand, unlike the case of extracting data from a large amount of text data on the World Wide Web (WWW), the TTS system normally does not need to process a large amount of text data in a short time.

例えば、１文の処理時間が０．１秒程度を要し、それがＴＴＳシステムにおける処理遅延時間を０．１秒遅らせることになっても、その処理時間は多くの場合において問題にはならない。つまり、大量のテキストを処理するための形態素解析装置との比較において、高速な処理は不要である。一方で、スマートフォンのような携帯端末上でのＴＴＳシステムのニーズがあることを考えると、システムの速度を上げることよりも、システムのサイズをより小さくすることの方が重要である。つまり、この要求から、ＴＴＳシステムを対象とする形態素解析装置は、大量のテキストを処理するための形態素解析装置とは異なる設計となり得る。 For example, even if the processing time of one sentence requires about 0.1 seconds, which delays the processing delay time in the TTS system by 0.1 seconds, the processing time is not a problem in many cases. That is, high-speed processing is unnecessary in comparison with a morphological analyzer for processing a large amount of text. On the other hand, considering the need for a TTS system on a mobile terminal such as a smartphone, it is more important to reduce the size of the system than to increase the speed of the system. That is, from this requirement, the morphological analyzer for the TTS system can be designed differently from the morphological analyzer for processing a large amount of text.

まず、形態素解析の方法として、以下では、最小コスト法に基づく方法を説明する。最小コスト法による形態素解析では、まず、各形態素の出現頻度を反映させた生起コスト関数と、連続する形態素の繋がりやすさを表す連接コスト関数を予め定義しておく。そして、形態素辞書に登録された形態素から、入力テキストに一致し、かつ文全体のコストが最小となるような形態素列を探すことで、適切な形態素列を得る。通常、生起コスト関数は出現頻度が高い形態素ほど、連接コスト関数は繋がりやすい形態素列ほど、その値が小さくなるよう定義される。 First, as a morphological analysis method, a method based on the minimum cost method will be described below. In the morpheme analysis by the minimum cost method, first, an occurrence cost function reflecting the appearance frequency of each morpheme and a concatenated cost function representing ease of connection of consecutive morpheme are defined in advance. Then, an appropriate morpheme sequence is obtained by searching the morpheme registered in the morpheme dictionary for a morpheme sequence that matches the input text and minimizes the cost of the entire sentence. Usually, the occurrence cost function is defined so that the morpheme having the higher appearance frequency and the concatenated cost function have the smaller value as the morpheme sequence that is more easily connected.

すなわち形態素列をＭ＝（ｍ１，…，ｍｎ）、生起コスト関数をＣｔ（ｍ）、連接コスト関数をＣｃ（ｍ（ｉ−ｋ＋１），…，ｍｉ）とするとき、コストの総和ΣＣｔ＋ΣＣｃが最小となる形態素列Ｍ、すなわちａｒｇｍｉｎ（ΣＣｔ＋ΣＣｃ）を求めることで形態素解析処理が行われる。ただし、ここで連接コスト関数はｋ個の形態素の並びで決定されるものとする。 That is, when the morpheme sequence is M = (m1,..., Mn), the occurrence cost function is Ct (m), and the concatenation cost function is Cc (m (i−k + 1),..., Mi), the total cost ΣCt + ΣCc is minimum. The morpheme analysis process is performed by obtaining the morpheme string M, that is, argmin (ΣCt + ΣCc). Here, it is assumed that the concatenated cost function is determined by an array of k morphemes.

このようにコスト関数を定義すると、コスト的な最適な全体系列を構成する部分系列は、その部分系列だけを見てもコスト的には最適となる。したがって、コスト的に最適でない部分系列は、最適な全体系列の構成要素にはならないので、探索において考慮する必要がなくなる。このように、最適系列を構成する可能性がない部分系列を考慮しないように進めて行く最適系列の探索法は一般に動的計画法と呼ばれ、効率よく最適系列の探索を行うことができる。 When the cost function is defined in this way, the partial series constituting the optimal whole series in terms of cost are optimal in terms of cost even when only the partial series is viewed. Therefore, the subsequence that is not optimal in terms of cost does not need to be considered in the search because it is not a component of the optimal overall sequence. As described above, the optimum sequence search method that proceeds without considering a partial sequence that is not likely to form an optimum sequence is generally called dynamic programming, and can search for the optimum sequence efficiently.

コスト関数の構成要素のうち、生起コストに関する情報は形態素辞書の内容として保持することができる。一方、連接コストについては連接表と呼ばれるテーブルを予め作成しておき、その表の値を使うことで求めることができる。ただし、全ての形態素の組み合わせの表を作成し用いることは、形態素の種類が多いために一般には難しい。そこで、例えば形態素の品詞型だけに注目する等、形態素のクラスに注目し、クラス間の連接表を用いることも行われる。なお、これらの関数は値が大きいほど好ましいものとして定義される場合もある。その場合は、文全体の値が最も大きくなる形態素系列を探すことになる。 Among the components of the cost function, information on the occurrence cost can be held as the contents of the morpheme dictionary. On the other hand, the connection cost can be obtained by creating a table called a connection table in advance and using the values in the table. However, it is generally difficult to create and use a table of all morpheme combinations because there are many types of morphemes. Therefore, for example, attention is paid to only the morpheme part-of-speech type, and attention is paid to the morpheme class, and the connection table between classes is also used. In some cases, these functions are defined as having larger values. In that case, a morpheme sequence having the largest value of the entire sentence is searched.

形態素解析における形態素列の探索処理では可能な形態素の並び全てを調べることが好ましい。そのため、通常の形態素解析では、形態素の候補を得るために、文中の任意の位置から始まる部分文字列を検索のキーとして、形態素辞書に登録された形態素のうち、キーの先頭部分文字列に等しい形態素全てを取り出す、という処理が繰り返し行われる。このような探索は、一般にＣｏｍｍｏｎｐｒｅｆｉｘｓｅａｒｃｈ（共通接頭辞探索）と呼ばれる。これを比較的効率よく表現するデータ構造として、ｔｒｉｅ（トライ）が知られている。 In the morpheme string search process in morpheme analysis, it is preferable to examine all possible morpheme sequences. Therefore, in normal morpheme analysis, in order to obtain a morpheme candidate, a partial character string starting from an arbitrary position in a sentence is used as a search key, and the morpheme registered in the morpheme dictionary is equal to the first partial character string of the key. The process of extracting all morphemes is repeated. Such a search is generally referred to as a common prefix search. As a data structure that expresses this relatively efficiently, trie is known.

ここで、トライは複数の文字列を格納するための多分木構造で、ここでは、各文字列の先頭文字から順に各文字を木の枝として格納していくことで構築されものとする。トライでは、文字列間の共通な接頭辞が木構造上で共有されるので、検索対処の文字列の接頭辞となる全ての登録語は、木構造の１つのパス上に配置される。つまり、トライのルート（根）から検索キーに沿って木をリーフ（葉）方向にたどって行くことで、ｃｏｍｍｏｎｐｒｅｆｉｘｓｅａｒｃｈを実現できる。 Here, a trie is a multi-tree structure for storing a plurality of character strings, and here, it is assumed that each character is stored as a tree branch in order from the first character of each character string. In the trie, a common prefix between character strings is shared on the tree structure, and therefore, all registered words that are prefixes of the character strings to be searched are arranged on one path of the tree structure. That is, the common prefix search can be realized by tracing the tree in the leaf direction along the search key from the root of the trie.

特開２００９−１２９２５８号公報JP 2009-129258 A

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００６、２０１０年３月"Symbol for Japanese text-to-speech synthesis" JEITA standard IT-4006, March 2010

一般に、日本語の文章では、ある程度の表記の揺らぎが避けられない。例えば、漢字と仮名、平仮名と片仮名、漢字における新字体と旧字体、また漢字の代替や誤用等が挙げられる。 In general, Japanese text cannot be avoided to some degree. For example, kanji and kana, hiragana and katakana, new and old kanji in kanji, and substitution and misuse of kanji.

仮に、形態素解析装置で行われる入力文字列と形態素辞書に登録された（形態素を表す）表記文字列との比較が完全一致に基づくものだとすると、この揺らぎを考慮した形態素辞書を構築する必要がある。そのような方法として、特許文献１には交ぜ書き（ここでは、例えば一形態素内で、常用漢字外といった難解な漢字のみ仮名書きする方法のことを言う。「挽回」にたいする「ばん回」など。）を含む文書の形態素解析を実現する方法として、形態素解析装置で用いる形態素辞書の作成時に、形態素の表記に含まれる難解な文字を仮名書きに変換し、それも形態素として登録する方法が開示されている。この方法によれば、交ぜ書きのような表記の揺れに対応することはできるが、形態素辞書に登録される語の数が増え、形態素辞書のサイズは大きくなる。 If the comparison between the input character string performed by the morpheme analyzer and the notation character string registered in the morpheme dictionary (representing the morpheme) is based on perfect match, it is necessary to construct a morpheme dictionary that takes this fluctuation into account. . As such a method, Patent Document 1 refers to a method of writing mixed characters (in this case, for example, a method of writing only a difficult kanji such as outside a common kanji within one morpheme. ) Is disclosed as a method for realizing a morphological analysis of a document including) when a morpheme dictionary used in the morpheme analyzer is created, converting a difficult character included in the morpheme notation into kana and writing it as a morpheme. ing. According to this method, it is possible to deal with fluctuations in notation such as cross writing, but the number of words registered in the morpheme dictionary increases and the size of the morpheme dictionary increases.

その他に、形態素辞書に登録された表記文字列と入力文字列の比較の際に、文字間の完全一致ではなく、より複雑な比較規則を適用する方法が考えられる。しかし、この方法は、平仮名と片仮名、新字体と旧字体、単純な代用や誤用については対応できるものの、交ぜ書きのようなケースに対応したルール記述を行うことは困難である。例えば漢字と仮名文字列の一致規則を定義すると、実用上は多くの漢字に対する規則が必要になり、その規則の総数が数千から数万といった規模ともなる。その結果、文字比較における適用規則の探索そのものにトライのような構造が必要になってしまうため、装置の構成がより複雑になる。 In addition, when comparing a notation character string registered in a morpheme dictionary and an input character string, a method of applying a more complicated comparison rule rather than a perfect match between characters is conceivable. However, this method can handle hiragana and katakana, new and old fonts, simple substitution and misuse, but it is difficult to write a rule description corresponding to a mixed writing case. For example, if rules for matching kanji and kana character strings are defined, rules for many kanji characters are required in practice, and the total number of rules is on the order of thousands to tens of thousands. As a result, the search for application rules in character comparison itself requires a structure like a trie, which makes the configuration of the apparatus more complicated.

また、同様の方法としては、中間表現を用いて探索を行う方法が考えられる。この方法では、あらかじめ、例えば異なる文字であっても平仮名と片仮名、旧字体と新字体の関係にある等、関連性が強い文字だったり、形状が似ていて誤用されたり代用されたりする文字間については、同じ符号を割り当てた中間表現を作成し、入力と形態素辞書の形態素の双方を中間表現に変換して、中間表現上で形態素解析を行えばよい。 As a similar method, a search method using an intermediate representation can be considered. In this method, even if different characters are used, for example, hiragana and katakana, the relationship between old and new fonts, etc., it is a highly related character, or between characters that are similar in shape and misused or substituted For the above, an intermediate representation to which the same code is assigned is created, both the input and the morpheme in the morpheme dictionary are converted into the intermediate representation, and the morpheme analysis is performed on the intermediate representation.

しかし、この方法も、交ぜ書きのようなケースに対応させることは容易ではない。中間表現の一意性がないと、形態素解析の解析対象に複数通りの可能性が生じてしまい、形態素解析処理が困難になる。一方で、中間表現として、例えば仮名による読み表記（例えば「ＡＢＣ」に対する「エービーシー」、等）を用いれば中間表現の一意性は実現できるが、同音異義語等の区別ができなくなり形態素解析結果の可能性が増え、結果として形態素解析の精度が低下する。 However, this method is also not easy to deal with cases such as mixed writing. Without the uniqueness of the intermediate representation, there are a plurality of possibilities for the morphological analysis target, making morphological analysis processing difficult. On the other hand, the uniqueness of the intermediate expression can be realized by using, for example, a kana reading (for example, “ABC” for “ABC”, etc.) as the intermediate expression, but the homonym and the like cannot be distinguished, and the morphological analysis result The possibility increases, and as a result, the accuracy of morphological analysis decreases.

本発明は、このような事情に鑑みてなされたものであり、形態素辞書のサイズの増加を抑えつつ、形態素表記仮説を中間表現を含めて考慮することができ、交ぜ書き等の表記の揺らぎに対応して効率よく精度の高い形態素解析を行うことができる形態素解析装置およびプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and while suppressing an increase in the size of the morpheme dictionary, it is possible to consider the morpheme notation hypothesis including intermediate expressions, and to avoid fluctuations in notation such as cross writing. Accordingly, an object of the present invention is to provide a morpheme analyzer and a program capable of efficiently and accurately performing morpheme analysis.

（１）上記の目的を達成するため、本発明の形態素解析装置は、入力された文に対応する形態素解析結果を出力する形態素解析装置であって、入力された文から部分文字列を切り出して生成する部分文字列生成部と、中間表現辞書を用いて、前記生成された部分文字列から中間段階として変換されうる１以上の表現を中間表現として生成する中間表現生成部と、形態素辞書を用いて、前記生成された中間表現および前記中間表現の連結に対応する形態素を列挙する形態素列挙部と、前記列挙された形態素のうち、所定の条件を満たす形態素列を探索して出力する形態素列探索部と、を備えることを特徴としている。 (1) In order to achieve the above object, a morpheme analyzer of the present invention is a morpheme analyzer that outputs a morpheme analysis result corresponding to an input sentence, and extracts a partial character string from the input sentence. Using a partial character string generation unit to generate, an intermediate expression generation unit that generates, as an intermediate representation, one or more expressions that can be converted from the generated partial character string as an intermediate stage, and an morpheme dictionary A morpheme enumeration unit that enumerates the generated intermediate representation and a morpheme corresponding to the concatenation of the intermediate representations, and a morpheme sequence search that searches and outputs a morpheme sequence that satisfies a predetermined condition among the enumerated morphemes And a section.

これにより、形態素辞書のサイズの増加を抑えつつ、形態素表記仮説を中間表現を含めて考慮することができ、交ぜ書き等の表記の揺らぎに対応して効率よく精度の高い形態素解析を行うことができる。 This makes it possible to consider the morpheme notation hypothesis including intermediate representations while suppressing the increase in the size of the morpheme dictionary, and to perform efficient and highly accurate morpheme analysis in response to fluctuations in notation such as mixed writing. it can.

（２）また、本発明の形態素解析装置は、前記中間表現辞書および前記形態素辞書の少なくとも一方は、部分文字列を構成しうる文字素および前記文字素に対応する中間表現を交互に並べた符号列を格納することを特徴としている。これにより、文字素列とそれに対応する中間表現の組を別々ではなく１つの領域に格納でき、辞書データ構造を単純化できる。 (2) In the morpheme analyzer of the present invention, at least one of the intermediate expression dictionary and the morpheme dictionary is a code in which character elements that can form a partial character string and intermediate expressions corresponding to the character elements are alternately arranged. It is characterized by storing columns. As a result, a set of character element sequences and corresponding intermediate representations can be stored in one area rather than separately, and the dictionary data structure can be simplified.

（３）また、本発明の形態素解析装置は、前記中間表現辞書および前記形態素辞書の一方の全部または一部を、他方の全部または一部として用いることを特徴としている。これにより、格納に必要なメモリ量を削減でき、形態素辞書のサイズの増加を抑えつつ、効率よく形態素解析を行うことができる。 (3) The morpheme analyzer of the present invention is characterized in that one or all of the intermediate representation dictionary and the morpheme dictionary are used as all or part of the other. Thereby, the amount of memory required for storage can be reduced, and morpheme analysis can be performed efficiently while suppressing an increase in the size of the morpheme dictionary.

（４）また、本発明の形態素解析装置は、前記中間表現辞書および前記形態素辞書の一方は、文字素および前記文字素に対応する中間表現を交互に並べた符号列で第１の辞書データとして格納し、他方は、前記第１の辞書データとして格納された文字素列の一部または全部の先頭から所定の文字素数までの文字素列に対して、文字素および前記文字素に対応する中間表現の格納の順序を前記第１の辞書データとは逆順で交互に並べた符号列で格納することを特徴としている。 (4) Further, in the morpheme analyzer according to the present invention, one of the intermediate representation dictionary and the morpheme dictionary is a code string obtained by alternately arranging a character element and an intermediate expression corresponding to the character element as first dictionary data. And the other is a character element and an intermediate corresponding to the character element with respect to a character element string from the beginning of a part or all of the character element string stored as the first dictionary data to a predetermined number of character elements. It is characterized in that the storage order of expressions is stored as a code string alternately arranged in the reverse order to the first dictionary data.

（５）また、本発明のプログラムは、辞書を有する形態解析装置のコンピュータに実行させることで、入力された文に対応する形態素解析結果を出力するプログラムであって、入力された文から部分文字列を切り出して生成する処理と、中間表現辞書を用いて、前記生成された部分文字列から中間段階として変換されうる１以上の表現を中間表現として生成する処理と、形態素辞書を用いて、前記生成された中間表現および前記中間表現の連結に対応する形態素を列挙する処理と、前記列挙された形態素のうち、所定の条件を満たす形態素列を探索して出力する処理と、を含むことを特徴としている。 (5) The program of the present invention is a program for outputting a morphological analysis result corresponding to an input sentence by causing the computer of the morphological analysis apparatus having a dictionary to execute the partial character from the input sentence. A process of cutting out and generating a sequence, a process of generating one or more expressions that can be converted as an intermediate stage from the generated partial character string using an intermediate expression dictionary, and a morpheme dictionary using the morpheme dictionary A process of enumerating generated intermediate representations and morphemes corresponding to concatenation of the intermediate representations, and a process of searching for and outputting a morpheme string satisfying a predetermined condition among the enumerated morphemes. It is said.

本発明によれば、形態素辞書のサイズの増加を抑えつつ、形態素表記仮説を中間表現を含めて考慮することができ、交ぜ書き等の表記の揺らぎに対応して効率よく精度の高い形態素解析を行うことができる。 According to the present invention, it is possible to consider the morpheme notation hypothesis including intermediate representations while suppressing an increase in the size of the morpheme dictionary, and to efficiently and accurately perform morpheme analysis corresponding to notation fluctuations such as cross writing. It can be carried out.

本発明の形態素解析装置を示すブロック図である。It is a block diagram which shows the morphological analyzer of this invention. 本発明のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of this invention. 本発明のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of this invention. 本発明の形態素解析装置による処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process by the morphological analyzer of this invention. 本発明の形態素解析装置による処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process by the morphological analyzer of this invention. 本発明の形態素解析装置による処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process by the morphological analyzer of this invention. 本発明のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of this invention.

次に、本発明の実施の形態について、図面を参照しながら説明する。以下の説明において表記文字列が同じであっても中間表現が異なる形態素は、異なる形態素として扱う。 Next, embodiments of the present invention will be described with reference to the drawings. In the following description, morphemes having different intermediate representations are treated as different morphemes even if the written character strings are the same.

［第１の実施形態］
（形態素解析装置の構成）
図１は、形態素解析装置１００を示すブロック図である。形態素解析装置１００は、例えばＰＣ等で構成され、入力された文に対応する形態素列を出力する。形態素解析装置１００は、形態素解析における形態素候補の列挙において、入力文字列の部分文字列を中間表現生成部１２０で所定の中間表現に一度変換し、同じ中間表現を有する形態素も形態素の候補として列挙する。形態素解析装置１００は、部分文字列生成部１１０、中間表現生成部１２０、形態素列挙部１３０および形態素列探索部１４０を備えている。 [First Embodiment]
(Configuration of morphological analyzer)
FIG. 1 is a block diagram showing a morphological analyzer 100. The morpheme analyzer 100 is composed of a PC, for example, and outputs a morpheme string corresponding to the inputted sentence. The morpheme analyzer 100 converts a partial character string of an input character string into a predetermined intermediate representation once by the intermediate representation generation unit 120 in enumerating morpheme candidates in the morpheme analysis, and enumerates morphemes having the same intermediate representation as morpheme candidates. To do. The morpheme analyzer 100 includes a partial character string generation unit 110, an intermediate representation generation unit 120, a morpheme enumeration unit 130, and a morpheme string search unit 140.

部分文字列生成部１１０は、入力された文から部分文字列を切り出して生成する。中間表現生成部１２０は、中間表現辞書１２５を用いて、生成された部分文字列から変換されうる１以上の表現を中間表現として生成する。中間表現は、文字列に対して形態素解析する際に中間の段階で変換されうる表現であり、例えば文字列に対して片仮名で表される読みである。このように、中間表現を用いることで、形態素辞書のサイズの増加を抑えつつ、形態素表記仮説を中間表現を含めて考慮することができ、交ぜ書き等の表記の揺らぎに対応して効率よく精度の高い形態素解析を行うことができる。 The partial character string generation unit 110 cuts out and generates a partial character string from the input sentence. The intermediate expression generation unit 120 uses the intermediate expression dictionary 125 to generate one or more expressions that can be converted from the generated partial character string as an intermediate expression. The intermediate expression is an expression that can be converted at an intermediate stage when a morphological analysis is performed on a character string. For example, the intermediate expression is a reading expressed in katakana for a character string. In this way, by using the intermediate representation, it is possible to consider the morpheme notation hypothesis including the intermediate representation while suppressing an increase in the size of the morpheme dictionary, and efficiently and accurately cope with fluctuations in notation such as cross writing. High morphological analysis can be performed.

中間表現生成部１２０は、入力文に対するすべての部分文字列に対して所定の変換規則で中間表現への変換を行う。変換規則は、各文字素および形態素辞書に含まれる形態素（形態素を構成する文字素の列）を、中間表現に変換するために、中間表現辞書１２５により定められている。部分文字列が中間表現辞書１２５に登録されていた場合は、中間表現辞書１２５にアクセスし、その部分文字列と中間表現の組を形態素列挙部１３０に送る。その際に、入力文字列における対応位置情報（入力文字列の何文字目から何文字目に対応する等）と合わせて送ることとしてもよい。 The intermediate representation generation unit 120 converts all partial character strings for the input sentence into an intermediate representation according to a predetermined conversion rule. The conversion rule is defined by the intermediate expression dictionary 125 in order to convert each character element and the morpheme included in the morpheme dictionary (a string of character elements constituting the morpheme) into an intermediate expression. If the partial character string is registered in the intermediate representation dictionary 125, the intermediate representation dictionary 125 is accessed, and the pair of the partial character string and the intermediate representation is sent to the morpheme enumeration unit 130. At that time, it may be sent together with corresponding position information in the input character string (from what character to what character in the input character string, etc.).

形態素列挙部１３０は、形態素辞書１３５を用いて、生成された中間表現および中間表現の連結に対応する形態素を構成する文字列を生成する。中間表現の連結とは、中間表現の並びを所定の条件を満たす形で連結し作成した中間表現を意味する。所定の条件とは、例えば所定の個数である。 The morpheme enumeration unit 130 uses the morpheme dictionary 135 to generate a character string that forms a morpheme corresponding to the generated intermediate representation and connection of the intermediate representations. The connection of intermediate expressions means an intermediate expression created by connecting a sequence of intermediate expressions so as to satisfy a predetermined condition. The predetermined condition is, for example, a predetermined number.

形態素列挙部１３０は、それぞれ形態素辞書１３５にアクセスし、中間表現が一致する形態素があれば、その中間表現と形態素情報を形態素候補の情報として出力し、形態素列探索部１４０に送る。この際、後述する形態素列探索部１４０の処理で用いるために、入力文における部分文字列を出力してもよい。その際に必要であれば中間表現と対応するように部分文字列を連結して出力してもよい。 The morpheme enumeration unit 130 accesses the morpheme dictionary 135, and if there is a morpheme whose intermediate representation matches, outputs the intermediate representation and morpheme information as morpheme candidate information and sends it to the morpheme string search unit 140. At this time, a partial character string in the input sentence may be output for use in processing of the morpheme string search unit 140 described later. If necessary, partial character strings may be concatenated and output so as to correspond to the intermediate representation.

なお、形態素情報には、入力文字列に対する対応位置情報と、形態素文字列が含まれる。後述するように表記文字列と形態素文字列の比較を形態素列探索において行う場合は、形態素情報に表記文字列を復元するための情報を含める必要がある。形態素列探索部１４０が入力文字列を参照できる構造になっていれば、先述の対応位置情報だけでもよい。 The morpheme information includes the corresponding position information for the input character string and the morpheme character string. As will be described later, when comparing a written character string and a morpheme character string in a morpheme string search, it is necessary to include information for restoring the written character string in the morpheme information. If the morpheme string search unit 140 has a structure that can refer to the input character string, only the corresponding position information described above may be used.

例として、中間表現に片仮名文字列表現を使うことを考える。以下、入力文字列等の、表記文字列に対する最小の構成要素を文字素と呼ぶ。なお、多くの場合で１文字素は1文字に対応するが、２文字以上で構成される文字列を１文字素として扱う方が好ましい場合もある。例えば、濁点や半濁点が独立した１文字として符号化されるシステムにおいて、仮名１文字と濁点または半濁点の連続は、２文字素と考えるのではなく、２文字の連続を１文字素として考えた方が、例えばトライ構造においては木の深さが浅くなりより効率的な処理ができる。 As an example, consider using a katakana character string representation as an intermediate representation. Hereinafter, the minimum component for a written character string, such as an input character string, is referred to as a character element. In many cases, one character element corresponds to one character, but it may be preferable to treat a character string composed of two or more characters as one character element. For example, in a system in which a dakuten or semi-dakuten is coded as an independent character, a kana character and a dakuten or semi-dakuten sequence are not considered two-character elements, but two character sequences are considered one-character elements. However, in the trie structure, for example, the depth of the tree becomes shallower and more efficient processing can be performed.

形態素列探索部１４０は、列挙された形態素のうち、所定の条件を満たす形態素列を探索する。その結果、形態素候補の最適な並びを探索して出力することが好ましい。例えば、形態素列探索部１４０は、最小コスト法等により、形態素列挙部１３０が出力した形態素候補に対して所定の条件を満たす並びを決定し、各形態素の形態素情報等と合わせて形態素解析結果として出力することが好ましい。 The morpheme string search unit 140 searches for a morpheme string that satisfies a predetermined condition among the listed morphemes. As a result, it is preferable to search for and output an optimal arrangement of morpheme candidates. For example, the morpheme sequence search unit 140 determines a list satisfying a predetermined condition for the morpheme candidates output by the morpheme enumeration unit 130 by a minimum cost method or the like, and combines it with the morpheme information of each morpheme as a morpheme analysis result. It is preferable to output.

なお、中間表現生成においては、それぞれ１つの文字素列から複数種類の中間表現を生成してもよい。例えば中間表現が仮名表現の場合、文字素や文字素列に対して複数通りの仮名表記（読み）があれば、複数の中間表現が生成される。同様に、形態素候補列挙でも、１つの中間表現から複数の形態素候補が生成され得る。例えば、１つの中間表現から複数の同音異義語が生成される。 In the intermediate expression generation, a plurality of types of intermediate expressions may be generated from one character element string. For example, when the intermediate representation is a kana representation, a plurality of intermediate representations are generated if there are a plurality of kana representations (readings) for a character element or a character element string. Similarly, in the morpheme candidate enumeration, a plurality of morpheme candidates can be generated from one intermediate representation. For example, a plurality of homonyms are generated from one intermediate expression.

（具体例）
以下、「挽回＜バンカイ＞」（＜＞内は中間表現表記。以下、同様。）について、形態素解析を行う具体例として、形態素辞書１３５に「ばん回」という交ぜ書き表現の形態素がない場合を考える。 (Concrete example)
Hereinafter, as a specific example of performing morphological analysis for “recovery <Bankai>” (<> is an intermediate expression notation, and the same applies hereinafter), a case where the morpheme dictionary 135 does not have a morpheme of “Bankai” as a mixed expression. Think.

「ば＜バ＞」「ん＜ン＞」「挽回＜バンカイ＞」「回＜カイ＞」という文字列が形態素として登録されている場合、従来の方法では、入力文字列「ばん回」に対して「ば」「ん」「回」という３形態素が接続された形でしか形態素候補にならない。なお、「ば」や「ん」については通常、それぞれ助詞や感動詞等として登録される。 When the character strings “BA <B>”, “N <N>”, “Recovery <Bankai>”, “Time <Cai>” are registered as morphemes, the conventional method uses the input character string “Ban times”. The morpheme candidates can be obtained only by connecting the three morphemes “ba”, “n”, and “times”. Note that “ba” and “n” are usually registered as particles, impression verbs, etc., respectively.

これに対し、中間表現生成部１２０では、「ば」「ん」「回」から中間表現「バ」「ン」「カイ」がまず生成される。このためには、文字素や形態素辞書に含まれる形態素（形態素を構成する文字素の列）から、中間表現への変換規則が定められている必要がある。例えば、全ての文字素に対する変換規則と、形態素辞書に含まれるすべての形態素に対応する文字素列からの変換規則を設けておけばよい。 On the other hand, in the intermediate expression generation unit 120, the intermediate expressions “ba”, “n”, and “chi” are first generated from “ba” “n” “times”. For this purpose, a conversion rule from a morpheme (a string of character elements constituting a morpheme) included in a character element or a morpheme dictionary needs to be defined. For example, a conversion rule for all character elements and a conversion rule from character element strings corresponding to all morphemes included in the morpheme dictionary may be provided.

次に、形態素列挙部１３０では、連続する中間表現の可能な連結全てである、中間表現「バ」「バン」「バンカイ」「ン」「ンカイ」「カイ」に対応する形態素を形態素辞書１３５から列挙する。これにより「バンカイ」から「挽回」も形態素の候補として列挙される。なお、可能な連結全てを考えるのではなく、例えば、最大の連結個数を定めて形態素辞書の参照回数を減らしてもよい。 Next, in the morpheme enumeration unit 130, the morphemes corresponding to the intermediate expressions “ba”, “ban”, “bankai”, “n”, “nkai”, and “chi”, which are all possible connections of the continuous intermediate expressions, are obtained from the morpheme dictionary 135. Enumerate. As a result, “bankai” to “recovery” are also listed as morpheme candidates. Instead of considering all possible connections, for example, the maximum number of connections may be determined to reduce the number of morpheme dictionary references.

このように、一度中間表現を経由させることで、「ばん回」という入力に対する形態素解析処理において、意味上「挽回」という形態素が含まれている可能性を考慮することができる。 In this way, once the intermediate representation is passed, it is possible to consider the possibility that the morpheme of “recovery” is included in the morpheme analysis process for the input of “bounce”.

（中間表現を経由した処理を行う影響の緩和）
形態素列の探索においては、例えば、表記文字列との編集距離をコスト関数に加えて、文字素置換が生じた形態素候補が選ばれにくくしてもよい。例えば、文字の挿入、削除、置換がそれぞれ距離１であるとしたとき、形態素文字列「挽回」と入力文字列「ばん回」の間の編集距離は２となる。この編集距離Ｄに適当な重み係数Ｗｄを乗じたものを、元の形態素生起コストＣｔ（ｍ）に加えたものは以下の式（１）で表される。 (Reduction of the effect of processing via intermediate representation)
In the search for the morpheme string, for example, the edit distance from the notation character string may be added to the cost function to make it difficult to select the morpheme candidate in which the character element replacement has occurred. For example, if the insertion, deletion, and replacement of characters are each at distance 1, the editing distance between the morpheme character string “recovery” and the input character string “bump” is 2. A value obtained by multiplying the edit distance D by an appropriate weighting factor Wd and added to the original morpheme generation cost Ct (m) is expressed by the following equation (1).

Ｃｔ‘（ｍ）＝Ｃｔ（ｍ）＋Ｗｄ・Ｄ … （１）
式（１）を、形態素解析における生起コストとすることで、置換により生成した形態素候補を最終的な形態素解析結果に含まれにくくできる。この例の場合、一般に形態素「ば」「ん」「回」の生起コストは比較的大きい（これはその形態素の出現頻度が低いことに対応する）と考えられることから、「ばん回」に対する形態素解析結果は、「ば」「ん」「挽」よりも中間表現を経由して生成された「挽回」が選ばれやすくなる。 Ct ′ (m) = Ct (m) + Wd · D (1)
By using the expression (1) as the occurrence cost in the morpheme analysis, the morpheme candidate generated by the replacement can be hardly included in the final morpheme analysis result. In this example, since the occurrence costs of morphemes “ba”, “n”, and “times” are generally considered to be relatively high (this corresponds to the low appearance frequency of the morphemes), As the analysis result, “recovery” generated via the intermediate expression is more easily selected than “ba”, “n”, and “recovery”.

また、中間表現自体を生起コストや連接コストの計算で考慮してもよい。例えば、大量のテキストから中間表現の出現確率をあらかじめ求めておき、出現確率が大きい中間表現となる場合ほど生起コストが小さくなるような生起コスト関数や、隣接形態素の中間表現対に対して出現確率を前もって大量のテキストから調べておき、中間表現対の出現確率が大きいほど連接コストが小さくなるような連接コスト関数を用いることで、中間表現から見て不自然な形態素解析結果になることを避けることができる。 Further, the intermediate representation itself may be considered in the calculation of the occurrence cost and the connection cost. For example, the appearance probability of an intermediate representation is determined in advance from a large amount of text, and the occurrence probability for an occurrence cost function or an intermediate representation pair of adjacent morphemes that the occurrence cost decreases as the intermediate representation has a higher appearance probability. By using a concatenation cost function that reduces the concatenation cost as the appearance probability of the intermediate representation pair increases, avoid unnatural morphological analysis results from the intermediate representation. be able to.

［第２の実施形態］
第１の実施形態で用いられる中間表現辞書１２５は、表記文字列を構成する文字素列とそれに対応する中間表現の組の表で表されるデータ構造を有するが、同じ内容を異なる表現形態のデータ構造で構成してもよい。 [Second Embodiment]
The intermediate expression dictionary 125 used in the first embodiment has a data structure represented by a table of pairs of character element strings constituting the notation character string and corresponding intermediate expressions, but the same contents are represented in different expression forms. A data structure may be used.

すなわち、中間表現辞書１２５および形態素辞書１３５の少なくとも一方は、部分文字列を構成しうる文字素および文字素に対応する中間表現を交互に並べた符号列を格納することが好ましい。これにより、文字素列とそれに対応する中間表現の組を別々ではなく１つの領域に格納でき、辞書データ構造を単純化できる。 That is, at least one of the intermediate representation dictionary 125 and the morpheme dictionary 135 preferably stores a code string in which intermediate representations corresponding to character elements and character elements that can form a partial character string are alternately arranged. As a result, a set of character element sequences and corresponding intermediate representations can be stored in one area rather than separately, and the dictionary data structure can be simplified.

例えば、中間表現辞書１２５において「挽回＜バンカイ＞」は「挽／バン／回／カイ」（ここで「／」は表記文字列と中間表現表記の区切り記号を表す）の形で表現できる。この表現を用いることで、文字素列とそれに対応する中間表現の組を別々ではなく１つの領域に格納でき、辞書データ構造を単純化できる。これは、形態素辞書１３５における中間表現と形態素文字列との表現形態についても同様である。 For example, in the intermediate expression dictionary 125, “recovery <bankai>” can be expressed in the form of “recovery / ban / return / chi” (where “/” represents a delimiter between a notation character string and an intermediate expression notation). By using this representation, a set of character elementary sequences and corresponding intermediate representations can be stored in one area rather than separately, and the dictionary data structure can be simplified. The same applies to the representation form of the intermediate representation and the morpheme character string in the morpheme dictionary 135.

さらに、これらをトライの形で格納すると、交互に並べることで符号化された情報により、先頭からの共通部分が各辞書で共有されて、変換表に格納に必要なメモリ量を削減できる。すなわち、中間表現辞書１２５および形態素辞書１３５の一方の全部または一部を、他方の全部または一部として用いることができる。これにより、辞書のサイズの増加を抑えつつ、効率よく形態素解析を行うことができる。 Furthermore, if these are stored in the form of a trie, the common portion from the beginning is shared by each dictionary by the information encoded by arranging them alternately, and the amount of memory required for storage in the conversion table can be reduced. That is, all or part of one of the intermediate representation dictionary 125 and the morpheme dictionary 135 can be used as all or part of the other. Thereby, it is possible to efficiently perform morphological analysis while suppressing an increase in the size of the dictionary.

図２、図３は、データ構造の一例を示す図である。図２は、表記文字列から中間表現表記への変換のための辞書のトライの例を示しており、図３は、中間表現から形態素文字列への変換のための辞書のトライの例を示している。 2 and 3 are diagrams illustrating an example of the data structure. FIG. 2 shows an example of a dictionary trie for conversion from a notation character string to an intermediate representation notation, and FIG. 3 shows an example of a dictionary trie for conversion from an intermediate representation to a morpheme character string. ing.

ここでは、「い／イ」「か／カ」「ば／バ」「ん／ン」「挽／バン」「挽回／バンカイ」「挽く／ヒク」「回／カイ」「回す／マワス」「回る／マワル」の変換のための情報がトライにより表現されている。 Here, “I / I”, “Ka / K”, “Ba / Ba”, “N / N”, “Grin / Ban”, “Grind / Bankai”, “Grind / Hiku”, “Tai / Kai”, “Turn / Mawas”, “Turn” / Mawal "conversion information is represented by a trial.

同様に形態素辞書１３５において形態素情報の１つとして形態素文字列を含める場合、形態素文字列を構成する文字素と、文字素に対応する中間を交互に並べた表現の形で、中間表現と形態素文字列の組を格納することができる。 Similarly, when a morpheme character string is included as one of the morpheme information in the morpheme dictionary 135, an intermediate representation and a morpheme character in the form of an expression in which character elements constituting the morpheme character string and intermediates corresponding to the character elements are alternately arranged. A set of columns can be stored.

（辞書へのアクセス）
トライによる表現を用いた場合、辞書へのアクセスでは複数通りのパスについてトライの探索が必要になる場合がある。一例として、本実施形態の文字素列と中間表現の順に交互に並べたデータ構造の辞書へのアクセスを説明する。図４〜図６は、形態素解析装置１００の処理の一例を示すフローチャートである。ここでは、文字素、中間表現の順で交互に並べた場合の探索フローを例を示す。以下の処理では、ノード集合を｛Ｎｇ｝、｛Ｎｃ｝、｛Ｎｉ｝等で表す。 (Access to dictionary)
When the expression by trial is used, it may be necessary to search for a plurality of paths for accessing the dictionary. As an example, access to a dictionary having a data structure in which character strings and intermediate representations according to this embodiment are alternately arranged will be described. 4 to 6 are flowcharts illustrating an example of processing of the morphological analyzer 100. Here, an example of a search flow in the case where character elements and intermediate representations are alternately arranged is shown. In the following processing, a node set is represented by {Ng}, {Nc}, {Ni}, etc.

まず、｛Ｎｇ｝にルートノードを代入し、｛Ｎｉ｝を空集合とする（ステップＳ１）。そして、｛Ｎｇ｝に要素があるか否かを判定し（ステップＳ２）、｛Ｎｇ｝に要素があるときはステップＳ３に進み、｛Ｎｇ｝が空集合のときにはステップＳ１１に進む。ステップＳ３では｛Ｎｇ｝から１つノードを取り出し、取り出したノードをｎとする。取り出したノードは集合から削除する。なお、ノードを取り出す処理では、以下でも同様に取り出されたノードは属していた集合から削除するものとする。そして、ノードｎの子ノード全てをノード集合｛Ｎｃ｝に代入する（ステップＳ４）。 First, a root node is substituted for {Ng}, and {Ni} is an empty set (step S1). Then, it is determined whether or not there is an element in {Ng} (step S2). If there is an element in {Ng}, the process proceeds to step S3. If {Ng} is an empty set, the process proceeds to step S11. In step S3, one node is extracted from {Ng}, and the extracted node is set to n. The extracted node is deleted from the set. In the process of extracting a node, the extracted node is similarly deleted from the set to which it belongs. Then, all child nodes of the node n are substituted into the node set {Nc} (step S4).

｛Ｎｃ｝に要素があるか否かを判定し（ステップＳ５）、要素があるときはステップＳ６に進み、要素が無いときにはステップＳ２に戻る。ステップＳ６では｛Ｎｃ｝から１つノードを取り出し、そのノードをｃとする。ステップＳ７では、ノードｎからノードｃへの枝に結び付けられた符号が、探索の対象の文字素と合致するか否かを判定し、探索の対象の文字素と合致する場合は、ステップＳ８へ進む。合致しない場合にはステップＳ９へ進む。ステップＳ８ではノードｃを｛Ｎｇ｝に追加し、ステップＳ５に戻る。 It is determined whether or not there is an element in {Nc} (step S5). If there is an element, the process proceeds to step S6, and if there is no element, the process returns to step S2. In step S6, one node is extracted from {Nc}, and that node is set as c. In step S7, it is determined whether or not the code linked to the branch from the node n to the node c matches the search target character element, and if it matches the search target character element, the process goes to step S8. move on. If not, the process proceeds to step S9. In step S8, node c is added to {Ng}, and the process returns to step S5.

一方、ステップＳ９では、ノードｎからノードｃへの枝に結び付けられた符号が区切り文字か否かを判定し、区切り文字である場合は、ステップＳ１０へ進み、区切り文字でない場合はステップＳ５に戻る。そして、ステップＳ１０では、ノードｃを｛Ｎｉ｝に追加し、ステップＳ５に戻る。 On the other hand, in step S9, it is determined whether or not the code linked to the branch from the node n to the node c is a delimiter. If it is a delimiter, the process proceeds to step S10, and if it is not a delimiter, the process returns to step S5. . In step S10, node c is added to {Ni}, and the process returns to step S5.

ステップＳ１１では｛Ｎｉ｝に要素があるか否かを判定し、要素がある場合にはステップＳ１２に進み、｛Ｎｉ｝が空集合の場合にはステップＳ２１に進む。 In step S11, it is determined whether or not there is an element in {Ni}. If there is an element, the process proceeds to step S12. If {Ni} is an empty set, the process proceeds to step S21.

ステップＳ１２では、｛Ｎｉ｝から１つノードを取り出し、取り出したノードをｎとする。ノードｎの子ノード全てを｛Ｎｃ｝に代入する（ステップＳ１３）。｛Ｎｃ｝に要素があるか否かを判定し（ステップＳ１４）、要素がある場合にはステップＳ１５に進む。空集合のときはステップＳ１１に戻る。 In step S12, one node is extracted from {Ni}, and the extracted node is set to n. All child nodes of node n are substituted into {Nc} (step S13). It is determined whether or not there is an element in {Nc} (step S14). If there is an element, the process proceeds to step S15. If it is an empty set, the process returns to step S11.

ステップＳ１５では、｛Ｎｃ｝から１つノードを取り出し、そのノードをｃとする。ノードｃがリーフノードか否かを判定し（ステップＳ１６）、リーフノードである場合にはステップＳ１７に進み、リーフノードでないときは、ステップＳ１８に進む。ステップＳ１７では、ルートノードからノードｃまでの枝にある文字素・中間表現を探索結果として出力し、ステップＳ１４に戻る。 In step S15, one node is extracted from {Nc}, and that node is set as c. It is determined whether or not the node c is a leaf node (step S16). If it is a leaf node, the process proceeds to step S17. If it is not a leaf node, the process proceeds to step S18. In step S17, the character element / intermediate expression in the branch from the root node to node c is output as a search result, and the process returns to step S14.

ステップＳ１８では、ノードｎからノードｃへの枝に結び付けられた符号が区切り文字か否かを判定し、区切り文字である場合はステップＳ１９へ進み、区切り文字でない場合はステップＳ２０へ進む。ステップＳ１９では、ノードｃを｛Ｎｇ｝に追加し、ステップＳ１４に戻る。ステップＳ２０では、ノードｃを｛Ｎｉ｝に追加し、ステップＳ１４に戻る。ステップＳ２１では、｛Ｎｇ｝が空集合のときは終了する。空集合でない場合はステップＳ２に戻る。 In step S18, it is determined whether or not the code linked to the branch from the node n to the node c is a delimiter. If it is a delimiter, the process proceeds to step S19. If not, the process proceeds to step S20. In step S19, node c is added to {Ng}, and the process returns to step S14. In step S20, node c is added to {Ni}, and the process returns to step S14. In step S21, when {Ng} is an empty set, the process ends. If it is not an empty set, the process returns to step S2.

［第３の実施形態］
第１の実施形態では、中間表現生成処理では中間表現辞書１２５を用い、形態素列挙処理では形態素辞書１３５を用いているが、中間表現生成処理で形態素辞書１３５を用いてもよい。 [Third Embodiment]
In the first embodiment, the intermediate expression generation process uses the intermediate expression dictionary 125, and the morpheme enumeration process uses the morpheme dictionary 135, but the intermediate expression generation process may use the morpheme dictionary 135.

形態素辞書１３５における形態素情報に形態素文字列が含まれていれば、中間表現生成部１２０で必要となる形態素文字列に対応する中間表現は、形態素列挙部１３０における形態素辞書１３５から得るように構成することができる。 If the morpheme information in the morpheme dictionary 135 includes a morpheme character string, the intermediate representation corresponding to the morpheme character string required by the intermediate representation generation unit 120 is obtained from the morpheme dictionary 135 in the morpheme enumeration unit 130. be able to.

また、形態素列挙部１３０で必要な中間表現から形態素文字列への変換は、中間表現辞書１２５が有する表記文字列から中間表現への変換表において、形態素文字列でもある表記文字列に、そのことを示す符号を付しておき、その表において、所定の中間表現を持ち、かつ符号の付された表記文字列を探索することで行うこともできる。これにより、全体の辞書サイズを削減できる。 In addition, the conversion from the intermediate representation required by the morpheme enumeration unit 130 to the morpheme character string is performed on the notation character string that is also a morpheme character string in the conversion table from the notation character string to the intermediate representation that the intermediate representation dictionary 125 has. It can also be performed by searching for a notation character string having a predetermined intermediate expression and a sign in the table. Thereby, the whole dictionary size can be reduced.

［第４の実施形態］
中間表現辞書１２５および形態素辞書１３５の一方は、文字素および文字素に対応する中間表現を交互に並べた符号列で第１の辞書データとして格納し、他方は、第１の辞書データとして格納された文字素列の一部または全部の先頭から所定の文字素数までの文字素列に対して、文字素および文字素に対応する中間表現の格納の順序を第１の辞書データとは逆順で交互に並べた符号列で格納してもよい。 [Fourth Embodiment]
One of the intermediate representation dictionary 125 and the morpheme dictionary 135 is stored as first dictionary data in a code string in which intermediate representations corresponding to character elements and character elements are alternately arranged, and the other is stored as first dictionary data. The sequence of storing intermediate representations corresponding to character elements and character elements in the reverse order of the first dictionary data for character element strings from the beginning of a part or all of the character element strings to a predetermined number of character primes. You may store with the code sequence put in order.

これにより、木の下の部分のデータ構造に逆順の符号列を用いることができ、処理量の増加を抑えることができる。例えば、表記文字列から中間表現への変換に使うトライと、中間表現から形態素文字列への変換に用いるトライのうち、その片方（中間表現表記が先行する構成）は最初から２文字素目まで、といった小さいトライとし、そこより先の探索ではもう一方のトライを用いることができる。 As a result, a reverse code sequence can be used for the data structure in the lower part of the tree, and an increase in processing amount can be suppressed. For example, one of the trie used to convert a notation character string to an intermediate representation and the trie used to convert an intermediate representation to a morpheme character string (configuration preceded by an intermediate representation notation) is from the first to the second character prime , And the other try can be used for a search beyond that.

表記文字列と中間表現との変換では、トライのデータ構造で木を根（図では上側）から葉の方向（図では下方向）にたどっていく。したがって、中間表現生成部１２０では文字素、中間表現の順に並べた「挽／バン／回／カイ」のトライを用いればよい。また、形態素列挙部１３０がアクセスする形態素辞書１３５では中間表現、文字素の順に並べた「バン／挽／カイ／回」のトライを使えばよい。 In the conversion between the notation character string and the intermediate representation, the tree is traced from the root (upper side in the figure) to the leaf direction (downward in the figure) in the trie data structure. Therefore, the intermediate expression generation unit 120 may use a “grind / ban / times / chi” trie arranged in the order of character elements and intermediate expressions. Further, in the morpheme dictionary 135 accessed by the morpheme enumeration unit 130, it is sufficient to use “Ban / Grin / Kai / Time” trials arranged in the order of intermediate representation and character element.

しかし、一般的な傾向として、木の葉に近づくにつれ木の分岐の数は減っていくので、試行的に木を下方向にたどっても、通常、処理量はそれほど増加しない。例えば、中間表現から形態素文字列への変換の際に、「挽／バン／回／カイ」の構造のトライを用いる場合、文字素を読み飛ばして先に中間表現を得る必要があり、文字素を読み飛ばす際はその場所での全ての分岐をたどらなければならないが、そのような処理を行う場合でも、木の上方と比較すると、木の下の部分では分岐が少ないので処理量はあまり増えない。したがって、表記文字列から中間表現への変換に使うトライと、中間表現から形態素文字列への変換に用いるトライのうち、片方は例えば最初から２文字素目まで、といった小さいトライとし、そこより先の探索では他方のトライを用いてもよい。これにより、全体のサイズをより削減することができ、試行的な探索は必要となるものの、処理量の増加は限定的である。 However, as a general trend, the number of branches of a tree decreases as it approaches the leaves of the tree, and therefore the amount of processing usually does not increase so much even if the tree is traced down on a trial basis. For example, when converting from an intermediate representation to a morpheme character string, when using a “trim / bang / times / chi” structure trie, it is necessary to skip the character element first to obtain the intermediate representation. When skipping, it is necessary to follow all branches at that location, but even when such processing is performed, the amount of processing does not increase so much compared to the upper part of the tree because there are fewer branches in the lower part of the tree. Therefore, one of the trie used for conversion from the notation character string to the intermediate representation and the trie used for conversion from the intermediate representation to the morpheme character string is a small trie, for example, from the first to the second character prime, and beyond The other trie may be used in the search. As a result, the overall size can be further reduced, and a trial search is required, but the increase in the processing amount is limited.

例えば、試行的な下方への探索は、深さ方向に浅い探索となる方が処理量的に好ましい。したがって、一般に文字素が中間表現に先行する「挽／バン／回／カイ」の構成のトライは従来同様に作成するのが好ましい。また、他方の中間表現が先行する構成では、形態素文字列の１文字素目までのトライを作る方法が好ましい。上記の例では「バン／挽」までで打ち切ったデータに対してトライを作る。 For example, it is preferable in terms of throughput that the downward search on a trial basis is a shallow search in the depth direction. Therefore, in general, it is preferable to create a trie having a “grind / van / time / chi” structure in which a character element precedes an intermediate expression as in the conventional case. Further, in the configuration in which the other intermediate expression is preceded, a method of making a try up to the first character element of the morpheme character string is preferable. In the above example, a trie is created for the data that has been cut up to “Ban / Grin”.

なお、中間表現から形態素文字列への変換の際に、上記の例の場合、「バン／挽」の葉（本文書上の表現においては最も右）までアクセスした状態から、次の処理のために、「挽／バン／回／カイ」における「回」に対応するノードにアクセスする必要があるが、そのノードに到達するために必要な情報である、「回」までの文字素（すなわち「挽」）とその中間表現（「バン」）は、既に得られているので、このアクセスは容易である。図７は、データ構造の一例を示す図であり、図３に対応する、1文字目までの表記文字までで作成したトライを示している。 In the case of the above example, when converting from the intermediate representation to the morpheme character string, in the case of the above example, from the state where the “van / grind” leaf (rightmost in the representation in this document) is accessed, the next processing is performed. In addition, although it is necessary to access a node corresponding to “time” in “grinding / van / time / chi”, the character elements up to “times” which are information necessary to reach the node (that is, “ This is easy to access since the “rear” and its intermediate representation (“van”) have already been obtained. FIG. 7 is a diagram showing an example of the data structure, and shows a trie created up to the first written character corresponding to FIG.

あるいは、「挽／バン／回／カイ」の方のトライにおける「回」のノードを、トライのルートからたどらずに直接参照するための位置情報を、「バン／挽」のトライの葉ノードに格納しておき、これを用いてトライの中間ノードに直接アクセスする方法を用いてもよい。 Alternatively, the location information for directly referencing the “time” node in the “ground / van / time / chi” trie without following the route of the trie is used as the leaf node of the “van / ground” trie. It is also possible to use a method of storing and directly accessing the intermediate node of the trie using this.

［その他の実施形態］
（中間表現のその他の例）
上記では、中間表現として片仮名表記を用いたが、中間表現にこれ以外の表現を用いることもできる。例えばＩＰＡ（ＩｎｔｅｒｎａｔｉｏｎａｌＰｈｏｎｅｔｉｃＡｌｐｈａｂｅｔ、国際音声記号）による発音表記を用いてもよい。 [Other Embodiments]
(Other examples of intermediate expressions)
In the above description, Katakana notation is used as the intermediate representation, but other representations can be used as the intermediate representation. For example, pronunciation notation by IPA (International Phonetic Alphabet) may be used.

あるいは、仮名と常用漢字だけで構成された文字の集合を用い、かつできるだけ漢字表記するようにした表記を中間表現で用いてもよい。上記手法によれば、形態素解析において、形態素の生起コスト関数は表記文字列と中間表現の双方が考慮され、固有名詞の様に、旧字体が使われやすい形態素と使われにくい形態素をより正確に区別することができる。例えば、地名としての「桜井」を「櫻井」と表記する場合は少なく、「櫻井」という表記は人名としての意味を持っている可能性が高いと考えられるが、一方で「櫻井」が地名である可能性も全くないわけではない。 Alternatively, an intermediate expression may be used in which a set of characters composed only of kana and common kanji is used and kanji characters are written as much as possible. According to the above method, in the morpheme analysis, the occurrence cost function of the morpheme considers both the written string and the intermediate representation, and more accurately the morpheme that is easy to use the old font and the morpheme that is difficult to use like the proper noun. Can be distinguished. For example, “Sakurai” as a place name is rarely written as “Sakurai”, and the notation “Sakui” is likely to have a meaning as a personal name, but “Sakurai” is a place name. There is no possibility of being there.

最小コスト法に基づく形態素解析では、最終的には形態素の並びも考慮され、総合的な判断が行われるが、上記手法では、用いられることが多い人名としての「桜井」「櫻井」と、地名としての「桜井」が形態素辞書に登録されているとして、用いられることの少ない地名としての「櫻井」を形態素辞書に登録することなく、地名として用いられた「櫻井」を形態素解析の処理において考慮できる。これにより、例えば、地名の直後に「市」や「町」といった形態素が続く場合に、両者の間の連接コストがより小さくなるように定義した連接コスト関数を用いれば、人名ではなく、地名としての形態素が選ばれるようにすることができる。しかし、中間表現の抽象度が高すぎると、処理の際に考慮すべき形態素の可能性の数が増えることになる。 The morphological analysis based on the minimum cost method ultimately considers the arrangement of morphemes and makes a comprehensive decision. However, in the above method, “Sakurai” and “Sakurai” are often used, and the place names As “Sakurai” is registered in the morpheme dictionary, the place name “Sakui”, which is rarely used, is not registered in the morpheme dictionary. it can. Thus, for example, when a morpheme such as “city” or “town” immediately follows a place name, using a concatenation cost function defined so that the connection cost between the two is smaller, the place name is not the person name. Morphemes can be selected. However, if the intermediate representation is too abstract, the number of morpheme possibilities that should be taken into account during processing increases.

これは特に同音の形態素が多い固有名詞では問題となり得る。例えば上記の例において、中間表現に片仮名表記を用いると、「櫻井」に対する中間表現は「サクライ」となり、例えば「佐倉井」といったような、より多くの形態素の可能性も考慮する必要が生じる。上記手法では、文字置換数を生起コストに反映させることで、この可能性はコスト最適解の探索で排除されるようになっているものの、中間表現に「桜井」といった抽象度のより低い表現を用いることで、表記揺らぎへの対応が限定的になる一方で、その後の探索で考慮すべき可能性の数を減らし、処理量的にはより効率的な処理が可能となる。 This can be a problem, especially with proper nouns with many morphemes of the same sound. For example, in the above example, if katakana notation is used for the intermediate expression, the intermediate expression for “Sakurai” becomes “Sakurai”, and it is necessary to consider the possibility of more morphemes such as “Sakurai”. In the above method, this possibility is eliminated by searching for the optimal cost solution by reflecting the number of character substitutions in the occurrence cost, but a lower abstraction such as “Sakurai” is used as the intermediate representation. By using it, while the response to the notation fluctuation is limited, the number of possibilities to be considered in the subsequent search is reduced, and more efficient processing is possible in terms of processing amount.

（区切り文字を使わない例）
上記の第２および第４の実施形態では、文字素と中間表現を交互に並べた符号列表現において区切り文字を使用しているが、並べ方はこれに限らない。例えば、文字素を表す符号と中間表現を表す符号がその符号位置（符号の値）において完全に独立な場合は、区切り文字は不要である。 (Example without delimiter)
In the second and fourth embodiments described above, the delimiter is used in the code string expression in which the character elements and the intermediate expressions are alternately arranged. However, the arrangement is not limited to this. For example, when a code representing a character element and a code representing an intermediate representation are completely independent at the code position (code value), a delimiter is not necessary.

（プログラム）
なお、以上のような動作は、例えば形態素解析装置１００に搭載されるＣＰＵにプログラムを実行させることで実現できる。このプログラムは、記録媒体に記録された状態で流通しうる。また、このようなプログラムは、ネットワークを構成する公衆電話回線、専用電話回線、ケーブルテレビ回線、無線通信回線等により構成される通信網等の伝送媒体を介して、送信装置であるコンピュータにより送信された信号を受信することで流通しうる。 (program)
The above operation can be realized by causing a CPU mounted on the morphological analyzer 100 to execute a program, for example. This program can be distributed in a state of being recorded on a recording medium. In addition, such a program is transmitted by a computer as a transmission device via a transmission medium such as a communication network including a public telephone line, a dedicated telephone line, a cable TV line, a wireless communication line, etc. constituting the network. Can be distributed by receiving the received signal.

１００形態素解析装置
１１０部分文字列生成部
１２０中間表現生成部
１２５中間表現辞書
１３０形態素列挙部
１３５形態素辞書
１４０形態素列探索部 100 Morphological Analyzer 110 Partial Character String Generation Unit 120 Intermediate Representation Generation Unit 125 Intermediate Representation Dictionary 130 Morphological Enumeration Unit 135 Morphological Dictionary 140 Morphological Sequence Search Unit

Claims

A morpheme analyzer that outputs a morpheme analysis result corresponding to an input character string,
A partial character string generation unit that generates a cut-out partial character string from an input character string that is a kanji kana mixed sentence;
Using the intermediate expression dictionary that stores the character element string that can be a partial character string of the kanji kana mixed sentence and the intermediate expression pair as the reading expression corresponding to the character element string in association with each other, the generated An intermediate representation generation unit that generates one or more reading notations that can be converted from the partial character string as an intermediate stage, as an intermediate representation;
A morpheme enumeration that enumerates the generated intermediate representations and the morpheme candidates corresponding to the concatenation of the intermediate representations using a morpheme dictionary that stores a set of the intermediate representations and character element strings that can be morpheme candidates in association with each other. And
A morpheme string search unit that searches for and outputs a morpheme string that satisfies a predetermined condition among the listed morpheme candidates,
The predetermined condition is a scale defined by a combination of a scale corresponding to each appearance frequency of the enumerated morpheme candidates and a scale corresponding to ease of connection when the enumerated morpheme candidates are continuous. optimization der of is,
At least one of the intermediate expression dictionary and the morpheme dictionary is stored as dictionary data in a code string in which intermediate representations corresponding to the character elements and the character elements that can form a partial character string are alternately arranged, and from the character element strings A morpheme analyzer that is used for conversion to the intermediate representation or conversion from the intermediate representation to the morpheme string .

The morpheme analyzer according to claim 1, wherein all or part of one of the intermediate representation dictionary and the morpheme dictionary is used as all or part of the other.

One of the intermediate expression dictionary and the morpheme dictionary is stored as first dictionary data in a code string in which intermediate elements corresponding to the character elements and the character elements are alternately arranged, and from the character element string to the intermediate expression. Used for conversion,
The other is the storage of the character element and the intermediate representation corresponding to the character element with respect to the character element string from the beginning of a part or all of the character element string stored as the first dictionary data to a predetermined number of character elements. wherein the order of the first dictionary data stored in the code string arranged alternately in reverse order, morphological analysis apparatus according to claim 2, wherein the be used in actual conversion from the intermediate representation to the morphemes.

A program for outputting a morphological analysis result corresponding to an input character string by causing a computer of a morphological analyzer having a dictionary to execute the program,
Processing to cut out and generate a partial character string from an input character string that is a kanji kana mixed sentence;
Using the intermediate expression dictionary that stores the character element string that can be a partial character string of the kanji kana mixed sentence and the intermediate expression pair as the reading expression corresponding to the character element string in association with each other, the generated Processing to generate one or more reading notations that can be converted from the partial character string as an intermediate stage as an intermediate representation;
A process of enumerating the generated intermediate representation and morpheme candidates corresponding to the concatenation of the intermediate representations using a morpheme dictionary that associates and stores the set of intermediate representations and character element strings that can be morpheme candidates; ,
A process of searching for and outputting a morpheme string satisfying a predetermined condition among the listed morpheme candidates,
The predetermined condition is a scale defined by a combination of a scale corresponding to each appearance frequency of the enumerated morpheme candidates and a scale corresponding to ease of connection when the enumerated morpheme candidates are continuous. optimization der of is,
At least one of the intermediate expression dictionary and the morpheme dictionary is stored as dictionary data in a code string in which intermediate representations corresponding to the character elements and the character elements that can form a partial character string are alternately arranged, and from the character element strings program characterized be used in actual conversion to the morpheme string from the conversion or the intermediate representation to the intermediate representation.