JP2015194801A

JP2015194801A - Dictionary device, morpheme analysis device, data structure, method and program of morpheme analysis

Info

Publication number: JP2015194801A
Application number: JP2014071155A
Authority: JP
Inventors: 信行西澤; Nobuyuki Nishizawa
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2015-11-05
Anticipated expiration: 2034-03-31
Also published as: JP6300601B2

Abstract

PROBLEM TO BE SOLVED: To provide a dictionary device, a morpheme analysis device, data structure, a method and a program of morpheme analysis capable of reducing data size of a dictionary, and improving efficiency of processing in morpheme analysis.SOLUTION: A dictionary device 120 for morpheme analysis which stores data on character strings by structure based on a trie stores partial character strings obtained by sectioning data on the character strings and information on the partial character strings by alternately aligning them along the character strings. Thus, efficient data representation becomes possible by using the information regarding the partial character strings, and an amount of data regarding the character strings to be stored for the morpheme analysis can be reduced. In addition, since information regarding partial character strings which are not adopted can be discriminated at an early stage, processing of the morpheme analysis can be efficiently performed. In addition, since the information regarding the partial character strings can be easily referred to, morpheme analysis processing considering constraints of the information regarding the partial character strings can be realized.

Description

本発明は、トライに基づく構造で文字列のデータを格納する形態素解析用の辞書装置、形態素解析装置、データ構造ならびに形態素解析の方法およびプログラムに関する。 The present invention relates to a dictionary device for morphological analysis, a morphological analysis device, a data structure, a morphological analysis method, and a program for storing character string data in a structure based on a trie.

音声合成技術の代表的な利用形態にテキスト音声変換（Ｔｅｘｔ−Ｔｏ−Ｓｐｅｅｃｈ、ＴＴＳ）がある。テキスト音声変換は、入力されたテキストに対応する音声波形を合成する処理である。以下では、この一連の処理は、入力されたテキストを解析してテキストの読み方に関する情報を生成する処理と、読み方に関する情報から音声波形を合成する処理の、大きく２つに分ける。また、入力は日本語の漢字仮名交じり文であるとする。 Text-to-speech conversion (Text-To-Speech, TTS) is a typical usage form of speech synthesis technology. Text-to-speech conversion is a process of synthesizing a speech waveform corresponding to input text. In the following, this series of processes is roughly divided into two processes: a process of analyzing input text to generate information on how to read the text, and a process of synthesizing a speech waveform from the information on how to read. Also, it is assumed that the input is a Japanese kanji kana mixed sentence.

以下では、読み方に関する情報を表現するために用いる記号を、音声合成用記号と呼ぶ。音声合成用記号には様々な形式があり得るが、ここでは、一連の音声を構成する音韻的情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものを想定する。そのような音声合成用記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００６「日本語テキスト音声合成用記号」がある（非特許文献１参照）。この記号だけで音声の感情表現等まで表現することは困難だが、少なくとも通常の読み上げ音声の言語的特徴を記述するために必要な情報は含まれている。 Hereinafter, a symbol used for expressing information on how to read is referred to as a speech synthesis symbol. There are various forms of symbols for speech synthesis. Here, we assume that the phonetic information that composes a series of speech and the prosodic information that is mainly expressed as a pose or voice pitch. . An example of such a symbol for speech synthesis is JEITA (Electronic Information Technology Industry Association) standard IT-4006 “Symbol for Japanese Text Speech Synthesis” (see Non-Patent Document 1). Although it is difficult to express even emotional expressions of speech with these symbols alone, it contains at least the information necessary to describe the linguistic features of normal speech.

一方、音声波形を合成する処理は、音声合成用記号の通りの波形が合成されるように行われる。したがって、日本語テキストの正確な読み上げを実現するためには、日本語の漢字仮名交じり文に対応した、正確な音声合成用記号を作成すればよい。 On the other hand, the process of synthesizing the speech waveform is performed so that the waveform according to the speech synthesis symbol is synthesized. Therefore, in order to realize accurate reading of the Japanese text, it is only necessary to create an accurate speech synthesis symbol corresponding to the Japanese kanji kana mixed text.

任意の日本語テキストから音声合成用記号を生成する処理は、日本語漢字仮名交じり文を形態素と呼ばれる言語表現上意味を持つ最小の単位に区切り、形態素ごとの読みを付与し、形態素列等を参照し形態素の情報を適切に変形し、必要に応じ、ポーズ等の韻律的境界を挿入し、それらを接続することで実現できる。この際、各形態素の読みは、形態素辞書情報として予め作成し格納しておく。 The process of generating symbols for speech synthesis from arbitrary Japanese text is to divide Japanese Kanji kana mixed sentences into the smallest units that have meaning in linguistic expressions called morphemes, add readings for each morpheme, It can be realized by referring to the morpheme information appropriately, inserting prosodic boundaries such as poses as necessary, and connecting them. At this time, the reading of each morpheme is created and stored in advance as morpheme dictionary information.

ただし、形態素は、言語学的な定義の通りである必要はなく、一連の処理を行なう上で適当に区切られた単位でもよい。例えば、形態素の並びをより適切に処理するために、複数の形態素で構成される句（複合名詞句等）を便宜的に１つの形態素と見なして処理することがある。よって以下においては、形態素とは、その用途の観点から処理上の最小単位となるべく適当に設定された文字の並び（文字列）をいい、また、全ての文は、この文字列を連結することで構成できるものとする。また、ある文に対して、それを形態素に分割する処理は、一般に形態素解析処理と呼ばれ、音声合成処理に限らず、文の構成要素の抽出等で用いられている。 However, the morpheme need not be as defined in linguistic terms, and may be a unit that is appropriately delimited for performing a series of processing. For example, in order to more appropriately process the arrangement of morphemes, a phrase composed of a plurality of morphemes (such as a compound noun phrase) may be regarded as one morpheme for convenience. Therefore, in the following, a morpheme refers to a sequence of characters (character string) that is appropriately set as a minimum unit for processing from the viewpoint of its use, and all sentences are concatenated with this character string. It can be configured with. A process of dividing a sentence into morphemes is generally called a morpheme analysis process, and is used not only for speech synthesis processing but also for extraction of sentence components.

形態素解析の方法として、以下では、最小コスト法に基づく方法を説明する。最小コスト法による形態素解析では、まず、各形態素の出現頻度を反映させた生起コスト関数と、連続する形態素の繋がりやすさを表す連接コスト関数を予め定義しておく。そして、形態素辞書に登録された形態素から、入力テキストに一致し、かつ文全体のコストが最小となるような形態素列を探すことで、適切な形態素列を得る。通常、生起コスト関数は出現頻度が高い形態素ほど、連接コスト関数は繋がりやすい形態素列ほど、その値が小さくなるよう定義される。 As a morphological analysis method, a method based on the minimum cost method will be described below. In the morpheme analysis by the minimum cost method, first, an occurrence cost function reflecting the appearance frequency of each morpheme and a concatenated cost function representing ease of connection of consecutive morpheme are defined in advance. Then, an appropriate morpheme sequence is obtained by searching the morpheme registered in the morpheme dictionary for a morpheme sequence that matches the input text and minimizes the cost of the entire sentence. Usually, the occurrence cost function is defined so that the morpheme having the higher appearance frequency and the concatenated cost function have the smaller value as the morpheme sequence that is more easily connected.

すなわち形態素列をＭ＝（ｍ１，…，ｍｎ）、生成コスト関数をＣｔ（ｍ）、連接コスト関数をＣｃ（ｍ（ｉ−ｋ＋１），…，ｍｉ）とするとき、コストの総和Σ Ｃｔ＋Σ Ｃｃが最小となる形態素列Ｍ、すなわちａｒｇｍｉｎ（Σ Ｃｔ＋Σ Ｃｃ）を求めることで形態素解析処理が行われる。ただし、ここで連接コスト関数はｋ個の形態素の並びで決定されるものとする。 That is, when the morpheme sequence is M = (m1,..., Mn), the generation cost function is Ct (m), and the concatenation cost function is Cc (m (i−k + 1),..., Mi), the total cost Σ Ct + Σ A morpheme analysis process is performed by obtaining a morpheme string M that minimizes Cc, that is, argmin (Σ Ct + Σ Cc). Here, it is assumed that the concatenated cost function is determined by an array of k morphemes.

このようにコスト関数を定義すると、コスト的な最適な全体系列を構成する部分系列は、その部分系列だけを見てもコスト的には最適となる。したがって、コスト的に最適でない部分系列は、最適な全体系列の構成要素にはならないので、探索において考慮する必要がなくなる。このように、最適系列を構成する可能性がない部分系列を考慮しないように進めて行く最適系列の探索法は一般に動的計画法と呼ばれ、効率よく最適系列の探索を行なうことができる。 When the cost function is defined in this way, the partial series constituting the optimal whole series in terms of cost are optimal in terms of cost even when only the partial series is viewed. Therefore, the subsequence that is not optimal in terms of cost does not need to be considered in the search because it is not a component of the optimal overall sequence. In this way, the optimum sequence search method that proceeds without considering a partial sequence that does not have the possibility of forming an optimum sequence is generally called dynamic programming, and the optimum sequence can be searched efficiently.

コスト関数の構成要素のうち、生起コストに関する情報は形態素辞書の内容として保持することができる。一方、連接コストについては連接表と呼ばれるテーブルを予め作成しておき、その表の値を使うことで求めることができる。ただし、全ての形態素列の組み合わせの表を作ることは難しいので、例えば形態素の品詞型だけに注目した表を用いることも行われる。なお、これらの関数は値が大きいほど好ましいものとして定義される場合もある。その場合は、文全体の値が最も大きくなる形態素系列を探すことになる。 Among the components of the cost function, information on the occurrence cost can be held as the contents of the morpheme dictionary. On the other hand, the connection cost can be obtained by creating a table called a connection table in advance and using the values in the table. However, since it is difficult to create a table of all morpheme sequence combinations, for example, a table that focuses only on the morpheme part-of-speech type is also used. In some cases, these functions are defined as having larger values. In that case, a morpheme sequence having the largest value of the entire sentence is searched.

形態素解析における形態素列の探索処理では可能な形態素の並び全てを調べることが好ましい。そのため、通常の形態素解析では、形態素の候補を得るために、文中の任意の位置から始まる部分文字列を検索のキーとして、形態素辞書に登録された形態素のうち、キーの先頭部分文字列に等しい形態素全てを取り出す、という処理が繰り返し行われる。このような探索は、一般にＣｏｍｍｏｎｐｒｅｆｉｘｓｅａｒｃｈ（共通接頭辞探索）と呼ばれる。これを比較的効率よく表現するデータ構造として、ｔｒｉｅ（トライ）やＰａｔｒｉｃｉａｔｒｅｅ（パトリシア木）が知られている。 In the morpheme string search process in morpheme analysis, it is preferable to examine all possible morpheme sequences. Therefore, in normal morpheme analysis, in order to obtain a morpheme candidate, a partial character string starting from an arbitrary position in a sentence is used as a search key, and the morpheme registered in the morpheme dictionary is equal to the first partial character string of the key. The process of extracting all morphemes is repeated. Such a search is generally referred to as a common prefix search. As a data structure that expresses this relatively efficiently, trie and Patricia tree are known.

ここで、トライは複数の文字列を格納するための多分木構造で、ここでは、各文字列の先頭文字から順に各文字を木の枝として格納していくことで構築されものとする。トライでは、文字列間の共通な接頭辞が木構造上で共有されるので、検索対処の文字列の接頭辞となる全ての登録語は、木構造の１つのパス上に配置される。つまり、トライのルート（根）から検索キーに沿って木をリーフ（葉）方向にたどって行くことで、ｃｏｍｍｏｎｐｒｅｆｉｘｓｅａｒｃｈを実現できる。 Here, a trie is a multi-tree structure for storing a plurality of character strings, and here, it is assumed that each character is stored as a tree branch in order from the first character of each character string. In the trie, a common prefix between character strings is shared on the tree structure, and therefore, all registered words that are prefixes of the character strings to be searched are arranged on one path of the tree structure. That is, the common prefix search can be realized by tracing the tree in the leaf direction along the search key from the root of the trie.

またパトリシア木とは、先述のトライにおいて、子が１つしかないノードを、さらにその子ノードと結合させたものをいう。この結合によって、１つの枝には１文字だけでなく、連続する複数の文字が格納されることもある。 The Patricia tree means a node obtained by further combining a node having only one child with the child node in the above-described trie. By this combination, not only one character but also a plurality of consecutive characters may be stored in one branch.

トライの実現については、簡潔データ構造（succinct data structure）の１つとして挙げられるＬＯＵＤＳ（Level-Order Unary Degree Sequence）により省メモリで実現できることが知られている。（非特許文献２参照） It is known that the trial can be realized with a small amount of memory by LOUDS (Level-Order Unary Degree Sequence) which is one of succinct data structures. (See Non-Patent Document 2)

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００６、２０１０年３月"Symbol for Japanese text-to-speech synthesis" JEITA standard IT-4006, March 2010 「Space-efficient static trees and graphs」Jacobson, G、Foundations of Computer Science, 1989., 30th Annual Symposium、pp. 549-554、１９８９年１０月"Space-efficient static trees and graphs" Jacobson, G, Foundations of Computer Science, 1989., 30th Annual Symposium, pp. 549-554, October 1989

日本語やその他の多くの言語では、文字表記と読み情報の間に強い関係性がある。しかし、従来の形態素解析システムは読み情報を各形態素の付加的な情報として扱っており、文字表記と、読み情報の間の関係性をデータ表現上は利用していない。この結果、例えば、読み情報は各形態素で全く独立に符号化されており、形態素辞書のサイズが大きくなる、という問題がある。ハフマン符号化のように、読み情報を記述する記号の出現頻度を考慮する等の方法により、読み情報のサイズを圧縮することも可能であるものの、この場合、圧縮された読み情報を復号する処理が必要となり、必要な処理量が増えることになる。 In Japanese and many other languages, there is a strong relationship between writing and reading information. However, the conventional morpheme analysis system treats the reading information as additional information of each morpheme, and does not use the relationship between the character notation and the reading information in data representation. As a result, for example, the reading information is encoded completely independently for each morpheme, and there is a problem that the size of the morpheme dictionary increases. Although it is possible to compress the size of the reading information by a method such as considering the appearance frequency of symbols describing the reading information, such as Huffman coding, in this case, the process of decoding the compressed reading information Is required, and the amount of processing required increases.

また、木構造に基づく形態素辞書を用いた場合、読み情報は木のリーフにのみ結び付けられているため、木をルートから辿って行きリーフノードに達するまで、辿った部分までの部分的な読み情報も得ることができず、木構造辞書のアクセスの途中で処理を打ち切るといったアルゴリズムを用いる際に、読み情報を打ち切りの判断基準として考慮することができていない。 Also, when using a morpheme dictionary based on a tree structure, the reading information is linked only to the leaf of the tree, so the partial reading information up to the traced part is traced from the root to the leaf node. In the case of using an algorithm that aborts the process in the middle of accessing the tree structure dictionary, reading information cannot be considered as a criterion for aborting.

本発明は、このような事情に鑑みてなされたものであり、辞書のデータサイズを小さくでき、かつ形態素解析時の処理を効率化できる辞書装置、形態素解析装置、データ構造ならびに形態素解析の方法およびプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, a dictionary device, a morpheme analyzer, a data structure, a morpheme analysis method, and a dictionary device that can reduce the data size of the dictionary and increase the efficiency of processing during morpheme analysis. The purpose is to provide a program.

（１）上記の目的を達成するため、本発明の辞書装置は、トライに基づく構造で文字列のデータを格納する形態素解析用の辞書装置であって、文字列のデータを区分した部分文字列および前記部分文字列に関する情報を、文字列に沿って交互に並べて格納することを特徴としている。 (1) In order to achieve the above object, the dictionary device of the present invention is a dictionary device for morphological analysis that stores character string data in a structure based on a trie, and is a partial character string obtained by dividing character string data The information on the partial character string is stored alternately along the character string.

これにより、部分文字列に関する情報についても、部分文字列と同様にその先頭の共通部分が木構造上で共有されるので、形態素解析のために記憶する文字列に関するデータサイズを小さくできる。また、採用されない部分文字列に関する情報を共通接頭辞探索の早い段階で判別できるため、効率よく形態素解析の処理ができる。また、部分文字列に関する情報を容易に参照できるため、部分文字列に関する情報の制約を考慮した形態素解析処理を実現できる。 As a result, as for the information related to the partial character string, the common portion at the head of the information is shared on the tree structure similarly to the partial character string, so that the data size related to the character string stored for morphological analysis can be reduced. In addition, since information regarding partial character strings that are not adopted can be determined at an early stage of the common prefix search, morphological analysis can be performed efficiently. In addition, since information related to the partial character string can be easily referred to, it is possible to realize morpheme analysis processing in consideration of restrictions on information related to the partial character string.

（２）また、本発明の辞書装置は、前記部分文字列に関する情報が、部分文字列の読みに関する情報を含むことを特徴としている。これにより、文字列の読みに関する情報を利用して辞書のデータサイズを小さくできる。また、読みに関する情報を利用して効率的に形態素解析の処理を行なうことができる。 (2) Moreover, the dictionary apparatus of this invention is characterized by the information regarding the said partial character string including the information regarding the reading of a partial character string. As a result, the data size of the dictionary can be reduced by using information relating to reading of the character string. In addition, it is possible to efficiently perform morphological analysis using information related to reading.

（３）また、本発明の形態素解析装置は、文字列をもとに、これを構成する形態素列に関する情報を出力する形態素解析装置であって、上記の辞書装置と、入力された文字列から部分文字列を切り出し、文字列の順に沿って部分文字列ごとに前記切り出された部分文字列を前記辞書装置に照合し、前記切り出された部分文字列の先頭の部分文字列と一致する形態素に関する情報の候補を出力する照合部と、を備えることを特徴としている。これにより、採用されない部分文字列に関する情報を早い段階で判定し、効率よく形態素解析の処理ができる。 (3) Moreover, the morpheme analyzer of the present invention is a morpheme analyzer that outputs information on the morpheme strings constituting the morpheme sequence based on the character string, and includes the above dictionary device and the input character string. The present invention relates to a morpheme that cuts out a partial character string, collates the extracted partial character string for each partial character string in the order of the character string, and matches the first partial character string of the cut out partial character string. And a collation unit that outputs information candidates. As a result, information on partial character strings that are not adopted can be determined at an early stage, and morphological analysis can be performed efficiently.

（４）また、本発明の形態素解析装置は、文字列をもとに、これを構成する形態素列に関する情報を出力する形態素解析装置であって、上記の辞書装置と、入力された文字列から部分文字列を切り出し、文字列の順に沿って部分文字列ごとに前記切り出された部分文字列を前記辞書装置に照合し、前記切り出された部分文字列の先頭の部分文字列と一致する形態素に関する情報の候補を出力する照合部と、前記入力された文字列を構成する一部の文字に対する読みに関する情報を制約として参照し、前記照合の結果として出力された候補のうち前記制約を満たす候補を出力する制約参照部と、を備えることを特徴としている。これにより、形態素解析の処理中に、形態素の読みに関する情報を容易に参照することができ、読みの情報の制約を考慮した形態素解析処理を実現できる。 (4) The morpheme analyzer of the present invention is a morpheme analyzer that outputs information on a morpheme sequence that constitutes a character string based on the character string, and includes the above dictionary device and the input character string. The present invention relates to a morpheme that cuts out a partial character string, collates the extracted partial character string for each partial character string in the order of the character string, and matches the first partial character string of the cut out partial character string. A collation unit that outputs information candidates, and refers to information related to reading of some characters constituting the input character string as constraints, and candidates that satisfy the constraints among candidates output as a result of the collation And a constraint reference unit for outputting. This makes it possible to easily refer to information relating to morpheme reading during the morpheme analysis process, and to realize a morpheme analysis process in consideration of restrictions on reading information.

（５）また、本発明のデータ構造は、コンピュータ内の記憶部にトライに基づいて構成される形態素解析用の辞書のデータ構造であって、文字列のデータを区分した部分文字列および前記部分文字列に関する情報が、文字列に沿って交互に並べて格納されることを特徴としている。これにより、効率的なデータ表現により文字列のデータを小さくすることができる。また、効率よく形態素解析の処理ができる。 (5) Further, the data structure of the present invention is a data structure of a dictionary for morpheme analysis configured based on a trie in a storage unit in a computer, and includes a partial character string obtained by dividing character string data and the part Information on the character string is stored alternately along the character string. As a result, the character string data can be reduced by efficient data representation. Also, the morphological analysis process can be performed efficiently.

（６）また、本発明の方法は、文字列をもとに、これを構成する形態素列に関する情報を出力する形態素解析の方法であって、入力された文字列から部分文字列を切り出すステップと、文字列の順に沿って部分文字列ごとに前記切り出された部分文字列を上記のデータ構造を有する文字列のデータと照合するステップと、をコンピュータを用いて実行することを特徴としている。これにより、採用されない部分文字列に関する情報を早い段階で判定し、効率よく形態素解析の処理ができる。 (6) Further, the method of the present invention is a morpheme analysis method for outputting information on a morpheme sequence constituting a character string based on the character string, and cutting out a partial character string from the input character string; The step of collating the extracted partial character string with the character string data having the data structure described above for each partial character string in the order of the character string is performed using a computer. As a result, information on partial character strings that are not adopted can be determined at an early stage, and morphological analysis can be performed efficiently.

（７）また、本発明の方法は、文字列をもとに、これを構成する形態素列に関する情報を出力する形態素解析の方法であって、入力された文字列から部分文字列を切り出すステップと、文字列の順に沿って部分文字列ごとに前記切り出された部分文字列を、上記のデータ構造を有し、前記部分文字列に関する情報は部分文字列の読みに関する情報を含む、文字列のデータと照合するステップと、前記入力された文字列を構成する一部の文字に対する読みの情報を制約として参照し、前記照合の結果として出力された候補のうち前記制約を満たす候補を出力するステップと、をコンピュータを用いて実行することを特徴としている。これにより、形態素解析の処理中に、形態素の読みに関する情報を容易に参照することができ、読みの情報の制約を考慮した形態素解析処理を実現できる。 (7) Further, the method of the present invention is a morpheme analysis method for outputting information on a morpheme sequence constituting a character string based on the character string, the step of cutting out a partial character string from the input character string; The character string data, wherein the partial character string extracted for each partial character string in the order of the character string has the data structure described above, and the information on the partial character string includes information on reading of the partial character string And a step of referring to reading information for some characters constituting the input character string as a constraint, and outputting a candidate satisfying the constraint among candidates output as a result of the collation, and Is executed using a computer. This makes it possible to easily refer to information relating to morpheme reading during the morpheme analysis process, and to realize a morpheme analysis process in consideration of restrictions on reading information.

（８）また、本発明のプログラムは、文字列をもとに、これを構成する形態素列に関する情報を出力する形態素解析のプログラムであって、入力された文字列から部分文字列を切り出す処理と、文字列の順に沿って部分文字列ごとに前記切り出された部分文字列を上記のデータ構造を有する文字列のデータと照合する処理と、を含む一連の処理をコンピュータに実行させることを特徴としている。これにより、採用されない部分文字列に関する情報を早い段階で判定し、効率よく形態素解析の処理ができる。 (8) Further, the program of the present invention is a morpheme analysis program that outputs information on morpheme strings that constitute a character string based on the character string, and a process of cutting out a partial character string from the input character string; , Causing a computer to execute a series of processes including the process of collating the extracted partial character string with the data of the character string having the data structure described above for each partial character string in the order of the character string Yes. As a result, information on partial character strings that are not adopted can be determined at an early stage, and morphological analysis can be performed efficiently.

（９）また、本発明のプログラムは、文字列をもとに、これを構成する形態素列に関する情報を出力する形態素解析のプログラムであって、入力された文字列から部分文字列を切り出す処理と、文字列の順に沿って部分文字列ごとに前記切り出された部分文字列を、上記のデータ構造を有し、前記部分文字列に関する情報は部分文字列の読みに関する情報を含む、文字列のデータと照合する処理と、前記入力された文字列を構成する一部の文字に対する読みの情報を制約として参照し、前記照合の結果として出力された候補のうち前記制約を満たす候補を出力する処理と、を含む一連の処理をコンピュータに実行させることを特徴としている。これにより、形態素解析の処理中に、形態素の読みに関する情報を容易に参照することができ、読みの情報の制約を考慮した形態素解析処理を実現できる。 (9) Further, the program of the present invention is a morpheme analysis program that outputs information on morpheme strings that constitute a character string based on the character string, and a process of cutting out a partial character string from the input character string; The character string data, wherein the partial character string extracted for each partial character string in the order of the character string has the data structure described above, and the information on the partial character string includes information on reading of the partial character string And a process of referring to reading information for some characters constituting the input character string as a constraint, and outputting a candidate satisfying the constraint among candidates output as a result of the collation The computer is caused to execute a series of processes including. This makes it possible to easily refer to information relating to morpheme reading during the morpheme analysis process, and to realize a morpheme analysis process in consideration of restrictions on reading information.

本発明によれば、効率的なデータ表現が可能となり、形態素解析のため記憶する文字列のデータサイズを小さくできる。また、採用されない部分文字列に関する情報を早い段階で判定できるため、効率よく形態素解析の処理ができる。また、部分文字列に関する情報を容易に参照できるため、部分文字列に関する情報の制約を考慮した形態素解析処理を実現できる。 According to the present invention, efficient data representation is possible, and the data size of a character string stored for morphological analysis can be reduced. Further, since information on partial character strings that are not adopted can be determined at an early stage, morphological analysis processing can be performed efficiently. In addition, since information related to the partial character string can be easily referred to, it is possible to realize morpheme analysis processing in consideration of restrictions on information related to the partial character string.

本発明の形態素解析装置を示すブロック図である。It is a block diagram which shows the morphological analyzer of this invention. 従来のデータ構造の一例を示す図である。It is a figure which shows an example of the conventional data structure. 本発明のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of this invention. 本発明の形態素解析装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the morphological analyzer of this invention. 本発明の形態素解析装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the morphological analyzer of this invention. 本発明の形態素解析装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the morphological analyzer of this invention. 従来のデータ構造の一例を示す図である。It is a figure which shows an example of the conventional data structure. 本発明のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of this invention. 本発明のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of this invention.

次に、本発明の実施の形態について、図面を参照しながら説明する。以下の説明において表記文字列が同じであっても読みが異なる形態素は、異なる形態素として扱う。 Next, embodiments of the present invention will be described with reference to the drawings. In the following description, morphemes with different readings even if the notation character strings are the same are treated as different morphemes.

［第１の実施形態］
（形態素解析装置の構成）
図１は、形態素解析装置１００を示すブロック図である。図１に示すように、形態素解析装置１００は、照合部１１０、辞書装置１２０、制約参照部１３０、決定部１４０および連接表記憶部１５０を備え、入力された文字列をもとに、これに関する情報のうち適したものを決定し出力する。 [First Embodiment]
(Configuration of morphological analyzer)
FIG. 1 is a block diagram showing a morphological analyzer 100. As shown in FIG. 1, the morphological analysis apparatus 100 includes a collation unit 110, a dictionary device 120, a constraint reference unit 130, a determination unit 140, and a connection table storage unit 150. Determine and output suitable information.

照合部１１０は、入力された文字列（漢字仮名交じり文）から部分文字列を切り出し、文字列の順に沿って部分文字列ごとに切り出された部分文字列を辞書装置１２０に照合し、切り出された部分文字列のさらにその先頭部分の部分文字列と一致する形態素に関する情報の候補を出力する。これにより、採用されない形態素に関する情報を早い段階で判定し、効率よく形態素解析の処理ができる。なお、辞書装置１２０内のデータは格納された文字列に沿った順で照合される。 The collation unit 110 cuts out a partial character string from the input character string (kanji-kana mixed text), collates the partial character string cut out for each partial character string in the order of the character string, and cuts out the partial character string. In addition, information candidates relating to morphemes that match the partial character string at the head of the partial character string are output. Thereby, the information regarding the morpheme which is not adopted can be determined at an early stage, and the process of morpheme analysis can be performed efficiently. The data in the dictionary device 120 is collated in the order along the stored character string.

辞書装置１２０は、文字列のデータおよびこれに関する情報を格納する。辞書装置１２０は、文字列のデータを区分した部分文字列および部分文字列に関する情報を、文字列に沿って交互に並べて格納している。なお、通常、文字列のデータは複数の部分文字列に区分される。辞書装置１２０に格納されたデータは、トライに基づく構造を有している。すなわち、一端（先頭または末尾）から途中までのデータが共通する複数の文字列のデータに対して、共通する部分のデータを単一のデータで代表させるとともに、途中から他端までの共通しない部分文字列を複数のデータで枝分かれさせている。 The dictionary device 120 stores character string data and information related thereto. The dictionary device 120 stores partial character strings obtained by dividing character string data and information related to the partial character strings alternately arranged along the character strings. In general, character string data is divided into a plurality of partial character strings. The data stored in the dictionary device 120 has a tri-based structure. That is, for a plurality of character string data with common data from one end (the beginning or the end) to the middle, the common portion data is represented by a single data, and the non-common portion from the middle to the other end A character string is branched by a plurality of data.

このように文字と発音の関係性を利用することで効率的なデータ表現が可能となり、読み情報付きの形態素解析を、より小さいデータサイズで実現できる。また、採用されない部分文字列に関する情報を早い段階で判別できるため、効率よく形態素解析の処理ができる。 Thus, by using the relationship between characters and pronunciation, efficient data expression is possible, and morphological analysis with reading information can be realized with a smaller data size. In addition, since information regarding partial character strings that are not adopted can be determined at an early stage, morphological analysis processing can be performed efficiently.

部分文字列に関する情報は、部分文字列の読みに関する情報を含むことが好ましい。これにより、文字列の読みに関する情報を利用して辞書のデータサイズを小さくできる。また、読みに関する情報を利用して効率的に形態素解析の処理を行なうことができる。 The information related to the partial character string preferably includes information related to the reading of the partial character string. As a result, the data size of the dictionary can be reduced by using information relating to reading of the character string. In addition, it is possible to efficiently perform morphological analysis using information related to reading.

制約参照部１３０は、入力された文字列を構成する一部の文字に対する読みに関する情報を制約として参照し、照合の結果として出力された候補のうち制約を満たす候補を出力する。辞書装置１２０のデータ構造を利用することで、形態素解析の処理中に、形態素の読みに関する情報を容易に参照することができるため、読みの情報の制約を考慮した形態素解析処理を実現できる。 The constraint reference unit 130 refers to information related to reading of some characters constituting the input character string as a constraint, and outputs candidates satisfying the constraint among candidates output as a result of collation. By using the data structure of the dictionary device 120, it is possible to easily refer to information related to morpheme reading during the morpheme analysis process, and thus it is possible to realize a morpheme analysis process that takes into account restrictions on reading information.

決定部１４０は、連接表を参照して形態素候補列のコストを算出し、候補列のうち最も適した系列を決定し、形態素解析結果の形態素列として出力する。連接表記憶部１５０は、部分文字列とその読みが連続したものと対応する各コストを表す連接表を記憶している。 The determining unit 140 refers to the concatenation table, calculates the cost of the morpheme candidate sequence, determines the most suitable sequence among the candidate sequences, and outputs it as the morpheme sequence of the morpheme analysis result. The concatenation table storage unit 150 stores a concatenation table representing costs corresponding to partial character strings and consecutive readings.

（辞書装置のデータ構造）
辞書装置１２０は、そのインデックスに発音情報を埋め込んでいる。図２は、従来のデータ構造の一例を示す図である。図３は、本発明のデータ構造を示す図である。 (Data structure of dictionary device)
The dictionary device 120 embeds pronunciation information in the index. FIG. 2 is a diagram illustrating an example of a conventional data structure. FIG. 3 shows the data structure of the present invention.

図２に示すように、従来の発音情報をインデックスに含めない方法では、形態素末を表すノードに読みに関する情報が結び付けられる。この場合、インデックスの「会」の文字はトライ構造で共有されるが、読みに関する「カイシャ」「カイギ」の共通部分は直接的には共有されていない。 As shown in FIG. 2, in the conventional method in which pronunciation information is not included in the index, information related to reading is linked to a node representing the morpheme end. In this case, the letters “kai” in the index are shared in the trie structure, but the common parts of “kaisha” and “kaigi” related to reading are not directly shared.

これに対し、図３に示す例では、「会社（カイシャ）」という形態素のインデックスに、「会／カイ／社／シャ」という文字列を使う。ここで文字「／」は表記の文字と読み情報を区切る区切り文字である。これにより、辞書構造にトライ（パトリシア木を含む）を使うことで、発音情報も含めて部分情報が木構造の同じノードに共有されるため、全体のサイズを抑えることができる。例えば形態素辞書に「会議（カイギ）」という形態素も存在する場合、インデックスのうち「会／カイ／社／シャ」と「会／カイ／議／ギ」の「／会／カイ／」の部分が共有される。 On the other hand, in the example shown in FIG. 3, the character string “Kai / Kai / Company / Sha” is used for the index of the morpheme “Company (Kaisha)”. Here, the character “/” is a delimiter that separates the written character from the reading information. Thus, by using a trie (including a Patricia tree) in the dictionary structure, partial information including pronunciation information is shared by the same node in the tree structure, so that the overall size can be suppressed. For example, if the morpheme dictionary also contains a morpheme called “meeting (kaigi)”, the “/ kai / kai /” part of “kai / kai / sha / sha” and “kai / kai / discussion / gi” is included in the index. Shared.

（トライに基づく特有のデータ構造）
トライにより構成された辞書装置１２０を用いて形態素解析処理を行なう場合、共通接頭辞探索として、トライのルートノードからリーフノード方向に木をたどり、ノード上の形態素を候補して出力する。 (Specific data structure based on trial)
When the morphological analysis process is performed using the dictionary device 120 configured by a trie, the tree is traced from the root node of the trie toward the leaf node as a common prefix search, and the morphemes on the nodes are output as candidates.

ただし、読み情報をインデックスに含む形態素解析辞書構造を形態素解析装置１００で用いる場合は、辞書のインデックスのうち、葉方向「／」で囲まれた区間にある読み情報は探索において無視して、その子ノードに遷移すればよい。ただし、ある文字に対して複数種類の発音情報が存在する場合には、発音情報に対応する部分のノードで複数の可能性が生じ、探索結果となり得る複数の子ノードが存在することになる。その場合、それらの全ての子ノードを候補として考える必要がある。 However, when the morphological analysis dictionary structure including the reading information in the index is used in the morphological analysis apparatus 100, the reading information in the section surrounded by the leaf direction “/” in the index of the dictionary is ignored in the search, and its child What is necessary is just to change to a node. However, when there are a plurality of types of pronunciation information for a certain character, there are a plurality of possibilities at the node corresponding to the pronunciation information, and there are a plurality of child nodes that can be search results. In that case, it is necessary to consider all those child nodes as candidates.

従来の方法では、共通接頭辞探索により列挙される形態素の候補は、トライの１つの経路上にしか存在しないが、本発明では、このように、異なる複数の経路上に形態素の候補が存在する可能性があるため、探索においては、このことを考慮する必要がある。具体的には、ルートからリーフまでのノードのリストを複数保持するためのリストを作成する必要がある。 In the conventional method, the morpheme candidates listed by the common prefix search exist only on one path of the trie. In the present invention, however, morpheme candidates exist on a plurality of different paths in this way. This may need to be taken into account in the search. Specifically, it is necessary to create a list for holding a plurality of lists of nodes from the root to the leaf.

（形態素解析装置の動作）
図４〜図６は、それぞれ形態素解析装置１００における照合部１１０の動作の一例を示すフローチャートである。図４は、照合部全体の処理を示している。図４に示すように、まず、それらの共通接頭辞探索の結果が入力された文字列に含まれる可能性のある形態素候補を列挙するように、入力された文字列から部分文字列に区切り、それらを列挙する（ステップＳ１０１）。ここでの部分文字列の列挙は、例えば、１文字目から最後の文字まで、２文字目から最後の文字まで、という処理を繰り返せばよい。次に、ひとつの部分文字列をを探索キーに設定する（ステップＳ１０２）。そして、サブルーチンＧ（ルートノード、０）を呼び出し、辞書に格納されたデータと探索キーと照合を行なう（ステップＳ１０３）。 (Operation of morphological analyzer)
4 to 6 are flowcharts illustrating an example of the operation of the matching unit 110 in the morphological analyzer 100, respectively. FIG. 4 shows the processing of the entire verification unit. As shown in FIG. 4, first, the input character string is divided into partial character strings so as to enumerate morpheme candidates that may be included in the input character string as a result of the common prefix search. These are listed (step S101). The enumeration of partial character strings here may be repeated, for example, from the first character to the last character, from the second character to the last character. Next, one partial character string is set as a search key (step S102). Then, subroutine G (root node, 0) is called to collate the data stored in the dictionary with the search key (step S103).

図５は、文字列と探索キーとを照合するサブルーチン（Ｇ（Ｎ，ｋ））の処理を示している。Ｇ（Ｎ，ｋ）は、辞書データのノードＮと探索キーのｋ番目の文字との文字列の照合を意味している。図５に示すように、Ｎの子ノードを全て列挙し、ノードリストＬに格納し（ステップＳ２０１）、ｉに初期値０を設定する（ステップＳ２０２）。 FIG. 5 shows processing of a subroutine (G (N, k)) for collating a character string with a search key. G (N, k) means collation of the character string between the node N of the dictionary data and the kth character of the search key. As shown in FIG. 5, all N child nodes are listed and stored in the node list L (step S201), and an initial value 0 is set to i (step S202).

次に、Ｌ［ｉ］への枝に結び付けられた文字が検索キーのｋ番目の文字と一致するかを判定する（ステップＳ２０３）。一致する場合には、Ｇ（Ｌ［ｉ］、ｋ＋１）を呼び出し（ステップＳ２０４）、ステップＳ２０７へ進む。 Next, it is determined whether the character linked to the branch to L [i] matches the kth character of the search key (step S203). If they match, G (L [i], k + 1) is called (step S204), and the process proceeds to step S207.

一致しない場合には、Ｌ［ｉ］への枝に結び付けられた文字が区切り文字か否かを判定する（ステップＳ２０５）。文字が区切り文字である場合には、サブルーチンＰ（Ｌ［ｉ］、ｋ）を呼び出し（ステップＳ２０６）、ステップＳ２０７に進む。一方、文字が区切り文字でない場合には、ステップＳ２０７に進む。 If they do not match, it is determined whether or not the character linked to the branch to L [i] is a delimiter (step S205). If the character is a delimiter, subroutine P (L [i], k) is called (step S206), and the process proceeds to step S207. On the other hand, if the character is not a delimiter, the process proceeds to step S207.

そして、ｉを１増加させ（ステップＳ２０７）、ｉがＬ［ｉ］の大きさ未満か否かを判定し（ステップＳ２０８）、未満である場合には、ステップＳ２０３に進む。ｉがＬ［ｉ］の大きさ以上である場合には、サブルーチンを終了し、もとの処理に戻る。 Then, i is increased by 1 (step S207), and it is determined whether i is less than the size of L [i] (step S208). If it is less, the process proceeds to step S203. If i is greater than or equal to L [i], the subroutine ends and the process returns to the original process.

図６は、読み情報の照合のサブルーチン（Ｐ（Ｎ，ｋ））の処理を示している。サブルーチンＰ（Ｎ，ｋ）は、辞書データのノードＮと探索キーのｋ番目の文字との読みの照合を意味している。図６に示すように、Ｎの子ノードを全て列挙し、ノードリストＬに格納し（ステップＳ３０１）、ｉに初期値０を設定する（ステップＳ３０２）。 FIG. 6 shows processing of a reading information collation subroutine (P (N, k)). Subroutine P (N, k) means collation of reading between node N of the dictionary data and the kth character of the search key. As shown in FIG. 6, all N child nodes are listed and stored in the node list L (step S301), and an initial value 0 is set to i (step S302).

次に、Ｌ［ｉ］が形態素末か否かを判定する（ステップＳ３０３）。形態素末である場合には、ルートからＬ［ｉ］までの文字列を結果として出力し（ステップＳ３０４）、ステップＳ３０８へ進む。 Next, it is determined whether L [i] is a morpheme end (step S303). If it is a morpheme powder, a character string from the root to L [i] is output as a result (step S304), and the process proceeds to step S308.

形態素末でない場合には、Ｌ［ｉ］への枝に結び付けられた文字が区切り文字か否かを判定する（ステップＳ３０５）。文字が区切り文字である場合には、サブルーチンＧ（Ｌ［ｉ］、ｋ）を呼び出し（ステップＳ３０６）、ステップＳ３０８に進む。一方、文字が区切り文字でない場合には、サブルーチンＰ（Ｌ［ｉ］、ｋ）を呼び出し（ステップＳ３０７）、ステップＳ３０８に進む。 If it is not a morpheme end, it is determined whether or not the character linked to the branch to L [i] is a delimiter (step S305). If the character is a delimiter, subroutine G (L [i], k) is called (step S306), and the process proceeds to step S308. On the other hand, if the character is not a delimiter, subroutine P (L [i], k) is called (step S307), and the process proceeds to step S308.

そして、ｉを１増加させ（ステップＳ３０８）、ｉがＬ［ｉ］の大きさ未満か否かを判定し（ステップＳ３０９）、未満である場合には、ステップＳ３０３に進む。ｉがＬ［ｉ］の大きさ以上である場合には、サブルーチンを終了し、もとの処理に戻る。なお、以上の処理は、コンピュータに処理を実行させることで行なうことができる。 Then, i is incremented by 1 (step S308), and it is determined whether i is less than the size of L [i] (step S309). If it is less, the process proceeds to step S303. If i is greater than or equal to L [i], the subroutine ends and the process returns to the original process. The above processing can be performed by causing a computer to execute processing.

（形態素解析処理の一例）
以下に処理の例を説明する。図７は、従来のデータ構造の一例を示す図である。図８は、本発明のデータ構造の一例を示す図である。 (Example of morphological analysis processing)
An example of processing will be described below. FIG. 7 is a diagram illustrating an example of a conventional data structure. FIG. 8 shows an example of the data structure of the present invention.

「秋葉原に行く」という文を形態素解析する場合、まず「秋葉原に行く」をキーとして共通接頭辞探索を行なう。この時、辞書に「秋（アキ）」「秋葉（アキバ）」「秋葉原（アキハバラ）」の３語が登録されている場合、形態素の候補としてこれらを列挙することになる。 When the morphological analysis of the sentence “go to Akihabara” is performed, a common prefix search is first performed using “go to Akihabara” as a key. At this time, if three words “Aki”, “Akiba”, and “Akihabara” are registered in the dictionary, these are listed as morpheme candidates.

しかし、図７に示すような従来の読み情報を含めないインデックスを用いた場合は、トライ上で、これら３語がルートノードから見て同じ経路上に存在する。つまり共通接頭辞探索において、トライ上の単一のパスのみを考慮すればよい。 However, when an index that does not include conventional reading information as shown in FIG. 7 is used, these three words are present on the same path when viewed from the root node. In other words, only a single path on the trie needs to be considered in the common prefix search.

これに対し、図８にしめすような本発明の方法では、「秋葉」と「秋葉原」は「葉」の部分の読みが異なるため、「秋葉原」に至るパス上に単語「秋葉」は存在しない。「秋／アキ／葉／」までが同じパス上にあり、それ以後は異なる枝に情報がそれぞれ格納される。 On the other hand, in the method of the present invention as shown in FIG. 8, since “Akihabara” and “Akihabara” have different readings of the “leaf” portion, the word “Akiha” does not exist on the path to “Akihabara”. . Up to “Autumn / Aki / Leaf /” are on the same path, and thereafter, information is stored in different branches.

この処理では、親ノードがその全ての子ノードを高速に列挙できる必要がある。従来の形態素解析では、ある特定の文字が結び付けられた枝に連なる子ノードがあるか否かだけを高速に判定できれば良く、この部分が従来とは異なる。 In this process, the parent node needs to be able to enumerate all its child nodes at high speed. In the conventional morphological analysis, it is only necessary to determine at high speed whether or not there is a child node connected to a branch to which a specific character is linked, and this part is different from the conventional one.

しかし、そのようなデータ構造の実現は比較的容易であり、例えば親ノードは複数の子ノードのうちの１つの子ノードへのリンクだけを持ち、その子ノードを先頭に、全ての兄弟ノードに対するリスト構造を構築すればよい。あるいは、省メモリなトライの実装に使われるデータ構造であるＬＯＵＤＳは、読み情報をインデックスに含めない従来の方法であっても、木をリーフ方向にたどる際に子ノードを列挙するため、トライのデータ構造にＬＯＵＤＳを用いた実用なシステムは、本発明が要求する先述の条件を満たしている。 However, it is relatively easy to realize such a data structure. For example, a parent node has only a link to one child node among a plurality of child nodes, and a list for all sibling nodes starting with the child node. You can build a structure. Alternatively, LOUDS, which is a data structure used for implementing memory-saving tries, enumerates child nodes when tracing trees in the leaf direction, even if the conventional method does not include reading information in the index. A practical system using LOUDS for the data structure satisfies the above-mentioned conditions required by the present invention.

（制約が入力される場合の処理）
従来の辞書構造では、トライのノードに形態素情報が結び付けられていた。このため、読み情報に関する制約つき探索を行なう場合、共通接頭辞探索により形態素候補の全てを列挙してから、各形態素情報を調べて、読み制約を満たさない形態素候補を捨てる、といった処理が必要となる。 (Processing when constraints are entered)
In the conventional dictionary structure, morpheme information is linked to a trie node. For this reason, when performing a constrained search for reading information, it is necessary to perform processing such as enumerating all morpheme candidates by common prefix search, then examining each morpheme information, and discarding morpheme candidates that do not satisfy the reading constraint. Become.

これに対し、本発明の形態素辞書構造を読み情報付で構築した場合、発音情報を単に読み飛ばすのではなく、発音情報が、与えられた読み情報に関する制約を満たすか否かをトライの探索の段階でチェックできる。そして、読み情報の制約条件を満たさない形態素を、より早い時点で形態素候補の列挙対象から除外することができる。これにより、列挙対象の形態素候補数が減り、処理量やメモリ使用量を減らすことができる。この場合、図６示したフローチャートに基づく処理を行なう場合においては、Ｓ３０３の直前に読み情報の制約条件を満たすかどうかのチェックを行ない、満たす場合はＳ３０３に、満たさない場合はＳ３０８に進むようにすればよい。 On the other hand, when the morpheme dictionary structure of the present invention is constructed with reading information, instead of simply skipping the pronunciation information, the trie search is performed to determine whether the pronunciation information satisfies the restrictions on the given reading information. You can check in stages. Then, morphemes that do not satisfy the restriction condition of the reading information can be excluded from the morpheme candidate enumeration targets at an earlier time point. Thereby, the number of morpheme candidates to be enumerated decreases, and the processing amount and memory usage amount can be reduced. In this case, when the processing based on the flowchart shown in FIG. 6 is performed, it is checked whether or not the reading information constraint condition is satisfied immediately before S303, and if satisfied, the process proceeds to S303, and if not, the process proceeds to S308. do it.

（その他の処理例）
図９は、本発明のデータ構造の一例を示す図である。例えば、「上（ウエ）」「上る（アガ・ル）」「上る（ノボ・ル）」の３語が辞書登録されているケースで、解析対象のテキストが「上る」、解析対象テキストのうち、「上」の読みとして、「ノボ」が指定されている場合、従来の辞書構造では、「上る」に対する共通接頭辞探索の結果、この３語を候補として得られるため、それぞれ読み情報を調べて、「上る（ノボ・ル）」以外の形態素を捨てる処理が必要となる。 (Other processing examples)
FIG. 9 shows an example of the data structure of the present invention. For example, in the case where three words “upper (we)”, “upper (aga ru)” and “upper (novo le)” are registered in the dictionary, the text to be analyzed is “up”, When “NOVO” is specified as the reading of “UP”, in the conventional dictionary structure, as a result of the common prefix search for “UP”, these three words are obtained as candidates. Therefore, a process of discarding morphemes other than “Noboru” is required.

これに対して本発明の方法では、探索において、「上／ノボ／る／＊」（ここで「＊」は「／」以外の文字で構成される全ての文字列を示す）をキーとしてｃｏｍｍｏｎｐｒｅｆｉｘｓｅａｒｃｈを行えばよく、インデックスが「上／ノボ／る／ル」となる「上る（ノボ・ル）」のみが共通接頭辞探索の結果として得られる。 On the other hand, in the method of the present invention, in the search, “up / novo / ru / *” (where “*” indicates all character strings composed of characters other than “/”) is used as a key. Prefix search may be performed, and only “noboru” with the index “up / novo / ru / le” is obtained as a result of the common prefix search.

この処理の際、読み方に対する揺らぎを考慮するために、先述のような読み情報の完全一致ではなく、別の選択基準を用いることもできる。例えば、指定された読み情報に対して編集距離がある値以下（一致文字数がある文字数以上）のみの形態素を形態素辞書の探索結果として列挙する、といったような曖昧性を認めた制約条件設定も可能である。 In this processing, in order to take into account fluctuations in reading, it is possible to use another selection criterion instead of the complete matching of reading information as described above. For example, it is possible to set a constraint condition that allows ambiguity such as enumerating morphemes whose edit distance is less than or equal to a certain value (more than a certain number of matching characters) as the search result of the morpheme dictionary. It is.

上記の説明では、表記文字列を構成する部分文字列と、それに対応する発音記号の間に、区切り記号「／」を入れるが、表記文字列１文字毎に発音情報が付与されることが規則化されている場合は、表記文字列の文字と発音情報の間に、区切り記号「／」を書かない形態も可能である。 In the above description, the delimiter symbol “/” is inserted between the partial character string constituting the notation character string and the corresponding phonetic symbol, but it is a rule that phonetic information is given for each character of the notation character string. In the case where the delimiter “/” is not written between the character of the written character string and the pronunciation information, it is possible.

あるいは、区切り記号を定義するのではなく、各ノードにおいて、その直前の枝に格納された文字が、表記文字列を表す文字か、読み情報を表す文字かを示す1ビットの情報を格納し、これを用いてもよい。これにより辞書のサイズをより抑えることができる。 Alternatively, instead of defining a delimiter, each node stores 1-bit information indicating whether the character stored in the immediately preceding branch is a character representing a written character string or a character representing reading information, You may use this. Thereby, the size of the dictionary can be further suppressed.

また、これまでの説明では読み情報を片仮名だけで構成される文字列として例示しているが、これに限らない。例えば、ＪＥＩＴＡＩＴ−４００６のように韻律情報を含む記号を用いてもよい。また、表記文字から読み情報が規則的に決まるケースについては、「規則的に決める」という意味を持つ符号を読み情報の１つとして定義し、読み情報の長さを削減する方法を用いることもできる。規則的に決まる読みの例としては、１種類の音読みしかない漢字が挙げられる。 In the description so far, the reading information is exemplified as a character string including only katakana characters, but the present invention is not limited to this. For example, a symbol including prosodic information such as JEITA IT-4006 may be used. For cases where reading information is regularly determined from written characters, a method of reducing the length of reading information by defining a code having the meaning of “determined regularly” as one of the reading information may be used. it can. As an example of reading that is regularly determined, there is a kanji that has only one type of reading.

また、本発明の方法では、読み情報以外の情報を格納してもよい。例えば形態素の生起コストが、文字表記における各文字の生起コストの和で定義されるような形態素解析システムにおいては、文字生起コストを表す記号を定義し、文字とこの記号の組を並べてトライのインデックスを構成することで、トライの木構造そのものに、形態素の生起コストに関する情報を直接埋め込むことができる。 In the method of the present invention, information other than reading information may be stored. For example, in a morphological analysis system in which the morpheme occurrence cost is defined as the sum of the occurrence costs of each character in the character notation, a symbol representing the character occurrence cost is defined, and a set of characters and this symbol is arranged to indicate a trie index. By configuring the information, the information about the morpheme generation cost can be directly embedded in the trie structure itself.

この構造では、トライをたどる過程で生起コストの値が得られるので、形態素する候補を列挙する過程で形態素生成コストが大きい候補を捨てるといったような、形態素の生起コストを考慮した候補の列挙を行なうことができる。また、格納する情報は、読み情報と文字生起コストの組み合わせ等、様々な情報の組み合わせでもよい。 In this structure, since the cost of occurrence is obtained in the process of following a trie, enumeration of candidates taking into account the cost of occurrence of morpheme is performed, such as discarding candidates with high morpheme generation cost in the process of enumerating candidates for morpheme. be able to. The information to be stored may be a combination of various information such as a combination of reading information and a character generation cost.

また、全ての表記文字列を前後逆順にした形での辞書構成も可能である。たとえば、文字列を前後逆にした形態素辞書を構築しておき、文の形態素解析では、逆順にした文の先頭から、すなわち元の文の末尾から先頭方向に向かって形態素候補を列挙する方法が考えられる。この場合、最終的に得られる最適な形態素列が、文頭から文末に向かって確定していくが、多くの装置では、形態素解析結果も文頭から文末方向に向かって出力するため、形態素解析結果を文末から文頭方向に確定させていく場合と異なり、形態素解析結果を一時的に保存する必要がなくなる、といった利点がある。 Further, a dictionary configuration in which all written character strings are reversed in the front-rear order is also possible. For example, a morpheme dictionary in which character strings are reversed in front and back is constructed, and in the morphological analysis of sentences, there is a method of enumerating morpheme candidates from the beginning of sentences in reverse order, that is, from the end of the original sentence toward the beginning. Conceivable. In this case, the optimal morpheme sequence finally obtained is determined from the beginning of the sentence toward the end of the sentence.However, in many devices, the morpheme analysis result is also output from the beginning of the sentence toward the end of the sentence. Unlike the case of determining from the end of the sentence to the beginning of the sentence, there is an advantage that it is not necessary to temporarily store the morphological analysis result.

１００形態素解析装置
１１０照合部
１２０辞書装置
１３０制約参照部
１４０決定部
１５０連接表記憶部 100 morphological analyzer 110 collation unit 120 dictionary device 130 constraint reference unit 140 determination unit 150 connection table storage unit

Claims

A dictionary device for morphological analysis that stores character string data in a tri-based structure,
A dictionary device characterized by storing partial character strings obtained by dividing character string data and information on the partial character strings alternately arranged along the character strings.

2. The dictionary apparatus according to claim 1, wherein the information on the partial character string includes information on reading of the partial character string.

A morpheme analyzer that outputs information on a morpheme sequence that constitutes a character string based on a character string,
The dictionary device according to claim 1 or 2,
A partial character string is cut out from the input character string, and the extracted partial character string is collated with the dictionary device for each partial character string in the order of the character string, and the first partial character of the cut out partial character string A morpheme analyzer comprising: a collation unit that outputs information candidates relating to morphemes that match the columns.

A morpheme analyzer that outputs information on a morpheme sequence that constitutes a character string based on a character string,
A dictionary device according to claim 2;
A partial character string is cut out from the input character string, and the extracted partial character string is collated with the dictionary device for each partial character string in the order of the character string, and the first partial character of the cut out partial character string A collation unit that outputs candidate information about morphemes that match the columns;
A constraint reference unit that refers to information about reading of some characters constituting the input character string as a constraint, and outputs a candidate that satisfies the constraint among candidates output as a result of the collation. A morphological analyzer characterized by the above.

A data structure of a dictionary for morphological analysis configured based on a trie in a storage unit in a computer,
A data structure characterized in that partial character strings obtained by dividing character string data and information on the partial character strings are stored alternately arranged along the character string.

A morpheme analysis method for outputting information on a morpheme sequence constituting a character string based on a character string,
Cutting out a substring from the input string;
The step of collating the extracted partial character string with the character string data having the data structure according to claim 5 is performed using a computer. Method.

A morpheme analysis method for outputting information on a morpheme sequence constituting a character string based on a character string,
Cutting out a substring from the input string;
6. The character string having the data structure according to claim 5, wherein the partial character string cut out for each partial character string in the order of the character string, wherein the information on the partial character string includes information on reading of the partial character string Checking against the data of
Using a computer to refer to reading information for some characters constituting the input character string as a constraint, and to output candidates that satisfy the constraint among candidates output as a result of the matching A method characterized by performing.

A morpheme analysis program that outputs information on morpheme sequences that make up a character string,
A process of cutting a substring from the input string,
6. A process for causing a computer to execute a series of processes including a process of collating the extracted partial character string for each partial character string in the order of the character string with data of a character string having a data structure according to claim 5. A featured program.

A morpheme analysis program that outputs information on morpheme sequences that make up a character string,
A process of cutting a substring from the input string,
6. The character string having the data structure according to claim 5, wherein the partial character string cut out for each partial character string in the order of the character string, wherein the information on the partial character string includes information on reading of the partial character string Processing to match the data of
A series of processes including a process of referring to reading information for some characters constituting the input character string as a constraint, and outputting candidates satisfying the constraint among candidates output as a result of the collation A program that causes a computer to execute.