JP5807891B2

JP5807891B2 - Language model learning apparatus and computer program

Info

Publication number: JP5807891B2
Application number: JP2010224870A
Authority: JP
Inventors: デサーガステイン; イシュトヴァーンヴァルガ; 清敬大竹; 健太郎鳥澤
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2010-10-04
Filing date: 2010-10-04
Publication date: 2015-11-10
Anticipated expiration: 2030-10-04
Also published as: JP2012078647A

Description

この発明は、コーパスに含まれる自然言語文から、多数の自然言語の単語列を得て、統計的言語モデルの学習を行なう言語モデル学習装置に関し、特に、コーパスに含まれる自然言語文から、所定の目的のために好適な言語モデルを自動的に生成可能な言語モデル学習装置に関する。 The present invention relates to a language model learning apparatus that learns a statistical language model by obtaining a large number of natural language word strings from natural language sentences included in a corpus, and more particularly, from a natural language sentence included in a corpus, to a predetermined language language learning apparatus. The present invention relates to a language model learning apparatus capable of automatically generating a language model suitable for the purpose.

音声認識技術では、統計的言語モデルが使用される。統計的言語モデルとは、大量の自然言語文からなるコーパスにおける単語（または単語列。以下単に「単語等」と呼ぶ。）の出現頻度情報をモデル化したものである。コーパス内に出現する単語等の各々に対し、その単語等を、その出現頻度とともにリストしたものが統計的言語モデルである。Ｎ個の単語が所定の順序で連結された単語列（Ｎ−グラムと呼ばれる。）についての言語モデルはＮ−グラム言語モデルと呼ばれる。通常、１−グラムから３−グラムまでの言語モデルが使用されることが多い。 In speech recognition technology, a statistical language model is used. The statistical language model models appearance frequency information of words (or word strings; hereinafter simply referred to as “words”) in a corpus composed of a large amount of natural language sentences. For each word appearing in the corpus, the statistical language model is a list of the words together with their appearance frequency. A language model for a word string (called N-gram) in which N words are connected in a predetermined order is called an N-gram language model. Usually, 1-gram to 3-gram language models are often used.

音声認識技術では、音声認識した結果得られる音素列の妥当性を、言語モデルを用いて検証する。音響的な尤度が高い音素列の中で、言語モデルにより算出される尤度が高い仮説のみが音声認識結果の候補として採用される。 In the speech recognition technology, the validity of a phoneme string obtained as a result of speech recognition is verified using a language model. Of phoneme strings with high acoustic likelihood, only hypotheses with high likelihood calculated by the language model are adopted as candidates for speech recognition results.

こうした統計的言語モデルは、音声認識技術だけではなく、機械翻訳等においても翻訳結果の妥当性を算出するために使用される。 Such a statistical language model is used not only for speech recognition technology but also for machine translation and the like to calculate the validity of translation results.

言語モデルを作成するためには、大量の自然言語文が必要である。ただし、自然言語文が大量にあればよいというものではない。言語モデルが使用されるアプリケーション、及び言語モデルが適用される対象領域に応じた適切な自然言語文から言語モデルを構築することが望ましい。例えば音声認識技術が適用される分野が明確に分かるのであれば、その分野に関連する自然言語文から言語モデルを作成することが望ましい。 In order to create a language model, a large amount of natural language sentences are required. However, there is no need for a large amount of natural language sentences. It is desirable to construct a language model from an appropriate natural language sentence corresponding to an application in which the language model is used and a target area to which the language model is applied. For example, if a field to which speech recognition technology is applied is clearly understood, it is desirable to create a language model from natural language sentences related to the field.

現在は、大量の機械可読文書が利用可能である。このような機械可読文書から、言語モデルを作成するために必要な自然言語文を選択する技術が開発されてきている。 Currently, a large number of machine-readable documents are available. Techniques for selecting natural language sentences necessary for creating a language model from such machine-readable documents have been developed.

しかし、実際に人間が記述した自然言語文を用いる限り、言語モデルに含めたい表現がそこに含まれているという保証はない。逆に、大量の機械可読文書から抽出した文に、対象分野または言語モデルが適用されるアプリケーションとは関係のない言語表現が含まれる可能性は高い。したがって、例えばある分野またはアプリケーションが明確に意識されていたとしても、その分野またはアプリケーションに適した言語モデルを積極的に構築することが難しいという問題がある。 However, as long as a natural language sentence actually written by a human is used, there is no guarantee that the expression desired to be included in the language model is included therein. Conversely, it is highly possible that a sentence extracted from a large amount of machine-readable documents includes a language expression that is not related to an application to which the target field or language model is applied. Therefore, for example, even if a certain field or application is clearly conscious, there is a problem that it is difficult to actively build a language model suitable for the field or application.

結局、従来の技術では、コーパスは所与のものとして、言語モデルのモデリング技術そのものを工夫し、言語モデルの性能を確保しようとする。 After all, in the conventional technology, given a corpus, the language model modeling technology itself is devised to ensure the performance of the language model.

例えば後掲の特許文献１には、例えば言語モデルとして単語３‐グラム、２‐グラム、及び１‐グラムの出現頻度情報を持つものにおけるスムージングを開示している。３‐グラム言語モデルでは、学習データの不足により出現頻度が０となる３‐グラムが出現する可能性が高くなる。そのような言語モデルをそのまま使用すると、音声認識結果の単語列の尤度を正しく評価することができない。スムージングはそうした問題を緩和するための技術である。 For example, Patent Document 1 listed below discloses smoothing in a language model having appearance frequency information of words 3-gram, 2-gram, and 1-gram, for example. In the 3-gram language model, there is a high possibility that 3-grams having an appearance frequency of 0 appear due to lack of learning data. If such a language model is used as it is, the likelihood of the word string of the speech recognition result cannot be correctly evaluated. Smoothing is a technique to alleviate such problems.

特許文献１の技術では、予めスムージングに利用可能な言語モデルの種類とその依存関係（両者をあわせて「依存関係等」と呼ぶ。）とを記憶装置に記憶させておく。例えば学習に用いるコーパスに出現する総単語数が所定のしきい値より小さい場合には、例えば３‐グラムの言語モデルにスムージングを施すために、その依存関係等から、別の言語モデルを利用できる。 In the technique of Patent Document 1, the types of language models that can be used for smoothing and their dependency relationships (both are referred to as “dependency relationships”) are stored in a storage device in advance. For example, if the total number of words appearing in the corpus used for learning is smaller than a predetermined threshold value, another language model can be used due to its dependency, for example, to smooth the 3-gram language model .

特開２００９−１４５７７５号公報JP 2009-145775 A

しかし、上記した特許文献１に開示された技術も、もとになる学習データそのものに偏りがあったり、対象分野に関連するサンプルが不足していたりすると、言語モデルの性能を高めることはできない。すなわち、言語モデルが適用される分野またはアプリケーションに適した学習データをどのように準備するか、という問題を解決することがより本質的な解決策である。 However, the technique disclosed in Patent Literature 1 described above cannot improve the performance of the language model if the original learning data itself is biased or if there are a shortage of samples related to the target field. That is, a more essential solution is to solve the problem of how to prepare learning data suitable for the field or application to which the language model is applied.

理想的には、対象となる分野またはアプリケーションで発せられる可能性のある文章を全て含む学習データを準備し、その学習データに基づいて言語モデルを作成できればよい。現在のところ、Ｗｅｂ上のデータの総体がそうした学習データに最も近いように思われる。しかし、上記したようにＷｅｂ上のデータにしても、人間が作成している以上、その総数には限界があり、ましては対象となる分野またはアプリケーションで発せられる可能性のある文章をすべて含んでいることはあり得ない。そこで、問題は、対象となる分野またはアプリケーションで発せられる可能性のある文章をできるだけ多く含む自然言語文を効率よく収集するためにはどうしたらよいか、ということである。 Ideally, it is only necessary to prepare learning data including all sentences that may be issued in the target field or application, and create a language model based on the learning data. At present, it seems that the total amount of data on the Web is closest to such learning data. However, as described above, even if it is data on the Web, as long as it is created by humans, the total number is limited, and it includes all sentences that may be issued in the target field or application. It can never be. Thus, the problem is how to efficiently collect natural language sentences that contain as much text as possible in the target field or application.

それゆえに本発明の目的は、対象となる分野またはアプリケーションで発せられる可能性のある自然言語の単語列に割当てられる確率が相対的に高くなるような言語モデルを効率よく生成できる言語モデル学習装置を提供することである。 Therefore, an object of the present invention is to provide a language model learning apparatus capable of efficiently generating a language model that has a relatively high probability of being assigned to a natural language word string that may be issued in a target field or application. Is to provide.

本発明の第１の局面に係る言語モデル学習装置は、複数の自然言語文を含むコーパスを記憶する機械可読なコーパス記憶手段とともに用いられ、当該コーパスから特定用途に適した言語モデルの学習を行なうための言語モデル学習装置である。この装置は、特定用途のために予め準備された単語列テンプレートを記憶するためのテンプレート記憶手段と、テンプレート記憶手段に記憶された単語列テンプレートに合致する単語列パターンをコーパスから抽出するための単語列抽出手段と、予め選択された目的に沿った形式の自然言語の単語列が生成されるように予め準備された単語列変形規則に基づいて、単語列抽出手段により抽出された単語列パターンを変形するための変形手段と、変形手段により出力される単語列を学習データとして言語モデルの学習を行なうための学習手段とを含む。 The language model learning device according to the first aspect of the present invention is used together with a machine-readable corpus storage unit that stores a corpus including a plurality of natural language sentences, and learns a language model suitable for a specific application from the corpus. It is a language model learning device for This apparatus includes a template storage means for storing a word string template prepared in advance for a specific application, and a word for extracting a word string pattern matching the word string template stored in the template storage means from the corpus A word string pattern extracted by the word string extraction means based on a string extraction means and a word string transformation rule prepared in advance so that a word string in a natural language in a format according to a preselected purpose is generated; Deformation means for deforming, and learning means for learning a language model using the word string output by the deformation means as learning data.

予め、単語列テンプレートがテンプレート記憶手段に準備され、単語列テンプレートに合致する単語列パターンがコーパスから抽出される。それら単語列パターンに対し、予め選択された目的に沿った形式の自然言語の単語列が生成されるよう、単語列変形規則が適用される。その結果、コーパス内には存在しない表現が新たに生成される。その結果、コーパスに含まれる単語列の数の制限に関わらず、コーパスに含まれない表現まで含めて、目的に沿った多くの単語列からなる自然言語の単語列が生成できる。 A word string template is prepared in advance in the template storage means, and a word string pattern that matches the word string template is extracted from the corpus. The word string transformation rules are applied to these word string patterns so that natural language word strings having a format in accordance with a preselected purpose are generated. As a result, a new expression that does not exist in the corpus is generated. As a result, regardless of the limitation on the number of word strings included in the corpus, it is possible to generate a natural language word string including many word strings in accordance with the purpose, including expressions that are not included in the corpus.

好ましくは、テンプレート記憶手段は、機械可読な文から抽出すべき、基本的な単語列テンプレートであるシードテンプレートを記憶するためのシードテンプレート記憶手段と、シードテンプレート記憶手段に記憶されたシードテンプレートの各々に対し、予め準備されたテンプレート拡張規則を適用して拡張テンプレートを生成するための拡張テンプレート生成手段と、拡張テンプレート生成手段により生成された拡張テンプレートと、シードテンプレート記憶手段に記憶されたシードテンプレートとを記憶し、単語列抽出手段に単語列テンプレートとして与えるための拡張テンプレート記憶手段とを含む。 Preferably, the template storage means includes a seed template storage means for storing a seed template that is a basic word string template to be extracted from a machine-readable sentence, and each of the seed templates stored in the seed template storage means. In contrast, an extension template generation means for generating an extension template by applying a template extension rule prepared in advance, an extension template generated by the extension template generation means, and a seed template stored in the seed template storage means And an extended template storage means for providing the word string extraction means as a word string template.

シードテンプレートをテンプレート拡張規則により拡張することで、最初に準備されたシードテンプレートより多くのテンプレートを生成できる。コーパスに含まれる単語列パターンのうちから、抽出される単語列パターンの数を多くできる。その結果、目的に沿った単語列からなる自然言語の単語列をより多く生成できる。 By expanding the seed template with the template expansion rule, more templates than the initially prepared seed template can be generated. The number of extracted word string patterns can be increased from the word string patterns included in the corpus. As a result, it is possible to generate more natural language word strings composed of word strings in accordance with the purpose.

より好ましくは、テンプレート記憶手段が記憶する単語列テンプレートの各々は、それぞれ所定の制約条件を充足する任意の単語列を表す１または複数の変数と、その他の単語列を表すテキストデータとの配列を含む。 More preferably, each of the word string templates stored in the template storage means has an array of one or a plurality of variables representing arbitrary word strings satisfying predetermined constraints and text data representing other word strings. Including.

所定の制約条件は、各変数により表される単語の属する単語クラスであってもよい。単語列抽出手段は、コーパスに記憶された複数の自然言語文の各々を形態素解析し、各形態素に、当該形態素が属する単語クラスのタグを付して形態素列として出力するための形態素解析手段と、テンプレート記憶手段に記憶された単語列テンプレートの各々と、形態素解析手段により出力された形態素列とを比較し、単語列テンプレートと形態素列とが、単語列テンプレートに含まれる変数を除いて一致し、かつ形態素列中で単語列テンプレート内の変数に対応する位置にある形態素の単語クラスが、当該変数の単語クラスと一致しているものをコーパスから抽出するための手段とを含む。 The predetermined constraint condition may be a word class to which a word represented by each variable belongs. A word string extraction unit that performs morphological analysis on each of a plurality of natural language sentences stored in the corpus, and adds a tag of a word class to which the morpheme belongs to each morpheme and outputs the morpheme string; Each of the word string templates stored in the template storage means is compared with the morpheme string output by the morpheme analyzing means, and the word string template and the morpheme string match except for the variables included in the word string template. And means for extracting from the corpus that the word class of the morpheme at the position corresponding to the variable in the word string template in the morpheme string matches the word class of the variable.

好ましくは、テンプレート記憶手段が記憶する単語列テンプレートの各々は、それぞれ所定の制約条件を充足する任意の単語を表す１または複数の変数と、その他の単語列と、これら変数及び単語列の間の文法的関係を示す構文情報とを含む。 Preferably, each of the word string templates stored in the template storage means includes one or a plurality of variables representing arbitrary words satisfying a predetermined constraint condition, other word strings, and the interval between these variables and the word strings. Syntax information indicating grammatical relationships.

より好ましくは、所定の制約条件は、各変数により表される単語の属する単語クラスである。単語列抽出手段は、コーパスに記憶された複数の自然言語文の各々を形態素解析し、各形態素に、当該形態素が属する単語クラスのタグを付して形態素列として出力するための形態素解析手段と、形態素解析手段により出力される形態素列に対して構文解析を行ない、自然言語文の構文情報からなる単語列パターンを出力するための構文解析手段と、テンプレート記憶手段に記憶された単語列テンプレートの各々と、構文解析手段により出力された単語列パターンとを比較し、構文解析手段により出力された単語列パターン内の、変数を除いて単語列テンプレートと一致する構造を持つ部分であって、かつ当該部分の内で単語列テンプレート内の変数に対応する位置にある単語の単語クラスが、当該変数の単語クラスと一致しているものをコーパスから抽出するための手段とを含む。 More preferably, the predetermined constraint condition is a word class to which a word represented by each variable belongs. A word string extraction unit that performs morphological analysis on each of a plurality of natural language sentences stored in the corpus, and adds a tag of a word class to which the morpheme belongs to each morpheme and outputs the morpheme string; Parses the morpheme sequence output by the morpheme analysis unit, outputs a word string pattern composed of syntax information of a natural language sentence, and a word string template stored in the template storage unit. Each of which is compared with the word string pattern output by the syntax analysis means, and is a portion having a structure that matches the word string template except for variables in the word string pattern output by the syntax analysis means, and If the word class of the word in the position corresponding to the variable in the word string template matches the word class of the variable And means for extracting from the path.

さらに好ましくは、言語モデル学習装置はさらに、所定のコーパスに出現する単語列パターンの出現頻度を、当該単語列パターンごとに記憶するための頻度記憶手段と、変形手段から出力される変形後の単語列の各々に対し、当該単語列を構成する単語を生成した単語列パターンについて頻度記憶手段に記憶された出現頻度に基づいて、当該単語列の複写回数を決定して複写することにより、変形手段から出力される単語列中の単語の出現頻度を調整するための頻度調整手段とを含む。 More preferably, the language model learning device further includes a frequency storage means for storing the appearance frequency of the word string pattern appearing in a predetermined corpus for each word string pattern, and the modified word output from the deformation means For each of the columns, the transformation means by determining and copying the number of times of copying the word string based on the appearance frequency stored in the frequency storage means for the word string pattern that generated the word constituting the word string Frequency adjusting means for adjusting the frequency of appearance of words in the word string output from.

言語モデル学習装置は、さらに、所定のコーパスに出現する単語の出現頻度を記憶するための頻度記憶手段と、変形手段から出力される変形後の単語列の各々に対し、当該単語列を構成する単語について頻度記憶手段に記憶された各単語の出現頻度に基づいて、当該単語列の複写回数を決定して複写することにより、変形手段から出力される単語列中の単語の出現頻度を調整するための頻度調整手段とを含む。 The language model learning apparatus further configures the word string for each of the frequency storage means for storing the appearance frequency of words appearing in a predetermined corpus and the transformed word string output from the deformation means. Based on the appearance frequency of each word stored in the frequency storage means for the word, the appearance frequency of the word in the word string output from the deformation means is adjusted by determining and copying the number of times the word string is copied. Frequency adjusting means.

テンプレート記憶手段に記憶されたシードテンプレートの各々には予め重みが割当てられていてもよい。拡張テンプレート記憶手段に記憶された拡張テンプレートの各々には、当該拡張テンプレートのもとになったシードテンプレートの重みよりも小さな重みが割当てられている。言語モデル学習装置はさらに、変形手段から出力される変形後の単語列の各々に対し、単語列抽出手段において使用された単語列テンプレートに割当てられた重みにしたがって、当該単語列を複写することにより、変形手段から出力される変形後の単語列に含まれる単語の出現頻度を調整するための頻度調整手段を含む。 A weight may be assigned in advance to each of the seed templates stored in the template storage unit. Each of the extension templates stored in the extension template storage means is assigned a weight smaller than the weight of the seed template that is the basis of the extension template. The language model learning device further copies the word string according to the weight assigned to the word string template used in the word string extraction unit for each of the transformed word strings output from the deformation unit. The frequency adjustment means for adjusting the appearance frequency of the word included in the word string after the deformation outputted from the deformation means.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータを、上記したいずれかの言語モデル学習装置の各手段として機能させる。 The computer program according to the second aspect of the present invention causes a computer to function as each unit of any of the language model learning devices described above.

本発明の第３の局面に係る音声認識装置は、上記した言語モデル学習装置のいずれかと、言語モデル学習装置により学習された言語モデルを記憶するための言語モデル記憶手段と、言語モデル記憶手段に記憶された言語モデルを用いることにより、入力される音声の音声認識を行なうための音声認識手段とを含む。 A speech recognition device according to a third aspect of the present invention includes any of the language model learning devices described above, a language model storage unit for storing a language model learned by the language model learning device, and a language model storage unit. Speech recognition means for performing speech recognition of input speech by using the stored language model.

本発明の１実施の形態に係る言語モデル学習装置のブロック図である。1 is a block diagram of a language model learning device according to an embodiment of the present invention. シードテンプレート集合の例を示す図である。It is a figure which shows the example of a seed template set. テンプレート拡張規則の例を示す図である。It is a figure which shows the example of a template expansion rule. 単語列変形規則の例を示す図である。It is a figure which shows the example of a word string deformation | transformation rule. シードテンプレートを拡張するためのプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program for extending a seed template. Ｗｅｂコーパスからテンプレートに合致する単語列パターンを抽出するためのプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program for extracting the word sequence pattern which matches a template from a web corpus. 抽出された単語列パターンに変換規則を適用して所定の形の自然言語の単語列を生成して出力するためのプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program for producing | generating and outputting the word string of a natural language of a predetermined form by applying a conversion rule to the extracted word string pattern. 本発明の１実施の形態に係る言語モデル学習装置を実現するコンピュータシステムの外観図である。1 is an external view of a computer system that implements a language model learning apparatus according to an embodiment of the present invention. 図８に示すコンピュータのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the computer shown in FIG.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following description and drawings, the same parts are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated.

［構成］
図１を参照して、本発明の１実施の形態に係る言語モデル学習装置３０は、Ｗｅｂから収集した文からなるＷｅｂコーパス３２に含まれる自然言語文から、特定の分野に関する、特定の形の自然言語の単語列からなる学習コーパス３４を生成し、学習コーパス３４を学習データとして、言語モデル学習モジュール３６により言語モデルの学習を行なうためのものである。この言語モデル学習装置３０により、特定の分野に関する、特定の形の文についての音声認識に適した言語モデル３８を構築できる。本実施の形態では、後述するように、病気に対する質問文である入力音声４２の音声認識を行なって
音声認識結果４４を出力する音声認識装置４０のための言語モデルを構築する。なお、Ｗｅｂコーパス３２、学習コーパス３４、及び言語モデル３８は、本実施の形態ではいずれもハードディスク等の不揮発性記憶媒体に記憶される。 [Constitution]
Referring to FIG. 1, a language model learning device 30 according to an embodiment of the present invention has a specific form related to a specific field from a natural language sentence included in a Web corpus 32 including sentences collected from the Web. A learning corpus 34 composed of natural language word strings is generated, and a language model learning module 36 learns a language model using the learning corpus 34 as learning data. With this language model learning device 30, it is possible to construct a language model 38 suitable for speech recognition of a specific form of sentence relating to a specific field. In the present embodiment, as will be described later, a language model is constructed for the speech recognition apparatus 40 that performs speech recognition of the input speech 42 that is a question sentence for a disease and outputs a speech recognition result 44. Note that the web corpus 32, the learning corpus 34, and the language model 38 are all stored in a nonvolatile storage medium such as a hard disk in the present embodiment.

言語モデル学習装置３０は、Ｗｅｂコーパス３２から抽出すべき単語列が満たすべき単語列パターンを記述したシードテンプレートからなるシードテンプレート集合を記憶するシードテンプレート集合記憶部５０と、シードテンプレートから、シードテンプレートと異なる形の拡張テンプレートを生成するために参照されるテンプレート拡張規則を記憶するテンプレート拡張規則記憶部５４と、シードテンプレート集合記憶部５０に記憶された各シードテンプレートに対し、テンプレート拡張規則記憶部５４に記憶されたテンプレート拡張規則のうち適用可能なものを適用し、拡張テンプレート集合を出力するためのテンプレート拡張処理部５２と、テンプレート拡張処理部５２により出力される拡張テンプレート集合を記憶するための拡張テンプレート集合記憶部５６とを含む。 The language model learning device 30 includes a seed template set storage unit 50 that stores a seed template set that includes a seed template that describes a word string pattern to be filled by a word string to be extracted from the Web corpus 32, a seed template, A template expansion rule storage unit 54 that stores template expansion rules that are referred to in order to generate different types of expansion templates, and a template expansion rule storage unit 54 for each seed template stored in the seed template set storage unit 50 A template extension processing unit 52 for applying an applicable one of the stored template extension rules and outputting an extension template set, and an extension for storing the extension template set output by the template extension processing unit 52 And a PLATES set storage unit 56.

なお、本明細書では、単語列パターンとは、自然言語文または自然言語の単語列（以下「自然言語の単語列等」と呼ぶ。）を構成する単語の間の文法的関係を記述した構文情報のことをいう。たとえば構文解析木のようなものである。構文解析木の各リーフには、自然言語の単語列などに含まれる単語列が対応付けられる。テンプレートは、単語列パターンと同様の構造を構造を持ち、単語列パターンと比較されるもののことをいう。 In this specification, the word string pattern is a syntax describing a grammatical relationship between words constituting a natural language sentence or a natural language word string (hereinafter referred to as a “natural language word string or the like”). Information. For example, a parse tree. Each leaf of the parse tree is associated with a word string included in a natural language word string or the like. The template has a structure similar to that of the word string pattern and is compared with the word string pattern.

図２を参照して、シードテンプレート集合記憶部５０に記憶されたシードテンプレートは、Ｗｅｂコーパス３２から抽出される単語列パターンが充足すべき基本的な構文構造を記述したものである。たとえば、シードテンプレートは、利用者が予め指定した構文解析木と、その解析木の各ノードに対応する単語列とからなる。シードテンプレートは、自然言語文から自動的に生成することも可能である。本実施の形態では、シードテンプレートは人間が手作業で準備するものとする。シードテンプレートは、本実施の形態ではいわゆる正規表現を用いて記述するものとする。正規表現としては種々のものが知られているが、ここではそれらのいずれも用いるようにしてもよい。 With reference to FIG. 2, the seed template stored in the seed template set storage unit 50 describes a basic syntax structure to be satisfied by the word string pattern extracted from the Web corpus 32. For example, the seed template includes a parse tree specified in advance by the user and a word string corresponding to each node of the parse tree. The seed template can also be automatically generated from a natural language sentence. In the present embodiment, it is assumed that the seed template is manually prepared by a human. In this embodiment, the seed template is described using a so-called regular expression. Various regular expressions are known, but any of them may be used here.

本実施の形態では、シードテンプレートを含む単語列テンプレートの構文解析木のリーフに相当する位置には単語が配置される。これら単語は、所定の制約条件を満たすべき単語を表す変数と、変数以外の単語列を表すテキストデータとを含む。テンプレートに変数が１つも含まれないものでもよい。テンプレートとして、ある単語が文頭または文末に来ていることを示す記号も記述可能とする。 In the present embodiment, words are arranged at positions corresponding to the leaves of the parse tree of the word string template including the seed template. These words include a variable representing a word that should satisfy a predetermined constraint condition and text data representing a word string other than the variable. The template may not contain any variables. As a template, a symbol indicating that a certain word comes to the beginning or end of a sentence can be described.

図２に挙げたテンプレート（構文情報は除く。）はいずれも、変数（Ａ，Ｂ）を含んでいる。これら変数も上記正規表現で定義されるものである。これら変数には、その変数に相当する位置の単語の単語クラス等の属性、またはこれらの組合せが指定される。単語クラスとは、例えば病名、薬品名、症状名、物質名、地名、人名、品詞、動詞の活用形、その他、単語をその属性によって分類したときに単語が属する集合を指定するものである。１つの変数に複数の属性が指定されていてもよい。その場合には、それら複数の属性がＡＮＤ関係にあるのか、ＯＲ関係にあるのかを指定する情報も変数に付される。 All of the templates (excluding syntax information) shown in FIG. 2 include variables (A, B). These variables are also defined by the above regular expression. For these variables, an attribute such as a word class of a word at a position corresponding to the variable, or a combination thereof is designated. The word class specifies, for example, a disease name, a drug name, a symptom name, a substance name, a place name, a person name, a part-of-speech, a verb utilization form, and a set to which a word belongs when the words are classified by their attributes. A plurality of attributes may be specified for one variable. In that case, information specifying whether the plurality of attributes are in an AND relationship or an OR relationship is also attached to the variable.

図２の例では、「Ａの原因はＢ」という単語列パターンがあればその単語列パターン（またはこの単語列パターンを含む文。以下単に「単語列パターン」と呼ぶ。）が抽出される。変数Ａに「病名」という単語クラスが指定されていれば、「Ａの原因はＢ」という形の文であって、かつ「Ａ」が病名であるような単語列パターンが抽出される。「Ｂ」についても同様である。このように、特定の規則とマッチすることにより抽出された単語列パターンをここでは「インスタンス」と呼ぶ。 In the example of FIG. 2, if there is a word string pattern “A is caused by B”, the word string pattern (or a sentence including this word string pattern; hereinafter simply referred to as “word string pattern”) is extracted. If the word class “disease name” is designated as the variable A, a word string pattern in which “A is the cause of B” and “A” is the disease name is extracted. The same applies to “B”. Thus, the word string pattern extracted by matching with a specific rule is referred to as an “instance” here.

なお、図２に示す各シードテンプレートの右にある数値は、各シードテンプレートに割当てられた重みである。重みを用いない実施の形態もあり得るが、本実施の形態ではこの重みを用いて、最終的に得られた単語列の出現頻度（コーパスに出力する回数）を調整する。本実施の形態ではこの重みの範囲は０より大きく１以下である。 The numerical value on the right of each seed template shown in FIG. 2 is the weight assigned to each seed template. Although there may be an embodiment in which no weight is used, in the present embodiment, the appearance frequency (number of times of output to the corpus) of the finally obtained word string is adjusted using this weight. In the present embodiment, this weight range is greater than 0 and less than or equal to 1.

図２に示す例では、シードテンプレートは全て変数を２つ含んでいるが、シードテンプレートがこのようなものに限定されるわけではない。例えば変数を何も含まない表現、変数を１個だけ含む表現、文頭または文末を指定する表現等を用いることもできる。文頭または文末を表す場合、テンプレートにそれらを表す文字列（タグ）を付しておく。 In the example shown in FIG. 2, all seed templates include two variables, but the seed templates are not limited to this. For example, an expression that does not include any variable, an expression that includes only one variable, an expression that specifies the beginning or end of a sentence, and the like can be used. When representing the beginning or end of a sentence, a character string (tag) representing them is attached to the template.

シードテンプレートとして特定の単語のみを指定することもできるし、品詞列を指定することもできる。 Only a specific word can be designated as a seed template, or a part of speech string can be designated.

図３を参照して、テンプレート拡張規則記憶部５４に記憶されたテンプレート拡張規則は、シードテンプレート集合記憶部５０に記憶されたシードテンプレートを拡張し、拡張テンプレートを生成するための規則である。テンプレート拡張規則も正規表現を用いて記述することができる。 Referring to FIG. 3, the template extension rule stored in template extension rule storage unit 54 is a rule for extending the seed template stored in seed template set storage unit 50 to generate an extension template. Template extension rules can also be described using regular expressions.

例えばシードテンプレートとして図３の（１）に示す「Ａ＜病名＞の理由はＢ」を考える。「＜病名＞」は変数Ａに割当てられた単語タグであり、この位置の単語に「病名」というタグが割当てられていることを示す。 For example, consider “A <disease name reason is B” shown in FIG. “<Disease Name>” is a word tag assigned to the variable A, and indicates that a tag “Disease Name” is assigned to the word at this position.

図３に示すテンプレート拡張規則によれば、「Ａ＜病名＞の理由はＢ」というテンプレートから、「Ａの理由はＢ」、「ＡはＢにより引き起こされる」、「ＢによりＡが発生」、「ＡはＢのせい」等というテンプレートが生成できる。なお、ここでは変数に割当てられる単語クラスの記載は繰返していない。 According to the template expansion rule shown in FIG. 3, from the template “A is the reason for <name of disease> is B”, “the reason for A is B”, “A is caused by B”, “A is generated by B”, A template such as “A is because of B” can be generated. Here, the description of the word class assigned to the variable is not repeated.

このように、予めテンプレート拡張規則を多数準備しておき、シードテンプレートにこれらテンプレート拡張規則を適用することにより新たなテンプレート（これらを「拡張テンプレート」と呼ぶ。）を生成できる。テンプレート拡張規則を多数準備しておけば、１つのシードテンプレートから多数の拡張テンプレートを生成することができ、Ｗｅｂコーパス３２から抽出されるインスタンスの数を増加させることができる。 As described above, a large number of template expansion rules are prepared in advance, and new templates (these are referred to as “expansion templates”) can be generated by applying these template expansion rules to the seed template. If a large number of template expansion rules are prepared, a large number of expansion templates can be generated from one seed template, and the number of instances extracted from the Web corpus 32 can be increased.

図３には示していないが、単語を意味的な構造にしたがって配列したシソーラスを用いると、テンプレート拡張規則により生成できるテンプレート数をより多くすることができる。例えば単語クラス「薬品名」について、このクラスのより上位の単語クラスが「物質名」であれば、単語クラスが「薬品名」の変数がテンプレート中にあるときに、この変数の単語クラスを上位の「物質名」に置換するようなテンプレートを生成することもできる。このような置換を可能とするか否かは、システムの設計に依存する事項でもあるし、システムの動作時の設定に依存する事項でもある。 Although not shown in FIG. 3, when a thesaurus in which words are arranged according to a semantic structure is used, the number of templates that can be generated by the template expansion rule can be increased. For example, for the word class “drug name”, if the higher word class of this class is “substance name”, if the variable whose word class is “drug name” is in the template, It is also possible to generate a template that replaces the “substance name”. Whether or not such replacement is possible is a matter that depends on the design of the system and also a matter that depends on the setting during operation of the system.

図３に示す例では、拡張規則の各々の右側に、各拡張規則の重みが付されている。本実施の形態では、各テンプレート拡張規則に割当てられた重みと、基となるシードテンプレートの重みとの積にしたがって、最終的に抽出または生成された単語列パターンの出現頻度を調整する。拡張規則により拡張されたテンプレートは、利用者がシードテンプレートとしては特に指定しなかったものである。したがって、最終的に得られる言語モデルでは、拡張テンプレートに基づいて抽出されたインスタンスについては、その出現頻度をシードテンプレートに基づいて抽出されたものよりも低めに設定した方が目的に沿っていると考えられる。したがって、本実施の形態では、各テンプレート拡張規則に割当てられた重みは０より大きく１より小さい値となっている。 In the example shown in FIG. 3, the weight of each extended rule is attached to the right side of each extended rule. In the present embodiment, the appearance frequency of the word string pattern finally extracted or generated is adjusted according to the product of the weight assigned to each template expansion rule and the weight of the seed template that is the basis. The template extended by the extension rule is not designated by the user as a seed template. Therefore, in the language model finally obtained, it is better to set the appearance frequency of the instance extracted based on the extended template to be lower than that extracted based on the seed template. Conceivable. Therefore, in the present embodiment, the weight assigned to each template expansion rule is a value greater than 0 and less than 1.

再び図１を参照して、言語モデル学習装置３０はさらに、拡張テンプレート集合記憶部５６に記憶された拡張テンプレートを用い、Ｗｅｂコーパス３２に含まれる文から、拡張テンプレートのいずれかに合致するインスタンスを抽出するフィルタ６０と、フィルタ６０がＷｅｂコーパス３２内の各文の構文解析を行なう際に参照する構文解析用辞書５８と、フィルタ６０によりＷｅｂコーパス３２から抽出されたインスタンスの文からなる抽出文コーパスを記憶する抽出文コーパス記憶装置６２と、抽出文コーパス記憶装置６２に記憶された抽出文を、最終的に得られる言語モデルの対象分野及びアプリケーションに応じた文型に変換するための単語列変形規則を記憶した単語列変形規則記憶部６４と、抽出文コーパス記憶装置６２に記憶された文の各々に、単語列変形規則記憶部６４に記憶された単語列変形規則のうち適用可能なものを適用し、変形後の文を出力するための変形モジュール６６と、変形モジュール６６から出力される変形後の文からなる変形文集合を記憶するための変形単語列集合記憶部６８とを含む。単語列変形規則記憶部６４に記憶された単語列変形規則も、本実施の形態では正規表現で記述されている。 Referring again to FIG. 1, the language model learning device 30 further uses the extension template stored in the extension template set storage unit 56 to select an instance that matches one of the extension templates from the sentence included in the Web corpus 32. A filter 60 to be extracted, a syntax analysis dictionary 58 to which the filter 60 refers when performing syntax analysis of each sentence in the Web corpus 32, and an extracted sentence corpus comprising instance sentences extracted from the Web corpus 32 by the filter 60 An extracted sentence corpus storage device 62 for storing the extracted sentence, and a word string transformation rule for converting the extracted sentence stored in the extracted sentence corpus storage device 62 into a sentence type corresponding to the target field and application of the finally obtained language model Stored in the word string transformation rule storage unit 64 and the extracted sentence corpus storage device 62. To each sentence, an applicable one of the word string modification rules stored in the word string modification rule storage unit 64 is applied, and a modification module 66 for outputting the transformed sentence and an output from the modification module 66 And a modified word string set storage unit 68 for storing a modified sentence set composed of the modified sentences. The word string deformation rules stored in the word string deformation rule storage unit 64 are also described by regular expressions in this embodiment.

図４を参照して、単語列変形規則の簡単な例を示す。なお、ここでは、言語モデル学習装置３０は、前述したとおり、病気に関する質問の音声入力を音声認識するための言語モデルの作成に用いられるものとする。図４の（１）に示す単語列変形規則は、「Ａを引き起こすＢ」という形の単語列パターンから、「Ａを引き起こすものについて教えてください。」という質問文を生成するための規則である。ここでも「Ａ」と「Ｂ」とは変数である。変数には、テンプレートと同様、単語クラス等の属性の指定が付されていても良い。変数に属性が付されている場合には、その属性まで含めて単語列変形規則の左辺と一致した単語列パターンを、単語列変形規則の右辺に示された単語列に変形する。 With reference to FIG. 4, a simple example of a word string transformation rule is shown. Here, as described above, the language model learning device 30 is used to create a language model for recognizing a voice input of a question about a disease. The word string transformation rule shown in (1) of FIG. 4 is a rule for generating a question sentence “Tell me about what causes A” from a word string pattern of the form “B causing A”. . Again, “A” and “B” are variables. As with the template, attributes such as word classes may be specified for the variables. If the variable has an attribute, the word string pattern that includes the attribute and matches the left side of the word string modification rule is transformed into the word string shown on the right side of the word string modification rule.

図４に示す例では、単語列変形規則の左辺と右辺とが１つずつのものしか示されていない。しかし本発明はそのような実施の形態に限定されない。左辺が同一で右辺が異なるような複数の単語列変形規則を、１つの変形規則にまとめるような実装を行なってもよい。 In the example shown in FIG. 4, only one each of the left side and the right side of the word string transformation rule is shown. However, the present invention is not limited to such an embodiment. An implementation may be performed in which a plurality of word string modification rules having the same left side and different right sides are combined into one modification rule.

再び図１を参照して、言語モデル学習装置３０はさらに、Ｗｅｂコーパス３２に出現する、構造を含めた各単語列の出現頻度を算出するための頻度算出モジュール７０と、頻度算出モジュール７０により各単語列について算出された出現頻度からなる頻度データを記憶する頻度データ記憶部７２と、変形単語列集合記憶部６８に記憶されている各変形文について、当該変形文に付されている重みと、頻度データ記憶部７２に記憶された単語列の頻度データとに基づいて、変形文の出力回数を定め、その回数だけ繰返して当該変形文を出力することにより、最終的に得られるコーパス中の単語列の出現頻度を調整するための頻度調整モジュール７４とを含む。頻度調整モジュール７４の出力する変形文の集合が学習コーパス３４を形成する。 Referring again to FIG. 1, the language model learning device 30 further includes a frequency calculation module 70 for calculating the appearance frequency of each word string including the structure appearing in the Web corpus 32, and the frequency calculation module 70. A frequency data storage unit 72 that stores frequency data composed of appearance frequencies calculated for a word string, and for each modified sentence stored in the modified word string set storage unit 68, a weight attached to the modified sentence; Based on the frequency data of the word string stored in the frequency data storage unit 72, the number of times of output of the modified sentence is determined, and the word in the corpus finally obtained is output by repeatedly outputting the modified sentence. A frequency adjustment module 74 for adjusting the appearance frequency of the columns. A set of modified sentences output by the frequency adjustment module 74 forms the learning corpus 34.

本実施の形態では、頻度データ記憶部７２に記憶される頻度データは、構造別の単語列ごとの出現確率である。 In the present embodiment, the frequency data stored in the frequency data storage unit 72 is an appearance probability for each word string by structure.

図５を参照して、図１のテンプレート拡張処理部５２をコンピュータにより実現するためのプログラムは、以下のような制御構造を有する。このプログラムは、テンプレート拡張規則記憶部５４に記憶されたテンプレート拡張規則を全てコンピュータの主記憶装置に読込むステップ１００と、ステップ１００で読込まれた各規則に対し、以下に説明するステップ１０４を実行するステップ１０２と、ステップ１０２の処理が完了した後、ステップ１０２の処理で得られた、シードテンプレートと拡張テンプレートとをマージしたものを拡張テンプレート集合として出力して処理を終了するステップ１０６とを含む。 Referring to FIG. 5, a program for realizing template extension processing unit 52 in FIG. 1 by a computer has the following control structure. This program executes step 100 for reading all the template expansion rules stored in the template expansion rule storage unit 54 into the main storage device of the computer, and step 104 described below for each rule read in step 100. And a step 106 of outputting the merged seed template and extension template obtained in the process of step 102 as an extension template set and completing the process after the process of step 102 is completed. .

ステップ１０４は、シードテンプレート集合記憶部５０に記憶された全てのシードテンプレートに対し、現在処理対象となっている拡張規則が適用可能なら適用し、新たなテンプレート（拡張テンプレート）を作成するステップ１１０と、新たに作成されたテンプレートに、基となったシードテンプレートの重みと、拡張規則の重みとの積を計算し、重みとして付すステップ１１２と、ステップ１１２で作成された拡張テンプレートを、ステップ１１２で計算された重みとともに、シードテンプレートにマージするステップ１１４とを含む。なお、ステップ１１０で作成された新テンプレートが既にシードテンプレートにマージされている場合、そのテンプレートはシードテンプレートに追加されない。 Step 104 is applied to all seed templates stored in the seed template set storage unit 50, if the extended rule currently being processed is applicable, and creates a new template (extended template) 110. In step 112, the product of the weight of the seed template that is the base and the weight of the extension rule is calculated and added as a weight to the newly created template, and the extension template created in step 112 is Merging with the calculated weights into the seed template 114. If the new template created in step 110 has already been merged with the seed template, the template is not added to the seed template.

図６を参照して、図１に示すフィルタ６０を実現するためのプログラムは、Ｗｅｂコーパス３２に記憶されている各文に対して以下のステップ１３２を実行するステップ１３０を含む。 Referring to FIG. 6, the program for realizing filter 60 shown in FIG. 1 includes a step 130 for executing the following step 132 for each sentence stored in Web corpus 32.

ステップ１３０は、処理対象となっている文に対し、構文解析用辞書５８を参照してそれぞれ形態素解析及び構文解析を行なうステップ１４０及び１４１と、形態素解析及び構文解析処理により単語クラス、活用形等を示すタグが付された単語列（形態素列）を持つ構文情報からなる単語列パターンを受け、拡張テンプレート集合記憶部５６に記憶された各テンプレートについて、以下のステップ１４４を実行するステップ１４２とを含む。なお、ここでは対象言語を日本語としているため、ステップ１４０では形態素解析を含む構文解析を行なっている。対象言語が英語のように単語を空白で区切る言語の場合、ここでは形態素解析は不要であり、品詞解析等の解析処理を含む構文解析処理を実行すればよい。形態素解析には、既存の形態素解析プログラムを使用すればよい。形態素解析プログラムとして、例えば、ＪＵＭＡＮ (URL＝http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html)、またはＣｈａＳｅｎ(URL=http://chasen-legacy.sourceforge.jp/)を用いることができる。構文解析処理には、係り受け解析と句構造解析との２通りの技術が存在する。いずれを用いてもよいが、本実施の形態では係り受け解析を用いるものとする。既存の日本語構文解析システムＫＮＰ（URL=http://nlp.kuee.kyoto-u.ac.jp/nl-resource/knp.html）を用いてもよい。 Steps 130 and 141 perform morphological analysis and syntax analysis, respectively, with reference to the syntax analysis dictionary 58 with respect to the sentence to be processed, and word classes, usage forms, etc. by morphological analysis and syntax analysis processing. Step 142 for receiving a word string pattern consisting of syntax information having a word string (morpheme string) with a tag indicating, and executing the following step 144 for each template stored in the extended template set storage unit 56. Including. Here, since the target language is Japanese, in step 140, syntax analysis including morphological analysis is performed. In the case where the target language is a language that separates words with spaces such as English, morphological analysis is not necessary here, and syntax analysis processing including analysis processing such as part-of-speech analysis may be executed. For the morphological analysis, an existing morphological analysis program may be used. As a morphological analysis program, for example, JUMAN (URL = http: //nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html) or ChaSen (URL = http: //chasen-legacy.sourceforge .jp /) can be used. There are two types of syntax analysis processing: dependency analysis and phrase structure analysis. Any of these may be used, but in the present embodiment, dependency analysis is used. An existing Japanese syntax analysis system KNP (URL = http: //nlp.kuee.kyoto-u.ac.jp/nl-resource/knp.html) may be used.

ステップ１４４は、処理対象となっている単語列パターン中で、処理対象となっているテンプレートにマッチする箇所があるか否かを判定するステップ１５０と、ステップ１５０の判定が肯定のときに、そのマッチする箇所に、処理対象となっているテンプレートの重みを付して出力し、処理を次のテンプレートに移動させるステップ１５２とを含む。ステップ１５０の判定が否定のときには何もされず、処理は次のテンプレートに移動する。なお、ここでのマッチの判定の基準は、単語列パターンを構成する構文情報の内に、変数を除いて単語列テンプレートの構文情報と一致する部分があり、かつ当該部分の内で単語列テンプレートの構文情報内の変数に対応する位置にある単語の単語クラスが、当該変数の単語クラスと一致しているか否かである。この判定の際、処理対象となっている単語列パターン中の一部でも単語列パターンと一致していればそを出力する。なお、構文情報を用いず、形態素列（単語列）のみを用いる場合には、単語列テンプレートと形態素列とが、単語列テンプレートに含まれる変数を除いて一致し、かつ形態素列中で単語列テンプレート内の変数に対応する位置にある形態素の単語クラスが、当該変数の単語クラスと一致しているか否かを基準とすればよい。 Step 144 is a step 150 for determining whether or not there is a portion that matches the template to be processed in the word string pattern to be processed, and when the determination in step 150 is affirmative, And a step 152 for outputting the template with the weight of the template to be processed added to the matching part, and moving the process to the next template. If the determination in step 150 is negative, nothing is done and the process moves to the next template. The criterion for determining the match here is that the syntax information constituting the word string pattern includes a part that matches the syntax information of the word string template except for variables, and the word string template is included in the part. Whether or not the word class of the word at the position corresponding to the variable in the syntax information matches the word class of the variable. In this determination, if even a part of the word string pattern to be processed matches the word string pattern, it is output. In addition, when not using syntax information and using only a morpheme string (word string), the word string template and the morpheme string match except for the variables included in the word string template, and the word string in the morpheme string It may be based on whether or not the word class of the morpheme at the position corresponding to the variable in the template matches the word class of the variable.

図７を参照して、図１の変形モジュール６６及び頻度調整モジュール７４を実現するためのプログラムは、抽出文コーパス記憶装置６２に記憶された文に含まれる各単語列に対し、以下のステップ１８２を実行するステップ１８０を含む。 Referring to FIG. 7, the program for realizing the transformation module 66 and the frequency adjustment module 74 of FIG. 1 performs the following step 182 for each word string included in the sentence stored in the extracted sentence corpus storage device 62. Step 180 is executed.

ステップ１８２は、単語列変形規則記憶部６４に記憶された各単語列変形規則について、以下のステップ２０２を実行するステップ２００を含む。 Step 182 includes a step 200 of executing the following step 202 for each word string modification rule stored in the word string modification rule storage unit 64.

ステップ２０２は、処理対象の単語列パターンが、処理対象の変形規則の左辺にマッチするか否かを判定し、マッチしなければ次の変形規則に処理を進めるステップ２１０と、ステップ２１０の判定が肯定のときに、この変形規則に従い、処理対象の単語列パターンを変形して新たな単語列を生成するステップ２１２と、ステップ２１２に続き、処理対象の単語列パターンに含まれる単語の、そのパターン（単語列の構造）における出現頻度（単語が複数あるときはそれらの積）と、処理対象の単語列パターンに割当てられている重みと、最終的に得られる学習コーパス３４に含まれる文の数を調整するために予め定められる定数との積を計算するステップ２１４と、ステップ２１４で計算された値の整数部分により定められる回数だけ、ステップ２１２で得られた単語列を繰返して出力するステップ２１６とを含む。ステップ２１４で算出された値が１に満たない場合、本実施の形態ではステップ２１６において１回だけ変形後の単語列を出力する。 Step 202 determines whether or not the word string pattern to be processed matches the left side of the modification rule to be processed. If there is no match, step 210 proceeds to the next modification rule, and the determination in step 210 is performed. If the determination is affirmative, step 212 for generating a new word string by modifying the word string pattern to be processed according to this deformation rule, and the pattern of the word included in the word string pattern to be processed following step 212 Appearance frequency (the product of words when there are multiple words) in (word string structure), the weight assigned to the word string pattern to be processed, and the number of sentences included in the learning corpus 34 finally obtained Step 214 for calculating the product of a predetermined constant to adjust the value, and the number of times determined by the integer part of the value calculated in step 214. Tsu repeated a word string obtained by the flop 212 and a step 216 to be output. When the value calculated in step 214 is less than 1, in the present embodiment, in step 216, the transformed word string is output only once.

［動作］
図１〜図７に示した言語モデル学習装置３０は以下のように動作する。予め、Ｗｅｂから多数の文を収集し、Ｗｅｂコーパス３２に記憶させておく。頻度算出モジュール７０は、予めＷｅｂコーパス３２に含まれる各文について形態素解析及び構文解析し、各単語について、その出現する構造ごとにその出現頻度を算出し、頻度データ記憶部７２に頻度データとして記憶させる。この処理は通常の言語モデルの学習とほとんど同じである。 [Operation]
The language model learning device 30 shown in FIGS. 1 to 7 operates as follows. A large number of sentences are collected in advance from the Web and stored in the Web corpus 32 in advance. The frequency calculation module 70 performs morphological analysis and syntax analysis for each sentence included in the Web corpus 32 in advance, calculates the appearance frequency for each word structure for each word, and stores it as frequency data in the frequency data storage unit 72. Let This process is almost the same as learning a normal language model.

本実施の形態では、利用者が予めシードテンプレート、テンプレート拡張規則、及び単語列変形規則を作成し、シードテンプレート集合記憶部５０、テンプレート拡張規則記憶部５４、及び単語列変形規則記憶部６４にそれぞれ記憶させておく。これらはいずれも正規表現を使用する。これらはまた、最終的に得られる言語モデル３８がどのような分野に適用されるのであり、どのようなアプリケーションにより使用されるのかにより、その作成方針が定められる。ただし、最終的にどのようなテンプレート及び規則を作成するかは利用者の選択により決まる。 In the present embodiment, a user creates a seed template, a template expansion rule, and a word string modification rule in advance, and stores them in the seed template set storage unit 50, the template expansion rule storage unit 54, and the word string modification rule storage unit 64, respectively. Remember. All of these use regular expressions. In these fields, the language model 38 that is finally obtained is applied to which field, and the creation policy is determined depending on what application is used. However, what template and rule are finally created is determined by the user's selection.

シードテンプレート及びテンプレート拡張規則が準備できると、テンプレート拡張処理部５２が動作し、シードテンプレート集合記憶部５０に記憶されたシードテンプレートの各々に、テンプレート拡張規則記憶部５４に記憶されたテンプレート拡張規則を適用してテンプレートを拡張する。この拡張により多数のテンプレートが生成され拡張テンプレート集合記憶部５６に記憶される。 When the seed template and the template expansion rule are prepared, the template expansion processing unit 52 operates, and the template expansion rule stored in the template expansion rule storage unit 54 is added to each of the seed templates stored in the seed template set storage unit 50. Apply to extend the template. A number of templates are generated by this extension and stored in the extension template set storage unit 56.

さらに、単語列変形規則を生成し、予め単語列変形規則記憶部６４に格納しておく。構文解析用辞書５８としては、フィルタ６０で使用する形態素解析及び構文解析プログラムに適合したフォーマットのものを用意しておく。 Further, a word string deformation rule is generated and stored in the word string deformation rule storage unit 64 in advance. The syntax analysis dictionary 58 is prepared in a format suitable for the morphological analysis and syntax analysis program used in the filter 60.

拡張テンプレート集合記憶部５６に格納された拡張テンプレートは予め全て読み出され、図示しない主記憶部に記憶される。フィルタ６０は、Ｗｅｂコーパス３２から文を読出し、それぞれについて形態素解析及び構文解析を行なう（図６のステップ１４０）。さらにフィルタ６０は、構文解析により得られた単語列パターン（単語クラス、意味クラス等のタグが付された形態素列と、それらをリーフとして持つ構文解析木からなる構文情報）について、主記憶部に記憶された拡張テンプレートにマッチする部分を持つか否かを判定する（ステップ１５０）。拡張テンプレートのいずれかとマッチする部分がある場合（ステップ１５０の判定が肯定）、フィルタ６０はその単語列パターンを、マッチしたテンプレートに付された重みとともに抽出文コーパス記憶装置６２に出力する（ステップ１５２）。抽出文コーパス記憶装置６２はこれらの単語列パターンを単語に付されたタグ及び重みとともに記憶する。フィルタ６０は、Ｗｅｂコーパス３２に記憶された全ての文についてこれを繰返す。 All the extension templates stored in the extension template set storage unit 56 are read in advance and stored in a main storage unit (not shown). The filter 60 reads a sentence from the Web corpus 32 and performs morphological analysis and syntax analysis on each of them (step 140 in FIG. 6). Further, the filter 60 stores the word string pattern (syntax information consisting of morpheme strings with tags such as word classes and semantic classes and the parse tree having them as leaves) obtained by the syntax analysis in the main storage unit. It is determined whether or not there is a portion that matches the stored extension template (step 150). If there is a portion that matches any of the extended templates (Yes at step 150), the filter 60 outputs the word string pattern to the extracted sentence corpus storage device 62 together with the weight assigned to the matched template (step 152). ). The extracted sentence corpus storage device 62 stores these word string patterns together with tags and weights attached to the words. The filter 60 repeats this for all sentences stored in the web corpus 32.

変形モジュール６６は、抽出文コーパス記憶装置６２に記憶された各単語列パターンに対し、単語列変形規則記憶部６４を適用する。すなわち、変形モジュール６６は処理対象の単語列パターンごとに、変形規則を呼出し、単語列パターンが変形規則の左辺とマッチするか否かを判定する（ステップ２１０）。単語列パターンが変形規則の左辺とマッチする場合（ステップ２１０の判定が肯定）、変形モジュール６６は、変形規則の右辺にしたがって単語列パターンを変形し、その単語列パターンから単語列を生成する（ステップ２１２）。変形後の単語列は変形単語列集合記憶部６８に記憶される。頻度調整モジュール７４は、その変形後の単語列の重みを、単語列に出現する単語の出現頻度の積と、単語列に付されていた、フィルタ６０において適用されたテンプレートの重みと、所定の定数との積として算出する（ステップ２１４）。頻度調整モジュール７４は、こうして計算された重みの整数部分の回数だけ、変形後の単語列を繰返して出力する（ステップ２１６）。出力された変形後の文はいずれも学習コーパス３４に記憶される。ステップ２１６の処理が終了すると、変形モジュール６６は次の変形規則による処理を実行する。ステップ２１０の判定が否定なら、変形モジュール６６その変形規則については何もせず、次の変形規則による処理を実行する。 The deformation module 66 applies the word string deformation rule storage unit 64 to each word string pattern stored in the extracted sentence corpus storage device 62. That is, the deformation module 66 calls a deformation rule for each word string pattern to be processed, and determines whether or not the word string pattern matches the left side of the deformation rule (step 210). When the word string pattern matches the left side of the deformation rule (Yes in Step 210), the deformation module 66 deforms the word string pattern according to the right side of the deformation rule, and generates a word string from the word string pattern ( Step 212). The modified word string is stored in the modified word string set storage unit 68. The frequency adjustment module 74 uses the product of the appearance frequency of words appearing in the word string, the weight of the template applied in the filter 60 applied to the word string, and a predetermined weight as the weight of the modified word string. Calculated as a product with a constant (step 214). The frequency adjustment module 74 repeatedly outputs the modified word string by the number of times of the integer part of the weight thus calculated (step 216). All of the output modified sentences are stored in the learning corpus 34. When the process of step 216 is completed, the deformation module 66 executes a process according to the next deformation rule. If the determination in step 210 is negative, the modification module 66 does nothing with the modification rule, and executes the process according to the next modification rule.

このようにして、ある単語列パターンについて、変形モジュール６６及び変形単語列集合記憶部６８が全ての変形規則を適用する処理が完了すると、次の単語列パターンについて、同じ処理が実行される。 In this way, when the modification module 66 and the modified word string set storage unit 68 apply all the modification rules for a certain word string pattern, the same process is executed for the next word string pattern.

全ての単語列パターンについて、変形モジュール６６及び頻度調整モジュール７４が全ての変形規則を適用すると、処理を終了する。 When all the transformation rules are applied by the transformation module 66 and the frequency adjustment module 74 for all the word string patterns, the processing ends.

このようにして学習コーパス３４が作成される。学習コーパス３４は、最初に準備したシードテンプレートに適合した文と、シードテンプレートから拡張した、シードテンプレートと関連した拡張テンプレートに適合した文とから、予め準備された単語列変形規則により変形された文からなる。拡張テンプレートは、シードテンプレートに含まれる単語の類義語、シードテンプレートの表現の言い換え等からなる。また、単語列変形規則は、最終的な目標となる言語モデルが使用されるアプリケーションでよく使用される文型を想定したものである。したがって、学習コーパス３４は、特定の分野に関する発話によく出現する単語またはその類義語、及び特定のアプリケーションでよく用いられる言い回しを多く含む。しかもシードテンプレートは、テンプレート拡張規則により拡張されるため、拡張テンプレート集合記憶部５６には非常に多数のテンプレートが含まれる。しかもこのテンプレートには正規表現が用いられるため、テンプレートとＷｅｂコーパス３２に含まれる文とのマッチングにより非常に多くの単語列パターン（形態素列）がＷｅｂコーパス３２から抽出される。ここでは「抽出」という語を用いているが、テンプレートとして言い換えも認めているため、Ｗｅｂコーパス３２には含まれない表現もフィルタ６０の処理により抽出されることになる。 In this way, the learning corpus 34 is created. The learning corpus 34 is a sentence that is deformed by a word string modification rule prepared in advance from a sentence that matches the initially prepared seed template and a sentence that is extended from the seed template and that matches the extended template related to the seed template. Consists of. The extension template includes synonyms of words included in the seed template, paraphrasing expression of the seed template, and the like. The word string transformation rule assumes a sentence pattern often used in an application in which a language model as a final target is used. Accordingly, the learning corpus 34 includes many words or their synonyms that often appear in utterances related to a specific field, and phrases often used in a specific application. Moreover, since the seed template is expanded by the template expansion rule, the extended template set storage unit 56 includes a very large number of templates. Moreover, since a regular expression is used for this template, a very large number of word string patterns (morpheme strings) are extracted from the web corpus 32 by matching the template with sentences included in the web corpus 32. Here, the word “extraction” is used, but paraphrasing is also accepted as a template, so that expressions not included in the Web corpus 32 are also extracted by the processing of the filter 60.

Ｗｅｂコーパス３２は、入手可能なコーパスとしては、最も多数の表現を含むと考えられる。しかし、Ｗｅｂコーパス３２に含まれる表現は、人間により作成されたものであり、そのためにその数にはどうしても限りがある。それに対し、本実施の形態のように、テンプレートを拡張して様々な拡張テンプレートでＷｅｂコーパス３２とマッチングを行ない、さらに拡張テンプレートにより表現を種々に変更することにより、変形モジュール６６には人手で作成されたものよりもはるかに幅広い表現が格納されることになる。したがって、それら表現を用いて生成された学習コーパス３４を学習データとして学習した言語モデル３８は、最初に意図された分野またはアプリケーションに適合したものとなり、しかもＷｅｂコーパス３２には含まれない表現を含む非常に幅広い表現に対しても出現確率を算出することが可能なものとなる。その結果、言語モデル３８を用いた音声認識は、シードテンプレート及び単語列変形規則を作成したときに意図された分野及びアプリケーションに対して高い精度の認識率を実現することができる。 The Web corpus 32 is considered to contain the largest number of expressions as an available corpus. However, the expressions included in the Web corpus 32 are created by humans, and therefore the number is inevitably limited. On the other hand, as in the present embodiment, the deformation module 66 is manually created by extending the template, matching with the Web corpus 32 using various extended templates, and changing the expression variously using the extended templates. A much wider representation will be stored than what has been done. Therefore, the language model 38 learned by using the learning corpus 34 generated using these expressions as learning data is adapted to the originally intended field or application, and includes expressions that are not included in the Web corpus 32. Appearance probabilities can be calculated for a very wide range of expressions. As a result, the speech recognition using the language model 38 can realize a highly accurate recognition rate for the field and application intended when the seed template and the word string transformation rule are created.

もっとも、本発明で使用するＷｅｂコーパス３２がＷｅｂから収集した文のみを含むものに限定されないことは当業者には明らかであろう。Ｗｅｂコーパス３２として、Ｗｅｂから収集したものに、別のソースから得た文を加えたものを用いてもよいし、Ｗｅｂから収集した文を含まないコーパスを用いることもできる。 However, it will be apparent to those skilled in the art that the Web corpus 32 used in the present invention is not limited to those containing only sentences collected from the Web. As the web corpus 32, a web corpus collected from the web and a sentence obtained from another source may be used, or a corpus that does not include a sentence collected from the web may be used.

なお、上記した実施の形態では、テンプレート拡張処理部５２によるテンプレートの拡張はシードテンプレートに対するもののみであった。しかし本発明はそのようなものには限定されない。シードテンプレートに対してテンプレート拡張規則を適用して得られた拡張テンプレートに、さらにテンプレート拡張規則を適用することでさらにテンプレート数を増加させるようにしてもよい。この場合、所定の繰返し回数だけテンプレート拡張の処理を行なっても良いし、新たな拡張テンプレートが出現しなくなるまで、テンプレート拡張の処理を繰返し実行するようにしてもよい。 In the above-described embodiment, the template expansion by the template expansion processing unit 52 is only for the seed template. However, the present invention is not limited to such. You may make it increase the number of templates further by applying a template expansion rule to the expansion template obtained by applying a template expansion rule with respect to a seed template. In this case, the template expansion process may be performed a predetermined number of times, or the template expansion process may be repeatedly performed until no new expansion template appears.

図４に示す例では、１つの単語列パターンを変形して１つの新たな単語列を生成する変形規則のみが示されている。しかし、本発明はそのような実施の形態に限定されるわけではない。例えば、規則中に、別の単語列パターンを参照する記述を含ませることにより、２つの単語列パターンから新たな単語列を作成するような規則を用いても良い。 In the example shown in FIG. 4, only a deformation rule for deforming one word string pattern to generate one new word string is shown. However, the present invention is not limited to such an embodiment. For example, a rule that creates a new word string from two word string patterns by including a description referring to another word string pattern in the rule may be used.

例えば、変形後の単語列パターンの集合の中に、ある単語で終わっている単語列パターンと、同じ単語で始まっている単語列パターンとが存在しているときに、それら２つの単語列パターンを、共通の単語を中心に互いに接続して新たな単語列を作成することができる。例えば、「ＡのＢ」というテンプレートのインスタンスとして「ボリビアの首都」という単語列パターンが抽出され、「Ｘはどこ」というテンプレートに対して「首都はどこ」というインスタンスが抽出されたときを考える。前者の最後の単語と、後者の先頭の単語とは、いずれも「首都」である。こうしたときには、両者を「首都」を中心に接続し、「ボリビアの首都はどこ」という新たな単語列を生成できる。 For example, when there are a word string pattern ending with a certain word and a word string pattern beginning with the same word in the transformed set of word string patterns, the two word string patterns are A new word string can be created by connecting each other around a common word. For example, consider a case where a word string pattern “capital of Bolivia” is extracted as an instance of a template “B of A” and an instance “where is the capital” is extracted for a template “where is X”. The last word of the former and the first word of the latter are both “capital”. In such a case, a new word string “where is the capital city of Bolivia” can be generated by connecting both of them with the capital city as the center.

他の例として、テンプレートの中に変数が存在しない場合には、テンプレート同士を単純に接続することも変形の一種として行なう。例えば、「ですね（文末）」のようなものがテンプレートに存在しており、変形後の単語列の中に「首都ですね」という表現がある場合を考える。この場合、「ボリビアの首都」というインスタンスと「ですね」という表現とを直接に接続して「ボリビアの首都ですね」という表現も変形後の単語列として生成する。 As another example, when a variable does not exist in the template, simply connecting the templates to each other is also performed as a kind of modification. For example, let us consider a case in which something like “It is (end of sentence)” exists in the template, and there is an expression “It is a capital” in the transformed word string. In this case, the instance “capital of Bolivia” is directly connected to the expression “I am”, and the expression “I am the capital of Bolivia” is also generated as a modified word string.

こうした処理のためには、そのための変形規則を単語列変形規則記憶部６４に記憶された変形規則とは別に準備しておく必要がある。図７に示す処理が完了した後に、これら規則に従って、変形後の単語列をさらに加工するようにすればよい。 For such processing, it is necessary to prepare a modification rule for that purpose separately from the modification rule stored in the word string modification rule storage unit 64. After the process shown in FIG. 7 is completed, the modified word string may be further processed according to these rules.

上記実施の形態では、頻度調整モジュール７４は変形文に割当てられた重みと、変形文に含まれる単語の出現確率の積との積により、その変形文の複写数を調整している。しかし本発明はそのような実施の形態には限定されない。例えば、変形文に割当てる重みは全て等しい値としてもよい。また、変形文に含まれる全ての単語の出現確率の積ではなく、例えば名詞だけの出現確率を用いるようにしても良い。 In the above embodiment, the frequency adjustment module 74 adjusts the number of copies of the modified sentence by the product of the weight assigned to the modified sentence and the product of the appearance probabilities of the words included in the modified sentence. However, the present invention is not limited to such an embodiment. For example, all the weights assigned to the deformed sentences may be equal values. Further, instead of the product of the appearance probabilities of all the words included in the modified sentence, for example, the appearance probability of only a noun may be used.

上記実施の形態では、シードテンプレートに予め種々の重みを付与している。しかし本発明はそのような実施の形態には限定されない。シードテンプレートに付与している重みを一定とし、どのテンプレート拡張規則が用いられたかのみにより、テンプレートの重みを決定するようにしてもよい。または、Ｗｅｂコーパス３２に含まれる単語列について、適用可能なテンプレートが複数個ある場合には、その個数に応じて大きくなる重みを与えるようにしてもよい。テンプレート拡張規則をシードテンプレートだけでなく拡張テンプレートにも適用してテンプレートを作成するようにした場合には、拡張テンプレートを適用するごとに、テンプレートの重みが軽くなるようにすることが望ましい。 In the above embodiment, various weights are given to the seed template in advance. However, the present invention is not limited to such an embodiment. The weight assigned to the seed template may be constant, and the template weight may be determined only by which template expansion rule is used. Alternatively, when there are a plurality of applicable templates for the word string included in the web corpus 32, a weight that increases according to the number of templates may be given. When a template is created by applying the template expansion rule not only to the seed template but also to the expansion template, it is desirable that the weight of the template is reduced each time the expansion template is applied.

さらに、フィルタ６０によるフィルタリングの際に、上記実施の形態では、抽出された単語列に対し、抽出の際に適用されたテンプレートの重みを付しているだけである。しかし本発明はそのような実施の形態には限定されない。例えば、処理対象の文のうち、どの程度の大きさの部分があるテンプレートに適合したかにより、重みを変化させるようにしても良い。この場合、文の全体が１つのテンプレートに適合した場合に重みは変化させず、マッチした部分の文全体に対する割合が小さくなるにしたがって、重みも小さくなるようにすることが望ましい。 Furthermore, when filtering by the filter 60, the template weight applied at the time of extraction is only attached to the extracted word string in the above embodiment. However, the present invention is not limited to such an embodiment. For example, the weight may be changed depending on how large a portion of the sentence to be processed matches a template. In this case, it is desirable that the weight is not changed when the whole sentence matches one template, and the weight becomes smaller as the ratio of the matched portion to the whole sentence becomes smaller.

上記実施の形態では、単語列変形規則記憶部６４に記憶される単語列変形規則については重みを付与していない。しかし本発明はそのような実施の形態には限定されない。例えば、予め単語列変形規則に対して０より大きく１以下の重みを付与しておき、マッチした文に付与されていた重みにこの重みを乗じて、変形後の文の重みとしてもよい。 In the above embodiment, no weight is given to the word string deformation rules stored in the word string deformation rule storage unit 64. However, the present invention is not limited to such an embodiment. For example, a weight greater than 0 and less than or equal to 1 may be assigned to the word string modification rule in advance, and the weight assigned to the matched sentence may be multiplied by this weight to obtain the weight of the sentence after modification.

上記実施の形態では、各規則はいずれも正規表現を用いて記述されている。しかし本発明はそのような実施の形態には限定されない。目的に応じて規則を的確に記述できるものであれば、どのような記述方式に従うものであってもよい。 In the above embodiment, each rule is described using regular expressions. However, the present invention is not limited to such an embodiment. As long as the rules can be accurately described according to the purpose, any description method may be used.

さらに、上記した実施の形態では、コーパスの各文に対して構文解析を行なっている。しかし本発明はそのような実施の形態には限定されず、形態素解析のみを行なうようにしてもよい。この場合得られるのは１次元的に配列された形態素列となるが、これも一種の構造とみなせば、以後の処理としては上記実施の形態の処理をそのまま適用することができる。 Further, in the above-described embodiment, syntax analysis is performed on each sentence of the corpus. However, the present invention is not limited to such an embodiment, and only morphological analysis may be performed. In this case, a morpheme string arranged one-dimensionally is obtained, but if this is also regarded as a kind of structure, the process of the above embodiment can be applied as it is as the subsequent process.

［コンピュータによる実現］
この実施の形態に係る言語モデル学習装置３０は、コンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現できる。 [Realization by computer]
The language model learning device 30 according to this embodiment can be realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware.

図８を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２とを含む。 Referring to FIG. 8, this computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. including.

図９を参照して、コンピュータ３４０は、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、および作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０とを含む。コンピュータシステム３３０はさらに、インターネットへの接続を提供するネットワークインターフェイス（Ｉ／Ｆ）３４４を含む。図示しないが、コンピュータ３４０はネットワークＩ／Ｆ３４４を介して携帯電話ネットワークと接続されており、携帯電話３００とデータ通信を行なうことができる。
Referring to FIG. 9, in addition to FD drive 352 and CD-ROM drive 350, computer 340 includes CPU (central processing unit) 356 and bus 366 connected to CPU 356, FD drive 352 and CD-ROM drive 350. And a read only memory (ROM) 358 for storing a boot-up program and the like, and a random access memory (RAM) 360 connected to the bus 366 for storing a program command, a system program, work data, and the like. The computer system 330 further includes a network interface (I / F) 344 that provides a connection to the Internet. Although not shown, the computer 340 is connected to the mobile phone network via the network I / F 344 and can perform data communication with the mobile phone 300.

コンピュータシステム３３０に言語モデル学習装置３０としての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０またはＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２またはＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the language model learning device 30 is stored in the CD-ROM 362 or FD 364 inserted in the CD-ROM drive 350 or FD drive 352 and further transferred to the hard disk 354. . Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態の言語モデル学習装置３０として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム（ＯＳ）もしくはサードパーティのプログラム、またはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態のシステムおよび方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出すことにより、上記した言語モデル学習装置３０としての動作を実行する命令のみを含んでいればよい。 This program includes a plurality of instructions for causing the computer 340 to operate as the language model learning device 30 of this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 340 or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program includes only instructions that execute the operation as the language model learning apparatus 30 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. Just go out.

なお、図１に示すＷｅｂコーパス３２、シードテンプレート集合記憶部５０、テンプレート拡張規則記憶部５４、拡張テンプレート集合記憶部５６、構文解析用辞書５８、抽出文コーパス記憶装置６２、単語列変形規則記憶部６４、変形単語列集合記憶部６８、頻度データ記憶部７２、学習コーパス３４及び言語モデル３８等は、いずれも図９に示すハードディスク３５４またはＲＡＭ３６０により実現される。特に、例えばＷｅｂコーパス３２、シードテンプレート集合記憶部５０、テンプレート拡張規則記憶部５４、抽出文コーパス記憶装置６２、単語列変形規則記憶部６４等の領域は通常はハードディスク３５４内に確保されており、プログラムの実行時、必要に応じて必要な情報がこれら領域から読出されてＲＡＭ３６０にロードされる。拡張テンプレート集合記憶部５６、抽出文コーパス記憶装置６２、変形単語列集合記憶部６８等に記憶されるデータはワークファイル的な性格を持つ。したがって、生成時にはＲＡＭ３６０に生成され、保存の必要があればハードディスク３５４に保存される。学習コーパス３４及び言語モデル３８も同様である。 The web corpus 32, the seed template set storage unit 50, the template extension rule storage unit 54, the extension template set storage unit 56, the syntax analysis dictionary 58, the extracted sentence corpus storage unit 62, and the word string modification rule storage unit shown in FIG. 64, the modified word string set storage unit 68, the frequency data storage unit 72, the learning corpus 34, the language model 38, and the like are all realized by the hard disk 354 or the RAM 360 shown in FIG. In particular, for example, areas such as the Web corpus 32, the seed template set storage unit 50, the template expansion rule storage unit 54, the extracted sentence corpus storage unit 62, and the word string modification rule storage unit 64 are usually reserved in the hard disk 354. When the program is executed, necessary information is read from these areas and loaded into the RAM 360 as necessary. Data stored in the extended template set storage unit 56, the extracted sentence corpus storage unit 62, the modified word string set storage unit 68, etc. has a work file-like character. Therefore, it is generated in the RAM 360 at the time of generation, and stored in the hard disk 354 if it is necessary to store it. The same applies to the learning corpus 34 and the language model 38.

コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 The operation of computer system 330 is well known and will not be repeated here.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

３０言語モデル学習装置
３２Ｗｅｂコーパス
３４学習コーパス
３６言語モデル学習モジュール
３８言語モデル
５０シードテンプレート集合記憶部
５２テンプレート拡張処理部
５４テンプレート拡張規則記憶部
５６拡張テンプレート集合記憶部
５８構文解析用辞書
６０フィルタ
６４単語列変形規則記憶部
６６変形モジュール
７４頻度調整モジュール 30 Language Model Learning Device 32 Web Corpus 34 Learning Corpus 36 Language Model Learning Module 38 Language Model 50 Seed Template Set Storage Unit 52 Template Extension Processing Unit 54 Template Extension Rule Storage Unit 56 Extended Template Set Storage Unit 58 Syntax Analysis Dictionary 60 Filter 64 Word string transformation rule storage unit 66 transformation module 74 frequency adjustment module

Claims

A language model learning device that is used together with a machine-readable corpus storage means for storing a corpus including a plurality of natural language sentences, and for learning a language model suitable for a specific application from the corpus,
Template storage means for storing a word string template prepared in advance for the specific application;
A word string extraction means for extracting from the corpus a word string pattern that matches a word string template stored in the template storage means;
Deformation means for deforming the word string pattern extracted by the word string extraction means based on a deformation rule prepared in advance so that a natural language word string in a format according to a preselected purpose is generated When,
Look including a learning means for performing learning of the language model word train output by said deformation means as learning data,
The template storage means includes
Seed template storage means for storing a seed template that is a basic word string template to be extracted from a machine-readable sentence;
Extension template generation means for generating an extension template by applying a template extension rule prepared in advance for each of the seed templates stored in the seed template storage means;
An extension template storage means for storing the extension template generated by the extension template generation means and the seed template stored in the seed template storage means and providing the word string extraction means as the word string template , language model learning device.

The language model learning device according to claim 1 ,
Each of the word string templates stored in the template storage means includes an array of one or a plurality of variables representing arbitrary words satisfying predetermined constraint conditions and text data representing other word string patterns. Model learning device.

The language model learning device according to claim 2 ,
The predetermined constraint condition is a word class to which a word represented by each variable belongs,
The word string extraction means includes
Morphological analysis of each of the plurality of natural language sentences stored in the corpus, morpheme analysis means for attaching a tag of a word class to which the morpheme belongs to each morpheme and outputting it as a morpheme string;
Each of the word string templates stored in the template storage unit is compared with the morpheme string output by the morpheme analyzing unit, and the word string template and the morpheme string are identical except for variables included in the word string template. And a means for extracting, from the corpus, a word class of a morpheme in a position corresponding to a variable in a word string template in the morpheme string, which matches the word class of the variable. Model learning device.

The language model learning device according to claim 1 ,
Each of the word string templates stored in the template storage means includes one or a plurality of variables representing arbitrary words satisfying predetermined constraint conditions, other word strings, and grammatical relationships between these variables and the word strings. A language model learning apparatus including a word string pattern including syntax information indicating a relationship.

The language model learning device according to claim 4 ,
The predetermined constraint condition is a word class to which a word represented by each variable belongs,
The word string extraction means includes
Morphological analysis of each of the plurality of natural language sentences stored in the corpus, morpheme analysis means for attaching a tag of a word class to which the morpheme belongs to each morpheme and outputting it as a morpheme string;
A syntax analysis means for performing a syntax analysis on the morpheme sequence output by the morpheme analysis means, and outputting a word string pattern composed of syntax information of the natural language sentence;
Each of the word string templates stored in the template storage means is compared with the word string pattern output by the syntax analysis means, and the variables in the word string patterns output by the syntax analysis means are excluded. A portion having a structure that matches the word string template, and the word class of the word at a position corresponding to the variable of the word string template in the portion matches the word class of the variable A language model learning device including means for extracting from the corpus.

The language model learning device according to any one of claims 4 to 5 , further comprising:
A frequency storage means for storing the appearance frequency of a word string pattern appearing in a predetermined corpus for each word string pattern;
Provided between the deformation means and the learning means, receives the transformed word string output from the deformation means, and stores the frequency for the word string pattern that generated the word string for each of the word strings Based on the appearance frequency stored in the means, the number of times of copying the word string is determined, copied, and repeatedly output to the learning means, thereby adjusting the appearance frequency of the word in the word string output from the deformation means A language model learning device including frequency adjusting means for performing the operation.

The language model learning device according to any one of claims 1 to 5 , further comprising:
A frequency storage means for storing the appearance frequency of words appearing in a predetermined corpus;
It is provided between the deforming means and the learning means, receives the transformed word string output from the deforming means, and stores the word constituting the word string in the frequency storage means for each of the word strings. Based on the appearance frequency of each stored word, the number of times of copying of the word string is determined, copied, and repeatedly output to the learning means, whereby the appearance frequency of words in the word string output from the deformation means is determined. A language model learning device including frequency adjusting means for adjusting.

The language model learning device according to claim 1 ,
A weight is assigned in advance to each of the seed templates stored in the template storage means,
Each of the extension templates stored in the template storage means is assigned a weight smaller than the weight of the seed template that is the basis of the extension template,
The language model learning device further includes:
A word string template provided between the deforming means and the learning means, receiving the transformed word string output from the deforming means, and using the word string template used in the word string extracting means for each of the word strings. Frequency adjusting means for adjusting the appearance frequency of the word included in the modified word string output from the deforming means by copying the word string and repeatedly outputting it to the learning means according to the assigned weight A language model learning device.

A computer connected to a machine-readable corpus storage means for storing a corpus including a plurality of natural language sentences;
Template storage means for storing a word string template;
A word string extraction means for extracting from the corpus a word string pattern that matches a word string template stored in the template storage means;
Deformation means for deforming the word string pattern extracted by the word string extraction means based on a deformation rule prepared in advance so that a natural language word string in a format according to a preselected purpose is generated When,
Function as learning means for learning a statistical language model using a set of word strings output by the deformation means as learning data ;
The template storage means includes
Seed template storage means for storing a seed template that is a basic word string template to be extracted from a machine-readable sentence;
Extension template generation means for generating an extension template by applying a template extension rule prepared in advance for each of the seed templates stored in the seed template storage means;
An extension template storage means for storing the extension template generated by the extension template generation means and the seed template stored in the seed template storage means and providing the word string extraction means as the word string template , Computer program.

The language model learning device according to any one of claims 1 to 8 ,
Language model storage means for storing a language model learned by the language model learning device;
A speech recognition apparatus including speech recognition means for performing speech recognition of input speech by using a language model stored in the language model storage means.