JP2016180849A

JP2016180849A - Learning data generation unit, language model learning unit, learning data generation method and program

Info

Publication number: JP2016180849A
Application number: JP2015060664A
Authority: JP
Inventors: 秀治中嶋; Hideji Nakajima; 秀之水野; Hideyuki Mizuno
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2016-10-13

Abstract

PROBLEM TO BE SOLVED: To provide a learning data generation unit capable of generating many sets of n words which are natural as neighboring languages from a piece of given text data while preventing generation of any sets of n words which are never neighboring each other.SOLUTION: The learning data generation unit includes: a tree structure generation section that reads a predetermined length of a text data in which each of words is segmented and a piece of data on dependency between clause phrases in the text data to generate a tree structure which represents a dependency relation among the text data in the predetermined length; and a generate text generation section that generates a new text after removing a part or all of clause phrases which have no dependency to other clause phrase based on the tree structure.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識、統計翻訳、自然言語処理などで用いられる統計的言語モデルの学習データを生成する学習データ生成装置、学習データ生成方法、統計的言語モデルを学習する言語モデル学習装置、プログラムに関する。 The present invention relates to a learning data generation device that generates learning data of a statistical language model used in speech recognition, statistical translation, natural language processing, and the like, a learning data generation method, a language model learning device that learns a statistical language model, and a program About.

単語ｎ-ｇｒａｍなどの統計的言語モデルの学習には大量の文章データが必要であるが、文章データの量が増えても、単語の全てのｎ個組のうちの一部しか得られない問題が常に生じる。そのため、できるだけ文章データを利用しつくす工夫が必要とされている。 Learning a statistical language model such as a word n-gram requires a large amount of sentence data, but even if the amount of sentence data increases, only a part of all n sets of words can be obtained. Always occurs. For this reason, it is necessary to devise ways to use text data as much as possible.

例えば非特許文献１では、必ずしも連接せずに間にある単語をスキップして得られた単語のｎ個組を生成して、これを単語ｎ-ｇｒａｍの学習データとして利用する試みがなされている。 For example, in Non-Patent Document 1, an attempt is made to generate n sets of words obtained by skipping words in between without necessarily being connected, and to use this as learning data for the word n-gram. .

Guthrie, David, et al.， "A closer look at skip-gram modelling"， Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). 2006.Guthrie, David, et al., "A closer look at skip-gram modeling", Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). 2006.

非特許文献１の方法で学習データを生成した場合、言語的に隣り合って並びえない単語のｎ個組まで生成する虞があり、このような学習データを用いても、高精度な統計的言語モデルを学習することができなかった。 When learning data is generated by the method of Non-Patent Document 1, there is a risk of generating up to n sets of words that are linguistically adjacent to each other. Even if such learning data is used, high-precision statistical data may be generated. I could not learn a language model.

そこで本発明では、隣接しえない単語ｎ個組の生成を抑え、与えられた文章データから隣接しうる言語として自然な単語ｎ個組を多く生成できる学習データ生成装置を提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a learning data generation device that can suppress the generation of n adjacent words that cannot be adjacent to each other and can generate many natural words as a language that can be adjacent from given sentence data. To do.

本発明の学習データ生成装置は、木構造生成部と、文生成部を含む。 The learning data generation apparatus of the present invention includes a tree structure generation unit and a sentence generation unit.

木構造生成部は、単語毎に区切られた文章データと文章データの文節間の係り受けデータを所定の長さ読み込んで、所定の長さの文章データの係り受け関係を表す木構造を生成する。文生成部は、木構造に基づいて、他の文節を受けていない文節を一部、または全部取り除いて、新たな文を生成する。 The tree structure generation unit reads the dependency data between the sentence data divided by words and the clauses of the sentence data with a predetermined length, and generates a tree structure representing the dependency relation of the sentence data with a predetermined length. . The sentence generation unit generates a new sentence by removing some or all of the phrases that have not received other phrases based on the tree structure.

本発明の学習データ生成装置によれば、隣接しえない単語ｎ個組の生成を抑え、与えられた文章データから隣接しうる言語として自然な単語ｎ個組を多く生成できる。 According to the learning data generation apparatus of the present invention, generation of n sets of words that cannot be adjacent to each other can be suppressed, and many natural word sets as words that can be adjacent to each other can be generated from given sentence data.

実施例１の学習データ生成装置、言語モデル学習装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a learning data generation device and a language model learning device according to Embodiment 1. FIG. 実施例１の学習データ生成装置、言語モデル学習装置の動作を示すフローチャート。3 is a flowchart illustrating operations of the learning data generation device and the language model learning device according to the first embodiment. 木構造生成部が生成する木構造を例示する図。The figure which illustrates the tree structure which a tree structure generation part generates.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

図１、図２を参照して本実施例の学習データ生成装置、言語モデル学習装置の構成、および動作について説明する。図１は、本実施例の学習データ生成装置１、言語モデル学習装置２の構成を示すブロック図である。図２は、本実施例の学習データ生成装置１、言語モデル学習装置２の動作を示すフローチャートである。 The configuration and operation of the learning data generation device and the language model learning device according to this embodiment will be described with reference to FIGS. FIG. 1 is a block diagram illustrating a configuration of a learning data generation device 1 and a language model learning device 2 according to the present embodiment. FIG. 2 is a flowchart illustrating operations of the learning data generation device 1 and the language model learning device 2 according to this embodiment.

図１に示すように、本実施例の学習データ生成装置１は、木構造生成部１１と、文生成部１２を含む。また、学習データ生成装置１の機能に加え、統計的言語モデルの生成までを実行する言語モデル学習装置２としてもよい。この場合、言語モデル学習装置２は、学習データ生成装置１の構成に加え、言語モデル学習部２１をさらに含む構成である。 As shown in FIG. 1, the learning data generation apparatus 1 of the present embodiment includes a tree structure generation unit 11 and a sentence generation unit 12. Moreover, in addition to the function of the learning data generation device 1, it is good also as the language model learning device 2 which performs even the production | generation of a statistical language model. In this case, the language model learning device 2 is configured to further include a language model learning unit 21 in addition to the configuration of the learning data generation device 1.

学習データ生成装置１、言語モデル学習装置２の外部の記憶領域として、文章データ記憶部８と、係り受けデータ記憶部９が存在するものとする。なお、文章データ記憶部８と、係り受けデータ記憶部９は、学習データ生成装置１、言語モデル学習装置２の内部に設けられていてもよい。文章データ記憶部８には、単語ごとに区切られた文章データが記憶されているものとする。係り受けデータ記憶部９には、文章データ記憶部８に記憶されている文章データの文節間の係り受けデータが記憶されているものとする。 Assume that a text data storage unit 8 and a dependency data storage unit 9 exist as storage areas outside the learning data generation device 1 and the language model learning device 2. The text data storage unit 8 and the dependency data storage unit 9 may be provided inside the learning data generation device 1 and the language model learning device 2. It is assumed that the sentence data storage unit 8 stores sentence data divided for each word. It is assumed that the dependency data storage unit 9 stores dependency data between phrases of the sentence data stored in the sentence data storage unit 8.

単語ごとに区切られた文章データ（単語列データともいう）は形態素解析によって得ることができる。形態素解析は、例えば、参考非特許文献１に記載された方法によって実施できる。
（参考非特許文献１：松本裕治、“形態素解析システム「茶筅」”、情報処理、vol. 41(11)、pp. 1208-1214、2000年） Sentence data divided into words (also referred to as word string data) can be obtained by morphological analysis. The morpheme analysis can be performed, for example, by the method described in Reference Non-Patent Document 1.
(Reference Non-Patent Document 1: Yuji Matsumoto, “A morphological analysis system“ tea bowl ””, Information Processing, vol. 41 (11), pp. 1208-1214, 2000)

係り受け解析は、形態素解析結果を入力として、複数の単語からなる文節を構成し、品詞や単語の出現形や単語のＩＤの関係に基づいて、文節間の係り受け関係を予測する技術である。係り受け解析は、例えば、参考非特許文献２に記載された方法によって実施できる。
（参考非特許文献２：工藤拓、松本裕治、“チャンキングの段階適用による日本語係り受け解析”、情報処理学会論文誌、43(6)、pp. 1834-1842、2002） Dependency analysis is a technology that uses a morpheme analysis result as an input to construct a phrase composed of a plurality of words and predicts the dependency relation between phrases based on the relationship between the part of speech, the appearance form of the word, and the ID of the word. . The dependency analysis can be performed by a method described in Reference Non-Patent Document 2, for example.
(Reference Non-patent Document 2: Taku Kudo, Yuji Matsumoto, “Japanese Dependency Analysis by Chunking Stage Application”, IPSJ Transactions, 43 (6), pp. 1834-1842, 2002)

木構造生成部１１は、単語毎に区切られた文章データと文章データの文節間の係り受けデータを所定の長さ（例えば一文ずつ）読み込んで、所定の長さの文章データの係り受け関係を表す木構造を生成する（Ｓ１１）。文生成部１２は、木構造に基づいて、（係り受けとして）他の文節を受けていない文節を一部、または全部取り除いて、新たな文を生成する（Ｓ１２）。言語モデル学習部２１は、元の文と、生成された複数の文とを用いて言語モデルを学習する（Ｓ２１）。 The tree structure generation unit 11 reads dependency data between sentence data divided into words and phrases of sentence data by a predetermined length (for example, one sentence at a time), and determines dependency relations of sentence data of a predetermined length. A tree structure to represent is generated (S11). Based on the tree structure, the sentence generation unit 12 generates a new sentence by removing some or all of the phrases that have not received other phrases (as a dependency) (S12). The language model learning unit 21 learns a language model using the original sentence and the plurality of generated sentences (S21).

以下、図３の例を参照して上述のステップＳ１１、Ｓ１２を具体的に説明する。図３は、木構造生成部１１が生成する木構造を例示する図である。 Hereinafter, the above-described steps S11 and S12 will be described in detail with reference to the example of FIG. FIG. 3 is a diagram illustrating a tree structure generated by the tree structure generation unit 11.

係り受け関係はある文節がどの文節に係るかを示す情報である。係り元の文節番号と係り先の文節番号で係り受け関係が示される。これにより、木構造生成部１１は例えば図３のような木構造を生成する（Ｓ１１）。木構造生成部１１は、文節に属する単語番号を記録することで、文節内部の構造を木に関連付ける。 The dependency relationship is information indicating which clause a certain clause relates to. The dependency relationship is indicated by the clause number of the source and the clause number of the destination. Thereby, the tree structure generation unit 11 generates a tree structure as shown in FIG. 3, for example (S11). The tree structure generation unit 11 records the word numbers belonging to the phrase, thereby associating the structure inside the phrase with the tree.

文生成部１２は、木の根から（係り受けとして）どの文節も受けていない文節を取り除くことにより、さまざまな文を生成できる。また、他の文節からの係りを受けている文節であっても、補助的な単語から始まる文節はその文節も含めて係りを受けていない文節として取り除くことができる。例えば、図３では、文節「あらゆる」「すべて」「自分の」「捻じ」（図３においてドットハッチングを施した文節）は何れの文節をも受けていない。文生成部１２は、これらの文節をランダムに取り除くことにより、元の文章データである
「あらゆる現実をすべて自分のほうへ捻じ曲げたのだ」
から、新たな文である
「あらゆる現実を曲げたのだ」、
「あらゆる現実を捻じ曲げたのだ」
「あらゆる現実を自分のほうへ捻じ曲げたのだ」
「あらゆる現実をすべて曲げたのだ」
などを生成する（Ｓ１２）。上記の４例のうち３例目以外は補助的な単語から始まる文節まで含めた“自分のほうへ”の２文節を取り除いた例である。その結果、言語モデル学習部２１は、新しい単語３個組である
「現実−を＋曲げ」「現実−を＋捻じ」「現実−を＋自分」
などの自然にあり得るｓｋｉｐ−ｇｒａｍを得ることができる。従来のようにスキップする距離を明示的に固定値として定めないので、前記の事例のように、６個、５個、２個スキップしたような上記の３つの自然な単語３個組の事例を数多く得ることができる。 The sentence generation unit 12 can generate various sentences by removing a phrase that has not received any phrase (as a dependency) from the root of the tree. In addition, even if a clause is affected by another clause, a clause that starts with an auxiliary word can be removed as an unrelated clause including that clause. For example, in FIG. 3, the phrases “every”, “all”, “my”, and “twist” (the phrase that has been dot-hatched in FIG. 3) have not received any of the phrases. The sentence generation unit 12 removes these clauses at random, which is the original sentence data “twisting all the reality to yourself”.
From the new sentence, "Bent all reality",
“We twisted every reality.”
“I twisted all reality to myself.”
“I bent all the reality”
Etc. are generated (S12). Of the above four examples, the other than the third example is an example in which the two phrases “toward yourself” including the phrase starting from the auxiliary word are removed. As a result, the language model learning unit 21 sets the new word triplet “reality- + bend” “reality- + twist” “reality- + self”
A skip-gram which can be naturally obtained can be obtained. The distance to be skipped is not explicitly set as a fixed value as in the conventional case, so the case of the above three natural words triple set such as 6, 5, or 2 skipped as in the above case. Many can be obtained.

新たな文の生成原理は何れの文節をも受けていない文節を一部、または全部取り除くことである。何れの文節をも受けていない文節がｎ個あれば、１つの文節を取り除く場合からｎ個すべての文節を取り除く場合まで、合計２^ｎ通りの新たな文を生成できる。言語モデル学習部２１は、何れの文節をも受けていない文節をすべて均等に扱うのではなく、これらの文節を確率的に取り除いても良い。 The principle of generating a new sentence is to remove some or all of the phrases that have not received any of the phrases. If there are n clauses that have not received any clauses, a total of 2 ⁿ new sentences can be generated from the removal of one clause to the removal of all n clauses. The language model learning unit 21 may not remove all the clauses that have received any of the clauses equally but may remove these clauses in a probabilistic manner.

本実施例の学習データ生成装置１、言語モデル学習装置２によれば、従来のｓｋｉｐ-ｇｒａｍとは異なるスキップの仕方で、文章データの係り受け情報に基づいて、自然な文としてあり得る学習データ、単語ｎ個組を作成できるため、与えられた同一の文章データから従来よりも自然で多くの単語ｎ個組を獲得し、言語モデル学習に利用することができる。また本実施例の学習データ生成装置１が生成する学習データは、単語Ｎｇｒａｍだけではなく、得られた単語に対応する品詞から作ることができる品詞のＮｇｒａｍも自然なものとなるため、その他の言語モデルの学習にも利用可能なデータである。 According to the learning data generation device 1 and the language model learning device 2 of the present embodiment, the learning data that can be a natural sentence based on the dependency information of the sentence data in a skipping method different from the conventional skip-gram. Since a set of n words can be created, it is possible to acquire a larger number of n sets of words that are more natural than before from the same given sentence data and use it for language model learning. In addition, the learning data generated by the learning data generation apparatus 1 according to the present embodiment is not only the word Ngram, but also the part of speech Ngram that can be created from the part of speech corresponding to the obtained word is natural. This data can also be used for model learning.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A tree structure generation unit that reads sentence data divided for each word and dependency data between clauses of the sentence data by a predetermined length, and generates a tree structure representing the dependency relation of the sentence data of the predetermined length When,
A learning data generation device including a sentence generation unit that generates a new sentence by removing a part or all of a phrase that has not received another phrase based on the tree structure.

A tree structure generation unit that reads sentence data divided for each word and dependency data between clauses of the sentence data by a predetermined length, and generates a tree structure representing the dependency relation of the sentence data of the predetermined length When,
Based on the tree structure, a sentence generation unit that generates a new sentence by removing some or all of the phrases that have not received other phrases;
A language model learning apparatus including a language model learning unit that learns a language model using the generated sentence.

A learning data generation method executed by a learning data generation device,
The learning data generation device includes:
Reading dependency data between sentence data divided into words and clauses of the sentence data with a predetermined length, and generating a tree structure representing a dependency relation of the sentence data with the predetermined length;
A learning data generation method for executing a step of generating a new sentence by removing a part or all of a phrase that has not received another phrase based on the tree structure.

A program that causes a computer to function as the learning data generation device according to claim 1.

A program for causing a computer to function as the language model learning device according to claim 2.