JP6426971B2

JP6426971B2 - Learning data generation device and program thereof

Info

Publication number: JP6426971B2
Application number: JP2014211298A
Authority: JP
Inventors: 貴裕奥; 庄衛佐藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-10-16
Filing date: 2014-10-16
Publication date: 2018-11-21
Anticipated expiration: 2034-10-16
Also published as: JP2016080832A

Description

本願発明は、放送番組の音声認識に用いる音響モデルの適応化に必要な学習データを、準教師あり学習により生成する学習データ生成装置及びそのプログラムに関する。 The present invention relates to a learning data generation apparatus and program for generating learning data necessary for adaptation of an acoustic model used for speech recognition of a broadcast program by quasi-supervised learning.

現在、スポーツ番組や情報番組の一部では、リスピーク方式により字幕を制作している。このリスピーク方式とは、字幕キャスタと呼ばれる字幕制作用のリスピーカが復唱した番組音声を音声認識し、字幕を制作するものである（例えば、非特許文献１）。リスピーク方式には、特殊な復唱技術が求められるうえ、リスピーカを介して字幕を制作するため、時間を要する。よって、リスピーク方式によらず、番組音声をリアルタイムで音声認識できる手法が望まれている。 Currently, in some sports programs and information programs, subtitles are produced using the lispeak method. In this lispeak method, subtitles are produced by speech recognition of program sound reproduced by a subtitle production re-speaker called subtitle caster (for example, non-patent document 1). The respeak method requires special reproduction technology and requires time to produce subtitles through the re-speaker. Therefore, a method capable of recognizing program sound in real time is desired regardless of the response method.

これを実現するには、スポーツ番組や情報番組といった様々なジャンルの放送番組を精度よく音声認識できる音響モデルが必要になる。このとき、音響モデルを構築するための学習データとして、大規模な音声言語コーパスが必要になる。この音声言語コーパスには、実用化レベルの音響モデルを構築するために、高い精度が要求される。 In order to realize this, it is necessary to have an acoustic model that can accurately recognize voices of broadcast programs of various genres such as sports programs and information programs. At this time, a large-scale speech language corpus is required as learning data for constructing an acoustic model. The speech language corpus is required to have high accuracy in order to construct a practical level acoustic model.

従来より、音声言語コーパスを生成する手法として、準教師あり学習が提案されている（例えば、非特許文献２）。非特許文献２に記載の技術は、番組音声の音声認識結果と字幕テキストとからアライメントを行い、発話区間毎に音声認識結果と字幕テキストとが一致するか否かを判定し、一致する発話区間を抽出するものである。そして、非特許文献２に記載の技術は、抽出した発話区間に対応する音声データと字幕テキストを音響モデルの学習に用いる。 Conventionally, quasi-supervised learning has been proposed as a method for generating a speech language corpus (for example, Non-Patent Document 2). The technology described in Non-Patent Document 2 performs alignment from the speech recognition result of program speech and subtitle text, determines whether the speech recognition result and subtitle text match each speech section, and the coincident speech section To extract Then, the technology described in Non-Patent Document 2 uses voice data and subtitle text corresponding to the extracted utterance section for learning of an acoustic model.

松井他、言い換えを利用したリスピーク方式によるスポーツ中継のリアルタイム字幕制作、電子情報通信学会論文誌、D-11、情報・システム処理,II-パターン処理、Vol.87、No.2、pp.427-435,2004-02-01Matsui et al., Real-time Captioning of Sports Relay by Lith Peak System using Paraphrase, Journal of the Institute of Electronics, Information and Communication Engineers, D-11, Information and System Processing, II-Pattern Processing, Vol. 87, No. 2, pp. 427- 435, 2004-02-01 Lamel et.al,Lightly Supervised and Unsupervised Acoustic Model Training,Computer Speech and Language,Vol6,pp.115-129,2002Lamel et. Al, Lightly Supervised and Unsupervised Acoustic Model Training, Computer Speech and Language, Vol 6, pp. 115-129, 2002

しかし、非特許文献２に記載の技術は、ニュース番組が対象のため、他のジャンルの放送番組に適用した場合、必要な量の学習データを生成できないという問題がある。
具体的には、情報番組では、背景音楽や雑音が含まれていたり、アナウンサ以外の出演者が正確に発話していないことが多い。このため、非特許文献２に記載の技術は、ニュース番組で学習した音響モデルを用いても、情報番組の音声認識精度が低くなり、音声認識結果と字幕テキストとの単語一致区間が減少してしまう。その結果、非特許文献２に記載の技術では、必要な量の学習データを生成できない。 However, the technology described in Non-Patent Document 2 has a problem that it can not generate a necessary amount of learning data when applied to broadcast programs of other genres because it is a news program.
Specifically, in information programs, background music and noise are often included, and performers other than announcers often do not utter correctly. For this reason, the technology described in Non-Patent Document 2 lowers the speech recognition accuracy of the information program even when using the acoustic model learned in the news program, and the word matching section between the speech recognition result and the subtitle text decreases. I will. As a result, the technique described in Non-Patent Document 2 can not generate a necessary amount of learning data.

本願発明は、高精度な学習データをより多く生成できる学習データ生成装置及びそのプログラムを提供することを課題とする。 An object of the present invention is to provide a learning data generation device capable of generating more highly accurate learning data and a program thereof.

前記した課題に鑑みて、本願発明に係る学習データ生成装置は、放送番組の音声認識に用いる音響モデルの適応化に必要な学習データを、準教師あり学習により生成する学習データ生成装置であって、第３言語モデル生成手段と、音声認識手段と、アライメント手段と、置換手段と、学習データ生成手段と、を備える構成とした。 In view of the above problems, a learning data generation apparatus according to the present invention is a learning data generation apparatus that generates learning data necessary for adaptation of an acoustic model used for speech recognition of a broadcast program by quasi-supervised learning. And a third language model generation unit, a speech recognition unit, an alignment unit, a substitution unit, and a learning data generation unit.

かかる構成によれば、学習データ生成装置は、第３言語モデル生成手段によって、テキストコーパスから予め生成した第１言語モデルと、放送番組の字幕テキストから予め生成した第２言語モデルとを線形補間することで、第３言語モデルを生成する。 According to this configuration, the learning data generation device linearly interpolates the first language model generated in advance from the text corpus and the second language model generated in advance from the subtitle text of the broadcast program by the third language model generation means. To generate a third language model.

学習データ生成装置は、音声認識手段によって、第３言語モデル及び予め生成した音響モデルを用いて、放送番組を音声認識する。そして、学習データ生成装置は、アライメント手段によって、放送番組の音声認識結果を表す音声認識テキストと字幕テキストとの単語を、時刻順で対応付けるアライメントを行う。 The learning data generation apparatus performs speech recognition of the broadcast program by the speech recognition unit using the third language model and the acoustic model generated in advance. Then, the learning data generation apparatus performs alignment by using alignment means to associate words in the speech recognition text and subtitle text representing the speech recognition result of the broadcast program in time order.

ここで、音声認識の精度が字幕制作の精度よりも低いと考えられる。また、音声認識テキストと字幕テキストとの間で対応付けられた単語が異なり、かつ、その単語に前後する単語連鎖が一致する場合、音声認識テキストに含まれるその単語が、誤って音声認識された可能性が非常に高くなる。 Here, it is considered that the accuracy of speech recognition is lower than that of subtitle production. In addition, when the word associated between the speech recognition text and the subtitle text is different, and the word sequence preceding and following the word matches, the word included in the speech recognition text is erroneously recognized as speech recognition The possibilities are very high.

そこで、学習データ生成装置は、置換手段によって、音声認識テキストと字幕テキストとの間で対応付けられた単語毎に、その単語が異なり、かつ、その単語の前後で予め設定された単語数の単語連鎖が一致するか否かによりその単語が置換対象であるか否かを判定する。そして、学習データ生成装置は、置換手段によって、その単語が置換対象の場合、音声認識テキストのその単語を字幕テキストの単語に置換する。 Therefore, the learning data generation device uses the substitution unit to change the word for each word associated between the speech recognition text and the subtitle text, and the number of words set in advance before and after the word Whether or not the word is a replacement target is determined based on whether or not the sequences match. Then, when the word is a replacement target, the learning data generation device replaces the word of the speech recognition text with the word of the subtitle text by the substitution means.

このように、学習データ生成装置は、音声認識の精度が低いために音声認識テキストと字幕テキストとの単語が一致しない場合でも、音声認識テキストの単語を置換するので、音声認識テキストと字幕テキストとの単語一致区間を増加させることができる。 As described above, since the learning data generation device substitutes the words of the speech recognition text even when the words of the speech recognition text and the subtitle text do not match because the accuracy of the speech recognition is low, the speech recognition text and the subtitle text The word match interval of can be increased.

学習データ生成装置は、学習データ生成手段によって、放送番組の発話区間毎に、置換手段で置換された音声認識テキストと字幕テキストとが一致するか否かを判定し、一致すると判定された発話区間の音声データに、発話区間に対応した字幕テキストの単語をラベルとして付与する。このとき、学習データ生成装置は、音声認識テキストと字幕テキストとの単語一致区間が増加しているため、一致すると判定される発話区間も増加することになる。 The learning data generation device determines, by the learning data generation means, whether or not the speech recognition text and the caption text replaced by the replacement means match each other during the speech interval of the broadcast program, and the speech intervals determined to match The words of the subtitle text corresponding to the speech section are added as labels to the voice data of. At this time, since the word matching section of the speech recognition text and the caption text increases, the learning data generation apparatus also increases the speech section determined to match.

本願発明によれば、以下のような優れた効果を奏する。
本願発明に係る学習データ生成装置は、音声認識の精度が低いために音声認識テキストと字幕テキストとの単語が一致しない場合でも、音声認識テキストの単語を置換する。これにより、学習データ生成装置は、音声認識テキストと字幕テキストとの単語一致区間が増加するため、高精度な学習データをより多く生成することができる。 According to the present invention, the following excellent effects can be obtained.
The learning data generation device according to the present invention replaces words in the speech recognition text even when the words in the speech recognition text and the subtitle text do not match because the accuracy of speech recognition is low. As a result, the learning data generation device can generate more accurate learning data because the word matching section between the voice recognition text and the subtitle text increases.

本願発明の第１実施形態に係る音響モデル生成装置の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of an acoustic model generation device according to a first embodiment of the present invention. 図１の音響モデル生成装置における単語の置換を説明する説明図である。It is explanatory drawing explaining the substitution of the word in the acoustic model production | generation apparatus of FIG. 図１の音響モデル生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the acoustic model production | generation apparatus of FIG. 本願発明の第２実施形態に係る音響モデル生成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic model production | generation apparatus based on 2nd Embodiment of this invention. 図４の音響モデル生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the acoustic model production | generation apparatus of FIG. 本願発明の実施例１において、単語数と異なるパターン数との関係を示すグラフである。In Example 1 of this invention, it is a graph which shows the relationship between the number of words, and the number of patterns which are different. 実施例２，３及び比較例において、『クローズアップ現代』の適応化回数と音声言語コーパスとの関係を示すグラフである。In Example 2, 3 and a comparative example, it is a graph which shows the relationship between the frequency | count of adaptation of "close-up modern", and a speech language corpus. 実施例２，３及び比較例において、『まる得マガジン』の適応化回数と音声言語コーパスとの関係を示すグラフである。In Example 2, 3 and a comparative example, it is a graph which shows the relationship between the frequency | count of adaptation of "Marutokuki magazine", and a speech language corpus. 実施例２，３及び比較例において、『サイエンスＺＥＲＯ』の適応化回数と音声言語コーパスとの関係を示すグラフである。In Example 2, 3 and a comparative example, it is a graph which shows the relationship between the frequency | count of adaptation of "Science ZERO", and a speech language corpus.

以下、本願発明の各実施形態について、適宜図面を参照しながら詳細に説明する。なお、各実施形態において、同一の機能を有する手段には同一の符号を付し、説明を省略した。 Hereinafter, each embodiment of the present invention will be described in detail with reference to the drawings as appropriate. In each embodiment, means having the same function are denoted by the same reference numeral, and the description thereof is omitted.

（第１実施形態）
［音響モデル生成装置の構成］
図１を参照し、本願発明の第１実施形態に係る音響モデル生成装置（学習データ生成装置）１の構成について説明する。
音響モデル生成装置１は、音響モデルの適応化に必要な学習データを準教師あり学習により生成し、生成した学習データを用いて、音響モデルを適応化（生成）するものである。
この音響モデルは、ニュース番組に限定されず、スポーツ番組、情報番組といった様々なジャンルの放送番組の音声認識に用いることができる。 First Embodiment
[Configuration of acoustic model generation device]
The configuration of an acoustic model generation device (learning data generation device) 1 according to a first embodiment of the present invention will be described with reference to FIG.
The acoustic model generation device 1 generates learning data necessary for adaptation of the acoustic model by quasi-supervised learning, and adapts (generates) the acoustic model using the generated learning data.
This acoustic model is not limited to news programs, and can be used for speech recognition of broadcast programs of various genres such as sports programs and information programs.

図１のように、音響モデル生成装置１は、適応化言語モデル生成手段（第３言語モデル生成手段）１０と、音声認識手段２０と、アライメント手段３０と、置換手段４０と、学習データ生成手段５０と、音響モデル適応化手段６０とを備える。 As shown in FIG. 1, the acoustic model generation device 1 includes an adaptive language model generation unit (third language model generation unit) 10, a speech recognition unit 20, an alignment unit 30, a substitution unit 40, and a learning data generation unit. 50 and acoustic model adaptation means 60.

適応化言語モデル生成手段１０は、ベースライン言語モデル（第１言語モデル）と、ドメイン言語モデル（第２言語モデル）とを補間することで、適応化言語モデル（第３言語モデル）を生成するものである。 The adaptive language model generation unit 10 generates an adaptive language model (third language model) by interpolating the baseline language model (first language model) and the domain language model (second language model). It is a thing.

なお、ベースライン言語モデルとは、大規模テキストコーパスから予め生成した言語モデルのことである。
また、ドメイン言語モデルとは、放送番組に付与された字幕テキストから予め生成した言語モデルのことである。 The baseline language model is a language model generated in advance from a large-scale text corpus.
The domain language model is a language model generated in advance from subtitle text attached to a broadcast program.

適応化言語モデル生成手段１０は、ベースライン言語モデル及びドメイン言語モデルが入力される。そして、適応化言語モデル生成手段１０は、入力されたベースライン言語モデルとドメイン言語モデルとを線形補間し、適応化言語モデルを生成する。このとき、適応化言語モデル生成手段１０は、ドメイン言語モデルをベースライン言語モデルより大きく重み付ける。 The adaptive language model generation means 10 receives a baseline language model and a domain language model. Then, the adaptive language model generation unit 10 linearly interpolates the input baseline language model and domain language model to generate an adaptive language model. At this time, the adaptive language model generation unit 10 weights the domain language model more heavily than the baseline language model.

例えば、ベースライン言語モデル及びドメイン言語モデルがトライグラムの言語モデルであることとする。また、ベースライン言語モデル及びドメイン言語モデルには、下記のように、「今日」、「は」、「私」という同一の単語連鎖のエントリが存在し、それぞれのスコア（確率）が‘７．０’及び‘５．０’であることとする。また、ドメイン言語モデルの補間係数（重み係数）を‘０．９’とし、ベースライン言語モデルの補間係数を‘０．１’とした場合、以下のようになる。 For example, suppose that a baseline language model and a domain language model are trigram language models. Also, in the baseline language model and the domain language model, there are entries of the same word chain “Today”, “Ha” and “I” as described below, and their scores (probabilities) are '7. It shall be 0 'and' 5.0 '. Also, assuming that the interpolation coefficient (weighting coefficient) of the domain language model is “0.9” and the interpolation coefficient of the baseline language model is “0.1”, the following is obtained.

＜各言語モデルの一例＞
ベースライン言語モデル：「今日」、「は」、「私」スコア７．０
ドメイン言語モデル：「今日」、「は」、「私」スコア５．０
適応化言語モデル：「今日」、「は」、「私」スコア５．２ <Example of each language model>
Baseline language model: "Today", "Ha", "I" Score 7.0
Domain language model: "Today", "Ha", "I" Score 5.0
Adaptive Language Model: "Today", "Ha", "I" Score 5.2

この例では、適応化言語モデル生成手段１０は、ベースライン言語モデルのスコア‘７．０’とベースライン言語モデルの補間係数‘０．１’との乗算値‘０．７’を求める。また、適応化言語モデル生成手段１０は、ドメイン言語モデルのスコア‘５．０’とドメイン言語モデルの補間係数‘０．９’との乗算値‘４．５’を求める。そして、適応化言語モデル生成手段１０は、ベースライン言語モデルの乗算値とドメイン言語モデルとの乗算値を加算したスコア‘５．２’を求め、スコア‘５．２’の「今日」、「は」、「私」という単語連鎖のエントリを適応化言語モデルに追加する。 In this example, the adaptive language model generation unit 10 obtains a multiplication value '0.7' of the score '7.0' of the baseline language model and the interpolation coefficient '0.1' of the baseline language model. Furthermore, the adaptive language model generation unit 10 obtains a multiplication value “4.5” of the domain language model score “5.0” and the domain language model interpolation coefficient “0.9”. Then, the adaptive language model generation unit 10 obtains a score '5.2' obtained by adding the product of the product of the product of the baseline language model and the product of the domain language model. "," Add an entry of the word chain "I" to the adaptation language model.

その後、適応化言語モデル生成手段１０は、生成した適応化言語モデルを音声認識手段２０に出力する。
なお、ベースライン言語モデル、ドメイン言語モデル及び適応化言語モデルは、前記した例に限定されない。また、補間係数も前記した例に限定されない。 Thereafter, the adaptive language model generation means 10 outputs the generated adaptive language model to the speech recognition means 20.
Note that the baseline language model, the domain language model, and the adaptation language model are not limited to the examples described above. Also, the interpolation coefficient is not limited to the above-described example.

音声認識手段２０は、適応化言語モデル生成手段１０から入力された適応化言語モデル及びベースライン音響モデルを用いて、放送番組の音声認識を行うものである。ここで、音声認識手段２０は、放送番組の音声が収録された音声データと、予め生成したベースライン音響モデルとが入力される。そして、音声認識手段２０は、１パスデコーダや２パスデコーダといった任意の音声認識デコーダを用いて、この音声データを発話区間毎に音声認識し、音声認識結果を表す音声認識テキストを生成する。 The speech recognition means 20 performs speech recognition of a broadcast program using the adaptation language model and the baseline acoustic model input from the adaptation language model generation means 10. Here, the voice recognition means 20 receives voice data in which the voice of the broadcast program is recorded and a baseline acoustic model generated in advance. Then, the speech recognition means 20 performs speech recognition of the speech data for each speech section using an arbitrary speech recognition decoder such as a one pass decoder or a two pass decoder, and generates a speech recognition text representing a speech recognition result.

その後、音声認識手段２０は、生成した音声認識テキストと、音声データ（不図示）とをアライメント手段３０に出力する。
なお、後記する繰り返し処理の場合、音声認識手段２０は、音響モデル適応化手段６０から入力された適応化音響モデルでベースライン音響モデルを更新し、この適応化音響モデルと適応化言語モデルを用いて、放送番組の音声認識を行う。 Thereafter, the speech recognition means 20 outputs the generated speech recognition text and speech data (not shown) to the alignment means 30.
In the case of iterative processing to be described later, the speech recognition unit 20 updates the baseline acoustic model with the adapted acoustic model input from the acoustic model adaptation unit 60, and uses the adapted acoustic model and the adapted language model. Perform voice recognition of the broadcast program.

アライメント手段３０は、音声認識手段２０から入力された音声認識テキストと、字幕テキストとのアライメントを行うものである。
アライメントとは、音声認識テキスト及び字幕テキストに含まれる単語を、時刻順で対応付けることである。 The alignment unit 30 aligns the speech recognition text input from the speech recognition unit 20 with the subtitle text.
The alignment is to associate words included in the speech recognition text and the subtitle text in time order.

ここで、アライメント手段３０は、放送番組に付与された字幕テキストが入力される。そして、アライメント手段３０は、音声認識テキストに含まれる単語と、字幕テキストに含まれる単語とを時刻順で対応付ける。その後、アライメント手段３０は、アライメントした音声認識テキスト及び字幕テキストと、音声データとを置換手段４０に出力する。 Here, the alignment means 30 receives the subtitle text attached to the broadcast program. Then, the alignment unit 30 associates the words included in the speech recognition text with the words included in the subtitle text in time order. Thereafter, the alignment means 30 outputs the aligned speech recognition text and subtitle text and the speech data to the substitution means 40.

置換手段４０は、アライメント手段３０から入力された音声認識テキストと字幕テキストとの間で対応付けられた単語毎に、その単語が異なり、かつ、その単語に前後する単語連鎖が一致するか否かにより、その単語が置換対象であるか否かを判定するものである。そして、置換手段４０は、その単語が置換対象の場合、音声認識テキストの単語を字幕テキストの単語に置換する。 The substitution unit 40 determines whether the word is different for each word associated between the speech recognition text and the subtitle text input from the alignment unit 30, and that a word sequence preceding or following the word matches Thus, it is determined whether or not the word is a replacement target. Then, when the word is a replacement target, the replacement means 40 replaces the word of the speech recognition text with the word of the subtitle text.

＜単語の置換＞
図２を参照し、置換手段４０による単語の置換について説明する（適宜図１参照）。
この図２では、音声認識テキスト１００及び字幕テキスト２００に含まれる単語ａ〜単語ｄ、単語Ｘ及び単語Ｙを、「ａ」〜「ｄ」、「Ｘ」及び「Ｙ」と図示した。また、単語ａ，…，単語ｂ及び単語ｃ，…，単語ｄは、それぞれ、Ｎ個の単語が連続する単語連鎖である。また、音声認識テキスト１００及び字幕テキスト２００との間では、単語ａから単語ｂまでの単語及び単語ｃから単語ｄまでの単語が一致することとする。 <Word substitution>
The substitution of words by the substitution means 40 will be described with reference to FIG. 2 (see FIG. 1 as appropriate).
In FIG. 2, the words a to d, the words X, and the words Y included in the speech recognition text 100 and the subtitle text 200 are illustrated as “a” to “d”, “X”, and “Y”. Further, the words a,..., The word b and the words c,..., The word d are respectively a word sequence in which N words are continuous. In addition, between the speech recognition text 100 and the subtitle text 200, it is assumed that the words from the word a to the word b and the words from the word c to the word d match.

図２のように、音声認識テキスト１００と字幕テキスト２００との間では、単語ａ，…，単語ｂ及び単語ｃ，…，単語ｄが対応付けられたこととする。また、音声認識テキスト１００の単語Ｘと、字幕テキスト２００の単語Ｙが対応付けられたこととする。 As shown in FIG. 2, it is assumed that words a,..., Word b and words c,..., Word d are associated between the speech recognition text 100 and the subtitle text 200. Further, it is assumed that the word X of the speech recognition text 100 and the word Y of the subtitle text 200 are associated with each other.

置換手段４０は、任意の値で単語数Ｎを予め設定しておく。この単語数Ｎは、アライメントのずれを抑制すると共に、学習データの量を増加させるため、‘５’に設定することが好ましい（実施例１参照）。 The substitution means 40 presets the number of words N with an arbitrary value. The number of words N is preferably set to '5' in order to suppress misalignment and increase the amount of learning data (see Example 1).

ここで、置換手段４０は、音声認識テキスト１００及び字幕テキスト２００の先頭側から順に、対応付けられた単語が一致するか否かを判定する。まず、置換手段４０は、音声認識テキスト１００の単語ａと、字幕テキスト２００の単語ａとが一致するので、単語ａを置換対象として判定しない。単語ａと同様、置換手段４０は、単語ｂまでを置換対象として判定しない。 Here, the substitution means 40 determines, in order from the head side of the voice recognition text 100 and the subtitle text 200, whether or not the associated words match. First, since the word a of the speech recognition text 100 matches the word a of the subtitle text 200, the substitution means 40 does not determine the word a as a substitution target. Similar to the word a, the replacement means 40 does not determine up to the word b as a replacement target.

また、置換手段４０は、音声認識テキスト１００の単語Ｘと、字幕テキスト２００の単語Ｙとが異なる単語のため、一致しないと判定する。ここで、音声認識テキスト１００の単語Ｘの前、及び、字幕テキスト２００の単語Ｙの前には、同一の単語ａ，…，単語ｂがＮ個連続する。また、音声認識テキスト１００の単語Ｘの後、及び、字幕テキスト２００の単語Ｙの後には、同一の単語ｃ，…，単語ｄがＮ個連続する。このことから、置換手段４０は、音声認識テキスト１００の単語Ｘ及び字幕テキスト２００の単語Ｙに前後するＮ個の単語連鎖とが一致すると判定する。従って、置換手段４０は、音声認識テキスト１００の単語Ｘを置換対象として判定し、この単語Ｘを字幕テキスト２００の単語Ｙに置換する。 Further, the substitution unit 40 determines that the word X of the speech recognition text 100 and the word Y of the subtitle text 200 do not match because they are different words. Here, N identical words a,..., Word b continue in front of the word X of the speech recognition text 100 and the word Y of the subtitle text 200. Further, after the word X of the speech recognition text 100 and after the word Y of the subtitle text 200, N identical words c,. From this, the substitution means 40 determines that the word X in the speech recognition text 100 and the N word series preceding and following the word Y in the subtitle text 200 match. Therefore, the substitution unit 40 determines the word X of the speech recognition text 100 as the substitution target, and replaces the word X with the word Y of the subtitle text 200.

すなわち、置換手段４０は、判定基準となる単語が異なっており、判定基準となる単語の前後にする単語連鎖が一致する場合、音声認識テキスト１００の単語が誤って音声認識されたと判定して、字幕テキスト２００の単語で置換する。 That is, when the words used as the determination reference are different, and the word sequence before and after the words used as the determination reference matches, the replacement means 40 determines that the word in the voice recognition text 100 is erroneously recognized as voice, Replace with the word of subtitle text 200.

続いて、置換手段４０は、単語ｃ，…，単語ｄが音声認識テキスト１００と字幕テキスト２００との間で一致するので、単語ｃ，…，単語ｄを置換対象として判定しない。
その後、置換手段４０は、置換された音声認識テキスト１００と、字幕テキスト２００と、音声データとを学習データ生成手段５０に出力する。 Subsequently, since the words c,..., And d match between the speech recognition text 100 and the subtitle text 200, the substitution means 40 does not determine the words c,.
After that, the substitution means 40 outputs the speech recognition text 100, subtitle text 200, and speech data that have been substituted to the learning data generation means 50.

学習データ生成手段５０は、学習データを生成するために、置換手段４０から入力された音声認識テキスト１００と字幕テキスト２００とが一致するか否かを発話区間毎に判定するものである。 The learning data generation unit 50 determines, for each utterance section, whether or not the speech recognition text 100 input from the substitution unit 40 and the subtitle text 200 match in order to generate learning data.

ここで、学習データ生成手段５０は、音声認識テキスト１００と字幕テキスト２００との判定単位として、置換手段４０から入力された音声データ及び音声認識テキスト１００の発話区間を検出する。そして、学習データ生成手段５０は、検出した発話区間毎に判定を行い、一致すると判定された発話区間の音声データに、この発話区間に対応した字幕テキストの単語をラベルとして付与することで、学習データを生成する。 Here, the learning data generation unit 50 detects speech segments of the speech data and the speech recognition text 100 input from the substitution unit 40 as a determination unit of the speech recognition text 100 and the subtitle text 200. Then, the learning data generation unit 50 performs the determination for each detected utterance section, and adds the word of the subtitle text corresponding to the utterance section as a label to the voice data of the utterance section determined to be coincident as a label. Generate data.

例えば、図２において、単語ａから単語ｄまでが同一の発話区間であることとする。この場合、学習データ生成手段５０は、音声認識テキスト１００の単語Ｘが単語Ｙに置換されているため、音声認識テキスト１００と字幕テキスト２００との間で単語ａから単語ｄまでの発話区間が一致すると判定し、この発話区間から学習データを生成する。 For example, in FIG. 2, word a to word d are assumed to be the same utterance section. In this case, since the word X of the speech recognition text 100 is replaced with the word Y, the learning data generation unit 50 matches the speech sections from the word a to the word d between the speech recognition text 100 and the subtitle text 200 Then, it determines and generates learning data from this utterance section.

その後、学習データ生成手段５０は、生成した学習データを音響モデル適応化手段６０に出力する。さらに、学習データ生成手段５０は、生成した学習データを音声言語コーパスとして出力してもよい。 Thereafter, the learning data generation means 50 outputs the generated learning data to the acoustic model adaptation means 60. Furthermore, the learning data generation means 50 may output the generated learning data as a speech language corpus.

図１に戻り、音響モデル生成装置１の構成について、説明を続ける。
音響モデル適応化手段６０は、学習データ生成手段５０から入力された学習データを用いて、音響モデルを適応化するものである。例えば、音響モデル適応化手段６０は、音響モデルとして、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）を用いることができる。また、音響モデル適応化手段６０は、音響モデルの適応化手法として、ＭＡＰ（Maximum A. Posteriori estimation）法を用いてもよい。 Returning to FIG. 1, the description of the configuration of the acoustic model generation device 1 will be continued.
The acoustic model adaptation means 60 uses the learning data input from the learning data generation means 50 to adapt the acoustic model. For example, the acoustic model adaptation means 60 can use a Hidden Markov Model (HMM) as the acoustic model. Further, the acoustic model adaptation means 60 may use a MAP (Maximum A. Posteriori estimation) method as an acoustic model adaptation method.

また、音響モデル適応化手段６０は、適応化した音響モデル（適応化音響モデル）を用いると音響認識の精度が向上するため、繰り返し処理を行うか否かを判定する。具体的には、音響モデル適応化手段６０は、音響モデルを適応化した回数（適応化回数）をインクリメントし、この適応化回数が予め設定された閾値以下であるか否かを判定する。 Further, since the accuracy of the sound recognition is improved by using the adapted acoustic model (adapted acoustic model), the acoustic model adaptation means 60 determines whether or not to perform the iterative process. Specifically, the acoustic model adaptation means 60 increments the number of times of adaptation of the acoustic model (the number of adaptations), and determines whether the number of times of adaptation is equal to or less than a preset threshold.

ここで、適応化回数が閾値以下の場合、音響モデル適応化手段６０は、繰り返し処理を行うと判定し、適応化音響モデルを音声認識手段２０に出力する。
一方、適応化回数が閾値を超える場合、音響モデル適応化手段６０は、繰り返し処理を行わないと判定し、適応化音響モデルを外部に出力し、処理を終了する。 Here, when the number of times of adaptation is equal to or less than the threshold value, the acoustic model adaptation means 60 determines that repetitive processing is to be performed, and outputs an adapted acoustic model to the speech recognition means 20.
On the other hand, when the number of times of adaptation exceeds the threshold value, the acoustic model adaptation means 60 determines that the iterative processing is not to be performed, outputs the adapted acoustic model to the outside, and ends the processing.

なお、繰り返し処理では、音声認識手段２０がベースライン音響モデルの代わりに適応化音響モデルを用いる以外、音響モデル生成装置１の各手段が同様の処理を行うので、これ以上の説明を省略する。 In addition, in the iterative process, each means of the acoustic model generation device 1 performs the same process except that the speech recognition means 20 uses an adapted acoustic model instead of the baseline acoustic model, and therefore, further description will be omitted.

また、音響モデル生成装置１は、置換手段４０以外の各手段が下記参考文献１に記載されているため、これ以上の説明を省略する。
参考文献１：Lamel et.al,Lightly Supervised and Unsupervised Acoustic Model Training,Computer Speech and Language,Vol6,pp.115-129,2002 Moreover, since each means other than the substitution means 40 is described in the following reference 1 of the acoustic model production | generation apparatus 1, the description beyond this is abbreviate | omitted.
Reference 1: Lamel et. Al, Lightly Supervised and Unsupervised Acoustic Model Training, Computer Speech and Language, Vol 6, pp. 115-129, 2002

［音響モデル生成装置の動作］
図３を参照し、音響モデル生成装置１の動作について説明する（適宜図１参照）。
音響モデル生成装置１は、適応化言語モデル生成手段１０によって、ベースライン言語モデルとドメイン言語モデルとを補間することで、適応化言語モデルを生成する（ステップＳ１）。 [Operation of acoustic model generation apparatus]
The operation of the acoustic model generation device 1 will be described with reference to FIG. 3 (see FIG. 1 as needed).
The acoustic model generation device 1 generates the adaptation language model by interpolating the baseline language model and the domain language model by the adaptation language model generation means 10 (step S1).

音響モデル生成装置１は、音声認識手段２０によって、ステップＳ１で生成した適応化言語モデル及びベースライン音響モデルを用いて放送番組の音声認識を行う（ステップＳ２）。
音響モデル生成装置１は、アライメント手段３０によって、ステップＳ２で生成した音音声認識テキストと、字幕テキストとのアライメントを行う（ステップＳ３）。 The acoustic model generation device 1 causes the speech recognition unit 20 to perform speech recognition of a broadcast program using the adaptive language model and the baseline acoustic model generated in step S1 (step S2).
The acoustic model generation device 1 causes the alignment unit 30 to align the sound-to-speech recognition text generated at step S2 with the subtitle text (step S3).

音響モデル生成装置１は、置換手段４０によって、ステップＳ３でアライメントした音声認識テキストと字幕テキストとの単語が異なり、かつ、その単語に前後する単語数Ｎの単語連鎖が一致するか否かにより、その単語が置換対象であるか否かを判定する。そして、置換手段４０は、その単語が置換対象の場合、音声認識テキストの単語を字幕テキストの単語に置換する（ステップＳ４）。 The acoustic model generation device 1 determines whether the words of the speech recognition text and the subtitle text aligned in step S3 are different by the substitution means 40 and whether or not the word sequence of the number N of words preceding and following the word matches. It is determined whether the word is a replacement target. Then, when the word is a replacement target, the replacement means 40 replaces the word of the speech recognition text with the word of the subtitle text (step S4).

音響モデル生成装置１は、学習データ生成手段５０によって、ステップＳ４で置換した音声認識テキストと字幕テキストとが一致するか否かを発話区間毎に判定する。そして、学習データ生成手段５０は、一致すると判定された発話区間の音声データに、この発話区間に対応した字幕テキストの単語をラベルとして付与することで、学習データを生成する（ステップＳ５）。 The acoustic model generation device 1 causes the learning data generation means 50 to determine, for each utterance section, whether or not the speech recognition text replaced in step S4 matches the subtitle text. Then, the learning data generation unit 50 generates learning data by adding, as a label, the word of the subtitle text corresponding to the speech section to the voice data of the speech section determined to match (step S5).

音響モデル生成装置１は、音響モデル適応化手段６０によって、ステップＳ５で生成した学習データを用いて、音響モデルを適応化し、適応化回数をインクリメントする（ステップＳ６）。
音響モデル生成装置１は、音響モデル適応化手段６０によって、適応化回数が閾値以下であるか否かにより、繰り返し処理を行うか否かを判定する（ステップＳ７）。 The acoustic model generation device 1 causes the acoustic model adaptation means 60 to adapt the acoustic model using the learning data generated in step S5, and increments the number of times of adaptation (step S6).
The acoustic model generation device 1 determines whether or not the iterative process is to be performed based on whether or not the number of times of adaptation is equal to or less than the threshold value by the acoustic model adaptation means 60 (step S7).

繰り返し処理を行う場合（ステップＳ７でＹｅｓ）、音響モデル生成装置１は、ステップＳ２の処理に戻る。ステップＳ２の処理において、音声認識手段２０は、ベースライン音響モデルの代わりにステップＳ６で適応化した音響モデルを用いて、放送番組の音声認識を行う。その後、音響モデル生成装置１は、ステップＳ３以降の処理を継続する。
繰り返し処理を行わない場合（ステップＳ７でＮｏ）、音響モデル適応化手段６０は、ステップＳ６で適応化した音響モデルを出力し、処理を終了する。 When the iterative process is performed (Yes in step S7), the acoustic model generation device 1 returns to the process of step S2. In the process of step S2, the speech recognition means 20 performs speech recognition of the broadcast program using the acoustic model adapted in step S6 instead of the baseline acoustic model. After that, the acoustic model generation device 1 continues the processing after step S3.
When the repetitive processing is not performed (No in Step S7), the acoustic model adaptation means 60 outputs the acoustic model adapted in Step S6, and ends the processing.

以上のように、本願発明の第１実施形態に係る音響モデル生成装置１は、音声認識の精度が低いために音声認識テキストと字幕テキストとの単語が一致しない場合でも、音声認識テキストの単語を置換する。これにより、音響モデル生成装置１は、音声認識テキストと字幕テキストとの単語一致区間が増加し、高精度な学習データをより多く生成することができる。 As described above, the acoustic model generation device 1 according to the first embodiment of the present invention does not match the words of the speech recognition text even when the words of the speech recognition text and the subtitle text do not match because the accuracy of speech recognition is low. Replace. As a result, the acoustic model generation device 1 can generate more highly accurate learning data by increasing the word matching section between the speech recognition text and the subtitle text.

（第２実施形態）
［音響モデル生成装置の構成］
図４を参照し、本願発明の第２実施形態に係る音響モデル生成装置１Ｂの構成について、第１実施形態と異なる点を説明する（適宜図１参照）。
第２実施形態では、学習データ及び音声言語コーパスを異なるデータとして扱う点が、第１実施形態と異なる。 Second Embodiment
[Configuration of acoustic model generation device]
With respect to the configuration of the acoustic model generation device 1B according to the second embodiment of the present invention with reference to FIG. 4, points different from the first embodiment will be described (see FIG. 1 as needed).
The second embodiment differs from the first embodiment in that the learning data and the speech language corpus are treated as different data.

図４のように、音響モデル生成装置１Ｂは、適応化言語モデル生成手段１０と、音声認識手段２０と、アライメント手段３０Ｂと、置換手段４０と、学習データ生成手段５０Ｂと、音響モデル適応化手段６０と、音声言語コーパス生成手段７０とを備える。
なお、アライメント手段３０Ｂ、学習データ生成手段５０Ｂ及び音声言語コーパス生成手段７０以外の各手段は、第１実施形態と同様のため、説明を省略する。 As shown in FIG. 4, the acoustic model generation device 1B includes an adaptive language model generation unit 10, a speech recognition unit 20, an alignment unit 30B, a substitution unit 40, a learning data generation unit 50B, and an acoustic model adaptation unit. 60 and speech language corpus generation means 70.
The respective units other than the alignment unit 30B, the learning data generation unit 50B, and the speech language corpus generation unit 70 are the same as those in the first embodiment, and thus the description thereof will be omitted.

アライメント手段３０Ｂは、アライメントした音声認識テキスト及び字幕テキストを置換手段４０及び音声言語コーパス生成手段７０に出力する。他の点、アライメント手段３０Ｂは、第１実施形態と同様のため、説明を省略する。
学習データ生成手段５０Ｂは、音声言語コーパスを出力しない以外、第１実施形態と同様のため、説明を省略する。 The alignment unit 30 B outputs the aligned speech recognition text and subtitle text to the substitution unit 40 and the speech language corpus generation unit 70. Since the alignment means 30B is the same as that of 1st Embodiment in other points, description is abbreviate | omitted.
The learning data generation unit 50B is the same as the first embodiment except that it does not output the speech language corpus, and thus the description thereof is omitted.

音声言語コーパス生成手段７０は、発話区間毎に、アライメント手段３０Ｂから入力された音声認識テキストと字幕テキストとが一致するか否かを判定するものである。そして、音声言語コーパス生成手段７０は、一致すると判定された発話区間の音声データに、この発話区間に対応した字幕テキストの単語をラベルとして付与することで、音声言語コーパスを生成する。 The speech language corpus generation means 70 determines, for each speech section, whether or not the speech recognition text input from the alignment means 30B matches the subtitle text. Then, the speech language corpus generation unit 70 generates a speech language corpus by adding the words of the subtitle text corresponding to the speech section as a label to the speech data of the speech section determined to be coincident.

図１の学習データ生成手段５０は、学習データを生成する際、単語が置換された音声認識テキスト（つまり、置換手段４０から入力された音声認識テキスト）を用いる。一方、音声言語コーパス生成手段７０は、音声言語コーパスを生成する際、単語が置換されていない音声認識テキスト（つまり、アライメント手段３０Ｂから入力された音声認識テキスト）を用いる。
他の点、音声言語コーパス生成手段７０は、図１の学習データ生成手段５０と同様のため、説明を省略する。 When generating learning data, the learning data generation means 50 of FIG. 1 uses speech recognition texts in which words are substituted (that is, speech recognition texts input from the substitution means 40). On the other hand, when generating the speech language corpus, the speech language corpus generation means 70 uses speech recognition text in which words are not substituted (that is, speech recognition text input from the alignment means 30B).
Since the speech language corpus generation means 70 is the same as the learning data generation means 50 of FIG. 1 in other points, the description will be omitted.

［音響モデル生成装置の動作］
図５を参照し、音響モデル生成装置１Ｂの動作について説明する（適宜図３，図４参照）。
図５のステップＳ１〜Ｓ７の処理は、図３の各ステップと同様のため、説明を省略する。 [Operation of acoustic model generation apparatus]
The operation of the acoustic model generation device 1B will be described with reference to FIG. 5 (see FIGS. 3 and 4 as appropriate).
The processes in steps S1 to S7 in FIG. 5 are the same as the steps in FIG.

音響モデル生成装置１Ｂは、音声言語コーパス生成手段７０によって、ステップＳ３でアライメントした音声認識テキストと字幕テキストとが一致するか否かを発話区間毎に判定する。そして、音声言語コーパス生成手段７０は、一致すると判定された発話区間の音声データに、この発話区間に対応した字幕テキストの単語をラベルとして付与することで、音声言語コーパスを生成する（ステップＳ８）。
なお、ステップＳ８の処理は、ステップＳ５の後に制限されず、ステップＳ３の後からステップＳ７の前までに実行すればよい。 The acoustic model generation device 1B causes the speech language corpus generation means 70 to determine, for each utterance section, whether the speech recognition text aligned in step S3 matches the subtitle text. Then, the speech language corpus generation unit 70 generates a speech language corpus by adding the words of the subtitle text corresponding to the speech section as a label to the speech data of the speech section determined to be coincident (step S8). .
The process of step S8 is not limited after step S5, and may be performed after step S3 to before step S7.

以上のように、本願発明の第２実施形態に係る音響モデル生成装置１Ｂは、第１実施形態と同様、音声認識テキストと字幕テキストとの単語一致区間が増加するため、高精度な学習データをより多く生成することができる。 As described above, in the acoustic model generation device 1B according to the second embodiment of the present invention, as in the first embodiment, the word matching section between the speech recognition text and the subtitle text is increased, so that highly accurate learning data can be obtained. More can be generated.

（実施例１）
以下、実施例１として、単語数Ｎの設定について説明する。
なお、字幕テキストは、十分な精度があり、誤っている可能性が低いこととする。 Example 1
Hereinafter, setting of the number of words N will be described as the first embodiment.
The subtitle text has sufficient accuracy and is unlikely to be erroneous.

字幕テキスト内に類似した単語連鎖が複数存在する場合、アライメントで対応付けた単語のずれが発生することがある。単語数Ｎを１，２といった小さな値で設定すると、アライメントのずれが解消されず、字幕テキストの誤った単語で音声認識テキストの単語を置換する可能性がある。これに対し、単語数Ｎを大きな値で設定すると、アライメントのずれが解消されるものの、置換対象と判定される単語数が減少し、発話ラベルとして利用可能な発話区間を検出できないことがある。 When there are a plurality of similar word chains in the subtitle text, misalignment of the words associated by alignment may occur. If the number of words N is set to a small value such as 1 or 2, misalignment may not be eliminated, and a word in speech recognition text may be replaced with an incorrect word in subtitle text. On the other hand, when the number of words N is set to a large value, although the misalignment is eliminated, the number of words determined to be replacement targets may decrease, and an utterance section available as an utterance label may not be detected.

以上のように、音声認識テキストと、字幕テキストとの不一致区間のうち、音声認識テキストから字幕テキストに置換すべき区間（単語）を精度よく検出するためには、適切な単語数Ｎを設定しなければならない。そこで、ある単語の前後Ｎ個の単語連鎖が一致し、かつ、その単語が異なるパターン数を放送番組から調査した。異なるパターンが１回の放送で多く発生する場合、アライメントのずれが発生する可能性が残るため、高精度な学習データの生成が見込めない。
なお、「ある単語の前後Ｎ個の単語連鎖が一致し、かつ、その単語が異なるパターン」を「異なるパターン」と略記する。 As described above, in order to accurately detect the section (word) to be substituted for the subtitle text from the speech recognition text among the mismatched sections for the speech recognition text and the subtitle text, an appropriate number of words N is set There must be. Therefore, the number of patterns in which N word sequences before and after a certain word match and the word is different from the number of patterns was investigated from the broadcast program. If many different patterns occur in one broadcast, there is a possibility that an alignment deviation will occur, so that it is not possible to anticipate the generation of highly accurate learning data.
It should be noted that “a pattern in which N word chains before and after a certain word match and the word is different” is abbreviated as “different pattern”.

調査対象とした放送番組は、『クローズアップ現代（放送時間２６分）』、『まる得マガジン（放送時間５分）』、『サイエンスＺＥＲＯ（放送時間３０分）』の１００回放送分である。そして、単語数Ｎの値を変えながら、各調査対象の放送番組に含まれる異なるパターン数を調査した。 Broadcast programs to be surveyed are 100 times broadcasts of "Close-up modern (26 minutes broadcast time)," Marutoku Magazine (5 minutes broadcast time), and "Science ZERO (30 minutes broadcast time)". Then, while changing the value of the word number N, the number of different patterns included in the broadcast program to be checked was investigated.

調査結果を図６に示す。図６の横軸が単語数Ｎを表し、横軸が１放送回あたりの異なるパターン数の平均値を表す。また、図６では、‘■’が『クローズアップ現代』の結果を表し、‘◆’が『まる得マガジン』の結果を表し、‘▲’が『サイエンスＺＥＲＯ』の結果を表す。 The survey results are shown in FIG. The horizontal axis in FIG. 6 represents the number of words N, and the horizontal axis represents the average value of the number of different patterns per broadcast. Further, in FIG. 6, “’ ”represents the result of“ Close-up modern ”,“ 』” represents the result of “Marutoku Magazine”, and “得” represents the result of “Science ZERO”.

この図６において、単語数Ｎは、異なるパターン数が‘０’となり、かつ、その中で最小値を設定すればよい。３種類の調査対象の放送番組について、単語数Ｎ＝５とすれば、異なるパターン数が‘０’となった。このことから、単語数Ｎ＝５に設定すれば、アライメントのずれが発生しなくなると考えられる。 In FIG. 6, the number of words N is such that the number of different patterns is “0”, and the minimum value may be set among them. Assuming that the number of words N = 5 for three types of broadcast programs to be investigated, the number of different patterns is “0”. From this, it is considered that no misalignment occurs when the number of words N is set to 5.

（実施例２，３）
以下、音声言語コーパスの生成実験について説明する。
ここで、図１の音響モデル生成装置１と、図４の音響モデル生成装置１Ｂと、参考文献１に記載の手法とを用いて、音声言語コーパスを生成し、生成した音声言語コーパスを検証した。以下、図１の音響モデル生成装置１を実施例２とし、図４の音響モデル生成装置１Ｂを実施例３とし、参考文献１に記載の手法を比較例とする。 (Examples 2 and 3)
The following describes a speech language corpus generation experiment.
Here, a speech language corpus was generated using the acoustic model generation device 1 of FIG. 1, the acoustic model generation device 1B of FIG. 4 and the method described in reference 1 and the generated speech language corpus was verified. . Hereinafter, the acoustic model generation apparatus 1 of FIG. 1 is referred to as Example 2, the acoustic model generation apparatus 1B of FIG. 4 is referred to as Example 3, and the method described in Reference 1 is referred to as a comparative example.

実施例２，３及び比較例では、『クローズアップ現代』、『まる得マガジン』、『サイエンスＺＥＲＯ』それぞれ２時間分の音声認識テキストと字幕テキストとから、学習データを生成した。これら３種類の放送番組は、実施例１のときと放送時間が異なり、２０１４年２月から６月に放送されている。 In Examples 2 and 3 and Comparative Example, learning data was generated from speech recognition texts and subtitle texts for two hours each of “Close-up Contemporary”, “Marudoku Magazine”, and “Science ZERO”. These three types of broadcast programs are broadcasted from February to June 2014, with broadcast times different from those in the first embodiment.

『クローズアップ現代』は、生放送の報道番組である。『クローズアップ現代』の字幕は、スピードワープロ方式で制作され、番組キャスタの発話内容をそのまま字幕化していることが多く、わずかな誤りが含まれる。
『まる得マガジン』は、オフラインの情報番組である。また、『サイエンスＺＥＲＯ』は、教養番組である。これら『まる得マガジン』及び『サイエンスＺＥＲＯ』の字幕は、予め制作されたものである。 "Close-up Contemporary" is a live broadcast news program. Subtitles of "Close-up Contemporary" are produced by a speed word processor method, and often the program caster's uttered content is subtitled as it is, including slight errors.
"Marutoku Magazine" is an offline information program. "Science ZERO" is a literary program. Subtitles of these "Marutoku Magazine" and "Science ZERO" are produced in advance.

適応化言語モデルは、放送番組の書き起こしから学習した語彙サイズ１００キロバイトのベースライン言語モデルと、字幕テキストから学習したドメイン言語モデルとを用いて、放送回毎に生成した。このとき、ベースライン言語モデル及びドメイン言語モデルの補間係数は、それぞれ、‘０．１’及び‘０．９’である。 The adapted language model was generated for each broadcast using a baseline language model with a vocabulary size of 100 kilobytes learned from the transcription of a broadcast program and a domain language model learned from subtitle text. At this time, interpolation coefficients of the baseline language model and the domain language model are '0.1' and '0.9', respectively.

音声認識デコーダは、下記参考文献２に記載の２パスデコーダを利用した。この２パスデコーダは、男女の判定を行いながら、性別依存のＨＭＭを用いて音声認識するものである。
参考文献２：今井他、放送用リアルタイム字幕制作のための音声認識技術の改善、第２回ドキュメント処理ワークショップ、pp.113-120、2008 The speech recognition decoder utilized the two-pass decoder described in reference 2 below. The two-pass decoder performs speech recognition using a gender-dependent HMM while making a gender determination.
Reference 2: Imai et al., Improvement of Speech Recognition Technology for Real-time Subtitle Production for Broadcast, 2nd Document Processing Workshop, pp. 113-120, 2008

ベースライン音響モデルは、日本放送協会が放送したニュース番組から学習した。このニュース番組では、男性が３４０時間発話し、女性が２４０時間発話している。男女別の音響モデルは、５状態３自己ループのトライフォンＨＭＭであり、状態共有により１６混合分布の約４０００状態を有している。これら男女別の音響モデルは、音声認識テキストと字幕テキストとのアライメント結果から抽出した学習データにより適応化した。 Baseline acoustic models were learned from news programs broadcast by the Japan Broadcasting Corporation. In this news program, a man speaks for 340 hours and a woman speaks for 240 hours. The gender-specific acoustic model is a tri-state HMM of 5-state 3-self-loop and has about 4000 states of 16 mixed distributions by state sharing. These male and female acoustic models were adapted by learning data extracted from the alignment result of speech recognition text and subtitle text.

発話区間の検出には、下記参考文献３に記載の手法を用いた。参考文献３に記載の手法は、男女並列の性別依存音響モデルによるエンドレスな音素認識を行い、音声／非音声の累積音素尤度比から発話区間を検出するものである。
参考文献３：T.Imai et.al,Online speech detection and dual-gender speech recognition for captioning broadcast news,IEICE Trans.Inf&Syst,Vol E90-D,no.8,pp.1286-1291,2007 The method described in Reference Document 3 below was used for detection of the utterance section. The method described in Reference 3 performs endless phoneme recognition using a gender-dependent, gender-dependent acoustic model, and detects an utterance interval from a cumulative phoneme likelihood ratio of speech / non-speech.
Reference 3: T. Imai et. Al, Online speech detection and dual-gender speech recognition for captioning broadcast news, IEICE Trans. Inf & Syst, Vol E90-D, no. 8, pp. 1286-1291, 2007

図７〜図９には、音響モデルの適応化回数（横軸）と音声言語コーパスの抽出率（縦軸）との関係を図示した。図７が『クローズアップ現代』の実験結果を表し、図８が『まる得マガジン』の実験結果を表し、図９が『サイエンスＺＥＲＯ』の実験結果を表す。また、図７〜図９では、‘▲’が実施例１を表し、‘■’が実施例２を表し、‘◆’が比較例を表す。 7 to 9 show the relationship between the number of times of adaptation of the acoustic model (horizontal axis) and the extraction rate of the speech language corpus (vertical axis). FIG. 7 shows the experimental result of “Close-up modern”, FIG. 8 shows the experimental result of “Marutoku Magazine”, and FIG. 9 shows the experimental result of “Science ZERO”. Moreover, in FIG. 7-FIG. 9, "(triangle | delta)" represents Example 1, "(triangle | delta)" represents Example 2, and "(triple)" represents a comparative example.

適応化回数が５回のとき、実施例１は、比較例と比べて、全ての放送番組で抽出率が１．３倍以上となった。また、実施例２は、比較例と比べて、全ての放送番組で抽出率が１．２倍以上となった。 When the number of times of adaptation is five, the extraction rate is 1.3 times or more for all the broadcast programs compared to the comparative example in the first embodiment. In addition, in Example 2, the extraction rate was 1.2 times or more for all broadcast programs as compared to the comparative example.

適応化回数が５回のとき、音声言語コーパスの発話ラベルの精度を検証した。実施例１では、誤った字幕テキストへの置換が行われ、実施例２よりも誤りが増加した。ここで、実施例１における発話ラベルの誤りは、「あの」、「えー」といった不用語に起因することがわかった。さらに、実施例１，２ともに、音声言語コーパスの精度が９９％を超えるので、音響モデルの構築に十分な精度を有する。 When the number of adaptations was 5, the accuracy of the speech label of the speech language corpus was verified. In the first embodiment, substitution for an incorrect subtitle text is performed, and errors increase more than the second embodiment. Here, it has been found that the error in the speech label in Example 1 is caused by the incoincidence "Ano", "Eh". Furthermore, in both of the first and second embodiments, since the accuracy of the speech language corpus exceeds 99%, it has sufficient accuracy for constructing an acoustic model.

また、適応化回数が５回のとき、３種類の放送番組で音声言語コーパスの抽出率を比較した。その結果、『サイエンスＺＥＲＯ』、『まる得マガジン』、『クローズアップ現代』の順に抽出率が高くなった。 In addition, when the number of adaptations was 5, the extraction rates of the speech language corpus were compared for three types of broadcast programs. As a result, the extraction rate increased in the order of "Science ZERO", "Marutoku Magazine", and "Close-up Contemporary".

ここで、『クローズアップ現代』は、その放送番組の終了直前に字幕が付与されていなかったため、抽出率が最も低くなったと考えられる。同放送番組では、いくつかの放送回で終了直前まで番組キャスタが発話していたため、スピードワープロ方式で番組音声を全て字幕化できていなかった。
なお、スピードワープロ方式とは、複数のキーを同時に押下して入力する特殊な高速入力用キーボードを用いる字幕制作方式である。 Here, it is considered that the “close-up modern” has the lowest extraction rate since subtitles have not been given immediately before the end of the broadcast program. In the same broadcast program, the program caster uttered until just before the end of several broadcast runs, so it was not possible to subtitle all program audio by the speed word processor method.
The speed word processor method is a subtitle production method using a special high speed input keyboard which is input by pressing a plurality of keys simultaneously.

また、『まる得マガジン』の方が『サイエンスＺＥＲＯ』よりも、放送時間内での背景音楽の時間割合が高かった。このため、『サイエンスＺＥＲＯ』が『まる得マガジン』よりも抽出率が高くなったと考えられる。 In addition, “Marutoku Magazine” had a higher percentage of background music in airtime than “Science ZERO”. For this reason, it is thought that "Science ZERO" has a higher extraction rate than "Marutoku Magazine".

このことから、音声言語コーパスの抽出率を高くするためには、（１）放送番組の終了まで番組音声が字幕化されているオフライン字幕番組であること、（２)背景音楽が少ない放送番組であることが好ましい。 From this, in order to increase the extraction rate of the speech language corpus, (1) an off-line subtitle program in which program audio is subtitled until the end of the broadcast program; (2) a broadcast program with a small amount of background music Is preferred.

以上、本願発明の各実施形態及び各実施例を詳述してきたが、本願発明は前記した各実施形態及び各実施例に限られるものではなく、本願発明の要旨を逸脱しない範囲の設計変更等も含まれる。 As mentioned above, although each embodiment and each example of the present invention were explained in full detail, the present invention is not limited to each above-mentioned embodiment and each example mentioned above, The design change etc. which do not deviate from the gist of the present invention Also included.

前記した実施形態では、ベースライン言語モデル、ドメイン言語モデル及びベースライン音響モデルが外部から入力されることとして説明したが、本願発明は、これに限定されない。例えば、音響モデル生成装置は、各言語モデル及び各音響モデルを記憶、管理するデータベースを備え、このデータベースを参照して音響モデルを適応化してもよい。 In the above-described embodiment, the baseline language model, the domain language model, and the baseline acoustic model are described as being externally input, but the present invention is not limited thereto. For example, the acoustic model generation device may be provided with a database that stores and manages each language model and each acoustic model, and the acoustic model may be adapted with reference to this database.

前記した実施形態では、音響モデル生成装置（学習データ生成装置）を独立したハードウェアとして説明したが、本願発明は、これに限定されない。例えば、本願発明は、コンピュータが備えるＣＰＵ、メモリ、ハードディスク等のハードウェア資源を学習データ生成装置として協調動作させる学習データ生成プログラムで実現することもできる。このプログラムは、通信回線を介して配布してもよく、ＣＤ−ＲＯＭやフラッシュメモリ等の記録媒体に書き込んで配布してもよい。 In the above embodiment, the acoustic model generation device (the learning data generation device) is described as an independent hardware, but the present invention is not limited to this. For example, the present invention can also be realized by a learning data generation program that causes hardware resources of a computer such as a CPU, a memory, and a hard disk to cooperate as a learning data generation device. This program may be distributed via a communication line, or may be distributed by writing on a recording medium such as a CD-ROM or a flash memory.

１，１Ｂ音響モデル生成装置（学習データ生成装置）
１０適応化言語モデル生成手段（第３言語モデル生成手段）
２０音声認識手段
３０，３０Ｂアライメント手段
４０置換手段
５０，５０Ｂ学習データ生成手段
６０音響モデル適応化手段
７０音声言語コーパス生成手段 1, 1 B Acoustic Model Generator (Learning Data Generator)
10 Adaptive Language Model Generation Means (Third Language Model Generation Means)
20 speech recognition means 30, 30B alignment means 40 substitution means 50, 50B learning data generation means 60 acoustic model adaptation means 70 speech language corpus generation means

Claims

A learning data generation apparatus that generates learning data necessary for adaptation of an acoustic model used for speech recognition of a broadcast program by quasi-supervised learning,
Third language model generation means for generating a third language model by linearly interpolating a first language model generated in advance from a text corpus and a second language model generated in advance from subtitle text of the broadcast program;
Voice recognition means for voice recognition of the broadcast program using the third language model and a previously generated acoustic model;
Alignment means for performing alignment in which words of the speech recognition text representing the speech recognition result of the broadcast program and the subtitle text are associated in time order;
The word is different for each word associated between the voice recognition text and the subtitle text, and the word depends on whether or not the word sequence of the number of words set in advance before and after the word matches. A replacement unit that determines whether the word is a replacement target, and if the word is a replacement target, replacing the word of the speech recognition text with the word of the subtitle text;
It is determined whether or not the speech recognition text replaced by the replacement means matches the subtitle text for each speech zone of the broadcast program, and the speech zone of the speech zone determined to match is the speech zone Learning data generation means for generating the learning data by adding words of subtitle text corresponding to
A learning data generation apparatus comprising:

The learning data generation apparatus according to claim 1, wherein the replacement unit has five words set in advance.

The learning data generation apparatus according to claim 1 or 2, further comprising: an acoustic model adaptation unit that adapts the acoustic model using the training data.

The acoustic model adaptation means determines whether or not the number of times of adaptation of the acoustic model is equal to or less than a preset threshold value, and when the number of times is equal to or less than the threshold value, Output to recognition means,
4. The learning data generation apparatus according to claim 3 , wherein the speech recognition unit performs speech recognition on the broadcast program using the third language model and the adapted acoustic model.

The learning data generation program for functioning a computer as a learning data generation apparatus as described in any one of Claims 1-4.