JP2010243914A

JP2010243914A - Acoustic model learning device, voice recognition device, and computer program for acoustic model learning

Info

Publication number: JP2010243914A
Application number: JP2009094212A
Authority: JP
Inventors: Masato Mimura; 正人三村; Tatsuya Kawahara; 達也河原
Original assignee: Kyoto University
Current assignee: Kyoto University
Priority date: 2009-04-08
Filing date: 2009-04-08
Publication date: 2010-10-28
Anticipated expiration: 2029-04-08
Also published as: JP5366050B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an acoustic model learning device for effectively generating an acoustic model for voice recognition and dictation of spoken words of a type where there is a document style text DB having already been shaped. <P>SOLUTION: The acoustic model learning device 78 includes a language model estimating section 188 for estimating a language model 136 of dictation faithful to the actual spoken contents from a language model 186 learned with a document style text (for example, conference minutes) 42 acquired by dictation and shaping of a voice DB (for example, discussion voice corpus) 40 by a human being, a phoneme labeling section 144 for dictation it, attaching a phoneme label to the voice DB 40, and outputting a voice DB 80 with a phoneme label by voice recognition using an initial audio model 130 and the language model 136 of spoken word style dictation estimated by the language model estimating section 188, and an acoustic model learning section for performing the learning of the acoustic model using the voice DB 80 with the phoneme label as learning data. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は音声認識技術に関し、特に、話し言葉の音声を精度高く認識可能な音声認識装置、及びそのための音響モデルの学習技術に関する。 The present invention relates to a speech recognition technology, and more particularly to a speech recognition device that can recognize spoken speech with high accuracy, and an acoustic model learning technology therefor.

近年、大語彙連続音声認識の主要な対象は、音声認識用に丁寧に発音した音声（以下「読上音声」と呼ぶ。）から、講演及び会議などの話し言葉の音声（以下「話し言葉音声」と呼ぶ。）に移行しつつある。 In recent years, the main target of large vocabulary continuous speech recognition has been the speech of carefully spoken words for speech recognition (hereinafter referred to as “reading speech”), as well as spoken speech of lectures and meetings (hereinafter referred to as “spoken speech”). It is moving to.

話し言葉音声は読上音声では見られないような流暢でない現象を伴う。これらの現象とは、例えば、言直し、言いよどみ、「あー」とか「うー」というようなフィラーと呼ばれる発声の挿入、日本語の場合の助詞の欠落、及び発音の怠けなどである。 Spoken speech accompanies a phenomenon that is not fluent as seen in reading speech. These phenomena include, for example, rephrasing, wording, insertion of utterances called fillers such as “Ah” and “Uh”, lack of particles in Japanese, and lack of pronunciation.

一般に、音声を統計的音声認識技術を用いて音声認識するためには、音響モデルが必要である。音響モデルの学習には、音声とその忠実な書き起こしとの組である音声コーパスを準備しなければならない。音声認識の精度を高めるためには、音声コーパスの規模は大きい方が望ましい。通常、こうした音声コーパスの作成は人手で行なわれる。しかし話し言葉音声の場合、上記したような現象のために人手による書き起こしの作成には多大なコストがかかる。したがって、大規模なコーパスの構築は極めて困難である。その結果、音声認識に必要な音響モデルの学習のためのデータ量不足が問題となる。 In general, an acoustic model is required for speech recognition using statistical speech recognition technology. To learn an acoustic model, a speech corpus that is a pair of speech and its faithful transcription must be prepared. In order to increase the accuracy of speech recognition, it is desirable that the size of the speech corpus is large. Usually, such a speech corpus is created manually. However, in the case of spoken speech, because of the phenomenon described above, it is very expensive to create a transcript by hand. Therefore, it is very difficult to construct a large corpus. As a result, a shortage of data for learning the acoustic model necessary for speech recognition becomes a problem.

この問題に対処するため、Ｌａｍｅｌらは、非特許文献１において、ｌｉｇｈｔｌｙｓｕｐｅｒｖｉｓｅｄｔｒａｉｎｉｎｇ（以下「準教師付学習」と呼ぶ。）と呼ばれるアプローチを提案している。このアプローチでは、発話の忠実な書き起こしの代わりに、低コストで利用できる整形済テキストデータから音響モデルの学習のための音素ラベルを作成する。非特許文献１では、ニュース音声を対象として以下のように音素ラベルを付与することが提案されている。 In order to cope with this problem, Lamel et al. In Non-Patent Document 1 propose an approach called lightly supervised training (hereinafter referred to as “semi-supervised learning”). This approach creates phoneme labels for learning acoustic models from pre-formatted text data that can be used at low cost instead of faithful transcription of the utterance. In Non-Patent Document 1, it is proposed to give a phoneme label as follows for news speech.

多くの放送には、字幕が付与される。この字幕を放送に対するテキストデータとして音素ラベルを作成することが考えられる。しかし、非特許文献１によれば、字幕は多くの誤りを含み、そのままでは音素ラベルとして利用できない。そこで、非特許文献１では、字幕のテキストデータから学習した言語モデルを用いて音声認識を行なうことで、放送音声に対する音素ラベルを作成している。非特許文献１によれば、ニュース音声には音楽及びいわゆるＣＭなどの非音声区間が多数存在するため、音声認識結果の信頼性は高くない。そこで非特許文献１は、音声認識の後、その結果と字幕とを再度マッチングさせ、合致した区間の音声認識結果のみを用いるのが効果的であると報告している。 Many broadcasts are given subtitles. It is conceivable to create a phoneme label using this caption as text data for broadcasting. However, according to Non-Patent Document 1, subtitles contain many errors and cannot be used as phoneme labels as they are. Therefore, in Non-Patent Document 1, a phoneme label for broadcast sound is created by performing speech recognition using a language model learned from subtitle text data. According to Non-Patent Document 1, since there are many non-speech segments such as music and so-called CMs in news speech, the reliability of speech recognition results is not high. Therefore, Non-Patent Document 1 reports that it is effective to match the result again with the caption after the speech recognition and use only the speech recognition result in the matched section.

非特許文献２は、同様に放送音声を対象としているが、字幕には現れない表現にも対応するために、字幕から構築した言語モデルと、別途構築したベースライン言語モデルとを、前者に大きな重みをかけて合成し、この言語モデルを用いて音声認識を行なっている。非特許文献２は、作成された音素ラベルを用いた学習データの追加により、通常のＭＬ（最尤）学習だけでなく、識別学習の一種である音素誤り最小（ＭＰＥ：ＭｉｎｉｍｕｍＰｈｏｎｅＥｒｒｏｒ）学習においても認識精度が向上したと報告している。 Non-Patent Document 2 is also intended for broadcast audio, but in order to cope with expressions that do not appear in subtitles, a language model constructed from subtitles and a separately constructed baseline language model are largely divided into the former. It is synthesized by applying weights, and speech recognition is performed using this language model. Non-Patent Document 2 describes not only normal ML (maximum likelihood) learning but also phoneme error minimum (MPE) learning, which is a type of discriminative learning, by adding learning data using a created phoneme label. Also reported improved recognition accuracy.

Ｌ．ラメルら、「準教師付音響モデル学習の研究」、ＩＣＡＳＳＰ，Ｖｏｌ．１、ｐｐ．４７７−４８０、２００１年（L. Lamel et al. “Investigating lightly supervised acoustic model training.” In ICASSP, Vol. 1, pp. 477-480, 2001）L. Ramel et al., “Study on Acoustic Model Learning with Associate Teacher”, ICASSP, Vol. 1, pp. 477-480, 2001 (L. Lamel et al. “Investigating lightly supervised acoustic model training.” In ICASSP, Vol. 1, pp. 477-480, 2001) Ｈ．Ｙ．チャンら、「準教師付識別学習による放送ニュース書き起こしの改良」、ＩＥＥＥ−ＩＣＡＳＳＰ，Ｖｏｌ．１，ｐｐ．７３７−７４０、２００４年（H.Y. Chan et al., “Improving broadcast news transcription by lightly supervised discriminative training.” In IEEE-ICASSP, Vol. 1, pp. 737-740, 2004）H. Y. Chang et al., “Improvement of Broadcast News Transcription by Semi-Supervised Discriminative Learning”, IEEE-ICASSP, Vol. 1, pp. 737-740, 2004 (H.Y. Chan et al., “Improving broadcast news transcription by lightly supervised discriminative training.” In IEEE-ICASSP, Vol. 1, pp. 737-740, 2004) Ｐ．モーリック他、「ＥＰＰＳ録音に対する準教師付音響モデル学習」ＩＮＴＥＲＳＰＥＥＣＨ、ｐｐ．２２４−２２７，２００８年（M. Paulik et al., “Lightly supervised acoustic model training on epps recordings” In INTERSPEECH pp. 224-227, 2008）P. Morlic et al., “Semi-supervised acoustic model learning for EPPS recordings” INTERSPEECH, pp. 224-227, 2008 (M. Paulik et al., “Lightly supervised acoustic model training on epps recordings” In INTERSPEECH pp. 224-227, 2008) 秋田祐哉他、「統計的機械翻訳の枠組みに基づく言語モデルの話し言葉スタイルへの変換」、電子情報通信学会技術研究報告、ＳＰ２００５−１０８、ＮＬＣ２００５−７５（ＳＬＰ−５９−１９）、２００５．Akita Yuya et al., “Conversion of language model into spoken language style based on the framework of statistical machine translation”, IEICE Technical Report, SP2005-108, NLC2005-75 (SLP-59-19), 2005.

近年、国会、地方議会などにおいて、音声認識を用いて会議録を作成しようとする試みがされている。これは、公的機関の業務について効率化及び経費節減が求められていること、会議録作成を担ってきた熟練速記者の数が減少していること、速記者の養成が難しい社会情勢となっていること、などが理由である。もちろんその背景には、高性能なコンピュータの普及及び音声認識技術の発達など、必要なハードウエア及びソフトウエアの充実という事情もある。 In recent years, attempts have been made by the National Diet, local councils, etc. to create minutes using speech recognition. This is a social situation in which it is difficult to train stenographers, because there is a need for efficiency and cost savings for the work of public institutions, a decrease in the number of skilled stenographers who have taken minutes. The reason is. Of course, the background is the expansion of necessary hardware and software, such as the spread of high-performance computers and the development of speech recognition technology.

しかし、国会、特に委員会の質疑応答などは典型的な話し言葉であるため、既に述べたように音声コーパスの作成が困難である。その結果、話し言葉音声のための音響モデルの精度を高めることができず、音声認識の結果も芳しくないという問題がある。 However, since the question and answer session of the Diet, especially the committee, is a typical spoken language, it is difficult to create a speech corpus as already mentioned. As a result, there is a problem that the accuracy of the acoustic model for spoken speech cannot be increased and the result of speech recognition is not good.

非特許文献１及び非特許文献２の報告から考えて、準教師付学習は放送についての話し言葉音声認識に有効な技術であると考えられる。国会の委員会などでの発話は典型的な話し言葉であるから、準教師付学習によって学習した音響モデルを使用して音声認識を行なうことで会議録の作成を行なうことができる可能性が高い。 Considering the reports of Non-Patent Document 1 and Non-Patent Document 2, semi-supervised learning is considered to be an effective technique for speech recognition of spoken words about broadcasting. Since utterances at parliamentary committees and the like are typical spoken words, there is a high possibility that a minutes can be created by performing speech recognition using an acoustic model learned by semi-supervised learning.

既に、非特許文献３に、欧州議会音声を対象とした、準教師付学習を用いた会議録作成が報告されている。非特許文献３では、欧州議会の会議録のテキストを用いた準教師付学習を、音声データに対する音素ラベルの作成に使用している。具体的には、人手により作成された会議録をそのまま用いて言語モデルを構築し、この言語モデルを用いて会議録に対応する音声の音声認識を行なって音素ラベルを作成している。この音素ラベルが付された音声を用いて音響モデルを構築し、新たな会議音声の音声認識を行なって会議録を作成する。 Already, Non-Patent Document 3 reports the creation of conference minutes using semi-supervised learning for European Parliament audio. In Non-Patent Document 3, semi-supervised learning using text from the proceedings of the European Parliament is used to create phoneme labels for speech data. More specifically, a language model is constructed using a conference record created manually, and a phoneme label is created by performing speech recognition corresponding to the conference record using this language model. An acoustic model is constructed using the speech with the phoneme label, and a conference record is created by performing speech recognition of a new conference speech.

非特許文献３ではさらに、特定会議のテキストに大きな重みをかけて言語モデルを学習してその会議の音声の音声認識をすることで、全ての会議の会議録を一様に用いて学習した言語モデルを使用したときよりも高い精度の音素ラベルが得られたことが報告されている。 In Non-Patent Document 3, the language learned by using the conference minutes of all conferences uniformly by learning a language model by applying a large weight to the text of a specific conference and recognizing the speech of the conference. It has been reported that phoneme labels with higher accuracy were obtained than when the model was used.

非特許文献３で報告されているように、人手により作成された会議録そのものを言語モデルとして使用して音素ラベルを付与したときの精度が満足すべき値となれば問題はない。しかし、以下に述べるように、特に日本の国会、地方議会などの会議録を作成するためには、解決すべき問題がある。 As reported in Non-Patent Document 3, there is no problem as long as the accuracy when a phoneme label is given by using a conference record itself manually created as a language model is satisfactory. However, as described below, there are problems that need to be solved, especially in preparing minutes of the Japanese Diet and local assembly.

欧州議会の場合、日本の国会の本会議での発言に相当するものが多いため、発言が比較的丁寧に行なわれ、話し言葉特有の問題がそれほど生じない。その結果、欧州議会では、会議録と実際の発話との相違が小さく、会議録のテキストデータをそのまま言語モデルの作成に使用しても、音素ラベル付与の精度はそれほど低下しない。 In the case of the European Parliament, there are many things that correspond to the remarks made at the plenary session of the Japanese Diet, so the remarks are made relatively carefully and the problems peculiar to spoken language do not arise. As a result, in the European Parliament, the difference between the minutes and actual utterances is small, and even if the text data of the minutes is used as it is for the creation of a language model, the accuracy of the phoneme labeling does not decrease so much.

しかし、日本の国会での議論は、本会議ではなく委員会を中心になされている。委員会での議論は、本会議と比較してよりインタラクティブであり、自発的な発話が主となる。特に、委員会での質問者は、簡単なメモを手にして考えながら、かつ答弁の内容を考慮しながら発言を行なうので、発話中に頻繁に言直し、ポーズ、及びフィラーの挿入などが発生する。答弁者の場合は、質問者と比較してそうした問題は少ないが、それでも本会議での発言と比較して話し言葉特有の問題が多く発生する。 However, discussions in the Japanese Diet are centered on the committee, not the plenary session. Discussions at the committee are more interactive compared to the plenary session, and are mainly spontaneous. In particular, the questioner at the committee speaks while thinking with a simple memo and taking into account the contents of the answer, so frequent rephrasing, poses, insertion of fillers, etc. occur during utterances. To do. In the case of respondents, there are few such problems compared to the questioner, but there are still more problems specific to spoken language than in the plenary session.

現在、会議録の作成は速記者によって行なわれている。そのため、上記したような無意味な音声、言直し、発音の怠けなどが訂正され、書き言葉に近い表現に整形される。こうした作業は知的に高度な作業であって、機械で再現することは非常にむずかしい。しかしそれだけに、実際の発話内容と会議録との間の相違が大きくなり、音響モデル作成のための音声データへの音素ラベル付与に会議録をそのまま使用するのは無理である。 Currently, proceedings are created by a stenographer. For this reason, the above meaningless speech, rephrasing, pronunciation laziness, etc. are corrected and shaped into an expression close to written language. These tasks are intelligently advanced and are difficult to reproduce on a machine. However, the difference between the actual utterance content and the minutes is so large that it is impossible to use the minutes as they are for adding phoneme labels to the sound data for creating an acoustic model.

しかし、会議録を全く使用しないで会議音声に音素ラベル付けをしようとすれば、前述したとおり人手により新たに書き起こしを行なう必要が生じ、膨大なコストがかかってしまう。そこで、既存の会議録を有効に使用しながら、大量の音声に対する効率的な音素ラベル付けを可能とする技術が求められている。こうした問題は、会議録に限らず、例えば大学・高校などにおける講義録又は講演録の作成など、整形済の書き起こしテキストデータが存在している話し言葉音声データのテキスト化を自動化する場合に共通した問題である。さらに、例えば裁判などで、撮影済の画像を参照する際、画像内の主な発言内容を文書化した後に、再度画像内の関連する箇所を検索したい、というような要求が発生することが考えられる。そのような場合にも、音声に効率的に音素ラベルを付与することができれば便利である。 However, if the phoneme labeling is to be performed on the conference voice without using the conference record at all, as described above, it is necessary to newly transcribe the conference speech, which entails a huge cost. Therefore, there is a need for a technology that enables efficient phoneme labeling for a large amount of speech while effectively using an existing conference record. These problems are not limited to conference proceedings, but are common when automating text-to-speech speech data with pre-written transcript text data, such as creating lecture or lecture transcripts at universities and high schools. It is a problem. Furthermore, when referring to a photographed image, for example, in a trial, it may occur that a request to search for a related portion in the image again after documenting the main statement in the image is considered. It is done. Even in such a case, it is convenient if a phoneme label can be efficiently given to the voice.

また、話し言葉の場合、話者、話題の内容、周囲の音響的環境などがときにより変化していく場合がある。例えば内閣改造があった場合、国会で答弁に立つ閣僚は変わる。政権交代があれば、それまでの与野党が逆転することがありえるが、立場の変化に応じて発話スタイルが変化する可能性が高い。そうした場合には、書き起こし作成のための音響モデルについても、環境の変化に追従できるように簡単に更新できることが望ましい。従来は、そのように簡便に大量の話し言葉音声データに効率的に音素ラベルを付与する技術は存在していなかった。 In the case of spoken language, the speaker, topic content, surrounding acoustic environment, and the like may change from time to time. For example, when there is a cabinet reshuffle, the ministers who answer in the Diet change. If there is a change of government, the former ruling and opposition parties may reverse, but the utterance style is likely to change as the position changes. In such a case, it is desirable that the acoustic model for creating a transcript can be easily updated so that it can follow changes in the environment. Conventionally, there has been no technology for efficiently assigning phoneme labels to such a large amount of spoken speech data.

それゆえに本発明の目的は、整形済のテキストデータが存在している話し言葉音声データのテキスト化のための音響モデルを、効果的に作成することが可能な音響モデル学習装置を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide an acoustic model learning device capable of effectively creating an acoustic model for converting text-to-speech speech data in which formatted text data exists. .

本発明の他の目的は、整形済のテキストデータが存在している話し言葉音声データのテキスト化のための音響モデルについて、環境の変化に応じて簡単に更新することが可能な音響モデル学習装置を提供することである。 Another object of the present invention is to provide an acoustic model learning apparatus that can easily update an acoustic model for converting spoken speech data in which pre-formed text data exists into text data according to environmental changes. Is to provide.

本発明の第１の局面に係る音響モデル学習装置は、音声データベースを人間が書き起こし、整形して得られた文書スタイルテキストにより学習した言語モデルから、実際の発言内容に忠実な話し言葉スタイル書き起こしのための言語モデルを推定するための言語モデル推定手段と、予め準備された初期音響モデルと、言語モデル推定手段により推定された話し言葉スタイル書き起こしの言語モデルとを用いた音声認識により、音声データベースに書き起こしとその音素ラベルとを付すための音素ラベリング手段と、音素ラベリング手段により音素ラベルが付された音声データベースを学習データとして、音声認識用音響モデルの学習又は更新を行なうための音響モデル学習手段とを含む。 The acoustic model learning apparatus according to the first aspect of the present invention is a transcription style transcription that is faithful to the actual content of speech from a language model that is learned from a document style text obtained by transcription and formatting of a speech database. A speech database by speech recognition using a language model estimation means for estimating a language model for a voice, a preliminarily prepared initial acoustic model, and a spoken language transcription language model estimated by the language model estimation means Phoneme labeling means for attaching a transcript and its phoneme label, and acoustic model learning for learning or updating a speech recognition acoustic model using the speech database with the phoneme label attached by the phoneme labeling means as learning data Means.

この音響モデル学習装置では、言語モデル推定手段が、文書スタイルテキストにより学習した言語モデルから、話し言葉スタイル書き起こしのための言語モデルを推定する。この言語モデルと、初期音響モデルとを用い、音素ラベリング手段が発話のもとになった音声データベースに書き起こしとその音素ラベルとを付与する。音素ラベルが付与された音声データベースを学習データとして、音響モデル学習手段が音声認識用音響モデルの学習を行なう。 In this acoustic model learning apparatus, the language model estimation means estimates a language model for the spoken language style transcription from the language model learned by the document style text. Using this language model and the initial acoustic model, the phoneme labeling means assigns the transcription and the phoneme label to the speech database from which the utterance is based. The acoustic model learning means learns the acoustic model for speech recognition using the speech database to which the phoneme label is assigned as learning data.

文書スタイルテキストにより学習した言語モデルから、話し言葉スタイル書き起こしのための言語モデルが推定される。この言語モデルを用いることにより、発話スタイルテキストのもとになった音声データベースに書き起こしと音素ラベルとが付されるため、音声データベースの発話内容に、話し言葉特有の現象（言い淀み、繰返し、フィラーの挿入など）があったとしても、精度高く、発話音声に忠実に音声認識を行なうことができる。このように、発話音声に忠実にラベリングがされた音声データを学習データとして音声認識用音響モデルの学習を行なうため、この音声認識用音響モデルを用いて新たな発話データの音声認識を行なうときの精度を高めることができる。 A language model for transcription of spoken language style is estimated from a language model learned from document style text. By using this language model, transcription and phoneme labels are attached to the speech database that is the basis of the utterance style text. Even if there is insertion), speech recognition can be performed with high accuracy and faithful to the speech. Thus, in order to learn the acoustic model for speech recognition using the speech data labeled faithfully to the speech speech as learning data, when performing speech recognition of new speech data using this speech recognition acoustic model Accuracy can be increased.

好ましくは、言語モデル推定手段は、音声データベースの発話のターンごとに対応した文書スタイルテキストから、ターンごとのＮ−グラム言語モデルを作成するためのＮ−グラム作成手段と、Ｎ−グラム作成手段により作成されたターンごとのＮ−グラム言語モデルの各々から、話し言葉スタイル書き起こしの話し言葉用Ｎ−グラム言語モデルを推定するための手段とを含む。音素ラベリング手段は、音声データベースのターンごとに、話し言葉用Ｎ−グラム言語モデルのうち、対応するＮ−グラム言語モデルを選択するための言語モデル選択手段と、音声データベースの発話のターンごとに、言語モデル選択手段により選択されたＮ−グラム言語モデルと、初期音響モデルとを用いて音声認識を行なって、音声データベースのターンごとに書き起こしとその音素ラベルとを付与するための音声認識手段とを含む。 Preferably, the language model estimation means includes an N-gram creation means for creating an N-gram language model for each turn from the document style text corresponding to each turn of the speech database, and an N-gram creation means. Means for estimating a spoken N-gram language model of spoken style transcription from each of the generated turn-by-turn N-gram language models. The phoneme labeling means includes a language model selecting means for selecting a corresponding N-gram language model among spoken N-gram language models for each turn of the speech database, and a language for each utterance turn of the speech database. Speech recognition means for performing speech recognition using the N-gram language model selected by the model selection means and the initial acoustic model, and for giving a transcription and a phoneme label for each turn of the speech database. Including.

音声データベース内の発話の発声のスタイルは、発話者及び話題などにより変化する。ターンごとに話し言葉スタイル書き起こしの話し言葉用Ｎ−グラムを作成し、ターンごとにそのターンから得られた話し言葉用Ｎ−グラムを用いて音声認識を行なうことで、ターンごとの音声データベースの音素ラベリングの精度を高めることができる。その結果、音声認識用音響モデルの学習効率を高めることが可能になり、音声認識用音響モデルを用いた音声認識の精度を高めることができる。 The utterance style of the utterance in the voice database varies depending on the speaker and the topic. Create an N-gram for spoken style transcripts for each turn, and perform speech recognition using the spoken N-grams obtained from that turn for each turn. Accuracy can be increased. As a result, the learning efficiency of the speech recognition acoustic model can be increased, and the accuracy of speech recognition using the speech recognition acoustic model can be increased.

より好ましくは、音響モデル学習装置は、音声データベースの一部の話し言葉スタイル書き起こしと、文書スタイルテキストのうちで当該一部に対応する部分とに基づいて作成された対応付けコーパスに基づいて、文書スタイルテキスト内の表現から話し言葉スタイル書き起こしの表現への変換を統計的に示す変換モデルを学習するための変換モデル学習手段をさらに含む。言語モデル推定手段は、ターンごとのＮ−グラム言語モデルの各々に対し、変換モデルを適用することにより、話し言葉スタイル書き起こしのＮ−グラム言語モデルを推定するための手段を含む。 More preferably, the acoustic model learning device is configured to generate a document based on a correspondence corpus created based on a part of spoken language style transcription in the speech database and a part corresponding to the part of the document style text. Further included is a conversion model learning means for learning a conversion model that statistically shows the conversion from the expression in the style text to the expression of the spoken style transcription. The language model estimation means includes means for estimating an N-gram language model of spoken style transcription by applying a transformation model to each N-gram language model for each turn.

音声データベースの一部の話し言葉スタイル書き起こしと、文書スタイルテキストのうちで対応する一部とから対応付けコーパスを作成すると、その対応付けコーパスから変換モデル学習手段が変換モデルを学習する。この変換モデルは、文書スタイルテキスト内の表現から話し言葉スタイル書き起こし内の表現への変換を統計的に示すものである。言語モデル推定手段は、ターンごとのＮ−グラム言語モデルの各々に対してこの変換モデルを適用して、話し言葉スタイル書き起こしのＮ−グラム言語モデルを作成する。 When a correspondence corpus is created from a part of spoken language style transcription in the speech database and a corresponding part of the document style text, the conversion model learning means learns a conversion model from the correspondence corpus. This conversion model shows statistically the conversion from the expression in the document style text to the expression in the spoken style transcription. The language model estimation means applies the conversion model to each of the N-gram language models for each turn, and creates an N-gram language model of the spoken language style transcription.

対応付けコーパス自体は、人手により作成することが想定される。しかし、このようにして得られた言語モデルを使用すると、対応付けコーパスを作成するもとになった音声データベースの一部だけでなく、その一部の音声データベースを含むより大きな音声データベースの音素ラベリングを自動的に行なうことができる。音声データベース全体について対応付けコーパスを作成する場合と比較して、より少ない労力で大量の音声データベースの音素ラベリングを、高精度に、かつ効率よく行なうことができる。 It is assumed that the association corpus itself is created manually. However, using the language model obtained in this way, phoneme labeling of not only the part of the speech database from which the correspondence corpus was created, but also a larger speech database including that part of the speech database. Can be performed automatically. Compared with the case where the corpus is created for the entire speech database, phoneme labeling of a large number of speech databases can be performed with high accuracy and efficiency with less effort.

より好ましくは、音声データベースは何らかの審議の音声を収録した審議音声コーパスであり、文書スタイルテキストは、その審議の会議録である。 More preferably, the audio database is a deliberation audio corpus that includes audio of some deliberation, and the document style text is a minutes of the deliberation.

国会などの審議の音声には、話し言葉特有の現象（フィラー、言い淀みなど）が頻繁に出現し、しかも大量に存在する。そのため、音声データベースの音素ラベリングを手作業で行なうのは困難である。しかし審議中の発言を文書スタイルに整形した会議録が完備している。そこで、この会議録を文書スタイルテキストとし、審議音声データベースを音声データベースとして上記したような音声認識用音響モデルの学習を行なうことで、審議の音声を、効率よく、精度高く音声認識することが可能になる。 Speeches deliberated by the Diet, etc., frequently exhibit spoken language-specific phenomena (fillers, grudges, etc.) and are present in large quantities. Therefore, it is difficult to manually perform phoneme labeling of the speech database. However, there are complete minutes of the discussions in a document style. Therefore, it is possible to recognize the speech of the discussion efficiently and accurately by learning the acoustic model for speech recognition as described above using the minutes as document style text and the speech database as the speech database. become.

本発明の第２の局面に係る音声認識装置は、所定の音声コーパスを学習データとして、上記のいずれかの音響モデル学習装置により学習が行なわれた音声認識用音響モデルを記憶するための音響モデル記憶手段と、音響モデル記憶手段に記憶された音声認識用音響モデルと、音声認識用言語モデルとを用いて、入力される発話データに対する音声認識を行なうための音声認識手段とを含む。 A speech recognition device according to a second aspect of the present invention is an acoustic model for storing a speech recognition acoustic model learned by any of the acoustic model learning devices using a predetermined speech corpus as learning data. And a speech recognition means for performing speech recognition on the input utterance data using the storage means, the speech recognition acoustic model stored in the acoustic model storage means, and the speech recognition language model.

本発明の第３の局面に係るコンピュータプログラムは、コンピュータを、音声データベースを人間が書き起こし、整形して得られた文書スタイルテキストにより学習した言語モデルから、実際の発言内容に忠実な話し言葉スタイル書き起こしの言語モデルを推定するための言語モデル推定手段と、予め準備された初期音響モデルと、言語モデル推定手段により推定された話し言葉スタイル書き起こしの言語モデルとを用いた音声認識により、音声データベースに書き起こしとその音素ラベルとを付すための音素ラベリング手段と、音素ラベリング手段により音素ラベルが付された音声データベースを学習データとして、音声認識用音響モデルの学習又は更新を行なうための音響モデル学習手段として機能させる。 A computer program according to a third aspect of the present invention provides a spoken language style writing that is faithful to the actual content of speech from a language model obtained by learning a computer from a document style text obtained by writing and formatting a speech database. A speech database that uses a language model estimator for estimating the language model of transcription, an initial acoustic model prepared in advance, and a spoken language style transcript language model estimated by the language model estimator is used to create a speech database. A phoneme labeling means for attaching a transcription and its phoneme label, and an acoustic model learning means for learning or updating a speech recognition acoustic model using a speech database with a phoneme label attached by the phoneme labeling means as learning data To function as.

本発明の第１の実施の形態に係る会議録作成システム３０のブロック図である。1 is a block diagram of a conference record creating system 30 according to a first embodiment of the present invention. 図１に示す審議音声コーパス４０と会議録４２との対応関係を模式的に示す図である。It is a figure which shows typically the correspondence of the discussion audio corpus 40 shown in FIG. 図１に示す音素ラベリング処理部７８のブロック図である。It is a block diagram of the phoneme labeling process part 78 shown in FIG. 本発明の実施の形態で使用される対応付けコーパスの内容の一部を示す模式図である。It is a schematic diagram which shows a part of content of the matching corpus used by embodiment of this invention. 話し言葉／書き言葉の変換モデルを学習する処理部を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the process part which learns the conversion model of spoken language / written language. ターンごとにＮ−グラムを作成する処理部及びＮ−グラムの書き言葉から話し言葉への変換を行なう処理部を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the process part which produces the N-gram for every turn, and the process part which converts the written word of N-gram into a spoken word. 第１の実施の形態に係る会議録作成システムを構成するコンピュータの関係を模式的に示す図である。It is a figure which shows typically the relationship of the computer which comprises the meeting minutes production system which concerns on 1st Embodiment. 図７に示す会議録作成システムにおいて、音響モデル作成用のコンピュータの外観図である。FIG. 8 is an external view of a computer for creating an acoustic model in the conference record creating system shown in FIG. 7. 図８に示すコンピュータのハードウエア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the computer shown in FIG. 図７に示す会議録作成システムにおいて、会議録作成用に使用されるコンピュータの外観図である。FIG. 8 is an external view of a computer used for creating a conference record in the conference record creating system shown in FIG. 7.

以下の説明では、同一部品には同一の参照番号を付してある。それらの名称及び機能も同一である。したがって、それらについての詳細な説明は繰返さない。また、以下に述べる実施の形態では、Ｎ−グラムとしてユ二グラム、バイグラム、及びトライグラムを用いている。 In the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated. In the embodiments described below, unigrams, bigrams, and trigrams are used as N-grams.

［実施の形態の原理］
本実施の形態では、以下の考え方によって、国会審議音声の自動書き起こしシステム（会議録作成システム）を構築している。日本の国会では、前述したとおり、欧州議会と異なり議論は主として委員会で行なわれる。そのため、欧州議会の審議よりもインタラクティブで自発的な発話が主となる。そうした発話には、多くのフィラー、言いよどみ、繰返しなどが含まれる。人手で作成された審議録では、そのような流暢でない発話も流暢な発話に「翻訳」されている。すなわち、日本では、実際の発話内容と会議録との相違が大きい。したがって、会議録をもとに音素ラベルを作成する処理はそのままでは難しく、話し言葉特有の現象に以下に適切に対応するかが問題となる。 [Principle of Embodiment]
In the present embodiment, an automatic transcription system (conference record creation system) of the Diet deliberation voice is constructed based on the following concept. In the Japanese parliament, as mentioned above, unlike the European Parliament, discussions are mainly held by the committee. For this reason, it is mainly interactive and spontaneous utterances rather than the European Parliament deliberation. Such utterances include many fillers, stagnations, repetitions, and so on. In the minutes created manually, such fluent utterances are “translated” into fluent utterances. That is, in Japan, the difference between the actual utterance content and the minutes is large. Therefore, the process of creating a phoneme label based on the minutes is difficult as it is, and it becomes a problem whether to appropriately deal with a phenomenon specific to spoken language as follows.

国会審議音声における実際の発話と会議録との例を図２に示す。 FIG. 2 shows an example of actual utterances and minutes of proceedings in the Diet discussion sound.

図２には、実際の発話からなる審議音声コーパス４０と、対応する会議録４２とを対比して示してある。審議音声コーパス４０は、たとえば国会の審議の音声を収録したものであって、音声データベースを構成している。発話１００と、会議録１１０、発話１０２と会議録１１２、及び発話１０４と会議録１１４がそれぞれ対応している。 In FIG. 2, a discussion voice corpus 40 composed of actual utterances and a corresponding conference record 42 are shown in comparison. The deliberation voice corpus 40 includes, for example, the sound of the deliberation of the Diet, and constitutes a voice database. The utterance 100 corresponds to the conference record 110, the utterance 102 and the conference record 112, and the utterance 104 and the conference record 114, respectively.

図２から分かるように、会議録では助詞「が」の挿入、並びに「いー」、「えー」、及び「あのー」などのフィラーの除去による整形が行なわれている。いわば話し言葉から書き言葉への変換が行なわれている。 As can be seen from FIG. 2, in the minutes, the particle “ga” is inserted and the fillers such as “i”, “e”, and “ano” are removed. In other words, conversion from spoken language to written language is performed.

このような話し言葉（発言の内容の忠実な書き起こし）と、整形済文書（会議録）との対応付けコーパスから、言語モデルのスタイル変換のための統計的モデルを構築する枠組みが、非特許文献４で提案されている。以下に述べる実施の形態では、この統計的な言語モデル変換を、個々の会議録に適用することにより、書き言葉の言語モデルから話し言葉の言語モデルを構築し、この言語モデルを用いて音声認識を行なうことにより、話し言葉に対する音素ラベルを作成する。 A framework for constructing a statistical model for style conversion of a language model from a corpus that correlates such spoken language (faithful transcription of the content of the statement) and a formatted document (meeting minutes) is a non-patent document. 4 proposed. In the embodiment described below, the statistical language model conversion is applied to individual minutes, thereby constructing a spoken language language model from the written language model and performing speech recognition using the language model. To create phoneme labels for spoken words.

言語モデルの統計的スタイル変換では、統計的機械翻訳の枠組みに基づき、話し言葉スタイルＶと文書スタイルＷとの変換を行なう。この変換は双方向的である。すなわち、話し言葉の書き起こしから文書スタイルへ整形を行なう方向へも、文書スタイルのテキストから書き起こしを復元する方向へもこの変換モデルを適用することができる。 In the statistical style conversion of the language model, the spoken language style V and the document style W are converted based on the framework of statistical machine translation. This conversion is bidirectional. In other words, the conversion model can be applied in the direction of shaping from the transcription of the spoken language to the document style and in the direction of restoring the transcription from the text in the document style.

デコードは、統計的機械翻訳の枠組みにしたがい、次のベイズ則に基づいて行なわれる。 Decoding is performed based on the following Bayesian rule according to the framework of statistical machine translation.

この式において、ｐ（Ｗ）は文書スタイルのＮ−グラム確率、ｐ（Ｖ）は話し言葉スタイルのテキストＶのＮ−グラム確率、ｐ（Ｗ｜Ｖ）は話し言葉スタイルのテキストＶに対する文書スタイルのテキストＷの条件付確率、ｐ（Ｖ｜Ｗ）は文書スタイルのテキストＷに対する話し言葉スタイルのテキストＶの条件付確率を、それぞれ示す。各式の分母は通常は無視される。 Where p (W) is the document style N-gram probability, p (V) is the N-gram probability of spoken style text V, and p (W | V) is the document style text for spoken style text V. The conditional probability of W, p (V | W), indicates the conditional probability of spoken style text V with respect to document style text W, respectively. The denominator of each expression is usually ignored.

ここで重要なのは、式（２）により話し言葉スタイルのテキストＶを一意に決定するのは、テキストＶが多様であり得るため、式（１）により整形を行なうプロセスよりもはるかに難しい点である。例えば、式（２）においてフィラーはランダムに挿入され得る（つまり、フィラーを含む話し言葉スタイルのテキストＶの形式が多様であり得る）が、式（１）においてはフィラーは確率１で除去される（すなわち、話し言葉スタイルのテキストＶ中のフィラーは文書スタイルのテキストＷへの変換の際に確実に除去される。）と考えてよい。したがって、話し言葉スタイルのテキストＶを一意に復元することよりも、次の式（３）のように話し言葉スタイルのテキストＶの統計的言語モデルを推定することの方が有意義である。 What is important here is that it is much more difficult to uniquely determine the spoken-style text V by the equation (2) than the process of shaping by the equation (1) because the text V can be varied. For example, fillers may be inserted randomly in equation (2) (ie, the form of spoken-style text V that includes fillers may vary), but fillers are removed with probability 1 in equation (1) ( That is, the filler in the spoken-style text V is surely removed upon conversion to the document-style text W. Therefore, it is more meaningful to estimate the statistical language model of the spoken-style text V as in the following equation (3) than to uniquely restore the spoken-style text V.

重要な点は、文書スタイルのテキストＷは話し言葉を忠実に書き起こしたテキストＶよりも豊富に存在する点である。すなわち、式（３）にしたがえば、豊富な文書スタイルのテキストを用いて話し言葉音声認識のための言語モデルｐ（Ｖ）をロバストに推定できる。 The important point is that the document-style text W exists more abundantly than the text V which is a transcription of the spoken language. That is, according to Equation (3), it is possible to robustly estimate the language model p (V) for spoken speech recognition using abundant document style text.

実際の変換は、次式のようにＮ−グラム計数を操作することで行なわれる。 The actual conversion is done by manipulating the N-gram count as follows:

ｖ及びｗは、各スタイルにおける変換パターンである。式（４）により、置換ｗ→ｖ、ｗの脱落、ｖの挿入を文脈を考慮してモデル化することができる。条件付確率ｐ（ｖ｜ｗ）及びｐ（ｗ｜ｖ）は、書き起こしと文書スタイルテキストとの対応付けコーパスから統計的に推定される。より具体的には、これら条件付確率条件付確率ｐ（ｖ｜ｗ）及びｐ（ｗ｜ｖ）は、コーパス中の各パターンの出現回数から推定される。 v and w are conversion patterns in each style. According to the equation (4), the replacement w → v, the omission of w, and the insertion of v can be modeled in consideration of the context. The conditional probabilities p (v | w) and p (w | v) are statistically estimated from the corpus of correspondence between the transcription and the document style text. More specifically, these conditional probabilities conditional probabilities p (v | w) and p (w | v) are estimated from the number of appearances of each pattern in the corpus.

適切なモデルとなるように、パターンの隣接単語も考慮する。例えば、フィラー「あー」は、｛ｗ＝（ｗ_-1、ｗ₊₁）→ｖ＝（ｗ_-1，あー，ｗ₊₁）｝のようにモデル化される。品詞情報を用いたスムージングを行なうと、データのスパースネスに対応することができる。 Consider adjacent words in the pattern to be an appropriate model. For example, the filler “Ah” is modeled as {w = (w ₋₁ , w ₊₁ ) → v = (w ₋₁ , ah, w ₊₁ )}. When smoothing using the part of speech information is performed, it is possible to cope with the sparseness of the data.

［第１の実施の形態］
図１を参照して、本発明の第１の実施の形態に係る会議録作成システム３０は、一般的には音声認識システムであって、審議音声コーパス４０と、審議音声コーパス４０に対応する会議録４２とから、審議音声５４を音声認識することによって書き起こし５６を出力するためのものである。この実施の形態は、前記した言語モデルの統計的スタイル変換（書き言葉→話し言葉）を、音響モデルの準教師付学習に適用したものである。国会では、収録した音声データによる大規模なアーカイブが作成されている。これらの音声に対しては、人手による書き起こしは付与されていないが、整形済の会議録が利用可能である。したがって、会議録をもとに音素ラベルを自動で作成できれば、豊富な音声データがそのまま音響モデルの学習データとして利用できることになる。 [First Embodiment]
Referring to FIG. 1, a conference record creation system 30 according to the first embodiment of the present invention is generally a speech recognition system, and includes a deliberation audio corpus 40 and a conference corresponding to the deliberation audio corpus 40. This is for outputting a transcript 56 by recognizing the discussion voice 54 from the record 42. In this embodiment, the statistical style conversion (written language → spoken language) of the language model described above is applied to semi-supervised learning of an acoustic model. The National Diet has created a large-scale archive of recorded audio data. These voices are not given transcription by hand, but pre-formatted conference minutes can be used. Therefore, if a phoneme label can be automatically created based on a conference record, abundant speech data can be used as it is as learning data for an acoustic model.

図１を参照して、この目的のために、会議録作成システム３０においては、審議音声コーパス４０の一部である部分コーパス６８から作成した忠実な書き起こし７０と、会議録４２のうち部分コーパス６８に対応する部分会議録７２とから、手作業の対応付けコーパス作成処理７４により、最初に対応付けコーパス７６を作成する。部分コーパス６８と部分会議録７２とは互いに対応付けられている。すなわち、部分コーパス６８に含まれる音声に対し、部分会議録７２のテキストデータを構成する文字・記号が予め割当てられている。書き起こし７０により、部分コーパス６８に音素ラベルを付与できる。 Referring to FIG. 1, for this purpose, in the minutes recording system 30, a faithful transcript 70 created from a partial corpus 68 that is a part of the deliberation audio corpus 40 and a partial corpus of the minutes 42 are recorded. First, a correspondence corpus 76 is created from the partial conference records 72 corresponding to 68 by manual corpus creation processing 74. The partial corpus 68 and the partial conference record 72 are associated with each other. That is, the characters and symbols constituting the text data of the partial conference record 72 are assigned in advance to the voice included in the partial corpus 68. The transcription 70 can give a phoneme label to the partial corpus 68.

会議録は、予算委員会、法務委員会などの会議毎に作成されるが、各発言には会議内の話者ＩＤが付与されており、それにしたがってターン毎のテキストが抽出できる。各会議はおよそ２時間から５時間の長さであり、各ターンは１０秒から３分程度（平均１分）の長さである。ここで「ターン」とは、ある話者がまとめて話したひとまとまりの発話のことをいう。例えば質問者が質問を発したときの発話で１ターン、答弁者がその質問に答弁して次の１ターン、などのように一連の発話が複数のターンに分割される。同一の話者による連続した発話でも、話題が異なれば別ターンとされている。図２に示す発話１００、１０２及び１０４はそれぞれ１ターンとなっている。それに対応する会議録１１０、１１２及び１１４もターンごとに読出すことができる。 A conference record is created for each conference such as a budget committee, a legal committee, etc., and a speaker ID in the conference is assigned to each utterance, and a text for each turn can be extracted accordingly. Each meeting is approximately 2 to 5 hours long, and each turn is approximately 10 seconds to 3 minutes long (average 1 minute). Here, “turn” refers to a group of utterances spoken by a speaker. For example, a series of utterances are divided into a plurality of turns, such as one turn when the questioner makes a question, one answer by the answerer and the next turn after answering the question. Even consecutive utterances by the same speaker are considered different turns if the topic is different. Each of the utterances 100, 102, and 104 shown in FIG. 2 is one turn. Corresponding conference records 110, 112 and 114 can also be read for each turn.

本実施の形態では、音素ラベル付与のための音声認識の際に言語モデルとして使用されるＮ−グラムが、より強い制約となるように、多くの話者又は話題を含む会議全体ではなく、個々のターンごとにＮ−グラムを作成する。本実施の形態に係る手法では、個々のＮ−グラムのサイズが大きくならないので、ターンのような詳細な単位ごとにＮ−グラムを用意することが可能である。その上、ベースライン言語モデルを音声認識に使用する場合のように、余計な表現が混入する可能性が極めて低いという利点がある。 In this embodiment, an N-gram used as a language model in speech recognition for providing a phoneme label is not an entire conference including many speakers or topics, but an individual conference so as to be a stronger constraint. Create an N-gram every turn. In the method according to this embodiment, since the size of each N-gram does not increase, it is possible to prepare an N-gram for each detailed unit such as a turn. In addition, there is an advantage that the possibility that extra expressions are mixed is extremely low as in the case where the baseline language model is used for speech recognition.

対応付けコーパス作成処理７４は、部分コーパス６８の書き起こし７０を作成した後、書き起こしの各単語を部分会議録７２の単語と対応付ける処理である。この処理は手作業である。しかし、対応付けコーパス７６は、審議音声コーパス４０の一部（部分コーパス６８）及び会議録４２の一部（部分会議録７２）のみに対応するものである。したがって、対応付けコーパス７６を作成するための作業量は、審議音声コーパス４０の全体を書き起こす場合と比較してはるかに小さくてよい。 The association corpus creation process 74 is a process of creating each transcription 70 of the partial corpus 68 and then associating each transcription word with a word of the partial conference record 72. This process is manual. However, the association corpus 76 corresponds to only a part of the deliberation voice corpus 40 (partial corpus 68) and a part of the conference record 42 (partial conference record 72). Therefore, the amount of work for creating the association corpus 76 may be much smaller than when the entire discussion speech corpus 40 is written up.

なお、本実施の形態ではＮ−グラムを言語モデルとして使用するため、対応付けコーパス７６の作成において、ポーズの取扱いに注意する必要がある。音声データではポーズが挿入されていても、会議録ではポーズはそのままで挿入されているわけではなく、句読点の形で挿入されていることが多いためである。ポーズの取扱い方には種々あるが、本実施の形態では「、」はショートポーズ（＜ｓｐ＞）、「。」は無音区間（＜ｓｉｌ＞）として取扱っている。対応付けコーパス７６の作成時には、このようにしてポーズの標記を統一している。 In the present embodiment, since N-grams are used as language models, it is necessary to pay attention to the handling of poses when creating the association corpus 76. This is because even if pauses are inserted in the audio data, the pauses are not inserted as they are in the minutes but are often inserted in the form of punctuation marks. Although there are various ways of handling poses, in this embodiment, “,” is handled as a short pause (<sp>), and “.” Is treated as a silent section (<sil>). In creating the association corpus 76, the pose marks are unified in this way.

会議録作成システム３０は、このようにして作成された対応付けコーパス７６を用い、式（４）によって書き言葉用の言語モデルを話し言葉用の言語モデルに変換する変換モデル１２２を推定するための話し言葉／書き言葉変換モデル学習部１２０と、この変換モデル１２２を使用して、審議音声コーパス４０から話し言葉の音声認識に対応した音響モデル４８の学習を行なうための音声認識用音響モデル学習部４４と、会議録４２の全体から音声認識用の統計的言語モデル５８の学習を行なうための言語モデル学習部４６と、変換モデル１２２を使用して、会議録４２から学習された書き言葉用の言語モデル５８を話し言葉用の言語モデル５０に変換するための言語モデル変換部６０と、各々話し言葉用に適応化された音響モデル４８及び言語モデル５０を用い、審議音声５４を音声認識して認識結果を書き起こし５６として出力するための音声認識装置５２とを含む。 The conference record creation system 30 uses the correspondence corpus 76 created in this way, and uses the correspondence corpus 76 to estimate a conversion model 122 for converting a language model for written language into a language model for spoken language according to Expression (4). A written language conversion model learning unit 120, a speech recognition acoustic model learning unit 44 for learning the acoustic model 48 corresponding to speech recognition of spoken words from the deliberation speech corpus 40 using the conversion model 122, and a conference record The language model learning unit 46 for learning the statistical language model 58 for speech recognition from the whole 42 and the conversion model 122 are used to use the language model 58 for written language learned from the conference record 42 for the spoken language. A language model conversion unit 60 for converting to the language model 50, and an acoustic model 48 each adapted for spoken language, and Using the word model 50, and a speech recognition device 52 for outputting as a cause 56 write the recognition result by recognizing speech deliberations audio 54.

具体的には、話し言葉／書き言葉変換モデル学習部１２０は、部分会議録７２に出現するＮ−グラムの各々について、書き起こし７０内の対応する部分がどのように変化しているかを調べ、その結果を計数する。例えば部分会議録７２中にｗ＝「＜ｓｐ＞この法案」（＜ｓｐ＞はショートポーズを表す。）が５００回出現し、書き起こし７０ではそのうち５０回がｖ＝「＜ｓｐ＞えーこの法案」となっていた（フィラー「えー」が挿入された）とすれば、ｐ（ｖ｜ｗ）＝５０／５００となる。このような計数を、全てのＮ−グラムとその変化形とについて集計することで、式（４）にしたがった変換モデル１２２が得られる。この集計により得られるのは、どのような変化が何回あったかを示す計数である。この値は、文書スタイルの表現が話し言葉スタイルのどのような表現にどのような確率で変化するかを示す確率と同視することができる。 Specifically, the spoken / written language conversion model learning unit 120 examines how the corresponding portion in the transcript 70 has changed for each N-gram appearing in the partial conference record 72, and the result thereof. Count. For example, w = “<sp> this bill” (<sp> represents a short pause) appears 500 times in the partial conference record 72, and 50 times of the transcript 70 are v = “<sp> "(Filler" E "" is inserted), p (v | w) = 50/500. By counting such counts for all N-grams and their variations, a conversion model 122 according to equation (4) is obtained. What is obtained by this tabulation is a count indicating what kind of change has occurred and how many times. This value can be equated with a probability indicating how often the expression of the document style changes to what expression of the spoken language style.

音声認識用音響モデル学習部４４は、審議音声コーパス４０、音素ラベル付部分コーパス６８、及び変換モデル１２２を用いた音声認識により審議音声コーパス４０の音声に対して音素ラベルを付す処理を行ない、音素ラベル付音声データベース８０を出力するための音素ラベリング処理部７８と、音素ラベル付音声データベース８０を学習データとして、通常の学習方法により話し言葉用の音響モデル４８の学習を行なうための音響モデル学習部８２とを含む。 The acoustic model learning unit 44 for speech recognition performs a process of attaching a phoneme label to the speech of the deliberation speech corpus 40 by speech recognition using the deliberation speech corpus 40, the phoneme-labeled partial corpus 68, and the conversion model 122. A phoneme labeling processing unit 78 for outputting the labeled speech database 80 and an acoustic model learning unit 82 for learning the acoustic model 48 for spoken language by a normal learning method using the phoneme labeled speech database 80 as learning data. Including.

図３を参照して、音素ラベリング処理部７８は、音素ラベル付部分コーパス６８から初期音響モデル１３２の学習を行なうための初期音響モデル学習部１３０と、会議録４２のターンごとに会議録４２のテキストデータからＮ−グラム統計データを作成することにより、ターンごとＮ−グラム１８６を作成するためのターンごとＮ−グラム作成部１８４と、ターンごとＮ−グラム１８６の各々に含まれるＮ−グラムの確率に対し、変換モデル１２２により定まる、式（４）により表現される変換を行なうことによって話し言葉用Ｎ−グラム１３６を出力するためのＮ−グラム変換部１８８とを含む。 Referring to FIG. 3, the phoneme labeling processing unit 78 includes an initial acoustic model learning unit 130 for learning the initial acoustic model 132 from the phoneme-labeled partial corpus 68, and the conference record 42 for each turn of the conference record 42. By creating N-gram statistical data from text data, an N-gram creation unit 184 for each turn for creating an N-gram 186 for each turn, and the N-grams included in each N-gram 186 for each turn An N-gram conversion unit 188 for outputting the spoken-language N-gram 136 by performing the conversion expressed by the equation (4), which is determined by the conversion model 122, on the probability.

ターンごとＮ−グラム作成部１８４は、各ターンの会議録のテキストからＮ−グラムエントリの抽出とそれらの出現回数との計数を行なう。この結果、ターンごとにターンごとＮ−グラム１８６が得られる。ターンごとＮ−グラム１８６内の各エントリについて、変換モデル１２２を適用することによって話し言葉用Ｎ−グラム１３６がターンごとに得られる。 The N-gram creation unit 184 for each turn extracts N-gram entries from the text of the minutes of each turn and counts the number of appearances. This results in an N-gram 186 for each turn. For each entry in the N-gram 186 per turn, a spoken N-gram 136 is obtained for each turn by applying the transformation model 122.

音素ラベリング処理部７８はさらに、審議音声コーパス４０内の各ターンを順番に選択し、ターンを特定する情報と、選択されたターンの音声とを出力するためのターン・音声選択部１３８と、ターン・音声選択部１３８が選択したターンを示す情報を受け、話し言葉用Ｎ−グラム１３６の中から、そのターンに対応するＮ−グラム１４２を選択するためのＮ−グラム選択部１４０と、初期音響モデル１３２及びＮ−グラム１４２を用い、特にＮ−グラム１４２を言語モデルとして用いて、ターン・音声選択部１３８の出力した発話音声の音声認識を行なって、その音声に、単語レベル及び音素レベルの認識結果を付して音素ラベル付音声データベース８０に出力するための音声認識装置１４４とを含む。 The phoneme labeling processing unit 78 further selects each turn in the discussion speech corpus 40 in turn, and outputs a turn / speech selection unit 138 for outputting information for identifying the turn and the sound of the selected turn, An N-gram selection unit 140 for receiving information indicating the turn selected by the voice selection unit 138 and selecting the N-gram 142 corresponding to the turn from the spoken N-gram 136, and an initial acoustic model 132 and N-gram 142, in particular, N-gram 142 is used as a language model, speech recognition of the speech output from turn / speech selector 138 is performed, and the word level and phoneme level are recognized in the speech. A speech recognition device 144 for attaching the result and outputting it to the phoneme-labeled speech database 80.

音声認識装置１４４には、既存の統計的音声認識装置を用いることができる。ここでは単語レベル及び音素レベルの認識結果を出力するものを用いるが、音素レベルの結果のみを出力するものでもよい。音声認識装置１４４は、発話中のポーズにより、最長で３０秒程度の短い発話区間に分割した形で認識結果の付された音声データを出力する。以降の学習はこの区間を単位として行なう。 An existing statistical speech recognition device can be used as the speech recognition device 144. Here, what outputs the recognition result of a word level and a phoneme level is used, However, You may output only the result of a phoneme level. The voice recognition device 144 outputs voice data to which a recognition result is attached in a form of being divided into short utterance sections of about 30 seconds at the maximum according to a pause during utterance. Subsequent learning is performed in units of this interval.

このようにして得られた音素ラベル付音声データベース８０の各音素ラベルは、話し言葉には出現するが文書スタイルでは出現しないような音素列の出現確率を考慮して決定されている。しかもターンごとに、そのターンのみについて学習されたＮ−グラムを用いているため、音声認識の精度、すなわち付与される音素ラベルの精度は高くなる。その上、審議音声コーパス４０に大量の音声が存在する場合にも、その全てに対して、自動的に高精度で音素ラベルを付与することができる。 Each phoneme label in the speech database with phoneme label 80 obtained in this way is determined in consideration of the appearance probability of a phoneme string that appears in spoken language but does not appear in the document style. In addition, since the N-gram learned only for that turn is used for each turn, the accuracy of speech recognition, that is, the accuracy of the phoneme label to be applied is increased. In addition, even when a large amount of speech exists in the deliberation speech corpus 40, phoneme labels can be automatically assigned to all of them.

したがって、この音素ラベル付音声データベース８０から、図１に示す音響モデル学習部８２によって通常の方法で音響モデル４８を作成すると、音声認識装置５２による認識結果の精度が高くなることが十分に期待できる。 Therefore, when the acoustic model 48 is created by the acoustic model learning unit 82 shown in FIG. 1 from the phoneme-labeled speech database 80 by a normal method, it can be sufficiently expected that the accuracy of the recognition result by the speech recognition device 52 is increased. .

一方、音声認識装置５２が使用する言語モデル５０も、会議録４２中に出現するＮ−グラムについて、変換モデル１２２を適用して得られたものであり、話し言葉に特有の音素列の発生確率が算入されたものである。 On the other hand, the language model 50 used by the speech recognition device 52 is also obtained by applying the conversion model 122 to N-grams appearing in the conference record 42, and the occurrence probability of phoneme strings peculiar to spoken language is high. It has been included.

このように、話し言葉特有の音素列の発生確率を考慮して得られた音響モデル４８及び言語モデル５０を使用するため、音声認識装置５２は、話し言葉においてよく発生する事象、すなわちフィラーの挿入、言い淀み、発音の怠けなどにもかかわらず、審議音声コーパス４０の高精度な書き起こしを出力することができる。 As described above, since the acoustic model 48 and the language model 50 obtained in consideration of the occurrence probability of the phoneme string peculiar to the spoken language are used, the speech recognition device 52 causes an event that frequently occurs in the spoken language, that is, filler insertion, Regardless of the grudge and the lack of pronunciation, it is possible to output a highly accurate transcript of the deliberation speech corpus 40.

図４は、対応付けコーパス７６中の２つの文例を示す。図４において、審議音声コーパス４０では発話されているが会議録４２では削除されている音声を図４（Ａ）の発話１６０の先頭の「{えー}」のように中カッコ{ }で囲んで示してある。審議音声コーパス４０では発話されていないが会議録４２では挿入されている音声は、図４（Ｂ）の発話１６２内の「いただいて（い）るつもりで…」のようにカッコ（）で囲んで示してある。審議音声コーパス４０の発話での表現が会議録４２では他の表現に変えられている部分は、発話１６０内の「{んで／ので}」のように、全体を中カッコで囲み、審議音声コーパス４０での表現を「／」の前に、会議録４２での表現を「／」の後に、それぞれ示してある。 FIG. 4 shows two sentence examples in the association corpus 76. In FIG. 4, the speech that has been uttered in the discussion speech corpus 40 but has been deleted in the conference record 42 is enclosed in curly braces {} like “{e}” at the beginning of the utterance 160 in FIG. It is shown. Voices that are not uttered in the discussion speech corpus 40 but are inserted in the minutes 42 are enclosed in parentheses () as shown in the utterance 162 in FIG. It is shown by. The part in which the expression of the utterance of the deliberation voice corpus 40 is changed to another expression in the minutes 42 is surrounded by curly braces like “{de / no}” in the utterance 160, and the deliberation voice corpus The expression in 40 is shown before “/”, and the expression in the minutes 42 is shown after “/”.

この対応付けコーパスは、書き起こし７０と部分会議録７２とを別の言語によるものと考えたときの翻訳モデル作成のためのパラレルコーパスと考えることができる。通常、翻訳モデルでは、単語の挿入、削除、置換に加え、順序の入替えという編集を考えるが、ここでは言語自体は同一限度であるため、順序の入替えは考えていない。 This association corpus can be considered as a parallel corpus for creating a translation model when the transcript 70 and the partial conference record 72 are considered to be in different languages. Usually, in the translation model, in addition to inserting, deleting, and replacing words, editing such as changing the order is considered. However, since the language itself has the same limit, changing the order is not considered.

［話し言葉／書き言葉変換モデル学習部１２０のプログラム構造］
図５を参照して、話し言葉／書き言葉変換モデル学習部１２０による変換モデル１２２の学習処理を実現するコンピュータプログラムは、利用者からの処理開始の指示に応答してプログラムの実行を開始し、記憶領域の確保、変数のクリアなどの初期設定を行なうステップ１９０と、対応付けコーパス７６のファイルをオープンするステップ１９２と、繰返し変数ｉに０を代入するステップ１９４とを含む。 [Program structure of spoken / written language conversion model learning unit 120]
Referring to FIG. 5, the computer program that realizes the learning process of conversion model 122 by spoken / written language conversion model learning unit 120 starts the execution of the program in response to an instruction to start the process from the user, and the storage area Includes step 190 for performing initial setting such as securing the variable, clearing the variable, step 192 for opening the file of the corpus corpus 76, and step 194 for assigning 0 to the repetition variable i.

繰返し変数ｉは、対応付けコーパス７６のうち、処理対象となっている単語の位置を示す変数であり、０から１ずつ増加する。以下、変数ｉによって示される位置の単語を「単語（ｉ）」と書く。 The repetition variable i is a variable indicating the position of the word to be processed in the association corpus 76 and increases by 1 from 0. Hereinafter, the word at the position indicated by the variable i is written as “word (i)”.

このプログラムはさらに、変数ｉの値が対応付けコーパス７６中の全単語の数より大きくなったか否かを判定し、判定結果に応じて制御の流れを分岐させるステップ１９６と、ステップ１９６の判定結果がＮＯのときに実行され、対応付けコーパス７６の中で、部分会議録７２の単語（ｉ）を先頭とするユニグラム、バイグラム、及びトライグラムの計数にそれぞれ１ずつ加算するステップ１９８と、変数ｉに１を加算して制御をステップ１９６に戻すステップ２００とを含む。ステップ１９６からステップ２００の処理を、対応付けコーパス７６中の全単語に対して実行することにより、部分会議録７２のＮ−グラムモデルが作成される。 The program further determines whether or not the value of the variable i is larger than the number of all words in the corpus corpus 76, and branches the flow of control according to the determination result, and the determination result of step 196 Is executed when NO is set, and in the association corpus 76, a step 198 for adding one each to the count of the unigram, bigram, and trigram beginning with the word (i) of the partial conference record 72, and the variable i And step 200 for returning 1 to 196 by adding 1 to. By executing the processing from step 196 to step 200 for all the words in the association corpus 76, an N-gram model of the partial conference record 72 is created.

このプログラムは更に、ステップ１９６での判定結果がＹＥＳのときに実行され、対応付けコーパス７６の読出位置を先頭に再設定するステップ２０２と、ステップ２０２に続き、部分会議録７２で計算されたユニグラム、バイグラム、トライグラムの各々について、書き起こし７０ではどのように変化しているかを集計することにより、変換モデル１２２を計算するステップ２０４と、ステップ２０４で計算された変換モデル１２２をファイルとして出力し、処理を終了するステップ２０６とを含む。 This program is further executed when the determination result in step 196 is YES, and step 202 for resetting the reading position of the association corpus 76 to the top, and the unigram calculated in the partial conference record 72 following step 202. Step 204 for calculating the conversion model 122 by counting how the transcription 70 changes for each of the bigram and trigram, and the conversion model 122 calculated in step 204 is output as a file. And step 206 for ending the processing.

［ターンごとＮ−グラム作成部１８４及びＮ−グラム変換部１８８のプログラム構造］
図６を参照して、ターンごとＮ−グラム作成部１８４及びＮ−グラム変換部１８８を実現するためのコンピュータプログラムは、プログラムの実行開始とともに、必要な記憶領域の確保及び初期化などの初期設定を行なうステップ２１０と、繰返し変数ｉに０を代入するステップ２１２と、繰返し変数ｉを処理対象の部分会議録７２に含まれるターン数と比較することにより、全ターンの処理が終了したか否かを判定し、判定結果により制御の流れを分岐させるステップ２１４とを含む。 [Program structure of N-gram creation unit 184 and N-gram conversion unit 188 for each turn]
Referring to FIG. 6, the computer program for realizing N-gram creation unit 184 and N-gram conversion unit 188 for each turn is used for initial setting such as securing and initializing necessary storage areas as the program starts. Whether or not the processing of all the turns has been completed by comparing the repetition variable i with the number of turns included in the partial conference record 72 to be processed. And step 214 of branching the control flow according to the determination result.

このプログラムはさらに、ステップ２１４の判定結果がＮＯの場合に実行され、ターン（ｉ）の会議録を部分会議録７２から読出すステップ２１６と、ステップ２１６で読出されたターン（ｉ）の会議録のＮ−グラムを作成し、所定の記憶媒体に出力するステップ２１８と、ステップ２１８に続き、繰返し変数ｉの値に１を加算し、制御をステップ２１４に戻すステップ２２０とを含む。 This program is further executed when the determination result in step 214 is NO, and the step 216 for reading the minutes of the turn (i) from the partial minutes 72 and the minutes of the turn (i) read in the step 216 are read. The N-gram is generated and output to a predetermined storage medium. Step 218 is followed by step 220 where step 1218 adds 1 to the value of iteration variable i and returns control to step 214.

このプログラムはさらに、ステップ２１４の判定結果がＹＥＳの場合に実行され、変換モデル１２２を外部記憶媒体から主記憶装置に読出すステップ２２２と、繰返し変数ｉに０を代入するステップ２２４と、繰返し変数ｉの値と部分会議録７２に含まれるターン数との比較により、部分会議録７２の内の全ターンの会議録についてＮ−グラムの変換（文書スタイル→話し言葉スタイルの変換）を行なったか否かを判定し、判定結果に応じて制御の流れを分岐させるステップ２２６と、ステップ２２６において、部分会議録７２の内の会議録についてのＮ−グラムの変換が完了していないと判定されたことに応答して実行され、ターン（ｉ）のＮ−グラムの全てについて、変換モデル１２２を適用することにより話し言葉スタイルにおける確率の推定値を再計算し更新するステップ２３０と、繰返し変数ｉに１を加算して制御をステップ２２６に戻すステップ２３２とを含む。 This program is further executed when the determination result at step 214 is YES, step 222 for reading the conversion model 122 from the external storage medium to the main storage device, step 224 for substituting 0 for the repetition variable i, and the repetition variable. Whether or not N-gram conversion (document style → spoken style conversion) has been performed on the minutes of all the minutes in the partial minutes 72 by comparing the value of i with the number of turns included in the minutes. In step 226 for branching the control flow according to the determination result, and in step 226, it is determined that the conversion of the N-gram for the minutes in the partial minutes 72 has not been completed. It is executed in response, and for all N-grams of turn (i), the transformation model 122 is applied to verify the spoken language style. It includes the step 230 to recalculate and update the estimated value, and a step 232 to return the sum to control one to repeat the variable i in step 226.

［コンピュータシステムによる実現］
上に構造を説明した会議録作成システム３０は、実質的にはコンピュータにより実現される。会議録作成システム３０の全体を１台のコンピュータ上に実装することも可能である。しかし、音響モデル４８及び言語モデル５０は大量の審議音声コーパス４０及び会議録４２を使用して学習するものであるのに対し、会議録作成には審議音声コーパス４０及び会議録４２は不要である。したがって、両者を分離する方がメンテナンス上都合がよい。また、変換モデルの学習及び音響モデルの学習は、システムの性能に大きな影響を及ぼすため、システムのユーザではなく、システムの管理者又は行なう方が好ましい。 [Realization by computer system]
The conference record creation system 30 whose structure has been described above is substantially realized by a computer. It is also possible to mount the entire meeting record creation system 30 on a single computer. However, while the acoustic model 48 and the language model 50 are learned by using a large amount of the discussion voice corpus 40 and the minutes 42, the discussion voice corpus 40 and the minutes 42 are not necessary for creating the minutes. . Therefore, it is more convenient for maintenance to separate the two. In addition, learning of the conversion model and learning of the acoustic model have a great influence on the performance of the system, so it is preferable that the conversion model learning and the acoustic model learning be performed by the system administrator or the system administrator.

したがって、本実施の形態に係る会議録作成システム３０は、図７に示されるように、音響モデル４８及び言語モデル変換部６０の学習を行なう学習用コンピュータシステム２５０と、コンピュータシステム２５０により学習が行なわれた音響モデル４８及び言語モデル５０を使用して、審議音声を音声認識し書き起こしを出力する処理を行なう会議録作成用コンピュータシステム３００とを含む。当業者には容易に分かるように、会議録作成用コンピュータシステム３００を複数使用すれば、共通の音響モデル４８及び言語モデル５０を用いて、複数の委員会における審議の会議録を作成することができる。 Therefore, as shown in FIG. 7, the conference record creating system 30 according to the present embodiment performs learning by the learning computer system 250 for learning the acoustic model 48 and the language model conversion unit 60, and the computer system 250. A computer system 300 for creating a conference record that performs a process of using the acoustic model 48 and the language model 50 to recognize a discussion voice and output a transcript. As will be readily understood by those skilled in the art, if a plurality of computer systems 300 for creating minutes are used, minutes of discussions in a plurality of committees can be created using a common acoustic model 48 and language model 50. it can.

図８を参照して、学習用コンピュータシステム２５０は、コンピュータ２６０と、いずれもコンピュータ２６０に接続されるモニタ２６２、キーボード２６６、マウス２６８、マイクロホン２９０及び一対のスピーカ２５８とを含む。コンピュータ２６０には、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）の再生及び記録が可能なＤＶＤドライブ２７０と、所定の規格にしたがった半導体メモリ記憶装置が装着可能なメモリポート２７２とが備えられている。コンピュータ２６０の内部構成については図９を参照して後述する。 Referring to FIG. 8, a learning computer system 250 includes a computer 260, a monitor 262, a keyboard 266, a mouse 268, a microphone 290, and a pair of speakers 258 that are all connected to the computer 260. The computer 260 is provided with a DVD drive 270 capable of reproducing and recording a DVD (Digital Versatile Disc), and a memory port 272 into which a semiconductor memory storage device according to a predetermined standard can be mounted. The internal configuration of the computer 260 will be described later with reference to FIG.

図９を参照して、コンピュータ２６０は、図８に示すＤＶＤドライブ２７０及びメモリポート２７２に加え、ＣＰＵ（中央演算処理装置）２７６と、ＣＰＵ２７６に接続されたバス２８６と、いずれもバス２８６に接続されたＲＯＭ（読出専用メモリ）２７８、ＲＡＭ（ランダムアクセスメモリ）２８０、大容量ハードディスク２７４、ネットワークインターフェイス２９６、及びサウンドボード２８８を含む。 Referring to FIG. 9, in addition to DVD drive 270 and memory port 272 shown in FIG. 8, computer 260 has a CPU (Central Processing Unit) 276 and a bus 286 connected to CPU 276, both connected to bus 286. ROM (Read Only Memory) 278, RAM (Random Access Memory) 280, large capacity hard disk 274, network interface 296, and sound board 288.

ＤＶＤドライブ２７０には、ＤＶＤ２８２が装着される。メモリポート２７２には半導体メモリ２８４が装着される。ＣＰＵ２７６は、バス２８６並びにＤＶＤドライブ２７０及びメモリポート２７２をそれぞれ介して、ＤＶＤ２８２及び半導体メモリ２８４をアクセスし、データの読出及び書込を行なえる。 A DVD 282 is attached to the DVD drive 270. A semiconductor memory 284 is attached to the memory port 272. The CPU 276 accesses the DVD 282 and the semiconductor memory 284 via the bus 286, the DVD drive 270, and the memory port 272, respectively, and can read and write data.

キーボード２６６、マウス２６８、モニタ２６２は、いずれも図示しないインターフェイスを介してコンピュータ２６０のバス２８６に接続される。スピーカ２５８及びマイクロホン２９０は、サウンドボード２８８に接続される。 The keyboard 266, the mouse 268, and the monitor 262 are all connected to the bus 286 of the computer 260 via an interface (not shown). The speaker 258 and the microphone 290 are connected to the sound board 288.

上記実施の形態における審議音声コーパス４０、会議録４２、部分コーパス６８、書き起こし７０、部分会議録７２、対応付けコーパス７６、変換モデル１２２、音素ラベル付音声データベース８０、音響モデル４８、言語モデル５０及び５８等は、ＲＡＭ２８０、大容量ハードディスク２７４、ＤＶＤ２８２、半導体メモリ２８４のいずれでも実現できる。実際には、格納するデータの容量、読出し、書込みに要求される速度などによって、最も効率のよい記憶装置が各記憶部を実現するために選択される。本実施の形態では、これらは大容量ハードディスク２７４に記憶され、利用時にＲＡＭ２８０にロードされる。 The discussion speech corpus 40, conference record 42, partial corpus 68, transcript 70, partial conference record 72, association corpus 76, conversion model 122, phoneme-labeled speech database 80, acoustic model 48, language model 50 in the above embodiment. And 58 can be realized by any of the RAM 280, the large-capacity hard disk 274, the DVD 282, and the semiconductor memory 284. Actually, the most efficient storage device is selected to realize each storage unit depending on the capacity of data to be stored, the speed required for reading and writing, and the like. In the present embodiment, these are stored in the large-capacity hard disk 274 and loaded into the RAM 280 when used.

図１０を参照して、本実施の形態に係る会議録作成システム３０で用いられる会議録作成用コンピュータシステム３００は、コンピュータ３１０と、いずれもコンピュータ３１０に接続された、モニタ３２０、キーボード３２２、マウス３２４、マイク３２８及び一対のスピーカ３２６とを含む。図示していないが、コンピュータ３１０にはヘッドホン接続端子が設けられており、ヘッドホンによる音声の再生を行なうこともできる。コンピュータ３１０には、図１に示す音声認識装置５２を実現するための音声認識プログラムと、この音声認識プログラムにより出力される審議録ファイルを編集するための編集プログラムとが予めインストールされている。さらに、コンピュータ３１０は、大容量のＨＤＤを持ち、コンピュータシステム２５０からネットワークを介して受信した音響モデル４８及び言語モデル５０をこのＨＤＤに記憶することができる。 Referring to FIG. 10, a conference record creation computer system 300 used in the conference record creation system 30 according to the present embodiment includes a computer 310, a monitor 320, a keyboard 322, and a mouse, all connected to the computer 310. 324, a microphone 328, and a pair of speakers 326. Although not shown, the computer 310 is provided with a headphone connection terminal, so that sound can be reproduced by the headphone. The computer 310 is preinstalled with a speech recognition program for realizing the speech recognition apparatus 52 shown in FIG. 1 and an editing program for editing the proceedings file output by the speech recognition program. Furthermore, the computer 310 has a large-capacity HDD, and can store the acoustic model 48 and the language model 50 received from the computer system 250 via the network in this HDD.

会議録作成用コンピュータシステム３００のハードウエア構成は、図９に示すものと同様である。したがってここではその詳細については繰返さない。 The hardware configuration of the conference record creating computer system 300 is the same as that shown in FIG. Therefore, details thereof will not be repeated here.

［動作］
上に構成を説明した会議録作成システム３０は以下のように動作する。会議録作成システム３０の動作はいくつかのフェーズに分けられる。以下、それらフェーズを順番に説明する。 [Operation]
The conference record creation system 30 described above operates as follows. The operation of the conference record creation system 30 is divided into several phases. Hereinafter, these phases will be described in order.

−対応付けコーパス７６の作成−
図１を参照して、最初に、既存の審議音声コーパス４０及び会議録４２から、コンピュータシステム２５０において対応付けコーパス７６が作成される。手作業により、部分コーパス６８が審議音声コーパス４０から抽出され、対応する部分会議録７２が会議録４２から抽出される。部分コーパス６８を再生し、手作業により審議音声の忠実な書き起こし７０をターンごとに作成する。このようにして作成された書き起こし７０と部分会議録７２とから、これも人手による対応付けコーパス作成処理７４が行なわれ、対応付けコーパス７６が作成される。 -Creation of the corpus 76-
With reference to FIG. 1, first, an association corpus 76 is created in the computer system 250 from the existing discussion voice corpus 40 and the conference record 42. By manual operation, the partial corpus 68 is extracted from the discussion voice corpus 40, and the corresponding partial meeting record 72 is extracted from the meeting record 42. The partial corpus 68 is reproduced, and a faithful transcription 70 of the deliberation voice is manually created for each turn. From the transcription 70 and the partial conference record 72 created in this way, the association corpus creation process 74 is also performed manually, and the association corpus 76 is created.

ここでは、書き起こし７０を一旦作成してから対応付けコーパス７６を作成するが、部分コーパス６８を再生しながら、部分会議録７２を画面で直接編集することにより対応付けコーパス７６を作成してもよい。 Here, the transcription corpus 76 is created after the transcription 70 is created once. However, even if the corpus 76 is created by directly editing the partial meeting record 72 on the screen while reproducing the partial corpus 68. Good.

完成した対応付けコーパス７６は大容量ハードディスク２７４に格納される。 The completed corpus 76 is stored in the large-capacity hard disk 274.

−変換モデル１２２の作成−
対応付けコーパス７６は、話し言葉スタイルの部分コーパス６８の忠実な書き起こしと、整形済の（文書スタイルの）部分会議録７２とが対になったものであり、本実施の形態では図４に示すような形式となっている。話し言葉／書き言葉変換モデル学習部１２０は、この対応付けコーパス７６のうち、部分会議録７２の部分について通常のＮ−グラムを作成する（図５、ステップ１９６−２００）。さらに話し言葉／書き言葉変換モデル学習部１２０は、このＮ−グラムの各エントリについて、書き起こし７０内の対応部分を調べ、変化しているものがあればその数をそれぞれ計数し、全て計数した時点で、各エントリに対する変化形ごとにその割合を算出することで変換モデル１２２を得る（ステップ２０４）。 -Creation of transformation model 122-
The correspondence corpus 76 is a pair of a faithful transcript of the spoken-style partial corpus 68 and a formatted (document-style) partial conference record 72, which is shown in FIG. 4 in the present embodiment. It is like this format. The spoken / written language conversion model learning unit 120 creates a normal N-gram for the portion of the partial conference record 72 in the association corpus 76 (FIG. 5, Steps 196-200). Further, the spoken / written language conversion model learning unit 120 examines the corresponding portion in the transcript 70 for each entry of the N-gram, and counts the number of each change if any, and at the time of counting all of them. The conversion model 122 is obtained by calculating the ratio for each variation for each entry (step 204).

この処理は例えば以下のように行なう。部分会議録７２内に、Ｎ−グラムのトライグラムｗ＝「＜ｓｐ＞この法案」が５００回出現し、書き起こし７０ではそのうち５０回がｖ＝「＜ｓｐ＞えーこの法案」となっていたとする。この場合、ｐ（ｖ｜ｗ）＝５０／５００となる。話し言葉／書き言葉変換モデル学習部１２０はｖの生起回数（上の場合、５０）を計数する。他にトライグラムｗ＝「＜ｓｐ＞この法案」の変形がなかったとすれば、文書スタイルのトライグラムｗ＝「＜ｓｐ＞この法案」が全部で５００あれば、それに対応する話し言葉スタイルの表現の生起回数は、「＜ｓｐ＞えーこの法案」が５０、「＜ｓｐ＞この法案」が４５０（＝５００−５０）となる。 This process is performed as follows, for example. N-gram trigram w = “<sp> this bill” appeared 500 times in the partial meeting record 72, and 50 of those in transcript 70 were v = “<sp> eh this bill”. To do. In this case, p (v | w) = 50/500. The spoken / written language conversion model learning unit 120 counts the number of occurrences of v (50 in the above case). If there is no other modification of trigram w = “<sp> this bill”, if there is a document style trigram w = “<sp> this bill” in total of 500, the corresponding spoken style expression The number of occurrences is 50 for "<sp> E- this bill" and 450 (= 500-50) for "<sp> this bill".

話し言葉／書き言葉変換モデル学習部１２０は、このようにして、対応付けコーパス７６から得られるＮ−グラムの各エントリに対し、その変形ごとに書き起こし７０内での発生回数を計数する。この計数結果に基づき、式（４）の変換係数が、書き起こし７０中に出現する話し言葉スタイルの各Ｎ−グラムについて算出される。これらにより変換モデル１２２が得られる。得られた変換モデル１２２はＨＤＤに出力され記憶される（図５、ステップ２０６）。 In this way, the spoken / written language conversion model learning unit 120 counts the number of occurrences in the transcript 70 for each modification of each N-gram entry obtained from the association corpus 76. Based on the counting result, the conversion coefficient of equation (4) is calculated for each N-gram of the spoken language style that appears in the transcript 70. As a result, a conversion model 122 is obtained. The obtained conversion model 122 is output and stored in the HDD (FIG. 5, step 206).

−審議音声コーパス４０の音素ラベリング処理−
以上のようにして変換モデル１２２が得られると、審議音声コーパス４０について以下のようにして音素ラベルが付与できる。 -Phoneme labeling process of the deliberation speech corpus 40-
When the conversion model 122 is obtained as described above, a phoneme label can be assigned to the deliberation speech corpus 40 as follows.

最初に、図３に示されるように部分コーパス６８及び部分会議録７２を用い、初期音響モデル学習部１３０によって、通常の方法で初期音響モデル１３２の学習が行なわれる。次いで、会議録４２の各ターンに対し、ターンごとＮ−グラム１８６（図３参照）がターンごとＮ−グラム作成部１８４により得られる（図６、ステップ２１４−２２０）。得られたターンごとＮ−グラム１８６に対して、Ｎ−グラム変換部１８８が変換モデル１２２を適用することにより、各ターンについて話し言葉用Ｎ−グラム１３６が得られる。 First, as shown in FIG. 3, the initial acoustic model 132 is learned by a normal method by the initial acoustic model learning unit 130 using the partial corpus 68 and the partial conference record 72. Next, for each turn of the conference record 42, an N-gram 186 (see FIG. 3) for each turn is obtained by the N-gram creation unit 184 for each turn (FIG. 6, steps 214-220). The N-gram conversion unit 188 applies the conversion model 122 to the obtained N-gram 186 for each turn, so that a spoken N-gram 136 is obtained for each turn.

ターン・音声選択部１３８は、審議音声コーパス４０の各ターンを順番に選択してターン情報をＮ−グラム選択部１４０に与える。Ｎ−グラム選択部１４０は、与えられたターン情報に応じ、話し言葉用Ｎ−グラム１３６の中で、選択されたターンから得られた話し言葉用Ｎ−グラムを選択し、Ｎ−グラム１４２として音声認識装置１４４に与える。一方、ターン・音声選択部１３８は、選択されたターン中の音声データを音声認識装置１４４に与える。 The turn / speech selection unit 138 sequentially selects each turn of the deliberation speech corpus 40 and provides turn information to the N-gram selection unit 140. The N-gram selection unit 140 selects the spoken N-gram obtained from the selected turn from the spoken N-gram 136 according to the given turn information, and performs speech recognition as the N-gram 142. To device 144. On the other hand, the turn / voice selection unit 138 provides the voice recognition device 144 with the voice data of the selected turn.

音声認識装置１４４は、Ｎ−グラム１４２を言語モデルとして用い、初期音響モデル１３２を使用して、審議音声コーパス４０から選択された音声に対する音声認識を行ない、音声認識結果を音素ラベルとして審議音声コーパス４０の音声データに付与する。音声認識装置１４４による音声認識では、ターンごとにそのターンから得られた話し言葉用に変換したＮ−グラム１４２が言語モデルとして使用される。そのため、審議音声コーパス４０の各ターンについて、話された際の音声に忠実な音声認識結果が得られる。すなわち、音素ラベリング処理部７８により音素ラベルが付与された音素ラベル付音声データベース８０は、話し言葉の発音に忠実な、精度の高い音素ラベルを有した音声コーパスとなる。しかも、審議音声コーパス４０に含まれる全ての音声に対し、このようにして自動的に音素ラベルを付与することができる。 The speech recognition apparatus 144 uses the N-gram 142 as a language model, performs speech recognition on the speech selected from the deliberation speech corpus 40 using the initial acoustic model 132, and uses the speech recognition result as a phoneme label for the deliberation speech corpus. It is given to 40 audio data. In the speech recognition by the speech recognition apparatus 144, the N-gram 142 converted for spoken language obtained from the turn is used as a language model for each turn. Therefore, for each turn of the deliberation voice corpus 40, a voice recognition result faithful to the voice when spoken is obtained. That is, the phoneme-labeled speech database 80 to which phoneme labels are assigned by the phoneme labeling processing unit 78 is a speech corpus having high-precision phoneme labels that is faithful to the pronunciation of spoken words. In addition, phoneme labels can be automatically assigned to all voices included in the deliberation voice corpus 40 in this way.

−音響モデル４８の学習−
上記のように得られた音素ラベル付音声データベース８０は、話し言葉に忠実な音素ラベルが付与された音声コーパスである。したがってこの音素ラベル付音声データベース８０を使用した学習を行なうことにより、話し言葉を音声認識するのに適した音響モデル４８が得られる。音素ラベル付音声データベース８０が話し言葉に忠実な音素ラベルを有しているため、音響モデル学習部８２は通常の音響モデルの学習を行なうだけでよい。 -Learning of acoustic model 48-
The phoneme-labeled speech database 80 obtained as described above is a speech corpus to which phoneme labels faithful to spoken words are given. Therefore, by performing learning using the phoneme-labeled speech database 80, an acoustic model 48 suitable for speech recognition of spoken words can be obtained. Since the phoneme-labeled speech database 80 has phoneme labels faithful to spoken words, the acoustic model learning unit 82 only needs to learn a normal acoustic model.

−言語モデル５０の学習−
音響モデル４８の学習とは別に、言語モデル５０の学習も以下のようにして行なれる。言語モデル学習部４６は、通常の言語モデルの学習方法を用い、会議録４２を学習データとして言語モデル５８の学習を行なう。本実施の形態では、言語モデルとしてユニグラム、バイグラム及びトライグラムを用いる。 -Learning language model 50-
Apart from the learning of the acoustic model 48, the learning of the language model 50 can also be performed as follows. The language model learning unit 46 uses a normal language model learning method and learns the language model 58 using the minutes 42 as learning data. In this embodiment, unigrams, bigrams, and trigrams are used as language models.

言語モデル変換部６０はさらに、言語モデル５８内の各Ｎ−グラムに対し、変換モデル１２２を適用することで、話し言葉に対応した言語モデル５０への変換を行なう。変換後の言語モデル５０においては、文書スタイルのＮ−グラムの生起確率の一部が、話し言葉特有のＮ−グラムの生起確率に割り振られ、その分だけ文書スタイルのＮ−グラムの生起確率がディスカウントされている。 The language model conversion unit 60 further performs conversion into the language model 50 corresponding to the spoken language by applying the conversion model 122 to each N-gram in the language model 58. In the language model 50 after conversion, a part of the occurrence probability of the document-style N-gram is allocated to the occurrence probability of the N-gram specific to the spoken language, and the occurrence probability of the document-style N-gram is discounted accordingly. Has been.

−新たな書き起こしの作成−
このようにしてコンピュータシステム２５０で得られた音響モデル４８及び言語モデル５０を、会議録作成用コンピュータシステム３００に送信し、会議録作成用コンピュータシステム３００に保存する。会議録作成用コンピュータシステム３００の音声認識装置５２は、新たに録音された審議音声５４を、これら音響モデル４８及び言語モデル５０を用いて音声認識し、音声認識結果を新たな書き起こし５６として出力する。 -Create a new transcript-
The acoustic model 48 and the language model 50 obtained by the computer system 250 in this way are transmitted to the conference record creation computer system 300 and stored in the conference record creation computer system 300. The speech recognition device 52 of the computer system 300 for creating the conference record recognizes the newly recorded discussion speech 54 using the acoustic model 48 and the language model 50, and outputs the speech recognition result as a new transcript 56. To do.

音響モデル４８の学習のときに、審議音声コーパス４０の全体を学習データとすることができる。そのため、音響モデル４８は多様な話し言葉表現をカバーすることができる。さらに、言語モデル５０では、話し言葉特有の表現について、書き起こし７０及び部分会議録７２の比較結果に応じた生起確率が割当てられる。そのため、文書スタイルのみの言語モデル５８を用いた場合と比較して、話し言葉スタイルの発話の音声認識の精度を高めることができる。 When learning the acoustic model 48, the entire deliberation speech corpus 40 can be used as learning data. Therefore, the acoustic model 48 can cover various spoken language expressions. Furthermore, in the language model 50, occurrence probabilities corresponding to the comparison results of the transcript 70 and the partial conference record 72 are assigned to expressions unique to the spoken language. Therefore, compared with the case where the language model 58 of only the document style is used, it is possible to improve the accuracy of speech recognition of the spoken language style utterance.

以上述べたように、この実施の形態に係る会議録作成システム３０によれば、審議音声コーパス４０の一部である部分コーパス６８から書き起こし７０を作成し、対応する部分会議録７２と結合して対応付けコーパス７６を作成する処理を行なえば、後は自動的に審議音声コーパス４０への音素ラベル付与、音響モデル４８の学習、及び言語モデル５０の学習が行なえる。例えば政権交代などがあり、審議音声の状況に相当大きな変化があったときにも、対応付けコーパス７６を作成する処理までを手操作で行なえば、後は自動的な処理で音響モデル４８及び言語モデル５０の再構築をすることができる。その結果、新たな状況で得られた審議音声５４でも、音声認識装置５２によって正確な書き起こしを作成することができる。 As described above, according to the conference record creating system 30 according to this embodiment, the transcript 70 is created from the partial corpus 68 that is a part of the deliberation voice corpus 40 and combined with the corresponding partial conference record 72. If the processing for creating the association corpus 76 is performed, the phoneme label can be automatically assigned to the deliberation speech corpus 40, the acoustic model 48 can be learned, and the language model 50 can be learned. For example, even when there is a change of government and the situation of the deliberative voice changes significantly, if the process of creating the corpus 76 is performed manually, the acoustic model 48 and language are automatically processed thereafter. The model 50 can be reconstructed. As a result, an accurate transcript can be created by the voice recognition device 52 even with the discussion voice 54 obtained in a new situation.

上記した実施の形態に係る会議録作成システム３０を実現するためのコンピュータプログラムは、単一のプログラムでもよいし、複数のプログラムを組合せたものでもよい。ただし、上記した実施の形態のように、会議録作成システム３０を２系統のコンピュータシステムで分割して実現する場合には、それらプログラムも別々にする必要がある。上記した各部の機能のうち、図１に示す話し言葉／書き言葉変換モデル学習部１２０において行なわれるＮ−グラム作成、言語モデル学習部４６において行なわれる言語モデル作成、初期音響モデル学習部１３０及び音響モデル学習部８２が実行する音響モデルの学習処理、などの個々の機能については、既に広く流布しているプログラムをそのまま使用できる。もちろん、これらプログラムは汎用に作成されているため、適切な調整を行なうことは要求されるが、それらはこの技術分野における通常の知識を持つ者にとっては、目的に照らして容易に実現できる範囲に留まる。 The computer program for realizing the conference record creation system 30 according to the above-described embodiment may be a single program or a combination of a plurality of programs. However, when the conference record creation system 30 is divided and realized by two computer systems as in the above-described embodiment, the programs need to be separated. Among the functions of the above-described units, N-gram creation performed in the spoken / written language conversion model learning unit 120 shown in FIG. 1, language model creation performed in the language model learning unit 46, initial acoustic model learning unit 130, and acoustic model learning For individual functions such as the learning process of the acoustic model executed by the unit 82, a program that has already been widely distributed can be used as it is. Of course, since these programs are created for general use, it is necessary to make appropriate adjustments, but for those who have ordinary knowledge in this technical field, they are within the range that can be easily realized according to the purpose. stay.

これらプログラムは、例えばＤＶＤ２８２等のような記憶媒体に記憶され、又はインターネット２５２等のネットワークを通じて流通し、通常は大容量ハードディスク２７４等の不揮発外部記憶装置に記憶される。そして実行時には大容量ハードディスク２７４からＲＡＭ２８０にコピーされ、ＣＰＵ２７６内の図示しないプログラムカウンタと呼ばれるレジスタにより指し示されるアドレスから読出された命令がＣＰＵ２７６により実行され、上記した所期の機能を実現する。コンピュータハードウエアそのものの動作形態については周知であるので、ここではこれ以上の詳細な説明は行なわない。 These programs are stored in a storage medium such as a DVD 282, or distributed through a network such as the Internet 252, and are usually stored in a nonvolatile external storage device such as a large-capacity hard disk 274. At the time of execution, the CPU 276 executes an instruction that is copied from the large-capacity hard disk 274 to the RAM 280 and is read from an address indicated by a register called a program counter (not shown) in the CPU 276, thereby realizing the intended function. Since the operation form of the computer hardware itself is well known, no further detailed description will be given here.

［評価実験］
−実験条件−
上記実施の形態の考え方にしたがって構築した会議録作成システムの性能について、衆議院審議音声により評価した。 [Evaluation experiment]
-Experimental conditions-
The performance of the conference record creation system constructed according to the concept of the above-described embodiment was evaluated using the speech of the House of Representatives.

ベースライン音響モデル及び統計的変換モデルは２００３年及び２００４年のデータを用いて学習した。これらのデータについては人手による書き起こしが存在し、予め会議録との対応付けを行なっておく。音声データのサイズは１３４時間であり、審議録のテキストサイズは１．８Ｍ単語である。 Baseline acoustic models and statistical transformation models were learned using 2003 and 2004 data. These data are manually transcribed, and are associated with the minutes in advance. The size of the audio data is 134 hours, and the text size of the proceedings is 1.8M words.

音声認識の際の音響特徴量は、１２次元のＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）、ΔＭＦＣＣ、ΔΔＭＦＣＣ，Δパワー、ΔΔパワーの計３８次元である。 The acoustic feature amount at the time of speech recognition is a total of 38 dimensions of 12-dimensional MFCC (Mel-Frequency Cepstrum Coefficient), ΔMFCC, ΔΔMFCC, Δ power, and ΔΔ power.

−音素ラベル作成実験−
２００６年及び２００７年の衆議院審議音声を対象に、音素ラベル作成の実験を行なった。会議数は２６、ターン数は５，１７０、データ量は９１時間である。音響モデルは２００３年及び２００４年のデータ（１３４時間）を用いて学習したＨＭＭ（隠れマルコフモデル）のベースラインモデルである。ＨＭＭの状態数は３０００、混合数は１６であり、ＭＰＥ学習済である。特徴量にはＣＭＮ（ＣｅｐｓｔｒａｌＭｅａｎＮｏｒｍａｌｉｚａｔｉｏｎ）及びＣＶＮ（ＣｅｐｓｔｒａｌＶａｒｉａｎｃｅＮｏｒｍａｌｉｚａｔｉｏｎ）を適用した。音声認識は、Ｊｕｌｉｕｓ（http://julius.sourceforge.jp/）を用いて行なうが、大量のデータを処理することを想定して、サーチパラメータは軽く設定している（リアルタイムの２倍程度の時間を許容）。 -Phoneme label creation experiment-
A phoneme label creation experiment was conducted on the speeches of the House of Representatives deliberations in 2006 and 2007. The number of conferences is 26, the number of turns is 5,170, and the amount of data is 91 hours. The acoustic model is an HMM (Hidden Markov Model) baseline model trained using 2003 and 2004 data (134 hours). The number of HMM states is 3000, the number of mixtures is 16, and MPE learning has been completed. CMN (Cepstral Mean Normalization) and CVN (Cepstal Variance Normalization) were applied to the feature amount. Speech recognition is performed using Julius (http://julius.sourceforge.jp/), but the search parameters are set lightly assuming that a large amount of data is processed (about twice the real time). Allow time).

比較のため、以下の種々のモデルで音素ラベル作成実験を行なった。言語モデルの単位としては、会議全体で１つのモデルを作成する条件と、ターン毎に個別のモデルを作成する条件とを比較した。手法としては、本実施の形態に係る手法（「会議録、話し言葉変換」と呼ぶ。）に加え、話し言葉用ベースラインモデル（「ベースライン」）、会議録のみから作成したモデル（「会議録」）、それらを会議録に１００倍の重みをかけて合成したｂｉａｓｅｄＬＭ（「ｂｉａｓｅｄＬＭ」）、及び会議録モデルのポーズ位置にフィラーのエントリのみを追加したモデル（「会議録、フィラー」）をそれぞれ用いた。ベースラインモデルは１９９９年から２００５年の７年分の会議録に話し言葉変換を適用して作成した。 For comparison, phoneme label making experiments were conducted using the following various models. As a language model unit, the conditions for creating one model for the entire meeting were compared with the conditions for creating individual models for each turn. As a method, in addition to the method according to the present embodiment (referred to as “meeting minutes, spoken language conversion”), a spoken language baseline model (“baseline”), a model created only from the minutes (“meeting”) ), A biased LM ("biased LM") synthesized with 100 times the weight of the minutes, and a model ("minutes, filler") with only the filler entry added to the pose position of the minutes model Each was used. The baseline model was created by applying spoken language conversion to seven minutes of conferences from 1999 to 2005.

音声認識により得られた音素ラベルの精度をテーブル１に示す。テーブル１において、Ｃｏｒｒ．（単語正解率）及びＡｃｃ．（単語認識精度）は人手による書き起こしを正解として算出した値である。 Table 1 shows the accuracy of phoneme labels obtained by speech recognition. In Table 1, Corr. (Word correct answer rate) and Acc. (Word recognition accuracy) is a value calculated with a human transcription as a correct answer.

テーブル１を参照して、会議単位の条件では、ｂｉａｓｅｄＬＭ及び上記実施の形態の手法で話し言葉スタイルに対処した場合、会議録単独のモデルよりも高い単語認識精度が得られた。ただし、２６の会議に対し、上記実施の形態の手法ではコンパクトなサイズでモデルが構築できた（１００ＭＢ）のに対し、ｂｉａｓｅｄＬＭでは極めて大きなサイズを要した（１．６ＧＢ）。したがって、ｂｉａｓｅｄＬＭをターン単位の処理に適用するのは非現実的と考えられる。 Referring to Table 1, in the conference unit condition, when the spoken language style is dealt with by the biased LM and the method of the above embodiment, the word recognition accuracy higher than the model of the conference minutes alone was obtained. However, for the 26 conferences, the model in the above embodiment could be constructed in a compact size (100 MB), whereas the biased LM required a very large size (1.6 GB). Therefore, it is considered impractical to apply biased LM to turn-by-turn processing.

ターン単位の条件では、会議単位の場合よりも全体に高い精度が得られた。本実施の形態に係る手法では、会議録のみを用いた場合よりも認識精度で８．６ポイント高くなった。会議録から得られた単語モデルにフィラーを追加したモデル（会議録、フィラー）は、簡易な話し言葉向け言語モデルとなっており、話し言葉の現象のうちフィラーの挿入のみに対応し、かつ文脈を考慮しない場合に相当する。本実施の形態に係る手法では、「会議録、フィラー」モデルを認識精度で５．９ポイント上回った。統計的変換モデルにより、会議録から適切に話し言葉向け言語モデルが推定できていることが分かる。本実施の形態の手法では、精度で９２．１％、単語正解率で９４．０％を実現した。 On a turn-by-turn basis, the overall accuracy was higher than on a conference-by-conference basis. In the method according to the present embodiment, the recognition accuracy is 8.6 points higher than when only the minutes are used. The model (Meeting Record, Filler) with a filler added to the word model obtained from the minutes is a simple language model for spoken language, and it supports only the insertion of fillers in the spoken language and considers the context. Corresponds to the case of not. In the method according to the present embodiment, the “meeting record, filler” model exceeds the recognition accuracy by 5.9 points. It turns out that the language model for spoken language can be appropriately estimated from the minutes by the statistical transformation model. According to the method of the present embodiment, 92.1% accuracy and 94.0% word correct rate are realized.

本実施の形態により作成された音素ラベルの例を以下に示す。 An example of a phoneme label created according to this embodiment is shown below.

この例では、助詞「が」の脱落、「いー」などのフィラーの挿入について、本実施の形態に係る手法により正しい音素ラベルが得られた。助詞「に」の挿入については不正解だったが、このパターンはそもそも変換規則に存在しなかったため、言語モデルで予測できるものではなかったと考えられる。 In this example, the correct phoneme label was obtained by the method according to the present embodiment for the omission of the particle “ga” and the insertion of fillers such as “i”. The insertion of the particle “ni” was incorrect, but this pattern did not exist in the conversion rule in the first place, so it is thought that it could not be predicted by the language model.

−音声認識実験−
上記実施の形態に係る手法により作成した音素ラベルを用いて学習データを追加し、この学習データを使用して音響モデルの学習を行なった。学習済の音響モデルを用いて以下のような音声認識実験を行なった。 -Speech recognition experiment-
Learning data was added using the phoneme label created by the method according to the above embodiment, and the acoustic model was learned using this learning data. The following speech recognition experiment was conducted using the learned acoustic model.

ベースラインモデルは、２００３年、２００４年のデータ（１３４時間）を用いて人手の書き起こし音素ラベルにより学習を行なった音響モデルによる。追加データは、上記「音素ラベル作成実験」で音素ラベルを付与した２００６年及び２００７年の９１時間分である。比較のため、同じデータに対して人手の音素ラベルにより学習を行なった場合も評価する。学習はＭＬ（最尤基準）及びＭＰＥ（ＭｉｎｉｍｕｍＰｈｏｎｅＥｒｒｏｒ）基準の２つの基準により行なう。ＨＭＭの状態数は５０００、混合数は３２である。特徴量にはＣＭＮ，ＣＶＮ及びＶＴＬＮ（ＶｏｃａｌＴｒａｃｔＬｅｎｇｔｈＮｏｒｍａｌｉｚａｔｉｏｎ）を適用した。テストセットは２００８年２月２６日及び２９日の衆議院予算委員会（２．４時間、１２１ターン）及び２００８年１０月７日の衆議院予算委員会（３．９時間、２１１ターン）である。 The baseline model is an acoustic model in which learning is performed by using hand-transcribed phoneme labels using data of 2003 and 2004 (134 hours). The additional data is 91 hours of 2006 and 2007 when the phoneme labels were assigned in the above “phoneme label creation experiment”. For comparison, evaluation is also performed when learning is performed on the same data using a manual phoneme label. Learning is performed according to two criteria, ML (maximum likelihood criterion) and MPE (Minimum Phone Error) criteria. The number of HMM states is 5000, and the number of mixtures is 32. CMN, CVN, and VTLN (Vocal Tract Length Normalization) were applied to the feature amount. The test sets are the House of Representatives Budget Committee on February 26 and 29, 2008 (2.4 hours, 121 turns) and the House of Representatives Budget Committee on October 7, 2008 (3.9 hours, 211 turns).

この実験で得られた単語認識精度をテーブル３に示す。 Table 3 shows the word recognition accuracy obtained in this experiment.

テーブル３を参照して、ＭＬ学習の場合には、いずれのテストセットに対しても本実施の形態に係る手法を用いることでベースラインより高い精度が得られ、人手による音素ラベル付けの場合とほとんど変わらない水準となったことが分かる。ＭＰＥ学習の場合にも、ベースラインより精度が向上し、この場合にも人手による音素ラベル付けとほとんど変わらない水準となっている。 Referring to Table 3, in the case of ML learning, using the method according to the present embodiment for any test set, a higher accuracy than the baseline is obtained, and in the case of manual phoneme labeling It can be seen that the level is almost unchanged. Also in the case of MPE learning, the accuracy is improved from the baseline, and in this case as well, the level is almost the same as the manual phoneme labeling.

以上のように本発明によれば、統計的話し言葉変換を用いた準教師付学習により、低コストで音響モデルを構築し、更新することが可能となった。したがって、音響モデルの学習のための音声コーパスにデータを追加したり入替えたりしても、音響モデルを容易に、かつ低コストで再構築することができる。その結果、内閣改造や総選挙などによる話者の変更、各話者の話し方の変化にも容易に対応することができる。 As described above, according to the present invention, an acoustic model can be constructed and updated at low cost by semi-supervised learning using statistical spoken language conversion. Therefore, even if data is added to or replaced with the speech corpus for learning the acoustic model, the acoustic model can be reconstructed easily and at low cost. As a result, it is possible to easily cope with speaker changes due to cabinet remodeling or general elections, and changes in the way each speaker speaks.

上記実施の形態は、国会の委員会審議録を自動的に作成するシステムに関するものである。しかし本発明はそのような実施の形態には限定されない。例えば、放送番組の字幕や大学の講義録の作成などにこのシステムを適用することもできる。 The above embodiment relates to a system that automatically creates a committee proceedings record of the Diet. However, the present invention is not limited to such an embodiment. For example, this system can be applied to the creation of subtitles for broadcast programs and university lectures.

また、上記実施の形態では、音響モデル４８及び言語モデル５０の学習をコンピュータシステム２５０で行ない、会議録作成用コンピュータシステム３００では音響モデル４８及び言語モデル５０を受取って会議録作成のみを行なっている。しかし本発明はそのような実施の形態には限定されない。例えば、１つのコンピュータシステム内に上記した全ての機能を組込んでもよい。また、コンピュータシステム２５０内で実行されるプログラムのうち、音素ラベリング処理部７８の機能のみを別のコンピュータで実行し、音素ラベル付音声データベース８０をコンピュータシステム２５０で受けて音響モデル４８の学習を行なうようにしてもよい。同様に、話し言葉／書き言葉変換モデル学習部１２０の機能を別システムで実現してもよい。 In the above embodiment, the acoustic model 48 and the language model 50 are learned by the computer system 250, and the conference record creation computer system 300 receives the acoustic model 48 and the language model 50 and only creates the conference record. . However, the present invention is not limited to such an embodiment. For example, all the functions described above may be incorporated in one computer system. Further, among the programs executed in the computer system 250, only the function of the phoneme labeling processing unit 78 is executed by another computer, and the computer model 250 receives the phoneme-labeled speech database 80 to learn the acoustic model 48. You may do it. Similarly, the function of the spoken / written language conversion model learning unit 120 may be realized by another system.

上記実施の形態の会議録作成システム３０は、一般には音声認識システムと呼ばれるべきものであり、音声認識によって、審議の発話内容に忠実な書き起こしを生成することができる。審議音声コーパスは、より一般的には、審議内における発話を収録した音声データベースであり、その名称はどのようなものでもよい。また、会議録は文書スタイルテキストの一例であって、発話内容を人間が書き起こし、整形したものであればどのようなものでもよい。 The conference record creation system 30 according to the above-described embodiment is generally called a speech recognition system, and can generate a transcript that is faithful to the content of the utterance by the speech recognition. The deliberation speech corpus is more generally a speech database in which utterances in the deliberation are recorded, and any name may be used. In addition, the conference minutes are an example of document style text, and may be anything as long as the content of the utterance is transcribed and shaped by a human.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

３０会議録作成システム
４０審議音声コーパス
４２会議録
４４音声認識用音響モデル学習部
４６言語モデル学習部
４８音響モデル
５０言語モデル
５２，１４４音声認識装置
５４審議音声
５６書き起こし
５８言語モデル
６０言語モデル変換部
６８部分コーパス
７０書き起こし
７２部分会議録
７６対応付けコーパス
７８音素ラベリング処理部
８０音素ラベル付音声データベース
１３０初期音響モデル学習部
１３２初期音響モデル
１３６話し言葉用Ｎ−グラム
１３８ターン・音声選択部
１８６ターンごとＮ−グラム
１８８Ｎ−グラム変換部 30 Conference record creation system 40 Discussion speech corpus 42 Conference record 44 Acoustic model learning unit for speech recognition 46 Language model learning unit 48 Acoustic model 50 Language models 52 and 144 Speech recognition device 54 Discussion speech 56 Transcription 58 Language model 60 Language model conversion Unit 68 Partial corpus 70 Transcript 72 Partial conference record 76 Corresponding corpus 78 Phoneme labeling processing unit 80 Phoneme-labeled speech database 130 Initial acoustic model learning unit 132 Initial acoustic model 136 Spoken N-gram 138 Turn / speech selection unit 186 Turn N-gram 188 N-gram converter

Claims

A language model estimation means for estimating a language model of a spoken language style transcription that is faithful to the actual content of speech from a language model learned by writing and formatting a speech database by a human;
Phoneme labeling for attaching a transcript and its phoneme label to the speech database by speech recognition using an initial acoustic model prepared in advance and a language model of a spoken style transcript estimated by the language model estimation means Means,
An acoustic model learning device comprising: an acoustic model learning unit for learning or updating a speech recognition acoustic model using the speech database to which the phoneme label is attached by the phoneme labeling unit as learning data.

The language model estimating means includes
N-gram creation means for creating an N-gram language model for each turn from document style text corresponding to each turn of the speech database;
Means for estimating a spoken N-gram language model of the spoken style transcription from each of the turn N-gram language models created by the N-gram creation means,
The phoneme labeling means includes:
Language model selection means for selecting a corresponding N-gram language model from the spoken N-gram language model for each turn of the speech database;
Performing speech recognition using the N-gram language model selected by the language model selection means and the initial acoustic model for each turn of the speech database, and transcribed for each turn of the speech database; The acoustic model learning apparatus according to claim 1, further comprising voice recognition means for assigning the phoneme label.

Based on a correspondence corpus created based on a part of the spoken language style transcript of the speech database and a part of the document style text corresponding to the part, the representation in the document style text Further comprising a conversion model learning means for learning a conversion model that statistically shows the conversion of spoken style transcription into expression.
The language model estimation means includes means for estimating an N-gram language model of the spoken style transcription by applying the transformation model to each N-gram language model for each turn. Item 4. The acoustic model learning device according to Item 1.

The voice database is a deliberative voice corpus that contains some meeting voice,
The acoustic model learning apparatus according to claim 1, wherein the document style text is a minutes of the meeting.

Acoustic model storage means for storing the acoustic model for speech recognition learned by the acoustic model learning device according to any one of claims 1 to 4, using a predetermined speech database as learning data,
A speech recognition apparatus, comprising: speech recognition means for performing speech recognition on input speech data using the speech recognition acoustic model stored in the acoustic model storage means and a speech recognition language model.

Computer
A language model estimation means for estimating a language model of a spoken language style transcription that is faithful to the actual content of speech from a language model learned by writing and formatting a speech database by a human;
Phoneme labeling for attaching a transcript and its phoneme label to the speech database by speech recognition using an initial acoustic model prepared in advance and a language model of a spoken style transcript estimated by the language model estimation means Means,
A computer program for learning an acoustic model that causes the speech database labeled with a phoneme label by the phoneme labeling means to function as acoustic model learning means for learning or updating an acoustic model for speech recognition using learning data.