JP4829871B2

JP4829871B2 - Learning data selection device, learning data selection method, program and recording medium, acoustic model creation device, acoustic model creation method, program and recording medium

Info

Publication number: JP4829871B2
Application number: JP2007301625A
Authority: JP
Inventors: 哲小橋川; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-11-21
Filing date: 2007-11-21
Publication date: 2011-12-07
Anticipated expiration: 2027-11-21
Also published as: JP2009128490A

Description

本発明は、音響モデルの作成に用いるデータ（学習データ）の選択およびこの学習データを用いて音響モデルを作成する技術に関する。 The present invention relates to selection of data (learning data) used for creating an acoustic model and a technique for creating an acoustic model using the learning data.

従来の音声認識において、認識結果候補を構成する音素、音節、単語などの音声単位のカテゴリ毎に隠れマルコフモデル（Hidden Markov Model；以下「ＨＭＭ」と表す。）によってモデル化して音響モデルを作成する手法は、認識性能が高く、現在の音声認識技術の主流となっている。 In conventional speech recognition, an acoustic model is created by modeling with a Hidden Markov Model (hereinafter referred to as “HMM”) for each category of speech units such as phonemes, syllables, and words constituting the recognition result candidate. The method has high recognition performance and has become the mainstream of current speech recognition technology.

ＨＭＭに代表される音響モデルは、学習データから学習して蓄積した十分統計量を用いて生成される。近年では、学習データ量は膨大になり、５００時間を越える学習データ量を擁する学習データが音響モデル作成に用いられるようになっている。 An acoustic model typified by an HMM is generated using sufficient statistics learned and accumulated from learning data. In recent years, the amount of learning data has become enormous, and learning data having a learning data amount exceeding 500 hours has been used for creating an acoustic model.

ところで、音響モデルの作成には、学習データ量に応じた学習時間を要する。近年、学習データ量の増加に伴い、音響モデル学習にかかる時間コストは膨大となっていた。また、学習データの中には、認識性能向上に寄与しない妨害データも存在し、この妨害データの存在によって認識性能が劣化してしまうことがあった。 By the way, creation of an acoustic model requires a learning time corresponding to the amount of learning data. In recent years, with the increase in the amount of learning data, the time cost for acoustic model learning has become enormous. In addition, there is interference data that does not contribute to improvement of recognition performance in the learning data, and the recognition performance may be deteriorated due to the presence of the interference data.

そこで、特許文献１に開示される技術では、次のようにして高精度な音響モデルを作成している。ベース音響モデルと複数の学習データクラスタを用いて、各学習データクラスタに対応した十分統計量（クラスタ十分統計量）を得る。そして、各クラスタ十分統計量のうち一つまたは複数の組合せから音響モデルを作成して、各音響モデルを評価用データおよび評価用言語モデルを用いて評価する。各音響モデルの評価結果のうち所定の評価結果を与えた音響モデルを選択する。
特開２００７−２４９０５１号公報 Therefore, in the technique disclosed in Patent Document 1, a high-accuracy acoustic model is created as follows. Using the base acoustic model and a plurality of learning data clusters, sufficient statistics (cluster sufficient statistics) corresponding to each learning data cluster are obtained. Then, an acoustic model is created from one or a plurality of combinations among the sufficient statistics of each cluster, and each acoustic model is evaluated using the evaluation data and the evaluation language model. An acoustic model giving a predetermined evaluation result is selected from the evaluation results of each acoustic model.
JP 2007-249051 A

従来、予定している音声認識対象となる音声と音響的に近い音声（環境などの音声認識用途であるタスク、発話スタイル、話者等をメルクマールとする。）を一から収集するか、既存の音声データベースから人手で選定したりして、音響モデルの作成に用いる学習データ（具体例としては音声データと、この音声データに対応付けた音声単位カテゴリによるラベルである。）を得ていた。高精度音響モデルの作成のためにタスクに対応する膨大な学習データ量を一から収集することや、膨大な学習データ量を擁する音声データベースから予定している音声認識対象となる音声と音響的に近い音声を選定することには、大変な労力を伴う。 Conventionally, the sound that is acoustically close to the scheduled speech recognition target (collecting tasks such as environment, such as tasks, speech styles, speakers, etc. as Merckmar) is collected from scratch or existing Learning data (specifically, voice data and a label based on a voice unit category associated with the voice data) used to create an acoustic model is obtained by manually selecting from a voice database. In order to create a high-accuracy acoustic model, a large amount of learning data corresponding to a task can be collected from scratch, and the speech recognition target scheduled from a speech database with a large amount of learning data can be acoustically Choosing a close voice is very labor intensive.

また、既述のとおり、音響モデルの作成には学習データ量に応じた学習時間を要するところ、学習データ量の増加に伴い、音響モデル学習にかかる時間コストは膨大となってしまう。また、学習データの中には、認識性能向上に寄与しない妨害データも存在し、この妨害データの存在によって認識性能が劣化してしまう。 Further, as described above, the creation of the acoustic model requires a learning time corresponding to the learning data amount, and the time cost for the acoustic model learning becomes enormous as the learning data amount increases. In addition, the learning data includes interference data that does not contribute to the improvement of the recognition performance, and the recognition performance deteriorates due to the presence of the interference data.

上記特許文献１の技術では、その一局面において、膨大な学習データを細分化して複数の音響モデルを作成し、評価用データ等を用いたこれらの評価結果に基づいて高精度の音響モデルを得ていたが、複数の音響モデルを作成するものであるから、計算コストの負担が大きかった。 In the technology of Patent Document 1, in one aspect, a large amount of learning data is subdivided to create a plurality of acoustic models, and a high-accuracy acoustic model is obtained based on these evaluation results using evaluation data and the like. However, since a plurality of acoustic models are created, the calculation cost is high.

このような問題に鑑み、本発明は、学習データの中から高い認識性能を実現する高精度な音響モデルの作成に有用な学習データを選択する技術を提供する。また、短時間で高精度の音響モデルを作成する技術を提供する。 In view of such a problem, the present invention provides a technique for selecting learning data useful for creating a highly accurate acoustic model that realizes high recognition performance from learning data. In addition, a technique for creating a highly accurate acoustic model in a short time is provided.

上記課題を解決するために、本発明は、音声データとこの音声データに対応付けられたラベルで構成される学習データから、次のようにして学習データの選択を行う。初期音響モデルであるベース音響モデルを、タスクに適応する学習データであるタスク適応学習データで学習してタスクに適応した音響モデル（適応音響モデル）を作成する。そして、音声データに対して音声認識を行い、学習データのラベルから得られた文法および適応音響モデルを用いた音声認識の場合の認識スコア（適応認識スコア）と、文法およびベース音響モデルを用いた音声認識の場合の認識スコア（ベース認識スコア）とを求める。学習データのうち、適応認識スコアとベース認識スコアとの比較判定に合格するものを選択する。
このように、適応認識スコアとベース認識スコアとの比較判定に合格する学習データを、タスクに相応しい学習データであると看做してこれを選択する。 In order to solve the above-mentioned problem, the present invention selects learning data from learning data including voice data and a label associated with the voice data as follows. An acoustic model (adaptive acoustic model) adapted to a task is created by learning a base acoustic model that is an initial acoustic model from task adaptive learning data that is learning data adapted to the task. Then, speech recognition is performed on the speech data, and the recognition score (adaptive recognition score) in the case of speech recognition using the grammar and adaptive acoustic model obtained from the label of the learning data, and the grammar and base acoustic model are used. A recognition score (base recognition score) in the case of speech recognition is obtained. Of the learning data, one that passes the comparison judgment between the adaptive recognition score and the base recognition score is selected.
In this way, learning data that passes the comparison determination between the adaptive recognition score and the base recognition score is regarded as learning data suitable for the task, and is selected.

また、上記課題を解決するために、本発明は、選択された学習データを用いて、初期音響モデルである基本音響モデルを学習して音響モデルを作成する。 In order to solve the above problems, the present invention creates a sound model by learning a basic sound model that is an initial sound model using selected learning data.

また、本発明の学習データ選択装置としてコンピュータを機能させる学習データ選択プログラムによって、コンピュータを学習データ選択装置として作動処理させることができる。同様に、本発明の音響モデル作成装置としてコンピュータを機能させる音響モデル作成プログラムによって、コンピュータを音響モデル作成装置として作動処理させることができる。そして、このようなプログラムを記録した、コンピュータによって読み取り可能なプログラム記録媒体によって、他のコンピュータを学習データ選択装置、音響モデル作成装置として機能させることや、プログラムを流通させることなどが可能になる。 In addition, the computer can be operated as a learning data selection device by a learning data selection program that causes the computer to function as the learning data selection device of the present invention. Similarly, the computer can be operated as an acoustic model creation device by an acoustic model creation program that causes the computer to function as the acoustic model creation device of the present invention. A computer-readable program recording medium that records such a program makes it possible for another computer to function as a learning data selection device and an acoustic model creation device, or to distribute the program.

本発明によれば、適応音響モデルを用いた音声認識の場合の適応認識スコアと、ベース音響モデルを用いた音声認識の場合のベース認識スコアとを求めて、学習データのうち、適応認識スコアとベース認識スコアとの比較判定に合格するものを選択することから、高精度な音響モデルの作成に有用な学習データを選択することができる。また、選択された学習データのデータ量は、学習データそのもののデータ量よりも通常小さいため、このような選択された学習データで基本音響モデルを学習することで、高い認識性能を実現する高精度な音響モデルが短時間で得られる。 According to the present invention, an adaptive recognition score in the case of speech recognition using an adaptive acoustic model and a base recognition score in the case of speech recognition using a base acoustic model are obtained, Learning data useful for creating a high-accuracy acoustic model can be selected by selecting the one that passes the comparison and determination with the base recognition score. In addition, since the amount of learning data selected is usually smaller than the amount of learning data itself, the basic acoustic model is learned with such selected learning data, thereby achieving high recognition performance. A simple acoustic model can be obtained in a short time.

《第１実施形態》
図面を参照して、本発明の第１実施形態を説明する。
本発明の第１実施形態である学習データ選択装置１は、それ単体で独立に存在するよりは、選択された学習データを用いて音響モデルの作成を行う装置（本発明の第１実施形態である音響モデル作成装置２）を構成する構成要素として存在するのが実用的である。さらに云えば、学習データ選択装置１は、音響モデル作成装置２とは容易に分離可能に音響モデル作成装置２を構成する構成要素ではなく、音響モデル作成装置２自体を或る機能に着眼して片面的に評価したものと云うこともできる。要するに、学習データ選択装置１は、音響モデル作成装置２そのものであることが凡そ実用的である。
ただし、学習データ選択装置１が、単体独立の構成要素として存在すること、音響モデル作成装置２とは容易に分離可能に音響モデル作成装置２を構成する構成要素であることを排除する趣旨ではない。例えば学習データの選択自体を目的とするならば、学習データ選択装置１を単体独立の構成要素として実現することに何らの妨げは無い。
ここで音響モデル作成装置２は、例えば専用のハードウェアで構成された専用機やパーソナルコンピュータのような汎用機といったコンピュータで実現されるとし、単体独立の構成要素として学習データ選択装置１を実現する場合も同様である。 << First Embodiment >>
A first embodiment of the present invention will be described with reference to the drawings.
The learning data selection device 1 according to the first embodiment of the present invention is a device that creates an acoustic model using selected learning data (in the first embodiment of the present invention, rather than being independently present alone). It is practical to exist as a component constituting an acoustic model creation device 2). Furthermore, the learning data selection device 1 is not a component constituting the acoustic model creation device 2 so as to be easily separable from the acoustic model creation device 2, but focuses on a certain function of the acoustic model creation device 2 itself. It can also be said that it was evaluated on one side. In short, it is practically practical that the learning data selection device 1 is the acoustic model creation device 2 itself.
However, this does not exclude the fact that the learning data selection device 1 exists as a single independent component and that the learning data selection device 1 is a component constituting the acoustic model creation device 2 so as to be easily separable from the acoustic model creation device 2. . For example, if the purpose is to select learning data itself, there is no obstacle to realizing the learning data selection device 1 as a single independent component.
Here, for example, the acoustic model creation device 2 is realized by a computer such as a dedicated machine configured by dedicated hardware or a general-purpose machine such as a personal computer, and the learning data selection device 1 is realized as a single independent component. The same applies to the case.

音響モデル作成装置２を単体単独の構成要素として、これをコンピュータ（汎用機）で実現する場合のハードウェア構成例を説明する。学習データ選択装置１は、音響モデル作成装置２を構成する構成要素とする。 A hardware configuration example will be described in which the acoustic model creation device 2 is a single component and is realized by a computer (general-purpose machine). The learning data selection device 1 is a component constituting the acoustic model creation device 2.

＜音響モデル作成装置２のハードウェア構成例＞
音響モデル作成装置２は、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ＣＰＵ（Central Processing Unit）〔キャッシュメモリなどを備えていてもよい。〕、メモリであるＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）と、ハードディスクである外部記憶装置、並びにこれらの入力部、出力部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置間のデータのやり取りが可能なように接続するバスなどを備えている。また必要に応じて、音響モデル作成装置２に、ＣＤ−ＲＯＭなどの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Hardware configuration example of acoustic model creation device 2>
The acoustic model creation device 2 may include an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, a CPU (Central Processing Unit) [cache memory, or the like. ] RAM (Random Access Memory), ROM (Read Only Memory), external storage device as a hard disk, and exchange of data between these input unit, output unit, CPU, RAM, ROM, external storage device It has a bus that can be connected. Further, if necessary, the acoustic model creation device 2 may be provided with a device (drive) that can read and write a storage medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

音響モデル作成装置２の外部記憶装置には、音響モデル作成のためのプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくなどでもよい。〕。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。以下、データやその格納領域のアドレスなどを記憶する記憶装置を単に「記憶部」と呼ぶことにする。 The external storage device of the acoustic model creation device 2 stores a program for creating an acoustic model, data necessary for processing of this program, etc. [not limited to the external storage device, for example, a program for read-only storage device It may be stored in a ROM. ]. Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device. Hereinafter, a storage device that stores data, addresses of storage areas, and the like is simply referred to as a “storage unit”.

本実施形態では、記憶部の所定の記憶領域には、学習データ２００がデータとして記憶されている。学習データ２００は、音声認識などに用いる汎用の音声データベースに含まれる学習データである。この汎用音声データベースは、既存のものを用いることができ、例えば５００時間を越えるデータ量を擁している。 In the present embodiment, learning data 200 is stored as data in a predetermined storage area of the storage unit. The learning data 200 is learning data included in a general-purpose speech database used for speech recognition and the like. This general-purpose voice database can use an existing one, and has a data amount exceeding 500 hours, for example.

また、記憶部の所定の記憶領域には、ベース学習データ１００がデータとして記憶されている。ベース学習データ１００は、初期音響モデルであるベース音響モデル１４１の作成に用いる学習データである。ベース学習データ１００は、タスクに適応する学習データである必要はない。例えば、ベース学習モデル１００は、学習データ２００と同一であってもよいし、一から収集したものであってもよい。 In addition, base learning data 100 is stored as data in a predetermined storage area of the storage unit. The base learning data 100 is learning data used to create a base acoustic model 141 that is an initial acoustic model. The base learning data 100 does not have to be learning data adapted to a task. For example, the base learning model 100 may be the same as the learning data 200 or may be collected from scratch.

さらに、記憶部の所定の記憶領域には、タスク適応学習データ１２０がデータとして記憶されている。タスク適応学習データ１２０は、音響モデルが用いられるタスクに適応した学習データであり、予め準備されているものとする。例えば、タスクが予め既知の場合、タスクにおける音声を一から収集しておくことや、既存の音声データベースからタスクに適応する学習データを選定しておくことで、タスク適応学習データ１２０を予め準備できる。また、タスクにおける音声を収集できない場合でも、収録環境、話者、発声内容、発話スタイルの少なくともいずれか一つを同じとして得た学習データをタスク適応学習データ１２０として採用すればよい（後述の実施例を参照のこと。）。 Furthermore, task adaptive learning data 120 is stored as data in a predetermined storage area of the storage unit. The task adaptive learning data 120 is learning data adapted to a task in which an acoustic model is used, and is prepared in advance. For example, when the task is known in advance, the task adaptive learning data 120 can be prepared in advance by collecting the voices in the task from scratch or selecting the learning data to be adapted to the task from the existing voice database. . In addition, even when the voice in the task cannot be collected, learning data obtained by using at least one of the recording environment, the speaker, the utterance content, and the utterance style may be adopted as the task adaptive learning data 120 (described later). See example).

タスク適応学習データ１２０は、タスクによっては十分なデータ量が得られる保証がなく、学習データ２００に比べてデータ量が少ないことが多い。本発明の学習データ選択技術では、後述するように、タスク適応学習データ１２０として相応しい学習データを学習データ２００から選択することで選択学習データ１３１を得ることができるから、タスク適応学習データ１２０を選択学習データ１３１で増強することができる。この観点から、学習データ２００には、タスクに近い音声データを含んでいることが望ましい。 The task adaptive learning data 120 is not guaranteed to have a sufficient amount of data depending on the task, and the amount of data is often smaller than that of the learning data 200. In the learning data selection technique of the present invention, as will be described later, the selection learning data 131 can be obtained by selecting the learning data appropriate for the task adaptation learning data 120 from the learning data 200. Therefore, the task adaptation learning data 120 is selected. The learning data 131 can be augmented. From this point of view, it is desirable that the learning data 200 includes voice data close to a task.

ベース学習データ１００、タスク適応学習データ１２０、学習データ２００はそれぞれ、発話単位で、音声データ（肉声のアナログデータ）とこの音声データに対応付けられた音声単位カテゴリ（例えば音素、音節、半音節など）によるラベルから構成される。但し、このような構成に限定されるものではなく、例えば、音響分析結果とこの音響分析結果に対応付けた音声単位カテゴリによるラベルから構成するとしてもよいし、あるいは、ディジタル化された音声データとこの音声データに対応付けられた音声単位カテゴリによるラベルから構成されるとしてもよい。 The base learning data 100, the task adaptive learning data 120, and the learning data 200 are each utterance units, and voice data (analog data of real voice) and a voice unit category (for example, phonemes, syllables, semi-syllables, etc.) associated with the voice data. ). However, the present invention is not limited to such a configuration. For example, the acoustic analysis result and a label based on a voice unit category associated with the acoustic analysis result may be used, or digitized voice data and It may be composed of a label based on a voice unit category associated with the voice data.

音響モデル作成装置２の記憶部には、ベース音響モデル１４１を作成するためのプログラム、適応音響モデル１５１を作成するためのプログラム、学習データ２００をディジタル化するためのプログラム、学習データ２００から文法を生成するためのプログラム、音声認識を行うためのプログラム、音声認識結果に基づいて学習データを選択するためのプログラム、少なくとも選択された学習データを用いて音響モデルを作成するためのプログラムが記憶されている。 The storage unit of the acoustic model creation device 2 stores a program for creating the base acoustic model 141, a program for creating the adaptive acoustic model 151, a program for digitizing the learning data 200, and a grammar from the learning data 200. A program for generating, a program for performing speech recognition, a program for selecting learning data based on a speech recognition result, and a program for creating an acoustic model using at least the selected learning data are stored Yes.

音響モデル作成装置２では、記憶部に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭに読み込まれて、ＣＰＵで解釈実行・処理される。この結果、ＣＰＵが所定の機能（ベース音響モデル作成部、適応音響モデル作成部、ディジタル化部、文法生成部、音声認識部、学習データ選択部、音響モデル作成部）を実現することで学習データの選択並びに音響モデルの作成が実現される。なお、ベース音響モデル作成部、文法生成部は、音響モデル作成装置２の必須の構成要素ではない。また、第１実施形態の学習データ選択装置１は、ベース音響モデル作成部、適応音響モデル作成部、ディジタル化部、文法生成部、音声認識部、学習データ選択部を含んで構成されるが、ベース音響モデル作成部、文法生成部は、学習データ選択装置１の必須の構成要素ではない。 In the acoustic model creation device 2, each program stored in the storage unit and data necessary for processing each program are read into the RAM as necessary, and are interpreted and processed by the CPU. As a result, the CPU realizes predetermined functions (base acoustic model creation unit, adaptive acoustic model creation unit, digitization unit, grammar generation unit, speech recognition unit, learning data selection unit, acoustic model creation unit), and learning data Selection and creation of an acoustic model is realized. The base acoustic model creation unit and the grammar generation unit are not essential components of the acoustic model creation device 2. The learning data selection device 1 of the first embodiment includes a base acoustic model creation unit, an adaptive acoustic model creation unit, a digitization unit, a grammar generation unit, a speech recognition unit, and a learning data selection unit. The base acoustic model creation unit and the grammar generation unit are not essential components of the learning data selection device 1.

次に、第１実施形態として、図１および図２を参照しながら、学習データ選択装置１による学習データ選択処理を含む音響モデル作成装置２による音響モデル作成処理の流れを叙述的に説明する。 Next, as a first embodiment, a flow of acoustic model creation processing by the acoustic model creation device 2 including learning data selection processing by the learning data selection device 1 will be described descriptively with reference to FIGS. 1 and 2.

まず、文法生成部１６が、学習データ２００に含まれるラベルを用いて文法１６１を生成する（ステップＳ１）。文法１６１は、例えばラベル中の単語間にショートポーズを許すような文法として作成する。なお、文法１６１を生成する処理は必須ではなく、予め用意した記述文法を文法１６１として用いることもできる。 First, the grammar generation unit 16 generates a grammar 161 using a label included in the learning data 200 (step S1). The grammar 161 is created, for example, as a grammar that allows a short pause between words in a label. The process for generating the grammar 161 is not essential, and a description grammar prepared in advance can be used as the grammar 161.

次に、ベース音響モデル作成部１４が、ベース学習データ１００を用いて、ベース音響モデル１４１を作成する（ステップＳ２）。ベース音響モデル１４１は、ベース学習データ１００の音声データに対して、例えば書き起こしテキストの発音形（カナ）に対応する音素をラベリングしてモノフォンラベルを作成し（強制的にショートポーズを入れてアライメントを行うとする。）、３状態left-to-right型ＨＭＭ構造で作成した確率モデルなどである。ベース音響モデル１４１を作成する処理は必須ではなく、予め用意した未学習の音響モデルをベース音響モデル１４１として用いることもできる。 Next, the base acoustic model creation unit 14 creates a base acoustic model 141 using the base learning data 100 (step S2). The base acoustic model 141 creates, for example, a monophone label by labeling the phonetic data of the base learning data 100 with, for example, a phoneme corresponding to the transcription form (kana) of the transcription text (forcing a short pause). It is assumed that alignment is performed.) A probability model created with a three-state left-to-right type HMM structure. The process of creating the base acoustic model 141 is not essential, and an unlearned acoustic model prepared in advance can be used as the base acoustic model 141.

次に、適応音響モデル作成部１５が、ベース音響モデル１４１をタスク適応学習データ１２０で学習して音響モデル（適応音響モデル１５１）を作成する（ステップＳ３）。 Next, the adaptive acoustic model creating unit 15 learns the base acoustic model 141 with the task adaptive learning data 120 and creates an acoustic model (adaptive acoustic model 151) (step S3).

次に、ディジタル化部１１が、発話単位で、学習データ２００に含まれる音声データをディジタル音声信号に変換する（ステップＳ４）。このディジタル化は、周知のＡ／Ｄ変換などを適用して行う。このディジタル音声信号は音声認識部１２の入力となる。なお、学習データ２００に含まれる音声データがディジタル音声信号である場合、この処理は不要である。 Next, the digitizing unit 11 converts the speech data included in the learning data 200 into a digital speech signal in units of utterances (step S4). This digitization is performed by applying a known A / D conversion or the like. This digital voice signal is input to the voice recognition unit 12. It should be noted that this process is not necessary when the audio data included in the learning data 200 is a digital audio signal.

次に、音声認識部１２が、ディジタル音声信号、文法１６１、ベース音響モデル１４１および適応音響モデル１５１を入力として、発話単位で音声認識を行い、文法１６１およびベース音響モデル１４１を用いた音声認識に拠るディジタル音声信号に対する認識スコア（ベース認識スコア）と、文法１６１および適応音響モデル１５１を用いた音声認識に拠るディジタル音声信号に対する認識スコア（適応認識スコア）とを求める（ステップＳ５）。各認識スコアの算出方法は周知のものに拠ればよい。 Next, the speech recognition unit 12 receives the digital speech signal, the grammar 161, the base acoustic model 141, and the adaptive acoustic model 151, performs speech recognition in units of utterances, and performs speech recognition using the grammar 161 and the base acoustic model 141. A recognition score (base recognition score) with respect to the digital speech signal based on it and a recognition score (adaptive recognition score) with respect to the digital speech signal based on speech recognition using the grammar 161 and the adaptive acoustic model 151 are obtained (step S5). The calculation method of each recognition score may be based on a known method.

次に、学習データ選択部１３が、適応認識スコアとベース認識スコアとの比較判定を行い、学習データ２００の中から前記判定に合格した（通常は一部の）学習データを選択学習データ１３１として選択する（ステップＳ６）。例えば、ベース認識スコア以上の適応認識スコアとなった発話に対応する音声データとこれに対応するラベルからなる学習データを選択学習データ１３１として選択する。つまり、選択学習データ１３１は、学習データ２００の部分集合である。 Next, the learning data selection unit 13 performs a comparison determination between the adaptive recognition score and the base recognition score, and learning data that has passed the determination from the learning data 200 (usually a part) is selected learning data 131. Select (step S6). For example, learning data composed of speech data corresponding to an utterance having an adaptive recognition score equal to or higher than the base recognition score and a label corresponding to the speech data is selected as the selection learning data 131. That is, the selected learning data 131 is a subset of the learning data 200.

続いて、音響モデル作成部１７が、ベース音響モデル１４１を選択学習データ１３１で学習して音響モデル１７１を作成する（ステップＳ７）。 Subsequently, the acoustic model creation unit 17 creates the acoustic model 171 by learning the base acoustic model 141 using the selection learning data 131 (step S7).

適応音響モデル１５１、音響モデル１７１の各音響モデルの作成について補足説明する。この補足説明のモデル作成部を適応音響モデル作成部１５、音響モデル作成部１７に読み替え、この補足説明の学習データをタスク適応学習データ１２０、選択学習データ１３１に読み替えることで、各処理の理解が得られる。 Supplementary explanation will be given for the creation of the acoustic models of the adaptive acoustic model 151 and the acoustic model 171. By replacing the model creation unit of this supplementary explanation with the adaptive acoustic model creation unit 15 and the acoustic model creation unit 17, and by replacing the learning data of this supplementary explanation with the task adaptive learning data 120 and the selection learning data 131, each process can be understood. can get.

モデル作成部は、ベース音響モデル１４１および学習データを用いて十分統計量を算出する。 The model creation unit calculates sufficient statistics using the base acoustic model 141 and the learning data.

モデル作成部による学習の一例は、学習データを構成するラベル対応の音声データの音響分析を行い、この音響分析結果をＨＭＭの状態からの出力信号系列と見立てて、Baum-Welchアルゴリズムによって、音声単位カテゴリ毎に（ベース音響モデル１４１が与えられた下での）ＨＭＭの最尤パラメータを求めるための統計量を算出するものである。この統計量が、十分統計量である。 An example of learning by the model creation unit is an acoustic analysis of the label-corresponding speech data constituting the learning data, and regarding the acoustic analysis result as an output signal sequence from the state of the HMM, the Baum-Welch algorithm A statistic for calculating the maximum likelihood parameter of the HMM (under the base acoustic model 141) is calculated for each category. This statistic is a sufficient statistic.

十分統計量とは、ＨＭＭを特徴付けるパラメータであり、具体例を説明する。ラベル対応の音声データの音響特徴量と音声単位カテゴリとの関係を与える確率分布を混合正規分布で表した場合、この混合正規分布は、１個あるいは複数の多次元正規分布を混合した確率分布である。ここでの多次元正規分布は、一般的に第ｉ次元ケプストラム〔ＬＰＣケプストラム、ＭＦＣＣ（メル周波数ケプストラム係数）なども含む。以下同様。〕、第ｉ次元Δケプストラム（ケプストラム係数の１次差分）、第ｉ次元ΔΔケプストラム（Δケプストラム係数の１次差分）などのケプストラム係数および対数パワー、Δ対数パワー（対数パワーの１次差分）、ΔΔ対数パワー（Δ対数パワーの１次差分）の各正規分布で構成され、各正規分布は、平均と分散によって特徴付けられる。また、混合正規分布は、一般的に各多次元正規分布に重み付けして混合することで得られる。ここで挙げた各多次元正規分布の平均、分散や混合重み、状態遷移確率を計算するための統計量が十分統計量である。 Sufficient statistics are parameters that characterize the HMM, and a specific example will be described. When the probability distribution that gives the relationship between the acoustic feature quantity of the voice data corresponding to the label and the speech unit category is expressed as a mixed normal distribution, this mixed normal distribution is a probability distribution obtained by mixing one or a plurality of multidimensional normal distributions. is there. The multidimensional normal distribution here generally includes the i-th dimensional cepstrum [LPC cepstrum, MFCC (Mel frequency cepstrum coefficient), and the like. The same applies hereinafter. ], Cepstrum coefficients and logarithmic power such as i-th dimension Δ cepstrum (primary difference of cepstrum coefficients), i-th dimension ΔΔ cepstrum (primary difference of Δ cepstrum coefficients), Δ logarithmic power (primary difference of logarithmic power), Each normal distribution is composed of ΔΔ log power (first difference of Δ log power), and each normal distribution is characterized by mean and variance. The mixed normal distribution is generally obtained by weighting and mixing each multidimensional normal distribution. The statistics for calculating the average, variance, mixture weight, and state transition probability of each multidimensional normal distribution listed here are sufficient statistics.

モデル作成部は、十分統計量から音響モデルを合成する。十分統計量から音響モデルを合成する方法は参考文献に詳しい。
（参考文献）Lawrence Rabiner, Biing-Hwang Juang 共著、古井貞熙監訳、"音声認識の基礎（下）"、ＮＴＴアドバンステクノロジ、１９９５ The model creation unit synthesizes an acoustic model from sufficient statistics. Methods for synthesizing acoustic models from sufficient statistics are detailed in the references.
(Reference) Co-authored by Lawrence Rabiner and Biing-Hwang Juang, translated by Sadahiro Furui, “Basics of Speech Recognition (below)”, NTT Advanced Technology, 1995

≪第２実施形態≫
第２実施形態は、選択学習データ１３１の選択に関する変形例である。第２実施形態は、第１実施形態のステップＳ６の処理を、学習データ選択部１３が、学習データ２００のうち、適応認識スコアからベース認識スコアを減じて得られるスコア（差分認識スコア）が、予め定められた閾値以上となる発話に対応する音声データとこれに対応するラベルからなる学習データを選択学習データ１３１として選択する処理（ステップＳ６ａ）に変更した実施形態である（図３参照）。第１実施形態の例は、閾値が０の場合である。閾値は負の値としてもよい。ここで述べた変更以外は、第１実施形態と同じである。 << Second Embodiment >>
The second embodiment is a modification regarding selection of the selection learning data 131. In the second embodiment, the process (step S6) of the first embodiment is performed such that the learning data selection unit 13 obtains a score (difference recognition score) obtained by subtracting the base recognition score from the adaptive recognition score in the learning data 200. In this embodiment, the processing is changed to processing (step S6a) for selecting learning data composed of speech data corresponding to an utterance that is equal to or greater than a predetermined threshold and a label corresponding to the speech (step S6a) (see FIG. 3). An example of the first embodiment is a case where the threshold is zero. The threshold value may be a negative value. Except for the changes described here, the second embodiment is the same as the first embodiment.

閾値の値が大きく設定されていると、選択学習データ１３１として、タスク適応学習データ１２０に近い発話に対応する音声データとこれに対応するラベルからなる学習データを選択することができる。他方、閾値の値が（負の場合も含めて）小さく設定されていると、必ずしもタスクに特化したものではないが、選択学習データ１３１のデータ量を増やすことができる。 When the threshold value is set to be large, it is possible to select the learning data including the speech data corresponding to the utterance close to the task adaptive learning data 120 and the label corresponding thereto as the selection learning data 131. On the other hand, if the threshold value is set small (including a negative case), the data amount of the selection learning data 131 can be increased, although it is not necessarily specialized for a task.

≪第３実施形態≫
第３実施形態は、音響モデル１７１の作成に関する変形例である。第３実施形態は、第１実施形態のステップＳ７の処理を、音響モデル作成部１７が、選択学習データ１３１に基本学習データ１０１を併せたものを学習データとして、この学習データでベース音響モデル１４１を学習して音響モデル１７１を作成する処理（ステップＳ７ａ）に変更した実施形態である（図４参照）。ここで基本学習データ１０１として、ベース学習データ１００、または、タスク適応学習データ１２０、または、ベース学習データ１００とタスク適応学習データ１２０とを併せたものを採用できる。この変更以外は、第１実施形態と同じである。また、図示していないが、この第３実施形態は第２実施形態に適用できる。 «Third embodiment»
The third embodiment is a modified example related to creation of the acoustic model 171. In the third embodiment, the process of step S7 of the first embodiment is performed by using the learning model in which the acoustic model creation unit 17 uses the combination of the selected learning data 131 and the basic learning data 101 as learning data. This is an embodiment in which the processing is changed to the process of creating the acoustic model 171 (step S7a) (see FIG. 4). Here, as the basic learning data 101, the base learning data 100, the task adaptive learning data 120, or a combination of the base learning data 100 and the task adaptive learning data 120 can be adopted. Other than this change, the second embodiment is the same as the first embodiment. Although not shown, the third embodiment can be applied to the second embodiment.

選択学習データ１３１だけでなくタスク適応学習データ１２０も学習データに用いることで、タスクに特化しつつ、十分なデータ量の学習データで学習された音響モデル１７１を得ることができる。また、選択学習データ１３１だけでなくベース学習データ１００も学習データに用いることで、必ずしもタスクに特化したものではないが、十分なデータ量の学習データで学習された音響モデル１７１を得ることができる。 By using not only the selected learning data 131 but also the task adaptive learning data 120 as learning data, it is possible to obtain an acoustic model 171 trained with learning data having a sufficient amount of data while being specialized for the task. Further, by using not only the selected learning data 131 but also the base learning data 100 as learning data, the acoustic model 171 trained with learning data having a sufficient amount of data is obtained, although not necessarily specialized for a task. it can.

≪補記≫
第１実施形態において、学習データ選択処理のみを行う場合には、ステップＳ７の処理を省略できる。この場合であっても、ステップＳ１およびＳ２の各処理は必須のものではない。このことは第２実施形態でも同様である。また、第３実施形態では、学習データ選択処理のみを行う場合には、ステップＳ７ａの処理を省略できる。この場合でも、ステップＳ１およびＳ２の各処理は必須のものではない。 ≪Supplementary notes≫
In the first embodiment, when only the learning data selection process is performed, the process of step S7 can be omitted. Even in this case, the processes in steps S1 and S2 are not essential. The same applies to the second embodiment. In the third embodiment, when only the learning data selection process is performed, the process of step S7a can be omitted. Even in this case, the processes in steps S1 and S2 are not essential.

学習データ選択装置１と音響モデル作成装置２を各別の装置として構成する場合などでは、学習データ選択装置１の学習データ選択部１３が出力した選択学習データ１３１を（例えば記録媒体３３を介して）音響モデル作成装置２の記憶部に記憶し、音響モデル作成部１７がこの記憶された選択学習データ１３１で基本音響モデル１４２を学習して音響モデル１７１を得ることができる（図５参照）。この場合、音響モデル作成装置２が用いる基本音響モデル１４２は、学習データ選択装置１で用いられたベース音響モデル１４１と同じであることが好適であるが、必ずしも同一のベース音響モデルを用いる必要はない。 In the case where the learning data selection device 1 and the acoustic model creation device 2 are configured as separate devices, the selection learning data 131 output from the learning data selection unit 13 of the learning data selection device 1 (for example, via the recording medium 33). ) It is stored in the storage unit of the acoustic model creation device 2, and the acoustic model creation unit 17 can learn the basic acoustic model 142 from the stored selection learning data 131 to obtain the acoustic model 171 (see FIG. 5). In this case, the basic acoustic model 142 used by the acoustic model creation device 2 is preferably the same as the base acoustic model 141 used in the learning data selection device 1, but it is not always necessary to use the same base acoustic model. Absent.

また、第３実施形態のように、音響モデル作成装置２の記憶部に記憶された選択学習データ１３１に基本学習データ１０１を併せたものを学習データとして、音響モデル作成部１７がこの学習データで基本音響モデル１４２を学習して音響モデル１７１を得ることができる（図５参照）。この場合には、基本学習データ１０１は、学習データ選択装置１で用いられたベース学習データ１００、または、タスク適応学習データ１２０、または、ベース学習データ１００とタスク適応学習データ１２０とを併せたものと同じであることが好適であるが、必ずしも同一のものを用いる必要はない。 Moreover, as in the third embodiment, the acoustic model creation unit 17 uses the learning data as a combination of the selected learning data 131 stored in the storage unit of the acoustic model creation device 2 and the basic learning data 101 as learning data. The acoustic model 171 can be obtained by learning the basic acoustic model 142 (see FIG. 5). In this case, the basic learning data 101 is the base learning data 100 used in the learning data selection device 1, the task adaptive learning data 120, or a combination of the base learning data 100 and the task adaptive learning data 120. It is preferable that the same is used, but it is not always necessary to use the same.

以上の実施形態の他、本発明である学習データ選択装置・方法、音響モデル作成装置・方法は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、各実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 In addition to the above embodiments, the learning data selection device / method and the acoustic model creation device / method according to the present invention are not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. It is. In addition, the processing described in each embodiment may be executed not only in time series according to the description order, but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

また、上記学習データ選択装置／音響モデル作成装置における処理機能をコンピュータによって実現する場合、学習データ選択装置／音響モデル作成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記学習データ選択装置／音響モデル作成装置における処理機能がコンピュータ上で実現される。 When the processing function in the learning data selection device / acoustic model creation device is realized by a computer, the processing content of the function that the learning data selection device / acoustic model creation device should have is described by a program. Then, by executing this program on a computer, the processing function in the learning data selection device / acoustic model creation device is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、学習データ選択装置／音響モデル作成装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the learning data selection device / acoustic model creation device is configured by executing a predetermined program on the computer. However, at least a part of these processing contents is realized by hardware. It is good as well.

データ量が４５８．３６時間の学習データ２００から、本発明により、データ量が２．３６時間の選択学習データ１３１が選択された。データ量が１３．５８時間のベース学習データ１００で得られたベース音響モデル１４１を用いた場合の認識率は７８．３７％、認識精度は７４．４９％であったが、データ量が１３．５８時間のベース学習データ１００にデータ量が２．３６時間の選択学習データ１３１を併せたものを学習データ（データ量は１５．９４時間）としてベース音響モデル１４１を学習して得られた音響モデル１７１を用いた場合の認識率は７９．１０％、認識精度は７５．３３％であり、いずれも改善した。 From the learning data 200 with the data amount of 458.36 hours, the selection learning data 131 with the data amount of 2.36 hours is selected according to the present invention. When the base acoustic model 141 obtained from the base learning data 100 with a data amount of 13.58 hours was used, the recognition rate was 78.37% and the recognition accuracy was 74.49%. An acoustic model obtained by learning the base acoustic model 141 using 58-hour base learning data 100 combined with selected learning data 131 having a data amount of 2.36 hours as learning data (data amount is 15.94 hours). When 171 was used, the recognition rate was 79.10%, and the recognition accuracy was 75.33%, both improving.

また、タスク適応学習データ１２０とは異なるが、タスク適応学習データと同じ発話スタイルである自由発話音声を認識対象とした場合では、音響モデル１７１を用いた場合の認識率は６７．６０％、認識精度は６６．１０％であった。これに対して、ベース音響モデル１４１を用いた場合の認識率は６７．４３％、認識精度は６５．８１％であった。このように、タスク適応学習データ１２０とは異なるが、同じ発話スタイルの音声を認識対象とした場合に音響モデル１７１を用いても、この認識率および認識精度はベース音響モデル１４１のそれに比して良好である結果を示した。 Further, although different from the task adaptive learning data 120, when a free speech having the same utterance style as the task adaptive learning data is set as a recognition target, the recognition rate when the acoustic model 171 is used is 67.60%. The accuracy was 66.10%. In contrast, the recognition rate when using the base acoustic model 141 was 67.43%, and the recognition accuracy was 65.81%. Thus, although different from the task adaptive learning data 120, even when the acoustic model 171 is used when speech of the same utterance style is used as a recognition target, the recognition rate and the recognition accuracy are higher than those of the base acoustic model 141. Results were good.

本発明は、音声認識―例えば、音声認識に基づく文字入力や対話システムの音声認識など―に用いる音響モデルの作成に有用である。 The present invention is useful for creating an acoustic model used for speech recognition, for example, character input based on speech recognition or speech recognition of a dialogue system.

第１実施形態に係わる学習データ選択装置・音響モデル作成装置の機能構成例を示すブロック図。The block diagram which shows the function structural example of the learning data selection apparatus and acoustic model creation apparatus concerning 1st Embodiment. 第１実施形態に係わる学習データ選択処理・音響モデル作成処理の処理フローを示す図。The figure which shows the processing flow of the learning data selection process and acoustic model creation process concerning 1st Embodiment. 第２実施形態に係わる学習データ選択処理・音響モデル作成処理の処理フローを示す図。The figure which shows the processing flow of the learning data selection process and acoustic model creation process concerning 2nd Embodiment. 第３実施形態に係わる学習データ選択処理・音響モデル作成処理の処理フローを示す図。The figure which shows the processing flow of the learning data selection process and acoustic model creation process concerning 3rd Embodiment. 学習データ選択装置と音響モデル作成装置を各別の装置とした場合の機能構成例を示すブロック図。The block diagram which shows the function structural example at the time of using a learning data selection apparatus and an acoustic model production apparatus as another apparatus, respectively.

Explanation of symbols

１学習データ選択装置
２音響モデル作成装置
１２音声認識部
１３学習データ選択部
１５適応音響モデル作成部
１７音響モデル作成部
１３１選択学習データ
１４１ベース音響モデル
１５１適応音響モデル
１６１文法
１７１音響モデル
２００学習データ DESCRIPTION OF SYMBOLS 1 Learning data selection apparatus 2 Acoustic model creation apparatus 12 Speech recognition part 13 Learning data selection part 15 Adaptive acoustic model creation part 17 Acoustic model creation part 131 Selection learning data 141 Base acoustic model 151 Adaptive acoustic model 161 Grammar 171 Acoustic model 200 Learning data

Claims

A base acoustic model that is an initial acoustic model, learning data composed of sound data and a label associated with the sound data, a grammar obtained from the label of the learning data, and learning data adapted to the task (task Storage means for storing adaptive learning data),
Adaptive acoustic model creating means for learning the base acoustic model from the task adaptive learning data and creating an acoustic model adapted to the task (adaptive acoustic model);
Speech recognition is performed on the speech data, and a recognition score (adaptive recognition score) in speech recognition using the grammar and the adaptive acoustic model, and speech recognition using the grammar and the base acoustic model. Speech recognition means for obtaining a recognition score (base recognition score);
A learning data selecting device comprising learning data selecting means for selecting, from among the learning data, a score obtained by subtracting the base recognition score from the adaptive recognition score is a predetermined threshold value or more .

The storage means is adapted to the base acoustic model, which is the initial acoustic model, learning data composed of speech data and a label associated with the speech data, the grammar obtained from the label of the learning data, and the task Learning data (task adaptive learning data) to be stored,
An adaptive acoustic model creating step of learning the base acoustic model from the task adaptive learning data and creating an acoustic model adapted to the task (adaptive acoustic model);
Speech recognition is performed on the speech data, and a recognition score (adaptive recognition score) in speech recognition using the grammar and the adaptive acoustic model, and speech recognition using the grammar and the base acoustic model. A speech recognition step for obtaining a recognition score (base recognition score);
A learning data selection method comprising: a learning data selection step of selecting, from among the learning data, a score obtained by subtracting the base recognition score from the adaptive recognition score is equal to or greater than a predetermined threshold .

Adapts to the basic acoustic model and the base acoustic model , which are initial acoustic models, learning data composed of speech data and a label associated with the speech data, grammar obtained from the label of this training data, and task Storage means for storing learning data (task adaptive learning data) ;
Adaptive acoustic model creating means for learning the base acoustic model from the task adaptive learning data and creating an acoustic model adapted to the task (adaptive acoustic model);
Speech recognition is performed on the speech data, and a recognition score (adaptive recognition score) in speech recognition using the grammar and the adaptive acoustic model, and speech recognition using the grammar and the base acoustic model. Speech recognition means for obtaining a recognition score (base recognition score);
Learning data selection means for selecting, from among the learning data, a score obtained by subtracting the base recognition score from the adaptive recognition score (selection learning data) equal to or greater than a predetermined threshold;
An acoustic model creation device comprising acoustic model creation means for learning the basic acoustic model using the selection learning data and creating an acoustic model.

Upper Symbol acoustic model creating means, those combined with the selected learning data different from the basic learned data and the selected learning data as learning data, creating an acoustic model in the learning data to learn the basic acoustic model The acoustic model creation apparatus according to claim 3 .

The storage means includes a basic acoustic model and a base acoustic model , which are initial acoustic models, learning data including speech data and a label associated with the speech data, and a grammar obtained from the label of the learning data. , Learning data to be adapted to the task (task adaptive learning data) is stored,
An adaptive acoustic model creating step of learning the base acoustic model from the task adaptive learning data and creating an acoustic model adapted to the task (adaptive acoustic model);
Speech recognition is performed on the speech data, and a recognition score (adaptive recognition score) in speech recognition using the grammar and the adaptive acoustic model, and speech recognition using the grammar and the base acoustic model. A speech recognition step for obtaining a recognition score (base recognition score);
A learning data selection step for selecting, from among the learning data, a score obtained by subtracting the base recognition score from the adaptive recognition score being equal to or greater than a predetermined threshold (selection learning data);
An acoustic model creation method including an acoustic model creation step of creating an acoustic model by learning the basic acoustic model using the selection learning data.

A program for causing a computer to function as the learning data selection device according to claim 1 .

A computer-readable recording medium on which the program according to claim 6 is recorded.

Program for causing a computer to function as the acoustic model generating apparatus according to claim 3 or claim 4.

A computer-readable recording medium on which the program according to claim 8 is recorded.