JP2011095680A

JP2011095680A - Acoustic model adaptation device, acoustic model adaptation method and program for acoustic model adaptation

Info

Publication number: JP2011095680A
Application number: JP2009252247A
Authority: JP
Inventors: Takenori Tsujikawa; 剛範辻川; Yoshifumi Onishi; 祥史大西; Takeshi Hanazawa; 健花沢; Takafumi Koshinaka; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-11-02
Filing date: 2009-11-02
Publication date: 2011-05-12

Abstract

PROBLEM TO BE SOLVED: To efficiently perform acoustic model adaptation with high accuracy in a limited period of time. SOLUTION: An acoustic model adaptation device includes: a clustering section 12 for clustering a voice signal divided by a dividing section 11 according to acoustic difference; a reliability calculation section 13 for calculating reliability for a voice signal included in a cluster; a label estimation section 14 for obtaining an estimation label by recognizing a voice signal included in the cluster; a presentation section 15 for presenting the voice signal and the estimation label included in a first cluster which is selected based on the reliability; a teacher label obtaining section 16 for obtaining a teacher label to the presented voice signal; a transition instruction section 17 which instructs the presentation section 15 to shift from a state for dealing the first cluster to a state for dealing a second cluster which is different from the first cluster; and an acoustic model adaptation section 18 for adapting the acoustic model to the voice signal in the cluster by using the teacher label. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音響モデル適応装置、音響モデル適応方法および音響モデル適応用プログラムに関し、特に、限られた時間で効率よく高精度な音声認識結果を得るための音響モデル適応装置、音響モデル適応方法および音響モデル適応用プログラムに関する。 The present invention relates to an acoustic model adaptation device, an acoustic model adaptation method, and an acoustic model adaptation program, and in particular, an acoustic model adaptation device, an acoustic model adaptation method, and an acoustic model adaptation method for obtaining a highly accurate speech recognition result efficiently in a limited time. The present invention relates to an acoustic model adaptation program.

一般に、音声認識では、音素の特徴を表現するための音響モデルと、音素の並び方の制約を表すための言語モデルとが使用される。そして、音響モデルを話者や環境に適応させる『音響モデルの適応』が行われることがある。 Generally, in speech recognition, an acoustic model for expressing phoneme features and a language model for expressing restrictions on how phonemes are arranged are used. Then, “acoustic model adaptation” may be performed to adapt the acoustic model to the speaker and the environment.

音響モデル適応装置の一例が、特許文献１に記載されている。特許文献１に記載されているシステムは、ユーザに複数の音声認識結果の候補を提示し、ユーザに候補から正解を選択させる。その結果、誤った音声認識結果をユーザが効率よく修正できる。すなわち、特許文献１に記載されているシステムは、効率よく修正された認識結果を用いて音響モデルを教師ありで適応を行うことが可能なシステムである。 An example of an acoustic model adaptation apparatus is described in Patent Document 1. The system described in Patent Literature 1 presents a plurality of speech recognition result candidates to the user, and causes the user to select a correct answer from the candidates. As a result, the user can efficiently correct an erroneous speech recognition result. In other words, the system described in Patent Document 1 is a system that can adapt an acoustic model with supervision using an efficiently corrected recognition result.

図８は、特許文献１に開示されている音声認識システムの構成を示すブロック図である。図８を参照して、認識誤り単語を訂正するための構成および動作を説明する。図８に示す音声認識システム４００は、音声入力手段４０３、音声認識手段４０５、辞書を記憶するデータ記憶手段４１２、単語訂正手段４０９および認識結果表示手段４０７を備えている。 FIG. 8 is a block diagram showing a configuration of a voice recognition system disclosed in Patent Document 1. As shown in FIG. With reference to FIG. 8, the structure and operation | movement for correcting a recognition error word are demonstrated. The speech recognition system 400 shown in FIG. 8 includes speech input means 403, speech recognition means 405, data storage means 412 that stores a dictionary, word correction means 409, and recognition result display means 407.

音声認識手段４０５は、連続音判定手段４１３と音声認識実行手段４１１とを含む。 The voice recognition unit 405 includes a continuous sound determination unit 413 and a voice recognition execution unit 411.

単語訂正手段４０９において、競合単語表示指令手段４１５は、競合候補の中から最も競合確率の高い単語の競合確率に近い競合確率を持つ１以上の競合単語を選び、選んだ競合単語を、対応する最も競合確率の高い単語に隣接させて認識結果表示手段４０７の画面上に表示する。競合単語選択手段４１７は、ユーザによるマニュアル操作に応じて、画面上に表示された１以上の競合単語から適切な訂正単語を選択する。単語置き換え指令手段４１９は、競合単語選択手段４１７によって選択された訂正単語を、認識された最も競合確率の高い単語と置き換えることを、音声認識手段４０５に指令する。 In the word correction unit 409, the competitive word display command unit 415 selects one or more competitive words having a competitive probability close to the competitive probability of the word having the highest competitive probability from among the competitive candidates, and corresponds the selected competitive word. It is displayed on the screen of the recognition result display means 407 adjacent to the word having the highest competition probability. The competing word selection means 417 selects an appropriate correction word from one or more competing words displayed on the screen in accordance with a manual operation by the user. The word replacement command unit 419 commands the speech recognition unit 405 to replace the corrected word selected by the competitive word selection unit 417 with the recognized word having the highest competition probability.

特開２００６−１４６００８JP 2006-146008 A

特許文献１に記載されている音響モデル適応装置では、効率よくユーザが認識誤り単語を訂正でき、その結果、音響モデル適応装置は、効率よく音響モデルの適応を行うことができる。 In the acoustic model adaptation device described in Patent Document 1, the user can efficiently correct the recognition error word, and as a result, the acoustic model adaptation device can efficiently adapt the acoustic model.

しかし、特許文献１に記載されている音響モデル適応装置には、限られた時間で最大限の音響モデル適応効果が得られない可能性がある。その理由は、音響モデルを適応するにあたってのターゲット（発話者や発話環境など）に対して、ユーザがどの程度訂正すればよいかが考慮されていないためである。換言すれば、どのような教師を与えればよいかが考慮されていないためである。 However, the acoustic model adaptation apparatus described in Patent Document 1 may not be able to obtain the maximum acoustic model adaptation effect in a limited time. The reason is that it is not considered how much the user should correct the target (speaker, utterance environment, etc.) in applying the acoustic model. In other words, it is not considered what kind of teacher should be given.

そこで、本発明、限られた時間で効率よく高精度な音響モデル適応を行うことができる音響モデル適応装置を提供することを目的とする。 Therefore, an object of the present invention is to provide an acoustic model adaptation apparatus that can perform highly accurate acoustic model adaptation efficiently in a limited time.

本発明による音響モデル適応装置は、音声信号を分割する分割部と、分割部によって分割された音声信号を音響的な違いに応じてクラスタリングするクラスタリング部と、クラスタリング部が作成したクラスタに含まれる音声信号について音響的な信頼度を計算する信頼度計算部と、クラスタリング部が作成したクラスタに含まれる音声信号を認識することによって推定ラベルを得るラベル推定部と、クラスタリング部が作成したクラスタから、信頼度計算部が計算した信頼度にもとづいて選択したクラスタである第１のクラスタに含まれる音声信号とラベル推定部が得た推定ラベルとをユーザに提示する提示部と、提示部によって提示された音声信号に対する教師ラベルを得る教師ラベル取得部と、所定の条件が成立すると、提示部に、第１のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第１の状態から、第１のクラスタとは異なる第２のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第２の状態に遷移することを指示する遷移指示部と、教師ラベル取得部が取得した教師ラベルを用いて音響モデルをクラスタ内の音声信号に適応させる音響モデル適応部とを備えたことを特徴とする。 An acoustic model adaptation apparatus according to the present invention includes a dividing unit that divides a speech signal, a clustering unit that clusters the speech signals divided by the dividing unit according to an acoustic difference, and a speech included in a cluster created by the clustering unit. A reliability calculation unit that calculates the acoustic reliability of the signal, a label estimation unit that obtains an estimated label by recognizing a speech signal included in the cluster created by the clustering unit, and a cluster created by the clustering unit A presentation unit for presenting the user with the speech signal included in the first cluster, which is a cluster selected based on the reliability calculated by the degree calculation unit, and the estimated label obtained by the label estimation unit, and presented by the presentation unit A teacher label acquisition unit that obtains a teacher label for an audio signal, and when a predetermined condition is satisfied, From the first state in which the speech signal and the estimated label included in the cluster are presented to the user, the second signal that presents the speech signal and the estimated label included in the second cluster different from the first cluster to the user. A transition instructing unit for instructing transition to a state, and an acoustic model adaptation unit for adapting an acoustic model to an audio signal in a cluster using the teacher label acquired by the teacher label acquisition unit.

本発明による音響モデル適応方法は、音声信号を分割し、分割された音声信号を音響的な違いに応じてクラスタリングし、クラスタリングによって作成されたクラスタに含まれる音声信号について音響的な信頼度を計算し、クラスタリングによって作成されたクラスタに含まれる音声信号を認識することによって推定ラベルを得て、クラスタリングによって作成されたクラスタから、信頼度にもとづいて選択したクラスタである第１のクラスタに含まれる音声信号と、推定ラベルとをユーザに提示し、ユーザに提示された音声信号に対する教師ラベルを得て、所定の条件が成立すると、第１のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第１の状態から、第１のクラスタとは異なる第２のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第２の状態に遷移させ、得られた教師ラベルを用いて音響モデルをクラスタ内の音声信号に適応させることを特徴とする。 The acoustic model adaptation method according to the present invention divides an audio signal, clusters the divided audio signals according to acoustic differences, and calculates an acoustic reliability of the audio signals included in the cluster created by clustering. Then, an estimated label is obtained by recognizing a speech signal included in the cluster created by clustering, and the speech included in the first cluster that is a cluster selected based on the reliability from the cluster created by clustering. The signal and the estimated label are presented to the user, the teacher label for the speech signal presented to the user is obtained, and when a predetermined condition is satisfied, the speech signal and the estimated label included in the first cluster are presented to the user From the first state, the audio signal included in the second cluster different from the first cluster is estimated. Characterized in that to adapt the acoustic model to the speech signals in the cluster using to transition to a second state which presents a label to the user, the training labels obtained.

本発明による音響モデル適応プログラムは、コンピュータに、音声信号を分割する分割処理と、分割処理で分割された音声信号を音響的な違いに応じてクラスタリングするクラスタリング処理と、クラスタリング処理で作成されたクラスタに含まれる音声信号について音響的な信頼度を計算する信頼度計算処理と、クラスタリング処理で作成されたクラスタに含まれる音声信号を認識することによって推定ラベルを得るラベル推定処理と、クラスタリング処理で作成されたクラスタから、信頼度計算処理で計算された信頼度にもとづいて選択したクラスタである第１のクラスタに含まれる音声信号とラベル推定部が得た推定ラベルとをユーザに提示する提示処理と、提示処理で提示された音声信号に対する教師ラベルを得る教師ラベル取得処理と、所定の条件が成立すると、第１のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第１の状態から、第１のクラスタとは異なる第２のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第２の状態に遷移することを指示する遷移指示処理と、教師ラベル取得処理で取得された教師ラベルを用いて音響モデルをクラスタ内の音声信号に適応させる音響モデル適応処理とを実行させることを特徴とする。 An acoustic model adaptation program according to the present invention includes: a dividing process for dividing an audio signal into a computer; a clustering process for clustering audio signals divided by the dividing process according to an acoustic difference; and a cluster created by the clustering process. A reliability calculation process that calculates the acoustic reliability of the audio signal included in the voice signal, a label estimation process that obtains an estimated label by recognizing the audio signal included in the cluster created by the clustering process, and a clustering process A presenting process for presenting a speech signal included in the first cluster, which is a cluster selected based on the reliability calculated in the reliability calculation process, and the estimated label obtained by the label estimation unit to the user Teacher label acquisition process for obtaining a teacher label for the audio signal presented in the presentation process And when the predetermined condition is satisfied, from the first state in which the audio signal and the estimated label included in the first cluster are presented to the user, the audio signal included in the second cluster different from the first cluster; An acoustic model for adapting an acoustic model to a speech signal in a cluster using a transition instruction process for instructing a transition to a second state in which the estimated label is presented to the user and a teacher label acquired in the teacher label acquisition process And an adaptive process.

本発明によれば、ユーザが１つのクラスタに偏って多数の教師データを付与する可能性が低減するので、比較的少ない教師データを付与するだけで、より多くの音響モデルの適応処理を行うことが可能になる。その結果、限られた時間で効率よく高精度な音響モデル適応を行うことができる音響モデル適応装置を実現することができる。 According to the present invention, since the possibility that a user will give a large number of teacher data biased to one cluster is reduced, it is possible to perform adaptive processing of a larger number of acoustic models simply by providing relatively less teacher data. Is possible. As a result, it is possible to realize an acoustic model adaptation apparatus that can efficiently and accurately perform acoustic model adaptation in a limited time.

本発明による音響モデル適応装置の実施形態の一例を示すブロック図である。It is a block diagram which shows an example of embodiment of the acoustic model adaptation apparatus by this invention. 音響モデル適応装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of an acoustic model adaptation apparatus. 音響モデル適応装置を含む音声認識システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition system containing an acoustic model adaptation apparatus. 音響モデル適応装置を含む音声検出システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the audio | voice detection system containing an acoustic model adaptation apparatus. 本発明による音響モデル適応装置の主要部を示すブロック図である。It is a block diagram which shows the principal part of the acoustic model adaptation apparatus by this invention. 遷移指示部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a transition instruction | indication part. 遷移指示部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a transition instruction | indication part. 特許文献１に記載された音声認識システムの構成を示すブロック図である。1 is a block diagram showing a configuration of a voice recognition system described in Patent Document 1. FIG.

以下、図面を参照して本発明の実施形態を説明する。図１は、本実施形態の音響モデル適応装置の構成例を示すブロック図である。図１に示す音響モデル適応装置は、入力音声ストリームを音声データに分割する分割部１と、分割された音声データを発話者や発話環境などの音響的な違いに応じてクラスタリングする音声データクラスタリング部２とを備えている。音声データクラスタリング部２は、作成したクラスタ１０１_１〜１０１_ｎをクラスタ記憶部１０１に格納する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration example of the acoustic model adaptation apparatus of the present embodiment. The acoustic model adaptation apparatus shown in FIG. 1 includes a dividing unit 1 that divides an input audio stream into audio data, and an audio data clustering unit that clusters the divided audio data according to acoustic differences such as a speaker and an utterance environment. 2 are provided. The voice data clustering unit 2 stores the created clusters 101 _{1 to} 101 _n in the cluster storage unit 101.

また、音響モデル適応装置は、音響モデル記憶部１０２に格納されている音響モデルを用いてクラスタ１０１_１〜１０１_ｎに含まれる音声データについて音響的な信頼度を計算する信頼度計算部３と、音響モデル記憶部１０２に格納されている音響モデルを用いてクラスタ１０１−１〜ｎに含まれる音声データを認識することによって推定ラベルを得るラベル推定部４と、ラベル推定部４が得た推定ラベルを一時記憶する推定ラベル記憶部１０３とを備えている。 In addition, the acoustic model adaptation apparatus includes a reliability calculation unit 3 that calculates an acoustic reliability of the audio data included in the clusters 101 ₁ to 101 _n using the acoustic model stored in the acoustic model storage unit 102; A label estimation unit 4 that obtains an estimated label by recognizing speech data included in the clusters 101-1 to 10-n using an acoustic model stored in the acoustic model storage unit 102, and an estimated label obtained by the label estimation unit 4 And an estimated label storage unit 103 that temporarily stores.

また、音響モデル適応装置は、信頼度が高いクラスタ（第１のクラスタとする。）の音声データと推定ラベルとをユーザに提示する音声データ推定ラベル提示部５と、提示された音声データに対するユーザからの教師ラベルを得て教師ラベル記憶部１０４に格納する教師ラベル取得部６と、音声データ推定ラベル提示部５の処理対象を第１のクラスタとは異なる第２のクラスタに遷移させる遷移部７と、少なくともユーザからの教師ラベルを用いて音響モデルをクラスタ内の音声データに適応させて、適応モデル１０５_１〜１０５_ｎを得る音響モデル適応部８とを備えている。 The acoustic model adaptation apparatus also includes a speech data estimated label presentation unit 5 that presents speech data and estimated labels of a highly reliable cluster (referred to as a first cluster) to the user, and a user for the presented speech data. The teacher label acquisition unit 6 that obtains the teacher label from and stores it in the teacher label storage unit 104, and the transition unit 7 that causes the processing target of the speech data estimated label presentation unit 5 to transition to a second cluster different from the first cluster And an acoustic model adaptation unit 8 that obtains adaptive models 105 _{1 to} 105 _n by adapting the acoustic model to the speech data in the cluster using at least a teacher label from the user.

次に、本実施形態の音響モデル適応装置の動作を説明する。図２は、本実施形態の音響モデル適応装置における処理手順を示すフローチャートである。 Next, the operation of the acoustic model adaptation apparatus of this embodiment will be described. FIG. 2 is a flowchart showing a processing procedure in the acoustic model adaptation apparatus of the present embodiment.

入力音声ストリームをｘ（ｔ）とする。ただし、ｔは時間のインデックスであり、一例として０〜Ｔとする。分割部１は、入力音声ストリームｘ（ｔ）をｍ個の音声データｘ１（ｔ１），ｘ２（ｔ２），・・・，ｘｍ（ｔｍ）に分割する（ステップＳ１）。ただし、ｔ１，ｔ２，・・・，ｔｍは時間のインデックスであり、範囲はｔの範囲０〜Ｔに含まれる。なお、分割の単位として、発話単位や決められた時間単位などの単位が考えられる。 Let the input audio stream be x (t). However, t is an index of time, and is 0 to T as an example. The dividing unit 1 divides the input audio stream x (t) into m pieces of audio data x1 (t1), x2 (t2),..., Xm (tm) (step S1). However, t1, t2,..., Tm are time indexes, and the range is included in the range 0 to T of t. Note that a unit such as an utterance unit or a predetermined time unit can be considered as a unit of division.

音声データクラスタリング部２は、分割部１で分割された音声データを、発話者や発話環境など音響的な違いに応じてｎ個のクラスタ１０１_１〜１０１_ｎにクラスタリングする（ステップＳ２）。音声データクラスタリング部２は、例えば、音響的な特徴の近さを基準として自動的にクラスタリングを行う。具体的には、ある音声データの特徴（周波数スペクトラムやケプストラムなど）と他の音声データの特徴とが類似している場合に、それらの音声データを同じクラスタにクラスタリングする。また、音声データと発話者との対応が既知の場合には、その対応情報にもとづいてクラスタリングを行ってもよい。 Audio data clustering unit 2, the audio data divided by the dividing unit 1, clustering to n clusters 101 ₁ to 101 _n in accordance with the acoustic differences such as speakers and speech environment (step S2). For example, the voice data clustering unit 2 automatically performs clustering based on the proximity of acoustic features. Specifically, when the characteristics of certain audio data (such as frequency spectrum and cepstrum) are similar to the characteristics of other audio data, the audio data is clustered into the same cluster. If the correspondence between the voice data and the speaker is known, clustering may be performed based on the correspondence information.

信頼度計算部３は、各々のクラスタ１０１_１〜１０１_ｎ毎に、クラスタに含まれる音声データについて音響的な信頼度を計算することによって各々のクラスタ１０１_１〜１０１_ｎの信頼度を算出する（ステップＳ３）。信頼度として、例えば、事後確率の平均値やＳＮ比の平均値を使用する。 The reliability calculation unit 3 calculates the reliability of each of the clusters 101 _{1 to} 101 _n by calculating the acoustic reliability of the speech data included in the cluster for each of the clusters 101 _{1 to} 101 _n ( Step S3). As the reliability, for example, an average value of posterior probabilities and an average value of SN ratio are used.

一例として、クラスタ１０１_１に音声データｘ１（ｔ１），ｘ２（ｔ２）が含まれている場合に、音声データｘ１（ｔ１），ｘ２（ｔ２）の事後確率の平均値（信頼度）を以下のように計算できる。 As an example, the audio data x1 (t1) to the cluster 101 _1, x2 when (t2) are included, the audio data x1 (t1), x2 average value of (reliability) below the posterior probability of (t2) Can be calculated as follows.

事後確率の平均値＝ａｖｅ＿｛ｘ１，ｘ２｝＜ａｖｅ＿｛ｔ１｝＜Ｐ（ｋ１｜ｘ１（ｔ１））＞，ａｖｅ＿｛ｔ２｝＜Ｐ（ｋ２｜ｘ２（ｔ２））＞＞・・・式（１） Average value of posterior probabilities = ave_ {x1, x2} <ave_ {t1} , ave_ {t2} > Formula ( 1)

式（１）において、ａｖｅ＿｛ａ｝＜ｂ＞はａに関するｂの平均を計算する演算子であり、Ｐ（ｋ１｜ｘ１（ｔ１））は音声データｘ１（ｔ１）が与えられたときの音響モデル中の確率分布ｋ１に対する事後確率である。ただし、確率分布ｋ１は時刻ｔ１において事後確率が最も高い分布である。なお、事後確率やＳＮ比以外の指標を信頼度として用いてもよい。 In Expression (1), ave_ {a} is an operator for calculating the average of b with respect to a, and P (k1 | x1 (t1)) is an acoustic when audio data x1 (t1) is given. This is the posterior probability for the probability distribution k1 in the model. However, the probability distribution k1 is a distribution with the highest posterior probability at time t1. An index other than the posterior probability and the SN ratio may be used as the reliability.

ラベル推定部４は、各々のクラスタ１０１_１〜１０１_ｎ毎に、クラスタに含まれる音声データを認識することによって推定ラベルを得る（ステップＳ４）。推定ラベルとして、例えば、音素ラベル（ａ，ｉ，ｕ，ｅ，ｏ，ｋ，ｓ，ｔ，ｎ，・・・）、音節ラベル（あ，い，う，え，お，か，さ，た，な，・・・）、または、音声か雑音かの違いを示すラベルなどを使用する。 The label estimation unit 4 obtains an estimated label by recognizing speech data included in each cluster for each of the clusters 101 _{1 to} 101 _n (step S4). As estimated labels, for example, phoneme labels (a, i, u, e, o, k, s, t, n,...), Syllable labels (a, i, u, e, o, ka, sa, t , Na, ...), or a label indicating the difference between voice and noise.

音声データ推定ラベル提示部５は、信頼度が高い第１のクラスタに含まれる音声データと、第１のクラスタにおける推定ラベルとをユーザに提示する（ステップＳ５）。具体的には、音声データにもとづく音声をユーザに提示するとともに、ユーザに見せるために推定ラベルを表示装置（図示せず）に表示することが好ましい。 The voice data estimated label presenting unit 5 presents the voice data included in the first cluster with high reliability and the estimated label in the first cluster to the user (step S5). Specifically, it is preferable to present a voice based on the voice data to the user and display an estimated label on a display device (not shown) for the user to see.

ユーザが、提示された音声データに対する教師ラベルを入力装置（図示せず）を介して入力すると、教師ラベル取得部６は、入力された教師ラベルを取得する。その結果、教師ラベル取得部６は、ユーザに提示した音声データに対してユーザから正しい教師ラベルを得る（ステップＳ６）。教師ラベルは、ラベル推定部４が扱った推定レベルと同じ種類のラベルであることが望ましいが、同じ種類のラベルに変換可能なものであってもよい。例えば、音節ラベルは音素ラベルに変換可能であるから、ラベル推定部４が推定ラベルとして音素ラベルを作成した場合でも、教師ラベルは音節ラベルであってもよい。 When the user inputs a teacher label for the presented voice data via an input device (not shown), the teacher label acquisition unit 6 acquires the input teacher label. As a result, the teacher label acquisition unit 6 obtains a correct teacher label from the user for the audio data presented to the user (step S6). The teacher label is desirably the same type of label as the estimation level handled by the label estimation unit 4, but may be one that can be converted into the same type of label. For example, since a syllable label can be converted into a phoneme label, the teacher label may be a syllable label even when the label estimation unit 4 creates a phoneme label as an estimated label.

遷移部７は、所定の条件が成立すると、音声データ推定ラベル提示部５の処理対象を、相対的に信頼度が高い第１のクラスタとは異なる第２のクラスタに遷移させる（ステップＳ７）。すなわち、処理対象を第２のクラスタに変える指示を出力する。なお、第２のクラスタは、第１のクラスタの次に信頼度が高いクラスタである。 When the predetermined condition is satisfied, the transition unit 7 causes the processing target of the speech data estimation label presenting unit 5 to transition to a second cluster different from the first cluster having a relatively high reliability (step S7). That is, an instruction to change the processing target to the second cluster is output. Note that the second cluster is a cluster having the second highest reliability after the first cluster.

＜信頼度が高いクラスタから遷移する場合＞
信頼度が高いクラスタである場合は、推定ラベルが正しく推定されている可能性が高い。従って、少量の教師ラベルを与えたときに、推定ラベルとの一致度が高ければ、残りの音声データについては推定ラベルで代用可能である。 <Transition from a cluster with high reliability>
If the cluster is highly reliable, it is highly likely that the estimated label is correctly estimated. Therefore, when a small amount of teacher label is given and the degree of coincidence with the estimated label is high, the remaining speech data can be substituted with the estimated label.

以上のことから、閾値以上のラベルの一致度が確認できれば、このクラスタから他のクラスタに遷移可能であるといえる。従って、ステップＳ７の処理で、遷移部７は、ユーザに推定ラベルを提示する度に、その推定ラベルと教師ラベル取得部６が得た教師ラベルとの一致度（類似度）を計算し、一致度があらかじめ決められている閾値以上である場合には、未提示の推定ラベルがあるときでも、音声データ推定ラベル提示部５の処理対象を第２のクラスタに遷移させる。 From the above, it can be said that it is possible to transition from this cluster to another cluster if the degree of coincidence of labels equal to or greater than the threshold can be confirmed. Accordingly, in the process of step S7, each time the transition unit 7 presents the estimated label to the user, the transition unit 7 calculates the degree of coincidence (similarity) between the estimated label and the teacher label obtained by the teacher label acquisition unit 6, When the degree is equal to or greater than a predetermined threshold, the processing target of the speech data estimated label presenting unit 5 is shifted to the second cluster even when there is an unpresented estimated label.

＜信頼度が低いクラスタから遷移する場合＞
信頼度が低いクラスタである場合には、信頼度が高いクラスタに比べて、推定ラベルの推定誤りが多い。しかし、教師ラベルの音素（ラベル）網羅度が高ければ、全ての音声データに対して教師ラベルを与えなくても、音響モデルは適応可能である。 <Transition from a cluster with low reliability>
In the case of a cluster with low reliability, there are more estimation errors in the estimation label than in a cluster with high reliability. However, if the phoneme (label) coverage of the teacher label is high, the acoustic model can be applied without giving a teacher label to all speech data.

以上のことから、閾値以上の教師ラベルの音素（ラベル）網羅度が確認できれば、このクラスタから他のクラスタに遷移可能であるといえる。従って、ステップＳ７の処理で、遷移部７は、教師ラベル取得部６が得た教師ラベルの音素網羅度があらかじめ決められている閾値以上である場合には、未提示の推定ラベルがあるときでも、音声データ推定ラベル提示部５の処理対象を第２のクラスタに遷移させる。 From the above, it can be said that transition from this cluster to another cluster can be made if the phoneme (label) coverage of the teacher label equal to or greater than the threshold can be confirmed. Therefore, in the process of step S7, the transition unit 7 determines that the phoneme coverage of the teacher label obtained by the teacher label acquisition unit 6 is greater than or equal to a predetermined threshold value, even when there is an unpresented estimated label. Then, the processing target of the voice data estimated label presenting unit 5 is changed to the second cluster.

＜ユーザが適応不要と判断したクラスタから遷移する場合＞
ユーザが認識する必要がないと判断した（発話者の）クラスタについては、そのクラスタから他のクラスタに、教師ラベルを付与せずにユーザからの指示により遷移可能である。従って、ステップＳ７の処理で、遷移部７は、ユーザが、音声データ推定ラベル提示部５がそのときに扱っているクラスタについて教師ラベルの入力は不要である旨を入力装置を介して入力した場合には、未提示の推定ラベルがあるときでも、音声データ推定ラベル提示部５の処理対象を第２のクラスタに遷移させる。なお、ユーザは、例えば、音声データにもとづく音声と推定ラベルとが合致しているときに、認識する必要がないと判断する。 <When transitioning from a cluster that the user has determined to be unnecessary>
A cluster determined by the user that does not need to be recognized (speaker's) can be transitioned from the cluster to another cluster according to an instruction from the user without assigning a teacher label. Therefore, in the process of step S7, the transition unit 7 has entered through the input device that the user does not need to input a teacher label for the cluster that the speech data estimated label presenting unit 5 handles at that time. Even if there is an unpresented estimated label, the processing target of the speech data estimated label presenting unit 5 is shifted to the second cluster. Note that the user determines that there is no need to recognize, for example, when the voice based on the voice data matches the estimated label.

音声データ推定ラベル提示部５は、遷移部７から、処理対象を第２のクラスタに遷移させる指示を入力した場合には、第２のクラスタを処理対象として、ステップＳ５の処理を実行する。以後、音声データ推定ラベル提示部５、教師ラベル取得部６および遷移部７は、ステップＳ５〜Ｓ７の処理を繰り返し、全てのクラスタ１０１_１〜１０１_ｎについてステップＳ５，Ｓ６の処理が実行されると、遷移部７は、次のクラスタに遷移する必要はないと判断する。 When the voice data estimation label presenting unit 5 receives an instruction from the transition unit 7 to transition the processing target to the second cluster, the voice data estimated label presenting unit 5 performs the process of step S5 with the second cluster as the processing target. Thereafter, the speech data estimated label presenting unit 5, the teacher label obtaining unit 6 and the transition unit 7 repeat the processes of steps S5 to S7, and when the processes of steps S5 and S6 are executed for all the clusters 101 _{1 to} 101 _n. The transition unit 7 determines that there is no need to transition to the next cluster.

なお、上記の例では、クラスタ１０１_１〜１０１_ｎのうち相対的に信頼度が高いクラスタ（例えば、最も信頼度が高いクラスタ）を第１のクラスタとし、優先度が、取り扱ったクラスタの次に高いクラスタを順次対象にしてステップＳ５，Ｓ６の処理が実行されるようにしたが、クラスタ１０１_１〜１０１_ｎのうち相対的に信頼度が低いクラスタ（例えば、最も信頼度が低いクラスタ）を第１のクラスタとし、優先度が、取り扱ったクラスタの次に低いクラスタを対象にして順次ステップＳ５，Ｓ６の処理が実行されるようにしてもよい。 In the above example, the cluster having the relatively high reliability (for example, the cluster having the highest reliability) among the clusters 101 _{1 to} 101 _n is set as the first cluster, and the priority is next to the cluster handled. The processing of steps S5 and S6 is executed sequentially for high clusters, but the cluster with the relatively low reliability (for example, the cluster with the lowest reliability) among the clusters 101 _{1 to} 101 _n is selected. The processing of steps S5 and S6 may be executed sequentially for the cluster having the lowest priority after the handled cluster.

音響モデル適応部８は、教師ラベル取得部６によって取得された教師ラベル１０４を用いて、クラスタ１０１_１〜１０１_ｎの各々に含まれる音声データに、音響モデル１０２を適応させることによって、適応モデル１０５_１〜１０５_ｎを得る（ステップＳ８）。適応モデル１０５_１〜１０５_ｎは、適応モデル記憶部１０５に記憶される。 The acoustic model adaptation unit 8 adapts the acoustic model 102 to the audio data included in each of the clusters 101 _{1 to} 101 _n using the teacher label 104 acquired by the teacher label acquisition unit 6, thereby adapting the adaptive model 105. _{1 to} 105 _n are obtained (step S8). The adaptive models 105 _{1 to} 105 _n are stored in the adaptive model storage unit 105.

なお、音声データと教師ラベルとを用いて音響モデルを適応させるアルゴリズムとして、ＭＬＬＲ（Maximum Likelihood Linear Regression）法、木構造適応法などを用いればよい。また、本実施形態では、全てのクラスタ１０１_１〜１０１_ｎの教師ラベルを取得した後に音響モデル適応を行うようにしたが、あるクラスタの教師ラベルが取得され次第、音響モデル適応を行うようにしてもよい。 As an algorithm for adapting an acoustic model using speech data and a teacher label, an MLLR (Maximum Likelihood Linear Regression) method, a tree structure adaptation method, or the like may be used. In this embodiment, the acoustic model adaptation is performed after acquiring the teacher labels of all the clusters 101 _{1 to} 101 _n . However, the acoustic model adaptation is performed as soon as the teacher labels of a certain cluster are acquired. Also good.

本実施形態では、分割した音声データを音響的な違いに応じてクラスタリングし、あるクラスタに対して必要な教師ラベルが取得されたと判断された時点で処理対象のクラスタを変えるので、限られた時間で効率よく高精度な音声認識結果を得るための音響モデル適応が可能になる。 In this embodiment, the divided speech data is clustered according to the acoustic difference, and the processing target cluster is changed when it is determined that a necessary teacher label is acquired for a certain cluster. Therefore, it is possible to adapt the acoustic model to obtain a highly accurate speech recognition result efficiently.

上記の実施形態の音響モデル適応装置を、音声認識システムに適用することができる。図３は、上記の実施形態の音響モデル適応装置を含む音声認識システムの構成例を示すブロック図である。図３に示すように、音声認識システム２００は、上記の実施形態の音響モデル適応装置１０と、音声認識装置２０とを含む。音声認識装置２０は、例えば、入力された音声データの特徴を検出し、音響モデル適応装置１０における適応モデル１０５_１〜１０５_ｎから、音声データの特徴に合った適応モデルを選択し、選択した適応モデルを用いて音声認識処理を実行する。 The acoustic model adaptation apparatus of the above embodiment can be applied to a speech recognition system. FIG. 3 is a block diagram illustrating a configuration example of a speech recognition system including the acoustic model adaptation device according to the above-described embodiment. As shown in FIG. 3, the speech recognition system 200 includes the acoustic model adaptation device 10 and the speech recognition device 20 of the above embodiment. For example, the speech recognition device 20 detects the feature of the input speech data, selects an adaptation model that matches the feature of the speech data from the adaptation models 105 _{1 to} 105 _n in the acoustic model adaptation device 10, and selects the selected adaptation. Perform speech recognition using the model.

また、上記の実施形態の音響モデル適応装置を、音声検出システムに適用することができる。図４は、上記の実施形態の音響モデル適応装置を含む音声検出システムの構成例を示すブロック図である。図４に示すように、音声検出システム３００は、上記の実施形態の音響モデル適応装置１０と、音声検出装置３０とを含む。音声検出装置３０は、例えば、入力された音声データの特徴を検出し、音響モデル適応装置１０における適応モデル１０５_１〜１０５_ｎから、音声データの特徴に合った適応モデルを選択し、選択した適応モデルを用いて、音声データから特定の音声部分を抽出したり話者認識を行う音声検出処理を実行する。 Moreover, the acoustic model adaptation apparatus of said embodiment is applicable to an audio | voice detection system. FIG. 4 is a block diagram illustrating a configuration example of a voice detection system including the acoustic model adaptation apparatus according to the above embodiment. As shown in FIG. 4, the speech detection system 300 includes the acoustic model adaptation device 10 and the speech detection device 30 of the above embodiment. For example, the voice detection device 30 detects the feature of the input voice data, selects an adaptation model that matches the feature of the voice data from the adaptation models 105 _{1 to} 105 _n in the acoustic model adaptation device 10, and selects the selected adaptation. Using the model, a voice detection process is performed for extracting a specific voice portion from the voice data and performing speaker recognition.

図５は、本発明による音響モデル適応装置の主要部を示すブロック図である。図５に示すように、音響モデル適応装置は、音声信号を分割する分割部１１（図１に示す分割部１に相当）と、分割部１１によって分割された音声信号を音響的な違いに応じてクラスタリングするクラスタリング部１２（図１に示す音声データクラスタリング部２に相当）と、クラスタリング部１２が作成したクラスタに含まれる音声信号について音響的な信頼度を計算する信頼度計算部１３（図１に示す信頼度計算部３に相当）と、クラスタリング部１２が作成したクラスタに含まれる音声信号を認識することによって推定ラベルを得るラベル推定部１４（図１に示すラベル推定部４に相当）と、クラスタリング部１２が作成したクラスタから、信頼度計算部１３が計算した信頼度にもとづいて選択したクラスタである第１のクラスタに含まれる音声信号と、ラベル推定部１４が得た推定ラベルとをユーザに提示する提示部１５（図１に示す音声データ推定ラベル提示部５に相当）と、提示部１５によって提示された音声信号に対する教師ラベルを得る教師ラベル取得部１６（図１に示す教師ラベル取得部６に相当）と、所定の条件が成立すると、提示部１５に、第１のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第１の状態から、第１のクラスタとは異なる第２のクラスタに含まれる音声信号と推定ラベルとをユーザに提示する第２の状態に遷移することを指示する遷移指示部１７（図１に示す遷移部７に相当）と、教師ラベル取得部１６が取得した教師ラベルを用いて音響モデルをクラスタ内の音声信号に適応させる音響モデル適応部１８（図１に示す音響モデル適応部８に相当）とを備えている。 FIG. 5 is a block diagram showing a main part of the acoustic model adaptation apparatus according to the present invention. As shown in FIG. 5, the acoustic model adaptation apparatus responds to an acoustic difference between a dividing unit 11 (corresponding to the dividing unit 1 shown in FIG. 1) that divides an audio signal and an audio signal divided by the dividing unit 11. Clustering unit 12 (corresponding to the speech data clustering unit 2 shown in FIG. 1), and a reliability calculation unit 13 (FIG. 1) that calculates the acoustic reliability of speech signals included in the cluster created by the clustering unit 12. And a label estimation unit 14 (corresponding to the label estimation unit 4 shown in FIG. 1) that obtains an estimated label by recognizing a speech signal included in the cluster created by the clustering unit 12. , Included in the first cluster that is selected from the clusters created by the clustering unit 12 based on the reliability calculated by the reliability calculation unit 13. A presentation unit 15 (corresponding to the speech data estimation label presentation unit 5 shown in FIG. 1) that presents the user with the estimated signal obtained by the label estimation unit 14 and the speech signal presented by the presentation unit 15 When the teacher label acquisition unit 16 (which corresponds to the teacher label acquisition unit 6 shown in FIG. 1) that obtains the teacher label and a predetermined condition are satisfied, the presentation unit 15 receives the audio signal and the estimated label included in the first cluster. A transition instructing unit 17 that instructs to transition from the first state presented to the user to the second state presenting the audio signal and the estimated label included in the second cluster different from the first cluster to the user. (Corresponding to the transition unit 7 shown in FIG. 1) and the acoustic model adaptation unit 18 (sound shown in FIG. 1) that adapts the acoustic model to the audio signal in the cluster using the teacher label acquired by the teacher label acquisition unit 16. And a corresponding) and the model adaptation unit 8.

なお、音響モデル適応装置を、ソフトウェアで実現することもできる。すなわち、音響モデル適応装置がＣＰＵを内蔵し、ＣＰＵが、プログラムに従って、図５に示された分割部１１、クラスタリング部１２、信頼度計算部１３、ラベル推定部１４、提示部１５、教師ラベル取得部１６、遷移指示部１７および音響モデル適応部１８の機能を実現するように構成されていてもよい。 The acoustic model adaptation device can also be realized by software. That is, the acoustic model adaptation apparatus has a built-in CPU, and the CPU follows the program according to the division unit 11, clustering unit 12, reliability calculation unit 13, label estimation unit 14, presentation unit 15, and teacher label acquisition shown in FIG. The functions of the unit 16, the transition instruction unit 17, and the acoustic model adaptation unit 18 may be realized.

また、図６に示すように、遷移指示部１７は、提示部１５によって提示された推定ラベルと教師ラベル取得部１６が取得した教師ラベルとの一致度を計算する一致度計算部１７Ａと、一致度計算部１７Ａが計算した一致度が所定値以上である場合に、所定の条件が成立したとして、第２の状態に遷移することを指示する指示部１７Ｂとを含むように構成されていてもよい。そのように構成されている場合には、全ての推定ラベルを提示する前に、クラスタについての処理を終了させることができ、音響モデル適応に要する時間を短縮することができる。 Further, as illustrated in FIG. 6, the transition instruction unit 17 includes a coincidence calculation unit 17 A that calculates the coincidence between the estimated label presented by the presentation unit 15 and the teacher label acquired by the teacher label acquisition unit 16. Even if the degree of coincidence calculated by the degree calculation unit 17A is greater than or equal to a predetermined value, it may be configured to include an instruction unit 17B that instructs to transition to the second state, assuming that a predetermined condition is satisfied. Good. In the case of such a configuration, the processing for the cluster can be terminated before all the estimated labels are presented, and the time required for the acoustic model adaptation can be shortened.

また、図７に示すように、遷移指示部１７は、教師ラベル取得部１６が取得した教師ラベルの音素網羅度を計算する音素網羅度計算部１７Ｃと、音素網羅度計算部１７Ｃが計算した音素網羅度が所定値以上である場合に、所定の条件が成立したとして、第２の状態に遷移することを指示する指示部１７Ｄとを含むように構成されていてもよい。そのように構成されている場合には、全ての推定ラベルを提示する前に、クラスタについての処理を終了させることができ、音響モデル適応に要する時間を短縮することができる。 As shown in FIG. 7, the transition instruction unit 17 includes a phoneme coverage calculation unit 17C that calculates the phoneme coverage of the teacher label acquired by the teacher label acquisition unit 16, and a phoneme calculated by the phoneme coverage calculation unit 17C. When the coverage is equal to or higher than a predetermined value, it may be configured to include an instruction unit 17D that instructs to change to the second state, assuming that a predetermined condition is satisfied. In the case of such a configuration, the processing for the cluster can be terminated before all the estimated labels are presented, and the time required for the acoustic model adaptation can be shortened.

また、遷移指示部１７は、ユーザから入力された指示に応じて、第２の状態に遷移することを指示するように構成されていてもよい。そのように構成されている場合には、ユーザの意思に応じて第１の状態から第２の状態に遷移することができ、音響モデル適応に要する時間をさらに短縮することができる。 Moreover, the transition instruction | indication part 17 may be comprised so that it may instruct | indicate to change to a 2nd state according to the instruction | indication input from the user. In the case of such a configuration, the transition from the first state to the second state can be made according to the user's intention, and the time required for the acoustic model adaptation can be further shortened.

また、遷移指示部１７は、第１のクラスタの次に信頼度が高いクラスタを第２のクラスタとするように構成されていてもよい。そのように構成されている場合には、提示部１５および遷移指示部１７の処理が簡便になる。 Moreover, the transition instruction | indication part 17 may be comprised so that a cluster with the next highest reliability may be made into a 2nd cluster. In the case of such a configuration, the processing of the presentation unit 15 and the transition instruction unit 17 becomes simple.

また、遷移指示部１７は、第１のクラスタの次に信頼度が低いクラスタを第２のクラスタとするように構成されていてもよい。そのように構成されている場合には、提示部１５および遷移指示部１７の処理が簡便になる。 Moreover, the transition instruction | indication part 17 may be comprised so that a cluster with the next lowest reliability after a 1st cluster may be made into a 2nd cluster. In the case of such a configuration, the processing of the presentation unit 15 and the transition instruction unit 17 becomes simple.

本発明を、音声認識システムや音声検出システムなどに搭載可能な音響モデル適応装置に適用できる。 The present invention can be applied to an acoustic model adaptation apparatus that can be mounted on a speech recognition system, a speech detection system, or the like.

１分割部
２音声データクラスタリング部
３信頼度計算部
４ラベル推定部
５音声データ推定ラベル提示部
６教師ラベル取得部
７遷移部
８音響モデル適応部
１０音響モデル適応装置
１１分割部
１２クラスタリング部
１３信頼度計算部
１４ラベル推定部
１５提示部
１６教師ラベル取得部
１７遷移指示部
１７Ａ一致度計算部
１７Ｂ指示部
１７Ｃ音素網羅度計算部
１７Ｄ指示部
１８音響モデル適応部
２０音声認識装置
３０音声検出装置
１０１クラスタ記憶部
１０１_１〜１０１_ｎクラスタ
１０２音響モデル記憶部
１０３推定ラベル記憶部
１０４教師ラベル記憶部
１０５適応モデル記憶部
１０５_１〜１０５_ｎ適応モデル
２００音声認識システム
３００音声検出システム DESCRIPTION OF SYMBOLS 1 Dividing part 2 Speech data clustering part 3 Reliability calculation part 4 Label estimation part 5 Speech data estimation label presentation part 6 Teacher label acquisition part 7 Transition part 8 Acoustic model adaptation part 10 Acoustic model adaptation apparatus 11 Dividing part 12 Clustering part 13 Reliability Degree calculation section 14 Label estimation section 15 Presentation section 16 Teacher label acquisition section 17 Transition instruction section 17A Matching degree calculation section 17B Instruction section 17C Phoneme coverage calculation section 17D Instruction section 18 Acoustic model adaptation section 20 Speech recognition apparatus 30 Speech detection apparatus 101 Cluster storage unit 101 _{1 to} 101 _n Cluster 102 Acoustic model storage unit 103 Estimated label storage unit 104 Teacher label storage unit 105 Adaptive model storage unit 105 _{1 to} 105 _n Adaptive model 200 Speech recognition system 300 Speech detection system

Claims

A dividing unit for dividing the audio signal;
A clustering unit that clusters the audio signals divided by the dividing unit according to an acoustic difference;
A reliability calculation unit for calculating an acoustic reliability for the audio signal included in the cluster created by the clustering unit;
A label estimation unit that obtains an estimated label by recognizing a speech signal included in the cluster created by the clustering unit;
The speech signal included in the first cluster, which is a cluster selected based on the reliability calculated by the reliability calculation unit from the clusters created by the clustering unit, and the estimated label obtained by the label estimation unit A presentation unit to present to,
A teacher label obtaining unit for obtaining a teacher label for the audio signal presented by the presenting unit;
When a predetermined condition is satisfied, the presentation unit includes the second state different from the first cluster from the first state in which the audio signal and the estimated label included in the first cluster are presented to the user. A transition instruction unit for instructing transition to the second state in which the audio signal and the estimated label are presented to the user;
An acoustic model adaptation device comprising: an acoustic model adaptation unit adapted to adapt an acoustic model to a speech signal in a cluster using the teacher label acquired by the teacher label acquisition unit.

The transition instruction section
A degree-of-match calculator that calculates the degree of match between the estimated label presented by the presenter and the teacher label acquired by the teacher label acquisition unit;
The acoustic model according to claim 1, further comprising: an instruction unit that instructs to transition to the second state when a predetermined condition is satisfied when the degree of coincidence calculated by the coincidence degree calculation unit is equal to or greater than a predetermined value. Adaptive device.

The transition instruction section
A phoneme coverage calculation unit that calculates the phoneme coverage of the teacher label acquired by the teacher label acquisition unit;
An instructing unit for instructing transition to a second state when a predetermined condition is satisfied when the phoneme coverage calculated by the phoneme coverage calculation unit is equal to or greater than a predetermined value. Item 3. The acoustic model adaptation device according to Item 2.

The acoustic model adaptation device according to any one of claims 1 to 3, wherein the transition instruction unit instructs the transition to the second state in accordance with an instruction input from a user.

The acoustic model adaptation device according to any one of claims 1 to 4, wherein the transition instruction unit sets a cluster having the second highest reliability after the first cluster as the second cluster.

The acoustic model adaptation device according to any one of claims 1 to 4, wherein the transition instruction unit sets a cluster having the second lowest reliability after the first cluster as a second cluster.

A speech recognition system including the acoustic model adaptation device according to any one of claims 1 to 6.

A speech detection system including the acoustic model adaptation device according to any one of claims 1 to 6.

Split the audio signal,
Cluster the divided audio signals according to acoustic differences,
Calculate the acoustic reliability of the audio signal included in the cluster created by clustering,
Obtain an estimated label by recognizing the speech signal contained in the cluster created by clustering,
Presenting the speech signal included in the first cluster, which is a cluster selected based on the reliability, from the cluster created by clustering, and the estimated label to the user,
Obtain a teacher label for the audio signal presented to the user,
When the predetermined condition is satisfied, the voice signal and the estimated label included in the second cluster different from the first cluster are changed from the first state in which the voice signal and the estimated label included in the first cluster are presented to the user. To the second state to present to the user,
An acoustic model adaptation method for adapting an acoustic model to a speech signal in a cluster using the obtained teacher label.

On the computer,
A division process for dividing the audio signal;
A clustering process for clustering the audio signals divided by the division process according to an acoustic difference;
A reliability calculation process for calculating an acoustic reliability for an audio signal included in the cluster created by the clustering process;
A label estimation process for obtaining an estimated label by recognizing a speech signal included in the cluster created by the clustering process;
A speech signal included in a first cluster that is a cluster selected based on the reliability calculated in the reliability calculation process from the clusters created in the clustering process, and an estimated label obtained by the label estimation unit; Presenting process to present to the user,
A teacher label acquisition process for obtaining a teacher label for the audio signal presented in the presentation process;
When the predetermined condition is satisfied, the voice signal and the estimated label included in the second cluster different from the first cluster are changed from the first state in which the voice signal and the estimated label included in the first cluster are presented to the user. And a transition instruction process for instructing the transition to the second state presented to the user,
An acoustic model adaptation program for executing an acoustic model adaptation process for adapting an acoustic model to an audio signal in a cluster using the teacher label acquired in the teacher label acquisition process.