JP2000259169A

JP2000259169A - Voice recognition device and its recording medium

Info

Publication number: JP2000259169A
Application number: JP11057710A
Authority: JP
Inventors: Shoe Sato; 庄衛佐藤; Toru Imai; 亨今井; Akio Ando; 彰男安藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1999-03-04
Filing date: 1999-03-04
Publication date: 2000-09-22

Abstract

PROBLEM TO BE SOLVED: To improve the precision of voice recognition in the case of generating an acoustic model using the voice sample of an unspecified speaker. SOLUTION: An estimation sample selecting section 112 computes the distance between the voice sample of an unspecified speaker of a voice database 109 and the voice sample of a specific speaker of a task voice database, selects the voice sample of a voice database that has a smaller distance to the voice sample of the specific speaker and transmits the sample and a writing start text to a parameter estimating section 111.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音響モデルのパラ
メータを使用して音声認識を行う音声認識装置およびそ
の記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for performing speech recognition using the parameters of an acoustic model, and a recording medium therefor.

【０００２】[0002]

【従来の技術】従来、特定話者や不特定話者などの話者
の種類に応じた用途や音声認識に使用する装置の種類に
応じた用途（以下、使用用途をタスクと総称する）を限
定することができる音声認識装置が提案されている。こ
のような音声認識装置の認識精度の向上のためには認識
タスクに限定した音響モデルを使用することが有効であ
る。音響モデルのパラメータ推定には大量の音声サンプ
ルとその音声サンプルの発話内容を記載したテキスト
（書き起こしテキストと称する）が必要である。たとえ
ば、認識対象の特定話者がタスクに応じた音声を入力
し、パラメータの推定に十分なサンプルを音声認識装置
で作成することが必要となる。2. Description of the Related Art Conventionally, applications corresponding to types of speakers, such as specific speakers and unspecified speakers, and applications corresponding to types of devices used for speech recognition (hereinafter, applications are collectively referred to as tasks). A speech recognition device that can be limited has been proposed. In order to improve the recognition accuracy of such a speech recognition device, it is effective to use an acoustic model limited to a recognition task. Estimation of the parameters of the acoustic model requires a large number of speech samples and a text describing the utterance content of the speech sample (referred to as transcribed text). For example, it is necessary for a specific speaker to be recognized to input speech corresponding to a task, and for a speech recognition device to generate samples sufficient for estimating parameters.

【０００３】しかしながら、特定話者にとっては、音声
をタスクに応じて入力することは大きな負担となるた
め、タスクを特定しない音声サンプル、例えば、複数の
話者による、入力された音声サンプルを使用することに
なる。However, for a specific speaker, inputting a voice according to a task imposes a heavy burden. Therefore, a voice sample that does not specify a task, for example, a voice sample input by a plurality of speakers is used. Will be.

【０００４】そこで、比較的少数の認識対象者の発声サ
ンプルを使用して、タスクに適用させるようにした音声
認識方法が提案されている（”Maximum Likelihood Lin
iearregression for Speaker Adaptation of Continuou
s Density Hidden Markov Moders” CJ Leggettec and
PC Woodland, Computer Speech and Language 1995(9)1
71-185）。なお、この提案では、音響モデルとして上述
のモデルが用いられている。[0004] Therefore, a speech recognition method has been proposed in which utterance samples of a relatively small number of persons to be recognized are used for a task ("Maximum Likelihood Lin").
iearregression for Speaker Adaptation of Continuou
s Density Hidden Markov Moders ”CJ Leggettec and
PC Woodland, Computer Speech and Language 1995 (9) 1
71-185). In this proposal, the above-described model is used as an acoustic model.

【０００５】[0005]

【発明が解決しようとする課題】音響パラメータ推定に
使用される音声サンプルは複数の話者から採取されてい
るので、認識対象の特定話者の特性に比較的近い音声サ
ンプルと、認識精度に悪影響を与える話者特性を持つ音
声サンプルとが混在する。Since speech samples used for acoustic parameter estimation are collected from a plurality of speakers, speech samples having characteristics relatively close to the characteristics of a specific speaker to be recognized and adversely affecting recognition accuracy. And voice samples with speaker characteristics that give

【０００６】この対象話者と異なる話者特性を持つサン
プルを含めて推定した音響モデルが認識精度低下の原因
となっている。The acoustic model estimated including samples having speaker characteristics different from the target speaker causes a reduction in recognition accuracy.

【０００７】そこで本発明の目的は、上述の点を考慮し
て、特定話者の音声認識のために、複数の話者の音声サ
ンプルから音響パラメータを推定するに際し、認識精度
に悪影響を与える音声サンプルを除外することの可能な
音声認識装置およびその記録媒体を提供することにあ
る。In view of the above, it is an object of the present invention to provide a method for estimating acoustic parameters from speech samples of a plurality of speakers for speech recognition of a specific speaker, which has an adverse effect on recognition accuracy. An object of the present invention is to provide a speech recognition device capable of excluding a sample and a recording medium for the device.

【０００８】[0008]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、音声サンプルおよび書き
起こしテキストから音響モデルを作成し、当該作成され
た音響モデルに基づき入力音声の音声認識を行う音声認
識において、話者にとらわれない複数の第１の音声サン
プルおよびそれらに対応した複数の書き起こしテキスト
を記憶した第１の記憶手段と、タスクに応じた第２の音
声サンプルを記憶した第２の記憶手段と、前記第１の記
憶手段に記憶された複数の第１の音声サンプルの中から
前記第２の記憶手段に記録された第２の音声サンプル
に、より近い第１の音声サンプルを選択する選択手段
と、該選択された第１の音声サンプルに対応する書き起
こしテキストを前記第１の記憶手段から取り出す情報処
理手段とを有し、当該取り出された書き起こしテキスト
と該書き起こしテキストに対応する第１の音声サンプル
により前記音響モデルを作成することを特徴とする。In order to achieve the above object, according to the first aspect of the present invention, an acoustic model is created from a speech sample and a transcribed text, and an input speech is created based on the created acoustic model. In voice recognition for voice recognition, a first storage unit storing a plurality of first voice samples that are not bound by a speaker and a plurality of transcribed texts corresponding to the first voice samples, and a second voice sample corresponding to a task. A stored second storage unit, and a first audio sample closer to the second audio sample recorded in the second storage unit from among the plurality of first audio samples stored in the first storage unit. And a data processing unit for extracting a transcript text corresponding to the selected first audio sample from the first storage unit. Ri by the issued transcription text and first audio samples corresponding to the transcription text, characterized in that to create the acoustic model.

【０００９】請求項２の発明は、請求項１に記載の音声
認識装置において、前記第２の音声サンプルは複数あ
り、前記選択手段は当該複数の第２の音声サンプルの各
々と前記複数の第１の音声サンプルの各々の間の距離を
計算し、その計算結果に基づき前記より近い第１の音声
サンプルを選択することを特徴とする。According to a second aspect of the present invention, in the voice recognition apparatus according to the first aspect, there are a plurality of the second voice samples, and the selecting means includes a plurality of the second voice samples and a plurality of the second voice samples. Calculating a distance between each of the one audio samples and selecting the closer first audio sample based on the calculation result.

【００１０】請求項３の発明は、請求項２に記載の音声
認識装置において、前記選択手段は、前記計算結果が許
容範囲内にあるか否かを判定し、許容範囲内にあるの判
定結果が得られた場合に、前記計算結果に使用された第
１の音声サンプルを音響モデルに使用する音声サンプル
として選択することを特徴とする。According to a third aspect of the present invention, in the speech recognition apparatus according to the second aspect, the selecting unit determines whether the calculation result is within an allowable range, and determines that the calculation result is within the allowable range. Is obtained, the first voice sample used for the calculation result is selected as a voice sample used for the acoustic model.

【００１１】請求項４の発明は、請求項２に記載の音声
認識装置において、前記選択手段は前記計算結果を距離
の近い順に並び換え、当該並び換えた計算結果の中の最
も距離が近い方から一定個数の計算結果を選択し、当該
選択された計算結果に使用された第１の音声サンプルを
音響モデルに使用する音声サンプルとして選択すること
を特徴とする。According to a fourth aspect of the present invention, in the speech recognition apparatus according to the second aspect, the selecting means rearranges the calculation results in ascending order of distance, and selects the one of the rearranged calculation results having the shortest distance. , A selected number of calculation results are selected, and the first voice sample used for the selected calculation result is selected as a voice sample used for the acoustic model.

【００１２】請求項５の発明は、請求項１に記載の音声
認識装置において、話者によらない複数の音声サンプル
についてクラスタリングを行い、クラスタリングの結果
を前記第２の音声サンプルとして前記第２の記憶手段に
記憶するクラスタリング手段をさらに具えたことを特徴
とする。According to a fifth aspect of the present invention, in the voice recognition apparatus according to the first aspect, clustering is performed on a plurality of voice samples independent of a speaker, and a result of the clustering is used as the second voice sample as the second voice sample. It is characterized by further comprising clustering means for storing in the storage means.

【００１３】請求項６の発明は、請求項１〜請求項５の
いずれかに記載の音声認識装置に応じて、前記タスクに
応じた第２の音声サンプルは特定話者についての音声サ
ンプルであることを特徴とする。According to a sixth aspect of the present invention, in accordance with the voice recognition apparatus of any one of the first to fifth aspects, the second voice sample corresponding to the task is a voice sample for a specific speaker. It is characterized by the following.

【００１４】請求項７の発明は、音声サンプルおよび書
き起こしテキストから音響モデルを作成し、当該作成さ
れた音響モデルに基づき入力音声の音声認識を行う音声
認識で実行されるプログラムを記録した記憶媒体におい
て、前記音声認識装置は、話者にとらわれない複数の第
１の音声サンプルおよびそれらに対応した複数の書き起
こしテキストを記憶した第１の記憶手段と、タスクに応
じた第２の音声サンプルを記憶した第２の記憶手段とを
有し、前記プログラムは、前記第１の記憶手段に記憶さ
れた複数の第１の音声サンプルの中から前記第２の記憶
手段に記録された第２の音声サンプルに、より近い第１
の音声サンプルを選択する選択ステップと、該選択され
た第１の音声サンプルに対応する書き起こしテキストを
前記第１の記憶手段から取り出す情報処理ステップとを
有し、当該取り出された書き起こしテキストと該書き起
こしテキストに対応する第１の音声サンプルにより前記
音響モデルを作成することを特徴とする。According to a seventh aspect of the present invention, there is provided a storage medium in which an acoustic model is created from a speech sample and a transcribed text, and a program executed in speech recognition for performing speech recognition of an input speech based on the created acoustic model. In the above, the voice recognition device may include a first storage unit storing a plurality of first voice samples that are not bound by a speaker and a plurality of transcribed texts corresponding thereto, and a second voice sample corresponding to a task. And a second storage means for storing the second sound recorded in the second storage means from among the plurality of first sound samples stored in the first storage means. The first, closer to the sample
And an information processing step of extracting a transcript text corresponding to the selected first audio sample from the first storage means. The acoustic model is created from a first audio sample corresponding to the transcribed text.

【００１５】請求項８の発明は、請求項７に記載の音声
認識装置の記録媒体において、前記第２の音声サンプル
は複数あり、前記選択ステップでは当該複数の第２の音
声サンプルの各々と前記複数の第１の音声サンプルの各
々の間の距離を計算し、その計算結果に基づき前記より
近い第１の音声サンプルを選択することを特徴とする。According to an eighth aspect of the present invention, in the recording medium of the voice recognition device according to the seventh aspect, there are a plurality of the second voice samples, and in the selecting step, each of the plurality of second voice samples is A distance between each of the plurality of first audio samples is calculated, and the closer first audio sample is selected based on the calculation result.

【００１６】請求項９の発明は、請求項８に記載の音声
認識装置の記録媒体において、前記選択ステップでは、
前記計算結果が許容範囲内にあるか否かを判定し、許容
範囲内にあるの判定結果が得られた場合に、前記計算結
果に使用された第１の音声サンプルを音響モデルに使用
する音声サンプルとして選択することを特徴とする。According to a ninth aspect of the present invention, in the recording medium of the voice recognition device according to the eighth aspect, the selecting step includes:
Determining whether or not the calculation result is within an allowable range; if a determination result indicating that the calculation result is within the allowable range is obtained, the first voice sample used for the calculation result is used for an acoustic model; It is characterized in that it is selected as a sample.

【００１７】請求項１０の発明は、請求項８に記載の音
声認識装置の記録媒体において、前記選択ステップでは
前記計算結果を距離の近い順に並び換え、当該並び換え
た計算結果の中の最も距離が近い方から一定個数の計算
結果を選択し、当該選択された計算結果に使用された第
１の音声サンプルを音響モデルに使用する音声サンプル
として選択することを特徴とする。According to a tenth aspect of the present invention, in the recording medium of the speech recognition apparatus according to the eighth aspect, in the selecting step, the calculation results are rearranged in ascending order of the distance, and the most distant of the rearranged calculation results is obtained. Is selected from the closest one, and a first voice sample used for the selected calculation result is selected as a voice sample to be used for the acoustic model.

【００１８】請求項１１の発明は、請求項７に記載の音
声認識装置の記録媒体において、話者によらない複数の
音声サンプルについてクラスタリングを行い、クラスタ
リングの結果を前記第２の音声サンプルとして前記第２
の記憶手段に記憶するクラスタリングステップをさらに
具えたことを特徴とする。[0018] According to an eleventh aspect of the present invention, in the recording medium of the voice recognition device according to the seventh aspect, clustering is performed on a plurality of voice samples independent of a speaker, and a result of the clustering is used as the second voice sample. Second
And further comprising a clustering step of storing in the storage means.

【００１９】請求項１２の発明は、請求項７〜請求項１
１のいずれかに記載の音声認識装置の記録媒体に応じ
て、前記タスクに応じた第２の音声サンプルは特定話者
についての音声サンプルであることを特徴とする。According to a twelfth aspect of the present invention, there is provided the seventh to first aspects.
2. The method according to claim 1, wherein the second voice sample corresponding to the task is a voice sample for a specific speaker, according to the recording medium of the voice recognition device.

【００２０】[0020]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００２１】本発明を適用した音声認識装置の説明に先
立って、本発明の音声認識方法、より詳しくは音響パラ
メータの推定方法について説明する。Prior to the description of the speech recognition apparatus to which the present invention is applied, a speech recognition method of the present invention, more specifically, a method of estimating acoustic parameters will be described.

【００２２】本実施形態では、認識対象者の比較的少数
の発話サンプル（音声サンプル）をリファレンス（比較
に使用するサンプルのこと）として使用し、話者の異な
る音声サンプルから認識対象者に近い音声サンプルを選
び出す。これによりタスクに特化した音響モデルのパラ
メータの推定を行う。近い音声サンプルを選び出すため
にはリファレンスとして使用する音声サンプルと話者の
異なる多数の音声サンプルの各々との間の距離（発話サ
ンプル間距離）を使用する。In this embodiment, a relatively small number of utterance samples (voice samples) of the recognition target are used as references (samples used for comparison), and voice samples close to the recognition target are obtained from voice samples of different speakers. Select a sample. Thus, the parameters of the acoustic model specialized for the task are estimated. In order to select a nearby voice sample, the distance between the voice sample used as a reference and each of a number of voice samples of different speakers (distance between speech samples) is used.

【００２３】本実施形態では、さらにリファレンス用の
複数の音声サンプルをクラスタリング（類似するもの同
士を集め、１つのグループとすること）し、上述の方法
で各音響モデルの推定サンプル数を確保した上で音響モ
デルを作成する。In this embodiment, a plurality of reference voice samples are further clustered (similar ones are collected into one group), and the estimated number of samples of each acoustic model is secured by the above-described method. Create an acoustic model with.

【００２４】以下に、話者の異なる大量の音声サンプル
から認識対象の音声サンプルに近い音声サンプルを選択
するための方法を説明する。Hereinafter, a method for selecting a speech sample close to the speech sample to be recognized from a large number of speech samples from different speakers will be described.

【００２５】推定に使用するサンプルは話者の異なる複
数の音声サンプルの各々と、リファレンスとして使用す
る認識対象者の音声サンプルの各々との間の距離（発話
サンプル間距離）に基づいて選択される。サンプルの選
択方法には発話サンプル間の距離と音声認識精度の間の
関係に非線形性があることから、以下に挙げるＡｂｓと
ＲＴの選択方法のいずれか、または双方の選択法を使用
する。The sample used for estimation is selected based on the distance (inter-speech sample distance) between each of a plurality of speech samples of different speakers and each of the speech samples of the recognition target person used as a reference. . Since the sample selection method has nonlinearity in the relationship between the distance between speech samples and the speech recognition accuracy, one or both of the following selection methods of Abs and RT are used.

【００２６】Ａｂｓ選択方法はいずれかのリファレンス
用の音声サンプルからの距離が所定値ｘ以下となる音声
サンプルを話者の異なる複数の音声サンプルから選択す
る方法である。The Abs selection method is a method of selecting, from a plurality of voice samples from different speakers, a voice sample whose distance from any reference voice sample is equal to or less than a predetermined value x.

【００２７】Ｒｔ選択方法はリファレンス用サンプルに
対して距離が近い方のサンプルから全体のｒ％の個数分
だけ、異なる複数の音声サンプルから選択する方法であ
る。The Rt selection method is a method of selecting from a plurality of different audio samples by the number of r% of the samples whose distance is closer to the reference sample.

【００２８】話者が異なり、発話内容も異なる複数のサ
ンプルとリファレンス用の複数のサンプルの間の距離を
個々に計算し、距離が近い音声サンプルを上述の方法で
選択する際に、本実施形態ではリファレンス用として使
用する複数の音声サンプルをクラスタリングしておく。
このクラスタリングでは、例えば、完全連鎖分割階層法
と呼ばれる方法（西田俊夫著、多変量解析の徹底研究、
現代数学社発行を参照）によって、複数のクラスタ（グ
ループ）に分割される。クラスタリングされたリファレ
ンスごとに話者の異なる複数のサンプルから選択された
音声サンプルを用意し、音響モデルのパラメータの推定
を行うことで、タスクに応じた音声認識を行うと共に、
従来よりも音声認識精度を向上させる。When calculating the distances between a plurality of samples for different speakers and different utterance contents and a plurality of samples for reference individually, and selecting a voice sample having a short distance by the above-described method, the present embodiment is used. Then, a plurality of audio samples used for reference are clustered.
In this clustering, for example, a method called complete chain division hierarchy method (Toshio Nishida, thorough study of multivariate analysis,
Is divided into a plurality of clusters (groups). Prepare speech samples selected from multiple samples with different speakers for each clustered reference, and estimate the parameters of the acoustic model to perform speech recognition according to the task,
Improve the speech recognition accuracy compared to the past.

【００２９】本発明を適用した幾つかの実施形態を説明
する。Several embodiments to which the present invention is applied will be described.

【００３０】（第１の実施形態）図１は本発明第１の実
施形態におけるシステム構成を示す。第１実施形態は推
定データ選択部を音声認識装置内の音響モデルパラメー
タ推定部内に配置した例である。図１において、認識音
声入力部１０１は認識すべき音声を入力する。音響分析
部１０２は入力された音声を分析し、音響特徴ベクトル
列を出力する。音響特徴には入力音声の周波数特性を示
す情報が含まれる。音声認識デコーダ１０３は言語モデ
ル１０４と音響モデル１０６を用いて単語単位の音声認
識を行い、音声認識結果１０５を出力する。音声認識に
は確率的文音声認識方法等で呼ばれる音声認識方法（Ro
nald Rosenfeld, “The CMU Statistical Language Mod
eling Toolkit and its use in the 1994 ARPA CSR Eva
luation ”, Proceedings ofthe Spoken Language Syst
ems Technology Workshop, pp. 47-50(1995.1) を使用
することができる。(First Embodiment) FIG. 1 shows a system configuration according to a first embodiment of the present invention. The first embodiment is an example in which an estimation data selection unit is arranged in an acoustic model parameter estimation unit in a speech recognition device. In FIG. 1, a recognition voice input unit 101 inputs a voice to be recognized. The acoustic analysis unit 102 analyzes the input speech and outputs an acoustic feature vector sequence. The acoustic features include information indicating the frequency characteristics of the input speech. The speech recognition decoder 103 performs speech recognition on a word basis using the language model 104 and the acoustic model 106, and outputs a speech recognition result 105. For speech recognition, a speech recognition method (Ro
nald Rosenfeld, “The CMU Statistical Language Mod
eling Toolkit and its use in the 1994 ARPA CSR Eva
luation ”, Proceedings of the Spoken Language Syst
ems Technology Workshop, pp. 47-50 (1995.1) can be used.

【００３１】より具体的には、言語モデル１０４は文法
的（あるいは意味的）に結合される内蔵の単語列とこれ
までに音声認識された単語とを使用して、現在、認識対
象となっている単語音声の言語的な確からしさ（言語ス
コア）を音声認識デコーダ１０３に与える。音響モデル
１０６は単語発音辞書１０７からの候補単語音素列と入
力音声（音響特徴ベクトル列の形態）とを照合して入力
音声がその単語である確からしさ（音響スコア）を計算
し、音響スコアを音声認識デコーダ１０３に与える。More specifically, the language model 104 uses a built-in word string that is grammatically (or semantically) combined and a word that has been speech-recognized so far to be recognized. The linguistic certainty (language score) of the word speech that is present is given to the speech recognition decoder 103. The acoustic model 106 compares the candidate word phoneme string from the word pronunciation dictionary 107 with the input speech (the form of the acoustic feature vector string), calculates the likelihood (acoustic score) that the input speech is the word, and calculates the acoustic score. This is given to the speech recognition decoder 103.

【００３２】パラメータ推定部１１１は推定サンプル選
択部１１２により選択された音声サンプルの音響パラメ
ータとその音声サンプルの発話内容を示す書き起こしテ
キストに基づき音響スコアを計算するための音響モデル
パラメータを推定する。推定サンプル選択部１１２はタ
スク音声データベース１０８、音声データベース１０９
および推定サンプル判定部１１０を有する。タスク音声
データベース１０８は音響パラメータ形態の、タスクを
特定した音声サンプルを記憶したデータベースであり、
タスクの１つとして、特定話者の音声サンプルをタスク
音声データベース１０８に収めたものが挙げられる。The parameter estimating unit 111 estimates acoustic model parameters for calculating an acoustic score based on the acoustic parameters of the audio sample selected by the estimated sample selecting unit 112 and the transcript indicating the utterance content of the audio sample. The estimated sample selection unit 112 includes a task voice database 108 and a voice database 109.
And an estimated sample determination unit 110. The task voice database 108 is a database that stores voice samples in the form of acoustic parameters, which specify the task.
One of the tasks is to store a voice sample of a specific speaker in the task voice database 108.

【００３３】音声データベース１０９は不特定話者の
（話者によらない）音声サンプル（音響パラメータ形
態）とその書き起こしテキストを記憶したデータベース
である。推定サンプル判定部１１０は上述のＡｂｓ選択
方法またはＲｔ選択方法によりタスク音声データベース
の中の予め外部から指定された特定のタスク（たとえ
ば、特定話者）の音声サンプルに話者特性が近い音声サ
ンプルを音声データベース１０９から選択する。The speech database 109 is a database that stores speech samples (acoustic parameter forms) of unspecified speakers (independent of speakers) and transcribed texts thereof. The estimated sample determination unit 110 determines a voice sample whose speaker characteristic is close to a voice sample of a specific task (for example, a specific speaker) previously specified from the outside in the task voice database by the Abs selection method or the Rt selection method. Select from the audio database 109.

【００３４】（推定サンプル選択部の第１の形態）図２
にＡｂｓ選択方法を使用する場合の推定サンプル選択部
１１２の回路構成を示す。図２において、図１と同様の
個所には同一の符号を付し詳細な説明を省略する。発話
距離マトリクス計算部２０６は音声データベース１０９
のｎ個の音声サンプル２０３とタスク音声データベース
１０８のｍ個の音声サンプル２０４を入力し、相互の音
声サンプルの間の距離を予め定めた発話サンプル間距離
演算式により計算し、その計算結果をｍ×ｎのマトリク
ス形態（発話距離マトリクスと称する）で一時記憶す
る。発話距離マトリクスの各行の値は最短距離探索部２
０７に入力される。(First Embodiment of Estimated Sample Selector) FIG. 2
2 shows a circuit configuration of the estimated sample selection unit 112 when the Abs selection method is used. 2, the same parts as those in FIG. 1 are denoted by the same reference numerals, and the detailed description will be omitted. The utterance distance matrix calculation unit 206 stores the speech database 109
Are input and m voice samples 204 of the task voice database 108 are input, and a distance between the voice samples is calculated by a predetermined utterance sample distance calculation formula. The memory is temporarily stored in a matrix form of xn (referred to as an utterance distance matrix). The value of each row of the utterance distance matrix is the shortest distance search unit 2
07.

【００３５】最短距離探索部２０７は各行毎に最短距離
の値を検出し、比較演算部２０９に最短距離の値とその
値の計算に使用された音声データベース１０９上の音声
サンプルの識別名（識別ＩＤ）を出力する。比較演算部
２０９は各行の最短距離の値を外部から与えられたパラ
メータの値ｘと比較し、最短距離の値が許容範囲内（こ
の場合ｘより小さい値）にある上記識別ＩＤを推定サン
プルリスト作成出力部２１１に出力する。The shortest distance search unit 207 detects the value of the shortest distance for each row, and gives the comparison operation unit 209 the value of the shortest distance and the identification name (identification) of the voice sample in the voice database 109 used for calculating the value. ID) is output. The comparison operation unit 209 compares the value of the shortest distance of each row with the value x of an externally provided parameter, and determines the identification ID whose value of the shortest distance is within an allowable range (in this case, a value smaller than x) in the estimated sample list. Output to the creation output unit 211.

【００３６】推定サンプルリスト作成出力部２１１は、
音声データベース１０９を検索し、入力された識別ＩＤ
を有する音声サンプルを読み出し、図１のパラメータ推
定部１１１に出力し、音声データベース１０９中の音声
サンプルに近い音声サンプルを選択する。The estimation sample list creation output unit 211
Search the voice database 109 and enter the identification ID
Is read out and output to the parameter estimating unit 111 in FIG. 1, and a voice sample close to the voice sample in the voice database 109 is selected.

【００３７】上述の発話サンプル間距離の計算に使用す
る計算式には以下の計算式のいずれかを計算することが
できる。Any of the following formulas can be calculated as the formula used for calculating the distance between speech samples.

【００３８】（１）Log Likelihood Ratio(1) Log Likelihood Ratio

【００３９】[0039]

【数１】 (Equation 1)

【００４０】（参考文献：“Segregation of Speakers
for Speech Recognition and SpeakerIdentification
” Herbert Gish, Man-Hung Siu, and Robin Rohliek.
ICA SSP(91) 873-172) （２）ＡｒｉｔｈｍｅｔｉｃＨａｒｍｏｎｉｃＳｐ
ｈｅｒｉｃｉｔｙ(Reference: “Segregation of Speakers
for Speech Recognition and SpeakerIdentification
Herbert Gish, Man-Hung Siu, and Robin Rohliek.
ICA SSP (91) 873-172) (2) Arithmetic Harmonic Sp
hericity

【００４１】[0041]

【数２】 (Equation 2)

【００４２】（参考文献：Ｔｅｘｔ−ＦｒｅｅＳｐｅ
ａｋｅｒＲｅｃｏｇｎｉｔｉｏｎＵｓｉｎｇａｎ
Ａｒｉｔｈｍｅｔｉｃ−ｈａｒｍｏｎｉｃＳｐｈｅ
ｒｉｃｉｔｙＭｅａｓｕｒｅ ” Bimbot. F and Mat
han L. Eurospeech 93 (1) 169-172）（３）Gaussian Divergence(Reference: Text-Free Spe
aker Recognition Using an
Arithmetic-harmonic Sphe
risk Measure "Bimbot. F and Mat
han L. Eurospeech 93 (1) 169-172) (3) Gaussian Divergence

【００４３】[0043]

【数３】 (Equation 3)

【００４４】（参考文献：”Speaker Clustering Using
Direct Maximisation of the MLLR-Adapted Likelihoo
d ” S.E.Johnson and P.C. Woodland. ICSLP-98#726）ここで、Ｘはリファレンスサンプル、Ｙはデータベース
中のサンプルであり、μ_X ，μ_Y ，Σ_X ，Σ_Y は各Ｘ，
Ｙサンプルの平均ベクトルおよび共分散行列である。Ｎ
_X ，Ｎ_Y ，Ｎは各サンプル中のベクトルの数およびＸ，
Ｙでのベクトル数の総和、Ｄはベクトルの次元数、Ｉは
単位行列である。また、(Reference: "Speaker Clustering Using
Direct Maximization of the MLLR-Adapted Likelihoo
d ”SEJohnson and PC Woodland. ICSLP-98 # 726) where X is a reference sample, Y is a sample in the database, μ _X , μ _Y , Σ _X , Σ _Y are X,
It is a mean vector and a covariance matrix of Y samples. N
_X , N _Y , N are the number of vectors in each sample and X,
The sum of the number of vectors at Y, D is the number of dimensions of the vector, and I is the unit matrix. Also,

【００４５】[0045]

【数４】 (Equation 4)

【００４６】である。Is as follows.

【００４７】（推定サンプル選択部の第２の形態）図３
にＲｔサンプル選択方法を使用する推定サンプル選択部
１１２の回路構成を示す。図３において、図１と同一の
個所には同一の符号を付し、詳細な説明を省略する。(Second Embodiment of Estimated Sample Selector) FIG.
9 shows a circuit configuration of the estimated sample selection unit 112 using the Rt sample selection method. 3, the same parts as those in FIG. 1 are denoted by the same reference numerals, and detailed description thereof will be omitted.

【００４８】発話距離マトリクス計算部３０５は音声デ
ータベース１０９中のｎ個の音声サンプル（音響パラメ
ータ形態）３０３とタスク音声データベース１０８中の
ｍ個の音声サンプル（音響パラメータ形態）とを入力
し、２種の２つの音声サンプルとの間の距離を計算す
る。計算結果は音声サンプルの識別ＩＤと共にｍ×ｎの
マトリスクスの形態で記憶され、発話距離ソート部３０
６に出力される。発話距離マトリクスは発話距離ソート
部３０６において、マトリクスの各列毎に距離の昇順
（小さい方から大きい方へ）に並び換えられる。並び換
えられた発話距離マトリクスは推定サンプルリスト抽出
部３０７において、一定の個数、すなわちｍ×ｎ×ｒ％
の個数に相当する個数だけ値の小さいものから順に音声
サンプル、この場合、識別ＩＤが選択され、各列毎に重
複している距離の値およびＩＤを取り除かれる。選択さ
れた識別ＩＤが推定サンプルリスト作成出力部３０８に
出力される。推定サンプルリスト作成出力部３０６は音
声データベース１０９を検索し、識別ＩＤに対応する音
響パラメータとその書き起こしテキストを抽出し、図１
のパラメータ推定部１１１に出力する。The utterance distance matrix calculation unit 305 receives n speech samples (sound parameter form) 303 in the speech database 109 and m speech samples (sound parameter form) in the task speech database 108, and Calculate the distance between the two audio samples. The calculation result is stored in the form of m × n matrices together with the identification ID of the voice sample, and the utterance distance sorting unit 30
6 is output. The utterance distance matrix is rearranged in the utterance distance sorter 306 in ascending order of distance (from smaller to larger) for each column of the matrix. In the estimated sample list extracting unit 307, the rearranged utterance distance matrix has a fixed number, that is, m × n × r%
The audio samples, in this case, the identification IDs, are selected in ascending order of the number corresponding to the number of, and the overlapping distance values and IDs are removed for each column. The selected identification ID is output to the estimated sample list creation output unit 308. The estimated sample list creation / output unit 306 searches the voice database 109 and extracts the acoustic parameters corresponding to the identification ID and the transcribed text thereof.
Is output to the parameter estimating unit 111.

【００４９】（第２の実施形態）タスク別に複数の音響
モデルを作成するようにした第２の実施形態を次に説明
する。第２の実施形態のシステム構成を図４に示す。図
４において、認識音声入力部４０１から音声が入力され
る。入力された音声は音響分析部により音響パラメータ
に変換される。音声デコーダ４０３は認識対象の音響パ
ラメータ（入力音声）を、言語モデル４０４と複数の音
響モデル４０６を使用して音声認識し、認識結果４０５
を出力する。単語発音辞書４０７は音響モデル４０６に
候補単語音素列を与える。以上の構成要素は音響モデル
が複数である点を除けば第１の実施形態と同じである。(Second Embodiment) Next, a second embodiment in which a plurality of acoustic models are created for each task will be described. FIG. 4 shows a system configuration of the second embodiment. 4, a voice is input from a recognition voice input unit 401. The input speech is converted into acoustic parameters by the acoustic analysis unit. The speech decoder 403 performs speech recognition of a recognition target acoustic parameter (input speech) using the language model 404 and the plurality of acoustic models 406, and a recognition result 405.
Is output. The word pronunciation dictionary 407 gives the acoustic model 406 a candidate word phoneme sequence. The above components are the same as those of the first embodiment except that there are a plurality of acoustic models.

【００５０】本実施形態では図１の推定サンプル判定部
１１０およびパラメータ推定部１１１が音響モデルと同
じ個数だけ用意される。また、図１のタスク音声データ
ベース１０８（図４で４０８）中の音声サンプルを４つ
のグループにクラスタリング（距離が近いもの同士を１
つのグループ（クラスタ）に集めること）し、クラスタ
リングされた音声サンプルの各グループを複数のデータ
ベース４１６に記憶している。個々のクラスタリングさ
れたデータベース４１６の音声サンプルと音声データベ
ース４０８（図１の音声データベース１０９に対応）の
音声サンプルとが比較（両者の距離計算）され、音響特
性が近い音声データベース中の音声サンプルが対応の推
定サンプル判定部４１２で選択される。個々の推定サン
プル判定部４１２からは対応するパラメータ推定部４１
４に選択された音声サンプルと書き起こしテキストが出
力される。In this embodiment, the same number of estimated sample determination units 110 and parameter estimation units 111 as the number of acoustic models are prepared. Also, the audio samples in the task audio database 108 (408 in FIG. 4) of FIG. 1 are clustered into four groups (ones with short distances are
Into one group (cluster)), and each group of the clustered audio samples is stored in a plurality of databases 416. The voice samples in the individual clustered database 416 and the voice samples in the voice database 408 (corresponding to the voice database 109 in FIG. 1) are compared (distance calculation between them), and the voice samples in the voice database having similar acoustic characteristics correspond. Is selected by the estimated sample determination unit 412. A corresponding parameter estimating unit 41 is output from each estimated sample determining unit 412.
The selected voice sample and the transcribed text are output to No. 4.

【００５１】第２の実施形態では、第１の実施形態に比
べ、より詳細なタスク限定音響モデルが用いられるため
認識精度が向上するという利点がある。The second embodiment has an advantage that the recognition accuracy is improved because a more detailed task-limited acoustic model is used as compared with the first embodiment.

【００５２】（第３の実施形態）汎用コンピュータ（パ
ーソナルコンピュータ）が音声認識プログラムを実行す
ることにより音声認識装置を実現した第３の実施形態を
説明する。(Third Embodiment) A third embodiment in which a general-purpose computer (personal computer) executes a speech recognition program to realize a speech recognition device will be described.

【００５３】第３の実施形態のシステム構成を図５に示
す。図５において、ＣＰＵ５００、システムメモリ５１
０、入力装置、ディスク読み取り装置５３０、ハードデ
ィスク記憶装置（以下、ハードディスクと略記する）５
４０、ディスプレイ５５０および音声入力装置５６０が
バスに接続されている。FIG. 5 shows the system configuration of the third embodiment. In FIG. 5, a CPU 500, a system memory 51
0, input device, disk reading device 530, hard disk storage device (hereinafter abbreviated as hard disk) 5
40, a display 550 and a voice input device 560 are connected to the bus.

【００５４】ＣＰＵ５００はシステムメモリ５１０にロ
ードされた音声認識プログラムに従って音声処理を行
う。本実施形態では、音響モデルや言語モデルの作成処
理、音響モデルおよび言語モデルを使用した音声認識処
理が可能である。CPU 500 performs voice processing according to a voice recognition program loaded in system memory 510. In the present embodiment, it is possible to perform processing for creating an acoustic model and a language model, and perform speech recognition processing using the acoustic model and the language model.

【００５５】ＣＰＵ５００はさらにハードディスク５４
０に保存されたオペレーティングシステムにしたがっ
て、システム制御を行う。The CPU 500 further includes a hard disk 54
System control is performed according to the operating system stored in “0”.

【００５６】システムメモリ５１０はＲＯＭおよびＲＡ
Ｍを有し、ＣＰＵ５００が実行するプログラム、演算に
使用されるデータ等を記憶する。入力装置５２０はキー
ボードおよびマウスを有し、キーボードおよびマウスを
使用して、ＣＰＵ５００に対する各種の命令、データの
入力を行う。入力装置５２０は音声データベースへの登
録する書き起こしテキストを入力することもできる。The system memory 510 includes a ROM and an RA.
M, and stores a program executed by the CPU 500, data used for calculation, and the like. The input device 520 has a keyboard and a mouse, and inputs various commands and data to the CPU 500 using the keyboard and the mouse. The input device 520 can also input transcript text to be registered in the audio database.

【００５７】ディスク読み取り装置５３０はフロッピ
ー、ＣＤＲＯＭなどの携帯用記録媒体から記録データを
読み取る。本実施形態では、上述の音声認識プログラム
を記録媒体から読み取る。The disk reader 530 reads recorded data from a portable recording medium such as a floppy or a CDROM. In the present embodiment, the above-described speech recognition program is read from a recording medium.

【００５８】ハードディスク５４０はシステム制御で使
用するオペレーティングシステム、音声認識プログラ
ム、音声認識プログラムで使用する言語モデル、音響モ
デル、音声データベース（図４の音声データベース４０
８に対応）、タスク音声データベース（図４のタスク音
声データベース４１０に対応）、その他、音声認識に必
要なデータを保存記憶する。ディスプレイ５５０は入力
装置５２０から入力されたデータや、音声認識結果など
をＣＰＵ５００の制御の下に表示する。音声入力装置５
６０はマイクロホンおよびアナログデジタル変換器を有
し、入力された音声をＣＰＵ５００が処理可能なデジタ
ル音声信号の形態で出力する。入力音声は音声の認識の
対象となる場合と、タスク音声データベースへの登録の
対象となる場合がある。The hard disk 540 includes an operating system used for system control, a speech recognition program, a language model used for the speech recognition program, an acoustic model, and a speech database (the speech database 40 shown in FIG. 4).
8), a task voice database (corresponding to the task voice database 410 in FIG. 4), and other data required for voice recognition. The display 550 displays data input from the input device 520, a voice recognition result, and the like under the control of the CPU 500. Voice input device 5
Reference numeral 60 has a microphone and an analog-to-digital converter, and outputs an input voice in the form of a digital voice signal that can be processed by the CPU 500. The input voice may be a target for voice recognition or a target for registration in the task voice database.

【００５９】音声認識プログラムは音声認識それ自体を
行うプログラム部分と音響モデルを作成するプログラム
部分に分かれている。音声認識プログラム部分は従来と
同様の、言語モデルおよび音響モデルを使用して音声認
識を行うプログラムを使用することができる。また、音
響モデルを作成するプログラムの中の本発明に係る音声
サンプル選択処理以外については従来と同様とすること
ができるので、音声サンプル選択処理のためのプログラ
ムについて説明する。The speech recognition program is divided into a program portion for performing speech recognition itself and a program portion for creating an acoustic model. As the speech recognition program portion, a program for performing speech recognition using a language model and an acoustic model as in the related art can be used. In addition, since the program other than the audio sample selection process according to the present invention in the program for creating the acoustic model can be the same as the conventional one, the program for the audio sample selection process will be described.

【００６０】音声サンプルの選択処理を実行するための
プログラムの一例を図６に示す。このプログラムは音声
認識プログラムの一部分をなし、音響モデルの作成の実
行の指示が入力装置５２０から入力されたときにＣＰＵ
５００により実行される。また、ハードディスク５４０
上の音声データベースには音声サンプルおよび書き起こ
しサンプルが、タスク音声データベースには音声サンプ
ルが予め登録されているものとする。FIG. 6 shows an example of a program for executing the audio sample selecting process. This program forms a part of a speech recognition program, and is executed by a CPU when an instruction to execute creation of an acoustic model is input from the input device 520.
500. Also, the hard disk 540
It is assumed that audio samples and transcription samples are registered in the upper audio database, and audio samples are registered in the task audio database in advance.

【００６１】図６において、ＣＰＵ５００は音声データ
ベース中の音声サンプルに対してクラスタリング（完全
連鎖分割階層法と呼ばれる方法（西田俊夫著、多変量解
析の徹底研究、現代数学社発行を参照）ｎクラスタリン
グ）を行い、ハードディスク５４０上にデータベース
（図４のクラスタリングされたタスク音声データベース
４１６に対応）の形態でクラスタリングの結果を記憶す
る（ステップ１０００）。たとえば、４つのクラスタに
音声サンプルが分類されると、４つのデータベースが構
築される。In FIG. 6, CPU 500 clusters speech samples in a speech database (a method called complete chain division hierarchy method (see Toshio Nishida, thorough study of multivariate analysis, published by Hyundai Mathematics) n clustering). The result of clustering is stored in the form of a database (corresponding to the clustered task voice database 416 in FIG. 4) on the hard disk 540 (step 1000). For example, if the audio samples are classified into four clusters, four databases are constructed.

【００６２】ＣＰＵ５００は第１番目のクラスタを指定
し、対応のデータベース（クラスタ別データベースと称
する）から第１番目の音声サンプルを読み出し、システ
ムメモリ５１０に一時記憶する。次に音声データベース
から第１番目の音声サンプル（識別ＩＤ付き）を読み出
し、システムメモリ５１０に一時記憶する。読み出され
た２つの音声サンプルの間の距離計算がＣＰＵ５００に
より行われ、計算結果がシステムメモリ５１０に記憶さ
れる。この計算結果を記憶しておくためのシステムメモ
リ５１０の領域はマトリクス構成となっている。次にＣ
ＰＵ５００は音声データベースから第２番目の音声サン
プルを読み出し、システムメモリ５１０に距離計算のた
めに一時記憶する。クラスタ別データベースから読み出
した第１番目の音声サンプルと音声データベースから読
み出した第２番目の音声サンプルの距離計算を行いシス
テムメモリの計算結果記憶領域に書き込む。The CPU 500 designates the first cluster, reads the first voice sample from the corresponding database (referred to as a cluster-specific database), and temporarily stores it in the system memory 510. Next, the first voice sample (with identification ID) is read from the voice database and temporarily stored in the system memory 510. The distance between the two read voice samples is calculated by the CPU 500, and the calculation result is stored in the system memory 510. The area of the system memory 510 for storing the calculation result has a matrix configuration. Then C
The PU 500 reads the second audio sample from the audio database and temporarily stores it in the system memory 510 for distance calculation. The distance between the first voice sample read from the cluster-specific database and the second voice sample read from the voice database is calculated and written to the calculation result storage area of the system memory.

【００６３】このようにして、タスク別データベースの
第１番目の音声サンプルと、音声データベースの全ての
音声サンプルの距離計算を行い、計算結果をシステムメ
モリ５１０に書き込んで行く。タスク別データベースの
第１番目の音声サンプルについての距離計算が済むと、
ＣＰＵ５００は音声データベースの音声サンプル全てに
ついてタスク別データベースの第２番目の音声サンプル
との間の距離計算を行って、システムメモリ５１０に計
算結果を書き込む。In this way, the distance between the first voice sample in the task-specific database and all voice samples in the voice database is calculated, and the calculation result is written to the system memory 510. After calculating the distance for the first audio sample in the task-specific database,
The CPU 500 calculates the distance between all the audio samples in the audio database and the second audio sample in the task-specific database, and writes the calculation result in the system memory 510.

【００６４】このような処理を行うと、タスク別データ
ベースの全ての音声サンプルとタスク別データベースの
全ての音声サンプルの距離計算が行われる（ステップ１
０２０）。When such processing is performed, the distance between all voice samples in the task-specific database and all voice samples in the task-specific database is calculated (step 1).
020).

【００６５】ＣＰＵ５００はシステムメモリ５１０に記
憶された距離計算結果について後述の音声サンプル選択
処理を実行し、選択した音声サンプルの識別ＩＤ、たと
えば、音声データベースの記憶アドレスをシステムメモ
り５１０またはハードディスク５４０に保存する（ステ
ップ１０３０）。保存の形態は文書ファイルの形態（い
わゆるリスト）とすることができる。CPU 500 executes a voice sample selection process described later on the distance calculation result stored in system memory 510, and stores the identification ID of the selected voice sample, for example, the storage address of the voice database, in system memory 510 or hard disk 540. Save (step 1030). The storage format can be a document file format (so-called list).

【００６６】１つのクラスタについての音声サンプルの
選択処理を終了すると、ＣＰＵ５００はクラスタを別の
ものに切り替えて上述と同様にして、音声データベース
の音声サンプルと、クラスタ別データベースの音声サン
プルの距離計算および音声サンプルの選択処理を行う。
（ステップ１０４０→ステップ１０２０〜１０３０）。
全てのクラスタ別データベースについて音声サンプルの
距離計算および選択処理を行うと、選択された音声サン
プルの識別ＩＤに基づき音声データベースから識別ＩＤ
に対応する書き起こしテキストを抽出すると共に、音声
サンプル（音響パラメータ）をパラメータ推定処理のた
めのプログラムに引き渡すためにシステムメモリ５１０
の特定領域に記憶して、図６の処理手順を終了する（ス
テップ１０５０）。When the processing of selecting the voice sample for one cluster is completed, the CPU 500 switches the cluster to another cluster and calculates the distance between the voice sample in the voice database and the voice sample in the cluster database in the same manner as described above. Perform audio sample selection processing.
(Step 1040 → Steps 1020 to 1030).
When the distance calculation and the selection process of the voice sample are performed for all the cluster-specific databases, the identification ID is obtained from the voice database based on the identification ID of the selected voice sample.
And a system memory 510 for extracting a transcribed text corresponding to and transferring a speech sample (acoustic parameter) to a program for parameter estimation processing.
And the processing procedure of FIG. 6 ends (step 1050).

【００６７】Ａｂｓ選択方法を採用する場合の図６の音
声サンプルの選択処理の詳細を図７に示す。図７におい
て、ＣＰＵ５００はシステムメモリ５１０に記憶されて
いる距離の計算結果（マトリクス形態）を第１番目から
順次に読み出す（ステップ１１００）。読み出された距
離の計算結果の値と閾値（上述のｘ）との大小間の比較
をＣＰＵ５００は行う（ステップ１１１０）。FIG. 7 shows the details of the audio sample selection process in FIG. 6 when the Abs selection method is employed. In FIG. 7, CPU 500 sequentially reads distance calculation results (matrix form) stored in system memory 510 from the first position (step 1100). The CPU 500 compares the value of the read distance calculation result with the threshold (x described above) in magnitude (step 1110).

【００６８】距離計算の結果が閾値よりも、小さい場
合、すなわち、音声データベースの音声サンプルと、タ
スク別データベースの音声サンプルが音響的に近い場合
には、読み出した音声データベース側の音声サンプルお
よびそれに付随している識別ＩＤをシステムメモリ５１
０に記憶する（ステップ１１２０）。If the result of the distance calculation is smaller than the threshold value, that is, if the voice sample in the voice database and the voice sample in the task-specific database are acoustically close, the read voice sample in the voice database and its associated The identification ID of the
0 (step 1120).

【００６９】一方、距離計算の結果の値が閾値より大き
い場合には、手順をステップ１１００に戻し、次の距離
計算結果を読み出して閾値と比較する（ステップ１１３
０→１１００）。以下、全ての距離計算結果について、
ステップ１１０〜１１３０の処理を繰り返し実行する
と、閾値より距離計算結果の値が小さい音声サンプルを
選択することができる。以上が１つのクラスタについて
の音声サンプルの選択処理であり、クラスタ別に同じ処
理を繰り返すことになる。On the other hand, if the value of the result of the distance calculation is larger than the threshold value, the procedure returns to step 1100, and the next distance calculation result is read and compared with the threshold value (step 113).
0 → 1100). Hereinafter, for all distance calculation results,
By repeatedly executing the processing of steps 110 to 1130, it is possible to select a voice sample whose distance calculation result value is smaller than the threshold value. The above is the process of selecting the audio sample for one cluster, and the same process is repeated for each cluster.

【００７０】Ｒｔ選択方法により音声サンプルの選択を
行う場合の処理手順を図８に示す。図８において、ＣＰ
Ｕ５００はシステムメモリ５１０に記憶されている全て
の距離計算結果についてのソーティング処理を行う（ス
テップ１２００）。複数のデータのの値を昇順に並び換
える処理自体は周知のプログラムを使用するとよい。FIG. 8 shows a processing procedure for selecting a sound sample by the Rt selection method. In FIG. 8, CP
U500 performs a sorting process on all distance calculation results stored in system memory 510 (step 1200). A well-known program may be used for the process of rearranging the values of a plurality of data in ascending order.

【００７１】次にＣＰＵ５００はシステムメモリ５１０
に記憶された距離計算結果の個数を取得し、一定の比率
ｒ％を乗じることにより選択すべき音声サンプル数Ｎを
計算する（ステップ１２１０→１２２０）。Next, the CPU 500 sets the system memory 510
Is obtained, and the number N of audio samples to be selected is calculated by multiplying by a certain ratio r% (step 1210 → 1220).

【００７２】ＣＰＵ５００は昇順に並べられた距離計算
の値の中から最も小さい値から順にその値の計算に使用
された音声データベースの音声サンプルおよびその識別
ＩＤをシステムメモリ５１０から取り出し、別領域に記
憶する（ステップ１２４０）。この時、音声サンプルの
取り出し回数Ｉが計数され（ステップ１２６０）、選択
すべき音声サンプル数Ｎに到達したしたことがステップ
１２５０で検出されると、図８の処理手順が終了する。The CPU 500 retrieves from the system memory 510 the speech sample of the speech database and the identification ID thereof used in the calculation of the value from the smallest value among the distance calculation values arranged in ascending order, and stores them in another area. (Step 1240). At this time, the number I of taking out audio samples is counted (step 1260), and when it is detected in step 1250 that the number N of audio samples to be selected has been reached, the processing procedure of FIG. 8 ends.

【００７３】上述の実施形態の他に次の形態を実施でき
る。The following embodiment can be carried out in addition to the above embodiment.

【００７４】１）上述の実施形態では音響モデルと言語
モデルを併用して音声認識を行う音声認識装置を説明し
た。しかしながら、音声認識精度は若干落ちるが、音響
モデルのみを使用して音声認識を行う音声認識装置に対
しても本発明を適用することができる。1) In the above-described embodiment, the speech recognition apparatus that performs speech recognition using both the acoustic model and the language model has been described. However, the present invention can be applied to a speech recognition device that performs speech recognition using only an acoustic model, although the speech recognition accuracy is slightly reduced.

【００７５】２）上述の第２および第３の実施形態では
音声認識装置が音声サンプルのクラスタリング機能を有
しているが、他の装置でクラスタリングを行って、クラ
スタリングされた音声サンプルをオフラインやオンライ
ンで音声認識装置に転送してもよいこと勿論である。2) In the above-described second and third embodiments, the speech recognition apparatus has the function of clustering speech samples. However, another apparatus performs clustering, and the clustered speech samples are offline or online. Of course, the data may be transferred to the voice recognition device.

【００７６】３）上述の実施形態ではスタンドアローン
（単体）の音声認識装置を紹介したが、電話の音声、テ
レビ映像から取り出した音声を認識対象として入力する
ことが可能である。3) In the above embodiment, a stand-alone (single) speech recognition apparatus is introduced. However, it is possible to input a speech of a telephone or a speech extracted from a television image as a recognition target.

【００７７】４）本発明で言う記録媒体はフロッピーデ
ィスク、ＣＤＲＯＭ等の記録媒体に限定されない。プロ
グラムを記録（記憶）できる媒体であればいずれでもよ
い。たとえば、ＩＣメモリ、ハードディスク記憶装置な
ども記録媒体として使用することができる。さらにはこ
のような記録媒体は音声認識装置内に設置する必要はな
く、無線、有線を介して、他の装置内に設置された記録
媒体から音声認識装置内の記憶装置に音声認識プログラ
ムをダウンロード（転送すること）してもよいこと勿論
である。4) The recording medium referred to in the present invention is not limited to a recording medium such as a floppy disk, CDROM and the like. Any medium that can record (store) a program may be used. For example, an IC memory, a hard disk storage device, or the like can be used as a recording medium. Further, such a recording medium does not need to be installed in the voice recognition device, and a voice recognition program can be downloaded from a storage medium installed in another device to a storage device in the voice recognition device via wireless or wired communication. (Transferring) may of course be performed.

【００７８】５）Ａｂｓ選択方法で使用する閾値ｘやＲ
ｔ選択方法で使用するｒ％の値は固定化する必要はな
く、作成したい音響モデルの種類に応じて可変すること
ができる。この場合、入力装置５２０から数値を手入力
する方法と、音響モデルの種類のみを手入力し、音響モ
デルの種類に対応した値をテーブル等から取り出す方法
のいずれかを採用するとよい。5) Threshold x or R used in Abs selection method
The value of r% used in the t selection method does not need to be fixed, and can be varied according to the type of acoustic model to be created. In this case, either a method of manually inputting a numerical value from the input device 520 or a method of manually inputting only the type of the acoustic model and extracting a value corresponding to the type of the acoustic model from a table or the like may be adopted.

【００７９】６）上述の実施形態では、タスク音声デー
タベースに記憶された音声サンプルから特定話者の音声
サンプルを抽出し、音声データベースからの音声サンプ
ルの選択に使用した。しかしながら、音声認識の用途に
応じて他のタスクたとえば、音声認識デコーダ、認識音
声入力装置等の機械特性に応じた音声サンプルを使用し
て音声データベースから音声サンプルを選択してもよ
い。6) In the above-described embodiment, a voice sample of a specific speaker is extracted from the voice samples stored in the task voice database, and is used for selecting a voice sample from the voice database. However, speech samples may be selected from the speech database using other tasks depending on the application of speech recognition, for example, speech samples according to machine characteristics such as speech recognition decoders, recognized speech input devices, and the like.

【００８０】[0080]

【発明の効果】以上、説明したように、請求項１，７の
発明は、話者の異なる音声サンプルの中からタスク、た
とえば、特定話者の音声サンプルに近い音声サンプルを
選択するので、音響モデルの作成に使用される音声サン
プルの中から認識特性に悪影響を与える音声サンプルは
除外される。As described above, according to the first and seventh aspects of the present invention, a task, for example, a voice sample close to a voice sample of a specific speaker is selected from voice samples of different speakers. Speech samples that adversely affect the recognition characteristics are excluded from the speech samples used to create the model.

【００８１】請求項２，８の発明では複数の第１および
第２の音声サンプルとの間の距離計算を行うことで、音
響モデルに使用するできるだけ多数の音声サンプルを取
り出すことができる。According to the second and eighth aspects of the present invention, by calculating the distance between the first and second voice samples, it is possible to extract as many voice samples as possible for use in the acoustic model.

【００８２】請求項３，９の発明では、距離計算結果が
許容範囲内にあるかを判定することにより、選択すべき
第１の音声サンプルについて近さについての精度を保つ
ことができる。According to the third and ninth aspects of the present invention, it is possible to maintain the accuracy of the closeness of the first audio sample to be selected by determining whether the result of the distance calculation is within the allowable range.

【００８３】請求項４，１０の発明では、用いる発話サ
ンプル間距離と音声認識精度間の非線形性による影響を
小さくすることができる。According to the fourth and tenth aspects of the present invention, it is possible to reduce the influence of the nonlinearity between the distance between speech samples to be used and the speech recognition accuracy.

【００８４】請求項５，６，１１，１２の発明は、第２
の音声サンプルが多数ある場合には、クラスタリングを
行うことにより、より細かなタスク限定モデルを用いる
ことができる。また、特定話者などのタスク別に第２の
音声サンプルを使用することにより用途に応じた基準
（リファレンス）の第２の音声サンプルを複数の第２の
音声サンプルから選択することも可能となる。The fifth, sixth, eleventh and twelfth aspects of the present invention
When there are a large number of audio samples, a more detailed task limitation model can be used by performing clustering. In addition, by using the second voice sample for each task of a specific speaker or the like, it is possible to select a second voice sample of a reference (reference) according to a use from a plurality of second voice samples.

[Brief description of the drawings]

【図１】本発明第１の実施形態のシステム構成を示すブ
ロック図である。FIG. 1 is a block diagram showing a system configuration according to a first embodiment of the present invention.

【図２】推定サンプル選択部の回路構成を示すブロック
図である。FIG. 2 is a block diagram illustrating a circuit configuration of an estimated sample selection unit.

【図３】推定サンプル選択部の他の回路構成を示すブロ
ック図である。FIG. 3 is a block diagram illustrating another circuit configuration of the estimated sample selection unit.

【図４】本発明第２の実施形態のシステム構成を示すブ
ロック図である。FIG. 4 is a block diagram showing a system configuration according to a second embodiment of the present invention.

【図５】本発明第３の実施形態のシステム構成を示すブ
ロック図である。FIG. 5 is a block diagram illustrating a system configuration according to a third embodiment of the present invention.

【図６】本発明第３の実施形態の音声サンプル選択処理
の処理手順を示すフローチャートである。FIG. 6 is a flowchart illustrating a procedure of a voice sample selection process according to a third embodiment of the present invention.

【図７】Ａｂｓ選択方法にしたがった音声サンプル選択
処理の処理手順を示すフローチャートである。FIG. 7 is a flowchart showing a processing procedure of audio sample selection processing according to the Abs selection method.

【図８】Ｒｔ選択方法にしたがった音声サンプル選択処
理の処理手順を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure of audio sample selection processing according to an Rt selection method.

[Explanation of symbols]

１０１認識音声入力部１０２音響分析部１０３音声認識デコーダ１０４言語モデル１０５音声認識結果１０６音響モデル１０７単語発音辞書１０８タスク音声データベース１０９音声データベース１１０推定サンプル判定部１１１パラメータ推定部 Reference Signs List 101 recognition speech input unit 102 sound analysis unit 103 speech recognition decoder 104 language model 105 speech recognition result 106 acoustic model 107 word pronunciation dictionary 108 task speech database 109 speech database 110 estimation sample determination unit 111 parameter estimation unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者安藤彰男東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5D015 AA03 BB01 GG01 GG03 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Akio Ando 1-10-11 Kinuta, Setagaya-ku, Tokyo Japan Broadcasting Corporation Broadcasting Research Institute F-term (reference) 5D015 AA03 BB01 GG01 GG03

Claims

[Claims]

In a speech recognition for creating an acoustic model from a speech sample and a transcribed text, and performing speech recognition of an input speech based on the created acoustic model, a plurality of first speech samples independent of a speaker and First storage means for storing a plurality of transcribed texts corresponding thereto, second storage means for storing a second voice sample corresponding to a task, and a plurality of storage means for storing a plurality of transcripts stored in the first storage means. Selecting means for selecting a first audio sample closer to the second audio sample recorded in the second storage means from among the first audio samples, corresponding to the selected first audio sample And information processing means for extracting the transcribed text from the first storage means. Speech recognition apparatus characterized by a first audio samples to create the acoustic model.

2. The speech recognition device according to claim 1, wherein said plurality of second speech samples are provided, and said selecting means selects each of said plurality of second speech samples and said plurality of first speech samples. A speech recognition apparatus comprising: calculating a distance between each of them; and selecting the closer first speech sample based on the calculation result.

3. The speech recognition apparatus according to claim 2, wherein said selecting means determines whether or not said calculation result is within an allowable range, and when a result of determination that said calculation result is within an allowable range is obtained. A first speech sample used for the calculation result is selected as a speech sample used for an acoustic model.

4. The speech recognition apparatus according to claim 2, wherein the selecting unit rearranges the calculation results in the order of the shortest distance, and calculates a certain number of the calculation results starting from the shortest distance in the rearranged calculation results. A speech recognition apparatus, wherein a result is selected, and a first speech sample used for the selected calculation result is selected as a speech sample used for an acoustic model.

5. The voice recognition device according to claim 1, wherein clustering is performed on a plurality of voice samples independent of a speaker, and a result of the clustering is stored as the second voice sample in the second storage unit. A speech recognition device, further comprising clustering means.

6. The speech recognition apparatus according to claim 1, wherein the second speech sample corresponding to the task is a speech sample for a specific speaker. Voice recognition device.

7. A storage medium in which an acoustic model is created from a speech sample and a transcribed text and a program executed in speech recognition for performing speech recognition of an input speech based on the created acoustic model is recorded. The apparatus has first storage means for storing a plurality of first voice samples independent of a speaker and a plurality of transcripts corresponding thereto, and second storage for storing a second voice sample corresponding to a task. Means, and wherein the program includes a second sound sample closer to the second sound sample recorded in the second storage means from among the plurality of first sound samples stored in the first storage means. A selection step of selecting one audio sample; and information for extracting a transcript text corresponding to the selected first audio sample from the first storage means. Processing medium, wherein the acoustic model is created from the extracted transcribed text and a first speech sample corresponding to the transcribed text.

8. The recording medium of the speech recognition device according to claim 7, wherein there are a plurality of said second speech samples, and in said selecting step, each of said plurality of second speech samples and said plurality of first speech samples are included. A recording medium for a speech recognition device, comprising: calculating a distance between each of voice samples; and selecting the closer first voice sample based on the calculation result.

9. The recording medium of the speech recognition device according to claim 8, wherein in said selecting step, it is determined whether said calculation result is within an allowable range, and a result of determination that said calculation result is within an allowable range is obtained. A recording medium for a speech recognition apparatus, wherein, when the first speech sample is used, the first speech sample used for the calculation result is selected as a speech sample used for an acoustic model.

10. The recording medium of the speech recognition device according to claim 8, wherein, in the selecting step, the calculation results are rearranged in ascending order of distance, and the calculated results are rearranged from the shortest distance to a constant distance. A recording medium for a speech recognition apparatus, wherein a number of calculation results are selected, and a first voice sample used for the selected calculation result is selected as a voice sample used for an acoustic model.

11. The recording medium of the speech recognition apparatus according to claim 7, wherein clustering is performed on a plurality of speech samples independent of a speaker, and a result of the clustering is used as the second speech sample in the second storage unit. A storage medium for a speech recognition device, further comprising a clustering step of storing the information in a storage device.

12. The method according to claim 7, wherein the second voice sample corresponding to the task is a voice sample for a specific speaker. The recording medium of the speech recognition device, which is a feature.