JP2003271179A

JP2003271179A - Method, device and program for speech recognition

Info

Publication number: JP2003271179A
Application number: JP2002071638A
Authority: JP
Inventors: Hiroyuki Segi; 寛之世木; Shoe Sato; 庄衛佐藤; Kazuho Onoe; 和穂尾上
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2002-03-15
Filing date: 2002-03-15
Publication date: 2003-09-25
Anticipated expiration: 2022-03-15
Also published as: JP3841342B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition method capable of performing speech recognition by speaker adaptation without needing a storage medium with large capacity for storing the acoustic model, etc., of a particular speaker and without having to identifying the particular speaker in advance, and to provide a device and a program therefor. <P>SOLUTION: This speech recognition device 1 using an acoustic model in which each phoneme of a speech signal to be an object of speech recognition is composed of a plurality of statistical distributions, a language model in which each inter-word relation is composed of a statistical distribution, and a word pronunciation dictionary for showing the relation between a word and a phoneme, is provided with a statistical distribution specification means 3a for specifying a statistical distribution with high output probability among a plurality of statistical distributions of the acoustic model on the basis of a result obtained by vocally recognizing speech uttered by a speaker desired to be adapted, and a speech recognition means 3b for recognizing the speech to be recognized on the basis of the statistical distribution specified by the statistical distribution specification means. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識対象とな
る音声を発生する話者に適応させて音声認識する音声認
識方法および音声認識装置ならびに音声認識プログラム
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition method, a voice recognition device, and a voice recognition program which are adapted to a speaker generating a voice to be voice-recognized and recognized.

【０００２】[0002]

【従来の技術】従来、音声を発する話者の口調（話し方
の癖）、よく使用する語彙等に適応させて音声認識す
る、いわゆる話者適応化による音声認識を行う方法とし
て、例えば、次に記すようなものがある。2. Description of the Related Art Conventionally, as a method for performing speech recognition by so-called speaker adaptation, which is adapted to the tone of a speaker who emits a voice (habit of speaking), vocabulary often used, etc. There is something like this.

【０００３】ＭＡＰ推定法を利用する方法（「音声言語
処理」、北研二他、６５−６８）であり、この方法は
音声認識の話者適応を行う際にＭＡＰ推定法（最大事後
確率推定法）を適用するものである。この方法では、不
特定話者の音響モデルのパラメータを適応化したい話者
（特定話者）の発声（発声データ）で推定したパラメー
タに近づける方法である。この方法を使用することによ
り、不特定話者の音響モデルを特定話者の音響モデルに
適応化することができる。This is a method utilizing the MAP estimation method ("Spoken language processing", Kenji Kita et al., 65-68), which is a MAP estimation method (maximum posterior probability estimation method) when performing speaker adaptation of speech recognition. ) Is applied. In this method, the parameters of the acoustic model of the unspecified speaker are brought close to the parameters estimated by the utterance (speech data) of the speaker (specified speaker) to be adapted. By using this method, the acoustic model of the unspecified speaker can be adapted to the acoustic model of the specified speaker.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、従来の
「ＭＡＰ推定法を利用する方法」では、適応化したい話
者（特定話者）毎に特定話者の音響モデルを準備する必
要があり、特定話者が複数の場合に、特定話者数の増加
に伴って、不特定話者の音響モデルから適応化する特定
話者の音響モデルの数が増加する。このため、この特定
話者の音響モデルを記憶させるハードディスク等の記憶
媒体の容量を大量に確保しなければならないという問題
がある。However, in the conventional "method utilizing the MAP estimation method", it is necessary to prepare an acoustic model of a specific speaker for each speaker (specific speaker) to be adapted. When there are a plurality of speakers, the number of specific speaker acoustic models adapted from the unspecified speaker acoustic models increases as the number of specific speakers increases. Therefore, there is a problem that a large capacity of a storage medium such as a hard disk for storing the acoustic model of the specific speaker must be secured.

【０００５】また、この特定話者毎の発声データを使用
する場合には、予め、特定話者が誰であるのか特定話者
を判別しなければならず、話者の判別を間違う可能性が
あり、その結果、音声認識率が低下するという問題があ
る。Further, when using the utterance data for each specific speaker, it is necessary to determine in advance who the specific speaker is, and there is a possibility that the determination of the speaker may be wrong. However, as a result, there is a problem that the voice recognition rate decreases.

【０００６】そこで、本発明の目的は前記した従来の技
術が有する課題を解消し、特定話者の音響モデル等を記
憶させる大容量の記憶媒体を必要とすることなく、特定
話者毎の発声データを使用する際に特定話者を判別する
ことなく話者適応化による音声認識を行うことができる
音声認識方法および音声認識装置ならびに音声認識プロ
グラムを提供することにある。Therefore, an object of the present invention is to solve the problems of the above-mentioned conventional techniques, and to utter each specific speaker without requiring a large-capacity storage medium for storing the acoustic model of the specific speaker. It is an object of the present invention to provide a voice recognition method, a voice recognition device, and a voice recognition program capable of performing voice recognition by speaker adaptation without distinguishing a specific speaker when using data.

【０００７】[0007]

【課題を解決するための手段】本発明は、前記した目的
を達成するため、以下に示す構成とした。請求項１記載
の音声認識方法は、音声認識の対象となる音声信号の各
音素が複数の統計分布で構成される音響モデルと、各単
語間の関係が確率で表現されている言語モデルと、単語
と音素との関係を示す単語の発音辞書とを使用する音声
認識方法であって、適応化したい話者が発した音声を予
め音声認識した結果に基づいて、前記音響モデルの複数
の統計分布の中から、出力確率の高い統計分布を特定す
る統計分布特定ステップと、前記話者が発声する場合、
前記統計分布特定ステップにおいて特定された統計分布
に基づいて、認識対象となる認識対象音声を認識する音
声認識ステップと、を含むことを特徴とする。In order to achieve the above-mentioned object, the present invention has the following constitution. The speech recognition method according to claim 1, wherein each phoneme of the speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, and a language model in which the relationship between each word is expressed by probability. A speech recognition method using a pronunciation dictionary of words indicating the relationship between words and phonemes, wherein a plurality of statistical distributions of the acoustic model are based on the result of preliminarily speech recognition of a speech produced by a speaker to be adapted. From among the statistical distribution identification step of identifying a statistical distribution with a high output probability, and if the speaker utters,
A voice recognition step of recognizing a recognition target voice that is a recognition target based on the statistical distribution specified in the statistical distribution specifying step.

【０００８】この方法によれば、まず、統計分布特定ス
テップにおいて、統計分布で構成される音響モデルの複
数の統計分布の中から、適応化したい話者が発した音声
の音声認識結果に基づいて、出力確率の高い統計分布が
特定される。この場合、話者が発する音声は話者の意志
により、自由に発声されたものである。なお、統計分布
として代表的なものは、正規分布が挙げられる。そし
て、音声認識ステップにおいて、複数の統計分布の中か
ら特定された統計分布に基づいて、認識対象音声が認識
される。つまり、統計分布特定ステップにおいて、不特
定話者を対象とする１つの音響モデルの中から出力確率
の高い（使用頻度の高い）統計分布が特定される。すな
わち、従来のように、適応したい話者毎に複数の音響モ
デルを準備する必要がない。そして、音声認識ステップ
において、出力確率の高い統計分布に基づいて、特定話
者（適応化したい話者）の認識対象音声が認識される。According to this method, first, in the statistical distribution specifying step, based on the voice recognition result of the voice uttered by the speaker who wants to be adapted, from among a plurality of statistical distributions of the acoustic model constituted by the statistical distribution. , A statistical distribution with a high output probability is specified. In this case, the voice uttered by the speaker is freely uttered by the intention of the speaker. A typical distribution is a normal distribution. Then, in the voice recognition step, the recognition target voice is recognized based on the statistical distribution specified from the plurality of statistical distributions. That is, in the statistical distribution specifying step, a statistical distribution having a high output probability (high usage frequency) is specified from one acoustic model targeting an unspecified speaker. That is, unlike the conventional case, it is not necessary to prepare a plurality of acoustic models for each speaker who wants to adapt. Then, in the voice recognition step, the recognition target voice of the specific speaker (the speaker to be adapted) is recognized based on the statistical distribution having a high output probability.

【０００９】請求項２記載の音声認識方法は、音声認識
の対象となる音声信号の各音素が複数の統計分布で構成
される音響モデルと、各単語間の関係が確率で表現され
ている言語モデルと、単語と音素との関係を示す単語の
発音辞書とを使用する音声認識方法であって、予め、内
容が既知である発声内容とこの発声内容を適応化したい
話者が発した音声とを対応付けした対応結果に基づい
て、前記音響モデルの複数の統計分布の中から、出力確
率の高い統計分布を特定する統計分布特定ステップと、
前記話者が発声する場合、前記統計分布特定ステップに
おいて特定された統計分布に基づいて、認識対象となる
認識対象音声を認識する音声認識ステップと、を含むこ
とを特徴とする。According to a second aspect of the present invention, there is provided a speech recognition method, in which each phoneme of a speech signal to be speech-recognized is composed of a plurality of statistical distributions, and a language in which the relation between each word is expressed by probability. A speech recognition method using a model and a pronunciation dictionary of words indicating a relationship between a word and a phoneme, in which voice content whose content is known in advance and voice produced by a speaker who wants to adapt this voice content Based on the correspondence result in which the, corresponding to, from among a plurality of statistical distribution of the acoustic model, a statistical distribution identification step of identifying a statistical distribution with a high output probability,
When the speaker utters, a voice recognition step of recognizing a recognition target voice that is a recognition target based on the statistical distribution specified in the statistical distribution specifying step is included.

【００１０】この方法によれば、統計分布特定ステップ
において、内容が既知である発声内容とこの発声内容を
適応化したい話者が発した音声とを対応付けした対応結
果に基づいて、統計分布で構成される音響モデルの複数
の統計分布の中から、出力確率の高い統計分布が特定さ
れる。そして、音声認識ステップにおいて、特定された
統計分布に基づいて、認識対象音声が認識される。な
お、内容が既知である発声内容とは、通常、発声内容が
文書化（テキストデータ化）されたものを指すものであ
る。According to this method, in the statistical distribution specifying step, the statistical distribution is calculated based on the correspondence result in which the utterance content whose content is known and the voice uttered by the speaker who wants to adapt the utterance content are associated with each other. A statistical distribution having a high output probability is specified from a plurality of statistical distributions of the acoustic model that is configured. Then, in the voice recognition step, the recognition target voice is recognized based on the specified statistical distribution. Note that the utterance content whose content is already known usually means that the utterance content is documented (text data).

【００１１】また、請求項３記載の音声認識装置は、音
声認識の対象となる音声信号の各音素が複数の統計分布
で構成される音響モデルと、各単語間の関係が確率で表
現されている言語モデルと、単語と音素との関係を示す
単語の発音辞書とを使用する音声認識装置であって、適
応化したい話者が発した音声を予め音声認識した結果に
基づいて、前記音響モデルの複数の統計分布の中から、
出力確率の高い統計分布を特定する統計分布特定手段
と、前記話者が発声する場合、前記統計分布特定手段で
特定された統計分布に基づいて、認識対象となる認識対
象音声を認識する音声認識手段と、を備えることを特徴
とする。According to another aspect of the speech recognition apparatus of the present invention, the acoustic model in which each phoneme of the speech signal to be recognized by speech is composed of a plurality of statistical distributions, and the relationship between each word is expressed by probability. A speech recognition device that uses a language model that exists and a pronunciation dictionary of words that indicate the relationship between a word and a phoneme, wherein the acoustic model is based on the result of preliminarily speech recognition of a voice uttered by a speaker to be adapted. From multiple statistical distributions of
When a speaker utters a statistical distribution specifying unit that specifies a statistical distribution having a high output probability, a voice recognition that recognizes a recognition target voice that is a recognition target based on the statistical distribution specified by the statistical distribution specifying unit. Means and are provided.

【００１２】かかる構成によれば、統計分布特定手段
で、適応化したい話者が発した音声の音声認識結果に基
づいて、統計分布で構成される音響モデルの複数の統計
分布の中から、出力確率の高い統計分布が特定される。
この場合、話者が発する音声は話者の意志により、自由
に発声されたものである。そして、音声認識手段で、特
定された統計分布に基づいて認識対象音声が認識され
る。According to such a configuration, the statistical distribution specifying means outputs an output from a plurality of statistical distributions of the acoustic model formed by the statistical distribution based on the voice recognition result of the voice uttered by the speaker to be adapted. A statistical distribution with a high probability is identified.
In this case, the voice uttered by the speaker is freely uttered by the intention of the speaker. Then, the voice recognition means recognizes the recognition target voice based on the specified statistical distribution.

【００１３】請求項４記載の音声認識装置は、音声認識
の対象となる音声信号の各音素が複数の統計分布で構成
される音響モデルと、各単語間の関係が確率で表現され
ている言語モデルと、単語と音素との関係を示す単語の
発音辞書とを使用する音声認識装置であって、予め、内
容が既知である発声内容とこの発声内容を適応化したい
話者が発した音声とを対応付けした対応結果に基づい
て、前記音響モデルの複数の統計分布の中から、出力確
率の高い統計分布を特定する統計分布特定手段と、前記
話者が発声する場合、前記統計分布特定手段で特定され
た統計分布に基づいて、認識対象となる認識対象音声を
認識する音声認識手段と、を備えることを特徴とする。According to a fourth aspect of the present invention, there is provided a speech recognition apparatus in which a phoneme of a speech signal to be speech-recognized is composed of a plurality of statistical distributions and a language in which the relation between each word is expressed by probability. A speech recognition device that uses a model and a pronunciation dictionary of words indicating a relationship between a word and a phoneme, and a voicing content whose content is known in advance and a voice uttered by a speaker who wants to adapt this voicing content. Based on the correspondence result in which the acoustic model is associated, a statistical distribution identifying unit that identifies a statistical distribution having a high output probability from among a plurality of statistical distributions of the acoustic model, and the statistical distribution identifying unit when the speaker utters. Voice recognition means for recognizing a recognition target voice as a recognition target based on the statistical distribution specified in.

【００１４】かかる構成によれば、統計分布特定手段
で、内容が既知である発声内容とこの発声内容を適応化
したい話者が発した音声とを対応付けした対応結果に基
づいて、統計分布で構成される音響モデルの複数の統計
分布の中から、出力確率の高い統計分布が特定される。
そして、音声認識手段で、特定された統計分布に基づい
て認識対象音声が認識される。According to this structure, the statistical distribution specifying means calculates the statistical distribution based on the correspondence result in which the utterance content of which content is known and the voice uttered by the speaker who wants to adapt the utterance content are associated with each other. A statistical distribution having a high output probability is specified from a plurality of statistical distributions of the acoustic model that is configured.
Then, the voice recognition means recognizes the recognition target voice based on the specified statistical distribution.

【００１５】請求項５記載の音声認識プログラムは、音
声認識の対象となる音声信号の各音素が複数の統計分布
で構成される音響モデルと、各単語間の関係が確率で表
現されている言語モデルと、単語と音素との関係を示す
単語の発音辞書とを使用する装置を、以下に示す手段と
して機能させることを特徴とする。当該装置を機能させ
る手段は、適応化したい話者が発した音声を予め音声認
識した結果に基づいて、前記音響モデルの複数の統計分
布の中から、出力確率の高い統計分布を特定する統計分
布特定手段、前記話者が発声する場合、前記統計分布特
定手段で特定された統計分布に基づいて、認識対象とな
る認識対象音声を認識する音声認識手段、である。A speech recognition program according to a fifth aspect of the present invention is a language in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, and a relationship between each word is expressed by probability. A device using a model and a pronunciation dictionary of words indicating the relationship between words and phonemes is characterized by causing it to function as the following means. A means for operating the device is a statistical distribution that specifies a statistical distribution having a high output probability from among a plurality of statistical distributions of the acoustic model based on a result of previously recognizing a voice uttered by a speaker to be adapted. The specifying unit is a voice recognizing unit that recognizes a recognition target voice that is a recognition target based on the statistical distribution specified by the statistical distribution specifying unit when the speaker utters.

【００１６】かかる構成によれば、統計分布特定手段
で、適応化したい話者が発した音声の音声認識結果に基
づいて、統計分布で構成される音響モデルの複数の統計
分布の中から、出力確率の高い統計分布が特定される。
この場合、話者が発する音声は話者の意志により、自由
に発声されたものである。そして、音声認識手段で、特
定された統計分布に基づいて認識対象音声が認識され
る。According to this configuration, the statistical distribution specifying means outputs the statistical distribution based on the speech recognition result of the voice uttered by the speaker to be adapted, from among the statistical distributions of the acoustic model constituted by the statistical distribution. A statistical distribution with a high probability is identified.
In this case, the voice uttered by the speaker is freely uttered by the intention of the speaker. Then, the voice recognition means recognizes the recognition target voice based on the specified statistical distribution.

【００１７】請求項６記載の音声認識プログラムは、音
声認識の対象となる音声信号の各音素が複数の統計分布
で構成される音響モデルと、各単語間の関係が確率で表
現されている言語モデルと、単語と音素との関係を示す
単語の発音辞書とを使用する装置を、以下に示す手段と
して機能させることを特徴とする。当該装置を機能させ
る手段は、予め、内容が既知である発声内容とこの発声
内容を適応化したい話者が発した音声とを対応付けした
対応結果に基づいて、前記音響モデルの複数の統計分布
の中から、出力確率の高い統計分布を特定する統計分布
特定手段、前記話者が発声する場合、前記統計分布特定
手段で特定された統計分布に基づいて、認識対象となる
認識対象音声を認識する音声認識手段、である。According to a sixth aspect of the present invention, there is provided a speech recognition program, wherein an acoustic model in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, and a language in which the relation between each word is expressed by probability. A device using a model and a pronunciation dictionary of words indicating the relationship between words and phonemes is characterized by causing it to function as the following means. The means for causing the device to function includes a plurality of statistical distributions of the acoustic model based on a correspondence result in which the utterance content whose content is already known and the voice uttered by the speaker who wants to adapt the utterance content are associated with each other. Among them, a statistical distribution specifying unit that specifies a statistical distribution having a high output probability, and when the speaker utters, recognizes a recognition target voice that is a recognition target based on the statistical distribution specified by the statistical distribution specifying unit. Voice recognition means for doing so.

【００１８】かかる構成によれば、統計分布特定手段
で、内容が既知である発声内容とこの発声内容を適応化
したい話者が発した音声とを対応付けした対応結果に基
づいて、統計分布で構成される音響モデルの複数の統計
分布の中から、出力確率の高い統計分布が特定される。
そして、音声認識手段で、記録された統計分布に基づい
て認識対象音声が認識される。According to this structure, the statistical distribution specifying means calculates the statistical distribution based on the correspondence result in which the utterance content of which the content is known and the voice uttered by the speaker who wants to adapt the utterance content are associated with each other. A statistical distribution having a high output probability is specified from a plurality of statistical distributions of the acoustic model that is configured.
Then, the voice recognition means recognizes the voice to be recognized based on the recorded statistical distribution.

【００１９】[0019]

【発明の実施の形態】以下、本発明の一実施の形態につ
いて、図面を参照して詳細に説明する。この実施の形態
の説明では、まず音声認識装置の構成を説明し、ついで
この音声認識装置の動作を説明する。そして、音声認識
時の波形データと音響スコアとに関して補足的に説明す
る。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described in detail below with reference to the drawings. In the description of this embodiment, the configuration of the voice recognition device will be described first, and then the operation of the voice recognition device will be described. Then, the waveform data at the time of voice recognition and the acoustic score will be supplementarily described.

【００２０】（音声認識装置の構成）図１は音声認識装
置のブロック図である。この図１に示すように、音声認
識装置１は、主制御部３と、記憶部５と、テキスト入力
部７と、音声入力部９と、表示出力部１１とを備えてい
る。音声認識装置１は、予め、話者が発した音声の音声
認識結果に基づいて、音響モデルの正規分布の中から、
出力確率の値が高かった正規分布のインデックスを記録
しておき、その後、記録しておいた正規分布のインデッ
クスを利用して当該話者が発した音声を認識するもので
ある。なお、この実施の形態では、音声認識装置１は一
般的なパーソナルコンピュータによって構成されてい
る。(Structure of Speech Recognition Device) FIG. 1 is a block diagram of the speech recognition device. As shown in FIG. 1, the voice recognition device 1 includes a main control unit 3, a storage unit 5, a text input unit 7, a voice input unit 9, and a display output unit 11. The voice recognition device 1 is based on the voice recognition result of the voice uttered by the speaker in advance from the normal distribution of the acoustic model,
The normal distribution index with a high output probability value is recorded, and then the recorded normal distribution index is used to recognize the voice uttered by the speaker. In this embodiment, the voice recognition device 1 is composed of a general personal computer.

【００２１】主制御部３は、ＣＰＵ、メインメモリ等か
ら構成され、音声認識装置１の制御を司るもので、話者
適応化手段３ａと、音声認識手段３ｂと、単語列出力手
段３ｃとを備えている。話者適応化手段３ａは、音声入
力部９から入力される生の音声データ（発声データ）ま
たは生の音声データから抽出された音声特徴量に基づい
て、記憶部５に記憶されている音響モデルデータ５ａ、
言語モデルデータ５ｂおよび発音辞書５ｃを使用して音
声認識を行い、利用音響モデル正規分布インデックス５
ａ１を生成するものである。The main control section 3 is composed of a CPU, a main memory, etc., and controls the voice recognition apparatus 1, and comprises a speaker adaptation means 3a, a voice recognition means 3b, and a word string output means 3c. I have it. The speaker adaptation unit 3a stores the acoustic model stored in the storage unit 5 based on the raw voice data (voice data) input from the voice input unit 9 or the voice feature amount extracted from the raw voice data. Data 5a,
Speech recognition is performed using the language model data 5b and the pronunciation dictionary 5c, and the used acoustic model normal distribution index 5 is used.
a1 is generated.

【００２２】つまり、この話者適応化手段３ａでは、音
声認識対象となる音声を発する話者（適応化したい話
者）に応じた音響モデルの正規分布インデックスが生成
され、この利用音響モデル正規分布インデックス５ａ１
は、入力された生の音声データまたは音声特徴量におい
て出力確率が高かった正規分布を記録したものである。That is, the speaker adaptation means 3a generates a normal distribution index of an acoustic model corresponding to a speaker (speaker to be adapted) who emits a voice to be voice-recognized, and uses this normal acoustic model distribution. Index 5a1
Is a normal distribution having a high output probability in the input raw voice data or voice feature amount.

【００２３】また、この話者適応化手段３ａは、発声内
容が既知の場合（例えば、予めテキストデータ化されて
いる場合）には、この発声内容と、音声入力部９から入
力される生の音声データ（発声データ）または生の音声
データから抽出された音声特徴量とを対応付けした対応
結果に基づいて、出力確率が高いものについて、利用音
響モデル正規分布インデックス５ａ１を生成するもので
ある。なお、この話者適応化手段３ａが請求項に記載し
た統計分布特定手段に相当するものである。If the utterance content is known (for example, if it is converted into text data in advance), the speaker adaptation means 3a outputs the utterance content and the raw input data from the voice input unit 9. Based on a correspondence result in which voice data (voice data) or voice feature amounts extracted from raw voice data are associated with each other, a used acoustic model normal distribution index 5a1 is generated for a high output probability. The speaker adaptation means 3a corresponds to the statistical distribution specifying means described in the claims.

【００２４】音声認識手段３ｂは、話者適応化手段３ａ
で生成された利用音響モデル正規分布インデックス５ａ
１を利用して、音声入力部９から入力された音声認識す
る対象となる認識対象音声の音響スコア、言語スコア
（言語モデルデータ５ｂを利用）を算出するものであ
る。なお、音響スコアとは、音声入力部９から入力され
た認識対象音声の波形パターンの比較結果に基づいた隠
れマルコフモデルにおける音響モデルから計算された数
値である。言語スコアとは、単語間のつながりを算出し
た言語モデルの数値である。これら音響スコアおよび言
語スコアのより高いものが音声認識結果となる。The voice recognition means 3b is a speaker adaptation means 3a.
Acoustic model normal distribution index 5a generated in
1 is used to calculate the acoustic score and the language score (using the language model data 5b) of the recognition target voice that is the voice recognition target input from the voice input unit 9. The acoustic score is a numerical value calculated from the acoustic model in the hidden Markov model based on the comparison result of the waveform patterns of the recognition target voice input from the voice input unit 9. The language score is a numerical value of a language model that calculates the connection between words. The higher of these acoustic score and language score becomes the speech recognition result.

【００２５】単語列出力手段３ｃは、音声認識手段３ｂ
で算出された音響モデルのスコアおよび言語モデルのス
コアに基づいて、音声認識結果である出力単語列を送出
するものである。記憶部５は、大容量のハードディスク
等によって構成され、音響モデルデータ５ａと、言語モ
デルデータ５ｂと、発音辞書５ｃと、適応化音声データ
５ｄと、認識音声データ５ｅと、テキストデータ５ｆと
を記憶するものである。The word string output means 3c is a voice recognition means 3b.
Based on the score of the acoustic model and the score of the language model calculated in, the output word string which is the speech recognition result is transmitted. The storage unit 5 is composed of a large-capacity hard disk or the like, and stores acoustic model data 5a, language model data 5b, pronunciation dictionary 5c, adapted voice data 5d, recognition voice data 5e, and text data 5f. To do.

【００２６】音響モデルデータ５ａは、複数の正規分布
で構成される不特定の話者の隠れマルコフモデルにおけ
る音響モデルのデータを集めたものである。言語モデル
データ５ｂはｎ−ｇｒａｍで表されるような統計分布で
ある。発音辞書５ｃは単語に対して、この単語の読みが
記されているものである。適応化音声データ５ｄは、音
声入力部９から入力された適応化したい話者の発した音
声データである。認識音声データ５ｅは、音声入力部９
から入力された音声認識の対象となる音声データであ
る。テキストデータ５ｆは、適応化音声データ５ｄから
入力される音声データの発声内容が既知の場合に、テキ
スト入力部７から入力された当該発声内容のテキストデ
ータである。なお、音響モデルデータ５ａが請求項に記
載した音響モデルに、言語モデルデータ５ｂが請求項に
記載した言語モデルに相当するものである。The acoustic model data 5a is a collection of acoustic model data in the hidden Markov model of an unspecified speaker composed of a plurality of normal distributions. The language model data 5b has a statistical distribution represented by n-gram. In the pronunciation dictionary 5c, the reading of this word is written for each word. The adapted voice data 5d is voice data input from the voice input unit 9 and uttered by the speaker to be adapted. The recognized voice data 5e is stored in the voice input unit 9
It is the voice data which is the target of the voice recognition input from. The text data 5f is text data of the utterance content input from the text input unit 7 when the utterance content of the voice data input from the adapted voice data 5d is known. The acoustic model data 5a corresponds to the acoustic model described in the claims, and the language model data 5b corresponds to the language model described in the claims.

【００２７】これらのうち、適応化音声データ５ｄ、認
識音声データ５ｅ、テキストデータ５ｆは、厳密には記
憶部５における記憶領域を指し示すもので、適応時、つ
まり適応化したい話者が音声を発するまで適応化音声デ
ータ５ｄ或いはテキストデータ５ｆには、何も記憶され
ておらず、また、認識時、つまり適応化した話者が認識
対象音声を発するまでは認識音声データ５ｅには何も記
憶されていない。ただし、リアルタイムの音声認識の場
合には、認識音声データ５ｅの記憶領域にはデータが記
憶される必要はない。Of these, the adapted voice data 5d, the recognized voice data 5e, and the text data 5f strictly indicate the storage area in the storage unit 5, and at the time of adaptation, that is, the speaker who wants to adapt emits a voice. Nothing is stored in the adapted voice data 5d or the text data 5f, and nothing is stored in the recognized voice data 5e at the time of recognition, that is, until the adapted speaker emits a recognition target voice. Not not. However, in the case of real-time voice recognition, it is not necessary to store data in the storage area of the recognized voice data 5e.

【００２８】テキスト入力部７は、キーボード、マウス
等によって構成され、記憶部５の適応化音声データ５ｄ
に記憶させる音声データの発声内容が既知の場合に発声
内容のテキストデータとして入力するものであり、当該
音声認識装置１の操作に供されるのものである。The text input unit 7 is composed of a keyboard, a mouse, etc., and the adapted voice data 5d of the storage unit 5 is used.
When the utterance content of the voice data to be stored in is known, the utterance content is input as text data of the utterance content, and is used for operating the voice recognition device 1.

【００２９】音声入力部９は、スピーカ等によって構成
され、適応化したい話者が発した音声（音声データ）お
よび音声認識時に話者が発した音声を入力するものであ
る。表示出力部１１は、ディスプレイ等によって構成さ
れ、主制御部３の単語列出力手段３ｃで生成される音声
認識結果である出力単語列を表示するものである。The voice input unit 9 is composed of a speaker or the like, and inputs the voice (voice data) uttered by the speaker to be adapted and the voice uttered by the speaker during voice recognition. The display output unit 11 is configured by a display or the like, and displays an output word string that is a voice recognition result generated by the word string output unit 3c of the main control unit 3.

【００３０】（音声認識装置の動作）次に、図２を参照
して、音声認識装置１の動作を説明する。この音声認識
装置１の動作では、話者適応時と音声認識時とを一連の
動作として説明している。まず、記憶部５のテキストデ
ータ５ｆに適応化したい話者が発する音声のテキストデ
ータが記憶されているかどうかが判断される（Ｓ１）。
テキストデータが記憶されていると判断されない場合
（Ｓ１、Ｎｏ）、この音声認識装置１における動作は、
適応化したい話者の発した発声内容が不明の場合の動作
に該当する（後記するＳ２〜Ｓ６：請求項１の方法に相
当する）。また、テキストデータが記憶されていると判
断された場合（Ｓ１、Ｙｅｓ）、適応化したい話者の発
した発声内容が既知の場合の動作に該当する（後記する
Ｓ７、Ｓ３〜Ｓ６：請求項２の方法に相当する）。(Operation of Speech Recognition Device) Next, the operation of the speech recognition device 1 will be described with reference to FIG. In the operation of the voice recognition device 1, the speaker adaptation time and the voice recognition time are described as a series of operations. First, it is determined whether or not the text data of the voice uttered by the speaker who wants to be adapted is stored in the text data 5f of the storage unit 5 (S1).
When it is not determined that the text data is stored (S1, No), the operation of the voice recognition device 1 is as follows.
This corresponds to the operation in the case where the utterance content uttered by the speaker to be adapted is unknown (S2 to S6 described later: correspond to the method of claim 1). If it is determined that the text data is stored (S1, Yes), the operation corresponds to the case where the utterance content of the speaker to be adapted is known (S7, S3 to S6 described below: Claim). 2)).

【００３１】発声内容が不明の場合の適応時の動作とし
て、適応化したい話者の音声（音声データ）が音声入力
部９から入力され、記憶部５の適応化音声データ５ｄに
記憶される（Ｓ２）。そして、入力された適応化したい
話者の音声（音声データ）は音響モデルデータ５ａ、言
語モデルデータ５ｂ、発音辞書５ｃが利用され、音声認
識される。音声認識された音声認識結果（適応時の音声
認識結果）に基づいて、主制御部３の話者適応化手段３
ａで記憶部５に記憶されている音響モデル正規分布デー
タ５ａから利用音響モデル正規分布インデックス５ａ１
が生成される（Ｓ３）。なお、この適応時の動作におい
て入力された音声（音声データ）が極端に少ない場合に
は、認識時の動作において、音響モデルデータ５ａがそ
のまま利用される。As an operation at the time of adaptation when the utterance content is unknown, the voice (voice data) of the speaker to be adapted is input from the voice input unit 9 and stored in the adapted voice data 5d of the storage unit 5 ( S2). Then, the inputted voice (voice data) of the speaker to be adapted is recognized by using the acoustic model data 5a, the language model data 5b, and the pronunciation dictionary 5c. The speaker adaptation means 3 of the main control unit 3 is based on the speech recognition result (speech recognition result at the time of adaptation) that has been recognized.
From the acoustic model normal distribution data 5a stored in the storage unit 5 at a, the used acoustic model normal distribution index 5a1
Is generated (S3). If the input voice (voice data) is extremely small in the adaptation operation, the acoustic model data 5a is used as it is in the recognition operation.

【００３２】次に、認識時の動作として、適応化した話
者が発した認識対象音声（認識対象音声データ）が音声
入力部９から入力され、記憶部５の認識音声データ５ｅ
に記憶される（Ｓ４）。そして、主制御部３の音声認識
手段３ｂで利用音響モデル正規分布インデックス５ａ
１、音響モデルデータ５ａ、言語モデルデータ５ｂおよ
び発音辞書５ｃが利用され、これらに基づいて認識対象
音声の音響スコアおよび言語スコアが算出される。これ
らの音響スコアおよび言語スコアの最も高いもの（音声
認識結果となる）が単語列出力手段３ｃに出力される
（Ｓ５）。そして、単語列出力手段３ｃで音声認識結果
である出力単語列が表示出力部１１を送出される（Ｓ
６）。Next, as a recognition operation, the recognition target voice (recognition target voice data) uttered by the adapted speaker is input from the voice input unit 9 and the recognition voice data 5e of the storage unit 5 is inputted.
(S4). Then, the voice recognition means 3b of the main controller 3 uses the acoustic model normal distribution index 5a.
1, the acoustic model data 5a, the language model data 5b, and the pronunciation dictionary 5c are used, and the acoustic score and language score of the recognition target speech are calculated based on these. The one with the highest acoustic score and language score (becomes a speech recognition result) is output to the word string output means 3c (S5). Then, the word string output means 3c sends the output word string, which is the speech recognition result, to the display output unit 11 (S).
6).

【００３３】この音声認識装置１の動作によれば、話者
適応化手段３ａで、適応化したい話者が発した音声の適
応時音声認識結果に基づいて、音響モデルデータ５ａの
複数の正規分布の中から出力確率の高い利用音響モデル
正規分布インデックス５ａ１が記憶部５に記憶される。
そして、音声認識手段３ｂで、この利用音響モデル正規
分布インデックス５ａ１、音響モデルデータ５ａ、言語
モデルデータ５ｂおよび発音辞書５ｃが利用されて音声
入力部９から入力される認識対象音声が認識される。According to the operation of the voice recognition apparatus 1, the speaker adaptation means 3a produces a plurality of normal distributions of the acoustic model data 5a based on the result of the adaptive voice recognition of the voice uttered by the speaker to be adapted. Among these, the used acoustic model normal distribution index 5a1 having a high output probability is stored in the storage unit 5.
Then, the voice recognition means 3b recognizes the recognition target voice input from the voice input unit 9 by using the used acoustic model normal distribution index 5a1, the acoustic model data 5a, the language model data 5b, and the pronunciation dictionary 5c.

【００３４】このため、適応化したい話者毎に音響モデ
ルデータ５ａの中から利用音響モデル正規分布インデッ
クス５ａ１が生成されるので、適応化したい話者毎に音
響モデルを用意して記憶しておく必要がなく記憶部５の
記憶容量を抑えることができる。Therefore, since the used acoustic model normal distribution index 5a1 is generated from the acoustic model data 5a for each speaker to be adapted, an acoustic model is prepared and stored for each speaker to be adapted. There is no need, and the storage capacity of the storage unit 5 can be suppressed.

【００３５】また、通常の適応化の際において、話者が
複数である場合、話者の発声は話者毎にまとめておく必
要がある。しかしながら、この音声認識装置１では、複
数の話者により発声された文章をまとめて適応化データ
として扱い、利用音響モデル正規分布インデックス５ａ
１を作成でき、さらに音声認識時も話者の判定をするこ
となく、利用音響モデル正規分布インデックス５ａ１を
利用して、音声認識することができる。In addition, in the case of normal adaptation, when there are a plurality of speakers, it is necessary to collect the utterances of the speakers for each speaker. However, in the voice recognition device 1, the sentences uttered by a plurality of speakers are collectively treated as adapted data, and the used acoustic model normal distribution index 5a is used.
1 can be created, and voice recognition can be performed by using the used acoustic model normal distribution index 5a1 without determining the speaker during voice recognition.

【００３６】さて、ここで記憶部５のテキストデータ５
ｆに適応化した話者が発する音声のテキストデータが記
憶されていると判断された場合（Ｓ１、ＹＥＳ）、つま
り、適応化したい話者の発した発声内容が既知の場合の
動作について説明する。Now, here, the text data 5 in the storage unit 5
The operation when it is determined that the text data of the voice uttered by the speaker adapted to f is stored (S1, YES), that is, the utterance content of the speaker who wants to be adapted is known will be described. .

【００３７】適応時の動作として、適応化したい話者の
発した音声データ（音声の波形データ）が音声入力部９
から入力され、記憶部５の適応化音声データ５ｄに記憶
される（Ｓ７）。そして記憶された適応化音声データ５
ｄと発声内容（テキストデータ５ｆ）とが話者適応化手
段３ａにより対応付けられた対応結果が得られる。この
対応結果に基づいて、主制御部３の話者適応化手段３ａ
で記憶部５に記憶されている音響モデルデータ５ａから
利用音響モデル正規分布インデックス５ａ１が生成され
る（Ｓ８）。その後、認識時の動作Ｓ４〜Ｓ６が実行さ
れる。As an operation at the time of adaptation, the voice data (voice waveform data) produced by the speaker to be adapted is converted into the voice input unit 9.
Is input from and is stored in the adapted voice data 5d of the storage unit 5 (S7). And the stored adapted voice data 5
A correspondence result in which d and the utterance content (text data 5f) are associated by the speaker adaptation means 3a is obtained. Based on the correspondence result, the speaker adaptation means 3a of the main control unit 3
Then, the used acoustic model normal distribution index 5a1 is generated from the acoustic model data 5a stored in the storage unit 5 (S8). After that, the operations S4 to S6 at the time of recognition are executed.

【００３８】この音声認識装置１の動作によれば、話者
適応化手段３ａで、適応化したい話者が発した音声の波
形データとテキストデータ５ｆとを対応付けした対応結
果に基づいて、音響モデルデータ５ａの複数の正規分布
の中から出力確率の高いものが利用音響モデル正規分布
インデックス５ａ１として記憶部５に記憶される。そし
て、音声認識手段３ｂで、この利用音響モデル正規分布
インデックス５ａ１、音響モデルデータ５ａ、言語モデ
ルデータ５ｂおよび発音辞書５ｃが利用されて音声入力
部９から入力される認識対象音声が認識される。According to the operation of the voice recognition apparatus 1, the speaker adaptation means 3a produces a sound based on the correspondence result in which the waveform data of the voice uttered by the speaker to be adapted is associated with the text data 5f. Of the plurality of normal distributions of the model data 5a, the one with a high output probability is stored in the storage unit 5 as the used acoustic model normal distribution index 5a1. Then, the voice recognition means 3b recognizes the recognition target voice input from the voice input unit 9 by using the used acoustic model normal distribution index 5a1, the acoustic model data 5a, the language model data 5b, and the pronunciation dictionary 5c.

【００３９】このため、適応化したい話者毎に音響モデ
ルデータ５ａの中から利用音響モデル正規分布インデッ
クス５ａ１が生成されるので、適応化したい話者毎に音
響モデルを用意して記憶しておく必要がなく記憶部５の
記憶容量を抑えることができる。Therefore, since the used acoustic model normal distribution index 5a1 is generated from the acoustic model data 5a for each speaker to be adapted, an acoustic model is prepared and stored for each speaker to be adapted. There is no need, and the storage capacity of the storage unit 5 can be suppressed.

【００４０】また、通常の適応化の際において、話者が
複数である場合、話者の発声は話者毎にまとめておく必
要がある。しかしながら、この音声認識装置１では、複
数の話者により発声された文章をまとめて適応化データ
として扱い、利用音響モデル正規分布インデックス５ａ
１を作成でき、さらに音声認識時も話者の判定をするこ
となく、利用音響モデル正規分布インデックス５ａ１を
利用して、音声認識することができる。Further, in the case of normal adaptation, when there are a plurality of speakers, it is necessary to group the utterances of the speakers for each speaker. However, in the voice recognition device 1, the sentences uttered by a plurality of speakers are collectively treated as adapted data, and the used acoustic model normal distribution index 5a is used.
1 can be created, and voice recognition can be performed by using the used acoustic model normal distribution index 5a1 without determining the speaker during voice recognition.

【００４１】（音声認識時における音声波形と音素スコ
アの計算とに関して）次に、図３を参照して音声認識時
における音声波形と音素スコアの計算とに関して説明す
る（適宜、図１参照）。図３は、適応化したい話者が音
声認識時に「次」（つぎ）と発した音声波形データ（音
声認識時に音声入力部９から入力された音声波形デー
タ）と、適応時に適応化された（優先順位が付された）
利用音響モデル正規分布インデックス５ａ１との関係を
示したものである。(Regarding Speech Waveform and Phoneme Score Calculation During Speech Recognition) Next, the speech waveform and phoneme score calculation during speech recognition will be described with reference to FIG. 3 (see FIG. 1 as appropriate). In FIG. 3, the speech waveform data (the speech waveform data input from the speech input unit 9 at the time of speech recognition) which the speaker who wants to adapt utters "next" at the time of speech recognition, and the speech waveform data (at the time of adaptation) are adapted ( Prioritized)
It shows a relationship with the used acoustic model normal distribution index 5a1.

【００４２】図３（ａ）は、適応時に適応化したい話者
が発した音声の発声内容が不明な場合の関係を示してお
り、図３（ｂ）は、適応時に適応化したい話者が発した
音声の発声内容が既知の場合の関係を示している。つま
り、図３（ａ）の場合、音声認識時に入力された音声波
形データと比較されるものは、適応時に音声認識した結
果（適応時音声認識結果）の音素列を成す各音素の音響
的特徴パターンであり、図３（ｂ）の場合、音声認識時
に入力された音声波形データと比較されるものは、予め
与えられた音素列（若しくは単語列）を成す各音素の音
響的特徴パターンである。FIG. 3 (a) shows the relationship when the utterance content of the voice uttered by the speaker who wants to adapt during adaptation is unknown, and FIG. 3 (b) shows the speaker who wants to adapt during adaptation. The relationship is shown when the utterance content of the uttered voice is known. That is, in the case of FIG. 3A, what is compared with the voice waveform data input at the time of speech recognition is the acoustic feature of each phoneme that forms the phoneme sequence of the result of speech recognition at the time of adaptation (speech recognition result at adaptation). In the case of FIG. 3B, what is compared with the speech waveform data input at the time of speech recognition is the acoustic feature pattern of each phoneme forming a phoneme string (or word string) given in advance in the case of FIG. 3B. .

【００４３】すなわち、適応時において、これらの音素
列にどの音響的特徴パターンが使用されたかが利用音響
モデル正規分布インデックス５ａ１（図中・・
・）に記憶されており、音声認識時において、音響的
特徴パターンが音声波形データの各音素に該当するとこ
ろと比較され、音声認識される。That is, which acoustic feature pattern was used for these phoneme strings during adaptation is the utilized acoustic model normal distribution index 5a1 (in the figure ...
)), And at the time of voice recognition, the acoustic feature pattern is compared with the place corresponding to each phoneme of the voice waveform data, and voice recognition is performed.

【００４４】以上、一実施形態に基づいて本発明を説明
したが、本発明はこれに限定されるものではない。音声
認識装置１の各構成を一つずつの処理（ステップ）とみ
なした音声認識方法と捉えることもでき、この場合、音
声認識装置１と同様の効果が得られる。また、音声認識
装置１の各構成の処理を汎用的なコンピュータ言語で記
述した音声認識プログラムと捉えることもでき、この場
合も、音声認識装置１と同様の効果が得られる。Although the present invention has been described based on the embodiment, the present invention is not limited to this. It is possible to regard each component of the voice recognition device 1 as a voice recognition method in which it is regarded as one process (step), and in this case, the same effect as that of the voice recognition device 1 can be obtained. Further, the processing of each component of the voice recognition device 1 can be regarded as a voice recognition program written in a general-purpose computer language, and in this case, the same effect as that of the voice recognition device 1 can be obtained.

【００４５】[0045]

【発明の効果】請求項１、３、５記載の発明によれば、
統計分布特定ステップ（統計分布特定手段）で、適応化
したい話者が発した音声の音声認識結果に基づいて、統
計分布で構成される音響モデルの複数の統計分布の中か
ら出力確率の高い統計分布が特定され、音声認識ステッ
プ（音声認識手段）で、音声認識されるので、適応化し
たい話者毎に音響モデルを用意して記憶しておく必要が
なく記憶媒体の記憶容量を抑えることができる。According to the inventions of claims 1, 3, and 5,
In the statistical distribution specifying step (statistical distribution specifying means), based on the voice recognition result of the voice uttered by the speaker who wants to be adapted, the statistics with a high output probability are selected from among the statistical distributions of the acoustic model configured by the statistical distribution. Since the distribution is specified and the voice is recognized in the voice recognition step (voice recognition means), it is not necessary to prepare and store an acoustic model for each speaker to be adapted, and the storage capacity of the storage medium can be suppressed. it can.

【００４６】請求項２、４、６記載の発明によれば、統
計分布特定ステップ（統計分布特定手段）で、適応化し
たい話者が発した音声とテキストデータとを対応付けた
対応結果に基づいて、統計分布で構成される音響モデル
の複数の統計分布の中から出力確率の高い統計分布が特
定され、音声認識ステップ（音声認識手段）で、音声認
識されるので、適応化したい話者毎に音響モデルを用意
して記憶しておく必要がなく記憶媒体の記憶容量を抑え
ることができる。According to the present invention, the statistical distribution specifying step (statistical distribution specifying means) is based on the correspondence result in which the voice uttered by the speaker to be adapted is associated with the text data. Then, a statistical distribution with a high output probability is specified from among a plurality of statistical distributions of the acoustic model composed of statistical distributions, and the speech recognition step (speech recognition means) performs speech recognition. Since it is not necessary to prepare and store the acoustic model in the storage medium, the storage capacity of the storage medium can be suppressed.

[Brief description of drawings]

【図１】本発明による一実施の形態を示す音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device showing an embodiment according to the present invention.

【図２】図１に示した音声認識装置の動作である。FIG. 2 is an operation of the voice recognition device shown in FIG.

【図３】音声認識時における音声波形と音素スコアの計
算とを説明した説明図である。FIG. 3 is an explanatory diagram illustrating a speech waveform and a phoneme score calculation during speech recognition.

【符号の説明】１音声認識装置３主制御部３ａ話者適応化手段（統計分布特定手段）３ｂ音声認識手段３ｃ単語列出力手段５記憶部５ａ音響モデルデータ５ｂ言語モデルデータ５ｃ発音辞書７テキスト入力部９音声入力部１１表示出力部[Explanation of symbols] 1 Speech recognition device 3 Main control section 3a Speaker adaptation means (statistical distribution identification means) 3b voice recognition means 3c word string output means 5 memory 5a Acoustic model data 5b Language model data 5c Pronunciation dictionary 7 Text input section 9 Voice input section 11 Display output section

───────────────────────────────────────────────────── フロントページの続き (72)発明者尾上和穂東京都世田谷区砧一丁目10番11号日本放送協会放送技術研究所内Ｆターム(参考） 5D015 GG04 HH23 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Kaho Onoe 1-10-11 Kinuta, Setagaya-ku, Tokyo, Japan Broadcasting Association Broadcast Technology Institute F-term (reference) 5D015 GG04 HH23

Claims

[Claims]

1. An acoustic model in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, a language model in which a relationship between each word is expressed by probability, and a word and a phoneme. A speech recognition method using a pronunciation dictionary of words indicating a relationship, wherein based on a result of speech recognition of a voice uttered by a speaker to be adapted in advance, output is made from a plurality of statistical distributions of the acoustic model. A statistical distribution identification step of identifying a statistical distribution with a high probability, and a voice recognition step of recognizing a recognition target speech to be a recognition target based on the statistical distribution identified in the statistical distribution identification step when the speaker utters And a speech recognition method comprising:

2. An acoustic model in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, a language model in which a relationship between each word is expressed by probability, and a word and a phoneme. A speech recognition method using a pronunciation dictionary of words indicating a relationship, which is based on a correspondence result in which the utterance content whose content is already known and the voice uttered by the speaker who wants to adapt the utterance content are associated with each other. Then, from among the plurality of statistical distributions of the acoustic model,
A statistical distribution specifying step of specifying a statistical distribution having a high output probability, and when the speaker utters, a voice recognition for recognizing a recognition target voice as a recognition target based on the statistical distribution specified in the statistical distribution specifying step. A voice recognition method, comprising:

3. An acoustic model in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, a language model in which a relationship between each word is expressed by probability, and a word and a phoneme. A voice recognition device that uses a pronunciation dictionary of words indicating a relationship, and outputs from a plurality of statistical distributions of the acoustic model based on a result of voice recognition of a voice uttered by a speaker to be adapted in advance. A statistical distribution specifying unit that specifies a statistical distribution with a high probability, and a voice recognition unit that recognizes a recognition target voice that is a recognition target based on the statistical distribution specified by the statistical distribution specifying unit when the speaker utters. A voice recognition device comprising:

4. An acoustic model in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, a language model in which a relationship between each word is expressed by probability, and a word and a phoneme. A voice recognition device that uses a pronunciation dictionary of words indicating a relationship, based on a correspondence result in which the utterance content whose content is already known and the voice uttered by the speaker who wants to adapt this utterance content are associated with each other. Then, from among the plurality of statistical distributions of the acoustic model,
A statistical distribution specifying unit that specifies a statistical distribution having a high output probability, and when the speaker utters, a voice recognition that recognizes a recognition target voice that is a recognition target based on the statistical distribution specified by the statistical distribution specifying unit. A voice recognition device comprising:

5. An acoustic model in which each phoneme of a speech signal to be subjected to speech recognition is composed of a plurality of statistical distributions, a language model in which the relationship between each word is expressed by probability, and a word and a phoneme. A device that uses a pronunciation dictionary of words indicating a relationship, based on the result of pre-voice recognition of a voice uttered by a speaker who wants to be adapted, from among a plurality of statistical distributions of the acoustic model, a statistic having a high output probability. Statistical distribution specifying means for specifying a distribution, when the speaker utters, based on the statistical distribution specified by the statistical distribution specifying means, to function as a voice recognition means for recognizing a recognition target voice to be a recognition target. A speech recognition program characterized by.

6. An acoustic model in which each phoneme of a speech signal to be speech-recognized is composed of a plurality of statistical distributions, a language model in which the relationship between each word is expressed by probability, and a word and a phoneme. A device using a pronunciation dictionary of words indicating a relationship is used in advance, based on a correspondence result in which a voicing content whose content is already known and a voice uttered by a speaker who wants to adapt the voicing content are associated with each other. From the multiple statistical distributions of the model,
Statistical distribution specifying means for specifying a statistical distribution having a high output probability, and when the speaker utters, a voice recognition means for recognizing a recognition target voice as a recognition target based on the statistical distribution specified by the statistical distribution specifying means. A speech recognition program characterized by causing it to function as.