JP4583772B2

JP4583772B2 - Speech recognition system, speech recognition method, and speech recognition program

Info

Publication number: JP4583772B2
Application number: JP2004029143A
Authority: JP
Inventors: 健花沢
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-02-05
Filing date: 2004-02-05
Publication date: 2010-11-17
Anticipated expiration: 2024-02-05
Also published as: JP2005221727A

Description

本発明は、不特定話者の音声を認識する音声認識システム、音声認識方法および音声認識用プログラムに関するものである。 The present invention relates to a voice recognition system, a voice recognition method, and a voice recognition program for recognizing the voice of an unspecified speaker.

不特定話者音響モデルを使用して不特定話者の音声を認識する音声認識システムでは、ユーザによる音声登録が不要である反面、ある特定の話者あるいは話者群に特化した音響モデルを使用する場合に比べて一般に認識性能が劣化する。ここで、音響モデルとは、入力音声に対して音響的な確からしさを与えるモデルのことである。 A voice recognition system that recognizes the voice of an unspecified speaker using an unspecified speaker acoustic model does not require voice registration by the user, but an acoustic model specialized for a specific speaker or speaker group is used. The recognition performance generally deteriorates compared to the case of using. Here, the acoustic model is a model that gives an acoustic certainty to the input voice.

また、音声認識の前段階でＧＭＭ（Gaussian Mixture Model）などを用いて複数話者を判別してから、複数話者の中から選択された話者の音声認識を行う音声認識システムがある。しかし、この音声認識システムでは、前処理を行う分だけ処理量が多くなる他、前処理時に一旦入力音声を処理しなければならないため、応答が遅れる。 In addition, there is a speech recognition system that recognizes a speaker selected from a plurality of speakers after determining a plurality of speakers using a GMM (Gaussian Mixture Model) or the like before the speech recognition. However, in this speech recognition system, the amount of processing increases as much as the preprocessing is performed, and the input speech must be processed once during the preprocessing, so that the response is delayed.

そこで、複数話者の音響モデルを並列に認識処理させ、最後に音声認識スコアの良いものを選択する、すなわち複数話者を判別する音声認識システムがある。しかし、この音声認識システムでは、話者数分の計算量が必要になる。 In view of this, there is a speech recognition system that recognizes a plurality of speakers' acoustic models in parallel, and finally selects one having a good speech recognition score, that is, discriminates a plurality of speakers. However, this speech recognition system requires a calculation amount corresponding to the number of speakers.

これに対し、計算量を削減した音声認識システムの一例が、特許文献１に開示されている。この音声認識システムは、複数話者に対応する複数の仮説に対し、後述する枝刈りという処理を共通に行うことで自動的に最良の話者を選択するシステムであり、図５に示すように、マイクロフォン１と、特徴抽出部２と、バッファメモリ３と、音素照合部４と、ＬＲパーザ５と、話者混合隠れマルコフ網メモリ１１と、話者モデルメモリ１２と、ＬＲテーブルメモリ１３と、文脈自由文法データベースメモリ２０とから構成されている。 On the other hand, Patent Document 1 discloses an example of a speech recognition system that reduces the amount of calculation. This speech recognition system is a system that automatically selects the best speaker by commonly performing a process called pruning described later for a plurality of hypotheses corresponding to a plurality of speakers, as shown in FIG. , Microphone 1, feature extraction unit 2, buffer memory 3, phoneme matching unit 4, LR parser 5, speaker mixed hidden Markov network memory 11, speaker model memory 12, LR table memory 13, And a context-free grammar database memory 20.

このような構成を有する従来の音声認識システムは、次のように動作する。 The conventional speech recognition system having such a configuration operates as follows.

まず、マイクロフォン１で話者音声が受け付けられ、特徴抽出部２で話者音声のパラメータである特徴量が抽出される。抽出された特徴量は、バッファメモリ３を介して音素照合部４に入力される。 First, a speaker voice is received by the microphone 1, and a feature amount that is a parameter of the speaker voice is extracted by the feature extraction unit 2. The extracted feature amount is input to the phoneme matching unit 4 via the buffer memory 3.

音素照合部４では、ＬＲパーザ５からの音素照合要求に応じて、入力された特徴量の音素照合が行われる。このとき、ＬＲパーザ５からは、音素照合区間の情報、および照合対象音素とその前後の音素の情報を含む音素コンテキスト情報が渡される。音素照合部４では、受け取った音素コンテキスト情報に基づいてそのような音素コンテキストを受理することができる隠れマルコフ網（以下、ＨＭ網という。）上の状態を、先行状態リストと後続状態リストの制約内で連結することによって、１つのモデルが選択される。そして、この選択されたモデルを用いて音素照合区間内のデータに対する尤度が計算され、この尤度の値が音素照合スコアとしてＬＲパーザ５に返される。 In the phoneme matching unit 4, phoneme matching of the input feature quantity is performed in response to a phoneme matching request from the LR parser 5. At this time, the phoneme context information including information on the phoneme collation section and information on the phonemes to be collated and the phonemes before and after the phoneme is passed from the LR parser 5. In the phoneme matching unit 4, a state on a hidden Markov network (hereinafter referred to as an HM network) that can accept such a phoneme context based on the received phoneme context information is defined as a restriction between the preceding state list and the succeeding state list. A model is selected by concatenating within. Then, the likelihood for the data in the phoneme matching section is calculated using the selected model, and the likelihood value is returned to the LR parser 5 as a phoneme matching score.

ＬＲパーザ５では、例えば音素継続時間長モデルを含む話者モデルメモリ１２と、ＬＲテーブルメモリ１３とを参照して、入力された音素照合スコアについて左から右方向に、後戻りなしに、仮説計算処理を行う。ＬＲテーブルメモリ１３には、文脈自由文法データベースメモリ２０内の所定の文脈自由文法（ＣＦＧ）が自動変換されたＬＲテーブルが格納されている。ＬＲパーザ５では、ＬＲテーブルメモリ１３内のＬＲテーブルから次にくる音素を予測し、その音素予測データを音素照合部４に出力する。 The LR parser 5 refers to, for example, the speaker model memory 12 including the phoneme duration model and the LR table memory 13, and performs hypothesis calculation processing for the input phoneme matching score from left to right without backtracking. I do. The LR table memory 13 stores an LR table in which a predetermined context free grammar (CFG) in the context free grammar database memory 20 is automatically converted. The LR parser 5 predicts the next phoneme from the LR table in the LR table memory 13 and outputs the phoneme prediction data to the phoneme matching unit 4.

これに応答して、音素照合部４では、入力された音素予測データに対応する隠れマルコフ網メモリ１１内の情報を参照して照合し、その尤度を音声認識スコアとしてＬＲパーザ５に戻す。 In response to this, the phoneme collation unit 4 refers to the information in the hidden Markov network memory 11 corresponding to the input phoneme prediction data and collates it, and returns the likelihood to the LR parser 5 as a speech recognition score.

ＬＲパーザ５では、入力された音声認識スコアを元に順次音素を連接していくことにより、連続音声の認識を行う。なお、音素照合部４で複数の音素が予測された場合は、これらすべての存在をチェックし、部分的な音声認識の尤度の高い部分木を残すという枝刈りを行って高速処理を実現する。 The LR parser 5 performs continuous speech recognition by sequentially connecting phonemes based on the input speech recognition score. Note that when a plurality of phonemes are predicted by the phoneme matching unit 4, high-speed processing is realized by checking the existence of all of them and performing pruning to leave a partial tree with a high likelihood of partial speech recognition. .

その後、マイクロフォン１で受け付けられた話者音声が最後まで処理されると、ＬＲパーザ５では、全体の尤度が最大のものまたは所定の上位複数個のものを音声認識結果データとして出力する。
特許第２９０５６７４号明細書 After that, when the speaker voice received by the microphone 1 is processed to the end, the LR parser 5 outputs the one with the maximum overall likelihood or a plurality of predetermined higher ranks as voice recognition result data.
Japanese Patent No. 2905673

しかしながら、上述のように、複数話者を自動判別しながら音声認識する場合に、複数話者の仮説を共通に枝刈りする音声認識システムでは、仮説計算処理までの処理を全ての話者の音響モデルに対して行わなければならないため、処理量の点で効率的でないという問題点がある。 However, as described above, when speech recognition is performed while automatically identifying multiple speakers, in a speech recognition system that prunes hypotheses of multiple speakers in common, processing up to hypothesis calculation processing is performed for all speakers. Since it must be performed on the model, there is a problem that it is not efficient in terms of processing amount.

本発明の目的は、少ない処理量で入力音声の音響モデルを判別、選択しながら、当該入力音声の音声認識を行うことができる音声認識システム、音声認識方法および音声認識用プログラムを提供することにある。 An object of the present invention is to provide a speech recognition system, a speech recognition method, and a speech recognition program capable of performing speech recognition of an input speech while discriminating and selecting an acoustic model of the input speech with a small processing amount. is there.

本発明の音声認識システムは、入力音声の特徴量に対し、複数の音響モデルとの音響尤度を計算する音響尤度計算手段と、音響尤度計算手段にて入力音声の先頭の所定区間分の音響尤度計算が終了した時点で、当該所定区間内に音響尤度計算手段にて計算された音響尤度に基づいて、入力音声の音響モデルの判別、選択を行うモデル判別手段と、音響尤度計算手段にて入力音声の所定区間分の音響尤度計算が終了すると、当該所定区間内にモデル判別手段にて選択された音響モデルに対して計算された音響尤度を用いて仮説処理を行い、入力音声の全区間の仮説処理が終了した後に、仮説処理結果に基づく音声認識結果を出力する仮説処理手段とを備える。ここで、音響尤度計算手段は、入力音声の先頭の所定区間では、入力音声の特徴量に対し、複数の音響モデルの全てとの音響尤度を各々計算し、入力音声の先頭以降の所定区間では、入力音声の特徴量に対し、モデル判別手段にて選択された音響モデルのみとの音響尤度を計算する。 The speech recognition system according to the present invention includes an acoustic likelihood calculating means for calculating acoustic likelihoods with a plurality of acoustic models for the feature quantity of the input speech, and a predetermined interval at the head of the input speech by the acoustic likelihood calculating means. Model discrimination means for discriminating and selecting the acoustic model of the input speech based on the acoustic likelihood calculated by the acoustic likelihood calculation means within the predetermined section at When the acoustic likelihood calculation for the predetermined interval of the input speech is completed by the likelihood calculating means, hypothesis processing is performed using the acoustic likelihood calculated for the acoustic model selected by the model discriminating means within the predetermined interval. And a hypothesis processing means for outputting a speech recognition result based on the hypothesis processing result after the hypothesis processing for all sections of the input speech is completed. Here, the acoustic likelihood calculating means calculates the acoustic likelihood with all of the plurality of acoustic models for the feature amount of the input speech in a predetermined section at the beginning of the input speech, and determines the predetermined values after the top of the input speech. In the section, the acoustic likelihood with only the acoustic model selected by the model discriminating means is calculated for the feature quantity of the input speech.

このような構成により、入力音声の先頭の所定区間でのみ、音響尤度計算処理が複数の音響モデルの数だけ複数回行われ、先頭以降の所定区間では、音響尤度計算処理が、既に選択されている音響モデルの数だけすなわち１回だけ行われることになる。そのため、入力音声の音響モデルを判別、選択しながら、当該入力音声の音声認識を行う際の処理量を小さく迎えることができる。 With such a configuration, the acoustic likelihood calculation process is performed a plurality of times for the number of the plurality of acoustic models only in the first predetermined section of the input speech, and the acoustic likelihood calculation process has already been selected in the predetermined section after the start. It is performed only for the number of acoustic models that are being used, that is, only once. Therefore, it is possible to reduce the processing amount when performing speech recognition of the input speech while discriminating and selecting the acoustic model of the input speech.

ここで、モデル判別の対象は話者であっても良く、その場合の話者判別の対象は、性別、年齢別、または方言別であっても良い。あるいは、モデル判別の対象は、雑音環境、伝送特性、言語、またはそれらの組み合わせであっても良い。 Here, the model discrimination target may be a speaker, and the speaker discrimination target in that case may be gender, age, or dialect. Alternatively, the model discrimination target may be a noise environment, transmission characteristics, language, or a combination thereof.

本発明によれば、入力音声の先頭の所定区間でのみ、音響尤度計算処理が複数の音響モデルの数だけ複数回行われ、先頭以降の所定区間では、音響尤度計算処理が、既に選択されている音響モデルの数だけすなわち１回だけ行われること、また、仮説処理時には、その前段の音響尤度計算処理時に計算された音響尤度を共用し、新たに音響尤度を計算しなくても済むことから、入力音声の音響モデルを判別、選択しながら、当該入力音声の音声認識を行う際の処理量が小さくて済むという効果を奏する。 According to the present invention, the acoustic likelihood calculation process is performed a plurality of times for the number of the plurality of acoustic models only in the first predetermined section of the input speech, and the acoustic likelihood calculation process is already selected in the predetermined section after the start. It is performed only for the number of acoustic models that are used, that is, only once, and at the time of hypothesis processing, the acoustic likelihood calculated during the acoustic likelihood calculation processing of the previous stage is shared, and no new acoustic likelihood is calculated. Therefore, it is possible to reduce the processing amount when performing speech recognition of the input speech while discriminating and selecting the acoustic model of the input speech.

次に、本発明を実施するための最良の形態について図面を参照して詳細に説明する。 Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施形態による音声認識システムの構成を示すブロック図である。この音声認識システムは、複数の音響モデルを格納する音響モデル格納部１０５と、言語モデルを格納する言語モデル格納部１０７と、入力音声１０１の音声入力が行われるマイクロフォン１０２と、入力音声１０１のパラメータである特徴量を抽出する特徴抽出部１０３と、抽出された特徴量に対し、複数の音響モデルとの音響尤度を計算する音響尤度計算部１０４と、入力音声１０１の先頭の所定区間分の音響尤度計算が終了した時点で、当該所定区間内に計算された音響尤度に基づき入力音声１０１の音響モデルの判別、選択を行うモデル判別部１０８と、入力音声１０１の所定区間分の音響尤度計算が終了する度に、当該所定区間内に選択された音響モデルに対して計算された音響尤度を用いて仮説処理を行い、入力音声１０１の全区間の仮説処理が終了した後に、仮説処理結果に基づく音声認識結果１０９を出力する仮説処理部３０６とを有する。ここで、音響尤度計算部１０４は、入力音声１０１の先頭の所定区間では、入力音声１０１の特徴量に対し、複数の音響モデルの全てとの音響尤度を各々計算するが、それ以降の所定区間では、入力音声１０１の特徴量に対し、モデル判別部１０８にて選択された音響モデルのみとの音響尤度を計算する。 FIG. 1 is a block diagram showing the configuration of a speech recognition system according to an embodiment of the present invention. The speech recognition system includes an acoustic model storage unit 105 that stores a plurality of acoustic models, a language model storage unit 107 that stores a language model, a microphone 102 that performs speech input of the input speech 101, and parameters of the input speech 101. A feature extraction unit 103 that extracts a feature amount, an acoustic likelihood calculation unit 104 that calculates acoustic likelihoods for a plurality of acoustic models for the extracted feature amount, and a predetermined interval at the beginning of the input speech 101 When the acoustic likelihood calculation for the input speech 101 is completed, the model discrimination unit 108 for discriminating and selecting the acoustic model of the input speech 101 based on the acoustic likelihood calculated in the predetermined interval, and the predetermined speech segment of the input speech 101 Each time the acoustic likelihood calculation is completed, hypothesis processing is performed using the acoustic likelihood calculated for the acoustic model selected in the predetermined section, and all of the input speech 101 is calculated. After hypothesis processing between is completed, and a hypothesis processor 306 which outputs a speech recognition result 109 hypothetical processing result. Here, the acoustic likelihood calculation unit 104 calculates the acoustic likelihood with all of the plurality of acoustic models for the feature amount of the input speech 101 in the first predetermined section of the input speech 101. In a predetermined section, the acoustic likelihood with only the acoustic model selected by the model discrimination unit 108 is calculated for the feature amount of the input speech 101.

次に、本実施形態による音声認識システムの動作の一例について図２のフローチャートを用いて説明する。 Next, an example of the operation of the speech recognition system according to the present embodiment will be described using the flowchart of FIG.

まず、入力音声１０１をマイクロフォン１０２から時間同期で、例えば１０ｍ秒ずつ入力する（ステップ２０１）。この１０ｍ秒分の標本量を１フレームとする。 First, the input voice 101 is input from the microphone 102 in time synchronization, for example, every 10 milliseconds (step 201). The sample amount for 10 milliseconds is defined as one frame.

次に、特徴抽出部１０３において、入力音声１０１の１フレーム分の音声の特徴量を抽出する（ステップ２０２）。 Next, the feature extraction unit 103 extracts the feature value of one frame of the input sound 101 (step 202).

次に、音響尤度計算部１０４において、特徴抽出部１０３により抽出された１フレーム分の特徴量に対し、音響モデル格納部１０５に格納されている複数の音響モデルとの音響尤度を計算する（ステップ２０３）。このとき、音響尤度計算部１０４においては、入力音声１０１の先頭の所定区間では、複数の音響モデルの全てとの音響尤度を計算するが、それ以降の区間では、既に選択されている音響モデルのみとの音響尤度を計算する。なお、ここで計算された音響尤度は、音響尤度計算部１０４にて蓄積される。 Next, in the acoustic likelihood calculation unit 104, the acoustic likelihood with the plurality of acoustic models stored in the acoustic model storage unit 105 is calculated with respect to the feature amount for one frame extracted by the feature extraction unit 103. (Step 203). At this time, in the acoustic likelihood calculation unit 104, the acoustic likelihood with all of the plurality of acoustic models is calculated in the predetermined section at the head of the input speech 101, but in the subsequent sections, the already selected acoustics are calculated. Calculate acoustic likelihood with model only. The acoustic likelihood calculated here is stored in the acoustic likelihood calculation unit 104.

ここで、音響尤度が所定区間分、例えば６０フレーム分蓄積された場合、この区間が入力音声１０１の先頭であれば（ステップ２０４のＹｅｓ）、モデル判別部１０８において、音響尤度の計算結果に基づいて入力音声１０１の音響モデルの判別、選択を行い（ステップ２０５）、ここで選択された音響モデルに対して音響尤度計算部１０４にて計算された音響尤度が、仮説処理部１０６での仮説処理に使用される。一方、入力音声１０１の先頭以外であれば（ステップ２０４のＮｏ）、モデル判別部１０８でのモデル判別は行われず、既にモデル判別部１０８により選択されている音響モデルに対して音響尤度計算部１０４にて計算された音響尤度が、仮説処理部１０６での仮説処理に使用される。 Here, when the acoustic likelihood is accumulated for a predetermined section, for example, 60 frames, if this section is the head of the input speech 101 (Yes in Step 204), the model discrimination unit 108 calculates the acoustic likelihood. The acoustic model of the input speech 101 is discriminated and selected based on (step 205), and the acoustic likelihood calculated by the acoustic likelihood calculator 104 for the selected acoustic model is the hypothesis processor 106. Used for hypothesis processing in On the other hand, if it is other than the head of the input speech 101 (No in step 204), the model discrimination by the model discrimination unit 108 is not performed, and the acoustic likelihood calculation unit for the acoustic model already selected by the model discrimination unit 108 The acoustic likelihood calculated at 104 is used for hypothesis processing in the hypothesis processing unit 106.

次に、仮説処理部１０６において、モデル判別部１０８により選択された音響モデルに対して音響尤度計算部１０４にて上記の所定区間内に計算された音響尤度と、言語モデル格納部１０７に格納されている言語モデルとを用いて１フレームずつ仮説処理を行う（ステップ２０６）。 Next, in the hypothesis processing unit 106, the acoustic likelihood calculated in the predetermined section by the acoustic likelihood calculation unit 104 for the acoustic model selected by the model determination unit 108 and the language model storage unit 107. Hypothesis processing is performed frame by frame using the stored language model (step 206).

入力音声１０１の所定区間分の仮説処理までの処理が終了すると、その区間の終端が入力音声１０１の全区間の終端でなければ（ステップ２０７のＮｏ）、ステップ２０１に戻って、入力音声１０１の次のフレームの入力を受け付ける。一方、入力音声１０１の全区間の終端であれば（ステップ２０７のＹｅｓ）、仮説処理部１０６では、仮説処理により得た単語列の中から最も確からしい単語列を音声認識結果１０９として出力して（ステップ２０８）、処理を終了する。 When the process up to the hypothesis processing for a predetermined section of the input speech 101 is completed, if the end of the section is not the end of all sections of the input speech 101 (No in Step 207), the process returns to Step 201, Accept next frame input. On the other hand, if it is the end of all sections of the input speech 101 (Yes in Step 207), the hypothesis processing unit 106 outputs the most probable word sequence as the speech recognition result 109 from the word sequences obtained by the hypothesis processing. (Step 208), the process ends.

上述のように本発明では、入力音声１０１の先頭の所定区間においては、複数の音響モデルを全て用いて音響尤度計算処理を音響尤度計算部１０４で行い、その先頭の所定区間の音響尤度計算処理が終了した時点で音響尤度の計算結果が良い音響モデルをモデル判別部１０８で選択する。先頭以降の所定区間においては、モデル判別部１０８で選択された音響モデルのみを用いて音響尤度計算処理を音響尤度計算部１０４で行う。 As described above, in the present invention, in the first predetermined section of the input speech 101, the acoustic likelihood calculation processing is performed by the acoustic likelihood calculation unit 104 using all the plurality of acoustic models, and the acoustic likelihood of the first predetermined section is calculated. When the degree calculation process is completed, the model determination unit 108 selects an acoustic model with a good acoustic likelihood calculation result. In a predetermined section after the head, the acoustic likelihood calculation process is performed by the acoustic likelihood calculation unit 104 using only the acoustic model selected by the model determination unit 108.

それにより、入力音声１０１の先頭区間でのみ、音響尤度計算処理が複数の音響モデルの数だけ複数回行われ、先頭以降の区間では、音響尤度計算処理が、既に選択されている音響モデルの数だけすなわち１回だけ行われるため、複数の音響モデルの中から入力音声１０１の音響モデルを選択して音声認識する場合の処理量を小さく抑えることができる。 Thereby, the acoustic likelihood calculation process is performed a plurality of times only for the number of the plurality of acoustic models only in the head section of the input speech 101, and in the section after the head, the acoustic likelihood calculation process is already selected. Therefore, it is possible to reduce the amount of processing when the acoustic model of the input speech 101 is selected from a plurality of acoustic models and speech recognition is performed.

次に、本発明の音声認識システムについて具体的な実施例を用いて説明する。 Next, the speech recognition system of the present invention will be described using specific examples.

（第１実施例）
図３は、本発明の第１実施例による音声認識システムの構成を示すブロック図である。この音声認識システムは、図１に示した入力音声１０１の具体例として不特定話者の音声を用いることにより、モデル判別の対象を話者とし、また、音響モデルの具体例として男性話者モデルと女性話者モデルとを用いることにより、話者判別の対象を性別とする。それに伴い、図１に示したモデル判別部１０８を、話者のモデルを判別する話者判別部３０８に置き換え、また、音響モデル格納部１０５を、男性話者モデルを格納する男性話者モデル格納部３０５Ａと女性話者モデルを格納する女性話者モデル格納部３０５Ｂとに置き換える。なお、図３において、マイクロフォン３０２、特徴抽出部３０３、音響尤度計算部３０４、仮説処理部３０６、言語モデル格納部３０７は、それぞれ、図１に示したマイクロフォン１０２、特徴抽出部１０３、音響尤度計算部１０４、仮説処理部１０６、言語モデル格納部１０７に相当する。 (First embodiment)
FIG. 3 is a block diagram showing the configuration of the voice recognition system according to the first embodiment of the present invention. This speech recognition system uses the speech of an unspecified speaker as a specific example of the input speech 101 shown in FIG. 1, thereby making the model discrimination target a speaker, and a male speaker model as a specific example of an acoustic model. And the female speaker model, the target of speaker discrimination is gender. Accordingly, the model discriminating unit 108 shown in FIG. 1 is replaced with a speaker discriminating unit 308 that discriminates a speaker model, and the acoustic model storage unit 105 is a male speaker model storage that stores a male speaker model. It replaces with the part 305A and the female speaker model storage part 305B which stores a female speaker model. In FIG. 3, the microphone 302, the feature extraction unit 303, the acoustic likelihood calculation unit 304, the hypothesis processing unit 306, and the language model storage unit 307 are respectively the microphone 102, the feature extraction unit 103, and the acoustic likelihood shown in FIG. This corresponds to the degree calculation unit 104, hypothesis processing unit 106, and language model storage unit 107.

次に、本実施例による音声認識システムの動作の一例について説明する。 Next, an example of the operation of the voice recognition system according to this embodiment will be described.

まず、入力音声３０１をマイクロフォン３０２から時間同期で、例えば１０ｍ秒ずつ入力する。この１０ｍ秒分の標本量を１フレームとする。 First, the input sound 301 is input from the microphone 302 in time synchronization, for example, every 10 milliseconds. The sample amount for 10 milliseconds is defined as one frame.

次に、特徴抽出部３０３において、入力音声３０１の１フレーム分の音声の特徴量を抽出する。 Next, the feature extraction unit 303 extracts the feature amount of one frame of the input sound 301.

次に、音響尤度計算部３０４において、特徴抽出部３０３により抽出された１フレーム分の特徴量に対し、男性話者モデル格納部３０５Ａに格納されている男性話者モデルとの音響尤度、女性話者モデル格納部３０５Ｂに格納されている女性話者モデルとの音響尤度をそれぞれ計算する。なお、ここで計算された音響尤度は、音響尤度計算部３０４にて蓄積される。 Next, in the acoustic likelihood calculation unit 304, the acoustic likelihood with the male speaker model stored in the male speaker model storage unit 305A with respect to the feature amount for one frame extracted by the feature extraction unit 303, The acoustic likelihood with the female speaker model stored in the female speaker model storage unit 305B is calculated. The acoustic likelihood calculated here is accumulated in the acoustic likelihood calculation unit 304.

ここで、入力音声３０１の先頭の所定区間分、例えば先頭の６０フレーム分の音響尤度が蓄積された場合、話者判別部３０８において、その区間分の男性話者モデル、女性話者モデルのそれぞれの音響尤度の平均値を比較し、例えば男性話者モデルの平均音響尤度の方が女性話者モデルの平均音響尤度よりもより確からしい場合には、男性話者モデルを選択する。以下では、男性話者モデルが選択されたものとして説明する。 Here, when the acoustic likelihood for the first predetermined section of the input speech 301, for example, the first 60 frames, is accumulated, the speaker discrimination unit 308 determines the male speaker model and female speaker model for the section. Compare the average values of each acoustic likelihood, for example, if the average acoustic likelihood of the male speaker model is more likely than the average acoustic likelihood of the female speaker model, select the male speaker model . In the following description, it is assumed that the male speaker model is selected.

次に、仮説処理部３０６において、話者判別部３０８により選択された男性話者モデルに対して音響尤度計算部３０４にて上記の所定区間内に計算された音響尤度と、言語モデル格納部３０７に格納されている言語モデルとを用いて１フレームずつ仮説処理を行い、確からしい順に複数の単語列仮説を得る。 Next, in the hypothesis processing unit 306, the acoustic likelihood calculated in the predetermined section by the acoustic likelihood calculation unit 304 for the male speaker model selected by the speaker determination unit 308, and the language model storage Hypothesis processing is performed frame by frame using the language model stored in the unit 307, and a plurality of word string hypotheses are obtained in a probable order.

入力音声３０１の先頭の所定区間分の仮説処理までの処理が終了すると、それ以降の区間については既に選択されている男性話者モデルのみを用いて音響尤度計算処理と仮説処理をそれぞれ音響尤度計算部３０４と仮説処理部３０６で行い、入力音声３０１の全区間の終端までの処理が終了したら、仮説処理部３０６から音声認識結果３０９として最も確からしい単語列を出力する。 When the processing up to the hypothesis processing for the first predetermined section of the input speech 301 is completed, the acoustic likelihood calculation process and the hypothesis process are respectively performed using only the already selected male speaker model for the subsequent sections. When the degree calculation unit 304 and the hypothesis processing unit 306 complete the processing up to the end of all sections of the input speech 301, the hypothesis processing unit 306 outputs the most likely word string as the speech recognition result 309.

上述のように本実施例では、話者の性別をあらかじめ指定することなしに男性あるいは女性に特化した話者モデルを使用することが可能となり、さらに性別をあらかじめ指定する場合と比較しても処理量の増加が少なくてすむ。すなわち、先頭の所定区間分でのみ、音響尤度計算処理が複数（本実施例では２回）行われることになる。 As described above, in this embodiment, it is possible to use a speaker model specialized for men or women without specifying the sex of the speaker in advance, and even if compared with the case where the gender is specified in advance. Less increase in throughput. That is, a plurality of acoustic likelihood calculation processes (in this embodiment, twice) are performed only in the first predetermined section.

なお、本実施例では、話者判別の対象を性別とし、話者を男性と女性との２つのクラスタに分けているが、話者を何らかの別の基準で複数のクラスタに分け、その中から話者のモデルを判別することも容易に実現可能である。例えば、話者判別の対象を年齢別とし、子供音声と成人音声と高齢者音声とを判別する場合や、話者判別の対象を方言別とし、アメリカ人の英語と日本人の英語とを判別する場合などが考えられる。 In this embodiment, the speaker discrimination target is gender, and the speaker is divided into two clusters of male and female, but the speaker is divided into a plurality of clusters according to some other criteria, It is also possible to easily determine the speaker model. For example, if the target of speaker discrimination is classified by age and child speech, adult speech and elderly speech are discriminated, or the speaker discrimination target is distinguished by dialect, distinguishing between American English and Japanese English If you want to.

また、本実施例では、モデル判別の対象を話者としているが、モデル判別の対象は話者に限定されない。例えば、モデル判別の対象を雑音環境とし、ＳＮ比の異なる複数の雑音環境に対応したモデルを用いて雑音環境を判別しながら音声認識を行う場合や、モデル判別の対象を伝送特性とし、携帯電話音声と固定電話音声など異なる伝送特性に対応したモデルを用いて伝送特性を判別しながら音声認識を行う場合などが考えられる。また、モデル判別の対象を言語とし、例えば、日本語、英語、中国語の３種類のモデルを用い、言語判別しながら音声認識を行う場合が考えられる。このとき、先頭の所定区間で言語をどれか１つに絞る方法を採用しても良いし、確からしいもの２つをまず選択しておいて、次の所定区間で最終的に１つに絞る、という方法を採用しても良い。また、モデル判別の対象を、上述の雑音環境、伝送特性、言語の組み合わせとしても良い。 In this embodiment, the model discrimination target is a speaker, but the model discrimination target is not limited to a speaker. For example, a model discrimination target is a noise environment, and speech recognition is performed while determining the noise environment using models corresponding to a plurality of noise environments having different S / N ratios. There may be a case where voice recognition is performed while discriminating transmission characteristics using models corresponding to different transmission characteristics such as voice and fixed telephone voice. In addition, it is conceivable that model recognition is performed on a language and, for example, three types of models of Japanese, English, and Chinese are used, and speech recognition is performed while performing language determination. At this time, a method of narrowing the language to one of the predetermined intervals at the beginning may be adopted, or two probable ones are first selected and finally limited to one in the next predetermined interval. You may adopt the method of. Further, a model discrimination target may be a combination of the above-described noise environment, transmission characteristics, and language.

さらに、本実施例では、話者判別のための音響尤度比較の方法として、その区間分のフレーム単位音響尤度の平均値を用いる方法を採用したが、この方法には限定されず、フレームごとに音響尤度の優劣を比較してその区間分で多数決をとる方法を採用しても良い。 Furthermore, in this embodiment, as a method of comparing the acoustic likelihood for speaker discrimination, a method using the average value of the frame unit acoustic likelihood for the section is adopted. However, the present invention is not limited to this method. A method may be adopted in which superiority or inferiority of acoustic likelihood is compared for each and a majority decision is made for that section.

（第２の実施例）
図４は、本発明の第２実施例による音声認識システムの構成を示すブロック図である。この音声認識システムは、第１の実施例と同様に、図１に示した入力音声１０１の具体例として不特定話者の音声を用いることにより、モデル判別の対象を話者とし、また、音響モデルの具体例として男性話者モデルと女性話者モデルとを用いることにより、話者判別の対象を性別とする。それに伴い、図１に示したモデル判別部１０８を、話者のモデルを判別する話者判別部４０８に置き換え、また、音響モデル格納部１０５を、男性話者モデルを格納する男性話者モデル格納部４０５Ａと女性話者モデルを格納する女性話者モデル格納部４０５Ｂとに置き換える。なお、図４において、マイクロフォン４０２、特徴抽出部４０３、音響尤度計算部４０４、仮説処理部４０６、言語モデル格納部４０７は、それぞれ、図１に示したマイクロフォン１０２、特徴抽出部１０３、音響尤度計算部１０４、仮説処理部１０６、言語モデル格納部１０７に相当する。さらに、この音声認識システムは、話者判別部４０８で選択しようとしているモデルの平均音響尤度が所定の閾値を超えているか判定する閾値判定部４１０も新たに備えられている。 (Second embodiment)
FIG. 4 is a block diagram showing the configuration of a speech recognition system according to the second embodiment of the present invention. As in the first embodiment, this speech recognition system uses the speech of an unspecified speaker as a specific example of the input speech 101 shown in FIG. By using a male speaker model and a female speaker model as specific examples of the model, the target of speaker discrimination is set to gender. Accordingly, the model discriminating unit 108 shown in FIG. 1 is replaced with a speaker discriminating unit 408 that discriminates a speaker model, and the acoustic model storage unit 105 is a male speaker model storage that stores a male speaker model. It replaces with the part 405A and the female speaker model storage part 405B which stores a female speaker model. In FIG. 4, the microphone 402, the feature extraction unit 403, the acoustic likelihood calculation unit 404, the hypothesis processing unit 406, and the language model storage unit 407 are the microphone 102, the feature extraction unit 103, and the acoustic likelihood shown in FIG. This corresponds to the degree calculation unit 104, hypothesis processing unit 106, and language model storage unit 107. Further, the speech recognition system is further provided with a threshold determination unit 410 that determines whether the average acoustic likelihood of the model to be selected by the speaker determination unit 408 exceeds a predetermined threshold.

まず、入力音声４０１をマイクロフォン４０２から時間同期で、例えば１０ｍ秒ずつ入力する。この１０ｍ秒分の標本量を１フレームとする。 First, the input sound 401 is input from the microphone 402 in time synchronization, for example, every 10 milliseconds. The sample amount for 10 milliseconds is defined as one frame.

次に、特徴抽出部４０３において、入力音声４０１の１フレーム分の音声の特徴量を抽出する。 Next, the feature extraction unit 403 extracts the feature amount of the sound for one frame of the input sound 401.

次に、音響尤度計算部４０４において、特徴抽出部４０３により抽出された１フレーム分の特徴量に対し、男性話者モデル格納部４０５Ａに格納されている男性話者モデルとの音響尤度、女性話者モデル格納部４０５Ｂに格納されている女性話者モデルとの音響尤度をそれぞれ計算する。なお、ここで計算された音響尤度は、音響尤度計算部４０４にて蓄積される。 Next, in the acoustic likelihood calculation unit 404, the acoustic likelihood with the male speaker model stored in the male speaker model storage unit 405A with respect to the feature amount of one frame extracted by the feature extraction unit 403, The acoustic likelihood with the female speaker model stored in the female speaker model storage unit 405B is calculated. The acoustic likelihood calculated here is accumulated in the acoustic likelihood calculation unit 404.

ここで、入力音声４０１の先頭の所定区間分、例えば先頭の３０フレーム分の音響尤度が蓄積された場合、話者判別部４０８において、その区間分の男性話者モデル、女性話者モデルのそれぞれの音響尤度の平均値を比較し、例えば男性話者モデルの平均音響尤度の方が女性話者モデルの平均音響尤度よりもより確からしい場合には、男性話者モデルの平均音響尤度を閾値判定部４１０へ渡す。以下では、男性話者モデルの平均音響尤度が閾値判定部４１０へ渡されたものとして説明する。 Here, when the acoustic likelihood for the first predetermined section of the input speech 401, for example, the first 30 frames, is accumulated, the speaker discrimination unit 408 stores the male speaker model and the female speaker model for the section. Compare the average values of the respective acoustic likelihoods.For example, if the average acoustic likelihood of the male speaker model is more likely than the average acoustic likelihood of the female speaker model, the average acoustic likelihood of the male speaker model The likelihood is passed to the threshold determination unit 410. In the following description, it is assumed that the average acoustic likelihood of the male speaker model is passed to the threshold determination unit 410.

次に、閾値判定部４１０において、話者判別部４０８から渡された男性話者モデルの平均音響尤度が所定の閾値を超えているかどうかを判定し、例えば閾値を超えている場合には受理という判定結果を話者判別部４０８へ返す。以下では、受理という判定結果が話者判別部４０８へ返されたものとして説明する。 Next, the threshold determination unit 410 determines whether the average acoustic likelihood of the male speaker model passed from the speaker determination unit 408 exceeds a predetermined threshold. Is returned to the speaker discrimination unit 408. In the following description, it is assumed that the determination result of acceptance is returned to the speaker determination unit 408.

次に、話者判別部４０８において、閾値判定部４１０から受理という判定結果が渡されたので、モデル判別結果として男性話者モデルを選択する。 Next, since the determination result of acceptance is given from the threshold value determination unit 410 in the speaker determination unit 408, a male speaker model is selected as the model determination result.

次に、仮説処理部４０６において、話者判別部４０８により選択された男性話者モデルに対して音響尤度計算部４０４にて上記の所定区間内に計算された音響尤度と、言語モデル格納部４０７に格納されている言語モデルとを用いて１フレームずつ仮説処理を行い、確からしい順に複数の単語列仮説を得る。 Next, in the hypothesis processing unit 406, the acoustic likelihood calculated in the predetermined section by the acoustic likelihood calculation unit 404 for the male speaker model selected by the speaker determination unit 408, and the language model storage Hypothesis processing is performed frame by frame using the language model stored in the unit 407, and a plurality of word string hypotheses are obtained in a probable order.

入力音声４０１の先頭の所定区間分の仮説処理までの処理が終了すると、それ以降の区間については既に選択されている男性話者モデルのみを用いて音響尤度計算処理と仮説処理をそれぞれ音響尤度計算部４０４と仮説処理部４０６で行い、入力音声４０１の全区間の終端までの処理が終了したら、仮説処理部４０６から音声認識結果４０９として最も確からしい単語列を出力する。 When the process up to the hypothesis process for the first predetermined section of the input speech 401 is completed, the acoustic likelihood calculation process and the hypothesis process are respectively performed using only the already selected male speaker model for the subsequent sections. When the degree calculation unit 404 and the hypothesis processing unit 406 complete the processing up to the end of the entire section of the input speech 401, the hypothesis processing unit 406 outputs the most likely word string as the speech recognition result 409.

上述のように本実施例では、閾値判定を行うことにより、入力音声４０１の区間の中で話者判別が行われる先頭区間を可変長とすることが可能となり、その結果としてより高精度な話者判別を行うことが可能となる。 As described above, in the present embodiment, by performing the threshold determination, it is possible to make the head section in which the speaker discrimination is performed in the section of the input speech 401 variable, and as a result, more accurate speech. Person identification can be performed.

なお、本発明においては、音声認識システム内の処理は、上述の専用のハードウェアにより実現されるもの以外に、その機能を実現するための音声認識プログラムにより実行するものであっても良い。この場合、音声認識システムを１台のコンピュータとして構成し、このコンピュータにて読取可能な記録媒体に音声認識プログラムを記録し、この記録媒体に記録された音声認識プログラムをコンピュータにて読み込み、実行する。コンピュータにて読取可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＤＶＤ、ＣＤなどの移設可能な記録媒体の他、コンピュータに内蔵されたＨＤＤなどを指す。 In the present invention, the processing in the voice recognition system may be executed by a voice recognition program for realizing the function in addition to the above-described dedicated hardware. In this case, the voice recognition system is configured as one computer, the voice recognition program is recorded on a recording medium readable by the computer, and the voice recognition program recorded on the recording medium is read and executed by the computer. . The computer-readable recording medium refers to a transfer medium such as a floppy disk, a magneto-optical disk, a DVD, and a CD, as well as an HDD built in the computer.

本発明は、計算機に対する音声入力インタフェースといった用途に適用できる。 The present invention is applicable to uses such as a voice input interface for a computer.

本発明の一実施形態による音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system by one Embodiment of this invention. 図１に示した音声認識システムの動作の一例を説明するフローチャートである。It is a flowchart explaining an example of operation | movement of the speech recognition system shown in FIG. 本発明の第１実施例による音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system by 1st Example of this invention. 本発明の第２実施例による音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system by 2nd Example of this invention. 従来の音声認識システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the conventional speech recognition system.

Explanation of symbols

１０１入力音声
１０２マイクロフォン
１０３特徴抽出部
１０４音響尤度計算部
１０５音響モデル格納部
１０６仮説処理部
１０７言語モデル格納部
１０８モデル判別部
１０９音声認識結果
２０１〜２０８ステップ
３０１入力音声
３０２マイクロフォン
３０３特徴抽出部
３０４音響尤度計算部
３０５Ａ男性話者モデル格納部
３０５Ｂ女性話者モデル格納部
３０６仮説処理部
３０７言語モデル格納部
３０８話者判別部
３０９音声認識結果
４０１入力音声
４０２マイクロフォン
４０３特徴抽出部
４０４音響尤度計算部
４０５Ａ男性話者モデル格納部
４０５Ｂ女性話者モデル格納部
４０６仮説処理部
４０７言語モデル格納部
４０８話者判別部
４０９音声認識結果
４１０閾値判定部 DESCRIPTION OF SYMBOLS 101 Input speech 102 Microphone 103 Feature extraction part 104 Acoustic likelihood calculation part 105 Acoustic model storage part 106 Hypothesis processing part 107 Language model storage part 108 Model discrimination | determination part 109 Speech recognition result 201-208 Step 301 Input voice 302 Microphone 303 Feature extraction part 304 Acoustic likelihood calculation unit 305A Male speaker model storage unit 305B Female speaker model storage unit 306 Hypothesis processing unit 307 Language model storage unit 308 Speaker discrimination unit 309 Speech recognition result 401 Input speech 402 Microphone 403 Feature extraction unit 404 Acoustic likelihood Degree calculation unit 405A Male speaker model storage unit 405B Female speaker model storage unit 406 Hypothesis processing unit 407 Language model storage unit 408 Speaker discrimination unit 409 Speech recognition result 410 Threshold determination unit

Claims

In a speech recognition system that recognizes the input speech while distinguishing and selecting the acoustic model of the input speech from a plurality of acoustic models,
Acoustic likelihood calculating means for calculating the acoustic likelihood with the plurality of acoustic models in time synchronization with respect to the feature amount of the input speech;
Based on the acoustic likelihood calculated by the acoustic likelihood calculation means within the predetermined interval at the time when the acoustic likelihood calculation for the predetermined interval at the beginning of the input speech is completed by the acoustic likelihood calculation means. Model discrimination means for discriminating and selecting an acoustic model of the input speech;
Wherein the acoustic likelihood calculating means with the input speech in a predetermined time section acoustic likelihood calculation is finished, the acoustic likelihood calculated for the acoustic model selected by the model discriminating unit to the predetermined interval within A hypothesis processing means for performing a hypothesis processing in time synchronization using the, and after completing the hypothesis processing of all sections of the input speech, and outputting a speech recognition result based on the hypothesis processing result,
The acoustic likelihood calculating means calculates the acoustic likelihood with all of the plurality of acoustic models for the feature amount of the input speech in a predetermined section at the beginning of the input speech in a time-synchronized manner. In a predetermined section after the head of the speech recognition system, an acoustic likelihood with only the acoustic model selected by the model discrimination means is calculated in time synchronization with respect to the feature quantity of the input speech.

The speech recognition system according to claim 1 , wherein a model discrimination target by the model discrimination means is a speaker.

The speech recognition system according to claim 2 , wherein a target of speaker discrimination by the model discrimination means is gender, age, or dialect.

The speech recognition system according to claim 1 , wherein the model discrimination target by the model discrimination unit is a noise environment, transmission characteristics, language, or a combination thereof.

In a speech recognition method by a speech recognition system that performs speech recognition of the input speech while discriminating and selecting an acoustic model of the input speech from a plurality of acoustic models,
A first step in which an acoustic likelihood calculating means calculates the acoustic likelihood with all of the plurality of acoustic models for the feature amount of the input speech in a predetermined interval at the beginning of the input speech in a time-synchronized manner; ,
When the model discrimination means finishes calculating the acoustic likelihood for the predetermined interval at the beginning of the input speech, the input speech is calculated based on the acoustic likelihood calculated in the first step within the predetermined interval. A second step of determining and selecting an acoustic model of
When the hypothesis processing means completes the acoustic likelihood calculation for the predetermined interval at the beginning of the input speech, the hypothesis processing means performs the first step on the acoustic model selected in the second step within the predetermined interval. A third step of performing hypothesis processing in time synchronization using the calculated acoustic likelihood;
The acoustic likelihood calculation means calculates the acoustic likelihood with only the acoustic model selected in the second step with respect to the feature quantity of the input voice in a predetermined section after the head of the input voice in time synchronization. A fourth step to:
When the hypothesis processing means finishes calculating the acoustic likelihood for a predetermined section after the head of the input speech, the hypothesis processing means performs the fourth step on the acoustic model selected in the second step within the predetermined section. A fifth step of performing a hypothesis process in time synchronization using the calculated acoustic likelihood and outputting a speech recognition result based on the hypothesis process result after the hypothesis process of all sections of the input speech is completed. A voice recognition method characterized by the above.

The speech recognition method according to claim 5 , wherein in the second step, a model discrimination target is a speaker.

The speech recognition method according to claim 6 , wherein in the second step, a speaker discrimination target is gender, age, or dialect.

The speech recognition method according to claim 5 , wherein in the second step, a model discrimination target is a noise environment, a transmission characteristic, a language, or a combination thereof.

In a speech recognition program to be executed by a computer that performs speech recognition of the input speech while determining and selecting an acoustic model of the input speech from a plurality of acoustic models,
A first step of calculating, with time synchronization, acoustic likelihoods with all of the plurality of acoustic models for a feature amount of the input speech in a predetermined section at the beginning of the input speech;
When the acoustic likelihood calculation for the first predetermined section of the input speech is completed, the acoustic model of the input speech is determined based on the acoustic likelihood calculated in the first step within the predetermined section. A second step of making a selection;
The acoustic likelihood calculated in the first step with respect to the acoustic model selected in the second step within the predetermined interval when the calculation of the acoustic likelihood for the predetermined interval at the beginning of the input speech is completed. A third step of performing hypothesis processing in time synchronization using degrees;
A fourth step of calculating, in time synchronization, an acoustic likelihood of only the acoustic model selected in the second step with respect to a feature quantity of the input speech in a predetermined section after the head of the input speech;
When the acoustic likelihood calculation for a predetermined section after the head of the input speech is completed, the acoustic likelihood calculated in the fourth step with respect to the acoustic model selected in the second step within the predetermined section. performs hypothesis treated with time synchronization with, after the hypothesis processing of all segments of the input speech is finished, characterized in that to execute a fifth step of outputting the speech recognition result based on the hypothesis processing result to the computer Voice recognition program.

The speech recognition program according to claim 9 , wherein in the second step, a model discrimination target is a speaker.

The speech recognition program according to claim 10 , wherein in the second step, a speaker discrimination target is gender, age, or dialect.

The speech recognition program according to claim 9 , wherein in the second step, a model discrimination target is a noise environment, a transmission characteristic, a language, or a combination thereof.