JP2005249874A

JP2005249874A - Device, method, and program for speech recognition

Info

Publication number: JP2005249874A
Application number: JP2004056528A
Authority: JP
Inventors: Nobuyuki Kunieda; 伸行國枝
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2004-03-01
Filing date: 2004-03-01
Publication date: 2005-09-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device capable of realizing speech recognition with higher precision than a conventional device. <P>SOLUTION: The speech recognition device 10 is equipped with a signal state classifying means 11 of analyzing a feature of an input signal and classifying the signal state of the input signal into one of a plurality of predetermined signal states, a state generation frequency calculating means 12 of calculating the generation frequency of a signal state, a sound model storage memory 13 which stores a plurality of sound models and data by the plurality of sound models used when the plurality of sound models are created, a sound model mixing rate calculating means 14 of calculating the mixing rate of data according to the generation frequency, a sound model composing means 15 of composing a sound model by mixing data stored in the sound model storage memory 13 at the mixing rate, and a speech recognition means 16 of performs speech recognition processing for the input signal by using the sound model composed by the sound model composing means 15. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声を認識する音声認識装置、音声認識方法及び音声認識プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program for recognizing voice.

従来、音声認識は、入力信号の特徴に近い音響モデルを使用したマッチング処理により実現されている。入力信号の特徴は、音声を認識処理するときの話者の環境や発声スタイル等によって変化するので、これらの変化に応じて音声を認識処理する必要がある。 Conventionally, speech recognition is realized by matching processing using an acoustic model close to the characteristics of an input signal. Since the characteristics of the input signal change depending on the speaker environment, utterance style, and the like when the speech is recognized, it is necessary to recognize the speech according to these changes.

例えば、話者の環境の騒音に注目した場合、騒音が存在する環境下においては、その騒音の種類やＳＮ比などに近い音声データを使用して生成された音響モデルを多数用意してマッチング処理を行うことで高性能な音声認識処理を実現できる。 For example, when attention is paid to the noise in the speaker's environment, in an environment where noise exists, a large number of acoustic models generated using voice data close to the type of noise and the SN ratio are prepared and matching processing is performed. Can perform high-performance speech recognition processing.

また、話者の発声スタイルは、話者が早口で話す場合とゆっくり話す場合とでは異なり、また、話者が新聞を読み上げる場合と普段の話し言葉で発声する場合とでは異なる。話者が同じ内容を話す場合でも、話者の性別、年齢、声の大きさなどによって発声スタイルには個人差が生じる。一方、同一の話者でも、電話で話す場合や風呂場のような残響がある部屋で話す場合などでは話者の発声スタイルは異なるものとなる。話者の発声スタイルが異なれば話者の音声の特徴は異なるものとなってしまうので、高い認識性能を実現するためには話者の発声スタイルに近い音声データを使用して生成された音響モデルを多数用意してマッチング処理を行う必要がある。 In addition, the speaker's utterance style is different between when the speaker speaks quickly and when speaking slowly, and when the speaker reads out the newspaper and when speaking in the usual spoken language. Even if the speaker speaks the same content, there are individual differences in the utterance style depending on the gender, age, and loudness of the speaker. On the other hand, even if the same speaker is speaking on the phone or speaking in a room with reverberation such as a bathroom, the speaking style of the speaker is different. Since the speaker's speech characteristics will be different if the speaker's utterance style is different, an acoustic model generated using speech data close to the speaker's utterance style to achieve high recognition performance It is necessary to prepare a large number of matching processes.

前述のように、音声認識の分野においては、音声を認識処理するときの話者の環境や発声スタイル等（以下「環境等」という。）の多様な変化に対応できる頑健な音声認識が要求されており、この要求に対して様々な提案がされている。 As described above, in the field of speech recognition, robust speech recognition that can cope with various changes in the speaker environment, utterance style, etc. (hereinafter referred to as “environment etc.”) during speech recognition processing is required. Various proposals have been made for this requirement.

例えば、非特許文献１に示された車載用の音声認識装置は、学習音声に重畳する車内雑音として走行雑音とアイドリング雑音とを用い、この２つの雑音を含む音声データで作成された音響モデルによって、搭乗者が発声した音声を認識することができるようになっている。
小窪浩明、天野明雄、畑岡信夫著「車載用音声認識における騒音対策とその評価」電子情報通信学会論文誌、D-II Vol.J83-D-II、No.11、pp.2190-2197、２０００年１１月 For example, the on-vehicle speech recognition device disclosed in Non-Patent Document 1 uses running noise and idling noise as in-vehicle noise superimposed on learning speech, and uses an acoustic model created with speech data including these two noises. The voice uttered by the passenger can be recognized.
Hiroaki Ogikubo, Akio Amano, Nobuo Hataoka “Noise Countermeasures and Evaluation in Vehicle Voice Recognition” IEICE Transactions, D-II Vol.J83-D-II, No.11, pp.2190-2197, 2000 November

しかしながら、このような従来の音声認識装置では、音声を認識処理する環境等が多様に変化する実環境下において音声を高精度で認識処理しようとすると、想定される個々の環境等の音声データで作成した膨大な音響モデルを用意する必要がある。 However, in such a conventional speech recognition apparatus, if speech recognition processing is to be performed with high accuracy in an actual environment where the speech recognition environment changes in various ways, the speech data of each assumed environment or the like is used. It is necessary to prepare an enormous acoustic model that has been created.

したがって、従来の音声認識装置では、音声認識装置に搭載できるメモリの容量や音声認識処理に要する時間等の制約により、多様な環境等の変化に追随できる音響モデルを用意することは困難であるので、環境等の変化により音声認識の精度が低下してしまうという問題があった。 Therefore, in the conventional speech recognition device, it is difficult to prepare an acoustic model that can follow changes in various environments due to restrictions on the capacity of the memory that can be installed in the speech recognition device and the time required for speech recognition processing. There has been a problem that the accuracy of voice recognition is lowered due to changes in the environment and the like.

本発明は、このような問題を解決するためになされたものであり、音声を認識処理する環境等が変化しても従来の装置よりも高い精度で音声認識を実現することができる音声認識装置、音声認識方法及び音声認識プログラムを提供するものである。 The present invention has been made in order to solve such a problem, and a speech recognition apparatus capable of realizing speech recognition with higher accuracy than conventional apparatuses even when the environment for speech recognition processing changes. A speech recognition method and a speech recognition program are provided.

本発明の音声認識装置は、入力信号の特徴を分析し、前記入力信号の信号状態を予め定められた複数の信号状態のうちのいずれかに分類する信号状態分類手段と、前記信号状態の発生頻度を計算する状態発生頻度計算手段と、予め定められた複数の音響モデルが作成される際に前記音響モデル毎に使用されたデータを混合する混合比率を前記発生頻度に基づいて計算する音響モデル混合比率計算手段と、前記データを前記混合比率で混合して新たに音響モデルを合成する音響モデル合成手段と、前記合成された音響モデルを使用して前記入力信号に対する音声認識処理を行う音声認識手段とを備えたことを特徴とする構成を有している。 The speech recognition apparatus according to the present invention is characterized by analyzing characteristics of an input signal and classifying the signal state of the input signal into one of a plurality of predetermined signal states, and generation of the signal state A state occurrence frequency calculating means for calculating a frequency, and an acoustic model for calculating a mixing ratio based on the occurrence frequency for mixing data used for each of the acoustic models when a plurality of predetermined acoustic models are created Mixing ratio calculating means, acoustic model synthesizing means for newly synthesizing an acoustic model by mixing the data at the mixing ratio, and speech recognition for performing speech recognition processing on the input signal using the synthesized acoustic model And means.

この構成により、本発明の音声認識装置は、音響モデル混合比率計算手段が、音響モデル毎に使用されたデータを混合する混合比率を信号状態の発生頻度に基づいて計算し、音声認識手段が、その時の環境等に合った混合比率で混合されたデータで学習した音響モデルによって入力信号に対する音声認識処理を行うので、音声を認識処理する環境等が変化しても従来の装置よりも高い精度で音声認識を実現することができる。 With this configuration, in the speech recognition apparatus of the present invention, the acoustic model mixing ratio calculation unit calculates the mixing ratio for mixing the data used for each acoustic model based on the occurrence frequency of the signal state, and the speech recognition unit Since the speech recognition process is performed on the input signal using the acoustic model learned from the data mixed at the mixing ratio suitable for the environment at that time, even if the environment for speech recognition processing changes, the accuracy is higher than that of conventional devices. Voice recognition can be realized.

また、本発明の音声認識装置は、前記音響モデル混合比率計算手段は、前記信号状態分類手段によって分類された前記複数の信号状態に対応する前記音響モデル毎に使用された前記データの前記混合比率を前記発生頻度が高くなるに従って大きくすることを特徴とする構成を有している。 Further, in the speech recognition apparatus of the present invention, the acoustic model mixture ratio calculation unit is configured to use the mixing ratio of the data used for each of the acoustic models corresponding to the plurality of signal states classified by the signal state classification unit. Is increased as the frequency of occurrence increases.

この構成により、本発明の音声認識装置は、音響モデル混合比率計算手段が、音響モデル毎に使用されたデータの混合比率を発生頻度が高くなるに従って大きくするので、入力信号の信号状態に適応した音声認識処理を実現することができる。 With this configuration, the speech recognition apparatus of the present invention is adapted to the signal state of the input signal because the acoustic model mixture ratio calculation means increases the mixture ratio of the data used for each acoustic model as the frequency of occurrence increases. Voice recognition processing can be realized.

さらに、本発明の音声認識装置は、前記音響モデル混合比率計算手段は、前記音声認識手段の音声認識結果に基づいて前記混合比率を計算することを特徴とする構成を有している。 Furthermore, the speech recognition apparatus of the present invention has a configuration characterized in that the acoustic model mixture ratio calculation means calculates the mixture ratio based on a speech recognition result of the speech recognition means.

この構成により、本発明の音声認識装置は、音響モデル混合比率計算手段が、音声認識手段の音声認識結果に基づいてデータの混合比率を計算するので、音声認識結果に応じてデータの混合比率を再設定して音声認識の精度を高めることができる。 With this configuration, in the speech recognition apparatus according to the present invention, the acoustic model mixture ratio calculation unit calculates the data mixture ratio based on the speech recognition result of the speech recognition unit. It can be reset to increase the accuracy of voice recognition.

さらに、本発明の音声認識装置は、前記音響モデル混合比率計算手段が、前記音声認識結果の確からしさを数値で表したスコアが所定値以下のとき、前記発生頻度の変化傾向に基づいて前記混合比率を算出することを特徴とする構成を有している。 Furthermore, in the speech recognition apparatus of the present invention, when the acoustic model mixture ratio calculation means has a score representing the probability of the speech recognition result as a numerical value is equal to or less than a predetermined value, the mixing is performed based on the change tendency of the occurrence frequency. It has the structure characterized by calculating a ratio.

この構成により、本発明の音声認識装置は、音響モデル混合比率計算手段が、音声認識結果のスコアが所定値以下のとき、発生頻度の変化傾向に基づいて混合比率を算出するので、入力信号の信号状態及び音声認識結果のスコアに適応する混合比率を算出することができ、従来の装置よりも高い精度で音声認識を実現することができる。 With this configuration, in the speech recognition apparatus of the present invention, the acoustic model mixture ratio calculation means calculates the mixture ratio based on the change tendency of the occurrence frequency when the score of the speech recognition result is a predetermined value or less. The mixing ratio adapted to the signal state and the score of the speech recognition result can be calculated, and speech recognition can be realized with higher accuracy than conventional devices.

さらに、本発明の音声認識装置は、前記混合比率を入力する音響モデル混合比率入力手段を備えたことを特徴とする構成を有している。 Furthermore, the speech recognition apparatus of the present invention has a configuration characterized by comprising acoustic model mixture ratio input means for inputting the mixture ratio.

この構成により、本発明の音声認識装置は、音響モデル混合比率入力手段が、混合比率を入力するので、初期状態等における音響モデルの混合比率を容易に設定することができる。 With this configuration, in the speech recognition apparatus of the present invention, since the acoustic model mixture ratio input means inputs the mixture ratio, the acoustic model mixture ratio in the initial state or the like can be easily set.

さらに、本発明の音声認識装置は、前記混合比率を設定するテストデータを発生するテストデータ発生手段を備えたことを特徴とする構成を有している。 Furthermore, the speech recognition apparatus of the present invention has a configuration characterized by comprising test data generating means for generating test data for setting the mixing ratio.

この構成により、本発明の音声認識装置は、テストデータ発生手段が、混合比率を設定するテストデータを発生するので、信号状態と音声認識結果のスコアとの関係を把握することができ、音声認識の性能の向上を図ることができる。 With this configuration, since the test data generating means generates test data for setting the mixing ratio, the voice recognition device of the present invention can grasp the relationship between the signal state and the score of the voice recognition result, and the voice recognition It is possible to improve the performance.

さらに、本発明の音声認識装置は、前記データが、前記音響モデルを作成する際の音声データの時間波形データ、メルケプストラム係数、デルタメルケプストラム係数及びデルタ対数パワー係数のうちの少なくとも一つを含むことを特徴とする構成を有している。 Furthermore, in the speech recognition apparatus of the present invention, the data includes at least one of time waveform data of voice data, a mel cepstrum coefficient, a delta mel cepstrum coefficient, and a delta logarithmic power coefficient when the acoustic model is created. It has the structure characterized by this.

この構成により、本発明の音声認識装置は、音響モデル格納メモリが、音響モデルを作成する際の音声データの時間波形データやメルケプストラム係数等を格納するので、これらのデータを混合比率で混合して音響モデルを合成することにより、音声を認識処理する環境等が変化しても従来の装置よりも高い精度で音声認識を実現することができる。 With this configuration, in the speech recognition apparatus of the present invention, the acoustic model storage memory stores the time waveform data, mel cepstrum coefficient, etc. of the speech data when creating the acoustic model, so these data are mixed at a mixing ratio. By synthesizing the acoustic model, speech recognition can be realized with higher accuracy than conventional devices even if the environment for speech recognition processing changes.

さらに、本発明の音声認識装置は、前記音響モデルが、隠れマルコフモデルの構造を有し、前記音響モデル格納手段は、前記音響モデルを前記隠れマルコフモデルで定義される遷移確率及び出力確率のデータとして格納することを特徴とする構成を有している。 Furthermore, in the speech recognition apparatus of the present invention, the acoustic model has a hidden Markov model structure, and the acoustic model storage means includes data of transition probability and output probability defined by the hidden Markov model. As a storage feature.

この構成により、本発明の音声認識装置は、音響モデルが、隠れマルコフモデルの構造を有するので、遷移確率又は出力確率のデータの混合比率を計算することにより、音声を認識処理する環境等が変化しても従来の装置よりも高い精度で音声認識を実現することができる。 With this configuration, since the acoustic model has a hidden Markov model structure, the speech recognition environment of the present invention changes the environment for speech recognition processing, etc. by calculating the mixing ratio of transition probability or output probability data. Even so, speech recognition can be realized with higher accuracy than conventional devices.

本発明の音声認識方法は、入力信号の特徴を分析し、前記入力信号の信号状態を予め定められた複数の信号状態のうちのいずれかに分類した後、所定の音響モデルが作成される際に使用されたデータを混合する混合比率を前記信号状態の発生頻度に基づいて計算し、前記データを前記混合比率で混合して前記音響モデルを合成し、合成された前記音響モデルを使用して前記入力信号に対する音声認識処理を行うことを特徴とする方法である。 The speech recognition method of the present invention analyzes the characteristics of an input signal, classifies the signal state of the input signal into one of a plurality of predetermined signal states, and then creates a predetermined acoustic model. The mixing ratio for mixing the data used in the calculation is calculated based on the occurrence frequency of the signal state, the data is mixed at the mixing ratio to synthesize the acoustic model, and the synthesized acoustic model is used. A voice recognition process is performed on the input signal.

この方法により、音響モデル毎に使用されたデータを混合する混合比率を信号状態の発生頻度に基づいて計算し、混合比率で混合されたデータで学習した音響モデルによって入力信号に対する音声認識処理を行うので、音声を認識処理する環境等が変化しても従来の装置よりも高い精度で音声認識を実現することができる。 By this method, the mixing ratio for mixing the data used for each acoustic model is calculated based on the occurrence frequency of the signal state, and the speech recognition process is performed on the input signal by the acoustic model learned from the data mixed at the mixing ratio. Therefore, voice recognition can be realized with higher accuracy than conventional devices even if the environment or the like for voice recognition processing changes.

本発明の音声認識プログラムは、入力信号の特徴を分析するステップと、前記入力信号の信号状態を予め定められた複数の信号状態のうちのいずれかに分類するステップと、所定の音響モデルが作成される際に使用されたデータを混合する混合比率を前記信号状態の発生頻度に基づいて計算するステップと、前記データを前記混合比率で混合して前記音響モデルを合成するステップと、合成された前記音響モデルを使用して前記入力信号に対する音声認識処理を行うステップとを含むことを特徴とするプログラムである。 The speech recognition program according to the present invention includes a step of analyzing a feature of an input signal, a step of classifying the signal state of the input signal into one of a plurality of predetermined signal states, and a predetermined acoustic model Calculating a mixing ratio based on the frequency of occurrence of the signal state, mixing the data at the mixing ratio, and synthesizing the acoustic model. Performing a voice recognition process on the input signal using the acoustic model.

このプログラムにより、従来のプログラムよりも高精度の音声認識処理をコンピュータに実行させることができる。 With this program, it is possible to cause a computer to execute voice recognition processing with higher accuracy than a conventional program.

本発明は、従来の装置よりも高い精度で音声認識を実現することができるという効果を有する音声認識装置を提供することができるものである。 The present invention can provide a speech recognition device having an effect that speech recognition can be realized with higher accuracy than conventional devices.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施の形態）
まず、本発明の第１の実施の形態の音声認識装置の構成について説明する。 (First embodiment)
First, the configuration of the speech recognition apparatus according to the first embodiment of the present invention will be described.

図１に示すように、本実施の形態の音声認識装置１０は、入力信号の特徴を分析し、入力信号の信号状態を予め定められた複数の信号状態のうちのいずれかに分類する信号状態分類手段１１と、信号状態分類手段１１によって分類された信号状態の発生頻度を計算する状態発生頻度計算手段１２と、予め定められた複数の音響モデル及び複数の音響モデルが作成される際に使用された複数の音響モデル毎のデータを格納する音響モデル格納メモリ１３と、状態発生頻度計算手段１２によって計算された発生頻度に基づいてデータの混合比率を計算する音響モデル混合比率計算手段１４と、音響モデル格納メモリ１３に記憶されたデータを混合比率で混合して音響モデルを合成する音響モデル合成手段１５と、音響モデル合成手段１５によって合成された音響モデルを使用して入力信号に対する音声認識処理を行う音声認識手段１６とを備えている。 As shown in FIG. 1, the speech recognition apparatus 10 according to the present embodiment analyzes a feature of an input signal and classifies the signal state of the input signal into one of a plurality of predetermined signal states. Classifying means 11, state occurrence frequency calculating means 12 for calculating the occurrence frequency of signal states classified by signal state classifying means 11, and a plurality of predetermined acoustic models and a plurality of acoustic models are used. An acoustic model storage memory 13 for storing data for each of the plurality of acoustic models, an acoustic model mixture ratio calculation means 14 for calculating a data mixture ratio based on the occurrence frequency calculated by the state occurrence frequency calculation means 12, The acoustic model synthesis means 15 for synthesizing the acoustic model by mixing the data stored in the acoustic model storage memory 13 at the mixing ratio, and the acoustic model synthesis means 15 Use made acoustic model and a speech recognition unit 16 for performing speech recognition processing on the input signal.

ここで、信号状態分類手段１１によって分類される信号状態とは、例えば、騒音の種類やＳＮ比、話者の性別、年齢、話すスピードなどの特徴で分類された状態をいう。 Here, the signal state classified by the signal state classification means 11 refers to a state classified by characteristics such as noise type, SN ratio, speaker gender, age, and speaking speed.

信号状態分類手段１１は、入力信号の特徴を分析し、例えば、予め設定された入力信号のＳＮ比による信号状態のうちのいずれかに分類するようになっている。 The signal state classification means 11 analyzes the characteristics of the input signal and classifies it into one of signal states based on, for example, a preset S / N ratio of the input signal.

さらに具体的には、信号状態分類手段１１が分類すべき信号状態として、例えば、「ＳＮ比が２５ｄＢ以上の信号状態Ａ」、「ＳＮ比が１５以上２５ｄＢ未満の信号状態Ｂ」及び「ＳＮ比が１５ｄＢ未満の信号状態Ｃ」の３つが予め設定されている場合、信号状態分類手段１１は、入力信号のＳＮ比を所定の時間フレーム毎に算出することによって、入力信号の信号状態を時間フレーム毎に信号状態Ａ、Ｂ及びＣのいずれかに分類するようになっている。 More specifically, the signal states to be classified by the signal state classification unit 11 include, for example, “signal state A whose SN ratio is 25 dB or more”, “signal state B whose SN ratio is 15 or more and less than 25 dB” and “SN ratio”. When the three signal states C of less than 15 dB are preset, the signal state classification unit 11 calculates the signal-to-noise ratio of the input signal for each predetermined time frame, thereby changing the signal state of the input signal to the time frame. Each is classified into one of signal states A, B and C.

なお、以下の説明において、本実施の形態の音声認識装置１０は、前述の信号状態Ａ、Ｂ及びＣの状態にある入力信号に対して音声認識するものとして説明する。 In the following description, the speech recognition apparatus 10 of the present embodiment will be described as performing speech recognition on the input signals in the signal states A, B, and C described above.

状態発生頻度計算手段１２は、所定の時間（以下「発生頻度計算時間」という。）内において、入力信号が信号状態Ａ、Ｂ及びＣの状態にある時間比率を計算することによって、信号状態の発生頻度を計算するようになっている。 The state occurrence frequency calculation means 12 calculates a time ratio in which the input signal is in the signal states A, B, and C within a predetermined time (hereinafter referred to as “occurrence frequency calculation time”). The frequency of occurrence is calculated.

具体的には、例えば発生頻度計算時間を１００秒間とした場合、ＳＮ比が２５ｄＢ以上の信号状態Ａ、ＳＮ比が１５以上２５ｄＢ未満の信号状態Ｂ及びＳＮ比が１５ｄＢ未満の信号状態Ｃの状態の時間が、それぞれ、３０秒、６０秒及び１０秒のとき、状態発生頻度計算手段１２は、信号状態Ａ、信号状態Ｂ及び信号状態Ｃの発生頻度を、それぞれ、３０％、６０％及び１０％と計算するようになっている。 Specifically, for example, when the occurrence frequency calculation time is 100 seconds, the signal state A with an SN ratio of 25 dB or more, the signal state B with an SN ratio of 15 or more and less than 25 dB, and the signal state C with an SN ratio of less than 15 dB The state occurrence frequency calculation means 12 calculates the occurrence frequencies of the signal state A, the signal state B, and the signal state C, respectively, at 30%, 60%, and 10 seconds, respectively. % Is calculated.

なお、発生頻度計算時間は、前述の１００秒間に限定されるものではなく、２０秒や１０分等でもよい。また、状態発生頻度計算手段１２が、発生頻度計算時間を例えばＳＮ比の時間的な変化に応じて設定し、発生頻度を計算するように構成してもよい。この構成により、状態発生頻度計算手段１２は、ＳＮ比が短時間で変化するときは例えば５秒毎に発生頻度を計算し、またＳＮ比が時間的にほとんど変化しないときは例えば２０分毎に発生頻度を計算するので、ＣＰＵやメモリ等のリソースを有効に活用することができる。 The occurrence frequency calculation time is not limited to the above-mentioned 100 seconds, and may be 20 seconds or 10 minutes. Further, the state occurrence frequency calculation unit 12 may be configured to calculate the occurrence frequency by setting the occurrence frequency calculation time according to, for example, a temporal change in the SN ratio. With this configuration, the state occurrence frequency calculating means 12 calculates the occurrence frequency every 5 seconds when the SN ratio changes in a short time, and every 20 minutes when the SN ratio hardly changes in time, for example. Since the occurrence frequency is calculated, resources such as a CPU and a memory can be used effectively.

音響モデル格納メモリ１３は、図２に示すように、例えばＳＮ比３０ｄＢで学習した音響モデルＡ、ＳＮ比２０ｄＢで学習した音響モデルＢ及びＳＮ比１０ｄＢで学習した音響モデルＣと、音響モデルＡ、Ｂ及びＣを作成する際に学習で使用したそれぞれの音声データＡ、Ｂ及びＣとを格納している。 As shown in FIG. 2, the acoustic model storage memory 13 includes, for example, an acoustic model A learned with an SN ratio of 30 dB, an acoustic model B learned with an SN ratio of 20 dB, an acoustic model C learned with an SN ratio of 10 dB, an acoustic model A, The audio data A, B, and C used for learning when creating B and C are stored.

具体的には、音響モデル格納メモリ１３は、ＳＮ比３０ｄＢの環境で収録された複数の音声データＡ及び音声データＡで学習して作成された音響モデルＡと、ＳＮ比２０ｄＢの環境で収録された複数の音声データＢ及び音声データＢで学習して作成された音響モデルＢと、ＳＮ比１０ｄＢの環境で収録された複数の音声データＣ及び音声データＣで学習して作成された音響モデルＣとを備えている。なお、図２に示された音響モデルＡ、Ｂ及びＣには、それぞれ、音声データＡ、Ｂ及びＣが含まれているものとする。 Specifically, the acoustic model storage memory 13 is recorded in a sound model A created by learning a plurality of sound data A and sound data A recorded in an environment with an SN ratio of 30 dB, and an environment with an SN ratio of 20 dB. A plurality of voice data B and an acoustic model B created by learning with the voice data B, and an acoustic model C created by learning with a plurality of voice data C and voice data C recorded in an environment with an SN ratio of 10 dB And. Note that the acoustic models A, B, and C shown in FIG. 2 include audio data A, B, and C, respectively.

音響モデル混合比率計算手段１４は、状態発生頻度計算手段１２によって計算された発生頻度が高くなるに従って、音響モデルＡ、Ｂ及びＣを合成する際の各音響モデルに対応する音声データの混合比率が大きくなるよう計算するようになっている。 The acoustic model mixture ratio calculation means 14 has a sound data mixture ratio corresponding to each acoustic model when the acoustic models A, B, and C are synthesized as the occurrence frequency calculated by the state occurrence frequency calculation means 12 increases. Calculations are made to increase.

例えば図２に示すように、状態発生頻度計算手段１２によって計算された信号状態Ａ、Ｂ及びＣの発生頻度が、それぞれ、３０％、６０％及び１０％の場合、音響モデル混合比率計算手段１４は、それぞれの混合比率を０．３、０．６及び０．１と計算するようになっている。 For example, as shown in FIG. 2, when the occurrence frequencies of the signal states A, B, and C calculated by the state occurrence frequency calculation unit 12 are 30%, 60%, and 10%, respectively, the acoustic model mixture ratio calculation unit 14 Is designed to calculate the respective mixing ratios as 0.3, 0.6 and 0.1.

音響モデル合成手段１５は、音響モデル混合比率計算手段１４によって計算された混合比率を重み係数とし、音響モデル格納メモリ１３に格納された音声データＡ、Ｂ及びＣを混合し、新たな音響モデルを合成するようになっている。なお、音響モデル合成手段１５によって合成された音響モデルを以下「合成音響モデル」という。 The acoustic model synthesizing unit 15 uses the mixing ratio calculated by the acoustic model mixture ratio calculating unit 14 as a weighting factor, mixes the audio data A, B, and C stored in the acoustic model storage memory 13 and creates a new acoustic model. It comes to synthesize. The acoustic model synthesized by the acoustic model synthesis unit 15 is hereinafter referred to as “synthetic acoustic model”.

具体的には、音響モデル合成手段１５は、図２に示すように、それぞれ複数のデータで構成されたデータＡ、Ｂ及びＣから、それぞれの重み係数０．３、０．６及び０．１の割合で各データを使用して学習し、合成音響モデルを生成する。 Specifically, as shown in FIG. 2, the acoustic model synthesizing unit 15 obtains weight coefficients 0.3, 0.6, and 0.1 from data A, B, and C each composed of a plurality of data. A synthetic acoustic model is generated by learning using each data at a rate of.

音声認識手段１６は、合成音響モデルを使用して入力信号に対する音声認識処理を行い、音声認識処理の結果を出力するようになっている。 The voice recognition means 16 performs voice recognition processing on the input signal using the synthetic acoustic model, and outputs the result of the voice recognition processing.

次に、本実施の形態の音声認識装置１０の動作について説明する。なお、前述の信号状態Ａ、Ｂ及びＣの状態における入力信号に対して音声認識を行う場合を例に挙げて説明する。 Next, the operation of the speech recognition apparatus 10 according to the present embodiment will be described. A case where speech recognition is performed on the input signals in the signal states A, B, and C will be described as an example.

まず、信号状態分類手段１１によって、入力信号の特徴が分析され、入力信号の信号状態が、信号状態Ａ、Ｂ及びＣのいずれかに分類される。 First, the characteristics of the input signal are analyzed by the signal state classification unit 11 and the signal state of the input signal is classified into one of the signal states A, B, and C.

次いで、状態発生頻度計算手段１２によって、発生頻度計算時間内における信号状態Ａ、Ｂ及びＣの状態の発生頻度が計算される。 Next, the occurrence frequency of the signal states A, B, and C within the occurrence frequency calculation time is calculated by the state occurrence frequency calculation means 12.

さらに、音響モデル混合比率計算手段１４によって、音響モデル格納メモリ１３に格納された音声データＡ、Ｂ及びＣを混合する混合比率が発生頻度に基づいて計算される。 Further, the acoustic model mixture ratio calculation means 14 calculates a mixture ratio for mixing the audio data A, B, and C stored in the acoustic model storage memory 13 based on the occurrence frequency.

続いて、音響モデル合成手段１５によって、混合比率を重み係数として音響モデル格納メモリ１３に格納された音声データＡ、Ｂ及びＣが混合され、合成音響モデルが生成される。 Subsequently, the acoustic model synthesizing unit 15 mixes the audio data A, B, and C stored in the acoustic model storage memory 13 using the mixing ratio as a weighting factor, thereby generating a synthetic acoustic model.

そして、音声認識手段１６によって、入力信号に対する音声認識が合成音響モデルを使用して実行され、音声認識結果が出力される。 Then, the speech recognition unit 16 performs speech recognition on the input signal using the synthetic acoustic model, and outputs a speech recognition result.

なお、本実施の形態において、信号状態をＳＮ比に基づいて分類したが、本発明はこれに限定されるものではなく、騒音の種類や話者の性別、年齢、話す速度等の特徴で分類してもよい。 In this embodiment, the signal states are classified based on the SN ratio. However, the present invention is not limited to this, and is classified according to characteristics such as noise type, speaker gender, age, and speaking speed. May be.

例えば、本実施の形態の音声認識装置１０を携帯可能な音声入力型のカーナビゲーション装置に適用する場合、騒音の種類で入力信号の信号状態を分類するよう信号状態分類手段１１を設定し、家庭内及び車両内の騒音で学習した音響モデル及びそのデータを用意すれば、家庭内において音声でルート設定した後、カーナビゲーション装置を家庭内から車内に移動し継続して使用するときでも、使用する環境等の変化に依存せず、騒音の種類に対応した精度のよい音声認識を行うことができる。 For example, when the speech recognition device 10 of the present embodiment is applied to a portable voice input type car navigation device, the signal state classification means 11 is set so as to classify the signal state of the input signal according to the type of noise, and the home If you prepare an acoustic model and its data learned from the noise inside the vehicle and the vehicle, you can use it even when you move the car navigation device from the home to the car and continue to use it after setting the route with voice in the home Accurate speech recognition corresponding to the type of noise can be performed without depending on changes in the environment or the like.

また、本実施の形態の音声認識装置１０は、例えばコンピュータで構成される。この構成の場合、コンピュータは、音声認識装置１０として動作するよう作成されたプログラムを記憶媒体から読み出したり、ネットワークを介して受信したりして実行することにより、従来の装置よりも高い精度で音声認識を実現することができる。 Moreover, the speech recognition apparatus 10 of this Embodiment is comprised, for example with a computer. In the case of this configuration, the computer reads out a program created to operate as the voice recognition device 10 from a storage medium or receives it via a network and executes it, so that the voice can be obtained with higher accuracy than a conventional device. Recognition can be realized.

以上のように、本実施の形態の音声認識装置１０によれば、音響モデル合成手段１５は、音響モデル格納メモリ１３に格納された音声データＡ、Ｂ及びＣを信号状態に応じた混合比率で混合して合成音響モデルを生成し、音声認識手段１６は、合成音響モデルを使用して入力信号に対する音声認識処理を行う構成としたので、音声を認識処理する環境等が変化しても従来の装置よりも高い精度で音声認識を実現することができる。 As described above, according to the speech recognition apparatus 10 of the present embodiment, the acoustic model synthesis unit 15 mixes the speech data A, B, and C stored in the acoustic model storage memory 13 with a mixing ratio according to the signal state. Since the synthesized acoustic model is generated by mixing, the speech recognition means 16 is configured to perform speech recognition processing on the input signal using the synthesized acoustic model. Speech recognition can be realized with higher accuracy than the apparatus.

なお、本実施の形態において、音響モデルを作成するデータを音声データとして説明したが、本発明はこれに限定されるものではなく、音響モデルを作成する際の音声データの時間波形データ、メルケプストラム係数、デルタメルケプストラム係数、デルタ対数パワー係数等のデータであっても同様の効果が得られる。 In the present embodiment, data for creating an acoustic model has been described as speech data. However, the present invention is not limited to this, and time waveform data, mel cepstrum of speech data for creating an acoustic model is not limited to this. Similar effects can be obtained with data such as a coefficient, a delta-mel cepstrum coefficient, and a delta logarithmic power coefficient.

また、本実施の形態における音響モデルが隠れマルコフモデルの構造を有す場合、前述の音響モデルＡ、Ｂ及びＣの遷移確率又は出力確率のデータに重み係数を使用して合成音響モデルを生成するよう構成しても同様の効果が得られる。 Further, when the acoustic model in the present embodiment has a hidden Markov model structure, a synthetic acoustic model is generated by using a weighting factor for the transition probability or output probability data of the acoustic models A, B, and C described above. Even if it comprises, the same effect is acquired.

（第２の実施の形態）
まず、本発明の第２の実施の形態の音声認識装置の構成について説明する。ただし、本発明の第１の実施の形態の音声認識装置１０と同様な構成については同一の符号を付して説明を省略する。 (Second Embodiment)
First, the configuration of the speech recognition apparatus according to the second embodiment of the present invention will be described. However, the same components as those of the speech recognition apparatus 10 according to the first embodiment of the present invention are denoted by the same reference numerals and description thereof is omitted.

図３に示すように、本実施の形態の音声認識装置２０は、状態発生頻度計算手段１２の計算結果及び音声認識手段１６の音声認識結果に基づいて混合比率を計算する音響モデル混合比率計算手段１４を備えている。 As shown in FIG. 3, the speech recognition apparatus 20 according to the present embodiment includes an acoustic model mixture ratio calculation unit that calculates a mixture ratio based on the calculation result of the state occurrence frequency calculation unit 12 and the speech recognition result of the speech recognition unit 16. 14 is provided.

次に、本実施の形態の音声認識装置２０の動作について説明する。ただし、本発明の第１の実施の形態の音声認識装置１０と同様な動作については説明を省略する。 Next, the operation of the speech recognition apparatus 20 according to the present embodiment will be described. However, description of operations similar to those of the speech recognition apparatus 10 according to the first exemplary embodiment of the present invention is omitted.

音響モデル混合比率計算手段１４によって、状態発生頻度計算手段１２が計算した発生頻度と、音声認識手段１６が出力した音声認識結果とが入力される。 The acoustic model mixture ratio calculation means 14 inputs the occurrence frequency calculated by the state occurrence frequency calculation means 12 and the speech recognition result output by the speech recognition means 16.

ここで、音声認識結果の確からしさを数値で表したスコアが所定値以上のときは、音響モデル混合比率計算手段１４は、混合比率の計算結果が入力信号の信号状態に適応していると判断し、本発明の第１の実施の形態で説明したように、状態発生頻度計算手段１２の計算結果に基づいて混合比率を計算する。 Here, when the score representing the probability of the speech recognition result as a numerical value is equal to or greater than a predetermined value, the acoustic model mixture ratio calculation means 14 determines that the calculation result of the mixture ratio is adapted to the signal state of the input signal. Then, as described in the first embodiment of the present invention, the mixing ratio is calculated based on the calculation result of the state occurrence frequency calculation means 12.

一方、音声認識結果のスコアが所定値未満のときは、音響モデル混合比率計算手段１４は、混合比率の計算結果が入力信号の信号状態に適応していないと判断し、発生頻度の変化傾向に基づいた混合比率を算出する。 On the other hand, when the score of the speech recognition result is less than the predetermined value, the acoustic model mixture ratio calculation means 14 determines that the calculation result of the mixture ratio is not adapted to the signal state of the input signal, and the occurrence frequency tends to change. Based on the mixing ratio.

例えば、入力信号の信号状態が刻々と変化しているような場面では、混合比率の計算結果が入力信号の信号状態に適応しない場合が生じ、音声認識結果のスコアが低下することがある。そこで、音響モデル混合比率計算手段１４は、信号状態の発生頻度の時間的な変化傾向に基づき、音声認識結果のスコアが所定値以上になるよう混合比率を算出する。 For example, in a scene where the signal state of the input signal changes every moment, the calculation result of the mixing ratio may not be adapted to the signal state of the input signal, and the score of the speech recognition result may be lowered. Therefore, the acoustic model mixture ratio calculation means 14 calculates the mixture ratio based on the temporal change tendency of the occurrence frequency of the signal state so that the score of the speech recognition result becomes a predetermined value or more.

具体的には、音響モデル混合比率計算手段１４は、ある時刻において、信号状態Ａ、Ｂ及びＣの発生頻度が、それぞれ、１０％、５０％及び４０％であり、また、信号状態Ｃの時間的な変化傾向が最も大きいとする。このときの音声認識結果のスコアが所定値よりも低いとき、音響モデル混合比率計算手段１４は、信号状態Ａ、Ｂ及びＣの発生頻度を、それぞれ、１０％、４０％及び５０％と仮設定した混合比率を算出する。すなわち、音響モデル混合比率計算手段１４は、最も時間的な変化率が大きい信号状態Ｃの発生頻度を実際の値よりも上げた混合比率を算出し、入力信号に対する音声認識結果のスコアを上げるよう動作する。 Specifically, the acoustic model mixture ratio calculation means 14 has the occurrence frequencies of the signal states A, B, and C at a certain time of 10%, 50%, and 40%, respectively, and the time of the signal state C. Suppose that there is the largest change tendency. When the score of the voice recognition result at this time is lower than a predetermined value, the acoustic model mixture ratio calculation means 14 temporarily sets the occurrence frequencies of the signal states A, B, and C to 10%, 40%, and 50%, respectively. Calculate the mixing ratio. That is, the acoustic model mixture ratio calculation means 14 calculates a mixture ratio obtained by raising the frequency of occurrence of the signal state C having the largest temporal change rate from the actual value so as to increase the score of the speech recognition result for the input signal. Operate.

なお、本実施の形態の音声認識装置２０は、信号状態の発生頻度及び時間的変化と音声認識結果のスコアとの関係を音響モデル混合比率計算手段１４に学習させる構成を備えることもできる。この構成により、本実施の形態の音声認識装置２０は、音声を認識する環境等が急激に変化して音声認識結果のスコアが一時低下しても、短時間で音声認識結果のスコアを所定値以上に復帰させることができる。 Note that the speech recognition apparatus 20 according to the present embodiment can also have a configuration in which the acoustic model mixture ratio calculation unit 14 learns the relationship between the occurrence frequency and temporal change of the signal state and the score of the speech recognition result. With this configuration, the speech recognition apparatus 20 according to the present exemplary embodiment allows the speech recognition result score to be set to a predetermined value in a short time even if the speech recognition environment suddenly changes and the speech recognition result score temporarily decreases. It can return to the above.

以上のように、本実施の形態の音声認識装置２０によれば、音響モデル混合比率計算手段１４が、状態発生頻度計算手段１２の計算結果及び音声認識手段１６の音声認識結果に基づいて混合比率を計算する構成としたので、入力信号の信号状態及び音声認識結果のスコアに適応する混合比率を算出することができ、従来の装置よりも高い精度で音声認識を実現することができる。 As described above, according to the speech recognition apparatus 20 of the present embodiment, the acoustic model mixture ratio calculation unit 14 is based on the calculation result of the state occurrence frequency calculation unit 12 and the speech recognition result of the speech recognition unit 16. Therefore, the mixing ratio adapted to the signal state of the input signal and the score of the speech recognition result can be calculated, and speech recognition can be realized with higher accuracy than the conventional device.

なお、本実施の形態において、信号状態の発生頻度の変化傾向と音声認識結果のスコアとに基づく混合比率の算出について説明したが、本発明はこれに限定されるものではなく、例えば、信号状態がほぼ一定で変化しない環境にもかかわらず音声認識結果のスコアが所定値未満の状態の場合においても、音響モデル混合比率計算手段１４が、音声認識結果のスコアを参照して混合比率を算出することにより、音声の認識性能を高めることができる。 In the present embodiment, the calculation of the mixing ratio based on the change tendency of the occurrence frequency of the signal state and the score of the speech recognition result has been described. However, the present invention is not limited to this, and for example, the signal state Even when the score of the speech recognition result is less than a predetermined value in spite of the environment in which the sound is almost constant, the acoustic model mixture ratio calculation means 14 calculates the mixture ratio with reference to the score of the speech recognition result. Thus, speech recognition performance can be improved.

（第３の実施の形態）
まず、本発明の第３の実施の形態の音声認識装置の構成について説明する。ただし、本発明の第１の実施の形態の音声認識装置１０と同様な構成については同一の符号を付して説明を省略する。 (Third embodiment)
First, the configuration of the speech recognition apparatus according to the third embodiment of the present invention will be described. However, the same components as those of the speech recognition apparatus 10 according to the first embodiment of the present invention are denoted by the same reference numerals and description thereof is omitted.

図４に示すように、本実施の形態の音声認識装置３０は、音響モデル混合比率を入力する音響モデル混合比率入力手段３１を備えている。 As shown in FIG. 4, the speech recognition apparatus 30 according to the present embodiment includes an acoustic model mixture ratio input unit 31 that inputs an acoustic model mixture ratio.

音響モデル混合比率入力手段３１は、例えば、図５（ａ）に示すようなスイッチと、図５（ｂ）に示すようなテーブルを記憶するメモリ（図示せず）とによって構成され、スイッチレバーＳ１〜Ｓ３の位置の組み合わせによるスイッチパターンで混合比率を入力するようになっている。 The acoustic model mixture ratio input means 31 includes, for example, a switch as shown in FIG. 5A and a memory (not shown) that stores a table as shown in FIG. 5B, and a switch lever S1. The mixing ratio is input by a switch pattern based on a combination of the positions of S3.

例えば、音響モデル混合比率入力手段３１が図５（ａ）に示されたスイッチパターン「１、１、０」で混合比率を入力したとき、音響モデル混合比率計算手段１４は、図５（ｂ）に示すテーブルに基づき、各音響モデルに対応する音声データＡ、Ｂ及びＣの混合比率を、それぞれ、０．４、０．４及び０．２に設定するようになっている。 For example, when the acoustic model mixture ratio input means 31 inputs the mixture ratio with the switch pattern “1, 1, 0” shown in FIG. 5A, the acoustic model mixture ratio calculation means 14 displays the result shown in FIG. Based on the table shown in Fig. 4, the mixing ratio of the audio data A, B, and C corresponding to each acoustic model is set to 0.4, 0.4, and 0.2, respectively.

次に、本実施の形態の音声認識装置３０の動作について説明する。ただし、本発明の第１の実施の形態の音声認識装置１０と同様な動作については説明を省略する。 Next, the operation of the speech recognition apparatus 30 according to the present embodiment will be described. However, description of operations similar to those of the speech recognition apparatus 10 according to the first exemplary embodiment of the present invention is omitted.

音響モデル混合比率入力手段３１によって、スイッチで設定されたスイッチパターン情報が読み出され、このスイッチパターン情報及びテーブル情報が音響モデル混合比率計算手段１４に出力される。 The switch pattern information set by the switch is read by the acoustic model mixture ratio input means 31, and the switch pattern information and table information are output to the acoustic model mixture ratio calculation means 14.

そして、音響モデル混合比率計算手段１４によって、スイッチパターン情報及びテーブル情報に基づいて各音響モデルに対応する音声データの混合比率が設定される。 Then, the acoustic model mixture ratio calculation means 14 sets the mixture ratio of the audio data corresponding to each acoustic model based on the switch pattern information and the table information.

以上のように、本実施の形態の音声認識装置３０によれば、音響モデル混合比率入力手段３１は、スイッチで設定されたスイッチパターン情報による音響モデル混合比率を入力する構成としたので、入力信号の信号状態が予測できる場合や電源投入後及びリセット後の初期状態における場合等において音響モデル混合比率を容易に設定することができる。 As described above, according to the speech recognition apparatus 30 of the present embodiment, the acoustic model mixture ratio input unit 31 is configured to input the acoustic model mixture ratio based on the switch pattern information set by the switch. The acoustic model mixture ratio can be easily set when the signal state can be predicted or in the initial state after power-on and reset.

なお、本実施の形態において、音響モデル混合比率入力手段３１がスイッチ及びメモリで構成される例について説明したが、本発明はこれに限定されるものではなく、例えば、キーボード又はポインティングデバイスと、ディスプレイ等とで音響モデル混合比率入力手段３１を構成し、音響モデル混合比率をディスプレイに表示された入力画面で入力する場合においても、同様な効果が得られる。 In the present embodiment, an example in which the acoustic model mixture ratio input means 31 is configured by a switch and a memory has been described. However, the present invention is not limited to this, for example, a keyboard or a pointing device, and a display The same effect can be obtained when the acoustic model mixture ratio input means 31 is configured with the above and the acoustic model mixture ratio is input on the input screen displayed on the display.

（第４の実施の形態）
まず、本発明の第４の実施の形態の音声認識装置の構成について説明する。ただし、本発明の第２の実施の形態の音声認識装置２０と同様な構成については同一の符号を付して説明を省略する。 (Fourth embodiment)
First, the configuration of the speech recognition apparatus according to the fourth embodiment of the present invention will be described. However, the same components as those of the speech recognition apparatus 20 according to the second embodiment of the present invention are denoted by the same reference numerals and description thereof is omitted.

図６に示すように、本実施の形態の音声認識装置４０は、音声認識の対象である入力信号と切り替えて使用されるテストデータを発生するテストデータ発生手段４１を備えている。 As shown in FIG. 6, the speech recognition apparatus 40 of the present embodiment includes test data generation means 41 that generates test data that is used by switching to an input signal that is a target of speech recognition.

テストデータ発生手段４１は、入力信号として想定される様々な信号、例えば、ＳＮ比や騒音の種類等を設定した信号をテストデータとして発生するようになっている。 The test data generating means 41 generates various signals assumed as input signals, for example, signals in which the SN ratio, noise type, etc. are set as test data.

次に、本実施の形態の音声認識装置４０の動作について説明する。ただし、本発明の第２の実施の形態の音声認識装置２０と同様な動作については説明を省略する。 Next, the operation of the speech recognition apparatus 40 according to this embodiment will be described. However, description of operations similar to those of the speech recognition apparatus 20 according to the second exemplary embodiment of the present invention is omitted.

まず、テストデータ発生手段４１によって、例えば、様々なＳＮ比のテストデータが発生される。そして、このテストデータは、前述の音声認識装置２０と同様な手順で処理され、音声認識手段１６によって音声認識処理される。 First, the test data generating means 41 generates test data with various signal-to-noise ratios, for example. The test data is processed in the same procedure as the voice recognition device 20 described above, and voice recognition processing is performed by the voice recognition means 16.

次いで、音響モデル混合比率計算手段１４によって、音声認識結果のスコアとテストデータの信号状態の分類結果とに基づいて音響モデル混合比率が計算される。 Next, the acoustic model mixture ratio calculation unit 14 calculates the acoustic model mixture ratio based on the score of the speech recognition result and the classification result of the signal state of the test data.

以上のように、本実施の形態の音声認識装置４０によれば、テストデータ発生手段４１は、入力信号として想定される様々なテストデータを発生する構成としたので、信号状態と音声認識結果のスコアとの関係を把握することができ、音声認識の性能の向上を図ることができる。 As described above, according to the speech recognition apparatus 40 of the present embodiment, the test data generation means 41 is configured to generate various test data assumed as input signals. The relationship with the score can be grasped, and the speech recognition performance can be improved.

以上のように、本発明にかかる音声認識装置は、従来の装置よりも高い精度で音声認識を実現することができるという効果を有し、音声を認識する音声認識装置、音声認識方法及び音声認識プログラム等として有用である。 As described above, the speech recognition device according to the present invention has an effect that speech recognition can be realized with higher accuracy than conventional devices, and a speech recognition device, speech recognition method, and speech recognition that recognize speech. Useful as a program.

本発明の第１の実施の形態の音声認識装置のブロック図The block diagram of the speech recognition apparatus of the 1st Embodiment of this invention 本発明の第１の実施の形態の音声認識装置の音響モデル合成の説明図Explanatory drawing of the acoustic model synthesis | combination of the speech recognition apparatus of the 1st Embodiment of this invention 本発明の第２の実施の形態の音声認識装置のブロック図The block diagram of the speech recognition apparatus of the 2nd Embodiment of this invention 本発明の第３の実施の形態の音声認識装置のブロック図The block diagram of the speech recognition apparatus of the 3rd Embodiment of this invention （ａ）本発明の第３の実施の形態の音響モデル混合比率入力手段を構成するスイッチの一例を示す図（ｂ）本発明の第３の実施の形態にかかる音響モデル混合比率を設定したテーブルの一例を示す図(A) The figure which shows an example of the switch which comprises the acoustic model mixing ratio input means of the 3rd Embodiment of this invention (b) The table which set the acoustic model mixing ratio concerning the 3rd Embodiment of this invention Figure showing an example 本発明の第４の実施の形態の音声認識装置のブロック図The block diagram of the speech recognition apparatus of the 4th Embodiment of this invention

Explanation of symbols

１０、２０、３０、４０音声認識装置
１１信号状態分類手段
１２状態発生頻度計算手段
１３音響モデル格納メモリ
１４音響モデル混合比率計算手段
１５音響モデル合成手段
１６音声認識手段
３１音響モデル混合比率入力手段
４１テストデータ発生手段 DESCRIPTION OF SYMBOLS 10, 20, 30, 40 Speech recognition apparatus 11 Signal state classification | category means 12 State generation frequency calculation means 13 Acoustic model storage memory 14 Acoustic model mixing ratio calculation means 15 Acoustic model synthesis means 16 Speech recognition means 31 Acoustic model mixing ratio input means 41 Test data generation means

Claims

Analyzing the characteristics of the input signal, classifying the signal state of the input signal into one of a plurality of predetermined signal states, and calculating the state occurrence frequency for calculating the occurrence frequency of the signal state Means, and an acoustic model mixture ratio calculating means for calculating a mixing ratio for mixing data used for each acoustic model when a plurality of predetermined acoustic models are created, based on the occurrence frequency, and the data An acoustic model synthesizing unit that newly synthesizes an acoustic model by mixing at a mixing ratio, and a speech recognition unit that performs speech recognition processing on the input signal using the synthesized acoustic model. Voice recognition device.

The acoustic model mixture ratio calculation means increases the mixing ratio of the data used for each of the acoustic models corresponding to the plurality of signal states classified by the signal state classification means as the generation frequency increases. The speech recognition apparatus according to claim 1.

The speech recognition apparatus according to claim 1, wherein the acoustic model mixture ratio calculation unit calculates the mixture ratio based on a speech recognition result of the speech recognition unit.

The acoustic model mixture ratio calculation means calculates the mixture ratio based on a change tendency of the occurrence frequency when a score representing the probability of the speech recognition result by a numerical value is a predetermined value or less. The speech recognition device according to any one of claims 1 to 3.

The speech recognition apparatus according to claim 1, further comprising an acoustic model mixture ratio input unit that inputs the mixture ratio.

6. The speech recognition apparatus according to claim 1, further comprising test data generating means for generating test data for setting the mixing ratio.

2. The data according to claim 1, wherein the data includes at least one of time waveform data, mel cepstrum coefficient, delta mel cepstrum coefficient, and delta logarithmic power coefficient of voice data when the acoustic model is created. Item 7. The speech recognition device according to any one of Items 6 to 6.

The acoustic model has a hidden Markov model structure, and the acoustic model storage means stores the acoustic model as transition probability and output probability data defined by the hidden Markov model. The speech recognition device according to any one of claims 1 to 7.

Analyzing the characteristics of the input signal, classifying the signal state of the input signal into one of a plurality of predetermined signal states, and then mixing the data used when a predetermined acoustic model is created A mixing ratio is calculated based on the frequency of occurrence of the signal state, the data is mixed at the mixing ratio to synthesize the acoustic model, and speech recognition processing is performed on the input signal using the synthesized acoustic model A speech recognition method characterized by performing.

Analyzing the characteristics of the input signal; classifying the signal state of the input signal into one of a plurality of predetermined signal states; and data used when a predetermined acoustic model is created Calculating a mixing ratio based on the frequency of occurrence of the signal state, mixing the data at the mixing ratio to synthesize the acoustic model, and using the synthesized acoustic model Performing a voice recognition process on an input signal.