JP2003044079A

JP2003044079A - Device and method for recognizing voice, recording medium, and program

Info

Publication number: JP2003044079A
Application number: JP2001233323A
Authority: JP
Inventors: Katsuki Minamino; 活樹南野
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2001-08-01
Filing date: 2001-08-01
Publication date: 2003-02-14
Anticipated expiration: 2021-08-01
Also published as: JP4655184B2

Abstract

PROBLEM TO BE SOLVED: To improve the recognition accuracy with respect to a background noise. SOLUTION: An audio signal inputted from a microphone 1 is supplied through an A/D converting part 2 to a voice synthesizing part 34. In the voice synthesizing part 34, a noise signal stored in a noise recording part 33 is read and superimposed on the inputted audio signal and the superimposed audio signal is supplied to an acoustic analyzing part 3. In the acoustic analyzing part 3, acoustic analyzing processing is applied to the inputted audio signal, and a feature vector as a feature amount is extracted and supplied to a register part 11. In the register part 11, an acoustic model is read out of an acoustic model database 5, the acoustic model is connected on the basis of an acoustic model network 12, and the arrangement of the acoustic models of the highest score (phoneme sequence) is determined to the inputted feature amount and registered at a language model database 6 as sounding information of a correspondent word.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置およ
び方法、記録媒体、並びにプログラムに関し、特に、背
景雑音に対する認識精度を向上させることができるよう
にする音声認識装置および方法、記録媒体、並びにプロ
グラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and method, a recording medium, and a program, and more particularly, to a voice recognition device and method, a recording medium, and a recording medium that can improve the recognition accuracy for background noise. Regarding the program.

【０００２】[0002]

【従来の技術】近年、音声認識装置は、マンマシンイン
タフェース等として、多数のシステムで利用されてい
る。2. Description of the Related Art Recently, a voice recognition device has been used in many systems as a man-machine interface or the like.

【０００３】図１は、そのような音声認識装置の一例の
構成を示している。FIG. 1 shows the configuration of an example of such a voice recognition device.

【０００４】ユーザが発した音声は、マイクロフォン１
に入力され、マイクロフォン１では、その入力音声が、
電気信号としての音声信号に変換される。この音声信号
は、ＡＤ(Analog Digital)変換部２に供給される。ＡＤ
変換部２では、マイクロフォン１からのアナログ信号で
ある音声信号がサンプリング、量子化され、ディジタル
信号である音声データに変換される。この音声データ
は、音響分析部３に供給される。The voice uttered by the user is the microphone 1
Is input to the microphone 1 and the input voice is
It is converted into a voice signal as an electric signal. This audio signal is supplied to the AD (Analog Digital) converter 2. AD
The converter 2 samples and quantizes a voice signal which is an analog signal from the microphone 1 and converts it into voice data which is a digital signal. This voice data is supplied to the acoustic analysis unit 3.

【０００５】音響分析部３は、ＡＤ変換部２からの音声
データについて、適当なフレームごとに（微小時間間隔
に）音響分析処理を施し、これにより、例えば、ＭＦＣ
Ｃ(Mel Frequency Cepstrum Coefficient)等の特徴量と
しての特徴ベクトルを抽出して、認識部４に供給する。
なお、音響分析部３では、その他、例えば、スペクトル
や、線形予測係数、ケプストラム係数、線スペクトル対
等の特徴量を抽出することが可能である。この分析に
は、線形予測分析（ＬＰＣ）、高速フーリエ変換（ＦＦ
Ｔ）、バンドパスフィルタ（ＢＰＦ）などが用いられ
る。The acoustic analysis unit 3 performs acoustic analysis processing on the audio data from the AD conversion unit 2 for each appropriate frame (at a minute time interval), whereby, for example, MFC is performed.
A feature vector as a feature amount such as C (Mel Frequency Cepstrum Coefficient) is extracted and supplied to the recognition unit 4.
In addition, in the acoustic analysis unit 3, it is possible to extract other features such as a spectrum, a linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair. This analysis includes linear prediction analysis (LPC), fast Fourier transform (FF)
T), a band pass filter (BPF), etc. are used.

【０００６】認識部４は、音響分析部３からの特徴量系
列を用いて、音響モデルデータベース５、単語辞書情報
および文法規則情報で構成される言語モデルデータベー
ス６を必要に応じて参照しながら、マイクロフォン１に
入力された音声（入力音声）を、例えば、連続分布ＨＭ
Ｍ法等に基づいて音声認識する。The recognizing unit 4 refers to the acoustic model database 5 and the language model database 6 composed of word dictionary information and grammatical rule information as necessary using the feature quantity sequence from the acoustic analysis unit 3, The voice (input voice) input to the microphone 1 is, for example, continuously distributed HM.
Speech recognition based on the M method and the like.

【０００７】音響モデルデータベース５は、音声認識す
る音声の言語における個々の音素や音節などの音響的な
特徴を表す音響モデルを記憶している。ここでは、連続
分布ＨＭＭ法に基づいて音声認識を行うので、音響モデ
ルとしては、例えば、ＨＭＭ(Hidden Markov Model)が
用いられる。言語モデルデータベース６は、認識対象の
各単語（語彙）について、その発音に関する情報（音韻
情報）が記述された単語辞書情報、および単語辞書情報
に登録されている各単語が、どのように連鎖する（つな
がる）かを記述した文法規則情報（言語モデル）を記憶
している。ここで、文法規則としては、例えば、文脈自
由文法（ＣＦＧ）や、統計的な単語連鎖確率（Ｎ−ｇｒ
ａｍ）などに基づく規則が用いられる。The acoustic model database 5 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of speech to be recognized. Here, since speech recognition is performed based on the continuous distribution HMM method, for example, an HMM (Hidden Markov Model) is used as the acoustic model. In the language model database 6, for each word (vocabulary) to be recognized, how the word dictionary information describing the pronunciation (phonological information) and each word registered in the word dictionary information are linked. It stores grammatical rule information (language model) describing whether it is (connected). Here, as the grammar rules, for example, context-free grammar (CFG) and statistical word chain probability (N-gr) are used.
am) etc. are used.

【０００８】認識部４は、言語モデルデータベース６の
単語辞書を参照し、音響モデルデータベース５に記憶さ
れている音響モデルを接続することで、単語の音響モデ
ル（単語モデル）を構成する。さらに、認識部４は、幾
つかの単語モデルを、言語モデルデータベース６に記憶
された文法規則情報を参照することにより接続し、その
ようにして接続された単語モデルを用いて、特徴量に基
づき、連続分布ＨＭＭ法によって、マイクロフォン１に
入力された音声を認識する。即ち、認識部４は、音響分
析部３が出力する時系列の特徴量が観測されるスコア
（尤度）が最も高い単語モデルの系列を検出し、その単
語モデルの系列に対応する単語列を、音声の認識結果と
して出力する。The recognition section 4 refers to the word dictionary of the language model database 6 and connects the acoustic models stored in the acoustic model database 5 to form an acoustic model of a word (word model). Further, the recognition unit 4 connects some word models by referring to the grammar rule information stored in the language model database 6, and uses the word models thus connected to connect the word models based on the feature amount. The voice input to the microphone 1 is recognized by the continuous distribution HMM method. That is, the recognition unit 4 detects the word model sequence having the highest score (likelihood) at which the time-series feature amount output from the acoustic analysis unit 3 is detected, and the word string corresponding to the word model sequence is detected. , Output as a voice recognition result.

【０００９】つまり、認識部４は、接続された単語モデ
ルに対応する単語列について、各特徴量の出現確率を累
積し、その累積値をスコアとして、そのスコアを最も高
くする単語列を、音声認識結果として出力する。That is, the recognizing unit 4 accumulates the appearance probabilities of the respective feature quantities for the word string corresponding to the connected word model, and uses the cumulative value as a score to determine the word string having the highest score. Output as a recognition result.

【００１０】具体的には、例えば、音響モデルデータベ
ース５に記憶された音響モデルの音素あるいは音節など
の中から、日本語の「あ」、「い」、「う」、「え」、
「お」、「か」…、「ん」を単位とする仮名の音響モデ
ルを用いた場合、それらを接続することで、「はい」、
「いいえ」、「おはよう」、「いまなんじですか」な
ど、いろいろな言葉を構成することができる。そして、
これらの言葉に対して、入力される特徴量との類似度を
表すスコアの計算を行うことが可能になる。Specifically, for example, from among the phonemes or syllables of the acoustic model stored in the acoustic model database 5, the Japanese words "a", "i", "u", "e",
If you use a pseudonym acoustic model with "o", "ka" ..., "n" as a unit, you can connect them to create "yes",
You can compose various words such as "No", "Good morning", and "What are you doing now?" And
With respect to these words, it becomes possible to calculate a score indicating the degree of similarity with the input feature amount.

【００１１】その音響モデルを接続する情報が、言語モ
デルデータベース６の単語辞書情報および文法規則情報
である。単語辞書情報は、認識対象となる各単語を構成
するために、音響モデルをどのように接続するかを与え
る情報である。文法規則情報は、単語と単語をどのよう
に接続するかを与える情報である。例えば、「（数字）
時から（数字）時まで」という文を扱う場合、まず、
「０（ぜろ）」、「１（いち）」…、「２４（にじゅう
よん）」という数字と、「時（じ）」、「から」、「ま
で」という語に対して、それぞれ読み仮名を含めて単語
辞書情報として持つことで、仮名を単位とする音響モデ
ルの接続関係を与える。次に、「（数字）」＋「時」＋
「から」＋「（数字）」＋「時」＋「まで」というルー
ルを文法規則情報として持つことで単語の接続関係を与
える。これらの単語辞書情報および文法規則情報を組み
合わせることによって、「１時から２時まで」あるいは
「２時から５時まで」など、それぞれの文と入力される
特徴量との類似度が計算できることになり、その中のス
コアの高いものを認識結果として出力することが可能で
ある。The information connecting the acoustic models is the word dictionary information and the grammar rule information of the language model database 6. The word dictionary information is information that gives how to connect acoustic models in order to form each word to be recognized. Grammar rule information is information that gives how to connect words to each other. For example, "(number)
When dealing with the sentence "from time to (number) hour", first,
The numbers "0", "1" ..., "24" and the words "hour", "from" and "to" are read respectively. By including the kana as word dictionary information, the connection relation of the acoustic models in units of kana is given. Next, "(number)" + "hour" +
The connection relation of words is given by having the rule “from” + “(number)” + “hour” + “up” as grammar rule information. By combining these word dictionary information and grammatical rule information, it is possible to calculate the similarity between each sentence and the input feature amount, such as "1 to 2" or "2 to 5". It is possible to output the one with the highest score as the recognition result.

【００１２】したがって、この音声認識装置は、音素や
音節などの微小な単位を音響モデルとして用いることに
よって、単語辞書情報あるいは文法規則情報の変更だけ
で、いろいろな言葉を認識することができる。Therefore, this speech recognition apparatus can recognize various words only by changing the word dictionary information or the grammatical rule information by using a minute unit such as a phoneme or a syllable as an acoustic model.

【００１３】しかしながら、このような音声認識装置を
構築する場合、各単語に対して、読み仮名のような音響
モデルの接続関係の情報（以下、発音情報と称する）を
単語辞書情報として、言語モデルデータベース６に予め
設定しておく必要がある。例えば、自分の名前が言語モ
デルデータベース６に登録されていない場合、キーボー
ドなどにより、読み仮名を入力し、登録するようにすれ
ばよいが、仮名を入力する手間が生じる。さらに、仮名
のように広く使われる単位以外に、音素や発音記号など
の日常生活ではあまり使用されることがない単位が音響
モデルとして用いられる場合、予備知識なしにその接続
関係（発音情報）を入力することは、極めて困難なこと
である。However, in the case of constructing such a speech recognition apparatus, information about connection relations of acoustic models (hereinafter referred to as pronunciation information), such as phonetic kana, for each word is used as word dictionary information and a language model is created. It is necessary to set it in the database 6 in advance. For example, when one's own name is not registered in the language model database 6, it is sufficient to input a phonetic kana by using a keyboard or the like, but it is troublesome to input the kana. Furthermore, in addition to widely used units such as Kana, when units such as phonemes and phonetic symbols that are rarely used in daily life are used as acoustic models, the connection relation (pronunciation information) can be calculated without prior knowledge. Entering is extremely difficult.

【００１４】そこで、新しい単語に対して、発音情報を
どのように入力するかという課題に関して、音素タイプ
ライタを用いる方法がある。音素タイプライタは、入力
音声に対して、音素認識を行うことで、対応する音素系
列（発音情報）を推定するものである。認識の単位とし
ては、音素以外に音節など、いろいろな単位が用いられ
る場合がある。Therefore, there is a method of using a phoneme typewriter for the problem of how to input pronunciation information for a new word. The phoneme typewriter is for performing phoneme recognition on input speech to estimate a corresponding phoneme sequence (pronunciation information). As a unit of recognition, various units such as syllables may be used in addition to phonemes.

【００１５】次に、単語辞書情報に含まれない新しい単
語に対応する発音情報および単語辞書情報に含まれる単
語に対する新しい発音情報の獲得について説明する。Next, acquisition of pronunciation information corresponding to a new word not included in the word dictionary information and acquisition of new pronunciation information for a word included in the word dictionary information will be described.

【００１６】図2は、音素タイプライタを用いた登録機
能を有する音声認識装置の他の構成例を示している。な
お、図中、図１における場合と対応する部分について
は、同一の符号を付してあり、繰り返しになるので、以
下では、その説明は、適宜省略する。FIG. 2 shows another example of the configuration of a voice recognition device having a registration function using a phoneme typewriter. Note that, in the drawing, the portions corresponding to those in FIG. 1 are denoted by the same reference numerals and the description will be repeated, and therefore the description thereof will be omitted below as appropriate.

【００１７】音響モデルデータベース５の単位として
は、音素や音節などの微小な単位、例えば、母音と子音
を単位とする図３Ａに示されるような音素ＨＭＭが用い
られる。図３Ａの「sil」は、無音声部分をモデル化し
たＨＭＭを示す。言語モデルデータベース６は、単語辞
書情報および文法規則情報で構成されており、単語辞書
情報に含まれる各単語に対応して、音素ＨＭＭの接続方
法に関する情報（発音情報）が登録される。例えば、図
３Ｂに示されるように、単語「はい」に対応して発音情
報「ｈａｉ」が登録される。As a unit of the acoustic model database 5, a minute unit such as a phoneme or a syllable, for example, a phoneme HMM as shown in FIG. 3A in which a vowel and a consonant are used. “Sil” in FIG. 3A indicates an HMM that models a voiceless portion. The language model database 6 is composed of word dictionary information and grammatical rule information, and information (pronunciation information) about a phoneme HMM connection method is registered corresponding to each word included in the word dictionary information. For example, as shown in FIG. 3B, pronunciation information “hai” is registered corresponding to the word “yes”.

【００１８】音響分析部３は、入力された音声入力信号
から特徴量を抽出し、登録部１１に供給する。登録部１
１は、音響モデルデータベース５と音響モデルネットワ
ーク１２を用いて、音声認識処理を行う。音声認識処理
は、音響モデルネットワーク１２に基づいて、音響モデ
ルを接続し、入力された特徴量に対して、最もスコアの
高い音響モデルの並びを決定することで行われる。The acoustic analysis unit 3 extracts a feature amount from the input voice input signal and supplies it to the registration unit 11. Registration department 1
1 performs a voice recognition process using the acoustic model database 5 and the acoustic model network 12. The voice recognition processing is performed by connecting the acoustic models based on the acoustic model network 12 and determining the arrangement of the acoustic models having the highest score for the input feature amount.

【００１９】音響モデルネットワーク１２は、図４に示
されるように、音響モデル「ａ」、「ｉ」、「ｕ」…、
「Ｎ」、「sil」をノードとする状態遷移ネットワーク
であり、音響モデルの任意の並び、すなわち任意の音素
系列（発音情報）を生成することができるようなネット
ワークとして構成される。例えば、「ｈａｉ」は、「ST
ART」から、分岐点２１を介して「ｈ」を通り、分岐点
２２から、分岐点２１に戻り、「ａ」を通り、分岐点２
２から、分岐点２１に戻り、「ｉ」および分岐点２２を
通り、「END」に至る状態遷移で生成される。As shown in FIG. 4, the acoustic model network 12 includes acoustic models "a", "i", "u" ...
The state transition network has “N” and “sil” as nodes, and is configured as a network capable of generating an arbitrary sequence of acoustic models, that is, an arbitrary phoneme sequence (pronunciation information). For example, "hai" means "ST
From “ART”, go through “h” via branch point 21, return from branch point 22 to branch point 21, pass through “a”, then branch point 2
It is generated from the state transition from 2 to the branch point 21, passes through “i” and the branch point 22, and reaches “END”.

【００２０】スコアの計算は、音声モデルネットワーク
１２に基づいて、音素ＨＭＭを接続し、そのネットワー
ク上において、入力される特徴量を出力する確率値を累
積していくことによって求められる。例えば、Viterbi
アルゴリズム累積方法が用いられる。これにより、ひと
つの特徴量系列に対して、その累積値が最も高くなる状
態遷移系列を決定することが可能になる。すなわち、全
ての音素ＨＭＭの並びの中で、最もスコアが高くなる音
素系列（発音情報）を求めることができる。The calculation of the score is obtained by connecting the phoneme HMMs based on the speech model network 12 and accumulating the probability values of outputting the input feature amount on the network. For example, Viterbi
An algorithm accumulation method is used. This makes it possible to determine the state transition sequence having the highest cumulative value for one feature amount sequence. That is, it is possible to obtain the phoneme sequence (pronunciation information) that has the highest score in the sequence of all phoneme HMMs.

【００２１】ここで得られた発音情報は、言語モデルデ
ータベース６に供給され、対応する単語に関する発音情
報として、言語モデルデータベース６の単語辞書情報に
新たに登録される。また、ひとつの単語に対して、複数
の発音情報が与えられることもある。The pronunciation information obtained here is supplied to the language model database 6 and newly registered in the word dictionary information of the language model database 6 as the pronunciation information about the corresponding word. In addition, a plurality of pronunciation information may be given to one word.

【００２２】以上のように、言語モデルデータベース６
の単語辞書情報は、適宜更新される。そして、音声認識
処理では、この更新された単語辞書情報が用いられる。
したがって、発音情報を獲得することで、新しい単語を
システムに追加したり、あるいは既に内部に保持されて
いる単語に関する発音情報を補正したりすることが可能
となる。As described above, the language model database 6
The word dictionary information of is updated as appropriate. Then, in the voice recognition process, the updated word dictionary information is used.
Therefore, by acquiring the pronunciation information, it becomes possible to add a new word to the system or correct the pronunciation information about the word already held inside.

【００２３】このようにして、ある単語に対して発音情
報が正しく与えられている場合、音響モデルをその発音
情報にしたがって接続することで、通常、対応する音声
に対して、高いスコアを与えることができる。In this way, when the pronunciation information is correctly given to a certain word, a high score is usually given to the corresponding voice by connecting the acoustic model according to the pronunciation information. You can

【００２４】[0024]

【発明が解決しようとする課題】しかしながら、入力さ
れた音声に背景雑音などが付加されている場合、音響分
析によって得られる特徴量が変動を受けるため、必ずし
も高いスコアを与えるとは限らない。However, when background noise or the like is added to the input voice, the feature amount obtained by the acoustic analysis is changed, so that a high score is not always given.

【００２５】例えば、静かな環境で発声された音声信号
に基づいて、音素タイプライタを用いて音素系列を推定
した場合、その音声系列は、背景雑音の付加された音声
信号に対しては合わなくなる。つまり、認識率の低下に
つながるといった課題があった。For example, when a phoneme sequence is estimated using a phoneme typewriter based on a voice signal uttered in a quiet environment, the voice sequence does not match a voice signal added with background noise. . That is, there was a problem that the recognition rate was lowered.

【００２６】本発明はこのような状況に鑑みてなされた
ものであり、背景雑音に対する認識精度を向上させるこ
とができるようにするものである。The present invention has been made in view of such a situation, and makes it possible to improve the recognition accuracy for background noise.

【００２７】[0027]

【課題を解決するための手段】本発明の音声認識装置
は、背景雑音を取得する取得手段と、入力音声に取得手
段により取得された背景雑音を合成する合成手段と、合
成手段により合成された合成音声を音響分析し、その合
成音声の特徴量を抽出する分析手段と、分析手段により
抽出された特徴量に基づいて発音情報を推定する推定手
段と、推定手段により推定された発音情報を、対応する
単語の発音情報として登録する登録手段とを備えること
を特徴とする。The speech recognition apparatus of the present invention is composed of an acquisition unit for acquiring background noise, a combination unit for combining the input noise with the background noise acquired by the acquisition unit, and a combination unit. Acoustic analysis of the synthetic speech, analysis means for extracting the feature amount of the synthetic voice, estimating means for estimating pronunciation information based on the feature amount extracted by the analyzing means, pronunciation information estimated by the estimating means, And a registration unit that registers as pronunciation information of the corresponding word.

【００２８】登録手段は、単語に対して複数の発音情報
を登録するようにすることができる。The registration means can register a plurality of pronunciation information for a word.

【００２９】登録手段により登録された発音情報に基づ
いてマッチング処理を行うマッチング手段をさらに備え
るようにすることができる。It is possible to further include a matching means for performing matching processing based on the pronunciation information registered by the registration means.

【００３０】本発明の音声認識方法は、背景雑音を取得
する取得ステップと、入力音声に取得ステップの処理に
より背景雑音を合成する合成ステップと、合成ステップ
の処理により合成された合成音声を音響分析し、その合
成音声の特徴量を抽出する分析ステップと、分析ステッ
プの処理により抽出された特徴量に基づいて発音情報を
推定する推定ステップと、推定ステップの処理により推
定された発音情報を、対応する単語の発音情報として登
録する登録ステップとを含むことを特徴とする。The speech recognition method of the present invention includes an acquisition step of acquiring background noise, a synthesizing step of synthesizing background noise with input speech by the processing of the obtaining step, and an acoustic analysis of the synthesized speech synthesized by the processing of the synthesizing step. Then, the analysis step of extracting the feature amount of the synthesized speech, the estimation step of estimating pronunciation information based on the feature amount extracted by the processing of the analysis step, and the pronunciation information estimated by the processing of the estimation step are associated with each other. And a registration step of registering as pronunciation information of a word to be performed.

【００３１】本発明の記録媒体のプログラムは、背景雑
音を取得する取得ステップと、入力音声に取得ステップ
の処理により背景雑音を合成する合成ステップと、合成
ステップの処理により合成された合成音声を音響分析
し、その合成音声の特徴量を抽出する分析ステップと、
分析ステップの処理により抽出された特徴量に基づいて
発音情報を推定する推定ステップと、推定ステップの処
理により推定された発音情報を、対応する単語の発音情
報として登録する登録ステップとを含むことを特徴とす
る。The program of the recording medium according to the present invention includes an acquisition step of acquiring background noise, a synthesizing step of synthesizing the background noise with the input voice by the processing of the obtaining step, and a synthesized voice synthesized by the processing of the synthesizing step. An analysis step of analyzing and extracting the feature amount of the synthesized voice;
An estimation step of estimating pronunciation information based on the feature amount extracted by the processing of the analysis step; and a registration step of registering the pronunciation information estimated by the processing of the estimation step as pronunciation information of the corresponding word. Characterize.

【００３２】本発明のプログラムは、入力音声を音声
認識する音声認識処理を行う音声認識装置用のコンピュ
ータに、背景雑音を取得する取得ステップと、入力音声
に取得ステップの処理により背景雑音を合成する合成ス
テップと、合成ステップの処理により合成された合成音
声を音響分析し、その合成音声の特徴量を抽出する分析
ステップと、分析ステップの処理により抽出された特徴
量に基づいて発音情報を推定する推定ステップと、推定
ステップの処理により推定された発音情報を、対応する
単語の発音情報として登録する登録ステップとを実行さ
せることを特徴とする。According to the program of the present invention, a computer for a voice recognition device that performs voice recognition processing for recognizing an input voice synthesizes background noise by an acquisition step of acquiring background noise and an input voice processing in the acquisition step. A synthesis step, acoustic analysis is performed on the synthesized speech synthesized by the processing of the synthesis step, and an analysis step for extracting the feature amount of the synthesized speech, and estimation of pronunciation information based on the feature amount extracted by the processing of the analysis step It is characterized in that the estimation step and the registration step of registering the pronunciation information estimated by the processing of the estimation step as the pronunciation information of the corresponding word are performed.

【００３３】本発明の音声認識装置および方法、記録媒
体、並びにプログラムにおいては、入力音声に、背景雑
音が合成され、合成音声が音響分析され、その合成音声
の特徴量が抽出され、特徴量に基づいて推定された発音
情報が、対応する単語の発音情報として登録される。In the speech recognition apparatus and method, the recording medium, and the program of the present invention, background noise is synthesized with the input speech, the synthesized speech is acoustically analyzed, and the characteristic amount of the synthesized speech is extracted to obtain the characteristic amount. The pronunciation information estimated based on this is registered as the pronunciation information of the corresponding word.

【００３４】[0034]

【発明の実施の形態】図５は、本発明が適用される音声
認識装置の構成例を示している。なお、図中、図１およ
び図２における場合と対応する部分については、同一の
符号を付してあり、繰り返しになるので、以下では、そ
の説明は、適宜省略する。FIG. 5 shows an example of the configuration of a voice recognition device to which the present invention is applied. Note that, in the figure, portions corresponding to those in FIG. 1 and FIG. 2 are denoted by the same reference numerals, and the description will be repeated, so the description thereof will be omitted below as appropriate.

【００３５】制御部３２は、入力部３１からのユーザの
指示に基づいて、ＡＤ変換部２を制御し、ＡＤ変換部２
に入力された音声デジタルデータを、音響分析部３また
は音声合成部３４に出力させる。The control unit 32 controls the AD conversion unit 2 based on the user's instruction from the input unit 31, and the AD conversion unit 2
The audio digital data input to is output to the acoustic analysis unit 3 or the audio synthesis unit 34.

【００３６】音声合成部３４は、ＡＤ変換部２より音声
入力信号が入力されると、雑音記憶部３３に記憶された
雑音信号を読み出し、入力された音声入力信号に重畳
し、音響分析部３に供給する。When the voice input signal is input from the AD conversion unit 2, the voice synthesis unit 34 reads the noise signal stored in the noise storage unit 33, superimposes it on the input voice input signal, and the acoustic analysis unit 3 Supply to.

【００３７】音響分析部３は、音声合成部３４より音声
入力信号が入力されると、その特徴量を抽出し、認識部
４または登録部１１に供給する。When the voice input signal is input from the voice synthesizing unit 34, the acoustic analysis unit 3 extracts the feature amount and supplies it to the recognition unit 4 or the registration unit 11.

【００３８】雑音記憶部３３には、音声合成部３４にお
いて、重畳される雑音信号が記憶されている。例えば、
走行中の車内で音声認識する場合、走行中の雑音だけが
収録されて予め記憶されたり、同じ車内であっても、さ
まざまな走行状況に対応した雑音やファンノイズなどが
収録され、記憶される。すなわち、この雑音信号は、あ
る程度、予め推定して記憶される。The noise signal stored in the voice synthesizer 34 is stored in the noise memory 33. For example,
When recognizing voice in a moving vehicle, only noise during driving is recorded and stored in advance, or noise or fan noise corresponding to various driving conditions is recorded and stored even in the same vehicle. . That is, this noise signal is estimated and stored to some extent in advance.

【００３９】図６のフローチャートを参照して、音声認
識装置の発音情報登録処理を説明する。The pronunciation information registration process of the voice recognition device will be described with reference to the flowchart of FIG.

【００４０】ユーザが発した音声は、マイクロフォン１
に入力され、マイクロフォン１では、その入力音声が、
電気信号としての音声信号に変換される。この例では、
停車中の車の中における発声に基づいて、登録が行わ
れ、音声認識は、走行中に行われる。したがって、雑音
記録部３３は、走行ノイズ（雑音信号）を予め記憶して
いる。The voice uttered by the user is the microphone 1
Is input to the microphone 1 and the input voice is
It is converted into a voice signal as an electric signal. In this example,
The registration is performed based on the utterance in the stopped vehicle, and the voice recognition is performed while the vehicle is running. Therefore, the noise recording unit 33 stores the traveling noise (noise signal) in advance.

【００４１】ステップＳ１において、ＡＤ変換部２は、
マイクロフォン１を介して音声信号を入力する。In step S1, the AD conversion section 2
A voice signal is input via the microphone 1.

【００４２】ステップＳ２において、制御部３２は、入
力部３１からのユーザの指示に基づいて、ＡＤ変換部２
に入力された音声信号に雑音信号を重畳するか否かを判
断し、音声信号に雑音信号を重畳すると判断した場合、
ＡＤ変換部２を制御し、音声信号を音声合成部３４に供
給させる。In step S2, the control unit 32, based on the user's instruction from the input unit 31, the AD conversion unit 2
When it is determined whether or not to superimpose a noise signal on the voice signal input to, and when it is determined to superimpose the noise signal on the voice signal,
The AD converter 2 is controlled to supply the audio signal to the audio synthesizer 34.

【００４３】ステップＳ３において、音声合成部３４
は、雑音記録部３３に記憶されている雑音信号を読み出
し、ＡＤ変換部２から入力された音声信号に重畳する。
そして、音声合成部３４は、重畳された音声信号を音響
分析部３に供給する。In step S3, the voice synthesizer 34
Reads the noise signal stored in the noise recording unit 33 and superimposes it on the audio signal input from the AD conversion unit 2.
Then, the voice synthesis unit 34 supplies the superimposed voice signal to the acoustic analysis unit 3.

【００４４】ステップＳ２において、音声信号に雑音信
号を重畳しないと判定された場合、ＡＤ変換部２は、音
声信号を音響分析部３に供給する。このとき、ステップ
Ｓ３の雑音信号を重畳する処理は、スキップされる。When it is determined in step S2 that the noise signal is not superimposed on the voice signal, the AD conversion unit 2 supplies the voice signal to the acoustic analysis unit 3. At this time, the process of superimposing the noise signal in step S3 is skipped.

【００４５】ステップＳ４において、音響分析部３は、
入力された音声信号の適当なフレームごとに（微小時間
間隔に）音響分析処理を施し、これにより、特徴量とし
ての特徴ベクトルを抽出して、登録部１１に供給する。In step S4, the acoustic analysis unit 3
An acoustic analysis process is performed for each appropriate frame of the input voice signal (at a minute time interval), whereby a feature vector as a feature amount is extracted and supplied to the registration unit 11.

【００４６】ステップＳ５において、登録部１１は、音
響モデルデータベース５から、音響モデルを読み出し、
音響モデルネットワーク１２に基づいて、音響モデルを
接続する。ステップＳ６において、登録部１１は、音響
分析部３から入力された特徴量に対して接続された音響
モデルから、最もスコアの高い音響モデルの並び（発音
情報）を決定する。In step S5, the registration unit 11 reads the acoustic model from the acoustic model database 5,
The acoustic models are connected based on the acoustic model network 12. In step S6, the registration unit 11 determines the arrangement (pronunciation information) of the acoustic model having the highest score from the acoustic models connected to the feature amount input from the acoustic analysis unit 3.

【００４７】ステップＳ７において、登録部１１は、決
定した発音情報を対応する単語の発音情報として、言語
モデルデータベース６に登録する。In step S7, the registration unit 11 registers the determined pronunciation information in the language model database 6 as the pronunciation information of the corresponding word.

【００４８】なお、以上の処理において、１つの単語に
対して、複数の発音情報を登録することも可能である。
また、雑音信号を重畳しない音声信号および重畳した音
声信号の２種類の発音情報を生成して、登録することも
可能である。In the above processing, it is possible to register a plurality of pronunciation information for one word.
It is also possible to generate and register two types of pronunciation information, a voice signal in which a noise signal is not superimposed and a voice signal in which a noise signal is superimposed.

【００４９】この雑音記憶部３３を用いた発音情報登録
処理は、上記説明のように、言語モデルデータベース６
の単語辞書情報に含まれない新しい単語に対して行われ
るだけでなく、すでに、単語辞書情報に含まれる単語に
対しても同様に行われる。これにより、背景雑音を考慮
した発音情報を登録することができる。The pronunciation information registration process using the noise storage unit 33 is performed by the language model database 6 as described above.
This is not only performed for new words that are not included in the word dictionary information, but is similarly performed for words that are already included in the word dictionary information. Thereby, it is possible to register the pronunciation information in consideration of the background noise.

【００５０】このようにして更新された言語モデルデー
タベース６の単語辞書情報は、次に説明する音声認識処
理において用いられることになる。The word dictionary information of the language model database 6 updated in this way will be used in the speech recognition process described below.

【００５１】図７のフローチャートを参照して、音声認
識装置の音声認識処理を説明する。The voice recognition processing of the voice recognition device will be described with reference to the flowchart of FIG.

【００５２】ステップＳ２１において、ＡＤ変換部２
は、マイクロフォン１を介して入力された音声信号を音
響分析部３に供給する。In step S21, the AD conversion unit 2
Supplies the audio signal input via the microphone 1 to the acoustic analysis unit 3.

【００５３】ステップＳ２２において、音響分析部３
は、入力された音声信号に対して、適当なフレームごと
に（微小時間間隔に）音響分析処理を施し、これによ
り、特徴量としての特徴ベクトルを抽出して、認識部４
に供給する。In step S22, the acoustic analysis unit 3
Performs acoustic analysis processing on the input voice signal for each appropriate frame (at minute time intervals), thereby extracting a feature vector as a feature amount, and recognizing unit 4
Supply to.

【００５４】ステップＳ２３において、認識部４は、入
力された特徴量系列に基づいて、言語モデルデータベー
ス６の単語辞書情報を参照し、音響モデルデータベース
５に記憶されている音響モデルを接続することで、単語
の音響モデル（単語モデル）を構成する。In step S23, the recognizing unit 4 refers to the word dictionary information of the language model database 6 based on the input feature amount series, and connects the acoustic models stored in the acoustic model database 5 to each other. , A word acoustic model (word model).

【００５５】ステップＳ２４において、認識部４は、接
続された音響モデルに対応する単語列について、各特徴
量の出現確率を累積し、その累積値をスコアとして、そ
のスコアを最も高くする単語列を、音声認識結果として
出力する。In step S24, the recognizing unit 4 accumulates the appearance probabilities of the respective feature quantities with respect to the word string corresponding to the connected acoustic model, and uses the cumulative value as a score to find the word string having the highest score. , Output as a voice recognition result.

【００５６】以上のように、この言語モデルデータベー
ス６の単語辞書情報には、図６の処理で得られた雑音信
号を考慮した発音情報が登録されているため、背景雑音
があるところにおいて、音声認識をする場合、認識率の
低下を抑制することができる。As described above, since the pronunciation information in consideration of the noise signal obtained in the process of FIG. 6 is registered in the word dictionary information of the language model database 6, in the presence of background noise, the voice When recognizing, it is possible to suppress a decrease in recognition rate.

【００５７】実際に、本発明の音声認識装置を用いて行
った音声認識の評価の実験結果について説明する。Actually, the experimental results of the evaluation of the voice recognition performed using the voice recognition device of the present invention will be described.

【００５８】静かな環境において、所定の１００単語を
男性５名と女性５名が３回ずつ発声したデータが収録さ
れる。最初の２回のデータは、同じ日に収録されたもの
であり、発音情報の登録に用いられた。最後の１回のデ
ータは、その１ヵ月後に収録されたものであり、音声認
識の評価に用いられた。In a quiet environment, data obtained by uttering 100 predetermined words three times by five men and five women are recorded. The first two data were recorded on the same day and were used to register pronunciation information. The last piece of data was recorded one month later and was used for the evaluation of speech recognition.

【００５９】なお、この例において、用いた音響モデル
は、２９種類の音素に関して、３状態の音素ＨＭＭを作
成した、前後の音素環境依存を考慮したモデルである。
また、音響分析部３の音響分析には、ＭＦＣＣが用いら
れた。さらに、入力音声信号は、スペクトルサブストラ
クションと呼ばれる雑音除去が行われた。In this example, the acoustic model used is a model in which a phoneme HMM in three states is created with respect to 29 types of phonemes, taking into consideration the dependence on the preceding and following phoneme environments.
Moreover, MFCC was used for the acoustic analysis of the acoustic analysis unit 3. Furthermore, the input speech signal was subjected to noise removal called spectral subtraction.

【００６０】図８Ａは、上記１００単語で構成される言
語モデルデータベース６の単語辞書情報を用いて行った
音声認識の評価の実験結果である。評価対象は、３回目
の発声のデータであり、静かな環境で収録されたもので
ある。認識率は、１０名の平均値を示している。FIG. 8A shows an experimental result of evaluation of speech recognition performed using the word dictionary information of the language model database 6 composed of 100 words. The evaluation target is the data of the third utterance, which was recorded in a quiet environment. The recognition rate shows the average value of 10 persons.

【００６１】「読み仮名」の認識率は、読み仮名から発
音情報を決定した場合の認識率を示し、その値は、99.3
0％であった。これは、言語モデルデータベース６の中
の読み仮名の音響モデルの発音情報（図１を参照して説
明した認識方法）により、音声認識を行った結果であ
る。すなわち、登録部１１および音声合成部３４（雑音
記憶部３３）は、使用されていない。The recognition rate of "phonetic alphabet" indicates the recognition rate when pronunciation information is determined from the phonetic alphabet, and its value is 99.3.
It was 0%. This is the result of speech recognition using the pronunciation information (recognition method described with reference to FIG. 1) of the phonetic model of the phonetic transcription in the language model database 6. That is, the registration unit 11 and the voice synthesis unit 34 (noise storage unit 33) are not used.

【００６２】「１回発声」の認識率は、１回目の発声か
ら発音情報を１つだけ登録した場合の認識率を示し、そ
の値は、99.10％であり、「２回発声」の認識率は、１
回目の発声から求められた発音情報と２回目の発声から
求められた発音情報の２つを登録した場合の認識率を示
し、その値は、99.50％であった。これらは、音素タイ
プライタを用いて登録した発音情報（図２を参照して説
明した認識方法）により、音声認識の処理を行った結果
である。すなわち、音声合成部３４（雑音記憶部３３）
は、使用されていない。The "single-voiced" recognition rate indicates the recognition rate when only one pronunciation information is registered from the first vocalization, and the value is 99.10%, and the "double-voiced" recognition rate. Is 1
The recognition rate when the two pieces of pronunciation information obtained from the second utterance and the pronunciation information obtained from the second utterance were registered was shown, and the value was 99.50%. These are the results of processing of speech recognition using the pronunciation information (recognition method described with reference to FIG. 2) registered using the phoneme typewriter. That is, the voice synthesis unit 34 (noise storage unit 33)
Is not used.

【００６３】以上より、音素タイプライタを用いて、２
回分の発声から求められた２種類の発音情報を登録する
ことで、読み仮名から発音情報を決定した場合と、ほぼ
同じ認識率が得られることがわかる。From the above, 2 using the phoneme typewriter
By registering the two types of pronunciation information obtained from the utterances of the times, it can be seen that the recognition rate is almost the same as that when the pronunciation information is determined from the reading kana.

【００６４】以下、発音情報を登録する場合、最初の２
回分の発声から求められた２種類の発音情報を登録する
ものとして説明する。In the following, when registering the pronunciation information, the first two
The description will be made assuming that the two types of pronunciation information obtained from the vocalizations are registered.

【００６５】図８Ｂは、上記１００単語で構成される言
語データベース６の単語辞書情報を用いて行った音声認
識の評価の実験結果である。評価対象は、３回目の発声
のデータであり、車の走行ノイズが重畳されている。こ
の車の走行ノイズは、車種の違い、走行速度の違い、あ
るいは路面状況の違いなどを含めて７種類の車内雑音で
あり、雑音を重畳しない場合を含めて、合計８種類の環
境での音声認識の評価が行われている。したがって、認
識率は、８種類の環境における、１０名の平均値を示し
ている。FIG. 8B shows an experimental result of evaluation of voice recognition performed using the word dictionary information of the language database 6 composed of 100 words. The evaluation target is the data of the third utterance and the running noise of the car is superimposed. The running noise of this car is 7 kinds of in-vehicle noise including difference in vehicle type, difference in running speed, difference in road condition, etc. Recognition is being evaluated. Therefore, the recognition rate indicates the average value of 10 persons in 8 kinds of environments.

【００６６】「読み仮名」の認識率は、92.34％であ
り、「従来」の認識率は、92.15％であり、「本発明
１」の認識率は、94.88％であり、「本発明２」の認識
率は、95.22％であった。The recognition rate of "Yomikana" is 92.34%, the recognition rate of "conventional" is 92.15%, the recognition rate of "Invention 1" is 94.88%, and "Invention 2". The recognition rate was 95.22%.

【００６７】なお、「読み仮名」は、読み仮名から発音
情報を決定した場合を示し、「従来」は、音素タイプラ
イタのみを用いて、発音情報を登録した場合を示してい
る。また、「本発明１」は、音素タイプライタを用いて
登録された発音情報に加えて、上述したような雑音記憶
部３３に記憶されている背景雑音を考慮した発音情報を
登録した場合を示し、「本発明２」は、「本発明１」の
発音情報に加えて、さらに、読み仮名から求まる発音情
報も利用する場合を示している。音素タイプライタある
いは背景雑音を用いたどちらの場合でも、発音情報を求
めるために、２回分の発声が用いられている。したがっ
て、例えば、図９に示されるように、１単語あたりの発
音情報の数は、「読み仮名」は「b e N ch i」の１つ、
「従来」は２つ（本発明１のうちの２つ）、「本発明
１」は「h b e m u ch i i」、「pr d e u ch i」、「b
e r i N g i」、「p e N ch i j」の４つ、「本発明
２」は５つ（読み仮名＋本発明１）になる。"Yomi kana" indicates the case where the pronunciation information is determined from the phonetic kana, and "conventional" indicates the case where the pronunciation information is registered using only the phoneme typewriter. Further, "present invention 1" shows a case where, in addition to the pronunciation information registered using the phoneme typewriter, the pronunciation information in consideration of the background noise stored in the noise storage unit 33 as described above is registered. , "Present invention 2" shows a case where, in addition to the pronunciation information of "present invention 1", pronunciation information obtained from a phonetic kana is also used. In both cases using a phoneme typewriter or background noise, two vocalizations are used to obtain pronunciation information. Therefore, for example, as shown in FIG. 9, the number of pronunciation information for one word is one of “be N ch i” for “phonetic kana”,
"Conventional" is two (two of the present invention 1), "Invention 1" is "hbemu ch ii", "pr deu ch i", "b"
"eri N gi" and "pe N ch ij" are four, and "present invention 2" is five (reading kana + present invention 1).

【００６８】このように、車内雑音を含めた環境におい
て、前の結果と比べると、「読み仮名」を用いた場合、
99.30％から92.34％まで認識率が低下している。背景雑
音を考慮しない「従来」も、同様に、92.15％まで認識
率が低下している。As described above, in the environment including the in-vehicle noise, as compared with the previous result, in the case of using "Yomikana",
The recognition rate has dropped from 99.30% to 92.34%. Similarly, the “conventional” method, which does not consider background noise, also has a recognition rate of 92.15%.

【００６９】これに対して、背景雑音を考慮した「本発
明１」を用いた場合、94.88％まで、「本発明２」を用
いた場合、95.22％まで性能が向上している。On the other hand, when “Invention 1” in which the background noise is taken into consideration is used, the performance is improved to 94.88%, and when “Invention 2” is used, the performance is improved to 95.22%.

【００７０】特に、「本発明２」は、読み仮名から発音
情報を予め決定した言語モデルデータベース６の単語辞
書情報に対して、背景雑音を考慮した発音情報を追加し
たものである。これによって、単語辞書情報に予め登録
してある単語に対しても、本発明を適用することによ
り、認識率を向上させることができることがわかる。In particular, the "present invention 2" is the one in which pronunciation information in consideration of background noise is added to the word dictionary information of the language model database 6 in which pronunciation information is predetermined from the phonetic alphabet. As a result, it can be seen that the recognition rate can be improved by applying the present invention even to a word registered in advance in the word dictionary information.

【００７１】図８Ｃは、評価用の１００単語を含む５０
７５単語で構成された言語モデルデータベース６の単語
辞書情報を用いて行った音声認識の評価の実験結果であ
る。評価対象は、３回目の発声のデータであり、図８Ｂ
の場合と同様に、８種類の雑音環境での音声認識の評価
が行われている。したがって、認識率は、８種類の環境
における、１０名の平均値を示している。FIG. 8C shows 50 including 100 words for evaluation.
It is an experimental result of the evaluation of the speech recognition performed using the word dictionary information of the language model database 6 composed of 75 words. The evaluation target is the data of the third utterance, and FIG.
As in the case of, the evaluation of speech recognition in eight types of noise environments is being performed. Therefore, the recognition rate indicates the average value of 10 persons in 8 kinds of environments.

【００７２】「読み仮名」の認識率は、71.28％であ
り、「本発明」の認識率は、86.80％であった。The recognition rate of "Yomikana" was 71.28%, and the recognition rate of "present invention" was 86.80%.

【００７３】この「本発明」は、読み仮名から発音情報
を決定した５０７５単語の言語モデルデータベース６の
単語辞書情報に対して、評価用の１００単語について背
景雑音を考慮した発音情報を追加したものである。In the "present invention", pronunciation information in which background noise is taken into consideration for 100 words for evaluation is added to the word dictionary information of the language model database 6 of 5075 words whose pronunciation information is determined from the phonetic alphabet. Is.

【００７４】したがって、この結果からも、単語辞書情
報に予め登録してある単語に対しても、本発明を適用す
ることにより、認識率を向上させることができることが
わかる。Therefore, it can be seen from this result that the recognition rate can be improved by applying the present invention to the words registered in the word dictionary information in advance.

【００７５】以上においては、発音情報を得るために２
回の発声を用いたが、１回の発声でもよく、あるいは、
２回以上の発声を用いてもよい。In the above, in order to obtain pronunciation information, 2
I used the utterance once, but it may be one utterance, or
Two or more utterances may be used.

【００７６】また、１つの音声信号に対して、雑音信号
を重畳したものと重畳していないものの２種類を用意
し、それぞれ発音情報を登録することもできるが、これ
は、１つの発声に対して、登録部１１において登録処理
を２回行ったことを表している。すなわち、１つの発声
に対して、音声合成部３３を通る音声信号と音声合成部
３３を通らない音声信号の２種類の音声信号が登録部１
１において登録処理される。したがって、例えば、複数
の雑音を重畳し、登録するためには、登録部１１の登録
処理を複数回行うことになる。It is also possible to prepare two types, one with a noise signal superimposed and one without a noise signal, and register pronunciation information for each voice signal. The registration unit 11 has performed the registration process twice. That is, for one utterance, two types of voice signals, a voice signal passing through the voice synthesizing unit 33 and a voice signal not passing through the voice synthesizing unit 33, are registered.
Registration processing is performed in 1. Therefore, for example, in order to superpose and register a plurality of noises, the registration process of the registration unit 11 is performed a plurality of times.

【００７７】上述した説明において、車内走行環境にお
ける雑音を用いたが、オフィス環境、飛行機、電車など
様々な環境での雑音にも対応するようにできる。また、
ロボットの音声認識の場合、ロボットが動くときに発生
するモータ音あるいは歩行時に発生する路面との摩擦音
にも対応するようにできる。In the above description, the noise in the running environment in the vehicle is used, but it is possible to deal with the noise in various environments such as office environment, airplane and train. Also,
In the case of voice recognition of a robot, it is possible to deal with a motor sound generated when the robot moves or a friction sound with a road surface generated when walking.

【００７８】さらに、本実施の形態において、発音情報
を登録する場合、登録部１１で登録処理を行い、認識を
行う場合、認識部４で認識処理を行うものとして説明し
たが、認識部４で認識処理を行うと同時に、発音情報の
決定を行うことも可能であり、この場合、必要に応じ
て、言語モデルデータベース６の単語辞書情報の更新、
すなわち、新しい単語の追加および辞書に登録済みの単
語に対する発音情報の追加が行われることになる。Furthermore, in the present embodiment, when the pronunciation information is registered, the registration unit 11 performs the registration process, and when the recognition is performed, the recognition unit 4 performs the recognition process. It is also possible to determine the pronunciation information simultaneously with the recognition processing. In this case, the word dictionary information of the language model database 6 is updated, if necessary.
That is, new words are added and pronunciation information is added to words already registered in the dictionary.

【００７９】上記説明では、入力音声から発音情報を決
定する方法について説明したが、キーボードなど別の入
力手段を用いて発音情報の登録を行うなどの他の方法と
併用することもできる。In the above description, the method of determining the pronunciation information from the input voice has been described, but it can be used in combination with other methods such as the registration of the pronunciation information by using another input means such as a keyboard.

【００８０】上述した一連の処理は、ハードウェアによ
り実行させることもできるが、ソフトウェアにより実行
させることもできる。この場合、例えば、図１０に示さ
れるような音声認識装置５０により構成される。The series of processes described above can be executed not only by hardware but also by software. In this case, for example, the voice recognition device 50 as shown in FIG. 10 is used.

【００８１】図１０において、ＣＰＵ（Central Proces
sing Unit）５１は、ＲＯＭ(Read Only Memory) ５２に
記憶されているプログラム、または、記憶部５８からＲ
ＡＭ（Random Access Memory）５３にロードされたプロ
グラムに従って各種の処理を実行する。ＲＡＭ５３には
また、ＣＰＵ５１が各種の処理を実行する上において必
要なデータなどが適宜記憶される。In FIG. 10, the CPU (Central Proces
The sing unit) 51 is a program stored in a ROM (Read Only Memory) 52, or R from the storage unit 58.
Various processes are executed according to a program loaded in an AM (Random Access Memory) 53. The RAM 53 also appropriately stores data necessary for the CPU 51 to execute various processes.

【００８２】ＣＰＵ５１、ＲＯＭ５２、およびＲＡＭ５
３は、バス５４を介して相互に接続されている。このバ
ス５４にはまた、入出力インタフェース５５も接続され
ている。CPU 51, ROM 52, and RAM 5
3 are connected to each other via a bus 54. An input / output interface 55 is also connected to the bus 54.

【００８３】入出力インタフェース５５には、キーボー
ド、マウスなどよりなる入力部５６、ＣＲＴ(Cathode R
ay Tube)，ＬＣＤ(Liquid Crystal Display)などよりな
るディスプレイ、並びにスピーカなどよりなる出力部５
７、ハードディスクなどより構成される記憶部５８、モ
デム、ターミナルアダプタなどより構成される通信部５
９が接続されている。通信部５９は、図示しないネット
ワークを介しての通信処理を行う。The input / output interface 55 includes an input unit 56 including a keyboard and a mouse and a CRT (Cathode R).
ay Tube), LCD (Liquid Crystal Display) and other displays, and output section 5 and speakers.
7, a storage unit 58 including a hard disk, a communication unit 5 including a modem, a terminal adapter, etc.
9 is connected. The communication unit 59 performs communication processing via a network (not shown).

【００８４】入出力インタフェース５５にはまた、必要
に応じてドライブ６０が接続され、磁気ディスク６１、
光ディスク６２、光磁気ディスク６３、或いは半導体メ
モリ６４などが適宜装着され、それから読み出されたコ
ンピュータプログラムが、必要に応じて記憶部５８にイ
ンストールされる。A drive 60 is connected to the input / output interface 55 if necessary, and the magnetic disk 61,
The optical disc 62, the magneto-optical disc 63, the semiconductor memory 64, or the like is appropriately mounted, and the computer program read from the optical disc 62 is installed in the storage unit 58 as necessary.

【００８５】一連の処理をソフトウエアにより実行させ
る場合には、そのソフトウエアを構成するプログラム
が、専用のハードウエアに組み込まれているコンピュー
タ、または、各種のプログラムをインストールすること
で、各種の機能を実行することが可能な、例えば、汎用
のパーソナルコンピュータなどに、ネットワークや記録
媒体からインストールされる。When a series of processes are executed by software, a program that constitutes the software is installed in a computer in which dedicated hardware is installed, or various programs are installed to perform various functions. It is installed from a network or a recording medium into a general-purpose personal computer or the like capable of executing.

【００８６】この記録媒体は、図１０に示すように、装
置本体とは別に、ユーザにプログラムを提供するために
配布される、プログラムが記録されている磁気ディスク
６１（フレキシブルディスクを含む）、光ディスク６２
（CD-ROM(Compact Disk-ReadOnly Memory)，ＤＶＤ(Dig
ital Versatile Disk)を含む）、光磁気ディスク６３
（MD(Mini-Disk)（商標）を含む）、もしくは半導体メ
モリ６４などよりなるパッケージメディアにより構成さ
れるだけでなく、装置本体に予め組み込まれた状態でユ
ーザに提供される、プログラムが記録されているＲＯＭ
５２や、記憶部５８に含まれるハードディスクなどで構
成される。As shown in FIG. 10, this recording medium is a magnetic disk 61 (including a flexible disk) on which a program is recorded, which is distributed in order to provide the program to the user, separately from the apparatus main body, an optical disk. 62
(CD-ROM (Compact Disk-Read Only Memory), DVD (Dig
(including ital Versatile Disk)), magneto-optical disk 63
(Including MD (Mini-Disk) (trademark)), or package media consisting of semiconductor memory 64, etc. ROM
52 and a hard disk included in the storage unit 58.

【００８７】なお、本明細書において、記録媒体に記録
されるプログラムを記述するステップは、記載された順
序に従って時系列的に行われる処理はもちろん、必ずし
も時系列的に処理されなくとも、並列的あるいは個別に
実行される処理をも含むものである。In the present specification, the steps for describing the program recorded on the recording medium are not limited to the processing performed in time series according to the order described, but may be performed in parallel even if the processing is not necessarily performed in time series. Alternatively, it also includes processes that are individually executed.

【００８８】[0088]

【発明の効果】以上の如く、本発明の音声認識装置およ
び方法、記録媒体、並びにプログラムによれば、入力音
声に、背景雑音を合成し、合成音声を音響分析し、その
合成音声の特徴量を抽出し、特徴量に基づいて推定され
た発音情報を、対応する単語の発音情報として登録する
ようにしたので、背景雑音に対する認識精度を向上させ
ることができる。As described above, according to the voice recognition device and method, the recording medium, and the program of the present invention, the background noise is synthesized with the input voice, the synthesized voice is acoustically analyzed, and the feature amount of the synthesized voice is obtained. Is extracted, and the pronunciation information estimated based on the feature amount is registered as the pronunciation information of the corresponding word. Therefore, the recognition accuracy for background noise can be improved.

[Brief description of drawings]

【図１】従来の音声認識装置の構成例を示すブロック図
である。FIG. 1 is a block diagram showing a configuration example of a conventional voice recognition device.

【図２】従来の音声認識装置の他の構成例を示すブロッ
ク図である。FIG. 2 is a block diagram showing another configuration example of a conventional voice recognition device.

【図３】図２の音声認識装置の音響モデルと発音情報を
説明する図である。FIG. 3 is a diagram illustrating an acoustic model and pronunciation information of the voice recognition device in FIG.

【図４】図２の音声認識装置の音響モデルネットワーク
を説明する図である。FIG. 4 is a diagram illustrating an acoustic model network of the voice recognition device in FIG.

【図５】本発明を適用した音声認識装置の構成例を示す
ブロック図である。FIG. 5 is a block diagram showing a configuration example of a voice recognition device to which the present invention is applied.

【図６】図５の音声認識装置の発音情報登録処理を説明
するフローチャートである。6 is a flowchart illustrating a pronunciation information registration process of the voice recognition device in FIG.

【図７】図５の音声認識装置の音声認識処理を説明する
フローチャートである。7 is a flowchart illustrating a voice recognition process of the voice recognition device in FIG.

【図８】図５の音声認識装置を用いた音声認識の実験結
果を示す図である。FIG. 8 is a diagram showing an experimental result of voice recognition using the voice recognition device of FIG.

【図９】図５の音声認識装置の発音情報の例を示す図で
ある。9 is a diagram showing an example of pronunciation information of the voice recognition device in FIG.

【図１０】本発明を適用した音声認識装置の他の構成例
を示すブロック図である。FIG. 10 is a block diagram showing another configuration example of a voice recognition device to which the present invention has been applied.

[Explanation of symbols]

３音響分析部，４認識部，５音響モデルデータベ
ース，６言語モデルデータベース，１１登録部，１
２音響モデルネットワーク，３３雑音記憶部，３４
音声合成部3 acoustic analysis unit, 4 recognition unit, 5 acoustic model database, 6 language model database, 11 registration unit, 1
2 acoustic model network, 33 noise storage unit, 34
Speech synthesizer

Claims

[Claims]

1. A voice recognition device for performing voice recognition processing for recognizing an input voice, comprising: an acquisition unit that acquires background noise; and a synthesis that combines the input voice with the background noise acquired by the acquisition unit. Means, acoustically analyzing the synthesized voice synthesized by the synthesizing means,
An analysis unit that extracts a feature amount of the synthesized voice, an estimation unit that estimates pronunciation information based on the feature amount extracted by the analysis unit, and a pronunciation word that is estimated by the estimation unit And a registration means for registering the pronunciation information as the pronunciation information.

2. The voice recognition device according to claim 1, wherein the registration unit registers a plurality of the pronunciation information for the word.

3. The voice recognition device according to claim 1, further comprising a matching unit that performs matching processing based on the pronunciation information registered by the registration unit.

4. A voice recognition method for a voice recognition device that performs voice recognition processing for recognizing an input voice, comprising: an acquisition step of acquiring background noise; and the input voice acquired by the processing of the acquisition step. A synthesis step of synthesizing background noise, an analysis step of acoustically analyzing the synthesized speech synthesized by the processing of the synthesis step, and extracting a feature quantity of the synthesized speech, and the feature quantity extracted by the processing of the analysis step. A voice recognition method comprising: an estimation step of estimating pronunciation information based on the above; and a registration step of registering the pronunciation information estimated by the processing of the estimation step as the pronunciation information of a corresponding word.

5. A program for a voice recognition device that performs a voice recognition process of recognizing an input voice, the obtaining step of obtaining background noise, and the background obtained by the process of the obtaining step for the input voice. A synthesis step of synthesizing noise, an acoustic analysis of the synthesized speech synthesized by the processing of the synthesis step, an analysis step of extracting a feature amount of the synthesized speech, and the feature amount extracted by the processing of the analysis step A computer readable program comprising: an estimation step of estimating pronunciation information based on the pronunciation information; and a registration step of registering the pronunciation information estimated by the processing of the estimation step as the pronunciation information of a corresponding word. A recording medium on which a program is recorded.

6. A computer for a voice recognition device that performs a voice recognition process of recognizing an input voice, the acquisition step of acquiring background noise, and the background noise obtained by the process of the acquisition step of the input voice. A synthesizing step of synthesizing, an acoustic analysis of the synthesized speech synthesized by the processing of the synthesizing step, and an analysis step of extracting a feature amount of the synthesized voice, and based on the feature amount extracted by the processing of the analyzing step A program that executes an estimation step of estimating pronunciation information and a registration step of registering the pronunciation information estimated by the processing of the estimation step as the pronunciation information of a corresponding word.