JPS63303395A

JPS63303395A - Voice recognition equipment with multi- amplification function

Info

Publication number: JPS63303395A
Application number: JP62138240A
Authority: JP
Inventors: 羽金　廣
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1987-06-03
Filing date: 1987-06-03
Publication date: 1988-12-09

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は発声された音声を認識する音声認識装置に関
し、特に、発声レベルの低い音声部分から音素レベルの
特徴を確実に抽出するマルチ増幅機能を備えた音声認識
装置に関するものである。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a speech recognition device that recognizes uttered speech, and in particular, a multi-amplification function that reliably extracts phoneme-level features from speech parts with low utterance levels. The present invention relates to a speech recognition device equipped with the following.

[Conventional technology]

従来、この種の音声認識装置では登録（特定話者用の音
声認識装置で行々う処理）や認識処理を行なう前に、マ
イクロホンからの音声信号を増幅するための増幅器の増
幅度を発声者自身がボリューム等で設定したシ、あるい
は発声者のテスト発声のレベルに従って音声認識装置が
自動的に設定して（以後このレベル設定する処理をレベ
ル設定と呼ぶ）、その増幅度で増幅された音声信号に対
して音声認識処理を行なっていた。Conventionally, in this type of speech recognition device, the amplification degree of the amplifier for amplifying the speech signal from the microphone is determined by the speaker before registration (processing performed by the speech recognition device for a specific speaker) and recognition processing. The voice is automatically set by the voice recognition device according to the volume set by the user himself or the level of the test voice of the speaker (hereinafter, this level setting process is referred to as level setting), and the voice is amplified by the amplification level. Voice recognition processing was performed on the signal.

[Problem that the invention seeks to solve]

上述した従来の音声認識装置ではレベル設定後に登録や
認識が行なわれ、レベル設定で決定された増幅度はその
後の登録や認識処理中は変更されることなく一定の増幅
度でマイクロホンからの音声信号を増幅して処理してい
るため、発声中の音量が小さい区間、特に子音区間では
増幅度が小さい（音量が大きい部分に比べてＬΦ変換後
の量子化の精度が粗い）音声信号を処理して音声認識を
行なわざるをえないという欠点がある。In the conventional speech recognition device described above, registration and recognition are performed after setting the level, and the amplification determined by the level setting is not changed during subsequent registration and recognition processing, and the audio signal from the microphone is processed at a constant amplification. Because it amplifies and processes the audio signal, the degree of amplification is small in the low-volume sections during utterance, especially the consonant sections (the quantization accuracy after LΦ conversion is rougher than in the high-volume sections). The disadvantage is that voice recognition has to be performed.

[Means for solving problems]

この発明のマルチ増幅機能を備えた音声認識装置は、発
声された音声から音素レベルの特徴を抽出して認識する
に際して、その音素レベルの％徴を抽出する時間帯に対
して最適な増幅度となる音声信号を記憶部から読み出し
、その信号から音素レベルの特徴を抽出しながら音声全
体を認識するものである。The speech recognition device equipped with the multi-amplification function of the present invention, when extracting and recognizing phoneme-level features from uttered speech, selects the optimum amplification degree for the time period for extracting the phoneme-level % characteristics. The system reads out a voice signal from the storage section and recognizes the entire voice while extracting features at the phoneme level from the signal.

[Effect]

この発明は最適ガレベルで増幅された音声信号からその
特徴を抽出できるので、認識率を向上させることができ
る。Since the present invention can extract the features from the audio signal amplified at the optimum level, the recognition rate can be improved.

〔Example〕

第１図はこの発明に係るマルチ増幅機能を備えた音声認
識装置行の一実施例を示すブロック図である。同図にお
いて、ＭＣはマイクロホン、Ａｌ　＋　Ａ２〜Ａ５はそ
れぞれ異なった増幅度ａ１〜ａ５を備えた増幅器、Ｃ，
、Ｃ２〜Ｃ５は入力するアナログ音声信号をデイジタル
カ刊声信号に変換して出力するＡ／Ｄ変換器、ＳＤおよ
びＥＤはそれぞれ音声信号の始めと終りを検出し、それ
ぞれ始端検出信号Ｓ１および終端検出信号Ｓ２を出力す
る始端検出部および終端検出部、Ｍ１＋　Ｍ２〜Ｍ５は
始端検出信号Ｓ１の入力によシ記憶動作を開始しに１変
換器Ｃ７〜Ｃ５から出力されたディジタルな音声信号を
記憶し終端検出信号Ｓ２の入力によシ記憶動作を終了す
る記憶部、ＳＥはフレーム区間に対して最適な増幅度で
記憶されている音声信号を記憶部Ｍ、〜Ｍ５の中から選
んで音声信号Ｖとして出力する増幅度選択部、ＲＣはそ
れぞれのフレームの区間の音量レベルを音量レベル信号
りとして出力する一方、送られてきた音声信号Ｖを例え
ば数ｒｎｓＯフレームに分けてそれぞれのフレームから
音素レベルの特徴を抽出して標準フレームパターンとマ
ツチングを行ない音韻ラベル付けをした音声認識結果信
号Ｔを出力する音声認識部である。なお、第２図は発ル
された音声「あさ」（朝）の音量変化をフレームＦ１〜
Ｆ＋５に分けた図である。また、前記音声認識部ＲＣは
例えば音韻の特徴として母音。FIG. 1 is a block diagram showing an embodiment of a speech recognition device equipped with a multi-amplification function according to the present invention. In the figure, MC is a microphone, Al + A2 to A5 are amplifiers each having different amplification degrees a1 to a5, C,
, C2 to C5 are A/D converters that convert input analog audio signals into digital voice signals and output them, and SD and ED detect the beginning and end of the audio signals, respectively, and output a starting edge detection signal S1 and an ending edge detection signal, respectively. The start edge detection section and the end detection section M1+ which output the signal S2, M2 to M5 start the storage operation upon input of the start edge detection signal S1, and store the digital audio signals output from the converters C7 to C5. The storage unit SE, which completes the storage operation upon input of the termination detection signal S2, selects the audio signal stored at the optimum amplification degree for the frame section from the storage units M, to M5, and stores the audio signal V. The amplification selector RC outputs the volume level of each frame section as a volume level signal, while dividing the sent audio signal V into, for example, several rnsO frames and extracting the phoneme level from each frame. This is a speech recognition unit that extracts features, matches them with standard frame patterns, and outputs a speech recognition result signal T with phoneme labels attached. In addition, Figure 2 shows the volume change of the uttered voice "Asa" (morning) from frame F1 to
It is a diagram divided into F+5. Further, the speech recognition unit RC recognizes, for example, a vowel as a phoneme feature.

ｇ擦合、鼻音、破裂音を選び、それぞれの標準フレーム
パターンを前もって登録して認識時にその標準フレーム
パターンとマツチングをとって類似性の一番高い標準フ
レームパターンの属性を示す音韻ラベル名をそれぞれの
フレームに対して付けている。g Select a rasp, a nasal, or a plosive, register each standard frame pattern in advance, match it with that standard frame pattern during recognition, and assign a phonological label name to each that indicates the attribute of the standard frame pattern with the highest similarity. attached to the frame.

次に上記構成によるマルチ増幅機能を備えた音声認識装
置の動作について説明する。まず、マイクロホンＭＣか
ら入力された音声信号はそれぞれ異なった増幅度１．〜
ａ５を備えた増幅器Ａ１〜Ａ５で増幅されたのち、〜Φ
変換器Ｃ１〜Ｃ５に入力する。そして、この〜■変換器
Ｃ１〜Ｃ５は入力するアナログ音声信号をディジタルな
音声信号に変換して記憶部Ｍ、〜Ｍ５に出力する。一方
、始端検出部ＳＤおよび終端検出部ＥＤは音声信号の始
端検出信号Ｓ１および終端検出信号Ｓ２を記憶部Ｍ１〜
Ｍ５に出力する。したがって、この記憶部Ｍ１〜Ｍ５は
始端検出信号Ｓ１が入力してから終端検出信号Ｓ２が入
力するまでにΦ変換器Ｃ１〜Ｃ５から出力されたディジ
タルな音声信号を記憶する。Next, the operation of the speech recognition device having the multi-amplification function configured as described above will be explained. First, the audio signals input from the microphones MC have different amplification degrees of 1. ~
After being amplified by amplifiers A1 to A5 equipped with a5, ~Φ
input to converters C1-C5. The converters C1 to C5 convert the input analog audio signals into digital audio signals and output the digital audio signals to the storage units M and M5. On the other hand, the start end detection unit SD and the end detection unit ED store the start end detection signal S1 and the end detection signal S2 of the audio signal in the storage units M1 to
Output to M5. Therefore, the storage units M1 to M5 store the digital audio signals output from the Φ converters C1 to C5 from when the start detection signal S1 is input to when the end detection signal S2 is input.

そして、音声認識部ＲＣは記憶部Ｍ、−Ｍ５から音声信
号を読み出し、その音声信号から音素レベルの特徴を抽
出して始端から終端までの音声を認識するが、まず例え
ば第２図に示すように発声された音声［あさｊ（朝）を
数ｍｇのフレームに分けてそれぞれのフレームから音素
レベルの特徴を抽出して各フレームの音韻ラベルを付け
る。すなわち、音声認識部ＲＣはフレームＦ１の音量レ
ベルから音量レベル信号りを増幅度選択部ＳＥに出力す
る。したがって、この増幅度選択部ＳＥはフレームＦ１
の区間に対して最適な増幅度で記憶されている音声信号
を記憶部Ｍ１〜Ｍ５の中から選び音声信号Ｖに出力する
。したがって、音声認識部ＲＣは入力する音声信号Ｖか
ら特徴を抽出して標準フレームパターンとマツチングを
行なって音韻ラベル付けをする。同様にして、フレーム
Ｆ２〜Ｆ＋５についても音韻ラベル付けを行なうことが
できる。Then, the speech recognition section RC reads out the speech signals from the storage sections M and -M5, extracts features at the phoneme level from the speech signals, and recognizes the speech from the beginning to the end. The speech uttered [Asaj (morning)] is divided into frames of several milligrams each, phoneme-level features are extracted from each frame, and a phoneme label is attached to each frame. That is, the speech recognition unit RC outputs a volume level signal from the volume level of the frame F1 to the amplification degree selection unit SE. Therefore, this amplification selector SE selects the frame F1.
The audio signal stored at the optimum amplification degree for the section is selected from the storage units M1 to M5 and outputted as the audio signal V. Therefore, the speech recognition unit RC extracts features from the input speech signal V, matches them with the standard frame pattern, and attaches a phoneme label. Similarly, phonological labeling can be applied to frames F2 to F+5 as well.

そして、単語辞書に認識対象とガる単語の音韻ラベル列
（以下単語音韻ラベル列と呼ぶ）を持ち、未知入力信号
から得られた音韻ラベル列と単語音韻ラベル列とマツチ
ングを行ない、類似度が最も大きい単語音韻ラベル列の
単語の属性を認識結果とすることにより単語認識を行な
うことができる。Then, the word dictionary has a phonological label string of the word to be recognized (hereinafter referred to as the word phonological label string), and the phonological label string obtained from the unknown input signal is matched with the word phonological label string, and the similarity is calculated. Word recognition can be performed by using the attribute of the word in the largest word phoneme label string as the recognition result.

なお、以上は単語認識を例にして説明したが、これに限
定せず、単語辞書に加えて単語間の接続情報等の言語レ
ベルの辞書をもつことによシ連続的に発声された音声も
同様にして認識することができることはもちろんである
。また、特徴を音韻ラベル、認識方式をマツチングとし
て説明したが、これに限定されないことはもちろんであ
る。また、以上の説明では増幅器、に１変換器および記
憶部をそれぞれ５個設けた場合について説明したが、こ
れに限定せず、任意の個数でよいことはもちろんである
。The above explanation is based on word recognition as an example, but it is not limited to this; by having a language-level dictionary containing connection information between words in addition to a word dictionary, it is possible to recognize continuously uttered speech as well. Of course, it can be recognized in the same way. Further, although the feature has been described as a phonetic label and the recognition method as matching, it is needless to say that the present invention is not limited to this. Further, in the above description, a case has been described in which one amplifier, one converter, and five storage units are provided, but the present invention is not limited to this, and it goes without saying that any number may be used.

〔Effect of the invention〕

以上詳細に説明したように、この発明に係るマルチ増幅
機能を備えた音声認識装置によれば、語頭２語尾および
子音部のように発声レベルの低い音声部分から音素レベ
ルの特徴を抽出するに際してその区間に対して最適なレ
ベルで増幅された音声信号からその特徴を抽出するので
、認識エラーやりジエクトを削減することができる効果
がある。As explained in detail above, the speech recognition device equipped with the multi-amplification function according to the present invention is effective in extracting phoneme-level features from voice parts with low utterance levels, such as the beginning and end of words and consonants. Since the features are extracted from the audio signal that has been amplified at the optimal level for the section, it has the effect of reducing recognition errors and errors.

[Brief explanation of the drawing]

第１図はこの発明に係るマルチ増幅機能を備えた音声認
識装置の一実施例を示すブロック図、第２図は発声者が
「あさ」を発声したときのマイクロホンからの音声信号
の音量の変化を示す図である。ＭＣ・・・ｅマイクロホン、Ａ１〜Ａ５　・・・会増幅
器、Ｔ・・・・音声認識結果、自〜Ｃ５・・・・んΦ変
換器、Ｍｌ−Ｍ５・・・・記憶部、ＳＥ・・・φ増幅度
選択部、ＲＣ・・・・音声認識部、ＳＤ　＠・・・始端
検出部、ＥＤ・・・・終端検出部、Ｓｌ　・・・・始端
検出信号、Ｓ２　・・・・終端検出信号、Ｌ−・・・音
量レベル信号。Fig. 1 is a block diagram showing an embodiment of a speech recognition device equipped with a multi-amplification function according to the present invention, and Fig. 2 shows changes in the volume of the audio signal from the microphone when the speaker utters "Asa". FIG. MC... e microphone, A1-A5... group amplifier, T... speech recognition result, self-C5... Φ converter, Ml-M5... memory section, SE...・φ amplification selection section, RC...speech recognition section, SD@...starting end detection section, ED...end detection section, Sl...starting end detection signal, S2...end detection Signal, L-...Volume level signal.

Claims

[Claims]

A plurality of amplifiers, each having a different amplification degree, amplify the output of the microphone, a memory section that temporarily stores the audio output signals of these amplifiers from the beginning to the end, and an optimal 1. A speech recognition device having a multi-amplification function, comprising means for reading an amplification-level speech output signal from a storage unit and performing recognition processing.