JP3125928B2

JP3125928B2 - Voice recognition device

Info

Publication number: JP3125928B2
Application number: JP01026078A
Authority: JP
Inventors: 潤一郎藤本; 晴剛安田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-02-03
Filing date: 1989-02-03
Publication date: 2001-01-22
Anticipated expiration: 2016-01-22
Also published as: JPH02205898A

Description

【発明の詳細な説明】技術分野本発明は、音声認識装置に関する。Description: TECHNICAL FIELD The present invention relates to a speech recognition device.

従来技術音声認識装置の研究が活発に行なわれており、特定話
者や不特定話者等の方式がある。いずれの方式にして
も、音声区間が正しく切り出せることが正しい認識のた
めの必須の条件である。音声区間の切り出し方は、特公
昭62−50837号公報に記載のものが知られている。これ
は、閾値Ａの決め方が重要で、低く設定すると雑音によ
って信号のエネルギーが閾値を越え、音声区間切り出し
部で音声検出信号を出してしまう。一方、閾値Ａを高く
設定すると、音声の冒頭が欠落してしまうことになる。
これを防ぐために、例えば、特開昭57−177197号公報や
58−076899号公報等があり、これは周囲の雑音のレベル
によって閾値Ａをどのように設定するかを示しているも
ので、前記の様な不具合が発生しにくい閾値Ａを決める
訳であるが、雑音のレベルは時々刻々と変化しているも
のであり、場合によっては雑音レベルが変わらなくとも
発声者の声の大きさが小さくなってしまえば正確な切り
出しが出来なくなってしまう欠点がある。2. Description of the Related Art Speech recognition devices have been actively researched, and there are methods for specific speakers and unspecified speakers. In any case, it is an essential condition for correct recognition that a speech section can be cut out correctly. A method of extracting a voice section is known from Japanese Patent Publication No. 62-50837. This is because how to determine the threshold value A is important. If it is set low, the energy of the signal exceeds the threshold value due to noise, and a voice detection signal is output by the voice section cutout unit. On the other hand, if the threshold value A is set high, the beginning of the sound will be lost.
In order to prevent this, for example, JP-A-57-177197 and
Japanese Patent Application Laid-Open No. 58-076899, which shows how to set the threshold value A according to the level of ambient noise, determines the threshold value A in which the above-described inconvenience is unlikely to occur. However, the noise level is constantly changing, and in some cases, even if the noise level does not change, if the loudness of the voice of the speaker decreases, accurate cutout cannot be performed.

又、他の方法として、特開昭57−148799号公報のよう
に、音声のエネルギーだけでなく音韻系列を分析して正
確な区間を切り出すものもあるが、音韻の分類という厄
介な演算をしなければならない。更に、特開昭56−5658
8号公報に示されているように、上記方法で音声区間の
始端を検出後、始点よりも50m秒前へもどす方法があ
る。これは声が小さく閾値が高い時に欠落しやすい部分
を補うものであるが、この50m秒の中には、必ずしも音
声が含まれているとは限らない。ここに音声が含まれて
いない場合又は雑音が入っている場合には認識精度を落
とす欠点がある。As another method, as disclosed in Japanese Patent Application Laid-Open No. 57-148799, there is a method in which not only speech energy but also a phoneme sequence is analyzed to cut out an accurate section, but a troublesome operation of phoneme classification is performed. There must be. Further, JP-A-56-5658
As disclosed in Japanese Patent Laid-Open Publication No. 8 (1999) -1995, there is a method of detecting the beginning of a speech section by the above method and returning the speech section 50 msec before the starting point. This compensates for the portion that is likely to be lost when the voice is small and the threshold is high, but the voice is not always included in this 50 ms. There is a drawback that the recognition accuracy is lowered when no voice is included or when noise is present.

また、上記閾値Ａを用いた場合の欠点を補うものとし
て、第10図のようなものがある（特開昭60−23899号公
報）。これは音声区間内のエネルギーを積分し、それを
音声区間の時間長によって正規化した結果がある値より
大きければ良く、小さければ使用者に声を大きく発声す
るように指示するものである。しかしながら、第11図
（ａ）のようなエネルギーの小さい部分を持つ言葉と、
そうでない第11図（ｂ）のような言葉ではエネルギーの
時間平均だけでは扱いにくいという問題があった。例え
ば第11図（ａ）は「ストップ」のような単語では促音が
含まれるため、大きい声でしゃべっているのに声を大き
くせよという指示が出たり、一方、「目」のような単語
では第11図（ｂ）のようなタイプで、声が小さくて子音
の/m/が検出できなくても母音/e/は/m/に比べて大きな
エネルギーを持っているため音声区間の平均は閾値Ａを
下らないことが多く、そのため検出エラーを生じやすい
という欠点があった。FIG. 10 shows a method for compensating for the drawback when the threshold value A is used (Japanese Patent Laid-Open No. 60-23899). This is to integrate the energy in the voice section and normalize the energy by the time length of the voice section if the result is larger than a certain value, and instruct the user to utter a loud voice if the result is smaller. However, words with small parts of energy as shown in FIG.
In other words, there is a problem that it is difficult to treat the words as shown in FIG. For example, in FIG. 11 (a), since words such as "stop" include a prompting sound, there is an instruction to increase the voice while speaking in a loud voice, while words such as "eyes" In the type shown in Fig. 11 (b), even if the voice is low and the consonant / m / cannot be detected, the vowel / e / has a larger energy than / m /, so the average of the voice section is There is a drawback that the detection error often occurs because the threshold value A is often not lowered.

目的本発明は、上述のごとき実情に鑑みてなされたもの
で、音声の始端が発声者の声の大小にかかわらず正確に
検出されるようにしたもので、また、正しい音声区間の
検出を可能にするもので、その結果、認識精度の高い音
声認識装置を提供することを目的としてなされたもので
ある。Object The present invention has been made in view of the above-mentioned circumstances, and is intended to accurately detect the beginning of a voice irrespective of the loudness of a speaker's voice, and to detect a correct voice section. As a result, the object of the present invention is to provide a speech recognition device with high recognition accuracy.

構成本発明は、上記目的を達成するために、音声を電気信
号に変換する音響−電気変換機と、変換された信号から
音声に係る部分を抽出するための音声区間検出部と、検
出された信号を用いて音声を認識する認識部とを有する
音声認識装置において、音声区間検出部で検出された音
声の始点候補より時間的に前に連続して存在する電気信
号を分析し、その分析結果に応じて使用者の発声の仕方
を指示するようにしたことを特徴としたものである。以
下、本発明の実施例に基づいて説明する。Configuration In order to achieve the above object, the present invention provides an audio-to-electric converter for converting a sound into an electric signal, a sound section detecting unit for extracting a portion related to the sound from the converted signal, A voice recognition device having a recognition unit that recognizes voice using a signal, analyzing an electrical signal that is continuously present temporally before the candidate of the starting point of the voice detected by the voice section detection unit, and the analysis result In this case, the user is instructed how to utter in response to the request. Hereinafter, a description will be given based on examples of the present invention.

まず、第５図は、通常のパターンマッチングを利用し
た音声認識装置のブロック図で、マイク12からの信号
は、音声区間検出部13でその音声区間が検出され、マイ
クからの音声パターンと標準パターン15とが認識部14に
おいて認識される。First, FIG. 5 is a block diagram of a voice recognition device using ordinary pattern matching. The signal from the microphone 12 is detected by the voice section detection unit 13 and the voice pattern from the microphone is compared with the standard pattern. 15 is recognized by the recognition unit 14.

第１図は、本発明による音声認識装置の一実施例を説
明するための構成図で、同図は、第５図における音声区
間検出部の構成を示したもので、図中、１はマイク、２
はA/D変換部、３は第１のメモリー、４はエネルギー検
出部、５はレジスター、６は比較器、７は閾値Ａ、８は
エネルギー検出部、９は比較器、10は閾値Ｂ、11は結合
部である。最初に音声を電気信号に変換する音響−電気
変換器により変換された信号から音声に係る部分を抽出
する音声区間検出部と、検出された信号を用いて音声を
認識する認識部とを有する音声認識装置において、音声
区間の始点が検出された時点より前へ音声の始点をずら
すようにした点について以下に説明する。FIG. 1 is a configuration diagram for explaining an embodiment of a voice recognition device according to the present invention. FIG. 1 shows the configuration of a voice section detection unit in FIG. , 2
Is an A / D converter, 3 is a first memory, 4 is an energy detector, 5 is a register, 6 is a comparator, 7 is a threshold A, 8 is an energy detector, 9 is a comparator, 10 is a threshold B, Reference numeral 11 denotes a connecting portion. A speech having a speech section detection unit for extracting a portion related to speech from a signal converted by an acoustic-to-electric converter that first converts speech into an electric signal, and a recognition unit for recognizing speech using the detected signal. A description will be given below of how the recognition device shifts the start point of the voice to a point before the start point of the voice section is detected.

マイク１からの信号をA/D変換器２でA/D変換して順に
第１のメモリー３に書き込んで行く。この時あらかじめ
特徴量に変換してから書き込んでも良いし、書き込んだ
ものを読み出して変換しても良い。ここで言う特徴量と
はスペクトルやLPC等の分析結果を指しており、その種
類は特に限定するものではない。第１のメモリー３には
時間の経過に従って順にずらしながら書き込み、端まで
一杯になった時に再度先頭へ戻るようなものである。メ
モリーに書き込むと同時にその信号のエネルギーを検出
して音声区間の検出を行なう。The signal from the microphone 1 is A / D converted by the A / D converter 2 and written in the first memory 3 in order. At this time, the data may be converted into a feature value before writing, or the written data may be read and converted. The feature amount referred to here indicates an analysis result such as a spectrum or LPC, and the type is not particularly limited. The first memory 3 is written in such a manner that the data is sequentially shifted with the passage of time and returns to the top when the end is full. At the same time as writing to the memory, the energy of the signal is detected to detect a voice section.

音声区間の検出方法は、第６図に示すような音声のエ
ネルギーの大きさから周囲のバックグラウンドノイズと
分けるものが一般的である。この方法では音声が入力さ
れるまえにノイズレベルのエネルギー閾値Ａを決めてお
き、その閾値Ａよりも大きな音が入力された時点から閾
値Ａより下がるまでを音声区間とするものである。これ
が考え方の基本であるが、雑音と区別するためいろいろ
の改良がなされている。また、特徴量としては特定のも
のを利用する必要はなく、もっとも一般的なパワースペ
クトルやLPC、更にはケプストラムなど、どれを用いて
も良い。この中からパワースペクトルを例にあげると、
入力された音声をバンドパスフィルタ群に印加せしめれ
ば実現出来、バンドパスフィルタの特性をどのように選
ぶかで自由に分析のしかたが変えられる。In general, a method for detecting a voice section is to separate the voice section from surrounding background noise based on the magnitude of voice energy as shown in FIG. In this method, an energy threshold value A of the noise level is determined before a voice is input, and a portion from a point in time when a sound larger than the threshold value A is input to a point below the threshold value A is defined as a voice section. This is the basis of the idea, but various improvements have been made to distinguish it from noise. In addition, it is not necessary to use a specific feature amount, and any of the most general power spectrum, LPC, and cepstrum may be used. Taking the power spectrum as an example,
This can be realized by applying the input voice to the band-pass filter group, and the analysis method can be freely changed by selecting the characteristics of the band-pass filter.

次に、第７図の波形に従って説明する。図示した音声
波形が入力された時、まずエネルギーが計算され、閾値
Ａと比較してこれより大きい時に音声区間が始ったとし
てレジスターに検出された音声が格納される、仮りに、
第１のメモリーに100m秒分のデータが格納されるとする
と、第７図のａからｄまでのデータを持っていることに
なる。そこでｃにあった音声の始点をａ〜ｃの間ずらす
ようにする。ただし、ａに移動すると、ａ〜ｂ間の余分
が音声の冒頭に添付してしまう。特に、第８図の母音の
ようなエネルギー波形にこの100m秒をつぎ足すと、その
100m秒の中は殆ど不要なデータになってしまう。そこ
で、始点をずらした後に、音声区間検出部が検出した始
点（前の始点）と、ずらして作った始点（後の始点）の
間の信号を分析し、その分析結果に応じて後の始点を移
動させるようにした。この分析の例としてａ〜ｃのエネ
ルギーを着目するようにしたのが第１図の実施例であ
る。第１のメモリー３に保持されているａ〜ｃのデータ
のエネルギーを検出し、閾値Ｂと比較する。閾値Ｂは言
うまでもなく閾値Ａ＞閾値Ｂでなければならない。閾値
Ｂは０であっても良い。比較器９で閾値Ｂを越えた時点
へ音声の始端を移動すると、第７図ではｂ〜ｄまでの正
しい音声区間を検出することができる上に、第８図のよ
うな波形であっても正しく検出できる。当然ながらこの
方法を音声の終端に適用しても良い。又、100m秒のバッ
ファも限定したものではなく更に短くても良い。第７図
でいうならこのように検出したｂ〜ｃの部分を通常の方
法で検出したｃ〜ｄの部分に結合することによってｂ〜
ｄの正しい音声となる。これを認識部へ転送し認識を行
なう。認識方法は特に限定するものではなくDPマッチン
グ等の周知の方法を用いれば良い。また、第１図は実質
上、エネルギー検出部を二つに分けて示したが、一つで
両方を兼ねることも可能であるし、閾値も両値を持たず
に、例えば、Ｂ＝A/5のように決めても良い。更に、こ
こではａ〜ｃのデータの分析の方法としてエネルギーを
示したが、パワースペクトルの差分を取るなど他の方法
を利用することによっても実行可能である。Next, a description will be given with reference to the waveforms of FIG. When the illustrated audio waveform is input, the energy is calculated first, and the detected audio is stored in the register assuming that the audio section has started when the energy is larger than the threshold value A.
Assuming that data for 100 msec is stored in the first memory, the data has the data from a to d in FIG. Therefore, the starting point of the sound corresponding to c is shifted from a to c. However, when moving to a, the extra space between a and b is attached to the beginning of the sound. In particular, if this 100 ms is added to the energy waveform like the vowel in FIG.
In 100msec, it becomes almost unnecessary data. Therefore, after shifting the starting point, the signal between the starting point (previous starting point) detected by the voice section detection unit and the starting point (later starting point) formed by shifting is analyzed, and the subsequent starting point is determined according to the analysis result. Was moved. The embodiment of FIG. 1 focuses on the energies a to c as examples of this analysis. The energy of the data of a to c held in the first memory 3 is detected and compared with the threshold B. Needless to say, the threshold value B must be greater than the threshold value A> the threshold value B. The threshold value B may be zero. When the comparator 9 moves the beginning of the voice to the point where the threshold value B is exceeded, the correct voice section from b to d can be detected in FIG. 7 and the waveform as shown in FIG. Can be detected correctly. Of course, this method may be applied to the end of voice. Also, the buffer of 100 ms is not limited and may be shorter. In FIG. 7, by combining the parts b to c detected in this way with the parts c to d detected by the usual method,
The correct voice of d is obtained. This is transferred to the recognition unit for recognition. The recognition method is not particularly limited, and a known method such as DP matching may be used. In addition, FIG. 1 shows the energy detection unit divided into two parts. However, it is possible to use one part for both, and the threshold value does not have both values. For example, B = A / You may decide like 5. Furthermore, although the energy is shown as a method of analyzing the data of a to c here, the method can be executed by using another method such as obtaining a difference between power spectra.

次に、本発明の他の実施例について、第２図に基づい
て説明する。図中、16は第２のメモリ、17はクリア部、
その他第１図の場合と同様の作用をする部分１〜11は、
第１図の場合と同一の参照番号が付してある。Next, another embodiment of the present invention will be described with reference to FIG. In the figure, 16 is the second memory, 17 is the clear unit,
Other parts 1 to 11 which operate in the same manner as in FIG.
The same reference numerals as in FIG. 1 are used.

第２図は、音声を電気信号に変換する音響−電気変換
器により変換された信号から音声に係る部分を抽出する
音声区間検出部と、検出された信号を用いて音声を認識
する認識部とを有する音声認識装置において、音声区間
検出部が検出した音声の始点より一定時間前へ始点をず
らし、該ずらした区間内を分析しその区間の始点以外の
部分に無音部が検出された時、音声の始点をこの無音部
の最後尾へ移動させるようにしたものである。マイク１
から入力された音声信号或いは音声信号を特徴量に変換
したものを第１のメモリー３へ記録して行く。第１のメ
モリー３では、各タイミングに出力される１個又は複数
のデータを順に記録できるようなもので、例えば100m秒
分のデータが一時的に格納できるものであれば良い。10
0m秒分書き込み終ると再び先頭からその上へ次のデータ
を書き込み直す。又、第１のメモリー３に書くと同時に
各タイミングのエネルギーを求め、それが閾値Ａより大
きいかどうかを第１の比較器６で比較し、大きいと音声
スタートの信号をレジスタ５へ送ってA/D変換したデー
タをとり込み始める。次に、第１のメモリー３の中に格
納されている100m秒のデータの冒頭からエネルギーを求
め閾値Ｂと第２の比較器９で比較し、閾値Ｂより小さけ
れば無視し大きければ第２のメモリー16を書き込む。こ
こでその後、閾値Ｂより小さなエネルギーの部分がある
とクリア機能17で第２のメモリー16の内容を全てクリア
し、以下同様のくり返しとなる。100m秒のデータをチェ
ックし終った後、第２のメモリー16にデータがあれば、
これをレジスタ５内に格納されているデータの冒頭に結
合して音声データとし、認識部へ転送する。これを波形
で説明すると、第９図のようになる。通常の方法で検出
される音声区間では先頭の音韻が欠落する。そこで先頭
100m秒前のデータまでとったとすると、口唇の開閉音等
のノイズが一緒に音声として含まれてしまうことがあ
る。そこでこの100m秒分のエネルギーを再度チェック
し、先にみつかっている音声区間と連続している部分を
残して他を捨てる。これにより、正しい音声区間を検出
できる。この効果を得るためには当然閾値Ａ＞閾値Ｂで
なければならない。FIG. 2 is a diagram illustrating a speech section detection unit that extracts a portion related to speech from a signal converted by an acoustic-electric converter that converts speech into an electric signal, and a recognition unit that recognizes speech using the detected signal. In the speech recognition device having, when the start point is shifted a predetermined time before the start point of the voice detected by the voice section detection unit, the shifted section is analyzed, and when a silent section is detected in a portion other than the start point of the section, The starting point of the voice is moved to the end of the silent section. Microphone 1
The audio signal input from the CPU or a signal obtained by converting the audio signal into a feature value is recorded in the first memory 3. In the first memory 3, one or a plurality of data output at each timing can be sequentially recorded, and any memory that can temporarily store, for example, 100 msec of data may be used. Ten
When writing for 0 msec is completed, the next data is rewritten from the top to the top again. At the same time as writing to the first memory 3, the energy of each timing is obtained, and whether the energy is greater than the threshold value A is compared by the first comparator 6. Starts importing / D converted data. Next, the energy is obtained from the beginning of the data of 100 ms stored in the first memory 3 and compared with the threshold value B and the second comparator 9. Write to memory 16. Here, after that, if there is an energy portion smaller than the threshold value B, the entire contents of the second memory 16 are cleared by the clear function 17, and the same operation is repeated. After checking the data for 100 ms, if there is data in the second memory 16,
This is combined with the beginning of the data stored in the register 5 to generate audio data, which is transferred to the recognition unit. This will be described with reference to waveforms as shown in FIG. The first phoneme is missing in the voice section detected by the normal method. So the top
If data up to 100 ms before is taken, noise such as the sound of opening and closing the lips may be included together with the voice. Therefore, the energy for this 100 msec is checked again, and the rest is discarded except for the portion that is continuous with the voice section found earlier. Thereby, a correct voice section can be detected. In order to obtain this effect, the threshold value A must be larger than the threshold value B.

この方法は、音声の冒頭で説明したが、音声の終端に
適用することもできる。また、100m秒のメモリーも限定
するものではなく、更に短くしても良い。また認識方法
は特に限定するものではなく、前述したDPマッチング等
の周知の方法を用いれば良い。このような技術内容は、
例えば「音声認識」（新美著、共立出版）等に詳しく記
載されている。This method has been described at the beginning of speech, but can also be applied to the end of speech. Further, the memory of 100 ms is not limited, and may be further shortened. The recognition method is not particularly limited, and a known method such as the above-described DP matching may be used. Such technical content,
For example, it is described in detail in "Speech recognition" (by Niimi, Kyoritsu Shuppan).

また、第２図は第１図の場合と同様に便宜上、エネル
ギー検出部を二つに分けて示したが、一つで両方を兼ね
ることも可能であるし、閾値も両値を持たずに、例えば
Ｂ＝A/5のように決めても良い。更に、ここでは、ａ〜
ｃのデータの分析の方法としてエネルギーを示したが、
パワースペクトルの差分を取るなど他の方法を利用する
ことによっても実行可能である。In addition, FIG. 2 shows the energy detection unit divided into two for convenience as in the case of FIG. 1, but it is also possible to use one for both, and the threshold does not have both values. For example, B = A / 5 may be determined. Further, here, a to
Although energy was shown as a method of analyzing the data of c,
It can also be executed by using other methods such as taking a difference between power spectra.

更に、本発明による他の実施例を第３図に基づいて説
明する。図中、18は表示部で、その他第１図の場合と同
様の作用をする部分１〜10は、第１図の場合と同一の参
照番号が付してある。この実施例では、音声の立ち上り
があまり急峻ではなく、特に先頭音が子音である場合は
なだらかな立ち上りをするため、この部分が正確に検出
できないことに着目してなされている。音声を電気信号
に変換する音響−電気変換器により変換された信号から
音声に係る部分を抽出する音声区間検出部と、検出され
た信号を用いて音声を認識する認識部とを有する音声認
識装置において、音声区間検出部で検出された音声始点
より前のデータを分析し、その分析結果に応じて使用者
の発声の仕方を指示するようにした。Further, another embodiment according to the present invention will be described with reference to FIG. In the figure, reference numeral 18 denotes a display unit, and other parts 1 to 10 which operate in the same manner as in FIG. 1 are denoted by the same reference numerals as those in FIG. In this embodiment, attention is paid to the fact that the rising of the voice is not so steep, and particularly when the leading sound is a consonant, the sound gradually rises, so that this portion cannot be detected accurately. A speech recognition device including a speech section detection unit that extracts a portion related to speech from a signal converted by an acoustic-electric converter that converts speech into an electric signal, and a recognition unit that recognizes speech using the detected signal. In the above, data before the voice start point detected by the voice section detection unit is analyzed, and a method of uttering the user is instructed according to the analysis result.

マイク１からの音声はA/D変換器２でデジタル信号に
変換されている。この場合、あからじめ特徴量に変換し
ておいてA/D変換するのが望ましい。このデータは第１
のメモリー３にサンプル時間毎に順に記録されるととも
にエネルギーの検出が行なわれる。このエネルギーは閾
値Ａと第１の比較器６で比較され、これより大なる時点
で音声のスタートとみなされる。つまり、この部分は第
６図に示す音声区間の検出を行なっている訳であるが、
閾値Ａよりもエネルギーが大きくなった時点で、その少
し前のデータまで第１のメモリー３の中に記憶されてい
ることになる。第１のメモリー３に0.1秒分のデータが
記録できるとすると、音声の立ち上りの瞬間にはその0.
1秒前のデータまで持っていることになる。そこで、こ
の0.1秒分のエネルギーを分析し、決められた閾値Ｂよ
りも大きければ、本来音声区間として検出すべきものが
声が小さくエネルギーが小さくなったため、検出誤りを
引き起こしていると考え、「声を大きく」とのメッセー
ジを表示部18で表示するようにする。又、閾値Ｂよりも
小さいと、正確に立ち上りを検出しているとして、何も
表示しない。或いは「良好」の旨を表示する。ここで、
閾値Ｂの決め方であるが、0.1秒分全体を比較する場合
なら、A/2の0.1秒分のエネルギー程度に設定すれば良い
し、各時点毎の比較ならばA/2程度の値にすれば良い。
これを第５図の音声区間検出部に組み入れることで、本
発明の認識装置は動作する。この場合の認識部の方式は
特に限定するものではない。又、標準パターンは特定話
者方式の時には登録するルーチンが必要である。第５図
図では、不特定話者等を考えて省略してある。Audio from the microphone 1 is converted to a digital signal by the A / D converter 2. In this case, it is desirable to perform A / D conversion after previously converting to the feature amount. This data is
Are sequentially recorded in the memory 3 for each sample time, and the energy is detected. This energy is compared with the threshold value A in the first comparator 6 and at a point above which it is considered as the start of the speech. In other words, this part detects the voice section shown in FIG.
When the energy becomes larger than the threshold value A, the data just before that is stored in the first memory 3. Assuming that 0.1 second of data can be recorded in the first memory 3, the data is recorded at the moment when the sound rises.
You will have data up to one second ago. Therefore, the energy for this 0.1 second is analyzed, and if it is larger than the determined threshold B, it is considered that what should be detected as a voice section is small in voice and energy is low, causing a detection error. Message is displayed on the display unit 18. If it is smaller than the threshold value B, it is determined that the rising is detected accurately, and nothing is displayed. Alternatively, a message indicating “good” is displayed. here,
The threshold value B is determined. When comparing the entire 0.1 seconds, the energy may be set to about 0.1 second of A / 2, and for comparison at each time point, the value may be about A / 2. Good.
By incorporating this into the voice section detection unit in FIG. 5, the recognition device of the present invention operates. The method of the recognition unit in this case is not particularly limited. In addition, a routine for registering the standard pattern in the specific speaker system is required. In FIG. 5, it is omitted considering unspecified speakers and the like.

更に、本発明による他の実施例を第４図に基づいて説
明する。19は特徴抽出部、20はレジスタ群（レジスタ制
御部）、21は内容チェック部、22はリングカウンタ、23
はデータ転送制御部、24は始端検出部、25は始端補正
部、26は入力データバッファ、27はパターン照合部、28
は辞書テンプレート、29は結果出力部である。まず第７
図に示す様に、音声区間を求める場合に音声の特徴量、
例えばパワースペクトラムやLPCケプストラム等に基づ
いてある閾値Ａと比較して、それより大なる部分を音声
区間として検出する。しかしながら、閾値Ａは音声の大
きさや周囲騒音に対して一定であると検出が難しくなる
ため、一般には、可変に設定する場合が多い。従って、
語頭の子音部などは、声の小さい場合や周囲騒音が大き
い場合には閾値Ａの影響で検出できない。そのため、始
端をｂ点に動かす事によりその影響を小さくする。又、
音声認識装置においては、一般に音声発声終了までデー
タを取り込み、その後に照合を行うものより、データ入
力と並行して照合演算を行う（例えば、DPマッチングや
BTSP方式における予備選択）ものが多く、音声区間の検
出も実時間で行う必要がある。しかしながら、時間的に
過去へ逆上ることは難かしく、実際にはレジスタ群を用
いて過去のデータを蓄積する。Further, another embodiment according to the present invention will be described with reference to FIG. 19 is a feature extraction unit, 20 is a register group (register control unit), 21 is a content check unit, 22 is a ring counter, 23
Is a data transfer control unit, 24 is a start detection unit, 25 is a start correction unit, 26 is an input data buffer, 27 is a pattern matching unit, 28
Is a dictionary template, and 29 is a result output unit. First, the seventh
As shown in the figure, when determining the voice section, the feature amount of the voice,
For example, a part larger than the threshold A is detected as a voice section by comparing with a certain threshold A based on a power spectrum, an LPC cepstrum, or the like. However, if the threshold value A is constant with respect to the loudness of the voice and the ambient noise, it becomes difficult to detect the threshold value A. Therefore, in general, the threshold value A is often set variably. Therefore,
A consonant part at the beginning of a word cannot be detected due to the threshold value A when the voice is low or the ambient noise is high. Therefore, the influence is reduced by moving the starting end to the point b. or,
In a speech recognition device, generally, data is fetched until the end of voice utterance, and then collation is performed in parallel with data input (for example, DP matching,
Preliminary selection in the BTSP method), and it is necessary to detect voice sections in real time. However, it is difficult to retreat to the past in time, and in fact, past data is accumulated using a group of registers.

あるサンプル周期で（例えば、10ms,20ms）入力され
る未知音声の特徴量が特徴抽出部19で求められ、音声の
始端が検出されるまでは、特徴データは逐次レジスタ群
20に蓄積される。レジスタ群20は定められたｎフレーム
長の長さで、その入力にはリングカウンタ22の値をポイ
ンタとして0,1,2,…,n,0,1,2,…,n−1,n,0,…とデータ
をレジスタ制御部20により制御される。一方、始端検出
部24において始端が検出された時点で、同様にリングカ
ウンタ22の指す位置から指定されたｍフレームの位置か
ら過去ｎフレームを構成することにより、始端検出と同
時に補正がかかり、ほぼ実時間で照合可能となる。The feature amount of the unknown speech input at a certain sample period (for example, 10 ms, 20 ms) is obtained by the feature extraction unit 19, and the feature data is sequentially stored in the register group until the beginning of the speech is detected.
Stored in 20. The register group 20 has a predetermined length of n frames, and its input is set to 0, 1, 2,..., N, 0, 1, 2,. , 0,... Are controlled by the register control unit 20. On the other hand, when the start end is detected by the start end detection unit 24, the past n frames are similarly configured from the position of the m frame specified from the position indicated by the ring counter 22, so that the correction is performed simultaneously with the detection of the start end. It can be collated in real time.

次に、始端の補正は、そのリングカウンタ22によって
示されたレジスタ群20の値を逆方向に検索し、０データ
となる直前のフレームが真の始端になる。この様にして
得られたポインタに従って、入力バッファに真の始端か
ら現在のポインタまでを入力バッファに転送し、その後
は再びサンプル周期に従ったデータ入力を実時間で処理
する。Next, for the correction of the starting end, the value of the register group 20 indicated by the ring counter 22 is searched in the reverse direction, and the frame immediately before the 0 data becomes the true starting end. According to the pointer obtained in this way, the data from the true beginning to the current pointer is transferred to the input buffer, and then the data input according to the sample period is processed again in real time.

効果以上の説明から明らかなように、本発明によると、発
声者の音声の大小にかかわらず正しい音声区間の検出が
可能となったため、音声認識装置の認識率を向上させる
ことができ、その補正を実時間で行える。Advantages As is apparent from the above description, according to the present invention, it is possible to detect a correct voice section regardless of the volume of the voice of the speaker, so that the recognition rate of the voice recognition device can be improved, and the correction can be performed. Can be performed in real time.

また、音声のレベルの大小にかかわらず、正確に区間
の検出が出来るようになった。特に本発明は声が小さく
唇の開閉音を発声しやすいような人に対し有効で高い認
識精度が実現できるようになった。In addition, the section can be detected accurately regardless of the level of the audio level. In particular, the present invention can realize an effective and high recognition accuracy for a person whose voice is small and easily utters the opening and closing sound of the lips.

また、小さい声でしゃべった時には、発声された言葉
によらず正確にそれ以後の発声を大きくするよう指示す
ることが出来る。又、周辺の騒音と共に閾値Ａが変動す
るような場合には、騒音が大きく、閾値Ａが上昇すると
更に大きな声で発声する様、指示が出されることにな
る。これによって、音声の始端、終端は正確に検出され
るようになり、その結果、認識精度を良くすることがで
きる。Also, when speaking in a small voice, it is possible to instruct the user to accurately increase the subsequent utterance regardless of the uttered word. If the threshold value A fluctuates with the surrounding noise, the noise is loud, and when the threshold value A rises, an instruction is issued to produce a louder voice. As a result, the beginning and end of the voice can be accurately detected, and as a result, the recognition accuracy can be improved.

[Brief description of the drawings]

第１図は、本発明による音声認識装置の一実施例を説明
するための構成図、第２図は、他の実施例を示す構成
図、第３図は、更に他の実施例を示す構成図、第４図
は、更に他の実施例を示す構成図、第５図は、通常のパ
ターンマッチングを利用した音声認識装置のブロック
図、第６図は、音声区間の検出方法を説明するための
図、第７図〜第９図は、音声波形に対する閾値と音声区
間を示す図、第10図は、音声区間の検出方法の従来例を
示す図、第11図（ａ）、（ｂ）は、エネルギー分布の異
なる音声に対する音声区間の検出方法の従来例を示す図
である。１……マイク、２……A/D変換部、３……第１のメモ
リ、4,8……エネルギー検出部、５……レジスター、6,9
……比較器、７……閾値Ａ、10……閾値Ｂ、11……結合
部、16……第２のメモリ、17……クリアブ、18……表示
部。FIG. 1 is a configuration diagram for explaining one embodiment of a voice recognition device according to the present invention, FIG. 2 is a configuration diagram showing another embodiment, and FIG. 3 is a configuration showing still another embodiment. FIG. 4, FIG. 4 is a block diagram showing still another embodiment, FIG. 5 is a block diagram of a speech recognition device using ordinary pattern matching, and FIG. 6 is a diagram for explaining a method of detecting a speech section. , FIGS. 7 to 9 are diagrams showing threshold values and speech sections for speech waveforms, FIG. 10 is a diagram showing a conventional example of a speech segment detection method, and FIGS. 11 (a) and (b). FIG. 2 is a diagram showing a conventional example of a voice section detection method for voices having different energy distributions. 1 ... Microphone, 2 ... A / D converter, 3 ... First memory, 4,8 ... Energy detector, 5 ... Register, 6,9
...... Comparator 7, Threshold A, 10 Threshold B, 11 ...... Connection unit, 16 ... Second memory, 17 ... Clear, 18 ... Display unit.

フロントページの続き (56)参考文献特開昭62−191895（ＪＰ，Ａ) 特開昭63−77097（ＪＰ，Ａ) 特開昭60−23899（ＪＰ，Ａ) 特開昭60−101598（ＪＰ，Ａ) 特開昭59−152498（ＪＰ，Ａ) 特開昭58−76899（ＪＰ，Ａ) 特公昭62−50837（ＪＰ，Ｂ２)Continuation of front page (56) References JP-A-62-191895 (JP, A) JP-A-63-77097 (JP, A) JP-A-60-23899 (JP, A) JP-A-60-101598 (JP, A) JP-A-59-152498 (JP, A) JP-A-58-76899 (JP, A) JP-B-62-50837 (JP, B2)

Claims

(57) [Claims]

An audio-to-electric converter for converting a sound into an electric signal, a sound section detecting unit for extracting a portion related to the sound from the converted signal, and recognizing the sound using the detected signal. In the voice recognition device having a recognition unit, the electrical signal that is continuously present temporally before the starting point candidate of the voice detected by the voice section detection unit is analyzed, and the utterance of the user is A speech recognition device characterized by instructing a way.