JPH03135600A

JPH03135600A - Voice recognition device

Info

Publication number: JPH03135600A
Application number: JP1273585A
Authority: JP
Inventors: Junichiro Fujimoto; 潤一郎藤本; Harutake Yasuda; 安田　晴剛
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1989-10-20
Filing date: 1989-10-20
Publication date: 1991-06-10

Abstract

PURPOSE:To accurately discriminate plural specific voiced words in Japanese with simple, low-cost constitution by detecting one voice section, then counting how many times following voice sections are detected within a specific time from the end of the voice section, and outputting the counting result as a recognition result. CONSTITUTION:A power detecting means 3 detects the power of an input voice signal and a voice section detecting means 5 detects the voice section according to the detected power. A counting means 6 detects one voice section, then counts how many following voice sections are detected in the specific time from the end of said voice section, and outputs the result as the recognition result. For example, when 'maru' is voiced, the period wherein this word is voiced is one voice section T0. When 'batu' is voiced, on the other hand, the period wherein 'ba' is voiced and the period wherein 'tu' is voiced are detected as two voice sections T1 and T2. Consequently, the specific voiced Japanese words can accurately be discriminated by the practical, simple, and low-cost voice recognition device.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、特定の複数の単語が音声として入力されたと
きにこれらを識別しうる音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device that can identify a plurality of specific words when they are input as speech.

[Conventional technology]

近年、音声認識の分野においては、特に大語粟化、ある
いは連続発声された音声を認識させることを指向した研
究がなされている。一方、実用上の問題を考えると、現
在の音声認識装置では音声を認識させるのに発声に制限
をもたせなりしなければならないので、認識可能な語粟
数を多くしても、十分に実用的であるとは言い錐い。In recent years, in the field of speech recognition, research has been conducted particularly toward recognition of large words or continuously uttered speech. On the other hand, considering practical issues, current speech recognition devices require restrictions on utterances in order to recognize speech, so even if the number of words that can be recognized is increased, it is not practical enough. It is difficult to say that it is.

実用上の観点からはむしろ、特定の単語だけを認識対象
にし、例えば決められた単語を発声しているか否かを判
定できるもの、あるいは「ハイ」と「イイエ」を識別で
きるものが望まれている場合が多い、なお、多くの語粟
を認識可能な装置であれば、当然のことながら上記のよ
うな２つの単語を識別することもできるが、２つの単語
を認識。From a practical point of view, what is desired is something that can recognize only specific words and, for example, be able to determine whether or not a certain word is being uttered, or be able to distinguish between "hi" and "yay". Of course, if the device is capable of recognizing many words, it will be able to identify two words like the one above, but only two words can be recognized.

判定するだけの目的のためには、このような装置は一般
に複雑であり、また高コストであった。For purposes of determination alone, such devices have generally been complex and expensive.

ところで、従来、米国特許３６８８１２６号には、上記
のような２つの単語だけを識別することを意図した実用
的な簡易かつ低コストの装置が開示されている。この装
置では、ｒＹＥｓＪと「ＮＯ」の入力音声をこれらの周
波数分布の違いによって識別するようになっている。す
なわち「ＹＥＳ」の「Ｓ」の音が歯擦音であり高い周波
数成分を有していることに着目し、音声区間の終端に高
い周波数成分があるか、音声区間全体に高い周波数成分
がないかによって、前者をｒＹＥＳ」、後者をｒＮｏ、
であると認識し識別するようになっている。Incidentally, US Pat. No. 3,688,126 discloses a practical, simple and low-cost device intended to identify only two words as described above. This device is designed to identify input voices rYEsJ and "NO" based on the difference in their frequency distributions. In other words, focusing on the fact that the "S" sound in "YES" is a sibilant sound and has a high frequency component, it is possible to determine whether there is a high frequency component at the end of the voice section or there is no high frequency component in the entire voice section. Depending on the situation, the former is rYES, the latter is rNo,
It is designed to be recognized and identified as such.

[Problem to be solved by the invention]

しかしながら、ｒＹＥ　Ｓ　、とｒＮｏ、とを識別可能
な上記音声認識装置を日本語の音声の認識に適用しよう
とした場合には、日本語の例えば「ハイ」と「イイエ」
のいずれも高い周波数成分をもたないので、これらを有
効に識別することができない、従って、上記音声認識装
置は、英語の音声の識別には遇しているが、日本語の音
声にはこのままでは利用することができないといった問
題があった。However, when trying to apply the above-mentioned speech recognition device capable of distinguishing rYES and rNo to the recognition of Japanese speech,
None of these have high frequency components, so they cannot be effectively identified.Therefore, although the above speech recognition device is suitable for identifying English speech, it is not suitable for Japanese speech. There was a problem that it could not be used.

本発明は、特定の日本語の音声を識別するのに適した実
用的な簡易かつ低コストの音声認識装置を提供すること
を目的としている。An object of the present invention is to provide a practical, simple, and low-cost speech recognition device suitable for identifying specific Japanese speech.

[Means to solve the problem]

上記目的を達成するために本発明は、入力する音信号の
パワーを検出するパワー検出手段と、検出されたパワー
に基づき音声区間を検出する音声区間検出手段と、音声
区間検出手段において１つの音声区間を検出後、該音声
区間の終端から所定時間以内に次の音声区間が検出され
る回数を計数し、該計数結果を認識結果として出力する
計数手段とを備えていることを特徴としている。In order to achieve the above object, the present invention includes a power detecting means for detecting the power of an input sound signal, a speech section detecting means for detecting a speech section based on the detected power, and a speech section detecting means for detecting one speech. The present invention is characterized by comprising a counting means for counting the number of times the next voice section is detected within a predetermined time from the end of the voice section after detecting the section, and outputting the counting result as a recognition result.

[Effect]

上記のような構成では、パワー検出手段で入力する音信
号のパワーを検出し、検出されたパワーに基づき音声区
間検出手段で音声区間を検出するが、識別されるべき特
定の複数の単語間でこれらに含まれる破裂音の数が互い
に異なっているような場合には、単語全体の発声を終了
するまでに検出される音声区間の個数が互いに異なる。In the above configuration, the power detection means detects the power of the input sound signal, and the speech interval detection means detects the speech interval based on the detected power. If the number of plosives included in these sounds is different from each other, the number of voice segments detected until the utterance of the entire word is finished is different from each other.

すなわち破裂音があると所定時間以内の無音状態が発生
し、単語全体の発声に対して音声区間が区切られる。こ
のことに着目して、計数手段では、１つの音声区間を検
出後、該音声区間の終端から所定時間以内に次の音声区
間が検出される回数を計数し、この結果を認識結果とし
て出力する。これにより、複数の単語の音声を識別する
ことができる。That is, if there is a plosive, a silence state occurs within a predetermined period of time, and a speech interval is separated from the utterance of the entire word. Focusing on this, the counting means, after detecting one speech section, counts the number of times the next speech section is detected within a predetermined time from the end of the speech section, and outputs this result as a recognition result. . This allows the sounds of multiple words to be identified.

〔Example〕

以下、本発明の一実施例を図面に基づいて説明する。 Hereinafter, one embodiment of the present invention will be described based on the drawings.

第１図は本発明に係る音声認識装置の一実施例のブロッ
ク図である。FIG. 1 is a block diagram of an embodiment of a speech recognition device according to the present invention.

本発明は、日本語では「ハイ」と「イイエ」や「ヨイ」
と「ワルイ」のような対を表わす単語として、「マル」
　（“Ｏ”）と「バラ」または「ベケ」　（“ｘ”　）
があることに着目してなされたものであり、第１図を参
照すると、本実施例の音声認識装置は、入力音を電気信
号に変換するマイクなどの変換部１と、電気信号に変換
された音信号を増幅する増幅部２と、音信号のパワーす
なわちエネルギーを検出するパワー検出部３と、音声が
入力される前の周辺雑音のパワーが基準値として記憶さ
れる基準値記憶部４と、パワー検出部３で検出されたパ
ワーと基準値記憶部４に記憶されている基準値とを比較
し、入力音声の音声区間を検出する音声区間検出部５と
、音声区間検出部５で１つの音声区間を検出後、該音声
区間の終端から所定の時間以内に次の音声区間が検出さ
れる回数を計数し、計数結果に応じた認識結果を出力す
る計数部６とを備えている。The present invention can be expressed as "hai", "iiie" and "yoi" in Japanese.
"Maru" is used as a word to express a pair such as "warui" and "warui".
(“O”) and “bara” or “beke” (“x”)
Referring to FIG. 1, the speech recognition device of this embodiment includes a converter 1 such as a microphone that converts input sound into an electric signal, and a converter 1 that converts input sound into an electric signal. an amplification section 2 that amplifies the sound signal, a power detection section 3 that detects the power or energy of the sound signal, and a reference value storage section 4 that stores the power of ambient noise before inputting the sound as a reference value. , a voice section detecting section 5 which compares the power detected by the power detecting section 3 with a reference value stored in the reference value storage section 4 and detecting a voice section of the input voice; After one voice section is detected, the counting section 6 counts the number of times the next voice section is detected within a predetermined time from the end of the voice section, and outputs a recognition result according to the counting result.

パワー検出部３は、例えば第２図（ａ）に示すように、
整流回路１０と、ロウパスフィルター１とによって実現
され、第２図（ｂ）には具体的な回路例が示されている
。すなわち増幅部２からの音信号をダイオードＤ、Ｄ２
からなる整流回路１０で整流し、抵抗Ｒ，コンデンサＣ
からなるロウパスフィルタ１１でフィルタリングするこ
とによりて音信号のパワーを検出するようになっている
。For example, as shown in FIG. 2(a), the power detection unit 3
This is realized by a rectifier circuit 10 and a low-pass filter 1, and a specific example of the circuit is shown in FIG. 2(b). In other words, the sound signal from the amplifier section 2 is transferred to the diodes D and D2.
Rectified by a rectifier circuit 10 consisting of a resistor R and a capacitor C.
The power of the sound signal is detected by filtering it with a low-pass filter 11 consisting of the following.

また計数部６は、例えばカウンタによって構成されてい
る。Further, the counting section 6 is constituted by, for example, a counter.

次にこのような構成の音声認識装置の動作例を説明する
。Next, an example of the operation of the speech recognition device having such a configuration will be explained.

先づ、対立する単語としての「マル」と「バラ」または
「ベケ」の音声を認識し、これらを互いに識別する場合
を考える。上記いずれかの単語が発声されると、パワー
検出部３で検出された入力音のパワーに基づき、音声区
間検出部５では音声区間を検出する。すなわち音声区間
検出部５では、パワー検出部３で検出された音信号のパ
ワーが基準値記憶部４に記憶されている基準値すなわち
周辺雑音のパワーよりも大きくなったときにこれを音声
区間の始端として検出する。なお基準値記憶部４に記憶
される基準値としては、実際の音声が入力するに先立っ
て数ミリ秒間にわたる周辺雑音のパワーデータの平均を
とったものが良く、これにより、音声区間の始端を正確
に検出することができる。−度音声区間の始端が検出さ
れた後は、基準値を更新しないようにし、音声区間の始
端を検出後、音声区間検出部５は、音信号のパワーが基
準値よりも小さくなったか否かを検知し、基準値よりも
小さくなったときにこれを音声区間の終端として検出す
る。First, let us consider the case where the sounds of the opposing words ``maru'' and ``bara'' or ``beke'' are recognized and distinguished from each other. When any of the words mentioned above is uttered, the voice section detecting section 5 detects the voice section based on the power of the input sound detected by the power detecting section 3. That is, when the power of the sound signal detected by the power detection section 3 becomes larger than the reference value stored in the reference value storage section 4, that is, the power of the surrounding noise, the speech section detecting section 5 detects the power of the sound signal in the speech section. Detected as the start end. Note that the reference value stored in the reference value storage unit 4 is preferably one that is the average of the power data of the surrounding noise for several milliseconds before the actual speech is input. Can be detected accurately. After the start of the voice section is detected, the reference value is not updated, and after the start of the voice section is detected, the voice section detection unit 5 determines whether the power of the sound signal has become smaller than the reference value. is detected, and when it becomes smaller than a reference value, this is detected as the end of the voice section.

ところで、「マル」を発声した場合には、この音声のパ
ワーは第３図（ａ）に示すようになり、この音声を発声
している期間が１つの音声区間Ｔ。By the way, when "maru" is uttered, the power of this voice becomes as shown in FIG. 3(a), and the period during which this voice is uttered is one voice section T.

どなる、これに対して「バラ」を発声した場合には、こ
の音声に破裂音“ｔ”が含まれ、破裂音を発するのに一
度唇を閉じるため、後の音「ツ」の発声前は短かい期間
無音状態すなわちパワーが０”に近い状態となり、これ
によって第３図（ｂ）に示すように音声区間検出部５で
は、「バ」を発声している期間と「ツ」を発声している
期間との２つの音声区間Ｔ　　、Ｔ２が検出される。If you say ``bara'' in response to this, this sound will include a plosive ``t'', and because you close your lips once to produce the plosive, before you pronounce the later sound ``tsu'' There is a short period of silence, that is, a state where the power is close to 0'', and as a result, as shown in FIG. Two voice sections T1 and T2 are detected.

「ベケ」を発声した場合にも同様にして、この音声に破
裂音“ｋ　”が含まれているので、第３図（Ｃ）に示す
ように「べ」を発声している期間と「ゲ」を発声してい
る期間との２つの音声区間Ｔ３．Ｔ４が検出される。こ
れら２つの音声区間の間の無音の期間は、後の音が促音
である場合や上記のような破裂音”ｔ”、′に、　ρ”
であるときに発生するが、促音の場合には無音の期間は
約３５０ミリ秒程度とされているので、破裂音の場合に
はこれよりももつと短かい時間となる。Similarly, when ``beke'' is uttered, this voice includes the plosive sound ``k'', so as shown in Figure 3 (C), the period of uttering ``beke'' and the period when ``beke'' is uttered are different. ” is uttered and two voice sections T3. T4 is detected. The period of silence between these two speech intervals can occur when the following sound is a consonant, or when the plosive sounds “t”, ′, ρ”, etc.
However, in the case of a consonant, the period of silence is approximately 350 milliseconds, so in the case of a plosive, the period of silence is shorter than this.

すなわち上記の例では、最初の音声区間Ｔ１゜Ｔ３の終
端が検出された後、約３５０ミリ秒程度の短かい時間以
内に後の音声区間Ｔ２．Ｔ４の始端が検出されることに
なる。That is, in the above example, after the end of the first voice section T1°T3 is detected, the subsequent voice section T2. The starting edge of T4 will be detected.

計数部６では、１つの音声区間の終端が検出された後、
上記時間以内に次の音声区間の始端が検出されたかを監
視しており、上記時間以内に次の音声区間の始端が検出
されたときには計数値を“１”だけ歩進させ、上記時間
以内に次の音声区間の始端が検出されなくなった時点で
単語全体の発声が終了したと判断し、計数動作を停止し
てそれまでの計数結果を認識結果として出力する。従っ
て、「マル」が発声されたときには、１つの音声区間Ｔ
。たけであるので、計数結果は“Ｏ”となり、「バラ」
や「ベケ」が発声されたときには、後の音声区間Ｔ２ま
たはＴ４の始端によって計数結果は“１“どなる、これ
により、極めて簡易な仕方で、「マル」と「バラ」また
は「ベゲ」の対立する２つの日本語の音声を正確に認識
しこれらを識別することが可能となる。In the counting unit 6, after the end of one voice section is detected,
It monitors whether the start of the next voice section is detected within the above time, and when the start of the next voice section is detected within the above time, the count value is incremented by "1", and within the above time When the start of the next voice section is no longer detected, it is determined that the entire word has been uttered, the counting operation is stopped, and the counting results up to that point are output as recognition results. Therefore, when "maru" is uttered, one vocal section T
. Since it is a bamboo, the counting result is “O” and “rose”.
When ``beke'' or ``beke'' is uttered, the counting result will be ``1'' depending on the starting point of the subsequent vocal section T2 or T4.This makes it possible to easily determine the opposition between ``maru'' and ``bara'' or ``bege.'' It becomes possible to accurately recognize and distinguish between two Japanese sounds.

上記説明では、「マル」と「バラ」または「ペゲ」を例
にとったが、本発明の音声認識装置は、これに限らず、
例えば「ゼロ」　（“０”）と「イチ」　（“１”）の
ように対立する一方の単語にのみ破裂音を含むものであ
れば全ての対立する単語の識別に適用することができる
。In the above explanation, "maru" and "bara" or "pege" were taken as examples, but the speech recognition device of the present invention is not limited to this.
For example, if only one of the opposing words contains a plosive, such as "zero"("0") and "ichi"("1"), it can be applied to the identification of all opposing words.

また２つの日本語の単語の識別のみならす３つあるいは
それ以上の日本語の単語を識別することも可能である０
例えば、「マル」　（“○”）と、「バラ」または「ベ
ゲ」　（“Ｘ“）と、これらの中間を表わす「サンカフ
」　（“△”）との３つの単語を識別する場合を考える
。「サンカフ」を発声した場合には、この音声に破裂音
“ｋ“が２つ含まれるので第４図に示すように、「サン
」を発声している期間と、「力」を発声している期間と
、「り」を発声している期間との３つの音声区間Ｔｓ　
、Ｔｓ　、Ｔ７が音声区間検出部５で検出される。It is also possible to identify not only two Japanese words but also three or more Japanese words.
For example, consider the case of identifying three words: ``maru''(``○''),``bara'' or ``bege''(``X''), and ``sankafu''(``△''), which represents the middle word. . When uttering ``sankafu'', this voice contains two plosive sounds ``k'', so as shown in Figure 4, the period during which ``san'' is uttered and the period during which ``power'' is uttered are divided. Three voice sections Ts: the period when the voice is uttered, and the period when the ``ri'' is uttered.
, Ts, and T7 are detected by the voice section detection section 5.

これにより、「サンカフ」を発声したときの計数部６の
計数結果は“２”となり、計数結果が“０”となる「マ
ル」や計数結果が“１”となる「バラ」または「ペケ」
と識別することが可能となる。As a result, the counting result of the counting unit 6 when uttering "sankafu" is "2", and the counting result is "0" for "maru", and the counting result is "1" for "bara" or "peke".
It becomes possible to identify the

〔Effect of the invention〕

以上に説明したように本発明によれば、１つの音声区間
を検出後、該音声区間の終端から所定時間以内に次の音
声区間が検出される回数を計数し、この計数結果を認識
結果として出力するようにしているので、簡易かつ低コ
ストの構成で特定の複数の日本語の音声を正確に識別す
ることができる。As explained above, according to the present invention, after detecting one speech section, the number of times the next speech section is detected within a predetermined time from the end of the speech section is counted, and this counting result is used as the recognition result. Since it is output, it is possible to accurately identify a plurality of specific Japanese sounds with a simple and low-cost configuration.

す図、第４図は単語ｒサンカフ」を発声したときの音声
のパワーの時間的経過を示す図である。FIG. 4 is a diagram showing the time course of the power of the voice when the word "R Sancuff" is uttered.

１・・・変換部、２・・・増幅部、３・・・パワー検出
部、４・・・基準値記憶部、５・・・音声区間検出部、
６・・・計数部DESCRIPTION OF SYMBOLS 1... Conversion section, 2... Amplification section, 3... Power detection section, 4... Reference value storage section, 5... Voice section detection section,
6...Counting section

Claims

[Claims]

power detection means for detecting the power of the input sound signal;
A voice section detecting means detects a voice section based on the detected power, and after the voice section detecting means detects one voice section, the number of times the next voice section is detected within a predetermined time from the end of the voice section is determined. A speech recognition device comprising a counting means for counting and outputting the counting result as a recognition result.