JPS61156100A

JPS61156100A - Voice recognition equipment

Info

Publication number: JPS61156100A
Application number: JP59277403A
Authority: JP
Inventors: 藤橋　勇一郎; 福井　昭
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1984-12-27
Filing date: 1984-12-27
Publication date: 1986-07-15

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識装置に関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a speech recognition device.

［従来の技術］従来、この種の音声認識装置においては、音声区間の検
出は、決められた１種類の閾値に基づいて行なわれ、認
識結果がリジェクトあるいは誤認識となった場合１話者
に再度発声させ音声を再入力させて同じ認識処理を繰返
す方式が多く用いられていた。[Prior Art] Conventionally, in this type of speech recognition device, detection of a speech section is performed based on one type of predetermined threshold value, and if the recognition result is rejected or misrecognized, one speaker is A method was often used in which the same recognition process was repeated by making the user speak again and input the voice again.

［発明が解決しようとする問題点］Ｌ述した従来の音声認識装置では、音声区間の検出を決
められた１種類の閾値に基づいて行なわれているため、
音声区間の検出が適切に行なわれなかったことが原因で
誤認識あるいはりジェツトとなった場合、話者に再度発
声させて音声を再入させても、同じような音声区間検出
誤りが発生し再び誤認識あるいはりジェツトを繰返す危
険性が大きく、また、同じ内容を話者に再発声させるた
め、話者への負担が増しサービス性の低下を招くという
問題点がある。[Problems to be Solved by the Invention] In the conventional speech recognition device described above, detection of speech sections is performed based on one type of predetermined threshold value.
If a speech segment is misrecognized or dropped due to improper speech segment detection, the same speech segment detection error will occur even if the speaker speaks again and the speech is re-entered. There is a large risk of repeating misrecognition or misrepresentation, and since the speaker is forced to repeat the same content, there is a problem that the burden on the speaker increases and the quality of service deteriorates.

［問題点を解決するための手段］本発明は、入力音声の分析結果である音声パワーと特徴
パターンとをバッファ・メモリに記憶しておき、かつ音
声区間の検出に用いる閾値を数種類用意しておき、検出
された音声区間の特徴パターンと標準パターンとのパタ
ーン・マッチングを行ない類似度を求め、求まった類似
度を判定し、認識結果がリジェクトとなった場合、音声
区間検出の閾値を変更し、バッファ・メモリに記憶され
ている入力音声のパワーを読出して再度音声区間を検出
し、再度求まった音声区間の特徴パターンを同様にバッ
ファ・メモリから読出しパターン・マッチング、認識結
果の判定を行なうことにより、１回の発声の入力音声に
ついて、複数の閾値で音声区間の検出を行なうことがで
きるようにしたもので、より適切な音声区間の検出がで
きる確率が大きくなり、音声区間の誤検出が原因でリジ
ェクト、誤認識となる確率が小さくなり、上述した従来
装置の問題点を解決することができる。[Means for Solving the Problems] The present invention stores speech power and characteristic patterns, which are the analysis results of input speech, in a buffer memory, and prepares several types of threshold values for use in detecting speech sections. Then, pattern matching is performed between the characteristic pattern of the detected speech section and the standard pattern to find the degree of similarity.If the recognition result is rejected, the threshold for speech section detection is changed. , read out the power of the input voice stored in the buffer memory, detect the voice section again, read out the feature pattern of the voice section found again from the buffer memory, perform pattern matching, and judge the recognition result. This makes it possible to detect speech sections using multiple thresholds for input speech of a single utterance, increasing the probability of detecting a more appropriate speech section and reducing false detection of speech sections. The probability of rejection or erroneous recognition due to this decreases, and the above-mentioned problems of the conventional device can be solved.

［実施例］本発明の実施例について図面を参照しながら説明する。[Example] Embodiments of the present invention will be described with reference to the drawings.

第１図は本発明による音声認識装置の一実施例のブロッ
ク図である。FIG. 1 is a block diagram of an embodiment of a speech recognition device according to the present invention.

本実施例の音声認識装置は、音声分析手段２、分析結果
バッファ・メモリ３、音声区間検出手段４、標準パター
ン・メモリ５．パターン−マツチング手段６、認識結果
判定手段７より構成される。The speech recognition device of this embodiment includes a speech analysis means 2, an analysis result buffer memory 3, a speech section detection means 4, a standard pattern memory 5. It is composed of a pattern matching means 6 and a recognition result determining means 7.

音声分析手段２は入力音声１を入力し、その分析結果で
ある特徴パターン８と音声パワー９を分析結果バッファ
・メモリ３に記憶させる。音声区間検出手段４は、分析
結果バッファ・メモリ３から記憶されている音声パワー
１１を読出して音声区間の検出を行ない、音声区間の始
端・終端情報１２をパターン・マッチング手段６に出力
する。パターン・マッチング手段６は音声区間の始端・
終端情報１２に従って分析結果バッファ・゛メモリ３か
ら記憶されている音声区間の特徴パターン１０を読出し
、また標準パターン・メモリ５（認識対象の単語・文の
それぞれに対して複数の標準パターンが記憶されている
）から標準パターン１３を読出し、音声区間の特徴パタ
ーンと標準パターンとのパターン・マッチングを行ない
、類似度１４を認識結果判定手段７へ出力する。認識結
果判定手段７は、入力された類似度１４に従って認識結
果を判定し、リジェクトでない場合は認識結果１Ｂを出
力し、一連の音声認識処理を終了するが、リジェクトと
なった場合は、リジェクトという認識結果を出力せず、
再認識指示１５を音声区間検出手段４とパターン・マッ
チング手段６に出力する。音声区間検出手段４は、再認
識指示１５を受けとると、音声区間検出のための閾値を
変更し、分析結果バッファ１メモリ３から再度、記憶さ
れている音声パワー１１を読出し、新たな音声区間を検
出し、パターン・マッチング手段６へ出力する。パター
ン・マッチング手段６は、再認識指示１５を受けとると
、新たな音声区間の始端・終端情報１２に従って分析結
果バッファ・メモリ３から記憶されている特徴パターン
を再度読出し、最初の認識処理と同様に標準パターン・
メモリ５から標準パターン１３を読出しパターン・マッ
チングを行ない、新たな類似度１４を認識結果判定手段
７へ出力する。認識結果判定手段７は入力された新たな
類似度１４を判定し、リジェクトでない場合は認識結果
１６を出カレ、再認識処理を終了し、認識結果がリジェ
クトとなり、再認識処理繰返し最大回数が１回の場合に
はりジェツトという認識結果１Ｂを出力し再認識処理を
終了し、最大回数が２回以上の場合には、再認識処理を
繰返し行ない、最大繰返し回数だけ再認識処理を行なっ
ても認識結果がリジェクトとなった場合には、リジェク
トという認識結果１６を出力し、再認識処理を終了する
。The speech analysis means 2 inputs the input speech 1, and stores the analysis results of the characteristic pattern 8 and the speech power 9 in the analysis result buffer memory 3. The voice section detecting means 4 reads out the voice power 11 stored from the analysis result buffer memory 3, detects the voice section, and outputs the start and end information 12 of the voice section to the pattern matching means 6. The pattern matching means 6 detects the beginning and end of the voice section.
According to the termination information 12, the characteristic pattern 10 of the voice section stored in the analysis result buffer/memory 3 is read out, and the standard pattern memory 5 (in which a plurality of standard patterns are stored for each word/sentence to be recognized) is read out. The standard pattern 13 is read out from the standard pattern 13, pattern matching is performed between the characteristic pattern of the voice section and the standard pattern, and the similarity 14 is output to the recognition result determining means 7. The recognition result determination means 7 determines the recognition result according to the input similarity 14, and if it is not a reject, it outputs the recognition result 1B and ends the series of speech recognition processing, but if it is a reject, it outputs the recognition result 1B. without outputting recognition results,
A re-recognition instruction 15 is output to the voice section detection means 4 and the pattern matching means 6. When the voice section detection means 4 receives the re-recognition instruction 15, it changes the threshold for voice section detection, reads out the stored voice power 11 from the analysis result buffer 1 memory 3 again, and generates a new voice section. It is detected and output to the pattern matching means 6. When the pattern matching means 6 receives the re-recognition instruction 15, it reads out the characteristic pattern stored in the analysis result buffer memory 3 again according to the start/end information 12 of the new speech section, and performs the same process as the first recognition process. Standard pattern/
The standard pattern 13 is read out from the memory 5, pattern matching is performed, and a new degree of similarity 14 is output to the recognition result determination means 7. The recognition result determination means 7 determines the input new similarity degree 14, and if it is not a reject, outputs the recognition result 16, ends the re-recognition process, the recognition result becomes a reject, and the maximum number of times the re-recognition process is repeated is 1. If the maximum number of times is 2 or more, the re-recognition process is completed by outputting the recognition result 1B of ``Jet'', and if the maximum number of times is 2 or more, the re-recognition process is repeated. If the result is a reject, a recognition result 16 of reject is output, and the re-recognition process is ended.

第２図は音声パワーＰの一例の波形（時間変化）図であ
る０本実施例では音声区間の検出のための音声パワーＰ
の閾値が第１の閾値Ｐｔｂ＋、第２の閾値Ｐｔｂｚ　と
２個設定されている。閾値が第１の閾値Ｐｔｈ＋　の場
合、音声パワーＰが閾値Ｐｔｈ＋　に一致した始端ｔｂ
＋から始端継続時間Ｔｏが開始し、音声パワーＰが閾値
Ｐｔｈ＋　に再度一致した終端Ｌ６凰　に終了し、この
終端ｔｅ＋　から終端継続時間Ｔｅが開始する。そして
、この始端継続時間Ｔｂと終端継続時間Ｔｅにより音声
区間の検出が行なわれる。すなわち、始端継続時間Ｔｂ
がある設定時間Ｔ１以上であればこの区間は音声区間と
見倣されるが５始端継続時間Ｔｂがこの設定時間Ｔ、に
達しない場合には終端継続時間Ｔｅにより判断され、こ
の終端継続時間Ｔｅがある設定時間Ｔ２以上であれば前
記始端継続時間Ｔｂの区間は音声区間でないと見倣され
、終端継続時間Ｔｅが設定時間Ｔ２に達しない場合には
この区間は前の始端継続時間Ｔｂに加算されて次の終端
において音声区間かどうかが判定される。＊１の閾値Ｐ
ｔｈｓの場合には始端継続時間Ｔｂは、設定値Ｔ、より
小さく、終端継続時間Ｔｅは設定値Ｔ２よりも大きいの
で、この始端継続時間Ｔｂの区間は音声区間と見倣され
ない、第２の閾値Ｐｔｈｚの場合、始端ｔｂ２（始端ｔ
ｂ＋　とほぼ同じ時点）から始端継続時間Ｔｂが開始し
、途中でパワーディップ区間Ｔ−が存在するが、この区
間Ｔ−は設定時間Ｔ２よりも小さいので始端継続時間Ｔ
ｂに加算され、次の終端ｔｅ２　まで続く、そして、こ
の始端ｔＩ１２から終端ｔｅ２　までの始端継続時間Ｔ
ｂは設定時間Ｔ、より大きく、この区間は音声区間と見
倣される。FIG. 2 is a waveform (temporal change) diagram of an example of the voice power P. In this embodiment, the voice power P for detecting a voice section is
Two threshold values are set, a first threshold value Ptb+ and a second threshold value Ptbz. When the threshold is the first threshold Pth+, the starting point tb where the audio power P matches the threshold Pth+
The starting end duration To starts from +, ends at the end L6 凰 when the audio power P matches the threshold value Pth+ again, and the ending end duration Te starts from this ending te+. Then, a voice section is detected based on the start end duration time Tb and the end end duration time Te. That is, the starting end duration Tb
If it is longer than a certain set time T1, this section is regarded as a voice section, but if the starting end duration Tb does not reach this set time T, it is judged based on the ending end duration Te, and this end end duration Te If it is longer than a certain set time T2, the section of the start end duration Tb is assumed to be not a voice section, and if the end end time Te does not reach the set time T2, this section is added to the previous start end duration Tb. Then, at the next end, it is determined whether it is a voice section or not. *1 threshold P
In the case of ths, the start duration Tb is smaller than the set value T, and the end duration Te is larger than the set value T2, so the section of the start end duration Tb is not regarded as a voice section, and the second threshold value is set. In the case of Pthz, starting point tb2 (starting point t
The start end duration time Tb starts from approximately the same time as b+), and there is a power dip section T- in the middle, but since this section T- is smaller than the set time T2, the start end duration time Tb starts.
b and continues until the next end te2, and the start end duration time T from this start end tI12 to the end end te2
b is larger than the set time T, and this section is regarded as a voice section.

従って、最初に第１の閾値Ｐｔｈ＋を用いて音声区間検
出を行ない、認識結果がリジェクトとなっても、第２の
閾値Ｐｔｂｚを用いて音声区間検出を行なうことにより
適切な音声区間が検出でき、正しい認識結果を得ること
ができる。Therefore, even if speech segment detection is first performed using the first threshold value Pth+ and the recognition result is rejected, an appropriate speech segment can be detected by performing speech segment detection using the second threshold value Ptbz. Correct recognition results can be obtained.

なお、本実施例では、説明を簡単にするために、音声パ
ワーＰが閾値を越えはじめた時点、下まわりはじめた時
点を始端、終端としているが、始端、終端それぞれにハ
ング・オーバーを付加する、すなわち音声パワーが闇値
を越えはじめた時点より決められた時間だけ前の時点を
始端とし、音声パワーが閾値より下まわりはじめた時点
より決められた時間だけ後の時点を終端とする方法も有
効である。Note that in this embodiment, to simplify the explanation, the starting point and the ending point are the point in time when the audio power P starts to exceed the threshold value and the point in time when it starts to go below the threshold value, but a hangover is added to the starting point and the ending point, respectively. In other words, there is also a method in which the starting point is a predetermined time before the voice power begins to exceed the dark value, and the end point is a predetermined time after the voice power begins to fall below the threshold. It is valid.

［発明の効果］以」二説明したように、本発明は、入力音声の分析結果
であるパワーと特徴パターンとをバッファ・メモリに記
憶しておき、音声区間の検出に用いる閾値を数種類用意
しておき、検出された音声区間の特徴パターンと標準パ
ターンとのパターン・マッチングを行ない類似度を求め
、求まった類似度を判定し認識結果がリジェクトとなっ
た場合。[Effects of the Invention] As explained below, the present invention stores the power and characteristic patterns that are the analysis results of input speech in a buffer memory, and prepares several types of threshold values for use in detecting speech sections. Then, pattern matching is performed between the characteristic pattern of the detected voice section and the standard pattern to determine the degree of similarity, and the degree of similarity determined is judged and the recognition result is rejected.

音声区間検出の閾値を変更し、／ヘソファ・メモリに記
憶されている入力音声のパワーを読出し、再度音声区間
を検出し、再度求まった音声区間の特徴パターンを同様
にバッファ拳メモリから読出しパターン・マッチング、
認識結果の判定を行なうことにより、１回の発声の入力
音声について、複り適切な音声区間検出ができる確率が
大きくなり、音声区間誤検出が原因でリジェクト、誤認
識となる確率を小さくし、再発声という話者の負担を少
なくシ、より効率のよい音声入力を可能とする効果があ
る。Change the voice interval detection threshold, read out the power of the input voice stored in the Hesofa memory, detect the voice interval again, and read out the feature pattern of the voice interval found again from the buffer memory in the same way. matching,
By judging the recognition results, the probability of correctly detecting a voice section for a single utterance of input speech is increased, and the probability of rejection or erroneous recognition due to erroneous detection of a voice section is reduced. This has the effect of reducing the burden on the speaker of re-voicing and enabling more efficient voice input.

[Brief explanation of drawings]

第１図は本発明による音声認識装置の一実施例のブロッ
ク図、第２図は音声区間検出のための閾値と音声パワー
の波形の関係を示す図である。　　・ｌ；入力音声。２：音声分析手段。３：分析結果バッファ拳メモリ。４；音声区間検出手段、５；標準パターン・メモリ、６；パターン・マッチング子役、７；認識結果判定手段、８；特徴パターン、９；音声パワー。ｌＯ；記憶されている特徴パターン。ｌｌ：記憶されている音声パワー。１２：音声区間の始端中終端情報。１３；標準パターン、１４：類似度、１５：再認識指示、１６；認識結果。FIG. 1 is a block diagram of an embodiment of a speech recognition device according to the present invention, and FIG. 2 is a diagram showing the relationship between a threshold value for detecting a speech section and a waveform of speech power.・l; Input audio. 2: Voice analysis means. 3: Analysis result buffer fist memory. 4; voice section detection means; 5; standard pattern memory; 6; pattern matching child actor; 7; recognition result determination means; 8; characteristic pattern; 9; voice power. lO; memorized feature pattern. ll: Memorized voice power. 12: Information on the beginning, middle, and end of the audio section. 13: Standard pattern, 14: Similarity, 15: Re-recognition instruction, 16: Recognition result.

Claims

[Scope of Claims] 1. A standard pattern memory in which a plurality of standard patterns are stored in advance for each word/sentence to be recognized; and a voice analysis means for analyzing the voice power and characteristic pattern of input voice; An analysis result buffer memory that stores the voice power and characteristic pattern that are the analysis results of this voice analysis means; and the voice power stored in the analysis result buffer memory is read out, and a voice section is detected from this voice power. a voice interval detection means; and according to the information of the voice interval detected by the voice interval detection means, reading out a characteristic pattern of the voice interval stored in the analysis result buffer memory and reading out a standard pattern from the standard pattern memory. , a pattern matching means for determining the degree of similarity by performing pattern matching between the two, and a recognition result determination means for determining the recognition result according to the degree of similarity, and when the recognition result is not reject, the recognition result is determined. If the recognition result is a reject, the recognition result is not output, and the threshold value used for determining the speech interval in the speech interval detection means is changed, and the recognition process is terminated. A speech recognition apparatus characterized in that a re-recognition process is performed in which the above-described process by the pattern matching means is repeated to obtain a new degree of similarity, and a new recognition result is determined and output according to the degree of similarity. 2. A maximum number of times of the re-recognition process is set, and if the recognition result of the re-recognition process is a reject even if the re-recognition process is repeated the maximum number of times, the recognition result is output as a reject. The speech recognition device according to item 1.