JPS6285300A

JPS6285300A - Word voice recognition system

Info

Publication number: JPS6285300A
Application number: JP60225212A
Authority: JP
Inventors: 教幸藤本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1985-10-09
Filing date: 1985-10-09
Publication date: 1987-04-18
Also published as: JPH0367279B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［概　要］入力音声の音声パワーから予め設定した閾値により音声
区間を検出し、音声特徴パラメータを抽出して、予め登
録してある単語の音声特徴パラメータとの比較・照合に
より音声認識する音声認識装置において、予め設定した
複数の音声パワー閾値により複数の音声区間を求め、各
区間の始端と終端の位置の違いを調べ、両方の違いが小
さい場合には一方の区間についてのみ、登録単語音声パ
ターンとの比較・照合を行うよう構成したもので、これ
により、複数閾値による有利さを保持して処理量を削減
することができる。[Detailed Description of the Invention] [Summary] A speech section is detected from the speech power of the input speech using a preset threshold value, speech feature parameters are extracted, and comparisons with speech feature parameters of words registered in advance are performed. In a speech recognition device that recognizes speech by matching, it calculates multiple speech sections using multiple preset speech power thresholds, checks the difference in the position of the start and end of each section, and if the difference between both is small, selects one section. This configuration is configured to perform comparison/verification with the registered word speech pattern only for the registered word speech pattern, thereby making it possible to maintain the advantages of multiple threshold values and reduce the amount of processing.

［産業上の利用分野］本発明は、音声認識装置における単語認識方式に係わり
、特に音声区間の認識方式に関するものである。[Industrial Field of Application] The present invention relates to a word recognition method in a speech recognition device, and particularly to a speech segment recognition method.

［従来の技術］単語ごとに音声を登録しておき、入力された音声の単語
の音声特徴パラメータを各登録単語の特徴パラメータと
比較して最も類似度の高いものを認識結果として認識す
る形式の音声認識装置においては、まず入力音声から単
語の音声区間を検出する必要がある。[Prior art] A method of registering speech for each word, comparing the speech feature parameters of the input speech word with the feature parameters of each registered word, and recognizing the one with the highest degree of similarity as the recognition result. In a speech recognition device, it is first necessary to detect the speech section of a word from input speech.

単語音声が入力されたことを検出し、単語音声区間とし
て抽出するためには、入力音声のパワーを一定の閾値で
切り、閾値を超えた区間を１つの単語区間として検出し
、その音声パラメータを抽出して登録音声のそれと比較
し、最近似のものを認識結果単語とするものである。In order to detect that word speech has been input and extract it as a word speech section, the power of the input speech is cut by a certain threshold, the section exceeding the threshold is detected as one word section, and its speech parameters are The word is extracted and compared with that of the registered speech, and the most similar word is used as the recognition result word.

例えば、入力音声のパワーが第４図に示すような形であ
った場合、一定の閾値を超える区間を単語区間として検
出し、これの音声特徴パラメータを抽出する。For example, if the power of the input voice is as shown in FIG. 4, a section exceeding a certain threshold is detected as a word section, and its speech feature parameters are extracted.

音声特徴パラメータの抽出は、例えば帯域濾波器１，２
，３．・・−を用意し、入力単語のＩｏｍｓ（フレーム
）ごとの帯域濾波器１．２．３．−の出力を、第５図に
示すような形式で記憶するもので、これをａ語パターン
と称する。Extraction of voice feature parameters is performed using bandpass filters 1 and 2, for example.
,3. ...- is prepared, and a bandpass filter is applied for each Ioms (frame) of input words. - is stored in the format shown in FIG. 5, and this is called an a-word pattern.

この人力音声パターンを、同様な形式で格納してある各
登録単語の単語パターンと比較、照合して最も類似度の
高いものを求めるのである。This human voice pattern is compared and collated with the word patterns of each registered word stored in a similar format to find the one with the highest degree of similarity.

［発明が解決しようとする問題点］上記のように、単一のパワー閾値によって単語区間の検
出を行う際には、その入力音声のパワーと閾値との関係
によって、（１）ノイズの区間も音声区間に含まれてしまう場（２
）無声化音節等のパワーの弱い音声が区間に含まれない
場合、があり、誤認識、或いはりジェツトの原因になっていた
。[Problems to be Solved by the Invention] As mentioned above, when detecting word sections using a single power threshold, depending on the relationship between the power of the input speech and the threshold, (1) Noise sections may also be detected. If it is included in the voice section (2
) If a segment does not contain low-power sounds such as devoiced syllables, this can cause misrecognition or even jets.

そこで、複数のパワー閾値を用いて、単語区間を検出し
、その各々について単語パターンを抽出して登録単語パ
ターンとの類似度を求めることによって、検出した複数
の音声区間のうちに、正しい音声区間が含まれる可能性
が高く、高い認識率が得られることが確認されている。Therefore, by detecting word sections using multiple power thresholds, extracting a word pattern for each, and determining the degree of similarity with the registered word pattern, we can select the correct speech section from among the detected plurality of speech sections. It has been confirmed that there is a high possibility that this will be included, and that a high recognition rate can be obtained.

（例えば、日本音響学会講演論文集、昭和５９年１０月
、挿口、接置、佐藤：複数閾値を用いた簡易型単語音声
認識方式。）第６図は、この複数閾値を用いる音声区間検出方式の効
果を示すものでる。(For example, Proceedings of the Acoustical Society of Japan, October 1980, Insertion, Placement, Sato: Simple word speech recognition method using multiple thresholds.) Figure 6 shows speech segment detection using this multiple thresholds. This shows the effectiveness of the method.

第６図の（ａ）、ｆｂ）、（（＋１は、登録されである
３つの単語パターンを示す。In FIG. 6, (a), fb), ((+1) indicate three word patterns that are registered.

同図ｆｄ）は入力Ａの場合のパワーパターンを示し、最
も高い閾値Ｌ３によって正しい音声区間■が検出されて
、同図（ａ）の登録単語と類似度が最も高く正しく認識
され、低い閾値Ｌｌ、Ｌ２ではノイズ区間が音声区間に
含まれて音声区間■、■と検出され、同図（ｂ）の登録
単語と誤認識される可能性があることを示している。Figure fd) shows the power pattern for input A, where the correct speech interval ■ is detected using the highest threshold L3 and is correctly recognized with the highest degree of similarity to the registered word in Figure (a), and the lower threshold Ll , L2, the noise section is included in the speech section and the speech sections ■ and ■ are detected, indicating that there is a possibility that the words may be mistakenly recognized as the registered words in FIG.

第６図（ｅ）は入力Ｂの場合のパワーパターンを示し、
最も低い閾値Ｌ１により正しい音声区間■が検出されて
、同図ｆｂｌの登録単語と類似度量も高く正しく認識さ
れ、高い閾値Ｌ２．Ｌｌでは同図ｔｃ＞の登録単語と誤
認識される可能性のあることを示している。Figure 6(e) shows the power pattern for input B,
The correct speech interval ■ is detected using the lowest threshold L1, and the degree of similarity to the registered word fbl in the same figure is high, and it is correctly recognized, and the highest threshold L2. This indicates that there is a possibility that the word Ll may be mistakenly recognized as the registered word tc> in the same figure.

しかし、この複数閾値を用いる方法では、第６図（ｄ）
で示す入力Ａの場合には■、■の区間が殆ど同じである
にも拘わらず、また同図（ｅ）で示す入力Ｂの場合には
■、■の区間が殆ど同じであるにも拘わらず、それぞれ
単語パターンを求めて登録パターンと照合しなければな
らなかった。However, in this method using multiple thresholds, as shown in Fig. 6(d)
In the case of input A shown in the figure, even though the intervals between ■ and ■ are almost the same, and in the case of input B shown in the same figure (e), even though the intervals between ■ and ■ are almost the same. First, each word pattern had to be found and matched against the registered pattern.

このため処理にかなりの時間を要するものであった・本発明は、この問題点を解消した新規の単語音声認識方
式を提供しようとするものである。For this reason, the processing required a considerable amount of time.The present invention aims to provide a new word speech recognition method that solves this problem.

［問題点を解決するための手段］第１図は本発明の単語音声認識方式の原理ブロック図を
示す。[Means for Solving the Problems] FIG. 1 shows a block diagram of the principle of the word speech recognition method of the present invention.

第１図において、Ｉ　１．１２．・・・、Ｉｍはｍ個の
音声パワー閾値によって音声区間を検出する区間検出部
である。In FIG. 1, I 1.12. ..., Im is a section detection unit that detects a voice section using m voice power thresholds.

２はｍ個の区間検出部１１．１２．・−・、１ｍで検出
されたｍ個の音声区間から必要なもののみを選択する区
間選択部である。2 are m section detection units 11.12. . . . is a section selection unit that selects only necessary ones from m voice sections detected at 1 m.

区間検出部１　＋、　１２．・・−，１ｍはそれぞれ入
力音声パワーがそれぞれの閾値Ｌ１、Ｌ２．・−、Ｌｍ
を超える最初の点３１．Ｓ２．−．Ｓ＋ｎを検出する始
端検出部１１１．１１２．−＝、　ｌ１ｍ　ｓおよび最
後の点Ｅｌ、Ｅ２゜・−Ｉ　Ｅｍを検出する終端検出部
１２’１．１２２．−・−、１２ｍから成る。Section detection unit 1+, 12. ...-, 1m, the input audio power is the respective threshold value L1, L2 .・-, Lm
The first point exceeding 31. S2. −． Starting edge detection units 111, 112. that detect S+n. -=, l1m s and the last point El, E2°·-I Em. -・-, consisting of 12m.

終端検出部１２＋、１２ｚ、−、１２ｍは、始端検出の
後、入力音声パワーが低下し閾値を割っても予め定めた
時間以内に再び上昇して閾値を超えるときは終端とせず
音声区間の１！続とみなす処理も行う。The end detection units 12+, 12z, -, and 12m detect the start end, and if the input audio power decreases and exceeds the threshold value, but rises again within a predetermined time and exceeds the threshold value, the end detection units 12+, 12z, -, and 12m do not detect the end point, but detect 1 of the audio section. ! Processing is also performed to consider it as a continuation.

区間選択部２は、区間検出部１　＋、　１２．−・−、
１ｍの検出した始端３１，３２．・・−９Ｓ…および終
端Ｅ＋。The section selection section 2 includes section detection sections 1 +, 12. −・−,
Detected start ends 31, 32 of 1 m. ...-9S... and terminal E+.

Ｅ２．・・−、Ｅｍについて、次の検査を行う。E2. ...-, perform the following inspection on Em.

１ｓｔ　−ｓｊ　　ｌ＜ＱｔｌＥｉ　−Ｅｊ　ｌ＜Ｑまただし、Ｑｌ、Ｑ２は予め定めである閾値であって、ｉ
、ｊは、ｉ＝１．２．・・・、　ｍ　−１ｉ　　Ｊ　＝
２＋３、・・−、ｍ（ｉ＜ｊ）なる総ての組合せについ
て行う。1st -sj l<Qt lEi -Ej l<Q However, Ql and Q2 are predetermined thresholds, and i
, j is i=1.2. ..., m −1i J =
This is performed for all combinations of 2+3, . . . -, m (i<j).

上記の２式が共に成立するときは、音声区間（Ｓｊ、Ｅ
ｊ　）を候補区間から除外して、残された音声区間のみ
を候補区間として出力する。When the above two equations both hold true, the speech interval (Sj, E
j) is excluded from the candidate sections, and only the remaining speech sections are output as candidate sections.

区間選択部２から出力された候補区間の各々について単
語音声パターンを生成し、登録単語音声パターンと比較
・照合する。A word speech pattern is generated for each candidate section output from the section selection unit 2, and compared and verified with the registered word speech pattern.

上記の式に替えて、次式により検査を行ってもよい。In place of the above formula, the following formula may be used for testing.

ｌｓｉ　−５ｊ　　Ｉ＋ｌＥｉ　−Ｅｊ　　ｌ＜Ｑ）い
ずれの場合でも、始端位置差と終端位置差が共に成る値
より小さいときは一方を除外するものである。lsi −5j I+lEi −Ej l<Q) In either case, if both the starting end position difference and the ending end position difference are smaller than the same value, one is excluded.

［作用］上記の構成により、複数閾値を使用する有利さを保持し
つつ、殆ど同一の音声区間を除外して、処理量を削減す
ることができる。[Operation] With the above configuration, it is possible to reduce the amount of processing by excluding almost the same speech sections while maintaining the advantage of using a plurality of threshold values.

［実施例］以下第２図および第３図に示す実施例により、本発明を
さらに具体的に説明する。[Example] The present invention will be described in more detail below with reference to Examples shown in FIGS. 2 and 3.

第２図は本発明の実施例のブロック図である。FIG. 2 is a block diagram of an embodiment of the invention.

第２図において、１　＋、　１２．・−１１ｍは区間検
出部であり、２は区間選択部であって、第１図と同一の
対象物である。In FIG. 2, 1 +, 12. -11m is a section detection section, 2 is a section selection section, and these are the same objects as in FIG.

３はパワー計算部であって、入力音声のパワーを計算し
、各区間検出部１　＋、　１２．・・−，１ｍに入力す
る。3 is a power calculation unit which calculates the power of the input voice, and each section detection unit 1 +, 12. ...-, input to 1m.

４は音声入力用のマイクであり、５は音声増幅器であっ
て、増幅出力をパワー計算部３およびｓｌ、６２．ｆ３
Ｌ’−＋　　６ｎの帯域濾波器に入力する。4 is a microphone for audio input, 5 is an audio amplifier, and the amplified output is sent to the power calculation unit 3, sl, 62 . f3
Input to L'-+6n bandpass filter.

６＋、６２，６３．−−−、　ｆＨｌは帯域濾波器（Ｂ
ＰＦ）であって、音声帯域をｎ個に分割した各々の周波
数帯域を通過する濾波器である。6+, 62, 63. ---, fHl is a bandpass filter (B
PF), which is a filter that passes through each frequency band obtained by dividing the audio band into n pieces.

７　＋、　７２．７　Ｌ・−１７０は整流・平滑器であ
って、それぞれの帯域濾波器６　＋、　６２．６３．・
・・、６ｎからの出力を整流し、平滑化して、包絡出力
を出す。7 +, 72.7 L·-170 are rectifiers and smoothers, and the respective bandpass filters 6 +, 62.63.・
..., rectifies and smoothes the output from 6n, and outputs an envelope output.

８は単語音声パターン生成部であって、整流・平滑器７
＋、７２，７３．・−１７ｎの出力により、区間選択部
２の出力した区間について、第５図に示したような単語
音声パターンを生成する。8 is a word speech pattern generation unit, which includes a rectifier/smoother 7;
+, 72, 73. - By the output of -17n, a word sound pattern as shown in FIG. 5 is generated for the section output by the section selection section 2.

９は登録パターン格納部であって、最初に所要の単語に
ついて、単語音声パターン生成部８において生成した音
声パターンを格納しておく記憶装置である。Reference numeral 9 denotes a registered pattern storage unit, which is a storage device that initially stores the voice pattern generated by the word voice pattern generation unit 8 for a desired word.

１０は照合部であって、単語音声パターン生成部８から
入力された音声パターンを、登録パターン格納部９の各
パターンと比較・照合して、類似度の最も高い単語を認
識結果として出力する。Reference numeral 10 denotes a collation unit, which compares and collates the voice pattern input from the word voice pattern generation unit 8 with each pattern in the registered pattern storage unit 9, and outputs the word with the highest degree of similarity as a recognition result.

最初に単語を登録する場合には、スイッチ１１を上に倒
して、マイク４より単語音声を入力し、標準のパワー閾
値を持つ区間検出部、例えば１２の出力により、単語音
声パターンを生成し、これを登録パターン格納部９に格
納する。When registering a word for the first time, turn the switch 11 upward, input the word sound from the microphone 4, and generate the word sound pattern by the output of the section detection section, for example 12, which has a standard power threshold. This is stored in the registered pattern storage section 9.

音声認識の場合は、スイッチ１１を下に倒して、マイク
４より認識された音声を各区間検出部１＋。In the case of voice recognition, the switch 11 is pushed down and the voice recognized from the microphone 4 is sent to each section detection section 1+.

１２、・・・、１＋ｍで区間検出し、区間選択部２によ
り必要な区間のみを選択して、これについて単語音声パ
ターンを生成し、登録単語音声パターンと比較・照合し
て認識結果を出力する。12, . . . , 1+m is used to detect sections, and the section selection unit 2 selects only the necessary sections, generates a word speech pattern for this, compares it with the registered word speech pattern, and outputs the recognition result. .

第３図は、区間選択部の処理例を示す図である。FIG. 3 is a diagram showing an example of processing by the section selection section.

同図（ａ）の入力Ａの場合は、３つの閾値Ｌｌ、Ｌ２゜
Ｌ３によって音声区間■、■、■がヰ食出されたが、区
間■と区間■は始端差と終端差が共に小さいので、破線
で示すように区間■は除外したことを示している。In the case of input A in the same figure (a), the three thresholds Ll, L2゜L3 are used to extract voice sections ■, ■, and ■, but the differences between the beginning and end of the sections ■ and ■ are both small. Therefore, as shown by the broken line, section ■ is excluded.

同図（ｂｌの入力Ｂの場合は、３つの閾値Ｌｌ、Ｌ２゜
Ｌ３によって音声区間■、■、■が検出されたが、区間
■と区間■は始端差と終端差が共に小さいので・破線で
示すように区間■は除外したことを示している。In the same figure (in the case of input B of bl, voice sections ■, ■, and ■ were detected using the three thresholds Ll, L2 and L3, but since the starting and ending differences between section ■ and section ■ are both small, the dashed line As shown, the section ■ indicates that it has been excluded.

［発明の効果］以上説明のように本発明によれば、複数閾値を使用する
有利さを保持しつつ、殆ど同一の音声区間を除外して、
処理量を削減することによって処理速度を向上すること
ができ、その実用上の効果はきわめて大きい。[Effects of the Invention] As explained above, according to the present invention, while maintaining the advantage of using multiple thresholds, almost the same speech sections are excluded,
The processing speed can be improved by reducing the amount of processing, and its practical effects are extremely large.

[Brief explanation of drawings]

第１図は本発明の原理ブロック図、第２図は本発明の実施例のブロック図、第３図は区間選
択部の処理例を示す図、第４図は単語音声のパワーパタ
ーンを示す図、第５図は単語音声パターンを例示する図
、第６図は複数閾値を用いる音声区間検出方式の効果を
示す図である。図面において、Ｉ　１．１　ｚ、−、１ｍは区間検出部、１１１．１１
２．・・・、１１ｍは始端検出部、１２１、１２２．・
・、１２ｍは終端検出部、２は区間選択部、３はパワー計算部、４はマイク、５は増幅器、６１．６２，６３．−−−．６１１は帯域濾波器、７　
Ｉ、　７２．７　］、・・・、７ｏは整流・平滑器、８
は単語音声パターン生成部、９は登録パターン格納部、１０は照合部、１１はスイッチ、をそれぞれ示す。Fig. 1 is a block diagram of the principle of the present invention, Fig. 2 is a block diagram of an embodiment of the present invention, Fig. 3 is a diagram showing an example of processing of the section selection section, and Fig. 4 is a diagram showing the power pattern of word sounds. , FIG. 5 is a diagram illustrating a word speech pattern, and FIG. 6 is a diagram showing the effect of the speech section detection method using multiple threshold values. In the drawing, I 1.1 z, -, 1m is the section detection unit, 111.11
2. . . , 11m is a starting end detection unit, 121, 122 .・
, 12m is a termination detection section, 2 is a section selection section, 3 is a power calculation section, 4 is a microphone, 5 is an amplifier, 61.62, 63. ---. 611 is a bandpass filter, 7
I, 72.7 ], ..., 7o is a rectifier/smoother, 8
9 indicates a word speech pattern generation unit, 9 indicates a registered pattern storage unit, 10 indicates a collation unit, and 11 indicates a switch.

Claims

[Claims] Speech sections are detected from the voice power of the input voice using a preset threshold, voice feature parameters are extracted, and voice recognition is performed by comparing and collating with the voice feature parameters of words registered in advance. The speech recognition device includes a section detection section that detects a speech section exceeding a plurality of preset speech power thresholds, and a section detection section that compares the positional difference between the start and end of each section detected by the section detection section with a preset threshold. , a section selection section that performs a process of excluding one section and outputting the remaining section when both the start end position difference and the end end position difference are within the threshold, and the section selection section selects the section selected by the section selection section. A word speech recognition method characterized in that a speech feature parameter for a speech section is compared and collated with a speech feature parameter of a registered word.