JPS61113099A

JPS61113099A - Voice section detecting system for voice recognition equipment

Info

Publication number: JPS61113099A
Application number: JP59234385A
Authority: JP
Inventors: 桜庭　孝宏
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1984-11-07
Filing date: 1984-11-07
Publication date: 1986-05-30

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、複数の標準音声パターンと入力音声パターン
とを比較照合して、最もイ以ているパターンを認識結果
とする音声認識装置に関するものであり、特にその中で
も雑音を含む入力音声について正しい音声区間を検出す
るための方式に関する。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a speech recognition device that compares and matches a plurality of standard speech patterns and an input speech pattern, and uses the most satisfactory pattern as a recognition result. In particular, the present invention relates to a method for detecting correct speech intervals for input speech containing noise.

[Conventional technology]

一般に音声認識では、マイクから入力された音声を音声
区間検出によって音声が存在する部分を検出して、その
部分を比較照合の対象として処理している。従来の音声
区間検出では、音声のパワー情報等をチェックし、ある
閾値（スレッシュホールド値）よりも大きければ音声が
存在し、小さければ音声ではない、という様にして音声
区間検出を行っていた。In general, in speech recognition, a portion of speech input from a microphone is detected by speech section detection, and that portion is processed as a target for comparison and verification. In conventional voice section detection, voice section detection is performed by checking voice power information, etc., and if it is greater than a certain threshold value, voice exists, and if it is smaller, it is not voice.

以下、図面を用いて具体例を説明する。Specific examples will be described below with reference to the drawings.

第２図は、従来の音声認識装置の１例を示したもので、
１はマイク、２は増幅器、３は１６チヤネルのフィルタ
、４はパワー計算部、５は区間検出部、６は入力音声バ
ッファ、７は音声辞書、８は照合部、９は照合結果判定
部である。Figure 2 shows an example of a conventional speech recognition device.
1 is a microphone, 2 is an amplifier, 3 is a 16-channel filter, 4 is a power calculation unit, 5 is an interval detection unit, 6 is an input audio buffer, 7 is an audio dictionary, 8 is a matching unit, and 9 is a matching result determination unit. be.

マイク１から入力された音声信号は、増幅器２で増幅さ
れ、１６チヤネルのフィルタ３に供給される。フィルタ
３は、入力された音声信号を所定の帯域ごとに分波しフ
レームごとにパラメータ化して音声パターンに変換する
。パワー計算部４は、フィルタ３から出力された音声パ
ラメータに基づいてパワー計算を行う。An audio signal input from a microphone 1 is amplified by an amplifier 2 and supplied to a 16-channel filter 3. The filter 3 separates the input audio signal into predetermined bands, parameterizes each frame, and converts the signals into audio patterns. The power calculation unit 4 performs power calculation based on the audio parameters output from the filter 3.

区間検出部５は、パワー計算部４におけるパワー計算結
果に、予め設定されている閾値を適用し、閾値以下のパ
ワーをもつ信号の区間を無音区間とし、閾値以上のパワ
ーをもつ信号区間を音声区間として、音声区間の音声パ
ターンのみを入力音声バッファ６に格納する。The section detection section 5 applies a preset threshold to the power calculation result in the power calculation section 4, sets the section of the signal with the power below the threshold as a silent section, and sets the section of the signal with the power above the threshold as a sound section. As a section, only the voice pattern of the voice section is stored in the input voice buffer 6.

音声辞書７には、予め作成された多数の標準音声パター
ン（パラメータ）が登録されており、照合部８は、入力
音声バッファ６の入力音声パターンごとに音声辞書７か
ら標準音声パターンを順次取り出して両者を照合し、距
離を算出する。照合結果判定部９ｔよ、このようにして
照合された結果について最小距離の標準音声パターンを
検出し、認識結果として出力する。A large number of standard voice patterns (parameters) created in advance are registered in the voice dictionary 7, and the collation unit 8 sequentially extracts standard voice patterns from the voice dictionary 7 for each input voice pattern in the input voice buffer 6. Compare the two and calculate the distance. The matching result determination unit 9t detects the standard speech pattern with the minimum distance from the matching results and outputs it as a recognition result.

第３図は、入力音声信号中の雑音レベルと閾値との関係
を示したもので、Ｖ、ば入力音声信号、Ｅ７は雑音レベ
ル、ＳｌおよびＳ２は閾値レベルを表している。FIG. 3 shows the relationship between the noise level in the input audio signal and the threshold value, where V represents the input audio signal, E7 represents the noise level, and Sl and S2 represent the threshold levels.

図示の音声信号Ｖ８は、単語“フナバシ”の音声パワー
レベルを示し、中央の強声音“す”、“バ゛の前後に弱
声音“フ”、“シ”をもっている。The illustrated audio signal V8 shows the audio power level of the word "Funabashi", and has the strong sounds "su" and "b" in the center, followed by the weak sounds "f" and "shi".

そして雑音レベルＥゎは、弱声音“フ゛′、“シ゛のパ
ワーレベルよりも高い状態にある。The noise level E is in a state higher than the power level of the weak voice sounds "F'" and "S".

ここで閾値レベルを、Ｓｌに設定すると、弱音声信号を
検出することができるが、その反面、真の無音声区間に
おける雑音を音声として認識してしまう可能性がでる。If the threshold level is set to Sl here, a weak speech signal can be detected, but on the other hand, there is a possibility that noise in a true silent section will be recognized as speech.

他方、閾値レベルを雑音レベルＥ。よりも高いＳ２に設
定すると、弱音声が検出できず、認識もれの原因となる
ので、適切な閾値の設定が必要とされる。On the other hand, the threshold level is the noise level E. If S2 is set higher than S2, weak speech cannot be detected and may cause recognition failure, so it is necessary to set an appropriate threshold.

[Problem that the invention seeks to solve]

以上のように、従来の音声区間検出方式は、雑音が大き
い環境下で入力された音声の音声区間検出精度に問題が
あり、認識率を低下させる大きな原因となっていた。As described above, the conventional speech segment detection method has a problem with the accuracy of speech segment detection of input speech in a noisy environment, which is a major cause of lowering the recognition rate.

[Means for solving problems]

本発明は、上記問題点を解決するため、入力音声につい
てまず雑音の影響を受けないパワーの強い音声部分を検
出して確定し、その前後の雑音の影響を受は易いパワー
の弱い音声部分については、予め標準音声について採取
しておいた区間長を適用して全体の音声区間を算出し、
切出すようにしたもので、それによる発明の構成は、複
数の標準音声パターンと入力音声パターンとを比較照合
して認識する音声認識装置において、雑音の影響を受け
やすいパワーの弱い音声部分を検出できる第１の閾値と
、雑音の影響をうけないパワーの強い音声部分を検出で
きる第２の閾値とを設け、標準音声については第１の閾
値を用いて音声区間の検出を行うとともに第１の閾値で
検出された音声区間と第２の閾値で検出された音声区間
との差を標準音声パターンとともに記憶しておき、認識
用入力音声については第２の閾値のみを用いて音声区間
の検出を行い、該検出された音声区間の両端に上記標準
音声パターンとともに記憶されている第１の閾値と第２
の閾値とによる検出区間の差区間を付加した部分をその
標準音声パターンに対する認識用入力パターンの音声区
間として比較照合させることを特徴としている。In order to solve the above-mentioned problems, the present invention first detects and determines the high-power voice parts that are not affected by noise in the input voice, and then determines the low-power voice parts that are easily affected by the noise before and after the detected voice parts. calculates the entire speech section by applying the section length collected in advance for standard speech,
The structure of the invention is to detect low-power speech parts that are susceptible to noise in a speech recognition device that compares and recognizes a plurality of standard speech patterns and input speech patterns. For standard speech, the first threshold is used to detect speech sections, and the second threshold is used to detect speech sections with strong power that are not affected by noise. The difference between the speech section detected by the threshold and the speech section detected by the second threshold is stored together with the standard speech pattern, and the speech section is detected using only the second threshold for the input speech for recognition. The first threshold value and the second threshold value stored together with the standard voice pattern are set at both ends of the detected voice section.
The feature is that the portion to which the difference section of the detection section with respect to the threshold value is added is compared and verified as the speech section of the recognition input pattern with respect to the standard speech pattern.

[Action of the invention]

本発明は、第４図に示すように、低レベルの音声区間を
検出する閾４ＭＳ　＋　と、雑音レベルより確実に大き
い音声区間を検出する閾値Ｓ２を持ち、はじめに、ＳＩ
よりも雑音レベルが低い環境で標準音声パターンを作成
し、このとき、閾値Ｓ１で音声区間、すなわち音声パワ
ーの弱い始端部ｔ、。As shown in FIG. 4, the present invention has a threshold 4MS + for detecting a low-level speech section and a threshold S2 for detecting a speech section reliably higher than the noise level.
A standard speech pattern is created in an environment where the noise level is lower than that of the speech section at the threshold S1, that is, the starting end t, where the speech power is weak.

および終端部ｔ８と、音声パワーの強い中央部ｔ。and a terminal portion t8, and a central portion t where audio power is strong.

との和の区間を求め、同時に他の閾値Ｓ２で中央部ｔヨ
の区間を求め、そしてこれらの差からそれぞれ１ｓ、１
．を求めておく。At the same time, use another threshold value S2 to find the section of the center tyo, and from these differences, calculate 1s and 1s, respectively.
．． Let's find out.

次に認識処理では、認識すべき入力音声が、第５図のよ
うに雑音レベルが高く、閾値ＳＩでは、音声の弱い部分
と雑音との区分けができない場合であっても閾値Ｓ２は
、入力音声について標準音声と大差ない強い音声区間の
中央部ｔ１を検出することができる。そこでこのＳ２で
区間検出した中央部分ｔｆｆｌに、始端が存在すると予
想される最大区間（始端を効区間と言う）と終端が存在
すると予想される最大区間（終端有効区間と言う）とを
付加し、音声保存区間を長めに求め、その区間の音声を
バッファに保存しておく。Next, in the recognition process, even if the input speech to be recognized has a high noise level as shown in FIG. 5 and the threshold SI cannot distinguish between weak parts of the speech and noise, the threshold S2 is It is possible to detect the central part t1 of a strong voice section that is not much different from the standard voice. Therefore, to the central part tffl detected in S2, we add the maximum interval in which the start end is expected to exist (the start end is called the valid interval) and the maximum interval in which the end end is expected to exist (referred to as the end valid interval). , find a longer audio storage section and store the audio in that section in a buffer.

この長めに求めた認識対象の入力音声を標準音声と照合
するために、認識範囲を定める音声区間を次のように決
定する。In order to compare this longer input speech to be recognized with the standard speech, the speech section that defines the recognition range is determined as follows.

各標準音声について求めである１ｓ、１．、を入力音声
について閾値Ｓ２で検出された中央部ｔ。The calculations for each standard voice are 1s, 1. , is the central part t detected with the threshold value S2 for the input speech.

に加えて始端、終端を決定し、音声区間とする。In addition to this, the start and end points are determined and used as a voice section.

この音声区間は、標準音声ごとに異なっている。This voice section differs depending on the standard voice.

このようにして、標準音声ごとに認識音声区間を決定し
、照合して認識する。In this way, a recognized speech section is determined for each standard speech, and recognized by comparison.

〔Example〕

以下に、本発明の詳細を実施例にしかって説明する。 The details of the present invention will be explained below with reference to Examples.

第１図は、本発明の１実施例装置の構成図である。図に
おいて、１１はマイク、１２は増幅器、１３は１６チヤ
ネルのフィルタ、１４はパワー計算部、１５は区間検出
部、１６は人力音声バッファ、１７は音声辞書、１８は
照合部、１９は照合結果判定部を示す。FIG. 1 is a configuration diagram of an apparatus according to an embodiment of the present invention. In the figure, 11 is a microphone, 12 is an amplifier, 13 is a 16-channel filter, 14 is a power calculation unit, 15 is an interval detection unit, 16 is a human voice buffer, 17 is a voice dictionary, 18 is a matching unit, and 19 is a matching result The determination section is shown.

本実施例装置の基本的な機能は、第２図に示されている
従来例装置の機能とほぼ同じであるが、本発明に基づき
、区間検出部１５、入力音声バッファ１６、音声辞書１
７の各構成と機能が変更されている。The basic functions of the device of this embodiment are almost the same as those of the conventional device shown in FIG.
7's configuration and functions have been changed.

本実施例装置は、標準辞書作成モードと認識処理モード
の２つの動作モードをもっている。The device of this embodiment has two operating modes: a standard dictionary creation mode and a recognition processing mode.

まず標準辞書作成モードにおいて、標準音声パターンを
作成し、音声辞書１７に登録する処理を行う。このモー
ドでは、区間検出部１５の閾値として、弱音声を検出可
能な低レベルのＳＩと通常想定される環境雑音レベルよ
りも高いレベルの８２とが使用される。First, in the standard dictionary creation mode, a process of creating a standard speech pattern and registering it in the speech dictionary 17 is performed. In this mode, the section detection unit 15 uses a low level SI that can detect weak speech and a level 82 that is higher than the normally assumed environmental noise level.

マイク１１を雑音レベルが８１よりも低い環境におき、
所定の標準音声を入力する。Place the microphone 11 in an environment where the noise level is lower than 81,
Input a predetermined standard voice.

入力された音声信号は、増幅器１２で増幅され、さらに
フィルター３で分波されて音声パラメータ化され、音声
パターンに変換される。The input audio signal is amplified by an amplifier 12, further demultiplexed by a filter 3, converted into audio parameters, and converted into an audio pattern.

パワー計算部１４は、フレームごとに入力音声のパワー
レベルを計算し区間検出部１５に供給する。区間検出部
１５は、ＳｌおよびＳ２の２つの閾値を用いて入力音声
のパワーレベルを検出し、それぞれの区間、すなわちＳ
ｌによって第４図に示す音声区間（ｔｓ　＋ｔ、＋ｔ、
）の区間を、またＳ２によっては中央部ｔ、を検出し、
これらから始端部ｔ８と終端部ｔ６とをそれぞれ求め、
入力音声パターンとともに、入力音声バッファ１６を介
して、音声辞書１７に登録する。The power calculation unit 14 calculates the power level of the input audio for each frame and supplies it to the section detection unit 15. The section detection unit 15 detects the power level of the input audio using two thresholds, Sl and S2, and detects the power level of the input audio using two thresholds, S1 and S2, and
The speech interval (ts +t, +t,
) and, depending on S2, the central part t,
From these, find the starting end t8 and the ending end t6, respectively.
It is registered in the speech dictionary 17 via the input speech buffer 16 along with the input speech pattern.

全ての標準音声について音声辞書１７への登録が終了し
たら、認識処理モードに切替え、未知の入力音声につい
ての認識処理を開始する。When all the standard voices have been registered in the speech dictionary 17, the mode is switched to recognition processing mode, and recognition processing for unknown input speech is started.

認識処理モードでは、区間検出部１５の閾値を３２のみ
にする。未知の音声が入力されると、区間検出部１５は
閾値Ｓ２を用いて入力部のパワーレベルを検出し、中央
部ｔ、を求める。さらにこのｔ７をもとに、その前後に
第５図で説明した始端有効区間および終端有効区間をイ
」加して音声保存区間を求め、この範囲の入力音声パタ
ーンを入力音声バッファ１６に格納する。なおこのとき
ｔ。In the recognition processing mode, the threshold value of the section detection unit 15 is set to only 32. When an unknown voice is input, the section detection section 15 detects the power level of the input section using the threshold value S2, and calculates the central portion t. Furthermore, based on this t7, the start end effective section and end end effective section explained in FIG. . Note that at this time t.

もデータの一部として格納する。is also stored as part of the data.

照合部１８は、音声辞書１７に登録されている各標準音
声パターンを順次取り出し、入力音声パターンと照合す
る。このとき入力音声へソファ１６から取り出される入
力音声パターンの区間は次のようにして決定される。す
なわち区間検出部１５は、標準パターンごとに音声辞書
１７から始端部ｔ５と終端部１ｅの区間データを取り出
し、入力音声バッファ１６の入力音声パターンとともに
先に格納しである中央部ｔ□区間に結合し、照合処理す
べき音声区間（１ｓ＋１．＋１ｅ）を求める。そして入
力音声バッファ１６中の音声保存区間の音声パターンか
らこの音声区間に相当する音声パターンを切出して照合
部１日に転送する。The matching unit 18 sequentially extracts each standard speech pattern registered in the speech dictionary 17 and matches it with the input speech pattern. At this time, the section of the input audio pattern extracted from the sofa 16 as the input audio is determined as follows. That is, the section detection unit 15 extracts the section data of the start end t5 and end section 1e from the speech dictionary 17 for each standard pattern, and combines them into the central section t□ which was previously stored together with the input speech pattern in the input speech buffer 16. Then, the voice section (1s+1.+1e) to be verified is determined. Then, from the voice pattern of the voice storage section in the input voice buffer 16, a voice pattern corresponding to this voice section is cut out and transferred to the matching section 1.

照合部１８は、入力音声パターンと各標準パターンとの
照合結果（距離計算結果）を照合結果判定部１９に送り
、認識判定させる。The matching section 18 sends the matching results (distance calculation results) between the input voice pattern and each standard pattern to the matching result determining section 19 for recognition determination.

〔Effect of the invention〕

本発明により、雑音レベルが音声の始端部あるいは終端
部のパワーの弱い部分よりも高くても、はぼ正しい＠芦
区間を検出することができ、雑音の影響により起こる誤
認識を少なくすることができる。According to the present invention, even if the noise level is higher than the weak power part at the beginning or end of the voice, it is possible to detect a fairly accurate @Ashi section, and it is possible to reduce misrecognition caused by the influence of noise. can.

[Brief explanation of the drawing]

第１図は本発明の１実施例装置の構成図、第２図は従来
例装置の構成図、第３図は雑音レベルと閾値との関係説
明図、第４図は音声辞書作成時の区間検出処理の説明図
、第５図は認識処理時の区間量検出処理の説明図である。図中、１１はマイク、１２は増幅器、１３はフィルタ、
１４はパワー計算部、１５は区間検出部、１６は入力音
声バッファ、１７は音声辞書、１日は照合部、１９は照
合結果判定部を示す。特許出願人　　冨士ｊｆｆ！株式会社代理人弁理士　長谷用　文廣（外１名）（ト）　　　Ｕ
ＪＵノリ口Ｑつ表Fig. 1 is a block diagram of a device according to an embodiment of the present invention, Fig. 2 is a block diagram of a conventional device, Fig. 3 is an explanatory diagram of the relationship between noise level and threshold, and Fig. 4 is a section when creating a speech dictionary. FIG. 5 is an explanatory diagram of the detection process. FIG. 5 is an explanatory diagram of the section amount detection process during the recognition process. In the figure, 11 is a microphone, 12 is an amplifier, 13 is a filter,
14 is a power calculation section, 15 is a section detection section, 16 is an input voice buffer, 17 is a speech dictionary, 1 is a collation section, and 19 is a collation result determination section. Patent applicant Fujijff! Representative Patent Attorney Co., Ltd. Fumihiro Hase (1 other person) (G) U
J U Noriguchi Q table

Claims

[Claims]

In a speech recognition device that compares and matches multiple standard speech patterns and input speech patterns to recognize them, there is a first threshold that can detect low-power speech parts that are susceptible to noise, and a first threshold that can detect low-power speech parts that are susceptible to noise. A second threshold that can detect strong speech parts is provided, and for standard speech, the first threshold is used to detect speech sections, and the speech sections detected using the first threshold and the second threshold are detected. The difference between the detected speech section and the standard speech pattern is stored together with the standard speech pattern, and the speech section is detected using only the second threshold for the input speech for recognition, and the standard speech pattern is added to both ends of the detected speech section. Speech section detection characterized by comparing and comparing a portion to which a difference section of a detection section based on a first threshold value and a second threshold value, which are stored together, is added, as a speech section of a recognition input pattern with respect to the standard speech pattern. method.