JPH09222899A

JPH09222899A - Word voice recognizing method and device for implementing this method

Info

Publication number: JPH09222899A
Application number: JP8028921A
Authority: JP
Inventors: Yoshio Nakadai; 芳夫中台; Tetsutada Sakurai; 哲真桜井; Yutaka Nishino; 豊西野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-02-16
Filing date: 1996-02-16
Publication date: 1997-08-26

Abstract

PROBLEM TO BE SOLVED: To reduce the erroneous recognition caused by the addition of noises detected in the voice zone. SOLUTION: A partial standard pattern expressing the voice feature for each standard pattern is extracted in advance, the partial input pattern having the same timewise positional relation as the partial standard pattern is extracted from the start end while each position at each fixed time interval is assumed as the true start end of voice for the input voice pattern, and the pattern matching process is conducted between both partial patterns. The start end and terminal end positions of the true voice zone in the input voice pattern are determined from the position of the partial input pattern capable of obtaining the minimum value of the distance between both partial patterns, and matching is conducted between the true voice zone of the input voice pattern and the standard pattern.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、単語音声認識方
法およびこの方法を実施する装置に関し、特に、単語単
位の音声入力をして認識結果を出力する単語音声認識方
法およびこの方法を実施する装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a word voice recognition method and an apparatus for implementing this method, and more particularly to a word voice recognition method for inputting voice in word units and outputting a recognition result and an apparatus for implementing this method. Regarding

【０００２】[0002]

【従来の技術】従来例を図を参照して説明する。電気機
器その他の機械装置を人間の手により操作する代わりに
音声によりコマンドを入力して操作する場面で使用され
る音声認識装置については、従来より様々な研究開発が
行われている。2. Description of the Related Art A conventional example will be described with reference to the drawings. BACKGROUND ART Various researches and developments have been conventionally performed on a voice recognition device used in a situation where a command is input by voice instead of operating an electric device or other mechanical device by a human hand.

【０００３】音声認識技術は、人間が任意の場所から任
意のタイミングで発声した任意長の音声を１００％の確
率で認識するのが理想である。しかし、完璧な認識性能
を達成しようとすると、音声認識技術は、様々な雑音が
存在する実際の音響環境下において任意のタイミングで
発声される音声を正確に捕捉することを要求される。そ
の結果、音声入力処理において、雑音をも含めて観測さ
れる信号区間の内から音声の始端と終端とを何回も検出
し、且つ、雑音その他の不要な信号区間を除外する様な
複雑な区間検出アルゴリズムが常に実行される処理操作
を必要とされ、処理に要する計算量が膨大化する。この
様な理由から、簡易な音声認識技術として、或る一定時
間中において音声の始端と終端とをそれぞれ１回のみ検
出する孤立単語音声認識方式が採用される。Ideally, the voice recognition technique is capable of recognizing a voice of arbitrary length produced by a human from an arbitrary place at an arbitrary timing with a probability of 100%. However, in order to achieve perfect recognition performance, the speech recognition technology is required to accurately capture speech uttered at arbitrary timing in a real acoustic environment in which various noises are present. As a result, in the voice input processing, the start and end of the voice are detected many times from the signal section observed including the noise, and noise and other complicated signal sections such as unnecessary signal sections are excluded. The section detection algorithm requires a processing operation that is always executed, and the amount of calculation required for the processing is enormous. For this reason, as a simple speech recognition technique, an isolated word speech recognition method is adopted, which detects the beginning and end of speech only once during a certain period of time.

【０００４】図４を参照して孤立単語音声認識装置を説
明する。図４において、音声入力部１はマイクロホンそ
の他音声を電気的な音声波形に変換して入力するところ
である。変換部２は音声分析の前処理として音声波形を
ディジタルの数値に変換するところである。音声特徴抽
出部３は音声波形を短時間スペクトル分析して一定時間
間隔、即ち短時間フレーム毎に音声波形から音声認識に
必要とされる特徴を抽出するところである。音声区間検
出部５は音声特徴抽出部３から得られる音声特徴量に基
づいて音声の始端および終端をそれぞれ１箇所だけ決定
するところである。起動スイッチ部４は音声区間検出に
際して外部より始端検出開始のトリガを与えるところで
ある。入力パターン格納部６は音声区間検出部５におい
て決定された音声始端から音声終端に到る音声特徴量を
音声特徴抽出部３から取り込んでこれを入力音声パター
ンとするところである。標準パターン記憶部７は、入力
パターン格納部６と同様の手順により格納され、ラベル
名を付与された音声認識に使用される複数の単語音声パ
ターンを格納した記憶部である。パターンマッチング部
８は入力パターン格納部６に格納された未知の入力音声
パターンと標準パターン記憶部７に格納されている各標
準パターンの間のマッチング処理を行い、その結果であ
る入力音声パターンと各標準パターンとの間の距離値を
出力するところである。距離比較部９はパターンマッチ
ング部８の出力する距離値を、マッチングする各標準パ
ターン毎に蓄積および比較し、その結果一つの未知入力
パターンに対する各標準パターンのマッチング結果の内
から最小の距離値を得るところである。結果出力部１０
は距離比較部９より出力された距離値の内の最も小さい
距離値を持つ標準パターンのラベル名を、音声認識装置
を動作させる上位ホストへ出力するところである。An isolated word speech recognition apparatus will be described with reference to FIG. In FIG. 4, the voice input unit 1 is a place where a voice such as a microphone is converted into an electric voice waveform and input. The conversion unit 2 is to convert a voice waveform into a digital value as preprocessing of voice analysis. The voice feature extraction unit 3 is to extract a feature required for voice recognition from the voice waveform at a fixed time interval, that is, for each short frame, by performing a short-time spectrum analysis of the voice waveform. The voice section detection unit 5 is to determine only one start point and one end point of the voice based on the voice feature amount obtained from the voice feature extraction unit 3. The start switch unit 4 is to give a trigger for starting the start edge detection from the outside when detecting the voice section. The input pattern storage unit 6 takes in the voice feature amount from the voice start end to the voice end determined by the voice section detection unit 5 from the voice feature extraction unit 3 and uses this as the input voice pattern. The standard pattern storage unit 7 is a storage unit that stores a plurality of word voice patterns stored in a procedure similar to that of the input pattern storage unit 6 and used for voice recognition given a label name. The pattern matching unit 8 performs a matching process between an unknown input voice pattern stored in the input pattern storage unit 6 and each standard pattern stored in the standard pattern storage unit 7, and outputs the input voice pattern and each standard pattern. It is about to output the distance value to the standard pattern. The distance comparison unit 9 accumulates and compares the distance values output by the pattern matching unit 8 for each standard pattern to be matched, and as a result, finds the smallest distance value from the matching results of each standard pattern for one unknown input pattern. I'm getting it. Result output unit 10
Is a place where the label name of the standard pattern having the smallest distance value among the distance values output from the distance comparison unit 9 is output to the upper host that operates the voice recognition device.

【０００５】以下、図４の音声認識装置の動作について
説明する。標準パターンは入力音声パターンと同様に分
析され整備されたものが標準パターン記憶部７に予め登
録されているものとする。音声は、常時、音声入力部
１、変換部２、音声特徴抽出部３を介して１０〜３０ｍ
ｓｅｃ程度の一定時間間隔、即ち短時間フレーム毎に入
力および分析され、その分析結果の一部の情報、例えば
音声信号の対数パワーは音声区間検出部５に送られ、音
声区間検出の情報とされる。ここで、発声者或は音声認
識装置を動作させる上位ホストの操作により起動スイッ
チ部４を駆動し、音声区間検出開始のトリガが発生した
ものとする。これにより音声区間検出部５は初期化さ
れ、音声特徴抽出部３から入力する情報について音声始
端の検出を開始する。音声始端の検出方法としては、例
えば、信号パワー値が音声のない状態から或る一定閾値
以上の大きな値で一定時間継続したときにその信号パワ
ー値の立ち上がり位置を始端とする方法がある。この
後、音声区間検出部５は音声の信号パワー値の減衰点を
検出してこれを音声の終端とし、動作を終了する。この
様にして検出された音声の始端から終端に到る区間につ
いて音声特徴抽出部３の分析結果を、入力パターン格納
部６に入力音声パターンとして格納する。格納が完了し
た時点において、パターンマッチング部８は入力パター
ン格納部６に格納した入力音声パターンと標準パターン
記憶部７に記憶されている各標準パターンの内容を、Ｄ
Ｐマッチングその他のパターンマッチング手法により照
合して距離計算を行う。各標準パターンに対する距離計
算結果は距離比較部９において小さい距離値の順に整
理され、最も小さい距離値となった標準パターンのラベ
ル名が結果出力部１０を介して上位ホストへ出力され
る。The operation of the voice recognition apparatus shown in FIG. 4 will be described below. It is assumed that the standard pattern analyzed and prepared in the same manner as the input voice pattern is registered in the standard pattern storage unit 7 in advance. The voice is always 10 to 30 m through the voice input unit 1, the conversion unit 2, and the voice feature extraction unit 3.
It is input and analyzed for a fixed time interval of about sec, that is, for each short-time frame, and a part of the information of the analysis result, for example, the logarithmic power of the voice signal is sent to the voice section detection unit 5 and is used as the voice section detection information. It Here, it is assumed that the start switch unit 4 is driven by the operation of the speaker or the host host operating the voice recognition device, and a trigger for starting the voice section detection is generated. As a result, the voice section detection unit 5 is initialized, and the voice start end of the information input from the voice feature extraction unit 3 is detected. As a method of detecting the voice start point, for example, there is a method of setting the rising position of the signal power value as the start point when the signal power value continues from a state without voice to a large value of a certain threshold value or more for a predetermined time. After that, the voice section detection unit 5 detects the attenuation point of the signal power value of the voice and uses this as the end of the voice, and ends the operation. The analysis result of the voice feature extraction unit 3 for the section from the beginning to the end of the voice detected in this way is stored as an input voice pattern in the input pattern storage unit 6. At the time when the storage is completed, the pattern matching unit 8 compares the contents of the input voice pattern stored in the input pattern storage unit 6 and the contents of each standard pattern stored in the standard pattern storage unit 7 with D
Distance matching is performed by matching using P matching or another pattern matching method. The distance calculation result for each standard pattern is sorted in the order of the smallest distance value in the distance comparison unit 9, and the label name of the standard pattern having the smallest distance value is output to the upper host via the result output unit 10.

【０００６】しかし、この様な孤立単語音声認識装置に
おいては、音声区間を正確に検出する技術が必要とされ
る。音声区間検出を目的として、音声特徴抽出部により
得られる全ての情報をニューラルネットその他のフィル
タリング操作部に入力し、正確に音声区間のみを抽出す
る方法があるが、この抽出に使用される計算方法は大が
かりなものであり、いわゆる文音声認識と大差のない計
算量を必要とする。このため、一般に、少量の計算量に
より音声区間検出を実施しようとする場合、音声パワ
ー、零交差数その他の単純な情報に基づいて音声区間を
検出する。また、音声区間検出段階において音声区間
の一部がが未検出となることを防ぐために、区間検出時
には２つの音声区間に挟まれた短い雑音区間もまとめて
１つの音声区間として検出する様な方法を採用すること
ができる。However, such an isolated word voice recognition apparatus requires a technique for accurately detecting a voice section. For the purpose of voice section detection, there is a method of inputting all the information obtained by the voice feature extraction section to a neural network or other filtering operation section to accurately extract only the voice section. The calculation method used for this extraction Is a large-scale one and requires a calculation amount that is not so different from so-called sentence-speech recognition. Therefore, in general, when trying to detect a voice section with a small amount of calculation, the voice section is detected based on voice power, the number of zero crossings, and other simple information. Further, in order to prevent a part of the voice section from being undetected at the voice section detection stage, a method of collectively detecting short noise sections sandwiched between two voice sections as one voice section at the time of section detection. Can be adopted.

【０００７】この様な音声区間検出方法により切り出さ
れた音声の模式図を図５に示す。これは音声信号パワー
に着目して音声区間を切り出した例である。図５におい
て、区間（１）は口唇を動かし始めた時に収録したノイ
ズであり、一般にリップノイズと称される。区間（２）
は検出を意図した真の音声区間を示す。区間（３）は発
声後に受信した呼吸音を示し、区間（４）は周囲騒音或
は音声入力部１から変換部２に到る間において生じた伝
送上のノイズを示す。図５は、音声入力部１が電話機の
ハンドセットの様に発声者の口元に近い場合、真の音声
区間の前後に呼吸音或はリップノイズが付随したり、周
囲騒音或は伝送系に起因するノイズについても音声区間
と誤って判断される場合の生ずることを示している。こ
の様に、真の音声区間以外に不要な信号区間の付随が生
じたものは、標準パターンが真の音声区間と同一のもの
であったとしても、不要な信号区間を含めた形でパター
ンマッチング処理を実行するところから、二つのパター
ン間に食い違いを生じ、結果的には音声認識誤りを生ず
る。この様に音声認識誤りを引き起こす区間検出の状態
を、一般に音声区間検出誤りと呼ぶ。FIG. 5 shows a schematic diagram of a voice cut out by such a voice section detection method. This is an example in which a voice section is cut out by focusing on the voice signal power. In FIG. 5, the section (1) is noise recorded when the lip movement is started, and is generally called lip noise. Section (2)
Indicates a true voice section intended for detection. The section (3) shows the breathing sound received after utterance, and the section (4) shows the ambient noise or the noise on the transmission generated from the voice input section 1 to the conversion section 2. FIG. 5 shows that when the voice input unit 1 is close to the mouth of the speaker such as a handset of a telephone, respiratory sounds or lip noises are attached before and after the true voice section, ambient noise or transmission system. It also indicates that noise may occur when it is erroneously determined as a voice segment. In this way, in the case where an unnecessary signal section other than the true speech section occurs, even if the standard pattern is the same as the true speech section, pattern matching is performed including the unnecessary signal section. From the execution of the process, a discrepancy occurs between the two patterns, resulting in a speech recognition error. Such a state of section detection that causes a voice recognition error is generally called a voice section detection error.

【０００８】音声区間検出誤りによる誤認識は、入力音
声パターンを調整することにより回避しなければならな
い。その理由は、標準パターンが最適な認識率が得られ
る様に発声者が注意深く発声して作成されたものであっ
たり、或は計算機上において自動生成されたものであっ
たりして、殆どの場合、区間検出誤りが排除されたパタ
ーンであるのに対して、入力音声パターンは実環境の元
において収録された音声についてのものであるので、区
間検出誤りの原因および発生状態が発声の都度異なるか
らである。また、音声認識装置は事前に想定し得ない未
知の音響環境下においても有効に作用する回避手法であ
ることも要求される。False recognition due to a voice section detection error must be avoided by adjusting the input voice pattern. The reason is that the standard pattern is created by the speaker's careful utterance so that the optimum recognition rate is obtained, or it is automatically generated on the computer. , While the section detection error is excluded, the input voice pattern is for the voice recorded under the actual environment, so the cause and the occurrence state of the section detection error are different for each utterance. Is. Further, the voice recognition device is also required to be an avoidance method that works effectively even in an unknown acoustic environment that cannot be assumed in advance.

【０００９】入力音声の区間検出誤りにより生ずる誤認
識を回避する方法としては、いわゆるワードスポッティ
ングという手法が使われている。即ち、入力音声につい
ては音声が含まれると思われる区間を事前に大まかに検
出しておき、その区間中の一定時間間隔毎の各位置を真
の入力音声の始端と考え、標準パターンとの間で終端フ
リーのパターンマッチングを繰り返し、その結果得られ
る最小の距離値を二つのパターン間のマッチング結果と
するものである。しかし、この方法は大まかに切り出し
た音声区間の長さに相当するマッチングを繰り返すこと
となり、計算量が膨大になる。A so-called word spotting method is used as a method for avoiding erroneous recognition caused by an error in detecting a section of an input voice. That is, for the input voice, a section that seems to contain the voice is roughly detected in advance, and each position in the section at constant time intervals is considered as the start of the true input voice, and the input pattern is separated from the standard pattern. The terminal-free pattern matching is repeated with, and the minimum distance value obtained as a result is used as the matching result between the two patterns. However, this method repeats matching corresponding to the length of the speech segment roughly cut out, and the amount of calculation becomes enormous.

【００１０】[0010]

【発明が解決しようとする課題】上述した通り、少量の
計算規模で実行することを目的とした単語音声認識装置
は、音声区間検出処理を簡素化しているところから、真
の音声区間以外の不要信号区間が前後に付随した区間検
出結果を生じ、この様な音声については正しい認識結果
が得られない。また、この問題を解決するには計算量が
膨大になるアルゴリズムを使用しなければならない。As described above, the word voice recognition device intended to be executed with a small amount of calculation scale simplifies the voice section detection process, and therefore, it is not necessary to use a section other than the true voice section. As a result of section detection in which the signal section is attached to the front and rear, a correct recognition result cannot be obtained for such a voice. In order to solve this problem, an algorithm that requires a huge amount of calculation must be used.

【００１１】ここで、区間検出誤りに起因する誤認識を
回避する有効な方法とされているワードスポッティング
技術の効果を検証すると、以下の様になる。図６はワー
ドスポッティングの一例を説明する図である。この手法
は長い区間として検出された横軸方向の入力音声パター
ンと、縦軸方向の短い標準パターンとを、入力音声パタ
ーン中の一定時間間隔毎の各位置を開始点として終端フ
リーのマッチングを行い、距離値を算出するものであ
る。ところが、二つのパターン間のパターンマッチング
を行う範囲は、図６の斜線部の様に二つのパターンが交
差する全区間に及び、膨大な計算量を必要とすることが
判る。マッチング結果として算出された距離値は図６の
上部のグラフの例の様に、標準パターンと最も類似性の
高い部分区間で極小値を取る様に推移する。ワードスポ
ッティングの効果は、パターンマッチング自体が音声区
間検出を兼ね、距離値が極小値になる区間が真の音声区
間であるとする充分条件を利用していることにある。Here, the effect of the word spotting technique, which is an effective method for avoiding erroneous recognition due to section detection error, will be verified as follows. FIG. 6 is a diagram illustrating an example of word spotting. This method performs end-free matching of an input voice pattern in the horizontal axis direction detected as a long section and a short standard pattern in the vertical axis direction with each position at a constant time interval in the input voice pattern as a starting point. , The distance value is calculated. However, it can be seen that the range in which the pattern matching between the two patterns is performed extends over the entire section where the two patterns intersect, as shown by the shaded area in FIG. 6, and a huge amount of calculation is required. The distance value calculated as the matching result changes so as to take the minimum value in the partial section having the highest similarity to the standard pattern, as in the example of the graph in the upper part of FIG. The effect of word spotting is that pattern matching itself also serves as voice segment detection, and a sufficient condition is used that the segment where the distance value is the minimum value is the true voice segment.

【００１２】ところが、パターンマッチングはＤＰマッ
チング法に代表される様に、音声区間をおおよそ推定す
ることができれば、音声始端および終端部分のゆらぎ、
パターン間の時間伸縮のゆらぎを吸収することができ
る。この特性を利用し、標準パターンの内の音声の特徴
が現われている部分区間を使用してワードスポッティン
グと同様の手順で音声区間をおおまかに推定し、その
後、標準パターン全区間と推定された入力音声パターン
の部分区間との間においてパターンマッチングを実行す
れば、ワードスポッティングと比較して少ない計算量で
従来のワードスポッティングに匹敵する音声認識性能を
得ることができることになる。However, in pattern matching, as typified by the DP matching method, if the voice section can be roughly estimated, fluctuations in the voice start and end portions,
It is possible to absorb fluctuations in time expansion and contraction between patterns. Using this characteristic, the speech segment is roughly estimated by the same procedure as word spotting using the segment where the characteristic of speech in the standard pattern appears, and then the input estimated as the entire segment of the standard pattern is estimated. If pattern matching is executed between the sub-sections of the voice pattern, the voice recognition performance comparable to that of the conventional word spotting can be obtained with a smaller calculation amount as compared with the word spotting.

【００１３】この発明は、入力音声の前後に不要信号区
間が付随する区間検出結果が生じた場合においても、音
声の特徴が現われている標準パターンの一部区間を使用
した簡易なワードスポッティングアルゴリズムにより音
声区間を推定してマッチングを行うことにより、計算量
の膨大化を招くことなしに正確に音声認識をすることが
できる音声認識装置を提供するものである。According to the present invention, even when a section detection result accompanied by an unnecessary signal section before and after the input speech occurs, a simple word spotting algorithm using a section of a standard pattern in which the characteristic of the speech appears. The present invention provides a voice recognition device that can accurately recognize a voice without enlarging the calculation amount by estimating a voice section and performing matching.

【００１４】[0014]

【課題を解決するための手段】比較されるべき各標準パ
ターンの部分区間と類似性の高い部分区間を入力音声パ
ターンから導出し、これを手がかりに真の音声区間を推
定し、マッチングを行う単語音声認識方法を構成した。
そして、先の単語音声認識方法において、比較されるべ
き各標準パターンについて、音声の特徴が現われている
部分区間である部分標準パターンを予め抽出しておき、
認識対象である入力音声パターンについて、一定時間間
隔毎の各位置を真の音声の始端と仮定して始端から部分
標準パターンと同一の時間的位置関係を有する部分区間
である部分入力パターンを抽出し、両部分パターン間に
おいてパターンスポッティング処理を行なって両部分パ
ターン間の距離の極小値が得られる真の音声区間の始端
および終端位置を決定し、入力音声パターンの真の音声
区間と標準パターンの間においてでマッチングを行う単
語音声認識方法を構成した。A word for matching is derived by deriving a sub-segment having a high similarity to a sub-segment of each standard pattern to be compared from an input voice pattern, and using this as a clue to estimate a true voice segment. A voice recognition method was constructed.
Then, in the above word speech recognition method, for each standard pattern to be compared, a partial standard pattern that is a partial section in which the characteristics of the speech are present is extracted in advance,
With respect to the input speech pattern to be recognized, assuming that each position at constant time intervals is the starting end of the true speech, a partial input pattern that is a partial section having the same temporal positional relationship as the partial standard pattern is extracted from the starting end. , The start and end positions of the true voice section where the minimum value of the distance between the two partial patterns is obtained by performing the pattern spotting process between the two partial patterns, and between the true voice section of the input voice pattern and the standard pattern. We constructed a word speech recognition method that matches in.

【００１５】音声信号を入力する音声入力部１を具備
し、入力された音声信号より音声特徴パターンを抽出す
る音声特徴抽出部３を具備し、音声特徴抽出部３の出力
する音声特徴パターン情報に基づいて音声区間を検出す
る音声区間検出部５を具備し、音声区間検出結果に基づ
いて音声区間の始端および終端を確定しこの両端で示さ
れる区間の音声特徴パターンを格納する入力音声パター
ン格納部６を具備し、音声認識に使用する標準パターン
を格納した標準パターン記憶部７を具備し、格納された
各標準パターンより音声の特徴が現われている部分区間
パターンを抽出する部分標準パターン抽出部１１を具備
し、入力音声パターンの内の一定時間間隔毎の各位置を
始端として標準パターンの部分区間パターンと同様の時
間関係を有する部分区間パターンを抽出する部分入力パ
ターン抽出部１２を具備し、標準パターンの部分区間と
入力音声パターンの部分区間との間のパターンマッチン
グを行う部分パターンスポッティング部１３を具備し、
部分パターンスポッティング部１３のマッチング結果よ
り部分区間パターンと入力音声パターンの間の距離値が
極小値となる位置を真の音声区間の始端および終端位置
として決定する区間位置決定部１４を具備し、区間位置
決定部１４より得られる位置関係情報に基づいて標準パ
ターンと入力音声パターンとの間のパターンマッチング
を行い距離値を出力するパターンマッチング部８を具備
し、各標準パターンと入力音声パターンとの間のマッチ
ング結果として出力された距離値を蓄積し最小距離値の
標準パターンを特定する距離比較部９を具備し、最小距
離値と判定された標準パターンのラベル名を出力する結
果出力部１０を具備する単語音声認識装置を構成した。A voice input unit 1 for inputting a voice signal is provided, a voice feature extracting unit 3 for extracting a voice feature pattern from the input voice signal is provided, and voice feature pattern information output by the voice feature extracting unit 3 is used. An input voice pattern storage unit that includes a voice section detection unit 5 that detects a voice section based on the voice section detection result, determines the start end and the end of the voice section based on the voice section detection result, and stores the voice feature patterns of the sections indicated by these ends. 6, a standard pattern storage unit 7 storing standard patterns used for speech recognition, and a partial standard pattern extraction unit 11 for extracting a partial section pattern in which a characteristic of a voice appears from each stored standard pattern. And a portion having the same time relationship as the partial interval pattern of the standard pattern, starting from each position of the input speech pattern at constant time intervals Comprising a partial input pattern extraction unit 12 for extracting between patterns, comprising the partial pattern spotting unit 13 performing pattern matching between the partial period of the reference pattern and the input speech pattern subintervals,
The partial pattern spotting unit 13 is provided with a section position determination unit 14 that determines the position where the distance value between the partial section pattern and the input voice pattern has a minimum value as the start and end positions of the true voice section. A pattern matching unit 8 for performing pattern matching between the standard pattern and the input voice pattern based on the positional relationship information obtained from the position determination unit 14 and outputting a distance value is provided, and between the standard pattern and the input voice pattern. A distance comparing unit 9 for accumulating the distance values output as the matching result of the standard pattern and specifying the standard pattern of the minimum distance value, and a result output unit 10 for outputting the label name of the standard pattern determined to be the minimum distance value. We constructed a word speech recognition system.

【００１６】[0016]

【発明の実施の形態】この発明は、先ず、音声認識に使
用する各標準パターンを登録すると共に、音声の特徴が
現われている部分区間を標準パターンの内からそれぞれ
抽出する。次に、認識されるべき音声を入力させ、信号
パワーの如き簡易な情報に着目して音声区間を検出し、
入力音声パターンとする。ここで、検出した入力音声パ
ターンの始端から終端まで一定間隔であるフレーム毎に
部分標準パターンと同様の時間的位置関係を持つ部分区
間を抽出し、先に抽出しておいた各標準パターンの部分
区間との間で簡単なパターンマッチングを行い、両部分
区間パターン間の累積距離値を求める。このパターンマ
ッチングの結果、入力音声パターンについて累積距離値
が極小値を取る部分区間の位置を推定することができ
る。この始端位置情報に基づいて、区間長は比較する標
準パターンと同一と仮定して、照合すべき入力音声パタ
ーンの区間位置を特定し、標準パターン全区間と特定し
た入力パターン部分との間のマッチングを行い、距離値
を求める。これらの処理を各標準パターン毎に繰り返し
た結果、累積距離値が最小となった標準パターンを認識
結果として得ることができ、音声区間検出誤りに起因す
る認識誤りを回避することができる。BEST MODE FOR CARRYING OUT THE INVENTION According to the present invention, first, each standard pattern used for voice recognition is registered, and the partial sections in which the characteristics of the voice appear are extracted from each standard pattern. Next, input the voice to be recognized, focus on simple information such as signal power, detect the voice section,
Use the input voice pattern. Here, a partial section having the same temporal positional relationship as the partial standard pattern is extracted for each frame with a constant interval from the start end to the end of the detected input voice pattern, and the part of each standard pattern previously extracted is extracted. A simple pattern matching is performed with the section to obtain the cumulative distance value between the two partial section patterns. As a result of this pattern matching, it is possible to estimate the position of the partial section in which the cumulative distance value has the minimum value for the input voice pattern. Based on this start position information, the section length is assumed to be the same as the standard pattern to be compared, the section position of the input voice pattern to be matched is specified, and matching between the entire standard pattern section and the specified input pattern portion is performed. To obtain the distance value. As a result of repeating these processes for each standard pattern, the standard pattern having the smallest cumulative distance value can be obtained as the recognition result, and the recognition error due to the voice section detection error can be avoided.

【００１７】[0017]

【実施例】この発明の実施例を図１を参照して説明す
る。図１において、音声入力部１は音声を入力するとこ
ろあり、オーディオマイクロホン、オーディオ入力端子
を使用する。変換部２は音声分析の前処理として音声波
形をディジタルの数値に変換するところである。音声特
徴抽出部３は変換部２により得られた音声波形を短時間
スペクトル分析して１０〜３０ｍｓｅｃ程度の一定時間
間隔、即ち短時間フレーム毎に音声波形から音声認識に
必要とされる特徴量を抽出するところであって、その分
析手法としては、短時間対数パワー分析およびケプスト
ラム分析の如き手法が採用される。起動スイッチ部４は
孤立単語音声認識を実現するに必要とされる音声区間検
出時の始端検出開始のトリガを与えるところである。音
声区間検出部５は音声特徴抽出部３から得られる音声特
徴量に基づいて音声の始端および終端をそれぞれ１箇所
だけ決定するところであり、その検出の手法としては、
音声発生以前の雑音レベルを測定しておき、その雑音レ
ベルより導出される一定閾値以上の対数パワー値を有す
る信号成分が一定時間内で推移する区間を音声区間とす
る方法を採用することができる。また、一定閾値を超え
る区間が、閾値未満の短い区間を挟んで２つ存在する場
合には、この３つの区間を合わせて１つの部分区間とみ
なす方法を採用することができる。入力パターン格納部
６は音声区間検出部５において決定された音声始端から
音声終端に到る音声特徴量を音声特徴抽出部３から取り
込んでこれを入力音声パターンとするところである。標
準パターン記憶部７は、入力パターン格納部６と同様の
手順により格納され、ラベル名を付与された音声認識に
使用される複数の単語音声パターンを格納した記憶部で
ある。Embodiment An embodiment of the present invention will be described with reference to FIG. In FIG. 1, a voice input unit 1 is where a voice is input and uses an audio microphone and an audio input terminal. The conversion unit 2 is to convert a voice waveform into a digital value as preprocessing of voice analysis. The voice feature extraction unit 3 analyzes the voice waveform obtained by the conversion unit 2 for a short time to obtain a feature amount required for voice recognition from the voice waveform at a fixed time interval of about 10 to 30 msec, that is, for each short time frame. A method such as short-time logarithmic power analysis and cepstrum analysis is adopted as the analysis method for extraction. The activation switch unit 4 is to give a trigger for starting the start edge detection at the time of detecting the voice section which is required to realize the isolated word voice recognition. The voice section detection unit 5 is to determine only one start point and one end point of the voice based on the voice feature amount obtained from the voice feature extraction unit 3, and the detection method is as follows.
It is possible to adopt a method in which the noise level before the voice is generated is measured in advance, and a section in which a signal component having a logarithmic power value equal to or higher than a certain threshold derived from the noise level transits within a certain period is set as the voice section . Further, when there are two sections that exceed a certain threshold value across a short section that is less than the threshold value, it is possible to adopt a method of considering these three sections as one partial section. The input pattern storage unit 6 takes in the voice feature amount from the voice start end to the voice end determined by the voice section detection unit 5 from the voice feature extraction unit 3 and uses this as the input voice pattern. The standard pattern storage unit 7 is a storage unit that stores a plurality of word voice patterns stored in a procedure similar to that of the input pattern storage unit 6 and used for voice recognition given a label name.

【００１８】この発明により付加される部分標準パター
ン抽出部１１は、音声の特徴が現われている部分区間で
ある部分標準パターンを標準パターン記憶部７より抽出
し、後で説明される部分パターンスポッティング部１３
に供給するところであり、マッチングに使用する。同様
にこの発明により付加される部分入力パターン抽出部１
２は、入力パターン格納部６より入力された音声パター
ンについて、始端から終端まで一定間隔で位置をずらし
ながら、部分標準パターン抽出部１１により抽出したも
のと同様の部分区間である部分入力パターンを抽出する
ところである。この発明により付加される部分パターン
スポッティング部１３は、部分標準パターン抽出部１１
より出力された部分区間パターンと部分入力パターン抽
出部１２より出力された部分区間パターンとの間で簡単
なパターンマッチングを実行し、両部分区間パターン間
の距離値を出力するところである。区間位置決定部１４
もこの発明により付加される構成であり、部分パターン
スポッティング部１３から出力される距離値を部分入力
パターンの抽出位置毎に蓄積および比較し、距離値が極
小値となる入力パターンの位置を特定し、更に比較対象
とされた標準部分パターンを参照して、パターンマッチ
ング部８において照合を行う入力音声パターン区間の位
置を特定するところである。パターンマッチング部８
は、入力パターン格納部６に格納された未知の入力音声
パターンと標準パターン記憶部７に格納されている各標
準パターンとの間において、区間位置決定部１４の情報
に基づいて位置合わせしてパターンマッチングを実行
し、入力音声パターンとの間の距離値を出力するところ
であり、そのパターンマッチングの手法としては、音声
認識のパターンマッチング法としてよく知られているＤ
Ｐマッチング法を採用することができる。距離比較部９
はパターンマッチング部８の出力する距離値を、マッチ
ングする各標準パターン毎に蓄積および比較し、その結
果一つの未知入力パターンに対する各標準パターンのマ
ッチング結果の内から最小の距離値を得るところであ
る。結果出力部１０は距離比較部９より出力された距離
値の内の最も小さい距離値を有する標準パターンを導出
し、その標準パターンのラベル名を音声認識装置を動作
させる上位ホストへ出力するところである。The partial standard pattern extraction unit 11 added according to the present invention extracts a partial standard pattern, which is a partial section in which a voice feature appears, from the standard pattern storage unit 7, and a partial pattern spotting unit described later. Thirteen
It is used for matching. Similarly, the partial input pattern extraction unit 1 added according to the present invention
2 is for extracting a partial input pattern which is the same partial section as that extracted by the partial standard pattern extraction unit 11 while shifting the position of the voice pattern input from the input pattern storage unit 6 from the start end to the end at regular intervals. I am about to do it. The partial pattern spotting unit 13 added by the present invention is a partial standard pattern extraction unit 11
A simple pattern matching is performed between the partial section pattern output by the above and the partial section pattern output by the partial input pattern extraction unit 12, and the distance value between both partial section patterns is output. Section position determination unit 14
This is also a configuration added by the present invention. The distance value output from the partial pattern spotting unit 13 is accumulated and compared for each extraction position of the partial input pattern, and the position of the input pattern having the minimum distance value is specified. Further, the position of the input voice pattern section to be collated in the pattern matching section 8 is specified by further referring to the standard partial pattern to be compared. Pattern matching unit 8
Between the unknown input voice pattern stored in the input pattern storage unit 6 and each standard pattern stored in the standard pattern storage unit 7 based on the information of the section position determination unit 14 Matching is performed and a distance value to the input voice pattern is output. The pattern matching method is well known as a voice recognition pattern matching method.
The P matching method can be adopted. Distance comparison unit 9
Is to accumulate and compare the distance values output by the pattern matching unit 8 for each standard pattern to be matched, and as a result, obtain the minimum distance value from the matching results of each standard pattern with respect to one unknown input pattern. The result output unit 10 derives the standard pattern having the smallest distance value out of the distance values output from the distance comparison unit 9, and outputs the label name of the standard pattern to the host host operating the speech recognition apparatus. .

【００１９】以下、図１の音声認識装置の動作について
説明する。標準パターンは入力音声パターンと同様に分
析され整備されたものが標準パターン記憶部７に予め登
録されているものとする。音声は、常時、音声入力部
１、変換部２、音声特徴抽出部３を介して１０〜３０ｍ
ｓｅｃ程度の一定時間間隔、即ち短時間フレーム毎に入
力および分析され、その分析結果の一部の情報、例えば
音声信号の対数パワーは音声区間検出部５に送られ、音
声区間検出の情報とされる。ここで、発声者或は音声認
識装置を動作させる上位ホストの操作により起動スイッ
チ部４を駆動し、音声区間検出開始のトリガが発生した
ものとする。これにより音声区間検出部５は初期化さ
れ、音声特徴抽出部３から入力する情報について音声始
端の検出を開始する。音声始端の検出方法としては、例
えば、信号パワー値が音声のない状態から或る一定閾値
以上の大きな値で一定時間継続したときにその信号パワ
ー値の立ち上がり位置を始端とする方法がある。この
後、音声区間検出部５は音声の信号パワー値の減衰点を
検出してこれを音声の終端とし、動作を終了する。この
様にして検出された音声の始端から終端に到る区間につ
いて音声特徴抽出部３の分析結果を、入力パターン格納
部６に入力音声パターンとして格納する。The operation of the voice recognition apparatus shown in FIG. 1 will be described below. It is assumed that the standard pattern analyzed and prepared in the same manner as the input voice pattern is registered in the standard pattern storage unit 7 in advance. The voice is always 10 to 30 m through the voice input unit 1, the conversion unit 2, and the voice feature extraction unit 3.
It is input and analyzed for a fixed time interval of about sec, that is, for each short-time frame, and a part of the information of the analysis result, for example, the logarithmic power of the voice signal is sent to the voice section detection unit 5 and is used as the voice section detection information. It Here, it is assumed that the start switch unit 4 is driven by the operation of the speaker or the host host operating the voice recognition device, and a trigger for starting the voice section detection is generated. As a result, the voice section detection unit 5 is initialized, and the voice start end of the information input from the voice feature extraction unit 3 is detected. As a method of detecting the voice start point, for example, there is a method of setting the rising position of the signal power value as the start point when the signal power value continues from a state without voice to a large value of a certain threshold value or more for a predetermined time. After that, the voice section detection unit 5 detects the attenuation point of the signal power value of the voice and uses this as the end of the voice, and ends the operation. The analysis result of the voice feature extraction unit 3 for the section from the beginning to the end of the voice detected in this way is stored as an input voice pattern in the input pattern storage unit 6.

【００２０】入力パターン格納部６に対する入力音声パ
ターンの格納が完了すると、この入力音声パターンと登
録されている各標準パターンとの間のマッチングが開始
される。先ず、部分標準パターン抽出部１１において、
標準パターンより音声の特徴が現われている部分区間で
ある部分標準パターンを抽出して部分パターンスポッテ
ィング部１３に供給する。部分標準パターンの抽出の仕
方を図２を参照して説明する。標準パターン全体を図２
（ａ）の通りとする。説明を簡略化するために対数音声
パワーのみにより音声波形を表記している。部分区間の
抽出例としては、（ｂ）の斜線により示される様な音声
の特徴が現われている１つ以上の部分区間、（ｃ）の斜
線により示される始端および終端の短い部分区間、
（ｄ）の斜線により示される、例えば、標準パターン全
長の１／４、３／４、にあたる位置その他の、図形的に
計算の容易な位置の部分区間を採用すると好適である。
この場合、部分標準パターンの相互位置関係は元の標準
パターン区間上の位置関係を保持したまま、即ち時間伸
縮は考慮しないものとする。When the storage of the input voice pattern in the input pattern storage unit 6 is completed, matching between this input voice pattern and each registered standard pattern is started. First, in the partial standard pattern extraction unit 11,
A partial standard pattern, which is a partial section in which the characteristic of voice appears, is extracted from the standard pattern and supplied to the partial pattern spotting unit 13. A method of extracting the partial standard pattern will be described with reference to FIG. Figure 2 shows the entire standard pattern
As in (a). In order to simplify the explanation, the voice waveform is expressed only by the logarithmic voice power. As an example of extraction of the sub-sections, one or more sub-sections in which the characteristics of the voice as shown by the slanted line in (b) appear, short start and end sub-sections shown by the slanted line in (c),
It is preferable to adopt a partial section at a position which is graphically easy to calculate, such as a position corresponding to 1/4 and 3/4 of the total length of the standard pattern, which is indicated by the diagonal line in (d).
In this case, the mutual positional relationship of the partial standard patterns is maintained while maintaining the positional relationship on the original standard pattern section, that is, time expansion / contraction is not considered.

【００２１】次に、部分入力パターン抽出部１２におい
て、入力パターン格納部６に格納した入力音声パターン
に対して、パターンの始端から終端に到る区間をフレー
ム単位の一定時間間隔如に細分化したときの各フレーム
を真の入力音声パターンの始端と仮定し、始端より部分
標準パターンと同一位置の部分パターンである部分入力
パターンを抽出する。この様に、部分標準パターン抽出
部１１において抽出された部分標準パターンと部分入力
パターン抽出部１２において抽出された部分入力パター
ンとの間のマッチングを部分パターンスポッティング部
１３において実行する。各部分区間に対するマッチング
の方法は、例えば、ケプストラムのユークリッド距離の
累積距離値を使用する。この距離計算結果は、区間位置
決定部１４へ送出されるが、区間位置決定部１４におい
ては、部分入力パターン抽出部１２で走査する入力音声
パターンの仮定始端毎に累積距離値の推移を見る。この
推移の例を図３に示す。図３において、部分標準パター
ンと入力音声パターンとの間で部分パターンスポッティ
ングのためにスペクトル距離計算を必要とする領域は図
３の中央の枠の内の斜線部のみであり、図６の場合と比
較して、計算領域は明らかに小さい領域で済んでいるこ
とが判る。部分パターンスポッティングの結果、部分区
間同士の距離値は図３の上部に示される様に推移する
が、標準パターン全長と、入力音声パターンの内の真の
入力音声部分とがほぼ合致する位置関係を取ったときに
距離値は極小値となる。区間位置決定部１４において
は、この極小値を取ったときの入力音声パターンの仮定
始端を真の音声区間に対する始端と決定し、その情報を
パターンマッチング部８へ送出する。パターンマッチン
グ部８は、区間決定部１４から送出された部分区間位置
情報を入力し、標準パターン記憶部７より供給される標
準パターンの全長と入力パターン記憶部６より供給され
る入力音声パターンとについて、区間決定部１４により
判断された部分区間位置情報に基づいてマッチングを行
う。このとき、入力音声の区間長は、標準パターン区間
長と同一とする。マッチング結果は距離比較部９におい
て各標準パターンについて蓄積されると共に、小さい距
離値の順に整理され、結果出力部１０へ送出される。最
も小さい距離値を取った標準パターンのラベル名が結果
出力部１０を介して上位ホストへ出力される。なお、各
標準パターンと入力音声パターンとの比較において、入
力音声パターン長が標準パターン長より短い場合があ
る。この場合は、区間位置決定部１４の判断により入力
音声パターンの全長と標準パターンの全長とをパターン
マッチングする様にパターンマッチング部８に指示す
る。Next, in the partial input pattern extraction unit 12, the section from the start end to the end of the input voice pattern stored in the input pattern storage unit 6 is subdivided into fixed time intervals in frame units. Each frame at this time is assumed to be the start end of the true input voice pattern, and a partial input pattern that is a partial pattern at the same position as the partial standard pattern is extracted from the start end. In this way, the partial pattern spotting unit 13 performs matching between the partial standard pattern extracted by the partial standard pattern extracting unit 11 and the partial input pattern extracted by the partial input pattern extracting unit 12. As a matching method for each partial section, for example, the cumulative distance value of the Euclidean distance of the cepstrum is used. The distance calculation result is sent to the section position determining unit 14, and the section position determining unit 14 looks at the transition of the cumulative distance value for each hypothetical start end of the input voice pattern scanned by the partial input pattern extraction unit 12. An example of this transition is shown in FIG. In FIG. 3, the region requiring the spectral distance calculation for the partial pattern spotting between the partial standard pattern and the input voice pattern is only the shaded portion in the central frame of FIG. By comparison, it can be seen that the calculation area is obviously smaller. As a result of the partial pattern spotting, the distance values between the partial sections change as shown in the upper part of FIG. 3, but the positional relationship in which the total length of the standard pattern and the true input voice part of the input voice pattern almost match is shown. When taken, the distance value becomes a minimum value. The section position determining unit 14 determines the assumed start end of the input voice pattern when the minimum value is taken as the start end for the true voice section, and sends the information to the pattern matching unit 8. The pattern matching unit 8 inputs the partial section position information sent from the section determination unit 14, and regarding the total length of the standard pattern supplied from the standard pattern storage unit 7 and the input voice pattern supplied from the input pattern storage unit 6. The matching is performed based on the partial section position information determined by the section determining unit 14. At this time, the section length of the input voice is the same as the standard pattern section length. The matching results are accumulated for each standard pattern in the distance comparison unit 9, arranged in the order of smaller distance values, and sent to the result output unit 10. The label name of the standard pattern having the smallest distance value is output to the upper host via the result output unit 10. In comparison between each standard pattern and the input voice pattern, the input voice pattern length may be shorter than the standard pattern length. In this case, the section position determination unit 14 instructs the pattern matching unit 8 to perform pattern matching between the total length of the input voice pattern and the standard pattern.

【００２２】以上のアルゴリズムについて、実際の音声
に対して実験した結果を説明する。認識対象は文献「音
声認識用共通音声データ」（著者板橋、音響学会予稿
集、１９８５年発表）に記述された日本都市名１００単
語中上位２０単語を男性話者１名が発声したものであ
る。音声は電話帯域（３００Ｈｚ〜３. ４ｋＨｚ）のフ
ィルタを通して８ｋＨｚでＡ／Ｄ変換され、１６ｍｓｅ
ｃ毎に３２ｍｓｅｃ長の短時間フレームについてＬＰＣ
ケプストラム分析が実行される。音声区間検出は短時間
対数パワーに着目して行った。発声においては、この発
明の音声認識手法の有効性を明確化するために（１）認
識語彙の前に「えー」を付随させて発声する、（２）認
識語彙の後に「です」を付随させて発声する、（３）認
識語彙の前後に「えー」および「です」をそれぞれ付随
させて発声する、のスタイルで発声させた。部分標準パ
ターンの抽出方法は、図２（ｄ）の様に標準パターンの
始端から１／４および終端から１／４、即ち始端から３
／４の２箇所の区間のみに抽出する方法を採用した。そ
の結果、音声区間を一つに固定する従来の認識手法にお
いて、（１）、（２）、（３）の各発声スタイルによる
認識率がそれぞれ１０％、４０％および５％であったの
に対して、入力パターンから真の音声区間を推定してマ
ッチングを行うこの発明の方法においては認識率はそれ
ぞれ１００％、１００％および８５％となり、この発明
の方法が有効であることが確認された。With respect to the above algorithm, the results of experiments conducted on actual speech will be described. The recognition target is one of the top 20 words in the 100 words of the city of Japan described in the document "Common Speech Data for Speech Recognition" (author Itabashi, Proceedings of the Acoustical Society of Japan, published in 1985). . Voice is A / D converted at 8kHz through a telephone band (300Hz to 3.4kHz) filter, and 16mse.
LPC for a short frame of 32 msec length for each c
Cepstrum analysis is performed. The speech section was detected by focusing on the short-time logarithmic power. In utterance, in order to clarify the effectiveness of the speech recognition method of the present invention, (1) utterance is made by attaching "Eh" before the recognition vocabulary, and (2) Attaching "is" after the recognition vocabulary. They were uttered in the style of (3) uttering with "er" and "da" before and after the recognition vocabulary. As shown in FIG. 2D, the method of extracting the partial standard pattern is 1/4 from the starting end and 1/4 from the ending end of the standard pattern, that is, 3 from the starting end.
The method of extracting only in two sections of / 4 was adopted. As a result, in the conventional recognition method of fixing one voice section, the recognition rates by the utterance styles of (1), (2), and (3) were 10%, 40%, and 5%, respectively. On the other hand, in the method of the present invention in which the true speech segment is estimated from the input pattern and the matching is performed, the recognition rates are 100%, 100% and 85%, respectively, which confirms that the method of the present invention is effective. .

【００２３】この発明は、また、この実験の様に意図的
に付随させた不要音声だけでなく、発声者の意図に関係
なく発生するリップノイズ、呼吸音、或は背景雑音その
他の雑音を音声区間から除外して音声認識することがで
きる。The present invention not only intentionally attaches unnecessary voices as in this experiment, but also voices lip noises, breath sounds, background noises, and other noises generated regardless of the intention of the speaker. Speech recognition can be performed by excluding it from the section.

【００２４】[0024]

【発明の効果】以上の通りであって、この発明は、音声
区間検出を行ってからパターンマッチングを行う単語音
声認識装置について、音声区間検出時に誤って音声区間
として検出した不要音声、リップノイズ、呼吸音、或は
背景雑音その他の雑音の付加に起因して生ずる認識誤り
を音声区間検出アルゴリズムに対する簡易な演算の追加
により回避する効果を奏する。As described above, the present invention relates to a word voice recognition device for performing pattern matching after detecting a voice section, unnecessary voices, lip noise, and A recognition error caused by the addition of breath noise, background noise, or other noise is avoided by adding a simple operation to the voice section detection algorithm.

[Brief description of drawings]

【図１】実施例を説明するブロック図。FIG. 1 is a block diagram illustrating an embodiment.

【図２】標準パターンの内から部分パターンを抽出する
仕方を説明する図。FIG. 2 is a diagram illustrating a method of extracting a partial pattern from a standard pattern.

【図３】パターンスポッティングを行ったときの入力音
声パターンの位置に対する累積距離値の推移を示す図。FIG. 3 is a diagram showing a transition of a cumulative distance value with respect to a position of an input voice pattern when pattern spotting is performed.

【図４】従来例を説明するブロック図。FIG. 4 is a block diagram illustrating a conventional example.

【図５】音声区間検出時に生じる信号現象を説明する
図。FIG. 5 is a diagram illustrating a signal phenomenon that occurs when a voice section is detected.

【図６】ワードスポッティングを行ったときの入力音声
パターンの位置に対する累積距離値の推移を示す図。FIG. 6 is a diagram showing a transition of a cumulative distance value with respect to a position of an input voice pattern when performing word spotting.

[Explanation of symbols]

１音声入力部２変換部３音声特徴抽出部４起動スイッチ部５音声区間検出部６入力パターン格納部７標準パターン記憶部８パターンマッチング部９距離比較部１０結果出力部１１部分標準パターン抽出部１２部分入力パターン抽出部１３部分パターンスポッティング部１４区間位置決定部 1 voice input unit 2 conversion unit 3 voice feature extraction unit 4 start switch unit 5 voice section detection unit 6 input pattern storage unit 7 standard pattern storage unit 8 pattern matching unit 9 distance comparison unit 10 result output unit 11 partial standard pattern extraction unit 12 Partial input pattern extraction unit 13 Partial pattern spotting unit 14 Section position determination unit

Claims

[Claims]

1. A word speech recognition method characterized by deriving a partial segment having a high similarity to a partial segment of each standard pattern to be compared from an input speech pattern to estimate a true speech segment and performing matching. .

2. The word voice recognition method according to claim 1, wherein for each standard pattern to be compared, a partial standard pattern that is a partial section in which a characteristic of a voice appears is extracted in advance, and a recognition target is obtained. Assuming that each position at constant time intervals is the start of the true voice, a partial input pattern that is a partial section having the same temporal positional relationship as the partial standard pattern is extracted from the input voice pattern. The start and end positions of the true voice section in the input voice pattern are determined from the position of the partial input pattern where the minimum value of the distance between both partial patterns is obtained by performing the pattern matching process between the partial voice patterns. A word voice recognition method characterized by performing matching between a true voice segment and a standard pattern.

3. A voice input unit for inputting a voice signal, a voice feature extraction unit for extracting a voice feature pattern from the input voice signal, and based on voice feature pattern information output by the voice feature extraction unit. A voice section detection unit for detecting a voice section, and an input voice pattern storage section for determining the start and end of the voice section based on the voice section detection result and storing the voice characteristic pattern of the section indicated by these ends. In addition, a standard pattern storage unit that stores standard patterns used for voice recognition is provided, and a partial standard pattern extraction unit that extracts a partial section pattern in which the characteristic of the voice appears from each stored standard pattern is provided. A partial section pattern having the same time relationship as the partial section pattern of the standard pattern is extracted starting from each position of the voice pattern at a constant time interval. A partial pattern spotting unit for performing pattern matching between a partial section of the standard pattern and a partial section of the input voice pattern, and a partial section pattern based on the matching result of the partial pattern spotting section. And the input voice pattern have a minimum distance value as the start and end positions of the true voice section, and are provided with a section position determination unit, which is a standard based on the positional relationship information obtained from the section position determination unit. The pattern matching unit that performs pattern matching between the pattern and the input voice pattern and outputs the distance value is provided, and the distance value output as the matching result between each standard pattern and the input voice pattern is accumulated to obtain the minimum distance value. Equipped with a distance comparison unit that specifies the standard pattern of Word speech recognition apparatus characterized by comprising a result output unit for outputting the emission of the label name.