JPH01285996A

JPH01285996A - Speech recognizing device

Info

Publication number: JPH01285996A
Application number: JP63117183A
Authority: JP
Inventors: Toru Ueda; 徹上田; Mitsuhiro Toya; 充宏斗谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1988-05-13
Filing date: 1988-05-13
Publication date: 1989-11-16

Abstract

PURPOSE:To exactly extract the syllable sections of speech input words at the time of recognition of the speech input words by comparing the acoustical characteristics of the speech input words and the acoustical characteristics of the candidate for recognized words and executing correspondence of the time base. CONSTITUTION:There are two modes; a registration mode and a recognition mode. The syllable sections of the speech input words from a user are extracted from the acoustical characteristics of said speech input words and the syllable standard patterns are formed for each of the syllable sections and are registered in a standard pattern memory part 5, in the registration mode. The syllable sections of the speeches pronounced by the user are extracted and the distance calculation from the previously registered syllable standard pattern is executed for each of the syllable sections, then the speech input words are recognized in accordance with the calculated distances and the results of the recognition are outputted, in the registration mode. The time series of the acoustical characteristics of the candidates for the recognized words stored in a recognized word memory part 6 is corresponded to the time series of the acoustical characteristics of the speech input words in such a manner. The syllable sections of the speech input words are thereby correctly extracted.

Description

【発明の詳細な説明】〈産業上の利用分野〉この発明は、入力された音声を音節に分割して認識する
音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION <Industrial Application Field> The present invention relates to a speech recognition device that recognizes input speech by dividing it into syllables.

〈従来の技術〉特定話者単語認識の方法として、一つの単語に対してそ
の単語全体を表す特徴量を記憶部に格納し、この単語全
体の特徴量に基づいて音声入力単語の認識を行う方式が
一般的である。しかしながら、この方式の場合は総ての
単語の特徴量を登録しておく必要があり、大語當（数百
〜数千）の認識には不適当である。さらに、−度登録し
た認識単語を変更することは容易ではない。これに対し
て、音節の特徴量を記憶部に格納し、この音節の特徴量
に基づいて音声入力単語を音節に分割して認識を行う方
式は、特＠量を記憶する音節の数が少なくてよく、日本
語総てを入力する場合でも百種類程度の音節の特徴量を
登録をするだＩノで人語量の認識が可能である。さらに
、−・変音節の特徴量を登録すると、認識単語の変化は
単語辞ＨＹの仮名文字列の標記を変更するたけで対応す
ることができる。<Prior art> As a method for recognizing words for a specific speaker, features representing the entire word for one word are stored in a storage unit, and speech input words are recognized based on the features of the entire word. This method is common. However, in the case of this method, it is necessary to register the feature amounts of all words, and it is not suitable for recognizing large words (several hundred to several thousand). Furthermore, it is not easy to change recognized words that have been registered. On the other hand, in a method that stores syllable feature quantities in a storage unit and performs recognition by dividing the voice input word into syllables based on the syllable feature quantities, the number of syllables for which special @ quantities are stored is small. Even when inputting all of Japanese, it is possible to recognize the amount of human speech by simply registering the features of about 100 types of syllables. Furthermore, by registering the feature amount of the -.variant syllable, changes in the recognized word can be handled simply by changing the notation of the kana character string of the word dictionary HY.

従来の音声入力単語を音節に分割して認識を行う音声認
識装置は、音声入力単語を音節に分割し、この分割され
た音節の特徴パターンとあらかじめ使用者によって登録
されている音節標準パターンとの距離を計算し、ごの距
離に」１（づいて？゛１声入力単語を認識するにうにし
ている。そして、音声入力単語を音節に分割する際には
、使用者にｊ−って入力された」１記音声入力単語の音
響的特徴のみによって音節区間を抽出するようにしてい
る。Conventional speech recognition devices that recognize speech input words by dividing them into syllables divide the speech input words into syllables, and compare the characteristic patterns of the divided syllables with syllable standard patterns registered in advance by the user. It calculates the distance and then uses the distance to recognize the voice input word. Then, when dividing the voice input word into syllables, it asks the user to say j-. The syllable section is extracted only based on the acoustic features of the input word 1.

〈発明が解決しようとする課題〉音声入力単語を音節に分解し、この分解した音節と音節
標準パターンとの距離に基づいて上記音声入力単語を認
識する音声認識装置においては、使用者によって入力さ
れた音声入力単語を音節に分割する際の音節区間抽出の
精度が単語認識の精度に大きな影響を及ぼず。これは、
誤った音節区間に基づいて音節を認識すると認識音節候
補の中に正しい音節が含まれず、したがって、認識単語
候補にも正しい候補が含まれないためである。<Problems to be Solved by the Invention> A voice recognition device that decomposes a voice input word into syllables and recognizes the voice input word based on the distance between the decomposed syllables and a syllable standard pattern, The accuracy of syllable segment extraction when dividing speech input words into syllables does not have a large effect on the accuracy of word recognition. this is,
This is because if a syllable is recognized based on an incorrect syllable interval, the recognized syllable candidates will not include correct syllables, and therefore the recognized word candidates will not include correct candidates.

ところで、上記従来の音声認識装置においては、音声入
力単語の音節区間を抽出する際に、音声入力単語の音響
的特徴のみを用いて音節区間を抽出しているので、音声
入力単語の音節区間を間違えて抽出してしまう頻度が高
く、したがって、音声入力単語の認識精度が低いという
問題がある。By the way, in the conventional speech recognition device described above, when extracting the syllable section of the speech input word, the syllable section is extracted using only the acoustic features of the speech input word. There is a problem in that the frequency of incorrect extraction is high, and therefore the recognition accuracy of speech input words is low.

そこで、この発明の目的は、音声入力単語の音節区間を
正しく抽出できる音声認識装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a speech recognition device that can correctly extract syllable sections of speech input words.

〈課題を解決するための手段〉上記目的を達成するため、この発明は、予め発声者によ
って音節標準パターンを登録し、音声によって入力され
た音声入力単語を音節に分割し、この分割された音節毎
に上記音節標準パターンとの距離を計算し、その計算結
果に基づいて」−記音声入力単語を認識する音声認識装
置において、上記音声入力単語の音響的特徴を抽出する
特徴抽出部と、上記音声入力単語を認識する際に用いる
認識単語を記憶する認識単語記憶部と、」二足認識ｊｌ
ｉ語記憶部に記憶された認識単語の中から選択された認
識単語候補に関する音響的特徴を生成する特徴生成部と
、上記特徴抽出部によって抽出された上記音声入力単語
の音響的特徴と、上記特徴生成部によって生成された」
二足認識単語候補の音響的特徴とを比較して時間軸の対
応付けを行い、その対応付けに基づいて音声入力単語の
音節区間を抽出する音節区間抽出部と、上記音節区間抽
出部によって抽出された音声入力単語の音節区間と」−
２音節標準パターンとの距離を計算し、計算された上記
距離の認識単語候補についての平均値を求める距離計算
部を備えたことを特徴としている。<Means for Solving the Problems> In order to achieve the above object, the present invention registers a syllable standard pattern in advance by a speaker, divides a voice input word input by voice into syllables, and divides the divided syllables into syllables. a feature extractor for extracting acoustic features of the speech input word; a recognition word storage unit that stores recognition words used when recognizing voice input words;
a feature generation unit that generates acoustic features regarding recognition word candidates selected from recognition words stored in an i-word storage unit; an acoustic feature of the speech input word extracted by the feature extraction unit; "generated by the feature generator"
A syllable segment extractor that compares the acoustic features of bipedal recognition word candidates to create a temporal axis correspondence, and extracts a syllable segment of the voice input word based on the association, and a syllable segment extraction unit that extracts the syllable segment. The syllable interval of the voice input word and "−
The present invention is characterized by comprising a distance calculation unit that calculates the distance from the two-syllable standard pattern and calculates the average value of the calculated distance for the recognition word candidates.

〈作用〉一４＝音声によって入力された音声入力単語を認識する際に、
音声入力単語の音響的特徴が特徴抽出部によって抽出さ
れる。さらに、認識単語記憶部に記憶された認識単語の
中から上記音声入力単語？こ応して予備選択された認識
単語候補力１選出され、その単語に関する音響的特徴が
特徴生成部によって生成される。そうすると、音節区間
抽出部は、上記特徴抽出部によって抽出された上記音声
入力単語の音響的特徴と、上記特徴生成部によって生成
された上記認識単語候補の音響的特徴とを比較して時間
軸の対応付けを行い、その対応付けに基づいて音声入力
単語の音節区間を抽出する。そして、距離計算部によっ
て、上記音節区間抽出部で抽出された音声入力単語の音
節区間と−に記音節標準パターンとの距離が計算され、
得られた距離の上記認識単語候補に関する平均値が算出
される。<Effect> 14 = When recognizing a voice input word input by voice,
Acoustic features of the speech input word are extracted by a feature extractor. Furthermore, select the voice input word from among the recognized words stored in the recognized word storage unit. Accordingly, one of the preselected recognition word candidates is selected, and acoustic features related to the word are generated by the feature generation section. Then, the syllable interval extraction unit compares the acoustic features of the speech input word extracted by the feature extraction unit and the acoustic features of the recognition word candidate generated by the feature generation unit, and A correspondence is made, and a syllable section of the audio input word is extracted based on the correspondence. Then, the distance calculation section calculates the distance between the syllable section of the voice input word extracted by the syllable section extraction section and the syllable standard pattern written at -,
An average value of the obtained distances for the recognized word candidates is calculated.

したがって、上記特徴抽出部によって抽出される音響的
特徴のみによって音節区間を抽出する場合よりも正確に
音声入力単語の音節区間を抽出することができる。Therefore, the syllable section of the voice input word can be extracted more accurately than when the syllable section is extracted only by the acoustic features extracted by the feature extraction section.

〈実施例〉以下、この発明を図示の実施例により詳細に説明する。<Example> Hereinafter, the present invention will be explained in detail with reference to illustrated embodiments.

第１図はこの発明の音声認識装置のブ［ノック図であり
、１は音声を入力するマイクロボン、２はマイクロホン
ｌより入力された音声の音声帯域のみを増幅する増幅器
、３は増幅器２から入力された音声波形から音節区間抽
出時や音声入力単語認識時の距離計算に使用される特＠
里を計算４−る特徴抽出部、４は音声認識装置全体をｊ
ｌ制御するＣＰＵ（中央制御装置）、５は音節標準パタ
ーンを格納する標準パターン記憶部、６は音声入力単語
認識時に使用する認識単語を記憶する認識単語記憶部、
７は認識単語記憶部６に記憶された認識語蛍における認
識単語の文字列からその認識単語に関する音響的特徴を
生成し、生成した音響的特徴に最６整合するように入力
音声の音節区間を抽出するトップダウンセグメンテーシ
ョン部、８は音声入力単語の抽出された音節区間と標準
パターン記憶部５に記憶された音節標準パターンとの距
離計算を行う距離計算部、９は音節標準パターン登録動
作および音節認識動作の際に使用される作業用メモリ、
■０は図示しない外部装置（表示部、キーボード部およ
びホストＣＰＵ等）七データを交換するためのＩ１０イ
ンタフェースである。FIG. 1 is a block diagram of the speech recognition device of the present invention, in which 1 is a microphone for inputting voice, 2 is an amplifier for amplifying only the voice band of the voice input from microphone 1, and 3 is from amplifier 2. Special @ used for distance calculation when extracting syllable sections from input speech waveforms and recognizing speech input words.
4 is the feature extraction unit that calculates the village, and 4 is the entire speech recognition device.
1 a CPU (central control unit) to control; 5 a standard pattern storage unit that stores syllable standard patterns; 6 a recognition word storage unit that stores recognition words used when recognizing voice input words;
7 generates an acoustic feature related to the recognized word from the character string of the recognized word in the recognized word Hotaru stored in the recognized word storage unit 6, and generates a syllable section of the input voice so as to match the generated acoustic feature at most. A top-down segmentation section for extracting; 8 a distance calculation section for calculating the distance between the extracted syllable section of the audio input word and the syllable standard pattern stored in the standard pattern storage section 5; 9 a syllable standard pattern registration operation and a syllable working memory used during recognition operations;
(2) 0 is an I10 interface for exchanging data with external devices (not shown) (display section, keyboard section, host CPU, etc.).

この発明の音声認識装置には大きく分けて登録モードと
認識モードの２つのモードがある。登録モードは、使用
者からの音声入力単語の音響的特徴から音声入力単語の
音節区間を抽出し、この音節区間毎ｊこ音節標準パター
ンを作成して標準パターン記憶部５に登録するモードで
ある。また、認識モードは、後に詳述するように、使用
者が発声した音声からその音声の音節区間を抽出し、登
録モードによってあらかじめ登録された音節標準パター
ンとの距離計算を抽出された音節区間毎に行って、その
計算された距離に基づいて音声入力単語を認識して認識
結果を出力するモードである。このように、登録モード
および認識モードによって音声認識を行うのは一般的な
音声認識の手法である。The speech recognition device of the present invention has two main modes: a registration mode and a recognition mode. The registration mode is a mode in which a syllable section of a voice input word is extracted from the acoustic characteristics of the voice input word from the user, a syllable standard pattern is created for each syllable section, and the created standard pattern is registered in the standard pattern storage unit 5. . In addition, as will be detailed later, the recognition mode extracts syllable sections from the voice uttered by the user, and calculates the distance between each extracted syllable section and the syllable standard pattern registered in advance using the registration mode. This mode recognizes the voice input word based on the calculated distance and outputs the recognition result. In this way, performing voice recognition using the registration mode and the recognition mode is a common voice recognition method.

一７＝第２図は上述の認識モードのフ【ｌ−ヂャートである。17 = FIG. 2 is a diagram of the recognition mode described above.

以下、第２図のフ［Ｊ−ヂャートに従って上記構成の音
声認識装置の認識モードの動作を具体的な例を上げて説
明する。Hereinafter, the operation of the recognition mode of the speech recognition apparatus having the above configuration will be explained using a specific example according to the diagram of FIG.

ステップＳｌで、マイクロポン１から入力された音声は
増幅器２で増幅されて特徴抽出部３に入力される。そし
て、特徴抽出部３で入力音声のフレーム毎の音素系列５
パワ一系列および距離計算に用いる他の特＠（例えば、
１６ヂヤンネルフイルタバンク出力）等の音響的特徴の
時系列が計算される。In step Sl, the voice input from the micropon 1 is amplified by the amplifier 2 and input to the feature extraction section 3. Then, the feature extraction unit 3 extracts the phoneme sequence 5 for each frame of the input speech.
Power series and other features used for distance calculations (e.g.
A time series of acoustic features such as 16-channel filter bank output) is calculated.

ステップＳ２で、上述の入力音声に対する処理とは別に
、認識単語記憶部６に記憶された認識単語の中から、例
えば音素列のＤＰマツチングによる予備選択によって音
声入力単語に応じて数個の認識単語候補が選択される。In step S2, in addition to the above-described processing for the input speech, several recognition words are selected according to the speech input word by preliminary selection, for example, by DP matching of phoneme strings, from among the recognition words stored in the recognition word storage unit 6. A candidate is selected.

そして、この選択された認識単語候補の中の一つの認識
単語候補の文字列から、この認識単語候補に対して予想
される予想音響的特徴がトップダウンセグメンテーショ
ン部７で生成される。Then, from the character string of one of the selected recognition word candidates, the top-down segmentation unit 7 generates an expected acoustic feature for this recognition word candidate.

ステップＳ３で、上記ステップＳ２で選択された認識単
語候補の総てについての処理が終了したか否かが判別さ
れる。その結果、終了していればステップＳ８に進み、
そうでなければステップＳ４に進む。In step S3, it is determined whether or not processing has been completed for all of the recognition word candidates selected in step S2. As a result, if it has been completed, the process advances to step S8,
Otherwise, the process proceeds to step S4.

ステップＳ４で、入力音声の音節区間の抽出が行われて
音節に分割される。In step S4, syllable sections of the input voice are extracted and divided into syllables.

この音節分割は次のようにして行われる。すなわち、上
述のステップＳｌで算出された入力音声の音響的特徴の
時系列と、ステップＳ２で生成された認識単語候補の予
想音響的特徴の時系列とが比較され、生成された予想音
響的特徴の時系列に最も整合するように入力音声の音響
的特徴が対応付けられる。そして、予想音響的特徴の時
系列の変化点に対応する入力音声の音響的特徴の時系列
の変化点が入力音声の最適な音節区間として決定される
。上述の入力音声の音響的特、徴の時系列と認識単語候
補の予想音響的特徴の時系列との整合性を求める手法と
しては、時間軸の逆転を起こさずに非線形な時間対応を
とるアルゴリズムであるＤＰマツチング等の手法がある
。This syllable division is performed as follows. That is, the time series of the acoustic features of the input speech calculated in step Sl above is compared with the time series of the expected acoustic features of the recognition word candidates generated in step S2, and the generated expected acoustic features are The acoustic features of the input speech are associated in a manner that best matches the time series of the input speech. Then, a time-series change point of the acoustic features of the input speech that corresponds to a time-series change point of the predicted acoustic features is determined as the optimal syllable interval of the input speech. As a method for finding consistency between the time series of acoustic features and features of the input speech described above and the time series of expected acoustic features of recognition word candidates, there is an algorithm that takes nonlinear temporal correspondence without causing a reversal of the time axis. There are methods such as DP matching.

ステップＳ５で、」二足ステップＳ４で抽出された入力
音声の音節区間と」二連の登録モートにおいて標準パタ
ーン記憶部５に記憶された音節標準パターンとの距離計
算が、現認識単語候補の総ての音節に関して距離計算部
８によって行われる。In step S5, the distance calculation between the syllable section of the input voice extracted in step S4 and the syllable standard pattern stored in the standard pattern storage unit 5 in the two registration motes is performed to calculate the total number of currently recognized word candidates. This is performed by the distance calculation unit 8 for all syllables.

その際に、音声入力単語の現音節と現認識単語候補との
類似度を求めるだけなので、現認識単語候補をｍ威して
いる音節の音節パターンきの距離を求めるだけでよい。At this time, only the degree of similarity between the current syllable of the voice input word and the currently recognized word candidate is determined, so it is only necessary to determine the distance between the syllable patterns of the syllables that compare the currently recognized word candidate.

すなわち、例えば現認識単語候補が「あかい」であると
すると。音声入力ｊｌｊ。That is, for example, suppose that the current recognized word candidate is "red". Voice input jlj.

語の第１番目の音節区間は音節標準パターン／あ／との
距離を求めるだけでよい。For the first syllable section of a word, it is only necessary to find the distance from the syllable standard pattern /a/.

ステップＳ６で、上記ステップＳ５で求められた現認識
単語候補の総ての音節に関する距離の総和をその現認識
単語候補の音節数で割ること？こ、Ｊ：って、入力音声
の現認識単語候補に対する平均音節距離が算出される。In step S6, the sum of the distances for all syllables of the currently recognized word candidate obtained in step S5 is divided by the number of syllables of the currently recognized word candidate? The average syllable distance of the input speech to the currently recognized word candidate is calculated.

ステップＳ７で、」二足ステップＳ６で算出された現認
識単語候補の平均音節距離が、例えば作業用メモリ９に
記憶された後ステップＳ２に戻る。In step S7, the average syllable distance of the currently recognized word candidate calculated in step S6 is stored in, for example, the working memory 9, and then the process returns to step S2.

ステップＳ８で、上述のようにして算出されて記憶され
た総ての認識単語候補における平均音節距離が、平均音
節距離の小さいほうから順にソートされる。In step S8, the average syllable distances of all recognized word candidates calculated and stored as described above are sorted in descending order of average syllable distance.

ステップＳ９で、最も平均音節距離の小さい認識単語候
補を認識結果としてＩ１０インターフェース９より出力
される。In step S9, the recognized word candidate with the smallest average syllable distance is output from the I10 interface 9 as a recognition result.

第３図および第４図は認識単語候補として「かしま」と
１とくしま」の２単語が選択された場合の抽出音節区間
と音節距離とを示す。FIGS. 3 and 4 show extracted syllable sections and syllable distances when two words "Kashima" and "1 and Kushima" are selected as recognition word candidates.

第３図は入力音声のパワーとその抽出音節区間を示す。FIG. 3 shows the power of the input speech and its extracted syllable sections.

最上段は使用者が／とくしま／と発声した場合に得られ
たパワーの時系列である。下段の１列は入力単語が「か
しま」であると想定した場合すなわち認識単語候補とし
て「かじまＪを使用した場合に得られた最適音節区間で
ある。また、下段の■列は認識単語候補として「とくし
ま」を使用した場合に得られた最適音節区間である。ま
た、下段の■列は音声入力単語「とくしま」を認ＩＩ− 識単語候補を用いずに、入力音声の音響的特徴のみから
得られた音節区間である。この１１列の場合は、入力音
声「とくしま」の音節／＜／が短いため次の音節／シ／
と併合して認識されてしまい、誤って３つの音節に分割
されている。The top row is a time series of the power obtained when the user utters /Tokushima/. The first column in the lower row is the optimal syllable interval obtained when the input word is "Kashima", that is, when "Kajima J" is used as a recognition word candidate. This is the optimal syllable interval obtained when "Tokushima" is used as a candidate. In addition, the lower column (■) is a syllable section obtained only from the acoustic features of the input speech without using recognition word candidates for the speech input word "Tokushima". In the case of these 11 columns, since the syllable /</ of the input voice "Tokushima" is short, the next syllable / /
It is recognized as being merged with the word ``,'' and it is incorrectly divided into three syllables.

第４図は認識単語候補として「かしま」おｊ；び「とく
しま」を選択した場合、第３図の１列および■列に示す
ように分割された入力音声単語の各音節と標準パターン
記憶部５？こ記憶された音節標準パターンとの距Ｍ（第
２図のフローチャートのステップＳ５によって求められ
る距離）を示す。また、それと同時に、この距離に基づ
く各認識単語候補毎の平均音節距離（最２図のフローチ
ャートのステップＳ６によって求められる距離）を示す
。Figure 4 shows the syllables and standard patterns of the input speech word divided as shown in columns 1 and 2 of Figure 3 when "Kashima" and "Tokushima" are selected as recognition word candidates. Memory section 5? The distance M from this stored syllable standard pattern (distance determined in step S5 of the flowchart in FIG. 2) is shown. At the same time, the average syllable distance for each recognized word candidate based on this distance (the distance determined in step S6 of the flowchart in Figure 2) is also shown.

すなわち、この実施例の場合は平均音節距離が小さい認
識単語候補「きくしまｊを認識単語として出力する。That is, in the case of this embodiment, the recognition word candidate "Kikushima j" with a small average syllable distance is output as a recognition word.

このように、認識単語記憶部６に記憶された認識語當の
中からあらかじめ音声入力単語に応して数個の認識単語
候補（例えば、第３図の「かしま」および「とくしま」
）を選択し、この認識単語候補の音響的特徴の時系列と
音声入力単語の音響的特徴の時系列とを対応付けること
によって、音声入力単語の音節区間を抽出するようにし
たので、例えば第３図のように「とくしま」と入力され
た音声単語が■列のように３つの音節区間に誤って分割
されることがない。すなわち、」二連のように誤って分
割された３つの音節区間に基づいて得られた３音節の単
語が認識単語として出力されることがない。In this way, several recognition word candidates (for example, "Kashima" and "Tokushima" in FIG.
), and by associating the time series of the acoustic features of this recognition word candidate with the time series of the acoustic features of the speech input word, the syllable interval of the speech input word is extracted. As shown in the figure, the input audio word "Tokushima" is not erroneously divided into three syllable sections as shown in the ■ column. That is, a three-syllable word obtained based on three syllable sections that are erroneously divided, such as a double-syllable, is not output as a recognized word.

したがって、この発明によれば、入力音声を正確な音節
区間に分割して正しく音声入力単語を認識することがで
きる。Therefore, according to the present invention, input speech can be divided into accurate syllable sections and speech input words can be correctly recognized.

上記実施例においては、音声入力単語の音節区間と音節
標準パターンとの距離計算を行う場合、認識単語候補毎
に音節区間の抽出を行い、その総ての音節区間について
音節標準パターン七の距離計算を行っているが、この発
明はこれに限定されるものではない。すなイつち、音節
区間の抽出の際に、第３図において認識単語候補「かし
ま」の音節／か／および音節／ま／と、認識単語候補［
とくしま］の音節／と／および音節／ま／に対応して、
音声入力単語の同一・区間Ａおよび１３を音節として抽
出している。したがって、音声入力中給の音節区間へと
音節標準パターン／か／および／と／との距離計算を一
度に行い、音節区間Ｉ３と音節標準パターン／ま／およ
び／ま／との距離計算を一回省略してもよい。In the above embodiment, when calculating the distance between the syllable interval of the voice input word and the syllable standard pattern, the syllable interval is extracted for each recognition word candidate, and the distance of the syllable standard pattern 7 is calculated for all the syllable intervals. However, the present invention is not limited to this. In other words, when extracting the syllable interval, the syllables /ka/ and syllables /ma/to of the recognition word candidate "Kashima" in Figure 3, and the recognition word candidate [
Tokushima] corresponds to the syllable /to/ and the syllable /ma/,
The same sections A and 13 of the audio input word are extracted as syllables. Therefore, the distances between the syllable standard patterns /ka/ and / and / are calculated at once to the syllable intervals in the voice input, and the distances between the syllable interval I3 and the syllable standard patterns /ma/ and /ma/ are calculated at once. Times may be omitted.

また、本実施（１＋ｊｌこおいて認識単語候補のＰ想音
響的特徴を生成する場合、予め記憶しである所定のルー
ルに従って生成してもよいし、各認識単語毎にその文字
列とその音響的特徴の時系列とを一緒に認識単語記憶部
６に記憶しておき、その記憶された音響的特徴の時系列
に従って生成してらよい。In addition, in this implementation (1+jl), when generating the P-like acoustic features of the recognition word candidates, they may be generated according to a predetermined rule that is stored in advance, or the character string and its acoustic characteristics may be generated for each recognition word. The time series of the acoustic features may be stored together in the recognized word storage unit 6, and the acoustic features may be generated in accordance with the stored time series of the acoustic features.

〈発明の効果〉以上より明らかなように、この発明の音声認識装置は、
音声入力単語の認識時において、上記音声入力単語の音
響的特徴と認識単語候補の音響的特徴とを比較して時間
軸の対応付けを行うことによって音声入力単語の音節区
間を抽出し、この抽出された音節区間と上記音節標準パ
ターンとの距離の上記認識単語候補に関する平均値によ
って音声入力単語の認識をするようにしたものである。<Effects of the Invention> As is clear from the above, the speech recognition device of the present invention has the following effects:
When recognizing a voice input word, the acoustic features of the voice input word are compared with the acoustic features of the recognition word candidates, and the syllable intervals of the voice input word are extracted by comparing the acoustic features of the recognition word candidates and The voice input word is recognized based on the average value of the distance between the identified syllable section and the syllable standard pattern for the recognized word candidates.

したがって、この発明によれば、音声入力単語認識時に
おける音声入力単語の音節区間の抽出を正確に行うこと
ができ、この正しい音節区間に従って精度の高い音声認
識を行うことができる。Therefore, according to the present invention, it is possible to accurately extract the syllable section of a speech input word during speech input word recognition, and highly accurate speech recognition can be performed in accordance with this correct syllable section.

[Brief explanation of the drawing]

第１図はこの発明の音声認識装置のブロック図、第２図
は認識モードにおける音節認識動作のフローヂャート、
第３図は入力音声単語のパワーおよびその抽出音節区間
の一例を示す図、第４図は分割された音声入力単語の音
節と音節標準パターンとの距離および認識単語候補の音
節平均距離の一例を示す図である。ｌ・・・マイクロボン、　　　２・増幅器、３・特徴抽
出部、　　　　４　・ＣＰＵ、５・標準パターン記憶部
、６・・認識単語記憶部、＝１５− ７　トップダウンセグメンテーンヨン部、８・距離計算
部、　　　　９・・作業用メモリ、ＩＯ・・Ｉ１０イン
クーフＪ−ス。FIG. 1 is a block diagram of the speech recognition device of the present invention, FIG. 2 is a flowchart of syllable recognition operation in recognition mode,
Fig. 3 shows an example of the power of an input speech word and its extracted syllable section, and Fig. 4 shows an example of the distance between the syllables of the divided speech input word and the syllable standard pattern and the average syllable distance of recognition word candidates. FIG. 1. Microbon, 2. Amplifier, 3. Feature extraction section, 4. CPU, 5. Standard pattern storage section, 6. Recognition word storage section, =15- 7. Top-down segmentation section, 8. Distance. Calculation unit, 9... working memory, IO... I10 ink file.

Claims

[Claims]

(1) A syllable standard pattern is registered in advance by the speaker,
Divide the voice input word input by voice into syllables,
In the speech recognition device, which calculates the distance from the syllable standard pattern for each divided syllable and recognizes the speech input word based on the calculation result, a feature extraction unit extracts acoustic features of the speech input word. a recognition word storage unit that stores recognition words used when recognizing the voice input word; and generating acoustic features regarding recognition word candidates selected from the recognition words stored in the recognition word storage unit. a feature generation unit, which compares the acoustic features of the voice input word extracted by the feature extraction unit and the acoustic features of the recognition word candidates generated by the feature generation unit, and correlates the time axis; a syllable interval extraction unit that extracts syllable intervals of the audio input word based on the correspondence; and a syllable interval extraction unit that calculates the distance between each syllable interval of the audio input word extracted by the syllable interval extraction unit and the syllable standard pattern. A speech recognition device comprising a distance calculation section.