JP3477751B2

JP3477751B2 - Continuous word speech recognition device

Info

Publication number: JP3477751B2
Application number: JP22236193A
Authority: JP
Inventors: 友康藤井; 久衛谷口
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 1993-09-07
Filing date: 1993-09-07
Publication date: 2003-12-10
Anticipated expiration: 2018-12-10
Also published as: JPH0777998A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、認識すべき単語の音響
的特徴量が登録された単語辞書を用いて、外部からの入
力音声に含まれる単語系列を認識する連続単語音声認識
装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous word speech recognition apparatus for recognizing a word sequence included in a speech input from the outside by using a word dictionary in which acoustic features of words to be recognized are registered.

【０００２】[0002]

【従来の技術】従来より、この種の連続単語音声認識装
置では、まず、外部からの入力音声を所定周期で分析し
て音響的特徴量を順次抽出し、周知のＤＰマッチング法
等によって、その抽出された音響的特徴量の時系列デー
タを、単語辞書に認識すべき単語毎に登録されている音
響的特徴量を用いて、何れかの単語の音響的特徴量に最
も近似したデータ列毎に区分し、その区分したデータ列
毎に、対応する音響的特徴量が表わす単語を割り当てる
ことによって、入力音声の単語系列を認識するようにし
ている。2. Description of the Related Art Conventionally, in this type of continuous word speech recognition apparatus, first, an input speech from the outside is analyzed at a predetermined cycle to sequentially extract acoustic feature quantities, and the acoustic feature quantity is extracted by a known DP matching method or the like. The time-series data of the extracted acoustic features is used for each data string that is the closest to the acoustic features of any word, using the acoustic features registered for each word to be recognized in the word dictionary. The word sequence of the input voice is recognized by allocating the word represented by the corresponding acoustic feature amount to each of the divided data strings.

【０００３】このため、入力音声が予め単語辞書に登録
されている単語のみから構成されている場合には音声認
識を良好に行うことができるものの、入力音声の単語の
途中に無音部分が多く含まれていたり、入力音声中に単
語辞書に登録されていない音声（不要語）が含まれてい
ると、音声認識を良好に行うことができず、単語系列を
誤認識してしまうといった問題があった。Therefore, when the input voice is composed only of the words registered in the word dictionary in advance, the voice recognition can be satisfactorily performed, but many silent parts are included in the middle of the word of the input voice. If the input voice contains a voice (unnecessary word) that is not registered in the word dictionary, the voice recognition cannot be performed well and the word sequence may be erroneously recognized. It was

【０００４】一方、こうした問題を解決するために、従
来より、例えば特開昭６１−２００９５号公報に開示さ
れているように、認識すべき単語の音声から無音部分を
除去した音声データに基づき各単語の音響的特徴量を生
成して単語辞書に登録しておき、この単語辞書を用いて
入力音声の単語系列を認識する際には、入力音声から無
音区間を除去した音声データを用いるようにする、とい
ったことが提案されている。On the other hand, in order to solve such a problem, conventionally, for example, as disclosed in Japanese Patent Application Laid-Open No. 61-20095, each voice data based on voice data obtained by removing a silent portion from the voice of a word to be recognized is used. When the acoustic feature of a word is generated and registered in the word dictionary, and when recognizing the word sequence of the input voice using this word dictionary, the voice data obtained by removing the silent section from the input voice is used. It is proposed to do so.

【０００５】[0005]

【発明が解決しようとする課題】しかし、こうした対策
では、入力音声の単語の途中に無音区間が多く含まれて
いる場合の認識精度を向上することはできるものの、入
力音声中に不要語が含まれている場合の誤認識を防止す
ることはできず、音声認識を精度良く実行させるには、
音声認識装置に不要語が入力されることのないよう、使
用者が発声する音声以外の音が音声認識装置に入力され
ないようにし、しかも使用者が単語辞書に登録されてい
る単語のみを発声しなければならない、といった問題が
あった。However, although such a measure can improve the recognition accuracy when a lot of silent sections are included in the middle of the word of the input voice, the input voice includes unnecessary words. It is not possible to prevent erroneous recognition in the case of voice recognition.
To prevent unwanted words from being input to the voice recognition device, make sure that no sound other than the voice uttered by the user is input to the voice recognition device, and that the user speaks only the words registered in the word dictionary. There was a problem that it had to be.

【０００６】本発明は、こうした問題に鑑みなされたも
ので、入力音声中に認識すべき単語以外の音声（不要
語）が含まれている場合にも、音声認識を精度良く実行
することのできる連続単語音声認識装置を提供すること
を目的としている。The present invention has been made in view of these problems, and it is possible to accurately perform voice recognition even when the input voice includes a voice (unnecessary word) other than a word to be recognized. The object is to provide a continuous word speech recognition device.

【０００７】[0007]

【課題を解決するための手段】かかる目的を達成するた
めになされた本発明は、図１に例示する如く、本願の請
求項１に記載の発明によれば、認識すべき複数の単語の
音響的特徴量が、各単語毎に予め格納された単語辞書記
憶手段と、外部からの入力音声を所定周期で分析して音
響的特徴量を順次抽出する音響分析手段と、該音響分析
手段にて順次抽出された音響的特徴量の時系列データ
を、上記単語辞書記憶手段に格納された音響的特徴量に
最も近似したデータ列毎に区分し、各データ列毎に、対
応する音響的特徴量が表わす単語を割り当て、上記入力
音声の単語系列を認識する音声認識手段と、該音声認識
手段により認識された単語系列を外部装置に出力する出
力手段と、を備えた連続単語音声認識装置において、上
記単語辞書記憶手段に、上記各単語の音響的特徴量に加
えて、上記認識すべき各単語の音響的特徴量を平均化し
てなる音声認識を必要としない不要語の音響的特徴量
を、上記音声認識手段にて認識すべき単語の音響的特徴
量として格納しておき、更に、上記音声認識手段により
認識された単語系列の中から、上記不要語として認識さ
れた単語を除去して、上記出力手段に出力する不要語除
去手段を設けたことを特徴とする。また、本願の請求項
２に記載の発明によれば、認識すべき複数の単語の音響
的特徴量が、各単語毎に予め格納された単語辞書記憶手
段と、外部からの入力音声を所定周期で分析して音響的
特徴量を順次抽出する音響分析手段と、該音響分析手段
にて順次抽出された音響的特徴量の時系列データを、上
記単語辞書記憶手段に格納された音響的特徴量に最も近
似したデータ列毎に区分し、各データ列毎に、対応する
音響的特徴量が表わす単語を割り当て、上記入力音声の
単語系列を認識する音声認識手段と、該音声認識手段に
より認識された単語系列を外部装置に出力する出力手段
と、を備えた連続単語音声認識装置において、上記単語
辞書記憶手段に、上記各単語の音響的特徴量に加えて、
上記認識すべき全ての単語の音響的特徴量を平均化して
なる音声認識を必要としない不要語の音響的特徴量を、
上記音声認識手段にて認識すべき単語の音響的特徴量と
して格納しておき、更に、上記音声認識手段により認識
された単語系列の中から、上記不要語として認識された
単語を除去して、上記出力手段に出力する不要語除去手
段を設けたことを特徴とする。 The Means for Solving the Problems The above object was to achieve
The present invention, which has been made for the purpose of the present invention, is the contract of the present application, as illustrated in FIG.
According to the invention described in claim 1, a plurality of words to be recognized
An acoustic feature quantity is stored in advance for each word in a dictionary of words.
The memory and the input voice from the outside are analyzed at a predetermined cycle
Acoustic analysis means for sequentially extracting resonating feature quantities, and the acoustic analysis
Time-series data of acoustic features extracted sequentially by means
The divides each closest data string to the acoustic feature quantity stored in the word dictionary storing hand stage, for each data string, assigns the words representing corresponding acoustic feature quantity, the words of the input speech Speech recognition means for recognizing a sequence and the speech recognition
Output the word sequence recognized by the means to an external device.
And a continuous word voice recognition device including
The acoustic feature quantity of each word is added to the word dictionary storage means.
In addition, the acoustic feature amount of the unnecessary word that does not require voice recognition, which is obtained by averaging the acoustic feature amount of each word to be recognized, is stored as the acoustic feature amount of the word to be recognized by the voice recognition means. In addition, by the voice recognition means
From the recognized word series, it was recognized as the above unnecessary word.
The words are removed, characterized in that a unnecessary word removing means for outputting to said output means. Also, the claims of the present application
According to the invention described in 2, acoustics of a plurality of words to be recognized
Dictionary with a pre-stored dynamic feature quantity for each word
Stage and externally input voice are analyzed in a predetermined cycle
Acoustic analysis means for sequentially extracting feature quantities, and the acoustic analysis means
The time-series data of the acoustic features that were sequentially extracted in
Serial word dictionary stored most were classified for each data column approximate the hand acoustic feature amount stored in the stages, for each data string, assigns the words representing corresponding acoustic feature quantity, the word sequence of the input speech A voice recognition means for recognizing, and a voice recognition means
Output means for outputting a more recognized word sequence to an external device
And a continuous word speech recognition device comprising:
In addition to the acoustic features of each word in the dictionary storage means,
The acoustic feature quantity of unnecessary words which do not require speech recognition obtained by averaging the acoustic feature quantity of all a word to be the recognition,
It is stored as an acoustic feature amount of a word to be recognized by the voice recognition means, and further recognized by the voice recognition means.
It is characterized in that an unnecessary word removing means for removing the word recognized as the unnecessary word from the generated word series and outputting it to the output means is provided.

【０００８】[0008]

【作用】上記のように本発明の連続単語音声認識装置に
おいては、単語辞書記憶手段に、認識すべき複数の単語
の音響的特徴量と共に、音声認識を必要としない不要語
の音響的特徴量が、音声認識手段にて認識すべき単語の
音響的特徴量として格納されている。As described above, in the continuous word speech recognition apparatus of the present invention, the word dictionary storage means stores acoustic features of a plurality of words to be recognized and acoustic features of unnecessary words that do not require speech recognition. Is stored as the acoustic feature amount of the word to be recognized by the voice recognition means.

【０００９】そして、外部から音声が入力されると、ま
ず、音響分析手段が、その入力音声を所定周期で順次分
析して音響的特徴量を順次抽出する。すると、音声認識
手段が、音響分析手段にて順次抽出された音響的特徴量
の時系列データを、単語辞書記憶手段に格納された単語
及び不要語の音響的特徴量に最も近似したデータ列毎に
区分し、その区分したデータ列毎に、対応する音響的特
徴量が表わす単語或は不要語を割り当ることによって、
入力音声中の単語系列を認識する。When a voice is input from the outside, first, the acoustic analysis means sequentially analyzes the input voice in a predetermined cycle to sequentially extract acoustic feature quantities. Then, the voice recognition means uses the time-series data of the acoustic feature quantities sequentially extracted by the acoustic analysis means for each data string that is the closest to the acoustic feature quantities of the words and unnecessary words stored in the word dictionary storage means. By dividing each of the divided data strings into a word or an unnecessary word represented by the corresponding acoustic feature amount,
Recognize word sequences in input speech.

【００１０】また、このように音声認識手段が入力音声
中の単語系列を認識すると、不要語除去手段が、その単
語系列の中から不要語として認識された単語を除去して
出力手段に出力する。この結果、出力手段からは、入力
音声に含まれている認識すべき単語が時系列に連続的に
出力されることとなり、外部装置には、使用者が音声入
力した必要な単語系列のみが提供されることとなる。When the voice recognition means recognizes the word sequence in the input voice in this way, the unnecessary word removing means removes the word recognized as the unnecessary word from the word sequence and outputs it to the output means. . As a result, the words to be recognized included in the input voice are continuously output in time series from the output means, and the external device is provided with only the necessary word sequence input by the user's voice. Will be done.

【００１１】即ち、本発明では、単語辞書として、認識
すべき単語の音響的特徴量だけでなく、音声認識する必
要のない不要語の音響的特徴量をも登録しておき、音声
認識手段にて、入力音声に含まれる不要語も一つの単語
として認識することにより、入力音声中の不要語の領域
が認識すべき何れかの単語であると誤認識されるのを防
止しているのである。That is, according to the present invention, not only the acoustic feature amount of a word to be recognized but also the acoustic feature amount of an unnecessary word that does not need to be voice-recognized is registered as a word dictionary, and the voice recognition means is registered. By recognizing the unnecessary words included in the input voice as one word, it is possible to prevent the unnecessary word area in the input voice from being erroneously recognized as one of the words to be recognized. .

【００１２】なお、不要語の単語辞書としては、認識す
べき単語以外の音声を認識可能な音響的特徴量を登録し
ておけば良く、その個数としては１個でも複数でもよい
が、請求項１に記載の如く、認識すべき各単語の音響的
特徴量を平均化したものを設定するか、若しくは、請求
項２に記載の如く、認識すべき全ての単語の音響的特徴
量を平均化したものを設定しておけば、１個の辞書で不
要語を良好に認識することができるようになる。これ
は、認識すべき単語の音響的特徴量を平均化した平均特
徴量は分散が大きく、入力音声中の不要語の部分は、認
識すべき単語の音響的特徴量よりもその平均特徴量によ
り近くなるからである。As the word dictionary of unnecessary words, it is only necessary to register acoustic feature quantities capable of recognizing voices other than words to be recognized, and the number may be one or plural. as described in 1, to set a material obtained by averaging the acoustic feature quantity of each word to be recognized, Wakashi Ku is as claimed in claim 2, acoustic feature quantity of all words to be recognized If an averaged value is set, unnecessary words can be satisfactorily recognized by one dictionary. This is because the average feature amount obtained by averaging the acoustic feature amounts of the words to be recognized has a large variance, and the unnecessary word portion in the input voice is more Because it will be close.

【００１３】[0013]

【実施例】以下に本発明の実施例を図面と共に説明す
る。まず図２は本発明が適用された実施例の連続単語音
声認識装置（以下、単に音声認識装置という。）全体の
構成を表わすブロック図である。なお、図２は、音声認
識装置の機能的構成を表わすブロック図であり、ハード
的構成を表わすものではない。Embodiments of the present invention will be described below with reference to the drawings. First, FIG. 2 is a block diagram showing the overall configuration of a continuous word voice recognition device (hereinafter, simply referred to as a voice recognition device) of an embodiment to which the present invention is applied. Note that FIG. 2 is a block diagram showing a functional configuration of the voice recognition device, and does not show a hardware configuration.

【００１４】図２に示す如く、本実施例の音声認識装置
は、音声認識に使用する単語辞書１０を作成する辞書作
成部２と、辞書作成部２により作成された単語辞書１０
を用いて外部から入力された音声中の単語系列を認識す
る認識部４とから構成されている。As shown in FIG. 2, the speech recognition apparatus of this embodiment has a dictionary creating section 2 for creating a word dictionary 10 used for speech recognition, and a word dictionary 10 created by the dictionary creating section 2.
And a recognition unit 4 for recognizing a word sequence in a voice input from outside using.

【００１５】ここで、辞書作成部２は、ＣＰＵ，ＲＯ
Ｍ，ＲＡＭ等からなるマイクロコンピュータの処理によ
り、音声認識に必要な単語辞書１０を作成して、その作
成した単語辞書１０を、ＲＡＭやＩＣカード等の記憶素
子の所定の記憶領域に格納するためのものであり、予め
音声認識すべき単語（以下、必要語という。）毎に収集
された必要語音声データ６を用いて、音声認識に使用す
る各必要語の音響的特徴量を求め、これを必要語認識用
の辞書として設定する必要語辞書作成部１２と、同じく
必要語音声データ６を用いて、音声認識の必要のない不
要語の音響的特徴量を求め、これを不要語認識用の辞書
として設定する不要語辞書作成部１４とを備えている。Here, the dictionary creating section 2 uses a CPU and RO.
To create a word dictionary 10 required for voice recognition by processing of a microcomputer including M and RAM, and to store the created word dictionary 10 in a predetermined storage area of a storage element such as a RAM or an IC card. The required word voice data 6 collected for each word (hereinafter referred to as a necessary word) to be voice-recognized is used to obtain the acoustic feature amount of each necessary word used for voice recognition. Is used as a dictionary for recognizing necessary words, and the necessary word voice data 6 is also used to obtain acoustic feature amounts of unnecessary words that do not require voice recognition. And an unnecessary word dictionary creating unit 14 that is set as the dictionary of.

【００１６】また、認識部４は、周囲の音声をデジタル
データに変換して取り込むためのマイクロフォン，Ａ／
Ｄ変換器等からなる音声入力部２２と、この音声入力部
２２からの入力データと単語辞書１０とから、入力音声
中の単語系列を認識する音声認識部２４と、音声認識部
２４による認識結果を、外部の表示装置や認識結果に応
じて作動する外部装置に出力する認識結果出力部２６と
から構成されている。なお、音声認識部２４は、必要語
辞書作成部１２及び不要語辞書作成部１４と同様、マイ
クロコンピュータの処理により実現される。The recognition unit 4 also includes a microphone for converting ambient voice into digital data and taking in the digital data, A /
A voice recognition unit 24 that recognizes a word sequence in the input voice from a voice input unit 22 including a D converter, input data from the voice input unit 22, and the word dictionary 10, and a recognition result by the voice recognition unit 24. Is output to an external display device or an external device that operates according to the recognition result. The voice recognition unit 24 is realized by the processing of the microcomputer, like the necessary word dictionary creating unit 12 and the unnecessary word dictionary creating unit 14.

【００１７】次に、上記必要語辞書作成部１２，不要語
辞書作成部１４，及び音声認識部２４の動作を図３〜図
５に示すフローチャートに沿って説明する。まず図３
は、必要語辞書作成部１２において実行される必要語辞
書作成処理を表わすフローチャートである。Next, the operations of the necessary word dictionary creating section 12, the unnecessary word dictionary creating section 14, and the voice recognition section 24 will be described with reference to the flow charts shown in FIGS. First, Fig. 3
9 is a flowchart showing a necessary word dictionary creating process executed in the necessary word dictionary creating unit 12.

【００１８】図３に示す如く、この必要語辞書作成処理
が開始されると、必要語音声データ６に含まれる全て
（Ｎ個）の必要語を１個ずつ順に単語辞書１０に登録す
るために、まずステップ１００にて、必要語の数をカウ
ントするカウンタｎに初期値「１」を設定する初期化の
処理を実行する。As shown in FIG. 3, when the necessary word dictionary creating process is started, all (N) necessary words included in the necessary word voice data 6 are sequentially registered in the word dictionary 10 one by one. First, in step 100, an initialization process of setting an initial value "1" in a counter n that counts the number of necessary words is executed.

【００１９】そして続くステップ１１０では、必要語音
声データ６からカウンタｎの値に対応した必要語（ｎ）
の音声データを読み込み、次ステップ１２０にて、その
読み込んだ音声データをフレーム周期（例えば２０ｍse
c.）毎に音響分析し、音響的特徴量（例えばケプストラ
ム）を抽出する。Then, in the subsequent step 110, the necessary word (n) corresponding to the value of the counter n from the necessary word voice data 6
Of the audio data, and in the next step 120, the read audio data is read in a frame cycle (for example, 20 mse).
The acoustic analysis is performed for each c.) and the acoustic feature amount (for example, cepstrum) is extracted.

【００２０】なお、ステップ１１０の音声データの読み
込みは、必要語（ｎ）を構成する音声データがｍ個ある
場合には、ｍ個の音声データを全て読む込むことによっ
て実行される。また、このように必要語（ｎ）を構成す
る音声データがｍ個ある場合、ステップ１２０では、ｍ
個の音響的特徴量が求められることになる。The reading of voice data in step 110 is executed by reading all m voice data when there are m voice data forming the necessary word (n). In addition, when there are m pieces of voice data forming the necessary word (n) in this way, in step 120, m
The individual acoustic feature amount will be obtained.

【００２１】次にステップ１３０では、上記ステップ１
２０で求めたｍ個の音響的特徴量を平均化して、必要語
（ｎ）全体の音響的特徴量である一つの平均特徴量を求
め、続くステップ１４０にて、この平均特徴量を必要語
（ｎ）の音響的特徴量として単語辞書１０の必要語
（ｎ）の項目に書込む。Next, in step 130, the above step 1
The m acoustic feature amounts obtained in step 20 are averaged to obtain one average feature amount that is the acoustic feature amount of the entire required word (n), and in step 140, this average feature amount is used as the required word amount. It is written in the required word (n) item of the word dictionary 10 as the acoustic feature amount of (n).

【００２２】そして、続くステップ１５０では、カウン
タｎの値が登録すべき必要語の個数Ｎになっているか否
かを判断することにより、必要語音声データ６に含まれ
る全て（Ｎ個）の必要語について辞書作成が行われたか
否かを判断し、Ｎ個の必要語全てについて辞書作成が行
われていなければ、ステップ１６０にて、当該処理を、
必要語音声データ６中の次の必要語の辞書作成に移行す
べく、カウンタｎの値をインクリメントし、再度ステッ
プ１１０に移行する。また、ステップ１５０にて、Ｎ個
の必要語全てについて辞書作成が行われたと判断される
と、そのまま当該処理を終了する。Then, in the following step 150, it is judged whether or not the value of the counter n has reached the number N of necessary words to be registered, so that all (N) necessary words included in the necessary word voice data 6 are necessary. It is determined whether or not the dictionary has been created for the word. If the dictionary has not been created for all N required words, the process is executed in step 160.
The value of the counter n is incremented in order to shift to the creation of the dictionary of the next required word in the required word voice data 6, and the process proceeds to step 110 again. If it is determined in step 150 that the dictionary has been created for all N necessary words, the process ends.

【００２３】このように、当該必要語辞書作成処理にお
いては、必要語音声データ６内の各必要語毎に平均特徴
量を求めて、単語辞書１０に登録する。このため、単語
辞書１０には、必要語の個数に応じたＮ個の辞書項目が
設定され、各必要語の辞書項目にその必要語全体を表わ
す音響的特徴量が格納されることとなる。As described above, in the necessary word dictionary creating process, the average feature amount is calculated for each necessary word in the necessary word voice data 6 and registered in the word dictionary 10. Therefore, N dictionary items corresponding to the number of necessary words are set in the word dictionary 10, and the acoustic feature quantity representing the entire necessary word is stored in the dictionary item of each necessary word.

【００２４】次に、図４は不要語辞書作成部１４におい
て実行される不要語辞書作成処理を表わすフローチャー
トである。図４に示す如く、この不要語辞書作成処理が
開始されると、必要語音声データ６に含まれる全て（Ｎ
個）の必要語の音響分析を行うために、まずステップ２
００にて、必要語の数をカウントするカウンタｎに初期
値「１」を設定する初期化の処理を実行する。Next, FIG. 4 is a flowchart showing the unnecessary word dictionary creating process executed by the unnecessary word dictionary creating section 14. As shown in FIG. 4, when the unnecessary word dictionary creating process is started, all the words included in the necessary word voice data 6 (N
Step 2 to perform acoustic analysis of necessary words
At 00, an initialization process of setting an initial value "1" in a counter n that counts the number of necessary words is executed.

【００２５】そして続くステップ２１０では、必要語音
声データ６からカウンタｎの値に対応した必要語（ｎ）
の音声データを読み込み、次ステップ２２０にて、その
読み込んだ音声データをフレーム周期（例えば２０ｍse
c.）毎に音響分析し、音響的特徴量（例えばケプストラ
ム）を抽出する。なお、このステップ２１０及びステッ
プ２２０の処理は、必要語辞書作成処理のステップ１１
０及びステップ１２０と同様に実行される。Then, in the following step 210, the necessary word (n) corresponding to the value of the counter n from the necessary word voice data 6
Of the voice data, and in the next step 220, the read voice data is read in a frame cycle (for example, 20 mse).
The acoustic analysis is performed for each c.) and the acoustic feature amount (for example, cepstrum) is extracted. Note that the processing of steps 210 and 220 is the same as step 11 of the necessary word dictionary creation processing.
0 and step 120.

【００２６】こうして、必要語（ｎ）の音響分析が終了
すると、今度はステップ２３０にて、カウンタｎの値が
必要語の個数Ｎになっているか否かを判断することによ
り、必要語音声データ６に含まれる全て（Ｎ個）の必要
語についての音響分析が終了したか否かを判断する。そ
して、Ｎ個の必要語全てについての音響分析が終了して
いなければ、ステップ２４０に移行して、必要語音声デ
ータ６中の次の必要語について音響分析を行うべく、カ
ウンタｎの値をインクリメントし、再度ステップ２１０
に移行する。When the acoustic analysis of the necessary word (n) is completed in this way, it is judged in step 230 whether or not the value of the counter n is the number N of the necessary words, and the necessary word voice data is obtained. It is determined whether or not the acoustic analysis is completed for all (N) necessary words included in No. 6. If the acoustic analysis has not been completed for all N required words, the process proceeds to step 240, and the value of the counter n is incremented in order to perform the acoustic analysis for the next required word in the required word voice data 6. And step 210 again
Move to.

【００２７】一方、ステップ２３０にて、Ｎ個の必要語
全てについての音響分析が終了したと判断されると、今
度はステップ２５０に移行して、上記ステップ２２０を
繰返し実行することにより求められた全て（Ｎ個）の必
要語の音響的特徴量を平均化することにより、全必要語
を平均化した１つの平均特徴量を求める。そして続くス
テップ２６０にて、この平均特徴量を不要語の音響的特
徴量として、単語辞書１０の不要語の項目に書き込み、
当該処理を終了する。On the other hand, when it is determined in step 230 that the acoustic analysis has been completed for all N necessary words, the process proceeds to step 250 this time, and is determined by repeatedly executing step 220. By averaging the acoustic feature amounts of all (N) required words, one average feature amount obtained by averaging all necessary words is obtained. Then, in the following step 260, this average feature amount is written in the unnecessary word item of the word dictionary 10 as the acoustic feature amount of the unnecessary word,
The process ends.

【００２８】このように、当該不要語辞書作成処理にお
いては、全ての必要語の平均特徴量を不要語の音響的特
徴量として単語辞書１０に登録する。このため、単語辞
書１０は、必要語の個数Ｎに１を加えた「Ｎ＋１」個の
辞書項目を持つことになる。なお、不要語の音響的特徴
量として全ての必要語の平均特徴量を設定するのは、全
ての必要語の平均特徴量は分散が大きくなるため、入力
音声中の不要語の部分は必要語の特徴量よりも全ての必
要語の平均特徴量により近くなるからである。つまり、
本実施例では、全ての必要語の平均特徴量を不要語の音
響的特徴量として設定することにより、入力音声中の不
要語を一つの不要語辞書にて正確に認識できるようにし
ているのである。As described above, in the unnecessary word dictionary creating process, the average feature amount of all necessary words is registered in the word dictionary 10 as the acoustic feature amount of the unnecessary word. Therefore, the word dictionary 10 has “N + 1” dictionary items in which 1 is added to the number N of necessary words. Note that the average feature amount of all necessary words is set as the acoustic feature amount of unnecessary words because the average feature amount of all necessary words has a large variance, so the unnecessary word portion in the input speech is This is because it is closer to the average feature amount of all necessary words than the feature amount of That is,
In this embodiment, since the average feature amount of all the necessary words is set as the acoustic feature amount of the unnecessary words, the unnecessary words in the input voice can be accurately recognized by one unnecessary word dictionary. is there.

【００２９】次に、図５は音声認識部２４において実行
される音声認識処理を表わすフローチャートである。図
５に示す如く、この音声認識処理が開始されると、まず
ステップ３００にて、音声入力部２２から入力された音
声データを、所定のフレーム周期（例えば２０ｍsec.）
毎に順次音響分析して音響的特徴量（例えばケプストラ
ム）を抽出する音響分析手段としての処理を実行すると
共に、この音響分析によって各フレーム毎に得られた音
響的特徴量の個数をフレーム長Ｆとして記憶する。Next, FIG. 5 is a flow chart showing a voice recognition process executed in the voice recognition unit 24. As shown in FIG. 5, when the voice recognition process is started, first, in step 300, the voice data input from the voice input unit 22 is processed in a predetermined frame cycle (for example, 20 msec.).
A process as an acoustic analysis means for sequentially performing acoustic analysis for each to extract an acoustic feature amount (for example, a cepstrum) is executed, and the number of acoustic feature amounts obtained for each frame by this acoustic analysis is calculated as a frame length F. Memorize as.

【００３０】このようにステップ３００にて、入力音声
を所定のフレーム周期で音響分析した音響的特徴量の時
系列データが得られると、今度はステップ３１０〜ステ
ップ４３０において、周知のＤＰマッチング法によっ
て、この時系列データをいくつかの区間に分けて、各区
間が単語辞書１０に格納されたどの単語に対応している
かを求める音声認識手段としての処理を実行する。As described above, when the time-series data of the acoustic feature amount obtained by acoustically analyzing the input voice in the predetermined frame period is obtained in step 300, the known DP matching method is used in steps 310 to 430. Then, the time-series data is divided into some sections, and a process as a voice recognition means for determining which word stored in the word dictionary 10 corresponds to each section is executed.

【００３１】即ち、入力音声の音響的特徴量がフレーム
毎に格納された時系列データを、フレーム単位で分割可
能な全ての区間に分けて、各区間毎に、その区間内のデ
ータ（音響的特徴量）と単語辞書１０に登録されている
必要語及び不要語の音響的特徴量との一致度（距離）を
算出し、その距離が最小となる区間毎に全フレームを区
切って、各区間に距離が最小となる単語を割り当てるこ
とにより、入力音声中の単語系列を求めるのである。That is, the time-series data in which the acoustic feature amount of the input voice is stored for each frame is divided into all sections that can be divided in frame units, and for each section, the data (acoustic (Feature amount) and the acoustic feature amounts of necessary words and unnecessary words registered in the word dictionary 10 are calculated (distance), and all frames are divided into sections where the distance is the minimum. The word sequence in the input speech is obtained by assigning the word with the smallest distance to.

【００３２】以下、この処理を順に沿って説明する。図
５に示す如く、まずステップ３１０にて、以降の処理で
使用する変数Ｔ（１），Ｓ（１），Ｗ（１）に初期値
「０」を設定する初期化の処理を実行する。そして、続
くステップ３２０及び３３０にて、上記音響的特徴量の
時系列データの内、単語辞書１０に対する距離の算出対
象となる領域を表わす終点フレームｊ及び始点フレーム
ｉにそれぞれ初期値「１」を設定し、更に続くステップ
３４０にて、距離の算出に使用する単語辞書１０内の単
語の項目番号ｎに初期値「１」を設定する。Hereinafter, this process will be described step by step. As shown in FIG. 5, first in step 310, an initialization process of setting initial values “0” to variables T (1), S (1), W (1) used in the subsequent processes is executed. Then, in subsequent steps 320 and 330, the initial value "1" is set to the end point frame j and the start point frame i, which represent the region for which the distance to the word dictionary 10 is to be calculated, from the time-series data of the acoustic feature quantity. In step 340, the initial value "1" is set to the item number n of the word in the word dictionary 10 used for calculating the distance.

【００３３】次に、続くステップ３５０では、上記ステ
ップ３００にて得られた時系列データの中から、上記ス
テップ３２０，３３０にて初期設定されるか以降の処理
で更新された始点フレームｉから終点フレームｊまでの
音響的特徴量を読み込むと共に、単語辞書１０から項目
番号ｎの単語の音響的特徴量を読み込み、これら各特徴
量の一致度を表わす距離Ｄij（ｎ）を、先頭フレームか
ら始点フレームｉまでに算出された距離であるＴ（ｉ）
を初期値として算出する。Next, in the following step 350, from the time-series data obtained in the above step 300, from the starting point frame i initialized in steps 320 and 330 or updated in the subsequent processing, to the end point i. The acoustic feature up to the frame j is read, the acoustic feature of the word of the item number n is read from the word dictionary 10, and the distance Dij (n) representing the degree of coincidence of each feature is calculated from the start frame to the start frame. T (i), which is the distance calculated up to i
Is calculated as an initial value.

【００３４】そして続くステップ３６０では、単語辞書
１０の項目番号ｎが単語辞書１０に登録された単語の個
数「Ｎ＋１」と一致しているか否かを判断することによ
り、ｉフレームからｊフレームまでの音響的特徴量につ
いて、単語辞書１０に登録された全ての単語（つまりＮ
個の必要語と１個の不要語）との距離を計算したか否か
を判断し、ステップ３６０にて、単語辞書１０に登録さ
れた全ての単語との距離の計算が終了していないと判断
された場合には、ステップ３７０に移行し、上記ステッ
プ３５０にて次の項目番号（ｎ＋１）の単語との距離を
算出させるために、項目番号ｎの値をインクリメントし
て、再度ステップ３５０に移行する。Then, in the following step 360, it is judged whether or not the item number n of the word dictionary 10 matches the number "N + 1" of words registered in the word dictionary 10, so that from the i frame to the j frame. For acoustic features, all words (that is, N
It is determined whether or not the distances between the necessary words and the unnecessary words) have been calculated, and it is determined in step 360 that the distances to all the words registered in the word dictionary 10 have not been completed. If it is determined, the process proceeds to step 370, and the value of the item number n is incremented in order to calculate the distance to the word of the next item number (n + 1) in step 350, and the process returns to step 350. Transition.

【００３５】一方、ステップ３６０にて、ｉフレームか
らｊフレームまでの音響的特徴量について、単語辞書１
０に登録された全ての単語との距離の計算が終了したと
判断されると、ステップ３８０に移行して、始点フレー
ムｉが終点フレームｊと一致しているか否かを判断する
ことにより、始点フレームｉを終点フレームｊまで１フ
レームずつずらして距離の計算を行ったか否かを判断す
る。そして、このステップ３８０にて、始点フレームｉ
が終点フレームｊと一致していないと判断された場合に
は、ステップ３９０に移行して、始点フレームｉを次の
フレーム（ｉ＋１）にずらして距離の計算を行うべく、
始点フレームｉの値をインクリメントして、再度ステッ
プ３４０に移行する。On the other hand, in step 360, the word dictionary 1 is selected for the acoustic feature amount from the i-th frame to the j-th frame.
When it is determined that the calculation of the distances to all the words registered in 0 is completed, the process proceeds to step 380, and it is determined whether or not the starting point frame i matches the ending point frame j. It is determined whether or not the distance is calculated by shifting the frame i to the end point frame j by one frame. Then, in this step 380, the starting point frame i
If it is determined that does not match the end point frame j, the process proceeds to step 390 to shift the start point frame i to the next frame (i + 1) and calculate the distance.
The value of the starting point frame i is incremented, and the process proceeds to step 340 again.

【００３６】この結果、始点フレームｉが１フレーム分
ずれた領域にて、単語辞書１０に登録された項目番号１
の単語から項目番号「Ｎ＋１」の単語（つまり不要語）
までの全ての単語についての距離が再度算出されること
となり、この処理を繰り返すことにより、時系列データ
の先頭フレームから終点フレームｊまでの時系列データ
内にてフレーム単位で分割可能な全ての領域における入
力音声の音響的特徴量と単語辞書１０に登録された全て
の単語の組合せとの距離が算出されることとなる。As a result, in the area where the starting point frame i is shifted by one frame, the item number 1 registered in the word dictionary 10
Word from item number “N + 1” (that is, unnecessary word)
The distances for all the words up to are calculated again, and by repeating this process, all regions that can be divided in frame units in the time series data from the first frame to the end frame j of the time series data. The distance between the acoustic feature amount of the input voice in and the combination of all the words registered in the word dictionary 10 is calculated.

【００３７】また次に、ステップ３８０にて、始点フレ
ームｉが終点フレームｊと一致していると判断される
と、続くステップ４００にて、現在設定されている終点
フレームｊに対して上記ステップ３５０にて繰返し算出
された距離Ｄij（ｎ）の中から、その値が最小となる距
離「ｍｉｎＤij（ｎ）」を選択し、その距離の値を変数
Ｔ（ｊ）、その距離に対応した始点フレームｉの値を変
数Ｓ（ｊ）、その距離に対応した単語辞書１０の辞書項
目ｎを変数Ｗ（ｊ）、として記憶する。Next, when it is determined in step 380 that the starting point frame i matches the ending point frame j, in the following step 400, the above step 350 is performed for the currently set ending point frame j. The distance "minDij (n)" that minimizes the value is selected from the distances Dij (n) repeatedly calculated in step 1, and the value of the distance is set to the variable T (j), and the start point frame corresponding to the distance is selected. The value of i is stored as a variable S (j), and the dictionary item n of the word dictionary 10 corresponding to the distance is stored as a variable W (j).

【００３８】そして続くステップ４１０では、終点フレ
ームｊの値がフレーム長Ｆの値に一致したか否かを判断
することにより、終点フレームｊを初期値「１」から時
系列データの終了フレームＦまで１フレームずつずらし
て上記ステップ３３０〜４００までの処理を実行したか
否かを判断し、このステップ４１０にて、終点フレーム
ｊの値がフレーム長Ｆの値に一致していないと判断され
ると、ステップ４２０に移行して、終点フレームｊを１
フレームずらして上記ステップ３３０〜４００の処理を
実行すべく、終点フレームｊの値をインクリメントし、
再度ステップ３３０に移行する。Then, in the following step 410, it is determined whether or not the value of the end point frame j matches the value of the frame length F, so that the end point frame j is changed from the initial value "1" to the end frame F of the time series data. It is determined by shifting one frame at a time whether or not the processes in steps 330 to 400 have been executed, and in step 410, it is determined that the value of the end frame j does not match the value of the frame length F. , Go to step 420 and set the end frame j to 1
The value of the end frame j is incremented in order to shift the frames and execute the processing in steps 330 to 400,
The process moves to step 330 again.

【００３９】この結果、終点フレームｊが１フレーム分
ずれた先頭フレームから終点フレームｊまでの時系列デ
ータ内にて、フレーム単位で分割可能な全ての領域にお
ける入力音声の音響的特徴量と単語辞書１０に登録され
た全ての単語の組合せとの距離が算出され、その距離が
最小となる区間が求められることとなり、この処理を繰
返し実行することにより、ステップ４００にて、終点フ
レーム「１」から終点フレーム「Ｆ」までの各領域にお
ける変数の配列Ｔ（１），Ｔ（２），…Ｔ（Ｆ）、Ｓ
（１），Ｓ（２），…Ｓ（Ｆ）、及びＷ（１），Ｗ
（２），…Ｗ（Ｆ）が設定されることとなる。As a result, the acoustic feature amount of the input voice and the word dictionary in all the regions that can be divided frame by frame in the time-series data from the start frame to the end frame j in which the end frame j is shifted by one frame. The distances to all combinations of words registered in 10 are calculated, and the section with the minimum distance is obtained. By repeating this process, in step 400, from the end point frame “1”, Array of variables T (1), T (2), ... T (F), S in each area up to the end frame “F”
(1), S (2), ... S (F), and W (1), W
(2), ... W (F) will be set.

【００４０】そして、ステップ４００にて、変数Ｔ
（Ｆ）、Ｓ（Ｆ）、Ｗ（Ｆ）が求められた場合には、終
点フレームｊの値がフレーム長Ｆの値に対応しているこ
とから、ステップ４１０にて肯定判断されて、処理はス
テップ４３０に移行する。ステップ４３０では、上記ス
テップ４００にて順次求められた変数Ｓ（ｊ）と変数Ｗ
（ｊ）とをｊ＝Ｆの終了フレーム側から逆にたどること
により、単語辞書１０に登録されている単語との距離の
総和が最小となる単語系列を求める。Then, in step 400, the variable T
When (F), S (F), and W (F) are obtained, the value of the end point frame j corresponds to the value of the frame length F, so an affirmative decision is made in step 410, and the processing is executed. Moves to step 430. In step 430, the variable S (j) and the variable W sequentially obtained in step 400 are obtained.
By traversing (j) and j = F from the end frame side, the word sequence having the minimum sum of distances to the words registered in the word dictionary 10 is obtained.

【００４１】つまり、変数Ｓ（ｊ）は、入力音声の時系
列データの第１フレームから第ｊフレームまでの領域に
おいて、単語辞書１０に登録されている単語との距離の
総和が最小となる組み合せの最後の区間を表わし、また
変数Ｗ（ｊ）はその区間の単語を表わしているため、Ｗ
（Ｆ）には最後の区間に対応する単語が記憶され、Ｓ
（Ｆ）には最後の区間の始点フレームが記憶されている
ことになる。このため、ステップ４００にて順次求めら
れた変数Ｓ（ｊ）と変数Ｗ（ｊ）とをｊ＝Ｆの終了フレ
ーム側から逆にたどることにより、変数Ｗ（Ｆ）に対応
した単語を最終の単語として設定し、変数Ｓ（Ｆ）が表
わす始点フレームより１つ前のフレームを終点フレーム
ｊとする変数Ｗ（ｊ）を見れば最後から２番目の単語が
分かるため、この単語を最後から２番目の区間の単語と
して設定し、更にこの最後から２番目の単語の始点フレ
ームを表わす変数Ｓ（ｊ）から更に一つ前の単語を見つ
けて設定する、といった手順で、入力音声の単語系列を
簡単に求めることができるのである。That is, the variable S (j) is a combination that minimizes the total sum of the distances to the words registered in the word dictionary 10 in the area from the 1st frame to the jth frame of the time-series data of the input voice. , The variable W (j) represents the word in that interval, so W
In (F), the word corresponding to the last section is stored, and S
The starting point frame of the last section is stored in (F). Therefore, by tracing back the variable S (j) and the variable W (j) sequentially obtained in step 400 from the end frame side of j = F, the word corresponding to the variable W (F) is determined as the final word. If you look at the variable W (j) that is set as a word and has the end frame j that is one frame before the start frame represented by the variable S (F), you can see the penultimate word. The word sequence of the input speech is set by the procedure of setting the word as the word of the second section, and further finding and setting the word one before the variable S (j) that represents the starting point frame of the second-to-last word. You can easily ask for it.

【００４２】こうして、ステップ４３０にて、入力音声
の単語系列が求められると、ステップ４４０に移行し、
この単語系列の中から不要語として認識されている単語
を除去する不要語除去手段としての処理を実行する。つ
まり、単語辞書１０には、Ｎ個の必要語以外に１個の不
要語が登録されており、入力音声の時系列データ中、こ
の不要語の音響的特徴量に近似した区間は不要語として
認識されて、単語系列が設定されているので、この単語
系列から不要語と認識されている単語を除去することに
より、必要語のみの単語系列を生成するのである。この
結果、認識結果出力部２６からは、この必要語のみから
なる単語系列を表わすデータが出力されることとなる。Thus, when the word sequence of the input voice is obtained in step 430, the process proceeds to step 440,
A process as an unnecessary word removing means for removing a word recognized as an unnecessary word from this word series is executed. In other words, one unnecessary word is registered in the word dictionary 10 in addition to the N necessary words, and in the time-series data of the input voice, a section that is close to the acoustic feature amount of this unnecessary word is regarded as an unnecessary word. Since the word series is recognized and the word series is set, the word series that includes only the necessary words is generated by removing the words that are recognized as unnecessary words from the word series. As a result, the recognition result output unit 26 outputs data representing a word series consisting of only the necessary words.

【００４３】以上説明したように、本実施例の音声認識
装置においては、単語辞書１０に、認識すべき単語であ
る必要語の音響的特徴量と共に、全ての必要語の音響的
特徴量を平均化した平均特徴量を不要語の音響的特徴量
として登録しておくことにより、音声認識時に、入力音
声中の不要語の領域が、その登録した不要語辞書を用い
て不要語として認識されるようにし、しかも音声認識終
了後は、その認識結果から、不要語として認識された単
語を除去することにより、必要語のみからなる単語系列
を出力するようにされている。As described above, in the speech recognition apparatus of this embodiment, the word dictionary 10 averages the acoustic feature amounts of all the necessary words together with the acoustic feature amounts of the necessary words which are the words to be recognized. By registering the converted average feature amount as the acoustic feature amount of the unnecessary word, the region of the unnecessary word in the input voice is recognized as the unnecessary word by using the registered unnecessary word dictionary during voice recognition. In addition, after the voice recognition is completed, the words recognized as unnecessary words are removed from the recognition result, so that a word sequence including only the necessary words is output.

【００４４】このため、入力音声中に不要語が含まれて
いる場合に、従来のように、その不要語の領域を単語辞
書に登録されている何れかの必要語であると誤認識して
しまうといったことがなく、単語系列の認識精度を向上
することができ、外部装置に対して、使用者が発した正
確な単語系列を出力することができるようになる。Therefore, when the input voice contains an unnecessary word, the area of the unnecessary word is erroneously recognized as one of the necessary words registered in the word dictionary as in the conventional case. It is possible to improve the recognition accuracy of the word series without causing the problem and to output the accurate word series issued by the user to the external device.

【００４５】ここで、本実施例の音声認識装置では、音
声認識をＤＰマッチング法により行うように構成した
が、例えばＨＭＭ（隠れマルコフモデル）を使用して音
声認識を行うようにしてもよい。なお、この場合、単語
辞書作成時には、図３及び図４におけるステップ１３０
及びステップ２５０にて必要語及び不要語の音響的特徴
量を求める際に、フレーム周期毎に音響分析して得られ
た音響的特徴量（例えばケプストラム）の平均特徴量を
求める代わりに、Forward-Backwardアルゴリズム等を用
いてＨＭＭのパラメータを推定し、ステップ１４０及び
ステップ２６０にて、その求められたＨＭＭのパラメー
タを、必要語及び不要語の音響的特徴量として単語辞書
１０に登録するようにすればよく、また音声認識時に
は、図５におけるステップ３５０にて、始点フレームｉ
から終点フレームｊまでの音響的特徴量の時系列データ
と単語辞書項目ｎの音響的特徴量との距離Ｄij（ｎ）を
計算する代わりに、単語辞書項目ｎのパラメータを持つ
モデルに基づいて、始点フレームｉから終点フレームｊ
までの音響的特徴量の時系列データの尤度を計算し、ス
テップ４００にて、距離Ｄij（ｎ）が最小となる区間の
始点フレームｉ及び辞書項目ｎを求める代りに、尤度が
最大となる区間の始点フレームｉ及び辞書項目ｎを求め
るようにすればよい。Although the voice recognition apparatus of this embodiment is constructed so that the voice recognition is performed by the DP matching method, the voice recognition may be performed using, for example, an HMM (Hidden Markov Model). In this case, when the word dictionary is created, step 130 in FIGS. 3 and 4 is performed.
In addition, instead of obtaining the average feature amount of the acoustic feature amount (for example, cepstrum) obtained by performing the acoustic analysis for each frame period when obtaining the acoustic feature amount of the necessary word and the unnecessary word in step 250, the Forward- The parameters of the HMM are estimated by using the Backward algorithm or the like, and the obtained parameters of the HMM are registered in the word dictionary 10 as the acoustic feature quantities of the necessary word and the unnecessary word in steps 140 and 260. In the case of speech recognition, in step 350 in FIG.
Instead of calculating the distance Dij (n) between the time-series data of the acoustic feature amount from the end point frame j to the acoustic feature amount of the word dictionary item n, based on the model having the parameter of the word dictionary item n, Start frame i to end frame j
Up to the maximum likelihood instead of calculating the likelihood of the time-series data of the acoustic features up to and calculating the starting point frame i and the dictionary item n of the section where the distance Dij (n) is the minimum in step 400. The starting point frame i and the dictionary item n of the section may be obtained.

【００４６】[0046]

【発明の効果】以上説明したように、本発明の連続単語
音声認識装置においては、単語辞書として、認識すべき
単語の音響的特徴量だけでなく、音声認識する必要のな
い不要語の音響的特徴量をも登録しておき、音声認識時
には、入力音声に含まれる不要語も一つの単語として認
識し、音声認識終了後は、その認識結果から、不要語と
して認識された単語を除去することにより、必要語のみ
からなる単語系列を出力するようにされている。As described above, in the continuous word speech recognition apparatus of the present invention, not only the acoustic feature amount of the word to be recognized but also the acoustics of unnecessary words that need not be recognized as a word are used as a word dictionary. The feature amount is also registered, and the unnecessary words included in the input voice are recognized as one word at the time of voice recognition, and after the voice recognition is completed, the word recognized as the unnecessary word is removed from the recognition result. Thus, a word series consisting of only the necessary words is output.

【００４７】このため、本発明によれば、入力音声中に
不要語が含まれている場合に、従来のように、その不要
語を認識すべき単語であると誤認識してしまうといった
ことがなく、単語系列の認識精度を向上することがで
き、外部装置に対して、使用者が発した正確な単語系列
を出力することができるようになる。Therefore, according to the present invention, when the input speech includes an unnecessary word, the unnecessary word may be erroneously recognized as a word to be recognized, as in the conventional case. Therefore, the recognition accuracy of the word series can be improved, and the accurate word series issued by the user can be output to the external device.

[Brief description of drawings]

【図１】本発明の構成を例示するブロック図である。FIG. 1 is a block diagram illustrating a configuration of the present invention.

【図２】実施例の音声認識装置の構成を表わすブロック
図である。FIG. 2 is a block diagram showing a configuration of a voice recognition device according to an embodiment.

【図３】実施例の必要語辞書作成部において実行される
必要語辞書作成処理を表わすフローチャートである。FIG. 3 is a flowchart showing a necessary word dictionary creating process executed by a necessary word dictionary creating unit according to the embodiment.

【図４】実施例の不要語辞書作成部において実行される
不要語辞書作成処理を表わすフローチャートである。FIG. 4 is a flowchart showing an unnecessary word dictionary creating process executed by an unnecessary word dictionary creating unit of the embodiment.

【図５】実施例の音声認識部において実行される音声認
識処理を表わすフローチャートである。FIG. 5 is a flowchart showing a voice recognition process executed in the voice recognition unit of the embodiment.

[Explanation of symbols]

２…辞書作成部４…認識部６…必要語音声デー
タ１０…単語辞書１２…必要語辞書作成部１４…
不要語辞書作成部２２…音声入力部２４…音声認識部２６…認識
結果出力部2 ... Dictionary creating unit 4 ... Recognition unit 6 ... Necessary word voice data 10 ... Word dictionary 12 ... Necessary word dictionary creating unit 14 ...
Unnecessary word dictionary creation unit 22 ... Voice input unit 24 ... Voice recognition unit 26 ... Recognition result output unit

フロントページの続き (56)参考文献特開平５−313690（ＪＰ，Ａ) 特開昭61−20095（ＪＰ，Ａ) 特開昭59−119400（ＪＰ，Ａ) 特開昭61−52698（ＪＰ，Ａ) 特開昭62−65092（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/10 G10L 15/20 Continuation of front page (56) Reference JP-A-5-313690 (JP, A) JP-A-61-20095 (JP, A) JP-A-59-119400 (JP, A) JP-A-61-52698 (JP , A) JP 62-65092 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/10 G10L 15/20

Claims

(57) [Claims]

1. An acoustic feature quantity of a plurality of words to be recognized is stored in advance for each word, and a word dictionary storage means is used to analyze an input voice from the outside in a predetermined cycle to sequentially determine the acoustic feature quantity. The acoustic analysis means for extraction and the time-series data of the acoustic feature quantity sequentially extracted by the acoustic analysis means are divided into data strings that are closest to the acoustic feature quantity stored in the word dictionary storage means. , For each data string, assign the word represented by the corresponding acoustic feature,
A continuous word voice recognition device comprising: a voice recognition unit that recognizes the word sequence of the input voice; and an output unit that outputs the word sequence recognized by the voice recognition unit to an external device. In addition to the acoustic feature amount of each word, the acoustic feature amount of an unnecessary word that does not require voice recognition is obtained by averaging the acoustic feature amount of each word to be recognized by the voice recognition means. It is stored as an acoustic feature amount of a word to be recognized, and further, the word recognized as the unnecessary word is removed from the word sequence recognized by the voice recognition means, and the word is output to the output means. A continuous word voice recognition device characterized by comprising unnecessary word removing means.

2. Acoustic features of a plurality of words to be recognized
, A word dictionary storage means stored in advance for each word , and an acoustic feature quantity by analyzing an input voice from the outside in a predetermined cycle.
And a time series of acoustic feature quantities sequentially extracted by the acoustic analysis means.
The column data is divided into data strings that are closest to the acoustic features stored in the word dictionary storage means, and the words represented by the corresponding acoustic features are assigned to each data sequence.
A voice recognition means for recognizing the word sequence of the input voice, and the word sequence recognized by the voice recognition means to an external device.
In a continuous word speech recognition device comprising: output means for outputting , in the word dictionary storage means, the acoustic feature quantity of each word is stored.
In addition, the acoustic feature quantity of unnecessary words which do not require speech recognition obtained by averaging the acoustic feature quantity of all a word to be the recognition, acoustic features of a word to be recognized by the speech recognition means It is stored as a quantity, and further in the word sequence recognized by the voice recognition means.
The continuous word voice recognition device is characterized in that an unnecessary word removing means for removing the word recognized as the unnecessary word and outputting it to the output means is provided.