JP2002132293A

JP2002132293A - Speech recognizer

Info

Publication number: JP2002132293A
Application number: JP2000328747A
Authority: JP
Inventors: Masaru Kuroda; 勝黒田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-10-27
Filing date: 2000-10-27
Publication date: 2002-05-09

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognizer which exactly recognizes the speech inputted with simple constitution. SOLUTION: This speech recognizer comprises a word dictionary data base for speech recognition, a speech input section which computes the speech characteristic quantity from the inputted speech, a collation section which determines the similarity of the computed speech characteristic quantity and the respective words stored in the dictionary database, a specifying means which specifies the word not having another words to maximize the similarity afterward among the words maximizing the similarity determined in the collation section and an output means which outputs the word specified by the specifying means as the recognized word.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声を認識
し、認識結果を出力する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition device for recognizing an input speech and outputting a recognition result.

【０００２】[0002]

【従来の技術】従来より、マイク等を介して入力される
音声を認識し、認識した内容を、例えばディスプレイ等
の表示手段に出力したり、自動車に搭載されるナビゲー
ションシステムに行き先情報として出力する音声認識装
置が知られている。2. Description of the Related Art Conventionally, a voice input through a microphone or the like is recognized, and the recognized content is output to a display means such as a display, or is output as a destination information to a navigation system mounted on an automobile. Speech recognition devices are known.

【０００３】音声認識法としては、予め発声者の発声し
た単語を登録して認識単語辞書を作成する特定話者認識
方式と、発声者の発声した単語を登録するのではなく、
テキスト文書等から認識単語辞書を作成する不特定話者
認識方式がある。何れの方式も、予め決められた単語を
発声者が発声することにより入力音声を認識するもので
ある。従来の音声認識装置では、発声者による発声を検
出してから一定の期間内に入力される音声に基づいて単
語辞書に登録している単語の内、最も類似するものを選
択する。[0003] As a speech recognition method, a specific speaker recognition method in which words uttered by a speaker are registered in advance and a recognition word dictionary is created, and a word uttered by the speaker is registered instead of a specific speaker recognition method.
There is an unspecified speaker recognition method for creating a recognition word dictionary from a text document or the like. In each of these methods, the input voice is recognized by the speaker uttering a predetermined word. In a conventional speech recognition device, the most similar word is selected from words registered in a word dictionary based on a speech input within a certain period after detecting a speech by a speaker.

【０００４】[0004]

【発明が解決しようとする課題】しかし、上記従来の音
声認識装置では、発声者が「えー」や「あのー」といっ
た認識とは無関係な不要語を発した後に辞書に登録して
ある単語を発した場合には、該辞書に登録してある単語
を正しく認識することができない。However, in the above-mentioned conventional speech recognition apparatus, the speaker utters unnecessary words irrelevant to the recognition, such as "er" or "ano", and then utters words registered in the dictionary. In this case, words registered in the dictionary cannot be correctly recognized.

【０００５】この不都合を解決する音声認識の１つの手
法としてワードスポッティング法が知られている。ワー
ドスポッティング法は、逐次、ある時間単位で標準辞書
内の単語との照合を行い、類似度が所定のしきい値を超
えた単語について出力を行い、そうでない場合には照合
を継続するものである。なお、上記ワードスポッティン
グ法を用いた音声認識法としては、例えば、「継続時間
制御形状態遷移モデルを用いた単語音声認識法」（電子
情報通信学会論文誌、vol.J72-D-II,No.11,pp.1769〜p
p.1777,1989年11月）が知られている。当該音声認識法
は、認識対象となる辞書に含まれる音素に継続時間情報
を付加して演算量を減らしながらも良好な認識性能を得
るものである。[0005] A word spotting method is known as one method of speech recognition that solves this inconvenience. The word spotting method is to sequentially collate words in the standard dictionary in a certain time unit, output words whose similarity exceeds a predetermined threshold, and continue collation otherwise. is there. Examples of the speech recognition method using the word spotting method include, for example, a “word speech recognition method using a duration control type state transition model” (Transactions of the Institute of Electronics, Information and Communication Engineers, vol. J72-D-II, No. .11, pp.1769-p
p. 1777, November 1989). This speech recognition method obtains good recognition performance while reducing the amount of computation by adding duration information to phonemes included in a dictionary to be recognized.

【０００６】上記のワードスポッティング法では、発声
者が「えー」や「あのー」といった認識とは無関係な不
要語を発した後に辞書に登録してある単語を発した場合
でも必要な単語を正確に認識することができる。しか
し、本人の発声の有無に関係無く常に辞書内の単語との
照合を行うため、本人以外の第三者の発声音をも認識し
てしまういわゆる湧き出しの現象が起るといった問題が
ある。In the word spotting method described above, even if the speaker utters an unnecessary word such as "Eh" or "Ah" unrelated to recognition and then utters a word registered in the dictionary, the necessary word can be accurately detected. Can be recognized. However, since the collation is always performed with words in the dictionary irrespective of the presence / absence of the utterance of the person, there is a problem that a so-called swelling phenomenon occurs in which the utterance sound of the third person other than the person is recognized.

【０００７】本発明は、より正確に音声認識を行うこと
のできる音声認識装置を提供することを目的とする。An object of the present invention is to provide a speech recognition device capable of performing speech recognition more accurately.

【０００８】[0008]

【課題を解決するための手段】本発明の第１の音声認識
装置は、音声認識用の単語辞書データベースと、入力さ
れる音声から音声特徴量を演算する音声入力部と、上記
演算した音声特徴量と、上記辞書データベースに記憶し
てある各単語との類似度を求める照合部と、上記照合部
において求められる類似度が最大となった単語の内、後
に類似度が最大となる他の単語がない単語を特定する特
定手段と、特定手段により特定された単語を認識単語と
して出力する出力手段とで構成されることを特徴とす
る。According to a first aspect of the present invention, there is provided a speech recognition apparatus comprising: a word dictionary database for speech recognition; a speech input unit for calculating a speech feature amount from inputted speech; A matching unit for calculating the amount and the similarity with each word stored in the dictionary database; and, among the words having the largest similarity determined by the matching unit, other words having the largest similarity later It is characterized by comprising a specifying means for specifying a word having no word, and an output means for outputting the word specified by the specifying means as a recognized word.

【０００９】本発明の第２の音声認識装置は、上記第１
の音声認識装置であって、上記単語特定手段は、上記演
算手段により求められる類似度が所定のしきい値以上で
かつ最大となった単語の内、他に類似度が所定のしきい
値以上でかつ最大となる単語がない単語を特定すること
を特徴とする。[0009] The second speech recognition device of the present invention comprises the first speech recognition device.
The word recognition means, wherein the similarity calculated by the calculation means is equal to or greater than a predetermined threshold and the other words whose similarity is equal to or greater than a predetermined threshold And a word having no maximum word.

【００１０】本発明の第３の音声認識装置は、上記何れ
かの音声認識装置であって、発声者の発声音量を測定す
る音量測定手段を備え、上記出力手段は、上記単語特定
手段により特定された単語であって、上記音量測定手段
による測定値が所定のしきい値を超えている単語を認識
単語として出力する出力手段とで構成されることを特徴
とする。A third speech recognition apparatus according to the present invention is any one of the above speech recognition apparatuses, further comprising a sound volume measuring means for measuring a sound volume of a speaker, and the output means being specified by the word specifying means. And output means for outputting, as a recognized word, words for which the measured value of the sound volume measuring means exceeds a predetermined threshold value.

【００１１】[0011]

【発明の実施の形態】以下、添付の図面を用いて本発明
の音声認識装置の実施の形態について説明する。図１
は、音声認識装置１００の構成図である。音声認識装置
１００は、中央演算処理装置（以下、ＣＰＵという）１
を中心に、音声を収集するマイク３、マイク３により収
集されたアナログ信号を認識処理用にディジタル信号に
変換するＡ／Ｄ変換器２、音声認識処理プログラムが格
納されているＲＯＭ４、音声認識処理の実行時に上記プ
ログラムが展開されるＲＡＭ５、音声認識処理の結果を
出力する例えばディスプレイ等の出力装置６、及び、所
定の単語について構成された標準辞書データベース７と
で構成される。なお、出力装置６の代わりに自動車等に
搭載されるナビゲーションシステムを接続し、音声認識
結果を当該ナビゲーションシステムの行き先情報として
出力する構成も考えられる。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a speech recognition apparatus according to an embodiment of the present invention. FIG.
1 is a configuration diagram of the voice recognition device 100. The voice recognition device 100 includes a central processing unit (hereinafter referred to as a CPU) 1
, A microphone 3 for collecting voice, an A / D converter 2 for converting an analog signal collected by the microphone 3 into a digital signal for recognition processing, a ROM 4 storing a voice recognition processing program, a voice recognition process And a RAM 5 on which the above-mentioned program is developed at the time of execution of the program, an output device 6 such as a display for outputting the result of the speech recognition processing, and a standard dictionary database 7 for predetermined words. Note that a configuration is also conceivable in which a navigation system mounted on an automobile or the like is connected instead of the output device 6, and the result of voice recognition is output as destination information of the navigation system.

【００１２】図２は、ＣＰＵ１により実行される音声認
識処理の処理ブロックを示す図である。マイク３により
収音された音声は、Ａ／Ｄ変換器２においてディジタル
信号に変換された後に特徴抽出部５０及び音量検知部５
１に出力される。特徴抽出部５０では、入力された音声
から音声認識に必要な音声特徴量を抽出する。具体的に
は、フレームと呼ばれる所定の時間単位（例えば２０ｍ
ｓ）毎に１０次のメルーケプストラムを求める。FIG. 2 is a diagram showing processing blocks of a speech recognition process executed by the CPU 1. The sound picked up by the microphone 3 is converted into a digital signal in the A / D converter 2 and then converted into a digital signal.
1 is output. The feature extraction unit 50 extracts a speech feature amount necessary for speech recognition from the input speech. Specifically, a predetermined time unit called a frame (for example, 20 m
s) A 10th order Meru cepstrum is determined every time.

【００１３】標準辞書データベース７は、各単語毎に、
単語の音素列（文字列）から生成される各音素毎の平均
的なメルーケプストラムベクトル、各音素の継続時間、
及び、各音素の状態遷移を示すオートマトン等の情報を
記憶している。例えば、上述するように、音声認識結果
をナビゲーションシステムの行き先情報として用いる場
合、標準辞書データベース７は、地名や建物の名前等の
単語で構成される。The standard dictionary database 7 stores, for each word,
Average melu-cepstral vector for each phoneme generated from a phoneme string (character string) of a word, duration of each phoneme,
Further, information such as an automaton indicating a state transition of each phoneme is stored. For example, as described above, when the speech recognition result is used as destination information of the navigation system, the standard dictionary database 7 includes words such as place names and building names.

【００１４】照合部５２は、上記標準辞書データベース
７内に格納されている各単語の上記情報と特徴抽出部５
０で求めた特徴量との比較演算を行い、各単語の音素の
状態遷移を判断しながら距離ベクトル（以下、類似度と
いう）Ｓを求め、求めた各単語毎の類似度Ｓを当該単語
の識別番号（例えばＪＩＳコード）と供に結果判断出力
部５４に出力する。The collating unit 52 includes the information of each word stored in the standard dictionary database 7 and the feature extracting unit 5.
0, a distance vector (hereinafter referred to as similarity) S is determined while judging the state transition of the phoneme of each word, and the calculated similarity S of each word is determined. The data is output to the result determination output unit 54 together with the identification number (for example, JIS code).

【００１５】一方、音量検知部５１は、フレーム（２０
ｍｓ）毎に、入力された音声の最大音量Ｖを求め、求め
た音量値Ｖを結果判断出力部５４に出力する。On the other hand, the volume detecting section 51 detects the frame (20
Every ms), the maximum volume V of the input voice is obtained, and the obtained volume value V is output to the result determination output unit 54.

【００１６】図３は、発声者本人が発声した場合の類似
度Ｓと音量最大値Ｖの遷移と、第三者が発声した場合の
類似度Ｓと音量最大値Ｖの遷移を表す図である。図示す
るように、音量最大値Ｖは、一般に音声が意識して発声
されたときには大きな値になる。また、無意識に発した
音声、周囲雑音や、発声者以外の第三者による発声の場
合には、音量最大値Ｖは比較的小さくなる。FIG. 3 is a diagram showing the transition between the similarity S and the maximum volume value V when the speaker himself speaks, and the transition between the similarity S and the maximum volume value V when a third person speaks. . As shown in the figure, the maximum sound volume value V generally becomes large when the voice is consciously uttered. In the case of unconsciously uttered voice, ambient noise, or utterance by a third party other than the speaker, the maximum sound volume value V is relatively small.

【００１７】結果判断出力部５４では、類似度の最大に
なった単語（類似度が上昇した後に降下し始めた単語）
の検出を行い、当該単語の類似度が最大になってから所
定の応答時間Ｔ_thが経過するまでの間に、他に類似度の
最大になった単語がないか調べる。この応答時間Ｔ_thの
間に他に類似度の最大になった単語が検出されない場合
には、この時点での音量最大値Ｖが所定のしきい値Ｖ_th
を超えていることを条件として、当該類似度の最大にな
った単語を認識単語として出力する。In the result judgment output unit 54, the word having the maximum similarity (the word that has started to fall after the similarity has increased)
Is detected, and it is checked whether or not there is another word having the maximum similarity between the time when the predetermined response time T _th elapses after the similarity of the word becomes maximum. If no other word having the maximum similarity is detected during the response time _Tth , the volume maximum value V at this time is set to the predetermined threshold value _Vth.
The word having the maximum similarity is output as a recognized word on condition that the number of words exceeds the threshold.

【００１８】一方、上記応答時間Ｔ_thの経過前に他に類
似度の最大になった単語が検出された場合には、当該単
語の検出時より再び応答時間Ｔ_thが経過するまでの間
に、他に類似度の最大になった単語がないか調べる。Meanwhile, if prior to the expiration of the response time T _th word in which the maximum similarity to the other is detected, until again the response time from the time of detecting T _th of the word has passed , And whether there is any other word having the highest similarity.

【００１９】また、上記応答時間Ｔ_thの間に他に類似度
の最大になった単語が検出されない場合であっても音量
最大値Ｖが所定のしきい値Ｖ_thを超えていない場合に
は、当該単語が第三者により発声され湧き出した単語で
あると判断して類似度及び音量の測定値を初期化した後
に、上記処理をやり直す。Even when no other word having the maximum similarity is detected during the response time _Tth , if the maximum sound volume value V does not exceed the predetermined threshold value _Vth , Then, after determining that the word is a word uttered and spouted by a third party and initializing the measured values of the similarity and the volume, the above process is repeated.

【００２０】図４は、上記ＣＰＵ１の実行する音声認識
処理のフローチャートである。また、図５は、結果判断
出力部５４において「しんよこはま」と「しんよこはま
きた」の２つの単語の類似度の遷移を示す図である。以
下、図４を参照しつつ音声認識処理の手順について説明
する。FIG. 4 is a flowchart of the voice recognition process executed by the CPU 1. FIG. 5 is a diagram showing the transition of the degree of similarity between the two words “Shinyokohama” and “Shinyokohama” in the result determination output section 54. Hereinafter, the procedure of the voice recognition processing will be described with reference to FIG.

【００２１】まず、フレーム（２０ｍｓ）毎に入力され
る音声の類似度及び最大音量の値を初期化する（ステッ
プＳ１）。音声の入力を受け付ける（ステップＳ２）。
特徴量の抽出を行う（ステップＳ３）。標準辞書データ
ベース７内に記憶している各単語毎に類似度Ｓを求める
（ステップＳ４）。類似度Ｓが最大Ｓ_maxになった単
語、例えば、図５に示すように「しんよこはま」がある
場合（ステップＳ５でＹＥＳ）、応答時間タイマーを初
期化する（ステップＳ６）。ここで、類似度Ｓ_ma _xがし
きい値Ｓ_thを超えていることを確認した後に（ステップ
Ｓ７でＹＥＳ）、タイマーをスタート、あるいは、既に
スタートしている場合には継続動作させる（ステップＳ
８）。類似度Ｓ_maxがしきい値Ｓ_thを超えていない場合
（ステップＳ７でＮＯ）、上記ステップＳ２に戻り、次
に類似度が最大となる単語の検出を行う。First, the values of the similarity and the maximum volume of the voice input for each frame (20 ms) are initialized (step S1). A voice input is accepted (step S2).
The feature amount is extracted (step S3). A similarity S is obtained for each word stored in the standard dictionary database 7 (step S4). If there is a word whose similarity S has reached the maximum _Smax , for example, "Shinyokohama" as shown in FIG. 5 (YES in step S5), the response time timer is initialized (step S6). Here, after confirming that the similarity S _ma _x exceeds the threshold value S _th (YES at step S7), and starts a timer, or to already continue operation if you started (step S
8). If the similarity S _max does not exceed the threshold value S _th (NO in step S7), the process returns to step S2, and the next word having the maximum similarity is detected.

【００２２】類似度Ｓ_maxがしきい値Ｓ_thを超えている
が（ステップＳ７でＹＥＳ）、応答時間タイマーの値ｔ
がしきい値Ｔ_thを超えていない場合には（ステップＳ９
でＮＯ）、上記ステップＳ２に戻り、他に類似度が最大
となる単語の検出を行う。ここで、応答時間タイマーの
値ｔがしきい値Ｔ_thを経過する前に、他に類似度Ｓが最
大Ｓ_maxとなった単語、例えば、図５に示すように「し
んよこはまきた」が検出された場合（ステップＳ５でＹ
ＥＳ）、応答時間タイマーを再度初期化する（ステップ
Ｓ６）。他に類似度Ｓが最大Ｓ_maxとなる単語が検出さ
れること無く、応答時間タイマーの値ｔがしきい値Ｔ_th
を経過した場合（ステップＳ９でＹＥＳ）、最大音量値
Ｖが所定のしきい値Ｖ_thを超えていることを条件として
（ステップＳ１０でＹＥＳ）、上記認識した単語、図５
の例では「しんよこはまきた」を確定し（ステップＳ１
１）、これを出力する（ステップＳ１２）。Although the similarity S _max exceeds the threshold value S _th (YES in step S7), the response time timer value t
_Does not exceed the threshold value _Tth (step S9).
NO), the process returns to step S2, and another word having a maximum similarity is detected. Here, before the response time timer value t exceeds the threshold value T _th , another word having a similarity S of the maximum S _max , for example, “Shinyokohamakita” as shown in FIG. 5 is detected. (Yes in step S5)
ES), the response time timer is initialized again (step S6). The value t of the response time timer is set to the threshold value T _th without detecting any other word having the maximum similarity S at the maximum S _max.
Has elapsed (YES in step S9), the condition is determined on the condition that the maximum volume value V exceeds a predetermined threshold value _Vth (YES in step S10).
In the example of (1), “Shinyokohama Kita” is determined (step S1
1) This is output (step S12).

【００２３】[0023]

【発明の効果】本発明の第１の音声認識装置では、類似
度が最大となった単語が検出された場合、直ちに当該単
語を認識した単語の内、後に他に類似度が最大になる単
語がないか調べる。当該処理を行うことで、例えば、
「しんよこはま」と「しんよこはまきた」を正確に認識
することができる。According to the first speech recognition apparatus of the present invention, when a word having the highest similarity is detected, another word having the highest similarity later is immediately recognized from the words that have been recognized. Check for any. By performing the processing, for example,
"Shinyokohama" and "Shinyokohama" can be accurately recognized.

【００２４】本発明の第２の音声認識装置では、上記第
１の音声認識装置であって、類似度が所定のしきい値よ
りも大きな単語を出力する。これにより、より正確な認
識処理を実現することができる。According to a second speech recognition apparatus of the present invention, in the first speech recognition apparatus, a word having a similarity greater than a predetermined threshold is output. Thereby, more accurate recognition processing can be realized.

【００２５】また、本発明の第３の音声認識装置では、
上記何れかの音声認識装置において、類似度の他に、入
力音声の音量に基づいて発声者が発声した言葉であるの
か、又は、周囲に入る第三者が発声した言葉であるのか
の判断を行う。これにより、湧き出しによる誤認識を防
止することができる。Also, in the third speech recognition device of the present invention,
In any of the above speech recognition devices, in addition to the degree of similarity, it is determined whether the word is a word spoken by the speaker based on the volume of the input voice or a word spoken by a third party who is in the vicinity. Do. Thereby, erroneous recognition due to the source can be prevented.

[Brief description of the drawings]

【図１】実施の形態１に係る画像処理システムの構成
図である。FIG. 1 is a configuration diagram of an image processing system according to a first embodiment.

【図２】システム構成図である。FIG. 2 is a system configuration diagram.

【図３】ある単語についての類似度Ｓと音量最大値Ｖ
との関係を示す図である。FIG. 3 shows a similarity S and a maximum sound volume V for a certain word.
FIG.

【図４】ＣＰＵの実行する音声認識処理のフローチャ
ートである。FIG. 4 is a flowchart of a voice recognition process executed by a CPU.

【図５】２つの単語「しんよこはま」、「しんよこは
まきた」に対して実行される音声認識処理の様子を示す
図である。FIG. 5 is a diagram illustrating a state of a voice recognition process performed on two words “Shinyokohama” and “Shinyokohama”.

[Explanation of symbols]

１ＣＰＵ、２Ａ／Ｄ変換器、３マイク、４ＲＯ
Ｍ、５ＲＡＭ、６出力装置、７単語辞書データベ
ース、５０特徴抽出部、５１音量検知部、５２照
合部、５３標準辞書、５４結果判断出力部。1 CPU, 2 A / D converter, 3 microphone, 4 RO
M, 5 RAM, 6 output devices, 7 word dictionary database, 50 feature extraction unit, 51 volume detection unit, 52 collation unit, 53 standard dictionary, 54 result judgment output unit.

Claims

[Claims]

1. A word dictionary database for voice recognition, a voice input unit for calculating a voice feature amount from an input voice, a voice input unit for calculating the voice feature amount, and each word stored in the dictionary database. A matching unit that determines the similarity, a specifying unit that specifies a word that has no other word that has the highest similarity later, and a specifying unit that specifies a word that has the highest similarity calculated by the matching unit. And an output unit that outputs the recognized word as a recognized word.

2. The method according to claim 1, wherein the identifying unit determines that the similarity obtained by the matching unit is equal to or more than a predetermined threshold value and is the maximum, and then the similarity is equal to or more than the predetermined threshold value and is the maximum. The speech recognition device according to claim 1, wherein a word having no word is specified.

3. The voice recognition device according to claim 1, further comprising: a sound volume measuring unit that measures a sound volume of a speaker, wherein the output unit outputs a word of the word specified by the specifying unit. Output means for outputting, as a recognition word, a word whose measured value by the volume measurement means exceeds a predetermined threshold value.