JP2011203434A

JP2011203434A - Voice recognition device and voice recognition method

Info

Publication number: JP2011203434A
Application number: JP2010069818A
Authority: JP
Inventors: Hitoshi Iwamida; 均岩見田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-03-25
Filing date: 2010-03-25
Publication date: 2011-10-13

Abstract

PROBLEM TO BE SOLVED: To reduce incorrect recognition and incorrect operation caused by noise and unintentional sound, in a voice recognition device.SOLUTION: The voice recognition device includes: a speech input unit for inputting a speech data; a word dictionary in which a plurality of words for voice recognition are registered by relating each of them to its reading information; a voice recognizer which performs voice recognition on input speech data that is input by a speech input unit by using the word dictionary, and which outputs a word corresponding to the reading information of an evaluation value of a predetermined value or more as a recognition result information; a recognition result storage unit for storing recognition result information which is output by the voice recognizer together with input time when the speech data are input to the speech input unit, as past recognition result information; and a final result determination unit for determining output information corresponding to new output recognition result information, based on the past recognition result information stored in the recognition result storage unit.

Description

本発明は、音声信号を音声認識する音声認識装置及び音声認識方法に関する。 The present invention relates to a speech recognition apparatus and speech recognition method for recognizing speech signals.

入力される音声信号を音声認識して、その認識結果を出力する音声認識装置が知られている。このような音声認識装置の一例としては、マイク等の集音機を介して入力された音声やデジタル音声ファイルの音声を入力として、所定の単語辞書に予め登録されている単語の読み情報との類似度を判定し、所定の閾値以上の類似度を有する読み情報に対応する表示情報または識別情報を認識結果として出力するものがある。 2. Description of the Related Art A speech recognition device that recognizes an input speech signal and outputs the recognition result is known. As an example of such a voice recognition device, the voice input through a sound collector such as a microphone or the voice of a digital voice file is used as input, and word reading information registered in a predetermined word dictionary in advance. Some of them determine similarity and output display information or identification information corresponding to reading information having similarity equal to or higher than a predetermined threshold as a recognition result.

音声認識装置で音声認識された認識結果は、例えば、カーナビゲーションシステムや音声自動応答システムなどに入力される。音声認識装置の認識結果が入力された装置では、入力された認識結果をトリガーとして、対応する処理を実行する。 The recognition result recognized by the voice recognition device is input to, for example, a car navigation system or an automatic voice response system. In the device to which the recognition result of the speech recognition device is input, the corresponding processing is executed using the input recognition result as a trigger.

例えば、装置の操作指示を行うための制御コマンド、通話先の番号を指定するための人名や数字に対応するダイヤルコマンド等を音声認識装置の認識結果として出力することにより、音声入力による装置制御を行うことが可能となる。制御コマンドは、装置の電源オン・オフや装置に固有の操作を指示するためのコマンドである。ダイヤルコマンドは、他の装置が回線を通じた通信機能を備える場合に、通信先を特定するための人名や電話番号に対応するコマンドである。 For example, by outputting a control command for instructing operation of the device, a dial command corresponding to a person name or a number for designating the number of the called party as a recognition result of the speech recognition device, device control by voice input is performed. Can be done. The control command is a command for instructing a power on / off of the apparatus and an operation specific to the apparatus. The dial command is a command corresponding to a person name or a telephone number for specifying a communication destination when another device has a communication function through a line.

このような音声認識装置では、入力された音声と単語辞書に登録された全ての単語の読み情報とを比較して、類似度が高いものについて音声認識結果とする。したがって、単語辞書に登録されている単語の数が多い場合には、音声認識処理に多くの時間を要し、誤認識の確率が増大して認識精度が低下する。 In such a speech recognition apparatus, the input speech and the reading information of all the words registered in the word dictionary are compared, and the speech recognition result is obtained for those having high similarity. Therefore, when there are a large number of words registered in the word dictionary, a long time is required for the speech recognition processing, and the probability of misrecognition increases and the recognition accuracy decreases.

音声認識装置において、音声認識処理を効率化するとともに、認識精度を向上するために、過去の認識結果に基づいて利用頻度の高い単語を単語辞書に登録して、この単語辞書を用いて音声認識処理を行うようにすることが提案されている（特許文献１参照）。 In a speech recognition device, in order to improve the efficiency of speech recognition processing and improve recognition accuracy, a word that is frequently used is registered in a word dictionary based on past recognition results, and speech recognition is performed using this word dictionary. It has been proposed to perform processing (see Patent Document 1).

このようにした音声認識装置では、単語辞書に登録された単語が利用頻度の高いものであることから、少ない語彙であっても、単語辞書による単語のカバー率を高くすることができ、認識精度を高めることが可能となる。 In the speech recognition apparatus configured as described above, since words registered in the word dictionary are frequently used, the word coverage by the word dictionary can be increased even with a small vocabulary, and the recognition accuracy can be increased. Can be increased.

特開２００２−２４５０７８号公報JP 2002-245078 A

周囲から雑音の影響を受けやすい環境であれば、音声入力を意図しない音声により音声認識装置が誤認識や音声認識装置の認識結果に基づいて他の装置が誤作動を起こすおそれがある。特に、音声認識装置が常時音声入力を受け付けることが可能となっているような場合には、周囲の雑音などに反応して音声認識処理を開始する場合があり、この場合意図しない認識結果を得ることとなる。 In an environment that is susceptible to noise from the surroundings, there is a possibility that the voice recognition device may be erroneously recognized by a voice that is not intended for voice input, or other devices may malfunction due to the recognition result of the voice recognition device. In particular, when the voice recognition device can always accept voice input, the voice recognition process may be started in response to ambient noise or the like. In this case, an unintended recognition result is obtained. It will be.

たとえば、音声認識装置を備えるカーナビゲーションシステムにおいて、運転者が地名を発声することで、目的地を指定するための音声入力が可能なものがある。音声認識装置は、入力された音声に基づいて音声認識処理を実行し、該当する地名を目的地としてカーナビゲーションシステムに伝達する。カーナビゲーションシステムでは、入力された目的地に基づいて、目的地までの経路や所要時間の予測などを算出し、これを表示する。 For example, in a car navigation system including a voice recognition device, a driver can input a voice for designating a destination by uttering a place name. The voice recognition device performs voice recognition processing based on the input voice, and transmits the corresponding place name as a destination to the car navigation system. In the car navigation system, based on the input destination, the route to the destination and the prediction of the required time are calculated and displayed.

このとき、音声認識装置が、助手席や後部座席に搭乗している同乗者が発声した音声を入力として、音声認識処理を開始することが考えられる。音声認識装置が、このような入力を意図しない音声に反応して音声認識処理を開始する場合、出力される認識結果を誤認識として廃棄するか、あるいは、出力された認識結果に基づいて作動した他の処理を誤作動としてリセットすることとなる。 At this time, it is conceivable that the voice recognition device starts the voice recognition processing by using the voice uttered by the passenger on the passenger seat or the rear seat as an input. When the speech recognition device starts speech recognition processing in response to such unintended speech, the output recognition result is discarded as misrecognition or activated based on the output recognition result Other processing will be reset as a malfunction.

本発明では、音声認識装置において、雑音や音声入力を意図しない音声に基づく誤認識や誤作動を低減することを目的とする。 An object of the present invention is to reduce misrecognition or malfunction based on noise or voice that is not intended for voice input in a voice recognition device.

音声認識装置は、音声データを入力する音声入力部と、音声認識用の複数の単語がその読み情報と対応付けて登録されている単語辞書と、音声入力部から入力された音声データが単語辞書に登録された単語の読み情報に類似する度合いを示す評価値を算出し、所定値以上の評価値である読み情報に対応する単語を認識結果情報として出力する音声認識部と、音声認識部により出力される認識結果情報を、音声データが音声入力部に入力された入力時刻とともに過去の認識結果情報として蓄積する認識結果蓄積部と、音声認識部から新たに認識結果情報が出力された場合に、認識結果蓄積部に蓄積された過去の認識結果情報に基づいて、新たに出力された認識結果情報に対応する出力情報を決定する最終結果判定部とを備える。 A speech recognition device includes a speech input unit for inputting speech data, a word dictionary in which a plurality of words for speech recognition are registered in association with the reading information, and speech data input from the speech input unit is a word dictionary A speech recognition unit that calculates an evaluation value indicating a degree of similarity to the reading information of the word registered in the word and outputs a word corresponding to the reading information that is an evaluation value equal to or greater than a predetermined value as recognition result information, and a speech recognition unit When the recognition result information is newly output from the recognition result storage unit that stores the output recognition result information as past recognition result information together with the input time when the voice data is input to the voice input unit. A final result determination unit that determines output information corresponding to the newly output recognition result information based on past recognition result information stored in the recognition result storage unit.

音声認識装置は、新たに認識結果情報に含まれる単語が認識結果蓄積部に蓄積された過去の認識結果情報に含まれている場合にはこの単語を出力情報に決定することができる。したがって、過去の認識結果情報に同一の単語が含まれていない場合には、新たに出力された認識結果情報を雑音、または入力を意図しない音声データであるとみなして出力しないようにすることができる。 The speech recognition apparatus can determine this word as output information when a word newly included in the recognition result information is included in past recognition result information accumulated in the recognition result accumulation unit. Therefore, if the same word is not included in the past recognition result information, the newly output recognition result information may be regarded as noise or voice data not intended to be input and not output. it can.

このことにより、周囲から雑音の影響を受け易い環境にあっても、雑音による誤認識を避けることが可能であり、また、入力を意図しない音声データによる誤作動を低減することが可能となる。 This makes it possible to avoid erroneous recognition due to noise even in an environment that is easily affected by noise from the surroundings, and to reduce malfunctions caused by voice data that is not intended for input.

図１は、音声認識装置を含む車載装置の一例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of an in-vehicle device including a voice recognition device. 図２は、音声認識装置10のハードウェア構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of the speech recognition apparatus 10. 図３は、第１実施形態の音声認識装置の機能ブロック図である。FIG. 3 is a functional block diagram of the speech recognition apparatus according to the first embodiment. 図４は、第１実施形態の音声認識装置10における処理を示すフローチャートである。FIG. 4 is a flowchart showing processing in the speech recognition apparatus 10 of the first embodiment. 図５は、単語辞書303に登録された単語の読み情報との対応関係を示す単語情報テーブルの説明図である。FIG. 5 is an explanatory diagram of a word information table showing a correspondence relationship with word reading information registered in the word dictionary 303. 図６は、音声認識部302が出力する認識結果情報の説明図である。FIG. 6 is an explanatory diagram of the recognition result information output from the voice recognition unit 302. 図７は、認識結果蓄積部304に蓄積された過去の認識結果情報の説明図である。FIG. 7 is an explanatory diagram of past recognition result information accumulated in the recognition result accumulation unit 304. 図８は、第２実施形態の音声認識装置の機能ブロック図である。FIG. 8 is a functional block diagram of the speech recognition apparatus according to the second embodiment. 図９は、第２実施形態の音声認識装置10における処理を示すフローチャートである。FIG. 9 is a flowchart showing processing in the speech recognition apparatus 10 of the second embodiment.

音声認識装置の実施例について図面に基づいて説明する。 An embodiment of a voice recognition device will be described with reference to the drawings.

〈概略構成〉
音声認識装置の一例として、車両に搭載され、ユーザが発声する音声に応じて連動するカーナビゲーション装置やその他の装置に対して、制御コマンドを認識・生成して出力する場合について説明する。 <Outline configuration>
As an example of a voice recognition device, a case will be described in which a control command is recognized, generated, and output to a car navigation device or other device that is mounted on a vehicle and interlocks according to a voice uttered by a user.

図１は、音声認識装置を含む車載装置の一例を示すブロック図である。 FIG. 1 is a block diagram illustrating an example of an in-vehicle device including a voice recognition device.

車両内には、音声認識装置10、カーナビゲーション装置20、オーディオ装置30、通信装置40、その他の電装品50がネットワーク60を介して接続されている。 In the vehicle, a voice recognition device 10, a car navigation device 20, an audio device 30, a communication device 40, and other electrical components 50 are connected via a network 60.

カーナビゲーション装置20では、電源オン・オフ、現在地表示、目的地設定、ルート検索、目的地変更等の予め、認識可能な制御コマンドが登録されている。 In the car navigation apparatus 20, recognizable control commands such as power on / off, current position display, destination setting, route search, and destination change are registered in advance.

オーディオ装置30では、電源オン・オフ、再生、停止、一時停止、前の曲へ、次の曲へ、等の予め、認識可能な制御コマンドが登録されている。 In the audio device 30, recognizable control commands such as power on / off, playback, stop, pause, previous song, next song, and the like are registered in advance.

通信装置40は、例えば、ハンズフリーでの通話が可能な車載電話であり、通話先指定、通話開始、通話終了等の予め、認識可能な制御コマンドが登録されている。 The communication device 40 is, for example, an in-vehicle phone that can perform a hands-free call, in which control commands that can be recognized are registered in advance, such as call destination designation, call start, and call end.

その他の電装品50としては、例えば、カーエアコン、ワイパー、ヘッドライト等の車両に搭載された電装品であり、電源オン・オフ、動作モードや速度、強度等の制御コマンドを備えている。 The other electrical components 50 are electrical components mounted on a vehicle such as a car air conditioner, a wiper, and a headlight, for example, and include control commands such as power on / off, operation mode, speed, and strength.

音声認識装置10では、ユーザの発声した音声に基づいて音声認識処理を実行し、その認識結果に基づいて、カーナビゲーション装置20、オーディオ装置30、通信装置40、その他の電装品50のいずれに対する制御コマンドであるかを決定し、該当する装置に制御コマンドを出力する。例えば、音声認識装置10は、カーナビゲーション装置20に対する制御コマンドと、電源オン、電源オフ、現在地表示、目的地設定、ルート検索、目的地変更、その他を指示する制御コマンドとを特定する読み情報を単語辞書に登録しておく。音声認識装置10は、ユーザから入力された音声をデジタル音声信号に変換し、音声認識処理を行う。 The voice recognition device 10 executes voice recognition processing based on the voice uttered by the user, and controls any of the car navigation device 20, the audio device 30, the communication device 40, and other electrical components 50 based on the recognition result. Determine whether it is a command and output a control command to the corresponding device. For example, the voice recognition device 10 reads the reading information that specifies the control commands for the car navigation device 20 and the control commands for instructing power on, power off, current location display, destination setting, route search, destination change, etc. Register it in the word dictionary. The voice recognition device 10 converts voice input from a user into a digital voice signal and performs voice recognition processing.

音声認識処理に際しては、通常、デジタル音声信号の特徴量の分布を音素毎に表した音響モデルを用いて、入力されたデジタル音声信号と各音素との距離または類似性を計算し、単語辞書に登録されている単語の読み情報参照しながら、各単語との一致度合いを示す評価値を算出する。音声認識装置10では、評価値が所定の閾値を超えた読み情報に対応する単語、または最も評価値が高い読み情報に対応する単語を音声入力された制御コマンドとして、カーナビゲーション装置20に出力する。 In speech recognition processing, the distance or similarity between the input digital speech signal and each phoneme is usually calculated using an acoustic model that represents the distribution of feature quantities of the digital speech signal for each phoneme, and is stored in the word dictionary. An evaluation value indicating the degree of coincidence with each word is calculated while referring to the reading information of the registered word. In the speech recognition device 10, a word corresponding to the reading information whose evaluation value exceeds a predetermined threshold value or a word corresponding to the reading information having the highest evaluation value is output to the car navigation device 20 as a control command inputted by voice. .

他の装置に対しても、それぞれの装置への制御コマンドであることを特定するための読み情報、各装置の制御コマンドであることを特定するための読み情報を用意しておき、入力された音声からデジタル音響信号に変換された音声信号を音声認識処理し、単語辞書に登録された読み情報と比較して、その評価値が所定の閾値を超える読み情報に対応する単語または、最も評価値が高い読み情報に対応する単語を、認識結果情報としてそれぞれの装置に出力する。 For other devices, read information for specifying that it is a control command for each device and read information for specifying that it is a control command for each device are prepared and input. The speech signal converted from speech to digital acoustic signal is speech-recognized and compared with the reading information registered in the word dictionary, the word corresponding to the reading information whose evaluation value exceeds a predetermined threshold or the most evaluated value A word corresponding to high reading information is output to each device as recognition result information.

図２は、音声認識装置10のハードウェア構成を示すブロック図である。 FIG. 2 is a block diagram showing a hardware configuration of the speech recognition apparatus 10.

音声認識装置10が適用されるハードウェア構成は、マイクロプロセッサで構成されるCPU11、BIOSや各種パラメータ等を格納するROM12、アプリケーション実行時の変数の値や演算値が一時的に格納されるRAM13、アプリケーションプログラムのデータや装置の機能に必要な各種パラメータ、各種データベースや単語辞書等が格納されるハードディスクドライブ（HDD）14、等を備える。 The hardware configuration to which the speech recognition device 10 is applied includes a CPU 11 constituted by a microprocessor, a ROM 12 that stores BIOS and various parameters, a RAM 13 that temporarily stores variable values and calculation values during application execution, It includes a hard disk drive (HDD) 14 in which various parameters necessary for application program data and device functions, various databases, word dictionaries, and the like are stored.

また、音声認識装置10が適用されるハードウェア構成は、ユーザが発声する音声を集音するためのマイクロホン15、制御コマンドや各種パラメータの入力を受け付けるためのキー入力部16、液晶表示パネルやプラズマディスプレイ、または有機ELディスプレイ等で構成される表示部17、オーディオ装置からの音楽、カーナビゲーション装置に内蔵された音声合成部からの音声案内等を出力するためのスピーカ18等を備えていてもよい。このようなハードウェア構成の各部はバス19を介して接続されている。 The hardware configuration to which the voice recognition device 10 is applied includes a microphone 15 for collecting voice uttered by the user, a key input unit 16 for receiving input of control commands and various parameters, a liquid crystal display panel and plasma. A display unit 17 constituted by a display or an organic EL display, etc., music from an audio device, a speaker 18 for outputting voice guidance from a voice synthesis unit built in a car navigation device, etc. may be provided . Each part of such a hardware configuration is connected via a bus 19.

音声認識装置10は、前述したようなハードウェア構成上で動作するアプリケーションソフトウェアとすることができ、また、複数のアプリケーションソフトウェアが共通して利用可能なプログラムであるDLL（Dynamic Link Library）とすることができる。ハードウェア構成は、その全てまたは一部を他の装置と連携動作することができ、例えば、カーナビゲーション装置のハードウェア構成と連携動作することができる。 The voice recognition device 10 can be application software that operates on the hardware configuration as described above, and can be a DLL (Dynamic Link Library) that is a program that can be used in common by a plurality of application software. Can do. All or a part of the hardware configuration can operate in cooperation with another device. For example, the hardware configuration can operate in cooperation with the hardware configuration of the car navigation device.

上述の音声認識装置10は、カーナビゲーション装置とともに車載装置に搭載することができ、また、自動音声応答装置に組み込まれる音声認識装置、携帯電話やPDA（Personal Digital Assistant）に組み込まれる音声認識装置、デジタルサイネージ、その他、一般的なパーソナルコンピュータ上で動作する音声認識装置として実現することができる。 The voice recognition device 10 described above can be mounted on a vehicle-mounted device together with a car navigation device, a voice recognition device incorporated in an automatic voice response device, a voice recognition device incorporated in a mobile phone or a PDA (Personal Digital Assistant), It can be realized as a digital signage or other speech recognition device that operates on a general personal computer.

〈第１実施形態〉
音声認識装置の第１実施形態について説明する。 <First Embodiment>
A first embodiment of a voice recognition device will be described.

図３は、第１実施形態の音声認識装置の機能ブロック図である。 FIG. 3 is a functional block diagram of the speech recognition apparatus according to the first embodiment.

音声認識装置10は、音声入力部301、音声認識部302、単語辞書303、認識結果蓄積部304、最終結果判定部305を備えている。 The voice recognition device 10 includes a voice input unit 301, a voice recognition unit 302, a word dictionary 303, a recognition result storage unit 304, and a final result determination unit 305.

音声入力部301は、ユーザが発声した音声をマイクにより集音し、アナログ／デジタル変換してデジタル音響信号として音声認識部302に入力する。また、音声入力部301は、wavファイルやその他のデジタル音声データから復元した音声データを音声認識部302に入力することもできる。 The voice input unit 301 collects voice uttered by the user with a microphone, performs analog / digital conversion, and inputs the voice to the voice recognition unit 302 as a digital acoustic signal. The voice input unit 301 can also input voice data restored from a wav file or other digital voice data to the voice recognition unit 302.

音声認識部302は、音声入力部301から入力される音声信号を、単語辞書303を用いて音声認識処理する。音声認識部302は、所定の音響モデル（図示せず）を用いて、単語辞書303に登録されている単語の読み情報に対応する音素モデルを生成し、音声信号との特徴量を比較して類似する度合いを示す評価値を算出する。音声認識部302は、所定値以上の評価値である読み情報に対応する単語を認識結果情報として出力する。 The voice recognition unit 302 performs voice recognition processing on the voice signal input from the voice input unit 301 using the word dictionary 303. The speech recognition unit 302 generates a phoneme model corresponding to the word reading information registered in the word dictionary 303 using a predetermined acoustic model (not shown), and compares the feature amount with the speech signal. An evaluation value indicating the degree of similarity is calculated. The speech recognition unit 302 outputs a word corresponding to reading information that is an evaluation value equal to or greater than a predetermined value as recognition result information.

単語辞書303には、音声認識用の単語がその読み情報とともに登録されている。単語辞書303に登録される単語は、カーナビゲーション装置などに対して操作を指示するための制御コマンドに対応するものである。例えば、カーナビゲーション装置20に対して制御コマンドを出力する場合には、電源オン、電源オフ、現在地表示、目的地検索、ルート検索、目的地変更等の装置の操作に関する制御コマンドを単語として、これに対応する読み情報をとともに単語辞書303に登録される。単語辞書303に登録される単語は、制御コマンドの入力対象となる装置のシステム設計者、その装置のユーザ等が、装置に必要となる制御コマンドを選択的に登録することができ、例えば、CSV（Comma Separated Values）形式のファイルのような、読み情報を列挙したテキストファイルとして単語辞書303に登録することも可能である。 In the word dictionary 303, words for speech recognition are registered together with the reading information. The words registered in the word dictionary 303 correspond to control commands for instructing operations to the car navigation device or the like. For example, when a control command is output to the car navigation device 20, the control commands related to device operations such as power-on, power-off, current location display, destination search, route search, and destination change are used as words. The reading information corresponding to is registered in the word dictionary 303. Words registered in the word dictionary 303 can be selectively registered by the system designer of the device to which the control command is input, the user of the device, etc. It is also possible to register in the word dictionary 303 as a text file listing reading information, such as a (Comma Separated Values) format file.

認識結果蓄積部304は、音声認識部302により出力される認識結果情報を、音声データが音声入力部301に入力された入力時刻とともに過去の認識結果情報として蓄積する。認識結果蓄積部304は、所定の記憶媒体に格納しておくことができ、たとえば、音声認識装置10に備えられるハードディスクドライブ、フラッシュメモリ、通信装置を介して接続されるネットワーク上の記録媒体、その他種々の構成を考慮することができる。 The recognition result accumulation unit 304 accumulates the recognition result information output by the voice recognition unit 302 as past recognition result information together with the input time when the voice data is input to the voice input unit 301. The recognition result accumulation unit 304 can be stored in a predetermined storage medium, such as a hard disk drive, flash memory, a network recording medium connected via a communication device, etc. Various configurations can be considered.

最終結果判定部305は、音声認識部302から新たに認識結果情報が出力された場合に、認識結果蓄積部304に蓄積された過去の認識結果情報に基づいて、新たに出力された認識結果情報に対応する出力情報を決定する。たとえば、最終結果判定部305は、音声認識部302が新たな認識結果情報を出力すると、認識結果蓄積部304に蓄積された過去の音声認識結果情報を参照し、直前の所定時間内に新たな認識結果情報に含まれる単語と同一の単語を含む認識結果情報が少なくとも１つあれば、その単語を出力情報に決定する。 When the recognition result information is newly output from the speech recognition unit 302, the final result determination unit 305 newly outputs the recognition result information based on the past recognition result information stored in the recognition result storage unit 304. The output information corresponding to is determined. For example, when the speech recognition unit 302 outputs new recognition result information, the final result determination unit 305 refers to the past speech recognition result information stored in the recognition result storage unit 304 and creates a new result within a predetermined time immediately before. If there is at least one recognition result information including the same word as the word included in the recognition result information, that word is determined as output information.

図４は、第１実施形態の音声認識装置10における処理を示すフローチャートである。 FIG. 4 is a flowchart showing processing in the speech recognition apparatus 10 of the first embodiment.

ステップS401において、音声認識装置10は音声入力部301を介して音声認識部302に音声信号を入力する。前述したように、音声入力部301は、ユーザのアナログ音声をマイクにより集音してアナログ／デジタル変換したデジタル音響信号、またはwavファイルやその他のデジタル音声データから復元した音声データを音声認識部302に入力する。 In step S401, the speech recognition apparatus 10 inputs a speech signal to the speech recognition unit 302 via the speech input unit 301. As described above, the voice input unit 301 collects a user's analog voice by a microphone and performs analog / digital conversion on a digital acoustic signal, or voice data restored from a wav file or other digital voice data. To enter.

ステップS403において、音声認識装置10は、音声入力部301から入力される音声信号を、単語辞書303を用いて音声認識処理する。音声認識部302は、単語辞書303に登録されている単語の読み情報を参照し、入力された音声信号の特徴量と音素モデルの特徴量とを比較して、類似度合いを示す評価値を算出する。音声認識部302は、さらに、この評価値が所定値以上である読み情報に対応する単語を、音声認識できた単語であると判断して、これを認識結果情報として出力する。 In step S403, the speech recognition apparatus 10 performs speech recognition processing on the speech signal input from the speech input unit 301 using the word dictionary 303. The speech recognition unit 302 refers to the word reading information registered in the word dictionary 303, compares the feature amount of the input speech signal with the feature amount of the phoneme model, and calculates an evaluation value indicating the degree of similarity To do. Furthermore, the speech recognition unit 302 determines that the word corresponding to the reading information whose evaluation value is equal to or greater than a predetermined value is a word that has been speech-recognized, and outputs this as recognition result information.

図５は、単語辞書303に登録された単語の読み情報との対応関係を示す単語情報テーブルの説明図である。 FIG. 5 is an explanatory diagram of a word information table showing a correspondence relationship with word reading information registered in the word dictionary 303.

単語情報テーブル500は、少なくとも表記情報欄501、読み情報欄503を備えている。表記情報欄501には、単語が表示装置上に表示される際の表記に関する情報が格納される。また、読み情報欄503には、単語の読み情報に関するものであって、標準的な読み情報がかな表記により格納される。 The word information table 500 includes at least a notation information column 501 and a reading information column 503. The notation information column 501 stores information related to notation when a word is displayed on the display device. The reading information field 503 relates to word reading information and stores standard reading information in kana notation.

音声認識部302は、単語辞書303にと登録されている単語に関してその読み情報を用いて音素モデルを生成し、その特徴量と入力された音声信号の特徴量とを比較して評価値を算出する。 The speech recognition unit 302 generates a phoneme model using the reading information of the words registered in the word dictionary 303, compares the feature amount with the feature amount of the input speech signal, and calculates an evaluation value. To do.

発音の怠け、地域による発音の特徴、発音の個人差などに応じて、標準的な読み情報による音声認識処理を行った場合に、音声認識ができない場合がでてくる。図５に示す単語情報テーブル500では、発音の怠け、発音の地域性、発音の個人差などに基づく拡張読み情報が登録できるように、拡張読み情報欄505を備えている。 When speech recognition processing based on standard reading information is performed according to the lack of pronunciation, the characteristics of pronunciation in each region, individual differences in pronunciation, etc., voice recognition may not be possible. The word information table 500 shown in FIG. 5 includes an extended reading information column 505 so that extended reading information can be registered based on pronunciation laziness, regionality of pronunciation, individual differences in pronunciation, and the like.

図５では、「東京」という表記の単語に対して、標準の読み情報として「とうきょう」が登録されており、拡張読み情報として「ときょう」が登録されている例を挙げている。拡張読み情報は１つに限定されるものではなく、２以上の拡張読み情報を登録することもできる。 FIG. 5 shows an example in which “Tokyo” is registered as standard reading information and “toki” is registered as extended reading information for the word “Tokyo”. The extended reading information is not limited to one, and two or more extended reading information can be registered.

音声認識部302は、標準的な読み情報である読み情報欄503に登録された読み情報と、発音の怠けなどを考慮した読み情報である拡張読み情報欄505に登録された読み情報との両方を用いて音声認識処理を行うことも可能であり、また、必要に応じていずれか一方だけを用いた音声認識処理を行うことも可能である。 The voice recognition unit 302 uses both the reading information registered in the reading information column 503 that is standard reading information and the reading information registered in the extended reading information column 505 that is reading information in consideration of pronunciation laziness and the like. It is also possible to perform voice recognition processing by using and can also perform voice recognition processing using only one of them as necessary.

ステップS405において、音声認識装置10は、音声認識部302より出力される認識結果情報を認識結果蓄積部304に蓄積する。前述したように、認識結果蓄積部304は、ハードディスクドライブ、フラッシュメモリ、ネットワークで接続された記録媒体、その他種々の記録媒体とすることができ、音声認識部302より出力される認識結果情報として、単語情報とその評価値及び音声入力部301に入力された時刻を蓄積することができる。 In step S405, the speech recognition apparatus 10 stores the recognition result information output from the speech recognition unit 302 in the recognition result storage unit 304. As described above, the recognition result accumulation unit 304 can be a hard disk drive, a flash memory, a recording medium connected via a network, and other various recording media. As the recognition result information output from the voice recognition unit 302, The word information, its evaluation value, and the time input to the voice input unit 301 can be accumulated.

ステップS407において、音声認識装置10は、認識結果蓄積部304に蓄積された認識結果情報を参照する。 In step S407, the speech recognition apparatus 10 refers to the recognition result information stored in the recognition result storage unit 304.

音声認識部302から出力される認識結果情報は、認識結果蓄積部304に蓄積されるとともに、最終結果判定部305に出力される。最終結果判定部305では、音声認識部302から新たな認識結果情報が入力されると、認識結果蓄積部304に蓄積されている過去の認識結果情報を参照する。 The recognition result information output from the speech recognition unit 302 is stored in the recognition result storage unit 304 and also output to the final result determination unit 305. When new recognition result information is input from the speech recognition unit 302, the final result determination unit 305 refers to past recognition result information stored in the recognition result storage unit 304.

最終結果判定部305は、音声認識部302から新たなに出力された認識結果情報に含まれる単語が、認識結果蓄積部304に蓄積されている過去の認識結果情報に含まれているか否かを判定し、過去に同一の単語を含む認識結果情報が存在している場合には、その単語を出力情報に決定して最終的な認識結果として出力する。 The final result determination unit 305 determines whether or not the word included in the recognition result information newly output from the speech recognition unit 302 is included in the past recognition result information stored in the recognition result storage unit 304. If there is recognition result information including the same word in the past, the word is determined as output information and output as a final recognition result.

音声認識部302は、音声認識できた単語と、その評価値（スコア）と、音声入力部301に音声信号が入力された時刻とを含む認識結果情報を出力する。 The voice recognition unit 302 outputs recognition result information including a word that has been voice-recognized, its evaluation value (score), and the time when the voice signal is input to the voice input unit 301.

図６は、音声認識部302が出力する認識結果情報の説明図である。 FIG. 6 is an explanatory diagram of the recognition result information output from the voice recognition unit 302.

音声認識部302が出力する認識結果情報は、図６に示すような認識結果情報テーブル600で示すことができ、認識結果情報テーブル600は、時刻情報欄601、第１認識結果欄603、第２認識結果欄605、第３認識結果欄607を備えている。時刻情報欄601には、音声入力部301に音声信号が入力された時刻が格納される。第１認識結果欄603〜第３認識結果欄607は、単語欄611とスコア欄613とを備えており、評価値が所定値よりも高かったことから音声認識が成功した読み情報に対応する単語の表記情報が単語欄611に格納され、対応する評価値がスコア欄613に格納される。 The recognition result information output by the speech recognition unit 302 can be shown in a recognition result information table 600 as shown in FIG. 6, and the recognition result information table 600 includes a time information column 601, a first recognition result column 603, a second A recognition result column 605 and a third recognition result column 607 are provided. The time information column 601 stores the time when the audio signal is input to the audio input unit 301. The first recognition result column 603 to the third recognition result column 607 are provided with a word column 611 and a score column 613, and the word corresponding to the reading information that has been successfully recognized by voice recognition because the evaluation value is higher than a predetermined value. Is stored in the word column 611, and the corresponding evaluation value is stored in the score column 613.

図６に示す例では、時刻11時13分25秒に音声入力301により入力された音声信号に対して、「京都」という単語の評価値が"97"で最も近い読み情報を備える単語であると評価されて第１認識結果欄603に格納されている。以下、「兵庫」という単語の読み情報が評価値"77"で第２認識結果欄605に格納され、第３認識結果欄607には該当なしとして認識結果情報は出力されなかった。 In the example shown in FIG. 6, the word “Kyoto” has the evaluation value “97” for the voice signal input by the voice input 301 at time 11:13:25, and is the word having the nearest reading information. And stored in the first recognition result field 603. Hereinafter, the reading information of the word “Hyogo” is stored in the second recognition result column 605 with the evaluation value “77”, and the recognition result information is not output in the third recognition result column 607 as not applicable.

なお、ここでの評価値としては、単語の読み情報と音声信号との一致する特徴量を加算していき、最大値が100となるように正規化したものを用いるものとする。 Here, as the evaluation value, it is assumed to use a normalized value so that the maximum feature value is 100 by adding the matching feature amount between the word reading information and the voice signal.

図７は、認識結果蓄積部304に蓄積された過去の認識結果情報の説明図である。 FIG. 7 is an explanatory diagram of past recognition result information accumulated in the recognition result accumulation unit 304.

認識結果蓄積部304に蓄積される過去の認識結果情報テーブル701は、図６に示す認識結果情報テーブル600と同様に、時刻情報欄701、第１認識結果欄703、第２認識結果欄705、第３認識結果欄707を備えている。時刻情報欄701には、音声入力部301に音声信号が入力された時刻が格納される。第１認識結果欄703〜第３認識結果欄707は、単語欄711とスコア欄713とを備えており、音声認識部302より出力された認識結果情報について単語の表記情報が単語欄711に格納され、対応する評価値がスコア欄713に格納される。 The past recognition result information table 701 stored in the recognition result storage unit 304 is similar to the recognition result information table 600 shown in FIG. 6 in the time information column 701, the first recognition result column 703, the second recognition result column 705, A third recognition result column 707 is provided. The time information column 701 stores the time when the audio signal is input to the audio input unit 301. The first recognition result column 703 to the third recognition result column 707 include a word column 711 and a score column 713, and word notation information for the recognition result information output from the speech recognition unit 302 is stored in the word column 711. The corresponding evaluation value is stored in the score column 713.

図７に示すように、音声認識部302から出力される認識結果情報は、音声入力部301に音声信号が入力された時刻に基づいて時系列で蓄積されている。図７に示す例では、時刻11時11分56秒に音声入力部301により入力された音声信号に対して、「東京」という単語に対応する読み情報の評価値が"95"であり、「京都」という単語に対応する読み情報の評価値が"75"であるような過去の認識結果情報が蓄積されている。また、時刻11時11分59秒に音声入力部301により入力された音声信号に対して、「和歌山」という単語に対応する読み情報の評価値が"81"であり、「岡山」という単語に対応する読み情報の評価値が"80"であり、さらに「高山」という単語に対応する読み情報の評価値が"72"であるような過去の認識結果情報が蓄積されている。同様に、時刻11時13分07秒に音声入力部301により入力された音声信号に対して、「和歌山」という単語に対応する読み情報の評価値が"87"であり、「岡山」という単語に対応する読み情報の評価値が"70"である過去の認識結果情報、および時刻11時13分20秒に音声入力部301により入力された音声信号に対して、「京都」という単語に対応する読み情報の評価値が"84"であり、「東京」という単語に対応する読み情報の評価値が"80"である過去の認識結果情報が蓄積されている。 As shown in FIG. 7, the recognition result information output from the speech recognition unit 302 is accumulated in time series based on the time when the speech signal is input to the speech input unit 301. In the example shown in FIG. 7, the evaluation value of the reading information corresponding to the word “Tokyo” is “95” for the audio signal input by the audio input unit 301 at the time 11:11:56, The past recognition result information that the evaluation value of the reading information corresponding to the word “Kyoto” is “75” is accumulated. In addition, for the audio signal input by the audio input unit 301 at time 11:11:59, the evaluation value of the reading information corresponding to the word “Wakayama” is “81”, and the word “Okayama” Past recognition result information in which the evaluation value of the corresponding reading information is “80” and the evaluation value of the reading information corresponding to the word “Takayama” is “72” is accumulated. Similarly, the evaluation value of the reading information corresponding to the word “Wakayama” is “87” for the audio signal input by the audio input unit 301 at time 11:13:07, and the word “Okayama” Corresponds to the word “Kyoto” for the past recognition result information whose reading information evaluation value corresponding to is “70” and the voice signal input by the voice input unit 301 at the time 11:13:20 The past recognition result information in which the evaluation value of the reading information is “84” and the evaluation value of the reading information corresponding to the word “Tokyo” is “80” is accumulated.

ステップS409において、音声認識装置10は、新たに出力された認識結果情報と認識結果蓄積部304に蓄積された過去の認識結果情報とに基づいて、出力情報を決定してこれを出力する。 In step S409, the speech recognition apparatus 10 determines output information based on the newly output recognition result information and the past recognition result information stored in the recognition result storage unit 304, and outputs this.

最終結果判定部305は、音声認識部302から新たな認識結果情報が出力されると、認識結果蓄積部304を参照して、新たな認識結果情報に含まれる単語と同一の単語を含む過去の認識結果情報が蓄積されているか否かを判定する。 When new recognition result information is output from the speech recognition unit 302, the final result determination unit 305 refers to the recognition result storage unit 304, and includes a past result including the same word as the word included in the new recognition result information. It is determined whether the recognition result information is accumulated.

最終結果判定部305は、新たな認識結果情報の時刻から所定時間以内の認識結果情報に、同一の単語が含まれるか否かを判定することができる。たとえば、最終結果判定部305は、認識結果蓄積部304に蓄積された認識結果情報のうち、新たな認識結果情報の時刻から過去１分以内に入力された過去の認識結果情報があるか否を判定することができる。 The final result determination unit 305 can determine whether or not the same word is included in the recognition result information within a predetermined time from the time of the new recognition result information. For example, the final result determination unit 305 determines whether there is past recognition result information input within the past one minute from the time of new recognition result information among the recognition result information accumulated in the recognition result accumulation unit 304. Can be determined.

図６に示すような認識結果情報が音声認識部302から出力された際に、最終結果判定部305が認識結果蓄積部304を参照して、図７に示すような過去の認識結果情報を得たとする。この場合、音声認識部302から新たに出力された認識結果情報に含まれる単語は、「京都」と「兵庫」であり、それぞれ評価値が"97"と"77"である。また、認識結果蓄積部304に蓄積されている過去の認識結果情報のうち、時刻11時13分20秒の認識結果情報には、「京都」という単語が含まれている。この時刻11時13分20秒の認識結果情報に含まれる「京都」は評価値"84"であり、このときの音声認識処理における最も高い評価値を示す。このような場合には、最終結果判定部305は、新たな認識結果情報と、過去１分以内の認識結果情報に含まれる「京都」を出力情報として決定する。 When the recognition result information as shown in FIG. 6 is output from the speech recognition unit 302, the final result determination unit 305 refers to the recognition result storage unit 304 to obtain past recognition result information as shown in FIG. Suppose. In this case, the words included in the recognition result information newly output from the speech recognition unit 302 are “Kyoto” and “Hyogo”, and the evaluation values are “97” and “77”, respectively. Of the past recognition result information stored in the recognition result storage unit 304, the recognition result information at time 11:13:20 includes the word “Kyoto”. “Kyoto” included in the recognition result information at 11:13:20 at this time is an evaluation value “84”, which indicates the highest evaluation value in the speech recognition processing at this time. In such a case, the final result determination unit 305 determines the new recognition result information and “Kyoto” included in the recognition result information within the past one minute as output information.

なお、図７に示す例において、時刻11時11分56秒に入力された過去の認識結果情報において、第２認識結果として「京都」という単語が含まれているが、新たなに出力された認識結果情報の過去１分以内に出力された認識結果情報という限定をした場合には、この認識結果情報は採用されない。 In the example shown in FIG. 7, the past recognition result information input at time 11:11:56 contains the word “Kyoto” as the second recognition result, but it has been newly output. When the recognition result information is output within the past one minute of the recognition result information, the recognition result information is not adopted.

このように、所定時間以内の過去に出力された認識結果情報と、新たなに出力された認識結果情報とに基づいて出力情報を決定することにより、突発的な雑音や入力を意図しない音声を音声認識して出力情報として出力することを避けることができる。 In this way, by determining the output information based on the recognition result information output in the past within a predetermined time and the newly output recognition result information, sudden noise or speech that is not intended for input can be obtained. It is possible to avoid voice recognition and output as output information.

たとえば、カーナビゲーションシステムなどの装置が、音声によるコマンド入力を受け付ける場合に、ユーザに対して同一の音声コマンドを続けて２度発声させる仕様にしておく。 For example, when a device such as a car navigation system accepts a voice command input, a specification is made so that the user continuously utters the same voice command twice.

この場合、音声認識部302が１分以内に同一のコマンドに対応した単語を認識した場合に、最終結果判定部305がその単語をカーナビゲーションシステムに対する出力情報とすることができる。したがって、単語辞書303に登録されている単語に基づいて音声認識部302で認識された単語が、カーナビゲーションシステムのコマンドに対応したものであっても、１分以内の過去に同一の単語が認識されていなければ、最終結果判定部305は、音声認識部302から出力された認識結果情報を雑音もしくは意図しない音声信号に基づくものと見なして出力情報として出力しない。 In this case, when the voice recognition unit 302 recognizes a word corresponding to the same command within one minute, the final result determination unit 305 can use the word as output information for the car navigation system. Therefore, even if a word recognized by the speech recognition unit 302 based on a word registered in the word dictionary 303 corresponds to a command of the car navigation system, the same word is recognized in the past within one minute. If not, the final result determination unit 305 regards the recognition result information output from the speech recognition unit 302 as being based on noise or an unintended speech signal and does not output it as output information.

カーナビゲーションシステムが、主に運転席に搭乗しているユーザからの指示によって動作しているような場合に、助手席や後部座席に搭乗している同乗者が発声した音声に基づいて、意図しないコマンドを音声認識しても誤作動することを防止できる。 When the car navigation system is operated mainly by instructions from the user in the driver's seat, it is not intended based on the voice uttered by the passenger in the passenger seat or the rear seat. Even if the command is recognized by voice, it can be prevented from malfunctioning.

このようにして、突発的な雑音や他の話者による音声が入力された場合であっても、音声認識の最終結果として出力しないことから、音声認識装置10として誤認識を減少させることができ、また、音声認識装置10から出力される出力情報に基づく誤作動を低減することが可能となる。 In this way, even when sudden noise or speech from another speaker is input, it is not output as the final result of speech recognition, so the speech recognition device 10 can reduce misrecognition. In addition, malfunctions based on output information output from the speech recognition apparatus 10 can be reduced.

図６及び図７では、音声認識部302が出力する認識結果情報として、音声認識された単語とその音声認識処理で得られた評価値を出力する場合を示している。このような場合には、最終結果判定部305が、所定時間以内の過去に音声認識された認識結果情報の評価値を単語毎に集計し、その値の最も大きいものを出力情報とすることができる。 6 and 7 show a case where a speech-recognized word and an evaluation value obtained by the speech recognition process are output as recognition result information output by the speech recognition unit 302. FIG. In such a case, the final result determination unit 305 may aggregate the evaluation values of recognition result information recognized in the past within a predetermined time for each word, and use the largest value as output information. it can.

ユーザの発声毎に認識結果情報の最も評価値の高い単語が異なる場合がある。たとえば、音声認識装置10に対して音声入力を行ったユーザの発声において、１回目と２回目とにおいて異なる発声の怠けが生じたり、ユーザの身体的な状態の変化などによって微妙に異なる発音になったりすることが考えられる。 The word with the highest evaluation value of the recognition result information may be different for each utterance of the user. For example, in the utterance of a user who has made a voice input to the speech recognition device 10, the utterance is different between the first time and the second time, or the sound is slightly different due to changes in the physical state of the user. Can be considered.

認識結果情報に含まれる評価値を単語毎に合計して、最も大きいものを出力情報として決定することによって、ユーザの発声毎に差異があった場合であっても、適切な出力情報を得ることができる。 By summing the evaluation values included in the recognition result information for each word and determining the largest one as output information, even if there is a difference for each utterance of the user, appropriate output information can be obtained Can do.

前述のカーナビゲーションシステムの例では、２回の発声に基づいて音声入力されたコマンドを判定しているが、３回もしくはそれ以上に設定することも可能である。 In the example of the car navigation system described above, a command input by voice is determined based on two utterances, but it is also possible to set the command to three or more times.

キーボードやマウスによる手書き文字入力、タッチパネルによるソフトウェアキーボードまたは手書き文字入力、その他の文字入力方法に比して、音声入力する場合にはユーザの負担が小さく、複数回の音声入力は比較的簡単である。したがって、複数回の発声に基づいて音声認識処理を行うことで、比較的簡単な構成で雑音に基づく誤認識や入力を意図しない音声による誤作動を防止することが可能となる。 Compared with handwritten character input using a keyboard or mouse, software keyboard or handwritten character input using a touch panel, or other character input methods, user input is less burdensome and multiple input is relatively easy. . Therefore, by performing voice recognition processing based on a plurality of utterances, it is possible to prevent erroneous recognition based on noise and voice that is not intended for input with a relatively simple configuration.

〈第２実施形態〉
図８は、第２実施形態の音声認識装置の機能ブロック図である。 Second Embodiment
FIG. 8 is a functional block diagram of the speech recognition apparatus according to the second embodiment.

第１実施形態と同様に、音声認識装置10は、音声入力部301、音声認識部302、単語辞書303、スコア判定部801、認識結果蓄積部304、最終結果判定部305を備えている。 Similar to the first embodiment, the speech recognition apparatus 10 includes a speech input unit 301, a speech recognition unit 302, a word dictionary 303, a score determination unit 801, a recognition result storage unit 304, and a final result determination unit 305.

音声入力部301は、ユーザが発声した音声をアナログ／デジタル変換したデジタル音響信号またはwavファイルやその他のデジタル音声データから復元した音声データを音声認識部302に入力する。 The voice input unit 301 inputs to the voice recognition unit 302 digital audio signals obtained by analog / digital conversion of voice uttered by the user, or voice data restored from a wav file or other digital voice data.

音声認識部302は、音声入力部301から入力される音声信号を、単語辞書303を用いて音声認識処理し、所定値以上の評価値である読み情報に対応する単語を認識結果情報として出力する。 The speech recognition unit 302 performs speech recognition processing on the speech signal input from the speech input unit 301 using the word dictionary 303, and outputs a word corresponding to the reading information that is an evaluation value equal to or greater than a predetermined value as recognition result information. .

単語辞書303は、音声認識用の単語がその読み情報とともに登録されている。 In the word dictionary 303, words for speech recognition are registered together with their reading information.

認識結果蓄積部304は、音声認識部302により出力される認識結果情報を、音声データが音声入力部301に入力された入力時刻とともに過去の認識結果情報として蓄積する。 The recognition result accumulation unit 304 accumulates the recognition result information output by the voice recognition unit 302 as past recognition result information together with the input time when the voice data is input to the voice input unit 301.

スコア判定部801は、音声認識部302から出力される認識結果情報に含まれる単語の評価値が所定値以上であるか否かを判定し、その判定結果を最終結果判定部305に送信する。 The score determination unit 801 determines whether or not the word evaluation value included in the recognition result information output from the speech recognition unit 302 is equal to or greater than a predetermined value, and transmits the determination result to the final result determination unit 305.

最終結果判定部305は、音声認識部302から新たに認識結果情報が出力された場合に、認識結果蓄積部304に蓄積された過去の認識結果情報に基づいて、新たに出力された認識結果情報に対応する出力情報を決定する。このとき、スコア判定部801から出力された判定結果が、認識結果情報に含まれる単語の評価値が所定値以上である旨の判定結果である場合には、最終結果判定部305は、認識結果蓄積部304を参照せずに、所定値以上の評価値である単語を出力情報に決定する。 When the recognition result information is newly output from the speech recognition unit 302, the final result determination unit 305 newly outputs the recognition result information based on the past recognition result information stored in the recognition result storage unit 304. The output information corresponding to is determined. At this time, if the determination result output from the score determination unit 801 is a determination result indicating that the evaluation value of the word included in the recognition result information is greater than or equal to a predetermined value, the final result determination unit 305 Without referring to the storage unit 304, a word having an evaluation value equal to or greater than a predetermined value is determined as output information.

図９は、第２実施形態の音声認識装置10における処理を示すフローチャートである。 FIG. 9 is a flowchart showing processing in the speech recognition apparatus 10 of the second embodiment.

ステップS901において、音声認識装置10は音声入力部301を介して音声認識部302に音声信号を入力する。前述したように、音声入力部301は、ユーザのアナログ音声をマイクにより集音してアナログ／デジタル変換したデジタル音響信号、またはwavファイルやその他のデジタル音声データから復元した音声データを音声認識部302に入力する。 In step S901, the speech recognition apparatus 10 inputs a speech signal to the speech recognition unit 302 via the speech input unit 301. As described above, the voice input unit 301 collects a user's analog voice by a microphone and performs analog / digital conversion on a digital acoustic signal, or voice data restored from a wav file or other digital voice data. To enter.

ステップS903において、音声認識装置10は、音声入力部301から入力される音声信号を、単語辞書303を用いて音声認識処理する。音声認識部302は、単語辞書303に登録されている単語の読み情報を参照し、入力された音声信号の特徴量と音素モデルの特徴量とを比較して、類似度合いを示す評価値を算出する。音声認識部302は、さらに、この評価値が所定値以上である読み情報に対応する単語を、音声認識できた単語であると判断して、これを認識結果情報として出力する。 In step S903, the speech recognition apparatus 10 performs speech recognition processing on the speech signal input from the speech input unit 301 using the word dictionary 303. The speech recognition unit 302 refers to the word reading information registered in the word dictionary 303, compares the feature amount of the input speech signal with the feature amount of the phoneme model, and calculates an evaluation value indicating the degree of similarity To do. Furthermore, the speech recognition unit 302 determines that the word corresponding to the reading information whose evaluation value is equal to or greater than a predetermined value is a word that has been speech-recognized, and outputs this as recognition result information.

ステップS905において、音声認識装置10は、音声認識部302より出力される認識結果情報を認識結果蓄積部304に蓄積する。前述したように、認識結果蓄積部304は、ハードディスクドライブ、フラッシュメモリ、ネットワークで接続された記録媒体、その他種々の記録媒体とすることができ、音声認識部302より出力される認識結果情報として、単語情報とその評価値及び音声入力部301に入力された時刻を蓄積することができる。 In step S905, the speech recognition apparatus 10 stores the recognition result information output from the speech recognition unit 302 in the recognition result storage unit 304. As described above, the recognition result accumulation unit 304 can be a hard disk drive, a flash memory, a recording medium connected via a network, and other various recording media. As the recognition result information output from the voice recognition unit 302, The word information, its evaluation value, and the time input to the voice input unit 301 can be accumulated.

ステップS907において、音声認識装置10は、認識結果情報に含まれる評価値が所定値以上であるか否かを判別する。スコア判定部801は、音声認識部302から出力された認識結果情報に含まれる単語の評価値が所定値以上であるか否かを判定しその判定結果を最終結果判定部305に出力する。認識結果情報に含まれる単語の評価値が所定値以上である場合には、ステップS911に移行し、そうでない場合にはステップS909に移行する。 In step S907, the speech recognition apparatus 10 determines whether or not the evaluation value included in the recognition result information is greater than or equal to a predetermined value. The score determination unit 801 determines whether or not the word evaluation value included in the recognition result information output from the speech recognition unit 302 is greater than or equal to a predetermined value, and outputs the determination result to the final result determination unit 305. If the evaluation value of the word included in the recognition result information is greater than or equal to the predetermined value, the process proceeds to step S911, and if not, the process proceeds to step S909.

ステップS909において、音声認識装置10は、認識結果蓄積部304に蓄積された認識結果情報を参照する。最終結果判定部305では、音声認識部302から新たな認識結果情報が入力されると、認識結果蓄積部304に蓄積されている過去の認識結果情報を参照する。 In step S909, the speech recognition apparatus 10 refers to the recognition result information stored in the recognition result storage unit 304. When new recognition result information is input from the speech recognition unit 302, the final result determination unit 305 refers to past recognition result information stored in the recognition result storage unit 304.

ステップS911において、音声認識装置10は、音声認識結果として出力する出力情報を決定する。 In step S911, the speech recognition apparatus 10 determines output information to be output as a speech recognition result.

スコア判定部801による判定結果が、所定値以上の評価値となる単語が認識結果情報に含まれていないとの判定結果である場合には、最終結果判定部305は、第１実施形態と同様に、認識結果蓄積部304に蓄積された過去の認識結果情報に基づいて出力情報を決定する。また、スコア判定部801による判定結果が、所定値以上の評価値となる単語が認識結果情報に含まれているとの判定結果である場合には、最終結果判定部305は、その単語を出力情報に決定しこれを出力する。 When the determination result by the score determination unit 801 is a determination result that a word having an evaluation value equal to or greater than a predetermined value is not included in the recognition result information, the final result determination unit 305 is the same as in the first embodiment. In addition, output information is determined based on past recognition result information stored in the recognition result storage unit 304. When the determination result by the score determination unit 801 is a determination result that a word having an evaluation value equal to or greater than a predetermined value is included in the recognition result information, the final result determination unit 305 outputs the word Information is determined and output.

たとえば、カーナビゲーションシステムなどに音声入力により操作コマンドを入力する場合、ユーザはそのコマンドに相当する単語を比較的はっきりと発声することが想定される。したがって、ユーザが操作コマンドを入力するために発声を行った場合、音声認識部302から出力される認識結果情報には、比較的評価値の高い単語が含まれる。たとえば、評価値の最大値が100になるように正規化した場合には、90以上、もしくは95以上の評価値が得られる場合には、その認識結果情報に含まれる単語は正規の音声入力に基づく可能性が高い。 For example, when an operation command is input to a car navigation system or the like by voice input, it is assumed that the user utters a word corresponding to the command relatively clearly. Therefore, when the user speaks to input an operation command, the recognition result information output from the speech recognition unit 302 includes a word having a relatively high evaluation value. For example, when normalization is performed so that the maximum value of the evaluation value is 100, if an evaluation value of 90 or more, or 95 or more is obtained, the word included in the recognition result information is used for normal speech input. Likely based on.

これに対して、雑音や入力を意図しない音声に基づいて音声認識部302が音声認識処理した場合、認識結果情報に含まれる評価値はそれほど高くはならないと考えられる。たとえば、評価値の最大値が100になるように正規化した場合、突発的に発生した雑音に対する音声認識部302の認識結果情報中に95以上の評価値を有する単語が含まれる可能性は少ないと想定される。 On the other hand, when the speech recognition unit 302 performs speech recognition processing based on noise or speech that is not intended to be input, it is considered that the evaluation value included in the recognition result information is not so high. For example, when normalization is performed so that the maximum evaluation value becomes 100, it is unlikely that a word having an evaluation value of 95 or more is included in the recognition result information of the speech recognition unit 302 for suddenly generated noise. It is assumed.

このことから、認識結果情報に含まれる単語の評価値がある程度の高い値であるような場合には、最終結果判定部305は、その単語をユーザが音声入力を目的として発声したものと見なして、認識結果蓄積部304を参照せずに、出力情報として出力することができる。 From this, when the evaluation value of the word included in the recognition result information is a certain high value, the final result determination unit 305 regards the word as being uttered by the user for the purpose of voice input. The output information can be output without referring to the recognition result storage unit 304.

たとえば、図６に示す例では、音声認識部302から出力された認識結果情報には、評価値が"97"であるような読み情報に対応する「京都」という単語が含まれている。 For example, in the example illustrated in FIG. 6, the recognition result information output from the speech recognition unit 302 includes the word “Kyoto” corresponding to the reading information whose evaluation value is “97”.

スコア判定部801は、たとえば、最大値が100であるような評価値に対して、95以上の評価値である単語については、評価値の高い単語である旨の判定結果を最終結果判定部305に出力することとする。 For example, for a word having an evaluation value of 95 or higher with respect to an evaluation value having a maximum value of 100, the score determination unit 801 indicates a determination result indicating that the word has a high evaluation value as a final result determination unit 305. Will be output.

前述したように、音声認識部302から出力された認識結果情報に含まれる単語として、読み情報の評価値が"97"である「京都」に対しては、スコア判定部801は、評価値が高い単語であると判定してその旨の判定結果を最終結果判定部305に出力する。 As described above, for a word “Kyoto” whose reading information evaluation value is “97” as a word included in the recognition result information output from the speech recognition unit 302, the score determination unit 801 has an evaluation value of It is determined that the word is a high word, and a determination result to that effect is output to the final result determination unit 305.

最終結果判定部305は、認識結果情報に含まれる「京都」という単語に対して、評価値が高い単語である旨の判定結果を得ていることから、認識結果蓄積部304を参照せずに、「京都」を出力情報として決定してこれを出力する。 Since the final result determination unit 305 has obtained a determination result indicating that the word “Kyoto” included in the recognition result information is a word having a high evaluation value, the final result determination unit 305 does not refer to the recognition result accumulation unit 304. , “Kyoto” is determined as output information and output.

音声認識部302において音声認識処理を行った結果、評価値が高い読み情報に対応する単語は、所定時間以内に同じ単語として誤認識される可能性は低いと考えられ、これを出力情報としても誤認識となる可能性は低い。 As a result of performing speech recognition processing in the speech recognition unit 302, it is considered that a word corresponding to reading information with a high evaluation value is unlikely to be erroneously recognized as the same word within a predetermined time, and this is also used as output information. The possibility of misrecognition is low.

このように、評価値が高い読み情報に対応する単語については、１回の発声だけで出力情報として決定することが可能であり、音声認識処理の時間を短縮することが可能となり、ユーザの発声の手間も省力することができる。 As described above, a word corresponding to reading information having a high evaluation value can be determined as output information only by one utterance, and it is possible to reduce the time of the speech recognition processing, and the user's utterance. Can save labor.

〈他の実施形態〉
前述した実施形態に係る音声認識装置において、背景雑音レベルを測定する雑音レベル測定部を設けることも可能である。 <Other embodiments>
In the speech recognition apparatus according to the above-described embodiment, it is possible to provide a noise level measuring unit that measures the background noise level.

この場合、雑音レベル測定部により測定される背景雑音レベルが所定値を超える場合には、第１実施形態に記載したように、所定時間以内の過去の認識結果情報を参照して、同一単語があればこれを出力情報に決定するように構成できる。 In this case, when the background noise level measured by the noise level measurement unit exceeds a predetermined value, as described in the first embodiment, referring to past recognition result information within a predetermined time, the same word is If so, it can be configured to determine this as output information.

また、雑音レベル測定部により測定される背景雑音レベルが所定値以下であるような場合には、音声認識部302から出力された認識結果情報に含まれる単語を出力情報としてそのまま出力するように構成できる。 Further, when the background noise level measured by the noise level measurement unit is less than or equal to a predetermined value, the word included in the recognition result information output from the speech recognition unit 302 is output as output information as it is. it can.

背景雑音のレベルがある程度高い場合には、入力を意図して発声した場合であっても、音声認識部302における音声認識処理で誤認識される可能性がでてくる。したがって、最終結果判定部305が認識結果蓄積部304に蓄積された過去の認識結果情報を参照して、複数回の音声認識を通じて出力情報を決定することにより、誤認識の確率を少なくすることができる。 When the background noise level is high to some extent, there is a possibility that the voice recognition process in the voice recognition unit 302 may be erroneously recognized even when the voice is intended for input. Accordingly, the final result determination unit 305 refers to the past recognition result information stored in the recognition result storage unit 304 and determines output information through multiple speech recognitions, thereby reducing the probability of erroneous recognition. it can.

また、背景雑音のレベルが低い場合には、音声認識部302における音声認識処理の誤認識が少なくなると考えられる。したがって、最終結果判定部305が、１回の音声認識処理の認識結果情報に基づいて出力情報を決定しても、誤認識となる可能性が低いことが想定される。 Further, when the background noise level is low, it is considered that erroneous recognition of the speech recognition processing in the speech recognition unit 302 is reduced. Therefore, even if the final result determination unit 305 determines output information based on the recognition result information of one speech recognition process, it is assumed that there is a low possibility of erroneous recognition.

雑音レベル測定部により測定される背景雑音レベルが所定値を超える場合には、ユーザに対して複数回同一音声の入力を促す指示を表示または音声による通知を行うようにしてもよい。 When the background noise level measured by the noise level measurement unit exceeds a predetermined value, an instruction for prompting the user to input the same voice a plurality of times may be displayed or notified by voice.

このようにして、背景雑音のレベルが低い場合には、ユーザによる発声を１回だけにすることができ、音声認識処理の時間を短縮するとともに、ユーザの音声入力の手間を省略することが可能となる。 In this way, when the background noise level is low, the user can speak only once, reducing the time for voice recognition processing and saving the user's time for voice input. It becomes.

カーナビゲーションシステム、デジタルサイネージ、携帯電話などの音声入力により操作指示が可能な装置に適用することが可能である。 The present invention can be applied to devices capable of giving operation instructions by voice input, such as car navigation systems, digital signage, and mobile phones.

10：音声認識装置
301：音声入力部
302：音声認識部
303：単語辞書
304：認識結果蓄積部
305：最終結果判定部
801：スコア判定部
10: Voice recognition device
301: Voice input part
302: Voice recognition unit
303: Word dictionary
304: Recognition result storage
305: Final result judgment unit
801: Score determination unit

Claims

A voice input unit for inputting voice data;
A word dictionary in which a plurality of words for speech recognition are registered in association with the reading information;
An evaluation value indicating a degree of similarity of speech data input from the speech input unit to reading information of a word registered in the word dictionary is calculated, and a word corresponding to the reading information having an evaluation value equal to or greater than a predetermined value is recognized. A voice recognition unit that outputs the result information;
A recognition result storage unit that stores recognition result information output by the voice recognition unit as past recognition result information together with an input time when the voice data is input to the voice input unit;
When recognition result information is newly output from the speech recognition unit, output information corresponding to the newly output recognition result information is obtained based on past recognition result information stored in the recognition result storage unit. A final result determination unit to be determined;
A speech recognition apparatus comprising:

The final result determination unit, when a word included in the recognition result information output by the speech recognition unit is included in at least one past recognition result information stored in the recognition result storage unit, The speech recognition apparatus according to claim 1, wherein words included in the recognition result information and the newly output recognition result information are determined as output information.

The final result determination unit includes at least one word included in at least one past recognition result information output within a predetermined time before the newly output recognition result information and the newly output recognition result information. The speech recognition apparatus according to claim 2, wherein when the word included in the word matches, the matched word is determined as output information.

The recognition result output from the speech recognition unit includes an evaluation value calculated for the reading information of the word, and the final result determination unit includes at least one past recognition stored in the recognition result storage unit. If there are two or more words included in the result information and words included in the newly output recognition result information, the evaluation values are totaled for each word, and the word having the highest total evaluation value is output. The speech recognition apparatus according to claim 3, wherein the speech recognition apparatus determines information.

A score determination unit that determines whether or not an evaluation value of a word included in the newly output recognition result information is equal to or greater than a predetermined value;
The speech recognition apparatus according to claim 4, wherein the final result determination unit determines, as the output information, a word determined by the score determination unit as an evaluation value equal to or greater than a predetermined value.

A speech recognition apparatus program for performing speech recognition processing on input speech data,
A voice input unit for inputting voice data;
A word dictionary in which a plurality of words for speech recognition are registered in association with the reading information;
An evaluation value indicating a degree of similarity of speech data input from the speech input unit to reading information of a word registered in the word dictionary is calculated, and a word corresponding to the reading information having an evaluation value equal to or greater than a predetermined value is recognized. A voice recognition unit that outputs the result information;
A recognition result storage unit that stores recognition result information output by the voice recognition unit as past recognition result information together with an input time when the voice data is input to the voice input unit;
When recognition result information is newly output from the speech recognition unit, output information corresponding to the newly output recognition result information is obtained based on past recognition result information stored in the recognition result storage unit. A final result determination unit to be determined;
A program for causing a computer to function as a speech recognition apparatus.