JP4941494B2

JP4941494B2 - Speech recognition system

Info

Publication number: JP4941494B2
Application number: JP2009082675A
Authority: JP
Inventors: 竜一鈴木
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2009-03-30
Filing date: 2009-03-30
Publication date: 2012-05-30
Anticipated expiration: 2029-03-30
Also published as: JP2010237286A

Description

本発明は、例えば車両用ナビゲーション装置などに適用される音声認識システムに関する。 The present invention relates to a speech recognition system applied to, for example, a vehicle navigation apparatus.

近年、ユーザが発話した音声を認識し、その認識結果に基づいて、操作対象機器を操作したり、ユーザに対してサービスを提供したりするシステムが開発され、実用に供されつつある。 In recent years, a system for recognizing a voice spoken by a user and operating an operation target device or providing a service to the user based on the recognition result has been developed and put into practical use.

例えば、特許文献１には、ユーザの音声を認識し、その認識結果に基づいて、注文の受付やデータベースの検索などのサービスを提供する音声応答装置が記載されている。この音声応答装置では、音声認識部によって、音声認識辞書部に予め登録したどの語句がどのような順序で発声されたかを認識する。そして、不要語検出部により、音声認識部の認識結果に、音声応答装置の操作に必要でない語句である不要語が含まれているか否かを調べる。不要語が含まれているときには、不要語検出部は、さらに、不要語と認識結果内の目的語との位置関係を調べる。 For example, Patent Document 1 describes a voice response device that recognizes a user's voice and provides services such as order reception and database search based on the recognition result. In this voice response device, the voice recognition unit recognizes which words and phrases previously registered in the voice recognition dictionary unit are uttered in what order. Then, the unnecessary word detection unit checks whether or not the recognition result of the voice recognition unit includes an unnecessary word that is a phrase that is not necessary for the operation of the voice response device. When an unnecessary word is included, the unnecessary word detection unit further checks the positional relationship between the unnecessary word and the target word in the recognition result.

そして、習熟度推定部により、不要語検出部が調べた結果に基づいて、ユーザの音声応答装置の操作の習熟度を推測する。会話フロー制御部は、予め格納した会話フローから、推測した習熟度に対応した会話フローに含まれるガイダンスを取り出し、ユーザに出力する。 Then, the proficiency level estimation unit estimates the proficiency level of the user's operation of the voice response device based on the result of the examination by the unnecessary word detection unit. The conversation flow control unit extracts the guidance included in the conversation flow corresponding to the estimated proficiency level from the conversation flow stored in advance, and outputs the guidance to the user.

特開２００１−３３１１９６号公報JP 2001-331196 A

特許文献１に記載された音声応答装置では、上述したようにして、ユーザの操作の習熟度に応じたガイダンスを行なうようにしている。 In the voice response device described in Patent Document 1, as described above, guidance according to the proficiency level of the user's operation is performed.

しかしながら、特許文献１に記載されたように、不要語が含まれているか否か、および不要語と目的語との位置関係から、ユーザの操作の習熟度を推測しようとすると、膨大な認識語彙からなる音声認識辞書を用いる必要が生じる。すなわち、もともと膨大である目的語を認識するための認識語彙の他、非常に多数の不要語を認識するための認識語彙も対象として、ユーザの発話音声と照合する必要がある。このように膨大な認識語彙との照合を行なった場合、却って誤認識の確率が高まり、音声認識性能を低下させてしまう虞が生じる。 However, as described in Patent Document 1, if an attempt is made to estimate the proficiency level of the user's operation from whether or not unnecessary words are included and the positional relationship between the unnecessary words and the target word, a huge recognition vocabulary is required. It is necessary to use a speech recognition dictionary consisting of In other words, in addition to the recognition vocabulary for recognizing an enormous number of target words, recognition vocabularies for recognizing a large number of unnecessary words need to be collated with the user's speech. When collation with such a large number of recognition vocabularies is performed, the probability of misrecognition increases on the contrary, and voice recognition performance may be degraded.

本発明は、このような点に鑑みてなされたものであり、極力、認識語彙に含まれる不要語の数が少ない辞書を用いて音声認識を行なうことにより、音声認識性能の低下を抑制することが可能な音声認識システムを提供することを目的とする。 The present invention has been made in view of these points, and suppresses a decrease in speech recognition performance by performing speech recognition using a dictionary with a small number of unnecessary words included in the recognition vocabulary as much as possible. An object of the present invention is to provide a voice recognition system capable of

上記目的を達成するために、請求項１に記載の音声認識システムは、
音声を入力する音声入力部と、
音声入力部に入力された音声に基づき、話者を識別する話者識別手段と、
音声入力部に入力された音声を、多数の認識語彙を有する辞書を用いて認識する音声認識部と、
話者識別手段によって識別された話者毎に、音声認識部により認識された音声における、入力音声として本来不要である不要語をカウントするとともに、そのカウント結果に基づいて算出した不要語使用頻度を記憶する不要語使用頻度記憶部と、
音声認識部にて使用する辞書として、含まれる不要語の数が異なる複数の辞書が用意されており、話者識別手段により識別された話者に対して、不要語使用頻度記憶部に不要語使用頻度が記憶されている場合、音声認識部における使用辞書を、記憶されている不要語使用頻度に応じた不要語の数の辞書に切り替える辞書切替部と、を備えることを特徴とする。 In order to achieve the above object, a speech recognition system according to claim 1 comprises:
A voice input unit for inputting voice;
Speaker identification means for identifying a speaker based on the voice input to the voice input unit;
A speech recognition unit for recognizing speech input to the speech input unit using a dictionary having a large number of recognition vocabularies;
For each speaker identified by the speaker identification means, unnecessary words that are originally unnecessary as input speech in the speech recognized by the speech recognition unit are counted, and the unnecessary word usage frequency calculated based on the count result is calculated. An unnecessary word use frequency storage unit for storing;
Multiple dictionaries with different numbers of unnecessary words are prepared as dictionaries to be used in the speech recognition unit, and unnecessary words are stored in the unnecessary word usage frequency storage unit for speakers identified by the speaker identification means. And a dictionary switching unit that switches the use dictionary in the speech recognition unit to a dictionary having the number of unnecessary words corresponding to the stored unnecessary word use frequency when the use frequency is stored.

上述したように、請求項１に記載の発明では、音声入力部に入力された音声から話者を識別し、その話者の発話音声における不要語の使用頻度を算出して、話者毎に不要語使用頻度として記憶する。この不要語使用頻度は、話者毎に、どの程度頻繁に不要語を使用するかの傾向を表すものとなる。 As described above, according to the first aspect of the present invention, the speaker is identified from the voice input to the voice input unit, and the use frequency of the unnecessary word in the voice of the speaker is calculated. It is stored as an unnecessary word usage frequency. This unnecessary word usage frequency represents a tendency of how often unnecessary words are used for each speaker.

従って、入力された音声に基づいて話者が識別されたとき、その話者に対して不要語使用頻度が記憶されている場合、音声認識部において使用される辞書を、記憶されている不要語使用頻度に応じた不要語の数の辞書に切り替える。この結果、不要語を使用する頻度が高いユーザの音声認識には、不要語の数が相対的に多い辞書が使用されるが、不要語を使用する頻度が低いユーザの音声認識には、不要語の数が相対的に少ない辞書が使用される。このように、請求項１の発明では、音声認識用の辞書として、ユーザの不要語の使用頻度に応じた不要語の数の辞書を用いるので、ユーザの音声認識に際して、極力、不要語の数の少ない辞書を用いることができる。 Therefore, when a speaker is identified on the basis of the input speech, and the unnecessary word usage frequency is stored for the speaker, the dictionary used in the speech recognition unit is stored as a stored unnecessary word. Switch to a dictionary with as many unnecessary words as you want. As a result, a dictionary with a relatively large number of unnecessary words is used for voice recognition of users who frequently use unnecessary words, but it is not necessary for voice recognition of users who use less unnecessary words. A dictionary with a relatively small number of words is used. Thus, according to the first aspect of the present invention, as the dictionary for speech recognition, a dictionary having the number of unnecessary words corresponding to the frequency of use of unnecessary words by the user is used. A dictionary with few can be used.

請求項２に記載したように、前記辞書は、入力音声として必要な語彙である目的語を集めた目的語辞書と、入力音声として本来不要である不要語を集めた不要語辞書とからなり、当該不要語辞書として、不要語の数が異なる複数の辞書が用意されており、辞書切替部は、記憶されている不要語使用頻度に応じて、不要語辞書を切り替えるものであって、音声認識部は、目的語辞書と、辞書切替部によって切り替えられた不要語辞書を用いて音声認識を行なうことが好ましい。このように、目的語辞書と不要語辞書とを切り離し、不要語辞書のみ切り替え対象とすることにより、辞書の容量が過大となることを防止することができる。 As described in claim 2, the dictionary includes an object dictionary that collects objects that are vocabulary required as input speech, and an unnecessary word dictionary that collects unnecessary words that are originally unnecessary as input speech. A plurality of dictionaries with different numbers of unnecessary words are prepared as the unnecessary word dictionaries, and the dictionary switching unit switches the unnecessary word dictionaries according to the stored unnecessary word use frequency, and is used for speech recognition. The unit preferably performs speech recognition using the target word dictionary and the unnecessary word dictionary switched by the dictionary switching unit. In this way, by separating the target word dictionary and the unnecessary word dictionary and making only the unnecessary word dictionary a switching target, it is possible to prevent the dictionary capacity from becoming excessive.

請求項３に記載したように、辞書切替部は、不要語辞書を不使用とするように、不要語辞書の切り替えを行なうことが可能であることが好ましい。例えば、ユーザが音声認識システムの操作に習熟している場合には、その入力音声の認識のために不要語辞書が必要とされないこともありえるためである。 As described in claim 3, it is preferable that the dictionary switching unit can switch the unnecessary word dictionary so that the unnecessary word dictionary is not used. For example, if the user is familiar with the operation of the voice recognition system, an unnecessary word dictionary may not be required for the recognition of the input voice.

請求項４に記載したように、ユーザの操作に基づいて、音声によって入力される情報の種類を決定する決定手段を備え、辞書切替部は、決定手段により音声入力情報の種類が決定された場合、その決定された音声入力情報の種類も考慮して、音声認識部における使用辞書の切り替えを行なうことが好ましい。 According to a fourth aspect of the present invention, there is provided a determining unit that determines a type of information input by voice based on a user operation, and the dictionary switching unit is configured such that the type of the voice input information is determined by the determining unit In consideration of the determined type of voice input information, it is preferable to switch the use dictionary in the voice recognition unit.

例えば、車両用ナビゲーション装置を操作対象装置とし、目的地を設定するための情報として、地理的情報とジャンルとを組み合わせて音声入力する場合と電話番号を音声入力する場合とでは、地理的情報とジャンルとを音声入力する場合の方が、不要語は増加する傾向にある。このように、音声入力される情報の種類と、入力音声に含まれる不要語の数とは、ある程度相関関係を有する。従って、請求項４に記載したように、音声入力情報の種類も考慮して使用辞書の切り替えを行なうことが好ましい。 For example, when the vehicle navigation device is an operation target device and the information for setting the destination is a combination of geographical information and a genre, and a case where a phone number is input, the geographical information is The unnecessary words tend to increase when the genre is input by voice. As described above, the type of information input by speech and the number of unnecessary words included in the input speech have a certain degree of correlation. Therefore, as described in claim 4, it is preferable to switch the use dictionary in consideration of the type of voice input information.

音声入力情報の種類も考慮して使用辞書の切り替えを行なう具体的な手法が、請求項５及び請求項６に記載されている。 Specific methods for switching the use dictionary in consideration of the type of voice input information are described in claims 5 and 6.

すなわち、請求項５に記載したように、辞書切替部は、決定手段により音声入力情報の種類が決定された場合、話者識別手段により識別された話者の不要語使用頻度に係らず、決定された音声入力情報の種類に応じた不要語の数の辞書に切り替えても良い。ユーザ毎の個人差よりも、音声入力情報の種類の方が、ユーザの発話音声に含まれる不要語の数に与える影響が大きいと考えられるためである。 That is, as described in claim 5, when the type of the voice input information is determined by the determining unit, the dictionary switching unit determines the use regardless of the use frequency of unnecessary words of the speaker identified by the speaker identifying unit. The dictionary may be switched to the number of unnecessary words corresponding to the type of the voice input information. This is because the type of voice input information is considered to have a greater influence on the number of unnecessary words included in the user's uttered voice than the individual difference for each user.

また、請求項６に記載したように、辞書切替部は、話者識別手段により識別された話者の不要語使用頻度に応じた不要語の数の辞書と、音声入力情報の種類に応じた不要語の数の辞書とで、より不要語の数が多い辞書に切り替えるようにしても良い。これにより、音声認識部において使用される辞書を、ユーザ毎の個人差による不要語の使用数と、音声入力情報の種類に起因する不要語の使用数とに適切に対応する辞書に切り替えることが可能になる。 Further, as described in claim 6, the dictionary switching unit responds to the dictionary of the number of unnecessary words according to the frequency of unnecessary word usage of the speaker identified by the speaker identifying means and the type of the voice input information. A dictionary having a larger number of unnecessary words may be switched to a dictionary having the number of unnecessary words. Thereby, the dictionary used in the speech recognition unit can be switched to a dictionary appropriately corresponding to the number of unnecessary words used due to individual differences for each user and the number of unnecessary words used due to the type of voice input information. It becomes possible.

本発明の実施形態による音声認識システムを備えた車両用ナビゲーション装置の構成を表す構成図である。It is a block diagram showing the structure of the navigation apparatus for vehicles provided with the speech recognition system by embodiment of this invention. 音声認識システムにおける、音声認識部と対話制御部との詳細な構成を示す制御ブロック図である。It is a control block diagram which shows the detailed structure of a speech recognition part and a dialogue control part in a speech recognition system. (ａ)は、ユーザ毎に不要語の頻度を記憶した様子を表す説明図であり、（ｂ）は不要語頻度に応じて不要語辞書を選択する際の基準の一例を示す説明図である。(a) is explanatory drawing showing a mode that the frequency of the unnecessary word was memorize | stored for every user, (b) is explanatory drawing which shows an example of the reference | standard at the time of selecting an unnecessary word dictionary according to an unnecessary word frequency. . 音声認識システムにおける主要な制御処理を示すフローチャートである。It is a flowchart which shows the main control processing in a speech recognition system.

以下、本発明の実施形態について図面を用いて説明する。なお、以下に説明する実施形態では、本発明の音声認識システムが車両用ナビゲーション装置に適用されているが、本発明の音声認識システムの適用対象は、車両用ナビゲーション装置に限られるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the embodiment described below, the voice recognition system of the present invention is applied to a vehicle navigation apparatus, but the application target of the voice recognition system of the present invention is not limited to the vehicle navigation apparatus.

図１に示すように、車両用ナビゲーション装置２は、位置検出器４、データ入力器６、操作スイッチ群８、これらに接続された制御回路１０、制御回路１０に接続された通信装置１２、外部メモリ１４、表示装置１６、リモコンセンサ１８及び音声認識システム３０を備えている。なお制御回路１０は通常のコンピュータとして構成されており、内部には、周知のＣＰＵ、ＲＯＭ、ＲＡＭ、Ｉ／Ｏ及びこれらの構成を接続するバスラインを備えている。 As shown in FIG. 1, the vehicle navigation device 2 includes a position detector 4, a data input device 6, an operation switch group 8, a control circuit 10 connected thereto, a communication device 12 connected to the control circuit 10, and an external device. A memory 14, a display device 16, a remote control sensor 18, and a voice recognition system 30 are provided. The control circuit 10 is configured as a normal computer, and includes a well-known CPU, ROM, RAM, I / O, and a bus line for connecting these configurations.

位置検出器４は、周知のジャイロスコープ２０、距離センサ２２及び衛星からの電波に基づいて車両の位置を検出するためのＧＰＳ受信機２４を有している。これらのセンサ等２０，２２，２４は各々が性質の異なる誤差を持っているため、複数のセンサにより、各々補間しながら使用するように構成されている。なお、精度によっては上述した内の一部で構成してもよく、更に、ステアリングの回転センサ、各転動輪の車輪センサ等を用いてもよい。 The position detector 4 has a known gyroscope 20, a distance sensor 22, and a GPS receiver 24 for detecting the position of the vehicle based on radio waves from a satellite. Since these sensors 20, 22, and 24 have errors of different properties, they are configured to be used while being interpolated by a plurality of sensors. Depending on the accuracy, a part of the above may be used, and further, a steering rotation sensor, a wheel sensor of each rolling wheel, or the like may be used.

データ入力器６は、位置検出の精度向上のためのいわゆるマップマッチング用データ、地図データ及び目印データを含むナビゲーション用の各種データに加えて、音声認識システム３０において認識処理を行なう際に用いる辞書データを入力するための装置である。記憶媒体としては、そのデータ量からハードディスクやＤＶＤを用いるのが一般的であるが、ＣＤ−ＲＯＭ等の他の媒体を用いても良い。 In addition to so-called map matching data for improving the accuracy of position detection, various data for navigation including map data and landmark data, the data input device 6 uses dictionary data used when the speech recognition system 30 performs recognition processing. Is a device for inputting. As a storage medium, a hard disk or a DVD is generally used because of its data amount, but another medium such as a CD-ROM may be used.

操作スイッチ群８は、例えば表示装置１６と一体になったタッチスイッチもしくはメカニカルなスイッチ等が用いられ、スイッチ操作により制御回路１０へ各種の操作指示を出力する。例えば、地図縮尺変更、メニュー表示選択、目的地設定、経路探索、経路案内開始、表示画面変更、音声案内設定、音量調整等の操作指示を行なう。また、操作スイッチ群８は、例えば、出発地および目的地を設定するための情報の種類を選択するためのスイッチを含んでいる。その選択スイッチを操作することによって、ユーザ（車両の乗員）は、予め登録しておいた地点、施設名、電話番号、住所など、所望の情報を用いて、出発地および目的地を設定することができる。 For example, a touch switch or a mechanical switch integrated with the display device 16 is used as the operation switch group 8, and various operation instructions are output to the control circuit 10 by the switch operation. For example, operation instructions such as map scale change, menu display selection, destination setting, route search, route guidance start, display screen change, voice guidance setting, and volume adjustment are performed. Further, the operation switch group 8 includes, for example, switches for selecting the type of information for setting the departure place and the destination. By operating the selection switch, the user (vehicle occupant) sets the starting point and the destination using desired information such as a registered point, facility name, telephone number, and address. Can do.

通信装置１２は、設定された連絡先通信情報によって特定される連絡先との通信を行なうためのものであり、例えば携帯電話機等の移動体通信機によって構成される。外部メモリ１４は、書き込み可能な大容量記憶装置である。外部メモリ１４には大量のデータや電源をＯＦＦしても消去してはいけないデータを記憶したり、頻繁に使用するデータを地図データ入力器６からコピーして利用したりする等の用途がある。なお、外部メモリ１４は、比較的記憶容量の小さいリムーバブルなメモリであってもよい。 The communication device 12 is for communicating with a contact specified by the set contact communication information, and is constituted by a mobile communication device such as a mobile phone. The external memory 14 is a writable mass storage device. The external memory 14 stores a large amount of data and data that cannot be deleted even when the power is turned off, and frequently used data is copied from the map data input device 6 and used. . The external memory 14 may be a removable memory having a relatively small storage capacity.

表示装置１６は例えば液晶表示装置からなり、表示装置１６の画面には、位置検出器４によって検出された車両の現在位置を示す車両現在位置マークと、地図データ入力器６より入力された車両の現在位置周辺の地図データと、更に地図上に表示する誘導経路や設定地点の目印等の付加データとを重ねて表示することができる。また、複数の選択肢を表示するメニュー画面やその中の選択肢を選んだ場合に、さらに複数の選択肢を表示するコマンド入力画面なども表示することができる。 The display device 16 is composed of, for example, a liquid crystal display device. On the screen of the display device 16, a vehicle current position mark indicating the current position of the vehicle detected by the position detector 4 and the vehicle input from the map data input device 6 are displayed. Map data around the current position and additional data such as a guide route and a set point mark displayed on the map can be displayed in an overlapping manner. In addition, when a menu screen that displays a plurality of options, or when an option is selected, a command input screen that displays a plurality of options can be displayed.

リモコンセンサ１８は、図示しないリモコンからの操作信号を受信して、制御回路１０に出力するものである。リモコンには多数のスイッチが設けられ、そのスイッチ操作により、操作スイッチ群８とほぼ同等の機能を制御回路１０に対して実行させることを指示することが可能である。 The remote control sensor 18 receives an operation signal from a remote controller (not shown) and outputs it to the control circuit 10. The remote controller is provided with a large number of switches, and by operating the switches, it is possible to instruct the control circuit 10 to execute a function substantially equivalent to the operation switch group 8.

次に、図1及び図２に基づいて、音声認識システム３０の構成について説明する。なお、図２は、音声認識システム３０における、音声認識部３１と対話制御部３２との詳細な構成を示すブロック図である。 Next, the configuration of the speech recognition system 30 will be described with reference to FIGS. 1 and 2. FIG. 2 is a block diagram showing a detailed configuration of the voice recognition unit 31 and the dialogue control unit 32 in the voice recognition system 30.

音声認識システム３０は、上記操作スイッチ群８あるいはリモコンが各種コマンド入力のために手動操作されるのに対して、ユーザの発話音声によっても制御回路１０に各種コマンドを入力できるようにするためのものである。 The voice recognition system 30 is configured to allow various commands to be input to the control circuit 10 even by the user's uttered voice, while the operation switch group 8 or the remote controller is manually operated for inputting various commands. It is.

音声認識システム３０は、音声認識部３１、対話制御部３２、音声合成部３３、音声抽出部３４、マイク３５、トークスイッチ３６、スピーカ３７、及び制御部３８を備えている。 The voice recognition system 30 includes a voice recognition unit 31, a dialogue control unit 32, a voice synthesis unit 33, a voice extraction unit 34, a microphone 35, a talk switch 36, a speaker 37, and a control unit 38.

トークスイッチ３６は、ユーザ（運転者）が音声入力を開始する旨を指示するためのもので、例えばステアリングコラムカバーの側面部やシフトレバーの近傍などユーザが操作しやすい位置に設けられている。なお、トークスイッチ３６はいわゆるクリック方式のスイッチであり、ユーザがトークスイッチ３６をオン操作した後音声を入力（発話）するようになっている。制御部３８は、トークスイッチ３６からのオン信号の入力に基づいて、音声抽出部３４に対して音声信号の抽出の処理の実行を指示する。また、制御部３８は、音声認識部３１及び対話制御部３２に対して、音声抽出部３４における音声抽出処理が開始されたことを通知する。すると、対話制御部３２は、音声合成部３３を介してスピーカ３７から、“音声を入力してください”などの案内音声を出力する。なお、音声合成部３３は、波形データベース内に格納されている音声波形を用い、対話制御部３２からの応答音声の出力指示に基づく音声を合成する。この合成音声がスピーカ３７から出力される。 The talk switch 36 is used to instruct the user (driver) to start voice input, and is provided at a position where the user can easily operate, for example, near the side surface of the steering column cover or the shift lever. Note that the talk switch 36 is a so-called click-type switch, and after the user turns on the talk switch 36, voice is input (speaking). Based on the input of the ON signal from the talk switch 36, the control unit 38 instructs the audio extraction unit 34 to execute an audio signal extraction process. In addition, the control unit 38 notifies the voice recognition unit 31 and the dialogue control unit 32 that the voice extraction processing in the voice extraction unit 34 has started. Then, the dialogue control unit 32 outputs a guidance voice such as “Please input voice” from the speaker 37 via the voice synthesis unit 33. The voice synthesizer 33 synthesizes a voice based on a response voice output instruction from the dialogue control unit 32 using a voice waveform stored in the waveform database. This synthesized voice is output from the speaker 37.

ユーザの発話音声が入力されるマイク３５は、例えばステアリングコラムカバーの上面部や運転席側のサンバイザー等のユーザの音声を拾いやすい位置に設けられる。音声抽出部３４は、制御部３８の指示によりマイク３５から音声信号を取込み、その音声信号からノイズ成分を除去して音声データを抽出するようになっている。そして、抽出された音声データは音声認識部３１に出力される。 The microphone 35 to which the user's speech is input is provided at a position where the user's voice can be easily picked up, such as an upper surface portion of the steering column cover or a sun visor on the driver's seat side. The voice extraction unit 34 takes a voice signal from the microphone 35 in accordance with an instruction from the control unit 38, and removes a noise component from the voice signal to extract voice data. The extracted voice data is output to the voice recognition unit 31.

音声抽出部３４における処理について、もう少し詳細に説明する。音声抽出部３４は、マイク３５にて取り込んだ周囲の音声信号をデジタル音声データに変換する。そして、例えば数１０ｍｓ程度の区間のフレーム信号を一定間隔で切り出し、その入力信号が、音声の含まれている音声区間であるのか音声の含まれていないノイズ区間であるのか判定する。マイク３５から入力される信号は、認識対象の音声だけでなくノイズも混在したものであるため、音声区間とノイズ区間の判定を行なうのである。この判定方法としては従来から多くの手法が提案されており、例えば入力信号の短時間パワーを一定時間毎に抽出していき、所定の閾値以上の短時間パワーが一定以上継続したか否かによって音声区間であるかノイズ区間であるかを判定する手法がよく採用されている。そして、音声区間であると判定された場合には、それを音声データとして音声認識部３１に出力する。 The process in the voice extraction unit 34 will be described in a little more detail. The voice extraction unit 34 converts surrounding voice signals captured by the microphone 35 into digital voice data. Then, for example, a frame signal in a section of about several tens of ms is cut out at regular intervals, and it is determined whether the input signal is a voice section in which voice is included or a noise section in which voice is not included. Since the signal input from the microphone 35 includes not only the speech to be recognized but also noise, the speech section and the noise section are determined. Many methods have been proposed as this determination method. For example, the short-time power of the input signal is extracted at regular intervals, and depending on whether or not the short-time power equal to or greater than a predetermined threshold continues for a certain period. A method of determining whether a voice section or a noise section is often used. And when it determines with it being an audio | voice area, it outputs it to the audio | voice recognition part 31 as audio | speech data.

音声認識部３１は、図２に示すように抽出結果記憶部３１１、照合部３１２、及び辞書部３１３を備える。 The voice recognition unit 31 includes an extraction result storage unit 311, a collation unit 312, and a dictionary unit 313 as shown in FIG. 2.

抽出結果記憶部３１１は、音声抽出部３４から入力されたノイズ成分が除去された音声データを記憶する。そして、照合部３１２が、抽出結果記憶部３１１に記憶された音声データに対して、辞書部３１３における目的語辞書３１３ａや不要語辞書３１３ｂを用いて照合を行ない（認識処理）、複数の比較対象パターン候補と比較して一致度の高い上位比較対象パターンを、音声認識結果として対話制御部３２へ出力する。 The extraction result storage unit 311 stores the audio data from which the noise component input from the audio extraction unit 34 has been removed. The collation unit 312 collates the speech data stored in the extraction result storage unit 311 using the object word dictionary 313a and the unnecessary word dictionary 313b in the dictionary unit 313 (recognition process), and compares a plurality of comparison targets. The higher comparison target pattern having a higher degree of matching than the pattern candidate is output to the dialogue control unit 32 as a voice recognition result.

ここで、本実施形態では、図２に示すように、辞書部３１３が、目的語辞書３１３ａの他に、不要語辞書３１３ｂを有する。目的語辞書３１３ａとは、コマンドや目的地など音声操作に必要な語彙である目的語を集めた辞書であり、不要語辞書３１３ｂとは、音声操作に必要でない語句である不要語を集めた辞書である。本実施形態では、不要語辞書３１３ｂは、不要語の数が多い不要語（大）辞書３１３ｂａと、不要語の数が少ない不要語（小）辞書３１３ｂｂからなっている。辞書部３１３は、後述する対話制御部３２の辞書切替部３２６からの指示に応じて、照合部３１２が使用する不要語辞書３１３ｂを切り替えることが可能である。さらに、辞書切替部３２６から不要語辞書３１３ｂの不使用が指示されたとき、照合部３１２に対して不要語辞書３１３ｂを提供せず、目的語辞書３１３ａのみ提供することが可能である。すなわち、この場合、照合部３１２は、不要語の辞書がない状態で、目的語辞書３１３ａのみを用いて上述した認識処理を行なうことになる。 In this embodiment, as shown in FIG. 2, the dictionary unit 313 includes an unnecessary word dictionary 313b in addition to the target word dictionary 313a. The target word dictionary 313a is a dictionary that collects target words that are words necessary for voice operations such as commands and destinations, and the unnecessary word dictionary 313b is a dictionary that collects unnecessary words that are words that are not necessary for voice operations. It is. In the present embodiment, the unnecessary word dictionary 313b includes an unnecessary word (large) dictionary 313ba having a large number of unnecessary words and an unnecessary word (small) dictionary 313bb having a small number of unnecessary words. The dictionary unit 313 can switch the unnecessary word dictionary 313b used by the matching unit 312 in accordance with an instruction from the dictionary switching unit 326 of the dialogue control unit 32 described later. Furthermore, when the non-use of the unnecessary word dictionary 313b is instructed from the dictionary switching unit 326, it is possible to provide only the target word dictionary 313a without providing the unnecessary word dictionary 313b to the matching unit 312. That is, in this case, the collation unit 312 performs the above-described recognition process using only the target word dictionary 313a in a state where there is no unnecessary word dictionary.

対話制御部３２は、図２に示すように、処理部３２１、話者特定部３２２、不要語使用頻度カウント部３２３、不要語使用頻度記憶部３２４、不要語使用頻度判定部３２５、及び辞書切替部３２６を備える。 As shown in FIG. 2, the dialogue control unit 32 includes a processing unit 321, a speaker specifying unit 322, an unnecessary word usage frequency counting unit 323, an unnecessary word usage frequency storage unit 324, an unnecessary word usage frequency determining unit 325, and a dictionary switching. Part 326.

処理部３２１は、音声認識部３１における認識結果や制御部３８からの指示に基づき、音声合成部３３への応答音声の出力指示、あるいは、ナビゲーション装置２自体の処理を実行する制御回路１０に対して、音声認識部３１における認識結果、例えば目的地やコマンドを通知して目的地の設定やコマンドを実行させるよう指示する処理を行なう。このような処理の結果として、この音声認識システム３０を利用すれば、操作スイッチ群８あるいはリモコンを手動操作しなくても、音声入力によりナビゲーション装置２に対する目的地の指示などが可能となる。 Based on the recognition result in the voice recognition unit 31 and the instruction from the control unit 38, the processing unit 321 performs an instruction to output a response voice to the voice synthesis unit 33 or the control circuit 10 that executes the processing of the navigation device 2 itself. Then, a recognition result in the voice recognition unit 31, for example, a destination or a command is notified, and processing for instructing to set the destination or execute the command is performed. As a result of such processing, if the voice recognition system 30 is used, a destination can be instructed to the navigation device 2 by voice input without manually operating the operation switch group 8 or the remote controller.

また、処理部３２１には、操作スイッチ群８あるいはリモコンなどの手動操作、又は音声操作により、例えば目的地を設定するための情報の種類が制御回路１０にて決定された場合、その決定された情報の種類が制御回路１０から通知される。すると、処理部３２１は、辞書切替部３２６に対して、決定された情報の種類に応じた不要語の数の不要語辞書３１３ｂを選択するように指示する。これにより、決定された種類の情報が音声にて入力されたときに、その情報の種類に適した数の不要語の辞書を用いて、入力音声を認識できるようになる。 In addition, when the control circuit 10 determines the type of information for setting the destination, for example, by the manual operation of the operation switch group 8 or the remote controller or the voice operation, the processing unit 321 is determined. The type of information is notified from the control circuit 10. Then, the processing unit 321 instructs the dictionary switching unit 326 to select the unnecessary word dictionary 313b having the number of unnecessary words corresponding to the determined type of information. Thus, when the determined type of information is input by voice, the input voice can be recognized using a dictionary of unnecessary words suitable for the type of information.

例えば、目的地を設定するための情報として、地理的な情報とジャンルとを組み合わせて音声入力する場合（例えば“名古屋駅近くのラーメン屋”）と電話番号を音声入力する場合とでは、地理的情報とジャンルとを音声入力する場合の方が、不要語の数は増加する傾向にある。このように、音声入力される情報の種類と、入力音声に含まれる不要語の数とは、ある程度相関関係を有する。従って、音声入力情報の種類に基づいて不要語辞書の切り替えを行なうことにより、適切な数の不要語を含む不要語辞書を用いて音声認識処理を行なうことが可能になる。 For example, as the information for setting the destination, the geographical information and the genre are used for voice input (for example, “Ramen shop near Nagoya Station”) and the telephone number is used for voice input. The number of unnecessary words tends to increase when information and genre are input by voice. As described above, the type of information input by speech and the number of unnecessary words included in the input speech have a certain degree of correlation. Therefore, by performing switching of unnecessary word dictionaries based on the type of voice input information, it is possible to perform voice recognition processing using an unnecessary word dictionary including an appropriate number of unnecessary words.

なお、制御回路１０に対して、音声操作により目的地を設定するための情報の種類を指示するには、“目的地設定”と発話した後に、音声入力したい情報の種類（“住所”、“施設名”、“ジャンル”（地理的情報含む）、“電話番号”など）を発話すれば良い。 In order to instruct the control circuit 10 the type of information for setting the destination by voice operation, the type of information (“address”, “ Just say “facility name”, “genre” (including geographical information), “phone number”, etc.

話者特定部３２２は、音声抽出部３４の抽出結果記憶部３１１に記憶された音声データを入力し、その音声データに基づいて、話者としてのユーザを特定する。すなわち、話者特定部３２２は、各ユーザが発話した音声の音響的特徴を示すいわゆる話者モデルをユーザ毎に生成して記憶しておき、その話者モデルを用いて実際の発話からその発話主を特定する。このとき、話者モデルと発話の音響的特徴との比較処理は、例えば、話者モデルと発話の音響的特徴との間の類似度を算出し、その算出した類似度をその話者モデルに対応する類似度の閾値と比較し、類似度と閾値との差が予め定められている所定の範囲内にあるとき、その発話がその話者モデルに相当する発話者による発話であると特定することができる。 The speaker specifying unit 322 inputs the voice data stored in the extraction result storage unit 311 of the voice extracting unit 34, and specifies a user as a speaker based on the voice data. That is, the speaker specifying unit 322 generates and stores a so-called speaker model indicating the acoustic characteristics of the speech uttered by each user for each user, and uses the speaker model to make the utterance from the actual utterance. Identify the Lord. At this time, the comparison process between the speaker model and the acoustic feature of the utterance, for example, calculates the similarity between the speaker model and the acoustic feature of the utterance, and uses the calculated similarity as the speaker model. Compare with the corresponding similarity threshold, and when the difference between the similarity and the threshold is within a predetermined range, specify that the utterance is an utterance by the speaker corresponding to the speaker model be able to.

不要語使用頻度カウント部３２３は、ある話者によって実際に発話された音声が照合部３１２にて認識されたとき、その認識語句全体における不要語の使用頻度をカウントするものである。そして、そのカウント結果は、図３（ａ）に示すように、話者特定部３２２によって特定された話者毎に、不要語使用頻度記憶部３２４にそのまま記憶されるか、もしくは、そのカウント結果を用いて更新した不要語使用頻度が記憶される。すなわち、話者特定部３２２において特定された話者に対応して、不要語使用頻度記憶部３２４に不要語使用頻度が記憶されていない場合には、不要語使用頻度カウント部３２３によるカウント結果が、そのまま不要語使用頻度記憶部３２４に記憶される。一方、話者特定部３２２によって特定された話者に対応する不要語使用頻度が、既に不要語使用頻度記憶部３２４に記憶されていた場合には、今回、不要語使用頻度カウント部にてカウントされた結果を反映するように更新した不要語使用頻度が記憶される。 The unnecessary word usage frequency counting unit 323 counts the usage frequency of unnecessary words in the entire recognized word / phrase when a voice actually spoken by a speaker is recognized by the matching unit 312. Then, as shown in FIG. 3A, the count result is stored as it is in the unnecessary word use frequency storage unit 324 for each speaker specified by the speaker specifying unit 322, or the count result The frequency of unnecessary word usage updated using is stored. That is, when the unnecessary word usage frequency is not stored in the unnecessary word usage frequency storage unit 324 corresponding to the speaker specified by the speaker specifying unit 322, the count result by the unnecessary word usage frequency counting unit 323 is displayed. The unnecessary word use frequency storage unit 324 is stored as it is. On the other hand, if the unnecessary word usage frequency corresponding to the speaker specified by the speaker specifying unit 322 is already stored in the unnecessary word usage frequency storage unit 324, the unnecessary word usage frequency counting unit counts this time. The unnecessary word usage frequency updated to reflect the result obtained is stored.

不要語使用頻度判定部３２５は、話者特定部３２２によって話者が特定されたときに、その話者に対応する不要語使用頻度が不要語使用頻度記憶部３２４に記憶されていれば、その記憶されている不要語使用頻度を入力して、その話者の発話音声を認識するには、どの程度の数の不要語を含む不要語辞書が必要であるかを判定する。 If the speaker specification unit 322 identifies the speaker, the unnecessary word usage frequency determination unit 325 may store the unnecessary word usage frequency corresponding to the speaker if the unnecessary word usage frequency storage unit 324 stores the unnecessary word usage frequency. In order to input the stored unnecessary word usage frequency and recognize the utterance voice of the speaker, it is determined how many unnecessary word dictionaries including unnecessary words are necessary.

例えば、図３（ｂ）に示すように、不要語の使用頻度が５０％以上である場合には、その話者は、不要語の使用頻度が高く、使用する不要語の種類も多いと考えられるので、不要語の数が最も多い不要語（大）辞書３１３ｂａが必要と判定する。また、不要語の使用頻度が、０％より大きく５０％未満である場合には、不要語の使用頻度はそれほど高いわけではなく、使用する不要語の種類も限られると考えられるので、不要語の数が相対的に少ない不要語（小）辞書３１３ｂｂが適切と判定する。なお、不要語の使用頻度が０％である場合には、音声操作に習熟しており、不要語を用いずに、目的語のみ発話する話者であると考えられるので、不要語辞書は不要と判定する。 For example, as shown in FIG. 3B, if the frequency of use of unnecessary words is 50% or more, the speaker thinks that the use frequency of unnecessary words is high and there are many types of unnecessary words to be used. Therefore, it is determined that the unnecessary word (large) dictionary 313ba having the largest number of unnecessary words is necessary. Also, if the frequency of use of unnecessary words is greater than 0% and less than 50%, the use frequency of unnecessary words is not so high, and the types of unnecessary words used are considered to be limited. The unnecessary word (small) dictionary 313bb having a relatively small number is determined to be appropriate. If the use frequency of the unnecessary word is 0%, it is proficient in voice operation and it is considered that the speaker speaks only the object without using the unnecessary word, so the unnecessary word dictionary is unnecessary. Is determined.

辞書切替部３２６は、処理部３２１からの不要語辞書の選択指示、及び不要語使用頻度判定部３２５の判定結果に基づいて、照合部３１２が入力音声の認識処理を行なう際に使用する不要語辞書３１３ｂを切り替える。 The dictionary switching unit 326 is an unnecessary word used when the collation unit 312 performs input speech recognition processing based on an unnecessary word dictionary selection instruction from the processing unit 321 and a determination result of the unnecessary word use frequency determination unit 325. Switch dictionary 313b.

例えば、辞書切替部３２６は、処理部３２１から、入力される情報の種類が決定されたことに基づき、決定された情報の種類に応じた不要語の数の不要語辞書３１３ｂを選択するように指示されている場合には、不要語使用頻度判定部３２５の判定結果によらず、処理部３２１によって指示された不要語辞書３１３ｂに切り替える。一方、処理部３２１から不要語辞書３１３ｂの選択に関する指示がない場合には、不要語使用頻度判定部３２５の判定結果に基づいて、不要語辞書３１３ｂを切り替える。このようにする理由は、ユーザ毎の個人差よりも、音声入力情報の種類の方が、ユーザの発話音声に含まれる不要語の数に与える影響が大きいと考えられるためである。 For example, the dictionary switching unit 326 selects from the processing unit 321 the unnecessary word dictionary 313b having the number of unnecessary words corresponding to the determined type of information based on the determined type of input information. When instructed, the unnecessary word dictionary 313b instructed by the processing unit 321 is switched regardless of the determination result of the unnecessary word use frequency determining unit 325. On the other hand, when there is no instruction regarding the selection of the unnecessary word dictionary 313b from the processing unit 321, the unnecessary word dictionary 313b is switched based on the determination result of the unnecessary word usage frequency determination unit 325. The reason for this is that the type of voice input information is considered to have a larger influence on the number of unnecessary words included in the user's uttered voice than the individual difference for each user.

次に、上述した音声認識システム３０における、不要語辞書３１３ｂの切替処理を含む主要な制御処理について、図４のフローチャートに基づいて説明する。 Next, main control processing including switching processing of the unnecessary word dictionary 313b in the voice recognition system 30 described above will be described based on the flowchart of FIG.

まず、ステップＳ１１０では、トークスイッチ３６がオンされたか否かを判定する。このとき、トークスイッチ３６がオンされていると判定されると、ステップＳ１２０の処理に進む。ステップＳ１２０では、音声入力処理を行なう。すなわち、音声抽出部３４にて、マイク３５に入力された音声信号からノイズ成分を除去した音声データを生成する。 First, in step S110, it is determined whether or not the talk switch 36 is turned on. At this time, if it is determined that the talk switch 36 is turned on, the process proceeds to step S120. In step S120, voice input processing is performed. That is, the audio extraction unit 34 generates audio data from which noise components have been removed from the audio signal input to the microphone 35.

ステップＳ１３０では、制御回路１０から、入力情報の種類が決定された旨が通知されているか否かを判定する。入力情報の種類が決定されている場合、ステップＳ１４０に進み、決定された情報の種類に応じた不要語の数の不要語辞書３１３ｂを選択する。これにより、決定された種類の情報が音声にて入力されたときに、その情報の種類に適した数の不要語の不要語辞書３１３ｂを用いて（不要語辞書３１３ｂの不使用を含む）、入力音声を認識できるようになる。一方、ステップＳ１３０において、入力情報の種類が決定されていないと判定された場合、ステップＳ１５０の処理に進む。 In step S130, it is determined whether or not the control circuit 10 notifies that the type of input information has been determined. When the type of input information is determined, the process proceeds to step S140, and the unnecessary word dictionary 313b having the number of unnecessary words corresponding to the determined type of information is selected. Thus, when the determined type of information is input by voice, the number of unnecessary word dictionaries 313b suitable for the type of information is used (including the nonuse of the unnecessary word dictionary 313b). Input speech can be recognized. On the other hand, if it is determined in step S130 that the type of input information has not been determined, the process proceeds to step S150.

ステップＳ１５０では、抽出された音声データに基づいて、話者特定部３２２により、話者の特定を行なう。すなわち、各ユーザが発話した音声の音響的特徴を示すいわゆる話者モデルをユーザ毎に生成して記憶しておき、その話者モデルを用いて実際の発話からその発話主を特定する。 In step S150, the speaker identification unit 322 identifies a speaker based on the extracted voice data. That is, a so-called speaker model indicating the acoustic characteristics of the speech uttered by each user is generated and stored for each user, and the speaker is identified from the actual utterance using the speaker model.

そして、ステップＳ１６０にて、特定された話者に対応する不要語使用頻度が記憶されているか否かを判定する。特定された話者に対応する不要語使用頻度が記憶されている場合には、ステップＳ１７０に進む。ステップＳ１７０では、記憶されている不要語使用頻度に基づいて、特定された話者の発話音声を認識するのに、最も適した不要語の数の不要語辞書３１３ｂを決定する（不要語辞書の不使用を含む）。一方、ステップＳ１６０において、不要語使用頻度が記憶されていないと判定された場合には、ステップＳ１８０に進む。ステップＳ１８０では、使用する不要語辞書３１３ｂを、含まれる不要語の数が最も多い不要語（大）辞書３１３ｂａに決定する。これは、話者であるユーザが、どの程度の頻度で不要語を使用するか不明であるため、高頻度で不要語が使用された場合にも、入力音声の認識を可能とするためである。 In step S160, it is determined whether or not an unnecessary word usage frequency corresponding to the specified speaker is stored. If the unnecessary word usage frequency corresponding to the identified speaker is stored, the process proceeds to step S170. In step S170, the number of unnecessary word dictionaries 313b most suitable for recognizing the uttered speech of the identified speaker is determined based on the stored unnecessary word usage frequency (the number of unnecessary word dictionaries 313b). Including non-use). On the other hand, if it is determined in step S160 that the unnecessary word usage frequency is not stored, the process proceeds to step S180. In step S180, the unnecessary word dictionary 313b to be used is determined to be the unnecessary word (large) dictionary 313ba having the largest number of unnecessary words included. This is because the frequency of unnecessary words used by the user who is a speaker is unknown, so that the input speech can be recognized even when unnecessary words are used frequently. .

続くステップＳ１９０では、ステップＳ１４０にて選択、あるいはステップＳ１７０又はＳ１８０にて決定された不要語辞書３１３ｂと、目的語辞書３１３ａとを用いて、ユーザにより入力された音声の認識処理を実行する。この認識処理による認識結果は、ステップＳ２００において、制御回路１０に出力される。 In subsequent step S190, recognition processing of the voice input by the user is executed using the unnecessary word dictionary 313b selected in step S140 or determined in step S170 or S180 and the target word dictionary 313a. The recognition result by this recognition processing is output to the control circuit 10 in step S200.

ステップＳ２１０では、認識結果に基づいて、認識語句全体における不要語の使用頻度をカウントする。そして、ステップＳ２２０において、不要語使用頻度のカウント結果が、ステップＳ１５０にて特定された話者毎に、そのまま記憶されるか、もしくは、そのカウント結果を用いて更新した不要語使用頻度が記憶される。 In step S210, the frequency of use of unnecessary words in the entire recognized word / phrase is counted based on the recognition result. In step S220, the unnecessary word usage frequency count result is stored as it is for each speaker specified in step S150, or the unnecessary word usage frequency updated using the count result is stored. The

以上、本発明の好ましい実施形態について説明したが、本発明は上記実施形態になんら制限されることなく、本発明の主旨を逸脱しない範囲において、種々変形して実施することが可能である。 The preferred embodiments of the present invention have been described above, but the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the present invention.

例えば、上述した実施形態では、制御回路１０から入力される情報の種類が決定された旨が通知された場合、音声入力を行なう話者に係らず、不要語辞書３１３ｂとして、入力情報の種類に応じた不要語の数の不要語辞書３１３ｂを選択するものであった。 For example, in the above-described embodiment, when it is notified that the type of information input from the control circuit 10 has been determined, the unnecessary word dictionary 313b is used as the type of input information regardless of the speaker who performs voice input. The number of unnecessary word dictionaries 313b corresponding to the number of unnecessary words is selected.

しかしながら、例えば、入力情報の種類に応じた不要語の数の不要語辞書３１３ｂと、特定された話者の不要語使用頻度に応じた不要語の数の不要語辞書３１３ｂとをそれぞれ求め、それらの不要語辞書３１３ｂの内、より不要語の数が多い不要語辞書３１３ｂに切り替えるようにしても良い。これにより、音声認識部３１において使用される不要語辞書３１３ｂを、ユーザ毎の個人差による不要語の使用数と、入力情報の種類に起因する不要語の使用数との両方に適切に対応する不要語辞書３１３ｂに切り替えることが可能になる。 However, for example, an unnecessary word dictionary 313b having the number of unnecessary words corresponding to the type of input information and an unnecessary word dictionary 313b having the number of unnecessary words corresponding to the frequency of unnecessary word usage of the specified speaker are respectively obtained. The unnecessary word dictionary 313b may be switched to the unnecessary word dictionary 313b having a larger number of unnecessary words. Accordingly, the unnecessary word dictionary 313b used in the speech recognition unit 31 appropriately corresponds to both the number of unnecessary words used due to individual differences for each user and the number of unnecessary words used due to the type of input information. It becomes possible to switch to the unnecessary word dictionary 313b.

また、上述した実施形態では、不要語辞書３１３ｂを、不要語辞書３１３ｂの不使用、不要語（小）辞書３１３ｂｂ、及び不要語（大）辞書３１３ｂａの３種類のいずれかに切り替えるようにしたが、切り替え対象となる不要語辞書３１３ｂの数は、２種類であっても、４種類以上であっても良い。 In the embodiment described above, the unnecessary word dictionary 313b is switched to any one of the three types of the unnecessary word dictionary 313b, the unnecessary word (small) dictionary 313bb, and the unnecessary word (large) dictionary 313ba. The number of unnecessary word dictionaries 313b to be switched may be two or four or more.

３０音声認識システム
３１音声認識部
３２対話制御部
３３音声合成部
３４音声抽出部
３５マイク
３６トークスイッチ
３７スピーカ
３８制御部 30 speech recognition system 31 speech recognition unit 32 dialogue control unit 33 speech synthesis unit 34 speech extraction unit 35 microphone 36 talk switch 37 speaker 38 control unit

Claims

A voice input unit for inputting voice;
Speaker identification means for identifying a speaker based on the voice input to the voice input unit;
A speech recognition unit for recognizing speech input to the speech input unit using a dictionary having a large number of recognition vocabularies;
For each speaker identified by the speaker identifying means, unnecessary words that are unnecessary as input speech in the speech recognized by the speech recognition unit are counted, and the unnecessary word usage frequency calculated based on the count result An unnecessary word usage frequency storage unit for storing
A plurality of dictionaries with different numbers of unnecessary words included in the recognition vocabulary are prepared as the dictionary used in the speech recognition unit, and the use of the unnecessary words for the speakers identified by the speaker identification means A dictionary switching unit that switches the use dictionary in the voice recognition unit to a dictionary of the number of unnecessary words corresponding to the stored unnecessary word use frequency when the unnecessary word use frequency is stored in the frequency storage unit. A speech recognition system characterized by that.

The dictionary is composed of an object dictionary that collects objects that are vocabulary required as input speech, and an unnecessary word dictionary that collects unnecessary words that are originally unnecessary as input speech. There are several dictionaries with different numbers.
The dictionary switching unit switches the unnecessary word dictionary according to the stored unnecessary word usage frequency,
The speech recognition system according to claim 1, wherein the speech recognition unit performs speech recognition using the object word dictionary and an unnecessary word dictionary switched by the dictionary switching unit.

The speech recognition system according to claim 2, wherein the dictionary switching unit can switch the unnecessary word dictionary so that the unnecessary word dictionary is not used.

A determination unit configured to determine a type of information input by voice based on a user operation;
The dictionary switching unit, when the type of voice input information is determined by the determining unit, switches the dictionary used in the voice recognition unit in consideration of the type of the determined voice input information. The speech recognition system according to any one of claims 1 to 3.

When the type of speech input information is determined by the determining unit, the dictionary switching unit determines the type of the speech input information determined regardless of the frequency of use of unnecessary words of the speaker identified by the speaker identifying unit. 5. The speech recognition system according to claim 4, wherein the dictionary is switched to a dictionary having the number of unnecessary words corresponding to the dictionary.

The dictionary switching unit includes a dictionary of the number of unnecessary words according to the frequency of use of unnecessary words of the speaker identified by the speaker identifying means, and a dictionary of the number of unnecessary words according to the type of the voice input information. 5. The voice recognition system according to claim 4, wherein the dictionary is switched to a dictionary having a larger number of unnecessary words.