JP2013088477A

JP2013088477A - Speech recognition system

Info

Publication number: JP2013088477A
Application number: JP2011226051A
Authority: JP
Inventors: Shuichi Kawaguchi; 修市川口
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2011-10-13
Filing date: 2011-10-13
Publication date: 2013-05-13

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition system which shortens the time required for acquisition of a recognition result when performing distributed speech recognition processing in a client and a server and is capable of reducing the scale and processing burden of the client which relate to speech recognition processing.SOLUTION: Speech recognition processing on user's speech inputted to an in-vehicle device 1 is performed by the in-vehicle device 1 or a server 2 connected via a network 3. The in-vehicle device 1 includes: a microphone 22; a speech recognition processing unit 100 which takes a plurality of preliminarily prepared words or sentences as the object to perform speech recognition processing; a distribution determination unit 102 which determines an input speech as a speech to be subjected to speech recognition processing in the speech recognition processing unit 100 or a speech to be subjected to speech recognition processing in the server 2; and a speech data transmission unit 56 which, in the case of a speech to be subjected to speech recognition processing in the server, transmits speech data to the server 2.

Description

本発明は、入力音声に対して音声認識処理を行う音声認識システムに関する。 The present invention relates to a speech recognition system that performs speech recognition processing on input speech.

従来から、ネットワークを介して接続されたサーバとクライアントを備え、クライアントで処理できるものはクライアントで処理し、サーバでなければ処理できないもののみをサーバに送信して処理するようにした音声認識システムが知られている（例えば、特許文献１参照。）。クライアントとサーバのどちらの音声認識エンジンを使用するかの判定は、基本的にクライントに音響モデル辞書・言語モデル辞書・単語辞書のうちのいずれか１以上を含む小語彙辞書群があるかどうかで行われる。すなわち、小語彙辞書群がある場合にはクライアントで音声認識処理が行われ、ない場合にはサーバで音声認識処理が行われる。また、クライントで音声認識処理を行った結果、認識不可能であった場合もサーバで音声認識処理が行われる。 2. Description of the Related Art Conventionally, a voice recognition system that includes a server and a client connected via a network, processes what can be processed by the client by the client, and transmits only those that can only be processed by the server to the server for processing. It is known (for example, refer to Patent Document 1). The determination of which client or server speech recognition engine to use is basically based on whether or not the client has a small vocabulary dictionary group including one or more of an acoustic model dictionary, a language model dictionary, and a word dictionary. Done. That is, when there is a small vocabulary dictionary group, the speech recognition process is performed by the client, and when there is no small vocabulary dictionary group, the speech recognition process is performed by the server. Further, when the voice recognition process is not performed as a result of the client performing the voice recognition process, the server performs the voice recognition process.

特開２００５−２４９８２９号公報（第８−１１頁、図１−３）Japanese Patent Laying-Open No. 2005-249829 (page 8-11, FIG. 1-3)

上述した特許文献１には、音声入力がなされたときにこの入力音声に対する音声認識処理をクライントで行うかサーバで行うかについては、以下のような具体例の記載が含まれる。
（１）各国の言語に合わせた音響モデル辞書・言語モデル辞書・単語辞書のうちいずれか１以上を含む小語彙辞書群を用意しておけば、各言語に合わせた音声認識がクライアントでも行える。
（２）病院やレストランの予約システムで使用する場合とか、インターネット株取引で使う場合とかなどによって、分野別音響モデル辞書・言語モデル辞書・単語辞書のうちいずれか１以上を含む小語彙辞書群を用意しておけば、ユーザが利用したい分野ごとに音響モデル辞書・言語モデル辞書・単語辞書のうちいずれか１以上を含む小語彙辞書群を選択することができ、クライアントでの音声認識のヒット率も高めることができる。 Patent Document 1 described above includes the following specific examples of whether voice recognition processing for the input voice is performed by the client or the server when voice input is performed.
(1) If a small vocabulary dictionary group including any one or more of an acoustic model dictionary, a language model dictionary, and a word dictionary according to the language of each country is prepared, the voice recognition according to each language can be performed by the client.
(2) A group of small vocabulary dictionaries containing at least one of a field-specific acoustic model dictionary, language model dictionary, or word dictionary, depending on whether it is used in a hospital or restaurant reservation system, or in Internet stock trading. If prepared, a small vocabulary dictionary group including one or more of an acoustic model dictionary, a language model dictionary, and a word dictionary can be selected for each field that the user wants to use, and the speech recognition hit rate at the client Can also be increased.

これらの記載からもわかるように、特許文献１の音声認識システムでは、特定の用途（特定の言語や分野）についてはこの用途に対応する小語彙辞書群をクライアントに用意しておいて音声認識処理を行い、クライアントで認識不可能な場合や他の用途についてはサーバで音声認識処理を行っている。 As can be seen from these descriptions, in the speech recognition system of Patent Document 1, for a specific application (specific language or field), a small vocabulary dictionary group corresponding to this application is prepared in the client, and speech recognition processing is performed. In the case where the client cannot recognize or for other uses, the server performs voice recognition processing.

しかし、このような音声認識システムでは、クライアントで音声認識処理を行った結果認識不可能な場合には、その後サーバによる音声認識処理が行われるため、このような場合に最終的な認識結果が得られるまでに時間がかかるという問題があった。この問題は、クライアントとサーバの両方において音声認識処理を行うことにより生じるものであるが、クライアントで音声認識処理を行った場合に確実に良好な認識結果を得ようとすると、クライアントに備わった小語彙辞書群の規模やこれを用いた音声認識エンジンの性能を上げる必要があり、クライアントとサーバとで分散して音声認識処理を行う趣旨に反することになる。 However, in such a speech recognition system, if recognition is not possible as a result of performing speech recognition processing at the client, then speech recognition processing by the server is performed thereafter, and in such a case, a final recognition result is obtained. There was a problem that it took a long time to be released. This problem occurs when voice recognition processing is performed on both the client and server. However, if the client performs voice recognition processing reliably, it is difficult to obtain good recognition results. It is necessary to improve the size of the vocabulary dictionary group and the performance of the speech recognition engine using the lexical dictionary group, which is contrary to the purpose of performing speech recognition processing distributed between the client and the server.

本発明は、このような点に鑑みて創作されたものであり、その目的は、クライアントとサーバとで分散して音声認識処理を行う場合に認識結果が得られるまでに要する時間を短縮するとともに、音声認識処理に関するクライアントの規模および処理負担の軽減が可能な音声認識システムを提供することにある。 The present invention was created in view of the above points, and its purpose is to reduce the time required until a recognition result is obtained when performing voice recognition processing distributed between a client and a server. Another object of the present invention is to provide a voice recognition system that can reduce the size and processing load of a client related to voice recognition processing.

上述した課題を解決するために、本発明の音声認識システムは、クライアントにおいて入力された利用者の音声に対して、このクライアントあるいはこのクライアントにネットワークを介して接続されたサーバによる音声認識処理を行う。クライアントは、認識対象となる音声の入力を行う音声入力手段と、あらかじめ用意された複数の単語あるいは文章を対象に音声認識処理を行うクライアント側音声認識処理手段と、音声入力手段によって入力された音声について、クライアント側音声認識処理手段において音声認識処理を行うものとサーバにおいて音声認識処理を行うものとを振り分ける振り分け手段と、振り分け手段によってサーバにおいて音声認識処理を行うもとのして振り分けられた音声のデータをサーバに送信するクライアント側通信手段とを備える。サーバは、クライアントから送られてきた音声のデータを受信するサーバ側通信手段と、サーバ側通信手段によって受信した音声のデータを用いて音声認識処理を行うサーバ側音声認識処理手段とを備える。具体的には、上述したクライアント側音声認識処理手段は、音声入力手段によって入力された音声に対して音声認識処理を行うことにより、あらかじめ用意された複数の単語あるいは文章のいずれかの読みを特定している。 In order to solve the above-described problems, the speech recognition system according to the present invention performs speech recognition processing on a user's speech input at a client by the client or a server connected to the client via a network. . The client includes voice input means for inputting voice to be recognized, client-side voice recognition processing means for performing voice recognition processing on a plurality of words or sentences prepared in advance, and voice input by the voice input means. For the voice recognition processing in the client-side voice recognition processing means and the sorting means for sorting the voice recognition processing in the server, and the voice that is distributed based on the voice recognition processing in the server by the sorting means Client-side communication means for transmitting the data to the server. The server includes server-side communication means for receiving voice data sent from the client, and server-side voice recognition processing means for performing voice recognition processing using the voice data received by the server-side communication means. Specifically, the above-described client-side voice recognition processing unit specifies a reading of a plurality of words or sentences prepared in advance by performing voice recognition processing on the voice input by the voice input unit. doing.

あらかじめ用意された単語や文章に限定してクライアント側での音声認識処理を行うことにより、クライアント側で行う音声認識処理とサーバ側で行う音声認識処理を正確に振り分けることができるため、クライアント側とサーバ側の両方で音声認識処理を行うことを回避することができ、認識結果が得られるまでに要する時間を短縮することができる。また、クライアント側では、あらかじめ用意された単語や文章についてのみ認識結果が得られればよいため、音声認識処理に関するクライアントの規模および処理負担の軽減が可能となる。 By performing voice recognition processing on the client side only for words and sentences prepared in advance, it is possible to accurately distribute voice recognition processing performed on the client side and voice recognition processing performed on the server side. It is possible to avoid performing the voice recognition processing on both the server side, and it is possible to shorten the time required until a recognition result is obtained. In addition, since it is only necessary for the client side to obtain a recognition result for words and sentences prepared in advance, it is possible to reduce the size and processing burden of the client regarding the speech recognition processing.

また、上述したクライアントは、車載装置であり、クライアントは、クライアント側音声認識処理手段あるいはサーバ側音声認識処理手段による認識結果に応じて、車載装置に対する操作指示あるいは情報入力を行う入力処理手段をさらに備えることが望ましい。これにより、車載装置における各種入力を音声認識処理を用いた音声入力によって行うことができるとともに、その際の音声認識処理に要する時間の短縮や、車載装置の規模や処理負担の軽減が可能となる。 The client described above is an in-vehicle device, and the client further includes an input processing unit that performs an operation instruction or information input to the in-vehicle device in accordance with a recognition result by the client-side speech recognition processing unit or the server-side speech recognition processing unit. It is desirable to provide. As a result, various inputs in the in-vehicle device can be performed by voice input using voice recognition processing, and the time required for the voice recognition processing at that time can be shortened, and the scale and processing load of the in-vehicle device can be reduced. .

また、上述したクライアントは、利用者による手動操作を受け付ける操作手段をさらに備え、入力処理手段は、クライアント側音声認識処理手段による認識結果、サーバ側音声認識処理手段による認識結果、操作手段を用いた手動操作のいずれかに応じて、車載装置に対する操作指示あるいは情報入力を行うことが望ましい。これにより、車載装置において各種入力を行う際に、音声認識処理を用いた音声入力と、操作手段を用いた手動操作による入力とを必要に応じて使い分けることができ、操作性の向上が可能となる。 The client described above further includes an operation unit that accepts a manual operation by a user, and the input processing unit uses a recognition result by the client side voice recognition processing unit, a recognition result by the server side voice recognition processing unit, and an operation unit. It is desirable to perform an operation instruction or information input to the in-vehicle device in accordance with any manual operation. As a result, when performing various inputs in the in-vehicle device, voice input using voice recognition processing and input by manual operation using operation means can be properly used as necessary, and operability can be improved. Become.

また、上述した振り分け手段は、ネットワークを介してサーバに対する接続ができないときに、サーバにおける音声認識処理に代えて、クライアント側音声認識処理手段における音声認識処理に振り分けることが望ましい。これにより、何らかの原因によりサーバとの間の接続ができない場合であっても、音声認識処理を用いて操作指示や情報入力を行うことが可能となる。 In addition, it is desirable that the above-described distribution unit distributes to the voice recognition processing in the client side voice recognition processing unit instead of the voice recognition processing in the server when connection to the server cannot be performed via the network. Thereby, even if it is a case where connection with a server cannot be performed for some reason, it becomes possible to perform an operation instruction and information input using voice recognition processing.

また、上述した入力処理手段は、ネットワークを介してサーバに対する接続ができないときに、サーバ側音声認識処理手段による認識結果に代えて、操作手段を用いた手動操作に応じて、車載装置に対する操作指示あるいは情報入力を行うことが望ましい。これにより、何らかの原因によりサーバとの間の接続ができない場合であっても、操作手段を用いて操作指示や情報入力を行うことが可能となる。 In addition, when the input processing unit described above cannot be connected to the server via the network, the operation instruction to the in-vehicle device is performed according to the manual operation using the operation unit instead of the recognition result by the server side voice recognition processing unit. Alternatively, it is desirable to input information. As a result, even if the connection with the server cannot be established due to some cause, it is possible to perform operation instructions and information input using the operation means.

また、上述した車載装置に対する操作指示あるいは情報入力の対象となる複数の単語あるいは文章が既知である場合に、振り分け手段は、クライアント側音声認識処理手段による音声認識処理に振り分けを行い、クライアント側音声認識処理手段は、音声入力手段によって入力された音声に対し音声認識処理を行うことにより、複数の単語あるいは文章の中から音声認識結果に対応するものを選択することが望ましい。また、上述したクライアント側音声認識処理手段は、既知の複数の単語あるいは文章に対する音声認識辞書を有することが望ましい。これにより、クライアント側音声認識処理手段において、あらかじめ用意された単語や文章の中から入力音声に対応するものを確実に抽出することができる。 In addition, when a plurality of words or sentences that are targets of operation instructions or information input to the on-vehicle device described above are known, the sorting unit sorts the voice recognition processing by the client side voice recognition processing unit, and the client side voice It is desirable that the recognition processing means selects a word corresponding to the voice recognition result from a plurality of words or sentences by performing voice recognition processing on the voice input by the voice input means. Moreover, it is desirable that the client side speech recognition processing means described above has a speech recognition dictionary for a plurality of known words or sentences. As a result, the client-side speech recognition processing means can reliably extract the one corresponding to the input speech from words and sentences prepared in advance.

また、上述した複数の単語あるいは文章は、車載装置に対して操作指示を行う複数の操作コマンドであることが望ましい。これにより、車載装置に対する操作指示については車載装置側における音声認識処理を行い、迅速にその指示内容を判定して車載装置の動作に反映させることが可能となる。 Moreover, it is desirable that the plurality of words or sentences described above are a plurality of operation commands for instructing operation to the in-vehicle device. As a result, it is possible to perform voice recognition processing on the in-vehicle device side for operation instructions for the in-vehicle device, quickly determine the contents of the instruction, and reflect them in the operation of the in-vehicle device.

また、上述した車載装置は、通話先の電話番号および各電話番号に対応する名称が含まれる電話帳データが内蔵された移動体電話が接続されているときに、電話帳データに含まれる電話番号に対して移動体電話を用いた発呼を行うハンズフリー電話システムとして動作し、複数の単語あるいは文章は、電話番号および名称の少なくとも一方であることが望ましい。また、上述したクライアント側音声認識処理手段は、移動体電話が接続されたときに、電話帳データに含まれる電話番号および名称の少なくとも一方の読みに対応する音声認識辞書を作成する音声認識辞書作成手段を有することが望ましい。これにより、車載装置としてのハンズフリー電話システムにおいて電話番号や名称の入力に本発明を適用することが可能となる。 In addition, the above-described in-vehicle device has a phone number included in the phone book data when a mobile phone having a built-in phone book data including a phone number of a call destination and a name corresponding to each phone number is connected. It is desirable to operate as a hands-free telephone system for making a call using a mobile telephone, and the plurality of words or sentences are preferably at least one of a telephone number and a name. The above-described client-side voice recognition processing means creates a voice recognition dictionary that creates a voice recognition dictionary corresponding to the reading of at least one of the phone number and name included in the phone book data when a mobile phone is connected. It is desirable to have a means. Thus, the present invention can be applied to input of a telephone number and a name in a hands-free telephone system as an in-vehicle device.

また、上述した車載装置は、複数の楽曲に対して選択的に再生を行うオーディオ装置として動作し、複数の単語あるいは文章は、複数の楽曲のそれぞれに対応する楽曲名、アルバム名、アーティスト名の少なくとも一つであることが望ましい。また、上述したクライアント側音声認識処理手段は、楽曲名、アルバム名、アーティスト名の少なくとも一つの読みに対応する音声認識辞書を作成する音声認識辞書作成手段を有することが望ましい。これにより、車載装置としてのオーディオ装置において楽曲名、アルバム名、アーティスト名の入力に本発明を適用することが可能となる。 The above-described in-vehicle device operates as an audio device that selectively reproduces a plurality of songs, and a plurality of words or sentences have a song name, album name, and artist name corresponding to each of the plurality of songs. At least one is desirable. The client-side voice recognition processing means described above preferably includes voice recognition dictionary creation means for creating a voice recognition dictionary corresponding to at least one reading of a song name, album name, and artist name. As a result, the present invention can be applied to the input of a song name, album name, and artist name in an audio device as an in-vehicle device.

また、上述した車載装置は、ネットワークを介して発信する文書を作成する文書作成手段をさらに備え、文書の作成に必要なテキスト入力を音声入力手段によって入力された音声に基づいて行う際に、振り分け手段は、サーバ側音声認識処理手段における音声認識処理に振り分けることが望ましい。また、上述した車載装置は、サーバから送信されるサーバ側音声認識処理手段による認識結果を取得する認識結果取得手段をさらに備え、文書作成手段は、認識結果所得手段によって取得した認識結果を、文書の作成に必要なテキストとして用いることが望ましい。これにより、車載装置において電子メール作成等の文書作成を行う際のテキスト入力に本発明を適用することが可能となる。 The on-vehicle device described above further includes document creation means for creating a document to be transmitted via the network, and when performing text input necessary for document creation based on the voice input by the voice input means, sorting is performed. It is desirable that the means is distributed to the voice recognition processing in the server side voice recognition processing means. The on-vehicle device described above further includes a recognition result acquisition unit that acquires a recognition result by the server side voice recognition processing unit transmitted from the server, and the document creation unit converts the recognition result acquired by the recognition result income unit into the document It is desirable to use it as the text necessary to create Thereby, it becomes possible to apply this invention to the text input at the time of creating documents, such as e-mail creation, in an in-vehicle device.

また、上述した車載装置は、特定施設の詳細情報を表示する施設情報表示手段をさらに備え、施設情報表示手段による詳細情報表示の対象となる特定施設の入力を音声入力手段によって入力された音声に基づいて行う際に、振り分け手段は、サーバ側音声認識処理手段における音声認識処理に振り分けることが望ましい。また、上述した車載装置は、サーバから送信されるサーバ側音声認識手段による認識結果、あるいは、この認識結果を用いて検索された詳細情報を取得する認識結果取得手段をさらに備え、施設情報表示手段は、認識結果取得手段によって取得した認識結果を用いて検索された詳細情報、あるいは、認識結果取得手段によって取得した詳細情報を表示することが望ましい。これにより、車載装置において特定施設の詳細情報の表示を行う際に、表示対象となる特定施設の入力に本発明を適用することが可能となる。 The on-vehicle device described above further includes facility information display means for displaying the detailed information of the specific facility, and the input of the specific facility that is the target of the detailed information display by the facility information display means is input to the voice input by the voice input means. When performing based on this, it is desirable that the distribution unit distributes to the voice recognition processing in the server side voice recognition processing unit. The on-vehicle apparatus described above further includes a recognition result obtained by the server-side voice recognition means transmitted from the server, or a recognition result acquisition means for obtaining detailed information searched using the recognition result, and includes facility information display means. It is desirable to display the detailed information searched using the recognition result acquired by the recognition result acquisition unit or the detailed information acquired by the recognition result acquisition unit. Thereby, when displaying the detailed information of a specific facility in a vehicle-mounted apparatus, it becomes possible to apply this invention to the input of the specific facility used as a display target.

また、上述した音声入力手段は、マイクロホンであることが望ましい。これにより、利用者はクライアントに備わったマイクロホンに向かった発声するだけで、クライアント側あるいはサーバ側における音声認識処理が適切に振り分けられ、短時間のうちに認識結果を取得することが可能となる。 Moreover, it is desirable that the voice input means described above is a microphone. As a result, the user can appropriately speak the voice recognition process on the client side or the server side only by speaking to the microphone provided in the client, and can acquire the recognition result in a short time.

また、上述したマイクロホンに向けて発声する際に利用者によって操作可能な発話スイッチをさらに備え、振り分け手段は、発話スイッチが操作された後にマイクロホンによって集音された利用者の音声に対して振り分けを行うことが望ましい。これにより、音声認識処理の対象となる音声の入力タイミングが明確になり、処理手順の簡略化や認識精度の向上が可能となる。 In addition, a speech switch that can be operated by the user when speaking to the microphone described above is further provided, and the distribution unit distributes the voice of the user collected by the microphone after the speech switch is operated. It is desirable to do. As a result, the input timing of the speech to be subjected to speech recognition processing becomes clear, and the processing procedure can be simplified and the recognition accuracy can be improved.

また、それぞれがサーバ側音声認識処理手段を備える複数のサーバがクライアントと接続可能であり、振り分け手段は、サーバにおいて音声認識処理を行うものとして振り分けを行う際に、複数のサーバのいずれかを選択することが望ましい。これにより、得意とする分野等が異なる複数のサーバを使い分けて音声認識処理を依頼することができ、サーバに依頼する場合の認識精度を向上させることができる。 In addition, a plurality of servers each including server-side voice recognition processing means can be connected to the client, and the sorting means selects one of the plurality of servers when sorting is performed on the server as performing voice recognition processing. It is desirable to do. Thereby, it is possible to request the voice recognition processing by properly using a plurality of servers having different fields of expertise, and the recognition accuracy when requesting the server can be improved.

一実施形態の音声認識システムの全体構成を示す図である。It is a figure showing the whole voice recognition system composition of one embodiment. 車載装置の詳細構成を示す図である。It is a figure which shows the detailed structure of a vehicle-mounted apparatus. 利用者が発話してその内容を車載装置の操作等に反映させるまでの動作手順を示す流れ図である。It is a flowchart which shows the operation | movement procedure until a user speaks and the content is reflected on operation etc. of a vehicle-mounted apparatus. サーバと接続ができない場合の変形例の動作手順を示す流れ図である。It is a flowchart which shows the operation | movement procedure of the modification in case a connection with a server cannot be performed. サーバと接続ができない場合の他の変形例の動作手順を示す流れ図である。It is a flowchart which shows the operation | movement procedure of the other modified example when a connection with a server cannot be performed. 携帯電話の接続時に電話帳データを読み出して音声認識辞書を登録する動作手順を示す流れ図である。It is a flowchart which shows the operation | movement procedure which reads a telephone directory data at the time of a mobile telephone connection, and registers a speech recognition dictionary. ＵＳＢメモリの接続時にコンテンツリストの付属情報を読み出して音声認識辞書を登録する動作手順を示す流れ図である。It is a flowchart which shows the operation | movement procedure which reads the attached information of a content list when a USB memory is connected, and registers a speech recognition dictionary. 変形例の音声認識システムの全体構成を示す図である。It is a figure which shows the whole structure of the speech recognition system of a modification. 複数のサーバを使い分ける場合の変形例の動作手順を示す流れ図である。It is a flowchart which shows the operation | movement procedure of the modification in the case of using a some server properly.

以下、本発明を適用した一実施形態の音声認識システムについて図面を参照しながら説明する。図１は、一実施形態の音声認識システムの全体構成を示す図である。本実施形態の音声認識システムは、車載装置１とサーバ２を含んで構成されている。車載装置１は、ナビゲーション装置やオーディオ装置などの機能を有し、車両に搭載されている。また、サーバ２は、車両の外部に設けられており、車載装置１と所定のネットワーク３を介して接続される。このネットワーク３は、例えばインターネットであり、車載装置１に接続された移動体電話としての携帯電話および基地局（ともに図示せず）を介して接続されている。なお、車載装置１のネットワーク３への接続は、必ずしも携帯電話を介して行う必要はなく、車載装置１に接続（あるいは内蔵）された無線ＬＡＮ用の通信装置およびアクセスポイント（ともに図示せず）を介して接続するようにしてもよい。また、無線ＬＡＮによってネットワーク３に接続可能な場合には無線ＬＡＮによる接続を行い、無線ＬＡＮによる接続が不可能な場合（近くにアクセスポイントがない場合など）には携帯電話を用いた接続を行うようにしてもよい。 Hereinafter, a speech recognition system according to an embodiment to which the present invention is applied will be described with reference to the drawings. FIG. 1 is a diagram illustrating an overall configuration of a speech recognition system according to an embodiment. The voice recognition system according to the present embodiment includes an in-vehicle device 1 and a server 2. The in-vehicle device 1 has functions such as a navigation device and an audio device, and is mounted on the vehicle. The server 2 is provided outside the vehicle and is connected to the in-vehicle device 1 via a predetermined network 3. The network 3 is, for example, the Internet, and is connected via a mobile phone as a mobile phone connected to the in-vehicle device 1 and a base station (both not shown). Note that the connection of the in-vehicle device 1 to the network 3 is not necessarily performed via a mobile phone, and a wireless LAN communication device and an access point (both not shown) connected to (or built in) the in-vehicle device 1. You may make it connect via. When the wireless LAN can be connected to the network 3, the wireless LAN is connected. When the wireless LAN is not possible (such as when there is no access point nearby), the mobile phone is connected. You may do it.

また、車載装置１には音声認識処理部１００と振り分け判定部１０２とが備わっており、サーバ２には音声認識処理部２００が備わっている。本実施形態の音声認識システムでは、クライアントとしての車載装置１において入力された利用者の音声に対して、車載装置１内の音声認識処理部１００あるいはこの車載装置１にネットワーク３を介して接続されたサーバ２内の音声認識処理部２００による音声認識処理を行っており、車載装置１とサーバ２のいずれにおいて音声認識処理を行うかの判定を振り分け判定部１０２によって行っている。 The in-vehicle device 1 includes a voice recognition processing unit 100 and a sorting determination unit 102, and the server 2 includes a voice recognition processing unit 200. In the voice recognition system according to the present embodiment, the voice recognition processing unit 100 in the in-vehicle device 1 or the in-vehicle device 1 is connected to the user's voice input in the in-vehicle device 1 as a client via the network 3. The voice recognition processing by the voice recognition processing unit 200 in the server 2 is performed, and the distribution determination unit 102 determines which of the in-vehicle device 1 and the server 2 performs the voice recognition processing.

図２は、車載装置１の詳細構成を示す図である。図２に示すように、車載装置１は、ナビゲーション処理部１０、ＴＶチューナ処理部１４、ラジオチューナ処理部１６、音声入力処理部２０、音声認識処理部１００、操作部４０、発話スイッチ（ＳＷ）４２、入力制御部４４、制御部５０、表示処理部６０、表示装置６２、デジタル−アナログ変換器（Ｄ／Ａ）６４、スピーカ６６、ハードディスク装置（ＨＤＤ）７０、ＵＳＢ（Universal Serial Bus）インタフェース部（ＵＳＢＩ／Ｆ）８０、８２を備えている。 FIG. 2 is a diagram illustrating a detailed configuration of the in-vehicle device 1. As shown in FIG. 2, the in-vehicle device 1 includes a navigation processing unit 10, a TV tuner processing unit 14, a radio tuner processing unit 16, a voice input processing unit 20, a voice recognition processing unit 100, an operation unit 40, and a speech switch (SW). 42, input control unit 44, control unit 50, display processing unit 60, display device 62, digital-analog converter (D / A) 64, speaker 66, hard disk device (HDD) 70, USB (Universal Serial Bus) interface unit (USB I / F) 80 and 82 are provided.

ナビゲーション処理部１０は、ハードディスク装置７０に記憶されている地図データを用いて車載装置１が搭載された車両の走行を案内するナビゲーション動作を行う。自車位置を検出するＧＰＳ装置１２とともに用いられ、車両の走行を案内するナビゲーション動作には、地図表示、経路探索・誘導のほかに周辺施設やＰＯＩ（Point Of Interest）を検索して表示する動作などが含まれる。なお、自車位置検出は、ＧＰＳ１２の他にジャイロセンサや車速センサ等の自律航法センサを組み合わせて用いるようにしてもよい。 The navigation processing unit 10 performs a navigation operation for guiding the traveling of the vehicle on which the in-vehicle device 1 is mounted, using the map data stored in the hard disk device 70. Used in conjunction with the GPS device 12 for detecting the position of the host vehicle, the navigation operation for guiding the running of the vehicle searches for and displays peripheral facilities and POIs (Point Of Interest) in addition to map display and route search / guidance. Etc. are included. The vehicle position detection may be performed using a combination of autonomous navigation sensors such as a gyro sensor and a vehicle speed sensor in addition to the GPS 12.

ＴＶチューナ処理部１４は、地上デジタル放送等の放送信号を受信し、映像および音声を再生する処理を行う。ラジオチューナ処理部１６は、ラジオ放送の信号を受信し、音声を再生する処理を行う。 The TV tuner processing unit 14 receives a broadcast signal such as terrestrial digital broadcast and performs processing for reproducing video and audio. The radio tuner processing unit 16 receives radio broadcast signals and performs processing for reproducing sound.

音声入力処理部２０は、マイクロホン２２によって集音された利用者（話者）の音声の入力処理を行う。具体的には、音声入力処理部２０は、アナログ−デジタル変換器（Ａ／Ｄ）２４と圧縮処理部２６を備えている。アナログ−デジタル変換器２４は、マイクロホン２２の出力信号をデジタルの音声データに変換する。圧縮処理部２６は、アナログ−デジタル変換器２４から出力される音声データを圧縮する。 The voice input processing unit 20 performs input processing of the voice of the user (speaker) collected by the microphone 22. Specifically, the voice input processing unit 20 includes an analog-digital converter (A / D) 24 and a compression processing unit 26. The analog-digital converter 24 converts the output signal of the microphone 22 into digital audio data. The compression processing unit 26 compresses the audio data output from the analog-digital converter 24.

音声認識処理部１００は、マイクロホン２２によって集音した音声に対して音声認識処理を行うためのものであり、音声認識辞書作成部３２、音声認識辞書３４、音声認識部３６を含んでいる。音声認識辞書作成部３２は、車載装置１がハンズフリー電話システムとして動作する場合に発呼先となる電話番号および各電話番号に対応する名称（個人の氏名も含む）や、車載装置１がオーディオ装置として動作する場合に再生対象となる楽曲名、アルバム名、アーティスト名などを読み上げた音声に対応する音声認識辞書３４を作成する。この音声認識辞書３４の作成は所定のタイミングで行われるが、その具体例については後述する。音声認識辞書３４は、既知の複数の単語あるいは文章について音声認識処理を行うためのものであり、これら複数の単語あるいは文章には、上述した電話番号や楽曲名などのように所定のタイミングで生成されるものの他に、車載装置１に対して操作指示を行う操作コマンドなどが含まれる。音声認識部３６は、マイクロホン２２によって集音した利用者の音声に対して音声認識辞書３４を用いて音声認識処理を行い、利用者が発声した音声の内容（文字列）を特定する。 The voice recognition processing unit 100 is for performing voice recognition processing on the voice collected by the microphone 22, and includes a voice recognition dictionary creation unit 32, a voice recognition dictionary 34, and a voice recognition unit 36. The voice recognition dictionary creation unit 32 is configured to provide a telephone number to be called when the in-vehicle device 1 operates as a hands-free telephone system and a name (including an individual name) corresponding to each telephone number. A voice recognition dictionary 34 corresponding to the voice that reads out the music title, album name, artist name, and the like to be reproduced when operating as an apparatus is created. The voice recognition dictionary 34 is created at a predetermined timing, and a specific example thereof will be described later. The voice recognition dictionary 34 is used for performing voice recognition processing on a plurality of known words or sentences, and the plurality of words or sentences are generated at a predetermined timing, such as the above-described telephone number or song name. In addition to what is performed, an operation command for instructing the in-vehicle device 1 to be operated is included. The voice recognition unit 36 performs voice recognition processing on the voice of the user collected by the microphone 22 using the voice recognition dictionary 34, and specifies the content (character string) of the voice uttered by the user.

操作部４０は、車載装置１に対する利用者による手動操作を受け付けるためのものであり、各種の操作キー、操作スイッチ、操作つまみ等が含まれる。また、表示装置６２に各種の操作画面や入力画面が表示された時点で、これらの操作画面や入力画面の一部を利用者が指などで直接指し示すことにより、操作画面や入力画面の表示項目を選択することができるようになっており、このような操作画面や入力画面を用いた操作を可能とするために、指し示された指などの位置を検出するタッチパネルが操作部４０の一部として備わっている。なお、タッチパネルを用いる代わりに、リモートコントロールユニット等を用いて操作画面や入力画面の一部を利用者の指示に応じて選択するようにしてもよい。発話スイッチ４２は、利用者がマイクロホン２２に向けて発声する際に利用者によって操作されて発話タイミングを指示するために用いられる。入力制御部４４は、操作部４０および発話スイッチ４２を監視しており、これらの操作内容を決定する。 The operation unit 40 is for accepting a manual operation by the user with respect to the in-vehicle device 1 and includes various operation keys, operation switches, operation knobs, and the like. Further, when various operation screens and input screens are displayed on the display device 62, the user directly points a part of these operation screens and input screens with a finger or the like, so that display items of the operation screens and input screens are displayed. A touch panel that detects the position of the finger pointed to is a part of the operation unit 40 in order to enable an operation using such an operation screen or an input screen. As provided. Instead of using the touch panel, a part of the operation screen or the input screen may be selected according to a user instruction using a remote control unit or the like. The utterance switch 42 is used by the user to indicate the utterance timing when the user utters toward the microphone 22. The input control unit 44 monitors the operation unit 40 and the utterance switch 42 and determines the content of these operations.

制御部５０は、車載装置１の全体を制御するとともに、オーディオ装置やハンズフリー電話システムとしての動作を行う。この制御部５０は、ＲＯＭやＲＡＭなどに格納された動作プログラムをＣＰＵによって実行することにより実現される。また、図１では、この制御部５０とは別にナビゲーション処理部１０やＴＶチューナ処理部１４、ラジオチューナ処理部１６、音声認識処理部１００などを設けて図示したが、これらの一部の機能を制御部５０によって実現するようにしてもよい。制御部５０の詳細については後述する。 The control unit 50 controls the entire vehicle-mounted device 1 and operates as an audio device or a hands-free telephone system. The control unit 50 is realized by the CPU executing an operation program stored in a ROM or RAM. In FIG. 1, the navigation processing unit 10, the TV tuner processing unit 14, the radio tuner processing unit 16, the voice recognition processing unit 100, and the like are provided separately from the control unit 50, but some of these functions are provided. You may make it implement | achieve by the control part 50. FIG. Details of the control unit 50 will be described later.

表示処理部６０は、各種の操作画面や入力画面、ＴＶチューナ処理部１４によって受信した放送信号に対応する映像画面等を表示する映像信号を出力し、表示装置６２にこれらの各種画面を表示する。デジタル−アナログ変換器６４は、車載装置１がハンズフリー電話システムとして動作する場合の音声データをアナログの音声信号に変換してスピーカ６６から出力するとともに、車載装置１がオーディオ装置として動作する場合のオーディオデータ（楽曲データ）をアナログのオーディオ信号に変換してスピーカ６６から出力する。なお、実際には、デジタル−アナログ変換器６４とスピーカ６６の間には信号を増幅する増幅器が接続されているが、図２ではこの増幅器は省略されている。また、デジタル−アナログ変換器６４とスピーカ６６との組合せは再生チャンネル数分備わっているが、図２では一組のみが図示されている。 The display processing unit 60 outputs video signals for displaying various operation screens and input screens, video screens corresponding to broadcast signals received by the TV tuner processing unit 14, and displays these various screens on the display device 62. . The digital-analog converter 64 converts audio data when the in-vehicle device 1 operates as a hands-free telephone system into an analog audio signal and outputs it from the speaker 66, and also when the in-vehicle device 1 operates as an audio device. Audio data (music data) is converted into an analog audio signal and output from the speaker 66. In practice, an amplifier for amplifying a signal is connected between the digital-analog converter 64 and the speaker 66, but this amplifier is omitted in FIG. Further, the combination of the digital-analog converter 64 and the speaker 66 is provided for the number of reproduction channels, but only one set is shown in FIG.

ハードディスク装置７０は、ナビゲーション処理部１０によるナビゲーション動作に用いる地図データや周辺施設・ＰＯＩ検索用データ、オーディオ装置における再生動作に用いるコンテンツリスト、ハンズフリー電話システムで用いる電話帳データなどを格納する。ここで、コンテンツリストには、コンテンツデータ（楽曲データ）のフォルダ構成、ファイル構成、ファイル属性が含まれる。このファイル属性には、各楽曲に対応する付属情報、具体的には、楽曲の歌唱あるいは演奏を行うアーティスト名と、楽曲が収録されたアルバムが存在する場合にはアルバム名と、この楽曲の名称（楽曲名）とが含まれる。また、電話帳データには、あらかじめ登録されている電話番号と、各電話番号に対応する名称（個人の場合には個人の氏名やニックネーム等、会社やその他の団体の場合には会社名やその略称等）が含まれる。 The hard disk device 70 stores map data and peripheral facility / POI search data used for the navigation operation by the navigation processing unit 10, content list used for reproduction operation in the audio device, phone book data used in the hands-free telephone system, and the like. Here, the content list includes the folder structure, file structure, and file attributes of content data (music data). This file attribute includes attached information corresponding to each song, specifically, the name of the artist who sings or performs the song, the album name if there is an album containing the song, and the name of this song (Song name). In addition, the phone book data includes a pre-registered phone number and a name corresponding to each phone number (in the case of an individual, the name and nickname of the individual, such as the company name and its name in the case of a company or other organization) Abbreviations, etc.).

ＵＳＢインタフェース部８０、８２は、ＵＳＢケーブルを介して携帯電話９０や外部の記憶媒体としてのＵＳＢメモリ９２などとの間で信号の入出力を行うためのものであり、ＵＳＢポートやＵＳＢホストコントローラが含まれる。ＵＳＢメモリ９２には楽曲データが記録されている。 The USB interface units 80 and 82 are for inputting / outputting signals to / from a mobile phone 90 or a USB memory 92 as an external storage medium via a USB cable. included. Music data is recorded in the USB memory 92.

次に、制御部５０の詳細について説明する。図２に示すように、制御部５０は、電話帳取得部５１、電話処理部５２、コンテンツリスト作成部５３、ＡＶ処理部５４、インターネット処理部５５、振り分け判定部１０２、音声データ送信部５６、認識結果取得部５７、入力処理部５８を有している。 Next, details of the control unit 50 will be described. As shown in FIG. 2, the control unit 50 includes a telephone directory acquisition unit 51, a telephone processing unit 52, a content list creation unit 53, an AV processing unit 54, an Internet processing unit 55, a distribution determination unit 102, an audio data transmission unit 56, A recognition result acquisition unit 57 and an input processing unit 58 are included.

電話帳取得部５１は、ＵＳＢインタフェース部８０、８２のいずれかに接続された携帯電話９０に登録されている電話帳データを読み込んで取得する。取得した電話帳データは、例えばハードディスク装置７０に格納される。この電話帳データには、発呼先となる「電話番号」と、各電話番号に対応する個人名や会社名等の「名称」と、電子メールのアドレスがわかっている場合にはその「アドレス」とが含まれている。なお、本実施形態では、一方のＵＳＢインタフェース部８０に携帯電話９０が接続され、他方のＵＳＢインタフェース部８２に楽曲データを格納したＵＳＢメモリ９２が接続されるものとして、以下では説明を行う。 The phone book acquisition unit 51 reads and acquires the phone book data registered in the mobile phone 90 connected to either of the USB interface units 80 and 82. The acquired telephone directory data is stored in the hard disk device 70, for example. In this phone book data, the “phone number” to be called, the “name” such as the personal name or company name corresponding to each phone number, and the e-mail address if the e-mail address is known. Is included. In the present embodiment, the following description will be given on the assumption that the mobile phone 90 is connected to one USB interface unit 80 and the USB memory 92 storing music data is connected to the other USB interface unit 82.

電話処理部５２は、電話帳取得部５１によって取得した電話帳データに含まれるいずれかの電話番号に対して、あるいは、利用者が操作部４０を用いて直接電話番号を入力した場合にはその電話番号に対して、携帯電話９０を用いて電話を掛ける発呼処理を行う。また、電話処理部５２は、通話相手との間で電話回線の接続が行われた後は、マイクロホン２２によって集音した話者の音声を通話相手に送信するとともに、通話相手の音声をスピーカ６６から出力する処理を行う。このようにして、電話処理部５２によって携帯電話９０を利用したハンズフリー電話システムが実現される。 The telephone processing unit 52 selects the telephone number included in the telephone directory data acquired by the telephone directory acquisition unit 51 or when the user directly inputs the telephone number using the operation unit 40. A calling process for making a call using the mobile phone 90 is performed on the telephone number. In addition, after the telephone line is connected to the other party, the telephone processing unit 52 transmits the voice of the speaker collected by the microphone 22 to the other party and also transmits the voice of the other party to the speaker 66. Process to output from. In this manner, a hands-free telephone system using the mobile phone 90 is realized by the telephone processing unit 52.

コンテンツリスト作成部５３は、接続が検出された記録メディアとしてのＵＳＢメモリ９２に記録されたコンテンツデータ（楽曲データ）を解析し、解析結果に基づいてコンテンツリストを作成する。上述したように、コンテンツリストには、コンテンツデータのフォルダ構成、ファイル構成およびファイル属性（アーティスト名、アルバム名、楽曲名等）が含まれる。作成されたコンテンツリストは、例えばハードディスク装置７０に格納される。なお、このコンテンツリストの作成は、例えば、ＵＳＢインタフェース部８２にＵＳＢメモリ９２が接続されたタイミングで行われる。 The content list creation unit 53 analyzes the content data (music data) recorded in the USB memory 92 as the recording medium from which the connection is detected, and creates a content list based on the analysis result. As described above, the content list includes the folder structure, file structure, and file attributes (artist name, album name, song name, etc.) of content data. The created content list is stored in the hard disk device 70, for example. The creation of the content list is performed, for example, at the timing when the USB memory 92 is connected to the USB interface unit 82.

ＡＶ処理部５４は、ＵＳＢメモリ９２に格納されている所定形式の楽曲データを読み出して復調処理を行い、デジタル−アナログ変換器６４に入力する形式の楽曲データ（例えばＰＣＭデータ）に変換することにより楽曲の再生を行う。また、ＡＶ処理部５４は、この再生動作に際して、利用者によって楽曲の再生箇所を選択したり、音量変更等を行うための再生メニュー画面を作成する。この再生メニュー画面は表示処理部６０を介して表示装置６２に表示される。 The AV processing unit 54 reads out music data in a predetermined format stored in the USB memory 92, performs demodulation processing, and converts it into music data (for example, PCM data) in a format input to the digital-analog converter 64. Play music. Further, the AV processing unit 54 creates a playback menu screen for the user to select a playback location of the music, change the volume, etc., during the playback operation. The reproduction menu screen is displayed on the display device 62 via the display processing unit 60.

インターネット処理部５５は、インターネットを介した各種のサービスを利用するために必要な処理を行う。具体的には、インターネット処理部５５は、ウェブブラウザとメールソフトの機能を有しており、利用者の指示や入力に応じて、ウェブページの閲覧や、電子メールの作成および送受信、ＳＮＳ（ソーシャル・ネットワーキング・サービス）画面の閲覧や入力等を行う。 The Internet processing unit 55 performs processing necessary for using various services via the Internet. Specifically, the Internet processing unit 55 has functions of a web browser and mail software, and in response to a user's instruction and input, browsing of a web page, creation and transmission / reception of an email, SNS (social)・ Networking service) View and input screens.

振り分け判定部１０２は、マイクロホン２２によって集音された音声について、クライアント側（車載装置１）の音声認識処理部１００において音声認識処理を行うものとサーバ２の音声認識処理部２００において音声認識処理を行うものとを振り分ける。音声データ送信部５６は、サーバ２の音声認識処理部２００に音声認識処理を依頼する際に、マイクロホン２２で集音して圧縮処理部２６で圧縮した音声データをサーバ２に向けて送信する処理を行う。認識結果取得部５７は、音声認識処理の結果（認識結果）がサーバ２から送り返されてきたときにこの認識結果を受信する。なお、サーバ２には、車載装置１から送られてくる音声データを受信して音声認識処理部２００に入力するとともに、音声認識処理部２００による認識結果を取得して車載装置１に送り返す制御を行う通信制御部２０２が備わっている。 The distribution determination unit 102 performs voice recognition processing in the voice recognition processing unit 100 on the client side (in-vehicle device 1) and performs voice recognition processing in the voice recognition processing unit 200 of the server 2 on the voice collected by the microphone 22. Sort what to do. The voice data transmission unit 56 transmits the voice data collected by the microphone 22 and compressed by the compression processing unit 26 to the server 2 when requesting the voice recognition processing unit 200 of the server 2 to perform voice recognition processing. I do. The recognition result acquisition unit 57 receives the recognition result when the result (recognition result) of the voice recognition process is sent back from the server 2. The server 2 receives voice data sent from the in-vehicle device 1 and inputs it to the voice recognition processing unit 200, and obtains a recognition result by the voice recognition processing unit 200 and sends it back to the in-vehicle device 1. A communication control unit 202 is provided.

入力処理部５８は、音声認識処理部１００による認識結果、サーバ２から取得した音声認識処理部２００による認識結果、操作部４０を用いた操作内容の中からいずれかを、車載装置１に対する操作指示あるいは情報入力の内容として選択する。これらの選択の具体例については後述する。 The input processing unit 58 is an operation instruction to the in-vehicle device 1 among the recognition result by the voice recognition processing unit 100, the recognition result by the voice recognition processing unit 200 acquired from the server 2, and the operation content using the operation unit 40. Or it selects as the content of information input. Specific examples of these selections will be described later.

上述したマイクロホン２２が音声入力手段に、音声認識処理部１００がクライアント側音声認識処理手段に、振り分け判定部１０２が振り分け手段に、音声データ送信部５２がクライアント側通信手段に、通信制御部２００がサーバ側通信手段に、音声認識処理部２００がサーバ側音声認識処理手段にそれぞれ対応する。また、入力処理部５８が入力処理手段に、操作部４０が操作手段に、インターネット処理部５５が文書作成手段に、認識結果取得部５７が認識結果取得手段に、ナビゲーション処理部１０が施設情報表示手段にそれぞれ対応する。 The microphone 22 described above is a voice input unit, the voice recognition processing unit 100 is a client side voice recognition processing unit, the distribution determination unit 102 is a distribution unit, the voice data transmission unit 52 is a client side communication unit, and the communication control unit 200 is The speech recognition processing unit 200 corresponds to the server-side communication means and the server-side speech recognition processing means. The input processing unit 58 is an input processing unit, the operation unit 40 is an operation unit, the Internet processing unit 55 is a document creation unit, the recognition result acquisition unit 57 is a recognition result acquisition unit, and the navigation processing unit 10 is a facility information display. Each means corresponds.

本実施形態の音声認識システムはこのような構成を有しており、次に、マイクロホン２２に向けて利用者が発話し、この発話音声に対して音声認識処理を行う動作について説明する。 The voice recognition system according to the present embodiment has such a configuration. Next, an operation in which a user speaks toward the microphone 22 and performs voice recognition processing on the spoken voice will be described.

図３は、利用者が発話してその内容を車載装置１の操作等に反映させるまでの動作手順を示す流れ図である。音声認識部３６は、発話スイッチ４２がオンされたか否かを判定する（ステップ１００）。発話スイッチ４２がオンされない場合には否定判断が行われ、この判定が繰り返される。 FIG. 3 is a flowchart showing an operation procedure until the user speaks and the contents are reflected in the operation of the in-vehicle device 1. The voice recognition unit 36 determines whether or not the utterance switch 42 is turned on (step 100). If the utterance switch 42 is not turned on, a negative determination is made, and this determination is repeated.

また、発話スイッチ４２がオンされるとステップ１００の判定において肯定判断が行われる。次に、振り分け判定部１０２は、その時点の表示内容に基づいて入力モードを解析する（ステップ１０２）。例えば、該当項目を選択するメニュー画面等が表示されている入力モード（この入力モードを「該当項目選択モード」と称する）に該当するか、テキストボックスが含まれてテキストの入力状態になっている入力モード（この入力モードを「テキスト入力モード」と称する）に該当するかが判定される。 When the speech switch 42 is turned on, an affirmative determination is made in the determination of step 100. Next, the distribution determination unit 102 analyzes the input mode based on the display content at that time (step 102). For example, it corresponds to an input mode in which a menu screen or the like for selecting a corresponding item is displayed (this input mode is referred to as “corresponding item selection mode”), or a text box is included to enter a text input state. It is determined whether the input mode is applicable (this input mode is referred to as “text input mode”).

ところで、本実施形態では、各種の操作指示や情報入力を、音声入力によって行うことを想定している。また、入力音声に対しては音声認識処理を行ってその内容を特定するが、あらかじめ１対１に対応する音声辞書が用意されている単語あるいは文章を音声認識処理の対象とするか、対応する音声辞書があらかじめ用意されていない不特定の単語や文章を音声認識処理の対象とするかの振り分けが振り分け判定部１０２によって行われる。 By the way, in this embodiment, it is assumed that various operation instructions and information input are performed by voice input. In addition, the speech recognition process is performed on the input speech to identify the content, but a word or sentence for which a speech dictionary corresponding to one-to-one correspondence is prepared in advance or the speech recognition process is supported. The sorting determination unit 102 sorts whether or not an unspecified word or sentence for which a speech dictionary is not prepared in advance is to be subjected to speech recognition processing.

さらに具体的には、入力モードが「該当項目選択モード」である場合が、あらかじめ１対１に対応する音声辞書が用意されている単語あるいは文章を音声認識処理の対象とするものであって、このときの音声認識処理が車載装置１の音声認識処理部１００によって行われる。一方、入力モードが「テキスト入力モード」である場合が、対応する音声辞書があらかじめ用意されていない不特定の単語や文章を音声認識処理の対象とするものであって、このときの音声認識処理がサーバ２の音声認識処理部２００によって行われる。 More specifically, when the input mode is “corresponding item selection mode”, a word or sentence for which a speech dictionary corresponding to one-to-one correspondence is prepared in advance is subjected to speech recognition processing. The voice recognition process at this time is performed by the voice recognition processing unit 100 of the in-vehicle device 1. On the other hand, when the input mode is “text input mode”, an unspecified word or sentence for which a corresponding speech dictionary is not prepared in advance is targeted for speech recognition processing. Is performed by the voice recognition processing unit 200 of the server 2.

振り分け判定部１０２は、解析した入力モードが該当項目選択モードであるか否かを判定する（ステップ１０４）。現在の入力モードが該当項目選択モードである場合には肯定判断が行われる。次に、車載装置１に内蔵された音声認識処理部１００は、マイクロホン２２によって集音された利用者の音声に対して音声認識処理を行う（ステップ１０６）。また、入力処理部５８は、この音声認識結果をその時点の表示内容に対応する操作指示や情報入力の内容として用いて車載装置１に対する操作や入力を実行する（ステップ１０８）。 The distribution determination unit 102 determines whether or not the analyzed input mode is a corresponding item selection mode (step 104). If the current input mode is the corresponding item selection mode, a positive determination is made. Next, the voice recognition processing unit 100 built in the in-vehicle device 1 performs voice recognition processing on the user's voice collected by the microphone 22 (step 106). Further, the input processing unit 58 uses the voice recognition result as an operation instruction or information input content corresponding to the display content at that time to execute an operation or input to the in-vehicle device 1 (step 108).

例えば、ナビゲーション処理部１０によるナビゲーション動作中に地図画像表示が行われているときに、「シュクシャク」と音声入力されたときに表示縮尺の変更を指示したり、「モクテキチ」と音声入力されたときに目的地の設定を指示することがあらかじめ決められており、「シュクシャク」、「モクテキチ」などを音声認識するための音声認識辞書３４が用意されている。振り分け判定部１０２は、その時点の表示内容が「地図画像」である場合に入力モードが「該当項目選択モード」であると判定し、音声認識部３６は、入力音声「シュクシャク」等に対して音声認識辞書３４を用いた音声認識を行い、認識結果として文字列「シュクシャク」等を得ることができる。この認識結果を受けて、ナビゲーション処理部１０は、表示中の地図画像の表示縮尺を変更する処理を開始する。 For example, when a map image is being displayed during the navigation operation by the navigation processing unit 10, when a change in display scale is instructed when “shukusaku” is input as a voice, or when “mokutekichi” is input as a voice Instructing the user to set a destination is prepared in advance, and a voice recognition dictionary 34 for voice recognition of “shukusaku”, “mokutekichi”, and the like is prepared. The distribution determination unit 102 determines that the input mode is “corresponding item selection mode” when the display content at that time is “map image”, and the speech recognition unit 36 performs the processing for the input speech “Shukusaku” and the like. Speech recognition using the speech recognition dictionary 34 is performed, and a character string “shukusaku” or the like can be obtained as a recognition result. In response to this recognition result, the navigation processing unit 10 starts a process of changing the display scale of the map image being displayed.

また、ＴＶチューナ処理部１４による受信動作中に受信対象となる放送局を選択する選局画面が表示されているときに、「○○テレビ」と音声入力されたときにこの放送局への選局の切り替えを指示することがあらかじめ決められており、「○○テレビ」などを音声認識するための音声認識辞書３４が用意されている。振り分け判定部１０２は、その時点の表示内容が「選局画面」である場合に入力モードが「該当項目選択モード」であると判定し、音声認識部３６は、入力音声「○○テレビ」等に対して音声認識辞書３４を用いた音声認識を行い、認識結果として文字列「○○テレビ」等を得ることができる。この認識結果を受けて、ＴＶチューナ処理部１４は、選局を○○テレビ等に変更する。 In addition, when a channel selection screen for selecting a broadcast station to be received is displayed during the reception operation by the TV tuner processing unit 14, the selection to this broadcast station is made when a voice is input as “XX TV”. An instruction to switch stations is determined in advance, and a voice recognition dictionary 34 for voice recognition such as “XX TV” is prepared. The distribution determination unit 102 determines that the input mode is “corresponding item selection mode” when the display content at that time is “channel selection screen”, and the voice recognition unit 36 inputs the input voice “XX TV” or the like. The voice recognition using the voice recognition dictionary 34 is performed for the character string, and the character string “XX TV” or the like can be obtained as a recognition result. Upon receiving this recognition result, the TV tuner processing unit 14 changes the channel selection to OO TV or the like.

一方、現在の入力モードがテキスト入力モードである場合にはステップ１０４の判定において否定判断が行われる。次に、音声データ送信部５６は、マイクロホン２２から入力されて圧縮処理部２６によって圧縮処理された音声データをネットワーク３を介してサーバ２に向けて送信して、サーバ２内の音声認識処理部２００による音声認識処理を依頼する（ステップ１１０）。その後、認識結果取得部５７は、サーバ２から送り返されてくる認識結果を受信したか否かを判定する（ステップ１１２）。受信していない場合には否定判断が行われ、この判定が繰り返される。また、認識結果を受信した場合にはステップ１１２の判定において肯定判断が行われる。次に、入力処理部５８は、サーバ２から受信した音声認識結果をその時点の表示内容に対応する操作指示や情報入力の内容として用いて車載装置１に対する操作や入力を実行する（ステップ１０８）。 On the other hand, if the current input mode is the text input mode, a negative determination is made in the determination in step 104. Next, the voice data transmission unit 56 transmits the voice data input from the microphone 22 and compressed by the compression processing unit 26 to the server 2 via the network 3, and the voice recognition processing unit in the server 2. A voice recognition process 200 is requested (step 110). Thereafter, the recognition result acquisition unit 57 determines whether or not the recognition result sent back from the server 2 has been received (step 112). If not received, a negative determination is made and this determination is repeated. If a recognition result is received, an affirmative determination is made in the determination of step 112. Next, the input processing unit 58 uses the voice recognition result received from the server 2 as an operation instruction or information input content corresponding to the display content at that time to execute an operation or input to the in-vehicle device 1 (step 108). .

例えば、インターネット処理部５５によるメール作成動作中にメール作成画面が表示され、入力位置がメール本文を指しているときに、振り分け判定部１０２は「テキスト入力モード」であると判定し、メール本文に入力する文章を示す入力音声に対する音声認識（自然言語認識）がサーバ２内の音声認識処理部２００に依頼される。そして、認識結果（入力する文章に対応する文字列）が送り返されてくると、インターネット処理部５５は、メール本文にこの認識結果としての文字列を入力してメール作成を行う。 For example, when the mail creation screen is displayed during the mail creation operation by the Internet processing unit 55 and the input position indicates the mail body, the sorting determination unit 102 determines that the “text input mode” is set, and The speech recognition processing unit 200 in the server 2 is requested to perform speech recognition (natural language recognition) on the input speech indicating the input text. When the recognition result (character string corresponding to the input text) is sent back, the Internet processing unit 55 creates a mail by inputting the character string as the recognition result in the mail body.

このように、本実施形態の音声認識システムでは、あらかじめ用意された単語や文章に限定して車載装置１側での音声認識処理を行うことにより、車載装置１側で行う音声認識処理とサーバ２側で行う音声認識処理を正確に振り分けることができるため、車載装置１側とサーバ２側の両方で音声認識処理を行うことを回避することができ、認識結果が得られるまでに要する時間を短縮することができる。また、車載装置１側では、あらかじめ用意された単語や文章についてのみ認識結果が得られればよいため、音声認識処理に関する車載装置１の規模および処理負担の軽減が可能となる。 As described above, in the speech recognition system according to the present embodiment, the speech recognition processing performed on the in-vehicle device 1 side and the server 2 are performed by performing the speech recognition processing on the in-vehicle device 1 side limited to words and sentences prepared in advance. Since the voice recognition processing performed on the side can be accurately distributed, it is possible to avoid performing the voice recognition processing on both the in-vehicle device 1 side and the server 2 side, and shorten the time required until the recognition result is obtained. can do. In addition, since it is sufficient that the recognition result is obtained only for words and sentences prepared in advance on the in-vehicle device 1 side, it is possible to reduce the scale and processing load of the in-vehicle device 1 related to the speech recognition processing.

また、車載装置１に操作部４０を備えることにより、車載装置１において各種入力を行う際に、音声認識処理を用いた音声入力と、操作部４０を用いた手動操作による入力とを必要に応じて使い分けることができ、操作性の向上が可能となる。 In addition, by providing the operation unit 40 in the in-vehicle device 1, when performing various inputs in the in-vehicle device 1, voice input using voice recognition processing and input by manual operation using the operation unit 40 are performed as necessary. The operability can be improved.

また、車載装置１に対する操作指示あるいは情報入力の対象となる複数の単語あるいは文章が既知である場合に、振り分け判定部１０２は、車載装置１において音声認識処理を行う振り分けを行うととに、これら既知の単語あるいは文章に対応する音声認識辞書３４を車載装置１に備えて音声認識処理を行っており、これにより、車載装置１において入力音声に対する音声認識処理を確実に行うことができる。 In addition, when a plurality of words or sentences that are targets of operation instructions or information input to the in-vehicle device 1 are known, the distribution determination unit 102 performs the distribution for performing the voice recognition process in the in-vehicle device 1. A voice recognition dictionary 34 corresponding to a known word or sentence is provided in the in-vehicle device 1 to perform voice recognition processing, whereby the in-vehicle device 1 can reliably perform voice recognition processing for input speech.

また、上述した既知の複数の単語あるいは文章を、車載装置１に対して操作指示を行う複数の操作コマンドとすることにより、車載装置１に対する操作指示については車載装置１側における音声認識処理を行い、迅速にその指示内容を判定して車載装置１の動作に反映させることが可能となる。 In addition, by using the plurality of known words or sentences described above as a plurality of operation commands for instructing the in-vehicle device 1, voice recognition processing on the in-vehicle device 1 side is performed for the operation instructions for the in-vehicle device 1. It is possible to quickly determine the content of the instruction and reflect it in the operation of the in-vehicle device 1.

ところで、図３に示した動作手順は、いつでもサーバ２と接続可能な状態にあることが前提となっている。しかし、車載装置１が搭載された車両が携帯電話９０の電波の届かない場所を走行中やこのような場所に車両を駐車しているときにはサーバ２と接続できない場合がある。例えば、携帯電話９０の基地局が存在しないような山間部を走行中または駐車中や、長いトンネル内を走行中などの場合にが、サーバ２との間の接続ができないことが多い。 By the way, the operation procedure shown in FIG. 3 is based on the assumption that the server 2 can be connected to the server 2 at any time. However, when the vehicle on which the in-vehicle device 1 is mounted is traveling in a place where the radio waves of the mobile phone 90 do not reach or when the vehicle is parked in such a place, the server 2 may not be connected. For example, connection with the server 2 is often impossible when traveling in a mountainous area where the base station of the mobile phone 90 does not exist or while parked or traveling in a long tunnel.

図４は、サーバ２と接続ができない場合の変形例の動作手順を示す流れ図である。図４に示す動作手順は、図３に示した動作手順に対して、ステップ１１０の動作の前にステップ１０９の動作を追加した点が異なっている。 FIG. 4 is a flowchart showing the operation procedure of the modified example when connection to the server 2 cannot be made. The operation procedure shown in FIG. 4 is different from the operation procedure shown in FIG. 3 in that the operation of step 109 is added before the operation of step 110.

このステップ１０９では、現在の入力モードがテキスト入力モードである場合であってステップ１０４の判定において否定判断が行われた後、音声データ送信部５６は、サーバ２に接続できたか否かを判定する。接続できた場合には肯定判断が行われ、ステップ１１０のサーバ２内の音声認識処理部２００による音声認識処理の依頼動作に移行する。 In step 109, if the current input mode is the text input mode and a negative determination is made in the determination in step 104, the voice data transmission unit 56 determines whether or not the connection to the server 2 has been established. . If the connection has been established, an affirmative determination is made, and the operation proceeds to a voice recognition processing request operation by the voice recognition processing unit 200 in the server 2 in step 110.

また、サーバ２の接続が困難な場合（電波状態が悪い場合の他に、携帯電話９０が接続されていない場合や故障した場合も含まれる）にはステップ１０９の判定において否定判断が行われる。この場合にはステップ１０６に移行し、車載装置１に内蔵された音声認識処理部１００を用いた音声認識処理に移行する。なお、テキスト入力モードの場合には、入力対象となる単語や文章が事前にわかっていないため、音声認識部３６では、利用者の発話音声の一語一語に対して音声認識処理を行って内容を特定する処理が行われる。 Further, if the connection of the server 2 is difficult (including the case where the cellular phone 90 is not connected or malfunctions in addition to the case where the radio wave condition is bad), a negative determination is made in the determination of step 109. In this case, the process proceeds to step 106, and the process proceeds to a voice recognition process using the voice recognition processing unit 100 built in the in-vehicle device 1. In the text input mode, since the input word or sentence is not known in advance, the speech recognition unit 36 performs speech recognition processing for each word of the user's speech. Processing for specifying the content is performed.

このように、サーバ２と接続ができない状況にある場合には車載装置１において音声認識処理が行われるため、サーバ２と接続できないことが原因で処理が中断してしまうことを防止することができる。 As described above, since the voice recognition processing is performed in the in-vehicle device 1 when the connection with the server 2 is not possible, it is possible to prevent the processing from being interrupted due to the inability to connect to the server 2. .

図５は、サーバ２と接続ができない場合の他の変形例の動作手順を示す流れ図である。図５に示す動作手順は、図４に示した動作手順に対して、ステップ１０９の判定において否定判断が行われた後の動作としてステップ１１１の動作を追加した点が異なっている。 FIG. 5 is a flowchart showing the operation procedure of another modified example when connection to the server 2 is not possible. The operation procedure shown in FIG. 5 is different from the operation procedure shown in FIG. 4 in that an operation in step 111 is added as an operation after a negative determination is made in the determination in step 109.

このステップ１１１では、サーバ２と接続ができない場合であってステップ１０９の判定において否定判断が行われた後、操作部４０を用いた入力動作が行われる。入力処理部５８は、サーバ２から受信した音声認識結果の代わりに、操作部４０の操作内容をその時点の表示内容に対応する操作指示や情報入力の内容として用いて車載装置１に対する操作や入力を実行する（ステップ１０８）。 In this step 111, the connection with the server 2 cannot be established, and after a negative determination is made in the determination in step 109, an input operation using the operation unit 40 is performed. Instead of the voice recognition result received from the server 2, the input processing unit 58 uses the operation content of the operation unit 40 as an operation instruction or information input content corresponding to the display content at that time, and performs an operation or input to the in-vehicle device 1. Is executed (step 108).

このように、サーバ２と接続ができない状況にある場合には、サーバ２による音声認識処理の代わりに操作部４０を用いた利用者の手動操作が行われるため、サーバ２と接続できないことが原因で処理が中断してしまうことを防止することができる。 As described above, when the server 2 cannot be connected, a manual operation of the user using the operation unit 40 is performed instead of the voice recognition processing by the server 2, so that the server 2 cannot be connected. Can prevent the processing from being interrupted.

また、上述した該当項目選択モードに対応して車載装置１内の音声認識処理部１００によって音声認識処理を行う場合には、その前提として、音声認識の対象となる単語や文章が既知であって、これらに対応する音声認識辞書３４を備える必要がある。例えば、車載装置１をオーディオ装置やナビゲーション装置等として用いる場合にその操作コマンド（操作指示）を音声によって行う場合には、各操作コマンドとしての単語あるいは文章に対応する音声認識辞書３４をあらかじめ作成しておけばよい。 Further, in the case where the speech recognition processing unit 100 in the in-vehicle device 1 performs speech recognition processing corresponding to the above-described corresponding item selection mode, the premise is that the words and sentences that are subject to speech recognition are known. It is necessary to provide a speech recognition dictionary 34 corresponding to these. For example, when the in-vehicle device 1 is used as an audio device, a navigation device, or the like, when the operation command (operation instruction) is performed by voice, a voice recognition dictionary 34 corresponding to a word or sentence as each operation command is created in advance. Just keep it.

一方、車載装置１をハンズフリー電話システムとして使用して通話先の氏名や電話番号を音声で入力する場合や、車載装置１をオーディオ装置として使用してアーティスト名、アルバム名、楽曲名を音声で入力する場合などについては、音声入力する内容が車載装置１毎に、あるいは接続される携帯電話９０やＵＳＢメモリ９２毎に異なるため、音声認識の対象となる単語や文章に対応する音声認識辞書３４を必要に応じて作成する必要がある。 On the other hand, when the in-vehicle device 1 is used as a hands-free telephone system and the name and telephone number of a call destination are input by voice, or the in-car device 1 is used as an audio device and the artist name, album name, and song name are spoken. In the case of input or the like, since the content of voice input differs for each in-vehicle device 1 or for each connected mobile phone 90 or USB memory 92, the voice recognition dictionary 34 corresponding to a word or sentence that is a target of voice recognition. Need to be created as needed.

図６は、携帯電話９０の接続時に電話帳データを読み出して音声認識辞書３４を登録する動作手順を示す流れ図である。電話帳取得部５１は、携帯電話９０が接続されたか否かを判定しており（ステップ２００）、接続されるまで否定判断を行ってこの判定を繰り返す。また、携帯電話９０が接続された場合にはステップ２００の判定において肯定判断が行われる。 FIG. 6 is a flowchart showing an operation procedure for reading the phone book data and registering the voice recognition dictionary 34 when the mobile phone 90 is connected. The phone book acquisition unit 51 determines whether or not the mobile phone 90 is connected (step 200), and makes a negative determination until it is connected and repeats this determination. If the mobile phone 90 is connected, an affirmative determination is made in the determination of step 200.

次に、電話帳取得部５１は、携帯電話９０に格納された電話帳データを読み込む（ステップ２０２）。この電話帳データには通話先となる電話番号および名称の他に住所等の情報も含まれる。電話帳取得部５１は、読み込んだ電話帳データの中から電話番号と名称を抽出する（ステップ２０４）。 Next, the phone book acquisition unit 51 reads the phone book data stored in the mobile phone 90 (step 202). This telephone directory data includes information such as an address in addition to the telephone number and name to be called. The telephone directory acquisition unit 51 extracts a telephone number and name from the read telephone directory data (step 204).

次に、音声認識処理部１００内の音声認識辞書作成部３２は、抽出された電話番号と名称に対応する音声認識辞書を作成して（ステップ２０６）、音声認識部３６によって用いられる音声認識辞書３４として登録する（ステップ２０８）。例えば、音声認識辞書作成部３２は、抽出された電話番号と名称のそれぞれの文字列に対してＧＴＰ（Grapheme To Phoneme、書記素−音素変換）処理を行って、文字列の「よみ情報」を作成した後、このよみ情報から音声認識処理用の動的な認識辞書を作成する。例えば、よみ情報に対してＴＴＳ（Text-to-Speech）処理を行って音声波形を生成し、この音声波形について音声認識処理用の特徴抽出を行うことにより動的な認識辞書の作成が行われる。作成された認識辞書が音声認識辞書３４として登録される。 Next, the speech recognition dictionary creation unit 32 in the speech recognition processing unit 100 creates a speech recognition dictionary corresponding to the extracted telephone number and name (step 206), and the speech recognition dictionary used by the speech recognition unit 36. 34 is registered (step 208). For example, the speech recognition dictionary creation unit 32 performs GTP (Grapheme To Phoneme, grapheme-phoneme conversion) processing on each extracted character string of the phone number and name, and obtains “reading information” of the character string. After the creation, a dynamic recognition dictionary for speech recognition processing is created from the read information. For example, a TTS (Text-to-Speech) process is performed on the read information to generate a speech waveform, and a feature recognition for speech recognition processing is performed on the speech waveform to create a dynamic recognition dictionary. . The created recognition dictionary is registered as the speech recognition dictionary 34.

このように、携帯電話９０を接続した際に電話帳データを読み出して電話番号や名称を抽出し、動的な認識辞書が作成されるため、これらを音声入力した際に車載装置１側の音声認識処理によって確実にハンズフリー電話システムの通話先を決定することができる。 As described above, when the mobile phone 90 is connected, the phone book data is read out to extract the phone number and name, and a dynamic recognition dictionary is created. The call destination of the hands-free telephone system can be reliably determined by the recognition process.

図７は、ＵＳＢメモリ９２の接続時にコンテンツリストの付属情報を読み出して音声認識辞書３４を登録する動作手順を示す流れ図である。コンテンツリスト作成部５３は、ＵＳＢメモリ９２が接続されたか否かを判定しており（ステップ３００）、接続されるまで否定判断を行ってこの判定を繰り返す。また、ＵＳＢメモリ９２が接続された場合にはステップ３００の判定において肯定判断が行われる。 FIG. 7 is a flowchart showing an operation procedure for reading the attached information of the content list and registering the voice recognition dictionary 34 when the USB memory 92 is connected. The content list creation unit 53 determines whether or not the USB memory 92 is connected (step 300), makes a negative determination until it is connected, and repeats this determination. If the USB memory 92 is connected, an affirmative determination is made in the determination in step 300.

次に、コンテンツリスト作成部５３は、ＵＳＢメモリ９２に記録されたコンテンツデータを解析する（ステップ３０２）。この解析結果に基づいてコンテンツリストが作成されるが、このコンテンツリストにはコンテンツデータのフォルダ構成、ファイル構成およびファイル属性（アーティスト名、アルバム名、楽曲名等）が含まれる。コンテンツリスト作成部５３は、解析結果に基づいて作成したコンテンツリストの中からアーティスト名、アルバム名、楽曲名を抽出する（ステップ３０４）。 Next, the content list creation unit 53 analyzes the content data recorded in the USB memory 92 (step 302). A content list is created based on the analysis result. This content list includes the folder structure, file structure, and file attributes (artist name, album name, music name, etc.) of the content data. The content list creation unit 53 extracts the artist name, album name, and song name from the content list created based on the analysis result (step 304).

次に、音声認識処理部１００内の音声認識辞書作成部３２は、抽出されたアーティスト名、アルバム名、楽曲名に対応する音声認識辞書を作成して（ステップ３０６）、音声認識部３６によって用いられる音声認識辞書３４として登録する（ステップ３０８）。例えば、音声認識辞書作成部３２は、抽出されたアーティスト名、アルバム名、楽曲名のそれぞれの文字列に対してＧＴＰ処理を行って、文字列の「よみ情報」を作成した後、このよみ情報から音声認識処理用の動的な認識辞書を作成する。作成された認識辞書が音声認識辞書３４として登録される。 Next, the speech recognition dictionary creation unit 32 in the speech recognition processing unit 100 creates a speech recognition dictionary corresponding to the extracted artist name, album name, and song name (step 306), and is used by the speech recognition unit 36. Is registered as a voice recognition dictionary 34 (step 308). For example, the voice recognition dictionary creation unit 32 performs GTP processing on the extracted character strings of artist name, album name, and song name to create “reading information” of the character string, and then reads the reading information. To create a dynamic recognition dictionary for speech recognition processing. The created recognition dictionary is registered as the speech recognition dictionary 34.

このように、ＵＳＢメモリ９２を接続した際にコンテンツデータを読み出してアーティスト名、アルバム名、楽曲名を抽出し、動的な認識辞書が作成されるため、これらを音声入力した際に車載装置１側の音声認識処理によって確実にオーディオ装置において再生対象となる楽曲等を決定することができる。 As described above, when the USB memory 92 is connected, the content data is read to extract the artist name, album name, and song name, and a dynamic recognition dictionary is created. The music to be reproduced in the audio apparatus can be reliably determined by the side voice recognition processing.

また、上述した本実施形態では、入力モードが「テキスト入力モード」の場合にはサーバ２に対して音声認識処理を依頼するようにしているが、この依頼は、複数のサーバに分散して行うようにしてもよい。例えば、車載装置１をナビゲーション装置として用いる際に周辺施設やＰＯＩ（Point Of Interest）の検索を行う場合には、施設名等を含む単語を対象に音声認識処理（単語認識）を行うことになる。一方、電子メール作成時にテキストボックスを開いて入力する場合には、単語よりもむしろ文章を対象に音声認識処理（自然言語認識）を行うことになる。このように、音声認識処理の種類（単語認識か自然言語認識か）によって、それぞれの音声認識処理に適したサーバを選択するようにしてもよい。なお、一概に単語認識といっても、認識対象となる単語の分野によって適した音声認識処理を行うサーバが異なる場合なども考えられるため、複数のサーバを使い分ける方法は上述した分野に応じて分ける場合だけでなく、さらに分野等を考慮して別の分け方をするようにしてもよい。あるいは、利用料金が安いサーバを優先的に選択するようにしてもよい（その前提として、利用料金に関する情報を保持しておく必要がある）。 Further, in the present embodiment described above, when the input mode is the “text input mode”, the server 2 is requested to perform the voice recognition processing. However, this request is distributed to a plurality of servers. You may do it. For example, when searching for peripheral facilities and POI (Point Of Interest) when using the in-vehicle device 1 as a navigation device, speech recognition processing (word recognition) is performed on words including facility names and the like. . On the other hand, when inputting by opening a text box when creating an e-mail, speech recognition processing (natural language recognition) is performed on a sentence rather than a word. Thus, a server suitable for each voice recognition process may be selected depending on the type of voice recognition process (whether word recognition or natural language recognition). Note that even if word recognition is generally used, there may be cases where the server that performs the speech recognition process suitable for the field of the word to be recognized is different. Therefore, the method of using a plurality of servers is divided according to the above-mentioned fields. In addition to the case, another division may be performed in consideration of the field and the like. Alternatively, a server with a low usage fee may be preferentially selected (as a precondition, it is necessary to hold information on the usage fee).

図８は、変形例の音声認識システムの全体構成を示す図である。この音声認識システムは、車載装置１と複数のサーバ２Ａ、２Ｂを含んで構成されている。サーバ２Ａは、単語認識に適した音声認識処理部２００Ａを備えている。また、サーバ２Ｂは、自然言語認識に適した音声認識処理部２００Ｂを備えている。 FIG. 8 is a diagram illustrating an overall configuration of a modified speech recognition system. This voice recognition system includes an in-vehicle device 1 and a plurality of servers 2A and 2B. The server 2A includes a speech recognition processing unit 200A suitable for word recognition. Further, the server 2B includes a speech recognition processing unit 200B suitable for natural language recognition.

図９は、複数のサーバ２Ａ、２Ｂを使い分ける場合の変形例の動作手順を示す流れ図である。図９に示した動作手順は、図３に示した動作手順に対して、ステップ１１０の動作をステップ１１０Ａ、１１０Ｂの動作に置き換えた点が異なっている。以下では、これらの置き換えたステップに着目して説明する。 FIG. 9 is a flowchart showing an operation procedure of a modified example in the case where a plurality of servers 2A and 2B are used properly. The operation procedure shown in FIG. 9 is different from the operation procedure shown in FIG. 3 in that the operation in step 110 is replaced with the operation in steps 110A and 110B. In the following, description will be given focusing on these replaced steps.

現在の入力モードがテキスト入力モードであってステップ１０４の判定において否定判断が行われると、次に、音声データ送信部５６は、音声認識処理を依頼するサーバを選択し（ステップ１１０Ａ）、マイクロホン２２から入力されて圧縮処理部２６によって圧縮処理された音声データを、選択されたサーバ２Ａ、２Ｂのいずれかに向けて送信して音声認識処理を依頼する（ステップ１１０Ｂ）。 If the current input mode is the text input mode and a negative determination is made in the determination in step 104, the voice data transmitting unit 56 next selects a server for requesting the voice recognition process (step 110A), and the microphone 22 is selected. Is sent to one of the selected servers 2A and 2B to request voice recognition processing (step 110B).

例えば、ステップ１１０Ａにおけるサーバの選択は、入力音声に対する音声認識処理の種類に応じて行われる。単語認識を行う場合にはサーバ２Ａが選択され、サーバ２Ａ内の音声認識処理部２００Ａに対して音声認識処理が依頼される。また、自然言語認識を行う場合にはサーバ２Ｂが選択され、サーバ２Ｂ内の音声認識処理部２００Ｂに対して音声認識処理が依頼される。このように、得意とする分野等が異なる複数のサーバ２Ａ、２Ｂを使い分けて音声認識処理を依頼することができ、サーバに依頼する場合の認識精度を向上させることができる。 For example, the server selection in step 110A is performed according to the type of voice recognition processing for the input voice. When performing word recognition, the server 2A is selected, and a speech recognition process is requested to the speech recognition processing unit 200A in the server 2A. Further, when performing natural language recognition, the server 2B is selected, and a speech recognition process is requested to the speech recognition processing unit 200B in the server 2B. As described above, a plurality of servers 2A and 2B having different fields of expertise and the like can be properly used to request voice recognition processing, and the recognition accuracy when requesting the server can be improved.

なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨の範囲内において種々の変形実施が可能である。例えば、上述した実施形態では、車載装置１にナビゲーション装置、オーディオ装置、ハンズフリー電話システム、テレビ受信機、ラジオ受信機の機能を持たせたが、これらの一部の機能を持たせて入力音声に対して音声認識処理を行うようにしてもよい。 In addition, this invention is not limited to the said embodiment, A various deformation | transformation implementation is possible within the range of the summary of this invention. For example, in the above-described embodiment, the vehicle-mounted device 1 has functions of a navigation device, an audio device, a hands-free telephone system, a television receiver, and a radio receiver. Voice recognition processing may be performed on the above.

また、上述した実施形態では、電子メールの作成に必要な音声に入力については「テキスト入力モード」を適用したが、送信先アドレスの入力については「該当項目選択モード」を適用するようにしてもよい。例えば、図６に示したステップ２０４において、読み込んだ電話帳データの中から電話番号と名称の他にアドレスを抽出し、ステップ２０５において、抽出された電話番号、名称、アドレスに対応する音声認識辞書を作成しておくようにする。そして、電子メール作成の際に、メール作成画面における入力位置が電子メールの送信先アドレスを入力する「宛先」の位置にある場合であって、アドレスの入力が音声でなされた場合には、この入力音声に対する音声認識処理を車載装置１側で行うようにしてもよい。また、このアドレスの音声入力は、アドレスそのものを音声で読み上げる場合の他に、このアドレスに対応する名称や電話番号を音声で読み上げるようにしてもよい。電話帳取得部５１によって取得した電話帳データに基づいて、名称（あるいは電話番号）に１対１に対応するアドレスを抽出することができるため、この抽出したアドレスを電子メールの宛先として割り当てることができる。 In the above-described embodiment, the “text input mode” is applied to the voice necessary for creating the e-mail. However, the “corresponding item selection mode” is applied to the input of the destination address. Good. For example, in step 204 shown in FIG. 6, an address is extracted from the read telephone directory data in addition to the telephone number and name, and in step 205, a speech recognition dictionary corresponding to the extracted telephone number, name, and address. Make sure to create. When the e-mail is created, if the input position on the e-mail creation screen is at the “destination” position where the e-mail destination address is entered and the address is entered by voice, this Voice recognition processing for input voice may be performed on the in-vehicle device 1 side. In addition to voice reading of the address itself, the voice input of the address may be voiced to read the name and telephone number corresponding to the address. Based on the phone book data acquired by the phone book acquisition unit 51, an address corresponding to the name (or telephone number) can be extracted one-to-one, and therefore, the extracted address can be assigned as an e-mail destination. it can.

また、上述した実施形態では、入力音声に対応する音声データをサーバ２等に送り、サーバ２等から車載装置１に向けて認識結果を送り返すようにしたが、サーバ２等において認識結果に基づく所定の処理を行った結果を車載装置１に送り返すようにしてもよい。例えば、車載装置１をナビゲーション装置として使用し、周辺施設やＰＯＩの検索に必要な事項を音声入力する場合に、入力音声に対する音声認識処理をサーバ２内の音声認識処理部２００によって行い、その認識結果を用いて検索処理部（図示せず）にて周辺施設等の検索を行い、その検索結果を車載装置１に送り返すようにしてもよい。 In the above-described embodiment, the audio data corresponding to the input voice is sent to the server 2 or the like, and the recognition result is sent back from the server 2 or the like toward the in-vehicle device 1. The result of performing the process may be sent back to the in-vehicle device 1. For example, when the in-vehicle device 1 is used as a navigation device and a voice input is performed for matters necessary for searching for peripheral facilities and POI, the voice recognition processing for the input voice is performed by the voice recognition processing unit 200 in the server 2 and the recognition is performed. A search processing unit (not shown) may be used to search for peripheral facilities using the result, and the search result may be sent back to the in-vehicle device 1.

上述したように、本発明によれば、あらかじめ用意された単語や文章に限定して車載装置１側での音声認識処理を行うことにより、車載装置１側で行う音声認識処理とサーバ２側で行う音声認識処理を正確に振り分けることができるため、車載装置１側とサーバ２側の両方で音声認識処理を行うことを回避することができ、認識結果が得られるまでに要する時間を短縮することができる。また、車載装置１側では、あらかじめ用意された単語や文章についてのみ認識結果が得られればよいため、音声認識処理に関する車載装置１の規模および処理負担の軽減が可能となる。 As described above, according to the present invention, the voice recognition processing on the in-vehicle device 1 side is performed on the server 2 side by performing the voice recognition processing on the in-vehicle device 1 side limited to words and sentences prepared in advance. Since the voice recognition processing to be performed can be accurately distributed, it is possible to avoid performing the voice recognition processing on both the in-vehicle device 1 side and the server 2 side, and to reduce the time required until the recognition result is obtained. Can do. In addition, since it is sufficient that the recognition result is obtained only for words and sentences prepared in advance on the in-vehicle device 1 side, it is possible to reduce the scale and processing load of the in-vehicle device 1 related to the speech recognition processing.

１車載装置
２、２Ａ、２Ｂサーバ
１０ナビゲーション処理部
１４ＴＶチューナ処理部
１６ラジオチューナ処理部
２０音声入力処理部
２２マイクロホン
２４アナログ−デジタル変換器（Ａ／Ｄ）
２６圧縮処理部
３２音声認識辞書作成部
３４音声認識辞書
３６音声認識部
４０操作部
４２発話スイッチ（ＳＷ）
４４入力制御部
５０制御部
５１電話帳取得部
５２電話処理部
５３コンテンツリスト作成部
５４ＡＶ処理部
５５インターネット処理部
５６音声データ送信部
５７認識結果取得部
５８入力処理部
６０表示処理部
６２表示装置
６４デジタル−アナログ変換器（Ｄ／Ａ）
６６スピーカ
７０ハードディスク装置（ＨＤＤ）
８０、８２ＵＳＢインタフェース部（ＵＳＢＩ／Ｆ）
９０携帯電話
９２ＵＳＢメモリ
１００音声認識処理部
１０２振り分け判定部
２００、２００Ａ、２００Ｂ音声認識処理部
２０２通信制御部 DESCRIPTION OF SYMBOLS 1 In-vehicle apparatus 2, 2A, 2B server 10 Navigation processing part 14 TV tuner processing part 16 Radio tuner processing part 20 Audio | voice input processing part 22 Microphone 24 Analog-digital converter (A / D)
26 compression processing unit 32 voice recognition dictionary creation unit 34 voice recognition dictionary 36 voice recognition unit 40 operation unit 42 speech switch (SW)
44 Input Control Unit 50 Control Unit 51 Telephone Book Acquisition Unit 52 Telephone Processing Unit 53 Content List Creation Unit 54 AV Processing Unit 55 Internet Processing Unit 56 Audio Data Transmission Unit 57 Recognition Result Acquisition Unit 58 Input Processing Unit 60 Display Processing Unit 62 Display Device 64 Digital-analog converter (D / A)
66 Speaker 70 Hard Disk Drive (HDD)
80, 82 USB interface (USB I / F)
DESCRIPTION OF SYMBOLS 90 Cellular phone 92 USB memory 100 Voice recognition processing part 102 Distribution determination part 200, 200A, 200B Voice recognition processing part 202 Communication control part

Claims

A speech recognition system that performs speech recognition processing on a user's voice input at a client by a server connected to the client or the client via a network,
The client
Speech input means for inputting speech to be recognized;
Client-side speech recognition processing means for performing speech recognition processing on a plurality of words or sentences prepared in advance;
A sorting unit that sorts a voice input by the voice input unit between a voice recognition process performed by the client side voice recognition processing unit and a voice recognition process performed by the server;
A client-side communication unit that transmits voice data distributed as a result of performing voice recognition processing in the server by the distribution unit, to the server;
The server
Server-side communication means for receiving audio data sent from the client;
And a server-side voice recognition processing unit that performs voice recognition processing using voice data received by the server-side communication unit.

In claim 1,
The client-side voice recognition processing means identifies a reading of any of the plurality of words or sentences prepared in advance by performing voice recognition processing on the voice input by the voice input means. Voice recognition system.

In claim 2,
The client is an in-vehicle device,
The client further includes an input processing means for performing an operation instruction or information input to the in-vehicle device in accordance with a recognition result by the client side voice recognition processing means or the server side voice recognition processing means. system.

In claim 3,
The client further comprises an operation means for accepting a manual operation by a user,
The input processing unit is configured to perform an operation instruction on the in-vehicle device or a recognition result by the client side voice recognition processing unit, a recognition result by the server side voice recognition processing unit, or a manual operation using the operation unit. A speech recognition system characterized by inputting information.

In claim 3 or 4,
The voice recognition system according to claim 1, wherein when the connection unit cannot connect to the server via a network, the voice recognition process is performed by the client side voice recognition processing unit instead of the voice recognition process by the server.

In claim 4,
When the input processing unit is unable to connect to the server via a network, the input processing unit performs an operation on the in-vehicle device according to a manual operation using the operation unit instead of the recognition result by the server side voice recognition processing unit. A speech recognition system characterized by giving instructions or inputting information.

In any one of Claims 3-6,
When the plurality of words or sentences that are the target of operation instructions or information input to the in-vehicle device are known, the distribution unit distributes the voice recognition processing by the client side voice recognition processing unit, and the client side The voice recognition processing means performs voice recognition processing on the voice input by the voice input means, thereby selecting one corresponding to the voice recognition result from the plurality of words or sentences. Recognition system.

In claim 7,
The client-side speech recognition processing means has a speech recognition dictionary for the plurality of known words or sentences.

In claim 8,
The speech recognition system, wherein the plurality of words or sentences are a plurality of operation commands for instructing the in-vehicle device.

In claim 8,
The in-vehicle device is connected to the phone number included in the phone book data when a mobile phone with a built-in phone book data including a phone number of a call destination and a name corresponding to each phone number is connected. On the other hand, it operates as a hands-free telephone system for making a call using the mobile phone,
The speech recognition system, wherein the plurality of words or sentences are at least one of the telephone number and the name.

In claim 10,
The client side voice recognition processing means creates the voice recognition dictionary corresponding to the reading of at least one of the telephone number and the name included in the telephone directory data when the mobile phone is connected. A speech recognition system comprising a dictionary creating means.

In claim 8,
The in-vehicle device operates as an audio device that selectively reproduces a plurality of music pieces,
The speech recognition system, wherein the plurality of words or sentences are at least one of a song name, an album name, and an artist name corresponding to each of the plurality of songs.

In claim 12,
The client-side voice recognition processing means includes voice recognition dictionary creation means for creating the voice recognition dictionary corresponding to at least one reading of the song name, album name, and artist name.

In any one of Claims 3-13,
The in-vehicle device further includes document creation means for creating a document to be transmitted via a network,
The voice is characterized in that when the text input necessary for creating the document is performed based on the voice inputted by the voice input means, the sorting means sorts the voice recognition processing in the server side voice recognition processing means. Recognition system.

In claim 14,
The in-vehicle device further includes recognition result acquisition means for acquiring a recognition result by the server side voice recognition processing means transmitted from the server,
The speech recognition system, wherein the document creation means uses the recognition result acquired by the recognition result income means as text necessary for creation of the document.

In any one of Claims 3-13,
The in-vehicle device further includes facility information display means for displaying detailed information of a specific facility,
When performing the input of the specific facility that is the target of detailed information display by the facility information display unit based on the voice input by the voice input unit, the distribution unit performs voice recognition in the server side voice recognition processing unit. A voice recognition system characterized by distribution to processing.

In claim 16,
The in-vehicle device further includes a recognition result obtained by the server-side voice recognition means transmitted from the server, or a recognition result obtaining means for obtaining the detailed information searched using the recognition result,
The facility information display means displays the detailed information searched using the recognition result acquired by the recognition result acquisition means or the detailed information acquired by the recognition result acquisition means. system.

In any one of Claims 1-17,
The voice recognition system, wherein the voice input means is a microphone.

In claim 18,
A speech switch operable by a user when speaking toward the microphone;
The voice recognition system characterized in that the sorting means sorts the voice of the user collected by the microphone after the utterance switch is operated.

In any one of Claims 1-19,
A plurality of the servers each including the server side voice recognition processing means can be connected to the client,
The distribution means selects one of the plurality of servers when performing distribution as performing speech recognition processing in the server.