JP2017181667A

JP2017181667A - Voice recognition apparatus and voice recognition method

Info

Publication number: JP2017181667A
Application number: JP2016066312A
Authority: JP
Inventors: 純一伊藤; Junichi Ito; 池野　篤司; Tokuji Ikeno; 篤司池野; 林　直樹; Naoki Hayashi; 直樹林; 浩太畠中; Kota HATANAKA; 拓磨峰村; Takuma Minemura; 惇也増井; Junya Masui; 難波　利行; Toshiyuki Nanba; 利行難波
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2016-03-29
Filing date: 2016-03-29
Publication date: 2017-10-05

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition apparatus with its recognition accuracy and speed well balanced.SOLUTION: A voice recognition apparatus, for use in recognizing a voice obtained from a user, includes: voice obtainment means for obtaining a voice from the user; recognition means for recognizing the voice and obtaining a first recognition result and a degree of certainty corresponding to the first recognition result; information obtainment means for obtaining a second recognition result that is a result of recognizing the voice by a second voice recognition apparatus being different from the apparatus itself; and determination means for adopting the first recognition result in a case where the degree of certainty when the voice recognized by the recognition means is equal to or higher than a given value, or making an inquiry to the user in a case where the degree of certainty is less than the given value and, based on the response to the inquiry, determining which of the first recognition result and second recognition result to be adopted.SELECTED DRAWING: Figure 1

Description

本発明は、入力された音声を認識する音声認識装置に関する。 The present invention relates to a speech recognition apparatus that recognizes input speech.

ユーザが発した音声を認識し、当該認識結果を用いてコンピュータが処理を行う音声認識技術が普及している。音声認識技術を用いることで、コンピュータを非接触で操作することが可能になり、特に自動車などの移動体に搭載されたコンピュータの利便性が大きく向上する。 A speech recognition technology is widely used in which a speech uttered by a user is recognized and a computer performs processing using the recognition result. By using the voice recognition technology, it is possible to operate the computer without contact, and the convenience of the computer mounted on a moving body such as an automobile is greatly improved.

音声認識を行う際の認識精度は、認識に用いる辞書の規模によって異なる。例えば、音声認識に特化されたワークステーションと、音声認識に特化されていないパーソナルコンピュータとでは、認識精度に大きな差があることがある。
そこで、規模の小さいコンピュータにおいて音声認識を利用したい場合、通信回線を通して規模の大きいコンピュータに音声データを転送し、認識結果を取得するという手法が利用されている。 The recognition accuracy when performing speech recognition varies depending on the scale of the dictionary used for recognition. For example, there may be a large difference in recognition accuracy between a workstation specialized for speech recognition and a personal computer not specialized for speech recognition.
Thus, when it is desired to use speech recognition in a small-scale computer, a technique of transferring speech data to a large-scale computer through a communication line and acquiring a recognition result is used.

しかし、ネットワークを介して音声認識を行う場合、全体的なレスポンスが遅くなるという問題がある。すなわち、認識精度とレスポンスの両立という点において課題があった。
これに対応するため、音声の認識をローカルで行うか、サーバ上で行うかを動的に切り替える技術が提案されている。例えば、特許文献１には、ローカルで行った音声認識結果の信頼度を算出し、信頼度が低い場合に、サーバによる音声認識結果を採用するシステムが記載されている。 However, when speech recognition is performed via a network, there is a problem that the overall response becomes slow. That is, there is a problem in terms of both recognition accuracy and response.
In order to cope with this, a technique for dynamically switching whether voice recognition is performed locally or on a server has been proposed. For example, Patent Literature 1 describes a system that calculates the reliability of a locally recognized speech recognition result and adopts the speech recognition result by a server when the reliability is low.

特開２０１３−２３２００１号公報JP2013-232001A

特許文献１に記載のシステムでは、ローカルで行った音声認識結果の信頼度を閾値と比較し、比較結果に基づいて、ローカルで音声認識を行った結果を利用するか、サーバで音声認識を行うかを決定している。しかし、当該判断をローカルで保持している閾値のみに依存して行うと、適切な判断が行えなくなるケースが発生する。 In the system described in Patent Document 1, the reliability of the result of speech recognition performed locally is compared with a threshold value, and the result of performing speech recognition locally is used based on the comparison result, or speech recognition is performed by a server. Have decided. However, if the determination is made depending only on the threshold value stored locally, a case where an appropriate determination cannot be made occurs.

例えば、辞書の規模に対して閾値が高すぎる場合、音声認識が常にサーバで行われることになってしまい、全体的なレスポンスが低下してしまう。また、閾値が低すぎる場合、応答は高速になるが、質の高い認識結果を得ることができない。
また、信頼度が閾値を下回っていた場合、仮にローカルで行った音声認識の結果が正しい場合であっても、無条件でサーバが行った音声認識の結果を採用してしまう。
すなわち、認識精度と速度の両立という点において課題があった。 For example, if the threshold is too high with respect to the scale of the dictionary, voice recognition will always be performed by the server, and the overall response will be reduced. If the threshold is too low, the response is fast, but a high quality recognition result cannot be obtained.
Also, if the reliability is below the threshold, the result of speech recognition performed by the server unconditionally is adopted even if the result of speech recognition performed locally is correct.
That is, there is a problem in terms of achieving both recognition accuracy and speed.

本発明は上記の課題を考慮してなされたものであり、認識精度と速度を両立させた音声認識装置を提供することを目的とする。 The present invention has been made in consideration of the above-described problems, and an object thereof is to provide a speech recognition apparatus that achieves both recognition accuracy and speed.

本発明に係る音声認識装置は、
ユーザから取得した音声を認識する音声認識装置であって、前記ユーザから音声を取得する音声取得手段と、前記音声を認識し、第一の認識結果と、前記第一の認識結果に対応する確度を取得する認識手段と、前記音声を、自装置とは異なる第二の音声認識装置に認識させた結果である第二の認識結果を取得する情報取得手段と、前記認識手段が音声を認識した際の確度が所定の値以上である場合に、前記第一の認識結果を採用し、確度が所定の値を下回る場合に、前記ユーザに問い合わせを行い、当該問い合わせに対する回答に基づいて、前記第一の認識結果または前記第二の認識結果のどちらを採用するかを決定する決定手段と、を有することを特徴とする。 The speech recognition apparatus according to the present invention is
A speech recognition apparatus for recognizing speech acquired from a user, wherein speech acquisition means for acquiring speech from the user, recognition of the speech, a first recognition result, and a probability corresponding to the first recognition result Recognizing means, information obtaining means for obtaining a second recognition result obtained by causing a second voice recognition device different from the device to recognize the voice, and the recognition means recognizes the voice. When the accuracy at the time is greater than or equal to a predetermined value, the first recognition result is adopted, and when the accuracy is lower than the predetermined value, an inquiry is made to the user, and based on an answer to the inquiry, the first Determining means for deciding which one of the recognition results or the second recognition result is to be adopted.

第一の認識結果は、ローカルで音声認識を行った結果であり、第二の認識結果は、自装置とは異なる（すなわち外部にある）音声認識装置を利用して音声認識を行った結果である。 The first recognition result is a result of performing speech recognition locally, and the second recognition result is a result of performing speech recognition using a speech recognition device different from the own device (that is, located outside). is there.

音声認識は、一定の認識精度が得られるのであればローカルで行ったほうが応答時間の観点で好ましく、一定の認識精度が得られないのであれば外部装置を用いて行ったほうが正確性の観点で好ましい。すなわち、ローカルにおける音声認識結果の確度を用いて、第一の認識結果を採用するか、第二の認識結果を採用するかを決定することができる。しかし、どちらを採用するかを確度のみに基づいて決定すると、例えば、結果的に第一の認識結果が正しいにもかかわらず、第二の認識結果を採用しようとして、余分な通信が発生したり、応答時間が長くなるといった問題が発生する。 Speech recognition is preferably performed locally from the viewpoint of response time if a certain recognition accuracy can be obtained, and from the viewpoint of accuracy if an external device is used if a certain recognition accuracy cannot be obtained. preferable. That is, it is possible to determine whether to adopt the first recognition result or the second recognition result using the accuracy of the local speech recognition result. However, if you decide which one to use based only on the accuracy, for example, even if the first recognition result is correct as a result, extra communication may occur when trying to adopt the second recognition result. The problem that response time becomes long occurs.

そこで、本発明に係る音声認識装置は、第一の認識結果の確度が所定値以上である場合に、無条件に第一の認識結果を採用し、確度が所定値を下回っていた場合、ユーザに問い合わせを行い、当該問い合わせに対する回答内容に基づいて、第一の認識結果と第二の認識結果のどちらを採用するかを決定する。
ユーザに対する問い合わせの内容は、典型的には、第一の認識結果の正誤に関するものであるが、これに限られない。かかる構成によると、ローカルでの認識結果に疑義がある場合に限ってユーザに問い合わせを行い、この結果、明確な誤りがあると認められる場合にのみ外部ソースを利用して音声認識を行うため、ローカルで音声認識を行うことによる応答性を活かしつつ、不正確な認識結果を採用してしまう確率を下げることができる。 Therefore, the speech recognition apparatus according to the present invention adopts the first recognition result unconditionally when the accuracy of the first recognition result is equal to or higher than the predetermined value, and the accuracy is lower than the predetermined value. And determines whether to adopt the first recognition result or the second recognition result based on the contents of the response to the inquiry.
The contents of the inquiry to the user are typically related to the correctness of the first recognition result, but are not limited thereto. According to such a configuration, the user is inquired only when there is doubt about the local recognition result, and as a result, only when it is recognized that there is a clear error, voice recognition is performed using an external source. It is possible to reduce the probability of adopting an incorrect recognition result while taking advantage of the responsiveness of performing voice recognition locally.

また、前記決定手段は、前記確度が所定の値を下回る場合に、前記第一の認識結果が正しいかを前記ユーザに問い合わせ、回答が肯定的であった場合に前記第一の認識結果を採用することを特徴としてもよい。 Further, the determination means inquires of the user whether the first recognition result is correct when the accuracy is lower than a predetermined value, and adopts the first recognition result when the answer is affirmative. It may be characterized by.

確度が所定値を下回る場合、ユーザに対して問い合わせを行い、肯定的、すなわち第一の認識結果が正しいという回答があった場合に、第一の認識結果を採用する。このようにすることで、確度は比較的低いが、結果的にローカルで行った音声認識の結果が正しいというケースにおいて、応答速度を確保することができる。 When the accuracy falls below a predetermined value, the user is inquired and the first recognition result is adopted when there is an affirmative answer, that is, when the first recognition result is correct. By doing so, although the accuracy is relatively low, the response speed can be secured in the case where the result of the speech recognition performed locally is correct as a result.

また、前記決定手段は、前記回答が否定的であり、かつ、前記第一の認識結果と第二の認識結果が異なる場合に、前記第二の認識結果を採用することを特徴としてもよい。 Further, the determination means may adopt the second recognition result when the answer is negative and the first recognition result and the second recognition result are different.

ユーザの回答が否定的なものであった場合であって、かつ、第一の認識結果と第二の認識結果が異なる場合、第二の認識結果が正しい可能性が高いためである。 This is because if the user's answer is negative and the first recognition result is different from the second recognition result, the second recognition result is likely to be correct.

また、前記決定手段は、前記回答が否定的であり、かつ、前記第一の認識結果と第二の認識結果が異なる場合に、前記第二の認識結果が正しいかを前記ユーザに問い合わせ、回答が肯定的であった場合に前記第二の認識結果を採用することを特徴としてもよい。 Further, the determination means inquires of the user whether the second recognition result is correct when the answer is negative and the first recognition result is different from the second recognition result. The second recognition result may be adopted when the result is positive.

かかる構成によると、第二の認識結果を採用するか、認識自体をやり直させるかを決定することができる。 According to this configuration, it is possible to determine whether to adopt the second recognition result or re-perform the recognition itself.

また、前記情報取得手段は、前記音声取得手段が取得した音声を前記第二の音声認識装置に送信し、前記回答が否定的であった場合に前記第二の認識結果を取得することを特徴としてもよい。 Further, the information acquisition means transmits the voice acquired by the voice acquisition means to the second voice recognition device, and acquires the second recognition result when the answer is negative. It is good.

第二の認識結果は、必ずしも第一の認識結果と同時に取得する必要はない。例えば、問い合わせに対して否定的な回答が得られるまで第二の認識結果を取得しないことで、応答を高速化することができる。
なお、第二の音声認識装置にデータを送信するタイミングは、第一の認識結果を取得するタイミングであってもよいし、第一の認識結果を採用しないと決定した後であってもよい。また、第二の音声認識装置に対する問い合わせを並列に実行してもよい。 The second recognition result is not necessarily acquired at the same time as the first recognition result. For example, the response can be speeded up by not obtaining the second recognition result until a negative answer to the inquiry is obtained.
Note that the timing for transmitting data to the second speech recognition apparatus may be the timing for acquiring the first recognition result, or may be after determining that the first recognition result is not adopted. Moreover, you may perform the inquiry with respect to a 2nd speech recognition apparatus in parallel.

また、前記第二の音声認識装置は、前記決定手段と無線ネットワーク経由で通信を行う装置であり、自装置よりも音声認識の精度が高い装置であることを特徴としてもよい。 The second voice recognition device may be a device that communicates with the determination unit via a wireless network, and may be a device having higher voice recognition accuracy than the own device.

このような構成においては、認識精度とレスポンスとのトレードオフ問題が発生するため、本発明を好適に適用することができる。 In such a configuration, a trade-off problem between recognition accuracy and response occurs, and the present invention can be suitably applied.

また、本発明に係る音声認識装置は、前記決定手段が採用した音声認識結果に基づいて、前記ユーザに情報を提供する情報提供手段をさらに有することを特徴としてもよい。 The speech recognition apparatus according to the present invention may further include information providing means for providing information to the user based on a speech recognition result adopted by the determining means.

情報提供手段は、音声を認識した結果に基づいて、ユーザに対して情報を提供する手段である。提供される情報は、例えば、目的地までの経路情報、経路に関連付いた情報、ウェブ検索結果、データベースに対する検索結果、他のサーバに対する検索結果などであるが、これに限られない。また、音声認識結果に応じてコンピュータに処理を行わせ、当該処理の結果を情報として提供してもよい。 The information providing means is means for providing information to the user based on the result of recognizing the voice. The provided information includes, for example, route information to the destination, information related to the route, web search results, search results for databases, search results for other servers, and the like, but is not limited thereto. Further, the computer may perform processing according to the voice recognition result, and the result of the processing may be provided as information.

なお、本発明は、上記手段の少なくとも一部を含む音声認識装置として特定することができる。また、前記音声認識装置が行う音声認識方法として特定することもできる。上記処理や手段は、技術的な矛盾が生じない限りにおいて、自由に組み合わせて実施することができる。 Note that the present invention can be specified as a speech recognition apparatus including at least a part of the above means. It can also be specified as a speech recognition method performed by the speech recognition apparatus. The above processes and means can be freely combined and implemented as long as no technical contradiction occurs.

本発明によれば、認識精度と速度を両立させた音声認識装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the speech recognition apparatus which made recognition accuracy and speed compatible can be provided.

実施形態に係る情報提供システムのシステム構成図である。It is a system configuration figure of an information service system concerning an embodiment. 実施形態に係る情報提供装置が行う処理のフローチャート図である。It is a flowchart figure of the process which the information provision apparatus which concerns on embodiment performs. 実施形態に係る情報提供装置が行う処理のフローチャート図である。It is a flowchart figure of the process which the information provision apparatus which concerns on embodiment performs.

以下、本発明の好ましい実施形態について図面を参照しながら説明する。
本実施形態に係る情報提供システムは、車両に搭乗している乗員（例えば運転者）から音声コマンドを取得して音声認識を行い、認識結果に基づいて、対応する情報を収集して乗員に提供するシステムである。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.
The information providing system according to the present embodiment acquires a voice command from an occupant (for example, a driver) boarding a vehicle, performs voice recognition, collects corresponding information based on the recognition result, and provides it to the occupant System.

<システム構成>
図１は、本実施形態に係る情報提供システムのシステム構成図である。
本実施形態に係る情報提供システムは、情報提供装置１０と情報提供サーバ２０から構成される。 <System configuration>
FIG. 1 is a system configuration diagram of an information providing system according to the present embodiment.
The information providing system according to this embodiment includes an information providing apparatus 10 and an information providing server 20.

まず、情報提供装置１０について説明する。情報提供装置１０は、車両の乗員から音声を取得して音声認識を行う機能と、音声認識の結果に基づいて情報を取得し、取得した情報を車両の乗員に提供する機能を有する装置である。情報提供装置１０は、例えば、車載されたカーナビゲーション装置であってもよいし、汎用のコンピュータであってもよい。また、他の車載端末であってもよい。
また、情報提供装置１０は、音声認識処理を、自装置ではなく、外部の音声認識装置（音声認識サーバ２０）を利用して行う機能を有している。音声認識サーバ２０の詳細については後述する。 First, the information providing apparatus 10 will be described. The information providing apparatus 10 is an apparatus having a function of acquiring voice from a vehicle occupant and performing voice recognition, a function of acquiring information based on a result of the voice recognition, and a function of providing the acquired information to a vehicle occupant. . For example, the information providing apparatus 10 may be a car navigation apparatus mounted on a vehicle or a general-purpose computer. Moreover, another vehicle-mounted terminal may be sufficient.
Further, the information providing apparatus 10 has a function of performing voice recognition processing using an external voice recognition apparatus (voice recognition server 20) instead of the own apparatus. Details of the voice recognition server 20 will be described later.

情報提供装置１０は、音声取得部１１、音声認識部１２、情報取得部１３、入出力部１４、通信部１５から構成される。
音声取得部１１は、車両の乗員から音声を取得する手段である。具体的には、不図示のマイクを用いて、音声を電気信号（以下、音声データ）に変換する。取得した音声データは、音声認識部１２へ送信される。 The information providing apparatus 10 includes a voice acquisition unit 11, a voice recognition unit 12, an information acquisition unit 13, an input / output unit 14, and a communication unit 15.
The sound acquisition unit 11 is means for acquiring sound from a vehicle occupant. Specifically, sound is converted into an electrical signal (hereinafter referred to as sound data) using a microphone (not shown). The acquired voice data is transmitted to the voice recognition unit 12.

音声認識部１２は、取得した音声データに対して音声認識を行い、テキストに変換する手段である。音声認識は、既知の技術によって行うことができる。例えば、音声認識部１２には、音響モデルと認識辞書が記憶されており、取得した音声データと音響モデルとを比較して特徴を抽出し、抽出した特徴を認識辞書とをマッチングさせることで音声認識を行う。また、音声認識部１２は、認識結果（テキスト）とともに、当該認識結果の尤度（すなわち確度）を取得可能な構成となっている。 The voice recognition unit 12 is a means for performing voice recognition on the acquired voice data and converting it into text. Speech recognition can be performed by known techniques. For example, the speech recognition unit 12 stores an acoustic model and a recognition dictionary, extracts features by comparing the acquired speech data with the acoustic model, and matches the extracted features with the recognition dictionary to generate speech. Recognize. The speech recognition unit 12 is configured to be able to acquire the likelihood (that is, the accuracy) of the recognition result together with the recognition result (text).

情報取得部１３は、音声認識結果に基づいて乗員に提供する情報を取得する手段である。
提供する情報は、例えば、出発地と目的地を結ぶ経路を表す情報であってもよいし、経路に関連付いた情報や、車両の移動に関連付いた情報であってもよい。また、ウェブ検索結果など、車両の走行と直接関連しない情報であってもよい。例えば、経路情報を提供する場合、「○○（目的地）までの経路」という音声コマンドに応答して経路探索を行い、結果を提示してもよい。また、経路に関連付いた情報を提供する場合、例えば、「○○（目的地）までの所要時間」という音声コマンドに応答して所要時間の推定を行い、結果を提示してもよい。この他にも、自然言語処理によって情報を提供できるものであれば、音声コマンドの内容と提供する情報はどのようなものであってもよい。 The information acquisition unit 13 is means for acquiring information to be provided to the occupant based on the voice recognition result.
The information to be provided may be, for example, information representing a route connecting the departure point and the destination, information related to the route, or information related to the movement of the vehicle. In addition, information that is not directly related to vehicle travel, such as web search results, may be used. For example, when providing route information, a route search may be performed in response to a voice command “route to XX (destination)” and the result may be presented. Further, when providing information related to a route, for example, the required time may be estimated in response to a voice command “required time to XX (destination)” and the result may be presented. In addition to this, as long as information can be provided by natural language processing, the contents of the voice command and the information to be provided may be anything.

入出力部１４は、利用者が行った入力操作を受け付け、利用者に対して情報を提示する手段である。本実施形態では一つのタッチパネルディスプレイからなる。すなわち、液晶ディスプレイとその制御手段、タッチパネルとその制御手段から構成される。
通信部１５は、通信回線（例えば携帯電話網）を介してネットワークにアクセスすることで、音声認識サーバ２０との通信を行う手段である。 The input / output unit 14 is means for accepting an input operation performed by the user and presenting information to the user. In this embodiment, it consists of one touch panel display. That is, it comprises a liquid crystal display and its control means, a touch panel and its control means.
The communication unit 15 is means for communicating with the voice recognition server 20 by accessing the network via a communication line (for example, a mobile phone network).

音声認識サーバ２０は、音声の認識に特化したサーバ装置であり、通信部２１および音声認識部２２からなる。
通信部２１が有する機能は、前述した通信部１５と同様であるため、詳細な説明は省略する。また、音声認識部２２は、前述した音声認識部１２と同様の機能を有するが、有している認識辞書の規模が音声認識部１２よりも大きいという点において異なる。すなわち、音声認識部２２は、音声認識部１２よりも高い精度で音声認識を行うことが可能な構成となっている。 The voice recognition server 20 is a server device specialized for voice recognition, and includes a communication unit 21 and a voice recognition unit 22.
Since the function which the communication part 21 has is the same as that of the communication part 15 mentioned above, detailed description is abbreviate | omitted. The speech recognition unit 22 has the same function as the speech recognition unit 12 described above, but differs in that the scale of the recognition dictionary it has is larger than that of the speech recognition unit 12. That is, the voice recognition unit 22 is configured to perform voice recognition with higher accuracy than the voice recognition unit 12.

情報提供装置１０および音声認識サーバ２０は、いずれもＣＰＵ、主記憶装置、補助記憶装置を有する情報処理装置として構成することができる。補助記憶装置に記憶されたプログラムが主記憶装置にロードされ、ＣＰＵによって実行されることで、図１に図示した各手段が機能する。なお、図示した機能の全部または一部は、専用に設計された回路を用いて実行されてもよい。 Each of the information providing apparatus 10 and the voice recognition server 20 can be configured as an information processing apparatus having a CPU, a main storage device, and an auxiliary storage device. Each unit shown in FIG. 1 functions by loading a program stored in the auxiliary storage device into the main storage device and executing it by the CPU. Note that all or part of the illustrated functions may be executed using a circuit designed exclusively.

<処理フローチャート>
次に、情報提供装置１０が行う具体的な処理の内容について説明する。図２は、情報提供装置１０が実行する処理を示したフローチャートである。
まず、ステップＳ１１で、音声取得部１１が不図示のマイクを通して車両の乗員から音声を取得する。取得した音声は音声データに変換され、音声認識部１２および通信部１５へ送信される。
次に、ステップＳ１２で、音声認識部１２が、取得した音声データに対して音声認識を行い、音声をテキストに変換する。変換結果のテキストは、尤度とともに情報取得部１３へ送信される。 <Process flowchart>
Next, the content of specific processing performed by the information providing apparatus 10 will be described. FIG. 2 is a flowchart illustrating processing executed by the information providing apparatus 10.
First, in step S11, the voice acquisition unit 11 acquires voice from a vehicle occupant through a microphone (not shown). The acquired voice is converted into voice data and transmitted to the voice recognition unit 12 and the communication unit 15.
Next, in step S12, the voice recognition unit 12 performs voice recognition on the acquired voice data and converts the voice into text. The text of the conversion result is transmitted to the information acquisition unit 13 together with the likelihood.

次に、ステップＳ１３で、通信部１５が、音声認識サーバ２０に対して、取得した音声データの送信を開始する。また、情報提供装置１０は、音声データの送信と平行してステップＳ１４の処理を行う。送信された音声データは、音声認識部２２によってテキストに変換され、変換が完了次第、通信部２１および通信部１５を介して情報取得部１３へ送信される。 Next, in step S <b> 13, the communication unit 15 starts transmitting the acquired voice data to the voice recognition server 20. Further, the information providing apparatus 10 performs the process of step S14 in parallel with the transmission of the audio data. The transmitted voice data is converted into text by the voice recognition unit 22, and transmitted to the information acquisition unit 13 via the communication unit 21 and the communication unit 15 as soon as the conversion is completed.

ステップＳ１４では、情報取得部１３が、ローカルでの音声認識の成否（すなわち、音声認識結果が得られたか否か）を判定する。この結果、音声認識結果が得られた場合、処理はステップＳ１５へ遷移する。 In step S14, the information acquisition unit 13 determines whether or not local speech recognition has succeeded (that is, whether or not a speech recognition result has been obtained). As a result, when a voice recognition result is obtained, the process proceeds to step S15.

ステップＳ１５では、情報取得部１３が、音声認識部１２が音声認識を行った際の尤度（以下、ローカル尤度）が予め定められた閾値以上であるか否かを判定する。この結果、ローカル尤度が、閾値を上回っていた場合、ローカルでの音声認識結果を採用し、提供する情報を生成する（ステップＳ１６）。 In step S15, the information acquisition unit 13 determines whether or not the likelihood (hereinafter referred to as local likelihood) when the speech recognition unit 12 performs speech recognition is equal to or greater than a predetermined threshold. As a result, if the local likelihood exceeds the threshold value, the local speech recognition result is adopted and information to be provided is generated (step S16).

ローカル尤度が閾値を下回っていた場合、入出力部１４を通して、音声認識部１２が音声認識を行った結果をユーザに提示し、正しいか否かの確認を求める（ステップＳ１７）。問い合わせは、画面を通して行ってもよいし、音声で行ってもよい。ここで行う確認は、認識結果をそのまま提示するものであってもよいし、認識結果に基づいて文脈の解釈を行い、その結果を提示するものであってもよい。
例えば、問い合わせの内容は、「入力されたコマンドは『○○駅周辺のレストランを教えて』でよろしいですか？」といったものであってもよいし、「○○駅周辺のレストランを検索してよろしいですか？」といったものであってもよい。 If the local likelihood is below the threshold value, the result of speech recognition performed by the speech recognition unit 12 is presented to the user through the input / output unit 14 to determine whether it is correct (step S17). The inquiry may be made through a screen or may be made by voice. The confirmation performed here may present the recognition result as it is, or may interpret the context based on the recognition result and present the result.
For example, the content of the inquiry may be “Are you sure that the command you entered is“ Tell me about restaurants near OO station? ”Or“ Search for restaurants around OO station ” "Are you sure?"

この結果、回答が肯定的であった場合、すなわち、認識結果が正しい旨の回答が得られた場合、ローカルでの音声認識結果を採用し、提供する情報を生成する（ステップＳ１６）。 As a result, if the answer is affirmative, that is, if an answer indicating that the recognition result is correct is obtained, the local speech recognition result is adopted and information to be provided is generated (step S16).

一方、ステップＳ１７において否定的な回答が得られた場合、すなわち、認識結果が誤っている旨の回答が得られた場合、情報提供装置１０は、音声認識サーバ２０が音声認識を行った結果を取得し、採用を試みる。 On the other hand, when a negative answer is obtained in step S17, that is, when an answer indicating that the recognition result is incorrect is obtained, the information providing apparatus 10 displays the result of the voice recognition server 20 performing voice recognition. Acquire and try to hire.

当該処理について、図３（Ａ）を参照しながら説明する。図３（Ａ）に示した処理は、情報取得部１３が行う処理である。
まず、ステップＳ２１で、音声認識サーバ２０から取得した認識結果があるか否かを判定する。ここで、認識結果が取得できていない場合、ステップＳ２５へ遷移し、音声の再取得を行う。なお、認識結果が取得できていないとは、認識ができなかった旨の応答を受信した場合や、タイムアウト時間を経過しても応答が得られなかった場合を含む。 This process will be described with reference to FIG. The process illustrated in FIG. 3A is a process performed by the information acquisition unit 13.
First, in step S21, it is determined whether there is a recognition result acquired from the voice recognition server 20. Here, when the recognition result is not acquired, it transfers to step S25 and performs reacquisition of an audio | voice. The case where the recognition result cannot be acquired includes a case where a response indicating that the recognition could not be performed is received, or a case where a response is not obtained even after the timeout time has elapsed.

音声認識サーバ２０から送信された認識結果がある場合、当該認識結果が、ローカルでの認識結果と同一であるか否かを判定する（ステップＳ２２）。この結果、同一であった場合、ローカルとサーバのどちらの認識結果も誤りであるため、ステップＳ２５へ遷移し、音声の再取得を行う。 If there is a recognition result transmitted from the voice recognition server 20, it is determined whether or not the recognition result is the same as the local recognition result (step S22). As a result, if they are the same, both the local and server recognition results are incorrect, so the process proceeds to step S25 to reacquire voice.

ローカルでの認識結果とサーバでの認識結果が異なる場合、ステップＳ２３へ遷移し、サーバが音声認識を行った結果をユーザに提示し、確認を求める。確認方法は、ステップＳ１７で行ったものと同様のものであってもよい。
この結果、回答が肯定的であった場合（すなわち、認識結果が正しい旨の回答が得られた場合）、ステップＳ２４へ遷移し、サーバにおける音声認識結果を採用する。また、否定的な回答が得られた場合（すなわち、認識結果が誤っている旨の回答が得られた場合）、ステップＳ２５へ遷移し、音声の再取得を行う。 If the local recognition result and the server recognition result are different from each other, the process proceeds to step S23, where the result of the voice recognition performed by the server is presented to the user for confirmation. The confirmation method may be the same as that performed in step S17.
As a result, when the answer is affirmative (that is, when an answer indicating that the recognition result is correct is obtained), the process proceeds to step S24, and the voice recognition result in the server is adopted. When a negative answer is obtained (that is, when an answer indicating that the recognition result is incorrect) is obtained, the process proceeds to step S25, and voice is reacquired.

一方、ステップＳ１４において、ローカルでの音声認識が失敗していた場合、情報提供装置１０は、音声認識サーバ２０が音声認識を行った結果を取得し、採用を試みる。
当該処理について、図３（Ｂ）を参照しながら説明する。図３（Ｂ）に示した処理は、情報取得部１３が行う処理である。
まず、ステップＳ２１で、音声認識サーバ２０から取得した認識結果があるか否かを判定する。ここで、認識結果が取得できていない場合、ステップＳ２５へ遷移し、音声の再取得を行う。一方、音声認識サーバ２０から送信された認識結果がある場合、ステップＳ２４へ遷移し、サーバにおける音声認識結果を採用する。
すなわち、図３（Ｂ）の処理は、サーバにおける認識結果をユーザに提示する処理を省略するという点において、図３（Ａ）の処理と相違する。 On the other hand, if the local speech recognition has failed in step S14, the information providing apparatus 10 acquires the result of the speech recognition server 20 performing the speech recognition and tries to adopt it.
This process will be described with reference to FIG. The process illustrated in FIG. 3B is a process performed by the information acquisition unit 13.
First, in step S21, it is determined whether there is a recognition result acquired from the voice recognition server 20. Here, when the recognition result is not acquired, it transfers to step S25 and performs reacquisition of an audio | voice. On the other hand, when there is a recognition result transmitted from the voice recognition server 20, the process proceeds to step S24, and the voice recognition result in the server is adopted.
That is, the process of FIG. 3B is different from the process of FIG. 3A in that the process of presenting the recognition result in the server to the user is omitted.

以上説明したように、本実施形態に係る情報提供装置は、ローカルにおける音声認識結果の尤度が閾値を下回っていた場合に、認識結果をユーザに確認させ、回答が否定的であった場合に限って、サーバにおける音声認識結果を採用する処理を実行する。
かかる構成によると、ローカルでの認識結果に疑義がある場合に、サーバにおける認識結果を無条件に採用するのではなく、ユーザへの問い合わせを挟んだうえで、どちらを採用するか判断するため、応答性を高めつつ、認識精度を確保することができる。 As described above, the information providing apparatus according to the present embodiment allows the user to confirm the recognition result when the likelihood of the local speech recognition result is below the threshold, and the answer is negative. Only, the process which employ | adopts the speech recognition result in a server is performed.
According to such a configuration, when there is doubt about the local recognition result, rather than adopting the recognition result in the server unconditionally, in order to determine which to adopt after inquiring the user, Recognition accuracy can be ensured while improving responsiveness.

（変形例）
上記の実施形態はあくまでも一例であって、本発明はその要旨を逸脱しない範囲内で適宜変更して実施しうる。
例えば、ステップＳ１５における閾値は、固定値でなくてもよい。例えば、学習結果に基づいて設定された動的な値であってもよい。
また、音声認識を行った結果、結果が二つ以上得られた場合、ステップＳ１７またはＳ２３の処理において、順番に内容を提示して問い合わせを行ってもよい。 (Modification)
The above embodiment is merely an example, and the present invention can be implemented with appropriate modifications within a range not departing from the gist thereof.
For example, the threshold value in step S15 may not be a fixed value. For example, it may be a dynamic value set based on the learning result.
When two or more results are obtained as a result of voice recognition, inquiries may be made by presenting the contents in order in the process of step S17 or S23.

また、音声認識サーバから音声認識の尤度（以下、サーバ尤度）を取得できる場合、サーバ尤度に基づいて、ステップＳ２３を実行するか省略するかを決定してもよい。例えば、サーバ尤度が所定の閾値を上回る場合、ステップＳ２３を省略してステップＳ２４を実行するようにしてもよい。 Moreover, when the likelihood of speech recognition (hereinafter referred to as server likelihood) can be acquired from the speech recognition server, whether to execute Step S23 or omit it may be determined based on the server likelihood. For example, when the server likelihood exceeds a predetermined threshold, step S23 may be omitted and step S24 may be executed.

また、実施形態の説明では、ステップＳ１３で音声認識サーバに対する音声の送信を開
始したが、音声認識サーバに対する音声の送信は、ステップＳ２１以降で行ってもよい。このようにすることで、通信量を削減することができる。 In the description of the embodiment, transmission of voice to the voice recognition server is started in step S13. However, transmission of voice to the voice recognition server may be performed after step S21. By doing in this way, the amount of communication can be reduced.

１０・・・情報提供装置
２０・・・音声認識サーバ
１１・・・音声取得部
１２，２２・・・音声認識部
１３・・・情報取得部
１４・・・入出力部
１５，２１・・・通信部 DESCRIPTION OF SYMBOLS 10 ... Information provision apparatus 20 ... Voice recognition server 11 ... Voice acquisition part 12, 22 ... Voice recognition part 13 ... Information acquisition part 14 ... Input / output part 15, 21 ... Communication department

Claims

A speech recognition device for recognizing speech acquired from a user,
Voice acquisition means for acquiring voice from the user;
Recognizing means for recognizing the voice and obtaining a first recognition result and an accuracy corresponding to the first recognition result;
Information acquisition means for acquiring a second recognition result that is a result of causing the second voice recognition device different from the device to recognize the voice;
When the accuracy when the recognition unit recognizes the voice is equal to or higher than a predetermined value, the first recognition result is adopted, and when the accuracy is lower than the predetermined value, the user is inquired and the inquiry is made. Determining means for deciding whether to adopt the first recognition result or the second recognition result based on an answer to
Voice recognition device.

The determination means inquires to the user whether the first recognition result is correct when the accuracy is lower than a predetermined value, and adopts the first recognition result when the answer is affirmative.
The speech recognition apparatus according to claim 1.

The determination means adopts the second recognition result when the answer is negative and the first recognition result is different from the second recognition result.
The speech recognition apparatus according to claim 2.

The determination means inquires of the user whether the second recognition result is correct when the answer is negative and the first recognition result is different from the second recognition result. Adopt the second recognition result if
The speech recognition apparatus according to claim 2.

The information acquisition means transmits the voice acquired by the voice acquisition means to the second voice recognition device, and acquires the second recognition result when the answer is negative.
The voice recognition device according to claim 2.

The second speech recognition device is a device that communicates with the determining means via a wireless network, and is a device that has higher speech recognition accuracy than the own device,
The speech recognition apparatus according to claim 1.

Further comprising information providing means for providing information to the user based on the speech recognition result adopted by the determining means;
The speech recognition apparatus according to claim 1.

A speech recognition method performed by a speech recognition device that recognizes speech acquired from a user,
An audio acquisition step of acquiring audio from the user;
A recognition step of recognizing the voice and obtaining a first recognition result and an accuracy corresponding to the first recognition result;
When the accuracy when recognizing the voice in the recognition step is greater than or equal to a predetermined value, the first recognition result is adopted, and when the accuracy is lower than the predetermined value, the user is inquired and the inquiry is made. Based on the answer to the above, either the first recognition result or the second recognition result which is a result of causing the second voice recognition device different from the voice recognition device to recognize the voice is adopted. To decide,
Speech recognition method.