JP5658641B2

JP5658641B2 - Terminal device, voice recognition program, voice recognition method, and voice recognition system

Info

Publication number: JP5658641B2
Application number: JP2011202064A
Authority: JP
Inventors: 孝輔辻野; 真也飯塚; 俊治栗栖; 悟史須田; 恭子増田
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2011-09-15
Filing date: 2011-09-15
Publication date: 2015-01-28
Anticipated expiration: 2031-09-15
Also published as: JP2013064777A

Description

本発明は、音声認識結果を処理する端末装置、音声認識プログラム、音声認識方法および音声認識システムに関する。 The present invention relates to a terminal device that processes a speech recognition result, a speech recognition program, a speech recognition method, and a speech recognition system.

マイクから入力された音声の内容を認識し、認識した結果をコマンドとして実行する技術がある。例えば、下記の特許文献１および特許文献２には、端末内音声認識を行い、端末内音声認識の結果をコマンドに変換する技術が開示されている。 There is a technique for recognizing the content of voice input from a microphone and executing the recognized result as a command. For example, Patent Documents 1 and 2 listed below disclose techniques for performing intra-terminal speech recognition and converting the results of intra-terminal speech recognition into commands.

特開２００７−３１８３１９号公報JP 2007-318319 A 特開平７−２１９５８７号公報Japanese Patent Application Laid-Open No. 7-219587

しかしながら、上記特許文献１および特許文献２に記載の発明では、以下のような問題がある。すなわち、上記特許文献１および特許文献２に記載の発明では、音声認識の結果がコマンド辞書に登録されていればそのコマンドが実行され、登録されていなければエラーが通知される。端末内音声認識では、認識可能な語彙が限られるため、端末内音声認識の語彙外の単語がコマンドに含まれる場合、コマンドが正しく発話されても端末内の音声認識によって正しく認識されず、そのコマンドが実行されない場合がある。 However, the inventions described in Patent Document 1 and Patent Document 2 have the following problems. That is, in the inventions described in Patent Document 1 and Patent Document 2, if the result of speech recognition is registered in the command dictionary, the command is executed, and if not registered, an error is notified. In terminal speech recognition, the vocabulary that can be recognized is limited, so if a word outside the vocabulary of terminal speech recognition is included in the command, even if the command is spoken correctly, it will not be recognized correctly by speech recognition in the terminal. The command may not be executed.

そこで本発明は、このような問題点を解決するために、音声により入力されたコマンドを高速かつ確実に実行する端末装置、音声認識プログラム、音声認識方法および音声認識システムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a terminal device, a speech recognition program, a speech recognition method, and a speech recognition system that execute a command inputted by speech at high speed and in order to solve such problems. To do.

上記課題を解決するため、本発明の端末装置は、音声信号の入力を受け付ける音声入力手段と、音声入力手段によって受け付けられた音声信号に対する音声認識を行う音声認識手段と、複数のコマンドが登録されたコマンド辞書と、音声信号を音声認識サーバに送信する音声送信手段と、音声認識サーバによる音声信号に対する音声認識結果であるサーバ音声認識結果を受信するサーバ音声認識結果受信手段と、音声認識手段によって認識された音声認識結果である端末内音声認識結果をコマンド辞書と照合して、端末内音声認識結果およびサーバ音声認識結果のうちいずれの音声認識結果を利用するか決定し、決定された音声認識結果に基づいて音声信号によって示されるコマンドを特定するコマンド照合手段と、コマンド照合手段によって特定されたコマンドを実行するコマンド実行手段と、を備えることを特徴とする。 In order to solve the above problems, a terminal device of the present invention has a voice input unit that receives an input of a voice signal, a voice recognition unit that performs voice recognition on a voice signal received by the voice input unit, and a plurality of commands registered. A command dictionary, a voice transmitting means for sending a voice signal to a voice recognition server, a server voice recognition result receiving means for receiving a server voice recognition result as a voice recognition result for the voice signal by the voice recognition server, and a voice recognition means. The in-terminal speech recognition result, which is the recognized speech recognition result, is checked against the command dictionary to determine which of the in-terminal speech recognition result and the server speech recognition result to use, and the determined speech recognition A command verification unit that identifies a command indicated by the voice signal based on the result, and a command verification unit Characterized in that it and a command executing means for executing the identified command.

また、本発明の音声認識システムは、端末装置と、音声認識サーバと、を含む音声認識システムであって、端末装置は、音声信号の入力を受け付ける音声入力手段と、音声入力手段によって受け付けられた音声信号に対する音声認識を行う音声認識手段と、複数のコマンドが登録されたコマンド辞書と、音声信号を音声認識サーバに送信する音声送信手段と、音声認識サーバによる音声信号に対する音声認識結果であるサーバ音声認識結果を受信するサーバ音声認識結果受信手段と、音声認識手段によって認識された音声認識結果である端末内音声認識結果をコマンド辞書と照合して、端末内音声認識結果およびサーバ音声認識結果のうちいずれの音声認識結果を利用するか決定し、決定された音声認識結果に基づいて音声信号によって示されるコマンドを特定するコマンド照合手段と、コマンド照合手段によって特定されたコマンドを実行するコマンド実行手段と、を備え、音声認識サーバは、音声送信手段から送信された音声信号を受信する音声受信手段と、端末装置よりも多くの語彙を含むサーバ辞書と、音声受信手段によって受信された音声信号を、サーバ辞書に基づいて音声認識するサーバ音声認識手段と、サーバ音声認識結果を端末装置に送信するサーバ音声認識結果送信手段と、を備えることを特徴とする。 The voice recognition system of the present invention is a voice recognition system including a terminal device and a voice recognition server, and the terminal device is received by a voice input unit that receives an input of a voice signal and a voice input unit. Voice recognition means for performing voice recognition on a voice signal, a command dictionary in which a plurality of commands are registered, voice transmission means for sending a voice signal to a voice recognition server, and a server that is a voice recognition result for the voice signal by the voice recognition server The server speech recognition result receiving means for receiving the speech recognition result, the in-terminal speech recognition result that is the speech recognition result recognized by the speech recognition means is checked against the command dictionary, and the in-terminal speech recognition result and the server speech recognition result Decide which voice recognition result to use and indicate it by the voice signal based on the determined voice recognition result. And a command execution unit that executes the command specified by the command verification unit, and the voice recognition server includes a voice reception unit that receives a voice signal transmitted from the voice transmission unit, A server dictionary that includes more vocabulary than the terminal device, a server speech recognition unit that recognizes speech signals received by the speech receiving unit based on the server dictionary, and a server that transmits a server speech recognition result to the terminal device Voice recognition result transmission means.

本発明によれば、端末内音声認識結果をコマンド辞書と照合し、照合した結果に基づいて端末内音声認識結果およびサーバ音声認識結果のいずれの音声認識結果を利用するか決定することによって、例えば、端末内音声認識結果がコマンドとして受理される場合は、端末内音声認識結果を利用してコマンドの実行を行い、端末内音声認識結果がコマンドとして受理できない場合には、サーバ音声認識結果を利用することができる。すなわち、端末内の語彙で認識可能なコマンドが音声入力された際には、端末内音声認識結果を利用することで高速に応答でき、端末内の語彙で認識不可能なコマンドが音声入力された際には、サーバ音声認識結果を利用することで音声入力されたコマンドを確実に認識し、実行することができる。 According to the present invention, the in-terminal speech recognition result is collated with the command dictionary, and based on the collation result, by determining which of the in-terminal speech recognition result and the server speech recognition result to use, for example, If the in-terminal speech recognition result is accepted as a command, the command is executed using the in-terminal speech recognition result. If the in-terminal speech recognition result cannot be accepted as a command, the server speech recognition result is used. can do. In other words, when a command that can be recognized by the vocabulary in the terminal is input by voice, a command that cannot be recognized by the vocabulary in the terminal can be input by using the result of speech recognition in the terminal. In this case, the command inputted by voice can be surely recognized and executed by using the server voice recognition result.

また、コマンド照合手段は、端末内音声認識結果をコマンド辞書と照合して、コマンドに該当する可能性を示す確信度を算出し、確信度が所定の閾値以上である場合に、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドを音声信号によって示されるコマンドとして特定することが好ましい。これによれば、端末内音声認識結果の信頼度が低くても正しく音声認識された場合に、端末内音声認識結果を利用することができる。その結果、端末内の語彙で認識可能なコマンドを、端末内音声認識結果の信頼度が低くても高速に実行することが可能となる。 Further, the command collating means collates the in-terminal speech recognition result with the command dictionary, calculates a certainty factor indicating the possibility of corresponding to the command, and when the certainty factor is equal to or greater than a predetermined threshold, the in-terminal speech recognition Preferably, the use of the result is determined, and a command having a certainty level equal to or greater than a threshold is specified as the command indicated by the voice signal. According to this, when the speech recognition is correctly performed even if the reliability of the speech recognition result in the terminal is low, the speech recognition result in the terminal can be used. As a result, a command that can be recognized by the vocabulary in the terminal can be executed at high speed even if the reliability of the speech recognition result in the terminal is low.

また、コマンド辞書は、複数のコマンドの各々に対して、複数のキーワードと、複数のキーワードに対応付けられたスコアとが登録されたキーワードリストを記憶し、コマンド照合手段は、端末内音声認識結果に含まれる単語の各々について、キーワードリストに登録された複数のキーワードのいずれかに該当するか否かを判定し、該当するキーワードに対応付けられたコマンドおよびスコアに基づいて確信度を算出することが好ましい。これによれば、端末内の語彙で認識可能なコマンドであるか否かを判定することができ、音声信号によって示されるコマンドをより確実に認識することができる。 The command dictionary stores a keyword list in which a plurality of keywords and scores associated with the plurality of keywords are registered for each of the plurality of commands. For each of the words included in the keyword, it is determined whether it corresponds to any of a plurality of keywords registered in the keyword list, and the certainty factor is calculated based on the command and score associated with the corresponding keyword. Is preferred. According to this, it can be determined whether or not the command can be recognized by the vocabulary in the terminal, and the command indicated by the voice signal can be more reliably recognized.

また、コマンド照合手段は、端末内音声認識結果に含まれる単語の各々について、キーワードリストに登録された複数のキーワードのいずれかに該当するか否かを判定し、該当するキーワードに対応付けられたコマンドおよびスコア並びに単語の音声認識の信頼度に基づいて、確信度を算出することが好ましい。これによれば、端末内の語彙で認識可能なコマンドであるか否かを判定することができ、音声信号によって示されるコマンドをより確実に認識することができる。 In addition, the command matching unit determines whether each of the words included in the in-terminal speech recognition result corresponds to any of a plurality of keywords registered in the keyword list, and is associated with the corresponding keyword. It is preferable to calculate the certainty factor based on the command and the score and the reliability of speech recognition of the word. According to this, it can be determined whether or not the command can be recognized by the vocabulary in the terminal, and the command indicated by the voice signal can be more reliably recognized.

また、コマンド照合手段は、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンドである場合には、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンド以外のコマンドである場合には、サーバ音声認識結果の利用を決定することが好ましい。これによれば、端末内機能の実行を指示するコマンドについては端末内音声認識結果を用いて高速に実行することができ、それ以外のコマンドについてはサーバ音声認識結果を用いて確実に実行することができる。 In addition, when the command with the certainty level equal to or greater than the threshold is a command for instructing the execution of the in-terminal function, the command matching unit determines use of the in-terminal speech recognition result, and the command with the certainty level equal to or greater than the threshold is When the command is a command other than a command for instructing execution of the in-terminal function, it is preferable to determine use of the server speech recognition result. According to this, it is possible to execute a command for instructing execution of an in-terminal function at high speed using the in-terminal speech recognition result, and reliably execute other commands using the server speech recognition result. Can do.

また、音声送信手段は、音声認識手段によって端末内音声認識結果が得られる前に、音声信号を音声認識サーバに送信することが好ましい。これによれば、サーバ音声認識結果をより早く受信することができる。このため、サーバ音声認識結果を利用することが決定された場合に、高速にコマンドを実行することができる。 Further, it is preferable that the voice transmission means transmits the voice signal to the voice recognition server before the voice recognition means obtains the in-terminal voice recognition result. According to this, the server speech recognition result can be received earlier. For this reason, when it is determined to use the server speech recognition result, the command can be executed at high speed.

また、サーバ音声認識結果受信手段は、コマンド照合手段によって端末内音声認識結果に基づいてコマンドが特定された後にサーバ音声認識結果を受信した場合、サーバ音声認識結果を破棄することが好ましい。これによれば、サーバ音声認識結果の受信を待つことなく、端末内音声認識結果に基づいてコマンドを特定することができる。このため、端末内の語彙で認識可能なコマンドを高速に実行することが可能となる。 The server speech recognition result receiving unit preferably discards the server speech recognition result when the command verification unit receives the server speech recognition result after the command is specified based on the in-terminal speech recognition result. According to this, a command can be specified based on the in-terminal speech recognition result without waiting for the reception of the server speech recognition result. For this reason, it becomes possible to execute a command recognizable by the vocabulary in the terminal at high speed.

ところで、本発明は、上記のように端末装置の発明として記述できる他に、以下のように音声認識プログラムおよび音声認識方法の発明としても記述することができる。これはカテゴリが異なるだけで、実質的に同一の発明であり、同様の作用および効果を奏する。 By the way, the present invention can be described as an invention of a terminal device as described above, and can also be described as an invention of a speech recognition program and a speech recognition method as follows. This is substantially the same invention only in different categories, and has the same operations and effects.

すなわち、本発明の音声認識プログラムは、音声信号の入力を受け付ける音声入力モジュールと、音声入力モジュールによって受け付けられた音声信号に対する音声認識を行う音声認識モジュールと、音声信号を音声認識サーバに送信する音声送信モジュールと、音声認識サーバによる音声信号に対する音声認識結果であるサーバ音声認識結果を受信するサーバ音声認識結果受信モジュールと、音声認識モジュールによって認識された音声認識結果である端末内音声認識結果を複数のコマンドが登録されたコマンド辞書と照合して、端末内音声認識結果およびサーバ音声認識結果のうちいずれの音声認識結果を利用するか決定し、決定された音声認識結果に基づいて音声信号によって示されるコマンドを特定するコマンド照合モジュールと、コマンド照合モジュールによって特定されたコマンドを実行するコマンド実行モジュールと、を備えることを特徴とする。 That is, the speech recognition program of the present invention includes a speech input module that accepts an input of a speech signal, a speech recognition module that performs speech recognition on the speech signal received by the speech input module, and a speech that transmits the speech signal to a speech recognition server. A transmission module, a server speech recognition result receiving module that receives a server speech recognition result that is a speech recognition result for a speech signal by a speech recognition server, and a plurality of in-terminal speech recognition results that are speech recognition results recognized by the speech recognition module Are compared with the registered command dictionary to determine which of the in-terminal speech recognition results and server speech recognition results to use, and is indicated by the speech signal based on the determined speech recognition results. A command matching module that identifies the command Characterized in that it comprises a command execution module for executing the command specified by command verification module.

また、コマンド照合モジュールは、端末内音声認識結果をコマンド辞書と照合して、コマンドに該当する可能性を示す確信度を算出し、確信度が所定の閾値以上である場合に、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドを音声信号によって示されるコマンドとして特定することが好ましい。 Further, the command verification module compares the in-terminal speech recognition result with the command dictionary, calculates a certainty factor indicating the possibility of corresponding to the command, and when the certainty factor is equal to or greater than a predetermined threshold, the in-terminal speech recognition Preferably, the use of the result is determined, and a command having a certainty level equal to or greater than a threshold is specified as the command indicated by the voice signal.

また、コマンド辞書は、複数のコマンドの各々に対して、複数のキーワードと、複数のキーワードの各々に対応付けられたスコアとが登録されたキーワードリストを記憶し、コマンド照合モジュールは、端末内音声認識結果に含まれる単語の各々について、キーワードリストに登録された複数のキーワードのいずれかに該当するか否かを判定し、該当するキーワードに対応付けられたコマンドおよびスコアに基づいて確信度を算出することが好ましい。 The command dictionary stores a keyword list in which a plurality of keywords and a score associated with each of the plurality of keywords are registered for each of a plurality of commands. For each word included in the recognition result, determine whether it corresponds to one of a plurality of keywords registered in the keyword list, and calculate the certainty factor based on the command and score associated with the corresponding keyword It is preferable to do.

また、コマンド照合モジュールは、端末内音声認識結果に含まれる単語の各々について、キーワードリストに登録された複数のキーワードのいずれかに該当するか否かを判定し、該当するキーワードに対応付けられたコマンドおよびスコア並びに単語の音声認識の信頼度に基づいて、確信度を算出することが好ましい。 In addition, the command matching module determines whether each of the words included in the in-terminal speech recognition result corresponds to any of a plurality of keywords registered in the keyword list, and is associated with the corresponding keyword. It is preferable to calculate the certainty factor based on the command and the score and the reliability of speech recognition of the word.

また、コマンド照合モジュールは、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンドである場合には、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンド以外のコマンドである場合には、サーバ音声認識結果の利用を決定することが好ましい。 In addition, when the command with the certainty level equal to or higher than the threshold is a command for instructing the execution of the function within the terminal, the command verification module determines the use of the in-terminal speech recognition result, and the command with the certainty level equal to or higher than the threshold is When the command is a command other than a command for instructing execution of the in-terminal function, it is preferable to determine use of the server speech recognition result.

また、音声送信モジュールは、音声認識モジュールによって端末内音声認識結果が得られる前に、音声信号を音声認識サーバに送信することが好ましい。 Moreover, it is preferable that the voice transmission module transmits the voice signal to the voice recognition server before the voice recognition result is obtained by the voice recognition module.

サーバ音声認識結果受信モジュールは、コマンド照合モジュールによって端末内音声認識結果に基づいてコマンドが特定された後にサーバ音声認識結果を受信した場合、サーバ音声認識結果を破棄することが好ましい。 The server speech recognition result receiving module preferably discards the server speech recognition result when the command verification module receives the server speech recognition result after the command is specified based on the in-terminal speech recognition result.

また、本発明の音声認識方法は、音声信号の入力を受け付ける音声入力ステップと、音声入力ステップにおいて受け付けられた音声信号に対する音声認識を行う音声認識ステップと、音声信号を音声認識サーバに送信する音声送信ステップと、音声認識サーバによる音声信号に対する音声認識結果であるサーバ音声認識結果を受信するサーバ音声認識結果受信ステップと、音声認識ステップにおいて認識された音声認識結果である端末内音声認識結果を複数のコマンドが登録されたコマンド辞書と照合して、端末内音声認識結果およびサーバ音声認識結果のうちいずれの音声認識結果を利用するか決定し、決定された音声認識結果に基づいて音声信号によって示されるコマンドを特定するコマンド照合ステップと、コマンド照合ステップにおいて特定されたコマンドを実行するコマンド実行ステップと、を備えることを特徴とする。 In addition, the speech recognition method of the present invention includes a speech input step for receiving an input of a speech signal, a speech recognition step for performing speech recognition on the speech signal received in the speech input step, and a speech for transmitting the speech signal to a speech recognition server. A transmission step, a server speech recognition result receiving step for receiving a server speech recognition result that is a speech recognition result for a speech signal by the speech recognition server, and a plurality of in-terminal speech recognition results that are speech recognition results recognized in the speech recognition step. Are compared with the registered command dictionary to determine which of the in-terminal speech recognition results and server speech recognition results to use, and is indicated by the speech signal based on the determined speech recognition results. Command verification step to identify the command to be Characterized in that it and a command execution step of executing a command specified Te.

本発明によれば、音声により入力されたコマンドを高速かつ確実に実行することができる。 ADVANTAGE OF THE INVENTION According to this invention, the command input with the audio | voice can be performed reliably at high speed.

本発明の実施形態に係る音声認識システムの機能構成を示す図である。It is a figure which shows the function structure of the speech recognition system which concerns on embodiment of this invention. 図１の端末装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the terminal device of FIG. 図１のコマンド辞書が記憶するキーワードリストの一例を示す図である。It is a figure which shows an example of the keyword list which the command dictionary of FIG. 1 memorize | stores. 図１の音声認識システムにおける確信度の算出方法を説明するための図である。It is a figure for demonstrating the calculation method of the certainty factor in the speech recognition system of FIG. 図１の端末装置で実行されるコマンド判別実行処理の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the command discrimination | determination execution process performed with the terminal device of FIG. 図１の端末装置で実行されるコマンド判別実行処理の他の例を説明するためのフローチャートである。7 is a flowchart for explaining another example of command determination execution processing executed by the terminal device of FIG. 1. 図１の端末装置で実行されるコマンド判別実行処理の他の例を説明するためのフローチャートである。7 is a flowchart for explaining another example of command determination execution processing executed by the terminal device of FIG. 1. 図１の音声認識システムにおけるコマンド判別実行処理を説明するための図である。It is a figure for demonstrating the command discrimination | determination execution process in the speech recognition system of FIG. 図１の音声認識システムにおけるコマンド判別実行処理を説明するための図である。It is a figure for demonstrating the command discrimination | determination execution process in the speech recognition system of FIG. 図１の音声認識システムにおける機能・アプリ判別呼出処理の一例を説明するためのフローチャートである。3 is a flowchart for explaining an example of a function / application discrimination call process in the voice recognition system of FIG. 1. 図１０の続きを示すフローチャートである。It is a flowchart which shows the continuation of FIG. 図１の音声認識システムにおける機能・アプリ判別呼出処理を説明するための図である。It is a figure for demonstrating the function and application discrimination | determination call process in the speech recognition system of FIG.

以下、添付図面を参照して、本発明の実施形態を詳細に説明する。なお、図面の説明において同一または相当要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements will be denoted by the same reference numerals, and redundant description will be omitted.

図１は、本実施形態に係る音声認識システムの機能構成を示す図である。図１に示されるように、音声認識システム１０は、端末装置１に入力された音声信号を、端末内音声認識またはネットワーク型音声認識により音声認識するシステムであって、端末装置１および音声認識サーバ２を含んで構成されている。端末装置１は、マイクなどの音声入力装置から入力された音声信号が示すコマンドに応じて、端末装置１が備える機能の起動および外部からの情報の取得などを実行する装置である。音声認識サーバ２は、端末装置１から送信された音声信号を音声認識してサーバ音声認識結果を取得し、そのサーバ音声認識結果を端末装置１に送信する装置である。なお、端末装置１と音声認識サーバ２とは、ネットワークＮＷを介して通信接続されている。 FIG. 1 is a diagram illustrating a functional configuration of the speech recognition system according to the present embodiment. As shown in FIG. 1, a voice recognition system 10 is a system that recognizes a voice signal input to a terminal device 1 by in-terminal voice recognition or network type voice recognition, and includes the terminal device 1 and a voice recognition server. 2 is comprised. The terminal device 1 is a device that executes activation of functions provided in the terminal device 1 and acquisition of information from the outside in response to a command indicated by an audio signal input from an audio input device such as a microphone. The speech recognition server 2 is a device that recognizes the speech signal transmitted from the terminal device 1 to acquire a server speech recognition result and transmits the server speech recognition result to the terminal device 1. The terminal device 1 and the voice recognition server 2 are connected for communication via a network NW.

ここで、端末内音声認識とは、端末装置１内で行われる音声認識を意味し、ネットワークＮＷを介した通信が不要であることから高速に応答できるが、認識対象の語彙が少ないことから正確な音声認識が行えない場合がある。ネットワーク型音声認識とは、音声認識サーバ２によって行われる音声認識を意味し、端末内音声認識よりも認識対象の語彙が多く、音声認識の精度は高いが、ネットワークＮＷを介した通信を行うため、通信遅延等で応答が遅くなる場合がある。 Here, intra-terminal speech recognition means speech recognition performed in the terminal device 1 and can respond quickly because communication via the network NW is unnecessary, but it is accurate because there are few vocabularies to be recognized. Voice recognition may not be possible. The network-type speech recognition means speech recognition performed by the speech recognition server 2 and has more vocabulary to be recognized than in-terminal speech recognition and has high speech recognition accuracy, but performs communication via the network NW. The response may be delayed due to a communication delay or the like.

端末装置１は、機能的には、音声入力部１１（音声入力手段）と、音声認識部１２（音声認識手段）と、ユーザ辞書１３と、音声送信部１４（音声送信手段）と、サーバ音声認識結果受信部１５（サーバ音声認識結果受信手段）と、コマンド照合部１６（コマンド照合手段）と、コマンド辞書１７と、コマンド実行部１８（コマンド実行手段）と、を備えている。この端末装置１は、例えば、携帯電話、スマートフォン、ＰＤＡ（Personal Digital Assistants）、カーナビゲーションシステム、ノートＰＣ等の音声入力装置を備える装置であって、図２に示されるハードウェアにより構成される。 Functionally, the terminal device 1 has a voice input unit 11 (voice input unit), a voice recognition unit 12 (voice recognition unit), a user dictionary 13, a voice transmission unit 14 (voice transmission unit), and server voice. A recognition result receiving unit 15 (server speech recognition result receiving unit), a command verification unit 16 (command verification unit), a command dictionary 17 and a command execution unit 18 (command execution unit) are provided. The terminal device 1 is a device including a voice input device such as a mobile phone, a smart phone, a PDA (Personal Digital Assistants), a car navigation system, and a notebook PC, and is configured by hardware shown in FIG.

図２は、端末装置１のハードウェア構成を示す図である。図２に示されるように、端末装置１は、物理的には、ＣＰＵ（Central Processing Unit）１０１、主記憶装置であるＲＡＭ（Random Access Memory）１０２、ＲＯＭ（Read Only Memory）１０３、ハードディスクなどの補助記憶装置１０４、ネットワークカードなどのデータ送受信デバイスである通信モジュール１０５、マイクなどの音声入力デバイスである音声入力装置１０６、キーボードやマウスなどの入力デバイスである入力装置１０７、液晶ディスプレイなどの出力デバイスである出力装置１０８などのハードウェアにより構成されている。図１において説明した端末装置１の各機能は、図２に示すＣＰＵ１０１、ＲＡＭ１０２などのハードウェア上に所定のコンピュータソフトウェアを読み込ませることにより、ＣＰＵ１０１の制御のもとで音声入力装置１０６、入力装置１０７、出力装置１０８などを動作させるとともに、ＲＡＭ１０２や補助記憶装置１０４におけるデータの読み出しおよび書き込みを行うことで実現される。 FIG. 2 is a diagram illustrating a hardware configuration of the terminal device 1. As shown in FIG. 2, the terminal device 1 physically includes a CPU (Central Processing Unit) 101, a RAM (Random Access Memory) 102 as a main storage device, a ROM (Read Only Memory) 103, a hard disk, and the like. Auxiliary storage device 104, communication module 105 that is a data transmission / reception device such as a network card, voice input device 106 that is a voice input device such as a microphone, input device 107 that is an input device such as a keyboard or mouse, and an output device such as a liquid crystal display This is configured by hardware such as the output device 108. Each function of the terminal device 1 described with reference to FIG. 1 is obtained by reading predetermined computer software on hardware such as the CPU 101 and the RAM 102 shown in FIG. 2, thereby controlling the voice input device 106 and the input device under the control of the CPU 101. 107, the output device 108 and the like are operated, and data is read and written in the RAM 102 and the auxiliary storage device 104.

引き続いて、図１を参照して、端末装置１の機能について説明する。音声入力部１１は、音声信号の入力を受け付ける音声入力手段として機能する。具体的には、音声入力部１１は、マイクなどの音声入力装置１０６を介して入力されたユーザの発話内容に基づく音声信号を受け付ける。そして、音声入力部１１は、受け付けた音声信号を音声認識部１２および音声送信部１４に送信する。 Subsequently, the function of the terminal device 1 will be described with reference to FIG. The voice input unit 11 functions as a voice input unit that receives an input of a voice signal. Specifically, the voice input unit 11 receives a voice signal based on the content of the user's utterance input via the voice input device 106 such as a microphone. Then, the voice input unit 11 transmits the received voice signal to the voice recognition unit 12 and the voice transmission unit 14.

音声認識部１２は、音声入力部１１によって受け付けられた音声信号に対する音声認識を行う音声認識手段として機能する。音声認識部１２は、予め記憶された音響モデルおよび言語モデルと、後述のユーザ辞書１３とを参照して、端末装置１の音声認識結果である端末内音声認識結果を取得する。この端末内音声認識結果には、音声信号を音声認識した結果の文字列である、複数の単語から構成される文字列データと、文字列データ全体または文字列データを構成する各単語の認識結果の尤もらしさを示す信頼度とが含まれる。そして、音声認識部１２は、端末内音声認識結果をコマンド照合部１６に送信する。なお、音声認識部１２は、ユーザ辞書１３に登録されていない端末内音声認識の語彙外の単語が発話内容に含まれている場合、正しく音声認識することができない。 The voice recognition unit 12 functions as a voice recognition unit that performs voice recognition on the voice signal received by the voice input unit 11. The speech recognition unit 12 refers to an acoustic model and a language model stored in advance and a user dictionary 13 described later, and acquires an in-terminal speech recognition result that is a speech recognition result of the terminal device 1. This in-terminal speech recognition result includes character string data composed of a plurality of words, which is a character string obtained as a result of speech recognition of a speech signal, and recognition results for each word constituting the entire character string data or character string data. And the reliability indicating the likelihood of. Then, the voice recognition unit 12 transmits the in-terminal voice recognition result to the command matching unit 16. Note that the speech recognition unit 12 cannot correctly recognize a speech when a word outside the vocabulary for in-terminal speech recognition that is not registered in the user dictionary 13 is included in the utterance content.

ユーザ辞書１３は、認識対象の単語が登録されてリスト化されたものである。このユーザ辞書１３には、一般に用いられる単語の他、端末装置１のユーザ固有の単語が含まれてもよい。例えば、端末装置１の電話帳に登録された氏名、端末装置１内に保存された音楽および動画などのコンテンツ名などはそれぞれ、ユーザ辞書１３に単語として登録されている。なお、地名、駅名、商品名、飲食店名、アプリ名などの固有名詞は、多岐にわたるため、ユーザ辞書１３に登録されていないことがある。 The user dictionary 13 is a list in which words to be recognized are registered and listed. The user dictionary 13 may include words unique to the user of the terminal device 1 in addition to commonly used words. For example, a name registered in the phone book of the terminal device 1 and a content name such as music and video stored in the terminal device 1 are registered in the user dictionary 13 as words. In addition, since proper names such as place names, station names, product names, restaurant names, and application names are diverse, they may not be registered in the user dictionary 13.

音声送信部１４は、音声信号を音声認識サーバ２に送信する音声送信手段として機能する。具体的には、音声送信部１４は、音声入力部１１によって送信された音声信号を受信し、受信した音声信号をネットワークＮＷを介して音声認識サーバ２に送信する。このとき、音声送信部１４は、音声信号を非圧縮または圧縮して送信する。また、音声送信部１４は、音声入力部１１から受信した音声信号を一時的に記憶しておき、後述のコマンド照合部１６からの指示により音声信号を音声認識サーバ２に送信してもよい。また、音声送信部１４は、コマンド照合部１６の指示を待つことなく、音声入力部１１から受信した音声信号を音声認識サーバ２に送信してもよい。 The voice transmission unit 14 functions as a voice transmission unit that transmits a voice signal to the voice recognition server 2. Specifically, the voice transmission unit 14 receives the voice signal transmitted by the voice input unit 11 and transmits the received voice signal to the voice recognition server 2 via the network NW. At this time, the audio transmission unit 14 transmits the audio signal after being uncompressed or compressed. The voice transmission unit 14 may temporarily store the voice signal received from the voice input unit 11 and transmit the voice signal to the voice recognition server 2 according to an instruction from the command verification unit 16 described later. In addition, the voice transmission unit 14 may transmit the voice signal received from the voice input unit 11 to the voice recognition server 2 without waiting for an instruction from the command matching unit 16.

サーバ音声認識結果受信部１５は、音声認識サーバ２による音声信号に対する音声認識結果であるサーバ音声認識結果を受信するサーバ音声認識結果受信手段として機能する。サーバ音声認識結果受信部１５は、ネットワークＮＷを介して音声認識サーバ２からサーバ音声認識結果を受信する。そして、サーバ音声認識結果受信部１５は、受信したサーバ音声認識結果をコマンド照合部１６に送信する。 The server speech recognition result receiving unit 15 functions as a server speech recognition result receiving unit that receives a server speech recognition result that is a speech recognition result for a speech signal by the speech recognition server 2. The server voice recognition result receiving unit 15 receives a server voice recognition result from the voice recognition server 2 via the network NW. Then, the server voice recognition result receiving unit 15 transmits the received server voice recognition result to the command matching unit 16.

コマンド照合部１６は、端末内音声認識結果を後述のコマンド辞書１７と照合して、端末内音声認識結果およびサーバ音声認識結果のうちいずれの音声認識結果を利用するか決定し、決定された音声認識結果に基づいて音声信号によって示されるコマンドを特定するコマンド照合手段として機能する。具体的に説明すると、コマンド照合部１６は、まず、端末内音声認識結果をコマンド辞書１７と照合して、コマンドに該当する可能性を示す確信度を算出する。この確信度の算出方法については、後述する。 The command collation unit 16 collates the speech recognition result in the terminal with a command dictionary 17 to be described later, determines which of the speech recognition results in the terminal and the server speech recognition result is to be used, and determines the determined speech It functions as a command collating means for specifying a command indicated by the voice signal based on the recognition result. More specifically, the command matching unit 16 first compares the in-terminal speech recognition result with the command dictionary 17 and calculates a certainty factor indicating the possibility of corresponding to the command. A method for calculating the certainty factor will be described later.

そして、確信度が所定の閾値以上の場合に、コマンド照合部１６は、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドを音声信号によって示されるコマンドとして特定する。一方、確信度が閾値未満である場合、すなわち、端末内音声認識結果をコマンドとして受理できなかった場合、コマンド照合部１６は、サーバ音声認識結果の利用を決定し、音声入力部１１から受信した音声信号を音声認識サーバ２に送信するよう音声送信部１４に対して指示する。なお、閾値は、コマンド照合部１６に予め設定された固定値であって、例えば２．５に設定されている。 When the certainty factor is equal to or greater than a predetermined threshold value, the command matching unit 16 determines the use of the in-terminal speech recognition result, and specifies a command having a certainty factor equal to or greater than the threshold value as a command indicated by the voice signal. On the other hand, when the certainty factor is less than the threshold value, that is, when the in-terminal speech recognition result cannot be received as a command, the command matching unit 16 determines the use of the server speech recognition result and receives it from the speech input unit 11 The voice transmitting unit 14 is instructed to transmit the voice signal to the voice recognition server 2. The threshold value is a fixed value set in advance in the command verification unit 16 and is set to 2.5, for example.

コマンド照合部１６は、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンドである場合には、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンド以外のコマンドである場合には、サーバ音声認識結果の利用を決定するようにしてもよい。ここで、端末内機能とは、端末装置１が備える機能（例えば、電話機能、メール機能、カメラ機能、スケジュール機能など）であって、ネットワークＮＷ上の他の装置を利用することなく、端末装置１内で実行可能な機能を意味する。コマンド照合部１６は、例えば後述のキーワードリストに基づいて、端末内機能であるか否かの判断を行ってもよい。この場合、コマンド照合部１６は、端末内機能の実行を指示するコマンドについては端末内音声認識結果を用いて高速に実行することができ、それ以外のコマンドについてはサーバ音声認識結果を用いて確実に実行することができる。 When the command with the certainty level equal to or greater than the threshold is a command for instructing the execution of the in-terminal function, the command matching unit 16 determines the use of the in-terminal speech recognition result. When the command is a command other than a command for instructing execution of the in-terminal function, the use of the server speech recognition result may be determined. Here, the in-terminal function is a function provided in the terminal device 1 (for example, a telephone function, a mail function, a camera function, a schedule function, etc.), and without using other devices on the network NW, the terminal device This means a function that can be executed within 1. The command verification unit 16 may determine whether the function is an in-terminal function based on, for example, a keyword list described later. In this case, the command matching unit 16 can execute the command for instructing the execution of the in-terminal function at high speed using the in-terminal speech recognition result, and reliably use the server speech recognition result for other commands. Can be executed.

コマンド辞書１７は、端末装置１において使用可能なコマンドをリスト化したものである。このコマンド辞書１７には、端末内機能の実行を指示するためのコマンドの他、外部のサーバ等を利用する端末外機能の実行を指示するためのコマンドが含まれていてもよい。端末内機能の実行を指示するためのコマンドには、例えば、電話をかける、メール機能起動、カメラ起動および端末内アプリ起動などがある。また、端末外機能の実行を指示するためのコマンドには、乗換検索、飲食店検索、ショッピングサイト検索、天気予報閲覧およびアプリ検索などがある。 The command dictionary 17 is a list of commands that can be used in the terminal device 1. The command dictionary 17 may include a command for instructing execution of an external function using an external server or the like in addition to a command for instructing execution of an in-terminal function. Commands for instructing execution of the in-terminal function include, for example, making a call, starting a mail function, starting a camera, and starting an in-terminal application. Further, commands for instructing execution of the function outside the terminal include transfer search, restaurant search, shopping site search, weather forecast browsing, and application search.

図３は、コマンド辞書１７が記憶するキーワードリストの一例を示す図である。図３に示されるように、キーワードリストには、「キーワード」と「機能」と「スコア」と「端末内機能」とが対応付けて記憶されている。「キーワード」に記憶される情報は、各機能に対して一定の関連性を有するキーワードを示す情報であって、例えば「電話」、「メール」、「読む」、「買う」などである。「機能」に記憶される情報は、端末装置１が実行可能な機能を示す情報であって、例えば「電話」、「メール」、「カメラ」、「乗換検索」、「ショッピング検索」などである。この機能には、端末内機能の他、端末外機能も含まれる。「スコア」に記憶される情報は、当該情報に対応付けられたキーワードと機能との関連性の高さを示す値（スコア）であって、関連性が高いほど大きな値が割り当てられる。また、スコアは、後述するように確信度の算出に用いられる情報である。「端末内機能」に記憶される情報は、当該情報に対応付けられた機能が端末内機能であるか否かを示す情報である。 FIG. 3 is a diagram illustrating an example of a keyword list stored in the command dictionary 17. As shown in FIG. 3, “keyword”, “function”, “score”, and “in-terminal function” are stored in the keyword list in association with each other. The information stored in the “keyword” is information indicating a keyword having a certain relevance to each function, such as “phone”, “mail”, “read”, “buy”, and the like. The information stored in the “function” is information indicating a function that can be executed by the terminal device 1, for example, “phone”, “mail”, “camera”, “transfer search”, “shopping search”, and the like. . This function includes the function outside the terminal in addition to the function inside the terminal. The information stored in the “score” is a value (score) indicating the degree of relevance between the keyword associated with the information and the function, and a higher value is assigned as the relevance is higher. The score is information used for calculating the certainty factor as will be described later. The information stored in “in-terminal function” is information indicating whether or not the function associated with the information is an in-terminal function.

例えば、キーワード「電話」には、電話機能、２．０のスコア、端末内機能であることを示す「Ｙｅｓ」が対応付けられている。すなわち、音声認識結果に「電話」という単語が含まれている場合には、電話機能の実行を指示するためのコマンドである可能性が高いと考えられるため、電話機能として２．０のスコアが割り当てられている。 For example, the keyword “phone” is associated with a telephone function, a score of 2.0, and “Yes” indicating an in-terminal function. That is, when the word “telephone” is included in the speech recognition result, it is highly likely that the command is a command for instructing execution of the telephone function. Assigned.

また、人名を示すキーワードには、電話機能およびメール機能が対応付けられている。そして、電話機能に対して０．５のスコア、メール機能に対して０．５のスコアがそれぞれ割り当てられている。このように、一つのキーワードに対して複数の異なる機能が対応付けられることもある。この場合、一つのキーワードに対応付けられた機能が複数存在するため、音声認識結果にこのキーワードが含まれているだけではどの機能の実行を指示するコマンドであるかを特定することができない。したがって、複数の機能と対応付けられるキーワードには小さい値のスコアが割り当てられてもよい。また、複数の機能と対応付けられるキーワードにはスコアが割り当てられないようにしてもよい。なお、人名を示すキーワードには、端末装置１の電話帳機能に登録された氏および名を含めてもよい。また、人名を示すキーワードは、音声認識結果に含まれる品詞などの付加情報に基づいて設定されてもよく、固定の人名辞書に登録された名前を用いてもよい。 The keyword indicating the person name is associated with the telephone function and the mail function. A score of 0.5 is assigned to the telephone function, and a score of 0.5 is assigned to the mail function. Thus, a plurality of different functions may be associated with one keyword. In this case, since there are a plurality of functions associated with one keyword, it is not possible to specify which function is instructed to execute if the keyword is included in the speech recognition result. Therefore, a small score may be assigned to keywords associated with a plurality of functions. Further, a score may not be assigned to a keyword associated with a plurality of functions. The keyword indicating the person name may include a name and a name registered in the telephone directory function of the terminal device 1. The keyword indicating the person name may be set based on additional information such as part of speech included in the speech recognition result, or a name registered in a fixed person name dictionary may be used.

コマンド実行部１８は、コマンド照合部１６によって特定されたコマンドを実行するコマンド実行手段として機能する。そして、コマンド照合部１６は、コマンドを実行した結果を、例えば出力装置１０８に出力する。 The command execution unit 18 functions as a command execution unit that executes the command specified by the command verification unit 16. Then, the command matching unit 16 outputs the result of executing the command to the output device 108, for example.

引き続いて、図１を参照して、音声認識サーバ２の機能について説明する。音声認識サーバ２は、機能的には、音声受信部２１（音声受信手段）と、サーバ音声認識部２２（サーバ音声認識手段）と、大語彙辞書２３（サーバ辞書）と、サーバ音声認識結果送信部２４（サーバ音声認識結果送信手段）と、を備えている。 Subsequently, the function of the voice recognition server 2 will be described with reference to FIG. The voice recognition server 2 functionally includes a voice receiver 21 (voice receiver), a server voice recognizer 22 (server voice recognizer), a large vocabulary dictionary 23 (server dictionary), and a server voice recognition result transmission. Unit 24 (server voice recognition result transmission means).

音声受信部２１は、端末装置１から送信された音声信号を受信する音声受信手段として機能する。具体的には、音声受信部２１は、ネットワークＮＷを介して端末装置１の音声送信部１４から音声信号を受信し、受信した音声信号をサーバ音声認識部２２に送信する。 The voice receiving unit 21 functions as a voice receiving unit that receives a voice signal transmitted from the terminal device 1. Specifically, the voice reception unit 21 receives a voice signal from the voice transmission unit 14 of the terminal device 1 via the network NW, and transmits the received voice signal to the server voice recognition unit 22.

サーバ音声認識部２２は、音声受信部２１によって受信された音声信号を、大語彙辞書２３に基づいて音声認識するサーバ音声認識手段として機能する。具体的に説明すると、サーバ音声認識部２２は、予め記憶された音響モデルおよび言語モデルと、後述の大語彙辞書２３とを参照して、サーバ音声認識結果を取得する。このサーバ音声認識結果には、音声信号を音声認識した結果の文字列である、複数の単語から構成される文字列データと、文字列データを構成する各単語の信頼度とが含まれる。このサーバ音声認識部２２は、音声認識部１２よりも高精度の音声認識を行うことができる。 The server speech recognition unit 22 functions as a server speech recognition unit that recognizes the speech signal received by the speech reception unit 21 based on the large vocabulary dictionary 23. More specifically, the server speech recognition unit 22 refers to an acoustic model and language model stored in advance and a large vocabulary dictionary 23 described later, and acquires a server speech recognition result. The server speech recognition result includes character string data composed of a plurality of words, which is a character string resulting from speech recognition of the speech signal, and the reliability of each word constituting the character string data. The server voice recognition unit 22 can perform voice recognition with higher accuracy than the voice recognition unit 12.

大語彙辞書２３は、ユーザ辞書１３よりも多くの語彙を含む。大語彙辞書２３には、地名、駅名、商品名、飲食店名、アプリ名などの固有名詞を含む多岐にわたった単語が登録されてリスト化されている。 The large vocabulary dictionary 23 includes more vocabularies than the user dictionary 13. In the large vocabulary dictionary 23, a wide variety of words including proper names such as place names, station names, product names, restaurant names, and application names are registered and listed.

サーバ音声認識結果送信部２４は、サーバ音声認識結果を端末装置１に送信するサーバ音声認識結果送信手段として機能する。具体的には、サーバ音声認識結果送信部２４は、音声受信部２１によって受信された音声信号に対するサーバ音声認識結果をサーバ音声認識部２２から受信し、受信したサーバ音声認識結果をネットワークＮＷを介して端末装置１のサーバ音声認識結果受信部１５に送信する。 The server speech recognition result transmitting unit 24 functions as a server speech recognition result transmitting unit that transmits the server speech recognition result to the terminal device 1. Specifically, the server voice recognition result transmitting unit 24 receives a server voice recognition result for the voice signal received by the voice receiving unit 21 from the server voice recognition unit 22, and receives the received server voice recognition result via the network NW. To the server voice recognition result receiving unit 15 of the terminal device 1.

次に、図４に示す例を用いて確信度の算出方法について説明する。図４は、音声認識結果に基づく確信度の算出方法を説明するための図である。図４に示されるように、ユーザが音声入力装置１０６を使用して発話内容（ａ）「やまださんにでんわをかける」との音声を入力し、音声認識部１２によって端末内音声認識結果（ｂ）「山田さんに電話をかける」が取得されたとする。コマンド照合部１６は、キーワードリストを参照し、端末内音声認識結果（ｂ）の各単語がキーワードリストに登録されたキーワードに一致するか否かを判定する。そして、コマンド照合部１６は、端末内音声認識結果（ｂ）に含まれる単語がキーワードリストに登録されたキーワードに一致する場合、そのキーワードに対応付けられた機能と、その機能に対して割り当てられたスコアとをキーワードリストから取得する。 Next, a certainty factor calculation method will be described with reference to an example shown in FIG. FIG. 4 is a diagram for explaining a certainty factor calculation method based on a speech recognition result. As shown in FIG. 4, the user inputs the speech content (a) “call phone to Yamada-san” using the voice input device 106, and the voice recognition unit 12 performs the in-terminal voice recognition result (b ) Suppose that "Call Yamada-san" is acquired. The command matching unit 16 refers to the keyword list and determines whether or not each word in the in-terminal speech recognition result (b) matches the keyword registered in the keyword list. When the word included in the in-terminal speech recognition result (b) matches the keyword registered in the keyword list, the command matching unit 16 is assigned to the function associated with the keyword and the function. Get the score from the keyword list.

この場合、「山田」は人名に該当することから、キーワードリストに登録されたキーワードに一致すると判定される。そして、コマンド照合部１６は、キーワードリストに基づいて、電話機能に対して０．５のスコアを取得し、メール機能に対して０．５のスコアを取得する。また、「電話」がキーワードリストに登録されたキーワードに一致することから、コマンド照合部１６は、キーワード「電話」に対応付けられた電話機能に対して２．０のスコアを取得する。さらに、「かける」がキーワードリストに登録されたキーワードに一致することから、コマンド照合部１６は、キーワード「かける」に対応付けられた電話機能に対して０．５のスコアを取得する。 In this case, since “Yamada” corresponds to a personal name, it is determined that the keyword matches the keyword registered in the keyword list. Then, based on the keyword list, the command verification unit 16 acquires a score of 0.5 for the telephone function and acquires a score of 0.5 for the mail function. Further, since “phone” matches the keyword registered in the keyword list, the command matching unit 16 obtains a score of 2.0 for the telephone function associated with the keyword “phone”. Further, since “call” matches the keyword registered in the keyword list, the command matching unit 16 obtains a score of 0.5 for the telephone function associated with the keyword “call”.

＜第１の確信度算出方法＞
ここで、コマンド照合部１６は、各機能に対して取得したスコアの合計を機能ごとに算出し、そのうち最大の値を、その最大の値を有する機能の実行を指示するコマンドの確信度とする。図４の例では、電話機能のスコアの合計が０．５＋２．０＋０．５＝３．０、メール機能のスコアの合計が０．５であるから、電話機能の実行を指示するためのコマンドの確信度が、３．０と算出される。 <First confidence calculation method>
Here, the command matching unit 16 calculates the total score acquired for each function for each function, and the maximum value among them is set as the certainty of the command instructing the execution of the function having the maximum value. . In the example of FIG. 4, since the sum of the telephone function scores is 0.5 + 2.0 + 0.5 = 3.0 and the sum of the mail function scores is 0.5, the command for instructing the execution of the telephone functions The certainty factor is calculated as 3.0.

＜第２の確信度算出方法＞
コマンド照合部１６は、各機能に対して取得したスコアの合計を機能ごとに算出し、最大の値と２番目に大きい値の差を、最大の値を有する機能の実行を指示するコマンドの確信度とする。図４の例では、電話機能のスコアの合計が３．０、メール機能のスコアの合計が０．５であるから、電話機能の実行を指示するためのコマンドの確信度が、３．０−０．５＝２．５と算出される。 <Second confidence calculation method>
The command verification unit 16 calculates the total score acquired for each function for each function, and determines the difference between the maximum value and the second largest value for the command to instruct the execution of the function having the maximum value. Degree. In the example of FIG. 4, since the total score of the telephone function is 3.0 and the total score of the mail function is 0.5, the certainty of the command for instructing execution of the telephone function is 3.0−. 0.5 = 2.5 is calculated.

また、単語ごとの音声認識結果の信頼度が得られる場合には、以下の第３の確信度算出方法、または、第４の確信度算出方法を用いてもよい。
＜第３の確信度算出方法＞
コマンド照合部１６は、各機能に対して取得したスコアと、単語ごとの音声認識結果の信頼度とに基づいて、確信度を算出する。例えば、音声認識部１２によって認識された「山田さんに電話をかける」の各単語について、「山田」の信頼度が０．９、「さん」の信頼度が０．８、「に」の信頼度が０．８、「電話」の信頼度が１．０、「を」の信頼度が０．５、「かける」の信頼度が０．５であったとする。なお、この信頼度は、音声認識部１２により取得された端末内音声認識結果に含まれる。この場合、コマンド照合部１６は、端末内音声認識結果の単語がキーワードリストに登録されたキーワードに一致すると判断すると、そのキーワードに対応付けられた機能に割り当てられたスコアに、その単語の信頼度を加える。そして、コマンド照合部１６は、信頼度を加えたスコアの合計を機能ごとに算出し、そのうちの最大の値を、その最大の値を有する機能の実行を指示するコマンドの確信度とする。図４の例では、電話機能に対する信頼度を加えたスコアの合計が０．５＋０．９＋２．０＋１．０＋０．５＋０．５＝５．４、メール機能に対する信頼度を加えたスコアの合計が０．５＋０．９＝１．４であるから、電話機能の実行を指示するためのコマンドの確信度が、５．４と算出される。 Moreover, when the reliability of the speech recognition result for each word is obtained, the following third certainty factor calculation method or fourth certainty factor calculation method may be used.
<Third confidence calculation method>
The command matching unit 16 calculates a certainty factor based on the score acquired for each function and the reliability of the speech recognition result for each word. For example, for each word “call Yamada-san” recognized by the voice recognition unit 12, the reliability of “Yamada” is 0.9, the reliability of “san” is 0.8, and the reliability of “ni” Assume that the degree is 0.8, the reliability of “telephone” is 1.0, the reliability of “O” is 0.5, and the reliability of “Call” is 0.5. This reliability is included in the in-terminal speech recognition result acquired by the speech recognition unit 12. In this case, if the command matching unit 16 determines that the word of the in-terminal speech recognition result matches the keyword registered in the keyword list, the reliability of the word is added to the score assigned to the function associated with the keyword. Add Then, the command matching unit 16 calculates the total score including the reliability for each function, and sets the maximum value among them as the certainty of the command instructing the execution of the function having the maximum value. In the example of FIG. 4, the total score including the reliability for the telephone function is 0.5 + 0.9 + 2.0 + 1.0 + 0.5 + 0.5 = 5.4, and the total score including the reliability for the mail function is 0.00. Since 5 + 0.9 = 1.4, the certainty factor of the command for instructing execution of the telephone function is calculated as 5.4.

＜第４の確信度算出方法＞
コマンド照合部１６は、端末内音声認識結果の単語がキーワードリストに登録されたキーワードに一致した場合に、そのキーワードに割り当てられたスコアに、その単語の信頼度を掛ける。そして、コマンド照合部１６は、信頼度を掛けたスコアの合計を機能ごとに算出し、そのうちの最大値を、その最大の値を有する機能の実行を指示するコマンドの確信度とする。端末内音声認識結果の各単語の信頼度が第３の確信度算出方法において説明したものと同じとすると、図４の例では、電話機能に対する信頼度を掛けたスコアの合計が０．５×０．９＋２．０×１．０＋０．５×０．５＝２．７、メール機能に対する信頼度を掛けたスコアの合計が０．５×０．９＝０．４５であるから、電話機能の実行を指示するためのコマンドの確信度が、２．７と算出される。 <Fourth certainty factor calculation method>
When the word in the terminal speech recognition result matches the keyword registered in the keyword list, the command matching unit 16 multiplies the score assigned to the keyword by the reliability of the word. Then, the command matching unit 16 calculates the sum of the scores multiplied by the reliability for each function, and sets the maximum value as the certainty of the command instructing the execution of the function having the maximum value. If the reliability of each word in the in-terminal speech recognition result is the same as that described in the third certainty factor calculation method, in the example of FIG. 4, the total score multiplied by the reliability for the telephone function is 0.5 ×. 0.9 + 2.0 × 1.0 + 0.5 × 0.5 = 2.7, and the total score multiplied by the reliability of the mail function is 0.5 × 0.9 = 0.45. The certainty factor of the command for instructing execution is calculated as 2.7.

続いて、端末装置１で実行されるコマンド判別実行処理（音声認識方法）について説明する。 Next, a command determination execution process (voice recognition method) executed by the terminal device 1 will be described.

＜第１のコマンド判別実行処理＞
図５は、端末装置１のコマンド判別実行処理の一例を示すフローチャートである。本処理は、端末装置１のユーザが音声入力装置１０６を介して音声入力することにより開始される。 <First command determination execution process>
FIG. 5 is a flowchart illustrating an example of command determination execution processing of the terminal device 1. This process is started when the user of the terminal device 1 inputs a voice via the voice input device 106.

音声入力部１１は、音声入力装置１０６から音声信号の入力を受け付ける（Ｓ０１，音声入力ステップ）。そして、音声入力部１１は、受け付けた音声信号を音声認識部１２および音声送信部１４に送信する。次に、音声認識部１２は、音声入力部１１によって送信された音声信号を受信し、ユーザ辞書１３を参照して受信した音声信号を音声認識する（Ｓ０２，音声認識ステップ）。音声認識部１２は、音声信号を音声認識した結果の文字列である文字列データ、文字列データを構成する各単語の信頼度情報を含む端末内音声認識結果をコマンド照合部１６に送信する。 The voice input unit 11 receives an input of a voice signal from the voice input device 106 (S01, voice input step). Then, the voice input unit 11 transmits the received voice signal to the voice recognition unit 12 and the voice transmission unit 14. Next, the voice recognition unit 12 receives the voice signal transmitted by the voice input unit 11, and recognizes the received voice signal with reference to the user dictionary 13 (S02, voice recognition step). The speech recognition unit 12 transmits to the command verification unit 16 the in-terminal speech recognition result including character string data that is a character string obtained as a result of speech recognition of the speech signal and reliability information of each word constituting the character string data.

次に、コマンド照合部１６は、端末内音声認識結果をコマンド辞書１７と照合する（Ｓ０３，コマンド照合ステップ）。具体的に説明すると、コマンド照合部１６は、コマンド辞書１７に記憶されたキーワードリストを参照し、端末内音声認識結果に含まれる単語のそれぞれがキーワードリストに登録されたキーワードに一致するか否かを判定する。そして、端末内音声認識結果に含まれる単語がキーワードに一致した場合、コマンド照合部１６は、そのキーワードに対応付けられた機能およびその機能に割り当てられたスコアをキーワードリストから取得する。そして、コマンド照合部１６は、上述の確信度算出方法のいずれかによりコマンドの確信度を算出する。 Next, the command collation unit 16 collates the in-terminal speech recognition result with the command dictionary 17 (S03, command collation step). Specifically, the command matching unit 16 refers to the keyword list stored in the command dictionary 17 and determines whether each word included in the in-terminal speech recognition result matches the keyword registered in the keyword list. Determine. When the word included in the in-terminal speech recognition result matches the keyword, the command matching unit 16 acquires the function associated with the keyword and the score assigned to the function from the keyword list. And the command collation part 16 calculates the reliability of a command with either of the above-mentioned reliability calculation methods.

次に、コマンド照合部１６は、端末内音声認識結果がコマンドとして受理されるか否かを判定する（Ｓ０４，コマンド照合ステップ）。すなわち、コマンド照合部１６は、確信度が閾値以上か否かを判定する。端末内音声認識結果がコマンドとして受理されたと判定された場合、すなわち、確信度が閾値以上であると判定された場合（Ｓ０４；Ｙｅｓ）、コマンド照合部１６は、受理されたコマンドが端末内機能の実行を指示するためのコマンドであるか否かを判定する（Ｓ０５）。この判定は、例えばキーワードリストの端末内機能であるか否かを示す情報に基づいて行われる。 Next, the command verification unit 16 determines whether or not the in-terminal speech recognition result is accepted as a command (S04, command verification step). That is, the command matching unit 16 determines whether or not the certainty factor is greater than or equal to a threshold value. When it is determined that the in-terminal speech recognition result is accepted as a command, that is, when the certainty factor is determined to be greater than or equal to the threshold (S04; Yes), the command matching unit 16 determines that the accepted command is an in-terminal function. It is determined whether or not the command is for instructing the execution of (S05). This determination is performed based on, for example, information indicating whether the keyword list is an in-terminal function.

受理されたコマンドが端末内機能の実行を指示するためのコマンドであると判定された場合（Ｓ０５；Ｙｅｓ）、コマンド照合部１６は、そのコマンドを音声信号によって示されるコマンドとして特定し、コマンド実行部１８にそのコマンドの実行を指示する。続いて、コマンド実行部１８は、コマンド照合部１６によって特定されたコマンドを実行する（Ｓ０６，コマンド実行ステップ）。そして、端末装置１は、コマンド判別実行処理を終了する。 When it is determined that the accepted command is a command for instructing execution of the in-terminal function (S05; Yes), the command matching unit 16 identifies the command as a command indicated by the voice signal, and executes the command. The unit 18 is instructed to execute the command. Subsequently, the command execution unit 18 executes the command specified by the command verification unit 16 (S06, command execution step). Then, the terminal device 1 ends the command determination execution process.

一方、Ｓ０４の判定において、端末内音声認識結果がコマンドとして受理されなかったと判定された場合、すなわち、確信度が閾値未満であると判定された場合（Ｓ０４；Ｎｏ）、または、Ｓ０５の判定において、受理されたコマンドが端末内機能の実行を指示するためのコマンド以外のコマンドであると判定された場合（Ｓ０５；Ｎｏ）、コマンド照合部１６は、音声送信部１４に対して、音声入力部１１から受信した音声信号を音声認識サーバ２に送信するよう指示する。そして、音声送信部１４は、音声信号を音声認識サーバ２に送信する（Ｓ０７，音声送信ステップ）。 On the other hand, if it is determined in S04 that the in-terminal speech recognition result is not accepted as a command, that is, if it is determined that the certainty factor is less than the threshold (S04; No), or in the determination of S05. When it is determined that the accepted command is a command other than a command for instructing execution of the in-terminal function (S05; No), the command matching unit 16 sends a voice input unit to the voice transmitting unit 14. 11 is instructed to transmit the voice signal received from the voice recognition server 2 to the voice recognition server 2. Then, the voice transmission unit 14 transmits a voice signal to the voice recognition server 2 (S07, voice transmission step).

音声認識サーバ２は、音声送信部１４によって送信された音声信号を受信すると、その音声信号に対して音声認識を行いサーバ音声認識結果を取得する。この音声認識は、ユーザ辞書１３よりも大語彙の大語彙辞書２３が用いられるため、音声認識部１２によって行われる端末内音声認識よりも高精度である。そして、音声認識サーバ２は、サーバ音声認識結果を端末装置１に送信する。 When the voice recognition server 2 receives the voice signal transmitted by the voice transmission unit 14, the voice recognition server 2 performs voice recognition on the voice signal and acquires a server voice recognition result. This speech recognition uses a large vocabulary dictionary 23 having a larger vocabulary than the user dictionary 13, and is therefore more accurate than in-terminal speech recognition performed by the speech recognition unit 12. Then, the voice recognition server 2 transmits the server voice recognition result to the terminal device 1.

その後、サーバ音声認識結果受信部１５は、音声認識サーバ２からサーバ音声認識結果を受信する（Ｓ０８，サーバ音声認識結果受信ステップ）。そして、サーバ音声認識結果受信部１５は、受信したサーバ音声認識結果をコマンド照合部１６に送信する。次に、コマンド照合部１６は、サーバ音声認識結果をコマンド辞書１７と照合する（Ｓ０９）。具体的に説明すると、コマンド照合部１６は、コマンド辞書１７に記憶されたキーワードリストを参照し、サーバ音声認識結果に含まれる単語のそれぞれがキーワードリストに登録されたキーワードに一致するか否かを判定する。そして、サーバ音声認識結果に含まれる単語がキーワードに一致した場合、コマンド照合部１６は、そのキーワードに対応付けられた機能およびその機能に割り当てられたスコアをキーワードリストから取得する。そして、コマンド照合部１６は、上述の確信度算出方法によりコマンドの確信度を算出する。 Thereafter, the server voice recognition result receiving unit 15 receives the server voice recognition result from the voice recognition server 2 (S08, server voice recognition result receiving step). Then, the server voice recognition result receiving unit 15 transmits the received server voice recognition result to the command matching unit 16. Next, the command collation unit 16 collates the server voice recognition result with the command dictionary 17 (S09). More specifically, the command matching unit 16 refers to the keyword list stored in the command dictionary 17 and determines whether each word included in the server speech recognition result matches the keyword registered in the keyword list. judge. When the word included in the server speech recognition result matches the keyword, the command matching unit 16 acquires the function associated with the keyword and the score assigned to the function from the keyword list. And the command collation part 16 calculates the reliability of a command with the above-mentioned reliability calculation method.

次に、コマンド照合部１６は、サーバ音声認識結果がコマンドとして受理されるか否かを判定する（Ｓ１０）。すなわち、コマンド照合部１６は、確信度が閾値以上か否かを判定する。確信度が閾値以上であると判定された場合、すなわち、サーバ音声認識結果がコマンドとして受理されたと判定された場合（Ｓ１０；Ｙｅｓ）、コマンド照合部１６は、そのコマンドを音声信号によって示されるコマンドとして特定し、コマンド実行部１８にそのコマンドの実行を指示する。続いて、コマンド実行部１８は、コマンド照合部１６によって特定されたコマンドを実行する（Ｓ０６，コマンド実行ステップ）。そして、端末装置１は、コマンド判別実行処理を終了する。なお、端末内音声認識よりもサーバ音声認識の方が高精度であるため、Ｓ１０の判定における閾値をＳ０４の判定における閾値よりも小さくしてもよい。 Next, the command verification unit 16 determines whether or not the server speech recognition result is accepted as a command (S10). That is, the command matching unit 16 determines whether or not the certainty factor is greater than or equal to a threshold value. When it is determined that the certainty factor is equal to or greater than the threshold, that is, when it is determined that the server speech recognition result is accepted as a command (S10; Yes), the command matching unit 16 uses the command indicated by the speech signal. The command execution unit 18 is instructed to execute the command. Subsequently, the command execution unit 18 executes the command specified by the command verification unit 16 (S06, command execution step). Then, the terminal device 1 ends the command determination execution process. Since the server speech recognition is more accurate than the intra-terminal speech recognition, the threshold value in the determination in S10 may be smaller than the threshold value in the determination in S04.

一方、Ｓ１０の判定において、確信度が閾値未満であると判定された場合、すなわち、サーバ音声認識結果がコマンドとして受理されなかったと判定された場合（Ｓ１０；Ｎｏ）、端末装置１は、コマンド判別実行処理を終了する。このとき、端末装置１は、音声の再入力を促すメッセージ等を出力装置１０８に表示してもよい。 On the other hand, if it is determined in S10 that the certainty factor is less than the threshold value, that is, if it is determined that the server speech recognition result has not been accepted as a command (S10; No), the terminal device 1 determines the command determination. The execution process is terminated. At this time, the terminal device 1 may display a message or the like for prompting re-input of voice on the output device 108.

＜第２のコマンド判別実行処理＞
図６は、端末装置１のコマンド判別実行処理の他の例を示すフローチャートである。本処理は、端末装置１のユーザが音声入力装置１０６を介して音声入力することにより開始される。なお、本処理におけるＳ２１〜Ｓ２９の各ステップはそれぞれ、図５のＳ０１〜Ｓ０９の各ステップと同様であるため、Ｓ２１〜Ｓ２９の説明を省略する。 <Second command discrimination execution process>
FIG. 6 is a flowchart illustrating another example of the command determination execution process of the terminal device 1. This process is started when the user of the terminal device 1 inputs a voice via the voice input device 106. In addition, since each step of S21-S29 in this process is the same as each step of S01-S09 of FIG. 5, description of S21-S29 is abbreviate | omitted.

Ｓ２９の処理後、コマンド照合部１６は、再度、端末内音声認識結果をコマンド辞書１７と照合する（Ｓ３０）。そして、コマンド照合部１６は、Ｓ２９において算出したコマンドの確信度と、Ｓ３０において算出したコマンドの確信度とを比較し、最も確信度の高いコマンドを選択する（Ｓ３１）。そして、コマンド照合部１６は、選択したコマンドを音声信号によって示されるコマンドとして特定し、コマンド実行部１８にそのコマンドの実行を指示する。コマンド実行部１８は、コマンド照合部１６によって特定されたコマンドが存在するか否かを判定する（Ｓ３２）。コマンドが存在すると判定された場合（Ｓ３２；Ｙｅｓ）、コマンド実行部１８は、そのコマンドを実行する（Ｓ２６，コマンド実行ステップ）。そして、端末装置１は、コマンド判別実行処理を終了する。一方、Ｓ３２の判定において、コマンドが存在しないと判定された場合（Ｓ３２；Ｎｏ）、コマンド実行部１８はコマンドの実行を行わない。そして、端末装置１は、コマンド判別実行処理を終了する。このとき、端末装置１は、音声の再入力を促すメッセージ等を出力装置１０８に表示してもよい。 After the process of S29, the command collation unit 16 collates the in-terminal speech recognition result with the command dictionary 17 again (S30). Then, the command matching unit 16 compares the certainty factor of the command calculated in S29 with the certainty factor of the command calculated in S30, and selects the command having the highest certainty factor (S31). Then, the command verification unit 16 specifies the selected command as a command indicated by the voice signal, and instructs the command execution unit 18 to execute the command. The command execution unit 18 determines whether there is a command specified by the command verification unit 16 (S32). When it is determined that a command exists (S32; Yes), the command execution unit 18 executes the command (S26, command execution step). Then, the terminal device 1 ends the command determination execution process. On the other hand, when it is determined in S32 that the command does not exist (S32; No), the command execution unit 18 does not execute the command. Then, the terminal device 1 ends the command determination execution process. At this time, the terminal device 1 may display a message or the like for prompting re-input of voice on the output device 108.

なお、コマンド照合部１６は、Ｓ２９において、サーバ音声認識結果を端末内機能に該当しないコマンド群とのみ照合してもよく、Ｓ３０において、端末内音声認識結果を端末内機能に該当するコマンド群とのみ照合してもよい。また、Ｓ３１において、コマンド照合部１６は、最も確信度の高いコマンドを選択しているが、確信度が閾値以上のコマンドの中で最も確信度が高いコマンドを選択するようにしてもよい。この場合、コマンド照合部１６は、コマンドを受理する基準となる閾値をＳ０４における閾値よりも小さくして照合するのが好ましい。 In S29, the command collation unit 16 may collate the server speech recognition result only with a command group that does not correspond to the in-terminal function, and in S30, the command speech recognition result may be compared with the command group that corresponds to the in-terminal function. You may only collate. In S31, the command matching unit 16 selects the command with the highest certainty factor, but may select the command with the highest certainty factor among the commands having the certainty factor equal to or higher than the threshold value. In this case, it is preferable that the command collation unit 16 collates with a threshold serving as a reference for accepting a command being smaller than the threshold in S04.

＜第３のコマンド判別実行処理＞
図７は、端末装置１のコマンド判別実行処理の他の例を示すフローチャートである。本処理は、端末装置１のユーザが音声入力装置１０６を介して音声入力することにより開始される。なお、本処理におけるＳ４１〜Ｓ４７の各ステップはそれぞれ、図５のＳ０１〜Ｓ０７の各ステップと同様であるため、Ｓ４１〜Ｓ４７の説明を省略する。 <Third command discrimination execution process>
FIG. 7 is a flowchart showing another example of command determination execution processing of the terminal device 1. This process is started when the user of the terminal device 1 inputs a voice via the voice input device 106. In addition, since each step of S41-S47 in this process is the same as each step of S01-S07 of FIG. 5, description of S41-S47 is abbreviate | omitted.

音声認識サーバ２では、Ｓ４７において送信された音声信号を音声受信部２１が受信すると、サーバ音声認識部２２は、大語彙辞書２３を参照し、音声受信部２１によって受信された音声信号の音声認識を行ってサーバ音声認識結果を取得する。そして、サーバ音声認識部２２は、さらにサーバ音声認識結果を音声認識サーバ２に設けられたコマンド辞書（不図示）と照合する。具体的に説明すると、サーバ音声認識部２２は、コマンド辞書に記憶されたキーワードリストを参照し、サーバ音声認識結果に含まれる単語のそれぞれがキーワードリストに登録されたキーワードに一致するか否かを判定する。 In the voice recognition server 2, when the voice reception unit 21 receives the voice signal transmitted in S 47, the server voice recognition unit 22 refers to the large vocabulary dictionary 23 and recognizes the voice signal received by the voice reception unit 21. To obtain the server speech recognition result. Then, the server speech recognition unit 22 further collates the server speech recognition result with a command dictionary (not shown) provided in the speech recognition server 2. More specifically, the server speech recognition unit 22 refers to the keyword list stored in the command dictionary, and determines whether each of the words included in the server speech recognition result matches the keyword registered in the keyword list. judge.

なお、キーワードリストは、コマンド辞書１７に記憶されたキーワードリストよりもさらに多くのコマンドに対応したキーワードを含むものであってもよい。そして、サーバ音声認識結果に含まれる単語がキーワードに一致した場合、サーバ音声認識部２２は、そのキーワードに対応付けられた機能およびその機能に割り当てられたスコアをキーワードリストから取得する。そして、サーバ音声認識部２２は、上述の確信度算出方法のいずれかによりコマンドの確信度を算出する。 The keyword list may include keywords corresponding to more commands than the keyword list stored in the command dictionary 17. When the word included in the server speech recognition result matches the keyword, the server speech recognition unit 22 acquires the function associated with the keyword and the score assigned to the function from the keyword list. Then, the server speech recognition unit 22 calculates the certainty factor of the command by any one of the above certainty factor calculation methods.

次に、サーバ音声認識部２２は、確信度が閾値以上か否かを判定する。確信度が閾値以上であると判定された場合、すなわち、サーバ音声認識結果がコマンドとして受理されたと判定された場合、音声認識サーバ２のサーバ音声認識結果送信部２４は、閾値以上の確信度のコマンドの実行を端末装置１に指示する。そして、サーバ音声認識結果受信部１５は、音声認識サーバ２による指示を受信し（Ｓ４８）、その指示をコマンド実行部１８に送信する。続いて、コマンド実行部１８は、指示されたコマンドを実行する（Ｓ４６，コマンド実行ステップ）。そして、端末装置１は、コマンド判別実行処理を終了する。 Next, the server voice recognition unit 22 determines whether the certainty factor is equal to or higher than a threshold value. When it is determined that the certainty factor is equal to or greater than the threshold value, that is, when it is determined that the server voice recognition result is accepted as a command, the server voice recognition result transmission unit 24 of the voice recognition server 2 has a certainty factor equal to or greater than the threshold value. The terminal device 1 is instructed to execute the command. Then, the server voice recognition result receiving unit 15 receives an instruction from the voice recognition server 2 (S48), and transmits the instruction to the command execution unit 18. Subsequently, the command execution unit 18 executes the instructed command (S46, command execution step). Then, the terminal device 1 ends the command determination execution process.

上記第１〜第３のコマンド判別実行処理において、音声認識サーバ２への音声送信（Ｓ０７，Ｓ２７，またはＳ４７，音声送信ステップ）は、端末内音声認識（Ｓ０２，Ｓ２２，またはＳ４２，音声認識ステップ）と端末内音声認識結果のコマンド照合（Ｓ０３，Ｓ２３，またはＳ４３，コマンド照合ステップ）の完了後に行われているが、音声認識サーバ２への音声送信は、音声入力後の任意のタイミングで、端末内音声認識や端末内音声認識結果のコマンド照合に先立って、もしくはこれらのステップと並行して行ってもよい。これにより、サーバ音声認識結果受信までの遅延時間を短縮することができる。 In the first to third command determination execution processes, the voice transmission to the voice recognition server 2 (S07, S27, or S47, voice transmission step) is performed in the terminal voice recognition (S02, S22, or S42, voice recognition step). ) And the command verification of the in-terminal speech recognition result (S03, S23 or S43, command verification step), the voice transmission to the voice recognition server 2 is performed at an arbitrary timing after the voice input. Prior to in-terminal speech recognition or command verification of in-terminal speech recognition results, or in parallel with these steps. Thereby, the delay time until receiving the server speech recognition result can be shortened.

続いて、図４、図８、図９を用いて、入力された発話内容に基づいてコマンドが特定され、実行されるまで処理を具体的に説明する。 Next, the processing will be specifically described with reference to FIGS. 4, 8, and 9 until a command is specified and executed based on the input utterance content.

上述したように、図４に示す例では、まず、ユーザにより音声入力装置１０６を介して、発話内容（ａ）「やまださんにでんわをかける」が入力される。そして、音声入力部１１は、発話内容（ａ）に対応する音声信号を受け付けて、その音声信号を音声認識部１２および音声送信部１４に送信する。次に、音声認識部１２は、ユーザ辞書１３を参照して端末内音声認識を行い、端末内音声認識結果（ｂ）「山田さんに電話をかける」を取得する。音声認識部１２は、端末内音声認識結果（ｂ）をコマンド照合部１６に送信する。 As described above, in the example illustrated in FIG. 4, first, the user inputs the utterance content (a) “call phone call to Yamada-san” via the voice input device 106. The voice input unit 11 receives a voice signal corresponding to the utterance content (a), and transmits the voice signal to the voice recognition unit 12 and the voice transmission unit 14. Next, the speech recognition unit 12 performs in-terminal speech recognition with reference to the user dictionary 13 and obtains the in-terminal speech recognition result (b) “call Yamada-san”. The speech recognition unit 12 transmits the in-terminal speech recognition result (b) to the command verification unit 16.

次に、コマンド照合部１６は、端末内音声認識結果（ｂ）をコマンド辞書１７と照合する。コマンド照合部１６は、「山田」を人名であると判断し、キーワードリストのキーワード「＜人名＞」に対応付けられた電話機能およびメール機能に対して、それぞれ０．５のスコアを加算する。また、コマンド照合部１６は、「電話」がキーワードリストのキーワード「電話」に一致することから、キーワード「電話」に対応付けられた電話機能に対して、２．０のスコアを加算する。さらに、コマンド照合部１６は、「かける」がキーワードリストのキーワード「かける」に一致することから、キーワード「かける」に対応付けられた電話機能に対して、０．５のスコアを加算する。そして、コマンド照合部１６は、照合結果に基づいて確信度を算出する。ここで、確信度は、上述の第２の確信度算出方法により算出され、閾値は、２．０に設定されているものとする。この場合、電話機能が最大のスコアを有し、その確信度は３．０−０．５＝２．５である。 Next, the command collation unit 16 collates the in-terminal speech recognition result (b) with the command dictionary 17. The command matching unit 16 determines that “Yamada” is a person name, and adds a score of 0.5 to each of the telephone function and the mail function associated with the keyword “<person name>” in the keyword list. Further, since “phone” matches the keyword “phone” in the keyword list, the command matching unit 16 adds a score of 2.0 to the telephone function associated with the keyword “phone”. Further, the command matching unit 16 adds a score of 0.5 to the telephone function associated with the keyword “call” because “call” matches the keyword “call” in the keyword list. Then, the command matching unit 16 calculates a certainty factor based on the matching result. Here, the certainty factor is calculated by the above-described second certainty factor calculation method, and the threshold value is set to 2.0. In this case, the telephone function has the highest score, and its certainty is 3.0−0.5 = 2.5.

そして、コマンド照合部１６は、確信度を閾値と比較し、電話機能を実行するためのコマンドを受理するか否かを判定する。確信度が閾値以上であることから、コマンド照合部１６は、電話機能を実行するためのコマンドを受理する。次に、コマンド照合部１６は、電話機能が端末内機能であるか否かを判断する。電話機能は端末内機能であることから、コマンド照合部１６は、発生内容（ａ）が示すコマンドを電話機能を実行するためのコマンドとして特定し、そのコマンドの実行をコマンド実行部１８に指示する。そして、コマンド実行部１８は、電話機能を実行するためのコマンドを実行する。 Then, the command verification unit 16 compares the certainty factor with a threshold value and determines whether or not to accept a command for executing the telephone function. Since the certainty factor is greater than or equal to the threshold, the command matching unit 16 accepts a command for executing the telephone function. Next, the command verification unit 16 determines whether or not the telephone function is an in-terminal function. Since the telephone function is an in-terminal function, the command verification unit 16 specifies the command indicated by the generated content (a) as a command for executing the telephone function, and instructs the command execution unit 18 to execute the command. . Then, the command execution unit 18 executes a command for executing the telephone function.

図８は、音声認識システム１０におけるコマンド判別実行処理を説明するための一例を示す図である。まず、ユーザにより音声入力装置１０６を介して、発話内容（ａ）「かさをかう」が入力される。そして、音声入力部１１は、発話内容（ａ）に対応する音声信号を受け付けて、その音声信号を音声認識部１２および音声送信部１４に送信する。次に、音声認識部１２は、ユーザ辞書１３を参照して端末内音声認識を行い、端末内音声認識結果（ｂ）「笹尾買う」を取得する。音声認識部１２は、端末内音声認識結果（ｂ）をコマンド照合部１６に送信する。 FIG. 8 is a diagram illustrating an example for explaining command determination execution processing in the speech recognition system 10. First, the utterance content (a) “Turn over” is input via the voice input device 106 by the user. The voice input unit 11 receives a voice signal corresponding to the utterance content (a), and transmits the voice signal to the voice recognition unit 12 and the voice transmission unit 14. Next, the speech recognition unit 12 refers to the user dictionary 13 to perform in-terminal speech recognition, and obtains the in-terminal speech recognition result (b) “Buo Sugao”. The speech recognition unit 12 transmits the in-terminal speech recognition result (b) to the command verification unit 16.

次に、コマンド照合部１６は、端末内音声認識結果（ｂ）をコマンド辞書１７と照合する。コマンド照合部１６は、「笹尾」を人名であると判断し、キーワードリストのキーワード「＜人名＞」に対応付けられた電話機能およびメール機能に対して、それぞれ０．５のスコアを加算する。さらに、コマンド照合部１６は、「買う」がキーワードリストのキーワードに一致することから、キーワード「買う」に対応付けられたショッピング機能に対して、２．０のスコアを加算する。そして、コマンド照合部１６は、照合結果に基づいて確信度を算出する。ここで、確信度は、上述の第２の確信度算出方法により算出され、閾値は、２．０に設定されているものとする。この場合、ショッピング機能が最大のスコアを有し、その確信度は２．０−０．５＝１．５である。 Next, the command collation unit 16 collates the in-terminal speech recognition result (b) with the command dictionary 17. The command matching unit 16 determines that “Tatsuo” is a person name, and adds a score of 0.5 to each of the telephone function and the mail function associated with the keyword “<person name>” in the keyword list. Furthermore, since “buy” matches the keyword in the keyword list, the command matching unit 16 adds a score of 2.0 to the shopping function associated with the keyword “buy”. Then, the command matching unit 16 calculates a certainty factor based on the matching result. Here, the certainty factor is calculated by the above-described second certainty factor calculation method, and the threshold value is set to 2.0. In this case, the shopping function has the highest score, and its certainty is 2.0−0.5 = 1.5.

そして、コマンド照合部１６は、確信度を閾値と比較し、ショッピング機能を実行するためのコマンドを受理するか否かを判定する。確信度が閾値よりも小さいことから、コマンド照合部１６は、ショッピング機能を実行するためのコマンドを受理しない。次に、コマンド照合部１６は、音声送信部１４に発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信するよう指示する。そして、音声送信部１４は、発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信する。その後、サーバ音声認識結果受信部１５は、音声認識サーバ２からサーバ音声認識結果（ｃ）「傘を買う」を受信する。そして、サーバ音声認識結果受信部１５は、サーバ音声認識結果（ｃ）をコマンド照合部１６に送信する。 And the command collation part 16 compares a certainty factor with a threshold value, and determines whether the command for performing a shopping function is received. Since the certainty factor is smaller than the threshold value, the command matching unit 16 does not accept a command for executing the shopping function. Next, the command verification unit 16 instructs the voice transmission unit 14 to transmit a voice signal corresponding to the utterance content (a) to the voice recognition server 2. Then, the voice transmission unit 14 transmits a voice signal corresponding to the utterance content (a) to the voice recognition server 2. Thereafter, the server voice recognition result receiving unit 15 receives the server voice recognition result (c) “buy an umbrella” from the voice recognition server 2. Then, the server voice recognition result receiving unit 15 transmits the server voice recognition result (c) to the command matching unit 16.

次に、コマンド照合部１６は、サーバ音声認識結果（ｃ）をコマンド辞書１７と照合する。コマンド照合部１６は、「買う」がキーワードリストのキーワードに一致することから、キーワード「買う」に対応付けられたショッピング機能に対して、２．０のスコアを加算する。コマンド照合部１６は、照合結果に基づいて確信度を算出する。この場合、ショッピング機能が最大のスコアを有し、その確信度は２．０である。そして、コマンド照合部１６は、確信度を閾値と比較し、ショッピング機能を実行するためのコマンドを受理するか否かを判定する。 Next, the command collation unit 16 collates the server voice recognition result (c) with the command dictionary 17. The command matching unit 16 adds a score of 2.0 to the shopping function associated with the keyword “buy” because “buy” matches the keyword in the keyword list. The command matching unit 16 calculates a certainty factor based on the matching result. In this case, the shopping function has the highest score, and its certainty is 2.0. And the command collation part 16 compares a certainty factor with a threshold value, and determines whether the command for performing a shopping function is received.

ここで、閾値は、端末内音声認識結果がコマンドとして受理されるか否かの判定に用いた値と同じであってもよいが、それよりも小さい方が好ましい。ここでは、閾値を０．５とする。コマンド照合部１６は、確信度が閾値以上であることから、ショッピング機能を実行するためのコマンドを受理する。そして、コマンド照合部１６は、発話内容（ａ）が示すコマンドをショッピング機能を実行するためのコマンドとして特定し、そのコマンドの実行をコマンド実行部１８に指示する。そして、コマンド実行部１８は、ショッピング機能を実行するためのコマンドを実行する。 Here, the threshold value may be the same as the value used for determining whether or not the in-terminal speech recognition result is accepted as a command, but is preferably smaller. Here, the threshold is set to 0.5. Since the certainty factor is greater than or equal to the threshold, the command matching unit 16 accepts a command for executing the shopping function. Then, the command verification unit 16 specifies the command indicated by the utterance content (a) as a command for executing the shopping function, and instructs the command execution unit 18 to execute the command. Then, the command execution unit 18 executes a command for executing the shopping function.

なお、端末内音声認識結果（ｂ）が「傘を買う」であり、端末内音声認識結果（ｂ）がコマンドとして受理されるか否かの判定に用いた閾値を１．５とした場合、端末内音声認識結果（ｂ）の照合結果に基づく確信度が閾値以上となることから、コマンド照合部１６は、ショッピング機能を実行するためのコマンドを受理する。しかし、ショッピング機能は端末内機能に該当しないため、この場合も、コマンド照合部１６は、音声送信部１４に発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信するよう指示することになる。 If the intra-terminal speech recognition result (b) is “buy an umbrella” and the threshold used to determine whether the intra-terminal speech recognition result (b) is accepted as a command is 1.5, Since the certainty factor based on the collation result of the in-terminal speech recognition result (b) is equal to or greater than the threshold value, the command collation unit 16 accepts a command for executing the shopping function. However, since the shopping function does not correspond to the in-terminal function, in this case as well, the command matching unit 16 instructs the voice transmitting unit 14 to transmit the voice signal corresponding to the utterance content (a) to the voice recognition server 2. become.

図９は、音声認識システム１０におけるコマンド判別実行処理を説明するための他の例を示す図である。まず、ユーザにより音声入力装置１０６を介して、発話内容（ａ）「やまださんにかける」が入力される。そして、音声入力部１１は、発話内容（ａ）に対応する音声信号を受け付けて、その音声信号を音声認識部１２および音声送信部１４に送信する。次に、音声認識部１２は、ユーザ辞書１３を参照して端末内音声認識を行い、端末内音声認識結果（ｂ）「山田さんにかける」を取得する。音声認識部１２は、端末内音声認識結果（ｂ）をコマンド照合部１６に送信する。 FIG. 9 is a diagram illustrating another example for explaining command determination execution processing in the speech recognition system 10. First, an utterance content (a) “call to Yamada-san” is input via the voice input device 106 by the user. The voice input unit 11 receives a voice signal corresponding to the utterance content (a), and transmits the voice signal to the voice recognition unit 12 and the voice transmission unit 14. Next, the speech recognition unit 12 refers to the user dictionary 13 to perform in-terminal speech recognition, and acquires the in-terminal speech recognition result (b) “Call to Mr. Yamada”. The speech recognition unit 12 transmits the in-terminal speech recognition result (b) to the command verification unit 16.

次に、コマンド照合部１６は、端末内音声認識結果（ｂ）をコマンド辞書１７と照合する。コマンド照合部１６は、「山田」を人名であると判断し、キーワードリストのキーワード「＜人名＞」に対応付けられた電話機能およびメール機能に対して、それぞれ０．５のスコアを加算する。さらに、コマンド照合部１６は、「かける」がキーワードリストのキーワードに一致することから、キーワード「かける」に対応付けられた電話機能に対して、０．５のスコアを加算する。そして、コマンド照合部１６は、照合結果に基づいて確信度を算出する。ここで、確信度は、第２の確信度算出方法により算出され、閾値は、２．０に設定されているものとする。この場合、電話機能が最大のスコアを有し、その確信度は１．０−０．５＝０．５である。 Next, the command collation unit 16 collates the in-terminal speech recognition result (b) with the command dictionary 17. The command matching unit 16 determines that “Yamada” is a person name, and adds a score of 0.5 to each of the telephone function and the mail function associated with the keyword “<person name>” in the keyword list. Further, the command matching unit 16 adds a score of 0.5 to the telephone function associated with the keyword “calling” because “calling” matches the keyword in the keyword list. Then, the command matching unit 16 calculates a certainty factor based on the matching result. Here, the certainty factor is calculated by the second certainty factor calculation method, and the threshold value is set to 2.0. In this case, the telephone function has the highest score, and its certainty is 1.0−0.5 = 0.5.

そして、コマンド照合部１６は、確信度を閾値と比較し、電話機能を実行するためのコマンドを受理するか否かを判定する。確信度が閾値よりも小さいことから、コマンド照合部１６は、電話機能を実行するためのコマンドを受理しない。次に、コマンド照合部１６は、音声送信部１４に発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信するよう指示する。そして、音声送信部１４は、発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信する。その後、サーバ音声認識結果受信部１５は、音声認識サーバ２からサーバ音声認識結果（ｃ）「山田さんにかける」を受信する。そして、サーバ音声認識結果受信部１５は、サーバ音声認識結果（ｃ）をコマンド照合部１６に送信する。 Then, the command verification unit 16 compares the certainty factor with a threshold value and determines whether or not to accept a command for executing the telephone function. Since the certainty factor is smaller than the threshold value, the command matching unit 16 does not accept a command for executing the telephone function. Next, the command verification unit 16 instructs the voice transmission unit 14 to transmit a voice signal corresponding to the utterance content (a) to the voice recognition server 2. Then, the voice transmission unit 14 transmits a voice signal corresponding to the utterance content (a) to the voice recognition server 2. Thereafter, the server voice recognition result receiving unit 15 receives the server voice recognition result (c) “Call to Mr. Yamada” from the voice recognition server 2. Then, the server voice recognition result receiving unit 15 transmits the server voice recognition result (c) to the command matching unit 16.

次に、コマンド照合部１６は、サーバ音声認識結果（ｃ）をコマンド辞書１７と照合する。コマンド照合部１６は、「山田」を人名であると判断し、キーワードリストのキーワード「＜人名＞」に対応付けられた電話機能およびメール機能に対して、それぞれ０．５のスコアを加算する。さらに、コマンド照合部１６は、「かける」がキーワードリストのキーワードに一致することから、キーワード「かける」に対応付けられた電話機能に対して、０．５のスコアを加算する。そして、コマンド照合部１６は、照合結果に基づいて確信度を算出する。この場合、電話機能が最大のスコアを有し、その確信度は１．０−０．５＝０．５である。そして、コマンド照合部１６は、確信度を閾値と比較し、電話機能を実行するためのコマンドを受理するか否かを判定する。 Next, the command collation unit 16 collates the server voice recognition result (c) with the command dictionary 17. The command matching unit 16 determines that “Yamada” is a person name, and adds a score of 0.5 to each of the telephone function and the mail function associated with the keyword “<person name>” in the keyword list. Further, the command matching unit 16 adds a score of 0.5 to the telephone function associated with the keyword “calling” because “calling” matches the keyword in the keyword list. Then, the command matching unit 16 calculates a certainty factor based on the matching result. In this case, the telephone function has the highest score, and its certainty is 1.0−0.5 = 0.5. Then, the command verification unit 16 compares the certainty factor with a threshold value and determines whether or not to accept a command for executing the telephone function.

ここで、閾値は、端末内音声認識結果がコマンドとして受理されるか否かの判定に用いた値よりも小さい値０．５とする。コマンド照合部１６は、確信度が閾値以上であることから、電話機能を実行するためのコマンドを受理する。そして、コマンド照合部１６は、発生内容（ａ）が示すコマンドを電話機能を実行するためのコマンドとして特定し、そのコマンドの実行をコマンド実行部１８に指示する。そして、コマンド実行部１８は、電話機能を実行するためのコマンドを実行する。 Here, the threshold is set to 0.5, which is smaller than the value used for determining whether or not the in-terminal speech recognition result is accepted as a command. The command verification unit 16 accepts a command for executing the telephone function because the certainty factor is equal to or greater than the threshold value. Then, the command verification unit 16 specifies the command indicated by the generated content (a) as a command for executing the telephone function, and instructs the command execution unit 18 to execute the command. Then, the command execution unit 18 executes a command for executing the telephone function.

以上のように、例えば、ユーザが、端末内機能を実行するためのコマンドを発話した場合、端末内音声認識により正しく音声認識され、サーバ音声認識を行うことなくコマンドが実行される。一方、ユーザが、ネットワークからの情報取得を意図したコマンドを発話した場合、そのコマンド内には地名、駅名、商品名、飲食店名、アプリ名などの固有名詞が含まれることがある。このような場合には、端末内音声認識では正しく音声認識できないため、サーバ音声認識により確実に認識されてコマンドが実行される。 As described above, for example, when the user utters a command for executing the in-terminal function, the voice is correctly recognized by the in-terminal voice recognition, and the command is executed without performing the server voice recognition. On the other hand, when a user utters a command intended to acquire information from a network, proper names such as place names, station names, product names, restaurant names, and application names may be included in the commands. In such a case, since the voice recognition cannot be performed correctly by the in-terminal voice recognition, the command is executed after being reliably recognized by the server voice recognition.

次に、端末装置１のコマンド判別実行処理を、機能・アプリ判別呼出処理に応用した例について説明する。この処理は、ユーザが端末装置１に向けて発話を行うことにより、端末装置１内の機能（電話、メール、スケジューラなど）または端末装置１にインストールされたアプリの呼出を行う処理である。図１０および図１１は、音声認識システム１０における機能・アプリ判別呼出処理の一例を説明するためのフローチャートである。本処理は、端末装置１のユーザが音声入力装置１０６を介して音声入力することにより開始される。なお、本処理におけるＳ５１〜Ｓ５２の各ステップはそれぞれ、図５のＳ０１〜Ｓ０２の各ステップと同様であるため、Ｓ５１〜Ｓ５２の説明を省略する。 Next, an example in which the command determination execution process of the terminal device 1 is applied to the function / application determination call process will be described. This process is a process of calling a function (telephone, mail, scheduler, etc.) in the terminal device 1 or an application installed in the terminal device 1 when the user speaks to the terminal device 1. FIG. 10 and FIG. 11 are flowcharts for explaining an example of the function / application discrimination call processing in the speech recognition system 10. This process is started when the user of the terminal device 1 inputs a voice via the voice input device 106. In addition, since each step of S51-S52 in this process is the same as each step of S01-S02 of FIG. 5, description of S51-S52 is abbreviate | omitted.

Ｓ５２の処理後、音声認識部１２は、Ｓ５２において取得した端末内音声認識結果をコマンド照合部１６に送信する。次に、コマンド照合部１６は、端末内音声認識結果をコマンド辞書１７と照合する（Ｓ５３）。具体的に説明すると、コマンド照合部１６は、コマンド辞書１７に記憶されたキーワードリストを参照し、端末内音声認識結果がキーワードリストに登録されたキーワードに一致するか否かを判定する（Ｓ５４）。なお、キーワードリストには、端末装置１において使用可能なコマンドに加えて、複数のアプリが予め登録されている。また、キーワードリストには、少なくともキーワードを示す情報と、機能またはアプリを示す情報と、が対応付けられて記憶されている。このキーワードとしては、機能名またはアプリ名が登録されている。キーワードリストに登録されたアプリは、端末装置１にインストールされているアプリに限らず、人気アプリなど端末装置１にインストール可能なアプリが含まれる。 After the process of S52, the speech recognition unit 12 transmits the in-terminal speech recognition result acquired in S52 to the command verification unit 16. Next, the command collation unit 16 collates the in-terminal speech recognition result with the command dictionary 17 (S53). More specifically, the command matching unit 16 refers to the keyword list stored in the command dictionary 17 and determines whether the in-terminal speech recognition result matches the keyword registered in the keyword list (S54). . In the keyword list, a plurality of applications are registered in advance in addition to commands that can be used in the terminal device 1. In the keyword list, at least information indicating a keyword and information indicating a function or an application are stored in association with each other. A function name or application name is registered as this keyword. Apps registered in the keyword list are not limited to apps installed in the terminal device 1, but include apps that can be installed in the terminal device 1, such as popular apps.

Ｓ５４の判定において、端末内音声認識結果がキーワードに一致した場合（Ｓ５４；Ｙｅｓ）、コマンド照合部１６は、そのキーワードに対応付けられた機能またはアプリの呼出であると判断し、出力装置１０８に「（機能名またはアプリ名）でよろしいですか？Ｙｅｓ／他候補」を表示する（Ｓ５５）。そして、コマンド照合部１６は、ユーザによって「Ｙｅｓ」が選択されたか否かを判定する（Ｓ５６）。ユーザによって「他候補」が選択されたと判定された場合（Ｓ５６；Ｎｏ）、コマンド照合部１６は、音声送信部１４に対して、音声入力部１１から受信した音声信号を音声認識サーバ２に送信するよう指示する。また、Ｓ５４の判定において、端末内音声認識結果がキーワードに一致しなかった場合（Ｓ５４；Ｎｏ）も、コマンド照合部１６は、音声送信部１４に対して、音声入力部１１から受信した音声信号を音声認識サーバ２に送信するよう指示する。 If it is determined in S54 that the in-terminal speech recognition result matches the keyword (S54; Yes), the command matching unit 16 determines that the function or application call is associated with the keyword, and the output device 108 “Are you sure you want to use (function name or application name)? Yes / other candidates” is displayed (S55). Then, the command verification unit 16 determines whether or not “Yes” is selected by the user (S56). When it is determined that the “other candidate” is selected by the user (S56; No), the command verification unit 16 transmits the voice signal received from the voice input unit 11 to the voice recognition server 2 to the voice transmission unit 14. Instruct them to do so. Also, in the determination of S54, when the in-terminal speech recognition result does not match the keyword (S54; No), the command matching unit 16 also sends the voice signal received from the voice input unit 11 to the voice transmission unit 14. Is transmitted to the voice recognition server 2.

そして、音声送信部１４は、音声信号を音声認識サーバ２に送信する（Ｓ５７）。音声認識サーバ２は、Ｓ５７において送信された音声信号を受信すると、音声認識を行って、サーバ音声認識結果を端末装置１に送信する。その後、サーバ音声認識結果受信部１５は、音声認識サーバ２からサーバ音声認識結果を受信する（Ｓ５８）。そして、サーバ音声認識結果受信部１５は、受信したサーバ音声認識結果をコマンド照合部１６に送信する。次に、コマンド照合部１６は、サーバ音声認識結果をコマンド辞書１７と照合する（Ｓ５９）。具体的に説明すると、コマンド照合部１６は、コマンド辞書１７に記憶されたキーワードリストを参照し、サーバ音声認識結果がキーワードリストに登録されたキーワードに一致するか否かを判定する。さらに、コマンド照合部１６は、端末内音声認識結果を再度コマンド辞書１７と照合する（Ｓ６０）。 Then, the voice transmission unit 14 transmits a voice signal to the voice recognition server 2 (S57). When receiving the voice signal transmitted in S57, the voice recognition server 2 performs voice recognition and transmits the server voice recognition result to the terminal device 1. Thereafter, the server voice recognition result receiving unit 15 receives the server voice recognition result from the voice recognition server 2 (S58). Then, the server voice recognition result receiving unit 15 transmits the received server voice recognition result to the command matching unit 16. Next, the command collation unit 16 collates the server voice recognition result with the command dictionary 17 (S59). Specifically, the command matching unit 16 refers to the keyword list stored in the command dictionary 17 and determines whether or not the server speech recognition result matches the keyword registered in the keyword list. Further, the command collation unit 16 collates the in-terminal speech recognition result with the command dictionary 17 again (S60).

そして、コマンド照合部１６は、Ｓ５９およびＳ６０における照合の結果、サーバ音声認識結果および端末内音声認識結果と一致したキーワード（機能名またはアプリ名）の一覧を取得する。そして、コマンド照合部１６は、取得した機能名またはアプリ名の一覧を候補として出力装置１０８に表示する（Ｓ６１）。その後、コマンド照合部１６は、ユーザによっていずれかの候補が選択されたか否かを判定する（Ｓ６２）。候補が選択されなかったと判定された場合（Ｓ６２；Ｎｏ）、端末装置１は、機能・アプリ判別呼出処理を終了する。このとき、端末装置１は、音声の再入力を促すメッセージ等を出力装置１０８に表示してもよい。 Then, the command collation unit 16 acquires a list of keywords (function names or application names) that match the collation results in S59 and S60, the server speech recognition result, and the in-terminal speech recognition result. Then, the command matching unit 16 displays the acquired list of function names or application names on the output device 108 as candidates (S61). Thereafter, the command matching unit 16 determines whether any candidate has been selected by the user (S62). When it is determined that no candidate has been selected (S62; No), the terminal device 1 ends the function / application determination call processing. At this time, the terminal device 1 may display a message or the like for prompting re-input of voice on the output device 108.

一方、Ｓ６２の判定において、候補が選択されたと判定された場合（Ｓ６２；Ｙｅｓ）、または、Ｓ５６の判定において、ユーザによって「Ｙｅｓ」が選択されたと判定された場合（Ｓ５６；Ｙｅｓ）、コマンド照合部１６は、選択された候補が機能であるかアプリであるかを、コマンド辞書１７に記憶されたキーワードリストを参照して判定する（Ｓ６３）。選択された候補が機能であると判定された場合（Ｓ６３；機能）、コマンド照合部１６は、当該機能を実行するためのコマンドを音声信号によって示されるコマンドとして特定し、そのコマンドの実行をコマンド実行部１８に指示する。そして、コマンド実行部１８は、選択された機能を実行するためのコマンドを実行し（Ｓ６４）、端末装置１は、機能・アプリ判別呼出処理を終了する。 On the other hand, if it is determined in S62 that a candidate is selected (S62; Yes), or if it is determined in S56 that “Yes” is selected by the user (S56; Yes), command verification is performed. The unit 16 determines whether the selected candidate is a function or an application with reference to the keyword list stored in the command dictionary 17 (S63). When it is determined that the selected candidate is a function (S63; function), the command matching unit 16 specifies a command for executing the function as a command indicated by the audio signal, and executes the command. The execution unit 18 is instructed. Then, the command execution unit 18 executes a command for executing the selected function (S64), and the terminal device 1 ends the function / application determination calling process.

一方、Ｓ６３の判定において、選択された候補がアプリであると判定された場合（Ｓ６３；アプリ）、コマンド照合部１６は、当該アプリが端末装置１にインストールされているか否かの判定を行う（Ｓ６５）。選択されたアプリがインストールされていると判定された場合には（Ｓ６５；Ｙｅｓ）、コマンド照合部１６は、そのアプリを音声信号によって示されるアプリとして特定し、そのアプリの起動をコマンド実行部１８に指示する。そして、コマンド実行部１８は、選択されたアプリを起動し（Ｓ６６）、端末装置１は、機能・アプリ判別呼出処理を終了する。一方、Ｓ６５の判定において、選択されたアプリがインストールされていないと判定された場合には（Ｓ６５；Ｎｏ）、コマンド照合部１６は、出力装置１０８に「アプリ検索を行いますか？Ｙｅｓ／Ｎｏ」を表示する（Ｓ６７）。 On the other hand, if it is determined in S63 that the selected candidate is an application (S63; application), the command matching unit 16 determines whether or not the application is installed in the terminal device 1 ( S65). If it is determined that the selected application is installed (S65; Yes), the command collation unit 16 identifies the application as an application indicated by the audio signal, and starts the application to be executed by the command execution unit 18. To instruct. Then, the command execution unit 18 activates the selected application (S66), and the terminal device 1 ends the function / application determination call process. On the other hand, if it is determined in S65 that the selected application is not installed (S65; No), the command matching unit 16 instructs the output device 108 to “Do you want to search for an application? Yes / No. Is displayed (S67).

そして、コマンド照合部１６は、ユーザによって「Ｙｅｓ」が選択されたか否かを判定する（Ｓ６８）。ユーザによって「Ｙｅｓ」が選択されたと判定された場合（Ｓ６８；Ｙｅｓ）、コマンド照合部１６は、通信モジュール１０５を介して、インターネット上のアプリ検索サイトであるアプリマーケットにおいて、選択されたアプリの検索を行う（Ｓ６９）。そして、コマンド照合部１６は、検索結果を出力装置１０８に表示することによって、ユーザが当該アプリをインストールできるようにする。そして、端末装置１は、機能・アプリ判別呼出処理を終了する。一方、Ｓ６８の判定において、ユーザによって「Ｎｏ」が選択されたと判定された場合（Ｓ６８；Ｎｏ）、端末装置１は、機能・アプリ判別呼出処理を終了する。 Then, the command matching unit 16 determines whether “Yes” is selected by the user (S68). When it is determined that “Yes” is selected by the user (S68; Yes), the command verification unit 16 searches the application market, which is an application search site on the Internet, via the communication module 105. (S69). Then, the command matching unit 16 displays the search result on the output device 108 so that the user can install the application. Then, the terminal device 1 ends the function / application discrimination call process. On the other hand, in the determination of S68, when it is determined that “No” is selected by the user (S68; No), the terminal device 1 ends the function / application determination calling process.

なお、上述のコマンド判別実行処理と同様に、キーワードリストに、キーワードを示す情報と、機能名またはアプリ名を示す情報と、スコアを示す情報と、機能であるかアプリであるかを示す情報と、が対応付けられて記憶されてもよい。この場合、端末内音声認識結果およびサーバ音声認識結果の照合は、上述のコマンド判別実行処理と同様に、音声認識結果に含まれる単語と、キーワードリストに登録されたキーワードとを比較し、上述の確信度算出方法のいずれかにより確信度を算出することによって行われてもよい。 Similar to the above-described command determination execution process, the keyword list includes information indicating a keyword, information indicating a function name or an application name, information indicating a score, and information indicating whether the function is an application. , May be stored in association with each other. In this case, the collation of the in-terminal speech recognition result and the server speech recognition result is performed by comparing the word included in the speech recognition result with the keyword registered in the keyword list, as in the above-described command determination execution process. It may be performed by calculating the certainty factor by any one of the certainty factor calculation methods.

続いて、図１２を用いて、入力された発話内容に基づいてアプリが特定される処理を具体的に説明する。図１２は、音声認識システム１０における機能・アプリ判別呼出処理を説明するための一例を示す図である。 Next, a process for identifying an application based on the input utterance content will be specifically described with reference to FIG. FIG. 12 is a diagram illustrating an example for explaining the function / application discrimination call processing in the speech recognition system 10.

まず、ユーザにより音声入力装置１０６を介して、発話内容（ａ）「ふらっどいっと」が入力される。そして、音声入力部１１は、発話内容（ａ）に対応する音声信号を受け付けて、その音声信号を音声認識部１２および音声送信部１４に送信する。次に、音声認識部１２は、ユーザ辞書１３を参照して端末内音声認識を行い、端末内音声認識結果（ｂ）「風呂糸」を取得する。音声認識部１２は、端末内音声認識結果（ｂ）をコマンド照合部１６に送信する。 First, the utterance content (a) “Flat Todo” is input by the user via the voice input device 106. The voice input unit 11 receives a voice signal corresponding to the utterance content (a), and transmits the voice signal to the voice recognition unit 12 and the voice transmission unit 14. Next, the speech recognition unit 12 performs in-terminal speech recognition with reference to the user dictionary 13 and acquires the in-terminal speech recognition result (b) “bath thread”. The speech recognition unit 12 transmits the in-terminal speech recognition result (b) to the command verification unit 16.

次に、コマンド照合部１６は、端末内音声認識結果（ｂ）をコマンド辞書１７と照合する。具体的には、コマンド照合部１６は、端末内音声認識結果（ｂ）がキーワードリストに登録されたキーワード（機能名およびアプリ名）に一致するか否かを判定する。端末内音声認識結果がキーワードに一致しないため、コマンド照合部１６は、音声送信部１４に発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信するよう指示する。そして、音声送信部１４は、発話内容（ａ）に対応する音声信号を音声認識サーバ２に送信する。 Next, the command collation unit 16 collates the in-terminal speech recognition result (b) with the command dictionary 17. Specifically, the command matching unit 16 determines whether or not the in-terminal speech recognition result (b) matches the keyword (function name and application name) registered in the keyword list. Since the in-terminal speech recognition result does not match the keyword, the command matching unit 16 instructs the speech transmitting unit 14 to transmit the speech signal corresponding to the utterance content (a) to the speech recognition server 2. Then, the voice transmission unit 14 transmits a voice signal corresponding to the utterance content (a) to the voice recognition server 2.

音声認識サーバ２では、受信した音声信号に対し、市中のアプリ名などが登録された大語彙辞書２３を用いて音声認識が行われる。その後、サーバ音声認識結果受信部１５は、音声認識サーバ２からサーバ音声認識結果（ｃ）「ｆｌｏｏｄｉｔ」を受信する。そして、サーバ音声認識結果受信部１５は、サーバ音声認識結果（ｃ）をコマンド照合部１６に送信する。 In the voice recognition server 2, voice recognition is performed on the received voice signal using the large vocabulary dictionary 23 in which the name of the application in the city is registered. Thereafter, the server speech recognition result receiving unit 15 receives the server speech recognition result (c) “flood it” from the speech recognition server 2. Then, the server voice recognition result receiving unit 15 transmits the server voice recognition result (c) to the command matching unit 16.

次に、コマンド照合部１６は、サーバ音声認識結果（ｃ）をコマンド辞書１７と照合する。具体的には、コマンド照合部１６は、サーバ音声認識結果（ｃ）がキーワードリストに登録されたキーワード（機能名およびアプリ名）に一致するか否かを判定する。その結果、サーバ音声認識結果（ｃ）がキーワード「Ｆｌｏｏｄ−Ｉｔ」に一致すると判定される。そして、コマンド照合部１６は、「Ｆｌｏｏｄ−Ｉｔ」を候補として出力装置１０８に表示する。その後、コマンド照合部１６は、ユーザによって候補が選択されたか否かを判定し、「Ｆｌｏｏｄ−Ｉｔ」が選択されたと判定する。 Next, the command collation unit 16 collates the server voice recognition result (c) with the command dictionary 17. Specifically, the command matching unit 16 determines whether or not the server speech recognition result (c) matches a keyword (function name and application name) registered in the keyword list. As a result, it is determined that the server speech recognition result (c) matches the keyword “Flood-It”. Then, the command matching unit 16 displays “Flood-It” as a candidate on the output device 108. Thereafter, the command matching unit 16 determines whether or not a candidate has been selected by the user, and determines that “Flood-It” has been selected.

続いて、コマンド照合部１６は、選択された候補「Ｆｌｏｏｄ−Ｉｔ」が機能であるか、アプリであるかをキーワードリストを参照して判定する。キーワード「Ｆｌｏｏｄ−Ｉｔ」は、キーワードリストにアプリとして登録されているため、コマンド照合部１６は、アプリであると判定する。そして、コマンド照合部１６は、アプリ「Ｆｌｏｏｄ−Ｉｔ」が端末装置１にインストールされているか否かを判定する。判定の結果、アプリ「Ｆｌｏｏｄ−Ｉｔ」は端末装置１にインストールされていないので、コマンド照合部１６は、出力装置１０８にアプリ検索を行うか否かを表示する。 Subsequently, the command matching unit 16 determines whether the selected candidate “Flood-It” is a function or an application with reference to the keyword list. Since the keyword “Flood-It” is registered as an application in the keyword list, the command matching unit 16 determines that the application is an application. Then, the command verification unit 16 determines whether or not the application “Flood-It” is installed in the terminal device 1. As a result of the determination, since the application “Flood-It” is not installed in the terminal device 1, the command matching unit 16 displays on the output device 108 whether or not to perform an application search.

そして、ユーザによってアプリ検索を行うことが選択された場合、コマンド照合部１６は、通信モジュール１０５を介して、アプリマーケットでアプリ「Ｆｌｏｏｄ−Ｉｔ」の検索を行い、検索結果を出力装置１０８に表示する。そして、ユーザによって当該アプリのインストールが指示されると、端末装置１にアプリ「Ｆｌｏｏｄ−Ｉｔ」がインストールされる。このように、ユーザは、アプリ名の発話、音声認識結果の確認、アプリ検索の指示およびアプリのインストールの指示という簡単な操作だけで、端末装置１に所望のアプリをインストールすることができる。 If the user selects to perform an application search, the command matching unit 16 searches the application market for the application “Flood-It” via the communication module 105 and displays the search result on the output device 108. To do. Then, when installation of the application is instructed by the user, the application “Flood-It” is installed in the terminal device 1. As described above, the user can install a desired application on the terminal device 1 by simple operations such as utterance of an application name, confirmation of a speech recognition result, an instruction for searching for an application, and an instruction for installing an application.

次に、端末装置１の作用効果について説明する。端末装置１は、音声入力部１１が音声信号の入力を受け付け、音声認識部１２が音声信号に対する音声認識を行う。また、音声送信部１４が音声信号を音声認識サーバ２に送信し、サーバ音声認識結果受信部１５がサーバ音声認識結果を受信する。そして、コマンド照合部１６が、端末内音声認識結果をコマンド辞書１７と照合して、端末内音声認識結果およびサーバ音声認識結果のうちいずれの音声認識結果を利用するか決定し、決定された音声認識結果に基づいて音声信号によって示されるコマンドを特定する。そして、コマンド実行部１８がコマンド照合部１６によって特定されたコマンドを実行する。 Next, the effect of the terminal device 1 is demonstrated. In the terminal device 1, the voice input unit 11 receives an input of a voice signal, and the voice recognition unit 12 performs voice recognition on the voice signal. Further, the voice transmission unit 14 transmits a voice signal to the voice recognition server 2, and the server voice recognition result reception unit 15 receives the server voice recognition result. Then, the command collation unit 16 collates the in-terminal speech recognition result with the command dictionary 17 to determine which of the in-terminal speech recognition result and the server speech recognition result to use, and the determined speech A command indicated by the voice signal is specified based on the recognition result. Then, the command execution unit 18 executes the command specified by the command verification unit 16.

これにより、例えば、端末内音声認識結果がコマンドとして受理される場合は、端末内音声認識結果を利用してコマンドの実行を行い、端末内音声認識結果がコマンドとして受理できない場合には、サーバ音声認識結果を利用することができる。すなわち、端末内の語彙で認識可能なコマンドが音声入力された際には、端末内音声認識結果を利用することで高速に応答でき、端末内の語彙で認識不可能なコマンドが音声入力された際には、サーバ音声認識結果を利用することで音声入力されたコマンドを確実に認識し、実行することができる。 Thus, for example, if the in-terminal speech recognition result is accepted as a command, the command is executed using the in-terminal speech recognition result, and if the in-terminal speech recognition result cannot be accepted as a command, the server speech The recognition result can be used. In other words, when a command that can be recognized by the vocabulary in the terminal is input by voice, a command that cannot be recognized by the vocabulary in the terminal can be input by using the result of speech recognition in the terminal. In this case, the command inputted by voice can be surely recognized and executed by using the server voice recognition result.

また、コマンド照合部１６は、端末内音声認識結果をコマンド辞書１７と照合して、確信度を算出し、確信度が所定の閾値以上である場合に、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドを音声信号によって示されるコマンドとして特定する。 In addition, the command matching unit 16 compares the in-terminal speech recognition result with the command dictionary 17 to calculate a certainty factor, and determines the use of the in-terminal speech recognition result when the certainty factor is equal to or greater than a predetermined threshold. The command having the certainty level equal to or higher than the threshold is specified as the command indicated by the voice signal.

これにより、周囲の雑音などにより端末内音声認識結果の信頼度が低くなったが、正しく音声認識されている場合に、端末内音声認識結果を利用することができる。その結果、端末内の語彙で認識可能なコマンドを、端末内音声認識結果の信頼度が低くても高速に実行することが可能となる。 Thereby, although the reliability of the in-terminal speech recognition result is lowered due to ambient noise or the like, the in-terminal speech recognition result can be used when the speech is recognized correctly. As a result, a command that can be recognized by the vocabulary in the terminal can be executed at high speed even if the reliability of the speech recognition result in the terminal is low.

また、コマンド辞書１７は、複数のコマンドの各々に対して、複数のキーワードと、複数のキーワードに対応付けられたスコアとが登録されたキーワードリストを記憶し、コマンド照合部１６は、端末内音声認識結果に含まれる単語の各々について、キーワードリストに登録された複数のキーワードのいずれかに該当するか否かを判定し、該当するキーワードに対応付けられたコマンドおよびスコアに基づいて確信度を算出する。これにより、端末内の語彙で認識可能なコマンドであるか否かを判定することができ、音声信号によって示されるコマンドをより確実に認識することができる。 The command dictionary 17 stores a keyword list in which a plurality of keywords and scores associated with the plurality of keywords are registered for each of the plurality of commands. For each word included in the recognition result, determine whether it corresponds to one of a plurality of keywords registered in the keyword list, and calculate the certainty factor based on the command and score associated with the corresponding keyword To do. Thereby, it can be determined whether or not the command can be recognized by the vocabulary in the terminal, and the command indicated by the voice signal can be recognized more reliably.

また、コマンド照合部１６は、端末内音声認識結果に含まれる単語の各々について、キーワードリストに登録された複数のキーワードのいずれかに該当するか否かを判定し、該当するキーワードに対応付けられたコマンドおよびスコア並びに単語の音声認識の信頼度に基づいて、確信度を算出することが好ましい。これにより、端末内の語彙で認識可能なコマンドであるか否かを判定することができ、音声信号によって示されるコマンドをより確実に認識することができる。 Further, the command matching unit 16 determines whether each of the words included in the in-terminal speech recognition result corresponds to any of a plurality of keywords registered in the keyword list, and is associated with the corresponding keyword. It is preferable to calculate the certainty factor based on the command and score and the reliability of speech recognition of the word. Thereby, it can be determined whether or not the command can be recognized by the vocabulary in the terminal, and the command indicated by the voice signal can be recognized more reliably.

また、コマンド照合部１６は、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンドである場合には、端末内音声認識結果の利用を決定し、閾値以上の確信度のコマンドが、端末内機能の実行を指示するコマンドでない場合には、サーバ音声認識結果の利用を決定する。これにより、端末内機能の実行を指示するコマンドについては端末内音声認識結果を用いて高速に実行することができ、それ以外のコマンドについてはサーバ音声認識結果を用いて確実に実行することができる。 In addition, when the command with the certainty level equal to or greater than the threshold is a command for instructing the execution of the in-terminal function, the command matching unit 16 determines the use of the in-terminal speech recognition result, and the command with the certainty level equal to or greater than the threshold value. However, if it is not a command for instructing execution of the in-terminal function, the use of the server speech recognition result is determined. As a result, commands for instructing execution of in-terminal functions can be executed at high speed using the in-terminal speech recognition results, and other commands can be reliably executed using the server speech recognition results. .

音声送信部１４は、音声認識部１２によって端末内音声認識結果が得られる前に、音声信号を音声認識サーバ２に送信する。これにより、サーバ音声認識結果をより早く受信することができる。このため、サーバ音声認識結果を利用することが決定された場合に、高速にコマンドを実行することができる。 The voice transmission unit 14 transmits a voice signal to the voice recognition server 2 before the voice recognition unit 12 obtains the in-terminal voice recognition result. Thereby, the server speech recognition result can be received earlier. For this reason, when it is determined to use the server speech recognition result, the command can be executed at high speed.

サーバ音声認識結果受信部１５は、コマンド照合部１６によって端末内音声認識結果に基づいてコマンドが特定された後にサーバ音声認識結果を受信した場合、サーバ音声認識結果を破棄する。これにより、サーバ音声認識結果の受信を待つことなく、端末内音声認識結果に基づいてコマンドを特定することができる。このため、端末内の語彙で認識可能なコマンドを高速に実行することが可能となる。 The server speech recognition result receiving unit 15 discards the server speech recognition result when the command collating unit 16 receives the server speech recognition result after the command is specified based on the in-terminal speech recognition result. Thereby, a command can be specified based on the in-terminal speech recognition result without waiting for the reception of the server speech recognition result. For this reason, it becomes possible to execute a command recognizable by the vocabulary in the terminal at high speed.

以上のように、端末装置１は、端末内音声認識と音声認識サーバ２を用いたネットワーク型音声認識とを協調させて利用することができる。この端末内音声認識は、ネットワークＮＷを介した通信が不要であることから高速に応答できるが、ユーザ辞書１３に登録されている語彙が少ないことから正確な音声認識が行えない場合がある。一方、ネットワーク型音声認識は、大語彙辞書２３に登録されている語彙がユーザ辞書１３よりも多いことから音声認識の精度は端末内音声認識よりも高いが、ネットワークＮＷを介した通信を行うため、通信遅延等で応答が遅くなる場合がある。したがって、端末装置１に対して端末装置１内の認識対象の語彙（ユーザ依存語彙を含む）で認識可能な音声コマンドが入力された場合には、端末内音声認識により高速な応答を可能とし、端末装置１内の認識対象の語彙では認識不可能な音声コマンドが入力された場合には、音声認識サーバ２を用いたネットワーク型音声認識により大語彙・高精度な音声認識を可能とする。 As described above, the terminal device 1 can use the in-terminal voice recognition and the network type voice recognition using the voice recognition server 2 in cooperation. This intra-terminal speech recognition can respond at high speed because communication via the network NW is unnecessary, but there are cases where accurate speech recognition cannot be performed because there are few vocabularies registered in the user dictionary 13. On the other hand, in the network type speech recognition, since the vocabulary registered in the large vocabulary dictionary 23 is larger than that in the user dictionary 13, the accuracy of speech recognition is higher than in-terminal speech recognition, but communication is performed via the network NW. The response may be delayed due to a communication delay or the like. Therefore, when a voice command recognizable in the vocabulary to be recognized in the terminal device 1 (including user-dependent vocabulary) is input to the terminal device 1, a high-speed response is enabled by in-terminal speech recognition, When a voice command that cannot be recognized by the recognition target vocabulary in the terminal device 1 is input, a large vocabulary and high-accuracy voice recognition is enabled by network type voice recognition using the voice recognition server 2.

なお、本実施形態においては、装置として端末装置１を例示したが、これに限るものではなく、例えば、端末装置１における各機能を実行するためのプログラムモジュールとして構成してもよい。すなわち、音声入力部１１に相当する音声入力モジュール、音声認識部１２に相当する音声認識モジュール、音声送信部１４に相当する音声送信モジュール、サーバ音声認識結果受信部１５に相当するサーバ音声認識結果受信モジュール、コマンド照合部１６に相当するコマンド照合モジュール、コマンド実行部１８に相当するコマンド実行モジュールを備えた音声認識プログラムであって、携帯端末などのコンピュータシステムに当該プログラムを読み込ませることにより、上述の端末装置１と同等の機能を実現することができる。なお、上述の音声認識プログラムは、例えば、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤもしくはＲＯＭ等の記憶媒体または半導体メモリに格納されて提供される。また、上述の音声認識プログラムは、搬送波に重畳されたコンピュータデータ信号としてネットワークを介して提供されてもよい。 In addition, in this embodiment, although the terminal device 1 was illustrated as an apparatus, it is not restricted to this, For example, you may comprise as a program module for performing each function in the terminal device 1. FIG. That is, a voice input module corresponding to the voice input unit 11, a voice recognition module corresponding to the voice recognition unit 12, a voice transmission module corresponding to the voice transmission unit 14, and a server voice recognition result reception corresponding to the server voice recognition result reception unit 15. A speech recognition program including a module, a command verification module corresponding to the command verification unit 16, and a command execution module corresponding to the command execution unit 18, and the above-described program is read by a computer system such as a portable terminal. Functions equivalent to those of the terminal device 1 can be realized. The above speech recognition program is provided by being stored in a storage medium such as a flexible disk, CD-ROM, DVD or ROM, or a semiconductor memory, for example. Further, the above-described speech recognition program may be provided via a network as a computer data signal superimposed on a carrier wave.

また、音声送信部１４は、音声信号として非圧縮の音声波形または圧縮された音声波形を音声認識サーバ２に送信してもよく、あるいは、音声認識に利用される特徴量を音声認識サーバ２に送信してもよい。 In addition, the voice transmission unit 14 may send an uncompressed voice waveform or a compressed voice waveform as a voice signal to the voice recognition server 2, or a feature amount used for voice recognition is sent to the voice recognition server 2. You may send it.

また、音声送信部１４は、音声入力部１１から音声信号を受信後、端末内音声認識結果を利用するか否かの決定がされる前に、音声認識サーバ２に音声信号を送信してもよい。この場合、音声認識部１２による端末内音声認識およびコマンド照合部１６による端末内音声認識結果のコマンド照合と並行して、音声認識サーバ２においてサーバ音声認識を行うことができ、サーバ音声認識結果を早く取得することが可能となる。その結果、コマンド照合部１６によってサーバ音声認識結果の利用が決定された場合の処理時間を短縮することができる。このとき、コマンド照合部１６は、サーバ音声認識結果を端末内音声認識結果よりも先に取得した場合、端末内音声認識結果を待っていずれの音声認識結果を利用するかを決定することが望ましいが、サーバ音声認識結果を優先して利用してもよい。 In addition, after receiving the voice signal from the voice input unit 11, the voice transmission unit 14 may transmit the voice signal to the voice recognition server 2 before determining whether to use the in-terminal voice recognition result. Good. In this case, in parallel with the in-terminal speech recognition by the speech recognition unit 12 and the command verification of the in-terminal speech recognition result by the command verification unit 16, server speech recognition can be performed in the speech recognition server 2. It becomes possible to obtain it early. As a result, it is possible to reduce the processing time when the command collation unit 16 decides to use the server speech recognition result. At this time, when the command collation unit 16 acquires the server speech recognition result before the in-terminal speech recognition result, it is desirable to determine which speech recognition result to use after waiting for the in-terminal speech recognition result. However, the server speech recognition result may be used with priority.

また、音声送信部１４が端末内音声認識結果を利用するか否かの判断がされる前に音声認識サーバ２に音声信号を送信し、サーバ音声認識結果受信部１５が音声認識サーバ２からサーバ音声認識結果を受信する前に、コマンド照合部１６が端末内音声認識結果を利用することを決定した場合、コマンド照合部１６は、音声送信部１４に対して音声認識サーバ２での音声認識処理を取り消すための信号を音声認識サーバ２に送信するように指示してもよい。このとき、音声送信部１４は、コマンド照合部１６からの指示に基づいて、音声認識サーバ２に音声認識サーバ２での音声認識処理を取り消すための信号を送信してもよい。このように、端末内音声認識により認識可能なコマンドが入力された場合には、音声認識サーバ２からのサーバ音声認識結果を待つことなく、端末内音声認識結果を利用することを決定することで、コマンド実行部１８は高速にコマンドを実行できる。 Further, the voice transmission unit 14 transmits a voice signal to the voice recognition server 2 before determining whether or not to use the in-terminal voice recognition result, and the server voice recognition result receiving unit 15 sends the voice recognition server 2 to the server. If the command matching unit 16 decides to use the in-terminal speech recognition result before receiving the speech recognition result, the command matching unit 16 performs speech recognition processing in the speech recognition server 2 on the speech transmitting unit 14. You may instruct | indicate to transmit the signal for canceling to the speech recognition server 2. FIG. At this time, the voice transmitting unit 14 may transmit a signal for canceling the voice recognition processing in the voice recognition server 2 to the voice recognition server 2 based on an instruction from the command matching unit 16. As described above, when a command that can be recognized by in-terminal speech recognition is input, it is determined that the in-terminal speech recognition result is used without waiting for the server speech recognition result from the speech recognition server 2. The command execution unit 18 can execute commands at high speed.

また、確信度の閾値は、固定値に限られず、端末内音声認識結果に含まれる単語数に対して、所定の割合（例えば０．２）を掛けた値とすることもできる。このようにすることで、単語数に応じて閾値を動的に変更することができ、より正確にコマンドを特定することが可能となる。 In addition, the certainty threshold is not limited to a fixed value, and may be a value obtained by multiplying the number of words included in the in-terminal speech recognition result by a predetermined ratio (for example, 0.2). In this way, the threshold value can be dynamically changed according to the number of words, and the command can be specified more accurately.

また、コマンド照合部１６は、受理されたコマンドが、端末内機能に相当する場合に限られず、例えば、限られた単語しか使われない機能、すなわち音声認識サーバ２による大語彙認識が必要ない機能に相当する場合に、当該コマンドを特定してもよい。 In addition, the command matching unit 16 is not limited to the case where the accepted command corresponds to an in-terminal function, for example, a function that uses only a limited word, that is, a function that does not require large vocabulary recognition by the speech recognition server 2. May correspond to the command.

また、コマンド辞書１７は、機能名などのキーワードと、機能とを対応付け、スコアを有しないキーワードリストを記憶してもよい。この場合、コマンド照合部１６は、音声認識結果に含まれる単語または部分文字列が、キーワードリストに登録されたキーワードに一致するか否かを判断し、いずれかのキーワードに一致した場合に、そのキーワードに対応付けられた機能を実行するためのコマンドを、音声信号によって示されるコマンドとして特定してもよい。 The command dictionary 17 may store a keyword list that associates keywords such as function names with functions and does not have a score. In this case, the command matching unit 16 determines whether or not the word or the partial character string included in the speech recognition result matches the keyword registered in the keyword list. A command for executing a function associated with a keyword may be specified as a command indicated by a voice signal.

１…端末装置、２…音声認識サーバ、１０…音声認識システム、１１…音声入力部（音声入力手段）、１２…音声認識部（音声認識手段）、１３…ユーザ辞書、１４…音声送信部（音声送信手段）、１５…サーバ音声認識結果受信部（サーバ音声認識結果受信手段）、１６…コマンド照合部（コマンド照合手段）、１７…コマンド辞書、１８…コマンド実行部（コマンド実行手段）、２１…音声受信部（音声受信手段）、２２…サーバ音声認識部（サーバ音声認識手段）、２３…大語彙辞書（サーバ辞書）、２４…サーバ音声認識結果送信部（サーバ音声認識結果送信手段）。 DESCRIPTION OF SYMBOLS 1 ... Terminal device, 2 ... Voice recognition server, 10 ... Voice recognition system, 11 ... Voice input part (voice input means), 12 ... Voice recognition part (voice recognition means), 13 ... User dictionary, 14 ... Voice transmission part ( Voice transmitting means), 15 ... server voice recognition result receiving section (server voice recognition result receiving means), 16 ... command matching section (command matching means), 17 ... command dictionary, 18 ... command execution section (command execution means), 21 ... voice receiver (voice receiver), 22 ... server voice recognizer (server voice recognizer), 23 ... large vocabulary dictionary (server dictionary), 24 ... server voice recognition result transmitter (server voice recognition result transmitter).

Claims

An audio input means for receiving an input of an audio signal;
Voice recognition means for performing voice recognition on the voice signal received by the voice input means;
Voice transmitting means for transmitting the voice signal to a voice recognition server;
Server speech recognition result receiving means for receiving a server speech recognition result which is a speech recognition result for the speech signal by the speech recognition server;
A command dictionary in which multiple commands are registered;
Which speech recognition result is used, that is, the speech recognition result in the terminal, which is the speech recognition result recognized by the speech recognition means, is collated with the command dictionary and the speech recognition result in the terminal or the server speech recognition result is used. Command collating means for determining and identifying a command indicated by the voice signal based on the determined voice recognition result;
Command execution means for executing the command specified by the command verification means;
A terminal device comprising:

The command collating unit collates the in-terminal speech recognition result with the command dictionary, calculates a certainty factor indicating a possibility of corresponding to a command, and when the certainty factor is equal to or greater than a predetermined threshold, The terminal device according to claim 1, wherein use of an internal speech recognition result is determined, and the command having the certainty factor equal to or greater than the threshold is specified as a command indicated by the speech signal.

The command dictionary stores a keyword list in which a plurality of keywords and scores associated with the plurality of keywords are registered for each of the plurality of commands.
The command matching unit determines whether each of the words included in the in-terminal speech recognition result corresponds to any of the plurality of keywords registered in the keyword list, and associates with the corresponding keyword. The terminal device according to claim 2, wherein the certainty factor is calculated based on the received command and score.

The command matching unit determines whether each of the words included in the in-terminal speech recognition result corresponds to any of the plurality of keywords registered in the keyword list, and associates with the corresponding keyword. The terminal device according to claim 3, wherein the certainty factor is calculated based on the received command and score and the reliability of voice recognition of the word.

When the command having the certainty factor equal to or greater than the threshold is a command for instructing execution of an in-terminal function, the command matching unit determines use of the intra-terminal speech recognition result, and the command equal to or greater than the threshold. The use of the server speech recognition result is determined when the command of certainty is a command other than a command for instructing execution of an in-terminal function. The terminal device described in 1.

6. The voice transmitting unit according to claim 1, wherein the voice signal is transmitted to the voice recognition server before the voice recognition unit obtains the intra-terminal voice recognition result. The terminal device described.

The server speech recognition result receiving unit discards the server speech recognition result when the command collating unit receives the server speech recognition result after a command is specified based on the in-terminal speech recognition result. The terminal device according to claim 6.

A voice input module that accepts input of voice signals,
A voice recognition module for performing voice recognition on a voice signal received by the voice input module;
A voice transmission module for transmitting the voice signal to a voice recognition server;
A server speech recognition result receiving module for receiving a server speech recognition result that is a speech recognition result for the speech signal by the speech recognition server;
The in-terminal speech recognition result, which is the speech recognition result recognized by the speech recognition module, is checked against a command dictionary in which a plurality of commands are registered, and any one of the in-terminal speech recognition result and the server speech recognition result A command verification module that determines whether to use a recognition result and identifies a command indicated by the voice signal based on the determined voice recognition result;
A command execution module for executing the command specified by the command verification module;
A speech recognition program comprising:

The command verification module compares the in-terminal speech recognition result with the command dictionary, calculates a certainty factor indicating the possibility of corresponding to a command, and when the certainty factor is equal to or greater than a predetermined threshold, 9. The speech recognition program according to claim 8, wherein use of an internal speech recognition result is determined, and the command having the certainty level equal to or greater than the threshold is specified as a command indicated by the speech signal.

The command dictionary stores a keyword list in which a plurality of keywords and a score associated with each of the plurality of keywords are registered for each of the plurality of commands.
The command verification module determines whether each of the words included in the in-terminal speech recognition result corresponds to any of the plurality of keywords registered in the keyword list, and associates with the corresponding keyword The speech recognition program according to claim 9, wherein the certainty factor is calculated based on the received command and score.

The command verification module determines whether each of the words included in the in-terminal speech recognition result corresponds to any of the plurality of keywords registered in the keyword list, and associates with the corresponding keyword The speech recognition program according to claim 10, wherein the certainty factor is calculated based on the received command and score and the reliability of speech recognition of the word.

The command verification module determines use of the intra-terminal speech recognition result when the command with the certainty level equal to or higher than the threshold is a command for instructing execution of an intra-terminal function, and the command verification module The use of the server speech recognition result is determined when the command of certainty is a command other than a command for instructing execution of an in-terminal function. The speech recognition program described in 1.

The voice transmission module transmits the voice signal to the voice recognition server before the voice recognition result is obtained by the voice recognition module. The described voice recognition program.

The server speech recognition result receiving module discards the server speech recognition result when receiving the server speech recognition result after a command is specified based on the in-terminal speech recognition result by the command verification module. The voice recognition program according to claim 13.

An audio input step for receiving an input of an audio signal;
A voice recognition step for performing voice recognition on the voice signal received in the voice input step;
A voice transmission step of transmitting the voice signal to a voice recognition server;
A server speech recognition result receiving step for receiving a server speech recognition result which is a speech recognition result for the speech signal by the speech recognition server;
The in-terminal speech recognition result, which is the speech recognition result recognized in the speech recognition step, is collated with a command dictionary in which a plurality of commands are registered, and any one of the in-terminal speech recognition result and the server speech recognition result A command matching step for determining whether to use a recognition result, and for identifying a command indicated by the voice signal based on the determined voice recognition result;
A command execution step for executing the command specified in the command verification step;
A speech recognition method comprising:

A speech recognition system including a terminal device and a speech recognition server,
The terminal device
An audio input means for receiving an input of an audio signal;
Voice recognition means for performing voice recognition on the voice signal received by the voice input means;
A command dictionary in which multiple commands are registered;
Voice transmitting means for transmitting the voice signal to the voice recognition server;
Server speech recognition result receiving means for receiving a server speech recognition result which is a speech recognition result for the speech signal by the speech recognition server;
Which speech recognition result is used, that is, the speech recognition result in the terminal, which is the speech recognition result recognized by the speech recognition means, is collated with the command dictionary and the speech recognition result in the terminal or the server speech recognition result is used. Command collating means for determining and identifying a command indicated by the voice signal based on the determined voice recognition result;
Command execution means for executing the command specified by the command verification means;
With
The voice recognition server
Voice receiving means for receiving the voice signal transmitted from the voice transmitting means;
A server dictionary containing more vocabulary than the terminal device;
Server voice recognition means for recognizing the voice signal received by the voice reception means based on the server dictionary;
Server speech recognition result transmitting means for transmitting the server speech recognition result to the terminal device;
A speech recognition system comprising: