JP2020101822A

JP2020101822A - Information providing method using voice recognition function, and control method of instrument

Info

Publication number: JP2020101822A
Application number: JP2020033526A
Authority: JP
Inventors: 育規石井; Yasunori Ishii; 良宏小島; Yoshihiro Kojima
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2014-05-13
Filing date: 2020-02-28
Publication date: 2020-07-02
Also published as: JP2019056913A; JP6670364B2

Abstract

To provide a new information providing method using voice recognition processing.SOLUTION: An information providing method in an information providing system is connected to a display instrument having a display and a voice input device capable of inputting user's voice, and provides information in response to the user's voice via the display instrument. Display screen information, which allows a display screen including a plurality of selectable items to be displayed on a display of the display instrument, is transmitted to the display instrument. On the display screen of the display, item selection information indicating that one item in the plurality of items was selected is received, and in the case where voice instruction including first voice information expressing instruction content is received from the voice input device when one item is selected, the instruction content is recognized from the first voice information, whether or not the voice instruction includes second voice information indicating an instruction word is determined, and in the case of determination that the second voice information is included in the voice instruction, the instruction content regarding one item is executed.SELECTED DRAWING: Figure 1

Description

本開示は、音声認識機能を用いた情報提供方法および機器の制御方法に関する。 The present disclosure relates to an information providing method and a device control method using a voice recognition function.

従来、マイクロフォン（以下、「マイク」と称することがある。）によって音声を受け付け、受け付けた音声を認識し、内容を解釈することで、機器を制御する装置がある。そのようなマイクロフォンは、機器に接続される場合もあれば、機器に付属の入力装置（例えば、リモートコントローラ（以下、「リモコン」と称することがある。））に内蔵される場合もある。音声による機器の制御によって、例えば電源のON、OFFや機器の一括制御などこれまでにない利便性をユーザに提供できる。 2. Description of the Related Art Conventionally, there is an apparatus that controls a device by receiving a voice with a microphone (hereinafter, also referred to as a “microphone”), recognizing the received voice, and interpreting the content. Such a microphone may be connected to the device or may be built in an input device (for example, a remote controller (hereinafter, may be referred to as a “remote control”)) attached to the device. By controlling the device by voice, it is possible to provide the user with unprecedented convenience such as turning the power on and off and controlling the device collectively.

機器を制御する制御コマンドには、音声認識による入力が適している制御コマンドとそうでない制御コマンドとがある。そのため、リモコンなどの入力装置と音声の両方を組み合わせた、マルチモーダルな入力方法を利用した機器制御が望ましい。特許文献１は、リモコンと音声認識とを組み合わせた機器制御方法を開示している。 Control commands for controlling the device include control commands suitable for input by voice recognition and control commands not suitable for input. Therefore, device control using a multimodal input method in which both an input device such as a remote controller and voice are combined is desirable. Patent Document 1 discloses a device control method that combines a remote control and voice recognition.

特開２００４−２６０５４４号公報JP, 2004-260544, A

上述の音声認識機能を用いた機器制御方法においては、実用化に向けてさらなる改善が必要であった。 In the device control method using the voice recognition function described above, further improvement was necessary for practical use.

上記課題を解決するために、本発明の一態様にかかる方法は、ディスプレイを有する表示機器とユーザの音声を入力可能な音声入力機器とに接続され、ユーザの音声に応答して表示機器を介して情報を提供する情報提供システムにおける情報提供方法であって、表示機器のディスプレイに選択可能な複数の項目を含む表示画面を表示させる表示画面情報を、表示機器に送信し、ディスプレイの表示画面において、複数の項目の中の一の項目が選択されたことを示す項目選択情報を受信し、一の項目が選択されているときに、音声入力装置から指示内容を表す第１音声情報を含む音声指示を受信した場合、第１音声情報から指示内容を認識し、音声指示に指示語を示す第２音声情報が含まれているか否かを判断し、音声指示に第２音声情報が含まれていると判断した場合は、一の項目について前記指示内容を実行する。 In order to solve the above problems, a method according to one embodiment of the present invention is connected to a display device having a display and a voice input device capable of inputting a user's voice, and responds to the user's voice via the display device. A method for providing information in an information providing system for providing information by transmitting display screen information for causing a display of a display device to display a display screen including a plurality of selectable items to the display device, , The item selection information indicating that one of the plurality of items is selected, and when the one item is selected, a voice including the first voice information indicating the instruction content from the voice input device. When the instruction is received, the instruction content is recognized from the first voice information, it is determined whether or not the voice instruction includes the second voice information indicating the instruction word, and the voice instruction includes the second voice information. If it is determined that there is one, the instruction content is executed for one item.

本発明の他の態様に係る方法は、ユーザの音声を入力可能な音声入力装置に接続され、ディスプレイを有する表示機器の制御方法であって、表示機器のコンピュータに、選択可能な複数の項目を含む表示画面をディスプレイに表示させ、ディスプレイの表示画面において、複数の項目の中の一の項目が選択されたことを検知させ、一の項目が選択されたことを検知しているときに、音声入力装置から指示内容を表す第１音声情報を含む音声指示を受信した場合、第１音声情報から指示内容を認識させて前記指示内容を実行させ、前記一の項目が選択されたことが検知されていないとき、または、前記指示内容が実行できないと判断されたとき、前記音声指示を他のコンピュータへ送信させる。 A method according to another aspect of the present invention is a method for controlling a display device having a display, which is connected to a voice input device capable of inputting a user's voice, wherein a computer of the display device is provided with a plurality of selectable items. The display screen including the display screen is displayed, and when one of the multiple items is selected on the display screen of the display, and when the selection of one item is detected, When the voice instruction including the first voice information indicating the instruction content is received from the input device, the instruction content is recognized from the first voice information, the instruction content is executed, and it is detected that the one item is selected. If not, or if it is determined that the instruction content cannot be executed, the voice instruction is transmitted to another computer.

上記の各態様によって、機器制御中に生じるサーバとクライアントのアクセスが減るため、操作性が向上する。 According to each of the above-mentioned aspects, since the access between the server and the client that occurs during device control is reduced, the operability is improved.

上記の各態様は、システム、装置、またはコンピュータプログラムを用いて実現され得る。あるいは、システム、装置、およびコンピュータプログラムの組み合わせによっても実現され得る。 Each of the above aspects can be implemented using a system, a device, or a computer program. Alternatively, it may be realized by a combination of the system, the device, and the computer program.

上記態様により、さらなる改善を実現できた。 With the above aspect, further improvement can be realized.

例示的な実施の形態１における処理の概要を示すシーケンス図である。FIG. 6 is a sequence diagram showing an outline of processing in Exemplary Embodiment 1. 例示的な実施の形態１による音声認識機能を用いた情報提示方法の構成を示す図である。FIG. 3 is a diagram showing a configuration of an information presentation method using a voice recognition function according to an exemplary embodiment 1. 例示的な実施の形態１におけるサーバとクライアントとの通信処理を示す第１のシーケンスを示す図である。FIG. 11 is a diagram showing a first sequence showing communication processing between a server and a client in the exemplary embodiment 1. 例示的な実施の形態１におけるサーバでの処理を示す図である。FIG. 6 is a diagram showing processing in a server in the exemplary embodiment 1. 例示的な実施の形態１におけるクライアントでの処理を示す図である。FIG. 6 is a diagram showing processing in a client in the exemplary embodiment 1. 地図上の位置を指定する例を示す図である。It is a figure which shows the example which specifies the position on a map. 画面上の人物位置を指定する例を示す第１の図である。It is a 1st figure which shows the example which designates the person position on a screen. 画面上の人物位置を指定する例を示す第２の図である。It is a 2nd figure which shows the example which specifies the person position on a screen. 地図上の位置を基準にした検索の例を示す第１の図である。It is a 1st figure which shows the example of the search based on the position on a map. 地図上の位置を基準にした検索の例を示す第２の図である。It is a 2nd figure which shows the example of the search based on the position on a map. 例示的な実施の形態１におけるサーバとクライアントとの通信処理を示す第２のシーケンスを示す図である。FIG. 11 is a diagram showing a second sequence showing communication processing between the server and the client in the exemplary embodiment 1. 例示的な実施の形態１における処理の概要を示す第１のシーケンス図である。FIG. 9 is a first sequence diagram illustrating an outline of processing according to the first exemplary embodiment. 例示的な実施の形態１における処理の概要を示す第２のシーケンス図である。FIG. 11 is a second sequence diagram showing an outline of processing according to the first exemplary embodiment. 例示的な実施の形態２による音声認識機能を用いた情報提示方法の構成を示す図である。It is a figure which shows the structure of the information presentation method using the audio|voice recognition function by exemplary Embodiment 2. 例示的な実施の形態２におけるサーバとクライアントとの通信処理を示すシーケンスを示す図である。FIG. 16 is a diagram showing a sequence showing communication processing between a server and a client in the exemplary second embodiment. 例示的な実施の形態２におけるサーバでの処理を示す図である。FIG. 9 is a diagram showing processing in a server in the exemplary second embodiment. 例示的な実施の形態２におけるクライアントでの処理を示す図である。FIG. 11 is a diagram showing processing by a client in the exemplary second embodiment. おすすめ番組一覧から番組内容を表示する例を示す図である。It is a figure which shows the example which displays the program content from a recommended program list. 従来の音声認識機能を用いた情報提示方法の構成を示す図である。It is a figure which shows the structure of the information presentation method using the conventional voice recognition function.

（本発明の基礎となった知見）
本発明の基礎となった知見は以下のとおりである。 (Findings that form the basis of the present invention)
The knowledge on which the present invention is based is as follows.

本願発明者らは、マイクによって音声を受け付け、受け付けた音声を認識し、内容を解釈することで、機器を制御する装置の実用化に向けて、さらなる改善が必要であると考えた。 The inventors of the present application considered that further improvement is necessary for practical application of a device that controls a device by receiving voice with a microphone, recognizing the received voice, and interpreting the content.

音声による機器の制御では、複数の制御コマンドを一つの音声コマンドに割当てることで、簡単な言葉で機器を制御できる。ボタン数が多いリモコンの操作に不慣れなユーザであっても自然な音声で機器を制御できるという利点がある。 In controlling the device by voice, by assigning a plurality of control commands to one voice command, the device can be controlled with simple words. There is an advantage that even a user who is unfamiliar with the operation of the remote controller having a large number of buttons can control the device with natural voice.

一方で、全ての操作を音声で行うことは利用者の操作性を損ねる。このことを、テレビ（ＴＶ）を例にして説明する。 On the other hand, performing all operations by voice impairs the operability of the user. This will be described using a television (TV) as an example.

図１６は、テレビに表示される画面の一例を示す図である。例えば、「おすすめ番組一覧」という音声コマンドにより、図１６に示すように、番組の一覧９０１が画面上に表示されるとする。これは、例えばＴＶが、リモコン９０２を介して利用者の音声を受け付け、「おすすめ番組一覧」という言葉（すなわち音声）を認識し、その内容を解釈することで、機器（すなわちＴＶ）がユーザに合わせたおすすめ番組を提示する機能である。ここで、番組を指定するためには、利用者は音声で「下」または「上」といったカーソル移動のコマンドを発話する。 FIG. 16 is a diagram showing an example of a screen displayed on the television. For example, it is assumed that a program list 901 is displayed on the screen as shown in FIG. 16 by a voice command of “list of recommended programs”. This is because, for example, the TV receives the voice of the user through the remote controller 902, recognizes the word “recommended program list” (that is, the voice), and interprets the content, so that the device (that is, the TV) gives the user the user. It is a function to present the recommended programs combined. Here, in order to specify the program, the user utters a cursor movement command such as "down" or "up" by voice.

表示されたおすすめ番組が多い場合、一画面に表示される番組の数が多くなる。表示画面が複数ページに渡ることもある。このような場合、番組を指定するためには、利用者は音声で「下」、「上」、「次のページ」、または「前のページ」といったカーソル移動のコマンドを数多く発話する必要がある。音声を繰り返し入力する場合、音声認識を誤る可能性が高くなる。このように、何度も同じ言葉を発する方法は、使いやすいとは言いがたい。 When many recommended programs are displayed, the number of programs displayed on one screen increases. The display screen may span multiple pages. In such a case, in order to specify the program, the user needs to utter a number of cursor movement commands such as “down”, “up”, “next page”, or “previous page”. .. When the voice is repeatedly input, there is a high possibility that the voice recognition will be mistaken. This way of saying the same words over and over is not easy to use.

このような課題について、例えば、特許文献１は、リモコンと音声認識との組み合わせによってテレビの操作を簡便に行うことが可能な音声認識方法を開示している。 Regarding such a problem, for example, Patent Document 1 discloses a voice recognition method capable of easily operating a television by combining a remote control and voice recognition.

この従来の方法では、上述のような音声コマンドによっておすすめ番組一覧が表示された場合には、利用者はまずリモコンで番組を指定する。その後、利用者は指示代名詞（「指示語」または「指示文字列」と称することもある。）とその指定した番組を制御する言葉（すなわち指示内容）との対で構成される音声を入力することで、リモコンで指定した番組の制御を行う。例えば、番組一覧９０１が表示されているとき、利用者がリモコン９０２で番組を指定すると、画面状態が、番組が選択されたことがわかるような画面状態９０３に変わる。その後、利用者が「その内容を表示」と発話すると、番組内容表示画面９０４のようにリモコンで指定された番組の内容が表示される。この例では、「その」が指示語に該当し、「内容を表示」が指示内容に該当する。本明細書では、指示内容を表す音声情報を「第１音声情報」、指示語を表す音声情報を「第２音声情報」と称することがある。 In this conventional method, when the recommended program list is displayed by the voice command as described above, the user first specifies the program with the remote controller. After that, the user inputs a voice composed of a pair of a demonstrative pronoun (sometimes referred to as “instruction word” or “instruction character string”) and a word (that is, instruction content) for controlling the designated program. By doing so, the program specified by the remote controller is controlled. For example, when the user designates a program with the remote controller 902 while the program list 901 is displayed, the screen state changes to the screen state 903 in which it can be seen that the program is selected. After that, when the user utters "display the contents", the contents of the program designated by the remote controller are displayed as in the program contents display screen 904. In this example, “the” corresponds to the instruction word, and “display content” corresponds to the instruction content. In this specification, the voice information indicating the instruction content may be referred to as “first voice information”, and the voice information indicating the instruction word may be referred to as “second voice information”.

図１７は、特許文献１に記載された従来の音声認識方法を実現する番組情報提示装置１０００の構成例を示す。図１７において、マイクロフォン１００１によって音声が入力され、音声認識部１００２によって音声認識が行われる。指示文字列検出部１００３は、音声認識結果から指示文字列を抽出する。音声合成部１００４は、ユーザに音声で応答するための合成音声を生成する。制御信号生成処理１００５は、機器を制御する信号を生成する。入力装置１００６は、マウスやタッチパネル、キーボード、リモートコントローラ等で構成されている。入力装置１００６は、複数の番組の情報が表示されている場合に、ユーザが複数の番組から一つの番組を選択するために用いられる。入力装置１００６は、画面に表示されている複数の番組から一つの番組がユーザによって選択されたときの選択位置の情報を受け付ける。出力部１００７は、選択された番組を表示する出力処理、制御信号生成処理で生成された信号に基づく機器の制御、制御結果の表示、音声合成処理で生成された合成音声の再生などの出力処理を行う。 FIG. 17 shows a configuration example of a program information presentation device 1000 that realizes the conventional voice recognition method described in Patent Document 1. In FIG. 17, voice is input by the microphone 1001 and voice recognition is performed by the voice recognition unit 1002. The instruction character string detection unit 1003 extracts an instruction character string from the voice recognition result. The voice synthesis unit 1004 generates a synthetic voice for responding to the user with a voice. The control signal generation processing 1005 generates a signal for controlling the device. The input device 1006 is composed of a mouse, a touch panel, a keyboard, a remote controller, and the like. The input device 1006 is used by the user to select one program from a plurality of programs when information on a plurality of programs is displayed. The input device 1006 receives information on the selected position when one program is selected by the user from a plurality of programs displayed on the screen. The output unit 1007 performs an output process of displaying the selected program, a device control based on the signal generated by the control signal generation process, a control result display, a reproduction process of the synthesized voice generated by the voice synthesis process, and the like. I do.

リモコンに備え付けられているボタンの代わりに音声コマンドを利用する場合は、発生する言葉の数や種類はボタンの数に限られる。そのため、リモコンのボタンに記載されている名称、あるいは、そのボタンに対応する音声のコマンドが、認識用の辞書として予め登録されていればよい。辞書に登録される個々の言葉について、年齢または性別の異なる様々な人の音声を集めて音声認識のための音響モデルおよび言語モデルが構築される。誤認識が減るように、認識用の辞書またはモデルを手作業でカスタマイズするなどの工夫が行われることもある。 When using voice commands instead of the buttons provided on the remote control, the number and types of words generated are limited to the number of buttons. Therefore, the name written on the button of the remote controller or the voice command corresponding to the button may be registered in advance as a dictionary for recognition. For each word registered in the dictionary, voices of various people having different ages or sexes are collected to construct an acoustic model and a language model for voice recognition. In order to reduce the false recognition, some ingenuity may be made such as manually customizing the recognition dictionary or model.

しかしながら、家電が宅外のネットワークに繋がるようになったことで、番組情報をウェブから取得することや、TV画面を利用したウェブ検索が可能になった。この場合、TVに関連しない言葉が入力される可能性も生じるため、どのような言葉が入力されるかを事前に知ることは難しい。つまり、事前に決められた単語群に特化した音響モデルおよび言語モデルを用意することができない。その結果、音声認識精度が低くなり、音声によってユーザが望む語彙を入力することが難しくなる。 However, with home appliances now connected to networks outside the home, it has become possible to obtain program information from the web and search the web using a TV screen. In this case, words that are not related to TV may be input, and it is difficult to know in advance what words will be input. That is, it is impossible to prepare an acoustic model and a language model specialized for a predetermined word group. As a result, the voice recognition accuracy becomes low, and it becomes difficult for the user to input the vocabulary desired by the voice.

リモコンに記載された言葉以外の言葉を高精度に認識するためには、大規模なデータ群によって音声認識のためのモデルを構築することが必要になる。大規模なデータ群を用いて統計的音声認識モデルを構築することで、事前に未知な単語でも高精度に認識することができる。統計的なモデルに基づく音声認識処理は、メモリ、計算量の必要なリソースが大きいため、ネットワークを介して機器に繋がるサーバコンピュータ（以下、単に「サーバ」と称することがある。）上で実行される。 In order to recognize words other than words written on the remote controller with high accuracy, it is necessary to construct a model for voice recognition using a large-scale data group. By constructing a statistical speech recognition model using a large-scale data group, even unknown words can be recognized with high accuracy in advance. The speech recognition process based on the statistical model is executed on a server computer (hereinafter, may be simply referred to as “server”) connected to a device via a network because the resources required for memory and calculation amount are large. It

特許文献１に開示されている技術では、制御対象となる機器本体と、音声認識処理部とが一体となっていた。そのため、機器本体を制御するリモコンに記載の内容については事前に音声認識辞書を用意することができる。しかし、ウェブ検索などの自由発話になると、音声認識精度が低かった。そのため、ユーザは使いにくく感じることが多く、音声認識の利用範囲を制限せざるを得なかった。 In the technique disclosed in Patent Document 1, the device body to be controlled and the voice recognition processing unit are integrated. Therefore, a voice recognition dictionary can be prepared in advance for the contents described in the remote controller controlling the device body. However, the accuracy of voice recognition was low for free utterances such as web searches. Therefore, the user often finds it difficult to use, and has been forced to limit the usage range of voice recognition.

以上のような考察から、機器が受け付けた音声信号の音声認識処理をサーバで行うことが実用上は望ましい。しかしながら、ネットワークを介して音声認識処理を行う場合には、音声信号を送信してから応答が返ってくるまでの時間が長い。すなわち、処理遅延が発生するという問題がある。 From the above consideration, it is practically preferable that the server performs the voice recognition process of the voice signal received by the device. However, in the case of performing voice recognition processing via a network, it takes a long time from transmitting a voice signal to returning a response. That is, there is a problem that processing delay occurs.

この問題が生じるシステムの一例として、音声認識処理を行い、認識結果から指示文字列の検出を行った後、指示文字列の検出結果に応じて音声の応答や制御信号を返すシステムを想定する。音声認識処理をサーバで実行する場合、音声認識処理と指示文字列の検出、認識結果に基づく音声応答、機器制御までの一連の処理がサーバで行われる。この場合、音声認識結果の中に指示文字列が検出されると、その度にサーバからクライアントである機器にアクセスが発生する。これは、指示文字列（例えば、「その」等）が示す対象の項目が何であるかを問い合わせるためである。これにより、サーバとクライアントの間の通信処理が終わるまでは、それ以降の処理を行うことができない。このため、処理の遅延が生じ得る。このシステムでは、指示文字列が検出される度にサーバからクライアントにアクセスを行うことによる処理遅延を低減することが要求される。しかし、この要求を満たすための技術的な解決策に関して検討はされていなかった。 As an example of a system in which this problem occurs, assume a system that performs a voice recognition process, detects an instruction character string from the recognition result, and then returns a voice response or a control signal according to the detection result of the instruction character string. When the voice recognition process is executed by the server, the server performs a series of processes including the voice recognition process, the detection of the instruction character string, the voice response based on the recognition result, and the device control. In this case, each time the instruction character string is detected in the voice recognition result, the server accesses the device that is the client. This is to inquire what the target item indicated by the instruction character string (for example, “the” or the like) is. As a result, the subsequent processing cannot be performed until the communication processing between the server and the client is completed. Therefore, a processing delay may occur. In this system, it is required to reduce the processing delay caused by accessing the client from the server each time the instruction character string is detected. However, no consideration has been given to the technical solution to meet this requirement.

このような課題を解決するための、本音声認識機能を用いた機器制御方法の一態様は、ユーザからの入力を受け付ける入力処理と、入力処理によって画面上の一部が指定されているか否かの状態検出を行う選択状態検出処理と、選択された一の項目の画面上の位置に関する内部情報を取得する選択情報検出処理と、ユーザに応答を返す出力処理と、外部装置と通信する通信処理と、音声を入力する音声入力処理と、音声の認識を行う音声認識処理と、音声認識結果に基づいて指示文字列の検出を行う指示文字列検出処理と、ユーザによる項目選択の状態を管理する選択状態管理処理を包含し、制御対象機器とは異なるサーバに音声入力処理と音声認識処理と指示文字列検出処理と選択状態管理処理とを実行させ、選択状態検出処理で選択状態が変更されたことを検知するたびに、選択状態管理処理の状態を更新し、更新結果が選択状態である場合のみ、指示文字列検出処理は選択情報検出処理で検出された選択情報を取得する。 One aspect of a device control method using the voice recognition function for solving such a problem is an input process for receiving an input from a user, and whether or not a part of the screen is designated by the input process. State detection processing for detecting the state of the selected item, selection information detection processing for obtaining internal information regarding the position on the screen of the selected one item, output processing for returning a response to the user, and communication processing for communicating with an external device. And a voice input process for inputting voice, a voice recognition process for recognizing voice, an instruction character string detection process for detecting an instruction character string based on a voice recognition result, and a state of item selection by a user are managed. Including the selection state management process, a server different from the control target device was made to execute the voice input process, the voice recognition process, the instruction character string detection process, and the selection state management process, and the selection state was changed in the selection state detection process. The selection character string detection process acquires the selection information detected by the selection information detection process only when the state of the selection state management process is updated each time it is detected.

選択状態管理処理により、入力装置によって一の項目（例えば番組を示す項目）が選択されているか否かの状態に関する情報をサーバが保持する。このため、サーバ上で音声認識処理が行われる場合に、サーバ上で保持された状態に応じて、サーバからクライアントにアクセスするか否かを選択することができる。その結果、処理遅延を減らすことが可能になる。 By the selection state management process, the server holds information regarding the state of whether or not one item (for example, an item indicating a program) is selected by the input device. Therefore, when voice recognition processing is performed on the server, it is possible to select whether or not to access the client from the server according to the state held on the server. As a result, processing delay can be reduced.

上述の機器制御方法は、対話管理処理と応答文生成処理をさらに包含し、ユーザと対話型の処理によって機器の制御を行ってもよい。 The device control method described above may further include a dialog management process and a response sentence generation process, and the device may be controlled by a process interactive with the user.

上述の機器制御方法は、音声合成処理と制御信号生成処理をさらに包含し、出力処理でユーザに応答を返す際に、合成された音声で応答を返す、あるいは、生成された制御信号で機器制御することでユーザに応答を返してもよい。 The above-described device control method further includes a voice synthesis process and a control signal generation process, and when a response is returned to the user in the output process, a response is returned with a synthesized voice, or a device control is performed with the generated control signal. By doing so, a response may be returned to the user.

選択状態管理処理は、入力処理で画面上の一部が選択されているかの状態のみを管理してもよい。 The selection state management process may manage only the state in which a part of the screen is selected in the input process.

選択状態管理処理は、入力処理で画面上の一部が選択されているかの状態に加えて、選択された場所に対応する内部情報も管理してもよい。 The selection state management process may also manage internal information corresponding to the selected location, in addition to the state in which a part of the screen is selected in the input process.

入力処理は、テレビ番組に関するメタデータか、テレビ番組のコンテンツのいずれかを指定してもよい。 The input process may specify either metadata about the television program or the content of the television program.

テレビ番組に関するメタデータは、番組名、チャンネル名、内容、注目度、おすすめ度のいずれかであってもよい。 The metadata regarding the TV program may be any of the program name, channel name, content, attention level, and recommendation level.

テレビ番組のコンテンツには、人物、動物、車、地図、文字、数字のいずれかを含んでもよい。 The content of the television program may include any one of a person, an animal, a car, a map, letters, and numbers.

さらに、上述の課題を解決するための情報提供方法の一態様は、ディスプレイを有する表示機器とユーザの音声を入力可能な音声入力機器とに接続され、ユーザの音声に応答して表示機器を介して情報を提供する情報提供システムにおける情報提供方法であって、表示機器のディスプレイに選択可能な複数の項目を含む表示画面を表示させる表示画面情報を、表示機器に送信し、ディスプレイの表示画面において、複数の項目の中の一の項目が選択されたことを示す項目選択情報を受信し、一の項目が選択されているときに、音声入力装置から指示内容を表す第１音声情報を含む音声指示を受信した場合、第１音声情報から指示内容を認識し、音声指示に指示語を示す第２音声情報が含まれているか否かを判断し、音声指示に第２音声情報が含まれていると判断した場合は、一の項目について指示内容を実行する。 Further, according to one aspect of an information providing method for solving the above-mentioned problem, a display device having a display and a voice input device capable of inputting a user's voice are connected, and in response to the user's voice, via the display device. A method for providing information in an information providing system for providing information by transmitting display screen information for causing a display of a display device to display a display screen including a plurality of selectable items to the display device, , The item selection information indicating that one of the plurality of items is selected, and when the one item is selected, a voice including the first voice information indicating the instruction content from the voice input device. When the instruction is received, the instruction content is recognized from the first voice information, it is determined whether or not the voice instruction includes the second voice information indicating the instruction word, and the voice instruction includes the second voice information. If it is determined that there is one, the instruction content for one item is executed.

指示内容は、一の項目に関連する情報を検索する指示であり、指示内容に基づく検索結果をユーザへ通知してもよい。 The instruction content is an instruction to retrieve information related to one item, and the user may be notified of a search result based on the instruction content.

検索結果をディスプレイに表示させる検索結果情報を、表示機器に送信してもよい。 The search result information for displaying the search result on the display may be transmitted to the display device.

情報提供システムはさらに、音声を出力可能な音声出力装置と接続され、検索結果を音声出力装置からの音声として出力させる検索結果情報を、音声出力装置に送信してもよい。 The information providing system may be further connected to a voice output device capable of outputting voice, and may transmit search result information that causes the search result to be output as voice from the voice output device, to the voice output device.

複数の項目は、テレビ番組に関するメタデータまたはテレビ番組のコンテンツを示す項目であってもよい。 The plurality of items may be items indicating metadata about the television program or contents of the television program.

メタデータは、テレビ番組名、チャンネル名、テレビ番組の概要、テレビ番組の注目度、テレビ番組のおすすめ度の少なくとも１つを示していてもよい。 The metadata may indicate at least one of a television program name, a channel name, an outline of the television program, a degree of attention of the television program, and a degree of recommendation of the television program.

テレビ番組のコンテンツは、人物、動物、車、地図、文字、数字の少なくとも１つを示す情報を含んでもよい。 The content of the television program may include information indicating at least one of a person, an animal, a car, a map, characters, and numbers.

表示画面は特定地域における地図を表し、複数の項目の各々は地図上の任意の座標、または地図上のオブジェクトであってもよい。 The display screen represents a map in a specific area, and each of the plurality of items may be an arbitrary coordinate on the map or an object on the map.

オブジェクトは、地図上の建造物を示してもよい。 The object may indicate a building on the map.

オブジェクトは、地図上の道路を示してもよい。 The object may indicate a road on the map.

オブジェクトは、地図上の地名を示してもよい。 The object may indicate a place name on the map.

本開示の音声認識機能を用いた機器制御方法の他の態様は、ユーザからの入力を受け付ける入力処理と、入力処理によって画面上の一部が指定されているか否かの状態検出を行う選択状態検出処理と、選択された一の項目の画面上の位置に関する内部情報を取得する選択情報検出処理と、ユーザに応答を返す出力処理と、外部装置と通信する通信処理と、音声を入力する音声入力処理と、音声の認識を行う第一の音声認識処理と、第一の音声認識処理とは異なる方法で学習された第二の音声認識処理と、音声認識結果に基づいて指示文字列の検出を行う指示文字列検出処理と、音声認識結果に基づいて命令文字列の検出を行う命令文字列検出処理を包含し、選択状態検出処理において、入力処理によって画面上の一部が選択されており、かつ、指示文字列と命令文字列の両方が検出されたときは、第一の音声認識処理の結果に従って出力処理を行い、画面上の一部が選択されていない、あるいは、指示文字列と命令文字列のいずれかが検出されていないときには、第二の音声認識処理の結果に従って出力処理を行う。 Another aspect of the device control method using the voice recognition function of the present disclosure is an input process for receiving an input from a user, and a selection state for performing a state detection of whether or not a part of the screen is designated by the input process. A detection process, a selection information detection process that acquires internal information regarding the position of the selected one item on the screen, an output process that returns a response to the user, a communication process that communicates with an external device, and a voice that inputs voice. Input processing, first speech recognition processing for recognizing speech, second speech recognition processing learned by a method different from the first speech recognition processing, and detection of an instruction character string based on the speech recognition result Including the instruction character string detection processing that performs the instruction character string detection processing and the instruction character string detection processing that detects the instruction character string based on the voice recognition result, in the selection state detection processing, a part of the screen is selected by the input processing. , And when both the instruction character string and the instruction character string are detected, output processing is performed according to the result of the first voice recognition processing, and a part of the screen is not selected, or the instruction character string When any of the command character strings is not detected, output processing is performed according to the result of the second voice recognition processing.

これにより、入力処理による画面指定があり、指示文字列と命令文字列が検出された場合には、サーバからの音声認識結果を待つこと無く、ユーザに応答を返すことができる。このため、従来よりも音声対話における応答の遅延を削減することが可能になる。 As a result, when the screen is designated by the input process and the instruction character string and the command character string are detected, a response can be returned to the user without waiting for the voice recognition result from the server. Therefore, it becomes possible to reduce the delay in the response in the voice conversation as compared with the conventional case.

上述の機器制御方法は、合成音声を生成する音声合成処理、および制御信号を生成する制御信号生成処理をさらに包含し、出力処理でユーザに応答を返す際に、合成された音声で応答を返す、あるいは、生成された制御信号で機器を制御することでユーザに応答を返してもよい。 The above-described device control method further includes a voice synthesizing process for generating a synthetic voice and a control signal generating process for generating a control signal, and when the response is returned to the user in the output process, the response is returned by the synthesized voice. Alternatively, a response may be returned to the user by controlling the device with the generated control signal.

選択状態検出処理は、入力処理で画面上の一部が選択されているかの状態のみを管理してもよい。 The selection state detection process may manage only the state in which a part of the screen is selected in the input process.

選択状態検出処理は、入力処理で画面上の一部が選択されているかの状態に加えて、選択された場所に対応する内部情報も管理してもよい。 The selection state detection process may manage internal information corresponding to the selected location in addition to the state of whether a part of the screen is selected in the input process.

さらに、上述の課題を解決するための制御方法の他の態様は、ユーザの音声を入力可能な音声入力装置に接続され、ディスプレイを有する表示機器の制御方法であって、表示機器のコンピュータに、選択可能な複数の項目を含む表示画面をディスプレイに表示させ、ディスプレイの表示画面において、複数の項目の中の一の項目が選択されたことを検知させ、一の項目が選択されたことが検知されているときに、音声入力装置から指示内容を表す第１音声情報を含む音声指示が受信された場合、第１音声情報から指示内容を認識させて指示内容を実行させ、一の項目が選択されたことが検知されていないとき、または、指示内容が実行できないと判断されたとき、音声指示を他のコンピュータへ送信させる。 Furthermore, another aspect of the control method for solving the above-mentioned problem is a control method of a display device having a display, which is connected to a voice input device capable of inputting a user's voice, in a computer of the display device, Display a display screen containing multiple selectable items on the display, and detect that one of the multiple items is selected on the display screen of the display, and detect that one item has been selected. When the voice instruction including the first voice information indicating the instruction content is received from the voice input device, the instruction content is recognized from the first voice information, the instruction content is executed, and one item is selected. When it is not detected that the instruction is performed or when it is determined that the instruction content cannot be executed, the voice instruction is transmitted to another computer.

表示機器のコンピュータに、さらに、音声指示に指示語を示す第２音声情報が含まれているか否かを判断させ、一の項目が選択されたことが検知され、第１音声情報から指示内容が認識され、かつ、音声指示に第２音声情報が含まれていると判断された場合、指示内容を実行させ、一の項目が選択されたことが検知されなかった場合、第１音声情報から前記指示内容が認識されなかった場合、または音声指示に第２音声情報が含まれていると判断されなかった場合、音声指示を前記他のコンピュータへ送信させてもよい。 The computer of the display device is further caused to determine whether or not the voice instruction includes the second voice information indicating the reference word, it is detected that one item is selected, and the instruction content is detected from the first voice information. If it is recognized and it is determined that the voice instruction includes the second voice information, the instruction content is executed, and if the selection of one item is not detected, the first voice information is used to If the instruction content is not recognized, or if it is not determined that the voice instruction includes the second voice information, the voice instruction may be transmitted to the other computer.

指示内容は、一の項目に関連する情報を検索する指示であり、制御方法は、指示内容に基づく検索結果をユーザへ通知させてもよい。 The instruction content is an instruction to search for information related to one item, and the control method may notify the user of a search result based on the instruction content.

表示機器はネットワークを介してサーバと接続され、一の項目に関連する情報を、サーバ内のデータベースを参照して検索してもよい。 The display device may be connected to the server via a network, and information related to one item may be searched by referring to a database in the server.

制御方法は、検索結果をディスプレイに表示させてもよい。 The control method may display the search result on the display.

音声入力装置は、表示機器に含まれてもよい。 The voice input device may be included in the display device.

表示機器はさらに、音声を出力可能な音声出力装置と接続され、制御方法は、検索結果を音声出力装置からの音声として出力させる検索結果情報を、音声出力装置に送信させてもよい。 The display device may be further connected to a sound output device capable of outputting sound, and the control method may cause the sound output device to transmit search result information that causes the sound output device to output the search result as sound.

音声出力装置は、表示機器に含まれてもよい。 The audio output device may be included in the display device.

メタデータは、テレビ番組名、チャンネル名、テレビ番組の概容、テレビ番組の注目度、およびテレビ番組のおすすめ度の少なくとも１つを示してもよい。 The metadata may indicate at least one of a TV program name, a channel name, an overview of the TV program, a TV program attention level, and a TV program recommendation level.

さらに、上述の課題を解決するためのコンピュータプログラムの一態様は、ユーザの音声を入力可能な音声入力装置に接続され、ディスプレイを有する表示機器に実行させるコンピュータプログラムであって、前記コンピュータプログラムは前記表示機器のコンピュータに、選択可能な複数の項目を含む表示画面を前記ディスプレイに表示させ、前記ディスプレイの表示画面において、前記複数の項目の中の一の項目が選択されたことを検知させ、前記一の項目が選択されたことが検知されているときに、前記音声入力装置から指示内容を表す第１音声情報を含む音声指示が受信された場合、前記第１音声情報から前記指示内容を認識させて前記指示内容を実行させ、前記一の項目が選択されたことが検知されていないとき、または、前記指示内容が実行できないと判断されたとき、前記音声指示を他のコンピュータへ送信させる。 Furthermore, one mode of a computer program for solving the above-mentioned problem is a computer program which is connected to a voice input device capable of inputting a user's voice and causes a display device having a display to execute the computer program. A computer of a display device displays a display screen including a plurality of selectable items on the display, and causes the display screen of the display to detect that one of the plurality of items is selected, When it is detected that one item is selected, when the voice instruction including the first voice information indicating the instruction content is received from the voice input device, the instruction content is recognized from the first voice information. Then, the instruction content is executed, and when it is not detected that the one item is selected, or when it is determined that the instruction content cannot be executed, the voice instruction is transmitted to another computer.

本開示の表示機器の一態様は、ユーザの音声を入力可能な音声入力装置に接続された表示機器であって、ディスプレイと、制御回路と、通信回路と、を備え、前記制御回路は、選択可能な複数の項目を含む表示画面を前記ディスプレイに表示させ、前記ディスプレイの前記表示画面において、前記複数の項目の中の一の項目が選択されたことを検知し、前記一の項目が選択されたことを検知しているときに、前記音声入力装置から指示内容を表す第１音声情報を含む音声指示が受信された場合、前記第１音声情報から前記指示内容を認識して前記指示内容を実行し、前記一の項目が選択されたことを検知していないとき、または、前記指示内容が実行できないと判断したとき、前記音声指示を他のコンピュータへ送信するように前記通信回路に指示する。 One aspect of the display device of the present disclosure is a display device connected to a voice input device capable of inputting a user's voice, the display device including a display, a control circuit, and a communication circuit, wherein the control circuit is selected. A display screen including a plurality of possible items is displayed on the display, and on the display screen of the display, it is detected that one of the plurality of items is selected, and the one item is selected. When the voice instruction including the first voice information indicating the instruction content is received from the voice input device while detecting the fact, the instruction content is recognized from the first voice information and the instruction content is displayed. If the execution is not detected and the selection of the one item is not detected, or if it is determined that the instruction content cannot be executed, the communication circuit is instructed to transmit the voice instruction to another computer. ..

なお、以下で説明する実施の形態は、いずれも本発明の一具体例を示すものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、ステップの順序などは、一例であり、本発明を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また全ての実施の形態において、各々の内容を組み合わせることも出来る。 Each of the embodiments described below shows one specific example of the present invention. Numerical values, shapes, constituent elements, steps, order of steps, and the like shown in the following embodiments are examples and are not intended to limit the present invention. Further, among the constituent elements in the following embodiments, the constituent elements that are not described in the independent claim indicating the highest concept are described as arbitrary constituent elements. In addition, the contents of each of the embodiments can be combined.

以下、添付の図面を参照しながら、本発明の例示的な実施の形態を説明する。 Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings.

（実施の形態１）
図１は、本実施の形態における情報提供システムが表示機器に対して実行する情報提供方法の概要を示すシーケンス図である。本実施の形態における情報提供システムは、ディスプレイを有する表示機器とユーザの音声を入力可能な音声入力機器とに接続される。ここで「接続される」とは、電気信号の送受信ができるように電気的に接続されることを意味する。「接続」は、有線に限らず無線でもよい。２つの機器の間に他の機器（例えば、スイッチングハブ、ルータ、パーソナルコンピュータ（ＰＣ）等）が接続され、それらを介して電気信号の送受信が行われ得る状態も、２つの機器が接続されている状態に該当する。 (Embodiment 1)
FIG. 1 is a sequence diagram showing an outline of an information providing method executed by the information providing system according to the present embodiment for a display device. The information providing system in the present embodiment is connected to a display device having a display and a voice input device capable of inputting a user's voice. Here, "connected" means electrically connected so that an electric signal can be transmitted and received. The “connection” is not limited to a wired connection and may be a wireless connection. In a state where another device (for example, a switching hub, a router, a personal computer (PC), etc.) is connected between the two devices and electric signals can be transmitted and received through them, the two devices are also connected. It corresponds to the state of being.

情報提供システムは、典型的にはサーバコンピュータを含む１以上の機器の組み合わせであり得る。情報提供システムは、選択可能な複数の項目を含む表示画面を表示機器のディスプレイに表示させる表示画面情報を表示機器に送信する。それを受けて、表示機器は表示画面をディスプレイに表示する（ステップＳ１００）。この表示画面は、選択可能な複数の項目を含む。複数の項目の各々は、例えば図１３に示すようなテレビ番組を示す項目であり得るが、これに限定されない。複数の項目の各々は、テレビ番組に関するメタデータまたはテレビ番組のコンテンツを示す項目であってもよい。メタデータは、例えば、テレビ番組名、チャンネル名、テレビ番組の概要、テレビ番組の注目度、およびテレビ番組のおすすめ度の少なくとも１つを示すデータであり得る。テレビ番組のコンテンツは、例えば人物、動物、車、地図、文字、数字の少なくとも１つを示す情報を含み得る。表示画面が地図の画像を含む場合、複数の項目の各々は、地図上の位置を特定する座標情報であり得る。 The information providing system may typically be a combination of one or more devices including a server computer. The information providing system transmits display screen information that causes a display of a display device to display a display screen including a plurality of selectable items to the display device. In response to this, the display device displays the display screen on the display (step S100). This display screen includes a plurality of selectable items. Each of the plurality of items may be, for example, an item indicating a television program as shown in FIG. 13, but is not limited to this. Each of the plurality of items may be an item indicating metadata about the television program or content of the television program. The metadata may be, for example, data indicating at least one of a television program name, a channel name, an outline of the television program, a degree of attention of the television program, and a degree of recommendation of the television program. The content of the television program may include information indicating at least one of a person, an animal, a car, a map, letters, and numbers, for example. When the display screen includes a map image, each of the plurality of items may be coordinate information that identifies a position on the map.

ユーザは、表示機器のディスプレイに表示された複数の項目の中から、一の項目を選択することができる。例えば、テレビ番組を示す複数の項目が表示されている場合、その中から１つの項目を選択することができる。表示機器がタッチスクリーンをディスプレイとして備えている場合、項目の選択はタッチスクリーンへの直接的な接触によって行われ得る。表示機器が外付けのディスプレイに表示画面を表示させる場合、項目の選択は例えばマウスの操作によって行われ得る。前者の場合はタッチスクリーンが、後者の場合はマウスが入力装置として機能する。 The user can select one item from the plurality of items displayed on the display of the display device. For example, when a plurality of items indicating a television program are displayed, one item can be selected from them. If the display device comprises a touch screen as the display, the selection of items can be done by direct contact with the touch screen. When the display device displays a display screen on an external display, selection of items can be performed by operating a mouse, for example. In the former case, the touch screen functions as the input device, and in the latter case, the mouse functions as the input device.

ディスプレイの表示画面において、複数の項目の中の一の項目が選択されると、表示機器は、そのことを示す情報（「項目選択情報」と称する。）を、情報提供システムに含まれるサーバに送信する。サーバは、項目選択情報を受信すると、どの項目が選択されたかを判断し、各項目の選択／非選択の状態を記録（または更新）する（ステップＳ１１０）。この処理を選択状態管理処理と称する。項目選択情報の送信および選択状態管理処理は、ユーザが項目の選択を変更する度に実行される。言い換えれば、ユーザの項目の選択（または変更）に起因する選択状態管理処理は、音声指示の前に何回でも実行され得る。 When one of a plurality of items is selected on the display screen of the display, the display device sends information indicating that (referred to as "item selection information") to the server included in the information providing system. Send. Upon receiving the item selection information, the server determines which item has been selected and records (or updates) the selected/non-selected state of each item (step S110). This process is called a selection state management process. The transmission of item selection information and the selection state management process are executed each time the user changes the selection of an item. In other words, the selection state management process resulting from the user's selection (or change) of items may be executed any number of times before the voice instruction.

ユーザは、一の項目を選択した後、その項目に対する音声指示を行う。例えば、選択した項目に対応するテレビ番組を再生する指示や、そのテレビ番組の概要を表示したりする指示を音声によって行うことができる。そのような指示は、例えば、「それを再生」、「その内容を表示」などと発声することによって行われ得る。この指示は、「を再生」、「内容を表示」といった指示内容を示す第１音声情報と、「それ」、「その」といった指示語を示す第２音声情報とを含み得る。第１音声情報は、表示機器の制御コマンドと関連付けられる。表示機器が何等かの音声指示をユーザから受け付けると、表示機器はその音声情報をサーバに送信する。 After selecting one item, the user gives a voice instruction to that item. For example, an instruction to reproduce a television program corresponding to the selected item or an instruction to display a summary of the television program can be issued by voice. Such instructions may be given, for example, by saying "play it back", "display its contents", etc. The instruction may include first voice information indicating instruction content such as “play” and “display content” and second voice information indicating instruction words such as “that” and “that”. The first audio information is associated with the control command of the display device. When the display device receives some voice instruction from the user, the display device transmits the voice information to the server.

サーバは、音声情報を受信すると、一の項目が選択されているか否か（ステップＳ１１１）、音声指示が第１音声情報を含むか否か（ステップＳ１１２）、音声指示が第２音声情報を含むか否か（ステップＳ１１３）を判定する。これらの３つのステップのいずれかでＮｏと判定した場合、サーバは指示内容を無視し、待機状態に戻る。あるいは、指示内容を実行しない旨を示す情報を表示機器に送信してもよい。 When the server receives the voice information, whether one item is selected (step S111), whether the voice instruction includes the first voice information (step S112), and the voice instruction includes the second voice information. It is determined whether or not (step S113). If the server determines No in any of these three steps, the server ignores the instruction content and returns to the standby state. Alternatively, information indicating that the instruction content is not executed may be transmitted to the display device.

ステップＳ１１１では、サーバは、選択状態管理処理（Ｓ１１０）において更新された選択状態の情報を参照し、一の項目が選択されているか否かを判定する。一の項目が選択されている場合、ステップＳ１１２に進む。ステップＳ１１２では、サーバは、音声指示が第１音声情報（すなわち指示内容）を含むか否かを判定する。音声指示が第１音声情報を含むと判定した場合、サーバは、指示内容を認識する（ステップＳ１１３）。続くステップＳ１１４において、サーバは、音声指示が第２音声情報（すなわち指示語）を含むか否かを判定する。音声指示が第２音声情報を含むと判定した場合、サーバは、指示内容を実行する（ステップＳ１１５）。指示内容の実行は、例えば、要求された指示に対応する機器の制御情報などを表示機器に送信することによって行われる。なお、ステップＳ１１１、Ｓ１１２、Ｓ１１４の順序は図１に示す順序に限らず、相互に入れ替えてもよい。 In step S111, the server refers to the information on the selection state updated in the selection state management process (S110) and determines whether or not one item is selected. If one item is selected, the process proceeds to step S112. In step S112, the server determines whether or not the voice instruction includes the first voice information (that is, the instruction content). When it is determined that the voice instruction includes the first voice information, the server recognizes the instruction content (step S113). In subsequent step S114, the server determines whether or not the voice instruction includes the second voice information (that is, the instruction word). When it is determined that the voice instruction includes the second voice information, the server executes the instruction content (step S115). Execution of the instruction content is performed, for example, by transmitting control information of a device corresponding to the requested instruction to the display device. The order of steps S111, S112, and S114 is not limited to the order shown in FIG. 1 and may be interchanged.

このような方法により、選択状態管理処理（Ｓ１１０）によってサーバが表示機器の表示画面における項目の選択状態をリアルタイムで把握できる。サーバが音声指示を受け付けた後、表示機器に選択状態の問い合わせを行う必要がないため、表示機器とサーバとの間のアクセスを低減することができる。 With such a method, the server can grasp the selection state of the item on the display screen of the display device in real time by the selection state management process (S110). After the server accepts the voice instruction, it is not necessary to inquire the display device about the selection state, so that the access between the display device and the server can be reduced.

次に、本実施の形態における番組情報提示方法を採用するシステムのより具体的な例を説明する。 Next, a more specific example of a system that employs the program information presentation method according to the present embodiment will be described.

図２は、本実施の形態における番組情報提示方法を採用するシステムの構成を示す。この番組情報提示方法は、ユーザの音声を認識する音声認識機能を利用して番組の情報をユーザに提示する。本システムは、クライアント１２１と、サーバ１２０とを含む。クライアント１２１は、前述の表示機器、または表示機器に接続される他の機器に対応する。クライアント１２１は、例えばテレビ、レコーダー、スマートフォン、タブレット端末などの機器であり得る。図２の例では、クライアント１２１は、音声入力装置であるマイクロフォン１０１と、入力装置１０８と、出力回路１１２と、通信回路１１３ｂと、これらを制御する制御回路１１４ｂとを備える。制御回路１１４ｂは、ユーザによる項目の選択を検出する選択状態検出部１０９と、入力装置によって指定された番組の表示画面上での位置情報および指定された番組の情報を検出する選択情報検出部１１１とを有する。 FIG. 2 shows the configuration of a system that employs the program information presentation method according to the present embodiment. This program information presentation method presents the user with program information by utilizing a voice recognition function that recognizes the user's voice. The system includes a client 121 and a server 120. The client 121 corresponds to the above-described display device or another device connected to the display device. The client 121 may be a device such as a television, a recorder, a smartphone, or a tablet terminal. In the example of FIG. 2, the client 121 includes the microphone 101 that is a voice input device, an input device 108, an output circuit 112, a communication circuit 113b, and a control circuit 114b that controls these. The control circuit 114b detects a selection state detecting unit 109 that detects selection of an item by the user, and a selection information detecting unit 111 that detects position information on the display screen of the program designated by the input device and information of the designated program. Have and.

サーバ１２０は、クライアント１２１と通信する通信回路１１３ａと、制御回路１１４ａとを備える。制御回路１１４ａは、選択状態管理部１１０、音声認識部１０２、指示文字列検出部１０３、対話管理部１０４、応答文生成部１０５、音声合成部１０６、および制御信号生成部１０７の７つの機能部を有する。 The server 120 includes a communication circuit 113a that communicates with the client 121, and a control circuit 114a. The control circuit 114a has seven functional units including a selection state management unit 110, a voice recognition unit 102, an instruction character string detection unit 103, a dialogue management unit 104, a response sentence generation unit 105, a voice synthesis unit 106, and a control signal generation unit 107. Have.

本実施の形態では、音声入力装置であるマイクロフォン１０１がユーザの音声信号をセンシングする。サーバ１２０の音声認識部１０２は、センシングした音声信号を文字列に変換する。以後は主としてサーバ１２０による処理が行われる。指示文字列検出部１０３は、音声認識部１０２で変換された文字列中に含まれる指示代名詞を検出する。対話管理部１０４は、ユーザと機器との対話型の処理を行った履歴やどのような対話処理を行うかという応答戦略などを管理する。ここで、対話型処理とは、タッチパネルなどの物理的なインターフェースや音声などを用いたユーザと機器とのメッセージのやりとりに関する処理をいう。そのような履歴情報および応答戦略に用いられる情報は、不図示のメモリなどの記録媒体に格納される。 In the present embodiment, the microphone 101, which is a voice input device, senses the voice signal of the user. The voice recognition unit 102 of the server 120 converts the sensed voice signal into a character string. After that, the processing by the server 120 is mainly performed. The designated character string detection unit 103 detects a designated pronoun included in the character string converted by the voice recognition unit 102. The dialogue management unit 104 manages a history of interactive processing between the user and the device, a response strategy indicating what kind of interactive processing is to be performed, and the like. Here, the interactive processing refers to processing relating to message exchange between the user and the device using a physical interface such as a touch panel or voice. The history information and the information used for the response strategy are stored in a recording medium such as a memory (not shown).

応答文生成部１０５は、入力された文字列に応じてユーザに応答する文字列を生成する。音声合成部１０６は、応答文生成部１０５で生成した文字列を音声に変換する。制御信号生成部１０７は、対話内容に応じた機器制御コマンドを生成する。 The response sentence generation unit 105 generates a character string that responds to the user according to the input character string. The voice synthesis unit 106 converts the character string generated by the response sentence generation unit 105 into voice. The control signal generation unit 107 generates a device control command according to the content of the dialogue.

なお、音声合成部１０６は、応答文生成部１０５が生成した文章から合成音声を生成し、ユーザに音声を提示すると説明したが、これは一例である。例えば、TVなどのディスプレイ装置がクライアント１２１に設けられている場合には、文字列を画面上に表示しても構わない。 It has been described that the voice synthesis unit 106 generates a synthetic voice from the sentence generated by the response sentence generation unit 105 and presents the voice to the user, but this is an example. For example, when a display device such as a TV is provided in the client 121, the character string may be displayed on the screen.

入力装置１０８は、例えば、マウス、タッチパネル、キーボード、リモートコントローラ等であり得る。この入力装置１０８は、ディスプレイ装置などの表示装置に複数の番組の情報が表示されている場合に、ユーザが一つの番組を選択することを可能にする。 The input device 108 can be, for example, a mouse, a touch panel, a keyboard, a remote controller, or the like. The input device 108 enables a user to select one program when information of a plurality of programs is displayed on a display device such as a display device.

入力装置１０８により番組が選択されると、その選択された画面上の位置の情報が取得される。位置の情報は、例えば二次元の座標情報であり得る。表示画面には、番組を示す選択可能な複数の項目の他に、指定可能な他の表示領域が存在し得る。例えば、ページ遷移のためのボタン、番組の選択を終了するボタン、または他の機能を呼び出すためのボタンなどの他の表示領域が存在し得る。ユーザはそのような表示領域も指定することができる。クライアント１２１における選択状態検出部１０９は、入力装置１０８によっていずれかの番組が選択されているか否かの検出を行う。この検出は、指定された位置がいずれかの番組を示す項目の位置と重なるか否かを判定することによって行われ得る。検出結果は通信回路１１３ｂ、１１３ａを介してサーバ１２０の選択状態管理部１１０に送られる。選択状態管理部１１０は、いずれかの番組が選択されているか否かを示す情報を管理する。例えば、いずれかの番組が選択されている場合は、選択状態管理部１１０の内部メモリに１を設定し、番組が選択されていない場合は、内部メモリに０を設定する。この内部メモリの値は選択状態に合わせて更新される。 When a program is selected by the input device 108, information on the selected position on the screen is acquired. The position information may be two-dimensional coordinate information, for example. In addition to a plurality of selectable items indicating programs, another display area that can be designated may be present on the display screen. For example, there may be other display areas such as buttons for page transitions, buttons to end program selection, or buttons to invoke other functions. The user can also specify such a display area. The selection state detection unit 109 in the client 121 detects whether any program is selected by the input device 108. This detection can be performed by determining whether or not the designated position overlaps with the position of the item indicating any program. The detection result is sent to the selection state management unit 110 of the server 120 via the communication circuits 113b and 113a. The selection state management unit 110 manages information indicating whether or not any program is selected. For example, if any program is selected, 1 is set in the internal memory of the selection state management unit 110, and if no program is selected, 0 is set in the internal memory. The value of this internal memory is updated according to the selected state.

選択情報検出部１１１は、入力装置１０８によって指定された番組の位置情報、および、指定された番組の情報などを検出する。検出された情報は、通信回路１１３ｂ、１１３ａを介して指示文字列検出部１０３に送信される。出力回路１１２は、応答文生成部１０５、音声合成部１０６、制御信号生成部１０７の出力結果に基づく情報を出力する。出力回路１１２は、例えば、ディスプレイへの応答文の表示、スピーカーへの合成音声の再生、生成された制御信号による機器の制御、およびディスプレイへの制御結果の表示などの出力処理を行う。 The selection information detection unit 111 detects position information of the program designated by the input device 108, information of the designated program, and the like. The detected information is transmitted to the instruction character string detection unit 103 via the communication circuits 113b and 113a. The output circuit 112 outputs information based on the output results of the response sentence generation unit 105, the voice synthesis unit 106, and the control signal generation unit 107. The output circuit 112 performs output processing such as displaying a response sentence on a display, reproducing synthesized voice on a speaker, controlling a device by a generated control signal, and displaying a control result on a display.

通信回路１１３ａおよび１１３ｂは、サーバ１２０とクライアント１２１の間の通信を行うための通信モジュールを備える。ここで、通信モジュールは、例えばＷｉ−Ｆｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの既存の通信方式を利用して通信を行う。そのような機能を有する限り、通信モジュールの種類は問わない。音声合成部１１３で合成された音声信号、および、機器を制御する制御信号は出力回路１０８に送信される。出力回路１０８は、音声信号、機器を制御するための信号、および制御結果を示す情報を出力する。 The communication circuits 113a and 113b include a communication module for performing communication between the server 120 and the client 121. Here, the communication module performs communication using an existing communication method such as Wi-Fi (registered trademark) or Bluetooth (registered trademark). The type of communication module is not limited as long as it has such a function. The voice signal synthesized by the voice synthesis unit 113 and the control signal for controlling the device are transmitted to the output circuit 108. The output circuit 108 outputs a voice signal, a signal for controlling the device, and information indicating a control result.

上述したサーバ１２０における制御回路１１４ａの各構成要素は、サーバ１２０のコンピュータ（たとえばＣＰＵ）が、コンピュータプログラムを実行することによって実現されてもよいし、それぞれが別個独立の回路等として設けられてもよい。 Each component of the control circuit 114a in the server 120 described above may be realized by a computer (for example, a CPU) of the server 120 executing a computer program, or each may be provided as a separate and independent circuit or the like. Good.

上述したクライアント１２１における制御回路１１４ｂの各構成要素（選択状態検出部１０９、および選択情報検出部１１１の各々）も、クライアント１２１のコンピュータ（たとえばＣＰＵ）が、コンピュータプログラムを実行することによって実現されてもよいし、それぞれが別個独立の回路等として設けられてもよい。 Each component of the control circuit 114b in the client 121 (each of the selection state detection unit 109 and the selection information detection unit 111) described above is also realized by the computer (for example, CPU) of the client 121 executing a computer program. Alternatively, each may be provided as an independent circuit or the like.

例えば、後述の図３に示されるサーバ１２０の各処理は、コンピュータプログラムを実行したサーバ１２０のコンピュータが行う制御方法として実現され得る。同様に、例えば図３に示されるクライアント１２１の各処理は、コンピュータプログラムを実行したクライアント１２１のコンピュータが行う制御方法として実現され得る。 For example, each processing of the server 120 shown in FIG. 3 described later can be realized as a control method performed by a computer of the server 120 that executes a computer program. Similarly, each process of the client 121 shown in FIG. 3, for example, can be realized as a control method performed by a computer of the client 121 that executes a computer program.

本実施の形態では、音声認識処理を、サーバ１２０が行う例を説明する。音声認識を行った後に実行される対話管理部１０４、応答文生成部１０５、音声合成部１０６、および制御信号生成部１０７の各処理は、サーバ１２０ではなくクライアント１２１が実行しても良い。 In this embodiment, an example in which the server 120 performs the voice recognition process will be described. Each process of the dialogue management unit 104, the response sentence generation unit 105, the voice synthesis unit 106, and the control signal generation unit 107, which is executed after performing the voice recognition, may be executed by the client 121 instead of the server 120.

図３は、サーバ１２０とクライアント１２１との通信処理のシーケンスを示す。このシーケンスは、ユーザがリモコンなどの入力装置１０８によって表示画面上の一部を指定することによって開始される。 FIG. 3 shows a sequence of communication processing between the server 120 and the client 121. This sequence is started by the user designating a part on the display screen with the input device 108 such as a remote controller.

ステップS２００の入力装置情報取得処理において、選択状態検出部１０９は、入力装置１０８によって指定された表示画面上の位置を示す情報を取得する。位置の指定は、入力装置１０８がタッチパネルであれば、指等によるタッチによって行われ得る。入力装置１０８がリモコンであればボタン操作によって行われ得る。 In the input device information acquisition process of step S200, the selection state detection unit 109 acquires information indicating the position on the display screen designated by the input device 108. If the input device 108 is a touch panel, the position can be designated by touching with a finger or the like. If the input device 108 is a remote control, it can be performed by button operation.

ステップS２０１の選択状態検出処理において、選択状態検出部１０９は、一の番組が選択されたか否かを検出する。この検出は、入力装置情報取得処理で取得された位置情報に基づいて、入力装置１０８によって指定された位置が番組を示す項目の位置に該当するか否かを判定することによって行われる。 In the selection state detection process of step S201, the selection state detection unit 109 detects whether one program is selected. This detection is performed by determining whether or not the position designated by the input device 108 corresponds to the position of the item indicating the program, based on the position information acquired by the input device information acquisition process.

ステップS２０２の選択情報保存処理において、クライアント１２１は、入力装置１０８によって選択された項目に関する情報（以下、「選択情報」と称することがある。）を取得してメモリ等の記録媒体に保存する処理を行う。例えば、番組であれば、選択された番組に関連づけられた情報（例えば、番組名、放送日時、概要、出演者などの情報）を取得する。なお、後述する例のように、ディスプレイに地図を表示させる例では、選択された項目に関する情報は、指定された位置にある建物の情報であり得る。地図の例については後述する。 In the selection information storage process of step S202, the client 121 acquires information regarding the item selected by the input device 108 (hereinafter, may be referred to as “selection information”) and stores it in a recording medium such as a memory. I do. For example, in the case of a program, information associated with the selected program (for example, program name, broadcast date/time, summary, performer information, etc.) is acquired. In an example in which a map is displayed on the display, as in the example described later, the information on the selected item may be information on the building at the specified position. An example of the map will be described later.

ステップS２０３の選択状態送信処理では、選択状態検出処理で取得した入力装置１０８による番組選択の有無を示す情報が、クライアント１２１の通信回路１１３ｂからサーバ１２０の通信回路１１３ａに送信される。 In the selection state transmission process of step S203, the information indicating the presence/absence of program selection by the input device 108 acquired in the selection state detection process is transmitted from the communication circuit 113b of the client 121 to the communication circuit 113a of the server 120.

ステップS２０４の選択状態受信処理では、サーバ１２０の通信回路１１３ａはクライアント１２１から送信された選択状態を示す情報を受信する。 In the selection state reception process of step S204, the communication circuit 113a of the server 120 receives the information indicating the selection state transmitted from the client 121.

ステップS２０５の選択状態管理処理では、選択状態管理部１１０は、選択状態受信処理で受信した情報に基づき、番組選択状態を管理する。具体的には、選択状態管理部１１０は、番組が選択されている状態を１、選択されていない状態を０として、０か１かの情報をサーバ１２０における特定のメモリに保存する。これにより、番組の選択有無の管理を実現できる。 In the selection state management process of step S205, the selection state management unit 110 manages the program selection state based on the information received in the selection state reception process. Specifically, the selection state management unit 110 sets the state in which a program is selected as 1 and the state in which it is not selected as 0, and stores information of 0 or 1 in a specific memory in the server 120. This makes it possible to manage whether or not a program is selected.

以上のステップＳ２００〜Ｓ２０５は、ユーザによって番組の選択が変更される度に実行される。したがって、図３に示すステップＳ２００〜Ｓ２０５は、複数回実行され得る。 The above steps S200 to S205 are executed every time the user changes the selection of the program. Therefore, steps S200 to S205 shown in FIG. 3 may be executed multiple times.

ステップS２０６の音声要求送信処理では、サーバ１２０における通信回路１１３ａは、クライアント１２１における通信回路１１３ｂに、音声信号を送付するよう要求する信号を送信する。この処理は、例えば、ユーザからの音声指示の開始の要求に応答して行われる。音声指示の開始の要求は、例えば、画面に表示される開始ボタンの押下をトリガーとして行われ得る。 In the voice request transmission process of step S206, the communication circuit 113a in the server 120 transmits a signal requesting that a voice signal be sent to the communication circuit 113b in the client 121. This process is performed, for example, in response to a request from the user to start a voice instruction. The request to start the voice instruction may be made, for example, by pressing the start button displayed on the screen.

ステップS２０７の音声要求受信処理では、クライアント１２１は、クライアント１２１に関連付けられたマイクロフォン１０１からの音声の入力を許可する。 In the voice request receiving process of step S207, the client 121 permits the voice input from the microphone 101 associated with the client 121.

ステップS２０８のＡ／Ｄ変換処理では、クライアント１２１は、入力された音声信号についてＡ／Ｄ変換（アナログデジタル変換）を行う。これにより、アナログの音声がデジタルの音声信号に変換される。 In the A/D conversion process of step S208, the client 121 performs A/D conversion (analog-digital conversion) on the input audio signal. As a result, analog voice is converted into a digital voice signal.

ステップS２０９の音声信号送信処理では、クライアント１２１の通信回路１１３ｂは、デジタル音声信号をサーバ１２０に送信する。 In the audio signal transmission process of step S209, the communication circuit 113b of the client 121 transmits a digital audio signal to the server 120.

ステップS２１０において、サーバ１２０の通信回路１１３ａは、クライアント１２１から送信された音声信号を受信する。 In step S210, the communication circuit 113a of the server 120 receives the audio signal transmitted from the client 121.

ステップS２１１において、音声認識部１０２は音声認識処理を行う。音声認識処理とは、入力された音声信号を解析し、テキストデータに変換する処理である。 In step S211, the voice recognition unit 102 performs voice recognition processing. The voice recognition process is a process of analyzing an input voice signal and converting it into text data.

ステップS２１２では、指示文字列検出部１０３は指示文字列を検出する。指示文字列検出処理とは、音声認識処理によって生成されたテキストデータを解析することで、指示文字列の検出を行う処理である。 In step S212, the instruction character string detection unit 103 detects the instruction character string. The instruction character string detection process is a process of detecting the instruction character string by analyzing the text data generated by the voice recognition process.

ステップS２１３の選択状態判定処理では、選択状態管理部１１０は、ステップＳ２０５の選択状態管理処理においてメモリに保存された選択状態の情報を参照することにより、一の項目が選択されている状態か否かを判定する。つまり、選択状態管理部１１０は、サーバ１２０上のデータのみに基づいて選択状態であるか否かの判定を行う。選択状態にあると判定した場合には、サーバ１２０は、クライアント１２１に、選択情報を要求する。クライアント１２１は、この要求を受信すると、ステップS２１４の選択情報送信処理において、ステップＳ２０２の選択情報保存処理でメモリに保存された選択情報をサーバに送信する。 In the selection state determination process of step S213, the selection state management unit 110 refers to the selection state information stored in the memory in the selection state management process of step S205 to determine whether or not one item is selected. Determine whether. That is, the selection state management unit 110 determines whether or not the selection state is set based on only the data on the server 120. When it is determined that the selection state is set, the server 120 requests the selection information from the client 121. Upon receiving this request, the client 121 transmits the selection information stored in the memory in the selection information storage processing of step S202 to the server in the selection information transmission processing of step S214.

ステップS２１５の選択情報受信処理において、サーバ１２０の通信回路１１３ａは、クライアント１２１の通信回路１１３ｂから選択情報を受信する。 In the selection information receiving process of step S215, the communication circuit 113a of the server 120 receives the selection information from the communication circuit 113b of the client 121.

ステップS２１６の対話管理処理では、対話管理部１０４は、受信した選択情報と指示文字列検出処理の結果に基づいて、機器の制御方法および音声での応答方法を決定し、クライアント１２１に返信するための情報を出力する。対話管理処理は、例えば、入力された音声の情報と出力の情報とが対応付けられたテーブルを参照することで、対話管理部１０４が応答方法を決める処理であり得る。例えば、「TV、電源ON」という音声が入力された場合、対話管理部１０４は、TVの電源をＯＮにする機器制御信号、あるいは、機器制御信号に対応する識別子（ID）を出力する。ユーザが「その番組の内容を表示」と言った場合、指示文字列検出結果により、「その」という文字が検出される。これにより、「その」という言葉が、ステップＳ２１５の選択情報受信処理で取得した選択情報を指していることがわかる。その結果、対話管理部１０４は、選択情報から得られる番組情報から番組内容を特定し、クライアント１２１に返信するための情報を生成できる。 In the dialog management process of step S216, the dialog management unit 104 determines the device control method and the voice response method based on the received selection information and the result of the instruction character string detection process, and returns the result to the client 121. The information of is output. The dialogue management process may be a process in which the dialogue management unit 104 determines the response method by referring to a table in which the input voice information and the output information are associated with each other. For example, when the voice "TV, power ON" is input, the dialogue management unit 104 outputs a device control signal for turning on the power of the TV, or an identifier (ID) corresponding to the device control signal. When the user says "display the content of the program", the character "that" is detected from the detection result of the instruction character string. From this, it can be seen that the word “that” refers to the selection information acquired in the selection information receiving process of step S215. As a result, the dialogue management unit 104 can specify the program content from the program information obtained from the selection information and can generate information for replying to the client 121.

ステップS２１７の応答結果送信処理では、サーバ１２０の通信回路１１３ａは、ステップＳ２１６の対話管理処理において生成された情報をクライアント１２１に送信する。送信される情報は、例えば、機器の制御信号もしくは制御信号に対応するID、または、合成音声のデータもしくは音声を合成するためのテキストデータであり得る。 In the response result transmission process of step S217, the communication circuit 113a of the server 120 transmits the information generated in the dialogue management process of step S216 to the client 121. The transmitted information may be, for example, a control signal of the device or an ID corresponding to the control signal, or data of synthetic voice or text data for synthesizing voice.

ステップS２１８の応答結果受信処理では、サーバ１２０からの応答結果をクライアント１２１の通信回路１１３ｂが受信する。 In the response result reception process of step S218, the communication circuit 113b of the client 121 receives the response result from the server 120.

ステップS２１９の応答結果出力処理では、出力回路１１２は、ステップＳ２１８の応答結果受信処理で受信した機器の制御信号や合成音声、テキストなどを機器の出力手段を通じてユーザまたは制御対象の機器に出力する。例えば、機器の制御信号に関しては、応答結果出力処理として、TVの電源のON、OFFや、音量、チャンネルの上下を制御することが考えられる。合成音声に関しては、TVのスピーカーから応答音声を出力することが考えられる。テキストに関しては、クライアント１２１の機器が音声の合成を行い、合成された音声を出力してもよいし、加工されたテキストをTVの画面に表示してもよい。 In the response result output process of step S219, the output circuit 112 outputs the control signal of the device, the synthesized voice, the text, etc. received in the response result reception process of step S218 to the user or the device to be controlled through the output means of the device. For example, regarding the control signal of the device, as a response result output process, it is conceivable to control ON/OFF of the power supply of the TV, control of volume, and up/down of the channel. Regarding synthetic speech, it is possible to output response speech from the TV speaker. Regarding the text, the device of the client 121 may synthesize the voice and output the synthesized voice, or the processed text may be displayed on the screen of the TV.

以下、サーバ１２０での処理とクライアント１２１での処理に分けて、音声認識機能を用いた番組情報提示方法をさらに詳細に説明する。 Hereinafter, the program information presentation method using the voice recognition function will be described in more detail by dividing the processing in the server 120 and the processing in the client 121.

図４は、サーバ１２０が音声指示を受信した後の処理フローの詳細を示す。 FIG. 4 shows details of the processing flow after the server 120 receives the voice instruction.

まず、音声入力処理（S３００）では、マイクロフォン１０１から音声信号が入力される。本実施の形態では、マイクロフォンはクライアント１２１に備えられているものとし、クライアント１２１上でA/D変換された音声信号がサーバ１２０側に転送される。 First, in the voice input process (S300), a voice signal is input from the microphone 101. In the present embodiment, the microphone is provided in the client 121, and the audio signal A/D converted on the client 121 is transferred to the server 120 side.

音声認識処理（S３０１）では、音声認識部１０２は入力された音声の認識処理を行う。音声認識処理では、入力された音声信号が文字列データに変換される。サーバ１２０上で音声認識を行うことで、大規模のデータ群から構築した音響モデルおよび言語モデルを利用できる。サーバ１２０の計算能力はクライアント１２１に比べると高い。大規模データから統計的学習手法によって学習した音響モデルおよび言語モデルを利用できるため、多様な言葉の認識率が高いというメリットがある。また、スマートフォンやFTTHなどの普及により、端末が常時ネットワークに接続された環境が整ってきている。このため、サーバ１２０上で音声認識を行う方法は実用的である。 In the voice recognition process (S301), the voice recognition unit 102 recognizes the input voice. In the voice recognition process, the input voice signal is converted into character string data. By performing speech recognition on the server 120, an acoustic model and a language model constructed from a large-scale data group can be used. The computing power of the server 120 is higher than that of the client 121. Since acoustic models and language models learned from a large-scale data by a statistical learning method can be used, there is an advantage that the recognition rate of various words is high. Also, with the spread of smartphones and FTTH, the environment in which terminals are constantly connected to networks has been established. Therefore, the method of performing voice recognition on the server 120 is practical.

指示文字列検出処理（S３０２）では、指示文字列検出部１０３は音声認識によって得られる文字列から指示文字列の検出を行う。ここで、指示文字列とは、「これ」、「それ」、「あれ」、「この」、「その」、「あの」、「これの」、「それの」、「あれの」などといった指示語または指示詞のことである。ここでは、指示文字列の検出は次のようにして行う。まず、指示文字列検出部１０３は、入力された文字列を形態素解析によって単語、品詞単位に分割する。形態素とは文章の要素のうち意味を持つ最小の単位である。形態素解析によって、文章を単語や品詞など複数の形態素に分割できる。あらかじめ指示文字列をリストとして用意しておき、そのリストに含まれる語と分割した形態素とが一致すれば、文章中の指示文字列が検出できたものとする。このように、単語同士のマッチングにて、指示文字列が検出されたか否かの検出処理を行う。 In the instruction character string detection process (S302), the instruction character string detection unit 103 detects the instruction character string from the character string obtained by voice recognition. Here, the instruction string is an instruction such as "this", "that", "that", "this", "that", "that", "this", "that", "that", etc. It is a word or demonstrative. Here, the detection of the instruction character string is performed as follows. First, the instruction character string detection unit 103 divides the input character string into words and parts of speech by morphological analysis. A morpheme is the smallest unit of meaning in a sentence element. Morphological analysis can divide a sentence into multiple morphemes such as words and parts of speech. It is assumed that the designated character string in the sentence can be detected if the designated character string is prepared in advance as a list and the words included in the list match the divided morphemes. In this way, the matching process between words is performed to detect whether or not the designated character string is detected.

指示文字列を検出したか否かによってサーバ１２０は以降の処理を切換える（S３０３）。指示文字列を検出すると、選択状態管理部１１０はクライアント側の入力装置１０８がTV画面上の番組に関する情報を選択しているか否かの状態を取得する（S３０４）。そして選択状態管理部１１０は、取得した選択状態に基づき番組を選択している状態であるか否かの判定を行う（S３０５）。具体的には、入力装置１０８が画面上の番組を選択しているときは、選択状態として１を、番組を選択していないときは選択状態として１以外を指定するものとすると、選択状態管理部１１０は選択状態取得処理（S３０４）にて、１か１以外かの情報を取得する。選択状態管理部１１０は選択状態判定処理（S３０５）にて、選択状態であるか否かの判定、つまり、選択状態が１であるか否かの判定を行う。このときの１か１以外かの値は、選択状態管理部１１０に保存されている。この判定結果に基づき、番組が選択されているか否かにより処理が切り替えられる（S３０６）。 The server 120 switches the subsequent processing depending on whether or not the instruction character string is detected (S303). When the instruction character string is detected, the selection state management unit 110 acquires the state as to whether or not the input device 108 on the client side has selected the information regarding the program on the TV screen (S304). Then, the selection state management unit 110 determines whether or not the program is being selected based on the acquired selection state (S305). Specifically, if the input device 108 selects a program on the screen, 1 is designated as the selection state, and if the program is not selected, a selection state other than 1 is designated. The unit 110 acquires information of 1 or other than 1 in the selection state acquisition process (S304). In the selection state determination process (S305), the selection state management unit 110 determines whether or not the selection state, that is, whether or not the selection state is 1. The value of 1 or other than 1 at this time is stored in the selection state management unit 110. Based on this determination result, the process is switched depending on whether or not the program is selected (S306).

もし番組が選択されていると判断されれば、選択状態管理部１１０は選択情報取得処理（S３０７）によって、画面上で選択された番組に関連する情報（例えば、番組名、放送日時、録画日時、ジャンル、放送局、番組内容、EPG情報など）を取得する。ここで、サーバ１２０がクライアント１２１から取得する情報は、番組に関する詳細な操作を行うためのものである。例えば、番組内容の表示、番組ジャンルの表示などの操作命令が入力されたときにサーバで処理できるように、番組の詳細な情報がクライアント１２１からサーバ１２０に送信される。 If it is determined that the program is selected, the selection state management unit 110 performs the selection information acquisition process (S307) to obtain information related to the program selected on the screen (for example, program name, broadcast date, recording date and time). , Genre, broadcasting station, program content, EPG information, etc.). Here, the information that the server 120 acquires from the client 121 is for performing detailed operations related to the program. For example, detailed information of a program is transmitted from the client 121 to the server 120 so that the server can process it when an operation command such as displaying the program content or displaying the program genre is input.

指示文字列検出の判定（S３０３）で指示文字列が検出されたと判断される、もしくは番組選択の判定（S３０６）で番組が選択されていないと判定される場合、対話管理部１０４は対話管理処理（S３０８）を行う。本実施の形態における対話管理処理では、対話管理部１０４は、音声認識された文字列の内容を理解し、入力言語情報や過去の文脈等を考慮してどのような応答をするかを決定し、応答結果を示す情報を出力する。例えば、TV番組の録画設定やTV画面の制御など機器の制御に関する応答を行うのであれば、対話管理部１０４の指示に従って制御信号生成処理（S３０９）が、機器の制御信号を生成することで、クライアント１２１の機器制御を行う。また、音声でユーザに応答するのであれば、対話管理部１０４の指示に従って音声合成部１０６が音声合成処理（S３１０）において合成音声を生成し、音声信号を出力する。 If it is determined that the instruction character string is detected in the determination of the instruction character string detection (S303) or that the program is not selected in the program selection determination (S306), the dialogue management unit 104 determines that the dialogue management process is performed. (S308) is performed. In the dialogue management processing according to the present embodiment, the dialogue management unit 104 understands the content of the character string recognized by voice recognition, and determines what kind of response should be made in consideration of the input language information, the past context, and the like. , Outputs information indicating the response result. For example, if a response related to device control such as TV program recording setting or TV screen control is performed, the control signal generation process (S309) generates a device control signal according to an instruction from the dialogue management unit 104. The device control of the client 121 is performed. If the user responds by voice, the voice synthesis unit 106 generates a synthesized voice in the voice synthesis process (S310) according to the instruction of the dialogue management unit 104, and outputs a voice signal.

信号送信処理（S３１１）では、通信回路１１３ａは、機器信号生成処理と音声合成処理で生成した機器の制御信号や音声の合成信号をクライアント１２１の通信回路１１３ｂに送信する。 In the signal transmission process (S311), the communication circuit 113a transmits the device control signal and the voice synthesis signal generated by the device signal generation process and the voice synthesis process to the communication circuit 113b of the client 121.

図５は、クライアント１２１が実行する処理のうち、選択状態の検出および出力に関する部分に関する処理フローを示す。 FIG. 5 shows a processing flow regarding a portion related to detection and output of a selected state among the processing executed by the client 121.

入力装置情報取得処理（S４００）は、入力装置１０８が情報を取得する処理である。ユーザが選択した番組の位置情報を入力装置１０８が取得する。選択状態検出処理（S４０１）では、入力装置１０８が番組を選択しているか否かを選択状態検出部１０９が検出する。入力装置１０８が番組を選択しているとは、例えば、入力装置１０８がリモコンであれば、ユーザが十字キーで番組を指定し、決定ボタンを押すことによって、番組が選択されている状態に遷移することをいう。決定ボタンを設けず、単に十字キーで番組を指定するだけで番組が選択されている状態に遷移するようにクライアント１２１を構成してもよい。入力装置１０８がタッチスクリーンまたはＰＣに接続されたディスプレイの場合、ユーザが特定の番組が表示されている箇所をタップまたはクリックすることによってその番組が選択されている状態に遷移するように構成してもよい。番組が選択されている状態で、ユーザが再度決定ボタンを押すなどして、非選択状態に変更することもできる。すなわち、入力装置情報取得処理で、どの位置を入力装置が指定しているかがわかり、選択状態検出処理で、どの位置のどの情報を選択しているかを知ることができる。 The input device information acquisition process (S400) is a process in which the input device 108 acquires information. The input device 108 acquires the position information of the program selected by the user. In the selection state detection process (S401), the selection state detection unit 109 detects whether the input device 108 is selecting a program. When the input device 108 is selecting a program, for example, when the input device 108 is a remote controller, the user designates the program with the cross key and presses the enter button to transition to the state where the program is selected. It means to do. The client 121 may be configured so as to transition to a state in which a program is selected simply by designating the program with the cross key without providing a decision button. When the input device 108 is a touch screen or a display connected to a PC, the user can tap or click on a place where a specific program is displayed so that the program is transited to the selected state. Good. The user can change to the non-selected state by pressing the enter button again while the program is selected. That is, it is possible to know which position is designated by the input device in the input device information acquisition process, and which information at which position is selected in the selection state detection process.

選択状態保存処理（S４０２）では、クライアント１２１は、入力装置情報取得処理で取得した位置情報と選択状態検出処理で得られる選択中であるか否かの情報の保存処理を行う。選択情報検出処理（S４０３）では、選択情報検出部１１１は、選択状態保存処理にて保存されている位置情報に対応する番組の情報または番組に関する情報を検出する。本明細書において「番組に関する情報」とは、例えばテレビ番組に関するメタデータまたはテレビ番組のコンテンツをいう。メタデータは、例えばテレビ番組名、放送日時、ジャンル、放送局、チャンネル名、テレビ番組の内容、テレビ番組の人気度、テレビ番組のおすすめ度、出演者、CM企業の少なくとも１つを含む。メタデータとして録画日時を含んでもよい。また、テレビ番組のコンテンツは、人物、動物、車、地図、文字、数字の少なくとも１つの情報を含む。ただしこれらは一例であり、これらに限られない。番組の情報の検出には、番組名に関する情報をシステム内外のEPGから検索する方法や、番組名などに基づいてウェブ検索を行い、関連情報を取得する方法などがある。 In the selection state saving process (S402), the client 121 performs a process of saving the position information acquired in the input device information acquisition process and the information obtained in the selection state detection process whether or not the selection is being performed. In the selection information detection process (S403), the selection information detection unit 111 detects information about the program or information about the program corresponding to the position information stored in the selection state storage process. In the present specification, “information about a program” refers to metadata about a television program or contents of a television program, for example. The metadata includes, for example, at least one of a TV program name, broadcast date/time, genre, broadcasting station, channel name, TV program content, TV program popularity, TV program recommendation level, performer, and CM company. The recording date and time may be included as metadata. In addition, the content of the television program includes at least one information of a person, an animal, a car, a map, characters, and numbers. However, these are examples and not limited to these. To detect program information, there are a method of searching for information related to a program name from EPGs inside and outside the system, and a method of performing a web search based on the program name and acquiring related information.

信号受信処理（S４０４）では、クライアント１２１の通信回路１１３ｂは、サーバの信号送信処理によって、サーバ１２０から送信される機器制御信号および合成された音声信号を受信する。 In the signal receiving process (S404), the communication circuit 113b of the client 121 receives the device control signal and the synthesized audio signal transmitted from the server 120 by the signal transmitting process of the server.

出力処理（S４０５）では、出力回路１１２は信号受信処理で受信した制御信号生成処理（S３０６）の結果と音声合成処理（S３０７）の結果に基づいて、ユーザに処理結果を出力する。 In the output processing (S405), the output circuit 112 outputs the processing result to the user based on the result of the control signal generation processing (S306) received in the signal reception processing and the result of the voice synthesis processing (S307).

なお、入力装置１０８で指定される対象は、番組などを表すアイコンやリストに限ったものではない。例えば、地図などの任意の位置をマウスで指定されるものであってもよい。地図上の指定には、画面上のｘ座標、ｙ座標を位置情報としてもよいし、地図特有の緯度経度情報に座標が表現されてもよい。緯度経度の値は住所に対応付けることが可能である。このため、緯度経度情報をキーボードから数値で入力して住所を指定するものであってもよい。あるいは、住所自体をキーボードで入力してもよい。住所は比較的長い文字列であるため、音声認識を失敗しやすいと考えられる。このような場合は、ユーザが入力しやすい方法で指し示す対象を指定すればよい。 It should be noted that the target specified by the input device 108 is not limited to the icons and lists representing programs and the like. For example, an arbitrary position such as a map may be designated with a mouse. For designation on the map, the x-coordinate and y-coordinate on the screen may be used as the position information, or the coordinates may be expressed in the latitude and longitude information specific to the map. The latitude and longitude values can be associated with an address. Therefore, the latitude and longitude information may be numerically input from the keyboard to specify the address. Alternatively, the address itself may be entered using the keyboard. Since the address is a relatively long character string, it is considered that voice recognition is likely to fail. In such a case, the target to be pointed may be designated by a method that is easy for the user to input.

位置指定の解除のためのボタン、アイコンを、位置指定した対象以外の位置に設けても良い。番組の選択の場合は、番組に関するアイコンの選択を繰返すことで、その番組の選択と選択解除を簡単に行える。しかし、地図上の特定の位置を指定する場合には、地図上の１点を選択することで、選択を解除することは難しい。そこで、図６に示すように、地図の画面上部に選択解除ボタンを設けてもよい。選択解除ボタンを押すことで選択解除を行うことができ、選択解除が容易になる。図６は、「○○スーパー」が指定されている例を示している。指定されている位置にカーソルを示す「矢印」が表示されている。図６では、選択解除は、地図右上にある選択解除枠を選択することで行われる。 A button or icon for canceling the position designation may be provided at a position other than the position designated object. In the case of selecting a program, the selection and deselection of the program can be easily performed by repeating the selection of the icon related to the program. However, when designating a specific position on the map, it is difficult to cancel the selection by selecting one point on the map. Therefore, as shown in FIG. 6, a selection cancel button may be provided at the top of the map screen. The selection can be canceled by pressing the selection cancel button, which facilitates the cancellation of the selection. FIG. 6 shows an example in which “XX super” is designated. An "arrow" indicating the cursor is displayed at the specified position. In FIG. 6, deselection is performed by selecting the deselection frame on the upper right of the map.

このような地図を表示させる表示機器は、テレビ、パーソナルコンピュータ、スマートフォン、タブレット端末などの情報機器の他、カーナビゲーションシステムに用いられてもよい。ユーザは任意の地点を指定（すなわち選択）した上で、その地点を示す指示語を含む音声指示により、所望の情報を得ることができる。例えば、「ここへの経路は？」、「ここから一番近いガソリンスタンドは？」といった音声指示に応答して要求された情報を提示するシステムを構築できる。 The display device for displaying such a map may be used for a car navigation system as well as an information device such as a television, a personal computer, a smartphone, and a tablet terminal. The user can obtain (desir) desired information by designating (that is, selecting) an arbitrary point and then giving a voice instruction including a directive indicating the point. For example, it is possible to construct a system that presents requested information in response to voice instructions such as "What is the route to here?" and "Which gas station is the closest to here?"

選択状態検出部１０９において、番組が選択されたことを示す情報に対して、その番組が選択された時間を付随させて記憶してもよい。これによって、番組が選択された時刻と現在の時刻との絶対差ｔが、所定の閾値よりも小さい場合と大きい場合とで、対応付けられる指示語を変えることもできる。例えば、絶対差ｔが所定の閾値よりも小さい場合には、「この」、「その」、「こっち」、「これ」、「それ」、「そっち」などのように、近称、中称と呼ばれる指示語で番組の指定を行い、前記絶対差ｔが所定の閾値より大きい場合は、「あっち」、「あれ」などのように遠称と呼ばれる指示語で番組の指定を行うようにしてもよい。このように、前記絶対差ｔの大きさに応じて指定する言葉を変えてもよい。 In the selection state detection unit 109, information indicating that a program has been selected may be stored together with the time when the program was selected. This makes it possible to change the associated instruction word depending on whether the absolute difference t between the time when the program is selected and the current time is smaller or larger than a predetermined threshold value. For example, when the absolute difference t is smaller than a predetermined threshold value, it is referred to as a near name or a middle name, such as "this", "that", "here", "this", "that", "that". If the absolute difference t is larger than a predetermined threshold value, the program is designated by a so-called directive word, and the program may be designated by a directive word called a far name such as "that" or "that". Good. In this way, the specified word may be changed according to the magnitude of the absolute difference t.

本実施の形態では、指示代名詞を用いて特定の番組を選択するため、２つ以上の番組が指定されている場合には、指示代名詞がどちらの番組を指し示しているか分からない場合がある。この場合には、最初に指定した番組を「この番組」「その番組」というように、近称、中称の指示語で選択し、後に指定した番組を「あの番組」というように、遠称の指示語で選択してもよい。これにより、指示代名詞の使い分けで、複数の候補から一つを選択することができる。 In the present embodiment, a specific program is selected using a demonstrative pronoun, so when two or more programs are designated, it may not be known which program the demonstrative pronoun points to. In this case, the first designated program is selected by the near and medium designations such as "this program" and "that program", and the later designated program is called "that program" by the far name. You may select with the directive. Thereby, it is possible to select one from a plurality of candidates by properly using the pronoun.

指示代名詞を用いて番組を指定する際、不図示の個人認識部を活用した個人識別情報を利用してもよい。例えば、入力装置１０８によって番組を選択したときに、誰が番組を選択したかを識別し、個人識別情報を選択状態検出部１０９に保存してもよい。（この情報を個人識別情報Aとする）。このとき、個人識別情報とその人がどの番組を選択したかという情報が対で記憶される。さらに、指示文字列検出部１０３によって指示文字列が検出されたとき、その指示文字列を発話した個人を識別してもよい（この識別情報を個人識別情報Bとする）。個人識別情報Bに合致する個人識別情報Ａを選択状態検出部１０９が保持する情報の中から検索すれば、番組を選択した個人と指示文字列を発話した個人とが合致するか否かを判定できる。両者が合致したとき、その個人識別情報Ａと対で記憶されている番組を指示代名詞で指定された番組とし、操作対象とする。 When designating a program using the demonstrative pronoun, personal identification information using a personal recognition unit (not shown) may be used. For example, when a program is selected by the input device 108, who selected the program may be identified and the personal identification information may be stored in the selection state detection unit 109. (This information is personal identification information A). At this time, the personal identification information and the information indicating which program the person has selected are stored as a pair. Further, when the instruction character string detection unit 103 detects the instruction character string, the individual who uttered the instruction character string may be identified (this identification information is referred to as individual identification information B). If the personal identification information A that matches the personal identification information B is searched from the information held by the selection state detection unit 109, it is determined whether the individual who selected the program matches the individual who uttered the instruction character string. it can. When both match, the program stored as a pair with the personal identification information A is set as the program designated by the designated pronoun and is set as the operation target.

なお、リモコン上に搭載されたタッチパッドやジョイスティックなどによって画面の任意の場所を指定できるようにすれば、画面上の好きな所を指定できるようになる。これにより、例えば、画面上の特定の人物をカーソルで指定し、「その人がでている番組」と指定すると、カーソルで指定された人が出演している番組の一覧を画面に表示することも可能となる。画面上に一人しか写っていなければ、音声のみで「この人」が誰を表すかを知ることができる。しかし、図７Ａように人物が二人以上写っている場合には、音声のみで人物を指定することが難しい。カーソルを使うことにより、図７Ｂに示すように、音声では指定することができないような、TVに出演する複数人物のうち一人を選択することができる。これにより、選択された人物に特定した情報検索が可能となる。カーソルが指す人物が誰であるかを認識するためには、既存の顔検出技術、顔認識技術を用いることができる。図７Ａの画面例６０１は、画面上の人物をカーソルで指定した例である。画面例６０１では、左側の人物上にカーソルが合わせられている。カーソルで人物を指定すると、その近辺の顔検出、顔認識処理が行われる。その後、図７Ｂに示すように、どこの、誰が認識されたかをディスプレイに表示する、あるいは音声でユーザに提示することで、ユーザは視覚的に誰が指定されたかを確認できる（画面例６０２）。 It should be noted that if a user can specify an arbitrary place on the screen by using a touch pad or a joystick mounted on the remote controller, it is possible to specify a desired place on the screen. With this, for example, if you specify a specific person on the screen with the cursor and specify "program that the person is playing", a list of programs in which the person specified by the cursor appears is displayed on the screen. Will also be possible. If only one person is shown on the screen, it is possible to know who this "person" represents by audio alone. However, when two or more persons are shown in FIG. 7A, it is difficult to specify the person only by voice. By using the cursor, as shown in FIG. 7B, it is possible to select one of a plurality of people who appear on TV, which cannot be designated by voice. As a result, it becomes possible to search for information that is specific to the selected person. In order to recognize who the person pointed by the cursor is, existing face detection technology and face recognition technology can be used. A screen example 601 of FIG. 7A is an example in which a person on the screen is designated by a cursor. In the screen example 601, the cursor is placed on the person on the left side. When a person is designated by the cursor, face detection and face recognition processing in the vicinity thereof are performed. Thereafter, as shown in FIG. 7B, the user can visually confirm who has been designated by displaying on the display where and who was recognized, or by presenting it to the user by voice (screen example 602).

この例では、人物の検出例について述べたが、一般物体認識技術を利用することで、上述したように、動物、車、文字、数字などを認識することも可能である。 In this example, the example of detecting a person has been described, but by using the general object recognition technique, it is possible to recognize an animal, a car, a character, a number, or the like as described above.

画面上に表示された地図を用いて場所を検索する場合には、表示画面に特定地域における地図を表示し、カーソルで指定された地図上の任意の座標、または地図上のオブジェクトを基準とした検索が可能となる。図８Ａおよび図８Ｂに示すように、例えば、「この場所より北にあるドラッグストア」というと、カーソルの場所より北のドラッグストアを表示することができる。図８Ａの表示例７０１では、ｘｘ公園を指定しており、図面上は矢印にて表示されている。音声による検索を行うことで、図８Ｂの表示例７０２のようにドラッグストアの場所が提示される。表示例７０２では、検索位置が点線丸で表示される。これにより、利用者は、詳細な住所を知ることが無くても、直感的に現在指している位置の情報と音声による地図検索が可能となる。同様に、「そこまでの行き方は？」という聞き方をすることで、現在位置からカーソルが指定している位置までの行き方を検索（路線検索やカーナビゲーション）できる。これにより、通常は、地図で位置を確認した後、その位置までの行き方検索をするために数ステップのボタン操作が必要になるが、音声の入力で素早く処理を完了させることができ、設定が簡便になる。 When searching for a place using the map displayed on the screen, the map in the specific area is displayed on the display screen, and the arbitrary coordinates on the map designated by the cursor or the object on the map is used as the reference. Search is possible. As shown in FIGS. 8A and 8B, for example, “drugstore north of this location” can display a drugstore north of the location of the cursor. In the display example 701 of FIG. 8A, xx park is designated, and is indicated by an arrow on the drawing. By performing a search by voice, the location of the drug store is presented as in the display example 702 of FIG. 8B. In the display example 702, the search position is displayed as a dotted circle. As a result, the user can intuitively perform a map search using information and a voice of the currently pointed position without knowing the detailed address. Similarly, by asking, "How do you get there?", you can search for directions (route search or car navigation) from the current position to the position specified by the cursor. As a result, normally, after confirming the position on the map, it is necessary to operate a few steps of buttons to search for the way to that position, but you can quickly complete the process by inputting voice and set the setting. It becomes easy.

なお、本実施の形態では、選択状態の送受信と選択情報の送受信の処理を分けて記載したが、選択状態の送信時に選択情報も送信するような形態であっても構わない。このような場合、サーバとクライアントのデータ送受信のシーケンスは、図９のようになる。システムの構成、サーバとクライアントの処理フローはそれぞれ図２、図４、図５の通りである。以下、重複する説明は省略することがある。 In the present embodiment, the transmission/reception processing in the selected state and the transmission/reception processing of the selection information are described separately, but the selection information may be transmitted at the time of transmission in the selected state. In such a case, the sequence of data transmission/reception between the server and the client is as shown in FIG. The system configuration and the processing flows of the server and the client are as shown in FIGS. 2, 4 and 5, respectively. In the following, redundant description may be omitted.

図９は、選択状態の送信時に選択情報も送信する場合におけるサーバ１２０とクライアント１２１との通信処理のシーケンスを示す。このシーケンスは、ユーザがリモコンなどの入力装置１０８によって表示画面上の一部を指定することによって開始される。 FIG. 9 shows a sequence of communication processing between the server 120 and the client 121 when the selection information is also transmitted when transmitting the selected state. This sequence is started by the user designating a part on the display screen with the input device 108 such as a remote controller.

ステップS８００は、入力装置情報取得処理である。選択状態検出部１０９は、入力装置１０８がクライアント１２１の画面上のどこを指しているかを検出する。 Step S800 is an input device information acquisition process. The selection state detection unit 109 detects where the input device 108 is pointing on the screen of the client 121.

ステップS８０１は、選択状態検出処理である。選択状態検出部１０９は、入力装置情報取得処理で指定された位置が、入力装置１０８によって指定されているか否かを取得する。 Step S801 is a selection state detection process. The selection state detection unit 109 acquires whether or not the position specified by the input device information acquisition process is specified by the input device 108.

ステップS８０２は、選択情報送信処理である。通信回路１１３ｂは、選択された項目に関する情報をサーバ１２０に送信する。 Step S802 is a selection information transmission process. The communication circuit 113b transmits information regarding the selected item to the server 120.

ステップS８０３は、選択情報受信処理である。サーバ１２０の通信回路１１３ａは、クライアント１２１からの選択情報を受信する。 Step S803 is a selection information reception process. The communication circuit 113a of the server 120 receives the selection information from the client 121.

ステップS８０４は、選択状態管理処理である。これは、選択状態管理部１１０が、選択状態受信処理で受信した入力装置１０８を介した選択状態をサーバ１２０側で管理するための処理である。選択状態管理処理では、選択状態管理部１１０は、入力装置１０８が特定の項目を選択しているという状態を１、選択していない状態を０として、０か１かの情報をサーバ１２０上の特定のメモリに保存する。この例では、選択情報も既に送信されているため、どのような情報が送信されているかもメモリ上に保存される。例えば、テレビ番組の一覧であれば、番組名、放送日、内容などが保存され、地図であれば、地名、緯度経度、選択された場所の住宅情報などが保存される。 Step S804 is a selection state management process. This is a process by which the selection state management unit 110 manages the selection state via the input device 108 received in the selection state reception process on the server 120 side. In the selection state management process, the selection state management unit 110 sets 0 as a state in which the input device 108 is selecting a specific item and 0 as a state in which it is not selected, and stores information of 0 or 1 on the server 120. Save to a specific memory. In this example, since the selection information has already been transmitted, what kind of information is transmitted is also stored in the memory. For example, in the case of a list of television programs, program names, broadcast dates, contents, etc. are stored, and in the case of maps, place names, latitude and longitude, housing information of selected locations, etc. are stored.

ステップS８０５は、音声要求送信処理である。サーバ１２０は、クライアント１２１に、音声信号を送付するよう要求する信号を送信する。 Step S805 is a voice request transmission process. The server 120 sends a signal requesting the client 121 to send an audio signal.

ステップS８０６は、音声要求受信処理である。クライアント１２１は、音声要求受信処理を受け付けると、クライアント１２１に関連付けられたマイクロフォン１０１からの音声の入力を許可する。 Step S806 is a voice request reception process. Upon accepting the voice request reception process, the client 121 permits input of voice from the microphone 101 associated with the client 121.

ステップS８０７において、クライアント１２１は、音声の入力を許可し、Ａ／Ｄ変換（アナログデジタル変換）を行う。これにより、アナログの音声がデジタルの音声信号に変換される。ステップS８０８の音声信号送信処理では、クライアント１２１の通信回路１１３ｂは、デジタル音声信号をサーバ１２０に送信する。 In step S807, the client 121 permits voice input and performs A/D conversion (analog-digital conversion). As a result, analog voice is converted into a digital voice signal. In the audio signal transmission process of step S808, the communication circuit 113b of the client 121 transmits the digital audio signal to the server 120.

ステップS８０９において、サーバ１２０の通信回路１１３ａは、クライアント１２１から送信された音声信号を受信する。 In step S809, the communication circuit 113a of the server 120 receives the audio signal transmitted from the client 121.

ステップS８１０において、音声認識部１０２は音声認識処理を行う。さらに、ステップS８１１では、指示文字列検出部１０３は指示文字列を検出する。 In step S810, the voice recognition unit 102 performs voice recognition processing. Further, in step S811, the instruction character string detection unit 103 detects the instruction character string.

ステップS８１２は、対話管理処理である。対話管理部１０４は、受信した選択情報と指示文字列検出処理の結果から機器の制御や音声での応答方法などを出力する。対話管理処理の方法は、前述の方法と同じである。 Step S812 is a dialogue management process. The dialogue management unit 104 outputs device control, a voice response method, and the like from the received selection information and the result of the instruction character string detection processing. The method of interaction management processing is the same as the method described above.

ステップS８１３は、応答結果送信処理である。応答結果送信処理は対話管理処理によって出力される制御信号、制御信号に対応するID、合成音声、音声を合成するためのテキストをクライアント１２１に送信する処理である。 Step S813 is a response result transmission process. The response result transmission process is a process of transmitting to the client 121 the control signal output by the dialogue management process, the ID corresponding to the control signal, the synthesized voice, and the text for synthesizing the voice.

ステップS８１４は、応答結果受信処理である。これにより、サーバ１２０からの応答結果をクライアント１２１の通信回路１１３ｂが受信する。 Step S814 is a response result reception process. As a result, the communication circuit 113b of the client 121 receives the response result from the server 120.

ステップS８１５は、応答結果出力処理である。応答結果出力処理として、出力回路１１２は、応答結果受信処理で受信した機器の制御信号や合成音声、テキストなどを機器の出力手段を通じてユーザ端末または制御対象の機器に出力する。 Step S815 is a response result output process. As the response result output process, the output circuit 112 outputs the control signal of the device, the synthesized voice, the text, etc. received in the response result receiving process to the user terminal or the device to be controlled through the output means of the device.

以上の構成、処理によって、音声認識処理をサーバ上で実施する場合でも、処理遅延を削減することが可能になる。 With the above configuration and processing, it is possible to reduce the processing delay even when the voice recognition processing is performed on the server.

（実施の形態２）
図１０は、本実施の形態における情報提供システムが表示機器に対して実行する制御方法の概要を示すシーケンス図である。本実施の形態における情報提供システムは、表示機器も音声認識機能を備えている点で実施の形態１とは異なる。以下、実施の形態１と異なる点を中心に説明し、重複する事項については説明を省略することがある。 (Embodiment 2)
FIG. 10 is a sequence diagram showing an outline of a control method executed by the information providing system according to the present embodiment for a display device. The information providing system in the present embodiment differs from that in the first embodiment in that the display device also has a voice recognition function. Hereinafter, the description will be focused on the points that are different from the first embodiment, and the description of overlapping items may be omitted.

本実施形態における表示機器の制御方法は、表示機器のコンピュータに、図１０に示す処理を実行させる。この制御方法は、まず、選択可能な複数の項目を含む表示画面を、表示機器に搭載または接続されたディスプレイに表示させる（ステップＳ９００）。次に、ディスプレイの前記表示画面において、前記複数の項目の中の一の項目が選択されたことを検知させる（ステップＳ９０１）。ステップＳ９００およびステップＳ９０１は、項目の選択が変更される度に繰り返し実行される。 The display device control method according to the present embodiment causes a computer of the display device to execute the process illustrated in FIG. 10. In this control method, first, a display screen including a plurality of selectable items is displayed on a display mounted or connected to a display device (step S900). Next, it is detected that one of the plurality of items has been selected on the display screen of the display (step S901). Steps S900 and S901 are repeatedly executed each time the item selection is changed.

表示機器が音声指示を受け付けると、表示機器は、一の項目が選択されているか否かを判定する（ステップＳ９０２）。項目が選択されていない場合、表示機器は受け付けた音声情報を情報提供システムにおける他のコンピュータ（以下、「サーバ」と称する。）に送信する。項目が選択されている場合、表示機器は、音声指示が実行可能であるか否かを判定する（ステップＳ９０３）。音声指示が実行可能である場合、指示内容を実行する（ステップＳ９０４）。音声指示が実行可能でない場合、表示機器は、音声情報をサーバに送信する。サーバは、表示機器が実行できない音声指示を認識し、実行する（ステップＳ９１１およびＳ９１２）。 When the display device receives the voice instruction, the display device determines whether or not one item is selected (step S902). When no item is selected, the display device transmits the received audio information to another computer (hereinafter, referred to as “server”) in the information providing system. When the item is selected, the display device determines whether or not the voice instruction can be executed (step S903). If the voice instruction can be executed, the instruction content is executed (step S904). If the voice instruction is not executable, the display device sends voice information to the server. The server recognizes and executes the voice instruction that the display device cannot execute (steps S911 and S912).

ここで実行可能な音声指示とは、表示機器に予めプログラムされた機能の範囲内で処理できる音声指示を意味する。例えば、表示機器が、特定の指示語と特定の指示内容との結合からなる音声指示は正しく認識できるが、そうでない音声指示（例えばウェブ検索の指示等）は認識できない場合、前者は実行可能であるが、後者は実行可能でない。そのような場合、後者の音声指示は、サーバが代わりに実行し、応答結果を表示機器に返す。 Here, the executable voice instruction means a voice instruction that can be processed within a range of functions preprogrammed in the display device. For example, if the display device can correctly recognize a voice instruction consisting of a combination of a specific instruction word and specific instruction content, but cannot recognize a voice instruction other than that (for example, a web search instruction), the former can be executed. Yes, but the latter is not feasible. In such a case, the latter voice instruction is executed by the server instead, and the response result is returned to the display device.

このように、本実施の形態の制御方法は、一の項目が選択されたことが検知されているときに、音声入力装置から指示内容を表す第１音声情報を含む音声指示が受信された場合、表示機器のコンピュータに、第１音声情報から指示内容を認識させて指示内容を実行させる。一方、一の項目が選択されたことが検知されていないとき、または、指示内容が実行できないと判断されたときは、音声指示をサーバへ送信させる。これにより、必要な場合のみ表示機器とサーバとのアクセスが発生するため、処理の遅延を軽減することができる。 As described above, the control method according to the present embodiment is performed when the voice instruction including the first voice information indicating the instruction content is received from the voice input device when the selection of one item is detected. , Causing the computer of the display device to recognize the instruction content from the first voice information and execute the instruction content. On the other hand, when it is not detected that one item is selected, or when it is determined that the instruction content cannot be executed, the voice instruction is transmitted to the server. As a result, the display device and the server are accessed only when necessary, so that the processing delay can be reduced.

図１１は、指示語と指示内容との結合からなる音声指示を認識可能な表示機器の制御方法の一例を示すシーケンス図である。この制御方法では、図１０におけるステップＳ９０３の代わりにステップＳ９０５−Ｓ９０７が実行される。この点を除き、図１０の方法と同じである。ステップＳ９０５では、表示機器は、音声指示が指示内容を示す第１音声情報を含むか否かを判定する。判定結果がＮｏの場合、表示機器は音声情報をサーバに送信する。判定結果がＹｅｓの場合、表示機器は指示内容を認識する（ステップＳ９０６）。続くステップＳ９０７では、音声指示が指示語を示す第２音声情報を含むか否かを判定する。判定結果がＮｏの場合、表示機器は音声情報をサーバに送信する。判定結果がＹｅｓの場合、表示機器が指示内容を実行する（ステップＳ９０４）。 FIG. 11 is a sequence diagram showing an example of a control method of a display device capable of recognizing a voice instruction including a combination of an instruction word and instruction content. In this control method, steps S905 to S907 are executed instead of step S903 in FIG. Except for this point, the method is the same as that of FIG. In step S905, the display device determines whether or not the voice instruction includes the first voice information indicating the instruction content. If the determination result is No, the display device transmits voice information to the server. If the determination result is Yes, the display device recognizes the instruction content (step S906). In a succeeding step S907, it is determined whether or not the voice instruction includes the second voice information indicating the instruction word. If the determination result is No, the display device transmits voice information to the server. If the determination result is Yes, the display device executes the instruction content (step S904).

このように、図１１に示す制御方法は、一の項目が選択されたことが検知され、第１音声情報から前記指示内容が認識され、かつ、音声指示に前記第２音声情報が含まれていると判断された場合、表示機器のコンピュータに指示内容を実行させる。一方、一の項目が選択されたことが検知されなかった場合、第１音声情報から指示内容が認識されなかった場合、または音声指示に第２音声情報が含まれていると判断されなかった場合、音声指示をサーバへ送信させる。これにより、必要な場合のみ表示機器とサーバとのアクセスが発生するため、処理の遅延を軽減することができる。 As described above, in the control method shown in FIG. 11, it is detected that one item is selected, the instruction content is recognized from the first voice information, and the voice instruction includes the second voice information. If it is determined that the instruction is given, the computer of the display device executes the instruction content. On the other hand, when it is not detected that one item is selected, when the instruction content is not recognized from the first voice information, or when it is not determined that the voice instruction includes the second voice information. , Send voice instructions to the server. As a result, the display device and the server are accessed only when necessary, so that the processing delay can be reduced.

図１２は、本実施の形態における番組情報提示方法を採用するシステムの構成を示す。この番組情報提示方法は、ユーザの音声を認識する音声認識機能を利用して番組の情報をユーザに提示する。本システムは、クライアント１２１とサーバ１２０とを含む。クライアント１２１は、前述の表示機器、または表示機器に接続される他の機器に対応する。クライアント１２１は、例えばテレビ、レコーダー、スマートフォン、タブレット端末などの機器であり得る。図１２の例では、クライアント１２１は、音声入力装置であるマイクロフォン１０１と、入力装置１０８と、出力回路１１２と、通信回路１１３ｂと、これらを制御する制御回路１１４ｄとを備える。本実施形態における制御回路１１４ｄは、選択状態検出部１０９および選択情報検出部１１１に加えて、音声認識部１０２ｂと、指示文字列検出部１０３と、命令文字列検出部１１５とを有する点で図２に示す制御回路１１４ｂとは異なっている。 FIG. 12 shows the configuration of a system that employs the program information presentation method according to the present embodiment. This program information presentation method presents the user with program information by utilizing a voice recognition function that recognizes the user's voice. The system includes a client 121 and a server 120. The client 121 corresponds to the above-described display device or another device connected to the display device. The client 121 may be a device such as a television, a recorder, a smartphone, or a tablet terminal. In the example of FIG. 12, the client 121 includes the microphone 101 that is a voice input device, an input device 108, an output circuit 112, a communication circuit 113b, and a control circuit 114d that controls these. The control circuit 114 d in the present embodiment is different from the selection state detection unit 109 and the selection information detection unit 111 in that it has a voice recognition unit 102 b, an instruction character string detection unit 103, and a command character string detection unit 115. 2 is different from the control circuit 114b shown in FIG.

サーバ１２０は、クライアント１２１と通信する通信回路１１３ａと、制御回路１１４ｃとを備える。制御回路１１４ｃは、音声認識部１０２、対話管理部１０４、応答文生成部１０５、音声合成部１０６、および制御信号生成部１０７の５つの機能部を有する。 The server 120 includes a communication circuit 113a that communicates with the client 121, and a control circuit 114c. The control circuit 114c has five functional units: a voice recognition unit 102, a dialogue management unit 104, a response sentence generation unit 105, a voice synthesis unit 106, and a control signal generation unit 107.

本実施の形態では、音声入力装置であるマイクロフォン１０１がユーザの音声信号をセンシングし、センシングした音声信号を音声認識部１０２ｂが文字列に変換する。指示文字列検出部１０３は変換された文字列に指示代名詞が含まれるかを判定する。命令文字列検出部１１５は、変換された文字列に機器を制御するなどの命令文字列が含まれるかを検出する。入力装置１０８は、ディスプレイに複数の番組情報が表示されている場合に、ユーザが一つの番組を選択することを可能にする。 In the present embodiment, the microphone 101, which is a voice input device, senses the user's voice signal, and the voice recognition unit 102b converts the sensed voice signal into a character string. The designated character string detection unit 103 determines whether or not the converted character string includes a designated pronoun. The command character string detection unit 115 detects whether the converted character string includes a command character string for controlling a device or the like. The input device 108 enables the user to select one program when a plurality of program information is displayed on the display.

入力装置１０８により番組が選択されると、その選択された画面上の位置の情報がシステムに入力される。選択状態検出部１０９は、入力装置１０８によって番組が選択されているか否かの判定を行う。選択情報検出部１１１は、入力装置１０８によって選択された番組の位置情報、および、選択された番組に関する情報などを検出する。出力回路１１２は、応答文生成部１０５、音声合成部１０６、制御信号生成部１０７の出力結果を受けて、ディスプレイへの応答文の表示、スピーカーへの合成音声の再生、生成された制御信号による機器の制御、およびディスプレイへの制御結果の表示などの出力処理を行う。 When a program is selected by the input device 108, information on the selected position on the screen is input to the system. The selection state detection unit 109 determines whether a program is selected by the input device 108. The selection information detection unit 111 detects the position information of the program selected by the input device 108, the information regarding the selected program, and the like. The output circuit 112 receives the output results of the response sentence generation unit 105, the voice synthesis unit 106, and the control signal generation unit 107, displays the response sentence on the display, reproduces the synthesized voice on the speaker, and generates the control signal. Performs output processing such as device control and display of control results on the display.

通信回路１１３ａおよび１１３ｂは、サーバ１２０とクライアント１２１の間の通信を行うための通信モジュールを備える。通信モジュールは、前述のように、例えばＷｉ−Ｆｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの既存の通信方式を利用して通信を行う。そのような機能を有する限り、通信モジュールの種類は問わない。音声合成部１０６で合成された音声信号、および、機器を制御する制御信号は出力回路１１２に送信される。出力回路１１２は、音声信号、機器を制御するための信号、制御結果を示す情報を出力する。 The communication circuits 113a and 113b include a communication module for performing communication between the server 120 and the client 121. As described above, the communication module performs communication using an existing communication method such as Wi-Fi (registered trademark) or Bluetooth (registered trademark). The type of communication module is not limited as long as it has such a function. The voice signal synthesized by the voice synthesis unit 106 and the control signal for controlling the device are transmitted to the output circuit 112. The output circuit 112 outputs an audio signal, a signal for controlling the device, and information indicating the control result.

音声認識部１０２ａはサーバ１２０上にて音声認識を行う。対話管理部１０４は、ユーザと機器との対話型の処理を行った履歴やどのような対話処理を行うかという応答戦略などを管理する。応答文生成部１０５は、入力された文字列に応じてユーザに応答する文字列を生成する。音声合成部１０６は、応答文生成部で生成した文字列を音声に変換する。制御信号生成部１０７は、対話内容に応じた機器制御コマンドを生成する。 The voice recognition unit 102a performs voice recognition on the server 120. The dialogue management unit 104 manages a history of interactive processing between the user and the device, a response strategy indicating what kind of interactive processing is to be performed, and the like. The response sentence generation unit 105 generates a character string that responds to the user according to the input character string. The voice synthesis unit 106 converts the character string generated by the response sentence generation unit into voice. The control signal generation unit 107 generates a device control command according to the content of the dialogue.

上述したサーバ１２０における制御回路１１４ｃおよびクライアント１２１における制御回路１１４ｄの各構成要素は、サーバ１２０のコンピュータ（例えばＣＰＵ）が、コンピュータプログラムを実行することによって実現されてもよいし、それぞれが別個独立の回路等として設けられてもよい。 Each component of the control circuit 114c in the server 120 and the control circuit 114d in the client 121 described above may be realized by a computer (for example, a CPU) of the server 120 executing a computer program, or each component is independent. It may be provided as a circuit or the like.

例えば、後述の図１３に示されるサーバ１２０の各処理は、コンピュータプログラムを実行したサーバ１２０のコンピュータが行う、サーバ１２０の制御方法として実現され得る。同様に、例えば図１３に示されるクライアント１２１の各処理は、コンピュータプログラムを実行したクライアント１２１のコンピュータが行う、クライアント１２１の制御方法として実現され得る。 For example, each process of the server 120 shown in FIG. 13 described later can be realized as a control method of the server 120 performed by a computer of the server 120 that executes a computer program. Similarly, for example, each process of the client 121 illustrated in FIG. 13 can be realized as a control method of the client 121 performed by a computer of the client 121 that executes a computer program.

本実施の形態では、音声認識処理を、クライアント１２１とサーバ１２０の両方が行う点が従来技術および実施の形態１と異なる。音声認識を行った後に処理を実行する対話管理部１０４、応答文生成部１０５、あるいは、処理結果を生成する音声合成部１０６、制御信号生成部１０７は、それぞれサーバ１２０ではなくクライアント１２１が備えていても良い。 The present embodiment differs from the related art and the first embodiment in that both the client 121 and the server 120 perform the voice recognition process. The dialogue management unit 104, the response sentence generation unit 105 that performs processing after performing voice recognition, or the speech synthesis unit 106 and the control signal generation unit 107 that generate processing results are provided in the client 121 instead of the server 120. May be.

図１３は、サーバ１２０とクライアント１２１との通信処理のシーケンスを示す。このシーケンスは、ユーザがリモコンなどの入力装置１０８によって表示画面上の一部を指定することによって開始される。 FIG. 13 shows a sequence of communication processing between the server 120 and the client 121. This sequence is started by the user designating a part on the display screen with the input device 108 such as a remote controller.

ステップS５００は、入力装置情報取得処理である。選択状態検出部１０９は、入力装置１０８によって指定された表示画面上の位置を示す情報を取得する。 Step S500 is an input device information acquisition process. The selection state detection unit 109 acquires information indicating the position on the display screen designated by the input device 108.

ステップS５０１は、選択状態検出処理である。選択状態検出部１０９は、一の番組が選択されたか否かを検出する。この検出は、入力装置情報取得処理で取得された位置情報に基づいて、入力装置１０８によって指定された位置が番組を示す項目の位置に該当するか否かを判定することによって行われる。 Step S501 is a selection state detection process. The selection state detection unit 109 detects whether or not one program has been selected. This detection is performed by determining whether or not the position designated by the input device 108 corresponds to the position of the item indicating the program, based on the position information acquired by the input device information acquisition process.

ステップS５０２において、クライアント１２１は、音声を入力し、Ａ／Ｄ変換（アナログデジタル変換）を行う。これにより、アナログの音声がデジタルの音声信号に変換される。 In step S502, the client 121 inputs a voice and performs A/D conversion (analog-digital conversion). As a result, analog voice is converted into a digital voice signal.

ステップS５０３は、クライアント１２１が入力音声を認識する音声認識処理である。 Step S503 is a voice recognition process in which the client 121 recognizes the input voice.

ステップS５０４では、指示文字列検出部１０３は指示文字列検出を行う。指示文字列検出処理では、音声認識処理を行ったテキストを解析することで、指示文字列の検出が行われる。 In step S504, the instruction character string detection unit 103 detects the instruction character string. In the instruction character string detection process, the instruction character string is detected by analyzing the text subjected to the voice recognition process.

ステップS５０５では、命令文字列検出部１１５は命令文字列検出を行う。命令文字列検出処理とは、音声認識処理を行ったテキストを解析することで、命令文字列の検出を行う処理である。 In step S505, the command character string detection unit 115 detects a command character string. The command character string detection process is a process of detecting the command character string by analyzing the text subjected to the voice recognition process.

ステップS５０６では、選択情報検出部１１１は選択情報検出処理を行う。入力装置１０８は情報取得処理で取得した位置の情報を検出する。 In step S506, the selection information detection unit 111 performs selection information detection processing. The input device 108 detects the position information acquired in the information acquisition process.

ステップS５０７は、音声信号送信処理である。クライアント１２１の通信回路１１３ｂがサーバ１２０に音声信号を送信する。 Step S507 is a voice signal transmission process. The communication circuit 113b of the client 121 sends an audio signal to the server 120.

ステップS５０８は、音声信号受信処理である。サーバ１２０の通信回路１１３ａが音声信号を受信する。 Step S508 is a voice signal reception process. The communication circuit 113a of the server 120 receives the audio signal.

ステップS５０９は、音声入力処理である。通信回路１１３ａが受信した音声信号をサーバ１２０内部に入力する。 Step S509 is a voice input process. The voice signal received by the communication circuit 113a is input into the server 120.

ステップS５１０は、サーバ側の音声認識処理である。音声認識部１０２ａは、サーバ１２０上で音声認識処理を行う。 Step S510 is a voice recognition process on the server side. The voice recognition unit 102a performs voice recognition processing on the server 120.

ステップS５１１は、対話管理処理である。対話管理部１０４は受信した選択情報と指示文字列検出処理の結果に基づいて、機器の制御方法および音声での応答方法を決定し、クライアントに返信するための情報を出力する。対話管理処理の方法は、実施の形態１で説明したとおりである。 Step S511 is a dialogue management process. The dialogue management unit 104 determines the device control method and the voice response method based on the received selection information and the result of the instruction character string detection processing, and outputs information to be returned to the client. The method of the dialogue management process is as described in the first embodiment.

ステップS５１２は、応答結果送信処理である。応答結果送信処理は対話管理処理によって出力される制御信号、制御信号に対応するID、合成音声、音声を合成するためのテキストをクライアント１２１に送信する。 Step S512 is a response result transmission process. The response result transmission process transmits to the client 121 the control signal output by the dialogue management process, the ID corresponding to the control signal, the synthesized voice, and the text for synthesizing the voice.

ステップS５１３は、応答結果受信処理である。これにより、サーバ１２０からの応答結果をクライアント１２１の通信回路１１３ｂが受信する。 Step S513 is a response result reception process. As a result, the communication circuit 113b of the client 121 receives the response result from the server 120.

ステップS５１４は、応答結果出力処理である。応答結果出力処理として、出力回路１１２は、応答結果受信処理で受信した機器の制御信号や合成音声、テキストなどを機器の出力手段を通じてユーザ端末または制御対象の機器に出力する。 Step S514 is a response result output process. As the response result output process, the output circuit 112 outputs the control signal of the device, the synthesized voice, the text, etc. received in the response result receiving process to the user terminal or the device to be controlled through the output means of the device.

以下、サーバ１２０での処理とクライアント１２１での処理にわけて、音声認識機能を用いた番組情報提示方法についてより詳細に説明する。 Hereinafter, the program information presentation method using the voice recognition function will be described in more detail by dividing the processing in the server 120 and the processing in the client 121.

図１４は、図１に示す構成のうち、サーバ１２０に関する処理フローを示す。 FIG. 14 shows a processing flow regarding the server 120 in the configuration shown in FIG.

まず、音声入力処理（S６００）では、マイクロフォン１０１から音声信号が入力される。本実施の形態では、マイクロフォンはクライアント１２１に備えられているものとする。クライアント１２１上でA/D変換された音声信号がサーバ１２０側に転送される。 First, in the voice input process (S600), a voice signal is input from the microphone 101. In the present embodiment, it is assumed that the microphone is included in the client 121. The audio signal A/D converted on the client 121 is transferred to the server 120 side.

サーバ側音声認識処理（S６０１）では、音声認識部１０２ａは入力された音声の認識処理を行う。音声認識処理では、入力された音声信号を文字列データに変換する。サーバ１２０上で音声認識を行うことで、大規模のデータ群から構築した音響モデル、言語モデルを利用できる。また、サーバ１２０の計算能力はクライアント１２１に比べると高い。大規模データから統計的学習手法によって学習した音響モデル、言語モデルを利用できるため、多様な言葉の認識率が高いというメリットがある。また、スマートフォンやＦＴＴＨなどの普及により、端末が常時ネットワークに接続された環境が整ってきているため、サーバ１２０上で音声認識を行う方法は実用的である。 In the server side voice recognition process (S601), the voice recognition unit 102a performs a recognition process of the input voice. In the voice recognition processing, the input voice signal is converted into character string data. By performing voice recognition on the server 120, an acoustic model and a language model constructed from a large-scale data group can be used. The computing power of the server 120 is higher than that of the client 121. Since acoustic models and language models learned from statistical data by large-scale data can be used, there is an advantage that the recognition rate of various words is high. Also, with the spread of smartphones and FTTH, the environment in which terminals are always connected to the network has been prepared, so the method of performing voice recognition on the server 120 is practical.

対話管理処理（S６０２）では、対話管理部１０４は、音声認識された文字列の内容を理解し、入力言語情報や過去の文脈等を考慮してどのような応答をするか出力する。対話管理処理の出力結果に応じて、制御信号であるか否かの判定処理（S６０３）が行われる。例えば、TV番組の録画設定やTV画面の制御など機器の制御に関する応答を行うのであれば、制御信号生成処理（S６０４）において制御信号生成部１０７が、機器の制御信号を生成する。制御信号送信処理（S６０５）では、サーバ１２０の通信回路１１３ａが、制御信号生成処理で生成された制御信号をクライアント１２１に送信する。これにより、クライアント１２１側で機器制御が行われる。 In the dialogue management process (S602), the dialogue management unit 104 understands the content of the voice-recognized character string, and outputs what response is to be made in consideration of the input language information, the past context, and the like. Depending on the output result of the dialogue management process, a determination process (S603) as to whether or not it is a control signal is performed. For example, if a response relating to device control such as TV program recording setting or TV screen control is performed, the control signal generation unit 107 generates a device control signal in the control signal generation process (S604). In the control signal transmission process (S605), the communication circuit 113a of the server 120 transmits the control signal generated by the control signal generation process to the client 121. As a result, device control is performed on the client 121 side.

ステップＳ６０３でＮｏと判定された場合、またはステップＳ６０５が終了した場合、音声でユーザに応答するか否かが判断される（Ｓ６０６）。音声でユーザに応答する場合、応答文生成処理（S６０７）で応答文が生成される。続いて出力が音声かテキストかが判定される（Ｓ６０８）。出力が音声の場合、音声合成処理（S６０９）において音声合成部１０６が合成音声を生成し、音声信号を出力する。音声送信処理（S６０８）では、サーバ１２０の通信回路１１３ａがテキストから合成音声に変換されたデータをクライアント１２１に送信する。 When it is determined No in step S603 or when step S605 ends, it is determined whether or not to respond to the user by voice (S606). When responding to the user by voice, a response sentence is generated in the response sentence generation process (S607). Subsequently, it is determined whether the output is voice or text (S608). When the output is a voice, the voice synthesizing unit 106 generates a synthetic voice in the voice synthesizing process (S609) and outputs a voice signal. In the voice transmission process (S608), the communication circuit 113a of the server 120 transmits the data converted from the text to the synthetic voice to the client 121.

出力がテキストである場合、応答文送信処理（S６０７）が行われる。応答分生成部１０５が応答文生成処理によってテキストを生成し、生成されたテキストである応答文がサーバ１２０からクライアント１２１に送信される。 If the output is text, response sentence transmission processing (S607) is performed. The response generation unit 105 generates text by the response sentence generation processing, and the generated response sentence is transmitted from the server 120 to the client 121.

図１５は、クライアント１２１が実行する処理のうち、選択状態の検出および出力に関する部分に関する処理フローを示す。 FIG. 15 shows a processing flow regarding a portion related to detection and output of the selected state among the processing executed by the client 121.

入力装置情報取得処理（S７００）は、入力装置１０８が情報を取得する処理である。ユーザが選択した番組の位置情報を入力装置１０８が取得する。選択状態検出処理（S７０１）では、入力装置１０８がTV画面上の番組を選択しているか否かを選択状態検出部１０９が検出する。入力装置１０８が番組を選択しているとは、例えば、入力装置１０８がリモコンであれば、ユーザが十字キーで番組を指定し、決定ボタンを押すことによって、番組が選択されている状態に遷移することをいう。番組が選択されている状態で、ユーザが再度決定ボタンを押して、非選択状態に変更することもできる。すなわち、入力装置情報取得処理で、どの位置を入力装置が指定しているかがわかり、選択状態検出処理で、どの位置のどの情報を選択しているか否かを知ることができる。 The input device information acquisition process (S700) is a process in which the input device 108 acquires information. The input device 108 acquires the position information of the program selected by the user. In the selection state detection process (S701), the selection state detection unit 109 detects whether the input device 108 is selecting a program on the TV screen. When the input device 108 is selecting a program, for example, when the input device 108 is a remote controller, the user designates the program with the cross key and presses the enter button to transition to the state where the program is selected. It means to do. When the program is selected, the user can press the enter button again to change to the non-selected state. That is, it is possible to know which position is designated by the input device in the input device information acquisition process, and to know which position and which information is selected in the selection state detection process.

音声入力処理（S７０２）では、通信回路１１３ａはクライアント１２１から送信された音声を受信する。音声認識処理（S７０３）では、音声認識部１０２は、入力された音声の認識を行う。クライアント１２１での音声認識は、サーバ型音声認識に比べると、登録可能な言葉に限界がある。限られた計算量、メモリで誤認識を減らすためには、必要最低限の言葉を辞書に登録することが望ましい。辞書は、音声認識部１０２として機能する回路内のメモリ（図示せず）に格納されていてもよいし、クライアント１２１に設けられた記憶装置（図示せず）に格納されていてもよい。 In the voice input process (S702), the communication circuit 113a receives the voice transmitted from the client 121. In the voice recognition process (S703), the voice recognition unit 102 recognizes the input voice. The voice recognition in the client 121 has a limit in the words that can be registered, as compared with the server-type voice recognition. In order to reduce erroneous recognition with a limited amount of calculation and memory, it is desirable to register the minimum necessary words in the dictionary. The dictionary may be stored in a memory (not shown) in the circuit that functions as the voice recognition unit 102, or may be stored in a storage device (not shown) provided in the client 121.

必要最低限な言葉とは例えば、リモコンのボタンと対応づく言葉の集合であり、「電源ON」、「電源OFF」、「ボリュームアップ」、「ボリュームダウン」などである。さらに、本実施の形態では、後述する指示文字列検出処理と命令文字列検出処理を行うため、それらの検出に利用される語彙が予め辞書に登録される。例えば、指示文字列を認識するためには、「これ」、「それ」、「あれ」、「この」、「その」、「あの」、「これの」、「それの」、「あれの」などといった指示語、指示詞を登録しておく。また、「内容を表示」、「検索」などの命令語彙を登録する。これにより、音声認識部１０２は、「その番組の内容を表示」などの言葉を認識することができる。その結果、その後の処理によって、指示文字列と命令文字列を検出できる。 The minimum necessary words are, for example, a set of words corresponding to the buttons of the remote controller, and are “power ON”, “power OFF”, “volume up”, “volume down”, and the like. Further, in the present embodiment, since the instruction character string detection process and the command character string detection process described later are performed, the vocabulary used for the detection is registered in the dictionary in advance. For example, in order to recognize the instruction character string, "this", "that", "that", "this", "that", "that", "this", "that", "that" Register the vocabulary and the verb such as. In addition, command vocabularies such as "display contents" and "search" are registered. As a result, the voice recognition unit 102 can recognize words such as “display the contents of the program”. As a result, the instruction character string and the instruction character string can be detected by the subsequent processing.

指示文字列検出処理（S７０４）では、指示文字列検出部１０３は音声認識によって得られる文字列から指示文字列の検出を行う。指示文字列とは、前述した指示語、指示詞のことである。ここでは、指示文字列の検出は次のようにして行う。まず、指示文字列検出部１０３は、入力された文字列を形態素解析によって単語、品詞単位に分割する。形態素とは文章の要素のうち意味を持つ最小の単位である。形態素解析によって、文章を単語や品詞など複数の形態素に分割できる。あらかじめ指示文字列をリストとして用意しておき、そのリストに含まれる語と分割した形態素とが一致すれば、文章中の指示文字列が検出できたものとする。 In the instruction character string detection process (S704), the instruction character string detection unit 103 detects the instruction character string from the character string obtained by voice recognition. The instruction character string is the above-described instruction word or verb. Here, the detection of the instruction character string is performed as follows. First, the designated character string detection unit 103 divides the input character string into words and parts of speech by morphological analysis. A morpheme is the smallest unit of meaning in a sentence element. Morphological analysis can divide a sentence into multiple morphemes such as words and parts of speech. It is assumed that the designated character string in the sentence can be detected if the designated character string is prepared in advance as a list and the words included in the list match the divided morphemes.

命令文字列検出処理（S７０５）では、命令文字列検出部１１５は音声認識結果から命令文字列を検出する。命令文字列検出部１１５は指示文字列検出処理と同様に形態素解析を行い、文章を分割する。分割した文章と事前に登録した単語リストとを比較することで命令文字列を検出する。ここで、単語リストに登録する命令文字列は、例えば、「内容、表示」、「検索」、「録画」などの操作コマンドに相当する言葉である。 In the command character string detection process (S705), the command character string detection unit 115 detects the command character string from the voice recognition result. The command character string detection unit 115 performs morphological analysis as in the instruction character string detection process, and divides the sentence. The command character string is detected by comparing the divided sentences with the word list registered in advance. Here, the command character string registered in the word list is a word corresponding to an operation command such as “contents, display”, “search”, and “record”.

次に、選択状態検出部１０９は、選択状態検出処理を行って得られる情報を用いて、画面上の領域が選択されているか否かの判定を行う（S７０６）。選択状態検出部１０９は、例えば、TV画面上の番組が選択されている場合に、番組選択状態であるフラグを出力する。その場合、番組が選択されている場合は１を返し、番組が選択されていない場合は１以外を出力する。この値を利用することで、番組の選択状態を知り、状態を判定できる。次に、指示文字列検出部１０３および命令文字列検出部１１５はそれぞれ、指示文字列を検出したか否かの判定（S７０７）と命令文字列を検出したか否かの判定（S７０８）を行う。これらの文字列の検出判定には、前述したようにあらかじめ登録されたリストの語彙とのマッチングで指示文字列の検出を行う。 Next, the selection state detection unit 109 uses information obtained by performing the selection state detection process to determine whether or not an area on the screen is selected (S706). The selection state detection unit 109 outputs a flag indicating a program selection state when, for example, a program on the TV screen is selected. In that case, 1 is returned when a program is selected, and a value other than 1 is output when a program is not selected. By using this value, the selection state of the program can be known and the state can be determined. Next, the instruction character string detection unit 103 and the instruction character string detection unit 115 respectively determine whether an instruction character string is detected (S707) and whether an instruction character string is detected (S708). .. To detect the detection of these character strings, the instruction character string is detected by matching with the vocabulary of the list registered in advance as described above.

選択状態検出部１０９によっていずれの項目も選択されていないと判定された場合、指示文字列検出部１０３によって指示文字列が検出されなかった場合、あるいは、命令文字列検出部１１５によって命令文字列が検出されなかった場合には、信号送受信処理（Ｓ７０９）が行われる。この処理では、通信回路１１３ａが音声信号をサーバ１２０に送信し、その後、サーバ１２０から返信される応答結果を示す信号を受信する。応答結果を示す信号には、サーバ１２０で音声認識および対話処理が行われて生成された音声信号、あるいは、機器制御信号が含まれる。出力回路１１２は、出力処理（S７１１）を行ってユーザに処理結果を通知する。 When the selection state detection unit 109 determines that no item is selected, when the instruction character string detection unit 103 does not detect the instruction character string, or when the instruction character string detection unit 115 determines that the instruction character string is If not detected, signal transmission/reception processing (S709) is performed. In this process, the communication circuit 113a transmits a voice signal to the server 120, and then receives a signal indicating the response result returned from the server 120. The signal indicating the response result includes a voice signal generated by performing voice recognition and dialogue processing in the server 120, or a device control signal. The output circuit 112 performs output processing (S711) and notifies the user of the processing result.

ステップＳ７０６〜Ｓ７０８において、選択状態検出部１０９が選択状態で有ると判定し、かつ、指示文字列検出部１０３が指示文字列と命令文字列を検出した場合は、選択情報検出処理（S７１０）が行われる。選択情報検出処理（Ｓ７１０）では、選択情報検出部１０７が、入力装置情報取得処理で取得された位置の情報およびTV番組の情報などを取得する。例えば、入力装置１０８によってTV画面上で指定されている番組の画面上の位置や番組に関連する情報、例えば、前述したテレビ番組に関するメタデータまたはテレビ番組のコンテンツを取得する。ここで取得した情報と命令文字列に基づき出力回路１１２が出力処理（S７１１）を行い機器を制御する。 In steps S706 to S708, when it is determined that the selection state detection unit 109 is in the selection state and the instruction character string detection unit 103 detects the instruction character string and the instruction character string, the selection information detection process (S710) is performed. Done. In the selection information detection process (S710), the selection information detection unit 107 acquires the position information and TV program information acquired in the input device information acquisition process. For example, the position on the screen of the program designated on the TV screen by the input device 108 or information related to the program, for example, the above-mentioned metadata about the television program or the content of the television program is acquired. The output circuit 112 performs an output process (S711) based on the information and the command character string acquired here to control the device.

以上のように、本実施の形態によれば、サーバ１２０だけでなくクライアント１２１でも音声指示の認識が行われる。クライアント１２１は、音声指示が実行できないときだけサーバ１２０に音声信号を送信し、処理をサーバ１２０に渡して応答結果を待つ。これにより、例えばテレビ番組に関する操作のように音声指示のバリエーションが少ない処理についてはクライアント１２１側で実行し、そうでない処理についてはサーバ側１２０で実行することができる。本実施の形態によれば、クライアント１２１とサーバ１２０との間のアクセスを最小限に抑えることができるため、処理の遅延を軽減できる。 As described above, according to the present embodiment, not only the server 120 but also the client 121 recognizes the voice instruction. The client 121 transmits a voice signal to the server 120 only when the voice instruction cannot be executed, passes the processing to the server 120, and waits for a response result. As a result, the client 121 side can execute a process having a small variation of voice instructions, such as an operation related to a television program, and the server side 120 can execute a process other than that. According to the present embodiment, the access between the client 121 and the server 120 can be minimized, so that the processing delay can be reduced.

なお、本実施の形態においても、実施の形態１で説明した多様な変形例を適用することができる。実施の形態１と実施の形態２とを組み合わせて新たな実施の形態を構成してもよい。 Note that the various modifications described in the first embodiment can also be applied to the present embodiment. A new embodiment may be configured by combining the first embodiment and the second embodiment.

なお、上述の実施の形態では、音声入力装置であるマイクロフォン１０１はクライアントに設けられているとして説明した。しかしながらこの構成は一例である。例えばマイクロフォン１０１は、クライアントとは別の機器として存在していても良い。クライアントはそのようなマイクロフォン１０１と接続され、マイクロフォン１０１を介して音声の入力を受け取ることができればよい。 In addition, in the above-described embodiment, the microphone 101, which is the voice input device, is described as being provided in the client. However, this configuration is an example. For example, the microphone 101 may exist as a device separate from the client. It is sufficient that the client is connected to such a microphone 101 and can receive a voice input via the microphone 101.

仮に、マイクロフォン１０１がクライアントに設けられているとしても、クライアント１２１内部ではマイクロフォン１０１は独立した装置として存在しており、内部的に配線されているに過ぎない。マイクロフォン１０１は容易に着脱できるように設けることができる。マイクロフォン１０１は、クライアント１２１の必須の構成要素ではない。クライアント１２１は、その内部または外部でマイクロフォン１０１と接続されていればよい。 Even if the microphone 101 is provided in the client, the microphone 101 exists as an independent device inside the client 121 and is only internally wired. The microphone 101 can be provided so as to be easily attached and detached. The microphone 101 is not an essential component of the client 121. The client 121 may be connected to the microphone 101 inside or outside thereof.

また、上述の実施の形態では、出力回路１０８は、機器の制御信号、合成音声、テキストなどを出力すると説明した。出力回路１０８が、制御信号の送信部（例えば出力端子やリモコンの赤外線送信装置）、音声出力装置（例えばスピーカー）、ディスプレイの一部であり得ることを意味している。これらは一体で設けられていてもよいし、別個独立の機器として存在していてもよい。 Further, in the above-described embodiment, it is described that the output circuit 108 outputs the device control signal, the synthesized voice, the text, and the like. It means that the output circuit 108 can be a part of a control signal transmitter (for example, an infrared transmitter of an output terminal or a remote controller), an audio output device (for example, a speaker), or a display. These may be provided integrally or may exist as separate and independent devices.

本開示は、音声認識機能を用いた情報提示方法に関して、サーバ上で音声認識処理を行う場合において有用である。 The present disclosure is useful in the case of performing voice recognition processing on a server regarding an information presentation method using a voice recognition function.

１０１マイクロフォン
１０２音声認識部
１０３指示文字列検出部
１０４対話管理部
１０５応答文生成部
１０６音声合成部
１０７制御信号生成部
１０８入力装置
１０９選択状態検出部
１１０選択状態管理部
１１１選択情報検出部
１１２出力回路
１１３ａ、１１３ｂ通信回路
１１４ａ、１１４ｂ制御回路
１１５命令文字列検出部
１２０サーバ
１２１クライアント
６０１入力装置による人物指定の例
６０２人物指定による個人認証結果の例
７０１地図における場所を指定する例
７０２地図における場所が指定された表示例
９０１番組一覧の表示例
９０２リモートコントローラ
９０３番組一覧から番組を選択した例
９０４リモコンと音声認識で番組内容の表示を行った例
１０００従来の番組情報提示装置
１００１マイクロフォン
１００２音声認識部
１００３指示文字列検出部
１００４音声合成部
１００５制御信号生成部
１００６入力装置
１００７出力部 Reference Signs List 101 microphone 102 voice recognition unit 103 instruction character string detection unit 104 dialogue management unit 105 response sentence generation unit 106 voice synthesis unit 107 control signal generation unit 108 input device 109 selection state detection unit 110 selection state management unit 111 selection information detection unit 112 output Circuit 113a, 113b Communication circuit 114a, 114b Control circuit 115 Command character string detection unit 120 Server 121 Client 601 Example of person designation by input device 602 Example of personal authentication result by person designation 701 Example of designating location on map 702 Location on map 901 Display example of program list 902 Remote controller 903 Example of selecting program from program list 904 Example of displaying program content by remote control and voice recognition 1000 Conventional program information presentation device 1001 Microphone 1002 Voice recognition Part 1003 instruction character string detection part 1004 voice synthesis part 1005 control signal generation part 1006 input device 1007 output part

Claims

A method for controlling a device connected to a voice input device capable of inputting a user's voice, comprising:
On the computer of the device,
By the input to the input device by the user, it is detected that one of the items presented by the device is selected,
When it is detected that the one item is selected, when the voice instruction including the first voice information indicating the instruction content is received from the voice input device, the voice instruction is transmitted to another computer. ,
When the voice instruction including the second voice information indicating the instruction content is received from the voice input device when the selection of the one item is not detected, the instruction content is changed from the second voice information. Let me recognize and judge whether or not the voice instruction can be executed,
A control method of causing the instruction content to be executed when it is determined that the instruction content can be executed, and transmitting the voice instruction to another computer when it is determined that the instruction content cannot be executed.

In the computer of the device,
It is determined whether or not the first voice information includes third voice information indicating a directive,
When it is detected that the one item is selected, the instruction content is recognized from the first voice information, and it is determined that the first voice information includes the third voice information, Execute the instructions given above,
If it is not determined that the voice instruction includes the third voice information, the voice instruction is transmitted to the other computer,
The control method according to claim 1.

The control method according to claim 1, wherein the instruction content is an instruction to search for information related to the one item, and a user is notified of a search result based on the instruction content.

The control method according to claim 3, wherein the device is connected to a server via a network, and information related to the one item is searched by referring to a database in the server.

4. The device according to claim 3, wherein the device is further connected to an audio output device capable of outputting audio, and causes the audio output device to transmit search result information that causes the search result to be output as audio from the audio output device. Control method.

A computer program for causing a device connected to a voice input device capable of inputting a user's voice to be executed,
The computer program is stored in the computer of the device,
By the input to the input device by the user, it is detected that one of the items presented by the device is selected,
When it is detected that the one item is selected, when the voice instruction including the first voice information indicating the instruction content is received from the voice input device, the voice instruction is transmitted to another computer. ,
When the voice instruction including the second voice information indicating the instruction content is received from the voice input device when the selection of the one item is not detected, the instruction content is changed from the second voice information. Let me recognize and judge whether or not the voice instruction can be executed,
When it is determined that the instruction content is executable, the instruction content is executed, and when it is determined that the instruction content cannot be executed, the voice instruction is transmitted to another computer,
Computer program.

A device connected to a voice input device capable of inputting a user's voice,
A control circuit,
Communication circuit,
Equipped with
The control circuit is
Detecting that one of the plurality of items presented by the device has been selected by the input to the input device by the user,
When it is detected that the one item is selected, when the voice instruction including the first voice information indicating the instruction content is received from the voice input device, the voice instruction is transmitted to another computer. To the communication circuit,
When the voice instruction including the second voice information indicating the instruction content is received from the voice input device when the selection of the one item is not detected, the instruction content is changed from the second voice information. Recognize and determine whether the voice instruction can be executed,
When it is determined that the instruction content is executable, the instruction content is executed, and when it is determined that the instruction content cannot be executed, the communication circuit is instructed to transmit the voice instruction to another computer,
machine.