JP2021089310A

JP2021089310A - Voice operation device, voice operation system and voice operation method

Info

Publication number: JP2021089310A
Application number: JP2019217954A
Authority: JP
Inventors: 龍也桑本; Tatsuya Kuwamoto
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2021-06-10

Abstract

To provide technology for improving usability while an excess increase of a communication load and a processing load is suppressed in a voice operation device which transmits voice data of speech uttered by a user to a voice recognition engine and operates a related apparatus according to an operation command generated on the basis of a voice recognition processing result by the voice recognition engine.SOLUTION: A voice operation device comprises: a voice acquisition part for acquiring voice data on speech uttered by a user; a related word detection part for detecting an operation related word related to a related apparatus operated in accordance with a previous operation command when voice data which the voice acquisition part newly acquires is voice data without a starting word; and a voice data transmission part for transmitting voice data without a starting word to a voice recognition engine when the related word detection part detects the operation related word from the voice data without a starting word.SELECTED DRAWING: Figure 12

Description

本発明は、音声操作装置、音声操作システムおよび音声操作方法に関する。 The present invention relates to a voice control device, a voice control system, and a voice control method.

乗員の発話を音声認識することにより車載機器を操作する音声操作装置が知られている。また、車載端末とサーバとの間で音声対話を行う音声対話システムが知られている（例えば、特許文献１、２等を参照）。 There is known a voice control device that operates an in-vehicle device by recognizing a voice of an occupant. Further, a voice dialogue system that performs voice dialogue between an in-vehicle terminal and a server is known (see, for example, Patent Documents 1 and 2).

また、ユーザの発話による要求に応じて、タスク又はサービスを行うインテリジェントパーソナルアシスタント（ＡＩアシスタントとも称す）が知られている。 Further, an intelligent personal assistant (also referred to as an AI assistant) that performs a task or a service in response to a request from a user's utterance is known.

特開２００５−２５０３７９号公報Japanese Unexamined Patent Publication No. 2005-250379 特開２００１−２４９６８５号公報Japanese Unexamined Patent Publication No. 2001-249685

ＡＩアシスタント等のような音声認識エンジンを利用してユーザの発話内容を音声認識する場合、ユーザは「Alexa（登録商標）」、「Hey Siri（登録商標）」などのように、
特定の起動ワード（ＷｕＷ：Wake-up Word）を発した後、「○○して」といった要求を発する必要がある。すなわち、ユーザは、音声入力をする度に起動ワードを発声する必要があり、ユーザビリティが高いとは言えなかった。一方、ユーザによる起動ワードを含まない発話を全て音声認識しようとすると、通信負荷や処理負荷の過度な増加を招いてしまうことが懸念される。 When voice recognition of the user's utterance content using a voice recognition engine such as AI Assistant, the user uses "Alexa (registered trademark)", "Hey Siri (registered trademark)", etc.
After issuing a specific activation word (WuW: Wake-up Word), it is necessary to issue a request such as "○○". That is, the user needs to utter the activation word every time the voice is input, and it cannot be said that the usability is high. On the other hand, if an attempt is made to voice-recognize all utterances that do not include the activation word by the user, there is a concern that the communication load and the processing load will increase excessively.

そこで、本発明の目的は、ユーザが発した発話の音声データを音声認識エンジンに送信し、音声認識エンジンによる音声認識処理結果に基づいて生成された操作コマンドに応じて関連機器を操作する音声操作装置において、通信負荷や処理負荷の過度な増加を抑えつつユーザビリティを高めるための技術を提供することにある。 Therefore, an object of the present invention is a voice operation in which voice data of a speech uttered by a user is transmitted to a voice recognition engine and a related device is operated according to an operation command generated based on a voice recognition processing result by the voice recognition engine. The purpose of the device is to provide a technique for improving usability while suppressing an excessive increase in communication load and processing load.

上記課題を解決するため、本発明は以下の構成を採用する。すなわち、本発明は、ユーザが発した発話の音声データに音声認識エンジンを起動する起動ワードが含まれている場合に少なくとも当該起動ワードに後続する発話を含む音声データを前記音声認識エンジンに送信し、当該音声認識エンジンによる音声認識処理結果に基づいて生成された操作コマンドに応じて関連機器の操作を行う音声操作装置であって、ユーザが発した発話に関する音声データを取得する音声取得部と、前記音声取得部が新たに取得した音声データが起動ワードを含まない起動ワード無し音声データである場合に、前回の操作コマンドに応じて操作した関連機器に関連する操作関連ワードを新たに取得した起動ワード無し音声データから検出する関連ワード検出部と、前記関連ワード検出部が前記起動ワード無し音声データから操作関連ワードを検出した場合に、前記起動ワード無し音声データを前記音声認識エンジンに送信する音声データ送信部と、を備える。 In order to solve the above problems, the present invention adopts the following configuration. That is, the present invention transmits to the voice recognition engine at least the voice data including the speech following the activation word when the voice data of the speech uttered by the user includes an activation word for activating the speech recognition engine. , A voice operation device that operates related devices in response to operation commands generated based on the result of voice recognition processing by the voice recognition engine, and a voice acquisition unit that acquires voice data related to speech made by the user. When the voice data newly acquired by the voice acquisition unit is voice data without a start word that does not include a start word, the start that newly acquires an operation-related word related to the related device operated in response to the previous operation command. When the related word detection unit that detects from the wordless voice data and the related word detection unit detect the operation related word from the activation wordless voice data, the voice that transmits the activation wordless voice data to the voice recognition engine. It includes a data transmission unit.

また、本発明に係る音声操作装置は、前記音声取得部が取得した音声データを記憶する音声記憶部をさらに備え、前記関連ワード検出部は、前記音声記憶部に記憶されている前回の音声データから、関連機器ごとに対応付けて予め定められた機器別関連ワードを検出
することによって前回の操作コマンドに応じて操作した関連機器を特定してもよい。 Further, the voice operation device according to the present invention further includes a voice storage unit that stores voice data acquired by the voice acquisition unit, and the related word detection unit is the previous voice data stored in the voice storage unit. Therefore, the related device operated in response to the previous operation command may be specified by detecting a predetermined device-specific related word associated with each related device.

また、本発明に係る音声操作装置は、前記音声認識エンジンによる音声データの音声認識処理結果に基づいて生成された操作コマンドに関する操作コマンド情報を記憶するコマンド情報記憶部をさらに備え、前記関連ワード検出部は、前記コマンド情報記憶部に記憶されている操作コマンド情報に基づいて前回の操作コマンドに応じて操作した関連機器を特定してもよい。 Further, the voice operation device according to the present invention further includes a command information storage unit that stores operation command information related to the operation command generated based on the voice recognition processing result of the voice data by the voice recognition engine, and detects the related word. The unit may specify the related device operated in response to the previous operation command based on the operation command information stored in the command information storage unit.

また、本発明は音声操作システムとして特定することができる。すなわち、本発明に係る音声操作システムは、上述までの何れかの音声操作装置と、前記音声認識エンジンと、前記音声認識エンジンによる音声データの音声認識処理結果に基づいて前記音声操作装置に送信するための操作コマンドを生成する操作コマンド生成サーバと、を含む。 Further, the present invention can be specified as a voice operation system. That is, the voice operation system according to the present invention transmits to the voice operation device based on any of the above-mentioned voice operation devices, the voice recognition engine, and the voice recognition processing result of the voice data by the voice recognition engine. Includes an operation command generation server that generates operation commands for.

また、本発明は音声操作方法として特定することができる。すなわち、本発明は、ユーザが発した発話の音声データに音声認識エンジンを起動する起動ワードが含まれている場合に少なくとも当該起動ワードに後続する発話を含む音声データを前記音声認識エンジンに送信し、当該音声認識エンジンによる音声認識処理結果に基づいて生成された操作コマンドに応じて関連機器の操作を音声操作装置が実行する音声操作方法であって、ユーザが発した発話に関する音声データを取得する音声取得工程と、前記音声取得工程で新たに取得した音声データが起動ワードを含まない起動ワード無し音声データである場合に、前回の操作コマンドに応じて操作した関連機器に関連する操作関連ワードを新たに取得した起動ワード無し音声データから検出する関連ワード検出工程と、前記関連ワード検出工程において前記起動ワード無し音声データから操作関連ワードを検出した場合に、前記起動ワード無し音声データを前記音声認識エンジンに送信する音声データ送信工程と、を含む。なお、上述の音声操作方法は、上述した各工程に係る処理を、音声操作装置のコンピュータが実行する。 Further, the present invention can be specified as a voice operation method. That is, the present invention transmits to the voice recognition engine at least the voice data including the speech following the activation word when the voice data of the speech uttered by the user includes an activation word for activating the speech recognition engine. , A voice operation method in which the voice operation device executes the operation of the related device in response to the operation command generated based on the voice recognition processing result by the voice recognition engine, and acquires the voice data related to the speech uttered by the user. When the voice acquisition process and the voice data newly acquired in the voice acquisition process are voice data without a start word that does not include a start word, operation-related words related to the related device operated in response to the previous operation command are displayed. When the related word detection step of detecting from the newly acquired voice data without activation word and the operation related word are detected from the voice data without activation word in the related word detection step, the voice data without activation word is recognized by the voice. Includes a voice data transmission process to be transmitted to the engine. In the above-mentioned voice operation method, the computer of the voice operation device executes the processing related to each of the above-mentioned steps.

また、本発明は、音声操作方法における各工程に係る処理をコンピュータに実行させるためのプログラムであってもよい。また、本発明は、上記プログラムをコンピュータが読取可能であって非一時的に記憶した記憶媒体であってもよい。 Further, the present invention may be a program for causing a computer to execute a process related to each step in the voice operation method. Further, the present invention may be a storage medium in which the program can be read by a computer and stored non-temporarily.

本発明によれば、ユーザが発した発話の音声データを音声認識エンジンに送信し、音声認識エンジンによる音声認識処理結果に基づいて生成された操作コマンドに応じて関連機器を操作する音声操作装置において、通信負荷や処理負荷の過度な増加を抑えつつユーザビリティを高めるための技術を提供できる。 According to the present invention, in a voice operation device that transmits voice data of a speech uttered by a user to a voice recognition engine and operates a related device in response to an operation command generated based on the voice recognition processing result by the voice recognition engine. , It is possible to provide a technique for improving usability while suppressing an excessive increase in communication load and processing load.

図１は、実施形態１に係る音声操作システムの概略構成図である。FIG. 1 is a schematic configuration diagram of a voice operation system according to the first embodiment. 図２は、実施形態１に係る音声操作装置のハードウェア構成図である。FIG. 2 is a hardware configuration diagram of the voice operation device according to the first embodiment. 図３は、実施形態１に係る音声認識エンジンのハードウェア構成図である。FIG. 3 is a hardware configuration diagram of the voice recognition engine according to the first embodiment. 図４は、実施形態１に係るコマンド生成サーバのハードウェア構成図である。FIG. 4 is a hardware configuration diagram of the command generation server according to the first embodiment. 図５は、発話者の発話内容と音声アシスタント発話内容の一例を示す図である。FIG. 5 is a diagram showing an example of the utterance content of the speaker and the utterance content of the voice assistant. 図６は、実施形態１に係る音声操作装置の機能ブロック図である。FIG. 6 is a functional block diagram of the voice operation device according to the first embodiment. 図７は、実施形態１に係る音声認識エンジンの機能ブロック図である。FIG. 7 is a functional block diagram of the voice recognition engine according to the first embodiment. 図８は、実施形態１に係るコマンド生成サーバの機能ブロック図である。FIG. 8 is a functional block diagram of the command generation server according to the first embodiment. 図９は、操作コマンドデータのデータ構造の一例を示す図である。FIG. 9 is a diagram showing an example of a data structure of operation command data. 図１０は、機器別関連ワード定義テーブルＴＢ１のデータ構造の一例を示す図である。FIG. 10 is a diagram showing an example of the data structure of the device-specific related word definition table TB1. 図１１は、操作コマンドデータのデータ構造の一例を示す図である。FIG. 11 is a diagram showing an example of a data structure of operation command data. 図１２は、音声操作装置が音声認識エンジンに音声データを送信する際の制御内容を示すフローチャートである。FIG. 12 is a flowchart showing a control content when the voice operation device transmits voice data to the voice recognition engine.

＜実施形態１＞
以下、図面を参照して本発明の実施の形態について例示的に説明する。 <Embodiment 1>
Hereinafter, embodiments of the present invention will be exemplified with reference to the drawings.

図１は、実施形態１に係る音声操作システム１００の概略構成図である。図１に示すように、音声操作システム１００は、音声操作装置１０、音声認識エンジン３０、コマンド生成サーバ５０を含んで構成されている。音声操作装置１０、音声認識エンジン３０、コマンド生成サーバ５０は、通信回線Ｎを介して互いに通信可能に接続されている。通信回線Ｎは、例えば、インターネット等の通信ネットワークである。また、通信回線Ｎは、少なくとも一部がＷｉＦｉやＬＴＥ等の無線通信方式を用いた回線であってもよい。本実施形態では、多数の音声操作装置１０が通信回線Ｎを介して音声認識エンジン３０、コマンド生成サーバ５０に接続されている。 FIG. 1 is a schematic configuration diagram of the voice operation system 100 according to the first embodiment. As shown in FIG. 1, the voice operation system 100 includes a voice operation device 10, a voice recognition engine 30, and a command generation server 50. The voice control device 10, the voice recognition engine 30, and the command generation server 50 are connected to each other so as to be able to communicate with each other via the communication line N. The communication line N is, for example, a communication network such as the Internet. Further, the communication line N may be at least a part of a line using a wireless communication method such as WiFi or LTE. In the present embodiment, a large number of voice operation devices 10 are connected to the voice recognition engine 30 and the command generation server 50 via the communication line N.

音声操作装置１０は、車両１に搭載された車載機であり、車両１に搭載されている各種の関連機器２を音声入力によって操作するための制御装置である。音声操作装置１０は、例えば、ＡＶＮ機（車載用オーディオ・ビジュアル・ナビゲーション一体機）の一部を構成していてもよい。関連機器２（図２を参照）は、音声操作装置１０による操作対象となる機器であり、例えば、エアコンディショナ（エアコン）２Ａ、オーディオ機器２Ｂ、ヘッドライト（照明装置）２Ｃ等が例示できる。但し、関連機器２は、上記例示に限られず、ワイパー、ウィンカー、パワーウィンドウ、車両１のドアロック装置等が関連機器２に含まれていてもよい。 The voice control device 10 is an in-vehicle device mounted on the vehicle 1, and is a control device for operating various related devices 2 mounted on the vehicle 1 by voice input. The voice control device 10 may form, for example, a part of an AVN machine (in-vehicle audio / visual / navigation integrated machine). The related device 2 (see FIG. 2) is a device to be operated by the voice control device 10, and examples thereof include an air conditioner (air conditioner) 2A, an audio device 2B, and a headlight (lighting device) 2C. However, the related device 2 is not limited to the above example, and the related device 2 may include a wiper, a blinker, a power window, a door lock device of the vehicle 1, and the like.

音声操作装置１０は、車載ネットワーク（ＣＡＮ、ＬＩＮなど）を介して各関連機器２と接続されている。例えば、音声操作装置１０は、車両１に乗車するユーザ（発話者）の発話内容に応じて各関連機器２を制御する。また、車両１には、マイクロフォン（マイク）３やスピーカ４が設けられており、これらと音声操作装置１０は車載ネットワークを介して接続されている。なお、車両１に設けられるマイクロフォン３、スピーカ４の位置、設置数については特に限定されない。 The voice control device 10 is connected to each related device 2 via an in-vehicle network (CAN, LIN, etc.). For example, the voice control device 10 controls each related device 2 according to the utterance content of the user (speaker) who gets on the vehicle 1. Further, the vehicle 1 is provided with a microphone (microphone) 3 and a speaker 4, and these are connected to the voice operation device 10 via an in-vehicle network. The positions and number of microphones 3 and speakers 4 provided in the vehicle 1 are not particularly limited.

図２は、実施形態１に係る音声操作装置１０のハードウェア構成図である。音声操作装置１０は、接続バス１１によって相互に接続されたプロセッサ１２、メモリ１３、入出力ＩＦ（インターフェース）１４、通信ＩＦ（インターフェース）１５を有するコンピュータである。プロセッサ１２は、入力された情報を処理し、処理結果を出力することにより、装置全体の制御を行う中央処理演算装置である。プロセッサ１２は、ＣＰＵ（Central Processing Unit）や、ＭＰＵ（Micro-processing unit）とも呼ばれる。プロセッサ１２は、単一のプロセッサに限られず、マルチプロセッサ構成であってもよい。また、単一のソケットで接続される単一のチップ内に複数のコアを有したマルチコア構成であってもよい。 FIG. 2 is a hardware configuration diagram of the voice operation device 10 according to the first embodiment. The voice control device 10 is a computer having a processor 12, a memory 13, an input / output IF (interface) 14, and a communication IF (interface) 15 connected to each other by a connection bus 11. The processor 12 is a central processing unit that controls the entire device by processing the input information and outputting the processing result. The processor 12 is also called a CPU (Central Processing Unit) or an MPU (Micro-processing unit). The processor 12 is not limited to a single processor, and may have a multiprocessor configuration. Further, it may be a multi-core configuration having a plurality of cores in a single chip connected by a single socket.

メモリ１３は、例えば主記憶装置と補助記憶装置とを含む。主記憶装置は、プロセッサ１２の作業領域、プロセッサ１２で処理される情報を一時的に記憶する記憶領域、通信データのバッファ領域として使用される。主記憶装置は、プロセッサ１２がプログラムやデータをキャッシュしたり、作業領域を展開したりするための記憶媒体である。主記憶装置は、例えば、ＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）、フラッシュメモリを含む。補助記憶装置は、プロセッサ１２により実行されるプログラムや、情報
処理に用いられるデータ、動作の設定情報などを記憶する記憶媒体である。補助記憶装置は、例えば、ＨＤＤ（Hard-disk Drive）やＳＳＤ（Solid State Drive）、ＥＰＲＯＭ（Erasable Programmable ROM）、フラッシュメモリ、ＵＳＢメモリ、メモリカード等であ
る。また、メモリ１３における補助記憶装置には、音声操作装置１０の各処理部が参照するデータの格納先として、音声認識辞書ＤＢを備える。 The memory 13 includes, for example, a main storage device and an auxiliary storage device. The main storage device is used as a work area of the processor 12, a storage area for temporarily storing information processed by the processor 12, and a buffer area for communication data. The main storage device is a storage medium for the processor 12 to cache programs and data and expand a work area. The main storage device includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory. The auxiliary storage device is a storage medium that stores a program executed by the processor 12, data used for information processing, operation setting information, and the like. The auxiliary storage device is, for example, an HDD (Hard-disk Drive), an SSD (Solid State Drive), an EPROM (Erasable Programmable ROM), a flash memory, a USB memory, a memory card, or the like. Further, the auxiliary storage device in the memory 13 is provided with a voice recognition dictionary DB as a storage destination of data referred to by each processing unit of the voice operation device 10.

入出力ＩＦ１４は、音声操作装置１０に接続される各関連機器２との間でデータの入出力を行うインターフェースである。音声操作装置１０は、例えば、入出力ＩＦ１４を介し、マイクロフォン３やスピーカ４等に接続されている。マイクロフォン３は、車両１の乗員等の発する音声（発話）が入力される入力デバイスである。マイクロフォン３から入力された情報は、接続バス１１を介してプロセッサ１２に通知される。スピーカ４は、プロセッサ１２等で処理された音声データが音声として出力される出力デバイスである。 The input / output IF 14 is an interface for inputting / outputting data to / from each related device 2 connected to the voice operating device 10. The voice control device 10 is connected to a microphone 3, a speaker 4, or the like via, for example, an input / output IF14. The microphone 3 is an input device for inputting voices (utterances) emitted by the occupants of the vehicle 1. The information input from the microphone 3 is notified to the processor 12 via the connection bus 11. The speaker 4 is an output device that outputs audio data processed by the processor 12 or the like as audio.

通信ＩＦ１５は、通信回線Ｎを介して他装置との通信を行うインターフェースである。通信ＩＦ１５は、例えば、ＷｉＭＡＸ（Worldwide Interoperability for Microwave Access）やＬＴＥ（Long Term Evolution）、ＷｉＦｉ、ブルートゥース（登録商標）等の無線通信方式で通信を行ってもよい。 The communication IF 15 is an interface for communicating with another device via the communication line N. The communication IF15 may perform communication by a wireless communication method such as WiMAX (Worldwide Interoperability for Microwave Access), LTE (Long Term Evolution), WiFi, or Bluetooth (registered trademark).

具体的な処理内容は後述するが、音声操作装置１０は、車両１の乗員（ユーザ、発話者）が発した発話の音声データ（発話データ）に音声認識エンジン３０を起動する所定の起動ワードが含まれている場合に少なくとも当該起動ワードに後続する発話を含む音声データを音声認識エンジン３０に送信し、当該音声認識エンジン３０による音声認識処理結果に基づいて生成された操作コマンドを取得し、取得した操作コマンドに応じて関連機器２の操作を行う。起動ワードは、音声認識エンジン３０等に対応付けて予め定められた特定のワードであり、例えば待機状態にある音声認識エンジン３０を起動させるために音声操作装置１０から音声データを送信するトリガとなる。起動ワードは、ＷｕＷ（Wake Up Word）、ウェイクアップワード、ウェイクワードとも称される場合がある。このような起動ワードは、例えば「Alexa（登録商標）」、「Hey Siri（登録商標）」等が例示できる。
なお、本明細書において、「起動ワード」は、１つの単語に限られず、複数の単語を含むターム、フレーズ等であってもよい。また、「起動ワード」は、「ハロー、マイ・カー」、「ハイ、マイ・ビークル」等、種々のものを採用することができる。 Although the specific processing content will be described later, in the voice operation device 10, a predetermined activation word for activating the voice recognition engine 30 is added to the voice data (speech data) of the speech uttered by the occupant (user, speaker) of the vehicle 1. If it is included, at least the voice data including the speech following the activation word is transmitted to the voice recognition engine 30, and the operation command generated based on the voice recognition processing result by the voice recognition engine 30 is acquired and acquired. The related device 2 is operated according to the operation command. The activation word is a specific word predetermined in association with the voice recognition engine 30 or the like, and serves as a trigger for transmitting voice data from the voice operation device 10 in order to start the voice recognition engine 30 in the standby state, for example. .. The activation word may also be referred to as WuW (Wake Up Word), wake up word, or wake word. Examples of such an activation word include "Alexa (registered trademark)" and "Hey Siri (registered trademark)".
In addition, in this specification, the "starting word" is not limited to one word, and may be a term, a phrase or the like including a plurality of words. Further, as the "startup word", various things such as "hello, my car", "high, my vehicle" and the like can be adopted.

本実施形態に係る音声操作システム１００においては、上記のように乗員が起動ワードを発したことをトリガとして音声操作装置１０から音声認識エンジン３０に音声データが送信される。図３は、実施形態１に係る音声認識エンジン３０のハードウェア構成図である。音声認識エンジン３０は、接続バス３１によって相互に接続されたプロセッサ３２、メモリ３３、通信ＩＦ３４等を有するコンピュータである。プロセッサ３２は、入力された情報を処理し、処理結果を出力することにより、装置全体の制御を行う中央処理演算装置である。プロセッサ３２は、ＣＰＵ（Central Processing Unit）や、ＭＰＵ（Micro-processing unit）とも呼ばれる。プロセッサ３２は、単一のプロセッサに限られず、マルチプロセッサ構成であってもよい。また、単一のソケットで接続される単一のチップ内に複数のコアを有したマルチコア構成であってもよい。 In the voice operation system 100 according to the present embodiment, voice data is transmitted from the voice operation device 10 to the voice recognition engine 30 triggered by the occupant issuing an activation word as described above. FIG. 3 is a hardware configuration diagram of the voice recognition engine 30 according to the first embodiment. The voice recognition engine 30 is a computer having a processor 32, a memory 33, a communication IF 34, and the like connected to each other by a connection bus 31. The processor 32 is a central processing unit that controls the entire device by processing the input information and outputting the processing result. The processor 32 is also called a CPU (Central Processing Unit) or an MPU (Micro-processing unit). The processor 32 is not limited to a single processor, and may have a multiprocessor configuration. Further, it may be a multi-core configuration having a plurality of cores in a single chip connected by a single socket.

メモリ３３は、主記憶装置と補助記憶装置とを含む。主記憶装置は、プロセッサ３２の作業領域、プロセッサ３２で処理される情報を一時的に記憶する記憶領域、通信データのバッファ領域として使用される。主記憶装置は、プロセッサ３２がプログラムやデータをキャッシュしたり、作業領域を展開したりするための記憶媒体である。主記憶装置は、例えば、ＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）、フラッシュメモリを含む。補助記憶装置は、プロセッサ３２により実行されるプログラムや、情報処理に用いられるデータ、動作の設定情報などを記憶する記憶媒体である。補助記憶装置は、例
えば、ＨＤＤ（Hard-disk Drive）やＳＳＤ（Solid State Drive）、ＥＰＲＯＭ（Erasable Programmable ROM）、フラッシュメモリ、ＵＳＢメモリ、メモリカード等である。通
信ＩＦ３４は、通信回線Ｎを介して他装置との通信を行うインターフェースである。なお、音声認識エンジン３０は、クラウド上に存在していてもよい。 The memory 33 includes a main storage device and an auxiliary storage device. The main storage device is used as a work area of the processor 32, a storage area for temporarily storing information processed by the processor 32, and a buffer area for communication data. The main storage device is a storage medium for the processor 32 to cache programs and data and expand a work area. The main storage device includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory. The auxiliary storage device is a storage medium that stores a program executed by the processor 32, data used for information processing, operation setting information, and the like. The auxiliary storage device is, for example, an HDD (Hard-disk Drive), an SSD (Solid State Drive), an EPROM (Erasable Programmable ROM), a flash memory, a USB memory, a memory card, or the like. The communication IF 34 is an interface for communicating with another device via the communication line N. The voice recognition engine 30 may exist on the cloud.

音声操作装置１０から音声データを受け取った音声認識エンジン３０は、音声データに対して音声認識処理を行うことで、音声データを発話者の発話内容を表すテキスト（文字列）データへと変換する。さらに、音声認識エンジン３０は、得られたテキストデータに対して自然言語解析処理を行うことで、発話者の要求内容を示すデータである発話者要求データを生成し、この発話者要求データをコマンド生成サーバ５０に送信する。図４は、実施形態１に係るコマンド生成サーバ５０のハードウェア構成図である。コマンド生成サーバ５０は、接続バス５１によって相互に接続されたプロセッサ５２、メモリ５３、通信ＩＦ５４等を有するコンピュータである。プロセッサ５２は、入力された情報を処理し、処理結果を出力することにより、装置全体の制御を行う中央処理演算装置である。プロセッサ５２は、ＣＰＵ（Central Processing Unit）や、ＭＰＵ（Micro-processing unit）とも呼ばれる。プロセッサ５２は、単一のプロセッサに限られず、マルチプロセッサ構成であってもよい。また、単一のソケットで接続される単一のチップ内に複数のコアを有したマルチコア構成であってもよい。 The voice recognition engine 30 that receives the voice data from the voice control device 10 converts the voice data into text (character string) data representing the utterance content of the speaker by performing voice recognition processing on the voice data. Further, the voice recognition engine 30 performs natural language analysis processing on the obtained text data to generate speaker request data which is data indicating the request contents of the speaker, and commands the speaker request data. It is transmitted to the generation server 50. FIG. 4 is a hardware configuration diagram of the command generation server 50 according to the first embodiment. The command generation server 50 is a computer having a processor 52, a memory 53, a communication IF 54, and the like connected to each other by a connection bus 51. The processor 52 is a central processing unit that controls the entire device by processing the input information and outputting the processing result. The processor 52 is also called a CPU (Central Processing Unit) or an MPU (Micro-processing unit). The processor 52 is not limited to a single processor, and may have a multiprocessor configuration. Further, it may be a multi-core configuration having a plurality of cores in a single chip connected by a single socket.

メモリ５３は、主記憶装置と補助記憶装置とを含む。主記憶装置は、プロセッサ５２の作業領域、プロセッサ５２で処理される情報を一時的に記憶する記憶領域、通信データのバッファ領域として使用される。主記憶装置は、プロセッサ５２がプログラムやデータをキャッシュしたり、作業領域を展開したりするための記憶媒体である。主記憶装置は、例えば、ＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）、フラッシュメモリを含む。補助記憶装置は、プロセッサ５２により実行されるプログラムや、情報処理に用いられるデータ、動作の設定情報などを記憶する記憶媒体である。補助記憶装置は、例えば、ＨＤＤ（Hard-disk Drive）やＳＳＤ（Solid State Drive）、ＥＰＲＯＭ（Erasable Programmable ROM）、フラッシュメモリ、ＵＳＢメモリ、メモリカード等である。通
信ＩＦ５４は、通信回線Ｎを介して他装置との通信を行うインターフェースである。なお、コマンド生成サーバ５０は、クラウド上に存在するウェブサーバとして構成されていてもよい。 The memory 53 includes a main storage device and an auxiliary storage device. The main storage device is used as a work area of the processor 52, a storage area for temporarily storing information processed by the processor 52, and a buffer area for communication data. The main storage device is a storage medium for the processor 52 to cache programs and data and expand a work area. The main storage device includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory. The auxiliary storage device is a storage medium that stores a program executed by the processor 52, data used for information processing, operation setting information, and the like. The auxiliary storage device is, for example, an HDD (Hard-disk Drive), an SSD (Solid State Drive), an EPROM (Erasable Programmable ROM), a flash memory, a USB memory, a memory card, or the like. The communication IF 54 is an interface for communicating with another device via the communication line N. The command generation server 50 may be configured as a web server existing on the cloud.

コマンド生成サーバ５０は、音声認識エンジン３０から受け取った発話者要求データに基づいて、音声操作装置１０に送信するための操作コマンドデータを生成する。操作コマンドデータは、音声操作装置１０が操作対象としている関連機器２の操作内容に関する情報が格納されているデータである。勿論、この操作コマンドデータは、発話者の要求が反映された内容となっている。コマンド生成サーバ５０が生成した操作コマンドデータは、通信回線Ｎを介してコマンド生成サーバ５０から音声操作装置１０へと送信される。音声操作装置１０は、コマンド生成サーバ５０から受け取った操作コマンドデータに基づいて、発話者の要求内容を示す操作コマンドデータに則して対象となる関連機器２を操作する。さらに、コマンド生成サーバ５０は、生成した操作コマンドデータに対応する応答音声を生成するための応答音声生成用テキストデータを生成し、この応答音声生成用テキストデータを音声認識エンジン３０に送信する。音声認識エンジン３０は、コマンド生成サーバ５０から受け取った応答音声生成用テキストデータに対して音声合成処理を行うことにより、応答音声生成用テキストデータを音声データ（以下、「応答用音声データ」という）に変換する。音声認識エンジン３０は、生成した応答用音声データを音声操作装置１０に送信する。そして、音声操作装置１０は、音声認識エンジン３０から受け取った応答用音声データに対応する音声（以下、「音声アシスタント発話」という）をスピーカ４に出力させることで、乗員にアナウンスする。 The command generation server 50 generates operation command data to be transmitted to the voice operation device 10 based on the speaker request data received from the voice recognition engine 30. The operation command data is data in which information regarding the operation content of the related device 2 to be operated by the voice operation device 10 is stored. Of course, this operation command data reflects the request of the speaker. The operation command data generated by the command generation server 50 is transmitted from the command generation server 50 to the voice operation device 10 via the communication line N. The voice operation device 10 operates the target related device 2 based on the operation command data received from the command generation server 50 in accordance with the operation command data indicating the request contents of the speaker. Further, the command generation server 50 generates response voice generation text data for generating a response voice corresponding to the generated operation command data, and transmits the response voice generation text data to the voice recognition engine 30. The voice recognition engine 30 performs voice synthesis processing on the response voice generation text data received from the command generation server 50, so that the response voice generation text data is voice data (hereinafter referred to as "response voice data"). Convert to. The voice recognition engine 30 transmits the generated response voice data to the voice control device 10. Then, the voice operation device 10 announces to the occupant by outputting the voice corresponding to the response voice data received from the voice recognition engine 30 (hereinafter, referred to as “voice assistant utterance”) to the speaker 4.

次に、音声操作システム１００における具体的な制御内容について説明する。音声操作システム１００は、例えば、車両１のＡＣＣ（アクセサリ)電源がオンに切り替えられる
ことを契機に起動する。 Next, specific control contents in the voice operation system 100 will be described. The voice operation system 100 is activated, for example, when the ACC (accessory) power supply of the vehicle 1 is switched on.

図５は、発話者の発話内容と音声アシスタント発話内容の一例を示す図である。図６は、実施形態１に係る音声操作装置１０の機能ブロック図である。音声操作装置１０では、プロセッサ１２がアプリケーションプログラムを実行することにより、音声取得部２１、起動ワード検出部２２、音声データ送信部２３、関連ワード検出部２４、コマンド取得部２５、応答用音声取得部２６、操作処理部２７、応答音声出力部２８といった各処理部として機能する。但し、上記各処理部の少なくとも一部の処理がDigital Signal Processor(DSP)、Application Specific Integrated Circuit（ASIC）等によって提供されてもよい。また、上記各処理部の少なくとも一部が、Field-Programmable Gate Array(FPGA)等の
専用large scale integration（LSI）、その他のデジタル回路であってもよい。また、上記各処理部の少なくとも一部にアナログ回路を含んでいてもよい。 FIG. 5 is a diagram showing an example of the utterance content of the speaker and the utterance content of the voice assistant. FIG. 6 is a functional block diagram of the voice operation device 10 according to the first embodiment. In the voice operation device 10, when the processor 12 executes the application program, the voice acquisition unit 21, the activation word detection unit 22, the voice data transmission unit 23, the related word detection unit 24, the command acquisition unit 25, and the response voice acquisition unit are used. It functions as each processing unit such as 26, an operation processing unit 27, and a response voice output unit 28. However, at least a part of the processing of each of the above processing units may be provided by a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or the like. Further, at least a part of each of the above processing units may be a dedicated large scale integration (LSI) such as a field-programmable gate array (FPGA) or other digital circuits. Further, an analog circuit may be included in at least a part of each of the above processing units.

音声取得部２１は、マイクロフォン３に入力された乗員（発話者）の発話を、入出力ＩＦ１４を介して受け付けることで音声データを取得する。ここで、メモリ１３は、マイクロフォン３から取得した乗員の発話に関する音声データを記憶する音声記憶部１３１を有している。記憶領域である音声記憶部１３１は、メモリ１３の一部に割り当てられた記憶領域である。音声取得部２１は、マイクロフォン３から取得した発話者の音声データを、音声記憶部１３１に記憶させる。 The voice acquisition unit 21 acquires voice data by receiving the utterance of the occupant (speaker) input to the microphone 3 via the input / output IF14. Here, the memory 13 has a voice storage unit 131 that stores voice data related to the utterance of the occupant acquired from the microphone 3. The voice storage unit 131, which is a storage area, is a storage area allocated to a part of the memory 13. The voice acquisition unit 21 stores the voice data of the speaker acquired from the microphone 3 in the voice storage unit 131.

起動ワード検出部２２は、音声取得部２１が新たに取得した発話者の音声データ（すなわち、メモリ１３の音声記憶部１３１に記憶されている最新の音声データ）に基づき、当該音声データに起動ワードが含まれているか否かを判定することで、音声データにおける起動ワードの有無を検出する。起動ワード検出部２２は、例えば、音声データに対して音声認識処理を行うことで、起動ワードを検出する。音声認識処理は、公知の音声信号マッチングモデルに対応するアルゴリズムに従い、音声認識辞書ＤＢ（データベース）を参照して行うことができる。音声認識処理に用いる音声認識辞書ＤＢや音声信号マッチングモデルは、例えば、メモリ１３の補助記憶装置に格納されていてもよい。 The activation word detection unit 22 is based on the voice data of the speaker newly acquired by the voice acquisition unit 21 (that is, the latest voice data stored in the voice storage unit 131 of the memory 13), and the activation word is added to the voice data. By determining whether or not is included, the presence or absence of an activation word in the voice data is detected. The activation word detection unit 22 detects the activation word, for example, by performing voice recognition processing on the voice data. The voice recognition process can be performed by referring to the voice recognition dictionary DB (database) according to an algorithm corresponding to a known voice signal matching model. The voice recognition dictionary DB and the voice signal matching model used for the voice recognition process may be stored in, for example, the auxiliary storage device of the memory 13.

図５には、車両１のＡＣＣ（アクセサリ)電源がオンの状態で、乗員が「〇〇（起動ワ
ード）、エアコンの温度を３度上げて」（左記「〇〇」は、起動ワード（ＷｕＷ）を表す。以下、同様。）と発話する例が示されている。この場合、「〇〇（起動ワード）、エアコンの温度を３度上げて」という発話がマイクロフォン３によって収音され、当該発話の音声データを音声取得部２１が取得すると共に、メモリ１３の音声記憶部１３１に記憶される。そして、「〇〇（起動ワード）、エアコンの温度を３度上げて」という音声データ中の「〇〇（起動ワード）」が起動ワード検出部２２によって検出される。 In FIG. 5, with the ACC (accessory) power of the vehicle 1 turned on, the occupant "○○ (starting word), raise the temperature of the air conditioner by 3 degrees"("○○" on the left is the starting word (WuW). ) Is shown. The same applies hereinafter.). In this case, the utterance "○○ (starting word), raise the temperature of the air conditioner by 3 degrees" is picked up by the microphone 3, the voice data of the utterance is acquired by the voice acquisition unit 21, and the voice is stored in the memory 13. It is stored in the unit 131. Then, "○○ (starting word)" in the voice data "○○ (starting word), raise the temperature of the air conditioner by 3 degrees" is detected by the starting word detecting unit 22.

音声データ送信部２３は、音声データを音声認識エンジン３０に送信する処理を行う。起動ワード検出部２２が起動ワードを検出すると、音声データ送信部２３は、少なくとも起動ワードに後続する発話を含む音声データを音声認識エンジン３０に送信する。ここでは、音声データのうち、起動ワードに後続する発話部分、すなわち、「エアコンの温度を３度上げて」という発話部分に対応する音声データ（以下、「要求内容音声データ」という）を音声データ送信部２３が音声認識エンジン３０に送信する。また、音声データ送信部２３は、上記要求内容音声データと併せてＩＤ識別情報を音声認識エンジン３０に送信する。要求内容音声データおよびＩＤ識別情報は、音声操作装置１０の通信ＩＦ１５、通信回線Ｎを介して音声認識エンジン３０に送信される。 The voice data transmission unit 23 performs a process of transmitting voice data to the voice recognition engine 30. When the activation word detection unit 22 detects the activation word, the voice data transmission unit 23 transmits voice data including at least the utterance following the activation word to the voice recognition engine 30. Here, among the voice data, the voice data corresponding to the utterance part following the activation word, that is, the utterance part of "raising the temperature of the air conditioner by 3 degrees" (hereinafter referred to as "request content voice data") is voice data. The transmission unit 23 transmits to the voice recognition engine 30. Further, the voice data transmission unit 23 transmits the ID identification information to the voice recognition engine 30 together with the request content voice data. The request content voice data and ID identification information are transmitted to the voice recognition engine 30 via the communication IF 15 of the voice operation device 10 and the communication line N.

なお、ＩＤ識別情報は、車両１に搭載されている音声操作装置１０を識別するために音
声操作装置１０毎に割り当てられた識別情報である。要求内容音声データと併せてＩＤ識別情報を音声認識エンジン３０に送信することで、音声認識エンジン３０は、受け取った要求内容音声データの送信元を特定することができる。音声操作装置１０における関連ワード検出部２４、コマンド取得部２５、操作処理部２７、応答音声出力部２８については後述する。 The ID identification information is identification information assigned to each voice operation device 10 in order to identify the voice operation device 10 mounted on the vehicle 1. By transmitting the ID identification information to the voice recognition engine 30 together with the request content voice data, the voice recognition engine 30 can specify the source of the received request content voice data. The related word detection unit 24, the command acquisition unit 25, the operation processing unit 27, and the response voice output unit 28 in the voice operation device 10 will be described later.

図７は、実施形態１に係る音声認識エンジン３０の機能ブロック図である。音声認識エンジン３０では、プロセッサ３２がアプリケーションプログラムを実行することにより、データ取得部３６、解析処理部３７、音声合成部３８といった各処理部として機能する。 FIG. 7 is a functional block diagram of the voice recognition engine 30 according to the first embodiment. In the voice recognition engine 30, the processor 32 executes an application program to function as each processing unit such as a data acquisition unit 36, an analysis processing unit 37, and a voice synthesis unit 38.

音声認識エンジン３０のデータ取得部３６は、音声操作装置１０の音声データ送信部２３が送信した要求内容音声データおよびＩＤ識別情報を取得する。そして、データ取得部３６は要求内容音声データを解析処理部３７に引き渡し、解析処理部３７は、受け取った要求内容音声データに対して音声認識処理および自然言語解析処理を行う。例えば、解析処理部３７は、要求内容音声データに対して音声認識処理を行うことで、当該要求内容音声データをテキストデータへと変換する。さらに、解析処理部３７は、このテキストデータに対して自然言語解析を行うことで、テキストデータに含まれる発話者の要求の意図を解釈する。解析処理部３７が実行する自然言語解析には、形態素解析処理、構文解析処理、意味解析処理、文脈解析処理等が含まれていてもよい。 The data acquisition unit 36 of the voice recognition engine 30 acquires the request content voice data and the ID identification information transmitted by the voice data transmission unit 23 of the voice operation device 10. Then, the data acquisition unit 36 delivers the request content voice data to the analysis processing unit 37, and the analysis processing unit 37 performs voice recognition processing and natural language analysis processing on the received request content voice data. For example, the analysis processing unit 37 converts the request content voice data into text data by performing voice recognition processing on the request content voice data. Further, the analysis processing unit 37 interprets the intention of the speaker's request included in the text data by performing natural language analysis on the text data. The natural language analysis executed by the analysis processing unit 37 may include morphological analysis processing, syntax analysis processing, semantic analysis processing, context analysis processing, and the like.

以上のように、解析処理部３７は、要求内容音声データに基づいて、発話者の要求内容を示す発話者要求データを生成する。解析処理部３７は、生成した発話者要求データをＩＤ識別情報と対応付けてコマンド生成サーバ５０に送信する。なお、発話者要求データおよびＩＤ識別情報は、音声認識エンジン３０の通信ＩＦ３４、通信回線Ｎを介してコマンド生成サーバ５０に送信される。音声認識エンジン３０における音声合成部３８の機能については後述する。また、音声認識エンジン３０は、音声操作装置１０から受け取った要求内容音声データおよびＩＤ識別情報をメモリ３３の記憶領域に記憶させてもよい。また、音声認識エンジン３０における解析処理部３７は、人工知能（ＡＩ）により実現されてもよい。すなわち、音声認識エンジン３０のメモリ３３に機械学習モデルを格納しておき、当該機械学習モデルを用いて解析処理部３７が発話者要求データを生成してもよい。このような構成は、音声認識エンジン３０が所謂ＡＩアシスタントとして構築される。 As described above, the analysis processing unit 37 generates speaker request data indicating the request content of the speaker based on the request content voice data. The analysis processing unit 37 associates the generated speaker request data with the ID identification information and transmits the generated speaker request data to the command generation server 50. The speaker request data and the ID identification information are transmitted to the command generation server 50 via the communication IF 34 of the voice recognition engine 30 and the communication line N. The function of the voice synthesis unit 38 in the voice recognition engine 30 will be described later. Further, the voice recognition engine 30 may store the request content voice data and the ID identification information received from the voice operation device 10 in the storage area of the memory 33. Further, the analysis processing unit 37 in the voice recognition engine 30 may be realized by artificial intelligence (AI). That is, the machine learning model may be stored in the memory 33 of the voice recognition engine 30, and the analysis processing unit 37 may generate the speaker request data using the machine learning model. In such a configuration, the voice recognition engine 30 is constructed as a so-called AI assistant.

図８は、実施形態１に係るコマンド生成サーバ５０の機能ブロック図である。コマンド生成サーバ５０では、プロセッサ５２がアプリケーションプログラムを実行することにより、コマンド生成部５６、応答音声用テキスト生成部５７といった各処理部として機能する。 FIG. 8 is a functional block diagram of the command generation server 50 according to the first embodiment. In the command generation server 50, when the processor 52 executes the application program, it functions as each processing unit such as the command generation unit 56 and the response voice text generation unit 57.

コマンド生成サーバ５０のコマンド生成部５６は、音声認識エンジン３０から受信した発話者要求データおよびＩＤ識別情報に基づいて、音声操作装置１０に送信するための操作コマンドデータを生成する。図９は、操作コマンドデータのデータ構造の一例を示す図である。図９に示す例では、操作コマンドデータは、操作コマンドによって操作する対象となる関連機器２（Object）、操作内容（Action）、およびＩＤ識別情報が対応付けられて格納されている。ここでの例では、操作対象となる関連機器２（Object）が「エアコンディショナ」、操作内容（Action）が「３度上げる」、コンディショナーＩＤ識別情報がＮｏ１となっている。但し、図９に示す操作コマンドデータのデータ構造は一例であり、上記例には限定されない。コマンド生成部５６は、操作コマンドデータを、ＩＤ識別情報に対応する音声操作装置１０へと送信する。操作コマンドデータは、コマンド生成サーバ５０の通信ＩＦ５４、通信回線Ｎを介して音声操作装置１０に送信される。 The command generation unit 56 of the command generation server 50 generates operation command data to be transmitted to the voice operation device 10 based on the speaker request data and the ID identification information received from the voice recognition engine 30. FIG. 9 is a diagram showing an example of a data structure of operation command data. In the example shown in FIG. 9, the operation command data is stored in association with the related device 2 (Object) to be operated by the operation command, the operation content (Action), and the ID identification information. In the example here, the related device 2 (Object) to be operated is "air conditioner", the operation content (Action) is "raised three times", and the conditioner ID identification information is No1. However, the data structure of the operation command data shown in FIG. 9 is an example, and is not limited to the above example. The command generation unit 56 transmits the operation command data to the voice operation device 10 corresponding to the ID identification information. The operation command data is transmitted to the voice operation device 10 via the communication IF 54 of the command generation server 50 and the communication line N.

また、応答音声用テキスト生成部５７は、音声認識エンジン３０から受信した発話者要
求データ、或いは、コマンド生成部５６が生成した操作コマンドデータに基づいて、音声認識エンジン３０に送信するための応答音声生成用テキストデータを生成する。応答音声生成用テキストデータは、操作コマンドデータに対応するテキストデータであり、音声操作装置１０のスピーカ４から音声出力させる音声アシスタント発話の元となるテキスト（文字列）データである。応答音声生成用テキストデータは、操作コマンドデータに格納されている操作対象となる関連機器２（Object）と、操作内容（Action）を含んでいる。ここでの例では、「エアコンの温度を３度上げます」というテキスト（文字列）を含んでいる。応答音声用テキスト生成部５７は、応答音声生成用テキストデータをＩＤ識別情報と併せて音声認識エンジン３０に送信する。応答音声生成用テキストデータおよびＩＤ識別情報は、コマンド生成サーバ５０の通信ＩＦ５４、通信回線Ｎを介して音声認識エンジン３０に送信される。 Further, the response voice text generation unit 57 transmits the response voice to the voice recognition engine 30 based on the speaker request data received from the voice recognition engine 30 or the operation command data generated by the command generation unit 56. Generate text data for generation. The response voice generation text data is text data corresponding to the operation command data, and is text (character string) data that is the source of the voice assistant utterance to be voice output from the speaker 4 of the voice operation device 10. The text data for generating the response voice includes the related device 2 (Object) to be operated and the operation content (Action) stored in the operation command data. In the example here, the text (character string) "Raise the temperature of the air conditioner by 3 degrees" is included. The response voice text generation unit 57 transmits the response voice generation text data together with the ID identification information to the voice recognition engine 30. The response voice generation text data and the ID identification information are transmitted to the voice recognition engine 30 via the communication IF 54 of the command generation server 50 and the communication line N.

コマンド生成サーバ５０から応答音声生成用テキストデータおよびＩＤ識別情報を受け取った音声認識エンジン３０は、音声合成部３８が応答音声生成用テキストデータに対して音声合成処理を行う。これにより、応答音声生成用テキストデータが応答用音声データに変換される。音声認識エンジン３０の音声合成部３８は、応答音声生成用テキストデータから変換した応答用音声データを、ＩＤ識別情報に対応する音声操作装置１０に送信する。応答用音声データは、音声認識エンジン３０の通信ＩＦ３４、通信回線Ｎを介して音声操作装置１０に送信される。 In the voice recognition engine 30 that has received the response voice generation text data and the ID identification information from the command generation server 50, the voice synthesis unit 38 performs voice synthesis processing on the response voice generation text data. As a result, the text data for generating the response voice is converted into the response voice data. The voice synthesis unit 38 of the voice recognition engine 30 transmits the response voice data converted from the response voice generation text data to the voice operation device 10 corresponding to the ID identification information. The response voice data is transmitted to the voice control device 10 via the communication IF 34 of the voice recognition engine 30 and the communication line N.

以上のようにして、音声操作装置１０は、音声認識エンジン３０から応答用音声データを受信し、コマンド生成サーバ５０から操作コマンドデータを受信する。ここで、音声操作装置１０のコマンド取得部２５は、コマンド生成サーバ５０から送信された操作コマンドデータを受信する。音声操作装置１０におけるメモリ１３は、操作コマンドデータ（操作コマンド情報）を記憶するコマンド情報記憶部１３２を有する。コマンド情報記憶部１３２は、メモリ１３の一部に割り当てられた記憶領域である。音声操作装置１０のコマンド取得部２５は、取得した操作コマンドデータをコマンド情報記憶部１３２に記憶する。 As described above, the voice operation device 10 receives the response voice data from the voice recognition engine 30, and receives the operation command data from the command generation server 50. Here, the command acquisition unit 25 of the voice operation device 10 receives the operation command data transmitted from the command generation server 50. The memory 13 in the voice operation device 10 has a command information storage unit 132 that stores operation command data (operation command information). The command information storage unit 132 is a storage area allocated to a part of the memory 13. The command acquisition unit 25 of the voice operation device 10 stores the acquired operation command data in the command information storage unit 132.

音声操作装置１０の操作処理部２７は、コマンド情報記憶部１３２に記憶されている最新の操作コマンドデータを読み込み、操作コマンドデータに則して操作対象となる関連機器２を操作する。ここでの例では、図９で説明したように、操作対象となる関連機器２（Object）が「エアコンディショナ２Ａ」であり、操作内容（Action）が「３度上げる」処理内容となっている。そこで、操作処理部２７は、操作コマンドデータに定義されている操作内容に則してエアコンディショナ２Ａの設定温度が３度高い温度に変更されるように、エアコンディショナ２Ａに制御信号を出力する。その結果、エアコンディショナ２Ａの設定温度が操作コマンドデータに応じて変更される。 The operation processing unit 27 of the voice operation device 10 reads the latest operation command data stored in the command information storage unit 132, and operates the related device 2 to be operated according to the operation command data. In the example here, as described with reference to FIG. 9, the related device 2 (Object) to be operated is the “air conditioner 2A”, and the operation content (Action) is the processing content of “raising three times”. There is. Therefore, the operation processing unit 27 outputs a control signal to the air conditioner 2A so that the set temperature of the air conditioner 2A is changed to a temperature 3 degrees higher according to the operation content defined in the operation command data. To do. As a result, the set temperature of the air conditioner 2A is changed according to the operation command data.

また、音声操作装置１０の応答用音声取得部２６は、音声認識エンジン３０から送信された応答用音声データを取得する。応答用音声取得部２６は、取得した応答用音声データを応答音声出力部２８に引き渡す。応答音声出力部２８は、スピーカ４に制御信号を出力し、応答用音声取得部２６から受け取った応答用音声データに基づいてスピーカ４に音声アシスタント発話を音声出力させる。ここでの例では、「エアコンの温度を３度上げます」という音声がスピーカ４から出力される。 Further, the response voice acquisition unit 26 of the voice operation device 10 acquires the response voice data transmitted from the voice recognition engine 30. The response voice acquisition unit 26 delivers the acquired response voice data to the response voice output unit 28. The response voice output unit 28 outputs a control signal to the speaker 4, and causes the speaker 4 to output a voice assistant utterance based on the response voice data received from the response voice acquisition unit 26. In the example here, the voice "Raise the temperature of the air conditioner by 3 degrees" is output from the speaker 4.

なお、本実施形態における音声操作システム１００において、音声認識エンジン３０は、１回のセッションが終了してから一定時間（例えば、数秒間）が経過すると、再び待機状態となる。ここでいうセッションとは、音声操作装置１０から音声データを受信することで音声認識エンジン３０が起動し、生成した発話者要求データをコマンド生成サーバ５０に送信した後、コマンド生成サーバ５０から応答音声生成用テキストデータを受信し、応答音声生成用テキストデータから生成した応答用音声データを音声操作装置１０に送信
するまでの一連の処理を指す。ここで、上述までの、発話者が「〇〇（起動ワード）、エアコンの温度を３度上げて」という発話を契機として開始されたセッションを「第１セッション」と呼ぶ。 In the voice operation system 100 of the present embodiment, the voice recognition engine 30 goes into the standby state again after a certain period of time (for example, several seconds) has elapsed from the end of one session. The term "session" as used herein means that the voice recognition engine 30 is activated by receiving voice data from the voice control device 10, transmits the generated speaker request data to the command generation server 50, and then responds to the voice from the command generation server 50. It refers to a series of processes from receiving the generation text data to transmitting the response voice data generated from the response voice generation text data to the voice operation device 10. Here, the session started by the speaker's utterance "○○ (starting word), raising the temperature of the air conditioner by 3 degrees" up to the above is referred to as a "first session".

ここで、従来の音声操作システムにおいては、発話者の音声入力によって関連機器を操作する場合には、音声入力をする度に発話者は起動ワード（ＷｕＷ）を発声してから、起動ワードに続けて要求内容を発話する必要があり、ユーザの利便性が高いとは言えなかった。一方、音声操作装置が取得した発話者の音声データが起動ワード（ＷｕＷ）を含んでいない起動ワード無し音声データである場合にまで、常に、起動ワード無し音声データを音声認識エンジン３０に送信し、起動ワード無し音声データに対して音声認識処理や言語解析処理を実行しようとすると、通信負荷や処理負荷の過度な増加を招いてしまう。そこで、本実施形態の音声操作装置１０においては、発話者が起動ワード（ＷｕＷ）の発話を省略した場合においても、通信負荷や処理負荷の過度な増加を招くことなく関連機器２の音声操作を可能としている。以下、音声操作装置１０の特徴的な処理内容について説明する。 Here, in the conventional voice operation system, when the related device is operated by the voice input of the speaker, the speaker utters the activation word (WuW) each time the voice is input, and then the activation word is continued. It was necessary to speak the request contents, and it could not be said that the convenience of the user was high. On the other hand, even when the voice data of the speaker acquired by the voice control device is the voice data without the activation word that does not include the activation word (WuW), the voice data without the activation word is always transmitted to the voice recognition engine 30. Attempting to execute voice recognition processing or language analysis processing on voice data without a start word causes an excessive increase in communication load and processing load. Therefore, in the voice operation device 10 of the present embodiment, even when the speaker omits the utterance of the activation word (WuW), the voice operation of the related device 2 can be performed without causing an excessive increase in the communication load and the processing load. It is possible. Hereinafter, the characteristic processing contents of the voice operation device 10 will be described.

図５に示す例では、第１セッションが終了した後、車両１の乗員が「温度を１度下げて」と発話している。この発話には、起動ワードが含まれていない。この場合、音声操作装置１０の音声取得部２１が新たに取得した発話者の音声データは起動ワードを含まない起動ワード無し音声データであり、当該起動ワード無し音声データがメモリ１３の音声記憶部１３１に記憶される。従って、この場合には、起動ワード検出部２２によって起動ワード無し音声データから起動ワードは検出されない。 In the example shown in FIG. 5, after the first session is completed, the occupant of the vehicle 1 utters "lower the temperature by 1 degree". This utterance does not include the activation word. In this case, the voice data of the speaker newly acquired by the voice acquisition unit 21 of the voice operation device 10 is the voice data without the activation word that does not include the activation word, and the voice data without the activation word is the voice storage unit 131 of the memory 13. Is remembered in. Therefore, in this case, the activation word detection unit 22 does not detect the activation word from the voice data without the activation word.

このように、音声取得部２１が新たに取得した発話者の音声データが起動ワード無し音声データである場合に、音声操作装置１０における関連ワード検出部２４は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを、新たに取得した起動ワード無し音声データから検出する。以下、発話者が起動ワードを発声せずに「温度を１度下げて」と発話したことを契機に開始される「第２セッション」での各処理を説明する。 In this way, when the voice data of the speaker newly acquired by the voice acquisition unit 21 is the voice data without the activation word, the related word detection unit 24 in the voice operation device 10 operates in response to the previous operation command. The operation-related words related to the related device 2 are detected from the newly acquired voice data without the activation word. Hereinafter, each process in the "second session" started when the speaker utters "lower the temperature by 1 degree" without uttering the activation word will be described.

第２セッションにおいて、関連ワード検出部２４は、メモリ１３の音声記憶部１３１に記憶されている前回の音声データ（図５に示す制御例では、「〇〇（起動ワード）、エアコンの温度を３度上げて」という発話が該当）から、関連機器２ごとに対応付けて予め定められると共に関連機器２に関連する単語である機器別関連ワードを検出することで、前回の操作コマンドに応じて操作した関連機器２を特定する。 In the second session, the related word detection unit 24 sets the previous voice data stored in the voice storage unit 131 of the memory 13 (in the control example shown in FIG. 5, “〇〇 (starting word), the temperature of the air conditioner 3”. (Applicable to the utterance "Raise the degree"), by detecting the device-specific related words that are predetermined and related to the related device 2 in association with each related device 2, the operation is performed according to the previous operation command. Identify the related device 2 that has been used.

ここで、音声操作装置１０のメモリ１３には、図１０に示す機器別関連ワード定義テーブルＴＢ１が格納されている。機器別関連ワード定義テーブルＴＢ１には、音声操作装置１０によって音声操作する関連機器２と関連機器２ごとに対応付けて予め定められた機器別関連ワードの対応関係が定義されている。関連機器２は、エアコンディショナ２Ａ、オーディオ機器２Ｂ、ヘッドライト２Ｃ等である。 Here, the device-specific word definition table TB1 shown in FIG. 10 is stored in the memory 13 of the voice control device 10. In the device-specific related word definition table TB1, the correspondence relationship between the related device 2 and the related device 2 that are voice-operated by the voice control device 10 and the predetermined device-specific related words is defined. The related device 2 is an air conditioner 2A, an audio device 2B, a headlight 2C, or the like.

図１０に示す例では、エアコンディショナ２Ａに対応する機器別関連ワードとしては、「エアコン（エアコンディショナ）」、「温度」、「度」、「上げ」、「下げ」、「止め」、「つけて」等の単語が定義されている。また、オーディオ機器２Ｂに対応する機器別関連ワードとしては、「オーディオ」、「ＣＤ」、「ＤＶＤ、「ボリューム」、「音量」、「再生」、「停止」、「トラックアップ」、「トラックダウン」等の単語が定義されている。また、ヘッドライト２Ｃに対応する機器別関連ワードとしては、「ヘッドライト」、「ライト」、「照明」、「明るく、「暗く」等の単語が定義されている。但し、これら関連機器２毎に設定された機器別関連ワードは例示的なものである。 In the example shown in FIG. 10, as the related words for each device corresponding to the air conditioner 2A, "air conditioner (air conditioner)", "temperature", "degree", "raise", "lower", "stop", Words such as "attach" are defined. In addition, as device-specific related words corresponding to audio device 2B, "audio", "CD", "DVD," volume "," volume "," play "," stop "," track up "," track down " Words such as "" are defined. Further, as the device-specific related words corresponding to the headlight 2C, words such as "headlight", "light", "lighting", and "bright and dark" are defined. However, the device-specific related words set for each of these related devices 2 are exemplary.

上記のように、関連ワード検出部２４は、メモリ１３に格納されている機器別関連ワード定義テーブルＴＢ１を参照し、音声記憶部１３１に記憶されている前回の音声データから機器別関連ワードを検出する。ここでの例では、「〇〇（起動ワード）、エアコンの温度を３度上げて」という前回の発話に含まれる「エアコン」、「温度」、「度」が機器別関連ワードとして検出される。そして、関連ワード検出部２４は、定義テーブルＴＢ１を参照することで、検出した機器別関連ワード（「エアコン」、「温度」、「度」）が関連付けられている関連機器２を読み出す。これにより、関連ワード検出部２４は、前回の操作コマンドに応じて操作した関連機器２を特定することができる。ここでの制御例では、前回の操作コマンドに応じて操作した関連機器２として、エアコンディショナ２Ａが特定される。 As described above, the related word detection unit 24 refers to the device-specific related word definition table TB1 stored in the memory 13 and detects the device-specific related word from the previous voice data stored in the voice storage unit 131. To do. In the example here, "air conditioner", "temperature", and "degree" included in the previous utterance "○○ (starting word), raise the temperature of the air conditioner by 3 degrees" are detected as related words for each device. .. Then, the related word detection unit 24 reads out the related device 2 associated with the detected device-specific related words (“air conditioner”, “temperature”, “degree”) by referring to the definition table TB1. As a result, the related word detection unit 24 can identify the related device 2 operated in response to the previous operation command. In the control example here, the air conditioner 2A is specified as the related device 2 operated in response to the previous operation command.

そして、音声操作装置１０における関連ワード検出部２４は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを、新たに取得した起動ワード無し音声データから検出する。上記のように、前回の操作コマンドに応じて操作した関連機器２はエアコンディショナ２Ａに特定されているため、関連ワード検出部２４はエアコンディショナ２Ａに関連する「エアコン（エアコンディショナ）」、「温度」、「度」、「上げ」、「下げ」、「止め」、「つける」等の単語が新たに取得した起動ワード無し音声データに含まれているか否か判定する。その結果、関連ワード検出部２４は、「温度を１度下げて」という発話に関する起動ワード無し音声データから「温度」、「度」という単語を操作関連ワードとして検出する。 Then, the related word detection unit 24 in the voice operation device 10 detects the operation-related word related to the related device 2 operated in response to the previous operation command from the newly acquired voice data without the activation word. As described above, since the related device 2 operated in response to the previous operation command is specified by the air conditioner 2A, the related word detection unit 24 is the "air conditioner (air conditioner)" related to the air conditioner 2A. , "Temperature", "Degree", "Raise", "Lower", "Stop", "Turn", etc. are included in the newly acquired voice data without activation word. As a result, the related word detection unit 24 detects the words "temperature" and "degree" as operation-related words from the voice data without the activation word related to the utterance "lower the temperature by 1 degree".

上記のように、関連ワード検出部２４が、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを、新たに取得した起動ワード無し音声データから検出した場合に、音声データ送信部２３は、起動ワード無し音声データおよびＩＤ識別情報を音声認識エンジン３０に送信する。このようにして起動ワード無し音声データおよびＩＤ識別情報が音声操作装置１０から音声認識エンジン３０に送信されることをトリガとして、待機状態にあった音声認識エンジン３０が起動する。 As described above, when the related word detection unit 24 detects the operation-related word related to the related device 2 operated in response to the previous operation command from the newly acquired voice data without the activation word, the voice data is transmitted. The unit 23 transmits the voice data without the activation word and the ID identification information to the voice recognition engine 30. In this way, the voice recognition engine 30 in the standby state is started by triggering the transmission of the voice data without the start word and the ID identification information from the voice operation device 10 to the voice recognition engine 30.

音声認識エンジン３０において実行される各処理については、第１セッションのときと実質的に同様である。すなわち、音声認識エンジン３０の解析処理部３７が、起動ワード無し音声データに対して音声認識処理を行うことで、「温度を１度下げて」という起動ワード無し音声データをテキストデータへと変換し、このテキストデータに対して自然言語解析処理を行うことでテキストデータに含まれる発話者の要求の意図を解釈する。このようにして、解析処理部３７は、起動ワード無し音声データに基づいて、発話者の要求内容を示す発話者要求データを生成する。音声認識エンジン３０において生成された発話者要求データは、ＩＤ識別情報と対応付けてコマンド生成サーバ５０に送信され、コマンド生成サーバ５０において操作コマンドデータおよび応答音声生成用テキストデータが生成される。 Each process executed in the voice recognition engine 30 is substantially the same as in the first session. That is, the analysis processing unit 37 of the voice recognition engine 30 performs voice recognition processing on the voice data without the start word, thereby converting the voice data without the start word "lower the temperature by 1 degree" into text data. , The intention of the speaker's request included in the text data is interpreted by performing natural language analysis processing on this text data. In this way, the analysis processing unit 37 generates speaker request data indicating the content of the speaker's request based on the voice data without the activation word. The speaker request data generated by the voice recognition engine 30 is transmitted to the command generation server 50 in association with the ID identification information, and the command generation server 50 generates operation command data and response voice generation text data.

ここで、コマンド生成サーバ５０が音声認識エンジン３０から受け取った発話者要求データには、操作対象となる関連機器２（Object）に対応する情報が格納されていない。そこで、コマンド生成サーバ５０は、操作コマンドデータおよび応答音声生成用テキストデータを生成するに際して、関連機器２（Object）に対応する情報を補完する。 Here, the speaker request data received by the command generation server 50 from the voice recognition engine 30 does not store information corresponding to the related device 2 (Object) to be operated. Therefore, the command generation server 50 complements the information corresponding to the related device 2 (Object) when generating the operation command data and the text data for generating the response voice.

コマンド生成サーバ５０のメモリ５３は、コマンド生成部５６が生成した操作コマンドデータを記憶するコマンドデータ記憶部１５１を有する。コマンドデータ記憶部１５１は、メモリ５３の一部に割り当てられた記憶領域である。コマンド生成サーバ５０のコマンド生成部５６は、前回のセッションにおいて生成した操作コマンドデータをコマンドデータ記憶部１５１から読み出す。そして、コマンド生成部５６は、音声認識エンジン３０か
ら新たに受け取った発話者要求データにブランク（ｎｕｌｌ）となっている操作対象の関連機器２（Object）に対応する情報を、前回のセッションにおいて生成した操作コマンドデータから読み込み、補完する。これにより、図１１に示す操作コマンドデータが生成される。ここでの制御例では、図１１に示すように、操作対象となる関連機器２（Object）が「エアコンディショナ」、操作内容（Action）が「１度下げる」となっている。なお、図１１において、エアコンディショナに付した括弧は、前回のセッションにおいて生成した操作コマンドデータから補完されたことを意味する。そして、コマンド生成部５６は、生成した操作コマンドデータを、第１セッションと同様、ＩＤ識別情報に対応する音声操作装置１０に送信する。 The memory 53 of the command generation server 50 has a command data storage unit 151 that stores the operation command data generated by the command generation unit 56. The command data storage unit 151 is a storage area allocated to a part of the memory 53. The command generation unit 56 of the command generation server 50 reads the operation command data generated in the previous session from the command data storage unit 151. Then, the command generation unit 56 generates information corresponding to the related device 2 (Object) to be operated, which is blank in the speaker request data newly received from the voice recognition engine 30, in the previous session. Read from the operation command data that was executed and complete it. As a result, the operation command data shown in FIG. 11 is generated. In the control example here, as shown in FIG. 11, the related device 2 (Object) to be operated is “air conditioner” and the operation content (Action) is “lowered once”. In FIG. 11, the parentheses attached to the air conditioner mean that they are complemented from the operation command data generated in the previous session. Then, the command generation unit 56 transmits the generated operation command data to the voice operation device 10 corresponding to the ID identification information as in the first session.

また、コマンド生成サーバ５０の応答音声用テキスト生成部５７は、第１セッションと同様、コマンド生成部５６が生成した操作コマンドデータに基づいて、操作コマンドデータに対応する内容の応答音声生成用テキストデータを生成し、生成した応答音声生成用テキストデータをＩＤ識別情報と併せて音声認識エンジン３０に送信する。そして、コマンド生成サーバ５０から応答音声生成用テキストデータおよびＩＤ識別情報を受信した音声認識エンジン３０は、音声合成部３８が応答音声生成用テキストデータに基づいて応答用音声データを生成し、生成した応答用音声データをＩＤ識別情報に対応する音声操作装置１０に送信する。 Further, the response voice text generation unit 57 of the command generation server 50 is the response voice generation text data having the content corresponding to the operation command data based on the operation command data generated by the command generation unit 56, as in the first session. Is generated, and the generated text data for generating the response voice is transmitted to the voice recognition engine 30 together with the ID identification information. Then, the voice recognition engine 30 that received the response voice generation text data and the ID identification information from the command generation server 50 generated the response voice data based on the response voice generation text data by the voice synthesis unit 38. The response voice data is transmitted to the voice operation device 10 corresponding to the ID identification information.

そして、コマンド生成サーバ５０から操作コマンドデータを受信した音声操作装置１０は、第1セッションと同様、操作コマンドデータに基づいて操作対象となる関連機器２を
操作する。具体的には、音声操作装置１０の操作処理部２７は、操作コマンドデータに定義されている操作内容に則してエアコンディショナ２Ａの設定温度が１度低い温度に設定されるように、エアコンディショナ２Ａを操作する。また、音声操作装置１０の応答用音声取得部２６は、音声認識エンジン３０から取得した応答用音声データを応答音声出力部２８に引き渡し、応答音声出力部２８は、応答用音声データに基づいてスピーカ４に音声アシスタント発話を音声出力させる。その結果、「エアコンの温度を１度下げます」という音声がスピーカ４から出力される。これにより、第２セッションが終了する。 Then, the voice operation device 10 that has received the operation command data from the command generation server 50 operates the related device 2 to be operated based on the operation command data, as in the first session. Specifically, the operation processing unit 27 of the voice operation device 10 is an air conditioner so that the set temperature of the air conditioner 2A is set to a temperature one degree lower according to the operation content defined in the operation command data. Operate the conditioner 2A. Further, the response voice acquisition unit 26 of the voice operation device 10 delivers the response voice data acquired from the voice recognition engine 30 to the response voice output unit 28, and the response voice output unit 28 is a speaker based on the response voice data. Let 4 output the voice assistant speech. As a result, the voice "lower the temperature of the air conditioner by 1 degree" is output from the speaker 4. This ends the second session.

次に、図５を参照して第３セッションについて説明する。第３セッションは、第２セッションが終了して音声認識エンジン３０が待機状態にあるときに、乗員が「〇〇（起動ワード）、ＣＤを再生して」と発話したことを契機に開始される。このように、乗員の発話に起動ワードが含まれる場合には、第１セッションと同様の処理が行われる。すなわち、「ＣＤを再生して」という発話部分に対応する要求内容音声データが音声操作装置１０から音声認識エンジン３０に送信される。それ以降の処理内容についても、第１セッションと同様である。すなわち、最終的に、音声操作装置１０は、コマンド生成サーバ５０から操作コマンドデータを受信すると共に、音声認識エンジン３０から応答用音声データを受信する。そして、音声操作装置１０の操作処理部２７はオーディオ機器２Ｂに制御信号を出力し、ＣＤを再生させる。また、音声操作装置１０の応答音声出力部２８が応答用音声データに基づいてスピーカ４を操作することで、「ＣＤを再生します」という音声アシスタント発話がスピーカ４から音声出力される。これにより、第３セッションが終了する。 Next, the third session will be described with reference to FIG. The third session is started when the occupant says "○○ (startup word), play the CD" when the voice recognition engine 30 is in the standby state after the second session is completed. .. In this way, when the occupant's utterance includes the activation word, the same processing as in the first session is performed. That is, the request content voice data corresponding to the utterance portion of "playing the CD" is transmitted from the voice operation device 10 to the voice recognition engine 30. The processing contents after that are the same as in the first session. That is, finally, the voice operation device 10 receives the operation command data from the command generation server 50 and the response voice data from the voice recognition engine 30. Then, the operation processing unit 27 of the voice operation device 10 outputs a control signal to the audio device 2B to reproduce the CD. Further, the response voice output unit 28 of the voice control device 10 operates the speaker 4 based on the response voice data, so that the voice assistant utterance "play a CD" is output from the speaker 4. This ends the third session.

なお、第３セッションにおいて、乗員が「〇〇（起動ワード）、ＣＤを再生して」と発話する代わりに「エアコンを止めて」と発話した場合、当該発話に起動ワードが含まれないため、起動ワード無し音声データから操作関連ワードを検出する処理が関連ワード検出部２４によって行われる。この場合、関連ワード検出部２４は、メモリ１３に格納されている機器別関連ワード定義テーブルＴＢ１を参照し、音声記憶部１３１に記憶されている前回（第２セッション）の音声データから機器別関連ワードを検出する。具体的には、「温度を１度下げて」という発話に含まれる「温度」、「度」という単語が機器別関連ワードとして検出され、前回の操作コマンドに応じて操作した関連機器２がエアコンディショ
ナ２Ａに特定される。 In the third session, if the occupant says "Turn off the air conditioner" instead of saying "○○ (starting word), play the CD", the starting word is not included in the utterance. The process of detecting the operation-related word from the voice data without the activation word is performed by the related word detection unit 24. In this case, the related word detection unit 24 refers to the device-specific word definition table TB1 stored in the memory 13, and is related to each device from the previous (second session) voice data stored in the voice storage unit 131. Detect words. Specifically, the words "temperature" and "degree" included in the utterance "lower the temperature by 1 degree" are detected as related words for each device, and the related device 2 operated in response to the previous operation command is an air conditioner. Specified as conditioner 2A.

そして、音声操作装置１０の関連ワード検出部２４は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを、新たに取得した「エアコンを止めて」という起動ワード無し音声データから検出する。その結果、関連ワード検出部２４は、エアコンディショナ２Ａに関連する「エアコン」、「止め」という単語を操作関連ワードとして検出する。このようにして、関連ワード検出部２４が操作関連ワードを起動ワード無し音声データから検出することで、起動ワード無し音声データが音声認識エンジン３０に送信される。以降の処理内容については、第２セッションで説明した処理内容と同様であるため説明を省略する。 Then, the related word detection unit 24 of the voice operation device 10 acquires the operation-related word related to the related device 2 operated in response to the previous operation command, and newly acquired voice data without an activation word "turn off the air conditioner". Detect from. As a result, the related word detection unit 24 detects the words "air conditioner" and "stop" related to the air conditioner 2A as operation-related words. In this way, the related word detection unit 24 detects the operation-related word from the voice data without the activation word, so that the voice data without the activation word is transmitted to the voice recognition engine 30. The subsequent processing contents are the same as the processing contents explained in the second session, and thus the description thereof will be omitted.

次に、第３セッションにおいて、乗員が「〇〇（起動ワード）、ＣＤを再生して」と発話する代わりに「ＣＤを再生して」と発話した場合について説明する。この場合、発話に起動ワードが含まれていないため、起動ワード無し音声データから操作関連ワードを検出する処理が関連ワード検出部２４によって行われる。すなわち、関連ワード検出部２４は、メモリ１３に格納されている機器別関連ワード定義テーブルＴＢ１を参照し、音声記憶部１３１に記憶されている前回（第２セッション）の音声データから機器別関連ワードを検出する。具体的には、「温度を１度下げて」という発話に含まれる「温度」、「度」という単語が機器別関連ワードとして検出され、前回の操作コマンドに応じて操作した関連機器２がエアコンディショナ２Ａに特定される。 Next, in the third session, a case where the occupant says "play the CD" instead of saying "OO (startup word), play the CD" will be described. In this case, since the utterance does not include the activation word, the related word detection unit 24 performs a process of detecting the operation-related word from the voice data without the activation word. That is, the related word detection unit 24 refers to the device-specific related word definition table TB1 stored in the memory 13, and refers to the device-specific related word from the previous (second session) voice data stored in the voice storage unit 131. Is detected. Specifically, the words "temperature" and "degree" included in the utterance "lower the temperature by 1 degree" are detected as related words for each device, and the related device 2 operated in response to the previous operation command is an air conditioner. Specified as conditioner 2A.

そして、音声操作装置１０の関連ワード検出部２４は、前回の操作コマンドに応じて操作した関連機器２（コンディショナ２Ａ）に関連する操作関連ワードを、新たに取得した「ＣＤを再生して」という起動ワード無し音声データから検出する。このケースにおいては、「ＣＤを再生して」という起動ワード無し音声データには、エアコンディショナ２Ａに関連する操作関連ワードが含まれていないため、操作関連ワードが関連ワード検出部２４によって検出されない。そのため、「ＣＤを再生して」という起動ワード無し音声データは音声認識エンジン３０に送信されず、音声認識エンジン３０は起動されない。すなわち、このケースにおいては、音声操作装置１０は発話者の発話に応答せず、オーディオ機器２Ｂを操作しない。 Then, the related word detection unit 24 of the voice operation device 10 newly acquired the operation-related word related to the related device 2 (conditioner 2A) operated in response to the previous operation command, "plays the CD". It is detected from the voice data without the activation word. In this case, since the operation-related word related to the air conditioner 2A is not included in the voice data without the activation word "play the CD", the operation-related word is not detected by the related word detection unit 24. .. Therefore, the voice data without the activation word "play the CD" is not transmitted to the voice recognition engine 30, and the voice recognition engine 30 is not started. That is, in this case, the voice control device 10 does not respond to the utterance of the speaker and does not operate the audio device 2B.

なお、上述した第２セッションの説明において、関連ワード検出部２４は、音声記憶部１３１に記憶されている前回の音声データから、関連機器２ごとに対応付けて予め定められる機器別関連ワードを検出することで、前回の操作コマンドに応じて操作した関連機器２を特定したが、関連ワード検出部２４は他の処理によって前回の操作コマンドに応じて操作した関連機器２を特定してもよい。例えば、関連ワード検出部２４は、コマンド情報記憶部１３２に記憶されている操作コマンドデータに基づいて前回の操作コマンドに応じて操作した関連機器２を特定してもよい。 In the above description of the second session, the related word detection unit 24 detects a predetermined device-specific related word associated with each related device 2 from the previous voice data stored in the voice storage unit 131. By doing so, the related device 2 operated in response to the previous operation command is specified, but the related word detection unit 24 may specify the related device 2 operated in response to the previous operation command by another process. For example, the related word detection unit 24 may specify the related device 2 operated in response to the previous operation command based on the operation command data stored in the command information storage unit 132.

次に、車両１の乗員が発話した際に、音声操作装置１０から音声認識エンジン３０に音声データを送信するときにプロセッサ１２が実行する制御フローについて説明する。図１２は、音声操作装置１０が音声認識エンジン３０に音声データを送信する際の制御内容を示すフローチャートである。ステップＳ１０において、車両１の乗員が発話した音声データをプロセッサ１２（音声取得部２１）が新たに取得する（音声取得工程）。また、プロセッサ１２（音声取得部２１）は、取得した発話者の音声データをメモリ１３の音声記憶部１３１に記憶させる。 Next, the control flow executed by the processor 12 when the voice data is transmitted from the voice control device 10 to the voice recognition engine 30 when the occupant of the vehicle 1 speaks will be described. FIG. 12 is a flowchart showing a control content when the voice operation device 10 transmits voice data to the voice recognition engine 30. In step S10, the processor 12 (voice acquisition unit 21) newly acquires the voice data spoken by the occupant of the vehicle 1 (voice acquisition step). Further, the processor 12 (voice acquisition unit 21) stores the acquired voice data of the speaker in the voice storage unit 131 of the memory 13.

次に、ステップＳ２０において、プロセッサ１２（起動ワード検出部２２）は、新たに取得した音声データに起動ワードが含まれているか否かを判定する。プロセッサ１２（起動ワード検出部２２）は、新たに取得した音声データに起動ワードが含まれていると判定
した場合、音声データから起動ワードを検出し、ステップＳ３０に進む。一方、新たに取得した音声データに起動ワードが含まれず、プロセッサ１２（起動ワード検出部２２）が起動ワードを当該音声データから検出しなかった場合には、ステップＳ４０に進む。なお、この場合には、新たに取得した音声データは、起動ワード無し音声データであったことを意味する。 Next, in step S20, the processor 12 (starting word detection unit 22) determines whether or not the newly acquired voice data includes the starting word. When the processor 12 (starting word detection unit 22) determines that the newly acquired voice data includes the starting word, the processor 12 detects the starting word from the voice data and proceeds to step S30. On the other hand, if the newly acquired voice data does not include the start word and the processor 12 (start word detection unit 22) does not detect the start word from the voice data, the process proceeds to step S40. In this case, it means that the newly acquired voice data is the voice data without the activation word.

ステップＳ３０において、プロセッサ１２（音声データ送信部２３）は、新たに取得した音声データのうち、起動ワードに後続する発話部分に関する要求内容音声データを、ＩＤ識別情報と併せて音声認識エンジン３０に送信する。これにより、音声操作装置１０から要求内容音声データを受け取った音声認識エンジン３０は、要求内容音声データに対して音声認識処理や自然言語解析処理などを行うことで、発話者の要求内容を示す発話者要求データを生成することができる。プロセッサ１２は、ステップＳ３の処理が終了すると、本制御フローに係るルーチンを終了する。 In step S30, the processor 12 (voice data transmission unit 23) transmits the request content voice data regarding the utterance portion following the activation word among the newly acquired voice data to the voice recognition engine 30 together with the ID identification information. To do. As a result, the voice recognition engine 30 that has received the request content voice data from the voice operation device 10 performs voice recognition processing, natural language analysis processing, or the like on the request content voice data to indicate the utterance indicating the request content of the speaker. Person request data can be generated. When the process of step S3 is completed, the processor 12 ends the routine related to this control flow.

ステップＳ４０において、プロセッサ１２（関連ワード検出部２４）は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを、新たに取得した起動ワード無し音声データから検出する（関連ワード検出工程）。本ステップにおいて、プロセッサ１２（関連ワード検出部２４）は、上記のように、メモリ１３の音声記憶部１３１に記憶されている前回の音声データから、関連機器２ごとに対応付けて予め定められる機器別関連ワードを検出することで、前回の操作コマンドに応じて操作した関連機器２を特定することができる。或いは、プロセッサ１２（関連ワード検出部２４）は、メモリ１３のコマンド情報記憶部１３２に記憶されている操作コマンドデータに基づいて前回の操作コマンドに応じて操作した関連機器２を特定してもよい。 In step S40, the processor 12 (related word detection unit 24) detects an operation-related word related to the related device 2 operated in response to the previous operation command from the newly acquired voice data without an activation word (related word). Detection process). In this step, the processor 12 (related word detection unit 24) is a device that is determined in advance in association with each related device 2 from the previous voice data stored in the voice storage unit 131 of the memory 13 as described above. By detecting another related word, the related device 2 operated in response to the previous operation command can be specified. Alternatively, the processor 12 (related word detection unit 24) may specify the related device 2 operated in response to the previous operation command based on the operation command data stored in the command information storage unit 132 of the memory 13. ..

そして、ステップＳ４０においては、プロセッサ１２（関連ワード検出部２４）は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードが、新たに取得した起動ワード無し音声データに含まれているか否かを判定する。そして、プロセッサ１２（関連ワード検出部２４）は、新たに取得した起動ワード無し音声データに上記操作関連ワードが含まれていない場合、すなわち新たに取得した起動ワード無し音声データから操作関連ワードを検出しなかった場合、起動ワード無し音声データは前回の操作コマンドによる操作に関連しない発話内容であると判断されるため、本制御フローに係るルーチンを終了する。この場合、起動ワード無し音声データは音声操作装置１０から音声認識エンジン３０に送信されず、音声認識エンジン３０は起動されない。 Then, in step S40, the processor 12 (related word detection unit 24) includes the operation-related words related to the related device 2 operated in response to the previous operation command in the newly acquired voice data without the activation word. Determine if it is. Then, the processor 12 (related word detection unit 24) detects the operation-related word from the newly acquired voice data without activation word when the operation-related word is not included in the newly acquired voice data without activation word, that is, from the newly acquired voice data without activation word. If not, it is determined that the voice data without the activation word is the utterance content not related to the operation by the previous operation command, so the routine related to this control flow is terminated. In this case, the voice data without the activation word is not transmitted from the voice operation device 10 to the voice recognition engine 30, and the voice recognition engine 30 is not started.

一方、ステップＳ４０において、プロセッサ１２（関連ワード検出部２４）は、新たに取得した起動ワード無し音声データに操作関連ワードが含まれていると判定した場合、起動ワード無し音声データから操作関連ワードを検出し、ステップＳ５０に進む。ステップＳ５０において、プロセッサ１２（音声データ送信部２３）は、新たに取得した起動ワード無し音声データを、ＩＤ識別情報と併せて音声認識エンジン３０に送信する（音声データ送信工程）。これにより、音声操作装置１０から起動ワード無し音声データを受け取った音声認識エンジン３０は、起動ワード無し音声データに対して音声認識処理や自然言語解析処理などを行うことで、発話者の要求内容を示す発話者要求データを生成することができる。プロセッサ１２は、ステップＳ５の処理が終了すると、本制御フローに係るルーチンを終了する。 On the other hand, in step S40, when the processor 12 (related word detection unit 24) determines that the newly acquired voice data without activation words includes operation-related words, the processor 12 determines the operation-related words from the voice data without activation words. Detect and proceed to step S50. In step S50, the processor 12 (voice data transmission unit 23) transmits the newly acquired voice data without an activation word to the voice recognition engine 30 together with the ID identification information (voice data transmission step). As a result, the voice recognition engine 30 that receives the voice data without the activation word from the voice operation device 10 performs voice recognition processing, natural language analysis processing, and the like on the voice data without the activation word to satisfy the request contents of the speaker. It is possible to generate the speaker request data to be shown. When the process of step S5 is completed, the processor 12 ends the routine related to this control flow.

上記制御フローで説明したように、本実施形態に係る音声操作装置１０は、車両１の乗員（ユーザ）が発した発話に関する音声データを取得し、新たに取得した音声データが起動ワードを含まない起動ワード無し音声データである場合に、前回の操作コマンドに応じて操作した関連機器に関連する操作関連ワードを新たに取得した起動ワード無し音声データから検出する。そして、起動ワード無し音声データから操作関連ワードを検出した場合
には、起動ワード無し音声データを音声認識エンジンに送信することを特徴とする。そのため、ユーザは、音声認識エンジン３０が待機状態のときに関連機器２を音声操作する際、音声操作の度に起動ワード（ＷｕＷ）を発声しなくても関連機器２を音声入力によって操作することができ、ユーザの利便性を高めることができる。すなわち、ユーザビリティの優れた音声操作装置１０および音声操作システム１００を提供することができる。 As described in the above control flow, the voice operation device 10 according to the present embodiment acquires voice data related to the utterance uttered by the occupant (user) of the vehicle 1, and the newly acquired voice data does not include the activation word. When the voice data has no activation word, the operation-related word related to the related device operated in response to the previous operation command is detected from the newly acquired voice data without the activation word. Then, when the operation-related word is detected from the voice data without the activation word, the voice data without the activation word is transmitted to the voice recognition engine. Therefore, when the user operates the related device 2 by voice while the voice recognition engine 30 is in the standby state, the user operates the related device 2 by voice input without uttering the activation word (WuW) each time the voice operation is performed. It is possible to improve the convenience of the user. That is, it is possible to provide the voice operation device 10 and the voice operation system 100 having excellent usability.

また、ユーザの発話に起動ワードが含まれない場合、その発話内容が前回の音声操作に関連しない場合には起動ワード無し音声データを音声認識エンジン３０に送信せず、発話内容が前回の音声操作に関連する場合に起動ワード無し音声データを音声認識エンジン３０に送信し、音声認識エンジン３０を起動するようにしたので、通信負荷や処理負荷の過度な増加を招くことを抑制しつつ、関連機器２の音声操作を実現することができる。以上より、本実施形態に係る音声操作装置１０およびこれを含む音声操作システム１００は、従来に比べて通信負荷や処理負荷の過度な増加を招くことを抑制しつつユーザビリティを向上することができる。また、音声操作装置１０およびこれを含む音声操作システム１００によれば、音声認識エンジン３０として、既存の音声認識エンジンに大掛かりな改変を加えることなく使用することができるため、システムの開発コスト、構築コストを抑えることができる。 Further, when the user's utterance does not include the activation word, and if the utterance content is not related to the previous voice operation, the voice data without the activation word is not transmitted to the voice recognition engine 30, and the utterance content is the previous voice operation. Since the voice recognition engine 30 is started by transmitting the voice data without a start word to the voice recognition engine 30 when it is related to the above, the related equipment is suppressed while suppressing an excessive increase in communication load and processing load. 2 voice operations can be realized. From the above, the voice operation device 10 according to the present embodiment and the voice operation system 100 including the voice operation device 10 can improve usability while suppressing an excessive increase in communication load and processing load as compared with the conventional case. Further, according to the voice operation device 10 and the voice operation system 100 including the voice operation device 10, the existing voice recognition engine can be used as the voice recognition engine 30 without major modification, so that the development cost and construction of the system can be increased. The cost can be suppressed.

また、音声操作装置１０のプロセッサ１２（関連ワード検出部２４）は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを新たに取得した起動ワード無し音声データから検出する際、メモリ１３の音声記憶部１３１に記憶されている前回の音声データから、関連機器２ごとに対応付けて予め定められた機器別関連ワードを検出することで前回の操作コマンドに応じて操作した関連機器２を特定するようにした。これによれば、関連機器２ごとに関連付けられた機器別関連ワードに基づいて、前回の操作コマンドによる操作対象の関連機器２を容易に特定することができる。よって、ユーザが起動ワードを発声せずに発話した場合に、その発話内容が前回の操作コマンドによる操作に関連する発話かどうかを容易に判定し、その発話に応答して関連機器２を操作すべきかどうかを容易に判定できる。 Further, when the processor 12 (related word detection unit 24) of the voice operation device 10 detects the operation-related word related to the related device 2 operated in response to the previous operation command from the newly acquired voice data without the activation word. , The association operated in response to the previous operation command by detecting a predetermined device-specific related word associated with each related device 2 from the previous voice data stored in the voice storage unit 131 of the memory 13. The device 2 is specified. According to this, the related device 2 to be operated by the previous operation command can be easily specified based on the device-specific related word associated with each related device 2. Therefore, when the user speaks without uttering the activation word, it should be easily determined whether the utterance content is related to the operation by the previous operation command, and the related device 2 should be operated in response to the utterance. It can be easily determined whether or not it is possible.

また、音声操作装置１０のプロセッサ１２（関連ワード検出部２４）は、前回の操作コマンドに応じて操作した関連機器２に関連する操作関連ワードを新たに取得した起動ワード無し音声データから検出する際、メモリ１３のコマンド情報記憶部１３２に記憶されている操作コマンドデータに基づいて前回の操作コマンドに応じて操作した関連機器２を特定するようにした。このように、コマンド情報記憶部１３２に記憶されている操作コマンドデータを参照することで、前回の操作コマンドによる操作対象の関連機器２を容易に特定することができる。よって、ユーザが起動ワードを発声せずに発話した場合に、その発話内容が前回の操作コマンドによる操作に関連する発話かどうかを容易に判定し、その発話に応答して関連機器２を操作すべきかどうかを容易に判定できる。 Further, when the processor 12 (related word detection unit 24) of the voice operation device 10 detects the operation-related word related to the related device 2 operated in response to the previous operation command from the newly acquired voice data without the activation word. , The related device 2 operated in response to the previous operation command is specified based on the operation command data stored in the command information storage unit 132 of the memory 13. In this way, by referring to the operation command data stored in the command information storage unit 132, the related device 2 to be operated by the previous operation command can be easily specified. Therefore, when the user speaks without uttering the activation word, it should be easily determined whether the utterance content is related to the operation by the previous operation command, and the related device 2 should be operated in response to the utterance. It can be easily determined whether or not it is possible.

以上、本発明の実施の形態を説明したが、これらはあくまで例示にすぎず、本発明はこれらに限定されるものではなく、上記構成を組み合わせるなど、特許請求の範囲の趣旨を逸脱しない限りにおいて、当業者の知識に基づく種々の変更が可能である。 Although the embodiments of the present invention have been described above, these are merely examples, and the present invention is not limited thereto, as long as the above configurations are combined and the like does not deviate from the scope of the claims. , Various changes are possible based on the knowledge of those skilled in the art.

例えば、上記実施形態では、音声操作装置１０が車両に搭載される車載機として構成される例を説明したが、これには限られない。すなわち、音声操作装置１０は車載機でなくてもよく、例えばスマートフォンや、タブレットＰＣ、スマートスピーカ等であってもよい。また、音声操作システム１００は、音声認識エンジン３０およびコマンド生成サーバ５０が単一のサーバ装置として提供されてもよい。また、音声認識エンジン３０の機能が音声操作装置１０に備えられていてもよいし、コマンド生成サーバ５０の機能が音声認識エンジン３０に備えられていてもよい。 For example, in the above embodiment, the example in which the voice control device 10 is configured as an in-vehicle device mounted on a vehicle has been described, but the present invention is not limited to this. That is, the voice operation device 10 does not have to be an in-vehicle device, and may be, for example, a smartphone, a tablet PC, a smart speaker, or the like. Further, in the voice operation system 100, the voice recognition engine 30 and the command generation server 50 may be provided as a single server device. Further, the function of the voice recognition engine 30 may be provided in the voice operation device 10, or the function of the command generation server 50 may be provided in the voice recognition engine 30.

また、本実施形態における音声操作装置１０、音声認識エンジン３０およびコマンド生成サーバ５０において実現される各処理は、これら各々のプロセッサがメモリに記憶されている各種アプリケーションプログラムを実行することによって実現されている。また、音声操作装置１０、音声認識エンジン３０およびコマンド生成サーバ５０における各々のプロセッサに上述した各処理を実行させるプログラムは、コンピュータが読み取り可能な記録媒体に記録することができる。ここで、コンピュータ読み取り可能な記録媒体とは、データやプログラム等の情報を電気的、磁気的、光学的、機械的、または化学的作用によって蓄積し、コンピュータから読み取ることができる記録媒体をいう。このような記録媒体のうちコンピュータから取り外し可能なものとしては、例えばフレキシブルディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ／Ｗ、ＤＶＤ、ＤＡＴ、８ｍｍテープ、メモリカード等がある。また、コンピュータに固定された記録媒体としてハードディスクやＲＯＭ等がある。 Further, each process realized by the voice operation device 10, the voice recognition engine 30, and the command generation server 50 in the present embodiment is realized by executing various application programs stored in the memory by each of these processors. There is. Further, a program for causing each processor in the voice operation device 10, the voice recognition engine 30, and the command generation server 50 to execute each of the above-described processes can be recorded on a computer-readable recording medium. Here, the computer-readable recording medium means a recording medium that can be read from a computer by accumulating information such as data and programs by electrical, magnetic, optical, mechanical, or chemical action. Among such recording media, those that can be removed from a computer include, for example, flexible discs, magneto-optical discs, CD-ROMs, CD-R / Ws, DVDs, DATs, 8 mm tapes, memory cards, and the like. In addition, there are hard disks, ROMs, and the like as recording media fixed to the computer.

１・・・車両
２・・・関連機器
３・・・マイクロフォン
４・・・スピーカ
１０・・・音声操作装置
２１・・・音声取得部
２２・・・起動ワード検出部
２３・・・音声データ送信部
２４・・・関連ワード検出部
２５・・・コマンド取得部
２６・・・応答用音声取得部
２７・・・操作処理部
２８・・・応答音声出力部
３０・・・音声認識エンジン
５０・・・コマンド生成サーバ
１００・・・音声操作システム 1 ... Vehicle 2 ... Related equipment 3 ... Microphone 4 ... Speaker 10 ... Voice control device 21 ... Voice acquisition unit 22 ... Activation word detection unit 23 ... Voice data transmission Unit 24 ... Related word detection unit 25 ... Command acquisition unit 26 ... Response voice acquisition unit 27 ... Operation processing unit 28 ... Response voice output unit 30 ... Voice recognition engine 50 ... -Command generation server 100 ... Voice operation system

Claims

When the voice data of the speech uttered by the user includes a start word for activating the voice recognition engine, at least the voice data including the utterance following the start word is transmitted to the voice recognition engine, and the voice recognition engine transmits the voice data. It is a voice operation device that operates related devices according to the operation commands generated based on the voice recognition processing result.
A voice acquisition unit that acquires voice data related to utterances made by the user,
When the voice data newly acquired by the voice acquisition unit is voice data without a start word that does not include a start word, the start that newly acquires an operation-related word related to the related device operated in response to the previous operation command. Related word detector that detects from wordless voice data,
When the related word detection unit detects an operation-related word from the voice data without the activation word, the voice data transmission unit that transmits the voice data without the activation word to the voice recognition engine
To prepare
Voice control device.

A voice storage unit for storing voice data acquired by the voice acquisition unit is further provided.
The related word detection unit operates according to the previous operation command by detecting a predetermined device-specific related word associated with each related device from the previous voice data stored in the voice storage unit. Identify related equipment
The voice operating device according to claim 1.

Further, a command information storage unit for storing operation command information related to the operation command generated based on the voice recognition processing result of the voice data by the voice recognition engine is provided.
The related word detection unit identifies the related device operated in response to the previous operation command based on the operation command information stored in the command information storage unit.
The voice operating device according to claim 1.

An operation command for transmitting to the voice operation device based on the voice operation device according to any one of claims 1 to 3, the voice recognition engine, and the voice recognition processing result of voice data by the voice recognition engine. To generate an operation command generation server, including a voice operation system.

When the voice data of the speech uttered by the user includes a start word for activating the voice recognition engine, at least the voice data including the utterance following the start word is transmitted to the voice recognition engine, and the voice recognition engine transmits the voice data. It is a voice operation method in which the voice operation device executes the operation of the related device according to the operation command generated based on the voice recognition processing result.
A voice acquisition process that acquires voice data related to utterances made by the user,
When the voice data newly acquired in the voice acquisition process is voice data without a start word that does not include a start word, the operation-related word related to the related device operated in response to the previous operation command is newly acquired and started. Related word detection process to detect from wordless voice data,
A voice data transmission step of transmitting the voice data without the activation word to the voice recognition engine when an operation-related word is detected from the voice data without the activation word in the related word detection step.
including,
Voice operation method.