JP2015069086A

JP2015069086A - Voice recognition device and voice recognition program

Info

Publication number: JP2015069086A
Application number: JP2013204500A
Authority: JP
Inventors: 邦宏伊藤; Kunihiro Ito; 智己片野; Tomoki Katano; 美佑紀楠田; Miyuki Kusuda
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2013-09-30
Filing date: 2013-09-30
Publication date: 2015-04-13

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device and a voice recognition program capable of executing voice recognition processing with hands free.SOLUTION: A CPU acquires first voice data based on a voice uttered by a user (S11), and determines a first volume value based on the first voice data (S13), and when the first volume value is larger than a first reference value (S13: YES), starts an acquisition of second voice data (S15), and determines whether or not a second volume value based on the acquired second voice data is larger than a second reference value (S18), and when the second volume value is larger than the second reference value (S18: YES), determines whether or not the second volume value becomes smaller than a third reference value within a second monitoring time (S23), and when the second volume value becomes smaller than the third reference value within the second monitoring time (S23: YES), recognizes that a voice corresponding to the second voice data has been uttered by the user, and executes following predetermined processing.

Description

本発明は、音声認識が可能な音声認識装置および音声認識プログラムに関する。 The present invention relates to a voice recognition device and a voice recognition program capable of voice recognition.

従来、ユーザの頭部に装着され、ユーザが視認可能に画像を表示できる画像表示装置において、ユーザが発する音声に基づいて、装置を制御できる技術が知られている（例えば、特許文献１参照）。特許文献１に記載のヘッドマウントディスプレイ（以下、「ＨＭＤ」という。）は、音声信号処理回路を備える。音声信号処理回路は、ユーザの口近くに位置するマイクロフォン（以下、「マイク」と略す。）を通じて入力されたユーザの音声を取得し、所定の音声認識処理を行う。音声認識処理の結果が、事前に登録された登録キーワードに一致するとき、ＨＭＤは、画像表示部に表示する画像を切り替える。音声信号処理回路は、音声制御切り換えスイッチがオン操作されたときに音声認識処理を行い、音声制御切り換えスイッチがオフ操作されたときは音声認識処理を行わない。音声制御切り換えスイッチはマイクの近傍に位置し、ユーザにより操作される。 2. Description of the Related Art Conventionally, in an image display device that is mounted on a user's head and can display an image so that the user can see the image, a technology that can control the device based on a voice uttered by the user is known (for example, see Patent Document 1). . The head mounted display (hereinafter referred to as “HMD”) described in Patent Document 1 includes an audio signal processing circuit. The voice signal processing circuit acquires the user's voice input through a microphone (hereinafter, abbreviated as “microphone”) located near the user's mouth, and performs a predetermined voice recognition process. When the result of the voice recognition process matches a registered keyword registered in advance, the HMD switches an image to be displayed on the image display unit. The voice signal processing circuit performs voice recognition processing when the voice control changeover switch is turned on, and does not perform voice recognition processing when the voice control changeover switch is turned off. The voice control changeover switch is located near the microphone and is operated by the user.

特開２００２−１６５１５６号公報JP 2002-165156 A

しかしながら、特許文献１の記載のＨＭＤでは、ユーザは音声制御切り換えスイッチを手動で操作することによって、音声認識処理のオン・オフの切り換えをする必要があった。このため、ハンズフリーで音声認識処理を実行させることができず、ＨＭＤの特長の一つであるハンズフリーでの操作性が損なわれるという問題点があった。 However, in the HMD described in Patent Document 1, it is necessary for the user to turn on / off the voice recognition process by manually operating the voice control switch. For this reason, the voice recognition process cannot be executed in a hands-free manner, and there is a problem in that the operability in the hands-free operation that is one of the features of the HMD is impaired.

本発明は、ハンズフリーで音声認識処理を実行可能な音声認識装置および音声認識プログラムを提供することを目的とする。 An object of the present invention is to provide a speech recognition apparatus and a speech recognition program capable of performing speech recognition processing in a hands-free manner.

本発明の第一態様に係る音声認識装置は、入力された音声に応じて音声データを出力するマイクから出力された第一音声データを取得する第一音声取得手段と、前記第一音声取得手段によって取得された前記第一音声データに対応する第一音量値を決定する第一音量決定手段と、前記第一音量値が第一基準値より大きいか否かを判断する第一音量判断手段と、前記第一音量判断手段によって前記第一音量値が前記第一基準値より大きいと判断された場合、前記第一音声データの後に前記マイクから出力された第二音声データを取得する第二音声取得手段と、前記第二音声取得手段によって取得された前記第二音声データに基づいて、前記第二音声データに対応する第一結果データを生成する第一生成手段と、前記第一生成手段によって生成された前記第一結果データに特定の単語に対応する基準データが含まれるか否かを判断する第一結果判断手段と、前記第一結果判断手段によって前記第一結果データに前記基準データが含まれると判断された場合、前記第二音声データの後に前記マイクから出力された第三音声データを取得する第三音声取得手段と、前記第三音声取得手段によって取得された前記第三音声データに対して音声認識処理を実行することで、対応するテキストデータを示す第二結果データを生成する第二生成手段と、所定のテキストデータで示される操作データと、処理についてのデータである処理データとが対応付けられた対応データを参照し、前記第二生成手段によって生成された前記第二結果データによって示されるテキストデータに前記操作データと同一のデータが含まれるか否かを判断する第二結果判断手段と、前記第二結果判断手段によって、前記第二結果データによって示されるテキストデータに前記操作データと同一のデータが含まれると判断された場合、前記第二結果データによって示されるテキストデータに同一のデータが含まれると判断された前記操作データに対応付けられた前記処理データに基づく処理を実行する実行手段とを備える。 The voice recognition device according to the first aspect of the present invention includes a first voice acquisition unit that acquires first voice data output from a microphone that outputs voice data according to input voice, and the first voice acquisition unit. First sound volume determining means for determining a first sound volume value corresponding to the first sound data acquired by the first sound volume determining means, and first sound volume determining means for determining whether or not the first sound volume value is larger than a first reference value. When the first sound volume determining means determines that the first sound volume value is greater than the first reference value, the second sound for obtaining the second sound data output from the microphone after the first sound data An acquisition unit; a first generation unit that generates first result data corresponding to the second audio data based on the second audio data acquired by the second audio acquisition unit; and a first generation unit. Generated In addition, the first result determining means for determining whether or not the first result data includes reference data corresponding to a specific word, and the first result data includes the reference data in the first result data. The third sound acquisition means for acquiring the third sound data output from the microphone after the second sound data, and the third sound data acquired by the third sound acquisition means. The second generation means for generating the second result data indicating the corresponding text data, the operation data indicated by the predetermined text data, and the processing data that is the data about the processing are performed by executing the voice recognition process. Referring to the associated correspondence data, the text data indicated by the second result data generated by the second generation means is the same as the operation data The second result judging means for judging whether or not data is included and the second result judging means determine that the text data indicated by the second result data includes the same data as the operation data. In this case, an execution unit that executes processing based on the processing data associated with the operation data determined to include the same data in the text data indicated by the second result data.

本発明の第一態様に係る音声認識装置は、第一音量判断手段によって第一音量値が第一基準値より大きいか否かをまず判断する。次いで、第一基準値よりも大きいと判断された場合に、第二音声データに基づいて生成された第一結果データに、特定の単語に対応する基準データが含まれるか判断する。第一結果データに、特定の単語に対応する基準データが含まれる場合、第三音声データに対して音声認識処理が実行されて第二結果データが生成され、生成された第二結果データに基づいて音声認識装置の制御が実行される。このため、ユーザは第一音声データに対応する音声および第二音声データに対応する音声の発声によって、第三音声データに基づく音声認識処理をハンズフリーで音声認識装置に実行させることができる。 The speech recognition apparatus according to the first aspect of the present invention first determines whether or not the first volume value is greater than the first reference value by the first volume determination means. Next, when it is determined that the value is larger than the first reference value, it is determined whether or not the reference data corresponding to the specific word is included in the first result data generated based on the second sound data. When the first result data includes reference data corresponding to a specific word, voice recognition processing is performed on the third voice data to generate second result data. Based on the generated second result data Then, the voice recognition device is controlled. For this reason, the user can cause the voice recognition apparatus to perform voice recognition processing based on the third voice data in a hands-free manner by uttering voice corresponding to the first voice data and voice corresponding to the second voice data.

前記音声認識装置は、前記第一音量判断手段によって前記第一音量値が前記第一基準値より大きいと判断された場合、前記第一音量値が前記第一基準値よりも大きくなる期間である第一期間が所定の範囲内であるか否かを判断する第一期間判断手段を備えてもよい。前記第一生成手段は、前記第一期間判断手段によって前記第一期間が前記所定の範囲内であると判断された場合、前記第一結果データの生成を行ってもよい。前記第一期間が前記所定の範囲内でないと判断された場合、前記第一結果データの生成を行わなくてもよい。 The voice recognition device is a period in which the first volume value is greater than the first reference value when the first volume determination unit determines that the first volume value is greater than the first reference value. You may provide the 1st period judgment means which judges whether a 1st period is in a predetermined range. The first generation unit may generate the first result data when the first period determination unit determines that the first period is within the predetermined range. When it is determined that the first period is not within the predetermined range, the first result data need not be generated.

音声認識装置の誤作動を防ぎ、また、音声による音声認識装置の操作性を向上する観点から、第一音声データに対応する音声は、任意の音声とするのではなく、所定の音声とすることが考えられる。第一音声データに対応する音声が所定の音声である場合、その所定の音声が発声されるために必要な時間の範囲は予測できる。音声認識装置は、第一期間が所定の範囲内でない場合には、第一音声データに対応する音声が所定の音声でないとみなして、第一生成手段による第一結果データの生成を回避できる。従って音声認識装置は、処理の単純化および所定の音声以外の音声によって第一生成手段以降における処理が誤って実行されて音声認識装置が誤作動することを防止できる。 From the viewpoint of preventing malfunction of the voice recognition device and improving the operability of the voice recognition device by voice, the voice corresponding to the first voice data should be a predetermined voice, not an arbitrary voice. Can be considered. When the sound corresponding to the first sound data is a predetermined sound, the time range required for the predetermined sound to be uttered can be predicted. When the first period is not within the predetermined range, the voice recognition device regards that the voice corresponding to the first voice data is not the predetermined voice, and can avoid the generation of the first result data by the first generation unit. Therefore, the voice recognition device can prevent the voice recognition device from malfunctioning due to simplification of processing and erroneous execution of processing after the first generation means due to voices other than the predetermined voice.

第一態様は、前記第一音量判断手段によって前記第一音量値が前記第一基準値より大きいと判断された場合、前記第二音声取得手段によって取得された前記第二音声データに対応する第二音量値を決定する第二音量決定手段と、前記第二音量値が第二基準値より大きいか否かを判断する第二音量判断手段と、前記第二音量判断手段によって前記第二音量値が前記第二基準値より大きいと判断された場合、前記第二音量値が前記第二基準値より大きいと判断された時点から経過する一定時間を計測可能な計測手段と、前記第二音量値が第三基準値より小さいか否かを判断する第二音声終了判断手段とを備えてもよい。前記第一生成手段は、前記計測手段によって計測される前記一定時間内に、前記第二音声終了判断手段によって前記第二音量値が前記第三基準値より小さいと判断されない場合、前記第一結果データの生成を行わなくてもよい。 In the first aspect, when the first volume determination unit determines that the first volume value is larger than the first reference value, the first mode corresponds to the second audio data acquired by the second audio acquisition unit. A second volume determination means for determining a second volume value; a second volume determination means for determining whether the second volume value is greater than a second reference value; and the second volume value by the second volume determination means. Is determined to be greater than the second reference value, the measuring means capable of measuring a certain time elapsed from the time when the second sound volume value is determined to be greater than the second reference value, and the second sound volume value And a second voice end judging means for judging whether or not is smaller than a third reference value. If the second sound value is not determined to be smaller than the third reference value by the second sound end determining means within the certain time measured by the measuring means, the first generating means is the first result. It is not necessary to generate data.

第一音声データに対応する音声と同様に、第二音声データに対応する音声もまた、任意の音声とするのではなく、特定の単語を示す音声とすることが考えられる。第二音声データに対応する音声として特定の単語を示す音声が発声されるために必要な時間の範囲は予測できる。音声認識装置は、第二音量値が第二基準値より大きい場合、第二音声データに対応する音声として特定の単語を示す音声が発声されたとみなし、その時点からの経過時間を計測する。経過時間内に第二音量値が第三基準値を下回らない場合には、第二音声データに対応する音声が所定の音声でないとみなして、第一生成手段による第一結果データの生成を回避できる。即ち、第一結果判断手段による判断を行うよりも前に、特定の結果以外の第一結果データが生成されると見込まれる第二音声データに係る以降の処理を排除することができる。従って音声認識装置は、処理の単純化および処理の効率化を図ることができる。 Similar to the voice corresponding to the first voice data, the voice corresponding to the second voice data is also considered to be a voice indicating a specific word, not an arbitrary voice. The time range required for the voice indicating a specific word to be uttered as the voice corresponding to the second voice data can be predicted. When the second sound volume value is larger than the second reference value, the voice recognition device considers that a voice indicating a specific word is uttered as the voice corresponding to the second voice data, and measures the elapsed time from that time. If the second volume value does not fall below the third reference value within the elapsed time, it is assumed that the sound corresponding to the second sound data is not a predetermined sound, and the generation of the first result data by the first generating means is avoided. it can. That is, prior to performing the determination by the first result determination means, it is possible to eliminate subsequent processing relating to the second audio data that is expected to generate the first result data other than the specific result. Therefore, the speech recognition apparatus can simplify processing and increase processing efficiency.

前記第一生成手段は、前記第一音量判断手段によって前記第一音量値が前記第一基準値よりも大きいと判断された後の所定期間内に、前記第二音量判断手段によって前記第二音量値が前記第二基準値より大きいと判断されない場合、前記第一結果データの生成を行わなくてもよい。 The first generation means includes the second volume determination means by the second volume determination means within a predetermined period after the first volume determination means determines that the first volume value is greater than the first reference value. If it is not determined that the value is greater than the second reference value, the first result data may not be generated.

第三音声データに基づく音声認識処理を音声認識装置に実行させるため、ユーザは第一音声データに対応する所定の音声と、第二音声データに対応する特定の単語を示す音声を発声する。ユーザが第一音声データに対応する所定の音声を発声した後、一定時間経過しても第二音声データに対応する音声を発声しない場合、ユーザには第三音声データに基づく音声認識処理を音声認識装置に実行させる意思がないと考えられる。音声認識装置は、第一音量値が第一基準値より大きいと判断された後の一定期間内に第二音量値が第二基準値よりも大きいと判断されない場合、ユーザに第二音声データに対応する音声が発声されないとみなして、第一生成手段による第一結果データの生成を回避できる。従って音声認識装置は、処理の単純化および第二音声データの取得待機を継続することによる処理遅延を防止できる。 In order to cause the voice recognition apparatus to execute voice recognition processing based on the third voice data, the user utters a predetermined voice corresponding to the first voice data and a voice indicating a specific word corresponding to the second voice data. If the user does not utter the sound corresponding to the second sound data after a predetermined time has elapsed after the user utters the predetermined sound corresponding to the first sound data, the user is subjected to sound recognition processing based on the third sound data. It is thought that there is no intention to make the recognition device execute. If the second sound volume value is not determined to be greater than the second reference value within a certain period of time after the first sound volume value is determined to be greater than the first reference value, the voice recognition device sends the second sound data to the user. Assuming that the corresponding voice is not uttered, generation of the first result data by the first generation means can be avoided. Therefore, the speech recognition apparatus can prevent processing delay due to simplification of processing and continuing to wait for acquisition of the second speech data.

前記音声認識装置は、前記基準データが前記操作データであり、前記基準データと前記処理データとが対応付けられた第一対応データおよび前記第一対応データとは異なる第二対応データを記憶する記憶手段を備えてもよい。前記第一生成手段は、前記第二音声データに対して音声認識処理を実行することで、対応するテキストデータを示す前記第一結果データを生成してもよい。前記第一結果判断手段は、前記第一対応データを参照し、前記第一生成手段によって生成された前記第一結果データによって示されるテキストデータに前記基準データと同一のデータが含まれるか否かを判断してもよい。前記第二結果判断手段は、前記第二対応データを参照し、前記第二生成手段によって生成された前記第二結果データによって示されるテキストデータに前記操作データと同一のデータが含まれるか否かを判断してもよい。 The voice recognition device stores the first correspondence data in which the reference data is the operation data, the reference data and the processing data are associated with each other, and the second correspondence data different from the first correspondence data. Means may be provided. The first generation unit may generate the first result data indicating the corresponding text data by executing a voice recognition process on the second voice data. The first result determination means refers to the first correspondence data, and whether or not the text data indicated by the first result data generated by the first generation means includes the same data as the reference data May be judged. The second result determination means refers to the second correspondence data, and whether or not the text data indicated by the second result data generated by the second generation means includes the same data as the operation data. May be judged.

この場合、第一結果判断手段は第一対応データを参照し、第一結果データによって示されるテキストデータに基準データと同一のデータが含まれるか否かを判断する。また、第二結果判断手段は、第二対応データを参照し、第二結果データによって示されるテキストデータに操作データと同一のデータが含まれるか否かを判断する。第一結果判断手段における判断と、第二結果判断手段における判断とで、参照する対応データを切り替えることで、第一結果判断手段および第二結果判断手段における判断の精度をそれぞれ向上させることができる。よって、より確実に、音声認識装置の制御を行うことができる。 In this case, the first result determination means refers to the first correspondence data and determines whether or not the text data indicated by the first result data includes the same data as the reference data. Further, the second result determining means refers to the second correspondence data and determines whether or not the text data indicated by the second result data includes the same data as the operation data. By switching the corresponding data to be referred to between the judgment in the first result judging means and the judgment in the second result judging means, the accuracy of judgment in the first result judging means and the second result judging means can be improved. . Therefore, the voice recognition apparatus can be controlled more reliably.

前記第一対応データに含まれる前記基準データと前記処理データとの組の数は、前記第二対応データに含まれる前記操作データと前記処理データとの組の数よりも少なくてもよい。この場合、第一対応データに含まれる基準データと処理データとの組の数は、第二対応データに含まれる操作データと処理データとの組の数よりも少ないため、特に第一結果判断手段における判断の精度と判断処理の迅速性を向上できる。 The number of sets of the reference data and the processing data included in the first correspondence data may be smaller than the number of sets of the operation data and the processing data included in the second correspondence data. In this case, since the number of sets of reference data and processing data included in the first correspondence data is smaller than the number of sets of operation data and processing data included in the second correspondence data, the first result determination means in particular The accuracy of judgment and the speed of judgment processing can be improved.

前記第一基準値は前記第二基準値よりも大きくてもよい。この場合、第一基準値は第二基準値よりも大きいため、音声認識装置に第三音声に基づく種々の処理を実行させるためには、ユーザは第二音声データに対応する音声よりも第一音声データに対応する音声を大きく発声する必要がある。これにより、第一音声データに基づいて種々の処理が誤って実行されることを防止できる。 The first reference value may be larger than the second reference value. In this case, since the first reference value is larger than the second reference value, in order to cause the voice recognition apparatus to execute various processes based on the third voice, the user needs to select the first voice from the voice corresponding to the second voice data. It is necessary to utter a loud voice corresponding to the voice data. Thereby, it is possible to prevent various processes from being erroneously executed based on the first audio data.

本発明の第二態様に係る音声認識プログラムは、入力された音声に応じて音声データを出力するマイクから出力された第一音声データを取得する第一音声取得ステップと、前記第一音声取得ステップにおいて取得された前記第一音声データに対応する第一音量値を決定する第一音量決定ステップと、前記第一音量値が第一基準値より大きいか否かを判断する第一音量判断ステップと、前記第一音量判断ステップにおいて前記第一音量値が前記第一基準値より大きいと判断された場合、前記第一音声データの後に前記マイクから出力された第二音声データを取得する第二音声取得ステップと、前記第二音声取得ステップにおいて取得された前記第二音声データに基づいて、前記第二音声データに対応する第一結果データを生成する第一生成ステップと、前記第一生成ステップにおいて生成された前記第一結果データに特定の単語に対応する基準データが含まれるか否かを判断する第一結果判断ステップと、前記第一結果判断ステップにおいて前記第一結果データに前記基準データが含まれると判断された場合、前記第二音声データの後に前記マイクから出力された第三音声データを取得する第三音声取得ステップと、前記第三音声取得ステップにおいて取得された前記第三音声データに対して音声認識処理を実行することで、対応するテキストデータを示す第二結果データを生成する第二生成ステップと、所定のテキストデータで示される操作データと、処理についてのデータである処理データとが対応付けられた対応データを参照し、前記第二生成ステップにおいて生成された前記第二結果データによって示されるテキストデータに前記操作データと同一のデータが含まれるか否かを判断する第二結果判断ステップと、前記第二結果判断ステップにおいて、前記第二結果データによって示されるテキストデータに前記操作データと同一のデータが含まれると判断された場合、前記第二結果データによって示されるテキストデータに同一のデータが含まれると判断された前記操作データに対応付けられた前記処理データに基づく処理を実行する実行ステップとをコンピュータに実行させる。この場合、音声認識装置のコンピュータが第二態様の音声認識プログラムを実行することで、第一態様と同様の効果を得ることができる。 The voice recognition program according to the second aspect of the present invention includes a first voice acquisition step of acquiring first voice data output from a microphone that outputs voice data according to an input voice, and the first voice acquisition step. A first sound volume determination step for determining a first sound volume value corresponding to the first sound data acquired in step 1, and a first sound volume determination step for determining whether the first sound volume value is greater than a first reference value; When the first sound volume determination step determines that the first sound volume value is greater than the first reference value, the second sound for obtaining the second sound data output from the microphone after the first sound data And a first generation step for generating first result data corresponding to the second audio data based on the acquisition step and the second audio data acquired in the second audio acquisition step. A first result determination step for determining whether or not the first result data generated in the first generation step includes reference data corresponding to a specific word, and in the first result determination step, When it is determined that the reference data is included in the first result data, a third sound acquisition step for acquiring third sound data output from the microphone after the second sound data, and the third sound acquisition step A second generation step of generating a second result data indicating the corresponding text data by executing a voice recognition process on the third voice data acquired in step; and operation data indicated by the predetermined text data; , Referring to the correspondence data associated with the processing data that is the data about the processing, and the above-mentioned generated in the second generation step Text data indicated by the second result data in the second result judging step for judging whether or not the text data indicated by the two result data includes the same data as the operation data; Is determined to include the same data as the operation data, the processing data associated with the operation data determined to include the same data as the text data indicated by the second result data And causing the computer to execute an execution step of executing a process based on the process. In this case, when the computer of the speech recognition apparatus executes the speech recognition program of the second aspect, the same effect as that of the first aspect can be obtained.

ＨＭＤ１の外観を示す斜視図である。It is a perspective view which shows the external appearance of HMD1. ＨＭＤ１およびサーバ８０の電気的構成を示すブロック図である。2 is a block diagram showing an electrical configuration of an HMD 1 and a server 80. FIG. 第一対応データ９５のデータ構成図である。4 is a data configuration diagram of first correspondence data 95. FIG. 第二対応データ９６のデータ構成図である。It is a data block diagram of the 2nd corresponding | compatible data 96. FIG. 音声認識プログラムの音声判断処理を示すフローチャートである。It is a flowchart which shows the speech judgment process of a speech recognition program. 音声認識プログラムの音声判断処理を示すフローチャートである。It is a flowchart which shows the speech judgment process of a speech recognition program. 音声判断処理の中において実行される音声操作処理を示すフローチャートである。It is a flowchart which shows the audio | voice operation process performed in an audio | voice judgment process.

以下、本発明を具体化した実施の形態について、図面を参照して説明する。なお、参照する図面は、本発明が採用しうる技術的特徴を説明するために用いられるものである。図示された装置の構成等は、その形態のみに限定する趣旨ではなく、単なる説明例である。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments embodying the present invention will be described below with reference to the drawings. The drawings to be referred to are used to explain technical features that can be adopted by the present invention. The configuration of the illustrated apparatus is not intended to be limited only to the form, but merely an illustrative example.

図１に示すように、ＨＭＤ１は、本発明に係る音声認識装置の一例である。ＨＭＤ１は、投影装置（以下、「ヘッドディスプレイ」または「ＨＤ」という。）１０と制御装置（以下、「コントロールボックス」または「ＣＢ」という。）５０を備える。ＨＭＤ１が使用される場合、ＨＤ１０は、例えばユーザの頭部に装着される。ＣＢ５０は、ＨＤ１０とは異なる位置（例えば、ユーザの腰ベルトや腕等）に装着される。ＨＭＤ１は、図２に示すサーバ８０と、無線または有線通信を介して接続することもできる。以下の説明において、図１の上方、下方、右斜め下方、左斜め上方、右斜め上方および左斜め下方が、ＨＭＤ１の上方、下方、前方、後方、右方および左方である。実施形態において、種々の構成における位置関係および方向関係の理解を助けるため、図１において、ＨＭＤ１の上方、下方、前方、後方、右方および左方は、３次元デカルト座標系の軸を参照して説明される。 As shown in FIG. 1, the HMD 1 is an example of a voice recognition device according to the present invention. The HMD 1 includes a projection device (hereinafter referred to as “head display” or “HD”) 10 and a control device (hereinafter referred to as “control box” or “CB”) 50. When the HMD 1 is used, the HD 10 is mounted on the user's head, for example. The CB 50 is mounted at a position different from the HD 10 (for example, a user's waist belt or arm). The HMD 1 can also be connected to the server 80 shown in FIG. 2 via wireless or wired communication. In the following description, the upper, lower, right diagonally downward, left diagonally upward, right diagonally upward and left diagonally downward of FIG. 1 are the upper, lower, forward, backward, right and left sides of the HMD 1. In the embodiment, in order to help understanding the positional relationship and the directional relationship in various configurations, the upper, lower, front, rear, right and left sides of the HMD 1 in FIG. 1 refer to the axes of the three-dimensional Cartesian coordinate system. Explained.

ＨＤ１０は専用の装着具である眼鏡５に装着される。ＨＤ１０は、ユーザが日常的に使用する眼鏡、ヘルメット、ヘッドホン等、他の装着具に取り付けられてもよい。ＨＤ１０は、画像光を後方に向けて照射する。画像光は、受光対象物（例えば、ＨＤ１０を装着したユーザの眼）に入射する。ＨＤ１０はハーネス７を介してＣＢ５０と着脱可能に接続する。ＣＢ５０は、ＨＤ１０を制御する。 The HD 10 is attached to the spectacles 5 that are dedicated attachment tools. The HD 10 may be attached to other wearing tools such as glasses, helmets, and headphones that are used on a daily basis by the user. The HD 10 irradiates image light backward. The image light is incident on a light receiving object (for example, a user's eye wearing the HD 10). The HD 10 is detachably connected to the CB 50 via the harness 7. The CB 50 controls the HD 10.

ＨＤ１０の構成について説明する。ＨＤ１０は筐体２を備える。筐体２は、左側にハーフミラー３を備える。ハーフミラー３は、ＨＭＤ１がユーザの頭部に装着されたとき、ユーザの眼（例えば左眼）の前方に配置される。ＨＤ１０は、筐体２の内部に、画像表示部１４（図２参照）と接眼光学系（図示略）を備える。画像表示部１４は、ＣＢ５０からハーネス７を介して送信される映像信号に基づいて画像を表示する。画像表示部１４は、例えば、液晶素子等の空間変調素子および光源である。画像表示部１４は、画像信号に応じた強度のレーザ光を２次元走査して画像表示を行う網膜走査型表示部、液晶ディスプレイ、および有機ＥＬ（Organic Electro-luminescence）ディスプレイ等であってもよい。ＨＤ１０は、カメラ２０を備える。カメラ２０は、ＨＤ１０の前面方向の外界の風景を撮像する。 The configuration of the HD 10 will be described. The HD 10 includes a housing 2. The housing 2 includes a half mirror 3 on the left side. The half mirror 3 is disposed in front of the user's eyes (for example, the left eye) when the HMD 1 is mounted on the user's head. The HD 10 includes an image display unit 14 (see FIG. 2) and an eyepiece optical system (not shown) inside the housing 2. The image display unit 14 displays an image based on a video signal transmitted from the CB 50 via the harness 7. The image display unit 14 is, for example, a spatial modulation element such as a liquid crystal element and a light source. The image display unit 14 may be a retinal scanning display unit that performs two-dimensional scanning with laser light having an intensity corresponding to an image signal to display an image, a liquid crystal display, an organic EL (Organic Electro-luminescence) display, and the like. . The HD 10 includes a camera 20. The camera 20 captures an external scene in the front direction of the HD 10.

接眼光学系は、画像表示部１４に表示された画像を示す画像光を集光し、筐体２の左端からハーフミラー３に向けて照射する。接眼光学系から照射された画像光は、筐体２の左側に設けられたハーフミラー３によって少なくとも一部（例えば半分）後方に向けて反射される。ＨＭＤ１がユーザの頭部に装着されたとき、ハーフミラー３によって反射された画像光が、ユーザの一方（例えば左）の眼球（図示略）に入射する。ハーフミラー３は外界の実像からの光の少なくとも一部を透過するので、ユーザの視野において実像（外界の風景）に重畳させた画像をユーザに見せることができる。 The eyepiece optical system collects image light indicating an image displayed on the image display unit 14 and irradiates the half mirror 3 from the left end of the housing 2. The image light emitted from the eyepiece optical system is reflected at least partially (for example, half) backward by the half mirror 3 provided on the left side of the housing 2. When the HMD 1 is mounted on the user's head, the image light reflected by the half mirror 3 enters one (for example, the left) eyeball (not shown) of the user. Since the half mirror 3 transmits at least part of the light from the real image in the outside world, an image superimposed on the real image (outside landscape) can be shown to the user in the field of view of the user.

眼鏡５は、ＨＤ１０をユーザの頭部に保持するための構成である。眼鏡５は、フレーム６において、左眼用レンズを支えるリム部の上面右端に、支持部４を備える。支持部４は、ＨＤ１０の筐体２を保持し、筐体２を眼鏡５に取り付ける。支持部４は、筐体２の保持位置を上下方向および左右方向に調整することができる。筐体２の保持位置を調整することで、ハーフミラー３は、ユーザの眼球の位置に合わせた位置に配置される。 The glasses 5 are configured to hold the HD 10 on the user's head. The glasses 5 include a support portion 4 at the right end of the upper surface of the rim portion that supports the left-eye lens in the frame 6. The support unit 4 holds the housing 10 of the HD 10 and attaches the housing 2 to the glasses 5. The support part 4 can adjust the holding position of the housing 2 in the vertical direction and the horizontal direction. By adjusting the holding position of the housing 2, the half mirror 3 is arranged at a position that matches the position of the user's eyeball.

眼鏡５は、フレーム６において、右耳に掛けるテンプル部に、マイク１７およびイヤホン１８を有するヘッドセット１６を備える。詳細は後述するが、ＨＭＤ１は、ＨＭＤ１が実行する動作についてのデータである処理データに対応付けられた操作データを含む音声をヘッドセット１６のマイク１７から取得することによって、使用時における各種操作等を受け付けることが可能である。ヘッドセット１６は、通常の気道式のマイクおよびスピーカに限定されず、骨伝導型のものであってもよい。なお、眼鏡５自体の形状は通常の眼鏡に類似のため、詳細な説明は省略する。 The eyeglasses 5 include a headset 16 having a microphone 17 and an earphone 18 at a temple portion to be hung on the right ear in the frame 6. Although details will be described later, the HMD 1 acquires various voices including operation data associated with processing data that is data on operations performed by the HMD 1 from the microphone 17 of the headset 16, thereby performing various operations during use. Can be accepted. The headset 16 is not limited to a normal airway microphone and speaker, but may be a bone conduction type. Since the shape of the glasses 5 itself is similar to that of normal glasses, detailed description thereof is omitted.

ＣＢ５０の構成について説明する。ＣＢ５０は、略直方体状の筐体を有する。ＣＢ５０は電源ランプ６３を内蔵する電源スイッチ６２を含む操作部６１を備える。電源スイッチ６２が操作されることで、ＨＭＤ１の電源はオンまたはオフされる。ＨＤ１０における各種設定や、使用時における各種操作等は、操作部６１に対して入力される。 The configuration of the CB 50 will be described. The CB 50 has a substantially rectangular parallelepiped housing. The CB 50 includes an operation unit 61 including a power switch 62 incorporating a power lamp 63. By operating the power switch 62, the power of the HMD 1 is turned on or off. Various settings in the HD 10 and various operations during use are input to the operation unit 61.

ＣＢ５０は、公知の無線通信（例えば、所定のアクセスポイントを介した無線ＬＡＮでの通信）を介して図２に示すサーバ８０と接続し、サーバ８０との間で音声データや画像データを含む各種データの送受信を行うことが可能である。ＣＢ５０は有線通信のインターフェイスを備え、通信ケーブルを用いてネットワーク９（図２参照）に接続し、サーバ８０と接続してもよい。あるいはＣＢ５０は、ＵＳＢインターフェイスを備え、ＵＳＢケーブルを用い、サーバ８０に接続してもよい。なお、ＣＢ５０は、サーバ８０の代わりに、同一ＬＡＮに接続されたパーソナルコンピュータ、スマートフォン、およびタブレット型携帯端末等、その他の機器と接続してもよい。 The CB 50 is connected to the server 80 shown in FIG. 2 through known wireless communication (for example, communication in a wireless LAN via a predetermined access point), and various kinds of data including audio data and image data are connected to the server 80. Data can be sent and received. The CB 50 may be provided with a wired communication interface, connected to the network 9 (see FIG. 2) using a communication cable, and connected to the server 80. Alternatively, the CB 50 may include a USB interface and connect to the server 80 using a USB cable. Note that the CB 50 may be connected to other devices such as a personal computer, a smartphone, and a tablet portable terminal connected to the same LAN instead of the server 80.

図２を参照し、ＨＭＤ１の電気的構成について説明する。ＨＤ１０は、ＨＤ１０全体の制御を行うＣＰＵ１１を備える。ＣＰＵ１１は、ＲＡＭ１２、フラッシュＲＯＭ１３、画像表示部１４、インターフェイス１５、および接続コントローラ１９に電気的に接続される。ＣＰＵ１１は、インターフェイス１５を介してカメラ２０およびヘッドセット１６に電気的に接続される。ＲＡＭ１２は、各種データを一時的に記憶する。フラッシュＲＯＭ１３は、ＣＰＵ１１が実行する各種プログラム等を記憶する。各種プログラムは、ＨＤ１０の出荷時にフラッシュＲＯＭ１３に記憶される。 The electrical configuration of the HMD 1 will be described with reference to FIG. The HD 10 includes a CPU 11 that controls the entire HD 10. The CPU 11 is electrically connected to the RAM 12, flash ROM 13, image display unit 14, interface 15, and connection controller 19. The CPU 11 is electrically connected to the camera 20 and the headset 16 via the interface 15. The RAM 12 temporarily stores various data. The flash ROM 13 stores various programs executed by the CPU 11. Various programs are stored in the flash ROM 13 when the HD 10 is shipped.

画像表示部１４は前述の通り、映像信号に基づいて画像を表示する。インターフェイス１５はカメラ２０およびヘッドセット１６と電気的に接続し、信号の入出力を制御する。接続コントローラ１９は、ハーネス７を介してＣＢ５０の接続コントローラ５８と電気的に接続し、有線通信を行う。カメラ２０は画像を撮像する。ヘッドセット１６はマイク１７およびイヤホン１８を備える。なお、ＨＤ１０は、ヘッドセット１６の代わりに、筐体２にマイクとスピーカを内蔵してもよい。 As described above, the image display unit 14 displays an image based on the video signal. The interface 15 is electrically connected to the camera 20 and the headset 16 and controls input / output of signals. The connection controller 19 is electrically connected to the connection controller 58 of the CB 50 via the harness 7 and performs wired communication. The camera 20 captures an image. The headset 16 includes a microphone 17 and an earphone 18. Note that the HD 10 may incorporate a microphone and a speaker in the housing 2 instead of the headset 16.

ＣＢ５０の電気的構成について説明する。ＣＢ５０は、ＣＢ５０全体の制御を行うＣＰＵ５１を備える。ＣＰＵ５１は、ＲＡＭ５２、フラッシュＲＯＭ５３、インターフェイス５５、ビデオＲＡＭ５６、画像処理部５７、接続コントローラ５８、および無線通信部５９に電気的に接続される。ＲＡＭ５２は、各種データを一時的に記憶する。 The electrical configuration of the CB 50 will be described. The CB 50 includes a CPU 51 that controls the entire CB 50. The CPU 51 is electrically connected to the RAM 52, the flash ROM 53, the interface 55, the video RAM 56, the image processing unit 57, the connection controller 58, and the wireless communication unit 59. The RAM 52 temporarily stores various data.

フラッシュＲＯＭ５３は、ＯＳを記憶する。また、フラッシュＲＯＭ５３は、ＣＰＵ５１が実行する各種プログラム、各種プログラムが使用するフラグやデータの初期値等を記憶する。各種プログラムは、ＯＳ上で実行される。フラッシュＲＯＭ５３は、少なくとも、メインプログラム記憶エリア６７と、音声認識プログラム記憶エリア６８の記憶領域を確保している。メインプログラム記憶エリア６７は、ＣＰＵ５１がＨＭＤ１の各種動作を制御するために実行するメインプログラムを記憶する。なお、メインプログラムは、音声認識プログラムを含む各種プログラムを並列処理によって実行するマルチタスク型のプログラムである音声認識プログラム記憶エリア６８は、ＣＰＵ５１が、ユーザの発声する音声に基づいてＨＭＤ１の各種操作等を行うための音声認識プログラム（後述）を記憶する。音声認識プログラムは、メインプログラムに従ってＣＰＵ５１が実行する各種プログラムのうちの一つである。メインプログラム、音声認識プログラムを含む各種プログラムおよびＯＳは、ＨＭＤ１の出荷時にフラッシュＲＯＭ５３に記憶される。 The flash ROM 53 stores the OS. The flash ROM 53 stores various programs executed by the CPU 51, flags used by the various programs, initial values of data, and the like. Various programs are executed on the OS. The flash ROM 53 secures at least storage areas for a main program storage area 67 and a speech recognition program storage area 68. The main program storage area 67 stores a main program that the CPU 51 executes to control various operations of the HMD 1. The main program is a voice recognition program storage area 68 that is a multitasking program that executes various programs including a voice recognition program by parallel processing. The CPU 51 performs various operations of the HMD 1 based on voices uttered by the user. A voice recognition program (to be described later) is stored. The voice recognition program is one of various programs executed by the CPU 51 in accordance with the main program. Various programs and OS including the main program and the voice recognition program are stored in the flash ROM 53 when the HMD 1 is shipped.

なお、フラッシュＲＯＭ５３には、ＨＤ１０のＣＰＵ１１が実行するプログラムが記憶されてもよい。ＣＰＵ５１は、ＨＤ１０のＣＰＵ１１が実行する処理と同じ処理を、ＣＰＵ１１の代わりに実行してもよい。また、ＨＭＤ１は、各種プログラムおよびＯＳを、無線通信部５９を介してプログラムダウンロード用のサーバからダウンロードし、インストールしてもよい。例えば、各種プログラムおよびＯＳは、コンピュータで読み取り可能な一時的な記憶媒体（例えば、伝送信号）として、サーバからＨＭＤ１に送信されてもよい。但し、記憶装置は、例えばＲＯＭ、フラッシュＲＯＭ、ＨＤＤ、ＲＡＭなどの、一時的な記憶媒体を除く記憶媒体であってもよい。また、記憶装置は、非一時的な記憶媒体であってもよい。非一時的な記憶媒体は、データを記憶する時間の長さに関わらず、データを留めておくことが可能なものであってもよい。 The flash ROM 53 may store a program executed by the CPU 10 of the HD 10. The CPU 51 may execute the same process as the process executed by the CPU 11 of the HD 10 instead of the CPU 11. Further, the HMD 1 may download and install various programs and the OS from the program download server via the wireless communication unit 59. For example, the various programs and the OS may be transmitted from the server to the HMD 1 as a computer-readable temporary storage medium (for example, a transmission signal). However, the storage device may be a storage medium other than a temporary storage medium such as a ROM, a flash ROM, an HDD, or a RAM. Further, the storage device may be a non-temporary storage medium. The non-transitory storage medium may be capable of retaining data regardless of the length of time for storing the data.

インターフェイス５５は電源スイッチ６２および電源ランプ６３を含む操作部６１と電気的に接続し、ユーザによる操作に対応した入力信号やランプの点灯信号等の入出力を行う。画像処理部５７は、ＨＤ１０の画像表示部１４に表示する画像を形成する処理を行う。ビデオＲＡＭ５６は、画像処理部５７が形成した画像を示す映像信号を生成するため、映像を構成するフレームを記憶領域内に形成する。接続コントローラ５８は、ハーネス７を介してＨＤ１０の接続コントローラ１９と電気的に接続し、有線通信を行う。無線通信部５９は、ネットワーク９のアクセスポイント（図示略）へ無線通信によって接続し、例えばサーバ８０等、ネットワーク９に接続する他の機器と通信を行う。 The interface 55 is electrically connected to an operation unit 61 including a power switch 62 and a power lamp 63, and inputs / outputs an input signal corresponding to an operation by a user, a lamp lighting signal, and the like. The image processing unit 57 performs processing for forming an image to be displayed on the image display unit 14 of the HD 10. The video RAM 56 forms a frame constituting the video in the storage area in order to generate a video signal indicating the image formed by the image processing unit 57. The connection controller 58 is electrically connected to the connection controller 19 of the HD 10 via the harness 7 and performs wired communication. The wireless communication unit 59 is connected to an access point (not shown) of the network 9 by wireless communication, and communicates with other devices connected to the network 9 such as the server 80.

サーバ８０は、サーバ８０全体の制御を行うＣＰＵ８１を備える。ＣＰＵ８１は、データバスを介してＲＯＭ８２、ＲＡＭ８３、および入出力バス（以下、「Ｉ／Ｏバス」という。）８５と電気的に接続する。ＲＯＭ８２は、ＣＰＵ８１が実行するＢＩＯＳ等のプログラムを記憶する読出し専用の記憶装置である。ＲＡＭ８３は、データを一時的に記憶する読み書き可能な記憶装置である。 The server 80 includes a CPU 81 that controls the entire server 80. CPU 81 is electrically connected to ROM 82, RAM 83, and input / output bus (hereinafter referred to as “I / O bus”) 85 via a data bus. The ROM 82 is a read-only storage device that stores programs such as BIOS executed by the CPU 81. The RAM 83 is a readable / writable storage device that temporarily stores data.

Ｉ／Ｏバス８５には、ハードディスクドライブ（以下、「ＨＤＤ」という。）８４および通信部８６が電気的に接続されている。ＨＤＤ８４は、ＯＳやプログラム等がインストールされる記憶装置である。通信部８６は、ネットワーク９のアクセスポイント（図示略）へ有線通信または無線通信で接続し、サーバ８０をネットワーク９に接続する。また、図示しないが、Ｉ／Ｏバス８５にはマウスやキーボード等の入力デバイスも接続されている。 A hard disk drive (hereinafter referred to as “HDD”) 84 and a communication unit 86 are electrically connected to the I / O bus 85. The HDD 84 is a storage device in which an OS, a program, and the like are installed. The communication unit 86 connects to an access point (not shown) of the network 9 by wired communication or wireless communication, and connects the server 80 to the network 9. Although not shown, input devices such as a mouse and a keyboard are also connected to the I / O bus 85.

ＨＭＤ１の構成は上記実施形態に限定されず、例えば、ＨＤ１０とＣＢ５０とが一体となった構成であってもよい。ＨＤ１０のＣＰＵ１１とＣＢ５０のＣＰＵ５１とは、ハーネス７の代わりに無線通信によって通信を行ってもよい。 The configuration of the HMD 1 is not limited to the above embodiment, and may be a configuration in which the HD 10 and the CB 50 are integrated, for example. The CPU 11 of the HD 10 and the CPU 51 of the CB 50 may communicate by wireless communication instead of the harness 7.

図３および図４を参照して、第一対応データ９５および第二対応データ９６について説明する。第一対応データ９５および第二対応データ９６は、例えば、フラッシュＲＯＭ５３に記憶されている。第一対応データ９５および第二対応データ９６には、操作データと処理データとが対応付けられている。操作データは、ＨＭＤ１の実行する動作をユーザがＣＰＵ５１に指示し、ＨＭＤ１を操作する際に発声する単語を示すテキストデータである。処理データは、ＣＰＵ５１がＨＭＤ１に動作を実行させるための処理についてのデータである。処理データは、例えば、ＣＰＵ５１によって実行される特定のルーチンを示す。より詳細には、第一対応データ９５には、操作データ「起動」が処理データ「音声操作処理を開始する」に対応付けられている。即ち、ＣＰＵ５１が操作データ「起動」を受け付けた場合、対応付けられている処理データ「音声操作処理を開始する」によって示される特定のルーチン（ここでは、図７で示される処理）が実行される。詳しくは後述するが、第一対応データ９５の操作データ「起動」は、Ｓ３４（図６参照）の音声操作処理を行うか否かの判断基準となる単語である。このため、本実施形態において、第一対応データ９５の操作データ「起動」を基準データとも呼ぶこととする。 The first correspondence data 95 and the second correspondence data 96 will be described with reference to FIGS. 3 and 4. The first correspondence data 95 and the second correspondence data 96 are stored in the flash ROM 53, for example. The first correspondence data 95 and the second correspondence data 96 are associated with operation data and processing data. The operation data is text data indicating a word uttered when the user instructs the CPU 51 to perform an operation performed by the HMD 1 and operates the HMD 1. The processing data is data regarding processing for the CPU 51 to cause the HMD 1 to perform an operation. The processing data indicates a specific routine executed by the CPU 51, for example. More specifically, in the first correspondence data 95, the operation data “activation” is associated with the processing data “start voice operation processing”. That is, when the CPU 51 receives the operation data “activation”, a specific routine (in this case, the process shown in FIG. 7) indicated by the associated process data “start the voice operation process” is executed. . As will be described in detail later, the operation data “activation” of the first correspondence data 95 is a word that serves as a criterion for determining whether or not to perform the voice operation processing of S34 (see FIG. 6). For this reason, in the present embodiment, the operation data “activation” of the first correspondence data 95 is also referred to as reference data.

第二対応データ９６には、操作データ「送る」、「戻る」、「Ｘ枚目」、「閉じる」等がそれぞれ、処理データ「表示中の図面等を次頁に送る」、「表示中の図面等を前頁に戻す」、「表示中の図面のＸ頁を表示する」、「表示中の図面等を閉じる」等に対応付けられている。例えば、ユーザが音声「ページを送る」を発声すると、ＣＰＵ５１は、後述する図７のＳ４９で生成するテキストデータ「ページを送る」に「送る」の操作データが含まれると判断して（後述する図７のＳ５１：ＹＥＳ）、処理データ「画像の次ページを表示する」に基づくＨＭＤ１の動作を実行させるための処理を行う（後述する図７のＳ５３参照）。なお、Ｘ枚目等の「Ｘ」は任意の自然数であり、ユーザは、Ｘを自然数に置き換えて発声する。 In the second correspondence data 96, the operation data “send”, “return”, “Xth page”, “close”, etc. are respectively processed data “send the drawing etc. being displayed to the next page”, “displayed” This is associated with “return drawing etc. to the previous page”, “display page X of the drawing being displayed”, “close drawing being displayed”, etc. For example, when the user utters the voice “send page”, the CPU 51 determines that the operation data “send” is included in the text data “send page” generated in S49 in FIG. 7 (S51 in FIG. 7: YES), processing for executing the operation of the HMD 1 based on the processing data “display the next page of the image” is performed (see S53 in FIG. 7 described later). Note that “X” such as the Xth sheet is an arbitrary natural number, and the user speaks by replacing X with a natural number.

図５から図７を参照し、音声認識プログラムを構成する音声判断処理について説明する。ＨＭＤ１は、前述したように、あらかじめフラッシュＲＯＭ５３に音声認識プログラムを記憶した状態で出荷される。音声認識プログラムは、ＨＭＤ１のＣＢ５０のＣＰＵ５１が実行するプログラムである。音声認識プログラムを実行したＣＰＵ５１は、ユーザが発声した音声を判断する音声判断処理の実行、および音声判断処理の中において行われる音声操作処理の開始・終了を行う。ＣＰＵ５１は、ユーザがＨＭＤ１の各種操作等の操作データに対応する音声を発声した場合、操作データに対応付けられた処理データに応じたＨＭＤ１の各種操作等を実行する。 With reference to FIG. 5 to FIG. 7, a voice determination process constituting the voice recognition program will be described. As described above, the HMD 1 is shipped with a voice recognition program stored in the flash ROM 53 in advance. The voice recognition program is a program executed by the CPU 51 of the CB 50 of the HMD1. The CPU 51 that has executed the voice recognition program executes voice judgment processing for judging voice uttered by the user, and starts and ends voice operation processing performed in the voice judgment processing. When the user utters a voice corresponding to operation data such as various operations of the HMD 1, the CPU 51 executes various operations of the HMD 1 according to the processing data associated with the operation data.

音声判断処理で使用する各種タイマカウンタについて説明する。ここで、音声判断処理が開始されて最初にＣＰＵ５１が取得する音声データを、第一音声データとする。第一タイマカウンタは、第一音声データに基づいて決定される音量値である第一音量値が第一基準値（後述）よりも大きいと判断された後の所定期間である第一監視時間を計測するため、ＲＡＭ５２に記憶されるタイマカウンタである。第一監視時間に対応するカウンタ値がＲＡＭ５２の第一タイマカウンタにセットされると、セットされた第一タイマカウンタの値が、０を下限値として順次減算される。詳細は後述するが、第一監視時間が経過する前に第二音量値が第二基準値（後述）より大きいと判断された場合、ユーザによって第二音声データに対応する音声が発声されたとみなして、ＣＰＵ５１は、以降の所定の処理を実行する。ここで、音声判断処理において、第一音声データの後にＣＰＵ５１が取得する音声データを、第二音声データとする。また、第二音量値は、第二音声データに基づいて決定される音量値である。第一タイマカウンタにおける第一監視時間の計測は、第一監視時間が経過する前に第二音量値が第二基準値より大きいと判断されない場合、ユーザに第二音声データに対応する音声の発声意思がないとみなして、音声操作処理（図６および図７参照）を実行させないために行われる。なお、第一タイマカウンタの値の減算は、ＯＳのタイマ機能に基づいて行われる。 Various timer counters used in the voice determination process will be described. Here, the sound data acquired by the CPU 51 first after the sound determination process is started is referred to as first sound data. The first timer counter has a first monitoring time which is a predetermined period after it is determined that the first volume value, which is a volume value determined based on the first audio data, is greater than a first reference value (described later). A timer counter stored in the RAM 52 for measurement. When the counter value corresponding to the first monitoring time is set in the first timer counter of the RAM 52, the set value of the first timer counter is sequentially subtracted with 0 as the lower limit value. Although details will be described later, if it is determined that the second volume value is larger than the second reference value (described later) before the first monitoring time has elapsed, it is considered that the user has produced a voice corresponding to the second voice data. Thus, the CPU 51 executes the following predetermined processing. Here, in the sound determination process, the sound data acquired by the CPU 51 after the first sound data is referred to as second sound data. The second volume value is a volume value determined based on the second audio data. When the first monitoring time is measured by the first timer counter, if it is not determined that the second volume value is greater than the second reference value before the first monitoring time elapses, the user speaks the voice corresponding to the second voice data. This is performed in order to prevent the voice operation processing (see FIGS. 6 and 7) from being executed, assuming that there is no intention. Note that the value of the first timer counter is subtracted based on the timer function of the OS.

第二タイマカウンタは、第二音声データに基づく第二音量値が第二基準値よりも大きいと判断された後の所定期間である第二監視時間を計測するため、ＲＡＭ５２に記憶されるタイマカウンタである。第二監視時間に対応するカウンタ値がＲＡＭ５２の第二タイマカウンタにセットされると、セットされた第二タイマカウンタの値が、０を下限値として順次減算される。詳細は後述するが、第二監視時間が経過する前に第二音量値が第三基準値（後述）を下回った場合、ユーザによって第二音声データに対応する音声の発声が適切に終了されたとみなして、以降の所定の処理を実行する。また、第二監視時間が経過する前に第二音声データに対応する音声が終了しない場合、第二音声データに対応する音声が所定の音声でないとみなして、ＣＰＵ５１は以降の処理、特に、音声操作処理（図６および図７参照）を実行しない。なお、第二タイマカウンタの値の減算は、ＯＳのタイマ機能に基づいて行われる。 The second timer counter is a timer counter stored in the RAM 52 for measuring a second monitoring time which is a predetermined period after it is determined that the second volume value based on the second audio data is larger than the second reference value. It is. When the counter value corresponding to the second monitoring time is set in the second timer counter of the RAM 52, the set second timer counter value is sequentially subtracted with 0 as the lower limit value. Although details will be described later, if the second sound volume value falls below a third reference value (described later) before the second monitoring time elapses, it is assumed that the utterance of the voice corresponding to the second voice data is appropriately terminated by the user. As a result, the following predetermined processing is executed. If the sound corresponding to the second sound data does not end before the second monitoring time elapses, the sound corresponding to the second sound data is regarded as not a predetermined sound, and the CPU 51 performs the subsequent processing, particularly the sound. The operation process (see FIGS. 6 and 7) is not executed. The value of the second timer counter is subtracted based on the OS timer function.

第三タイマカウンタは、音声判断処理の中において実行される音声操作処理（図７参照）において取得される第三音声データに基づく第三音量値が第二基準値よりも大きいと判断された後の所定期間である第三監視時間を計測するため、ＲＡＭ５２に記憶されるタイマカウンタである。ここで、第二音声データの後にＣＰＵ５１が取得する音声データを、第三音声データとする。また、第三音量値は、第三音声データに基づく音量値である。第三監視時間に対応するカウンタ値がＲＡＭ５２の第三タイマカウンタにセットされると、セットされた第三タイマカウンタの値が、０を下限値として順次減算される。詳細は後述するが、第三監視時間が経過する前に第三音量値が第二基準値より大きいと判断された場合、ユーザによって第三音声データに対応する音声が発声されたとみなして、以降の所定の処理を実行する。また、第三監視時間が経過する前に第三音量値が第二基準値より大きいと判断されない場合、ユーザに第三音声データに対応する音声の発声意思がないとみなして、ＣＰＵ５１は第三音声データについての音声認識処理（Ｓ４８、図７参照）を実行しない。なお、第三タイマカウンタの値の減算は、ＯＳのタイマ機能に基づいて行われる。 After the third timer counter is determined that the third volume value based on the third voice data acquired in the voice operation process (see FIG. 7) executed in the voice judgment process is larger than the second reference value. This is a timer counter stored in the RAM 52 for measuring the third monitoring time, which is a predetermined period. Here, the audio data acquired by the CPU 51 after the second audio data is referred to as third audio data. The third volume value is a volume value based on the third audio data. When the counter value corresponding to the third monitoring time is set in the third timer counter of the RAM 52, the set third timer counter value is sequentially subtracted with 0 as the lower limit value. Although details will be described later, if it is determined that the third volume value is greater than the second reference value before the third monitoring time has elapsed, it is assumed that the user has spoken the voice corresponding to the third voice data, and so on. The predetermined process is executed. On the other hand, if it is not determined that the third volume value is greater than the second reference value before the third monitoring time has elapsed, the CPU 51 assumes that the user has no intention to speak the voice corresponding to the third voice data. The voice recognition process (S48, see FIG. 7) for the voice data is not executed. The value of the third timer counter is subtracted based on the OS timer function.

ＨＤ１のＣＰＵ１１によって実行される処理の概要について説明する。ＨＭＤ１のＣＢ５０に設けられた電源スイッチ６２が操作されると、ＣＰＵ１１は、マイク１７およびインターフェイス１５を制御することで、音声データの取得を開始する。例えば、マイク１７は、入力される音声に対応するアナログの音声信号を、インターフェイス１５に対して出力する。インターフェイス１５は、アナログの音声をデジタルの音声データに変換する。ＣＰＵ１１は、接続コントローラ１９を制御して、変換した音声データを、ハーネス７を介してＣＢ５０に継続的に送信する。 An outline of processing executed by the CPU 11 of the HD 1 will be described. When the power switch 62 provided in the CB 50 of the HMD 1 is operated, the CPU 11 starts acquiring audio data by controlling the microphone 17 and the interface 15. For example, the microphone 17 outputs an analog audio signal corresponding to the input audio to the interface 15. The interface 15 converts analog sound into digital sound data. The CPU 11 controls the connection controller 19 to continuously transmit the converted audio data to the CB 50 via the harness 7.

次に、音声認識プログラムの実行に伴いＣＰＵ５１が行う処理について説明する。ユーザがＨＭＤ１のＣＢ５０に設けられた電源スイッチ６２を操作すると、ＣＰＵ５１は起動時における所定の動作をメインプログラムの実行に従って行う。ＣＰＵ５１は、音声認識プログラムを含む各種プログラムを実行する。 Next, processing performed by the CPU 51 as the voice recognition program is executed will be described. When the user operates the power switch 62 provided on the CB 50 of the HMD 1, the CPU 51 performs a predetermined operation at the time of startup according to the execution of the main program. The CPU 51 executes various programs including a voice recognition program.

音声認識プログラムにおいて、ＣＰＵ５１は、起動時に行う初期設定処理（図示略）を行う。ＣＰＵ５１は、ＲＡＭ５２に記憶するフラグやデータを初期化し、フラッシュＲＯＭ５３に記憶されているフラグやデータの初期値をＲＡＭ５２に書き込む。ＣＰＵ５１は音声データを保存するための記憶領域を、ＲＡＭ５２に確保する。ＣＰＵ５１は音声認識プログラムの初回実行時にＲＡＭ５２に第一対応データ９５および第二対応データ９６を展開する。なお、以下の説明では、ＨＭＤ１は、ネットワーク９およびサーバ８０に接続されていないとする。 In the voice recognition program, the CPU 51 performs an initial setting process (not shown) performed at the time of activation. The CPU 51 initializes flags and data stored in the RAM 52 and writes initial values of the flags and data stored in the flash ROM 53 to the RAM 52. The CPU 51 secures a storage area in the RAM 52 for storing audio data. The CPU 51 expands the first correspondence data 95 and the second correspondence data 96 in the RAM 52 when the voice recognition program is executed for the first time. In the following description, it is assumed that the HMD 1 is not connected to the network 9 and the server 80.

上記の初期設定処理が終了すると、ＣＰＵ５１は、音声判断処理を開始する。図５に示すように、ＣＰＵ５１は、音声判断処理を開始すると、ＨＤ１０からＣＢ５０に送信された最初の音声データである第一音声データを取得する（Ｓ１１）。次いで、ＣＰＵ５１は、取得した第一音声データに基づく音量値である第一音量値を決定する（Ｓ１２）。第一音量値は、例えば、Ｓ１１で受信した第一音声データにおいて、サンプリングされた波形のレベルを検出することで決定される。なお、第一音声データは、例えば、複数のサンプリング点を含む。サンプリング点の間隔は、予め定められたサンプリングレート（例えば、１１．０２５ｋＨｚ）に対応する。即ち、第一音量値は、個々のサンプリング点に対して決定される。次いで、ＣＰＵ５１は、決定した第一音量値の最大値を抽出し、第一音量値の最大値が第一基準値より大きいか否かを判断する（Ｓ１３）。なお、Ｓ１３の判断において、最大値の代わりに、平均値、中央値、最頻値などが用いられてもよい。第一音量値の最大値が第一基準値よりも大きい場合（Ｓ１３：ＹＥＳ）、第一基準値よりも第一音量値が大きくなる期間である第一期間の時間情報を算出する。例えば、ＣＰＵ５１は、第一音声データにおいて、第一基準値を超える第一音量値のサンプリング点が連続する回数を検出することで、第一期間の時間情報を算出する。ＣＰＵ５１は、算出した第一期間の時間情報が所定の範囲内であるか否かを判断する（Ｓ１４）。第一期間の時間情報が所定範囲内である場合（Ｓ１４：ＹＥＳ）、ＣＰＵ５１は、処理をＳ１５に移行する。なお、第一音量値が第一基準値より大きくない場合（Ｓ１３：ＮＯ）、または、第一音量値が第一基準値より大きい場合でも、第一期間の時間情報が所定範囲内でない場合（Ｓ１４：ＮＯ）、ＣＰＵ５１は、Ｓ１１に処理を戻す。 When the initial setting process is completed, the CPU 51 starts a sound determination process. As shown in FIG. 5, when starting the sound determination process, the CPU 51 acquires first sound data that is the first sound data transmitted from the HD 10 to the CB 50 (S11). Next, the CPU 51 determines a first volume value that is a volume value based on the acquired first sound data (S12). The first volume value is determined, for example, by detecting the level of the sampled waveform in the first audio data received in S11. Note that the first audio data includes, for example, a plurality of sampling points. The interval between the sampling points corresponds to a predetermined sampling rate (for example, 11.025 kHz). That is, the first volume value is determined for each sampling point. Next, the CPU 51 extracts the determined maximum value of the first volume value, and determines whether or not the maximum value of the first volume value is larger than the first reference value (S13). In the determination of S13, an average value, a median value, a mode value, or the like may be used instead of the maximum value. When the maximum value of the first volume value is larger than the first reference value (S13: YES), time information of the first period, which is a period in which the first volume value is larger than the first reference value, is calculated. For example, the CPU 51 calculates the time information of the first period by detecting the number of consecutive sampling points of the first volume value that exceeds the first reference value in the first audio data. The CPU 51 determines whether or not the calculated time information of the first period is within a predetermined range (S14). When the time information of the first period is within the predetermined range (S14: YES), the CPU 51 proceeds to S15. When the first volume value is not greater than the first reference value (S13: NO), or even when the first volume value is greater than the first reference value, the time information of the first period is not within the predetermined range ( (S14: NO), the CPU 51 returns the process to S11.

ここで、本実施形態では、第一音声データに対応するユーザの音声として、例えば「ハイ」と発声されることを想定している。マイク１７に入力する「ハイ」の音声に対応する第一音声データについて、第一音量値の最大値が第一基準値を上回るか否かをまず判断する。第一音量値の最大値が第一基準値を上回る場合には、さらに第一期間が所定範囲内であるか否かを監視することによって、ＣＰＵ５１は、後述する音声操作処理（Ｓ３４、図６参照）を行うか否かの一段階目の判断を行う。即ち、ＣＰＵ５１は、第一音声データについては、第一音量値と、第一音量値が第一基準値よりも大きくなる期間を監視するものの、第一音声データについての音声認識処理を行わない。音量値の監視のみの実行に比べて、複雑な信号からなる音声データを解析する音声認識処理の実行には、ＣＰＵ５１は非常に多くの電力を必要とする。ＨＭＤ１は、ユーザに装着されて使用される性質上、外部からの電源供給を受けず、バッテリー等の内部に搭載した電源によって動作を行うことが想定される。このため、ＨＭＤ１に音声認識処理を組み込む場合、消費電力の低減は欠かせない。本実施形態では、ＨＭＤ１がユーザの発する音声によって操作されるのに先だって、第一音声データに対して音声認識処理を実行するのではなく、以降の音声操作処理を行うためのトリガとするために第一音量値のみを判断することとしている。これにより、複雑な制御を必要とせずに、ＨＭＤ１の消費電力を低減し、内部電源の寿命を延ばすことができる。 Here, in the present embodiment, it is assumed that, for example, “high” is uttered as the user's voice corresponding to the first voice data. First, it is determined whether or not the maximum value of the first sound volume value exceeds the first reference value for the first sound data corresponding to the “high” sound input to the microphone 17. When the maximum value of the first volume value exceeds the first reference value, the CPU 51 further monitors whether or not the first period is within a predetermined range, whereby the CPU 51 performs voice operation processing (S34, FIG. 6) described later. The first step is to determine whether or not That is, for the first voice data, the CPU 51 monitors the first volume value and the period during which the first volume value is greater than the first reference value, but does not perform the voice recognition process for the first voice data. Compared with the execution of only monitoring the volume value, the CPU 51 requires much more power to execute the voice recognition process for analyzing the voice data composed of complicated signals. The HMD 1 is assumed to be operated by a power source mounted inside a battery or the like without receiving an external power supply due to the property of being used by being mounted on a user. For this reason, when voice recognition processing is incorporated into the HMD 1, it is essential to reduce power consumption. In the present embodiment, prior to the HMD 1 being operated by the voice uttered by the user, the voice recognition process is not performed on the first voice data, but is used as a trigger for performing the subsequent voice operation process. Only the first volume value is determined. Thereby, the power consumption of HMD1 can be reduced and the lifetime of an internal power supply can be extended, without requiring complicated control.

なお、第一基準値は、ＨＭＤ１の使用環境等に応じて、任意の値を設定することができる。前述したように、第一音量値は後述する音声操作処理（Ｓ３４、図６参照）を行うか否かの一段階目の判断材料である。このため、音声操作処理が誤って実行されて、ＨＭＤ１が誤作動することを防ぐため、第一基準値は、ＨＭＤ１の使用環境における周囲の雑音よりも大きい値に設定することが好ましい。 The first reference value can be set to an arbitrary value according to the usage environment of the HMD 1 or the like. As described above, the first sound volume value is a first-stage determination material for determining whether or not to perform a voice operation process (S34, see FIG. 6) described later. For this reason, in order to prevent the voice operation process from being erroneously performed and causing the HMD 1 to malfunction, it is preferable to set the first reference value to a value larger than the ambient noise in the usage environment of the HMD 1.

また、Ｓ１４の判断において、第一期間の時間情報が所定範囲内であるか否かが判断される。この所定範囲について、本実施形態では、第一音声データに対応する音声として想定する「ハイ」が発声される時間に対応して、ＣＰＵ５１は、第一期間が約１秒以内であるか否かを判断する。Ｓ１３の判断のみを行い、Ｓ１４の判断を行わない場合には、例えば、ＨＭＤ１を装着したユーザの通常の会話による音声に基づく音量値が第一基準値を超える場合等にも、ＣＰＵ５１は、以降の処理を実行してしまう。即ち、所定範囲を超えて第一期間が継続する場合には、ＣＰＵ５１は、ユーザが発声する音声を、音声操作処理（Ｓ３４、図６参照）を行うため以外の音声であるとみなす。Ｓ１３の判断に加えてＳ１４の判断を行うことによって、音声判断処理の単純化および所定の音声以外の音声によるＨＭＤ１の誤作動を防止できる。 In S14, it is determined whether or not the time information of the first period is within a predetermined range. For this predetermined range, in the present embodiment, the CPU 51 determines whether or not the first period is within about 1 second, corresponding to the time when “high” assumed as the sound corresponding to the first sound data is uttered. Judging. When only the determination of S13 is performed and the determination of S14 is not performed, for example, when the volume value based on the voice of the user wearing the HMD1 based on the voice of the normal conversation exceeds the first reference value, the CPU 51 Will be executed. That is, when the first period continues beyond the predetermined range, the CPU 51 regards the voice uttered by the user as a voice other than the voice operation process (S34, see FIG. 6). By performing the determination in S14 in addition to the determination in S13, it is possible to simplify the voice determination process and prevent malfunction of the HMD 1 due to a sound other than a predetermined sound.

図５の説明に戻る。Ｓ１５では、ＣＰＵ５１は、第一音声データの後にＨＤ１０からＣＢ５０に送信された第二音声データを取得し、取得した第二音声データをＲＡＭ５２に保存する処理を開始する。本実施形態では、第二音声データに対応するユーザの音声として、特定の単語（例えば「起動（キドウ）」）が発声されることを想定している。次いで、ＣＰＵ５１は、ＲＡＭ５２の第一タイマカウンタに第一監視時間に対応する値をセットする（Ｓ１６）。次いで、ＣＰＵ５１は、Ｓ１５において取得および保存を開始した第二音声データに基づく音量値である第二音量値を逐次決定する（Ｓ１７）。音量値を決定する方法は、Ｓ１２と同様である。次いで、ＣＰＵ５１は、決定した第二音量値の最大値を抽出し、第二音量値の最大値が第二基準値よりも大きいか否かを判断する（Ｓ１８）。決定した第二音量値の最大値が第二基準値より大きい場合（Ｓ１８：ＹＥＳ）、ＣＰＵ５１は、第二音声データに対応する音声の発声が開始されたとみなして、ＲＡＭ５２の第一タイマカウンタの値を「０」にクリアし（Ｓ１９）、処理をＳ２０へ移行する。 Returning to the description of FIG. In S 15, the CPU 51 acquires the second audio data transmitted from the HD 10 to the CB 50 after the first audio data, and starts processing to store the acquired second audio data in the RAM 52. In the present embodiment, it is assumed that a specific word (for example, “activation (kidney)”) is uttered as the user's voice corresponding to the second voice data. Next, the CPU 51 sets a value corresponding to the first monitoring time in the first timer counter of the RAM 52 (S16). Next, the CPU 51 sequentially determines a second volume value that is a volume value based on the second audio data that has been acquired and stored in S15 (S17). The method for determining the volume value is the same as S12. Next, the CPU 51 extracts the determined maximum value of the second volume value, and determines whether or not the maximum value of the second volume value is larger than the second reference value (S18). When the determined maximum value of the second sound volume value is larger than the second reference value (S18: YES), the CPU 51 regards that the sound corresponding to the second sound data has been started, and sets the first timer counter of the RAM 52. The value is cleared to “0” (S19), and the process proceeds to S20.

一方、第二音量値の最大値が第二基準値より大きくない場合（Ｓ１８：ＮＯ）、ＣＰＵ５１は、ＲＡＭ５２に記憶される第一タイマカウンタの値を参照して、第一監視時間が経過したか否かを判断する（Ｓ２１）。参照した第一タイマカウンタの値が「０」でない場合（Ｓ２１：ＮＯ）、第一監視時間が経過していないため、ＣＰＵ５１は、処理をＳ１８へ戻し、第一監視時間が経過するまで、第二音量値の最大値が第二基準値より大きくなるか否かの判断を繰り返す。第一監視時間が経過して第一タイマカウンタの値が「０」となった場合（Ｓ２１：ＹＥＳ）、ＣＰＵ５１は、処理をＳ２８へ移行する。このようにして、ＣＰＵ５１は、第一監視時間が経過する前に、第二音量値の最大値が第二基準値を上回るか否かを判断する。そして、第二音量値の最大値が第二基準値を上回ることなく第一監視時間が経過した場合には、第二音声データの取得を中止する（Ｓ２８）。前述したように、第一タイマカウンタによる第一監視時間の計測は、第一音声データに対応する音声を発声したユーザに第二音声データに対応する音声の発声意思があるか否か判断するために行われる。第一監視時間は、第一音声データに対応する音声として想定する「ハイ」と第二音声データに対応する音声として想定する特定の単語である「起動（キドウ）」との、それぞれの発声の間隔として想定される任意の時間に相当する値を設定することができる。 On the other hand, when the maximum value of the second volume value is not larger than the second reference value (S18: NO), the CPU 51 refers to the value of the first timer counter stored in the RAM 52 and the first monitoring time has elapsed. It is determined whether or not (S21). When the value of the referred first timer counter is not “0” (S21: NO), since the first monitoring time has not elapsed, the CPU 51 returns the process to S18 and continues until the first monitoring time has elapsed. The determination whether or not the maximum value of the two sound volume values is larger than the second reference value is repeated. When the value of the first timer counter becomes “0” after the first monitoring time has elapsed (S21: YES), the CPU 51 proceeds to S28. In this way, the CPU 51 determines whether or not the maximum value of the second volume value exceeds the second reference value before the first monitoring time has elapsed. Then, when the first monitoring time has passed without the maximum value of the second volume value exceeding the second reference value, the acquisition of the second audio data is stopped (S28). As described above, the measurement of the first monitoring time by the first timer counter is for determining whether or not the user who has spoken the voice corresponding to the first voice data has the intention to speak the voice corresponding to the second voice data. To be done. The first monitoring time is for each utterance of “high” assumed as the voice corresponding to the first voice data and “start (kid)” which is a specific word assumed as the voice corresponding to the second voice data. A value corresponding to an arbitrary time assumed as the interval can be set.

ここで、本実施形態では、第一基準値が、第二基準値よりも大きな値となるように設定している。これは、ユーザが第一音声データに対応する音声として想定する「ハイ」を、「ハイ」以降に発声する音声より大きく発声しなければ、音声操作処理（Ｓ３４、図６参照）が行われないようにするためである。即ち、第一基準値は第二基準値より大きいため、ＣＰＵ５１が音声操作処理を行うことによってＨＭＤ１に種々の動作を実行させるためには、ユーザは第二音声データに対応する音声「起動」よりも、第一音声データに対応する音声「ハイ」を大きく発声する必要がある。これにより、ＣＰＵ５１は、ＨＭＤ１に動作を実行させる意思がユーザにあるか否かを、第一音声データに基づく第一音量値と、第二基準値より大きい第一基準値との比較によって判断できる。従って、ＨＭＤ１に動作を実行させる意思のないユーザが発声した第一基準値よりも音量の小さな音声に基づいて、ＣＰＵ５１が種々の処理を誤って実行することを防止できる。 Here, in the present embodiment, the first reference value is set to be larger than the second reference value. This is because the voice operation process (S34, see FIG. 6) is not performed unless the user utters “high” assumed as the voice corresponding to the first voice data larger than the voice uttered after “high”. It is for doing so. That is, since the first reference value is larger than the second reference value, in order to cause the HMD 1 to execute various operations by the CPU 51 performing voice operation processing, the user must start from the voice “activation” corresponding to the second voice data. However, it is necessary to utter a voice “high” corresponding to the first voice data. Thereby, the CPU 51 can determine whether or not the user is willing to execute the operation by the HMD 1 by comparing the first sound volume value based on the first sound data and the first reference value larger than the second reference value. . Therefore, it is possible to prevent the CPU 51 from erroneously executing various processes based on the sound whose volume is lower than the first reference value uttered by the user who does not intend to cause the HMD 1 to execute the operation.

図５の説明に戻る。次いで、ＣＰＵ５１は、ＲＡＭ５２の第二タイマカウンタに第二監視時間に対応する値をセットする（Ｓ２０）。次いで、ＣＰＵ５１は、第二音量値が第三基準値よりも小さいか否かを判断する（Ｓ２３）。第二音量値が第三基準値より小さい場合（Ｓ２３：ＹＥＳ）、ＣＰＵ５１は、第二音声データに対応する音声の発声が終了されたとみなして、ＲＡＭ５２の第二タイマカウンタの値を「０」にクリアする（Ｓ２４）。次いで、ＣＰＵ５１は、Ｓ１５において開始した第二音声データをＲＡＭ５２に保存する処理を終了する（Ｓ２５）。ＣＰＵ５１は、処理をＳ３０（図６参照）へ移行する。 Returning to the description of FIG. Next, the CPU 51 sets a value corresponding to the second monitoring time in the second timer counter of the RAM 52 (S20). Next, the CPU 51 determines whether or not the second volume value is smaller than the third reference value (S23). When the second sound volume value is smaller than the third reference value (S23: YES), the CPU 51 regards that the sound corresponding to the second sound data is ended, and sets the value of the second timer counter in the RAM 52 to “0”. (S24). Next, the CPU 51 ends the process of storing the second audio data started in S15 in the RAM 52 (S25). CPU51 transfers a process to S30 (refer FIG. 6).

一方、第二音量値が第三基準値より小さくない場合（Ｓ２３：ＮＯ）、ＣＰＵ５１は、ＲＡＭ５２に記憶される第二タイマカウンタの値を参照して、第二監視時間が経過したか否かを判断する（Ｓ２６）。参照した第二タイマカウンタの値が「０」でない場合（Ｓ２６：ＮＯ）、ＣＰＵ５１は、処理をＳ２３へ戻し、第二監視時間が経過するまで、第二音量値が第三基準値より小さくなるか否かの判断を繰り返す。第二監視時間が経過して第二タイマカウンタの値が「０」となった場合（Ｓ２６：ＹＥＳ）、Ｓ２１の判断において「ＮＯ」と判断した場合と同様に、ＣＰＵ５１は、処理をＳ２８へ移行する。 On the other hand, when the second volume value is not smaller than the third reference value (S23: NO), the CPU 51 refers to the value of the second timer counter stored in the RAM 52 and determines whether or not the second monitoring time has elapsed. Is determined (S26). When the value of the referred second timer counter is not “0” (S26: NO), the CPU 51 returns the process to S23, and the second volume value becomes smaller than the third reference value until the second monitoring time elapses. Repeat the determination of whether or not. When the value of the second timer counter becomes “0” after the second monitoring time has elapsed (S26: YES), the CPU 51 proceeds to S28 as in the case of “NO” in the determination of S21. Transition.

Ｓ２８では、ＣＰＵ５１は、Ｓ１５で開始した第二音声データをＲＡＭ５２に保存する処理を中止し、処理をＳ１１へ戻す。ＣＰＵ５１は、Ｓ１５において取得が開始されてからＳ２８で取得が中止されるまでの間にＲＡＭ５２へ保存された第二音声データについて、ＲＡＭ５２から消去した後にＳ１１へ処理を戻してもよいし、次回に実行する音声判断処理において取得する第二音声データをＲＡＭ５２へ上書きしてもよい。 In S28, the CPU 51 stops the process of storing the second audio data started in S15 in the RAM 52, and returns the process to S11. The CPU 51 may return the processing to S11 after erasing from the RAM 52 for the second audio data stored in the RAM 52 after the acquisition is started in S15 and until the acquisition is stopped in S28. The second sound data acquired in the sound determination process to be executed may be overwritten on the RAM 52.

このように、ＣＰＵ５１は、第二音量値が第二基準値よりも大きいと判断してから第二監視時間が経過するまでの間に、第二音量値が第三基準値を下回るか否かによって、ユーザの第二音声データに対応する音声の発声終了を判断する。第二音声データに対応する音声の発声終了の判断を、全くの無音状態の検出を条件に判断すると、ＨＭＤ１の使用環境等によっては、周囲の雑音等が存在することによって、ユーザの第二音声データに対応する音声の発声終了の判断が難しくなる。従って、本実施形態では、第三基準値を、周囲の雑音よりも大きく、且つ、第二基準値よりも小さく設定している。ＣＰＵ５１は、第二音声データに対応する音声の発声が開始されたとみなした時点から、第二音声データに対応する音声の発声が終了されたとみなした時点までの間に取得した音声データを、ひとまとまりの第二音声データとしてＲＡＭ５２に保存する。 In this way, the CPU 51 determines whether or not the second volume value is lower than the third reference value during the period from when it is determined that the second volume value is greater than the second reference value until the second monitoring time elapses. Is used to determine the end of utterance of the voice corresponding to the second voice data of the user. If the end of the speech corresponding to the second sound data is determined based on the detection of a completely silent state, depending on the usage environment of the HMD 1 or the like, the presence of ambient noise or the like may cause the user's second sound. It becomes difficult to determine the end of utterance of the voice corresponding to the data. Therefore, in the present embodiment, the third reference value is set larger than the ambient noise and smaller than the second reference value. The CPU 51 obtains the audio data acquired from the time when it is considered that the utterance of the sound corresponding to the second sound data is started until the time when the utterance of the sound corresponding to the second sound data is considered to be terminated. It is stored in the RAM 52 as a set of second audio data.

第二監視時間は、第二音声データに対応する音声として想定する特定の単語「起動（キドウ）」が発声される時間に相当する任意の値を設定することができる。本実施形態では、第二監視時間を約１．５秒としている。第二監視時間を経過してもなお、第三基準値を超える音量値の第二音声データに対応する音声がマイク１７に入力する場合、ユーザは、通常の会話等、「起動（キドウ）」以外の音声を発声していることが想定される。このような場合には、第二音声データに対応する音声が「起動（キドウ）」を含む音声でないとみなして、ＣＰＵ５１は、音声判断処理におけるＳ２４以降の処理の実行を回避できる。即ち、音声判断処理の単純化および効率化を図ることができる。 The second monitoring time can be set to an arbitrary value corresponding to the time when a specific word “startup (kidney)” assumed as a voice corresponding to the second voice data is uttered. In the present embodiment, the second monitoring time is about 1.5 seconds. When the sound corresponding to the second sound data having a volume value exceeding the third reference value is input to the microphone 17 even after the second monitoring time has elapsed, the user can perform “start (kidney)” such as normal conversation. It is assumed that other voices are uttered. In such a case, it is considered that the sound corresponding to the second sound data is not a sound including “startup (kidney)”, and the CPU 51 can avoid the execution of the processes after S24 in the sound determination process. That is, the simplification and efficiency of the voice determination process can be achieved.

Ｓ３０（図６参照）では、ＣＰＵ５１は、音響特徴量（例えば、音素）を抽出する公知の音声認識処理（例えば、隠れマルコフモデル）を実行し、ＲＡＭ５２に保存した第二音声データに対する音声認識を行う。ＣＰＵ５１は、音声認識処理の結果、第二音声データに対応する音素データを生成する。この音素データは、ＲＡＭ５２に保存される。ＣＰＵ５１は、音声認識処理を行って認識した第二音声データに対応する音素データに基づいて、第二音声データに対応するテキストデータを取得する（Ｓ３１）。例えば、ＣＰＵ５１は、音素データを予め設けられた複数の単語モデルと比較することで、個々の単語モデルから第二音声データに対応する音声が出力される確率を計算する。そして、ＣＰＵ５１は、確率の最も高い単語モデルのテキストデータを取得する。ＣＰＵ５１は、取得したテキストデータを、ＲＡＭ５２に記憶する。次いで、ＣＰＵ５１は、音声認識プログラムの初回実行時にＲＡＭ５２に展開した第一対応データ９５（図３参照）を参照する（Ｓ３２）。ＣＰＵ５１は、参照した第一対応データ９５における操作データと、Ｓ３１において生成したテキストデータとを比較し、Ｓ３１において生成したテキストデータに、第一対応データ９５における操作データが含まれるか否かを判断する（Ｓ３３）。本実施形態では、第一対応データ９５は、基準データ「起動」と処理データ「音声操作処理を開始する」とが対応付けられた１組のデータのみを含む（図３参照）。即ち、ＣＰＵ５１は、Ｓ３３では、Ｓ３１において生成したテキストデータに、基準データである「起動（キドウ）」が含まれるか否かのみを判断すればよいため、判断の精度と判断処理の迅速性を向上できる。 In S 30 (see FIG. 6), the CPU 51 executes a known speech recognition process (for example, a hidden Markov model) for extracting an acoustic feature amount (for example, phoneme), and performs speech recognition for the second speech data stored in the RAM 52. Do. As a result of the voice recognition process, the CPU 51 generates phoneme data corresponding to the second voice data. The phoneme data is stored in the RAM 52. CPU51 acquires the text data corresponding to 2nd audio | voice data based on the phoneme data corresponding to the 2nd audio | voice data recognized by performing an audio | voice recognition process (S31). For example, the CPU 51 calculates the probability that a voice corresponding to the second voice data is output from each word model by comparing the phoneme data with a plurality of word models provided in advance. And CPU51 acquires the text data of the word model with the highest probability. The CPU 51 stores the acquired text data in the RAM 52. Next, the CPU 51 refers to the first correspondence data 95 (see FIG. 3) developed in the RAM 52 when the voice recognition program is executed for the first time (S32). The CPU 51 compares the operation data in the referenced first correspondence data 95 with the text data generated in S31, and determines whether the operation data in the first correspondence data 95 is included in the text data generated in S31. (S33). In the present embodiment, the first correspondence data 95 includes only one set of data in which the reference data “activation” and the processing data “start voice operation processing” are associated (see FIG. 3). That is, in S33, the CPU 51 only has to determine whether or not the text data generated in S31 includes “startup (kidney)” that is the reference data, so that the accuracy of the determination and the speed of the determination process are improved. Can be improved.

上記の第二音声データに基づくテキストデータに基準データ「起動（キドウ）」が含まれるか否かの判断は、Ｓ３４の音声操作処理を行うか否かの二段階目の判断である。Ｓ３４の音声操作処理を行うか否かの判断については、Ｓ１３およびＳ１４（図５参照）において行われる第一音量値に基づく一段階目の判断、および、Ｓ３３において行われる第二音声データに基づくテキストデータに対する二段階目の判断の二つの判断に基づいて行われる。即ち、ＨＭＤ１をユーザの音声によって操作する場合、第一音声データに対応する音声「ハイ」と第二音声データに対応する音声「起動（キドウ）」との二つの音声の発声がユーザに求められる。これによって、ＣＰＵ５１は、ユーザに音声操作処理の開始をハンズフリーで行わせるとともに、Ｓ３４の音声操作処理が誤って実行されて、ＨＭＤ１が誤作動することを防いでいる。Ｓ３１において生成したテキストデータに操作データと同一のデータが含まれる場合（Ｓ３３：ＹＥＳ）、ＣＰＵ５１は、Ｓ３４の音声操作処理を実行した後、音声判断処理を終了する。Ｓ３１において生成したテキストデータに操作データと同一のデータが含まれない場合（Ｓ３３：ＮＯ）、ＣＰＵ５１は、処理をＳ１１へ戻す（図５参照）。 The determination as to whether the text data based on the second voice data includes the reference data “startup (kidney)” is a second stage determination as to whether or not to perform the voice operation processing in S34. The determination as to whether or not to perform the voice operation processing in S34 is based on the first stage determination based on the first volume value performed in S13 and S14 (see FIG. 5) and the second voice data performed in S33. This is performed based on two judgments of the second stage judgment on the text data. That is, when the HMD 1 is operated by the user's voice, the user is required to utter two voices, the voice “high” corresponding to the first voice data and the voice “startup” corresponding to the second voice data. . As a result, the CPU 51 allows the user to start the voice operation process in a hands-free manner, and prevents the HMD 1 from malfunctioning due to the voice operation process of S34 being erroneously executed. If the text data generated in S31 includes the same data as the operation data (S33: YES), the CPU 51 ends the voice determination process after executing the voice operation process of S34. When the text data generated in S31 does not include the same data as the operation data (S33: NO), the CPU 51 returns the process to S11 (see FIG. 5).

本実施形態では、Ｓ３３において、Ｓ３１において生成したテキストデータに含まれるか否かを判断する操作データを「起動（キドウ）」の１つ基準データのみとしている。即ち、Ｓ３３の判断では、第二音声データが基準データ「起動（キドウ）」に対応する音声データであるか否かのみを判断できればよい。このため、例えば、Ｓ３１における第二音声データのテキストデータ変換を行わないこととしてもよい。そのかわりに、Ｓ３０の第二音声データについての音声認識処理において、例えば第二音声データに基づく音波の波形データを生成する。そして、第一対応データ９５における基準データとして、特定の単語「起動（キドウ）」に対応する特定の波形データを記憶する。生成した波形データが「起動（キドウ）」に対応する特定の波形データである基準データに対応するか否かによって、Ｓ３３の判断を行ってもよい。音声データは複雑な信号からなり、その音声データに基づく波形データもまた、複雑な波形パターンを有する。この場合、ＣＰＵ５１は、生成した複雑な波形データをテキストデータに変換する工程を経ることなく、波形データ同士のマッチングによって、第二音声データが基準データ「起動（キドウ）」に対応する音声データを含むか否かを判断できるため、音声判断処理を単純化することができ、消費電力の低減にも資する。 In the present embodiment, in S33, the operation data for determining whether or not it is included in the text data generated in S31 is only one reference data of “startup”. That is, in the determination of S33, it is only necessary to determine whether or not the second sound data is sound data corresponding to the reference data “startup (kidney)”. For this reason, it is good also as not performing the text data conversion of the 2nd audio | voice data in S31, for example. Instead, in the voice recognition process for the second voice data in S30, for example, sound wave waveform data based on the second voice data is generated. Then, as the reference data in the first correspondence data 95, the specific waveform data corresponding to the specific word “startup” is stored. The determination in S33 may be made according to whether or not the generated waveform data corresponds to reference data that is specific waveform data corresponding to “startup (kidney)”. The audio data is composed of a complicated signal, and the waveform data based on the audio data also has a complicated waveform pattern. In this case, the CPU 51 does not go through the step of converting the generated complex waveform data into text data, and the second audio data is converted into audio data corresponding to the reference data “start (kid)” by matching the waveform data. Since it can be determined whether or not it is included, the voice determination process can be simplified, which contributes to reduction of power consumption.

図７を参照して、音声操作処理（Ｓ３４、図６参照）の詳細について説明する。音声操作処理では、ユーザの音声によってＨＭＤ１に種々の動作を実行させるための処理が行われる。音声操作処理が開始すると、ＣＰＵ５１は、第二音声データの後にＨＤ１０からＣＢ５０に送信された第三音声データを取得し、取得した第三音声データをＲＡＭ５２に保存する処理を開始する（Ｓ４１）。本実施形態では、第三音声データに対応するユーザの音声として、例えば「送る（オクル）」、「戻る（モドル）」等、第二対応データ９６の操作データを含む様々な音声が、ＨＭＤ１に種々の動作を実行させるため発声されることを想定している。 Details of the voice operation process (S34, see FIG. 6) will be described with reference to FIG. In the voice operation process, a process for causing the HMD 1 to execute various operations according to the voice of the user is performed. When the voice operation process starts, the CPU 51 acquires the third voice data transmitted from the HD 10 to the CB 50 after the second voice data, and starts the process of saving the acquired third voice data in the RAM 52 (S41). In the present embodiment, as the user's voice corresponding to the third voice data, various voices including the operation data of the second correspondence data 96 such as “send (occle)” and “return (middle)” are transmitted to the HMD 1. It is assumed that the voice is uttered to execute various operations.

次いで、ＣＰＵ５１は、ＲＡＭ５２の第三タイマカウンタに第三監視時間に対応する値をセットする（Ｓ４２）。次いで、ＣＰＵ５１は、Ｓ４１において取得および保存を開始した第三音声データに基づく音量値である第三音量値を逐次決定する（Ｓ４３）。次いで、ＣＰＵ５１は、決定した第三音量値の最大値を抽出し、第三音量値の最大値が第二基準値よりも大きいか否かを判断する（Ｓ４４）。第三音量値の最大値が第二基準値より大きい場合（Ｓ４４：ＹＥＳ）、ＣＰＵ５１は、第三音声データに対応する音声の発声が開始されたとみなして、ＲＡＭ５２の第三タイマカウンタの値を「０」にクリアし（Ｓ４５）、処理をＳ４６の判断へ移行する。 Next, the CPU 51 sets a value corresponding to the third monitoring time in the third timer counter of the RAM 52 (S42). Next, the CPU 51 sequentially determines a third volume value, which is a volume value based on the third audio data that has been acquired and stored in S41 (S43). Next, the CPU 51 extracts the determined maximum value of the third volume value, and determines whether or not the maximum value of the third volume value is larger than the second reference value (S44). When the maximum value of the third volume value is larger than the second reference value (S44: YES), the CPU 51 considers that the voice corresponding to the third voice data has been started, and sets the value of the third timer counter in the RAM 52. It is cleared to “0” (S45), and the process proceeds to the determination of S46.

一方、第三音量値の最大値が第二基準値より大きくない場合（Ｓ４４：ＮＯ）、ＣＰＵ５１は、ＲＡＭ５２に記憶される第三タイマカウンタの値を参照して、第三監視時間が経過したか否かを判断する（Ｓ５４）。参照した第三タイマカウンタの値が「０」でない場合（Ｓ５４：ＮＯ）、第三監視時間が経過していないため、ＣＰＵ５１は、処理をＳ４４へ戻し、第三監視時間が経過するまで、第三音量値の最大値が第二基準値より大きくなるか否かの判断を繰り返す。次いでＳ４６の判断において、ＣＰＵ５１は、第三音量値が第三基準値よりも小さいか否かを判断する。第三音量値が第三基準値よりも小さくない場合には（Ｓ４６：ＮＯ）、ＣＰＵ５１は、第三音量値が第三基準値よりも小さくなるまで、繰り返しＳ４６の判断を行う。第三音量値が第三基準値より小さい場合（Ｓ４６：ＹＥＳ）、ＣＰＵ５１は、第三音声データに対応する音声の発声が終了したとみなして、Ｓ４１において開始した第三音声データをＲＡＭ５２に保存する処理を終了する（Ｓ４７）。ＣＰＵ５１は、第三音声データに対応する音声の発声が開始したとみなした時点から、第三音声データに対応する音声の発声が終了したとみなした時点までの間に取得した音声データを、ひとまとまりの第三音声データとしてＲＡＭ５２に保存する。 On the other hand, when the maximum value of the third volume value is not larger than the second reference value (S44: NO), the CPU 51 refers to the value of the third timer counter stored in the RAM 52 and the third monitoring time has elapsed. Whether or not (S54). If the value of the referenced third timer counter is not “0” (S54: NO), the third monitoring time has not elapsed, so the CPU 51 returns the process to S44 and continues until the third monitoring time has elapsed. The determination whether or not the maximum value of the three sound volume values is larger than the second reference value is repeated. Next, in the determination at S46, the CPU 51 determines whether or not the third volume value is smaller than the third reference value. When the third sound volume value is not smaller than the third reference value (S46: NO), the CPU 51 repeatedly performs the determination of S46 until the third sound volume value becomes smaller than the third reference value. When the third sound volume value is smaller than the third reference value (S46: YES), the CPU 51 regards that the sound corresponding to the third sound data is finished, and stores the third sound data started in S41 in the RAM 52. The processing to end is finished (S47). The CPU 51 obtains the audio data acquired from the time when it is considered that the voice utterance corresponding to the third voice data has started to the time when the voice utterance corresponding to the third voice data is considered to have ended. It is stored in the RAM 52 as a group of third audio data.

ここで、第三音声データのＲＡＭ５２への保存については、第二音声データのＲＡＭ５２への保存の場合における第二監視時間に対応する監視時間が特に設けられていない。これは、ユーザがＨＭＤ１を使用する際に、ユーザは操作データに対応する音声のみを発声する場合だけでなく、操作データを含む音声を発声した場合にも、音声によってＨＭＤ１の操作をできるようにするためである。即ち、ひとまとまりの第三音声データを保存できる時間を特に短い時間に限定するのではなく、第三音声データを保存できる時間に幅を持たせるためである。ただし、ＨＭＤ１の使用環境によっては、周囲の雑音等が大きいことによって、第三音量値が第三基準値を下回らない時間が長時間継続してしまい、Ｓ４６の判断処理が必要以上に繰り返されてしまうことも考えられる。このため、ＣＰＵ５１は、ＨＭＤ１の使用環境に応じて、第二監視時間が経過するまでの間に第二音量値が第三基準値を下回るか否かによって第二音声データの取得を中止するＳ２６およびＳ２８（図５参照）と同様の処理を、Ｓ４６の後に設けてもよい。 Here, regarding the storage of the third audio data in the RAM 52, the monitoring time corresponding to the second monitoring time in the case of storing the second audio data in the RAM 52 is not particularly provided. This is because when the user uses the HMD 1, the user can operate the HMD 1 with sound not only when the user utters only the sound corresponding to the operation data but also when the user utters the sound including the operation data. It is to do. That is, this is not to limit the time during which a group of third audio data can be stored to a particularly short time, but to provide a range for the time during which the third audio data can be stored. However, depending on the usage environment of the HMD 1, the ambient time or the like is large, so that the time during which the third volume value does not fall below the third reference value continues for a long time, and the determination process of S46 is repeated more than necessary. It can also be considered. Therefore, the CPU 51 stops the acquisition of the second audio data depending on whether or not the second sound volume value falls below the third reference value until the second monitoring time elapses according to the usage environment of the HMD 1. And the process similar to S28 (refer FIG. 5) may be provided after S46.

次いで、ＣＰＵ５１は、Ｓ３０と同様に公知の音声認識処理を実行し、ＲＡＭ５２に保存した第三音声データに対する音声認識処理を行い、第三音声データに対応する音素データを、ＲＡＭ５２に保存する（Ｓ４８）。ＣＰＵ５１は、音声認識処理を行って認識した第三音声データに対応する音素データに基づいて、第三音声に対応するテキストデータを取得する（Ｓ４９）。ＣＰＵ５１は、取得したテキストデータを、ＲＡＭ５２に記憶する。次いで、ＣＰＵ５１は、音声認識プログラムの初回実行時にＲＡＭ５２に展開した第二対応データ９６（図４参照）を参照する（Ｓ５０）。ＣＰＵ５１は、参照した第二対応データ９６における操作データと、Ｓ４９において生成したテキストデータとを比較する。ＣＰＵ５１は、Ｓ４９において生成したテキストデータに、第二対応データ９６における操作データが含まれるか否かを判断する（Ｓ５１）。 Next, the CPU 51 executes a known voice recognition process as in S30, performs voice recognition processing on the third voice data stored in the RAM 52, and stores phoneme data corresponding to the third voice data in the RAM 52 (S48). ). The CPU 51 acquires text data corresponding to the third voice based on the phoneme data corresponding to the third voice data recognized by performing the voice recognition process (S49). The CPU 51 stores the acquired text data in the RAM 52. Next, the CPU 51 refers to the second correspondence data 96 (see FIG. 4) developed in the RAM 52 when the voice recognition program is executed for the first time (S50). The CPU 51 compares the operation data in the referenced second correspondence data 96 with the text data generated in S49. The CPU 51 determines whether or not the operation data in the second correspondence data 96 is included in the text data generated in S49 (S51).

本実施形態では、第二対応データ９６は、操作データ「送る」、「戻る」、「○枚目」、「閉じる」、「明るく」、「暗く」、「オン」、「オフ」および「終了」と、各操作データのそれぞれに対応する処理データとが対応付けられた複数組のデータを含む（図４参照）。第二対応データ９６に様々な操作データと処理データとが対応付けられた複数のデータが含まれることによって、ＣＰＵ５１は、ユーザの発声する音声によってＨＭＤ１に、様々な動作を実行させることができる。 In the present embodiment, the second correspondence data 96 includes the operation data “send”, “return”, “circle”, “close”, “bright”, “dark”, “on”, “off”, and “end”. ”And processing data corresponding to each of the operation data are included (see FIG. 4). By including a plurality of data in which various operation data and processing data are associated with each other in the second correspondence data 96, the CPU 51 can cause the HMD 1 to execute various operations with the voice uttered by the user.

なお、前述したように、本実施形態では、第一対応データ９５に含まれる操作データと処理データとの組の数は、基準データ「起動」と処理データ「音声操作処理を開始する」とが対応付けられた１組のデータのみである。この数は、第二対応データ９６に含まれる操作データと処理データとの組の数よりも少ない。これは、Ｓ３３（図６参照）は、ユーザの発声する第二音声データに対応する音声によって、ユーザの音声によってＨＭＤ１に種々の動作を実行させるための音声操作処理の実行開始をするか否かの判断のみを行う処理であるからである。即ち、第一対応データ９５には、基準データ「起動」と処理データ「音声操作処理を開始する」とが対応付けられた１組のデータが含まれてさえいればよい。これに対し、第二対応データ９６は、音声操作処理において、ユーザの音声によってＨＭＤ１に種々の動作を実行させるために設けられるデータである。このため、第二対応データ９６には、ＨＭＤ１に実行可能な種々の動作に対応するためのデータ構成が要求される。本実施形態では、第一対応データ９５と第二対応データ９６との２つの対応データを備えることで、全ての操作データと処理データとが１つの対応データに含まれる場合に比べて、第一対応データ９５に含まれる操作データと処理データとの組の数を減らすことができる。これにより、音声操作処理の実行開始の判断を行うＳ３２における判断の精度と判断処理の迅速性を向上するとともに、音声操作処理においてＨＭＤ１に種々の動作を実行させることができる。 As described above, in this embodiment, the number of sets of operation data and processing data included in the first correspondence data 95 includes the reference data “activation” and the processing data “start voice operation processing”. Only one set of associated data. This number is smaller than the number of sets of operation data and processing data included in the second correspondence data 96. This is whether or not S33 (see FIG. 6) starts executing voice operation processing for causing the HMD 1 to execute various operations by the voice of the user, using the voice corresponding to the second voice data uttered by the user. This is because it is a process that performs only the above determination. In other words, the first correspondence data 95 only needs to include a set of data in which the reference data “activation” and the processing data “start voice operation processing” are associated. On the other hand, the second correspondence data 96 is data provided for causing the HMD 1 to execute various operations by the voice of the user in the voice operation process. Therefore, the second correspondence data 96 is required to have a data configuration for dealing with various operations that can be executed by the HMD 1. In the present embodiment, by providing two correspondence data of the first correspondence data 95 and the second correspondence data 96, the first correspondence data 95 and the second correspondence data 96 are compared with the case where all the operation data and the processing data are included in one correspondence data. The number of sets of operation data and processing data included in the correspondence data 95 can be reduced. This improves the accuracy of the determination in S32 for determining the start of execution of the voice operation process and the speed of the determination process, and allows the HMD 1 to execute various operations in the voice operation process.

図７の説明に戻る。Ｓ５１の判断において、Ｓ４９において生成したテキストデータに操作データと同一のデータが含まれる場合（Ｓ５１：ＹＥＳ）、ＣＰＵ５１は、処理をＳ５２の判断に移行する。Ｓ５２では、Ｓ４９において生成したテキストデータに、第二対応データ９６における操作データのうち「終了」（図４参照）の操作データが含まれるか否かが判断される。Ｓ４９において生成したテキストデータに「終了」以外の操作データが含まれる場合（Ｓ５２：ＮＯ）、ＣＰＵ５１は、処理をＳ５３へ移行する。 Returning to the description of FIG. If it is determined in S51 that the same data as the operation data is included in the text data generated in S49 (S51: YES), the CPU 51 proceeds to the determination in S52. In S 52, it is determined whether or not the text data generated in S 49 includes “end” (see FIG. 4) operation data among the operation data in the second correspondence data 96. When the text data generated in S49 includes operation data other than “end” (S52: NO), the CPU 51 proceeds to S53.

ＣＰＵ５１は、Ｓ５２において含まれると判断した「終了」以外の操作データに対応付けられた処理データに基づく動作をＨＭＤ１に実行させる（Ｓ５３）。具体的には、Ｓ４９において生成したテキストデータに同一のデータが含まれると判断した操作データが「送る」である場合、対応する処理データ「表示中の図面等を次頁に送る」に基づいて、ＣＰＵ５１は、図面の次頁を示す画像データをフラッシュＲＯＭ５３から取得する。ＣＰＵ５１は、取得した図面の次頁を示す画像データを画像処理部５７で処理することによって、ＨＤ１０の画像表示部１４に表示する図面の次頁を示す映像信号を生成する。画像表示部１４は、生成された映像信号に基づいて、図面の次頁を示す画像を表示する。これによって、ユーザは、ユーザの発声する第三音声データに対応する音声によって、ハンズフリーでＨＭＤ１を操作することができる。 The CPU 51 causes the HMD 1 to execute an operation based on the processing data associated with the operation data other than “end” determined to be included in S52 (S53). Specifically, when the operation data determined to include the same data in the text data generated in S49 is “Send”, the corresponding processing data “Send the displayed drawing etc. to the next page” is used. The CPU 51 acquires image data indicating the next page of the drawing from the flash ROM 53. The CPU 51 generates image signals indicating the next page of the drawing to be displayed on the image display unit 14 of the HD 10 by processing the image data indicating the acquired next page of the drawing by the image processing unit 57. The image display unit 14 displays an image showing the next page of the drawing based on the generated video signal. Thus, the user can operate the HMD 1 hands-free with the voice corresponding to the third voice data uttered by the user.

この他、本実施形態では、Ｓ４９において生成したテキストデータに、第二対応データ９６におけるいずれの操作データと同一のデータも含まれない場合（Ｓ５１：ＮＯ）、ＣＰＵ５１は、ユーザが発声した音声がＨＭＤ１に種々の動作を実行させるための音声ではないとみなして、音声操作処理を終了する。また、Ｓ５４の判断において、第三監視時間が経過して第三タイマカウンタの値が０となった場合（Ｓ５４：ＹＥＳ）、ＣＰＵ５１は、Ｓ４１で開始した第三音声データをＲＡＭ５２に保存する処理を中止し（Ｓ５５）、音声操作処理を終了する。これにより、音声操作処理が徒に実行されて、ＨＭＤ１が誤作動することを防ぐことができる。ＣＰＵ５１は、Ｓ４１において取得が開始されてからＳ５５で取得が中止されるまでの間にＲＡＭ５２へ保存された第三音声データについて、ＲＡＭ５２から消去した後に音声操作処理を終了することとしてもよいし、次回に実行する音声操作処理において取得する第三音声データをＲＡＭ５２へ上書きしてもよい。 In addition, in this embodiment, when the text data generated in S49 does not include the same data as any operation data in the second correspondence data 96 (S51: NO), the CPU 51 determines that the voice uttered by the user is The voice operation processing is terminated assuming that the voice is not for causing the HMD 1 to perform various operations. If it is determined in S54 that the third monitoring time has elapsed after the third monitoring time has elapsed (S54: YES), the CPU 51 stores the third audio data started in S41 in the RAM 52. Is canceled (S55), and the voice operation process is terminated. Thereby, it is possible to prevent the voice operation process from being executed and the HMD 1 from malfunctioning. The CPU 51 may end the voice operation process after erasing the third voice data stored in the RAM 52 from the start of the acquisition in S41 to the stop of the acquisition in S55 from the RAM 52. The third sound data acquired in the sound operation process to be executed next time may be overwritten on the RAM 52.

この他、本実施形態では、音声操作処理の終了を、ユーザの終了意思に基づく発声によって行うことができる。第二対応データ９６の最下欄に、操作データ「終了」に処理データ「音声操作処理を終了する」が対応付けられたデータが設けられている。Ｓ５２の判断において、Ｓ４９において生成したテキストデータに「終了」の操作データが含まれる場合（Ｓ５２：ＹＥＳ）、ＣＰＵ５１は、音声操作処理を終了する。これにより、ユーザは、音声操作処理の終了をハンズフリーで行うことができる。 In addition, in this embodiment, the voice operation process can be ended by utterance based on the user's intention to end. In the bottom column of the second correspondence data 96, data in which the operation data “End” is associated with the processing data “End voice operation processing” is provided. If it is determined in S52 that the “end” operation data is included in the text data generated in S49 (S52: YES), the CPU 51 ends the voice operation process. Thereby, the user can perform the end of the voice operation process hands-free.

以上説明したように、ＣＰＵ５１は、Ｓ１３の判断において第一音量値が第一基準値より大きいか否かをまず判断する。第一音量値が第一基準値よりも大きいと判断された場合、Ｓ３１において第二音声データに対して音声認識処理を行って第二音声データに基づくテキストデータを生成し、生成したテキストデータに、第一対応データ９５における基準データ「起動（キドウ）」が含まれるか否かをＳ３３において判断する。Ｓ３１において生成したテキストデータに基準データ「起動（キドウ）」と同一のデータが含まれる場合、Ｓ４９において第三音声データに対して音声認識処理を行って第三音声データに基づくテキストデータを生成し、生成したテキストデータに基づくＨＭＤ１の制御がＳ５３において実行される。このため、ユーザは第一音声および第二音声の発声によって、第三音声に基づく音声認識処理をハンズフリーでＨＭＤ１に実行させることができる。 As described above, the CPU 51 first determines whether or not the first volume value is larger than the first reference value in the determination of S13. When it is determined that the first sound volume value is larger than the first reference value, in S31, voice recognition processing is performed on the second voice data to generate text data based on the second voice data, and the generated text data In S33, it is determined whether or not the reference data “startup (kidney)” in the first correspondence data 95 is included. When the text data generated in S31 includes the same data as the reference data “startup (kidney)”, the third voice data is subjected to voice recognition processing in S49 to generate text data based on the third voice data. The control of the HMD 1 based on the generated text data is executed in S53. Therefore, the user can cause the HMD 1 to perform voice recognition processing based on the third voice in a hands-free manner by uttering the first voice and the second voice.

ＣＰＵ５１は、Ｓ１４の判断において第一期間が所定の範囲内でない場合には、第一音声が所定の音声でないとみなして処理をＳ１１へ戻し、第二音声データに対して音声認識処理を行って第二音声データに基づくテキストデータを生成するＳ３１の処理を回避する。従ってＣＰＵ５１は、音声判断処理の単純化および所定の音声以外の音声によって第二音声データに対する音声認識処理が誤って実行されてＨＭＤ１が誤作動することを防止できる。 If the first period is not within the predetermined range in the determination of S14, the CPU 51 regards the first voice as not being the predetermined voice, returns the process to S11, and performs voice recognition processing on the second voice data. The process of S31 for generating text data based on the second audio data is avoided. Therefore, the CPU 51 can prevent the HMD 1 from malfunctioning due to simplification of the voice determination process and erroneous execution of the voice recognition process for the second voice data by a voice other than the predetermined voice.

ＣＰＵ５１は、Ｓ１８の判断において第二音量値が第二基準値より大きい場合、第二音声に対応する音声の発声が開始されたとみなし、ＲＡＭ５２の第二タイマカウンタに第二監視時間に対応する値をセットする。Ｓ２６の判断において第二監視時間内に第二音量値が第三基準値を下回らない場合には、第二音声データに対応する音声の発声が終了されないとみなして、ＣＰＵ５１は、Ｓ３１の処理を回避できる。第二監視時間は、第二音声データに対応する音声として想定する特定の単語「起動（キドウ）」が発声される時間に相当する任意の値を設定する。即ち、第二監視時間内に音声の発声が終了されるか否かをＳ２６において判断し、発声が終了されない場合には、第二音声データの取得を中止する。従ってＣＰＵ５１は、音声判断処理の単純化および効率化を図ることができる。 When the second sound volume value is larger than the second reference value in the determination of S18, the CPU 51 regards that the sound corresponding to the second sound has been started, and sets the second timer counter of the RAM 52 to the value corresponding to the second monitoring time. Set. If it is determined in S26 that the second volume value does not fall below the third reference value within the second monitoring time, the CPU 51 regards that the voice corresponding to the second voice data is not ended, and the CPU 51 performs the process of S31. Can be avoided. The second monitoring time is set to an arbitrary value corresponding to the time at which a specific word “startup (kidney)” assumed as a voice corresponding to the second voice data is uttered. That is, it is determined in S26 whether or not the voice utterance is finished within the second monitoring time. If the utterance is not finished, the acquisition of the second voice data is stopped. Therefore, the CPU 51 can simplify and improve the voice determination process.

ＣＰＵ５１は、第二音声データの取得を開始すると、Ｓ１６においてＲＡＭ５２の第一タイマカウンタに第一監視時間に対応する値をセットする。第一監視時間は、第一音声データに対応する音声として想定する「ハイ」と第二音声データに対応する音声として想定する特定の単語「起動（キドウ）」との、それぞれの発声の間隔として想定される任意の時間に相当する値を設定する。Ｓ２１の判断において第一監視時間内に第二音量値が第二基準値を上回らない場合には、第二音声データに対応する音声の発声が開始されないとみなして、第二音声データの取得を中止する。従ってＣＰＵ５１は、音声判断処理の単純化および第二音声データの取得待機を継続することによる処理遅延を防止できる。 When starting the acquisition of the second audio data, the CPU 51 sets a value corresponding to the first monitoring time in the first timer counter of the RAM 52 in S16. The first monitoring time is the interval between the utterances of “high” assumed as the voice corresponding to the first voice data and the specific word “startup” assumed as the voice corresponding to the second voice data. A value corresponding to an arbitrary time that is assumed is set. If the second volume value does not exceed the second reference value within the first monitoring time in the determination of S21, it is assumed that the voice corresponding to the second voice data is not started, and the acquisition of the second voice data is performed. Discontinue. Therefore, the CPU 51 can prevent the processing delay due to the simplification of the voice determination process and the continuing standby for the acquisition of the second voice data.

ＣＰＵ５１は、Ｓ３３の判断において第一対応データ９５を参照し、第二音声データに対して音声認識処理を行うことによって生成される第二音声データに基づくテキストデータに、基準データと同一のデータが含まれるか否かを判断する。また、Ｓ５１の判断において第二対応データ９６を参照し、第三音声データに対して音声認識処理を行うことによって生成される第三音声データに基づくテキストデータに、操作データと同一のデータが含まれるか否かを判断する。第二音声データに基づくテキストデータに対する判断と、第三音声データに基づくテキストデータに対する判断において、参照する対応データを第一対応データ９５と第二対応データ９６とに切り替えることで、Ｓ３３およびＳ５１の判断の精度をそれぞれ向上させることができる。よってＣＰＵ５１は、より確実に、ＨＭＤ１の制御を行うことができる。 The CPU 51 refers to the first correspondence data 95 in the determination of S33 and the same data as the reference data is included in the text data based on the second voice data generated by performing voice recognition processing on the second voice data. Judge whether it is included. In addition, the same data as the operation data is included in the text data based on the third voice data generated by performing the voice recognition process on the third voice data with reference to the second correspondence data 96 in the determination of S51. To determine whether or not In the determination on the text data based on the second sound data and the determination on the text data based on the third sound data, the corresponding data to be referred to is switched to the first corresponding data 95 and the second corresponding data 96, whereby S33 and S51 are performed. The accuracy of judgment can be improved. Therefore, the CPU 51 can control the HMD 1 more reliably.

第一対応データ９５に含まれる基準データと処理データとの組の数は、第二対応データ９６に含まれる操作データと処理データとの組の数よりも少ないため、特に第一音声データに基づくテキストデータに対するＳ３３の判断の精度と判断処理の迅速性を向上できる。 Since the number of sets of reference data and processing data included in the first correspondence data 95 is smaller than the number of sets of operation data and processing data included in the second correspondence data 96, it is based on the first voice data in particular. The accuracy of the determination in S33 for text data and the speed of the determination process can be improved.

第一基準値は第二基準値よりも大きいため、ＨＭＤ１に第三音声に基づく動作を実行させるためには、ユーザは第二音声データに対応する音声よりも第一音声データに対応する音声を大きく発声する必要がある。これにより、第一音声データに基づいて種々の処理が誤って実行されることを防止できる。 Since the first reference value is larger than the second reference value, in order to cause the HMD 1 to perform an operation based on the third sound, the user does not use the sound corresponding to the first sound data rather than the sound corresponding to the second sound data. It is necessary to speak loudly. Thereby, it is possible to prevent various processes from being erroneously executed based on the first audio data.

本実施形態において、図５のＳ１１において第一音声データを取得するＣＰＵ５１が、本発明の「第一音声取得手段」として機能する。図５のＳ１２において第一音量値を取得するＣＰＵ５１が、本発明の「第一音量取得手段」として機能する。図５のＳ１３において第一音量値が第一基準値よりも大きいか否かを判断するＣＰＵ５１が、本発明の「第一音量判断手段」として機能する。図５のＳ１５において第二音声データの取得を開始するとともに第二音声データの取得を継続するＣＰＵ５１が、本発明の「第二音声取得手段」として機能する。図６のＳ３０およびＳ３１において第二音声データに対する音声認識処理を行い第二音声データに基づくテキストデータを生成するＣＰＵ５１が、本発明の「第一生成手段」として機能する。図６のＳ３３においてＳ３１で生成したテキストデータに、第一対応データ９５における基準データが含まれるか否かを判断するＣＰＵ５１が、本発明の「第一結果判断手段」として機能する。図７のＳ４１において第三音声データの取得を開始するとともに第三音声データの取得を継続するＣＰＵ５１が、「第三音声取得手段」として機能する。図７のＳ４８およびＳ４９において第三音声データに対する音声認識処理を行い第三音声データに基づくテキストデータを生成するＣＰＵ５１が、本発明の「第二生成手段」として機能する。フラッシュＲＯＭ５３に記憶される第一対応データ９５が、本発明の「対応データ」および「第一対応データ」に相当する。フラッシュＲＯＭ５３に記憶される第二対応データ９６が、本発明の「対応データ」および「第二対応データ」に相当する。図７のＳ５１においてＳ４９で生成したテキストデータに第二対応データ９６における操作データが含まれるか否かを判断するＣＰＵ５１が、本発明の「第二結果判断手段」として機能する。図７のＳ５３において操作データに対応付けられた処理データに基づく動作をＨＭＤ１に実行させるＣＰＵ５１が、本発明の「実行手段」として機能する。 In the present embodiment, the CPU 51 that acquires the first sound data in S11 of FIG. 5 functions as the “first sound acquisition means” of the present invention. The CPU 51 that acquires the first volume value in S12 of FIG. 5 functions as the “first volume acquisition unit” of the present invention. The CPU 51 that determines whether or not the first volume value is larger than the first reference value in S13 of FIG. 5 functions as the “first volume determination means” of the present invention. The CPU 51 that starts the acquisition of the second sound data in S15 of FIG. 5 and continues the acquisition of the second sound data functions as the “second sound acquisition unit” of the present invention. The CPU 51 that performs voice recognition processing on the second voice data in S30 and S31 of FIG. 6 and generates text data based on the second voice data functions as the “first generation means” of the present invention. The CPU 51 that determines whether or not the reference data in the first correspondence data 95 is included in the text data generated in S31 in S33 of FIG. 6 functions as the “first result determination unit” of the present invention. The CPU 51 that starts the acquisition of the third sound data in S41 of FIG. 7 and continues the acquisition of the third sound data functions as a “third sound acquisition unit”. The CPU 51 that performs voice recognition processing on the third voice data and generates text data based on the third voice data in S48 and S49 of FIG. 7 functions as the “second generation means” of the present invention. The first correspondence data 95 stored in the flash ROM 53 corresponds to “correspondence data” and “first correspondence data” of the present invention. The second correspondence data 96 stored in the flash ROM 53 corresponds to “correspondence data” and “second correspondence data” of the present invention. The CPU 51 that determines whether or not the operation data in the second correspondence data 96 is included in the text data generated in S49 in S51 of FIG. 7 functions as the “second result determination unit” of the present invention. The CPU 51 that causes the HMD 1 to execute the operation based on the processing data associated with the operation data in S53 of FIG. 7 functions as the “execution unit” of the present invention.

図５のＳ１４において第一期間の時間情報が所定の範囲内であるか否かを判断するＣＰＵ５１が、本発明の「第一期間判断手段」として機能する。図５のＳ１７において第二音声データに基づいて第二音量値を決定するＣＰＵ５１が、本発明の「第二音量決定手段」として機能する。図５のＳ１８において第二音量値が第二基準値より大きいか否かを判断するＣＰＵ５１が、本発明の「第二音量判断手段」として機能する。ＲＡＭ５２に記憶され第二監視時間を計測する第二タイマカウンタが本発明の「計測手段」に相当する。図５のＳ２３において第二音量値が第三基準値よりも小さいか否かを判断するＣＰＵ５１が、本発明の「第二音声終了判断手段」として機能する。第一対応データ９５および第二対応データ９６を記憶するフラッシュＲＯＭ５３が、本発明の「記憶手段」に相当する。 The CPU 51 that determines whether or not the time information of the first period is within a predetermined range in S14 of FIG. 5 functions as the “first period determining means” of the present invention. The CPU 51 that determines the second volume value based on the second audio data in S17 of FIG. 5 functions as the “second volume determination unit” of the present invention. The CPU 51 that determines whether or not the second volume value is larger than the second reference value in S18 of FIG. 5 functions as the “second volume determination unit” of the present invention. The second timer counter that is stored in the RAM 52 and measures the second monitoring time corresponds to the “measurement means” of the present invention. The CPU 51 that determines whether or not the second volume value is smaller than the third reference value in S23 of FIG. 5 functions as the “second voice end determination means” of the present invention. The flash ROM 53 that stores the first correspondence data 95 and the second correspondence data 96 corresponds to the “storage means” of the present invention.

Ｓ１１における第一音声データを取得する処理が、本発明の「第一音声取得手段ステップ」に相当する。Ｓ１２における第一音量値を決定する処理が、本発明の「第一音量決定ステップ」に相当する。Ｓ１３における第一音量値が第一基準値よりも大きいか否かを判断する処理が、本発明の「第一音量判断ステップ」に相当する。Ｓ１５において第二音声データを取得する処理が、本発明の「第二音声取得ステップ」に相当する。Ｓ３０およびＳ３１において第二音声データに対する音声認識処理を行い第二音声データに基づくテキストデータを生成する処理が、本発明の「第一生成ステップ」に相当する。Ｓ３３においてＳ３１で生成したテキストデータに、第一対応データ９５における操作データが含まれるか否かを判断する処理が、本発明の「第一結果判断ステップ」に相当する。Ｓ４１において第三音声データを取得する処理が、本発明の「第三音声取得ステップ」に相当する。Ｓ４８およびＳ４９において第三音声データに対する音声認識処理を行い第三音声データに基づくテキストデータを生成する処理が、本発明の「第二生成ステップ」に相当する。Ｓ５１においてＳ４９で生成したテキストデータに第二対応データ９６における操作データが含まれるか否かを判断する処理が、本発明の「第二結果判断ステップ」に相当する。Ｓ５３において操作データに対応付けられた処理データに基づく動作をＨＭＤ１に実行させる処理が、本発明の「実行ステップ」に相当する。 The process of acquiring the first voice data in S11 corresponds to the “first voice acquisition means step” of the present invention. The process of determining the first volume value in S12 corresponds to the “first volume determination step” of the present invention. The process of determining whether or not the first volume value in S13 is greater than the first reference value corresponds to the “first volume determination step” of the present invention. The process of acquiring the second audio data in S15 corresponds to the “second audio acquisition step” of the present invention. The process of performing the speech recognition process on the second voice data in S30 and S31 and generating the text data based on the second voice data corresponds to the “first generation step” of the present invention. The process of determining whether or not the operation data in the first correspondence data 95 is included in the text data generated in S31 in S33 corresponds to the “first result determination step” of the present invention. The process of acquiring the third audio data in S41 corresponds to the “third audio acquisition step” of the present invention. The process of generating the text data based on the third voice data by performing the voice recognition process on the third voice data in S48 and S49 corresponds to the “second generation step” of the present invention. The process of determining whether or not the operation data in the second correspondence data 96 is included in the text data generated in S49 in S51 corresponds to the “second result determination step” of the present invention. The process of causing the HMD 1 to execute an operation based on the process data associated with the operation data in S53 corresponds to the “execution step” of the present invention.

なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変更が可能であることは勿論である。上記実施形態では、Ｓ１３の判断において用いられる第一基準値が、Ｓ１８の判断において用いられる第二基準値よりも大きな値となるように設定している。しかし、ＨＭＤ１の使用環境等によっては、第一音声データに対応する音声を第二音声データに対応する音声よりも大きく発声することが難しいことも考えられる。このため、必ずしも第一基準値が第二基準値よりも大きな値である必要はない。 In addition, this invention is not limited to the said embodiment, Of course, various changes are possible within the range which does not deviate from the summary of this invention. In the above embodiment, the first reference value used in the determination in S13 is set to be larger than the second reference value used in the determination in S18. However, depending on the usage environment of the HMD 1 or the like, it may be difficult to utter a voice corresponding to the first voice data larger than a voice corresponding to the second voice data. For this reason, the first reference value is not necessarily larger than the second reference value.

上記実施形態では、第二音声データおよび第三音声データに対応する音声の発声終了の判断を、第二音量値および第三音量値が第三基準値を下回るか否かによって行っている。しかし、ユーザによる音声の発声終了の判断手法はこれに限られない。例えば、ＣＰＵ５１は、第二音声データが示す音の周波数に特有の周波数帯（例えば、人間の声域に対応する数百Ｈｚ）が含まれるか否かを判断し、特有の周波数帯が含まれなくなった場合にユーザによる音声の発声が終了されたとみなすこと等によって判断してもよい。 In the above-described embodiment, the end of voice production corresponding to the second audio data and the third audio data is determined based on whether the second volume value and the third volume value are lower than the third reference value. However, the method for determining the end of speech by the user is not limited to this. For example, the CPU 51 determines whether or not a specific frequency band (for example, several hundred Hz corresponding to the human voice range) is included in the sound frequency indicated by the second audio data, and the specific frequency band is not included. In such a case, it may be determined by regarding that the voice utterance by the user is terminated.

上記実施形態において、Ｓ４２からＳ４７（図７参照）の一連の処理は、第三監視時間が経過する前に、第三音量値が第三基準値を下回るか否かによって、ユーザの第三音声データに対応する音声の発声の開始から終了までを判断するための処理である。この一連の処理は、Ｓ１６からＳ２５（図５参照）のにおけるユーザの第二音声データに対応する音声の発声の開始から終了までを判断する一連の処理と同様である。このため、本実施形態では、Ｓ１８およびＳ２３（図５参照）の判断基準である第二基準値および第三基準値を、Ｓ４４およびＳ４６（図７参照）においても同様に第二基準値および第三基準値を判断基準とすることで、データの共通化を図っている。ただし、必ずしも第二基準値および第三基準値を共通して使用する必要はない。ＨＭＤ１の使用環境、用途等に応じて、任意の基準値を設定してよい。 In the above embodiment, a series of processing from S42 to S47 (see FIG. 7) is performed by determining whether the third sound value of the user is lower than the third reference value before the third monitoring time elapses. This is a process for determining from the start to the end of the voice corresponding to the data. This series of processing is the same as the series of processing for determining from the start to the end of the voice corresponding to the user's second voice data in S16 to S25 (see FIG. 5). For this reason, in the present embodiment, the second reference value and the third reference value, which are the determination criteria of S18 and S23 (see FIG. 5), are similarly used in S44 and S46 (see FIG. 7). The data is shared by using the three reference values as criteria. However, it is not always necessary to use the second reference value and the third reference value in common. An arbitrary reference value may be set according to the usage environment, application, etc. of the HMD 1.

上記実施形態では、第一タイマカウンタ、第二タイマカウンタ、第三タイマカウンタの３つのタイマカウンタをＲＡＭ５２に設けて、それぞれ経過時間を監視している。タイマカウンタは必ずしも３つ設ける必要はなく、同一のタイマカウンタをそれぞれの経過時間を監視する処理において使用することとして、監視する時間に対応する値をタイマカウンタに都度セットしてもよい。 In the above embodiment, the three timer counters of the first timer counter, the second timer counter, and the third timer counter are provided in the RAM 52, and the elapsed time is monitored. It is not always necessary to provide three timer counters. A value corresponding to the time to be monitored may be set in the timer counter each time the same timer counter is used in the process of monitoring each elapsed time.

上記実施形態では、音声判断処理は、ＨＭＤ１のＣＰＵ５１によって実行されていたが、これに限定されない。例えば、音声判断処理の中において実行される音声操作処理は、サーバ８０のＣＰＵ８１が実行する処理であってもよい。即ち、サーバ８０が第三音声データの音声認識処理を実行してもよい。以下、サーバ８０のＣＰＵ８１によって音声操作処理が実行される場合の変形例について説明する。 In the above embodiment, the voice determination process is executed by the CPU 51 of the HMD 1, but is not limited to this. For example, the voice operation process executed in the voice determination process may be a process executed by the CPU 81 of the server 80. That is, the server 80 may execute a voice recognition process for the third voice data. Hereinafter, a modified example in which the voice operation process is executed by the CPU 81 of the server 80 will be described.

以下の説明では、ＨＭＤ１のＣＰＵ５１とサーバ８０のＣＰＵ８１は、それぞれ、無線通信部５９、通信部８６を介してネットワーク９に接続され、相互にデータを送受信可能であるとする。また、音声操作処理を実行するプログラム、第二対応データ９６等は、ＨＤＤ８４に記憶されているとする。以下の説明では、ＨＭＤ１のＣＰＵ５１が音声操作処理を実行する場合と異なる処理について説明し、他の処理の説明は省略する。 In the following description, it is assumed that the CPU 51 of the HMD 1 and the CPU 81 of the server 80 are connected to the network 9 via the wireless communication unit 59 and the communication unit 86, respectively, and can transmit / receive data to / from each other. Further, it is assumed that the program for executing the voice operation process, the second correspondence data 96, and the like are stored in the HDD 84. In the following description, processing different from the case where the CPU 51 of the HMD 1 executes voice operation processing will be described, and description of other processing will be omitted.

Ｓ４１で第三音声データの取得が開始されると、ＣＰＵ８１は第三音声データを取得し、取得した第三音声データをＲＡＭ８３に記憶する処理を開始する。ＣＰＵ８１が取得する第三音声データは、ＨＭＤ１のＣＰＵ５１がマイク１７を介して取得し、サーバ８０に送信した音声データである。また、Ｓ４２およびＳ４５において第三監視時間に対応する値がセットさおよびクリアされる第三タイマカウンタは、ＲＡＭ８３に記憶される。Ｓ４３で逐次取得される第三音量値は、ＲＡＭ８３に記憶された第三音声データに基づいて決定される。Ｓ４８において、ＣＰＵ８１は、公知の音声認識処理（図示略）を実行し、ＲＡＭ８３に保存した第二音声データに対する音声認識処理を行う。ＣＰＵ５１は、音声認識処理を行って認識した第三音声データをテキストデータに変換し（Ｓ４９）、ＲＡＭ８３に記憶する。次いで、ＣＰＵ８１は、音声認識プログラムの初回実行時にＲＡＭ８３に展開した第二対応データ９６（図４参照）を参照する（Ｓ５０）。ＣＰＵ８１は、Ｓ５１において含まれると判断された操作データが「終了」以外の操作データであれば（Ｓ５２：ＮＯ）、操作データに対応付けられた処理データに基づく動作をＨＭＤ１に実行させる（Ｓ５３）。ＣＰＵ８１は、Ｓ５１において含まれると判断された操作データが「終了」である場合には（Ｓ５２：ＹＥＳ）、音声操作処理を終了する。この場合、ＣＰＵ８１は、処理データに基づく動作を実行する指示のデータまたは音声操作処理の終了の指示のデータをＨＭＤ１に送信する。ＨＭＤ１のＣＰＵ５１は、処理データに基づく動作を実行する指示または音声操作処理の終了の指示のデータを受信し、処理データに基づく動作または音声操作処理を終了する処理を実行する。 When the acquisition of the third audio data is started in S41, the CPU 81 starts the process of acquiring the third audio data and storing the acquired third audio data in the RAM 83. The third audio data acquired by the CPU 81 is audio data acquired by the CPU 51 of the HMD 1 via the microphone 17 and transmitted to the server 80. The third timer counter in which the value corresponding to the third monitoring time is set and cleared in S42 and S45 is stored in the RAM 83. The third volume value sequentially acquired in S43 is determined based on the third sound data stored in the RAM 83. In S 48, the CPU 81 executes a known voice recognition process (not shown), and performs a voice recognition process on the second voice data stored in the RAM 83. The CPU 51 converts the third voice data recognized by performing voice recognition processing into text data (S49) and stores it in the RAM 83. Next, the CPU 81 refers to the second correspondence data 96 (see FIG. 4) developed in the RAM 83 when the voice recognition program is executed for the first time (S50). If the operation data determined to be included in S51 is operation data other than “end” (S52: NO), CPU 81 causes HMD1 to perform an operation based on the processing data associated with the operation data (S53). . When the operation data determined to be included in S51 is “end” (S52: YES), the CPU 81 ends the voice operation process. In this case, the CPU 81 transmits to the HMD 1 instruction data for executing an operation based on the processing data or data for instructing to end the voice operation process. The CPU 51 of the HMD 1 receives the instruction to execute the operation based on the processing data or the instruction to end the voice operation process, and executes the process based on the processing data or the voice operation process.

以上の処理以外の処理は、ＨＭＤ１のＣＰＵ５１が実行する場合と同様である。音声操作処理を、サーバ８０のＣＰＵ８１が実行する場合、サーバ８０において最新の状態に更新された音声認識処理および第二対応データ９６等によって、第三音声データの音声認識処理および第三音声データのテキストデータ化を行える利点がある。 Processing other than the above processing is the same as that executed by the CPU 51 of the HMD 1. When the CPU 81 of the server 80 executes the voice operation process, the voice recognition process of the third voice data and the third voice data of the third voice data are updated by the voice recognition process updated in the server 80 and the second correspondence data 96 or the like. There is an advantage that text data can be converted.

本変形例において、Ｓ４８およびＳ４９において第三音声データに対する音声認識処理を行い第三音声データに基づくテキストデータを生成するＣＰＵ８１が、本発明の「第二生成手段」として機能する。ＨＤＤ８４に記憶される第二対応データ９６が、本発明の「対応データ」および「第二対応データ」に相当する。Ｓ５１においてＳ４９で生成したテキストデータに第二対応データ９６における操作データが含まれるか否かを判断するＣＰＵ８１が、本発明の「第二結果判断手段」として機能する。Ｓ５３において操作データに対応付けられた処理データに基づく動作をＨＭＤ１に実行させるＣＰＵ８１が、本発明の「実行手段」として機能する。第二対応データ９６を記憶するＨＤＤ８４が、本発明の「記憶手段」に相当する。 In this modification, the CPU 81 that performs voice recognition processing on the third voice data in S48 and S49 and generates text data based on the third voice data functions as the “second generation means” of the present invention. The second correspondence data 96 stored in the HDD 84 corresponds to “correspondence data” and “second correspondence data” of the present invention. The CPU 81 that determines whether or not the operation data in the second correspondence data 96 is included in the text data generated in S49 in S51 functions as the “second result determination unit” of the present invention. The CPU 81 that causes the HMD 1 to execute an operation based on the processing data associated with the operation data in S53 functions as the “execution unit” of the present invention. The HDD 84 that stores the second correspondence data 96 corresponds to the “storage unit” of the present invention.

なお、音声操作処理におけるすべての処理をＣＰＵ５１およびＣＰＵ８１の一方のみがすべて実行する必要はなく、ＣＰＵ５１が実行する処理とＣＰＵ８１が実行する処理とを分けてもよい。 Note that it is not necessary for only one of the CPU 51 and the CPU 81 to execute all the processing in the voice operation processing, and the processing executed by the CPU 51 and the processing executed by the CPU 81 may be separated.

１ヘッドマウントディスプレイ（ＨＭＤ）
１７マイク
５０コントロールボックス（ＣＢ）
５１ＣＰＵ
５２ＲＡＭ
５３フラッシュＲＯＭ
９５第一対応データ
９６第二対応データ 1 Head mounted display (HMD)
17 Microphone 50 Control box (CB)
51 CPU
52 RAM
53 Flash ROM
95 First correspondence data 96 Second correspondence data

Claims

First sound acquisition means for acquiring first sound data output from a microphone that outputs sound data according to the input sound;
First volume determination means for determining a first volume value corresponding to the first sound data acquired by the first sound acquisition means;
First volume determination means for determining whether the first volume value is greater than a first reference value;
Second sound acquisition for acquiring second sound data output from the microphone after the first sound data when the first sound volume determination means determines that the first sound volume value is greater than the first reference value Means,
First generation means for generating first result data corresponding to the second voice data based on the second voice data acquired by the second voice acquisition means;
First result determination means for determining whether or not reference data corresponding to a specific word is included in the first result data generated by the first generation means;
Third voice acquisition means for acquiring third voice data output from the microphone after the second voice data when the first result judgment means determines that the reference data is included in the first result data When,
Second generation means for generating second result data indicating corresponding text data by executing voice recognition processing on the third voice data acquired by the third voice acquisition means;
Text data indicated by the second result data generated by the second generation means with reference to corresponding data in which operation data indicated by predetermined text data is associated with processing data which is data about processing Second result judging means for judging whether or not the same data as the operation data is included in
When the second result determining means determines that the same data as the operation data is included in the text data indicated by the second result data, the same data is included in the text data indicated by the second result data. A speech recognition apparatus comprising: an execution unit configured to execute processing based on the processing data associated with the operation data determined to be included.

When the first volume determination unit determines that the first volume value is greater than the first reference value, a first period that is a period in which the first volume value is greater than the first reference value is a predetermined period. A first period judging means for judging whether or not it is within the range;
The first generation unit generates the first result data when the first period determination unit determines that the first period is within the predetermined range, and the first period is the predetermined period. The speech recognition apparatus according to claim 1, wherein the first result data is not generated when it is determined that it is not within the range.

When the first sound volume determination means determines that the first sound volume value is greater than the first reference value, a second sound volume value corresponding to the second sound data acquired by the second sound acquisition means is determined. Second volume determining means to perform,
Second volume determination means for determining whether the second volume value is greater than a second reference value;
When the second volume determination unit determines that the second volume value is greater than the second reference value, a predetermined time elapses from the time when the second volume value is determined to be greater than the second reference value. Measurable measuring means,
A second voice end judging means for judging whether or not the second volume value is smaller than a third reference value;
If the second sound value is not determined to be smaller than the third reference value by the second sound end determining means within the certain time measured by the measuring means, the first generating means is the first result. 3. The speech recognition apparatus according to claim 1, wherein no data is generated.

The first generation means includes the second volume determination means by the second volume determination means within a predetermined period after the first volume determination means determines that the first volume value is greater than the first reference value. 4. The speech recognition apparatus according to claim 3, wherein if the value is not determined to be greater than the second reference value, the first result data is not generated.

The reference data is the operation data, and includes storage means for storing first correspondence data in which the reference data and the processing data are associated and second correspondence data different from the first correspondence data,
The first generation means generates the first result data indicating the corresponding text data by executing a voice recognition process on the second voice data,
The first result determination means refers to the first correspondence data, and whether or not the text data indicated by the first result data generated by the first generation means includes the same data as the reference data Judging
The second result determination means refers to the second correspondence data, and whether or not the text data indicated by the second result data generated by the second generation means includes the same data as the operation data. The speech recognition apparatus according to claim 1, wherein:

The number of sets of the reference data and the processing data included in the first correspondence data is smaller than the number of sets of the operation data and the processing data included in the second correspondence data. The speech recognition apparatus according to claim 5.

The speech recognition apparatus according to claim 3 or 4, wherein the first reference value is larger than the second reference value.

A first voice acquisition step of acquiring first voice data output from a microphone that outputs voice data according to the input voice;
A first sound volume determination step for determining a first sound volume value corresponding to the first sound data acquired in the first sound acquisition step;
A first volume determination step of determining whether the first volume value is greater than a first reference value;
Second sound acquisition for acquiring second sound data output from the microphone after the first sound data when it is determined in the first sound volume determination step that the first sound volume value is greater than the first reference value Steps,
A first generation step of generating first result data corresponding to the second voice data based on the second voice data acquired in the second voice acquisition step;
A first result determination step of determining whether or not the first result data generated in the first generation step includes reference data corresponding to a specific word;
A third sound acquisition step of acquiring third sound data output from the microphone after the second sound data when it is determined in the first result determination step that the reference data is included in the first result data; When,
A second generation step of generating second result data indicating the corresponding text data by executing voice recognition processing on the third voice data acquired in the third voice acquisition step;
Text data indicated by the second result data generated in the second generation step with reference to correspondence data in which operation data indicated by predetermined text data is associated with processing data which is data about processing A second result determination step for determining whether or not the same data as the operation data is included,
In the second result determining step, when it is determined that the text data indicated by the second result data includes the same data as the operation data, the same data is included in the text data indicated by the second result data. A speech recognition program for causing a computer to execute an execution step of executing a process based on the processing data associated with the operation data determined to be included.