JP2009122598A

JP2009122598A - Electronic device, control method of electronic device, speech recognition device, speech recognition method and speech recognition program

Info

Publication number: JP2009122598A
Application number: JP2007299309A
Authority: JP
Inventors: Hiroaki Shibazaki; 裕昭柴▲崎▼
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2007-11-19
Filing date: 2007-11-19
Publication date: 2009-06-04

Abstract

<P>PROBLEM TO BE SOLVED: To recognize a whole speech part without missing a head part of the speech part which is required for speech recognition. <P>SOLUTION: When it is detected by a detecting means 9 that a sound volume level of sound data 4 exceeds a threshold sound volume level, a speech recognition means 11 discriminates the speech part from the recorded speech data 4, by going back to time of a predetermined period Tw2 before the detection time Tp by the detecting means 9, in the recorded sound data 4 in an overwrite record means 7, and performs speech recognition of the speech part. A control means 13 controls operation based on a control content expressed by the speech part which is discriminated by the speech recognition means 11. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、周囲の環境音から音声を認識する電子機器等に関する。 The present invention relates to an electronic device or the like that recognizes sound from ambient environmental sounds.

近年の電子機器においては、ユーザが外部から音声により所望の操作内容を与え、その音声を認識して把握された所望の操作内容に応じた動作を実行する形態のものが存在している。このような電子機器としては、そのような音声認識を開始させるにあたり、例えば操作者が発話ボタンを操作したことを契機とする方法も考えられるが、操作を希望する度に逐一発話ボタンを操作する必要があると使用意欲などが低下してしまうおそれがある。 In recent electronic devices, there is a form in which a user gives a desired operation content from the outside by voice, and performs an operation according to the desired operation content recognized by recognizing the voice. As such an electronic device, in order to start such voice recognition, for example, a method triggered by an operator operating the utterance button may be considered, but the utterance button is operated every time the operation is desired. If necessary, the willingness to use may be reduced.

そこで従来の電子機器においては、例えば常時音声認識の待ち受け状態とすることも考えられるが、このようにするとＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）といったリソースを不要に占有してしまいリソースの利用効率が悪くなってしまう。 Therefore, in a conventional electronic device, for example, it may be possible to always enter a standby state for voice recognition. However, in this case, resources such as a CPU (Central Processing Unit) and a DSP (Digital Signal Processor) are unnecessarily occupied. The use efficiency of will become worse.

そこで従来の電子機器では、音声認識装置が、採取した音からユーザの音声を識別し、その音声信号が所定の閾値となったことを契機にその音声の内容を認識する技術を採用している（特許文献１参照）。 Therefore, in a conventional electronic device, the voice recognition device adopts a technology that recognizes the user's voice from the collected sounds and recognizes the contents of the voice when the voice signal reaches a predetermined threshold. (See Patent Document 1).

具体的には、従来の音声認識装置は、次のようなレベル検出部及び音声認識部を有する。レベル検出部は、例えばマイクロフォンで採取した音の入力信号が所定レベル以上であるか否かを検出する。また音声認識部は、レベル検出部によって入力信号が所定レベル以上であり音声であることが検出された場合、その音に含まれる音声を識別して音声認識を行う。 Specifically, the conventional speech recognition apparatus has the following level detection unit and speech recognition unit. For example, the level detection unit detects whether or not an input signal of sound collected by a microphone is equal to or higher than a predetermined level. In addition, when the level detection unit detects that the input signal is equal to or higher than the predetermined level and is a voice, the voice recognition unit recognizes the voice included in the sound and performs voice recognition.

このとき音声認識部は、レベル検出部によって検出されてからその音について音声認識を開始すると、その最初の部分が音声認識の対象から欠落してしまい（以下、「頭切れ」と呼ぶ）、その音声部分が全体として何を表しているのかを正しく認識することができない。 At this time, when the voice recognition unit starts voice recognition for the sound after being detected by the level detection unit, the first part is lost from the target of voice recognition (hereinafter referred to as “head cut”). It is not possible to correctly recognize what the voice part represents as a whole.

そこで従来の音声認識装置においては、上述した構成に加えてさらに次のような遅延部を内蔵している。この遅延部は、音声を採取するマイクロフォンと音声認識部との間に設けられており、例えば音声部分の開始を表すスタート信号の入力を契機として入力信号自体を遅延させ、その入力信号の頭切れを防止する手法が採用されている。 Therefore, the conventional speech recognition apparatus further includes the following delay unit in addition to the above-described configuration. This delay unit is provided between the microphone that collects the voice and the voice recognition unit. For example, the input signal itself is delayed when the start signal indicating the start of the voice part is input, and the input signal is cut off. A technique to prevent this is adopted.

特開平５−２７７９５号公報（段落番号００１２、図２（ｂ））Japanese Patent Laid-Open No. 5-27795 (paragraph number 0012, FIG. 2 (b))

上記従来技術によれば、確かに、入力信号の頭切れを防止することができるものの入力信号を遅延させていることに伴って当然ながら音声認識処理自体の開始が遅れてしまう。このように音声認識処理が遅れると、電子機器を操作しようとした操作者は、音声を発してから電子機器が実際に動作を開始するまでに多少の時間が空くことから動作の反応に対して違和感を生じてしまい、結果として操作性が良いとはいえなかった。 According to the above prior art, although it is possible to prevent the head of the input signal from being cut off, the start of the speech recognition processing itself is naturally delayed as the input signal is delayed. If the voice recognition process is delayed in this way, an operator who tries to operate the electronic device will have some time to wait until the electronic device actually starts operating after making a voice. A sense of incongruity was produced, and as a result, the operability was not good.

本発明が解決しようとする課題には、上記した問題が一例として挙げられる。 The problem to be solved by the present invention includes the above-described problem as an example.

上記課題を解決するために、請求項１記載の発明は、採取された音に基づく音データが継続的に上書き記録される上書き記録手段と、前記音データの音量レベルが閾値音量レベルを超えたことを検知する検知手段と、前記検知手段によって前記音データの音量レベルが前記閾値音量レベルを超えたことが検知された場合、前記上書き記録手段に記録済の音データのうち前記検知手段による検知時刻よりも所定時間前に遡って記録済の音データから音声部分を識別し、前記音声部分について音声認識を行う音声認識手段と、前記音声認識手段によって識別された前記音声部分が表す制御内容に基づいて動作を制御する制御手段と、を有する。 In order to solve the above problem, the invention according to claim 1 is characterized in that overwriting recording means for continuously overwriting and recording sound data based on the collected sound, and the volume level of the sound data exceeds a threshold volume level. Detection means for detecting this, and when the detection means detects that the volume level of the sound data exceeds the threshold volume level, detection by the detection means among the sound data recorded in the overwrite recording means A voice recognition unit that identifies a voice part from recorded sound data before a predetermined time before the time and performs voice recognition on the voice part, and a control content represented by the voice part identified by the voice recognition unit Control means for controlling the operation based on the control means.

上記課題を解決するために、請求項７記載の発明は、採取された音に基づく音データを上書き記録手段に継続的に上書き記録している状態で、前記音データの音量レベルが閾値音量レベルを超えたことを検知する検知ステップと、前記検知ステップにて前記音データの音量レベルが前記閾値音量レベルを超えたことが検知された場合、前記上書き記録手段に記録済の音データのうち前記検知ステップでの検知時刻よりも所定時間前に遡って記録済みの音データから音声部分を識別し、前記音声部分について音声認識を行う音声認識ステップと、前記音声認識ステップにて識別された前記音声部分が表す制御内容に基づいて動作を制御する制御ステップと、を有する。 In order to solve the above-mentioned problem, the invention according to claim 7 is characterized in that the sound data based on the collected sound is continuously overwritten and recorded on the overwriting recording means, and the sound data volume level is a threshold volume level. And when detecting that the volume level of the sound data exceeds the threshold volume level in the detection step, the sound data recorded in the overwrite recording means A voice recognition step of identifying a voice part from recorded sound data retroactively before a detection time in the detection step and performing voice recognition on the voice part; and the voice identified in the voice recognition step And a control step for controlling the operation based on the control content represented by the portion.

上記課題を解決するために、請求項８記載の発明は、採取された音に基づく音データが継続的に上書き記録される上書き記録手段と、前記音データの音量レベルが閾値音量レベルを超えたことを検知する検知手段と、前記検知手段によって前記音データの音量レベルが前記閾値音量レベルを超えたことが検知された場合、前記上書き記録手段に記録済の音データのうち前記検知手段による検知時刻よりも所定時間前に遡って記録済の音データから音声部分を識別し、前記音声部分について音声認識を行う音声認識手段と、を有する。 In order to solve the above-mentioned problem, the invention according to claim 8 is characterized in that overwriting recording means for continuously overwriting recording sound data based on the collected sound, and the volume level of the sound data exceeds a threshold volume level. Detection means for detecting this, and when the detection means detects that the volume level of the sound data exceeds the threshold volume level, detection by the detection means among the sound data recorded in the overwrite recording means Voice recognition means for identifying a voice part from recorded sound data by going back a predetermined time before the time and performing voice recognition on the voice part.

上記課題を解決するために、請求項９記載の発明は、採取された音に基づく音データを上書き記録手段に継続的に上書き記録している状態で、前記音データの音量レベルが閾値音量レベルを超えたことを検知する検知ステップと、前記検知ステップにて前記音データの音量レベルが前記閾値音量レベルを超えたことが検知された場合、前記上書き記録手段に記録済の音データのうち前記検知ステップでの検知時刻よりも所定時間前に遡って記録済みの音データから音声部分を識別し、前記音声部分について音声認識を行う音声認識ステップと、を有する。 In order to solve the above-mentioned problem, the invention according to claim 9 is characterized in that the sound data based on the collected sound is continuously overwritten and recorded on the overwriting recording means, and the sound data volume level is a threshold volume level. And when detecting that the volume level of the sound data exceeds the threshold volume level in the detection step, the sound data recorded in the overwrite recording means A voice recognition step of identifying a voice part from recorded sound data retroactive to a predetermined time before the detection time in the detection step and performing voice recognition on the voice part.

上記課題を解決するために、請求項１０記載の発明は、採取された音に基づく音データを上書き記録手段に継続的に上書き記録している状態で、前記音データの音量レベルが閾値音量レベルを超えたことを検知する検知ステップと、前記検知ステップにて前記音データの音量レベルが前記閾値音量レベルを超えたことが検知された場合、前記上書き記録手段に記録済の音データのうち前記検知ステップでの検知時刻よりも所定時間前に遡って記録済みの音データから音声部分を識別し、前記音声部分について音声認識を行う音声認識ステップと、をコンピュータにて実行させる。 In order to solve the above-mentioned problem, the invention according to claim 10 is characterized in that the sound data based on the collected sound is continuously overwritten and recorded in the overwrite recording means, and the sound data volume level is a threshold volume level. And when detecting that the volume level of the sound data exceeds the threshold volume level in the detection step, the sound data recorded in the overwrite recording means A computer recognizes a voice part from recorded sound data by a predetermined time before the detection time in the detection step, and performs voice recognition on the voice part.

以下、本発明の一実施の形態を図面を参照しつつ説明する。
＜第１実施形態＞
図１は、第１実施形態における電子機器を適用した一例としてのロボット１００の構成例を示すブロック図である。なおこの電子機器は、このようなロボット１００以外であっても、ユーザの発声に反応して動作する様々な装置に適用することができる。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
<First Embodiment>
FIG. 1 is a block diagram illustrating a configuration example of a robot 100 as an example to which the electronic device according to the first embodiment is applied. Note that the electronic apparatus can be applied to various devices that operate in response to a user's voice even if the electronic apparatus is other than the robot 100.

ロボット１００は、例えば車両に搭載され、運転者又は搭乗者であるユーザが乗り込んだときに挨拶したりユーザの呼びかけに応じて愛らしい反応をすることで、例えば運転者の安全運転に寄与したり運転者や搭乗者の疲労感や不安感を和らげる機能を有する。 The robot 100 is mounted on, for example, a vehicle, and greets when a user who is a driver or a passenger gets in, or responds lovably in response to a user's call, thereby contributing to, for example, a driver's safe driving or driving It has a function to relieve fatigue and anxiety of passengers and passengers.

このロボット１００は、ユーザの発声により指示された操作内容を音声認識し、その操作内容に応じた動作を行う。ユーザは、例えばこのロボット１００に名称が付されている場合、所定の動作を行わせるべく、このロボット１００の名称を呼びかけるとともに、所望の動作を指示する命令語としての音声部分、そのような音声部分を含む文章、会話などを発声により投げかける。 The robot 100 recognizes an operation content instructed by a user's voice and performs an operation according to the operation content. For example, when a name is given to the robot 100, the user calls the name of the robot 100 to perform a predetermined operation, and a voice part as a command word for instructing a desired operation, such a voice. Spoken sentences, conversations, etc. including parts.

このロボット１００は次のような音声認識装置１を内蔵している。なお、この音声認識装置１はロボット１００に内蔵されている形態のみならず、独立した装置として音声認識を行う形態であっても良い。この音声認識装置１は、デジタイズ部５、上書きメモリ７、検知部９及び音声認識部１１を備えており、さらにマイクロフォン３を備えていても良い。 The robot 100 includes a voice recognition device 1 as follows. The voice recognition device 1 is not limited to a mode built in the robot 100 but may be a mode that performs voice recognition as an independent device. The voice recognition device 1 includes a digitizing unit 5, an overwrite memory 7, a detection unit 9, and a voice recognition unit 11, and may further include a microphone 3.

マイクロフォン３は音採取手段に相当し、周囲の環境で生じている音を採取して音データ４を出力する機能を有する。なおこのマイクロフォン３は内蔵されている形態のみならず着脱可能な形態であっても良い。この場合、音声認識装置１にはマイクロフォン３が採取した周囲の音に基づく音データ４が入力される形態となる。なお本実施形態では、マイクロフォン３が出力する音データ４は例えばアナログデータであるものとする。 The microphone 3 corresponds to sound collection means and has a function of collecting sound generated in the surrounding environment and outputting sound data 4. In addition, this microphone 3 may be not only a built-in form but also a removable form. In this case, the sound recognition device 1 is configured to receive sound data 4 based on ambient sounds collected by the microphone 3. In the present embodiment, it is assumed that the sound data 4 output from the microphone 3 is analog data, for example.

デジタイズ部５は、マイクロフォン３及び上書き用メモリ１６に接続されており、アナログデータである音データ４をデジタルデータの音データ４に変換し、上書きメモリ７に出力する。上書きメモリ７は、デジタイズ部５からの音データ４を所定のバファリング単位で繰り返し上書き記録される情報記録媒体である。 The digitizing unit 5 is connected to the microphone 3 and the overwriting memory 16, converts the sound data 4 that is analog data into sound data 4 of digital data, and outputs the sound data 4 to the overwriting memory 7. The overwrite memory 7 is an information recording medium on which the sound data 4 from the digitizing unit 5 is repeatedly overwritten and recorded in predetermined buffering units.

検知部９は検知手段に相当し、上述した音データ４の特定周波数帯域の音量レベルが規定の閾値音量レベルを超えたことを検知する機能を有する。具体的には、検知部９は、この音データ４の音量レベルが規定の閾値音量レベルを超えたことを契機として音声認識部１１に対してトリガーＴＧを出力する。 The detection unit 9 corresponds to detection means, and has a function of detecting that the volume level of the specific frequency band of the sound data 4 described above exceeds a prescribed threshold volume level. Specifically, the detection unit 9 outputs a trigger TG to the voice recognition unit 11 when the volume level of the sound data 4 exceeds a prescribed threshold volume level.

ここで、この検知部９は、採取された音に含まれるノイズレベルを考慮した上で、その音データ４の特定周波数帯域の音量レベルが所定の閾値音量レベルを超えたか否かを検知する。ここでこの特定周波数帯域とはユーザの音声周波数帯域を表している。以下の説明では、特に必要がない限り、音データ４の特定周波数帯域の音量レベルを「音データ４の音量レベル」と表現する。 Here, the detection unit 9 detects whether or not the volume level of the specific frequency band of the sound data 4 exceeds a predetermined threshold volume level in consideration of the noise level included in the collected sound. Here, this specific frequency band represents the voice frequency band of the user. In the following description, the volume level of the specific frequency band of the sound data 4 is expressed as “the volume level of the sound data 4” unless otherwise required.

この検知部９は、図示しない車両センサに接続されており、エンジンが作動している状態ではこの車両センサから恒常的に車両のエンジン回転数などの車両情報を取得している。 The detection unit 9 is connected to a vehicle sensor (not shown), and constantly acquires vehicle information such as the engine speed of the vehicle from the vehicle sensor when the engine is operating.

音声認識部１１は音声認識手段に相当し、検知部９によって音データ４の特定周波数帯域の音量レベルがそのような閾値以上を超えたことが検知された場合、つまり検知部９から上述したトリガーを受け取った場合、例えばトリガーＴＧを受け取った時刻を管理しておくとともに次のように動作する。なお本実施形態では、音声認識部１１が検知部９からトリガーＴＧを受け取った時刻を「検知時刻」と呼んでいる。 The voice recognition unit 11 corresponds to a voice recognition unit, and when the detection unit 9 detects that the volume level of the specific frequency band of the sound data 4 exceeds such a threshold value, that is, the trigger described above from the detection unit 9. For example, the time when the trigger TG is received is managed and the following operation is performed. In the present embodiment, the time when the voice recognition unit 11 receives the trigger TG from the detection unit 9 is called “detection time”.

すなわち音声認識部１１は、上書きメモリ７に記録済の音データ４のうち検知部９による検知時刻よりも所定時間前に遡って記録済の音データ４から音声部分を識別し、その音声部分について音声認識を行う。またこの音声認識部１１は、上昇していた音データ４の音量レベルがノイズレベルにまで下がり所定時間が経過すると、音声認識処理を停止する。ここでいうノイズレベルとは、ロボット１００の周囲に恒常的に生じている音声を含まない音の音量レベルを表している。 That is, the voice recognizing unit 11 identifies a voice part from the recorded sound data 4 retroactively before the detection time by the detecting unit 9 among the sound data 4 recorded in the overwrite memory 7, and the voice part is identified. Perform voice recognition. In addition, the voice recognition unit 11 stops the voice recognition process when the volume level of the sound data 4 that has risen falls to the noise level and a predetermined time elapses. Here, the noise level represents a volume level of a sound that does not include a sound that is constantly generated around the robot 100.

上述した制御部１０は、その音声認識部１１によって音声識別された音声部分が表す制御内容に基づいて動作を制御する。この制御部１０は、その制御内容に従って音声部分を発したユーザに対して反応するよう動作を制御する。つまりこの制御部１０は、このように音声認識によって把握された制御内容に従って後述するようにロボット１００の動作を制御している。 The control part 10 mentioned above controls operation | movement based on the control content which the audio | voice part identified by the audio | voice recognition part 11 represents. This control part 10 controls operation | movement so that it may react with respect to the user who emitted the audio | voice part according to the control content. That is, the control unit 10 controls the operation of the robot 100 as will be described later according to the control content grasped by the voice recognition.

音声認識装置１を内蔵するロボット１００は以上のような一構成例であり、次に図１を参照しつつ当該一構成例によるロボット１００の制御方法の一例について説明する。このロボット１００の制御方法は、音声認識装置１において実行される音声認識方法の各ステップを含んでいる。この音声認識方法は、音声認識プログラムが音声認識装置１の音声認識部１１及び検知部９などにおいて実行させる各ステップによって構成されている。 The robot 100 incorporating the voice recognition device 1 is one example of the configuration as described above. Next, an example of a method for controlling the robot 100 according to the example of the configuration will be described with reference to FIG. The control method of the robot 100 includes each step of the voice recognition method executed in the voice recognition device 1. This speech recognition method is constituted by steps that a speech recognition program causes the speech recognition unit 11 and the detection unit 9 of the speech recognition apparatus 1 to execute.

図２は、ロボット１００の処理の手順例を示すフローチャートである。なお、音声認識部１１は通常時に作動しておらず、後述するように必要に応じて動作するようになっている。 FIG. 2 is a flowchart illustrating an example of a processing procedure of the robot 100. Note that the voice recognition unit 11 is not normally operated, and is operated as necessary as described later.

まずステップＳ１では、制御部１０が図示しない車両センサによってエンジン回転数を検知し、エンジンが始動しているか否かを判断する。次にステップＳ２では、音声認識装置１が周囲の環境にて生じている音の上書き記録を開始する。 First, in step S1, the control unit 10 detects the engine speed by a vehicle sensor (not shown) and determines whether or not the engine is started. Next, in step S2, the voice recognition apparatus 1 starts overwriting recording of the sound generated in the surrounding environment.

具体的には、デジタイズ部５は、マイクロフォン３によって採取された音に基づくアナログの音データ４を２値化し、デジタルの音データ４を上書きメモリ７に出力する。この上書きメモリ７には、この音データ４が所定の単位で上書き記録されている。 Specifically, the digitizing unit 5 binarizes the analog sound data 4 based on the sound collected by the microphone 3 and outputs the digital sound data 4 to the overwrite memory 7. In the overwrite memory 7, the sound data 4 is overwritten and recorded in a predetermined unit.

次にステップＳ３では、検知部９が、例えばユーザが発する音声の周波数帯を基に、音声認識部１１を用いて音声認識を行うべき周波数を特定する。以下の説明では、このように特定した周波数を「特定周波数」と呼んでいる。ここで本実施形態では、採取された音のうち音声であるか否かの判定方法としては、例えば運転者などの人間の声（上述した音声に相当）であることの周波数特性上の特徴などを基に判定している。なお、この音声のユーザとしては車両の搭乗者であっても良い。 Next, in step S3, the detection unit 9 specifies the frequency at which voice recognition should be performed using the voice recognition unit 11 based on, for example, the frequency band of voice uttered by the user. In the following description, the frequency specified in this way is called “specific frequency”. Here, in the present embodiment, as a method for determining whether or not the collected sound is a voice, for example, a characteristic on the frequency characteristics of being a human voice such as a driver (corresponding to the voice described above) or the like Judgment based on. The voice user may be a vehicle occupant.

次にステップＳ４では、検知部９が周囲の環境に恒常的に生じているノイズレベルを特定する。このノイズレベルは採取された音のうち環境音に該当する。以下の説明では、このように特定したノイズレベルを「特定ノイズレベル」と呼ぶ。次にステップＳ５では、検知部９がこの特定周波数の音量レベルに関してモニタリングを開始する。 Next, in step S4, the detection unit 9 specifies a noise level that is constantly generated in the surrounding environment. This noise level corresponds to the environmental sound among the collected sounds. In the following description, the noise level specified in this way is referred to as a “specific noise level”. Next, in step S5, the detection unit 9 starts monitoring the volume level of the specific frequency.

次にステップＳ６では、検知部９が、図示しない車両センサからの車両情報に含まれるエンジン回転数に基づいて車両のエンジンが停止しているか否かを判断する。車両のエンジンが停止している場合にはこのドライブレコード処理が終了する一方、車両のエンジンが停止していない場合には検知部９が次のステップＳ７を実行する。 Next, in step S6, the detection unit 9 determines whether or not the vehicle engine is stopped based on the engine speed included in the vehicle information from a vehicle sensor (not shown). When the vehicle engine is stopped, the drive record process ends. On the other hand, when the vehicle engine is not stopped, the detection unit 9 executes the next step S7.

＜検知ステップ＞
このステップＳ７では、検知部９が、特定ノイズレベルを超える音量レベルの音データ４が入力されたか否かを判断する。この検知部９は、そのような音量レベルの音が入力されなかった場合には上述したステップＳ６に戻り、そのような音量レベルの音が入力された場合にはトリガーを音声認識部１１に対して出力する。 <Detection step>
In step S7, the detection unit 9 determines whether sound data 4 having a volume level exceeding a specific noise level has been input. The detection unit 9 returns to the above-described step S6 when the sound of such a volume level is not input, and triggers the sound recognition unit 11 when the sound of such a volume level is input. Output.

次にステップＳ８では、音声認識部１１が検知部９からトリガーを受け取ったか否かを判断し、受け取っていない場合には上述したステップＳ６に戻って実行される。 Next, in step S8, it is determined whether or not the voice recognition unit 11 has received a trigger from the detection unit 9. If not, the process returns to step S6 described above and executed.

＜音声認識ステップ＞
一方、音声認識部１１が検知部９からトリガーを受け取った場合にはここから音声認識を開始する。次にステップＳ９では、音声認識部１１が上書きメモリ７に記録済の音データ４を取得する。次にステップＳ１０では、詳細は後述するが、音声認識部１１が音声認識処理を実行する。 <Voice recognition step>
On the other hand, when the voice recognition unit 11 receives a trigger from the detection unit 9, voice recognition starts here. Next, in step S9, the voice recognition unit 11 acquires the sound data 4 recorded in the overwrite memory 7. Next, in step S10, although the details will be described later, the speech recognition unit 11 executes speech recognition processing.

次にステップＳ１１では、音声認識部１１は一連の音声認識処理が終了したか否かを判断し、終了すると、所定時間が経過するまで待ち、所定時間が経過すると、上記ステップＳ６に戻る（ステップＳ１２）。なお、音声認識処理が終了した場合でも、車両の走行又は乗車が継続していたときは、音データ４の上書きメモリ７への記録及び検知部９による検知は継続し、車両の走行又は乗車が終了したときは、上書きメモリ７への記録及び検知部９による検知も終了する。 Next, in step S11, the speech recognition unit 11 determines whether or not a series of speech recognition processing has been completed, and when completed, waits until a predetermined time has elapsed, and returns to step S6 when the predetermined time has elapsed (step S11). S12). Even when the voice recognition process is completed, if the vehicle continues to run or ride, the recording of the sound data 4 in the overwrite memory 7 and the detection by the detection unit 9 will continue, and the vehicle running or boarding will continue. When finished, the recording to the overwrite memory 7 and the detection by the detection unit 9 are also finished.

＜制御ステップ＞
制御部１０は、このように音声認識された音声部分が示す制御内容に基づいて動作を制御する。従って電子機器１００は、制御部１０がこの制御内容によって制御することでユーザの音声に反応するように動作する。このとき上述した音声部分は先頭部分が欠けることなく音声認識されるため、電子機器１００は確実にユーザの希望に沿って反応することができる。 <Control step>
The control unit 10 controls the operation based on the control content indicated by the voice part that has been voice-recognized in this way. Therefore, the electronic device 100 operates so as to respond to the user's voice as the control unit 10 controls according to the control content. At this time, since the voice part described above is voice-recognized without a loss of the head part, the electronic device 100 can surely respond to the user's wishes.

図３は、図２に示す音声認識処理を行っている様子の一例を示す図である。なお図示の例においては横軸が時間ｔを表しており縦軸が音量レベルＬを表している。
この例においては、図示のような波形を示す音データ４が上書きメモリ７に記録されている。この音データ４においては、時刻Ｔｓから時刻Ｔｅにわたりユーザの発声による音量レベルＬの変化が生じている。 FIG. 3 is a diagram illustrating an example of a state in which the voice recognition process illustrated in FIG. 2 is performed. In the illustrated example, the horizontal axis represents time t and the vertical axis represents volume level L.
In this example, sound data 4 having a waveform as shown is recorded in the overwrite memory 7. In the sound data 4, the volume level L changes due to the user's voice from time Ts to time Te.

＜音声認識処理の開始＞
検知部９は、上述したように音データ４の音量レベルＬがノイズレベルＬ０から上昇して閾値音量レベルＬ１を超えたときにトリガーＴＧを出力する。この検知部９は、定常的な走行ノイズレベルＬ０を計測しておき、その定常的な走行ノイズレベルＬ０を所定量以上超える音量レベルＬ１である場合にトリガーＴＧを出力する。 <Start of voice recognition processing>
As described above, the detection unit 9 outputs the trigger TG when the volume level L of the sound data 4 rises from the noise level L0 and exceeds the threshold volume level L1. The detection unit 9 measures a steady running noise level L0 and outputs a trigger TG when the volume level L1 exceeds the steady running noise level L0 by a predetermined amount or more.

図示の例では、この音量レベルＬがノイズレベルＬ０から上昇するのは時刻Ｔｓであるが、閾値音量レベルＬ１を超えるのは時刻Ｔｐである。従って検知部９は、この音量レベルＬが上昇し始めてから実際にトリガーＴＧを出力するまでに時間差Ｔｗが生じている。上述した音声認識部１１は、検知部９からトリガーＴＧを受け取った場合、まず検知時刻Ｔｐを管理しておくとともに次のように動作する。 In the illustrated example, the volume level L increases from the noise level L0 at time Ts, but exceeds the threshold volume level L1 at time Tp. Accordingly, the detection unit 9 has a time difference Tw from when the volume level L starts to rise until when the trigger TG is actually output. When the voice recognition unit 11 described above receives the trigger TG from the detection unit 9, it first manages the detection time Tp and operates as follows.

さらに音声認識部１１は、上書きメモリ７に記録済の音データ４のうち検知部９による検知時刻ＴｐよりもＴｗ１を含む所定時間Ｔｗ０前以降、つまり時刻Ｔｓ以降に記録された特定部分の音データ４から音声部分を判定し、その音声部分について音声認識を行っている。この時間Ｔｗ０は、例えばＴｗ１を含むのに十分な時間に予め設定しておくようにする。なおＴｗ１はその都度変動する時間であるが、理想的な仕組みを採用した場合においては正確に判定することができる。 Further, the voice recognition unit 11 is a part of the sound data 4 recorded in the overwrite memory 7, the sound data of a specific portion recorded after a predetermined time Tw 0 including Tw 1 before the detection time Tp by the detection unit 9, that is, after time Ts 4, the voice part is determined, and voice recognition is performed on the voice part. This time Tw0 is set in advance to a time sufficient to include, for example, Tw1. Note that Tw1 is a time that varies each time, but can be accurately determined when an ideal mechanism is employed.

＜音声認識処理の終了＞
一方、音声認識部１１は、このように上昇していた音データ４の音量レベルＬがノイズレベルＬ０にまで下がり所定時間Ｔｗ２が経過すると、この音声認識処理を停止する。 <End of voice recognition processing>
On the other hand, the voice recognition unit 11 stops the voice recognition process when the volume level L of the sound data 4 that has increased in this way falls to the noise level L0 and a predetermined time Tw2 has elapsed.

図４〜図１０は、上述した音声認識処理における音データ４の処理の流れを示すイメージ図である。なおこれら図４〜図１０においては、それぞれ上書きメモリ７内の音データ４が時刻軸ｔに沿って左から右に進むにつれて古く保存したデータとなるものとする。つまり図示した上書きメモリ７内の音データ４は、左側の部分が新しく記録されたデータであることを表している。なおこれら図４などにおいてはデジタイズ部５の図示を省略している。 4-10 is an image figure which shows the flow of a process of the sound data 4 in the speech recognition process mentioned above. 4 to 10, it is assumed that the sound data 4 in the overwrite memory 7 becomes data that has been saved as it progresses from the left to the right along the time axis t. That is, the sound data 4 in the overwrite memory 7 shown in the figure indicates that the left part is newly recorded data. In FIG. 4 and the like, the digitizing unit 5 is not shown.

まず図４に示すように音データ４は時間の経過とともに上書きメモリ７に徐々に蓄積され、図５に示すように音声認識部１１は、検知部９からトリガーＴＧを受け取ったことを契機としてこの時点までに上書きメモリ７に記憶済の音データ４を取得する。この時点は上述した検知時刻Ｔｐを表している。 First, as shown in FIG. 4, the sound data 4 is gradually accumulated in the overwrite memory 7 as time passes, and as shown in FIG. 5, the voice recognition unit 11 receives this trigger TG from the detection unit 9 as a trigger. The sound data 4 stored in the overwrite memory 7 until the time is acquired. This time point represents the detection time Tp described above.

音声認識部１１は、この取得した音データ４に基づいて音声認識処理を実行するが、その間でも、図６に示すように上書きメモリ７には、徐々に音データ４が上書き記録されていく。 The voice recognition unit 11 executes a voice recognition process based on the acquired sound data 4, but the sound data 4 is gradually overwritten and recorded in the overwrite memory 7 as shown in FIG.

音声認識部１１は、図７に示すように取得した音データ４に関する音声認識処理が終了すると、上書きメモリ７には経過時間に応じてさらに多くの音データ４が上書き記録されている。 When the voice recognition unit 11 finishes the voice recognition process on the acquired sound data 4 as shown in FIG. 7, more sound data 4 is overwritten and recorded in the overwrite memory 7 according to the elapsed time.

検知部９は、図８に示すように音データ４の音量レベルＬが閾値音量レベルＬ１を超えた時刻にてトリガーＴＧを音声認識部１１に対して出力する。この時刻は上述した検知時刻Ｔｐに相当する。すると、上書きメモリ７は、まだ記憶容量に余裕があってもその検知時刻Ｔｐまでに記憶済の音データ４が図９に示すように音声認識部１１に引き渡される。 As shown in FIG. 8, the detection unit 9 outputs a trigger TG to the voice recognition unit 11 at a time when the volume level L of the sound data 4 exceeds the threshold volume level L1. This time corresponds to the detection time Tp described above. Then, the overwrite memory 7 delivers the stored sound data 4 up to the detection time Tp to the voice recognition unit 11 as shown in FIG.

ここで音声認識部１１に引き渡された音データ４は、上述した音声部分開始時刻Ｔｓから検知時刻Ｔｐまでの音データ４の一部を表す音データ４ａ及びその前の余白のみを含んでいる。その後も新たな音データ４の記録が進んでおり、図１０に示すように上書きメモリ７には、その新たな音データ４が上書き記録されている。この音データ４は、例えば短い所定時間単位で小刻みに音声認識部１１に引き渡される。従ってこの音声認識部１１には、上述した音データ４ａ及び余白に加えてさらに、音データ４ｂ及び音データ４が蓄積される。この音データ４ｂは、検出時刻Ｔｐから音声部分終了時刻Ｔｅまでの音データ４の一部を表しており、音データ４ｃは、音声部分終了時刻Ｔｅから終了判断時刻Ｔｘまでの音データ４の一部を表している。本実施形態では、このような音データ４ｃが引き渡されるまでこのような音データ４の引き渡し処理を続けている。このように引き渡すタイミングは、音データ４が上書きメモリ７に全部蓄積されてからでも良いし、音声認識処理を早く実行するために、連続的に引き渡すようにしても良い。なおこの音声部分は、例えばユーザの音声によるコマンドを表している。この音声認識部１１は、これら音データ４ａ〜４ｃのうち時刻Ｔｓから時刻Ｔｅまでに該当する音声部分の音データ４ａ，４ｂを音声認識の対象部分とし、上述のような音声認識処理を行う。 Here, the sound data 4 delivered to the voice recognition unit 11 includes only the sound data 4a representing a part of the sound data 4 from the voice partial start time Ts to the detection time Tp described above and a margin before the sound data 4a. Thereafter, recording of new sound data 4 continues, and the new sound data 4 is overwritten and recorded in the overwrite memory 7 as shown in FIG. The sound data 4 is delivered to the voice recognition unit 11 in small increments, for example, in short time units. Therefore, in addition to the sound data 4a and the margin described above, the sound recognition unit 11 further stores sound data 4b and sound data 4. The sound data 4b represents a part of the sound data 4 from the detection time Tp to the sound partial end time Te, and the sound data 4c is one of the sound data 4 from the sound partial end time Te to the end determination time Tx. Part. In the present embodiment, such delivery processing of sound data 4 is continued until such sound data 4c is delivered. The delivery timing may be after all the sound data 4 is accumulated in the overwrite memory 7, or may be delivered continuously in order to execute the speech recognition processing early. This voice part represents a command by a user's voice, for example. The voice recognition unit 11 performs the voice recognition process as described above, using the sound data 4a and 4b of the voice part corresponding to the time Ts to the time Te among the sound data 4a to 4c.

上記実施形態における電子機器１００は、採取された音に基づく音データ４が継続的に上書き記録される上書き記録手段７（上書きメモリに相当）と、前記音データ４の音量レベルＬが閾値音量レベルＬ１を超えたことを検知する検知手段９（検知部に相当）と、前記検知手段９によって前記音データ４の音量レベルＬが前記閾値音量レベルＬ１を超えたことが検知された場合、前記上書き記録手段７に記録済の音データ４のうち前記検知手段９による検知時刻Ｔｓよりも所定時間Ｔｗ１前に遡って記録済の音データ４から音声部分を識別し、前記音声部分について音声認識を行う音声認識手段１１（音声認識部に相当）と、前記音声認識手段１１によって識別された前記音声部分が表す制御内容に基づいて動作を制御する制御手段１０（制御部に相当）と、を有することを特徴とする。 The electronic device 100 in the above embodiment includes an overwrite recording means 7 (corresponding to an overwrite memory) in which the sound data 4 based on the collected sound is continuously overwritten, and the volume level L of the sound data 4 is a threshold volume level. Detection means 9 (corresponding to a detection unit) that detects that the sound volume exceeds L1, and when the detection means 9 detects that the volume level L of the sound data 4 exceeds the threshold volume level L1, the overwriting Of the sound data 4 recorded in the recording means 7, a sound part is identified from the sound data 4 recorded before a predetermined time Tw 1 before the detection time Ts by the detection means 9, and sound recognition is performed on the sound part. A voice recognition unit 11 (corresponding to a voice recognition unit) and a control unit 10 (control) that controls the operation based on the control content represented by the voice part identified by the voice recognition unit 11. And equivalent) in part, characterized by having a.

このようにすると、音声認識手段１１は、音声認識の対象とすべき音声部分の先頭が欠けないようにするために、検知手段９による検知時刻Ｔｓから所定時間Ｔｗ１前に遡って記録済の音データ４を利用していることから、音データ４自体の入力タイミングを遅延させる必要がない。 In this way, the voice recognition unit 11 records the sound that has been recorded retroactively to the predetermined time Tw1 from the detection time Ts by the detection unit 9 so that the beginning of the voice part to be voice recognition target is not lost. Since the data 4 is used, it is not necessary to delay the input timing of the sound data 4 itself.

このため音声認識手段１１は、対象とすべき音声部分の先頭が欠けることなく音声認識を行うべきタイミングにおいてほぼリアルタイムに、この音声部分についての音声認識を行うことができる。従って制御手段１０は、この音声認識によって把握された制御内容に基づいてリアルタイムで電子機器１００の動作を制御することができ、この電子機器１００は、上記音声部分の制御内容に対して機敏に反応して動作することができる。 For this reason, the voice recognition unit 11 can perform voice recognition on the voice part almost in real time at the timing at which voice recognition should be performed without missing the head of the voice part to be processed. Therefore, the control means 10 can control the operation of the electronic device 100 in real time based on the control content grasped by the voice recognition, and the electronic device 100 reacts quickly to the control content of the voice part. And can work.

しかもこの電子機器１００では、音声認識部１１が恒常的に動作している必要がなく必要に応じて動作を開始したり停止すればよいため、恒常的に音声認識処理を実行させている場合に比べてリソースの負担を軽減することができる。 Moreover, in this electronic device 100, since the voice recognition unit 11 does not need to be constantly operating and may be started or stopped as necessary, when the voice recognition process is constantly executed. Compared to the resource burden.

上記実施形態における電子機器１００の制御方法は、採取された音に基づく音データ４を上書き記録手段７に継続的に上書き記録している状態で、前記音データ４の音量レベルＬが閾値音量レベルＬ０を超えたことを検知する検知ステップと、前記検知ステップにて前記音データ４の音量レベルＬが前記閾値音量レベルＬ０を超えたことが検知された場合、前記上書き記録手段７に記録済の音データのうち前記検知ステップでの検知時刻Ｔｐよりも所定時間Ｔｗ１前に遡って記録済みの音データから音声部分を識別し、前記音声部分について音声認識を行う音声認識ステップと、前記音声認識ステップにて識別された前記音声部分が表す制御内容に基づいて動作を制御する制御ステップと、を有することを特徴とする。 In the control method of the electronic device 100 in the above embodiment, the sound data 4 based on the collected sound is continuously overwritten and recorded in the overwrite recording means 7, and the sound volume level L of the sound data 4 is the threshold sound volume level. A detection step for detecting that the value exceeds L0, and when the detection step detects that the volume level L of the sound data 4 exceeds the threshold volume level L0, A voice recognition step of identifying a voice part from recorded sound data retroactive to a predetermined time Tw1 before the detection time Tp in the detection step, and performing voice recognition on the voice part; and the voice recognition step And a control step for controlling the operation based on the control content represented by the voice part identified in (1).

このようにすると、音声認識ステップでは、音声認識の対象とすべき音声部分の先頭が欠けないようにするために、検知ステップでの検知時刻Ｔｓから所定時間Ｔｗ１前に遡って記録済の音データ４を利用していることから、音データ４自体の入力タイミングを遅延させる必要がない。 In this way, in the voice recognition step, the recorded sound data is traced back to the predetermined time Tw1 from the detection time Ts in the detection step so that the head of the voice part to be subjected to voice recognition is not lost. 4 is used, it is not necessary to delay the input timing of the sound data 4 itself.

このため音声認識ステップでは、対象とすべき音声部分の先頭が欠けることなく音声認識を行うべきタイミングにおいてほぼリアルタイムに、この音声部分についての音声認識を行うことができる。従って制御ステップでは、この音声認識によって把握された制御内容に基づいてリアルタイムで電子機器１００の動作を制御することができ、この電子機器１００は、上記音声部分の制御内容に対して俊敏に反応して動作することができる。 For this reason, in the speech recognition step, speech recognition can be performed for this speech portion almost in real time at the timing at which speech recognition should be performed without missing the beginning of the speech portion to be processed. Therefore, in the control step, the operation of the electronic device 100 can be controlled in real time based on the control content grasped by the speech recognition, and the electronic device 100 reacts quickly to the control content of the voice part. Can work.

しかもこの電子機器１００では、音声認識を恒常的に動作させている必要がなく必要に応じて動作を開始したり停止すればよいため、恒常的に音声認識処理を実行させている場合に比べてリソースの負担を軽減することができる。 Moreover, in this electronic device 100, since it is not necessary to constantly operate the voice recognition, it is only necessary to start or stop the voice recognition as necessary. Therefore, compared with the case where the voice recognition processing is constantly executed. Resource burden can be reduced.

上記実施形態における電子機器１００は、上述した構成に加えてさらに、前記音データの音量レベルは、前記音データの特定周波数帯域の音量レベルである。 In the electronic device 100 according to the embodiment, in addition to the above-described configuration, the volume level of the sound data is a volume level of a specific frequency band of the sound data.

上記実施形態における電子機器１００は、上述した構成に加えてさらに、前記制御手段１１は、前記制御内容に従って前記音声部分を発したユーザに対して反応するよう動作を制御することを特徴とする。 In addition to the above-described configuration, the electronic device 100 according to the embodiment further controls the operation of the control unit 11 so as to react to a user who has emitted the voice part according to the control content.

このようにすると、電子機器１００は、この音声認識によって把握された制御内容に基づいてリアルタイムで動作することができ、上記音声部分の制御内容に対して俊敏に反応して動作することができる。このため電子機器は、ユーザに違和感を与えることなく機敏に反応して動作することができる。 In this way, the electronic device 100 can operate in real time based on the control content grasped by the voice recognition, and can operate in agile response to the control content of the voice part. For this reason, the electronic device can react and operate quickly without giving the user a sense of incongruity.

上記実施形態における電子機器１００は、上述した構成に加えてさらに、前記検知手段９（検知部に相当）は、採取された音に含まれるノイズレベルＬ０を考慮した上で、前記音データ４の音量レベルＬが閾値音量レベルＬ１となったか否かを検知することを特徴とする。 In addition to the above-described configuration, the electronic device 100 in the embodiment further includes the detection unit 9 (corresponding to a detection unit) that takes into account the noise level L0 included in the collected sound, and It is characterized by detecting whether or not the volume level L has reached the threshold volume level L1.

このようにすると、検知手段９は採取された音に音声部分が含まれていることを正確に検知することができる。 If it does in this way, the detection means 9 can detect correctly that the audio | voice part is contained in the extract | collected sound.

上記実施形態における電子機器１００は、上述した構成に加えてさらに、周囲の環境から音を採取して前記音に基づく音データを出力する音採取手段３（マイクロフォンに相当）を有することを特徴とする。 In addition to the above-described configuration, the electronic device 100 according to the embodiment further includes sound collection means 3 (corresponding to a microphone) that collects sound from the surrounding environment and outputs sound data based on the sound. To do.

このようにすると、音採取手段３を好適な方向に向ければ、音声認識部１１は音声が含まれる様々な音から音声部分を区別して音声認識を行うことができる。 In this way, if the sound sampling means 3 is directed in a suitable direction, the voice recognition unit 11 can perform voice recognition by distinguishing the voice part from various sounds including the voice.

上記実施形態における音声認識装置１は、採取された音に基づく音データ４が継続的に上書き記録される上書き記録手段７と、前記音データ４の音量レベルＬが閾値音量レベルＬ１を超えたことを検知する検知手段９と、前記検知手段９によって前記音データ４の音量レベルＬが前記閾値音量レベルＬ１を超えたことが検知された場合、前記上書き記録手段７に記録済の音データ４のうち前記検知手段９による検知時刻Ｔｐよりも所定時間Ｔｗ１前に遡って記録済の音データ４から音声部分を識別し、前記音声部分について音声認識を行う音声認識手段１１と、を有することを特徴とする。 The voice recognition device 1 in the above embodiment includes the overwrite recording means 7 in which the sound data 4 based on the collected sound is continuously overwritten and the volume level L of the sound data 4 exceeds the threshold volume level L1. When the detection means 9 detects that the volume level L of the sound data 4 has exceeded the threshold volume level L1, the detection means 9 detects the sound data 4 recorded in the overwrite recording means 7. Of these, a voice recognition unit 11 that identifies a voice part from the recorded sound data 4 before the predetermined time Tw1 before the detection time Tp by the detection unit 9 and performs voice recognition on the voice part is provided. And

このため音声認識手段１１は、対象とすべき音声部分の先頭が欠けることなく音声認識を行うべきタイミングにおいてほぼリアルタイムに、この音声部分についての音声認識を行うことができる。また音声認識部１１は、恒常的に動作している必要がなく必要に応じて動作を開始したり停止すればよいため、恒常的に音声認識処理を実行させている場合に比べてリソースの負担を軽減することができる。 For this reason, the voice recognition unit 11 can perform voice recognition on the voice part almost in real time at the timing at which voice recognition should be performed without missing the head of the voice part to be processed. Further, since the voice recognition unit 11 does not need to be constantly operating and may be started or stopped as necessary, the burden of resources is larger than when the voice recognition process is constantly executed. Can be reduced.

上記実施形態における音声認識方法は、採取された音に基づく音データ４を上書き記録手段７に継続的に上書き記録している状態で、前記音データ４の音量レベルＬが閾値音量レベルＬ１を超えたことを検知する検知ステップと、前記検知ステップにて前記音データ４の音量レベルＬが前記閾値音量レベルＬ１を超えたことが検知された場合、前記上書き記録手段７に記録済の音データ４のうち前記検知ステップでの検知時刻Ｔｐよりも所定時間Ｔｗ１前に遡って記録済みの音データ４から音声部分を識別し、前記音声部分について音声認識を行う音声認識ステップと、を有することを特徴とする。 In the speech recognition method in the above embodiment, the sound volume level L of the sound data 4 exceeds the threshold volume level L1 in a state where the sound data 4 based on the collected sound is continuously overwritten and recorded in the overwrite recording means 7. And detecting that the volume level L of the sound data 4 exceeds the threshold volume level L1 in the detection step, the sound data 4 already recorded in the overwrite recording means 7 A voice recognition step of identifying a voice part from the recorded sound data 4 before a predetermined time Tw1 before the detection time Tp in the detection step and performing voice recognition on the voice part. And

上記実施形態における音声認識プログラムは、採取された音に基づく音データ４を上書き記録手段７に継続的に上書き記録している状態で、前記音データ４の音量レベルＬが閾値音量レベルＬ０を超えたことを検知する検知ステップと、前記検知ステップにて前記音データ４の音量レベルＬが前記閾値音量レベルＬ０を超えたことが検知された場合、前記上書き記録手段７に記録済の音データ４のうち前記検知ステップでの検知時刻Ｔｐよりも所定時間Ｔｗ１前に遡って記録済みの音データ４から音声部分を識別し、前記音声部分について音声認識を行う音声認識ステップと、前記音声認識ステップにて識別された前記音声部分が表す制御内容に基づいて動作を制御する制御ステップと、を電子機器１００にて実行させていることを特徴とする。 In the voice recognition program in the above embodiment, the sound volume level L of the sound data 4 exceeds the threshold volume level L0 in a state where the sound data 4 based on the collected sound is continuously overwritten and recorded in the overwrite recording means 7. When the detection step detects that the volume level L of the sound data 4 has exceeded the threshold volume level L0, the sound data 4 recorded in the overwrite recording means 7 is detected. A voice recognition step of identifying a voice part from the recorded sound data 4 before a predetermined time Tw1 before the detection time Tp in the detection step, and performing voice recognition on the voice part; and the voice recognition step And a control step for controlling the operation based on the control content represented by the voice part identified in the electronic device 100. .

これらのようにすると、音声認識ステップでは、音声認識の対象とすべき音声部分の先頭が欠けないようにするために、検知ステップでの検知時刻Ｔｓから所定時間Ｔｗ１前に遡って記録済の音データ４を利用していることから、音データ４自体の入力タイミングを遅延させる必要がない。 In this manner, in the voice recognition step, the recorded sound is traced back to the predetermined time Tw1 from the detection time Ts in the detection step so that the head of the voice part to be subjected to voice recognition is not lost. Since the data 4 is used, it is not necessary to delay the input timing of the sound data 4 itself.

このため音声認識ステップでは、対象とすべき音声部分の先頭が欠けることなく音声認識を行うべきタイミングにおいてほぼリアルタイムに、この音声部分についての音声認識を行うことができる。また音声認識ステップは、恒常的に動作している必要がなく必要に応じて動作を開始したり停止すればよいため、恒常的に音声認識処理を実行させている場合に比べてリソースの負担を軽減することができる。 For this reason, in the speech recognition step, speech recognition can be performed for this speech portion almost in real time at the timing at which speech recognition should be performed without missing the beginning of the speech portion to be processed. In addition, since the voice recognition step does not need to be constantly operating and can be started or stopped as necessary, it is less resource intensive than when the voice recognition process is constantly executed. Can be reduced.

＜第２実施形態＞
図１１は、第２実施形態における電子機器１００ａの構成例を示すブロック図である。
第２実施形態における電子機器１００ａは、第１実施形態における電子機器１００とほぼ同様の構成でありほぼ同様の動作を行う。このため第２実施形態では、同一の構成及び動作については第１実施形態における図１乃至図１０と同一の符号を用いるとともに、その説明を省略し、以下の説明では異なる点を中心として説明する。 Second Embodiment
FIG. 11 is a block diagram illustrating a configuration example of the electronic device 100a according to the second embodiment.
The electronic device 100a in the second embodiment has substantially the same configuration as the electronic device 100 in the first embodiment, and performs substantially the same operation. For this reason, in the second embodiment, the same configurations and operations are denoted by the same reference numerals as those in FIGS. 1 to 10 in the first embodiment, and the description thereof is omitted. .

第２実施形態では、デジタイズ部５と音声認識部１１が接続されている点が異なっている。具体的には、音声認識部１１は、上述した音声部分に関する音声認識の進行に応じて、上書きメモリ７を経由せずに音データ４を直接デジタイズ部５から取得している。 The second embodiment is different in that the digitizing unit 5 and the voice recognition unit 11 are connected. Specifically, the voice recognition unit 11 acquires the sound data 4 directly from the digitizing unit 5 without going through the overwrite memory 7 in accordance with the progress of the voice recognition related to the voice part described above.

第１実施形態では、音声認識部１１は、検知部９による検知時刻Ｔｐから所定時間Ｔｗ１前に遡った音データ４から音声部分を取得しており、この遡った分の所定時間Ｔｗ１分だけごく僅かにリアルタイム処理とはならない。 In the first embodiment, the voice recognizing unit 11 acquires a voice part from the sound data 4 that goes back a predetermined time Tw1 before the detection time Tp by the detecting unit 9, and is only for the predetermined time Tw1 that goes back. Slightly not real-time processing.

音声認識部１１による音声認識処理の方が新たな音データ４の上書きメモリ７への記録よりも速いことから、音声認識部１１は、このように遡った音データ４の音声部分に関して音声認識を行い、このごく僅かな遅れを取り戻すことができる。 Since the voice recognition process by the voice recognition unit 11 is faster than the recording of the new sound data 4 in the overwrite memory 7, the voice recognition unit 11 performs voice recognition on the voice part of the sound data 4 that goes back in this way. Yes, this very slight delay can be recovered.

そこで第２実施形態では、音声認識部１１が、この音声部分に関する音声認識の進行に応じて、例えばこのような遅れを取り戻した後は、上書きメモリ７から音データ４を取得する代わりに、直接デジタイズ部５から音データ４を取得している。すると、音声認識部１１は、上書きメモリ７に音データ４を記録する書き込み時間を省き、さらに早い段階で音データ４を取得して音声認識処理を実行することができる。 Therefore, in the second embodiment, the voice recognition unit 11 directly acquires, instead of acquiring the sound data 4 from the overwrite memory 7, for example, after recovering such a delay according to the progress of the voice recognition regarding the voice part. Sound data 4 is acquired from the digitizing unit 5. Then, the voice recognition unit 11 can save the writing time for recording the sound data 4 in the overwrite memory 7 and can acquire the sound data 4 at an earlier stage to execute the voice recognition process.

上記実施形態における電子機器１００は、上述した構成に加えてさらに、前記音声認識手段１１（音声認識部に相当）は、前記音声部分に関する音声認識の進行に応じて、前記上書き記録手段７を経由せずに前記音データ４を直接取得することを特徴とする。 In addition to the above-described configuration, the electronic device 100 in the embodiment further includes the voice recognition unit 11 (corresponding to a voice recognition unit) via the overwrite recording unit 7 according to the progress of voice recognition related to the voice part. The sound data 4 is directly acquired without performing the above process.

このようにすると、音声認識手段１１は、上書き記録手段７を経由しないで音データ４を取得することから、例えば音声認識処理に余裕がある場合には早めに音データ４を取得し、第１実施形態よりも早く音声認識処理を完了することができる。 In this way, since the voice recognition unit 11 acquires the sound data 4 without going through the overwrite recording unit 7, for example, when there is a margin in the voice recognition process, the voice recognition unit 11 acquires the sound data 4 earlier. The speech recognition process can be completed earlier than in the embodiment.

なお、本実施形態は、上記に限られず、種々の変形が可能である。以下、そのような変形例を順を追って説明する。
上述した実施形態では、音声認識装置１の機能について検知部９及び音声認識部１１の機能については、上述した音声認識プログラムを用いてソフトウェアにより構成しても良いし、回路などを用いてハードウェアにより構成しても良い。なおデジタイズ部５についても、音声認識プログラムの一部としてソフトウェアにより構成しても良い。 In addition, this embodiment is not restricted above, A various deformation | transformation is possible. Hereinafter, such modifications will be described in order.
In the embodiment described above, the functions of the voice recognition device 1 may be configured by software using the voice recognition program described above, or may be implemented by hardware using a circuit or the like. You may comprise by. The digitizing unit 5 may also be configured by software as part of the voice recognition program.

上述した実施形態では、検知部９が音データ４の音量レベルＬに基づいてトリガーＴＧを出力しているが、これに限られず、その代わりに、例えば発話するユーザのジェスチャを画像認識により検知してトリガーＴＧを出力するようにしても良い。 In the embodiment described above, the detection unit 9 outputs the trigger TG based on the volume level L of the sound data 4, but is not limited to this, and instead, for example, a gesture of a user who speaks is detected by image recognition. Then, the trigger TG may be output.

上述した実施形態では、マイクロフォン３が運転者又は搭乗者の居る座席方向を向いており、音声認識部１１がそのようなマイクロフォン３で取得した音データ４に基づいて、例えば音声の周波数帯に限定して音声の有無を判断するようにしても良い。 In the above-described embodiment, the microphone 3 faces the seat where the driver or the passenger is present, and the voice recognition unit 11 is limited to, for example, a voice frequency band based on the sound data 4 acquired by the microphone 3. Then, the presence or absence of sound may be determined.

上記実施形態では、検知部９が検知すべき特定周波数は、例えば一般的な実験データ又は統計データから取得しても良い。取得場所としては、例えば図示しない所定の端末内のメモリでも良いし、図示しないネットワーク上のサーバでも良い。またこの特定周波数は、そのロボット１００などの機器又は車両のオーナーの声の計測データから取得しても良い。 In the above embodiment, the specific frequency to be detected by the detection unit 9 may be acquired from, for example, general experimental data or statistical data. The acquisition location may be, for example, a memory in a predetermined terminal (not shown) or a server on a network (not shown). Further, the specific frequency may be acquired from measurement data of a voice of a device such as the robot 100 or the owner of the vehicle.

この計測データは、予めオーナーが登録しても良いし、例えば仮に音声認識装置１がオーナーの声を識別する機能を備える場合、オーナーの過去の発話音がどの周波数帯に分布したのかのデータを取得し、その累積データを用いて周波数帯を決定しても良い。 The measurement data may be registered in advance by the owner. For example, if the speech recognition apparatus 1 has a function of identifying the owner's voice, data on which frequency band the owner's past utterance was distributed to The frequency band may be determined by acquiring and using the accumulated data.

この決定は、一度でも分布したことのある周波全て含んでも良いし、累積データから統計的に意味のあるデータのみを抜き出して決定しても良い。またこの決定は、過去に、この車内で発話された発話音がどの周波数帯に分布したのかに関する累積データを取得し、その累積データを用いて決定しても良い。 This determination may include all frequencies that have been distributed even once, or may be determined by extracting only statistically meaningful data from the accumulated data. In addition, this determination may be performed by acquiring accumulated data relating to which frequency band the uttered sound uttered in the vehicle has been distributed in the past and using the accumulated data.

また、この決定は、乗車している運転者、前席の搭乗者又は全搭乗者の性別や年齢を取得し、性別や年齢に特有の周波数帯データを基に決定するようにしても良い。この決定にあたっては、例えば男女が共に乗車していた場合には両者の周波数帯を合わせた周波数帯に決定したり、又は、様々な年齢の搭乗者がいた場合にも同様に両者の周波数帯を合わせた周波数帯に決定するようにしても良い。 Further, this determination may be made based on the frequency band data specific to the gender and age by acquiring the gender and age of the driver, the front seat passenger, or all the passengers. In this determination, for example, when both men and women are on board, the frequency band is determined by combining both frequency bands, or when there are passengers of various ages, the frequency bands of both are similarly set. You may make it determine to the combined frequency band.

また、上記決定にあたっては、現在登場している搭乗者の発話する発話音がどの周波数帯に分布したのかに関するデータを取得し、その累積データを用いて周波数帯を決定しても良い。例えば当日車両に登場して最初の発話の認識に際しては、前回乗車時などの過去のデータから周波数帯を決定し、今回乗車の発話データが累積されることによって逐次見直していっても良い。 Further, in the above determination, data regarding to which frequency band the utterance sound uttered by the currently appearing passenger is distributed may be acquired, and the frequency band may be determined using the accumulated data. For example, when recognizing the first utterance appearing on the vehicle of the day, the frequency band may be determined from past data such as the previous boarding time, and the utterance data of the current boarding may be accumulated to be reviewed sequentially.

また、上記実施形態では、検知部９が定常的なノイズレベルＬ０を取得しているが、次のような手法を用いても良い。すなわち検知部９は、例えば一般的な実験データ又は統計データからノイズレベルＬ０を取得しても良いし、過去にこの車両内の騒音レベルを計測し蓄積しておいたデータを取得しても良い。また検知部９は、例えば今回乗車した車両内の騒音レベルを計測してノイズレベルＬ０を決定しても良い。 Moreover, in the said embodiment, although the detection part 9 acquires steady noise level L0, you may use the following methods. That is, the detection unit 9 may acquire the noise level L0 from, for example, general experimental data or statistical data, or may acquire data obtained by measuring and accumulating the noise level in the vehicle in the past. . Moreover, the detection part 9 may determine the noise level L0 by measuring the noise level in the vehicle boarded this time, for example.

さらに検知部９は、搭乗中に継続的にノイズレベルＬ０を計測しておき、リアルタイムにノイズレベルＬ０を見直しても良い。つまり検知部９は、例えば直前の騒音レベルを基にして常にノイズレベルＬ０が適正であるか否かを判定するようにしても良い。また検知部９は、採取した音から音声部分を除いてノイズレベルＬ０を用いても良い。また検知部９は、例えばオーディオ再生時とそうでないときのノイズレベルを分けて取得し、現在オーディオ再生時か否かを判定してどちらのノイズレベルを使用するかどうかを決定しても良い。また検知部９は、特定周波数の音量レベルＬをモニタリングし、特定したノイズレベルＬ０を超える音量レベルＬの音が入力された否かを判定するようにしても良い。 Furthermore, the detection unit 9 may continuously measure the noise level L0 during boarding and review the noise level L0 in real time. That is, for example, the detection unit 9 may always determine whether or not the noise level L0 is appropriate based on the immediately preceding noise level. Further, the detection unit 9 may use the noise level L0 by removing the voice part from the collected sound. For example, the detection unit 9 may separately acquire the noise level during audio playback and when it is not, and determine whether the current audio playback is in use and determine which noise level to use. The detection unit 9 may monitor the volume level L of the specific frequency and determine whether or not a sound having a volume level L exceeding the specified noise level L0 is input.

また上記実施形態においては、音声認識部１１は、音声認識処理を終了すべきであるか否かに関して、例えば音データ４のうち音声の入力が時間Ｔｗ２に渡りなかった場合に音声認識処理を終了すべきであると判断しているが、これに限られず、次のような判断を行っても良い。 Moreover, in the said embodiment, the speech recognition part 11 complete | finishes a speech recognition process, when the input of a sound does not go over time Tw2 among the sound data 4, regarding whether the speech recognition process should be complete | finished, for example. However, the present invention is not limited to this, and the following determination may be made.

すなわち音声認識部１１は、音声認識処理を起動した結果、例えば音声又は、文章や単語に該当する音声部分があった場合に音声認識処理を実行し、これら音声又は、文章や単語に該当する音声部分の終端を検知したら音声認識処理を終了することにしても良い。また。音声認識部１１は、音声認識処理を起動した結果、例えば音声又は、文章や単語に該当する音声部分があった場合に音声認識処理を実行し、これら文章や単語の終端を検知した以降所定時間経過して次の音声又は、文章や単語に該当する次の音声部分と判定される音声が検知されなければ終了することにしても良い。 That is, the voice recognition unit 11 executes the voice recognition process when there is a voice part corresponding to a voice or a sentence or a word, for example, as a result of starting the voice recognition process, and the voice or the voice corresponding to the sentence or the word. When the end of the part is detected, the voice recognition process may be terminated. Also. The voice recognition unit 11 executes the voice recognition process when, for example, there is a voice or a voice part corresponding to a sentence or a word as a result of starting the voice recognition process, and detects the end of the sentence or word for a predetermined time. If the next sound or the sound determined to be the next sound part corresponding to a sentence or a word is not detected, the process may be terminated.

第１実施形態における電子機器を適用した一例としてのロボットの構成例を示すブロック図である。It is a block diagram which shows the structural example of the robot as an example to which the electronic device in 1st Embodiment is applied. ロボットの処理の手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure of the process of a robot. 図２に示す音声認識処理を行っている様子の一例を示す図である。It is a figure which shows an example of a mode that the speech recognition process shown in FIG. 2 is performed. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 音声認識処理における音データの処理の流れを示すイメージ図である。It is an image figure which shows the flow of a process of the sound data in a speech recognition process. 第２実施形態における電子機器の構成例を示すブロック図である。It is a block diagram which shows the structural example of the electronic device in 2nd Embodiment.

Explanation of symbols

３マイクロフォン（音採取手段に相当）
４音データ
７上書きメモリ（上書き記録手段に相当）
１１音声認識部（音声認識手段に相当）
１３制御部（制御手段に相当）
１００電子機器
Ｌ０ノイズレベル
Ｔｓ検知時刻
Ｔｗ１所定時間 3 Microphone (equivalent to sound collection means)
4 Sound data 7 Overwrite memory (equivalent to overwrite recording means)
11 Voice recognition unit (equivalent to voice recognition means)
13 Control unit (equivalent to control means)
100 Electronic device L0 Noise level Ts Detection time Tw1 Predetermined time

Claims

Overwrite recording means for continuously overwriting and recording sound data based on the collected sound;
Detecting means for detecting that the volume level of the sound data exceeds a threshold volume level;
When the detection means detects that the volume level of the sound data exceeds the threshold volume level, the sound data recorded in the overwrite recording means goes back a predetermined time before the detection time by the detection means. A voice recognition means for identifying a voice part from the recorded sound data and performing voice recognition on the voice part;
Control means for controlling the operation based on the control content represented by the voice portion identified by the voice recognition means;
An electronic device comprising:

The electronic device according to claim 1,
The volume level of the sound data is a volume level of a specific frequency band of the sound data.

The electronic device according to claim 1 or 2,
The control means includes
An electronic apparatus that controls an operation so as to react to a user who has emitted the voice portion according to the control content.

The electronic device according to any one of claims 1 to 3,
The electronic device is characterized by detecting whether or not the volume level of the sound data has reached a threshold volume level in consideration of a noise level included in the collected sound.

The electronic device according to any one of claims 1 to 4,
An electronic apparatus comprising sound collection means for collecting sound from a surrounding environment and outputting sound data based on the sound.

The electronic device according to any one of claims 1 to 5,
The voice recognition means
An electronic apparatus characterized in that the sound data is directly acquired without going through the overwrite recording means in accordance with the progress of voice recognition relating to the voice portion.

A detection step of detecting that the volume level of the sound data exceeds a threshold volume level in a state where the sound data based on the collected sound is continuously overwritten and recorded in the overwrite recording means;
When it is detected in the detection step that the volume level of the sound data exceeds the threshold volume level, a predetermined time before the detection time in the detection step of the sound data recorded in the overwrite recording means A voice recognition step of identifying a voice part from recorded sound data retroactively and performing voice recognition on the voice part;
A control step for controlling the operation based on the control content represented by the voice portion identified in the voice recognition step;
A method for controlling an electronic device, comprising:

Overwrite recording means for continuously overwriting and recording sound data based on the collected sound;
Detecting means for detecting that the volume level of the sound data exceeds a threshold volume level;
When the detection means detects that the volume level of the sound data exceeds the threshold volume level, the sound data recorded in the overwrite recording means goes back a predetermined time before the detection time by the detection means. A voice recognition means for identifying a voice part from the recorded sound data and performing voice recognition on the voice part;
A speech recognition apparatus comprising:

A detection step of detecting that the volume level of the sound data exceeds a threshold volume level in a state where the sound data based on the collected sound is continuously overwritten and recorded in the overwrite recording means;
When it is detected in the detection step that the volume level of the sound data exceeds the threshold volume level, a predetermined time before the detection time in the detection step of the sound data recorded in the overwrite recording means A voice recognition step of identifying a voice part from recorded sound data retroactively and performing voice recognition on the voice part;
A speech recognition method comprising:

A detection step of detecting that the volume level of the sound data exceeds a threshold volume level in a state where the sound data based on the collected sound is continuously overwritten and recorded in the overwrite recording means;
When it is detected in the detection step that the volume level of the sound data exceeds the threshold volume level, a predetermined time before the detection time in the detection step of the sound data recorded in the overwrite recording means A voice recognition step of identifying a voice part from recorded sound data retroactively and performing voice recognition on the voice part;
Is executed by a computer.