JP6450139B2

JP6450139B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP6450139B2
Application number: JP2014208834A
Authority: JP
Inventors: 孝輔辻野; 悠輔中島
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2014-10-10
Filing date: 2014-10-10
Publication date: 2019-01-09
Anticipated expiration: 2034-10-10
Also published as: JP2016080750A

Description

本発明は、任意の方向からの音声を認識する音声認識装置、音声認識方法、及び音声認識プログラムに関する。 The present invention relates to a speech recognition device that recognizes speech from an arbitrary direction, a speech recognition method, and a speech recognition program.

近年、音声認識によって、機器操作、情報取得、及び対話等を行う技術が普及しつつある。特に、ロボット等の機器に音声認識を実行させ、音声認識結果に基づく処理を実行させる場合、機器に対して任意の方向から到来する音声を精度良く認識できることが求められる。このような目的のために音源方向を推定し、音源方向にマイクロフォンアレイの指向方向を設定する装置が知られている。 In recent years, techniques for device operation, information acquisition, dialogue, and the like by voice recognition are becoming widespread. In particular, when a device such as a robot performs voice recognition and executes a process based on a voice recognition result, it is required that the device can accurately recognize a voice coming from an arbitrary direction. For this purpose, a device that estimates a sound source direction and sets the directivity direction of a microphone array in the sound source direction is known.

例えば特許文献１には、音源方向に死角を向けた空間フィルタの他に音源方向に指向性を向けた空間フィルタも生成し、それぞれについて方向とゲインとのパターンを求め、両方のパターンに基づいて音源方向を推定することが記載されている。また、特許文献２には、音源方向を推定する方法として、ＭＵＳＩＣ法を使用することが記載されている。また、特許文献３には、話者が手のひらを打ち鳴らす音を合図音として検出して、マイクロフォンアレイの指向方向を設定することが記載されている。 For example, in Patent Document 1, in addition to a spatial filter having a blind spot directed to a sound source direction, a spatial filter having directivity directed to a sound source direction is also generated, a pattern of direction and gain is obtained for each, and based on both patterns It is described that a sound source direction is estimated. Patent Document 2 describes that the MUSIC method is used as a method of estimating the sound source direction. Japanese Patent Application Laid-Open No. 2004-228561 describes that the sound of a speaker hitting the palm of the hand is detected as a cue sound and the directivity direction of the microphone array is set.

特開２０１２−１５０２３７号公報JP 2012-150237 A 特開２０１０−１２１９７５号公報JP 2010-121975 A 国際公開第２０１１／０５５４１０号International Publication No. 2011/055410

しかしながら、上記特許文献１，２に記載されている手法では、機器利用者が発した音声とは異なる周囲の雑音等の音源に対して指向方向が設定されてしまうおそれがある。また、音源方向を推定するために多少の期間に亘って音声信号を観測する必要がある。また、上記特許文献３に記載されている手法によれば、周囲の雑音等に対して指向方向が設定されてしまうことを防止し得るが、音源方向の推定のために最初にユーザが手を叩く等の動作を行う必要がある。つまり、音源方向の推定がされた後に音声認識が開始されるため、音声認識結果が得られるまでに時間がかかってしまい、ユーザの体感価値が損なわれてしまう。 However, in the methods described in Patent Documents 1 and 2, there is a possibility that the directivity direction is set for a sound source such as ambient noise different from the sound uttered by the device user. In addition, it is necessary to observe an audio signal for a certain period in order to estimate the sound source direction. Further, according to the method described in Patent Document 3, it is possible to prevent the directivity direction from being set with respect to ambient noise or the like. It is necessary to perform actions such as tapping. That is, since speech recognition is started after the sound source direction is estimated, it takes time until a speech recognition result is obtained, and the user's experience value is impaired.

本発明は、上記の課題に鑑みてなされたものであり、音源方向の推定に伴う遅延を伴わずに、任意の方向からの音声を精度良く認識することができる音声認識装置、音声認識方法、及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and a voice recognition device, a voice recognition method, and the like that can accurately recognize a voice from an arbitrary direction without a delay associated with estimation of a sound source direction. And a speech recognition program.

本発明に係る音声認識装置は、複数の方向からの音声ストリームのそれぞれを取得する音声取得手段と、音声取得手段により取得された複数の方向からの音声ストリームのそれぞれに対して音声認識を実行する音声認識処理手段と、音声認識処理手段により、予め定められた信頼度の基準を満たす音声認識結果が得られた場合に、当該音声認識結果が得られた音声ストリームに対応する方向を音源方向として決定する音源方向決定手段と、を備える。 The speech recognition apparatus according to the present invention performs speech recognition on each of the sound acquisition means for acquiring each of the sound streams from a plurality of directions and each of the sound streams from the plurality of directions acquired by the sound acquisition means. When a speech recognition result satisfying a predetermined reliability criterion is obtained by the speech recognition processing means and the speech recognition processing means, the direction corresponding to the speech stream from which the speech recognition result is obtained is set as the sound source direction. Sound source direction determining means for determining.

本発明に係る音声認識方法は、音声認識装置により実行される音声認識方法であって、複数の方向からの音声ストリームのそれぞれを取得する音声取得ステップと、音声取得ステップにおいて取得された複数の方向からの音声ストリームのそれぞれに対して音声認識を実行する音声認識処理ステップと、音声認識処理ステップにおいて、予め定められた信頼度の基準を満たす音声認識結果が得られた場合に、当該音声認識結果が得られた音声ストリームに対応する方向を音源方向として決定する音源方向決定ステップと、を含む。 The speech recognition method according to the present invention is a speech recognition method executed by a speech recognition apparatus, and includes a speech acquisition step of acquiring each of speech streams from a plurality of directions, and a plurality of directions acquired in the speech acquisition step. A speech recognition processing step for performing speech recognition on each of the speech streams from and a speech recognition result that satisfies a predetermined reliability criterion in the speech recognition processing step. And a sound source direction determining step for determining a direction corresponding to the audio stream from which the sound stream is obtained as a sound source direction.

本発明に係る音声認識プログラムは、コンピュータを、複数の方向からの音声ストリームのそれぞれを取得する音声取得手段と、音声取得手段により取得された複数の方向からの音声ストリームのそれぞれに対して音声認識を実行する音声認識処理手段と、音声認識処理手段により、予め定められた信頼度の基準を満たす音声認識結果が得られた場合に、当該音声認識結果が得られた音声ストリームに対応する方向を音源方向として決定する音源方向決定手段、として実行させる。 The speech recognition program according to the present invention enables a computer to perform speech recognition on each of an audio acquisition unit that acquires audio streams from a plurality of directions and each of the audio streams from a plurality of directions acquired by the audio acquisition unit. When a speech recognition result that satisfies a predetermined reliability criterion is obtained by the speech recognition processing means that executes the above and the speech recognition processing means, the direction corresponding to the speech stream from which the speech recognition result is obtained is determined. This is executed as sound source direction determining means for determining the sound source direction.

本発明に係る音声認識装置では、音声取得手段により取得される複数の方向からの音声ストリームのそれぞれに対する音声認識結果が、音声認識処理手段により得られる。併せて、音声認識処理手段により予め定められた信頼度の基準を満たす音声認識結果が得られた場合に、音源方向決定手段により、当該音声認識結果が得られた音声ストリームに対応する方向が音源方向として決定される。このように、上記音声認識装置によれば、音源方向を決定してから音声認識を開始するのではなく、継続的に音声認識を実行しつつ音源方向を決定することができる。即ち、音源方向の推定に伴う遅延を伴わずに、任意の方向からの音声に対する音声認識を実行することが可能となる。また、音源方向が決定された後は、例えば決定された音源方向からの音声ストリームに対してより高精度な音声認識を実行させるといったことが可能となるため、音声認識の精度を高めることが可能となる。従って、上記音声認識装置によれば、音源方向の推定に伴う遅延を伴わずに、任意の方向からの音声を精度良く音声認識することができる。 In the voice recognition apparatus according to the present invention, the voice recognition processing means obtains the voice recognition results for each of the voice streams from a plurality of directions acquired by the voice acquisition means. In addition, when a speech recognition result satisfying a predetermined reliability criterion is obtained by the speech recognition processing means, the direction corresponding to the sound stream from which the speech recognition result is obtained is indicated by the sound source direction determining means. Determined as direction. As described above, according to the voice recognition device, it is possible to determine the sound source direction while continuously performing the voice recognition, instead of starting the voice recognition after determining the sound source direction. That is, it is possible to execute speech recognition for speech from an arbitrary direction without a delay associated with estimation of the sound source direction. In addition, after the sound source direction is determined, for example, it is possible to perform higher-accuracy speech recognition on an audio stream from the determined sound source direction, so that the accuracy of speech recognition can be improved. It becomes. Therefore, according to the speech recognition apparatus, it is possible to recognize speech from any direction with high accuracy without the delay associated with estimation of the sound source direction.

上記音声認識装置は、音声取得手段により取得された音声ストリームのうち音源方向決定手段により決定された音源方向からの音声ストリームに対して、音声認識処理手段による音声認識よりも精度の高い音声認識を実行する第２音声認識処理手段を更に備えてもよい。 The voice recognition apparatus performs voice recognition with higher accuracy than the voice recognition by the voice recognition processing unit on the voice stream from the sound source direction determined by the sound source direction determination unit among the voice streams acquired by the voice acquisition unit. You may further provide the 2nd audio | voice recognition process means to perform.

上記音声認識装置によれば、音源方向決定手段により音源方向が決定された場合に、第２音声認識処理手段により、当該音源方向からの音声ストリームに対してより高精度な音声認識を実行することができる。 According to the speech recognition apparatus, when the sound source direction is determined by the sound source direction determination unit, the second speech recognition processing unit performs more accurate speech recognition on the audio stream from the sound source direction. Can do.

上記音声認識装置では、音声取得手段は、予め定められた複数の方向に指向性ビームを設定することにより、各指向性ビームのビーム方向に対応する音声ストリームを取得してもよい。 In the voice recognition apparatus, the voice acquisition unit may acquire a voice stream corresponding to the beam direction of each directional beam by setting the directional beam in a plurality of predetermined directions.

上記音声認識装置によれば、予め定められた複数の方向（固定された方向）のそれぞれからの音声ストリームを精度良く取得することができる。 According to the voice recognition apparatus, it is possible to accurately acquire voice streams from each of a plurality of predetermined directions (fixed directions).

上記音声認識装置では、音声取得手段は、所定の方法により推定された音源方向の候補となる複数の方向に指向性ビームを設定することにより、各指向性ビームのビーム方向に対応する音声ストリームを取得してもよい。 In the voice recognition apparatus, the voice acquisition unit sets a directional beam in a plurality of directions that are candidates for the sound source direction estimated by a predetermined method, thereby generating a voice stream corresponding to the beam direction of each directional beam. You may get it.

上記音声認識装置によれば、例えばＭＵＳＩＣ法等により推定された音源方向の候補となる複数の方向に指向性ビームを設定することで、音源方向である可能性が高い方向からの音声ストリームを優先的に取得でき、音声認識の精度向上を図ることができる。 According to the speech recognition apparatus, for example, by setting directional beams in a plurality of directions that are candidates for the sound source direction estimated by the MUSIC method or the like, a sound stream from a direction that is highly likely to be a sound source direction is given priority. And the accuracy of speech recognition can be improved.

上記音声認識装置では、音声認識処理手段は、予め定められた単語が音声認識結果に含まれている場合に、当該音声認識結果は予め定められた信頼度の基準を満たすと判定してもよい。 In the speech recognition apparatus, the speech recognition processing means may determine that the speech recognition result satisfies a predetermined reliability criterion when a predetermined word is included in the speech recognition result. .

上記音声認識装置によれば、予め定められた単語が音声認識されたか否かに基づいて音声認識結果の信頼度の判定を簡易且つ精度良く実行することができる。 According to the voice recognition device, it is possible to easily and accurately determine the reliability of the voice recognition result based on whether or not a predetermined word is voice-recognized.

上記音声認識装置では、音声認識処理手段は、発話区間を検出するための発話区間検出処理を実行し、当該発話区間検出処理により検出された発話区間に対して音声認識を実行してもよい。 In the above speech recognition apparatus, the speech recognition processing means may execute an utterance section detection process for detecting an utterance section, and perform speech recognition on the utterance section detected by the utterance section detection process.

上記音声認識装置によれば、発話区間検出処理により検出された発話区間についてのみ音声認識を実行することが可能となる。これにより、音声ストリームのうち発話区間以外の区間に対する無駄な音声認識処理の実行を防止でき、消費電力を低減することができる。 According to the voice recognition device, it is possible to perform voice recognition only for the speech section detected by the speech section detection process. As a result, it is possible to prevent unnecessary speech recognition processing from being performed on sections other than the speech section in the voice stream, and to reduce power consumption.

本発明によれば、音源方向の推定に伴う遅延を伴わずに、任意の方向からの音声を精度良く認識することができる。 According to the present invention, it is possible to accurately recognize a voice from an arbitrary direction without a delay accompanying estimation of a sound source direction.

本発明の実施形態に係る音声認識装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech recognition apparatus which concerns on embodiment of this invention. 音声認識装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of a speech recognition apparatus. 複数のマイクロフォンにより設定されるビーム方向の一例を示す図である。It is a figure which shows an example of the beam direction set with a several microphone. 音声認識装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a speech recognition apparatus. 音声認識プログラムのモジュール構成を示すブロック図である。It is a block diagram which shows the module structure of a speech recognition program.

以下、図面を参照しながら、本発明の実施形態に係る音声認識装置、音声認識方法、及び音声認識プログラムについて説明する。可能な場合には、同一の部分には同一の符号を付して、重複する説明を省略する。図１は、本実施形態に係る音声認識装置１の機能構成を示すブロック図である。図１に示すように、音声認識装置１は、音声入力部１１と、指向性制御部１２と、第１音声認識処理部１３と、音源方向決定部１４と、第２音声認識処理部１５と、音声認識結果出力部１６と、を備えている。 Hereinafter, a voice recognition device, a voice recognition method, and a voice recognition program according to embodiments of the present invention will be described with reference to the drawings. Where possible, the same parts are denoted by the same reference numerals, and redundant description is omitted. FIG. 1 is a block diagram showing a functional configuration of the speech recognition apparatus 1 according to the present embodiment. As shown in FIG. 1, the speech recognition apparatus 1 includes a speech input unit 11, a directivity control unit 12, a first speech recognition processing unit 13, a sound source direction determination unit 14, and a second speech recognition processing unit 15. A speech recognition result output unit 16.

音声認識装置１は、ユーザの発話音声を音声認識し、音声認識結果に応じた処理を実行する装置として構成される。例えば、音声認識装置１は、家庭内のリビングに設置され、ユーザの発話音声を音声認識し、音声認識結果に応じた処理の実行を家電機器等に無線電波等により指示する装置として構成されてもよいし、音声認識結果に応じた処理を実行する機器自体に組み込まれてもよい。また、音声認識装置１は、ユーザからの問いかけに対する応答結果をテキスト及び音声等により当該ユーザに提示する音声対話装置（例えばロボット等）として構成されてもよい。 The speech recognition device 1 is configured as a device that recognizes a user's uttered speech and executes processing according to the speech recognition result. For example, the voice recognition device 1 is configured as a device that is installed in a living room in a home, recognizes a user's uttered voice, and instructs a home appliance or the like to execute processing according to the voice recognition result by radio waves or the like. Alternatively, it may be incorporated in the device itself that executes processing according to the voice recognition result. The voice recognition device 1 may be configured as a voice interaction device (for example, a robot or the like) that presents a response result to a question from a user to the user by text, voice, or the like.

図２は、音声認識装置１のハードウェア構成の一例を示すブロック図である。図２に示すように、音声認識装置１は、例えばハードウェア構成として、ＣＰＵ（Central Processing Unit）１０Ａと、ＲＡＭ（Random Access Memory）１０Ｂと、ＲＯＭ（Read Only Memory）１０Ｃと、入力装置１０Ｄと、外部装置との通信を行う無線電波モジュール等の通信装置１０Ｅと、補助記憶装置１０Ｆと、出力装置１０Ｇと、を備える。入力装置１０Ｄは、音声入力部１１に相当する複数のマイクロフォンを含み、その他、例えば入力デバイスであるキーボード及びマウス等を含む。出力装置１０Ｇは、例えば応答結果をテキストとして出力するディスプレイや、応答結果を音声として出力するスピーカー等である。音声認識装置１の各機能は、例えば、ＲＡＭ１０Ｂ等に後述する音声認識プログラムＰを読み込ませ、ＣＰＵ１０Ａにより当該音声認識プログラムＰを実行させることにより実現される。 FIG. 2 is a block diagram illustrating an example of a hardware configuration of the speech recognition apparatus 1. As shown in FIG. 2, the speech recognition apparatus 1 includes, for example, a CPU (Central Processing Unit) 10A, a RAM (Random Access Memory) 10B, a ROM (Read Only Memory) 10C, and an input device 10D as hardware configurations. A communication device 10E such as a radio wave module for performing communication with an external device, an auxiliary storage device 10F, and an output device 10G. The input device 10D includes a plurality of microphones corresponding to the audio input unit 11, and includes a keyboard and a mouse that are input devices, for example. The output device 10G is, for example, a display that outputs the response result as text, a speaker that outputs the response result as sound, or the like. Each function of the speech recognition apparatus 1 is realized, for example, by causing the RAM 10B or the like to read a speech recognition program P described later and causing the CPU 10A to execute the speech recognition program P.

なお、音声認識装置１は、必ずしも上記のハードウェア構成を全て備えている必要はない。例えば、音声認識装置１は、応答結果をテキスト及び音声等で出力する機能を備えない場合には、出力装置１０Ｇを備えなくともよい。また、音声認識装置１は、物理的に単一の機器として構成されてもよいし、物理的に分離された複数の機器が協調して動作するように構成されてもよい。 Note that the speech recognition apparatus 1 does not necessarily have to include all of the hardware configurations described above. For example, the speech recognition device 1 may not include the output device 10G when it does not have a function of outputting the response result as text and speech. The speech recognition apparatus 1 may be configured as a physically single device, or may be configured such that a plurality of physically separated devices operate in cooperation.

音声入力部１１は、音声認識装置１の周囲の音を集音し、複数チャネル（複数の周波数帯域）の信号として取得する音声入力手段である。音声入力部１１は、例えば複数のマイクロフォンにより構成される。 The voice input unit 11 is a voice input unit that collects sounds around the voice recognition device 1 and acquires the sounds as signals of a plurality of channels (a plurality of frequency bands). The voice input unit 11 is constituted by a plurality of microphones, for example.

指向性制御部１２は、複数の方向からの音声ストリームのそれぞれを取得する音声取得手段である。指向性制御部１２は、例えば固定ビームフォーマ等の周知の手法を用いることにより、予め設定されたビーム方向から到来した音声のみを強調し他の方向から到来した音声を抑圧する信号処理を実行する。より具体的には、指向性制御部１２は、音声入力部１１から得られた複数チャネルの信号に対して上述の信号処理を実行することで、複数のビーム方向のそれぞれについて、各ビーム方向から到来した音声のみが強調され、他のビーム方向から到来した音声が抑圧された音声ストリームを生成する。 The directivity control unit 12 is an audio acquisition unit that acquires audio streams from a plurality of directions. The directivity control unit 12 executes signal processing for emphasizing only speech that has arrived from a preset beam direction and suppressing speech that has arrived from another direction by using a known method such as a fixed beamformer. . More specifically, the directivity control unit 12 performs the above-described signal processing on a plurality of channel signals obtained from the audio input unit 11, so that each of a plurality of beam directions is determined from each beam direction. Only an incoming voice is emphasized, and an audio stream in which an incoming voice from another beam direction is suppressed is generated.

指向性制御部１２は、予め定められた複数の方向に指向性ビームを設定することにより、各指向性ビームのビーム方向に対応する音声ストリームを取得してもよい。即ち、指向性制御部１２により設定される複数のビーム方向は、予め設定された固定のビーム方向であってもよい。これにより、予め定められた複数の方向（固定された方向）のそれぞれからの音声ストリームを精度良く取得することができる。特に、音声がどの方向から到来するかが予めわかっていない場合には、図３に示すように複数のビームが全方向を被覆するようビーム方向を設定すればよい。図３の例では、音声認識装置１を中心として、水平方向に４５度間隔で区切った８方向にビーム方向が設定されている。このようにビーム方向が設定されることで、指向性制御部１２は、例えば音声認識装置１から図３における右上に向けて設定されたビーム方向ａの先にある音源Ｘから到来する音声が強調された音声ストリームを生成することができる。 The directivity control unit 12 may acquire an audio stream corresponding to the beam direction of each directional beam by setting directional beams in a plurality of predetermined directions. That is, the plurality of beam directions set by the directivity control unit 12 may be fixed beam directions set in advance. Thereby, the audio | voice stream from each of the predetermined several direction (fixed direction) can be acquired accurately. In particular, when it is not known in advance from which direction the voice comes, the beam direction may be set so that a plurality of beams cover all directions as shown in FIG. In the example of FIG. 3, the beam directions are set in eight directions centered on the speech recognition apparatus 1 and divided at 45 degree intervals in the horizontal direction. By setting the beam direction in this way, the directivity control unit 12 emphasizes, for example, the voice coming from the sound source X ahead of the beam direction a set from the voice recognition device 1 toward the upper right in FIG. Audio streams can be generated.

また、音声認識装置１に対して入力される音声の発信源（音源方向）の候補が一定の範囲に限定される場合には、指向性制御部１２は、複数のビームが当該一定の範囲のみを被覆するようにビーム方向を設定してもよい。例えば、音声認識装置１がテレビ受像機に組み込まれており、ユーザが発話によって、テレビ受像機に対して所定の操作（例えばチャンネル変更等）の実行を指示する場合について考える。この場合、ユーザの位置は、テレビ受像機の画面前方であることが想定される。即ち、音源方向の候補は、テレビ受像機の画面前方１８０度の範囲に限定される。従って、この場合には、指向性制御部１２は、テレビ受像機の画面前方１８０度の範囲のみを被覆するようにビーム方向を設定すればよい。なお、設定されるビーム方向の数は、マイクロフォンの数、及び信号処理によって設定される指向性の鋭さ等に依存するが、通常マイクロフォンの数よりも多くのビーム方向を設定することができる。 In addition, when the candidates of the sound transmission source (sound source direction) input to the speech recognition apparatus 1 are limited to a certain range, the directivity control unit 12 has a plurality of beams only in the certain range. The beam direction may be set so as to cover the film. For example, consider a case where the speech recognition apparatus 1 is incorporated in a television receiver and the user instructs the television receiver to perform a predetermined operation (for example, channel change) by speaking. In this case, it is assumed that the position of the user is in front of the screen of the television receiver. That is, the sound source direction candidates are limited to a range of 180 degrees in front of the screen of the television receiver. Therefore, in this case, the directivity control unit 12 may set the beam direction so as to cover only the range of 180 degrees ahead of the screen of the television receiver. The number of beam directions to be set depends on the number of microphones and the sharpness of directivity set by signal processing, but more beam directions can be set than the number of normal microphones.

また、指向性制御部１２は、所定の方法により推定された音源方向の候補となる複数の方向に指向性ビームを設定することにより、各指向性ビームのビーム方向に対応する音声ストリームを取得してもよい。即ち、指向性制御部１２は、例えば音声入力部１１に入力される音響信号を用いたＭＵＳＩＣ法等によって推定された音源方向の複数の候補を、複数のビーム方向として設定してもよい。或いは、指向性制御部１２は、ＭＵＳＩＣ法等による音源方向推定に加えてカルマンフィルタやパーティクルフィルタ等の手法による音源方向トラッキングにより推定された音源方向の複数の候補を、複数のビーム方向として設定してもよい。このようにして複数のビーム方向が設定される場合には、上述した固定ビームフォーマを用いる場合と異なり、複数のビーム方向は、音響信号に依存して設定及び変更されるため、可変となる。このように、ＭＵＳＩＣ法等により推定された音源方向の候補となる複数の方向に指向性ビームを設定することで、音源方向である可能性が高い方向からの音声ストリームを優先的に取得でき、音声認識の精度向上を図ることができる。 Further, the directivity control unit 12 obtains an audio stream corresponding to the beam direction of each directional beam by setting directional beams in a plurality of directions that are candidates for the sound source direction estimated by a predetermined method. May be. That is, the directivity control unit 12 may set a plurality of sound source direction candidates estimated by, for example, the MUSIC method using an acoustic signal input to the voice input unit 11 as a plurality of beam directions. Alternatively, the directivity control unit 12 sets a plurality of sound source direction candidates estimated by sound source direction tracking by a technique such as a Kalman filter or a particle filter as a plurality of beam directions in addition to the sound source direction estimation by the MUSIC method or the like. Also good. When a plurality of beam directions are set in this way, the plurality of beam directions are set and changed depending on the acoustic signal, unlike the case of using the above-described fixed beamformer. Thus, by setting directional beams in a plurality of directions that are candidates for the sound source direction estimated by the MUSIC method or the like, it is possible to preferentially obtain an audio stream from a direction that is likely to be a sound source direction, The accuracy of voice recognition can be improved.

第１音声認識処理部１３は、指向性制御部１２により取得された複数の方向からの音声ストリームのそれぞれに対して音声認識を実行する音声認識処理手段である。以下、指向性制御部１２により取得された複数の方向からの音声ストリームのそれぞれを、指向性音声ストリームともいう。複数の指向性音声ストリームは、上述した音声入力部１１及び指向性制御部１２の処理により連続的に取得され、第１音声認識処理部１３に入力される。このため、第１音声認識処理部１３は、複数の指向性音声ストリームのそれぞれについて、音声認識を連続的に実行することになる。 The first voice recognition processing unit 13 is a voice recognition processing unit that performs voice recognition on each of voice streams from a plurality of directions acquired by the directivity control unit 12. Hereinafter, each of the audio streams from a plurality of directions acquired by the directivity control unit 12 is also referred to as a directional audio stream. A plurality of directional audio streams are continuously acquired by the processing of the audio input unit 11 and the directivity control unit 12 described above, and input to the first audio recognition processing unit 13. For this reason, the 1st audio | voice recognition process part 13 will perform an audio | voice recognition continuously about each of several directional audio | voice stream.

指向性音声ストリームには、人の声でなく雑音のみが含まれている場合もある。また、指向性音声ストリームには、ほぼ無音である区間しか含まれていない場合もある。そこで、第１音声認識処理部１３は、指向性音声ストリームに含まれている雑音を人の声であると誤認して誤った音声認識結果を得ることを防ぐために、各指向性音声ストリームに対して音声認識を実行する前に周知の雑音除去処理を実行してもよい。また、第１音声認識処理部１３は、音声認識を実行すべき発話区間（人の声が含まれている区間）を検出するために、指向性音声ストリームに対して周知の発話区間検出処理を実行し、当該発話区間検出処理により検出された発話区間に対して音声認識を実行してもよい。これにより、指向性音声ストリームのうち発話区間以外の区間に対する無駄な音声認識処理の実行を防止でき、消費電力を低減することができる。 In some cases, the directional audio stream contains only noise, not human voice. In addition, the directional audio stream may include only a section that is almost silent. Therefore, in order to prevent the first speech recognition processing unit 13 from misrecognizing the noise included in the directional speech stream as a human voice and obtaining an erroneous speech recognition result, Then, a known noise removal process may be executed before executing voice recognition. In addition, the first speech recognition processing unit 13 performs a well-known speech segment detection process on the directional speech stream in order to detect a speech segment (a segment including a human voice) in which speech recognition is to be performed. The speech recognition may be executed on the speech section detected by the speech section detection process. As a result, it is possible to prevent unnecessary speech recognition processing from being performed on sections other than the speech section in the directional sound stream, and power consumption can be reduced.

第１音声認識処理部１３は、各指向性音声ストリーム（或いは、上述の雑音除去処理や発話区間検出処理等を実行した後の各指向性音声ストリーム）に対して音声認識を実行することにより得られた音声認識結果について、予め定められた信頼度の基準を満たすか否かを判定する。このような判定に用いる信頼度としては、例えば、統計的音声認識の分野において周知の指標である出力仮説の尤度に基づく信頼度を用いることができる。 The first speech recognition processing unit 13 is obtained by performing speech recognition on each directional speech stream (or each directional speech stream after performing the above-described noise removal processing, speech segment detection processing, and the like). It is determined whether or not the predetermined voice recognition result satisfies a predetermined reliability criterion. As the reliability used for such determination, for example, the reliability based on the likelihood of the output hypothesis, which is a well-known index in the field of statistical speech recognition, can be used.

音声認識装置１にユーザの発話音声を音声認識させて所定の処理を実行させるために最初に発話（音声入力）すべき一以上のコマンド単語が予め定められている場合には、第１音声認識処理部１３は、予め定められたコマンド単語が音声認識結果に含まれている場合に、当該音声認識結果は予め定められた信頼度の基準を満たすと判定してもよい。なお、この場合、第１音声認識処理部１３は、コマンド単語のみを音声認識対象の語彙としてもよい。これにより、第１音声認識処理部１３は、指向性音声ストリームを音声認識し、何らかの音声認識結果（即ち、いずれかのコマンド単語を認識したことを示す結果）が得られた場合に、当該音声認識結果が予め定められた信頼度の基準を満たすと判定することができる。即ち、音声認識結果の信頼度の判定を、コマンド単語が音声認識されたか否かに基づいて簡易且つ精度良く実行することができる。 When one or more command words to be uttered (speech input) first to make the speech recognition apparatus 1 recognize speech of the user's speech and execute predetermined processing are determined in advance, the first speech recognition When the predetermined command word is included in the voice recognition result, the processing unit 13 may determine that the voice recognition result satisfies a predetermined reliability criterion. In this case, the first speech recognition processing unit 13 may use only the command word as a speech recognition target vocabulary. As a result, the first speech recognition processing unit 13 recognizes the directional speech stream and, when a speech recognition result (that is, a result indicating that any command word is recognized) is obtained, It can be determined that the recognition result satisfies a predetermined reliability criterion. That is, the determination of the reliability of the speech recognition result can be performed easily and accurately based on whether or not the command word is recognized as speech.

第１音声認識処理部１３は、ある指向性音声ストリームから、予め定められた信頼度の基準を満たす音声認識結果を得た場合、当該指向性音声ストリームの到来方向（即ち、当該指向性音声ストリームに対応するビーム方向）から人の発話音声が到来していると推定し、そのビーム方向を音源方向決定部１４に通知する。また、第１音声認識処理部１３は、各指向性ストリームに対する音声認識により得られた音声認識結果を音声認識結果出力部１６に出力する。 When the first speech recognition processing unit 13 obtains a speech recognition result satisfying a predetermined reliability criterion from a certain directional speech stream, the first speech recognition processing unit 13 arrives at the direction of the directional speech stream (that is, the directional speech stream). It is estimated that human speech is coming from (the beam direction corresponding to), and notifies the sound source direction determination unit 14 of the beam direction. The first speech recognition processing unit 13 outputs a speech recognition result obtained by speech recognition for each directional stream to the speech recognition result output unit 16.

音源方向決定部１４は、第１音声認識処理部１３により予め定められた信頼度の基準を満たす音声認識結果が得られた場合に、当該音声認識結果が得られた音声ストリームに対応する方向を音源方向として決定する音源方向決定手段である。 The sound source direction determination unit 14 determines the direction corresponding to the audio stream from which the speech recognition result is obtained when the first speech recognition processing unit 13 obtains a speech recognition result that satisfies a predetermined reliability criterion. Sound source direction determining means for determining the sound source direction.

音源方向決定部１４は、例えば、上述したように第１音声認識処理部１３から人の発話音声の到来方向であると推定されたビーム方向を通知された場合、当該ビーム方向を音源方向として決定する。そして、音源方向決定部１４は、当該ビーム方向に対応する指向性音声ストリームを指向性制御部１２から取得し、当該指向性音声ストリームを第２音声認識処理部１５に出力する。或いは、音源方向決定部１４は、音声入力部１１から得られた複数チャネルの信号を取得し、当該信号に対して独自の信号処理を実行することで、第１音声認識処理部１３から通知されたビーム方向に対応する指向性音声ストリームを取得してもよい。 For example, when the first sound recognition processing unit 13 is notified of the beam direction estimated to be the arrival direction of the human speech as described above, the sound source direction determination unit 14 determines the beam direction as the sound source direction. To do. Then, the sound source direction determination unit 14 acquires a directional audio stream corresponding to the beam direction from the directivity control unit 12 and outputs the directional audio stream to the second audio recognition processing unit 15. Alternatively, the sound source direction determination unit 14 is notified from the first speech recognition processing unit 13 by acquiring signals of a plurality of channels obtained from the voice input unit 11 and performing unique signal processing on the signals. A directional audio stream corresponding to the beam direction may be acquired.

音源方向決定部１４は、過去に第１音声認識処理部１３から何らのビーム方向も通知されていない初期状態においては、第２音声認識処理部１５に対して、予め設定された初期ビーム方向に対応する指向性音声ストリームを出力してもよいし、何らの指向性音声ストリームも出力しなくともよい。 In the initial state in which no beam direction has been notified from the first speech recognition processing unit 13 in the past, the sound source direction determination unit 14 sets the initial beam direction in advance to the second speech recognition processing unit 15. A corresponding directional audio stream may be output, or no directional audio stream may be output.

また、音源方向決定部１４は、過去に第１音声認識処理部１３からビーム方向を通知されてから、その後ビーム方向の通知を受けずに予め設定された一定時間を経過した場合には、初期状態に復帰してもよい。また、音源方向決定部１４は、過去に第１音声認識処理部１３から一のビーム方向を通知され、更にその後の時点において第１音声認識処理部１３から他のビーム方向を通知された場合は、後に通知された他のビーム方向を最新の音源方向として決定し、当該音源方向に対応する指向性音声ストリームを第２音声認識処理部１５に出力してもよい。 The sound source direction determination unit 14 is initialized when a predetermined time elapses after the beam direction is notified from the first speech recognition processing unit 13 in the past without receiving the beam direction notification. You may return to the state. The sound source direction determination unit 14 is notified of one beam direction from the first speech recognition processing unit 13 in the past, and further notified of another beam direction from the first speech recognition processing unit 13 at a later time. Alternatively, another beam direction notified later may be determined as the latest sound source direction, and a directional sound stream corresponding to the sound source direction may be output to the second sound recognition processing unit 15.

第２音声認識処理部１５は、指向性制御部１２により取得された音声ストリームのうち音源方向決定部１４により決定された音源方向からの音声ストリームに対して音声認識を実行する第２音声認識処理手段である。第２音声認識処理部１５による音声認識処理は、第１音声認識処理部１３による音声認識よりも多くの語彙を音声認識対象とし、精度の高い音声認識を実行する。これにより、音源方向決定部１４により音源方向が決定された場合に、第２音声認識処理部１５により、当該音源方向からの音声ストリームに対してより高精度な音声認識を実行することが可能となる。例えば、第１音声認識処理部１３による音声認識を、音声認識装置１内で認識処理を実行するローカル型音声認識とし、第２音声認識処理部１５による音声認識を、外部のサーバを利用したサーバ型音声認識としてもよい。なお、第２音声認識処理部１５による音声認識処理は、第１音声認識処理部１３による音声認識処理と同一であってもよい。 The second voice recognition processing unit 15 performs second voice recognition processing on the voice stream from the sound source direction determined by the sound source direction determination unit 14 among the voice streams acquired by the directivity control unit 12. Means. The speech recognition processing by the second speech recognition processing unit 15 performs speech recognition with high accuracy using more words as speech recognition targets than the speech recognition by the first speech recognition processing unit 13. As a result, when the sound source direction is determined by the sound source direction determination unit 14, the second sound recognition processing unit 15 can execute more accurate sound recognition on the sound stream from the sound source direction. Become. For example, the voice recognition by the first voice recognition processing unit 13 is a local type voice recognition for executing the recognition process in the voice recognition device 1, and the voice recognition by the second voice recognition processing unit 15 is a server using an external server. Type speech recognition. Note that the speech recognition processing by the second speech recognition processing unit 15 may be the same as the speech recognition processing by the first speech recognition processing unit 13.

第２音声認識処理部１５は、上述のサーバ型音声認識を実行する場合には、音源方向決定部１４から取得した指向性音声ストリームを、音声認識を実行する機能を有するサーバ（不図示）に送信し、当該サーバに音声認識を実行させ、その音声認識結果（例えばテキスト等。以下同じ。）を当該サーバから受信することで、音声認識を実行してもよい。このように、第２音声認識処理部１５は、高性能な音声認識エンジンを備えるサーバ等に音声認識を実行させることで、音源方向決定部１４により決定された音源方向からの音声ストリームに対してより高精度な音声認識を実行することができる。なお、第２音声認識処理部１５と上記サーバとの間のデータの送受信は、例えば、上述した通信装置１０Ｅの通信機能を用いることで、インターネット及びＬＡＮ（Local Area Network）等を介して行われる。 When executing the above-described server-type speech recognition, the second speech recognition processing unit 15 sends the directional speech stream acquired from the sound source direction determining unit 14 to a server (not shown) having a function of performing speech recognition. The voice recognition may be executed by transmitting the data, causing the server to execute voice recognition, and receiving the voice recognition result (for example, text or the like; the same applies hereinafter) from the server. As described above, the second speech recognition processing unit 15 causes a server or the like equipped with a high-performance speech recognition engine to perform speech recognition, and thereby the sound stream from the sound source direction determined by the sound source direction determining unit 14 is processed. More accurate voice recognition can be performed. Note that data transmission / reception between the second speech recognition processing unit 15 and the server is performed via the Internet, a LAN (Local Area Network), or the like by using the communication function of the communication device 10E described above, for example. .

第２音声認識処理部１５は、上述のようにして音源方向決定部１４から取得した指向性音声ストリームに対して音声認識を実行し、その結果として得られた音声認識結果を音声認識結果出力部１６に出力する。 The second speech recognition processing unit 15 performs speech recognition on the directional speech stream acquired from the sound source direction determining unit 14 as described above, and the speech recognition result obtained as a result is the speech recognition result output unit. 16 is output.

音声認識結果出力部１６は、第１音声認識処理部１３及び第２音声認識処理部１５の少なくとも一方から取得された音声認識結果を出力する音声認識結果出力手段である。音声認識結果出力部１６は、第１音声認識処理部１３及び第２音声認識処理部１５の一方から取得した音声認識結果のみを出力してもよいし、第１音声認識処理部１３及び第２音声認識処理部１５の両方から取得した音声認識結果を結合したテキストを出力してもよい。音声認識結果の出力の具体的な方法としては、音声認識結果出力部１６は、例えば、音声認識結果（テキスト）をディスプレイ等の出力装置１０Ｇに出力することでユーザに提示してもよいし、音声認識結果に対応する音声を周知の手法により音声合成し、得られた音声をスピーカー等の出力装置１０Ｇに出力してもよい。 The speech recognition result output unit 16 is a speech recognition result output unit that outputs a speech recognition result acquired from at least one of the first speech recognition processing unit 13 and the second speech recognition processing unit 15. The voice recognition result output unit 16 may output only the voice recognition result acquired from one of the first voice recognition processing unit 13 and the second voice recognition processing unit 15, or the first voice recognition processing unit 13 and the second voice recognition processing unit 15. You may output the text which combined the voice recognition result acquired from both of the voice recognition processing parts 15. As a specific method of outputting the speech recognition result, the speech recognition result output unit 16 may present the speech recognition result (text) to the user by outputting it to the output device 10G such as a display, The voice corresponding to the voice recognition result may be synthesized by a known method, and the obtained voice may be output to the output device 10G such as a speaker.

また、音声認識結果出力部１６は、音声認識結果をテキストや音声等により出力する以外に、音声認識結果に基づく何らかの応答結果を示す情報をユーザに提示してもよい。例えば、音声認識装置１がユーザからの発話内容に基づく情報検索を実行する装置として構成されている場合には、音声認識結果出力部１６は、検索結果をテキストや音声等でユーザに提示してもよい。同様に、音声認識装置１がユーザからの発話内容に基づいて所定の機器操作（例えばリモートコントローラを通じて電灯を消灯する操作等）を実行する装置として構成されている場合には、音声認識結果出力部１６は、当該機器操作の結果（例えば電灯を消灯したことを示す情報）をテキストや音声等でユーザに提示してもよい。また、音声認識装置１が雑談等のユーザの問いかけに対して回答することでユーザとの対話を実現する装置として構成されている場合には、音声認識結果出力部１６は、ユーザの問いかけに対する回答メッセージをテキストや音声等でユーザに提示してもよい。 Further, the voice recognition result output unit 16 may present information indicating some response result based on the voice recognition result to the user, in addition to outputting the voice recognition result by text or voice. For example, when the speech recognition device 1 is configured as a device that performs information retrieval based on the content of utterances from the user, the speech recognition result output unit 16 presents the search result to the user in text, speech, or the like. Also good. Similarly, when the voice recognition device 1 is configured as a device that executes a predetermined device operation (for example, an operation to turn off the light through a remote controller) based on the utterance content from the user, a voice recognition result output unit 16 may present the result of the device operation (for example, information indicating that the lamp has been turned off) to the user by text or voice. Further, when the voice recognition device 1 is configured as a device that realizes a dialog with the user by answering a user's question such as chat, the voice recognition result output unit 16 answers the user's question. The message may be presented to the user by text or voice.

続いて、図４に示すフローチャートを参照して、音声認識装置１により実行される処理（音声認識方法）の一例について説明する。まず、複数のマイクロフォン等により構成される音声入力部１１に外部からの音声が継続的に入力される。そして、指向性制御部１２により、音声入力部１１から得られた複数チャネルの信号に対して信号処理が実行され、複数の方向からの音声ストリームのそれぞれ（即ち、複数のビーム方向のそれぞれに対応する指向性音声ストリーム）が取得される（ステップＳ１、音声取得ステップ）。 Next, an example of processing (voice recognition method) executed by the voice recognition device 1 will be described with reference to the flowchart shown in FIG. First, external audio is continuously input to the audio input unit 11 including a plurality of microphones. Then, the directivity control unit 12 performs signal processing on the signals of a plurality of channels obtained from the audio input unit 11, and corresponds to each of the audio streams from a plurality of directions (that is, to each of a plurality of beam directions). Directional audio stream) is acquired (step S1, audio acquisition step).

続いて、第１音声認識処理部１３により、各指向性音声ストリームに対して音声認識が実行される（ステップＳ２、音声認識処理ステップ）。そして、第１音声認識処理部１３により、各指向性音声ストリームに対する音声認識結果が予め定められた信頼度の基準を満たすか否かが判定される（ステップＳ３）。 Subsequently, voice recognition is performed on each directional voice stream by the first voice recognition processing unit 13 (step S2, voice recognition processing step). Then, the first speech recognition processing unit 13 determines whether or not the speech recognition result for each directional speech stream satisfies a predetermined reliability criterion (step S3).

ステップＳ３において、予め定められた信頼度の基準を満たす音声認識結果が得られたと判定された場合（ステップＳ３：ＹＥＳ）には、第１音声認識処理部１３により、当該音声認識結果が得られた指向性音声ストリームに対応するビーム方向が音源方向決定部１４に通知される。そして、音源方向決定部１４により、当該音声認識結果が得られた音声ストリームに対応する方向が音源方向として決定され（ステップＳ４、音源方向決定ステップ）、当該音源方向に対応する指向性音声ストリームが第２音声認識処理部１５に出力される。 If it is determined in step S3 that a speech recognition result satisfying a predetermined reliability criterion has been obtained (step S3: YES), the first speech recognition processing unit 13 obtains the speech recognition result. The sound source direction determination unit 14 is notified of the beam direction corresponding to the directional audio stream. Then, the direction corresponding to the audio stream from which the speech recognition result is obtained is determined as the sound source direction by the sound source direction determination unit 14 (step S4, sound source direction determination step), and the directional audio stream corresponding to the sound source direction is determined. It is output to the second voice recognition processing unit 15.

一方、ステップＳ３において、予め定められた信頼度の基準を満たす音声認識結果が得られたと判定されなかった場合（ステップＳ３：ＮＯ）には、第１音声認識処理部１３から音源方向決定部１４に対してビーム方向（音源方向と推定されるビーム方向）の通知はされず、音源方向決定部１４により、予め設定された初期ビーム方向（或いは過去一定期間内に第１音声認識処理部１３から通知されたビーム方向）に対応する指向性音声ストリームが第２音声認識処理部１５に出力される。 On the other hand, if it is not determined in step S3 that a speech recognition result satisfying a predetermined reliability criterion has been obtained (step S3: NO), the first speech recognition processing unit 13 to the sound source direction determining unit 14 Is not notified of the beam direction (the beam direction estimated as the sound source direction), and the sound source direction determination unit 14 determines the initial beam direction set in advance (or from the first speech recognition processing unit 13 within a certain past period). A directional audio stream corresponding to the notified beam direction is output to the second audio recognition processing unit 15.

続いて、第２音声認識処理部１５により、設定されている音源方向からの音声ストリームに対して音声認識が実行される（ステップＳ５、第２音声認識処理ステップ）。ここで、「設定されている音源方向」とは、ステップＳ３で予め定められた信頼度の基準を満たす音声認識結果が得られたと判定された場合（ステップＳ３：ＹＥＳ）には、ステップＳ４で音源方向決定部１４により決定された音源方向であり、ステップＳ３で予め定められた信頼度の基準を満たす音声認識結果が得られたと判定されなかった場合（ステップＳ３：ＮＯ）には、予め設定された初期ビーム方向（或いは過去一定期間内に第１音声認識処理部１３から通知されたビーム方向）である。 Subsequently, voice recognition is performed on the voice stream from the set sound source direction by the second voice recognition processing unit 15 (step S5, second voice recognition processing step). Here, the “set sound source direction” means that if it is determined that a speech recognition result satisfying the reliability criterion determined in advance in step S3 is obtained (step S3: YES), in step S4. If the sound source direction determined by the sound source direction determining unit 14 is not determined that a speech recognition result satisfying the reliability criterion determined in step S3 has been obtained (step S3: NO), the sound source direction is set in advance. The initial beam direction (or the beam direction notified from the first speech recognition processing unit 13 within a predetermined period in the past).

続いて、音声認識結果出力部１６により、第１音声認識処理部１３による各指向性音声ストリームに対する音声認識結果、及び第２音声認識処理部１５による音声認識結果の少なくとも一つが出力される（ステップＳ６）。ここで、音声認識結果は、音声認識結果がそのままテキストや音声等でユーザに提示される形で出力されてもよいし、音声認識結果に基づく何らかの応答結果を示す情報（例えば、検索結果、機器操作の結果、ユーザの問いかけに対する回答メッセージ等）がテキストや音声等でユーザに提示される形で出力されてもよい。 Subsequently, the speech recognition result output unit 16 outputs at least one of the speech recognition result for each directional speech stream by the first speech recognition processing unit 13 and the speech recognition result by the second speech recognition processing unit 15 (step). S6). Here, the voice recognition result may be output in a form in which the voice recognition result is presented to the user as it is as text or voice, or information indicating some response result based on the voice recognition result (for example, search result, device, etc. As a result of the operation, an answer message to the user's inquiry may be output in a form presented to the user in text or voice.

以上述べた音声認識装置１では、指向性制御部１２により取得される複数の方向からの音声ストリームのそれぞれに対する音声認識結果が、第１音声認識処理部１３により得られる。併せて、第１音声認識処理部１３により予め定められた信頼度の基準を満たす音声認識結果が得られた場合に、音源方向決定部１４により、当該音声認識結果が得られた音声ストリームに対応する方向が音源方向として決定される。このように、上記音声認識装置１によれば、音源方向を決定してから音声認識を開始するのではなく、継続的に音声認識を実行しつつ音源方向を決定することができる。即ち、音源方向の推定に伴う遅延を伴わずに、任意の方向からの音声に対する音声認識を実行することが可能となる。 In the voice recognition device 1 described above, the first voice recognition processing unit 13 obtains the voice recognition result for each of the voice streams from the plurality of directions acquired by the directivity control unit 12. In addition, when the first speech recognition processing unit 13 obtains a speech recognition result that satisfies a predetermined reliability criterion, the sound source direction determination unit 14 corresponds to the speech stream from which the speech recognition result is obtained. Direction is determined as the sound source direction. As described above, according to the voice recognition device 1, it is possible to determine the sound source direction while continuously performing the voice recognition, instead of starting the voice recognition after determining the sound source direction. That is, it is possible to execute speech recognition for speech from an arbitrary direction without a delay associated with estimation of the sound source direction.

より具体的に説明すると、音源方向が決定されているか否かに関わらず、第１音声認識処理部１３による音声認識が継続的に実行される。このため、ユーザが音声認識装置１に対して何らかの処理（例えば上述した機器操作等）を実行させるための発話を行った場合、少なくとも第１音声認識処理部１３による音声認識が即時に実行される。従って、音声認識装置１は、当該音声認識に成功した場合に、音源方向の推定に伴う遅延を伴わずに、音声認識結果に基づく何らかの処理を適切に実行することが可能となる。 More specifically, voice recognition by the first voice recognition processing unit 13 is continuously executed regardless of whether or not the sound source direction is determined. For this reason, when the user performs an utterance for causing the voice recognition apparatus 1 to execute some processing (for example, the above-described device operation or the like), at least voice recognition by the first voice recognition processing unit 13 is immediately executed. . Therefore, when the speech recognition apparatus 1 succeeds in the speech recognition, the speech recognition apparatus 1 can appropriately execute some processing based on the speech recognition result without the delay accompanying the estimation of the sound source direction.

更に、音源方向が決定された後は、第２音声認識処理部１５が、決定された音源方向からの音声ストリームに対してより高精度な音声認識を実行することにより、音声認識の精度を高めることが可能となる。従って、上記音声認識装置１によれば、音源方向の推定に伴う遅延を伴わずに、任意の方向からの音声を精度良く音声認識することができる。 Furthermore, after the sound source direction is determined, the second speech recognition processing unit 15 performs higher-accuracy speech recognition on the sound stream from the determined sound source direction, thereby improving the accuracy of speech recognition. It becomes possible. Therefore, according to the speech recognition apparatus 1, it is possible to accurately recognize speech from an arbitrary direction without a delay associated with estimation of the sound source direction.

続いて、上述した一連の音声認識装置１による処理をコンピュータに実行させるための音声認識プログラムについて説明する。音声認識プログラムＰ１は、コンピュータに挿入されてアクセスされる、或いはコンピュータが備える記録媒体に形成されたプログラム格納領域内に格納される。 Next, a voice recognition program for causing a computer to execute the above-described series of processing by the voice recognition device 1 will be described. The speech recognition program P1 is inserted into a computer and accessed, or stored in a program storage area formed on a recording medium included in the computer.

図５に示すように、音声認識プログラムＰ１は、音声入力モジュールＰ１１、指向性制御モジュールＰ１２、第１音声認識処理モジュールＰ１３、音源方向決定モジュールＰ１４、第２音声認識処理モジュールＰ１５、及び音声認識結果出力モジュールＰ１６を備えて構成される。音声入力モジュールＰ１１、指向性制御モジュールＰ１２、第１音声認識処理モジュールＰ１３、音源方向決定モジュールＰ１４、第２音声認識処理モジュールＰ１５、及び音声認識結果出力モジュールＰ１６を実行させることにより実現される機能は、上述した音声認識装置１の音声入力部１１、指向性制御部１２、第１音声認識処理部１３、音源方向決定部１４、第２音声認識処理部１５、及び音声認識結果出力部１６とそれぞれ同様である。 As shown in FIG. 5, the speech recognition program P1 includes a speech input module P11, a directivity control module P12, a first speech recognition processing module P13, a sound source direction determination module P14, a second speech recognition processing module P15, and speech recognition results. An output module P16 is provided. Functions realized by executing the voice input module P11, the directivity control module P12, the first voice recognition processing module P13, the sound source direction determination module P14, the second voice recognition processing module P15, and the voice recognition result output module P16. The voice input unit 11, the directivity control unit 12, the first voice recognition processing unit 13, the sound source direction determination unit 14, the second voice recognition processing unit 15, and the voice recognition result output unit 16 of the voice recognition device 1 described above, respectively. It is the same.

なお、音声認識プログラムＰ１は、その一部又は全部が、通信回線等の伝送媒体を介して伝送され、他の機器により受信されて記録（インストールを含む）される構成としてもよい。また、音声認識プログラムＰ１の各モジュールは、１つのコンピュータでなく、複数のコンピュータのいずれかにインストールされてもよい。その場合、当該複数のコンピュータによるコンピュータシステムによって上述した一連の音声認識装置１の処理が行われる。 Note that a part or all of the voice recognition program P1 may be transmitted via a transmission medium such as a communication line and received and recorded (including installation) by another device. Further, each module of the speech recognition program P1 may be installed in any one of a plurality of computers instead of one computer. In that case, the above-described series of processing of the speech recognition apparatus 1 is performed by the computer system of the plurality of computers.

以上、本発明の好適な実施形態及び変形例について説明したが、本発明は、上記実施形態に限られず、その要旨を逸脱しない範囲において様々な変形が可能である。 As mentioned above, although preferred embodiment and the modification of this invention were demonstrated, this invention is not restricted to the said embodiment, A various deformation | transformation is possible in the range which does not deviate from the summary.

１…音声認識装置、１１…音声入力部、１２…指向性制御部、１３…第１音声認識処理部、１４…音源方向決定部、１５…第２音声認識処理部、１６…音声認識結果出力部、Ｐ１…音声認識プログラム、Ｐ１１…音声入力モジュール、Ｐ１２…指向性制御モジュール、Ｐ１３…第１音声認識処理モジュール、Ｐ１４…音源方向決定モジュール、Ｐ１５…第２音声認識処理モジュール、Ｐ１６…音声認識結果出力モジュール。 DESCRIPTION OF SYMBOLS 1 ... Voice recognition apparatus, 11 ... Voice input part, 12 ... Directivity control part, 13 ... 1st voice recognition processing part, 14 ... Sound source direction determination part, 15 ... 2nd voice recognition processing part, 16 ... Voice recognition result output P1 ... voice recognition program, P11 ... voice input module, P12 ... directivity control module, P13 ... first voice recognition processing module, P14 ... sound source direction determination module, P15 ... second voice recognition processing module, P16 ... voice recognition Result output module.

Claims

Audio acquisition means for acquiring each of audio streams from a plurality of directions;
Speech recognition processing means for performing speech recognition on each of the audio streams from a plurality of directions acquired by the sound acquisition means;
When the voice recognition processing unit obtains a voice recognition result that satisfies a predetermined reliability criterion, sound source direction determination is performed to determine the direction corresponding to the voice stream from which the voice recognition result is obtained as the sound source direction. Means,
A voice recognition with higher accuracy than the voice recognition by the voice recognition processing means is performed on the voice stream from the sound source direction determined by the sound source direction determination means among the voice streams acquired by the voice acquisition means. A speech recognition apparatus comprising: 2 speech recognition processing means .

The voice acquisition unit, by setting the directional beam in a plurality of predetermined direction, acquires the audio stream corresponding to the beam direction of each directional beam, according to claim 1 Symbol placement of the speech recognition device.

The audio acquisition unit acquires an audio stream corresponding to the beam direction of each directional beam by setting directional beams in a plurality of directions that are candidates for a sound source direction estimated by a predetermined method. 1 Symbol placement of the voice recognition device.

The voice recognition processing means, when the word predetermined is included in the speech recognition result, the speech recognition result is judged to meet the criteria of the predetermined reliability, any claim 1-3 The speech recognition device according to claim 1.

The voice recognition processing unit performs voice activity detection process for detecting a speech period, performing speech recognition on the detected speech segment by the speech section detection processing, any one of claims 1-4 The speech recognition apparatus according to one item.

A speech recognition method executed by a speech recognition device,
An audio acquisition step for acquiring each of audio streams from a plurality of directions;
A speech recognition processing step of executing speech recognition for each of the sound streams from the plurality of directions acquired in the sound acquisition step;
In the speech recognition processing step, when a speech recognition result satisfying a predetermined reliability criterion is obtained, sound source direction determination is performed to determine a direction corresponding to the sound stream from which the speech recognition result is obtained as a sound source direction. Steps,
A voice recognition with higher accuracy than the voice recognition in the voice recognition processing step is performed on the voice stream from the sound source direction determined in the sound source direction determination step among the voice streams acquired in the voice acquisition step. A voice recognition method including two voice recognition processing steps .

Computer
Audio acquisition means for acquiring each of audio streams from a plurality of directions;
Speech recognition processing means for performing speech recognition on each of the audio streams from a plurality of directions acquired by the sound acquisition means;
When the voice recognition processing unit obtains a voice recognition result that satisfies a predetermined reliability criterion, sound source direction determination is performed to determine the direction corresponding to the voice stream from which the voice recognition result is obtained as the sound source direction. and means,
A voice recognition with higher accuracy than the voice recognition by the voice recognition processing means is performed on the voice stream from the sound source direction determined by the sound source direction determination means among the voice streams acquired by the voice acquisition means. A voice recognition program to be executed as two voice recognition processing means .