JP2022183848A

JP2022183848A - Speech recognition device, display device, and control method and program and storage medium therefor

Info

Publication number: JP2022183848A
Application number: JP2021091349A
Authority: JP
Inventors: 祐介鳥海; Yusuke Chokai; 龍也華山; Tatsuya Hanayama; 拓弥坂牧; Takuya Sakamaki; 淳史菅原; Junji Sugawara; かおり戸村; Kaori Tomura; 陽介深井; Yosuke Fukai
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-12-13

Abstract

To provide a speech recognition device that enables operation commands for speech recognition control to be input even in a state in which a speech is acquired with sensitivity enhanced in a specific direction, a display device with the speech recognition device, and a control method, program and storage medium therefor.SOLUTION: Electronic binoculars have a speech processing part 305 which executes beam forming control capable of acquiring a speech input from a specific direction with higher sensitivity than a speech input from a different direction based upon a plurality of speech data obtained from a plurality of speech input parts 301 to 304, and outputs a corresponding control command when the speech data obtained from the plurality of speech input parts 301 to 304 correspond to a predetermined command. The speech processing part 305 determines whether or not the plurality of speech data correspond to the predetermined command when the beam forming control is effective and the plurality of speech data input from the plurality of speech input parts 301 to 304 meet predetermined conditions.SELECTED DRAWING: Figure 3

Description

本発明は、音声認識装置、音声認識装置を有する表示装置、それらの制御方法、プログラム、および記憶媒体に関する。 The present invention relates to a speech recognition device, a display device having the speech recognition device, a control method thereof, a program, and a storage medium.

望遠カメラと、望遠カメラで撮像した画像を表示する表示部とを備える電子双眼鏡がある。電子双眼鏡では、画像の他、音声を録音することもある。望遠カメラで観察する被写体に関する音声を取得するため、特定の方向に強い感度を有する（指向性を有する）マイクを用いることがある。強い指向性を有するマイクを使用する代わりに、複数のマイクを用いたビームフォーミング技術を用いて、特定の方向からの音声を取得する方法がある。 2. Description of the Related Art Electronic binoculars include a telephoto camera and a display unit that displays an image captured by the telephoto camera. Electronic binoculars sometimes record sound in addition to images. A microphone that has strong sensitivity in a specific direction (has directivity) is sometimes used to capture the sound of an object observed with a telephoto camera. Instead of using microphones with strong directivity, there is a method of acquiring sound from a specific direction using beamforming technology using multiple microphones.

特許文献１は、複数のマイクを用いたビームフォーミング制御を開示する。特許文献１には、さらに、深度センサーを用いて被写体までの距離を測定し、ビームフォーミング対象がないときは、指向性を制御しないことで、音声の誤認識を防ぐ方法について記述されている。 Patent Literature 1 discloses beamforming control using a plurality of microphones. Patent Document 1 further describes a method of measuring the distance to a subject using a depth sensor and not controlling the directivity when there is no beamforming target, thereby preventing erroneous recognition of voice.

また、音声を入力可能である場合に、ユーザー（観察者）が、音声認識処理を用いてハンズフリーで電子双眼鏡の制御を指示することも可能となる。 In addition, when voice can be input, the user (observer) can use voice recognition processing to give hands-free instructions to control the electronic binoculars.

米国特許出願公開第２０１５／００５８００３号明細書U.S. Patent Application Publication No. 2015/0058003

しかしながら、上述の特許文献に開示された従来技術では、被写体がビームフォーミングの対象となる範囲に存在する場合は、特定の方向でない方向からの音声に対する感度が低下してしまう。この時、電子双眼鏡の近傍のユーザーが音声認識処理による操作のための発話（操作コマンドの入力）を行っても、特定の方向でない方向からの音声に対する感度が低下していることから、適切に音声認識処理が実行されないことがあった。 However, in the conventional technology disclosed in the above-mentioned patent document, when the subject exists in the beamforming target range, the sensitivity to sound coming from a direction other than a specific direction decreases. At this time, even if the user near the electronic binoculars utters an utterance (input of an operation command) for operation by voice recognition processing, since the sensitivity to voice from a direction other than a specific direction has decreased, Voice recognition processing was not executed in some cases.

そこで、本発明は、特定の方向に対して感度を高めて音声を取得している状態であっても、音声認識制御のための操作コマンドの入力を可能とする音声認識装置、表示装置、それらの制御方法、プログラム、および記憶媒体を提供することを目的とする。 Accordingly, the present invention provides a speech recognition device, a display device, and the like that enable input of an operation command for speech recognition control even in a state in which speech is being acquired with increased sensitivity in a specific direction. The object is to provide a control method, a program, and a storage medium for

本発明にかかる音声認識装置の一の形態は、複数の音声入力手段から得られた複数の音声データに基づいて特定の方向から入力される音声を、他の方向から入力される音声よりも高い感度で取得可能なビームフォーミング制御を実行する制御手段と、前記複数の音声入力手段から得られた音声データがあらかじめ定められたコマンドに対応する場合に、当該コマンドに対応する制御信号を出力する音声認識手段と、を有し、前記音声認識手段は、前記ビームフォーミング制御が有効であって、前記複数の音声入力手段から入力された複数の音声データが所定の条件を満たす場合に、前記複数の音声データが前記あらかじめ定められたコマンドに対応するか否かを判定することを特徴とする。 One form of the speech recognition device according to the present invention is to make speech input from a specific direction higher than speech input from other directions based on a plurality of speech data obtained from a plurality of speech input means. Control means for executing beamforming control that can be obtained with sensitivity; and voice for outputting a control signal corresponding to a command when voice data obtained from the plurality of voice input means corresponds to a predetermined command. and a recognition means, wherein the voice recognition means recognizes the plurality of voice data input from the plurality of voice input means when the beamforming control is effective and the plurality of voice data input from the plurality of voice input means satisfies a predetermined condition. It is characterized by determining whether or not the voice data corresponds to the predetermined command.

本発明にかかる音声認識装置、表示装置、それらの制御方法、プログラム、および記憶媒体によれば、特定の方向に対して感度を高めて音声を取得している状態であっても、音声認識制御のための操作コマンドの入力が可能となる。 According to the speech recognition device, the display device, the control method thereof, the program, and the storage medium according to the present invention, even in a state in which speech is being acquired with increased sensitivity in a specific direction, speech recognition control can be performed. It becomes possible to input an operation command for

電子双眼鏡が実行する音声認識を実施するためのフローチャートである。4 is a flow chart for implementing voice recognition performed by the electronic binoculars; 電子双眼鏡の外観図である。1 is an external view of electronic binoculars; FIG. 電子双眼鏡のブロック図である。1 is a block diagram of electronic binoculars; FIG. 音声処理部の機能ブロックを示すブロック図である。4 is a block diagram showing functional blocks of an audio processing unit; FIG. ビームフォーミング制御中の音声認識制御における各信号に関するタイミングチャートである。4 is a timing chart regarding each signal in voice recognition control during beam forming control; 音声コマンドのデータ構造例を示す模式図である。FIG. 4 is a schematic diagram showing an example data structure of a voice command;

以下に、本発明の好ましい実施の形態を、添付の図面に基づいて詳細に説明する。本実施形態において、音声認識装置および音声認識装置を備える表示装置として、望遠カメラと、望遠カメラで撮像した画像を表示する表示部とを備える電子双眼鏡を例示する。なお、本発明を適用する装置は、上記に限らない。複数の集音装置（マイク）の入力を用いて特定の方向に対して強い指向性を持つ音声入力が可能であり、音声認識による機器制御を実行可能な装置であれば、本発明の適用が可能である。例えば、パーソナルコンピューター（ＰＣ）、タブレット、スマートフォン、およびテレビなどの機器にも適用可能である。 Preferred embodiments of the present invention are described in detail below with reference to the accompanying drawings. In this embodiment, electronic binoculars including a telephoto camera and a display section for displaying an image captured by the telephoto camera are exemplified as a speech recognition device and a display device including the speech recognition device. In addition, the apparatus to which the present invention is applied is not limited to the above. The present invention can be applied to any device capable of inputting voice with strong directivity in a specific direction using inputs from a plurality of sound collectors (microphones) and capable of executing device control by voice recognition. It is possible. For example, it is also applicable to devices such as personal computers (PCs), tablets, smartphones, and televisions.

以下、図１～図６を用いて、本実施形態における電子双眼鏡１００による制御を説明する。本実施形態において、電子双眼鏡１００を使用しているユーザーが所望の被写体を観察し、音声認識により電子双眼鏡１００を制御するシーンを想定している。 Control by the electronic binoculars 100 according to the present embodiment will be described below with reference to FIGS. 1 to 6. FIG. In this embodiment, a scene is assumed in which a user using the electronic binoculars 100 observes a desired subject and controls the electronic binoculars 100 by voice recognition.

図１は、電子双眼鏡１００が実行する音声認識を実施するためのフローチャートである。図２は、電子双眼鏡１００の外観図である。図２（ａ）は、電子双眼鏡１００を前方側から見た場合の外観斜視図である。図２（ｂ）は、電子双眼鏡１００を後方側から見た外観斜視図である。 FIG. 1 is a flow chart for voice recognition performed by the electronic binoculars 100 . FIG. 2 is an external view of the electronic binoculars 100. As shown in FIG. FIG. 2A is an external perspective view of the electronic binoculars 100 as seen from the front side. FIG. 2B is an external perspective view of the electronic binoculars 100 as seen from the rear side.

図２に示すように、電子双眼鏡１００は、カメラ１０１、右眼用ディスプレイ１０２、左眼用ディスプレイ１０３、パンニング部１０４、チルト部１０５、ジャイロセンサー１０６、操作部材１０７およびフレーム１１０を有する。さらに電子双眼鏡１００は、加速度センサー２０３、および音声入力部３０１～３０４を有する。 As shown in FIG. 2 , electronic binoculars 100 have camera 101 , right-eye display 102 , left-eye display 103 , panning section 104 , tilting section 105 , gyro sensor 106 , operation member 107 and frame 110 . Further, the electronic binoculars 100 have an acceleration sensor 203 and audio input units 301-304.

カメラ１０１は、観察対象を撮影する撮像装置である。図２（ａ）中の矢印で示すように、パンニング部１０４およびチルト部１０５を駆動することにより、パン方向、チルト方向に回動可能である。パンニング部１０４、チルト部１０５は内臓されたアクチュエータによって駆動する。カメラ部１０１は、焦点距離を１００ｍｍと４００ｍｍ（３５ｍｍフルサイズ換算での焦点距離）の２段階をユーザー操作により切り替えて使用することが出来る。焦点距離の切り替えや、電子双眼鏡の電源ＯＮ／ＯＦＦなどの操作は、操作部材１０８を介してユーザーが行う。また、頭の振れを検出するためのジャイロセンサー１０６が備えられている。焦点距離４００ｍｍのときは、光学ズームと電子ズームを併用して撮影する。 A camera 101 is an imaging device that captures an image of an observation target. As indicated by arrows in FIG. 2A, by driving the panning section 104 and the tilting section 105, it is possible to rotate in the panning direction and the tilting direction. The panning section 104 and tilting section 105 are driven by built-in actuators. The camera unit 101 can be used by switching between two focal lengths of 100 mm and 400 mm (focal length in terms of 35 mm full size) by user operation. A user performs operations such as switching the focal length and turning ON/OFF the power of the electronic binoculars via the operation member 108 . A gyro sensor 106 is also provided for detecting head shake. When the focal length is 400 mm, both the optical zoom and the electronic zoom are used for photographing.

カメラ１０１によって撮影された画像（以下、撮影画像）がユーザー設定の画角に合わせて表示範囲を切り出され、ユーザーが観察する表示画像が生成される。生成された表示画像は、リアルタイムに表示部であるディスプレイ１０２、１０３に表示される。 An image captured by the camera 101 (hereinafter referred to as a captured image) is cut out in a display range according to the angle of view set by the user, and a display image that the user observes is generated. The generated display image is displayed on displays 102 and 103, which are display units, in real time.

カメラ１０１は、オートフォーカス機能を有し、観察範囲内に含まれる観察対象に関して、自動的に合焦する。ピント調整で駆動する不図示のフォーカシングレンズの停止位置によって、ピントが合う被写体距離は一意に決まる。したがって、予めこの関係をシステム内に保持しておくと、オートフォーカスで停止したフォーカシングレンズの停止位置から、被写体距離を検出することが出来る。すなわち、カメラ１０１は、被写体距離を検出する機能も有する。 The camera 101 has an autofocus function, and automatically focuses on an observation target included within an observation range. The in-focus subject distance is uniquely determined by the stop position of a focusing lens (not shown) that is driven for focus adjustment. Therefore, if this relationship is held in the system in advance, the object distance can be detected from the stop position of the focusing lens stopped in autofocus. That is, the camera 101 also has a function of detecting the subject distance.

図３は、電子双眼鏡１００のブロック図である。ＣＰＵ２０１は、電子双眼鏡の各部を制御するプロセッサである。ＣＰＵ２０１は、カメラ１０１、右眼用ディスプレイ１０２、左眼用ディスプレイ１０３、ジャイロセンサー１０６、操作部材１０７、およびカメラ回動制御機構２０２を制御する。また、ＣＰＵ２０１は、加速度センサー２０３、音声処理部３０５を制御する。 FIG. 3 is a block diagram of the electronic binoculars 100. As shown in FIG. A CPU 201 is a processor that controls each part of the electronic binoculars. The CPU 201 controls the camera 101 , right-eye display 102 , left-eye display 103 , gyro sensor 106 , operation member 107 , and camera rotation control mechanism 202 . The CPU 201 also controls the acceleration sensor 203 and the audio processing unit 305 .

加速度センサー２０３は、電子双眼鏡１００の微小な上下左右の変位を検出する。 The acceleration sensor 203 detects minute vertical and horizontal displacements of the electronic binoculars 100 .

音声入力部３０１、３０２、３０３、３０４は、それぞれマイクロフォン（以降マイク）を含む音声入力部である。音声入力部３０１～３０４は、音声信号を電気信号にさらにデジタル信号に変換して出力する。音声入力部３０１、３０２は、電子双眼鏡１００のフレーム１１０の左側に、略直線上に並んで設けられる。音声入力部３０３、３０４は、電子双眼鏡１００のフレーム１１０の右側に略直線上に並んで設けられる。直線上に並んで設けられた２つの音声入力部の入力信号（音声信号）を用いて、音声処理部３０５および音声用メモリ３０６によってビームフォーミング制御が実行される。 Audio input units 301, 302, 303, and 304 are audio input units each including a microphone (hereinafter referred to as a microphone). The audio input units 301 to 304 convert audio signals into electrical signals and then into digital signals and output them. The audio input units 301 and 302 are arranged on a substantially straight line on the left side of the frame 110 of the electronic binoculars 100 . The audio input units 303 and 304 are arranged substantially in a straight line on the right side of the frame 110 of the electronic binoculars 100 . Beam forming control is performed by the audio processing unit 305 and the audio memory 306 using the input signals (audio signals) of the two audio input units arranged side by side on a straight line.

ビームフォーミング制御は、２つ以上の無指向性のマイクを用い、マイクの配置を変えることなく、信号処理によりマイクの指向性を２つのマイクの軸線上に先鋭化し、所定の方向の音を強調することができる。２つの音声入力部が電子双眼鏡１００の前後方向に並んで設けられており、ビームフォーミング制御により、電子双眼鏡１００の前方、ユーザーすなわち装着者（観察者）の真正面から入力される音を強調することができる。これにより、装着者が見ている（撮影している）被写体音周辺の音を強調することが可能となる。 Beamforming control uses two or more omnidirectional microphones, without changing the placement of the microphones, and uses signal processing to sharpen the directivity of the microphones on the axis of the two microphones, emphasizing sound in a given direction. can do. Two sound input units are provided side by side in the front and back direction of the electronic binoculars 100, and beam forming control emphasizes the sound input from the front of the electronic binoculars 100, that is, directly in front of the user, that is, the wearer (observer). can be done. As a result, it is possible to emphasize the sound around the sound of the object that the wearer is viewing (capturing).

音声入力部３０１、３０２、３０３、３０４から入力された音声コマンドに基づいて、音声処理部３０５および音声用メモリ３０６は、音声認識制御を行うことができる。音声用メモリ３０６に予め撮影装置の各種制御に対応する音声コマンドが保存されている。保存された音声コマンドに対応する音声データが、音声入力部３０１～３０４の少なくとも１つから入力されたことに応じて、音声処理部３０５は対応する制御コマンドをＣＰＵ２０１に出力する。ＣＰＵ２０１が入力された制御コマンドに応じて制御を実行することにより、操作部材１０７を介して行う指示と同様に電子双眼鏡１００の制御を指示することが可能となる。 Based on voice commands input from voice input units 301 , 302 , 303 and 304 , voice processing unit 305 and voice memory 306 can perform voice recognition control. Voice commands corresponding to various controls of the photographing apparatus are stored in advance in the voice memory 306 . When voice data corresponding to the stored voice command is input from at least one of voice input units 301 to 304, voice processing unit 305 outputs a corresponding control command to CPU 201. FIG. By executing control according to the input control command by the CPU 201 , it is possible to issue an instruction to control the electronic binoculars 100 in the same manner as an instruction through the operation member 107 .

図６は音声用メモリ３０６に格納された音声コマンドのデータ構造例を示す模式図である。音声処理部３０５は、音声用メモリ３０６に格納された音声コマンドに基づいて、音声コマンド認識処理を実行する。 FIG. 6 is a schematic diagram showing an example data structure of a voice command stored in the voice memory 306. As shown in FIG. The voice processing unit 305 executes voice command recognition processing based on voice commands stored in the voice memory 306 .

音声用メモリ３０６は逐次的に書き換えを行う音声データバッファ領域と、あらかじめ定められた音声コマンドのプロファイルを持つコマンド領域を有する。 The audio memory 306 has an audio data buffer area for successive rewriting and a command area having a predetermined audio command profile.

音声データバッファ領域は、音声入力部３０１、３０２、３０３、３０４のいずれかの音声データが逐次的に格納される。音声データをバッファ可能な量は、判定する音声コマンドの長さに応じて決定される。判定する音声コマンドが入力されうる期間にわたる音声データを格納可能なように、あらかじめ設定される。 The audio data buffer area sequentially stores audio data from any one of the audio input units 301, 302, 303, and 304. FIG. The amount of audio data that can be buffered is determined according to the length of the determined audio command. It is set in advance so that voice data can be stored for a period in which a voice command to be judged can be input.

音声コマンドは、起動コマンドと操作コマンドとを含む。 Voice commands include activation commands and operation commands.

起動コマンドは、音声認識による制御の開始を指示する音声コマンドである。音声認識処理は、音声認識機能を有効としている間は常に実行されるが、起動コマンドは、ユーザーが明示的に音声認識による制御の開始を指示するために用いられる。起動コマンドは、短いワード、もしくは少ないワード数で、会話などで発生しづらいユニークなコマンドであることが望ましい。例えば、“Ｈｉ，Ｃａｍｅｒａ”など、撮像装置１００を呼びかけるコマンドが想定される。この登録ワード該当するプロファイルデータがワードプロファイルＡとして格納されている。 The activation command is a voice command that instructs the start of control by voice recognition. Speech recognition processing is always executed while the speech recognition function is enabled, but the activation command is used by the user to explicitly instruct the start of control by speech recognition. It is desirable that the activation command be a unique command that is short or small in number of words and is unlikely to occur in a conversation. For example, a command calling the imaging device 100 such as "Hi, Camera" is assumed. Profile data corresponding to this registered word is stored as word profile A. FIG.

操作コマンドは、音声認識による電子双眼鏡１００の制御の実行を指示するための音声コマンドである。操作コマンドは、指示する制御に対応する短文や複数のワードで構成される。音声認識により制御を実行可能な処理は例えば、静止画撮影、動画撮影、追尾開始などであるとする。例えば、操作コマンドＢ－１は、静止画撮影を指示する音声コマンドであって、“Ｓｎａｐｉｔ”が設定されている。また、操作コマンドＢ－２は、動画撮影の開始を指示する音声コマンドであって、や“ＳｔａｒｔＭｏｖｉｅ”が設定されている。操作コマンドは、制御対象の処理に合わせて複数のコマンド（操作コマンドＢ－１～Ｂ－Ｎ）が登録されている。 The operation command is a voice command for instructing execution of control of the electronic binoculars 100 by voice recognition. An operation command is composed of a short sentence or a plurality of words corresponding to the commanded control. Processing that can be controlled by voice recognition is assumed to be, for example, still image shooting, moving image shooting, tracking start, and the like. For example, the operation command B-1 is a voice command for instructing still image shooting, and "Snap it" is set. The operation command B-2 is a voice command for instructing the start of moving image shooting, and "Start Movie" is set. As for the operation command, a plurality of commands (operation commands B-1 to BN) are registered according to the process to be controlled.

起動コマンドと操作コマンドは便宜上、別々に格納しているが、同じワードであっても構わない。また、仕向け先や、言語対応によって、ワードの書き換えをおこなう、また、ユーザーが自由に書き換えられる仕組みであっても構わない。 The activation command and the operation command are stored separately for convenience, but they may be stored in the same word. In addition, it may be a mechanism in which words are rewritten depending on the destination or language compatibility, or the user can freely rewrite the words.

音声処理部３０５は、音声データバッファ領域に逐次的に格納される音声データと、コマンド領域の起動コマンドもしくは操作コマンドを比較し、一致度が閾値以上であった音声コマンドに対応する音声データが入力されたと判定する。音声処理部３０５は、当該音声コマンドに対応する制御の実行を指示するための制御コマンドをＣＰＵ２０１に出力する。 The voice processing unit 305 compares the voice data sequentially stored in the voice data buffer area with the activation command or operation command in the command area, and receives the voice data corresponding to the voice command whose degree of matching is equal to or greater than the threshold. determined to have been The voice processing unit 305 outputs to the CPU 201 a control command for instructing execution of control corresponding to the voice command.

なお、本実施形態の電子双眼鏡１００では、ビームフォーミング制御に関わる“ＺｏｏｍＵｐ”や”ＺｏｏｍＯＵＴ”などの音声コマンドをほかのコマンドより大きなサイズで保持することで、他のコマンドよりも高い認識精度で反応することが可能となる。 Note that, in the electronic binoculars 100 of the present embodiment, voice commands related to beam forming control such as "Zoom Up" and "Zoom OUT" are held in a larger size than other commands, thereby achieving higher recognition accuracy than other commands. It is possible to react with

図１は本実施形態における電子双眼鏡１００の音声認識制御のフローチャートである。音声コマンドにより電子双眼鏡１００を制御、ビームフォーミング制御により被写体音拡大を実行・終了し、音声コマンドにより再度通常の撮影に戻るまでの処理を示す。本フローチャートに示す制御は、電子双眼鏡１００の音声認識機能を有効とし、電子双眼鏡１００の電源がＯＮである状態で、繰り返し実行されるとする。 FIG. 1 is a flowchart of voice recognition control of the electronic binoculars 100 according to this embodiment. It shows the processing from controlling the electronic binoculars 100 by a voice command, executing and ending object sound amplification by beamforming control, and returning to normal photography again by a voice command. The control shown in this flowchart is assumed to be executed repeatedly with the voice recognition function of the electronic binoculars 100 enabled and the electronic binoculars 100 powered on.

なお、電子双眼鏡１００は、通常の撮影を開始するとき、ユーザーにより操作部材１０７で電源ＯＮ操作がなされると、カメラ１０１が起動し、撮影画像を各ディスプレイ１０２、１０３へとリアルタイムに表示を開始する。電子双眼鏡１００は、カメラ１０１の焦点距離を１００ｍｍと４００ｍｍとの２段階で切り替え可能な構成であるが、電源ＯＮ直後に関しては、観察対象を見つけやすいように、より広角である１００ｍｍで制御するとする。 In the electronic binoculars 100, when the user turns on the power with the operation member 107 when starting normal photographing, the camera 101 is activated and the photographed images are displayed on the displays 102 and 103 in real time. do. The electronic binoculars 100 have a configuration in which the focal length of the camera 101 can be switched between 100 mm and 400 mm, but immediately after the power is turned on, it is controlled at a wider angle of 100 mm so that the observation target can be easily found. .

そして、ユーザーは観察対象を見つけると、操作部材１０７を介して焦点距離を４００ｍｍと切り替える。そして、取得画像からオートフォーカス（ＡＦ）や自動露出制御（ＡＥ）を行い、ユーザーはカメラ１０１が撮影した画像を通じて、被写体の観察を始めることが出来る。 When the user finds an observation target, the user switches the focal length to 400 mm via the operation member 107 . Autofocus (AF) and automatic exposure control (AE) are performed from the obtained image, and the user can start observing the subject through the image captured by the camera 101 .

また、ユーザーは、操作部材１０７の代わりに音声コマンドを用いて、焦点距離を制御することも出来る。その際、前述のビームフォーミング制御を用い、被写体周辺の音の強調を行うことが可能である。 Also, the user can use voice commands instead of the operation member 107 to control the focal length. At that time, the beam forming control described above can be used to emphasize sounds around the subject.

Ｓ１００１で、音声処理部３０５は、音声入力部３０１～３０４の少なくとも１つから起動コマンドに対応する音声が入力されたか否かを判定する。起動コマンドに対応する音声が入力されたと判定された場合、音声処理部３０５は、操作コマンドの待ち受け状態に移行し、Ｓ１００２に進む。起動コマンドに対応する音声が入力されない場合、処理は、Ｓ１０２０に進む。 In S1001, the voice processing unit 305 determines whether or not a voice corresponding to the activation command has been input from at least one of the voice input units 301-304. If it is determined that the voice corresponding to the activation command has been input, the voice processing unit 305 transitions to the standby state for the operation command, and proceeds to S1002. If the voice corresponding to the activation command is not input, the process proceeds to S1020.

Ｓ１００２で、音声処理部３０５は、入力された音声データより周波数特性（ｆ特）を取得し、音声用メモリ３０６に保持する。 In S<b>1002 , the audio processing unit 305 acquires the frequency characteristic (f characteristic) from the input audio data, and stores it in the audio memory 306 .

Ｓ１００３で、音声処理部３０５は、音声入力部３０１～３０４の少なくとも１つから操作コマンドに対応する音声が入力されたか否かを判定する。音声処理部３０５は、音声入力部３０１～３０４の少なくとも１つから入力された音声が、音声用メモリ３０６にあらかじめ保持されてある複数の操作コマンドのうちいずれかと一致するか否かを判定する。操作コマンドと一致する音声が入力された場合、処理は、Ｓ１００４に進む。操作コマンドと一致する音声が入力されない場合、処理はＳ１００３に戻り、操作コマンドに対応する音声の入力を待ち受ける。 In S1003, the voice processing unit 305 determines whether a voice corresponding to an operation command has been input from at least one of the voice input units 301-304. The voice processing unit 305 determines whether or not the voice input from at least one of the voice input units 301 to 304 matches any one of a plurality of operation commands pre-stored in the voice memory 306 . If a voice matching the operation command has been input, the process advances to S1004. If no voice matching the operation command is input, the process returns to S1003 to wait for input of voice corresponding to the operation command.

Ｓ１００４で、音声処理部３０５は、入力された音声に対応する操作コマンドに対応する制御の実行を指示する制御コマンドをＣＰＵ２０１へ通知する。ＣＰＵ２０１は、通知された制御コマンドに対応する処理を実行する。ここで、操作コマンド“ＺｏｏｍＵＰ”に対応する音声が入力されたとする。ＣＰＵ２０１は、カメラ１０１の焦点距離を４００ｍｍに制御し、さらに、音声入力部３０１、３０２および音声入力部３０３、３０４を用いたビームフォーミング制御による特定方向からの音声を強調する処理を実行する。この時、ビームフォーミング制御により感度を高める方向は、カメラ１０１の撮影方向である。すなわち、操作コマンド“ＺｏｏｍＵＰ”が入力された場合は、カメラ１０１の撮影範囲が拡大されるとともに、ビームフォーミング制御によって撮影方向に対して音声入力の感度が高められる。 In S1004, the voice processing unit 305 notifies the CPU 201 of a control command instructing execution of control corresponding to the operation command corresponding to the input voice. The CPU 201 executes processing corresponding to the notified control command. Assume that a voice corresponding to the operation command "Zoom UP" is input. The CPU 201 controls the focal length of the camera 101 to 400 mm, and further executes processing for emphasizing sound from a specific direction by beamforming control using the sound input units 301 and 302 and the sound input units 303 and 304 . At this time, the direction in which the sensitivity is increased by beam forming control is the photographing direction of the camera 101 . That is, when the operation command "Zoom UP" is input, the imaging range of the camera 101 is expanded, and the sensitivity of voice input to the imaging direction is increased by beamforming control.

Ｓ１００５で、音声処理部３０５は、音声入力部３０１～３０４の少なくとも１つから音声データが入力されたか否かを判定する。音声データが入力されない場合、処理はＳ１００５に戻る。音声データが入力された場合、処理はＳ１００６に進む。 In S1005, the audio processing unit 305 determines whether or not audio data has been input from at least one of the audio input units 301-304. If no voice data is input, the process returns to S1005. If voice data has been input, the process proceeds to S1006.

Ｓ１００６で、音声処理部３０５は、各音声入力部３０１～３０４から入力された音声データ（入力信号）の位相差を比較する。 In S1006, the audio processing unit 305 compares the phase difference of the audio data (input signal) input from each of the audio input units 301-304.

Ｓ１００７で、音声処理部３０５は、各音声入力部３０１～３０４から入力された音声データのレベル（音量）を比較する。 In S1007, the audio processing unit 305 compares the levels (volumes) of the audio data input from each of the audio input units 301-304.

Ｓ１００８で、音声入力部３０１～３０４に入力された音声データが同等であるかを判定する。装着者であるユーザーが発話した音声コマンドであれば、４つの音声入力部３０１～３０４に入力される音声データの位相およびレベルは、ほぼ同じとなる為、装着者から発せられたコマンドかどうかの一次判定に用いる。４つの音声入力部３０１～３０４に入力される音声データの位相およびレベルは、ほぼ同じである場合、処理はＳ１００９に進む。そうでない場合、装着者による音声認識制御のための音声入力でないと判定し、Ｓ１００５に戻る。 In S1008, it is determined whether the voice data input to the voice input units 301 to 304 are equivalent. If a voice command is uttered by a user who is the wearer, the phase and level of the voice data input to the four voice input units 301 to 304 are almost the same, so it is difficult to determine whether the command is issued by the wearer. Used for primary judgment. If the phases and levels of the audio data input to the four audio input units 301-304 are substantially the same, the process proceeds to S1009. Otherwise, it is determined that the voice input is not for voice recognition control by the wearer, and the process returns to S1005.

Ｓ１００９で、各音声入力部３０１～３０４から入力された音声データの周波数特性と、Ｓ１００２で保持した周波数特性とを比較する。周波数特性が同等であれば、起動コマンドを発話した装着者と、操作コマンドを入力した同じかどうかの二次判定を行う。Ｓ１００７は、音声入力部３０１～３０４から音声用メモリ３０６に一時的に保持された各音声データに対して音声処理部３０５が処理を行うことで実現する。周波数特性が同等であれば、起動コマンドを発話した装着者と、操作コマンドを入力した同じであると判定しＳ１００８に進む。そうでない場合、装着者による音声認識制御のための音声入力でないと判定し、Ｓ１００５に戻る。 At S1009, the frequency characteristics of the audio data input from each of the audio input units 301 to 304 are compared with the frequency characteristics held at S1002. If the frequency characteristics are the same, a secondary determination is made as to whether or not the wearer who uttered the activation command is the same as the wearer who input the operation command. S1007 is realized by the audio processing unit 305 processing each audio data temporarily held in the audio memory 306 from the audio input units 301 to 304 . If the frequency characteristics are the same, it is determined that the wearer who uttered the activation command is the same as the wearer who input the operation command, and the process proceeds to S1008. Otherwise, it is determined that the voice input is not for voice recognition control by the wearer, and the process returns to S1005.

Ｓ１０１０で、ＣＰＵ２０１は、音声データが入力されたタイミングで、装着者が発話したかどうかを電子双眼鏡１００のブリッジ部分にとりつけられた加速度センサ２０３が検知した上下方向の微小変位に基づいて判定する。上下方向の微小変位（検知結果）が発話検知のためにあらかじめ定められた変位検知条件を満たさない場合は、装着者の発話による音声データの入力でないと判定し、Ｓ１００５に戻る。上下方向の微小変位が発話検知のためにあらかじめ定められた変位検知条件を満たす場合は、装着者の発話による音声データの入力であると判定し、Ｓ１０１１に進む。 In S1010, the CPU 201 determines whether or not the wearer has spoken at the timing when the audio data is input, based on the minute displacement in the vertical direction detected by the acceleration sensor 203 attached to the bridge portion of the electronic binoculars 100. If the minute displacement in the vertical direction (detection result) does not satisfy the displacement detection condition predetermined for speech detection, it is determined that the voice data input is not the wearer's speech, and the process returns to S1005. If the minute displacement in the vertical direction satisfies the displacement detection condition predetermined for speech detection, it is determined that the voice data is input by the wearer's speech, and the process proceeds to S1011.

Ｓ１０１１にて、入力された音声データに対応するコマンドの検出を行う。Ｓ１００６～Ｓ１０１０の処理を行うことで、音声入力部３０１～３０４の指向性が装着者の口元に向いていないビームフォーミング制御中でも音声認識制御を実行可能となる。 At S1011, a command corresponding to the input voice data is detected. By performing the processing of S1006 to S1010, voice recognition control can be executed even during beamforming control in which the directivity of the voice input units 301 to 304 is not directed toward the wearer's mouth.

なお、Ｓ１０１１では、ビームフォーミング制御に関連するコマンドか否かの判定を行う。図６に示すように、本実施例では、ビームフォーミングに関わる“ＺｏｏｍＵｐ”や”ＺｏｏｍＯＵＴ”などの音声コマンドをほかのコマンドより大きなサイズで保持している。 In S1011, it is determined whether or not the command is related to beamforming control. As shown in FIG. 6, in this embodiment, voice commands such as "ZoomUp" and "ZoomOUT" related to beamforming are held in a larger size than other commands.

また、ユーザーは、ビームフォーミング制御後は、必然的に、ビームフォーミングを解除する、ビームフォーミング先の音を録音するなど、ビームフォーミング制御に関わる制御を実施する。このため、ビームフォーミング制御に関わるコマンドであるか否か、に絞って判定処理を行うことで、より高い認識精度で反応することが可能となる。 In addition, after beamforming control, the user inevitably performs control related to beamforming control, such as canceling beamforming and recording the sound of the beamforming destination. Therefore, it is possible to respond with higher recognition accuracy by performing determination processing focusing on whether or not the command is related to beam forming control.

Ｓ１０１２では、Ｓ１０１１で検知したコマンドに応じて、ＣＰＵ２０１が電子双眼鏡１００の制御を行う。また、処理はＳ１００５に戻る。 In S1012, the CPU 201 controls the electronic binoculars 100 according to the command detected in S1011. Also, the process returns to S1005.

図４、図５を用いて、各処理を詳細に説明する。図４は、音声処理部３０５の機能ブロックを示すブロック図である。音声処理部３０５は、ビームフォーミング制御部４０１、録音処理部４０２、音声認識処理部４０３、位相比較処理部４０４、音圧比較処理部４０５、および特定帯域検出部４０６を有する。 Each process will be described in detail with reference to FIGS. 4 and 5. FIG. FIG. 4 is a block diagram showing functional blocks of the audio processing unit 305. As shown in FIG. The audio processor 305 has a beamforming controller 401 , a recording processor 402 , a speech recognition processor 403 , a phase comparison processor 404 , a sound pressure comparison processor 405 , and a specific band detector 406 .

音声認識処理部４０３は、音声用メモリ３０６を用いて、音声データバッファに格納された音声データとあらかじめ保持している音声コマンドとの比較を行い、音声コマンド認識を実施する。また、音声認識処理部は４０３、各音声入力部３０１～３０４の入力のうち位相を比較する位相比較処理部４０４、音圧を比較する音圧比較処理部４０５、特定の帯域を検出する特定帯域検出部４０６の判定信号をイネーブル信号として入力する構成をとる。 The voice recognition processing unit 403 uses the voice memory 306 to compare the voice data stored in the voice data buffer with voice commands held in advance, and performs voice command recognition. In addition, a speech recognition processing unit 403, a phase comparison processing unit 404 that compares the phases of the inputs of each of the speech input units 301 to 304, a sound pressure comparison processing unit 405 that compares sound pressures, a specific band that detects a specific band A configuration is adopted in which the determination signal from the detection unit 406 is input as an enable signal.

イネーブル信号は各比較、後述のように検出処理が所定の値を検知した際に有効となり、音声認識処理部を補正することができる。また、イネーブル信号はＣＰＵ２０１の制御により、有効化、非有効化を制御することができる。 The enable signal becomes effective when each comparison or detection process detects a predetermined value as will be described later, and the speech recognition processing section can be corrected. Also, the enable signal can be controlled to be enabled or disabled under the control of the CPU 201 .

録音処理部４０２では、ビームフォーミング制御部４０１にて強調された帯域の音データを音声用メモリ３０６に対して出力することができる。 The recording processing unit 402 can output the sound data of the band emphasized by the beamforming control unit 401 to the audio memory 306 .

図５は、図１のＳ１００５～Ｓ１０１１で示したビームフォーミング制御中の音声認識制御における各信号に関するタイミングチャートである。 FIG. 5 is a timing chart regarding each signal in voice recognition control during beamforming control shown in S1005 to S1011 in FIG.

加速度センサー２０３における加速度変位が一定の値を超えると、ＣＰＵ２０１が、検出されたことをトリガに音声処理部３０５へイネーブル信号を送り音声処理部３０５の各処理を有効にさせる。 When the acceleration displacement in the acceleration sensor 203 exceeds a certain value, the detection triggers the CPU 201 to send an enable signal to the audio processing unit 305 to enable each process of the audio processing unit 305 .

その後、音声入力部３０１～３０４の入力を位相比較処理部４０４および音圧比較処理部４０５にて比較する。位相比較処理部４０４では、音声入力部３０１～３０４に入力された音声データの位相を比較し、各データの位相がそろっているかどうかの比較を行う。該比較には、音声用メモリ３０６に保持した所定の閾値が用いられる。各データの位相差が所定の閾値内に入っていた場合は、ＭＩＣ間位相差比較信号を有効にする。 After that, the inputs of the audio input units 301 to 304 are compared by the phase comparison processing unit 404 and the sound pressure comparison processing unit 405 . The phase comparison processing unit 404 compares the phases of the audio data input to the audio input units 301 to 304 to determine whether the phases of the data are aligned. A predetermined threshold value held in the audio memory 306 is used for the comparison. When the phase difference of each data is within a predetermined threshold value, the inter-MIC phase difference comparison signal is validated.

また、音圧比較処理部４０５では、音声入力部３０１～３０４に入力さえた音圧レベルが所定の閾値の範囲内であるか判定を行う。各データの音圧差が所定の閾値内に入っていた場合は、ＭＩＣ間音圧レベル比較信号を有効にする。 Further, the sound pressure comparison processing unit 405 determines whether or not the sound pressure level input to the voice input units 301 to 304 is within a predetermined threshold range. When the sound pressure difference of each data is within a predetermined threshold value, the inter-MIC sound pressure level comparison signal is validated.

前述の通り、装着者であるユーザーが発話した音声コマンドであれば、各音声入力部とユーザーの口元との距離はほぼ同じである為、４つの音声入力部に入力される音データの位相とレベルはほぼ同じとなる。ＭＩＣ間位相差比較信号およびＭＩＣ間音圧レベル比較信号は、ともに有効になる。 As described above, in the case of a voice command uttered by the user who is the wearer, the distance between each voice input unit and the user's mouth is almost the same, so the phase of the sound data input to the four voice input units Levels are almost the same. Both the inter-MIC phase difference comparison signal and the inter-MIC sound pressure level comparison signal become valid.

また、特定帯域検出部４０６にて、あらかじめ保持した周波数特性と発話されたコマンドとの帯域比較を行い、起動コマンドを発話した装着者と同じかどうかの二次判定を行う。該帯域比較は、音声入力部３０１～３０４から音声用メモリ３０６に一時的に保持された各音声データに対して音声処理部３０５が処理を行うことで実現する。音声コマンドが所定の帯域内であると判定された場合は、特定帯域検出信号を有効にする。 Further, the specific band detection unit 406 compares the frequency characteristics held in advance with the band of the uttered command, and makes secondary determination as to whether or not it is the same as the wearer who uttered the activation command. The band comparison is realized by having the audio processing unit 305 process each piece of audio data temporarily held in the audio memory 306 from the audio input units 301 to 304 . If the voice command is determined to be within the predetermined band, the specific band detection signal is activated.

その後、Ｔ６０１にて、音声データおよび前記３つの信号を用いて、起動コマンドであるか否かの判定が完了する。起動コマンドであると判定される場合、ＭＩＣ間位相差比較信号、ＭＩＣ間音圧レベル比較信号、および特定帯域検出信号が有効である。 After that, at T601, using the voice data and the three signals, the judgment as to whether or not it is an activation command is completed. If it is determined to be an activation command, the inter-MIC phase difference comparison signal, the inter-MIC sound pressure level comparison signal, and the specific band detection signal are valid.

Ｔ６０１～Ｔ６０２間のタイミングチャートは、制御コマンドを発話した際の音声コマンドの検出チャートである。処理及びタイミングは起動コマンドと同様の為、説明を省略する。 A timing chart between T601 and T602 is a voice command detection chart when a control command is uttered. Since the processing and timing are the same as those of the activation command, the description is omitted.

上述のように制御することにより、本実施例における音声認識装置、および音声認識装置を有する表示装置は、ビームフォーミング制御により特定の方向の音声データを増幅している場合においても装着者による音声認識制御の実行を可能とする。特に、ユーザーの口元の変位、ＭＩＣ間位相差比較、ＭＩＣ間音圧レベル比較、特定帯域の検出を行うことで、音声コマンドにより再度通常の撮影に戻ることが可能となる。 By controlling as described above, the speech recognition device and the display device having the speech recognition device according to the present embodiment can perform speech recognition by the wearer even when voice data in a specific direction is amplified by beamforming control. Allows execution of control. In particular, by performing the displacement of the user's mouth, the phase difference comparison between MICs, the sound pressure level comparison between MICs, and the detection of a specific band, it is possible to return to normal photography by a voice command.

（その他の実施例）
本発明は、上述の実施例の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other examples)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in the computer of the system or apparatus reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

以上、本発明の好ましい実施形態について説明したが、本発明はこれらの実施形態に限定されず、その要旨の範囲内で種々の変形及び変更が可能である。 Although preferred embodiments of the present invention have been described above, the present invention is not limited to these embodiments, and various modifications and changes are possible within the scope of the gist.

１００電子双眼鏡
１０１カメラ
１０２１０３ディスプレイ
１０４パンニング部
１０５チルト部
１０６ジャイロセンサ
１０７操作部材
２０３加速度センサー
３０１、３０２、３０３、３０４音声入力部 100 electronic binoculars 101 camera 102 103 display 104 panning unit 105 tilt unit 106 gyro sensor 107 operation member 203 acceleration sensor 301, 302, 303, 304 voice input unit

Claims

Executes beamforming control that can acquire voice input from a specific direction with higher sensitivity than voice input from other directions based on multiple voice data obtained from multiple voice input means. a control means for
voice recognition means for outputting a control command corresponding to a predetermined command when voice data obtained from the plurality of voice input means corresponds to the command;
has
When the beamforming control is effective and the plurality of speech data input from the plurality of speech inputting devices satisfies a predetermined condition, the speech recognition device performs the above-mentioned plurality of speech data according to the predetermined condition. A voice recognition device that determines whether or not it corresponds to a command.

further comprising detection means for detecting displacement of the speech recognition device at the timing when the plurality of speech data are input;
The voice recognition means determines whether or not the plurality of voice data correspond to the predetermined command when the detection result of the detection means satisfies a predetermined displacement detection condition. The speech recognition device according to claim 1.

The predetermined condition is characterized in that the frequency characteristics of the plurality of audio data match the frequency characteristics of the audio data of the wearer's utterance acquired before the beam forming control is executed. 3. The speech recognition device according to claim 1 or 2.

4. The speech recognition apparatus according to any one of claims 1 to 3, wherein said predetermined command includes a command for instructing activation of the device and a command for instructing control of the device.

Further comprising imaging means for capturing an image,
The control means executes the beam forming control so that the sound input from the direction in which the imaging means captures the image is acquired with higher sensitivity than the sound input from other directions. 5. The speech recognition device according to any one of claims 1 to 4.

In response to the voice recognition means detecting input of voice data corresponding to a command for expanding the imaging range of the imaging means while the beam forming control is not being executed,
6. The speech recognition apparatus according to claim 1, wherein said control means executes said beamforming control.

While the beam forming control is being performed, the speech recognition means recognizes the plurality of speech data if the plurality of speech data do not satisfy the predetermined condition even if the plurality of speech data are input. 7. The speech recognition apparatus according to any one of claims 1 to 6, wherein it does not determine whether or not data corresponds to the predetermined command.

imaging means for capturing an image;
display means for displaying the image;
a plurality of voice input means;
Based on the plurality of audio data obtained from the plurality of audio input means, it is possible to acquire the audio input from the imaging direction with higher sensitivity than the audio input from other directions. a control means for performing beamforming control;
voice recognition means for outputting a control command corresponding to a predetermined command when voice data obtained from the plurality of voice input means corresponds to the command;
has
When the beamforming control is effective and the plurality of speech data input from the plurality of speech inputting devices satisfies a predetermined condition, the speech recognition device performs the above-mentioned plurality of speech data according to the predetermined condition. A display device that determines whether or not it corresponds to a command.

Executes beamforming control that can acquire voice input from a specific direction with higher sensitivity than voice input from other directions based on multiple voice data obtained from multiple voice input means. a control process for
a speech recognition step of outputting a control command corresponding to a predetermined command when the speech data obtained from the plurality of speech input means corresponds to the command;
has
In the speech recognition step, when the beamforming control is effective and the plurality of speech data input from the plurality of speech input means satisfies a predetermined condition, the plurality of speech data are determined in advance. A control method for a speech recognition device, characterized by determining whether or not it corresponds to a command.

A program for causing a computer to function as each means of the speech recognition apparatus according to any one of claims 1 to 7.

A computer-readable storage medium storing a program for causing a computer to function as each means of the speech recognition apparatus according to any one of claims 1 to 7.