JP2008271157A

JP2008271157A - Sound enhancement device and control program

Info

Publication number: JP2008271157A
Application number: JP2007111066A
Authority: JP
Inventors: Shoji Sakamoto; 彰司坂本
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2007-04-19
Filing date: 2007-04-19
Publication date: 2008-11-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound enhancement device capable of enhancing target sound, even inside an environment surrounded by noise, and to provide a control program. <P>SOLUTION: A video camera 2 photographs an image that includes a target to be a sound source. A microphone array 3 has a plurality of microphone elements to input sound emitted by the object. An operating part 113 inputs a position instructed by an operator. A display part 15 displays an image photographed by the video camera 2, simultaneously superimposes the image and displays a position input by a mouse cursor 18 of the operating part 113. A control part 11 converts the position on the image shown by the mouse cursor 18, in a direction with respect to the microphone array 3 and sets the superdirectivity of the microphone array 3 in the converted direction. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、撮影装置及び音声入力装置を利用する音声強調装置及び制御プログラムに関する。 The present invention relates to a voice enhancement device and a control program using a photographing device and a voice input device.

従来より、１つのキーワードの音声認識によって確実に、目的音到来方向にマイクロホンアレイの指向性を設定する指向性設定装置が知られている（例えば、特許文献１）。この指向性設定装置では、音源の定位は人の発話音声に基づいて実行されている。
特開２００４−１０９３６１号公報 Conventionally, a directivity setting device that reliably sets the directivity of a microphone array in the direction of arrival of a target sound by voice recognition of one keyword is known (for example, Patent Document 1). In this directivity setting device, sound source localization is executed based on human speech.
JP 2004-109361 A

しかしながら、機械が発する音のように周囲の雑音との区別が難しい音を強調する場合には、上記指向性設定装置のように、音声信号のみから音源を定位する方法は必ずしも有効ではなかった。このため、工場で稼働する機械の異音を強調して採取する場合には、従来のマイクロホンアレイ制御における音源定位を前提とした超指向性の形成は機能しない。 However, when emphasizing sounds that are difficult to distinguish from ambient noise such as sounds emitted by a machine, a method of localizing a sound source only from an audio signal as in the directivity setting device is not always effective. For this reason, in the case of collecting with emphasis on the abnormal sound of a machine operating in a factory, the formation of superdirectivity on the premise of sound source localization in conventional microphone array control does not function.

本発明は、このような事情に鑑みてなされたものであり、その目的は、雑音に取り囲まれた環境内でも、目的音声の強調を行うことが可能な音声強調装置及び制御プログラムを提供することにある。 The present invention has been made in view of such circumstances, and an object thereof is to provide a speech enhancement device and a control program capable of enhancing a target speech even in an environment surrounded by noise. It is in.

上記目的を達成するため、請求項１の音声強調装置は、音源となる対象物を含む画像を撮影する撮影手段と、複数の集音素子を有し、当該対象物が発する音声を入力する音声入力手段と、操作者が指示する位置を入力する位置入力手段と、前記撮影手段で撮影された画像を表示し同時に当該画像に重畳して前記位置入力手段によって入力された位置を表示する画像表示手段と、前記位置入力手段によって示された画像上の位置を前記音声入力手段に対する方向に変換する変換手段と、前記変換された方向に前記音声入力手段の超指向性を設定する設定手段とを備えることを特徴とする。 In order to achieve the above object, the speech enhancement apparatus according to claim 1 includes an imaging unit that captures an image including an object to be a sound source, and a plurality of sound collection elements, and a sound that inputs sound emitted from the object. An input means, a position input means for inputting a position indicated by an operator, and an image display for displaying an image photographed by the photographing means and displaying the position inputted by the position input means while being superimposed on the image. Means, conversion means for converting the position on the image indicated by the position input means into a direction with respect to the voice input means, and setting means for setting the superdirectivity of the voice input means in the converted direction. It is characterized by providing.

請求項２の音声強調装置は、請求項１に記載の音声強調装置において、前記設定手段は、一の集音素子の音声に対する他の集音素子の各々の音声の遅延量を算出し、当該他の集音素子の各々の音声に当該算出された遅延量を付加し、当該遅延量が付加された他の集音素子の各々の音声から前記一の集音素子の音声を減算して、雑音を抽出且つ除去することにより、前記変換された方向に前記音声入力手段の超指向性を設定することを特徴とする。 The speech enhancement apparatus according to claim 2 is the speech enhancement apparatus according to claim 1, wherein the setting unit calculates a delay amount of each sound of the other sound collection elements with respect to the sound of the one sound collection element, and Adding the calculated delay amount to the sound of each of the other sound collection elements, subtracting the sound of the one sound collection element from the sound of each of the other sound collection elements to which the delay amount is added, The superdirectivity of the voice input means is set in the converted direction by extracting and removing noise.

請求項３の音声強調装置は、請求項１又は２記載の音声強調装置において、前記撮影手段及び前記音声入力手段は、通信網を介して前記音声強調装置に接続されることを特徴とする。 The speech enhancement apparatus according to claim 3 is the speech enhancement apparatus according to claim 1 or 2, wherein the photographing unit and the speech input unit are connected to the speech enhancement apparatus via a communication network.

請求項４の音声強調装置は、請求項１乃至３のいずれか１項に記載の音声強調装置において、前記設定された音声入力手段の超指向性に対応する音声を出力する音声出力手段を備えることを特徴とする。 The speech enhancement device according to claim 4 is the speech enhancement device according to any one of claims 1 to 3, further comprising speech output means for outputting speech corresponding to the superdirectivity of the set speech input means. It is characterized by that.

請求項５の音声強調装置は、請求項１又は２に記載の音声強調装置において、前記撮影手段で撮影された画像及び前記音声入力手段で入力された音声を記憶する記憶手段と、前記記憶手段に記憶された画像及び音声の再生を指示する再生指示手段とを備え、前記再生指示手段により前記記憶された画像及び前記記憶された音声の再生が指示された場合には、前記画像表示手段は、前記記憶された画像を表示し同時に当該画像に重畳して前記位置入力手段によって入力された位置を表示し、前記設定手段は、前記記憶された音声に対して前記変換された方向に前記音声入力手段の超指向性を設定することを特徴とする。 The speech enhancement device according to claim 5 is the speech enhancement device according to claim 1 or 2, wherein the storage unit stores the image captured by the imaging unit and the speech input by the speech input unit, and the storage unit. Reproduction instruction means for instructing the reproduction of the stored image and sound, and when the reproduction instruction means instructs the reproduction of the stored image and the stored sound, the image display means The stored image is displayed and simultaneously superimposed on the image to display the position input by the position input means, and the setting means is configured to display the sound in the converted direction with respect to the stored sound. The super directivity of the input means is set.

請求項６の音声強調装置は、請求項５に記載の音声強調装置において、前記設定された音声入力手段の超指向性に対応する音声を出力する音声出力手段を備えることを特徴とする。 According to a sixth aspect of the present invention, there is provided the voice emphasizing apparatus according to the fifth aspect, further comprising voice output means for outputting voice corresponding to the superdirectivity of the set voice input means.

請求項７の制御プログラムは、音源となる対象物を含む画像を撮影する撮影手段、及び複数の集音素子を有し、当該対象物が発する音声を入力する音声入力手段に接続されるコンピュータを、操作者が指示する位置を入力する位置入力手段、前記撮影手段で撮影された画像を表示し同時に当該画像に重畳して前記位置入力手段によって入力された位置を表示する画像表示手段、前記位置入力手段によって示された画像上の位置を前記音声入力手段に対する方向に変換する変換手段、及び前記変換された方向に前記音声入力手段の超指向性を設定する設定手段として機能させることを特徴とする。 According to a seventh aspect of the present invention, there is provided a control program comprising: a photographing unit that captures an image including an object that is a sound source; and a computer that is connected to a voice input unit that inputs a sound emitted from the target object. A position input means for inputting a position designated by an operator, an image display means for displaying an image photographed by the photographing means and displaying the position inputted by the position input means while being superimposed on the image at the same time, the position And functioning as a conversion means for converting the position on the image indicated by the input means into a direction with respect to the voice input means, and a setting means for setting the superdirectivity of the voice input means in the converted direction. To do.

請求項１、７の発明によれば、雑音に取り囲まれた環境内でも、操作者が指示する位置に対応する目的音声の強調を行うことができる。 According to the first and seventh aspects of the present invention, the target speech corresponding to the position indicated by the operator can be emphasized even in an environment surrounded by noise.

請求項２の発明によれば、音声入力手段の超指向性を精度良く設定することができる。 According to the invention of claim 2, the superdirectivity of the voice input means can be set with high accuracy.

請求項３の発明によれば、通信網を介して入力された音声についても、操作者が指示する位置に対応する目的音声の強調を行うことができる。 According to the third aspect of the present invention, the target voice corresponding to the position indicated by the operator can be emphasized also for the voice input via the communication network.

請求項４の発明によれば、操作者が指示する位置に対応する目的音声を強調して出力することができる。 According to the invention of claim 4, the target voice corresponding to the position designated by the operator can be emphasized and output.

請求項５の発明によれば、記憶手段に記憶された音声のうち、操作者が指示する位置に対応する目的音声の強調を行うことができる。 According to the fifth aspect of the present invention, it is possible to enhance the target voice corresponding to the position indicated by the operator among the voices stored in the storage means.

請求項６の発明によれば、記憶手段に記憶された音声のうち、操作者が指示する位置に対応する目的音声を強調して出力することができる。 According to the sixth aspect of the present invention, it is possible to emphasize and output the target voice corresponding to the position indicated by the operator among the voices stored in the storage means.

以下、本発明の実施の形態について、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

（第１の実施の形態）
図１は、本発明の第１の実施の形態に係る音声強調装置の構成を示す図である。 (First embodiment)
FIG. 1 is a diagram showing the configuration of the speech enhancement apparatus according to the first embodiment of the present invention.

本発明の実施の形態に係る音声強調装置は、パーソナルコンピュータ（ＰＣ）１、対象物９を撮影するビデオカメラ２（撮影手段）、対象物９が発する音声を取得するマイクロホンアレイ３（音声出力手段）、及び音声を出力するスピーカ４（音声出力手段）を備えている。また、ＰＣ１は、音声を再生する再生部１０（音声出力手段）と、装置全体を制御する制御部１１（変換手段、設定手段）と、ビデオカメラ２からの映像及びマイクロホンアレイ３からの音声を受信すると共にマイクロホンアレイ３の超指向性を制御する制御コマンドをマイクロホンアレイ３に送信する送受信部１２と、マウスやキーボードなどで構成される操作部１３（位置入力手段、再生指示手段）と、制御プログラム、データ及び情報等を記憶する記憶部１４（記憶手段）と、表示領域１７及びユーザインターフェース（ＵＩ）１６を表示する表示部１５（画像表示手段）とを備えている。マイクロホンアレイ３は複数のマイクロホン素子を含む。 A speech enhancement apparatus according to an embodiment of the present invention includes a personal computer (PC) 1, a video camera 2 (photographing unit) that captures an object 9, and a microphone array 3 (audio output unit) that acquires sound emitted from the object 9. ) And a speaker 4 (sound output means) for outputting sound. In addition, the PC 1 outputs a reproduction unit 10 (audio output unit) that reproduces audio, a control unit 11 (conversion unit and setting unit) that controls the entire apparatus, video from the video camera 2, and audio from the microphone array 3. A transmission / reception unit 12 that receives and transmits a control command for controlling the superdirectivity of the microphone array 3 to the microphone array 3, an operation unit 13 (position input unit, reproduction instruction unit) including a mouse and a keyboard, and a control A storage unit 14 (storage unit) that stores programs, data, information, and the like, and a display unit 15 (image display unit) that displays a display area 17 and a user interface (UI) 16 are provided. The microphone array 3 includes a plurality of microphone elements.

制御部１１は、再生部１０、送受信部１２、操作部１３、記憶部１４、及び表示部１５に接続されており、さらに送受信部１２を介してビデオカメラ２及びマイクロホンアレイ３に接続されている。尚、ＰＣ１は、表示部１５を備える一体型のパーソナルコンピュータで構成してもよい。 The control unit 11 is connected to the reproduction unit 10, the transmission / reception unit 12, the operation unit 13, the storage unit 14, and the display unit 15, and is further connected to the video camera 2 and the microphone array 3 via the transmission / reception unit 12. . The PC 1 may be configured by an integrated personal computer including the display unit 15.

表示領域１７には、ビデオカメラ２で撮影された撮影画像が表示される。また、表示領域１７には、操作部１３からの操作指示に従って動作するマウスカーソル１８が表示されている。マウスカーソル１８で位置が指定されると、後述するように、制御部１１は、マイクロホンアレイ３の超指向性を当該指定された位置に設定する制御コマンドをマイクロホンアレイ３に対して送信する。 In the display area 17, a captured image captured by the video camera 2 is displayed. In the display area 17, a mouse cursor 18 that operates in accordance with an operation instruction from the operation unit 13 is displayed. When the position is designated by the mouse cursor 18, as will be described later, the control unit 11 transmits a control command for setting the superdirectivity of the microphone array 3 to the designated position, to the microphone array 3.

図２は、ＰＣ１のハードウエア構成を示すブロック図である。 FIG. 2 is a block diagram showing a hardware configuration of the PC 1.

ＰＣ１は、装置全体を制御するＣＰＵ２１、制御プログラムを備えるＲＯＭ２２、ワーキングエリアとして機能するＲＡＭ２３、各種の情報やプログラムを備えるハードディスクドライブ（ＨＤＤ）２４、マウス及びキーボード２５、他のコンピュータと接続するためのネットワークインタフェース２６、液晶モニタ又はＣＲＴで構成されるディスプレイ２７、及びビデオカメラ２及びマイクロホンアレイ３と接続するためのＵＳＢ（universal serial bus）インタフェース２８を備えている。ＣＰＵ２１はシステムバス２９を介してＲＯＭ２２、ＲＡＭ２３、ハードディスクドライブ（ＨＤＤ）２４、マウス及びキーボード２５、ネットワークインタフェース２６、ディスプレイ２７及びＵＳＢインタフェース２８に接続されている。 The PC 1 is connected to a CPU 21 for controlling the entire apparatus, a ROM 22 having a control program, a RAM 23 functioning as a working area, a hard disk drive (HDD) 24 having various information and programs, a mouse and keyboard 25, and other computers. A network interface 26, a display 27 composed of a liquid crystal monitor or CRT, and a USB (universal serial bus) interface 28 for connecting to the video camera 2 and the microphone array 3 are provided. The CPU 21 is connected to a ROM 22, a RAM 23, a hard disk drive (HDD) 24, a mouse and keyboard 25, a network interface 26, a display 27, and a USB interface 28 via a system bus 29.

制御部１１は、制御プログラムに従って各種の処理を実行するＣＰＵ２１に相当する。送受信部１２は、ネットワークインタフェース２６及びＵＳＢインタフェース２８に相当する。操作部１３は、マウス及びキーボード２５に相当し、記憶部１４は、ハードディスクドライブ（ＨＤＤ）２４に相当する。表示部３０５は、ディスプレイ２７に相当する。 The control unit 11 corresponds to the CPU 21 that executes various processes according to the control program. The transmission / reception unit 12 corresponds to the network interface 26 and the USB interface 28. The operation unit 13 corresponds to a mouse and keyboard 25, and the storage unit 14 corresponds to a hard disk drive (HDD) 24. The display unit 305 corresponds to the display 27.

図３は、図１のＰＣ１で実行される処理を示すフローチャートである。 FIG. 3 is a flowchart showing processing executed by the PC 1 of FIG.

まず、制御部１１は、送受信部１２を介してビデオカメラ２から撮影画像を受信し、当該撮影画像を表示部１５の表示領域１７に表示させる（ステップＳ１）。このとき、表示領域１７には、マウスカーソル１８も撮影画像上に重ねて表示される。 First, the control unit 11 receives a captured image from the video camera 2 via the transmission / reception unit 12, and displays the captured image on the display area 17 of the display unit 15 (step S1). At this time, the mouse cursor 18 is also displayed on the captured image in the display area 17.

次いで、制御部１１は、送受信部１２を介してマイクロホンアレイ３から音声を受信する（ステップＳ２）。 Next, the control unit 11 receives sound from the microphone array 3 via the transmission / reception unit 12 (step S2).

次に、制御部１１は、マウスカーソル１８で位置が指定されたか否かを判別する（ステップＳ３）。具体的には、制御部１１は、操作部１３からダブルクリック等の位置指定コマンドを入力したか否かを判別する。 Next, the control part 11 discriminate | determines whether the position was designated with the mouse cursor 18 (step S3). Specifically, the control unit 11 determines whether or not a position designation command such as a double click has been input from the operation unit 13.

ステップＳ３の判別の結果、ＮＯの場合には、本処理を終了する。一方、ＹＥＳの場合には、制御部１１は、マウスカーソル１８で指定された位置がマイクロホンアレイ３にとってどういう向きであるかを計算する、即ち、指定された位置に対するマイクロホンアレイ３の方向を計算する（ステップＳ４）。より具体的には、制御部１１は、マウスカーソル１８で指定された位置をマイクロホンアレイ３の方向に変換する計算を実行する。 If the result of determination in step S3 is NO, this process ends. On the other hand, in the case of YES, the control unit 11 calculates the orientation of the position specified by the mouse cursor 18 with respect to the microphone array 3, that is, calculates the direction of the microphone array 3 with respect to the specified position. (Step S4). More specifically, the control unit 11 executes calculation for converting the position designated by the mouse cursor 18 into the direction of the microphone array 3.

制御部１１は、マイクロホンアレイ３の超指向性を当該計算された方向に設定するためのマイクロホン素子毎の音声の遅延量を計算する（ステップＳ５）。 The control unit 11 calculates a delay amount of sound for each microphone element for setting the superdirectivity of the microphone array 3 in the calculated direction (step S5).

制御部１１は、マイクロホンアレイ３の超指向性を当該計算された方向に設定するための制御コマンド及び当該遅延量を、送受信部１２を介してマイクロホンアレイ３に送信する（ステップＳ６）。これにより、制御部１１は、マイクロホンアレイ３の超指向性を上記指定された位置に設定する。 The control unit 11 transmits a control command for setting the superdirectivity of the microphone array 3 in the calculated direction and the delay amount to the microphone array 3 via the transmission / reception unit 12 (step S6). Thereby, the control unit 11 sets the superdirectivity of the microphone array 3 to the designated position.

制御部１１は、例えば、２つのマイクロホン素子Ａ、Ｂがある場合に、マイクロホン素子Ａの音声に対するマイクロホン素子Ｂの音声の遅延量を設定する。そして、制御部１１は、当該設定された遅延量をマイクロホン素子Ｂで取得された音声に付加して、マイクロホン素子Ａ、Ｂの音声を同位相化する。マイクロホン素子Ｂの音声からマイクロホン素子Ａの音声を減算することで、いわゆる雑音となる音声を抽出し、除去して、所望の音声（即ち指定された位置に対応する音声）を取得することができる。 For example, when there are two microphone elements A and B, the control unit 11 sets a delay amount of the sound of the microphone element B with respect to the sound of the microphone element A. Then, the control unit 11 adds the set delay amount to the sound acquired by the microphone element B, and makes the sound of the microphone elements A and B in phase. By subtracting the sound of the microphone element A from the sound of the microphone element B, a so-called noise sound can be extracted and removed to obtain a desired sound (that is, a sound corresponding to a designated position). .

制御部１１は、マイクロホンアレイ３及び送受信部１２を介して、対象物９から、指定された位置に対応する強調された音声を採取し、記憶部１４に記憶して（ステップＳ７）、本処理を終了する。尚、記憶部１４に記憶された音声は、制御部１１により適宜読み出されて、再生部１０及びスピーカ４を介して出力される。これにより、操作者が指示する位置に対応する目的音声を強調して出力することができる。 The control unit 11 collects the emphasized voice corresponding to the designated position from the object 9 via the microphone array 3 and the transmission / reception unit 12 and stores it in the storage unit 14 (step S7). Exit. Note that the sound stored in the storage unit 14 is appropriately read out by the control unit 11 and output via the reproduction unit 10 and the speaker 4. Thereby, the target voice corresponding to the position designated by the operator can be emphasized and output.

以上説明したように、本実施の形態によれば、ビデオカメラ２が音源となる対象物を含む画像を撮影し、マイクロホンアレイ３が複数のマイクロホン素子を有し、当該対象物が発する音声を入力し、操作部１１３が操作者が指示する位置を入力し、表示部１５がビデオカメラ２で撮影された画像を表示し同時に当該画像に重畳して操作部１１３のマウスカーソル１８によって入力された位置を表示し、制御部１１がマウスカーソル１８によって示された画像上の位置をマイクロホンアレイ３に対する方向に変換し、当該変換された方向にマイクロホンアレイ３の超指向性を設定するので、雑音に取り囲まれた環境内でも、操作者が指示する位置に対応する目的音声の強調を行うことができる。 As described above, according to the present embodiment, the video camera 2 captures an image including an object to be a sound source, the microphone array 3 has a plurality of microphone elements, and inputs sound emitted from the object. The operation unit 113 inputs a position indicated by the operator, and the display unit 15 displays an image taken by the video camera 2 and is simultaneously superimposed on the image and input by the mouse cursor 18 of the operation unit 113. And the control unit 11 converts the position on the image indicated by the mouse cursor 18 into a direction with respect to the microphone array 3, and sets the superdirectivity of the microphone array 3 in the converted direction, so that it is surrounded by noise. The target speech corresponding to the position indicated by the operator can be emphasized even in the specified environment.

制御部１１は、一のマイクロホン素子の音声に対する他のマイクロホン素子の各々の音声の遅延量を算出し、当該他のマイクロホン素子の各々の音声に当該算出された遅延量を付加し、当該遅延量が付加された他のマイクロホン素子の各々の音声から前記一のマイクロホン素子の音声を減算して、雑音を抽出且つ除去することにより、上記変換された方向にマイクロホンアレイ３の超指向性を設定するので、マイクロホンアレイ３の超指向性を精度良く設定することができる。 The control unit 11 calculates a delay amount of each sound of the other microphone element with respect to the sound of the one microphone element, adds the calculated delay amount to the sound of each of the other microphone elements, and the delay amount The superdirectivity of the microphone array 3 is set in the converted direction by subtracting the sound of the one microphone element from the sound of each of the other microphone elements to which is added to extract and remove noise. Therefore, the superdirectivity of the microphone array 3 can be set with high accuracy.

（第２の実施の形態）
本実施の形態は、ビデオカメラ２及びマイクロホンアレイ３が他のＰＣに接続され、ビデオカメラ２からの撮影画像及びマイクロホンアレイ３からの音声をネットワークを介して受信する点で、第１の実施の形態と異なる。 (Second Embodiment)
In the present embodiment, the video camera 2 and the microphone array 3 are connected to another PC, and the captured image from the video camera 2 and the sound from the microphone array 3 are received via the network. Different from form.

図４は、第２の実施の形態に係る音声強調装置の構成を示す図である。 FIG. 4 is a diagram illustrating the configuration of the speech enhancement apparatus according to the second embodiment.

同図に示すように、ＰＣ１はネットワーク６（通信網）を介してＰＣ５に接続されている。また、ＰＣ５には、ビデオカメラ２及びマイクロホンアレイ３が接続されている。ビデオカメラ２の撮影画像及びマイクロホンアレイ３の音声はＰＣ５及びネットワーク６を介してＰＣ１に送信される。 As shown in the figure, the PC 1 is connected to the PC 5 via a network 6 (communication network). The video camera 2 and the microphone array 3 are connected to the PC 5. The captured image of the video camera 2 and the sound of the microphone array 3 are transmitted to the PC 1 via the PC 5 and the network 6.

図５は、ＰＣ１で実行される処理を示すフローチャートである。尚、図３の処理と同一の処理については、同一のステップ番号を付す。 FIG. 5 is a flowchart showing processing executed by the PC 1. In addition, the same step number is attached | subjected about the process same as the process of FIG.

まず、制御部１１は、ＰＣ５、ネットワーク６及び送受信部１２を介してビデオカメラ２から撮影画像を受信し、当該撮影画像を表示部１５の表示領域１７に表示させる（ステップＳ１１）。このとき、表示領域１７には、マウスカーソル１８も撮影画像上に重ねて表示される。 First, the control unit 11 receives a captured image from the video camera 2 via the PC 5, the network 6, and the transmission / reception unit 12, and displays the captured image on the display area 17 of the display unit 15 (step S11). At this time, the mouse cursor 18 is also displayed on the captured image in the display area 17.

次いで、制御部１１は、ＰＣ５、ネットワーク６及び送受信部１２を介してマイクロホンアレイ３から音声を受信する（ステップＳ１２）。 Next, the control unit 11 receives audio from the microphone array 3 via the PC 5, the network 6, and the transmission / reception unit 12 (step S12).

制御部１１は、マイクロホンアレイ３の超指向性を当該計算された方向に設定するための制御コマンド及び当該遅延量を、ＰＣ５、ネットワーク６及び送受信部１２を介してマイクロホンアレイ３に送信する（ステップＳ１３）。これにより、制御部１１は、マイクロホンアレイ３の超指向性を上記指定された位置に設定する。 The control unit 11 transmits a control command for setting the superdirectivity of the microphone array 3 in the calculated direction and the delay amount to the microphone array 3 via the PC 5, the network 6, and the transmission / reception unit 12 (step). S13). Thereby, the control unit 11 sets the superdirectivity of the microphone array 3 to the designated position.

制御部１１は、マイクロホンアレイ３、ＰＣ５、ネットワーク６及び及び送受信部１２を介して、対象物９から、指定された位置に対応する強調された音声を採取し、再生部１０及びスピーカ４を介して出力して（ステップＳ１４）、本処理を終了する。尚、ステップＳ１６で強調された音声を記憶部１４に記憶するようにしてもよい。 The control unit 11 collects the emphasized sound corresponding to the designated position from the object 9 via the microphone array 3, the PC 5, the network 6, and the transmission / reception unit 12, and via the reproduction unit 10 and the speaker 4. Are output (step S14), and this process is terminated. Note that the voice emphasized in step S16 may be stored in the storage unit 14.

以上説明したように、本実施の形態によれば、ビデオカメラ２及びマイクロホンアレイ３がネットワーク６を介してＰＣ１に接続されるので、ネットワークを介してリアルタイムで入力された音声についても、操作者が指示する位置に対応する目的音声の強調を行うことができる。 As described above, according to the present embodiment, the video camera 2 and the microphone array 3 are connected to the PC 1 via the network 6, so that the operator can also perform voice input in real time via the network. The target voice corresponding to the designated position can be emphasized.

また、制御部１１は、マイクロホンアレイ３、ＰＣ５、ネットワーク６及び及び送受信部１２を介して、対象物９から、指定された位置に対応する強調された音声を採取し、再生部１０及びスピーカ４を介して出力するので、操作者が指示する位置に対応する目的音声を強調して出力することができる。 Further, the control unit 11 collects emphasized sound corresponding to the designated position from the object 9 via the microphone array 3, the PC 5, the network 6, and the transmission / reception unit 12, and reproduces the reproduction unit 10 and the speaker 4. Therefore, the target voice corresponding to the position indicated by the operator can be emphasized and output.

（第３の実施の形態）
本実施の形態は、記憶部１４に記憶された音声の中から、操作者が指示する位置に対応する目的音声の強調を行う点で、第１の実施の形態と異なる。 (Third embodiment)
This embodiment is different from the first embodiment in that the target voice corresponding to the position indicated by the operator is emphasized from the voices stored in the storage unit 14.

本実施の形態における音声強調装置の構成は、上記図１の音声強調装置の構成と同一であるので、その説明は省略する。 Since the configuration of the speech enhancement apparatus in the present embodiment is the same as the configuration of the speech enhancement apparatus of FIG. 1, the description thereof is omitted.

図６は、ＰＣ１で実行される処理を示すフローチャートである。尚、図３の処理と同一の処理については、同一のステップ番号を付す。 FIG. 6 is a flowchart showing processing executed by the PC 1. In addition, the same step number is attached | subjected about the process same as the process of FIG.

まず、制御部１１は、送受信部１２を介してビデオカメラ２から撮影画像を受信すると共に送受信部１２を介してマイクロホンアレイ３から音声を受信し、当該撮影画像及び当該音声を関連づけて記憶部１４に記憶する（ステップＳ２１）。この際、操作者は、撮影画像及び音声に、ファイル名などを付ける。 First, the control unit 11 receives a captured image from the video camera 2 via the transmission / reception unit 12 and also receives a sound from the microphone array 3 via the transmission / reception unit 12, and associates the captured image and the sound to the storage unit 14. (Step S21). At this time, the operator attaches a file name or the like to the captured image and sound.

制御部１１は、操作部１３を介して、記憶部１４に記憶された撮影画像及び音声の再生指示が入力されたか否かを判別する（ステップＳ２２）。 The control unit 11 determines whether or not an instruction to reproduce the captured image and sound stored in the storage unit 14 has been input via the operation unit 13 (step S22).

ステップＳ２２でＮＯの場合には、本処理を終了する。一方、ステップＳ２２でＹＥＳの場合には、制御部１１は、再生指示に対応する撮影画像を記憶部１４から読み出し、当該撮影画像を表示部１５の表示領域１７に表示させる（ステップＳ２３）。このとき、表示領域１７には、マウスカーソル１８も撮影画像上に重ねて表示される。 If NO in step S22, this process ends. On the other hand, if YES in step S22, the control unit 11 reads a captured image corresponding to the reproduction instruction from the storage unit 14, and displays the captured image in the display area 17 of the display unit 15 (step S23). At this time, the mouse cursor 18 is also displayed on the captured image in the display area 17.

次いで、制御部１１は、再生指示に対応する音声を再生部１０及びスピーカ４を介して出力する（ステップＳ２４）。 Next, the control unit 11 outputs sound corresponding to the reproduction instruction via the reproduction unit 10 and the speaker 4 (step S24).

制御部１１は、マイクロホンアレイ３の超指向性を当該計算された方向に設定するための制御コマンド及び当該遅延量を、再生中の音声に適用する（ステップＳ２５）。これにより、制御部１１は、マイクロホンアレイ３の超指向性を上記指定された位置に設定する。 The control unit 11 applies the control command for setting the superdirectivity of the microphone array 3 in the calculated direction and the delay amount to the sound being reproduced (step S25). Thereby, the control unit 11 sets the superdirectivity of the microphone array 3 to the designated position.

制御部１１は、対象物９から、指定された位置に対応する強調された音声を採取し、再生部１０及びスピーカ４を介して出力して（ステップＳ２６）、本処理を終了する。尚、ステップＳ１６で強調された音声を記憶部１４に記憶するようにしてもよい。 The control unit 11 collects the emphasized voice corresponding to the designated position from the object 9 and outputs it through the playback unit 10 and the speaker 4 (step S26), and ends this process. Note that the voice emphasized in step S16 may be stored in the storage unit 14.

以上説明したように、本実施の形態によれば、記憶部１４がビデオカメラ２で撮影された画像及びマイクロホンアレイ３で入力された音声を記憶し、操作部１３が記憶部１４に記憶された画像及び音声の再生指示を入力し、記憶された画像及び記憶された音声の再生が指示された場合には、表示部１５は、記憶された画像を表示し同時に当該画像に重畳して操作部１３のマウスカーソル１８によって入力された位置を表示し、制御部１１は、記憶された音声に対して変換（計算）された方向にマイクロホンアレイ３の超指向性を設定するので、記憶部１４に記憶された音声のうち、操作者が指示する位置に対応する目的音声の強調を行うことができる。 As described above, according to the present embodiment, the storage unit 14 stores the image captured by the video camera 2 and the sound input by the microphone array 3, and the operation unit 13 is stored in the storage unit 14. When an instruction to reproduce an image and sound is input and an instruction to reproduce a stored image and sound is instructed, the display unit 15 displays the stored image and simultaneously superimposes the image on the operation unit. The position input by the 13 mouse cursors 18 is displayed, and the control unit 11 sets the superdirectivity of the microphone array 3 in the direction converted (calculated) with respect to the stored voice. Among the stored voices, the target voice corresponding to the position indicated by the operator can be emphasized.

また、スピーカ４及び再生部１０は、設定されたマイクロホンアレイ３の超指向性に対応する音声を出力するので、記憶部１４に記憶された音声のうち、操作者が指示する位置に対応する目的音声を強調して出力することができる。 In addition, since the speaker 4 and the reproduction unit 10 output sound corresponding to the superdirectivity of the set microphone array 3, the purpose is to correspond to the position indicated by the operator in the sound stored in the storage unit 14. Audio can be emphasized and output.

ＰＣ１の機能を実現するためのソフトウェアのプログラムが記録されている記録媒体を、ＰＣ１に供給し、ＰＣ１のＣＰＵが記憶媒体に格納されたプログラムを読み出し実行することによっても、上記実施の形態と同様の効果を奏する。プログラムを供給するための記憶媒体としては、例えば、ＣＤ−ＲＯＭ、ＤＶＤ、又はＳＤカードなどがある。 Similar to the above embodiment, a recording medium in which a software program for realizing the functions of the PC 1 is recorded is supplied to the PC 1 and the CPU of the PC 1 reads and executes the program stored in the storage medium. The effect of. Examples of the storage medium for supplying the program include a CD-ROM, a DVD, or an SD card.

また、ＰＣ１のＣＰＵが、ＰＣ１の機能を実現するためのソフトウェアのプログラムを実行することによっても、上記実施の形態と同様の効果を奏する。 In addition, the same effect as that of the above-described embodiment can be obtained when the CPU of the PC 1 executes a software program for realizing the functions of the PC 1.

なお、本発明は、上述した実施の形態に限定されるものではなく、その要旨を逸脱しない範囲内で種々変形して実施することが可能である。 Note that the present invention is not limited to the above-described embodiment, and can be implemented with various modifications without departing from the scope of the invention.

本発明の第１の実施の形態に係る音声強調装置の構成を示す図である。It is a figure which shows the structure of the speech enhancement apparatus which concerns on the 1st Embodiment of this invention. ＰＣ１のハードウエア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of PC1. 図１のＰＣ１で実行される処理を示すフローチャートである。It is a flowchart which shows the process performed by PC1 of FIG. 第２の実施の形態に係る音声強調装置の構成を示す図である。It is a figure which shows the structure of the speech enhancement apparatus which concerns on 2nd Embodiment. ＰＣ１で実行される処理を示すフローチャートである。It is a flowchart which shows the process performed by PC1. 第３の実施の形態に係るＰＣ１で実行される処理を示すフローチャートである。It is a flowchart which shows the process performed with PC1 which concerns on 3rd Embodiment.

Explanation of symbols

１ＰＣ
２ビデオカメラ（撮影手段）
３マイクロホンアレイ（音声入力手段）
４スピーカ（音声出力手段）
９対象物
１０再生部（音声出力手段）
１１制御部（変換手段、設定手段）
１２送受信部
１３操作部（位置入力手段、再生指示手段）
１４記憶部（記憶手段）
１５表示部（画像表示手段）

1 PC
2 Video camera (photographing means)
3 Microphone array (voice input means)
4 Speaker (Audio output means)
9 Object 10 Playback part (sound output means)
11 Control unit (conversion means, setting means)
12 Transmission / reception unit 13 Operation unit (position input means, reproduction instruction means)
14 Storage unit (storage means)
15 Display unit (image display means)

Claims

Photographing means for photographing an image including an object as a sound source;
A voice input means having a plurality of sound collecting elements and inputting a voice emitted from the object;
Position input means for inputting the position indicated by the operator;
An image display means for displaying an image photographed by the photographing means and simultaneously displaying the image inputted by the position input means by superimposing the image on the image;
Conversion means for converting the position on the image indicated by the position input means into a direction relative to the voice input means;
A speech enhancement apparatus comprising: setting means for setting superdirectivity of the speech input means in the converted direction.

The setting means calculates a delay amount of each sound of the other sound collecting elements with respect to a sound of one sound collecting element, adds the calculated delay amount to each sound of the other sound collecting elements, By subtracting the sound of the one sound collecting element from the sound of each of the other sound collecting elements to which the delay amount has been added, noise is extracted and removed, so that the sound input means moves in the converted direction. The speech enhancement apparatus according to claim 1, wherein superdirectivity is set.

3. The voice enhancement apparatus according to claim 1, wherein the photographing unit and the voice input unit are connected to the voice enhancement unit via a communication network.

The speech enhancement apparatus according to any one of claims 1 to 3, further comprising speech output means for outputting speech corresponding to the superdirectivity of the set speech input means.

Storage means for storing an image photographed by the photographing means and sound input by the sound input means;
Playback instruction means for instructing playback of the image and sound stored in the storage means,
When reproduction of the stored image and the stored sound is instructed by the reproduction instruction unit, the image display unit displays the stored image and simultaneously superimposes the image on the image to input the position. The position input by the means is displayed, and the setting means sets the superdirectivity of the voice input means in the converted direction with respect to the stored voice. The voice emphasis device described in 1.

The speech enhancement apparatus according to claim 5, further comprising speech output means for outputting speech corresponding to the superdirectivity of the set speech input means.

A computer having a photographing means for photographing an image including an object to be a sound source, and a plurality of sound collecting elements, and connected to a sound input means for inputting sound emitted from the object,
Position input means for inputting the position indicated by the operator;
Image display means for displaying an image photographed by the photographing means and simultaneously displaying the image inputted by the position input means by superimposing the image on the image;
Functioning as conversion means for converting the position on the image indicated by the position input means into a direction with respect to the voice input means, and setting means for setting superdirectivity of the voice input means in the converted direction. A characteristic control program.