JP2020178150A

JP2020178150A - Voice processing device and voice processing method

Info

Publication number: JP2020178150A
Application number: JP2019076861A
Authority: JP
Inventors: 中澤　剛; Takeshi Nakazawa; 剛中澤; 慶子蛭川; Keiko Hirukawa; 洋介大崎; Yosuke Osaki
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2019-04-15
Filing date: 2019-04-15
Publication date: 2020-10-29

Abstract

To appropriately acquire the speaker's voice in a voice processing device used by a plurality of users.SOLUTION: A voice processing device includes a voice reception unit that receives voice collected by a microphone, an image acquisition unit that acquires an image captured by the imaging unit, a detection processing unit that detects a plurality of persons and the respective positions of the plurality of persons from the captured image acquired by the image acquisition unit, and a directivity adjustment unit that sets the directivity of a microphone with respect to each of the positions of the plurality of persons on the basis of the respective positions of the plurality of persons detected by the detection processing unit, and the voice reception unit receives the voice on the basis of the directivity set by the directivity adjusting unit.SELECTED DRAWING: Figure 8

Description

本発明は、音声処理装置及び音声処理方法に関する。 The present invention relates to a voice processing device and a voice processing method.

発話者の音声を取得して、音声データを保存したり、他の情報処理装置に送信したりすることが可能な音声処理装置が知られている。この種の音声処理装置は、例えば、複数のユーザが参加する会議、遠隔地をネットワーク接続して複数のユーザが参加するリモート会議などに利用可能である。 A voice processing device capable of acquiring a speaker's voice, storing voice data, and transmitting the voice data to another information processing device is known. This type of voice processing device can be used, for example, for a conference in which a plurality of users participate, a remote conference in which a remote location is connected to a network and a plurality of users participate, and the like.

例えば特許文献１には、携帯端末の推定部が、カメラで撮影された動画像データ内における撮影対象者の位置と、カメラが撮影に用いるパラメータ情報とに基づき、携帯端末に対する撮影対象者の相対位置を推定し、調整部が相対位置に向けてマイクの指向性を調整する技術が開示されている。 For example, in Patent Document 1, the estimation unit of the mobile terminal is relative to the mobile terminal based on the position of the shooting target person in the moving image data shot by the camera and the parameter information used by the camera for shooting. A technique for estimating a position and adjusting the directivity of the microphone toward a relative position is disclosed.

特開２０１１−４１０９６号公報Japanese Unexamined Patent Publication No. 2011-41096

しかし、前記従来の技術では、携帯端末を利用するユーザ１人を対象として前記指向性を調整するものである。このため、例えば会議などのように音声処理装置に対して複数のユーザが存在する場合には、複数のユーザの中から発話者を確実に特定することが困難である。また複数のユーザが同時に発話者となる場合には、マイクの指向性を適切に調整することが困難である。このように、音声処理装置を複数のユーザが利用する場合に、発話者の音声を適切に取得することは困難である。 However, in the conventional technique, the directivity is adjusted for one user who uses the mobile terminal. Therefore, when there are a plurality of users for the voice processing device such as in a conference, it is difficult to reliably identify the speaker from the plurality of users. Further, when a plurality of users are speakers at the same time, it is difficult to appropriately adjust the directivity of the microphone. As described above, when the voice processing device is used by a plurality of users, it is difficult to properly acquire the voice of the speaker.

本発明の目的は、複数のユーザにより利用される音声処理装置において、発話者の音声を適切に取得することにある。 An object of the present invention is to appropriately acquire the voice of a speaker in a voice processing device used by a plurality of users.

本発明の一の態様に係る音声処理装置は、マイクにより集音される音声を受け付ける音声受付部と、撮像部により撮像される撮像画像を取得する画像取得部と、前記画像取得部により取得される前記撮像画像から、複数の人物と前記複数の人物のそれぞれの位置とを検出する検出処理部と、前記検出処理部により検出される前記複数の人物のそれぞれの位置に基づいて、前記マイクの前記複数の人物のそれぞれの位置に対する指向性を設定する指向性調整部と、を備え、前記音声受付部は、前記指向性調整部により設定される前記指向性に基づいて前記音声を受け付ける。 The voice processing device according to one aspect of the present invention is a voice receiving unit that receives sound collected by a microphone, an image acquisition unit that acquires an image captured by the imaging unit, and the image acquisition unit. Based on the detection processing unit that detects a plurality of persons and the positions of the plurality of persons from the captured image, and the positions of the plurality of persons detected by the detection processing unit, the microphone The voice receiving unit receives the voice based on the directivity set by the directivity adjusting unit, including a directivity adjusting unit that sets the directivity for each position of the plurality of persons.

本発明の他の態様に係る音声処理方法は、マイクにより集音される音声を受け付ける音声受付ステップと、撮像部により撮像される撮像画像を取得する画像取得ステップと、前記画像取得ステップにより取得される前記撮像画像から、複数の人物と前記複数の人物のそれぞれの位置とを検出する検出ステップと、前記検出ステップにより検出される前記複数の人物のそれぞれの位置に基づいて、前記マイクの前記複数の人物のそれぞれの位置に対する指向性を設定する指向性調整ステップと、を含み、前記音声受付ステップでは、前記指向性調整ステップにより設定される前記指向性に基づいて前記音声を受け付ける。 The voice processing method according to another aspect of the present invention is acquired by a voice reception step for receiving sound collected by a microphone, an image acquisition step for acquiring an image captured by an imaging unit, and the image acquisition step. A detection step for detecting a plurality of persons and their respective positions of the plurality of persons from the captured image, and the plurality of the microphones based on the respective positions of the plurality of persons detected by the detection step. The voice reception step includes the directivity adjustment step for setting the directivity for each position of the person, and the voice reception step receives the voice based on the directivity set by the directivity adjustment step.

本発明によれば、複数のユーザにより利用される音声処理装置において、発話者の音声を適切に取得することが可能となる。 According to the present invention, in a voice processing device used by a plurality of users, it is possible to appropriately acquire the voice of the speaker.

図１は、本発明の実施形態に係る音声処理装置が適用される会議を模式的に示す図である。FIG. 1 is a diagram schematically showing a conference to which the voice processing device according to the embodiment of the present invention is applied. 図２は、本発明の実施形態に係る音声処理装置の構成を示す機能ブロック図である。FIG. 2 is a functional block diagram showing a configuration of a voice processing device according to an embodiment of the present invention. 図３は、本発明の実施形態に係る音声処理装置において利用されるパラメータ情報の一例を示す図である。FIG. 3 is a diagram showing an example of parameter information used in the voice processing device according to the embodiment of the present invention. 図４Ａは、本発明の実施形態に係る音声処理装置において設定される指向性パラメータの設定例を示すグラフである。FIG. 4A is a graph showing a setting example of directivity parameters set in the voice processing device according to the embodiment of the present invention. 図４Ｂは、本発明の実施形態に係る音声処理装置において設定される指向性パラメータの設定例を示すグラフである。FIG. 4B is a graph showing a setting example of directivity parameters set in the voice processing device according to the embodiment of the present invention. 図４Ｃは、本発明の実施形態に係る音声処理装置において設定される指向性パラメータの設定例を示すグラフである。FIG. 4C is a graph showing a setting example of directivity parameters set in the voice processing device according to the embodiment of the present invention. 図４Ｄは、本発明の実施形態に係る音声処理装置において設定される指向性パラメータの設定例を示すグラフである。FIG. 4D is a graph showing a setting example of directivity parameters set in the voice processing device according to the embodiment of the present invention. 図５は、本発明の実施形態に係る音声処理装置において利用されるパラメータ情報の他の例を示すグラフである。FIG. 5 is a graph showing another example of parameter information used in the voice processing device according to the embodiment of the present invention. 図６は、本発明の実施形態に係る音声処理装置において利用されるパラメータ情報の他の例を示すグラフである。FIG. 6 is a graph showing another example of parameter information used in the voice processing device according to the embodiment of the present invention. 図７は、本発明の実施形態に係る音声処理装置おける音声処理の初期設定処理の手順の一例を説明するためのフローチャートである。FIG. 7 is a flowchart for explaining an example of a procedure for initial setting processing of voice processing in the voice processing device according to the embodiment of the present invention. 図８は、本発明の実施形態に係る音声処理装置おける音声処理の音声入力処理の手順の一例を説明するためのフローチャートである。FIG. 8 is a flowchart for explaining an example of a procedure for voice input processing of voice processing in the voice processing device according to the embodiment of the present invention.

以下、添付図面を参照しながら、本発明の実施形態について説明する。尚、以下の実施形態は、本発明を具体化した一例であって、本発明の技術的範囲を限定する性格を有さない。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that the following embodiment is an example embodying the present invention and does not have a character that limits the technical scope of the present invention.

本発明に係る音声処理装置は、例えばオフィスの会議室に設置され、複数のユーザが参加する会議、遠隔地をネットワーク接続して複数のユーザが参加するリモート会議に適用される。図１には、前記会議が行われる会議室の一例を模式的に示している。図１に示す会議室には、テーブルの上に配置された音声処理装置１と、テーブルを囲うようにして着席した４人の会議参加者であるユーザＡ〜Ｄと、空席の椅子２と、電子ボード、ホワイトボード、黒板などのボード３とが含まれている。 The voice processing device according to the present invention is applied to, for example, a conference in which a plurality of users participate in a conference room in an office, and a remote conference in which a plurality of users participate by connecting a remote location to a network. FIG. 1 schematically shows an example of a conference room where the conference is held. In the conference room shown in FIG. 1, a voice processing device 1 arranged on a table, users A to D who are four conference participants seated so as to surround the table, and vacant chairs 2 are provided. A board 3 such as an electronic board, a white board, and a blackboard is included.

ここで例えば、音声処理装置１に対するユーザＤが位置する方向（角度）を基準（０度）とする。この場合、例えばユーザＣの位置は、音声処理装置１に対して３０度の位置であり、ユーザＢの位置は、音声処理装置１に対して６０度の位置であり、ユーザＡの位置は、音声処理装置１に対して１２０度の位置である。また、椅子２の位置は、音声処理装置１に対して１６０度の位置であり、ボード３の位置は、音声処理装置１に対して２５０度の位置である。また図１では、音声処理装置１からユーザＤまでの距離がＤｄであり、音声処理装置１からユーザＣまでの距離がＤｃであり、音声処理装置１からユーザＢまでの距離がＤｂであり、音声処理装置１からユーザＡまでの距離がＤａであり、音声処理装置１から椅子２までの距離がＤｘであり、音声処理装置１からボード３までの距離がＤｙであるとする。 Here, for example, the direction (angle) at which the user D is located with respect to the voice processing device 1 is used as a reference (0 degree). In this case, for example, the position of the user C is a position of 30 degrees with respect to the voice processing device 1, the position of the user B is a position of 60 degrees with respect to the voice processing device 1, and the position of the user A is. It is located at 120 degrees with respect to the voice processing device 1. The position of the chair 2 is 160 degrees with respect to the voice processing device 1, and the position of the board 3 is 250 degrees with respect to the voice processing device 1. Further, in FIG. 1, the distance from the voice processing device 1 to the user D is Dd, the distance from the voice processing device 1 to the user C is Dc, and the distance from the voice processing device 1 to the user B is Db. It is assumed that the distance from the voice processing device 1 to the user A is Da, the distance from the voice processing device 1 to the chair 2 is Dx, and the distance from the voice processing device 1 to the board 3 is Dy.

図１に示す会議において、音声処理装置１は、例えばユーザＡ〜Ｄが発話した音声をマイクで集音し、集音した音声の音声データを記憶部１２に記憶する。記憶された音声データは、例えば会議の議事録データとして保存される。なお、前記会議が、遠隔地の他の会議室とネットワーク接続されたリモート会議である場合には、音声処理装置１は、前記音声データを他の会議室に配置された音声処理装置１に送信したり、他の会議室で発話された音声の音声データを受信したりすることも可能である。 In the conference shown in FIG. 1, the voice processing device 1 collects, for example, the voice uttered by the users A to D with a microphone, and stores the voice data of the collected voice in the storage unit 12. The stored audio data is stored, for example, as the minutes data of the meeting. When the conference is a remote conference connected to another conference room in a remote location via a network, the voice processing device 1 transmits the voice data to the voice processing device 1 arranged in the other conference room. It is also possible to receive audio data of audio spoken in another conference room.

また音声処理装置１は、ユーザの指示に基づいて各種コマンドを実行する機能を備えてもよい。この場合、音声処理装置１は、ユーザの指示に応じたコマンド音声をクラウドサーバ（不図示）に送信し、クラウドサーバにおいて実行されたコマンドに対応する応答（コマンド応答）をクラウドサーバから取得し、当該コマンド応答を音声処理装置１のスピーカから出力させる。 Further, the voice processing device 1 may have a function of executing various commands based on a user's instruction. In this case, the voice processing device 1 transmits a command voice according to the user's instruction to the cloud server (not shown), acquires a response (command response) corresponding to the command executed in the cloud server, and obtains the response (command response) from the cloud server. The command response is output from the speaker of the voice processing device 1.

［音声処理装置１］
図２に示すように、音声処理装置１は、制御部１１、記憶部１２、カメラ１３、マイク１４、スピーカ１５、通信インターフェース１６などを備える。音声処理装置１は、例えば表示装置、又はパーソナルコンピュータのような情報処理装置であってもよい。音声処理装置１は、本発明の音声処理装置の一例である。本発明の音声処理装置は、カメラ１３、マイク１４、及びスピーカ１５が省略されたサーバであってもよい。 [Voice processing device 1]
As shown in FIG. 2, the voice processing device 1 includes a control unit 11, a storage unit 12, a camera 13, a microphone 14, a speaker 15, a communication interface 16, and the like. The voice processing device 1 may be, for example, a display device or an information processing device such as a personal computer. The voice processing device 1 is an example of the voice processing device of the present invention. The voice processing device of the present invention may be a server in which the camera 13, the microphone 14, and the speaker 15 are omitted.

通信インターフェース１６は、音声処理装置１を有線又は無線でネットワークに接続し、前記ネットワークを介して他の外部機器（例えば他の音声処理装置１）との間で所定の通信プロトコルに従ったデータ通信を実行するための通信インターフェースである。 The communication interface 16 connects the voice processing device 1 to a network by wire or wirelessly, and data communication with another external device (for example, another voice processing device 1) via the network according to a predetermined communication protocol. Is a communication interface for executing.

スピーカ１５は、音声処理装置１が取得する音声を外部に出力する。本発明の音声処理装置は、スピーカ１５を備えていなくてもよい。 The speaker 15 outputs the voice acquired by the voice processing device 1 to the outside. The voice processing device of the present invention does not have to include the speaker 15.

マイク１４は、音声処理装置１の周囲の音声を集音する。マイク１４は、音声処理装置１の周囲３６０度の範囲の音声を受信可能である。またマイク１４は、音声を集音する際に音源方向の指向性を高める機能（ビームフォーミング）に対応しており、制御部１１により設定される指向性パラメータの設定値に基づいて音声を集音する。 The microphone 14 collects sounds around the sound processing device 1. The microphone 14 can receive voice in a range of 360 degrees around the voice processing device 1. Further, the microphone 14 supports a function (beamforming) for increasing the directivity of the sound source direction when collecting sound, and collects sound based on the set value of the directivity parameter set by the control unit 11. To do.

カメラ１３は、被写体の画像を撮像してデジタル画像データとして出力するデジタルカメラである。例えばカメラ１３は、音声処理装置１の上面に設けられ、音声処理装置１の周囲３６０度の範囲を撮像可能である。図１に示す例では、カメラ１３は、会議室の室内全体を撮像することが可能である。カメラ１３は、本発明の撮像部の一例である。 The camera 13 is a digital camera that captures an image of a subject and outputs it as digital image data. For example, the camera 13 is provided on the upper surface of the voice processing device 1 and can image a range of 360 degrees around the voice processing device 1. In the example shown in FIG. 1, the camera 13 can take an image of the entire room of the conference room. The camera 13 is an example of the imaging unit of the present invention.

記憶部１２は、各種の情報を記憶する半導体メモリ、ＨＤＤ（Hard Disk Drive）又はＳＳＤ（Solid State Drive）などを含む不揮発性の記憶部である。例えば、記憶部１２には、制御部１１に後述の音声処理（図７及び図８参照）を実行させるための音声処理プログラムなどの制御プログラムが記憶されている。例えば、前記音声処理プログラムは、ＵＳＢ、ＣＤ又はＤＶＤ（何れも登録商標）などのコンピュータ読取可能な記録媒体に非一時的に記録されており、音声処理装置１に電気的に接続されるＵＳＢドライブ、ＣＤドライブ又はＤＶＤドライブなどの読取装置（不図示）で読み取られて記憶部１２に記憶される。前記音声処理プログラムは、ネットワークを介して外部機器からダウンロードされて記憶部１２に記憶されてもよい。 The storage unit 12 is a non-volatile storage unit including a semiconductor memory, an HDD (Hard Disk Drive), an SSD (Solid State Drive), or the like that stores various types of information. For example, the storage unit 12 stores a control program such as a voice processing program for causing the control unit 11 to execute voice processing (see FIGS. 7 and 8) described later. For example, the audio processing program is non-temporarily recorded on a computer-readable recording medium such as USB, CD, or DVD (all registered trademarks), and is electrically connected to the audio processing device 1. , A reading device (not shown) such as a CD drive or a DVD drive, and stored in the storage unit 12. The voice processing program may be downloaded from an external device via a network and stored in the storage unit 12.

また、記憶部１２には、画像情報１２１と、パラメータ情報１２２とが記憶されている。画像情報１２１には、カメラ１３により撮像される撮像データが記憶される。また、記憶部１２には、マイク１４により集音された音声が制御部１１の指示に従って記憶される。 Further, the storage unit 12 stores the image information 121 and the parameter information 122. Image data captured by the camera 13 is stored in the image information 121. Further, the storage unit 12 stores the sound collected by the microphone 14 according to the instruction of the control unit 11.

図３にはパラメータ情報１２２の一例を示している。パラメータ情報１２２には、撮像画像に含まれる検出対象ごとに、角度、距離、指向性パラメータ、ゲインパラメータなどの情報が登録される。「検出対象」は、図１に示す例では、人（ユーザＡ〜Ｄ）、椅子２、ボード３である。前記検出対象は、予め記憶部に記憶されてもよい。前記検出対象の他の例としては、会議室に存在するテーブル、会議室の壁、表示パネルなどが挙げられる。「角度」は、音声処理装置１から前記検出対象の位置に向かう方向の前記基準（０度）に対する角度である。「距離」は、音声処理装置１（例えばマイク１４）から前記検出対象の位置までの距離である。前記検出対象、前記角度及び前記距離は、制御部１１（対象検出部１１２）により検出される。 FIG. 3 shows an example of the parameter information 122. In the parameter information 122, information such as an angle, a distance, a directivity parameter, and a gain parameter is registered for each detection target included in the captured image. In the example shown in FIG. 1, the “detection target” is a person (users A to D), a chair 2, and a board 3. The detection target may be stored in the storage unit in advance. Other examples of the detection target include a table existing in the conference room, a wall of the conference room, a display panel, and the like. The “angle” is an angle with respect to the reference (0 degree) in the direction from the voice processing device 1 toward the position of the detection target. The “distance” is the distance from the voice processing device 1 (for example, the microphone 14) to the position of the detection target. The detection target, the angle, and the distance are detected by the control unit 11 (target detection unit 112).

「指向性パラメータ」は、マイク１４の指向性（ビームフォーミング）の強度に対応する設定値である。例えば、集音対象の発話者に対して指向性パラメータが強い値に設定され、発話者以外のユーザに対して指向性パラメータが弱い値に設定される。また指向性パラメータは、前記距離に応じた値に設定される。例えば、音声処理装置１からユーザまでの距離が長い程、指向性が強くなるように指向性パラメータが設定され、音声処理装置１からユーザまでの距離が短い程、指向性が弱くなるように指向性パラメータが設定される。指向性パラメータは、制御部１１（指向性調整部１１３）により設定される。 The "directivity parameter" is a set value corresponding to the intensity of the directivity (beamforming) of the microphone 14. For example, the directivity parameter is set to a strong value for the speaker to be collected, and the directivity parameter is set to a weak value for a user other than the speaker. Further, the directivity parameter is set to a value corresponding to the distance. For example, the directivity parameter is set so that the longer the distance from the voice processing device 1 to the user, the stronger the directivity, and the shorter the distance from the voice processing device 1 to the user, the weaker the directivity. Gender parameters are set. The directivity parameter is set by the control unit 11 (directivity adjusting unit 113).

「ゲインパラメータ」は、マイク１４を介して音声処理装置１に入力された音声の音量に対する調整値（ゲイン値）である。例えば一人の発話者の音声が入力された場合、ゲインパラメータは１００％に設定される（図５参照）。また、例えば二人の発話者の音声が入力されて、一方の発話者（例えばユーザＣ）及び他方の発話者（例えばユーザＢ）の音量比が７：３である場合、ユーザＣの音声に対してゲインパラメータは３０％に設定され、ユーザＢの音声に対してゲインパラメータは７０％に設定される（図６参照）。ゲインパラメータは、制御部１１（ゲイン調整部１１６）により設定される。 The “gain parameter” is an adjustment value (gain value) with respect to the volume of the voice input to the voice processing device 1 via the microphone 14. For example, when the voice of one speaker is input, the gain parameter is set to 100% (see FIG. 5). Further, for example, when the voices of two speakers are input and the volume ratio of one speaker (for example, user C) and the other speaker (for example, user B) is 7: 3, the voice of user C is used. On the other hand, the gain parameter is set to 30%, and the gain parameter is set to 70% with respect to the voice of user B (see FIG. 6). The gain parameter is set by the control unit 11 (gain adjustment unit 116).

制御部１１は、ＣＰＵ、ＲＯＭ、及びＲＡＭなどの制御機器を有する。前記ＣＰＵは、各種の演算処理を実行するプロセッサーである。前記ＲＯＭは、前記ＣＰＵに各種の処理を実行させるためのＢＩＯＳ及びＯＳなどの制御プログラムを予め記憶する。前記ＲＡＭは、各種の情報を記憶し、前記ＣＰＵが実行する各種の処理の一時記憶メモリー（作業領域）として使用される。そして、制御部１１は、前記ＲＯＭ又は記憶部１２に予め記憶された各種の制御プログラムを前記ＣＰＵで実行することにより音声処理装置１を制御する。 The control unit 11 has control devices such as a CPU, a ROM, and a RAM. The CPU is a processor that executes various arithmetic processes. The ROM stores in advance control programs such as a BIOS and an OS for causing the CPU to execute various processes. The RAM stores various information and is used as a temporary storage memory (work area) for various processes executed by the CPU. Then, the control unit 11 controls the voice processing device 1 by executing various control programs stored in advance in the ROM or the storage unit 12 on the CPU.

具体的に、制御部１１は、画像取得部１１１、対象検出部１１２、指向性調整部１１３、音声受付部１１４、判定処理部１１５、ゲイン調整部１１６などの各種の処理部を含む。尚、制御部１１は、前記ＣＰＵで前記音声処理プログラムに従った各種の処理を実行することによって前記各種の処理部として機能する。また、制御部１１に含まれる一部又は全部の処理部が電子回路で構成されていてもよい。尚、前記音声処理プログラムは、複数のプロセッサーを前記各種の処理部として機能させるためのプログラムであってもよい。 Specifically, the control unit 11 includes various processing units such as an image acquisition unit 111, a target detection unit 112, a directivity adjustment unit 113, a voice reception unit 114, a determination processing unit 115, and a gain adjustment unit 116. The control unit 11 functions as the various processing units by executing various processes according to the voice processing program on the CPU. Further, a part or all of the processing units included in the control unit 11 may be composed of an electronic circuit. The voice processing program may be a program for causing a plurality of processors to function as the various processing units.

画像取得部１１１は、カメラ１３により撮像された撮像画像を取得する。画像取得部１１１は、本発明の画像取得部の一例である。例えば、会議室においてカメラ１３により音声処理装置１の周囲が撮像された場合に、画像取得部１１１は、音声処理装置１の周囲の撮像画像を取得する。例えば画像取得部１１１は、カメラ１３により所定のフレームレートで撮像されたフレーム画像を順次取得する。画像取得部１１１は、取得した撮像画像の画像データを記憶部１２に記憶する。 The image acquisition unit 111 acquires an captured image captured by the camera 13. The image acquisition unit 111 is an example of the image acquisition unit of the present invention. For example, when the surroundings of the audio processing device 1 are imaged by the camera 13 in the conference room, the image acquisition unit 111 acquires the captured image of the surroundings of the audio processing device 1. For example, the image acquisition unit 111 sequentially acquires frame images captured by the camera 13 at a predetermined frame rate. The image acquisition unit 111 stores the image data of the acquired captured image in the storage unit 12.

対象検出部１１２は、画像取得部１１１により取得される前記撮像画像に基づいて、所定の検出対象を検出する。対象検出部１１２は、本発明の検出処理部の一例である。例えば対象検出部１１２は、前記撮像画像を画像解析して、人物（ユーザＡ〜Ｄ）、椅子２、又はボード３を検出する。また、対象検出部１１２は、検出した前記検出対象の位置（前記角度及び前記距離）を検出する。対象検出部１１２は、検出した前記検出対象、前記角度、及び前記距離の情報を記憶部１２のパラメータ情報１２２（図３参照）に登録する。なお、検出した撮像画像から所定の物体を検出する方法、及び、検出した前記物体の位置（角度、距離）を検出する方法は、周知の技術を適用することができる。 The target detection unit 112 detects a predetermined detection target based on the captured image acquired by the image acquisition unit 111. The target detection unit 112 is an example of the detection processing unit of the present invention. For example, the target detection unit 112 analyzes the captured image to detect a person (users A to D), a chair 2, or a board 3. In addition, the target detection unit 112 detects the detected position (the angle and the distance) of the detection target. The target detection unit 112 registers the detected information on the detection target, the angle, and the distance in the parameter information 122 (see FIG. 3) of the storage unit 12. A well-known technique can be applied to a method of detecting a predetermined object from the detected captured image and a method of detecting the position (angle, distance) of the detected object.

指向性調整部１１３は、対象検出部１１２により検出された複数の人物のそれぞれの位置に対するマイク１４の指向性を設定（調整）する。指向性調整部１１３は、本発明の指向性調整部の一例である。具体的には、指向性調整部１１３は、マイク１４の指向性パラメータを設定する。例えば、初期設定処理（初期設定モード）（図７参照）において、指向性調整部１１３は、対象検出部１１２により検出された検出対象の位置（角度、距離）に基づいて指向性パラメータを設定する。例えば、指向性調整部１１３は、音声処理装置１のマイク１４からの距離を「近距離」、「中距離」、「長距離」の３段階に区別し、当該距離の段階に応じて指向性パラメータを「弱」、「中」、「強」の３段階で設定する。なお、指向性調整部１１３は、指向性パラメータを、前記距離に応じて追従的に変化するように設定してもよい。 The directivity adjusting unit 113 sets (adjusts) the directivity of the microphone 14 with respect to the respective positions of the plurality of persons detected by the target detecting unit 112. The directivity adjusting unit 113 is an example of the directivity adjusting unit of the present invention. Specifically, the directivity adjusting unit 113 sets the directivity parameter of the microphone 14. For example, in the initial setting process (initial setting mode) (see FIG. 7), the directivity adjusting unit 113 sets the directivity parameter based on the position (angle, distance) of the detection target detected by the target detection unit 112. .. For example, the directivity adjusting unit 113 classifies the distance of the voice processing device 1 from the microphone 14 into three stages of "short distance", "medium distance", and "long distance", and directivity according to the stage of the distance. Set the parameters in three stages: "weak", "medium", and "strong". The directivity adjusting unit 113 may set the directivity parameter so as to change in a follow-up manner according to the distance.

図４Ａ及び図４Ｂは、前記初期設定処理において設定される指向性パラメータの一例を示すグラフである。例えば、図１に示す会議室について、対象検出部１１２により、人物（ユーザＡ〜Ｄ）、椅子２、及びボード３と、それぞれの位置（角度、距離）が検出されると、指向性調整部１１３は、４人のユーザＡ〜Ｄの距離に応じた指向性パラメータを設定する。具体的には、指向性調整部１１３は、音声処理装置１からユーザまでの距離が長い程、指向性パラメータが強くなるように設定し、音声処理装置１からユーザまでの距離が短い程、指向性パラメータが弱くなるように設定する。ここでは、角度「０度」かつ距離「Ｄｄ」のユーザＤに対して、指向性パラメータを「Ｂｐ１」（弱）に設定し、角度「３０度」かつ距離「Ｄｃ」のユーザＣに対して、指向性パラメータを「Ｂｐ２」（中）に設定し、角度「６０度」かつ距離「Ｄｂ」のユーザＢに対して、指向性パラメータを「Ｂｐ４」（強）に設定し、角度「１２０度」かつ距離「Ｄａ」のユーザＡに対して、指向性パラメータを「Ｂｐ３」（強）に設定する。ここで、距離は、「Ｄｂ＞Ｄａ＞Ｄｃ＞Ｄｄ」の関係を満たす。このため、指向性パラメータ距離は、「Ｂｐ４＞Ｂｐ３＞Ｂｐ２＞Ｂｐ１」の関係を満たす。また、指向性調整部１１３は、椅子２及びボード３については、指向性パラメータを「０」（ノイズ用パラメータ）に設定する。指向性調整部１１３は、設定した各指向性パラメータをパラメータ情報１２２（図３参照）に登録する。 4A and 4B are graphs showing an example of directivity parameters set in the initial setting process. For example, in the conference room shown in FIG. 1, when the target detection unit 112 detects the positions (angle, distance) of the person (users A to D), the chair 2, and the board 3, the directivity adjustment unit 113 sets the directivity parameters according to the distances of the four users A to D. Specifically, the directivity adjusting unit 113 is set so that the longer the distance from the voice processing device 1 to the user, the stronger the directivity parameter, and the shorter the distance from the voice processing device 1 to the user, the more directivity. Set the sex parameter to be weak. Here, the directivity parameter is set to "Bp1" (weak) for the user D having an angle of "0 degrees" and a distance of "Dd", and for the user C having an angle of "30 degrees" and a distance of "Dc". , Set the directivity parameter to "Bp2" (medium), set the directivity parameter to "Bp4" (strong) for user B with an angle of "60 degrees" and a distance of "Db", and set the angle to "120 degrees". The directivity parameter is set to "Bp3" (strong) for the user A having a distance of "Da". Here, the distance satisfies the relationship of "Db> Da> Dc> Dd". Therefore, the directivity parameter distance satisfies the relationship of "Bp4> Bp3> Bp2> Bp1". Further, the directivity adjusting unit 113 sets the directivity parameter to "0" (noise parameter) for the chair 2 and the board 3. The directivity adjusting unit 113 registers each set directivity parameter in the parameter information 122 (see FIG. 3).

音声受付部１１４は、マイク１４により集音される音声を受け付ける。音声受付部１１４は、本発明の音声受付部の一例である。例えば、音声受付部１１４は、対象検出部１１２により検出された複数の人物から発話される音声を受け付ける。また、音声受付部１１４は、例えば、会議に参加していない第三者が発話した音声、前記人物及び第三者が発話した音声が物体（椅子２、ボード３など）に反射した音声、その他の雑音なども受け付ける。すなわち、発話者の音声、第三者の音声、反射音声、雑音などは、音源となる。なお、これら第三者の音声、反射音声、雑音などの音源位置は、対象検出部１１２により検出される人物の位置とは異なる位置となる。 The voice reception unit 114 receives the voice collected by the microphone 14. The voice reception unit 114 is an example of the voice reception unit of the present invention. For example, the voice reception unit 114 receives voices uttered by a plurality of persons detected by the target detection unit 112. In addition, the voice reception unit 114 is, for example, a voice uttered by a third party who does not participate in the conference, a voice uttered by the person and the third party reflected on an object (chair 2, board 3, etc.), and the like. It also accepts the noise of. That is, the voice of the speaker, the voice of a third party, the reflected voice, noise, and the like are sound sources. The sound source positions of the third party's voice, reflected voice, noise, etc. are different from the positions of the person detected by the target detection unit 112.

音声受付部１１４は、前記初期設定処理（初期設定モード）において複数の人物（ユーザＡ〜Ｄ）のそれぞれの位置に対する指向性が設定された後、音声入力モードに移行すると、音声の受け付けを開始する。音声受付部１１４は、指向性調整部１１３により設定された前記指向性に基づいて前記音声を受け付ける。音声受付部１１４が音声を受け付けると、指向性調整部１１３は、当該音声の音源位置に基づいて、初期設定処理において設定した指向性を再調整する。具体的には、指向性調整部１１３は、前記音源位置に基づいて、前記指向性パラメータを再設定する。具体的には、対象検出部１１２により検出される音声処理装置１から複数の音源位置それぞれまでの距離に応じて、前記指向性パラメータを再設定して前記指向性を調整する。例えば指向性調整部１１３は、前記距離が長い程、前記指向性が強くなるように前記指向性パラメータを再設定し、前記距離が短い程、前記指向性が弱くなるように前記指向性パラメータを再設定する。 The voice reception unit 114 starts accepting voice when the voice input mode is set after the directivity for each position of the plurality of persons (users A to D) is set in the initial setting process (initial setting mode). To do. The voice receiving unit 114 receives the voice based on the directivity set by the directivity adjusting unit 113. When the voice receiving unit 114 receives the voice, the directivity adjusting unit 113 readjusts the directivity set in the initial setting process based on the sound source position of the voice. Specifically, the directivity adjusting unit 113 resets the directivity parameter based on the sound source position. Specifically, the directivity parameter is reset and the directivity is adjusted according to the distance from the voice processing device 1 detected by the target detection unit 112 to each of the plurality of sound source positions. For example, the directivity adjusting unit 113 resets the directivity parameter so that the longer the distance, the stronger the directivity, and the shorter the distance, the weaker the directivity. Reset.

例えば、指向性調整部１１３は、音声受付部１１４により受け付けられた音声の音源位置が、指向性調整部１１３により前記指向性が設定された位置（図３参照）と同一である場合に、当該音源位置に対する前記指向性の強度を強める。例えば、図３及び図４Ｂに示すように指向性パラメータが設定された場合において、ユーザＢが発話した場合、指向性調整部１１３は、音源位置であるユーザＢの位置がパラメータ情報１２２に登録された位置（角度「６０度」、距離「Ｄｂ」）と同一であるため、指向性調整部１１３は、例えば図４Ｃに示すように、当該音源位置に対する前記指向性パラメータを「Ｂｐ４」から最大値（「Ｂｐｍａｘ」）に再設定する。この場合、指向性調整部１１３は、さらに、前記音源位置とは異なる位置に対する前記指向性の強度を弱める。例えば、指向性調整部１１３は、前記音源位置とは異なるユーザＡ，Ｃ，Ｄの位置に対する前記指向性パラメータを弱い値（例えば、「Ｂｐ３１」、「Ｂｐ２１」、「Ｂｐ１１」）に再設定する。指向性調整部１１３は、前記指向性パラメータを再設定すると、パラメータ情報１２２を更新する（図５参照）。 For example, the directivity adjusting unit 113 is concerned when the sound source position of the sound received by the voice receiving unit 114 is the same as the position where the directivity is set by the directivity adjusting unit 113 (see FIG. 3). The strength of the directivity with respect to the sound source position is strengthened. For example, when the directivity parameters are set as shown in FIGS. 3 and 4B, when the user B speaks, the directivity adjusting unit 113 registers the position of the user B, which is the sound source position, in the parameter information 122. Since it is the same as the position (angle "60 degrees", distance "Db"), the directivity adjusting unit 113 sets the directivity parameter with respect to the sound source position to the maximum value from "Bp4", for example, as shown in FIG. 4C. Reset to (“Bpmax”). In this case, the directivity adjusting unit 113 further weakens the intensity of the directivity with respect to a position different from the sound source position. For example, the directivity adjusting unit 113 resets the directivity parameters for the positions of the users A, C, and D different from the sound source positions to weak values (for example, "Bp31", "Bp21", "Bp11"). .. The directivity adjusting unit 113 updates the parameter information 122 when the directivity parameter is reset (see FIG. 5).

さらに、指向性調整部１１３は、音声受付部１１４により受け付けられた前記音声の音源が複数存在する場合に、それぞれの前記音源の音源位置に応じて、それぞれの前記音源位置に対する前記指向性を調整する。例えば、図４Ｃに示すようにユーザＢが発話中にユーザＣが発話した場合、前記音源は、ユーザＢの位置とユーザＣの位置の複数存在することになる。この場合、指向性調整部１１３は、ユーザＢの位置及びユーザＣの位置に応じて、それぞれの音源位置に対する指向性を調整（割り振る）する。具体的には、音声処理装置１からの距離が長いユーザＢの音源位置に対する前記指向性パラメータを「Ｂｐｍａｘ」から「Ｂｐ４２」に再設定し、音声処理装置１からの距離が短いユーザＣの音源位置に対する前記指向性パラメータを「Ｂｐ２１」から「Ｂｐ２２」に再設定する（図４Ｄ参照）。ここで、「Ｂｐ２２」は、「Ｂｐ４２」より弱い（小さい）値である。指向性調整部１１３は、前記指向性パラメータを再設定すると、パラメータ情報１２２を更新する（図６参照）。なお、ここではユーザＡ及びユーザＤは発話していないため、「Ｂｐ１２」は「Ｂｐ１１」（図５参照）と同一の値であってもよく、「Ｂｐ３２」は「Ｂｐ３１」（図５参照）と同一の値であってもよい。 Further, when there are a plurality of sound sources of the voice received by the voice receiving unit 114, the directivity adjusting unit 113 adjusts the directivity with respect to each of the sound source positions according to the sound source position of each of the sound sources. To do. For example, as shown in FIG. 4C, when the user C speaks while the user B is speaking, the sound source has a plurality of positions of the user B and a plurality of positions of the user C. In this case, the directivity adjusting unit 113 adjusts (allocates) the directivity for each sound source position according to the position of the user B and the position of the user C. Specifically, the directivity parameter for the sound source position of the user B who has a long distance from the audio processing device 1 is reset from "Bpmax" to "Bp42", and the sound source of the user C who has a short distance from the audio processing device 1 The directivity parameter with respect to the position is reset from "Bp21" to "Bp22" (see FIG. 4D). Here, "Bp22" is a weaker (smaller) value than "Bp42". The directivity adjusting unit 113 updates the parameter information 122 when the directivity parameter is reset (see FIG. 6). Since the user A and the user D are not speaking here, "Bp12" may have the same value as "Bp11" (see FIG. 5), and "Bp32" is "Bp31" (see FIG. 5). It may be the same value as.

ここで、音声受付部１１４は、受け付けた前記音声の音源位置が、指向性調整部１１３により前記指向性が設定された位置と同一である場合に、前記音声を記憶部１２に記憶する。一方、音声受付部１１４は、受け付けた前記音声の音源位置が、指向性調整部１１３により前記指向性が設定された位置と異なる場合には、前記音声を削除する。 Here, the voice receiving unit 114 stores the voice in the storage unit 12 when the sound source position of the received voice is the same as the position where the directivity is set by the directivity adjusting unit 113. On the other hand, when the sound source position of the received voice is different from the position where the directivity is set by the directivity adjusting unit 113, the voice receiving unit 114 deletes the voice.

判定処理部１１５は、音声受付部１１４により受け付けられた前記音声が、対象検出部１１２により検出された前記複数の人物のうちいずれかの人物から発話された音声であるか否かを判定する。すなわち、判定処理部１１５は、撮像画像に基づいて発話者を特定する。例えば、判定処理部１１５は、撮影画像に含まれる人物の口の動きを検出し、口の動きを検出した方向からの音声をマイク１４が集音した場合に、当該方向の人物を発話者として特定し、当該音声を当該人物から発話された音声であると判定する。これにより、判定処理部１１５は、例えば会議に参加するユーザＡ〜Ｄが発話した音声であるか、第三者の音声、反射音、及びその他の雑音であるかを判定することができる。判定処理部１１５は、本発明の判定処理部の一例である。 The determination processing unit 115 determines whether or not the voice received by the voice reception unit 114 is a voice uttered by any one of the plurality of persons detected by the target detection unit 112. That is, the determination processing unit 115 identifies the speaker based on the captured image. For example, the determination processing unit 115 detects the movement of the mouth of a person included in the captured image, and when the microphone 14 collects the voice from the direction in which the movement of the mouth is detected, the person in that direction is set as the speaker. It is specified, and it is determined that the voice is a voice uttered by the person. As a result, the determination processing unit 115 can determine, for example, whether the voice is spoken by the users A to D participating in the conference, or the voice, reflected sound, or other noise of a third party. The determination processing unit 115 is an example of the determination processing unit of the present invention.

ゲイン調整部１１６は、音声処理装置１に入力された音声の音量に対するゲイン値（ゲインパラメータ）を設定する。ゲイン調整部１１６は、設定したゲインパラメータをパラメータ情報１２２に登録する。具体的には、ゲイン調整部１１６は、前記音声の音源が複数存在する場合に、複数の音源のそれぞれの音声の音量比に基づいて、前記複数の音源のそれぞれの音量に対応するゲインパラメータを設定する。例えば、図４Ｃに示すように、一人のユーザＢが発話している場合、ゲイン調整部１１６は、ユーザＢの音声に対するゲインパラメータを「１００％」に設定する（図５参照）。また図４Ｄに示すように、二人のユーザＢ及びユーザＣが発話している場合であって、ユーザＢ及びユーザＣそれぞれの音声の音量比が「３：７」である場合、ゲイン調整部１１６は、ユーザＢの音声に対するゲインパラメータを「７０％」に設定し、ユーザＣの音声に対するゲインパラメータを「３０％」に設定する（図６参照）。このように、ゲイン調整部１１６は、複数の発話者それぞれの音声に対するゲインパラメータを、複数の発話者それぞれの音声の音量比に反比例する値に設定する。これにより、記憶部１２に記憶される音声の音量を均一化することができる。なお、ゲイン調整部１１６は、音源が発話者でない場合、すなわち音源が椅子２、ボード３などである場合、ゲインパラメータを「０」に設定する。 The gain adjusting unit 116 sets a gain value (gain parameter) with respect to the volume of the voice input to the voice processing device 1. The gain adjusting unit 116 registers the set gain parameter in the parameter information 122. Specifically, when a plurality of sound sources of the sound are present, the gain adjusting unit 116 sets a gain parameter corresponding to the volume of each of the plurality of sound sources based on the volume ratio of the sound of each of the plurality of sound sources. Set. For example, as shown in FIG. 4C, when one user B is speaking, the gain adjusting unit 116 sets the gain parameter for the voice of the user B to “100%” (see FIG. 5). Further, as shown in FIG. 4D, when two users B and C are speaking and the volume ratio of the voices of each of the user B and the user C is "3: 7", the gain adjustment unit. In 116, the gain parameter for the voice of the user B is set to "70%", and the gain parameter for the voice of the user C is set to "30%" (see FIG. 6). In this way, the gain adjusting unit 116 sets the gain parameter for the voice of each of the plurality of speakers to a value that is inversely proportional to the volume ratio of the voice of each of the plurality of speakers. As a result, the volume of the voice stored in the storage unit 12 can be made uniform. The gain adjusting unit 116 sets the gain parameter to "0" when the sound source is not the speaker, that is, when the sound source is the chair 2, the board 3, or the like.

［音声処理］
以下、図７及び図８を参照しつつ、音声処理装置１の制御部１１によって実行される音声処理の手順の一例について説明する。前記音声処理には、初期設定モードにおいて初期設定を行う初期設定処理（図７参照）と、初期設定後の音声入力モードにおいて音声入力を行う音声入力処理とが含まれる（図８参照）。例えば、音声処理装置１の制御部１１は、音声処理装置１の電源がオン状態になることにより前記初期設定処理プログラムの実行を開始することによって、前記初期設定処理の実行を開始する。 [Voice processing]
Hereinafter, an example of the procedure of voice processing executed by the control unit 11 of the voice processing device 1 will be described with reference to FIGS. 7 and 8. The voice processing includes an initial setting process (see FIG. 7) for initial setting in the initial setting mode and a voice input process for performing voice input in the voice input mode after the initial setting (see FIG. 8). For example, the control unit 11 of the voice processing device 1 starts the execution of the initial setting processing program by starting the execution of the initial setting processing program when the power of the voice processing device 1 is turned on.

なお、本発明は、前記音声処理に含まれる一又は複数のステップを実行する音声処理方法の発明として捉えることができる。また、ここで説明する前記音声処理に含まれる一又は複数のステップが適宜省略されてもよい。また、前記音声処理における各ステップは、同様の作用効果を生じる範囲で実行順序が異なってもよい。さらに、ここでは制御部１１によって前記音声処理における各ステップが実行される場合を例に挙げて説明するが、他の実施形態では、複数のプロセッサーによって前記音声処理における各ステップが分散して実行されてもよい。 The present invention can be regarded as an invention of a voice processing method for executing one or a plurality of steps included in the voice processing. Further, one or a plurality of steps included in the voice processing described here may be omitted as appropriate. Further, the execution order of each step in the voice processing may be different within a range in which the same action and effect are produced. Further, here, a case where each step in the voice processing is executed by the control unit 11 will be described as an example, but in another embodiment, each step in the voice processing is distributed and executed by a plurality of processors. You may.

先ず図７を参照しつつ、前記初期設定処理の手順の一例について説明する。 First, an example of the procedure of the initial setting process will be described with reference to FIG. 7.

ステップＳ１１において、制御部１１は、カメラ１３により撮像される撮像画像を取得したか否かを判定する。例えば、会議室（図１参照）においてカメラ１３により音声処理装置１の周囲が撮像された場合に、制御部１１は、音声処理装置１の周囲の撮像画像を取得する。制御部１１により撮像画像が取得された場合（Ｓ１１：ＹＥＳ）、処理はステップＳ１２に移行する。ステップＳ１１は、本発明の画像取得ステップの一例である。 In step S11, the control unit 11 determines whether or not the captured image captured by the camera 13 has been acquired. For example, when the surroundings of the voice processing device 1 are imaged by the camera 13 in the conference room (see FIG. 1), the control unit 11 acquires the captured image of the surroundings of the sound processing device 1. When the captured image is acquired by the control unit 11 (S11: YES), the process proceeds to step S12. Step S11 is an example of the image acquisition step of the present invention.

ステップＳ１２において、制御部１１は、人物を検出したか否かを判定する。具体的には、制御部１１は、前記撮像画像を画像解析して、人物、椅子２、ボード３などの所定の検出対象を検出する。制御部１１により人物が検出された場合（Ｓ１２：ＹＥＳ）、処理はステップＳ１３に移行する。一方、制御部１１により人物が検出されなかった場合（Ｓ１２：ＮＯ）、すなわち前記撮像画像に人物が含まれない場合、処理はステップＳ１５に移行する。 In step S12, the control unit 11 determines whether or not a person has been detected. Specifically, the control unit 11 analyzes the captured image to detect a predetermined detection target such as a person, a chair 2, or a board 3. When a person is detected by the control unit 11 (S12: YES), the process proceeds to step S13. On the other hand, when the person is not detected by the control unit 11 (S12: NO), that is, when the captured image does not include the person, the process proceeds to step S15.

ステップＳ１３において、制御部１１は、人物の位置を検出する。具体的には、制御部１１は、音声処理装置１から、検出した人物までの距離及び角度を検出する。制御部１１は、検出した検出対象（「人物」）、距離、及び角度を撮像画像に対応付けてパラメータ情報１２２（図３参照）に登録する。ステップＳ１２、Ｓ１３は、本発明の検出ステップの一例である。 In step S13, the control unit 11 detects the position of the person. Specifically, the control unit 11 detects the distance and the angle from the voice processing device 1 to the detected person. The control unit 11 registers the detected detection target (“person”), distance, and angle in the parameter information 122 (see FIG. 3) in association with the captured image. Steps S12 and S13 are examples of the detection steps of the present invention.

次にステップＳ１４において、制御部１１は、検出された人物の位置に対するマイク１４の指向性を設定（調整）する。具体的には、制御部１１は、検出された人物の位置（角度、距離）に基づいて指向性パラメータを設定する。制御部１１は、複数の人物を検出した場合、各人物の位置に応じた指向性パラメータを設定する（図４Ｂ参照）。制御部１１は、設定した指向性パラメータを撮像画像に対応付けてパラメータ情報１２２（図３参照）に登録する。 Next, in step S14, the control unit 11 sets (adjusts) the directivity of the microphone 14 with respect to the position of the detected person. Specifically, the control unit 11 sets the directivity parameter based on the detected position (angle, distance) of the person. When the control unit 11 detects a plurality of persons, the control unit 11 sets the directivity parameters according to the positions of the persons (see FIG. 4B). The control unit 11 associates the set directivity parameters with the captured image and registers them in the parameter information 122 (see FIG. 3).

一方、ステップＳ１５において、制御部１１は、物体の位置を検出する。具体的には、制御部１１は、音声処理装置１から、検出した物体（椅子２、ボード３など）までの距離及び角度を検出する。制御部１１は、検出した検出対象（「椅子」、「ボード」）、距離、及び角度を撮像画像に対応付けてパラメータ情報１２２（図３参照）に登録する。 On the other hand, in step S15, the control unit 11 detects the position of the object. Specifically, the control unit 11 detects the distance and the angle from the voice processing device 1 to the detected object (chair 2, board 3, etc.). The control unit 11 registers the detected detection target (“chair”, “board”), distance, and angle in the parameter information 122 (see FIG. 3) in association with the captured image.

次にステップＳ１６において、制御部１１は、検出された物体の位置に対するマイク１４の指向性を設定（調整）する。具体的には、制御部１１は、物体（「椅子」、「ボード」）の位置に対して、指向性パラメータを、ノイズ用パラメータである「０」に設定する（図３参照）。制御部１１は、設定したノイズ用パラメータを撮像画像に対応付けてパラメータ情報１２２（図３参照）に登録する。ステップＳ１４、Ｓ１６は、本発明の指向性調整ステップの一例である。 Next, in step S16, the control unit 11 sets (adjusts) the directivity of the microphone 14 with respect to the position of the detected object. Specifically, the control unit 11 sets the directivity parameter to “0”, which is a noise parameter, with respect to the position of the object (“chair”, “board”) (see FIG. 3). The control unit 11 associates the set noise parameter with the captured image and registers it in the parameter information 122 (see FIG. 3). Steps S14 and S16 are examples of directivity adjusting steps of the present invention.

以上のようにして、前記初期設定処理が行われる。前記初期設定処理が完了すると、以下に示す音声入力処理が行われる。図８を参照しつつ、前記音声入力処理の手順の一例について説明する。 As described above, the initial setting process is performed. When the initial setting process is completed, the following voice input process is performed. An example of the procedure of the voice input processing will be described with reference to FIG.

ステップＳ２１において、制御部１１は、マイク１４を介して音声を受け付けたか否かを判定する。制御部１１により音声が受け付けられた場合（Ｓ２１：ＹＥＳ）、処理はステップＳ２２に移行する。ステップＳ２１は、本発明の音声受付ステップの一例である。 In step S21, the control unit 11 determines whether or not the voice is received via the microphone 14. When the voice is received by the control unit 11 (S21: YES), the process proceeds to step S22. Step S21 is an example of the voice reception step of the present invention.

ステップＳ２２において、制御部１１は、受け付けた前記音声の音源位置（距離、角度）が、指向性パラメータが設定された位置（距離、角度）（図３参照）と同一であるか否かを判定する。制御部１１により前記音源位置が前記指向性パラメータの設定位置と同一であると判定されると（Ｓ２２：ＹＥＳ）、処理はステップＳ２３に移行する。一方、前記音源位置が前記指向性パラメータの設定位置と同一でないと判定されると（Ｓ２２：ＮＯ）、処理はステップＳ２９に移行する。 In step S22, the control unit 11 determines whether or not the sound source position (distance, angle) of the received voice is the same as the position (distance, angle) (see FIG. 3) in which the directivity parameter is set. To do. When the control unit 11 determines that the sound source position is the same as the set position of the directivity parameter (S22: YES), the process proceeds to step S23. On the other hand, when it is determined that the sound source position is not the same as the set position of the directivity parameter (S22: NO), the process proceeds to step S29.

ステップＳ２３において、制御部１１は、前記音源位置の音声が発話者による音声であるか否かを判定する。制御部１１により前記音源位置の音声が発話者による音声であると判定された場合（Ｓ２３：ＹＥＳ）、処理はステップＳ２４に移行する。一方、制御部１１により前記音源位置の音声が発話者による音声でないと判定された場合（Ｓ２３：ＮＯ）、処理はステップＳ２９に移行する。 In step S23, the control unit 11 determines whether or not the voice at the sound source position is the voice of the speaker. When the control unit 11 determines that the voice at the sound source position is the voice of the speaker (S23: YES), the process proceeds to step S24. On the other hand, when the control unit 11 determines that the voice at the sound source position is not the voice produced by the speaker (S23: NO), the process proceeds to step S29.

ステップＳ２４において、制御部１１は、前記音源位置が、既に受け付けた音声の音源位置と同一であるか否かを判定する。例えば、一人のユーザＢが発話を継続して行っている場合、前記音源位置であるユーザＢの位置は、既に受け付けられているユーザＢの音声の音源位置と同一である。これに対して、ユーザＢが発話しているときにユーザＣが発話した場合には、前記音源位置であるユーザＣの位置は、既に受け付けられているユーザＢの音声の音源位置と異なる。制御部１１により前記音源位置が既に受け付けた音声の音源位置と同一であると判定された場合（Ｓ２４：ＹＥＳ）、処理はステップＳ２５に移行する。一方、制御部１１により前記音源位置が既に受け付けた音声の音源位置と同一でないと判定された場合（Ｓ２４：ＮＯ）、処理はステップＳ２７に移行する。 In step S24, the control unit 11 determines whether or not the sound source position is the same as the sound source position of the already received voice. For example, when one user B continues to speak, the position of the user B, which is the sound source position, is the same as the sound source position of the voice of the user B that has already been accepted. On the other hand, when the user C speaks while the user B is speaking, the position of the user C, which is the sound source position, is different from the already accepted sound source position of the user B's voice. When the control unit 11 determines that the sound source position is the same as the sound source position of the voice already received (S24: YES), the process proceeds to step S25. On the other hand, when the control unit 11 determines that the sound source position is not the same as the sound source position of the voice already received (S24: NO), the process proceeds to step S27.

ステップＳ２５において、制御部１１は、一つの音源である一人のユーザの音源の音源位置に対する指向性パラメータを設定する。例えば図４Ｃに示すように、制御部１１は、ユーザＢの音源位置に対する前記指向性パラメータを「Ｂｐ４」から最大値（「Ｂｐｍａｘ」）に再設定して、当該音源位置に対する指向性の強度を強める。また、制御部１１は、前記音源位置とは異なる位置に対する前記指向性の強度を弱める。制御部１１は、設定した前記指向性パラメータをパラメータ情報１２２に登録する（図５参照）。 In step S25, the control unit 11 sets a directivity parameter with respect to the sound source position of one user's sound source, which is one sound source. For example, as shown in FIG. 4C, the control unit 11 resets the directivity parameter with respect to the sound source position of the user B from "Bp4" to the maximum value ("Bpmax") to increase the intensity of directivity with respect to the sound source position. ramp up. Further, the control unit 11 weakens the strength of the directivity with respect to a position different from the sound source position. The control unit 11 registers the set directivity parameter in the parameter information 122 (see FIG. 5).

次にステップＳ２６において、制御部１１は、一つの音源である一人のユーザの音声の音量に対するゲインパラメータを設定する。例えば図４Ｃに示すように、制御部１１は、一人のユーザＢの音声に対するゲインパラメータを「１００％」に設定する。ゲイン調整部１１６は、設定したゲインパラメータをパラメータ情報１２２に登録する（図５参照）。その後、処理はステップＳ２１に戻る。 Next, in step S26, the control unit 11 sets a gain parameter with respect to the volume of the voice of one user, which is one sound source. For example, as shown in FIG. 4C, the control unit 11 sets the gain parameter for the voice of one user B to “100%”. The gain adjusting unit 116 registers the set gain parameter in the parameter information 122 (see FIG. 5). After that, the process returns to step S21.

ステップＳ２７において、制御部１１は、複数の音源である複数のユーザの音源の音源位置のそれぞれに対する指向性パラメータを設定する。例えば図４Ｄに示すように、制御部１１は、音声処理装置１からの距離が長いユーザＢの音源位置に対する前記指向性パラメータを「Ｂｐｍａｘ」から「Ｂｐ４２」に弱め、音声処理装置１からの距離が短いユーザＣの音源位置に対する前記指向性パラメータを「Ｂｐ２１」から「Ｂｐ２２」に強める（図４Ｄ参照）。制御部１１は、ユーザＣの位置に対する指向性を、ユーザＢの位置に対する指向性より弱く設定する。制御部１１は、設定した各指向性パラメータをパラメータ情報１２２に登録する（図６参照）。 In step S27, the control unit 11 sets the directivity parameters for each of the sound source positions of the sound sources of a plurality of users, which are a plurality of sound sources. For example, as shown in FIG. 4D, the control unit 11 weakens the directivity parameter with respect to the sound source position of the user B who has a long distance from the voice processing device 1 from “Bpmax” to “Bp42”, and the distance from the voice processing device 1. The directivity parameter with respect to the sound source position of the user C having a short time is strengthened from "Bp21" to "Bp22" (see FIG. 4D). The control unit 11 sets the directivity with respect to the position of the user C to be weaker than the directivity with respect to the position of the user B. The control unit 11 registers each set directivity parameter in the parameter information 122 (see FIG. 6).

次にステップＳ２８において、制御部１１は、複数の音源である複数のユーザの音声のそれぞれの音量に対するゲインパラメータを設定する。例えば図４Ｄに示すように、制御部１１は、二人のユーザＢ及びユーザＣが発話している場合であって、ユーザＢ及びユーザＣそれぞれの音声の音量比が「３：７」である場合、制御部１１は、ユーザＢの音声に対するゲインパラメータを「７０％」に設定し、ユーザＣの音声に対するゲインパラメータを「３０％」に設定する（図６参照）。制御部１１は、設定した各ゲインパラメータをパラメータ情報１２２に登録する（図６参照）。その後、処理はステップＳ２１に戻る。 Next, in step S28, the control unit 11 sets a gain parameter for each volume of the voices of a plurality of users, which are a plurality of sound sources. For example, as shown in FIG. 4D, in the control unit 11, when two users B and C are speaking, the volume ratio of the voices of each of the user B and the user C is "3: 7". In this case, the control unit 11 sets the gain parameter for the voice of the user B to "70%" and the gain parameter for the voice of the user C to "30%" (see FIG. 6). The control unit 11 registers each set gain parameter in the parameter information 122 (see FIG. 6). After that, the process returns to step S21.

ステップＳ２９において、制御部１１は、受け付けた音声をノイズとして判定して削除する。ここで、ステップＳ２９において処理される前記音声は、第三者の音声、反射音、その他の雑音である。制御部１１は、前記音声の音源位置に対して指向性パラメータを「０」（ノイズ用パラメータ）に設定し、指向性を持たせない。そして、制御部１１は、前記音声を記憶部１２に記憶するなどの入力処理を行わず、音声処理装置１から削除する。その後、処理はステップＳ２１に戻る。以上のようにして前記音声入力処理が行われ、制御部１１は、音声を受け付けるたびに前記音声入力処理を繰り返す。 In step S29, the control unit 11 determines the received voice as noise and deletes it. Here, the voice processed in step S29 is a third party's voice, reflected sound, or other noise. The control unit 11 sets the directivity parameter to "0" (noise parameter) with respect to the sound source position of the voice, and does not give the directivity. Then, the control unit 11 does not perform input processing such as storing the voice in the storage unit 12, and deletes the voice from the voice processing device 1. After that, the process returns to step S21. The voice input process is performed as described above, and the control unit 11 repeats the voice input process every time the voice is received.

以上のように、本実施形態に係る音声処理装置１によれば、会議などのように音声処理装置に対して複数のユーザが存在する場合において、複数のユーザの中から発話者を確実に特定することが可能である。また複数のユーザが同時に発話者となる場合であっても、各発話者の位置（距離、角度）に応じてマイクの指向性を適切に調整することができるため、発話者の音声を適切に取得することが可能となる。また、音声処理装置１の周囲の人物の位置とは異なる位置から取得する音声の方向に対しては指向性を持たせず、当該音声をノイズと判定して削除するため、不要な音声の入力を防ぐことができ、発話者の音声を適切に取得することが可能となる。 As described above, according to the voice processing device 1 according to the present embodiment, when there are a plurality of users for the voice processing device such as in a conference, the speaker can be reliably identified from the plurality of users. It is possible to do. Even when multiple users are speakers at the same time, the directivity of the microphone can be adjusted appropriately according to the position (distance, angle) of each speaker, so that the voice of the speaker can be appropriately adjusted. It becomes possible to acquire. Further, since the voice is not directed to the direction of the voice acquired from a position different from the position of the person around the voice processing device 1 and the voice is judged as noise and deleted, unnecessary voice input is performed. Can be prevented, and the voice of the speaker can be appropriately acquired.

尚、本発明の音声処理装置は、各請求項に記載された発明の範囲において、以上に示された各実施形態を自由に組み合わせること、或いは各実施形態を適宜、変形又は一部を省略することによって構成されることも可能である。 In the voice processing apparatus of the present invention, within the scope of the invention described in each claim, each of the above-described embodiments can be freely combined, or each embodiment may be appropriately modified or partially omitted. It is also possible to configure by.

１：音声処理装置
１１：制御部
１２：記憶部
１３：カメラ
１４：マイク
１５：スピーカ
１１１：画像取得部
１１２：対象検出部
１１３：指向性調整部
１１４：音声受付部
１１５：判定処理部
１１６：ゲイン調整部
１２１：画像情報
１２２：パラメータ情報 1: Voice processing device 11: Control unit 12: Storage unit 13: Camera 14: Microphone 15: Speaker 111: Image acquisition unit 112: Target detection unit 113: Directivity adjustment unit 114: Voice reception unit 115: Judgment processing unit 116: Gain adjustment unit 121: Image information 122: Parameter information

Claims

A voice reception unit that receives the sound collected by the microphone,
An image acquisition unit that acquires an image captured by the image pickup unit,
A detection processing unit that detects a plurality of persons and their respective positions of the plurality of persons from the captured image acquired by the image acquisition unit.
A directivity adjusting unit that sets the directivity of the microphone with respect to each position of the plurality of persons based on the respective positions of the plurality of persons detected by the detection processing unit.
With
The voice receiving unit is a voice processing device that receives the voice based on the directivity set by the directivity adjusting unit.

The detection processing unit detects the distance from the voice processing device to each position of the plurality of persons.
The longer the distance, the stronger the directivity of the directivity adjusting unit, and the shorter the distance, the weaker the strength of the directivity.
The voice processing device according to claim 1.

When the sound source position of the voice received by the voice receiving unit is the same as the position where the directivity is set by the directivity adjusting unit, the voice is stored while being stored.
When the sound source position of the voice received by the voice receiving unit is different from the position where the directivity is set by the directivity adjusting unit, the voice is deleted.
The voice processing device according to claim 1 or 2.

The directivity adjusting unit has the directivity with respect to the sound source position when the sound source position of the voice received by the voice receiving unit is the same as the position where the directivity is set by the directivity adjusting unit. Strengthen the strength of
The voice processing device according to any one of claims 1 to 3.

The directivity adjusting unit further weakens the intensity of the directivity with respect to a position different from the sound source position.
The voice processing device according to claim 4.

When there are a plurality of sound sources of the voice received by the voice receiving unit, the directivity adjusting unit adjusts the directivity with respect to each of the sound source positions according to the sound source position of each of the sound sources.
The voice processing device according to any one of claims 1 to 5.

A determination processing unit for determining whether or not the voice received by the voice reception unit is a voice uttered by any one of the plurality of persons detected by the detection processing unit is further provided.
When the determination processing unit determines that the voice received by the voice receiving unit is not a voice uttered by any one of the plurality of persons, the voice is deleted.
The voice processing device according to any one of claims 1 to 6.

Further, a gain adjusting unit for setting a gain value with respect to the volume of the voice received by the voice receiving unit is provided.
When there are a plurality of sound sources of the sound, the gain adjusting unit sets the gain value corresponding to the volume of each of the plurality of sound sources based on the volume ratio of the sound of each of the plurality of sound sources.
The voice processing device according to any one of claims 1 to 7.

The microphone and the imaging unit are further provided.
The voice processing device according to any one of claims 1 to 8.

A voice reception step that accepts the sound collected by the microphone,
An image acquisition step of acquiring an image captured by the image pickup unit, and
A detection step of detecting a plurality of persons and their respective positions of the plurality of persons from the captured image acquired by the image acquisition step.
A directivity adjustment step of setting the directivity of the microphone with respect to each position of the plurality of persons based on the respective positions of the plurality of persons detected by the detection step.
Including
In the voice reception step, a voice processing method that receives the voice based on the directivity set by the directivity adjustment step.