JP6079179B2

JP6079179B2 - Hands-free call device

Info

Publication number: JP6079179B2
Application number: JP2012264479A
Authority: JP
Inventors: 渡邉　聡; 聡渡邉; 秋作 ▲高▼間
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2012-12-03
Filing date: 2012-12-03
Publication date: 2017-02-15
Anticipated expiration: 2032-12-03
Also published as: JP2014110554A

Description

本発明は、ハンズフリー通話装置に関する。 The present invention relates to a hands-free call device.

ハンズフリー通話装置において、スピーカからの出力音がマイクに回り込むことで生じるエコーの影響を低減するものが提案されている。 In a hands-free communication device, a device that reduces the influence of echoes generated by the sound output from a speaker around a microphone has been proposed.

たとえば特許文献１に記載の技術では、送信音声信号（または受信音声信号）の信号レベルに基づいて、送信音声信号、受信音声信号の合計利得が「１」以下になるように、各音声信号の信号レベルを増幅または低減する。たとえば、送信音声信号が入力された際には、送信音声信号の信号レベルを増幅するとともに、受信音声信号の信号レベルを低減する制御を実施する。 For example, in the technique described in Patent Document 1, based on the signal level of the transmission audio signal (or reception audio signal), the total gain of the transmission audio signal and the reception audio signal is set to “1” or less so that each audio signal Amplify or reduce the signal level. For example, when a transmission audio signal is input, control is performed to amplify the signal level of the transmission audio signal and reduce the signal level of the reception audio signal.

また、エコーを除去する公知のエコーキャンセル技術は、一般に受信音声信号および送信音声信号からエコーパスの伝達関数を学習する適応フィルタを用いて構成されている（例えば特許文献２）。すなわち、送信音声信号に対しては、受信音声信号から適応フィルタによって擬似エコーを生成し、送信音声信号からエコー成分を差し引いている。 A known echo cancellation technique for removing echoes is generally configured using an adaptive filter that learns the transfer function of an echo path from a received voice signal and a transmitted voice signal (for example, Patent Document 2). That is, for the transmission voice signal, a pseudo echo is generated from the reception voice signal by an adaptive filter, and the echo component is subtracted from the transmission voice signal.

しかし、伝達関数を学習するときには、ユーザーが発話した音声が外乱となって伝達関数の推定精度を低下させてしまうという問題があった。そこで、特許文献２では、カメラでユーザーの口の動きを検出し、ユーザーの口の動きを検出している場合には、ユーザーが発話しているものと判定し、この適応フィルタの更新を停止する。これにより、擬似エコーの精度を向上させる。 However, when learning the transfer function, there is a problem that the speech uttered by the user becomes a disturbance and the estimation accuracy of the transfer function is lowered. Therefore, in Patent Document 2, when the movement of the user's mouth is detected by the camera, and the movement of the user's mouth is detected, it is determined that the user is speaking, and updating of the adaptive filter is stopped. To do. This improves the accuracy of the pseudo echo.

特開平９−１６２７８６号公報JP-A-9-162786 特開２０００−２９５３３８号公報JP 2000-295338 A

しかしながら、特許文献１に記載の技術では、送信音声信号が入力されているときに、受信音声信号がノイズレベル程度の小声に対応した信号レベルである場合には、受信音声信号の信号レベルが低減されることによって、スピーカから出力される受話音声を聴取できなくなってしまうことが懸念される。また、逆に、受信音声信号が入力されているときに、送信音声信号がノイズレベル程度の小声に対応した信号レベルである場合には、通話相手はユーザーの発話音声を聴取できなくなってしまう。 However, in the technique described in Patent Document 1, when a transmission voice signal is input, if the reception voice signal has a signal level corresponding to a low voice of about the noise level, the signal level of the reception voice signal is reduced. As a result, it is feared that the received voice output from the speaker cannot be heard. Conversely, when the received voice signal is input and the transmitted voice signal has a signal level corresponding to a low voice of about the noise level, the other party cannot hear the user's voice.

また、特許文献２に記載の技術のように、擬似エコーの精度を高めることでエコーが聞こえにくくすることもできるが、常に完全にエコーを取り除くことができるというわけではない。たとえば、ユーザーが運転している場合など伝達関数が動的に変化する状況においては、適応フィルタの更新を停止している間に実際のエコーパスに対応した伝達関数は変化する。そして、適応フィルタの学習した伝達関数と実際の伝達関数との間に相違が生じてしまうため、擬似エコーの精度は低下する。その結果、わずらわしいエコーは通話相手に伝わってしまう。 Further, as in the technique described in Patent Document 2, it is possible to make the echo difficult to hear by increasing the accuracy of the pseudo echo, but it is not always possible to completely remove the echo. For example, in a situation where the transfer function changes dynamically, such as when the user is driving, the transfer function corresponding to the actual echo path changes while the update of the adaptive filter is stopped. And since a difference will arise between the transfer function which the adaptive filter learned, and an actual transfer function, the precision of a pseudo echo falls. As a result, annoying echoes are transmitted to the other party.

本発明は、以上の事情を鑑みて成されたものであり、その目的とするところは、ユーザーの発話時に通話相手からの受信音声信号を受信した際に、その受信音声信号の信号レベルが小声に対応する信号レベルであってもユーザーは受信音声信号に含まれる通話相手からの受話音声を聴取することができ、また、ユーザーの発話音声が小声であっても信号レベルを低減せずに通話相手に伝えることができるとともに、ユーザーが発話していないときは、ユーザー側から通話相手に伝わるエコーの影響を低減するハンズフリー通話装置を提供することにある。 The present invention has been made in view of the above circumstances. The purpose of the present invention is to reduce the signal level of the received audio signal when the received audio signal is received from the other party during the user's speech. Even if the signal level corresponds to, the user can listen to the received voice from the other party in the received voice signal, and even if the user's uttered voice is low voice, the user can talk without reducing the signal level. An object of the present invention is to provide a hands-free communication device that can transmit to the other party and reduce the influence of echo transmitted from the user side to the other party when the user is not speaking.

その目的を達成するための本発明は、ユーザーの顔部分を撮像するカメラ（１）と、
前記カメラの撮像した画像データから前記ユーザーの口の動きを検出するカメラ画像処理部（２）と、
連続的又は階段状な変化によって増幅率を調整することでユーザーから通話相手へ送信する送信音声信号の信号レベルを調整する信号レベル調整部（３）と、を備えたハンズフリー通話装置（１００）であって、
前記信号レベル調整部は、
増幅率を０と１の間に位置する値に設定可能に構成されており、前記カメラ画像処理部で口の動きが検出されていない場合には増幅率を１よりも小さい値に設定し、
前記カメラ画像処理部で口の動きが検出された場合には送信音声信号の増幅率を１以上の値とすることを特徴とする。 In order to achieve the object, the present invention includes a camera (1) that captures an image of a user's face,
A camera image processing unit (2) for detecting movement of the user's mouth from image data captured by the camera;
A hands-free call device (100) comprising: a signal level adjustment unit (3) for adjusting a signal level of a transmission voice signal transmitted from a user to a call partner by adjusting an amplification factor by a continuous or stepwise change. Because
The signal level adjustment unit includes:
The amplification factor is configured to be set to a value located between 0 and 1, and when the movement of the mouth is not detected by the camera image processing unit, the amplification factor is set to a value smaller than 1 .
When mouth movement is detected by the camera image processing unit, the amplification factor of the transmission audio signal is set to a value of 1 or more.

このような構成によると、ユーザーの口の動きが検出されていない場合には送信音声信号の信号レベルは低減される。よって、エコーパスの伝達関数が動的に変化する状況において、ユーザーの口が動いていないときにエコーがマイクに入力されても、環境の変化に影響を受けずそのエコーの影響を低減できる。 According to such a configuration, the signal level of the transmission audio signal is reduced when the movement of the user's mouth is not detected. Therefore, in a situation where the transfer function of the echo path is dynamically changed, even if an echo is input to the microphone when the user's mouth is not moving, the influence of the echo can be reduced without being affected by the environmental change.

さらに、送信音声信号の信号レベルを調整する構成であり、受信音声信号の信号レベルを低減しないことから、通信相手からの受信音声信号受信時、その信号レベルが低く且つユーザーの発話があっても、通話相手からの受話音声を聴取することが可能である。 Furthermore, since the signal level of the transmission voice signal is adjusted and the signal level of the reception voice signal is not reduced, even when the reception voice signal from the communication partner is received, the signal level is low and the user utters. It is possible to listen to the received voice from the other party.

また、通話相手からの受信音声信号の入力があっても、カメラ画像処理部でユーザーの口の動きが検出される場合には、信号レベル調整部の増幅率は１以上に設定されるため、発話音声が小声であっても、通話相手に伝えることができる。 In addition, even if there is an input of a received voice signal from the other party, if the movement of the user's mouth is detected by the camera image processing unit, the amplification factor of the signal level adjustment unit is set to 1 or more. Even if the uttered voice is quiet, it can be communicated to the other party.

本実施形態におけるハンズフリー通話装置の構成を表すブロック図である。It is a block diagram showing the structure of the hands-free call device in this embodiment. 本実施形態の制御の流れを示すフローチャートである。It is a flowchart which shows the flow of control of this embodiment. 本実施形態の制御における増幅率調整処理の流れを表すフローチャートである。It is a flowchart showing the flow of the amplification factor adjustment process in the control of this embodiment. 本実施形態の作動を説明する概念図である。It is a conceptual diagram explaining the action | operation of this embodiment. 本実施形態の作動を説明する概念図である。It is a conceptual diagram explaining the action | operation of this embodiment. 本実施形態の作動を説明する概念図である。It is a conceptual diagram explaining the action | operation of this embodiment. 本発明の変形例を表すブロック図である。It is a block diagram showing the modification of this invention.

以下、本発明の実施形態について、図面に基づいて説明する。図１は本発明のハンズフリー通話装置１００の実施形態の構成を示すブロック図である。図１のハンズフリー通話装置１００は、信号処理部１とカメラ６を備えており、信号処理部１はスピーカ４、マイク５、カメラ６および電話機インターフェース部７と接続されている。なお、電話機インターフェース部７とハンズフリー通話装置１００は有線のほか、Ｂｌｕｅｔｏｏｔｈ（登録商標）などの無線によって接続されていてもよい。図示しない電話機は、公衆回線などのネットワークを介して受信した受信音声信号を電話機インターフェース部７に出力し、電話機インターフェース部７は、この受信音声信号を信号処理部１に出力する。また、電話機インターフェース部７より電話機に入力された送信音声信号は、ネットワークを介して通話相手に送信される。ここで、受信音声信号には、通話相手からの受話音声に加え、エコー成分やエコー成分以外のノイズを含まれる。また、送信音声信号にも、ユーザーが発話した発話音声に加え、エコー成分やエコー成分以外のノイズが含まれる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an embodiment of the hands-free call device 100 of the present invention. 1 includes a signal processing unit 1 and a camera 6, and the signal processing unit 1 is connected to a speaker 4, a microphone 5, a camera 6, and a telephone interface unit 7. Note that the telephone interface unit 7 and the hands-free call device 100 may be connected by wireless such as Bluetooth (registered trademark) in addition to wired communication. A telephone (not shown) outputs a received voice signal received via a network such as a public line to the telephone interface unit 7, and the telephone interface unit 7 outputs the received voice signal to the signal processing unit 1. Also, the transmission voice signal input to the telephone from the telephone interface unit 7 is transmitted to the call partner via the network. Here, the received voice signal includes an echo component and noise other than the echo component in addition to the received voice from the other party. In addition to the uttered voice uttered by the user, the transmitted voice signal also includes an echo component and noise other than the echo component.

信号処理部１は、カメラ画像処理部２、信号レベル調整部３、エコーキャンセル部１０を備えている。マイク５は、無指向性の小型マイクであり、ユーザーが発話した音声（発話音声Ｓ）や雑音などの周囲の音を集音し、電気的な信号に変換して、信号処理部１に出力する。スピーカ４は、信号処理部１より入力された電気的な受信音声信号を音に変換して出力する。 The signal processing unit 1 includes a camera image processing unit 2, a signal level adjustment unit 3, and an echo cancellation unit 10. The microphone 5 is a small omni-directional microphone, collects sounds uttered by the user (uttered speech S) and ambient sounds such as noise, converts them into electrical signals, and outputs them to the signal processing unit 1 To do. The speaker 4 converts the electrical reception audio signal input from the signal processing unit 1 into sound and outputs the sound.

なお、信号処理部１には、マイク５から信号が入力される入力部と、電話機インターフェース部７に信号を出力する出力部に、受信音声信号または送信音声信号の信号レベルを増幅する増幅器がそれぞれ設けられているが、図示は省略する。また、電話機インターフェース部７から信号が入力される入力部と、スピーカ４に出力する出力部にも同様の増幅器（図示略）がそれぞれ設けられている。なお、これらの増幅器の増幅率は固定されているものである。 The signal processing unit 1 includes an input unit that receives a signal from the microphone 5 and an output unit that outputs a signal to the telephone interface unit 7 and an amplifier that amplifies the signal level of the received audio signal or the transmitted audio signal. Although it is provided, illustration is omitted. Similar amplifiers (not shown) are also provided in the input unit for inputting signals from the telephone interface unit 7 and the output unit for outputting to the speaker 4. The amplification factor of these amplifiers is fixed.

ここで、図１を用いてエコーが発生する過程とエコーキャンセル部１０の動作を説明する。まず、電話機インターフェース部７を介して受信する受信音声信号は、スピーカ４で音に変換して放音される。受信音声信号には、通常、通話相手が発話した音声（受話音声Ｒ）に加えてエコー成分やノイズ成分などが重畳しているが、ここでは説明のため、受信音声信号中のエコー成分およびノイズ成分は無視し、受信音声信号と受話音声Ｒが同じものであるとする。なお、受話音声に対応する電気的な信号と、受話音声とは区別せず、同一の記号Ｒを用いて説明する。発話音声Ｓやエコー成分ｒについても同様にそれぞれ同一の記号を付して説明する。 Here, the process of generating an echo and the operation of the echo canceling unit 10 will be described with reference to FIG. First, the received audio signal received via the telephone interface unit 7 is converted into sound by the speaker 4 and emitted. In general, an echo component and a noise component are superimposed on the received voice signal in addition to the voice (received voice R) uttered by the other party, but for the sake of explanation here, the echo component and noise in the received voice signal are superimposed. The components are ignored and the received voice signal and the received voice R are the same. In addition, it demonstrates using the same symbol R, without distinguishing the electrical signal corresponding to a received voice, and a received voice. Similarly, the uttered voice S and the echo component r will be described with the same symbols.

スピーカ４から放音された音（この場合、エコー成分およびノイズ成分を無視するため、受話音声Ｒとなる）の一部が、エコーパスを経てマイク５に回り込み、エコー成分ｒとして集音される。マイク５にはユーザーの発話音声Ｓとエコー成分ｒが入力されるため、マイク５から信号処理部１に入力される送信音声信号はＳ+ｒとなる。このエコー成分ｒがわずらわしい雑音となって通話相手に伝わってしまう。 Part of the sound emitted from the speaker 4 (in this case, the received voice R is ignored because the echo component and the noise component are ignored) goes around the microphone 5 through the echo path and is collected as the echo component r. Since the user's speech S and the echo component r are input to the microphone 5, the transmission sound signal input from the microphone 5 to the signal processing unit 1 is S + r. This echo component r becomes annoying noise and is transmitted to the other party.

また、ユーザーから通話相手に送信された送信音声信号Ｓ（厳密にはＳ＋ｒ）も、通話相手側においてスピーカ４からマイク５に回り込んでエコーとなる。このとき、電話機インターフェース部７から信号処理部１に入力される受信音声信号は、受話音声Ｒにエコー成分ｓが重畳したＲ＋ｓとなる。 In addition, the transmission voice signal S (strictly S + r) transmitted from the user to the other party is also echoed from the speaker 4 to the microphone 5 on the other party side. At this time, the received voice signal input from the telephone interface unit 7 to the signal processing unit 1 is R + s in which the echo component s is superimposed on the received voice R.

エコーキャンセル部１０は、受信音声信号（≒受話音声Ｒ）および送信音声信号（≒発話音声Ｓ）を用いて、これらのエコー成分ｒおよびｓを除去する処理を行う。より具体的には、エコーキャンセル部１０は擬似エコー信号を生成するＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）１１と、受信側の加算器１２、送信側の加算器１３を備えている。図１に示すように、スピーカ４に出力する受話音声Ｒ（厳密にはエコーキャンセル部１０によって受信音声信号からエコー成分ｒが取り除かれた音声）は、ＤＳＰ１１にも入力される。ＤＳＰ１１は、受話音声Ｒに基づいて擬似エコー信号ｒ’を生成する。ここで擬似エコー信号ｒ’は、先行技術のように、エコーパスの伝達関数を逐次学習する適応フィルタを用いて生成すればよい。そして、加算器１３でマイク５より入力された音声信号Ｓ＋ｒから、ＤＳＰ１１で生成された疑似エコー信号ｒ’を差し引くことで、通話相手に送信する送信音声信号はＳ＋ｒ−ｒ’となる。 The echo canceling unit 10 performs a process of removing these echo components r and s using the received voice signal (≈received voice R) and the transmitted voice signal (≈uttered voice S). More specifically, the echo canceling unit 10 includes a DSP (Digital Signal Processor) 11 that generates a pseudo echo signal, a reception-side adder 12, and a transmission-side adder 13. As shown in FIG. 1, the received voice R (strictly speaking, the voice obtained by removing the echo component r from the received voice signal by the echo cancellation unit 10) output to the speaker 4 is also input to the DSP 11. The DSP 11 generates a pseudo echo signal r ′ based on the received voice R. Here, the pseudo echo signal r ′ may be generated using an adaptive filter that sequentially learns the transfer function of the echo path as in the prior art. Then, by subtracting the pseudo echo signal r ′ generated by the DSP 11 from the audio signal S + r input from the microphone 5 by the adder 13, the transmission audio signal to be transmitted to the other party is S + r−r ′.

ここで、受話音声Ｒから生成した擬似エコー信号ｒ’が完全にｒと一致すれば、エコーは完全に取り除かれ、ユーザーの発話音声Ｓが通話相手に送信される。しかしながら、公知のエコーキャンセル技術を用いても完全な擬似エコー信号（ｒ＝ｒ’）の生成は難しく、一般的にはｒ≒ｒ’である。また、仮に一時的にｒ＝ｒ’ を実現できるとしても、ユーザーが姿勢を変えたり、スピーカ４およびマイク５周辺の環境が変わったりする場合にはエコーパスおよびその伝達関数が変動するため、再び伝達関数を学習している間はｒ≠ｒ’となり、残留エコーが生じてしまう。さらに、より高精度なエコーキャンセル部１０を実現するためには、より複雑な回路が必要となってしまう。 Here, if the pseudo echo signal r ′ generated from the received voice R completely matches r, the echo is completely removed, and the user's uttered voice S is transmitted to the other party. However, even if a known echo cancellation technique is used, it is difficult to generate a complete pseudo echo signal (r = r ′), and generally r≈r ′. Even if r = r ′ can be realized temporarily, the echo path and its transfer function fluctuate when the user changes posture or the environment around the speaker 4 and the microphone 5 changes. While the function is being learned, r ≠ r ′ and a residual echo is generated. Furthermore, a more complicated circuit is required to realize the echo cancellation unit 10 with higher accuracy.

なお、発話音声Ｓによるエコー成分ｓに対しても同様にして公知のエコーキャンセル技術で低減することができる。すなわち、図１に示すように、エコーキャンセル部１０によってエコー成分ｒが取り除かれた送信音声信号（≒発話音声Ｓ）は、ＤＳＰ１１に入力され、ＤＳＰ１１は発話音声Ｓから予測される擬似エコー信号ｓ’を生成する。そして、電話機インターフェース部７より入力された受信音声信号Ｒ＋ｓから、ＤＳＰ１１で生成された疑似エコー信号ｓ’を加算器１３で減算することで、スピーカ４に出力される信号はＲ＋ｓ−ｓ’となる。ここでもｓ＝ｓ’ではないため、残留エコーは生じてしまう。 The echo component s due to the uttered voice S can be similarly reduced by a known echo cancellation technique. That is, as shown in FIG. 1, the transmission voice signal (≈ utterance voice S) from which the echo component r has been removed by the echo cancellation unit 10 is input to the DSP 11, and the DSP 11 is a pseudo echo signal s predicted from the utterance voice S. Generate '. The adder 13 subtracts the pseudo echo signal s ′ generated by the DSP 11 from the received audio signal R + s input from the telephone interface unit 7, so that the signal output to the speaker 4 becomes R + s−s ′. . Again, since s = s ′ is not satisfied, a residual echo is generated.

カメラ６は、ユーザーの顔を逐次撮像し、撮像した画像データを逐次カメラ画像処理部２に出力する。１秒あたりにユーザーの顔を撮像する枚数（フレームレート）は、ユーザーが発話したときの口の動きを検出するために十分であり、かつ、後述するカメラ画像処理部２で処理できる値に設定する。すなわち、フレームレートが小さいと、画像データを撮像する間隔が長くなり、ユーザーの発話に対して口の動きを検出するタイミングが遅れてしまうおそれがある。あるいは、フレーム間での画像データの連続性が低下することで検出精度が低下するおそれがある。一方、フレームレートが大きすぎると、後述する画像処理部２での処理負荷が増大してしまう。一例として、本実施形態では毎秒３０フレームとする。 The camera 6 sequentially captures the user's face and sequentially outputs the captured image data to the camera image processing unit 2. The number (frame rate) of images of the user's face per second is set to a value that is sufficient to detect the movement of the mouth when the user speaks and can be processed by the camera image processing unit 2 described later. To do. That is, when the frame rate is small, the interval at which image data is captured becomes long, and there is a possibility that the timing for detecting mouth movements may be delayed with respect to the user's speech. Alternatively, the detection accuracy may be reduced due to a decrease in continuity of image data between frames. On the other hand, if the frame rate is too large, the processing load on the image processing unit 2 described later increases. As an example, in this embodiment, it is 30 frames per second.

カメラ画像処理部２は、カメラ６より入力された画像データから、ユーザーの口の動きを検出する。口の動きを検出する方法として、一例としては特許文献２に記載の技術のように、口を中心とした所定範囲（口の領域）で検出された特徴点の動きベクトルあるいは輝度値のフレーム間差分等を用いればよい。口の領域を特定する方法としては、例えば、ユーザーの瞳の位置を検出し、その瞳の位置から統計的に口の範囲を予測し、この範囲内でパターンマッチング法により特定する。 The camera image processing unit 2 detects the movement of the user's mouth from the image data input from the camera 6. As a method for detecting the movement of the mouth, as an example, as in the technique described in Patent Document 2, the motion vector or luminance value between feature points detected in a predetermined range (mouth area) centered on the mouth A difference or the like may be used. As a method of specifying the mouth region, for example, the position of the user's pupil is detected, the mouth range is statistically predicted from the position of the pupil, and the pattern matching method is used to specify the range.

そして、動きベクトルによる口の動き検出方法としては、特定された口の領域を複数のブロックに分け、その領域内でブロック毎に動きベクトルを求め、この動きベクトルの大きさの累計値を求めればよい。そして、この動きベクトルの大きさの累計値が、所定の閾値より大きいかどうかを判定し、所定の閾値より大きければ、口の動きを検出したとする。 As a mouth motion detection method based on motion vectors, the identified mouth area is divided into a plurality of blocks, a motion vector is obtained for each block in the area, and a cumulative value of the magnitudes of the motion vectors is obtained. Good. Then, it is determined whether or not the cumulative value of the magnitudes of the motion vectors is larger than a predetermined threshold. If the cumulative value is larger than the predetermined threshold, it is assumed that the movement of the mouth is detected.

動きベクトルの代わりに輝度値のフレーム間差分を用いる場合も、特定された口の領域内でブロック毎に輝度値のフレーム間差分を求め、所定の閾値より大きいかどうかを判定し、所定の閾値より大きければ、所定の閾値より大きければ、口の動きを検出したとする。 Even when using an inter-frame difference in luminance value instead of a motion vector, the inter-frame difference in luminance value is obtained for each block within the specified mouth area, and it is determined whether it is greater than a predetermined threshold. If it is larger, the movement of the mouth is detected if it is larger than the predetermined threshold.

以上のようにして、ユーザーの口の動きを検出した場合、カメラ画像処理部２は、増幅率を０から１に設定するように指示するＯＮ信号を信号レベル調整部３に出力する。一方、所定数のフレームにおいて（例えば３フレーム）、連続してユーザーの口の動きがないと判定された場合には、増幅率を１から０に設定するように指示するＯＦＦ信号を信号レベル調整部３に出力する。 As described above, when the movement of the user's mouth is detected, the camera image processing unit 2 outputs to the signal level adjustment unit 3 an ON signal that instructs to set the amplification factor from 0 to 1. On the other hand, when it is determined that there is no movement of the user's mouth continuously in a predetermined number of frames (for example, 3 frames), an OFF signal that instructs to set the amplification factor from 1 to 0 is signal level adjusted. Output to part 3.

信号レベル調整部３は、送信音声信号の信号レベル（たとえば電圧値）の入力値に対する出力値の割合（以降、増幅率）を変更可能に構成されており、この増幅率に基づいて、送信音声信号の信号レベルを調整して電話機インターフェース部７へ出力する。なお、増幅率が０の状態とは、厳密な０に限らず、通話相手が聴取できない信号レベルに低減する状態も含む。また、送信音声信号を伝達する回路を開くことによって増幅率および信号レベルを完全に０にする構成としてもよい。 The signal level adjustment unit 3 is configured to be able to change the ratio (hereinafter referred to as amplification factor) of the output value with respect to the input value of the signal level (for example, voltage value) of the transmission audio signal. Based on this amplification factor, the transmission audio signal is adjusted. The signal level of the signal is adjusted and output to the telephone interface unit 7. The state where the amplification factor is 0 is not limited to strict 0, but also includes a state where the signal level is reduced to a level that cannot be heard by the other party. Further, the amplification factor and the signal level may be completely set to 0 by opening a circuit that transmits the transmission audio signal.

また、信号レベル調整部３の増幅率は、カメラ画像処理部２より入力される信号に基づいて逐次変更される。本実施形態では、カメラ画像処理部２よりＯＮ信号が入力されている場合は、増幅率を１に設定し、信号レベル調整部３に入力された送信音声信号の信号レベルは変更せずに出力する。一方、カメラ画像処理部２よりＯＦＦ信号が入力されている場合は、増幅率を０に設定して送信音声信号の信号レベルを０にする。すなわち、カメラ画像処理部２よりＯＦＦ信号が入力されている間に信号レベル調整部３に入力された送信音声信号は破棄される。このため、カメラ画像処理部２で口の動きが検出されている場合のみ、送信音声信号は電話機インターフェース部７などを介して通話相手に伝達される。 Further, the amplification factor of the signal level adjustment unit 3 is sequentially changed based on the signal input from the camera image processing unit 2. In this embodiment, when an ON signal is input from the camera image processing unit 2, the amplification factor is set to 1, and the signal level of the transmission audio signal input to the signal level adjustment unit 3 is output without being changed. To do. On the other hand, when an OFF signal is input from the camera image processing unit 2, the gain is set to 0 and the signal level of the transmission audio signal is set to 0. That is, the transmission audio signal input to the signal level adjustment unit 3 while the OFF signal is input from the camera image processing unit 2 is discarded. For this reason, only when the movement of the mouth is detected by the camera image processing unit 2, the transmission voice signal is transmitted to the call partner via the telephone interface unit 7 or the like.

ここで、図２のフローチャートを用いて本実施形態での作動を説明する。図２（Ａ）がユーザーから通話相手に、送信音声信号を送信する際の処理であり、図２（Ｂ）が相手側からの受信音声信号を受信した際の処理を表している。 Here, the operation in this embodiment will be described with reference to the flowchart of FIG. FIG. 2A shows processing when a transmission voice signal is transmitted from the user to the call partner, and FIG. 2B shows processing when a reception voice signal is received from the partner side.

図２（Ａ）のフローチャートは、信号処理部１で逐次実行されており、ステップＳ１２より開始される。 The flowchart of FIG. 2A is sequentially executed by the signal processing unit 1, and starts from step S12.

ステップＳ１２では、マイク５より信号処理部１に入力された送信音声信号（発話音声Ｓ＋エコー成分ｒ）に対して、エコーキャンセル部１０でエコーキャンセル処理を実施する。ここで、実施するエコーキャンセル処理は、上述したように公知の技術を用いたものである。エコーキャンセル処理によってエコー成分を除去された送信音声信号（Ｓ＋ｒ−ｒ’）を、信号レベル調整部３に出力し、ステップＳ１４に進む。 In step S <b> 12, echo cancellation processing is performed by the echo canceling unit 10 on the transmission voice signal (uttered speech S + echo component r) input from the microphone 5 to the signal processing unit 1. Here, the echo cancellation processing to be performed uses a known technique as described above. The transmission audio signal (S + r−r ′) from which the echo component has been removed by the echo cancellation processing is output to the signal level adjustment unit 3, and the process proceeds to step S14.

ステップＳ１４では信号レベル調整部３が、カメラ画像処理部２の指示に基づいて送信音声信号の信号レベルを０または１倍に調整し、ステップＳ１６に進む。この増幅率は、別途図３に示す増幅率調整処理のフローチャートに従って逐次調整されている。信号レベル調整部３で信号レベルを調整された送信音声信号は、電話機インターフェース部７に出力されるとともに、ＤＳＰ１１にも出力される。電話機インターフェース部７に入力された送信音声信号は、ステップＳ１８で電話機よりネットワークを介して通話相手に送信される。 In step S14, the signal level adjusting unit 3 adjusts the signal level of the transmission audio signal to 0 or 1 based on an instruction from the camera image processing unit 2, and the process proceeds to step S16. This amplification factor is sequentially adjusted according to a flowchart of amplification factor adjustment processing shown in FIG. The transmission audio signal whose signal level is adjusted by the signal level adjustment unit 3 is output to the telephone interface unit 7 and also to the DSP 11. The transmission voice signal input to the telephone interface unit 7 is transmitted from the telephone to the other party via the network in step S18.

また、図２（Ｂ）のフローチャートは、信号処理部１で逐次実行されており、ステップＳ２２より開始される。ステップＳ２２では、エコーキャンセル部１０で受信音声信号（受話音声Ｒ＋エコー成分ｓ）に対して、送信音声信号を用いて公知のエコーキャンセル処理を実施し、ステップＳ２４に進む。 Further, the flowchart of FIG. 2B is sequentially executed by the signal processing unit 1, and starts from step S22. In step S22, the echo cancellation unit 10 performs a known echo cancellation process on the received voice signal (received voice R + echo component s) using the transmitted voice signal, and the process proceeds to step S24.

ステップＳ２４では、前ステップＳ２２のエコーキャンセル処理によってエコーが除去された音声信号（Ｒ＋ｓ−ｓ’）をＤＳＰ１１に出力するとともに、スピーカ４に出力する。そして、ステップＳ２６で、スピーカ４は、信号処理部１より入力された受信音声信号を音に変換して出力する。 In step S24, the audio signal (R + s−s ′) from which the echo has been removed by the echo cancellation processing in the previous step S22 is output to the DSP 11 and also output to the speaker 4. In step S26, the speaker 4 converts the received voice signal input from the signal processing unit 1 into sound and outputs the sound.

図３に示すフローチャートは、増幅率調整処理の流れを示しており、この処理は通話開始に伴って、逐次実行されている。この増幅率調整処理では、まずステップＳ１６０で、上述したように、カメラ６が撮像する画像データからユーザーの口の動きの有無を検出する。そして、口の動きを検出した場合には、ステップＳ１６２をＹＥＳとして、ステップＳ１６４に進む。一方、口の動きを検出していない場合は、ステップＳ１６２をＮＯとしてステップＳ１６６に進む。ステップＳ１６４では、カメラ画像処理部２から信号レベル調整部３に増幅率を１に設定するようにＯＮ信号を出力する。また、ステップＳ１６６では、カメラ画像処理部２から信号レベル調整部３に増幅率を０に設定するようにＯＦＦ信号を出力する。 The flowchart shown in FIG. 3 shows the flow of the amplification factor adjustment process, and this process is sequentially executed as the call starts. In this amplification factor adjustment process, first, in step S160, as described above, the presence / absence of the movement of the user's mouth is detected from the image data captured by the camera 6. If mouth movement is detected, step S162 is set to YES and the process proceeds to step S164. On the other hand, when the movement of the mouth is not detected, step S162 is NO and the process proceeds to step S166. In step S164, an ON signal is output from the camera image processing unit 2 to the signal level adjustment unit 3 so as to set the amplification factor to 1. In step S166, the camera image processing unit 2 outputs an OFF signal to the signal level adjustment unit 3 so as to set the amplification factor to zero.

以降では、ユーザーの口の動きの有無および発話状態、ならびに通話相手の発話状態によって場合分けした各状況において、以上で述べた本実施形態と、比較技術Ａ、比較技術Ｂとの、それぞれの作動を比較したものを、図４を用いて説明する。 Hereinafter, in each situation classified according to the presence / absence of the movement of the user's mouth and the utterance state and the utterance state of the other party, the respective operations of the above-described embodiment, the comparison technique A, and the comparison technique B are described. Will be described with reference to FIG.

ここで、比較技術Ａとは、ユーザーからの送信音声信号の信号レベルおよび通話相手からの受信音声信号の信号レベルに応じて、受信音声信号および送信音声信号の増幅率を調整するものとする。なお、特許文献１には、受信音声信号と送信音声信号の両方の信号レベルを考慮して、それぞれの増幅率を決定する制御については例示されていない。この比較技術Ａは、特許文献１において送信音声信号および受信音声信号の両方が入力された場合の処理を便宜的に決定したものであり、送信音声信号と受信音声信号の信号レベルを比較し、信号レベルがより高いほうの音声を、伝達すべき音声信号と判定する。すなわち、比較技術Ａでは、より信号レベルが小さいほうの信号を低減し、かつ、受信音声信号と送信音声信号の利得の和が１となるように制御するものである。また、送信音声信号と受信音声信号の信号レベルが、同等である場合には、それぞれの利得が０．５となるように調整する。 Here, the comparison technique A is to adjust the amplification factor of the reception voice signal and the transmission voice signal according to the signal level of the transmission voice signal from the user and the signal level of the reception voice signal from the other party. Note that Patent Document 1 does not exemplify control for determining the respective amplification factors in consideration of the signal levels of both the reception audio signal and the transmission audio signal. This comparison technique A is a method for convenient determination of processing when both a transmission audio signal and a reception audio signal are input in Patent Document 1, and compares the signal levels of the transmission audio signal and the reception audio signal. The voice with the higher signal level is determined as the voice signal to be transmitted. That is, in the comparison technique A, the signal having the smaller signal level is reduced, and control is performed so that the sum of the gains of the reception audio signal and the transmission audio signal becomes 1. Further, when the signal levels of the transmission audio signal and the reception audio signal are the same, the respective gains are adjusted to be 0.5.

なお、比較技術Ａに関しては、送信音声信号と受信音声信号の各信号レベルよって増幅率を調整することによる効果と、本実施形態のように、口の動きによって増幅率を調整することによる効果の差異を検証することを目的とする。 In comparison technique A, the effect of adjusting the amplification factor according to the signal levels of the transmission audio signal and the reception audio signal and the effect of adjusting the amplification factor by the movement of the mouth as in the present embodiment. The purpose is to verify the difference.

また、比較技術Ｂは、一例として特許文献２に記載されるような、公知のエコーキャンセル技術のみを信号処理部１ｂに適用したものである。比較技術Ｂに関しては、公知のエコーキャンセル技術においてエコーが生じない状況と、本実施形態においてエコーを生じない状況とを比較することを目的とする。 In comparison technique B, only known echo cancellation technique as described in Patent Document 2 is applied to the signal processing unit 1b as an example. The purpose of the comparison technique B is to compare a situation where no echo is generated in the known echo cancellation technique and a situation where no echo is generated in the present embodiment.

なお、本実施形態、比較技術Ａ、および比較技術Ｂのいずれにおいても、信号処理部（１、１ａ、１ｂ）がスピーカ４、マイク５などの外部機器と接続する入出力部には、入力された信号（あるいは出力する信号）の信号レベルを一定量増幅する増幅器が備えられている。しかしながら、特に本実施形態と比較技術Ａにおいて、着目している要素は、それらの増幅率が固定されている増幅器ではなく、増幅率が状況に応じて調整される増幅器（すなわち信号レベル調整部）である。 In any of the present embodiment, the comparison technique A, and the comparison technique B, the signal processing unit (1, 1a, 1b) is input to an input / output unit that is connected to an external device such as the speaker 4 or the microphone 5. An amplifier for amplifying the signal level of the signal (or output signal) by a certain amount is provided. However, particularly in the present embodiment and the comparison technique A, the element of interest is not an amplifier whose amplification factor is fixed, but an amplifier whose amplification factor is adjusted according to the situation (that is, a signal level adjustment unit). It is.

次に、表１における各パラメータを説明する。まず、「口の動き」欄の有り／無しは、カメラ画像処理部２でユーザーの口の動きを検出できているか否かを示す。また、「発話」欄において、×は発話していない状態、△は発話しているがノイズレベルの小声である状態、○は普通の音声レベル（すなわち音量）で発話している状態である。「マイク」欄においては送信音声信号の信号レベルを、「ＳＰ」（ＳＰ＝スピーカ）欄においては受信音声信号の信号レベルを、それぞれ、そのまま通す状態を１、半減させて通過させる状態を０．５、通さない状態を０とする。すなわち、想定された状況において、各信号に対する増幅率がどのように設定されるかを示している。 Next, each parameter in Table 1 will be described. First, the presence / absence of the “mouth movement” column indicates whether the camera image processing unit 2 has detected the movement of the user's mouth. Further, in the “utterance” column, “x” indicates a state in which no speech is performed, “Δ” indicates a state in which a speech is made but the noise level is low, and “◯” indicates a state in which speech is performed at a normal voice level (ie, volume). In the “microphone” column, the signal level of the transmission audio signal is set to 1. In the “SP” (SP = speaker) column, the signal level of the reception audio signal is set to 1 for passing through the signal as it is, and 0. 5. The state that does not pass is set to 0. That is, it shows how the amplification factor for each signal is set in the assumed situation.

「通話」欄の○は、送信音声信号と受信音声信号の少なくとも一方が、増幅率を調整することによって、聴取可能な信号レベルから聴取不可能な信号レベル（＝０）まで低減される場合を×とし、それ以外は○とする。「エコー」欄では、擬似エコー信号が完全にエコー成分を除去するものではないこと、および擬似エコー信号がエコーから外れ得ることを想定し、通話相手またはユーザーにエコーが伝わる状況であれば、有りとする。 ○ in the “call” column indicates that at least one of the transmission audio signal and the reception audio signal is reduced from an audible signal level to an inaudible signal level (= 0) by adjusting the amplification factor. X, and other than that. In the “Echo” column, if the pseudo echo signal does not completely remove the echo component and the pseudo echo signal can deviate from the echo, there is a situation where the echo is transmitted to the other party or the user. And

なお、比較技術Ａに関しては、送信音声信号および受信音声信号の各信号レベルよって増幅率を調整することによる効果と、本実施形態のように、口の動きによって増幅率を調整することによる効果の差異を検証するための比較対象であるため、比較技術Ａにおいてエコー欄は省略する。また、比較技術Ｂに関しては、公知のエコーキャンセル技術によってエコーが生じない状況と、本実施形態によってエコーが生じない状況とを比較するものであるため、比較技術Ｂにおいてエコー欄以外は省略する。

In comparison technique A, the effect of adjusting the amplification factor according to each signal level of the transmission audio signal and the reception audio signal, and the effect of adjusting the amplification factor by the movement of the mouth as in this embodiment. Since this is a comparison target for verifying the difference, the echo column is omitted in the comparison technique A. Further, since the comparison technique B is a comparison between a situation in which no echo is generated by a known echo cancellation technique and a situation in which no echo is generated by the present embodiment, other than the echo column is omitted in the comparison technique B.

以降では、まず本実施形態と比較技術Ａとを比較したときに増幅率を調整することによる通話の可否に差が生じているＮｏ．６および８について説明する。そして、次に本実施形態と比較技術Ｂとを比較したときにエコーの影響について差が生じているＮｏ．１１および１２について説明する。 Thereafter, when this embodiment is first compared with the comparative technique A, there is a difference in whether or not a call can be made by adjusting the amplification factor. 6 and 8 will be described. Next, when this embodiment is compared with the comparative technique B, the difference in the effects of echoes occurs. 11 and 12 will be described.

まず、Ｎｏ．６の状況としては、カメラ画像処理部２でユーザーの口の動きが検出されており、ユーザーはノイズレベルの小声で発話し、かつ、通話相手は普通の音声レベルで発話している状況である。このような状況における、本実施形態と比較技術Ａの作動を、図４を用いて説明する。なお、Ｎｏ．６および８では、増幅率を調整することによる通話の可否を検討するため、従来のエコーキャンセル技術に対応する比較技術Ｂに関しては言及しない。 First, no. The situation 6 is a situation in which the movement of the user's mouth is detected by the camera image processing unit 2, the user speaks with a low noise level, and the other party speaks with a normal voice level. . The operation of this embodiment and the comparison technique A in such a situation will be described with reference to FIG. In addition, No. In 6 and 8, in order to examine whether or not a call can be made by adjusting the amplification factor, the comparison technique B corresponding to the conventional echo cancellation technique is not mentioned.

比較技術Ａでは、受信音声信号Ｒの信号レベルが、送信音声信号Ｓの信号レベルに対して大きく、かつ、その差が大きいため、受信音声信号Ｒを伝達すべき信号と判定する。この場合、図４のように受信側の信号レベル調整部３Ｒの増幅率が１となり、送信側の信号レベル調整部３Ｓの増幅率が０になる。このため、送信音声信号に含まれるユーザーの発話音声は、通話相手に伝わらなくなってしまう。なお、この比較技術Ａにおいて、厳密には増幅率は０にはならない場合も存在するが、ノイズレベルの小声の信号をさらに減衰させることによって聞き手が聴取できなくなるため、便宜上０とする。 In the comparison technique A, since the signal level of the reception audio signal R is larger than the signal level of the transmission audio signal S and the difference is large, it is determined that the reception audio signal R should be transmitted. In this case, as shown in FIG. 4, the amplification factor of the signal level adjustment unit 3R on the reception side is 1, and the amplification factor of the signal level adjustment unit 3S on the transmission side is 0. For this reason, the user's utterance voice included in the transmission voice signal is not transmitted to the other party. Strictly speaking, in this comparison technique A, there is a case where the amplification factor does not become zero, but it is set to 0 for convenience because the listener cannot hear by further attenuating the low-noise signal at the noise level.

一方、本実施形態では、カメラ画像処理部２でユーザーの口の動きを検出しているため、発話側の増幅率は１となる。そして、増幅率が１であることから送信音声信号は低減されず、ユーザーの発話音声を通話相手に伝えることができる。 On the other hand, in the present embodiment, since the movement of the user's mouth is detected by the camera image processing unit 2, the amplification factor on the utterance side is 1. Since the amplification factor is 1, the transmission voice signal is not reduced, and the user's voice can be transmitted to the other party.

次にＮｏ．８は、カメラ画像処理部２でユーザーの口の動きが検出されている状況において、ユーザーは普通の音声レベルで発話しており、かつ、通話相手はノイズレベルの小声で発話している状況である。このような状況における、本実施形態と比較技術Ａの作動を、図５を用いて説明する。 Next, no. 8 is a situation where the user's mouth movement is detected by the camera image processing unit 2 and the user is speaking at a normal voice level and the other party is speaking at a noise level. is there. The operation of this embodiment and the comparison technique A in such a situation will be described with reference to FIG.

比較技術Ａでは、送信音声信号Ｓの信号レベルが、受信音声信号Ｒの信号レベルに対して小さく、かつ、その差が大きいため、送信音声信号Ｓを伝達すべき信号と判定する。この場合、図５のように送信側の信号レベル調整部３Ｓの増幅率が１となり、受信側の信号レベル調整部３Ｒの増幅率が０になる。このため、受信音声信号の信号レベルが０となり、受信音声信号に含まれる通話相手からの受話音声はユーザーに伝わらなくなってしまう。一方、本実施形態では、受信側の増幅率は低減しないため、通話相手からの受話音声は、その音声レベルに関わらず、ユーザーに伝えることができる。 In the comparison technique A, since the signal level of the transmission audio signal S is smaller than the signal level of the reception audio signal R and the difference is large, it is determined that the transmission audio signal S should be transmitted. In this case, as shown in FIG. 5, the amplification factor of the signal level adjustment unit 3S on the transmission side is 1, and the amplification factor of the signal level adjustment unit 3R on the reception side is 0. For this reason, the signal level of the received voice signal becomes 0, and the received voice from the other party in the received voice signal is not transmitted to the user. On the other hand, in this embodiment, since the amplification factor on the receiving side is not reduced, the received voice from the other party can be transmitted to the user regardless of the voice level.

以上で説明したように、送信音声信号および受信音声信号の各信号レベルではなく、ユーザーの口の動きの有無に応じて送信側の増幅率を調整することで、受話音声信号と発話音声信号が同時に入力され、それらの信号レベルに違いがあった場合でも相互の音声を相手に伝えることができる。すなわち、本実施形態は送信音声信号の信号レベルを調整する構成であり、受信音声信号の信号レベルを低減しないことから、通信相手からの音声受信時、その受信音声信号の信号レベルが低く且つユーザーの発話があっても、通話相手からの受話音声を聴取することが可能である。また、受信音声信号の入力があっても、カメラ画像処理部でユーザーの口の動きが検出される場合には、信号レベル調整部の増幅率は１以上に設定されるため、発話音声が小声であっても、通話相手に伝えることができる。 As described above, by adjusting the amplification factor on the transmission side according to the presence / absence of movement of the mouth of the user instead of the signal levels of the transmission audio signal and the reception audio signal, the reception audio signal and the audio output signal are Even if they are input at the same time and there is a difference in their signal levels, mutual voices can be transmitted to the other party. That is, the present embodiment is configured to adjust the signal level of the transmission audio signal and does not reduce the signal level of the reception audio signal. Therefore, when receiving audio from the communication partner, the signal level of the reception audio signal is low and the user Even if there is an utterance, it is possible to listen to the received voice from the other party. Even if a received voice signal is input, if the movement of the user's mouth is detected by the camera image processing unit, the amplification factor of the signal level adjusting unit is set to 1 or more, so that the uttered voice is a low voice. Even so, you can tell the other party.

しかしながら、信号レベル調整部３およびカメラ画像処理部２だけでは、増幅率を１にしている場合にはエコーが通話相手に聞こえてしまう可能性がある。そこで、本実施形態のように、増幅率を調整する信号レベル調整部３に加えて、既技術のエコーキャンセル技術を使ってエコーの影響を抑える構成がより好ましい。また、さらに公知のノイズキャンセル技術を使ってノイズの影響を低減する構成を組み合わせても良い。 However, if only the signal level adjustment unit 3 and the camera image processing unit 2 have an amplification factor of 1, there is a possibility that an echo may be heard by the other party. Therefore, as in the present embodiment, in addition to the signal level adjustment unit 3 that adjusts the amplification factor, a configuration that suppresses the influence of echoes using an existing echo cancellation technique is more preferable. Further, a configuration for reducing the influence of noise using a known noise cancellation technique may be combined.

次に説明するＮｏ．１１および１２は、ユーザーの口の動きが検出されていない状況において、ユーザーは発話しておらず、通話相手は発話している（Ｎｏ．１１ではノイズレベルの小声での発話、Ｎｏ．１２では普通の音声レベルでの発話）状況である。このような状況における、本実施形態と比較技術Ｂの作動を、図６を用いて説明する。なお、ここでは、エコーの影響に関して本実施形態と比較技術Ｂとを比較することを目的とするため、比較技術Ａに関しては言及しない。 No. described next. 11 and 12, in a situation where the movement of the user's mouth is not detected, the user is not speaking and the other party is speaking (No. 11 is a low-noise utterance, No. 12 is Normal speech level). The operation of the present embodiment and the comparison technique B in such a situation will be described with reference to FIG. Here, the purpose of comparing the present embodiment and the comparison technique B with respect to the influence of echoes is not described, and therefore the comparison technique A is not mentioned.

比較技術Ｂでは、エコーパスおよびその伝達関数の変化などによって、エコーキャンセル部１０ａで生成した擬似エコー信号が実際のエコー成分から少しでもずれてしまった場合には、エコーが通話相手に伝わってしまうことが懸念される。 In the comparative technique B, if the pseudo echo signal generated by the echo cancel unit 10a is slightly deviated from the actual echo component due to the change of the echo path and its transfer function, the echo is transmitted to the other party. Is concerned.

一方、本実施形態では、ユーザーの口の動きを検出していないため、増幅率は０となっている。増幅率が０であるため送信音声信号の信号レベルは０となり、擬似エコー信号の精度（または有無）に関わらず、エコーが通話相手に聞こえることはない。 On the other hand, in this embodiment, since the movement of the user's mouth is not detected, the amplification factor is zero. Since the amplification factor is 0, the signal level of the transmission voice signal is 0, and the echo is not heard by the other party regardless of the accuracy (or presence / absence) of the pseudo echo signal.

また、比較技術Ｂではマイク５の周囲で発生しているノイズ（エコー以外の雑音）がマイク５に入力されてしまった場合には、エコーキャンセル部１０ではこのノイズが除去されず、相手に伝わってしまう。これに対し、さらに公知のノイズキャンセル技術を用いてノイズを低減することも考えられるが、エコーキャンセル技術と同様、ノイズに打ち消すための擬似ノイズ信号が実際のノイズとずれてしまった場合には、通話相手に聞こえてしまう。しかし、本実施形態のように、ユーザーの口の動きを検出していない場合には増幅率を０にすることで、その間、ノイズも通話相手に伝わらないようにすることができる。 Further, in the case of the comparison technique B, when noise (noise other than echo) generated around the microphone 5 is input to the microphone 5, the noise is not removed by the echo canceling unit 10 and transmitted to the other party. End up. On the other hand, it is also possible to reduce noise using a known noise cancellation technique, but as with the echo cancellation technique, if the pseudo noise signal for canceling the noise has shifted from the actual noise, The other party can hear you. However, when the movement of the user's mouth is not detected as in this embodiment, the amplification factor is set to 0, so that noise can not be transmitted to the other party during that time.

以上の構成によると、本実施形態では受信側の増幅率は調整しないため、通話相手からの受話音声が小さくユーザーの発話音声が大きい場合でも双方の音声を同時に伝えることが可能である。また、通話相手の声が大きくユーザーの声が小さい場合でも、カメラ画像処理部２でユーザーの口の動きが検出される場合には、送信側の増幅率を１とするため、双方の音声を伝えることが可能となる。 According to the above configuration, since the amplification factor on the receiving side is not adjusted in this embodiment, both voices can be transmitted simultaneously even when the received voice from the other party is small and the user's spoken voice is large. Even when the voice of the other party is loud and the voice of the user is low, when the movement of the user's mouth is detected by the camera image processing unit 2, the amplification factor on the transmission side is set to 1, so both voices are It becomes possible to convey.

さらに、ユーザーの口の動きが無いと判定された場合には、増幅率が０になることによって送信音声信号の信号レベルが０になるため、エコーやノイズがマイク５に入力されてもそれらが通話相手に伝わる可能性を低減できる。 Furthermore, when it is determined that there is no movement of the user's mouth, the signal level of the transmission audio signal becomes 0 because the amplification factor becomes 0, so that even if echo or noise is input to the microphone 5 The possibility of being transmitted to the other party can be reduced.

（変形例）
さらに、本発明は例えば図７に示すような構成（変形例）にしてもよい。図７は、図１に対応する図であり、変形例に係るハンズフリー通話装置の構成を示すブロック図である。なお、図１と重複する説明は省略する。 (Modification)
Furthermore, the present invention may be configured (modified example) as shown in FIG. FIG. 7 is a diagram corresponding to FIG. 1 and is a block diagram showing a configuration of a hands-free call device according to a modification. Note that a description overlapping that in FIG. 1 is omitted.

バッファ部８は、エコーキャンセル部１０でエコー成分ｒが除去あるいは低減された送信音声信号を一定の時間（以降、遅延時間とする）蓄積した後に信号レベル調整部３に出力する。ここで、遅延時間は、ユーザーおよび通話相手に違和感を与えない程度（たとえば数１０ｍｓ）とする。また、この遅延時間は、ユーザーが発話した瞬間から、発話に伴った口の動きをカメラ画像処理部２で検出して信号レベル調整部３の増幅率を１に設定するまでに要する時間を考慮したものであってもよい。 The buffer unit 8 accumulates the transmission audio signal from which the echo component r has been removed or reduced by the echo cancel unit 10 for a certain period of time (hereinafter referred to as a delay time), and then outputs it to the signal level adjustment unit 3. Here, the delay time is set to such an extent that it does not give a sense of incongruity to the user and the other party (for example, several tens of ms). This delay time takes into account the time required from the moment when the user speaks to the time when the movement of the mouth accompanying the speech is detected by the camera image processing unit 2 and the amplification factor of the signal level adjusting unit 3 is set to 1. It may be what you did.

口の動きを検出するためには、カメラ６で取り込んだ画像をカメラ画像処理部２で画像解析する必要があるため、ユーザーの話し始めの部分においては、口の動きの検出処理が間に合わずに増幅率が０となっており、話し始め部分が欠けてしまうおそれがあった。 In order to detect the movement of the mouth, it is necessary to analyze the image captured by the camera 6 with the camera image processing unit 2, so that the detection process of the mouth movement is not in time at the beginning of the user's talk. The amplification factor was 0, and there was a possibility that the beginning of the talk would be lost.

しかしながら、この変形例のようにバッファ部８を備える構成とすることで、発話音声がマイク５で取得されてから信号レベル調整部３に到達するまでかかる時間は、バッファ部８に蓄積されている時間だけ遅延する。そして、バッファ部８で遅延させる分だけ、発話音声から欠落する部分を低減することができ、通話相手に違和感を与える可能性を低減できる。また、話し始め部分がバッファ部８に入ってから出力されるまでの間に、カメラ画像処理部２で口の動きを検出して増幅率が１と設定できる場合には、話し始め部分も欠落することなく通話相手に伝えることができる。 However, by adopting a configuration including the buffer unit 8 as in this modification, the time taken for the speech level to reach the signal level adjustment unit 3 after the uttered voice is acquired by the microphone 5 is accumulated in the buffer unit 8. Delay by time. And the part missing from the uttered voice can be reduced by the amount delayed by the buffer unit 8, and the possibility of giving a strange feeling to the other party can be reduced. Also, if the camera image processing unit 2 can detect the movement of the mouth and set the amplification factor to 1 between the start of the talk and the output from the buffer unit 8, the talk start part is also missing. You can tell the other party without having to.

なお、本実施形態では、増幅率を１から０に設定する制御は、カメラ画像処理部２で所定のフレーム数連続して口の動きが無いと判定された場合に実施するとしたがこれに限らない。所定のフレーム数連続して口の動きがないと判定された時点から所定の時間（たとえば数秒程度）後に増幅率を１から０にしてもよい。 In the present embodiment, the control for setting the amplification factor from 1 to 0 is performed when the camera image processing unit 2 determines that there is no movement of the mouth for a predetermined number of frames. Absent. The amplification factor may be changed from 1 to 0 after a predetermined time (for example, about several seconds) from the time when it is determined that there is no movement of the mouth for a predetermined number of frames.

このようにすることで、例えばユーザーが発言の内容を考えたりして途切れ途切れに発話している場合に、増幅率が０と１の状態が短い間に変化し、それに伴って話し始め部分が欠落してしまう可能性を低減することができる。 By doing so, for example, when the user thinks about the content of the speech and speaks in an intermittent manner, the amplification rate changes between 0 and 1 in a short time, and accordingly, the beginning of the speech is The possibility of being missing can be reduced.

また、以上の構成において、信号レベル調整部３の増幅率０と１の２値としたがこれに限らない。ユーザーの口の動きを検出していない場合の増幅率を１より小さい値に設定し、口の動きが検出された場合の増幅率を１以上の値に設定する構成としても良い。ユーザーの口の動きが検出されていない場合の増幅率を１より小さい値にすることで、口の動きが検出されていない場合は送信音声信号の信号レベルは減衰されて通話相手に伝達される。この場合、マイク５に入力されたノイズやエコーは通話相手に伝わってしまう可能性が残るが、信号レベル調整部３で信号レベルを減衰させている分だけノイズやエコーによる影響を低減することができる。 In the above configuration, the binary values of the amplification factors 0 and 1 of the signal level adjusting unit 3 are used, but the present invention is not limited to this. A configuration may be adopted in which the amplification factor when the movement of the mouth of the user is not detected is set to a value smaller than 1, and the amplification factor when the movement of the mouth is detected is set to a value of 1 or more. When the movement of the mouth of the user is not detected, the amplification factor is set to a value smaller than 1. When the movement of the mouth is not detected, the signal level of the transmission voice signal is attenuated and transmitted to the other party. . In this case, there is a possibility that the noise and echo input to the microphone 5 are transmitted to the other party, but the influence of the noise and echo can be reduced by the amount that the signal level is attenuated by the signal level adjusting unit 3. it can.

また、信号レベル調整部３の増幅率は０から１へ（又は１から０へ）、ステップ状ではなく、連続的または階段状に変化させてもよい。さらには、０と１の間に０．４、０．５、０．６など、０と１の間の値を取るような構成としても良い。 Further, the amplification factor of the signal level adjusting unit 3 may be changed from 0 to 1 (or from 1 to 0) continuously or stepwise instead of stepwise. Furthermore, it is good also as a structure which takes the value between 0 and 1, such as 0.4, 0.5, 0.6 between 0 and 1.

さらにテレビ電話など、通話相手の顔画像データを取得できる場合には、その通話相手の顔画像データに対して、本発明のカメラ画像処理部と同様の処理を行う処理部を適用し、その処理部の検出結果に基づいて受信音声信号の信号レベルを調整する構成としてもよい。このような構成すれば、通話相手の口の動きを検出していない場合には、受信側の信号レベルを０とすることで、通話相手側からのエコーやノイズがユーザーに伝わってしまうことを抑制することができる。 Further, when the face image data of the other party can be acquired, such as a videophone, a processing unit that performs the same processing as the camera image processing unit of the present invention is applied to the other party's face image data, and the processing is performed. It is good also as a structure which adjusts the signal level of a received audio | voice signal based on the detection result of a part. With such a configuration, when the movement of the other party's mouth is not detected, the signal level on the receiving side is set to 0, so that the echo or noise from the other party is transmitted to the user. Can be suppressed.

また、カメラ画像処理部２で、口および口の周辺の特徴点の動きと、発話する直前の口の動きのパターンから、ユーザーが話し始めようとしていることを検出した時点で信号レベル調整部３にＯＮ信号を出力する構成としてもよい。 When the camera image processing unit 2 detects that the user is about to start speaking from the movement of the mouth and the feature points around the mouth and the movement pattern of the mouth immediately before speaking, the signal level adjustment unit 3 Alternatively, an ON signal may be output.

なお、本発明は上述した実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、それぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 It should be noted that the present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims, and the embodiments obtained by appropriately combining the respective technical means disclosed are also included in the present invention. It is included in the technical scope of the invention.

１００…ハンズフリー通話装置、１…信号処理部、２…カメラ画像処理部、３…信号レベル調整部、４…スピーカ、５…マイク、６…カメラ、８…バッファ部、１０…エコーキャンセル部、Ｒ…受話音声（受話音声に対応する信号）、Ｓ…発話音声（発話音声に対応する信号） DESCRIPTION OF SYMBOLS 100 ... Hands-free call apparatus, 1 ... Signal processing part, 2 ... Camera image processing part, 3 ... Signal level adjustment part, 4 ... Speaker, 5 ... Microphone, 6 ... Camera, 8 ... Buffer part, 10 ... Echo cancellation part, R: Received voice (signal corresponding to received voice), S: Spoken voice (signal corresponding to spoken voice)

Claims

A camera (1) that captures the user's face;
A camera image processing unit (2) for detecting movement of the user's mouth from image data captured by the camera;
A hands-free call device (100) comprising: a signal level adjustment unit (3) for adjusting a signal level of a transmission voice signal transmitted from a user to a call partner by adjusting an amplification factor by a continuous or stepwise change. Because
The signal level adjustment unit includes:
The amplification factor is configured to be set to a value located between 0 and 1, and when the movement of the mouth is not detected by the camera image processing unit, the amplification factor is set to a value smaller than 1.
A hands-free call device characterized in that when a movement of the mouth is detected by the camera image processing unit, an amplification factor of a transmission voice signal is set to a value of 1 or more.

In claim 1,
The signal level adjustment unit sets an amplification factor to 0 when mouth movement is not detected by the camera image processing unit, and transmits audio signal when mouth movement is detected by the camera image processing unit. A hands-free call device characterized in that the amplification factor is set to 1.

In claim 1 or 2,
The hands-free call device is a hands-free call device in which a signal corresponding to the sound collected by the microphone (5) is input as a transmission voice signal,
A hands-free call device comprising a buffer unit (8) for storing a transmission voice signal for a certain period of time between the microphone and the signal level adjustment unit and outputting the signal to the signal level adjustment unit.

In any one of Claims 1-3,
The hands-free communication device sets the amplification factor of the signal level adjustment unit to 0 when mouth movement is not detected by the camera image processing unit for a certain period of time.

In any one of Claims 1-4,
The hands-free call device generates a pseudo echo signal estimated from the received voice signal based on the received voice signal and an echo component of the received voice signal mixed in the transmitted voice signal, and the pseudo echo signal A hands-free communication device comprising: an echo cancel unit (10) for canceling an echo component by subtracting the signal from the transmission voice signal.

In any one of Claim 1 to 5,
The hands-free call device generates a pseudo echo signal estimated from the transmission voice signal based on the transmission voice signal and an echo component of the transmission voice signal mixed in the reception voice signal, and the pseudo echo signal A hands-free communication device comprising an echo canceling unit (10) for canceling an echo component by subtracting the signal from the received voice signal .