JP2023088360A

JP2023088360A - Video call device, video call method, and control program of video call device

Info

Publication number: JP2023088360A
Application number: JP2021203015A
Authority: JP
Inventors: 俊一藤井; Shunichi Fujii; 裕一新幡; Yuichi ARAHATA; 貴之荒瀬; Takayuki Arase; 智至吉村; Satoshi Yoshimura
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2023-06-27

Abstract

To provide a video call device capable of preventing or reducing voice not intended to transmit to a call destination from being transmitted to the call destination and being heard.SOLUTION: A camera 1 captures a speaker making a video call. A microphone 6 picks up voice uttered by the speaker. A communication unit 5 transmits the captured image of the speaker captured by the camera 1 and the voice of the speaker picked up by the microphone 6 to a network 20. An image analysis unit 3 detects a line of sight of the speaker on the basis of the captured image of the speaker captured by the camera, and detects whether the line of sight is off the camera 1. A voice control unit 4 controls the communication unit 5 so as to cut off the transmission of the voice or reduce volume of the voice, when at least the image analysis unit 3 detects that the line of sight of the speaker is off the camera 1.SELECTED DRAWING: Figure 1

Description

本発明は、ビデオ通話装置、ビデオ通話方法、及びビデオ通話装置の制御プログラムに関する。 The present invention relates to a video call device, a video call method, and a video call device control program.

特許文献１に記載されているように、画像及び音声を、ネットワークを介して双方向に通信するビデオ通話システム（テレビ会議システム）が普及している。 2. Description of the Related Art As described in Patent Literature 1, a video call system (teleconference system) that bi-directionally communicates images and voices via a network has become widespread.

特開２０１２－１００１８５号公報JP 2012-100185 A

ビデオ通話システムを用いて第１の側の参加者から第２の側の参加者へと通話しているとする。このとき、第１の側の話者が、第１の側に同席している参加者のみに伝えようとした呟きが、マイクロホンで収音されて第２の側の参加者へと送信されてしまうことがある。このような通話先に送信することを意図してない呟き等の音声が通話先に送信されることは好ましくない。 Suppose a participant on a first side is calling a participant on a second side using a video calling system. At this time, the murmurs that the speaker on the first side intended to convey only to the participants sitting on the first side were picked up by the microphone and transmitted to the participants on the second side. I can put it away. It is not preferable that such voices such as murmurs that are not intended to be transmitted to the called party are transmitted to the called party.

本発明は、通話先に送信することを意図してない音声が通話先に送信されて聞かれることを防止または低減させることができるビデオ通話装置、ビデオ通話方法、及びビデオ通話装置の制御プログラムを提供することを目的とする。 The present invention provides a video call device, a video call method, and a control program for the video call device that can prevent or reduce the transmission of voices not intended for transmission to the callee and being heard by the callee. intended to provide

本発明は、ビデオ通話する話者を撮影するカメラと、前記話者が発する音声を収音するマイクロホンと、前記カメラが前記話者を撮影した撮影画像と前記マイクロホンが収音した前記話者の音声とをネットワークへと送出する通信部と、前記カメラが前記話者を撮影する撮影画像に基づいて前記話者の視線を検知し、視線が前記カメラから外れたか否かを検出する画像解析部と、少なくとも前記画像解析部が前記話者の視線が前記カメラから外れた状態であることを検出したとき、前記音声の送出を遮断するか前記音声の音量を低下させるよう前記通信部を制御する音声制御部とを備えるビデオ通話装置を提供する。 The present invention comprises a camera for photographing a speaker making a video call, a microphone for picking up the voice uttered by the speaker, an image of the speaker photographed by the camera, and the speaker picked up by the microphone. and an image analysis unit that detects the line of sight of the speaker based on the captured image of the speaker captured by the camera and detects whether the line of sight is off the camera. and at least when the image analysis unit detects that the line of sight of the speaker is out of the camera, the communication unit is controlled to cut off the transmission of the voice or reduce the volume of the voice. and a video call device.

本発明は、ビデオ通話する話者を撮影するカメラと、前記話者が発する音声を収音するマイクロホンと、前記カメラが前記話者を撮影する撮影画像を解析する画像解析部と、前記マイクロホンが収音した前記話者の音声を解析する音声解析部と、前記カメラが前記話者を撮影した撮影画像と前記マイクロホンが収音した前記話者の音声とをネットワークへと送出する通信部と、前記撮影画像及び前記マイクロホンが収音した音声を記録する記録部と、前記記録部に記録された過去のビデオ通話による撮影画像及び音声の再生時に前記話者が指定した、前記通信部によって前記ネットワークへと送出すべきでなかった音声を発している区間と、前記画像解析部によって解析された再生された撮影画像の解析結果及び前記音声解析部によって解析された再生された音声の解析結果とを対応付けて学習し、新たなビデオ通話時に、学習結果と、前記画像解析部によって解析された新たなビデオ通話による撮影画像の解析結果及び前記音声解析部によって解析された新たなビデオ通話による音声の解析結果とに基づいて、前記通信部によって前記ネットワークへと送出すべきでない音声を発している区間を抽出する学習部と、前記学習部が抽出した前記ネットワークへと送出すべきでない音声を発している区間の音声の送出を遮断するか前記音声の音量を低下させるよう前記通信部を制御する音声制御部とを備えるビデオ通話装置を提供する。 The present invention comprises a camera that captures a video call speaker, a microphone that picks up the voice uttered by the speaker, an image analysis unit that analyzes the captured image of the speaker captured by the camera, and the microphone. a voice analysis unit that analyzes the collected voice of the speaker; a communication unit that transmits to a network the captured image of the speaker captured by the camera and the voice of the speaker captured by the microphone; a recording unit that records the captured image and the sound picked up by the microphone; and the analysis result of the reproduced photographed image analyzed by the image analysis unit and the analysis result of the reproduced sound analyzed by the sound analysis unit. The learning result, the analysis result of the image captured by the new video call analyzed by the image analysis unit, and the sound of the new video call analyzed by the sound analysis unit are learned at the time of the new video call. Based on the analysis result, a learning unit that extracts a section in which a voice that should not be sent to the network is emitted by the communication unit, and a voice that should not be sent to the network extracted by the learning unit is emitted. and a voice control unit for controlling the communication unit so as to cut off transmission of voice in a section where the voice is present or to reduce the volume of the voice.

本発明は、カメラによってビデオ通話する話者を撮影し、マイクロホンによって前記話者が発する音声を収音し、画像解析部が、前記カメラが前記話者を撮影する撮影画像に基づいて前記話者の視線を検知し、視線が前記カメラから外れたか否かを検出し、前記画像解析部が前記話者の視線が前記カメラから外れた状態であることを検出しなければ、通信部によって、前記カメラが前記話者を撮影した撮影画像と前記マイクロホンが収音した前記話者の音声とをネットワークへと送出し、少なくとも前記画像解析部が前記話者の視線が前記カメラから外れた状態であることを検出すれば、前記通信部による前記音声の送出を遮断するか前記音声の音量を低下させるビデオ通話方法を提供する。 According to the present invention, a camera photographs a speaker who makes a video call, a microphone picks up the voice uttered by the speaker, and an image analysis unit detects the speaker based on the photographed image of the speaker photographed by the camera. and detects whether or not the line of sight of the speaker is off the camera, and if the image analysis unit does not detect that the line of sight of the speaker is off the camera, the communication unit A captured image of the speaker captured by a camera and the speaker's voice collected by the microphone are transmitted to a network, and at least the image analysis unit is in a state where the speaker's line of sight is off the camera. A video call method is provided in which, when detecting that, the transmission of the voice by the communication unit is cut off or the volume of the voice is lowered.

本発明は、コンピュータに、ビデオ通話する話者を撮影するカメラが前記話者を撮影する撮影画像に基づいて前記話者の視線を検知し、視線が前記カメラから外れたか否かを検出するステップと、前記話者の視線が前記カメラから外れた状態であることが検出されなければ、通信部によって、前記カメラが前記話者を撮影した撮影画像とマイクロホンが収音した前記話者の音声とをネットワークへと送出するステップと、少なくとも前記話者の視線が前記カメラから外れた状態であることが検出されれば、前記通信部による前記音声の送出を遮断するか前記音声の音量を低下させるステップとを実行させるビデオ通話装置の制御プログラムを提供する。 According to the present invention, a step of detecting, in a computer, a line of sight of a speaker taking a video call based on a photographed image of the speaker, and detecting whether or not the line of sight has moved away from the camera. Then, if it is not detected that the line of sight of the speaker is out of the camera, the communication unit outputs the image captured by the camera of the speaker and the voice of the speaker picked up by the microphone. to the network, and at least if it is detected that the line of sight of the speaker is out of the camera, the transmission of the voice by the communication unit is cut off or the volume of the voice is reduced. and providing a control program for a video call device that causes the steps to be performed.

本発明のビデオ通話装置、ビデオ通話方法、及びビデオ通話装置の制御プログラムによれば、通話先に送信することを意図してない音声が通話先に送信されて聞かれることを防止または低減させることができる。 According to the video calling device, the video calling method, and the control program for the video calling device of the present invention, it is possible to prevent or reduce the transmission of voices not intended to be transmitted to the called party and being heard by the called party. can be done.

第１実施形態のビデオ通話装置を示すブロック図である。1 is a block diagram showing a video calling device according to a first embodiment; FIG. 第１実施形態のビデオ通話装置の動作及び第１実施形態のビデオ通話方法を示すフローチャートである。4 is a flow chart showing the operation of the video call device of the first embodiment and the video call method of the first embodiment; 第２実施形態のビデオ通話装置を示すブロック図である。FIG. 11 is a block diagram showing a video calling device of a second embodiment; FIG. 第２実施形態のビデオ通話装置の動作及び第２実施形態のビデオ通話方法を示すフローチャートである。8 is a flow chart showing the operation of the video call device of the second embodiment and the video call method of the second embodiment; 第３実施形態のビデオ通話装置を示すブロック図である。FIG. 11 is a block diagram showing a video calling device of a third embodiment; FIG.

以下、各実施形態のビデオ通話装置、ビデオ通話方法、及びビデオ通話装置の制御プログラムについて、添付図面を参照して説明する。 A video call device, a video call method, and a control program for the video call device of each embodiment will be described below with reference to the accompanying drawings.

＜第１実施形態＞
図１において、互いにネットワーク２０で接続された、第１実施形態のビデオ通話装置１０１、ビデオ通話サーバ３０、ビデオ通話装置４０は、ビデオ通話システムを構成している。ネットワーク２０及びビデオ通話サーバ３０を介して、ビデオ通話装置１０１はビデオ通話装置４０に画像データ及び音声データを送信し、ビデオ通話装置４０から画像データ及び音声データを受信する。図１においては、ビデオ通話装置１０１からビデオ通話装置４０に画像データ及び音声データを送信する状態を示している。 <First embodiment>
In FIG. 1, the video call device 101, the video call server 30, and the video call device 40 of the first embodiment, which are connected to each other via the network 20, constitute a video call system. The video calling device 101 transmits image data and voice data to the video calling device 40 and receives image data and voice data from the video calling device 40 via the network 20 and the video calling server 30 . FIG. 1 shows a state in which image data and audio data are transmitted from the video calling device 101 to the video calling device 40. As shown in FIG.

ビデオ通話装置１０１は、カメラ１、一時記憶メモリ２、画像解析部３、音声制御部４、通信部５、マイクロホン６、一時記憶メモリ７を備える。カメラ１は、ビデオ通話装置１０１の使用者であるビデオ通話する話者を撮影する。ビデオ通話装置１０１の配置場所には、話者を含む複数人が存在することがある。一時記憶メモリ２は、カメラ１より出力された撮影画像データを一時的に記憶する。画像解析部３は、後述するように撮影画像を解析する。音声制御部４は、通信部５による音声データのネットワーク２０への送出を遮断することがある。音声制御部４は、音声データの送出を遮断する代わりに、送出する音声データの音量を低下させることがあってもよい。 A video call device 101 includes a camera 1 , temporary storage memory 2 , image analysis section 3 , audio control section 4 , communication section 5 , microphone 6 , and temporary storage memory 7 . A camera 1 photographs a speaker who makes a video call, who is a user of the video call device 101 . A plurality of people including the speaker may be present at the place where the video communication device 101 is arranged. A temporary storage memory 2 temporarily stores captured image data output from the camera 1 . The image analysis unit 3 analyzes the captured image as described later. The voice control unit 4 may block transmission of voice data to the network 20 by the communication unit 5 . The audio control unit 4 may lower the volume of the audio data to be transmitted instead of interrupting the transmission of the audio data.

マイクロホン６は、話者が発する音声を収音する。一時記憶メモリ７は、マイクロホン６より出力された音声データを一時的に記憶する。通信部５は、音声制御部４が音声データの送出を遮断するよう通信部５を制御していなければ、撮影画像データと音声データとをネットワーク２０を介してビデオ通話サーバ３０へと送信する。典型的には、ネットワーク２０はインターネットであり、通信部５はインターネットプロトコルに従って撮影画像データ及び音声データを送信する。ビデオ通話装置４０は、ネットワーク２０を介して、ビデオ通話サーバ３０より影画像データ及び音声データを受信する。 A microphone 6 picks up the voice uttered by the speaker. Temporary storage memory 7 temporarily stores voice data output from microphone 6 . The communication unit 5 transmits the captured image data and the audio data to the video call server 30 via the network 20 if the audio control unit 4 does not control the communication unit 5 to block the transmission of the audio data. Typically, network 20 is the Internet, and communication unit 5 transmits captured image data and audio data according to Internet protocol. The video call device 40 receives shadow image data and audio data from the video call server 30 via the network 20 .

以上のように構成されるビデオ通話システムにおいて、ビデオ通話装置１０１は第１の側に配置され、ビデオ通話装置４０は第２の側に配置されている。第１の側の参加者の一人である話者が、第２の側の参加者へと通話しているとする。第１の側の話者が、第２の側の参加者へと送信しようとする音声ではなく、第１の側に同席している参加者のみに伝えようとする呟きのような音声を発するときには、話者は同席している参加者に顔を向けて、小さな声で音声を発することが多い。 In the video call system configured as described above, the video call device 101 is arranged on the first side and the video call device 40 is arranged on the second side. Suppose a speaker, one of the participants on the first side, is talking to a participant on the second side. A speaker on the first side utters a whisper-like sound intended only for the participants sitting on the first side rather than the sound intended to be transmitted to the participants on the second side Occasionally, the speaker will face the attendant and utter the sound in a low voice.

第１実施形態においては、話者の視線がカメラ１を向いているか否かによって、第２の側の参加者へと送信しようとする音声であるか否かを判定するように構成している。具体的には、画像解析部３は、入力された撮影画像データが示す撮影画像に基づいて話者の視線を検知する。画像解析部３は、視線がカメラ１から外れたか否かを検出する。画像解析部３は、視線がカメラ１から外れたことを検出すると、音声制御部４に視線が外れたことを通知する。 In the first embodiment, it is determined whether or not the voice is intended to be transmitted to the participants on the second side depending on whether or not the line of sight of the speaker is directed toward the camera 1. . Specifically, the image analysis unit 3 detects the line of sight of the speaker based on the photographed image indicated by the inputted photographed image data. The image analysis unit 3 detects whether or not the line of sight has left the camera 1 . When the image analysis unit 3 detects that the line of sight has left the camera 1, the image analysis unit 3 notifies the sound control unit 4 of the line of sight separation.

音声制御部４は、画像解析部３から視線が外れたことが通知されたら、音声データのネットワーク２０への送出を遮断するよう通信部５を制御する。この場合、音声制御部４は、音声データの送出を遮断するよう制御する遮断制御部として機能する。これにより、仮に話者が同席している参加者に顔を向けて、通話先に送信することを意図してない呟き等の音声を発したとしても、音声は通話先に送信されないから、第２の側の参加者に不用意に聞かれることを防止することができる。 When the image analysis unit 3 notifies that the line of sight is off, the audio control unit 4 controls the communication unit 5 to block transmission of the audio data to the network 20 . In this case, the voice control unit 4 functions as a cutoff control unit that controls to cut off transmission of voice data. As a result, even if the speaker turns his/her face to the other participant and utters a voice such as muttering that is not intended to be transmitted to the other party, the voice will not be transmitted to the other party. It is possible to prevent the participants on the second side from being heard carelessly.

音声制御部４は、画像解析部３から視線が外れたことが通知されたら、音声データの音量を低下させるよう通信部５を制御してもよい。この場合、音声制御部４は、音量を低下させるよう制御する音量制御部として機能する。通常、通話先に送信することを意図してない呟き等の音声は小さな音量で発せられる。従って、通信部５が呟き等の音声を送信したとしても、第２の側の参加者が聞こえない程度の音量となる。これにより、仮に話者が同席している参加者に顔を向けて、通話先に送信することを意図してない呟き等の音声を発したとしても、極めて音量の小さい音声データが送信されることになるから、第２の側の参加者に不用意に聞かれることを低減させることができる。 The voice control unit 4 may control the communication unit 5 so as to reduce the volume of the voice data when the image analysis unit 3 notifies that the line of sight is off. In this case, the audio control unit 4 functions as a volume control unit that controls to lower the volume. Usually, sounds such as murmurs that are not intended to be transmitted to the called party are emitted at a low volume. Therefore, even if the communication unit 5 transmits a voice such as murmuring, the volume is such that the participants on the second side cannot hear it. As a result, even if the speaker turns his face to the participant who is present and utters a voice such as muttering that is not intended to be transmitted to the other party, the voice data is transmitted at an extremely low volume. Therefore, it is possible to reduce the chances of the participants on the second side being unintentionally asked.

音声制御部４は、画像解析部３から視線が外れたことが通知されなければ、音声データのネットワーク２０への送出を遮断するよう通信部５を制御せず、また、音声データの音量を低下させるよう通信部５を制御しない。よって、通信部５は、撮影画像データ及び音声データをそのままネットワーク２０へと送出する。 The voice control unit 4 does not control the communication unit 5 to block transmission of the voice data to the network 20 unless the image analysis unit 3 notifies that the line of sight is off, and also reduces the volume of the voice data. The communication unit 5 is not controlled to allow Therefore, the communication unit 5 sends the captured image data and the audio data to the network 20 as they are.

図２に示すフローチャートを用いて、ビデオ通話装置１０１の動作及び第１実施形態のビデオ通話方法を説明する。図２に示すフローチャートのステップＳ２～Ｓ５、Ｓ１１、Ｓ１２は、第１実施形態のビデオ通話装置の制御プログラムが、ビデオ通話装置１０１が備えるコンピュータに実行させる処理を示してもよい。 The operation of the video call device 101 and the video call method of the first embodiment will be described using the flowchart shown in FIG. Steps S2 to S5, S11, and S12 in the flowchart shown in FIG. 2 may represent processing that the control program for the video call device of the first embodiment causes the computer included in the video call device 101 to execute.

図２において、ビデオ通話システムによるビデオ通話の処理が開始されると、カメラ１は、ステップＳ１にて、話者の撮影画像データを取得する。画像解析部３は、ステップＳ２にて、話者の視線を検知したか否かを判定する。話者の視線を検知しなければ（NO）、画像解析部３は、ステップＳ４にて、呟きのような音声の発生を検出していないことを示す値である、撮影画像に基づく呟き検出“Ｌ”を生成して音声制御部４に供給する。その後、処理はステップＳ１２に移行される。“Ｌ”は例えば“０”である。 In FIG. 2, when the video call processing by the video call system is started, the camera 1 acquires the captured image data of the speaker in step S1. In step S2, the image analysis unit 3 determines whether or not the line of sight of the speaker has been detected. If the line of sight of the speaker is not detected (NO), the image analysis unit 3, in step S4, detects murmurs based on the captured image, which is a value indicating that no sound such as murmurs has been detected. L” is generated and supplied to the audio control unit 4. After that, the process proceeds to step S12. "L" is, for example, "0".

ステップＳ２にて、話者の視線を検知していれば（YES）、画像解析部３は、ステップＳ３にて、視線がカメラ１から外れたか否かを判定する。視線がカメラ１から外れなければ（NO）、画像解析部３は、ステップＳ４にて、撮影画像に基づく呟き検出“Ｌ”を生成して音声制御部４に供給する。その後、処理はステップＳ１２に移行される。上記のように、第１実施形態においては、視線がカメラ１から外れていない状態を呟きのような音声の発生を検出していない状態とみなしている。 If the line of sight of the speaker has been detected in step S2 (YES), the image analysis unit 3 determines whether or not the line of sight has left the camera 1 in step S3. If the line of sight does not deviate from the camera 1 (NO), the image analysis unit 3 generates murmur detection "L" based on the captured image and supplies it to the voice control unit 4 in step S4. After that, the process proceeds to step S12. As described above, in the first embodiment, the state in which the line of sight is not deviated from the camera 1 is regarded as the state in which no sound such as murmuring is detected.

ステップＳ３にて視線がカメラ１から外れていれば（YES）、画像解析部３は、ステップＳ５にて、呟きのような音声の発生を検出したことを示す値である、撮影画像に基づく呟き検出“Ｈ”を生成して音声制御部４に供給する。“Ｈ”は例えば“１”である。続けて、音声制御部４は、ステップＳ１１にて、音声データの送出を遮断する。音声制御部４は、ステップＳ１１にて、音声データの音量を低下させてもよい。その後、処理はステップＳ１２に移行される。 If the line of sight is out of the camera 1 in step S3 (YES), the image analysis unit 3, in step S5, detects a murmur based on the captured image, which is a value indicating that a sound like murmur has been detected. A detection "H" is generated and supplied to the voice control unit 4. "H" is, for example, "1". Subsequently, in step S11, the audio control unit 4 cuts off the transmission of the audio data. The audio control unit 4 may reduce the volume of the audio data in step S11. After that, the process proceeds to step S12.

ビデオ通話装置１０１は、ステップＳ１２にて、ビデオ通話を終了する指示がなされたか否かを判定する。ビデオ通話を終了する指示がなされなければ（NO）、ビデオ通話装置１０１は、ステップＳ１～Ｓ１２の処理を繰り返す。ビデオ通話を終了する指示がなされれば（YES）、ビデオ通話装置１０１はビデオ通話の処理を終了させる。 In step S12, the video call device 101 determines whether or not an instruction to end the video call has been issued. If no instruction to end the video call is given (NO), the video call device 101 repeats the processing of steps S1 to S12. If an instruction to end the video call is issued (YES), the video call device 101 ends the processing of the video call.

以上のようにして、第１実施形態によれば、通話先に送信することを意図してない音声が通話先に送信されて聞かれることを防止または低減させることができる。 As described above, according to the first embodiment, it is possible to prevent or reduce the possibility that voices not intended to be transmitted to the called party are transmitted to and heard by the called party.

＜第２実施形態＞
図３に示す第２実施形態のビデオ通話装置１０２において、ビデオ通話装置１０１と同一部分には同一符号を付し、その説明を省略することがある。ビデオ通話装置１０２は、ビデオ通話装置１０１が備えていない構成として、音声解析部８を備える。音声制御部４は、画像解析部３による解析結果と音声解析部８による解析結果との双方に基づいて、通信部５による音声データのネットワーク２０への送出を遮断することがある。音声制御部４は、画像解析部３による解析結果と音声解析部８による解析結果との双方に基づいて、送出する音声データの音量を低下させることがあってもよい。 <Second embodiment>
In the video call device 102 of the second embodiment shown in FIG. 3, the same reference numerals are given to the same parts as in the video call device 101, and the description thereof may be omitted. The video call device 102 includes a voice analysis unit 8 as a component that the video call device 101 does not have. The audio control unit 4 may block transmission of audio data from the communication unit 5 to the network 20 based on both the analysis result of the image analysis unit 3 and the analysis result of the audio analysis unit 8 . The audio control unit 4 may reduce the volume of the audio data to be sent based on both the analysis result by the image analysis unit 3 and the analysis result by the audio analysis unit 8 .

音声解析部８は、入力された音声データの音圧レベルが所定の閾値以下であるか否かを判定する。音声解析部８は、入力された音声データを離散フーリエ変換する。典型的には、音声解析部８は、高速フーリエ変換（ＦＦＴ：Fast Fourier Transform）のアルゴリズムを用いて音声データを離散フーリエ変換する。 The audio analysis unit 8 determines whether or not the sound pressure level of the input audio data is equal to or lower than a predetermined threshold. The voice analysis unit 8 performs discrete Fourier transform on the input voice data. Typically, the speech analysis unit 8 performs a discrete Fourier transform on speech data using a Fast Fourier Transform (FFT) algorithm.

第２の側の参加者へと送信しようとする音声は有声音であり、第１の側に同席している参加者のみに伝えようとする呟きのような音声は無声音であることが多い。有声音と無声音とは、それらが存在している周波数帯域が異なる。音声解析部８は、時間領域のデータである音声データを離散フーリエ変換した周波数領域のデータの周波数に基づき、有声音であるか無声音であるか、即ち、第２の側の参加者へと送信しようとする音声であるか、呟きのような音声は無声音であるかを判定する。 Speech that is intended to be sent to participants on the second side is often voiced, and speech that is intended to be conveyed only to participants sitting on the first side, such as murmurs, is often unvoiced. Voiced sounds and unvoiced sounds differ in frequency bands in which they exist. The speech analysis unit 8 determines whether the sound is voiced or unvoiced based on the frequency of the frequency domain data obtained by performing the discrete Fourier transform of the speech data, which is the time domain data, to the participants on the second side. It is determined whether the sound is the intended sound or whether the sound such as muttering is unvoiced sound.

音声解析部８は、離散フーリエ変換前の入力された音声データの音圧レベルが所定の閾値以下であり、かつ、離散フーリエ変換後のデータが無声音の周波数であるとき、マイクロホン６が収音した音声は呟きのような音声であると判定するのがよい。音声解析部８は、音圧レベルが閾値以下であるか否かを判定せず、周波数領域のデータの周波数が無声音の周波数であるとき、マイクロホン６が収音した音声は呟きのような音声であると判定してもよい。音声解析部８は、マイクロホン６が収音した音声が有声音であるか無声音であるかの解析結果を音声制御部４に通知する。なお、音声解析部８は、無声音の子音を含む外国語の単語を誤って呟きと判断しないように、無声音が所定割合以上の場合に無声音と判定するようにしてもよい。 When the sound pressure level of the input audio data before the discrete Fourier transform is equal to or lower than a predetermined threshold and the data after the discrete Fourier transform is the frequency of unvoiced sound, the sound analysis unit 8 detects that the microphone 6 has picked up the sound. It is preferable to determine that the voice is murmur-like voice. The sound analysis unit 8 does not determine whether or not the sound pressure level is equal to or less than the threshold, and when the frequency of the frequency domain data is the frequency of unvoiced sound, the sound picked up by the microphone 6 is a murmur-like sound. It may be determined that there is The voice analysis unit 8 notifies the voice control unit 4 of the analysis result as to whether the voice picked up by the microphone 6 is voiced sound or unvoiced sound. Note that the speech analysis unit 8 may determine that a word in a foreign language containing unvoiced consonants is unvoiced when the number of unvoiced sounds is greater than or equal to a predetermined ratio so as not to erroneously determine that the words are murmurs.

音声制御部４は、画像解析部３から視線が外れたことが通知され、かつ、音声解析部８からマイクロホン６が収音した音声が無声音であるとの解析結果が通知されたら、音声データのネットワーク２０への送出を遮断するよう通信部５を制御する。これにより、話者が同席している参加者に顔を向けて、通話先に送信することを意図してない呟き等の音声を発したとしても、音声は通話先に送信されないから、第２の側の参加者に不用意に聞かれることを防止することができる。 When the image analysis unit 3 notifies the sound control unit 4 that the line of sight is off and the sound analysis unit 8 notifies the analysis result that the sound picked up by the microphone 6 is unvoiced sound, the sound control unit 4 converts the sound data. The communication unit 5 is controlled so as to cut off transmission to the network 20 . As a result, even if the speaker turns his/her face to the other participant and utters a voice such as muttering that is not intended to be transmitted to the other party, the voice is not transmitted to the other party. It is possible to prevent the participants on the other side from being heard carelessly.

音声制御部４は、画像解析部３から視線が外れたことが通知されかつ、音声解析部８からマイクロホン６が収音した音声が無声音であるとの解析結果が通知されたら、音声データの音量を低下させるよう通信部５を制御してもよい。これにより、話者が同席している参加者に顔を向けて、通話先に送信することを意図してない呟き等の音声を発したとしても、極めて音量の小さい音声データが送信されることになるから、第２の側の参加者に不用意に聞かれることを低減させることができる。 When the image analysis unit 3 notifies the voice control unit 4 that the line of sight is off and the voice analysis unit 8 notifies the analysis result that the voice picked up by the microphone 6 is unvoiced sound, the voice control unit 4 adjusts the volume of the voice data. You may control the communication part 5 so that it may reduce. As a result, even if the speaker turns his/her face to the participant who is present and utters a voice such as murmuring that is not intended to be transmitted to the other party, the voice data will be transmitted at an extremely low volume. Therefore, it is possible to reduce the chances of the participants on the second side being unintentionally asked.

音声制御部４は、画像解析部３から視線が外れたことが通知されないか、音声解析部８からマイクロホン６が収音した音声が無声音であるとの解析結果が通知されなければ、音声データのネットワーク２０への送出を遮断するよう通信部５を制御しない。また、音声制御部４は、音声データの音量を低下させるよう通信部５を制御しない。よって、通信部５は、撮影画像データ及び音声データをそのままネットワーク２０へと送出する。 If the voice control unit 4 is not notified by the image analysis unit 3 that the line of sight is off, or if the voice analysis unit 8 is not notified of the analysis result that the voice picked up by the microphone 6 is unvoiced sound, the voice control unit 4 cannot reproduce the voice data. The communication unit 5 is not controlled to block transmission to the network 20. Also, the audio control unit 4 does not control the communication unit 5 to lower the volume of the audio data. Therefore, the communication unit 5 sends the captured image data and the audio data to the network 20 as they are.

図４に示すフローチャートを用いて、ビデオ通話装置１０２の動作及び第２実施形態のビデオ通話方法を説明する。図４に示すフローチャートのステップＳ２～Ｓ５、Ｓ７～Ｓ１１、Ｓ２０～Ｓ２２は、第２実施形態のビデオ通話装置の制御プログラムが、ビデオ通話装置１０１が備えるコンピュータに実行させる処理を示してもよい。 The operation of the video call device 102 and the video call method of the second embodiment will be described using the flowchart shown in FIG. Steps S2 to S5, S7 to S11, and S20 to S22 of the flow chart shown in FIG. 4 may represent processing executed by the computer provided in the video calling device 101 by the control program of the video calling device of the second embodiment.

図４において、ステップＳ１～Ｓ５は図２に示すステップＳ１～Ｓ５と同一である。マイクロホン６は、ステップＳ６にて、話者が発した音声の音声データを取得する。音声解析部８は、ステップＳ７にて、音声データを離散フーリエ変換する。音声解析部８は、ステップＳ８にて、離散フーリエ変換前の音声データに基づき、音圧レベルが閾値以下であるか否かを判定する。音圧レベルが閾値以下でなければ（NO）、音声解析部８は、ステップＳ１０にて、音声に基づく呟き検出“Ｌ”を生成して音声制御部４に供給する。その後、処理はステップＳ２０に移行される。 In FIG. 4, steps S1-S5 are the same as steps S1-S5 shown in FIG. The microphone 6 acquires the audio data of the voice uttered by the speaker in step S6. The speech analysis unit 8 performs discrete Fourier transform on the speech data in step S7. In step S8, the sound analysis unit 8 determines whether or not the sound pressure level is equal to or less than the threshold based on the sound data before discrete Fourier transform. If the sound pressure level is not equal to or lower than the threshold value (NO), the voice analysis unit 8 generates murmur detection "L" based on voice and supplies it to the voice control unit 4 in step S10. After that, the process proceeds to step S20.

ステップＳ８にて音圧レベルが閾値以下であれば（YES）、音声解析部８は、ステップＳ９にて、離散フーリエ変換後の周波数領域のデータに基づき、周波数が無声音を示すか否かを判定する。周波数が無声音を示さなければ（NO）、音声解析部８は、ステップＳ１０にて、音声に基づく呟き検出“Ｌ”を生成して音声制御部４に供給する。周波数が無声音を示せば（YES）、音声解析部８は、ステップＳ１１にて、音声に基づく呟き検出“Ｈ”を生成して音声制御部４に供給する。処理はステップＳ１０またはＳ１１からステップＳ２０に移行される。 If the sound pressure level is equal to or lower than the threshold in step S8 (YES), the speech analysis unit 8 determines whether or not the frequency indicates unvoiced sound in step S9 based on the frequency domain data after the discrete Fourier transform. do. If the frequency does not indicate unvoiced sound (NO), the voice analysis unit 8 generates a voice-based murmur detection “L” and supplies it to the voice control unit 4 in step S10. If the frequency indicates unvoiced sound (YES), the voice analysis unit 8 generates a murmur detection "H" based on voice and supplies it to the voice control unit 4 in step S11. Processing proceeds from step S10 or S11 to step S20.

ステップＳ８を省略して、ステップＳ９の判定のみでステップＳ１０とステップＳ１１とを選択してもよい。 Step S8 may be omitted and step S10 and step S11 may be selected only by the determination of step S9.

音声制御部４は、ステップＳ２０にて、撮影画像に基づく呟き検出“Ｈ”かつ音声に基づく呟き検出“Ｈ”であるか否かを判定する。撮影画像に基づく呟き検出“Ｈ”かつ音声に基づく呟き検出“Ｈ”であれば（YES）、音声制御部４は、ステップＳ２１にて、音声データの送出を遮断する。音声制御部４は、ステップＳ２１にて、音声データの音量を低下させてもよい。その後、処理はステップＳ２２に移行される。撮影画像に基づく呟き検出“Ｈ”かつ音声に基づく呟き検出“Ｈ”でなければ（NO）、処理はステップＳ２０からステップＳ２２に移行される。 In step S20, the voice control unit 4 determines whether or not the murmur detection is "H" based on the captured image and the murmur detection is "H" based on the voice. If the murmur detection is "H" based on the captured image and the murmur detection is "H" based on the voice (YES), the voice control unit 4 cuts off the transmission of voice data in step S21. The audio control unit 4 may reduce the volume of the audio data in step S21. After that, the process proceeds to step S22. If the murmur detection is not "H" based on the captured image and the murmur detection is not "H" based on the voice (NO), the process proceeds from step S20 to step S22.

ビデオ通話装置１０２は、ステップＳ２２にて、ビデオ通話を終了する指示がなされたか否かを判定する。ビデオ通話を終了する指示がなされなければ（NO）、ビデオ通話装置１０２は、ステップＳ１～Ｓ２２の処理を繰り返す。ビデオ通話を終了する指示がなされれば（YES）、ビデオ通話装置１０２はビデオ通話の処理を終了させる。 In step S22, the video call device 102 determines whether or not an instruction to end the video call has been issued. If no instruction to end the video call is issued (NO), the video call device 102 repeats the processing of steps S1 to S22. If an instruction to end the video call is given (YES), the video call device 102 ends the processing of the video call.

以上のようにして、第２実施形態によれば、通話先に送信することを意図してない音声が通話先に送信されて聞かれることを防止または低減させることができる。第２実施形態によれば、話者が同席している参加者に顔を向けただけで、実際には呟き等の音声を発していない場合には、音声データの送出は遮断されず、実際に呟き等の音声を発したときに音声データの送出を遮断することができる。 As described above, according to the second embodiment, it is possible to prevent or reduce the possibility that voices not intended to be transmitted to the called party are transmitted to and heard by the called party. According to the second embodiment, when the speaker only turns his/her face to the participant who is present and does not actually utter a voice such as muttering, the transmission of the voice data is not interrupted. It is possible to cut off the transmission of voice data when a voice such as muttering is uttered.

＜第３実施形態＞
図５に示す第３実施形態のビデオ通話装置１０３において、ビデオ通話装置１０２と同一部分には同一符号を付し、その説明を省略することがある。ビデオ通話装置１０３は、ビデオ通話装置１０２が備えていない構成として、記録部１０、学習部１１、表示部１２、スピーカ１３、操作部１４を備える。表示部１２、スピーカ１３、操作部１４は、ビデオ通話装置１０３に対して外付けされていてもよい。 <Third Embodiment>
In the video call device 103 of the third embodiment shown in FIG. 5, the same reference numerals are given to the same parts as those of the video call device 102, and the description thereof may be omitted. The video call device 103 includes a recording unit 10, a learning unit 11, a display unit 12, a speaker 13, and an operation unit 14 as components that the video call device 102 does not have. The display unit 12 , the speaker 13 and the operation unit 14 may be externally attached to the video call device 103 .

記録部１０は、一時記憶メモリ２より出力された撮影画像データ、及び一時記憶メモリ７より出力された音声データを記録する。例えばビデオ通話の終了後に、話者は操作部１４を操作して記録部１０に記録されている過去のビデオ通話による撮影画像データ及び音声データを再生する。表示部１２は、再生されている撮影画像データに基づく撮影画像を表示する。スピーカ１３は、再生されている音声データに基づく音声を出力する。 The recording unit 10 records the captured image data output from the temporary storage memory 2 and the audio data output from the temporary storage memory 7 . For example, after finishing the video call, the speaker operates the operation unit 14 to reproduce the captured image data and voice data from the past video call recorded in the recording unit 10 . The display unit 12 displays a captured image based on the captured image data being reproduced. The speaker 13 outputs sound based on the reproduced sound data.

記録部１０に記録されている撮影画像データ及び音声データを再生しているとき、画像解析部３は再生している撮影画像データの撮影画像を解析し、音声解析部８は再生している音声データの音声を解析する。画像解析部３は話者の視線を検出する。音声解析部８は、音声の周波数に基づいて有声音であるか無声音であるかを判定するか、音声の音圧レベルを判定し、さらに周波数に基づいて有声音であるか無声音であるかを判定する。 When the captured image data and audio data recorded in the recording unit 10 are reproduced, the image analysis unit 3 analyzes the captured image of the captured image data being reproduced, and the audio analysis unit 8 analyzes the reproduced audio. Analyze the audio of the data. The image analysis unit 3 detects the line of sight of the speaker. The speech analysis unit 8 determines whether the sound is voiced or unvoiced based on the frequency of the sound, determines the sound pressure level of the sound, and further determines whether the sound is voiced or unvoiced based on the frequency. judge.

話者は、表示部１２に表示されている撮影画像及びスピーカ１３より出力される音声を確認しながら、通信部５によってネットワーク２０へと送出すべきでなかった音声を発している区間を指定する。学習部１１は、話者による区間の指定に基づいて、ネットワーク２０へと送出すべきでなかった音声を発している区間を学習する。このとき、学習部１１は、話者が指定した区間と、話者が指定した区間における画像解析部３によって解析された撮影画像の解析結果と音声解析部８によって解析された音声の解析結果とを対応付けて学習する。 While confirming the captured image displayed on the display unit 12 and the sound output from the speaker 13, the speaker designates a section during which the sound that should not be transmitted to the network 20 is emitted by the communication unit 5. . The learning unit 11 learns a segment in which speech that should not have been transmitted to the network 20 is generated based on the segment specified by the speaker. At this time, the learning unit 11 combines the section specified by the speaker, the analysis result of the captured image analyzed by the image analysis unit 3 in the section specified by the speaker, and the analysis result of the sound analyzed by the sound analysis unit 8. are associated and learned.

話者は、以上のようなビデオ通話の終了後の学習部１１によるネットワーク２０へと送出すべきでなかった音声の学習を複数回実行させる。すると、学習部１１は、画像解析部３による撮影画像の解析結果及び音声解析部８による音声の解析結果に基づいて、ネットワーク２０へと送出すべきでない音声を発している区間であるのか否かを判定する判定能力を取得することができる。 The speaker causes the learning unit 11 to learn the voice that should not be sent to the network 20 multiple times after the end of the video call as described above. Then, the learning unit 11 determines whether or not there is a section in which a sound that should not be transmitted to the network 20 is emitted, based on the analysis result of the captured image by the image analysis unit 3 and the analysis result of the sound by the sound analysis unit 8. It is possible to acquire the judgment ability to judge

学習部１１が判定能力を取得した後、新たなビデオ通話時に、学習部１１は、通信部５によってネットワーク２０へと送出すべきでない音声を発している区間を抽出する。学習部１１は、学習結果と、画像解析部３によって解析された新たなビデオ通話による撮影画像の解析結果と、音声解析部８によって解析された新たなビデオ通話による音声の解析結果とに基づいて、送出すべきでない音声を発している区間を抽出する。学習部１１が抽出した区間を示す情報は、音声制御部４に供給される。音声制御部４は、学習部１１が抽出した区間の音声の送出を遮断するか音声の音量を低下させるよう通信部５を制御する。 After the learning unit 11 acquires the determination ability, the learning unit 11 extracts a section in which voices that should not be transmitted to the network 20 by the communication unit 5 are emitted during a new video call. The learning unit 11 is based on the learning result, the analysis result of the image captured by the new video call analyzed by the image analysis unit 3, and the analysis result of the sound of the new video call analyzed by the sound analysis unit 8. , to extract a section in which a sound that should not be transmitted is emitted. Information indicating the interval extracted by the learning unit 11 is supplied to the voice control unit 4 . The audio control unit 4 controls the communication unit 5 so as to cut off the transmission of the audio in the section extracted by the learning unit 11 or reduce the volume of the audio.

以上のようにして、第３実施形態によれば、通話先に送信することを意図してない音声が通話先に送信されて聞かれることを防止または低減させることができる。第３実施形態によれば、学習部１１がネットワーク２０へと送出すべきでない音声を発している区間を予め学習して、そのような区間であるのか否かを判定する判定能力を取得しているので、ネットワーク２０へと送出すべきでない音声を高精度に遮断することができる。 As described above, according to the third embodiment, it is possible to prevent or reduce the possibility that voices not intended to be transmitted to the called party are transmitted to and heard by the called party. According to the third embodiment, the learning unit 11 learns in advance the section in which the sound that should not be transmitted to the network 20 is emitted, and acquires the determination ability to determine whether or not it is such a section. Therefore, voices that should not be transmitted to the network 20 can be blocked with high accuracy.

本発明は以上説明した第１～第３実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変更可能である。 The present invention is not limited to the first to third embodiments described above, and various modifications can be made without departing from the gist of the present invention.

１カメラ
２，７一時記憶メモリ
３画像解析部
４音声制御部
５通信部
６マイクロホン
８音声解析部
１０記録部
１１学習部
１２表示部
１３スピーカ
１４操作部
２０ネットワーク
３０ビデオ通話サーバ
４０，１０１～１０３ビデオ通話装置 1 camera 2, 7 temporary storage memory 3 image analysis unit 4 audio control unit 5 communication unit 6 microphone 8 audio analysis unit 10 recording unit 11 learning unit 12 display unit 13 speaker 14 operation unit 20 network 30 video call server 40, 101 to 103 video calling device

Claims

a camera that captures the person making the video call;
a microphone for picking up the voice uttered by the speaker;
a communication unit that transmits to a network the captured image of the speaker captured by the camera and the speaker's voice collected by the microphone;
an image analysis unit that detects the line of sight of the speaker based on an image captured by the camera of the speaker and detects whether or not the line of sight is off the camera;
audio control for controlling the communication unit to cut off transmission of the audio or reduce the volume of the audio at least when the image analysis unit detects that the line of sight of the speaker is out of the camera; Department and
A video call device with

further comprising a speech analysis unit that analyzes whether the speech of the speaker picked up by the microphone is voiced or unvoiced,
When the image analysis unit detects that the line of sight of the speaker is out of the camera and the audio analysis unit analyzes that the speaker's voice is unvoiced sound 2. The video call device according to claim 1, wherein the communication unit is controlled to block transmission of the sound or reduce the volume of the sound.

a camera that captures the person making the video call;
a microphone for picking up the voice uttered by the speaker;
an image analysis unit that analyzes an image captured by the camera of the speaker;
a voice analysis unit that analyzes the speaker's voice picked up by the microphone;
a communication unit that transmits to a network the captured image of the speaker captured by the camera and the speaker's voice collected by the microphone;
a recording unit that records the captured image and the sound picked up by the microphone;
A section during which the voice, which should not be transmitted to the network by the communication unit, is emitted and which is specified by the speaker when reproducing the captured image and voice of the past video call recorded in the recording unit, and the image The analysis result of the reproduced photographed image analyzed by the analysis unit and the analysis result of the reproduced sound analyzed by the audio analysis unit are associated and learned, and when a new video call is made, the learning result and the image are learned. Based on the analysis result of the image captured by the new video call analyzed by the analysis unit and the analysis result of the sound of the new video call analyzed by the sound analysis unit, the communication unit should send it to the network. A learning unit that extracts a section in which a voice that is not
a voice control unit that controls the communication unit so as to cut off the transmission of the voice in the section where the voice that should not be transmitted to the network extracted by the learning unit or to reduce the volume of the voice;
A video call device with

Take a picture of the speaker who makes a video call with a camera,
picking up the voice uttered by the speaker with a microphone;
an image analysis unit that detects the line of sight of the speaker based on an image captured by the camera of the speaker, and detects whether or not the line of sight is off the camera;
If the image analysis unit does not detect that the line of sight of the speaker is out of the camera, the communication unit detects the captured image of the speaker captured by the camera and the speech captured by the microphone. and send the voice of the person to the network,
A video call method, wherein at least when the image analysis unit detects that the line of sight of the speaker is out of the camera, transmission of the sound by the communication unit is cut off or the volume of the sound is reduced.

to the computer,
a step of detecting a line of sight of the speaker based on a photographed image of the speaker, and detecting whether or not the line of sight is off the camera;
If it is not detected that the line of sight of the speaker is out of the camera, the communication unit transmits the captured image of the speaker captured by the camera and the voice of the speaker collected by the microphone to a network. sending to
blocking transmission of the voice by the communication unit or reducing the volume of the voice at least when it is detected that the line of sight of the speaker is out of the camera;
A video call device control program that runs