JP2023118335A

JP2023118335A - Communication terminal, communication system, and communication server

Info

Publication number: JP2023118335A
Application number: JP2022021244A
Authority: JP
Inventors: 剛志大和久; Tsuyoshi Owaku
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2023-08-25

Abstract

To provide a technique for suppressing an interaction with a communication partner from being interrupted by a surrounding noise.SOLUTION: A communication terminal 10A is used by one participant, out of multiple communication terminals 10A, 10B used by multiple participants who participate in a remote conference. The communication terminal 10A includes a communication unit, a sound collection unit, an image acquisition unit, a determination unit, and an output control unit. The communication unit communicates with a communication terminal 10B via a communication line. The sound collection unit collects sound emitted during the remote conference. The image acquisition unit acquires a captured image of the one participant during the remote conference (S11). The determination unit determines whether the one participant uttered, on the basis of at least one of the collected sound and the captured image acquired by the image acquisition unit (S12). The output control unit controls, when a determination is made that the one participant uttered, the communication unit to output voice data (S13), or controls the communication unit not to output voice data when a determination is made that the one participant did not utter (S14).SELECTED DRAWING: Figure 4

Description

本発明は、通信端末、通信システム、及び通信サーバに関する。 The present invention relates to communication terminals, communication systems, and communication servers.

近年、インターネットや専用回線等の通信回線を介した遠隔会議の利用が高まっている。遠隔会議においては、会議参加者の周囲で発せられた雑音も他の会議参加者の端末に送信される。特許文献１には、テレビ会議システム等において、音声信号に含まれる雑音を除去する技術が開示されている。 In recent years, the use of teleconferencing via communication lines such as the Internet and dedicated lines has increased. In teleconferencing, noise emitted around conference participants is also transmitted to terminals of other conference participants. Patent Literature 1 discloses a technique for removing noise contained in an audio signal in a teleconferencing system or the like.

特開２０１８－１７８６５号公報JP 2018-17865 A

音声信号に含まれる雑音の種類は様々であるため、特許文献１のような雑音成分を除去する技術を用いても、雑音成分を除去できない場合もあり得る。そのため、例えば会議とは関係のない会話の音声が他の会議参加者の端末に送信されると、会議参加者との間の会話が妨げられる。 Since there are various types of noise contained in an audio signal, there may be cases where the noise component cannot be removed even by using the technology for removing the noise component as in Patent Document 1. Therefore, if, for example, voices of conversation unrelated to the conference are transmitted to the terminals of other conference participants, conversations with the conference participants are disturbed.

本発明は、周囲音によって通信相手との会話が妨げられにくい技術を提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technique that makes it difficult for a conversation with a communication partner to be hindered by ambient sounds.

本発明に係る通信端末は、通信回線を通じて遠隔会議に参加する複数の参加者が使用する複数の通信端末のうち、一の参加者が使用する通信端末である。通信端末は、通信部、集音部、画像取得部、判断部、及び出力制御部を含む。通信部は、通信回線を介して他の通信端末と通信する。集音部は、遠隔会議中の音声を集音する。画像取得部は、遠隔会議中の一の参加者の撮影画像を取得する。判断部は、集音部で集音された音声、及び画像取得部で取得された撮影画像の少なくとも一方に基づいて、一の参加者が発話したか否かを判断する。出力制御部は、判断部において一の参加者が発話したと判断された場合、通信部から通信回線に音声を示す音声データを出力し、判断部において一の参加者が発話していないと判断された場合、通信部から通信回線に音声データを出力しないように制御する。 A communication terminal according to the present invention is a communication terminal used by one of a plurality of communication terminals used by a plurality of participants participating in a teleconference via a communication line. A communication terminal includes a communication unit, a sound collection unit, an image acquisition unit, a determination unit, and an output control unit. The communication unit communicates with another communication terminal via a communication line. The sound collecting unit collects sound during the teleconference. The image acquisition unit acquires a photographed image of one participant during the remote conference. The determining unit determines whether or not one participant has spoken based on at least one of the sound collected by the sound collecting unit and the captured image obtained by the image obtaining unit. The output control unit outputs audio data indicating voice from the communication unit to the communication line when the judgment unit judges that one participant has spoken, and the judgment unit judges that one participant has not spoken. If so, the communication unit is controlled not to output voice data to the communication line.

本発明に係る通信システムは、通信回線を通じて遠隔会議に参加する複数の参加者のうちの一の参加者の音声の出力を制御する通信システムである。通信システムは、通信部、集音部、画像取得部、判断部、及び出力制御部を含む。通信部は、通信回線と通信接続される。集音部は、遠隔会議中の一の参加者側の音声を集音する。画像取得部は、遠隔会議中の一の参加者の撮影画像を取得する。判断部は、集音部で集音された音声、及び画像取得部で取得された撮影画像の少なくとも一方に基づいて、一の参加者が発話したか否かを判断する。出力制御部は、一の参加者が発話したと判断された場合に、通信部から他の参加者に向けて音声を示す音声データを出力し、一の参加者が発話していないと判断された場合に、通信部から他の参加者に向けて音声データを出力しないように制御する。 A communication system according to the present invention is a communication system that controls output of voice of one of a plurality of participants participating in a teleconference through a communication line. A communication system includes a communication unit, a sound collector, an image acquisition unit, a determination unit, and an output control unit. The communication unit is communicatively connected to the communication line. The sound collecting unit collects the voice of one participant during the teleconference. The image acquisition unit acquires a photographed image of one participant during the remote conference. The determining unit determines whether or not one participant has spoken based on at least one of the sound collected by the sound collecting unit and the captured image obtained by the image obtaining unit. The output control unit outputs voice data indicating voice from the communication unit to the other participants when it is determined that one participant has spoken, and it is determined that the one participant has not spoken. control so that voice data is not output from the communication unit to other participants.

本発明に係る通信サーバは、通信回線を通じて遠隔会議に参加する複数の参加者が使用する複数の通信端末と通信接続される。通信サーバは、音声取得部、画像取得部、判断部、及び出力制御部を備える。音声取得部は、通信端末ごとに、遠隔会議中に通信端末側で集音された音声を取得する。画像取得部は、通信端末ごとに、遠隔会議中に撮影された参加者の撮影画像を取得する。判断部は、通信端末ごとに、取得した音声及び撮影画像の少なくとも一方に基づいて、通信端末を使用する参加者が発話したか否かを判断する。出力制御部は、判断部において、参加者が発話したと判断された場合、取得した音声を示す音声データを、通信回線を介して他の通信端末に出力し、参加者が発話していないと判断された場合、音声データを、通信回線を介して他の端末に出力しないように制御する。 A communication server according to the present invention is communicatively connected to a plurality of communication terminals used by a plurality of participants participating in a teleconference via a communication line. The communication server includes an audio acquisition section, an image acquisition section, a determination section, and an output control section. The audio acquisition unit acquires audio collected by the communication terminal during the teleconference for each communication terminal. The image acquisition unit acquires, for each communication terminal, a photographed image of a participant photographed during the remote conference. The determination unit determines whether or not the participant using the communication terminal speaks based on at least one of the acquired voice and captured image for each communication terminal. The output control unit outputs voice data indicating the acquired voice to another communication terminal via a communication line when the determination unit determines that the participant has spoken, and determines that the participant has not spoken. If it is determined, the voice data is controlled not to be output to other terminals via the communication line.

本発明に係る通信端末、通信システム、及び通信サーバによれば、周囲音によって会話が妨げられにくい。 According to the communication terminal, communication system, and communication server according to the present invention, conversation is less likely to be disturbed by ambient sounds.

図１は、第１実施形態における通信システムの構成を示す模式図である。FIG. 1 is a schematic diagram showing the configuration of a communication system according to the first embodiment. 図２は、図１に示す通信端末の概略構成を示すブロック図である。FIG. 2 is a block diagram showing a schematic configuration of the communication terminal shown in FIG. 1. As shown in FIG. 図３は、図１に示すサーバの概略構成を示すブロック図である。FIG. 3 is a block diagram showing a schematic configuration of the server shown in FIG. 1. As shown in FIG. 図４は、通信端末における遠隔会議処理を示す動作フローである。FIG. 4 is an operation flow showing teleconference processing in a communication terminal. 図５は、第３実施形態における通信端末の概略構成を示すブロック図である。FIG. 5 is a block diagram showing a schematic configuration of a communication terminal according to the third embodiment. 図６Ａは、周波数スペクトルＰ１（ｆ）の波形例を示す図である。FIG. 6A is a diagram showing a waveform example of the frequency spectrum P1(f). 図６Ｂは、周波数スペクトルＰ２（ｆ）の波形例を示す図である。FIG. 6B is a diagram showing a waveform example of the frequency spectrum P2(f). 図６Ｃは、周波数スペクトルＰ１（ｆ）と周波数スペクトルＰ２（ｆ）との差分の波形例を示す図である。FIG. 6C is a diagram showing an example waveform of the difference between the frequency spectrum P1(f) and the frequency spectrum P2(f).

以下、図面を参照して、実施形態に係る通信端末及び通信システムについて説明する。なお、図中、同一又は相当部分については同一の参照符号を付して説明を繰り返さない。 Hereinafter, communication terminals and communication systems according to embodiments will be described with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals, and description thereof will not be repeated.

＜第１実施形態＞
図１は、本実施形態における会議システム１の構成を示す模式図である。会議システム１は、例えば、遠隔会議に参加する参加者（参加者Ｕａ及び参加者Ｕｂ）間の音声及び映像をやり取りする遠隔会議システムである。会議システム１は、参加者Ｕａ及び参加者Ｕｂそれぞれの通信端末１０（１０Ａ、１０Ｂ）と、サーバ２０とを備える。各通信端末１０とサーバ２０とは、公衆回線又は専用回線等の通信回線Ｎに接続されている。参加者Ｕａ及び参加者Ｕｂはそれぞれの通信端末１０を用い、サーバ２０を介して通信相手の通信端末１０と映像及び音声をやり取りすることにより遠隔会議を行う。尚、通信端末１０の数は２台に限らず、３台以上であってもよい。以下、会議システム１の構成について具体的に説明する。 <First Embodiment>
FIG. 1 is a schematic diagram showing the configuration of a conference system 1 according to this embodiment. The conference system 1 is, for example, a remote conference system that exchanges audio and video between participants (participant Ua and participant Ub) participating in a remote conference. The conference system 1 includes communication terminals 10 (10A, 10B) of each of the participants Ua and Ub, and a server 20. FIG. Each communication terminal 10 and server 20 are connected to a communication line N such as a public line or a dedicated line. Participant Ua and participant Ub use their respective communication terminals 10 to exchange video and audio with the communication terminal 10 of the other party via the server 20 to hold a teleconference. The number of communication terminals 10 is not limited to two, and may be three or more. The configuration of the conference system 1 will be specifically described below.

（通信端末１０）
図２は、通信端末１０の概略構成を示すブロック図である。通信端末１０は、本実施形態において、ＰＣ（Personal Computer）、タブレット端末、又はスマートフォン等の装置であってもよい。以下では、説明の便宜上、主として通信端末１０Ａを例に説明する。 (Communication terminal 10)
FIG. 2 is a block diagram showing a schematic configuration of the communication terminal 10. As shown in FIG. The communication terminal 10 may be a device such as a PC (Personal Computer), a tablet terminal, or a smart phone in the present embodiment. For convenience of explanation, the communication terminal 10A will be mainly described below as an example.

通信端末１０は、図２に示すように、マイク１１（集音部の一例）、カメラ１２（画像取得部の一例）、スピーカ１３、通信部１４、操作部１５、記憶部１６、表示部１７、及び制御部１８を備える。 The communication terminal 10 includes, as shown in FIG. , and a control unit 18 .

マイク１１は、通信端末１０周辺の音声を集音して電気信号に変換した音声信号を制御部１８へ出力する。カメラ１２は、カメラ１２の画角の範囲に写る被写体を撮像した撮像信号を制御部１８へ出力する。つまり、通信端末１０Ａにおいて、通信端末１０Ａ周辺の音声が集音され、参加者Ｕａ等の被写体が撮像される。また、通信端末１０Ｂにおいて、通信端末１０Ｂ周辺の音声が集音され、参加者Ｕｂ等の被写体が撮像される。 The microphone 11 collects sounds around the communication terminal 10 and converts them into electric signals to output the sound signals to the control unit 18 . The camera 12 outputs to the control unit 18 an imaging signal obtained by imaging a subject within the range of the angle of view of the camera 12 . That is, the communication terminal 10A collects sounds around the communication terminal 10A, and captures an image of a subject such as the participant Ua. Also, in the communication terminal 10B, sounds around the communication terminal 10B are collected, and a subject such as the participant Ub is imaged.

スピーカ１３は、制御部１８から出力された音声データをＤ／Ａ変換し、増幅して放音する。 The speaker 13 D/A-converts the audio data output from the control unit 18, amplifies the audio data, and emits sound.

通信部１４は、通信回線Ｎを介してサーバ２０と通信するための通信インタフェースである。通信部１４は、制御部１８の制御の下、ＲＴＰ（Real-time Transport Protocol）等の通信プロトコルを用い、サーバ２０との間で通信を確立し、サーバ２０を介して他の通信端末１０との間でデータを送受信する。具体的には、通信部１４は、サーバ２０から受信したデータを制御部１８へ出力し、制御部１８から入力されるデータをサーバ２０へ送信する。 The communication unit 14 is a communication interface for communicating with the server 20 via the communication line N. FIG. Under the control of the control unit 18, the communication unit 14 establishes communication with the server 20 using a communication protocol such as RTP (Real-time Transport Protocol), and communicates with another communication terminal 10 via the server 20. send and receive data between Specifically, the communication unit 14 outputs data received from the server 20 to the control unit 18 and transmits data input from the control unit 18 to the server 20 .

操作部１５は、マウス、キーボード、又はタッチパネル等を含む。操作部１５は、参加者の操作を受け付け、受け付けた操作を示す操作信号を制御部１８へ出力する。 The operation unit 15 includes a mouse, keyboard, touch panel, or the like. The operation unit 15 receives an operation by a participant and outputs an operation signal indicating the received operation to the control unit 18 .

記憶部１６は、ハードディスク等の不揮発性記憶媒体を含む。記憶部１６は、遠隔会議を行うための自装置のＩＰアドレスや他の通信端末１０のアドレス等のアドレス情報を記憶する。 The storage unit 16 includes a nonvolatile storage medium such as a hard disk. The storage unit 16 stores address information such as the IP address of its own device and the addresses of other communication terminals 10 for conducting remote conferences.

表示部１７は、液晶パネル等の表示パネルと、表示パネルを駆動する駆動回路とを含む（いずれも図示略）。駆動回路は、制御部１８の制御の下、カメラ１２で撮像された被写体の映像や、通信部１４を介して取得された通信相手の映像等の各種画像を表示するための駆動信号を表示パネルに供給する。表示パネルは、駆動回路から供給される駆動信号に応じた画像を表示する。 The display unit 17 includes a display panel such as a liquid crystal panel and a drive circuit for driving the display panel (both not shown). The drive circuit, under the control of the control unit 18, outputs a drive signal to the display panel for displaying various images such as the image of the subject captured by the camera 12 and the image of the communication partner acquired via the communication unit 14. supply to The display panel displays an image according to the drive signal supplied from the drive circuit.

制御部１８は、ＣＰＵ（Central Processing Unit）及びメモリ（ＲＯＭ(Read Only Memory)及びＲＡＭ(Random Access Memory)）を含む。制御部１８は、操作部１５を介した参加者の操作に応じて、ＣＰＵがＲＯＭに記憶された遠隔会議アプリケーションの制御プログラムを実行する。制御部１８は、ＣＰＵが制御プログラムを実行することにより、音声・映像信号処理部１８１、判断部１８２、出力制御部１８３、及び表示制御部１８４として機能し、遠隔会議処理を行う。遠隔会議処理は、送信制御処理と映像音声出力処理とを含む。送信制御処理は、音声・映像信号処理部１８１、判断部１８２、及び出力制御部１８３により、カメラ１２及びマイク１１によって取得した映像データ及び音声データの他の通信端末１０への送信を制御する処理である。また、映像音声出力処理は、音声・映像信号処理部１８１及び表示制御部１８４により、他の通信端末１０から取得した映像データ及び音声データを自端末から出力する処理である。 The control unit 18 includes a CPU (Central Processing Unit) and memory (ROM (Read Only Memory) and RAM (Random Access Memory)). In the control unit 18, the CPU executes a remote conference application control program stored in the ROM according to the operation of the participant via the operation unit 15. FIG. The control unit 18 functions as an audio/video signal processing unit 181, a determination unit 182, an output control unit 183, and a display control unit 184 by the CPU executing a control program, and performs teleconference processing. The teleconference processing includes transmission control processing and video/audio output processing. The transmission control process is a process of controlling transmission of video data and audio data acquired by the camera 12 and the microphone 11 to the other communication terminal 10 by the audio/video signal processing unit 181, the judgment unit 182, and the output control unit 183. is. The video/audio output processing is processing for outputting video data and audio data obtained from another communication terminal 10 from the own terminal by the audio/video signal processing unit 181 and the display control unit 184 .

音声・映像信号処理部１８１は、ＣＯＤＥＣを含む。音声・映像信号処理部１８１は、通信部２１を介して、遠隔会議中の映像データ及び音声データのパケットをサーバ２０との間で逐次送受信する。 The audio/video signal processing unit 181 includes a CODEC. The audio/video signal processing unit 181 sequentially transmits/receives packets of video data and audio data during the teleconference to/from the server 20 via the communication unit 21 .

具体的には、音声・映像信号処理部１８１は、送信制御処理において、マイク１１から入力される一定時間ごとの音声信号と、カメラ１２から入力される一定時間ごとの映像信号とを、遠隔会議システムの規格（例えばＨ．３２３）に従ってデジタルデータ（映像データ及び音声データ）に変換して判断部１８２へ出力する。 Specifically, in the transmission control process, the audio/video signal processing unit 181 converts the audio signal input from the microphone 11 at regular time intervals and the video signal input from the camera 12 at regular time intervals into the remote conference signal. It converts into digital data (video data and audio data) according to the system standard (for example, H.323) and outputs it to the determination unit 182 .

また、音声・映像信号処理部１８１は、出力制御処理として、通信部１４から逐次入力されるサーバ２０からの映像データ及び音声データをデコードし、パケットに分離する。入力される映像データ及び音声データには送信先情報、送信元情報、及びタイムスタンプ等の情報が付加されている。音声・映像信号処理部１８１は、デコードされた映像データ及び音声データをタイムスタンプの順に並べる。音声・映像信号処理部１８１は、音声データの音量を調整してスピーカ１３から出力し、映像データを表示制御部１８４へ出力する。なお、音声・映像信号処理部１８１は、出力制御処理において、カメラ１２から入力される自端末側の映像信号の映像データを表示制御部１８４へ出力してもよい。この場合、参加者Ｕｂだけでなく自端末側の参加者Ｕａの映像が表示部１７に表示される。 Also, as an output control process, the audio/video signal processing unit 181 decodes the video data and audio data from the server 20 that are sequentially input from the communication unit 14 and separates them into packets. Information such as destination information, source information, and time stamp is added to the input video data and audio data. The audio/video signal processing unit 181 arranges the decoded video data and audio data in the order of time stamps. The audio/video signal processing unit 181 adjusts the volume of the audio data, outputs the audio data from the speaker 13 , and outputs the video data to the display control unit 184 . In the output control process, the audio/video signal processing unit 181 may output video data of the video signal of its own terminal input from the camera 12 to the display control unit 184 . In this case, not only the video of the participant Ub but also the video of the participant Ua on the own terminal side is displayed on the display unit 17 .

判断部１８２は、音声・映像信号処理部１８１から自端末側の音声データ及び映像データを取得する。判断部１８２は、取得した映像データに基づいて、参加者Ｕａが発話しているか否か判断し、判断結果に応じて送信モードを設定する。送信モードは、映像音声モード及び映像モードを含む。映像音声モードは、取得した映像データと音声データとをサーバ２０に送信するモードである。映像モードは、取得した映像データのみをサーバ２０に送信するモードである。判断部１８２は、参加者Ｕａが発話しているか否かに応じて、映像音声モード及び映像モードのいずれかに切り替える。 The determination unit 182 acquires the audio data and video data of its own terminal from the audio/video signal processing unit 181 . The determination unit 182 determines whether or not the participant Ua is speaking based on the acquired video data, and sets the transmission mode according to the determination result. Transmission modes include audiovisual mode and video mode. The video/audio mode is a mode for transmitting the acquired video data and audio data to the server 20 . The video mode is a mode in which only acquired video data is transmitted to the server 20 . The determination unit 182 switches between the video/audio mode and the video mode depending on whether or not the participant Ua is speaking.

具体的には、例えば、通信端末１０Ａにおいて、判断部１８２は、音声・映像信号処理部１８１から自端末側の被写体を撮影した映像データ（以下、映像データＰａ）と、自端末側の音声が集音された音声データ（以下、音声データＳａ）とを取得する。判断部１８２は、映像データＰａを画像解析して参加者Ｕａの唇の動きを検出する。発話中の唇の動きの検出は、例えば、参加者Ｕａの顔画像から発話中の参加者Ｕａの唇の動きを出力する学習済みモデルを用いて検出してもよい。 Specifically, for example, in the communication terminal 10A, the determination unit 182 receives video data (hereinafter referred to as video data Pa) obtained by photographing a subject on the own terminal side from the audio/video signal processing unit 181 and audio on the own terminal side. Collected sound data (hereinafter referred to as sound data Sa) is acquired. The determination unit 182 performs image analysis on the video data Pa to detect the movement of the lips of the participant Ua. Detection of lip movements during speech may be performed using, for example, a trained model that outputs lip movements of the participant Ua during speech from the face image of the participant Ua.

判断部１８２は、映像データＰａから発話中の参加者Ｕａの唇の動きを検出した場合、送信モードを映像音声モードに設定する。つまり、判断部１８２は、映像データＰａとともに取得した音声データＳａに参加者Ｕａが発した音声が含まれていると判断して映像データＰａ及び音声データＳａを出力制御部１８３へ出力する。 If the determination unit 182 detects movement of the lips of the speaking participant Ua from the video data Pa, the determination unit 182 sets the transmission mode to the video/audio mode. That is, the determination unit 182 determines that the audio data Sa acquired together with the video data Pa includes the audio uttered by the participant Ua, and outputs the video data Pa and the audio data Sa to the output control unit 183 .

また、判断部１８２は、映像データＰａから発話中の参加者Ｕａの唇の動きを検出できない場合、送信モードを映像モードに設定する。つまり、判断部１８２は、音声データＳａに参加者Ｕａが発した音声が含まれていないと判断して、映像データＰａのみを出力制御部１８３へ出力する。なお、判断部１８２は、映像モードにおいて、操作部１５を介して音声の出力を指示する操作を受け付けた場合、送信モードを映像モードから映像音声モードに切り替えてもよい。 Further, when the movement of the lips of the speaking participant Ua cannot be detected from the video data Pa, the determination unit 182 sets the transmission mode to the video mode. That is, the determination unit 182 determines that the audio data Sa does not include the voice uttered by the participant Ua, and outputs only the video data Pa to the output control unit 183 . Note that the determination unit 182 may switch the transmission mode from the video mode to the video/audio mode when receiving an operation to instruct the output of audio via the operation unit 15 in the video mode.

出力制御部１８３は、判断部１８２から映像データＰａ及び音声データＳａ（以下、映像音声データ）、又は映像データＰａを取得し、取得した映像音声データ又は映像データＰａをパケット化し、他の通信端末１０のアドレス情報を付加して通信部１４へ出力する。 The output control unit 183 acquires the video data Pa and the audio data Sa (hereinafter referred to as audiovisual data) or the video data Pa from the judgment unit 182, packetizes the acquired audiovisual data or the video data Pa, and transmits the packets to other communication terminals. 10 address information is added and output to the communication unit 14 .

表示制御部１８４は、音声・映像信号処理部１８１から入力される映像データに基づく駆動信号を表示部１７に供給し、表示部１７に参加者Ｕａ及び参加者Ｕｂ等の被写体画像を表示させる。また、表示制御部１８４は、判断部１８２から送信モードを示すモード情報を取得し、取得したモード情報を示す画像（送信モード画像）を表示部１７に表示させてもよい。送信モード画像を表示部１７に表示させることにより、マイク１１で集音している音声が通信端末１０Ｂに送信されているか否かを参加者Ｕａが確認することができる。 The display control unit 184 supplies the display unit 17 with a drive signal based on the video data input from the audio/video signal processing unit 181, and causes the display unit 17 to display subject images such as the participant Ua and the participant Ub. Further, the display control unit 184 may acquire mode information indicating the transmission mode from the determination unit 182 and cause the display unit 17 to display an image (transmission mode image) indicating the acquired mode information. By displaying the transmission mode image on the display unit 17, the participant Ua can confirm whether or not the sound collected by the microphone 11 is being transmitted to the communication terminal 10B.

（サーバ２０）
図３は、サーバ２０の概略構成を示すブロック図である。図３に示すように、サーバ２０は、通信部２１、制御部２２、及び記憶部２３を備える。 (Server 20)
FIG. 3 is a block diagram showing a schematic configuration of the server 20. As shown in FIG. As shown in FIG. 3, the server 20 includes a communication section 21, a control section 22, and a storage section 23. FIG.

通信部２１は、通信回線Ｎを介して各通信端末１０と通信する通信インタフェースである。通信部２１は、制御部２２の制御の下、ＲＴＰ等の所定の通信プロトコルを用い、各通信端末１０との間で通信を確立し、映像データ及び音声データを送受信する。 The communication unit 21 is a communication interface that communicates with each communication terminal 10 via the communication line N. FIG. Under the control of the control unit 22, the communication unit 21 establishes communication with each communication terminal 10 using a predetermined communication protocol such as RTP, and transmits and receives video data and audio data.

記憶部２３は、ハードディスク等の不揮発性記憶媒体を含む。記憶部２３は、各通信端末１０の識別情報（ＩＰアドレス等）を含む通信端末情報（図示略）を記憶する。 Storage unit 23 includes a non-volatile storage medium such as a hard disk. The storage unit 23 stores communication terminal information (not shown) including identification information (IP address, etc.) of each communication terminal 10 .

制御部２２は、ＣＰＵ及びメモリ（ＲＯＭ及びＲＡＭ）を含む。制御部２２は、ＣＰＵが、ＲＯＭに記憶された制御プログラムを実行することにより、通信部２１を介して、各通信端末１０との間で通信する。具体的には、制御部２２は、各通信端末１０から送信された音声データ及び映像データのパケットを取得する。制御部２２は、記憶部２３における通信端末情報（図示略）に基づき、音声データ及び映像データを、音声データ及び映像データに付加された送信先情報で指定された通信端末１０に送信する。 The control unit 22 includes a CPU and memory (ROM and RAM). The control unit 22 communicates with each communication terminal 10 via the communication unit 21 by the CPU executing the control program stored in the ROM. Specifically, the control unit 22 acquires packets of audio data and video data transmitted from each communication terminal 10 . Based on the communication terminal information (not shown) in the storage unit 23, the control unit 22 transmits the audio data and the video data to the communication terminal 10 specified by the destination information added to the audio data and the video data.

（動作）
図４は、通信端末１０における遠隔会議処理を示す動作フローである。図４の動作フローは、通信端末１０Ａ及び通信端末１０Ｂにおいて遠隔会議アプリケーションが実行されている状態から開始される。つまり、通信端末１０間において所定の通信プロトコルを用いて通信が確立された後に遠隔会議処理が行われる。図３では、主として、通信端末１０Ａにおける送信制御処理について説明する。 (motion)
FIG. 4 is an operational flow showing teleconference processing in the communication terminal 10 . The operation flow of FIG. 4 starts from a state where the teleconference application is being executed on the communication terminals 10A and 10B. That is, the remote conference process is performed after communication is established between the communication terminals 10 using a predetermined communication protocol. FIG. 3 mainly describes transmission control processing in the communication terminal 10A.

通信端末１０Ａは、マイク１１で自端末側の音声を集音し、カメラ１２により自端末側の被写体を撮影する。通信端末１０Ａにおける制御部１８は、マイク１１から出力される音声信号と、カメラ１２から出力される映像信号とをＡ／Ｄ変換することにより、音声データＳａと映像データＰａとを逐次取得する（ステップＳ１１）。 The communication terminal 10A collects the sound of the own terminal side with the microphone 11 and shoots the subject on the own terminal side with the camera 12 . The control unit 18 in the communication terminal 10A sequentially acquires the audio data Sa and the video data Pa by A/D converting the audio signal output from the microphone 11 and the video signal output from the camera 12 ( step S11).

制御部１８は、取得した映像データＰａに基づいて、参加者Ｕａが発話しているか否かを判断する（ステップＳ１２）。具体的には、判断部１８２において、例えば、発話中の参加者Ｕａの唇の動きを検出する学習済みモデルに映像データＰａを入力することにより、参加者Ｕａが発話しているか否かを判断する。 The control unit 18 determines whether or not the participant Ua is speaking based on the acquired video data Pa (step S12). Specifically, the determining unit 182 determines whether or not the participant Ua is speaking by, for example, inputting the video data Pa to a trained model that detects the movement of the lips of the participant Ua during speech. do.

ステップＳ１２において、通信端末１０Ａは、参加者Ｕａが発話していると判断した場合（ステップＳ１２：Ｙｅｓ）、送信モードを映像音声モードに設定し、映像データＰａと音声データＳａとを通信部１４を介してサーバ２０へ送信する（ステップＳ１３）。具体的には、発話中の参加者Ｕａの唇の動きを検出する学習済みモデルにより、映像データＰａから参加者Ｕａの発話中の唇の動きが検出できた場合、判断部１８２は、送信モードを映像音声モードに設定する。そして、判断部１８２は、映像データＰａと、映像データＰａと共に取得した音声データＳａとを出力制御部１８３に出力する。出力制御部１８３は、判断部１８２から入力された映像データＰａ及び音声データＳａをエンコードしてパケット化し、通信端末１０Ｂのアドレス情報をパケットに付加して通信部１４に出力する。通信部１４は、出力制御部１８３から入力された映像データＰａ及び音声データＳａ（映像音声データ）をサーバ２０へ送信する。 In step S12, when the communication terminal 10A determines that the participant Ua is speaking (step S12: Yes), the communication terminal 10A sets the transmission mode to the video/audio mode, and transmits the video data Pa and the audio data Sa to the communication unit 14. to the server 20 (step S13). Specifically, when the learned model for detecting the movement of the lips of the participant Ua during speech can detect the movement of the lips of the participant Ua during speech from the video data Pa, the determination unit 182 selects the transmission mode. to the video/audio mode. Then, the determination unit 182 outputs the video data Pa and the audio data Sa acquired together with the video data Pa to the output control unit 183 . The output control unit 183 encodes the video data Pa and the audio data Sa input from the determination unit 182 into packets, adds the address information of the communication terminal 10B to the packets, and outputs the packets to the communication unit 14 . The communication unit 14 transmits the video data Pa and the audio data Sa (video audio data) input from the output control unit 183 to the server 20 .

また、ステップＳ１２において、通信端末１０Ａは、参加者Ｕａが発話していないと判断した場合（ステップＳ１２：Ｎｏ）、送信モードを映像モードに設定し、映像データＰａのみを通信部１４を介してサーバ２０へ送信する（ステップＳ１４）。具体的には、発話中の参加者Ｕａの唇の動きを検出する学習済みモデルにより、映像データＰａから参加者Ｕａの発話中の唇の動きが検出できない場合、判断部１８２は、送信モードを映像モードに設定する。そして、判断部１８２は、映像データＰａのみを出力制御部１８３に出力する。出力制御部１８３は、判断部１８２から入力された映像データＰａをエンコードしてパケット化し、通信端末１０Ｂのアドレス情報をパケットに付加して通信部１４に出力する。通信部１４は、出力制御部１８３から入力された映像データＰａをサーバ２０へ送信する。 Further, in step S12, when the communication terminal 10A determines that the participant Ua is not speaking (step S12: No), the communication terminal 10A sets the transmission mode to the video mode, and transmits only the video data Pa via the communication unit 14. It is transmitted to the server 20 (step S14). Specifically, when the trained model for detecting the lip movement of the participant Ua during speech cannot detect the movement of the lips of the participant Ua during speech from the video data Pa, the determination unit 182 selects the transmission mode. Set to video mode. Then, determination unit 182 outputs only video data Pa to output control unit 183 . The output control unit 183 encodes the video data Pa input from the determination unit 182 into packets, adds the address information of the communication terminal 10B to the packets, and outputs the packets to the communication unit 14 . The communication unit 14 transmits the video data Pa input from the output control unit 183 to the server 20 .

図示を省略するが、サーバ２０は、通信端末１０Ａから送信された映像音声データ（映像データＰａ及び音声データＳａ）、又は映像データＰａを、記憶部２３に記憶された通信端末情報（図示略）に基づいて、通信端末１０Ｂへ送信する。 Although illustration is omitted, the server 20 stores the video/audio data (video data Pa and audio data Sa) or the video data Pa transmitted from the communication terminal 10A in the communication terminal information (not shown) stored in the storage unit 23. is transmitted to the communication terminal 10B.

通信端末１０Ｂは、通信部１４を介してサーバ２０から映像音声データを受信した場合（ステップＳ２１：Ｙｅｓ）、映像データに基づく映像を表示部１７に表示し、音声データに基づく音声をスピーカ１３に出力する（ステップＳ２２）。 When the communication terminal 10B receives the video/audio data from the server 20 via the communication unit 14 (step S21: Yes), the communication terminal 10B displays the video based on the video data on the display unit 17, and outputs the audio based on the audio data to the speaker 13. Output (step S22).

具体的には、音声・映像信号処理部１８１において映像音声データ（映像データＰａ及び音声データＳａ）をデコードする。音声・映像信号処理部１８１は、デコードした音声データＳａをＤ／Ａ変換した音声信号をスピーカ１３から出力する。また、音声・映像信号処理部１８１は、デコードされた映像データＰａを表示制御部１８４に出力する。表示制御部１８４は、デコードされた映像データＰａに基づく駆動信号を表示部１７に入力し、表示部１７に駆動信号に基づく画像、すなわち、参加者Ｕａ等の映像を表示させる。 Specifically, the audio/video signal processing unit 181 decodes audio/video data (video data Pa and audio data Sa). The audio/video signal processing unit 181 outputs an audio signal obtained by D/A converting the decoded audio data Sa from the speaker 13 . Also, the audio/video signal processing unit 181 outputs the decoded video data Pa to the display control unit 184 . The display control unit 184 inputs a drive signal based on the decoded video data Pa to the display unit 17, and causes the display unit 17 to display an image based on the drive signal, that is, a video of the participant Ua and the like.

また、通信端末１０Ｂは、通信部１４を介してサーバ２０から映像音声データではなく、映像データのみを受信した場合(ステップＳ２１：Ｎｏ、ステップＳ２３：Ｙｅｓ)、映像データに基づく画像を表示部１７に表示する（ステップＳ２４）。この場合、ステップＳ２２で説明したように、通信端末１０Ｂにおいて、映像データＰａに基づく画像が表示部１７に表示されるが、映像データＰａが表示されている間、通信端末１０Ａ側の音声は出力されない。 In addition, when communication terminal 10B receives only video data from server 20 via communication unit 14, not video/audio data (step S21: No, step S23: Yes), display unit 17 displays an image based on the video data. (step S24). In this case, as described in step S22, an image based on the video data Pa is displayed on the display unit 17 of the communication terminal 10B. not.

通信端末１０Ａは、操作部１５を介して遠隔会議を終了する操作を受け付けるまで（ステップＳ１５：Ｎｏ）、ステップＳ１１以下の処理を繰り返し、操作部１５を介して遠隔会議を終了する操作を受け付けると（ステップＳ１５：Ｙｅｓ）、処理を終了する。 The communication terminal 10A repeats the processing from step S11 until an operation to end the remote conference is received via the operation unit 15 (step S15: No). (Step S15: Yes), the process is terminated.

通信端末１０Ｂは、ステップＳ２３において、通信部１４を介して映像データを受信できるまで（ステップＳ２３：Ｎｏ）、待機する。また、通信端末１０Ｂは、操作部１５を介して遠隔会議を終了する操作を受け付けるまで（ステップＳ２５：Ｎｏ）、ステップＳ２１以下の処理を繰り返し、操作部１５を介して遠隔会議を終了する操作を受け付けると（ステップＳ２５：Ｙｅｓ）、処理を終了する。 In step S23, communication terminal 10B waits until video data can be received via communication unit 14 (step S23: No). Further, the communication terminal 10B repeats the processing from step S21 until an operation to end the remote conference is received via the operation unit 15 (step S25: No), and the operation to end the remote conference is performed via the operation unit 15. If accepted (step S25: Yes), the process ends.

本実施形態では、通信端末１０において、参加者を撮影した映像から発話中の参加者の唇の動きを検出することにより、参加者が発話したか否かを判断する。通信端末１０は、参加者が発話していないときに集音された音声データを通信相手に送信しない。そのため、参加者が発話していない状態で周囲音のみが含まれた音声データが通信相手に送信されない。その結果、周囲音によって会話が妨げられることなく、スムーズに会議を進行させることができる。 In this embodiment, the communication terminal 10 determines whether or not the participant has spoken by detecting the movement of the lips of the participant who is speaking from the image of the participant. The communication terminal 10 does not transmit the sound data collected when the participant is not speaking to the communication partner. Therefore, voice data containing only the ambient sound is not transmitted to the communication partner while the participant is not speaking. As a result, the conference can proceed smoothly without being disturbed by ambient sounds.

＜第２実施形態＞
第１実施形態では、図４のステップＳ１２において、参加者Ｕａが発話したか否かを映像データＰａに写る参加者Ｕａの唇の動きに基づいて判断したが、参加者Ｕａの唇以外の動きも用いて発話しているか否かを判断してもよい。 <Second embodiment>
In the first embodiment, in step S12 of FIG. 4, it is determined whether or not the participant Ua has spoken based on the movement of the lips of the participant Ua reflected in the video data Pa. may also be used to determine whether or not the user is speaking.

具体的には、映像データＰａにおいて、会議に関連する特定の動作が含まれている場合に参加者Ｕａが発話している可能性があると判断されてもよい。特定の動作は、例えば、参加者Ｕａがカメラ１２を背にして、ホワイトボード等に文字等を書く動作である。通常、会議中にカメラ１２を背にしてホワイトボード等に文字を書く場合、参加者が他の参加者に対して何かを説明しようとしている可能性が高い。そのため、映像データＰａから発話中の唇の動きだけでなく、このような特定の動作を検出した場合には、送信モードを映像音声モードに設定してもよい。参加者が特定の動作を行っているか否かの判断は、参加者が背を向けてホワイトボード等に文字等を書いている画像を学習させた学習済モデルを用いて行ってもよい。判断部１８２は、映像データＰａから参加者Ｕａによる特定の動作を検出できた場合、送信モードを映像音声モードに設定し、映像データＰａ及び音声データＳａを出力制御部１８３に出力する。また、映像データＰａから参加者Ｕａによる特定の動作を検出できない場合、判断部１８２は、送信モードを映像モードに設定し、映像データＰａのみを出力制御部１８３に出力する。 Specifically, it may be determined that there is a possibility that the participant Ua is speaking when the video data Pa includes a specific action related to the conference. The specific action is, for example, the action of the participant Ua writing characters or the like on a whiteboard or the like with the camera 12 behind them. Normally, when writing characters on a whiteboard or the like with the camera 12 behind them during a meeting, there is a high possibility that the participant is trying to explain something to the other participants. Therefore, when not only the movement of the lips during speech but also such a specific action is detected from the video data Pa, the transmission mode may be set to the video/audio mode. The determination of whether or not the participant is performing a specific action may be made using a trained model trained with an image of the participant writing characters or the like on a whiteboard or the like with his/her back turned. If the specific action by the participant Ua can be detected from the video data Pa, the determination unit 182 sets the transmission mode to the video/audio mode, and outputs the video data Pa and the audio data Sa to the output control unit 183 . Further, when the specific action by the participant Ua cannot be detected from the image data Pa, the determination unit 182 sets the transmission mode to the image mode and outputs only the image data Pa to the output control unit 183 .

第１実施形態のように参加者の唇の動きだけで参加者の発話を判断する場合、参加者がカメラ１２を背にしたときに参加者の唇の動きが検出できないため、マイク１１で集音された音声は通信相手に送信されない。そのため、参加者がカメラ１２を背にしてホワイトボード等に文字を書きながら説明する場合、参加者が発話した音声が通信相手に送信されず会話が途切れる。本実施形態では、通信端末１０において、参加者が会議に関連した特定の動作を行っているか否かによって参加者による発話の可能性を判断する。そのため、参加者がカメラ１２に背を向けていても、特定の動作を行っていれば、参加者が発話した音声を含んだ音声データが通信相手の通信端末１０に送信され、通信相手との会話を継続させることができる。 When judging the participant's utterance only by the movement of the participant's lips as in the first embodiment, the movement of the participant's lips cannot be detected when the participant's back is facing the camera 12 . The audible voice is not transmitted to the communication partner. Therefore, when a participant explains while writing letters on a whiteboard or the like with the camera 12 behind them, the voice uttered by the participant is not transmitted to the communication partner, and the conversation is interrupted. In this embodiment, the communication terminal 10 determines the possibility of a participant's speech depending on whether or not the participant is performing a specific action related to the conference. Therefore, even if the participant turns his/her back to the camera 12, as long as the participant performs a specific action, the audio data including the voice uttered by the participant is transmitted to the communication terminal 10 of the other party of communication. Conversation can be continued.

＜第３実施形態＞
第１及び第２実施形態では、参加者が発話したか否かを映像データに基づいて判断したが、音声データに基づいて参加者が発話したか否かを判断してもよい。以下、本実施形態の通信端末の構成について具体的に説明する。 <Third Embodiment>
In the first and second embodiments, whether or not the participant has spoken is determined based on the video data, but it may be determined whether or not the participant has spoken based on the audio data. The configuration of the communication terminal according to this embodiment will be specifically described below.

図５は、本実施形態における通信端末１０１の概略構成を示すブロック図である。図５において、第１実施形態と同じ構成には第１実施形態と同じ符号が付されている。以下、主として、第１実施形態と異なる構成について説明する。 FIG. 5 is a block diagram showing a schematic configuration of the communication terminal 101 in this embodiment. In FIG. 5, the same reference numerals as in the first embodiment are assigned to the same configurations as in the first embodiment. Hereinafter, mainly the configuration different from that of the first embodiment will be described.

図５に示すように、通信端末１０１は、第１実施形態の判断部１８２及び記憶部１６に替えて、判断部１８２１及び記憶部１６１を備える。記憶部１６１は、参加者Ｕａが発話していない状況での参加者Ｕａ周辺の音声の周波数解析結果を示す音声情報を記憶する。周波数解析結果は、予め参加者Ｕａが発話していない状況での、参加者Ｕａ周辺の音声のみを集音した音声信号をＡ／Ｄ変換した音声データ（以下、基準音声データ）について周波数解析を行った結果である。周波数解析では、例えばフーリエ変換により、基準音声データにおける周波数成分の振幅（強度）を解析する。 As shown in FIG. 5, the communication terminal 101 includes a determination section 1821 and a storage section 161 instead of the determination section 182 and storage section 16 of the first embodiment. The storage unit 161 stores audio information indicating frequency analysis results of audio around the participant Ua when the participant Ua is not speaking. The frequency analysis result is obtained by performing frequency analysis on audio data (hereinafter referred to as reference audio data) obtained by A/D converting an audio signal obtained by collecting only the audio around the participant Ua in a situation where the participant Ua does not speak in advance. This is the result of what I did. In the frequency analysis, the amplitude (intensity) of frequency components in the reference voice data is analyzed by Fourier transform, for example.

判断部１８２１は、音声・映像信号処理部１８１から音声データＳａを取得し、音声データＳａを周波数解析する。判断部１８２１は、音声データＳａを周波数解析した結果と、記憶部１６１に記憶された音声情報とに基づいて、参加者Ｕａが発話したか否かを判断する。具体的には、判断部１８２１は、音声データＳａを周波数解析して得られる周波数スペクトルＰ１（ｆ）の振幅と、音声情報に基づく周波数スペクトルＰ２（ｆ）の振幅との差分が所定の閾値以上となる周波数の区間がある場合、参加者Ｕａが発話したと判断する。また、判断部１８２１は、周波数スペクトルＰ１（ｆ）及び周波数スペクトルＰ２（ｆ）の振幅との差分が全ての周波数において所定の閾値未満である場合、参加者Ｕａが発話していないと判断する。参加者Ｕａが発話していない場合、周波数スペクトルＰ１（ｆ）と、音声情報に基づく周波数スペクトルＰ２（ｆ）との振幅の差分は、参加者Ｕａが発話した場合の振幅の差分より小さい、つまり、所定の閾値未満となる。また、音声データＳａに周囲の別の会話等が含まれている場合であっても、参加者Ｕａが発話していれば、周囲の会話とは声質が異なるため、周波数スペクトルＰ１（ｆ）と周波数スペクトルＰ２（ｆ）との間に、所定の閾値以上の振幅の差分が現れる。 The determination unit 1821 acquires the audio data Sa from the audio/video signal processing unit 181 and performs frequency analysis on the audio data Sa. The determination unit 1821 determines whether or not the participant Ua has spoken based on the result of frequency analysis of the voice data Sa and the voice information stored in the storage unit 161 . Specifically, determination section 1821 determines that the difference between the amplitude of frequency spectrum P1(f) obtained by frequency analysis of audio data Sa and the amplitude of frequency spectrum P2(f) based on audio information is equal to or greater than a predetermined threshold. If there is a section with a frequency corresponding to , it is determined that the participant Ua has spoken. Also, if the difference between the amplitudes of the frequency spectrum P1(f) and the frequency spectrum P2(f) is less than a predetermined threshold for all frequencies, the determination unit 1821 determines that the participant Ua is not speaking. When the participant Ua does not speak, the amplitude difference between the frequency spectrum P1(f) and the frequency spectrum P2(f) based on the voice information is smaller than the amplitude difference when the participant Ua speaks. , is less than a predetermined threshold. In addition, even if the audio data Sa includes other surrounding conversations, etc., if the participant Ua is speaking, the voice quality is different from the surrounding conversations, so the frequency spectrum P1(f) and An amplitude difference equal to or greater than a predetermined threshold appears between the frequency spectrum P2(f) and the frequency spectrum P2(f).

図６Ａは、周波数スペクトルＰ１（ｆ）の波形例を示す図であり、図６Ｂは、周波数スペクトルＰ２（ｆ）の波形例を示す図である。また、図６Ｃは、周波数スペクトルＰ１（ｆ）と周波数スペクトルＰ２（ｆ）との差分（｜Ｐ１（ｆ）－Ｐ２（ｆ）｜）の波形例を示す図である。図６Ｃの波形において、所定の閾値Ｔ（ｆ）以上となる周波数の区間が含まれている。そのため、判断部１８２１は、参加者が発話していると判断する。また、例えば、図６Ｃの波形において、全ての周波数成分が閾値ＴＨ（ｆ）未満となる場合、判断部１８２１は、参加者が発話していないと判断する。 FIG. 6A is a diagram showing an example waveform of the frequency spectrum P1(f), and FIG. 6B is a diagram showing an example waveform of the frequency spectrum P2(f). FIG. 6C is a diagram showing a waveform example of the difference (|P1(f)-P2(f)|) between the frequency spectrum P1(f) and the frequency spectrum P2(f). The waveform of FIG. 6C includes a section of frequencies equal to or higher than a predetermined threshold value T(f). Therefore, the determination unit 1821 determines that the participant is speaking. Also, for example, in the waveform of FIG. 6C, when all frequency components are less than the threshold TH(f), the determination unit 1821 determines that the participant is not speaking.

本実施形態では、通信端末１０１において、マイク１１で集音された音声の周波数解析の結果と、予め参加者が発話していない状態での参加者の周囲の音声を周波数解析した結果とを比較して参加者が発話したか否かを判断する。通信端末１０１は、参加者が発話していないと判断した場合、そのときの音声を通信相手に送信しない。そのため、参加者が発話していない状態で周囲音のみが含まれた音声が通信相手に送信されず、周囲音によって会話が妨げられることなく、スムーズに会議を進行させることができる。 In this embodiment, in the communication terminal 101, the result of frequency analysis of the sound collected by the microphone 11 is compared with the result of frequency analysis of the sound around the participant in a state in which the participant is not speaking in advance. Then, it is determined whether or not the participant has spoken. When the communication terminal 101 determines that the participant is not speaking, the communication terminal 101 does not transmit the voice at that time to the communication partner. Therefore, the voice containing only the ambient sound is not transmitted to the communication partner in a state where the participants are not speaking, and the conversation can be smoothly progressed without being disturbed by the ambient sound.

以上、本発明に係る実施形態について説明した。但し、本発明は、上記の実施形態に限られるものではなく、その要旨を逸脱しない範囲で種々の態様において実施することが可能である。図面は、理解しやすくするために、それぞれの構成要素を主体に模式的に示しており、図示された各構成要素の厚み、長さ、個数等は、図面作成の都合上から実際とは異なる。また、上記の実施形態で示す各構成要素の形状、寸法等は一例であって、特に限定されるものではなく、本発明の効果から実質的に逸脱しない範囲で種々の変更が可能である。以下、上記実施形態の変形例を説明する。 The embodiments according to the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and can be implemented in various aspects without departing from the gist of the present invention. In order to facilitate understanding, the drawings schematically show each component mainly, and the thickness, length, number, etc. of each component illustrated are different from the actual ones due to the convenience of drawing. . Further, the shape, dimensions, etc. of each component shown in the above embodiment are examples and are not particularly limited, and various modifications are possible within a range that does not substantially deviate from the effects of the present invention. Modifications of the above embodiment will be described below.

［変形例］
（１）参加者が発話しているか否かの判断は、第１及び第２実施形態における映像データに基づく判断と、第３実施形態における音声データに基づく判断とを組み合わせてもよい。つまり、例えば、映像データにおいて参加者の発話中の唇の動きが検出できない場合であっても、周波数スペクトルの差分（｜Ｐ１（ｆ）－Ｐ２（ｆ）｜）が閾値ＴＨ（ｆ）以上であれば、参加者が発話していると判断してもよい。また、例えば、映像データにおいて参加者の発話中の唇の動きが検出できた場合であっても、周波数スペクトルの差分（｜Ｐ１（ｆ）－Ｐ２（ｆ）｜）が閾値ＴＨ（ｆ）未満であれば、参加者が発話していないと判断してもよい。 [Modification]
(1) Determination as to whether or not a participant is speaking may be made by combining the determination based on the video data in the first and second embodiments and the determination based on the audio data in the third embodiment. That is, for example, even if the movement of the lips during speech of the participant cannot be detected in the video data, the frequency spectrum difference (|P1(f)−P2(f)|) is equal to or greater than the threshold TH(f). If so, it may be determined that the participant is speaking. Further, for example, even if the movement of the lips during speech of the participant can be detected in the video data, the frequency spectrum difference (|P1(f)−P2(f)|) is less than the threshold TH(f). If so, it may be determined that the participant is not speaking.

又は、映像データにおいて参加者の発話中の唇の動きが検出でき、且つ、音声データ及び音声情報の周波数スペクトルの差分（｜Ｐ１（ｆ）－Ｐ２（ｆ）｜）が閾値ＴＨ（ｆ）以上である場合に、参加者が発話していると判断してもよい。また、映像データにおいて参加者の発話中の唇の動きが検出できず、且つ、音声データ及び音声情報の周波数スペクトルの差分（｜Ｐ１（ｆ）－Ｐ２（ｆ）｜）が閾値ＴＨ（ｆ）未満である場合に、参加者が発話していないと判断してもよい。要は、映像データ及び音声データの少なくとも一方に基づいて、参加者が発話しているか否かが判断されればよい。 Alternatively, the movement of the lips during speech of the participant can be detected in the video data, and the difference (|P1(f)-P2(f)|) between the frequency spectra of the audio data and the audio information is equal to or greater than the threshold TH(f). , it may be determined that the participant is speaking. In addition, the movement of the lips during speech of the participant cannot be detected in the video data, and the difference between the frequency spectrum of the audio data and the audio information (|P1(f)-P2(f)|) is the threshold TH(f) If less, it may be determined that the participant is not speaking. In short, whether or not the participant is speaking should be determined based on at least one of the video data and the audio data.

（２）第３実施形態において、例えば、マイク１１で集音された音声に含まれる文言が会議に関連する文言である場合、参加者が発話していると判断してもよい。会議に関連する文言は、会議の参加者の名前や会議のタイトル名等でもよい。この場合、予め、会議の参加者の名前、及び会議のタイトル名等のキーワード情報を記憶部１６１に記憶しておく。判断部１８２１は、マイク１１で集音された音声の音声データを形態素解析し、キーワード情報と照合してもよい。キーワード情報と一致する文言が音声データに含まれている場合、判断部１８２１は、参加者が発話していると判断し、出力制御部１８３から音声データを通信相手に送信してもよい。 (2) In the third embodiment, for example, if the words included in the sound collected by the microphone 11 are words related to the conference, it may be determined that the participant is speaking. The wording related to the conference may be the names of the participants of the conference, the title of the conference, and the like. In this case, keyword information such as names of participants in the conference and the title of the conference is stored in advance in the storage unit 161 . The determination unit 1821 may perform morphological analysis on voice data of voice collected by the microphone 11 and compare it with keyword information. If the voice data contains words that match the keyword information, the determination unit 1821 may determine that the participant is speaking, and the output control unit 183 may transmit the voice data to the communication partner.

（３）第１実施形態の判断部１８２において、映像データにおける参加者の唇の動きだけでなく、参加者の視線の向きに基づいて、参加者が発話しているか否かを判断してもよい。この場合、判断部１８２は、参加者の唇の動きだけでなく、公知の技術を用いて、映像データに映る参加者の目の動きを検出することにより視線の向きを特定する。映像データに発話中の参加者の唇の動きが検出できた場合であっても、参加者の視線の向きが、例えばカメラ１２に向かう方向ではない場合、参加者は会議に関する発言を行っていない可能性がある。この場合、判断部１８２において、参加者が会議に関して発話していないと判断してもよい。 (3) In the determination unit 182 of the first embodiment, it is possible to determine whether or not the participant is speaking based not only on the movement of the participant's lips in the video data, but also on the direction of the participant's line of sight. good. In this case, the determination unit 182 identifies the line-of-sight direction by detecting not only the movement of the participant's lips but also the eye movement of the participant reflected in the video data using a known technique. Even if the movement of the lips of the participant who is speaking can be detected in the video data, if the participant's line of sight is not directed toward the camera 12, for example, the participant is not speaking about the conference. there is a possibility. In this case, the determination unit 182 may determine that the participant has not spoken about the conference.

（４）第１～第３実施形態において、参加者が発話しているか否かの判断を、映像データ又は音声データに加えて、表示部１７に表示される画像を用いて行ってもよい。例えば、映像データに発話中の参加者の唇の動きが検出できない場合であっても、操作部１５を介して会議に関連する文書等の画像を表示部１７に表示させる操作を受け付けた場合、参加者が発話する可能性が高い。この場合、判断部１８２は、送信モードとして映像音声モードを設定してもよい。会議に関連する文書等の画像が表示部１７に表示されているか否かは、例えば、予め設定された会議のタイトル等のキーワードを予め設定し、表示された文書等に所定のキーワードが含まれているか否かによって判断してもよい。 (4) In the first to third embodiments, an image displayed on the display unit 17 may be used to determine whether or not a participant is speaking, in addition to video data or audio data. For example, even if the movement of the lips of the participant who is speaking cannot be detected in the video data, if an operation to display an image such as a document related to the meeting on the display unit 17 is received via the operation unit 15, Participants are more likely to speak. In this case, the determination unit 182 may set the video/audio mode as the transmission mode. Whether or not an image such as a document related to the meeting is displayed on the display unit 17 can be determined, for example, by setting a keyword such as a preset title of the meeting in advance and determining whether the displayed document or the like contains the predetermined keyword. You may judge by whether or not

（５）第１～第３実施形態において、カメラ１２及びマイク１１は通信端末１０、１０１に搭載されていなくてもよい。カメラ１２及びマイク１１は、通信端末１０、１０１とは別に設けられ、通信端末１０、１０１と電気的に接続されていればよい。 (5) In the first to third embodiments, the camera 12 and the microphone 11 may not be mounted on the communication terminals 10 and 101. The camera 12 and the microphone 11 may be provided separately from the communication terminals 10 and 101 and electrically connected to the communication terminals 10 and 101 .

（６）実施形態で示した各ブロック（各機能部）（図２及び図５）は、電気的に接続された複数のコンピュータ装置に分散して搭載された通信システムとして実現されてもよい。また、実施形態で示した各ブロック（各機能部）（図２及び図５）は、ＬＳＩなどの半導体装置により個別に１チップ化されても良いし、一部又は全部を含むように１チップ化されても良い。また、実施形態で示した各機能部の処理の一部又は全部は、プログラムにより実現されてもよい。 (6) Each block (each functional unit) (FIGS. 2 and 5) shown in the embodiment may be realized as a communication system distributed and mounted on a plurality of electrically connected computer devices. Further, each block (each functional unit) (FIGS. 2 and 5) shown in the embodiments may be individually integrated into one chip by a semiconductor device such as LSI, or may be integrated into one chip so as to include part or all of them. may be changed. Also, part or all of the processing of each functional unit shown in the embodiment may be implemented by a program.

（７）通信端末１０の機能の一部をサーバ２０が備えるようにしてもよい。具体的には、サーバ２０は、音声・映像信号処理部１８１、判断部１８２及び出力制御部１８３と同等の機能を備えてもよい。サーバ２０は、通信端末１０から音声データと映像データとを取得し、取得した音声データ及び映像データの少なくとも一方に基づいて、通信端末１０を利用する参加者が発話したか否かを判断する。参加者が発話したと判断された場合に、取得した音声データを他の通信端末１０へ出力し、参加者が発話していないと判断された場合に、取得した音声データを他の通信端末１０へ出力しないようにする。つまり、サーバ２０は、通信サーバの一例である。 (7) The server 20 may include part of the functions of the communication terminal 10 . Specifically, the server 20 may have functions equivalent to those of the audio/video signal processing section 181 , the determination section 182 and the output control section 183 . The server 20 acquires audio data and video data from the communication terminal 10, and determines whether or not the participant using the communication terminal 10 speaks based on at least one of the acquired audio data and video data. When it is determined that the participant has spoken, the acquired voice data is output to the other communication terminal 10, and when it is determined that the participant has not spoken, the acquired voice data is output to the other communication terminal 10. Do not output to That is, the server 20 is an example of a communication server.

本発明は、テレビ会議やＷＥＢ会議等の遠隔会議に利用可能である。 INDUSTRIAL APPLICABILITY The present invention can be used for remote conferences such as video conferences and web conferences.

１：会議システム
１０、１０Ａ、１０Ｂ、１０１：通信端末
１１：マイク
１２：カメラ
１３：スピーカ
１４、２１：通信部
１５：操作部
１６、１６１、２３：記憶部
１７：表示部
１８、２２：制御部
２０：サーバ
１８１：音声・映像信号処理部
１８２、１８２１：判断部
１８３：出力制御部
１８４：表示制御部 1: conference system 10, 10A, 10B, 101: communication terminal 11: microphone 12: camera 13: speaker 14, 21: communication unit 15: operation unit 16, 161, 23: storage unit 17: display unit 18, 22: control Unit 20: Server 181: Audio/Video signal processing unit 182, 1821: Judging unit 183: Output control unit 184: Display control unit

Claims

A communication terminal used by one of a plurality of communication terminals used by a plurality of participants participating in a teleconference via a communication line,
a communication unit that communicates with the other communication terminal via the communication line;
a sound collecting unit that collects sounds around the communication terminal during the remote conference;
an image acquisition unit that acquires a photographed image of the one participant during the remote conference;
a determination unit that determines whether or not the one participant has spoken based on at least one of the sound collected by the sound collection unit and the captured image obtained by the image acquisition unit;
When the determination unit determines that the one participant has spoken, the communication unit outputs voice data indicating the voice to the communication line, and the determination unit determines that the one participant has not spoken. and an output control section that controls not to output the voice data from the communication section to the communication line when it is determined that the communication terminal.

The communication terminal according to claim 1, wherein the determination unit determines whether or not the one participant speaks based on movement of the lips of the one participant appearing in the captured image.

The communication terminal according to claim 1 or 2, wherein the determination unit determines whether or not the one participant speaks based on the movement of the one participant appearing in the captured image.

The determination unit is obtained by pre-frequency-analyzing the frequency component of the sound collected by the sound collection unit and the surrounding sound of the one participant in a state in which the one participant is not speaking. 4. The communication terminal according to any one of claims 1 to 3, wherein the communication terminal determines whether or not the one participant has spoken based on the frequency component.

A communication system for controlling audio output of one of a plurality of participants participating in a teleconference through a communication line,
a communication unit communicatively connected to the communication line;
a sound collecting unit that collects the sound of the one participant during the remote conference;
an image acquisition unit that acquires a photographed image of the one participant during the remote conference;
a determination unit that determines whether or not the one participant has spoken based on at least one of the sound collected by the sound collection unit and the captured image obtained by the image acquisition unit;
When it is determined that the one participant has spoken, the communication unit outputs audio data indicating the audio to the other participants, and the one participant has spoken. and an output control unit that controls not to output the audio data from the communication unit to the other participants when it is determined that the voice data is not to be output.

A communication server communicatively connected to a plurality of communication terminals used by a plurality of participants participating in a teleconference via a communication line,
a sound acquisition unit that acquires sound collected by the communication terminal during the teleconference for each communication terminal;
an image acquisition unit that acquires, for each of the communication terminals, a photographed image of a participant photographed during the remote conference;
a determination unit that determines whether or not the participant using the communication terminal has spoken based on at least one of the acquired sound and the captured image for each communication terminal;
When the determination unit determines that the participant has spoken, the acquired voice data representing the voice is output to another communication terminal via the communication line to determine that the participant has not spoken. and an output control unit that controls not to output the audio data to the other terminal via the communication line when the judgment is made.