JP2022114044A

JP2022114044A - Remote conference device, remote conference method, remote conference terminal and remote conference system

Info

Publication number: JP2022114044A
Application number: JP2021010152A
Authority: JP
Inventors: 一浩片桐; Kazuhiro Katagiri
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2022-08-05

Abstract

To provide a technology that enables a remote conference with a plurality of participants to proceed more smoothly.SOLUTION: A remote conference device is provided, comprising: a speech detection unit which detects speech from sound data received from a first remote conference terminal: a speech information generating unit which inserts speech start data indicating that the speech has started on the basis of the detection of the speech from the sound data, into the front of the sound data; a speed adjustment unit which adjusts a first speed of the sound data on the basis of the insertion of the speech start data; and transmission control unit which controls the data transmission unit to transmit the sound data after the first speed has been adjusted to a second remote conference terminal after controlling the data transmission unit to transmit the speech start data to the second remote conference terminal.SELECTED DRAWING: Figure 1

Description

本発明は、遠隔会議装置、遠隔会議方法、遠隔会議端末および遠隔会議システムに関する。 The present invention relates to a teleconference device, a teleconference method, a teleconference terminal, and a teleconference system.

近年、ＩＣＴ（ＩｎｆｏｒｍａｔｉｏｎａｎｄＣｏｍｍｕｎｉｃａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ）の発達とワークスタイルの変化に伴い、少人数のサテライトオフィスの設置およびテレワークへの移行が急速に進んでいる。複数のワーカは、遠隔地に散らばっている場合には、１つのオフィスに集まっている場合に比べて、お互いの状況をあまり詳細に知ることができないことがある。そのため、各ワーカが遠隔地にいる他のワーカと円滑にコミュニケーションを取ることが難しくなる恐れがある。 In recent years, with the development of ICT (Information and Communication Technology) and changes in work styles, the establishment of satellite offices with a small number of people and the shift to telework are rapidly progressing. When multiple workers are dispersed in remote locations, they may not be able to know each other's situation in detail as compared to when they are gathered in one office. Therefore, it may become difficult for each worker to communicate smoothly with other workers in remote locations.

そこで、離れた場所に存在する複数のオフィス間が、映像、音および各種センサ情報などによって相互に接続され、遠隔地にいるユーザと円滑にコミュニケーションを取ることを可能にするテレワークシステムが提案されている（例えば、非特許文献１参照）。 Therefore, a telework system has been proposed, in which multiple offices in remote locations are interconnected by video, sound, and information from various sensors, enabling smooth communication with users in remote locations. (For example, see Non-Patent Document 1).

このシステムでは、オフィス内の複数箇所それぞれにカメラとマイクロフォンが設置されており、これらのカメラおよびマイクロフォンから得られる映像データおよび音データが当該オフィスから離れた他のオフィスへ伝送される。当該他のオフィスにいるユーザは、遠隔地に設置された複数のカメラ間で注目するカメラを自由に切り替えることができ、ユーザがカメラを切り替える度に切り替え先のカメラの近くに設置されているマイクロフォンにより収集された音が再生される。 In this system, cameras and microphones are installed at multiple locations in an office, and video data and sound data obtained from these cameras and microphones are transmitted to other offices away from the office. A user in the other office can freely switch the camera of interest among a plurality of cameras installed in remote locations, and each time the user switches cameras, a microphone installed near the camera to be switched to is used. Sounds collected by are played.

これによって、ユーザはリアルタイムに遠隔地の状況を知ることができる。そして、離れた場所にいるユーザ同士がお互いの状況を理解した上でコミュニケーションを取ることが可能となる。 This allows the user to know the status of the remote location in real time. Then, users in remote locations can communicate with each other after understanding each other's situations.

特開２０１３－０１７０２７号公報JP 2013-017027 A 特開２０１２－２１５６００号公報Japanese Unexamined Patent Application Publication No. 2012-215600 特開２０１７－１３５６６９号公報JP 2017-135669 A

野中他，“複数の映像・音・センサ情報を利用するオフィスコミュニケーションシステム”，ヒューマンインタフェース学会研究報告集Ｖｏｌ.１３Ｎｏ.１０，２０１１．Nonaka et al., “Office Communication System Using Multiple Image/Sound/Sensor Information”, Human Interface Society Research Report Vol. 叶他，“遠隔複数人会話をわかりやすくするための音像定位の検討”，情報処理学会研究報告，Ｖｏｌ.２０１４－ＧＮ－９１，ＮＯ．４５，２０１４．Kano et al., “Study of sound image localization for making remote multi-person conversation easier to understand,” Information Processing Society of Japan Research Report, Vol.2014-GN-91, NO. 45, 2014. 森田，板倉、“ポインター移動量制御による重複加算法（ＰＩＣＯＬＡ）を用いた音声の時間軸での伸長圧縮とその評価”，日本音響学会講演論文集，１９８６．Morita, Itakura, "Expansion and Compression of Speech on the Time Axis Using Overlapping Addition Method (PICOLA) with Pointer Movement Amount Control and Its Evaluation", Proceedings of Acoustical Society of Japan, 1986.

しかし、複数の参加者による遠隔会議をよりスムーズに進行させることを可能にする技術が提供されることが望まれる。 However, it is desired to provide a technique that enables a remote conference with a plurality of participants to proceed more smoothly.

上記問題を解決するために、本発明のある観点によれば、第１の遠隔会議端末から受信された音データから発話を検出する発話検出部と、前記音データから前記発話が検出されたことに基づいて、前記発話が開始したことを示す発話開始データを前記音データの前に挿入する発話情報生成部と、前記発話開始データの挿入に基づいて、前記音データの第１の速度を調整する速度調整部と、前記発話開始データを第２の遠隔会議端末に送信するようにデータ送信部を制御した後、前記第１の速度が調整された後の音データを前記第２の遠隔会議端末に送信するように前記データ送信部を制御する送信制御部と、を備える、遠隔会議装置が提供される。 In order to solve the above problem, according to one aspect of the present invention, a speech detection unit for detecting speech from sound data received from a first teleconference terminal; an utterance information generating unit that inserts utterance start data indicating that the utterance has started before the sound data based on; and adjusts a first speed of the sound data based on the insertion of the utterance start data and a data transmission unit to transmit the speech start data to the second teleconference terminal, and then transmit the sound data after the first speed adjustment to the second teleconference terminal. a transmission control unit for controlling the data transmission unit to transmit to a terminal.

前記発話情報生成部は、前記発話が検出されたことに基づいて、前記発話開始データを前記音データおよび前記第１の遠隔会議端末から前記音データとともに受信された映像データの前に挿入し、前記速度調整部は、前記発話開始データの挿入に基づいて、前記音データの前記第１の速度および前記映像データの第２の速度を調整された後の第１の速度と合うように調整し、前記送信制御部は、前記発話開始データを前記第２の遠隔会議端末に送信するように前記データ送信部を制御した後、前記第１の速度が調整された後の音データ、および、前記第２の速度が調整された後の映像データを前記第２の遠隔会議端末に送信するように前記データ送信部を制御してもよい。 The speech information generating unit inserts the speech start data before the audio data and video data received together with the audio data from the first teleconference terminal, based on the detection of the speech; The speed adjustment unit adjusts the first speed of the sound data and the second speed of the video data to match the first speed after the adjustment based on the insertion of the speech start data. , the transmission control unit, after controlling the data transmission unit to transmit the speech start data to the second remote conference terminal, the sound data after the first speed is adjusted; The data transmission unit may be controlled to transmit the video data after the second speed adjustment to the second teleconference terminal.

前記速度調整部は、前記発話開始データの挿入による音データの遅延が解消されるまで、前記第１の速度を前記発話開始データの挿入前よりも高くすることによって、前記第１の速度を調整してもよい。 The speed adjustment unit adjusts the first speed by making the first speed higher than before the insertion of the speech start data until the delay of the sound data due to the insertion of the speech start data is eliminated. You may

前記速度調整部は、前記遅延が解消されるまで、前記第１の速度を前記発話開始データの挿入前よりも高くしつつ徐々に低くすることによって、前記第１の速度を調整してもよい。 The speed adjustment unit may adjust the first speed by gradually lowering the first speed higher than before the speech start data is inserted until the delay is eliminated. .

前記速度調整部は、前記発話開始データの挿入後の音データの一部を削除することによって、前記第１の速度を調整してもよい。 The speed adjustment unit may adjust the first speed by deleting a part of the sound data after the speech start data is inserted.

前記速度調整部は、前記発話開始データの挿入後の音データのうち、非音声区間の音データを削除し、削除によって空いた区間に前記非音声区間の後の音データを詰めることによって、前記第１の速度を調整してもよい。 The speed adjusting unit deletes the sound data of the non-speech section from the sound data after the insertion of the utterance start data, and fills the section vacated by the deletion with the sound data after the non-speech section. The first speed may be adjusted.

また、本発明の他の観点によれば、第１の遠隔会議端末から受信された音データから発話を検出することと、前記音データから前記発話が検出されたことに基づいて、前記発話が開始したことを示す発話開始データを前記音データの前に挿入することと、前記発話開始データの挿入に基づいて、前記音データの第１の速度を調整することと、前記発話開始データを第２の遠隔会議端末に送信するようにデータ送信部を制御した後、前記第１の速度が調整された後の音データを前記第２の遠隔会議端末に送信するように前記データ送信部を制御することと、を備える、遠隔会議方法が提供される。 According to another aspect of the present invention, detecting an utterance from sound data received from a first remote conference terminal; inserting speech start data indicating that speech has started before the sound data; adjusting a first speed of the sound data based on the insertion of the speech start data; After controlling the data transmission unit to transmit to the second teleconference terminal, the data transmission unit is controlled to transmit the sound data after the first speed is adjusted to the second teleconference terminal. A teleconferencing method is provided, comprising:

また、本発明の他の観点によれば、他の遠隔会議端末から送信された音データから発話が検出されたことに基づいて前記音データの前に挿入された、前記発話が開始したことを示すデータである発話開始データを遠隔会議装置から取得し、前記発話開始データの挿入に基づいて第１の速度が調整された後の音データを前記遠隔会議装置から取得する取得部と、前記発話開始データが取得されたことに基づいて、前記発話が開始したことを示す提示データを提示するように提示部を制御し、前記第１の速度が調整された後の音データを提示するように前記提示部を制御する提示制御部と、を備える、遠隔会議端末が提供される。 Further, according to another aspect of the present invention, a message indicating that the speech has started is inserted before the sound data based on the detection of the speech from the sound data transmitted from another teleconference terminal. an acquisition unit that acquires from the remote conference device speech start data, which is data indicating the speech start data, and acquires sound data after the first speed is adjusted based on the insertion of the speech start data from the remote conference device; Based on the acquisition of the start data, the presentation unit is controlled to present the presentation data indicating that the speech has started, and the sound data after the first speed is adjusted is presented. and a presentation control section that controls the presentation section.

また、本発明の他の観点によれば、前記取得部は、前記発話が検出されたことに基づいて前記音データおよび前記他の遠隔会議端末から前記音データとともに送信された映像データの前に挿入された、前記発話開始データを前記遠隔会議装置から取得し、前記発話開始データの挿入に基づいて、第１の速度が調整された後の音データ、および、調整された後の第１の速度と合うように第２の速度が調整された後の映像データを前記遠隔会議装置から取得し、前記提示制御部は、前記発話開始データが取得されたことに基づいて、前記提示データを提示するように前記提示部を制御し、前記第１の速度が調整された後の音データおよび前記第２の速度が調整された後の映像データを提示するように前記提示部を制御してもよい。 Further, according to another aspect of the present invention, the acquisition unit, based on the detection of the speech, before the audio data and the video data transmitted together with the audio data from the other teleconference terminal, The inserted speech start data is acquired from the remote conference device, and based on the insertion of the speech start data, the first speed is adjusted, and the first speed is adjusted. The video data after the second speed has been adjusted to match the speed is acquired from the remote conference device, and the presentation control unit presents the presentation data based on the acquisition of the speech start data. and control the presentation unit to present the sound data after the first speed adjustment and the video data after the second speed adjustment. good.

前記提示データは、所定の画像データを含み、前記提示制御部は、前記提示データとして前記所定の画像データが表示されるようにモニタを制御してもよい。 The presentation data may include predetermined image data, and the presentation control unit may control the monitor so that the predetermined image data is displayed as the presentation data.

前記提示データは、所定の音データを含み、前記提示制御部は、前記提示データとして前記所定の音データが表示されるようにスピーカを制御してもよい。 The presentation data may include predetermined sound data, and the presentation control unit may control the speaker so that the predetermined sound data is displayed as the presentation data.

また、本発明の他の観点によれば、他の遠隔会議端末から送信された音データから発話が検出されたことに基づいて前記音データの前に挿入された、前記発話が開始したことを示すデータである発話開始データを遠隔会議装置から取得し、前記発話開始データの挿入に基づいて第１の速度が調整された後の音データを前記遠隔会議装置から取得することと、前記発話開始データが取得されたことに基づいて、前記発話が開始したことを示す提示データを提示するように提示部を制御し、前記第１の速度が調整された後の音データを提示するように前記提示部を制御することと、を備える、遠隔会議方法が提供される。 Further, according to another aspect of the present invention, a message indicating that the speech has started is inserted before the sound data based on the detection of the speech from the sound data transmitted from another teleconference terminal. obtaining from the teleconference device speech start data, which is data indicating the speech start data, and obtaining from the teleconference device sound data after the first speed has been adjusted based on the insertion of the speech start data; Based on the acquisition of the data, the presenting unit is controlled to present the presentation data indicating that the speech has started, and the sound data after the first speed is adjusted is presented. and controlling a presentation unit.

また、本発明の他の観点によれば、第１の遠隔会議端末と、遠隔会議装置と、第２の遠隔会議端末とを備える、遠隔会議システムであって、前記第１の遠隔会議端末は、音データを送信するようにデータ送信部を制御する送信制御部を備え、前記遠隔会議装置は、前記第１の遠隔会議端末から受信された音データから発話を検出する発話検出部と、前記音データから前記発話が検出されたことに基づいて、前記発話が開始したことを示す発話開始データを前記音データの前に挿入する発話情報生成部と、前記発話開始データの挿入に基づいて、前記音データの第１の速度を調整する速度調整部と、前記発話開始データを第２の遠隔会議端末に送信するようにデータ送信部を制御した後、前記第１の速度が調整された後の音データを前記第２の遠隔会議端末に送信するように前記データ送信部を制御する送信制御部と、を備え、前記第２の遠隔会議端末は、前記発話開始データを前記遠隔会議装置から取得し、前記第１の速度が調整された後の音データを前記遠隔会議装置から取得する取得部と、前記発話開始データが取得されたことに基づいて、前記発話が開始したことを示す提示データを提示するように提示部を制御し、前記第１の速度が調整された後の音データを提示するように前記提示部を制御する提示制御部と、を備える、遠隔会議システムが提供される。 According to another aspect of the present invention, there is provided a teleconference system comprising a first teleconference terminal, a teleconference device, and a second teleconference terminal, wherein the first teleconference terminal is a transmission control unit for controlling a data transmission unit to transmit sound data, the teleconference device comprising: a speech detection unit for detecting speech from sound data received from the first teleconference terminal; an utterance information generating unit that inserts utterance start data indicating that the utterance has started based on the detection of the utterance from the sound data before the sound data; and based on the insertion of the utterance start data, After controlling a speed adjustment unit for adjusting the first speed of the sound data and a data transmission unit to transmit the speech start data to the second teleconference terminal, after the first speed is adjusted a transmission control unit for controlling the data transmission unit to transmit the sound data of to the second remote conference terminal, wherein the second remote conference terminal transmits the speech start data from the remote conference device an acquisition unit that acquires the sound data after the first speed is adjusted from the remote conference device; and a presentation indicating that the speech has started based on the acquisition of the speech start data. a presentation control unit for controlling a presentation unit to present data; and a presentation control unit for controlling the presentation unit to present sound data after the first speed has been adjusted. be.

以上説明したように本発明によれば、複数の参加者による遠隔会議をよりスムーズに進行させることを可能にする技術が提供される。 As described above, according to the present invention, a technology is provided that enables a remote conference with a plurality of participants to proceed more smoothly.

本発明の実施形態に係る遠隔会議システムの構成の一例を示すブロック図である。1 is a block diagram showing an example of the configuration of a remote conference system according to an embodiment of the invention; FIG. 本発明の実施形態に係る装置のハードウェア構成の一例を示すブロック図である。It is a block diagram showing an example of the hardware constitutions of the device concerning the embodiment of the present invention.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings. In the present specification and drawings, constituent elements having substantially the same functional configuration are denoted by the same reference numerals, thereby omitting redundant description.

また、本明細書及び図面において、実質的に同一または類似の機能構成を有する複数の構成要素を、同一の符号の後に異なるアルファベットを付して区別する。ただし、実質的に同一または類似の機能構成を有する複数の構成要素の各々を特に区別する必要がない場合、同一符号のみを付する。 In addition, in the present specification and drawings, a plurality of constituent elements having substantially the same or similar functional configurations are distinguished by attaching different alphabets after the same reference numerals. However, when there is no particular need to distinguish between a plurality of components having substantially the same or similar functional configurations, only the same reference numerals are used.

（０．背景）
まず、本発明の実施形態の背景について説明する。上記したように、非特許文献１においては、離れた場所に存在する複数のオフィス間が、映像、音および各種センサ情報などによって相互に接続され、遠隔地にいるユーザと円滑にコミュニケーションを取ることを可能にするテレワークシステムが提案されている。非特許文献１に記載の手法を用いれば、ユーザに遠隔地の様々な場所の現在の状況を臨場感豊かに体感させることができる。 (0. Background)
First, the background of the embodiments of the present invention will be described. As described above, in Non-Patent Document 1, a plurality of remote offices are connected to each other by video, sound, various sensor information, etc., and smooth communication with users in remote locations is possible. A telework system that enables By using the method described in Non-Patent Document 1, it is possible to allow the user to experience the current conditions of various remote locations with a rich presence.

しかしながら、ユーザがテレワークで働く場合など（例えば、ユーザが自宅で働く場合など）においては、臨場感を生み出すための機器の設置が難しい場合がある。あるいは、ネットワークの帯域が制限されてしまう場合がある。かかる問題が生じ得るために、複数の参加者が会議を行う際には、複数の参加者それぞれの映像および音声を集約するシステム（遠隔会議システム）が使用されるのが一般的である。 However, when the user works by telework (for example, when the user works at home), it may be difficult to install equipment for creating a sense of presence. Alternatively, network bandwidth may be limited. Because of the potential for such problems, when a plurality of participants hold a conference, a system (teleconference system) that aggregates the video and audio of each of the participants is generally used.

しかし、遠隔会議に多くの人が参加する場合には、各参加者にとって他の参加者の状況および雰囲気などを知ることが難しい状況に陥ることがある。特に、他の参加者がネットワークの帯域を確保するために映像の送信機能をＯＦＦにしている場合には、各参加者にとって他の参加者の状況および雰囲気などを知ることが難しい状況に陥りやすくなる。 However, when many people participate in a teleconference, it may be difficult for each participant to know the situation and atmosphere of other participants. In particular, when other participants turn off the video transmission function to secure network bandwidth, it is easy for each participant to fall into a situation where it is difficult for each participant to know the situation and mood of the other participants. Become.

そして、かかる状況では、二人以上の参加者それぞれの発話の時間的な区間の少なくとも一部が重なってしまうこと（以下、単に「発話区間の重複」とも言う。）がある。これによって、二人以上の参加者それぞれの発話による音声（発話音声）が聞き取りにくくなる恐れがある。また、発話区間が重複してしまう度に、二人以上の参加者それぞれが発話を中断してしまい、遠隔会議がスムーズに進行しなくなる恐れがある。 In such a situation, at least a part of the temporal sections of speeches of two or more participants may overlap (hereinafter simply referred to as "overlap of speech sections"). This may make it difficult to hear the voices (spoken voices) of each of two or more participants. In addition, each time the utterance segments overlap, two or more participants interrupt their utterances, which may hinder the smooth progress of the teleconference.

この問題に対し、特許文献１においては、参加者の音声に音像定位処理を行い、二人以上の参加者それぞれの発話音声が別々の方向から聞こえるようにする手法が提案されている。かかる手法によれば、発話区間が重複してしまったとしても二人以上の参加者それぞれの発話音声が聞き取りやすくなる。また、音声の聞こえ方に方向感を出すことによって享受される効果については、非特許文献２において説明されている。 In order to solve this problem, Patent Document 1 proposes a method of performing sound image localization processing on the voices of the participants so that the uttered voices of two or more participants can be heard from different directions. According to such a method, even if the utterance segments overlap, it becomes easier to hear the utterances of two or more participants. In addition, non-patent document 2 explains the effect that can be enjoyed by giving a sense of direction to the way the sound is heard.

しかしながら、音像を定位させる方向の種類（音像定位処理の分解能）には限界があるため、多くの参加者が同時に発話してしまうと、やはり多くの参加者それぞれの発話音声が聞き取りにくくなってしまう。また、たとえ同時に発話した参加者が少人数であったとしても、二人以上の参加者によって同時に発話された音声を聞き分けることが難しいことには変わりない。そのため、たとえ少人数の参加者によって同時に発話された音声を聞き取ることができたとしても聞き取った参加者がストレスを感じてしまう恐れがある。 However, there is a limit to the type of direction in which the sound image can be localized (the resolution of the sound image localization process), so if many participants speak at the same time, it will be difficult to hear the voices of many participants. . Moreover, even if the number of participants speaking at the same time is small, it is still difficult to distinguish between voices spoken by two or more participants at the same time. Therefore, even if the voices uttered simultaneously by a small number of participants can be heard, the participants may feel stress.

そこで、本明細書においては、複数の参加者それぞれの発話区間が重複してしまう可能性を低減する技術について主に提案する。複数の参加者それぞれの発話区間が重複してしまう可能性が低減されることによって、複数の参加者による遠隔会議がよりスムーズに進行されるようになる。 Therefore, this specification mainly proposes a technique for reducing the possibility that the utterance segments of a plurality of participants overlap. By reducing the possibility that the utterance segments of the plurality of participants overlap, the teleconference of the plurality of participants can proceed more smoothly.

（１．実施形態の詳細）
続いて、本発明の実施形態の詳細について説明する。 (1. Details of embodiment)
Next, details of embodiments of the present invention will be described.

（１－１．システム構成）
図１は、本発明の実施形態に係る遠隔会議システムの構成の一例を示すブロック図である。図１を参照すると、本発明の実施形態に係る遠隔会議システム１は、遠隔会議サーバ１０と、Ｎ台（Ｎは２以上の整数）の遠隔会議クライアント２０と、ネットワーク４０とを有する。遠隔会議サーバ１０およびＮ台の遠隔会議クライアント２０は、ネットワーク４０に接続されており、ネットワーク４０を介して相互に通信可能である。遠隔会議サーバ１０は、遠隔会議装置の例として機能し得る。また、遠隔会議クライアント２０は、遠隔会議端末の例として機能し得る。 (1-1. System configuration)
FIG. 1 is a block diagram showing an example configuration of a remote conference system according to an embodiment of the present invention. Referring to FIG. 1, a remote conference system 1 according to an embodiment of the present invention has a remote conference server 10, N (N is an integer equal to or greater than 2) remote conference clients 20, and a network 40. FIG. The teleconference server 10 and N teleconference clients 20 are connected to a network 40 and can communicate with each other via the network 40 . Teleconference server 10 may function as an example of a teleconference device. Also, the teleconference client 20 can function as an example of a teleconference terminal.

Ｎ台の遠隔会議クライアント２０それぞれは、対応する参加者（利用者）によって利用される。一例として、遠隔会議クライアント２０と利用者とは、１対１に対応している。なお、以下の説明においては、Ｎ台の遠隔会議クライアント２０を区別する場合に、Ｎ台の遠隔会議クライアント２０を遠隔会議クライアント２０－１～２０－Ｎと称する。 Each of the N teleconference clients 20 is used by a corresponding participant (user). As an example, there is a one-to-one correspondence between the teleconference client 20 and the user. In the following description, when distinguishing the N remote conference clients 20, the N remote conference clients 20 are referred to as remote conference clients 20-1 to 20-N.

また、本発明の実施形態では、遠隔会議クライアント２０が、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）である場合を主に想定する。しかし、遠隔会議クライアント２０の形態は限定されない。例えば、遠隔会議クライアント２０は、スマートフォンなどの携帯端末であってもよいし、家庭内などに設置される音声入出力装置などであってもよい。例えば、音声入出力装置は、ＡＩ（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ）スピーカなどであってもよい。 Further, in the embodiment of the present invention, it is mainly assumed that the teleconference client 20 is a PC (Personal Computer). However, the form of the remote conference client 20 is not limited. For example, the teleconference client 20 may be a mobile terminal such as a smart phone, or may be a voice input/output device installed at home or the like. For example, the audio input/output device may be an AI (Artificial Intelligence) speaker or the like.

（１－２．遠隔会議サーバの機能構成）
図１を参照すると、遠隔会議サーバ１０は、データ受信部１１０、デコーダ１２０、発話検出部１３０、発話情報生成部１４０、メディア速度調整部１５０、エンコーダ１６０およびデータ送信部１７０を備える。メディア速度調整部１５０は、速度調整部の例として機能し得る。エンコーダ１６０は、送信制御部の例として機能し得る。 (1-2. Functional Configuration of Teleconference Server)
Referring to FIG. 1 , remote conference server 10 includes data receiver 110 , decoder 120 , speech detector 130 , speech information generator 140 , media speed adjuster 150 , encoder 160 and data transmitter 170 . Media speed adjuster 150 may function as an example of a speed adjuster. Encoder 160 may function as an example of a transmission controller.

（１－３．遠隔会議クライアントの機能構成）
図１を参照すると、遠隔会議クライアント２０は、エンコーダ２１０、データ送信部２２０、データ受信部２３０、デコーダ２４０および映像・音声再生部２５０を備える。エンコーダ２１０には、マイク３１（マイクロフォン）およびカメラ３２が接続されている。エンコーダ２１０は、取得部の例として機能し得る。なお、以下の説明においては、Ｎ台の遠隔会議クライアント２０を区別する場合に、Ｎ台の遠隔会議クライアント２０のエンコーダ２１０それぞれに接続されるマイク３１およびカメラ３２をマイク３１－１～３１－Ｎおよびカメラ３２－１～３２‐Ｎと称する。 (1-3. Functional Configuration of Teleconference Client)
Referring to FIG. 1, the teleconference client 20 includes an encoder 210, a data transmitter 220, a data receiver 230, a decoder 240 and a video/audio reproducer 250. FIG. A microphone 31 and a camera 32 are connected to the encoder 210 . Encoder 210 may serve as an example of an acquirer. In the following description, when distinguishing between the N remote conference clients 20, the microphones 31 and cameras 32 connected to the encoders 210 of the N remote conference clients 20 are referred to as microphones 31-1 to 31-N. and cameras 32-1 to 32-N.

また、映像・音声再生部２５０には、スピーカ３３およびモニタ３４が接続されている。スピーカ３３およびモニタ３４それぞれは、提示部の例として機能し得る。映像・音声再生部２５０は、提示制御部の例として機能し得る。なお、以下の説明においては、Ｎ台の遠隔会議クライアント２０を区別する場合に、Ｎ台の遠隔会議クライアント２０の映像・音声再生部２５０それぞれに接続されるスピーカ３３およびモニタ３４をスピーカ３３－１～３３－Ｎおよびモニタ３４－１～３４－Ｎと称する。 A speaker 33 and a monitor 34 are connected to the video/audio reproducing unit 250 . Speaker 33 and monitor 34 may each function as an example of a presentation unit. The video/audio reproduction unit 250 can function as an example of a presentation control unit. In the following description, when N teleconference clients 20 are to be distinguished, the speaker 33 and monitor 34 connected to the video/audio reproduction units 250 of the N teleconference clients 20 will be referred to as the speaker 33-1. . . . 33-N and monitors 34-1 through 34-N.

（１－４．遠隔会議システムの動作例）
続いて、図１を参照しながら、本発明の実施形態に係る遠隔会議システム１の動作例について説明する。 (1-4. Operation example of remote conference system)
Next, an operation example of the remote conference system 1 according to the embodiment of the present invention will be described with reference to FIG.

なお、以下の説明では、遠隔会議クライアント２０－１において、音データの送信機能がＯＮにされている場合を想定する。かかる場合には、遠隔会議クライアント２０－１のマイク３１－１によって得られた音データが、遠隔会議サーバ１０を介して遠隔会議クライアント２０－１以外の遠隔会議クライアント２０－２～２０－Ｎに送信され、スピーカ３３－２～３３－Ｎによって出力される。これによって、遠隔会議クライアント２０－１を利用する参加者は、自身の発話音声を遠隔会議クライアント２０－２～２０－Ｎそれぞれを利用する他の参加者に聞かせることができる。 In the following description, it is assumed that the remote conference client 20-1 has the sound data transmission function turned on. In such a case, sound data obtained by the microphone 31-1 of the remote conference client 20-1 is transmitted to the remote conference clients 20-2 to 20-N other than the remote conference client 20-1 via the remote conference server 10. transmitted and output by speakers 33-2 to 33-N. As a result, the participant using the teleconference client 20-1 can make his/her own speech sound heard by the other participants using the teleconference clients 20-2 to 20-N.

同様に、遠隔会議クライアント２０－２～２０－Ｎそれぞれにおいても、音データの送信機能がＯＮにされている場合を想定する。これによって、遠隔会議クライアント２０－２～２０－Ｎそれぞれを利用する参加者は、自身の発話音声を他の参加者に聞かせることができる。このようにして参加者間において発話音声を聞かせ合うことによって会議が進行する。しかし、必ずしも遠隔会議クライアント２０－１～２０－Ｎの全部において音データの送信機能がＯＮにされていなくてもよい。例えば、遠隔会議クライアント２０－１～２０－Ｎの中には、音データの送信機能がＯＦＦにされている遠隔会議クライアント２０が１または複数台存在してもよい。 Similarly, it is assumed that each of the remote conference clients 20-2 to 20-N has the sound data transmission function turned on. As a result, the participants using the teleconference clients 20-2 to 20-N can hear their own uttered voices to other participants. In this way, the conference progresses by having the participants listen to each other's spoken voices. However, the sound data transmission function does not necessarily have to be turned on in all of the remote conference clients 20-1 to 20-N. For example, among the remote conference clients 20-1 to 20-N, there may be one or more remote conference clients 20 whose sound data transmission function is turned off.

また、以下の説明では、遠隔会議クライアント２０－１において、映像データの送信機能がＯＮにされている場合を想定する。かかる場合には、遠隔会議クライアント２０－１のカメラ３２－１によって得られた映像データが、遠隔会議サーバ１０を介して遠隔会議クライアント２０－１以外の遠隔会議クライアント２０－２～２０－Ｎに送信され、モニタ３４－２～３４－Ｎによって出力される。これによって、遠隔会議クライアント２０－１を利用する参加者は、自身を被写体とする映像を遠隔会議クライアント２０－２～２０－Ｎそれぞれを利用する他の参加者に見せることができる。なお、遠隔会議クライアント２０－１のカメラ３２－１によって得られた映像データが、遠隔会議サーバ１０を介して遠隔会議クライアント２０－１にも送信され、モニタ３４－１によって出力されてもよい。 Also, in the following description, it is assumed that the remote conference client 20-1 has the video data transmission function turned on. In such a case, video data obtained by the camera 32-1 of the remote conference client 20-1 is sent to the remote conference clients 20-2 to 20-N other than the remote conference client 20-1 via the remote conference server 10. transmitted and output by monitors 34-2 through 34-N. As a result, the participant using the teleconference client 20-1 can show the other participants using the teleconference clients 20-2 to 20-N each of the participants using the teleconference clients 20-2 to 20-N. The video data obtained by the camera 32-1 of the remote conference client 20-1 may also be transmitted to the remote conference client 20-1 via the remote conference server 10 and output by the monitor 34-1.

同様に、遠隔会議クライアント２０－２～２０－Ｎそれぞれにおいても、映像データの送信機能がＯＮにされている場合を想定する。これによって、遠隔会議クライアント２０－２～２０－Ｎそれぞれを利用する参加者は、自身を被写体とする映像を他の参加者に見せることができる。このようにして参加者間において映像を見せ合うことによって、他の参加者の状況および雰囲気などが各参加者によって把握されやすくなる。しかし、必ずしも遠隔会議クライアント２０－１～２０－Ｎの全部において映像データの送信機能がＯＮにされていなくてもよい。例えば、遠隔会議クライアント２０－１～２０－Ｎの一部または全部において、映像データの送信機能がＯＦＦにされていてもよい。 Similarly, it is assumed that each of the teleconference clients 20-2 to 20-N has the video data transmission function turned on. As a result, the participants using the teleconference clients 20-2 to 20-N can show the other participants a video of themselves as the subject. By showing the video images among the participants in this manner, each participant can easily grasp the situation and atmosphere of the other participants. However, it is not necessary for all of the teleconference clients 20-1 to 20-N to have the video data transmission function turned on. For example, some or all of the teleconference clients 20-1 to 20-N may have their video data transmission function turned off.

以下では、説明を簡便にするため、遠隔会議クライアント２０－１から音データおよび映像データが、遠隔会議サーバ１０を介して遠隔会議クライアント２０－２～２０－Ｎそれぞれに送信される例について主に説明する。このとき、遠隔会議クライアント２０－１は、第１の遠隔会議端末として機能し、遠隔会議クライアント２０－２～２０－Ｎそれぞれは、第２の遠隔会議端末として機能する。しかし、かかる例は、遠隔会議クライアント２０－２～２０－Ｎから音データおよび映像データが遠隔会議サーバ１０に送信される場合に対しても、同様に適用され得る。 In order to simplify the explanation below, an example in which audio data and video data are transmitted from the teleconference client 20-1 to each of the teleconference clients 20-2 to 20-N via the teleconference server 10 will be mainly described. explain. At this time, the remote conference client 20-1 functions as a first remote conference terminal, and each of the remote conference clients 20-2 to 20-N functions as a second remote conference terminal. However, such an example can be similarly applied to the case where audio data and video data are transmitted to the teleconference server 10 from the teleconference clients 20-2 to 20-N.

（エンコーダ２１０）
まず、遠隔会議クライアント２０－１において、エンコーダ２１０は、マイク３１－１によって得られた映像データおよびカメラ３２－１によって得られた音データに対して、任意のコーデックによりエンコードを行う。 (Encoder 210)
First, in the teleconference client 20-1, the encoder 210 encodes the video data obtained by the microphone 31-1 and the sound data obtained by the camera 32-1 using an arbitrary codec.

（データ送信部２２０）
データ送信部２２０は、エンコーダ２１０によってエンコードされた後の映像データおよび音声データを、ネットワーク４０を介して遠隔会議サーバ１０に送信する。 (Data transmission unit 220)
The data transmission unit 220 transmits the video data and audio data encoded by the encoder 210 to the remote conference server 10 via the network 40 .

（データ受信部１１０）
遠隔会議サーバ１０において、データ受信部１１０は、遠隔会議クライアント２０－１からネットワーク４０を介してエンコードされた後の映像データおよび音データを受信する。 (Data receiving unit 110)
In the teleconference server 10, the data receiving unit 110 receives encoded video data and audio data from the teleconference client 20-1 via the network 40. FIG.

（デコーダ１２０）
デコーダ１２０は、エンコードされた後の映像データおよび音データに対して、任意のコーデックによりデコードを行い、デコードされた後の映像データおよび音データを得る。 (decoder 120)
The decoder 120 decodes the encoded video data and audio data using an arbitrary codec to obtain decoded video data and audio data.

（発話検出部１３０）
発話検出部１３０は、デコーダ１２０によってデコードされた後の音データから発話の検出を試みる。より詳細に、発話検出部１３０は、デコードされた後の音データに基づいて、参加者による発話が開始されたか否かを検出する。 (Utterance detection unit 130)
The speech detection unit 130 attempts to detect speech from the sound data decoded by the decoder 120 . More specifically, the speech detection unit 130 detects whether or not the participant has started speaking based on the decoded sound data.

なお、発話の検出手法は特定の手法に限定されない。一例として、発話の検出手法としては、音声区間検出技術を使い、音データから判定される区間が雑音などの非音声区間から音声区間に変わったときに、発話が開始されたと判定する手法が用いられてもよい。例えば、音声区間の検出手法としては、振幅の変動または音響特徴量の変化などに基づいて音声区間を判定する手法が用いられてもよいし、特許文献２に示されている手法が用いられてもよい。 Note that the speech detection method is not limited to a specific method. As an example of an utterance detection method, a method is used in which it is determined that an utterance has started when a segment determined from sound data changes from a non-speech segment such as noise to a speech segment using speech segment detection technology. may be For example, as a method of detecting a speech interval, a method of determining a speech interval based on amplitude fluctuations or changes in acoustic features may be used, or the method disclosed in Patent Document 2 may be used. good too.

（発話情報生成部１４０）
発話情報生成部１４０は、発話検出部１３０によって音データから発話が検出されたことに基づいて（すなわち、発話の開始が検出されたことに基づいて）、発話が開始したことを示す発話開始データを音データおよび映像データの前に挿入する。発話開始データは、遠隔会議クライアント２０－２～２０－Ｎに対して、発話が開始したことを示す提示データを提示する処理を行わせるためのトリガー信号となり得る。提示データの例については、後に詳細に説明する。 (Utterance information generation unit 140)
Speech information generation section 140 generates speech start data indicating that speech has started, based on detection of speech from sound data by speech detection section 130 (that is, based on detection of the start of speech). is inserted before the audio and video data. The speech start data can serve as a trigger signal for causing the teleconference clients 20-2 to 20-N to present presentation data indicating that speech has started. Examples of presentation data are described in detail later.

なお、発話情報生成部１４０は、遠隔会議クライアント２０－１から映像データが送信されない場合には、発話検出部１３０によって音データから発話が検出されたことに基づいて、発話開始データを音データの前に挿入すればよい。また、発話開始データの時間的な長さは、音データの遅延時間が大きくなりすぎない程度に設定されるのが望ましい（例えば、１００ミリ秒から２０００ミリ秒までの間に設定されてよい）。 Note that, when video data is not transmitted from the remote conference client 20-1, the speech information generation unit 140 converts the speech start data into the sound data based on the detection of speech from the sound data by the speech detection unit 130. should be inserted before Also, it is desirable that the time length of the speech start data is set to the extent that the delay time of the sound data does not become too large (for example, it may be set between 100 milliseconds and 2000 milliseconds). .

（メディア速度調整部１５０）
メディア速度調整部１５０は、発話情報生成部１４０による発話開始データの挿入に基づいて、音データの速度（第１の速度）を調整する。より詳細に、音データの前に発話開始データが挿入されることによって、少なくとも発話開始データの時間的な長さの分だけ音データの遅延が発生してしまう。したがって、メディア速度調整部１５０は、音データの遅延が解消されるまで（すなわち、音データの遅延が発話開始データの時間的な長さの分だけ回復するまで）、音データの速度を発話開始データの挿入前よりも高くすることによって（すなわち、音声データの速度を倍速にすることによって）、音データの速度を調整するのが望ましい。 (Media speed adjustment unit 150)
The media speed adjustment unit 150 adjusts the speed (first speed) of sound data based on the speech start data inserted by the speech information generation unit 140 . More specifically, by inserting the speech start data before the sound data, the sound data is delayed by at least the temporal length of the speech start data. Therefore, the media speed adjustment unit 150 adjusts the speed of the sound data until the delay in the sound data is eliminated (that is, until the delay in the sound data is recovered by the time length of the speech start data). It is desirable to adjust the rate of the audio data by making it higher than it was before the data was inserted (ie by doubling the rate of the audio data).

ここで、音データの速度を調整する手法は、特定の手法に限定されない。例えば、音データの速度を調整する手法としては、非特許文献３に示されているＰＩＣＯＬＡ（ＰｏｉｎｔｅｒＩｎｔｅｒｖａｌＣｏｎｔｒｏｌＯｖｅｒＬａｐａｎｄＡｄｄ）に代表される各人の声質を保ったまま話速を変える話速変換技術が使用され得る。 Here, the method of adjusting the speed of sound data is not limited to a specific method. For example, as a method for adjusting the speed of sound data, there is speech speed conversion that changes the speech speed while maintaining the voice quality of each person, represented by PICOLA (Pointer Interval Control OverLap and Add) shown in Non-Patent Document 3. techniques can be used.

また、音データの速度を具体的にどのようなタイミングでどの程度の高さに調整するかも限定されない。例えば、音データの速度を高めすぎてしまうと音を聞いている参加者に違和感を与えてしまう可能性がある。そこで、メディア速度調整部１５０は、音データの速度をある程度以下に抑えてもよい。一例として、メディア速度調整部１５０は、調整後の音データの速度が、発話開始データ挿入前の音データの速度の１．１倍から１．５倍の間に収まるように音データの速度を調整してもよい。 In addition, there are no specific restrictions on when and to what level the speed of the sound data is adjusted. For example, if the speed of the sound data is increased too much, the participants listening to the sound may feel uncomfortable. Therefore, the media speed adjustment unit 150 may suppress the speed of sound data to a certain level or less. As an example, the media speed adjustment unit 150 adjusts the speed of the sound data so that the speed of the sound data after adjustment is between 1.1 and 1.5 times the speed of the sound data before the speech start data is inserted. may be adjusted.

また、メディア速度調整部１５０は、音データの遅延が解消された場合には、音データの速度を発話開始データ挿入前の音データの速度に戻す。このとき、音データの速度が突然戻ってしまうと音を聞いている参加者に違和感を与えてしまう可能性がある。そこで、メディア速度調整部１５０は、音データの遅延が解消されるまで、音データの速度を発話開始データの挿入前よりも高くしつつ音データの速度を徐々に低くしてもよい。 Further, when the delay of the sound data is eliminated, the media speed adjustment unit 150 restores the speed of the sound data to the speed of the sound data before the speech start data is inserted. At this time, if the speed of the sound data suddenly returns, there is a possibility that the participants who are listening to the sound will feel uncomfortable. Therefore, the media speed adjustment unit 150 may gradually decrease the speed of the sound data while increasing the speed of the sound data from before the insertion of the speech start data until the delay of the sound data is eliminated.

一例として、メディア速度調整部１５０は、発話開始データの直後は音データの速度が発話開始データの挿入前の音データの速度の１．５倍となるように音データの速度を調整してもよい。そして、メディア速度調整部１５０は、音データの遅延時間が小さくなるにつれて、音データの速度が発話開始データ挿入前の音データの速度の１．３倍，１．１倍となるように、音データの速度を徐々に低くしてもよい。 As an example, the media speed adjustment unit 150 may adjust the speed of the sound data so that the speed of the sound data immediately after the speech start data is 1.5 times the speed of the sound data before the speech start data is inserted. good. Then, the media speed adjustment unit 150 adjusts the speed of the sound data so that as the delay time of the sound data becomes shorter, the speed of the sound data becomes 1.3 times and 1.1 times the speed of the sound data before the speech start data is inserted. The data rate may be gradually reduced.

また、メディア速度調整部１５０は、発話開始データ挿入後の音データの一部を削除してもよい。これによっても、音データの遅延が解消され得る。例えば、メディア速度調整部１５０は、発話開始データ挿入後の音データのうち、発話検出部１３０によって検出された非音声区間の音データを削除し、削除によって空いた区間に非音声区間後の音データを詰めてもよい。 Also, the media speed adjustment unit 150 may delete part of the sound data after the speech start data is inserted. This can also eliminate the delay in sound data. For example, the media speed adjustment unit 150 deletes the sound data of the non-speech section detected by the speech detection unit 130 from the sound data after the speech start data is inserted, and inserts the sound data after the non-speech section into the section vacated by the deletion. You can fill in the data.

また、メディア速度調整部１５０は、発話開始データの挿入に基づいて、映像データの速度（第２の速度）を調整された後の音データの速度（第１の速度）と合うように調整する。これによって、発話した参加者の音声と映像とを他の参加者が違和感なく知覚することが可能となる。メディア速度調整部１５０は、音データの速度の調整と同様に、映像データの速度を高めることによって（すなわち、映像データの速度を倍速にすることによって）、映像データの速度を調整すればよい。 In addition, the media speed adjustment unit 150 adjusts the speed of the video data (second speed) so as to match the speed of the audio data after being adjusted (first speed) based on the insertion of the speech start data. . As a result, the other participants can perceive the voice and video of the participant who has spoken without any sense of incongruity. The media speed adjustment unit 150 may adjust the speed of the video data by increasing the speed of the video data (that is, by doubling the speed of the video data) in the same manner as the speed of the audio data.

また、メディア速度調整部１５０は、非音声区間の音データを削除した場合には、非音声区間の映像データも削除し、削除によって空いた区間に非音声区間後の映像データを詰めてもよい。このとき、メディア速度調整部１５０は、削除された区間の前後の映像データに生じる違和感を抑制するために、例えば映像データに対してモーフィング処理などを行ってもよい。なお、メディア速度調整部１５０は、遠隔会議クライアント２０－１から映像データが送信されない場合には、映像データの速度の調整は行わなくてよい。 In addition, when the sound data of the non-speech section is deleted, the media speed adjustment unit 150 may also delete the video data of the non-speech section, and fill the space left by the deletion with the video data after the non-speech section. . At this time, the media speed adjusting unit 150 may perform, for example, morphing processing on the video data in order to suppress the sense of incongruity that occurs in the video data before and after the deleted section. Note that the media speed adjustment unit 150 need not adjust the speed of the video data when the video data is not transmitted from the teleconference client 20-1.

（エンコーダ１６０）
エンコーダ１６０は、発話情報生成部１４０によって音データおよび映像データの前に挿入された発話開始データに対して、任意のコーデックによりエンコードを行う。そして、エンコーダ１６０は、エンコードされた後の発話開始データをデータ送信部１７０に出力する。これによって、エンコードされた後の発話開始データのデータ送信部１７０による送信が制御される。 (Encoder 160)
The encoder 160 encodes the speech start data inserted before the audio data and the video data by the speech information generating section 140 using an arbitrary codec. Encoder 160 then outputs the encoded speech start data to data transmitting section 170 . This controls the transmission of the encoded speech start data by the data transmission unit 170 .

続いて、エンコーダ１６０は、メディア速度調整部１５０によって速度が調整された後の音データおよび映像データに対して、任意のコーデックによりエンコードを行う。そして、エンコーダ１６０は、エンコードされた後の音データおよび映像データをデータ送信部１７０に出力する。これによって、エンコードされた後の音データおよび映像データのデータ送信部１７０による送信が制御される。 Subsequently, the encoder 160 encodes the audio data and video data whose speed has been adjusted by the media speed adjustment unit 150 using an arbitrary codec. Encoder 160 then outputs the encoded audio data and video data to data transmitting section 170 . This controls the transmission of the encoded audio data and video data by the data transmission unit 170 .

（データ送信部１７０）
データ送信部１７０は、エンコーダ１６０によってエンコードされた後の発話開始データを、ネットワーク４０を介して遠隔会議クライアント２０－１以外の遠隔会議クライアント２０－２～２０－Ｎそれぞれに送信する。続いて、データ送信部１７０は、エンコーダ１６０によってエンコードされた後の音データおよび映像データを、ネットワーク４０を介して遠隔会議クライアント２０－１以外の遠隔会議クライアント２０－２～２０－Ｎそれぞれに送信する。 (Data transmission unit 170)
The data transmission unit 170 transmits the speech start data encoded by the encoder 160 via the network 40 to each of the remote conference clients 20-2 to 20-N other than the remote conference client 20-1. Subsequently, the data transmission unit 170 transmits the audio data and video data encoded by the encoder 160 to each of the teleconference clients 20-2 to 20-N other than the teleconference client 20-1 via the network 40. do.

（データ受信部２３０）
遠隔会議クライアント２０－２～２０－Ｎにおいて、データ受信部２３０は、遠隔会議サーバ１０から送信されたエンコードされた後の発話開始データを受信する。続いて、データ受信部２３０は、遠隔会議サーバ１０から送信されたエンコードされた後の音データおよび映像データを受信する。 (Data receiving unit 230)
In the teleconference clients 20-2 to 20-N, the data reception unit 230 receives the encoded speech start data transmitted from the teleconference server 10. FIG. Subsequently, the data receiving unit 230 receives the encoded audio data and video data transmitted from the remote conference server 10 .

（デコーダ２４０）
デコーダ２４０は、データ受信部２３０によって受信されたエンコードされた後の発話開始データに対して任意のコーデックによりデコードを行い、デコードされた後の発話開始データを得る。また、デコーダ２４０は、データ受信部２３０によって受信されたエンコードされた後の音データおよび映像データに対して任意のコーデックによりデコードを行い、デコードされた後の音データおよび映像データを得る。 (Decoder 240)
Decoder 240 decodes the encoded speech start data received by data receiving section 230 using an arbitrary codec to obtain decoded speech start data. Further, the decoder 240 decodes the encoded audio data and video data received by the data receiving unit 230 using an arbitrary codec to obtain decoded audio data and video data.

（映像・音声再生部２５０）
映像・音声再生部２５０は、デコーダ２４０によってデコードされた後の発話開始データが得られたことに基づいて、発話が開始したことを示す提示データを提示するように制御する。ここで、提示データの種類は特定の提示データに限定されない。例えば、提示データは、所定の画像データ（静止画像データまたは動画像データ）を含んでもよい。静止画像データには、アイコンが含まれ得る。また、動画像データには、アニメーションが含まれ得る。かかる場合には、映像・音声再生部２５０は、所定の画像が提示データの例として表示されるようにモニタ３４を制御する。 (Video/audio reproduction unit 250)
The video/audio reproducing unit 250 controls to present the presentation data indicating the start of speech based on the fact that the speech start data decoded by the decoder 240 is obtained. Here, the type of presentation data is not limited to specific presentation data. For example, presentation data may include predetermined image data (still image data or moving image data). Still image data may include icons. Also, moving image data may include animation. In such a case, the video/audio reproducing unit 250 controls the monitor 34 so that a predetermined image is displayed as an example of presentation data.

あるいは、提示データは、所定の音データを含んでもよい。かかる場合には、映像・音声再生部２５０は、所定の音データが提示データの例として出力されるようにスピーカ３３を制御する。 Alternatively, the presentation data may include predetermined sound data. In such a case, the video/audio reproducing unit 250 controls the speaker 33 so that predetermined sound data is output as an example of presentation data.

また、映像・音声再生部２５０は、デコーダ２４０によってデコードされた後の音データおよび映像データが得られたことに基づいて、デコードされた後の音データおよび映像データを再生する。映像・音声再生部２５０は、再生した音データをスピーカ３３に出力する。これによって、スピーカ３３によって音データが出力される。また、映像・音声再生部２５０は、再生した映像データをモニタ３４に出力する。これによって、モニタ３４によって映像データが出力される。 Further, the video/audio reproduction unit 250 reproduces the decoded audio data and video data based on the fact that the audio data and the video data after being decoded by the decoder 240 are obtained. The video/audio reproducing unit 250 outputs the reproduced sound data to the speaker 33 . As a result, the speaker 33 outputs sound data. Also, the video/audio reproducing unit 250 outputs the reproduced video data to the monitor 34 . As a result, the monitor 34 outputs video data.

以上、本発明の実施形態に係る遠隔会議システム１の動作例について説明した。 The operation example of the remote conference system 1 according to the embodiment of the present invention has been described above.

（２．ハードウェア構成）
図２は、本発明の実施形態に係る遠隔会議サーバ１０および遠隔会議クライアント２０（以下、遠隔会議サーバ１０および遠隔会議クライアント２０それぞれを区別せずに「本実施形態に係る装置」と言う場合がある。）のハードウェア構成の一例を示すブロック図である。 (2. Hardware configuration)
FIG. 2 shows a teleconference server 10 and a teleconference client 20 according to an embodiment of the present invention (hereinafter, the teleconference server 10 and the teleconference client 20 may be referred to as "apparatus according to this embodiment" without distinction). ) is a block diagram showing an example of a hardware configuration.

なお、遠隔会議サーバ１０および遠隔会議クライアント２０それぞれに下記のハードウェア構成のすべてが備えられている必要はなく（例えば、遠隔会議サーバ１０に直接的にセンサが備えられている必要はない）、各装置の機能構成を実現できるハードウェアモジュールが適宜限定して備えられてもよい。 Note that the remote conference server 10 and the remote conference client 20 do not need to have all of the following hardware configurations (for example, the remote conference server 10 does not need to have a sensor directly). A hardware module capable of realizing the functional configuration of each device may be appropriately limited and provided.

図２を参照すると、本実施形態に係る装置は、バス８０１、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）８０３、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）８０５、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）８０７、記憶装置８０９、通信インタフェース８１１、センサ８１３、入力装置８１５、表示装置８１７、スピーカ８１９を備える。 Referring to FIG. 2, the device according to this embodiment includes a bus 801, a CPU (Central Processing Unit) 803, a ROM (Read Only Memory) 805, a RAM (Random Access Memory) 807, a storage device 809, a communication interface 811, a sensor 813 , an input device 815 , a display device 817 and a speaker 819 .

ＣＰＵ８０３は、本実施形態に係る装置における様々な処理を実行する。また、ＲＯＭ８０５は、本実施形態に係る装置における処理をＣＰＵ８０３に実行させるためのプログラム及びデータを記憶する。また、ＲＡＭ８０７は、ＣＰＵ８０３の処理の実行時に、プログラム及びデータを一時的に記憶する。 A CPU 803 executes various processes in the apparatus according to this embodiment. The ROM 805 also stores programs and data for causing the CPU 803 to execute processing in the apparatus according to this embodiment. The RAM 807 also temporarily stores programs and data when the CPU 803 executes processing.

バス８０１は、ＣＰＵ８０３、ＲＯＭ８０５及びＲＡＭ８０７を相互に接続する。バス８０１には、さらに、記憶装置８０９、通信インタフェース８１１、センサ８１３、入力装置８１５、表示装置８１７及びスピーカ８１９が接続される。バス８０１は、例えば、複数の種類のバスを含む。一例として、バス８０１は、ＣＰＵ８０３、ＲＯＭ８０５及びＲＡＭ８０７を接続する高速バスと、該高速バスよりも低速の１つ以上の別のバスを含む。 A bus 801 interconnects the CPU 803 , ROM 805 and RAM 807 . A storage device 809 , a communication interface 811 , a sensor 813 , an input device 815 , a display device 817 and a speaker 819 are also connected to the bus 801 . Bus 801 includes, for example, multiple types of buses. As an example, bus 801 includes a high speed bus connecting CPU 803, ROM 805 and RAM 807, and one or more other buses that are slower than the high speed bus.

記憶装置８０９は、本実施形態に係る装置内で一時的または恒久的に保存すべきデータを記憶する。記憶装置８０９は、例えば、ハードディスク（ＨａｒｄＤｉｓｋ）等の磁気記憶装置であってもよく、または、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅａｎｄＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フラッシュメモリ（ｆｌａｓｈｍｅｍｏｒｙ）、ＭＲＡＭ（ＭａｇｎｅｔｏｒｅｓｉｓｔｉｖｅＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＦｅＲＡＭ（ＦｅｒｒｏｅｌｅｃｔｒｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）及びＰＲＡＭ（ＰｈａｓｅｃｈａｎｇｅＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の不揮発性メモリ（ｎｏｎｖｏｌａｔｉｌｅｍｅｍｏｒｙ）であってもよい。 A storage device 809 stores data to be temporarily or permanently stored in the device according to this embodiment. The storage device 809 may be, for example, a magnetic storage device such as a hard disk, or may be EEPROM (Electrically Erasable and Programmable Read Only Memory), flash memory, MRAM (Magnetoresistive Random Access Memory). , FeRAM (Ferroelectric Random Access Memory) and PRAM (Phase change Random Access Memory).

通信インタフェース８１１は、本実施形態に係る装置が備える通信手段であり、ネットワークを介して（あるいは直接的に）外部装置と通信する。通信インタフェース８１１は、無線通信用のインタフェースであってもよく、この場合に、例えば、通信アンテナ、ＲＦ回路及びその他の通信処理用の回路を含んでもよい。また、通信インタフェース８１１は、有線通信用のインタフェースであってもよく、この場合に、例えば、ＬＡＮ端子、伝送回路及びその他の通信処理用の回路を含んでもよい。 A communication interface 811 is communication means provided in the device according to the present embodiment, and communicates with an external device via a network (or directly). The communication interface 811 may be an interface for wireless communication, in which case it may include, for example, a communication antenna, RF circuits and other circuits for processing communication. Also, the communication interface 811 may be an interface for wired communication, and in this case may include, for example, a LAN terminal, a transmission circuit, and other circuits for communication processing.

センサ８１３は、例えばカメラ、マイクロフォン、生体センサ、その他のセンサまたはそれらの複合である。カメラは、被写体を撮像するもので、例えば光学系、撮像素子及び画像処理回路を含む。マイクロフォンは、周囲の音を収音するもので、該音を電気信号へ変換し該電気信号をデジタルデータに変換する。 Sensor 813 is, for example, a camera, microphone, biosensor, other sensor, or a combination thereof. A camera captures an image of a subject, and includes, for example, an optical system, an image sensor, and an image processing circuit. A microphone picks up ambient sound, converts the sound into an electrical signal, and converts the electrical signal into digital data.

入力装置８１５は、タッチパネル、マウス、カメラなどの視線検出装置、マイクロフォン等である。表示装置８１７は、本実施形態に係る装置からの出力画像（すなわち表示画面）を表示するもので、例えば液晶、有機ＥＬ（ＯｒｇａｎｉｃＬｉｇｈｔ－ＥｍｉｔｔｉｎｇＤｉｏｄｅ）、ＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）等を用いて実現され得る。スピーカ８１９は、音声を出力するもので、デジタルデータを電気信号に変換し該電気信号を音声に変換する。 The input device 815 is a touch panel, a mouse, a line-of-sight detection device such as a camera, a microphone, or the like. The display device 817 displays an output image (that is, a display screen) from the device according to the present embodiment, and is realized using, for example, a liquid crystal, an organic EL (Organic Light-Emitting Diode), a CRT (Cathode Ray Tube), or the like. can be The speaker 819 outputs sound, converts digital data into an electric signal, and converts the electric signal into sound.

以上、本実施形態に係る装置のハードウェア構成例について説明した。 The hardware configuration example of the apparatus according to the present embodiment has been described above.

（３．まとめ）
以上に説明したように、本発明の実施形態に係る遠隔会議サーバ１０は、遠隔会議クライアント２０－１から受信された音データから発話を検出する発話検出部１３０と、音データから発話が検出されたことに基づいて、発話が開始したことを示す発話開始データを音データの前に挿入する発話情報生成部１４０と、発話開始データの挿入に基づいて、音データの第１の速度を調整するメディア速度調整部１５０と、発話開始データを遠隔会議クライアント２０－２～２０－Ｎに送信するようにデータ送信部１７０を制御した後、第１の速度が調整された後の音データを遠隔会議クライアント２０－２～２０－Ｎに送信するようにデータ送信部１７０を制御するエンコーダ１６０と、を備える。 (3. Summary)
As described above, the remote conference server 10 according to the embodiment of the present invention includes the speech detection unit 130 that detects speech from sound data received from the remote conference client 20-1, and the speech detection unit 130 that detects speech from the sound data. a speech information generating unit 140 that inserts speech start data indicating that speech has started before the sound data based on the fact that the speech has started; and adjusts the first speed of the sound data based on the insertion of the speech start data. After controlling the media speed adjustment unit 150 and the data transmission unit 170 to transmit the speech start data to the remote conference clients 20-2 to 20-N, the audio data after the first speed adjustment is transmitted to the remote conference client. and an encoder 160 that controls the data transmission unit 170 to transmit to the clients 20-2 to 20-N.

かかる構成によれば、音データの前に発話開始データが配信される。これによって、参加者は、他の参加者の発話に早く気付くことが可能となるため、自分が発話しようとしても発話の直前で発話するのを止めることができる。したがって、参加者が他の参加者の発話を阻害してしまう可能性が低減される（すなわち、複数の参加者それぞれの発話区間が重複してしまう可能性が低減される）。これによって、複数の参加者による遠隔会議をよりスムーズに進行させることが可能となる。 According to such a configuration, speech start data is delivered before sound data. As a result, the participant can quickly notice the speech of the other participants, so that even if he/she tries to speak, he or she can stop speaking just before the speech. Therefore, it is less likely that one participant will interrupt another participant's speech (that is, it is less likely that the utterance segments of a plurality of participants will overlap). As a result, it becomes possible to smoothly proceed with a teleconference involving a plurality of participants.

音データの前に発話開始データが配信されることによって、参加者が他の参加者の発話に早く気付くことが可能となる理由は、以下の通りである。すなわち、参加者は、他の参加者が存在する遠隔地から流れてくる音を聞いて他の参加者が発話したことを認識する場合、音をある程度の長さ（例えば、１秒以上）聞いてからでなければ、その音に他の参加者による発話音声が含まれると認識できないのが通常である。そのため、参加者は、他の参加者が発話したことを認識した時点で既に自分も発話し始めてしまっているという状況が起こりやすいため、参加者が他の参加者の発話を阻害してしまう事態が生じやすい。 The reason why the speech start data is distributed before the sound data enables the participants to quickly notice the speech of the other participants is as follows. That is, when a participant hears a sound coming from a remote location where another participant exists and recognizes that the other participant has spoken, the participant listens to the sound for a certain length (for example, 1 second or more). Only after that, it is usually possible to recognize that the sound contains speech by other participants. Therefore, it is easy for a participant to start speaking by the time he or she recognizes that another participant has spoken. is likely to occur.

それに対して、本発明の実施形態のように、発話開始データが視覚情報として参加者に提示されることによって、遠隔地から流れてくる音に基づいて他の参加者が発話したことを認識する場合と比較して、参加者は、他の参加者が発話を開始したことを早く認識することが可能となる。発話開始データが所定の音データ（例えば、あらかじめ用意された特殊な音データ）として提示される場合にも、参加者は、他の参加者が発話を開始したことを早く認識することが可能となる（例えば、数ミリ秒で認識することが可能となる）。なお、特殊な音データは、人間の発する声以外の音であれば限定されず、所定の警告音であってもよい。 On the other hand, as in the embodiment of the present invention, by presenting speech start data as visual information to the participants, it is possible to recognize that other participants have spoken based on the sound coming from a remote location. Compared to the case, the participants can quickly recognize that the other participants have started speaking. Even when speech start data is presented as predetermined sound data (for example, special sound data prepared in advance), participants can quickly recognize that other participants have started speaking. (for example, it becomes possible to recognize in a few milliseconds). Note that the special sound data is not limited as long as it is a sound other than a human voice, and may be a predetermined warning sound.

以上、添付図面を参照しながら本発明の好適な実施形態について詳細に説明したが、本発明はかかる例に限定されない。本発明の属する技術の分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本発明の技術的範囲に属するものと了解される。 Although the preferred embodiments of the present invention have been described in detail above with reference to the accompanying drawings, the present invention is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field to which the present invention belongs can conceive of various modifications or modifications within the scope of the technical idea described in the claims. It is understood that these also naturally belong to the technical scope of the present invention.

例えば、上記のように、本発明の実施形態によれば、複数の参加者それぞれの発話区間が重複してしまう可能性が低減される。しかし、複数の参加者によって同時に発話が行われてしまい、遠隔会議クライアント２０において、遠隔会議サーバ１０から複数の発話開始データが受信される場合も想定され得る。かかる場合には、映像・音声再生部２５０は、特許文献３に示されている手法などの音像定位処理を行い、複数の参加者による発話音声に方向感を付けて再生してもよい。 For example, as described above, embodiments of the present invention reduce the likelihood of overlapping speech segments of multiple participants. However, it is conceivable that a plurality of participants may speak at the same time and the teleconference client 20 may receive a plurality of speech start data from the teleconference server 10 . In such a case, the video/audio reproducing unit 250 may perform sound image localization processing such as the method disclosed in Patent Document 3, and reproduce the uttered voices of a plurality of participants with a sense of direction.

１遠隔会議システム
１０遠隔会議サーバ
１１０データ受信部
１２０デコーダ
１３０発話検出部
１４０発話情報生成部
１５０メディア速度調整部
１６０エンコーダ
１７０データ送信部
２０遠隔会議クライアント
２１０エンコーダ
２２０データ送信部
２３０データ受信部
２４０デコーダ
２５０映像・音声再生部
３１マイク
３２カメラ
３３スピーカ
３４モニタ
４０ネットワーク
1 remote conference system 10 remote conference server 110 data receiver 120 decoder 130 speech detector 140 speech information generator 150 media speed adjuster 160 encoder 170 data transmitter 20 remote conference client 210 encoder 220 data transmitter 230 data receiver 240 decoder 250 video/audio playback unit 31 microphone 32 camera 33 speaker 34 monitor 40 network

Claims

a speech detection unit that detects speech from sound data received from the first teleconference terminal;
an utterance information generation unit that inserts utterance start data indicating that the utterance has started, before the sound data, based on the detection of the utterance from the sound data;
a speed adjustment unit that adjusts a first speed of the sound data based on the insertion of the speech start data;
After controlling the data transmission unit to transmit the speech start data to the second teleconference terminal, the sound data after the first rate is adjusted is transmitted to the second teleconference terminal. a transmission control unit that controls the data transmission unit;
A remote conference device.

The speech information generating unit inserts the speech start data before the audio data and video data received together with the audio data from the first teleconference terminal, based on the detection of the speech;
The speed adjustment unit adjusts the first speed of the sound data and the second speed of the video data to match the first speed after the adjustment based on the insertion of the speech start data. ,
After controlling the data transmission unit to transmit the speech start data to the second remote conference terminal, the transmission control unit controls the sound data after the first rate is adjusted, and the controlling the data transmission unit to transmit the video data after the speed adjustment of 2 to the second teleconference terminal;
The teleconference device according to claim 1.

The speed adjustment unit adjusts the first speed by making the first speed higher than before the insertion of the speech start data until the delay of the sound data due to the insertion of the speech start data is eliminated. do,
3. The teleconference device according to claim 1 or 2.

The speed adjustment unit adjusts the first speed by gradually lowering the first speed higher than before the speech start data is inserted until the delay is eliminated.
The teleconference device according to claim 3.

The speed adjustment unit adjusts the first speed by deleting part of the sound data after the speech start data is inserted.
3. The teleconference device according to claim 1 or 2.

The speed adjusting unit deletes the sound data of the non-speech section from the sound data after the insertion of the utterance start data, and fills the section vacated by the deletion with the sound data after the non-speech section. adjusting the first speed;
The remote conference device according to claim 5.

detecting speech from sound data received from the first teleconference terminal;
inserting speech start data indicating that the speech has started based on the detection of the speech from the sound data before the sound data;
adjusting a first rate of the sound data based on the insertion of the start-of-speech data;
After controlling the data transmission unit to transmit the speech start data to the second teleconference terminal, the sound data after the first rate is adjusted is transmitted to the second teleconference terminal. controlling the data transmitter;
A teleconferencing method comprising:

Speech start data, which is data indicating that the speech has started and is inserted before the sound data based on detection of the speech from sound data transmitted from another teleconference terminal, is sent from the teleconference device. an acquisition unit that acquires, from the teleconference device, sound data after the first speed has been adjusted based on the insertion of the speech start data;
Based on the acquisition of the speech start data, the presentation unit is controlled to present the presentation data indicating that the speech has started, and the sound data after the first speed is adjusted is presented. a presentation control unit that controls the presentation unit such that
A remote conference terminal.

The acquisition unit acquires the speech start data inserted before the audio data and the video data transmitted together with the audio data from the other remote conference terminal based on the detection of the speech. The sound data after the first speed is adjusted, and the second speed is adjusted to match the first speed after the adjustment, obtained from the device and based on the insertion of the speech start data. obtaining the video data after the video data from the teleconference device;
The presentation control unit controls the presentation unit to present the presentation data based on the acquisition of the speech start data, and controls the presentation unit to present the presentation data, and the sound data after the first speed is adjusted and the first speed. controlling the presentation unit to present the video data after the speed of 2 has been adjusted;
The teleconference terminal according to claim 8.

the presentation data includes predetermined image data;
The presentation control unit controls the monitor so that the predetermined image data is displayed as the presentation data.
The teleconference terminal according to claim 8 or 9.

the presentation data includes predetermined sound data;
The presentation control unit controls the speaker so that the predetermined sound data is displayed as the presentation data.
The teleconference terminal according to claim 8 or 9.

Speech start data, which is data indicating that the speech has started and is inserted before the sound data based on detection of the speech from sound data transmitted from another teleconference terminal, is sent from the teleconference device. obtaining from the teleconference device sound data obtained and having a first speed adjusted based on the insertion of the speech start data;
Based on the acquisition of the speech start data, the presentation unit is controlled to present the presentation data indicating that the speech has started, and the sound data after the first speed is adjusted is presented. and controlling the presentation unit to
A teleconferencing method comprising:

A teleconference system comprising a first teleconference terminal, a teleconference device, and a second teleconference terminal,
The first teleconference terminal,
A transmission control unit that controls the data transmission unit to transmit sound data,
The teleconference device is
a speech detection unit that detects speech from sound data received from the first teleconference terminal;
an utterance information generation unit that inserts utterance start data indicating that the utterance has started, before the sound data, based on the detection of the utterance from the sound data;
a speed adjustment unit that adjusts a first speed of the sound data based on the insertion of the speech start data;
After controlling the data transmission unit to transmit the speech start data to the second teleconference terminal, the sound data after the first rate is adjusted is transmitted to the second teleconference terminal. a transmission control unit that controls the data transmission unit;
with
The second teleconference terminal,
an acquisition unit that acquires the speech start data from the teleconference device and acquires the sound data after the first speed is adjusted from the teleconference device;
Based on the acquisition of the speech start data, the presentation unit is controlled to present the presentation data indicating that the speech has started, and the sound data after the first speed is adjusted is presented. a presentation control unit that controls the presentation unit such that
A remote conference system.