JP2019176375A

JP2019176375A - Moving image output apparatus, moving image output method, and moving image output program

Info

Publication number: JP2019176375A
Application number: JP2018063887A
Authority: JP
Inventors: 一次永井; Kazuji Nagai
Original assignee: Advanced Media Inc
Current assignee: Advanced Media Inc
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2019-10-10

Abstract

To provide a moving image output apparatus, a moving image output method, and a moving image output program, capable of recognizing appropriately an expression of a speaker, and easily gripping that the speaker speeches with what emotion.SOLUTION: A moving image output apparatus comprises: a first moving image acquisition part that acquires moving image data acquired by photographing a speaker who is speaking as first moving image data; a second moving image generation part that generates a plurality of speech parts that the speaker is speaking as second moving image data in the first moving image data acquired by the first moving image acquisition part; and a second moving image output part sequentially outputting a plurality of second moving image data generated by the second moving image generation part.SELECTED DRAWING: Figure 1

Description

本発明は、動画出力装置、動画出力方法および動画出力プログラムに関する。 The present invention relates to a moving image output apparatus, a moving image output method, and a moving image output program.

従来、通信技術、画像処理技術等の発展に伴い、遠隔の複数拠点にそれぞれ設置された複数の情報処理装置間でネットワークを介して会議ができるテレビ会議システムが実用化されている。また、大容量データの送受信が可能であることから、端末装置にて集音される音声のデータを他の端末装置へ送信して複数の端末装置にて発話者の発話内容を共有するのみならず、各端末装置にて会議参加者を撮影し、撮影した動画データを他の端末装置へ送信することによって、表情、身振りなどを交えた会議が実現できる会議システム（いわゆるＷｅｂ会議システム）が実用化されている。 2. Description of the Related Art Conventionally, with the development of communication technology, image processing technology, and the like, a video conference system capable of performing a conference via a network between a plurality of information processing apparatuses respectively installed at a plurality of remote bases has been put into practical use. In addition, since it is possible to transmit and receive large volumes of data, if the data of the sound collected by the terminal device is transmitted to other terminal devices and the content of the speaker's utterance is shared by a plurality of terminal devices First, a conference system (so-called Web conferencing system) that can realize a conference with facial expressions, gestures, etc. by photographing the participants at each terminal device and transmitting the captured video data to other terminal devices is practical. It has become.

また、音声認識技術を活用して、Ｗｅｂ会議システムにおける発話者の発話音声に対して音声認識を行って文字列に変換し、当該文字列を字幕として表示したり、当該文字列を用いて議事録のドラフト版を作成したりする技術が提案されている（例えば、特許文献１を参照）。このように、「音声」を「文字列」として目に見えるカタチに変換することで、会議の参加者が同じ情報を漏れなく共有することができ、また、議事録の書き起こし時間を削減することができる。 In addition, by using voice recognition technology, voice recognition is performed on the voice of the speaker in the Web conference system and converted into a character string, and the character string is displayed as a subtitle, or a meeting is performed using the character string. A technique for creating a draft version of a record has been proposed (see, for example, Patent Document 1). In this way, by converting “speech” into a visible form as a “character string”, meeting participants can share the same information without omission and reduce the transcript time. be able to.

特開２０１７−００４２７０号公報JP 2017-004270 A

ところで、Ｗｅｂ会議システムにおいて、複数の会議参加者を撮影した複数の動画データは複数の画面に同時表示される。そのため、各会議参加者は、同時表示されている複数の画面から、実際に口を動かしている発話者を探し出す必要があり、余計な手間を必要とし、発話者の顔色等の表情を適切に認識できない場合があった。この場合、発話者がどのような感情（喜び、驚き、落胆など）を持って発話しているかについて把握することは困難であった。 By the way, in the web conference system, a plurality of moving image data obtained by photographing a plurality of conference participants are simultaneously displayed on a plurality of screens. Therefore, each conference participant needs to find the speaker who is actually moving his / her mouth from multiple screens displayed at the same time, which requires extra effort and appropriately expresses the facial expression of the speaker. There were cases where it could not be recognized. In this case, it has been difficult to understand what emotion (joy, surprise, discouragement, etc.) the speaker is speaking.

本発明の目的は、発話者の表情を適切に認識し、発話者がどのような感情を持って発話しているかについて容易に把握することが可能な動画出力装置、動画出力方法および動画出力プログラムを提供することである。 An object of the present invention is to provide a moving image output apparatus, a moving image output method, and a moving image output program capable of appropriately recognizing the expression of a speaker and easily grasping what kind of emotion the speaker is speaking Is to provide.

本発明に係る動画出力装置は、
発話を行っている発話者を撮影することによって得られた動画データを第１の動画データとして取得する第１の動画取得部と、
前記第１の動画取得部により取得された前記第１の動画データにおいて、前記発話者が発話を行っている複数の発話部分をそれぞれ第２の動画データとして生成する第２の動画生成部と、
前記第２の動画生成部により生成された複数の前記第２の動画データを順次出力する第２の動画出力部と、
を備える。 The video output device according to the present invention is:
A first video acquisition unit that acquires video data obtained by photographing a speaker who is speaking as first video data;
A second moving image generating unit that generates a plurality of utterance portions of the utterer as the second moving image data in the first moving image data acquired by the first moving image acquiring unit;
A second moving image output unit that sequentially outputs the plurality of second moving image data generated by the second moving image generation unit;
Is provided.

本発明に係る動画出力方法は、
発話を行っている発話者を撮影することによって得られた動画データを第１の動画データとして取得し、
取得された前記第１の動画データにおいて、前記発話者が発話を行っている複数の発話部分をそれぞれ第２の動画データとして生成し、
生成された複数の前記第２の動画データを順次出力する。 The moving image output method according to the present invention includes:
The video data obtained by photographing the speaker who is speaking is acquired as the first video data,
In the acquired first moving image data, a plurality of utterance portions where the speaker is speaking are generated as second moving image data, respectively.
The plurality of generated second moving image data are sequentially output.

本発明に係る動画出力プログラムは、
コンピューターに、
発話を行っている発話者を撮影することによって得られた動画データを第１の動画データとして取得する処理と、
取得された前記第１の動画データにおいて、前記発話者が発話を行っている複数の発話部分をそれぞれ第２の動画データとして生成する処理と、
生成された複数の前記第２の動画データを順次出力する処理と、
を実行させる。 The moving image output program according to the present invention includes:
On the computer,
A process of acquiring video data obtained by photographing a speaker who is speaking as first video data;
In the acquired first moving image data, a process of generating a plurality of utterance portions where the speaker is speaking as second moving image data,
A process of sequentially outputting the plurality of generated second moving image data;
Is executed.

本発明によれば、発話者の表情を適切に認識し、発話者がどのような感情を持って発話しているかについて容易に把握することができる。 According to the present invention, it is possible to appropriately recognize the expression of a speaker and easily understand what kind of emotion the speaker is speaking.

本実施の形態におけるＷｅｂ会議システムの構成を示すブロック図である。It is a block diagram which shows the structure of the web conference system in this Embodiment. 本実施の形態における第２の動画データを生成して送信する流れを説明する図である。It is a figure explaining the flow which produces | generates and transmits the 2nd moving image data in this Embodiment. 本実施の形態におけるサーバー装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the server apparatus in this Embodiment. 本実施の形態における第２の動画データを生成して記録する流れを説明する図である。It is a figure explaining the flow which produces | generates and records the 2nd moving image data in this Embodiment. 本実施の形態におけるサーバー装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the server apparatus in this Embodiment.

以下、図面を適宜参照して、本発明の一実施の形態について詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings as appropriate.

＜Ｗｅｂ会議システムの構成＞
図１は、本実施の形態におけるＷｅｂ会議システム１０の機能構成を示すブロック図である。Ｗｅｂ会議システム１０は、ネットワーク（例えば、インターネット）を介して接続されたクライアント装置１００およびサーバー装置２００（本発明の「動画出力装置」として機能）を備えて構成される。 <Configuration of Web conference system>
FIG. 1 is a block diagram showing a functional configuration of a web conference system 10 according to the present embodiment. The Web conference system 10 includes a client device 100 and a server device 200 (functioning as the “moving image output device” of the present invention) connected via a network (for example, the Internet).

ユーザー（以下、発話者ともいう）は、クライアント装置１００を用いて、サーバー装置２００が提供するＷｅｂ会議サービスにログイン（参加）し、クライアント装置１００上で動作するＷｅｂブラウザを介して様々な発話（発言）を行うことができる。なお、図１においては、クライアント装置１００を１つのみ表示しているが、実際には、Ｗｅｂ会議に参加する複数のユーザーがそれぞれ使用する複数のクライアント装置１００が存在する。 A user (hereinafter also referred to as a “speaker”) uses the client device 100 to log in (participate) in a web conference service provided by the server device 200 and perform various utterances (via a web browser operating on the client device 100). Remark). In FIG. 1, only one client device 100 is displayed, but actually there are a plurality of client devices 100 used by a plurality of users participating in the Web conference.

まず、クライアント装置１００の機能構成について説明する。クライアント装置１００は、ＰＣ（パーソナルコンピュータ）、タブレットＰＣ、スマートフォン、携帯電話等の装置であり、他の装置（他のクライアント装置１００やサーバー装置２００）との間でデータを送受信する機能を備えている。 First, the functional configuration of the client device 100 will be described. The client device 100 is a device such as a PC (personal computer), a tablet PC, a smartphone, or a mobile phone, and has a function of transmitting and receiving data to and from other devices (other client devices 100 and server devices 200). Yes.

クライアント装置１００は、入力受付部１１０、動画音声入力部１２０、動画音声送信部１３０、動画音声受信部１４０、動画音声出力制御部１５０および動画音声出力部１６０を備えて構成される。 The client device 100 includes an input receiving unit 110, a moving image / sound input unit 120, a moving image / sound transmitting unit 130, a moving image / sound receiving unit 140, a moving image / sound output control unit 150, and a moving image / sound output unit 160.

なお、クライアント装置１００は、図示しないが、例えば、プロセッサとしてのＣＰＵ（Central Processing Unit）、制御プログラムを格納したＲＯＭ（Read Only Memory）等の記憶媒体、ＲＡＭ（Random Access Memory）等の作業用メモリ、および通信回路を有する。この場合、上記した各部の機能は、ＣＰＵが制御プログラムを実行することにより実現される。 Although not shown, the client apparatus 100 is, for example, a CPU (Central Processing Unit) as a processor, a storage medium such as a ROM (Read Only Memory) storing a control program, and a working memory such as a RAM (Random Access Memory). And a communication circuit. In this case, the function of each unit described above is realized by the CPU executing the control program.

入力受付部１１０は、図示しない操作部を介してユーザーによる各種の入力を受け付ける。そして、入力受付部１１０は、受け付けた入力に応じた入力信号をサーバー装置２００に送信する。例えば、入力受付部１１０は、サーバー装置２００が提供するＷｅｂ会議サービスにユーザーがログインするために必要なユーザーＩＤおよびパスワードの入力を受け付ける。 The input receiving unit 110 receives various inputs from the user via an operation unit (not shown). Then, the input receiving unit 110 transmits an input signal corresponding to the received input to the server device 200. For example, the input reception unit 110 receives an input of a user ID and a password necessary for a user to log in to a web conference service provided by the server device 200.

動画音声入力部１２０は、サーバー装置２００からクライアント装置１００にＷｅｂ会議サービスが提供される場合、カメラ等の撮像手段（図示せず）でユーザーを撮影した動画を動画データとして入力する。また、動画音声入力部１２０は、マイク等の集音手段（図示せず）でユーザー周辺の音を集音した音声を音声データとして入力する。 When the web conference service is provided from the server device 200 to the client device 100, the moving image / audio input unit 120 inputs a moving image obtained by photographing the user with an imaging unit (not shown) such as a camera as moving image data. In addition, the moving image audio input unit 120 inputs, as audio data, sound obtained by collecting sounds around the user by sound collecting means (not shown) such as a microphone.

動画音声送信部１３０は、動画音声入力部１２０に入力された動画データおよび音声データをサーバー装置２００に送信する。 The video / audio transmission unit 130 transmits the video data and audio data input to the video / audio input unit 120 to the server device 200.

動画音声受信部１４０は、サーバー装置２００の動画音声送信部２７０から送信された第２の動画データおよび第２の音声データを受信する。そして、動画音声受信部１４０は、受信した第２の動画データおよび第２の音声データを動画音声出力制御部１５０に出力する。 The moving image / audio receiving unit 140 receives the second moving image data and the second audio data transmitted from the moving image / audio transmitting unit 270 of the server device 200. Then, the video / audio reception unit 140 outputs the received second video data and second audio data to the video / audio output control unit 150.

また、動画音声受信部１４０は、サーバー装置２００の配信用コンテンツ送信部２９０から送信された配信用コンテンツを受信する。そして、動画音声受信部１４０は、受信した配信用コンテンツを動画音声出力制御部１５０に出力する。 In addition, the moving image audio receiving unit 140 receives the distribution content transmitted from the distribution content transmission unit 290 of the server device 200. Then, the video / audio reception unit 140 outputs the received distribution content to the video / audio output control unit 150.

動画音声出力制御部１５０は、動画音声受信部１４０から出力された第２の動画データおよび配信用コンテンツの表示出力制御と、動画音声受信部１４０から出力された第２の音声データの音声出力制御とを行う。具体的には、動画音声出力制御部１５０は、サーバー装置２００からクライアント装置１００にＷｅｂ会議サービスが提供されている間、動画音声受信部１４０から出力された配信用コンテンツを表示出力する制御を行い、動画音声受信部１４０から第２の動画データが出力された場合、当該配信用コンテンツではなく、当該第２の動画データを表示出力する制御を行う。 The video / audio output control unit 150 performs display output control of the second video data and distribution content output from the video / audio reception unit 140, and audio output control of the second audio data output from the video / audio reception unit 140. And do. Specifically, the video / audio output control unit 150 performs control to display and output the content for distribution output from the video / audio reception unit 140 while the Web conference service is provided from the server device 200 to the client device 100. When the second moving image data is output from the moving image / sound receiver 140, control is performed to display and output the second moving image data instead of the distribution content.

動画音声出力部１６０は、例えばディスプレイやスピーカーを有し、動画音声出力制御部１５０の制御下において、Ｗｅｂブラウザを介した第２の動画データおよび配信用コンテンツの表示出力、および、第２の音声データの音声出力を行う。 The video / audio output unit 160 includes, for example, a display and a speaker. Under the control of the video / audio output control unit 150, the display output of the second video data and distribution content via the Web browser, and the second audio Perform audio output of data.

次に、サーバー装置２００の機能構成について説明する。サーバー装置２００は、クライアント装置１００からのＷｅｂ会議サービスに関する要求に基づいて、当該要求に応じた各種処理を実行する。 Next, the functional configuration of the server device 200 will be described. Based on the request regarding the web conference service from the client apparatus 100, the server apparatus 200 executes various processes according to the request.

サーバー装置２００は、動画音声受信部２１０、第１の動画音声記憶部２２０、音声取得部２３０、音声認識部２４０、第２の動画音声生成部２５０、第２の動画音声記憶部２６０、動画音声送信部２７０、配信用コンテンツ記憶部２８０および配信用コンテンツ送信部２９０を備えて構成される。 The server device 200 includes a moving image audio receiving unit 210, a first moving image audio storage unit 220, an audio acquisition unit 230, an audio recognizing unit 240, a second moving image audio generation unit 250, a second moving image audio storage unit 260, and a moving image audio. A transmission unit 270, a distribution content storage unit 280, and a distribution content transmission unit 290 are provided.

なお、サーバー装置２００は、図示しないが、例えば、プロセッサとしてのＣＰＵ（Central Processing Unit）、制御プログラム（本発明の「動画出力プログラム」として機能）を格納したＲＯＭ（Read Only Memory）等の記憶媒体、ＲＡＭ（Random Access Memory）等の作業用メモリ、および通信回路を有する。この場合、上記した各部の機能は、ＣＰＵが制御プログラムを実行することにより実現される。 Although not shown, the server device 200 is a storage medium such as a CPU (Central Processing Unit) as a processor and a ROM (Read Only Memory) that stores a control program (function as the “moving image output program” of the present invention). A working memory such as a RAM (Random Access Memory), and a communication circuit. In this case, the function of each unit described above is realized by the CPU executing the control program.

動画音声受信部２１０は、複数のクライアント装置１００を介した複数のユーザーによる発話（すなわちＷｅｂ会議）が行われている最中に、クライアント装置１００の動画音声送信部１３０から送信された動画データおよび音声データをリアルタイムで受信する。そして、動画音声受信部２１０は、受信した動画データおよび音声データを第１の動画データおよび第１の音声データとして第１の動画音声記憶部２２０に記録する。なお、動画音声受信部２１０は、本発明の「第１の動画取得部」として機能する。 The moving image / audio receiving unit 210 receives the moving image data transmitted from the moving image / audio transmitting unit 130 of the client device 100 while the utterances (that is, Web conference) by a plurality of users via the plurality of client devices 100 are being performed. Receive audio data in real time. Then, the moving image / sound receiving unit 210 records the received moving image data and sound data in the first moving image / sound storage unit 220 as first moving image data and first sound data. The moving image audio receiving unit 210 functions as the “first moving image acquiring unit” of the present invention.

第１の動画音声記憶部２２０は、第１の動画データおよび第１の音声データを記憶する。第１の動画音声記憶部２２０は、例えば不揮発性の半導体メモリ（いわゆるフラッシュメモリ）やハードディスクドライブで構成される。 The first moving image / sound storage unit 220 stores first moving image data and first sound data. The first moving image / audio storage unit 220 includes, for example, a nonvolatile semiconductor memory (so-called flash memory) or a hard disk drive.

音声取得部２３０は、動画音声受信部２１０により受信された音声データを取得する。そして、音声取得部２３０は、取得した音声データを音声認識部２４０に出力する。 The audio acquisition unit 230 acquires the audio data received by the moving image audio reception unit 210. Then, the voice acquisition unit 230 outputs the acquired voice data to the voice recognition unit 240.

音声認識部２４０は、音声取得部２３０から出力された音声データに対して音声認識処理を行う。音声認識部２４０は、従来のあらゆる音声認識技術を用いて、音声認識処理を行うことができる。 The voice recognition unit 240 performs voice recognition processing on the voice data output from the voice acquisition unit 230. The voice recognition unit 240 can perform voice recognition processing using any conventional voice recognition technology.

音声認識部２４０は、音声認識処理の結果として正しい音声認識結果が得られた場合、音声認識結果として得られるテキストデータ（文字列）と、ユーザーが発話を開始した発話開始時間を示す発話開始時間情報と、当該ユーザーが当該発話を終了した発話終了時間を示す発話終了時間情報とを生成して第２の動画音声生成部２５０に出力する。正しい音声認識結果が得られた場合とは、例えば、音声認識処理に成功し、かつ、音声データの音声中において不要語や騒音の登場頻度が一定頻度以下である場合である。 When a correct speech recognition result is obtained as a result of the speech recognition process, the speech recognition unit 240 has text data (character string) obtained as the speech recognition result and an utterance start time indicating the utterance start time when the user started utterance. The information and the utterance end time information indicating the utterance end time when the user ends the utterance are generated and output to the second video / audio generation unit 250. The case where the correct speech recognition result is obtained is, for example, the case where the speech recognition process is successful and the appearance frequency of unnecessary words and noise in the speech of the speech data is equal to or less than a certain frequency.

ここで、不要語について説明する。一般に、話し言葉には、書き言葉には登場しない単語が登場する。例えば、発話者（ユーザー）の言いよどみの際に発声される「いや」、「あの」、「えー」、や「えーと」などの単語を含む間投詞（感動詞）は、書き言葉では一般には登場しない単語であるが、話し言葉には数多く登場する。また、これらの単語は話の内容とは無関係であることが多く、相手に情報を伝える上では不要であることが多い。 Here, unnecessary words will be described. In general, words that do not appear in written language appear in spoken language. For example, interjections (impression verbs) that contain words such as “no”, “that”, “e”, and “eto” that are uttered when a speaker (user) scrambles are words that do not generally appear in written language However, many words appear in spoken language. In addition, these words are often unrelated to the content of the story, and are often unnecessary to convey information to the other party.

第２の動画音声生成部２５０は、音声認識部２４０から出力された発話開始時間情報および発話終了時間情報に基づいて、第１の動画音声記憶部２２０に記憶される第１の動画データおよび第１の音声データにおいて、複数のユーザーが発話を行っている複数の発話部分（発話開始時間〜発話終了時間の区間）をそれぞれ特定する。第２の動画音声生成部２５０は例えば、第１の動画音声記憶部２２０に記憶される第１の動画データから、特定した発話部分を切り出すことによって第２の動画データを生成する。 Based on the utterance start time information and the utterance end time information output from the voice recognition unit 240, the second video / audio generation unit 250 stores the first video data and the first video data stored in the first video / audio storage unit 220. In one voice data, a plurality of utterance portions (speech start time to utterance end time section) in which a plurality of users are speaking are specified. For example, the second moving image / sound generation unit 250 generates the second moving image data by cutting out the identified utterance portion from the first moving image data stored in the first moving image / sound storage unit 220.

本実施の形態では、第２の動画音声生成部２５０は、音声認識部２４０から出力されたテキストデータを字幕（本発明の「所定の表現態様」に対応）として含めて、第２の動画データを生成する。そして、第２の動画音声生成部２５０は、生成した第２の動画データを第２の動画音声記憶部２６０に記録する。 In the present embodiment, the second video / audio generation unit 250 includes the text data output from the voice recognition unit 240 as subtitles (corresponding to the “predetermined expression mode” of the present invention), and the second video / video data Is generated. Then, the second moving image / sound generation unit 250 records the generated second moving image data in the second moving image / sound storage unit 260.

また、第２の動画音声生成部２５０は、例えば第１の動画音声記憶部２２０に記憶される第１の音声データから、特定した発話部分を切り出すことによって第２の音声データを生成する。そして、第２の動画音声生成部２５０は、生成した第２の音声データを第２の動画音声記憶部２６０に記録する。 In addition, the second moving image / sound generation unit 250 generates second sound data by cutting out the specified speech portion from the first sound data stored in the first moving image / sound storage unit 220, for example. Then, the second moving image / sound generation unit 250 records the generated second sound data in the second moving image / sound storage unit 260.

第２の動画音声記憶部２６０は、第２の動画データおよび第２の音声データを記憶する。第２の動画音声記憶部２６０は、例えば不揮発性の半導体メモリ（いわゆるフラッシュメモリ）やハードディスクドライブで構成される。 The second moving image / sound storage unit 260 stores the second moving image data and the second sound data. The second moving image / sound storage unit 260 includes, for example, a nonvolatile semiconductor memory (so-called flash memory) or a hard disk drive.

動画音声送信部２７０は、第２の動画音声記憶部２６０に記憶され、同一の発話部分に対応する第２の動画データおよび第２の音声データを時系列でクライアント装置１００の動画音声受信部１４０に送信する。すなわち、動画音声送信部２７０は、第２の動画データおよび第２の音声データを、Ｗｅｂ会議に参加している複数のユーザーによる発話を当該ユーザーが視聴して発話内容を確認できるように送信する。本実施の形態では、動画音声送信部２７０は、配信用コンテンツ送信部２９０からの配信用コンテンツの送信を一時的に停止し、第２の動画データおよび第２の音声データをクライアント装置１００の動画音声受信部１４０にストリーミング形式またはダウンロード形式で送信する。なお、動画音声送信部２７０は、本発明の「第２の動画出力部」として機能する。 The video / audio transmission unit 270 is stored in the second video / audio storage unit 260, and the second video data and the second audio data corresponding to the same utterance part are time-sequentially converted to the video / audio reception unit 140 of the client device 100. Send to. That is, the moving image / sound transmitting unit 270 transmits the second moving image data and the second sound data so that the user can view the utterances by a plurality of users participating in the web conference and confirm the utterance contents. . In the present embodiment, the video / audio transmission unit 270 temporarily stops transmission of the distribution content from the distribution content transmission unit 290, and uses the second video data and the second audio data as the video of the client device 100. The data is transmitted to the audio receiving unit 140 in a streaming format or a download format. The moving image / audio transmission unit 270 functions as the “second moving image output unit” of the present invention.

配信用コンテンツ記憶部２８０は、第１の動画音声記憶部２２０に記憶される第１の動画データにおいて、複数のユーザーが発話を行っていない非発話部分に関連して設定された画像を配信用コンテンツとして記憶する。本実施の形態では、配信用コンテンツは、ユーザーに予め指定されたファイル形式の動画像（動画）、静止画像等である。配信用コンテンツは、ユーザーに予め指定されたクライアント装置１００の動画音声送信部１３０から動画音声受信部２１０に送信された動画データであっても良い。配信用コンテンツ記憶部２８０は、例えば不揮発性の半導体メモリ（いわゆるフラッシュメモリ）やハードディスクドライブで構成される。 The distribution content storage unit 280 distributes images set in association with non-speech portions where a plurality of users are not speaking in the first moving image data stored in the first moving image audio storage unit 220. Store as content. In the present embodiment, the distribution content is a moving image (moving image), a still image, or the like in a file format designated in advance by the user. The distribution content may be video data transmitted from the video / audio transmission unit 130 of the client device 100 specified in advance by the user to the video / audio reception unit 210. The distribution content storage unit 280 is configured by, for example, a nonvolatile semiconductor memory (so-called flash memory) or a hard disk drive.

配信用コンテンツ送信部２９０は、配信用コンテンツ記憶部２８０に記憶される配信用コンテンツをクライアント装置１００の動画音声受信部１４０に送信する。本実施の形態では、配信用コンテンツ送信部２９０は、Ｗｅｂ会議が行われている最中、動画音声送信部２７０から第２の動画データおよび第２の音声データが送信されていないタイミングにおいて配信用コンテンツをストリーミング形式またはダウンロード形式で送信する。配信用コンテンツ送信部２９０は、本発明の「画像出力部」として機能する。 The distribution content transmission unit 290 transmits the distribution content stored in the distribution content storage unit 280 to the video / audio reception unit 140 of the client device 100. In the present embodiment, the distribution content transmission unit 290 is used for distribution at the timing when the second video data and the second audio data are not transmitted from the video / audio transmission unit 270 during the Web conference. Send content in streaming or download format. The distribution content transmission unit 290 functions as an “image output unit” of the present invention.

次に、図２を参照して、第２の動画データを生成して送信する流れの具体例について説明する。図２Ａは、Ｗｅｂ会議が行われている最中に、ユーザーＸ，Ｙ，Ｚがそれぞれ使用するクライアント装置１００ａ，１００ｂ，１００ｃ（図２Ａの例では、スマートフォン）から送信され、サーバー装置２００の動画音声受信部２１０でリアルタイムに受信され、第１の動画音声記憶部２２０に記録された第１の動画データ３００，３１０，３２０の動画内容を示している。 Next, a specific example of the flow of generating and transmitting the second moving image data will be described with reference to FIG. 2A is transmitted from the client devices 100a, 100b, and 100c (smartphones in the example of FIG. 2A) used by the users X, Y, and Z during the Web conference, and the moving image of the server device 200 is transmitted. The moving image contents of the first moving image data 300, 310, and 320 received in real time by the sound receiving unit 210 and recorded in the first moving image sound storage unit 220 are shown.

第１の動画データ３００には、ユーザーＸが発話を行っている発話部分３０２，３０４と、ユーザーＸが発話を行っていない非発話部分とが存在する。発話部分３０２では、ユーザーＸは、「会議を始めます。」と発話している。また、発話部分３０４では、ユーザーＸは、「今日の議題は○○です。」と発話している。なお、ユーザーＸは、発話部分３０２と発話部分３０４との間、および、発話部分３０４の後では、何も発話していない。 The first moving image data 300 includes uttered portions 302 and 304 where the user X is speaking and non-spoken portions where the user X is not speaking. In the utterance portion 302, the user X utters “Start a meeting”. In the utterance portion 304, the user X utters "Today's agenda is OO". Note that the user X does not utter anything between the utterance portion 302 and the utterance portion 304 and after the utterance portion 304.

第１の動画データ３１０には、ユーザーＹが発話を行っている発話部分３１２，３１４と、ユーザーＹが発話を行っていない非発話部分とが存在する。発話部分３１２では、ユーザーＹは、「よろしくお願いします。」と発話している。また、発話部分３１４では、ユーザーＹは、「なるほど。」と発話している。なお、ユーザーＹは、発話部分３１２の前、発話部分３１２と発話部分３１４との間、および、発話部分３１４の後では、何も発話していない。 The first moving image data 310 includes uttered portions 312 and 314 where the user Y is speaking and non-spoken portions where the user Y is not speaking. In the utterance portion 312, the user Y utters “Thank you very much”. In the utterance portion 314, the user Y utters “I see.” Note that the user Y does not utter anything before the utterance portion 312, between the utterance portion 312 and the utterance portion 314, and after the utterance portion 314.

第１の動画データ３２０には、ユーザーＺが発話を行っている発話部分３２２，３２４と、ユーザーＺが発話を行っていない非発話部分とが存在する。発話部分３２２では、ユーザーＺは、「よろしくお願いします。」と発話している。また、発話部分３２４では、ユーザーＹは、「その件ですが。」と発話している。なお、ユーザーＺは、発話部分３２２の前、発話部分３２２と発話部分３２４との間、および、発話部分３２４の後では、何も発話していない。 The first moving image data 320 includes uttered portions 322 and 324 where the user Z is speaking and non-spoken portions where the user Z is not speaking. In the utterance portion 322, the user Z utters “Thank you very much”. Further, in the utterance portion 324, the user Y utters “Is that case?”. Note that the user Z does not utter anything before the utterance portion 322, between the utterance portion 322 and the utterance portion 324, and after the utterance portion 324.

図２Ｂは、第１の動画音声記憶部２２０に記録された第１の動画データ３００，３１０，３２０から、第２の動画音声生成部２５０により発話部分を特定して生成された第２の動画データ３３０，３３２，３３４，３３８，３４０，３４２の動画内容を時系列で（図２Ｂの例では、左から発話の早い順に）並べて示している。 FIG. 2B shows the second moving image generated by specifying the utterance portion by the second moving image / voice generating unit 250 from the first moving image data 300, 310, 320 recorded in the first moving image / voice storing unit 220. The moving image contents of the data 330, 332, 334, 338, 340, 342 are shown in time series (in the example of FIG. 2B, from the left to the earliest utterance).

第２の動画データ３３０は、第１の動画データ３００の発話部分３０２に、音声認識部２４０から出力されたテキストデータを字幕（「会議を始めます。」、図２Ｂでは、「ＸＸＸＸＸＸ」）として含めて生成されている。 In the second moving image data 330, the text data output from the voice recognition unit 240 is subtitled to the utterance portion 302 of the first moving image data 300 (“Conference is started”, “XXXXXXX” in FIG. 2B). It is generated including.

第２の動画データ３３２は、第１の動画データ３１０の発話部分３１２に、音声認識部２４０から出力されたテキストデータを字幕（「よろしくお願いします。」、図２Ｂでは、「ＸＸＸＸＸＸ」）として含めて生成されている。 The second moving image data 332 includes text data output from the voice recognition unit 240 as subtitles (“Thank you in advance”, “XXXXXXX” in FIG. 2B) in the utterance portion 312 of the first moving image data 310. It is generated including.

第２の動画データ３３４は、第１の動画データ３２０の発話部分３２２に、音声認識部２４０から出力されたテキストデータを字幕（「よろしくお願いします。」、図２Ｂでは、「ＸＸＸＸＸＸ」）として含めて生成されている。 The second moving image data 334 includes text data output from the voice recognition unit 240 as subtitles (“Thank you in advance”, “XXXXXXX” in FIG. 2B) in the utterance portion 322 of the first moving image data 320. It is generated including.

第２の動画データ３３８は、第１の動画データ３００の発話部分３０４に、音声認識部２４０から出力されたテキストデータを字幕（「今日の議題は○○です。」、図２Ｂでは、「ＸＸＸＸＸＸ」）として含めて生成されている。 The second moving image data 338 includes subtitles (“Today's agenda is XX”) in the utterance portion 304 of the first moving image data 300, “XXXXXXX” in FIG. )).

第２の動画データ３４０は、第１の動画データ３２０の発話部分３２４に、音声認識部２４０から出力されたテキストデータを字幕（「その件ですが。」、図２Ｂでは、「ＸＸＸＸＸＸ」）として含めて生成されている。 The second moving image data 340 includes text data output from the voice recognition unit 240 as subtitles (“That is the case.” In FIG. 2B, “XXXXXXX”) in the utterance portion 324 of the first moving image data 320. It is generated including.

第２の動画データ３４２は、第１の動画データ３１０の発話部分３１４に、音声認識部２４０から出力されたテキストデータを字幕（「なるほど。」、図２Ｂでは、「ＸＸＸＸＸＸ」）として含めて生成されている。 The second moving image data 342 is generated by including the text data output from the voice recognition unit 240 in the utterance portion 314 of the first moving image data 310 as subtitles (“I see.” In FIG. 2B, “XXXXXXX”). Has been.

図２Ｂにおいて、第２の動画データ３３４と第２の動画データ３３８との間、かつ、第２の動画データ３４２の後には、配信用コンテンツ記憶部２８０に記憶されている配信用コンテンツ３３６が示されている。配信用コンテンツ３３６は、Ｗｅｂ会議に参加している全てのユーザーＸ，Ｙ，Ｚが発話を行っていない非発話部分に関連して設定され、Ｗｅｂ会議とは別のタイミングでユーザーＸを撮影した動画である。すなわち、Ｗｅｂ会議が行われている最中、時系列で見たときに、第２の動画データ３３４と第２の動画データ３３８との間、かつ、第２の動画データ３４２の後では、Ｗｅｂ会議に参加している全てのユーザーＸ，Ｙ，Ｚが発話を行っていない非発話部分が存在する。 In FIG. 2B, the distribution content 336 stored in the distribution content storage unit 280 is shown between the second video data 334 and the second video data 338 and after the second video data 342. Has been. The distribution content 336 is set in relation to a non-speech part in which all users X, Y, and Z participating in the web conference are not speaking, and the user X is photographed at a timing different from the web conference. It is a video. That is, during the Web conference, when viewed in time series, between the second moving image data 334 and the second moving image data 338 and after the second moving image data 342, the Web There is a non-speech part where all users X, Y, and Z participating in the conference are not speaking.

第２の動画データ３３０，３３２，３３４，３３８，３４０，３４２および配信用コンテンツ３３６は、図２Ｂに並べて示す順番で、サーバー装置２００からユーザーＸ，Ｙ，Ｚがそれぞれ使用するクライアント装置１００ａ，１００ｂ，１００ｃに送信されて表示出力される。これにより、ユーザーＸ，Ｙ，Ｚは、Ｗｅｂ会議における各ユーザーＸ，Ｙ，Ｚの発話内容を確認することができる。 The second moving image data 330, 332, 334, 338, 340, 342 and the distribution content 336 are client devices 100a, 100b used by the users X, Y, Z from the server device 200 in the order shown side by side in FIG. 2B. , 100c and displayed. Thereby, the users X, Y, and Z can confirm the utterance contents of the users X, Y, and Z in the web conference.

次に、本実施の形態におけるサーバー装置２００の動作例について説明する。図３は、サーバー装置２００の動作例を示すフローチャートである。図３における処理は、サーバー装置２００からクライアント装置１００に対してＷｅｂ会議サービスの提供が開始された後に実行される。 Next, an operation example of the server apparatus 200 in the present embodiment will be described. FIG. 3 is a flowchart illustrating an operation example of the server apparatus 200. The processing in FIG. 3 is executed after the server device 200 starts providing the web conference service to the client device 100.

まず、配信用コンテンツ送信部２９０は、クライアント装置１００の動画音声受信部１４０に対して、配信用コンテンツ記憶部２８０に記憶される配信用コンテンツの送信を開始する（ステップＳ１００）。 First, the distribution content transmission unit 290 starts transmission of the distribution content stored in the distribution content storage unit 280 to the video / audio reception unit 140 of the client device 100 (step S100).

次に、動画音声受信部２１０は、クライアント装置１００の動画音声送信部１３０から送信された動画データおよび音声データを受信しているか否かについて判定する（ステップＳ１２０）。判定の結果、動画データおよび音声データを受信していない場合（ステップＳ１２０、ＮＯ）、サーバー装置２００は、全てのクライアント装置１００がＷｅｂ会議サービスからログオフしていると判断し、図３における処理を終了する。 Next, the moving image / sound receiving unit 210 determines whether the moving image data and the sound data transmitted from the moving image / sound transmitting unit 130 of the client device 100 are received (step S120). As a result of the determination, if the moving image data and the audio data are not received (step S120, NO), the server device 200 determines that all the client devices 100 are logged off from the web conference service, and performs the processing in FIG. finish.

一方、動画データおよび音声データを受信している場合（ステップＳ１２０、ＹＥＳ）、動画音声受信部２１０は、受信した動画データおよび音声データを第１の動画データおよび第１の音声データとして第１の動画音声記憶部２２０に記録する。 On the other hand, when moving image data and audio data are received (step S120, YES), the moving image / audio receiving unit 210 uses the received moving image data and audio data as first moving image data and first audio data for the first time. Recorded in the moving image audio storage unit 220.

次に、音声取得部２３０は、動画音声受信部２１０により受信された音声データを取得する（ステップＳ１４０）。そして、音声取得部２３０は、取得した音声データを音声認識部２４０に出力する。 Next, the audio acquisition unit 230 acquires the audio data received by the moving image audio reception unit 210 (step S140). Then, the voice acquisition unit 230 outputs the acquired voice data to the voice recognition unit 240.

次に、音声認識部２４０は、音声取得部２３０から出力された音声データに対して音声認識処理を行う（ステップＳ１６０）。次に、音声認識部２４０は、音声認識処理を行った結果として、正しい音声認識結果が得られたか否かについて判定する（ステップＳ１８０）。判定の結果、正しい音声認識結果が得られなかった場合（ステップＳ１８０、ＮＯ）、処理はステップＳ１２０の前に戻る。 Next, the voice recognition unit 240 performs voice recognition processing on the voice data output from the voice acquisition unit 230 (step S160). Next, the voice recognition unit 240 determines whether or not a correct voice recognition result is obtained as a result of the voice recognition process (step S180). As a result of the determination, if a correct speech recognition result is not obtained (step S180, NO), the process returns to before step S120.

一方、正しい音声認識結果が得られた場合（ステップＳ１８０、ＹＥＳ）、音声認識部２４０は、音声認識結果として得られるテキストデータと、ユーザーが発話を開始した発話開始時間を示す発話開始時間情報と、当該ユーザーが当該発話を終了した発話終了時間を示す発話終了時間情報とを生成して第２の動画音声生成部２５０に出力する。 On the other hand, when a correct voice recognition result is obtained (step S180, YES), the voice recognition unit 240 includes text data obtained as a voice recognition result, and utterance start time information indicating the utterance start time when the user started utterance. The utterance end time information indicating the utterance end time when the user has finished the utterance is generated and output to the second moving image / sound generation unit 250.

次に、第２の動画音声生成部２５０は、音声認識部２４０から出力された発話開始時間情報および発話終了時間情報に基づいて、第１の動画音声記憶部２２０に記憶される第１の動画データおよび第１の音声データから、複数のユーザーが発話を行っている複数の発話部分（発話開始時間〜発話終了時間の区間）を切り出し、音声認識部２４０から出力されたテキストデータを字幕として含めて、第２の動画データを生成する（ステップＳ２００）。そして、第２の動画音声生成部２５０は、生成した第２の動画データを第２の動画音声記憶部２６０に記録する。なお、第２の動画音声生成部２５０は、第２の音声データも生成して第２の動画音声記憶部２６０に記録する。 Next, the second video / audio generation unit 250 stores the first video stored in the first video / audio storage unit 220 based on the utterance start time information and the utterance end time information output from the voice recognition unit 240. A plurality of utterances (speech start time to utterance end time) where a plurality of users are speaking are cut out from the data and the first voice data, and the text data output from the voice recognition unit 240 is included as subtitles Then, the second moving image data is generated (step S200). Then, the second moving image / sound generation unit 250 records the generated second moving image data in the second moving image / sound storage unit 260. Note that the second moving image / sound generation unit 250 also generates second sound data and records it in the second moving image / sound storage unit 260.

次に、動画音声送信部２７０は、配信用コンテンツ送信部２９０からの配信用コンテンツの送信を一時的に停止し、第２の動画音声記憶部２６０に記憶されて同一の発話部分に対応する第２の動画データおよび第２の音声データを、クライアント装置１００の動画音声受信部１４０に送信する（ステップＳ２２０）。ステップＳ２２０の処理が完了した後、処理はステップＳ１２０の前に戻る。 Next, the video / audio transmission unit 270 temporarily stops transmission of the distribution content from the distribution content transmission unit 290 and is stored in the second video / audio storage unit 260 and corresponds to the same utterance part. The second moving image data and the second audio data are transmitted to the moving image / audio receiving unit 140 of the client device 100 (step S220). After the process of step S220 is completed, the process returns to before step S120.

以上詳しく説明したように、本実施の形態では、サーバー装置２００は、複数のクライアント装置１００を介した複数のユーザーによる発話（すなわちＷｅｂ会議）が行われている最中に、発話を行っている複数の発話者（ユーザー）を撮影することによって得られた複数の動画データを第１の動画データとしてリアルタイムで受信する動画音声受信部２１０（第１の動画取得部）と、動画音声受信部２１０により受信された第１の動画データにおいて、複数のユーザーが発話を行っている複数の発話部分をそれぞれ第２の動画データとして生成する第２の動画音声生成部２５０（第２の動画生成部）と、第２の動画音声生成部２５０により生成された複数の第２の動画データを、複数の発話者が視認可能に順次送信する動画音声送信部２７０（第２の動画出力部）とを備える。 As described above in detail, in the present embodiment, the server apparatus 200 is performing an utterance while an utterance (that is, a Web conference) is performed by a plurality of users via the plurality of client apparatuses 100. A video / audio receiver 210 (first video acquisition unit) that receives a plurality of video data obtained by photographing a plurality of speakers (users) in real time as first video data, and a video / audio receiver 210 In the first moving image data received by the second moving image sound generating unit 250 (second moving image generating unit) that generates a plurality of utterance parts uttered by a plurality of users as second moving image data, respectively. And a plurality of second moving image data generated by the second moving image / voice generating unit 250 and sequentially transmitting the plurality of second moving image data so that a plurality of speakers can visually recognize them. And a (second video output unit).

このように構成した本実施の形態によれば、複数の発話者が発話を行っている複数の発話部分がそれぞれ第２の動画データとして生成され、クライアント装置１００に順次送信される。そのため、クライアント装置１００において、各発話者の発話内容を共有するにあたり、サーバー装置２００から複数の発話者の発話単位で送信される複数の第２の動画データを複数の表示画面に表示出力する必要がなく、１つの表示画面に表示出力すれば済むこととなる。よって、１つの表示画面に発話単位で表示出力される第２の動画データを見る発話者は、複数の第２の動画データが複数の表示画面に表示出力される場合と異なり、実際に口を動かしている発話者を探し出す必要がなく（すなわち余計な手間を不要とし）、発話者の顔色等の表情を適切に認識し、発話者がどのような感情（喜び、驚き、落胆など）を持って発話しているかについて容易に把握することができる。 According to the present embodiment configured as described above, a plurality of utterance parts uttered by a plurality of speakers are respectively generated as second moving image data and sequentially transmitted to the client device 100. Therefore, in sharing the utterance content of each utterer in the client device 100, it is necessary to display and output a plurality of second video data transmitted from the server device 200 in units of utterances of the plurality of speakers on a plurality of display screens. There is no need to display on a single display screen. Thus, unlike a case where a plurality of second moving image data are displayed and output on a plurality of display screens, a speaker who views the second moving image data displayed and output on a single display screen in units of utterances actually speaks. There is no need to find a moving speaker (ie, no extra effort is required), the facial expression such as the speaker's complexion is properly recognized, and the emotion of the speaker (joy, surprise, discouragement, etc.) You can easily grasp whether you are speaking.

特に、複数のクライアント装置１００において複数の発話者による同時発話が発生した場合でも、１つの表示画面に発話単位で表示出力される第２の動画データを見る発話者は、各発話者の発話内容を順次確認することができる。また、複数の第２の動画データを複数の画面に表示出力する必要がなくなるため、当該複数の画面に表示するための表示モジュールを不要とすることができ、クライアント装置１００の構成上の負荷を軽減することができる。 In particular, even when simultaneous utterances by a plurality of speakers occur in a plurality of client devices 100, a speaker who views the second moving image data displayed and output on a single display screen in units of utterances is uttered by each speaker. Can be confirmed sequentially. In addition, since it is not necessary to display and output a plurality of second moving image data on a plurality of screens, a display module for displaying the plurality of second moving image data on the plurality of screens can be eliminated, and the load on the configuration of the client device 100 is reduced. Can be reduced.

また、本実施の形態では、動画音声送信部２７０（第２の動画出力部）は、複数の第２の動画データを時系列で送信する。この構成により、クライアント装置１００において、時系列かつ発話単位で表示出力される第２の動画データを見る発話者は、実際のＷｅｂ会議の進行に合わせて各発話者の発話内容を確認することができ、理解しやすい。 In the present embodiment, the moving image / sound transmitting unit 270 (second moving image output unit) transmits a plurality of second moving image data in time series. With this configuration, in the client device 100, a speaker who sees the second moving image data displayed and output in time series and in units of utterances can confirm the utterance contents of each speaker as the actual Web conference progresses. Can be easy to understand.

また、本実施の形態では、サーバー装置２００は、動画音声受信部２１０（第１の動画取得部）により取得された第１の動画データに関連する音声データを取得する音声取得部２３０と、音声取得部２３０により取得された音声データに対して音声認識を行う音声認識部２４０とを備える。そして、第２の動画音声生成部２５０（第２の動画生成部）は、音声認識部２４０の音声認識結果（発話開始時間情報、発話終了時間情報）に基づいて複数の発話部分を特定し、その特定結果を用いて複数の第２の動画データを生成する。この構成により、例えば動画データ上において発話者が実際に口を動かしているか否かを画像解析などによって検出する場合と比べて、第１の動画データにおいて複数の発話者が発話を行っている複数の発話部分を精度良く特定することができ、当該発話部分に絞った（すなわち非発話部分を適切に除いて）第２の動画データを生成することができる。 In the present embodiment, the server apparatus 200 includes an audio acquisition unit 230 that acquires audio data related to the first video data acquired by the video audio reception unit 210 (first video acquisition unit), and an audio And a voice recognition unit 240 that performs voice recognition on the voice data acquired by the acquisition unit 230. Then, the second moving image sound generating unit 250 (second moving image generating unit) specifies a plurality of utterance parts based on the sound recognition result (the utterance start time information and the utterance end time information) of the sound recognition unit 240, A plurality of second moving image data is generated using the identification result. With this configuration, for example, a plurality of speakers who are speaking in the first moving image data are compared with a case in which whether or not the speaker is actually moving his / her mouth on the moving image data is detected by image analysis or the like. Can be identified with high accuracy, and second moving image data focused on the utterance portion (that is, appropriately excluding the non-utterance portion) can be generated.

また、本実施の形態では、第２の動画音声生成部２５０（第２の動画生成部）は、音声認識部２４０の音声認識結果（テキストデータ）を字幕（所定の表現態様）で含めて、第２の動画データを生成する。この構成により、クライアント装置１００において、表示出力される第２の動画データを見る発話者は、併せて表示出力される字幕も見ることによって、各発話者の発話内容を容易に確認することができる。 In the present embodiment, the second video / audio generation unit 250 (second video generation unit) includes the audio recognition result (text data) of the audio recognition unit 240 in subtitles (predetermined expression mode), Second moving image data is generated. With this configuration, in the client device 100, a speaker who sees the second moving image data to be displayed and output can easily check the utterance content of each speaker by seeing the subtitles that are also displayed and output. .

また、本実施の形態では、サーバー装置２００は、動画音声受信部２１０（第１の動画取得部）により受信された第１の動画データにおいて、複数の発話者が発話を行っていない非発話部分に関連して設定された配信用コンテンツ（画像）を送信する配信用コンテンツ送信部２９０（画像出力部）を備える。この構成により、クライアント装置１００において、第２の動画データがサーバー装置２００から送信されていない場合、配信用コンテンツが表示出力される。よって、第１の動画データにおいて複数の発話者が発話を行っていない非発話部分が存在しても、クライアント装置１００において何も表示出力されず、ユーザーに違和感を与えてしまう事態を防止することができる。 Further, in the present embodiment, server apparatus 200 has a non-speech portion in which a plurality of speakers are not speaking in the first moving image data received by moving image audio receiving unit 210 (first moving image acquiring unit). A distribution content transmission unit 290 (image output unit) for transmitting the distribution content (image) set in relation to With this configuration, in the client device 100, when the second moving image data is not transmitted from the server device 200, the distribution content is displayed and output. Therefore, even if there is a non-speech part in which a plurality of speakers are not speaking in the first moving image data, nothing is displayed and output on the client device 100, and a situation in which the user feels uncomfortable is prevented. Can do.

なお、上記実施の形態では、動画音声受信部２１０は、複数のクライアント装置１００を介した複数のユーザーによる発話が行われている最中に、クライアント装置１００の動画音声送信部１３０から送信された動画データおよび音声データをリアルタイムで受信する例について説明したが、本発明はこれに限らない。例えば、動画音声受信部２１０は、サーバー装置２００からクライアント装置１００にＷｅｂ会議サービスが提供されている間に動画音声入力部１２０に入力された動画データおよび音声データを、複数の発話者による発話が全て行われた後（すなわちＷｅｂ会議が終了した後）、クライアント装置１００の動画音声送信部１３０から受信しても良い。 In the above-described embodiment, the video / audio reception unit 210 is transmitted from the video / audio transmission unit 130 of the client device 100 while a plurality of users are speaking through the plurality of client devices 100. Although an example of receiving moving image data and audio data in real time has been described, the present invention is not limited to this. For example, the moving image / voice receiving unit 210 receives the moving image data and the audio data input to the moving image / audio input unit 120 while the Web conference service is provided from the server device 200 to the client device 100, by a plurality of speakers. You may receive from the moving image audio | voice transmission part 130 of the client apparatus 100 after all are performed (namely, after a web conference is complete | finished).

この場合、第２の動画音声生成部２５０は、音声認識部２４０から出力された発話開始時間情報および発話終了時間情報に基づいて、第１の動画音声記憶部２２０に記憶される第１の動画データおよび第１の音声データにおける複数のユーザーが発話を行っている複数の発話部分をそれぞれ特定する。そして、第２の動画音声生成部２５０は、例えば第１の動画音声記憶部２２０に記憶される第１の動画データから、特定した発話部分を切り出すことによって第２の動画データを生成する。また、第２の動画音声生成部２５０は、第１の動画音声記憶部２２０に記憶される第１の音声データから、特定した発話部分を切り出すことによって第２の音声データを生成する。 In this case, the second video / audio generation unit 250 stores the first video stored in the first video / audio storage unit 220 based on the utterance start time information and the utterance end time information output from the voice recognition unit 240. A plurality of utterance portions where a plurality of users are speaking in the data and the first voice data are specified. Then, the second moving image / sound generation unit 250 generates the second moving image data by cutting out the specified utterance portion from the first moving image data stored in the first moving image / sound storage unit 220, for example. In addition, the second moving image / sound generation unit 250 generates second sound data by cutting out the specified utterance portion from the first sound data stored in the first moving image / sound storage unit 220.

第２の動画音声生成部２５０は、音声認識部２４０の音声認識結果として生成された発話開始時間情報および発話終了時間情報に基づいて、同一の発話部分に対応する第２の動画データおよび第２の音声データを時系列でつなぎ合わせて、Ｗｅｂ会議の議事録用動画データを生成する。そして、第２の動画音声生成部２５０は、生成した議事録用動画データを第２の動画音声記憶部２６０に記録する。第２の動画音声記憶部２６０に記録された議事録用動画データは、ユーザーの要求に応じて、サーバー装置２００からクライアント装置１００に視聴可能に提供される。なお、本変形例では、第２の動画音声生成部２５０は、本発明の「第２の動画出力部」として機能する。 Based on the utterance start time information and the utterance end time information generated as the voice recognition result of the voice recognition unit 240, the second moving image voice generation unit 250 generates the second moving image data corresponding to the same utterance portion and the second Are combined in time series to generate moving image data for the minutes of the Web conference. Then, the second video / audio generation unit 250 records the generated video data for the minutes in the second video / audio storage unit 260. The moving image data for minutes recorded in the second moving image audio storage unit 260 is provided from the server device 200 to the client device 100 so as to be viewable in response to a user request. In the present modification, the second moving image / sound generating unit 250 functions as the “second moving image output unit” of the present invention.

図４は、第２の動画音声生成部２５０により生成された議事録用動画データ３５０を示している。図４に示すように、議事録用動画データ３５０は、同一の発話部分に対応する複数の第２の動画データ３３０，３３２，３３４，３３８，３４０，３４２および第２の音声データを時系列で（図４の例では、左から発話の早い順に）つなぎ合わせたものである。 FIG. 4 shows the minutes moving image data 350 generated by the second moving image audio generating unit 250. As shown in FIG. 4, the minutes moving image data 350 includes a plurality of second moving image data 330, 332, 334, 338, 340, 342 and second audio data corresponding to the same utterance portion in time series. (In the example of FIG. 4, they are connected from the left in the order of utterance).

なお、複数のユーザーの発話タイミングが重なった場合について説明する。例えば、ユーザーＸが６秒間発話し、ユーザーＹが５秒間発話した場合について説明する。この場合、ユーザーＹは、ユーザーＸが発話を開始して３秒後に、発話を開始している。すなわち、ユーザーＸの発話タイミングと、ユーザーＹの発話タイミングとは、３秒間重なっている。このようなケースでは、例えば、ユーザーＸの発話部分（６秒間）に対応する第２の動画データの後に、ユーザーＹの発話部分（５秒間）に対応する第２の動画データをつなぎ合わせる（第２の動画データの時間：１１秒）。または、ユーザーＸの発話部分（３秒間）に対応する第２の動画データの後に、ユーザーＹの発話部分（５秒間）に対応する第２の動画データをつなぎ合わせる。すなわち、ユーザーＹの発話が開始した時点で第２の動画データをユーザーＸからユーザーＹに切り替える（第２の動画データの時間：８秒）。または、ユーザーＸの発話部分（６秒間）に対応する第２の動画データの後に、ユーザーＹの発話部分（５秒間）に対応する第２の動画データをつなぎ合わせるとともに、発話タイミングが重なった部分については、ユーザーＸの発話部分（３秒間）に対応する第２の動画データと、ユーザーＹの発話部分（３秒間）に対応する第２の動画データとを重ねる、または、１つの表示画面においてユーザーＸの発話部分とユーザーＹの発話部分とを分割表示できるように第２の動画データを編集する（第２の動画データの時間：８秒）。 A case where the utterance timings of a plurality of users overlap will be described. For example, a case where user X speaks for 6 seconds and user Y speaks for 5 seconds will be described. In this case, the user Y has started speaking three seconds after the user X starts speaking. That is, the utterance timing of the user X and the utterance timing of the user Y overlap for 3 seconds. In such a case, for example, the second moving image data corresponding to the utterance portion of the user Y (5 seconds) is connected to the second moving image data corresponding to the utterance portion of the user Y (5 seconds). 2 time of moving image data: 11 seconds). Alternatively, the second moving image data corresponding to the utterance portion of user Y (5 seconds) is connected to the second moving image data corresponding to the utterance portion of user X (3 seconds). That is, when the user Y starts speaking, the second moving image data is switched from the user X to the user Y (second moving image data time: 8 seconds). Or, the second moving image data corresponding to the utterance portion of user X (6 seconds) is connected to the second moving image data corresponding to the utterance portion of user Y (5 seconds), and the utterance timing overlaps , The second moving image data corresponding to the utterance portion (3 seconds) of the user X and the second moving image data corresponding to the utterance portion (3 seconds) of the user Y are overlapped, or on one display screen The second moving image data is edited so that the utterance portion of user X and the utterance portion of user Y can be displayed separately (second moving image data time: 8 seconds).

図５は、議事録用動画データを生成する場合におけるサーバー装置２００の動作例を示すフローチャートである。図５における処理は、サーバー装置２００から全てのクライアント装置１００に対してＷｅｂ会議サービスの提供が終了された後に実行される。 FIG. 5 is a flowchart illustrating an operation example of the server device 200 when generating the moving image data for minutes. The process in FIG. 5 is executed after the server apparatus 200 has finished providing the web conference service to all the client apparatuses 100.

まず、動画音声受信部２１０は、サーバー装置２００からクライアント装置１００にＷｅｂ会議サービスが提供されている間に動画音声入力部１２０に入力された動画データおよび音声データを、複数のクライアント装置１００の動画音声送信部１３０からそれぞれ受信する（ステップＳ３００）。そして、動画音声受信部２１０は、複数のクライアント装置１００から受信した動画データおよび音声データを第１の動画データおよび第１の音声データとして第１の動画音声記憶部２２０に記憶させる。 First, the moving image / audio receiving unit 210 converts the moving image data and audio data input to the moving image / audio input unit 120 while the Web conference service is provided from the server device 200 to the client device 100, to the moving images of the plurality of client devices 100. Receiving from the voice transmitting unit 130 (step S300). Then, the moving image / audio reception unit 210 causes the first moving image / audio storage unit 220 to store the moving image data and the audio data received from the plurality of client devices 100 as the first moving image data and the first audio data.

次に、音声取得部２３０は、複数のクライアント装置１００から受信した音声データのうち、音声認識部２４０により音声認識処理が行われていない音声データが存在するか否かについて判定する（ステップＳ３２０）。判定の結果、音声認識処理が行われていない音声データが存在する場合（ステップＳ３２０、ＹＥＳ）、音声取得部２３０は、当該音声データを取得する（ステップＳ３４０）。そして、音声取得部２３０は、取得した音声データを音声認識部２４０に出力する。 Next, the voice acquisition unit 230 determines whether there is voice data that has not been subjected to voice recognition processing by the voice recognition unit 240 among the voice data received from the plurality of client devices 100 (step S320). . As a result of the determination, if there is voice data that has not been subjected to voice recognition processing (step S320, YES), the voice acquisition unit 230 acquires the voice data (step S340). Then, the voice acquisition unit 230 outputs the acquired voice data to the voice recognition unit 240.

次に、音声認識部２４０は、音声取得部２３０から出力された音声データに対して音声認識処理を行う（ステップＳ３６０）。次に、音声認識部２４０は、音認識処理の結果として、正しい音声認識結果が得られたか否かについて判定する（ステップＳ３８０）。判定の結果、正しい音声認識結果が得られなかった場合（ステップＳ３８０、ＮＯ）、処理はステップＳ３２０の前に戻る。 Next, the voice recognition unit 240 performs voice recognition processing on the voice data output from the voice acquisition unit 230 (step S360). Next, the voice recognition unit 240 determines whether or not a correct voice recognition result is obtained as a result of the sound recognition process (step S380). As a result of the determination, if a correct voice recognition result is not obtained (step S380, NO), the process returns to before step S320.

一方、正しい音声認識結果が得られた場合（ステップＳ３８０、ＹＥＳ）、音声認識部２４０は、音声認識結果として得られるテキストデータと、ユーザーが発話を開始した発話開始時間を示す発話開始時間情報と、当該ユーザーが当該発話を終了した発話終了時間を示す発話終了時間情報とを生成して第２の動画音声生成部２５０に出力する。 On the other hand, when a correct voice recognition result is obtained (step S380, YES), the voice recognition unit 240 includes text data obtained as a voice recognition result, and utterance start time information indicating the utterance start time when the user started utterance. The utterance end time information indicating the utterance end time when the user has finished the utterance is generated and output to the second moving image / sound generation unit 250.

次に、第２の動画音声生成部２５０は、音声認識部２４０から出力された発話開始時間情報および発話終了時間情報に基づいて、第１の動画音声記憶部２２０に記憶される第１の動画データおよび第１の音声データにおいて、ユーザー（発話者）が発話を行っている発話部分を切り出し、音声認識部２４０から出力されたテキストデータを字幕として含めて第２の動画データを生成する（ステップＳ４００）。なお、第２の動画音声生成部２５０は、第１の動画音声記憶部２２０に記憶される第１の音声データから、特定した発話部分を切り出すことによって第２の音声データを生成する。その後、処理はステップＳ３２０の前に戻る。 Next, the second video / audio generation unit 250 stores the first video stored in the first video / audio storage unit 220 based on the utterance start time information and the utterance end time information output from the voice recognition unit 240. In the data and the first voice data, the utterance part where the user (speaker) is speaking is cut out, and the second moving image data is generated by including the text data output from the voice recognition unit 240 as a caption (step). S400). Note that the second moving image / sound generation unit 250 generates second sound data by cutting out the specified utterance portion from the first sound data stored in the first moving image / sound storage unit 220. Thereafter, the process returns to before step S320.

ステップＳ３２０の判定に戻って、音声認識処理が行われていない音声データが存在しない場合、すなわち全てのクライアント装置１００から受信した動画データおよび音声データに基づいて第２の動画データおよび第２の音声データの生成が終了した場合（ステップＳ３２０、ＮＯ）、第２の動画音声生成部２５０は、音声認識部２４０の音声認識結果として生成された発話開始時間情報および発話終了時間情報に基づいて、同一の発話部分に対応する第２の動画データおよび第２の音声データを時系列でつなぎ合わせて、Ｗｅｂ会議の議事録用動画データを生成する（ステップＳ４２０）。そして、第２の動画音声生成部２５０は、生成した議事録用動画データを第２の動画音声記憶部２６０に記録する。ステップＳ４２０の処理が完了することによって、サーバー装置２００は、図５における処理を終了する。 Returning to the determination in step S320, if there is no audio data that has not been subjected to voice recognition processing, that is, the second moving image data and the second audio are based on the moving image data and audio data received from all the client devices 100. When the data generation is completed (step S320, NO), the second moving image audio generation unit 250 is identical based on the utterance start time information and the utterance end time information generated as the audio recognition result of the audio recognition unit 240. The second moving image data and the second audio data corresponding to the utterance part are connected in time series to generate the moving image data for Web conference minutes (step S420). Then, the second video / audio generation unit 250 records the generated video data for the minutes in the second video / audio storage unit 260. When the process of step S420 is completed, the server apparatus 200 ends the process in FIG.

上記議事録用動画データは、公知の動画編集ソフトを使用することによって、例えばサーバー管理者が第１の動画データおよび第１の音声データを編集して生成することができる。しかしながら、このような編集作業は、手間がかかり、サーバー管理者の作業負荷が大きい。これに対して、本変形例では、Ｗｅｂ会議サービスの提供が終了した後に、公知の動画編集ソフトを使用して第１の動画データおよび第１の音声データを編集するといった手間を一切かけずに、議事録用動画データを自動的に生成することができる。なお、発話者の発話を含む複数の動画データから、発話部分だけを切り出して１つの動画データを生成する本変形例は、議事録用動画データを生成する場合に限らず、例えば点検業務や道案内など、説明部分や報告部分、すなわち発話者の発話部分を含む複数の動画データから、当該説明部分や報告部分だけを切り出して時系列でつなぎ合わせることによって説明用動画データや報告書用動画データを生成する場合にも適用することができる。 The minutes moving image data can be generated by, for example, a server administrator editing the first moving image data and the first audio data by using known moving image editing software. However, such editing work takes time and a heavy workload on the server administrator. On the other hand, in this modification, after the provision of the web conference service is completed, there is no trouble of editing the first moving image data and the first audio data using a known moving image editing software. The video data for minutes can be automatically generated. Note that this modification example in which only one utterance portion is cut out from a plurality of moving image data including a speaker's utterance to generate one moving image data is not limited to the case of generating the moving image data for minutes. Explanation video data and report video data by extracting only the explanation part and report part from a plurality of video data including explanation parts and report parts such as guidance, that is, the utterance part of the speaker, and connecting them in time series It can also be applied when generating

また、上記実施の形態では、第２の動画音声生成部２５０は、音声認識部２４０から出力された音声認識結果（テキストデータ）を字幕として含めて第２の動画データを生成する例について説明したが、本発明はこれに限らない。例えば、第２の動画音声生成部２５０は、音声認識部２４０から出力された音声認識結果（テキストデータ）を、手話、他の言語のテキストデータ（翻訳結果）、イメージ、音声等の他の表現態様として含めて、第２の動画データを生成しても良い。要は、第２の動画音声生成部２５０は、音声認識部２４０から出力された音声認識結果（テキストデータ）の内容を、第２の動画データを見る発話者がより理解しやすくなるような表現態様として含めて、第２の動画データを生成すれば良い。 In the above-described embodiment, the second moving image / voice generating unit 250 generates the second moving image data by including the sound recognition result (text data) output from the sound recognizing unit 240 as a caption. However, the present invention is not limited to this. For example, the second video / audio generating unit 250 uses the speech recognition result (text data) output from the speech recognizing unit 240 as another expression such as sign language, text data (translation result) in other languages, an image, and sound. The second moving image data may be generated as an aspect. In short, the second moving image sound generating unit 250 expresses the contents of the sound recognition result (text data) output from the sound recognizing unit 240 so that the speaker who views the second moving image data can understand more easily. The second moving image data may be generated as an aspect.

また、上記実施の形態では、動画音声受信部２１０が、発話を行っている複数の発話者を撮影することによって得られた複数の動画データを第１の動画データとして受信し、第２の動画音声生成部２５０が、第１の動画データにおいて、複数の発話者が発話を行っている複数の発話部分をそれぞれ第２の動画データとして生成する例について説明したが、本発明はこれに限らない。例えば、動画音声受信部２１０が、発話を行っている（１または複数の）発話者を撮影することによって得られた１の動画データを第１の動画データとして受信し、第２の動画音声生成部２５０が、第１の動画データにおいて、（１または複数の）発話者が発話を行っている複数の発話部分をそれぞれ第２の動画データとして生成しても良い。 In the above embodiment, the moving image audio receiving unit 210 receives a plurality of moving image data obtained by photographing a plurality of speakers who are speaking as first moving image data, and the second moving image data Although the audio | voice production | generation part 250 demonstrated the example which each produces | generates the several utterance part which the several speaker is speaking in 2nd moving image data in 1st moving image data, this invention is not limited to this. . For example, the moving image / sound receiving unit 210 receives one moving image data obtained by photographing a speaker (one or more) who is speaking as the first moving image data, and generates the second moving image / sound The unit 250 may generate, as the second moving image data, a plurality of utterance parts in which the utterer (one or more) is speaking in the first moving image data.

また、上記実施の形態において、何れも本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 Moreover, in the said embodiment, all show only an example of actualization in implementing this invention, and the technical scope of this invention should not be limitedly interpreted by these. That is, the present invention can be implemented in various forms without departing from the gist or the main features thereof.

本発明は、発話者の表情を適切に認識し、発話者がどのような感情を持って発話しているかについて容易に把握することが可能な動画出力装置、動画出力方法および動画出力プログラムとして好適である。 The present invention is suitable as a moving image output apparatus, a moving image output method, and a moving image output program capable of appropriately recognizing the expression of the speaker and easily grasping what kind of emotion the speaker is speaking. It is.

１０Ｗｅｂ会議システム
１００クライアント装置
１１０入力受付部
１２０動画音声入力部
１３０，２７０動画音声送信部
１４０，２１０動画音声受信部
１５０動画音声出力制御部
１６０動画音声出力部
２００サーバー装置
２２０第１の動画音声記憶部
２３０音声取得部
２４０音声認識部
２５０第２の動画音声生成部
２６０第２の動画音声記憶部
２８０配信用コンテンツ記憶部
２９０配信用コンテンツ送信部 DESCRIPTION OF SYMBOLS 10 Web conference system 100 Client apparatus 110 Input reception part 120 Movie audio input part 130,270 Movie audio transmission part 140,210 Movie audio receiving part 150 Movie audio output control part 160 Movie audio output part 200 Server apparatus 220 1st animation audio Storage unit 230 Audio acquisition unit 240 Audio recognition unit 250 Second moving image audio generation unit 260 Second moving image audio storage unit 280 Distribution content storage unit 290 Distribution content transmission unit

Claims

A first video acquisition unit that acquires video data obtained by photographing a speaker who is speaking as first video data;
A second moving image generating unit that generates a plurality of utterance portions of the utterer as the second moving image data in the first moving image data acquired by the first moving image acquiring unit;
A second moving image output unit that sequentially outputs the plurality of second moving image data generated by the second moving image generation unit;
A video output device comprising:

The second moving image output unit outputs a plurality of the second moving image data in time series.
The moving image output apparatus according to claim 1.

An audio acquisition unit that acquires audio data related to the first video data acquired by the first video acquisition unit;
A voice recognition unit that performs voice recognition on the voice data acquired by the voice acquisition unit;
With
The second moving image generation unit specifies the plurality of utterance parts based on a voice recognition result of the voice recognition unit, and generates a plurality of the second moving image data using the specification result.
The moving image output apparatus according to claim 1 or 2.

The second moving image generating unit includes the voice recognition result of the voice recognizing unit as a predetermined expression form, and generates the second moving image data.
The moving image output apparatus according to claim 3.

The first moving image acquisition unit acquires the first moving image data in real time while the utterance by the speaker is being performed.
The moving image output apparatus according to any one of claims 1 to 4.

An image output unit that outputs an image set in association with a non-speech part in which the plurality of speakers are not speaking in the first video data acquired by the first video acquisition unit;
The moving image output apparatus according to claim 5.

The second moving image output unit sequentially outputs a plurality of the second moving image data so that the speaker can visually recognize.
The moving image output apparatus according to any one of claims 1 to 6.

The first moving image acquisition unit acquires the first moving image data after all the utterances by the speaker are performed.
The moving image output apparatus according to any one of claims 1 to 4.

The video data obtained by photographing the speaker who is speaking is acquired as the first video data,
In the acquired first moving image data, a plurality of utterance portions where the speaker is speaking are generated as second moving image data, respectively.
Sequentially outputting the plurality of generated second moving image data;
Video output method.

On the computer,
Processing for obtaining moving image data obtained by photographing a speaker who is speaking as first moving image data;
In the acquired first moving image data, a process of generating a plurality of utterance portions where the speaker is speaking as second moving image data,
A process of sequentially outputting the plurality of generated second moving image data;
A video output program that executes