JP2023120068A

JP2023120068A - Speech processing system, device and speech processing method

Info

Publication number: JP2023120068A
Application number: JP2022023261A
Authority: JP
Inventors: 大介切金; Taisuke Kirigane
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2023-08-29

Abstract

To provide a speech processing system that lightens a processing load of conversion to text data.SOLUTION: The present invention relates to a speech processing system in which a terminal unit and a device comprising a microphone communicate with each other, and the speech processing system has: a speech synthesis part which synthesizes a first speech that the terminal unit receives from outside and a second speech that the device collects to generate a synthetic speech; and a text conversion request part which sends a request to convert the synthetic speech that the speech synthesis part generates into text data to the outside.SELECTED DRAWING: Figure 2

Description

本発明は、音声処理システム、デバイス、及び、音声処理方法に関する。 TECHNICAL FIELD The present invention relates to an audio processing system, a device, and an audio processing method.

一方の拠点から１つ以上の他の拠点にリアルタイムに画像や音声を送信し、遠隔地にいるユーザー同士で画像や音声を用いた会議を行う遠隔会議サービスシステムが知られている。また、会議などの遠隔コミュニケーションの内容を議事録として記録する方法として、各拠点のユーザーの音声をテキストに変換して、テキストデータを保存する方法が知られている。 2. Description of the Related Art Teleconference service systems are known in which images and voices are transmitted in real time from one base to one or more other bases, and users at remote locations hold conferences using the images and voices. Also, as a method of recording the content of remote communication such as a meeting as minutes, a method of converting the voices of users at each site into text and saving the text data is known.

複数の話者のうち誰が発言したのかをテキストデータに関連付ける技術が知られている（例えば、特許文献１参照。）。特許文献１には、声紋識別により話者の声紋を区別することで発言している話者を区別し、各発言文章に対していずれかの話者の識別情報を付加する情報処理装置が開示されている。 A technology is known that associates text data with which of a plurality of speakers has spoken (see, for example, Patent Document 1). Patent Document 1 discloses an information processing device that distinguishes a speaker who is speaking by distinguishing the voiceprint of the speaker by voiceprint identification, and adds identification information of one of the speakers to each utterance sentence. It is

しかしながら、従来の技術では、テキストデータへの変換の処理負荷が大きいという問題がある。すなわち、従来の技術は、複数の話者が発言した各音声を別々に音声認識しているため、音声認識のためのリソースを圧迫するおそれや認識に時間がかかるおそれがある。 However, the conventional technology has a problem that the processing load of conversion to text data is large. In other words, the conventional technology separately recognizes each voice uttered by a plurality of speakers, so there is a risk that resources for voice recognition will be pressed and recognition will take time.

本発明は、上記課題に鑑み、テキストデータへの変換の処理負荷を低減する技術を提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technique for reducing the processing load of conversion into text data.

上記課題に鑑み、本発明は、端末装置と、マイクを備えたデバイスとが通信する音声処理システムであって、前記端末装置が外部から受信した第一の音声と前記デバイスが集音した第二の音声を合成して合成音声を生成する音声合成部と、前記音声合成部が合成した合成音声のテキストデータへの変換を外部に要求するテキスト変換要求部と、を有することを特徴とする。 In view of the above problems, the present invention provides a voice processing system in which a terminal device and a device equipped with a microphone communicate with each other, wherein a first voice received by the terminal device from the outside and a second voice collected by the device are provided. and a text conversion requesting unit for externally requesting conversion of the synthesized speech synthesized by the speech synthesis unit into text data.

テキストデータへの変換の処理負荷を低減する音声処理システムを提供できる。 It is possible to provide a speech processing system that reduces the processing load of conversion into text data.

遠隔会議中に実行されたアプリの画面を周囲のパノラマ画像と共に保存する記録情報の作成の概略を説明する図である。FIG. 10 is a diagram illustrating an overview of creation of record information for saving a screen of an application executed during a teleconference together with a surrounding panorama image; 記録情報作成システムの機能に関するシステムブロック図の一例である。1 is an example of a system block diagram relating to functions of a recording information creation system; FIG. 記録情報作成システムの構成例を示す図である。It is a figure which shows the structural example of a record information production system. 情報処理システム及び端末装置の一例のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of an information processing system and an example of a terminal device. ３６０°の動画を撮像可能なミーティングデバイスのハードウェア構成図の一例である。1 is an example of a hardware configuration diagram of a meeting device capable of capturing 360° moving images; FIG. ミーティングデバイスの撮像範囲を説明する図である。FIG. 4 is a diagram for explaining an imaging range of a meeting device; FIG. パノラマ画像と話者画像の切り出しを説明する図である。It is a figure explaining the extraction of a panorama image and a speaker image. 記録情報作成システムにおける端末装置、ミーティングデバイス、及び、情報処理システムの機能をブロックに分けて説明する機能ブロック図の一例である。1 is an example of a functional block diagram illustrating functions of a terminal device, a meeting device, and an information processing system in a recorded information creation system by dividing them into blocks; FIG. 情報記憶部が記憶している動画記録情報の一例を示す図である。4 is a diagram showing an example of moving image recording information stored in an information storage unit; FIG. 会議情報取得部が管理する、会議情報記憶部に記憶された会議情報の一例を示す図である。It is a figure which shows an example of the meeting information memorize|stored in the meeting information storage part which the meeting information acquisition part manages. 録画情報記憶部に記憶されている録画情報の一例を示す図である。4 is a diagram showing an example of recording information stored in a recording information storage unit; FIG. ストレージサービスシステムに保存されているテキストデータの構造を説明する図である。4 is a diagram illustrating the structure of text data stored in the storage service system; FIG. 端末装置で動作する情報記録アプリが表示するログイン後の初期画面の一例を示す図である。FIG. 10 is a diagram showing an example of an initial screen after login displayed by an information recording application running on a terminal device; 情報記録アプリが表示する録画設定画面の一例を示す図である。It is a figure which shows an example of the recording setting screen which an information recording application displays. 情報記録アプリが表示する会議一覧画面の一例を示す図である。It is a figure which shows an example of the meeting list screen which an information recording application displays. 会議記録確認部が提示し、端末装置が表示する会議記録確認画面の一例を示す図である。It is a figure which shows an example of the conference record confirmation screen which a conference record confirmation part presents and a terminal device displays. 一例の拠点判断条件を説明する図である。It is a figure explaining an example base determination conditions. 拠点判断条件1－1による拠点の判断方法を説明するフローチャート図の一例である。FIG. 10 is an example of a flow chart for explaining a method of determining a site based on a site determination condition 1-1. 拠点判断条件1－2による拠点の判断方法を説明するフローチャート図の一例である。FIG. 10 is an example of a flowchart illustrating a method of determining a site based on site determination conditions 1-2. 拠点判断条件２による拠点の判断方法を説明するフローチャート図の一例である。FIG. 10 is an example of a flowchart for explaining a method of determining a site based on a site determination condition 2; キュー方式で、拠点識別情報Ｅ、音声認識結果Ｄ、及び、合成音声Ｃをユーザーが対応付ける方法を説明する図である。FIG. 10 is a diagram for explaining a method by which a user associates site identification information E, speech recognition result D, and synthesized speech C with a cue method; タイムスタンプ方式で、拠点識別情報Ｅ、音声認識結果Ｄ、及び、合成音声Ｃをユーザーが対応付ける方法を説明する図である。FIG. 10 is a diagram illustrating a method for a user to associate base identification information E, speech recognition result D, and synthesized speech C with a time stamp method; 情報記録アプリがパノラマ画像、話者画像及びアプリの画面を録画する手順を示すシーケンス図の一例である。FIG. 10 is an example of a sequence diagram showing a procedure for an information recording application to record a panorama image, a speaker image, and an application screen; 情報記録アプリがパノラマ画像、話者画像及びアプリの画面を録画する手順を示すシーケンス図の一例である。FIG. 10 is an example of a sequence diagram showing a procedure for an information recording application to record a panorama image, a speaker image, and an application screen; ユーザーが拠点識別情報Ｅを編集する処理を説明するシーケンス図の一例である。FIG. 10 is an example of a sequence diagram illustrating processing for a user to edit base identification information E;

以下、本発明を実施するための形態の一例として、記録情報作成システムと記録情報作成システムが行う音声処理方法について説明する。 Hereinafter, as an example of a mode for carrying out the present invention, a recording information creation system and an audio processing method performed by the recording information creation system will be described.

＜遠隔会議におけるテキストデータの作成方法の一例＞
まず、図１を参照して、パノラマ画像とアプリの画面を用いた議事録の作成方法の概略を説明する。図１は、遠隔会議中に実行されたアプリの画面を周囲のパノラマ画像と共に保存する記録情報の作成の概略を説明する図である。図１に示すように、図示する自拠点１０２にいるユーザーが遠隔会議サービスシステム９０を利用して、他の拠点１０１と遠隔会議を行っている。 <Example of how to create text data in a teleconference>
First, with reference to FIG. 1, an outline of a method of creating minutes using a panorama image and an application screen will be described. FIG. 1 is a diagram for explaining an overview of creation of recording information for saving a screen of an application executed during a teleconference together with a surrounding panoramic image. As shown in FIG. 1, a user at his own site 102 is holding a remote conference with another site 101 using a teleconference service system 90 .

本実施形態の記録情報作成システム１００は、マイクとスピーカーを備えたミーティングデバイス６０が撮像した水平パノラマ画像（以下、パノラマ画像という）と、端末装置１０が実行するアプリケーション（以下、アプリという）が作成する画面と、を用いて、記録情報（議事録）を作成する。音声については、記録情報作成システム１００（音声処理システムの一例）は、遠隔会議アプリ４２が受信する音声と、ミーティングデバイス６０が取得する音声とを合成して、記録情報に含める。以下、概略を説明する。 The recorded information creation system 100 of this embodiment creates a horizontal panoramic image (hereinafter referred to as a panoramic image) captured by a meeting device 60 equipped with a microphone and a speaker, and an application (hereinafter referred to as an application) executed by the terminal device 10. The record information (minutes) is created by using the screen for As for the sound, the record information creation system 100 (an example of a sound processing system) synthesizes the sound received by the teleconference application 42 and the sound acquired by the meeting device 60, and includes the synthesized sound in the record information. An outline will be described below.

(1) 端末装置１０では、後述する情報記録アプリ４１と遠隔会議アプリとが動作している。この他、資料表示用のアプリなども動作していてよい。情報記録アプリ４１は、端末装置１０が出力する音声（遠隔会議アプリが他拠点から受信した音声を含む。第一の音声データの一例。）をミーティングデバイス６０に送信する。ミーティングデバイス６０は、自身が取得している音声（第二の音声データの一例）と、遠隔会議アプリの音声とをミキシング（合成）する。ミーティングデバイス６０は、端末装置１０が出力する音声の音圧に基づいて発言した拠点を判断する。 (1) In the terminal device 10, an information recording application 41 and a teleconference application, which will be described later, are running. In addition, an application for displaying materials may also be running. The information recording application 41 transmits audio output by the terminal device 10 (including audio received by the teleconference application from other bases; an example of first audio data) to the meeting device 60 . The meeting device 60 mixes (synthesizes) the voice (an example of the second voice data) acquired by itself and the voice of the teleconference application. The meeting device 60 determines the location of the speech based on the sound pressure of the voice output by the terminal device 10 .

(2) ミーティングデバイス６０はマイクを備え、音声を取得した方向に基づき、パノラマ画像から話者を切り出す処理を行い、話者画像を作成する。ミーティングデバイス６０は、パノラマ画像と話者画像の両方を端末装置１０に送信する。 (2) The meeting device 60 is equipped with a microphone, and based on the direction from which the voice is acquired, performs a process of cutting out the speaker from the panorama image and creates a speaker image. The meeting device 60 transmits both the panorama image and the speaker image to the terminal device 10 .

(3) 端末装置１０で動作する情報記録アプリ４１は、パノラマ画像２０３と話者画像２０４を表示できる。情報記録アプリ４１は、ユーザーが選択した任意のアプリの画面（例えば遠隔会議アプリの画面１０３）と、パノラマ画像２０３と話者画像２０４と、を結合する。例えば、左側にパノラマ画像２０３と話者画像２０４、右側に遠隔会議アプリの画面１０３が配置されるように、パノラマ画像２０３、話者画像２０４、アプリの画面１０３を結合する（以下、結合画像１０５という）。(3)の処理は繰り返し実行されるので、結合画像１０５は動画となる（以下、結合画像動画という）。また、情報記録アプリ４１は、結合画像動画に、合成された音声（以下、合成音声という）を結合して音声付きの動画を作成する。 (3) The information recording application 41 running on the terminal device 10 can display the panoramic image 203 and the speaker image 204 . The information recording application 41 combines the screen of any application selected by the user (for example, the screen 103 of the teleconference application), the panorama image 203 and the speaker image 204 . For example, the panoramic image 203, the speaker image 204, and the application screen 103 are combined so that the panoramic image 203 and the speaker image 204 are arranged on the left side, and the remote conference application screen 103 is arranged on the right side (hereinafter referred to as the combined image 105 called). Since the process (3) is repeatedly executed, the combined image 105 becomes a moving image (hereinafter referred to as a combined image moving image). In addition, the information recording application 41 creates a moving image with sound by combining synthesized sound (hereinafter referred to as synthesized sound) with the combined image moving image.

なお、本実施形態では、パノラマ画像２０３、話者画像２０４、アプリの画面１０３を結合する例を説明するが、情報記録アプリ４１がこれらを別々に保存し、再生時に画面に配置してもよい。 In this embodiment, an example in which the panoramic image 203, the speaker image 204, and the application screen 103 are combined will be described, but the information recording application 41 may store them separately and arrange them on the screen during playback. .

(4) 情報記録アプリ４１は、編集作業（ユーザーによる不要箇所のカット等）を受け付け、結合画像動画を完成させる。結合画像動画は記録情報の一部を構成する。 (4) The information recording application 41 accepts editing work (cutting of unnecessary parts by the user, etc.) and completes the combined image moving image. The combined image moving image constitutes part of the recorded information.

(5) 情報記録アプリ４１は、作成した結合画像動画（音声付き）をストレージサービスシステム７０に送信し保存しておく。 (5) The information recording application 41 transmits the created combined image moving image (with sound) to the storage service system 70 and stores it.

(6) 音声の一括変換の場合、情報記録アプリ４１は、結合画像動画から音声のみを抽出しておき（結合前の音声を取っておいてもよい）、抽出した音声を、情報処理システム５０に送信する。情報処理システム５０は音声をテキストデータに変換する音声認識サービスシステム８０に送信し、音声をテキスト化する。テキストデータには、録画開始から何分後に話したか、というデータも含まれる。 (6) In the case of batch conversion of audio, the information recording application 41 extracts only the audio from the combined image moving image (or the audio before combining), and sends the extracted audio to the information processing system 50. Send to The information processing system 50 transmits the speech to a speech recognition service system 80 that converts the speech into text data, converting the speech into text. The text data also includes data on how many minutes after the recording started.

(7) リアルタイム変換の場合、ミーティングデバイス６０が拠点の判断後に合成音声Ｃを情報処理システム５０に送信する。情報処理システム５０はリアルタイムに音声認識サービスシステムでテキストデータに変換し、このテキストデータを情報記録アプリ４１に送信する。本実施形態では、主にリアルタイム変換の場合を説明する。 (7) In the case of real-time conversion, the meeting device 60 transmits the synthesized speech C to the information processing system 50 after determining the base. The information processing system 50 converts the data into text data in real time using the speech recognition service system, and transmits this text data to the information recording application 41 . In this embodiment, the case of real-time conversion will be mainly described.

なお、情報処理システム５０は、ユーザーに対し利用したサービスに応じた課金処理を実行できる。例えば、課金はテキストデータ量、結合画像動画のファイルサイズ、処理時間などに基づいて算出される。 Note that the information processing system 50 can execute charging processing according to the service used by the user. For example, the charge is calculated based on the amount of text data, the file size of the combined image moving image, the processing time, and the like.

(8) 情報処理システム５０は、結合画像動画を格納したストレージサービスシステム７０に、テキストデータを追加で格納する。ユーザーは結合画像動画を端末装置１０で再生できる。テキストデータは記録情報の一部を構成する。 (8) The information processing system 50 additionally stores text data in the storage service system 70 that stores the combined image moving image. The user can reproduce the combined image moving image on the terminal device 10 . Text data forms part of the recorded information.

このように、結合画像動画には、ユーザーを含む周囲のパノラマ画像、話者画像、及び、遠隔会議中に表示されたアプリの画面が表示され、録画される。音声認識が合成音声に対し行われるので、別々に音声認識するよりも音声認識サービスシステムの処理負荷を低減できる。また、合成音声は、音圧情報に基づいて発言された拠点が判断されるので、音声データがどの拠点で発言されたものか記録できる。 In this way, in the combined image moving image, the surrounding panorama image including the user, the speaker image, and the screen of the application displayed during the teleconference are displayed and recorded. Since the speech recognition is performed on the synthesized speech, the processing load of the speech recognition service system can be reduced as compared with separate speech recognition. In addition, since the site where the synthetic voice was spoken is determined based on the sound pressure information, it is possible to record the site where the voice data was spoken.

＜用語について＞
アプリケーション（アプリ）とは、ある特定の機能や目的のために開発・使用されるソフトウェアで、コンピュータの操作自体のためのものではないものである。アプリケーションにはネイティブアプリとＷｅｂアプリがある。 <Terms>
An application (app) is software that is developed and used for a specific function or purpose, not for operating a computer itself. Applications include native apps and web apps.

実行中のアプリとは、アプリが起動されてから終了されるまでの間の状態のアプリをいう。アプリはアクティブ（最も手前にあるアプリ）でなくてもよく、バックグラウンドで動作していればよい。 A running application is an application that is in a state from when the application is started until it is terminated. The app doesn't have to be active (the one in the foreground), it just needs to run in the background.

デバイスとは、周囲の画像を撮像でき、周囲の音声を集音できる装置である。本実施形態では、ミーティングデバイス６０という用語で説明される。 A device is a device capable of capturing an image of the surroundings and collecting sounds of the surroundings. In this embodiment, the term meeting device 60 will be used.

ミーティングデバイス６０が取得した周囲の画像は、水平方向に通常の画角より広い画角で撮像された画像をいう。本実施形態では、周囲の画像は、パノラマ画像という用語で説明される。パノラマ画像はおおむね水平方向に１８０°～３６０°の画角がある画像である。ミーティングデバイス６０は１台でパノラマ画像を撮像しなくてもよく、通常の画角の撮像装置が複数個、組み合わされていてもよい。 The image of the surroundings acquired by the meeting device 60 refers to an image captured with a wider angle of view than the normal angle of view in the horizontal direction. In this embodiment, surrounding images are described in terms of panoramic images. A panoramic image is an image having an angle of view of approximately 180° to 360° in the horizontal direction. The meeting device 60 does not have to capture a panoramic image by itself, and may be a combination of a plurality of imaging devices with normal angles of view.

記録情報とは、情報記録アプリ４１が記録する情報である。情報記録アプリ４１が遠隔会議アプリの画面を録画した場合、記録情報が議事録となる場合がある。記録情報は、例えば、結合画像動画（音声を含む）及び音声が音声認識されたテキストデータを含む。 Recorded information is information recorded by the information recording application 41 . When the information recording application 41 records the screen of the teleconference application, the recorded information may be minutes. Recorded information includes, for example, combined image moving images (including audio) and text data in which audio has been recognized.

テナントとは、サービスの提供者からサービスを受けることを契約したユーザーのグループ（企業や自治体、これらの一部の組織等）である。本実施形態の記録情報の作成やテキストデータへの変換は、テナントがサービス提供元と契約しているために実行される。 A tenant is a group of users (companies, local governments, some of these organizations, etc.) who have contracted to receive services from a service provider. Creation of record information and conversion into text data in this embodiment are executed because the tenant has a contract with the service provider.

遠隔コミュニケーションとは、物理的に離れた拠点にいる相手と、ソフトウェアや端末装置を活用することによって音声や映像を通じたコミュニケーションを取ることをいう。遠隔コミュニケーションの一例に遠隔会議があり、会議は、会合、ミーティング、打ち合わせ、集会、寄り合い、集まり、セミナーは、講習会、勉強会、ゼミ、研修会等と呼ばれてもよい。 Remote communication refers to communicating with a person at a physically distant base through voice and video using software and terminal devices. An example of remote communication is a teleconference, and the conference may be called a meeting, a meeting, a meeting, a gathering, a get-together, a gathering, and a seminar may be called a workshop, a study group, a seminar, a training session, or the like.

拠点とは、活動のよりどころとする場所をいう。拠点の例として会議室がある。会議室は、主に会議に使用することを目的に設置された部屋のことである。 A base is a place where activities are based. An example of a base is a conference room. A conference room is a room set up mainly for the purpose of using it for a conference.

音声とは人間が発する言語音や周囲の音等であり、音声データは音声をデータ化したものであるが、本実施形態では、厳密には区別せずに説明する。 Speech refers to language sounds uttered by humans, ambient sounds, etc., and speech data is speech data converted into data.

＜機能に関するシステムブロック図＞
図２は、記録情報作成システム１００の機能に関するシステムブロック図である。記録情報作成システム１００は、遠隔会議における自拠点音声Ａを取得する自拠点音声入力部７と、自拠点以外の拠点の他拠点音声Ｂを取得する他拠点音声入力部８を有する。 <System block diagram related to functions>
FIG. 2 is a system block diagram relating to the functions of the recorded information creation system 100. As shown in FIG. The recording information creation system 100 has a self-site voice input unit 7 that acquires a self-site voice A in a teleconference, and an other-site voice input unit 8 that acquires other-site voice B of a site other than the self-site.

自拠点音声入力部７は、例えば一般的なマイクロホンでよい。また、他拠点音声入力部８は、自拠点音声入力部７とは別に、他拠点音声Ｂを取得できるモジュールである。他拠点音声入力部８は、例えば遠隔会議に参加している端末装置と記録情報作成システム１００を接続する接続部を介して他拠点音声Ｂを取得できるＵＳＢコネクタ、Ｂｌｕｅｔｏｏｔｈ（登録商標）、Wi－Fiなどの無線通信モジュールなどが考えられる。 The local site voice input unit 7 may be, for example, a general microphone. Also, the other-site voice input unit 8 is a module that can acquire the other-site voice B separately from the own-site voice input unit 7 . The other site voice input unit 8 is, for example, a USB connector, Bluetooth (registered trademark), Wi- A wireless communication module such as Fi can be considered.

上記構成により、記録情報作成システム１００は、複数の専用端末や専用アカウントを用意せずとも、一般的な遠隔コミュニケーションシステムを利用する遠隔会議においても自拠点音声Ａと他拠点音声Ｂをそれぞれ別音声として取得可能となる。 With the above configuration, the recorded information creation system 100 can produce separate voices for own site A and other site B even in a remote conference using a general remote communication system without preparing a plurality of dedicated terminals or dedicated accounts. can be obtained as

更に、自拠点音声Ａと他拠点音声Ｂは音声合成部６５に入力され、合成処理されたことで合成音声Ｃとして出力される。合成音声Ｃは音声認識部５５で音声認識技術によりテキスト化され、音声認識結果Ｄとして音声認識結果記録部５７に保存される。更に、合成音声Ｃは音声記録部５６により音声データとして録音される。 Further, the local site's voice A and the other site's voice B are input to the voice synthesizing unit 65 and output as a synthesized voice C after being subjected to synthesizing processing. The synthesized speech C is converted into text by a speech recognition technique in the speech recognition section 55 and stored as a speech recognition result D in the speech recognition result recording section 57 . Furthermore, the synthesized speech C is recorded as speech data by the speech recording unit 56 .

音声認識技術として、音声認識サービスシステム８０は、一般的な音声認識エンジンを利用可能である。音声認識エンジンの実装形態としてはハードウェアへの組み込みやクラウドサービスの利用などが考えられシステムごとに選択可能である。 As a speech recognition technology, the speech recognition service system 80 can use a general speech recognition engine. As the implementation form of the speech recognition engine, it is possible to select it for each system, such as embedding it in hardware or using a cloud service.

自拠点音声Ａと他拠点音声Ｂを合成音声Ｃに合成することにより、通信環境が悪い環境においてもデータ量が増大しにくいので音声認識部５５へ音声データを入力できる。 By synthesizing the local site's voice A and the other site's voice B into the synthesized voice C, it is possible to input the voice data to the voice recognition unit 55 because the amount of data is less likely to increase even in a poor communication environment.

また、複数拠点の音声を合成せずに音声認識すると２つの音声認識を並行処理する必要があるため、音声認識に際してパフォーマンス低下が懸念され、リアルタイム性も損なわれる可能性がある。しかし、複数拠点の音声を合成して合成音声Ｃとして音声認識部５５に音声データを入力することで、低パフォーマンス、低コストで音声認識によりテキスト化が可能になり、リアルタイム性も向上する。更に、合成音声Ｃを音声認識することで、文脈を推定した音声認識も行うことが可能であり、音声認識率が向上する。 In addition, if voice recognition is performed without synthesizing voices from a plurality of sites, two voice recognition processes must be performed in parallel, so there is concern about performance degradation during voice recognition, and there is a possibility that real-time performance will be impaired. However, by synthesizing the voices of a plurality of bases and inputting the voice data as synthesized voice C to the voice recognition unit 55, it becomes possible to convert the voice data into text by voice recognition at low performance and low cost, and the real-time performance is also improved. Furthermore, by recognizing the synthesized speech C, it is possible to perform speech recognition by estimating the context, thereby improving the speech recognition rate.

更に、音声記録部５６が合成音声Ｃを記録することで、合成前のどちらかの音声データが紛失したり、再生時に発言のタイミングがずらされたりせず、同時に発声された音声がそのまま記録できるため、ユーザーは会議音声を違和感なく後日、会議記録として確認することができる。 Furthermore, since the voice recording unit 56 records the synthesized voice C, the voices uttered at the same time can be recorded as they are without loss of either of the voice data before synthesis or shifting of the timing of utterances during reproduction. Therefore, the user can check the conference voice as a conference record at a later date without any sense of incongruity.

なお、記録情報作成システム１００は、音声認識前や音声合成前に音声データを処理・加工する音声データ加工部を有していてもよい。 Note that the recorded information creation system 100 may have a voice data processing unit that processes and processes voice data before voice recognition or voice synthesis.

また、記録情報作成システム１００は拠点判断処理部６４を有する。拠点判断処理部６４は自拠点音声Ａと他拠点音声Ｂから、自拠点音圧情報Ａ'と他拠点音圧情報Ｂ'とをそれぞれ検知する音圧検知部６４ａを有する。拠点判断部６４ｂは、自拠点音圧情報Ａ'と他拠点音圧情報Ｂ'を比較することにより合成音声Ｃが自拠点音声Ａと他拠点音声Ｂのどちらで主に構成されているかを判断することにより、拠点識別情報Ｅを生成する。拠点識別情報Ｅは拠点判断結果記録部５８に保存される。 The recorded information creation system 100 also has a site determination processing unit 64 . The base determination processing unit 64 has a sound pressure detection unit 64a that detects own base sound pressure information A' and other base sound pressure information B' from the own base sound A and the other base sound B, respectively. The base determination unit 64b compares the own base sound pressure information A' and the other base sound pressure information B' to determine whether the synthesized speech C is mainly composed of the own base sound A or the other base sound B. By doing so, the base identification information E is generated. The site identification information E is stored in the site determination result recording unit 58 .

発言拠点の識別に、音圧情報を比較するという簡易な手段を用いることにより、声紋を利用した音声識別ＡＩや発言者の顔画像を利用する顔認識ＡＩなどの話者識別ＡＩを用いずとも、拠点識別可能になるため低パフォーマンスで拠点識別が可能になり、低コストで実装が可能である。 By using a simple means of comparing sound pressure information for identifying the speaking base, it is possible to identify the speech base without using speaker identification AI such as voice recognition AI using voiceprints and face recognition AI using the face image of the speaker. , it is possible to identify bases with low performance, and it is possible to implement at low cost.

更に、記録情報作成システム１００は、拠点判断結果記録部５８と、音声認識結果記録部５７と、音声記録部５６からそれぞれ拠点識別情報Ｅと、音声認識結果Ｄと、合成音声Ｃを同期して読み出しユーザーに表示する会議記録確認部５９を有する。 Furthermore, the recorded information creation system 100 synchronizes the site identification information E, the voice recognition result D, and the synthesized voice C from the site determination result recording unit 58, the voice recognition result recording unit 57, and the voice recording unit 56, respectively. It has a conference record confirmation section 59 that is displayed to the reading user.

以上によりユーザーは音声認識結果が自拠点の発言に基づくテキストであるか、他拠点の発言に基づくテキストであるかが分かるようになり、議事録の理解が促進する。 As described above, the user can recognize whether the speech recognition result is a text based on the utterance of the user's own site or a text based on the utterance of another site, thereby facilitating understanding of the minutes.

＜システム構成例＞
続いて、図３を参照して、記録情報作成システム１００のシステム構成を説明する。図３は、記録情報作成システム１００の構成例を示す。図３では、遠隔会議を行う複数の拠点のうち１つの拠点（自拠点１０２）を示し、自拠点１０２における端末装置１０がネットワークを介して情報処理システム５０と、ストレージサービスシステム７０と、遠隔会議サービスシステム９０と、通信する。自拠点１０２には更に、ミーティングデバイス６０が配置され、端末装置１０はこのミーティングデバイス６０とＵＳＢケーブル等を介して通信可能に接続されている。 <System configuration example>
Next, with reference to FIG. 3, the system configuration of the recording information creation system 100 will be described. FIG. 3 shows a configuration example of the recording information creation system 100. As shown in FIG. FIG. 3 shows one site (own site 102) among a plurality of sites where remote conferences are held. Communicate with service system 90 . A meeting device 60 is further arranged at the self-site 102, and the terminal device 10 is communicably connected to this meeting device 60 via a USB cable or the like.

端末装置１０では、少なくとも情報記録アプリ４１と遠隔会議アプリ４２とが動作する。遠隔会議アプリ４２は、他の拠点１０１の端末装置１０とネットワーク上の遠隔会議サービスシステム９０を介して通信することができ、各拠点のユーザー同士が遠隔地から会議できるようになっている。情報記録アプリ４１は、遠隔会議アプリ４２が実施する遠隔会議における記録情報を、情報処理システム５０及びミーティングデバイス６０の機能を使って作成する。 At least an information recording application 41 and a teleconference application 42 operate on the terminal device 10 . The teleconference application 42 can communicate with the terminal devices 10 of the other bases 101 via the teleconference service system 90 on the network, so that users at each base can have a conference from a remote location. The information recording application 41 uses the functions of the information processing system 50 and the meeting device 60 to create recorded information in the remote conference held by the remote conference application 42 .

なお、本実施形態では、遠隔会議中の記録情報を作成する例を説明するが、記録される会議は、遠隔の拠点と通信する会議でなくてもよい。つまり、会議は１拠点内の参加者のみが参加する会議でもよい。この場合、ミーティングデバイス６０が集音した音声のみが合成なしに保存される他、情報記録アプリ４１の処理に変更はない。 In this embodiment, an example of creating recorded information during a remote conference will be described, but the conference to be recorded does not have to be a conference that communicates with a remote site. In other words, the conference may be a conference in which only participants within one base participate. In this case, only the sound collected by the meeting device 60 is saved without synthesis, and the processing of the information recording application 41 remains unchanged.

端末装置１０には通常の画角のカメラが内蔵されており（外付けでもよい）、端末装置１０を操作するユーザー１０７を含む正面の画像を撮像している。通常の画角とは、パノラマ画像でない画像であるが、本実施形態では、主に全天球画像のように曲面でない平面画像である。また、端末装置１０にはマイクが内蔵されており（外付けでもよい）、端末装置１０を操作するユーザー等の周囲の音声を集音している。したがって、ユーザーは、情報記録アプリ４１を意識することなく、遠隔会議アプリ４２を使用した従来の遠隔会議が可能である。情報記録アプリ４１やミーティングデバイス６０は、端末装置１０の処理負荷増を除けば遠隔会議アプリ４２に影響を与えない。なお、遠隔会議アプリ４２はミーティングデバイス６０が撮像するパノラマ画像や話者画像を遠隔会議サービスシステム９０に送信することも可能である。 The terminal device 10 has a built-in (or external) camera with a normal angle of view, and captures a front image including the user 107 operating the terminal device 10 . A normal angle of view is an image that is not a panoramic image, but in the present embodiment, it is mainly a planar image that is not a curved surface such as an omnidirectional image. In addition, the terminal device 10 has a built-in microphone (it may be externally attached), and collects the surrounding sounds of the user operating the terminal device 10 or the like. Therefore, the user can have a conventional remote conference using the remote conference application 42 without being conscious of the information recording application 41 . The information recording application 41 and the meeting device 60 do not affect the teleconference application 42 except for an increase in the processing load on the terminal device 10 . Note that the teleconference application 42 can also transmit panoramic images and speaker images captured by the meeting device 60 to the teleconference service system 90 .

情報記録アプリ４１はミーティングデバイス６０と通信して記録情報を作成する。ミーティングデバイス６０は、パノラマ画像の撮像装置、マイク、及び、スピーカーを備えたミーティング用のデバイスである。端末装置１０が有するカメラは正面の限られた範囲しか撮像できないが、ミーティングデバイス６０はミーティングデバイス６０を囲む全周囲（必ずしも全周囲でなくてもよい）を撮像できる。ミーティングデバイス６０は図３に示す複数の参加者１０６を常に画角に収めることができる。 The information recording application 41 communicates with the meeting device 60 to create recorded information. The meeting device 60 is a meeting device that includes a panorama image capturing device, a microphone, and a speaker. The camera of the terminal device 10 can capture only a limited range of the front, but the meeting device 60 can capture the entire surroundings (not necessarily the entire surroundings) surrounding the meeting device 60 . The meeting device 60 can always keep the multiple participants 106 shown in FIG. 3 within the angle of view.

この他、ミーティングデバイス６０は、パノラマ画像からの話者画像の切り出し、ミーティングデバイス６０が取得した音声と端末装置１０が出力する音声（遠隔会議アプリ４２が受信した音声を含む）との合成等を行う。なお、ミーティングデバイス６０は、机の上に限らず自拠点１０２のどこに配置されてもよい。ミーティングデバイス６０は全天球画像を撮像できるので、例えば天井に配置されてもよい。 In addition, the meeting device 60 cuts out a speaker image from a panorama image, synthesizes the voice acquired by the meeting device 60 and the voice output by the terminal device 10 (including the voice received by the teleconference application 42), and the like. conduct. Note that the meeting device 60 may be placed anywhere in the local site 102, not just on the desk. Since the meeting device 60 can capture an omnidirectional image, it may be placed on the ceiling, for example.

情報記録アプリ４１は、端末装置１０で実行中のアプリの一覧表示、上記した記録情報のための画像合成（結合画像動画の作成）、結合画像動画の再生、編集の受け付け等を行う。また、情報記録アプリ４１は、実施された又はこれらか実施される予定の遠隔会議の一覧表示、等を行う。遠隔会議の一覧は、記録情報に関する情報に使用され、ユーザーが遠隔会議と記録情報とを結びつけることができる。 The information recording application 41 displays a list of applications being executed on the terminal device 10, synthesizes images for the above-described recorded information (creates a combined image moving image), reproduces the combined image moving image, accepts editing, and the like. The information recording application 41 also displays a list of remote conferences that have been held or are scheduled to be held. The list of teleconferences is used for information about recorded information and allows users to associate teleconferences with recorded information.

遠隔会議アプリ４２は、他の拠点１０１との通信接続、他の拠点１０１との画像及び音声の送受信、画像の表示や音声の出力等を行う。 The teleconference application 42 performs communication connection with other bases 101, transmission/reception of images and voices with other bases 101, display of images, output of voices, and the like.

なお、情報記録アプリ４１及び遠隔会議アプリ４２はＷｅｂアプリでもネイティブアプリでもよい。Ｗｅｂアプリとは、Ｗｅｂサーバー上のプログラムとＷｅｂブラウザ上のプログラムが協働して処理を行うアプリであり、端末装置１０へのインストールが不要なアプリである。ネイティブアプリとは、端末装置１０にインストールして利用されるアプリである。本実施形態では、両者ともネイティブアプリであるとして説明する。 Note that the information recording application 41 and the teleconference application 42 may be web applications or native applications. A web application is an application that performs processing in cooperation with a program on a web server and a program on a web browser, and does not need to be installed on the terminal device 10 . A native application is an application that is installed in the terminal device 10 and used. In this embodiment, it is assumed that both are native applications.

端末装置１０は、例えば、ＰＣ（Personal Computer）、スマートフォン、タブレット端末等、通信機能を備えた汎用的な情報処理装置でよい。端末装置１０は、この他、電子黒板、ゲーム機、ＰＤＡ（Personal Digital Assistant）、ウェアラブルＰＣ、カーナビ、産業機械、医療機器、ネットワーク家電等でもよい。端末装置１０は情報記録アプリ４１と遠隔会議アプリ４２が動作する装置であればよい。 The terminal device 10 may be, for example, a general-purpose information processing device having a communication function, such as a PC (Personal Computer), a smart phone, a tablet terminal, or the like. In addition, the terminal device 10 may be an electronic blackboard, a game machine, a PDA (Personal Digital Assistant), a wearable PC, a car navigation system, an industrial machine, a medical device, a network appliance, or the like. The terminal device 10 may be any device as long as the information recording application 41 and the teleconference application 42 operate.

情報処理システム５０は、ネットワーク上に配置された一台以上の情報処理装置である。情報処理システム５０は、情報記録アプリ４１と協働して処理を行う１つ以上のサーバーアプリと、基盤サービスを有している。このサーバーアプリは、会議管理システム９が管理する遠隔会議のリストの取得、遠隔会議で記録された記録情報の管理、各種設定やストレージパスの管理等を行う。基盤サービスは、ユーザー認証や契約、課金処理等を行う。 The information processing system 50 is one or more information processing devices arranged on a network. The information processing system 50 has one or more server applications that perform processing in cooperation with the information recording application 41, and infrastructure services. This server application acquires a list of remote conferences managed by the conference management system 9, manages recording information recorded in remote conferences, and manages various settings and storage paths. The platform service performs user authentication, contracts, billing processing, and the like.

会議管理システム９は会議室の予約、会議の予定の管理などを行うシステムである。会議管理システム９と情報処理システム５０とが一体でもよい。 The conference management system 9 is a system for reserving conference rooms and managing conference schedules. The conference management system 9 and the information processing system 50 may be integrated.

なお、情報処理システム５０の機能の全て又は一部は、クラウド環境に存在してもよいし、オンプレミス環境に存在してもよい。情報処理システム５０は複数台のサーバー装置により構成されてもよいし、一台の情報処理装置により構成されてもよい。例えば、サーバーアプリと基盤サービスが別々の情報処理装置より提供されてよいし、更にサーバーアプリ内の機能ごとに情報処理装置が存在してもよい。情報処理システム５０と次述するストレージサービスシステム７０、音声認識サービスシステム８０が一体でもよい。 All or part of the functions of the information processing system 50 may exist in the cloud environment or may exist in the on-premises environment. The information processing system 50 may be configured by a plurality of server devices, or may be configured by a single information processing device. For example, the server application and the infrastructure service may be provided by separate information processing devices, or there may be an information processing device for each function within the server application. The information processing system 50, the storage service system 70, and the voice recognition service system 80, which will be described later, may be integrated.

ストレージサービスシステム７０は、ネットワーク上の記憶手段であり、ファイル等の保存を受け付けるストレージサービスを提供する。ストレージサービスシステム７０としてはＯｎｅＤｒｉｖｅ（登録商標）、ＧｏｏｇｌｅＷｏｒｋｓｐａｃｅ（登録商標）、ＤｒｏｐＢｏｘ（登録商標）等が知られている。ストレージサービスシステム７０は、オンプレミスのＮＡＳ（Network Attached Storage）等でもよい。 The storage service system 70 is storage means on the network, and provides a storage service that receives storage of files and the like. One Drive (registered trademark), Google Workspace (registered trademark), DropBox (registered trademark), and the like are known as the storage service system 70 . The storage service system 70 may be an on-premise NAS (Network Attached Storage) or the like.

音声認識サービスシステム８０は、音声データに音声認識を行いテキストデータに変換するサービスを提供する。音声認識サービスシステム８０は、汎用的な商用サービスでもよいし、情報処理システム５０の機能の一部でもよい。 The voice recognition service system 80 provides a service of performing voice recognition on voice data and converting it into text data. The speech recognition service system 80 may be a general-purpose commercial service, or may be part of the functions of the information processing system 50 .

＜ハードウェア構成例＞
図４を参照して、本実施形態に係る情報処理システム５０及び端末装置１０のハードウェア構成について説明する。 <Hardware configuration example>
Hardware configurations of the information processing system 50 and the terminal device 10 according to the present embodiment will be described with reference to FIG. 4 .

＜＜情報処理システム及び端末装置＞＞
図４は、本実施形態に係る情報処理システム５０及び端末装置１０の一例のハードウェア構成を示す図である。図４に示されているように、情報処理システム５０及び端末装置１０はコンピュータによって構築されており、ＣＰＵ５０１、ＲＯＭ５０２、ＲＡＭ５０３、ＨＤ（Hard Disk）５０４、ＨＤＤ(Hard Disk Drive)コントローラ５０５、ディスプレイ５０６、外部機器接続Ｉ／Ｆ(Interface)５０８、ネットワークＩ／Ｆ５０９、バスライン５１０、キーボード５１１、ポインティングデバイス５１２、光学ドライブ５１４、メディアＩ／Ｆ５１６を備えている。 <<information processing system and terminal device>>
FIG. 4 is a diagram showing an example hardware configuration of the information processing system 50 and the terminal device 10 according to this embodiment. As shown in FIG. 4, the information processing system 50 and the terminal device 10 are constructed by a computer, and include a CPU 501, a ROM 502, a RAM 503, an HD (Hard Disk) 504, an HDD (Hard Disk Drive) controller 505, and a display 506. , an external device connection I/F (Interface) 508 , a network I/F 509 , a bus line 510 , a keyboard 511 , a pointing device 512 , an optical drive 514 and a media I/F 516 .

これらのうち、ＣＰＵ５０１は、情報処理システム５０及び端末装置１０全体の動作を制御する。ＲＯＭ５０２は、ＩＰＬ等のＣＰＵ５０１の駆動に用いられるプログラムを記憶する。ＲＡＭ５０３は、ＣＰＵ５０１のワークエリアとして使用される。ＨＤ５０４は、プログラム等の各種データを記憶する。ＨＤＤコントローラ５０５は、ＣＰＵ５０１の制御にしたがってＨＤ５０４に対する各種データの読み出し又は書き込みを制御する。ディスプレイ５０６は、カーソル、メニュー、ウィンドウ、文字、又は画像などの各種情報を表示する。外部機器接続Ｉ／Ｆ５０８は、各種の外部機器を接続するためのインターフェースである。この場合の外部機器は、例えば、ＵＳＢ(Universal Serial Bus)メモリやプリンタ等である。ネットワークＩ／Ｆ５０９は、ネットワークＮ２を利用してデータ通信をするためのインターフェースである。バスライン５１０は、図４に示されているＣＰＵ５０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 Among these, the CPU 501 controls the operations of the information processing system 50 and the terminal device 10 as a whole. The ROM 502 stores programs used to drive the CPU 501 such as IPL. A RAM 503 is used as a work area for the CPU 501 . The HD 504 stores various data such as programs. The HDD controller 505 controls reading or writing of various data to/from the HD 504 under the control of the CPU 501 . A display 506 displays various information such as cursors, menus, windows, characters, or images. The external device connection I/F 508 is an interface for connecting various external devices. The external device in this case is, for example, a USB (Universal Serial Bus) memory, a printer, or the like. A network I/F 509 is an interface for data communication using the network N2. A bus line 510 is an address bus, a data bus, or the like for electrically connecting each component such as the CPU 501 shown in FIG.

また、キーボード５１１は、文字、数値、又は各種指示などの入力に使用される複数のキーを備えた入力手段の一種である。ポインティングデバイス５１２は、各種指示の選択や実行、処理対象の選択、カーソルの移動などを行う入力手段の一種である。光学ドライブ５１４は、着脱可能な記録媒体の一例としての光記憶媒体５１３に対する各種データの読み出し又は書き込みを制御する。なお、光記憶媒体５１３は、ＣＤ，ＤＶＤ、Ｂｌｕ－ｒａｙ（登録商標）等でよい。メディアＩ／Ｆ５１６は、フラッシュメモリ等の記録メディア５１５に対するデータの読み出し又は書き込み（記憶）を制御する。 Also, the keyboard 511 is a type of input means having a plurality of keys used for inputting characters, numerical values, various instructions, and the like. A pointing device 512 is a kind of input means for selecting and executing various instructions, selecting a processing target, moving a cursor, and the like. An optical drive 514 controls reading or writing of various data to an optical storage medium 513 as an example of a removable recording medium. Note that the optical storage medium 513 may be a CD, DVD, Blu-ray (registered trademark), or the like. A media I/F 516 controls reading or writing (storage) of data to a recording medium 515 such as a flash memory.

＜＜ミーティングデバイス＞＞
図５を用いて、ミーティングデバイス６０のハードウェア構成を説明する。図５は、３６０°の動画を撮像可能なミーティングデバイス６０のハードウェア構成図の一例である。以下では、ミーティングデバイス６０は撮像素子を使用した、デバイスの周囲３６０°の動画を所定の高さで撮像する装置とするが、撮像素子は１つでも２つ以上のいくつでもよい。また、必ずしも専用装置である必要はなくPCやデジタルカメラ、スマートフォン等に後付けの３６０°動画の撮像ユニットを取り付けることで、実質的に同じ機能を有するようにしてもよい。 <<Meeting Device>>
The hardware configuration of the meeting device 60 will be described with reference to FIG. FIG. 5 is an example of a hardware configuration diagram of a meeting device 60 capable of capturing 360° moving images. In the following description, the meeting device 60 is a device that uses an imaging element to capture a 360° moving image around the device at a predetermined height, but the number of imaging elements may be one or two or more. Further, it is not always necessary to use a dedicated device, and a PC, digital camera, smartphone, or the like may have substantially the same function by attaching a post-installed 360° video imaging unit.

図５に示されているように、ミーティングデバイス６０は、撮像ユニット６０１、画像処理ユニット６０４、撮像制御ユニット６０５、マイク６０８、音処理ユニット６０９、ＣＰＵ(Central Processing Unit)６１１、ＲＯＭ(Read Only Memory)６１２、ＳＲＡＭ(Static Random Access Memory)６１３、ＤＲＡＭ(Dynamic Random Access Memory)６１４、操作部６１５、外部機器接続Ｉ／Ｆ６１６、通信部６１７、アンテナ６１７ａ、音声センサー６１８、及びMicro USB用の凹状の端子６２１によって構成されている。 As shown in FIG. 5, the meeting device 60 includes an imaging unit 601, an image processing unit 604, an imaging control unit 605, a microphone 608, a sound processing unit 609, a CPU (Central Processing Unit) 611, a ROM (Read Only Memory). ) 612, SRAM (Static Random Access Memory) 613, DRAM (Dynamic Random Access Memory) 614, operation unit 615, external device connection I/F 616, communication unit 617, antenna 617a, audio sensor 618, and concave for Micro USB It is configured by a terminal 621 .

このうち、撮像ユニット６０１は、半球画像を結像するための３６０°の画角を有する広角レンズ（いわゆる魚眼レンズ）６０２と、各広角レンズに対応させて設けられている撮像素子６０３（イメージセンサー）を備えている。撮像素子６０３は、魚眼レンズ６０２による光学像を電気信号の画像データに変換して出力するＣＭＯＳ(Complementary Metal Oxide Semiconductor)センサーやＣＣＤ(Charge Coupled Device)センサーなどの画像センサー、この画像センサーの水平又は垂直同期信号や画素クロックなどを生成するタイミング生成回路、この撮像素子の動作に必要な種々のコマンドやパラメータなどが設定されるレジスタ群などを有している。 Among them, the imaging unit 601 includes a wide-angle lens (a so-called fisheye lens) 602 having a 360° field angle for forming a hemispherical image, and an imaging element 603 (image sensor) provided corresponding to each wide-angle lens. It has The image sensor 603 is an image sensor such as a CMOS (Complementary Metal Oxide Semiconductor) sensor or a CCD (Charge Coupled Device) sensor that converts an optical image formed by the fisheye lens 602 into electrical signal image data and outputs the image data. It has a timing generation circuit that generates synchronization signals, pixel clocks, etc., and a group of registers in which various commands and parameters necessary for the operation of this image sensor are set.

撮像ユニット６０１の撮像素子６０３（イメージセンサー）は、各々、画像処理ユニット６０４とパラレルＩ／Ｆバスで接続されている。一方、撮像ユニット６０１の撮像素子６０３は、撮像制御ユニット６０５とは、シリアルＩ／Ｆバス（Ｉ２Ｃバス等）で接続されている。画像処理ユニット６０４、撮像制御ユニット６０５及び音処理ユニット６０９は、バス６１０を介してＣＰＵ６１１と接続される。更に、バス６１０には、ＲＯＭ６１２、ＳＲＡＭ６１３、ＤＲＡＭ６１４、操作部６１５、外部機器接続Ｉ／Ｆ６１６、通信部６１７、及び音声センサー６１８なども接続される。 The imaging elements 603 (image sensors) of the imaging unit 601 are each connected to the image processing unit 604 via a parallel I/F bus. On the other hand, the imaging element 603 of the imaging unit 601 is connected to the imaging control unit 605 via a serial I/F bus (such as an I2C bus). The image processing unit 604 , imaging control unit 605 and sound processing unit 609 are connected to a CPU 611 via a bus 610 . Furthermore, the bus 610 is also connected with a ROM 612, an SRAM 613, a DRAM 614, an operation unit 615, an external device connection I/F 616, a communication unit 617, an audio sensor 618, and the like.

画像処理ユニット６０４は、撮像素子６０３から出力される画像データをパラレルＩ／Ｆバスを通して取り込み、それぞれの画像データに対して所定の処理を施して、魚眼映像からパノラマ画像や話者画像のデータを作成する。更に、画像処理ユニット６０４は、パノラマ画像と話者画像等を合成処理して、１つの動画を出力する。 The image processing unit 604 takes in the image data output from the image pickup device 603 through the parallel I/F bus, performs predetermined processing on each image data, and converts the fisheye video into panorama image and speaker image data. to create Further, the image processing unit 604 synthesizes the panorama image, the speaker image, etc., and outputs one moving image.

撮像制御ユニット６０５は、一般に撮像制御ユニット６０５をマスタデバイス、撮像素子６０３をスレーブデバイスとして、Ｉ２Ｃバスを利用して、撮像素子６０３のレジスタ群にコマンド等を設定する。必要なコマンド等は、ＣＰＵ６１１から受け取る。また、撮像制御ユニット６０５は、同じくＩ２Ｃバスを利用して、撮像素子６０３のレジスタ群のステータスデータ等を取り込み、ＣＰＵ６１１に送る。 The imaging control unit 605 generally uses the I2C bus with the imaging control unit 605 as a master device and the imaging device 603 as a slave device to set commands and the like in registers of the imaging device 603 . Necessary commands and the like are received from the CPU 611 . The imaging control unit 605 also uses the I2C bus to take in status data and the like of the register group of the imaging device 603 and send it to the CPU 611 .

また、撮像制御ユニット６０５は、操作部６１５の撮像開始ボタンが押下されたタイミングあるいはPCから撮像開始指示を受信したタイミングで、撮像素子６０３ａ，６０３ｂに画像データの出力を指示する。ミーティングデバイス６０によっては、ディスプレイ（例えば、ＰＣやスマートフォンのディスプレイ）によるプレビュー表示機能や動画表示に対応する機能を持つ場合もある。この場合は、撮像素子６０３からの画像データの出力は、所定のフレームレート（フレーム／分）によって連続して行われる。 Also, the imaging control unit 605 instructs the imaging devices 603a and 603b to output image data at the timing when the imaging start button of the operation unit 615 is pressed or at the timing when the imaging start instruction is received from the PC. Depending on the meeting device 60, there may be a function corresponding to a preview display function or moving image display by a display (for example, a display of a PC or a smartphone). In this case, image data is output continuously from the image sensor 603 at a predetermined frame rate (frames/minute).

また、撮像制御ユニット６０５は、後述するように、ＣＰＵ６１１と協働して撮像素子６０３の画像データの出力タイミングの同期をとる同期制御手段としても機能する。なお、本実施形態では、ミーティングデバイス６０にはディスプレイが設けられていないが、表示部を設けてもよい。 The imaging control unit 605 also functions as synchronization control means for synchronizing the output timing of the image data of the imaging element 603 in cooperation with the CPU 611, as will be described later. In this embodiment, the meeting device 60 is not provided with a display, but may be provided with a display unit.

マイク６０８は、音を音（信号）データに変換する。音処理ユニット６０９は、マイク６０８から出力される音声データをＩ／Ｆバスを通して取り込み、音声データに対して所定の処理を施す。 A microphone 608 converts sound into sound (signal) data. The sound processing unit 609 takes in audio data output from the microphone 608 through the I/F bus and performs predetermined processing on the audio data.

ＣＰＵ６１１は、ミーティングデバイス６０の全体の動作を制御すると共に必要な処理を実行する。ＲＯＭ６１２は、ＣＰＵ６１１のための種々のプログラムを記憶している。ＳＲＡＭ６１３及びＤＲＡＭ６１４はワークメモリであり、ＣＰＵ６１１で実行するプログラムや処理途中のデータ等を記憶する。特にＤＲＡＭ６１４は、画像処理ユニット６０４での処理途中の画像データや処理済みの正距円筒射影画像のデータを記憶する。 The CPU 611 controls the overall operation of the meeting device 60 and executes necessary processing. ROM 612 stores various programs for CPU 611 . An SRAM 613 and a DRAM 614 are work memories, and store programs to be executed by the CPU 611, data during processing, and the like. In particular, the DRAM 614 stores image data being processed by the image processing unit 604 and data of the processed equirectangular projection image.

操作部６１５は、撮像開始ボタン６１５ａなどの操作ボタンの総称である。ユーザーは操作部６１５を操作することで、撮像や録画を開始する他、電源ON/OFFの実行、通信接続の実行、種々の撮像モードや撮像条件などの設定を入力する。 The operation unit 615 is a general term for operation buttons such as the imaging start button 615a. By operating the operation unit 615, the user inputs settings such as starting image capturing and recording, power ON/OFF execution, communication connection execution, and various image capturing modes and image capturing conditions.

外部機器接続Ｉ／Ｆ６１６は、各種の外部機器を接続するためのインターフェースである。この場合の外部機器は、例えば、ＰＣ(Personal Computer)等である。ＤＲＡＭ６１４に記憶された動画データや画像データは、この外部機器接続Ｉ／Ｆ６１６を介して外部端末に送信されたり、外付けのメディアに記録されたりする。 The external device connection I/F 616 is an interface for connecting various external devices. The external device in this case is, for example, a PC (Personal Computer) or the like. The moving image data and image data stored in the DRAM 614 are transmitted to an external terminal via the external device connection I/F 616 or recorded on an external medium.

通信部６１７は、ミーティングデバイス６０に設けられたアンテナ６１７ａを介して、Wi－Fi等の無線通信技術によって、インターネット経由でクラウドサーバと通信し、記憶した動画データや画像データをクラウドサーバに送信してもよい。また、通信部６１７は、BLE（Bluetooth Low Energy。登録商標）やNFC等の近距離無線通信技術を用いて付近のデバイスと通信してもよい。 The communication unit 617 communicates with the cloud server via the Internet using wireless communication technology such as Wi-Fi via an antenna 617a provided in the meeting device 60, and transmits the stored video data and image data to the cloud server. may Also, the communication unit 617 may communicate with nearby devices using short-range wireless communication technology such as BLE (Bluetooth Low Energy. Registered trademark) and NFC.

音声センサー６１８は、ミーティングデバイス６０の周辺（水平面）の３６０°においてどの方向から音声が大きい音で入力されたかを特定するために、３６０°の音声データを取得するセンサーである。音処理ユニット６０９は入力した３６０°の音声パラメータに基づき、最も強い方向を特定して３６０°における音声入力方向を出力する。 The audio sensor 618 is a sensor that acquires 360° audio data in order to identify from which direction a loud sound is input in 360° around the meeting device 60 (horizontal plane). The sound processing unit 609 identifies the strongest direction according to the input 360° audio parameters and outputs the 360° audio input direction.

なお、他のセンサー（方位・加速度センサーやGPS等）が方位・位置・角度・加速度等を算出し、画像補正や位置情報付加に用いてもよい。 Note that other sensors (azimuth/acceleration sensor, GPS, etc.) may calculate the azimuth, position, angle, acceleration, etc., and use them for image correction and addition of position information.

また画像処理ユニット６０４は、以下の処理を行う。 The image processing unit 604 also performs the following processing.

・ＣＰＵ６１１は、パノラマ画像の作成を次の方法で行う。ＣＰＵ６１１は、球面映像を入力するイメージセンサーから入力されたRAWデータをBayer変換（RGB補完処理）等の所定のカメラ映像処理を行って魚眼映像（曲面の映像）を作成する。更に作成した魚眼映像（曲面の映像）に対してDeWarp処理（歪み補正処理）を行い、ミーティングデバイス６０の周辺の３６０°が写ったパノラマ画像（平面の映像）を作成する。 - The CPU 611 creates a panorama image by the following method. The CPU 611 performs predetermined camera image processing such as Bayer conversion (RGB interpolation processing) on RAW data input from an image sensor that inputs a spherical image to create a fisheye image (curved surface image). Further, DeWarp processing (distortion correction processing) is performed on the created fisheye image (curved surface image) to create a panoramic image (planar image) showing a 360° surrounding of the meeting device 60 .

・ＣＰＵ６１１は話者画像の作成を次の方法で行う。ＣＰＵ６１１は周辺の３６０°が写ったパノラマ画像（平面の映像）から、話者を切り出した話者画像を作成する。ＣＰＵ６１１は、音声センサー６１８及び音処理ユニット６０９を用いて出力した360°から特定した音声入力方向を、話者の方向として、上記パノラマ画像から話者画像を切り出す。このとき音声入力方向から人の画像を切り出す方法は、360°から特定した音声方向を中心に30°を切り取って、その中で顔検出（人物検出等でも可）を実施して切り出す。ＣＰＵ６１１は、更に切り出した話者画像のうち、直近で発言のあった特定人数分（３名等）の話者画像を特定する。 - The CPU 611 creates a speaker image by the following method. The CPU 611 creates a speaker image by cutting out the speaker from a panorama image (planar image) showing 360° surroundings. The CPU 611 cuts out a speaker image from the panorama image with the audio input direction specified from 360° output by the audio sensor 618 and the sound processing unit 609 as the direction of the speaker. At this time, the method of extracting an image of a person from the voice input direction is to cut out 30° from 360° centering on the specified voice direction, and perform face detection (or person detection, etc. is also possible) in it. The CPU 611 further identifies, among the cut-out speaker images, speaker images for a specific number of people (three, etc.) who have recently spoken.

パノラマ画像と、１以上の話者画像は個別に情報記録アプリ４１に送信されてもよいし、ミーティングデバイス６０がこれらから１枚の画像を作成して、情報記録アプリ４１に送信してもよい。本実施形態では、パノラマ画像と１以上の話者画像は個別にミーティングデバイス６０から情報記録アプリ４１に送信されるものとする。 The panorama image and one or more speaker images may be individually sent to the information recording application 41, or the meeting device 60 may create one image from these and send it to the information recording application 41. . In this embodiment, it is assumed that the panoramic image and one or more speaker images are individually transmitted from the meeting device 60 to the information recording application 41 .

図６は、ミーティングデバイス６０の撮像範囲を説明する図である。図６（ａ）に示すように、ミーティングデバイス６０は水平方向に３６０°の範囲を撮像する。図６（ｂ）に示すように、ミーティングデバイス６０は、ミーティングデバイス６０の高さに水平な方向を０°とし、上下に所定の角度を撮像範囲とする。 FIG. 6 is a diagram for explaining the imaging range of the meeting device 60. As shown in FIG. As shown in FIG. 6(a), the meeting device 60 takes an image in a horizontal 360° range. As shown in FIG. 6B, the meeting device 60 has an imaging range of 0° in a horizontal direction to the height of the meeting device 60 and a predetermined vertical angle.

図７は、パノラマ画像と話者画像の切り出しを説明する図である。図７に示すように、ミーティングデバイス６０が撮像する画像は球体の一部１１０をなすため、三次元の形状を有している。ミーティングデバイス６０は、図６（ｂ）で示したように、上下の所定角度と左右の所定角度ごとに画角を区切って透視投影変換を行う。透視投影変換を水平方向３６０°の全体で隙間なく行うことで、所定数の平面画像が得られるので、所定数の平面画像を左右に連結することでパノラマ画像１１１が得られる。また、ミーティングデバイス６０はパノラマ画像から音声方向を中心に所定の範囲で顔検出を実施して、顔の中心から左右に１５°（全体で３０°）を切り出すことで、話者画像１１２を作成する。 FIG. 7 is a diagram for explaining the extraction of the panorama image and the speaker image. As shown in FIG. 7, the image captured by the meeting device 60 forms a portion 110 of a sphere and thus has a three-dimensional shape. As shown in FIG. 6B, the meeting device 60 divides the angle of view into predetermined vertical angles and horizontal predetermined angles to perform perspective projection conversion. A predetermined number of planar images can be obtained by performing perspective projection transformation over the entire 360° in the horizontal direction without any gaps. In addition, the meeting device 60 performs face detection in a predetermined range centering on the sound direction from the panorama image, and creates a speaker image 112 by cutting out 15° left and right from the center of the face (30° in total). do.

＜機能について＞
次に、図８を参照して、記録情報作成システム１００が有する機能構成について説明する。図８は、記録情報作成システム１００における端末装置１０、ミーティングデバイス６０、及び、情報処理システム５０の機能をブロックに分けて説明する機能ブロック図の一例である。 <About functions>
Next, with reference to FIG. 8, the functional configuration of the recorded information creation system 100 will be described. FIG. 8 is an example of a functional block diagram illustrating the functions of the terminal device 10, the meeting device 60, and the information processing system 50 in the recorded information creation system 100 by dividing them into blocks.

＜＜端末装置＞＞
端末装置１０で動作する情報記録アプリ４１は、通信部１１、操作受付部１２、表示制御部１３、アプリ画面取得部１４、音声取得部１５、デバイス通信部１６、動画保存部１７、音声データ処理部１８、録画再生部１９、アップロード部２０、編集処理部２１、及び、検索部２２を有している。端末装置１０が有するこれら各部は、図４に示されている各構成要素のいずれかが、ＨＤ５０４からＲＡＭ５０３に展開された情報記録アプリ４１に従ったＣＰＵ５０１からの命令によって動作することで実現される機能、又は機能する手段である。また、端末装置１０は、図４に示されているＨＤ５０４等によって構築される記憶部１０００を有している。記憶部１０００には情報記憶部１００１が構築されている。 << terminal device >>
The information recording application 41 operating on the terminal device 10 includes the communication unit 11, the operation reception unit 12, the display control unit 13, the application screen acquisition unit 14, the audio acquisition unit 15, the device communication unit 16, the video storage unit 17, audio data processing. It has a unit 18 , a recording/playback unit 19 , an upload unit 20 , an edit processing unit 21 and a search unit 22 . Each of these units of the terminal device 10 is realized by operating one of the components shown in FIG. It is a function or means of functioning. The terminal device 10 also has a storage unit 1000 constructed by the HD 504 or the like shown in FIG. An information storage unit 1001 is constructed in the storage unit 1000 .

通信部１１は、ネットワークを介して情報処理システム５０と各種の情報を通信する。通信部１１は、例えば、遠隔会議のリストを情報処理システム５０から受信したり、音声データの認識要求を情報処理システム５０に送信したりする。 The communication unit 11 communicates various information with the information processing system 50 via the network. The communication unit 11 receives, for example, a list of teleconferences from the information processing system 50 and transmits a recognition request for voice data to the information processing system 50 .

表示制御部１３は情報記録アプリ４１に設定されている画面遷移にしたがって情報記録アプリ４１においてユーザーインターフェースとなる各種の画面を表示する。操作受付部１２は、情報記録アプリ４１に対する各種の操作を受け付ける。 The display control unit 13 displays various screens serving as user interfaces in the information recording application 41 in accordance with screen transitions set in the information recording application 41 . The operation reception unit 12 receives various operations for the information recording application 41 .

アプリ画面取得部１４は、デスクトップ画面、又は、ユーザーが選択したアプリの画面をＯＳ（Operating System）等から取得する。ユーザーが選択したアプリが遠隔会議アプリ４２の場合、遠隔会議アプリ４２が生成する画面（各拠点の画像、資料の画像等）が得られる。 The application screen acquisition unit 14 acquires a desktop screen or an application screen selected by the user from an OS (Operating System) or the like. When the application selected by the user is the teleconference application 42, a screen generated by the teleconference application 42 (images of each site, images of materials, etc.) is obtained.

音声取得部１５は、端末装置１０がマイクやイヤホンから出力する音声（遠隔会議アプリ４２から遠隔会議において受信された音声データを含む）を取得する。出力音声がミュート状態でも、音声取得部１５は音声を取得できる。音声データに関してユーザーは遠隔会議アプリ４２を選択するなどの操作は必要なく、音声取得部１５は、端末装置１０が出力できる音声を、ＯＳやＡＰＩ（Application Interface）を介して取得できる。これにより、遠隔会議アプリ４２が他の拠点１０１から受信する音声データも取得される。遠隔会議アプリ４２が実行中でなかったり、遠隔会議中でなかったりする場合、情報記録アプリ４１は音声データを取得できない場合がある。なお、音声取得部１５が取得する音声は、端末装置１０が集音する音声は含まれず、出力する音声データのみである点に注意されたい。ミーティングデバイス６０が別に、音声を集音しているためである。 The voice acquisition unit 15 acquires voice (including voice data received from the remote conference application 42 in the remote conference) output by the terminal device 10 from a microphone or an earphone. Even if the output sound is muted, the sound acquisition unit 15 can acquire the sound. The user does not need to perform operations such as selecting the teleconference application 42 for audio data, and the audio acquisition unit 15 can acquire audio that the terminal device 10 can output via the OS or API (Application Interface). As a result, voice data received by the teleconference application 42 from the other site 101 is also obtained. If the remote conference application 42 is not running or the remote conference is not in progress, the information recording application 41 may not be able to acquire voice data. It should be noted that the sound acquired by the sound acquisition unit 15 does not include the sound collected by the terminal device 10, but only the output sound data. This is because the meeting device 60 separately collects the sound.

デバイス通信部１６は、ＵＳＢケーブルなどを利用してミーティングデバイス６０と通信する。デバイス通信部１６は、無線ＬＡＮやＢｌｕｅｔｏｏｔｈ（登録商標）等でミーティングデバイス６０と通信してよい。デバイス通信部１６は、パノラマ画像と話者画像をミーティングデバイス６０から受信し、音声取得部１５が取得した音声データをミーティングデバイス６０に送信する。デバイス通信部１６は、ミーティングデバイス６０で合成された音声データを受信する。 The device communication section 16 communicates with the meeting device 60 using a USB cable or the like. The device communication section 16 may communicate with the meeting device 60 via a wireless LAN, Bluetooth (registered trademark), or the like. The device communication section 16 receives the panoramic image and the speaker image from the meeting device 60 and transmits the audio data acquired by the audio acquisition section 15 to the meeting device 60 . The device communication section 16 receives voice data synthesized by the meeting device 60 .

動画保存部１７は、デバイス通信部１６が受信したパノラマ画像と話者画像、及び、アプリ画面取得部１４が取得したアプリの画面を結合し、結合画像を作成する。また、動画保存部１７は繰り返し作成する結合画像を時系列に接続して結合画像動画を作成し、合成された音声データを結合画像動画に結合して音声付きの結合画像動画を作成する。 The moving image storage unit 17 combines the panorama image and the speaker image received by the device communication unit 16 and the application screen acquired by the application screen acquisition unit 14 to create a combined image. In addition, the moving image storage unit 17 creates a combined image moving image by connecting repeatedly created combined images in time series, and combines synthesized audio data with the combined image moving image to create a combined image moving image with sound.

音声データ処理部１８は、結合画像動画に結合された音声データを抽出するか、又は、ミーティングデバイス６０から受信した合成後の音声データの、テキストデータへの変換を情報処理システム５０に要求する。 The audio data processing unit 18 extracts the audio data combined with the combined image moving image, or requests the information processing system 50 to convert the synthesized audio data received from the meeting device 60 into text data.

録画再生部１９は、結合画像動画の再生を行う。結合画像動画は、録画中は端末装置１０に保存され、その後、情報処理システム５０にアップロードされる。 The recording/playback unit 19 plays back the combined image moving image. The combined image moving image is stored in the terminal device 10 during recording, and then uploaded to the information processing system 50 .

アップロード部２０は、遠隔会議が終了すると、結合画像動画を情報処理システム５０に送信する。 The upload unit 20 transmits the combined image moving image to the information processing system 50 when the teleconference ends.

編集処理部２１は、ユーザーの操作に応じて、結合画像動画の編集（一部の削除、つなぎ合わせ等）を実行する。 The editing processing unit 21 edits the combined image moving image (partial deletion, joining, etc.) according to the user's operation.

検索部２２は、キーワードによるテキストデータの検索を受け付け、テキストデータを検索し、検索結果を表示する。 The search unit 22 receives a search for text data using a keyword, searches the text data, and displays the search results.

図９は、情報記憶部１００１が記憶している動画記録情報を示す。動画記録情報は、会議ＩＤ、録画ＩＤ、更新日時、タイトル、アップロード、保存先等の各項目を有している。ユーザーが情報処理システム５０にログインすると、情報記録アプリ４１は情報処理システム５０の会議情報記憶部５００１から会議情報をダウンロードする。会議情報に含まれる会議ＩＤなどが動画記録情報に反映される。図９の動画記録情報は、あるユーザーが操作する端末装置１０が保持するものである。 FIG. 9 shows moving image recording information stored in the information storage unit 1001 . The moving image recording information has items such as meeting ID, recording ID, update date and time, title, upload, and save destination. When the user logs into the information processing system 50 , the information recording application 41 downloads the meeting information from the meeting information storage section 5001 of the information processing system 50 . A meeting ID or the like included in the meeting information is reflected in the moving image recording information. The moving image record information in FIG. 9 is held by the terminal device 10 operated by a certain user.

・会議ＩＤは、開催された遠隔会議を識別する識別情報である。会議ＩＤは、会議管理システム９に遠隔会議の予定が登録された際に採番されるか、又は、情報記録アプリ４１からの要求で情報処理システム５０が採番する。 - The conference ID is identification information for identifying the held remote conference. The conference ID is numbered when a remote conference schedule is registered in the conference management system 9 or is numbered by the information processing system 50 upon request from the information recording application 41 .

・録画ＩＤは、遠隔会議において録画された結合画像動画を識別する識別情報である。録画ＩＤはミーティングデバイス６０が採番するが、情報記録アプリ４１や情報処理システム５０が採番してもよい。同じ会議ＩＤに異なる録画ＩＤが付与されるのは、遠隔会議の途中で録画が終了したが、何らかの理由で再開した場合を示す。 - The recording ID is identification information for identifying the combined image video recorded in the teleconference. The recording ID is numbered by the meeting device 60, but may be numbered by the information recording application 41 or the information processing system 50. FIG. A different recording ID assigned to the same conference ID indicates a case where the recording was terminated in the middle of the remote conference but restarted for some reason.

・更新日時は、結合画像動画が更新された（録画が終了した）日時である。結合画像動画が編集された場合、編集された日時である。 - The updated date and time is the date and time when the combined image moving image was updated (recording ended). When the combined image moving image is edited, it is the date and time when it was edited.

・タイトルは、会議の会議名である。会議管理システム９への会議の登録時に設定されてもよいし、ユーザーが任意に設定してもよい。 • The title is the conference name of the conference. It may be set when the conference is registered in the conference management system 9, or may be arbitrarily set by the user.

・アップロードは、結合画像動画が情報処理システム５０にアップロードされたか否かを示す。 · Upload indicates whether or not the combined image moving image has been uploaded to the information processing system 50 .

・保存先は、ストレージサービスシステム７０において、結合画像動画とテキストデータが保存されている場所（ＵＲＬやファイルパス）を示す。したがって、ユーザーはアップロードされた結合画像動画を任意に閲覧できる。なお、結合画像動画とテキストデータは、例えばＵＲＬに続いて別々のファイル名で保存される。 The storage destination indicates the location (URL or file path) where the combined image moving image and the text data are stored in the storage service system 70 . Therefore, the user can arbitrarily view the uploaded combined image moving image. Note that the combined image moving image and the text data are saved with separate file names following the URL, for example.

＜＜ミーティングデバイス＞＞
図８に戻って説明する。ミーティングデバイス６０は、通信部６１、パノラマ画像作成部６２、話者画像作成部６３、拠点判断処理部６４、音声合成部６５、テキスト変換要求部６６、デバイス接続部６７、及び、集音部６８を有している。ミーティングデバイス６０が有するこれら各部は、図５に示されている各構成要素のいずれかがハード的に実現する機能、又は機能する手段である。ただし、これら各部は、ＲＯＭ６１２からＤＲＡＭ６１４に展開されたプログラムに従ったＣＰＵ６１１からの命令によって動作することで実現されてもよい。 <<Meeting Device>>
Returning to FIG. 8, description will be made. The meeting device 60 includes a communication unit 61 , a panorama image creation unit 62 , a speaker image creation unit 63 , a location determination processing unit 64 , a voice synthesis unit 65 , a text conversion request unit 66 , a device connection unit 67 , and a sound collection unit 68 . have. Each of these units of the meeting device 60 is a function implemented by hardware or a means for functioning by one of the components shown in FIG. However, each of these units may be implemented by operating according to instructions from the CPU 611 according to a program expanded from the ROM 612 to the DRAM 614 .

通信部６１は、情報処理システム５０と各種の情報を送受信する。通信部６１は、音声認識サービスシステムやストレージサービスシステム７０とも通信できる。 The communication unit 61 transmits and receives various information to and from the information processing system 50 . The communication unit 61 can also communicate with the voice recognition service system and the storage service system 70 .

デバイス接続部６７は、端末装置１０からの音声の入力を受け付ける。デバイス接続部６７はＵＳＢケーブルなどを利用して端末装置１０と通信する。デバイス接続部６７は、無線ＬＡＮやＢｌｕｅｔｏｏｔｈ（登録商標）等で端末装置１０と通信してよい。ミーティングデバイス６０は、デバイス接続部６７が取得した他拠点音声Ｂを出力可能なスピーカーを備えてもよい。デバイス接続部６７は、図２の他拠点音声入力部８に相当する。 The device connection unit 67 receives voice input from the terminal device 10 . The device connection unit 67 communicates with the terminal device 10 using a USB cable or the like. The device connection unit 67 may communicate with the terminal device 10 via a wireless LAN, Bluetooth (registered trademark), or the like. The meeting device 60 may include a speaker capable of outputting the other site audio B acquired by the device connection unit 67 . The device connection section 67 corresponds to the other site audio input section 8 in FIG.

パノラマ画像作成部６２はパノラマ画像を作成する。話者画像作成部６３は話者画像を作成する。これらの作成方法は図６、図７にて説明した。 A panorama image creating unit 62 creates a panorama image. A speaker image creating unit 63 creates a speaker image. These creation methods have been described with reference to FIGS. 6 and 7. FIG.

拠点判断処理部６４は、自拠点音声Ａと他拠点音声Ｂを所定の規則で分割し、分割された自拠点音声Ａの自拠点音圧情報Ａ'と、他拠点音声Ｂの他拠点音圧情報Ｂ'に基づいて、自拠点音声Ａ又は他拠点音声Ｂが発言された拠点を判断する。詳細は後述する。 The site determination processing unit 64 divides the own site audio A and the other site audio B according to a predetermined rule, and divides the own site sound pressure information A' of the divided own site audio A and the other site sound pressure information of the other site audio B. Based on the information B', the site where the local site's voice A or the other site's voice B is uttered is determined. Details will be described later.

音声合成部６５は、端末装置１０から送信された音声と集音部６８が集音した音声を合成する。これにより、他の拠点１０１で発言された音声と、自拠点１０２の発言がまとめられる。 The voice synthesizing unit 65 synthesizes the voice transmitted from the terminal device 10 and the voice collected by the sound collecting unit 68 . As a result, the voice uttered at the other site 101 and the utterance at the own site 102 are put together.

集音部６８は、ミーティングデバイス６０が有するマイクが取得する自拠点音声Ａの音声信号を音声データ（デジタル）に変換する。これにより、端末装置１０側の拠点でユーザーや参加者が発言した内容が集音される。集音部６８が集音した自拠点音声Ａはデバイス接続部６７を介して、端末装置１０に送信されるとよい。これにより、ユーザーはミーティングデバイス６０を一般的な外付けマイクデバイスと同様に使用可能である。集音部６８（及び音声取得部１５も含めてよい）は、図２の自拠点音声入力部７に相当する。 The sound collector 68 converts the audio signal of the local site audio A acquired by the microphone of the meeting device 60 into audio data (digital). As a result, the content of what the user or the participant said at the site on the terminal device 10 side is collected. Own site audio A collected by the sound collecting unit 68 may be transmitted to the terminal device 10 via the device connecting unit 67 . This allows the user to use the meeting device 60 like a general external microphone device. The sound collection unit 68 (and the sound acquisition unit 15 may also be included) corresponds to the local site sound input unit 7 in FIG.

また、ミーティングデバイス６０は集音部６８とデバイス接続部６７をそれぞれ有することにより、自拠点音声Ａと他拠点音声Ｂを個別に取得可能になる。よって、自拠点音圧情報Ａ'と他拠点音圧情報Ｂ'を別々に取得することが可能になる。 Also, the meeting device 60 has a sound collecting unit 68 and a device connecting unit 67, so that the local site audio A and the other site audio B can be acquired individually. Therefore, it becomes possible to separately acquire the own site sound pressure information A' and the other site sound pressure information B'.

テキスト変換要求部６６は、通信部６１を介して、情報処理システム５０に合成音声Ｃを送信し、リアルタイム音声認識（テキストデータへの変換）を要求する。 The text conversion request unit 66 transmits the synthesized speech C to the information processing system 50 via the communication unit 61 and requests real-time speech recognition (conversion to text data).

＜＜情報処理システム＞＞
情報処理システム５０は、通信部５１、認証部５２、画面生成部５３、会議情報取得部５４、音声認識部５５、音声記録部５６、音声認識結果記録部５７、拠点判断結果記録部５８、会議記録確認部５９、を有する。情報処理システム５０が有するこれら各部は、図４に示されている各構成要素のいずれかが、ＨＤ５０４からＲＡＭ５０３に展開されたプログラムに従ったＣＰＵ５０１からの命令によって動作することで実現される機能、又は機能する手段である。また、情報処理システム５０は、図４に示されているＨＤ５０４等によって構築される記憶部５０００を有している。記憶部５０００には、会議情報記憶部５００１と録画情報記憶部５００２が構築される。 <<Information Processing System>>
The information processing system 50 includes a communication unit 51, an authentication unit 52, a screen generation unit 53, a meeting information acquisition unit 54, a voice recognition unit 55, a voice recording unit 56, a voice recognition result recording unit 57, a location determination result recording unit 58, a conference A record confirmation unit 59 is provided. Each of these units of the information processing system 50 has a function realized by any one of the components shown in FIG. or a means of functioning. The information processing system 50 also has a storage unit 5000 constructed by the HD 504 or the like shown in FIG. A meeting information storage unit 5001 and a recording information storage unit 5002 are constructed in the storage unit 5000 .

通信部５１は、端末装置１０と各種の情報を送受信する。通信部５１は、例えば、遠隔会議のリストを端末装置１０に送信したり、音声データの認識要求を端末装置１０から受信したりする。 The communication unit 51 transmits and receives various information to and from the terminal device 10 . The communication unit 51 , for example, transmits a list of teleconferences to the terminal device 10 and receives voice data recognition requests from the terminal device 10 .

認証部５２は、端末装置１０を操作するユーザーを認証する。認証部５２は、例えば、通信部５１によって受信された認証要求に含まれている認証情報（ユーザーＩＤ及びパスワード）が予め保持する認証情報と一致するか否かにより、ユーザーを認証する。なお、認証情報は、ＩＣカードのカード番号、顔や指紋などの生体認証情報等でもよい。また、認証部５２は、外部の認証システムやＯＡＵＴＨなどの認証方法で認証してもよい。 The authentication unit 52 authenticates the user operating the terminal device 10 . The authentication unit 52 authenticates the user by, for example, determining whether or not the authentication information (user ID and password) included in the authentication request received by the communication unit 51 matches pre-stored authentication information. Note that the authentication information may be a card number of an IC card, biometric information such as a face or a fingerprint, or the like. Also, the authentication unit 52 may perform authentication using an external authentication system or an authentication method such as OAUTH.

画面生成部５３は端末装置１０が表示する画面情報の生成を行う。端末装置１０がネイティブアプリを実行する場合は、画面情報は端末装置１０が保持しており、表示される情報がＸＭＬ等で送信される。端末装置１０がＷｅｂアプリを実行する場合は、画面情報は、ＨＴＭＬ、ＸＭＬ、ＣＳＳ（Cascade Style Sheet）、及びJavaScript（登録商標）等により作成される。 The screen generation unit 53 generates screen information displayed by the terminal device 10 . When the terminal device 10 executes the native application, the terminal device 10 holds the screen information, and the information to be displayed is transmitted in XML or the like. When the terminal device 10 executes a web application, the screen information is created using HTML, XML, CSS (Cascade Style Sheet), JavaScript (registered trademark), and the like.

会議情報取得部５４は、各ユーザーのアカウント又は情報処理システム５０に付与されたシステム用のアカウントで、遠隔会議に関する情報を会議管理システム９から取得する。会議情報取得部５４は、テナントに所属するユーザーに閲覧権限がある遠隔会議のリストを取得できる。遠隔会議には会議ＩＤが設定されているので、会議ＩＤにより遠隔会議と記録情報が対応付けられる。 The conference information acquisition unit 54 acquires information about remote conferences from the conference management system 9 using each user's account or a system account assigned to the information processing system 50 . The conference information acquisition unit 54 can acquire a list of teleconferences for which the user belonging to the tenant has viewing authority. Since the conference ID is set for the remote conference, the remote conference and the recording information are associated with each other by the conference ID.

音声認識部５５は、端末装置１０又はミーティングデバイス６０からテキストデータへの変換を要求された合成音声Ｃを外部の音声認識サービスを利用してテキストデータに変換する。音声認識部５５自身が変換してもよい。 The speech recognition unit 55 converts the synthesized speech C requested by the terminal device 10 or the meeting device 60 to be converted into text data into text data using an external speech recognition service. The speech recognition unit 55 itself may convert.

音声記録部５６は、少なくとも合成音声Ｃを記録しておく。音声認識結果記録部５７は、音声認識結果Ｄを保存しておく。拠点判断結果記録部５８は、拠点識別情報Ｅを保存しておく。拠点判断結果記録部５８と音声認識結果記録部５７と音声記録部５６は、情報処理システム５０でこれらを保存してもよいし、ストレージサービスシステム７０に保存してもよい。すなわち、拠点判断結果記録部５８と音声認識結果記録部５７と音声記録部５６は受動的なストレージとして機能してもよいし、ストレージサービスシステム７０の記録情報記憶部７００１と連携する機能でもよい。本実施形態では、主に後者として説明する。 The voice recording unit 56 records at least the synthesized voice C in advance. The voice recognition result recording unit 57 stores the voice recognition result D. FIG. The site determination result recording unit 58 stores the site identification information E. FIG. The site determination result recording unit 58 , the voice recognition result recording unit 57 and the voice recording unit 56 may be stored in the information processing system 50 or may be stored in the storage service system 70 . That is, the location determination result recording unit 58 , the voice recognition result recording unit 57 and the voice recording unit 56 may function as passive storage, or may function in cooperation with the recorded information storage unit 7001 of the storage service system 70 . In this embodiment, the latter is mainly explained.

合成音声Ｃ、音声認識結果Ｄ、及び、拠点識別情報Ｅは分割された状態で、互いに対応付けて保存される。認識結果文字列は、音声認識の区切りごとに生成される文字列である。どこで音声認識を区切るかは、適宜設定されるが、例えば、無音状態、一定時間などである。 Synthesized speech C, speech recognition result D, and site identification information E are stored in a divided state in association with each other. A recognition result character string is a character string generated for each segment of speech recognition. Where to break the speech recognition is appropriately set, for example, silent state, fixed time, and the like.

会議記録確認部５９は、拠点識別情報Ｅと、音声認識結果Ｄと、合成音声Ｃを同期して取得し、ユーザーに提示することで、拠点の判断の確認を受け付ける。 The conference record confirmation unit 59 synchronously acquires the base identification information E, the speech recognition result D, and the synthesized speech C, and presents them to the user, thereby accepting confirmation of the base determination.

このような構成により、ミーティングデバイス６０は、広く普及している遠隔コミュニケーションシステムを用いた遠隔会議を利用可能であり、音声の集音、発言拠点の識別のために複数の専用マイクを用いずとも自然な形態で音声を収音し、発言拠点を識別可能となる。 With such a configuration, the meeting device 60 can use a remote conference using a widely-used remote communication system, without using multiple dedicated microphones for collecting voices and identifying speaking bases. It is possible to pick up the voice in a natural form and identify the speaking base.

＜＜会議情報記憶部＞＞
図１０は、会議情報取得部５４が管理する、会議情報記憶部５００１に記憶された会議情報の一例である。会議情報取得部５４は上記のアカウントを使ってテナントに所属する当該ユーザーが閲覧権限のある遠隔会議のリストを取得できる。本実施形態では、遠隔会議を例にしているが、遠隔会議のリストには１つの会議室だけで開催される会議も含まれている。 <<meeting information storage>>
FIG. 10 shows an example of conference information stored in the conference information storage unit 5001 managed by the conference information acquisition unit 54. As shown in FIG. The conference information acquisition unit 54 can use the above account to acquire a list of teleconferences for which the user belonging to the tenant has viewing authority. In this embodiment, a remote conference is taken as an example, but the list of remote conferences also includes conferences held in only one conference room.

会議情報は会議ＩＤで管理され、参加者、タイトル（会議名）、開始時刻、終了時刻、場所などと対応付けられている。これらは会議情報の一例であり、会議情報は、他にも情報を含みうる。 The conference information is managed by a conference ID, and is associated with participants, title (meeting name), start time, end time, location, and the like. These are examples of conference information, and the conference information may include other information.

・会議ＩＤは、会議の識別情報である。 - Meeting ID is the identification information of a meeting.

・テナントＩＤは、会議が開催されるテナントの識別情報である。 - Tenant ID is the identification information of the tenant where the conference is held.

・タイトルは、会議の件名や議題である。 • The title is the subject or agenda of the meeting.

・開催者は、当該会議の開催者である。・The organizer is the organizer of the conference.

・参加者は、会議に招待された参加者のリストである。 • Participants is a list of participants invited to the conference.

・閲覧権限があるユーザーは、結合画像動画を含む会議リソースへのアクセス可能なユーザーのリストである。 - The users with viewing authority is a list of users who can access the conference resource including the combined image video.

・アドホック参加者は、ゲスト参加者のリストである。 • Ad hoc participants is a list of guest participants.

・場所は、会議室の名称など、会議室に関する情報である。 - The location is information about the conference room, such as the name of the conference room.

・開始時刻は、会議が開始される予定時刻である。 • The start time is the scheduled time at which the conference will start.

・終了時刻は、会議が終了される予定時刻である。 • The end time is the scheduled time at which the conference will end.

・会議作成者は、会議情報を登録したユーザーＩＤ等である。 - The conference creator is the user ID or the like that registered the conference information.

・パスワードは、参加者が会議にログインするためのパスワードである。 - The password is the password for the participant to log in to the conference.

・場所は、会議の開催場所であり、例えば会議室や、支社名、建屋などである。 - The location is the place where the conference is held, such as a conference room, a branch office name, a building, and the like.

図９，図１０に示すように、会議ＩＤにより会議で録画された結合画像動画が特定される。 As shown in FIGS. 9 and 10, the combined image moving image recorded at the conference is specified by the conference ID.

図１１は、録画情報記憶部５００２に記憶されている録画情報を示す。録画情報は、テナントに所属する全てのユーザーが録画した結合画像動画のリストを有する。録画情報は、会議ＩＤ、録画ＩＤ、更新日時、タイトル、保存先等の各項目を有してる。これらの項目は図９と同様でよい。 FIG. 11 shows recording information stored in the recording information storage unit 5002. As shown in FIG. The recording information has a list of combined image videos recorded by all users belonging to the tenant. The recording information has items such as meeting ID, recording ID, date and time of update, title, storage destination, and the like. These items may be the same as those in FIG.

＜＜ストレージサービスシステム＞＞
ストレージサービスシステム７０は記録情報を記憶するサービスシステムであればよい。記録情報記憶部７００１には、記録情報（結合画像動画、テキストデータ）が保存されている。図１２では、記録情報記憶部７００１に保存されるテキストデータの構造を説明する。 <<Storage Service System>>
The storage service system 70 may be any service system that stores recorded information. The recorded information storage unit 7001 stores recorded information (combined image moving image, text data). FIG. 12 illustrates the structure of text data saved in the recording information storage unit 7001. FIG.

図１２は、ストレージサービスシステム７０に保存されているテキストデータの構造を説明する図である。図１２に示すように、テキストデータは、ＩＤ、time、認識結果文字列、音声データ、拠点識別情報Ｅの項目が対応付けられている。なお、テキストデータは会議ＩＤに対応付けて保存されている。 FIG. 12 is a diagram illustrating the structure of text data stored in the storage service system 70. As shown in FIG. As shown in FIG. 12, the items ID, time, recognition result character string, voice data, and site identification information E are associated with the text data. Note that the text data is stored in association with the conference ID.

・ＩＤは自拠点音声Ａと他拠点音声Ｂが所定の規則で分割された場合に採番される識別情報である。所定の規則は、ミーティングデバイス６０（及び音声認識サービスシステム８０の少なくとも一方）に設定されており、例えば、一定時間の無音状態が継続すると区切る、無音状態がなくても一定時間で強制的に区切る、形態素解析により検出した一文ずつ区切る、などである。 - ID is identification information numbered when own site audio A and other site audio B are divided according to a predetermined rule. Predetermined rules are set in the meeting device 60 (and at least one of the speech recognition service system 80), and for example, when a silent state continues for a certain period of time, it is forcibly separated at a certain period of time even if there is no silent state. , segment each sentence detected by morphological analysis, and so on.

・timeは、録画開始からの継続時間による発言時刻である。記録開始時にいわゆる時刻も保存されるので、textが発言された時刻（絶対的な時刻）も分かる。 • time is the utterance time based on the duration from the start of recording. Since the so-called time is also saved at the start of recording, the time (absolute time) when the text was said can also be known.

・認識結果文字列は分割された合成音声Ｃが音声認識で変換されたテキストデータの一部である。合成音声Ｃは、認識結果文字列の変換元となった音声データである。・The recognition result character string is a part of the text data obtained by converting the divided synthesized speech C by speech recognition. Synthesized speech C is speech data that is the source of conversion of the recognition result character string.

・音声データは、拠点の判断後に自拠点音声Ａと他拠点音声Ｂが合成された合成音声Ｃ（分割済み）である。・Speech data is the synthesized speech C (divided) obtained by synthesizing the self-site speech A and the other-site speech B after the determination of the site.

・拠点識別情報Ｅは、自拠点音圧情報Ａ'と他拠点音圧情報Ｂ'の音圧に基づいて判断された、音声データが発言された拠点の識別情報である。拠点識別情報Ｅは、例えば１が自拠点、２が他拠点を表す。 The base identification information E is the identification information of the base where the voice data was uttered, determined based on the sound pressure of the own base sound pressure information A' and the other base sound pressure information B'. As for the base identification information E, for example, 1 indicates the own base and 2 indicates the other base.

このように、timeと認識結果文字列が対応付けられているので、テキストデータが検索に適合すると、情報記録アプリ４１はこのtimeに対応する再生時刻から結合画像動画を再生できる。 In this way, since the time and the recognition result character string are associated with each other, if the text data matches the search, the information recording application 41 can reproduce the combined image moving image from the reproduction time corresponding to this time.

＜画面遷移＞
続いて、図１３～図１６を参照して、端末装置１０が遠隔会議中に表示するいくつかの画面について説明する。図１３は、端末装置１０で動作する情報記録アプリ４１が表示するログイン後の初期画面２００である。端末装置１０のユーザーが情報記録アプリ４１を情報処理システム５０に接続させる。ユーザーが認証情報を入力してログインに成功すると、図１３の初期画面２００が表示される。 <Screen transition>
Next, some screens displayed by the terminal device 10 during the teleconference will be described with reference to FIGS. 13 to 16. FIG. FIG. 13 shows an initial screen 200 after login displayed by the information recording application 41 operating on the terminal device 10 . A user of the terminal device 10 connects the information recording application 41 to the information processing system 50 . When the user enters the authentication information and successfully logs in, the initial screen 200 of FIG. 13 is displayed.

初期画面２００は、固定表示ボタン２０１、正面変更ボタン２０２、パノラマ画像２０３、１つ以上の話者画像２０４ａ～２０４ｃ（以下、区別しない場合、話者画像２０４という）、及び、記録開始ボタン２０５を有している。ログイン時にすでにミーティングデバイス６０が起動して、周囲を撮像している場合、初期画面２００にミーティングデバイス６０が作成するパノラマ画像２０３、及び話者画像２０４が表示される。したがって、ユーザーはこれらを見ながら、記録開始するかどうか決めることができる。ミーティングデバイス６０が起動していない（撮像していない）場合、パノラマ画像２０３と話者画像２０４は表示されない。 The initial screen 200 includes a fixed display button 201, a change front button 202, a panorama image 203, one or more speaker images 204a to 204c (hereinafter referred to as speaker images 204 when not distinguished), and a recording start button 205. have. When the meeting device 60 has already started up at the time of login and is capturing an image of the surroundings, a panorama image 203 created by the meeting device 60 and a speaker image 204 are displayed on the initial screen 200 . Therefore, the user can decide whether or not to start recording while viewing these. When the meeting device 60 is not activated (not capturing an image), the panorama image 203 and the speaker image 204 are not displayed.

なお、情報記録アプリ４１は、パノラマ画像２０３から検出された全ての顔に基づく全ての参加者の話者画像２０４を表示してもよいし、直近に発言したＮ人の話者画像２０４のみを表示してもよい。図１３では、最大３人まで話者画像２０４が表示される例を示す。参加者が発言するまでの間、話者画像２０４がなくてもよいし（発言に応じて一人ずつ増える）、所定の方向の参加者の３人の話者画像２０４が表示されてもよい（発言に応じて入れ替わる）。 Note that the information recording application 41 may display the speaker images 204 of all the participants based on all the faces detected from the panoramic image 203, or display only the N speaker images 204 who spoke most recently. may be displayed. FIG. 13 shows an example in which up to three speaker images 204 are displayed. Until the participant speaks, there may be no speaker images 204 (the number increases one by one according to the speech), or the speaker images 204 of three participants in a predetermined direction may be displayed ( (Replaced according to the statement).

なお、ミーティングデバイス６０が起動した直後など、誰も発言していない場合、水平３６０°のうちの予め決められた方向（０°、１２０°、２４０°など）を話者画像２０４として作成する。後述する固定表示が設定されている場合は、固定表示の設定が優先される。 When no one is speaking, such as immediately after the meeting device 60 is activated, the speaker image 204 is created in a predetermined direction (0°, 120°, 240°, etc.) within the horizontal 360°. If fixed display, which will be described later, is set, priority is given to the setting of fixed display.

固定表示ボタン２０１は、パノラマ画像２０３のある領域を話者画像２０４として固定でクローズアップする操作をユーザーが行うためのボタンである。 A fixed display button 201 is a button for the user to perform an operation to fix and close up an area of the panorama image 203 as the speaker image 204 .

正面変更ボタン２０２は、パノラマ画像２０３の正面を変更する操作をユーザーが行うためのボタンである（パノラマ画像は水平方向に３６０°写っているので、右端と左端の方向が一致する）。ユーザーはポインティングデバイスでパノラマ画像２０３を左右にスライドさせて、正面に写る参加者を決定できる。ユーザーの操作はミーティングデバイス６０に送信され、ミーティングデバイス６０は、水平方向３６０°のうち正面にする角度を変更してパノラマ画像を作成し、端末装置１０に送信する。 The change front button 202 is a button for the user to change the front of the panorama image 203 (because the panorama image is horizontally oriented at 360°, the directions of the right end and the left end are the same). The user can slide the panorama image 203 left and right with the pointing device to determine the participants in the front view. The user's operation is transmitted to the meeting device 60 , and the meeting device 60 changes the front angle of 360° in the horizontal direction to create a panorama image and transmits it to the terminal device 10 .

ユーザーが記録開始ボタン２０５を押下すると情報記録アプリ４１が図１４の録画設定画面２１０を表示する。 When the user presses the recording start button 205, the information recording application 41 displays the recording setting screen 210 of FIG.

図１４は、情報記録アプリ４１が表示する録画設定画面２１０の一例である。録画設定画面２１０では、ミーティングデバイス６０が作成したパノラマ画像及び話者画像、並びに、端末装置１０のデスクトップ画面又は動作するアプリの画面、を録画するかをユーザーが（録画に含めるか）設定できる。パノラマ画像及び話者画像、及び、デスクトップ画面又は動作するアプリの画面のどちらも、情報記録アプリ４１が録画しない場合は音声（端末装置１０が出力する音声＋ミーティングデバイス６０が集音した音声）のみ記録される。 FIG. 14 is an example of a recording setting screen 210 displayed by the information recording application 41. As shown in FIG. On the recording setting screen 210, the user can set whether to record the panorama image and the speaker image created by the meeting device 60, and the desktop screen of the terminal device 10 or the screen of the operating application (whether to include them in the recording). If the information recording application 41 does not record the panorama image, the speaker image, the desktop screen, or the screen of the application that is running, only the sound (the sound output by the terminal device 10 + the sound collected by the meeting device 60) is displayed. Recorded.

カメラトグルボタン２１１は、ミーティングデバイス６０が作成したパノラマ画像及び話者画像の録画のオンとオフを切り替えるボタンである。カメラトグルボタン２１１は、パノラマ画像と話者画像を個別に録画する設定が可能でもよい。 The camera toggle button 211 is a button for switching on and off recording of the panorama image and the speaker image created by the meeting device 60 . The camera toggle button 211 may be set to individually record the panorama image and the speaker image.

ＰＣ画面トグルボタン２１２は、端末装置１０のデスクトップ画面、端末装置１０で動作するアプリの画面の、録画のオンとオフを切り替えるボタンである。ＰＣ画面トグルボタン２１２がオンの状態で、デスクトップ画面が録画される。 The PC screen toggle button 212 is a button for switching on and off recording of the desktop screen of the terminal device 10 and the screen of the application running on the terminal device 10 . With the PC screen toggle button 212 on, the desktop screen is recorded.

ユーザーがアプリの画面を録画したい場合、更に、アプリ選択欄２１３で、アプリを選択する。アプリ選択欄２１３には端末装置１０が実行中のアプリ名がプルダウン形式で表示される。ユーザーは録画するアプリを選択できる。このアプリ名は、情報記録アプリ４１がＯＳから取得する。情報記録アプリ４１は実行中のアプリのうち、ＵＩ（画面）を持つアプリのみを表示することができる。選択されるアプリの中に、遠隔会議アプリ４２が含まれてよい。このため、情報記録アプリ４１は、遠隔会議アプリ４２で表示した資料や各拠点の参加者なども動画で記録できる。この他、プルダウンで表示されるアプリは、プレゼンテーション用アプリ、ワープロアプリ、表計算アプリ、Ｗｅｂブラウザアプリ、など様々である。したがって、ユーザーは結合画像動画に含めるアプリの画面を柔軟に選択できる。 When the user wants to record the screen of an application, the application is further selected in the application selection field 213 . The name of the application being executed by the terminal device 10 is displayed in the application selection field 213 in a pull-down format. Users can choose which apps to record. This application name is acquired by the information recording application 41 from the OS. The information recording application 41 can display only applications having a UI (screen) among the applications being executed. Teleconferencing application 42 may be included among the selected applications. Therefore, the information recording application 41 can also record materials displayed by the teleconference application 42, participants at each base, etc., as moving images. In addition, applications displayed in the pull-down menu include presentation applications, word processor applications, spreadsheet applications, web browser applications, and the like. Therefore, the user can flexibly select the screens of the application to be included in the combined image video.

また、アプリ単位で録画する場合、ユーザーは複数のアプリを選択できる。情報記録アプリ４１は、選択された全てのアプリの画面を録画できる。 Also, when recording by app, users can select multiple apps. The information recording application 41 can record screens of all selected applications.

カメラトグルボタン２１１とＰＣ画面トグルボタン２１２が双方ともオフの場合、録画内容確認ウィンドウ２１４に「音声のみ記録されます」と表示される。この音声は、端末装置１０が出力する音声（遠隔会議アプリ４２が他の拠点１０１から受信する音声）と、ミーティングデバイス６０が集音する音声である。つまり、遠隔会議が実施されていれば、遠隔会議アプリ４２の音声とミーティングデバイス６０の音声は、画像の記録に関係なく保存される。ただし、ユーザーは、ユーザーの設定で遠隔会議アプリ４２の音声、ミーティングデバイス６０の音声の保存を選択的に停止できてよい。 When both the camera toggle button 211 and the PC screen toggle button 212 are off, the message "Audio only is recorded" is displayed in the recorded content confirmation window 214. FIG. This sound is the sound output by the terminal device 10 (the sound received by the teleconference application 42 from the other site 101) and the sound collected by the meeting device 60. FIG. In other words, if a remote conference is being held, the audio from the remote conference application 42 and the audio from the meeting device 60 are saved regardless of image recording. However, the user may be able to selectively stop saving the audio of the teleconference application 42 and the audio of the meeting device 60 in user settings.

カメラトグルボタン２１１とＰＣ画面トグルボタン２１２のオンとオフの組み合わせに応じて、以下のように結合画像動画が録画される。また、録画内容確認ウィンドウ２１４にはリアルタイムに結合画像動画が表示される。 A combined image moving image is recorded as follows according to the combination of ON and OFF of the camera toggle button 211 and the PC screen toggle button 212 . Also, the combined image moving image is displayed in real time in the recorded content confirmation window 214 .

・カメラトグルボタン２１１がオン、ＰＣ画面トグルボタン２１２がオフの場合は、録画内容確認ウィンドウ２１４に、ミーティングデバイス６０が撮像したパノラマ画像と話者画像が表示される。 When the camera toggle button 211 is on and the PC screen toggle button 212 is off, the panorama image captured by the meeting device 60 and the speaker's image are displayed in the recorded content confirmation window 214 .

・カメラトグルボタン２１１がオフ、ＰＣ画面トグルボタン２１２がオン（画面も選択済）の場合、録画内容確認ウィンドウ２１４に、デスクトップ画面や選択されたアプリの画面が表示される。 When the camera toggle button 211 is off and the PC screen toggle button 212 is on (the screen is also selected), the desktop screen and the screen of the selected application are displayed in the recorded content confirmation window 214 .

・カメラトグルボタン２１１がオン、ＰＣ画面トグルボタン２１２がオンの場合、録画内容確認ウィンドウ２１４に、ミーティングデバイス６０が撮像したパノラマ画像と話者画像、及び、デスクトップ画面や選択されたアプリの画面が横に並んだ状態で表示される。・When the camera toggle button 211 is on and the PC screen toggle button 212 is on, the panoramic image and speaker image captured by the meeting device 60, the desktop screen, and the screen of the selected application are displayed in the recorded content confirmation window 214. displayed side by side.

したがって、パノラマ画像、話者画像、及びアプリの画面が録画されない場合や、パノラマ画像、話者画像、及びアプリの画面が一切録画されない場合があるが、本実施形態では、便宜上、情報記録アプリ４１が作成する画像を結合画像動画という。 Therefore, the panoramic image, the speaker image, and the application screen may not be recorded, or the panoramic image, the speaker image, and the application screen may not be recorded at all. The image created by is called a combined image movie.

更に、録画設定画面２１０は、「記録をアップロード後に自動で文字おこしする」というメッセージと共にチェックボックス２１５を有する。また、録画設定画面２１０は今すぐ記録開始ボタン２１６を有する。ユーザーがチェックボックス２１５にチェックを入れると、記録動画に、遠隔会議中の発言が変換されたテキストデータが添付される。この場合、録画終了後に情報記録アプリ４１がテキストデータへの変換要求と共に音声を情報処理システム５０にアップロードする。また、ユーザーが今すぐ記録開始ボタン２１６を押下すると、録画中画面が表示される。 Additionally, the recording settings screen 210 has a check box 215 with the message "Automatically transcribe recording after upload." The recording setting screen 210 also has a start recording now button 216 . When the user checks the check box 215, the text data obtained by converting the remarks made during the teleconference is attached to the recorded moving image. In this case, the information recording application 41 uploads the voice to the information processing system 50 together with a request for conversion into text data after the recording is finished. Also, when the user presses the recording start button 216 now, the recording screen is displayed.

図１５は、情報記録アプリ４１が表示する会議一覧画面２３０の一例である。会議一覧画面２３０は、会議の一覧であるが、遠隔会議において録画された記録情報のリストを表示できる。また、遠隔の会議に関わらず、ある会議室内のみで行われた会議も含まれる。会議一覧画面２３０には、会議情報記憶部５００１においてログインユーザーが閲覧権限のある会議情報が表示される。情報記憶部１００１に保存された動画記録情報の情報が更に統合されてもよい。 FIG. 15 is an example of a conference list screen 230 displayed by the information recording application 41. As shown in FIG. The conference list screen 230 is a list of conferences, and can display a list of recorded information recorded in remote conferences. It also includes conferences held only in a certain conference room, regardless of remote conferences. The conference list screen 230 displays conference information for which the log-in user has viewing authority in the conference information storage unit 5001 . Information of the moving image recording information stored in the information storage unit 1001 may be further integrated.

会議一覧画面２３０は、図１３の初期画面２００においてユーザーが会議一覧タブ２３１を選択すると表示される。会議一覧画面２３０は、このユーザーに閲覧権限がある記録情報のリスト２３６を表示する。会議作成者（議事録作成者）は参加者に閲覧権限を設定できる。なお会議一覧は、記憶した記録情報の一覧であっても、会議予定や会議データの一覧であってもよい。 The conference list screen 230 is displayed when the user selects the conference list tab 231 on the initial screen 200 of FIG. The conference list screen 230 displays a list 236 of record information that this user has viewing authority. The conference creator (minutes creator) can set viewing authority for the participants. The conference list may be a list of stored recording information, a list of conference schedules, or a list of conference data.

会議一覧画面２３０はチェックボックス２３２、更新日時２３３、タイトル２３４、及びステータス２３５の各項目を有する。 The conference list screen 230 has check boxes 232 , update dates and times 233 , titles 234 , and statuses 235 .

・チェックボックス２３２は録画ファイルの選択を受け付ける。チェックボックス２３２は、ユーザーがまとめて録画ファイルを削除したい場合に使用される。 • Check box 232 accepts selection of a recording file. Check box 232 is used when the user wants to delete recorded files in bulk.

・更新日時２３３は、結合画像動画の録画の開始時と終了時を示す。編集された場合は編集日時でよい。 The update date/time 233 indicates the start and end times of recording of the combined image moving image. If it has been edited, the date and time of editing may be used.

・タイトルは２３４、会議のタイトル（議題等）である。会議情報から転記されてもよいし、ユーザーが設定してもよい。・The title is 234, the title of the meeting (agenda, etc.). It may be transcribed from the meeting information, or may be set by the user.

・ステータス２３５は、結合画像動画が情報処理システム５０にアップロード済みか否かを示す。アップロード済みでない場合、「ローカルＰＣ」が表示され、アップロード済みの場合「アップロード済み」が表示される。アップロード済みでない場合、アップロードボタンが表示される。未アップロードの結合画像動画がある場合、ユーザーが情報処理システム５０にログイン時に、情報記録アプリ４１が自動アップロードするとよい。 · The status 235 indicates whether or not the combined image moving image has been uploaded to the information processing system 50 . If it has not been uploaded, "Local PC" is displayed, and if it has been uploaded, "Uploaded" is displayed. If not already uploaded, an upload button will be displayed. If there is an unuploaded combined image moving image, the information recording application 41 may automatically upload it when the user logs into the information processing system 50 .

ユーザーが結合画像動画のリスト２３６から任意のタイトル等をポインティングデバイスで選択すると、情報記録アプリ４１が録画再生画面を表示するが本実施形態では省略する。録画再生画面では、結合画像動画の再生などが可能である。 When the user selects an arbitrary title or the like from the combined image video list 236 with a pointing device, the information recording application 41 displays a recording/playback screen, but this is omitted in this embodiment. On the recording/playback screen, it is possible to play back a combined image moving image.

なお、ユーザーは、更新日時や、タイトル、キーワードなどから会議を絞り込むことができることが望ましい。また、表示される会議の数が多く、該当の会議を見つけにくい場合は、検索機能として、ユーザーが語句を入力することで、会議の発言やタイトルなどに含まれる語句から記録情報を絞り込むことができることが望ましい。検索機能により、ユーザーは記録情報が多くなった場合でも短時間で所望の記録情報を見つけることが可能である。また、会議一覧画面では、ユーザーが更新日時やタイトル順で会議をソートできてもよい。 In addition, it is desirable that the user can narrow down the conferences based on update dates, titles, keywords, and the like. In addition, if it is difficult to find the appropriate meeting due to the large number of displayed meetings, the user can enter words and phrases in the search function to narrow down the recorded information based on the words and phrases included in the meeting remarks and titles. It is desirable to be able to The search function allows the user to find desired recorded information in a short time even when there is a large amount of recorded information. Also, on the conference list screen, the user may be able to sort the conferences by update date and time or by title.

＜会議記録確認部が提示する画面例＞
図１６は、会議記録確認部５９が提示し、端末装置１０が表示する会議記録確認画面２４０の一例である。 <Screen example presented by the meeting record confirmation section>
FIG. 16 is an example of a conference record confirmation screen 240 presented by the conference record confirmation unit 59 and displayed by the terminal device 10 .

会議記録確認画面２４０は任意の数のセグメント表示部１３０－１～１３０－ｎを有する。以下では、セグメント表示部１３０－１～１３０－ｎのうち、任意のセグメント表示部を単にセグメント表示部１３０という。 The conference record confirmation screen 240 has an arbitrary number of segment display sections 130-1 to 130-n. An arbitrary segment display portion among the segment display portions 130-1 to 130-n will be simply referred to as a segment display portion 130 hereinafter.

セグメント表示部１３０はそれぞれ、少なくとも音声認識結果表示部１３３と拠点識別情報表示部１３１と合成音声再生部１３２を有する。音声認識結果表示部１３３は、音声認識結果Ｄを表示する。拠点識別情報表示部１３１は音声認識結果Ｄに対応する拠点識別情報Ｅを表示する。合成音声再生部１３２は、少なくとも音声を再生するための再生ボタンを有し、ユーザーは再生ボタンを押下することで、音声認識結果表示部１３３に表示される音声認識結果Ｄに対応する合成音声Ｃを再生可能である。また、合成音声再生部１３２は、一時停止、倍速再生、シークバーによる再生位置の変更機能、及び、任意の時間のスキップ機能などを備えてもよい。 Each segment display section 130 has at least a speech recognition result display section 133 , a site identification information display section 131 and a synthesized speech reproduction section 132 . The speech recognition result display unit 133 displays the speech recognition result D. FIG. The site identification information display section 131 displays the site identification information E corresponding to the speech recognition result D. FIG. The synthesized speech reproducing unit 132 has at least a play button for playing back the speech, and the user presses the play button to reproduce the synthesized speech C corresponding to the speech recognition result D displayed on the speech recognition result display unit 133. can be played. In addition, the synthesized speech reproduction unit 132 may have functions such as pause, double-speed reproduction, reproduction position change function using a seek bar, arbitrary time skip function, and the like.

音声認識結果Ｄと、音声認識結果Ｄに対応する拠点識別情報Ｅを対応付けて表示することで、ユーザーは会議記録の理解が促進され、発言拠点の違いによる誤認識などを防止することができる。合成音声Ｃを音声により再生可能であることで、ユーザーはより簡便に議事録を確認可能である。 By displaying the voice recognition result D and the base identification information E corresponding to the voice recognition result D in association with each other, the user can be promoted to understand the conference record, and misrecognition due to the difference in speaking bases can be prevented. . By being able to reproduce the synthesized speech C as voice, the user can more easily check the minutes.

図１６において、右側にある拠点識別情報表示部１３１と合成音声再生部１３２と、左側にある拠点識別情報表示部１３１と合成音声再生部１３２とがあるのは、拠点の違いを示す。例えば、発言が自拠点の場合、拠点識別情報表示部１３１と合成音声再生部１３２は右側に表示され、発言が他拠点の場合、左側に表示される。こうすることで、ユーザーは一目で拠点を判断できる。 In FIG. 16, the point identification information display section 131 and the synthesized speech reproduction section 132 on the right side and the point identification information display section 131 and the synthesized speech reproduction section 132 on the left side indicate the difference between the points. For example, the site identification information display section 131 and the synthesized speech reproduction section 132 are displayed on the right side when the statement is from the own site, and are displayed on the left side when the statement is from another site. By doing this, the user can determine the base at a glance.

音声認識結果表示部１３３、拠点識別情報表示部１３１に表示されている情報は後からユーザーが任意に編集できることが好ましい。また、セグメント表示部１３０－1～nは任意の個所をユーザーの操作により削除できることが好ましい。これにより、音声認識結果に間違いがあった場合や、議事録に不要なテキストデータがあった場合にユーザーは任意に認識結果文字列の修正、削除が可能である。 It is preferable that the information displayed in the speech recognition result display section 133 and the base identification information display section 131 can be edited arbitrarily by the user later. In addition, it is preferable that an arbitrary portion of the segment display portions 130-1 to 130-n can be deleted by the user's operation. This allows the user to arbitrarily modify or delete the recognition result character string when there is an error in the speech recognition result or when there is unnecessary text data in the minutes.

ユーザーが拠点識別情報表示部１３１を編集すると（変更すると）、編集後の拠点に応じて、表示制御部１３が拠点識別情報表示部１３１と合成音声再生部１３２の配置を左から右、又は右から左に変更して表示する。この画面表示に関する変更処理は、画面生成部５３が行ってもよい。会議記録確認部５９は、ユーザーが拠点識別情報Ｅを編集すると、端末装置１０から編集内容を受信して、図１２に示したテキストデータの拠点識別情報Ｅを変更する。 When the user edits (changes) the site identification information display section 131, the display control section 13 moves the location of the site identification information display section 131 and the synthesized speech reproduction section 132 from left to right or from right to left depending on the edited site. to the left to display. The screen generation unit 53 may perform the change processing regarding this screen display. When the user edits the base identification information E, the conference record confirmation unit 59 receives the edited content from the terminal device 10 and changes the base identification information E of the text data shown in FIG.

＜拠点の判断＞
図１７は、拠点判断条件を説明する図である。図１７（ａ）（ｂ）（ｃ）の上図と下図はそれぞれ自拠点音声Ａと他拠点音声Ｂの音圧情報の例を示す。音圧は、図１７（ａ）（ｂ）（ｃ）の波形の振幅であり、入力電圧V_INから式（１）でデシベルフルスケール(dBFS)に変換することで取得される。すなわち、デシベルフルスケールが音圧である。図１７（ａ）（ｂ）（ｃ）は、このようにして求められている。 <Judgment of base>
FIG. 17 is a diagram for explaining base determination conditions. 17A, 17B, and 17C show examples of sound pressure information of own site audio A and other site audio B, respectively. The sound pressure is the amplitude of the waveforms shown in FIGS. 17A, 17B, and 17C, and is obtained by converting the input voltage V _IN into decibel full scale (dBFS) using Equation (1). That is, the decibel full scale is the sound pressure. 17(a), (b) and (c) are obtained in this way.

デシベルフルスケールを用いることにより、音声データの音圧を信号レベル(入力電圧)から換算ですることができる。自拠点音圧情報Ａ'、他拠点音圧情報Ｂ'の導出方法としては音圧(dBFS)の実効値(RMS値)が好ましいが、ピーク値でもよく、システムごとに最適な方法を選択可能である。

By using the decibel full scale, the sound pressure of audio data can be converted from the signal level (input voltage). As a method of deriving own site sound pressure information A' and other site sound pressure information B', the effective value (RMS value) of the sound pressure (dBFS) is preferable, but the peak value is also acceptable, and the optimum method can be selected for each system. is.

ピーク値は、音声波形のピーク値(最大値)をある音声データの音圧情報の代表値として用いる方法である。また、実効値(RMS値)はある音声の入力電圧V_INの波形の実効値V_IN－rmsを、上記式（１）を用いて音圧(dBFS)に変換した値を音圧情報の代表値として用いる方法である。実効値は、式（２）を用いて計算できる。 The peak value is a method of using the peak value (maximum value) of an audio waveform as a representative value of sound pressure information of certain audio data. The effective value (RMS value) is the value obtained by converting the effective value V _IN -rms of the waveform of the input voltage V _IN of a certain sound into sound pressure (dBFS) using the above equation (1), which is representative of the sound pressure information. This is the method used as a value. The rms value can be calculated using equation (2).

ピーク値を用いた場合は、例えば図１７（ｂ）のように自拠点で話者が発言しており、他拠点からの発言はないが他拠点で突発的で大きな物音がした際には突発的で大きな物音に影響され拠点判断として拠点識別情報Ｅが他拠点と判断されてしまう可能性がある。しかし、実効値(RMS値)を用いることで、音圧が平均化され突発的な音の影響を受けにくくなる。

When the peak value is used, for example, as shown in FIG. There is a possibility that the site identification information E may be determined to be another site as a site determination due to the influence of a loud loud noise. However, by using the effective value (RMS value), the sound pressure is averaged, making it less susceptible to sudden sounds.

次に拠点判断処理部６４が拠点を判断するための条件の例を示す。拠点判断の条件としては、以下で説明する条件１よりも条件２が好ましいがこれらに限定されるものではなく、システムごとに最適な条件を選択できる。下記の拠点判断条件においては、自拠点音圧情報Ａ'を単にＡ'、他拠点音圧情報Ｂ'を単にＢ'と記載するが、これらはそれぞれ、先述したピーク値、実効値(RMS値)などを利用できる。 Next, an example of conditions for the site determination processing unit 64 to determine the site will be shown. As a condition for site determination, condition 2 is preferable to condition 1 described below, but the condition is not limited to these, and an optimum condition can be selected for each system. In the site determination conditions below, the own site sound pressure information A' is simply described as A', and the other site sound pressure information B' is simply described as B'. ) etc. can be used.

＜＜拠点判断条件1－1＞＞
(1) 拠点判断処理部６４は、Ａ'とＢ'の値を比較して、Ａ'がＢ'よりも大きければ自拠点で発言があったと判断する。 <<Base Judgment Conditions 1-1>>
(1) The site determination processing unit 64 compares the values of A' and B', and determines that the statement was made at its own site if A' is greater than B'.

(2) 拠点判断処理部６４は、Ａ'とＢ'の値を比較して、Ｂ'がＡ'よりも大きければ他拠点で発言があったと判断する。 (2) The site determination processing unit 64 compares the values of A' and B', and determines that the speech was made at another site if B' is greater than A'.

(3) 拠点判断処理部６４は、音圧が同一の場合は「ＮＡ（不明）」と判断する。 (3) When the sound pressures are the same, the site determination processing unit 64 determines "NA (unknown)".

以上の判断例をフローチャート図で表すと図１８のようになる。なお、一般に、ＮＡの場合、拠点識別情報Ｅが不明であるので、認識結果文字列も削除される。ただし、拠点判断処理部６４は、認識結果文字列を残してもよいし、拠点判断条件によってＮＡと判断された認識結果文字列を残すかどうかを切り替えてよい。 FIG. 18 is a flow chart showing the above determination example. In the case of NA, generally speaking, the recognition result character string is also deleted because the base identification information E is unknown. However, the site determination processing unit 64 may leave the recognition result character string, or may switch whether to leave the recognition result character string determined as NA according to the site determination condition.

図１８は、拠点判断条件1－1による拠点の判断方法を説明するフローチャート図の一例である。 FIG. 18 is an example of a flow chart for explaining a method of determining a site based on the site determination condition 1-1.

上記のように、拠点判断処理部６４は、所定の規則で音声を自拠点音声Ａと他拠点音声Ｂを分割し、Ａ'とＢ'の値の大小関係を判断する（Ｓ１０１）。 As described above, the site determination processing unit 64 divides the voice into own site voice A and other site voice B according to a predetermined rule, and determines the magnitude relationship between the values of A' and B' (S101).

拠点判断処理部６４は、Ａ'＞Ｂ'の場合、自拠点で発言があったと判断する（Ｓ１０２）。 If A'>B', the base determination processing unit 64 determines that the speech was made at the own base (S102).

拠点判断処理部６４は、Ａ'＜Ｂ'の場合、他拠点で発言があったと判断する（Ｓ１０３）。 If A'<B', the site determination processing unit 64 determines that the speech was made at another site (S103).

拠点判断処理部６４は、Ａ'＝Ｂ'の場合、不明であると判断する（Ｓ１０４）。 The site determination processing unit 64 determines that it is unknown when A'=B' (S104).

＜＜拠点判断条件1－2＞＞
(1) 拠点判断処理部６４は、Ａ'とＢ'の値を比較して、Ａ'がＢ'よりも大きければ自拠点で発言があったと判断する。 <<Base Judgment Conditions 1-2>>
(1) The site determination processing unit 64 compares the values of A' and B', and determines that the statement was made at its own site if A' is greater than B'.

(2) 拠点判断処理部６４は、Ａ'とＢ'の値を比較して、Ｂ'がＡ'以上であれば他拠点で発言があったと判断する（音圧が同一の場合は「他拠点」と判断）。
拠点判断条件１－２によると、Ａ'とＢ'を比較して大きい方の音声を拠点と判断することにより、音声認識結果と拠点識別情報Ｅを高確率で一致させることが可能になる。また、他拠点の音声は遠隔コミュニケーションシステムを介して受信される音声のためわずかにＷｅｂ会議システムのノイズキャンセリング機能により音声が小さくなることがあり、実際は同じ声量で話していても取得できる音圧としては小さくなることもある。そのため、拠点判断条件１－２によりＡ'とＢ'が同一音圧の場合は、他拠点と判断することで拠点の誤判断が少なくなる。 (2) The base determination processing unit 64 compares the values of A' and B', and if B' is equal to or greater than A', it determines that the speech was made at another base (if the sound pressure is base”).
According to the site determination condition 1-2, by comparing A' and B' and determining the louder voice as the site, it is possible to match the voice recognition result and the site identification information E with a high probability. In addition, since the voice of other sites is received via the remote communication system, the voice may be slightly reduced due to the noise canceling function of the web conferencing system. may be smaller. Therefore, if A' and B' have the same sound pressure according to site determination condition 1-2, misjudgment of the site can be reduced by determining that the site is another site.

図１９は、拠点判断条件1－2による拠点の判断方法を説明するフローチャート図の一例である。 FIG. 19 is an example of a flowchart for explaining a method of determining a site based on the site determination condition 1-2.

上記のように、拠点判断処理部６４は、所定の規則で音声を自拠点音声Ａと他拠点音声Ｂを分割し、Ａ'がＢ'より大きいか判断する（Ｓ１１１）。 As described above, the site determination processing unit 64 divides the audio into own site audio A and other site audio B according to a predetermined rule, and determines whether A' is greater than B' (S111).

拠点判断処理部６４は、Ａ'＞Ｂ'の場合、自拠点で発言があったと判断する（Ｓ１１２）。 If A'>B', the base determination processing unit 64 determines that the speech was made at the own base (S112).

拠点判断処理部６４は、Ａ'≦Ｂ'の場合、他拠点で発言があったと判断する（Ｓ１１３）。 If A'≦B', the site determination processing unit 64 determines that the statement was made at another site (S113).

＜＜拠点判断条件２＞＞
(1) 拠点判断処理部６４は、Ｂ'の値がノイズ閾値Ｘ以上で、かつ、Ｂ'がＡ'の値よりも大きい場合は、他拠点で発言があったと判断する。 <<Base Judgment Condition 2>>
(1) If the value of B' is greater than or equal to the noise threshold X and the value of B' is greater than the value of A', the site determination processing unit 64 determines that the speech was made at another site.

(2) 拠点判断処理部６４は、Ａ'の値がノイズ閾値Ｘ以上で、かつ、Ａ'がＢ'の値よりも大きい場合は、自拠点で発言があったと判断する。 (2) If the value of A' is equal to or greater than the noise threshold value X and A' is greater than the value of B', the site determination processing unit 64 determines that the speech was made at its own site.

(3) 拠点判断処理部６４は、Ａ'もＢ'のいずれの値もノイズ閾値Ｘ未満であった場合は、両拠点で発言がなかった（N/A（不明））と判断する。 (3) If both the values of A' and B' are less than the noise threshold value X, the site determination processing unit 64 determines that neither site has spoken (N/A (unknown)).

条件２によると、拠点判断処理部６４は、まず、Ａ'とＢ'の大きさとノイズ閾値Ｘを比較し、更にＡ'とＢ'の大きさを比較して大きい方の拠点を拠点識別情報Ｅとして返す。これにより、環境ノイズがある環境で環境ノイズの影響により拠点が誤識別されてしまうことを防止可能である。 According to Condition 2, the site determination processing unit 64 first compares the magnitudes of A' and B' with the noise threshold value X, and then compares the magnitudes of A' and B' and selects the larger site as the site identification information. Return as E. As a result, it is possible to prevent the site from being erroneously identified due to the influence of environmental noise in an environment with environmental noise.

ノイズ閾値Ｘはシステムにより適切な値を選択可能であるが、音圧情報としてデシベルフルスケールのピーク値を用いる場合は－40dBFS程度、音圧情報としてデシベルフルスケールの実効値(RMS値)を用いる場合は－50dBFS程度が好ましい。 An appropriate value for the noise threshold X can be selected depending on the system, but when using the peak value of the decibel full scale as sound pressure information, use about -40 dBFS, and use the effective value (RMS value) of the decibel full scale as the sound pressure information. -50 dBFS is preferable.

図２０は、拠点判断条件２による拠点の判断方法を説明するフローチャート図の一例である。 FIG. 20 is an example of a flowchart for explaining a method of determining a site based on the site determination condition 2. FIG.

拠点判断処理部６４は、所定の規則で音声を自拠点音声Ａと他拠点音声Ｂを分割する。拠点判断処理部６４は、Ｂ'の値がノイズ閾値Ｘ以上か否かを判断する（Ｓ１２１）。ステップＳ１２１の判断がＹｅｓの場合、拠点判断処理部６４は、Ｂ'の値がＡ'よりも大きいか否かを判断する（Ｓ１２２）。 The site determination processing unit 64 divides the voice into own site voice A and other site voice B according to a predetermined rule. The site determination processing unit 64 determines whether the value of B' is equal to or greater than the noise threshold value X (S121). If the determination in step S121 is Yes, the base determination processing unit 64 determines whether or not the value of B' is greater than A' (S122).

ステップＳ１２２の判断がＹｅｓの場合、拠点判断処理部６４は、他拠点で発言があったと判断する（Ｓ１２３）。 If the determination in step S122 is Yes, the site determination processing unit 64 determines that there is a statement at another site (S123).

ステップＳ１２１、又はＳ１２２の判断がＮｏの場合、拠点判断処理部６４は、Ａ'の値がノイズ閾値Ｘ以上か否かを判断する（Ｓ１２４）。 If the determination in step S121 or S122 is No, the base determination processing unit 64 determines whether or not the value of A' is equal to or greater than the noise threshold value X (S124).

ステップＳ１２４の判断がＹｅｓの場合、拠点判断処理部６４は、Ａ'の値がＢ'よりも大きいか否かを判断する（Ｓ１２５）。 If the determination in step S124 is Yes, the base determination processing unit 64 determines whether or not the value of A' is greater than B' (S125).

ステップＳ１２５の判断がＹｅｓの場合、拠点判断処理部６４は、自拠点で発言があったと判断する（Ｓ１２６）。 If the determination in step S125 is YES, the site determination processing unit 64 determines that the statement was made at its own site (S126).

ステップＳ１２４、又はＳ１２５の判断がＮｏの場合、拠点判断処理部６４は、拠点が不明であると判断する（Ｓ１２７）。 If the determination in step S124 or S125 is No, the base determination processing unit 64 determines that the base is unknown (S127).

なお、他拠点と自拠点の判断の順番は逆でもよい。 It should be noted that the order of determination between the other sites and the own site may be reversed.

＜＜判断例＞＞
拠点判断条件１，２を図１６の音圧に適用した場合の判断例について説明する。 <<Judgment example>>
An example of determination when base determination conditions 1 and 2 are applied to the sound pressure in FIG. 16 will be described.

図１７（ａ）において、拠点判断条件１－１を利用し、音圧情報としてピーク値を用いた場合は、拠点識別情報Ｅとして自拠点と判断される。 In FIG. 17(a), when the site determination condition 1-1 is used and the peak value is used as the sound pressure information, the site identification information E is determined as the own site.

図１７（ａ）において、拠点判断条件1－２を利用し、音圧情報として実効値(RSM値)を用いた場合は、拠点識別情報Ｅとして自拠点と判断される。 In FIG. 17(a), when site determination condition 1-2 is used and effective value (RSM value) is used as sound pressure information, site identification information E is determined to be the site itself.

図１７（ａ）において、ノイズ閾値－40dBFSとし、拠点判断条件２を利用し、音圧情報としてピーク値を用いた場合は、拠点識別情報Ｅとしては自拠点と判断される。 In FIG. 17A, when the noise threshold is -40 dBFS, site determination condition 2 is used, and the peak value is used as the sound pressure information, the site identification information E is determined to be the site itself.

図１７（ａ）において、ノイズ閾値－50dBFSとし、拠点判断条件２を利用し、音圧情報として実効値(RSM値)を用いた場合は、拠点識別情報Ｅとして自拠点と判断される。 In FIG. 17A, when the noise threshold is -50 dBFS, site determination condition 2 is used, and effective value (RSM value) is used as sound pressure information, site identification information E is determined to be the site itself.

拠点判断条件１及び２を利用することにより、拠点判断において発言があったと推測される自拠点を拠点識別情報Ｅとして判断できるようになった。 By using the site determination conditions 1 and 2, it is possible to determine, as the site identification information E, the own site that is presumed to have made a statement in the site determination.

図１７（ｂ）において、拠点判断条件１－１を利用し、音圧情報としてピーク値を用いた場合は、拠点識別情報Ｅとして他拠点と判断される。 In FIG. 17(b), when the site determination condition 1-1 is used and the peak value is used as the sound pressure information, the site identification information E is determined to be another site.

図１７（ｂ）において、拠点判断条件１－２を利用し、音圧情報として実効値(RSM値)を用いた場合は、拠点識別情報としては自拠点と判断される。 In FIG. 17(b), when site determination condition 1-2 is used and an effective value (RSM value) is used as sound pressure information, the site is determined to be the site itself as site identification information.

図１７（ｂ）において、ノイズ閾値－40dBFSとし、拠点判断条件２を利用し、音圧情報としてピーク値を用いた場合は、拠点識別情報Ｅとしては他拠点と判断される。 In FIG. 17(b), when the noise threshold is -40 dBFS, site determination condition 2 is used, and the peak value is used as the sound pressure information, the site identification information E is determined to be another site.

図１７（ｂ）において、ノイズ閾値－50dBFSとし、拠点判断条件２を利用し、音圧情報として実効値(RSM値)を用いた場合は、拠点識別情報Ｅとしては自拠点と判断される。 In FIG. 17(b), when the noise threshold is -50 dBFS, site determination condition 2 is used, and the effective value (RSM value) is used as the sound pressure information, the site identification information E is determined to be the site itself.

拠点判断条件２を利用することにより、拠点判断結果がノイズの影響を受けにくくなった。 By using location determination condition 2, location determination results are less susceptible to noise.

図１７（ｃ）において、拠点判断条件１－１を利用し、音圧情報としてピーク値を用いた場合は、拠点識別情報Ｅとしては自拠点と判断される。 In FIG. 17(c), when the site determination condition 1-1 is used and the peak value is used as the sound pressure information, the site identification information E is determined as the own site.

図１７（ｃ）において、拠点判断条件１－２を利用し、音圧情報として実効値(RSM値)を用いた場合は、拠点識別情報Ｅとしては自拠点と判断される。 In FIG. 17(c), when site determination condition 1-2 is used and effective value (RSM value) is used as sound pressure information, site identification information E is determined to be the site itself.

図１７（ｃ）において、ノイズ閾値－40dBFSとし、拠点判断条件２を利用し、音圧情報としてピーク値を用いた場合は、拠点識別情報Ｅとしては自拠点と判断される。 In FIG. 17(c), when the noise threshold is -40 dBFS, site determination condition 2 is used, and the peak value is used as the sound pressure information, the site identification information E is determined to be the site itself.

図１７（ｃ）において、ノイズ閾値－50dBFSとし、拠点判断条件2を利用し、音圧情報として実効値(RSM値)を用いた場合は、拠点識別情報ＥとしてはN/A (拠点識別情報無し)と判断される。 In FIG. 17(c), when the noise threshold is -50 dBFS, site determination condition 2 is used, and the effective value (RSM value) is used as sound pressure information, the site identification information E is N/A (site identification information None).

音圧情報として実効値(RMS値)を利用することにより、拠点判断結果が突発的な音の影響を受けにくくなった。 By using the effective value (RMS value) as the sound pressure information, the site judgment result is less likely to be affected by sudden sounds.

＜拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃの対応付け＞
図１２では、拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃの対応付けにＩＤが用いられているが、拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃに同じＩＤを対応付ける方法として、キュー方式とタイムスタンプ方式などの方式が考えられる。 <Association of Site Identification Information E, Speech Recognition Result D, and Synthesized Speech C>
In FIG. 12, IDs are used to associate the base identification information E, the speech recognition result D, and the synthesized speech C, but as a method of associating the same ID with the base identification information E, the speech recognition result D, and the synthesized speech C, Methods such as a queue method and a time stamp method are conceivable.

＜＜キュー方式＞＞
図２１は、キュー方式で、拠点識別情報Ｅ、音声認識結果Ｄ、及び、合成音声Ｃを情報処理システム５０が対応付ける方法を説明する図である。キューとは、要素を入ってきた順に一列に並べ、先に入れた要素から順に取り出すデータ構造をいう。キュー方式では、拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃがそれぞれ、拠点判断結果記録部５８と音声認識結果記録部５７と音声記録部５６に入力された順番に応じて、情報処理システム５０が拠点識別情報Ｅ、音声認識結果Ｄ、及び、合成音声Ｃに同じＩＤを設定する。 << queue method >>
FIG. 21 is a diagram for explaining how the information processing system 50 associates the site identification information E, the speech recognition result D, and the synthesized speech C by the cue method. A queue is a data structure in which elements are arranged in the order in which they were received, and the elements that were put in first are taken out in order. In the queue method, the site identification information E, the voice recognition result D, and the synthesized voice C are input to the site determination result recording unit 58, the voice recognition result recording unit 57, and the voice recording unit 56, respectively. 50 sets the same ID to the site identification information E, the speech recognition result D, and the synthesized speech C. FIG.

会議記録確認画面２４０は、１番目のセグメント表示部１３０であるセグメント表示部１３０－1に、拠点識別情報Ｅ－1と音声認識結果Ｄ－1と合成音声Ｃ－1を対応付けて表示させる。会議記録確認画面２４０は、n番目のセグメント表示部１３０であるセグメント表示部１３０－nには、拠点識別情報Ｅ－nと音声認識結果Ｄ－nと合成音声Ｃ－nを対応付けて表示させる。 The conference record confirmation screen 240 causes the segment display section 130-1, which is the first segment display section 130, to display the base identification information E-1, the speech recognition result D-1, and the synthesized speech C-1 in association with each other. The conference record confirmation screen 240 causes the segment display section 130-n, which is the n-th segment display section 130, to display the site identification information En, the speech recognition result Dn, and the synthesized speech Cn in association with each other. .

キュー方式は、簡易なアルゴリズムでユーザーに拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃを対応付けて表示可能である。 The cue method can display the base identification information E, the speech recognition result D, and the synthesized speech C in correspondence with each other to the user by a simple algorithm.

＜＜タイムスタンプ方式＞＞
図２２は、タイムスタンプ方式で、拠点識別情報Ｅ、音声認識結果Ｄ、及び、合成音声Ｃを情報処理システム５０が対応付ける方法を説明する図である。タイムスタンプ方式は、拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃにメタデータとして同一のタイムスタンプを付加する方法である。タイムスタンプとは、時刻やハッシュ値などのタイムスタンプに記載されたタイムスタンプ情報Ｆを元にデータの同一性を確実かつ簡単に確認できる方法である。拠点識別情報Ｅと合成音声Ｃに対するタイムスタンプは、ミーティングデバイス６０が付加する。音声認識結果Ｄに対するタイムスタンプは情報処理システム５０が付加する。 << Timestamp method >>
FIG. 22 is a diagram for explaining how the information processing system 50 associates the site identification information E, the speech recognition result D, and the synthesized speech C by the time stamp method. The time stamp method is a method of adding the same time stamp to the base identification information E, the speech recognition result D, and the synthesized speech C as metadata. A time stamp is a method for reliably and easily confirming the identity of data based on the time stamp information F described in the time stamp such as time and hash value. The location identification information E and the time stamp for the synthesized speech C are added by the meeting device 60 . The information processing system 50 adds a time stamp to the speech recognition result D. FIG.

情報処理システム５０は、同一のタイムスタンプ情報Ｆを持つ、拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃに同じＩＤを設定する。会議記録確認画面２４０は、同一のタイムスタンプ情報Ｆを持つ、拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃを対応付けて表示する。 The information processing system 50 sets the same ID to the site identification information E, the speech recognition result D, and the synthesized speech C having the same time stamp information F. FIG. The conference record confirmation screen 240 displays the site identification information E, the speech recognition result D, and the synthesized speech C having the same time stamp information F in association with each other.

会議記録確認画面２４０は、１番目のセグメント表示部１３０であるセグメント表示部１３０－1に、タイムスタンプ情報F－1を持つデータである、拠点識別情報Ｅ－1と音声認識結果Ｄ－1と合成音声Ｃ－1を対応付けて表示させる。会議記録確認画面２４０は、n番目のセグメント表示部１３０であるセグメント表示部１３０－nに、タイムスタンプ情報F－nを持つデータである、拠点識別情報Ｅ－nと音声認識結果Ｄ－nと合成音声Ｃ－nを対応付けて表示させる。 The conference record confirmation screen 240 displays base identification information E-1 and voice recognition result D-1, which are data having time stamp information F-1, in the segment display section 130-1, which is the first segment display section 130. Synthesized speech C-1 is associated and displayed. The meeting record confirmation screen 240 displays base identification information En, which is data having time stamp information Fn, and speech recognition result Dn in the segment display section 130-n, which is the n-th segment display section 130. Synthesized speech Cn is associated and displayed.

タイムスタンプ方式は、簡易なアルゴリズムでユーザーに拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃを対応付けて開示可能である。更に、データの遅延などがあり拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃが拠点判断結果記録部５８と音声認識結果記録部５７と音声記録部５６に入る順番が入れ替わったとしても確実に拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃを同期できる。 The time stamp method can disclose to the user the base identification information E, the speech recognition result D, and the synthesized speech C in association with each other using a simple algorithm. Furthermore, even if the order in which the site identification information E, the voice recognition result D, and the synthesized voice C enter the site determination result recording unit 58, the voice recognition result recording unit 57, and the voice recording unit 56 are switched due to data delay, etc., the The base identification information E, the speech recognition result D, and the synthesized speech C can be synchronized.

タイムスタンプの生成に際しては、自拠点音声Ａ又は他拠点音声Ｂ取得のタイミングで生成することが好ましい。しかし、ミーティングデバイス６０は、合成音声Ｃの生成のタイミングや自拠点音圧情報Ａ'又は他拠点音圧情報Ｂ'取得のタイミングや拠点識別情報Ｅの生成のタイミングで生成してもよい。拠点識別情報Ｅと音声認識結果Ｄと合成音声Ｃとを対応付けられるタイミングであれば、タイムスタンプの生成タイミングは、これらに限定されない。 When generating the time stamp, it is preferable to generate it at the timing of acquiring own-site audio A or other-site audio B. FIG. However, the meeting device 60 may generate the synthesized speech C at the timing of generating the synthesized voice C, the timing of acquiring the own site sound pressure information A′ or the other site sound pressure information B′, or the timing of generating the site identification information E. The generation timing of the time stamp is not limited to these, as long as it is the timing at which the site identification information E, the speech recognition result D, and the synthesized speech C can be associated with each other.

＜動作手順＞
続いて、以上の構成に基づいて、記録情報作成システム１００が行う動作及び処理について説明する。 <Operation procedure>
Next, operations and processes performed by the recording information creation system 100 based on the above configuration will be described.

＜＜結合画像動画の保存＞＞
図２３を参照し、結合画像動画の保存処理について説明する。図２３は、情報記録アプリ４１がパノラマ画像、話者画像及びアプリの画面を録画する手順を示すシーケンス図の一例である。図２３では、リアルタイム音声認識の処理を説明する。 <<Save Combined Image Movie>>
With reference to FIG. 23, processing for saving a combined image moving image will be described. FIG. 23 is an example of a sequence diagram showing a procedure for the information recording application 41 to record the panorama image, the speaker image, and the screen of the application. FIG. 23 illustrates processing of real-time speech recognition.

S1：ユーザーは端末装置１０で情報記録アプリ４１を起動し、端末装置１０（情報記録アプリ４１）を情報処理システム５０に接続させる。アクセストークンの有効期限が切れている場合、表示制御部１３がログイン画面を表示する。ユーザーは情報記録アプリ４１に、テナントにログインするための認証情報（例えばユーザーＩＤ、パスワード）を入力する。情報記録アプリ４１の操作受付部１２が入力を受け付ける。 S1: The user activates the information recording application 41 on the terminal device 10 and connects the terminal device 10 (the information recording application 41) to the information processing system 50. FIG. If the access token has expired, the display control unit 13 displays the login screen. The user inputs authentication information (for example, user ID and password) for logging into the tenant into the information recording application 41 . The operation reception unit 12 of the information recording application 41 receives the input.

S2：情報記録アプリ４１の通信部１１が、認証情報を指定してログイン要求を情報処理システム５０に送信する。 S2: The communication unit 11 of the information recording application 41 specifies authentication information and transmits a login request to the information processing system 50 .

S3：情報処理システム５０の通信部５１はログイン要求を受信し、認証部５２が認証情報に基づいてユーザーを認証する。ここでは認証が成功したものとする。情報処理システム５０の通信部５１はアクセストークン１を情報記録アプリ４１に送信する。図では明記しないが、以降、通信部５１は、アクセストークン１を情報処理システム５０との通信に添付する。アクセストークン１にはログインしたユーザーの権限が対応付けられている。 S3: The communication unit 51 of the information processing system 50 receives the login request, and the authentication unit 52 authenticates the user based on the authentication information. Assume here that the authentication has succeeded. The communication unit 51 of the information processing system 50 transmits the access token 1 to the information recording application 41 . Although not shown in the drawing, the communication unit 51 attaches the access token 1 to communications with the information processing system 50 thereafter. The access token 1 is associated with the authority of the logged-in user.

S4：同様に、ユーザーはストレージサービスシステム７０にもログインする。記録情報は、ストレージサービスシステム７０に保存されるためである。ユーザーはストレージサービスシステム７０にログインするための認証情報（例えばユーザーＩＤ、パスワード）を入力する。情報記録アプリ４１の操作受付部１２が入力を受け付ける。 S4: Similarly, the user logs into the storage service system 70 as well. This is because the recorded information is stored in the storage service system 70 . The user enters authentication information (eg, user ID, password) for logging into the storage service system 70 . The operation reception unit 12 of the information recording application 41 receives the input.

S5：情報記録アプリ４１の通信部１１が、認証情報を指定してログイン要求を情報処理システム５０に送信する。 S5: The communication unit 11 of the information recording application 41 specifies authentication information and transmits a login request to the information processing system 50 .

S6：情報処理システム５０の通信部５１はログイン要求を受信し、ストレージサービスシステム７０へのログイン要求なので、ログイン要求をストレージサービスシステム７０に転送する。 S6: The communication unit 51 of the information processing system 50 receives the login request, and transfers the login request to the storage service system 70 since it is a login request to the storage service system 70 .

S7：ストレージサービスシステム７０は認証情報に基づいてユーザーを認証する。ここでは認証が成功したものとする。ストレージサービスシステム７０はアクセストークン２を情報処理システム５０に送信する。 S7: The storage service system 70 authenticates the user based on the authentication information. Assume here that the authentication has succeeded. The storage service system 70 sends the access token 2 to the information processing system 50 .

S8：情報処理システム５０の通信部５１はアクセストークン２を受信し、情報記録アプリ４１に送信する。図では明記しないが、以降、通信部５１は、アクセストークン２をストレージサービスシステム７０との通信に添付する。アクセストークン２にはログインしたユーザーの権限が対応付けられている。 S8: The communication unit 51 of the information processing system 50 receives the access token 2 and transmits it to the information recording application 41 . Although not shown in the figure, the communication unit 51 attaches the access token 2 to communications with the storage service system 70 thereafter. The access token 2 is associated with the authority of the logged-in user.

S21：続いて、ユーザーは遠隔会議アプリ４２を操作して遠隔会議を開始する。ここでは、自拠点１０２と他の拠点１０１の遠隔会議アプリ４２が遠隔会議を開始したものとする。自拠点１０２の遠隔会議アプリ４２は、端末装置１０が有するカメラが撮像する画像、マイクが集音する音声を他の拠点１０１の遠隔会議アプリ４２に送信する。他の拠点１０１の遠隔会議アプリ４２は、受信した画像をディスプレイに表示し、受信した音声をスピーカーから出力する。同様に、他の拠点１０１の遠隔会議アプリ４２は、端末装置１０が有するカメラが撮像する画像、マイクが集音する音声を自拠点１０２の遠隔会議アプリ４２に送信する。自拠点１０２の遠隔会議アプリ４２は、受信した画像をディスプレイに表示し、受信した音声をスピーカーから出力する。各遠隔会議アプリ４２はこれを繰り返して、遠隔会議を実現する。 S21: Subsequently, the user operates the remote conference application 42 to start a remote conference. Here, it is assumed that the remote conference applications 42 of the local site 102 and the other site 101 have started the remote conference. The teleconference application 42 of the local site 102 transmits the image captured by the camera of the terminal device 10 and the sound collected by the microphone to the teleconference application 42 of the other site 101 . The teleconference application 42 of the other site 101 displays the received image on the display and outputs the received voice from the speaker. Similarly, the teleconference application 42 of the other site 101 transmits the image captured by the camera of the terminal device 10 and the sound collected by the microphone to the teleconference application 42 of the own site 102 . The teleconference application 42 of the own base 102 displays the received image on the display and outputs the received voice from the speaker. Each remote conference application 42 repeats this to realize a remote conference.

S22：ユーザーは図１６に示した情報記録アプリ４１の録画設定画面２１０に対し、録画に関する設定を行う。情報記録アプリ４１の操作受付部１２が設定を受け付ける。ここでは、カメラトグルボタン２１１、及び、ＰＣ画面トグルボタン２１２が共にオンであるとする。 S22: The user makes settings related to recording on the recording setting screen 210 of the information recording application 41 shown in FIG. The operation reception unit 12 of the information recording application 41 receives the setting. Here, it is assumed that both the camera toggle button 211 and the PC screen toggle button 212 are on.

ユーザーは遠隔会議を事前に予約済みの場合、図１５のような遠隔会議のリストを表示し、結合画像動画を対応付ける遠隔会議を選択できる。ユーザーは情報処理システム５０にログイン済みなので、情報処理システム５０はログインしたユーザーが閲覧権限のある遠隔会議を特定する。情報処理システム５０は特定した遠隔会議のリストを端末装置１０に送信するので、ユーザーは開催中又はこれから開催される遠隔会議を選択する。これにより、会議ＩＤ等、遠隔会議に関する情報が決定される。 When a remote conference is reserved in advance, the user can display a list of remote conferences as shown in FIG. 15 and select the remote conference with which the combined image video is associated. Since the user has already logged into the information processing system 50, the information processing system 50 identifies remote conferences to which the logged-in user has viewing authority. Since the information processing system 50 transmits a list of identified remote conferences to the terminal device 10, the user selects a remote conference that is being held or will be held in the future. As a result, information related to the teleconference, such as the conference ID, is determined.

また、ユーザーは遠隔会議を事前に予約していなくても、結合画像動画を作成する際に会議を作成できる。以下では、情報記録アプリ４１が、結合画像動画を作成する際に会議を作成し、会議ＩＤを情報処理システム５０から取得する場合を説明する。 In addition, even if the user does not reserve a teleconference in advance, the user can create the conference when creating the combined image video. A case will be described below in which the information recording application 41 creates a meeting when creating a combined image moving image and acquires the meeting ID from the information processing system 50 .

S23：ユーザーは録画開始（今すぐ記録開始ボタン２１６）を情報記録アプリ４１に指示する。情報記録アプリ４１の操作受付部１２が指示を受け付ける。表示制御部１３は録画中画面を表示する。 S23: The user instructs the information recording application 41 to start recording (start recording now button 216). The operation reception unit 12 of the information recording application 41 receives the instruction. The display control unit 13 displays a screen during recording.

S24：遠隔会議が選択されていないので（会議ＩＤが決まってないため）、情報記録アプリ４１の通信部１１が、遠隔会議作成要求を情報処理システム５０に送信する。 S24: Since the remote conference has not been selected (because the conference ID has not been determined), the communication section 11 of the information recording application 41 transmits a remote conference creation request to the information processing system 50 .

S25：情報処理システム５０の通信部５１は遠隔会議作成要求を受信し、会議情報取得部５４が、会議管理システム９が採番した重複しない会議ＩＤを取得し、通信部５１が会議ＩＤを情報記録アプリ４１に送信する。 S25: The communication unit 51 of the information processing system 50 receives the remote conference creation request, the conference information acquisition unit 54 acquires a unique conference ID numbered by the conference management system 9, and the communication unit 51 receives the conference ID as information. Send to the recording application 41 .

S26：また、会議情報取得部５４は、通信部５１を介して、結合画像動画の保存先（ストレージサービスシステム７０のＵＲＬ）を情報記録アプリ４１に送信する。 S<b>26 : The meeting information acquisition unit 54 also transmits the storage destination of the combined image moving image (the URL of the storage service system 70 ) to the information recording application 41 via the communication unit 51 .

S27：情報記録アプリ４１の通信部１１が会議ＩＤと録画ファイルの保存先を受信することで、動画保存部１７が録画の準備が整ったと判断し、録画を開始する。 S27: When the communication unit 11 of the information recording application 41 receives the meeting ID and the storage destination of the recorded file, the video storage unit 17 determines that preparation for recording has been completed, and starts recording.

S28：情報記録アプリ４１のアプリ画面取得部１４は、ユーザーが選択したアプリの画面をアプリに対し要求する（アプリ画面取得部１４は、より詳細にはＯＳを介して、アプリの画面を取得する）。図２３では、ユーザーが選択したアプリを遠隔会議アプリ４２とする。 S28: The application screen acquisition unit 14 of the information recording application 41 requests the screen of the application selected by the user from the application (more specifically, the application screen acquisition unit 14 acquires the screen of the application via the OS). ). In FIG. 23, it is assumed that the application selected by the user is the teleconference application 42 .

S29：情報記録アプリ４１の動画保存部１７は、デバイス通信部１６を介して、ミーティングデバイス６０に録画開始を通知する。通知の際、動画保存部１７は、カメラトグルボタン２１１がオンである旨（パノラマ画像と話者画像の要求）も通知するとよい。要求の有無に関係なくミーティングデバイス６０はパノラマ画像と話者画像を情報記録アプリ４１に送信している。 S29: The video storage unit 17 of the information recording application 41 notifies the meeting device 60 to start recording via the device communication unit 16. At the time of notification, the moving image storage unit 17 may also notify that the camera toggle button 211 is on (request for panorama image and speaker image). The meeting device 60 transmits the panorama image and the speaker image to the information recording application 41 regardless of whether or not there is a request.

S30：ミーティングデバイス６０のデバイス接続部６７が録画開始を受信すると、重複しない録画ＩＤを採番し、録画ＩＤを情報記録アプリ４１に返す。なお、録画ＩＤは情報記録アプリ４１が採番してもよいし、情報処理システム５０から取得してもよい。 S30: When the device connection unit 67 of the meeting device 60 receives the start of recording, it assigns a unique recording ID and returns the recording ID to the information recording application 41 . Note that the recording ID may be numbered by the information recording application 41 or obtained from the information processing system 50 .

S31：遠隔会議サービスシステム９０は繰り返し、他拠点から送信された音声データ及び画像データを遠隔会議アプリに送信する。 S31: The teleconference service system 90 repeatedly transmits the voice data and image data transmitted from other bases to the teleconference application.

S32：情報記録アプリ４１の音声取得部１５は端末装置１０が出力する音声データ（遠隔会議アプリ４２が受信した音声データ）を取得する。 S32: The audio acquisition unit 15 of the information recording application 41 acquires audio data output by the terminal device 10 (audio data received by the teleconference application 42).

S33：デバイス通信部１６が、音声取得部１５が取得した音声データと合成要求をミーティングデバイス６０に送信する。 S<b>33 : The device communication section 16 transmits the voice data acquired by the voice acquisition section 15 and the synthesis request to the meeting device 60 .

S34：ミーティングデバイス６０のデバイス接続部６７は音声データと合成要求を受信する。集音部６８は常に周囲の音声を集音している。拠点判断処理部６４は、デバイス接続部６７が受信した他拠点音声Ｂと、集音部６８が集音した自拠点音声Ａを所定の規則で分割し、それぞれの音圧に基づいて拠点を判断する。 S34: The device connection section 67 of the meeting device 60 receives the audio data and the synthesis request. The sound collector 68 always collects surrounding sounds. The site determination processing unit 64 divides the other site audio B received by the device connection unit 67 and the own site audio A collected by the sound collection unit 68 according to a predetermined rule, and determines the site based on the sound pressure of each. do.

S35：次に、音声合成部６５が、集音部６８が集音した自拠点音声Ａ（周囲の音声データ）と、デバイス接続部６７が受信した他拠点音声Ｂを合成する。したがって、合成音声Ｃは分割された状態で生成される。例えば、音声合成部６５は、自拠点音声Ａと他拠点音声Ｂを足し合わせる。ミーティングデバイス６０の周辺の鮮明な音声が記録されるので、特にミーティングデバイス６０周辺（会議室側）の音声のテキスト化精度が向上する。 S35: Next, the voice synthesizing unit 65 synthesizes the local site voice A (surrounding voice data) collected by the sound collecting unit 68 and the other site voice B received by the device connecting unit 67 . Therefore, synthesized speech C is generated in a divided state. For example, the speech synthesizing unit 65 adds the local site's voice A and the other site's voice B together. Since the clear voice around the meeting device 60 is recorded, the text conversion accuracy of the voice especially around the meeting device 60 (conference room side) is improved.

この音声の合成は、端末装置１０でも可能である。しかし、録画機能が端末装置１０に、音声処理がミーティングデバイス６０に分散して配置されることで、端末装置１０とミーティングデバイス６０の負荷を低減できる。録画機能がミーティングデバイス６０に、音声処理が端末装置１０に分散して配置されてもよい。 This speech synthesis can also be performed by the terminal device 10 . However, by distributing the recording function to the terminal device 10 and the audio processing to the meeting device 60, the load on the terminal device 10 and the meeting device 60 can be reduced. The recording function may be distributed to the meeting device 60 and the audio processing may be distributed to the terminal device 10 .

S36：ミーティングデバイス６０のテキスト変換要求部６６は、通信部６１を介して、音声認識要求（分割済みの合成音声Ｃ）と拠点識別情報Ｅを情報処理システム５０に送信する。 S36 : The text conversion request unit 66 of the meeting device 60 transmits the voice recognition request (divided synthesized voice C) and the site identification information E to the information processing system 50 via the communication unit 61 .

S37：情報処理システム５０の通信部５１は、音声認識要求（分割済みの合成音声Ｃ）と拠点識別情報Ｅを受信し、音声認識部５５が音声認識要求（分割済みの合成音声Ｃ）を音声認識サービスシステムに送信し、認識結果文字列を取得する。 S37: The communication unit 51 of the information processing system 50 receives the voice recognition request (divided synthetic voice C) and the location identification information E, and the voice recognition unit 55 transmits the voice recognition request (divided synthetic voice C). Send to the recognition service system and get the recognition result string.

S38：情報処理システム５０は、認識結果文字列、音声データ、及び、拠点識別情報Ｅを情報記録アプリ４１に返す。情報処理システム５０が情報記録アプリ４１にこれらを返すため、ミーティングデバイス６０はステップS36で自機の識別情報を添付しておく。また、情報記録アプリ４１は予め、端末装置１０のＩＰアドレスと、ミーティングデバイス６０から取得した識別情報を、情報処理システム５０に設定しておく。こうすることで、情報処理システム５０がミーティングデバイス６０の識別情報に基づいて端末装置１０を特定できる。 S38: The information processing system 50 returns the recognition result character string, voice data, and base identification information E to the information recording application 41. FIG. Since the information processing system 50 returns these to the information recording application 41, the meeting device 60 attaches its own identification information in step S36. The information recording application 41 also sets the IP address of the terminal device 10 and the identification information obtained from the meeting device 60 in the information processing system 50 in advance. By doing so, the information processing system 50 can identify the terminal device 10 based on the identification information of the meeting device 60 .

S39：情報処理システム５０の拠点判断結果記録部５８と音声認識結果記録部５７と音声記録部５６は、通信部５１を介して、認識結果文字列、音声データ、拠点識別情報Ｅを結合画像動画と同じ保存先に保存する。なお、これらには会議ＩＤが添付される。 S39: The site determination result recording unit 58, the voice recognition result recording unit 57, and the voice recording unit 56 of the information processing system 50, via the communication unit 51, combine the recognition result character string, the voice data, and the site identification information E into a combined image video. Save to the same destination as A conference ID is attached to these.

S40：また、ミーティングデバイス６０のパノラマ画像作成部６２はパノラマ画像を作成し、話者画像作成部６３は話者画像を作成する。 S40: Further, the panorama image creation unit 62 of the meeting device 60 creates a panorama image, and the speaker image creation unit 63 creates a speaker image.

S41：情報記録アプリ４１のデバイス通信部１６は、パノラマ画像と話者画像を繰り返しミーティングデバイス６０から取得する。また、デバイス通信部１６は、合成後の音声データを繰り返しミーティングデバイス６０に要求して取得する。これらの取得は、デバイス通信部１６がミーティングデバイス６０に要求することで行われてもよい。あるいは、カメラトグルボタン２１１がオンである旨を受け取ったミーティングデバイス６０が自動的にパノラマ画像と話者画像を送信してもよい。音声データの合成要求を受け取ったミーティングデバイス６０が自動的に合成後の音声データを情報記録アプリ４１に送信してもよい。 S<b>41 : The device communication unit 16 of the information recording application 41 repeatedly acquires the panoramic image and the speaker image from the meeting device 60 . In addition, the device communication unit 16 repeatedly requests the meeting device 60 to acquire synthesized voice data. These acquisitions may be performed by the device communication section 16 requesting the meeting device 60 . Alternatively, the meeting device 60 may automatically transmit the panorama image and the speaker image upon receiving the fact that the camera toggle button 211 is on. The meeting device 60 that receives the voice data synthesis request may automatically transmit the synthesized voice data to the information recording application 41 .

S42：情報記録アプリ４１の表示制御部１３はアプリの画面、パノラマ画像、話者画像を並べて、録画中画面２２０に表示する。また、情報記録アプリ４１の動画保存部１７は、遠隔会議アプリ４２から取得したアプリの画面と、パノラマ画像と、話者画像を結合して結合画像動画として保存する。すなわち、動画保存部１７は、繰り返し受信されるアプリの画面と、パノラマ画像と、話者画像を結合して結合画像を作成し、結合画像動画を構成するフレームに結合画像を指定することで結合画像動画を作成する。また、動画保存部１７はミーティングデバイス６０から受信した音声データを保存しておく。 S42: The display control unit 13 of the information recording application 41 displays the application screen, the panorama image, and the speaker image side by side on the recording screen 220 . In addition, the video storage unit 17 of the information recording application 41 combines the application screen acquired from the teleconference application 42, the panorama image, and the speaker image, and stores them as a combined image video. That is, the video storage unit 17 creates a combined image by combining the screen of the application that is repeatedly received, the panorama image, and the speaker image, and designates the combined image in the frame that constitutes the combined image moving image. Create image videos. Also, the video storage unit 17 stores audio data received from the meeting device 60 .

情報記録アプリ４１は以上のステップS32～S42を繰り返す。 The information recording application 41 repeats the above steps S32 to S42.

S43：遠隔会議が終わり、録画の必要がなくなると、ユーザーが録画終了（例えば、録画終了ボタン２２７）を情報記録アプリ４１に指示する。情報記録アプリ４１の操作受付部１２が指示を受け付ける。 S43: When the remote conference ends and recording is no longer necessary, the user instructs the information recording application 41 to end recording (for example, the recording end button 227). The operation reception unit 12 of the information recording application 41 receives the instruction.

S44：情報記録アプリ４１のデバイス通信部１６は、ミーティングデバイス６０に録画終了を通知する。ミーティングデバイス６０は、引き続きパノラマ画像と話者画像の作成や音声の合成を継続する。ただし、ミーティングデバイス６０は録画中かどうかで解像度やfpsを変えるなど処理の不可を変更してもよい。 S44: The device communication section 16 of the information recording application 41 notifies the meeting device 60 of the end of recording. The meeting device 60 continues to create panorama images and speaker images and to synthesize voices. However, the meeting device 60 may change whether processing is disabled, such as by changing the resolution or fps depending on whether recording is in progress.

S45：情報記録アプリ４１の動画保存部１７は、結合画像動画に音声データを結合して音声付きの結合画像動画を作成する。パノラマ画像、話者画像及びアプリの画面が一切保存されない場合は、音声データは独立していてよい。 S45: The moving image storage unit 17 of the information recording application 41 creates a combined image moving image with sound by combining audio data with the combined image moving image. If the panorama image, speaker image, and application screen are not saved at all, the audio data may be independent.

S46：情報記録アプリ４１のアップロード部２０は、通信部１１を介して、結合画像動画の保存先に結合画像動画を保存する。録画情報記憶部５００２において結合画像動画は会議ＩＤ及び録画ＩＤと対応付けられている。結合画像動画にはアップロード済みが記録される。 S46: The upload unit 20 of the information recording application 41 stores the combined image moving image in the storage destination of the combined image moving image via the communication unit 11 . In the recording information storage unit 5002, the combined image moving image is associated with the meeting ID and the recording ID. Uploaded is recorded in the combined image video.

保存先はユーザーに通知されているので、ユーザーはメールなどで保存先を知らせることで結合画像動画を参加者と共有できる。結合画像動画、音声データ、テキストデータを作成する装置がそれぞれ異なっても、１つの格納場所に集めて格納でき、後でユーザー等が容易に閲覧できる。 Since the save destination is notified to the user, the user can share the combined image video with the participants by notifying the save destination by e-mail or the like. Even if the devices that create the combined image moving image, audio data, and text data are different, they can be collected and stored in one storage location so that users can easily view them later.

なお、ステップS32～S42の処理は、図２３に示すとおりの順番でなくてもよく、音声データの合成と結合画像動画の保存が前後してもよい。 Note that the processing of steps S32 to S42 may not be performed in the order shown in FIG. 23, and the synthesis of the audio data and the storage of the combined image moving image may be performed in sequence.

＜＜録画終了後の音声認識＞＞
次に、図２４を参照して、録画終了後に音声認識する処理手順を説明する。図２４は、情報記録アプリ４１がパノラマ画像、話者画像及びアプリの画面を録画する手順を示すシーケンス図の一例である。図２４では主に図２３との相違を説明する。ステップS1～S35は図２３のステップS1～S35と同様でよい。しかし、ミーティングデバイス６０が音声認識を情報処理システム５０に要求するステップS36～S39がない。 << Voice recognition after recording >>
Next, with reference to FIG. 24, a processing procedure for recognizing voice after the end of recording will be described. FIG. 24 is an example of a sequence diagram showing a procedure for the information recording application 41 to record the panorama image, the speaker image, and the screen of the application. FIG. 24 mainly explains differences from FIG. Steps S1-S35 may be the same as steps S1-S35 in FIG. However, there are no steps S36 to S39 in which the meeting device 60 requests speech recognition from the information processing system 50 .

S51：ミーティングデバイス６０のデバイス接続部６７はパノラマ画像、話者画像、音声データ（合成音声Ｃ）、拠点識別情報Ｅを情報記録アプリ４１に送信する。したがって、この時点では認識結果文字列は存在しないが、音声データは区分されている。 S51: The device connection unit 67 of the meeting device 60 transmits the panorama image, speaker image, voice data (synthetic voice C), and base identification information E to the information recording application 41 . Therefore, at this point, there is no recognition result character string, but the voice data is segmented.

S52：情報記録アプリ４１のデバイス通信部１６は、パノラマ画像、話者画像、音声データ、拠点識別情報Ｅを繰り返しミーティングデバイス６０から取得する。情報記録アプリ４１の表示制御部１３はアプリの画面、パノラマ画像、話者画像を並べて、録画中画面２２０に表示する。 S52: The device communication unit 16 of the information recording application 41 repeatedly acquires the panorama image, the speaker image, the voice data, and the base identification information E from the meeting device 60. FIG. The display control unit 13 of the information recording application 41 displays the application screen, the panorama image, and the speaker image side by side on the recording screen 220 .

S53、S54：録画終了処理は図２３のステップS43,S44と同様でよい。 S53, S54: Recording end processing may be the same as steps S43 and S44 in FIG.

S55：ユーザーが録画設定画面２１０で「記録をアップロード後に自動で文字おこしする」に対応付けられたチェックボックス２１５をチェックした場合、音声データ処理部１８が、音声データのテキストデータへの変換を情報処理システム５０に要求する。詳細には、音声データ処理部１８は、通信部１１を介して、保存先のＵＲＬと拠点識別情報Ｅを指定し、会議ＩＤ及び録画ＩＤと共に、結合画像動画に結合された音声データ（分割済みの合成音声Ｃ）の変換要求を情報処理システム５０に送信する。 S55: When the user checks the check box 215 associated with "automatically transcribe after uploading a recording" on the recording setting screen 210, the audio data processing unit 18 sends information about conversion of audio data to text data. Request to processing system 50 . Specifically, the audio data processing unit 18 designates the URL of the storage destination and the base identification information E via the communication unit 11, and together with the conference ID and the recording ID, the audio data combined into the combined image video (divided to the information processing system 50 .

S56：情報処理システム５０の通信部５１は音声データの変換要求を受信し、音声認識部５５が音声認識サービスシステム８０を利用して音声データをテキストデータに変換する。 S56: The communication unit 51 of the information processing system 50 receives the voice data conversion request, and the voice recognition unit 55 uses the voice recognition service system 80 to convert the voice data into text data.

S57：通信部５１は認識結果文字列を音声認識サービスシステムから取得する。 S57: The communication unit 51 acquires the recognition result character string from the speech recognition service system.

S58：拠点判断結果記録部５８と音声認識結果記録部５７と音声記録部５６は、通信部５１を介して、認識結果文字列、音声データ（分割済みの合成音声Ｃ）、拠点識別情報Ｅを結合画像動画と同じ保存先に保存する。 S58: The site determination result recording unit 58, the voice recognition result recording unit 57, and the voice recording unit 56 transmit the recognition result character string, voice data (segmented synthesized voice C), and site identification information E via the communication unit 51. Save to the same destination as the combined image movie.

S59：情報記録アプリ４１の動画保存部１７は、結合画像動画に音声データを結合して音声付きの結合画像動画を作成する。パノラマ画像、話者画像及びアプリの画面が一切保存されない場合は、音声データは独立していてよい。 S59: The moving image storage unit 17 of the information recording application 41 creates a combined image moving image with sound by combining audio data with the combined image moving image. If the panorama image, speaker image, and application screen are not saved at all, the audio data may be independent.

S60：情報記録アプリ４１のアップロード部２０は、通信部１１を介して、結合画像動画の保存先に結合画像動画を保存する。録画情報記憶部５００２において結合画像動画は会議ＩＤ及び録画ＩＤと対応付けられている。結合画像動画にはアップロード済みが記録される。 S60: The upload unit 20 of the information recording application 41 stores the combined image moving image in the storage destination of the combined image moving image via the communication unit 11 . In the recording information storage unit 5002, the combined image moving image is associated with the meeting ID and the recording ID. Uploaded is recorded in the combined image video.

このように、情報記録アプリ４１が録画終了後に音声認識しても、拠点を判断できる。 In this way, even if the information recording application 41 performs voice recognition after the end of recording, the base can be determined.

＜＜拠点識別情報の編集＞＞
図２５は、ユーザーが拠点識別情報Ｅを編集する処理を説明するシーケンス図の一例である。 <<Edit base identification information>>
FIG. 25 is an example of a sequence diagram for explaining the process of editing the site identification information E by the user.

S71：ユーザーが端末装置１０を情報処理システム５０と接続し、会議一覧画面を表示させる。会議一覧画面には、ログインしたユーザーに閲覧権限がある会議の一覧が表示される。 S71: The user connects the terminal device 10 to the information processing system 50 and displays the conference list screen. The conference list screen displays a list of conferences for which the logged-in user has viewing authority.

S72：ユーザーが会議を選択して、会議記録確認画面２４０を表示する操作を入力する。操作受付部１２が操作を受け付ける。 S72: The user selects a meeting and inputs an operation to display the meeting record confirmation screen 240. FIG. The operation accepting unit 12 accepts an operation.

S73：端末装置１０の通信部１１が会議ＩＤを指定して会議記録確認画面２４０を情報処理システム５０に要求する。 S73: The communication unit 11 of the terminal device 10 designates the conference ID and requests the information processing system 50 to display the conference record confirmation screen 240 .

S74、S75：情報処理システム５０の通信部５１は、会議記録確認画面２４０の要求を受信し、会議ＩＤで特定される認識結果文字列、音声データ、拠点識別情報Ｅをストレージサービスシステム７０から取得する。 S74, S75: The communication unit 51 of the information processing system 50 receives the request for the conference record confirmation screen 240, and acquires the recognition result character string specified by the conference ID, voice data, and site identification information E from the storage service system 70. do.

S76：情報処理システム５０の会議記録確認部５９は、認識結果文字列、音声データ、拠点識別情報Ｅを用いて会議記録確認画面２４０を生成する。会議記録確認部５９は、拠点識別情報Ｅに応じて認識結果文字列の配置を会議記録確認画面２４０の左側と右側に寄せて配置する。 S76: The conference record confirmation unit 59 of the information processing system 50 generates the conference record confirmation screen 240 using the recognition result character string, the voice data, and the base identification information E. FIG. The conference record confirmation unit 59 aligns the recognition result character strings to the left and right sides of the conference record confirmation screen 240 in accordance with the base identification information E. FIG.

S77：情報処理システム５０の通信部５１が会議記録確認画面２４０の画面情報を端末装置１０に送信する。端末装置１０の通信部１１は会議記録確認画面２４０の画面情報を受信し、表示制御部１３が表示する。 S77: The communication unit 51 of the information processing system 50 transmits the screen information of the conference record confirmation screen 240 to the terminal device 10. FIG. The communication unit 11 of the terminal device 10 receives the screen information of the conference record confirmation screen 240, and the display control unit 13 displays it.

S78：ユーザーが例えば拠点識別情報Ｅを変更すると、操作受付部１２が変更を受け付ける。 S78: When the user changes, for example, the base identification information E, the operation accepting unit 12 accepts the change.

S79：端末装置１０の通信部１１は会議ＩＤを指定して編集結果を情報処理システム５０に送信する。 S79: The communication unit 11 of the terminal device 10 designates the conference ID and transmits the edited result to the information processing system 50 .

S80：情報処理システム５０の通信部５１は編集結果を受信し、拠点判断結果記録部５８が編集内容に応じてストレージサービスシステム７０におけるテキストデータの拠点識別情報Ｅを変更する。 S80: The communication unit 51 of the information processing system 50 receives the editing result, and the site determination result recording unit 58 changes the site identification information E of the text data in the storage service system 70 according to the editing content.

＜主な効果＞
このように、結合画像動画には、ユーザーを含む周囲のパノラマ画像、話者画像、及び、遠隔会議中に表示されたアプリの画面が表示され、録画される。音声認識が合成音声Ｃに対し行われるので、別々に音声認識するよりも音声認識サービスシステムの処理負荷を低減できる。また、合成音声Ｃは、音圧情報に基づいて発言された拠点が判断されるので、音声データがどの拠点で発言されたものか記録できる。 <Main effects>
In this way, in the combined image moving image, the surrounding panorama image including the user, the speaker image, and the screen of the application displayed during the teleconference are displayed and recorded. Since speech recognition is performed on the synthesized speech C, the processing load on the speech recognition service system can be reduced as compared with separate speech recognition. In addition, since the site where the synthetic voice C was uttered is determined based on the sound pressure information, it is possible to record the site where the voice data was uttered.

＜その他の適用例＞
以上、本発明を実施するための最良の形態について実施例を用いて説明したが、本発明はこうした実施例に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 <Other application examples>
Although the best mode for carrying out the present invention has been described above using examples, the present invention is by no means limited to such examples, and various modifications can be made without departing from the scope of the present invention. and substitutions can be added.

例えば、本実施形態では、自拠点の音声か他拠点の音声かのみを判断したが、他拠点の音声に拠点ＩＤが添付されている場合、他拠点のうちどの拠点の音声かを記録することもできる。 For example, in the present embodiment, only the voice of one's own site or the voice of another site is determined, but if the site ID is attached to the voice of the other site, the voice of which site among the other sites is recorded. can also

また、本実施形態では、パノラマ画像、話者画像、アプリの画面を結合して結合画像動画を保存したが、これらは別々の動画として保存されてもよい。この場合、情報記録アプリ４１が動画の再生時に、パノラマ画像、話者画像、アプリの画面を再生画面にそれぞれ配置する。 Further, in the present embodiment, the panorama image, the speaker image, and the screen of the application are combined and saved as a combined image video, but these may be saved as separate videos. In this case, the information recording application 41 arranges the panorama image, the speaker image, and the screen of the application on the reproduction screen when reproducing the moving image.

また、端末装置１０とミーティングデバイス６０が一体でもよい。端末装置１０にミーティングデバイス６０が外付けされてもよい。また、ミーティングデバイス６０は、全天球カメラとマイクとスピーカーがケーブルで接続されたものでもよい。 Also, the terminal device 10 and the meeting device 60 may be integrated. The meeting device 60 may be externally attached to the terminal device 10 . Also, the meeting device 60 may be one in which a omnidirectional camera, a microphone, and a speaker are connected by a cable.

また、他の拠点１０１においてもミーティングデバイス６０が配置されてよい。他の拠点１０１は別途、ミーティングデバイス６０を使用して結合画像動画とテキストデータを作成する。また、１つの拠点に複数のミーティングデバイス６０が配置されてもよい。この場合、ミーティングデバイス６０ごとに複数の記録情報が作成される。 Meeting devices 60 may also be arranged at other bases 101 . Other sites 101 separately use the meeting device 60 to create combined image moving images and text data. Also, a plurality of meeting devices 60 may be arranged at one site. In this case, a plurality of recorded information are created for each meeting device 60 .

また、本実施形態で使用した、結合画像動画における、パノラマ画像２０３，話者画像２０４、及び、アプリの画面の配置は一例に過ぎない。パノラマ画像２０３が下で話者画像２０４が上でもよいし、ユーザーが配置を変更したり、再生時にはパノラマ画像２０３と話者画像２０４の表示と非表示を個別に切り替えたりしてもよい。 Also, the arrangement of the panoramic image 203, the speaker image 204, and the application screen in the combined image moving image used in this embodiment is merely an example. The panorama image 203 may be placed at the bottom and the speaker image 204 may be placed at the top. Alternatively, the user may change the arrangement, or the panorama image 203 and the speaker image 204 may be individually switched between display and non-display during playback.

また、図８などの構成例は、端末装置１０、ミーティングデバイス６０、及び、情報処理システム５０による処理の理解を容易にするために、主な機能に応じて分割したものである。処理単位の分割の仕方や名称によって本願発明が制限されることはない。端末装置１０、ミーティングデバイス６０、及び、情報処理システム５０の処理は、処理内容に応じて更に多くの処理単位に分割することもできる。また、１つの処理単位が更に多くの処理を含むように分割することもできる。 Also, the configuration example of FIG. 8 and the like is divided according to main functions in order to facilitate understanding of the processing by the terminal device 10, the meeting device 60, and the information processing system 50. FIG. The present invention is not limited by the division method or name of the unit of processing. The processing of the terminal device 10, the meeting device 60, and the information processing system 50 can also be divided into more processing units according to the content of the processing. Also, one processing unit can be divided to include more processing.

また、実施例に記載された装置群は、本明細書に開示された実施形態を実施するための複数のコンピューティング環境のうちの１つを示すものにすぎない。ある実施形態では、情報処理システム５０は、サーバクラスタといった複数のコンピューティングデバイスを含む。複数のコンピューティングデバイスは、ネットワークや共有メモリなどを含む任意のタイプの通信リンクを介して互いに通信するように構成されており、本明細書に開示された処理を実施する。 Moreover, the devices described in the examples are only representative of one of several computing environments for implementing the embodiments disclosed herein. In some embodiments, information handling system 50 includes multiple computing devices, such as a server cluster. Multiple computing devices are configured to communicate with each other over any type of communication link, including a network, shared memory, etc., to perform the processes disclosed herein.

更に、情報処理システム５０は、開示された処理ステップ、例えば図２３等を様々な組み合わせで共有するように構成できる。例えば、所定のユニットによって実行されるプロセスは、情報処理システム５０が有する複数の情報処理装置によって実行され得る。また、情報処理システム５０は、１つのサーバー装置にまとめられていても良いし、複数の装置に分けられていても良い。 Further, the information handling system 50 can be configured to share the disclosed processing steps, such as FIG. 23, in various combinations. For example, a process executed by a given unit may be executed by multiple information processing devices included in the information processing system 50 . Moreover, the information processing system 50 may be integrated into one server device, or may be divided into a plurality of devices.

上記で説明した実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」は、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたASIC(Application Specific Integrated Circuit)、DSP（Digital Signal Processor）、FPGA（Field Programmable Gate Array）、及び、従来の回路モジュール等のデバイスを含む。 Each function of the embodiments described above may be implemented by one or more processing circuits. Here, "processing circuitry" as used herein refers to processors programmed by software to perform the functions, such as processors implemented by electronic circuitry, or processors designed to perform the functions described above. It includes devices such as ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), FPGAs (Field Programmable Gate Arrays), and conventional circuit modules.

１０端末装置
５０情報処理システム
６０ミーティングデバイス
１００記録情報作成システム 10 Terminal Device 50 Information Processing System 60 Meeting Device 100 Recording Information Creation System

特表2014－206896号公報Special Table 2014-206896

Claims

A voice processing system in which a terminal device and a device equipped with a microphone communicate,
a speech synthesizing unit that synthesizes a first speech externally received by the terminal device and a second speech collected by the device to generate synthesized speech;
a text conversion request unit that requests an external conversion of the synthesized speech synthesized by the speech synthesis unit into text data;
A speech processing system characterized by comprising:

the terminal device receives the first audio from a terminal device at another base via a network;
Before the speech synthesis unit synthesizes the first speech and the second speech, the first speech and the second speech are synthesized based on the sound pressures of the first speech and the second speech. 2. The voice processing system according to claim 1, further comprising a base judgment processing unit for judging a base where each voice is uttered.

The base determination processing unit divides the first sound and the second sound according to a predetermined rule,
determining a base based on the sound pressure of the divided first audio and the second audio;
Corresponding the synthetic voice obtained by synthesizing the first voice and the second voice after determining the site, the recognition result character string obtained by converting the synthesized voice, and the site identification information determined by the site determination processing unit. 3. The voice processing system according to claim 2, wherein the voice processing system is stored with an attachment.

The site determination processing unit determines a site where the sound pressure of the first voice and the sound pressure of the second voice, whichever is greater, is uttered by synthesizing the first voice and the second voice. 4. The speech processing system according to claim 2 or 3, wherein it is determined that said synthesized speech that has been created is a base where said speech was made.

When the sound pressure of the first sound is equal to or greater than a threshold and the sound pressure of the first sound is greater than the sound pressure of the second sound, the base determination processing unit determines that the synthesized sound is different from the sound pressure of the second sound. Judging that it was said at the base,
when the sound pressure of the second voice is equal to or greater than a threshold and the sound pressure of the second voice is greater than the sound pressure of the first voice, determining that the synthesized voice is spoken at its own site;
4. The method according to claim 2, wherein when the sound pressure of the first voice and the sound pressure of the second voice are less than a threshold, it is determined that the site where the synthesized voice was uttered is unknown. audio processing system.

The synthesized speech, the recognition result character string obtained by converting the synthesized speech, and the site identification information determined by the site determination processing unit are associated and stored on a network,
Having an information processing system that communicates with the terminal device,
The information processing system is
a meeting record confirmation unit that provides a terminal device with a screen displaying the recognition result character string converted from the synthesized voice and the site identification information determined by the site determination processing unit in association with each other;
3. The speech processing system according to claim 2, wherein the conference record confirmation unit changes the arrangement of the recognition result character string on the screen according to the site identification information.

The conference record confirmation unit associates a button for reproducing the synthesized speech with the recognition result character string and the base identification information, and arranges it for each of the divided synthetic speeches,
7. The speech processing system according to claim 6, wherein when said button is pressed, said terminal device reproduces said synthesized speech corresponding to said button.

The terminal device
an operation receiving unit that receives editing of the base identification information;
8. The voice processing system according to claim 6, further comprising a communication unit that transmits the edited site identification information to the information processing system.

The terminal device creates record information in which a screen acquired from an application running on the terminal device and a surrounding image acquired by the device are combined,
9. The speech processing system according to any one of claims 1 to 8, wherein text data obtained by converting said synthesized speech by speech recognition is stored in association with said recorded information.

A time stamp is added to the synthesized speech, the recognition result character string obtained by converting the synthesized speech, and the site identification information determined by the site determination processing unit,
4. The speech processing system according to claim 3, wherein the synthesized speech, the recognition result character string, and the site identification information are stored in association with each other based on the time stamp.

A device equipped with a microphone capable of communicating with a terminal device,
a speech synthesizing unit that synthesizes a first speech externally received by the terminal device and a second speech collected by the device to generate synthesized speech;
a text conversion request unit that requests an external conversion of the synthesized speech synthesized by the speech synthesis unit into text data;
A device comprising:

A voice processing method performed by a voice processing system in which a terminal device and a device equipped with a microphone communicate,
a step in which a speech synthesizing unit synthesizes a first speech externally received by the terminal device and a second speech collected by the device to generate synthesized speech;
a step in which the text conversion request unit requests an external conversion of the synthesized speech synthesized by the speech synthesis unit into text data;
A speech processing method characterized by comprising: