JP2022103490A

JP2022103490A - Information processing device, information processing method, and computer program

Info

Publication number: JP2022103490A
Application number: JP2020218157A
Authority: JP
Inventors: 彰彦佐野; Akihiko Sano
Original assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Current assignee: Canon Marketing Japan Inc; Canon IT Solutions Inc
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-08

Abstract

To provide a technique to convert outgoing and incoming voice data into text, and compare the results of both voice data converted to text so that it is made possible to determine the delay and degradation of data due to network and Web-RTC and notify a user of the same.SOLUTION: An information processing device has a communication screen. The information processing device accepts voice data input, determines whether the received voice data is delayed or degraded, and notifies a user in response to a determination result.SELECTED DRAWING: Figure 4

Description

本発明は、音声解析・リアルタイムコミュニケーションに関する。 The present invention relates to voice analysis and real-time communication.

離れた場所同士でのコミュニケーションを行うためのシステムは広く存在してきており、例えば、工事現場において、現場の作業者は、スマートフォンやタブレット等の情報処理端末を用いて、遠隔地にいる監督者やオペレーターとコミュニケーションを行いながら、作業を行うことができるようになってきている。
お互いの映像や音声を送信することで、コミュニケーションを行うが、通信環境によっては、映像や音声データの遅延や劣化が発生することがある。この遅延や劣化に対応する為に、音声をテキスト化することで円滑にコミュニケーションを図ろうとする技術がある。例えば、特許文献１や特許文献２である。 Systems for communicating between remote locations have become widespread. For example, at a construction site, workers at the site can use information processing terminals such as smartphones and tablets to supervise remote locations. It is becoming possible to work while communicating with the operator.
Communication is performed by transmitting each other's video and audio, but depending on the communication environment, delay or deterioration of the video or audio data may occur. In order to deal with this delay and deterioration, there is a technology that tries to communicate smoothly by converting voice into text. For example, Patent Document 1 and Patent Document 2.

特開２０２０－８８８１８号公報Japanese Unexamined Patent Publication No. 2020-88818 特表２０１６－５２９８３９号公報Special Table 2016-529839 Gazette

特許文献１には、発信端末と着信端末との間の通話をテキストに変換して、変換されたテキストを発信端末および着信端末に表示させることにより、通話内容のテキストを一致させる音声テキスト化技術が開示されている。
特許文献２には、音声通信において、通信チャネルが混雑している際に音声データではなくデータサイズの小さいテキストに変換することでコミュニケーションを維持する技術が記載されている。 Patent Document 1 describes a voice text conversion technique for converting a call between a calling terminal and an incoming terminal into text and displaying the converted text on the outgoing terminal and the incoming terminal to match the text of the call content. Is disclosed.
Patent Document 2 describes a technique for maintaining communication by converting voice data into text having a small data size instead of voice data when the communication channel is congested in voice communication.

しかしながら、特許文献１に記載の発明では、発信側と着信側の双方の間で通話内容のテキストを一致させるだけであり、また、特許文献２に記載の発明においても、通信チャネルが混雑している際に音声データではなくデータサイズの小さいテキストに変換することでコミュニケーションを維持することは出来るが、コミュニケーションをとっている当人同士において、どの程度のネットワークやＷｅｂ－ＲＴＣによるデータの劣化が発生しているのかを知ることはできない。 However, in the invention described in Patent Document 1, only the text of the call content is matched between both the calling side and the receiving side, and also in the invention described in Patent Document 2, the communication channel is congested. It is possible to maintain communication by converting to text with a small data size instead of voice data, but to what extent the data deteriorates due to the network and Web-RTC between the people who are communicating with each other. I can't tell if I'm doing it.

音声をテキスト化することで、音声データが遅延する場合でもどのような内容を話したかは表示される音声テキストを確認することが出来るが、実際にどの程度、音声データの遅延が発生しているかは把握することが出来ない。
例えば、コミュニケーションの相手に映像内の様子を基に話をする場合、どの程度の音声データの遅延が発生しているかが不明であると、コミュニケーションの相手に、映像内のどの点について話をしているのが伝わらず、ディスコミュニケーションになってしまう。 By converting the voice to text, you can check the displayed voice text to see what kind of content was spoken even if the voice data is delayed, but how much the voice data is actually delayed? Cannot be grasped.
For example, when talking to a communication partner based on the situation in the video, if it is unclear how much audio data is delayed, talk to the communication partner about what point in the video. I can't tell what I'm doing, and I end up with discommunication.

本発明は、上記問題を鑑みて成されたものであり、音声を含むコミュニケーションにおいて、データの遅延や劣化の状況を認識できる仕組みを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a mechanism capable of recognizing a state of data delay or deterioration in communication including voice.

上記、課題を解決するための本発明は、コミュニケーション画面を有する情報処理装置であって、音声データの入力を受け付ける入力手段と、入力手段によって入力を受付けた音声データをテキスト化する作成手段と、作成手段によって作成された送信側と受信側の音声テキストデータを比較する比較手段と、比較手段での比較結果に基づいて、遅延及び劣化が発生しているか否かを判定する判定手段とを備えることを特徴とする。 The present invention for solving the above-mentioned problems is an information processing apparatus having a communication screen, which comprises an input means for receiving input of voice data, a means for creating voice data received by the input means into text, and a means for creating text. It is provided with a comparison means for comparing voice text data on the transmitting side and the receiving side created by the creating means, and a determining means for determining whether or not delay and deterioration have occurred based on the comparison result by the comparing means. It is characterized by that.

本発明によれば、音声を含むコミュニケーションにおいて、データの遅延や劣化の状況を認識することが可能となる。 According to the present invention, it is possible to recognize the situation of data delay or deterioration in communication including voice.

本発明の実施形態における、コミュニケーションシステムの全体構成の一例を示す図である。It is a figure which shows an example of the whole structure of the communication system in embodiment of this invention. スマートフォン１０１、タブレット端末１０２、クライアントＰＣ１１１等のハードウェア構成の一例を示す構成図である。It is a block diagram which shows an example of the hardware composition of a smartphone 101, a tablet terminal 102, a client PC 111 and the like. Ｗｅｂ－ＲＴＣシステムを使用した音声コミュニケーションシステムの音声送受信の概要を示すシーケンス図である。It is a sequence diagram which shows the outline of the voice transmission / reception of the voice communication system using the Web-RTC system. 本発明の実施形態における、話し手と聞き手間で行われるクライアントのコミュニケーション処理の一例を示すフローチャート図である。It is a flowchart which shows an example of the communication processing of a client performed between a speaker and a listener in embodiment of this invention. 図４のコミュニケーション処理ループ内のコミュニケーション中の処理のフローを示す図である。It is a figure which shows the flow of the process during communication in the communication process loop of FIG. 図５のコミュニケーション中の処理のフロー内の遅延判断登録処理のフローを示す図である。It is a figure which shows the flow of the delay determination registration process in the flow of the process during communication of FIG. 本発明の実施形態における、遅延判断システムの遅延判断受付通知処理の一例を示すフローチャート図である。It is a flowchart which shows an example of the delay determination acceptance notification processing of the delay determination system in embodiment of this invention. 図７の遅延判断受付通知処理内の遅延判断処理の一例を示すフローチャート図である。It is a flowchart which shows an example of the delay determination process in the delay determination acceptance notification process of FIG. 7. 本発明の実施形態における、遅延判断システムのデータの保持形態の一例を示す図である。It is a figure which shows an example of the data holding form of the delay determination system in embodiment of this invention. 本発明の実施形態における、モバイルコミュニケーションアプリケーションの通常時の画面構成イメージ図である。It is a screen composition image figure in a normal time of the mobile communication application in embodiment of this invention. 本発明の実施形態における、ユーザ１（自分）が遅延している場合のモバイルコミュニケーションアプリケーションの画面構成イメージ図である。It is a screen configuration image diagram of the mobile communication application in the case where the user 1 (self) is delayed in the embodiment of the present invention. 本発明の実施形態における、ユーザ２（コミュニケーションの相手側）が遅延している場合のモバイルコミュニケーションアプリケーションの画面構成イメージ図である。It is a screen configuration image diagram of the mobile communication application in the case where the user 2 (the other party of communication) is delayed in the embodiment of the present invention. 本発明の実施形態における、ＰＣコミュニケーションアプリケーションの通常時の画面構成イメージ図である。It is a screen composition image figure in a normal time of the PC communication application in embodiment of this invention. 本発明の実施形態における、ユーザ１（自分）が遅延している場合のＰＣコミュニケーションアプリケーションの画面構成イメージ図である。It is a screen configuration image diagram of the PC communication application in the case where the user 1 (self) is delayed in the embodiment of the present invention. 本発明の実施形態における、ユーザ２（コミュニケーションの相手側）が遅延している場合のＰＣコミュニケーションアプリケーションの画面構成イメージ図であるIt is a screen configuration image diagram of the PC communication application in the case where the user 2 (the other side of communication) is delayed in the embodiment of the present invention.

以下、図面を参照して、本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、ユーザ同士のコミュニケーションをとり、音声のテキスト化による遅延判断による通知表示を行うモバイルコミュニケーションアプリケーション（以下モバイルアプリ）、ＰＣコミュニケーションアプリケーション（以下ＰＣアプリ）および、クラウドでコミュニケーションを制御するコミュニケーションシステムサーバおよび、遅延判断を行う遅延判断システムサーバおよび、音声を文字テキストに変換するＳｐｅｅｃｈＴｏＴｅｘｔシステムサーバの全体構成の一例を示す図である。本実施例においてはクラウド上に構成されるものとして説明するが、クラウド上に構成されなくてもよい。
ユーザ同士とは、コミュニケーションを実行する話し手側と聞き手側のことである。 FIG. 1 shows a mobile communication application (hereinafter referred to as a mobile application), a PC communication application (hereinafter referred to as a PC application) that communicates with each other and displays a notification based on a delay judgment by converting voice into text, and a communication system that controls communication in the cloud. It is a figure which shows an example of the whole structure of the server, the delay determination system server which performs delay determination, and the SpeechToText system server which converts voice into character text. In this embodiment, it is assumed that it is configured on the cloud, but it does not have to be configured on the cloud.
The users are the speaker side and the listener side who perform communication.

説明の便宜上、以後、コミュニケーションを実行する話し手側をユーザ１、話し手側のコミュニケーションを受ける聞き手側をユーザ２として、説明する。
尚、本説明では便宜上、話し手側、聞き手側としているが、両者の立場を入れ替えてコミュニケーションを行うことは勿論、可能である。 For convenience of explanation, the speaker side who executes communication will be referred to as user 1, and the listener side who receives communication on the speaker side will be referred to as user 2.
In this explanation, the speaker side and the listener side are used for convenience, but it is of course possible to exchange the positions of both parties for communication.

コミュニケーションを行うユーザ１は、モバイルアプリ１００をスマートフォン１０１やタブレット端末１０２にインストールすることで、スマートフォン１０１やタブレット端末１０２でモバイルアプリ１００を使用することが出来る。 The user 1 who communicates can use the mobile application 100 on the smartphone 101 or the tablet terminal 102 by installing the mobile application 100 on the smartphone 101 or the tablet terminal 102.

モバイルアプリ１００は、インターネット１３０を介して、クラウドシステム１２０のコミュニケーションシステムサーバ１２１、遅延判断システムサーバ１２２、ＡＩ／ＤＬＳｐｅｅｃｈＴｏＴｅｘｔシステムサーバ１２３、及びユーザ２のクライアントＰＣ上のＰＣコミュニケーションアプリケーションと接続される構成となっている。 The mobile application 100 is connected to the communication system server 121 of the cloud system 120, the delay determination system server 122, the AI / DL SpeechToText system server 123, and the PC communication application on the client PC of the user 2 via the Internet 130. It has become.

コミュニケーションを行うユーザ２には、ＰＣアプリ１１０と、クライアントＰＣ１１１がある。ＰＣアプリ１１０は、クライアントＰＣ１１１にインストールすることで、ＰＣ上で動作する。ＰＣアプリは、インターネット１３０を介して、クラウドシステム１２０のコミュニケーションシステムサーバ１２１、遅延判断システムサーバ１２２、ＡＩ／ＤＬＳｐｅｅｃｈＴｏＴｅｘｔシステムサーバ１２３、及びユーザ１のスマートフォン１０１やタブレット端末１０２上のモバイルアプリと接続される構成となっている。 The user 2 who communicates includes the PC application 110 and the client PC 111. The PC application 110 operates on the PC by installing it on the client PC 111. The PC application is connected to the communication system server 121 of the cloud system 120, the delay determination system server 122, the AI / DL SpeechToText system server 123, and the mobile application on the smartphone 101 or tablet terminal 102 of the user 1 via the Internet 130. It has a structure of

上記実施例では、ユーザ１がモバイルアプリ、ユーザ２がＰＣアプリを使用する例として記載しているが、コミュニケーションの中では、話し手側のユーザ、聞き手側のユーザが逆になることが可能なアプリケーションである。また、組み合わせがモバイルアプリ同士、ＰＣアプリ同士であっても同様の仕組みで動くものである。また、１対１のコミュニケーションの例を記載しているが、複数人によるコミュニケーションであっても同様の仕組みで動作することを可能とするものである。 In the above embodiment, the user 1 uses the mobile application and the user 2 uses the PC application, but in the communication, the speaker side user and the listener side user can be reversed. Is. Further, even if the combination is between mobile applications or between PC applications, the same mechanism works. Further, although an example of one-to-one communication is described, it is possible to operate by the same mechanism even in communication by a plurality of people.

クラウドシステムには、コミュニケーションシステムサーバ１２１と、遅延判断システムサーバ１２２、ＡＩ／ＤＬＳｐｅｅｃｈＴｏＴｅｘｔシステムサーバ１２３で構成されている。それぞれのサーバはインターネット１３０を介して接続される構成となっている。 The cloud system includes a communication system server 121, a delay determination system server 122, and an AI / DL SpeechToText system server 123. Each server is configured to be connected via the Internet 130.

データ１４０には、音声データ１４１、音声テキストデータ１４２、遅延判断結果データ１４３がある。音声データは、入力デバイスからインプットされた音声ストリーミングデータのことである。音声テキストデータは、音声データを変換し、テキストデータとして扱えるようにしたものである。遅延判断結果データは、音声テキストデータやそのデータのタイムスタンプを比較し、遅延判断を行った結果のデータであり、遅延判断管理テーブル９２０で保管するデータである。 The data 140 includes voice data 141, voice text data 142, and delay determination result data 143. Audio data is audio streaming data input from an input device. The voice text data is obtained by converting the voice data so that it can be handled as text data. The delay determination result data is data as a result of comparing voice text data and time stamps of the data and performing delay determination, and is data stored in the delay determination management table 920.

上記は、音声をコミュニケーションとして伝える説明をしているが、音声に加えて映像を一緒にやり取りできる一般的なコミュニケーションシステムの例である。
以上で図１の説明を終了する。 The above is an explanation of transmitting audio as communication, but it is an example of a general communication system that can exchange video together with audio.
This is the end of the description of FIG.

次に図２を参照して、本発明の実施形態におけるクライアント端末に適用可能なハードウェア構成の一例について説明する。 Next, with reference to FIG. 2, an example of a hardware configuration applicable to the client terminal according to the embodiment of the present invention will be described.

図２において、２０１はＣＰＵで、システムバス２０４に接続される各デバイスやコントローラを統括的に制御する。また、ＲＯＭ２０２あるいは外部メモリ２１１には、ＣＰＵ２０１の制御プログラムであるＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）やオペレーティングシステム（以下、ＯＳ）や、各サーバ或いは各ＰＣの実行する機能を実現するために必要な後述する各種プログラム等が記憶されている。 In FIG. 2, 201 is a CPU that comprehensively controls each device and controller connected to the system bus 204. Further, the ROM 202 or the external memory 211 is necessary to realize the functions executed by the BIOS (Basic Input / Output System) and the operating system (hereinafter, OS), which are the control programs of the CPU 201, and each server or each PC. Various programs described later are stored.

２０３はＲＡＭで、ＣＰＵ２０１の主メモリ、ワークエリア等として機能する。ＣＰＵ２０１は、処理の実行に際して必要なプログラム等をＲＯＭ２０２あるいは外部メモリ２１１からＲＡＭ２０３にロードして、該ロードしたプログラムを実行することで各種動作を実現するものである。 Reference numeral 203 denotes a RAM, which functions as a main memory, a work area, and the like of the CPU 201. The CPU 201 realizes various operations by loading a program or the like necessary for executing a process from the ROM 202 or the external memory 211 into the RAM 203 and executing the loaded program.

また、２０５は入力コントローラで、入力デバイス２０９からの入力を制御する。入力デバイス２０９としては、キーボード、タッチパネル、及びマウス等のポインティングデバイス等が挙げられる。 Further, 205 is an input controller, which controls the input from the input device 209. Examples of the input device 209 include a keyboard, a touch panel, a pointing device such as a mouse, and the like.

なお、入力デバイス２０９がタッチパネルの場合、ユーザがタッチパネルに表示されたアイコンやカーソルやボタンに合わせて押下（指等でタッチ）することにより、各種の指示を行うことができることとする。 When the input device 209 is a touch panel, the user can give various instructions by pressing (touching with a finger or the like) the icon, the cursor, or the button displayed on the touch panel.

また、タッチパネルは、マルチタッチスクリーンなどの、複数の指でタッチされた位置を検出することが可能なタッチパネルであってもよい。 Further, the touch panel may be a touch panel such as a multi-touch screen that can detect a position touched by a plurality of fingers.

２０６はビデオコントローラで、ディスプレイ２１０等の表示機への表示を制御する。なお、ディスプレイ２１０は、ＣＲＴ、液晶ディスプレイ等の表示機のことを指す。または、その他の表示機であってもよい。 Reference numeral 206 denotes a video controller, which controls the display on a display such as a display 210. The display 210 refers to a display device such as a CRT or a liquid crystal display. Alternatively, it may be another display.

尚、本体と一体になったノート型パソコンのディスプレイも含まれるものとする。外部出力装置はディスプレイに限ったものははく、例えばプロジェクタであってもよい。 It should be noted that the display of the notebook personal computer integrated with the main body is also included. The external output device is not limited to the display, and may be, for example, a projector.

また、図２のハードウェア構成を適用する装置が前述のタッチ操作を受け付け可能な装
置である場合には、ディスプレイ２１０が入力デバイス２０９としての機能を提供する。 Further, when the device to which the hardware configuration of FIG. 2 is applied is a device capable of accepting the above-mentioned touch operation, the display 210 provides a function as an input device 209.

２０７はメモリコントローラで、ブートプログラム、各種のアプリケーション、フォントデータ、ユーザファイル、編集ファイル、各種データ等を記憶するハードディスク（ＨＤＤ）や、フレキシブルディスク（ＦＤ）、或いはＰＣＭＣＩＡカードスロットにアダプタを介して接続されるコンパクトフラッシュ（登録商標）メモリ等の外部メモリ２１１へのアクセスを制御する。 The 207 is a memory controller, which is connected to a hard disk (HDD) for storing boot programs, various applications, font data, user files, edit files, various data, etc., a flexible disk (FD), or a PCMCIA card slot via an adapter. It controls access to an external memory 211 such as a compact flash (registered trademark) memory.

２０８は通信Ｉ／Ｆコントローラで、ネットワークを介して外部機器と接続・通信するものであり、ネットワークでの通信制御処理を実行する。例えば、ＴＣＰ／ＩＰを用いた通信等が可能である。 Reference numeral 208 denotes a communication I / F controller, which connects and communicates with an external device via a network, and executes communication control processing on the network. For example, communication using TCP / IP is possible.

なお、ＣＰＵ２０１は、例えばＲＡＭ２０３内の表示情報用領域へアウトラインフォント展開（ラスタライズ）処理を実行することにより、ディスプレイ２１０上での表示を可能としている。また、ＣＰＵ２０１は、ディスプレイ２１０上の不図示のマウスカーソル等でのユーザ指示を可能とする。 The CPU 201 enables display on the display 210 by, for example, executing an outline font expansion (rasterization) process on the display information area in the RAM 203. Further, the CPU 201 enables a user instruction with a mouse cursor or the like (not shown) on the display 210.

本発明を実現するための後述する各種プログラムは、ＲＯＭ２０２あるいは外部メモリ２１１に記録されており、必要に応じてＲＡＭ２０３にロードされることによりＣＰＵ２０１によって実行されるものである。 Various programs described later for realizing the present invention are recorded in the ROM 202 or the external memory 211, and are executed by the CPU 201 by being loaded into the RAM 203 as needed.

さらに、上記プログラムの実行時に用いられる定義ファイル及び各種情報、テーブル等も、外部メモリ２１１に格納されており、これらについての詳細な説明も後述する。 Further, definition files, various information, tables, etc. used when executing the above program are also stored in the external memory 211, and detailed explanations thereof will be described later.

カメラコントローラ２１２は、カメラ２１４等による映像取得を制御し、音声コントローラ２１３はマイク／スピーカ２１５等による音声取得制御やスピーカへの出力を制御する。 The camera controller 212 controls video acquisition by the camera 214 and the like, and the audio controller 213 controls audio acquisition control by the microphone / speaker 215 and the like and output to the speaker.

カメラ２１４は、図２のハードウェア構成を適用する装置に内蔵されているカメラである。また、マイク／スピーカ２１５は、図２のハードウェア構成を適用する装置に内蔵されているマイク及びスピーカである。以上で図２の説明を終了する。 The camera 214 is a camera built in the device to which the hardware configuration of FIG. 2 is applied. Further, the microphone / speaker 215 is a microphone and a speaker built in the device to which the hardware configuration of FIG. 2 is applied. This is the end of the description of FIG.

次に図３を用いて、Ｗｅｂ－ＲＴＣシステムなどの一般的なコミュニケーションシステムにおける音声の送受信処理について説明する。以下に簡単に説明をする。 Next, with reference to FIG. 3, a voice transmission / reception process in a general communication system such as a Web-RTC system will be described. A brief explanation is given below.

ステップＳ３０１では、話し手による発話を検出する。 In step S301, the utterance by the speaker is detected.

ステップＳ３０２では、ステップＳ３０１で検出した発話内容を音声データとして取得する。 In step S302, the utterance content detected in step S301 is acquired as voice data.

ステップＳ３０３では、モバイルアプリが、マイクによって取得された音声データをクラウド上のＷｅｂ－ＲＴＣシステムにＷｅｂ－ＲＴＣの音声データとして送信する。 In step S303, the mobile application transmits the voice data acquired by the microphone to the Web-RTC system on the cloud as Web-RTC voice data.

ステップＳ３０４では、クラウド上のＷｅｂ－ＲＴＣシステムは、Ｗｅｂ－ＲＴＣの音声データをモバイルアプリより受け付け、聞き手側のＰＣアプリにＷｅｂ－ＲＴＣの音声データをリレーする。 In step S304, the Web-RTC system on the cloud receives the voice data of the Web-RTC from the mobile application and relays the voice data of the Web-RTC to the PC application on the listener side.

ステップＳ３０５では、ＰＣアプリは、Ｗｅｂ－ＲＴＣシステムから受信した音声データをスピーカ、イヤホンなどのＰＣ１１１の出力デバイスから出力させる。 In step S305, the PC application outputs the voice data received from the Web-RTC system from the output device of the PC 111 such as a speaker and an earphone.

ステップＳ３０６では、スピーカやイヤホンなどの出力デバイスから音が聞こえることにより、聞き手側は、話し手側が発した音声を聞くことが出来る。 In step S306, the listener side can hear the sound emitted by the speaker side by hearing the sound from the output device such as a speaker or earphone.

上記の一連の処理の中で、帯域状況などの様々な要因から音声の遅延や劣化が発生することにより、話し手側がマイクに入力した音声が、聞き手側の出力デバイスで出力される際に音声が飛び、図３の３０９で図示されている様に所々が飛んだ形で出力されてしまう。 In the above series of processing, voice delay or deterioration occurs due to various factors such as bandwidth status, so that the voice input to the microphone by the speaker side is output when it is output by the output device on the listener side. It flies, and as shown in FIG. 309, the output is in the form of flying in places.

帯域が正常であれば、３０７で、話し手側がマイクに向けて、「もしもし、１４時から東京の現場で作業します。」と話しをすると、３０８のようにスピーカからは「もしもし、１４時から東京の現場で作業します。」と聞こえる。 If the band is normal, at 307, the speaker points to the microphone and says, "Hello, I will work at the site in Tokyo from 14:00." Then, like 308, the speaker says "Hello, from 14:00." I will work at the site in Tokyo. "

しかしながら、帯域が悪い場合であると、スピーカからは「もしもし、１？時からトーキの？場で作業をします。」など聞き取れない部分や飛んでしまう部分が発生する。
本願は、上記のような状態において、円滑にコミュニケーションを行えるにするものである。 However, if the band is poor, there will be inaudible parts or skipped parts from the speaker, such as "Hello, I will work in the talk room from 1 o'clock."
The present application is intended to enable smooth communication in the above-mentioned state.

以上で図３の説明を終了する。 This is the end of the description of FIG.

図４は、本発明の実施形態における、音声のテキスト変換、音声テキストデータ１４２を使った遅延判断及び遅延判断結果の通知処理の全体の流れを示す図である。 FIG. 4 is a diagram showing the overall flow of voice text conversion, delay judgment using voice text data 142, and notification processing of delay judgment results in the embodiment of the present invention.

図４では、モバイルアプリ及びＰＣアプリをまとめてクライアントアプリとして説明する。また、ここではスマートフォン及びＰＣをまとめてクライアント端末として説明する。以下、図４の処理の流れを説明する。 In FIG. 4, the mobile application and the PC application are collectively described as a client application. Further, here, the smartphone and the PC will be collectively described as a client terminal. Hereinafter, the processing flow of FIG. 4 will be described.

ステップＳ４０１では、クライアントアプリは、ユーザのクライアント端末を通して、クライアントアプリの起動指示を受け付けて、クライアントアプリを起動させる。 In step S401, the client application receives a start instruction of the client application through the user's client terminal and activates the client application.

ステップＳ４０２では、ステップＳ４０１で起動したクライアントアプリに関して、ログイン指示を受け付け、一般的なマルチテナントのログイン処理を行う。 In step S402, the login instruction is received for the client application started in step S401, and general multi-tenant login processing is performed.

続けて、ステップＳ４０３では、クライアント端末若しくはクライアント端末に接続されたマイクなどの音声入力デバイスから入力された音声データ１４１を取得する。 Subsequently, in step S403, the voice data 141 input from the client terminal or a voice input device such as a microphone connected to the client terminal is acquired.

ステップＳ４０４では、クライアント端末からコミュニケーションを開始するための指示を受け付け、一般的なコミュニケーションシステムのコミュニケーション開始処理を行う。 In step S404, an instruction for starting communication is received from the client terminal, and communication start processing of a general communication system is performed.

ステップＳ４０５では、コミュニケーションシステムにコミュニケーション開始リクエストを行う。コミュニケーション開始リクエストでは、リクエストデータとしてＡＰＩキー、接続先ＩＤ、映像音声データを要求する。ＡＰＩキーとは、コミュニケーションシステムをＡＰＩとして利用するためのキーであるテキストデータである。接続先ＩＤとは、コミュニケーションを行う相手同士をつなげるために必要なＩＤである。コミュニケーションをとりたい相手同士は同じ接続先ＩＤを指定する必要がある。 In step S405, a communication start request is made to the communication system. In the communication start request, API key, connection destination ID, and video / audio data are requested as request data. The API key is text data that is a key for using the communication system as an API. The connection destination ID is an ID necessary for connecting the communication partners. It is necessary to specify the same connection destination ID between the parties who want to communicate with each other.

ステップＳ４０６では、ステップＳ４０５のリクエストを承認して、コミュニケーション処理を開始する。 In step S406, the request in step S405 is approved and the communication process is started.

ステップＳ４０７では、ステップＳ４０６でコミュニケーションが開始されると、コミュニケーション処理ループへと入る。 In step S407, when communication is started in step S406, the communication processing loop is entered.

ステップＳ４０８では、ステップＳ４０７でコミュニケーション処理ループに入ると、ユーザ１（自分）のローカルの音声データを送信する処理、ユーザ２（相手側）からの音声データを受信し、ユーザに出力する処理、遅延判断システムからの通知を受信する処理の３つの処理を並行して実行する。ステップＳ４０８の詳細な処理については、図５を用いて説明を行う。 In step S408, when the communication processing loop is entered in step S407, the process of transmitting the local voice data of the user 1 (own), the process of receiving the voice data from the user 2 (the other party) and outputting it to the user, and the delay. The three processes of receiving the notification from the judgment system are executed in parallel. The detailed processing of step S408 will be described with reference to FIG.

続けて、ステップＳ４０８の詳細な処理について、図５を用いて説明を行う。図５の処理では、ユーザ１自身のローカルの音声データを送信する処理、相手からの音声データを受信し、ユーザに出力する処理、遅延判断システムからの通知を受信する処理の３つを並行でループ処理として実行する。それぞれの処理を並行して行うのは、コミュニケーションは複数のイベントが並列で発生する可能性があり、それらのイベントを並行して処理することでリアルタイムのコミュニケーションを実現する為である。 Subsequently, the detailed processing of step S408 will be described with reference to FIG. In the process of FIG. 5, the process of transmitting the local voice data of the user 1 itself, the process of receiving the voice data from the other party and outputting it to the user, and the process of receiving the notification from the delay determination system are performed in parallel. Execute as loop processing. The reason why each process is performed in parallel is that communication may cause a plurality of events in parallel, and real-time communication is realized by processing those events in parallel.

ステップＳ４０７で、コミュニケーション処理ループに入ると、ステップＳ５０１では、コミュニケーションシステムからコミュニケーション相手の映像音声データを受信する。 When the communication processing loop is entered in step S407, the video / audio data of the communication partner is received from the communication system in step S501.

ステップＳ５０２では、受信したコミュニケーション相手の音声データを遅延判断登録処理にインプットする。 In step S502, the received voice data of the communication partner is input to the delay determination registration process.

ステップＳ５０３では、ステップＳ５０２でインプットした音声データに対して、遅延判断登録処理を行う。遅延判断登録処理では、送受信した音声データを声の連続性や文章のとして意味の区切りを判断し、文区切りの音声データとする処理を行っている。ステップＳ５０３で行っている詳細な処理の内容については図６を用いて説明を行う。 In step S503, the delay determination registration process is performed on the voice data input in step S502. In the delay judgment registration process, the transmitted / received voice data is judged as the continuity of the voice or the delimiter of the meaning as a sentence, and is processed as the voice data of the sentence delimiter. The details of the processing performed in step S503 will be described with reference to FIG.

続けて、図６を用いてステップＳ５０３の遅延判断登録処理について説明を行う。
遅延判断登録処理では、送受信した音声データの声の連続性や、文章としての意味の区切りを判断し、文区切りの音声データにする処理を行う。図８に示す遅延判断処理を行う為に本処理を行う。 Subsequently, the delay determination registration process in step S503 will be described with reference to FIG.
In the delay determination registration process, the continuity of the voice of the transmitted / received voice data and the delimiter of the meaning as a sentence are determined, and the process is performed to make the voice data of the sentence delimiter. This process is performed to perform the delay determination process shown in FIG.

ステップＳ６０１では、音声データを文区切りにする。音声データを文区切りにして、自然言語処理をする為である。また、この処理は音声データを区切りファイル化出来る状態にすることで、テキスト変換処理をし易くする為にも行う。 In step S601, the voice data is divided into sentences. This is because the voice data is separated into sentences and processed in natural language. In addition, this processing is also performed in order to facilitate the text conversion processing by making the voice data into a delimited file.

ステップＳ６０２では、音声文区切り処理のループ処理を開始する。 In step S602, the loop processing of the speech sentence delimiter processing is started.

ステップＳ６０３では、文区切りの音声データをＳｐｅｅｃｈＴｏＴｅｘｔシステムサーバ１２３に送信する。ステップＳ６０３の処理は、文区切りの音声データをＳｐｅｅｃｈＴｏＴｅｘｔシステムに送信して、音声データをテキスト化した音声テキストデータを得るために行う。 In step S603, the sentence-delimited voice data is transmitted to the SpeechToText system server 123. The process of step S603 is performed in order to transmit the sentence-delimited voice data to the SpeechToText system and obtain the voice text data obtained by converting the voice data into text.

その際に、ステップＳ６０３では、リクエストデータとして、ＡＰＩキー、音声データを要求する。 At that time, in step S603, API key and voice data are requested as request data.

ステップＳ６０４では、ＳｐｅｅｃｈＴｏＴｅｘｔシステムからの受信結果の音声テキストデータを遅延判断システムに送信する。ステップＳ６０４の処理は、音声データをテキスト化した音声テキストデータを遅延判断システムに送信することで、遅延判断システムに音声データを解析させるために行う。ステップＳ６０４では、リクエストデータとして、コミュニケーションＩＤ、音声テキストデータ、登録者、送信側か受信側かの情報、発言者、発言時間、登録時間を要求する。 In step S604, the voice text data of the reception result from the SpeechToText system is transmitted to the delay determination system. The process of step S604 is performed in order to cause the delay determination system to analyze the voice data by transmitting the voice text data obtained by converting the voice data into text to the delay determination system. In step S604, the communication ID, the voice text data, the registrant, the information on the transmitting side or the receiving side, the speaker, the speaking time, and the registration time are requested as the request data.

SpeechToTextは、既知の技術であり、取得した音声データから音声データの特徴を求めたのち、言語処理解析をして、音声データの内容をテキストとして、書き起こす技術である。 SpeechToText is a known technique, which is a technique of obtaining the characteristics of speech data from the acquired speech data, performing language processing analysis, and transcribing the contents of the speech data as text.

ステップＳ６０１で文区切りにしたすべての音声データに対してＳ６３、Ｓ６０４の処理が終了すると、図６に示す遅延判断登録処理は終了となる。 When the processes of S63 and S604 are completed for all the voice data delimited by the sentence in step S601, the delay determination registration process shown in FIG. 6 is completed.

ステップＳ５０３まで処理が終了すると、コミュニケーション中の処理を終了して、ステップＳ４０９へ進む。 When the processing up to step S503 is completed, the processing during communication is terminated and the process proceeds to step S409.

ステップＳ５１１では、映像音声デバイスからユーザ１（自身）入力データの音声データを取得する。 In step S511, the audio data of the user 1 (self) input data is acquired from the video / audio device.

ステップＳ５１２では、音声データを遅延判断登録処理にインプットする。 In step S512, voice data is input to the delay determination registration process.

ステップＳ５１３では、ステップＳ５１２でインプットした音声データに対して、遅延判断登録処理を行う。前述のステップＳ５０３で行われる処理と同様の処理である。
ステップＳ５１３まで処理が終了すると、コミュニケーション中の処理を終了して、ステップＳ４０９へ進む。 In step S513, the delay determination registration process is performed on the voice data input in step S512. This is the same process as the process performed in step S503 described above.
When the processing up to step S513 is completed, the processing during communication is terminated and the process proceeds to step S409.

ステップＳ５２１では、遅延判断システムから遅延通知が届くまで受信待ちする。ステップＳ５２１の遅延判断システムから通知を受信する処理において、コミュニケーション中は、常に遅延判断システムから通知を受信できる状態で待機している。 In step S521, reception is awaited until a delay notification arrives from the delay determination system. In the process of receiving the notification from the delay determination system in step S521, the user is always on standby in a state where the notification can be received from the delay determination system during communication.

ステップＳ５２２では、遅延通知が届いたかを判定する。遅延通知が届いた場合（判定結果がＹｅｓであった場合）は、ステップＳ５２３に処理を進める。遅延通知が届いていない場合（判定結果がＮｏであった場合）は、ステップＳ５２１に戻り、通知が届くまで待機する。 In step S522, it is determined whether the delay notification has arrived. When the delay notification arrives (when the determination result is Yes), the process proceeds to step S523. If the delay notification has not arrived (if the determination result is No), the process returns to step S521 and waits until the notification arrives.

ステップＳ５２３では、遅延通知を受信するとクライアントアプリはクライアント端末上の画面に通知内容を表示し、ユーザに遅延通知を行う。これによりユーザは自分の音声が実際にどれだけ遅延や劣化して相手に伝わっているかを理解することができ、遅延や劣化を考慮したコミュニケーションを行うことで相手とのコミュニケーションを円滑に行うことができる In step S523, when the delay notification is received, the client application displays the notification content on the screen on the client terminal and notifies the user of the delay. As a result, the user can understand how much the voice is actually delayed or deteriorated and transmitted to the other party, and communication with the other party can be smoothly performed by communicating in consideration of the delay or deterioration. can

ステップＳ５２３まで処理が終了すると、コミュニケーション中の処理を終了して、ステップＳ４０９へ進む。 When the processing is completed up to step S523, the processing during communication is terminated and the process proceeds to step S409.

ステップＳ４０９では、ログアウト指示を受け付ける。 In step S409, a logout instruction is accepted.

ステップＳ４１０では、アプリで終了指示を受け付ける。 In step S410, the application receives an end instruction.

以上で図５のコミュニケーション中の処理のフローチャートの説明を終了する。 This is the end of the explanation of the flowchart of the process during communication in FIG.

続けて、図７を用いて遅延判断受付通知の説明を行う。図７では、遅延判断システムの遅延判断のための全体動作フローの説明を行う。 Subsequently, the delay determination acceptance notification will be described with reference to FIG. 7. FIG. 7 describes an overall operation flow for delay determination of the delay determination system.

ステップＳ７０１では、遅延判断システムは、ステップS６０４でクライアントアプリから送信された遅延判断登録要求を受付ける。 In step S701, the delay determination system accepts the delay determination registration request transmitted from the client application in step S604.

ステップＳ７０２では、受け付けた遅延判断登録内容を図９に示す音声テキスト管理テーブル９００に格納する。音声テキスト管理テーブル９００は、遅延判断登録内容であるテキストＩＤ９０１と、コミュニケーションＩＤ、音声テキストの登録者である登録者９０３、送信であるか受信であるかを管理する送受信９０４、音声の発言者が誰であるかを管理する音声の発言者９０５、発言していた時間を管理する９０７、音声のテキストデータを管理する場所であるテキストデータ９０８を管理するテーブルである。 In step S702, the received delay determination registration content is stored in the voice text management table 900 shown in FIG. The voice text management table 900 includes a text ID 901 which is a delay determination registration content, a communication ID, a registrant 903 who is a registrant of voice text, a transmission / reception 904 which manages whether to send or receive, and a voice speaker. It is a table that manages a voice speaker 905 that manages who he / she is, a voice speaker 907 that manages the time that he / she was speaking, and a text data 908 that manages voice text data.

ステップＳ７０３では、上記の音声テキスト管理テーブルに登録された内容に基づいて、受信した遅延判断登録内容が送信側であるか受信側であるかの判定を行う。受信した登録内容が送信側であればステップＳ７０４に進み、クライアントに遅延判断の登録完了を意味する登録成功のレスポンスを送信して、本処理を終了する。 In step S703, it is determined whether the received delay determination registration content is the transmission side or the reception side based on the content registered in the voice text management table. If the received registration content is on the transmitting side, the process proceeds to step S704, a registration success response indicating that the registration of the delay determination is completed is transmitted to the client, and this process is terminated.

ステップＳ７０３の処理は、遅延判断を行う為には送信側と受信側の両方のデータが揃う必要がある。理論的に送信側が先に届くことから、遅延判断は受信側の登録時に実施するため、どちらの登録内容であるかを判断する。登録内容が受信側であれば、ステップＳ７０５に進め、遅延判断処理を行う。 In the process of step S703, it is necessary to prepare data on both the transmitting side and the receiving side in order to perform the delay determination. Since the sender arrives first in theory, the delay determination is performed at the time of registration of the receiver, so it is determined which registration content is used. If the registered content is the receiving side, the process proceeds to step S705 and the delay determination process is performed.

ステップＳ７０５で行う遅延判断処理について、図８を用いて説明する。上述の通り、遅延判断は受信側の登録時に実施するため、ステップＳ７０５のタイミングで行う。 The delay determination process performed in step S705 will be described with reference to FIG. As described above, since the delay determination is performed at the time of registration on the receiving side, it is performed at the timing of step S705.

ステップＳ８０１では、音声テキスト管理テーブル９００から登録内容に含まれる音声テキストデータに該当する送信側が登録したデータの候補を検索する。検索する為の条件は、コミュニケーションＩＤ９０２の人物と音声の発言者９０５が同じ人物であること、発言時間及び登録時間がリクエストデータの登録内容と比較して６０秒以内であることを条件に検索する。 In step S801, data candidates registered by the sender corresponding to the voice text data included in the registered contents are searched from the voice text management table 900. The conditions for searching are that the person with communication ID 902 and the voice speaker 905 are the same person, and the speaking time and registration time are within 60 seconds compared to the registered contents of the request data. ..

この条件は、コミュニケーション相手が同じ内容を復唱したり、少し前に話した内容と同じことをもう一度繰り返し話したりした際に、復唱した内容や繰り返して話した内容を遅延していると判定させないために、採用している。勿論、実際の運用状況に合わせて、適宜検索条件は変更してよい。 This condition is because when the communication partner repeats the same content or repeats the same content that was spoken a while ago, it is not judged that the repeated content or the repeatedly spoken content is delayed. It is adopted in. Of course, the search conditions may be changed as appropriate according to the actual operation situation.

ステップＳ８０２では、候補の音声テキストデータに対して全文検索をすることで、音声テキスト管理テーブル９００に送信側が登録したデータを特定する。 In step S802, the data registered by the sender in the voice text management table 900 is specified by performing a full-text search on the candidate voice text data.

具体的には、検索した結果、ヒットした候補の音声テキストデータに対して、受信側の音声テキストデータをキーにした検索エンジンによる全文検索を行い、送信側の音声テキストデータを検索する。この処理により音声テキスト管理テーブル９００に事前に登録していた送信側の登録データを特定する。 Specifically, the voice text data of the candidate hit as a result of the search is subjected to a full-text search by a search engine using the voice text data on the receiving side as a key, and the voice text data on the transmitting side is searched. By this process, the registered data on the transmitting side registered in advance in the voice text management table 900 is specified.

ステップＳ８０３では、送信側の登録データと受信側の登録データの発言時間の差分からコミュニケーションの遅延判断を行う。 In step S803, the communication delay is determined from the difference in the speaking time between the registered data on the transmitting side and the registered data on the receiving side.

具体的には、送信側の登録データの発言時間とクライアントからの要求に含まれる受信側の発言時間を比較し、差分からコミュニケーションの遅延判断を行う。差分が１秒未満の場合は遅延なしと判断し、１～５秒以内の場合は遅延小、６～１０秒以上の場合は遅延中、１０秒以上の場合は遅延大と判断する。この遅延レベルの評価の結果が各クライアントの端末上でポップアップとして表示される。例えば、図１１に示す１１０１や図１２に示す１２０１、図１４に示す１４０１、図１５に示す１５０１などのように表示される。 Specifically, the speech time of the registered data on the transmitting side is compared with the speech time of the receiving side included in the request from the client, and the communication delay is determined from the difference. If the difference is less than 1 second, it is judged that there is no delay, if it is within 1 to 5 seconds, it is judged that the delay is small, if it is 6 to 10 seconds or more, it is judged that there is a delay, and if it is 10 seconds or more, it is judged that there is a large delay. The result of this delay level evaluation is displayed as a pop-up on each client's terminal. For example, 1101 shown in FIG. 11, 1201 shown in FIG. 12, 1401 shown in FIG. 14, 1501 shown in FIG. 15, and the like are displayed.

ステップＳ８０４では、送信側の登録データと受信側の登録データの発言内容の差分からコミュニケーションの劣化判断を行う。 In step S804, deterioration of communication is determined from the difference between the remark contents of the registered data on the transmitting side and the registered data on the receiving side.

具体的には、送信側の登録データの音声テキストデータとクライアントからの要求に含まれる受信側の音声テキストデータを比較し、一致度からコミュニケーションの劣化判断を行う。一致度が９５％以上の場合は劣化なしと判断し、９０～９５％の場合は劣化小、８０～９０％の場合は劣化中、８０％より低い場合は劣化大と判断する。この劣化レベルの評価の結果が各クライアントの端末上でポップアップとして表示される。例えば、図１１に示す１１０１や図１２に示す１２０１、図１４に示す１４０１、図１５に示す１５０１などのように表示される。 Specifically, the voice text data of the registered data on the transmitting side and the voice text data on the receiving side included in the request from the client are compared, and the deterioration of communication is judged from the degree of matching. When the degree of agreement is 95% or more, it is judged that there is no deterioration, when it is 90 to 95%, it is judged that the deterioration is small, when it is 80 to 90%, it is judged to be deteriorated, and when it is lower than 80%, it is judged that the deterioration is large. The result of this deterioration level evaluation is displayed as a pop-up on each client's terminal. For example, 1101 shown in FIG. 11, 1201 shown in FIG. 12, 1401 shown in FIG. 14, 1501 shown in FIG. 15, and the like are displayed.

ステップＳ８０５では、解析結果である遅延判断と劣化判断を図９に示す遅延判断管理格納テーブル９２０の解析結果（結果）９２３、解析結果（劣化）９２４にそれぞれ格納する。 In step S805, the delay judgment and the deterioration judgment, which are the analysis results, are stored in the analysis result (result) 923 and the analysis result (deterioration) 924 of the delay judgment management storage table 920 shown in FIG. 9, respectively.

遅延判断管理テーブル９２０は、コミュニケーションＩＤを管理するコミュニケーションＩＤ９２１、音声の発言者を管理する音声の発言者９２２、遅延判断の解析結果を管理する解析結果（遅延）９２３、劣化判断の解析結果を管理する解析結果（劣化）９２４、更新された日時を管理する更新日時９２５、撮影された日時を管理する撮影日時９２６を管理するテーブルである。 The delay judgment management table 920 manages the communication ID 921 that manages the communication ID, the voice speaker 922 that manages the voice speaker, the analysis result (delay) 923 that manages the analysis result of the delay judgment, and the analysis result of the deterioration judgment. It is a table which manages the analysis result (deterioration) 924, the update date and time 925 which manages the updated date and time, and the shooting date and time 926 which manages the shooting date and time.

ステップＳ８０６では、遅延若しくは劣化が発生したと判断した場合は、クライアントアプリに遅延若しくは劣化が発生した旨の通知を行う。遅延、劣化が発生したデータと同じコミュニケーションＩＤのデータを音声テキスト管理テーブル９００から検索して、ヒットしたデータの登録者全員に対して遅延若しくは劣化の内容を含む通知イベントを送信する。
モバイルアプリであれば、図１１や図１２のように表示し、ＰＣアプリであれば図１４や図１５のように表示される。
以上が、図８で行われる遅延判断処理の内容である。 In step S806, when it is determined that the delay or deterioration has occurred, the client application is notified that the delay or deterioration has occurred. The data of the same communication ID as the data in which the delay or deterioration has occurred is searched from the voice text management table 900, and a notification event including the content of the delay or deterioration is transmitted to all the registrants of the hit data.
If it is a mobile application, it is displayed as shown in FIGS. 11 and 12, and if it is a PC application, it is displayed as shown in FIGS. 14 and 15.
The above is the content of the delay determination process performed in FIG.

上述のステップＳ７０５の遅延判断処理が終了すると、ステップＳ７０６の処理へ移行する。
ステップＳ７０６では、クライアントに登録成功のレスポンスを送信する。
以上で図７の遅延判断受付通知の処理は終了する。 When the delay determination process in step S705 is completed, the process proceeds to step S706.
In step S706, a response of successful registration is sent to the client.
This completes the processing of the delay determination acceptance notification shown in FIG. 7.

続けて、図１０を用いて、通常のモバイルアプリのコミュニケーション画面構成イメージについて、説明する。通常のコミュニケーション画面１０００は、カメラを作動し、ルームに入室している際のものである。 Subsequently, with reference to FIG. 10, a communication screen configuration image of a normal mobile application will be described. The normal communication screen 1000 is when the camera is operated and the room is entered.

１００１のアイコンは、スマートフォンを利用して、ルームに入室しているユーザの数を示している。 The icon 1001 indicates the number of users who are in the room using a smartphone.

１００２のアイコンは、ＰＣを利用して、ルームに入室しているユーザの数を示している。 The icon of 1002 indicates the number of users who are in the room by using the PC.

１００３のアイコンは、カメラのシャッターボタンと、ビデオの撮影開始、ビデオの撮影終了等の処理を指示する操作部である。 The icon of 1003 is a shutter button of the camera and an operation unit for instructing processing such as start of video shooting and end of video shooting.

１００４のアイコンは、カメラモードを示すアイコンである。１００４のカメラアイコンが黄色で色付けされている場合は、カメラモードで撮影されていることを示す。 The icon of 1004 is an icon indicating the camera mode. When the camera icon of 1004 is colored in yellow, it indicates that the image is taken in the camera mode.

１００５のアイコンは、ビデオモードを示すアイコンである。１００５のビデオアイコンが黄色で色付けされている場合は、ビデオモードで撮影されていることを示す。 The icon of 1005 is an icon indicating a video mode. If the 1005 video icon is colored yellow, it indicates that it was shot in video mode.

１００６は、現在ユーザがどのルームに入室しているかが確認できる表示である。
この場合であれば、ルームＡ２に入室していることが分かる。以上で、図１０の説明を終了する。 Reference numeral 1006 is a display that allows the user to confirm which room the user is currently in.
In this case, it can be seen that the room A2 is in the room. This is the end of the description of FIG.

図１１を用いて、ユーザ１（自身）が遅延している場合のモバイルアプリのコミュニケーション画面の画面構成イメージを説明する。 With reference to FIG. 11, a screen configuration image of the communication screen of the mobile application when the user 1 (self) is delayed will be described.

ユーザ１（自身）において、コミュニケーションの遅延や劣化が発生すると、１１０１のようなポップアップが通常のコミュニケーション画面に重畳表示される。 When the communication is delayed or deteriorated in the user 1 (self), a pop-up such as 1101 is superimposed and displayed on the normal communication screen.

ポップアップ１１０１では、遅延が発生していることを通知する。そして、遅延レベルと劣化レベルを表示する。また、コミュニケーションの相手方が、ユーザ１（自身）の音声を聞き取りづらくなっていることを通知する。 Pop-up 1101 notifies that a delay has occurred. Then, the delay level and the deterioration level are displayed. In addition, the other party of communication notifies that it is difficult to hear the voice of user 1 (self).

これにより、ユーザ１（自身）は、現在相手方とのコミュニケーションの間で音声データ、映像データの遅延、劣化が発生していることが瞬時に把握できる。 As a result, the user 1 (self) can instantly grasp that the delay or deterioration of the audio data and the video data is currently occurring during the communication with the other party.

また、ユーザ（自身）は、自分の音声が実際にどれだけ遅延や劣化しているかを理解することができ、遅延や劣化を考慮したコミュニケーションを行うことで相手方とのコミュニケーションを円滑に行うことができる。以上で図１１の説明を終了する。 In addition, the user (self) can understand how much the voice is actually delayed or deteriorated, and can smoothly communicate with the other party by communicating in consideration of the delay or deterioration. can. This is the end of the description of FIG.

続けて、図１２を用いて、ユーザ２（コミュニケーションの相手方）が遅延している場合のモバイルアプリのコミュニケーション画面の画面構成イメージを説明する。 Subsequently, with reference to FIG. 12, a screen configuration image of the communication screen of the mobile application when the user 2 (communication partner) is delayed will be described.

ユーザ２（コミュニケーションの相手方）において、コミュニケーションの遅延や劣化が発生すると、１２０１のようなポップアップが通常のコミュニケーション画面に重畳表示される。 When the communication is delayed or deteriorated in the user 2 (communication partner), a pop-up such as 1201 is superimposed and displayed on the normal communication screen.

ポップアップ１２０１では、遅延が発生していることを通知する。そして、遅延レベルを表示する。１２０１のポップアップでは「小」が表示されている。 Pop-up 1201 notifies that a delay has occurred. Then, the delay level is displayed. "Small" is displayed in the pop-up of 1201.

相手が遅延している場合は、実際に音声の劣化が聞いてわかるので、ポップアップ１２０１には、劣化レベルの表示は不要である。 When the other party is delayed, the deterioration of the voice is actually heard and known, so that the pop-up 1201 does not need to display the deterioration level.

また、ユーザ２（コミュニケーションの相手方）のコミュニケーション相手であるユーザ１が劣化に気づいていない可能性がある旨の通知と、遅延を考慮した会話を実施するように促すメッセージを表示する。 In addition, a notification that the user 1 who is the communication partner of the user 2 (communication partner) may not be aware of the deterioration and a message prompting the user to carry out the conversation in consideration of the delay are displayed.

これにより、ユーザ２は、現在相手方とのコミュニケーションの間で音声データ、映像データの遅延が発生していることが瞬時に把握できる。 As a result, the user 2 can instantly grasp that the delay of the audio data and the video data is currently occurring between the communication with the other party.

また、遅延を考慮したコミュニケーションを行うことで、相手方とのコミュニケーションを円滑に行うことが出来る。以上で図１２の説明を終了する。 In addition, by communicating in consideration of delay, it is possible to smoothly communicate with the other party. This is the end of the description of FIG.

続けて、図１３を用いて、通常のＰＣアプリのコミュニケーション画面の画面構成イメージを説明する。１３００がＰＣアプリのコミュニケーション全体の画面である。 Subsequently, with reference to FIG. 13, a screen configuration image of a communication screen of a normal PC application will be described. 1300 is the screen of the entire communication of the PC application.

１３０１では、コミュニケーションを行うルームを選択することが出来る。 At 1301, it is possible to select a room for communication.

１３０２は、ユーザが選択することで、拡大して映像を観ることが可能な画面である。
勿論、ユーザが選択しなくとも一定のタイミングで１３０２の画面に表示される映像が切り替わるようにしてもよい。 The 1302 is a screen that can be enlarged and viewed by the user by selecting the screen.
Of course, the image displayed on the screen of 1302 may be switched at a fixed timing even if the user does not select it.

１３０３、１３０４、１３０５、１３０６はルームに参加する各ユーザから送られてくる映像を表示する領域である。各領域には、ユーザから送られてくる映像と、どのユーザからの映像なのかが分かるようにユーザ名が表示される。 1303, 1304, 1305, and 1306 are areas for displaying images sent from each user participating in the room. In each area, the user name is displayed so that the video sent from the user and the video from which user can be understood.

ＰＣ上での操作を行うユーザが、ルームに参加しているユーザが送ってくる映像を表示する画面右側の１３０３、１３０４、１３０５、１３０６の領域のうち、いずれかを選択すると、ユーザが選択した領域に表示される映像が、１３０２の画面に拡大して表示される。 When the user who operates on the PC selects one of the areas 1303, 1304, 1305, and 1306 on the right side of the screen displaying the image sent by the user participating in the room, the user selects it. The image displayed in the area is enlarged and displayed on the screen of 1302.

図１３の例であると、ユーザは１３０５を選択したので、１３０２には、１３０５の領域に映されていた映像を拡大して表示する。ユーザが１３０５を選択すると、１３０５に表示されていた映像が１３０２に移って、表示されていることをメッセージで示す。この場合、ＶＩＥＷと表示される。 In the example of FIG. 13, since the user has selected 1305, the image projected on the area of 1305 is enlarged and displayed on 1302. When the user selects 1305, the image displayed in 1305 is moved to 1302, and a message indicates that the image is displayed. In this case, it is displayed as VIEW.

１３０７は、カメラモードを選択するためのアイコンである。そして、カメラモードを選択すると、画面に表示されている映像をスクリーンショットで撮ることや、ＰＣに内蔵若しくは外部から接続されたカメラを利用して、写真を撮ることが出来る。 Reference numeral 1307 is an icon for selecting a camera mode. Then, when the camera mode is selected, the image displayed on the screen can be taken as a screenshot, or the picture can be taken by using the camera built in the PC or connected from the outside.

１３０８は、ビデオモードを選択するためのアイコンである。ビデオモードを選択すると、ビデオの撮影が開始する。画面自体を映像として撮影することも可能であり、また内蔵若しくは外部から接続されたカメラを利用して、ユーザの顔や身体を含む映像を撮ることも可能である。これにより、ジェスチャーなどでもコミュニケーションが可能となる。 1308 is an icon for selecting a video mode. Select the video mode and the video will start shooting. It is also possible to shoot the screen itself as an image, and it is also possible to take an image including the user's face and body by using a built-in or externally connected camera. This enables communication even with gestures.

１３０９は、写真一覧のアイコンである。写真一覧のアイコンを押下すると、撮った写真を確認することができる。 1309 is a photo list icon. You can check the photos you have taken by pressing the photo list icon.

１３１０は、画面キャプチャー共有のアイコンである。画面キャプチャー共有のアイコンを押下すると、ユーザがキャプチャーした画面をルームに参加する他のユーザに共有することが出来る。以上で図１３の説明を終了する。 1310 is a screen capture sharing icon. By pressing the screen capture sharing icon, the screen captured by the user can be shared with other users who participate in the room. This is the end of the description of FIG.

図１４は、ユーザ１（自分）が遅延している場合のＰＣアプリのコミュニケーション画面である。 FIG. 14 is a communication screen of the PC application when the user 1 (self) is delayed.

１４０１のようなポップアップ形式で、遅延が発生している旨の通知と、遅延レベルと劣化レベルを表示し、音声が聞き取りにくくなっていることを通知する。 In a pop-up format such as 1401, a notification that a delay has occurred and a delay level and a deterioration level are displayed to notify that the voice is difficult to hear.

これにより、現在相手方とのコミュニケーションの間で音声データ、映像データの遅延が発生していることが瞬時に把握できる。以上で図１４の説明を終了する。 As a result, it is possible to instantly grasp that the delay of the audio data and the video data is currently occurring between the communication with the other party. This is the end of the description of FIG.

図１５は、ユーザ２（コミュニケーションの相手方）が遅延している場合のコミュニケーション画面である。 FIG. 15 is a communication screen when user 2 (communication partner) is delayed.

遅延が発生しているコミュニケーションの相手方が存在する場合は、１５００のように通知バーを表示して、遅延が発生していることを通知する。通知バー１５００にマウスオーバーすることで、１５０１のようなポップアップのイメージが表示されます。 If there is a communication partner with a delay, a notification bar such as 1500 is displayed to notify that the delay has occurred. Mouse over the notification bar 1500 to see a pop-up image like 1501.

モバイルアプリと同様に、相手が遅延しているかどうかは実際に音声を聞いて分かるので劣化レベルについては表示しない。以上で図１５の説明を終了する。 As with the mobile app, you can actually hear the voice to see if the other party is delayed, so the deterioration level is not displayed. This is the end of the description of FIG.

以上、本発明について説明したが、本発明によって、発信側と着信側の音声データをテキスト化し、テキスト化した両者の音声データの結果を比較の上、ネットワークやＷｅｂ－ＲＴＣによるデータの遅延や劣化を判断して、ユーザに通知することが可能となる。 Although the present invention has been described above, according to the present invention, the voice data of the calling side and the called side are converted into text, and after comparing the results of the voice data of both converted into text, data delay or deterioration due to the network or Web-RTC is performed. Can be determined and notified to the user.

これにより、遅延や劣化が発生しているコミュニケーションにおいて、ユーザ同士でお互いの遅延や劣化の状況を把握、考慮しながらコミュニケーションが出来るので、ユーザの煩わしさが軽減できる。 As a result, in communication in which delay or deterioration occurs, users can communicate while grasping and considering each other's delay or deterioration situation, so that the user's annoyance can be reduced.

従来であれば、ユーザ１（自身）は、相手に自分の声が伝わっているか判断できず、また、自分の声が伝わっているとしても遅延が発生している場合にどの程度の遅延が発生しているのかまでは判断できなかった。しかしながら、本発明によって遅延や劣化の具合がユーザに通知されることで、発言をする側はゆっくり話す、文章を区切って話すなどの工夫をすることが出来る。また、遅延があることを想定しながら発言を行うことが出来るので、発言が被ることが少なくなる。 Conventionally, the user 1 (self) cannot determine whether or not his / her voice is transmitted to the other party, and even if his / her voice is transmitted, how much delay occurs when there is a delay. I couldn't tell if I was doing it. However, by notifying the user of the degree of delay or deterioration according to the present invention, the person who speaks can speak slowly, divide sentences, and so on. In addition, since it is possible to make a statement while assuming that there is a delay, it is less likely that the statement will be incurred.

以上、本願の実施形態について示したが、本発明は、例えば、システム、装置、方法、プログラムもしくは記録媒体等としての実施態様をとることが可能である。具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 Although the embodiments of the present application have been described above, the present invention can be implemented as, for example, a system, an apparatus, a method, a program, a recording medium, or the like. Specifically, it may be applied to a system composed of a plurality of devices, or may be applied to a device composed of one device.

また、本発明におけるプログラムは、図４に示すフローチャートの処理方法をコンピュータが実行可能なプログラムであり、本発明の記憶媒体は図４の処理方法をコンピュータが実行可能なプログラムが記憶されている。なお、本発明におけるプログラムは図１の各装置の処理方法ごとのプログラムであってもよい。 Further, the program in the present invention is a program in which a computer can execute the processing method of the flowchart shown in FIG. 4, and the storage medium of the present invention stores a program in which the computer can execute the processing method in FIG. The program in the present invention may be a program for each processing method of each device of FIG.

以上のように、前述した実施形態の機能を実現するプログラムを記録した記録媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムを読み出し、実行することによっても本発明の目的が達成されることは言うまでもない。 As described above, a recording medium recording a program that realizes the functions of the above-described embodiment is supplied to the system or device, and the computer (or CPU or MPU) of the system or device stores the program in the recording medium. Needless to say, the object of the present invention can be achieved by reading and executing.

この場合、記録媒体から読み出されたプログラム自体が本発明の新規な機能を実現することになり、そのプログラムを記録した記録媒体は本発明を構成することになる。 In this case, the program itself read from the recording medium realizes the novel function of the present invention, and the recording medium on which the program is recorded constitutes the present invention.

プログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＤＶＤ－ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＥＥＰＲＯＭ、シリコンディスク等を用いることが出来る。 Recording media for supplying programs include, for example, flexible disks, hard disks, optical disks, magneto-optical disks, CD-ROMs, CD-Rs, DVD-ROMs, magnetic tapes, non-volatile memory cards, ROMs, EEPROMs, and silicon. A disc or the like can be used.

また、コンピュータが読み出したプログラムを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program read by the computer, not only the function of the above-described embodiment is realized, but also the OS (operating system) or the like running on the computer is actually realized based on the instruction of the program. Needless to say, there are cases where a part or all of the processing is performed and the processing realizes the functions of the above-described embodiment.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program read from the recording medium is written in the memory provided in the function expansion board inserted in the computer or the function expansion unit connected to the computer, the function expansion board is based on the instruction of the program code. It goes without saying that there are cases where the CPU or the like provided in the function expansion unit performs a part or all of the actual processing, and the processing realizes the functions of the above-described embodiment.

また、本発明は、複数の機器から構成されるシステムに適用しても、ひとつの機器から成る装置に適用しても良い。また、本発明は、システムあるいは装置にプログラムを供給することによって達成される場合にも適用できることは言うまでもない。この場合、本発明を達成するためのプログラムを格納した記録媒体を該システムあるいは装置に読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。 Further, the present invention may be applied to a system composed of a plurality of devices or a device composed of one device. Needless to say, the present invention can also be applied when it is achieved by supplying a program to a system or an apparatus. In this case, by reading the recording medium containing the program for achieving the present invention into the system or the device, the system or the device can enjoy the effect of the present invention.

さらに、本発明を達成するためのプログラムをネットワーク上のサーバ、データベース等から通信プログラムによりダウンロードして読み出すことによって、そのシステムあるいは装置が、本発明の効果を享受することが可能となる。なお、上述した各実施形態およびその変形例を組み合わせた構成も全て本発明に含まれるものである。 Further, by downloading and reading a program for achieving the present invention from a server, database, or the like on a network by a communication program, the system or device can enjoy the effect of the present invention. It should be noted that the present invention also includes all the configurations in which each of the above-described embodiments and modifications thereof are combined.

１００モバイルコミュニケーションアプリケーション
１１０ＰＣコミュニケーションアプリケーション
１２０クラウドシステム
１３０インターネット
100 Mobile Communication Application 110 PC Communication Application 120 Cloud System 130 Internet

Claims

An information processing device with a communication screen
An input means that accepts voice data input and
A means for creating text of voice data received by the input means, and a means for creating text.
A comparison means for comparing voice text data on the transmitting side and the receiving side created by the creating means, and a comparison means.
A determination means for determining whether or not delay and deterioration have occurred based on the comparison result of the comparison means.
An information processing device characterized by being equipped with.

The information processing device is
A notification means that gives a notification corresponding to the determination result by the determination means, and
The information processing apparatus according to claim 1, wherein the information processing apparatus is provided.

The determination means according to claim 1 or 2, wherein the determination means determines whether delay and deterioration have occurred by comparing the degree of matching between the voice text data on the transmitting side and the voice text data on the receiving side. Information processing device.

The information processing according to claim 1 to 3, wherein the determination means determines the degree of delay and the degree of deterioration according to the degree of coincidence between the voice text data on the transmitting side and the voice text data on the receiving side. Device.

The information processing apparatus according to claim 1, wherein the notification means gives a notification corresponding to the degree of delay and the degree of deterioration determined by the determination means.

The information processing apparatus according to claim 1, wherein the notification means superimposes and displays on the communication screen that a delay or deterioration has occurred.

It is a control method of an information processing device having a communication screen.
The information processing device is
An input step that accepts voice data input and
The creation step of converting the voice data that received the input by the input step into text, and
A comparison step that compares the voice text data on the transmitting side and the receiving side created by the creation step, and
A determination step for determining whether or not delay and deterioration have occurred based on the comparison result in the comparison step, and a determination step.
An information processing device characterized by executing.

An information processing device with a communication screen,
An input means that accepts voice data input and
A means for creating text of voice data received by the input means, and a means for creating text.
A comparison means for comparing voice text data on the transmitting side and the receiving side created by the creating means, and a comparison means.
A determination means for determining whether or not delay and deterioration have occurred based on the comparison result of the comparison means.
A program to make it work.