JP2022113375A

JP2022113375A - Information processing method and monitoring system

Info

Publication number: JP2022113375A
Application number: JP2021009587A
Authority: JP
Inventors: 亜彩美井上; Asami Inoue
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2022-08-04

Abstract

To provide an information processing method, a monitoring system, and the like that make a conference or the like via a network smoothly proceed when a defect occurs in voice.SOLUTION: An information processing method for monitoring voice data in a conference system that provides a conference by a plurality of participant terminal devices includes the steps of: acquiring, from the conference system, the voice data transmitted from the plurality of participant terminal devices to the conference system; detecting a defect in the voice data; and outputting, to the conference system, transcript data including text, which is a voice recognition result of the voice data when a defect in the voice data is detected.SELECTED DRAWING: Figure 4

Description

本発明は、情報処理方法及び監視システム等に関する。 The present invention relates to an information processing method, a monitoring system, and the like.

従来、インターネット等のネットワークを用いて、複数の参加者間での会議等を実現するシステムが知られている。ネットワークを介した会議等においては、ネットワークの遅延等によって音量や音質が低下する場合がある。この場合、音声の聞き取りが難しくなるため、会議の進行が妨げられる可能性がある。 2. Description of the Related Art Conventionally, there has been known a system that realizes a conference or the like among a plurality of participants using a network such as the Internet. In a conference or the like over a network, the volume or sound quality may be degraded due to network delay or the like. In this case, it becomes difficult to hear the voice, which may hinder the progress of the conference.

特許文献１には、ネットワーク会議システムにおいて、送受信される情報の帯域幅を制御する手法が開示されている。特許文献２には、ネットワークを介して電子会議を提供するシステムにおいて、データの再生時間に応じてパケットの読み出しを制御する手法が開示されている。 Patent Literature 1 discloses a technique for controlling the bandwidth of transmitted and received information in a network conference system. Japanese Patent Laid-Open No. 2002-200001 discloses a method of controlling packet readout according to data reproduction time in a system that provides an electronic conference via a network.

特開２００３－１５２７９３号公報Japanese Patent Application Laid-Open No. 2003-152793 特開２００８－２２７９６８号公報JP-A-2008-227968

特許文献１や特許文献２の手法は、帯域幅やパケット読み出しを制御することによって、音声の不具合が発生することを抑制する手法である。しかしこれらの手法を用いても、ネットワーク遅延の程度等によっては、不具合発生を抑制しきれない可能性がある。従来手法では、音声の不具合が発生した場合の対処について開示がない。 The techniques of Patent Literature 1 and Patent Literature 2 are techniques for suppressing the occurrence of audio defects by controlling the bandwidth and packet reading. However, even if these methods are used, there is a possibility that failure occurrence cannot be suppressed depending on the degree of network delay. In the conventional method, there is no disclosure of how to deal with the problem of sound.

本開示のいくつかの態様によれば、音声に不具合が生じた場合に、ネットワークを介した会議等を円滑に進行する情報処理方法及び監視システム等を提供できる。
を提供できる。 According to some aspects of the present disclosure, it is possible to provide an information processing method, a monitoring system, and the like that smoothly proceed with a conference or the like via a network when a problem occurs in audio.
can provide

本開示の一態様は、複数の参加者端末装置による会議を提供する会議システムにおける音声データを監視するための情報処理方法であって、前記複数の参加者端末装置から前記会議システムへ送信された音声データを、前記会議システムから取得するステップと、前記音声データの不具合を検出するステップと、前記音声データの不具合が検出された場合に、前記音声データの音声認識結果であるテキストを含む書き下しデータを、前記会議システムに出力するステップと、を含む情報処理方法に関係する。 One aspect of the present disclosure is an information processing method for monitoring audio data in a conference system that provides a conference by a plurality of participant terminal devices, comprising: a step of acquiring voice data from the conference system; a step of detecting a defect in the voice data; and, when a defect in the voice data is detected, transcription data including text that is a voice recognition result of the voice data. to the conference system.

本開示の他の態様は、複数の参加者端末装置による会議を提供する会議システムにおける音声データを監視する監視システムであって、前記複数の参加者端末装置から前記会議システムへ送信された音声データを、前記会議システムから取得する音声データ取得部と、前記音声データの不具合を検出する処理を行う不具合検出部と、前記音声データの不具合が検出された場合に、前記音声データの音声認識結果であるテキストを含む書き下しデータを、前記会議システムに出力する処理を行う書き下しデータ出力部と、を含む監視システムに関係する。 Another aspect of the present disclosure is a monitoring system for monitoring audio data in a conference system providing a conference by a plurality of participant terminals, wherein audio data transmitted from the plurality of participant terminals to the conference system a voice data acquisition unit that acquires from the conference system, a defect detection unit that performs processing for detecting defects in the voice data, and a voice recognition result of the voice data when a defect is detected in the voice data. and a transcript data output unit that outputs transcript data including a certain text to the conference system.

監視システムを含む通信システムの構成例。A configuration example of a communication system including a monitoring system. 会議システムのハードウェア構成例。A hardware configuration example of a conference system. 監視システムのハードウェア構成例。Hardware configuration example of the monitoring system. 会議システムの機能ブロック図。A functional block diagram of the conference system. 監視システムの機能ブロック図。Functional block diagram of the monitoring system. 参加者端末装置のハードウェア構成例。A hardware configuration example of a participant terminal device. 参加者端末装置の機能ブロック図。A functional block diagram of a participant terminal device. 本実施形態の処理を説明するシーケンス図。FIG. 4 is a sequence diagram for explaining the processing of the embodiment; 不具合検出処理を説明するフローチャート。5 is a flowchart for explaining defect detection processing; 音声を用いた会議システムにおける表示画面例。An example of a display screen in a conference system using voice. チャットを用いた会議システムにおける表示画面例。An example of a display screen in a conference system using chat. 監視システムを含む通信システムの他の構成例。Another configuration example of a communication system including a monitoring system.

以下、本実施形態について図面を参照しつつ説明する。図面については、同一又は同等の要素には同一の符号を付し、重複する説明は省略する。なお、以下に説明する本実施形態は、特許請求の範囲に記載された内容を不当に限定するものではない。また本実施形態で説明される構成の全てが、本開示の必須構成要件であるとは限らない。 Hereinafter, this embodiment will be described with reference to the drawings. In the drawings, the same or equivalent elements are denoted by the same reference numerals, and overlapping descriptions are omitted. In addition, this embodiment described below does not unduly limit the content described in the claims. Moreover, not all the configurations described in the present embodiment are essential constituent elements of the present disclosure.

１．システム構成例
１．１全体構成
図１は、本実施形態に係る監視システム２００を含む会議用の通信システム１０の構成を示す図である。図１に示すように、通信システム１０は、会議システム１００と、監視システム２００と、複数の参加者端末装置３００を含む。なお通信システム１０の構成は図１に限定されず、他の構成を追加する等の変形実施が可能である。例えば、図１では参加者端末装置３００が２つである例を示したが、参加者端末装置３００の数は３以上であってもよい。 1. System Configuration Example 1.1 Overall Configuration FIG. 1 is a diagram showing the configuration of a conference communication system 10 including a monitoring system 200 according to this embodiment. As shown in FIG. 1, the communication system 10 includes a conference system 100, a monitoring system 200, and a plurality of participant terminals 300. FIG. The configuration of the communication system 10 is not limited to that shown in FIG. 1, and modifications such as adding other configurations are possible. For example, although FIG. 1 shows an example in which there are two participant terminal devices 300, the number of participant terminal devices 300 may be three or more.

通信システム１０は、ネットワークを介した会議に用いられるシステムである。なお、ここでの会議とは、複数のユーザによるコミュニケーションが必要となる場面を広く表す。即ち、本実施形態における会議とは、複数のユーザが意見交換、審議、意思決定を行う場である狭義の会議に限定されず、セミナーや講演会等の形態を含んでもよい。セミナーは、所定のテーマに沿った勉強会である。講演会は、講師による発表を聴衆が聴く場である。その他、本実施形態の会議は、所与の参加者が発話を行い、他の参加者が当該発話を聞く可能性のある種々の場面を含むことが可能である。 A communication system 10 is a system used for a conference via a network. Note that the term "meeting" as used here broadly refers to a situation in which communication by a plurality of users is required. That is, the conference in this embodiment is not limited to a narrowly defined conference where a plurality of users exchange opinions, deliberate, and make decisions, and may include forms such as seminars and lectures. A seminar is a study session on a predetermined theme. A lecture is a place where the audience listens to a lecturer's presentation. In addition, the conference of the present embodiment can include various situations in which a given participant speaks and other participants may hear the speech.

参加者端末装置３００は、会議システム１００に参加者データを送信する。本実施形態における参加者データとは、参加者端末装置３００から会議システム１００に送信される種々の情報を含む。例えば参加者データは、後述するポインティングデバイス３９２の操作結果を表すポインタデータ、キーボード３９３を用いて入力されるキー入力データ、マイク３９４によって録音された音声データ、カメラ３９５によって撮像された撮像画像データを含む。 The participant terminal device 300 transmits participant data to the conference system 100 . The participant data in this embodiment includes various information transmitted from the participant terminal device 300 to the conference system 100 . For example, the participant data includes pointer data representing operation results of a pointing device 392, which will be described later, key input data input using a keyboard 393, audio data recorded by a microphone 394, and captured image data captured by a camera 395. include.

また参加者データは、図５を用いて後述するディスプレイ３９１に表示されている画面データや、ユーザによってアップロードされるファイルデータ等を含んでもよい。また参加者データは、参加者端末装置３００に関する識別情報を含んでもよい。ここでの識別情報は、参加者端末装置３００を一意に特定する端末識別情報や、当該参加者端末装置３００を使用するユーザを識別するユーザ識別情報を含む。なおユーザ識別情報は、ユーザを一意に識別するユーザＩＤであってもよいし、他のユーザとの重複が許容されるユーザ名であってもよい。 The participant data may also include screen data displayed on the display 391, which will be described later with reference to FIG. 5, file data uploaded by the user, and the like. Participant data may also include identification information about the participant terminal device 300 . The identification information here includes terminal identification information that uniquely identifies the participant terminal device 300 and user identification information that identifies the user who uses the participant terminal device 300 . Note that the user identification information may be a user ID that uniquely identifies a user, or a user name that is allowed to overlap with other users.

会議システム１００は、複数の参加者端末装置３００に会議データを配信する。ここでの会議データは、会議システム１００から参加者端末装置３００に送信される種々の情報を含む。例えば会議データは、複数の参加者端末装置３００から送信された参加者データに基づいて作成されるデータであって、複数の音声データを多重化した会議音声データや、撮像画像データや画面データに基づいて作成される会議画面データ等を含む。 The conference system 100 distributes conference data to a plurality of participant terminal devices 300. FIG. The conference data here includes various information transmitted from the conference system 100 to the participant terminal device 300 . For example, conference data is data created based on participant data transmitted from a plurality of participant terminal devices 300, and includes conference audio data obtained by multiplexing a plurality of audio data, captured image data, and screen data. It includes conference screen data and the like created based on this.

会議システム１００は、例えばサーバシステムである。会議システム１００は、１つのサーバであってもよいし、複数のサーバの集合であってもよい。例えば会議システム１００は、複数の参加者端末装置３００に関する識別情報等を記憶するデータベースサーバと、会議データの作成処理や配信処理を行うアプリケーションサーバとを含んでもよい。ここでのサーバは、物理サーバであってもよいし仮想サーバであってもよい。例えば、上記データベースサーバとアプリケーションサーバは、それぞれが別体の物理サーバであり、会議システム１００は２つの物理サーバから構成されてもよい。あるいは、上記データベースサーバとアプリケーションサーバは、それぞれが仮想サーバであってもよい。この場合、１つの仮想サーバが１つの物理サーバ上に構築されてもよいし、複数の物理サーバに分散配置されてもよい。また複数の仮想サーバが同一の物理サーバ上に構築されてもよい。以上のように、会議システム１００の機能的な構成、及び、物理的な構成については種々の変形実施が可能である。 Conference system 100 is, for example, a server system. The conference system 100 may be a single server or a set of multiple servers. For example, the conference system 100 may include a database server that stores identification information and the like regarding a plurality of participant terminal devices 300, and an application server that performs conference data creation processing and distribution processing. The server here may be a physical server or a virtual server. For example, the database server and the application server may be separate physical servers, and the conference system 100 may consist of two physical servers. Alternatively, the database server and application server may each be a virtual server. In this case, one virtual server may be constructed on one physical server, or may be distributed among a plurality of physical servers. Also, multiple virtual servers may be built on the same physical server. As described above, the functional configuration and physical configuration of the conference system 100 can be modified in various ways.

また会議システム１００は、音声データを用いた会議サービスを提供する機能を有する。以下、音声データを用いた会議サービスの具体例として、音声データと画像データの両方を利用可能な、ビデオ会議サービスについて説明する。また会議システム１００は、チャットサービスを提供する機能を有してもよい。この場合、１つのサーバが２つの機能を有してもよいし、各機能が異なるサーバによって実現されてもよい。例えば、会議システム１００は、ビデオ会議サービスを提供するためのアプリケーションサーバと、チャットサービスを提供するためのアプリケーションサーバと、２つのサービスで共用されるデータベースサーバと、を含んでもよい。ただし、具体的な構成はこれに限定されず、種々の変形実施が可能である。なお、本実施形態における会議データは、ビデオ会議サービスに用いられるデータと、チャットサービスに用いられるデータの両方を含んでもよい。 The conference system 100 also has a function of providing a conference service using voice data. As a specific example of a conference service using audio data, a video conference service that can use both audio data and image data will be described below. The conference system 100 may also have a function of providing a chat service. In this case, one server may have two functions, or each function may be implemented by a different server. For example, conferencing system 100 may include an application server for providing video conferencing services, an application server for providing chat services, and a database server shared by the two services. However, the specific configuration is not limited to this, and various modifications are possible. Note that the conference data in this embodiment may include both data used for the video conference service and data used for the chat service.

監視システム２００は、会議システム１００によって提供される会議における音声データを監視するシステムである。監視システム２００は、例えばサーバシステムであって、１つのサーバであってもよいし、複数のサーバの集合であってもよい。監視システム２００の機能的な構成、及び、物理的な構成について、種々の変形実施が可能である点は、上述の会議システム１００と同様である。監視システム２００は、会議システム１００における音声データの不具合を監視し、不具合が検出された場合に、書き下しデータを出力する。具体的な処理については後述する。 The monitoring system 200 is a system that monitors audio data in the conference provided by the conference system 100. FIG. The monitoring system 200 is, for example, a server system, and may be a single server or a set of multiple servers. As with the conference system 100 described above, the functional configuration and physical configuration of the monitoring system 200 can be modified in various ways. The monitoring system 200 monitors audio data failures in the conference system 100, and outputs transcript data when a failure is detected. Specific processing will be described later.

参加者端末装置３００は、会議の参加者であるユーザによって使用される装置であり、例えばＰＣ（Personal Computer）である。ただし参加者端末装置３００は、タブレット端末やスマートフォン等の携帯端末装置であってもよい。 The participant terminal device 300 is a device used by a user who is a participant in the conference, and is, for example, a PC (Personal Computer). However, the participant terminal device 300 may be a mobile terminal device such as a tablet terminal or a smart phone.

１．２会議システム及び監視システム
図２Ａは、会議システム１００のハードウェア構成図である。会議システム１００は、プロセッサ１４０、メモリ１５０、通信インターフェース１６０を含む。ただし会議システム１００の構成は図２に限定されず、一部の構成要素が省略されてもよいし、他の構成要素が追加されてもよく、種々の変形実施が可能である。また具体的な構成が図面の内容に限定されない点は、後述する図２Ｂ、図３～図６においても同様である。 1.2 Conference System and Monitoring System FIG. 2A is a hardware configuration diagram of the conference system 100. As shown in FIG. Conferencing system 100 includes processor 140 , memory 150 and communication interface 160 . However, the configuration of the conference system 100 is not limited to that shown in FIG. 2, some components may be omitted, other components may be added, and various modifications are possible. 2B and FIGS. 3 to 6, which will be described later, are the same in that the specific configuration is not limited to the contents of the drawings.

プロセッサ１４０は、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）等、各種のプロセッサを用いることが可能である。またプロセッサ１４０は、ＣＰＵ、ＧＰＵ、ＤＳＰに加えて周辺回路装置を含んでもよい。周辺回路装置は、ＩＣ（Integrated Circuit）であってもよいし、抵抗やキャパシター等を含んでもよい。 The processor 140 can use various processors such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a DSP (Digital Signal Processor). Processor 140 may also include peripheral circuit devices in addition to CPU, GPU, and DSP. The peripheral circuit device may be an IC (Integrated Circuit), and may include resistors, capacitors, and the like.

メモリ１５０は、ＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリなどの半導体メモリであってもよいし、レジスタであってもよいし、ＨＤＤ（Hard Disk Drive）等の磁気記憶装置であってもよいし、光学ディスク装置等の光学式記憶装置であってもよい。 The memory 150 may be a semiconductor memory such as an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), a ROM (Read Only Memory), a flash memory, a register, or an HDD ( It may be a magnetic storage device such as a Hard Disk Drive, or an optical storage device such as an optical disk device.

通信インターフェース１６０は、ネットワークを介した通信を行うためのインターフェースであり、例えばアンテナ、ＲＦ（radio frequency）回路、及びベースバンド回路を含む。通信インターフェース１６０は、プロセッサ１４０による制御に従って動作してもよいし、プロセッサ１４０とは異なる通信制御用のプロセッサを含んでもよい。通信インターフェース１６０は、例えばＴＣＰ／ＩＰ（Transmission Control Protocol/Internet Protocol）に従った通信を行うためのインターフェースである。ただし具体的な通信方式は種々の変形実施が可能である。 The communication interface 160 is an interface for communicating via a network, and includes, for example, an antenna, an RF (radio frequency) circuit, and a baseband circuit. Communication interface 160 may operate under the control of processor 140 or may include a processor for communication control different from processor 140 . The communication interface 160 is an interface for performing communication according to TCP/IP (Transmission Control Protocol/Internet Protocol), for example. However, the specific communication method can be modified in various ways.

図２Ｂは、会議システム１００のハードウェア構成図である。会議システム１００は、プロセッサ２４０、メモリ２５０、通信インターフェース２６０を含む。 FIG. 2B is a hardware configuration diagram of the conference system 100. As shown in FIG. Conferencing system 100 includes processor 240 , memory 250 and communication interface 260 .

プロセッサ２４０は、ＣＰＵ、ＧＰＵ、ＤＳＰ等、各種のプロセッサを用いることが可能である。メモリ２５０は、ＳＲＡＭ、ＤＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリであってもよいし、レジスタであってもよいし、磁気記憶装置であってもよいし、光学式記憶装置であってもよい。通信インターフェース２６０は、ネットワークを介した通信を行うためのインターフェースであり、例えばアンテナ、ＲＦ回路、及びベースバンド回路を含む。 Various processors such as a CPU, GPU, and DSP can be used as the processor 240 . The memory 250 may be a semiconductor memory such as SRAM, DRAM, ROM, flash memory, etc., a register, a magnetic storage device, or an optical storage device. . A communication interface 260 is an interface for communicating via a network, and includes, for example, an antenna, an RF circuit, and a baseband circuit.

図３は、会議システム１００の機能ブロック図である。会議システム１００は、処理部１１０、記憶部１２０、通信部１３０を含む。処理部１１０は、参加者データ取得部１１１、第１会議データ作成部１１２、第１会議データ配信部１１３、第２会議データ作成部１１４、第２会議データ配信部１１５、制御部１１６、を含む。 FIG. 3 is a functional block diagram of the conference system 100. As shown in FIG. The conference system 100 includes a processing section 110 , a storage section 120 and a communication section 130 . The processing unit 110 includes a participant data acquisition unit 111, a first conference data creation unit 112, a first conference data distribution unit 113, a second conference data creation unit 114, a second conference data distribution unit 115, and a control unit 116. .

処理部１１０は、図２Ａのプロセッサ１４０に対応する。記憶部１２０は、図２Ａのメモリ１５０に対応する。通信部１３０は、図２Ａの通信インターフェース１６０に対応する。 Processing unit 110 corresponds to processor 140 in FIG. 2A. Storage unit 120 corresponds to memory 150 in FIG. 2A. The communication unit 130 corresponds to the communication interface 160 of FIG. 2A.

記憶部１２０はコンピュータによって読み取り可能な命令を格納しており、当該命令を処理部１１０が実行することによって、処理部１１０の機能が処理として実現される。具体的には、メモリ１５０に記憶された命令に従ってプロセッサ１４０が動作することによって、処理部１１０に含まれる参加者データ取得部１１１、第１会議データ作成部１１２、第１会議データ配信部１１３、第２会議データ作成部１１４、第２会議データ配信部１１５、制御部１１６のそれぞれにおける処理が実行される。ここでの命令は、プログラムを構成する命令セットの命令でもよいし、プロセッサ１４０のハードウェア回路に対して動作を指示する命令であってもよい。 The storage unit 120 stores computer-readable instructions, and the functions of the processing unit 110 are realized as processes by the processing unit 110 executing the instructions. Specifically, the processor 140 operates in accordance with the instructions stored in the memory 150, so that the participant data acquisition unit 111, the first conference data creation unit 112, the first conference data distribution unit 113, and the Processing in each of the second conference data creation unit 114, the second conference data delivery unit 115, and the control unit 116 is executed. The instruction here may be an instruction set that constitutes a program, or an instruction that instructs the hardware circuit of the processor 140 to operate.

参加者データ取得部１１１は、通信部１３０を介して、複数の参加者端末装置３００から参加者データを取得する処理を行う。参加者データ取得部１１１は、取得した参加者データを第１会議データ作成部１１２及び第２会議データ作成部１１４に出力する。 The participant data acquisition unit 111 performs processing for acquiring participant data from a plurality of participant terminal devices 300 via the communication unit 130 . Participant data acquisition section 111 outputs the acquired participant data to first conference data creation section 112 and second conference data creation section 114 .

なお、参加者端末装置３００の初回接続時に、当該参加者端末装置３００を一意に識別する端末識別情報が付与される。参加者データ取得部１１１は、当該端末識別情報を用いることによって、参加者データと、当該参加者データの送信元である参加者端末装置３００を対応づけることが可能である。例えば参加者データ取得部１１１は、複数の参加者端末装置３００からの音声データを取得した場合に、各音声データがいずれの参加者端末装置３００から送信されたものであるかを識別可能である。また参加者データ取得部１１１は、例えば参加者端末装置３００の初回接続時に、ユーザ名の入力を受け付けてもよい。記憶部１２０は、端末識別情報とユーザ名等のユーザ識別情報を対応づけて記憶する。 When the participant terminal device 300 is connected for the first time, terminal identification information for uniquely identifying the participant terminal device 300 is given. By using the terminal identification information, the participant data acquisition unit 111 can associate the participant data with the participant terminal device 300 that is the transmission source of the participant data. For example, when voice data is acquired from a plurality of participant terminal devices 300, the participant data acquisition unit 111 can identify from which participant terminal device 300 each voice data is transmitted. . Also, the participant data acquisition unit 111 may accept input of a user name, for example, when the participant terminal device 300 is connected for the first time. The storage unit 120 stores terminal identification information and user identification information such as a user name in association with each other.

第１会議データ作成部１１２は、参加者データ取得部１１１が取得した参加者データに基づいて、ビデオ会議サービスにおける会議データを作成する。ここでの参加者データは、例えばマイク３９４によって取得された音声データを含む。また参加者データは、カメラ３９５によって撮像された撮像画像データや、表示部３４０に表示される画面データ等を含んでもよい。またここでの会議データは、例えば複数の参加者端末装置３００からの音声データを多重化した会議音声データを含む。また会議データは、複数の参加者端末装置３００からの撮像画像データや画面データを、所与の規則に従って配置することによって生成される会議画面データを含んでもよい。第１会議データ作成部１１２は、作成した会議データを第１会議データ配信部１１３に出力する。 The first conference data creation unit 112 creates conference data in the video conference service based on the participant data acquired by the participant data acquisition unit 111 . Participant data here includes voice data acquired by the microphone 394, for example. The participant data may also include captured image data captured by the camera 395, screen data displayed on the display unit 340, and the like. Also, the conference data here includes, for example, conference audio data obtained by multiplexing audio data from a plurality of participant terminal devices 300 . The conference data may also include conference screen data generated by arranging image data and screen data from a plurality of participant terminal devices 300 according to a given rule. First conference data creation unit 112 outputs the created conference data to first conference data distribution unit 113 .

なお会議画面データは、マークアップ言語等を用いて作成されたファイルであって、具体的な画面生成は参加者端末装置３００において実行されてもよい。換言すれば、本実施形態における会議画面データは、会議に用いられる画面を参加者端末装置３００の表示部３４０に表示するためのデータであって、画面そのもののデータには限定されない。この点は、第２会議データ作成部１１４が生成する会議画面データについても同様である。 The conference screen data is a file created using a markup language or the like, and specific screen generation may be executed in the participant terminal device 300 . In other words, the conference screen data in this embodiment is data for displaying the screen used in the conference on the display unit 340 of the participant terminal device 300, and is not limited to the data of the screen itself. This point also applies to the conference screen data generated by the second conference data generation unit 114 .

第１会議データ配信部１１３は、通信部１３０を介して、ビデオ会議サービス用の会議データを複数の参加者端末装置３００に送信する処理を行う。会議システム１００のうち、第１会議データ作成部１１２及び第１会議データ配信部１１３が、ビデオ会議サービスを提供するビデオ会議システムに対応する。以下、ビデオ会議システムを第１システムとも表記する。 The first conference data distribution unit 113 performs processing for transmitting conference data for video conference service to a plurality of participant terminal devices 300 via the communication unit 130 . In the conference system 100, the first conference data creation unit 112 and the first conference data delivery unit 113 correspond to a video conference system that provides video conference services. Hereinafter, the video conference system will also be referred to as the first system.

第２会議データ作成部１１４は、参加者データ取得部１１１が取得した参加者データに基づいて、チャットサービスにおける会議データを作成する。ここでの参加者データは、例えばキーボード３９３を用いて入力されたキー入力データを含む。例えば、後述する図１０のチャット画面において、テキスト投稿領域Ｒｅ５にテキストが入力され、且つ、投稿ボタンの押下操作が行われた場合に、参加者端末装置３００の通信部３３０は、当該テキストを参加者データとして会議システム１００に送信する。ここでのチャット画面とは、参加者端末装置３００から投稿されたテキストデータを表示する画面である。第２会議データ作成部１１４における会議データは、上記チャット画面を表示するための会議画面データ等を含む。第２会議データ作成部１１４は、作成した会議データを第２会議データ配信部１１５に出力する。 The second conference data creation unit 114 creates conference data in the chat service based on the participant data acquired by the participant data acquisition unit 111 . The participant data here includes key input data input using the keyboard 393, for example. For example, in the chat screen of FIG. 10, which will be described later, when text is entered in the text posting area Re5 and the post button is pressed, the communication unit 330 of the participant terminal device 300 sends the text to the post. It is transmitted to the conference system 100 as party data. The chat screen here is a screen that displays text data posted from the participant terminal device 300 . The conference data in the second conference data creation unit 114 includes conference screen data and the like for displaying the chat screen. Second conference data creation unit 114 outputs the created conference data to second conference data delivery unit 115 .

第２会議データ配信部１１５は、通信部１３０を介して、チャットサービス用の会議データを複数の参加者端末装置３００に送信する処理を行う。会議システム１００のうち、第２会議データ作成部１１４及び第２会議データ配信部１１５が、チャットサービスを提供するチャットシステムに対応する。以下、チャットシステムを第２システムとも表記する。 The second conference data distribution unit 115 performs processing for transmitting conference data for chat service to a plurality of participant terminal devices 300 via the communication unit 130 . In the conference system 100, the second conference data creation unit 114 and the second conference data distribution unit 115 correspond to a chat system that provides chat services. Hereinafter, the chat system is also referred to as the second system.

制御部１１６は、会議システム１００に含まれる各部の制御を行う。例えば制御部１１６は、記憶部１２０の読み出し／書き込み制御や、通信部１３０の通信制御を行う。また制御部１１６は、処理部１１０に含まれる各部の制御を行ってもよい。 The control unit 116 controls each unit included in the conference system 100 . For example, the control unit 116 performs read/write control of the storage unit 120 and communication control of the communication unit 130 . Also, the control unit 116 may control each unit included in the processing unit 110 .

記憶部１２０は、上述した参加者データ、会議データ、識別情報等の各種の情報を記憶する。また記憶部１２０は、参加者端末装置３００において動作するプログラムを記憶してもよい。ここでのプログラムは、例えば後述するＷｅｂアプリケーションプログラムである。通信部１３０は、参加者データの受信、会議データの配信等の各種の通信を行う。 The storage unit 120 stores various types of information such as the above-described participant data, conference data, and identification information. The storage unit 120 may also store a program that operates on the participant terminal device 300 . The program here is, for example, a web application program which will be described later. The communication unit 130 performs various types of communication such as receiving participant data and distributing conference data.

図４は、監視システム２００の機能ブロック図である。監視システム２００は、処理部２１０、記憶部２２０、通信部２３０を含む。処理部２１０は、音声データ取得部２１１、音声認識結果取得部２１２、不具合検出部２１３、書き下しデータ出力部２１４、制御部２１５、を含む。 FIG. 4 is a functional block diagram of the monitoring system 200. As shown in FIG. The monitoring system 200 includes a processing section 210 , a storage section 220 and a communication section 230 . The processing unit 210 includes a voice data acquisition unit 211 , a voice recognition result acquisition unit 212 , a defect detection unit 213 , a written data output unit 214 and a control unit 215 .

処理部２１０は、図２Ｂのプロセッサ２４０に対応する。記憶部２２０は、図２Ｂのメモリ２５０に対応する。通信部２３０は、図２Ｂの通信インターフェース２６０に対応する。 Processing unit 210 corresponds to processor 240 in FIG. 2B. Storage unit 220 corresponds to memory 250 in FIG. 2B. Communication unit 230 corresponds to communication interface 260 in FIG. 2B.

メモリ２５０はコンピュータによって読み取り可能な命令を格納しており、当該命令をプロセッサ２４０が実行することによって、処理部２１０に含まれる各部の機能が処理として実現される。具体的には、メモリ２５０に記憶された命令に従ってプロセッサ２４０が動作することによって、音声データ取得部２１１、音声認識結果取得部２１２、不具合検出部２１３、書き下しデータ出力部２１４、制御部２１５のそれぞれにおける処理が実行される。 The memory 250 stores computer-readable instructions, and when the processor 240 executes the instructions, the functions of the units included in the processing unit 210 are realized as processes. Specifically, the processor 240 operates according to the instructions stored in the memory 250, so that the speech data acquisition unit 211, the speech recognition result acquisition unit 212, the defect detection unit 213, the written data output unit 214, and the control unit 215 is executed.

音声データ取得部２１１は、複数の参加者端末装置３００から会議システム１００へ送信された音声データを、会議システム１００から取得する。例えば監視システム２００は、図７を用いて後述するように、ゲストユーザとして会議に参加してもよい。換言すれば、監視システム２００は、参加者端末装置３００として会議システム１００に接続してもよい。この場合、音声データ取得部２１１は、会議システム１００から、音声データを含む会議データを取得する。音声データ取得部２１１は、取得した音声データを、音声認識結果取得部２１２と、不具合検出部２１３に出力する。 The audio data acquisition unit 211 acquires from the conference system 100 audio data transmitted from the plurality of participant terminal devices 300 to the conference system 100 . For example, monitoring system 200 may participate in a conference as a guest user, as described below with reference to FIG. In other words, the monitoring system 200 may be connected to the conference system 100 as the participant terminals 300 . In this case, the audio data acquisition unit 211 acquires conference data including audio data from the conference system 100 . The voice data acquisition unit 211 outputs the acquired voice data to the voice recognition result acquisition unit 212 and the defect detection unit 213 .

音声認識結果取得部２１２は、音声データに対して音声認識処理を行った結果である音声認識結果を取得する。音声認識結果取得部２１２は、通信部２３０を介して、音声データを外部の音声認識サーバに出力し、当該音声認識サーバから音声認識結果を取得してもよい。音声認識処理では、まず音声データから特徴量を抽出する音響分析が行われる。音響分析の結果に対して、音響モデルを用いて特徴の近い音素を特定する処理が行われる。さらに発音辞書や言語モデルを用いて、音素を単語、文章に変換することによって音声認識結果が取得される。音声認識結果とは、音声データをテキストに変換した変換結果を表すデータである。なお本実施形態の音声認識処理では、公知の手法を広く適用可能であるため、これ以上の詳細な説明は省略する。また、音声認識処理は監視システム２００の外部で行われるものには限定されない。例えば、音声認識結果取得部２１２が音声認識処理を行うことによって、音声データをテキストに変換する処理を行ってもよい。 The speech recognition result acquisition unit 212 acquires a speech recognition result that is the result of performing speech recognition processing on speech data. The speech recognition result acquisition unit 212 may output the speech data to an external speech recognition server via the communication unit 230 and acquire the speech recognition result from the speech recognition server. In speech recognition processing, acoustic analysis is first performed to extract features from speech data. The results of the acoustic analysis are processed to identify phonemes with similar features using an acoustic model. Furthermore, speech recognition results are obtained by converting phonemes into words and sentences using a pronunciation dictionary and language model. A speech recognition result is data representing a conversion result obtained by converting speech data into text. In addition, in the speech recognition processing of this embodiment, since a well-known method can be widely applied, further detailed description will be omitted. Also, the voice recognition process is not limited to being performed outside the monitoring system 200 . For example, the speech recognition result acquisition unit 212 may perform speech recognition processing to convert speech data into text.

不具合検出部２１３は、音声データの不具合を検出する不具合検出処理を行う。ここでの不具合とは、音量の低下、音質の悪化の少なくとも一方を含む。不具合検出処理の詳細は後述する。 The defect detection unit 213 performs defect detection processing for detecting defects in the audio data. The problem here includes at least one of volume reduction and sound quality deterioration. Details of the defect detection process will be described later.

書き下しデータ出力部２１４は、音声データの不具合が検出された場合に、音声データの音声認識結果であるテキストを含む書き下しデータを、会議システム１００に出力する。例えば、監視システム２００が参加者端末装置３００として機能する場合、書き下しデータ出力部２１４は、書き下しデータを含む参加者データを作成し、当該参加者データを、会議システム１００に送信する。なお、ここでの書き下しデータは、音声認識結果であるテキストデータそのものであってもよいし、当該テキストデータに何らかのメタデータが付加された情報であってもよい。メタデータとは、例えば音声データの送信元の参加者端末装置３００に関する識別情報を含む。 The transcription data output unit 214 outputs transcription data including text, which is the speech recognition result of the audio data, to the conference system 100 when a problem is detected in the audio data. For example, when the monitoring system 200 functions as the participant terminal device 300 , the draft data output unit 214 creates participant data including the draft data and transmits the participant data to the conference system 100 . Note that the transcription data here may be the text data itself, which is the speech recognition result, or may be information in which some kind of metadata is added to the text data. Metadata includes, for example, identification information about the participant terminal device 300 that is the transmission source of the audio data.

制御部２１５は、監視システム２００に含まれる各部の制御を行う。例えば制御部２１５は、記憶部２２０の読み出し／書き込み制御、通信部２３０の通信制御、処理部２１０に含まれる各部の制御等を行う。 The control unit 215 controls each unit included in the monitoring system 200 . For example, the control unit 215 performs read/write control of the storage unit 220, communication control of the communication unit 230, control of each unit included in the processing unit 210, and the like.

記憶部２２０は、音声データ、書き下しデータ等の各種の情報を記憶する。通信部２３０は、会議システム１００とのデータの送受信を行う。監視システム２００が参加者端末装置３００として機能する場合、通信部２３０は、参加者データの送信、会議データの受信等を行う。 The storage unit 220 stores various types of information such as voice data and written data. The communication unit 230 transmits and receives data to and from the conference system 100 . When the monitoring system 200 functions as the participant terminal device 300, the communication unit 230 transmits participant data, receives conference data, and the like.

なお本実施形態の手法は、監視システム２００の各部において実行されるステップを含む情報処理方法に適用されてもよい。当該情報処理方法は、例えば複数の参加者端末装置３００による会議を提供する会議システム１００における音声データを監視するための情報処理方法である。 Note that the technique of the present embodiment may be applied to an information processing method including steps executed by each unit of the monitoring system 200 . The information processing method is, for example, an information processing method for monitoring audio data in a conference system 100 that provides a conference by a plurality of participant terminal devices 300 .

１．３参加者端末装置
図５は、参加者端末装置３００のハードウェア構成図である。参加者端末装置３００は、プロセッサ３６０、メモリ３７０、通信インターフェース３８０、ディスプレイ３９１、ポインティングデバイス３９２、キーボード３９３、マイク３９４、カメラ３９５等を含む。 1.3 Participant Terminal Device FIG. 5 is a hardware configuration diagram of the participant terminal device 300. As shown in FIG. Participant terminal device 300 includes processor 360, memory 370, communication interface 380, display 391, pointing device 392, keyboard 393, microphone 394, camera 395, and the like.

プロセッサ３６０は、ＣＰＵ、ＧＰＵ、ＤＳＰ等、各種のプロセッサを用いることが可能である。メモリ３７０は、ＳＲＡＭ、ＤＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリであってもよいし、レジスタであってもよいし、磁気記憶装置であってもよいし、光学式記憶装置であってもよい。通信インターフェース３８０は、ネットワークを介した通信を行うためのインターフェースであり、例えばアンテナ、ＲＦ回路、及びベースバンド回路を含む。 Various processors such as a CPU, GPU, and DSP can be used as the processor 360 . The memory 370 may be a semiconductor memory such as SRAM, DRAM, ROM, flash memory, etc., a register, a magnetic storage device, or an optical storage device. . A communication interface 380 is an interface for communicating via a network, and includes, for example, an antenna, an RF circuit, and a baseband circuit.

ディスプレイ３９１は、各種の表示画面を表示するためのものであり、例えば液晶ディスプレイや有機ＥＬディスプレイなどにより実現できる。 The display 391 is for displaying various display screens, and can be realized by, for example, a liquid crystal display or an organic EL display.

ポインティングデバイス３９２は、ディスプレイ３９１に表示されるポインタを移動させるための操作インターフェースである。ポインティングデバイス３９２は、マウス、ペンタブレット、タッチパッド、トラックボール等、種々のデバイスによって実現できる。 A pointing device 392 is an operation interface for moving a pointer displayed on the display 391 . The pointing device 392 can be implemented by various devices such as a mouse, pen tablet, touch pad, trackball, and the like.

キーボード３９３は、複数のキーを有し、当該キーに対する操作が行われることによって、対応する信号を出力する操作インターフェースである。なおキーボードの具体的な形状、キー配置、接続方法等は種々の変形実施が可能である。 The keyboard 393 is an operation interface that has a plurality of keys and outputs corresponding signals when the keys are operated. Various modifications can be made to the specific shape of the keyboard, key arrangement, connection method, and the like.

マイク３９４は、音声を受け付けて、音声情報を出力するインターフェースである。なお参加者端末装置３００は、プロセッサからの信号に基づいて、各種の音声を出力する不図示のスピーカを含んでもよい。 A microphone 394 is an interface that receives voice and outputs voice information. Note that the participant terminal device 300 may include a speaker (not shown) that outputs various sounds based on signals from the processor.

カメラ３９５は、参加者端末装置３００の所与の位置に配置され、例えばユーザの顔周辺を撮像した撮像画像を出力する。カメラ３９５は、例えば被写体からの光が入射されるレンズユニットと、当該レンズユニットを介して被写体像を結像して撮像画像信号を出力する撮像素子と、を含む。 The camera 395 is arranged at a given position of the participant terminal device 300, and outputs a captured image of, for example, the vicinity of the user's face. The camera 395 includes, for example, a lens unit into which light from a subject is incident, and an imaging device that forms a subject image via the lens unit and outputs a captured image signal.

図６は、参加者端末装置３００の機能ブロック図である。参加者端末装置３００は、処理部３１０、記憶部３２０、通信部３３０、表示部３４０、ユーザ入力受付部３５０を含む。処理部３１０は、参加者データ送信部３１１、会議データ提示部３１２、制御部３１３を含む。 FIG. 6 is a functional block diagram of the participant terminal device 300. As shown in FIG. The participant terminal device 300 includes a processing section 310 , a storage section 320 , a communication section 330 , a display section 340 and a user input reception section 350 . The processing unit 310 includes a participant data transmission unit 311 , a conference data presentation unit 312 and a control unit 313 .

処理部３１０は、図５のプロセッサ３６０に対応する。記憶部３２０は、図５のメモリ３７０に対応する。通信部３３０は、図５の通信インターフェース３８０に対応する。表示部３４０は、図５のディスプレイ３９１に対応する。ユーザ入力受付部３５０は、図５のポインティングデバイス３９２、キーボード３９３、マイク３９４、カメラ３９５のうちの少なくとも１つに対応する。 Processing unit 310 corresponds to processor 360 in FIG. Storage unit 320 corresponds to memory 370 in FIG. Communication unit 330 corresponds to communication interface 380 in FIG. Display unit 340 corresponds to display 391 in FIG. User input reception unit 350 corresponds to at least one of pointing device 392, keyboard 393, microphone 394, and camera 395 in FIG.

メモリ３７０はコンピュータによって読み取り可能な命令を格納しており、当該命令をプロセッサ３６０が実行することによって、処理部３１０に含まれる各部の機能が処理として実現される。例えば、会議システムの記憶部１２０は、Ｗｅｂアプリケーションプログラムを記憶している。参加者端末装置３００は、通信部３３０を介してＷｅｂアプリケーションプログラムを受信し、受信したＷｅｂアプリケーションプログラムを記憶部３２０に記憶する。処理部３１０は、記憶部３２０に記憶されたＷｅｂアプリケーションプログラムに従って動作することによって、処理部３１０の各部の機能を実現する。 The memory 370 stores computer-readable instructions, and when the processor 360 executes the instructions, the functions of the units included in the processing unit 310 are realized as processes. For example, the storage unit 120 of the conference system stores a web application program. The participant terminal device 300 receives the web application program via the communication unit 330 and stores the received web application program in the storage unit 320 . The processing unit 310 implements the functions of each unit of the processing unit 310 by operating according to the web application program stored in the storage unit 320 .

参加者データ送信部３１１は、参加者データを作成し、通信部３３０を介して、当該参加者データを会議システム１００に送信する処理を行う。参加者データの例は上述したとおりであり、ユーザ入力受付部３５０が受け付けたデータであってもよいし、表示部３４０に表示されたデータであってもよいし、ユーザによって選択されたアップロードファイルであってもよい。 The participant data transmission unit 311 performs processing of creating participant data and transmitting the participant data to the conference system 100 via the communication unit 330 . Examples of the participant data are as described above, and may be data received by the user input receiving unit 350, data displayed on the display unit 340, or an upload file selected by the user. may be

会議データ提示部３１２は、通信部３３０が会議システム１００から受信した会議データを提示する処理を行う。例えば、会議データ提示部３１２は、第１会議データ配信部１１３によって送信された会議データを取得する。会議データ提示部３１２は、当該会議データのうちの会議画面データを表示部３４０に表示する処理や、会議音声データをスピーカに出力する処理を行う。これにより、ビデオ会議サービスが提供される。 The conference data presenting unit 312 performs processing for presenting the conference data received by the communication unit 330 from the conference system 100 . For example, the conference data presentation unit 312 acquires the conference data transmitted by the first conference data delivery unit 113 . The conference data presentation unit 312 performs processing for displaying conference screen data of the conference data on the display unit 340 and processing for outputting conference audio data to a speaker. This provides video conferencing services.

また会議データ提示部３１２は、第２会議データ配信部１１５によって送信された会議データを取得する。当該会議データは、チャット画面を表示するための会議画面データであるチャット画面データである。会議データ提示部３１２は、チャット画面データを表示部３４０に表示する処理を行う。これにより、チャットサービスが提供される。なお、会議データ提示部３１２は、受信した会議データをそのまま提示してもよいし、何らかの処理を行った結果を提示してもよい。 Also, the conference data presentation unit 312 acquires the conference data transmitted by the second conference data delivery unit 115 . The conference data is chat screen data that is conference screen data for displaying a chat screen. The conference data presentation unit 312 performs processing for displaying chat screen data on the display unit 340 . This provides a chat service. Note that the conference data presentation unit 312 may present the received conference data as it is, or may present the result of performing some processing.

制御部３１３は、参加者端末装置３００に含まれる各部の制御を行う。例えば制御部３１３は、記憶部３２０の読み出し／書き込み制御や、通信部３３０の通信制御、表示部３４０の表示制御、ユーザ入力受付部３５０の各部の制御を行う。また制御部３１３は、処理部３１０に含まれる各部の制御を行ってもよい。 The control unit 313 controls each unit included in the participant terminal device 300 . For example, the control unit 313 performs read/write control of the storage unit 320 , communication control of the communication unit 330 , display control of the display unit 340 , and control of each unit of the user input reception unit 350 . Also, the control unit 313 may control each unit included in the processing unit 310 .

記憶部３２０は、上述した参加者データ、会議データ等の各種の情報を記憶する。通信部３３０は、参加者データの送信、会議データの受信等の各種の通信を行う。表示部３４０は、会議画面データを表示する。ユーザ入力受付部３５０は、ユーザによる入力を受け付ける。ユーザ入力は、上述したように種々のデバイスを用いた入力を適用可能であり、音声やジェスチャ等を用いた入力を含んでもよい。音声はマイク３９４の出力に基づいて検出される。ジェスチャは、カメラ３９５の出力に基づいて検出される。 The storage unit 320 stores various types of information such as the above-described participant data and conference data. The communication unit 330 performs various types of communication such as sending participant data and receiving conference data. The display unit 340 displays conference screen data. The user input reception unit 350 receives input from the user. User input is applicable to input using various devices as described above, and may include input using voice, gestures, and the like. Sound is detected based on the output of microphone 394 . Gestures are detected based on the output of camera 395 .

２．処理の流れ
図７は、本実施形態における処理の流れを説明するシーケンス図である。まずユーザは、参加者端末装置３００を用いて会議を作成する操作を行う。ここでのユーザは、例えば会議の主催者であり、参加者端末装置３００は当該主催者によって使用される装置である。例えば主催者は、会議の開始時間、参加者、各参加者に付与される権限等、会議に用いられる設定データを作成する操作を行う。当該操作に基づいて、ステップＳ１０１において、参加者端末装置３００は、例えば上記設定データを含む会議作成用のデータを、会議システム１００に送信する。 2. Flow of Processing FIG. 7 is a sequence diagram illustrating the flow of processing in this embodiment. First, the user performs an operation to create a conference using the participant terminal device 300 . The user here is, for example, the host of the conference, and the participant terminal device 300 is a device used by the host. For example, the organizer performs an operation to create setting data used for the conference, such as the start time of the conference, the participants, and the authority granted to each participant. Based on the operation, in step S101, the participant terminal device 300 transmits data for creating a conference including the setting data, for example, to the conference system 100. FIG.

ステップＳ１０２において、会議システム１００は、参加者端末装置３００に会議を提供する処理を行う。具体的には、会議システム１００は、会議の参加者として指定された参加者端末装置３００に対して、会議ＩＤ、会議用のＵＲＬ（uniform resource locator）、パスワード等、会議に参加するためのデータを送信する。各参加者端末装置３００は、当該データを用いて会議への参加要求を会議システム１００に送信する。会議システム１００は、参加要求を承認した場合に、参加者端末装置３００からの参加データの受信、及び、当該参加者端末装置３００への会議データの配信を開始することによって、会議を提供する。図７では詳細な記載を省略しているが、以上の説明からわかるように、ステップＳ１０１の処理は主催者等、特定の参加者端末装置３００によって行われ、ステップＳ１０２の処理は、参加者である複数の参加者端末装置３００を対象として実行される。 In step S102, the conference system 100 performs processing for providing a conference to the participant terminal devices 300. FIG. Specifically, the conference system 100 sends data for participating in the conference, such as a conference ID, a conference URL (uniform resource locator), a password, etc., to the participant terminal device 300 designated as a conference participant. to send. Each participant terminal device 300 transmits a conference participation request to the conference system 100 using the data. When the conference system 100 approves the participation request, the conference system 100 provides the conference by receiving the participation data from the participant terminal device 300 and by starting to distribute the conference data to the participant terminal device 300 . Although detailed description is omitted in FIG. 7, as can be seen from the above description, the process of step S101 is performed by a specific participant terminal device 300 such as the organizer, and the process of step S102 is performed by the participant. It is executed with a plurality of participant terminal devices 300 as targets.

ステップＳ１０３において、監視システム２００は、監視対象である会議に、ゲストとして参加する処理を行う。例えば、会議システム１００は、監視システム２００を会議の参加者として招待する処理を行ってもよい。具体的には、会議システム１００は、会議ＩＤ等を含む会議に参加するためのデータを監視システム２００に送信する。監視システム２００は、当該データを用いて指定の会議への参加要求を行うことによって、ステップＳ１０３の処理を実行する。あるいは、主催者であるユーザが、音声データの監視を行うか否かを決定してもよい。例えば、主催者が音声データの監視を実行する旨の操作を行った場合に、会議システム１００が監視システム２００を会議に招待する処理を実行する。あるいは、主催者が監視システム２００を利用する権限を有するユーザであるか否かを、会議システム１００が判定してもよい。主催者が当該権限を有すると判定した場合に、会議システム１００は、監視システム２００を会議の参加者として招待する処理を行う。 In step S103, the monitoring system 200 performs processing for participating as a guest in the conference to be monitored. For example, the conference system 100 may perform processing to invite the monitoring system 200 as a conference participant. Specifically, the conference system 100 transmits data for participating in the conference including the conference ID and the like to the monitoring system 200 . The monitoring system 200 executes the process of step S103 by requesting participation in the designated conference using the data. Alternatively, the user who is the organizer may decide whether or not to monitor the audio data. For example, when the organizer performs an operation to monitor voice data, the conference system 100 executes processing for inviting the monitoring system 200 to the conference. Alternatively, the conference system 100 may determine whether the organizer is a user authorized to use the monitoring system 200 . If it is determined that the host has the authority, the conference system 100 performs processing to invite the monitoring system 200 as a conference participant.

ステップＳ１０４において、会議システム１００は、ステップＳ１０３の参加要求に対して、承認処理を行う。これにより、監視システム２００は、複数の参加者端末装置３００と同様に、会議システム１００への参加者データの送信、及び、会議システム１００からの会議データの受信が可能になる。 In step S104, the conference system 100 performs approval processing for the participation request in step S103. As a result, the monitoring system 200 can transmit participant data to the conference system 100 and receive conference data from the conference system 100 in the same manner as the plurality of participant terminal devices 300 .

ステップＳ１０１～Ｓ１０４の処理によって、会議システム１００によって会議が提供され、複数の参加者端末装置３００及び監視システム２００が、当該会議に参加した状態となる。 By the processing of steps S101 to S104, a conference is provided by the conference system 100, and a plurality of participant terminal devices 300 and the monitoring system 200 enter a state of participating in the conference.

会議中は、複数の参加者端末装置３００は、それぞれ所与のタイミングで参加者データを会議システム１００に送信する。会議システム１００は、参加者データに基づいて生成した会議データを、複数の参加者端末装置３００に配信する。 During the conference, the plurality of participant terminal devices 300 each transmit participant data to the conference system 100 at given timings. The conference system 100 distributes conference data generated based on participant data to a plurality of participant terminal devices 300 .

上述したように、会議システム１００は、複数の参加者端末装置３００から音声データを取得し、少なくとも音声を用いた会議を提供する第１システムと、複数の参加者端末装置３００からテキストデータを参加者データとして取得し、チャットによる会議を提供する第２システムと、を含んでもよい。第１システムは、第１会議データ作成部１１２及び第１会議データ配信部１１３に対応し、ビデオ会議サービスを提供するシステムである。第２システムは、第２会議データ作成部１１４及び第２会議データ配信部１１５に対応し、チャットサービスを提供するシステムである。 As described above, the conference system 100 acquires audio data from a plurality of participant terminal devices 300, and includes a first system that provides a conference using at least audio, and text data from the plurality of participant terminal devices 300. and a second system that obtains as party data and provides a conference by chat. The first system is a system that corresponds to the first conference data creation unit 112 and the first conference data distribution unit 113 and provides video conference services. The second system is a system that corresponds to the second conference data creation unit 114 and the second conference data distribution unit 115 and provides a chat service.

図９は、ビデオ会議システムで用いられる表示画面の例であり、図１０は、チャットシステムで用いられる表示画面の例である。図９及び図１０は、会議システム１００からの会議画面データに基づいて、参加者端末装置３００の表示部３４０に表示される画面を表す。 FIG. 9 is an example of a display screen used in a video conference system, and FIG. 10 is an example of a display screen used in a chat system. 9 and 10 show screens displayed on the display unit 340 of the participant terminal device 300 based on the conference screen data from the conference system 100. FIG.

図９に示すように、ビデオ会議サービスにおける表示画面は、参加者表示領域Ｒｅ１、画像表示領域Ｒｅ２、操作ボタン表示領域Ｒｅ３等を含む。参加者表示領域Ｒｅ１は、会議の参加者を表すユーザ識別情報を表示する領域である。図９の例では、会議の参加者は、参加者端末装置３００を用いるユーザである参加者Ａ及び参加者Ｂと、監視システム２００に対応する擬似的なユーザである。図９における、「ゲストＲｅｃｏｒｄｅｒ」が擬似的なユーザを表す。画像表示領域Ｒｅ２は、アップロードされたファイルの具体的な内容や、カメラ３９５によって撮像された各参加者の顔周辺の画像等が表示される領域である。操作ボタン表示領域Ｒｅ３は、マイク３９４やカメラ３９５のオンオフ操作、音量操作等を行うためのオブジェクトが表示される領域である。 As shown in FIG. 9, the display screen in the video conference service includes a participant display area Re1, an image display area Re2, an operation button display area Re3, and the like. The participant display area Re1 is an area for displaying user identification information representing conference participants. In the example of FIG. 9, the participants of the conference are participants A and B who are users using the participant terminal device 300 and pseudo users corresponding to the monitoring system 200 . "Guest Recorder" in FIG. 9 represents a pseudo user. The image display area Re2 is an area in which specific contents of the uploaded file, an image around each participant's face captured by the camera 395, and the like are displayed. The operation button display area Re3 is an area where objects for performing on/off operations of the microphone 394 and the camera 395, volume operations, and the like are displayed.

図１０に示すように、チャットサービスにおける表示画面は、テキスト表示領域Ｒｅ４と、テキスト投稿領域Ｒｅ５を含む。テキスト表示領域Ｒｅ４は、参加者によって投稿されたテキストデータを、投稿した参加者と対応づけて時系列に表示する領域である。テキスト投稿領域Ｒｅ５は、投稿対象となるテキストを入力する領域と、投稿ボタンを表示する領域とを含む。 As shown in FIG. 10, the display screen in the chat service includes a text display area Re4 and a text posting area Re5. The text display area Re4 is an area for displaying the text data posted by the participants in chronological order in association with the posted participants. The text posting area Re5 includes an area for inputting text to be posted and an area for displaying a post button.

図９及び図１０に示すように、方式の異なる複数の会議を提供することによって、会議をより円滑に進行することが可能になる。 As shown in FIGS. 9 and 10, by providing a plurality of conferences with different methods, the conference can proceed more smoothly.

図７に戻って説明を続ける。複数の参加者端末装置３００及び監視システム２００が参加した会議の開始後、ステップＳ１０５において、複数の参加者端末装置３００のうちの所与の参加者端末装置３００が、音声データを含む参加者データを、会議システム１００に送信したとする。ステップＳ１０５の処理は、例えば会議の中で繰り返し実行されるものである。 Returning to FIG. 7, the description continues. After the start of the conference in which the plurality of participant terminal devices 300 and the monitoring system 200 participate, in step S105, a given participant terminal device 300 among the plurality of participant terminal devices 300 receives participant data including voice data. is transmitted to the conference system 100 . The process of step S105 is repeatedly executed, for example, during a meeting.

ステップＳ１０６において、会議システム１００は、取得した音声データを、監視システム２００に送信する。ステップＳ１０６において、会議システム１００は、複数の参加者端末装置３００と共通の会議データを監視システム２００に送信してもよい。例えば、監視システム２００は、会議音声データを作成し、当該会議音声データを、複数の参加者端末装置３００と監視システム２００に送信する。あるいは会議システム１００は、監視システム２００用の会議データを別途作成、送信してもよい。例えば会議システム１００は、収集した参加者データのうちの音声データを抽出し、抽出した音声データに対して、当該音声データの送信元である参加者端末装置３００に関する識別情報を対応づけて監視システム２００に送信する。例えば監視システム２００は、複数の参加者端末装置３００のうちの第１参加者端末装置からの音声データに対して、当該第１参加者端末装置に関する識別情報が対応付けられた情報を受信する。 In step S106 , the conference system 100 transmits the acquired audio data to the monitoring system 200 . In step S106 , the conference system 100 may transmit conference data common to the plurality of participant terminals 300 to the monitoring system 200 . For example, the monitoring system 200 creates conference audio data and transmits the conference audio data to the plurality of participant terminal devices 300 and the monitoring system 200 . Alternatively, the conference system 100 may separately create and transmit conference data for the monitoring system 200 . For example, the conference system 100 extracts voice data from the collected participant data, associates the extracted voice data with identification information about the participant terminal device 300, which is the source of the voice data, and monitors the monitoring system. 200. For example, the monitoring system 200 receives information in which identification information relating to the first participant terminal device is associated with audio data from the first participant terminal device among the plurality of participant terminal devices 300 .

このように、会議システム１００から音声データを取得するステップにおいて、監視システム２００は、参加者端末装置３００として、会議システム１００と通信してもよい。監視システム２００は、会議システム１００が複数の参加者端末装置３００にデータを配信する際の通信方式に従って、会議システム１００から音声データを受信する。 Thus, in the step of acquiring audio data from the conference system 100 , the monitoring system 200 may communicate with the conference system 100 as the participant terminal device 300 . The monitoring system 200 receives audio data from the conference system 100 according to the communication method used when the conference system 100 distributes data to the plurality of participant terminal devices 300 .

このようにすれば、会議システム１００は、監視システム２００を参加者端末装置３００の１つとして取り扱うことが可能になる。例えば、会議システム１００を、参加者データの収集及び会議データの配信を行う従来のシステムと同様の構成とすること、及び、本実施形態の手法に係る音声監視の機能を、会議システム１００とは異なるシステムとして実現することが可能になる。この場合、本実施形態に係るシステムを構築する際に、会議システム１００自体のシステム変更が不要であるという利点がある。また監視システム２００のうち、会議システム１００とのデータの送受信を行うインターフェースについては、参加者端末装置３００と同様の構成を利用することが可能になる。例えば、監視システム２００の一部の機能は、プロセッサ２４０が参加者端末装置３００と同様にＷｅｂアプリケーションプログラムに従って動作することによって実現されてもよい。このようにすれば、ＨＴＴＰＳ（Hypertext Transfer Protocol Secure）等の規定のプロトコルを用いて会議システム１００との入出力を実現できるため、監視システム２００の実装が容易になる。 In this way, the conference system 100 can handle the monitoring system 200 as one of the participant terminal devices 300. FIG. For example, the conference system 100 may have the same configuration as a conventional system that collects participant data and distributes conference data, and the audio monitoring function according to the technique of the present embodiment is different from the conference system 100. It becomes possible to realize it as a different system. In this case, there is an advantage that the system change of the conference system 100 itself is unnecessary when constructing the system according to the present embodiment. Further, in the monitoring system 200, the same configuration as the participant terminal device 300 can be used for the interface for transmitting and receiving data with the conference system 100. FIG. For example, some functions of the monitoring system 200 may be realized by the processor 240 operating according to a web application program like the participant terminal device 300. In this way, input/output to/from the conference system 100 can be realized using a prescribed protocol such as HTTPS (Hypertext Transfer Protocol Secure), so implementation of the monitoring system 200 is facilitated.

監視システム２００の音声認識結果取得部２１２は、ステップＳ１０６において取得した音声データの音声認識結果を取得する。図７では、外部の音声認識サーバによって音声認識処理が行われる例を図示している。ステップＳ１０７において、監視システム２００の通信部２３０は、音声データを音声認識サーバに送信する。 The voice recognition result acquisition unit 212 of the monitoring system 200 acquires the voice recognition result of the voice data acquired in step S106. FIG. 7 illustrates an example in which speech recognition processing is performed by an external speech recognition server. In step S107, the communication unit 230 of the monitoring system 200 transmits the voice data to the voice recognition server.

ステップＳ１０８において、音声認識サーバは、音声認識処理を行う。上述したように、ステップＳ１０８の音声認識処理は、公知の手法を広く適用可能である。ステップＳ１０９において、音声認識サーバは、音声認識結果を監視システム２００に送信する。 In step S108, the speech recognition server performs speech recognition processing. As described above, the speech recognition processing in step S108 can widely apply known methods. In step S109 , the speech recognition server transmits the speech recognition result to the monitoring system 200 .

ステップＳ１１０において、監視システム２００は、音声データの音声認識結果を取得し、当該音声認識結果を議事録データとして記憶してもよい。このようにすれば、議事録データを自動作成することが可能になる。音声認識結果が、Ｓ１１２以降で説明するように不具合発生時の対処として利用されるだけでなく、議事録データとしても利用されるため、データの有効活用が可能になる。音声認識結果を議事録データとして保持することで、会議に関するデータの管理や活用が容易になる。またステップＳ１１０の処理は、Ｓ１１２以降の処理と比較すればわかるように、音声の不具合検出結果によらず実行されてもよい。このようにすれば、会議中の多くの場面が議事録データの作成対象となるため、会議の全体的な内容の把握に有用な議事録データの作成が可能になる。 In step S110, the monitoring system 200 may acquire the speech recognition result of the speech data and store the speech recognition result as minutes data. In this way, it becomes possible to automatically create minutes data. Since the speech recognition result is used not only as a countermeasure when a problem occurs, but also as meeting minutes data, as will be described in S112 and later, the data can be used effectively. Storing speech recognition results as meeting minutes data facilitates management and utilization of meeting-related data. Further, as can be seen by comparing with the processing after S112, the processing of step S110 may be executed regardless of the sound defect detection result. In this way, many scenes in the conference are subject to creation of minutes data, so that it is possible to create minutes data that is useful for understanding the overall content of the conference.

またステップＳ１１１において、監視システム２００の不具合検出部２１３は、ステップＳ１０６で取得した音声データを対象として、音声の不具合を検出する不具合検出処理を行う。ここでの音声の不具合とは、音量（音圧）が所与の基準音量に比べて所定値以上減少していること、及び、音質が所与の基準音質に比べて所定条件を満たす程度に低下していること、の少なくとも一方を表す。なお、ステップＳ１１１の処理は、ステップＳ１０６による音声データの取得後であれば任意のタイミングで実行が可能であり、ステップＳ１０７～Ｓ１１０との前後関係は図７に例示したものに限定されない。 Also, in step S111, the fault detection unit 213 of the monitoring system 200 performs fault detection processing for detecting a fault in the voice data acquired in step S106. The problem of sound here means that the volume (sound pressure) is reduced by a predetermined value or more compared to a given reference sound volume, and that the sound quality is less than a given reference sound quality to the extent that a predetermined condition is satisfied. It represents at least one of Note that the process of step S111 can be executed at any timing after the voice data is obtained in step S106, and the relationship between steps S107 to S110 is not limited to that illustrated in FIG.

図８は、ステップＳ１１１の不具合検出処理を説明するフローチャートである。この処理が開始されると、まずステップＳ２０１において、不具合検出部２１３は、不具合検出処理の対象となる検出期間の音声データの音量を検出する。ここでの音量は、音声データの振幅値であって、例えばｄＢを単位とする数値データである。不具合検出部２１３は、検出期間における音声データの平均音量を求めてもよいし、最大音量や最低音量を求めてもよい。なおここでの検出期間は任意の設定が可能であり、数秒～数十秒程度の時間であってもよいし、より長い時間であってもよい。また同一人物が継続して発話している期間を検出し、当該期間を用いて動的に検出期間が設定されてもよい。 FIG. 8 is a flowchart for explaining the defect detection processing in step S111. When this process is started, first, in step S201, the defect detection unit 213 detects the sound volume of the audio data during the detection period that is the target of the defect detection process. The volume here is the amplitude value of audio data, and is numerical data in units of dB, for example. The defect detection unit 213 may obtain the average volume of the audio data during the detection period, or may obtain the maximum volume or the minimum volume. Note that the detection period here can be set arbitrarily, and may be several seconds to several tens of seconds or longer. Alternatively, a period during which the same person continues to speak may be detected, and the detection period may be set dynamically using the detected period.

ステップＳ２０２において、不具合検出部２１３は、検出された音量の値が所与の音量閾値以下かを判定する。音量が音量閾値以下と判定された場合、ステップＳ２０３において、不具合検出部２１３は、不具合ありと判定する。ここでの音量閾値は、所与の固定値であってもよいし、会議中に取得された音声データの平均音量等を用いて動的に設定されてもよい。 In step S202, the defect detection unit 213 determines whether the detected volume value is equal to or less than a given volume threshold. If it is determined that the volume is equal to or less than the volume threshold, the fault detection unit 213 determines that there is a fault in step S203. The volume threshold here may be a given fixed value, or may be dynamically set using the average volume of voice data acquired during the meeting.

ステップＳ２０２において音量の値が音量閾値より大きいと判定された場合、不具合検出部２１３は、音質の判定を行う。例えばステップＳ２０４において、不具合検出部２１３は、音声データを音声とノイズの分離する音源分離処理を行う。例えば、時間、周波数、信号成分の強さの３次元のスペクトログラムにおいて、複数の音源からの信号の重なりが少ない点に着目し、非線形フィルタリングを用いて音源分離を行う手法が知られている。また近年では、ノイズが重畳された音声データと、そのうちの音声部分が抽出されたデータとを対応づけたデータセットを用いて、音源分離処理を行うための学習済モデルを生成する機械学習手法も広く知られている。不具合検出部２１３は、学習済モデルを取得し、当該学習済モデルに音声データを入力することによって、ステップＳ２０４の音源分離処理を行ってもよい。 If it is determined in step S202 that the volume value is greater than the volume threshold, the defect detection unit 213 determines the sound quality. For example, in step S204, the defect detection unit 213 performs sound source separation processing for separating the voice data into voice and noise. For example, in a three-dimensional spectrogram of time, frequency, and signal component intensity, there is known a method of separating sound sources by using nonlinear filtering, focusing on the fact that signals from a plurality of sound sources overlap little. In recent years, there is also a machine learning method that generates a trained model for performing sound source separation processing using a data set that associates speech data with noise superimposed and data from which the speech part is extracted. Widely known. The defect detection unit 213 may perform the sound source separation processing in step S204 by acquiring a trained model and inputting voice data to the trained model.

ステップＳ２０５において、不具合検出部２１３は、分離された音声の信号と、ノイズの信号とに基づいて、音質の指標値を算出する。ここでの指標値は、例えば音声の信号レベルとノイズの信号レベルの比であるＳ／Ｎ比である。 In step S205, the defect detection unit 213 calculates a sound quality index value based on the separated audio signal and noise signal. The index value here is, for example, the S/N ratio, which is the ratio of the signal level of voice to the signal level of noise.

ステップＳ２０６において、不具合検出部２１３は、算出されたＳ／Ｎ比が所与のＳＮ閾値以下かを判定する。Ｓ／Ｎ比がＳＮ閾値以下である場合に、ステップＳ２０３に移行し、不具合検出部２１３は、不具合ありと判定する。ここでのＳＮ閾値は、所与の固定値であってもよいし、会議中に取得されたＳ／Ｎ比の平均値等を用いて動的に設定されてもよい。ステップＳ２０６でＳ／Ｎ比がＳＮ閾値より大きい場合、ステップＳ２０７において、不具合検出部２１３は、不具合なしと判定する。 In step S206, the defect detection unit 213 determines whether the calculated S/N ratio is equal to or less than a given SN threshold. When the S/N ratio is equal to or less than the SN threshold, the process proceeds to step S203, and the defect detection unit 213 determines that there is a defect. The SN threshold here may be a given fixed value, or may be dynamically set using an average value of S/N ratios obtained during the conference. If the S/N ratio is greater than the SN threshold in step S206, the defect detection unit 213 determines that there is no defect in step S207.

なお、図８は処理の一例であり、本実施形態の不具合検出処理は、これに限定されない。例えば、音声データの不具合要因として、参加者端末装置３００のマイク３９４に関するノイズ、参加者端末装置３００と会議システム１００の間の通信に関するノイズ等が想定される場合、各要因による不具合発生時の典型的なデータを、あらかじめ推定することが可能である。即ち、不具合検出部２１３は、あらかじめ正常データと、１または複数の異常データを保持してもよい。不具合検出部２１３は、取得した音声データが、正常データと異常データのいずれに類似するかに応じて、不具合の有無を判定してもよい。また、図８では音量と音質の両方が条件を満たす場合に不具合なしと判定する例について説明したが、不具合検出処理は音量のみの判定であってもよいし、音質のみの判定であってもよい。その他、本実施形態の不具合検出は、音量や音質に関する他の処理に拡張可能である。 Note that FIG. 8 is an example of processing, and the defect detection processing of the present embodiment is not limited to this. For example, if noise related to the microphone 394 of the participant terminal device 300, noise related to the communication between the participant terminal device 300 and the conference system 100, etc. are assumed as the cause of the problem in the voice data, typical examples of problems caused by these factors are shown below. data can be estimated in advance. That is, the defect detection unit 213 may hold normal data and one or more abnormal data in advance. The defect detection unit 213 may determine the presence or absence of a defect according to whether the acquired voice data is similar to normal data or abnormal data. In addition, although FIG. 8 illustrates an example in which it is determined that there is no problem when both the volume and sound quality satisfy the conditions, the defect detection process may be performed based on only the sound volume or only the sound quality. good. In addition, the fault detection of this embodiment can be extended to other processes related to volume and sound quality.

図７に戻って説明を続ける。不具合検出処理によって音声データの不具合が検出された場合、監視システム２００は、当該不具合による会議の進行停止を抑制するための処理を実行する。具体的には、まずステップＳ１１２において、書き下しデータ出力部２１４は、音声データの音声認識結果である書き下しデータを含む参加者データを作成する。ここでの書き下しデータは、例えば不具合ありと判定された音声データの音声認識結果である。ただし書き下しデータは、不具合が検出された音声データよりも前に取得された音声データ、及び、不具合が検出された音声データよりも後に取得された音声データの少なくとも一方の音声認識結果を含んでもよい。また不具合が検出された音声データのすべてが書き下しデータに含まれることは必須ではなく、その一部が省略されてもよい。ステップＳ１１３において、書き下しデータ出力部２１４は、書き下しデータを会議システム１００に送信する。 Returning to FIG. 7, the description continues. When a problem in the audio data is detected by the problem detection process, the monitoring system 200 executes a process to prevent the progress of the conference from being stopped due to the problem. Specifically, first, in step S112, the written data output unit 214 creates participant data including written data that is the result of speech recognition of the voice data. The transcription data here is, for example, the voice recognition result of the voice data determined to be defective. However, the transcript data may include speech recognition results of at least one of audio data acquired before the audio data in which the defect was detected and audio data acquired after the audio data in which the defect was detected. . Moreover, it is not essential that all of the voice data in which defects are detected be included in the draft data, and part of it may be omitted. In step S113 , the draft data output unit 214 transmits the draft data to the conference system 100 .

上述したように、監視システム２００は参加者端末装置３００として会議システム１００に接続してもよい。ステップＳ１１２及びステップＳ１１３に示したように、監視システム２００は、複数の参加者端末装置３００が会議システム１００にデータを送信する際の通信方式に従って、書き下しデータを、会議システム１００に送信してもよい。 As described above, the monitoring system 200 may be connected to the conference system 100 as a participant terminal device 300. FIG. As shown in steps S112 and S113, the monitoring system 200 may transmit the draft data to the conference system 100 according to the communication method used when the plurality of participant terminal devices 300 transmit data to the conference system 100. good.

このようにすれば、監視システム２００を参加者端末装置３００の１つとして取り扱うことが可能になる。そのため、上述したように、会議システム１００自体を変更することなく、本実施形態の手法を容易に実現することが可能である。また監視システム２００と会議システム１００との入出力インターフェースを、参加者端末装置３００と同様の構成によって実現できるため、監視システム２００の実装が容易になる。 In this way, it becomes possible to treat the monitoring system 200 as one of the participant terminal devices 300 . Therefore, as described above, the technique of this embodiment can be easily implemented without changing the conference system 100 itself. Moreover, since the input/output interface between the monitoring system 200 and the conference system 100 can be implemented by the same configuration as the participant terminal device 300, the monitoring system 200 can be easily implemented.

ステップＳ１１４において、会議システム１００は、書き下しデータを含む会議データを作成する。なお、ステップＳ１１３に示した書き下しデータを出力するステップにおいて、監視システム２００は、書き下しデータを第２システムに出力してもよい。即ち、ステップＳ１１４において作成される会議データは、チャットシステムにおいて用いられる会議画面データであるチャット画面データであり、第２会議データ作成部１１４によって作成されてもよい。このようにチャットシステムを利用することによって、不具合に関する音声データの内容を、時系列や内容を視認しやすい態様で、ユーザに提示することが可能になる。 In step S114, the conference system 100 creates conference data including draft data. Note that in the step of outputting the draft data shown in step S113, the monitoring system 200 may output the draft data to the second system. That is, the conference data created in step S114 is chat screen data that is used in a chat system, and may be created by the second conference data creating section 114 . By using the chat system in this way, it is possible to present the contents of the voice data relating to the trouble to the user in a manner that makes it easy to visually recognize the chronological order and the contents.

ステップＳ１１５において、会議システム１００の第２会議データ配信部１１５は、チャット画面データを複数の参加者端末装置３００に配信する。ステップＳ１１６において、複数の参加者端末装置３００は、それぞれ受信したチャット画面データに基づいて、書き下しデータが表示されるチャット画面を、表示部３４０に表示する。 In step S115 , the second conference data distribution unit 115 of the conference system 100 distributes the chat screen data to the plurality of participant terminal devices 300 . In step S116, the plurality of participant terminal devices 300 display a chat screen on which the written data is displayed on the display unit 340 based on the received chat screen data.

なお上述したように、ステップＳ１０６に示した音声データを会議システム１００から取得するステップにおいて、監視システム２００は、音声データの送信元である参加者端末装置３００に関する識別情報を、音声データに関連付けて取得してもよい。そしてステップＳ１１３に示した書き下しデータを会議システム１００に出力するステップにおいて、監視システム２００は、不具合が検出された音声データの送信元を表す識別情報と、書き下しデータを関連付けて会議システム１００に出力してもよい。 As described above, in the step of acquiring the voice data from the conference system 100 shown in step S106, the monitoring system 200 associates the identification information of the participant terminal device 300, which is the source of the voice data, with the voice data. may be obtained. Then, in the step of outputting the draft data to the conference system 100 shown in step S113, the monitoring system 200 associates the identification information indicating the transmission source of the audio data in which the problem is detected with the draft data, and outputs the data to the conference system 100. may

例えば、図９に示すように参加者Ａ及び参加者Ｂが存在する場合において、監視システム２００は、参加者Ｂに対応づけられた音声データの不具合を検出したとする。この場合、書き下しデータ出力部２１４は、参加者Ｂに関連付けて書き下しデータを出力する。このようにすれば、いずれの参加者端末装置３００からの音声データに不具合があるかを検出すること、及び検出結果をわかりやすくユーザに提示することが可能になる。 For example, when there are participant A and participant B as shown in FIG. In this case, the written data output unit 214 outputs the written data in association with the participant B. FIG. In this way, it is possible to detect which participant terminal device 300 has a problem in the audio data, and to present the detection result to the user in an easy-to-understand manner.

この際、ステップＳ１１５に示した複数の参加者端末装置３００にチャット画面データを出力するステップにおいて、会議システム１００の第２システムは、音声データに関連付けられた識別情報によって表されるユーザを投稿者として表示し、書き下しデータを投稿内容として表示するチャット画面データを出力してもよい。例えば、監視システム２００が、あたかも自身が参加者Ｂの参加者端末装置３００であるかのように偽装したデータを会議システム１００に送信してもよい。あるいは会議システム１００の第２会議データ作成部１１４が、監視システム２００から投稿されたデータを、参加者Ｂの参加者端末装置３００が投稿したかのように修正してもよい。 At this time, in the step of outputting the chat screen data to the plurality of participant terminal devices 300 shown in step S115, the second system of the conference system 100 identifies the user represented by the identification information associated with the voice data as the poster. , and may output chat screen data that displays the written data as the posted content. For example, the monitoring system 200 may transmit to the conference system 100 disguised data as if it were the participant terminal device 300 of the participant B itself. Alternatively, the second conference data creating unit 114 of the conference system 100 may correct the data posted from the monitoring system 200 as if the participant terminal device 300 of the participant B posted the data.

図１０の例であれば、実際の投稿者は監視システム２００であるが、投稿者を表す領域に参加者Ｂが表示され、投稿内容を表す領域に書き下しデータに対応するテキストが表示される。このようにすれば、いずれの参加者の音声に不具合があったかをチャット画面を閲覧したユーザに容易に理解させることが可能になる。その際、投稿が監視システム２００によって行われたものであることを明示することによって、実際に参加者Ｂが投稿した場合のデータとの区別が容易になる。例えば図１０では、「参加者Ｂの音声の途切れが認められたため、書き下しを開始します」という案内文や「ＢＯＴ」等の表示が行われる。 In the example of FIG. 10, although the actual contributor is the monitoring system 200, participant B is displayed in the area representing the contributor, and text corresponding to the written data is displayed in the area representing the posted content. In this way, it becomes possible for the user viewing the chat screen to easily understand which participant's voice has a problem. At this time, clearly indicating that the posting was made by the monitoring system 200 makes it easier to distinguish the data from the data actually posted by the participant B. FIG. For example, in FIG. 10, a message such as "BOT" or the like is displayed as follows: "Due to the interruption of participant B's voice, we will start writing down."

図７～図１０を用いて上述したように、本実施形態の手法では、音声データに不具合が検出された場合に、関連する音声認識結果が書き下しデータとして会議システム１００に出力される。そのため、会議システム１００では、音声の不具合を適切に検出できる。例えば、会議システム１００が書き下しデータを参加者端末装置３００に配信することによって、音声の不具合を参加者に通知できる。音声データの発話者本人は、書き下しデータが配信されたことで音声データの不具合がわかるため、発言の継続を抑制し、不具合要因の特定や解消等の対応が可能になる。また、発話者以外の参加者は、書き下しデータを参照することによって、不具合発生時の発話内容をある程度理解できるため、会議の進行が停止してしまうことを抑制できる。なお、音声データに不具合がある場合、音声が途切れる、音が小さい、ノイズが多い等の要因により、人であるユーザがその内容を認識することが難しい可能性がある。しかし音声認識処理では、ノイズ低減処理や補間処理等の信号処理が可能であるため、発話の大まかな内容を理解できる程度の音声認識結果を取得することが可能と考えられる。 As described above with reference to FIGS. 7 to 10, according to the method of the present embodiment, when a problem is detected in voice data, related voice recognition results are output to the conference system 100 as draft data. Therefore, the conference system 100 can appropriately detect audio defects. For example, the conferencing system 100 distributes the draft data to the participant terminal device 300, so that the participant can be notified of the audio problem. Since the utterer of the voice data knows the problem of the voice data by the delivery of the written data, it is possible to suppress the continuation of the voice data and to identify the cause of the problem and solve it. In addition, participants other than the speaker can understand the content of the speech at the time of occurrence of the problem to some extent by referring to the written data, thereby preventing the progress of the conference from being stopped. If there is a problem with the voice data, it may be difficult for a human user to recognize the contents of the voice data due to factors such as voice interruption, low volume, and high noise. However, in speech recognition processing, signal processing such as noise reduction processing and interpolation processing is possible, so it is considered possible to obtain speech recognition results that enable the general understanding of the content of speech.

また図７に示したように、ステップＳ１１１の不具合検出処理によって音声データの不具合が検出されなかった場合、監視システム２００は、ステップＳ１１２以降の処理をスキップする。即ち、監視システム２００は、音声データの不具合が検出された場合に、音声データの音声認識結果であるテキストを含む書き下しデータを、会議システム１００に出力しない。このようにすれば、不具合のない場合に必要性の低い情報の出力を抑制することが可能になる。異なる観点から言えば、書き下しデータが出力された場合、その要因が音声データの不具合であることを明確にすることが可能である。 Further, as shown in FIG. 7, if no problem is detected in the audio data by the problem detection process of step S111, the monitoring system 200 skips the processes after step S112. That is, the monitoring system 200 does not output to the conference system 100 the transcription data including the text that is the speech recognition result of the audio data when a problem is detected in the audio data. In this way, it is possible to suppress the output of information with low necessity when there is no problem. From a different point of view, if the written data is output, it is possible to clarify that the cause is a defect in the audio data.

３．変形例
３．１通信システムの他の構成
以上では、会議システム１００と、監視システム２００が別体として設けられる例について説明した。会議システム１００と監視システム２００を別体とし、当該２つのシステムが例えばネットワークを介して接続される構成にすることによって、いずれか一方にエラーが発生した場合に、当該エラーが他方に伝播することを抑制できる。ただし本実施形態の通信システム１０の構成はこれに限定されない。 3. Modified Example 3.1 Other Configurations of Communication System An example in which the conference system 100 and the monitoring system 200 are separately provided has been described above. By separating the conference system 100 and the monitoring system 200 and connecting the two systems via a network, for example, if an error occurs in one of them, the error will be propagated to the other. can be suppressed. However, the configuration of the communication system 10 of this embodiment is not limited to this.

図１１は、通信システム１０の他の構成を示す図である。図１１に示すように、通信システム１０は、会議システム１００の機能と、監視システム２００の機能を含む一体のサーバ４００と、複数の参加者端末装置３００を含んでもよい。例えば、サーバ４００のハードウェア構成は図２Ａと同様であって、プロセッサとメモリと通信インターフェースを含む。メモリに記憶された命令に従ってプロセッサが動作することによって、参加者データ取得部１１１、第１会議データ作成部１１２、第１会議データ配信部１１３、第２会議データ作成部１１４、第２会議データ配信部１１５、制御部１１６、音声データ取得部２１１、音声認識結果取得部２１２、不具合検出部２１３、書き下しデータ出力部２１４、制御部２１５の各部の機能が実現される。なお、制御部１１６と制御部２１５のいずれか一方が省略されてもよい。 FIG. 11 is a diagram showing another configuration of the communication system 10. As shown in FIG. As shown in FIG. 11, the communication system 10 may include an integrated server 400 including the functions of the conference system 100 and the functions of the monitoring system 200, and a plurality of participant terminals 300. FIG. For example, the hardware configuration of server 400 is similar to that of FIG. 2A and includes a processor, memory, and communication interface. Participant data acquiring unit 111, first conference data creating unit 112, first conference data distributing unit 113, second conference data creating unit 114, second conference data distributing unit 114, and second conference data distributing unit 114 Functions of the unit 115, the control unit 116, the voice data acquisition unit 211, the voice recognition result acquisition unit 212, the defect detection unit 213, the written data output unit 214, and the control unit 215 are realized. Either one of the control unit 116 and the control unit 215 may be omitted.

図１１の構成において、サーバ４００内の監視システム２００が仮想的な参加者端末装置３００として機能し、同一装置内の会議システム１００に接続する構成としてもよい。あるいは、監視システム２００は、参加者端末装置３００として会議参加することなく、音声データの取得、及び書き下しデータの出力を行ってもよい。例えば、会議システム１００が取得した音声データがメモリの所与の領域に格納され、監視システム２００は、メモリの当該領域を読み出すことによって音声データを取得してもよい。この場合、ネットワークを介した通信が不要となるため、会議システム１００と監視システム２００との間の通信における遅延やエラーを考慮しなくてよいという利点がある。監視システム２００から会議システム１００への書き下しデータの送信についても同様であり、監視システム２００が仮想的な参加者端末装置３００として機能してもよいし、メモリを介した入出力が行われてもよい。 In the configuration of FIG. 11, the monitoring system 200 in the server 400 may function as the virtual participant terminal device 300 and connect to the conference system 100 in the same device. Alternatively, the monitoring system 200 may acquire voice data and output written data without participating in the conference as the participant terminal device 300 . For example, audio data acquired by the conference system 100 may be stored in a given area of the memory, and the monitoring system 200 may acquire the audio data by reading the area of the memory. In this case, there is no need for communication via the network, so there is an advantage that delays and errors in communication between the conference system 100 and the monitoring system 200 need not be considered. The same applies to transmission of written data from the monitoring system 200 to the conference system 100. The monitoring system 200 may function as a virtual participant terminal device 300, or input/output may be performed via memory. good.

３．２識別情報
また、以上では監視システム２００が書き下しデータを出力する際に、識別情報を対応づける例について説明した。例えば図１０に示したように、「参加者Ｂ」等の音声データの送信元に関する識別情報と、書き下しデータが対応づけられる。ただし本実施形態の手法において識別情報は必須ではない。 3.2 Identification Information Also, an example in which identification information is associated when the monitoring system 200 outputs the draft data has been described above. For example, as shown in FIG. 10, the identification information regarding the sender of the voice data such as "participant B" is associated with the written data. However, identification information is not essential in the method of this embodiment.

例えば監視システム２００は、書き下しデータを識別情報と対応づけずに出力してもよい。会議システム１００は、発話者を特定することなく、書き下しデータを複数の参加者端末装置３００に配信する。この場合であっても、書き下しデータが投稿されたことで、会議の参加者は音声の不具合を把握できる。 For example, the monitoring system 200 may output the draft data without associating it with the identification information. The conference system 100 distributes the draft data to a plurality of participant terminal devices 300 without specifying the speaker. Even in this case, the participants of the conference can understand the problem of the voice by posting the written data.

また、会議において複数の参加者が同時に発言すると聞き取りが難しくなるため、偶発的な場面を除いて、所与の１タイミングでは単一の参加者が発言するケースが多いことが想定される。そのため、識別情報が対応づけられていない場合であっても、書き下しデータが投稿されたタイミングや具体的な発話内容を参照することによって、各参加者は不具合が自身の音声データに関するものであるか否かを判定することが可能と考えられる。そのため、発話者が自身の音声データの不具合を認識すること、及び、他の参加者が発話の大まかな内容を把握することが可能である。 Also, when a plurality of participants speak at the same time in a conference, it becomes difficult to hear them. Therefore, it is assumed that there are many cases in which a single participant speaks at a given timing, except for occasional situations. Therefore, even if identification information is not associated, each participant can check whether the problem is related to their own voice data by referring to the timing when the draft data was posted and the specific utterance content. It is considered possible to determine whether or not Therefore, it is possible for the utterer to recognize defects in his/her own voice data, and for other participants to roughly grasp the content of the utterance.

なお、上記のように本実施形態について詳細に説明したが、本実施形態の新規事項および効果から実体的に逸脱しない多くの変形が可能であることは当業者には容易に理解できるであろう。従って、このような変形例はすべて本開示の範囲に含まれるものとする。例えば、明細書又は図面において、少なくとも一度、より広義または同義な異なる用語と共に記載された用語は、明細書又は図面のいかなる箇所においても、その異なる用語に置き換えることができる。また本実施形態及び変形例の全ての組み合わせも、本開示の範囲に含まれる。また会議システム、監視システム、参加者端末装置等の構成及び動作等も、本実施形態で説明したものに限定されず、種々の変形実施が可能である。 Although the present embodiment has been described in detail as above, those skilled in the art will easily understand that many modifications that do not substantially deviate from the novel matters and effects of the present embodiment are possible. . Accordingly, all such modifications are intended to be included within the scope of this disclosure. For example, a term described at least once in the specification or drawings together with a different broader or synonymous term can be replaced with the different term anywhere in the specification or drawings. All combinations of this embodiment and modifications are also included in the scope of the present disclosure. Also, the configurations and operations of the conference system, monitoring system, participant terminal devices, etc. are not limited to those described in the present embodiment, and various modifications are possible.

１０…通信システム、１００…会議システム、１１０…処理部、１１１…参加者データ取得部、１１２…第１会議データ作成部、１１３…第１会議データ配信部、１１４…第２会議データ作成部、１１５…第２会議データ配信部、１１６…制御部、１２０…記憶部、１３０…通信部、１４０…プロセッサ、１５０…メモリ、１６０…通信インターフェース、２００…監視システム、２１０…処理部、２１１…音声データ取得部、２１２…音声認識結果取得部、２１３…不具合検出部、２１４…書き下しデータ出力部、２１５…制御部、２２０…記憶部、２３０…通信部、２４０…プロセッサ、２５０…メモリ、２６０…通信インターフェース、３００…参加者端末装置、３１０…処理部、３１１…参加者データ送信部、３１２…会議データ提示部、３１３…制御部、３２０…記憶部、３３０…通信部、３４０…表示部、３５０…ユーザ入力受付部、３６０…プロセッサ、３７０…メモリ、３８０…通信インターフェース、３９１…ディスプレイ、３９２…ポインティングデバイス、３９３…キーボード、３９４…マイク、３９５…カメラ、４００…サーバ、Ｒｅ１…参加者表示領域、Ｒｅ２…画像表示領域、Ｒｅ３…操作ボタン表示領域、Ｒｅ４…テキスト表示領域、Ｒｅ５…テキスト投稿領域 DESCRIPTION OF SYMBOLS 10... Communication system, 100... Conference system, 110... Processing part, 111... Participant data acquisition part, 112... First meeting data creation part, 113... First meeting data delivery part, 114... Second meeting data creation part, 115 second conference data distribution unit 116 control unit 120 storage unit 130 communication unit 140 processor 150 memory 160 communication interface 200 monitoring system 210 processing unit 211 audio Data acquisition unit 212 Speech recognition result acquisition unit 213 Defect detection unit 214 Draft data output unit 215 Control unit 220 Storage unit 230 Communication unit 240 Processor 250 Memory 260 Communication interface 300 Participant terminal device 310 Processing unit 311 Participant data transmission unit 312 Conference data presentation unit 313 Control unit 320 Storage unit 330 Communication unit 340 Display unit 350 User input reception unit 360 Processor 370 Memory 380 Communication interface 391 Display 392 Pointing device 393 Keyboard 394 Microphone 395 Camera 400 Server Re1 Participant display Areas Re2... Image display area Re3... Operation button display area Re4... Text display area Re5... Text posting area

Claims

An information processing method for monitoring voice data in a conference system providing a conference by a plurality of participant terminal devices,
obtaining from the conference system audio data transmitted from the plurality of participant terminal devices to the conference system;
detecting defects in the audio data;
a step of outputting transcript data including text, which is a speech recognition result of the speech data, to the conference system when a defect in the speech data is detected;
Information processing methods, including

In claim 1,
The conference system is
a first system that acquires the audio data from the plurality of participant terminal devices and provides a conference using at least audio;
a second system that acquires text data from the plurality of participant terminal devices and provides a chat conference,
The information processing method of outputting the draft data to the second system in the step of outputting the draft data.

In claim 1 or 2,
In the step of acquiring the audio data from the conference system,
Acquiring, of the plurality of participant terminal devices, identification information relating to a first participant terminal device, which is a transmission source of the audio data, in association with the audio data;
In the step of outputting the draft data to the conference system,
An information processing method for associating the identification information indicating the transmission source of the voice data in which the defect is detected with the draft data and outputting the data to the conference system.

In claim 2,
The second system of the conference system further comprising a step of outputting chat screen data to the plurality of participant terminals;
In the step of acquiring the audio data from the conference system,
Acquiring, of the plurality of participant terminal devices, identification information relating to a first participant terminal device, which is a transmission source of the audio data, in association with the audio data;
In the step of outputting the draft data to the conference system,
associating the identification information representing the transmission source of the audio data in which the defect is detected with the draft data and outputting the data to the conference system;
In the step of outputting the chat screen data,
An information processing method for outputting the chat screen data in which the user represented by the identification information associated with the voice data is displayed as a poster, and the written data is displayed as posted content.

In any one of claims 1 to 4,
In the step of acquiring the audio data from the conference system,
communicating with the conferencing system as one of the plurality of participant terminals;
An information processing method for receiving the voice data from the conference system according to a communication method used when the conference system distributes data to the plurality of participant terminal devices.

In claim 5,
In the step of outputting the draft data to the conference system,
An information processing method for transmitting the written data to the conference system according to a communication method used when the plurality of participant terminal devices transmit data to the conference system.

In any one of claims 1 to 6,
An information processing method comprising the steps of acquiring the speech recognition result of the speech data and storing the speech recognition result as meeting minutes data regardless of the detection result of a defect in the speech data.

A monitoring system for monitoring audio data in a conference system providing a conference by a plurality of participant terminal devices,
an audio data acquisition unit configured to acquire, from the conference system, audio data transmitted from the plurality of participant terminal devices to the conference system;
a defect detection unit that performs processing for detecting defects in the audio data;
a transcription data output unit that performs a process of outputting transcription data including text, which is a speech recognition result of the audio data, to the conference system when a defect in the audio data is detected;
Surveillance system including.