JP7000547B1

JP7000547B1 - Programs, methods, information processing equipment, systems

Info

Publication number: JP7000547B1
Application number: JP2020212000A
Authority: JP
Inventors: 俊二菅谷
Original assignee: Optim Corp
Current assignee: Optim Corp
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2022-01-19
Anticipated expiration: 2040-12-22
Also published as: JP2022099335A; JP2022098561A

Abstract

【課題】音声認識処理の利便性を向上させる。【解決手段】プロセッサと、メモリとを備えるコンピュータに実行させるためのプログラムであって、プログラムは、プロセッサに、集音装置により集音された音を取得するステップと、取得した音から、少なくとも１つの音声を抽出するステップと、抽出した音声を解析することで、テキスト情報に変換するステップと、テキスト情報に基づき、抽出した音声の発声者の役割を推定するステップと、変換したテキスト情報を、役割を識別可能にユーザに提示するステップと、を実行させるプログラム【選択図】図４PROBLEM TO BE SOLVED: To improve the convenience of voice recognition processing. SOLUTION: The program is to be executed by a computer including a processor and a memory, and the program is at least one from a step of acquiring a sound collected by a sound collector and the acquired sound. A step of extracting one sound, a step of converting the extracted sound into text information, a step of estimating the role of the speaker of the extracted sound based on the text information, and a step of converting the converted text information. A program that executes a step that presents the role to the user in an identifiable manner [selection diagram] Fig. 4

Description

本開示は、プログラム、方法、情報処理装置、システムに関する。 The present disclosure relates to programs, methods, information processing devices and systems.

声の波形の特徴で、発話者を区別する技術が知られている。例えば、特許文献１では、音声情報を含む生体情報を用い、ユーザを認証することが記載されている。 A technique for distinguishing speakers by the characteristics of voice waveforms is known. For example, Patent Document 1 describes that a user is authenticated by using biometric information including voice information.

特開２０１５－０６１０８６号公報Japanese Patent Application Laid-Open No. 2015-061086

しかしながら、従来のシステムでは、音声情報を予め登録していないと話者を判別することができない。このため、音声認識処理によりテキスト情報を生成しても、音声情報が予め登録されていない場合には、発声者が判別できず、音声認識処理の利便性が損なわれることがある。 However, in the conventional system, the speaker cannot be identified unless the voice information is registered in advance. Therefore, even if the text information is generated by the voice recognition process, if the voice information is not registered in advance, the speaker cannot be identified, and the convenience of the voice recognition process may be impaired.

本開示の目的は、音声認識処理の利便性を向上させることである。 An object of the present disclosure is to improve the convenience of speech recognition processing.

一実施形態によると、プロセッサと、メモリとを備えるコンピュータに実行させるためのプログラムであって、プログラムは、プロセッサに、集音装置により集音された音を取得するステップと、取得した音から、少なくとも１つの音声を抽出するステップと、抽出した音声を解析することで、テキスト情報に変換するステップと、テキスト情報に基づき、抽出した音声の発声者の役割を推定するステップと、変換したテキスト情報を、役割を識別可能にユーザに提示するステップと、を実行させるプログラムが提供される。 According to one embodiment, a program for causing a computer including a processor and a memory to execute the program, from the step of acquiring the sound collected by the sound collector to the processor and the acquired sound. A step of extracting at least one sound, a step of converting the extracted sound into text information, a step of estimating the role of the speaker of the extracted sound based on the text information, and the converted text information. Is provided with a step of presenting the role to the user in an identifiable manner, and a program to execute.

本開示によれば、音声認識処理の利便性を向上させることができる。 According to the present disclosure, the convenience of voice recognition processing can be improved.

システム１の全体構成を示す図である。It is a figure which shows the whole structure of the system 1. サーバ２０の機能的な構成を示す図である。It is a figure which shows the functional configuration of a server 20. サーバ２０が記憶するテキスト情報データベース２０２１、音声情報データベース２０２２のデータ構造を示す図である。It is a figure which shows the data structure of the text information database 2021 and the voice information database 2022 stored in a server 20. システム１を構成する機器などの概要を示す図である。It is a figure which shows the outline of the apparatus which comprises the system 1. サーバ２０が、音データに基づいてテキストデータを生成する際の一連の処理を示すフローチャートである。6 is a flowchart showing a series of processes when the server 20 generates text data based on sound data. 執刀医と助手との会話に基づいて生成されたテキストデータの表示例を示す図である。It is a figure which shows the display example of the text data generated based on the conversation between a surgeon and an assistant. 講演者と視聴者との会話に基づいて生成されたテキストデータの表示例を示す図である。It is a figure which shows the display example of the text data generated based on the conversation between a speaker and a viewer. 管理者と作業員との会話に基づいて生成されたテキストデータの表示例を示す図である。It is a figure which shows the display example of the text data generated based on the conversation between a manager and a worker. 第２の実施形態における、システム１Ａの全体構成を示す図である。It is a figure which shows the whole structure of the system 1A in the 2nd Embodiment. 第２の実施形態における、サーバ２０Ａの機能的な構成を示す図である。It is a figure which shows the functional configuration of the server 20A in the 2nd Embodiment. 第２の実施形態における、サーバ２０Ａが記憶する画像情報データベース２０２３のデータ構造を示す図である。It is a figure which shows the data structure of the image information database 2023 which the server 20A stores in the 2nd Embodiment. 第２の実施形態における、システム１Ａを構成する機器などの概要を示す図である。It is a figure which shows the outline of the apparatus which comprises the system 1A in the 2nd Embodiment. サーバ２０Ａの制御部２０３Ａが音データと画像データとに基づいてテキストデータを生成する際の一連の処理を示すフローチャートである。6 is a flowchart showing a series of processes when the control unit 203A of the server 20A generates text data based on sound data and image data.

以下、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts are designated by the same reference numerals. Their names and functions are the same. Therefore, the detailed description of them will not be repeated.

＜第１の実施形態＞
＜概要＞
以下の実施形態では、発声者の役割を推定し、発声者の発声内容と、推定した役割とをテキストデータとして記憶するシステム１について説明する。 <First Embodiment>
<Overview>
In the following embodiment, the system 1 that estimates the role of the utterer and stores the utterance content of the utterer and the estimated role as text data will be described.

システム１は、集音装置により周囲の音を集音する。システム１は、集音した音に基づく音データから、少なくとも１つ以上の音声を抽出する。システム１は、抽出した音声の発声内容をテキスト情報に変換する。システム１は、テキスト情報に基づいて発声者の役割を推定する。システム１は、テキスト情報に、推定した役割を加えたテキストデータを記憶し、ユーザからの要求に応じて提示する。 The system 1 collects ambient sound by a sound collecting device. The system 1 extracts at least one or more sounds from the sound data based on the collected sounds. The system 1 converts the utterance content of the extracted voice into text information. System 1 estimates the role of the speaker based on the text information. The system 1 stores text data in which the estimated role is added to the text information, and presents the text data in response to a request from the user.

システム１は、例えば、病院などの医療施設等に設置され得る。具体的には、例えば、集音装置が手術室に設置され、システム１は、執刀医及び助手などの手術中の会話をテキスト情報に変換し、テキスト情報から推定される役割と共に記憶する。また、例えば、集音装置が病室に設置され、主治医及び看護師などの会話をテキスト情報に変換し、テキスト情報から推定される役割と共に記憶する。なお、執刀医及び主治医は、主として医療行為を実施する担当者の一例であり、助手及び看護師は、補助する担当者の一例である。これにより、術中及び日常の会話をテキスト情報及び役割を含むテキストデータとして記憶しておくことが可能となる。 The system 1 can be installed in, for example, a medical facility such as a hospital. Specifically, for example, a sound collecting device is installed in the operating room, and the system 1 converts conversations during surgery such as a surgeon and an assistant into text information and stores it together with a role estimated from the text information. Further, for example, a sound collecting device is installed in the hospital room to convert conversations between the attending physician and a nurse into text information and store it together with a role estimated from the text information. The surgeon and the attending physician are mainly examples of persons in charge of performing medical treatment, and assistants and nurses are examples of persons in charge of assisting. This makes it possible to store intraoperative and daily conversations as text data including text information and roles.

また、システム１は、例えば、セミナー、記者会見などの場にも設置され得る。具体的には、例えば、集音装置が会場に設置され、システム１は、講演者と視聴者との質疑応答をテキスト情報に変換し、テキスト情報から推定される役割と共に記憶する。これにより、議事録の作成の手間が軽減する。また、質問内容を容易に見返すことが可能となる。なお、講演者は、主となる話者の例示であり、主となる話者は、会合を進行を司る役を担う者、例えば、司会者等であってもよい。 The system 1 can also be installed in a place such as a seminar or a press conference. Specifically, for example, a sound collecting device is installed in the venue, and the system 1 converts a question and answer session between the speaker and the viewer into text information and stores it together with a role estimated from the text information. This reduces the trouble of creating minutes. In addition, the content of the question can be easily reviewed. The speaker is an example of the main speaker, and the main speaker may be a person who controls the progress of the meeting, for example, a moderator.

また、システム１は、例えば、作業現場などに設置され得る。具体的には、例えば、集音装置が現場に設置され、システム１は、管理者から作業員への指示の内容、作業員から管理者への報告の内容などをテキスト情報に変換し、テキスト情報から推定される役割と共に記憶する。なお、管理者は、指示者と換言しても構わない。また、作業員は、管理者により管理される被管理者の一例である。これにより、トラブル発生時の、管理者から作業員への指示漏れの有無等を確認することが可能となる。 Further, the system 1 may be installed at a work site or the like, for example. Specifically, for example, a sound collecting device is installed at the site, and the system 1 converts the contents of instructions from the manager to the worker, the contents of the report from the worker to the manager, etc. into text information, and texts. Memorize with the role estimated from the information. The administrator may be referred to as an instructor. In addition, the worker is an example of a managed person managed by the manager. This makes it possible to confirm whether or not there is an omission of instructions from the administrator to the worker when a trouble occurs.

＜１システム全体の構成図＞
図１は、システム１の全体の構成を示す図である。 <1 Configuration diagram of the entire system>
FIG. 1 is a diagram showing the overall configuration of the system 1.

図１に示すように、システム１は、サーバ２０と、エッジサーバ３０と、集音装置４０とを含む。サーバ２０とエッジサーバ３０とは、ネットワーク８０を介して通信接続する。エッジサーバ３０は、集音装置４０と接続されている。例えば、集音装置４０は、情報機器間の近距離通信システムで用いられる通信規格に基づく送受信装置である。具体的には、集音装置４０は、例えば、Bluetooth（登録商標）モジュールなど２．４ＧＨｚ帯を使用して、Bluetooth（登録商標）モジュールを搭載した他の情報機器からのビーコン信号を受信する。エッジサーバ３０は、当該近距離通信を利用したビーコン信号に基づき、集音装置４０から送信される情報を取得する。このように、集音装置４０は、取得した発声者の音声の情報を、ネットワーク８０を介さず、近距離通信によりエッジサーバ３０へ送信する。なお、エッジサーバ３０は、ネットワーク８０を介して集音装置４０と通信接続してもよい。 As shown in FIG. 1, the system 1 includes a server 20, an edge server 30, and a sound collector 40. The server 20 and the edge server 30 communicate with each other via the network 80. The edge server 30 is connected to the sound collecting device 40. For example, the sound collector 40 is a transmission / reception device based on a communication standard used in a short-range communication system between information devices. Specifically, the sound collector 40 uses a 2.4 GHz band such as a Bluetooth (registered trademark) module to receive a beacon signal from another information device equipped with the Bluetooth (registered trademark) module. The edge server 30 acquires information transmitted from the sound collecting device 40 based on the beacon signal using the short-range communication. In this way, the sound collecting device 40 transmits the acquired voice information of the speaker to the edge server 30 by short-range communication without going through the network 80. The edge server 30 may be connected to the sound collecting device 40 via the network 80.

サーバ２０は、音に関する情報を管理する。音に関する情報は、例えば、音データ、音から抽出された音声に基づいて生成されるテキストデータ等を含む。図１に示すサーバ２０は、通信ＩＦ２２、入出力ＩＦ２３、メモリ２５、ストレージ２６、及びプロセッサ２９を有する。 The server 20 manages information about sound. Information about sound includes, for example, sound data, text data generated based on voice extracted from sound, and the like. The server 20 shown in FIG. 1 has a communication IF 22, an input / output IF 23, a memory 25, a storage 26, and a processor 29.

通信ＩＦ２２は、サーバ２０が外部の装置と通信するため、信号を入出力するためのインタフェースである。入出力ＩＦ２３は、ユーザからの入力操作を受け付けるための入力装置とのインタフェース、および、ユーザに対し情報を提示するための出力装置とのインタフェースとして機能する。メモリ２５は、プログラム、および、プログラム等で処理されるデータ等を一時的に記憶するためのものであり、例えばＤＲＡＭ（Dynamic Random Access Memory）等の揮発性のメモリである。ストレージ２６は、データを保存するための記憶装置であり、例えばフラッシュメモリ、ＨＤＤ（Hard Disc Drive）である。プロセッサ２９は、プログラムに記述された命令セットを実行するためのハードウェアであり、演算装置、レジスタ、周辺回路などにより構成される。 The communication IF 22 is an interface for inputting / outputting signals because the server 20 communicates with an external device. The input / output IF 23 functions as an interface with an input device for receiving an input operation from the user and an interface with an output device for presenting information to the user. The memory 25 is for temporarily storing a program, data processed by the program, or the like, and is, for example, a volatile memory such as a DRAM (Dynamic Random Access Memory). The storage 26 is a storage device for storing data, and is, for example, a flash memory or an HDD (Hard Disc Drive). The processor 29 is hardware for executing an instruction set described in a program, and is composed of an arithmetic unit, registers, peripheral circuits, and the like.

本実施形態において、システム１がサーバ２０を有する場合を例に説明しているが、システム１を複数のサーバの集合体として形成してもよい。１つ又は複数のハードウェアに対して本実施形態に係るシステム１を実現することに要する複数の機能の配分の仕方は、各ハードウェアの処理能力及び／又はシステム１に求められる仕様等に鑑みて適宜決定することができる。 In the present embodiment, the case where the system 1 has the server 20 is described as an example, but the system 1 may be formed as an aggregate of a plurality of servers. The method of allocating the plurality of functions required to realize the system 1 according to the present embodiment to one or a plurality of hardware is in consideration of the processing capacity of each hardware and / or the specifications required for the system 1. Can be determined as appropriate.

エッジサーバ３０は、集音装置４０から送信される信号を受信し、受信した信号を、サーバ２０に送信する。また、エッジサーバ３０は、サーバ２０から取得した信号を集音装置４０へ送信する。サーバ２０から取得する信号には、例えば、集音装置４０の設定を更新するための情報などが含まれる。図１では、エッジサーバ３０が１台である場合を例に示しているが、システム１に収容されるエッジサーバは、複数台あっても構わない。 The edge server 30 receives the signal transmitted from the sound collecting device 40, and transmits the received signal to the server 20. Further, the edge server 30 transmits the signal acquired from the server 20 to the sound collecting device 40. The signal acquired from the server 20 includes, for example, information for updating the settings of the sound collecting device 40. Although FIG. 1 shows an example in which the number of edge servers 30 is one, a plurality of edge servers may be accommodated in the system 1.

集音装置４０は、周囲の音を集音し、例えば、デジタル形式の音データに変換する。集音装置４０は、音データに基づく音信号をエッジサーバ３０へ送信する。集音装置４０は、例えば、マイクにより実現される。マイクは、例えば、指向性マイク、又は無指向性マイクである。指向性マイクの指向性は、単一指向性であっても、双指向性であっても構わない。集音装置４０は、例えば、音を効率的に集音可能な位置に設置される。図１では、集音装置４０が１台である場合を例に示しているが、システム１に収容される集音装置４０は、複数台あっても構わない。 The sound collector 40 collects ambient sounds and converts them into, for example, digital sound data. The sound collecting device 40 transmits a sound signal based on the sound data to the edge server 30. The sound collecting device 40 is realized by, for example, a microphone. The microphone is, for example, a directional microphone or an omnidirectional microphone. The directivity of the directional microphone may be unidirectional or bidirectional. The sound collecting device 40 is installed, for example, at a position where sound can be efficiently collected. Although FIG. 1 shows an example in which one sound collecting device 40 is used, a plurality of sound collecting devices 40 accommodated in the system 1 may be provided.

＜１．１サーバ２０の構成＞
図２は、サーバ２０の機能的な構成を示す図である。図２に示すように、サーバ２０は、通信部２０１と、記憶部２０２と、制御部２０３としての機能を発揮する。 <1.1 Configuration of server 20>
FIG. 2 is a diagram showing a functional configuration of the server 20. As shown in FIG. 2, the server 20 functions as a communication unit 201, a storage unit 202, and a control unit 203.

通信部２０１は、サーバ２０が、外部の装置と通信するための処理を行う。 The communication unit 201 performs a process for the server 20 to communicate with an external device.

記憶部２０２は、サーバ２０が使用するデータ及びプログラムを記憶する。記憶部２０２は、テキスト情報データベース２０２１と、音声情報データベース２０２２等とを記憶する。 The storage unit 202 stores data and programs used by the server 20. The storage unit 202 stores the text information database 2021 and the voice information database 2022 and the like.

テキスト情報データベース２０２１は、集音装置４０で集音された音に基づいて生成されるテキストデータを記憶する。詳細は後述する。 The text information database 2021 stores text data generated based on the sound collected by the sound collector 40. Details will be described later.

音声情報データベース２０２２は、サーバ２０が集音装置４０で集音された音に基づく音データを記憶する。詳細は後述する。 The voice information database 2022 stores sound data based on the sound collected by the server 20 by the sound collecting device 40. Details will be described later.

制御部２０３は、サーバ２０のプロセッサがプログラムに従って処理を行うことにより、各種モジュールとして示す機能を発揮する。 The control unit 203 exerts a function shown as various modules by performing processing according to a program by the processor of the server 20.

受信制御モジュール２０３１は、サーバ２０が外部の装置から通信プロトコルに従って信号を受信する処理を制御する。例えば、受信制御モジュール２０３１は、通信部２０１を制御し、集音装置４０からエッジサーバ３０を介して送信される音信号を受信する。 The reception control module 2031 controls a process in which the server 20 receives a signal from an external device according to a communication protocol. For example, the reception control module 2031 controls the communication unit 201 and receives a sound signal transmitted from the sound collector 40 via the edge server 30.

送信制御モジュール２０３２は、サーバ２０が外部の装置に対し通信プロトコルに従って信号を送信する処理を制御する。 The transmission control module 2032 controls a process in which the server 20 transmits a signal to an external device according to a communication protocol.

取得モジュール２０３３は、受信した音信号から音データを取得する。取得モジュール２０３３は、取得した音データを音声情報データベース２０２２に記憶する。取得モジュール２０３３は、例えば、所定の要件を満たすと、取得した音データを音声情報データベース２０２２に記憶する。所定の要件は、例えば、以下である。
・録音開始指示が入力されてから録音終了指示が入力されるまで
・予め設定された時間への到達
・音の継続した発生（例えば、音が発生すると録音を開始し、音が予め設定された期間発生しないと録音を停止する） The acquisition module 2033 acquires sound data from the received sound signal. The acquisition module 2033 stores the acquired sound data in the voice information database 2022. The acquisition module 2033 stores the acquired sound data in the voice information database 2022, for example, when a predetermined requirement is satisfied. The predetermined requirements are, for example,:
-From the input of the recording start instruction to the input of the recording end instruction-Achievement of a preset time-Continuous generation of sound (for example, when sound is generated, recording is started and the sound is preset. Recording will stop if the period does not occur)

音声解析モジュール２０３４は、取得された音データを解析する。音声解析モジュール２０３４は、例えば、所定の要件を満たすと、音データを解析する。所定の要件は、例えば、以下である。
・録音（解析）開始指示が入力されてから録音（解析）終了指示が入力されるまで
・予め設定された時間への到達
・音の継続した発生（例えば、音が発生すると解析を開始し、音が予め設定された期間発生しないと解析を停止する） The voice analysis module 2034 analyzes the acquired sound data. The voice analysis module 2034 analyzes sound data, for example, when a predetermined requirement is satisfied. The predetermined requirements are, for example,:
-From the input of the recording (analysis) start instruction to the input of the recording (analysis) end instruction-Achievement of a preset time-Continuous generation of sound (for example, when sound is generated, analysis is started and analysis is started. Analysis will stop if no sound is generated for a preset period of time)

音声解析モジュール２０３４は、取得された音データから所定の音声を抽出する。具体的には、音声解析モジュール２０３４は、例えば、下記のいずれかの情報に基づいて音データから所定の音声を抽出する。
・声の特徴
・音が集音された方向
・音が集音されたタイミング
・音を集音した集音装置 The voice analysis module 2034 extracts a predetermined voice from the acquired sound data. Specifically, the voice analysis module 2034 extracts a predetermined voice from the sound data based on, for example, any of the following information.
・ Characteristics of voice ・ Direction in which sound is collected ・ Timing when sound is collected ・ Sound collector that collects sound

より具体的には、例えば、音声解析モジュール２０３４は、音データに含まれる声の特徴、例えば、声の大きさ、音高（周波数）、有声、無声、音素の種類、及びフォルマント等から成る群から選択される少なくとも１つを分析する。音声解析モジュール２０３４は、分析結果に基づいて同一の者が発生したと推定される音声を、音データから抽出する。 More specifically, for example, the voice analysis module 2034 is a group consisting of voice features included in the sound data, for example, voice volume, pitch (frequency), voicedness, unvoicedness, phoneme type, formant, and the like. Analyze at least one selected from. The voice analysis module 2034 extracts voices estimated to have been generated by the same person based on the analysis results from the sound data.

また、例えば、集音装置４０が指向性を有している場合、音声解析モジュール２０３４は、集音装置４０の指向性の情報に基づき、指向している方向から到来した音声を、音データから抽出する。 Further, for example, when the sound collecting device 40 has directivity, the voice analysis module 2034 obtains the sound coming from the pointing direction from the sound data based on the directivity information of the sound collecting device 40. Extract.

また、例えば、発話するタイミングが予め分かっている場合、音声解析モジュール２０３４は、集音装置４０が音を集音した時間に基づき、音データに含まれる音声を抽出する。例えば、音声解析モジュール２０３４は、講演等の発声に係るスケジュールを参照し、該当する時刻に達してから最初に発声した音声を、音データから抽出する。 Further, for example, when the timing of utterance is known in advance, the voice analysis module 2034 extracts the voice included in the sound data based on the time when the sound collecting device 40 collects the sound. For example, the voice analysis module 2034 refers to the schedule related to the utterance of a lecture or the like, and extracts the voice first uttered after the corresponding time is reached from the sound data.

また、例えば、集音装置４０が複数利用されている場合、音声解析モジュール２０３４は、集音装置４０毎に集音された音声を、音データから抽出する。 Further, for example, when a plurality of sound collecting devices 40 are used, the voice analysis module 2034 extracts the sound collected by each sound collecting device 40 from the sound data.

音声解析モジュール２０３４は、上記の抽出方法について、単独で発声者の音声を抽出してもよいし、複数の手法を組み合わせて発声者の音声を抽出してもよい。 Regarding the above extraction method, the voice analysis module 2034 may extract the voice of the speaker alone, or may extract the voice of the speaker by combining a plurality of methods.

また、音声解析モジュール２０３４は、抽出した音声に対して音声認識処理を実行することで発声内容をテキスト情報に変換する。音声認識の手法は既存のいかなる手法を用いてもよい。変換されたテキスト情報は、テキスト情報データベース２０２１に記憶される。 Further, the voice analysis module 2034 converts the utterance content into text information by executing a voice recognition process on the extracted voice. As the speech recognition method, any existing method may be used. The converted text information is stored in the text information database 2021.

推定モジュール２０３５は、テキスト情報に基づき、発声者の役割を推定する。例えば、推定モジュール２０３５は、サーバ２０の記憶部２０２に記憶されている学習済みモデルに、テキスト情報を入力することで、発声者の役割を推定する。 The estimation module 2035 estimates the role of the speaker based on the text information. For example, the estimation module 2035 estimates the role of the speaker by inputting text information into the trained model stored in the storage unit 202 of the server 20.

学習済みモデルは、例えば、学習用データに基づき、モデル学習プログラムに従って機械学習モデルに機械学習を行わせることで生成される。本実施形態において、学習済みモデルは、例えば、テキスト情報データベース２０２１に記憶されている発言に対し、役割を出力するように学習されている。このとき、学習用データは、例えば、所定の発言についての文字情報を入力データとし、その発言をする者の役割を正解出力データとする。例えば、手術をリードする発言についての文字情報を入力データとし、手術をリードする発言をする者の役割である執刀医を正解出力データとする。また、手術を補助する発言についての文字情報を入力データとし、手術を補助する発言をする者の役割である助手を正解出力データとする。このように学習された学習済みモデルは、テキスト情報が入力されると、発声者の役割、例えば、執刀医、助手、主治医、看護師、講演者、視聴者、管理者、作業員等を出力する。 The trained model is generated, for example, by having a machine learning model perform machine learning according to a model learning program based on training data. In the present embodiment, the trained model is trained to output a role for a statement stored in, for example, the text information database 2021. At this time, for the learning data, for example, the character information about a predetermined remark is used as input data, and the role of the person who makes the remark is used as the correct answer output data. For example, the text information about the remark that leads the surgery is used as the input data, and the surgeon who is the role of the person who makes the remark that leads the surgery is used as the correct output data. In addition, the text information about the remarks that assist the surgery is used as the input data, and the assistant who is the role of the person who makes the remarks that assist the surgery is used as the correct output data. The trained model learned in this way outputs the role of the speaker, for example, the surgeon, assistant, attending physician, nurse, speaker, viewer, manager, worker, etc. when text information is input. do.

推定モジュール２０３５は、音データから抽出した音声が複数ある場合、音声の内容が変換されたテキスト情報から、音声毎に役割を推定する。推定モジュール２０３５は、推定した役割を、テキスト情報と共にテキスト情報データベース２０２１に記憶させる。 When there are a plurality of voices extracted from the sound data, the estimation module 2035 estimates the role of each voice from the text information in which the content of the voice is converted. The estimation module 2035 stores the estimated role in the text information database 2021 together with the text information.

推定モジュール２０３５は、役割を一度推定した後は、同一の音声と推定可能な音声に対しては、同一の役割を付し、改めて役割を推定する処理を実行しなくてもよい。 After the role is estimated once, the estimation module 2035 assigns the same role to the same voice and the presumable voice, and does not have to execute the process of estimating the role again.

推定モジュール２０３５は、所定のタイミングで役割の推定をやりなおしてもよい。所定のタイミングは、例えば、以下である。
・予め設定した時間の経過
・録音の切り替わり
・新たな人物の登場 The estimation module 2035 may redo the estimation of the role at a predetermined timing. The predetermined timing is, for example, as follows.
・ Elapsed preset time ・ Switching recordings ・ Appearance of new people

提示モジュール２０３６は、ユーザからの要求に応じ、テキスト情報データベース２０２１に記憶されているテキストデータをユーザに提示する。 The presentation module 2036 presents the text data stored in the text information database 2021 to the user in response to the request from the user.

＜２データ構造＞
図３は、サーバ２０が記憶するテキスト情報データベース２０２１、音声情報データベース２０２２のデータ構造を示す図である。 <2 data structure>
FIG. 3 is a diagram showing the data structures of the text information database 2021 and the voice information database 2022 stored in the server 20.

図３に示すように、テキスト情報データベース２０２１は、項目「日時」と、項目「テキストＩＤ」と、項目「音声ＩＤ」と、項目「データ」等を含む。 As shown in FIG. 3, the text information database 2021 includes an item "date and time", an item "text ID", an item "voice ID", an item "data", and the like.

項目「日時」は、テキストデータの元となった音を集音した日時を示す情報である。 The item "date and time" is information indicating the date and time when the sound that is the source of the text data is collected.

項目「テキストＩＤ」は、テキストデータを識別する情報を示す。 The item "text ID" indicates information for identifying text data.

項目「音声ＩＤ」は、テキストデータの元となった音データを識別する情報を示す。例えば、テキストＩＤ「Ｔ００１」は、音声ＩＤ「Ｖ００１」に基づいて生成されたことを示す。 The item "voice ID" indicates information for identifying the sound data that is the source of the text data. For example, the text ID "T001" indicates that it was generated based on the voice ID "V001".

項目「データ」は、テキストデータを記憶している。項目「データ」で記憶されるテキストデータには、音声の内容が変換されたテキスト情報、テキスト情報から推定された役割が含まれている。 The item "data" stores text data. The text data stored in the item "data" includes the text information in which the content of the voice is converted and the role estimated from the text information.

図３に示すように、音声情報データベース２０２２は、項目「日時」と、項目「音声ＩＤ」と、項目「データ」等を含む。 As shown in FIG. 3, the voice information database 2022 includes an item "date and time", an item "voice ID", an item "data", and the like.

項目「日時」は、音を集音した日時を示す情報である。 The item "date and time" is information indicating the date and time when the sound was collected.

項目「音声ＩＤ」は、取得した音データを識別する情報を示す。 The item "voice ID" indicates information for identifying the acquired sound data.

項目「データ」は、音データを記憶している。項目「データ」で記憶される音データは、例えば、ｗａｖ等のデータ形式で記憶されている。 The item "data" stores sound data. The sound data stored in the item "data" is stored in a data format such as wav.

＜３小括＞
図４は、システム１の概要を示す図である。図４に示す例では、音声を取得する対象である人物Ａおよび人物Ｂの周囲に、集音装置４０が設置される。 <3 Summary>
FIG. 4 is a diagram showing an outline of the system 1. In the example shown in FIG. 4, the sound collecting device 40 is installed around the person A and the person B whose voice is to be acquired.

集音装置４０は、集音装置４０の周囲の音を取得する。集音装置４０は、取得した音についての音信号をエッジサーバ３０に送信する。 The sound collecting device 40 acquires the sound around the sound collecting device 40. The sound collector 40 transmits a sound signal for the acquired sound to the edge server 30.

エッジサーバ３０は、受信した音信号をサーバ２０に送信する。 The edge server 30 transmits the received sound signal to the server 20.

サーバ２０は、受信した音信号についての音データを解析し、音データから音声を抽出する。サーバ２０は、抽出した音声に対して音声認識処理を実行することで音声の内容をテキスト情報に変換する。サーバ２０は、変換したテキスト情報から、音声の発声者の役割を推定する。 The server 20 analyzes the sound data of the received sound signal and extracts the sound from the sound data. The server 20 converts the content of the voice into text information by executing the voice recognition process on the extracted voice. The server 20 estimates the role of the voice speaker from the converted text information.

これにより、サーバ２０は、発声者が発した音声の内容を、発声者の役割とテキスト情報とを対応付けて記憶することが可能となる。 As a result, the server 20 can store the content of the voice uttered by the utterer in association with the role of the utterer and the text information.

＜４動作＞
以下、サーバ２０が集音装置４０で集音された音に基づき、テキストデータを生成する際の一連の処理について説明する。 <4 operation>
Hereinafter, a series of processes when the server 20 generates text data based on the sound collected by the sound collecting device 40 will be described.

図５は、サーバ２０の制御部２０３が音データに基づいてテキストデータを生成する際の一連の処理を示すフローチャートである。以下の説明では、例えば、集音装置４０の周囲には、図４に示すように人物Ａと、人物Ｂとがいる場合を例に説明する。 FIG. 5 is a flowchart showing a series of processes when the control unit 203 of the server 20 generates text data based on the sound data. In the following description, for example, a case where a person A and a person B are around the sound collecting device 40 will be described as an example.

集音装置４０は、周囲の音を集音する。このとき、例えば、人物Ａが所定の発言をし、その後に、人物Ｂが人物Ａの発言に対して応答をしたとする。集音装置４０が集音した音には、人物Ａの音声の後に、人物Ｂの音声が含まれる。集音装置４０は、集音した音についての音信号を、エッジサーバ３０を介してサーバ２０へ送信する。 The sound collecting device 40 collects ambient sounds. At this time, for example, it is assumed that the person A makes a predetermined remark and then the person B responds to the remark of the person A. The sound collected by the sound collecting device 40 includes the voice of the person A followed by the voice of the person B. The sound collecting device 40 transmits a sound signal about the collected sound to the server 20 via the edge server 30.

ステップＳ５０１において、制御部２０３は、エッジサーバ３０から受信した音信号から音データを取得する。 In step S501, the control unit 203 acquires sound data from the sound signal received from the edge server 30.

ステップＳ５０２において、制御部２０３は、取得した音データを解析する。具体的には、例えば、制御部２０３は、取得した音データに含まれる声の特徴、例えば、声の大きさ、音高、有声、無声、音素の種類、フォルマント等から成る群から選択される少なくとも１つを分析する。制御部２０３は、人物Ａが発生した音声を、第１特徴を有する第１音声として音データから抽出する。制御部２０３は、人物Ａの後に人物Ｂが発生した音声を、第２特徴を有する第２音声として音データから抽出する。 In step S502, the control unit 203 analyzes the acquired sound data. Specifically, for example, the control unit 203 is selected from a group consisting of voice features included in the acquired sound data, for example, voice volume, pitch, voiced, unvoiced, phoneme type, formant, and the like. Analyze at least one. The control unit 203 extracts the voice generated by the person A from the sound data as the first voice having the first feature. The control unit 203 extracts the voice generated by the person B after the person A from the sound data as the second voice having the second feature.

なお、ここでは、制御部２０３が、声の特徴に基づいて音データから音声を抽出する場合を例に説明した。制御部２０３は、声の特徴、集音装置４０の指向方向、音が集音されたタイミング、集音に用いられた集音装置等から成る群から選択される少なくとも１つの手法を利用して音声を抽出してよい。 Here, a case where the control unit 203 extracts voice from sound data based on the characteristics of voice has been described as an example. The control unit 203 utilizes at least one method selected from a group consisting of voice characteristics, the direction of the sound collector 40, the timing at which the sound is collected, the sound collector used for the sound collection, and the like. Sound may be extracted.

ステップＳ５０３において、制御部２０３は、抽出した音声に対して音声認識処理を実行することで、音声の内容をテキスト情報に変換する。具体的には、例えば、制御部２０３は、第１音声に対して音声認識処理を実行することで、第１音声の内容を第１テキスト情報に変換する。制御部２０３は、第１テキスト情報をテキスト情報データベース２０２１に記憶する。また、制御部２０３は、第２音声に対して音声認識処理を実行することで、第２音声の内容をテキスト情報に変換する。制御部２０３は、第２テキスト情報をテキスト情報データベース２０２１に記憶する。 In step S503, the control unit 203 converts the content of the voice into text information by executing the voice recognition process on the extracted voice. Specifically, for example, the control unit 203 converts the content of the first voice into the first text information by executing the voice recognition process for the first voice. The control unit 203 stores the first text information in the text information database 2021. Further, the control unit 203 converts the content of the second voice into text information by executing the voice recognition process for the second voice. The control unit 203 stores the second text information in the text information database 2021.

ステップＳ５０４において、制御部２０３は、テキスト情報に基づき、音声の発声者の役割を推定する。具体的には、例えば、制御部２０３は、第１テキスト情報を学習済みモデルに入力する。学習済みモデルは、第１テキスト情報が入力されると、第１役割を出力する。また、制御部２０３は、第２テキスト情報を学習済みモデルに入力する。学習済みモデルは、第２テキスト情報が入力されると、第２役割を出力する。制御部２０３は、第１テキスト情報と第１役割とを関連付け、第２テキスト情報と第２役割とを関連付けてテキストデータとし、テキストデータをテキスト情報データベース２０２１に記憶する。 In step S504, the control unit 203 estimates the role of the voice speaker based on the text information. Specifically, for example, the control unit 203 inputs the first text information to the trained model. The trained model outputs the first role when the first text information is input. Further, the control unit 203 inputs the second text information to the trained model. The trained model outputs the second role when the second text information is input. The control unit 203 associates the first text information with the first role, associates the second text information with the second role to obtain text data, and stores the text data in the text information database 2021.

ステップＳ５０５において、制御部２０３は、ユーザからの要望に応じ、テキスト情報データベース２０２１に記憶されているテキストデータをユーザに提示する。 In step S505, the control unit 203 presents the text data stored in the text information database 2021 to the user in response to the request from the user.

＜５画面例＞
図６～８は、第１の実施形態において、テキストデータをユーザに提示する際の、ユーザが操作する端末のディスプレイの表示例を示す図である。ユーザ端末は、例えば据え置き型のＰＣ（Personal Computer）、ラップトップＰＣであるとしてもよい。また、ユーザ端末は、ヘッドマウントディスプレイとして機能してもよく、例えば、透過型、非透過型、又はシースルー型ヘッドマウントディスプレイとして機能してもよい。なお、テキストデータは、ディスプレイでの表示に限らず、紙にプリントアウトされてユーザに提示されてもよい。 <5 screen example>
6 to 8 are diagrams showing a display example of a terminal operated by the user when presenting text data to the user in the first embodiment. The user terminal may be, for example, a stationary PC (Personal Computer) or a laptop PC. The user terminal may also function as a head-mounted display, for example, as a transmissive, non-transparent, or see-through head-mounted display. The text data is not limited to the display on the display, and may be printed out on paper and presented to the user.

図６は、人物Ａが執刀医であり、人物Ｂが助手である場合のテキストデータの表示例を示す図である。 FIG. 6 is a diagram showing an example of displaying text data when the person A is a surgeon and the person B is an assistant.

図６において、オブジェクト６０１、６０７は、第１テキスト情報に基づいて推定される役割を表す。図６では、オブジェクト６０１、６０７は画面の左端に位置し、「執刀医」と表示されている。オブジェクト６０４は、第２テキスト情報に基づいて推定される役割を表す。図６では、オブジェクト６０４は画面の右端に位置し、「助手」と表示されている。このように、役割に応じてオブジェクトを表示する位置を変えることで、ユーザは、役割の表示位置を視認するだけで、役割の異なる者が会話していることを把握することが可能となる。 In FIG. 6, the objects 601 and 607 represent roles estimated based on the first text information. In FIG. 6, the objects 601 and 607 are located at the left edge of the screen and are displayed as "surgeon". The object 604 represents a role estimated based on the second text information. In FIG. 6, the object 604 is located at the right edge of the screen and is displayed as "assistant". In this way, by changing the position where the object is displayed according to the role, the user can grasp that a person having a different role is talking only by visually recognizing the display position of the role.

図６では、オブジェクト６０１、６０７が画面の左端に沿って位置し、オブジェクト６０４が画面の右端に沿って位置する例を示しているが、オブジェクト６０１、６０７及びオブジェクト６０４の位置はこれに限定されない。オブジェクト６０１及びオブジェクト６０４は、同じ端部に位置していてもよい。 FIG. 6 shows an example in which the objects 601 and 607 are located along the left edge of the screen and the object 604 is located along the right edge of the screen, but the positions of the objects 601 and 607 and the object 604 are not limited to this. .. Objects 601 and 604 may be located at the same end.

アイコン６０２、６０８およびアイコン６０５は、役割に応じたアイコンを表す。例えば、アイコン６０２、６０８は、それぞれオブジェクト６０１、６０７の下に表示され、執刀医を識別するアイコンを示す。アイコン６０５は、オブジェクト６０４の下に表示され、助手を識別するアイコンを示す。当該アイコンは、例えば、役割に応じて制御部２０３によって自動的に設定されてもよい。 The icons 602, 608 and 605 represent icons according to their roles. For example, the icons 602 and 608 are displayed below the objects 601 and 607, respectively, to indicate an icon that identifies the surgeon. The icon 605 is displayed below the object 604 and indicates an icon for identifying the assistant. The icon may be automatically set by the control unit 203 according to the role, for example.

ボックス６０３、６０９およびボックス６０６は、発声者それぞれの発言内容を表すテキスト情報が表示される。例えば、ボックス６０３、６０９は、画面の右端寄りに表示され、執刀医の発言を時刻と共に表示する。また、ボックス６０６は、画面の左端寄りに表示され、助手の発言を時刻と共に表示する。 In the boxes 603, 609 and 606, text information representing the content of each speaker's remark is displayed. For example, the boxes 603 and 609 are displayed near the right edge of the screen and display the surgeon's remarks together with the time. Further, the box 606 is displayed near the left edge of the screen, and displays the assistant's remarks together with the time.

これにより、ユーザは、執刀医と助手とが手術中などに行った会話の内容を、各々の役割を識別する形でテキスト情報として確認することができる。このため、執刀医と助手との術中における会話を、例えば、研修医の指導の際に、指示の出し方が適切か、誤った判断をしていないか等の確認に活用することが可能となる。また、執刀医自身が、自分の担当した手術中の会話を確認することで、反省点の振り返り、改善点の発見などに役立てることができる。 As a result, the user can confirm the contents of the conversation between the surgeon and the assistant during the operation or the like as text information in a form that identifies each role. For this reason, it is possible to utilize the intraoperative conversation between the surgeon and the assistant, for example, to confirm whether the instructions are appropriate and whether the judgment is incorrect when instructing the trainee. Become. In addition, the surgeon himself can check the conversation during the surgery that he was in charge of, which can be useful for looking back on the points of reflection and finding points for improvement.

図７は、人物Ａが講演者であり、人物Ｂが視聴者である場合のテキストデータの表示例を示す図である。 FIG. 7 is a diagram showing an example of displaying text data when the person A is a speaker and the person B is a viewer.

図７において、オブジェクト７０１、７０７は、図６におけるオブジェクト６０１、６０７と同様に、第１テキスト情報に基づいて推定される役割を表す。図７では、オブジェクト７０１、７０７は画面の左端に位置し、「講演者」と表示されている。オブジェクト７０４は、図６におけるオブジェクト６０４と同様に、第２テキスト情報に基づいて推定される役割を表す。図７では、オブジェクト７０４は画面の右端に位置し、「視聴者」と表示されている。 In FIG. 7, the objects 701 and 707 represent roles estimated based on the first text information, similar to the objects 601 and 607 in FIG. In FIG. 7, the objects 701 and 707 are located at the left edge of the screen and are displayed as "speaker". The object 704, like the object 604 in FIG. 6, represents a role estimated based on the second text information. In FIG. 7, the object 704 is located at the right edge of the screen and is displayed as "viewer".

アイコン７０２、７０８およびアイコン７０５は、図６におけるアイコン６０２、６０８および６０５と同様に、役割に応じたアイコンを表す。例えば、アイコン７０２、７０８は、それぞれオブジェクト７０１、７０７の下に表示され、講演者を識別するアイコンを示す。アイコン７０５は、オブジェクト７０４の下に表示され、視聴者を識別するアイコンを示す。 The icons 702, 708 and the icon 705 represent role-based icons, similar to the icons 602, 608 and 605 in FIG. For example, the icons 702 and 708 are displayed below the objects 701 and 707, respectively, to indicate an icon that identifies the speaker. The icon 705 is displayed below the object 704 and indicates an icon that identifies the viewer.

ボックス７０３、７０９およびボックス７０６は、図６におけるボックス６０３、６０９およびボックス６０６と同様に、発声者それぞれの発言内容を表すテキスト情報が表示される。例えば、ボックス７０３、７０９は、画面の右端寄りに表示され、講演者の発言を時刻と共に表示する。また、ボックス７０６は、画面の左端寄りに表示され、視聴者の発言を時刻と共に表示する。 Similar to the boxes 603, 609 and 606 in FIG. 6, the boxes 703, 709 and 706 display text information representing the content of each speaker's remark. For example, the boxes 703 and 709 are displayed near the right edge of the screen and display the speaker's remarks together with the time. Further, the box 706 is displayed near the left edge of the screen, and displays the viewer's remarks together with the time.

これにより、ユーザは、講演者と視聴者とが講演中などに行った会話、例えば質疑応答の内容を、各々の役割を識別する形でテキスト情報として確認することができる。このため、講演者は、質疑応答の内容をテキスト情報として確認することで、講演会における話の流れ、視聴者の反応などを確認することが可能となる。また、議事録を作成するユーザは、質疑応答の内容をテキスト情報として確認することで、容易に議事録を作成することが可能となる。 As a result, the user can confirm the content of the conversation between the speaker and the viewer during the lecture, for example, the question and answer session, as text information in a form that identifies each role. Therefore, the lecturer can confirm the flow of the talk in the lecture, the reaction of the viewer, etc. by confirming the contents of the question and answer as text information. In addition, the user who creates the minutes can easily create the minutes by confirming the contents of the question and answer as text information.

図８は、人物Ａが管理者であり、人物Ｂが作業員である場合のテキストデータの表示例を示す図である。 FIG. 8 is a diagram showing an example of displaying text data when the person A is an administrator and the person B is a worker.

図８において、オブジェクト８０１、８０７は、図７におけるオブジェクト７０１、７０７と同様に、第１テキスト情報に基づいて推定される役割を表す。図８では、オブジェクト８０１、８０７は画面の左端に位置し、「管理者」と表示されている。オブジェクト８０４は、図７におけるオブジェクト７０４と同様に、第２テキスト情報に基づいて推定される役割を表す。図８では、オブジェクト８０４は画面の右端に位置し、「作業員」と表示されている。 In FIG. 8, the objects 801 and 807 represent roles estimated based on the first text information, similar to the objects 701 and 707 in FIG. 7. In FIG. 8, the objects 801 and 807 are located at the left edge of the screen and are displayed as "administrator". The object 804, like the object 704 in FIG. 7, represents a role estimated based on the second text information. In FIG. 8, the object 804 is located at the right edge of the screen and is displayed as "worker".

アイコン８０２、８０８およびアイコン８０５は、図７におけるアイコン７０２、７０８および７０５と同様に、役割に応じたアイコンを表す。例えば、アイコン８０２、８０８は、それぞれオブジェクト８０１、８０７の下に表示され、管理者を識別するアイコンを示す。アイコン８０５は、オブジェクト８０４の下に表示され、作業員を識別するアイコンを示す。 The icons 802, 808 and the icon 805 represent role-based icons, similar to the icons 702, 708, and 705 in FIG. 7. For example, the icons 802 and 808 are displayed below the objects 801 and 807, respectively, to indicate an icon that identifies the administrator. The icon 805 is displayed below the object 804 and indicates an icon that identifies a worker.

ボックス８０３、８０９およびボックス８０６は、図７におけるボックス７０３、７０９およびボックス７０６と同様に、発声者それぞれの発言内容を表すテキスト情報が表示される。例えば、ボックス８０３、８０９は、画面の右端寄りに表示され、管理者の発言を時刻と共に表示する。また、ボックス８０６は、画面の左端寄りに表示され、作業員の発言を時刻と共に表示する。 Similar to the boxes 703, 709 and 706 in FIG. 7, the boxes 803, 809 and 806 display text information representing the content of each speaker's remark. For example, the boxes 803 and 809 are displayed near the right edge of the screen, and display the remarks of the administrator together with the time. Further, the box 806 is displayed near the left edge of the screen, and displays the worker's remarks together with the time.

これにより、ユーザは、管理者と作業員とが行った作業現場における会話、例えば当日の作業指示などの内容を、各々の役割を識別する形でテキスト情報として確認することができる。これにより、管理者は、作業指示の内容をテキスト情報として確認することで、当日の作業内容の振り返り、次の日の作業計画の立案などに役立てることが可能となる。また、管理者を管理監督する監督者が、管理者が作業員に出した指示内容、作業員の反応などをテキスト情報として確認することが可能となる。そのため、監督者は、ハラスメントなどの問題が生じたときに、指示の仕方が適切であったか、無理な負担を作業員に強いていないか、などを確認することが可能となる。 As a result, the user can confirm the contents of the conversation between the manager and the worker at the work site, for example, the work instruction of the day, as text information in the form of identifying each role. As a result, the administrator can check the contents of the work instruction as text information, which can be useful for looking back on the work contents of the day and for planning the work plan for the next day. In addition, the supervisor who manages and supervises the manager can confirm the contents of instructions given by the manager to the worker, the reaction of the worker, and the like as text information. Therefore, when a problem such as harassment occurs, the supervisor can confirm whether the instruction method is appropriate and whether the worker is forced to bear an unreasonable burden.

このように、サーバ２０は、音データから音声を抽出し、抽出した音声のテキスト情報への変換、変換したテキスト情報に基づいて発声者の役割を推定するようにしている。また、サーバ２０は、受信した一つ、または複数の音データから、複数の発声者の役割を推定するようにしている。このため、サーバ２０は、発声者について事前に登録された情報がなくても、発声者の役割を判別しながらテキスト情報をユーザへ提示することが可能となる。 In this way, the server 20 extracts the voice from the sound data, converts the extracted voice into text information, and estimates the role of the speaker based on the converted text information. Further, the server 20 estimates the roles of the plurality of vocalists from the received one or a plurality of sound data. Therefore, the server 20 can present the text information to the user while determining the role of the speaker even if there is no information registered in advance about the speaker.

＜６変形例＞
上記実施形態では、音声解析をサーバ２０で実施する場合を説明したが、音声解析はサーバ２０以外で実施されてもよい。例えば、エッジサーバ３０が音声解析を実施し、テキスト情報をサーバ２０へ送信してもよい。また、集音装置４０が音声解析を実施し、テキスト情報をエッジサーバ３０へ送信してもよい。なお、テキスト情報をサーバ２０へ送信する場合であっても、音信号をサーバ２０へ送信してもよい。 <6 Modification example>
In the above embodiment, the case where the voice analysis is performed on the server 20 has been described, but the voice analysis may be performed on a server other than the server 20. For example, the edge server 30 may perform voice analysis and transmit text information to the server 20. Further, the sound collecting device 40 may perform voice analysis and transmit the text information to the edge server 30. Even when the text information is transmitted to the server 20, the sound signal may be transmitted to the server 20.

また、上記実施形態では、推定処理をサーバ２０で実施する場合を説明したが、推定処理は音声解析の後であれば、サーバ２０以外で実施されてもよい。例えば、エッジサーバ３０、又は集音装置４０が音声解析を実施した場合には、エッジサーバ３０が推定処理を実施し、役割に関する情報をサーバ２０へ送信してもよい。また、集音装置４０が音声解析を実施した場合には、集音装置４０が推定処理を実施し、役割に関する情報をエッジサーバ３０へ送信してもよい。 Further, in the above embodiment, the case where the estimation process is performed on the server 20 has been described, but the estimation process may be performed on a server other than the server 20 as long as it is after the voice analysis. For example, when the edge server 30 or the sound collecting device 40 performs voice analysis, the edge server 30 may perform estimation processing and transmit information on the role to the server 20. Further, when the sound collecting device 40 performs voice analysis, the sound collecting device 40 may perform estimation processing and transmit information regarding the role to the edge server 30.

＜第２の実施形態＞
第１の実施形態では、集音装置４０のみを利用する場合を説明した。しかしながら、音声を抽出する方法はこれに限らない。第２の実施形態では、集音装置４０に加え、撮影装置５０を利用する方法について説明する。なお、第１の実施形態と同一の符号を付しているものについての詳細な説明は繰り返さない。 <Second embodiment>
In the first embodiment, the case where only the sound collecting device 40 is used has been described. However, the method of extracting voice is not limited to this. In the second embodiment, a method of using the photographing device 50 in addition to the sound collecting device 40 will be described. It should be noted that the detailed description of those having the same reference numerals as those of the first embodiment will not be repeated.

＜１システム全体の構成図＞
図９は、第２の実施形態における、システム１Ａの全体の構成を示す図である。 <1 Configuration diagram of the entire system>
FIG. 9 is a diagram showing the overall configuration of the system 1A in the second embodiment.

図９に示すように、システム１Ａは、サーバ２０Ａと、エッジサーバ３０と、集音装置４０と、撮影装置５０とを含む。サーバ２０Ａとエッジサーバ３０とは、ネットワーク８０を介して通信接続する。エッジサーバ３０は、集音装置４０と撮影装置５０と接続されている。例えば、集音装置４０と撮影装置５０は、情報機器間の近距離通信システムで用いられる通信規格に基づく送受信装置である。具体的には、集音装置４０と撮影装置５０は、例えば、Bluetooth（登録商標）モジュールなど２．４ＧＨｚ帯を使用して、Bluetooth（登録商標）モジュールを搭載した他の情報機器からのビーコン信号を受信する。エッジサーバ３０は、当該近距離通信を利用したビーコン信号に基づき、集音装置４０と撮影装置５０から送信される情報を取得する。このように、集音装置４０と撮影装置５０は、取得した発声者の音声の情報、および発声者の動作情報を、ネットワーク８０を介さず、近距離通信によりエッジサーバ３０へ送信する。なお、エッジサーバ３０は、ネットワーク８０を介して集音装置４０と撮影装置５０と通信接続してもよい。 As shown in FIG. 9, the system 1A includes a server 20A, an edge server 30, a sound collecting device 40, and a photographing device 50. The server 20A and the edge server 30 communicate with each other via the network 80. The edge server 30 is connected to the sound collecting device 40 and the photographing device 50. For example, the sound collecting device 40 and the photographing device 50 are transmission / reception devices based on a communication standard used in a short-range communication system between information devices. Specifically, the sound collecting device 40 and the photographing device 50 use a 2.4 GHz band such as a Bluetooth (registered trademark) module, and a beacon signal from another information device equipped with the Bluetooth (registered trademark) module. To receive. The edge server 30 acquires information transmitted from the sound collecting device 40 and the photographing device 50 based on the beacon signal using the short-range communication. In this way, the sound collecting device 40 and the photographing device 50 transmit the acquired voice information of the speaker and the operation information of the speaker to the edge server 30 by short-range communication without going through the network 80. The edge server 30 may communicate with the sound collecting device 40 and the photographing device 50 via the network 80.

撮影装置５０は、受光素子により光を受光して、撮影画像として出力するためのデバイスである。撮影装置５０は、設定されている方向の画像を撮影し、撮影により得られる画像データに基づく画像信号をエッジサーバ３０へ送信する。撮影装置５０は、例えば、以下のいずれかのデバイスが想定される。
・可視光カメラ
・赤外線カメラ
・紫外線カメラ
・超音波センサ
・ＲＧＢ－Ｄカメラ
・ＬｉＤＡＲ（Light Detection and Ranging）
図９では、撮影装置５０が１台である場合を例に示しているが、システム１Ａに収容される撮影装置５０は、複数台あっても構わない。 The photographing device 50 is a device for receiving light by a light receiving element and outputting it as a photographed image. The photographing device 50 photographs an image in a set direction, and transmits an image signal based on the image data obtained by the imaging to the edge server 30. As the photographing device 50, for example, any of the following devices is assumed.
・ Visible light camera ・ Infrared camera ・ Ultrasonic camera ・ Ultrasonic sensor ・ RGB-D camera ・ LiDAR (Light Detection and Ranging)
Although FIG. 9 shows an example in which the number of photographing devices 50 is one, a plurality of photographing devices 50 accommodated in the system 1A may be provided.

エッジサーバ３０は、集音装置４０から送信される音信号を受信し、受信した音信号を、サーバ２０へ送信する。また、エッジサーバ３０は、撮影装置５０から送信される画像信号を受信し、受信した画像信号を、サーバ２０へ送信する。 The edge server 30 receives the sound signal transmitted from the sound collecting device 40, and transmits the received sound signal to the server 20. Further, the edge server 30 receives the image signal transmitted from the photographing device 50, and transmits the received image signal to the server 20.

＜１．１サーバ２０Ａの構成＞
図１０は、第２の実施形態における、サーバ２０Ａの機能的な構成を示す図である。 <1.1 Configuration of server 20A>
FIG. 10 is a diagram showing a functional configuration of the server 20A in the second embodiment.

取得モジュール２０３３Ａは、受信制御モジュール２０３１で受信された音信号から音データを取得する。取得モジュール２０３３Ａは、取得した音データを音声情報データベース２０２２に記憶する。取得モジュール２０３３Ａは、受信制御モジュール２０３１で受信された画像信号から画像データを取得する。取得モジュール２０３３Ａは、取得した画像データを画像情報データベース２０２３に記憶する。取得モジュール２０２２Ａは、例えば、所定の要件を満たすと、取得した音データおよび画像データを、音声情報データベース２０２２および画像情報データベース２０２３にそれぞれ記憶する。所定の要件は、例えば、以下である。
・録音／録画開始指示が入力されてから録音／録画終了指示が入力されるまで
・予め設定された時間への到達
・音の継続した発生（例えば、音が発生すると録音／録画を開始し、音が予め設定された期間発生しないと録音／録画を停止する）
・発声者の動作を検知（例えば、発声者の口の動きを検知すると録音／録画を開始し、動作が予め設定された期間発生しないと録音／録画を停止する）
・発声者が別の発声者を指定する動作を検知（例えば、録音、および撮影していた発声者が異なる発声者を指定する動作を検知すると、指定された対象の録音および撮影を開始し、動作が予め設定された期間発生しないと録音および撮影を停止する） The acquisition module 2033A acquires sound data from the sound signal received by the reception control module 2031. The acquisition module 2033A stores the acquired sound data in the voice information database 2022. The acquisition module 2033A acquires image data from the image signal received by the reception control module 2031. The acquisition module 2033A stores the acquired image data in the image information database 2023. The acquisition module 2022A stores, for example, the acquired sound data and image data in the audio information database 2022 and the image information database 2023, respectively, when the predetermined requirements are satisfied. The predetermined requirements are, for example,:
-From the input of the recording / recording start instruction to the input of the recording / recording end instruction-Achievement of a preset time-Continuous generation of sound (for example, when sound is generated, recording / recording is started, Recording / stop recording if no sound is generated for a preset period)
-Detects the movement of the speaker (for example, recording / recording is started when the movement of the mouth of the speaker is detected, and recording / recording is stopped when the movement does not occur for a preset period).
-Detects an action in which a speaker specifies another speaker (for example, when a speaker who was recording and shooting detects an action in which a different speaker is specified, recording and shooting of the specified target is started, and the recording and shooting of the specified target are started. Recording and shooting will be stopped if the operation does not occur for a preset period.)

画像情報データベース２０２３は、サーバ２０Ａが撮影装置５０で撮影された画像に基づく画像データを記憶する。 The image information database 2023 stores image data based on an image taken by the image pickup device 50 by the server 20A.

画像解析モジュール２０３７は、取得した画像データを解析することで、画像データから動作情報を抽出する。例えば、画像解析モジュール２０３７は、学習済みモデルを用い、撮影装置５０が撮影した画像から動作情報を抽出する。 The image analysis module 2037 analyzes the acquired image data to extract operation information from the image data. For example, the image analysis module 2037 uses the trained model and extracts motion information from the image captured by the imaging device 50.

本実施形態において、学習済みモデルは、例えば、取得された画像データに対し、動作情報を出力するように学習されている。このとき、学習用データは、例えば、所定の動作を含む画像を入力データとし、その動作対象へのラベリング、ラベリングされた対象の変位を正解出力データとする。例えば、人物を含む画像を入力データとし、人物の口へのラベリング、ラベリングされた口の変位を正解出力データとする。なお、人物の手足のラベリング、ラベリングされた手足の変位を正解出力データとしてもよい。 In the present embodiment, the trained model is trained to output motion information with respect to the acquired image data, for example. At this time, for the learning data, for example, an image including a predetermined motion is used as input data, and labeling to the motion target and displacement of the labeled target are used as correct output data. For example, an image including a person is used as input data, and labeling of the person's mouth and displacement of the labeled mouth are used as correct output data. The labeling of the limbs of a person and the displacement of the labeled limbs may be used as correct output data.

画像解析モジュール２０３７は、例えば、取得した画像データから撮影された人の口の動作情報を抽出する。なお、抽出される動作情報は口に限定されず、ジェスチャー等の動作であってもよい。画像解析モジュール２０３７は、抽出した動作情報を、音声解析モジュール２０３４Ａに送信する。 The image analysis module 2037, for example, extracts motion information of a person's mouth taken from the acquired image data. The extracted motion information is not limited to the mouth, and may be a gesture or the like. The image analysis module 2037 transmits the extracted operation information to the voice analysis module 2034A.

音声解析モジュール２０３４Ａは、取得した音データと、画像解析によって得られた動作情報とから音声を抽出する。具体的には、音声解析モジュール２０３４Ａは、例えば、動作情報と同期して発声された音声を、その人物の発声であると認識し、その人物の音声として音データから抽出する。より具体的には、口の動きと同期して発声された音声を、口が動いた人物の発声であると認識し、その人物の音声とする。 The voice analysis module 2034A extracts voice from the acquired sound data and the operation information obtained by the image analysis. Specifically, the voice analysis module 2034A recognizes, for example, the voice uttered in synchronization with the motion information as the voice of the person, and extracts it from the sound data as the voice of the person. More specifically, the voice uttered in synchronization with the movement of the mouth is recognized as the utterance of the person whose mouth has moved, and is used as the voice of that person.

音声解析モジュール２０３４Ａは、撮影方向に複数の人物が含まれている場合において、それぞれの人物の音声を音データから抽出してもよい。また、音声解析モジュール２０３４Ａは、声の特徴、音が集音された方向、音が集音されたタイミング、音を集音した集音装置に基づいて音声を抽出してもよい。音声解析モジュール２０３４Ａは、単独で発声者の音声を抽出してもよいし、複数の手法を組み合わせて発声者の音声を抽出してもよい。 The voice analysis module 2034A may extract the voice of each person from the sound data when a plurality of people are included in the shooting direction. Further, the voice analysis module 2034A may extract the voice based on the characteristics of the voice, the direction in which the sound is collected, the timing at which the sound is collected, and the sound collecting device that collects the sound. The voice analysis module 2034A may extract the voice of the speaker alone, or may extract the voice of the speaker by combining a plurality of methods.

＜２データ構造＞
図１１は、サーバ２０Ａが記憶する画像情報データベース２０２３のデータ構造を示す図である。 <2 data structure>
FIG. 11 is a diagram showing a data structure of the image information database 2023 stored in the server 20A.

図１１に示すように、画像情報データベース２０２３は、項目「日時」と、項目「画像ＩＤ」と、項目「音声ＩＤ」と、項目「データ」等を含む。 As shown in FIG. 11, the image information database 2023 includes an item "date and time", an item "image ID", an item "voice ID", an item "data", and the like.

項目「日時」は、画像を録画した日時を示す情報である。 The item "date and time" is information indicating the date and time when the image was recorded.

項目「画像ＩＤ」は、画像データを識別する情報を示す。 The item "image ID" indicates information for identifying image data.

項目「音声ＩＤ」は、関連付けられている音データを識別する情報を示す。画像データと音データとは、例えば、時刻情報に基づいて関連付けられている。 The item "voice ID" indicates information for identifying the associated sound data. The image data and the sound data are associated with each other based on, for example, time information.

項目「データ」は、画像データを記憶している。項目「データ」で記憶される画像データは、例えば、ｊｐｅｇ等のデータ形式で記憶されている。 The item "data" stores image data. The image data stored in the item "data" is stored in a data format such as jpg.

＜３小括＞
図１２は、第２の実施形態におけるシステム１Ａの概要を示す図である。図１２に示す例では、音声を取得する対象である人物Ａおよび人物Ｂの周囲に、集音装置４０が設置される。また、人物Ａおよび人物Ｂを撮影方向に含むように撮影装置５０が設置される。 <3 Summary>
FIG. 12 is a diagram showing an outline of the system 1A in the second embodiment. In the example shown in FIG. 12, the sound collecting device 40 is installed around the person A and the person B whose voice is to be acquired. Further, the photographing device 50 is installed so as to include the person A and the person B in the photographing direction.

集音装置４０は、集音装置４０の周囲の音を取得する。集音装置４０は、取得した音信号をエッジサーバ３０に送信する。 The sound collecting device 40 acquires the sound around the sound collecting device 40. The sound collector 40 transmits the acquired sound signal to the edge server 30.

撮影装置５０は、撮影方向の画像を撮影する。撮影装置５０は、取得した画像信号をエッジサーバ３０に送信する。 The photographing device 50 acquires an image in the photographing direction. The photographing device 50 transmits the acquired image signal to the edge server 30.

エッジサーバ３０は、受信した音信号と画像信号とをサーバ２０Ａに送信する。 The edge server 30 transmits the received sound signal and the image signal to the server 20A.

サーバ２０Ａは、画像データの画像解析結果を参照し、受信した音信号についての音データから撮影されている人物の音声を抽出する。サーバ２０Ａは、抽出した音声に対して音声認識処理を実行することで音声の内容をテキスト情報に変換する。サーバ２０Ａは、変換したテキスト情報から、音声の発声者の役割を推定する。 The server 20A refers to the image analysis result of the image data, and extracts the sound of the person being photographed from the sound data of the received sound signal. The server 20A converts the content of the voice into text information by executing the voice recognition process on the extracted voice. The server 20A estimates the role of the voice speaker from the converted text information.

これにより、サーバ２０Ａは、発声者が発した音声の内容を、発声者の役割とテキスト情報とを対応付けて記憶することが可能となる。 As a result, the server 20A can store the content of the voice uttered by the utterer in association with the role of the utterer and the text information.

これにより、サーバ２０Ａは、取得した音データと画像データとから、より正確に音声を抽出し、テキスト情報に変換することが可能となる。そのため、サーバ２０Ａは、発声者の音声が小さく、周囲の音との差別化が困難な場合でも、正確に発声者の音声を抽出することができる。 As a result, the server 20A can more accurately extract voice from the acquired sound data and image data and convert it into text information. Therefore, the server 20A can accurately extract the voice of the speaker even when the voice of the speaker is small and it is difficult to differentiate from the surrounding sound.

＜４動作＞
以下、サーバ２０Ａが集音装置４０で集音された音と撮影装置５０で撮影された動作とに基づき、テキストデータを生成する際の一連の処理について説明する。 <4 operation>
Hereinafter, a series of processes when the server 20A generates text data based on the sound collected by the sound collecting device 40 and the operation photographed by the photographing device 50 will be described.

図１３は、サーバ２０Ａの制御部２０３Ａが音データと画像データとに基づいてテキストデータを生成する際の一連の処理を示すフローチャートである。以下の説明では、例えば、集音装置４０の周囲に、図１２に示すように人物Ａおよび人物Ｂがおり、人物Ａおよび人物Ｂを撮影方向に含むように撮影装置５０が設置される場合を例に説明する。 FIG. 13 is a flowchart showing a series of processes when the control unit 203A of the server 20A generates text data based on the sound data and the image data. In the following description, for example, there are a person A and a person B around the sound collecting device 40 as shown in FIG. 12, and the photographing device 50 is installed so as to include the person A and the person B in the photographing direction. Let's take an example.

集音装置４０は、周囲の音を集音する。このとき、例えば、人物Ａが所定の発言をし、その後に、人物Ｂが人物Ａの発言に対する応答をしたとする。集音装置４０が集音した音には、人物Ａの音声の後に、人物Ｂの音声が含まれる。集音装置４０は、集音した音についての音信号を、エッジサーバ３０を介してサーバ２０Ａへ送信する。 The sound collecting device 40 collects ambient sounds. At this time, for example, it is assumed that the person A makes a predetermined remark and then the person B responds to the remark of the person A. The sound collected by the sound collecting device 40 includes the voice of the person A followed by the voice of the person B. The sound collecting device 40 transmits a sound signal about the collected sound to the server 20A via the edge server 30.

撮影装置５０は、撮影方向の画像を撮影する。撮影装置５０が撮影した画像には、人物Ａの動作と、人物Ｂの動作とが含まれる。撮影装置５０は、撮影した画像についての画像信号を、エッジサーバ３０を介してサーバ２０Ａへ送信する。 The photographing device 50 acquires an image in the photographing direction. The image captured by the photographing apparatus 50 includes the motion of the person A and the motion of the person B. The photographing device 50 transmits an image signal about the captured image to the server 20A via the edge server 30.

ステップＳ１３０１において、制御部２０３Ａは、エッジサーバ３０から受信した画像信号から画像データを取得する。 In step S1301, the control unit 203A acquires image data from the image signal received from the edge server 30.

ステップＳ１３０２において、制御部２０３Ａは、取得した画像データを解析することで、画像データから動作情報を抽出する。制御部２０３Ａは、例えば、撮影方向に含まれる人物Ａおよび人物Ｂの動作、例えば、発言に伴う口の動き、ジェスチャー等についての動作情報を抽出する。 In step S1302, the control unit 203A extracts operation information from the image data by analyzing the acquired image data. The control unit 203A extracts, for example, motion information about the motions of the person A and the person B included in the shooting direction, for example, the movement of the mouth accompanying the speech, the gesture, and the like.

ステップＳ１３０３において、制御部２０３Ａは、取得した画像データの画像解析結果に基づいて、音データを解析する。具体的には、制御部２０３Ａは、人物Ａおよび人物Ｂの口の動きと同期して発声された音声を、人物Ａおよび人物Ｂの発声であると認識し、人物Ａおよび人物Ｂの音声として音データから抽出する。 In step S1303, the control unit 203A analyzes the sound data based on the image analysis result of the acquired image data. Specifically, the control unit 203A recognizes the voices uttered in synchronization with the movements of the mouths of the person A and the person B as the voices of the person A and the person B, and as the voices of the person A and the person B. Extract from sound data.

なお、ここでは、制御部２０３Ａが、声の特徴および発声者の動作情報、特に口の動きに基づいて音データから音声を抽出する場合を例に説明した。制御部２０３Ａは、声の特徴、集音装置４０の指向方向、音が集音されたタイミング、集音に用いられた集音装置等から成る群から選択される少なくとも１つと、撮影装置５０の撮影した、発声者の他の動作、例えば、発声に伴うジェスチャー、異なる発声者を指定する動き等から成る群から選択される少なくとも１つとを組み合わせて利用して音声を抽出してよい。 Here, the case where the control unit 203A extracts the voice from the sound data based on the characteristics of the voice and the motion information of the speaker, particularly the movement of the mouth, has been described as an example. The control unit 203A includes at least one selected from the group consisting of voice characteristics, the direction of the sound collecting device 40, the timing at which the sound is collected, the sound collecting device used for sound collecting, and the like, and the photographing device 50. The voice may be extracted by using in combination with at least one selected from the group consisting of other motions of the vocalist, for example, gestures accompanying the vocalization, movements specifying different vocalizers, and the like.

＜５変形例＞
上記実施形態では、画像解析および、画像解析結果に基づいた音声解析をサーバ２０Ａで実施する場合を説明したが、一連の解析処理はサーバ２０以外で実施されてもよい。例えば、エッジサーバ３０が画像解析および、画像解析結果に基づいた音声解析を実施し、テキスト情報をサーバ２０Ａへ送信してもよい。また、撮影装置５０が画像解析を実施し、画像解析の結果を集音装置４０に送信することで、集音装置４０が音声解析を実施し、テキスト情報をエッジサーバ３０へ送信してもよい。 <5 Modification example>
In the above embodiment, the case where the image analysis and the voice analysis based on the image analysis result are performed on the server 20A has been described, but a series of analysis processes may be performed on a server 20 other than the server 20. For example, the edge server 30 may perform image analysis and voice analysis based on the image analysis result, and transmit text information to the server 20A. Further, the photographing device 50 may perform image analysis and transmit the result of the image analysis to the sound collecting device 40, whereby the sound collecting device 40 may perform voice analysis and transmit the text information to the edge server 30. ..

また、上記実施形態では、推定処理をサーバ２０Ａで実施する場合を説明したが、推定処理は音声解析の後であれば、サーバ２０Ａ以外で実施されてもよい。例えば、エッジサーバ３０、又は集音装置４０が音声解析を実施した場合には、エッジサーバ３０が推定処理を実施し、役割に関する情報をサーバ２０Ａへ送信してもよい。また、集音装置４０が音声解析を実施した場合には、集音装置４０が推定処理を実施し、役割に関する情報をエッジサーバ３０へ送信してもよい。 Further, in the above embodiment, the case where the estimation process is performed on the server 20A has been described, but the estimation process may be performed on a server other than the server 20A as long as it is after the voice analysis. For example, when the edge server 30 or the sound collecting device 40 performs voice analysis, the edge server 30 may perform estimation processing and transmit information on the role to the server 20A. Further, when the sound collecting device 40 performs voice analysis, the sound collecting device 40 may perform estimation processing and transmit information regarding the role to the edge server 30.

また、上記実施形態では、推定モジュール２０３５が学習済みモデルを用いて発声者の役割を推定する場合を例に説明した。しかしながら、推定モジュール２０３５は、学習済みモデルを用いずに発声者の役割を推定してもよい。例えば、記憶部２０２は、役割と、所定の文言とが対応付けられたテーブルを予め記憶する。推定モジュール２０３５は、テーブルを参照し、テキスト情報から役割を推定する。 Further, in the above embodiment, the case where the estimation module 2035 estimates the role of the speaker using the trained model has been described as an example. However, the estimation module 2035 may estimate the role of the speaker without using the trained model. For example, the storage unit 202 stores in advance a table in which a role and a predetermined wording are associated with each other. The estimation module 2035 refers to the table and estimates the role from the text information.

＜付記＞
以上の各実施形態で説明した事項を以下に付記する。 <Additional notes>
The matters described in each of the above embodiments are added below.

（付記１）
プロセッサ２９と、メモリ２５とを備えるコンピュータ２０に実行させるためのプログラムであって、プログラムは、プロセッサ２９に、集音装置４０により集音された音を取得するステップ（Ｓ５０１）と、取得した音から、少なくとも１つの音声を抽出するステップ（Ｓ５０２）と、抽出した音声を解析することで、テキスト情報に変換するステップ（Ｓ５０３）と、テキスト情報に基づき、抽出した音声の発声者の役割を推定するステップ（Ｓ５０４）と、変換したテキスト情報を、役割を識別可能にユーザに提示するステップ（Ｓ５０５）と、を実行させるプログラム。 (Appendix 1)
A program for causing a computer 20 including a processor 29 and a memory 25 to execute the program, the program has a step (S501) of acquiring the sound collected by the sound collecting device 40 in the processor 29, and the acquired sound. From the above, the step of extracting at least one sound (S502), the step of converting the extracted sound into text information (S503), and the role of the speaker of the extracted sound are estimated based on the text information. A program for executing a step (S504) and a step (S505) of presenting the converted text information to the user so that the role can be identified.

（付記２）
抽出するステップ（Ｓ５０２）において、声の特徴に関する情報に基づいて、少なくとも１つの音声を抽出する、付記１に記載のプログラム。（段落００６３） (Appendix 2)
The program according to Supplementary Note 1, wherein at least one voice is extracted based on the information regarding the characteristics of the voice in the extraction step (S502). (Paragraph 0063)

（付記３）
抽出するステップ（Ｓ５０２）において、音の方向に関する情報に基づいて、少なくとも１つの音声を抽出する、付記１に記載のプログラム。（段落００６３） (Appendix 3)
The program according to Supplementary Note 1, wherein at least one voice is extracted based on the information regarding the direction of the sound in the extraction step (S502). (Paragraph 0063)

（付記４）
抽出するステップ（Ｓ５０２）において、音を取得するタイミングに関する情報に基づいて、少なくとも１つの音声を抽出する、付記１に記載のプログラム。（段落００６３） (Appendix 4)
The program according to Appendix 1, which extracts at least one voice based on the information regarding the timing of acquiring the sound in the extraction step (S502). (Paragraph 0063)

（付記５）
撮影装置により撮影された画像を取得するステップ（Ｓ１３０１）と、取得した画像から、発声者の動作情報を取得するステップ（Ｓ１３０２）と、をプロセッサ２９に実行させ、抽出するステップ（Ｓ５０２）において、音を集音したタイミングと、動作情報を取得したタイミングとに基づいて、音声を抽出する、付記１に記載のプログラム。（段落００９５） (Appendix 5)
In the step (S502) of causing the processor 29 to execute and extract the step (S1301) of acquiring the image captured by the photographing device and the step (S1302) of acquiring the motion information of the speaker from the acquired image. The program according to Appendix 1, which extracts a voice based on the timing at which the sound is collected and the timing at which the operation information is acquired. (Paragraph 0995)

（付記６）
動作情報が、撮影装置５０で撮影した、発声者の口又は手足の動作情報である、付記５に記載のプログラム。（段落００９５） (Appendix 6)
The program according to Appendix 5, wherein the motion information is motion information of the mouth or limbs of the speaker taken by the photographing device 50. (Paragraph 0995)

（付記７）
推定するステップ（Ｓ５０４）において、予め設定された役割の情報に基づいて、発声者の役割を推定する、付記１～６のいずれかに記載のプログラム。（段落００３９） (Appendix 7)
The program according to any one of Supplementary Provisions 1 to 6, which estimates the role of the speaker based on the information of the preset role in the estimation step (S504). (Paragraph 0039)

（付記８）
推定するステップ（Ｓ５０４）において、所定の発言についての文字情報を入力データとし、発言をする者の役割を正解出力データとして学習された学習済みモデルに、テキスト情報を入力することで発声者の役割を推定する、付記１～６のいずれかに記載のプログラム。
（段落００４０） (Appendix 8)
In the estimation step (S504), the character information about a predetermined remark is used as input data, and the role of the speaker is used as the correct answer output data. The program according to any one of Supplementary Provisions 1 to 6 for estimating.
(Paragraph 0040)

（付記９）
抽出するステップ（Ｓ５０２）において、複数の音声を抽出し、変換するステップ（Ｓ５０３）において、抽出した複数の音声をそれぞれ解析することで、複数のテキスト情報に変換し、推定するステップ（Ｓ５０４）において、変換した複数のテキスト情報に基づき、抽出した複数の音声の発声者の役割をそれぞれ推定する、付記１～８のいずれかに記載のプログラム。（段落００３６） (Appendix 9)
In the step (S502) of extracting and converting a plurality of voices in the extraction step (S502), in the step (S504) of converting and estimating a plurality of text information by analyzing each of the extracted plurality of voices. , The program according to any one of Supplementary note 1 to 8, which estimates the role of the speaker of the extracted voices based on the converted text information. (Paragraph 0036)

（付記１０）
推定するステップ（Ｓ５０４）において、複数の音声の発声者の役割として、主として医療行為を実施する担当者と、当該担当者を補助する担当者とをそれぞれ推定する、付記９に記載のプログラム。（段落００７４） (Appendix 10)
The program according to Appendix 9, in which in the estimation step (S504), the person in charge of performing the medical practice and the person in charge of assisting the person in charge of performing the medical treatment are estimated as the roles of the voice speakers of the plurality of voices. (Paragraph 0074)

（付記１１）
推定するステップ（Ｓ５０４）において、複数の音声の発声者の役割として、主となる話者と、当該話者の話を視聴する視聴者とをそれぞれ推定する、付記９に記載のプログラム。（段落００７９） (Appendix 11)
The program according to Appendix 9, which estimates the main speaker and the viewer who watches the speaker's story as the roles of the speakers of the plurality of voices in the estimation step (S504). (Paragraph 0079)

（付記１２）
推定するステップ（Ｓ５０４）において、複数の音声の発声者の役割として、管理者と、当該管理者による被管理者とをそれぞれ推定する、付記９に記載のプログラム。（段落００８４） (Appendix 12)
The program according to Appendix 9, which estimates an administrator and a managed person by the administrator as roles of a plurality of voice utterers in the estimation step (S504). (Paragraph 0084)

（付記１３）
プロセッサ２９と、メモリ２５とを備えるコンピュータ２０が実行する方法であって、方法は、プロセッサ２９が、集音装置４０により集音された音を取得するステップ（Ｓ５０１）と、取得した音から、少なくとも１つの音声を抽出するステップ（Ｓ５０２）と、抽出した音声を解析することで、テキスト情報に変換するステップ（Ｓ５０３）と、テキスト情報に基づき、抽出した音声の発声者の役割を推定するステップ（Ｓ５０４）と、変換したテキスト情報を、役割を識別可能にユーザに提示するステップ（Ｓ５０５）と、を実行する方法。 (Appendix 13)
A method executed by a computer 20 including a processor 29 and a memory 25, wherein the processor 29 acquires a sound collected by the sound collecting device 40 (S501), and the acquired sound is used as a method. A step of extracting at least one voice (S502), a step of converting the extracted voice into text information (S503), and a step of estimating the role of the speaker of the extracted voice based on the text information. (S504), and a step (S505) of presenting the converted text information to the user so that the role can be identified.

（付記１４）
制御部２０３を備える情報処理装置２０であって、制御部２０３が、集音装置４０により集音された音を取得するステップ（Ｓ５０１）と、取得した音から、少なくとも１つの音声を抽出するステップ（Ｓ５０２）と、抽出した音声を解析することで、テキスト情報に変換するステップ（Ｓ５０３）と、テキスト情報に基づき、抽出した音声の発声者の役割を推定するステップ（Ｓ５０４）と、変換したテキスト情報を、役割を識別可能にユーザに提示するステップ（Ｓ５０５）と、を実行する情報処理装置２０。 (Appendix 14)
In the information processing device 20 including the control unit 203, the control unit 203 has a step (S501) of acquiring the sound collected by the sound collecting device 40 and a step of extracting at least one sound from the acquired sound. (S502), a step of converting the extracted sound into text information (S503), a step of estimating the role of the speaker of the extracted sound based on the text information (S504), and the converted text. The information processing apparatus 20 for executing the step (S505) of presenting information to the user so that the role can be identified.

（付記１５）
集音装置４０により集音された音を取得する手段（Ｓ５０１）と、取得した音から、少なくとも１つの音声を抽出する手段（Ｓ５０２）と、抽出した音声を解析することで、テキスト情報に変換する手段（Ｓ５０３）と、テキスト情報に基づき、抽出した音声の発声者の役割を推定する手段（Ｓ５０４）と、変換したテキスト情報を、役割を識別可能にユーザに提示する手段（Ｓ５０５）と、を備えるシステム。 (Appendix 15)
A means for acquiring the sound collected by the sound collecting device 40 (S501), a means for extracting at least one voice from the acquired sound (S502), and a means for analyzing the extracted voice to convert it into text information. (S503), a means for estimating the role of the speaker of the extracted voice based on the text information (S504), a means for presenting the converted text information to the user so that the role can be identified (S505). A system equipped with.

２０サーバ、２２通信IF、２３入出力IF、２５メモリ、２６ストレージ、２９プロセッサ、３０エッジサーバ、４０集音装置、５０撮影装置、８０ネットワーク、２０１通信部、２０２制御部、２０３通信部、２０２１テキスト情報データベース、２０２２音声情報データベース、２０２３画像情報データベース。 20 servers, 22 communication IFs, 23 input / output IFs, 25 memories, 26 storages, 29 processors, 30 edge servers, 40 sound collectors, 50 shooting devices, 80 networks, 201 communication units, 202 control units, 203 communication units, 2021 Text information database, 2022 audio information database, 2023 image information database.

Claims

A program for causing a computer having a processor and a memory to execute the program, wherein the program causes the processor to execute the program.
Steps to acquire the sound collected by the sound collector, and
A step of extracting at least one voice from the acquired sound, and
The step of converting the extracted voice into text information by analyzing it,
A step of estimating the role of the speaker of the extracted voice based on the text information, and
The step of presenting the converted text information to the user together with the estimated role for the speaker, and
To execute ,
A program that estimates information about the position of a speaker as the role of the speaker without specifying the speaker in the estimation step.

The program according to claim 1, wherein in the step of presenting, the converted text information is presented to the user so that the role can be identified.

In the extraction step
The program of claim 1 or 2, wherein the at least one voice is extracted based on information about voice characteristics.

In the extraction step
The program according to claim 1 or 2, wherein the at least one voice is extracted based on the information regarding the direction of the sound.

In the extraction step
The program according to claim 1 or 2, wherein the at least one voice is extracted based on the information regarding the timing of acquiring the sound.

Steps to acquire the image taken by the shooting device,
The processor is made to execute the step of acquiring the operation information of the speaker from the acquired image.
In the extraction step
The program according to claim 1 or 2, wherein the voice is extracted based on the timing at which the sound is collected and the timing at which the operation information is acquired.

The program according to claim 6, wherein the motion information is motion information of the mouth or limbs of the speaker.

In the estimation step
The program according to any one of claims 1 to 7, which estimates the role of the speaker based on the preset information of the role.

In the estimation step, the character information about a predetermined remark is used as input data, and the role of the speaker is played by inputting the text information into the trained model learned as the correct output data. The program according to any one of claims 1 to 7.

In the extraction step, a plurality of voices are extracted and
In the conversion step, the extracted voices are analyzed to convert them into a plurality of text information.
The program according to any one of claims 1 to 9, wherein in the estimation step, the roles of the speaker of the extracted voices are estimated based on the converted text information.

In the estimation step
The program according to claim 10, wherein the person in charge of performing the medical practice and the person in charge of assisting the person in charge of performing the medical treatment are estimated as the roles of the voice speakers of the plurality of voices.

In the estimation step
The program according to claim 10, wherein the main speaker and the viewer who listens to the talk of the speaker are estimated as the roles of the speaker of the plurality of voices.

In the estimation step
The program according to claim 10, wherein the manager and the managed person by the manager are estimated as the roles of the voice speakers of the plurality of voices.

A method for causing a computer equipped with a processor and a memory to execute the method, wherein the processor is used.
Steps to acquire the sound collected by the sound collector, and
A step of extracting at least one voice from the acquired sound, and
The step of converting the extracted voice into text information by analyzing it,
A step of estimating the role of the speaker of the extracted voice based on the text information, and
The step of presenting the converted text information to the user together with the estimated role for the speaker, and
And run
A method of estimating information about the position of the speaker as the role of the speaker without specifying the speaker in the estimation step.

An information processing device including a control unit, wherein the control unit
Steps to acquire the sound collected by the sound collector, and
A step of extracting at least one voice from the acquired sound, and
The step of converting the extracted voice into text information by analyzing it,
A step of estimating the role of the speaker of the extracted voice based on the text information, and
The step of presenting the converted text information to the user together with the estimated role for the speaker, and
And run
An information processing device that estimates information about the position of the speaker as the role of the speaker without specifying the speaker in the estimation step.

A means of acquiring the sound collected by the sound collector,
A means for extracting at least one voice from the acquired sound, and
A means of converting the extracted voice into text information by analyzing it,
A means for estimating the role of the speaker of the extracted voice based on the text information, and
A means of presenting the converted text information to the user together with an estimated role for the speaker.
Equipped with
In the estimation means, a system that estimates information about the position of the speaker as the role of the speaker without specifying the speaker.