JP2022048516A

JP2022048516A - Information processing unit, program and information processing method

Info

Publication number: JP2022048516A
Application number: JP2020154373A
Authority: JP
Inventors: 尚史福江; Naofumi Fukue; 啓介小西; Keisuke Konishi
Original assignee: TIS Inc
Current assignee: TIS Inc
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2022-03-28

Abstract

To provide an information processing unit, a program and an information processing method that can precisely record details of a speech of a speaker even if a speaker's voice is hard to recognize.SOLUTION: An information processing unit comprises: a first speech acquisition part which acquires first speech data on a first speech of a first speaker; a reproduction part which reproduces the first speech based upon the first speech data when a repetition mode for acquiring a speech repeating the first speech is set; a second speech acquisition part which acquires seconds speech data on a second speech of a second speaker as speech data on the repeated speech when the repetition mode is set; a recognition result acquisition part which acquires, based upon the first speech data and second speech data, first text information representing a first recognition result of the first speech and second text information representing a second recognition result of the second speech; and a recording generation part which generates, based upon the first text information and the second text information, recording data on a speech by text.SELECTED DRAWING: Figure 3

Description

本発明は、情報処理装置、プログラム、および情報処理方法に関する。 The present invention relates to an information processing apparatus, a program, and an information processing method.

従来、会議での発話者の音声を取得し、音声認識技術を用いて取得した音声に基づいて議事録を作成する技術が知られている。 Conventionally, there is known a technique of acquiring the voice of a speaker at a conference and creating minutes based on the voice acquired by using voice recognition technology.

下記特許文献１に開示されている議事録作成システムでは、発話者の音声を予め設定された辞書を用いて音声認識を行い、その結果認識されなかった用語について、第２の辞書に対して認識要求を出力する。そして、議事録作成システムでは、第２の辞書による認識結果を受信して、議事録を作成する。このような議事録作成システムによれば、予め設定された辞書で音声認識できなかったところを第２の辞書による音声認識で補うことができるため、認識精度を向上させることが可能になる。そして、議事録作成者が作成された議事録を確認して修正する作業を減らすことができる。 In the minutes creation system disclosed in Patent Document 1 below, the voice of the speaker is recognized by using a preset dictionary, and the term that is not recognized as a result is recognized by the second dictionary. Output the request. Then, the minutes creation system receives the recognition result by the second dictionary and creates the minutes. According to such a minutes creation system, it is possible to improve the recognition accuracy because it is possible to make up for the place where the voice recognition by the preset dictionary could not be performed by the voice recognition by the second dictionary. Then, the work of the minutes creator to check and correct the created minutes can be reduced.

特開２０１７－１９１５３３号公報Japanese Unexamined Patent Publication No. 2017-191533

特許文献１の議事録作成システムでは、会議での発話者の音声がそもそも認識しづらい音声（例えば、音量が小さいなど）であった場合、第２の辞書による認識であっても認識が困難なため、議事録作成者の修正する作業を減らすことができないという問題がある。 In the minutes creation system of Patent Document 1, if the voice of the speaker at the conference is difficult to recognize in the first place (for example, the volume is low), it is difficult to recognize even if it is recognized by the second dictionary. Therefore, there is a problem that the correction work of the minutes creator cannot be reduced.

そこで、本発明は、発話者の音声が認識しづらい音声であってもその発話の内容を精度よく記録することができる情報処理装置、プログラム、および情報処理方法を提供することを目的とする。 Therefore, an object of the present invention is to provide an information processing device, a program, and an information processing method capable of accurately recording the content of an utterance even if the voice of the speaker is difficult to recognize.

本発明の一態様に係る情報処理装置は、第１発話者による第１音声の第１音声データを取得する第１音声取得部と、第１音声を復唱する音声を取得するための復唱モードが設定された場合に、第１音声データに基づいて、第１音声を再生する再生部と、復唱モードが設定された場合に、復唱する音声の音声データとして第２発話者による第２音声の第２音声データを取得する第２音声取得部と、第１音声データと第２音声データとに基づいて、第１音声の第１認識結果を示す第１テキスト情報と、第２音声の第２認識結果を示す第２テキスト情報と、を取得する認識結果取得部と、第１テキスト情報と第２テキスト情報とに基づいて、テキストによる発話の記録データを生成する記録生成部と、を備える。 The information processing device according to one aspect of the present invention has a first voice acquisition unit that acquires the first voice data of the first voice by the first speaker, and a repeat mode for acquiring the voice that repeats the first voice. When set, a playback unit that reproduces the first voice based on the first voice data, and when the repeat mode is set, the second voice by the second speaker is the voice data of the voice to be repeated. 2 The first text information indicating the first recognition result of the first voice and the second recognition of the second voice based on the second voice acquisition unit for acquiring the voice data, the first voice data and the second voice data. It includes a second text information indicating a result, a recognition result acquisition unit for acquiring, and a record generation unit for generating recorded data of a text-based speech based on the first text information and the second text information.

本発明の一態様に係るプログラムは、コンピュータに、第１発話者による第１音声の第１音声データを取得する第１音声機能と、第１音声を復唱する音声を取得するための復唱モードが設定された場合に、第１音声データに基づいて、第１音声を再生する再生機能と、復唱モードが設定された場合に、復唱する音声の音声データとして第２発話者による第２音声の第２音声データを取得する第２音声取得機能と、第１音声データと第２音声データとに基づいて、第１音声の第１認識結果を示す第１テキスト情報と、第２音声の第２認識結果を示す第２テキスト情報と、を取得する認識結果取得機能と、第１テキスト情報と第２テキスト情報とに基づいて、テキストによる発話の記録データを生成する記録生成機能と、を実現させる。 The program according to one aspect of the present invention has a first voice function for acquiring the first voice data of the first voice by the first speaker and a repeat mode for acquiring the voice for repeating the first voice in the computer. When set, a playback function that reproduces the first voice based on the first voice data, and when the repeat mode is set, the second voice by the second speaker is the voice data of the voice to be repeated. 2 The first text information indicating the first recognition result of the first voice and the second recognition of the second voice based on the second voice acquisition function for acquiring the voice data and the first voice data and the second voice data. A recognition result acquisition function for acquiring a second text information indicating a result, and a record generation function for generating recorded data of a text-based speech based on the first text information and the second text information are realized.

本発明の一態様に係る情報処理方法は、コンピュータが、第１発話者による第１音声の第１音声データを取得し、第１音声を復唱する音声を取得するための復唱モードが設定された場合に、第１音声データに基づいて、第１音声を再生し、復唱モードが設定された場合に、復唱する音声の音声データとして第２発話者による第２音声の第２音声データを取得し、第１音声データと第２音声データとに基づいて、第１音声の第１認識結果を示す第１テキスト情報と、第２音声の第２認識結果を示す第２テキスト情報と、を取得し、第１テキスト情報と第２テキスト情報とに基づいて、テキストによる発話の記録データを生成する。 In the information processing method according to one aspect of the present invention, a repeat mode is set for the computer to acquire the first voice data of the first voice by the first speaker and to acquire the voice to repeat the first voice. In this case, the first voice is played based on the first voice data, and when the repeat mode is set, the second voice data of the second voice by the second speaker is acquired as the voice data of the voice to be repeated. , The first text information indicating the first recognition result of the first voice and the second text information indicating the second recognition result of the second voice are acquired based on the first voice data and the second voice data. , Generates textual speech recording data based on the first text information and the second text information.

本発明によれば、発話者の音声が認識しづらい音声であってもその発話の内容を精度よく記録することができる情報処理装置、プログラム、および情報処理方法を提供することができる。 According to the present invention, it is possible to provide an information processing device, a program, and an information processing method capable of accurately recording the content of an utterance even if the voice of the speaker is difficult to recognize.

本実施形態に係る議事録作成システムのシステム構成例を説明するための図である。It is a figure for demonstrating the system configuration example of the minutes making system which concerns on this embodiment. 本実施形態に係る議事録作成システムの概要を説明するための図である。It is a figure for demonstrating the outline of the minutes making system which concerns on this embodiment. 本実施形態に係る記録装置の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the recording apparatus which concerns on this embodiment. 本実施形態に係る議事録作成システムの画面例の一例を示す図である。It is a figure which shows an example of the screen example of the minutes making system which concerns on this embodiment. 本実施形態に係る議事録作成システムの認識率と音量または発話者との距離との関係の一例を示す図である。It is a figure which shows an example of the relationship between the recognition rate of the minutes making system which concerns on this embodiment and the volume, or the distance with a speaker. 本実施形態に係る議事録作成システムのパワーと周波数との関係の一例を示す図である。It is a figure which shows an example of the relationship between the power and the frequency of the minutes making system which concerns on this embodiment. 本実施形態に係る記録装置の動作例を示す図である。It is a figure which shows the operation example of the recording apparatus which concerns on this embodiment. 本実施形態に係る記録装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the recording apparatus which concerns on this embodiment.

添付図面を参照して、本発明の好適な実施形態（以下、「本実施形態」という）について説明する。なお、各図において、同一の符号を付したものは、同一または同様の構成を有する。 A preferred embodiment of the present invention (hereinafter referred to as “the present embodiment”) will be described with reference to the accompanying drawings. In each figure, those with the same reference numerals have the same or similar configurations.

本実施形態において、「部」や「手段」、「装置」、「システム」とは、単に物理的手段を意味するものではなく、その「部」や「手段」、「装置」、「システム」が有する機能をソフトウェアによって実現する場合も含む。また、１つの「部」や「手段」、「装置」、「システム」が有する機能が２つ以上の物理的手段や装置により実現されても、２つ以上の「部」や「手段」、「装置」、「システム」の機能が１つの物理的手段や装置により実現されてもよい。 In the present embodiment, the "part", "means", "device", and "system" do not simply mean physical means, but the "part", "means", "device", and "system". Including the case where the function of is realized by software. Further, even if the functions of one "part", "means", "device", or "system" are realized by two or more physical means or devices, two or more "parts" or "means", The functions of "device" and "system" may be realized by one physical means or device.

＜１．システム構成＞
図１を参照して、本実施形態に係わる議事録作成システム１のシステム構成例を説明する。議事録作成システム１は、ユーザの会議などでの発話の内容を議事録として記録するシステムである。しかしながら、本発明をこれに限る趣旨ではない。本発明は、議事録に限らずに、ユーザの発話の内容を記録する諸々のシステムに適用可能である。図１に示すように、議事録作成システム１は、記録装置１００と、ユーザ端末２００とを含む。また議事録作成システム１は、ネットワークＮを介して音声認識システム３００と接続さている。 <1. System configuration>
A system configuration example of the minutes creation system 1 according to the present embodiment will be described with reference to FIG. The minutes creation system 1 is a system that records the contents of utterances at a user's meeting or the like as minutes. However, the present invention is not limited to this. The present invention is applicable not only to the minutes but also to various systems for recording the content of the user's utterance. As shown in FIG. 1, the minutes creation system 1 includes a recording device 100 and a user terminal 200. Further, the minutes creation system 1 is connected to the voice recognition system 300 via the network N.

ネットワークＮは、無線ネットワークや有線ネットワークにより構成される。ネットワークの一例としては、携帯電話網や、ＰＨＳ（ＰｅｒｓｏｎａｌＨａｎｄｙ－ｐｈｏｎｅＳｙｓｔｅｍ）網、無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、３Ｇ（３ｒｄＧｅｎｅｒａｔｉｏｎ）、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、４Ｇ（４ｔｈＧｅｎｅｒａｔｉｏｎ）、５Ｇ（５ｔｈＧｅｎｅｒａｔｉｏｎ）、ＷｉＭａｘ（登録商標）、赤外線通信、Ｂｌｕｅｔｏｏｔｈ（登録商標）、有線ＬＡＮ、電話線、電灯線ネットワーク、ＩＥＥＥ１３９４等に準拠したネットワークがある。 The network N is composed of a wireless network and a wired network. Examples of networks include mobile phone networks, PHS (Personal Handy-phone System) networks, wireless LAN (Local Area Network), 3G (3rd Generation), LTE (Long Term Evolution), 4G (4th Generation), and 5G (4th Generation). 5th Generation), WiMax®, infrared communication, Bluetooth®, wired LAN, telephone line, power line network, IEEE1394 compliant network, etc.

記録装置１００は、ユーザ端末２００や音声認識システム３００との通信が可能な情報処理装置である。記録装置１００は、後述の第１発話者や第２発話者の音声を取得して、取得した音声を音声認識によりテキストに変換して記録する。 The recording device 100 is an information processing device capable of communicating with the user terminal 200 and the voice recognition system 300. The recording device 100 acquires the voices of the first speaker and the second speaker, which will be described later, and converts the acquired voice into text by voice recognition and records the voice.

記録装置１００は、取得した音声に対話などで応答する、いわゆるスマートスピーカーであるが、これに限る趣旨ではない。記録装置１００は、他の例として、汎用のタブレット端末やスマートフォンなどであってもよい。記録装置１００は、例えば、汎用のタブレット端末に専用のプログラムをインストールし、このプログラムを実行させることにより、タブレット端末などを記録装置１００として使用してもよい。 The recording device 100 is a so-called smart speaker that responds to the acquired voice by dialogue or the like, but the purpose is not limited to this. As another example, the recording device 100 may be a general-purpose tablet terminal, a smartphone, or the like. The recording device 100 may use, for example, a tablet terminal or the like as the recording device 100 by installing a dedicated program on a general-purpose tablet terminal and executing this program.

ユーザ端末２００は、ユーザからの要求の受け付けの入力や記録装置１００との通信が可能なスマートフォンやラップトップ端末などの情報処理装置である。ユーザ端末２００は、所定のプログラムを実行することにより、記録装置１００と連携して、音声認識により記録されたテキストデータ（以下、「記録データ」ともいう）を表示させたり、この記録データを編集するためのフォームを表示させてテキストデータの編集を可能にさせたりする。 The user terminal 200 is an information processing device such as a smartphone or a laptop terminal capable of inputting reception of a request from a user and communicating with the recording device 100. By executing a predetermined program, the user terminal 200 cooperates with the recording device 100 to display text data recorded by voice recognition (hereinafter, also referred to as “recorded data”), or edits the recorded data. Display a form to enable editing of text data.

ユーザは、第１発話者と第２発話者とを含み、発話者以外にも議事録作成の担当者など議事録作成システム１に関わる者を含む。 The user includes the first speaker and the second speaker, and includes a person related to the minutes creation system 1 such as a person in charge of creating minutes in addition to the speaker.

音声認識システム３００は、記録装置１００と通信の通信が可能なシステムである。音声認識システム３００は、記録装置１００から受信したユーザの音声を示す音声データ（以下、単に「音声データ」ともいう）に基づいてユーザの音声を認識する。 The voice recognition system 300 is a system capable of communicating with the recording device 100. The voice recognition system 300 recognizes the user's voice based on the voice data (hereinafter, also simply referred to as “voice data”) indicating the user's voice received from the recording device 100.

＜２．システム概要＞
図２を参照して、議事録作成システム１の概要を説明する。 <2. System overview>
The outline of the minutes preparation system 1 will be described with reference to FIG.

（１）図２に示すように、記録装置１００の第１音声取得部１３１は、第１発話者による第１音声として「会議を始めます」の第１音声データを取得する。（２）記録装置１００の認識結果取得部１１２は、上記（１）の取得した第１音声データに基づいて、第１音声の音声認識を音声認識システム３００に指示する。（３）記録装置１００の認識結果取得部１１２は、音声認識システム３００から、第１音声の第１認識結果を示す第１テキスト情報を取得する。 (1) As shown in FIG. 2, the first voice acquisition unit 131 of the recording device 100 acquires the first voice data of "starting a conference" as the first voice by the first speaker. (2) The recognition result acquisition unit 112 of the recording device 100 instructs the voice recognition system 300 to recognize the voice of the first voice based on the first voice data acquired in (1) above. (3) The recognition result acquisition unit 112 of the recording device 100 acquires the first text information indicating the first recognition result of the first voice from the voice recognition system 300.

「第１発話者」とは、議事録作成システム１の記録対象の第１音声を発声する者である。第１発話者は、例えば、会議における発言者であってもよい。 The "first speaker" is a person who utters the first voice to be recorded by the minutes creation system 1. The first speaker may be, for example, a speaker at a conference.

（４）記録装置１００の再生部１５１は、第２発話者の指定などにより復唱モードが設定された場合に、第１音声データに基づいて、第１音声を再生する。すなわち、再生部１５１は、第１音声「会議を始めます」を出力する。ここで「復唱モード」とは、第１音声を復唱する音声（第２音声）を取得するための動作モードである。また、復唱モードは、第２音声を取得するために第１音声を再生する動作モードであってもよい。 (4) The reproduction unit 151 of the recording device 100 reproduces the first voice based on the first voice data when the repeat mode is set by the designation of the second speaker or the like. That is, the reproduction unit 151 outputs the first voice "start the conference". Here, the "repeated mode" is an operation mode for acquiring a voice (second voice) that repeats the first voice. Further, the repeat mode may be an operation mode in which the first voice is reproduced in order to acquire the second voice.

（５）復唱モードが設定された場合、第１音声を聴きながら第２発話者が復唱して第２音声「会議を始めます」を発声すると、記録装置１００の第２音声取得部１３２は、この復唱する第２音声の第２音声データを取得する。（６）記録装置１００の認識結果取得部１１２は、上記（５）の取得した第２音声データに基づいて、第２音声の音声認識を音声認識システム３００に指示する。（３）記録装置１００の認識結果取得部１１２は、音声認識システム３００から、第２音声の第２認識結果を示す第２テキスト情報を取得する。 (5) When the repeat mode is set, when the second speaker repeats while listening to the first voice and utters the second voice "Start a meeting", the second voice acquisition unit 132 of the recording device 100 receives the second voice. The second voice data of the second voice to be repeated is acquired. (6) The recognition result acquisition unit 112 of the recording device 100 instructs the voice recognition system 300 to recognize the second voice based on the second voice data acquired in (5) above. (3) The recognition result acquisition unit 112 of the recording device 100 acquires the second text information indicating the second recognition result of the second voice from the voice recognition system 300.

「第２発話者」とは、第１音声を復唱して第２音声を発声する者である。なお、「第１発話者」と「第２発話者」について、特に区別の必要がない場合、以降、総称して「発話者」ともいう。 The "second speaker" is a person who repeats the first voice and utters the second voice. When it is not necessary to distinguish between the "first speaker" and the "second speaker", they are also collectively referred to as "speaker" hereafter.

「第１音声データ」と「第２音声データ」について、特に区別の必要がない場合、以降、総称して「音声データ」ともいう。 When it is not necessary to distinguish between the "first voice data" and the "second voice data", they are also collectively referred to as "voice data" hereafter.

「第１認識結果」と「第２認識結果」について、特に区別の必要がない場合、以降、総称して「認識結果」ともいう。 When it is not necessary to distinguish between the "first recognition result" and the "second recognition result", they are also collectively referred to as "recognition result" hereafter.

「第１テキスト情報」と「第２テキスト情報」について、特に区別の必要がない場合、以降、総称して「テキスト情報」ともいう。 When it is not necessary to distinguish between "first text information" and "second text information", they are also collectively referred to as "text information" hereafter.

（７）記録装置１００の比較部１１３は、第１テキスト情報と第２テキスト情報とを比較する。（８）記録装置１００の記録生成部１１４は、上記（７）の比較結果ならびに第１テキスト情報および第２テキスト情報に基づいて、議事録としてテキストによる記録データを生成する。記録生成部１１４は、例えば、第２テキスト情報との比較結果により不一致だった第１テキスト情報の箇所について、第２テキスト情報に置き換えて記録データを生成してもよい。 (7) The comparison unit 113 of the recording device 100 compares the first text information and the second text information. (8) The record generation unit 114 of the recording device 100 generates textual recorded data as minutes based on the comparison result of the above (7) and the first text information and the second text information. The record generation unit 114 may, for example, replace the portion of the first text information that is inconsistent with the comparison result with the second text information with the second text information to generate the record data.

上記構成によれば、記録装置１００は、会議での第１発話者の第１音声が認識しづらい音声であっても第１音声を復唱した第２音声の音声認識結果との比較によりテキストによる記録データを議事録として生成することができる。このため、上記構成によれば、記録装置１００は、精度よく議事録を作成することができる。 According to the above configuration, the recording device 100 uses text by comparison with the voice recognition result of the second voice that repeats the first voice even if the first voice of the first speaker in the conference is difficult to recognize. Recorded data can be generated as minutes. Therefore, according to the above configuration, the recording device 100 can accurately create the minutes.

＜３．機能構成＞
図４を参照して、本実施形態に係る記録装置１００の機能構成を説明する。図４に示すように、記録装置１００は、制御部１１０と、音声取得部１３０と、通信部１４０と、出力部１５０と、記憶部１６０と、を備える。 <3. Function configuration>
The functional configuration of the recording apparatus 100 according to the present embodiment will be described with reference to FIG. As shown in FIG. 4, the recording device 100 includes a control unit 110, a voice acquisition unit 130, a communication unit 140, an output unit 150, and a storage unit 160.

制御部１１０は、認識結果取得部１１２と、記録生成部１１４と、を備える。また、制御部１１０は、例えば、音声認識部１１１、比較部１１３、発話データ生成部１１５、表示部１１６、受付部１１７、信頼度算出部１１８、推定部１１９、精度算定部１２０、音声データ生成部１２１、または加工部１２２を備えてもよい。 The control unit 110 includes a recognition result acquisition unit 112 and a record generation unit 114. Further, the control unit 110 is, for example, a voice recognition unit 111, a comparison unit 113, an utterance data generation unit 115, a display unit 116, a reception unit 117, a reliability calculation unit 118, an estimation unit 119, an accuracy calculation unit 120, and voice data generation. A portion 121 or a processed portion 122 may be provided.

制御部１１０は、受付部１１７が受け付けた復唱モードの指定に基づいて、記録装置１００の動作モードを復唱モードに設定する。 The control unit 110 sets the operation mode of the recording device 100 to the repeat mode based on the designation of the repeat mode received by the reception unit 117.

音声認識部１１１は、音声取得部１３０により取得された発話者の音声データを認識する。音声認識部１１１は、この認識の結果を示すテキスト情報を生成する。音声認識部１１１は、例えば、認識結果取得部１１２からの音声認識の指示により、音声取得部１３０が取得した音声データを、音声認識技術を用いてテキスト情報に変換してもよい。音声認識部１１１は、例えば、自己の認識率を算出してもよい。 The voice recognition unit 111 recognizes the voice data of the speaker acquired by the voice acquisition unit 130. The voice recognition unit 111 generates text information indicating the result of this recognition. The voice recognition unit 111 may convert the voice data acquired by the voice acquisition unit 130 into text information by using the voice recognition technology, for example, in response to a voice recognition instruction from the recognition result acquisition unit 112. The voice recognition unit 111 may calculate its own recognition rate, for example.

音声認識部１１１は、例えば、通信部１４０が音声認識システム３００に音声データを送信している途中で音声認識システム３００との通信が不可能になった場合、未送信の音声データの音声に基づいてテキスト情報を生成してもよい。 The voice recognition unit 111 is based on, for example, the voice of untransmitted voice data when communication with the voice recognition system 300 becomes impossible while the communication unit 140 is transmitting voice data to the voice recognition system 300. May generate textual information.

認識結果取得部１１２は、音声取得部１３０により取得された第１音声データに基づいて、第１音声の第１認識結果を示す第１テキスト情報を取得する。また、認識結果取得部１１２は、音声取得部１３０により取得された第２音声データに基づいて、第２音声の第２認識結果を示す第２テキスト情報を取得する。認識結果取得部１１２は、例えば、第１音声データと第２音声データとに基づいて、音声認識システム３００または音声認識部１１１にこれらの音声データの音声認識を指示する。認識結果取得部１１２は、この指示に対する応答として、第１テキスト情報と第２テキスト情報とを取得する。 The recognition result acquisition unit 112 acquires the first text information indicating the first recognition result of the first voice based on the first voice data acquired by the voice acquisition unit 130. Further, the recognition result acquisition unit 112 acquires the second text information indicating the second recognition result of the second voice based on the second voice data acquired by the voice acquisition unit 130. The recognition result acquisition unit 112 instructs the voice recognition system 300 or the voice recognition unit 111 to recognize the voice data based on the first voice data and the second voice data, for example. The recognition result acquisition unit 112 acquires the first text information and the second text information as a response to this instruction.

認識結果取得部１１２は、例えば、後述の発話データ生成部１１５が生成した第１発話データと第２発話データに基づいて、複数の区間ごとに区分けされた第１テキスト情報と第２テキスト情報とを取得してもよい。 The recognition result acquisition unit 112 includes, for example, first text information and second text information divided into a plurality of sections based on the first utterance data and the second utterance data generated by the utterance data generation unit 115 described later. May be obtained.

「区間」とは、例えば、音声データ（デジタル信号）において音声レベルがゼロまたは所定の閾値以下となる無音区間と、音声レベルがゼロより大きいまたは所定の閾値を超える発話区間（有音区間）と、を含んでもよい。また区間は、他の例として、所定期間ごとに区切られた範囲であってもよい。 The "section" is, for example, a silent section in which the voice level is zero or below a predetermined threshold in voice data (digital signal), and an utterance section (sound section) in which the voice level is higher than zero or exceeds a predetermined threshold. , May be included. Further, as another example, the section may be a range divided by a predetermined period.

比較部１１３は、第１テキスト情報と、第２テキスト情報と、を比較する。比較部１１３は、例えば、複数の区間ごとに、第１テキスト情報と第２テキスト情報とを比較してもよい。比較部１１３は、比較結果として、第１テキスト情報と第２テキスト情報とが一致しているか、または第１テキスト情報と第２テキスト情報とが不一致であるかを出力する。 The comparison unit 113 compares the first text information and the second text information. The comparison unit 113 may compare the first text information and the second text information for each of a plurality of sections, for example. As a comparison result, the comparison unit 113 outputs whether the first text information and the second text information match, or whether the first text information and the second text information do not match.

比較部１１３は、例えば、複数の発話区間ごとに、第１テキスト情報と第２テキスト情報のどちらの音声の認識精度（以下、単に「認識精度」ともいう）が高いか比較してもよい。認識精度は、例えば、音声認識処理における認識率であってもよい。比較部１１３は、この比較結果を、精度フラグに設定してもよい。ここで「精度フラグ」とは、区間ごとに、第１テキスト情報および第２テキスト情報のどちらのテキスト情報の認識精度が高いかを示す情報である。精度フラグには、例えば、相対的に認識精度が高い方に「１」が設定され、他方（相対的に認識精度が低い方）に「０」が設定される。 For example, the comparison unit 113 may compare which of the first text information and the second text information has the higher recognition accuracy (hereinafter, also simply referred to as “recognition accuracy”) for each of the plurality of utterance sections. The recognition accuracy may be, for example, the recognition rate in the voice recognition process. The comparison unit 113 may set this comparison result in the accuracy flag. Here, the "precision flag" is information indicating which of the first text information and the second text information has the higher recognition accuracy for each section. For the accuracy flag, for example, "1" is set for the one with relatively high recognition accuracy, and "0" is set for the other (the one with relatively low recognition accuracy).

記録生成部１１４は、第１テキスト情報と第２テキスト情報とに基づいて、テキストによる発話の記録データを生成する。ここでいう「発話の記録データ」は、例えば、会議の議事録となるデータであってもよい。なお、発話の記録データは、以降、単に「記録データ」ともいう。 The record generation unit 114 generates textual utterance record data based on the first text information and the second text information. The "recorded data of utterances" referred to here may be, for example, data that becomes the minutes of a meeting. The recorded data of the utterance is hereinafter simply referred to as "recorded data".

上記構成によれば、記録生成部１１４は、例えば、会議での第１発話者の第１音声が認識しづらい音声であっても、第１音声を復唱した第２音声の音声認識結果である第２テキスト情報も用いることで第１テキスト情報を補うことができる。このため、上記構成によれば、記録生成部１１４は、精度よく記録データを議事録として生成することができる。 According to the above configuration, the recording generation unit 114 is, for example, a voice recognition result of the second voice that repeats the first voice even if the first voice of the first speaker in the conference is difficult to recognize. The first text information can be supplemented by using the second text information as well. Therefore, according to the above configuration, the record generation unit 114 can accurately generate the recorded data as minutes.

記録生成部１１４は、例えば、比較部１１３による比較結果に基づいて、記録データを生成してもよい。記録生成部１１４は、例えば、精度フラグに基づいて、複数の発話区間ごとに、第１テキスト情報と第２テキスト情報に対して、比較部１１３による比較結果で認識精度が高い方を採用する。記録生成部１１４は、複数の発話区間ごとにこの採用した第１テキスト情報と第２テキスト情報とを組み合わせて、記録データを生成する。 The record generation unit 114 may generate record data, for example, based on the comparison result by the comparison unit 113. The record generation unit 114 adopts, for example, the one with higher recognition accuracy in the comparison result by the comparison unit 113 with respect to the first text information and the second text information for each of a plurality of utterance sections based on the accuracy flag. The record generation unit 114 generates recorded data by combining the adopted first text information and second text information for each of a plurality of utterance sections.

上記構成によれば、記録生成部１１４は、区間ごとに、第１テキスト情報と第２テキスト情報の認識精度がより高い方を記録データとして採用することができる。このため、記録生成部１１４は、より精度よく記録データを生成することができる。 According to the above configuration, the recording generation unit 114 can adopt the one having higher recognition accuracy of the first text information and the second text information as the recording data for each section. Therefore, the recording generation unit 114 can generate the recorded data with higher accuracy.

記録生成部１１４は、例えば、後述の受付部１１７が受け付けた第１テキスト情報と第２テキスト情報のいずれを発話の記録として採用するかの選択に基づいて、第１テキスト情報と第２テキスト情報とを区間ごとに組み合わせて、記録データを生成してもよい。 The record generation unit 114, for example, has the first text information and the second text information based on the selection of which of the first text information and the second text information received by the reception unit 117, which will be described later, is adopted as the utterance record. And may be combined for each section to generate recorded data.

上記構成によれば、記録生成部１１４は、比較結果をユーザに対して表示させて、区間ごとに第１テキスト情報または第２テキスト情報の選択された方を組み合わせて記録データを生成することができる。このため、上記構成によれば、第１テキスト情報と第２テキスト情報に対してユーザに選択させることができるため、ユーザの要望にそった記録データを生成することができる。したがって、上記構成によれば、議事録作成において合目的性・正確性を向上させることができる。 According to the above configuration, the record generation unit 114 may display the comparison result to the user and generate the record data by combining the selected ones of the first text information or the second text information for each section. can. Therefore, according to the above configuration, since the user can select the first text information and the second text information, it is possible to generate the recorded data according to the user's request. Therefore, according to the above configuration, it is possible to improve the purpose and accuracy in preparing the minutes.

記録生成部１１４は、例えば、第１テキスト情報と第２テキスト情報とが比較部１１３による比較結果で不一致だった区間について、受付部１１７が受け付けたテキスト情報で第１テキスト情報または第２テキスト情報を上書きして、記録データを生成してもよい。 The record generation unit 114 is, for example, the text information received by the reception unit 117 for the section where the first text information and the second text information do not match in the comparison result by the comparison unit 113, and the first text information or the second text information. May be overwritten to generate recorded data.

上記構成によれば、記録生成部１１４は、ユーザが編集したテキスト情報で記録データをカスタマイズできるため、よりユーザの要望にそった記録データを生成することができる。したがって、上記構成によれば、発話の記録において合目的性・正確性を向上させることができる。 According to the above configuration, since the record generation unit 114 can customize the record data with the text information edited by the user, it is possible to generate the record data more according to the user's request. Therefore, according to the above configuration, it is possible to improve the purpose and accuracy in recording the utterance.

発話データ生成部１１５は、第１音声データの複数の区間に対応する複数の第１発話データを生成する。また、発話データ生成部１１５は、第２音声データの複数の区間に対応する複数の第２発話データを生成する。なお、「第１発話データ」と「第２発話データ」について、特に区別の必要がない場合、以降、総称して「発話データ」ともいう。 The utterance data generation unit 115 generates a plurality of first utterance data corresponding to a plurality of sections of the first voice data. Further, the utterance data generation unit 115 generates a plurality of second utterance data corresponding to a plurality of sections of the second voice data. When there is no particular need to distinguish between the "first utterance data" and the "second utterance data", they are also collectively referred to as "utterance data" hereafter.

発話データ生成部１１５は、まず、音声データから複数の発話区間と無音区間とを検出する。次に、発話データ生成部１１５は、音声データを、発話区間ごとの発話データに分割する。このように発話データ生成部１１５は、音声データの複数の発話区間に対応する複数の発話データを生成する。 The utterance data generation unit 115 first detects a plurality of utterance sections and silent sections from the voice data. Next, the utterance data generation unit 115 divides the voice data into utterance data for each utterance section. In this way, the utterance data generation unit 115 generates a plurality of utterance data corresponding to the plurality of utterance sections of the voice data.

表示部１１６は、比較部１１３による比較結果を、ユーザ端末２００に表示させる。表示部１１６は、例えば、比較結果として、表示情報を生成する。表示情報は、例えば、複数の区間それぞれの記録データや精度フラグなどを含む。この「複数の区間それぞれの記録データ」は、複数の区間それぞれの第１テキスト情報と、複数の区間それぞれの第２テキスト情報と、を含む。表示情報は、図４に示す第１比較画面Ａ１と第２比較画面Ａ２などの議事録作成システム１の画面をユーザ端末２００に表示させるための情報でもある。表示部１１６は、通信部１４０を介して、生成した表示情報をユーザ端末２００に送信する。 The display unit 116 causes the user terminal 200 to display the comparison result by the comparison unit 113. The display unit 116 generates display information, for example, as a comparison result. The display information includes, for example, recorded data and accuracy flags for each of the plurality of sections. This "recorded data of each of the plurality of sections" includes the first text information of each of the plurality of sections and the second text information of each of the plurality of sections. The display information is also information for displaying the screens of the minutes creation system 1 such as the first comparison screen A1 and the second comparison screen A2 shown in FIG. 4 on the user terminal 200. The display unit 116 transmits the generated display information to the user terminal 200 via the communication unit 140.

表示部１１６は、例えば、比較部１１３による比較結果と併せて、第１認識結果および第２認識結果それぞれの信頼度をユーザ端末２００に表示させてもよい。ここで「信頼度」とは、音声の認識結果の確からしさの度合い（確信度）である。表示情報は、これらの信頼度を含む。 The display unit 116 may display, for example, the reliability of each of the first recognition result and the second recognition result on the user terminal 200 together with the comparison result by the comparison unit 113. Here, the "reliability" is the degree of certainty (confidence) of the voice recognition result. The displayed information includes these reliabilitys.

上記構成によれば、表示部１１６は、ユーザに対して、第１テキスト情報と第２テキスト情報との比較結果と併せて、それぞれの認識結果の信頼度を表示させることができる。このため、上記構成によれば、ユーザは、第１テキスト情報と第２テキスト情報のいずれを採用するか選択するにあたって、その指標となりうる信頼度を確認することができる。したがって、表示部１１６は、第１テキスト情報と第２テキスト情報の選択におけるユーザビリティを向上させることができる。 According to the above configuration, the display unit 116 can cause the user to display the reliability of each recognition result together with the comparison result between the first text information and the second text information. Therefore, according to the above configuration, the user can confirm the reliability that can be an index when selecting which of the first text information and the second text information is to be adopted. Therefore, the display unit 116 can improve usability in selecting the first text information and the second text information.

表示部１１６は、例えば、比較部１１３による比較結果と併せて、第１認識結果および第２認識結果それぞれの認識精度をユーザ端末２００に表示させてもよい。表示情報は、これらの認識精度を含む。 The display unit 116 may display, for example, the recognition accuracy of each of the first recognition result and the second recognition result on the user terminal 200 together with the comparison result by the comparison unit 113. The display information includes these recognition accuracy.

上記構成によれば、表示部１１６は、ユーザに対して、第１テキスト情報と第２テキスト情報との比較結果と併せて、それぞれの認識結果の認識精度を表示させることができる。このため、上記構成によれば、第１テキスト情報と第２テキスト情報のいずれを採用するか選択するにあたって、その指標となりうる信頼度をユーザは確認することができる。したがって、表示部１１６は、第１テキスト情報と第２テキスト情報の選択におけるユーザビリティを向上させることができる。 According to the above configuration, the display unit 116 can cause the user to display the recognition accuracy of each recognition result together with the comparison result between the first text information and the second text information. Therefore, according to the above configuration, the user can confirm the reliability that can be an index when selecting which of the first text information and the second text information is to be adopted. Therefore, the display unit 116 can improve usability in selecting the first text information and the second text information.

表示部１１６は、例えば、第１テキスト情報と第２テキスト情報とが比較結果で不一致だった区間について、当該区間の第１テキスト情報または前記第２テキスト情報を編集するための編集フォームをユーザ端末２００に表示させてもよい。表示情報は、この編集フォームを含む。 The display unit 116 is, for example, a user terminal for editing a section for editing the first text information or the second text information of the section in which the first text information and the second text information do not match in the comparison result. It may be displayed at 200. The display information includes this edit form.

受付部１１７は、ユーザ端末２００から、複数の区間それぞれに対して、第１認識結果（第１テキスト情報）と第２認識結果（第２テキスト情報）のいずれを第１発話者の発話の記録として採用するかの選択を受け付ける。受付部１１７は、例えば、表示部１１６が表示させた第１テキスト情報と第２テキスト情報とが不一致だった区間に対して、第１テキスト情報と第２テキスト情報のいずれを第１発話者の発話の議事録として採用するかの選択を受け付けてもよい。 The reception unit 117 records either the first recognition result (first text information) or the second recognition result (second text information) of the first speaker's utterance from the user terminal 200 for each of the plurality of sections. Accepts the choice of whether to adopt as. The reception unit 117 uses, for example, either the first text information or the second text information of the first speaker for the section in which the first text information and the second text information displayed by the display unit 116 do not match. You may accept the choice of adopting it as the minutes of the utterance.

受付部１１７は、例えば、ユーザ端末２００から、表示部１１６が表示させた編集フォームに対してユーザが入力したテキスト情報を受け付けてもよい。 The reception unit 117 may receive, for example, the text information input by the user for the edit form displayed by the display unit 116 from the user terminal 200.

ここで、図４を参照して、表示部１１６が表示させる比較画面の一例を説明する。図４（ａ）は、発話区間ごとに区分けされた第１テキスト情報を第１認識結果として表示する第１比較画面の一例を示す。図４（ｂ）は、発話区間ごとに区分けされた第２テキスト情報を第２認識結果として表示する第２比較画面の一例を示す。本例では、説明を容易にするために、第１比較画面と第２比較画面とを別の画面として表示させる例を説明するが、これに限る趣旨ではない。第１比較画面の表示内容と第２比較画面の表示内容とは、例えば、一つの画面にまとめて並べて表示させてもよい。 Here, an example of the comparison screen displayed by the display unit 116 will be described with reference to FIG. FIG. 4A shows an example of a first comparison screen that displays the first text information divided for each utterance section as the first recognition result. FIG. 4B shows an example of a second comparison screen that displays the second text information divided for each utterance section as the second recognition result. In this example, in order to facilitate the explanation, an example in which the first comparison screen and the second comparison screen are displayed as separate screens will be described, but the present invention is not limited to this. The display contents of the first comparison screen and the display contents of the second comparison screen may be displayed side by side on one screen, for example.

図４（ａ）に示すように、表示部１１６は、ユーザ端末２００に、第１比較画面Ａ１を表示させる。第１比較画面Ａ１は、第１音声データ表示エリアａ１１と、第１音声データ表示エリアａ１１や第２音声データ表示エリアａ２１に表示された発話データの音声を再生するための再生ボタンａ１２と、表示・編集されたテキスト情報を記録データとして保存するための保存ボタンａ１３と、を含む。 As shown in FIG. 4A, the display unit 116 causes the user terminal 200 to display the first comparison screen A1. The first comparison screen A1 displays a first voice data display area a11 and a play button a12 for playing back the voice of the utterance data displayed in the first voice data display area a11 and the second voice data display area a21. -Includes a save button a13 for saving the edited text information as recorded data.

第１音声データ表示エリアａ１１は、発話区間ごとに区分けされたそれぞれの第１発話データを表示する複数の第１発話データ表示エリアを含む。本例では、複数の第１発話データ表示エリアの中から、第１発話データ表示エリアａ１１１と、第１発話データ表示エリアａ１１２と、を用いて説明する。第１発話データ表示エリアａ１１１は、第１発話者を「参加者１」として、参加者１が発生した音声を認識した第１テキスト情報「こんにちは」を表示する。また、第１発話データ表示エリアａ１１２は、第１発話者を「参加者２」として、参加者２が発生した音声を認識した第１テキスト情報「こちは」を表示する。 The first voice data display area a11 includes a plurality of first utterance data display areas for displaying each first utterance data divided for each utterance section. In this example, the first utterance data display area a111 and the first utterance data display area a112 will be described from among the plurality of first utterance data display areas. The first utterance data display area a111 displays the first text information "hello" recognizing the voice generated by the participant 1 with the first utterance as the "participant 1". Further, the first utterance data display area a112 displays the first text information "kochi" recognizing the voice generated by the participant 2, with the first utterance as the "participant 2".

図４（ｂ）に示すように、表示部１１６は、ユーザ端末２００に、第２比較画面Ａ２を表示させる。第２比較画面Ａ２は、第２音声データ表示エリアａ２１と、再生ボタンａ１２と、保存ボタンａ１３と、を含む。 As shown in FIG. 4B, the display unit 116 causes the user terminal 200 to display the second comparison screen A2. The second comparison screen A2 includes a second audio data display area a21, a play button a12, and a save button a13.

第２音声データ表示エリアａ２１は、第１音声データ表示エリアａ１１と同様に、複数の第２発話データ表示エリアを含む。 The second voice data display area a21 includes a plurality of second utterance data display areas, like the first voice data display area a11.

本例では、複数の第２発話データ表示エリアの中から、第１発話データ表示エリアａ１１１に対応する第２発話データ表示エリアａ２１１と、第１発話データ表示エリアａ１１２に対応する第２発話データ表示エリアａ２１２と、を用いて説明する。第１発話データ表示エリアａ１１１に表示された第１発話データに対して、その第１音声を復唱した第２音声の第２発話データが第２発話データ表示エリアａ１１２に表示されている。第１発話データ表示エリアａ１１２に表示された第１発話データに対して、その第１音声を復唱した第２音声の第２発話データが第２発話データ表示エリアａ２１２に表示されている。 In this example, from the plurality of second utterance data display areas, the second utterance data display area a211 corresponding to the first utterance data display area a111 and the second utterance data display corresponding to the first utterance data display area a112. The area a212 will be described with reference to the area a212. With respect to the first utterance data displayed in the first utterance data display area a111, the second utterance data of the second voice that repeats the first voice is displayed in the second utterance data display area a112. With respect to the first utterance data displayed in the first utterance data display area a112, the second utterance data of the second voice that repeats the first voice is displayed in the second utterance data display area a212.

第２発話データ表示エリアａ２１１では、第２発話者を「復唱者１」として、復唱者１が発生した音声を認識した第２テキスト情報「こんにちは」を表示する。また、第２発話データ表示エリアａ２１２では、第２発話者を同じく「復唱者１」として、復唱者１が発生した音声を認識した第２テキスト情報「こんにちは」を表示する。 In the second utterance data display area a211, the second speaker is set as "reciter 1", and the second text information "hello" that recognizes the voice generated by the repeater 1 is displayed. Further, in the second utterance data display area a212, the second utterance person is also regarded as the "reciter 1", and the second text information "hello" recognizing the voice generated by the repeater 1 is displayed.

第１発話データ表示エリアａ１１１と第２発話データ表示エリアａ２１１とでは、該当の発話区間における第１テキスト情報と第２テキスト情報とが一致しているため、それぞれの認識結果を表示する。なお、このように一致している発話データ表示エリアのいずれかをユーザが押下（タップ操作・クリック操作など）した場合、表示部１１６は、この発話データ表示エリアのテキスト情報を編集するための編集入力ウィンドウａ１４（編集フォームの一態様）をユーザ端末２００に表示させてもよい。 In the first utterance data display area a111 and the second utterance data display area a211, since the first text information and the second text information in the corresponding utterance section match, the respective recognition results are displayed. When the user presses (tap operation, click operation, etc.) any of the utterance data display areas that match in this way, the display unit 116 edits the text information in this utterance data display area. The input window a14 (one aspect of the editing form) may be displayed on the user terminal 200.

表示部１１６は、第１テキスト情報と第２テキスト情報の不一致箇所が一目でわかるよう、一致箇所と相違するように、不一致箇所に関する発話データ表示エリアの表示態様を変更させることができる。具体的には、第１発話データ表示エリアａ１１２と第２発話データ表示エリアａ２１２とは、該当の発話区間における第１テキスト情報と第２テキスト情報とが不一致のため、それぞれの認識結果を表示する他に、比較結果や編集フォームなどを表示する。より具体的には、第１発話データ表示エリアａ１１２は、比較結果として、認識精度が相対的に低いことを示す困り顔のアイコンと、文字色（例えば、赤）やフォントを変更したテキスト情報と、を表示する。第２発話データ表示エリアａ２１２は、比較結果として、認識精度が相対的に高いことを示す笑顔のアイコンと、文字色（例えば、黒）やフォントを変更したテキスト情報と、を表示する。また、第２発話データ表示エリアａ２１２は、ユーザがテキスト情報を直接編集できるテキストフォーム（編集フォームの一態様）にもなっている。また、このテキストフォームは、第２認識結果の認識精度が所定の閾値より高い場合、ユーザによる編集が不要である旨を示す表示態様としてもよい（例えば、背景色（例えば、グレー）を変更してもよい）。ユーザがこのテキストフォームに対してテキスト情報を入力すると、受付部１１７がこの入力されたテキスト情報を受け付ける。 The display unit 116 can change the display mode of the utterance data display area regarding the mismatched portion so that the mismatched portion between the first text information and the second text information can be seen at a glance so as to be different from the matched portion. Specifically, since the first utterance data display area a112 and the second utterance data display area a212 do not match the first text information and the second text information in the corresponding utterance section, their respective recognition results are displayed. In addition, display comparison results and edit forms. More specifically, in the first utterance data display area a112, as a comparison result, an icon of a troubled face indicating that the recognition accuracy is relatively low, and text information in which the character color (for example, red) and the font are changed are used. , Is displayed. The second utterance data display area a212 displays, as a comparison result, a smile icon indicating that the recognition accuracy is relatively high, and text information in which the character color (for example, black) or the font is changed. Further, the second utterance data display area a212 is also a text form (one aspect of the editing form) in which the user can directly edit the text information. Further, this text form may be a display mode indicating that editing by the user is unnecessary when the recognition accuracy of the second recognition result is higher than a predetermined threshold value (for example, the background color (for example, gray) may be changed. May be). When the user inputs text information in this text form, the reception unit 117 accepts the input text information.

表示部１１６は、第１テキスト情報と第２テキスト情報の不一致箇所について、第１認識結果の認識精度より第２認識結果の認識精度が高い場合には、上記の例のように第２発話データ表示エリアａ２１２をテキストフォームにして表示させる。他方、表示部１１６は、第２認識結果の認識精度より第１認識結果の認識精度が高い場合には、第１発話データ表示エリアａ１１２の第１テキスト情報を第２発話データ表示エリアａ２１２にコピー（上書き）した上で第２発話データ表示エリアａ２１２をテキストフォームにして表示させてもよい。なお、この際、表示部１１６は、第２認識結果の第２テキスト情報を、第２発話データ表示エリアａ２１２の備考エリア（不図示）に表示させてもよい。 When the recognition accuracy of the second recognition result is higher than the recognition accuracy of the first recognition result for the mismatched portion between the first text information and the second text information, the display unit 116 sets the second utterance data as in the above example. Display the display area a212 as a text form. On the other hand, when the recognition accuracy of the first recognition result is higher than the recognition accuracy of the second recognition result, the display unit 116 copies the first text information of the first utterance data display area a112 to the second utterance data display area a212. After (overwriting), the second utterance data display area a212 may be displayed as a text form. At this time, the display unit 116 may display the second text information of the second recognition result in the remarks area (not shown) of the second utterance data display area a212.

ユーザは、上記のとおり不一致箇所である第１発話データ表示エリアａ１１２と第２発話データ表示エリアａ２１２とに対して、記録データとしてどちらを採用するかそれぞれのエリアを押下して選択することができる。ユーザが選択すると、受付部１１７がこの選択を受け付ける。 As described above, the user can select which of the first utterance data display area a112 and the second utterance data display area a212, which are the mismatched parts, to be adopted as the recorded data by pressing each area. .. When the user makes a selection, the reception unit 117 accepts this selection.

上記構成によれば、表示部１１６は、第１比較画面Ａ１と第２比較画面Ａ２とにより、第１テキスト情報と第２テキスト情報とを区間ごとに比較可能に表示させることができる。また、上記構成によれば、表示部１１６は、第１テキスト情報と第２テキスト情報の不一致箇所が一目でわかるようその表示態様を変更することができる。このため上記構成によれば、表示部１１６は、テキスト情報の確認やテキスト情報の選択などのＵＩにおいて、ユーザビリティを向上させることができる。 According to the above configuration, the display unit 116 can display the first text information and the second text information in a comparable manner for each section by the first comparison screen A1 and the second comparison screen A2. Further, according to the above configuration, the display unit 116 can change the display mode so that the mismatched portion between the first text information and the second text information can be seen at a glance. Therefore, according to the above configuration, the display unit 116 can improve usability in the UI such as confirmation of text information and selection of text information.

図３に戻って説明を続ける。受付部１１７は、例えば、ユーザ端末２００から、再生部によるステレオフォニック再生にあたって、左右のチャンネルの音声を入れ替える指定を受け付けてもよい。 The explanation will be continued by returning to FIG. The reception unit 117 may, for example, receive a designation from the user terminal 200 to switch the audio of the left and right channels in the stereophonic reproduction by the reproduction unit.

受付部１１７は、例えば、ユーザ端末２００から、復唱モードの指定を受け付けてもよい。 The reception unit 117 may accept the designation of the repeat mode from the user terminal 200, for example.

信頼度算出部１１８は、第１認識結果および第２認識結果それぞれの信頼度を算出する。信頼度算出部１１８は、例えば、認識結果に含まれる単語ごとの信頼度を算出し、算出した単語ごとの信頼度を集計して認識結果の信頼度を算出してもよい。 The reliability calculation unit 118 calculates the reliability of each of the first recognition result and the second recognition result. For example, the reliability calculation unit 118 may calculate the reliability of each word included in the recognition result, aggregate the calculated reliability of each word, and calculate the reliability of the recognition result.

単語ごとの信頼度は、例えば、所定の範囲の値（例えば、０．０～１．０の範囲）を有してもよい。この所定の範囲の値の中で数値が１．０、すなわち上限に近いほど、単語ごとの信頼度は、その単語に似たスコアをもつ他の競合候補が相対的に少ないことを示す。他方、この所定の範囲の値の中で数値が０．０、すなわち下限に近いほど、単語ごとの信頼度は、その単語に似たスコアをもつ他の競合候補が相対的に多いことを示す。すなわち、所定の範囲の中で数値が上限に近ければ近いほど、単語ごとの信頼度は、認識結果の一位候補の単語に近い他の候補がなく、信頼（確信）をもってその認識結果を出力したということがいえる。 The word-by-word confidence may have, for example, a value in a predetermined range (eg, in the range of 0.0 to 1.0). The closer the number is 1.0, or the upper limit, within this predetermined range of values, the less the word-by-word confidence is that there are relatively few other competitors with scores similar to that word. On the other hand, the closer the number is 0.0, that is, the lower limit, in this predetermined range of values, the more the word-by-word confidence indicates that there are relatively many other competitors with scores similar to that word. .. That is, the closer the numerical value is to the upper limit within the predetermined range, the more the reliability of each word is that there is no other candidate close to the word of the first candidate in the recognition result, and the recognition result is output with confidence (confidence). It can be said that it was done.

単語の信頼度の算出方法は、いくつかの方法が考えられるが、例えば、駒谷、河原著「音声認識結果の信頼度を用いた効率的な確認・誘導を行う対話処理」（情報処理学会論文誌、Ｖｏｌ．４３、Ｎｏ．１０、ｐｐ３０７８－３０８６）が知られている。 There are several possible methods for calculating the reliability of words. For example, Komatani and Kawahara, "Dialogue processing for efficient confirmation and guidance using the reliability of speech recognition results" (IPSJ Paper) Journal, Vol. 43, No. 10, pp3078-3086) is known.

推定部１１９は、第１音声データに基づいて、第１発話者と記録装置１００との間の第１距離を推定する。また、推定部１１９は、第２音声データに基づいて、第２発話者と記録装置１００との間の第２距離を推定する。ここで「発話者と記録装置１００との間の距離（以下、単に「発話者との距離」ともいう）」とは、具体的には、発話者と音声入力装置８１７の複数のマイクロフォン（マイクアレイ）（以下、単に「マイクロフォン」ともいう）との間の距離であってもよい。 The estimation unit 119 estimates the first distance between the first speaker and the recording device 100 based on the first voice data. Further, the estimation unit 119 estimates the second distance between the second speaker and the recording device 100 based on the second voice data. Here, the "distance between the speaker and the recording device 100 (hereinafter, also simply referred to as" the distance between the speaker ")" is specifically defined as a plurality of microphones (microphones) of the speaker and the voice input device 817. It may be the distance between the array) (hereinafter, also simply referred to as “microphone”).

推定部１１９は、例えば、発話者ごとの音声データに基づいて、発話者の方向や位置または発話者との距離などを推定する。推定部１１９は、推定結果（発話者の方向や位置または発話者との距離など）を位置情報として記憶部１６０に記録してもよい。推定部１１９は、例えば、音声入力装置８１７に入力された二つの音声信号の時間波形の間で相互相関関数を算出して、算出した相互相関関数より音の到達時間差を算出する。推定部１１９は、算出した音到達時間差に基づいて、発話者の方向や位置または距離を推定してもよい。 The estimation unit 119 estimates, for example, the direction and position of the speaker or the distance to the speaker based on the voice data of each speaker. The estimation unit 119 may record the estimation result (direction and position of the speaker, distance from the speaker, etc.) in the storage unit 160 as position information. For example, the estimation unit 119 calculates a cross-correlation function between the time waveforms of two voice signals input to the voice input device 817, and calculates the arrival time difference of the sound from the calculated cross-correlation function. The estimation unit 119 may estimate the direction, position, or distance of the speaker based on the calculated sound arrival time difference.

精度算定部１２０は、第１音声の第１音量および第２音声の第２音量の組み合わせ（以下、「音量の組み合わせ」ともいう）に基づいて、第１認識結果および第２認識結果のそれぞれの認識精度を算定する。また、精度算定部１２０は、第１距離および第２距離の組み合わせ（以下、「距離の組み合わせ」ともいう）に基づいて、第１認識結果および第２認識結果のそれぞれの認識精度を算定する。精度算定部１２０は、音量の組み合わせまたは距離の組み合わせの少なくともいずれかに基づいて、第１認識結果および第２認識結果のそれぞれの認識精度を算定してもよい。 The accuracy calculation unit 120 respectively of the first recognition result and the second recognition result based on the combination of the first volume of the first voice and the second volume of the second voice (hereinafter, also referred to as “volume combination”). Calculate recognition accuracy. Further, the accuracy calculation unit 120 calculates the recognition accuracy of each of the first recognition result and the second recognition result based on the combination of the first distance and the second distance (hereinafter, also referred to as “distance combination”). The accuracy calculation unit 120 may calculate the recognition accuracy of each of the first recognition result and the second recognition result based on at least one of the combination of volume and the combination of distance.

精度算定部１２０は、例えば、音声認識システム３００による音声認識を利用する場合、音声認識システム３００から認識結果と併せて認識率を取得してもよい。精度算定部１２０は、例えば、音声認識部１１１による音声認識を利用する場合、音声認識部１１１から認識結果と併せて認識率を取得してもよい。 For example, when the voice recognition by the voice recognition system 300 is used, the accuracy calculation unit 120 may acquire the recognition rate together with the recognition result from the voice recognition system 300. For example, when using the voice recognition by the voice recognition unit 111, the accuracy calculation unit 120 may acquire the recognition rate together with the recognition result from the voice recognition unit 111.

精度算定部１２０は、例えば、音量の組み合わせでは、所定の学習期間における音声の音量とそれに対応する音声の認識率を学習データとして入力することにより図５（ａ）に示すような音量と認識率の第１パターンモデルを構築してもよい。精度算定部１２０は、例えば、音量を説明変数（特徴量）とし認識率を目的変数（特徴量）として、回帰分析による統計処理を用いて第１パターンモデルを構築してもよい。精度算定部１２０は、構築した第１パターンモデルに音声の音量を入力して、認識率を算定してもよい。精度算定部１２０は、例えば、音量の取りうる範囲を３つの段階（「高」「中」「低」）に区分けする。精度算定部１２０は、例えば、区分けした３つの範囲のうち所定の範囲（Ｒ１）内に属する音量の認識率を「高」と算定する。 For example, in the combination of volumes, the accuracy calculation unit 120 inputs the volume of the voice in a predetermined learning period and the recognition rate of the corresponding voice as learning data, so that the volume and the recognition rate as shown in FIG. 5A are obtained. The first pattern model of may be constructed. For example, the accuracy calculation unit 120 may construct a first pattern model by using statistical processing by regression analysis with the volume as an explanatory variable (feature amount) and the recognition rate as an objective variable (feature amount). The accuracy calculation unit 120 may input the volume of voice into the constructed first pattern model and calculate the recognition rate. For example, the accuracy calculation unit 120 divides the range in which the volume can be taken into three stages (“high”, “medium”, and “low”). The accuracy calculation unit 120 calculates, for example, the recognition rate of the volume belonging to the predetermined range (R1) among the three divided ranges as “high”.

精度算定部１２０は、例えば、距離の組み合わせでは、所定の学習期間における発話者とマイクロフォンとの距離とそれに対応する認識率を学習データとして入力することにより図５（ｂ）に示すような発話者との距離と認識率の第２パターンモデルを構築する。精度算定部１２０は、例えば、発話者との距離を説明変数（特徴量）とし認識率を目的変数（特徴量）として、回帰分析による統計処理を用いて第２パターンモデルを構築してもよい。精度算定部１２０は、構築した第２パターンモデルに発話者との距離を入力して、認識率を算定してもよい。精度算定部１２０は、例えば、発話者との距離の取りうる範囲を３つの段階（「高」「中」「低」）に区分けする。精度算定部１２０は、例えば、設定した３つの範囲のうち所定の範囲（Ｒ２）内に属する発話者との距離の認識率を「高」と算定する。 For example, in the combination of distances, the accuracy calculation unit 120 inputs the distance between the speaker and the microphone in a predetermined learning period and the corresponding recognition rate as learning data, so that the speaker as shown in FIG. 5 (b). A second pattern model of the distance to and the recognition rate is constructed. The accuracy calculation unit 120 may construct a second pattern model by using statistical processing by regression analysis, for example, using the distance to the speaker as an explanatory variable (feature amount) and the recognition rate as an objective variable (feature amount). .. The accuracy calculation unit 120 may input the distance to the speaker into the constructed second pattern model and calculate the recognition rate. For example, the accuracy calculation unit 120 divides the range in which the distance from the speaker can be taken into three stages (“high”, “medium”, and “low”). The accuracy calculation unit 120 calculates, for example, the recognition rate of the distance to the speaker belonging to the predetermined range (R2) among the three set ranges as “high”.

精度算定部１２０は、例えば、第１音声の周波数および第２音声の周波数の組み合わせに基づいて、第１認識結果および第２認識結果のそれぞれの認識精度を算定してもよい。精度算定部１２０は、例えば、発話区間ごとに、第１音声や第２音声の周波数の統計値（平均値や中央値）または周波数帯域を算出し、統計値または周波数帯域の下限が所定の閾値より高い場合には、この発話区間における認識率を「高」と算定してもよい。すなわち、精度算定部１２０は、高い周波数成分が音声に多く含まれる場合に、認識率を高く算定してもよい。 The accuracy calculation unit 120 may calculate the recognition accuracy of each of the first recognition result and the second recognition result, for example, based on the combination of the frequency of the first voice and the frequency of the second voice. The accuracy calculation unit 120 calculates, for example, a statistical value (average value or median value) or frequency band of the frequency of the first voice or the second voice for each speech section, and the lower limit of the statistical value or the frequency band is a predetermined threshold value. If it is higher, the recognition rate in this speech section may be calculated as "high". That is, the accuracy calculation unit 120 may calculate the recognition rate high when the voice contains a large amount of high frequency components.

精度算定部１２０は、例えば、音声に含まれる、子音または所定の閾値以上の高周波数域の少なくともいずれかのパワー（または音圧レベル）を特徴量として抽出してもよい。ここでいう「パワー」とは、いわゆる音響パワーであり、音の周波数分析において、周波数ごとの重み（パワー）を示し、人の聴覚が感じる音の大きさや強さ（音量）とは相違する。パワーは、子音または所定の閾値以上の高周波数域の音声の強さとする。精度算定部１２０は、抽出した特徴量に基づいて、認識率を算定してもよい。精度算定部１２０は、例えば、子音のパワーにより上記で算定した認識率に重み付けを行い、重み付けを行った認識率に対して上記のような３つの段階（「高」「中」「低」）で算定をしてもよい。 For example, the accuracy calculation unit 120 may extract a consonant or at least one of the powers (or sound pressure levels) in a high frequency range above a predetermined threshold value as a feature amount contained in the voice. The "power" here is so-called acoustic power, which indicates the weight (power) for each frequency in the frequency analysis of sound, and is different from the loudness and intensity (volume) of the sound perceived by human hearing. The power is the strength of a consonant or a voice in a high frequency range above a predetermined threshold. The accuracy calculation unit 120 may calculate the recognition rate based on the extracted feature amount. For example, the accuracy calculation unit 120 weights the recognition rate calculated above by the power of the consonant, and the weighted recognition rate has the above three stages (“high”, “medium”, and “low”). You may calculate with.

精度算定部１２０は、例えば、音声の音圧レベルと周波数とについて、図６に示すようにプロットする。精度算定部１２０は、プロットしたデータが取りうる範囲を３つの認識率の段階（「高」「中」「低」）のエリアに区分けする。精度算定部１２０は、例えば、音声の音圧レベルと周波数とが区分けした３つのエリアのいずれに属するかによって、認識率を算定してもよい。 The accuracy calculation unit 120 plots, for example, the sound pressure level and frequency of voice as shown in FIG. The accuracy calculation unit 120 divides the range that the plotted data can take into three recognition rate stages (“high”, “medium”, and “low”). The accuracy calculation unit 120 may calculate the recognition rate depending on which of the three areas the sound pressure level and frequency of the voice belong to, for example.

精度算定部１２０は、例えば、上記のように（ア）音量、（イ）発話者との距離、（ウ）周波数、（エ）子音または所定の閾値以上の高周波数域のパワー、の少なくともいずれかにより算定した認識率と、（オ）音声認識システム３００や音声認識部１１１から取得した認識率と、の組み合わせに基づいて、複合的な認識率（以下、「複合認識率」ともいう）を算定してもよい。 The accuracy calculation unit 120 is, for example, at least one of (a) volume, (b) distance from the speaker, (c) frequency, (d) consonant or power in a high frequency range above a predetermined threshold, as described above. Based on the combination of the recognition rate calculated by the above and (e) the recognition rate acquired from the voice recognition system 300 and the voice recognition unit 111, a composite recognition rate (hereinafter, also referred to as "composite recognition rate") is obtained. You may calculate.

精度算定部１２０は、例えば、上記（ア）～（オ）それぞれの認識率の加重平均を算出して、算出した加重平均を複合認識率として算定してもよい。精度算定部１２０は、例えば、この加重平均にあたって、上記の（ア）と（イ）の重要度を他の（ウ）～（オ）より高く設定してもよい。精度算定部１２０は、例えば、この重要度に比例した係数をそれぞれの認識率にかけて重み付けをしてもよい。精度算定部１２０は、具体的には、以下の式によって複合認識率を算定してもよい。 The accuracy calculation unit 120 may, for example, calculate the weighted average of each of the above recognition rates (a) to (e) and calculate the calculated weighted average as the compound recognition rate. For example, the accuracy calculation unit 120 may set the importance of the above (a) and (b) to be higher than the other (c) to (e) in this weighted average. The accuracy calculation unit 120 may, for example, weight a coefficient proportional to this importance by multiplying each recognition rate. Specifically, the accuracy calculation unit 120 may calculate the compound recognition rate by the following formula.

複合認識率＝（α×上記（オ）の認識率＋β×上記（ア）の認識率＋θ×上記（イ）の認識率＋δ・上記（ウ）の認識率）／（α＋β＋θ＋δ） Combined recognition rate = (α x recognition rate of (e) above + β x recognition rate of (a) above + θ x recognition rate of (a) above + δ, recognition rate of (c) above) / (α + β + θ + δ)

「α」は、上記（オ）の重み係数であり、「β」は、上記（ア）、すなわち音量の重み係数であり、「θ」は上記（イ）、すなわち距離の重み係数であり、「δ」は、上記（ウ）、すなわち周波数の重み係数である。βとθは、設定した重要度に応じて、αおよびδより大きい値としてもよい。 “Α” is the weighting coefficient of (e) above, “β” is the weighting coefficient of (a) above, that is, the volume, and “θ” is the weighting coefficient of (b), that is, the distance. “Δ” is the above (c), that is, the frequency weighting coefficient. β and θ may be larger than α and δ depending on the set importance.

音声データ生成部１２１は、音声合成処理を用いて、第１テキスト情報に基づいて、第３音声を出力するための第３音声データを生成する。第３音声は、例えば、第１テキスト情報の文字列を読み上げる音声であってもよい。 The voice data generation unit 121 generates a third voice data for outputting the third voice based on the first text information by using the voice synthesis process. The third voice may be, for example, a voice that reads out the character string of the first text information.

音声データ生成部１２１は、例えば、テキスト情報に基づき、応答情報を生成してもよい。ここで「応答情報」とは、記録装置１００がユーザの音声に対して応答するための情報である。音声データ生成部１２１は、例えば、自然言語処理を用いてテキスト情報を解析する。そして音声データ生成部１２１は、この解析により、ユーザの音声に対する応答の内容を特定し、応答情報を生成する。音声データ生成部１２１は音声合成処理を用いて、応答情報に基づいて、ユーザの音声に対する応答のための音声データを生成してもよい。 The voice data generation unit 121 may generate response information based on, for example, text information. Here, the "response information" is information for the recording device 100 to respond to the voice of the user. The voice data generation unit 121 analyzes text information using, for example, natural language processing. Then, the voice data generation unit 121 identifies the content of the response to the user's voice by this analysis, and generates the response information. The voice data generation unit 121 may generate voice data for a response to the user's voice based on the response information by using the voice synthesis process.

音声データ生成部１２１は、例えば、ユーザの音声の内容「議事録を開始」を形態素解析して「議事録」および「開始」という単語を抽出する。次に、音声データ生成部１２１は、抽出したこれらの単語を検索キーとして、辞書情報を検索して該当する応答の内容を特定する。この応答の内容とは、（ア）第１発話者の発話の議事録を作成するための第１音声データの取得や音声認識処理などの一連の処理を実行、（イ）ユーザに「議事録を開始します」とする音声を出力する処理を実行、である。 The voice data generation unit 121 extracts, for example, the words "minutes" and "start" by morphologically analyzing the content "start minutes" of the user's voice. Next, the voice data generation unit 121 searches the dictionary information using these extracted words as a search key and identifies the content of the corresponding response. The content of this response is (a) execution of a series of processes such as acquisition of the first voice data and voice recognition processing for creating the minutes of the utterance of the first speaker, and (b) "minutes" to the user. Is executed, and the process of outputting the voice is executed.

「辞書情報」とは、単語または複数の単語の組み合わせと、応答の内容を関連付ける情報である。辞書情報は、例えば、「議事録」および「開始」とする単語の組み合わせと、上記（ア）および（イ）の処理の実行とする応答の内容と、を関連付ける。 "Dictionary information" is information that associates a word or a combination of a plurality of words with the content of a response. The dictionary information associates, for example, the combination of the words "minutes" and "start" with the content of the response to execute the processes (a) and (b) above.

加工部１２２は、第１音声データおよび第３音声データを、ステレオ音声データに加工する。ここで「ステレオ音声データ」とは、第１音声と、第２音声と、第３音声とのいずれか二つの音声をステレオフォニック再生するための音声データである。 The processing unit 122 processes the first audio data and the third audio data into stereo audio data. Here, the "stereo audio data" is audio data for stereophonically reproducing any two audios of the first audio, the second audio, and the third audio.

加工部１２２は、例えば、第１音声データ、第２音声データまたは第３音声データの少なくともいずれか二つを、ステレオ音声データに加工してもよい。この場合、ステレオ音声データは、（Ａ）第１音声データと第２音声データとの組み合わせ、（Ｂ）第１音声データと第３音声データとの組み合わせ、（Ｃ）第２音声データと第３音声データとの組み合わせ、とする３パターンのうちいずれか一つのパターンであってもよい。 For example, the processing unit 122 may process at least two of the first audio data, the second audio data, and the third audio data into stereo audio data. In this case, the stereo audio data is (A) a combination of the first audio data and the second audio data, (B) a combination of the first audio data and the third audio data, and (C) the second audio data and the third. It may be any one of the three patterns to be combined with the voice data.

加工部１２２は、例えば、ステレオ音声データの加工の前処理として、第１音声データ、第２音声データまたは第３音声データの音声の音像を定位させてもよい。加工部１２２は、例えば、第１音声データについて、第１発話者（チャンネル）ごとに仮想音源の位置に第１音声の音像を定位させてもよい。この仮想音源の位置は、例えば、発話者の位置（角度）に偏りがあると聞き取りづらい音声になる、すなわち認識しづらい音声になるため、発話者の位置が均等になるように設定してもよい。 For example, the processing unit 122 may localize the audio sound image of the first audio data, the second audio data, or the third audio data as preprocessing for processing the stereo audio data. For example, the processing unit 122 may localize the sound image of the first voice at the position of the virtual sound source for each first speaker (channel) for the first voice data. For example, if the position (angle) of the speaker is biased, the position of this virtual sound source becomes a voice that is difficult to hear, that is, a voice that is difficult to recognize. Therefore, even if the position of the speaker is set to be even. good.

音声取得部１３０は、ユーザの音声の音声データを取得する。音声取得部１３０は、第１音声取得部１３１と、第２音声取得部１３２と、を備える。第１音声取得部１３１は、第１発話者による第１音声の第１音声データを取得する。第２音声取得部１３２は、復唱モードが設定された場合に、復唱する音声の音声データとして第２発話者による第２音声の第２音声データを取得する。 The voice acquisition unit 130 acquires voice data of the user's voice. The voice acquisition unit 130 includes a first voice acquisition unit 131 and a second voice acquisition unit 132. The first voice acquisition unit 131 acquires the first voice data of the first voice by the first speaker. The second voice acquisition unit 132 acquires the second voice data of the second voice by the second speaker as the voice data of the voice to be repeated when the repeat mode is set.

音声取得部１３０は、例えば、発話者ごとの音声データを取得するにあたって、音声入力装置８１７に入力された音声信号に対して指向性処理や音源を分離する音源分離処理をしてもよい。指向性処理とは、例えば、発話者の方向からの音声を強調し、発話者以外の方向からの音声を抑制する信号処理（ビームフォーミング処理）である。また音源分離処理とは、発話者の方向ごとの対象音を抽出して個別に分離する処理である。音声取得部１３０は、すなわち発話者を分離し、分離された発話者ごとに、発話者それぞれの方向からの音声の音声データを取得する。 For example, when acquiring voice data for each speaker, the voice acquisition unit 130 may perform directivity processing or sound source separation processing for separating the sound source with respect to the voice signal input to the voice input device 817. The directional processing is, for example, signal processing (beamforming processing) that emphasizes the sound from the direction of the speaker and suppresses the sound from a direction other than the speaker. The sound source separation process is a process of extracting target sounds for each direction of the speaker and separating them individually. The voice acquisition unit 130 separates the speakers, and acquires voice data of voice from each speaker direction for each separated speaker.

音声取得部１３０は、例えば、発話者ごとに指向性を有するマイクロフォンを用いて、集音したマイクロフォンを特定して発話者を識別し、識別した発話者ごとに音声データを取得してもよい。 The voice acquisition unit 130 may, for example, use a microphone having directivity for each speaker to identify the microphone that collects the sound, identify the speaker, and acquire voice data for each identified speaker.

通信部１４０は、ネットワークＮを介して、ユーザ端末２００、音声認識システム３００などとの間で音声データやテキスト情報などの各種情報・データを送受信する。 The communication unit 140 transmits / receives various information / data such as voice data and text information to / from the user terminal 200, the voice recognition system 300, and the like via the network N.

出力部１５０は、応答情報に基づき、音声に対する応答を出力する。出力部１５０の出力態様は、どのような態様でもよい。出力部１５０の出力態様は、例えば、音声出力、画面出力、ファイル出力またはメッセージ出力などが考えられる。出力部１５０は、再生部１５１を備える。 The output unit 150 outputs a response to the voice based on the response information. The output mode of the output unit 150 may be any mode. The output mode of the output unit 150 may be, for example, audio output, screen output, file output, message output, or the like. The output unit 150 includes a reproduction unit 151.

再生部１５１は、復唱モードが設定された場合に、第１音声データに基づいて、第１音声を再生する。 The reproduction unit 151 reproduces the first voice based on the first voice data when the repeat mode is set.

再生部１５１は、例えば、加工部により加工されたステレオ音声データに基づいて、第１音声、第２音声、または第３音声の少なくともいずれか二つをステレオフォニック再生してもよい。また、再生部１５１は、ステレオフォニック再生にあたって、受付部１１７が受け付けた左右のチャンネルの入れ替えの指定に基づいて、左右のチャンネルの音声を入れ替えてもよい。再生部１５１は、例えば、左のチャンネルが第１音声で右のチャンネルが第３音声の場合、上記の左右のチャンネルの入れ替えの指定に基づいて、左のチャンネルが第３音声で右のチャンネルが第１音声に入れ替えてステレオフォニック再生してもよい。 The reproduction unit 151 may, for example, stereophonically reproduce at least two of the first sound, the second sound, and the third sound based on the stereo sound data processed by the processing unit. Further, in the stereophonic reproduction, the reproduction unit 151 may exchange the audio of the left and right channels based on the designation of the exchange of the left and right channels accepted by the reception unit 117. For example, when the left channel is the first sound and the right channel is the third sound, the playback unit 151 has the left channel as the third sound and the right channel as the third sound based on the above-mentioned specification of exchanging the left and right channels. It may be replaced with the first sound and played back stereophonically.

再生部１５１は、例えば、加工部により加工されたステレオ音声データに基づいて、復唱モードが設定された場合に、第１音声の再生をする代わりに、第１音声と第３音声とをステレオフォニック再生してもよい。 For example, when the repeat mode is set based on the stereo audio data processed by the processing unit, the reproduction unit 151 stereophonically plays the first sound and the third sound instead of playing the first sound. You may play it.

上記構成によれば、再生部１５１は、第１音声と第３音声とをステレオフォニック再生することで第１音声と第３音声との違いを第２発話者に認識させることができる。このため、上記構成によれば、再生部１５１は、第１音声と第１音声を音声認識した第３音声との差異を認識させつつ第２発話者に復唱させることができる。 According to the above configuration, the reproduction unit 151 can make the second speaker recognize the difference between the first voice and the third voice by stereophonically playing the first voice and the third voice. Therefore, according to the above configuration, the reproduction unit 151 can make the second speaker repeat the first voice while recognizing the difference between the first voice and the third voice that recognizes the first voice.

記憶部１６０は、音声データ（ステレオ音声データを含む）を記憶する。また記憶部１６０は、例えば、音声データと関連付けて、音声データの認識結果を示すテキスト情報、音声データに関する発話者の位置情報、音声データの認識結果の認識精度を示す精度情報、音声データの認識結果の信頼度を示す信頼度情報および／または音声データの認識結果に対する応答情報などを記憶してもよい。また記憶部１６０は、例えば、辞書情報を記憶してもよい。 The storage unit 160 stores audio data (including stereo audio data). Further, the storage unit 160, for example, associates with the voice data, text information indicating the recognition result of the voice data, position information of the speaker regarding the voice data, accuracy information indicating the recognition accuracy of the recognition result of the voice data, and recognition of the voice data. The reliability information indicating the reliability of the result and / or the response information to the recognition result of the voice data may be stored. Further, the storage unit 160 may store dictionary information, for example.

記憶部１６０は、データベースマネジメントシステム（ＤＢＭＳ）を利用して上記の情報を記憶してもよいし、ファイルシステムを利用して上記の情報を記憶してもよい。ＤＢＭＳを利用する場合は、上記の情報ごとにテーブルを設けて、テーブル間を関連付けてこれらの情報を管理してもよい。 The storage unit 160 may use a database management system (DBMS) to store the above information, or may use a file system to store the above information. When using the DBMS, a table may be provided for each of the above information, and the tables may be associated with each other to manage the information.

＜４．動作例＞
図７を参照して、記録装置１００の動作例を説明する。なお、以下に示す図７の動作例の処理の順番は一例であって、適宜、変更されてもよい。 <4. Operation example>
An operation example of the recording device 100 will be described with reference to FIG. 7. The order of processing of the operation example of FIG. 7 shown below is an example, and may be changed as appropriate.

図７に示すように、記録装置１００の第１音声取得部１３１は、第１発話者による第１音声の第１音声データを取得する（Ｓ１０）。次に、制御部１１０は、ユーザ端末２００から受け付けた復唱モードの指定に基づいて、復唱モードを設定する（Ｓ１１）。 As shown in FIG. 7, the first voice acquisition unit 131 of the recording device 100 acquires the first voice data of the first voice by the first speaker (S10). Next, the control unit 110 sets the repeat mode based on the designation of the repeat mode received from the user terminal 200 (S11).

次に、再生部１５１は、第１音声を復唱する音声を取得するための復唱モードが設定された場合に、第１音声データに基づいて、第１音声を再生する（Ｓ１２）。次に、第２音声取得部１３２は、復唱モードが設定された場合に、復唱する音声の音声データとして第２発話者による第２音声の第２音声データを取得する（Ｓ１３）。 Next, the reproduction unit 151 reproduces the first voice based on the first voice data when the repeat mode for acquiring the voice to repeat the first voice is set (S12). Next, when the repeat mode is set, the second voice acquisition unit 132 acquires the second voice data of the second voice by the second speaker as the voice data of the voice to be repeated (S13).

次に、認識結果取得部１１２は、第１音声データと第２音声データとに基づいて、第１音声の第１認識結果を示す第１テキスト情報と、第２音声の第２認識結果を示す第２テキスト情報と、を取得する（Ｓ１５）。 Next, the recognition result acquisition unit 112 shows the first text information indicating the first recognition result of the first voice and the second recognition result of the second voice based on the first voice data and the second voice data. The second text information and the second text information are acquired (S15).

次に、比較部１１３は、第１テキスト情報と第２テキスト情報とを比較する（Ｓ１６）。記録生成部１１４は、第１テキスト情報と第２テキスト情報と比較部１１３による比較結果に基づいて、テキストによる発話の記録データを生成する（Ｓ１７）。 Next, the comparison unit 113 compares the first text information and the second text information (S16). The record generation unit 114 generates textual utterance record data based on the comparison result of the first text information, the second text information, and the comparison unit 113 (S17).

＜５．ハードウェア構成＞
図８を参照して、上述してきた記録装置１００をコンピュータ８００により実現する場合のハードウェア構成の一例を説明する。なお、それぞれの装置の機能は、複数台の装置に分けて実現することもできる。 <5. Hardware configuration>
With reference to FIG. 8, an example of the hardware configuration in the case where the recording device 100 described above is realized by the computer 800 will be described. The function of each device can be realized by dividing it into a plurality of devices.

図８に示すように、コンピュータ８００は、プロセッサ８０１と、メモリ８０３と、記憶装置８０５と、入力Ｉ／Ｆ部８０７と、データＩ／Ｆ部８０９と、通信Ｉ／Ｆ部８１１、表示装置８１３、音声入力装置８１７および音声出力装置８１９を含む。 As shown in FIG. 8, the computer 800 includes a processor 801 and a memory 803, a storage device 805, an input I / F unit 807, a data I / F unit 809, a communication I / F unit 811 and a display device 813. , Includes audio input device 817 and audio output device 819.

プロセッサ８０１は、メモリ８０３に記憶されているプログラムを実行することによりコンピュータ８００における様々な処理を制御する。例えば、記録装置１００の制御部１１０が備える各機能部などは、メモリ８０３に一時記憶されたプログラムをプロセッサ８０１が実行することにより実現可能である。 The processor 801 controls various processes in the computer 800 by executing a program stored in the memory 803. For example, each functional unit included in the control unit 110 of the recording device 100 can be realized by the processor 801 executing a program temporarily stored in the memory 803.

メモリ８０３は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶媒体である。メモリ８０３は、プロセッサ８０１によって実行されるプログラムのプログラムコードや、プログラムの実行時に必要となるデータを一時的に記憶する。 The memory 803 is a storage medium such as, for example, a RAM (Random Access Memory). The memory 803 temporarily stores the program code of the program executed by the processor 801 and the data required when the program is executed.

記憶装置８０５は、例えばハードディスクドライブ（ＨＤＤ）やフラッシュメモリ等の不揮発性の記憶媒体である。記憶装置８０５は、オペレーティングシステムや、上記各構成を実現するための各種プログラムを記憶する。この他、記憶装置８０５は、音声データ、テキスト情報、位置情報、精度情報、信頼度情報または応答情報などを登録するテーブルと、このテーブルを管理するＤＢを記憶することも可能である。このようなプログラムやデータは、必要に応じてメモリ８０３にロードされることにより、プロセッサ８０１から参照される。 The storage device 805 is a non-volatile storage medium such as a hard disk drive (HDD) or a flash memory. The storage device 805 stores an operating system and various programs for realizing each of the above configurations. In addition, the storage device 805 can also store a table for registering voice data, text information, position information, accuracy information, reliability information, response information, and the like, and a DB for managing this table. Such programs and data are referenced from the processor 801 by being loaded into the memory 803 as needed.

入力Ｉ／Ｆ部８０７は、ユーザからの入力を受け付けるためのデバイスである。入力Ｉ／Ｆ部８０７の具体例としては、キーボードやマウス、タッチパネル、各種センサ、ウェアラブル・デバイス等が挙げられる。入力Ｉ／Ｆ部８０７は、例えばＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）等のインタフェースを介してコンピュータ８００に接続されても良い。 The input I / F unit 807 is a device for receiving input from the user. Specific examples of the input I / F unit 807 include a keyboard, a mouse, a touch panel, various sensors, a wearable device, and the like. The input I / F unit 807 may be connected to the computer 800 via an interface such as USB (Universal Serial Bus).

データＩ／Ｆ部８０９は、コンピュータ８００の外部からデータを入力するためのデバイスである。データＩ／Ｆ部８０９の具体例としては、各種記憶媒体に記憶されているデータを読み取るためのドライブ装置等がある。データＩ／Ｆ部８０９は、コンピュータ８００の外部に設けられることも考えられる。その場合、データＩ／Ｆ部８０９は、例えばＵＳＢ等のインタフェースを介してコンピュータ８００へと接続される。 The data I / F unit 809 is a device for inputting data from the outside of the computer 800. Specific examples of the data I / F unit 809 include a drive device for reading data stored in various storage media. It is also conceivable that the data I / F unit 809 is provided outside the computer 800. In that case, the data I / F unit 809 is connected to the computer 800 via an interface such as USB.

通信Ｉ／Ｆ部８１１は、コンピュータ８００の外部の装置と有線または無線により、インターネットＮを介したデータ通信を行うためのデバイスである。通信Ｉ／Ｆ部８１１は、コンピュータ８００の外部に設けられることも考えられる。その場合、通信Ｉ／Ｆ部８１１は、例えばＵＳＢ等のインタフェースを介してコンピュータ８００に接続される。 The communication I / F unit 811 is a device for performing data communication via the Internet N by wire or wirelessly with an external device of the computer 800. It is also conceivable that the communication I / F unit 811 is provided outside the computer 800. In that case, the communication I / F unit 811 is connected to the computer 800 via an interface such as USB.

表示装置８１３は、各種情報を表示するためのデバイスである。表示装置８１３の具体例としては、例えば液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ、ウェアラブル・デバイスのディスプレイ等が挙げられる。表示装置８１３は、コンピュータ８００の外部に設けられても良い。その場合、表示装置８１３は、例えばディスプレイケーブル等を介してコンピュータ８００に接続される。また、入力Ｉ／Ｆ部８０７としてタッチパネルが採用される場合には、表示装置８１３は、入力Ｉ／Ｆ部８０７と一体化して構成することが可能である。 The display device 813 is a device for displaying various information. Specific examples of the display device 813 include a liquid crystal display, an organic EL (Electro-Luminence) display, a display of a wearable device, and the like. The display device 813 may be provided outside the computer 800. In that case, the display device 813 is connected to the computer 800 via, for example, a display cable or the like. Further, when the touch panel is adopted as the input I / F unit 807, the display device 813 can be integrally configured with the input I / F unit 807.

音声入力装置８１７は、マイクなどの音声を検出するための入力装置である。音声入力装置８１７は、例えば、音声信号を含めたアナログ振動信号を取得する単一または複数のマイクロフォン（マイクアレイ）、アナログ振動信号を増幅するアンプ、アナログ振動信号をデジタル信号に変換するＡ／Ｄ変換部などを備える。音声入力装置８１７は、例えば、ユーザが発する音声を検出する。 The voice input device 817 is an input device for detecting voice such as a microphone. The voice input device 817 is, for example, a single or a plurality of microphones (microphone arrays) that acquire an analog vibration signal including a voice signal, an amplifier that amplifies the analog vibration signal, and an A / D that converts the analog vibration signal into a digital signal. It is equipped with a conversion unit and the like. The voice input device 817 detects, for example, the voice emitted by the user.

音声出力装置８１９は、音声を出力するための出力装置であり、例えば、スピーカなどである。また音声出力装置８１９は、ヘッドフォンまたはイヤフォンに音をステレオ再生するための装置であってもよい。 The audio output device 819 is an output device for outputting audio, and is, for example, a speaker or the like. Further, the audio output device 819 may be a device for reproducing sound in stereo on headphones or earphones.

なお、本実施形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。 It should be noted that the present embodiment is an example for explaining the present invention, and is not intended to limit the present invention to only the embodiment thereof. Further, the present invention can be modified in various ways as long as it does not deviate from the gist thereof. Further, those skilled in the art can adopt an embodiment in which each element described below is replaced with an equal one, and such an embodiment is also included in the scope of the present invention.

［変形例］
なお、本発明を上記実施形態に基づいて説明してきたが、以下のような場合も本発明に含まれる。 [Modification example]
Although the present invention has been described based on the above embodiment, the following cases are also included in the present invention.

［変形例１］
上記実施形態に係る記録装置１００おける各構成の少なくとも一部は、ユーザ端末２００またはサーバ装置（不図示）に搭載させる議事録作成システム１専用のプログラムが備えてもよい。例えば、このプログラムに、記録装置１００の制御部１１０の各機能部や音声取得部１３０を備えさせて、出力部１５０に関してはユーザ端末２００に標準的に備える機能を利用して、ユーザ端末２００で全て実現してもよい。また、この際、制御部１１０の各機能部の中で比較的処理負荷の高い信頼度算出部１１８、推定部１１９または精度算定部１２０などはサーバ装置に搭載させてもよい。ユーザ端末２００は、サーバ装置のこれらの機能に対する処理の指示と指示に対する処理結果を受け取るだけとしてもよい。 [Modification 1]
At least a part of each configuration in the recording device 100 according to the above embodiment may be provided with a program dedicated to the minutes creation system 1 mounted on the user terminal 200 or the server device (not shown). For example, this program is provided with each function unit of the control unit 110 of the recording device 100 and the voice acquisition unit 130, and the output unit 150 uses the function provided as standard in the user terminal 200 in the user terminal 200. All may be realized. Further, at this time, among the functional units of the control unit 110, the reliability calculation unit 118, the estimation unit 119, the accuracy calculation unit 120, etc., which have a relatively high processing load, may be mounted on the server device. The user terminal 200 may only receive processing instructions for these functions of the server device and processing results for the instructions.

［変形例２］
上記実施形態では、第１テキスト情報と第２テキスト情報とが不一致だった区間について、どちらのテキスト情報を第１発話者の発話の記録、すなわち議事録として採用するかユーザに選択させる例を示したが、これに限定されない。議事録作成システム１では、第１認識結果および第２認識結果それぞれの認識精度などに基づいて、自動的にどちらを採用するか選択してもよい。 [Modification 2]
In the above embodiment, an example is shown in which the user is allowed to select which text information is to be adopted as the record of the utterance of the first speaker, that is, the minutes of the section where the first text information and the second text information do not match. However, it is not limited to this. In the minutes preparation system 1, which one may be automatically selected may be selected based on the recognition accuracy of each of the first recognition result and the second recognition result.

制御部１１０は、選択部（不図示）を備える。選択部は、第１テキスト情報と第２テキスト情報とが比較部１１３による比較結果で不一致だった区間について、信頼度または認識精度の少なくともいずれかに基づいて、第１テキスト情報と第２テキスト情報のいずれを第１発話者の発話の記録として採用するかを選択する。
選択部は、例えば、信頼度の高い方のテキスト情報を第１発話者の発話の記録として選択してもよい。 The control unit 110 includes a selection unit (not shown). The selection unit determines the first text information and the second text information based on at least either the reliability or the recognition accuracy of the section in which the first text information and the second text information do not match in the comparison result by the comparison unit 113. Which of the above is adopted as the record of the utterance of the first speaker is selected.
For example, the selection unit may select the text information having the higher reliability as the record of the utterance of the first speaker.

記録生成部１１４は、例えば、選択部による選択結果に基づいて、第１テキスト情報と第２テキスト情報とを区間ごとに組み合わせて、記録データを生成してもよい。 The record generation unit 114 may generate record data by combining the first text information and the second text information for each section based on the selection result by the selection unit, for example.

上記構成によれば、第１テキスト情報と第２テキスト情報のいずれかを選択する手間をユーザは省くことができるため、効率よく発話の記録を生成することができる。 According to the above configuration, the user can save the trouble of selecting either the first text information or the second text information, so that the utterance record can be efficiently generated.

［変形例３］
上記実施形態では示していないが、復唱モードが設定された場合、第２発話者が復唱している際に、精度算定部１２０は、第２音声取得部１３２が取得した第２音声データの第２音声の認識精度を随時算出してもよい。そして表示部１１６が、算出された認識精度をユーザ端末２００に随時表示させてもよい。このような構成によれば、表示部１１６は、第２発話者が復唱している際に、タイムリーにその第２音声の認識精度を表示させることができる。このため、上記構成によれば、第２発話者は、例えば、自身の音量や記録装置１００との距離をより精度よく認識できるよう見直しつつ、復唱することができる。 [Modification 3]
Although not shown in the above embodiment, when the repeat mode is set, when the second speaker is reciting, the accuracy calculation unit 120 is the second voice data acquired by the second voice acquisition unit 132. 2 The voice recognition accuracy may be calculated at any time. Then, the display unit 116 may display the calculated recognition accuracy on the user terminal 200 at any time. According to such a configuration, the display unit 116 can display the recognition accuracy of the second voice in a timely manner when the second speaker is reciting. Therefore, according to the above configuration, the second speaker can repeat, for example, while reviewing so that his / her own volume and the distance from the recording device 100 can be recognized more accurately.

１…議事録作成システム、１００…記録装置、１１０…制御部、１１１…音声認識部、１１２…認識結果取得部、１１３…比較部、１１４…記録生成部、１１５…発話データ生成部、１１６…表示部、１１７…受付部、１１８…信頼度算出部、１１９…推定部、１２０…精度算定部、１２１…音声データ生成部、１２２…加工部、１３０…音声取得部、１３１…第１音声取得部、１３２…第２音声取得部、１４０…通信部、１５０…出力部、１５１…再生部、１６０…記憶部、２００…ユーザ端末、３００…音声認識システム、８００…コンピュータ、８０１…プロセッサ、８０３…メモリ、８０５…記憶装置、８０７…入力Ｉ／Ｆ部、８０９…データＩ／Ｆ部、８１１…通信Ｉ／Ｆ部、８１３…表示装置、８１７…音声入力装置、８１９…音声出力装置。 1 ... Minutes creation system, 100 ... Recording device, 110 ... Control unit, 111 ... Voice recognition unit, 112 ... Recognition result acquisition unit, 113 ... Comparison unit, 114 ... Record generation unit, 115 ... Speech data generation unit, 116 ... Display unit, 117 ... reception unit, 118 ... reliability calculation unit, 119 ... estimation unit, 120 ... accuracy calculation unit, 121 ... voice data generation unit, 122 ... processing unit, 130 ... voice acquisition unit, 131 ... first voice acquisition Unit, 132 ... Second voice acquisition unit, 140 ... Communication unit, 150 ... Output unit, 151 ... Playback unit, 160 ... Storage unit, 200 ... User terminal, 300 ... Voice recognition system, 800 ... Computer, 801 ... Processor, 803 ... Memory, 805 ... Storage device, 807 ... Input I / F unit, 809 ... Data I / F unit, 811 ... Communication I / F unit, 813 ... Display device, 817 ... Voice input device, 819 ... Voice output device.

Claims

The first voice acquisition unit that acquires the first voice data of the first voice by the first speaker, and
A reproduction unit that reproduces the first voice based on the first voice data when a repeat mode for acquiring the voice that repeats the first voice is set.
A second voice acquisition unit that acquires the second voice data of the second voice by the second speaker as the voice data of the voice to be repeated when the repeat mode is set.
Based on the first voice data and the second voice data, the first text information indicating the first recognition result of the first voice, the second text information indicating the second recognition result of the second voice, and the second text information. The recognition result acquisition unit to acquire
A record generation unit that generates recorded data of the utterance of the first speaker by text based on the first text information and the second text information is provided.
Information processing equipment.

An utterance data generation unit that generates a plurality of first utterance data corresponding to a plurality of sections of the first voice data and a plurality of second utterance data corresponding to a plurality of sections of the second voice data from the voice data. , Further prepared,
The recognition result acquisition unit acquires the first text information and the second text information divided into each of the plurality of sections based on the first utterance data and the second utterance data.
The information processing apparatus further includes a comparison unit for comparing the first text information and the second text information for each of the plurality of sections.
The record generation unit generates the record data based on the comparison result by the comparison unit.
The information processing apparatus according to claim 1.

A display unit that displays the comparison result on the user terminal of the user,
A reception unit that accepts from the user terminal the selection of whether to adopt the first text information or the second text information as a record of the utterance of the first speaker for each of the plurality of sections.
The record generation unit further combines the first text information and the second text information for each section based on the selection accepted by the reception unit to generate the record data.
The information processing apparatus according to claim 2.

Further, a reliability calculation unit for calculating the reliability of each of the first recognition result and the second recognition result is provided.
The display unit causes the user terminal to display the reliability together with the comparison result.
The information processing apparatus according to claim 3.

Based on the first voice data and the second voice data, the first distance between the first speaker and the information processing device and the second distance between the second speaker and the information processing device. 2 distances, an estimation unit that estimates, and
The first recognition result and the first recognition result based on at least one combination of the first volume of the first voice and the second volume of the second voice, or the combination of the first distance and the second distance. 2 Further equipped with an accuracy calculation unit that calculates the recognition accuracy of each recognition result.
The display unit causes the user terminal to display the recognition accuracy together with the comparison result.
The information processing apparatus according to claim 3 or 4.

The display unit provides an edit form for editing the first text information or the second text information of the section for the section in which the first text information and the second text information do not match in the comparison result. Display it on the user terminal
The reception unit receives the text information input by the user for the edit form from the user terminal, and receives the text information.
The record generation unit generates the record data by overwriting the first text information or the second text information with the text information received by the reception unit for the inconsistent section.
The information processing apparatus according to claim 5.

An estimation unit that estimates a first distance between the first speaker and the information processing device and a second distance between the second speaker and the information processing device.
The first recognition result and the first recognition result based on at least one combination of the first volume of the first voice and the second volume of the second voice, or the combination of the first distance and the second distance. 2 The accuracy calculation unit that calculates the recognition accuracy of each recognition result,
A reliability calculation unit that calculates the reliability of each of the first recognition result and the second recognition result,
With respect to the section in which the first text information and the second text information do not match in the comparison result, the first text information and the second text information are based on at least one of the reliability or the recognition accuracy. A selection unit that selects which one is to be adopted as the record of the utterance of the first speaker, and
The record generation unit further combines the first text information and the second text information for each section based on the selection result by the selection unit to generate the record data.
The information processing apparatus according to claim 2.

A voice data generation unit that generates a third voice data for outputting a third voice based on the first text information by using a voice synthesis process.
A processing unit that processes the first audio data and the third audio data into stereo audio data for stereophonically reproducing the first audio and the third audio.
When the repeat mode is set, the reproduction unit performs the stereophonic reproduction instead of the reproduction of the first audio based on the stereo audio data.
The information processing apparatus according to any one of claims 1 to 7.

On the computer
The first voice function to acquire the first voice data of the first voice by the first speaker, and
A reproduction function for reproducing the first voice based on the first voice data when a repeat mode for acquiring the voice for reciting the first voice is set.
A second voice acquisition function that acquires the second voice data of the second voice by the second speaker as the voice data of the voice to be repeated when the repeat mode is set.
Based on the first voice data and the second voice data, the first text information indicating the first recognition result of the first voice, the second text information indicating the second recognition result of the second voice, and the second text information. With the recognition result acquisition function to acquire
A record generation function for generating recorded data of the utterance of the first speaker by text based on the first text information and the second text information is realized.
program.

The computer
Acquire the first voice data of the first voice by the first speaker,
When the repeat mode for acquiring the voice to repeat the first voice is set, the first voice is reproduced based on the first voice data.
When the repeat mode is set, the second voice data of the second voice by the second speaker is acquired as the voice data of the voice to be repeated.
Based on the first voice data and the second voice data, the first text information indicating the first recognition result of the first voice, the second text information indicating the second recognition result of the second voice, and the second text information. To get,
Based on the first text information and the second text information, the recorded data of the utterance of the first speaker by text is generated.
Information processing method.