JP2023174514A

JP2023174514A - Information processing device, information processing method, and program

Info

Publication number: JP2023174514A
Application number: JP2023050529A
Authority: JP
Inventors: 将樹能勢; Masaki Nose; 克之大村; Katsuyuki Omura; 直嗣篠原; Naotsugu Shinohara; 徹福原; Toru Fukuhara; 紀子高橋; Noriko Takahashi
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2022-05-26
Filing date: 2023-03-27
Publication date: 2023-12-07

Abstract

To assist selection of an utterance to be targeted for training data.SOLUTION: In an information processing system 100, an information processing device includes: an acquisition unit that acquires voice data; a voice recognition unit that detects, from a voice related to the acquired voice data, an utterance section which is a section in which an utterance has been made; a determination unit that determines whether the utterance in the detected utterance section satisfies one or more conditions previously set for outputting candidates for training data; and an output unit that outputs content of a first utterance in the utterance section determined as satisfying the conditions as a candidate for the training data.SELECTED DRAWING: Figure 14

Description

本発明は、情報処理装置、情報処理方法、プログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program.

従来では、特定の番組の番組音声（音声データ）と、当該番組に予め付されている字幕テキストと、番組音声の書き起こしと、を用いて音声言語コーパスを生成して、音声認識に使用させる音響モデルを学習する技術が知られている。 Conventionally, a spoken language corpus is generated using the program audio (audio data) of a specific program, subtitle text added to the program in advance, and a transcription of the program audio, and is used for speech recognition. Techniques for learning acoustic models are known.

上述した従来の技術では、番組の音声データを用いて教師データの生成を支援することが開示されている。しかしながら、作業者が、音声データを確認して教師データを生成する際、教師データの対象となる発話を選定することは負担となる場合があった。 The above-mentioned conventional technology discloses supporting the generation of teacher data using audio data of a program. However, when a worker checks voice data and generates training data, it may be burdensome for the operator to select utterances to be used as training data.

開示の技術は、上記事情に鑑みたものであり、教師データの対象となる発話の選定を支援する、ことを目的とする。 The disclosed technology has been developed in view of the above-mentioned circumstances, and aims to support the selection of utterances to be used as training data.

開示の技術は、音声データを取得する取得部と、前記音声データに係る音声から、発話がされた区間である発話区間を検出する音声認識部と、検出された前記発話区間の発話が、教師データの候補を出力するために予め設定された１以上の条件を満たすかを判断する判断部と、前記判断部で前記条件を満たすと判断された前記発話区間における第１の発話の内容を、前記教師データの候補として出力する出力部と、を有する情報処理装置である。 The disclosed technology includes an acquisition unit that acquires voice data, a voice recognition unit that detects a speech section in which an utterance is made from the speech related to the voice data, and a speech recognition section that detects an utterance in the detected speech section. a determination unit that determines whether one or more preset conditions are met for outputting data candidates; and a content of the first utterance in the utterance section that is determined by the determination unit to satisfy the condition. The information processing apparatus includes an output unit that outputs the teacher data as a candidate.

本発明の一実施形態によると、教師データの対象となる発話の選定を支援できる。 According to an embodiment of the present invention, it is possible to support selection of utterances to be used as training data.

情報処理システムのシムテム構成の一例を示す図である。1 is a diagram illustrating an example of a system configuration of an information processing system. 重複発話について説明する第一の図である。FIG. 3 is a first diagram illustrating duplicate utterances. 重複発話について説明する第二の図である。FIG. 7 is a second diagram illustrating duplicate utterances. 相槌とフィラーについて説明する第一の図である。FIG. 2 is a first diagram illustrating agitation and filler. 相槌とフィラーについて説明する第二の図である。FIG. 2 is a second diagram illustrating agitation and filler. 音声データからの重複発話の除外について説明する第一の図である。FIG. 3 is a first diagram illustrating the exclusion of duplicate utterances from audio data. 比較例を示す図である。It is a figure showing a comparative example. 音声データからの重複発話の除外について説明する第二の図である。FIG. 7 is a second diagram illustrating the exclusion of duplicate utterances from audio data. 音声データからの相槌及びフィラーの除外について説明する図である。FIG. 3 is a diagram illustrating the exclusion of compliments and fillers from audio data. 孤立した相槌、孤立したフィラーについて説明する図である。FIG. 3 is a diagram illustrating an isolated match and an isolated filler. 音声データと、特定の条件を満たすとされる発話区間との関係を示す図である。FIG. 3 is a diagram showing the relationship between audio data and speech sections that satisfy a specific condition. 情報処理装置のハードウェア構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration of an information processing device. 端末装置のハードウェア構成の一例を示す図である。It is a diagram showing an example of the hardware configuration of a terminal device. 情報処理システムの有する各装置の機能構成を説明する図である。FIG. 2 is a diagram illustrating the functional configuration of each device included in the information processing system. 認識結果データ記憶部の一例を示す図である。It is a figure showing an example of a recognition result data storage part. 情報処理装置の処理を説明するフローチャートである。3 is a flowchart illustrating processing of the information processing device. 教師データの候補の表示例を示す図である。FIG. 6 is a diagram illustrating a display example of training data candidates. 教師データの候補の表示例を示す他の図である。FIG. 7 is another diagram illustrating a display example of training data candidates. 音声データの取得方法を説明する図である。FIG. 3 is a diagram illustrating a method of acquiring audio data.

以下に図面を参照して、実施形態について説明する。図１は、情報処理システムのシムテム構成の一例を示す図である。 Embodiments will be described below with reference to the drawings. FIG. 1 is a diagram showing an example of a system configuration of an information processing system.

本実施形態の情報処理システム１００は、情報処理装置２００と、端末装置４００とを含み、両者はネットワーク等を介して接続されている。 The information processing system 100 of this embodiment includes an information processing device 200 and a terminal device 400, both of which are connected via a network or the like.

本実施形態の情報処理装置２００は、一般的なコンピュータであってよく、音声認識部２５０と、生成支援部２６０とを含む。音声認識部２５０は、音声データを取得し、取得した音声データに対して音声認識処理を行って、音声データから変換された文字列を取得する。 The information processing device 200 of this embodiment may be a general computer, and includes a speech recognition section 250 and a generation support section 260. The speech recognition unit 250 obtains speech data, performs speech recognition processing on the obtained speech data, and obtains a character string converted from the speech data.

なお、以下に説明では、音声データに対して音声認識処理を行って取得したデータを、認識結果データと表現する場合がある。認識結果データは、例えば、音声データに含まれる発話を特定するための識別情報である発話ＩＤと、発話が行われている発話区間の開始時刻と、発話区間の終了時刻と、発話区間の音声データから変換された文字列（テキスト）と、が対応付けられたデータである。 Note that in the following description, data obtained by performing voice recognition processing on voice data may be expressed as recognition result data. The recognition result data includes, for example, the utterance ID which is identification information for identifying the utterance included in the audio data, the start time of the utterance section in which the utterance is being made, the end time of the utterance section, and the audio of the utterance section. The character string (text) converted from the data is associated with the data.

また、以下の説明では、発話区間の発話内容とは、発話区間の音声データと、発話区間の音声データから変換された文字列との少なくとも何れかを含むものとする。言い換えれば、発話区間の発話内容は、発話区間の音声データに対して音声認識処理を行った結果である認識結果データに含まれる。 Furthermore, in the following description, the utterance content of the utterance section includes at least one of the audio data of the utterance section and the character string converted from the audio data of the utterance section. In other words, the utterance content of the utterance section is included in recognition result data that is the result of voice recognition processing performed on the voice data of the utterance section.

生成支援部２６０は、認識結果データを用いて、音声認識処理の精度を高めるための教師データの生成を支援する。なお、本実施形態の教師データとは、後述する音声認識モデルに機械学習を行わせるための教師データである。 The generation support unit 260 uses the recognition result data to support generation of teacher data for improving the accuracy of speech recognition processing. Note that the teacher data in this embodiment is teacher data for causing a speech recognition model, which will be described later, to perform machine learning.

具体的には、生成支援部２６０は、発話区間毎の認識結果データのうち、発話区間における発話が、予め設定された１以上の条件を満たす発話である認識結果データを、教師データの候補に特定する。そして、生成支援部２６０は、教師データの候補に特定された認識結果データの一覧を、端末装置４００に表示させる。 Specifically, the generation support unit 260 selects recognition result data in which the utterance in the utterance section satisfies one or more preset conditions from among the recognition result data for each utterance section as training data candidates. Identify. Then, the generation support unit 260 causes the terminal device 400 to display a list of recognition result data specified as teacher data candidates.

端末装置４００は、例えば、タブレット型端末やスマートフォン等であってもよいし、情報処理装置２００と同様の一般的なコンピュータであってもよく、主に、教師データを生成する作業者等によって利用されてよい。 The terminal device 400 may be, for example, a tablet terminal, a smartphone, etc., or may be a general computer similar to the information processing device 200, and is mainly used by a worker who generates teaching data. It's okay to be.

情報処理装置２００は、端末装置４００に表示された認識結果データの一覧において、認識結果データが選択されると、選択された認識結果データに基づき教師データを生成する。 When recognition result data is selected in the list of recognition result data displayed on the terminal device 400, the information processing device 200 generates teacher data based on the selected recognition result data.

したがって、本実施形態では、教師データを生成する作業者は、教師データの候補とされた認識結果データから、教師データの生成に用いる認識結果データを選択するだけで、音声認識処理の精度を向上させるための教師データを生成することができる。 Therefore, in this embodiment, the operator who generates the teacher data can improve the accuracy of the speech recognition process by simply selecting the recognition result data to be used for generating the teacher data from among the recognition result data that are candidates for the teacher data. It is possible to generate training data for

なお、図１の例では、情報処理装置２００が音声認識部２５０と生成支援部２６０とを含むものとしたが、これに限定されない。音声認識部２５０と生成支援部２６０とは、それぞれが別々の装置によって実現されてよい。具体的には、例えば、音声認識部２５０は、情報処理装置２００とは別の音声認識装置によって実現されてよい。 Note that in the example of FIG. 1, the information processing device 200 includes the speech recognition section 250 and the generation support section 260, but the present invention is not limited to this. The speech recognition unit 250 and the generation support unit 260 may be realized by separate devices. Specifically, for example, the speech recognition unit 250 may be realized by a speech recognition device different from the information processing device 200.

ここで、本実施形態における着目点について説明する。 Here, the points of interest in this embodiment will be explained.

図２は、重複発話について説明する第一の図である。図３は、重複発話について説明する第二の図である。図２では、例えば、話者１と話者２との発話の一部が重複した場合を示している。 FIG. 2 is a first diagram illustrating duplicate utterances. FIG. 3 is a second diagram illustrating duplicate utterances. FIG. 2 shows, for example, a case where parts of the utterances of speaker 1 and speaker 2 overlap.

図２において、話者１の発話区間は、タイミングＴ１からタイミングＴ３までの区間であり、話者２の発話区間は、タイミングＴ２からタイミングＴ４までの区間である。また、タイミングＴ２からタイミングＴ３までの区間では、話者１の発話と話者２の発話とが重複している。 In FIG. 2, the speech section of speaker 1 is the section from timing T1 to timing T3, and the speech section of speaker 2 is the section from timing T2 to timing T4. Furthermore, in the section from timing T2 to timing T3, the utterances of speaker 1 and utterances of speaker 2 overlap.

本実施形態では、このように、複数の話者の発話が重複することを重複発話と表現する場合がある。 In this embodiment, such overlapping utterances of multiple speakers may be expressed as overlapping utterances.

このような重複区間は、例えば、複数人が参加する会議等において頻発する。図３では、各種の会議における会議時間と重複発話が行われた時間（重複発話が行われた時間）との割合の例を示す図である。なお、会議時間とは、音声データの録音が開始されてから終了するまでの期間を示す。 Such overlapping sections frequently occur, for example, in meetings where multiple people participate. FIG. 3 is a diagram illustrating an example of the ratio between the meeting time and the time during which overlapping utterances were made (time during which overlapping utterances were made) in various conferences. Note that the conference time refers to the period from the start to the end of audio data recording.

図３に示すように、一般的な会議の場合、全体の発話に対する重複発話の時間的割合は、図３に示すように、全体の発話時間の５％未満である。 As shown in FIG. 3, in a typical conference, the time ratio of overlapping utterances to the total utterances is less than 5% of the total utterance time, as shown in FIG.

しかし、音声認識の学習データは、センテンス（意味を成す１つの発話）単位であるため、例えば、重複発話の時間がごく短時間であったとしても、その部分がセンテンス全体の学習に悪影響を与える可能性があり、そのセンテンス全体が学習に適さない場合がある。 However, since the training data for speech recognition is in units of sentences (single meaningful utterances), for example, even if repeated utterances occur for a very short time, that portion will have a negative impact on the learning of the entire sentence. The entire sentence may not be suitable for study.

図４は、相槌とフィラーについて説明する第一の図である。図５は、相槌とフィラーについて説明する第二の図である。 FIG. 4 is a first diagram illustrating the match and filler. FIG. 5 is a second diagram illustrating the match and filler.

会議等で行われる人同士の話し言葉では、フィラーや相槌も多発する。フィラーとは、図４に示すように、「あの」、「えーと」等の場繋ぎ的な表現の言葉、または驚いたときや落胆したときなどに発する感動詞のことである。 Fillers and compliments often occur in conversations between people at meetings and the like. As shown in FIG. 4, fillers are words that are used as fillers such as "um" and "um", or interjections that are uttered when surprised or disappointed.

フィラー自体は音声認識の学習への悪影響はほぼ無視できるものの、教師データ等においては不要な情報である。 Although the filler itself has almost negligible negative effects on speech recognition learning, it is unnecessary information in teacher data and the like.

一方、相槌は受け取り側の意図を示すため、教師データ等において意味はあるが、必要以上多く発生するため、全ての相槌について作業者が確認することは大きな負担になる。 On the other hand, since apologies indicate the intention of the recipient, they are meaningful in teacher data, etc., but since they occur more often than necessary, it becomes a heavy burden for the operator to check all the pouts.

図５では、複数種類の会議における発話長をヒストグラム化した図の例である。図５から、ヒストグラム１～６に示すいずれの会議においても、発話長が１秒未満の発話の頻度が最も高いことがわかる。発話長が１秒未満の発話とは、相槌を含む発話であることを強く示す。 FIG. 5 is an example of a histogram of utterance lengths in multiple types of conferences. From FIG. 5, it can be seen that in all the meetings shown in histograms 1 to 6, the frequency of utterances with an utterance length of less than 1 second is highest. An utterance with an utterance length of less than 1 second strongly indicates that it is an utterance that includes agreement.

このように、人同士の会話の中には、相槌のように単調な発話が多く発生するため、これら全ての音声データに対して作業者が確認することは大きな負担になることがわかる。 As described above, in conversations between people, many monotonous utterances like mutual exchanges occur, so it can be seen that it is a heavy burden for the worker to check all of this voice data.

本実施形態では、これらの点に着目し、複数人の会話を録音した音声データに含まれる発話区間毎の認識結果データのうち、発話が重複発話、相槌、フィラーである発話区間の認識結果データを、教師データの候補から除外する。本実施形態では、このように、教師データの候補を選別することで、教師データの生成にかかる作業者の負荷を削減することができる。言い換えれば、本実施形態では、効率的に教師データを生成することができる。 In this embodiment, focusing on these points, among the recognition result data for each utterance section included in audio data of recorded conversations of multiple people, recognition result data for utterance sections where the utterance is a duplicate utterance, a compliment, or a filler. are excluded from training data candidates. In this embodiment, by selecting candidates for teacher data in this way, it is possible to reduce the burden on the worker involved in generating the teacher data. In other words, in this embodiment, teacher data can be efficiently generated.

なお、本実施形態における、複数人の会話を録音した音声データとは、話者の口元とマイクとの距離が一定の距離以上離れている状態で録音された音声データであってよい。 Note that in this embodiment, the audio data obtained by recording a conversation between multiple people may be audio data recorded while the distance between the speaker's mouth and the microphone is a certain distance or more.

以下の説明では、話者の口元とマイクとの距離が一定の距離以上離れている状態を「ＦａｒＦｉｅｌｄ（遠方界）」と表現する場合がある。また、以下の説明では、話者の口元と、音声データを取得するマイクとの距離が一定の距離以上である状態で取得された音声データを、ＦａｒＦｉｅｌｄにおいて取得された音声データと表現する場合がある。音声データの取得方法の詳細は後述する。 In the following description, a state in which the distance between the speaker's mouth and the microphone is a certain distance or more may be expressed as a "Far Field." In addition, in the following explanation, audio data acquired when the distance between the speaker's mouth and the microphone that acquires the audio data is a certain distance or more is expressed as audio data acquired in the Far Field. There is. Details of how to obtain audio data will be described later.

次に、図６乃至図８を参照して、音声データから重複発話を除外した場合と、比較例とについて説明する。 Next, a case where duplicate utterances are excluded from audio data and a comparative example will be described with reference to FIGS. 6 to 8.

図６は、音声データからの重複発話の除外について説明する第一の図である。図６では、図２の例を参照して、重複発話の除外について説明する。 FIG. 6 is a first diagram illustrating the removal of duplicate utterances from audio data. In FIG. 6, exclusion of duplicate utterances will be explained with reference to the example of FIG.

図２の例では、話者１の発話区間は、タイミングＴ１からタイミングＴ３までの区間であり、話者２の発話区間は、タイミングＴ２からタイミングＴ４までの区間である。また、話者１と話者２との重複発話の区間は、タイミングＴ２からタイミングＴ３までの区間である。 In the example of FIG. 2, the speech section of speaker 1 is the section from timing T1 to timing T3, and the speech section of speaker 2 is the section from timing T2 to timing T4. Furthermore, the section of overlapping utterances between speaker 1 and speaker 2 is the section from timing T2 to timing T3.

そこで、本実施形態では、タイミングＴ１からタイミングＴ４までの音声データにおける発話区間毎の認識結果データのうち、発話が重複していない発話区間の認識結果データを抽出し、教師データの候補とする。なお、以下の説明では、他の話者の発話と重複していない発話を、単独発話と表現する。 Therefore, in the present embodiment, recognition result data for utterance sections in which utterances do not overlap among the recognition result data for each utterance section in the audio data from timing T1 to timing T4 is extracted and used as training data candidates. Note that in the following explanation, an utterance that does not overlap with the utterances of other speakers will be expressed as an independent utterance.

図７は、比較例を示す図である。図７では、タイミングＴ１からタイミングＴ３までの区間（話者１の発話区間）の音声データと、タイミングＴ２からタイミングＴ４までの区間（話者２の発話区間）の音声データに対して音声認識処理を行った場合を示している。 FIG. 7 is a diagram showing a comparative example. In FIG. 7, voice recognition processing is performed on audio data in an interval from timing T1 to timing T3 (speaker 1's utterance interval) and audio data in an interval from timing T2 to timing T4 (speaker 2's utterance interval). This shows the case where

この場合、タイミングＴ１からタイミングＴ３までの区間（話者１の発話区間）の音声データと、タイミングＴ２からタイミングＴ４までの区間（話者２の発話区間）の音声データのそれぞれは、重複発話を含む音声データとなる。 In this case, the audio data in the interval from timing T1 to timing T3 (speaker 1's utterance interval) and the audio data in the interval from timing T2 to timing T4 (speaker 2's utterance interval) are overlapped. This is the audio data that includes.

重複発話を含む音声データは、音韻が不明瞭であり、音声認識によって取得した文字列が不正確となる場合がある。また、重複発話を含む音声データに対して音声認識処理を行った場合、音韻が不明瞭であるため、認識結果データに含まれる発話内容が不正確である可能性が高い。このため、この認識結果データを教師データとして音声認識モデルを学習させても、音声認識の精度向上に対して寄与しない可能性がある。さらに、重複発話を単一発話に分離する技術も高難度であり、高い精度の担保が困難である。 Speech data that includes repeated utterances has ambiguous phonemes, and character strings obtained through speech recognition may be inaccurate. Furthermore, when voice recognition processing is performed on voice data that includes repeated utterances, the utterance content included in the recognition result data is likely to be inaccurate because the phonemes are unclear. Therefore, even if a speech recognition model is trained using this recognition result data as training data, there is a possibility that it will not contribute to improving the accuracy of speech recognition. Furthermore, the technology for separating duplicate utterances into single utterances is also highly difficult, making it difficult to ensure high accuracy.

図８は、音声データからの重複発話の除外について説明する第二の図である。本実施形態では、単独発話となるタイミングＴ１からタイミングＴ２まで発話区間の認識結果データと、タイミングＴ３からタイミングＴ４までの発話区間の認識結果データとを、教師データの候補とする。 FIG. 8 is a second diagram illustrating the exclusion of duplicate utterances from audio data. In this embodiment, the recognition result data of the utterance section from timing T1 to timing T2, which is a single utterance, and the recognition result data of the utterance section from timing T3 to timing T4 are used as training data candidates.

また、本実施形態では、話者１の話者２の重複発話であるタイミングＴ２からタイミングＴ３までの発話区間の認識結果データを、教師データの候補から除外する。 Furthermore, in this embodiment, the recognition result data of the utterance section from timing T2 to timing T3, which is the overlapping utterance of speaker 1 and speaker 2, is excluded from the training data candidates.

このようにすることで、本実施形態では、音韻が明瞭な音声データと、精度の高い音声認識処理によって取得した文字列とが対応付けられた認識結果データを教師データの候補とし、教師データを生成する作業者等に提示することができる。 By doing so, in this embodiment, recognition result data in which speech data with clear phonemes and character strings obtained through highly accurate speech recognition processing are associated are candidates for training data, and the training data is It can be presented to the worker who generates it.

次に、図９及び図１０を参照して、音声データからの相槌及びフィラーの除外について説明する。 Next, with reference to FIGS. 9 and 10, the exclusion of compliments and fillers from audio data will be described.

図９は、音声データからの相槌及びフィラーの除外について説明する図である。図９（Ａ）は、話者１の発話中に、話者２が相槌や短い発話（フィラー）を散発的に行った場合を示しており、図９（Ｂ）は、メインの話者である話者１に対して、話者２が相槌や短い発話（フィラー）を返す場合を示している。 FIG. 9 is a diagram illustrating the exclusion of compliments and fillers from audio data. Figure 9(A) shows a case where speaker 2 sporadically makes concurrence or short utterances (filler) while speaker 1 is speaking, and Figure 9(B) shows a case where speaker 2 sporadically makes short utterances (filler) while speaker 1 is speaking. This shows a case where a speaker 2 responds with a comment or a short utterance (filler) to a certain speaker 1.

図９（Ａ）の例では、話者２の相槌やフィラーは、話者１の発話と重複している。しかしながら、相槌やフィラーは、上述したように、会話中に頻発するため、相槌やフィラーが重複している発話区間を重複発話として、教師データの候補から除外すると、教師データの候補となる認識結果データのデータ量が大幅に減少する。また、メインの話者の発話と重複する相槌やフィラーは、ノイズとして捉えることもできる。 In the example of FIG. 9A, speaker 2's compliments and fillers overlap with speaker 1's utterance. However, as mentioned above, since agitation and fillers occur frequently during a conversation, if the utterance sections in which agitation and fillers overlap are treated as duplicate utterances and excluded from the training data candidates, the recognition result becomes a training data candidate. The amount of data is significantly reduced. Further, compliments and fillers that overlap with the main speaker's utterance can also be regarded as noise.

そこで、本実施形態では、話者１の発話と話者２の相槌やフィラーが重複した場合は、重複発話とせず、相槌やフィラーのみの音声データに対して音声認識処理を行って取得した認識結果データを、教師データの候補から除外する。 Therefore, in this embodiment, when an utterance of speaker 1 and a compliment or filler of speaker 2 overlap, the recognition is not treated as an overlapping utterance, but is obtained by performing voice recognition processing on the voice data of only the compliment or filler. Exclude result data from training data candidates.

具体的には、本実施形態では、図９（Ａ）の話者１の発話（メインの話者）を示す音声データと対応する認識結果データのみを教師データの候補とし、話者２の発話である相槌やフィラーを示す音声データと対応する認識結果データは、教師データの候補から除外する。 Specifically, in this embodiment, only the recognition result data corresponding to the voice data indicating the utterance of speaker 1 (the main speaker) in FIG. Recognition result data corresponding to voice data indicating a compliment or filler is excluded from the training data candidates.

したがって、図９（Ａ）の例では、話者１の発話区間の認識結果データのみが教師データの候補となる。 Therefore, in the example of FIG. 9A, only the recognition result data of the utterance section of speaker 1 is a candidate for teacher data.

また、本実施形態では、図９（Ｂ）に示すように、メインの話者である話者１に対して、話者２による相槌やフィラーが続いた場合、相槌や孤立したフィラーを示す音声データと対応する認識結果データを教師データの候補から除外する。 In addition, in this embodiment, as shown in FIG. 9B, when speaker 2 continues to give compliments or fillers to speaker 1, who is the main speaker, the voice indicating the compliment or isolated filler Exclude recognition result data corresponding to the data from training data candidates.

具体的には、本実施形態では、図９（Ｂ）のメインの話者である話者１の発話区間の認識結果データのみを、教師データの候補とし、話者２の発話区間の認識結果データは、教師データの候補から除外する。 Specifically, in this embodiment, only the recognition result data of the utterance section of speaker 1, who is the main speaker in FIG. The data is excluded from the training data candidates.

さらに、本実施形態では、会話中における孤立した相槌と、孤立したフィラーに相当する音声データと対応する認識結果データも、教師データの候補から除外する。 Furthermore, in this embodiment, recognition result data corresponding to voice data corresponding to isolated exchanges and isolated fillers during a conversation are also excluded from the training data candidates.

ここで、図１０を参照して、孤立した相槌、孤立したフィラーについて説明する。図１０は、孤立した相槌、孤立したフィラーについて説明する図である。図１０（Ａ）は、孤立した相槌について説明する図であり、図１０（Ｂ）は、孤立したフィラーについて説明する図である。 Now, with reference to FIG. 10, isolated fillers and isolated fillers will be explained. FIG. 10 is a diagram illustrating an isolated match and an isolated filler. FIG. 10(A) is a diagram illustrating an isolated filler, and FIG. 10(B) is a diagram illustrating an isolated filler.

本実施形態における、孤立した相槌、孤立したフィラーとは、発話が連続している発話区間内で、フィラーや相槌以外の発話がない状態をいう。 In this embodiment, an isolated compliment or an isolated filler refers to a state in which there is no utterance other than a filler or a compliment within an utterance section in which utterances are continuous.

図１０（Ａ）に示すように、発話区間Ｋ１における発話内容は、「ああ、そうだね」であり、相槌「ああ」の他に、「そうだね」という発話も含まれる。したがって、発話区間Ｋ１の認識結果データは、教師データの候補の対象となる。 As shown in FIG. 10(A), the content of the utterance in the utterance section K1 is "Ah, that's right," and includes the utterance "Ah, that's right" as well as the utterance "Ah, right." Therefore, the recognition result data of the utterance section K1 is a candidate for teacher data.

また、発話区間Ｋ２における発話内容は、「ああ」であり、発話区間Ｋ３における発話内容は、「そうだね」である。この場合、発話区間Ｋ２は、発話内容が相槌のみであり、他の発話が含まれないため、孤立した相槌となる。したがって、発話区間Ｋ２の認識結果データは、教師データの候補から除外される。 Further, the utterance content in the utterance section K2 is "Ah", and the utterance content in the utterance section K3 is "I see." In this case, the utterance section K2 contains only a mutual comment and does not include any other utterances, so it becomes an isolated compliment. Therefore, the recognition result data of the utterance section K2 is excluded from the training data candidates.

また、図１０（Ｂ）では、発話区間Ｋ４における発話内容は、「あのー、これでいいですか」であり、フィラー「あのー」の他に、「これでいいですか」という発話も含まれる。したがって、発話区間Ｋ４の認識結果データは、教師データの候補となる。 Furthermore, in FIG. 10B, the content of the utterance in the utterance section K4 is "Um, is this okay?" and includes the filler "Um," as well as the utterance "Is this okay?" Therefore, the recognition result data of the utterance section K4 is a candidate for teacher data.

また、発話区間Ｋ５における発話内容は、「あのー」であり、発話区間Ｋ６における発話内容は、「これでいいですか」である。この場合、発話区間Ｋ５は、発話内容がフィラーのみであり、他の発話が含まれないため、孤立したフィラーとなる。したがって、発話区間Ｋ５の認識結果データは、教師データの候補から除外される。 Further, the utterance content in the utterance section K5 is "umm", and the utterance content in the utterance section K6 is "Is this okay?". In this case, the utterance section K5 contains only filler utterances and does not include other utterances, so it becomes an isolated filler. Therefore, the recognition result data of the utterance section K5 is excluded from the training data candidates.

このように、本実施形態では、複数人による発話を録音した音声データのうち、音声データが特定の条件を満たす発話区間の認識結果データを、教師データの候補とする。言い換えれば、本実施形態では、複数人による発話のうち、予め設定された１以上の条件（特定の条件）を満たす発話区間の発話内容を、教師データの候補とする。なお、特定の条件は、情報処理システム１００のユーザによって予め設定されていてよい。情報処理システム１００のユーザとは、例えば情報処理装置２００の管理者等であってもよいし、端末装置４００のユーザ（教師データを生成する作業者）であってもよい。 In this manner, in the present embodiment, recognition result data of a utterance section in which the voice data satisfies a specific condition, out of voice data recorded as utterances by a plurality of people, is used as a candidate for training data. In other words, in this embodiment, among the utterances by a plurality of people, the content of utterances in a utterance section that satisfies one or more preset conditions (specific conditions) is used as a candidate for training data. Note that the specific conditions may be set in advance by the user of the information processing system 100. The user of the information processing system 100 may be, for example, an administrator of the information processing device 200, or a user of the terminal device 400 (a worker who generates teacher data).

特定の条件とは、以下のうちの何れか１つである。 The specific condition is any one of the following.

・メインの話者の発話であり、且つ、複数人の発話が重複していない単独発話であること（以下、条件１と呼ぶ。）
・メインの話者の発話ではなく、且つ、孤立した相槌又は孤立したフィラーのみからなる発話ではないこと（以下、条件２と呼ぶ。）
・一部に重複発話を含むメインの話者の発話において、単独発話となる部分が存在すること（以下、条件３と呼ぶ。）
本実施形態では、この条件の何れか１つを満たす発話区間の認識結果データを、教師データの候補とする。 -The utterance is from the main speaker and is a single utterance without overlapping utterances from multiple people (hereinafter referred to as condition 1).
・It is not an utterance by the main speaker, nor is it an utterance consisting only of isolated compliments or isolated fillers (hereinafter referred to as condition 2).
・In the main speaker's utterance, which includes some overlapping utterances, there is a portion that is a single utterance (hereinafter referred to as condition 3).
In this embodiment, recognition result data of an utterance section that satisfies any one of these conditions is used as a candidate for training data.

以下に、図１１を参照して、複数人の発話を録音した音声データと、特定の条件を満たすとされる発話区間との関係を具体的に説明する。 Below, with reference to FIG. 11, the relationship between audio data recorded from multiple people's utterances and utterance sections that satisfy a specific condition will be specifically described.

図１１は、音声データと、特定の条件を満たすとされる発話区間との関係を示す図である。 FIG. 11 is a diagram showing the relationship between audio data and utterance sections that satisfy a specific condition.

図１１では、複数人の発話を録音した音声データが示す音声波形１０と、発話区間毎の音声データから変換された文字列とを対応付けて示している。また、図１１において、領域１１は、メインの話者である話者１の発話を示し、領域１２は、メインの話者ではない話者２の発話を示す。 In FIG. 11, an audio waveform 10 represented by audio data obtained by recording the utterances of a plurality of people is shown in association with a character string converted from the audio data for each utterance section. Further, in FIG. 11, area 11 shows the utterances of speaker 1 who is the main speaker, and area 12 shows the utterances of speaker 2 who is not the main speaker.

図１１において、話者１による発話区間Ｋ１０、Ｋ１２は、特定の条件のうち、条件１を満たす。したがって、発話区間Ｋ１０、Ｋ１２の認識結果データは、教師データの候補とされる。また、話者１の発話区間Ｋ１４は、話者２の発話区間Ｋ１５と一部が重複している。したがって、本実施形態では、発話区間Ｋ１４のうち、発話区間Ｋ１５と重複していない部分のみが条件３を満たす発話区間となる。また、この発話区間の認識結果データは、教師データの候補とされる。 In FIG. 11, utterance sections K10 and K12 by speaker 1 satisfy condition 1 among the specific conditions. Therefore, the recognition result data of the utterance sections K10 and K12 are considered as training data candidates. Further, the speech section K14 of speaker 1 partially overlaps with the speech section K15 of speaker 2. Therefore, in this embodiment, only the portion of the utterance section K14 that does not overlap with the utterance section K15 becomes the utterance section that satisfies condition 3. Furthermore, the recognition result data of this utterance section is used as a candidate for teacher data.

また、話者１の発話区間Ｋ１６の発話内容は、相槌であるが、条件１を満たす。したがって、発話区間Ｋ１６の認識結果データは、教師データの候補とされる。 Further, the content of the utterance of speaker 1 in utterance section K16 is a mutual exchange, which satisfies condition 1. Therefore, the recognition result data of the utterance section K16 is taken as a candidate for teacher data.

また、図１１において、発話区間Ｋ１１、Ｋ１３、Ｋ１７は、話者２による孤立した相槌であり、特定の条件である条件１～３の何れも満たさない。したがって、発話区間Ｋ１１、Ｋ１３、Ｋ１７の認識結果データは、教師データの候補から除外される。 Furthermore, in FIG. 11, utterance sections K11, K13, and K17 are isolated responses by speaker 2, and do not satisfy any of the specific conditions 1 to 3. Therefore, the recognition result data of the utterance sections K11, K13, and K17 are excluded from the training data candidates.

さらに、発話区間Ｋ１５は、重複発話であり、特定の条件である条件１～３の何れも満たさない。したがって、発話区間Ｋ１５の認識結果データは、教師データの候補から除外される。 Furthermore, the utterance section K15 is a duplicate utterance and does not satisfy any of the specific conditions 1 to 3. Therefore, the recognition result data of the utterance section K15 is excluded from the training data candidates.

本実施形態の情報処理システム１００において、情報処理装置２００は、このようにして、教師データの候補となる認識結果データを特定し、教師データの候補を端末装置４００に表示させる。そして、本実施形態の情報処理装置２００は、端末装置４００のユーザに選択された認識結果データを用いて、教師データを生成する。 In the information processing system 100 of the present embodiment, the information processing device 200 identifies recognition result data that are candidates for teacher data in this way, and causes the terminal device 400 to display the candidates for the teacher data. Then, the information processing device 200 of this embodiment generates teacher data using the recognition result data selected by the user of the terminal device 400.

以下に、本実施形態の情報処理システム１００の有する各装置について説明する。図１２は、情報処理装置のハードウェア構成の一例を示す図である。 Each device included in the information processing system 100 of this embodiment will be described below. FIG. 12 is a diagram illustrating an example of the hardware configuration of the information processing device.

情報処理装置２００は、コンピュータによって構築されており、図１２に示されているように、ＣＰＵ２０１、ＲＯＭ２０２、ＲＡＭ２０３、ＨＤ２０４、ＨＤＤ(Hard Disk Drive)コントローラ２０５、ディスプレイ２０６、外部機器接続Ｉ／Ｆ(Interface)２０８、ネットワークＩ／Ｆ２０９、バスラインＢ１、キーボード２１１、ポインティングデバイス２１２、ＤＶＤ－ＲＷ(Digital Versatile Disk Rewritable)ドライブ２１４、メディアＩ／Ｆ２１６を備えている。 The information processing device 200 is constructed by a computer, and as shown in FIG. interface) 208, a network I/F 209, a bus line B1, a keyboard 211, a pointing device 212, a DVD-RW (Digital Versatile Disk Rewritable) drive 214, and a media I/F 216.

これらのうち、ＣＰＵ２０１は、情報処理装置２００全体の動作を制御する。ＲＯＭ２０２は、ＩＰＬ等のＣＰＵ２０１の駆動に用いられるプログラムを記憶する。ＲＡＭ２０３は、ＣＰＵ２０１のワークエリアとして使用される。ＨＤ２０４は、プログラム等の各種データを記憶する。ＨＤＤコントローラ２０５は、ＣＰＵ２０１の制御にしたがってＨＤ２０４に対する各種データの読み出し又は書き込みを制御する。 Among these, the CPU 201 controls the operation of the information processing device 200 as a whole. The ROM 202 stores programs used to drive the CPU 201, such as IPL. RAM 203 is used as a work area for CPU 201. The HD 204 stores various data such as programs. The HDD controller 205 controls reading and writing of various data to the HD 204 under the control of the CPU 201.

ディスプレイ（表示装置）２０６は、カーソル、メニュー、ウィンドウ、文字、又は画像などの各種情報を表示する。外部機器接続Ｉ／Ｆ２０８は、各種の外部機器を接続するためのインターフェースである。この場合の外部機器は、例えば、ＵＳＢ(Universal Serial Bus)メモリやプリンタ等である。ネットワークＩ／Ｆ２０９は、通信ネットワークを利用してデータ通信をするためのインターフェースである。バスラインＢ１は、図１２に示されているＣＰＵ２０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 A display (display device) 206 displays various information such as a cursor, menu, window, characters, or images. External device connection I/F 208 is an interface for connecting various external devices. The external device in this case is, for example, a USB (Universal Serial Bus) memory, a printer, or the like. The network I/F 209 is an interface for data communication using a communication network. The bus line B1 is an address bus, a data bus, etc. for electrically connecting each component such as the CPU 201 shown in FIG. 12.

また、キーボード２１１は、文字、数値、各種指示などの入力のための複数のキーを備えた入力手段の一種である。ポインティングデバイス２１２は、各種指示の選択や実行、処理対象の選択、カーソルの移動などを行う入力手段の一種である。ＤＶＤ－ＲＷドライブ２１４は、着脱可能な記録媒体の一例としてのＤＶＤ－ＲＷ２１３に対する各種データの読み出し又は書き込みを制御する。なお、ＤＶＤ－ＲＷに限らず、ＤＶＤ－Ｒ等であってもよい。メディアＩ／Ｆ２１６は、フラッシュメモリ等の記録メディア２１５に対するデータの読み出し又は書き込み（記憶）を制御する。 Further, the keyboard 211 is a type of input means that includes a plurality of keys for inputting characters, numerical values, various instructions, and the like. The pointing device 212 is a type of input means for selecting and executing various instructions, selecting a processing target, moving a cursor, and the like. The DVD-RW drive 214 controls reading and writing of various data on the DVD-RW 213, which is an example of a removable recording medium. Note that it is not limited to DVD-RW, but may be DVD-R or the like. The media I/F 216 controls reading or writing (storage) of data to a recording medium 215 such as a flash memory.

図１３は、端末装置のハードウェア構成の一例を示す図である。図１３では、端末装置４００がスマートフォンである場合のハードウェア構成を示す。 FIG. 13 is a diagram illustrating an example of the hardware configuration of a terminal device. FIG. 13 shows a hardware configuration when the terminal device 400 is a smartphone.

図１３に示されているように、端末装置４００は、ＣＰＵ４０１、ＲＯＭ４０２、ＲＡＭ４０３、ＥＥＰＲＯＭ４０４、ＣＭＯＳセンサ４０５、撮像素子Ｉ／Ｆ４０６、加速度・方位センサ４０７、メディアＩ／Ｆ４０９、ＧＰＳ受信部４１１を備えている。 As shown in FIG. 13, the terminal device 400 includes a CPU 401, a ROM 402, a RAM 403, an EEPROM 404, a CMOS sensor 405, an image sensor I/F 406, an acceleration/direction sensor 407, a media I/F 409, and a GPS receiving section 411. ing.

これらのうち、ＣＰＵ４０１は、端末装置４００全体の動作を制御する。ＲＯＭ４０２は、ＣＰＵ４０１やＩＰＬ等のＣＰＵ４０１の駆動に用いられるプログラムを記憶する。ＲＡＭ４０３は、ＣＰＵ４０１のワークエリアとして使用される。ＥＥＰＲＯＭ４０４は、ＣＰＵ４０１の制御にしたがって、スマートフォン用プログラム等の各種データの読み出し又は書き込みを行う。 Among these, the CPU 401 controls the operation of the entire terminal device 400. The ROM 402 stores the CPU 401 and programs used to drive the CPU 401 such as IPL. RAM 403 is used as a work area for CPU 401. The EEPROM 404 reads or writes various data such as smartphone programs under the control of the CPU 401.

ＣＭＯＳ(Complementary Metal Oxide Semiconductor)センサ４０５は、ＣＰＵ４０１の制御に従って被写体（主に自画像）を撮像して画像データを得る内蔵型の撮像手段の一種である。なお、ＣＭＯＳセンサではなく、ＣＣＤ(Charge Coupled Device)センサ等の撮像手段であってもよい。撮像素子Ｉ／Ｆ４０６は、ＣＭＯＳセンサ４０５の駆動を制御する回路である。加速度・方位センサ４０７は、地磁気を検知する電子磁気コンパスやジャイロコンパス、加速度センサ等の各種センサである。メディアＩ／Ｆ４０９は、フラッシュメモリ等の記録メディア４０８に対するデータの読み出し又は書き込み（記憶）を制御する。ＧＰＳ受信部４１１は、ＧＰＳ衛星からＧＰＳ信号を受信する。 A CMOS (Complementary Metal Oxide Semiconductor) sensor 405 is a type of built-in imaging means that images a subject (mainly a self-portrait) and obtains image data under the control of the CPU 401. Note that instead of a CMOS sensor, an imaging means such as a CCD (Charge Coupled Device) sensor may be used. The image sensor I/F 406 is a circuit that controls driving of the CMOS sensor 405. The acceleration/direction sensor 407 is a variety of sensors such as an electronic magnetic compass, a gyro compass, and an acceleration sensor that detect geomagnetism. A media I/F 409 controls reading or writing (storage) of data to a recording medium 408 such as a flash memory. GPS receiving section 411 receives GPS signals from GPS satellites.

また、端末装置４００は、遠距離通信回路４１２、ＣＭＯＳセンサ４１３、撮像素子Ｉ／Ｆ４１４、マイク４１５、スピーカ４１６、音入出力Ｉ／Ｆ４１７、ディスプレイ４１８、外部機器接続Ｉ／Ｆ(Interface)４１９、近距離通信回路４２０、近距離通信回路４２０のアンテナ４２０ａ、及びタッチパネル４２１を備えている。 The terminal device 400 also includes a long distance communication circuit 412, a CMOS sensor 413, an image sensor I/F 414, a microphone 415, a speaker 416, a sound input/output I/F 417, a display 418, an external device connection I/F (Interface) 419, It includes a short-range communication circuit 420, an antenna 420a of the short-range communication circuit 420, and a touch panel 421.

これらのうち、遠距離通信回路４１２は、アンテナ４１２ａにより、通信ネットワークを介して、他の機器と通信する回路である。ＣＭＯＳセンサ４１３は、ＣＰＵ４０１の制御に従って被写体を撮像して画像データを得る内蔵型の撮像手段の一種である。撮像素子Ｉ／Ｆ４１４は、ＣＭＯＳセンサ４１３の駆動を制御する回路である。マイク４１５は、音を電気信号に変える内蔵型の回路である。スピーカ４１６は、電気信号を物理振動に変えて音楽や音声などの音を生み出す内蔵型の回路である。 Among these, the long-distance communication circuit 412 is a circuit that communicates with other devices via a communication network using an antenna 412a. The CMOS sensor 413 is a type of built-in imaging means that images a subject and obtains image data under the control of the CPU 401. The image sensor I/F 414 is a circuit that controls driving of the CMOS sensor 413. Microphone 415 is a built-in circuit that converts sound into electrical signals. The speaker 416 is a built-in circuit that converts electrical signals into physical vibrations to produce sounds such as music and voice.

音入出力Ｉ／Ｆ４１７は、ＣＰＵ４０１の制御に従ってマイク４１５及びスピーカ４１６との間で音信号の入出力を処理する回路である。ディスプレイ４１８は、被写体の画像や各種アイコン等を表示する液晶や有機ＥＬ(Electro Luminescence)などの表示手段の一種である。外部機器接続Ｉ／Ｆ４１９は、各種の外部機器を接続するためのインターフェースである。近距離通信回路４２０は、ＮＦＣ(Near Field Communication)やＢｌｕｅｔｏｏｔｈ（登録商標）等の通信回路である。タッチパネル４２１は、利用者がディスプレイ４１８を押下することで、端末装置４００を操作する入力手段の一種である。 The sound input/output I/F 417 is a circuit that processes input/output of sound signals between the microphone 415 and the speaker 416 under the control of the CPU 401 . The display 418 is a type of display means such as a liquid crystal or organic EL (Electro Luminescence) that displays images of the subject, various icons, and the like. The external device connection I/F 419 is an interface for connecting various external devices. The near field communication circuit 420 is a communication circuit such as NFC (Near Field Communication) or Bluetooth (registered trademark). The touch panel 421 is a type of input means by which the user operates the terminal device 400 by pressing the display 418.

また、端末装置４００は、バスライン４１０を備えている。バスライン４１０は、図１３に示されているＣＰＵ４０１等の各構成要素を電気的に接続するためのアドレスバスやデータバス等である。 The terminal device 400 also includes a bus line 410. The bus line 410 is an address bus, a data bus, etc. for electrically connecting each component such as the CPU 401 shown in FIG. 13.

次に、図１４を参照して、本実施形態の情報処理システム１００の有する各装置の機能について説明する。図１４は、情報処理システムの有する各装置の機能構成を説明する図である。 Next, with reference to FIG. 14, the functions of each device included in the information processing system 100 of this embodiment will be described. FIG. 14 is a diagram illustrating the functional configuration of each device included in the information processing system.

はじめに、情報処理装置２００の機能構成について説明する。本実施形態の情報処理装置２００は、音声認識部２５０、生成支援部２６０、通信制御部２６５、音声データ記憶部２７０、認識結果データ記憶部２８０、教師データ記憶部２９０を含む。音声認識部２５０、生成支援部２６０は、情報処理装置２００の有するＣＰＵ２０１がＨＤ２０４等に格納されたプログラムを読み出して実行することで実現される。音声データ記憶部２７０、認識結果データ記憶部２８０、教師データ記憶部２９０は、ＨＤ２０４等が有する記憶領域によって実現される。 First, the functional configuration of the information processing device 200 will be explained. The information processing device 200 of this embodiment includes a speech recognition section 250, a generation support section 260, a communication control section 265, a speech data storage section 270, a recognition result data storage section 280, and a teacher data storage section 290. The speech recognition unit 250 and the generation support unit 260 are realized by the CPU 201 of the information processing device 200 reading and executing a program stored in the HD 204 or the like. The voice data storage section 270, the recognition result data storage section 280, and the teacher data storage section 290 are realized by storage areas of the HD 204 and the like.

本実施形態の情報処理装置２００において、音声データ記憶部２７０は、情報処理装置２００が取得した音声データが格納される。認識結果データ記憶部２８０は、音声認識部２５０による音声認識処理の結果である認識結果データが格納される。認識結果データ記憶部２８０の詳細は後述する。 In the information processing device 200 of this embodiment, the audio data storage unit 270 stores audio data acquired by the information processing device 200. The recognition result data storage section 280 stores recognition result data that is the result of the speech recognition process by the speech recognition section 250. Details of the recognition result data storage section 280 will be described later.

なお、認識結果データ記憶部２８０において、認識結果データは、発話区間毎の発話ＩＤによって特定される音声データと対応付けられて格納されていてもよい。また、本実施形態では、音声データ記憶部２７０に格納された音声データに対し、発話区間毎の発話ＩＤが付与されていてもよい。本実施形態では、発話区間毎の認識結果データと、発話区間毎の音声データとが発話ＩＤによって対応付けられていればよい。 Note that in the recognition result data storage unit 280, the recognition result data may be stored in association with the voice data specified by the utterance ID of each utterance section. Furthermore, in the present embodiment, the speech data stored in the speech data storage section 270 may be given an utterance ID for each utterance section. In this embodiment, it is sufficient that the recognition result data for each utterance section and the audio data for each utterance section are correlated by the utterance ID.

また、認識結果データ記憶部２８０において、認識結果データは、教師データの候補とされたか否かを示す情報と対応付けられて格納されていてよい。言い換えれば、認識結果データ記憶部２８０において、認識結果データは、後述する判断部２６１による判断結果を示す情報と対応付けられて格納されていてよい。 Further, in the recognition result data storage unit 280, the recognition result data may be stored in association with information indicating whether or not the recognition result data is a candidate for teacher data. In other words, in the recognition result data storage unit 280, the recognition result data may be stored in association with information indicating a determination result by the determination unit 261, which will be described later.

教師データ記憶部２９０は、後述する生成部２６３により生成された教師データが格納される。教師データは、音声データと、音声データから変換された文字列とが対応付けられたデータであってよい。 The teacher data storage unit 290 stores teacher data generated by a generation unit 263, which will be described later. The teacher data may be data in which audio data is associated with a character string converted from the audio data.

音声認識部２５０は、取得部２５１、区間検出定部２５２、音声認識モデル２５３、学習部２５４を含む。取得部２５１は、音声データを取得する。取得部２５１が取得する音声データは、音声データ記憶部２７０から読み出された音声データであってもよいし、情報処理システム１００の外部装置から取得した音声データであってもよい。取得部２５１は、情報処理システム１００の外部装置から音声データを取得した場合には、取得した音声データを音声データ記憶部２７０に格納してよい。 The speech recognition section 250 includes an acquisition section 251, a section detection section 252, a speech recognition model 253, and a learning section 254. The acquisition unit 251 acquires audio data. The audio data acquired by the acquisition unit 251 may be audio data read from the audio data storage unit 270, or may be audio data acquired from an external device of the information processing system 100. When acquiring audio data from an external device of the information processing system 100, the acquisition unit 251 may store the acquired audio data in the audio data storage unit 270.

区間検出定部２５２は、取得された音声データに係る音声から発話区間を検出する。発話区間とは、発話が行われている区間を示す。本実施形態の区間検出部２５２は、音声データにおける発話区間を検出すると、特定された発話区間に対して発話ＩＤを付与し、発話区間の開始時刻と終了時刻とを、発話ＩＤとを対応付けてよい。 The section detection section 252 detects an utterance section from the audio related to the acquired audio data. The utterance section indicates the section in which utterances are being made. When the section detection unit 252 of this embodiment detects a speech section in the audio data, it assigns a speech ID to the specified speech section, and associates the start time and end time of the speech section with the speech ID. It's fine.

音声認識モデル２５３は、話者の口元とマイク等の集音装置との距離が一定の距離以上離れている状態において取得された音声データに対し、音声認識処理を行う音声認識器であってよく、音声認識処理の結果として文字列（テキスト）を取得する。言い換えれば、本実施形態の音声認識モデル２５３は、ＦａｒＦｉｅｌｄにおいて取得された音声データに対し、音声認識を行って、音声データを文字列に変換する音声認識器であってよい。音声認識モデル２５３の詳細は後述する。 The voice recognition model 253 may be a voice recognizer that performs voice recognition processing on voice data acquired when the distance between the speaker's mouth and a sound collection device such as a microphone is a certain distance or more. , obtain a character string (text) as a result of speech recognition processing. In other words, the speech recognition model 253 of this embodiment may be a speech recognizer that performs speech recognition on speech data acquired in the Far Field and converts the speech data into a character string. Details of the speech recognition model 253 will be described later.

音声認識モデル２５３によって取得された文字列は、発話ＩＤ、発話区間の開始時刻及び終了時刻と対応付けられた認識結果データとして認識結果データ記憶部２８０に格納されてよい。 The character string acquired by the speech recognition model 253 may be stored in the recognition result data storage unit 280 as recognition result data associated with the utterance ID and the start time and end time of the utterance section.

ＦａｒＦｉｅｌｄにおいて取得された音声データとは、具体的には、例えば、バウンダリーマイクのような卓上マイクを用いて収音した音声データである。 Specifically, the audio data acquired in the Far Field is, for example, audio data collected using a tabletop microphone such as a boundary microphone.

学習部２５４は、教師データ記憶部２９０に格納された教師データが入力されると、音声認識モデル２５３を学習させる。 The learning unit 254 causes the speech recognition model 253 to learn when the teacher data stored in the teacher data storage unit 290 is input.

生成支援部２６０は、判断部２６１、出力部２６２、生成部２６３を含む。判断部２６１は、認識結果データに含まれる発話ＩＤで特定される発話区間における発話が、特定の条件の何れか１つを満たすか否かを判断する。言い換えれば、判断部２６１は、特定の条件を示す情報に基づき、認識結果データを教師データの候補とするか否かを判断する。なお、特定の条件を示す情報は、判断部２６１において保持されていてよい。 The generation support section 260 includes a determination section 261, an output section 262, and a generation section 263. The determining unit 261 determines whether the utterance in the utterance section specified by the utterance ID included in the recognition result data satisfies any one of the specific conditions. In other words, the determination unit 261 determines whether or not the recognition result data is to be a candidate for teacher data, based on information indicating a specific condition. Note that information indicating a specific condition may be held in the determination unit 261.

出力部２６２は、判断部２６１による判断の結果を示す情報と共に、認識結果データを端末装置４００に出力する。言い換えれば、判断部２６１は、教師データの候補とされた認識結果データと、教師データの候補から除外された認識結果データとを、端末装置４００に出力する。 The output unit 262 outputs the recognition result data to the terminal device 400 along with information indicating the result of the determination by the determination unit 261. In other words, the determination unit 261 outputs the recognition result data that is a candidate for teacher data and the recognition result data that is excluded from the candidates for teacher data to the terminal device 400.

生成部２６３は、端末装置４００における操作に応じて、認識結果データから教師データを生成し、教師データ記憶部２９０に格納する。具体的には、生成部２６３は、端末装置４００において認識結果データの選択が行われると、選択された認識結果データに含まれる、音声データから変換された文字列と、認識結果データと対応する音声データとを対応付けた教師データを生成する。 The generation unit 263 generates teacher data from the recognition result data in response to an operation on the terminal device 400, and stores it in the teacher data storage unit 290. Specifically, when recognition result data is selected in the terminal device 400, the generation unit 263 creates a correspondence between a character string converted from voice data and included in the selected recognition result data, and the recognition result data. Generate teacher data that is associated with audio data.

通信制御部２６５は、情報処理装置２００と外部装置との通信を制御する。具体的には、通信制御部２６５は、情報処理装置２００と端末装置４００との通信を制御する。 The communication control unit 265 controls communication between the information processing device 200 and external devices. Specifically, the communication control unit 265 controls communication between the information processing device 200 and the terminal device 400.

次に、端末装置４００の機能構成について説明する。端末装置４００は、入力受付部４５０、通信制御部４６０、表示制御部４７０を含む。 Next, the functional configuration of the terminal device 400 will be explained. Terminal device 400 includes an input reception section 450, a communication control section 460, and a display control section 470.

入力受付部４５０は、端末装置４００に対する各種の入力を受け付ける。具体的には、入力受付部４５０は、端末装置４００に表示された教師データの候補に対する選択を受け付ける。通信制御部４６０は、端末装置４００と外部装置との通信を制御する。表示制御部４７０は、端末装置４００のディスプレイ４１８における各種の表示を制御する。具体的には、表示制御部４７０は、ディスプレイ４１８に、教師データの候補とされた認識結果データと、教師データの候補から除外された認識結果データとを含む一覧画面を表示させる。 The input receiving unit 450 receives various inputs to the terminal device 400. Specifically, the input accepting unit 450 accepts selections from teacher data candidates displayed on the terminal device 400. Communication control unit 460 controls communication between terminal device 400 and external devices. The display control unit 470 controls various displays on the display 418 of the terminal device 400. Specifically, the display control unit 470 causes the display 418 to display a list screen including the recognition result data that are candidates for teacher data and the recognition result data that are excluded from the candidates for teacher data.

次に、図１５を参照して、本実施形態の認識結果データ記憶部２８０について説明する。図１５は、認識結果データ記憶部の一例を示す図である。 Next, with reference to FIG. 15, the recognition result data storage unit 280 of this embodiment will be described. FIG. 15 is a diagram showing an example of a recognition result data storage section.

本実施形態の認識結果データ記憶部２８０に格納された認識結果データは、情報の項目として、発話ＩＤ、開始時刻、終了時刻、発話内容を含み、項目「発話ＩＤ」と、項目「開始時刻」、「終了時刻」、「テキスト」が対応付けられている。 The recognition result data stored in the recognition result data storage unit 280 of this embodiment includes utterance ID, start time, end time, and utterance content as information items, including the item "utterance ID" and the item "start time." , "end time", and "text" are associated with each other.

項目「発話ＩＤ」の値は、開始時刻と終了時刻により特定される発話区間に取得された音声データを特定するための識別情報である。 The value of the item "utterance ID" is identification information for specifying the audio data acquired in the speech section specified by the start time and end time.

項目「開始時刻」、「終了時刻」の値は、それぞれ、発話区間の開始時刻と、発話区間の終了時刻とを示す。項目「テキスト」の値は、発話ＩＤによって特定される音声データに対して、音声認識モデル２５３が音声認識処理を行って取得した文字列である。言い換えれば、項目「テキスト」の値は、音声データから変換された文字列を示す。 The values of the items "start time" and "end time" indicate the start time and end time of the speech section, respectively. The value of the item "text" is a character string obtained by the speech recognition model 253 performing speech recognition processing on the speech data specified by the utterance ID. In other words, the value of the item "text" indicates a character string converted from audio data.

なお、認識結果データ記憶部２８０では、各認識結果データに対し、判断部２６１による判断結果を示す情報が付与されてよい。 Note that in the recognition result data storage section 280, information indicating the determination result by the determination section 261 may be added to each recognition result data.

次に、図１６を参照して、本実施形態の情報処理装置２００の処理について説明する。図１６は、情報処理装置の処理を説明するフローチャートである。 Next, with reference to FIG. 16, processing of the information processing apparatus 200 of this embodiment will be described. FIG. 16 is a flowchart illustrating the processing of the information processing device.

本実施形態の情報処理装置２００は、音声認識部２５０の取得部２５１により、音声データを取得する（ステップＳ１６０１）。続いて、音声認識部２５０は、区間検出定部２５２により、発話区間を検出する（ステップＳ１６０２）。続いて、音声認識部２５０は、音声認識モデル２５３による音声認識処理により、検出された発話区間と対応する音声データから文字列を取得する（ステップＳ１６０３）。なお、この時点で、発話ＩＤ、発話区間の開始時刻及び終了時刻、音声データから変換された文字列とが対応付けられた認識結果データが認識結果データ記憶部２８０に格納されてよい。 The information processing device 200 of this embodiment acquires voice data using the acquisition unit 251 of the voice recognition unit 250 (step S1601). Subsequently, the speech recognition section 250 detects the utterance section using the section detection and determination section 252 (step S1602). Next, the speech recognition unit 250 performs speech recognition processing using the speech recognition model 253 to obtain a character string from the speech data corresponding to the detected utterance section (step S1603). At this point, recognition result data in which the utterance ID, the start time and end time of the utterance section, and the character string converted from the audio data are associated may be stored in the recognition result data storage unit 280.

なお、図１６のステップＳ１６０１からステップＳ１６０３までの処理は、図１６のステップＳ１６０４以降の処理とは別に、独立したタイミングで実行されてもよい。言い換えれば、本実施形態では、音声認識部２５０の処理は、ステップＳ１６０４以降に示す生成支援部２６０の処理が実行される前に実行されていればよい。 Note that the processing from step S1601 to step S1603 in FIG. 16 may be executed at an independent timing, apart from the processing from step S1604 onward in FIG. In other words, in the present embodiment, the processing by the speech recognition unit 250 only needs to be executed before the processing by the generation support unit 260 shown after step S1604 is executed.

次に、情報処理装置２００は、生成支援部２６０の判断部２６１により、発話区間における発話を抽出する（ステップＳ１６０４）。言い換えれば、判断部２６１は、認識結果データに含まれる発話ＩＤで特定される発話区間における音声データを抽出する。 Next, the information processing device 200 uses the determination unit 261 of the generation support unit 260 to extract the utterance in the utterance section (step S1604). In other words, the determination unit 261 extracts audio data in the utterance section specified by the utterance ID included in the recognition result data.

続いて、判断部２６１は、抽出された発話がメインの話者の発話であるか否かを判断する（ステップＳ１６０５）。具体的には、判断部２６１は、抽出された音声データの音量が、他の発話区間の音量よりも小さい場合に、発話がメインの話者の発話であると判断してよい。なお、音声データの音量の大小は相対的であることが考えられるが、絶対的であってもよい。 Next, the determining unit 261 determines whether the extracted utterance is the utterance of the main speaker (step S1605). Specifically, the determination unit 261 may determine that the utterance is the utterance of the main speaker when the volume of the extracted audio data is lower than the volume of other utterance sections. Note that although the volume of the audio data may be relative, it may be absolute.

ステップＳ１６０５において、抽出された発話がメインの話者の発話でない場合、情報処理装置２００は、後述するステップＳ１６１０へ進む。 In step S1605, if the extracted utterance is not the utterance of the main speaker, the information processing device 200 proceeds to step S1610, which will be described later.

ステップＳ１６０５において、抽出された発話がメインの話者の発話であると判断された場合、判断部２６１は、発話が単独発話であるか否かを判断する（ステップＳ１６０６）。つまり、判断部２６１は、ステップＳ１６０６において、抽出された発話区間の発話が、特定の条件のうちの条件１を満たすか否かを判断する。 If it is determined in step S1605 that the extracted utterance is the utterance of the main speaker, the determination unit 261 determines whether the utterance is a solo utterance (step S1606). That is, in step S1606, the determining unit 261 determines whether the utterance in the extracted utterance section satisfies Condition 1 of the specific conditions.

具体的には、判断部２６１は、音声データに対する音声認識処理を行ったときの確信度に基づき、発話が単独発話であるか否かを判断してよい。例えば、発話が重複発話である場合、音声が不明瞭になり確信度が下がる。このため、本実施形態では、確信度が所定の閾値より高い場合に、この音声データが示す発話が単独発話であると判断し、確信度が閾値未満である場合に、この音声データが示す発話が重複発話であると判断してよい。なお、確信度とは、予測または出力がどのくらい確実であるかの統計的な尺度を示す値であってよい。 Specifically, the determination unit 261 may determine whether or not the utterance is a single utterance based on the confidence level when performing voice recognition processing on the voice data. For example, if the utterances are repeated utterances, the voice becomes unclear and the confidence level decreases. Therefore, in this embodiment, when the confidence level is higher than a predetermined threshold, it is determined that the utterance indicated by this audio data is a solo utterance, and when the confidence level is less than the threshold value, the utterance indicated by this audio data is determined to be a solo utterance. may be considered to be a duplicate utterance. Note that the confidence level may be a value indicating a statistical measure of how certain the prediction or output is.

ステップＳ１６０６において、発話が単独発話である場合、判断部２６１は、この発話が条件１を満たすものとし、この発話区間を特定する発話ＩＤを含む認識結果データを、教師データの候補に選択し（ステップＳ１６０７）、後述するステップＳ１６１３へ進む。言い換えれば、判断部２６１は、条件１を満たす発話を第１の発話として、第１の発話の発話内容を教師データの候補に選択する。 In step S1606, if the utterance is a solo utterance, the determination unit 261 assumes that this utterance satisfies condition 1, and selects the recognition result data including the utterance ID that specifies this utterance section as a candidate for teacher data ( Step S1607), the process advances to step S1613, which will be described later. In other words, the determination unit 261 selects the utterance that satisfies Condition 1 as the first utterance, and selects the utterance content of the first utterance as a candidate for teacher data.

具体的には、判断部２６１は、この発話区間を特定する発話ＩＤを含む認識結果データに対してフラグを立てる。フラグは、認識結果データ記憶部２８０において、認識結果データと紐付けられて格納されてもよい。 Specifically, the determination unit 261 sets a flag on recognition result data that includes the utterance ID that specifies this utterance section. The flag may be stored in the recognition result data storage unit 280 in association with the recognition result data.

また、ステップＳ１６０６において、発話が単独発話ではなかった場合、判断部２６１は、この発話の中に、重複発話ではない部分が含まれるか否かを判定する（ステップＳ１６０８）。言い換えれば、判断部２６１は、この発話に単独発話となる部分が含まれるか否かを判断する。つまり、判断部２６１は、ステップＳ１６０４で抽出された発話が、条件３を満たすか否かを判断する。 Further, in step S1606, if the utterance is not a single utterance, the determination unit 261 determines whether the utterance includes a portion that is not a duplicate utterance (step S1608). In other words, the determining unit 261 determines whether or not this utterance includes a portion that is a single utterance. That is, the determining unit 261 determines whether the utterance extracted in step S1604 satisfies condition 3.

ステップＳ１６０８において、単独発話となる部分が含まれない場合、つまり、ステップＳ１６０４で抽出された発話が条件３を満たさない場合、生成支援部２６０は、後述するステップＳ１６１２へ進む。 In step S1608, if a portion that is a single utterance is not included, that is, if the utterance extracted in step S1604 does not satisfy condition 3, the generation support unit 260 proceeds to step S1612, which will be described later.

ステップＳ１６０８において、単独発話となる部分が含まれる場合、つまり、ステップＳ１６０４で抽出された発話が条件３を満たす場合、判断部２６１は、ステップＳ１６０４で抽出した発話から、単独発話となる部分を抽出し（ステップＳ１６０９）、ステップＳ１６０７へ進む。言い換えれば、判断部２６１は、条件３を満たす発話を第１の発話として、第１の発話の発話内容を教師データの候補に選択する。
このとき、メインの話者の発話と重複している発話が、孤立した相槌又はフィラーの場合、単独発話の抽出を行わずに、メインの話者の発話をそのまま第1の発話とする。 In step S1608, if a portion that is a solo utterance is included, that is, if the utterance extracted in step S1604 satisfies condition 3, the determination unit 261 extracts a portion that is a solo utterance from the utterance extracted in step S1604. (step S1609), and the process advances to step S1607. In other words, the determination unit 261 selects the utterance that satisfies condition 3 as the first utterance, and selects the utterance content of the first utterance as a candidate for teacher data.
At this time, if the utterance that overlaps with the utterance of the main speaker is an isolated compliment or filler, the utterance of the main speaker is directly used as the first utterance without extracting the independent utterance.

なお、判断部２６１は、ステップＳ１６０６と同様に、音声認識処理を行ったときの確信度に基づき、単独発話となる部分を抽出してよい。 Note that, similarly to step S1606, the determination unit 261 may extract portions that are single utterances based on the confidence level when performing the voice recognition process.

また、ここでは、判断部２６１は、ステップＳ１６０４で抽出した発話のうち、ステップＳ１６０９で抽出された単独発話に対応する発話内容のみを、教師データの候補とする。 Further, here, the determination unit 261 selects only the utterance content corresponding to the single utterance extracted in step S1609 from among the utterances extracted in step S1604 as candidates for teacher data.

ステップＳ１６０６において、抽出された発話がメインの話者の発話でないと判断された場合、判断部２６１は、この発話が単独発話であるか否かを判断する（ステップＳ１６１０）。ステップＳ１６１０において、単独発話ではないと判断された場合、生成支援部２６０は、後述するステップＳ１６１２へ進む。 If it is determined in step S1606 that the extracted utterance is not the utterance of the main speaker, the determining unit 261 determines whether or not this utterance is a solo utterance (step S1610). If it is determined in step S1610 that the utterance is not a solo utterance, the generation support unit 260 proceeds to step S1612, which will be described later.

ステップＳ１６１０において、単独発話と判断された場合、判断部２６１は、ステップＳ１６０４で抽出された発話が、孤立した相槌又はフィラーであるか否かを判断する（ステップＳ１６１１）。つまり、判断部２６１は、ステップＳ１６０４で抽出した発話が条件２を満たすか否かを判断している。また、判断部２６１は、抽出された発話と対応する音声データの特徴量に基づき、発話が孤立した相槌又はフィラーであるか否かを判断してよい。 If it is determined in step S1610 that the utterance is an isolated utterance, the determination unit 261 determines whether the utterance extracted in step S1604 is an isolated comment or filler (step S1611). In other words, the determining unit 261 determines whether the utterance extracted in step S1604 satisfies condition 2. Further, the determining unit 261 may determine whether the utterance is an isolated compliment or a filler based on the feature amount of the audio data corresponding to the extracted utterance.

ステップＳ１６１１において、発話が孤立した相槌又はフィラーではないと判断された場合、判断部２６１は、ステップＳ１６０４で抽出された発話が条件２を満たすものとして、ステップＳ１６０７へ進む。言い換えれば、判断部２６１は、条件２を満たす発話を第１の発話として、第１の発話の発話内容を教師データの候補に選択する。 If it is determined in step S1611 that the utterance is not an isolated compliment or filler, the determination unit 261 determines that the utterance extracted in step S1604 satisfies condition 2, and proceeds to step S1607. In other words, the determination unit 261 considers the utterance that satisfies Condition 2 as the first utterance, and selects the utterance content of the first utterance as a candidate for teacher data.

ステップＳ１６１１において、発話が孤立した相槌又はフィラーであると判断された場合、判断部２６１は、ステップＳ１６０４で抽出された発話は特定の条件を満たさないものとし、この発話と対応する認識結果データを、教師データの候補から除外し（ステップＳ１６１２）、後述するステップＳ１６１３へ進む。言い換えれば、判断部２６１は、抽出された発話を、特定の条件を満たしていない第２の発話とする。 If it is determined in step S1611 that the utterance is an isolated compliment or filler, the determination unit 261 determines that the utterance extracted in step S1604 does not satisfy the specific condition, and stores recognition result data corresponding to this utterance. , are excluded from the training data candidates (step S1612), and the process proceeds to step S1613, which will be described later. In other words, the determination unit 261 determines the extracted utterance as a second utterance that does not satisfy the specific condition.

情報処理装置２００は、ステップＳ１６０２で検出された全ての発話区間について、ステップＳ１６０４からステップＳ１６１２までの処理を実行したか否かを判定する（ステップＳ１６１３）。ステップＳ１６１３において、全ての発話区間に対して処理が実行されていない場合、情報処理装置２００は、ステップＳ１６０４へ戻る。 The information processing device 200 determines whether the processes from step S1604 to step S1612 have been executed for all the speech sections detected in step S1602 (step S1613). In step S1613, if the process has not been executed for all speech sections, the information processing device 200 returns to step S1604.

ステップＳ１６１３において、全ての発話区間に対して処理が実行されていた場合、出力部２６２は、端末装置４００に対して、認識結果データと、判断部２６１による判断の結果とを端末装置４００に出力し（ステップＳ１６１４）、処理を終了する。 In step S1613, if the process has been executed for all speech sections, the output unit 262 outputs the recognition result data and the determination result by the determination unit 261 to the terminal device 400. (step S1614), and the process ends.

なお、図１６の例では、音声データに含まれる全ての発話区間を検出した後に、各発話区間と対応する音声データが示す発話が、特定の条件を満たすか否かを判断しているが、処理の順番は、これに限定されない。本実施形態では、例えば、音声データに含まれる発話区間が検出される度に検出された発話区間と対応する音声データが示す発話が、特定の条件を満たすか否かが判断されてもよい。 Note that in the example of FIG. 16, after all speech sections included in the audio data are detected, it is determined whether the utterance indicated by the audio data corresponding to each speech section satisfies a specific condition. The order of processing is not limited to this. In this embodiment, for example, each time a speech section included in audio data is detected, it may be determined whether the speech indicated by the audio data corresponding to the detected speech section satisfies a specific condition.

また、図１６では、抽出された発話がメインの話者による発話であるか否かを判断した後に、発話が単独発話であるか否かを判断しているが、処理の順番はこれに限定されない。例えば、本実施形態では、発話が抽出された後に、発話が単独発話である否かを判断した後に、発話がメインの話者による発話であるか否かが判断されてもよい。 In addition, in FIG. 16, it is determined whether the extracted utterance is an utterance by the main speaker, and then it is determined whether the utterance is a solo utterance, but the order of processing is limited to this. Not done. For example, in the present embodiment, after the utterance is extracted, it may be determined whether the utterance is a solo utterance, and then it may be determined whether the utterance is uttered by the main speaker.

また、本実施形態では、発話が特定の条件を満たす場合に、この発話と対応する認識結果データを教師データの候補とするものとしたが、本実施形態では、発話がメインの話者の発話である場合に、この発話と対応する発話区間の認識結果データを教師データの候補としてもよい。言い換えれば、発話が重複しているか、発話が孤立した相槌フィラーであるか、に関わらず、発話がメインの話者の発話である場合には、この発話内容を教師データの候補としてもよい。 Furthermore, in this embodiment, when an utterance satisfies a specific condition, the recognition result data corresponding to this utterance is selected as a candidate for training data. In this case, the recognition result data of the utterance section corresponding to this utterance may be used as a candidate for the teacher data. In other words, regardless of whether the utterances are duplicated or are isolated fillers, if the utterance is uttered by the main speaker, this utterance content may be used as a candidate for training data.

次に、図１７を参照し、端末装置４００の表示例について説明する。図１７は、教師データの候補の表示例を示す図である。 Next, a display example of the terminal device 400 will be described with reference to FIG. 17. FIG. 17 is a diagram illustrating a display example of teacher data candidates.

図１７に示す画面１７１は、図１６のステップＳ１６１４において、端末装置４００に出力されるデータに基づき、端末装置４００に表示される画面（第１の画面）の一例である。なお、画面１７１は、情報処理装置２００の有するディスプレイ２０６に表示されてもよい。 A screen 171 shown in FIG. 17 is an example of a screen (first screen) displayed on the terminal device 400 based on the data output to the terminal device 400 in step S1614 of FIG. Note that the screen 171 may be displayed on the display 206 of the information processing device 200.

画面１７１は、表示領域１７２、１７３、１７４を含む。表示領域１７２には、認識結果データと、フラグとが対応付けられて表示される。フラグは、判断部２６１による判断結果を示す情報である。 Screen 171 includes display areas 172, 173, and 174. In the display area 172, recognition result data and flags are displayed in association with each other. The flag is information indicating the determination result by the determination unit 261.

表示領域１７３は、端末装置４００のユーザによって、表示領域１７２に表示された認識結果データが選択されたか否かを示す情報が表示される。表示領域１７４は、画面１７１に表示させるページを操作するための操作ボタンが表示される。 In the display area 173, information indicating whether the recognition result data displayed in the display area 172 has been selected by the user of the terminal device 400 is displayed. In the display area 174, operation buttons for operating the page to be displayed on the screen 171 are displayed.

表示領域１７２では、発話ＩＤ「００１０」、「００１２」、「００１４」、「００１６」を含む認識結果データに対して、フラグ「１」が付与されている。このため、図１７の例では、発話ＩＤ「００１０」、「００１２」、「００１４」、「００１６」のそれぞれで特定される発話（第１の発話）の発話内容が、教師データの候補とされたことがわかる。 In the display area 172, a flag "1" is assigned to recognition result data including the utterance IDs "0010", "0012", "0014", and "0016". Therefore, in the example of FIG. 17, the utterance content of the utterance (first utterance) identified by each of the utterance IDs "0010", "0012", "0014", and "0016" is considered as a candidate for training data. I can see that.

なお、発話ＩＤ「００１４」で特定される発話は、一部に重複発話を含むメインの話者の発話において、単独発話となる部分が存在する発話である。したがって、表示領域１７２では、発話ＩＤ「００１４」で特定される発話において、重複発話ではない部分（単独発話の部分）の発話内容が教師データの候補とされる。 Note that the utterance identified by the utterance ID "0014" is an utterance in which there is a portion that is a single utterance among the utterances of the main speaker that include overlapping utterances. Therefore, in the display area 172, in the utterance identified by the utterance ID "0014," the utterance contents of the portions that are not duplicate utterances (single utterance portions) are candidates for teacher data.

また、表示領域１７２では、発話ＩＤ「００１１」、「００１３」、「００１５」、「００１７」を含む認識結果データに対して、フラグ「１」は付与されていない。このため、図１７の例では、発話ＩＤ「００１１」、「００１３」、「００１５」、「００１７」のそれぞれで特定される発話（第２の発話）の発話内容は、教師データの候補から除外されたことがわかる。 Furthermore, in the display area 172, the flag "1" is not assigned to the recognition result data including the utterance IDs "0011", "0013", "0015", and "0017". Therefore, in the example of FIG. 17, the utterance contents of the utterances (second utterances) identified by the utterance IDs "0011", "0013", "0015", and "0017" are excluded from the training data candidates. I know what happened.

本実施形態では、このように、教師データの候補とされた認識結果データと、教師データの候補から除外された認識結果データとを、端末装置４００のユーザに対して把握させることができる。 In this embodiment, in this way, the user of the terminal device 400 can be made aware of the recognition result data that are candidates for teacher data and the recognition result data that are excluded from the candidates for teacher data.

また、表示領域１７３では、発話ＩＤ「００１０」、「００１２」、「００１４」、「００１６」を含む認識結果データと対応付けて、「○」が表示されている。したがって、図１７の例では、端末装置４００のユーザによって、発話ＩＤ「００１０」、「００１２」、「００１４」、「００１６」のそれぞれで特定される発話（第１の発話）の発話内容が、教師データに選択されたことがわかる。 Furthermore, in the display area 173, "○" is displayed in association with recognition result data including utterance IDs "0010", "0012", "0014", and "0016". Therefore, in the example of FIG. 17, the utterance contents of the utterances (first utterances) specified by the utterance IDs "0010", "0012", "0014", and "0016" by the user of the terminal device 400 are as follows. You can see that it has been selected as teacher data.

なお、図１７の例では、教師データの候補とされた第１の発話の発話内容が教師データに選択されているが、画面１７１では、教師データの候補から除外された第２の発話の発話内容を教師データに選択することもできる。 Note that in the example of FIG. 17, the utterance content of the first utterance that is a candidate for teacher data is selected as the teacher data, but in the screen 171, the utterance content of the second utterance that was excluded from the candidates for teacher data is selected as the teacher data. Contents can also be selected as training data.

また、画面１７１では、表示領域１７２に表示された認識結果データに含まれる発話ＩＤが、端末装置４００のユーザによって選択されると、発話ＩＤで特定される発話区間の音声データが再生されてもよい。このように、音声データを再生することで、端末装置４００のユーザは、簡単な操作で、教師データの候補とされた発話区間の発話内容が正しいか否かを確認できる。また、端末装置４００のユーザは、音声データから取得された文字列が正しいか否かを確認した後に、教師データを選択できる。 In addition, on the screen 171, when the utterance ID included in the recognition result data displayed in the display area 172 is selected by the user of the terminal device 400, the audio data of the utterance section specified by the utterance ID is played back. good. By reproducing the audio data in this manner, the user of the terminal device 400 can confirm whether or not the content of the utterance in the utterance section that is a candidate for teacher data is correct with a simple operation. Further, the user of the terminal device 400 can select teacher data after confirming whether the character string acquired from the audio data is correct.

また、本実施形態では、認識結果データにフラグ「１」が付与された発話区間を、表示領域１７３においてあらかじめ選択された状態（「○」が付与された状態）として、画面１７１を表示してもよい。この場合、表示領域１７３では、ユーザの操作によって選択が解除されてもよい。また、本実施形態では、認識結果データにフラグ「１」が付与された発話区間を、表示領域１７３において未選択の状態（「○」が付与されていない状態）として、画面１７１を表示してもよい。 Furthermore, in the present embodiment, the screen 171 displays the utterance section to which the flag "1" has been added to the recognition result data as a preselected state in the display area 173 (a state in which "○" has been added). Good too. In this case, the selection in the display area 173 may be canceled by the user's operation. Furthermore, in the present embodiment, the screen 171 displays the utterance section to which the flag "1" has been added to the recognition result data as an unselected state in the display area 173 (a state in which "○" is not added). Good too.

また、本実施形態では、画面１７１において、端末装置４００のユーザが、音声データから変換された文字列を選択することで、選択された文字列を修正するための編集画面が表示されてもよい。これにより、ユーザは、音声データから変換された文字列に誤りがある場合に、この文字列を修正できる。 Further, in the present embodiment, when the user of the terminal device 400 selects a character string converted from audio data on the screen 171, an editing screen may be displayed for modifying the selected character string. . This allows the user to correct the character string converted from voice data if there is an error in the character string.

また、画面１７１には、教師データの生成を情報処理装置２００に対して指示するための操作ボタンが設けられていてもよい。情報処理装置２００は、画面１７１において、教師データに用いる認識結果データの選択が完了した後に、この操作ボタンが選択されると、ユーザによって選択された認識結果データに含まれる発話内容を用いて教師データを生成してよい。 Further, the screen 171 may be provided with an operation button for instructing the information processing device 200 to generate teacher data. When this operation button is selected on the screen 171 after the selection of recognition result data to be used as teacher data is completed, the information processing device 200 uses the utterance content included in the recognition result data selected by the user to teach the teacher data. May generate data.

ここで、情報処理装置２００の生成支援部２６０による教師データの生成について説明する。情報処理装置２００の生成支援部２６０は、生成部２６３により、教師データとする認識結果データが選択されると、認識結果データ記憶部２８０から、選択された認識結果データに含まれる、音声データから変換された文字列を取得する。また、生成部２６３は、選択された認識結果データに含まれる発話ＩＤと対応付けられた音声データを音声データ記憶部２７０から取得する。そして、生成部２６３は、取得した音声データを入力データとし、文字列を正解データとした教師データを生成し、教師データ記憶部２９０に格納する。 Here, generation of teacher data by the generation support unit 260 of the information processing device 200 will be explained. When the generation unit 263 selects recognition result data to be used as teacher data, the generation support unit 260 of the information processing device 200 generates data from the voice data included in the selected recognition result data from the recognition result data storage unit 280. Get the converted string. The generation unit 263 also acquires audio data associated with the utterance ID included in the selected recognition result data from the audio data storage unit 270. Then, the generation unit 263 uses the acquired audio data as input data, generates teacher data using the character string as correct answer data, and stores it in the teacher data storage unit 290.

このように、本実施形態では、特定の条件を満たす発話のみが教師データの候補として端末装置４００のユーザに提示し、ユーザに選択された教師データの候補を用いて教師データを生成するため、精度の高い教師データの作成できる。また、本実施形態によれば、ユーザは、提示された教師データの候補を選択するだけで、教師データを生成でき、教師データの生成を支援することができる。 In this way, in this embodiment, only utterances that meet specific conditions are presented to the user of the terminal device 400 as training data candidates, and training data is generated using the training data candidates selected by the user. Highly accurate training data can be created. Further, according to the present embodiment, the user can generate teacher data simply by selecting the presented teacher data candidates, and can support the generation of teacher data.

次に、図１８を参照して、教師データの候補を出力する際の別の表示例について説明する。図１８は、教師データの候補の表示例を示す他の図である。 Next, with reference to FIG. 18, another display example when outputting training data candidates will be described. FIG. 18 is another diagram showing a display example of training data candidates.

図１８に示す画面１８１は、表示領域１８２、１８３、１７４、操作ボタン１８４を含む。表示領域１８２は、音声認識処理が行われた音声データの波形が表示される。表示領域１８３は、音声データから変換された文字列が、発話区間毎に、音声データの波形と対応付けられて表示される。 Screen 181 shown in FIG. 18 includes display areas 182, 183, 174, and operation buttons 184. The display area 182 displays the waveform of the audio data that has been subjected to the audio recognition process. In the display area 183, character strings converted from audio data are displayed in association with the waveform of the audio data for each utterance section.

本実施形態の表示領域１８３では、フラグ「１」が付与されている認識結果データに含まれる発話内容を強調表示させてもよい。言い換えれば、表示領域１８３では、認識結果データが教師データの候補とされている場合と、教師データの候補から除外されている場合とで、発話内容の表示態様を異ならせてよい。 In the display area 183 of this embodiment, the utterance content included in the recognition result data to which the flag "1" is attached may be highlighted. In other words, in the display area 183, the display mode of the utterance content may be made different depending on whether the recognition result data is a candidate for teacher data or when it is excluded from the candidates for teacher data.

具体的には、表示領域１８３では、発話区間Ｋ１０、Ｋ１２、Ｋ１４、Ｋ１７と対応する発話内容が強調表示されている。したがって、端末装置４００のユーザは、発話区間Ｋ１０、Ｋ１２、Ｋ１４、Ｋ１７と対応する発話内容が、教師データの候補の選択された認識結果データの発話内容であることを認識できる。 Specifically, in the display area 183, the utterance contents corresponding to the utterance sections K10, K12, K14, and K17 are highlighted. Therefore, the user of the terminal device 400 can recognize that the utterance contents corresponding to the utterance sections K10, K12, K14, and K17 are the utterance contents of the recognition result data selected as teacher data candidates.

なお、発話区間Ｋ１４における発話は一部が重複発話であるため、単独発話である部分「メール通知」が教師データの候補とされており、音声データから変換された文字列のうち、「メール通知」のみが強調表示されている。 Note that some of the utterances in the utterance section K14 are overlapping utterances, so the single utterance "email notification" is considered a candidate for teacher data, and among the character strings converted from the audio data, "email notification" ' is highlighted.

また、表示領域１８３では、発話区間における発話が単独発話であるか、重複発話であるか、孤立した相槌又はフィラーであるか、等の発話の種類に応じて、発話内容の表示態様を異ならせてもよい。また、表示領域１８３では、メインの話者による発話であるか否かに応じて、発話内容の表示態様を異ならせてもよい。 In addition, in the display area 183, the display mode of the utterance content is changed depending on the type of utterance, such as whether the utterance in the utterance section is a single utterance, multiple utterances, isolated utterances, or fillers. It's okay. Furthermore, in the display area 183, the display mode of the content of the utterance may be changed depending on whether the utterance is from the main speaker or not.

図１８の例では、発話区間Ｋ１６は、メインの話者による発話であり、かつ、孤立した相槌である。また、発話区間Ｋ１７は、メインの話者以外の話者による発話であり、かつ、孤立した相槌である。このため、発話区間Ｋ１６の発話内容と、発話区間Ｋ１７の発話内容とでは、表示態様が異なっている。このように、表示態様を異ならせることで、端末装置４００のユーザに対して、発話の種類を把握させることができる。 In the example of FIG. 18, the utterance section K16 is an utterance by the main speaker and is an isolated compliment. Furthermore, the utterance section K17 is an utterance by a speaker other than the main speaker, and is an isolated compliment. Therefore, the utterance contents in the utterance section K16 and the utterance contents in the utterance section K17 are displayed differently. By changing the display mode in this way, the user of the terminal device 400 can understand the type of utterance.

また、表示領域１８３において、発話区間毎に発話内容が選択可能であってよい。この場合、表示領域１８３では、ユーザによって発話内容が選択されると、発話区間が枠線で囲まれ、教師データの対象として選択されたことが視認できるようになる。 Further, in the display area 183, the content of the utterance may be selectable for each utterance section. In this case, in the display area 183, when the content of the utterance is selected by the user, the utterance section is surrounded by a frame line, making it visible that it has been selected as the subject of teacher data.

また、画面１８１では、音声データから変換された文字列を選択することで、文字列を修正するための編集画面が表示されてよい。このようにすることで、表示領域１８３に表示された文字列に誤りがある場合には、誤りのある文字列を修正できる。なお、編集画面を表示させるために文字列を選択する操作は、教師データにする発話内容を選択する操作とは異なる操作であることが好ましい。 Further, on the screen 181, an editing screen for modifying a character string may be displayed by selecting a character string converted from audio data. By doing this, if there is an error in the character string displayed in the display area 183, the erroneous character string can be corrected. Note that it is preferable that the operation of selecting a character string to display the editing screen is a different operation from the operation of selecting the content of the utterance to be used as teacher data.

次に、図１９を参照して、複数人の発話を録音した音声データの取得方法について説明する。 Next, with reference to FIG. 19, a method for acquiring audio data obtained by recording the utterances of a plurality of people will be described.

図１９は、音声データの取得方法を説明する図である。なお、図１９に示す例は、音声データの取得方法の一例であり、複数人の発話を録音した音声データは、他の方法によって取得されてもよい。 FIG. 19 is a diagram illustrating a method of acquiring audio data. Note that the example shown in FIG. 19 is an example of a method for acquiring audio data, and audio data obtained by recording the utterances of multiple people may be acquired by other methods.

図１９では、会議中の音声データを録音する場合を示している。具体的には、図１９では、会議室Ｒ１のテーブル１１０に配置された卓上マイク５００により、会議の参加者Ｐ１～Ｐ６のそれぞれの発話を音声データとして収集する。 FIG. 19 shows a case where audio data during a conference is recorded. Specifically, in FIG. 19, the tabletop microphone 500 placed on the table 110 in the conference room R1 collects the utterances of each of the conference participants P1 to P6 as audio data.

卓上マイク５００は、一般的な集音装置であってよく、集音した音声データを記憶する記憶装置と、音声データを情報処理装置２００へ送信する通信装置とを含んでもよい。 The desk microphone 500 may be a general sound collection device, and may include a storage device that stores collected audio data and a communication device that transmits the audio data to the information processing device 200.

卓上マイク５００で収集された音声データは、情報処理装置２００に送信されて、音声認識部２５０による音声認識処理が行われる。 The voice data collected by the tabletop microphone 500 is transmitted to the information processing device 200 and subjected to voice recognition processing by the voice recognition unit 250.

ここで、卓上マイク５００は、会議室Ｒ１に設置されたテーブル１１０の中央に配置されており、参加者Ｐ１～Ｐ６の口元から所定の距離以上離れた位置に配置されていてよい。 Here, the desk microphone 500 is placed in the center of the table 110 installed in the conference room R1, and may be placed at a position a predetermined distance or more from the mouths of the participants P1 to P6.

したがって、卓上マイク５００が取得する音声データは、ＦａｒＦｉｅｌｄで取得された音声データである。 Therefore, the audio data acquired by the desk microphone 500 is the audio data acquired in the Far Field.

本実施形態では、このようにして、複数人の発話を含む音声データを取得し、この音声データに対して音声認識処理が行われる。 In this embodiment, voice data including utterances by a plurality of people is acquired in this way, and voice recognition processing is performed on this voice data.

以上に説明したように、本実施形態では、音声データを発話区間毎に文字列に変換する音声認識を行い、この発話区間における発話が特定の条件を満たすか否かを判定した結果と共に、音声認識結果の認識結果データを認識結果データ記憶部２８０に格納する。 As explained above, in this embodiment, voice recognition is performed to convert voice data into a character string for each utterance section, and the result of determining whether or not the utterance in this utterance section satisfies a specific condition is recognized. The recognition result data of the recognition result is stored in the recognition result data storage section 280.

このようにすることで、教師データを作成する作業者は、認識結果データ記憶部２８０に保持された発話区間毎の音声認識の結果を確認するだけでよく、作業負荷を軽減できる。したがって、本実施形態では、教師データの生成にかかるコストを削減することができる。 By doing so, the worker who creates the teacher data only needs to check the speech recognition results for each utterance section held in the recognition result data storage section 280, which can reduce the workload. Therefore, in this embodiment, the cost for generating teacher data can be reduced.

さらに、本実施形態では、教師データを効率的に生成することができるため、音声認識モデル２５３の学習に対して、十分な教師データを提供することができ、音声認識モデル２５３による音声認識の精度を向上させることができる。 Furthermore, in this embodiment, since training data can be efficiently generated, sufficient training data can be provided for the learning of the speech recognition model 253, and the accuracy of speech recognition by the speech recognition model 253 can be improved. can be improved.

本実施形態では、このように教師データの生成を支援することで、精度の高い教師データを容易に生成することができ、機械学習における音声認識の精度の向上に寄与することができる。 In the present embodiment, by supporting the generation of teacher data in this way, highly accurate teacher data can be easily generated, which can contribute to improving the accuracy of speech recognition in machine learning.

ここで、本実施形態の音声認識モデル２５３について説明する。本実施形態の音声認識モデル２５３には、ＤＮＮ（Deep Neural Network）等により構成されてよく、さらに、Ｅｎｄ－ｔｏ－Ｅｎｄモデルであってよい。 Here, the speech recognition model 253 of this embodiment will be explained. The speech recognition model 253 of this embodiment may be configured with a DNN (Deep Neural Network) or the like, and may also be an End-to-End model.

Ｅｎｄ－ｔｏ－Ｅｎｄモデルとは、１つのニューラルネットワークを介し、入力された音声を文字に直接変換するモデルである。Ｅｎｄ－ｔｏ－Ｅｎｄモデルは、音響モデル、言語モデル、発音辞書といった複数の部品を個々に最適化して組み合わせた従来の音声認識モデルと比較して、構造がシンプルなため、実装が容易、応答速度が速い等のメリットがある。 The End-to-End model is a model that directly converts input speech into characters via one neural network. The End-to-End model has a simpler structure, making it easier to implement and faster response times than traditional speech recognition models that combine multiple parts such as acoustic models, language models, and pronunciation dictionaries by individually optimizing them. It has the advantage of being fast.

さらに、Ｅｎｄ－ｔｏ－Ｅｎｄモデルは、複数の部品を個々に最適化した従来の音声認識モデルと比較して、話し言葉のように非文法的で揺らぎが大きい音声データから効率的に学習することができる。話し言葉のように非文法的で揺らぎが大きい音声データとは、例えば、ＦａｒＦｉｅｌｄで取得された音声データである。 Furthermore, compared to conventional speech recognition models that individually optimize multiple parts, the End-to-End model can learn more efficiently from ungrammatical and highly fluctuating speech data such as spoken words. can. Audio data that is ungrammatical and has large fluctuations, such as spoken words, is, for example, audio data acquired in the Far Field.

したがって、本実施形態の手法によって生成された教師データは、音声認識モデル２５３をＥｎｄ－ｔｏ－Ｅｎｄモデルとした場合の学習において、有用な教師データとなる。 Therefore, the teacher data generated by the method of this embodiment becomes useful teacher data in learning when the voice recognition model 253 is an End-to-End model.

さらに、従来の音声認識モデルでは、前段で音響的な処理（ノイズキャンセル等）を行うフロントエンドを実装することが多いが、Ｅｎｄ－ｔｏ－Ｅｎｄモデルの場合は、ノイズキャンセル等を行わず、ノイズを含んだ音声データをそのまま用いて学習を行うことが容易にできる。 Furthermore, in conventional speech recognition models, a front end that performs acoustic processing (noise cancellation, etc.) is often implemented in the previous stage, but in the case of an end-to-end model, noise cancellation is not performed, and noise Learning can be easily performed using audio data containing .

したがって、音声認識モデル２５３をＥｎｄ－ｔｏ－Ｅｎｄモデルとした場合には、本実施形態の手法によって生成された教師データを用いて音声認識モデル２５３を学習させることで、音声認識の精度を向上させることができる。 Therefore, when the voice recognition model 253 is an end-to-end model, the accuracy of voice recognition can be improved by training the voice recognition model 253 using the training data generated by the method of this embodiment. be able to.

上記で説明した実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」とは、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたASIC（Application Specific Integrated Circuit）、DSP（digital signal processor）、FPGA（field programmable gate array）や従来の回路モジュール等のデバイスを含むものとする。 Each function of the embodiments described above can be realized by one or more processing circuits. Here, the term "processing circuit" as used herein refers to a processor programmed to execute each function by software, such as a processor implemented by an electronic circuit, or a processor designed to execute each function explained above. This includes devices such as ASICs (Application Specific Integrated Circuits), DSPs (digital signal processors), FPGAs (field programmable gate arrays), and conventional circuit modules.

また、実施形態に記載された装置群は、本明細書に開示された実施形態を実施するための複数のコンピューティング環境のうちの１つを示すものにすぎない。 Additionally, the devices described in the embodiments are merely illustrative of one of a plurality of computing environments for implementing the embodiments disclosed herein.

ある実施形態では、情報処理装置２００は、サーバクラスタといった複数のコンピューティングデバイスを含む。複数のコンピューティングデバイスは、ネットワークや共有メモリなどを含む任意のタイプの通信リンクを介して互いに通信するように構成されており、本明細書に開示された処理を実施する。同様に、情報処理装置２００は、互いに通信するように構成された複数のコンピューティングデバイスを含むことができる。 In some embodiments, information processing apparatus 200 includes multiple computing devices, such as a server cluster. The plurality of computing devices are configured to communicate with each other via any type of communication link, including a network, shared memory, etc., to perform the processes disclosed herein. Similarly, information processing apparatus 200 may include multiple computing devices configured to communicate with each other.

さらに、情報処理システム１００は、開示された処理ステップを様々な組み合わせで共有するように構成できる。例えば、情報処理装置２００によって実行されるプロセスは、他の情報処理装置によって実行され得る。同様に、情報処理装置２００の機能は、他の情報処理装置によって実行することができる。また、情報処理装置と他の情報処理装置の各要素は、１つの情報処理装置にまとめられていても良いし、複数の装置に分けられていても良い。 Further, information processing system 100 can be configured to share the disclosed processing steps in various combinations. For example, a process executed by information processing device 200 may be executed by another information processing device. Similarly, the functions of information processing device 200 can be performed by another information processing device. Further, each element of the information processing device and other information processing devices may be combined into one information processing device, or may be divided into a plurality of devices.

以上、各実施形態に基づき本発明の説明を行ってきたが、上記実施形態に示した要件に本発明が限定されるものではない。これらの点に関しては、本発明の主旨をそこなわない範囲で変更することができ、その応用形態に応じて適切に定めることができる。 Although the present invention has been described above based on each embodiment, the present invention is not limited to the requirements shown in the above embodiments. These points can be changed without detracting from the gist of the present invention, and can be determined appropriately depending on the application thereof.

本発明の態様は、例えば、以下のとおりである。
＜１＞
音声データを取得する取得部と、
前記音声データに係る音声から、発話がされた区間である発話区間を検出する音声認識部と、
検出された前記発話区間の発話が、教師データの候補を出力するために予め設定された１以上の条件を満たすかを判断する判断部と、
前記判断部で前記条件を満たすと判断された前記発話区間における第１の発話の内容を、前記教師データの候補として出力する出力部と、
を有する情報処理装置。
＜２＞
前記出力部は、
前記第１の発話の内容を、前記教師データの候補として表示する第１の画面を出力する、
＜１＞に記載の情報処理装置。
＜３＞
前記第１の発話の内容は、生成する教師データの対象の発話の内容として前記第１の画面に選択可能に表示され、
前記情報処理装置は、さらに、
ユーザにより前記第１の画面で選択された前記第１の発話の内容に基づいて、教師データを生成する生成部を有する
＜２＞に記載の情報処理装置。
＜４＞
前記第１の画面は、前記判断部で前記条件を満たさないと判断された前記発話区間における第２の発話の内容をさらに含み、さらに、該第２の発話の内容を、前記教師データに使用する発話の内容として選択可能であり、
前記生成部は、ユーザにより前記第１の画面において前記第２の発話の内容が選択された場合、該第２の発話の内容に基づいて、前記教師データを生成する、
＜２＞又は＜３＞に記載の情報処理装置。
＜５＞
前記第１の画面は、前記第１の発話の内容及び前記第２の発話の内容のうち、前記第１の発話の内容が前記教師データの候補であることを識別可能に表示する画面である、
＜４＞に記載の情報処理装置。
＜６＞
前記第1の画面は、前記第１の発話の内容と、前記第２の発話の内容と、に対応した音声データを再生可能な画面である、
＜４＞又は＜５＞に記載の情報処理装置。
＜７＞
前記発話区間における発話の内容には、少なくとも、当該発話区間に対応した音声データと、当該音声データを変換した文字列と、が含まれる
＜１＞乃至＜６＞の何れか一項に記載の情報処理装置。
＜８＞
前記音声データは、メインの話者を含む複数の話者の会話の音声に係る音声データである、
＜１＞乃至＜７＞の何れか一項に記載の情報処理装置。
＜９＞
前記１以上の条件は、前記メインの話者による発話であることを含む、
＜８＞に記載の情報処理装置。
＜１０＞
前記１以上の条件は、前記メインの話者の発話であって、他の発話と時間的に重複していない発話であること、を含む、
＜８＞又は＜９＞に記載の情報処理装置。
＜１１＞
前記1以上の条件は、前記メインの話者の発話ではなく、且つ、相槌又はフィラーのみからなる発話ではないことを含む、
＜８＞乃至＜１０＞の何れか一項に記載の情報処理装置。
＜１２＞
前記1以上の条件は、一部に他の発話と時間的に重複している発話を含む前記メインの話者の発話であって、且つ、他の発話と時間的に重複していない部分があること、を含む、
＜８＞乃至＜１１＞の何れか一項に記載の情報処理装置。
＜１３＞
前記出力部は、
前記判断部により、検出された前記発話区間の発話が、一部に他の発話と時間的に重複している発話を含む前記メインの話者の発話であって、且つ、他の発話と時間的に重複していない部分があると判断された場合に、当該メインの話者の発話において、前記他の発話と時間的に重複していない部分の発話の内容を、前記教師データの候補として出力する
＜１２＞に記載の情報処理装置。
＜１４＞
前記１以上の条件は、ユーザによって設定される、＜１＞乃至＜１３＞の何れか一項に請求項１記載の情報処理装置。
＜１５＞
情報処理装置による情報処理方法であって、前記情報処理装置が、
音声データを取得し、
前記音声データに係る音声から、発話がされた区間である発話区間を検出し、
検出された前記発話区間の発話が、教師データの候補を出力するために予め設定された１以上の条件を満たすかを判断し、
前記条件を満たすと判断された前記発話区間における第１の発話の内容を、前記教師データの候補として出力する、情報処理方法。
＜１６＞
音声データを取得し、
前記音声データに係る音声から、発話がされた区間である発話区間を検出し、
検出された前記発話区間の発話が、教師データの候補を出力するために予め設定された１以上の条件を満たすかを判断し、
前記条件を満たすと判断された前記発話区間における第１の発話の内容を、前記教師データの候補として出力する、処理を情報処理装置に実行させる、プログラム。 Aspects of the present invention are, for example, as follows.
<1>
an acquisition unit that acquires audio data;
a speech recognition unit that detects a speech section that is a speech section from the speech related to the speech data;
a determination unit that determines whether the detected utterance in the utterance section satisfies one or more preset conditions for outputting training data candidates;
an output unit that outputs the content of the first utterance in the utterance section that is determined to satisfy the condition by the determination unit as a candidate for the teacher data;
An information processing device having:
<2>
The output section is
outputting a first screen that displays the content of the first utterance as a candidate for the teacher data;
The information processing device according to <1>.
<3>
The content of the first utterance is selectably displayed on the first screen as the content of the utterance targeted for the teacher data to be generated;
The information processing device further includes:
The information processing device according to <2>, further comprising a generation unit that generates teacher data based on the content of the first utterance selected by the user on the first screen.
<4>
The first screen further includes the content of a second utterance in the utterance section that is determined by the determination unit to not satisfy the condition, and further includes the content of the second utterance that is used for the teacher data. can be selected as the content of the utterance,
The generation unit generates the teacher data based on the content of the second utterance when the content of the second utterance is selected by the user on the first screen.
The information processing device according to <2> or <3>.
<5>
The first screen is a screen that visibly displays that the content of the first utterance among the content of the first utterance and the content of the second utterance is a candidate for the teacher data. ,
The information processing device according to <4>.
<6>
The first screen is a screen that can reproduce audio data corresponding to the content of the first utterance and the content of the second utterance,
The information processing device according to <4> or <5>.
<7>
The content of the utterance in the utterance section includes at least audio data corresponding to the utterance section and a character string obtained by converting the audio data. Information processing device.
<8>
The audio data is audio data related to the audio of conversations of multiple speakers including the main speaker.
The information processing device according to any one of <1> to <7>.
<9>
The one or more conditions include that the utterance is by the main speaker;
The information processing device according to <8>.
<10>
The one or more conditions include that the utterance is an utterance of the main speaker and does not temporally overlap with other utterances;
The information processing device according to <8> or <9>.
<11>
The one or more conditions include that the utterance is not an utterance of the main speaker and is not an utterance consisting only of compliments or fillers;
The information processing device according to any one of <8> to <10>.
<12>
The condition of 1 or more is that the utterance of the main speaker includes a part of the utterance that overlaps in time with other utterances, and the part that does not overlap in time with other utterances. including
The information processing device according to any one of <8> to <11>.
<13>
The output section is
The judgment unit determines that the detected utterances in the utterance section are utterances of the main speaker that include some utterances that overlap in time with other utterances, and are different in time from other utterances. If it is determined that there is a portion that does not overlap temporally, the content of the utterance of the main speaker's utterance that does not overlap temporally with the other utterances is used as a candidate for the training data. The information processing device according to <12>, which outputs.
<14>
The information processing apparatus according to claim 1, wherein the one or more conditions are set by a user.
<15>
An information processing method using an information processing device, the information processing device comprising:
Get audio data,
detecting an utterance interval that is an utterance interval from the audio related to the audio data;
determining whether the detected utterance in the utterance section satisfies one or more preset conditions for outputting training data candidates;
An information processing method that outputs the content of a first utterance in the utterance section that is determined to satisfy the condition as a candidate for the teacher data.
<16>
Get audio data,
detecting an utterance interval that is an utterance interval from the audio related to the audio data;
determining whether the detected utterance in the utterance section satisfies one or more preset conditions for outputting training data candidates;
A program that causes an information processing device to execute a process of outputting content of a first utterance in the utterance section that is determined to satisfy the condition as a candidate for the teacher data.

１００情報処理システム
２００情報処理装置
２５０音声認識部
２６０生成支援部
２６１判断部
２６２出力部
２６３生成部
２８０認識結果データ記憶部
２９０教師データ記憶部
４００端末装置 100 Information processing system 200 Information processing device 250 Speech recognition unit 260 Generation support unit 261 Judgment unit 262 Output unit 263 Generation unit 280 Recognition result data storage unit 290 Teacher data storage unit 400 Terminal device

特開２０１７－０４５０２７号公報JP 2017-045027 Publication

Claims

an acquisition unit that acquires audio data;
a speech recognition unit that detects a speech section that is a speech section from the speech related to the speech data;
a determination unit that determines whether the detected utterance in the utterance section satisfies one or more preset conditions for outputting training data candidates;
an output unit that outputs the content of the first utterance in the utterance section that is determined to satisfy the condition by the determination unit as a candidate for the teacher data;
An information processing device having:

The output section is
outputting a first screen that displays the content of the first utterance as a candidate for the teacher data;
The information processing device according to claim 1.

The content of the first utterance is selectably displayed on the first screen as the content of the utterance targeted for the teacher data to be generated;
The information processing device further includes:
The information processing apparatus according to claim 2, further comprising a generation unit that generates teacher data based on the content of the first utterance selected by the user on the first screen.

The first screen further includes the content of a second utterance in the utterance section that is determined by the determination unit to not satisfy the condition, and further includes the content of the second utterance that is used for the teacher data. can be selected as the content of the utterance,
The generation unit generates the teacher data based on the content of the second utterance when the content of the second utterance is selected by the user on the first screen.
The information processing device according to claim 3.

The first screen is a screen that visibly displays that the content of the first utterance among the content of the first utterance and the content of the second utterance is a candidate for the teacher data. ,
The information processing device according to claim 4.

The first screen is a screen that can reproduce audio data corresponding to the content of the first utterance and the content of the second utterance,
The information processing device according to claim 4.

2. The information processing device according to claim 1, wherein the content of the utterance in the utterance section includes at least audio data corresponding to the utterance section and a character string obtained by converting the audio data.

The audio data is audio data related to the audio of conversations of multiple speakers including the main speaker.
The information processing device according to claim 1.

The one or more conditions include that the utterance is by the main speaker;
The information processing device according to claim 8.

The one or more conditions include that the utterance is an utterance of the main speaker and does not temporally overlap with other utterances;
The information processing device according to claim 8.

The one or more conditions include that the utterance is not an utterance of the main speaker and is not an utterance consisting only of compliments or fillers;
The information processing device according to claim 8.

The condition of 1 or more is that the utterance of the main speaker includes a part of the utterance that overlaps in time with other utterances, and the part that does not overlap in time with other utterances. including
The information processing device according to claim 8.

The output section is
The judgment unit determines that the detected utterances in the utterance section are utterances of the main speaker that include some utterances that overlap in time with other utterances, and are different in time from other utterances. If it is determined that there is a portion that does not overlap temporally, the content of the utterance of the main speaker's utterance that does not overlap temporally with the other utterances is used as a candidate for the training data. The information processing device according to claim 12, wherein the information processing device outputs.

The information processing apparatus according to claim 1, wherein the one or more conditions are set by a user.

An information processing method using an information processing device, the information processing device comprising:
Get audio data,
detecting an utterance interval that is an utterance interval from the audio related to the audio data;
determining whether the detected utterance in the utterance section satisfies one or more preset conditions for outputting training data candidates;
An information processing method that outputs the content of a first utterance in the utterance section that is determined to satisfy the condition as a candidate for the teacher data.

Get audio data,
detecting an utterance interval that is an utterance interval from the audio related to the audio data;
determining whether the detected utterance in the utterance section satisfies one or more preset conditions for outputting training data candidates;
A program that causes an information processing device to execute a process of outputting content of a first utterance in the utterance section that is determined to satisfy the condition as a candidate for the teacher data.