JP7481894B2

JP7481894B2 - Speech text generation device, speech text generation program, and speech text generation method

Info

Publication number: JP7481894B2
Application number: JP2020083244A
Authority: JP
Inventors: 清栗原; 均伊藤; 信正清山
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2024-05-13
Anticipated expiration: 2040-05-11
Also published as: JP2021179468A

Description

本発明は、発話音声テキスト生成装置、発話音声テキスト生成プログラムおよび発話音声テキスト生成方法に関する。 The present invention relates to a speech text generation device, a speech text generation program, and a speech text generation method.

近年、音声合成や音声認識の分野では、ディープニューラルネットワーク（ＤＮＮ：Deep Neural Network）を用いて、音声合成や音声認識を行う手法が一般化している。
例えば、ＤＮＮで構成された統計モデルを用いて、テキストデータから音声データを生成する音声合成手法が、特許文献１等に開示されている。
また、ＤＮＮで構成された音響モデル等を用いて、音声データからテキストデータを生成する音声認識手法が、特許文献２等に開示されている。
このようなＤＮＮを用いた手法では、ＤＮＮのモデルを学習するための膨大な学習データが必要となる。
従来、この学習データを生成する手法として、放送番組の音声データと字幕データ（テキストデータ）とから、時刻に対応した音声データとテキストデータとを対応付けて学習データを生成する手法が、特許文献３等に開示されている。 In recent years, in the fields of voice synthesis and voice recognition, a method of performing voice synthesis and voice recognition using a deep neural network (DNN) has become common.
For example, Patent Document 1 discloses a voice synthesis method that generates voice data from text data using a statistical model configured with a DNN.
Furthermore, a speech recognition method for generating text data from speech data using an acoustic model configured with a DNN is disclosed in, for example, Japanese Patent Laid-Open No. 2003-233996.
Such a method using DNN requires a huge amount of training data to train the DNN model.
Conventionally, as a method for generating this learning data, a method is disclosed in Patent Document 3 and the like in which learning data is generated by associating audio data and text data corresponding to a time from audio data and subtitle data (text data) of a broadcast program.

特開２０１９－２１９５９０号公報JP 2019-219590 A 特開２０１９－０２０５９７号公報JP 2019-020597 A 特許第６４２６９７１号公報Patent No. 6426971

放送番組のような字幕データ（クローズドキャプション）が重畳された放送データから、従来手法によって、音声データとテキストデータである字幕データとを対応付けて抽出する場合、以下に示す問題がある。
放送番組が生放送の場合、字幕作成者が、送出された音声を聞いた後にキーボードによって字幕を付加するため、実際の音声に対して字幕が遅延して放送されることになる。そのため、従来手法では、音声データと字幕データとに時間のずれが生じ、正しく学習データを生成することができないという問題がある。
また、生放送では、字幕の付加に人手が介在し、音声データと字幕データとの時間のずれが一定ではないため、音声データと字幕データとを調相して対応付けることは困難である。 When extracting audio data and closed caption data, which is text data, in association with each other from broadcast data on which closed caption data, such as a broadcast program, is superimposed, using a conventional method, the following problems arise.
When a broadcast program is live, the subtitle creator adds subtitles by keyboard after listening to the transmitted audio, so the subtitles are broadcast with a delay from the actual audio. Therefore, in the conventional method, a time lag occurs between the audio data and the subtitle data, and the learning data cannot be generated correctly.
Furthermore, in live broadcasts, the addition of subtitles requires manual intervention, and the time lag between the audio data and the subtitle data is not constant, making it difficult to synchronize and associate the audio data and the subtitle data.

そこで、本発明は、複数の発話音声を含んだ音声データと対応するテキストデータとから、発話区間ごとの音声データとテキストデータとを生成することが可能な発話音声テキスト生成装置、発話音声テキスト生成プログラムおよび発話音声テキスト生成方法を提供することを課題とする。 The present invention aims to provide a speech text generation device, a speech text generation program, and a speech text generation method that are capable of generating speech data and text data for each speech section from speech data containing multiple speech sounds and corresponding text data.

前記課題を解決するため、本発明に係る発話音声テキスト生成装置は、音声区切り検出手段と、音声認識手段と、マッチング手段と、コンテキスト情報生成手段と、変換手段と、を備える構成とした。 In order to achieve the above object, the speech text generation device according to the present invention is configured to include a speech segment detection means, a speech recognition means, a matching means, a context information generation means, and a conversion means.

かかる構成において、発話音声テキスト生成装置は、音声区切り検出手段によって、複数の発話音声からなる音声データから、パワー等の音響特徴によって、発話ごとの区間音声データの区切り位置を検出する。
そして、発話音声テキスト生成装置は、音声認識手段によって、発話区間の区間音声データごとに音声認識を行う。
そして、発話音声テキスト生成装置は、マッチング手段によって、音声認識手段の認識結果と、音声データの発話内容であるテキストデータとをＤＰマッチング等のマッチング手法でマッチングすることで、区間音声データの時間に対応する区間テキストデータを推定する。 In such a configuration, the speech text generation device detects, by the speech segment detection means, segment positions of section speech data for each utterance from speech data consisting of a plurality of utterances, based on acoustic features such as power.
Then, the speech text generation device performs speech recognition on each section speech data of the speech section by using the speech recognition means.
Then, the speech text generation device estimates section text data corresponding to the time of the section speech data by using a matching means to match the recognition result of the speech recognition means with text data, which is the spoken content of the speech data, using a matching method such as DP matching.

さらに、発話音声テキスト生成装置は、コンテキスト情報生成手段によって、区間テキストデータから、音素の情報と、当該音素が含まれるアクセント句および当該アクセント句に隣接するアクセント句に関する特徴を示すアクセント句情報とを少なくとも含む音素ごとのコンテキスト情報を生成する。このコンテキスト情報によって、音素ごとのアクセントの状態を認識することが可能になる。
そして、発話音声テキスト生成装置は、変換手段によって、音素列のコンテキスト情報を、音素の出現順の読みを表す文字とアクセントの状態を示す韻律を表す予め定めた文字とを含む第２の区間テキストデータに変換する。これによって、発話者が発した区間音声データの時間に対応する区間テキストデータ、第２の区間テキストデータを生成することができる。 Furthermore, the speech text generation device generates, from the section text data, context information for each phoneme, including at least information on the phoneme and accent phrase information indicating characteristics of an accent phrase including the phoneme and an accent phrase adjacent to the accent phrase, by using the context information generation means. This context information makes it possible to recognize the accent state of each phoneme.
The speech text generation device then converts the context information of the phoneme string into second section text data including characters representing the reading order of the phonemes and predetermined characters representing prosody indicating the state of accent, by using the conversion means, thereby making it possible to generate section text data corresponding to the time of the section speech data uttered by the speaker, that is, the second section text data.

また、前記課題を解決するため、本発明に係る発話音声テキスト生成プログラムは、コンピュータを、前記した各手段として機能させるためのプログラムとして実現することができる。
また、前記課題を解決するため、本発明に係る発話音声テキスト生成方法は、前記した各手段の動作をステップとして含む手順として実現することができる。 In order to achieve the above object, a speech text generation program according to the present invention can be realized as a program for causing a computer to function as each of the means described above.
In order to achieve the above object, the speech text generation method according to the present invention can be realized as a procedure including the operations of the above-mentioned means as steps.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、複数の発話音声からなる音声データとその音声データに対応するテキストデータとに時間的にずれがある場合でも、発話ごとの音声データとテキストデータとを対応付けて抽出することができる。 The present invention provides the following excellent effects.
According to the present invention, even if there is a time lag between voice data consisting of multiple utterances and the text data corresponding to that voice data, it is possible to extract voice data and text data for each utterance in association with each other.

参考例の実施形態に係る発話音声テキスト生成装置を含む学習データ生成システムの構成を示すブロック構成図である。1 is a block diagram showing a configuration of a training data generation system including a speech text generation device according to an embodiment of a reference example . アップロード端末において字幕付きデータのファイルを選択する選択画面の例を示す図である。FIG. 13 is a diagram showing an example of a selection screen for selecting a file of subtitled data in an upload terminal. 編集端末において音声の区切り位置およびテキストデータを修正する編集画面の例を示す図である。13 is a diagram showing an example of an editing screen for correcting audio delimiter positions and text data on an editing terminal. FIG. 参考例の実施形態に係る学習データ生成システムの動作を示すフローチャートである。13 is a flowchart showing the operation of a training data generation system according to an embodiment of a reference example . 本発明の実施形態に係る発話音声テキスト生成装置を含む学習データ生成システムの構成を示すブロック構成図である。1 is a block diagram showing a configuration of a training data generation system including a speech text generation device according to an embodiment of the present invention. 発話音声テキスト生成装置が生成する読み仮名と韻律記号とを含むＰＬＰデータの例を説明するための説明図である。1 is an explanatory diagram for explaining an example of PLP data including pronunciation kana and prosodic symbols generated by a speech text generation device; 韻律記号の例を説明するための説明図である。FIG. 11 is an explanatory diagram for explaining an example of prosodic symbols. コンテキスト情報の各ラベルの特徴を示す図（その１）である。FIG. 13 is a diagram (part 1) showing the characteristics of each label of context information. コンテキスト情報の各ラベルの特徴を示す図（その２）である。FIG. 2 is a diagram (part 2) showing the characteristics of each label of context information. コンテキスト情報の形式の例を示す図である。FIG. 13 is a diagram illustrating an example of a format of context information. 韻律記号を挿入する条件を説明するための説明図である。FIG. 11 is an explanatory diagram for explaining a condition for inserting a prosodic symbol. コンテキスト情報からＰＬＰデータを生成する流れを説明するための説明図である。FIG. 11 is an explanatory diagram for explaining a flow of generating PLP data from context information. 編集端末において音声の区切り位置およびテキストデータ（ＰＬＰデータ）を修正する編集画面の例を示す図である。11 is a diagram showing an example of an editing screen for correcting audio delimiter positions and text data (PLP data) on an editing terminal. FIG. 編集端末において音声の区切り位置およびテキストデータ（仮名漢字交じり文、ＰＬＰデータ）を修正する編集画面の例を示す図である。13 is a diagram showing an example of an editing screen for correcting audio delimiter positions and text data (kana-kanji mixed text, PLP data) on an editing terminal. FIG. 本発明の実施形態に係る学習データ生成システムの動作を示すフローチャートである。4 is a flowchart showing the operation of the training data generation system according to the embodiment of the present invention.

以下、参考例および本発明の実施形態について図面を参照して説明する。
≪参考例の実施形態≫
＜学習データ生成システムの構成＞
最初に、図１を参照して、参考例の実施形態に係る学習データ生成システム１００の構成について説明する。 Hereinafter, reference examples and embodiments of the present invention will be described with reference to the drawings.
<Embodiments of Reference Examples >
<Configuration of the learning data generation system>
First, with reference to FIG. 1, a configuration of a training data generation system 100 according to an embodiment of the reference example will be described.

学習データ生成システム１００は、音声合成または音声認識に用いるディープニューラルネットワーク（ＤＮＮ）のモデルを学習するための発話単位の音声データおよびその音声データに対応するテキストデータを学習データとして生成するものである。
学習データ生成システム１００は、字幕付きデータ記憶装置１と、アップロード端末２と、発話音声テキスト生成装置３と、編集端末４と、を備える。 The training data generation system 100 generates, as training data, speech data in units of utterances for training a deep neural network (DNN) model used for voice synthesis or voice recognition, and text data corresponding to that speech data.
The training data generation system 100 includes a subtitled data storage device 1, an upload terminal 2, a speech text generation device 3, and an editing terminal 4.

〔字幕付きデータ記憶装置〕
字幕付きデータ記憶装置１は、複数の発話音声からなる音声データとその音声データに対応する字幕データとを含んだ字幕付きデータを記憶するものである。字幕付きデータは、例えば、ＸＤＣＡＭ（登録商標）等のデータ形式の映像音声コンテンツ等である。なお、字幕付きデータは、少なくとも音声データとそれに対応する字幕データとを含んだものであればよく、映像データを含まないデータであってもよい。
字幕付きデータ記憶装置１には、字幕付きデータを１つのファイルとして予め複数記憶しておく。 [Subtitled data storage device]
The subtitled data storage device 1 stores subtitled data including audio data consisting of a plurality of spoken voices and subtitle data corresponding to the audio data. The subtitled data is, for example, video and audio content in a data format such as XDCAM (registered trademark). Note that the subtitled data may be data that does not include video data as long as it includes at least audio data and the corresponding subtitle data.
In the subtitled data storage device 1, a plurality of subtitled data are stored in advance as one file.

〔アップロード端末〕
アップロード端末２は、字幕付きデータ記憶装置１に記憶されている字幕付きデータ、または、現在放送中の字幕付きデータ（放送データ）から、音声データと字幕データ（テキストデータ）とを分離して、発話音声テキスト生成装置３に送信するクライアント端末である。
アップロード端末２は、ファイル選択手段２０と、ファイル分離手段２１と、放送データ受信手段２２と、放送データ分離手段２３と、を備える。 [Uploading terminal]
The upload terminal 2 is a client terminal that separates audio data and subtitle data (text data) from the subtitled data stored in the subtitled data storage device 1 or from the subtitled data currently being broadcast (broadcast data), and transmits the separated audio data and subtitle data to the speech text generation device 3.
The upload terminal 2 comprises a file selection means 20, a file separation means 21, a broadcast data receiving means 22, and a broadcast data separation means 23.

ファイル選択手段２０は、字幕付きデータ記憶装置１に記憶されている複数の字幕付きデータの各ファイルから、学習データを生成するためのファイルを選択するものである。
例えば、アップロード端末２は、ファイル選択手段２０によって、図２に示すような、ファイルを選択する選択画面Ｇ１を表示装置（不図示）に表示し、操作者によるマウス等の入力手段（不図示）の操作によって、ファイルを選択する。 The file selection means 20 selects a file for generating learning data from among the multiple subtitled data files stored in the subtitled data storage device 1 .
For example, the upload terminal 2 causes the file selection means 20 to display a selection screen G1 for selecting a file as shown in FIG. 2 on a display device (not shown), and the operator selects a file by operating an input means (not shown) such as a mouse.

図２に示した選択画面Ｇ１は、実行ｇ１、識別名ｇ２、日時ｇ３、ファイルパスｇ４、チャンネルｇ５、ステータスｇ６の欄と、開始ボタンＢを表示した例を示している。
実行ｇ１の欄は、選択対象のファイルを選択するチェック欄である。ここでは、ファイル選択手段２０は、実行ｇ１の欄を選択されることで、レ点を表示し、当該ファイルが選択されたことを示す。
識別名ｇ２の欄は、字幕付きデータを識別する名称を表示する欄である。例えば、字幕付きデータのファイル名である。
日時ｇ３は、字幕付きデータの時間情報を表示する欄である。この時間情報は、字幕付きデータを録音、録画した日時、あるいは、字幕付きデータ記憶装置１に字幕付きデータを記憶した日時である。
ファイルパスｇ４の欄は、字幕付きデータを記憶している字幕付きデータ記憶装置１のファイルパスを表示する欄である。 The selection screen G1 shown in FIG. 2 shows an example in which columns for execution g1, identification name g2, date and time g3, file path g4, channel g5, and status g6, as well as a start button B are displayed.
The execution g1 column is a check column for selecting a file to be selected. Here, when the execution g1 column is selected, the file selection means 20 displays a check mark to indicate that the file has been selected.
The identification name g2 field displays a name for identifying the subtitled data, for example, a file name of the subtitled data.
The date and time g3 is a field for displaying time information of the subtitled data. This time information is the date and time when the subtitled data was recorded or filmed, or the date and time when the subtitled data was stored in the subtitled data storage device 1.
The file path g4 field displays the file path of the subtitled data storage device 1 that stores the subtitled data.

チャンネルｇ５の欄は、チャンネル番号を指定する欄である。例えば、字幕付きデータがＸＤＣＡＭの場合、最大８チャンネルの中から抽出したい音声チャンネルを選択する。
ステータスｇ６の欄は、選択された字幕付きデータのアップロードの状態を表示する欄である。例えば、ここでは、ファイルが選択されただけで、まだ、アップロードされていない（未送信）状態を示している。このステータスｇ６の欄は、後記するファイル分離手段２１によって、音声データおよび字幕データが分離され、発話音声テキスト生成装置３にアップロードされた段階で、送信完了に更新される。
開始ボタンＢは、選択されたファイルのアップロードを指示するボタンである。アップロード端末２は、ファイルを選択された後、マウス等の入力手段によって開始ボタンＢを押下されることで、ファイルのアップロードを開始する。 The channel g5 field is a field for specifying a channel number. For example, if the subtitled data is XDCAM, the audio channel to be extracted is selected from a maximum of eight channels.
The status g6 column is a column that displays the upload status of the selected subtitled data. For example, here, it shows a state where the file has only been selected but has not yet been uploaded (transmitted). This status g6 column is updated to "transmission completed" when the voice data and subtitle data are separated by the file separation means 21, which will be described later, and uploaded to the speech voice text generation device 3.
The start button B is a button for instructing uploading of a selected file. After a file has been selected, the upload terminal 2 starts uploading the file when the start button B is pressed using an input means such as a mouse.

図１に戻って、アップロード端末２の構成について説明を続ける。
ファイル選択手段２０は、選択された字幕付きデータを字幕付きデータ記憶装置１から読み出して、ファイル分離手段２１に出力する。 Returning to FIG. 1, the description of the configuration of the upload terminal 2 will be continued.
The file selection means 20 reads out the selected subtitled data from the subtitled data storage device 1 and outputs it to the file separation means 21 .

ファイル分離手段（分離手段）２１は、ファイル選択手段２０で選択された字幕付きデータから、音声データと字幕データとを分離するものである。
例えば、字幕付きデータがＸＤＣＡＭの動画コンテンツの場合、映像データ、音声データおよび字幕データは、ＭＸＦ（Material eXchange Format）の形式でコンテンツ内に格納されている。
そこで、ファイル分離手段２１は、ＭＸＦの字幕付きデータから、音声ストリームを抽出し、ＷＡＶファイルに変換することで、音声データを分離する。
また、字幕データは、ＭＸＦの形式でＡＲＩＢ（Association of Radio Industries and Businesses：一般社団法人電波産業会）字幕ファイルとしてコンテンツ内に格納されている。
そこで、ファイル分離手段２１は、ＭＸＦの字幕付きデータから、ＡＲＩＢ字幕ファイルを抽出し、ＡＲＩＢ字幕を文字コード（例えば、ＵＴＦ－８）に変換することで、テキストデータとして字幕データを分離する。 The file separating means (separating means) 21 separates the subtitled data selected by the file selecting means 20 into audio data and subtitle data.
For example, when the subtitled data is XDCAM video content, the video data, audio data, and subtitle data are stored in the content in the MXF (Material eXchange Format) format.
Therefore, the file separation means 21 extracts the audio stream from the MXF subtitled data and converts it into a WAV file, thereby separating the audio data.
The subtitle data is stored in the content as an ARIB (Association of Radio Industries and Businesses) subtitle file in MXF format.
Therefore, the file separation means 21 extracts the ARIB subtitle file from the MXF subtitled data, and converts the ARIB subtitles into a character code (for example, UTF-8) to separate the subtitle data as text data.

ファイル分離手段２１は、分離した音声データおよびテキストデータを、発話音声テキスト生成装置３にアップロードする。
ここでは、ファイル分離手段２１は、図示を省略した通信手段によって、ネットワークＮ，Ｎ_１を介して、音声データおよびテキストデータを対応付けて発話音声テキスト生成装置３に送信する。
なお、ファイル分離手段２１は、音声データおよびテキストデータを、発話音声テキスト生成装置３にアップロードした後、図２に示した選択画面Ｇ１のステータスｇ６の欄を「送信完了」に更新する。
これによって、操作者は、選択したファイルのアップロード状況を確認することができる。 The file separation means 21 uploads the separated voice data and text data to the speech text generation device 3 .
Here, the file separation means 21 transmits the voice data and the text data in association with each other to the speech text generation device 3 via the networks N and _N1 by using a communication means (not shown).
After uploading the voice data and text data to the speech voice text generator 3, the file separator 21 updates the status g6 field of the selection screen G1 shown in FIG. 2 to "Transmission Completed".
This allows the operator to check the upload status of the selected file.

放送データ受信手段２２は、デジタル放送で放送中の字幕付きデータ（放送データ）を受信し、ストリームデータ（トランスポートストリーム〔ＴＳ：Transport Stream〕）に復調するものである。
放送データ受信手段２２は、例えば、外部から、字幕付きの放送データを放送しているチャンネルを指定されることで、復調したストリームデータ中のＰＳＩ／ＳＩ（Program Specific Information〔番組特定情報〕／Service Information〔番組配列情報〕）を解析し、指定されたチャンネルに対応するストリームデータを抽出する。
放送データ受信手段２２は、抽出したストリームデータを放送データ分離手段２３に出力する。 The broadcast data receiving means 22 receives subtitled data (broadcast data) being broadcast by digital broadcasting, and demodulates the data into stream data (transport stream (TS)).
For example, when a channel broadcasting subtitled broadcast data is specified from outside, the broadcast data receiving means 22 analyzes the PSI/SI (Program Specific Information/Service Information) in the demodulated stream data and extracts the stream data corresponding to the specified channel.
The broadcast data receiving means 22 outputs the extracted stream data to the broadcast data separating means 23 .

放送データ分離手段（分離手段）２３は、放送データ受信手段２２で受信したストリームデータから、音声データと字幕データ（テキストデータ）とを分離するものである。
放送データ分離手段２３は、ストリームデータに多重化されている音声データと、ストリームデータにクローズドキャプションとして多重化されているテキストデータである字幕データとをそれぞれ抽出する。
放送データ分離手段２３は、分離した音声データおよびテキストデータを、発話音声テキスト生成装置３にアップロードする。
ここでは、放送データ分離手段２３は、図示を省略した通信手段によって、ネットワークＮ，Ｎ_１を介して、音声データおよびテキストデータを対応付けて発話音声テキスト生成装置３に送信する。 The broadcast data separating means (separating means) 23 separates the stream data received by the broadcast data receiving means 22 into audio data and subtitle data (text data).
The broadcast data separation means 23 extracts audio data multiplexed into the stream data and subtitle data, which is text data multiplexed into the stream data as closed captions.
The broadcast data separation means 23 uploads the separated voice data and text data to the speech text generation device 3 .
Here, the broadcast data separation means 23 transmits the voice data and the text data in association with each other to the speech voice text generation device 3 via the networks N and _N1 by using a communication means (not shown).

以上、アップロード端末２の構成について説明したが、アップロード端末２は、この構成に限定されるものではない。例えば、アップロード端末２は、放送データ受信手段２２および放送データ分離手段２３を省略し、字幕付きデータ記憶装置１に記憶されている字幕付きデータから、音声データとテキストデータとを分離して、発話音声テキスト生成装置３に送信するものとして構成してもよい。また、例えば、アップロード端末２は、ファイル選択手段２０およびファイル分離手段２１を省略し、現在放送中の放送データから、音声データとテキストデータとを分離して、発話音声テキスト生成装置３に送信するものとして構成してもよい。 Although the configuration of the upload terminal 2 has been described above, the upload terminal 2 is not limited to this configuration. For example, the upload terminal 2 may be configured to omit the broadcast data receiving means 22 and the broadcast data separating means 23, and separate the audio data and text data from the subtitled data stored in the subtitled data storage device 1, and transmit them to the speech voice text generation device 3. Also, for example, the upload terminal 2 may be configured to omit the file selection means 20 and the file separating means 21, and separate the audio data and text data from the broadcast data currently being broadcast, and transmit them to the speech voice text generation device 3.

〔発話音声テキスト生成装置〕
発話音声テキスト生成装置３は、複数の発話音声からなる音声データとその音声データに対応するテキストデータとから、発話区間の音声データ（区間音声データ）と、その音声データに対応するテキストデータ（区間テキストデータ）とを学習データとして生成するサーバである。
発話音声テキスト生成装置３は、音声テキスト記憶手段３０と、音声区切り検出手段３１と、音声認識手段３２と、マッチング手段３３と、を備える。 [Speech to text generator]
The speech text generation device 3 is a server that generates speech data of an utterance section (section speech data) and text data corresponding to the speech data (section text data) as learning data from speech data consisting of multiple utterances and text data corresponding to the speech data.
The speech text generation device 3 includes a speech text storage means 30 , a speech segment detection means 31 , a speech recognition means 32 , and a matching means 33 .

音声テキスト記憶手段３０は、複数の発話音声からなる音声データとその音声データに対応するテキストデータとを対応付けて記憶するものである。この音声テキスト記憶手段３０は、ハードディスク等の一般的な記憶媒体で構成することができる。
音声テキスト記憶手段３０に記憶する音声データおよびテキストデータは、図示を省略した通信手段によって、ネットワークＮ，Ｎ_１を介して、アップロード端末２からアップロードされた音声データおよびテキストデータを受信して記憶されたデータである。 The voice text storage means 30 stores voice data consisting of a plurality of spoken voices in association with text data corresponding to the voice data. The voice text storage means 30 can be configured with a general storage medium such as a hard disk.
The voice data and text data stored in the voice text storage means 30 are data that are received and stored by communication means (not shown) via the networks N and _N1 and uploaded from the upload terminal 2.

音声区切り検出手段３１は、複数の発話音声からなる音声データから、発話ごとの音声データ（区間音声データ）の区切り位置を検出するものである。
音声区切り検出手段３１は、音声テキスト記憶手段３０に記憶されている音声データから、発話区間を検出し、発話区間同士の間の位置（例えば、中間位置）を音声データの区切り位置として検出する。
音声区切り検出手段３１における発話区間の検出手法は、一般的な手法を用いればよい。例えば、音声区切り検出手段３１は、音声データから音響特徴量であるパワー（パワースペクトル）を抽出し、パワーが、予め定めた閾値よりも大きい場合に当該時間区間を発話区間とし、それ以外を非発話区間とする。
音声区切り検出手段３１は、音声データと音声データの区切り位置とを音声認識手段３２およびマッチング手段３３に出力する。 The voice segment detection means 31 detects segment positions of voice data (section voice data) for each utterance from voice data consisting of a plurality of utterances.
The voice segment detection means 31 detects speech segments from the voice data stored in the voice text storage means 30, and detects positions between speech segments (for example, intermediate positions) as segment positions of the voice data.
A general method may be used as the method for detecting the speech section in the speech segment detection means 31. For example, the speech segment detection means 31 extracts power (power spectrum) which is an acoustic feature from the speech data, and if the power is greater than a predetermined threshold, it determines the relevant time section as a speech section and the rest as a non-speech section.
The voice segment detection means 31 outputs the voice data and the segment positions of the voice data to the voice recognition means 32 and the matching means 33 .

音声認識手段３２は、音声区切り検出手段３１で検出された区切り位置で区分される音声データ（区間音声データ）ごとに音声認識を行うものである。
音声認識手段３２における音声認識手法は、一般的な手法を用いればよい。音声認識手段３２は、図示を省略した言語モデル、音響モデルおよび発音辞書により、音声データの音声認識を行う。
音声認識手段３２は、区間音声データごとの認識結果（漢字仮名交じり文）をマッチング手段３３に出力する。 The voice recognition means 32 performs voice recognition for each piece of voice data (section voice data) divided by the delimiter positions detected by the voice delimiter detection means 31 .
A general method may be used as the voice recognition method in the voice recognition means 32. The voice recognition means 32 performs voice recognition of the voice data using a language model, an acoustic model, and a pronunciation dictionary, which are not shown.
The voice recognition means 32 outputs the recognition result (a mixture of kanji and kana) for each section of voice data to the matching means 33 .

マッチング手段３３は、音声認識手段３２で音声認識された認識結果と、音声テキスト記憶手段３０に記憶されている音声データに対応するテキストデータとをマッチングするものである。
このマッチング手段３３は、例えば、動的計画法(Dynamic Programming)によるマッチング手法（ＤＰマッチング）により、単語または文字単位で認識結果とテキストデータとをマッチングすることで、認識結果に対応するテキストデータ（区間テキストデータ）を推定する。このとき、マッチング手段３３は、類似の度合いとして、認識結果と推定した区間テキストデータとの間で、認識誤り、記号の挿入、書き換えを含んだ不一致率（matching error rate：ＭＥＲ）を算出する。
マッチング手段３３は、不一致率が予め定めた閾値未満の区間テキストデータを、区切り位置で区切られた音声データ（区間音声データ）に対応するテキストデータとする。
そして、マッチング手段３３は、区切り位置で区切った区間音声データと、マッチングした区間テキストデータとを対応付ける。 The matching means 33 matches the result of the voice recognition performed by the voice recognition means 32 with text data corresponding to the voice data stored in the voice text storage means 30 .
The matching means 33 estimates text data (section text data) corresponding to the recognition result by matching the recognition result with the text data on a word or character basis, for example, by a dynamic programming matching method (DP matching). At this time, the matching means 33 calculates a matching error rate (MER) including recognition errors, symbol insertions, and rewritings between the recognition result and the estimated section text data as a degree of similarity.
The matching means 33 regards section text data having a mismatch rate less than a predetermined threshold as text data corresponding to the audio data separated at the separation positions (section audio data).
Then, the matching means 33 associates the section voice data separated at the separation positions with the matched section text data.

マッチング手段３３は、対応付けた区間音声データと区間テキストデータとを、図示を省略した通信手段によって、ネットワークＮ，Ｎ_２を介して、編集端末４に送信する。
なお、マッチング手段３３は、認識結果との不一致率が予め定めた閾値未満の区間テキストデータについては、対応する区間音声データとともに、編集端末４に送信を行わないこととする。あるいは、マッチング手段３３は、区間音声データとともに、区間テキストデータをＮＵＬＬデータとして、編集端末４に送信することとしてもよい。 The matching means 33 transmits the associated section voice data and section text data to the editing terminal 4 via the networks N and _N2 by means of communication means (not shown).
The matching means 33 does not transmit section text data whose mismatch rate with the recognition result is less than a predetermined threshold together with the corresponding section voice data to the editing terminal 4. Alternatively, the matching means 33 may transmit the section text data as NULL data together with the section voice data to the editing terminal 4.

以上説明したように構成することで、発話音声テキスト生成装置３は、音声データとテキストデータとから、発話ごとに対応付けた区間音声データと区間テキストデータとを学習データとして生成することができる。このとき、発話音声テキスト生成装置３は、音声データに含まれる発話音声である区間音声データを、時間のずれに関係なく字幕データに対応した区間テキストデータに対応付けることができる。
なお、発話音声テキスト生成装置３は、図示を省略したコンピュータを、前記した各手段として機能させるための発話音声テキスト生成プログラムで動作させることができる。 With the above-described configuration, the speech text generation device 3 can generate section speech data and section text data associated with each utterance as learning data from the speech data and text data. At this time, the speech text generation device 3 can associate the section speech data, which is the speech included in the speech data, with the section text data corresponding to the subtitle data regardless of time lag.
The speech text generation device 3 can operate a computer (not shown) using a speech text generation program for causing the computer to function as each of the above-mentioned means.

〔編集端末〕
編集端末４は、発話音声テキスト生成装置３で対応付けられた発話区間ごとの音声データ（区間音声データ）とテキストデータ（区間テキストデータ）とを修正するクライアント端末である。
編集端末４は、学習データ記憶手段４０と、修正手段４１と、を備える。 [Editing terminal]
The editing terminal 4 is a client terminal that corrects the voice data (section voice data) and the text data (section text data) for each speech section that are associated with each other by the speech text generation device 3 .
The editing terminal 4 includes a learning data storage means 40 and a correction means 41 .

学習データ記憶手段４０は、発話音声テキスト生成装置３で生成された学習データである発話区間ごとの区間音声データと区間テキストデータとを対応付けて記憶するものである。この学習データ記憶手段４０は、ハードディスク等の一般的な記憶媒体で構成することができる。
学習データ記憶手段４０に記憶する区間音声データおよび区間テキストデータは、図示を省略した通信手段によって、ネットワークＮ，Ｎ_２を介して、発話音声テキスト生成装置３から受信して記憶されたデータである。 The training data storage means 40 stores, in association with each other, section voice data and section text data for each speech section, which are the training data generated by the speech text generation device 3. The training data storage means 40 can be configured with a general storage medium such as a hard disk.
The section voice data and section text data stored in the learning data storage means 40 are data received and stored from the speech voice text generation device 3 via the networks N and _N2 by communication means (not shown).

修正手段４１は、操作者の操作によって、学習データ（区間音声データおよび区間テキストデータ）を修正するものである。
修正手段４１は、図３に示すような編集画面Ｇ２を表示し、操作者の操作によって、区間音声データおよび区間テキストデータを修正する。 The correction means 41 corrects the learning data (section voice data and section text data) in response to an operation by an operator.
The modifying means 41 displays an editing screen G2 as shown in FIG. 3, and modifies the section voice data and the section text data in response to an operation by the operator.

図３では、編集画面Ｇ２を、区間音声データの区切り位置を修正する区切り位置修正画面ｇ１０と、区間テキストデータを修正するテキスト修正画面ｇ１１とで構成した例を示している。
区切り位置修正画面ｇ１０は、修正対象の区間音声データの音声波形ｗを、前後の区間音声データの音声波形ｗｆ，ｗｂとともに時系列に表示するとともに、修正対象の区間音声データの前後の区切り位置ｐｆ，ｐｂを表示する画面である。
区切り位置修正画面ｇ１０は、操作者のマウス等の操作により、区切り位置ｐｆ，ｐｂを修正するインタフェースを有する。
また、区切り位置修正画面ｇ１０は、さらに、再生ボタンｂ１、停止ボタンｂ２、一時停止ボタンｂ３、１０秒戻るボタンｂ４、１０秒進むボタンｂ５を備え、操作者が所望する位置からの音声データの再生の指示を受け付けるインタフェースを有する。 FIG. 3 shows an example in which the editing screen G2 is configured with a delimiter position correction screen g10 for correcting the delimiter positions of the section voice data, and a text correction screen g11 for correcting the section text data.
The delimiter position correction screen g10 is a screen that displays the audio waveform w of the section audio data to be corrected in chronological order along with the audio waveforms wf, wb of the preceding and following section audio data, and also displays the delimiter positions pf, pb before and after the section audio data to be corrected.
The delimiter position correction screen g10 has an interface for correcting the delimiter positions pf and pb by the operator operating a mouse or the like.
The break position correction screen g10 further includes a play button b1, a stop button b2, a pause button b3, a 10 second back button b4, and a 10 second forward button b5, and has an interface that accepts instructions from the operator to play back audio data from a desired position.

テキスト修正画面ｇ１１は、修正対象の区間テキストデータを表示する画面である。
テキスト修正画面ｇ１１は、キーボード等の操作により、テキストデータを編集するインタフェースを有する。
また、編集画面Ｇ２は、修正対象を前の区間の文章（区間音声データ、区間テキストデータ）に切り替える戻るボタンｂ６、修正内容を保存して修正対象を次の区間に進める進むボタンｂ７、修正内容を保存せず、あるいは、修正を行わずに次の区間に進める進むボタンｂ８を備え、操作者が所望する修正対象の切り替えを行うインタフェースを有する。 The text correction screen g11 is a screen that displays the section text data to be corrected.
The text correction screen g11 has an interface for editing text data by operating a keyboard or the like.
In addition, the editing screen G2 has a back button b6 that switches the object to be corrected to the text of the previous section (section audio data, section text data), a forward button b7 that saves the correction content and advances the object to be corrected to the next section, and a forward button b8 that advances to the next section without saving the correction content or without making any corrections, and has an interface that allows the operator to switch the object to be corrected as desired.

図１に戻って、編集端末４の構成について説明を続ける。
修正手段４１は、音声区切り修正手段４１０と、テキスト修正手段４１１と、を備える。 Returning to FIG. 1, the configuration of the editing terminal 4 will be further explained.
The correcting means 41 includes a voice segment correcting means 410 and a text correcting means 411 .

音声区切り修正手段４１０は、学習データ記憶手段４０に記憶されている区間音声データの区切り位置を修正するものである。
音声区切り修正手段４１０は、図３に示した編集画面Ｇ２の区切り位置修正画面ｇ１０において、修正対象の区間音声データの音声波形ｗを、前後の区間音声データの音声波形ｗｆ，ｗｂとともに時系列に表示する。
また、音声区切り修正手段４１０は、修正対象の区間音声データの区切り位置ｐｆ，ｐｂを表示する。 The voice segment correction means 410 corrects the segment positions of the section voice data stored in the learning data storage means 40 .
The audio segment correction means 410 displays the audio waveform w of the section audio data to be corrected in chronological order together with the audio waveforms wf, wb of the preceding and following section audio data on the segment position correction screen g10 of the editing screen G2 shown in FIG.
Moreover, the audio segment correction means 410 displays the segment positions pf and pb of the audio data segment to be corrected.

音声区切り修正手段４１０は、再生ボタンｂ１、停止ボタンｂ２、一時停止ボタンｂ３、１０秒戻るボタンｂ４、１０秒進むボタンｂ５を操作者によって指示されることで、操作者が所望する位置からの音声データの再生、停止等を行う。これによって、操作者は、最適な音声データの区切り位置を判断することができる。
音声区切り修正手段４１０は、操作者の操作によって、例えば、マウス等で区切り位置ｐｆ，ｐｂの線を左右にドラッグすることで、区切り位置ｐｆ，ｐｂを修正する。 The audio segment correction means 410 plays or stops audio data from a position desired by the operator in response to an instruction from the play button b1, the stop button b2, the pause button b3, the 10-second back button b4, or the 10-second forward button b5. This allows the operator to determine the optimal segment position of the audio data.
The audio segmentation correction means 410 corrects the segmentation positions pf and pb by the operator dragging the lines of the segmentation positions pf and pb to the left and right with, for example, a mouse.

なお、音声区切り修正手段４１０は、前の区切り位置ｐｆを後ろ修正する、あるいは、後の区切り位置ｐｂを前に修正する場合、修正対象の区間音声データの音声波形ｗにおいて指定された位置で音声波形を削除すればよい。また、音声区切り修正手段４１０は、前の区切り位置ｐｆをさらに前に修正する、あるいは、後の区切り位置ｐｂをさらに後ろに修正する場合、修正対象の区間音声データの音声波形ｗに前後の区間音声データの音声波形の一部を付加すればよい。 When the audio segment correction means 410 corrects the previous segment position pf to be moved backward or the next segment position pb to be moved forward, it only has to delete the audio waveform at the specified position in the audio waveform w of the section audio data to be corrected. When the audio segment correction means 410 corrects the previous segment position pf to be further forward or the next segment position pb to be further backward, it only has to add a part of the audio waveform of the previous or next section audio data to the audio waveform w of the section audio data to be corrected.

テキスト修正手段４１１は、学習データ記憶手段４０に記憶されている区間テキストデータを修正するものである。
テキスト修正手段４１１は、図３に示した編集画面Ｇ２のテキスト修正画面ｇ１１に、修正対象の区間テキストデータを表示する。
そして、テキスト修正手段４１１は、操作者のキーボード等の操作によって、区間テキストデータを一般的なテキスト編集によって修正する。
修正手段４１は、図３に示した編集画面Ｇ２の戻るボタンｂ６、進むボタンｂ８をマウス等の入力手段によって押下されることで、修正対象を時系列で前または後に変更する。
また、修正手段４１は、進むボタンｂ７をマウス等の入力手段によって押下されることで、修正した区間音声データおよび区間テキストデータで、学習データ記憶手段４０のデータを更新する。 The text correcting means 411 corrects the section text data stored in the learning data storage means 40 .
The text correction means 411 displays the section text data to be corrected on the text correction screen g11 of the editing screen G2 shown in FIG.
Then, the text correction means 411 corrects the section text data by general text editing in response to the operator's operation of the keyboard or the like.
The correction means 41 changes the object of correction to the previous or next in the chronological order when the back button b6 or the forward button b8 on the editing screen G2 shown in FIG. 3 is pressed by an input means such as a mouse.
Furthermore, when a forward button b7 is pressed by input means such as a mouse, the correcting means 41 updates the data in the learning data storage means 40 with the corrected section voice data and section text data.

以上説明したように、学習データ生成システム１００は、字幕付きデータ（放送データ）から、音声合成または音声認識に用いるＤＮＮのモデルを学習するための発話単位の音声データ（区間音声データ）およびその音声データに対応するテキストデータ（区間テキストデータ）を学習データとして生成することができる。
なお、学習データ生成システム１００において、編集端末４は必ずしも必須構成ではない。しかし、学習データの精度を高める点において、編集端末４を備えることが好ましい。
また、学習データ生成システム１００は、アップロード端末２と、発話音声テキスト生成装置３と、編集端末４と、を一体化した発話音声テキスト生成装置として構成してもよい。 As described above, the training data generation system 100 can generate training data from subtitled data (broadcast data), in the form of utterance-based voice data (section voice data) for training a DNN model used for voice synthesis or voice recognition, and text data corresponding to that voice data (section text data).
The editing terminal 4 is not necessarily a required component of the training data generation system 100. However, it is preferable to provide the editing terminal 4 in terms of improving the accuracy of the training data.
Furthermore, the training data generation system 100 may be configured as a speech text generation device that integrates the upload terminal 2, the speech text generation device 3, and the editing terminal 4.

＜学習データ生成システムの動作＞
次に、図４を参照（構成については適宜図１参照）して、参考例の実施形態に係る学習データ生成システム１００の動作（発話音声テキスト生成方法）について説明する。
なお、字幕付きデータ記憶装置１には、複数の発話音声からなる音声データとその音声データに対応する字幕データとを含んだ字幕付きデータが予め記憶されているものとする。 <Operation of the learning data generation system>
Next, the operation of the training data generation system 100 (a method for generating spoken voice text) according to the embodiment of the reference example will be described with reference to FIG. 4 (for the configuration, refer to FIG. 1 as appropriate).
It is assumed that the subtitled data storage device 1 stores subtitled data in advance, the subtitled data including audio data consisting of a plurality of spoken voices and subtitle data corresponding to the audio data.

ステップＳ１において、アップロード端末２は、字幕付きデータを取得する。ここでは、アップロード端末２は、字幕付きデータ記憶装置１から、ファイル選択手段２０によって、操作者が選択した字幕付きデータを取得する。あるいは、アップロード端末２は、放送データ受信手段２２によって、放送データを受信し、指定されたチャンネルに対応するストリームデータを抽出する。 In step S1, the upload terminal 2 acquires subtitled data. Here, the upload terminal 2 acquires subtitled data selected by the operator from the subtitled data storage device 1 using the file selection means 20. Alternatively, the upload terminal 2 receives broadcast data using the broadcast data receiving means 22 and extracts stream data corresponding to the specified channel.

ステップＳ２において、アップロード端末２は、字幕付きデータから、音声データとテキストデータ（字幕データ）とを分離する。
ステップＳ３において、アップロード端末２は、分離した音声データとテキストデータとを、発話音声テキスト生成装置３にアップロードする。 In step S2, the upload terminal 2 separates the audio data and the text data (subtitle data) from the subtitled data.
In step S3, the upload terminal 2 uploads the separated voice data and text data to the speech text generator 3.

ステップＳ４において、発話音声テキスト生成装置３は、ステップＳ３でアップロードされた音声データとテキストデータとを対応付けて音声テキスト記憶手段３０に記憶する。
ステップＳ５において、発話音声テキスト生成装置３は、音声区切り検出手段３１によって、複数の発話音声からなる音声データにおいて、発話ごとの音声データの区切り位置を検出する。
ステップＳ６において、発話音声テキスト生成装置３は、音声認識手段３２によって、ステップＳ５で検出された区切り位置で区分される音声データである区間音声データごとに音声認識を行う。これによって、発話単位の音声データに対応する音声認識結果が生成される。 In step S4, the speech text generator 3 stores the speech data and text data uploaded in step S3 in the speech text storage means 30 in association with each other.
In step S5, the speech text generation device 3 detects, by the speech segmentation detection means 31, segmentation positions of the speech data for each utterance in the speech data consisting of a plurality of utterances.
In step S6, the speech text generator 3 performs speech recognition for each section of speech data, which is speech data divided by the delimiter positions detected in step S5, by the speech recognition means 32. This generates a speech recognition result corresponding to the speech data in units of utterances.

ステップＳ７において、発話音声テキスト生成装置３は、マッチング手段３３によって、ステップＳ６で音声認識された区間音声データの認識結果と、複数の発話音声からなる音声データに対応付けられているテキストデータとをマッチングすることで、認識結果に対応するテキストデータ（区間テキストデータ）を推定する。
ステップＳ８において、発話音声テキスト生成装置３は、生成した学習データ（区間音声データ、区間テキストデータ）を編集端末４に送信し、編集端末４は、区間音声データと区間テキストデータとを対応付けて学習データ記憶手段４０に記憶する。 In step S7, the speech text generation device 3 estimates text data (section text data) corresponding to the recognition result by using the matching means 33 to match the recognition result of the section speech data generated in step S6 with text data corresponding to the speech data consisting of multiple speech sounds.
In step S8, the speech text generation device 3 transmits the generated training data (section speech data, section text data) to the editing terminal 4, and the editing terminal 4 stores the section speech data and the section text data in association with each other in the training data storage means 40.

ステップＳ９において、編集端末４は、修正手段４１によって、区間音声データの区切り位置と、区間テキストデータの文字列とを、操作者の判断により必要に応じて修正する。
ここでは、編集端末４は、音声区切り修正手段４１０によって、区間音声データの区切り位置を修正し、テキスト修正手段４１１によって、区間テキストデータを修正する。
以上の動作によって、学習データ生成システム１００は、音声合成または音声認識に用いるＤＮＮのモデルを学習するための学習データを生成することができる。 In step S9, the editing terminal 4 uses the modifying means 41 to modify the delimiter positions of the section voice data and the character strings of the section text data as required at the discretion of the operator.
Here, the editing terminal 4 uses the voice segment correction means 410 to correct segment positions of the section voice data, and uses the text correction means 411 to correct the section text data.
Through the above operations, the training data generation system 100 can generate training data for training a DNN model used for speech synthesis or speech recognition.

≪本発明の実施形態≫
＜学習データ生成システムの構成＞
次に、図５を参照して、本発明の実施形態に係る学習データ生成システム１００Ｂの構成について説明する。
以下の参考文献に記載されている音声合成方式において、音声合成に用いるＤＮＮは、音声データと、それに対応する読み仮名および韻律記号とを学習データとして学習したものである。
（参考文献）栗原清、清山信正、熊野正、今井篤、“読み仮名と韻律記号を入力とする日本語End-to-End 音声合成方式の検討”、日本音響学会秋季研究発表会、1-4-1、Sep．2018．
この参考文献では、学習データとして、漢字仮名交じり文や片仮名のみのテキストデータよりも、読み仮名および韻律記号を用いる方が、音声合成結果の品質が向上する旨が記載されている。 <Embodiments of the present invention >
<Configuration of the learning data generation system>
Next, a configuration of a training data generation system 100B according to an embodiment of the present invention will be described with reference to FIG.
In the speech synthesis method described in the following reference document, the DNN used for speech synthesis is trained using speech data and the corresponding kana pronunciation and prosodic symbols as training data.
(References) Kiyoshi Kurihara, Nobumasa Kiyoyama, Tadashi Kumano, Atsushi Imai, "Study on Japanese End-to-End Speech Synthesis Method Using Reading Kana and Prosodic Symbols as Input," Acoustical Society of Japan Autumn Meeting, 1-4-1, Sep. 2018.
This reference document describes that the quality of speech synthesis results is improved by using pronunciation kana and prosodic symbols as training data rather than text data containing a mixture of kanji and kana or only katakana.

図５に示す学習データ生成システム１００Ｂは、参考文献に記載の手法に対しても学習データを生成することを可能にするシステムである。
学習データ生成システム１００Ｂは、音声合成または音声認識に用いるディープニューラルネットワーク（ＤＮＮ）のモデルを学習するための発話単位の音声データおよびその音声データに対応する読み仮名および韻律記号を学習データとして生成するものである。 The training data generation system 100B shown in FIG. 5 is a system that makes it possible to generate training data for the techniques described in the reference documents.
The training data generation system 100B generates utterance-unit speech data and pronunciation and prosodic symbols corresponding to the speech data as training data for training a deep neural network (DNN) model used for speech synthesis or speech recognition.

ここで、図６および図７を参照して、学習データ生成システム１００Ｂが生成する読み仮名および韻律記号について説明する。
図６は、「こんにちは正午のニュースです」（漢字仮名交じり文）に対応する読み仮名と韻律記号とを記載した例を示している。
ここでは、「コンニチワショーゴノニュースデス」が読み仮名で、読み仮名の途中や末尾に付加されている記号が韻律記号である。
なお、読み仮名は、
読みを表す文字であればよく、片仮名以外にも、平仮名、音素記号、発音記号、ローマ字等であってもよい。
韻律記号は、韻律を表す予め定めた文字であって、アクセント、句・フレーズの区切り、文末イントネーション、ポーズ等の位置や状態を示す記号である。 Here, the pronunciation and prosodic symbols generated by the training data generation system 100B will be described with reference to FIG. 6 and FIG.
FIG. 6 shows an example in which the reading kana and prosodic symbols corresponding to "Konichiwa Shogo no News Desu" (a mixed kanji and kana sentence) are written.
Here, "Konnichiwa Shogo no Niyusu desu" is the pronunciation, and the symbols added in the middle or at the end of the pronunciation are prosodic symbols.
The pronunciation is as follows:
Any characters that represent readings may be used, and other than katakana, hiragana, phonetic symbols, pronunciation symbols, roman letters, etc. may be used.
The prosodic symbols are predetermined characters that represent prosody, and are symbols that indicate the position or state of accents, divisions of phrases and sentences, intonation at the end of a sentence, pauses, and the like.

図７に韻律記号の例を示す。アクセント位置の指定には、アクセント上昇を表す韻律記号「″」や、アクセント下降を表す韻律記号「＆」が用いられる。句・フレーズの区切り指定には、アクセント句の区切りを表す韻律記号「＃」が用いられる。文末イントネーションの指定には、通常の文末を表す韻律記号「（」や、疑問の文末を表す韻律記号「？」が用いられる。ポーズの指定には韻律記号「＿」が用いられる。なお、これらの韻律記号は例であり、他の記号を用いてもよい。また、これらの例では、韻律記号を１字で表しているが、２字以上で表してもよい。また、図７に示す韻律に加えて他の韻律の韻律記号を用いることもできる。 Figure 7 shows examples of prosodic symbols. To specify the position of the accent, the prosodic symbol "" representing an accent rise and the prosodic symbol "&" representing an accent fall are used. To specify the division of a phrase or phrase, the prosodic symbol "#" representing the division of an accent phrase is used. To specify the intonation at the end of a sentence, the prosodic symbol "(" representing the normal end of a sentence and the prosodic symbol "?" representing the end of a question are used. To specify a pause, the prosodic symbol "_" is used. Note that these prosodic symbols are only examples, and other symbols may be used. Also, in these examples, the prosodic symbol is represented by one character, but it may be represented by two or more characters. Also, in addition to the prosodic symbols shown in Figure 7, prosodic symbols of other prosody may be used.

図５に戻って説明を続ける。
学習データ生成システム１００Ｂは、字幕付きデータ記憶装置１と、アップロード端末２と、発話音声テキスト生成装置３Ｂと、編集端末４と、を備える。
字幕付きデータ記憶装置１、アップロード端末２および編集端末４は、図１で説明した構成と同じであるため、説明を省略する。 Returning to FIG.
The training data generation system 100B includes a subtitled data storage device 1, an upload terminal 2, a speech voice text generation device 3B, and an editing terminal 4.
The subtitled data storage device 1, the upload terminal 2, and the editing terminal 4 have the same configuration as those described in FIG. 1, and therefore a description thereof will be omitted.

〔発話音声テキスト生成装置〕
発話音声テキスト生成装置３Ｂは、複数の発話音声からなる音声データとその音声データに対応するテキストデータとから、発話区間の音声データ（区間音声データ）と、その音声データに対応するテキストデータである読み仮名および韻律記号とを学習データとして生成するサーバである。なお、読み仮名および韻律記号を、ＰＬＰ（Symbols of phoneme and linguistic phonological features）データと記載する場合がある。 [Speech to text generator]
The speech text generation device 3B is a server that generates speech data of an utterance section (section speech data) and reading kana and prosodic symbols, which are text data corresponding to the speech data, as learning data from speech data consisting of a plurality of utterances and text data corresponding to the speech data. Note that the reading kana and prosodic symbols may be referred to as PLP (Symbols of phoneme and linguistic phonological features) data.

発話音声テキスト生成装置３Ｂは、音声テキスト記憶手段３０と、音声区切り検出手段３１と、音声認識手段３２と、マッチング手段３３と、コンテキスト情報生成手段３４と、変換手段３５と、を備える。
音声テキスト記憶手段３０、音声区切り検出手段３１、音声認識手段３２およびマッチング手段３３は、図１で説明した構成と同じであるため、説明を省略する。なお、ここでは、マッチング手段３３は、区間テキストデータをコンテキスト情報生成手段３４に出力し、区間音声データを変換手段３５に出力することとする。 The spoken voice text generation device 3B includes a voice text storage means 30, a voice segment detection means 31, a voice recognition means 32, a matching means 33, a context information generation means 34, and a conversion means 35.
The voice text storage means 30, the voice segment detection means 31, the voice recognition means 32 and the matching means 33 have the same configuration as those described in Fig. 1, and therefore the description thereof will be omitted. Note that, in this embodiment, the matching means 33 outputs section text data to the context information generation means 34 and outputs section voice data to the conversion means 35.

コンテキスト情報生成手段３４は、マッチング手段３３で区間音声データに対応付けられた区間テキストデータ（漢字仮名交じり文）から、コンテキスト情報（コンテキストラベルデータ）を生成するものである。
コンテキスト情報は、音素の情報と、当該音素が含まれるアクセント句および当該アクセント句に隣接するアクセント句に関する特徴を示すアクセント句情報とを少なくとも含む音素ごとの情報（コンテキスト）を、予め定めた指標（ラベル）ごとに表した情報である。 The context information generating means 34 generates context information (context label data) from the section text data (a mixture of kanji and kana text) associated with the section voice data by the matching means 33 .
Context information is information that represents information (context) for each phoneme, which includes at least phoneme information and accent phrase information that indicates characteristics related to the accent phrase in which the phoneme is included and the accent phrases adjacent to the accent phrase, for each predetermined index (label).

図８および図９にコンテキスト情報の各ラベルの特徴を示す。ｎは、先頭の音素を１番目としたときの音素の順番を表す。ラベルｐ_ｎ、ａ_ｎ～ｋ_ｎは、ｎ番目の音素を現在位置としたときの特徴を示す。
ｐ_ｎは現在（ｎ番目）の音素を中心とした音素の並びを表す。ｐ_ｎ，１は２つ前の音素（先先行音素）、ｐ_ｎ，２は１つ前の音素（先行音素）、ｐ_ｎ，３は現在（ｎ番目）の音素、ｐ_ｎ，４は１つ後の音素（後続音素）、ｐ_ｎ，５は２つ後の音素（後後続音素）を表す。ａ_ｎは、アクセント型と位置に関する情報を示す。ｂ_ｎは、先行単語の品詞、活用形および活用型に関する情報を示す。ｃ_ｎは、現在の単語の品詞、活用形および活用型に関する情報を示す。ｄ_ｎは、後続単語の品詞、活用形および活用型に関する情報を示す。ｅ_ｎは、先行アクセント句の情報を示す。ｆ_ｎは、現在のアクセント句の情報を示す。ｇ_ｎは、後続アクセント句の情報を示す。ｈ_ｎは、先行呼気段落の情報を示す。ｉ_ｎは、現在の呼気段落の情報を示す。ｊ_ｎは、後続呼気段落の情報を示す。ｋ_ｎは、発話における呼気段落、アクセント句およびモーラ（音の分節）の数を示す。 8 and 9 show the characteristics of each label of the context information. n indicates the order of the phonemes when the first phoneme is numbered 1. The labels p _n , a _n to k _n indicate the characteristics when the nth phoneme is the current position.
p _n represents a sequence of phonemes centered on the current (nth) phoneme. p _n,1 represents the phoneme two steps back (first preceding phoneme), p _n,2 represents the phoneme one step back (preceding phoneme), p _n,3 represents the current (nth) phoneme, p _n,4 represents the phoneme one step back (subsequent phoneme), and p _n,5 represents the phoneme two steps back (subsequent phoneme). _{a n} represents information on accent type and position. b _n represents information on the part of speech, inflection form, and inflection type of the preceding word. c _n represents information on the part of speech, inflection form, and inflection type of the current word. d _n represents information on the part of speech, inflection form, and inflection type of the subsequent word. e _n represents information on the preceding accent phrase. f _n represents information on the current accent phrase. g _n represents information on the subsequent accent phrase. h _n represents information on the preceding breath paragraph. i _n represents information on the current breath paragraph. j _n indicates information on the following breath group, and k _n indicates the number of breath groups, accent phrases, and moras (sound segments) in the utterance.

このように、コンテキスト情報は、発話における音素の情報、当該音素の前後の音素の情報、当該音素のアクセント句情報等を含む。アクセント句情報は、発話において現在の音素が含まれるアクセント句に関する特徴、および、当該アクセント句に隣接するアクセント句に関する特徴等を示す。なお、位置は、現在の音素の位置を”０”として、現在の音素よりも前の位置は負の値により、現在の音素のよりも後の位置は正の値により表される。
図１０に、コンテキスト情報の形式例を示す。図１０のコンテキスト情報Ｌ_ｎは、音素列の中のｎ番目の音素の情報を示す。 In this way, the context information includes information on the phoneme in the utterance, information on the phonemes before and after the phoneme, accent phrase information on the phoneme, etc. The accent phrase information indicates features related to the accent phrase in the utterance in which the current phoneme is included, and features related to the accent phrase adjacent to the accent phrase, etc. The position is represented by taking the position of the current phoneme as "0", with positions before the current phoneme being represented by negative values and positions after the current phoneme being represented by positive values.
An example of the format of the context information is shown in Fig. 10. Context information _Ln in Fig. 10 indicates information on the n-th phoneme in the phoneme string.

図５に戻って説明を続ける。
コンテキスト情報生成手段３４は、区間テキストデータ（漢字仮名交じり文）から、音素ごとに、図１０に示すコンテキスト情報Ｌ_ｎ（ｎ＝１～Ｎ，Ｎ：音素数）を生成する。
漢字仮名交じり文のテキストデータからコンテキスト情報を生成する手法は、一般的な手法を用いればよい。例えば、参考文献「“Open JTalk”，[online]，[2020年3月6日検索]，インターネット<http://open-jtalk.sourceforge.net/>」に記載の技術を用いることができる。この技術の手法は、形態素解析の機能とアクセント辞典の機能やその他の言語処理の機能を持ち、漢字仮名交じり文からコンテキストラベルの形式で各ラベルに情報を反映する。
コンテキスト情報生成手段３４は、生成した音素列のコンテキスト情報を、変換手段３５に出力する。 Returning to FIG.
The context information generating means 34 generates context information L _n (n=1 to N, N: number of phonemes) shown in FIG. 10 for each phoneme from the section text data (a mixture of kanji and kana text).
A common method can be used to generate context information from text data of mixed kanji and kana sentences. For example, the technology described in the reference "Open JTalk, [online], [searched March 6, 2020], Internet <http://open-jtalk.sourceforge.net/>" can be used. This technology has a morphological analysis function, an accent dictionary function, and other language processing functions, and reflects information from the mixed kanji and kana sentences in the form of context labels to each label.
The context information generating means 34 outputs the generated context information of the phoneme string to the converting means 35 .

変換手段３５は、コンテキスト情報生成手段３４で生成された音素列のコンテキスト情報を、音素の出現順の読みを表す文字と韻律を表す予め定めた文字とを含むテキストデータ（第２の区間テキストデータ）に変換するものである。
ここでは、変換手段３５は、コンテキスト情報を、ＰＬＰデータ（読み仮名および韻律記号）に変換する。
変換手段３５は、音素列のコンテキスト情報Ｌ_１，…，Ｌ_ｎ，…，Ｌ_Ｎ（Ｎ：音素数）から、ｐ_ｎ，３（ｎ＝１～Ｎ，Ｎ：音素数）の音素（図８参照）を順番に抽出して、音素列を生成する。
そして、変換手段３５は、予め定めた条件に合致したとき、ｐ_ｎ，３の後ろに、予め定めた韻律記号を挿入する。
具体的には、変換手段３５は、図１１に示す条件（１）～（６）に合致する場合（適宜図８，図９参照）、所定の韻律記号を挿入する。 The conversion means 35 converts the context information of the phoneme string generated by the context information generation means 34 into text data (second section text data) including characters representing the reading of the order in which the phonemes appear and predetermined characters representing the rhythm.
Here, the conversion means 35 converts the context information into PLP data (reading kana and prosodic symbols).
The conversion means 35 sequentially extracts phonemes p _n,3 (n=1 to N, N: number of phonemes) (see FIG. 8) from context information L ₁ , ..., L _n , ..., L _N (N: number of phonemes) of the phoneme string to generate a phoneme string.
Then, the conversion means 35 inserts a predetermined prosodic symbol after p _n,3 when a predetermined condition is met.
Specifically, the conversion means 35 inserts a predetermined prosodic symbol when the conditions (1) to (6) shown in FIG. 11 are met (see FIG. 8 and FIG. 9 as appropriate).

条件（１）は、コンテキスト情報Ｌ_ｎのａ_ｎ，３＝１、かつ、コンテキスト情報Ｌ_ｎ＋１のａ_{ｎ＋１，２}＝１という条件である。ａ_ｎ，３は、現在のアクセント句における現在のモーラの後ろからの位置を意味する。つまり、ａ_ｎ，３＝１とは、現在のモーラ位置が現在のアクセント句内において最も後ろであることを示す。ａ_ｎ，２は、現在のアクセント句における現在のモーラの先頭からの位置を意味する。つまり、ａ_{ｎ＋１，２}＝１とは、後続音素の位置を現在位置としたときに、現在のモーラ位置が現在のアクセント句内において先頭であることを示す。
この条件（１）を満たす場合、変換手段３５は、音素ｐ_ｎ，３の後ろに、アクセント句の区切りを示す韻律記号（“＃”）を挿入する。 Condition (1) is a condition where a _n,3 = 1 in the context information L _n and a _n+1,2 = 1 in the context information L _n+1 . _{a n,3} means the position from the end of the current mora in the current accent phrase. In other words, a _n,3 = 1 indicates that the current mora position is the endmost position in the current accent phrase. _{a n,2} means the position from the beginning of the current mora in the current accent phrase. In other words, a _n+1,2 = 1 indicates that the current mora position is the beginning of the current accent phrase when the position of the following phoneme is the current position.
If this condition (1) is satisfied, the conversion means 35 inserts a prosodic symbol ("#") indicating the delimitation of the accent phrase after the phoneme pn _,3 .

条件（２）は、コンテキスト情報Ｌ_ｎのａ_ｎ，１＝０、かつ、ａ_ｎ，２≠ｆ_ｎ，１という条件である。ａ_ｎ，１＝０は、現在のアクセント句においてアクセント型（アクセント核の位置）と現在のモーラ位置とが一致することを示す。ａ_ｎ，２≠ｆ_ｎ，１は、現在のアクセント句のモーラ数と現在のアクセント句における現在のモーラの先頭からの位置とが不一致であることを示す。つまり、コンテキスト情報Ｌ_ｎの音素は、現在のアクセント句における最後のモーラではないことを示す。
この条件（２）を満たす場合、変換手段３５は、音素ｐ_ｎ，３の後ろに、アクセント下降を示す韻律記号（「＆」）を挿入する。 Condition (2) is a condition that a _n,1 = 0 and a _n,2 ≠ f _n,1 of the context information L _n . _{a n,1} = 0 indicates that the accent type (position of the accent kernel) and the current mora position in the current accent phrase match. _{a n,2} ≠ f _n,1 indicates that the number of moras in the current accent phrase and the position from the beginning of the current mora in the current accent phrase do not match. In other words, it indicates that the phoneme of the context information L _n is not the last mora in the current accent phrase.
If this condition (2) is met, the conversion means 35 inserts a prosodic symbol ("&") indicating a falling accent after the phoneme _pn,3 .

条件（３）は、コンテキスト情報Ｌ_ｎのａ_ｎ，２＝１、かつ、コンテキスト情報Ｌ_ｎ＋１のａ_{ｎ＋１，２}＝２という条件である。ａ_ｎ，２は、現在のアクセント句における現在のモーラの先頭からの位置を表す。ａ_ｎ，２＝１とは、現在のモーラ位置が現在のアクセント句内において先頭であることを示す。また、ａ_{ｎ＋１，２}＝２とは、後続音素の位置を現在位置としたときに、現在のモーラ位置が現在のアクセント句内において２番目であることを示す。
この条件（３）を満たす場合、変換手段３５は、音素ｐ_ｎ，３の後ろに、アクセント上昇を示す韻律記号（「”」）を挿入する。 Condition (3) is a condition where a _n,2 = 1 in the context information L _n and a _n+1,2 = 2 in the context information L _n+1 . _{a n,2} indicates the position of the current mora from the beginning of the current accent phrase. _{a n,2} = 1 indicates that the current mora position is the beginning in the current accent phrase. Also, a _n+1,2 = 2 indicates that the current mora position is the second in the current accent phrase when the position of the following phoneme is the current position.
If this condition (3) is satisfied, the conversion means 35 inserts a prosodic symbol ("") indicating an accent rise after the phoneme _pn,3 .

条件（４）は、コンテキスト情報Ｌ_ｎの音素ｐ_ｎ，３がポーズを表す「ｐａｕ」であるという条件である。
この条件（４）を満たす場合、変換手段３５は、音素ｐ_ｎ，３の「ｐａｕ」を削除し、ポーズを表す韻律記号（「＿」）を挿入する。 Condition (4) is a condition that the phoneme p _n,3 of the context information L _n is “pau”, which indicates a pause.
If this condition (4) is met, the conversion means 35 deletes "pau" from the phoneme _pn,3 and inserts a prosodic symbol ("_") indicating a pause.

条件（５）は、コンテキスト情報Ｌ_ｎの音素ｐ_ｎ，３が無音を表す「ｓｉｌ」であり、かつ、ｎ＝Ｎであり、かつ、ｅ_ｎ，３＝０であるという条件である。ｎ＝Ｎとは、現在の音素が発話における最後の音素であることを示す。ｅ_ｎ，３＝０とは、文末イントネーションが疑問形ではない通常のイントネーションであることを示す。
この条件（５）を満たす場合、変換手段３５は、音素ｐ_ｎ，３の「ｓｉｌ」を削除し、文末（通常）を表す韻律記号（「（」）を挿入する。 Condition (5) is a condition that the phoneme p _n,3 of the context information L _n is "sil" representing silence, n=N, and e _n,3 =0. n=N indicates that the current phoneme is the last phoneme in the utterance. _{e n,3} =0 indicates that the intonation at the end of the sentence is a normal intonation that is not a question.
If this condition (5) is satisfied, the conversion means 35 deletes the "sil" of the phoneme _pn,3 and inserts the prosodic symbol ("(") which indicates the end of the sentence (normal).

条件（６）は、コンテキスト情報Ｌ_ｎの音素ｐ_ｎ，３が無音を表す「ｓｉｌ」であり、かつ、ｎ＝Ｎであり、かつ、ｅ_ｎ，３＝１であるという条件である。ｎ＝Ｎとは、現在の音素が発話における最後の音素であることを示す。ｅ_ｎ，３＝１とは、文末イントネーションが疑問形のイントネーションであることを示す。
この条件（６）を満たす場合、変換手段３５は、音素ｐ_ｎ，３の「ｓｉｌ」を削除し、文末（疑問）を表す韻律記号（「？」）を挿入する。 Condition (6) is that the phoneme _pn,3 of the context information _Ln is "sil" representing silence, n=N, and e _n,3 =1. n=N indicates that the current phoneme is the last phoneme in the utterance. _{e n,3} =1 indicates that the final intonation is an interrogative intonation.
If this condition (6) is met, the conversion means 35 deletes the "sil" of the phoneme _pn,3 and inserts a prosodic symbol ("?") indicating the end of the sentence (a question).

これによって、変換手段３５は、図１２に示すように、コンテキスト情報Ｌ_１，…，Ｌ_ｎ，…，Ｌ_Ｎ（Ｎ：音素数）を、音素列ｐ_１，３，ｐ_２，３，…，ｐ_Ｎ，３に韻律記号を挿入したテキストデータであるＰＬＰデータ（ＰＬＰ_Ｎ）に変換する。
なお、ここでは、ＰＬＰデータの読み仮名を音素記号（ｐ_１，３等を示す音素記号）で表した例で示しているが、変換手段３５は、音素記号を、平仮名、片仮名、発音記号、ローマ字等に変換してもよい。片仮名に変換した場合、図６に示したＰＬＰデータとなる。 As a result, the conversion means 35 converts the context information _L1 , ..., _Ln , ..., _LN (N: number of phonemes) into PLP data (PLP _N ), which is text data in which prosodic symbols are inserted into a phoneme sequence _p1,3 , _p2,3 , ..., pN _,3, as shown in Figure 12.
In this example, the reading of the PLP data is expressed by phoneme symbols (phoneme symbols indicating p1 _{, 3} , etc.), but the conversion means 35 may convert the phoneme symbols into hiragana, katakana, phonetic symbols, roman letters, etc. When converted into katakana, the PLP data becomes as shown in FIG.

図５に戻って説明を続ける。
変換手段３５は、マッチング手段３３で区切られた区間音声データとその区間に対応するＰＬＰデータである区間ＰＬＰデータ（第２の区間テキストデータ）とを、図示を省略した通信手段によって、ネットワークＮ，Ｎ_２を介して、編集端末４に送信する。 Returning to FIG.
The conversion means 35 transmits the section audio data separated by the matching means 33 and the section PLP data (second section text data) which is the PLP data corresponding to that section, to the editing terminal 4 via networks N and _N2 using a communication means not shown.

以上説明したように構成することで、発話音声テキスト生成装置３Ｂは、音声データとテキストデータとから、発話ごとに対応付けた区間音声データと区間ＰＬＰデータとを学習データとして生成することができる。このとき、発話音声テキスト生成装置３Ｂは、音声データに含まれる発話音声である区間音声データを、時間のずれに関係なく字幕データに対応した区間ＰＬＰデータに対応付けることができる。
なお、発話音声テキスト生成装置３Ｂは、図示を省略したコンピュータを、前記した各手段として機能させるための学習宇データ生成プログラムで動作させることができる。 With the above-described configuration, the speech text generation device 3B can generate section speech data and section PLP data associated with each utterance as learning data from the speech data and text data. At this time, the speech text generation device 3B can associate the section speech data, which is the speech included in the speech data, with the section PLP data corresponding to the subtitle data regardless of the time lag.
The speech text generation device 3B can be operated by a learning data generation program for causing a computer (not shown) to function as each of the above-mentioned means.

学習データ生成システム１００Ｂでは、編集端末４は、発話区間ごとの音声データ（区間音声データ）とテキストデータ（ＰＬＰデータ）とを学習データ記憶手段４０に記憶する。そして、編集端末４は、図１３に示すように、編集画面Ｇ２のテキスト修正画面ｇ１１に、区間ＰＬＰデータを表し、修正を行う。
なお、発話音声テキスト生成装置３Ｂは、区間音声データと区間ＰＬＰデータとともに、区間テキストデータを編集端末４に送信することとしてもよい。
この場合、編集端末４は、修正手段４１によって、図１４に示すように編集画面Ｇ２Ｂを表示し、区間テキストデータと区間ＰＬＰデータとを修正対象とることができる。
図１４の例では、テキスト修正手段４１１が、テキスト修正画面を２つ（ｇ１１ａ，ｇ１１ｂ）表示し、テキスト修正画面ｇ１１ａにおいて区間テキストデータを修正し、テキスト修正画面ｇ１１ｂにおいて区間ＰＬＰデータを修正すればよい。 In the learning data generation system 100B, the editing terminal 4 stores voice data (section voice data) and text data (PLP data) for each speech section in the learning data storage means 40. Then, as shown in Fig. 13, the editing terminal 4 displays the section PLP data on a text correction screen g11 of the editing screen G2 and performs correction.
The speech text generation device 3B may transmit the section text data to the editing terminal 4 together with the section voice data and the section PLP data.
In this case, the editing terminal 4 displays the editing screen G2B as shown in FIG. 14 by the modifying means 41, and the section text data and the section PLP data can be subject to modification.
In the example of FIG. 14, the text correction means 411 displays two text correction screens (g11a, g11b), and the section text data is corrected on the text correction screen g11a, and the section PLP data is corrected on the text correction screen g11b.

＜学習データ生成システムの動作＞
次に、図１５を参照（構成については適宜図５参照）して、本発明の実施形態に係る学習データ生成システム１００Ｂの動作（発話音声テキスト生成方法）について説明する。
なお、ステップＳ１からＳ７までの動作は、図４で説明した学習データ生成システム１００と同じ動作であるため説明を省略する。 <Operation of the learning data generation system>
Next, the operation of the training data generation system 100B (a method for generating spoken voice text) according to an embodiment of the present invention will be described with reference to FIG. 15 (for the configuration, refer to FIG. 5 as appropriate).
The operations from steps S1 to S7 are the same as those of the learning data generation system 100 described with reference to FIG. 4, and therefore will not be described here.

ステップＳ７Ａにおいて、発話音声テキスト生成装置３Ｂは、コンテキスト情報生成手段３４によって、ステップＳ７で区間音声データに対応付けられた区間テキストデータ（漢字仮名交じり文）に対して、形態素解析および言語解析を行うことで、区間テキストデータから、音素ごとのコンテキスト情報（コンテキストラベルデータ）を生成する。
ステップＳ７Ｂにおいて、発話音声テキスト生成装置３Ｂは、変換手段３５によって、ステップ７Ａで生成されたコンテキスト情報から音素列を抽出するとともに、図１１に示した条件に従って、韻律記号を付加することで、区間音声データに対応した音素列のコンテキスト情報をＰＬＰデータ（区間ＰＬＰデータ；第２の区間テキストデータ）に変換する。 In step S7A, the speech text generation device 3B uses the context information generation means 34 to perform morphological analysis and linguistic analysis on the section text data (mixed kanji and kana sentences) associated with the section speech data in step S7, thereby generating context information (context label data) for each phoneme from the section text data.
In step S7B, the speech text generation device 3B extracts a phoneme string from the context information generated in step 7A using the conversion means 35, and adds prosodic symbols in accordance with the conditions shown in FIG. 11 to convert the context information of the phoneme string corresponding to the section speech data into PLP data (section PLP data; second section text data).

ステップ８Ａにおいて、発話音声テキスト生成装置３Ｂは、生成した学習データ（区間音声データ、区間ＰＬＰデータ）を編集端末４に送信し、編集端末４は、区間音声データと区間テキストデータとを対応付けて学習データ記憶手段４０に記憶する。
ステップＳ９Ｂにおいて、編集端末４は、修正手段４１によって、区間音声データの区切り位置と、区間ＰＬＰデータの文字列とを、操作者の判断により必要に応じて修正する。
ここでは、編集端末４は、音声区切り修正手段４１０によって、区間音声データの区切り位置を修正し、テキスト修正手段４１１によって、区間ＰＬＰデータを修正する。
以上の動作によって、学習データ生成システム１００Ｂは、音声合成または音声認識に用いるＤＮＮのモデルを学習するための学習データを生成することができる。 In step 8A, the speech text generation device 3B transmits the generated training data (section speech data, section PLP data) to the editing terminal 4, and the editing terminal 4 stores the section speech data and section text data in association with each other in the training data storage means 40.
In step S9B, the editing terminal 4 uses the modifying means 41 to modify the delimiter positions of the section audio data and the character strings of the section PLP data, as necessary, according to the operator's decision.
Here, the editing terminal 4 uses the voice segment correction means 410 to correct segment positions of the section voice data, and uses the text correction means 411 to correct the section PLP data.
Through the above operations, the training data generation system 100B can generate training data for training a DNN model used for speech synthesis or speech recognition.

１００，１００Ｂ学習データ生成システム
１字幕付きデータ記憶装置
２アップロード端末
２０ファイル選択手段
２１ファイル分離手段（分離手段）
２２放送データ受信手段
２３放送データ分離手段（分離手段）
３，３Ｂ発話音声テキスト生成装置
３０音声テキスト記憶手段
３１音声区切り検出手段
３２音声認識手段
３３マッチング手段
３４コンテキスト情報生成手段
３５変換手段
４編集端末
４０学習データ記憶手段
４１修正手段
４１０音声区切り修正手段
４１１テキスト修正手段 100, 100B Learning data generation system 1 Subtitled data storage device 2 Upload terminal 20 File selection means 21 File separation means (separation means)
22 Broadcast data receiving means 23 Broadcast data separating means (separating means)
3, 3B Speech text generating device 30 Speech text storage means 31 Speech segment detection means 32 Speech recognition means 33 Matching means 34 Context information generation means 35 Conversion means 4 Editing terminal 40 Learning data storage means 41 Correction means 410 Speech segment correction means 411 Text correction means

Claims

A voice segment detection means for detecting segment positions of section voice data for each utterance from voice data consisting of a plurality of utterances;
a voice recognition means for performing voice recognition for each of the section voice data;
a matching means for estimating section text data corresponding to a time of the section voice data by matching a recognition result of the voice recognition means with text data which is a spoken content of the voice data;
a context information generating means for generating, from the section text data, context information for each phoneme, the context information including at least information on the phoneme and accent phrase information indicating characteristics related to an accent phrase including the phoneme and an accent phrase adjacent to the accent phrase;
a conversion means for converting the context information of the phoneme string into second section text data including characters representing the pronunciation of the phonemes in the order of appearance and predetermined characters representing prosody;
A speech text generation device comprising:

2. The speech text generation device according to claim 1, further comprising a separation means for separating the speech data and the subtitle data that becomes the text data from subtitled data including a plurality of speech sounds and subtitles corresponding to the speech sounds.

3. The speech text generation device according to claim 1, further comprising a speech segment correction means for correcting the segment positions of the section speech data based on an operation by an operator.

4. The speech text generating device according to claim 1, further comprising a text correction unit that corrects the section text data based on an operation by an operator.

A speech text generation program for causing a computer to function as the speech text generation device according to any one of claims 1 to 4 .

detecting a segmentation position of section voice data for each utterance from voice data consisting of a plurality of utterances by a voice segmentation detection means;
performing voice recognition for each of the section voice data by a voice recognition means;
a step of estimating section text data corresponding to a time of the section voice data by matching a recognition result of the voice recognition means with text data which is the spoken content of the voice data by a matching means;
generating, from the section text data, context information for each phoneme, the context information including at least information on the phoneme and accent phrase information indicating characteristics related to an accent phrase including the phoneme and an accent phrase adjacent to the accent phrase;
converting the context information of the phoneme string into second section text data including characters representing the pronunciation of the phonemes in the order of appearance and predetermined characters representing prosody by a conversion means;
A method for generating speech to text, comprising: