JP2023044436A

JP2023044436A - Synthetic voice generation data forming method, synthetic voice generation method, and synthetic voice generation device

Info

Publication number: JP2023044436A
Application number: JP2021152468A
Authority: JP
Inventors: 茂雄山形; Shigeo Yamagata
Original assignee: Tosho Printing Co Ltd
Current assignee: Tosho Printing Co Ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2023-03-30

Abstract

To provide a synthetic voice generation data forming method, a synthetic voice generation method, and a synthetic voice generation device for generating a synthetic voice of more natural utterance.SOLUTION: A pause length is allocated, as a pause position, to a position that satisfies predetermined conditions out of a symbol included in a text in text data and the text, and synthetic voice generation data in which pause length information indicating a pause length is inserted is formed at the pause position. In the synthetic voice generation data, in a case that the symbol included in the text is a comma, pause length information indicating a first pause length is inserted immediately after the comma as the pause length, and in a case where the symbol included in the text is a period, pause length information indicating a second pause length longer than the first pause length is inserted immediately after the period as the pause length. In addition, immediately after a comma and a period included in a text in parentheses in the text data, pause length information indicating pause lengths shorter than a comma and a period positioned outside the parentheses is each inserted.SELECTED DRAWING: Figure 2

Description

本開示は、合成音声生成用データ形成方法、合成音声生成方法及び合成音声生成装置に関する。 The present disclosure relates to a synthetic speech generation data formation method, a synthetic speech generation method, and a synthetic speech generation device.

近年、テキストデータを音声データに変換して発話する技術を用いた様々なサービスが提供されている。このために、テキストデータから合成音声データを生成する様々な技術が用いられており、例えば人間が発生した音声データを含む大規模な音声データベースである音声コーパスを用いたコーパスベース音声合成が広く用いられている（特許文献１参照）。コーパスベース音声合成では、人によって発声された音声データを所定単位に分けてデータベースに蓄積し、音声合成の際にデータベースから抽出した所定単位の音声データを連結して合成音声データを生成している（例えば、特許文献１参照）。 2. Description of the Related Art In recent years, various services using a technique of converting text data into voice data and uttering the data have been provided. For this purpose, various techniques have been used to generate synthetic speech data from text data. (see Patent Document 1). In corpus-based speech synthesis, speech data uttered by a person is divided into predetermined units and stored in a database, and synthesized speech data is generated by concatenating the predetermined units of speech data extracted from the database during speech synthesis. (See Patent Document 1, for example).

特開２０１２―１０３６６８号公報Japanese Unexamined Patent Application Publication No. 2012-103668

しかしながら、上述した音声合成方法では、人によって発声された所定単位の音声データを連携しているものの、依然として人が文字を読み上げた様な自然な発話の合成音声には至っておらず、不自然な合成音声が生成される場合がある。
本開示は、より自然な発話の合成音声を生成する合成音声生成用データ形成方法、合成音声生成方法及び合成音声生成装置を提供することにある。 However, in the speech synthesis method described above, although a predetermined unit of speech data uttered by a person is linked, it still does not lead to synthesized speech of natural utterances such as those read out by a person, and unnatural speech is produced. Synthetic speech may be generated.
An object of the present disclosure is to provide a synthetic speech generation data formation method, a synthetic speech generation method, and a synthetic speech generation apparatus that generate synthetic speech of more natural utterance.

上記課題を解決するために、本開示の一態様に係る合成音声生成用データ形成方法は、テキストデータ中のテキストに含まれる記号及びテキストのうち所定の条件を満たす位置をポーズ位置としてポーズ長を割り当て、ポーズ位置に、ポーズ長を示すポーズ長情報を挿入した合成音声生成用データを形成する。 In order to solve the above problems, a synthetic speech generation data formation method according to an aspect of the present disclosure sets a pause length as a pause position that satisfies a predetermined condition among symbols and text included in text in text data. Data for generating synthesized speech is formed by inserting pause length information indicating the length of the pause at the assigned pause position.

上記課題を解決するために、本開示の一態様に係る合成音声生成方法は、テキストデータに対応するテキストの所定のポーズが入るポーズ位置にポーズの長さを示すポーズ長情報が挿入された合成音声生成用データを取得し、テキストデータに対応する文章を発音表記に変換し、発音表記を用いて、抑揚及び持続時間の韻律情報を生成し、人間が発生した合成単位ごとの音声データを含む音声データベースから、発音表記に対応する合成単位を選択し、合成音声生成用データに含まれるポーズ位置にポーズ長情報に対応する長さのポーズを介して合成単位を連結するとともに、韻律情報を付加して合成音声を生成する。 In order to solve the above problems, a synthesized speech generation method according to an aspect of the present disclosure provides a synthesized speech in which pause length information indicating the length of a pause is inserted at a pause position where a predetermined pause of text corresponding to text data is inserted. Acquire speech generation data, convert sentences corresponding to text data into phonetic notation, use phonetic notation to generate prosodic information of intonation and duration, and include speech data for each human-generated synthesis unit Select a synthesis unit corresponding to the phonetic notation from the speech database, connect the synthesis unit to the pause position included in the synthetic speech generation data via a pause of a length corresponding to the pause length information, and add prosody information to generate synthesized speech.

上記課題を解決するために、本開示の一態様に係る合成音声生成装置は、テキストデータに対応するテキストの所定のポーズが入るポーズ位置にポーズの長さを示すポーズ長情報が挿入された合成音声生成用データを取得する合成音声生成用データ取得部と、テキストデータに対応する文章を発音表記に変換する発音表記変換部と、発音表記変換部から取得した発音表記を用いて、文章の抑揚及び持続時間の韻律情報を生成する韻律処理部と、人間が発生した合成単位ごとの音声データを含む音声データベースから、発音表記変換部から取得した発音表記に対応する合成単位を選択し、合成音声生成用データに含まれるポーズ位置にポーズ長情報に対応する長さのポーズを介して合成単位を連結するとともに、韻律情報を付加して合成音声を生成する音声合成部と、を備えている。 In order to solve the above problems, a synthesized speech generation apparatus according to an aspect of the present disclosure provides synthesis in which pause length information indicating the length of a pause is inserted at a pause position where a predetermined pause of text corresponding to text data is inserted. Synthetic speech generation data acquisition unit for acquiring data for speech generation, phonetic transcription conversion unit for converting sentences corresponding to text data into phonetic transcription, and sentence intonation using the phonetic transcription acquired from the phonetic transcription conversion unit and a prosody processing unit that generates prosody information of duration, and a synthesis unit corresponding to the phonetic notation acquired from the phonetic notation conversion unit from a speech database that includes human-generated speech data for each synthesis unit, and synthesized speech. a speech synthesizing unit that connects a synthesis unit to a pause position included in the generation data via a pause having a length corresponding to the pause length information, and adds prosody information to generate synthesized speech.

本開示の態様によれば、より自然な発話の合成音声を生成する合成音声生成用データ形成方法、合成音声生成方法及び合成音声生成装置を提供することができる。 According to aspects of the present disclosure, it is possible to provide a synthetic speech generation data formation method, a synthetic speech generation method, and a synthetic speech generation device that generate synthetic speech of more natural utterance.

本開示の第一実施形態に係る合成音声生成用データ形成方法においてポーズ長情報への変換ルールの一例を示す表である。4 is a table showing an example of a conversion rule to pause length information in the synthetic speech generation data forming method according to the first embodiment of the present disclosure; 本開示の第一実施形態に係る合成音声生成用データ形成方法を実行する合成音声生成用データ形成装置の一構成例を示すブロック図である。1 is a block diagram showing a configuration example of a synthetic speech generation data formation device that executes a synthetic speech generation data formation method according to a first embodiment of the present disclosure; FIG. 本開示の第一実施形態に係る合成音声生成用データ形成方法を用いて形成された合成音声生成用データの一例を示す模式図である。FIG. 3 is a schematic diagram showing an example of synthetic speech generation data formed using the synthetic speech generation data formation method according to the first embodiment of the present disclosure; 本開示の第二実施形態に係る合成音声生成用データ形成方法を実行する合成音声生成用データ形成装置の一構成例を示すブロック図である。FIG. 10 is a block diagram showing a configuration example of a synthetic speech generation data formation device that executes a synthetic speech generation data formation method according to the second embodiment of the present disclosure; 本開示の第二実施形態に係る合成音声生成用データ形成方法を用いて形成された合成音声生成用データの一例を示す模式図である。FIG. 5 is a schematic diagram showing an example of synthetic speech generation data formed using the synthetic speech generation data formation method according to the second embodiment of the present disclosure; 本開示の第三実施形態に係る合成音声生成装置の一構成例を示すブロック図である。FIG. 12 is a block diagram showing a configuration example of a synthetic speech generation device according to the third embodiment of the present disclosure; 本開示の第三実施形態に係る合成音声生成装置の他の構成例を示すブロック図である。FIG. 13 is a block diagram showing another configuration example of the synthetic speech generation device according to the third embodiment of the present disclosure;

以下、実施形態を通じて本開示を説明するが、以下の実施形態は特許請求の範囲にかかる発明を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。また、図面は特許請求の範囲にかかる発明を模式的に示すものであり、各部の構成及び機能は現実の方法及び装置とは異なる。 Hereinafter, the present disclosure will be described through embodiments, but the following embodiments do not limit the invention according to the claims. Also, not all combinations of features described in the embodiments are essential for the solution of the invention. Moreover, the drawings schematically show the invention according to the claims, and the configuration and function of each part are different from the actual method and apparatus.

１．第一実施形態
以下、第一実施形態に係る合成音声生成用データ形成方法について説明する。また、第一実施形態では合成音声生成用データ形成方法をコンピュータに実行させる合成音声生成用データ形成プログラム及び合成音声生成用データ形成装置について説明する。 1. First Embodiment Hereinafter, a synthetic speech generating data forming method according to the first embodiment will be described. In the first embodiment, a synthetic speech generation data formation program and a synthetic speech generation data formation device for causing a computer to execute a synthetic speech generation data formation method will be described.

（１．１）合成音声生成用データ形成方法
第一実施形態に係る合成音声生成用データ形成方法について説明する。
第一実施形態に係る合成音声生成用データ形成方法は、少なくとも以下の方法により実行される。
（Ａ）テキストデータ中のテキストに含まれる記号及び前記テキストのうち所定の条件を満たす位置をポーズ位置としてポーズ長を割り当てる
（Ｂ）テキストのポーズ位置に、ポーズ長を示すポーズ長情報を挿入した合成音声生成用データを形成する
以上により、テキストの所定の位置に適切な長さのポーズを示すポーズ長情報が挿入された合成音声生成用データが形成される。このような合成音声生成用データを用いて音声合成を行った場合、合成音声でありながらより自然な発話の合成音声を生成することができる。
ポーズ長情報は、例えば以下のようなポーズ長を有する。 (1.1) Synthetic Speech Generation Data Formation Method A synthetic speech generation data formation method according to the first embodiment will be described.
The synthetic speech generation data forming method according to the first embodiment is executed at least by the following method.
(A) Symbols included in text in text data and positions satisfying predetermined conditions in the text are assigned pause lengths as pause positions. (B) Pause length information indicating the pause length is inserted into the pause positions of the text. Formation of Data for Synthetic Speech Generation As described above, data for synthetic speech generation in which pause length information indicating a pause of an appropriate length is inserted at a predetermined position of the text is formed. When speech synthesis is performed using such data for generating synthetic speech, it is possible to generate synthesized speech that is more natural and uttered.
The pause length information has, for example, the following pause lengths.

（読点）
本実施形態に係る合成音声生成用データ形成方法では、テキストに含まれる記号が読点（、）である場合には、読点の直後にポーズ長として第１のポーズ長を示すポーズ長情報を挿入することが好ましい。
第１のポーズ長は、例えば３００ｍｓｅｃ以上５００ｍｓｅｃ以下であることが好ましく、３５０ｍｓｅｃ以上４５０ｍｓｅｃ以下であることがより好ましく、例えば４００ｍｓｅｃである。 (reading point)
In the data forming method for generating synthetic speech according to the present embodiment, when the symbol included in the text is a comma (,), pause length information indicating the first pause length is inserted immediately after the comma. is preferred.
The first pause length is, for example, preferably 300 msec to 500 msec, more preferably 350 msec to 450 msec, for example 400 msec.

本実施形態に係る合成音声生成用データ形成方法では、テキストに含まれる記号が句点（。）である場合には、句点の直後にポーズ長として第１のポーズ長よりも長い第２のポーズ長を示すポーズ長情報を挿入することが好ましい。
第２のポーズ長は、例えば９００ｍｓｅｃ以上１５００ｍｓｅｃ以下であることが好ましく、９００ｍｓｅｃ以上１１００ｍｓｅｃ以下であることがより好ましく、例えば１０００ｍｓｅｃである。
句点の位置に、読点よりも長いポーズ長のポーズ長情報を挿入した合成音声生成用データを形成することにより、より自然な発話の合成音声を生成することができる。 In the method of forming data for generating synthetic speech according to the present embodiment, when the symbol included in the text is a period (.), a second pause length longer than the first pause length is set immediately after the period. It is preferable to insert pause length information indicating
The second pause length is, for example, preferably 900 msec or more and 1500 msec or less, more preferably 900 msec or more and 1100 msec or less, for example 1000 msec.
By forming synthetic speech generation data in which pause length information having a longer pause length than the reading point is inserted at the position of the period, synthetic speech of more natural utterance can be generated.

また、本実施形態に係る合成音声生成用データ形成方法では、テキストデータ中の鉤括弧（「」）で示される記号同士の間に位置するテキストに含まれる記号が読点（、）である場合には、鉤括弧同士の間に位置する読点の直後に、ポーズ長として第１のポーズ長（鉤括弧の外に位置する句点のポーズ長）よりも短い第３のポーズ長を示すポーズ長情報を挿入することが好ましい。
第３のポーズ長は、例えば１５０ｍｓｅｃ以上３００ｍｓｅｃ以下であることが好ましく、１５０ｍｓｅｃ以上２５０ｍｓｅｃ以下であることがより好ましく、例えば２００ｍｓｅｃである。 In addition, in the method of forming data for generating synthetic speech according to the present embodiment, if the symbol included in the text located between the symbols indicated by the brackets (“”) in the text data is a comma (,), puts pause length information indicating a third pause length, which is shorter than the first pause length (the pause length of the full stop located outside the brackets), immediately after the comma located between the brackets. preferably inserted.
The third pause length is, for example, preferably 150 msec or more and 300 msec or less, more preferably 150 msec or more and 250 msec or less, and is, for example, 200 msec.

また、本実施形態に係る合成音声生成用データ形成方法では、鉤括弧（「」）同士の間に位置するテキストに含まれる記号が句点（。）である場合には、鉤括弧同士の間に位置する句点の直後に、ポーズ長として第２のポーズ長（鉤括弧の外に位置する読点のポーズ長）よりも短い第４のポーズ長を示すポーズ長情報を挿入することが好ましい。
第４のポーズ長は、例えば４５０ｍｓｅｃ以上９００ｍｓｅｃ以下であることが好ましく、６５０ｍｓｅｃ以上７５０ｍｓｅｃ以下であることがより好ましく、例えば７００ｍｓｅｃである。
鉤括弧で示される記号同士の間に位置するテキストは、セリフ等を示すテキストである場合が多い。このため、鉤括弧内の句点や読点の位置におけるポーズ長を鉤括弧外の句点や読点の位置におけるポーズ長よりもそれぞれ短くすることにより、合成音声とした際に鉤括弧内のテキストに対応する音声中のポーズを短くして、さらに自然な発話の合成音声とすることができる。 Further, in the method of forming synthetic speech generation data according to the present embodiment, when the symbol included in the text located between the square brackets ("") is a full stop (.), Immediately after the positioned period, it is preferable to insert pause length information indicating a fourth pause length shorter than the second pause length (the pause length of the comma positioned outside the brackets) as the pause length.
The fourth pause length is, for example, preferably 450 msec to 900 msec, more preferably 650 msec to 750 msec, for example 700 msec.
The text located between the symbols indicated by the brackets is often the text indicating serifs or the like. For this reason, by making the pause length at the position of the period or comma inside the brackets shorter than the pause length at the position of the period or comma outside the brackets, the synthesized speech corresponds to the text inside the brackets. By shortening the pauses in the speech, it is possible to obtain synthesized speech with a more natural utterance.

（括弧等）
本実施形態に係る合成音声生成用データ形成方法では、テキストに含まれる記号が括弧である場合には、少なくとも括弧のうち前括弧（「）の直前にポーズ長として第５のポーズ長を示すポーズ長情報を挿入することが好ましい。ポーズ長情報は、前括弧の直前のみ、又は前括弧の直前及び後ろ括弧（」）の直後に挿入されることが好ましい。例えば括弧が連続する場合（例えば、」「等）には、前括弧の直前のみにポーズ長情報が挿入されることにより、前括弧（」）と後ろ括弧（「」との間にポーズ長情報が重複して挿入されることを防ぐことができる。
ここで、「括弧」とは、鉤括弧（二重鉤括弧を含む）、丸括弧、隅付き括弧、角括弧、波括弧等の各括弧をいう。 (parentheses, etc.)
In the method of forming data for generating synthesized speech according to the present embodiment, when the symbols included in the text are parentheses, at least a pause indicating the fifth pause length as the pause length immediately before the front parenthesis (") among the parentheses It is preferred to insert the length information, preferably the pause length information is inserted just before the front bracket, or just before the front bracket and after the back bracket ("). For example, when parentheses are consecutive (for example, "", etc.), the pause length information is inserted only immediately before the front parenthesis (") and the back parenthesis (""). can be prevented from being duplicated.
Here, "parentheses" refer to brackets such as brackets (including double brackets), round brackets, corner brackets, square brackets, and curly brackets.

第５のポーズ長は、例えば５００ｍｓｅｃ以上１０００ｍｓｅｃ以下であることが好ましく、５００ｍｓｅｃ以上６００ｍｓｅｃ以下であることがより好ましく、例えば５００ｍｓｅｃである。
上述したように、括弧で示される記号同士の間に位置するテキストは、例えばセリフや重要な事柄を説明する文言である場合が多い。このため、少なくとも括弧の直前にポーズ長情報を挿入することで、合成音声とした際に括弧内のテキストに対応する音声と、前後の音声との間にポーズを入れて、括弧内のテキストに対応する音声に聞き手の意識を集中しやすくして自然な発話の合成音声とすることができる。 The fifth pause length is, for example, preferably 500 msec to 1000 msec, more preferably 500 msec to 600 msec, for example 500 msec.
As mentioned above, the text located between the bracketed symbols is often, for example, a dialogue or a phrase explaining an important matter. For this reason, by inserting pause length information at least just before the parentheses, a pause is inserted between the speech corresponding to the text in the parentheses and the speech before and after the synthesized speech, so that the text in the parentheses Synthetic speech of natural utterance can be obtained by making it easier for the listener to concentrate on the corresponding speech.

（見出し）
本実施形態に係る合成音声生成用データ形成方法では、テキストが見出しである場合に、見出しの直後（所定の条件を満たす位置の一例）に、第６のポーズ長を示すポーズ長情報を挿入する。第６のポーズ長は、見出し以外のテキストの直前又は直後に挿入された他のポーズ長（すなわち、第１から第５のポーズ長）よりも長いことが好ましい。
第６のポーズ長は、例えば１５００ｍｓｅｃ以上４５００ｍｓｅｃ以下であることが好ましく、２０００ｍｓｅｃ以上３０００ｍｓｅｃ以下であることがより好ましい。テキストに複数種類の見出し（例えば、大見出し（例えば各章の冒頭の見出し）と小見出し）が含まれる場合、大見出し直後の第６のポーズ長を小見出し直後の第６のポーズ長よりも長くすることが好ましい。例えば、大見出し直後の第６のポーズ長を３０００ｍｓｅｃとし、小見出し直後の第６のポーズ長を２０００ｍｓｅｃとする。
このように、見出しの直後に比較的長いポーズを入れることで、見出しのテキストに対応する音声に聞き手の意識を集中しやすくして自然な発話の合成音声とすることができる。 (heading)
In the method of forming synthetic speech generation data according to the present embodiment, when the text is a headline, pause length information indicating the sixth pause length is inserted immediately after the headline (an example of a position that satisfies a predetermined condition). . The sixth pause length is preferably longer than the other pause lengths inserted immediately before or after the non-heading text (ie, the first through fifth pause lengths).
The sixth pause length is, for example, preferably 1500 msec to 4500 msec, more preferably 2000 msec to 3000 msec. If the text contains multiple types of headings (e.g. main headings (e.g. headings at the beginning of each chapter) and sub-headings), the 6th pause length immediately after the main headings should be longer than the 6th pause length immediately after the sub-headings. is preferred. For example, the length of the sixth pause immediately after the main headline is assumed to be 3000 msec, and the length of the sixth pause immediately after the small headline is assumed to be 2000 msec.
In this way, by inserting a relatively long pause immediately after the headline, it is possible to easily concentrate the attention of the listener on the speech corresponding to the text of the headline, and to produce natural synthesized speech.

（文章のまとまり）
本実施形態に係る合成音声生成用データ形成方法では、テキストが意味上のまとまりを有する場合、文章のまとまりの直後（所定の条件を満たす位置の一例）に、ポーズ長情報を挿入することが好ましい。ここで、「文章のまとまり」とは、例えば、一つの見出し内に記載されて関連する内容を説明する複数の文章をいう。このとき、文章のまとまりの直後には、見出し以外のテキストの直前又は直後に挿入された他のポーズ長（すなわち、第１から第５のポーズ長）よりも長い第７のポーズ長を示すポーズ長情報が挿入されることが好ましい。 (Summary of sentences)
In the method of forming data for generating synthetic speech according to the present embodiment, when the text has a semantic unity, it is preferable to insert the pause length information immediately after the unity of the sentence (an example of a position that satisfies a predetermined condition). . Here, a "group of sentences" means, for example, a plurality of sentences described in one heading and explaining related contents. At this time, immediately after the unity of sentences, a pause indicating a seventh pause length longer than the other pause lengths inserted immediately before or after the text other than the headline (that is, the first to fifth pause lengths) Long information is preferably inserted.

第７のポーズ長は、見出しと同程度であることが好ましく、見出しとして大見出しと小見出し等の複数種類の見出しが用いられている場合には比較的ポーズ長が短い小見出しよりも長いポーズ長を有することが好ましい。第７のポーズ長は、例えば２５００ｍｓｅｃ以上４５００ｍｓｅｃ以下であることが好ましく、３０００ｍｓｅｃ以上４０００ｍｓｅｃ以下であることがより好ましい。例えば、大見出し直後の第６のポーズ長が３０００ｍｓｅｃであり、小見出し直後の第６のポーズ長が２０００ｍｓｅｃである場合、第７のポーズ長は３０００ｍｓｅｃであることが好ましい。
このように、文章のまとまりの直後に比較的長いポーズを入れることで、テキストに対応する音声の内容の切れ目が聞き手に理解しやすくなり、自然な発話の合成音声とすることができる。 The seventh pose length is preferably about the same as the headline, and when multiple types of headlines such as a large headline and a subheadline are used as headlines, a longer pose length than the subheadline, which has a relatively short pose length, is used. It is preferable to have The seventh pause length is, for example, preferably 2500 msec to 4500 msec, more preferably 3000 msec to 4000 msec. For example, if the sixth pause length immediately after the main heading is 3000 msec and the sixth pause length immediately after the subheading is 2000 msec, the seventh pause length is preferably 3000 msec.
In this way, by inserting a relatively long pause immediately after the unity of sentences, it becomes easier for the listener to understand the discontinuity of the content of the speech corresponding to the text, and natural-sounding synthesized speech can be obtained.

（その他）
また、図１に示すように、本実施形態に係る合成音声生成用データ形成方法では、各記号や条件に応じて、所定の条件を満たす位置にポーズ長情報を挿入することができる。
例えば、ポーズ長情報は、二点リーダ（‥）や三点リーダ（…）等のリーダ、疑問符（？）、感嘆符（！）、縦線（｜）、ダッシュ（―）、丸数字や四角囲み数字等の囲み英数字等の記号の直後に挿入される。図１には、ポーズ長情報を挿入する条件、ポーズ長の一例（ポーズ長の好ましい範囲）及びポーズ長情報の具体例を示す。リーダは、会話中での無音の状態（間）、文末における余韻、文中での省略を示し、ダッシュも間等を示すことから、例えば句点や読点よりも長いポーズ長が割り当てられることが好ましい。リーダの直後には、１０００ｍｓｅｃ以上１５００ｍｓｅｃ以下のポーズ長が割り当てられることが好ましく、例えば１０００ｍｓｅｃが割り当てられる。縦線は、文章の区切りを示すことが多いことから、例えば句点や読点よりも長いポーズ長が割り当てられることが好ましい。縦線の直後には、１０００ｍｓｅｃ以上１５００ｍｓｅｃ以下のポーズ長が割り当てられることが好ましく、例えば１０００ｍｓｅｃが割り当てられる。また、囲み英数字は、例えば箇条書きにされた文章の行頭等を示す事が多いことから、例えば読点と同等程度の長さのポーズ長が割り当てられることが好ましい。囲み英数字の直後には、３００ｍｓｅｃ以上５００ｍｓｅｃ以下のポーズ長が割り当てられることが好ましく、例えば３００ｍｓｅｃが割り当てられる。 (others)
Further, as shown in FIG. 1, in the synthetic speech generating data formation method according to the present embodiment, pause length information can be inserted at a position that satisfies a predetermined condition according to each symbol or condition.
For example, pause length information can be represented by a leader such as a two-point leader (...) or three-point leader (...), a question mark (?), an exclamation point (!), a vertical bar (|), a dash (-), a round number, or a square. It is inserted immediately after symbols such as enclosed alphanumeric characters such as enclosed numbers. FIG. 1 shows a condition for inserting pause length information, an example of pause length (preferred range of pause length), and a specific example of pause length information. The leader indicates silence (pause) in conversation, lingering at the end of a sentence, omission in a sentence, and dashes also indicate pauses. Immediately after the leader, a pause length of 1000 msec to 1500 msec is preferably assigned, for example, 1000 msec. Since vertical lines often indicate breaks in sentences, it is preferable to assign longer pause lengths than, for example, full stops or commas. A pause length of 1000 msec to 1500 msec is preferably assigned immediately after the vertical line, for example, 1000 msec. In addition, since enclosed alphanumeric characters often indicate the beginning of a line of itemized sentences, for example, it is preferable to assign a pause length that is approximately the same length as a comma. Immediately after the enclosed alphanumeric characters, a pause length of 300 msec to 500 msec is preferably assigned, for example 300 msec.

ここで、句読点や括弧類などの記号類（いわゆる約物）が２つ以上連続した場合には、連続した記号同士の間にはポーズを割り当てず、連続する記号の最後のみにポーズを割り当てるようにしてもよい。記号が連続する場合、例えばそれぞれの記号のポーズ長のうち長い方のポーズ長を、後ろの記号の直後に挿入することが好ましい。
例えば、前括弧（「）及び後ろ括弧（」）の間に、最後に疑問符（？）を含むテキストが記載されている場合、連続する疑問符と後ろ鍵括弧との間にはポーズを割り当てず、後ろ鍵括弧の直後のみにポーズを割り当てればよい。このとき、後ろ鍵括弧の直後に割り当てたポーズ位置には、疑問符のポーズ長（９００～１５００ｍｓｅｃ）と、後ろ鍵括弧のポーズ長（５００～１０００ｍｓｅｃ）のうち、より長さが長い疑問符のポーズ長（例えば１２００ｍｓｅｃ）を割り当てることが好ましい。 Here, when two or more symbols such as punctuation marks and parentheses (so-called punctuation) are consecutive, no pause is assigned between consecutive symbols, and a pause is assigned only to the end of the consecutive symbols. can be If the symbols are consecutive, for example, it is preferable to insert the longer pause length of each symbol immediately after the following symbol.
For example, if text containing a question mark (?) at the end is written between front brackets (") and back brackets ("), no pause is assigned between the consecutive question marks and back brackets, You only need to assign a pause immediately after the trailing curly brace. At this time, at the pause position assigned immediately after the trailing square bracket, the pause length of the longer question mark (900-1500msec) or the pause length of the trailing square bracket (500-1000msec) (eg 1200 msec).

また、図１に記載していない他の記号類についてもポーズ長を割り当てても良い。
また、従来の合成音声生成装置において合成音声を生成する際に、テキストに含まれる記号及びテキストのうち所定の条件を満たす位置以外の位置において微小な長さのポーズが含まれる場合、当該ポーズの位置に第８のポーズ長情報を挿入しても良い。この場合、第８のポーズ長は、例えば１３０ｍｓｅｃ以上２００ｍｓｅｃ以下であることが好ましく、１４０ｍｓｅｃ以上１７０ｍｓｅｃ以下であることがより好ましく、例えば１５０ｍｓｅｃである。 Pause lengths may also be assigned to other symbols not shown in FIG.
Further, when generating synthetic speech in a conventional synthetic speech generation apparatus, if a pause of minute length is included at a position other than a position that satisfies a predetermined condition among symbols and text included in the text, the pause may be generated. An eighth pause length information may be inserted at the position. In this case, the eighth pause length is, for example, preferably 130 msec or more and 200 msec or less, more preferably 140 msec or more and 170 msec or less, for example 150 msec.

さらに、合成音声を生成するためのテキストには、注釈を示す番号等が含まれる場合がある。このため、注釈の前後に、テキストの発話を行わないようにするための発話禁止情報をタグとして挿入し、注釈を示す番号等を含まない合成音声生成用データを生成しても良い。
これにより、合成音声生成用データから生成された合成音声において、テキストの文脈と関連せず合成音声の自然な発話を阻害する注釈が発話されないようにすることができる。 Furthermore, the text for generating synthesized speech may include numbers indicating annotations. For this reason, speech prohibition information may be inserted as a tag before and after the annotation to prevent the text from being uttered, and synthetic speech generation data may be generated that does not include the number or the like indicating the annotation.
As a result, in the synthesized speech generated from the synthetic speech generation data, it is possible to prevent the annotation from being uttered, which is unrelated to the context of the text and hinders the natural utterance of the synthesized speech.

（１．２）合成音声生成用データ形成プログラムの基本構成
本実施形態に係る合成音声生成用データ形成プログラムについて説明する。後述する合成音声生成用データ形成装置１０は、少なくとも以下の（ａ），（ｂ）の各動作をコンピュータに実行させるプログラムに従って、合成音声生成用データを形成する。以下のプログラムは、例えばハードディスクドライブ、メモリ等の記録媒体やＤＶＤディスク又はＢｌｕ－ｒａｙ（登録商標）等の光ディスクに非一時的に記録される。以下のプログラムは、インターネットを介して配布されても良い。さらに、以下のプログラムは、クラウドサーバに記録され、インターネットを介して実行されても良い。 (1.2) Basic Configuration of Synthetic Speech Generation Data Formation Program A synthetic speech generation data formation program according to the present embodiment will be described. A synthetic speech generation data forming apparatus 10, which will be described later, forms synthetic speech generation data in accordance with a program that causes a computer to execute at least the following operations (a) and (b). The following programs are non-temporarily recorded on a recording medium such as a hard disk drive or memory, or an optical disc such as a DVD disc or Blu-ray (registered trademark). The following programs may be distributed via the Internet. Furthermore, the following programs may be recorded on a cloud server and executed via the Internet.

（ａ）テキストデータ中のテキストに含まれる記号及び前記テキストのうち所定の条件を満たす位置をポーズ位置としてポーズ長を割り当てること
（ｂ）テキストのポーズ位置に、ポーズ長を示すポーズ長情報を挿入した合成音声生成用データを形成すること (a) assigning a pause length to a symbol included in the text in the text data and a position that satisfies a predetermined condition in the text as a pause position; (b) inserting pause length information indicating the pause length to the pause position of the text. forming data for generating synthesized speech

（１．３）合成音声生成用データ形成装置の基本構成
以下、第一実施形態に係る合成音声生成用データ形成方法を実行する合成音声生成用データ形成装置１０を、図２を参照して説明する。図２は、合成音声生成用データ形成装置１０の基本構成及び各部の機能について説明する機能ブロック図である。 (1.3) Basic Configuration of Synthetic Speech Generation Data Formation Device Hereinafter, a synthetic speech generation data formation device 10 that executes the synthetic speech generation data formation method according to the first embodiment will be described with reference to FIG. do. FIG. 2 is a functional block diagram for explaining the basic configuration of the synthetic speech generation data forming apparatus 10 and the functions of each section.

図２に示すように、合成音声生成用データ形成装置１０は、テキストデータ処理部１１及びポーズ設定部１２を備えている。合成音声生成用データ形成装置１０は、例えば書籍の内容を示すテキストデータが入力され、テキストデータ処理部１１及びポーズ設定部１２の各部での処理により、テキストの所定の位置に適切な長さのポーズを示す情報を挿入した合成音声生成用データを形成して出力する。 As shown in FIG. 2, the synthetic speech generation data forming apparatus 10 includes a text data processing section 11 and a pause setting section 12 . Text data representing the content of a book, for example, is input to the synthetic speech generation data forming apparatus 10, and the text data processing section 11 and the pause setting section 12 process each section to generate an appropriate length of text at a predetermined position of the text. Synthetic speech generation data into which information indicating a pause is inserted is formed and output.

ここで、合成音声生成用データ形成装置１０に入力されるテキストデータとしては、例えば、文字等がレイアウトの指定に従って配置されたデータである組版データが用いられる。組版データには、例えば書籍とした場合の見出し、文章の配置及び改行の位置並びに空白行の幅に関する情報の少なくとも１つを示すタグが挿入されている。このため、合成音声生成用データ形成装置１０において、テキストのうち見出しに相当する部分や、文章のまとまりの最後部分（例えば見出しの直前）の判断が容易となるため好ましい。
また、合成音声生成用データ形成装置１０に入力されるテキストデータとしては、例えば、文字情報のみが含まれる（組版データ用のタグ等が含まれない）原稿データであってもよい。 Here, as the text data input to the synthesized speech generation data forming apparatus 10, for example, typesetting data, which is data in which characters and the like are arranged according to layout designation, is used. The typesetting data includes, for example, a tag indicating at least one of information regarding a headline, sentence arrangement, line feed position, and blank line width in the case of a book. Therefore, in the synthetic speech generation data forming apparatus 10, it is possible to easily determine the part corresponding to the headline in the text and the last part of the group of sentences (for example, immediately before the headline), which is preferable.
The text data input to the synthesized speech generating data forming apparatus 10 may be, for example, manuscript data containing only character information (not containing tags for typesetting data, etc.).

なお、合成音声生成用データ形成装置１０は、テキストデータが入力され、合成音声生成用データを出力する出入力部と、上述した合成音声生成用データ形成方法をコンピュータに実行させるプログラムを記憶する記憶部と、装置内の動作を制御する制御部とを備えている（図２中不図示）。テキストデータ処理部１１及びポーズ設定部１２は、合成音声生成用データ形成プログラムがコンピュータによって実行されることにより実現される。
以下、合成音声生成用データ形成装置１０の各部について説明する。 The synthetic speech generation data forming apparatus 10 includes an input/output unit for inputting text data and outputting synthetic speech generation data, and a storage for storing a program for causing a computer to execute the above-described synthetic speech generation data forming method. and a controller (not shown in FIG. 2) for controlling the operation in the device. The text data processing section 11 and the pause setting section 12 are implemented by a computer executing a synthetic speech generation data forming program.
Each part of the synthetic speech generating data forming apparatus 10 will be described below.

＜テキストデータ処理部＞
テキストデータ処理部１１は、入力されたテキストデータが組版データである場合、テキストデータからテキストを抜き出して分析を行う。テキストデータには、組版用のタグが挿入されている。また、テキストデータには、複数の文章が、一つの見出し内に記載されて関連する内容を説明する場合、これらの複数の文章は「文章のまとまり」となっている。テキストデータ処理部１１は、見出しを示すタグや改行を示すタグ等に応じて、見出し（所定の条件の一例）を検出する。
なお、この場合、図２に示す言語辞典５１及び品詞辞典５２等をテキストの分析に用いる必要はない。 <Text data processing unit>
When the input text data is typesetting data, the text data processing unit 11 extracts text from the text data and analyzes the text. The text data has a typesetting tag inserted therein. Also, in text data, when a plurality of sentences are described in one heading to explain related contents, these multiple sentences are a "unit of sentences". The text data processing unit 11 detects a heading (an example of a predetermined condition) according to a tag indicating a heading, a tag indicating a line feed, or the like.
In this case, it is not necessary to use the language dictionary 51, the part-of-speech dictionary 52, etc. shown in FIG. 2 for analyzing the text.

また、テキストデータ処理部１１は、入力されたテキストデータが原稿データである場合、テキストデータ中のテキストを分析して、見出しや文章のまとまりを検出する。テキストデータ処理部１１は、例えば言語辞典５１や品詞辞典５２も用いてテキストを分析し、見出しや文章のまとまりを検出してもよい。
なお、テキストの分析は、機械学習により生成された学習済モデルを用いて行なわれても良い。例えば、学習済モデルは、見出しを示すタグや文章のまとまりの終わりを示すタグ等を挿入したテキストデータを学習用データとした機械学習により生成される。このような学習済モデルに上述したようなタグが挿入されていないテキストデータを挿入して分析することにより、テキストデータから見出しや文章のまとまりの終わり部分を抽出することができる。 Further, when the input text data is manuscript data, the text data processing unit 11 analyzes the text in the text data to detect headlines and sentences. The text data processing unit 11 may analyze the text using, for example, the language dictionary 51 and the part-of-speech dictionary 52 to detect headlines and sentences.
Note that text analysis may be performed using a trained model generated by machine learning. For example, a trained model is generated by machine learning using text data in which tags indicating headlines, tags indicating the end of sentences, etc. are inserted as data for learning. By inserting text data in which tags as described above are not inserted into such a trained model and analyzing the text data, it is possible to extract headings or the ending part of a group of sentences from the text data.

＜ポーズ設定部＞
ポーズ設定部１２は、テキストデータ中のテキストに含まれる記号の位置をポーズ位置としてポーズ長を割り当て、ポーズ位置に、ポーズ長を示すポーズ長情報を挿入する。また、ポーズ設定部１２は、テキストデータ処理部１１で検出された所定の条件を満たすテキストの所定位置にポーズ長情報を挿入する。ポーズ長情報は、図１に示すルール表の一例に従って、記号の直前又は直後等に挿入される。ポーズ設定部１２は、検出された条件に応じたポーズ長を示すポーズ長情報を挿入する。
これにより、ポーズ設定部１２は、合成音声生成用データを形成する。ポーズ設定部１２は、生成された合成音声生成用データを出力する。また、ポーズ設定部１２は、合成音声生成用データを図示しない記憶部に記憶してもよい。記憶部に記憶された合成音声生成用データは、出入力部を介して出力することができる。 <Pose setting part>
The pause setting unit 12 assigns a pause length to the position of the symbol included in the text in the text data as a pause position, and inserts pause length information indicating the pause length into the pause position. Also, the pause setting unit 12 inserts pause length information at a predetermined position in the text that is detected by the text data processing unit 11 and satisfies a predetermined condition. The pause length information is inserted immediately before or after the symbol according to an example of the rule table shown in FIG. The pause setting unit 12 inserts pause length information indicating a pause length corresponding to the detected condition.
As a result, the pause setting unit 12 forms synthetic speech generation data. The pause setting unit 12 outputs the generated synthetic speech generation data. Moreover, the pause setting unit 12 may store the synthetic speech generation data in a storage unit (not shown). The synthetic speech generation data stored in the storage unit can be output via the input/output unit.

図３に、ポーズ長情報が挿入された合成音声生成用データをテキストで示した場合の具体例を示す。なお、図３では、説明のために、記号や所定の条件を満たすテキストの一部のみにポーズ長情報を示している。
図３に示すように、合成音声生成用データ中のテキストのうち、大見出しとなる「第１章下級老人とは何か」のテキストＰ１の直後には、ポーズ長が３０００ｍｓｅｃであることを示すポーズ長情報「<vtml_pause time=”3000”/>」が挿入されている。
同様に、文章のまとまりの最後部のテキストＰ４の直後にも同様に、ポーズ長が３０００ｍｓｅｃであることを示すポーズ長情報「<vtml_pause time=”3000”/>」が挿入されている。 FIG. 3 shows a specific example of a text representation of synthetic speech generation data into which pause length information has been inserted. In FIG. 3, for the sake of explanation, pause length information is shown only for symbols and part of text that satisfies a predetermined condition.
As shown in FIG. 3, the pause length of 3000 msec is indicated immediately after the text P1 of "Chapter 1: What is a lower-class elderly person?" Pause length information "<vtml_pause time="3000"/>" is inserted.
Similarly, pause length information “<vtml_pause time=“3000”/>” indicating that the pause length is 3000 msec is inserted immediately after the text P4 at the end of the set of sentences.

合成音声生成用データ中のテキストのうち、小見出しとなる「下級老人とは、いったい何か」のテキストＰ２、「下流老人の具体的な指標３つの「ない」」のテキストＰ５及び「収入が著しく少「ない」」のテキストＰ７の直後には、ポーズ長が２０００ｍｓｅｃであることを示すポーズ長情報「<vtml_pause time=”2000”/>」が挿入されている。
合成音声生成用データ中のテキストのうち、文末が三点リーダで終わる「人生の終結に向かっていく…」のテキストＰ３の直後には、ポーズ長が１０００ｍｓｅｃであることを示すポーズ長情報「<vtml_pause time=”1000”/>」が挿入されている。 Among the texts in the synthetic speech generation data, text P2 of "What is a lower-ranked elderly person?" Pause length information “<vtml_pause time=“2000”/>” indicating that the pause length is 2000 msec is inserted immediately after the text P7 of “not”.
Among the texts in the synthetic speech generation data, the pause length information "< vtml_pause time=”1000”/>” is inserted.

合成音声生成用データ中のテキストのうち、丸括弧Ｐ３の直前には、ポーズ長が５００ｍｓｅｃであることを示すポーズ長情報「<vtml_pause time=”500”/>」が挿入されている。
合成音声生成用データ中のテキストのうち、丸数字Ｐ６の直後には、ポーズ長が３００ｍｓｅｃであることを示すポーズ長情報「<vtml_pause time=”300”/>」が挿入されている。
このように、ポーズ設定部１２では、テキストデータ中のテキストに含まれる記号及び前記テキストのうち所定の条件を満たす位置に、それぞれに適したポーズ長を示すポーズ長情報が挿入される。 Pause length information “<vtml_pause time=“500”/>” indicating that the pause length is 500 msec is inserted immediately before the parenthesis P3 in the text in the synthetic speech generation data.
Pause length information “<vtml_pause time=“300”/>” indicating that the pause length is 300 msec is inserted immediately after the circled number P6 in the text in the synthetic speech generation data.
In this manner, the pause setting unit 12 inserts pause length information indicating a pause length suitable for each of the symbols included in the text in the text data and the position of the text that satisfies a predetermined condition.

また、ポーズ設定部１２は、ポーズ長情報と共に、組版データのタグも含んだ合成音声生成用データを形成しても良い。
上述した合成音声生成用データ形成装置では、人が文字を読み上げた様な自然な発話の音声を合成するための合成音声生成用データを生成することができる。 Further, the pause setting unit 12 may form synthetic speech generation data including the tags of the typesetting data together with the pause length information.
The above-described synthetic speech generation data forming apparatus can generate synthetic speech generation data for synthesizing natural speech that is similar to a person reading out characters.

２．第二実施形態
以下、第二実施形態に係る合成音声生成用データ形成方法について説明する。また、第二実施形態では合成音声生成用データ形成方法をコンピュータに実行させる合成音声生成用データ形成プログラム及び合成音声生成用データ形成装置について説明する。 2. Second Embodiment A method of forming synthetic speech generation data according to a second embodiment will be described below. Further, in the second embodiment, a synthetic speech generation data formation program and a synthetic speech generation data formation device for causing a computer to execute a synthetic speech generation data formation method will be described.

（２．１）合成音声生成用データ形成方法
第二実施形態に係る合成音声生成用データ形成方法は、第一実施形態に係る合成音声生成用データ形成方法の（Ａ）、（Ｂ）と、以下の方法とにより実行される。
（Ｃ）テキストデータ中のテキストのうち所定の条件を満たす位置に、音声データに対して音響効果を加えるための音響情報を挿入する
例えば、テキストが見出しである場合に、見出しの直後に音響情報を挿入する。また、音響情報は、例えば音響データのリンク先、すなわち音響データの保存先を示すリンク先情報を含む。 (2.1) Synthetic Speech Generation Data Formation Method The synthetic speech generation data formation method according to the second embodiment includes (A) and (B) of the synthetic speech generation data formation method according to the first embodiment, It is executed by the following methods.
(C) Inserting audio information for adding sound effects to audio data at a position in the text data that satisfies a predetermined condition. For example, if the text is a headline, the audio information immediately follows the headline. insert In addition, the acoustic information includes, for example, link destination information indicating a link destination of the acoustic data, that is, a storage destination of the acoustic data.

以上により、テキストの所定の位置に、ポーズ長情報と音声データに対して音響効果を加えるための音響情報とが挿入された合成音声生成用データが形成される。このような合成音声生成用データを用いて音声合成を行った場合、合成音声でありながらより自然な発話の合成音声を生成することができ、かつ合成音声のみでも場面転換を聞き手にわかりやすくすることができる。また、音響情報は、音声編集機器がなくても音響情報を示すタグ中のテキストの編集を行うだけでリンク先の編集やリバーブ・エコーのような音響の設定等を行うことができ、合成音声生成用データの生成及び編集が用意となる。 As described above, synthesized speech generation data is formed in which pause length information and acoustic information for adding acoustic effects to speech data are inserted at predetermined positions of the text. When speech synthesis is performed using such synthetic speech generation data, it is possible to generate synthetic speech that is more natural than synthetic speech, and to make it easy for listeners to understand scene changes even with synthesized speech alone. be able to. In addition, even without a voice editing device, audio information can be edited by simply editing the text in the tag that indicates the audio information, and it is possible to edit the link destination and set the sound such as reverb and echo. Generation and editing of data for generation becomes ready.

（２．２）合成音声生成用データ形成プログラムの基本構成
本実施形態に係る合成音声生成用データ形成プログラムについて説明する。後述する合成音声生成用データ形成装置２０は、第一実施形態に記載の（ａ）、（ｂ）と、以下の（ｃ）の各動作をコンピュータに実行させるプログラムに従って、合成音声生成用データを形成する。
（ｃ）テキストデータ中のテキストのうち所定の条件を満たす位置に、音声データに対して音響効果を加えるための音響情報を挿入すること (2.2) Basic Configuration of Synthetic Speech Generation Data Formation Program A synthetic speech generation data formation program according to the present embodiment will be described. Synthetic speech generation data forming apparatus 20, which will be described later, generates synthetic speech generation data in accordance with a program that causes a computer to execute the operations (a) and (b) described in the first embodiment and the following (c). Form.
(c) Inserting acoustic information for adding acoustic effects to voice data at positions satisfying predetermined conditions in text in text data.

（２．３）合成音声生成用データ形成装置の基本構成
以下、第二実施形態に係る合成音声生成用データ形成方法を実行する合成音声生成用データ形成装置２０を、図４を参照して説明する。図４は、合成音声生成用データ形成装置２０の基本構成及び各部の機能について説明する機能ブロック図である。 (2.3) Basic Configuration of Synthetic Speech Generation Data Formation Apparatus Hereinafter, a synthetic speech generation data formation apparatus 20 that executes a synthetic speech generation data formation method according to the second embodiment will be described with reference to FIG. do. FIG. 4 is a functional block diagram for explaining the basic configuration of the synthetic speech generation data formation device 20 and the function of each section.

図４に示すように、合成音声生成用データ形成装置２０は、テキストデータ処理部１１及びポーズ設定部１２と共に音響設定部２３を備えている。すなわち、合成音声生成用データ形成装置２０は、音響設定部２３を備える点で合成音声生成用データ形成装置１０と相違する。合成音声生成用データ形成装置２０では、所定の条件を満たす位置に、音声データに対して音響効果を加えるための音響情報を挿入することにより、合成音声に効果音やＢＧＭ、リバーブ（残響）やエコー（反響）等の音響効果等を与えることが可能となる。
以下、音響設定部２３について説明する。なお、テキストデータ処理部１１及びポーズ設定部１２は、第一実施形態で説明した各部と同様の構成であるため説明を省略する。 As shown in FIG. 4, the synthetic speech generation data forming apparatus 20 includes a text data processing section 11 and a pause setting section 12 as well as an audio setting section 23 . That is, the synthetic speech generation data formation device 20 differs from the synthetic speech generation data formation device 10 in that the sound setting unit 23 is provided. The synthetic speech generation data formation device 20 inserts acoustic information for adding acoustic effects to the speech data at a position that satisfies a predetermined condition. Acoustic effects such as echo (reverberation) can be applied.
The sound setting unit 23 will be described below. Note that the text data processing unit 11 and the pause setting unit 12 have the same configurations as the units described in the first embodiment, so description thereof will be omitted.

＜音響設定部＞
音響設定部２３は、テキストデータ中のテキストのうち所定の条件を満たす位置に、テキストを読み上げた音声データに対して音響効果を加える、すなわちリバーブ、エコー等の音響をかけたり、効果音を入れるための音響情報を挿入する。
音響設定部２３は、テキストが見出しである場合に、見出しの前及び後ろの少なくとも一方に音響情報を挿入する。音響情報としては、例えば効果音データのリンク先、すなわち効果音データの保存先を示すリンク先情報を含む。
また、音響設定部２３は、テキストが見出しである場合に、見出しの前後にリバーブやエコー等の音響効果の開始時点又は終了時点を示す音響情報を挿入しても良い。この場合、見出しの前には音響効果の開始を示すタグを音響情報として挿入し、見出しの後には音響効果の終了を示すタグを音響情報として挿入する。 <Sound setting section>
The sound setting unit 23 adds sound effects to the voice data obtained by reading the text, that is, applies sounds such as reverb and echo, or inserts sound effects at positions satisfying predetermined conditions in the text in the text data. inserts acoustic information for
When the text is a headline, the sound setting unit 23 inserts sound information at least one of before and after the headline. The sound information includes, for example, link destination information indicating the destination of the sound effect data, that is, the destination of the sound effect data.
Further, when the text is a headline, the sound setting unit 23 may insert sound information indicating the start point or end point of sound effects such as reverb and echo before and after the headline. In this case, a tag indicating the start of the sound effect is inserted as sound information before the headline, and a tag indicating the end of the sound effect is inserted as sound information after the headline.

図５に、ポーズ長情報とともに音響情報が挿入された合成音声生成用データをテキストで示した場合の具体例を示す。なお、図５では、説明のために、記号や所定の条件を満たすテキストの一部のみにポーズ長情報及び音響情報を示している。
図５に示すように、合成音声生成用データ中のテキストのうち、大見出しとなる「第１章下級老人とは何か」のテキストＰ１の直前には、音響効果であるリバーブの開始を示す音響情報「<vtml_mark name=”reverb_start”/>」と、再生する効果音のリンク先（保存先のＵＲＬ）を示す音響情報「<vtml_ mark name="sound:効果音ファイル.wav"/>」とが挿入されている。また、合成音声生成用データ中のテキストのうち、大見出しとなるテキストＰ１の直後には、リバーブの終了を示す音響情報「<vtml_ mark name=”reverb_end”/>」が挿入されている。
上述した合成音声生成用データ形成装置では、人が文字を読み上げた様な自然な発話の音声を合成するための合成音声生成用データを生成することができる。 FIG. 5 shows a specific example of a text representation of synthesized speech generation data in which acoustic information is inserted together with pause length information. In FIG. 5, for the sake of explanation, pause length information and sound information are shown only for symbols and part of text that satisfies a predetermined condition.
As shown in FIG. 5, among the texts in the synthetic speech generation data, just before the text P1 of "Chapter 1 What is a lower-class elderly person?" Acoustic information "<vtml_mark name="reverb_start"/>" and acoustic information "<vtml_mark name="sound: sound effect file.wav"/> indicating the link destination (URL of the save destination) of the sound effect to be played and are inserted. Further, among the texts in the synthetic speech generation data, audio information “<vtml_mark name=“reverb_end”/>” indicating the end of reverb is inserted immediately after the text P1, which is the headline.
The above-described synthetic speech generation data forming apparatus can generate synthetic speech generation data for synthesizing natural speech that is similar to a person reading out characters.

上述した合成音声生成用データ形成装置は、人が文字を読み上げた様な自然な発話であり、かつ場面転換を容易に聞き手に示す事ができる音響情報を含む合成音声生成用データを生成することができる。 The above-described synthetic speech generation data forming apparatus generates synthetic speech generation data that includes acoustic information that is natural utterance as if a person were reading out characters and that can easily indicate a scene change to a listener. can be done.

３．第三実施形態
以下、第三実施形態に係る合成音声生成装置及び合成音声生成方法について説明する。
（３．１）合成音声生成方法の基本構成
以下、第三実施形態に係る合成音声生成方法について説明する。
第三実施形態に係る合成音声生成法は、少なくとも以下の方法により実行される。
（Ｐ）テキストデータに対応するテキストの所定のポーズが入るポーズ位置にポーズの長さを示すポーズ長情報が挿入された合成音声生成用データを取得する
（Ｑ）テキストデータに対応する文章を発音表記に変換する
（Ｒ）発音表記を用いて、抑揚及び持続時間の韻律情報を生成する
（Ｓ）人間が発生した合成単位ごとの音声データを含む音声データベースから、発音表記に対応する合成単位を選択する
（Ｔ）合成音声生成用データに含まれるポーズ位置にポーズ長情報に対応する長さのポーズを介して合成単位を連結するとともに、韻律情報を付加して合成音声を生成する 3. Third Embodiment Hereinafter, a synthetic speech generation apparatus and a synthetic speech generation method according to a third embodiment will be described.
(3.1) Basic Configuration of Synthetic Speech Generation Method Hereinafter, a synthetic speech generation method according to the third embodiment will be described.
The synthetic speech generation method according to the third embodiment is executed at least by the following method.
(P) Acquire synthesized speech generation data in which pause length information indicating the length of the pause is inserted at the pause position where a predetermined pause of the text corresponding to the text data is inserted (Q) Pronounce the sentence corresponding to the text data (R) Generate prosodic information of intonation and duration using the phonetic transcription (S) Synthetic units corresponding to the phonetic transcriptions from a speech database containing speech data for each human-generated synthesis unit Select (T) Connect synthesis units to pause positions included in data for generating synthesized speech through pauses of length corresponding to pause length information, and add prosody information to generate synthesized speech.

また、第三実施形態に係る合成音声生成方法は、合成音声生成用データ取得前に以下の方法が実行されてもよい。
（Ｏ）テキストデータに対応するテキストの所定のポーズが入るポーズ位置にポーズの長さを示すポーズ長情報を挿入した合成音声生成用データを形成する
すなわち、第三実施形態に係る合成音声生成方法では合成音声生成用データの形成が別途行なわれても良い。 In addition, the method for generating synthetic speech according to the third embodiment may be implemented by the following method before acquiring the data for generating synthetic speech.
(O) Generate synthetic speech generation data by inserting pause length information indicating the length of the pause at the pause position where a predetermined pause of the text corresponding to the text data is inserted. That is, the synthetic speech generation method according to the third embodiment. Then, data for generating synthesized speech may be formed separately.

（３．２）合成音声生成装置の基本構成
以下、第三実施形態に係る合成音声生成用データ形成方法を実行する合成音声生成装置１００を、図６を参照して説明する。図６は、合成音声生成装置１００の基本構成及び各部の機能について説明する機能ブロック図である。 (3.2) Basic Configuration of Synthetic Speech Generating Apparatus Hereinafter, a synthetic speech generating apparatus 100 that executes the data forming method for generating synthetic speech according to the third embodiment will be described with reference to FIG. FIG. 6 is a functional block diagram illustrating the basic configuration of the synthetic speech generating apparatus 100 and the functions of each section.

図６に示すように、合成音声生成装置１００は、言語処理部１１０、韻律処理部１２０及び音声合成部１３０を備えている。合成音声生成装置１００は、例えば書籍の内容を示すテキストデータが入力され、言語処理部１１０、韻律処理部１２０及び音声合成部１３０の各部での処理により、自然な発話の合成音声を生成する。合成音声生成装置１００には、例えば、ポーズ長情報が挿入されていないテキストデータが入力される。 As shown in FIG. 6, the synthetic speech generation device 100 includes a language processing section 110, a prosody processing section 120, and a speech synthesis section . Synthesized speech generating apparatus 100 receives text data indicating the content of a book, for example, and generates synthesized speech of natural utterance through processing in language processing unit 110, prosody processing unit 120, and speech synthesis unit 130. For example, text data in which pause length information is not inserted is input to the synthesized speech generation apparatus 100 .

＜言語処理部＞
言語処理部１１０は、第一実施形態で説明した合成音声生成用データ形成装置１０、第二実施形態で説明した合成音声生成用データ形成装置２０の各部の機能を含んでいる。
言語処理部１１０は、例えばテキストデータ処理部１１１及びポーズ設定部１１２と、発音表記変換部１１４とを備えている。テキストデータ処理部１１１及びポーズ設定部１１２は、言語処理部１１０において、図２に示す合成音声生成用データ形成装置１０と同様の機能を有する合成音声生成用データ形成部１１５を形成している。合成音声生成用データ形成部は、テキストデータ中のテキストに含まれる記号及びテキストのうち所定の条件を満たす位置（ポーズ位置）にポーズ長情報を挿入した合成音声生成用データを形成する。ここで、合成音声生成用データ形成部は、図４に示す合成音声生成用データ形成装置２０と同様の構成であっても良い。 <Language processing unit>
The language processing unit 110 includes the functions of each unit of the synthetic speech generation data formation device 10 described in the first embodiment and the synthetic speech generation data formation device 20 described in the second embodiment.
The language processing unit 110 includes, for example, a text data processing unit 111 , a pause setting unit 112 , and a phonetic transcription conversion unit 114 . The text data processing unit 111 and the pause setting unit 112 form a synthetic speech generation data formation unit 115 having the same function as the synthetic speech generation data formation device 10 shown in FIG. 2 in the language processing unit 110 . The synthetic speech generation data forming unit forms synthetic speech generation data in which pause length information is inserted at positions (pause positions) satisfying predetermined conditions among symbols and text included in text in the text data. Here, the synthetic speech generation data formation unit may have the same configuration as the synthetic speech generation data formation device 20 shown in FIG.

言語処理部１１０のテキストデータ処理部１１１ポーズ設定部１１２は、合成音声生成用データ形成装置１０のテキストデータ処理部１１及びポーズ設定部１２と同一の機能を有する。すなわち、言語処理部１１０は、発音表記変換部１１４を備えている点で第一実施形態で説明した合成音声生成用データ形成装置１０と相違する。
以下、言語処理部１１０の発音表記変換部１１４と、韻律処理部１２０及び音声合成部１３０について説明する。また、テキストデータ処理部１１１及びポーズ設定部１１２の説明は省略する。 The text data processing unit 111 and pause setting unit 112 of the language processing unit 110 have the same functions as the text data processing unit 11 and the pause setting unit 12 of the synthetic speech generation data forming apparatus 10 . That is, the language processing unit 110 differs from the synthetic speech generating data forming apparatus 10 described in the first embodiment in that it includes a phonetic transcription conversion unit 114 .
The phonetic notation conversion unit 114, the prosody processing unit 120, and the speech synthesis unit 130 of the language processing unit 110 will be described below. Also, descriptions of the text data processing unit 111 and the pause setting unit 112 are omitted.

＜言語処理部＞
（発音表記変換部）
発音表記変換部１１４は、入力されたテキストデータに対応する文章を発音表記に変換する。発音表記変換部１１４に入力されるテキストデータは、ポーズ設定部１１２においてテキストデータに対応するテキストの所定のポーズ位置にポーズ長情報が挿入された合成音声生成用データである。
発音表記変換部１１４は、例えば、発音辞典５３と通信可能であり、入力されたテキストデータ（合成音声生成用データ）に基づいて、テキストデータに対応するテキストを発音表記に変換する。
なお、合成音声生成装置１００は、テキストデータ（又は合成音声生成用データ）を保存したテキストデータ保存部をさらに備えており、言語処理部１１０は、テキストデータ保存部からテキストデータを取得しても良い。 <Language processing unit>
(Pronunciation converter)
The phonetic notation conversion unit 114 converts sentences corresponding to the input text data into phonetic notation. The text data input to the phonetic notation conversion unit 114 is synthetic speech generation data in which pause length information is inserted at predetermined pause positions in the text corresponding to the text data in the pause setting unit 112 .
The phonetic transcription converting unit 114 can communicate with, for example, the pronunciation dictionary 53, and converts the text corresponding to the input text data (data for generating synthesized speech) into phonetic transcription.
Note that the synthetic speech generation device 100 further includes a text data storage unit that stores text data (or synthetic speech generation data), and the language processing unit 110 acquires text data from the text data storage unit. good.

＜韻律処理部＞
韻律処理部１２０は、言語処理部１１０の発音表記変換部１１４から取得した発音表記を用いて文章の抑揚及び持続時間の韻律情報を生成する韻律情報生成部１２１を備えている。韻律処理部１２０は、生成した韻律情報を音声合成部１３０に出力する。 <Prosody processing part>
The prosody processing unit 120 includes a prosody information generation unit 121 that generates prosody information of sentence intonation and duration using the phonetic transcription acquired from the phonetic transcription conversion unit 114 of the language processing unit 110 . The prosody processing unit 120 outputs the generated prosody information to the speech synthesis unit 130 .

＜音声合成部＞
音声合成部１３０は、テキストデータ（合成音声生成用データ）と、人間が発生した合成単位ごとの音声データを含む音声データベース５４とに基づいて合成音声を生成する。音声合成部１３０は、合成単位選択部１３１と合成単位連結部１３２とを備えている。
以下、音声合成部１３０の各部について説明する。 <Speech synthesizer>
The speech synthesizing unit 130 generates synthetic speech based on text data (synthetic speech generating data) and a speech database 54 containing speech data generated by humans for each synthesis unit. The speech synthesis unit 130 includes a synthesis unit selection unit 131 and a synthesis unit connection unit 132 .
Each unit of the speech synthesizing unit 130 will be described below.

（合成単位選択部）
合成単位選択部１３１は、音声データベース５４から、発音表記変換部１１４から取得した発音表記に対応する合成単位を選択して抽出する。合成単位選択部１３１は、抽出した合成単位は、合成単位連結部１３２に送信する。 (Synthetic unit selection part)
The synthesis unit selection unit 131 selects and extracts a synthesis unit corresponding to the phonetic transcription acquired from the phonetic transcription conversion unit 114 from the speech database 54 . The synthesis unit selection unit 131 transmits the extracted synthesis unit to the synthesis unit linking unit 132 .

（合成単位連結部）
合成単位連結部１３２は、合成単位選択部１３１で抽出された合成単位を連結するとともに、韻律情報を付加して合成音声を生成する。このとき、音声合成部１３０は、合成音声生成用データに含まれるポーズ位置にポーズ長情報に対応する長さのポーズを介して合成単位を連結することで、自然な発話の合成音声を生成する。また、合成音声生成用データがポーズ長情報と共に、書籍とした場合の見出し、文章の配置及び改行の位置並びに空白行の幅に関する情報も含んだ合成音声生成用データである場合には、見出し、改行や空白行の幅に対応するポーズを介して合成単位を連結することで、自然な発話の合成音声を生成してもよい。 (Synthetic unit connecting part)
The synthesis unit connection unit 132 connects the synthesis units extracted by the synthesis unit selection unit 131 and adds prosodic information to generate synthesized speech. At this time, the speech synthesizing unit 130 connects the synthesis unit to the pause position included in the synthetic speech generation data via a pause of a length corresponding to the pause length information, thereby generating synthetic speech of natural utterance. . In addition, if the data for generating synthetic speech is data for generating synthetic speech that includes information on the headline, sentence arrangement, position of line feed, and width of blank lines in the case of a book, together with the pause length information, the headline, Synthetic speech of natural speech may be generated by linking synthesis units via pauses corresponding to line breaks and blank line widths.

（３．３）変形例１
第三実施形態では、音声合成を行いたいテキストデータが挿入され、テキストデータに基づいて音声合成を行う合成音声生成装置１００について説明したが、このような構成に限られない。
例えば、変形例１の合成音声生成装置１００Ａは、別途生成された合成音声生成用データが入力されて音声合成を行う装置であってもよい。この場合、図７に示すように、合成音声生成装置１００Ａの言語処理部１１０Ａは、テキストデータ処理部１１１及びポーズ設定部１１２を有しておらず、少なくとも発音表記変換部１１４を備えていれば良い。合成音声生成装置１００Ａには、第三実施形態の合成音声生成用データ形成部１１５で生成される合成音声生成用データが入力される。このため、発音表記変換部１１４を備えていれば、合成音声生成装置１００Ａの言語処理部１１０Ａとしての機能を果たすことができる。 (3.3) Modification 1
In the third embodiment, the synthetic speech generating apparatus 100 in which text data for which speech synthesis is desired is inserted and speech synthesis is performed based on the text data has been described, but the configuration is not limited to this.
For example, the synthetic speech generation device 100A of Modification 1 may be a device that receives synthetic speech generation data separately generated and performs speech synthesis. In this case, as shown in FIG. 7, the language processing unit 110A of the synthetic speech generation device 100A does not have the text data processing unit 111 and the pause setting unit 112, and at least has the phonetic transcription converting unit 114. good. Synthetic speech generation data generated by the synthetic speech generation data forming unit 115 of the third embodiment is input to the synthetic speech generation device 100A. Therefore, if the phonetic transcription conversion unit 114 is provided, it can function as the language processing unit 110A of the synthetic speech generation device 100A.

（３．４）変形例２
言語処理部１１０は、例えばテキストデータ処理部１１１、ポーズ設定部１１２及び発音表記変換部１１４とともに、図示しない音響設定部を備えていてもよい。この場合、音響設定部は、合成音声生成用データ形成装置２０の音響設定部２３と同一の機能を有する。
言語処理部１１０が音響設定部を備える場合、音声合成部１３０は、合成単位同士を連結する際に、所定の位置（例えば見出しの位置）に、音響情報のリンク先から取得した効果音を重ねたり、例えば見出し等に対してリバーブ等の音響効果をかけることができる。 (3.4) Modification 2
The language processing unit 110 may include, for example, a text data processing unit 111, a pause setting unit 112, a phonetic transcription conversion unit 114, and a sound setting unit (not shown). In this case, the sound setting section has the same function as the sound setting section 23 of the synthetic speech generation data forming apparatus 20 .
When the language processing unit 110 includes a sound setting unit, the speech synthesis unit 130 superimposes the sound effect obtained from the link destination of the sound information at a predetermined position (for example, the position of the headline) when connecting the synthesis units. For example, it is possible to apply sound effects such as reverb to headlines.

（３．５）変形例３
第三実施形態では、合成音声生成装置１００が、言語処理部１１０が合成音声生成用データ形成装置１０と同様の機能を有する場合について説明したがこのような構成に限られない。
例えば、合成音声生成装置１００の言語処理部１１０は、第二実施形態に係る合成音声生成用データ形成装置２０と同様の機能を有していてもよい。 (3.5) Modification 3
In the third embodiment, the synthetic speech generation device 100 has the language processing unit 110 having the same function as the synthetic speech generation data forming device 10, but the configuration is not limited to this.
For example, the language processing unit 110 of the synthetic speech generation device 100 may have the same function as the synthetic speech generation data forming device 20 according to the second embodiment.

以上、本開示の実施形態について説明したが、本開示の技術的範囲は、上述した実施形態に記載の技術的範囲には限定されない。上述した実施形態に、多様な変更又は改良を加えることも可能であり、そのような変更又は改良を加えた形態も本開示の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。 Although the embodiments of the present disclosure have been described above, the technical scope of the present disclosure is not limited to the technical scope described in the above-described embodiments. Various changes or improvements can be made to the above-described embodiments, and forms with such changes or improvements can also be included in the technical scope of the present disclosure. it is obvious.

１０，２０合成音声生成用データ形成装置
１１，１１１テキストデータ処理部
１２、１１２ポーズ設定部
３０機械学習装置
３１テキストデータ化部
３２記憶部
３３学習データ抽出部
３４学習部
５１言語辞典
５２品詞辞典
５３発音辞典
５４音声データベース
１００合成音声生成装置
１１０言語処理部
１１４発音表記変換部
１２０韻律処理部
１２１韻律情報生成部
１３０音声合成部
１３１合成単位選択部
１３２合成単位連結部 10, 20 Synthetic speech generation data forming device 11, 111 Text data processing unit 12, 112 Pose setting unit 30 Machine learning device 31 Text data conversion unit 32 Storage unit 33 Learning data extraction unit 34 Learning unit 51 Language dictionary 52 Part of speech dictionary 53 Pronunciation dictionary 54 Speech database 100 Synthetic speech generation device 110 Language processing unit 114 Phonetic transcription conversion unit 120 Prosody processing unit 121 Prosody information generation unit 130 Speech synthesis unit 131 Synthesis unit selection unit 132 Synthesis unit connection unit

Claims

assigning a pause length to a position satisfying a predetermined condition among symbols included in text in the text data and the text as a pause position;
A synthetic speech generation data forming method for forming synthetic speech generation data in which pause length information indicating the pause length is inserted at the pause position.

if the symbol included in the text is a reading point, inserting the pause length information indicating a first pause length as the pause length immediately after the reading point;
2. When the symbol included in the text is a period, the pause length information indicating a second pause length longer than the first pause length is inserted immediately after the period as the pause length. 3. The method of forming data for generating synthetic speech according to .

When the symbol included in the text located between the symbols indicated by the square brackets in the text data is a comma, the pause length is immediately after the comma located between the square brackets. inserting the pause length information indicating a third pause length shorter than the first pause length as
When the symbol included in the text located between the square brackets is a period, the pause length immediately after the period located between the square brackets is shorter than the second pause length. 3. The method of forming synthetic speech generation data according to claim 2, wherein said pause length information indicating a fourth pause length is inserted.

4. The pause length information indicating a fifth pause length as the pause length is inserted at least immediately before the preceding parenthesis among the parentheses when the symbol included in the text is a parenthesis. 2. The data forming method for generating synthetic speech according to 1 or 2 above.

When the text is a headline, inserting the pause length information immediately after the headline indicating a sixth pause length longer than other pause lengths inserted immediately before or after the text other than the headline. 5. The method of forming data for generating synthetic speech according to any one of claims 1 to 4.

Synthetic speech generation according to any one of claims 1 to 5, wherein when the text has a semantic unity, the pause length information indicating a seventh pause length is inserted immediately after the text unity. data formation method.

7. The synthetic speech generation use according to any one of claims 1 to 6, wherein acoustic information for adding a sound effect to the speech data is inserted at a position of the text in the text data that satisfies a predetermined condition. Data formation method.

8. The method of forming data for generating synthetic speech according to claim 7, wherein when the text is a headline, the acoustic information is inserted immediately after the headline.

9. The method of forming synthetic speech generation data according to claim 8, wherein said acoustic information includes link destination information indicating a link destination of said acoustic data.

Acquiring synthesized speech generation data in which pause length information indicating the length of the pause is inserted at a pause position where a predetermined pause of the text corresponding to the text data is inserted;
converting the text corresponding to the text data into a phonetic transcription;
generating prosodic information of intonation and duration using the phonetic transcription;
selecting a synthesis unit corresponding to the phonetic notation from a speech database containing human-generated speech data for each synthesis unit;
Synthetic speech for generating synthetic speech by connecting the synthesis unit to the pause position included in the data for generating synthetic speech via the pause having a length corresponding to the pause length information and adding the prosody information. generation method.

a synthetic speech generation data acquisition unit for acquiring synthetic speech generation data in which pause length information indicating the length of the pause is inserted at a pause position where a predetermined pause of the text corresponding to the text data is inserted;
a phonetic notation conversion unit that converts the text corresponding to the text data into a phonetic notation;
a prosody processing unit that generates prosody information of intonation and duration of the text using the phonetic transcription obtained from the phonetic transcription conversion unit;
A synthesis unit corresponding to the phonetic notation acquired from the phonetic notation conversion unit is selected from a speech database containing speech data generated by a person for each synthesis unit, and placed at the pose position included in the data for generating synthesized speech. a speech synthesizing unit that connects the synthesis units through the pauses of a length corresponding to the pause length information and adds the prosody information to generate synthetic speech;
Synthetic speech generator.

A symbol included in the text and a position that satisfies a predetermined condition in the text are assigned a pause length as the pause position, and the data for generating synthesized speech in which the pause length information indicating the pause length is inserted into the pause position. further comprising a synthetic speech generation data formation unit that forms
12. The synthetic speech generation apparatus according to claim 11, wherein the synthetic speech generation data acquisition unit acquires the synthetic speech generation data from the synthetic speech generation data formation unit.

13. The synthetic speech generating apparatus according to claim 11, wherein said text data is data including information on the arrangement of said text, line feed position and blank line width in the case of a book.

14. The synthetic speech generating apparatus according to claim 13, further comprising a text data storage unit storing said text data.