JP4778402B2

JP4778402B2 - Pause time length calculation device, program thereof, and speech synthesizer

Info

Publication number: JP4778402B2
Application number: JP2006301711A
Authority: JP
Inventors: 信正清山; 徹都木
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-11-07
Filing date: 2006-11-07
Publication date: 2011-09-21
Anticipated expiration: 2026-11-07
Also published as: JP2008116826A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a pause duration calculation device and its program, which calculates pause duration between voice component data by which natural listening feeling is obtained, when connecting the voice component data in voice synthesizing of a recording and editing system, and a voice synthesizer equipped with the pause duration calculation device. <P>SOLUTION: In voice synthesizing by connecting voice component data in which a voice waveform produced by uttering a predetermined unit of text is recorded, the pause duration calculation device 40 comprises: an acoustic feature amount detector 410 for detecting a predetermined acoustic feature amount in the voice waveform which is recorded in the voice component data; an acoustic distance calculation section 430 for calculating an acoustic distance between the acoustic feature amount of a preceding voice component data which precedes in voice component data connected together, and the acoustic feature amount of a following voice component data which follows it; and a pause duration calculation section 440 for calculating pause duration which is inserted between the voice component data by using a calculation formula which is set beforehand. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、テキストを発話した音声を予め録音した音声部品データを接続して音声合成を行う、録音編集方式による音声合成技術に関する。 The present invention relates to a speech synthesis technique based on a recording / editing method, in which speech synthesis is performed by connecting speech component data in which speech uttered by text is recorded in advance.

予め単語や文節、定型文等を録音した音声部品データを接続して音声合成を行う、いわゆる録音編集方式による音声合成においては、合成できる語彙や文の種類が限定されるものの、扱う音声の単位が比較的長いこともあり、より品質の高い音声が得られるが、音声部品データ間の接続部における音声の音響的特徴（ピッチ周波数、話速、パワー、スペクトル包絡等）の連続性が品質に影響していた。 In speech synthesis using the so-called recording and editing method, which connects speech component data in which words, phrases, fixed phrases, etc. are recorded in advance, the vocabulary and sentence types that can be synthesized are limited, but the unit of speech to be handled May be relatively long, resulting in higher quality audio, but the continuity of audio acoustic features (pitch frequency, speech speed, power, spectral envelope, etc.) at the connection between audio component data Had an effect.

そこで、例えば、定型文に単語や文節を差し挟む場合や、単語や文節を組み合せて文を構成する場合等においては、例えば、接続時の各音声部品データの前後の音響的特徴を考慮して発声された音声を収録するなど、一般的に、音声部品データの録音段階から音声部品データを接続した際に音響的特徴の不連続が生じないような工夫をしたり、不連続が気にならないように、ある程度長めのポーズを挟んで音声部品データを接続したりしていた。 Therefore, for example, in the case where a word or phrase is sandwiched between fixed phrases or a sentence is composed by combining words or phrases, for example, the acoustic features before and after each audio component data at the time of connection are considered. In general, such as recording the voice that was uttered, when the audio component data is connected from the recording stage of the audio component data, it is devised so that the discontinuity of the acoustic features does not occur, or the discontinuity does not matter In this way, audio component data is connected with a long pause.

一方、従来、任意のテキストを合成可能な、規則に基づく音声合成では、より自然な合成音声を得るために、音声区間に対するピッチ周波数、継続時間長、パワー、並びに、音声区間同士の間の休止時間長（ポーズ長）をきめ細かに制御する必要があり、様々な韻律制御方法が提案されている。 On the other hand, in the conventional rule-based speech synthesis that can synthesize any text, on the other hand, in order to obtain a more natural synthesized speech, the pitch frequency, duration length, power, and pause between speech segments are obtained. It is necessary to finely control the time length (pause length), and various prosodic control methods have been proposed.

その中で、ポーズ長の設定についても、実際に発話された音声データの分析に基づいた制御規則が提案されている。例えば、特許文献１において、入力された合成音声の情報に応じてポーズ長を設定する方法が提案されている。特許文献１に記載された音声規則合成装置は、入力された合成音声の音韻継続長や基本周波数等の情報と、予め設定したポーズ設定規則とに基づき、先行句と後続句との係り受け関係や読点の有無に応じて、先行句と後続句との間の句境界におけるポーズ挿入の有無や、挿入する場合は、ポーズの種類に応じて基準値を設定した後、ポーズ長に影響を与える文法的な要因によってポーズ長を補正すると共に、１モーラ長の整数倍になるように設定するものである。 Among them, a control rule based on analysis of actually spoken voice data has also been proposed for setting a pause length. For example, Patent Document 1 proposes a method for setting a pause length in accordance with input synthetic speech information. The speech rule synthesizer described in Patent Literature 1 is based on information such as the phoneme continuation length and the fundamental frequency of the input synthesized speech, and the dependency relationship between the preceding phrase and the succeeding phrase based on a preset pause setting rule. Depending on the presence or absence of punctuation marks, the presence or absence of pauses at the phrase boundary between the preceding phrase and the succeeding phrase, and if inserted, the reference value is set according to the type of pause and the pause length is affected. The pose length is corrected by grammatical factors and set to be an integral multiple of 1 mora length.

また、特許文献２において、人による発話を録音した音声波形データから音素列の音声波形データを切出して、音素列データベースを予め作成しておき、音声合成対象となるテキスト情報を構成する音素列に対応する音素列データを接続する録音編集方式の音声合成装置が提案されている。特許文献２に記載の音声合成装置は、音素列データを接続する際に、先行する音素列データの後端部の無音部分長及び後続の音素列データの先端部の無音部分長の短いほうの無音部分長を、音素列データの接続部の無音部分長とすることにより、何れかの無音部分長が極端に長い場合であっても、自然な音声合成を行えるようにするものである。
特許第３０６０４２２号公報（段落００１９〜段落００２３、図３）特開２００２−４１０７７号公報（段落００１２） Further, in Patent Document 2, a phoneme string database is created in advance by extracting speech waveform data of a phoneme string from voice waveform data in which a person's utterance is recorded, and the phoneme string constituting the text information that is the target of speech synthesis. A recording and editing type speech synthesizer for connecting corresponding phoneme string data has been proposed. When connecting the phoneme string data, the speech synthesizer described in Patent Document 2 has a shorter silent part length at the rear end of the preceding phoneme string data and a shorter silent part length at the front end of the subsequent phoneme string data. By making the silent part length the silent part length of the phoneme string data connection part, natural speech synthesis can be performed even if any silent part length is extremely long.
Japanese Patent No. 3060422 (paragraphs 0019 to 0023, FIG. 3) JP 2002-41077 A (paragraph 0012)

しかしながら、従来の録音編集方式の音声合成において、音声部品データ間の接続部における音声の音響的特徴に不連続が生じないように録音を行う場合には、録音済みの音声を試聴するなどして、発声の目標となる刺激を参照したり、収録済みの音声との適合度合いを確認しながら録音を進めたりするなど、発話者や録音作業者に過度の負担を強いるものであった。 However, in the conventional speech editing method of speech synthesis, when recording is performed so as not to cause discontinuity in the acoustic characteristics of the speech at the connection part between the speech component data, the recorded speech is auditioned. In other words, the utterer or the recording worker was overburdened by referring to the stimulus that is the target of the utterance or by proceeding with recording while checking the degree of conformity with the recorded voice.

また、不連続が気にならないように、安全を見込んで、ある程度長めのポーズを挟んで接続する場合には、間延びして自然性を損なうという問題があった。これを避けるために、適切な長さのポーズを設定するには、長さを変えて試聴により試行を繰り返す必要があり、非効率であった。また、一般的には、固定長のポーズを用いるため、読み上げ方が規則的で、いわゆる機械的な印象になりがちであった。 In addition, there is a problem in that when the connection is made with a somewhat long pose in consideration of safety so that discontinuities are not anxious, the connection is extended and the naturalness is impaired. In order to avoid this, in order to set a pose having an appropriate length, it is necessary to repeat trials by changing the length and auditioning, which is inefficient. In general, since a fixed-length pose is used, the reading method is regular and tends to be a so-called mechanical impression.

一方、特許文献１に記載の音声規則合成装置は、規則合成に利用するものであるが、入力された合成音声の情報と、先行句と後続句との係り受け関係や読点の有無やポーズの種類に応じて、基準値を設定した後、ポーズ長に影響を与える文法的な要因によってポーズ長を補正するものであり、録音編集方式への応用も考えられる。 On the other hand, the speech rule synthesizer described in Patent Document 1 is used for rule synthesis. However, the dependency relationship between the input synthesized speech information and the preceding phrase and the succeeding phrase, the presence or absence of reading marks, and the pause Depending on the type, after setting a reference value, the pose length is corrected by a grammatical factor that affects the pose length, and application to a recording editing system is also conceivable.

しかし、録音編集方式に特許文献１に記載の装置を適用すると、音声部品データを組み合わせて定型文を構成する場合や、定型文に単語や文節を差し挟む場合等においては、同じ文法構造をした文となり、同じ文法構造であれば同じポーズ長を与えることになる。
規則合成では、元々ポーズ前後の環境における音響的特徴を制御して揃えることができるので、同じ文法構造をした文に対して同じポーズ長を与えたとしても問題はないが、録音編集方式では、音声部品データ間の接続部における音声の音響的特徴を考慮して、適切なポーズ長を設定しなければ、音声部品データの接続部において不連続感が生じ、合成音声の自然性を損なうことになる。また、同じ文法構造をした文に対して同じポーズ長を与えることにより、やはり読み上げ方が規則的で、いわゆる機械的な印象になってしまうという問題があった。 However, when the device described in Patent Document 1 is applied to the recording and editing method, the same grammatical structure is used when a fixed sentence is composed by combining audio parts data or when a word or phrase is inserted between fixed phrases. It becomes a sentence, and the same pose length is given if it has the same grammatical structure.
In rule synthesis, the acoustic features in the environment before and after the pose can be controlled and aligned, so there is no problem even if the same pose length is given to sentences with the same grammatical structure. If an appropriate pause length is not set in consideration of the acoustic characteristics of the speech at the connection between the audio component data, discontinuity will occur at the connection of the audio component data, and the naturalness of the synthesized speech will be impaired. Become. In addition, by giving the same pose length to sentences having the same grammatical structure, there is a problem that the reading method is regular and a so-called mechanical impression is obtained.

また、特許文献２に記載の音声合成装置は、接続する音素列データ間の無音部分長を、たまたま作成した音素列データの端部の無音部分長の短い方を採用するものであり、音響的特徴の連続性を十分に考慮したものではなかった。 In addition, the speech synthesizer described in Patent Document 2 adopts the shorter silence part length between the phoneme string data to be connected, and the shorter silence part length at the end of the phoneme string data created by chance. The continuity of features was not fully considered.

そこで、本発明は、録音編集方式の音声合成において、音声部品データを接続する際に、自然な聴感を得られる音声部品データ間の休止時間長を算出する休止時間長算出装置及びそのプログラム、並びに、この休止時間長算出装置を備えた音声合成装置を提供することを目的とする。 Therefore, the present invention relates to a pause time length calculation device for calculating a pause time length between voice component data that can obtain a natural audibility when voice component data is connected in voice synthesis of a recording and editing method, a program thereof, and An object of the present invention is to provide a speech synthesizer provided with this pause time length calculation device.

そのために、請求項１に記載の休止時間長算出装置は、所定の単位のテキストを発話した音声波形を記録した音声部品データを接続して音声合成を行う際に、互いに接続される音声部品データ間に挿入する休止時間長を算出する休止時間長算出装置であって、音響的特徴量取得手段と、音響的距離算出手段と、休止時間長算出手段と、を備える構成とした。 For this purpose, the pause time length calculating device according to claim 1 connects voice component data connected to each other when voice component data in which a voice waveform uttering a predetermined unit of text is recorded and voice synthesis is performed. a pause time length calculation device that calculates the rest time length to be inserted between, and the acoustic feature value acquiring means, acoustically distance calculating means, and the rest time length calculating unit, configured to include a.

かかる構成によれば、休止時間長算出装置は、互いに接続される音声部品データにおいて先行する先行音声部品データ及び後続の後続音声部品データに記録された音声波形における、それぞれの音響的特徴量として、声の高さを表すピッチ周波数、発話のスピードを表す話速、音声の大きさを表すパワー又は音声の響きを表すスペクトル包絡の内の、少なくとも１つを音響的特徴量取得手段によって取得する。次に、音響的距離算出手段によって、音響的特徴量取得手段で取得された先行音声部品データの音響的特徴量と、後続音声部品データの音響的特徴量との音響的な差異を表わす音響的距離を算出する。そして、休止時間長算出手段によって、音響的距離算出手段で算出された音響的距離に基づいて、予め設定した算出式を用いて先行音声部品データと後続音声部品データとの間に挿入する休止時間長を算出する。 According to such a configuration, the pause time length calculation device, as the respective acoustic feature amounts in the audio waveform recorded in the preceding audio component data preceding and the subsequent audio component data following in the audio component data connected to each other , At least one of the pitch frequency representing the pitch of the voice, the speech speed representing the speed of the speech, the power representing the volume of the speech, or the spectrum envelope representing the sound of the speech is obtained by the acoustic feature quantity acquisition means. Next, an acoustic distance representing the acoustic difference between the acoustic feature quantity of the preceding voice component data acquired by the acoustic feature quantity acquisition means and the acoustic feature quantity of the subsequent voice component data by the acoustic distance calculation means. Calculate the distance. Then, the pause time inserted between the preceding voice component data and the subsequent voice component data by using a preset calculation formula based on the acoustic distance calculated by the acoustic distance calculating unit by the pause time length calculating unit. Calculate the length.

請求項２に記載の休止時間長算出装置は、請求項１に記載の休止時間長算出装置において、前記算出式として、前記音響的距離算出手段によって算出された音響的距離を説明変数とする回帰式を用いるように構成した。 The rest time length calculation apparatus according to claim 2 is a regression using the acoustic distance calculated by the acoustic distance calculation means as an explanatory variable as the calculation formula in the rest time length calculation apparatus according to claim 1. The formula was used.

かかる構成によれば、休止時間長算出装置は、互いに接続される音声部品データ間の音響的距離と回帰係数との積和演算によって休止時間長を算出することができる。 According to such a configuration, the pause time length calculation device can calculate the pause time length by performing a product-sum operation on the acoustic distance between the audio component data connected to each other and the regression coefficient.

請求項３に記載の音声部品データ間の休止時間長算出プログラムは、所定の単位のテキストを発話した音声波形を記録した音声部品データを接続して音声合成を行う際に、互いに接続される音声部品データ間に挿入する休止時間長を算出するために、コンピュータを、音響的特徴量取得手段、音響的距離算出手段、休止時間長算出手段、として機能させる構成とした。 According to a third aspect of the present invention, there is provided a program for calculating a pause time length between voice component data, wherein voice components connected to each other when voice component data recording a voice waveform obtained by uttering a predetermined unit of text is connected and synthesized. In order to calculate the pause time length to be inserted between the component data, the computer is configured to function as an acoustic feature quantity acquisition unit, an acoustic distance calculation unit, and a pause time length calculation unit.

かかる構成によれば、音声部品データ間の休止時間長算出プログラムは、互いに接続される音声部品データにおいて先行する先行音声部品データ及び後続の後続音声部品データに記録された音声波形における、それぞれの音響的特徴量として、声の高さを表すピッチ周波数、発話のスピードを表す話速、音声の大きさを表すパワー又は音声の響きを表すスペクトル包絡の内の、少なくとも１つを音響的特徴量取得手段によって取得する。次に、音響的距離算出手段によって、音響的特徴量取得手段で取得された先行音声部品データの音響的特徴量と、後続音声部品データの音響的特徴量との音響的な差異を表わす音響的距離を算出する。そして、休止時間長算出手段によって、音響的距離算出手段で算出された音響的距離に基づいて、予め設定した算出式を用いて先行音声部品データと後続音声部品データとの間に挿入する休止時間長を算出する。
これによって、音声部品データ間の休止時間長算出プログラムは、互いに接続される音声部品データの音響的距離に応じた適切な休止時間長を算出することができる。 According to such a configuration, the pause time length calculation program between the audio component data is recorded in the audio waveforms recorded in the preceding audio component data preceding and the subsequent audio component data following in the audio component data connected to each other. Acquire acoustic feature quantity as at least one of pitch frequency representing voice pitch, speech speed representing speech speed, power representing speech volume, or spectrum envelope representing sound of speech. Obtain by means. Next, an acoustic distance representing the acoustic difference between the acoustic feature quantity of the preceding voice component data acquired by the acoustic feature quantity acquisition means and the acoustic feature quantity of the subsequent voice component data by the acoustic distance calculation means. Calculate the distance. Then, the pause time inserted between the preceding voice component data and the subsequent voice component data by using a preset calculation formula based on the acoustic distance calculated by the acoustic distance calculating unit by the pause time length calculating unit. Calculate the length.
Thereby, the pause time length calculation program between audio component data can calculate an appropriate pause time length according to the acoustic distance of the audio component data connected to each other.

請求項４に記載の音声合成装置は、所定の単位のテキストを発話した音声波形を記録した音声部品データを接続して音声合成を行う音声合成装置であって、音声部品データ記憶手段と、読み上げ情報取得手段と、音声部品データ取得手段と、休止時間長算出装置と、を備えて構成した。 The speech synthesizer according to claim 4 is a speech synthesizer that synthesizes speech by connecting speech component data in which a speech waveform of a predetermined unit of text is uttered, comprising speech component data storage means, An information acquisition unit, a voice part data acquisition unit, and a pause time length calculation device are provided.

かかる構成によれば、音声合成装置は、まず、読み上げ情報取得手段によって、音声合成の対象となる、決められた順番で連続的に読み上げられるテキストからなるか、又はこのテキストを構成する所定の単位のテキストに対応する音声部品データを指定した情報からなる読み上げ情報を取得する。次に、音声部品データ取得手段によって、読み上げ情報取得手段で取得した読み上げ情報に基づいて、予め音声波形を記録した音声部品データ記憶した音声部品データ記憶手段から、所望の音声部品データを取得する。そして、休止時間長算出装置によって、音声合成の対象となるテキストを構成する音声部品データ間に挿入する休止時間長を算出し、この休止時間長を、音声部品データ間の休止時間として設定する。
これによって、音声合成装置は、音声部品データ間に、それぞれ互いに接続される音声部品データ間の音響的距離に応じた休止時間を挿入した音声合成データを作成することができる。 According to such a configuration, the speech synthesizer first includes text that is a target of speech synthesis and is continuously read out in a predetermined order by the reading information acquisition unit, or a predetermined unit that constitutes the text Read-out information consisting of information designating audio component data corresponding to the text of is acquired. Next, the desired audio component data is acquired by the audio component data acquisition unit from the audio component data storage unit that stores the audio component data in which the audio waveform has been recorded in advance based on the reading information acquired by the reading information acquisition unit. Then, the pause time length calculation device calculates the pause time length inserted between the voice component data constituting the text to be synthesized, and sets the pause time length as the pause time between the voice component data.
Thus, the speech synthesizer can create speech synthesis data in which a pause time corresponding to the acoustic distance between the speech component data connected to each other is inserted between the speech component data.

請求項１又は請求項３に記載の発明によれば、録音編集方式の音声合成において、互いに接続される音声部品データ間に挿入する休止時間を、音声部品データ間の音響的距離に応じて算出するため、この休止時間を挿入して再生される音声を、不連続感や機械的な印象を与えない自然な聴感の合成音声とすることができる。
また、音声部品データ間に挿入する休止時間を、音声部品データ間の、音声の高さを表すピッチ周波数の差異、発話のスピードを表す話速の差異、音声の大きさを表すパワーの差異又は音声の響きであるスペクトル包絡の差異に応じて休止時間長を算出するため、この休止時間を挿入して再生される音声を、用いた音響的特徴量が表わす音声の高さ、発話スピード、音声の大きさ又は音声の響きの不連続間を与えない自然な聴感の合成音声とすることができる。
請求項２に記載の発明によれば、休止時間長を、音響的距離を説明変数とする回帰式によって算出するため、音響的距離と回帰式の係数との積和演算によって簡便に算出することができる。
請求項４に記載の発明によれば、録音編集方式の音声合成において、互いに接続される音声部品データ間に挿入する休止時間を、音声部品データ間の音響的距離に応じて算出して設定するため、この休止時間を挿入して作成された音声合成データを再生することで、不連続感や機械的な印象を与えない自然な聴感の音声を得ることができる。 According to the first or third aspect of the present invention, the pause time to be inserted between the voice component data connected to each other is calculated according to the acoustic distance between the voice component data in the voice synthesis of the recording and editing method. Therefore, the sound reproduced by inserting the pause time can be a synthetic sound with a natural audibility that does not give a discontinuity or a mechanical impression.
In addition, the pause time to be inserted between the voice component data is the difference in pitch frequency representing the pitch of the voice, the difference in speech speed representing the speed of speech, the difference in power representing the size of the voice, or between the voice component data. In order to calculate the pause time length according to the difference in the spectral envelope, which is the sound of the voice, the voice played by inserting this pause time, the voice height represented by the acoustic feature, speech speed, voice Or a synthesized speech with a natural sensation that does not give a discontinuity between the size of the sound and the sound reverberation.
According to the second aspect of the present invention, since the pause time length is calculated by a regression equation having the acoustic distance as an explanatory variable, it is simply calculated by a product-sum operation of the acoustic distance and the coefficient of the regression equation. Can do.
According to the fourth aspect of the present invention, the pause time to be inserted between the voice component data connected to each other is calculated and set according to the acoustic distance between the voice component data in the voice editing of the recording and editing method. Therefore, by reproducing the voice synthesis data created by inserting this pause time, it is possible to obtain a natural audible voice that does not give a discontinuity or a mechanical impression.

以下、本発明の実施形態について適宜図面を参照して詳細に説明する。
＜音声合成装置の構成＞
まず、図１を参照して、本発明による休止時間長算出装置４０を備えた音声合成装置１００の構成について説明する。ここで、図１は、本実施形態の音声合成装置の構成を示すブロック図である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings as appropriate.
<Configuration of speech synthesizer>
First, with reference to FIG. 1, the structure of the speech synthesizer 100 provided with the pause time length calculation device 40 according to the present invention will be described. Here, FIG. 1 is a block diagram showing the configuration of the speech synthesizer of this embodiment.

図１に示した本実施形態の音声合成装置１００は、読み上げ情報入力部１０、音声部品データ取得部２０、音声部品データ記憶部３０、休止時間長算出装置４０、音声合成データ記憶部５０及び音声再生部６０を備えて構成されている。 The speech synthesizer 100 according to the present embodiment shown in FIG. 1 includes a reading information input unit 10, a speech component data acquisition unit 20, a speech component data storage unit 30, a pause time length calculation device 40, a speech synthesis data storage unit 50, and a speech. A playback unit 60 is provided.

ここで、各部の詳細について説明する前に、図２から図５を参照して、本実施形態の読み上げ情報及び音声部品データの構成、並びに休止時間長を設定する原理について説明する。 Here, before explaining the details of each part, the principle of setting the configuration of the read-out information and the voice component data and the pause time length of the present embodiment will be described with reference to FIGS.

まず、図２を参照して、読み上げ情報の構成について説明する。なお、図２は、読み上げ情報の構成を説明するための説明図である。
図２に示した読み上げ情報の例では、“文節”を単位とし、１つ又は複数の文節によって“文”が構成され、更に、複数の文によって、読み上げ情報が構成されており、文番号をｉ、読み上げ情報を構成する文の数をＮ、各文における文節番号をｊ、各文を構成する文節の数をＭ_ｉで表している。なお、各文は、文番号ｉの順番で連続的に読み上げられ、各文の文節は、文節番号ｊの順番で連続的に読み上げられる。
また、phr[i][j]は、ｉ番目の文におけるｊ番目の文節に対応する１個の音声部品データを示している。 First, the configuration of the reading information will be described with reference to FIG. FIG. 2 is an explanatory diagram for explaining the configuration of the reading information.
In the example of the reading information shown in FIG. 2, “sentence” is a unit of “sentence”, and “sentence” is constituted by one or a plurality of phrases, and further, the reading information is constituted by a plurality of sentences. i, represents the number of sentences that constitute the reading information N, the phrase number in each sentence j, the number of clauses that constitute each sentence in M _i. Each sentence is continuously read out in the order of sentence number i, and the clauses of each sentence are continuously read out in the order of phrase number j.
Further, phr [i] [j] represents one piece of audio component data corresponding to the j-th clause in the i-th sentence.

なお、本実施形態では、音声部品データは、文節を単位として構成したが、これに限定されるものではなく、音素、単語、形態素、文節、文等を単位としてもよいし、これらの単位を混在して音声部品データを構成するようにしてもよい。 In the present embodiment, the audio component data is configured in units of phrases. However, the present invention is not limited to this, and may be in units of phonemes, words, morphemes, phrases, sentences, or the like. You may make it comprise audio | voice component data by mixing.

次に、図３及び図４を参照して、音声部品データの構成の一例について説明する。ここで、図３は、音声部品データのデータ構造を示す図であり、図４は、音声部品データに含まれる音声波形データの構成を模式的に示す構成図である。 Next, an example of the configuration of audio component data will be described with reference to FIGS. Here, FIG. 3 is a diagram illustrating the data structure of the audio component data, and FIG. 4 is a configuration diagram schematically illustrating the configuration of the audio waveform data included in the audio component data.

図３に示した音声部品データのデータ構造は、基本データとして、音声部品番号、読み上げ（テキスト）データ、音声波形データ、拍数（モーラ数）、データ長（全時間長）を含み、音響的特徴量に関するデータとして、先端無音長、後端無音長、先端非有声音長、後端非有声音長、先端ピッチ周波数、後端ピッチ周波数、平均話速、平均パワー、先端スペクトル包絡、後端スペクトル包絡を含み、付加データ（設定データ）として、休止時間長を含んで構成されている。 The data structure of the audio component data shown in FIG. 3 includes, as basic data, an audio component number, reading (text) data, audio waveform data, number of beats (number of mora), and data length (total time length). The data related to the feature amount includes the leading end silent length, trailing end silent length, leading end unvoiced sound length, trailing end unvoiced sound length, leading end pitch frequency, trailing end pitch frequency, average speech speed, average power, leading end spectral envelope, trailing end. A spectrum envelope is included, and additional data (setting data) includes a pause time length.

なお、本実施形態において、録音編集方式の音声合成のために用いられる音声部品データは、予め基本データが設定されて、音声部品データ記憶部３０（図１参照）に記憶されている。 In the present embodiment, basic data is set in advance for voice component data used for voice synthesis in the recording and editing method, and is stored in the voice component data storage unit 30 (see FIG. 1).

基本データには、音声部品を識別するための音声部品番号と、音声部品の内容を示す読み上げデータ、すなわちテキストデータと、そのテキストデータを発話者が発話した音声を録音した音声波形データと、テキストデータの拍数（モーラ数）と、データ長（音声波形データの全時間長）とが設定されている。 The basic data includes a voice part number for identifying a voice part, read-out data indicating the contents of the voice part, that is, text data, voice waveform data obtained by recording a voice uttered by the speaker, and text. The number of beats (number of mora) of data and the data length (total time length of speech waveform data) are set.

例えば、図３に示した例では、音声部品番号として“１２３４５６”、読み上げデータとして“Ｋ放送（けいほうそう）”、音声波形データ（図４のＰ_Ａ参照）として所定のサンプリング周波数（例えば、数ｋＨｚ〜数十ｋＨｚ程度）でサンプリングされたデジタルデータ、拍数として“６”、データ長として“１２００（ｍｓ）”が設定されている。 For example, in the example shown in FIG. 3, "123456" as the audio part number, as speech data "K Broadcast (alarm going)", a predetermined sampling frequency as the audio waveform data (see _{P A} in FIG. 4) (e.g., Digital data sampled at a frequency of about several kHz to several tens of kHz, “6” is set as the number of beats, and “1200 (ms)” is set as the data length.

音響的特徴量に関するデータは、休止時間長算出装置４０において休止時間長を算出するための中間データであり、休止時間長を算出する過程において音響的特徴量検出部４１０によって算出され、一時的に設定されるデータである。 The data related to the acoustic feature amount is intermediate data for calculating the suspension time length in the suspension time length calculation device 40, and is calculated by the acoustic feature amount detection unit 410 in the process of calculating the suspension time length. Data to be set.

付加データである休止時間長は、休止時間長算出装置４０（図１参照）によって算出されて設定されるデータである。このデータは、音声合成において、後ろに接続される音声部品データとの音響的特徴量の差異（音響的距離）に基づいて決定されるデータであり、同じ音声部品番号の音声部品データであっても、読み上げ情報において文番号ｉ及び文節番号ｊによって指定される音声部品データが用いられる位置によって異なる値となる。 The pause time length that is additional data is data that is calculated and set by the pause time length calculation device 40 (see FIG. 1). This data is data determined based on a difference (acoustic distance) in the acoustic feature amount from the audio component data connected behind in the audio synthesis, and is audio component data having the same audio component number. Also, the value differs depending on the position where the audio component data designated by the sentence number i and the phrase number j is used in the reading information.

次に、各データについて、図３及び図４を参照して説明する。
図４には、音声合成において先行する先行音声部品データＰ_Ａ、及び、この先行音声部品データの後ろに接続される後続音声部品データＰ_Ｂの音声波形データを示している。 Next, each data will be described with reference to FIG. 3 and FIG.
FIG. 4 shows the speech waveform data of the preceding speech component data P _A preceding in speech synthesis and the subsequent speech component data P _B connected after the preceding speech component data.

図４に示した例では、先行音声部品データＰ_Ａは、株式の銘柄「Ｋ放送（けいほうそう）」を録音した音声部品データであり、後続音声部品データＰ_Ｂは、株価「４０円（よんじゅうえん）」を録音した音声部品データである。図４は、その音声波形を示している。
なお、図４において、左右方向が時間軸であり、左から右に向かって時間が経過するものとする。 In the example shown in FIG. 4, the preceding audio component data P _A is the audio component data recorded shares of brand "K broadcast (Alarm likely)" succeeding audio component data P _B is stock "40 yen ( This is the audio component data that recorded "Yonjuen)". FIG. 4 shows the speech waveform.
In FIG. 4, the left-right direction is the time axis, and time passes from left to right.

図４に示したように、各音声部品データＰ_Ａ，Ｐ_Ｂの音声波形について、それぞれ、データ開始位置、音声開始位置、有声音開始位置、有声音終了位置、音声終了位置、データ終了位置を定めることができる。 As shown in FIG. 4, the data start position, the voice start position, the voiced sound start position, the voiced sound end position, the voice end position, and the data end position are respectively obtained for the voice waveforms of the voice component data P _A and P _B. Can be determined.

データ長は、音声波形データの始点であるデータ開始位置から音声波形データの終点であるデータ終了位置までの全データ長である。 The data length is the total data length from the data start position that is the start point of the voice waveform data to the data end position that is the end point of the voice waveform data.

先端無音長及び後端無音長は、それぞれ、音声波形データの「データ開始位置から音声開始位置までの間」及び「音声終了位置からデータ終了位置までの間」の無音区間の長さである。 The leading end silent length and the trailing end silent length are the lengths of the silent sections “between the data start position and the voice start position” and “between the voice end position and the data end position” of the voice waveform data, respectively.

先端非有声音長及び後端非有声音長は、それぞれ、音声波形データの「データ開始位置から有声音開始位置までの間」及び「有声音終了位置からデータ終了位置までの間」の有声音を含まない非有声音区間の長さである。非有声音区間には、無音区間と無声音区間とが含まれる。 The leading non-voiced sound length and the trailing non-voiced sound length are voiced sounds of “between the data start position and the voiced sound start position” and “between the voiced sound end position and the data end position” of the voice waveform data, respectively. It is the length of the non-voiced sound section that does not include. The unvoiced sound section includes a silent section and an unvoiced sound section.

また、音声区間長は、データ長から先端無音長と後端無音長とを減じることにより算出することができる。 The voice section length can be calculated by subtracting the leading end silent length and the trailing end silent length from the data length.

なお、音声開始位置、有声音開始位置、有声音終了位置及び音声終了位置は、休止時間長算出装置４０の音響的特徴量検出部４１０（図１参照）によって、音声波形データを音響分析することによって検出され、検出された音声開始位置、有声音開始位置、有声音終了位置及び音声終了位置、並びにデータ開始位置及びデータ終了位置に基づいて各区間の長さを算出することができる。 The voice start position, the voiced sound start position, the voiced sound end position, and the voice end position are obtained by acoustic analysis of the voice waveform data by the acoustic feature quantity detection unit 410 (see FIG. 1) of the pause time length calculation device 40. And the length of each section can be calculated based on the detected voice start position, voiced sound start position, voiced sound end position and voice end position, and data start position and data end position.

休止時間長は、音声再生時に先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとの接続部に挿入されるポーズ（無音状態）の時間長であり、休止時間長算出装置４０の休止時間長算出部４４０（図１参照）によって算出され、休止時間長設定部４５０（図１参照）によって先行音声部品データＰ_Ａに設定される。 Quiescence period is the time length of the pause (silence) to be inserted into the connecting portion between the preceding sound component data P _A during voice reproduction and subsequent audio component data P _B, quiescence period of quiescence period calculating unit 40 calculated by the calculating unit 440 (see FIG. 1), it is set to the preceding audio component data P _a by rest time length setting unit 450 (see FIG. 1).

次に、図５を参照（適宜図３参照）して、本発明による休止時間長を設定する原理について説明する。ここで、図５は、本発明による休止時間長を設定する原理を説明するための説明図である。 Next, with reference to FIG. 5 (refer to FIG. 3 as appropriate), the principle of setting the pause time length according to the present invention will be described. Here, FIG. 5 is an explanatory diagram for explaining the principle of setting the pause time length according to the present invention.

本発明は、休止時間長算出装置４０（図１参照）によって、先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとの接続部に、先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとの音響的特徴量の差異に基づいて、休止時間（無音状態）を設定するものである。これによって、音声部品データＰ_Ａ，Ｐ_Ｂに含まれる音声波形を自然な聴感となるように接続して再生することが可能となる。 The present invention, by quiescence period calculation unit 40 (see FIG. 1), the connecting portion between the preceding sound component data P _A and the subsequent audio component data P _B, and the preceding audio component data P _A and the subsequent audio component data P _B The pause time (silent state) is set based on the difference between the acoustic feature quantities. As a result, the audio waveforms included in the audio component data P _A and P _B can be connected and reproduced so as to have a natural audibility.

図５は、音響的特徴量の例としてピッチ周波数Ｆ０を用いた場合について示している。
ピッチ周波数Ｆ０を用いて、先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとを接続する場合は、先行音声部品データＰ_Ａの有声音終了位置における後端ピッチ周波数ｅｄ．Ｆ０_Ａと、後続音声部品データＰ_Ｂの有声音開始位置における先端ピッチ周波数ｓｔ．Ｆ０_Ｂとの差異、すなわち音響的な“距離”（以降、適宜“音響的距離”と呼ぶ）に基づいて、先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとの間の接続部に挿入する休止時間長を設定する。 FIG. 5 shows a case where the pitch frequency F0 is used as an example of the acoustic feature quantity.
Using the pitch frequency F0, preceding audio component data P _A and the succeeding speech components when connecting the data P _B is preceding speech component data P _A rear pitch frequency ed at voiced end of. F0 _A and the tip pitch frequency st. At the voiced sound start position of the subsequent audio component data P _B. The difference between F0 _B, i.e. acoustic "distance" (hereinafter, appropriately referred to as "acoustic distance") based on, inserted into the connection portion between the preceding sound component data P _A and the subsequent audio component data P _B Set the pause time length to be used.

互いに接続される音声部品データＰ_Ａ及びＰ_Ｂにおいて、音響的距離が大きい場合には、そのまま音声波形を連続して再生すると、不連続で不自然な聴感となる。また、音響的距離にかかわらず、長めの休止時間を挿入した場合は、間延びした感じになる場合が生じると共に、機械的な読み上げの印象を与えることになる。 In the audio component data P _A and P _B connected to each other, when the acoustic distance is large, when the audio waveform is continuously reproduced as it is, the sound becomes discontinuous and unnatural. In addition, regardless of the acoustic distance, when a longer pause time is inserted, a feeling of being extended may occur and an impression of mechanical reading is given.

そこで、本発明では、先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとの音響的距離が大きいほど接続部に挿入する休止時間を長く設定し、音響的距離が小さいほど接続部に挿入する休止時間を短く設定することにより、自然な聴感が得られる音声再生を可能とするものである。 Therefore, in the present invention, prior audio component data P _A and the subsequent audio component data P _B and acoustic distance as set longer downtime to be inserted into the connecting portion larger, inserting the connecting portion as acoustic distance is smaller By setting the pause time short, it is possible to reproduce sound with a natural audibility.

なお、用いる音響的特徴量によって、音声波形における音響的特徴量を参照する位置に違いがあるが、詳細については後記する。 In addition, although the position which refers to the acoustic feature amount in the speech waveform varies depending on the acoustic feature amount to be used, details will be described later.

図１に戻って、音声合成装置１００の各部の構成について説明する。
読み上げ情報入力部（読み上げ情報取得手段）１０は、音声合成を行う対象となる読み上げ情報を入力するための入力部であり、例えば、読み上げ情報が記憶されている磁気ディスク装置、光ディスク装置、フラッシュメモリ等の記憶装置から、音声合成の対象とする読み上げ情報を読み出して取得するものである。また、ネットワークや電話回線等の通信回線を介して読み上げ情報を入力するようにしてもよいし、キーボード等の入力デバイスを介して入力するようにしてもよく、入力手段については特に限定されない。
読み上げ情報入力部１０は、入力した読み上げ情報を音声部品データ取得部２０に出力する。 Returning to FIG. 1, the configuration of each unit of the speech synthesizer 100 will be described.
A read-out information input unit (read-out information acquisition means) 10 is an input unit for inputting read-out information to be subjected to speech synthesis. For example, a magnetic disk device, an optical disc device, or a flash memory in which read-out information is stored. Read-out information to be subjected to speech synthesis is read out and acquired from a storage device such as the above. Further, the reading information may be input via a communication line such as a network or a telephone line, or may be input via an input device such as a keyboard, and the input means is not particularly limited.
The reading information input unit 10 outputs the input reading information to the voice component data acquisition unit 20.

なお、本実施形態では、読み上げ情報入力部は、文節に対応した音声部品データを指定する情報として入力するようにしたが、通常のテキストデータを読み上げ情報として入力し、適宜な手法を用いて文節等の単位に分解し、予め準備された音声部品データに対応付けるようにしてもよい。 In the present embodiment, the reading information input unit inputs voice component data corresponding to a phrase as information for specifying the text. However, normal text data is input as reading information, and the phrase is read using an appropriate method. Etc., and may be associated with voice component data prepared in advance.

音声部品データ取得部（音声部品データ取得手段）２０は、読み上げ情報入力部１０から出力された読み上げ情報を入力し、入力した読み上げ情報を構成する“文節”に対応する音声部品データを順次に音声部品データ記憶部３０から読み出して取得し、休止時間長算出装置４０の音響的特徴量検出部４１０に出力する。 The speech component data acquisition unit (speech component data acquisition means) 20 receives the speech information output from the speech information input unit 10 and sequentially speeches the speech component data corresponding to “sentences” constituting the input speech information. The information is read out from the component data storage unit 30 and is obtained and output to the acoustic feature amount detection unit 410 of the pause time length calculation device 40.

音声部品データ記憶部（音声部品データ記憶手段）３０は、音声部品データを記憶する、例えば、磁気ディスク装置、光ディスク装置、半導体メモリ等の記憶装置であり、発話した文節を録音した音声波形データを設定された音声部品データが、予め記憶されている。
音声部品データ記憶部３０に記憶されている音声部品データは、音声部品データ取得部２０によって適宜読み出される。 The voice component data storage unit (voice component data storage means) 30 is a storage device such as a magnetic disk device, an optical disk device, or a semiconductor memory that stores voice component data. The set audio component data is stored in advance.
The audio component data stored in the audio component data storage unit 30 is appropriately read out by the audio component data acquisition unit 20.

休止時間長算出装置４０は、音響的特徴量検出部４１０、先行音声部品データ記憶部４２０、音響的距離算出部４３０、休止時間長算出部４４０及び休止時間長設定部４５０を備えて構成されている。
休止時間長算出装置４０は、音声部品データ取得部２０から出力された音声部品データを入力し、入力した音声部品データから音響的特徴量を検出し、検出した音響的特徴量を用いて、互いに接続される音声部品データ間の音響的距離を算出し、算出した音響的距離に基づいて、互いに接続される音声部品データ間の休止時間長を算出し、算出した休止時間長を音声部品データに付加（設定）して音声合成データ記憶部５０に記憶する。
休止時間長算出装置４０の各部の詳細な構成については後記する。 The pause time length calculation device 40 includes an acoustic feature quantity detection unit 410, a preceding voice component data storage unit 420, an acoustic distance calculation unit 430, a pause time length calculation unit 440, and a pause time length setting unit 450. Yes.
The pause time length calculation device 40 receives the audio component data output from the audio component data acquisition unit 20, detects an acoustic feature amount from the input audio component data, and uses the detected acoustic feature amount to mutually Calculate the acoustic distance between the connected audio component data, calculate the pause time length between the connected audio component data based on the calculated acoustic distance, and use the calculated pause time length as the voice component data It is added (set) and stored in the speech synthesis data storage unit 50.
The detailed configuration of each part of the downtime calculation device 40 will be described later.

音声合成データ記憶部５０は、休止時間長算出装置４０の休止時間長設定部４５０によって休止時間長が設定された音声部品データを記憶する、例えば、磁気ディスク装置、光ディスク装置、半導体メモリ等の記憶装置である。
音声合成データ記憶部５０は、休止時間長が設定された音声部品データを、読み上げ情報入力部１０で入力した読み上げ情報で指定された文番号及び文節番号に対応付けて順次記憶する。そして、読み上げ情報に含まれるすべての文節に対応して、休止時間長が設定された音声部品データを記憶することで、音声合成データ記憶部５０に、読み上げ情報に対応する音声合成データが形成される。
音声合成データ記憶部５０に形成された音声合成データは、音声再生部６０によって読み出される。 The voice synthesis data storage unit 50 stores the voice component data in which the pause time length is set by the pause time length setting unit 450 of the pause time length calculation device 40, for example, a storage such as a magnetic disk device, an optical disk device, and a semiconductor memory. Device.
The voice synthesis data storage unit 50 sequentially stores the voice component data in which the pause time length is set in association with the sentence number and the phrase number specified by the reading information input by the reading information input unit 10. Then, by storing the voice component data in which the pause time length is set corresponding to all the phrases included in the reading information, the voice synthesis data corresponding to the reading information is formed in the voice synthesis data storage unit 50. The
The voice synthesis data formed in the voice synthesis data storage unit 50 is read by the voice reproduction unit 60.

音声再生部６０は、音声合成データ記憶部５０に形成され読み上げ情報に対応する音声合成データを読み出し、文番号及び文節番号によって対応付けられた音声部品データに含まれる音声波形データを、順次アナログの音声波形信号に再生し、再生した音声波形信号をスピーカ７０に出力する。
音声再生部６０は、音声部品データに含まれる音声波形データを音声波形信号に再生すると共に、その音声部品データに設定された休止時間長だけ無音状態を挿入した後に、次の音声部品データの再生を行う。 The voice reproduction unit 60 reads out the voice synthesis data formed in the voice synthesis data storage unit 50 and corresponding to the read-out information, and sequentially converts the voice waveform data included in the voice component data associated with the sentence number and the phrase number into analog analog data. The audio waveform signal is reproduced and the reproduced audio waveform signal is output to the speaker 70.
The audio reproduction unit 60 reproduces the audio waveform data included in the audio component data into the audio waveform signal, and inserts a silent state for the pause time length set in the audio component data, and then reproduces the next audio component data. I do.

スピーカ７０は、音声再生部６０から出力された音声波形信号を入力し、入力した音声波形信号を音波に変換して、聴取可能に再生する。 The speaker 70 receives the audio waveform signal output from the audio reproduction unit 60, converts the input audio waveform signal into a sound wave, and reproduces it in an audible manner.

なお、本実施形態では、音声合成装置１００は、休止時間長を設定した音声部品データによって形成した音声合成データを、音声再生部によって音声波形信号に再生してスピーカに出力し、聴取可能に再生するように構成したが、例えば、音声合成データをネットワーク等の通信回線や放送波を介して送信するようにし、受信装置側で音声合成データを再生するようにしてもよい。 In the present embodiment, the speech synthesizer 100 reproduces the speech synthesis data formed by the speech component data in which the pause time length is set to the speech waveform signal by the speech reproduction unit, outputs it to the speaker, and is audibly reproduced. However, for example, the voice synthesis data may be transmitted via a communication line such as a network or a broadcast wave, and the voice synthesis data may be reproduced on the receiving device side.

＜休止時間長算出装置の構成＞
次に、図６を参照（適宜図１参照）して、本実施形態の休止時間長算出装置４０の構成について詳細に説明する。ここで、図６は、本実施形態の休止時間長算出装置の構成を示すブロック図である。 <Configuration of pause time length calculation device>
Next, with reference to FIG. 6 (refer to FIG. 1 as appropriate), the configuration of the pause time length calculation apparatus 40 of the present embodiment will be described in detail. Here, FIG. 6 is a block diagram showing a configuration of the pause time length calculation apparatus of the present embodiment.

図６に示した休止時間長算出装置４０は、音響的特徴量検出部４１０、先行音声部品データ記憶部４２０、音響的距離算出部４３０、休止時間長算出部４４０及び休止時間長設定部４５０を備えて構成されている。 The pause time length calculation device 40 shown in FIG. 6 includes an acoustic feature quantity detection unit 410, a preceding voice component data storage unit 420, an acoustic distance calculation unit 430, a pause time length calculation unit 440, and a pause time length setting unit 450. It is prepared for.

音響的特徴量検出部（音響的特徴量取得手段）４１０は、フレーム化処理部４１１、スペクトル分析部４１２、ピッチ周波数検出部４１３、話速検出部４１４、パワー検出部４１５及びスペクトル包絡検出部４１６を備えて構成されており、音声部品データ取得部２０から出力された音声部品データを入力し、入力した音声部品データに含まれる音声波形データを分析して音響的特徴量を検出し、検出した音響的特徴量に関するデータ（図３参照）を音声部品データに設定し、音響的特徴量に関するデータを設定した音声部品データを音響的距離算出部４３０に出力すると共に、先行音声部品データ記憶部４２０に記憶する。 The acoustic feature quantity detection unit (acoustic feature quantity acquisition unit) 410 includes a framing processing unit 411, a spectrum analysis unit 412, a pitch frequency detection unit 413, a speech speed detection unit 414, a power detection unit 415, and a spectrum envelope detection unit 416. The audio component data output from the audio component data acquisition unit 20 is input, the audio waveform data included in the input audio component data is analyzed to detect the acoustic feature amount, and Data relating to the acoustic feature quantity (see FIG. 3) is set in the voice part data, and the voice part data in which the data relating to the acoustic feature quantity is set is output to the acoustic distance calculation unit 430 and the preceding voice part data storage unit 420 To remember.

本実施形態の音響的特徴量検出部４１０は、ピッチ周波数検出部４１３、話速検出部４１４、パワー検出部４１５及びスペクトル包絡検出部４１６によって、それぞれ、音声の高さを表わす“ピッチ周波数”、話すスピードを表す“話速”、音声の大きさを表す“パワー”及び音声の響きを表す“スペクトル包絡”の４つの音響的特徴量を検出する。 The acoustic feature amount detection unit 410 according to the present embodiment includes a pitch frequency detection unit 413, a speech speed detection unit 414, a power detection unit 415, and a spectrum envelope detection unit 416, respectively. Four acoustic features are detected: “speech speed” representing the speaking speed, “power” representing the loudness of the speech, and “spectrum envelope” representing the sound of the speech.

なお、本実施形態では前記した４つの音響的特徴量を検出するが、このうちの１つ又は複数の音響的特徴量を検出するようにしてもよいし、例えば、端部の音素の継続時間長等の他の音響的特徴量を検出するようにしてもよい。 In the present embodiment, the above-described four acoustic feature quantities are detected. However, one or a plurality of acoustic feature quantities may be detected, for example, the duration of the phoneme at the end. You may make it detect other acoustic feature-values, such as length.

次に、音響的特徴量検出部４１０の各部の詳細について説明する。
フレーム化処理部４１１は、入力された音声部品データに含まれる音声波形データから所定の間隔で窓関数を用いて音声波形データを切り出すフレーム化処理を行う。
フレーム化処理を行う際には、例えば、フレーム長を２０〜４０ｍｓ程度、フレーム間隔を５〜２０ｍｓ程度とし、窓関数としてハミング窓、ハニング窓、三角窓等を用いることができる。
フレーム化処理された音声波形データは、スペクトル分析部４１２に出力される。 Next, details of each part of the acoustic feature quantity detection unit 410 will be described.
The framing processing unit 411 performs a framing process for cutting out the audio waveform data from the audio waveform data included in the input audio component data using a window function at a predetermined interval.
When performing the framing process, for example, the frame length is about 20 to 40 ms, the frame interval is about 5 to 20 ms, and a Hamming window, Hanning window, triangular window, or the like can be used as a window function.
The voice waveform data subjected to framing processing is output to the spectrum analysis unit 412.

スペクトル分析部４１２は、フレーム化処理部４１１から出力されたフレーム化処理された音声波形データをスペクトル分析する。
スペクトル分析の手法としては、例えば、フーリエスペクトル分析、ＬＰＣ分析（線型予測分析）、ケプストラム分析等を用いることができ、パワースペクトル、予測係数、ケプストラム等をスペクトルデータとして算出する。
算出したスペクトルデータは、前記した音響的特徴量を検出するためのピッチ周波数検出部４１３、話速検出部４１４、パワー検出部４１５及びスペクトル包絡検出部４１６に出力される。 The spectrum analysis unit 412 performs spectrum analysis on the speech waveform data subjected to framing processing output from the framing processing unit 411.
As a spectrum analysis method, for example, Fourier spectrum analysis, LPC analysis (linear prediction analysis), cepstrum analysis, or the like can be used, and a power spectrum, a prediction coefficient, a cepstrum, and the like are calculated as spectrum data.
The calculated spectrum data is output to the pitch frequency detection unit 413, the speech speed detection unit 414, the power detection unit 415, and the spectrum envelope detection unit 416 for detecting the acoustic feature amount.

次に、図６及び図７を参照して、ピッチ周波数検出部４１３の構成について説明する。ここで、図７は、ピッチ周波数に基づく休止時間長の設定の様子を説明するための説明図である。
図６に示したように、ピッチ周波数検出部４１３は、端部非有声音長検出部４１３ａ及び端部ピッチ周波数検出部４１３ｂを備えて構成されている。 Next, the configuration of the pitch frequency detector 413 will be described with reference to FIGS. Here, FIG. 7 is an explanatory diagram for explaining how the pause time length is set based on the pitch frequency.
As shown in FIG. 6, the pitch frequency detection unit 413 includes an end non-voiced sound length detection unit 413a and an end pitch frequency detection unit 413b.

本実施形態では、図７に示したように、音響的特徴量としてピッチ周波数を用いる場合は、先行音声部品データＰ_Ａの音声波形の後端における後端ピッチ周波数（phr[i][j].ed.F0）と、後続音声部品データＰ_Ｂの音声波形の先端における先端ピッチ周波数（phr[i][j+1].st.F0）とに基づいて、休止時間長（phr[i][j].pau）を算出する。
なお、ピッチ周波数は、無声音からは抽出できないため、音声区間において有声音を含む最初のフレームから検出されるピッチ周波数を先端ピッチ周波数として検出し、音声区間において有声音を含む最後のフレームから検出されるピッチ周波数を後端ピッチ周波数として検出する。 In the present embodiment, as shown in FIG. 7, in the case of using the pitch frequency as the acoustic feature quantity, rear pitch frequency at the rear end of the speech waveform of the preceding audio component data _{P A (phr [i] [} j] and .Ed.F0), subsequent audio component data P tip pitch frequency at the tip of the voice waveform _B (phr [i] based on [j + 1] .st.F0) and, pause time length (phr [i] [j] .pau).
Since the pitch frequency cannot be extracted from unvoiced sound, the pitch frequency detected from the first frame containing voiced sound in the voice section is detected as the leading pitch frequency and detected from the last frame containing voiced sound in the voice section. Is detected as the rear end pitch frequency.

また、ピッチ周波数は、例えば、パワースペクトルの自己相関関数を求め、その自己相関関数の第１ピークを抽出し、抽出した第１ピークの周波数として求めることができるし、ケプストラム分析を行い、その高ケフレンシ部分のピークを抽出し、抽出したケフレンシの逆数を算出することにより求めることもできる。また、他の手法によってピッチ周波数を検出するようにしてもよい。 The pitch frequency can be obtained, for example, by obtaining an autocorrelation function of the power spectrum, extracting the first peak of the autocorrelation function, and obtaining the frequency of the extracted first peak, performing cepstrum analysis, It can also be obtained by extracting the peak of the kerfrenchy part and calculating the reciprocal of the extracted kerfrenci. Further, the pitch frequency may be detected by other methods.

端部非有声音長検出部４１３ａは、フレーム化された音声波形のスペクトルデータを解析することにより、フレーム毎に有声音が含まれるかどうかを検出する。そして、最初に出現した有声音を含むフレームの位置を有声音開始位置として検出する。また、有声音を含む最後のフレームの位置を有声音終了位置として検出する。 The end non-voiced sound length detection unit 413a detects whether or not a voiced sound is included in each frame by analyzing the spectrum data of the framed speech waveform. Then, the position of the frame including the voiced sound that appears first is detected as the voiced sound start position. Further, the position of the last frame including the voiced sound is detected as the voiced sound end position.

検出した有声音開始位置とデータ開始位置とにより、先端非有声音長を算出することができる。簡単にはデータ開始位置を“０（ｍｓ）”と定義すると、有声音開始位置が先端非有声音長に一致する。また、検出した有声音終了位置とデータ終了位置とにより、後端非有声音長を算出することができる。データ開始位置を“０”とすると、データ終了位置はデータ長に一致するから、データ長から有声音終了位置を減じることにより後端非有声音長を算出することができる。 The tip non-voiced sound length can be calculated from the detected voiced sound start position and data start position. If the data start position is simply defined as “0 (ms)”, the voiced sound start position coincides with the tip non-voiced sound length. Further, the rear end non-voiced sound length can be calculated from the detected voiced sound end position and data end position. If the data start position is “0”, the data end position coincides with the data length. Therefore, the rear end non-voiced sound length can be calculated by subtracting the voiced sound end position from the data length.

端部ピッチ周波数検出部４１３ｂは、端部非有声音長検出部４１３ａで検出された有声音開始位置のフレームに対応するスペクトルデータからピッチ周波数を検出して先端ピッチ周波数とし、有声音終了位置のフレームのスペクトルデータからピッチ周波数を検出して後端ピッチ周波数とする。 The end pitch frequency detection unit 413b detects the pitch frequency from the spectrum data corresponding to the frame of the voiced sound start position detected by the end non-voiced sound length detection unit 413a and sets it as the tip pitch frequency. The pitch frequency is detected from the spectrum data of the frame and set as the rear end pitch frequency.

ピッチ周波数検出部４１３は、端部非有声音長検出部４１３ａで検出した先端非有声音長及び後端非有声音長を、それぞれ、音声部品データの phr[i][j].st.pos2 及び phr[i][j].ed.pos2 に設定し、端部ピッチ周波数検出部４１３ｂで検出した先端ピッチ周波数及び後端ピッチ周波数を、それぞれ、音声部品データの phr[i][j].st.F0 及び phr[i][j].ed.F0 に設定する。 The pitch frequency detection unit 413 converts the leading non-voiced sound length and the trailing non-voiced sound length detected by the end non-voiced sound length detection unit 413a into phr [i] [j] .st.pos2 of the audio component data, respectively. And phr [i] [j] .ed.pos2, and the leading edge pitch frequency and the trailing edge pitch frequency detected by the edge pitch frequency detection unit 413b are set to phr [i] [j]. Set to st.F0 and phr [i] [j] .ed.F0.

本実施形態では、端部のピッチ周波数を音響的特徴量として用いたが、各音声部品の音声波形において、ピッチ周波数を検出することができた全フレーム（すなわち有声音を含むフレーム）の平均ピッチ周波数を算出して音響的特徴量として用いるようにしてもよい。特に、データ長が短い音声部品の場合には、平均ピッチ周波数を用いても良く、データ長が長い場合には、端部ピッチ周波数を用いることが好ましい。これによって、適切に音声部品データの接続部に休止時間を設定することができる。 In the present embodiment, the pitch frequency at the end is used as the acoustic feature quantity. However, the average pitch of all frames (that is, frames including voiced sound) in which the pitch frequency can be detected in the speech waveform of each speech component. The frequency may be calculated and used as an acoustic feature amount. In particular, an average pitch frequency may be used in the case of an audio component with a short data length, and an end pitch frequency is preferably used when the data length is long. Thereby, it is possible to appropriately set the pause time in the connection part of the audio component data.

次に、図６及び図８を参照して、話速検出部４１４の構成について説明する。ここで、図８は、話速に基づく休止時間長の設定の様子を説明するための説明図である。
図６に示したように、話速検出部４１４は、端部無音長検出部４１４ａ及び平均話速検出部４１４ｂを備えて構成されている。 Next, the configuration of the speech speed detection unit 414 will be described with reference to FIGS. Here, FIG. 8 is an explanatory diagram for explaining how the pause time length is set based on the speech speed.
As shown in FIG. 6, the speech speed detection unit 414 includes an end silent length detection unit 414a and an average speech speed detection unit 414b.

本実施形態では、音響的特徴量として話速を用いる場合は、先行音声部品データＰ_Ａの音声波形の音声区間における平均話速と、後続音声部品データＰ_Ｂの音声波形の音声区間における平均話速とに基づいて、休止時間長（phr[i][j].pau）を算出する。
また、本実施形態では、図８に示したように、話速として、音声区間に出現する拍数（phr[i][j].mora）と音声区間長に基づいて算出される平均話速を用いるようにしたが、単位時間当たりの音素数等の他の定義による話速を用いるようにしてもよい。 In the present embodiment, the case of using the speech speed as the acoustic feature quantity, the average speech speed in the speech segment speech waveform of the preceding audio component data P _A, the average talk in the speech section of the subsequent audio component data P _B of the speech waveform Based on the speed, the pause time length (phr [i] [j] .pau) is calculated.
In the present embodiment, as shown in FIG. 8, the average speech speed calculated based on the number of beats (phr [i] [j] .mora) appearing in the speech segment and the speech segment length as the speech rate. However, the speech speed based on other definitions such as the number of phonemes per unit time may be used.

端部無音長検出部４１４ａは、フレーム化された音声波形のスペクトルデータを解析することにより、フレーム毎に音声波形信号が所定値以上のパワーを有するかどうかを検出する。そして、最初に所定値以上のパワーを有するフレームの位置を音声開始位置として検出する。また、所定のパワーを有する最後のフレームの位置を音声終了位置として検出する。 The end silent length detection unit 414a detects whether or not the speech waveform signal has a power greater than or equal to a predetermined value for each frame by analyzing the spectrum data of the speech waveform that has been framed. First, the position of a frame having a power equal to or greater than a predetermined value is detected as the voice start position. Further, the position of the last frame having a predetermined power is detected as the voice end position.

そして、検出した音声開始位置とデータ開始位置とに基づいて、先端無音長を算出することができる。また、検出した音声終了位置とデータ終了位置とに基づいて後端無音長を算出することができる。
また、音声区間長は、データ長（phr[i][j].time）から先端無音長（phr[i][j].st.pos1）及び後端無音長（phr[i][j].ed,pos1）を減じることにより算出することができる。 Then, based on the detected voice start position and data start position, the leading end silent length can be calculated. Further, the trailing end silent length can be calculated based on the detected voice end position and data end position.
In addition, the voice interval length is determined from the data length (phr [i] [j] .time) to the leading silence length (phr [i] [j] .st.pos1) and trailing silence length (phr [i] [j] .ed, pos1) can be calculated by subtracting.

なお、音声開始位置及び音声終了位置の検出は、パワーの代わりに、例えば、フレーム毎に音素を有するかどうかを検出することで行うようにしてもよいし、スペクトルデータの代わりに、音声波形データの信号レベルに基づいて検出するようにしてもよい。 Note that the voice start position and voice end position may be detected by detecting whether or not each frame has a phoneme instead of power, for example, or voice waveform data instead of spectrum data. Detection may be performed based on the signal level.

平均話速検出部４１４ｂは、式（１）に示したように、音声部品データに予め設定されている拍数（phr[i][j].mora）を、前記した手順で算出される音声区間長で除することにより平均話速 phr[i][j].SR を算出する。
phr[i][j].SR =
phr[i][j].mora / (phr[i][j].time -phr[i][j].st.pos1 -phr[i][j].ed.pos1)
・・・（１） The average speech speed detection unit 414b calculates the number of beats (phr [i] [j] .mora) preset in the voice component data by the above procedure as shown in the equation (1). The average speech speed phr [i] [j] .SR is calculated by dividing by the section length.
phr [i] [j] .SR =
phr [i] [j] .mora / (phr [i] [j] .time -phr [i] [j] .st.pos1 -phr [i] [j] .ed.pos1)
... (1)

話速検出部４１４は、端部無音長検出部４１４ａで検出した先端無音長及び後端無音長を、それぞれ、音声部品データの phr[i][j].st.pos1 及び phr[i][j].ed.pos1 に設定し、平均話速検出部４１４ｂで検出した平均話速を、音声部品データの phr[i][j].SR に設定する。 The speech speed detection unit 414 converts the leading end silent length and trailing end silent length detected by the end silent length detection unit 414a into phr [i] [j] .st.pos1 and phr [i] [ j] .ed.pos1 and the average speech speed detected by the average speech speed detection unit 414b is set in phr [i] [j] .SR of the voice component data.

本実施形態では、音声区間の平均話速を音響的特徴量として用いるようにしたが、データ長が短い場合には、先端無音長及び後端無音長を無視して、拍数をデータ長で除することにより算出される平均話速を用いるようにしてもよい。この場合は、無音長の検出が不要である。 In the present embodiment, the average speech speed of the voice section is used as the acoustic feature amount. However, when the data length is short, the silent length at the front end and the silent length at the rear end are ignored, and the number of beats is set as the data length. The average speech speed calculated by dividing may be used. In this case, it is not necessary to detect the silence length.

また、データ長が長い場合には、音声波形の平均話速ではなく、端部の話速を検出して用いるようにしてもよい。端部の話速を用いる場合は、例えば、フレーム化された音声波形を解析することにより、最初及び最後に出現するモーラの継続時間長を検出し、継続時間長の逆数をそれぞれ、先端話速及び後端話速として用いることができる。あるいは、先端及び後端から所定時間内におけるモーラの出現数を検出するようにしてもよい。 In addition, when the data length is long, it may be detected and used instead of the average speech speed of the speech waveform. When using the speech rate at the end, for example, by analyzing the framed speech waveform, the duration of the first and last mora appearing is detected, and the reciprocal of the duration length is set to the tip speech rate, respectively. And it can be used as the rear-end speech speed. Or you may make it detect the appearance number of mora in the predetermined time from the front-end | tip and a rear end.

次に、図６及び図９を参照して、パワー検出部４１５の構成について説明する。ここで、図９は、パワーに基づく休止時間長の設定の様子を説明するための説明図である。
図６に示したように、パワー検出部４１５は、端部無音長検出部４１５ａ及び平均パワー検出部４１５ｂを備えて構成されている。 Next, the configuration of the power detection unit 415 will be described with reference to FIGS. 6 and 9. Here, FIG. 9 is an explanatory diagram for explaining the setting of the pause time length based on the power.
As shown in FIG. 6, the power detection unit 415 includes an end silent length detection unit 415a and an average power detection unit 415b.

本実施形態では、図９に示したように、音響的特徴量としてパワーを用いる場合は、先行音声部品データＰ_Ａの音声波形の音声区間における平均パワーと、後続音声部品データＰ_Ｂの音声波形の音声区間における平均パワーとに基づいて、休止時間長（phr[i][j].pau）を算出する。 In the present embodiment, as shown in FIG. 9, the case of using power as the acoustic feature quantity, and the average power in the speech section of the speech waveform of the preceding audio component data P _A, subsequent audio component data P _B of the speech waveform The pause time length (phr [i] [j] .pau) is calculated based on the average power in the voice section.

端部無音長検出部４１５ａは、話速検出部４１４の端部無音長検出部４１４ａと同様に、音声開始位置及び音声終了位置を検出して、先端無音長及び後端無音長を算出するものであるから、詳細な説明は省略する。なお、パワー検出部４１５と話速検出部４１４と、後記するスペクトル包絡検出部４１６とで、例えば、端部無音長検出部４１４ａを共用するようにしてもよい。
また、音声区間長は、データ長から先端無音長及び後端無音長を減じることにより算出することができる。 The end silent length detection unit 415a detects the voice start position and the voice end position and calculates the leading end silent length and the trailing end silent length in the same manner as the end silent length detection unit 414a of the speech speed detection unit 414. Therefore, detailed description is omitted. For example, the end silent length detection unit 414a may be shared by the power detection unit 415, the speech speed detection unit 414, and the spectrum envelope detection unit 416 described later.
Further, the voice section length can be calculated by subtracting the leading end silent length and the trailing end silent length from the data length.

平均パワー検出部４１５ｂは、スペクトルデータを用いてフレーム毎にパワー（phr[i][j].pwr[k]：ｋはフレーム番号を示す）を検出し、式（２）に示したように、音声区間内の全フレームのパワーを平均することにより平均パワー（phr[i][j].PW）を算出する。
phr[i][j].PW = sum( phr[i][j].pwk[k] )/ 音声区間のフレーム数・・・（２）
但し、右辺の分母の sum( ) は、音声区間内のフレームのパワーの総和を示す。 The average power detection unit 415b detects the power (phr [i] [j] .pwr [k]: k indicates the frame number) for each frame using the spectrum data, as shown in the equation (2). The average power (phr [i] [j] .PW) is calculated by averaging the powers of all frames in the speech section.
phr [i] [j] .PW = sum (phr [i] [j] .pwk [k]) / number of frames in speech interval (2)
However, the sum () of the denominator on the right side indicates the total power of the frames in the speech section.

パワー検出部４１５は、端部無音長検出部４１５ａで検出した先端無音長及び後端無音長を、それぞれ、音声部品データの phr[i][j].st.pos1 及び phr[i][j].ed.pos1 に設定し、平均パワー検出部４１５ｂで検出した平均パワーを、音声部品データの phr[i][j].PW に設定する。 The power detection unit 415 detects the leading end silent length and the trailing end silent length detected by the end silent length detection unit 415a, respectively, as phr [i] [j] .st.pos1 and phr [i] [j ] .ed.pos1 and the average power detected by the average power detector 415b is set in phr [i] [j] .PW of the audio component data.

本実施形態では、音声区間の平均パワーを音響的特徴量として用いるようにしたが、音声区間の平均パワーではなく、音声区間の先端及び後端フレームのパワーを検出して用いるようにしてもよい。また、本実施形態では、音声の大きさを表す音響的特徴量としてパワーを用いたが、パワーの代わりに、例えば、聴覚的な音量を表す感覚量であるラウドネスレベルを用いるようにしてもよい。 In the present embodiment, the average power of the voice section is used as the acoustic feature amount. However, instead of the average power of the voice section, the power of the front and rear end frames of the voice section may be detected and used. . Further, in this embodiment, power is used as an acoustic feature amount representing the loudness of the sound. However, instead of power, for example, a loudness level that is a sense amount representing an auditory volume may be used. .

次に、図６及び図１０を参照して、スペクトル包絡検出部４１６の構成について説明する。ここで、図１０は、スペクトル包絡に基づく休止時間長の設定の様子を説明するための説明図である。
図６に示したように、スペクトル包絡検出部４１６は、端部無音長検出部４１６ａ及び端部スペクトル包絡検出部４１６ｂを備えて構成されている。 Next, the configuration of the spectrum envelope detection unit 416 will be described with reference to FIGS. 6 and 10. Here, FIG. 10 is an explanatory diagram for explaining the setting of the pause time length based on the spectrum envelope.
As shown in FIG. 6, the spectrum envelope detection unit 416 includes an end silence length detection unit 416a and an end spectrum envelope detection unit 416b.

本実施形態では、図１０に示したように、音響的特徴量としてスペクトル包絡を用いる場合は、先行音声部品データＰ_Ａの音声波形の後端における後端スペクトル包絡（phr[i][j].ed.SE）と、後続音声部品データＰ_Ｂの音声波形の先端における先端スペクトル包絡（phr[i][j+1].st.SE）とに基づいて、休止時間長（phr[i][j].pau）を算出する。 In the present embodiment, as shown in FIG. 10, in the case of using the spectral envelope as the acoustic feature quantity, rear spectral envelope at the rear end of the speech waveform of the preceding audio component data _{P A (phr [i] [} j] ed.SE) and the leading edge spectral envelope (phr [i] [j + 1] .st.SE) at the front end of the speech waveform of the subsequent speech component data P _B , the pause time length (phr [i] [j] .pau).

スペクトル包絡は、スペクトル分析部４１２で算出したスペクトルデータに基づいて求めることができる。例えば、スペクトル分析の手法としてフーリエ変換を用いた場合は、フーリエ変換係数を用いることができる。その他に、帯域フィルタ群、相関関数、ＬＰＣ分析の係数、ケプストラム、メルケプストラム等を用いることもできる。更に、これらの係数の１次微分や２次微分等の動的特徴量を加えるようにしてもよい。
なお、スペクトル包絡は、複数の係数によって構成されるベクトル量として表される。 The spectrum envelope can be obtained based on the spectrum data calculated by the spectrum analysis unit 412. For example, when Fourier transform is used as a spectrum analysis method, a Fourier transform coefficient can be used. In addition, a band filter group, a correlation function, a coefficient of LPC analysis, a cepstrum, a mel cepstrum, and the like can be used. Furthermore, you may make it add dynamic feature-values, such as the primary differentiation of these coefficients, and a secondary differentiation.
The spectrum envelope is expressed as a vector quantity composed of a plurality of coefficients.

端部無音長検出部４１６ａは、話速検出部４１４の端部無音長検出部４１４ａと同様に、音声開始位置及び音声終了位置を検出して、先端無音長及び後端無音長を算出するものであるから、詳細な説明は省略する。なお、話速検出部４１４とパワー検出部４１５とスペクトル包絡検出部４１６とで、例えば、端部無音長検出部４１４ａを共用するようにしてもよい。
また、音声区間長は、データ長から先端無音長及び後端無音長を減じることにより算出することができる。 The end silence length detection unit 416a calculates the front end silence length and the rear end silence length by detecting the voice start position and the voice end position, similarly to the end silence length detection unit 414a of the speech speed detection unit 414. Therefore, detailed description is omitted. Note that, for example, the end soundless length detection unit 414a may be shared by the speech speed detection unit 414, the power detection unit 415, and the spectrum envelope detection unit 416.
Further, the voice section length can be calculated by subtracting the leading end silent length and the trailing end silent length from the data length.

端部スペクトル包絡検出部４１６ｂは、端部無音長検出部４１６ａで検出された音声開始位置のフレームに対応するスペクトルデータからスペクトル包絡を検出して先端スペクトル包絡とし、音声終了位置のフレームのスペクトルデータからスペクトル包絡を検出して後端スペクトル包絡とする。 The end spectrum envelope detection unit 416b detects the spectrum envelope from the spectrum data corresponding to the frame at the voice start position detected by the end silence length detection unit 416a to obtain the tip spectrum envelope, and the spectrum data of the frame at the voice end position From this, a spectral envelope is detected and used as a rear end spectral envelope.

スペクトル包絡検出部４１６は、端部無音長検出部４１６ａで検出した先端無音長及び後端無音長を、それぞれ、音声部品データの phr[i][j].st.pos1 及び phr[i][j].ed.pos1 に設定し、端部スペクトル包絡検出部４１６ｂで検出した先端スペクトル包絡及び後端スペクトル包絡を、それぞれ、音声部品データの phr[i][j].st.SE 及び phr[i][j].ed.SE に設定する。 The spectrum envelope detection unit 416 converts the leading end silent length and trailing end silent length detected by the end silent length detection unit 416a into phr [i] [j] .st.pos1 and phr [i] [ j] .ed.pos1 and the front-end spectrum envelope and the rear-end spectrum envelope detected by the edge spectrum envelope detection unit 416b are respectively set to phr [i] [j] .st.SE and phr [ Set to i] [j] .ed.SE.

なお、本実施形態では、端部のスペクトル包絡を音響的特徴量として用いたが、音声部品のデータ長が短い場合には、音声区間における平均スペクトル包絡を用いるようにしてもよい。 In the present embodiment, the spectral envelope at the end is used as the acoustic feature quantity. However, when the data length of the voice component is short, the average spectral envelope in the voice section may be used.

図６に戻って、休止時間長算出装置４０の構成について説明を続ける。
先行音声部品データ記憶部４２０は、音響的特徴量検出部４１０によって音響的特徴量に関するデータを設定された音声部品データを一時的に記憶し、この音声部品データは、次回の休止時間長を算出する際の、先行音声部品データとして休止時間長算出部４４０によって読み出される。すなわち、先行音声部品データ記憶部４２０は、データ遅延手段として機能する。
先行音声部品データ記憶部４２０としては、例えば、半導体メモリを用いることができるが、磁気ディスク装置や光ディスク装置等の記憶装置を用いることもできる。 Returning to FIG. 6, the description of the configuration of the pause time length calculation device 40 will be continued.
The preceding audio component data storage unit 420 temporarily stores the audio component data in which the data related to the acoustic feature amount is set by the acoustic feature amount detection unit 410, and the audio component data calculates the next pause time length. In this case, the pause time length calculation unit 440 reads the previous voice component data. That is, the preceding audio component data storage unit 420 functions as a data delay unit.
As the preceding audio component data storage unit 420, for example, a semiconductor memory can be used, but a storage device such as a magnetic disk device or an optical disk device can also be used.

音響的距離算出部（音響的距離算出手段）４３０は、音響的特徴量検出部４１０によって音響的特徴量に関するデータを設定された音声部品データを、後続音声部品データとして入力すると共に、先行音声部品データ記憶部４２０に記憶された音声部品データを読み出し、先行音声部品データとして用いる。そして、先行音声部品データに設定された音響的特徴量に関するデータと、後続音声部品データに設定された音響的特徴量に関するデータとに基づいて音響的距離を算出し、休止時間長算出部４４０に出力する。 The acoustic distance calculation unit (acoustic distance calculation unit) 430 inputs the audio component data in which the data related to the acoustic feature amount is set by the acoustic feature amount detection unit 410 as the subsequent audio component data, and the preceding audio component. The audio component data stored in the data storage unit 420 is read out and used as preceding audio component data. Then, the acoustic distance is calculated based on the data related to the acoustic feature amount set in the preceding speech component data and the data related to the acoustic feature amount set in the subsequent speech component data, and the pause time length calculation unit 440 Output.

音響的距離は、用いる音響的特徴量に応じて、式（３）〜式（８）によって算出することができる。
まず、音響的特徴量としてピッチ周波数を用いる場合は、式（３）によって、先行音声部品データの後端ピッチ周波数（phr[i][j].ed.F0）と後続音声部品データの先端ピッチ周波数（phr[i][j+1].st.F0）とに基づいて音響的距離（ΔF0[i][j]）を算出すると共に、式（４）によって、先行音声部品データの後端非有声音長（phr[i][j].ed.pos2）と後続音声部品データの先端非有声音長（phr[i][j+1].st.pos2）とに基づいて、音響的距離の算出に用いるピッチ周波数を検出した端部間の時間的距離（ΔFp[i][j]）を算出する。 The acoustic distance can be calculated by Expressions (3) to (8) according to the acoustic feature amount to be used.
First, when the pitch frequency is used as the acoustic feature amount, the rear end pitch frequency (phr [i] [j] .ed.F0) and the front end pitch of the subsequent audio component data are obtained by Expression (3). The acoustic distance (ΔF0 [i] [j]) is calculated based on the frequency (phr [i] [j + 1] .st.F0), and the rear end of the preceding audio component data is calculated by Equation (4). Based on the unvoiced sound length (phr [i] [j] .ed.pos2) and the leading unvoiced sound length (phr [i] [j + 1] .st.pos2) The temporal distance (ΔFp [i] [j]) between the ends where the pitch frequency used for the distance calculation is detected is calculated.

ΔF0[i][j] = ｜log(phr[i][j].ed.F0) -log(phr[i][j+1].st.F0)｜・・・（３）
ΔFp[i][j] = phr[i][j].ed.pos2 + phr[i][j+1].st.pos2 ・・・（４）
但し、log( ) は、常用対数関数を示す。 ΔF0 [i] [j] = | log (phr [i] [j] .ed.F0) -log (phr [i] [j + 1] .st.F0) | (3)
ΔFp [i] [j] = phr [i] [j] .ed.pos2 + phr [i] [j + 1] .st.pos2 (4)
Where log () represents a common logarithmic function.

この端部間の時間的距離を考慮して休止時間長を検出することにより、考慮しないときよりも適切に休止時間長を算出することができる。 By detecting the pause time length in consideration of the temporal distance between the ends, the pause time length can be calculated more appropriately than when the pause time length is not taken into consideration.

次に、音響的特徴量として話速を用いる場合は、式（５）によって、先行音声部品データの平均話速（phr[i][j].SR）と後続音声部品データの平均話速（phr[i][j+1].SR）とに基づいて音響的距離（ΔR[i][j]）を算出する。 Next, when speech speed is used as the acoustic feature amount, the average speech speed (phr [i] [j] .SR) of the preceding speech component data and the average speech speed of the subsequent speech component data (equation (5)) phr [i] [j + 1] .SR) to calculate the acoustic distance (ΔR [i] [j]).

ΔR[i][j] = ｜phr[i][j].SR -phr[i][j+1].SR｜・・・（５） ΔR [i] [j] = | phr [i] [j] .SR -phr [i] [j + 1] .SR | (5)

次に、音響的特徴量としてパワーを用いる場合は、式（６）によって、先行音声部品データの音声区間の平均パワー（phr[i][j].PW）と後続音声部品データの音声区間の平均パワー（phr[i][j+1].PW）とに基づいて音響的距離（ΔP[i][j]）を算出する。 Next, when power is used as the acoustic feature quantity, the average power (phr [i] [j] .PW) of the speech section of the preceding speech component data and the speech section of the succeeding speech component data are expressed by Equation (6). The acoustic distance (ΔP [i] [j]) is calculated based on the average power (phr [i] [j + 1] .PW).

ΔP[i][j] = ｜phr[i][j].PW -phr[i][j+1].PW｜・・・（６） ΔP [i] [j] = | phr [i] [j] .PW -phr [i] [j + 1] .PW | (6)

次に、音響的特徴量としてスペクトル包絡を用いる場合は、式（７）によって、先行音声部品データの後端スペクトル包絡（phr[i][j].ed.SE）と後続音声部品データの先端スペクトル包絡（phr[i][j+1].st.SE）とに基づいて音響的距離（ΔE[i][j]）を算出すると共に、式（８）によって、先行音声部品データの後端無音長（phr[i][j].ed.pos1）と後続音声部品データの先端無音長（phr[i][j+1].st.pos1）とに基づいて、音響的距離の算出に用いるスペクトル包絡を検出した端部間の時間的距離（ΔEp[i][j]）を算出する。 Next, when a spectral envelope is used as the acoustic feature quantity, the trailing edge spectral envelope (phr [i] [j] .ed.SE) and the leading edge of the following voice component data are obtained by Expression (7). Calculate the acoustic distance (ΔE [i] [j]) based on the spectral envelope (phr [i] [j + 1] .st.SE) and Calculate the acoustic distance based on the end silence length (phr [i] [j] .ed.pos1) and the end silence length of the following audio component data (phr [i] [j + 1] .st.pos1) The time distance (ΔEp [i] [j]) between the end portions where the spectrum envelope used in the above is detected is calculated.

ΔE[i][j] = ｜phr[i][j].ed.SE -phr[i][j+1].st.SE｜・・・（７）
ΔEp[i][j] = phr[i][j].ed.pos1 + phr[i][j+1].st.pos1 ・・・（８）
但し、スペクトル包絡はベクトル量であるから、式（７）において、スペクトル包絡の距離としてベクトル量同士のユークリッド距離を算出する。 ΔE [i] [j] = | phr [i] [j] .ed.SE -phr [i] [j + 1] .st.SE | (7)
ΔEp [i] [j] = phr [i] [j] .ed.pos1 + phr [i] [j + 1] .st.pos1 (8)
However, since the spectral envelope is a vector quantity, the Euclidean distance between the vector quantities is calculated as the spectral envelope distance in Equation (7).

ピッチ周波数を用いるときと同様に、この端部間の時間的距離を考慮して休止時間長を検出することにより、考慮しないときよりも適切に休止時間長を算出することができる。 As in the case of using the pitch frequency, by detecting the pause time length in consideration of the temporal distance between the end portions, the pause time length can be calculated more appropriately than when the pitch frequency is not taken into consideration.

休止時間長算出部（休止時間長算出手段）４４０は、重回帰演算部４４１及び回帰係数記憶部４４２を備えて構成されており、音響的距離算出部４３０によって算出された音響的距離に基づいて、先行音声部品データと後続音声部品データとの間の接続部に挿入する休止時間長を算出して、休止時間長設定部４５０に出力する。 The pause time length calculation unit (pause time length calculation means) 440 includes a multiple regression calculation unit 441 and a regression coefficient storage unit 442, and is based on the acoustic distance calculated by the acoustic distance calculation unit 430. The pause time length to be inserted into the connection between the preceding voice component data and the subsequent voice component data is calculated and output to the pause time length setting unit 450.

重回帰演算部４４１は、音響的距離算出部４３０によって算出された音響的距離と、回帰係数記憶部４４２に予め記憶されている回帰式の係数とに基づいて、回帰演算を行うことによって休止時間長を算出し、休止時間長設定部４５０に出力する。
なお、図６に示した実施形態では、複数の音響的特徴量を説明変数とする重回帰式によって休止時間長を算出するようにしたが、１つの音響的特徴量を説明変数とする場合は、単回帰式によって休止時間長を算出する。特許請求の範囲における回帰式とは、説明変数が複数のときの重回帰式の場合と、説明変数が１つのときの単回帰式の場合とを含むものとする。 The multiple regression calculation unit 441 performs a resting time by performing a regression calculation based on the acoustic distance calculated by the acoustic distance calculation unit 430 and the regression equation coefficient stored in advance in the regression coefficient storage unit 442. The length is calculated and output to the pause time length setting unit 450.
In the embodiment shown in FIG. 6, the pause time length is calculated by a multiple regression equation using a plurality of acoustic feature quantities as explanatory variables. However, when one acoustic feature quantity is used as an explanatory variable, The pause time length is calculated by a single regression equation. The regression equation in the claims includes a multiple regression equation when there are a plurality of explanatory variables and a single regression equation when there is one explanatory variable.

ここで、休止時間長（phr[i][j].pau）は、用いる音響的特徴量に応じて、回帰係数a0〜a16等を用いて、式（９）〜式（１３）に示した重回帰式によって算出される。 Here, the pause time length (phr [i] [j] .pau) is shown in Expressions (9) to (13) using regression coefficients a0 to a16 and the like according to the acoustic feature amount used. Calculated by multiple regression equation.

まず、音響的特徴量としてピッチ周波数のみを用いる場合は、式（９）に示した重回帰式を用いる。
phr[i][j].pau = a0 + a1×ΔF0[i][j] + a2×ΔFp[i][j] ・・・（９） First, when only the pitch frequency is used as the acoustic feature amount, the multiple regression equation shown in Equation (9) is used.
phr [i] [j] .pau = a0 + a1 × ΔF0 [i] [j] + a2 × ΔFp [i] [j] (9)

次に、音響的特徴量として話速のみを用いる場合は、式（１０）に示した単回帰式を用いる。
phr[i][j].pau = a3 + a4×ΔR[i][j] ・・・（１０） Next, when only the speech speed is used as the acoustic feature amount, the single regression equation shown in Equation (10) is used.
phr [i] [j] .pau = a3 + a4 × ΔR [i] [j] (10)

次に、音響的特徴量としてパワーのみを用いる場合は、式（１１）に示した単回帰式を用いる。
phr[i][j].pau = a5 + a6×ΔP[i][j] ・・・（１１） Next, when only power is used as the acoustic feature amount, the single regression equation shown in Equation (11) is used.
phr [i] [j] .pau = a5 + a6 × ΔP [i] [j] (11)

次に、音響的特徴量としてスペクトル包絡のみを用いる場合は、式（１２）に示した重回帰式を用いる。
phr[i][j].pau = a7 + a8×ΔE[i][j] + a9×ΔEp[i][j] ・・・（１２） Next, when only the spectral envelope is used as the acoustic feature amount, the multiple regression equation shown in Equation (12) is used.
phr [i] [j] .pau = a7 + a8 × ΔE [i] [j] + a9 × ΔEp [i] [j] (12)

また、音響的特徴量としてピッチ周波数、話速、パワー及びスペクトル包絡の４つを用いる場合は、式（１３）に示した重回帰式を用いる。
phr[i][j].pau = a10 + a11×ΔF0[i][j] + a12×ΔFp[i][j] + a13×ΔR[i][j] + a14×ΔP[i][j] + a15×ΔE[i][j] + a16×ΔEp[i][j] ・・・（１３） In addition, when using four of the pitch frequency, speech speed, power, and spectrum envelope as acoustic feature quantities, the multiple regression equation shown in Equation (13) is used.
phr [i] [j] .pau = a10 + a11 × ΔF0 [i] [j] + a12 × ΔFp [i] [j] + a13 × ΔR [i] [j] + a14 × ΔP [i] [j ] + a15 × ΔE [i] [j] + a16 × ΔEp [i] [j] (13)

また、以上の回帰式に限定されることなく、用いる音響的特徴量を適宜組み合わせて、重回帰式を定めて休止時間長を算出するようにしてもよい。 Further, the present invention is not limited to the above regression formula, and the acoustic feature amount to be used may be appropriately combined to determine the multiple regression formula to calculate the pause time length.

ここで、図１１を参照して、回帰式の係数a0〜a16を求める方法について、音響的特徴量としてピッチ周波数を用いた場合を例にして説明する。ここで、図１１は、主観評価実験と重回帰分析の関係を説明するための説明図である。 Here, with reference to FIG. 11, a method for obtaining the coefficients a0 to a16 of the regression equation will be described by taking a case where a pitch frequency is used as an acoustic feature amount as an example. Here, FIG. 11 is an explanatory diagram for explaining the relationship between the subjective evaluation experiment and the multiple regression analysis.

重回帰式の係数を決定するために、まず、音声部品データを様々に組み合わせた場合の、それぞれの音声部品データの組み合わせにおける最適な休止時間長を主観評価実験によって求める。一方、前記した音響的距離算出部４３０と同様の手順で音響的距離及び時間的距離を算出する。そして、主観評価実験によって求めた最適な休止時間長と、算出して求めた音響的距離及び時間的距離との重回帰分析を行うことにより、重回帰式の係数を決定することができる。 In order to determine the coefficient of the multiple regression equation, first, an optimum pause time length for each combination of audio component data when the audio component data is combined in various ways is obtained by a subjective evaluation experiment. On the other hand, the acoustic distance and the temporal distance are calculated in the same procedure as the acoustic distance calculation unit 430 described above. The coefficient of the multiple regression equation can be determined by performing multiple regression analysis of the optimal pause time length obtained by the subjective evaluation experiment, the acoustic distance and the temporal distance obtained by calculation.

例えば、図１１に示したように、文節１ａに対応する先行音声部品データＰ_Ａと文節１ｂに対応する後続音声部品データＰ_Ｂとの間の最適な休止時間長Ｐａｕｓｅ１を主観評価実験によって求める。同様に、文節２ａに対応する先行音声部品データＰ_Ａと文節２ｂに対応する後続音声部品データＰ_Ｂとの間の最適な休止時間長Ｐａｕｓｅ２、文節Ｌａに対応する先行音声部品データＰ_Ａと文節Ｌｂに対応する後続音声部品データＰ_Ｂとの間の最適な休止時間長ＰａｕｓｅＬ等を主観評価実験によって求める。 For example, as shown in FIG. 11, the optimal quiescence period Pause1 between subsequent audio component data P _B corresponding to the preceding audio component data P _A and clause 1b corresponding to clause 1a determined by subjective evaluation experiment. Similarly, the subsequent speech component data P _B optimal dwell time length between Pause2, preceding audio component data P _A and clauses which correspond to the clauses La corresponding to the preceding audio component data P _A and clause 2b corresponding to clause 2a The optimum pause time length PauseL and the like with the subsequent audio component data P _B corresponding to Lb is obtained by a subjective evaluation experiment.

なお、主観評価実験は、例えば、相対法や極限法によって、最適な休止時間長を数値化することができる。また、他の手法による主観評価実験を用いて休止時間長を求めることもできる。 In the subjective evaluation experiment, the optimum pause time length can be quantified by, for example, the relative method or the limit method. The pause time length can also be obtained using a subjective evaluation experiment by another method.

また、それぞれの先行音声部品データＰ_Ａと後続音声部品データＰ_Ｂとの組み合わせにおける音響的距離（ΔＦ０_１，ΔＦ０_２，ΔＦ０_Ｌ等）及び時間的距離（ΔＦｐ_１，ΔＦｐ_２，ΔＦｐ_Ｌ等）を前記した手順によって算出する。 The acoustic distance in combination with the respective preceding voice component data _{P A} and the subsequent audio component data _{_{_{P B (ΔF0 1, ΔF0 2}}} , ΔF0 L , etc.) and the temporal distance _{_{(ΔFp 1, ΔFp 2, ΔFp}} L , etc.) Is calculated according to the procedure described above.

これらのデータを式（９）に適用すると、式（１４）のような関係式が得られる。
Ｐａｕｓｅ１＝ａ０＋ａ１×ΔＦ０_１＋ａ２×ΔＦｐ_１
Ｐａｕｓｅ１＝ａ０＋ａ１×ΔＦ０_２＋ａ２×ΔＦｐ_２
・
・
・
Ｐａｕｓｅ１＝ａ０＋ａ１×ΔＦ０_Ｌ＋ａ２×ΔＦｐ_Ｌ
・・・（１４） When these data are applied to equation (9), a relational expression such as equation (14) is obtained.
Pause1 = a0 + a1 × ΔF0 ₁ + a2 × ΔFp ₁
Pause1 = a0 + a1 × ΔF0 ₂ + a2 × ΔFp ₂
・
・
・
Pause1 = a0 + a1 × ΔF0 _L + a2 × ΔFp _L
(14)

式（１４）に示した関係式に対して、最小二乗法を適用することで、回帰式の係数ａ０，ａ１，ａ２を算出して定めることができる。
式（１０）〜式（１３）に示したような、他の回帰式を用いる場合も、同様の手順によって回帰式の係数を定めることができる。 By applying the least square method to the relational expression shown in Expression (14), the coefficients a0, a1, and a2 of the regression equation can be calculated and determined.
Even when other regression equations such as those shown in equations (10) to (13) are used, the coefficients of the regression equation can be determined by the same procedure.

このようにして予め定めた回帰式の係数を、回帰係数記憶部４４２（図６参照）に記憶しておき、この回帰式の係数を回帰係数記憶部４４２から読み出して用いることにより、休止時間長を算出することができる。 The regression equation coefficient determined in this way is stored in the regression coefficient storage unit 442 (see FIG. 6), and the regression equation coefficient is read from the regression coefficient storage unit 442 and used. Can be calculated.

回帰係数記憶部４４２は、前記した主観評価実験に基づいて予め定められた回帰式の係数を記憶するものであり、記憶した係数は重回帰演算部４４１によって適宜読み出される。
回帰係数記憶部４４２としては、例えば、磁気ディスク装置、光ディスク装置、半導体メモリ等の記憶装置を用いることができる。 The regression coefficient storage unit 442 stores a regression equation coefficient determined in advance based on the subjective evaluation experiment described above, and the stored coefficient is appropriately read by the multiple regression calculation unit 441.
As the regression coefficient storage unit 442, for example, a storage device such as a magnetic disk device, an optical disk device, or a semiconductor memory can be used.

休止時間長設定部４５０は、休止時間長算出部４４０によって算出された休止時間長を、この休止時間長を算出した際の先行音声部品データの休止時間長（phr[i][j].pau）に設定し、休止時間長を設定した音声部品データを音声合成データ記憶部５０（図１参照）に、文番号及び文節番号に対応付けて記憶する。 The pause time length setting unit 450 sets the pause time length calculated by the pause time length calculation unit 440 to the pause time length (phr [i] [j] .pau of the preceding voice component data when the pause time length is calculated. ) And the speech part data for which the pause time length is set are stored in the speech synthesis data storage unit 50 (see FIG. 1) in association with the sentence number and the phrase number.

以上説明した音声合成装置１００は、一部またはすべてを専用のハードウェアを作成して実施することができるが、一般的なコンピュータプログラムを実行させ、コンピュータ内の演算装置、記憶装置、入力装置、画像表示装置等を動作させることにより実現することもできる。このプログラム（休止時間長算出プログラム）は、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The speech synthesizer 100 described above can be implemented by creating a dedicated hardware part or all of the speech synthesizer 100. However, a general computer program is executed, and an arithmetic device, a storage device, an input device, It can also be realized by operating an image display device or the like. This program (pause time length calculation program) can be distributed via a communication line, or can be distributed on a recording medium such as a CD-ROM.

＜音声合成装置の動作＞
次に、図１２を参照（適宜図１及び図６参照）して、本実施形態の音声合成装置１００の動作について説明する。ここで、図１２は、本実施形態の音声合成装置の処理の流れを示すフローチャートである。 <Operation of speech synthesizer>
Next, the operation of the speech synthesizer 100 of this embodiment will be described with reference to FIG. 12 (refer to FIGS. 1 and 6 as appropriate). Here, FIG. 12 is a flowchart showing the flow of processing of the speech synthesizer of this embodiment.

まず、音声合成装置１００は、読み上げ情報入力部１０によって、音声合成の対象となる読み上げ情報を入力し、入力した読み上げ情報を音声部品データ取得部２０に出力する（ステップＳ１０）。 First, the speech synthesizer 100 inputs read-out information to be subjected to speech synthesis by the read-out information input unit 10 and outputs the input read-out information to the voice component data acquisition unit 20 (step S10).

音声合成装置１００は、音声部品データ取得部２０によって、ステップＳ１０で入力した読み上げ情報に指定された文節に対応する音声部品データを、順次に音声部品データ記憶部３０から取得し、休止時間長算出装置４０の音響的特徴量検出部４１０に出力する（ステップＳ１１）。 The speech synthesizer 100 uses the speech component data acquisition unit 20 to sequentially acquire speech component data corresponding to the phrase specified in the reading information input in step S10 from the speech component data storage unit 30, and calculate a pause time length. The result is output to the acoustic feature quantity detection unit 410 of the device 40 (step S11).

音声合成装置１００は、休止時間長算出装置４０の音響的特徴量検出部４１０によって、音響的特徴量を検出し、検出した音響的特徴量に関するデータを音声部品データに設定し、この音声部品データを、音響的距離算出部４３０に対して後続音声部品データとして出力すると共に、次の音声部品データの接続における先行音声部品データとして先行音声部品データ記憶部４２０に記憶する（ステップＳ１２）。 The speech synthesizer 100 detects an acoustic feature amount by the acoustic feature amount detection unit 410 of the pause time length calculation device 40, sets data relating to the detected acoustic feature amount as speech component data, and this speech component data Is output to the acoustic distance calculation unit 430 as subsequent audio component data, and is stored in the preceding audio component data storage unit 420 as preceding audio component data in the connection of the next audio component data (step S12).

ここで、音響的特徴量検出部４１０によって出力された音声部品データが、ステップＳ１０で入力した読み上げ情報を構成する最初の文節に対する音声部品データである場合は（ステップＳ１３でＹｅｓ）、この音声部品データに対応する先行音声部品データは無く、休止時間長を算出する必要がないため、ステップＳ１１に戻り、次の文節に対応する音声部品データを取得する。 Here, when the audio component data output by the acoustic feature quantity detection unit 410 is the audio component data for the first phrase constituting the reading information input in step S10 (Yes in step S13), this audio component Since there is no preceding audio component data corresponding to the data and there is no need to calculate the pause time length, the process returns to step S11 to acquire audio component data corresponding to the next phrase.

一方、音響的特徴量検出部４１０によって出力される音声部品データが読み上げ情報を構成する最初の文節に対する音声部品データではない場合には（ステップＳ１３でＮｏ）、音声合成装置１００は、音響的距離算出部４３０によって、先行音声部品データ記憶部４２０に記憶された先行音声部品データに設定されている音響的特徴量と、音響的特徴量検出部４１０によって出力された後続音声部品データに設定されている音響的特徴量との音響的距離を算出し、休止時間長算出部４４０に出力する（ステップＳ１４）。 On the other hand, when the audio component data output by the acoustic feature amount detection unit 410 is not the audio component data for the first phrase constituting the reading information (No in step S13), the speech synthesizer 100 determines the acoustic distance. The calculation unit 430 sets the acoustic feature amount set in the preceding audio component data stored in the preceding audio component data storage unit 420 and the subsequent audio component data output by the acoustic feature amount detection unit 410. The acoustic distance with the acoustic feature amount being calculated is calculated and output to the pause time length calculation unit 440 (step S14).

次に、休止時間長算出部４４０の重回帰演算部４４１によって、回帰係数記憶部４４２に予め記憶しておいた重回帰式の係数と、ステップＳ１４で算出された音声部品データ間の音響的距離とに基づいて、休止時間長を算出し、休止時間長設定部４５０に出力する（ステップＳ１５）。 Next, an acoustic distance between the multiple regression equation coefficient stored in advance in the regression coefficient storage unit 442 by the multiple regression calculation unit 441 of the pause time length calculation unit 440 and the audio component data calculated in step S14. Based on the above, the suspension time length is calculated and output to the suspension time length setting unit 450 (step S15).

そして、休止時間長設定部４５０によって、ステップＳ１５で算出した休止時間長を、先行音声部品データに設定し（ステップＳ１６）、読み上げ情報の文番号と文節番号とに対応付けて音声合成データ記憶部５０に記憶する（ステップＳ１７）。 Then, the pause time length setting unit 450 sets the pause time length calculated in step S15 in the preceding voice part data (step S16), and associates it with the sentence number and phrase number of the reading information, and the voice synthesis data storage unit 50 (step S17).

ステップＳ１７で休止時間長を設定した音声部品データを記憶すると、読み上げ情報に次の文節が残っているかどうかを確認し（ステップＳ１８）、文節が残っている場合は（ステップＳ１８でＹｅｓ）、ステップＳ１１に戻り、次の文節に対応する音声部品データを取得し、ステップＳ１７までの処理を繰り返す。 When the voice part data in which the pause time length is set in step S17 is stored, it is checked whether or not the next phrase remains in the reading information (step S18). If the phrase remains (Yes in step S18), step Returning to S11, the audio component data corresponding to the next phrase is acquired, and the processing up to step S17 is repeated.

一方、次の文節が残っていない場合は（ステップＳ１８でＮｏ）、最後の文節に対応する当該後続音声部品データには休止時間長を設定する必要がないため、この後続音声部品データを読み上げ情報の最後の文における最後の文節の番号に対応付けて、音声合成データ記憶部５０に記憶する（ステップＳ１９）。
以上で、読み上げ情報に対する音声合成データが音声合成データ記憶部５０の中に完成する。 On the other hand, if the next phrase does not remain (No in step S18), it is not necessary to set a pause time length for the subsequent audio part data corresponding to the last phrase, so this subsequent audio part data is read-out information. Is stored in the speech synthesis data storage unit 50 in association with the number of the last phrase in the last sentence (step S19).
Thus, the speech synthesis data for the read-out information is completed in the speech synthesis data storage unit 50.

音声合成データが完成すると、音声再生部６０によって、音声部品データを文番号及び文節番号に従って、音声合成データ記憶部５０から対応する音声部品データを順次読み出し、音声部品データに含まれる音声波形データをアナログの音声波形信号に変換し、スピーカ７０に出力して聴取可能に再生する。そして、この音声部品データに設定された休止時間長のポーズ（無音状態）を挿入した後に、次の音声部品データの再生を行う（ステップＳ２０）。 When the speech synthesis data is completed, the speech reproduction unit 60 sequentially reads out the corresponding speech component data from the speech synthesis data storage unit 50 according to the sentence number and the phrase number, and the speech waveform data included in the speech component data is obtained. It is converted into an analog voice waveform signal, outputted to the speaker 70, and reproduced so as to be audible. Then, after inserting a pause (silent state) of the pause time length set in the audio component data, the next audio component data is reproduced (step S20).

以上説明した手順によって、音声合成装置１００は、読み上げ情報で指定された音声部品データ間に適切な休止時間を挿入して、自然な印象の音声として再生することができる。 By the procedure described above, the speech synthesizer 100 can reproduce the sound with a natural impression by inserting an appropriate pause time between the speech component data designated by the reading information.

なお、本実施形態では、逐次、音声部品データの音声波形データの音響分析と休止時間長算出を連続して処理するようにしたが、読み上げ情報に含まれるすべての文節に対応する音声部品データの音声波形データを音響分析した後、休止時間長を算出するようにしてもよい。 In the present embodiment, the acoustic analysis of the speech waveform data of the speech component data and the calculation of the pause time length are successively processed in the present embodiment, but the speech component data corresponding to all the phrases included in the reading-out information is processed. After the acoustic waveform data is acoustically analyzed, the pause time length may be calculated.

このようにするには、例えば、音響的特徴量検出部４１０によって音声部品データの音声波形データから音響分析し、音響的特徴量に関するデータを当該音声部品データに設定して、例えば、音声合成データ記憶部５０に記憶する。読み上げ情報に含まれるすべての文節に対応する音声部品データの音響分析が終了すると、音響的距離算出部４３０によって音声合成データ記憶部５０から先行音声部品データと後続音声部品データのペアを順次読み出して音響的距離を算出し、算出した音響的距離に基づいて、休止時間長算出部４４０によって休止時間長を算出し、休止時間長設定部４５０によって、先行音声部品データに休止時間長を設定して音声合成データ記憶部５０に記憶する。そして、すべての音声部品データ間の休止時間長の設定が終了すると、音声合成データ記憶部５０には、休止時間長が設定された音声部品データによって構成される音声合成データが完成する。 In order to do this, for example, the acoustic feature quantity detection unit 410 performs acoustic analysis from the voice waveform data of the voice part data, sets data relating to the acoustic feature quantity in the voice part data, for example, voice synthesis data Store in the storage unit 50. When the acoustic analysis of the speech component data corresponding to all the clauses included in the reading information is completed, the acoustic distance calculation unit 430 sequentially reads the preceding speech component data and the subsequent speech component data pairs from the speech synthesis data storage unit 50. An acoustic distance is calculated, a pause time length is calculated by the pause time length calculation unit 440 based on the calculated acoustic distance, and a pause time length is set in the preceding voice component data by the pause time length setting unit 450 The data is stored in the voice synthesis data storage unit 50. When the setting of the pause time length between all the voice component data is completed, the voice synthesis data storage unit 50 completes the voice synthesis data constituted by the voice component data for which the pause time length is set.

また、本実施形態では、音響的特徴量は、音声部品データに予め設定されている音声波形データを音響的特徴量検出部４１０によって音響分析して検出して取得するようにしたが、予め、音声波形データを音響分析して、図３に示した音響的特徴量に関するデータを検出し、音声部品データに設定して音声部品データ記憶部３０に記憶して用いるようにしてもよい。
これによって、音声合成の度に、音声合成で選択された音声部品データの音響的特徴量を検出する必要がなく、音声部品データに設定された音響的特徴量に関するデータを参照するだけで音響的特徴量を取得することができ、音声合成処理に要する処理時間を短縮することができる。 In the present embodiment, the acoustic feature amount is acquired by performing acoustic analysis on the acoustic waveform data preset in the voice component data by the acoustic feature amount detection unit 410. The audio waveform data may be acoustically analyzed to detect data relating to the acoustic feature amount shown in FIG. 3, set as audio component data, stored in the audio component data storage unit 30, and used.
As a result, it is not necessary to detect the acoustic feature quantity of the voice component data selected by the voice synthesis every time the voice synthesis is performed, and the acoustic feature quantity can be obtained simply by referring to the data related to the acoustic feature quantity set in the voice component data. The feature amount can be acquired, and the processing time required for the speech synthesis process can be shortened.

本実施形態の音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer of this embodiment. 読み上げ情報の構成を説明するための説明図である。It is explanatory drawing for demonstrating the structure of reading-out information. 音声部品データのデータ構造を示す図である。It is a figure which shows the data structure of audio | voice component data. 音声部品データに含まれる音声波形データの構成を模式的に示す構成図である。It is a block diagram which shows typically the structure of the audio | voice waveform data contained in audio | voice component data. 本発明による休止時間長を設定する原理を説明するための説明図である。It is explanatory drawing for demonstrating the principle which sets the idle time length by this invention. 本実施形態の休止時間長算出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the idle time length calculation apparatus of this embodiment. ピッチ周波数に基づく休止時間長の設定の様子を説明するための説明図である。It is explanatory drawing for demonstrating the mode of the setting of the pause time length based on a pitch frequency. 話速に基づく休止時間長の設定の様子を説明するための説明図である。It is explanatory drawing for demonstrating the mode of the setting of the idle time length based on speech speed. パワーに基づく休止時間長の設定の様子を説明するための説明図である。It is explanatory drawing for demonstrating the mode of the setting of the idle time length based on power. スペクトル包絡に基づく休止時間長の設定の様子を説明するための説明図である。It is explanatory drawing for demonstrating the mode of the setting of the idle time length based on a spectrum envelope. 主観評価実験と重回帰分析の関係を説明するための説明図である。It is explanatory drawing for demonstrating the relationship between a subjective evaluation experiment and multiple regression analysis. 本実施形態の音声合成装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech synthesizer of this embodiment.

Explanation of symbols

１０読み上げ情報入力部（読み上げ情報取得手段）
２０音声部品データ取得部（音声部品データ取得手段）
３０音声部品データ記憶部（音声部品データ記憶手段）
４０休止時間長算出装置
１００音声合成装置
４１０音響的特徴量検出部（音響的特徴量取得手段）
４１３ピッチ周波数検出部
４１４話速検出部
４１５パワー検出部
４１６スペクトル包絡検出部
４３０音響的距離算出部（音響的距離算出手段）
４４０休止時間長算出部（休止時間長算出手段）
４４１重回帰演算部
４５０休止時間長設定部
Ｐ_Ａ先行音声部品データ
Ｐ_Ｂ後続音声部品データ 10 Reading information input part (Reading information acquisition means)
20 Audio component data acquisition unit (audio component data acquisition means)
30 Voice component data storage unit (voice component data storage means)
40 pause time length calculation device 100 speech synthesizer 410 acoustic feature quantity detector (acoustic feature quantity acquisition means)
413 Pitch frequency detection unit 414 Speech speed detection unit 415 Power detection unit 416 Spectrum envelope detection unit 430 Acoustic distance calculation unit (acoustic distance calculation means)
440 Pause time length calculation unit (pause time length calculation means)
441 450 rest time length setting unit Multiple Regression calculation unit P _A preceding speech component data P _B succeeding audio component data

Claims

A pause time length calculation device that calculates a pause time length to be inserted between voice component data connected to each other when voice component data recording a voice waveform in which a predetermined unit of text is uttered is connected and voice synthesis is performed. There,
At least one acoustic feature amount of pitch frequency, speech speed, power, or spectrum envelope in the speech waveform recorded in the preceding preceding speech component data and the succeeding subsequent speech component data in the speech component data connected to each other. Acoustic feature acquisition means for acquiring each ;
The sound that is the difference between the acoustic feature amount of the preceding audio component data that precedes the audio component data connected to each other and the acoustic feature amount of the subsequent audio component data that is acquired by the acoustic feature amount acquisition unit. An acoustic distance calculating means for calculating a target distance;
Based on the acoustic distance calculated by the acoustic distance acquisition means, a pause time for calculating a pause time length to be inserted between the preceding voice component data and the subsequent voice component data using a preset calculation formula A length calculating means;
An idle time length calculation device comprising:

2. The pause time length calculation apparatus according to claim 1, wherein a regression equation using the acoustic distance calculated by the acoustic distance calculation means as an explanatory variable is used as the calculation formula.

In order to calculate a pause time length to be inserted between speech component data connected to each other when speech synthesis is performed by connecting speech component data in which a speech waveform uttering a predetermined unit of text is connected,
At least one acoustic feature amount of pitch frequency, speech speed, power, or spectrum envelope in the speech waveform recorded in the preceding preceding speech component data and the succeeding subsequent speech component data in the speech component data connected to each other. Acoustic feature acquisition means for acquiring each ;
The sound that is the difference between the acoustic feature amount of the preceding audio component data that precedes the audio component data connected to each other and the acoustic feature amount of the subsequent audio component data that is acquired by the acoustic feature amount acquisition unit. Acoustic distance calculating means for calculating the target distance;
Based on the acoustic distance calculated by the acoustic distance acquisition means, a pause time for calculating a pause time length to be inserted between the preceding voice component data and the subsequent voice component data using a preset calculation formula Length calculation means,
An idle time length calculation program characterized in that it functions as:

A speech synthesizer that synthesizes speech by connecting speech component data that records speech waveforms that utter a predetermined unit of text,
Voice component data storage means for storing voice component data in which a voice waveform is recorded in advance;
Read-out information acquisition means for acquiring read-out information consisting of text that is continuously read out in a determined order, or information that specifies the audio component data corresponding to the predetermined unit of text constituting the text;
Voice component data acquisition means for acquiring voice component data from the voice component data storage means based on the reading information acquired by the reading information acquisition means;
The pause time length calculation device according to claim 1 or 2 , which calculates a pause time length to be inserted between voice component data constituting the reading information acquired by the voice component data acquisition unit,
A speech synthesizer, wherein the pause time length calculated by the pause time length calculation device is set as a pause time length between the speech component data.