JP2008139573A

JP2008139573A - Vocal quality conversion method, vocal quality conversion program and vocal quality conversion device

Info

Publication number: JP2008139573A
Application number: JP2006325884A
Authority: JP
Inventors: Satoshi Watanabe; 聡渡辺; Tsutomu Kaneyasu; 勉兼安; Takeshi Iwaki; 健岩木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2006-12-01
Filing date: 2006-12-01
Publication date: 2008-06-19

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a method, a program and a device of vocal quality conversion capable of giving variation to synthesis voice, by converting voice sound to voiceless sound. <P>SOLUTION: The method for converting voice sound to voiceless sound comprises : an input step of inputting an original waveform; a prediction analysis step of predicting a transfer function from the original waveform which is input in the input step; a residual signal extracting step of extracting a residual signal by using output of the original waveform and the prediction analysis step; a white noise generating step of outputting a white noise signal corresponding to power of the residual signal; and a voice synthesizing step of performing voice synthesis based on the output of the white noise generating step and the transfer function which is predicted in the prediction analysis step. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、有声音を無声音に変換する方法、プログラム、及び装置に関するものである。 The present invention relates to a method, a program, and an apparatus for converting voiced sound to unvoiced sound.

従来、ささやき音声（無声音）を通常音声（有声音）に変換するための技術として、『ささやき音声分析手段１２は、多量の学習用データ中のささやき音声を音声分析して、ささやきスペクトル情報を抽出する。また通常音声分析手段１３は、学習用データ中の通常音声を音声分析して通常スペクトル情報を抽出する。そして写像関数推定手段１４は多量のささやきスペクトル情報と通常スペクトル情報の対から写像関数を推定して記憶する。ささやき音声が入力されると、入力音声分析手段１１は音声分析して入力スペクトル情報を抽出する。そしてスペクトル変換手段１５は入力スペクトル情報を写像関数により変換スペクトル情報へと変換する。音声合成手段１６は変換スペクトル情報から通常音声を合成して出力する。』というものが提案されている（特許文献１）。
特開平１０−２５４４７３号公報（要約） Conventionally, as a technique for converting a whispering voice (unvoiced sound) into a normal voice (voiced sound), “the whispering voice analysis means 12 analyzes the whispering voice in a large amount of learning data and extracts the whispering spectrum information. To do. The normal voice analysis means 13 analyzes normal voice in the learning data and extracts normal spectrum information. The mapping function estimation means 14 estimates and stores a mapping function from a large amount of whisper spectrum information and normal spectrum information pairs. When a whispering voice is input, the input voice analyzing means 11 analyzes the voice and extracts input spectrum information. Then, the spectrum converting means 15 converts the input spectrum information into converted spectrum information using a mapping function. The voice synthesizer 16 synthesizes a normal voice from the converted spectrum information and outputs it. Is proposed (Patent Document 1).
JP 10-254473 A (summary)

しかしながら、上記従来技術では、無声音を有声音にすることはできても、有声音を無声音にすることはできない。
そのため、有声音を無声音に変換することにより、合成音声にバリエーションを持たせることのできる声質変換方法、プログラム、及び装置が望まれていた。 However, in the above prior art, although the unvoiced sound can be made voiced, the voiced sound cannot be made unvoiced.
Therefore, there has been a demand for a voice quality conversion method, program, and apparatus that can give variations to synthesized speech by converting voiced sound to unvoiced sound.

本発明に係る声質変換方法は、
有声音を無声音に変換する方法であって、
原音波形を入力する入力ステップと、
前記入力ステップで入力された原音波形から伝達関数を予測する予測分析ステップと、
前記原音波形及び前記予測分析ステップの出力を用いて残差信号を抽出する残差信号抽出ステップと、
白色雑音信号を出力する白色雑音発生ステップと、
前記白色雑音発生ステップの出力と前記予測分析ステップで予測した伝達関数に基づき音声合成を行う音声合成ステップと、
を有することを特徴とするものである。 The voice quality conversion method according to the present invention includes:
A method for converting voiced sound to unvoiced sound,
An input step for inputting an original sound waveform;
A predictive analysis step of predicting a transfer function from the original sound waveform input in the input step;
A residual signal extraction step for extracting a residual signal using the original sound waveform and the output of the prediction analysis step;
A white noise generation step for outputting a white noise signal;
A speech synthesis step of performing speech synthesis based on the output of the white noise generation step and the transfer function predicted in the prediction analysis step;
It is characterized by having.

本発明に係る声質変換方法によれば、有声音を無声音に変換して出力することができるので、合成音声出力に、「ささやき声のように聞こえる」という新たなバリエーションを持たせることができる。 According to the voice quality conversion method of the present invention, voiced sound can be converted into unvoiced sound and output, so that a new variation of “sounds like a whisper” can be given to the synthesized voice output.

実施の形態１．
図１は、本発明の実施の形態１に係る声質変換装置１００の機能ブロック図である。
声質変換装置１００は、波形蓄積部１０１、ＬＰＣ係数計算部１０２、残差信号抽出部１０３、白色雑音発生器１０４、パワー計算部１０５、振幅変換部１０６、音声合成部１０７を有する。
波形蓄積部１０１は、原音波形データを受け取り、例えば１６ｋＨｚ、１６ｂｉｔ等のデジタルデータとして蓄積する。
ＬＰＣ係数計算部１０２は、波形蓄積部１０１から原音波形データを読み取り、ＬＰＣ分析（線形予測分析）を行って、ＬＰＣ係数を算出する。分析の際のパラメータは、例えば分析窓長＝２０ｍｓ、分析次数＝１５次、などとする。
残差信号抽出部１０３は、ＬＰＣ係数分析フィルタを用いて残差信号を抽出する。
白色雑音発生器１０４は、任意の時間長の白色雑音を発生する。
パワー計算部１０５は、与えられた信号系列データの平均パワー（２乗平均の平方根）を計算する機能と、２つの信号系列データの平均パワーから、その両信号系列データの平均振幅比を算出する機能と、を有する。
振幅変換部１０６は、与えられた信号系列データと振幅比の値を用いて、振幅変換演算を行う。
音声合成部１０７は、ＬＰＣ合成フィルタを用いて音声合成を行う。 Embodiment 1 FIG.
FIG. 1 is a functional block diagram of voice quality conversion apparatus 100 according to Embodiment 1 of the present invention.
The voice quality conversion apparatus 100 includes a waveform storage unit 101, an LPC coefficient calculation unit 102, a residual signal extraction unit 103, a white noise generator 104, a power calculation unit 105, an amplitude conversion unit 106, and a voice synthesis unit 107.
The waveform accumulating unit 101 receives the original sound waveform data and accumulates it as digital data of 16 kHz, 16 bits, for example.
The LPC coefficient calculation unit 102 reads the original sound waveform data from the waveform storage unit 101, performs LPC analysis (linear prediction analysis), and calculates an LPC coefficient. The analysis parameters are, for example, analysis window length = 20 ms, analysis order = 15th order, and the like.
The residual signal extraction unit 103 extracts a residual signal using an LPC coefficient analysis filter.
The white noise generator 104 generates white noise having an arbitrary time length.
The power calculation unit 105 calculates the average amplitude ratio of the two signal series data from the function of calculating the average power (square mean square root) of the given signal series data and the average power of the two signal series data. And having a function.
The amplitude converter 106 performs an amplitude conversion operation using the given signal series data and the amplitude ratio value.
The voice synthesis unit 107 performs voice synthesis using an LPC synthesis filter.

波形蓄積部１０１、ＬＰＣ係数計算部１０２、残差信号抽出部１０３、白色雑音発生器１０４、パワー計算部１０５、振幅変換部１０６、及び音声合成部１０７は、回路デバイスのようなハードウェアを用いて実現してもよいし、波形データ入出力のためのバッファを含むソフトウェアとして実現してもよい。
ソフトウェアとして実現する場合は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）等にこれら各部の機能を実現するプログラムを格納しておき、マイコンやＣＰＵ等の演算装置がそのプログラムを読み込んで、プログラムの指示に従って各部の機能に相当する処理を実行するように構成する。処理フローは、図１に示す処理の流れと同様に構成すればよい。
なお、以下の実施の形態においても同様のことが言えることを付言しておく。 The waveform storage unit 101, LPC coefficient calculation unit 102, residual signal extraction unit 103, white noise generator 104, power calculation unit 105, amplitude conversion unit 106, and speech synthesis unit 107 use hardware such as a circuit device. Alternatively, it may be realized as software including a buffer for waveform data input / output.
When implemented as software, a program that realizes the functions of these units is stored in an HDD (Hard Disk Drive) or the like, and an arithmetic unit such as a microcomputer or CPU reads the program, and the functions of the units according to the instructions of the program Is configured to execute processing corresponding to the above. The processing flow may be configured similarly to the processing flow shown in FIG.
It should be noted that the same applies to the following embodiments.

ここで、声質変換装置１００の詳細な構成説明と動作説明を行う前に、声質変換装置１００が行う処理の内容を理解しやすくするため、有声音と無声音それぞれの特徴について説明する。 Here, before explaining the detailed configuration and operation of the voice quality conversion device 100, the characteristics of the voiced sound and the unvoiced sound will be described in order to facilitate understanding of the contents of the processing performed by the voice quality conversion device 100.

有声音とは、声帯の振動を伴う音声のことである。有声音の音声信号波形には、この声帯振動に対応した、ピッチと呼ばれる周期性が現れる。
一方、無声音とは、声帯の振動を伴わない音声のことである。無声音は声帯の振動が伴わないため、その音声信号には、これに起因する振動の周期性が現れない。 Voiced sound is sound accompanied by vocal cord vibration. In the voice signal waveform of voiced sound, a periodicity called pitch corresponding to this vocal cord vibration appears.
On the other hand, unvoiced sound is sound that does not involve vocal cord vibration. Since the voiceless sound is not accompanied by vocal cord vibration, the voice signal does not show the periodicity of vibration caused by the voice signal.

通常、発声は母音を始めとして有声音を多く含む。一方、ささやき声は、有声音を含まない音声である。従って、ささやき声の音声信号波形には、ピッチによる周期性が現れない。
例えば、同じ「こんにちわ」という音声であっても、通常発声とささやき声では、ピッチによる周期性の有無により音声信号の波形が大きく異なる。しかし、どちらの音声も人間の耳には「こんにちわ」と聞こえる点で共通であるので、音韻性の観点からは共通の音質的特長を持つ。即ち、通常発声とささやき声は、似て非なる音声であると言える。 Usually, the utterance includes many voiced sounds including vowels. On the other hand, a whisper is a voice that does not include a voiced sound. Therefore, the periodicity due to the pitch does not appear in the whistling voice signal waveform.
For example, even for the same “Konchiwa” voice, the waveform of the voice signal differs greatly depending on the presence or absence of periodicity depending on the pitch between the normal utterance and the whispering voice. However, since both voices are common in that they can be heard by the human ear, they have a common sound quality feature from the viewpoint of phonology. That is, it can be said that normal speech and whispering are similar and non-speech sounds.

本発明は有声音と無声音の上記の特徴に着目し、有声音から上述のピッチによる周期性を除去することにより、有声音を無声音に変換することを可能とする音声変換方法、プログラム、装置を提案するものである。
有声音を無声音に変換することにより、結果として人間の耳には、通常発声の音声をささやき声に変換したような効果が得られる。これにより、音声合成で「ささやき声」を生成することができるので、合成音声に「ささやき声」という新たなバリエーションが生まれることとなり、合成音声の表現の幅が広くなる。 The present invention focuses on the above characteristics of voiced sound and unvoiced sound, and removes the periodicity due to the pitch described above from voiced sound, thereby converting a voiced method, program, and apparatus that can convert voiced sound into unvoiced sound. It is what we propose.
By converting a voiced sound to an unvoiced sound, the effect of converting a normal uttered voice to a whispered voice can be obtained in the human ear as a result. As a result, a “whispering voice” can be generated by voice synthesis, so that a new variation of “whispering voice” is created in the synthesized voice, and the range of expression of the synthesized voice is widened.

以下は、声質変換装置１００の詳細説明に戻る。 The following will return to the detailed description of the voice quality conversion device 100.

図２は、ＬＰＣ係数計算部１０２〜残差信号抽出部１０３に相当する部分の具体的な構成を示すブロック図である。
ＬＰＣ係数計算部１０２は、波形蓄積部１０１から、所定の時間長の原音波形信号に相当するデータを受け取り、あらかじめ定められた次数ｎ（ここではｎ＝１５程度を想定）で、線形予測係数（ＬＰＣ係数）α１〜αｎを推定する演算を行う。ＬＰＣ係数α１〜αｎを求めることは、声道の伝達関数を時間軸上で求めることに相当する。
原音波形信号データと、求められたＬＰＣ係数α１〜αｎを用いて原音波形を予測した予測波形信号データとから、残差信号に相当するデータを得ることができる。
ＬＰＣ係数α１〜αｎを算出しているブロックが、図１のＬＰＣ係数計算部１０２に相当する。また、原音波形信号と予測波形信号から残差信号を求めている部分が、残差信号抽出部１０３に相当する。 FIG. 2 is a block diagram showing a specific configuration of portions corresponding to the LPC coefficient calculation unit 102 to the residual signal extraction unit 103.
The LPC coefficient calculation unit 102 receives data corresponding to an original sound waveform signal having a predetermined time length from the waveform storage unit 101, and uses a linear prediction coefficient (in this case, assuming n = 15) with a predetermined order n. LPC coefficient) α1 to αn are estimated. Obtaining the LPC coefficients α1 to αn corresponds to obtaining the transfer function of the vocal tract on the time axis.
Data corresponding to the residual signal can be obtained from the original sound waveform signal data and the predicted waveform signal data in which the original sound waveform is predicted using the obtained LPC coefficients α1 to αn.
The blocks for which the LPC coefficients α1 to αn are calculated correspond to the LPC coefficient calculation unit 102 in FIG. Further, the part for which the residual signal is obtained from the original sound waveform signal and the predicted waveform signal corresponds to the residual signal extraction unit 103.

次に、声質変換装置１００の動作について、ステップを追って説明する。 Next, the operation of the voice quality conversion device 100 will be described step by step.

（１）原音波形データの蓄積
波形蓄積部１０１は、原音波形データを受け取り、例えば１６ｋＨｚ、１６ｂｉｔ等のデジタルデータとして一旦蓄積する。蓄積した原音波形データは、ＬＰＣ係数計算部１０２と残差信号抽出部１０３に出力される。
蓄積の粒度は、原音波形データの全てを蓄積するように構成してもよいし、所定のフレーム長のデータを受け取って、逐次ＬＰＣ係数計算部１０２と残差信号抽出部１０３に出力することを、データ終了まで繰り返すように構成してもよい。
なお、図１には記載していないが、原音波形データを受け取るための入力部を必要に応じて設ける。入力部の構成としては、例えば有線のデジタル又はアナログのインターフェースや、ＬＡＮインターフェースなどのネットワークインターフェースが考えられる。 (1) Accumulation of original sound waveform data The waveform accumulation unit 101 receives the original sound waveform data and temporarily accumulates it as digital data of 16 kHz, 16 bits, for example. The accumulated original sound waveform data is output to the LPC coefficient calculation unit 102 and the residual signal extraction unit 103.
The storage granularity may be configured to store all of the original sound waveform data, or to receive data of a predetermined frame length and sequentially output it to the LPC coefficient calculation unit 102 and the residual signal extraction unit 103. It may be configured to repeat until the end of data.
Although not shown in FIG. 1, an input unit for receiving the original sound waveform data is provided as necessary. As the configuration of the input unit, for example, a wired digital or analog interface or a network interface such as a LAN interface can be considered.

（２）ＬＰＣ係数の計算
ＬＰＣ係数計算部１０２は、波形蓄積部１０１より所定のフレーム長の原音波形データを受け取り、図２に示したような構成を用いてＬＰＣ分析を行い、ＬＰＣ係数α１〜αｎを算出する。
求めたＬＰＣ係数α１〜αｎを用いて原音波形を予測した予測波形信号データは、残差信号抽出部１０３に出力される。
なお、波形蓄積部１０１がいずれの粒度で原音波形データを蓄積しているかを問わず、ＬＰＣ分析は所定のフレーム長で実行するので、１度のＬＰＣ分析を行う際に波形蓄積部１０１より受け取る原音波形データは、ＬＰＣ分析に必要なフレーム長に相当する分でよい。
ＬＰＣ分析の実行は、１度に分析を行うフレーム長分の分析が完了するごとに逐次実行するようにしてもよいし、後続のステップ（３）以降の処理が終了するまで待機していてもよい。いずれの方式とするかは、演算速度や音声出力バッファの同期処理の実装方式などにもよるため、適宜適切な方式を選択すればよい。 (2) Calculation of LPC coefficient The LPC coefficient calculation unit 102 receives original sound waveform data having a predetermined frame length from the waveform storage unit 101, performs LPC analysis using the configuration shown in FIG. αn is calculated.
Predicted waveform signal data in which the original sound waveform is predicted using the obtained LPC coefficients α1 to αn is output to the residual signal extraction unit 103.
Note that the LPC analysis is performed with a predetermined frame length regardless of the granularity of the waveform accumulating unit 101 that accumulates the original sound waveform data, so that it is received from the waveform accumulating unit 101 when performing one LPC analysis. The original sound waveform data may correspond to the frame length necessary for the LPC analysis.
The execution of the LPC analysis may be performed sequentially every time the analysis for the frame length to be analyzed at once is completed, or even if the processing after the subsequent step (3) is completed. Good. Which method is used depends on the calculation speed, the method of implementing the synchronization processing of the audio output buffer, and the like, and therefore an appropriate method may be selected as appropriate.

（３）残差信号の抽出
残差信号抽出部１０３は、波形蓄積部１０１から受け取った原音波形データと、ＬＰＣ係数計算部１０２が出力した予測波形データを用いて、残差信号データを抽出する。
得られた残差信号データは、パワー計算部１０５に出力される。 (3) Extraction of residual signal The residual signal extraction unit 103 extracts residual signal data using the original sound waveform data received from the waveform storage unit 101 and the predicted waveform data output from the LPC coefficient calculation unit 102. .
The obtained residual signal data is output to the power calculation unit 105.

（４）白色雑音の出力
白色雑音発生器１０４は、任意の時間長の白色雑音波形データをパワー計算部１０５と振幅変換部１０６に出力する。
出力のタイミングは、残差信号抽出部１０３が残差信号データをパワー計算部１０５に出力した時点でもよいし、常時継続的に白色雑音を出力し続けてもよい。
白色雑音波形データの時間長については、次のステップ（５）で述べる。 (4) White Noise Output The white noise generator 104 outputs white noise waveform data having an arbitrary time length to the power calculator 105 and the amplitude converter 106.
The output timing may be the time when the residual signal extraction unit 103 outputs the residual signal data to the power calculation unit 105, or the white noise may be continuously output continuously.
The time length of the white noise waveform data will be described in the next step (5).

（５）平均パワー比の計算
パワー計算部１０５は、残差信号抽出部１０３が出力した残差信号データから残差信号の平均パワー（＝Ｐｒ）を計算するとともに、白色雑音発生器１０４が出力した白色雑音波形データの平均パワー（＝Ｐｎ）を計算する。
計算に際してのデータの時間長は、ＬＰＣ係数計算部１０２がＬＰＣ分析を行う際の時間長に合わせる。計算方法は、各サンプル値の２乗平均でもよいし、直前の数フレームの平均値を保持するようにし、これらを用いて平滑化して求めてもよい。
次に、パワー計算部１０５は、上記の２つの信号系列データの平均パワーから、その両信号系列データの平均振幅比（＝Ｐｒ／Ｐｎ）を算出する。
求めた平均振幅比は、振幅変換部１０６に出力される。 (5) Calculation of Average Power Ratio The power calculation unit 105 calculates the average power (= Pr) of the residual signal from the residual signal data output from the residual signal extraction unit 103, and the white noise generator 104 outputs The average power (= Pn) of the obtained white noise waveform data is calculated.
The time length of data at the time of calculation is adjusted to the time length when the LPC coefficient calculation unit 102 performs LPC analysis. The calculation method may be a mean square of each sample value, or may be obtained by smoothing using the average value of several previous frames.
Next, the power calculation unit 105 calculates the average amplitude ratio (= Pr / Pn) of the two signal series data from the average power of the two signal series data.
The obtained average amplitude ratio is output to the amplitude converter 106.

（６）白色雑音の振幅変換
振幅変換部１０６は、パワー計算部１０５より受け取った平均振幅比の値（＝Ｐｒ／Ｐｎ）と、白色雑音発生器１０４より受け取った白色雑音波形データを用いて、振幅変換演算を行う。具体的には、白色雑音波形データの各サンプル値に平均振幅比を乗算することにより、パワースケールを調整して、新たな雑音波形データを得る。
得られた振幅変換済みの雑音波形データは、音声合成部１０７に出力される。 (6) White Noise Amplitude Conversion The amplitude conversion unit 106 uses the average amplitude ratio value (= Pr / Pn) received from the power calculation unit 105 and the white noise waveform data received from the white noise generator 104. Performs amplitude conversion calculation. Specifically, the power scale is adjusted by multiplying each sample value of the white noise waveform data by the average amplitude ratio to obtain new noise waveform data.
The obtained noise waveform data subjected to amplitude conversion is output to the speech synthesizer 107.

（７）音声合成
音声合成部１０７は、振幅変換部１０６より受け取った振幅変換済みの雑音波形データと、ＬＰＣ係数計算部１０２が算出したＬＰＣ係数α１〜αｎを用いて構成したＬＰＣ合成フィルタを用いて、音声合成を行う。
合成した音声波形データは、声質変換装置１００の最終出力となる。 (7) Speech Synthesis The speech synthesis unit 107 uses an LPC synthesis filter configured using the amplitude waveform-converted noise waveform data received from the amplitude conversion unit 106 and the LPC coefficients α1 to αn calculated by the LPC coefficient calculation unit 102. Voice synthesis.
The synthesized speech waveform data is the final output of the voice quality conversion device 100.

以上の処理により得られる合成音声は、残差信号を白色雑音に変換したことにより、上述のピッチ成分の周期性が除去されていることになる。即ち、人間の耳には、元々有声音であった原音波形が、ささやき声のような無声音に変換されているように聞こえる。
音声合成部１０７が用いるＬＰＣ合成フィルタは、ＬＰＣ係数計算部１０２が算出したＬＰＣ係数α１〜αｎを用いて構成しているので、原音波形の音韻性は維持されていることになるため、例えば原音が有声音の「こんにちわ」であれば、変換後の合成音声はささやき声の「こんにちわ」に聞こえる。 The synthesized speech obtained by the above processing has the above-described periodicity of the pitch component removed by converting the residual signal into white noise. That is, it is heard to the human ear as if the original sound waveform which was originally voiced sound is converted to an unvoiced sound like a whisper.
Since the LPC synthesis filter used by the speech synthesizer 107 is configured using the LPC coefficients α1 to αn calculated by the LPC coefficient calculator 102, the phonological property of the original sound waveform is maintained. If “Konchiwa” is a voiced sound, the synthesized speech after conversion is heard as “Konichiwa” of whispering voice.

なお、本実施の形態１では、ＬＰＣ分析により声道の伝達関数を予測する方法を用いたが、必ずしもＬＰＣ分析を用いる必要はなく、例えばＰＡＲＣＯＲ（偏自己相関）係数やＬＳＰ（線スペクトル対）係数を用いる方法であってもよい。
即ち、原音波形の音韻性を維持することができればよく、伝達関数を予測する方法はＬＰＣ分析に限られるものではない。
以後の実施の形態についても同様である。 In the first embodiment, a method for predicting the transfer function of the vocal tract by LPC analysis is used. However, it is not always necessary to use LPC analysis. For example, a PARCOR (partial autocorrelation) coefficient or an LSP (line spectrum pair) is used. A method using a coefficient may be used.
That is, it is only necessary to maintain the phoneme of the original sound waveform, and the method for predicting the transfer function is not limited to LPC analysis.
The same applies to the following embodiments.

以上のように、本実施の形態１によれば、有声音を無声音に変換して出力することができるので、合成音声出力に、「ささやき声のように聞こえる」という新たなバリエーションを持たせることができる。
合成音声のバリエーションが増えることは、音声によるユーザインターフェースを人間にとってより親しみやすくすることに繋がり、日常接する様々な機器において、マンマシンインターフェースとしてこれを応用することが期待できる。 As described above, according to the first embodiment, since voiced sound can be converted into unvoiced sound and output, a new variation of “sounds like a whisper” can be given to the synthesized voice output. it can.
The increase in the variation of synthesized speech leads to making the user interface by speech more familiar to humans, and it can be expected that this will be applied as a man-machine interface in various devices that come in contact with everyday life.

実施の形態２．
本発明の実施の形態２では、任意のタイミングで無音部を挿入することのできる声質変換装置の構成について説明する。 Embodiment 2. FIG.
In the second embodiment of the present invention, a configuration of a voice quality conversion apparatus that can insert a silent part at an arbitrary timing will be described.

図３は、本発明の実施の形態２に係る声質変換装置１００の機能ブロック図である。
本実施の形態２に係る声質変換装置１００は、音声合成部１０７の出力側に、新たに無音挿入部１０８を設けている。その他の構成は実施の形態１の図１で説明したものと同様であるため、同じ符号を付して説明を省略する。 FIG. 3 is a functional block diagram of voice quality conversion apparatus 100 according to Embodiment 2 of the present invention.
The voice quality conversion apparatus 100 according to the second embodiment further includes a silence insertion unit 108 on the output side of the speech synthesis unit 107. Other configurations are the same as those described in FIG. 1 of the first embodiment, and thus the same reference numerals are given and description thereof is omitted.

無音挿入部１０８は、音声合成部１０７から声質変換処理済の合成音声波形データを受け取るとともに、声質変換装置１００の外部よりモーラ情報と変換規則情報を受け取り、これらの情報を用いて、合成音声に無音部を挿入して出力する。
モーラ情報と変換規則情報については、次の図４で説明する。 The silence insertion unit 108 receives the synthesized voice waveform data subjected to the voice quality conversion process from the voice synthesis unit 107, receives the mora information and the conversion rule information from the outside of the voice quality conversion device 100, and uses these information to make the synthesized voice. Insert silence and output.
The mora information and the conversion rule information will be described with reference to FIG.

図４は、モーラ情報と変換規則情報について説明するものである。
モーラとは、音韻上一定の時間的長さをもった音の分節単位のことであり、音の「拍」に相当する。即ちモーラ情報とは、発生された音声の拍に関する時間情報のことである。一般に、促音、撥音、長音は音が出ていなくても一拍とカウントする。
例えば「おきでんきちゃん」という原音は（図４の（１））、「お」「き」「で」「ん」「き」「ちゃ」「ん」と７拍で発音される。
モーラ情報は、その拍の区切りが発音音声中のいずれのタイミングに存在するかを時間軸上で表したものである。例えば上述の「おきでんきちゃん」が０．７秒で発音される場合、各拍の区切りが０．１秒毎に等間隔で６つ存在する、といったような情報である（図４の（２））。
無音挿入部１０８は、このモーラ情報を与えられることにより、いずれのタイミングで拍の区切りとすればよいかが分かる。 FIG. 4 explains mora information and conversion rule information.
A mora is a segmental unit of a sound having a certain time length in terms of phoneme, and corresponds to a “beat” of a sound. That is, the mora information is time information related to the beat of the generated voice. In general, sound, repellent sound, and long sound are counted as one beat even if no sound is produced.
For example, the original sound “Okiden-chan” ((1) in FIG. 4) is pronounced in seven beats, “O”, “Ki”, “De”, “N”, “Ki”, “Cha”, “N”.
The mora information represents on the time axis whether the beat breaks are present in the pronunciation sound. For example, in the case where the above-mentioned “Oidenki-chan” is pronounced in 0.7 seconds, there is information such that there are six breaks at equal intervals every 0.1 second ((2 in FIG. 4). )).
The silent insertion unit 108 is given this mora information and can know at which timing the beat should be divided.

変換規則情報は、上述のモーラ情報で表される拍の区切りに、それぞれ何秒の無音部を挿入するかの規則を表す。
例えば図４（３）に示すように、１拍目と４拍目の終わりに０．０６秒の無音部を挿入し、２拍目、３拍目、５拍目の終わりに０．０４秒の無音部を挿入する、といったように個別の拍に対して無音部の挿入規則を設定することが考えられる。
あるいは、１モーラに相当する時間をあらかじめ定めておき、「全モーラに対して、終了後に１モーラ分の無音部を挿入する。」、あるいは「全モーラに対して、終了後に２モーラ分の無音部を挿入する。ただし最後から２番目のモーラには挿入しない。」といったように、ルールベースで定めることもできる。 The conversion rule information represents a rule of how many seconds of silence are inserted in each beat segment represented by the mora information.
For example, as shown in FIG. 4 (3), a 0.06 second silence is inserted at the end of the first and fourth beats, and 0.04 seconds at the end of the second, third, and fifth beats. It is conceivable to set a silence insertion rule for each individual beat, such as inserting a silent part.
Alternatively, a time corresponding to 1 mora is determined in advance, and “a silence part for 1 mora is inserted after completion for all mora.” Or “silence for 2 mora after completion for all mora. It can also be determined on a rule basis, such as “Insert part, but not in the second mora from the end.”

無音挿入部１０８は、音声合成部１０７から声質変換処理済の合成音声波形データを受け取り、上述のモーラ情報と変換規則情報により定められる部分に無音部を挿入して出力する。
さらに、雑音防止のため、無音部挿入の前後にフェードイン・フェードアウトのような振幅変換処理を加えてもよい。これにより、無音部を挿入しても、スムーズな変換後音声が得られる。
無音挿入部１０８の出力は、声質変換装置１００の最終出力となる。 The silence inserting unit 108 receives the synthesized voice waveform data subjected to the voice quality conversion processing from the voice synthesizing unit 107, inserts the silence part into the part determined by the above-described mora information and the conversion rule information, and outputs it.
Furthermore, in order to prevent noise, amplitude conversion processing such as fade-in / fade-out may be added before and after the silent part is inserted. Thereby, even if a silence part is inserted, a smooth converted voice can be obtained.
The output of the silence insertion unit 108 is the final output of the voice quality conversion device 100.

本実施の形態２では、変換規則は声質変換装置１００の外部から与えられるものとして説明したが、声質変換装置１００の内部に記憶手段を設けてその中にあらかじめ格納しておいてもよい。
モーラ情報は、原音波形の拍を表す情報であるため、原音波形を提供する側でなければその情報を提供することができないが、原音の内容が限られているなどによりあらかじめ原音を知ることができるのであれば、声質変換装置１００の内部に備えておいてもよい。
モーラ情報と変換規則情報を外部から受け取る方法は、例えば声質変換装置１００に操作部を設けるなどしてユーザに直接入力させるものでもよいし、ＬＡＮインターフェースのようなネットワークを介した送受信手段を用いて取得するものでもよい。あるいは、声質変換装置１００の外部に設けられた記憶手段から読み取るものでもよい。
さらには、原音が合成音声である場合は、音声合成の過程で音素の継続時間情報などが得られるため、この場合は変換規則情報のみを外部から供給すればよい。 In the second embodiment, the conversion rule is described as being given from the outside of the voice quality conversion apparatus 100. However, a storage unit may be provided inside the voice quality conversion apparatus 100 and stored in advance therein.
Since the mora information is information representing the beat of the original sound waveform, the information cannot be provided unless it is the side that provides the original sound waveform, but it is possible to know the original sound in advance because the content of the original sound is limited. If possible, it may be provided inside the voice quality conversion device 100.
As a method for receiving mora information and conversion rule information from the outside, for example, an operation unit may be provided in the voice quality conversion device 100 to allow the user to directly input the information, or transmission / reception means via a network such as a LAN interface may be used. It may be acquired. Or you may read from the memory | storage means provided outside the voice quality conversion apparatus 100. FIG.
Furthermore, when the original sound is synthesized speech, phoneme duration information and the like are obtained in the process of speech synthesis. In this case, only conversion rule information needs to be supplied from the outside.

本実施の形態２では、声質変換装置１００内に無音挿入部１０８を設けたが、声質変換装置１００は実施の形態１と同様の構成とし、無音挿入部１０８に相当する装置を声質変換装置１００の出力端に接続するように構成しても、本実施の形態２と同様の効果を得ることができる。 In the second embodiment, the silence insertion unit 108 is provided in the voice quality conversion device 100. However, the voice quality conversion device 100 has the same configuration as that of the first embodiment, and a device corresponding to the silence insertion unit 108 is a voice quality conversion device 100. Even if it is configured to be connected to the output terminal, the same effect as in the second embodiment can be obtained.

以上のように、本実施の形態２によれば、無声化した合成音声に無音部を挿入して孤立発声したような聴覚効果を与えることができるので、合成音声の表現の幅が実施の形態１よりもさらに広がる。
また、無音部挿入の前後の波形にフェードイン・フェードアウトなどの処理を施すことにより、無音部挿入前後で音が不自然に途切れることがなく、人間の聴覚にもスムーズに聞こえて負担が少ない。 As described above, according to the second embodiment, since it is possible to provide an auditory effect as if a voice is isolated by inserting a silent part into a voice that has been silenced, the range of expression of the voice can be increased. More than one.
Further, by applying a process such as fade-in / fade-out to the waveform before and after the silent section is inserted, the sound is not unnaturally interrupted before and after the silent section is inserted, and the human hearing can be heard smoothly and less burdened.

実施の形態３．
本発明の実施の形態３では、任意の部分を長音化することのできる声質変換装置の構成について説明する。
ここでいう長音化とは、「おきでんきちゃん」という音声を「おーきーでーんーきーちゃーん」というように変換することである。 Embodiment 3 FIG.
In the third embodiment of the present invention, a configuration of a voice quality conversion apparatus capable of making an arbitrary part longer sound will be described.
In this case, the longer sound is to convert the voice of “Oidenki-chan” to “Oki-Den-Ki-Chan”.

図５は、本実施の形態３に係る声質変換装置１００の機能ブロック図である。実施の形態１で説明した図１との差異点のみ説明する。
本実施の形態３に係る声質変換装置１００は、ＬＰＣ係数補完部１０９を備える。
ＬＰＣ係数計算部１０２が算出したＬＰＣ係数α１〜αｎは、ＬＰＣ係数補完部１０９に出力される。
ＬＰＣ係数補完部１０９は、ＬＰＣ係数α１〜αｎ、モーラ情報、変換規則情報を受け取り、後述の演算を行ってＬＰＣ係数α１〜αｎを時間軸上で補完する。
白色雑音発生器１０４は、モーラ情報と変換規則情報を受け取り、長音化後の時間長に相当する白色雑音信号データを出力する。 FIG. 5 is a functional block diagram of voice quality conversion apparatus 100 according to the third embodiment. Only differences from FIG. 1 described in the first embodiment will be described.
Voice quality conversion apparatus 100 according to Embodiment 3 includes LPC coefficient complementing section 109.
The LPC coefficients α1 to αn calculated by the LPC coefficient calculation unit 102 are output to the LPC coefficient complementing unit 109.
The LPC coefficient complementing unit 109 receives the LPC coefficients α1 to αn, the mora information, and the conversion rule information, and performs calculations described later to supplement the LPC coefficients α1 to αn on the time axis.
The white noise generator 104 receives the mora information and the conversion rule information, and outputs white noise signal data corresponding to the length of time after making the sound longer.

次に、ＬＰＣ係数補完部１０９が行うＬＰＣ係数α１〜αｎの補完処理の詳細を説明する。 Next, the details of the LPC coefficient α1-αn complementing process performed by the LPC coefficient complementing unit 109 will be described.

ＬＰＣ係数補完部１０９は、入力されたＬＰＣ係数α１〜αｎを時間軸方向に伸張する。例えば、ある時刻ｔにおけるＬＰＣ係数をα１（ｔ）、・・・、αｎ（ｔ）とする。
ｔは５ｍｓ間隔で与えられ、継続時間長が１００ｍｓであるとすると、その音素のＬＰＣ係数は次式（式１）で表される。
α１（ｔ）、・・・、αｎ（ｔ）・・・（式１）
ｔ＝０、５、１０、・・・、１００ The LPC coefficient complementing unit 109 extends the input LPC coefficients α1 to αn in the time axis direction. For example, let L1 coefficients at a certain time t be α1 (t),..., Αn (t).
If t is given at intervals of 5 ms and the duration is 100 ms, the LPC coefficient of the phoneme is expressed by the following equation (Equation 1).
α1 (t),..., αn (t) (Equation 1)
t = 0, 5, 10,..., 100

この時間長を２倍の２００ｍｓに伸張する場合、次のように変換する。
（１）単純伸張
まず、単純に時間軸上で２倍に引き伸ばす。引き伸ばした後のＰＣ係数は次式（式２）で表される。
α１（ｔ）、・・・、αｎ（ｔ）・・・（式２）
ｔ＝０、１０、２０、・・・、２００
（２）式２ではＬＰＣ係数が１０ｍｓ間隔となり、ｔ＝５、１５、・・・、１９５に相当する値が存在しないので、次式（式３）で補完する。
αｎ（ｔ）＝（αｎ（ｔ−５）＋αｎ（ｔ＋５））／２・・・（式３）
ｔ＝５、１５、２５、・・・、１９５ When this time length is extended to 200 ms, which is twice as long, conversion is performed as follows.
(1) Simple extension First, it is simply extended twice on the time axis. The PC coefficient after stretching is expressed by the following formula (Formula 2).
α1 (t),..., αn (t) (Formula 2)
t = 0, 10, 20,..., 200
(2) In Equation 2, since the LPC coefficient is 10 ms apart and there is no value corresponding to t = 5, 15,..., 195, it is supplemented by the following Equation (Equation 3).
αn (t) = (αn (t−5) + αn (t + 5)) / 2 (Expression 3)
t = 5, 15, 25, ..., 195

以上の補完処理により、時間軸上で２倍に伸張しても、５ｍｓ間隔のＬＰＣ係数が得られる。
具体的にどの程度の時間伸張するかは、モーラ情報と変換規則情報により定まる。即ち、これらの情報を用いて原音を長音化した後の時間長に相当するように伸張すればよい。白色雑音についても同様のことが言える。 With the above complement processing, LPC coefficients at intervals of 5 ms can be obtained even if they are doubled on the time axis.
Specifically, how much time is extended is determined by the mora information and the conversion rule information. That is, the information may be extended so as to correspond to the time length after the original sound is made longer. The same is true for white noise.

本実施の形態３に係る声質変換装置１００の全体的な動作は、実施の形態１で説明したものとほぼ同様である。相違点は、上述のように白色雑音発生器１０４の出力がモーラ情報と変換規則情報に基づいている点と、音声合成部１０７がＬＰＣ係数補完部１０９により補完処理された後のＬＰＣ係数を用いて音声合成を行う点である。
以上の処理を行うことにより、モーラ情報と変換規則情報で定められる部分が長音化された無声化済みの合成音声が得られる。 The overall operation of voice quality conversion apparatus 100 according to the third embodiment is substantially the same as that described in the first embodiment. The difference is that, as described above, the output of the white noise generator 104 is based on the mora information and the conversion rule information, and the LPC coefficient after the speech synthesis unit 107 is complemented by the LPC coefficient complementing unit 109 is used. This is the point of voice synthesis.
By performing the above processing, a devoiced synthesized speech in which the part defined by the mora information and the conversion rule information is lengthened is obtained.

本実施の形態３に係る声質変換装置１００は、該当するモーラについて、スペクトルパラメータを時間軸上で伸張補完して利用することで、長音化を行う。そのため、ＬＰＣ係数よりも一般に補完特性がよいＰＡＲＣＯＲ係数やＬＳＰ係数を用いることが好ましい。 The voice quality conversion apparatus 100 according to the third embodiment increases the sound length of the corresponding mora by using the spectrum parameter with the extension complemented on the time axis. For this reason, it is preferable to use a PARCOR coefficient or an LSP coefficient which generally has better complementary characteristics than the LPC coefficient.

以上のように、本実施の形態３によれば、無声化した合成音声を長音化することができるので、合成音声の口調のバリエーションがさらに広がる。 As described above, according to the third embodiment, since the devoiced synthesized voice can be made longer, variations in the tone of the synthesized voice are further expanded.

実施の形態４．
本発明の実施の形態４では、原音波形のフレームごとに有声音・無声音のいずれであるかを判定し、有声音のフレームのみ無声化処理を行う声質変換装置の構成を説明する。 Embodiment 4 FIG.
In the fourth embodiment of the present invention, a configuration of a voice quality conversion apparatus that determines whether a voiced sound or an unvoiced sound is generated for each frame of the original sound waveform and performs the unvoiced process only on the frame of the voiced sound will be described.

図６は、本実施の形態４に係る声質変換装置１００の機能ブロック図である。
同図の声質変換装置１００の構成は、実施の形態１の図１とほぼ同様であるが、残差信号抽出部１０３とパワー計算部１０５の間に有声無声判定部１１０を新たに設けた点が、主に異なる点である。
残差信号抽出部１０３は、原音波形データと残差信号波形データの双方を有声無声判定部１１０に出力するように構成しておく。
有声無声判定部１１０は、原音波形データもしくは残差信号波形データを残差信号抽出部１０３より受け取り、フレーム毎に自己相関関数を求めるなどして、そのフレームが有声音・無声音のいずれであるかを判定する。
そのフレームが有声音である場合は、残差信号抽出部１０３より受け取った残差信号波形データをパワー計算部１０５に出力する。以後の処理は実施の形態１と同様である。
そのフレームが無声音である場合には、それ以上無声化処理を行う必要がないため、原音波形データをそのまま音声合成部１０７に出力する。音声合成部１０７は、受け取った原音波形データをそのまま出力する。 FIG. 6 is a functional block diagram of voice quality conversion apparatus 100 according to the fourth embodiment.
The configuration of voice quality conversion apparatus 100 in FIG. 9 is substantially the same as that in FIG. 1 of the first embodiment, except that voiced / unvoiced determination unit 110 is newly provided between residual signal extraction unit 103 and power calculation unit 105. However, it is mainly different.
The residual signal extraction unit 103 is configured to output both the original sound waveform data and the residual signal waveform data to the voiced / unvoiced determination unit 110.
The voiced / unvoiced determination unit 110 receives the original sound waveform data or the residual signal waveform data from the residual signal extraction unit 103, obtains an autocorrelation function for each frame, and determines whether the frame is voiced or unvoiced. Determine.
If the frame is a voiced sound, the residual signal waveform data received from the residual signal extraction unit 103 is output to the power calculation unit 105. The subsequent processing is the same as in the first embodiment.
If the frame is an unvoiced sound, no further devoicing processing is required, and the original sound waveform data is output to the speech synthesizer 107 as it is. The speech synthesizer 107 outputs the received original sound waveform data as it is.

本実施の形態４では、実施の形態１の図１と同様の構成を備える場合について説明したが、その他の実施の形態の構成を備える場合でも、残差信号抽出部１０３とパワー計算部１０５の間に有声無声判定部１１０を設けて、同様の処理を行うことができる。 In the fourth embodiment, the case where the configuration similar to that of FIG. 1 of the first embodiment is provided has been described. However, even when the configuration of the other embodiments is provided, the residual signal extraction unit 103 and the power calculation unit 105 are configured. A voiced / unvoiced determination unit 110 can be provided between them to perform the same processing.

以上のように、本実施の形態４によれば、有声音のフレームのみ無声化変換処理を行うようにしたので、本来変換する必要のない音声をさらに変換して音質を破壊してしまうようなことがなくなる。
また、必要時のみ変換処理を行っているので、マシンリソース上の変換効率がよく、処理能力の小さい演算装置などを用いて声質変換装置を構成することができ、装置全体の小型化やコスト低減に資する。ソフトウェアとして構成した場合であっても、処理能力の低いＣＰＵ等で変換処理を実行することができるので、同様の効果を奏する。 As described above, according to the fourth embodiment, only the voiced sound frame is subjected to the devoicing conversion process, so that the sound that originally does not need to be converted is further converted to destroy the sound quality. Nothing will happen.
In addition, since conversion processing is performed only when necessary, it is possible to configure a voice quality conversion device using a computing device with high conversion efficiency on machine resources and low processing capacity, and downsizing and cost reduction of the entire device Contribute to Even when configured as software, the conversion process can be executed by a CPU or the like having a low processing capability, so that the same effect can be obtained.

実施の形態５．
本発明の実施の形態５では、原音波形と合成波形を混合して出力することのできる声質変換装置の構成について説明する。 Embodiment 5. FIG.
In the fifth embodiment of the present invention, a configuration of a voice quality conversion apparatus capable of mixing and outputting an original sound waveform and a synthesized waveform will be described.

図７は、本実施の形態５に係る声質変換装置１００の機能ブロック図である。
同図の声質変換装置１００の構成は、実施の形態１の図１とほぼ同様であるが、音声合成部１０７の出力側に新たに有声無声混合部１１１を設けた点が異なる。
有声無声混合部１１１は、波形蓄積部１０１から原音波形データを受け取るとともに、音声合成部１０７から合成音声波形データを受け取る。次に、受け取った各データを次式（式４）により混合して出力する。有声無声混合部１１１の出力が、声質変換装置１００の最終出力となる。
Ｓｎ＝βＯｎ＋（１−β）ＵＶｎ・・・（式４）
Ｓｎ：混合音声波形
Ｏｎ：原音波形
ＵＶｎ：合成音声波形（無声音波形）
β：混合係数（０＜＝β＜＝１）
混合係数βの設定により、原音（有声音）と合成音（無声音）の混合度合いが様々に定められるため、合成音声のバリエーションが広がるという効果がある。
なお、混合係数βの値は可変にしてもよい。値の変更方法は、ランダムに経時変化するものでもよいし、ネットワーク経由などでユーザに設定値を入力させるものでもよい。 FIG. 7 is a functional block diagram of voice quality conversion apparatus 100 according to the fifth embodiment.
The configuration of voice quality conversion apparatus 100 in FIG. 10 is substantially the same as that in FIG. 1 of the first embodiment, except that voiced / voiceless mixing unit 111 is newly provided on the output side of speech synthesis unit 107.
The voiced / voiceless mixing unit 111 receives the original sound waveform data from the waveform storage unit 101 and also receives the synthesized speech waveform data from the speech synthesis unit 107. Next, each received data is mixed and output by the following equation (Equation 4). The output of the voiced / voiceless mixing unit 111 is the final output of the voice quality conversion device 100.
Sn = βOn + (1-β) UVn (Formula 4)
Sn: Mixed speech waveform On: Original sound waveform UVn: Synthetic speech waveform (unvoiced sound waveform)
β: Mixing coefficient (0 <= β <= 1)
Since the degree of mixing of the original sound (voiced sound) and the synthesized sound (unvoiced sound) is variously determined by setting the mixing coefficient β, there is an effect that the variation of the synthesized sound is widened.
Note that the value of the mixing coefficient β may be variable. The method of changing the value may be a method that changes randomly with time, or a method in which the user inputs a set value via a network or the like.

以上の実施の形態１〜５において、原音波形の供給源は、通常発声した肉声データであってもよいし、それ自体が合成音声であってもよい。
後者の場合は、音声合成の過程で継続時間情報が得られるため、その過程においてモーラ情報を提供できるが、前者の場合はモーラ情報を外部から別途供給する必要がある。供給手段は、実施の形態２で説明したように、直接入力やネットワークを介した入力などが考えられる。
なお、原音波形の供給源を合成音声とする場合は、音声合成装置と、本発明に係る声質変換装置とを、ネットワークなどを介して接続し、原音波形を直接的に声質変換装置へ入力するように構成してもよい。さらには、モーラ情報を音声合成装置から声質変換装置へ直接供給することもできる。 In the first to fifth embodiments described above, the source of the original sound waveform may be normal voice data that is uttered, or may be synthetic speech itself.
In the latter case, since duration information is obtained in the process of speech synthesis, mora information can be provided in the process, but in the former case, it is necessary to separately supply the mora information from the outside. As the supply means, as described in the second embodiment, direct input or input via a network can be considered.
When the source of the original sound waveform is synthesized speech, the speech synthesizer and the voice quality conversion device according to the present invention are connected via a network or the like, and the original sound waveform is directly input to the voice quality conversion device. You may comprise as follows. Furthermore, the mora information can be directly supplied from the speech synthesizer to the voice quality conversion device.

実施の形態２〜３で説明した「変換規則情報」は、直接入力やネットワークを介した入力の他に、声質変換装置１００自体が自動生成することもできる。例えば以下のような例が考えられる。
（１）各拍をＮ（例えばＮ＝２）倍に伸張し、最後の２モーラのみそのままとする。
・「おきでんきさん」−＞「お・・き・・で・・ん・・き・・さん」
・「おきでんきちゃん」−＞「お・・き・・で・・ん・・き・・｜ちゃ｜ん」
（２）最後から２拍目のみ伸張する。伸張部分の長さは１モーラ。
・「おきでんきさん」−＞「おきでんきさーん」
・「さいとーさーん」−＞「さいとーさーーん」 The “conversion rule information” described in the second to third embodiments can be automatically generated by the voice quality conversion device 100 itself in addition to direct input or input via a network. For example, the following examples can be considered.
(1) Each beat is expanded N times (for example, N = 2), and only the last two mora are left as they are.
・ "Okidenki-san"->"O ・・ ki ・・で・・ん・・き・・さん"
・ "Okinenki-chan"->"Okiki-de-nkikiki
(2) Extend only the second beat from the end. The length of the extension is 1 mora.
・ "Okidenki-san"->"Okidenkisan"
・ "Saito-san"->"Saito-san"

なお、無声化した後に無音部を挿入したり長音化したりといった変形処理を行うのは、無声音が変形しやすいことも一つの理由である。無声音であれば、有声音に固有のピッチ同期等の不連続が出にくいためである。
このように、本発明に係る声質変換装置は、音声の変形をしやすくするという副次的な効果をも有する。 Note that the reason why the unvoiced sound is easily deformed is that the unvoiced sound is easily deformed by performing a modification process such as inserting a silent part or making the sound longer. This is because unvoiced sounds are less likely to have discontinuities such as pitch synchronization inherent to voiced sounds.
Thus, the voice quality conversion apparatus according to the present invention also has a secondary effect of facilitating the deformation of the voice.

実施の形態１に係る声質変換装置１００の機能ブロック図である。3 is a functional block diagram of voice quality conversion apparatus 100 according to Embodiment 1. FIG. ＬＰＣ係数計算部１０２〜残差信号抽出部１０３に相当する部分の具体的な構成を示すブロック図である。3 is a block diagram illustrating a specific configuration of a portion corresponding to an LPC coefficient calculation unit 102 to a residual signal extraction unit 103. FIG. 実施の形態２に係る声質変換装置１００の機能ブロック図である。6 is a functional block diagram of a voice quality conversion device 100 according to Embodiment 2. FIG. モーラ情報と変換規則情報について説明するものである。This explains the mora information and the conversion rule information. 実施の形態３に係る声質変換装置１００の機能ブロック図である。FIG. 9 is a functional block diagram of a voice quality conversion device 100 according to a third embodiment. 実施の形態４に係る声質変換装置１００の機能ブロック図である。FIG. 10 is a functional block diagram of a voice quality conversion device 100 according to a fourth embodiment. 実施の形態５に係る声質変換装置１００の機能ブロック図である。FIG. 10 is a functional block diagram of a voice quality conversion device 100 according to a fifth embodiment.

Explanation of symbols

１００声質変換装置、１０１波形蓄積部、１０２ＬＰＣ係数計算部、１０３残差信号抽出部、１０４白色雑音発生器、１０５パワー計算部、１０６振幅変換部、１０７音声合成部、１０８無音挿入部、１０９ＬＰＣ係数補完部、１１０有声無声判定部、１１１有声無声混合部。 DESCRIPTION OF SYMBOLS 100 Voice quality conversion apparatus, 101 Waveform storage part, 102 LPC coefficient calculation part, 103 Residual signal extraction part, 104 White noise generator, 105 Power calculation part, 106 Amplitude conversion part, 107 Speech synthesizer, 108 Silence insertion part, 109 LPC coefficient complementing unit, 110 voiced / unvoiced determining unit, 111 voiced / unvoiced mixing unit.

Claims

A method for converting voiced sound to unvoiced sound,
An input step for inputting an original sound waveform;
A predictive analysis step of predicting a transfer function from the original sound waveform input in the input step;
A residual signal extraction step for extracting a residual signal using the original sound waveform and the output of the prediction analysis step;
A white noise generation step of outputting a white noise signal corresponding to the power of the residual signal;
A speech synthesis step of performing speech synthesis based on the output of the white noise generation step and the transfer function predicted in the prediction analysis step;
A voice quality conversion method characterized by comprising:

Receiving mora information and conversion rule information for inserting silence into the voice using the mora information;
The voice quality conversion method according to claim 1, further comprising a silent insertion step of inserting a silent part into the voice synthesized in the voice synthesis step using these pieces of information.

In the silent insertion step,
The voice quality conversion method according to claim 2, wherein fade-in processing or fade-out processing is performed on the sound before and after the silent portion is inserted.

Receives mora information and conversion rule information for expanding the voice on the time axis using the mora information,
The transfer function predicted in the prediction analysis step has a prediction complementing step of complementing on the time axis using these pieces of information,
In the white noise generation step,
The mora information and the conversion rule information are received, and a white noise signal generated using the information is output,
In the speech synthesis step,
The voice quality conversion method according to claim 1, wherein speech synthesis is performed based on an output of the white noise generation step and a transfer function complemented by the prediction complement step.

For each frame of the original sound waveform or the residual signal, a determination step of determining whether the sound of the frame is voiced sound or unvoiced sound,
In the speech synthesis step,
The voice quality conversion method according to any one of claims 1 to 4, wherein speech synthesis is performed only for frames determined as voiced in the determination step.

The voice quality conversion method according to any one of claims 1 to 5, further comprising a mixing step of mixing the output of the speech synthesis step and the original sound waveform at a predetermined ratio.

In the input step,
The voice conversion method according to any one of claims 1 to 6, wherein the synthesized voice output by the voice synthesizer is input as the original sound waveform.

A voice quality conversion program that causes a computer to execute the voice quality conversion method according to claim 1.

A device that converts voiced sound to unvoiced sound,
An input unit for inputting an original sound waveform;
A predictive analyzer that predicts a transfer function from the original sound waveform input to the input unit;
A residual signal extraction unit that extracts a residual signal using the original sound waveform and the output of the prediction analysis unit;
A white noise generator that outputs a white noise signal corresponding to the power of the residual signal;
A speech synthesizer that synthesizes speech based on the output of the white noise generation unit and the transfer function predicted by the prediction analysis unit;
A voice quality conversion device characterized by comprising:

Receiving mora information and conversion rule information for inserting silence into the voice using the mora information;
The voice quality conversion device according to claim 9, further comprising a silence insertion unit that inserts a silence part into the voice synthesized by the voice synthesis unit using these pieces of information.

The silent insertion part is
The voice quality conversion device according to claim 10, wherein fade-in processing or fade-out processing is performed on the sound before and after the silent portion is inserted.

Receives mora information and conversion rule information for expanding the voice on the time axis using the mora information,
The prediction function is provided with a prediction complementing unit that complements the transfer function predicted by the prediction analysis unit on the time axis using these pieces of information.
The white noise generator is
The mora information and the conversion rule information are received, and a white noise signal generated using the information is output,
The speech synthesizer
10. The voice quality conversion apparatus according to claim 9, wherein speech synthesis is performed based on an output of the white noise generation unit and a transfer function supplemented by the prediction complementing unit.

For each frame of the original sound waveform or the residual signal, a determination unit that determines whether the sound of the frame is voiced sound or unvoiced sound,
The voice quality conversion apparatus according to any one of claims 9 to 12, wherein the voice synthesizer performs voice synthesis only on a frame determined by the determination unit as a voiced sound.

The voice quality conversion device according to any one of claims 9 to 13, further comprising a mixing unit that mixes the output of the speech synthesizer and the original sound waveform at a predetermined ratio.

The input unit is
The voice conversion device according to any one of claims 9 to 14, wherein the synthesized voice output by the voice synthesizer is input as the original sound waveform.