JP2612867B2

JP2612867B2 - Voice pitch conversion method

Info

Publication number: JP2612867B2
Application number: JP62250706A
Authority: JP
Inventors: 徹都木; 尚夫桑原
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1987-10-06
Filing date: 1987-10-06
Publication date: 1997-05-21
Anticipated expiration: 2012-05-21
Also published as: JPH0193799A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、放送、映画、音楽等における音声処理にお
いて、音声の高低やアクセント，イントネーション等を
変化させたり、ビブラートを付加したりするなど、音声
のピッチ周波数を制御する音声ピッチ変換方法に関す
る。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to audio processing in broadcast, movie, music, etc., such as changing the pitch, accent, intonation, etc. of sound, adding vibrato, etc. The present invention relates to a voice pitch conversion method for controlling a pitch frequency of voice.

［発明の概要］本発明は人の音声を一時記録し、そのピッチの周期を
変化させて、再び音声として出力する技術に関するもの
で、入力音声をA/D変換した後、有声音部分についてその
ピッチ周波数を抽出し、波形を各ピッチ間隔で分割し、
各ピッチの周期を伸縮し、これらを発話時間長に変化が
ないように接続し、さらにその波形をフーリエ変換し、
周波数領域においてピッチの変更によって生じた歪成分
を軽減せしめ、逆フーリエ変換によって時間領域に戻し
た後、これをD/A変換することにより、原音声の音韻性や自然性を良好に保ったまま、声の高
さやイントネーションを自由に変換できるようにする方
法である。[Summary of the Invention] The present invention relates to a technology for temporarily recording a human voice, changing the pitch cycle thereof, and outputting the voice again as a voice. Extract the pitch frequency, divide the waveform at each pitch interval,
Expanding and contracting the cycle of each pitch, connecting these so that the speech time length does not change, and further performing a Fourier transform on the waveform,
After reducing the distortion component caused by the pitch change in the frequency domain, returning it to the time domain by inverse Fourier transform, and then performing D / A conversion, the phonology and naturalness of the original voice are maintained well This is a method that allows the voice pitch and intonation to be freely converted.

［従来の技術］この種の技術としては、古典的な例として音声をアナ
ログテープレコーダに録音し、再生スピード変化させる
方法がある。この方法の場合、ピッチ周波数のみなら
ず、ホルマントの周波数も含めた全周波数帯域が一様に
変化すると共に、発話時間長も同時に変化する。[Prior Art] As this kind of technique, there is a classic example of a method of recording sound on an analog tape recorder and changing the reproduction speed. In this method, not only the pitch frequency but also the entire frequency band including the formant frequency changes uniformly, and the speech time length also changes simultaneously.

すなわち、再生スピードを録音時のＲ倍にすると、ピ
ッチおよびホルマントの周波数は全てＲ倍となり、発話
時間長は1/R倍となる。That is, if the reproduction speed is R times that at the time of recording, the pitch and the formant frequencies are all R times, and the speech time length is 1 / R times.

ここで、ピッチは音声の高低を与えたり、その時間的
変化によってアクセントやイントネーションを特徴づけ
るものであり、また、ホルマントは音声の音韻性を特徴
づけるものであり、大幅な個人差を有する。Here, the pitch gives the level of the voice, and characterizes the accent and intonation by its temporal change, and the formant characterizes the phonology of the voice, and has a great individual difference.

上記従来例に対して、デジタル技術を用い、発話時間
長を変化させない方法も開発されている。In contrast to the above-mentioned conventional example, a method has been developed in which the speech time length is not changed using digital technology.

すなわち、サンプリング周波数Ｆで書込んだ音声波形
を、Ｆ×Ｒなるサンプリング周波数で読出せば、ピッチ
およびホルマント周波数はＲ倍となる。この際、適当な
時間窓と周期を用いて波形を間引いたり、繰り返したり
すれば、発話時間長を原音声と同じに保つことができ
る。このような装置は「ハーモナイザー」などと呼ば
れ、音響効果装置として一般に使用されている。That is, if the voice waveform written at the sampling frequency F is read at a sampling frequency of F × R, the pitch and the formant frequency become R times. At this time, if the waveform is thinned or repeated using an appropriate time window and cycle, the utterance time length can be kept the same as the original voice. Such a device is called a "harmonizer" or the like, and is generally used as a sound effect device.

［発明が解決しようとする問題点］しかしながら、上述したいずれの従来例においても、
ピッチ周波数を変化させた場合、同時にホルマント周波
数も変化してしまうことが避けられない。[Problems to be Solved by the Invention] However, in any of the conventional examples described above,
When the pitch frequency is changed, it is inevitable that the formant frequency also changes at the same time.

ホルマント周波数が変化すると、音声における個人差
が不明瞭となり、さらに変化量が多い場合には音韻性が
劣化し、非人間的な声となる。従って、上述したような
効果を積極的に利用するのでない限り、ピッチ周波数の
変化を伴うホルマント周波数の変化が有害であるという
問題点があった。When the formant frequency changes, the individual difference in the voice becomes unclear, and when the change amount is large, the phonological properties deteriorate, resulting in a non-human voice. Therefore, there is a problem in that a change in the formant frequency accompanying a change in the pitch frequency is harmful unless the above-described effects are actively used.

また、従来の音声処理装置では、音声の高低を制御す
ることが主であるから、ピッチ周波数の長時間にわたる
平均の変化を制御することは容易であるが、イントネー
ションのような短時間内のピッチ周波数の変化を制御で
きないという問題点があった。Further, in the conventional voice processing device, since the pitch of the voice is mainly controlled, it is easy to control the average change of the pitch frequency over a long period of time. There was a problem that the change in frequency could not be controlled.

そこで本発明の目的は上述した従来の問題点を解消
し、原音声のピッチ周波数を大きく変化させてもホルマ
ント周波数を不変とすることおよび変化に伴う周波数歪
を軽減することによって、個人性や音韻性を保ち、人間
の音声としての自然性を損なわずに音声の高低やアクセ
ント等を制御することが可能な音声ピッチ変換方法を提
供することにある。Accordingly, an object of the present invention is to solve the above-mentioned conventional problems and to make the formant frequency invariable even when the pitch frequency of the original voice is largely changed, and to reduce the frequency distortion due to the change, thereby improving personality and phonology. It is an object of the present invention to provide a voice pitch conversion method capable of controlling the pitch, accent, and the like of a voice without deteriorating the naturalness of human voice.

本発明の他の目的は、短時間内においてもピッチ周波
数の制御を可能とすることによってイントネーションや
ビブラート等の強調，付替が自由に行なうことのできる
音声ピッチ変換方法を提供することにある。Another object of the present invention is to provide a voice pitch conversion method capable of freely controlling the pitch frequency even within a short period of time so that the intonation and vibrato can be freely emphasized and replaced.

［問題点を解決するための手段］そのために本発明では入力音声から有声音区間を抽出
し、有声音区間からピッチ周期を抽出し、抽出したピッ
チ周期に対応した各々のピッチ区間において線形予測係
数を求め、線形予測係数を用いてスペクトル包絡を算出
し、各々のピッチ区間の波形を線形予測係数を援用する
ことによって伸縮し、伸縮された波形を入力音声の発声
時間長と等しくなるよう各々のピッチ区間の波形を間引
くかまたは繰り返すことによって接続し、接続された波
形において線形予測係数を求め、線形予測係数を用いて
スペクトル包絡を算出し、波形の伸縮前に算出したスペ
クトル包絡と伸縮後に算出したスペクトル包絡との差を
歪成分とし、接続波形をフーリエ変換によって周波数領
域に変換し、周波数領域の各々の周波数成分から歪成分
を修正した後、逆フーリエ変換によって波形を時間領域
に戻し、戻された波形の平均ピッチ周期に対応した櫛形
ろ波を波形に施した後、前後の無声音区間または無音区
間と接続し、新たな音声波形とすることを特徴とする。[Means for Solving the Problems] For this purpose, in the present invention, a voiced sound section is extracted from an input voice, a pitch cycle is extracted from the voiced sound section, and a linear prediction coefficient is calculated in each pitch section corresponding to the extracted pitch cycle. , Calculate the spectral envelope using the linear prediction coefficient, expand and contract the waveform of each pitch section by using the linear prediction coefficient, and make the expanded and contracted waveform equal to the utterance time length of the input voice. Connect by thinning out or repeating the waveforms in the pitch section, find the linear prediction coefficient in the connected waveform, calculate the spectral envelope using the linear prediction coefficient, and calculate the spectral envelope calculated before and after the waveform expansion and contraction The difference from the obtained spectral envelope is used as a distortion component, and the connected waveform is transformed into the frequency domain by Fourier transform. After correcting the distortion component from the minute, the waveform is returned to the time domain by the inverse Fourier transform, and a comb-shaped filter corresponding to the average pitch period of the returned waveform is applied to the waveform. , A new voice waveform.

［作用］以上の構成によれば、周波数スペクトル包絡を原音声
のものに保ったまま、換言すれば、原音声のホルマント
周波数を変化させずにピッチ周波数を変更することがで
きる。[Operation] According to the above configuration, it is possible to change the pitch frequency without changing the formant frequency of the original voice while keeping the frequency spectrum envelope of that of the original voice.

また、各ピッチ区間毎にピッチ周期を変えることがで
きる。Further, the pitch period can be changed for each pitch section.

［実施例］以下、図面に示す実施例に基づき本発明を詳細に説明
する。EXAMPLES Hereinafter, the present invention will be described in detail based on examples shown in the drawings.

第１図は、本発明の一実施例に係るピッチ周波数変換
システムのブロック図を示す。図において、２は分析
部、４はピッチ周波数制御部、６は波形接続部、８は歪
修正部をそれぞれ示し、各部は電子計算機内に構成さ
れ、ROM,RAMあるいはディスクメモリ等のメモリを併用
しながらピッチ周波数変換の処理が実行される。A/D変
換されて標本化された音声波形は分析部２へ入力し、有
音と無音および有声音と無声音の判別、さらに有声音に
ついてはピッチ区間が定められる。FIG. 1 shows a block diagram of a pitch frequency conversion system according to one embodiment of the present invention. In the figure, reference numeral 2 denotes an analysis unit, 4 denotes a pitch frequency control unit, 6 denotes a waveform connection unit, and 8 denotes a distortion correction unit. Each unit is configured in an electronic computer and uses a memory such as a ROM, a RAM, or a disk memory. The pitch frequency conversion process is executed while performing the process. The A / D converted and sampled speech waveform is input to the analysis unit 2 to determine whether the speech is voiced or unvoiced or voiced and unvoiced, and a pitch interval is determined for voiced speech.

次にピッチ周波数制御部４においては、分析部２で得
られた各ピッチ区間について所望の変更を加え、新たな
ピッチ周期列を計算し、各ピッチ毎に新たなピッチ周期
に応じて波形を伸縮する。これにより、音声の高低、イ
ントネーション等が制御される。Next, the pitch frequency control unit 4 makes a desired change for each pitch section obtained by the analysis unit 2, calculates a new pitch cycle sequence, and expands and contracts the waveform for each pitch according to the new pitch cycle. I do. Thereby, the level of the sound, intonation, and the like are controlled.

波形接続部６ではピッチ周波数制御部４で変更された
各ピッチの波形を発話時間長に変化が無いように適宜間
引くかまたは繰り返すことによって接続する。The waveform connection unit 6 connects the waveforms of the respective pitches changed by the pitch frequency control unit 4 by appropriately thinning out or repeating them so that the speech time length does not change.

歪修正部８では波形接続部６で得られた有声音区間で
の合成波形に対して、その短時間スペクトル包絡を順次
求め、これを原音声のスペクトル包絡と同じになるよう
修正する。The distortion correcting unit 8 sequentially obtains the short-time spectral envelope of the synthesized waveform in the voiced sound section obtained by the waveform connecting unit 6 and corrects the short-time spectral envelope so that it becomes the same as the spectral envelope of the original voice.

上述した一連の有声音に対するピッチ周波数変換の処
理を終了すると、無声音区間および無音区間を接続し、
次の有声音区間の処理に移る。最終的に合成された音声
波形をD/A変換して出力音声とする。When the above-described pitch frequency conversion processing for a series of voiced sounds is completed, the unvoiced sound section and the silent section are connected,
The process proceeds to the next voiced sound section. The finally synthesized voice waveform is D / A converted to output voice.

上記各部における処理の詳細を第２図に示すフローチ
ャートを参照しながら説明する。The details of the processing in each section will be described with reference to the flowchart shown in FIG.

変換ビット数12bit,標本化周波数15kHzでA/D変換され
た音声は、まず、分析部２におけるステップS1で音声パ
ワーの有無に基づいて有音区間と無音区間の判別が行わ
れる。次にステップS2では有音区間の標本値に対してPA
RCOR分析と零交さ分析とを行い、無声子音区間と有声音
区間との判別を行う。これは、１次のPARCOR係数を参照
して入力周波数の高域成分の割合を調べたり、零交さ数
を調べることによって行なう。すなわち、無声子音のエ
ネルギーは高周波領域まで分布しており、高域成分の割
合および高周波になると多くなる零交さ数を調べること
によって無声子音と有声音とを判別する。なお、PARCOR
分析と零交さ分析の両方を用いて判別を行なうのは、判
別を確実なものとするためである。First, at step S1 in the analyzing unit 2, a voiced section and a silent section are determined based on the presence or absence of voice power in the voice A / D-converted at a conversion bit number of 12 bits and a sampling frequency of 15 kHz. Next, in step S2, PA
An RCOR analysis and a zero-crossing analysis are performed to determine an unvoiced consonant section and a voiced sound section. This is performed by checking the ratio of the high frequency component of the input frequency with reference to the first-order PARCOR coefficient or checking the number of zero crossings. That is, the energy of unvoiced consonants is distributed up to the high-frequency region, and unvoiced consonants and voiced sounds are determined by examining the proportion of high-frequency components and the number of zero crossings that increase at higher frequencies. In addition, PARCOR
The reason why the discrimination is performed using both the analysis and the zero-crossing analysis is to ensure the discrimination.

上記ステップS1およびS2で判別された無音区間の時間
および無声子音区間の波形は、それぞれステップS21お
よびS22においてそのままRAMあるいはメモリディスク等
に記憶される。The time of the silent section and the waveform of the unvoiced consonant section determined in steps S1 and S2 are stored in RAM or a memory disk as they are in steps S21 and S22, respectively.

次に、ステップS3では有声音区間における音声波形の
標本値を音声の生成モデルに基づくいわゆる声道逆フィ
ルタに通すことによって線形予測分析を行なう。この線
形予測分析によって線形予測係数と残差波形を得る。得
られた残差波形はステップS23においてRAMあるいはディ
スクメモリ等に記憶される。Next, in step S3, a linear predictive analysis is performed by passing a sample value of the voice waveform in the voiced sound section through a so-called vocal tract inverse filter based on a voice generation model. By this linear prediction analysis, a linear prediction coefficient and a residual waveform are obtained. The obtained residual waveform is stored in a RAM or a disk memory or the like in step S23.

ステップS4ではステップS3で得られた残差波形の相間
における周期と原音声波形のピークの間隔とから仮のピ
ッチ周期を求める。In step S4, a temporary pitch period is determined from the period between the phases of the residual waveform obtained in step S3 and the interval between the peaks of the original audio waveform.

次に、ステップS5においては、第３図に示すように波
形のレベルが急に大きくなる点の直前をピッチの開始点
とし、上記で求めたピッチ周期に基づき次のピッチの開
始点の１標本手前を終了点として１つのピッチ区間を定
める。Next, in step S5, as shown in FIG. 3, the point immediately before the point where the waveform level suddenly increases is set as the pitch start point, and one sample of the start point of the next pitch is obtained based on the pitch period obtained above. One pitch section is determined with the near side as the end point.

ステップS6では上記で求めた１ピッチ区間の中間点を
分析窓の中心として、20msec程度の窓掛けを行なう。こ
の窓掛けにより有限個の標本値により短時間スペクトル
分析が可能となり、この窓掛けデータを基に再び線形予
測分析を行なう。すなわち、標本値の窓掛けを行なった
データを基に相関関数を求めることによって、線形予測
係数α_１〜α_ｐを算出する。ここで、ｐは線形予測分析
の次数であり、一般に男性の声に対してはｐ＝14、女性
の声に対してはｐ＝10程度を用いる。In step S6, windowing is performed for about 20 msec with the midpoint of the one pitch section determined above as the center of the analysis window. This windowing makes it possible to perform a short-time spectrum analysis using a finite number of sample values, and performs linear prediction analysis again based on the windowing data. That is, the linear prediction coefficients α _{1 to} α _p are calculated by obtaining a correlation function based on the data obtained by windowing the sample values. Here, p is the order of the linear prediction analysis, and generally, p = 14 is used for a male voice and p = about 10 is used for a female voice.

ステップS7,S8では上記１ピッチ区間の標本値の自乗
和をピッチ区間長で割った値を正規化パワーと定義し、
ピッチ区間の長さ、線形予測係数と共にRAMあるいはメ
モリディスク等に記憶する。In steps S7 and S8, a value obtained by dividing the sum of the squares of the sample values in one pitch section by the pitch section length is defined as a normalized power,
It is stored in a RAM or a memory disk together with the length of the pitch section and the linear prediction coefficient.

上記ステップS6〜S8の１ピッチ区間についての処理を
終了すると、処理区間を１ピッチ分だけ後へずらし、次
のピッチ区間の処理を行ない、これらの操作を有声区間
が終るまで繰返す。When the processing for one pitch section in steps S6 to S8 is completed, the processing section is shifted backward by one pitch, the processing for the next pitch section is performed, and these operations are repeated until the voiced section ends.

ピッチ周波数制御部４では、まずステップS9におい
て、分析部２で得られた一連のピッチ周期の各々に所望
の変更を加え、新たにピッチ周期列を算出する。すなわ
ち、ある有声音区間内において、初めのピッチからｎ番
目のピッチの周期P_n、ピッチ周波数をF_n＝1/P_nとし、ま
た全ピッチ数をＬとする。さらに、平均ピッチ周波数F
_AVEを、人間の音声における高低の知覚機構を考慮して
全ピッチ周波数の相乗平均で定義する。First, in step S9, the pitch frequency control unit 4 adds a desired change to each of the series of pitch periods obtained by the analysis unit 2, and newly calculates a pitch period sequence. Namely, within certain voiced section, the period P _n of the n-th pitch from the beginning of the pitch, the pitch frequency is F _n = 1 / P _n, also the total number of pitches and L. Furthermore, the average pitch frequency F
_AVE is defined as the geometric mean of all pitch frequencies taking into account the perceptual mechanism of height in human speech.

すなわち、 F_AVE＝（F₁×F₂×……F_L）^1/L ＝（P₁×P₂×……P_L）^−1/L （１）このとき、例えば、音声の高低を制御するために平均
のピッチ周波数をＲ倍にしたければ、（１）式より全て
のピッチ周期を1/R倍にすればよい。また、アクセント
のように抑揚を変化させる場合には、各ピッチ周期毎に
異なる比率で周期を伸縮しなければならない。そのため
に、第４図に示すように、各ピッチ周期毎にｎ番目のピ
ッチ周波数F_nをR_n倍する。That is, F _AVE = (F ₁ × F ₂ ×... F _L ) ^{1 / L} = (P ₁ × P ₂ ×... P _L ) ^{−1 / L} (1) At this time, for example, the level of voice is controlled. If it is desired to make the average pitch frequency R times for this, all pitch periods should be made 1 / R times from the equation (1). When the intonation is changed like an accent, the pitch must be expanded or contracted at a different rate for each pitch cycle. Therefore, as shown in FIG. 4, the n-th pitch frequency F _n for multiplying R _n for each pitch cycle.

また、第５図に示すように原音声の平均ピッチ周波数
を中心として抑揚を強調あるいは抑圧する場合は、R_nと
して（２）式に示すものを用いればよい。すなわち、 R_n＝（F_n/F_AVE）^C-1 このときＣ＞１ならば抑揚の強調、０≦ｃ＜１ならば
抑揚の抑圧となる。In the case of emphasizing or suppressing intonation around the average pitch frequency of the original speech as shown in Fig. 5, it may be used as shown in the R _n (2) expression. In other words, R _n = (F _n / F _AVE ) ^C−1 At this time, if C> 1, the intonation is emphasized, and if 0 ≦ c <1, the intonation is suppressed.

次に、ステップS10において、各ピッチ毎の波形をス
テップS9で得た新しいピッチ周期に対応させて伸縮す
る。すなわち、原音声におけるある１ピッチ区間の標本
数をｋとし、変更されたピッチ区間長に相当する標本数
をｋ′とすると、ピッチ周期を縮めた場合には波形をピ
ッチ区間の開始点からｋ′番目の標本値までで打ち切
り、ピッチ周期を伸ばした場合には分析部２で得られた
線形予測係数α_１〜α_ｐを用いて、（３）式に示す如く
ｍ＝ｋ＋１番目からｍ＝ｋ′番目までの標本値を求め後
続の波形を得る。Next, in step S10, the waveform for each pitch is expanded or contracted in accordance with the new pitch cycle obtained in step S9. That is, assuming that the number of samples in a certain pitch section in the original voice is k and the number of samples corresponding to the changed pitch section length is k ′, when the pitch period is shortened, the waveform is changed from the start point of the pitch section by k ′. When the sampling period is censored up to the 'th sample value and the pitch period is extended, the linear prediction coefficients α _{1 to} α _p obtained by the analysis unit 2 are used to obtain m = k + 1 to m = The sample values up to the k'th are obtained to obtain the subsequent waveform.

ｘ（ｍ）＝α₁x（ｍ−１）α₂x（ｍ−２）＋…… ＋α_px（ｍ−ｐ）（３）ただし、人間の音声の特徴を考慮して後続部は指数的
に減衰する窓係数を掛ける。x (m) = α ₁ x (m−1) α ₂ x (m−2) +... + α _px (m−p) (3) However, considering the characteristics of human voice, the following part is an exponent Multiplied by a window coefficient that attenuates in time.

ステップS11では、ステップS7で得た正規化パワーの
調整を行なう。すなわち、ピッチ周期を変更すると、一
般に前述した正規化パワーも変化するのでステップS7で
得た値と同じになるよう各標本値を定数倍する。In step S11, the normalized power obtained in step S7 is adjusted. That is, when the pitch period is changed, the above-described normalized power generally also changes. Therefore, each sample value is multiplied by a constant so as to be the same as the value obtained in step S7.

波形接続部６では、まずステップS12で発話時間長の
比較を行なう。すなわち、原音声の発話時間長をＴ、ｎ
番目のピッチ区間のピッチ周期をP_nとし、ピッチ周波数
変更後のそれらをそれぞれＴ′,P_n′とすると、Ｔ＝P₁＋P₂＋……＋P_L （４）Ｔ′＝P₁′＋P₂′＋……＋P_L′ （５）と現わされる。一般にピッチ周波数変更によって発話時
間長は変化するからＴ≠Ｔ′となる。The waveform connection unit 6 first compares the utterance time length in step S12. That is, the speech time length of the original voice is T, n
Th the pitch period of the pitch interval between P _n, those after the change pitch frequency respectively T ', P _n' _{_{When, T = P 1 + P 2}} + ...... + P L (4) T '= P 1' + P ₂ ′ +... + P _L ′ (5) Generally, the utterance time length changes by changing the pitch frequency, so that T ≠ T ′.

そこで、γ＝Ｔ′/Tとおき、γの値に応じて、ステッ
プS13でピッチ区間の間引きあるいは繰り返しを行な
う。すなわち、γ＞１ならば、γ／（γ−１）ピッチに
つき１ピッチの割合で間引き、γ＜１ならば、γ／（１
−γ）ピッチにつき１ピッチの割合で同じ波形を繰り返
す。Therefore, γ is set to T ′ / T, and pitch sections are thinned out or repeated in step S13 according to the value of γ. That is, if γ> 1, thinning is performed at a pitch of 1 pitch per γ / (γ−1) pitch, and if γ <1, γ / (1
-Γ) The same waveform is repeated at a rate of one pitch per pitch.

γ＝1.5およびγ＝0.667の場合の処理の様子をそれぞ
れ第６図（Ａ）および（Ｂ）に示す。同図に示すよう
に、γ＝1.5の場合は３ピッチに１回ピッチ変更後の音
声ピッチ区間３および６を間引き、γ＝0.667の場合、
２ピッチに１回ピッチ変更後の音声のピッチ区間2,4お
よび６の波形を繰り返す。FIGS. 6 (A) and 6 (B) show the states of processing when γ = 1.5 and γ = 0.667, respectively. As shown in the figure, in the case of γ = 1.5, the voice pitch sections 3 and 6 after the pitch change once in 3 pitches are thinned out, and in the case of γ = 0.667,
The waveforms of the pitch sections 2, 4 and 6 of the voice after the pitch change once for two pitches are repeated.

これにより、概ね原音声の発話時間長を保つことがで
き、聴感的にも違和感がない、なお、一般的にピッチ周期を変更した波形において
は、その波形のピッチ区間の最終標本点と次のピッチ区
間の開始標本点との間には標本値の大きな不連続がある
ので、ステップS14において、接続点、すなわち最終標
本点と開始標本点の前後数標本のデータを用いて最小自
乗法により３次曲線を用いた近似を行ない連続的に接続
する。As a result, the speech time length of the original sound can be generally maintained, and there is no sense of incongruity in the sense of hearing. In general, in a waveform in which the pitch cycle is changed, the final sampling point of the pitch section of the waveform and the next sample point Since there is a large discontinuity in the sample value between the start sample point of the pitch section and the sample point, a step S14 uses the least squares method by using the data of the connection point, ie, the last sample point and several samples before and after the start sample point. An approximation using a quadratic curve is performed to connect continuously.

歪修正部８では、まず、ステップS15において、第７
図に示すようにピッチ周期の変更を行った波形のｑ点か
らｑ＋Ｍ−１までのＭ個の標本のデータに対してその自
乗和Ｐを求めると共に、このＭ個の標本値について線形
予測分析を行ない、線形予測係数α_１′〜α_ｐ′を得
る。First, in step S15, the distortion correcting unit 8 sets the seventh
As shown in the figure, the sum of squares P is obtained for data of M samples from point q to q + M−1 of the waveform whose pitch period has been changed, and linear prediction analysis is performed on the M sample values. To obtain the linear prediction coefficients α ₁ ′ to α _p ′.

ステップS24およびS25では、この線形予測係数α_１′
〜α_ｐ′および前述したところの分析部２において原音
声の時刻的に同じ区間に相当する部分から得られた線形
予測係数α_１〜α_ｐを用いて、それそれ以下に示す
（６）式および（７）式によってスペクトル包絡
（Ｋ）およびＨ（Ｋ）を求める。In steps S24 and S25, the linear prediction coefficient α ₁ ′
To? _{P 'and} using the linear prediction coefficients alpha ₁ to? _P obtained in the analysis unit 2 was described above from the portion corresponding to the time the same section of the original speech, it it shown below (6) Then, the spectral envelopes (K) and H (K) are obtained by the equations (7).

ここで、Ｍは20〜30msecの時間長での標本数であり、
標本化周波数は15kHzであるから、その値は300〜450程
度となり、ＮはＭより大きい２のべき乗で512とする。 Here, M is the number of samples in a time length of 20 to 30 msec,
Since the sampling frequency is 15 kHz, the value is about 300 to 450, and N is 512 which is a power of 2 larger than M.

スペクトル包絡Ｈ（Ｋ）は、原音声の音韻性や個人性
を多く含む、すなわちホルマント周波数を特徴づける物
理量であるが、（Ｋ）はピッチ周期の変更に起因する
歪により、必ずしもＨ（Ｋ）と一致しない。この歪を修
正するために以下の処理を行なう。The spectral envelope H (K) contains a lot of phonological and individual characteristics of the original speech, that is, a physical quantity characterizing the formant frequency. However, (K) is not necessarily H (K) due to distortion caused by a change in pitch period. Does not match. The following processing is performed to correct this distortion.

まず、ステップS16において、第７図に示すｑ−（Ｎ
−Ｍ）/2点からｑ＋（Ｎ＋Ｍ）/2−１までのＮ個の標本
を新たにｘ（１）〜ｘ（Ｎ）とおき、（８）式に示すよ
うに、時間窓係数ｗ（ｍ）と掛けて、ｙ（１）〜ｙ
（Ｎ）とする。すなわち、ｙ（ｍ）＝ｗ（ｍ）・ｘ（ｍ）ｍ＝１〜Ｎ（８）ただし、Ｌ＝（Ｎ−Ｍ）/2＋1,L′＝（Ｎ＋Ｍ）/2と
して、ｗ（ｍ）＝0.5・｛１−cos（πm/L）｝１≦ｍ≦Ｌｗ（ｍ）＝１Ｌ≦ｍ≦Ｌ′ ｗ（ｍ）＝0.5・［１＋cos｛π（ｍ−Ｌ′）/L｝］Ｌ′≦ｍ≦Ｎ得られたｙ（ｍ）に対して、Ｎ点の高速フーリエ変換
を行ない、周波数領域に変換してＹ（Ｋ）とする。次に
ステップS17で、以下の（９）式で示すように、Ｙ
（Ｋ）の絶対値をスペクトル包絡Ｈ（Ｋ）および
（Ｋ）の比を用いて変更する。すなわち、（Ｋ）＝Ｈ（Ｋ）／（Ｋ）・Ｙ（Ｋ）Ｋ＝１〜Ｎ
（９）ステップS18では得られた（Ｋ）を逆高速フーリエ
変換により時間領域の波形（１）〜（Ｎ）とし、さ
らにステップS19において、以下（10）式で示すように
櫛型ろ波を行ない（１）〜（Ｎ）とする。これによ
り、ピッチ周波数の非整数倍の周波数に生じた歪成分を
減衰させる。First, in step S16, q- (N
−M) / 2 points to q + (N + M) / 2−1 are newly set as x (1) to x (N), and a time window coefficient w ( m) and y (1) -y
(N). That is, y (m) = w (m) · x (m) m = 1 to N (8) where L = (N−M) / 2 + 1, L ′ = (N + M) / 2, and w (m) = 0.5 · {1-cos (πm / L)} 1 ≦ m ≦ Lw (m) = 1 L ≦ m ≦ L ′ w (m) = 0.5 · [1 + cos {π (m−L ′) / L} L ′ ≦ m ≦ N The obtained y (m) is subjected to N-point Fast Fourier Transform, converted to the frequency domain, and set as Y (K). Next, in step S17, as shown by the following equation (9), Y
The absolute value of (K) is changed using the ratio of the spectral envelopes H (K) and (K). That is, (K) = H (K) / (K) .Y (K) K = 1 to N
(9) In step S18, the obtained (K) is converted into time-domain waveforms (1) to (N) by inverse fast Fourier transform. (1) to (N). This attenuates distortion components generated at a frequency that is a non-integer multiple of the pitch frequency.

（ｍ）＝0.25｛（１−ａ）（ｍ−kp）＋２（１＋
ａ）（ｍ）＋（１−ａ）（ｍ＋Kp）｝ただし、（ｍ）＝（１）ｍ≦0,（ｍ）＝（Ｎ）ｍ＞
Ｎ（10）ここで、Kpはピッチ周波数変更後の処理区間における
平均ピッチ周期に相当するピッチ区間の標本数であり、
また、ａは０から１の間の定数で、0.01程度を用いる。(M) = 0.25 ｛(1-a) (m−kp) +2 (1+
a) (m) + (1−a) (m + Kp)｝ (m) = (1) m ≦ 0, (m) = (N) m>
N (10) where Kp is the number of samples in the pitch section corresponding to the average pitch period in the processing section after the pitch frequency is changed,
A is a constant between 0 and 1, and about 0.01 is used.

（10）式より得られたＮ点のデータのうち中心のＭ個
の標本のデータに対し、その自乗和Ps′が先にステップ
S15が求めたPsと等しくなるよう各標本値を定数倍して
ゲインの調整をする。これによって音声の大きさが等し
く保たれる。さらに、波形接続の際、端の効果を軽減す
るため、両端で０、中心で１となるようなハニング窓ま
たは三角窓を掛け、この波形をRAMあるいはメモリディ
スク等に記憶する。For the data of the central M samples among the data at the N points obtained from equation (10), the sum of squares Ps' is
The gain is adjusted by multiplying each sample value by a constant so that S15 becomes equal to the obtained Ps. This keeps the volume of the voice equal. Furthermore, when connecting the waveforms, a Hanning window or a triangular window is set to 0 at both ends and 1 at the center in order to reduce the effect of the end, and this waveform is stored in a RAM or a memory disk.

次に第７図に示すｇ点をM/2点だけ後ヘシフトして処
理区間を移し、ステップS16以降の一連の処理を行った
後、第８図に示すようにＭ個の標本値の前半のM/2点
と、直前の処理フレームの後半のM/2点とを重ね合わせ
て順次加える。Next, the point g shown in FIG. 7 is shifted backward by M / 2 points to shift the processing section, and after performing a series of processes from step S16, the first half of the M sample values as shown in FIG. And the M / 2 point of the latter half of the immediately preceding processing frame are superimposed and sequentially added.

以下有声音区間が終るまで同じ操作を繰り返せば原音
声と同様なスペクトル包絡を有する音声波形が得られ
る。これにより、ホルマント周波数は不変となり原音声
の音韻性や個人性を保持することが可能となる。Thereafter, if the same operation is repeated until the end of the voiced sound section, a voice waveform having the same spectral envelope as the original voice can be obtained. As a result, the formant frequency becomes invariable, and the phonological and individual characteristics of the original voice can be maintained.

なお、Kpがある程度以上大きい場合には、Ｍを450〜6
00とし、同時にＮも1024に拡大した方が良い音質が得ら
れる。When Kp is larger than a certain value, M is set to 450 to 6
00, and at the same time, if N is also increased to 1024, better sound quality can be obtained.

ひとつの有声音区間の処理が終了したならば、ステッ
プS20で前後の無声音区間または、無音区間と接続し、
ステップS2以降で次の有音声区間の処理に移る。最終的
に合成された音声ををD/A変換して、出力音声とする。When the processing of one voiced sound section is completed, in step S20, the preceding and following unvoiced sound sections or silence sections are connected,
After step S2, the process moves to the next voiced section. Finally, the synthesized voice is D / A converted to output voice.

［発明の効果］以上説明したように、本発明によれば音声の周波数ス
ペクトル包絡を原音声のものに保ったまま、言い換えれ
ば原音声のホルマント周波数を変化させずにピッチ周波
数を変更することができる。従ってホルマントの構造に
依存する音韻性や個人性に影響を与えず、従来の技術よ
り自然性の高い状態でピッチ周波数を変化させることが
可能である。[Effects of the Invention] As described above, according to the present invention, it is possible to change the pitch frequency without changing the formant frequency of the original voice while keeping the frequency spectrum envelope of the original voice. it can. Therefore, it is possible to change the pitch frequency in a state of higher naturalness than the conventional technique without affecting the phonological and personality depending on the formant structure.

また、従来の装置ではピッチ周波数の変化量が長時間
にわたり一定であったが、本発明においては、各ピッチ
毎にその変化量を変えることで抑揚を変化させ、会話の
イトネーションや歌声のビブラートの制御などが可能で
ある。Also, in the conventional apparatus, the change amount of the pitch frequency is constant for a long time, but in the present invention, the intonation is changed by changing the change amount for each pitch, and the intonation of conversation and the vibrato of singing voice are changed. Can be controlled.

[Brief description of the drawings]

第１図は本発明の一実施例に係るシステムのブロック
図、第２図は本発明の一実施例を示すフローチャート、第３図は実施例におけるピッチ区間の定め方を説明する
ための波形図、第４図および第５図は実施例におけるピッチ周波数の変
更を説明するためのピッチ周期列を表す図、第６図は実施例におけるピッチ区間の間引きあるいは繰
り返しを説明するための波形図、第７図は実施例の歪修正を説明するための波形図、第８図は実施例における波形の重ね合せを説明するため
の波形図である。２……分析部、４……ピッチ周波数制御部、６……波形接続部、８……歪修正部。FIG. 1 is a block diagram of a system according to one embodiment of the present invention, FIG. 2 is a flowchart showing one embodiment of the present invention, and FIG. 3 is a waveform diagram for explaining how to determine a pitch section in the embodiment. 4 and 5 are diagrams showing a pitch cycle sequence for explaining a change in pitch frequency in the embodiment. FIG. 6 is a waveform diagram for explaining thinning or repetition of a pitch section in the embodiment. FIG. 7 is a waveform chart for explaining distortion correction in the embodiment, and FIG. 8 is a waveform chart for explaining waveform superposition in the embodiment. 2 ... analysis part, 4 ... pitch frequency control part, 6 ... waveform connection part, 8 ... distortion correction part.

フロントページの続き (56)参考文献特開昭59−82608（ＪＰ，Ａ) 特開平１−93795（ＪＰ，Ａ) 特開平１−93796（ＪＰ，Ａ) 桑原、都木、”分析合成による声質変換と嗄声改善への応用、信学技報ＳＰ86 −57、ＰＰ．45〜52 都木、梅田、”ピッチ変更時の歪をスペクトル領域で補正する声質変換”、信学技報ＥＡ87−82、ＰＰ．49−56Continuation of the front page (56) References JP-A-59-82608 (JP, A) JP-A-1-93795 (JP, A) JP-A-1-93796 (JP, A) Kuwahara, Tsuki, "Analytical Synthesis Conversion and its application to hoarseness improvement, IEICE Technical Report SP86-57, pp. 45-52, Tosuki and Umeda, "Voice Quality Conversion to Correct Distortion at Pitch Change in the Spectral Domain," EA87-82, PP.49-56

Claims

(57) [Claims]

1. A voiced sound section is extracted from an input voice, a pitch period is extracted from the voiced sound section, a linear prediction coefficient is obtained in each pitch section corresponding to the extracted pitch period, and the linear prediction coefficient is calculated. Calculating the spectral envelope by using the linear prediction coefficient to expand and contract the waveform of each of the pitch sections, and to make the expanded and contracted waveform equal to the utterance time length of the input voice. The waveform is thinned or repeated to obtain a connection, a linear prediction coefficient is obtained in the connected waveform, a spectrum envelope is calculated using the linear prediction coefficient, and the spectrum envelope calculated before the expansion and contraction of the waveform and the A difference from the spectrum envelope calculated after expansion and contraction is defined as a distortion component, and the connection waveform is transformed into a frequency domain by Fourier transform. After correcting the distortion component from each frequency component in several domains, the waveform is returned to the time domain by inverse Fourier transform, and after applying a comb filter corresponding to the average pitch period of the returned waveform to the waveform, A voice pitch conversion method characterized in that the voice pitch is connected to a preceding or following unvoiced section or a non-voice section to form a new voice waveform.