JP3125937B2

JP3125937B2 - Voice pitch conversion method

Info

Publication number: JP3125937B2
Application number: JP03150785A
Authority: JP
Inventors: 貴夫小山; 憲也村上
Original assignee: NTT Data Corp
Current assignee: NTT Data Corp
Priority date: 1991-06-24
Filing date: 1991-06-24
Publication date: 2001-01-22
Anticipated expiration: 2016-01-22
Also published as: JPH04372999A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、ピッチ制御を行う装置
の音声ピッチ変換方法に関し、特に入力音声のピーク位
置と変換目的のピッチ周期の双方を考慮し、誤ったピー
ク位置を与えられた場合に対応することの可能な音声ピ
ッチ変換方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice pitch conversion method for an apparatus for performing pitch control, and more particularly, to a case where an erroneous peak position is given in consideration of both a peak position of an input voice and a pitch period to be converted. The present invention relates to a voice pitch conversion method capable of coping with the above.

【０００２】[0002]

【従来の技術】従来、音声のピッチ制御を行う音声ピッ
チ変換装置では、図２に示すように、入力端子２１よ
り、変換対象の音声が処理装置に入力されると、前処理
部２２はアナログ／ディジタル変換等の前処理をする。
一方、そのデータと時間的に同期のとれた音声波形中の
ローカルピーク位置２３を目視（あるいは、ピーク位置
自動抽出の手法）により得る。これらのデータおよびピ
ーク位置は、波形切り出し／重ね合わせ部２４に入力さ
れて蓄えられる。さらに、波形切り出し／重ね合わせ部
２４には、音声ピッチ変換の目標となるピッチパターン
２５が入力される。これにより、蓄積されているローカ
ルピーク位置をもとにして、音声波形により合成目標の
ピッチ周期に合うようなｈａｎｎｉｎｇ窓等の時間窓関
数を、そのローカルピーク位置を中心にして乗ずること
によって波形を切り出し、目標ピッチパターン２５に合
わせて、波形を再度重ね合わせる。こうして、波形切り
出し／重ね合わせ部２４により合成音声を構成し、後処
理部２６によってそのデータにディジタル／アナログ変
換等の後処理を施し、出力端子２７から変換音声を得
る。なお、従来のピッチ制御方式については、例えば、
「広川、箱田著、波形編集型規則合成方法におけるピッ
チ制御法の検討、平成２年３月音響学会講演論文集１−
４−７」において論じられている。2. Description of the Related Art Conventionally, in a voice pitch converter for performing voice pitch control, as shown in FIG. 2, when a voice to be converted is input from an input terminal 21 to a processing device, a pre-processing unit 22 receives an analog signal. / Perform pre-processing such as digital conversion.
On the other hand, the local peak position 23 in the audio waveform which is temporally synchronized with the data is obtained by visual observation (or a method of automatic peak position extraction). These data and the peak position are input to the waveform cutout / superposition unit 24 and stored. Further, a pitch pattern 25 that is a target of voice pitch conversion is input to the waveform cutout / superposition unit 24. Thereby, based on the accumulated local peak position, the waveform is multiplied by a time window function such as a hanning window that matches the pitch period of the synthesis target with the audio waveform centering on the local peak position. The waveform is cut out and the waveform is superimposed again according to the target pitch pattern 25. In this way, a synthesized speech is formed by the waveform cutout / superposition unit 24, post-processing such as digital / analog conversion is performed on the data by the post-processing unit 26, and a converted speech is obtained from the output terminal 27. In addition, about the conventional pitch control method, for example,
"Hirokawa, Hakoda, Pitch Control Method in Waveform Editing Rule Synthesis Method, Proceedings of the Acoustical Society of Japan, March 1990-
4-7 ".

【０００３】[0003]

【発明が解決しようとする課題】上記従来技術では、波
形切り出し用の窓関数を施す際、目標となるピッチに合
うことのみを考慮した窓関数により波形切り出しを行っ
ている。よって、入力音声波形におけるピッチ構造につ
いては配慮がなされておらず、入力音声波形のピッチ構
造を損なう場合がある。また、ローカルピーク位置を自
動的に抽出する処理において、例えば、入力音声がピッ
チ構造をなしていないことによるピーク抽出誤りがあっ
た場合、従来方法による窓関数を用いて波形切り出しを
行うと、合成音声にノイズが載る等、劣化の原因となる
ことが予想される。本発明の目的は、入力音声波形と目
標となるピッチパターンの双方を考慮して波形切り出し
用の窓関数の形状を決定することにより、このような問
題点を改善して、与えられたピーク位置の誤りを吸収
し、高品質な合成音声を構成することが可能な音声ピッ
チ変換方法を提供することにある。In the above-mentioned prior art, when applying a window function for extracting a waveform, the waveform is extracted using a window function that only takes into account that the window function matches a target pitch. Therefore, no consideration is given to the pitch structure of the input speech waveform, and the pitch structure of the input speech waveform may be damaged. In the process of automatically extracting the local peak position, for example, if there is a peak extraction error due to the fact that the input voice does not have a pitch structure, if the waveform is cut out using a window function according to a conventional method, It is expected to cause deterioration such as noise on voice. An object of the present invention is to improve such a problem by determining the shape of a window function for waveform extraction in consideration of both an input speech waveform and a target pitch pattern, thereby improving a given peak position. It is an object of the present invention to provide a voice pitch conversion method capable of absorbing high-quality errors and forming a high-quality synthesized voice.

【０００４】[0004]

【課題を解決するための手段】上記目的を達成するた
め、本発明の音声ピッチ変換方法は、波形切り出し／重
ね合わせ部において、ピーク位置抽出部で抽出した入力
音声のローカルピーク位置のうち、時間窓関数を適用す
るピーク位置と、そのピーク位置と時系列的に隣り合う
前後のピーク位置との距離（ｗlf、ｗlb）を求め、それ
らの距離に目標ピッチへの変換率Ｒより求めた係数Ｃを
乗じて窓長を決定することにより、入力音声と変換音声
の双方を考慮した左右非対称の窓関数（式（４））を生
成することに特徴がある。In order to achieve the above object, a voice pitch conversion method according to the present invention is characterized in that, in a waveform cutout / superposition unit, a time interval of a local peak position of an input voice extracted by a peak position extraction unit is extracted. The distance (wlf, wlb) between the peak position to which the window function is applied and the preceding and succeeding peak positions in time series with the peak position is obtained, and the coefficient C obtained from the conversion rate R to the target pitch is calculated for the distance. , The window length is determined by multiplying by (1) to generate a left-right asymmetric window function (Equation (4)) considering both the input voice and the converted voice.

【０００５】[0005]

【作用】本発明においては、波形切り出し／重ね合わせ
部で波形重ね合わせを行う際、時間窓関数を適用するロ
ーカルピーク位置と、そのピーク位置と時系列的に隣り
合う前後のピーク位置の距離を求め、それらの距離に目
標ピッチへの変換率より求めた係数を乗じて窓長を決定
する。すなわち、窓関数を施す区間の変曲点（ピーク位
置）の左側（時系列的に前）と右側（時系列的に後）の
部分は、直前のピーク位置までの２倍を窓長とするhann
ing窓等の窓関数の左半分および右半分とそれぞれ等価
であるため、両側を組み合わせて左右非対称の窓関数を
構成することにより、入力音声と変換音声の双方を考慮
して、波形切り出し用の窓関数の形状を決定することが
できる。これにより、原音声のピッチパターンを保ちな
がら、与えられたピーク位置の誤りを吸収することが可
能となるため、波形レベルでの高品質な合成音声を得る
ことができる。In the present invention, when performing waveform superposition in the waveform cutout / superposition unit, the distance between the local peak position to which the time window function is applied and the preceding and succeeding peak positions in time series with the peak position is determined. The window length is determined by multiplying those distances by a coefficient obtained from the conversion rate to the target pitch. That is, the window length of the left (in time series) and right (in time series) portions of the inflection point (peak position) in the section where the window function is applied is twice as long as the immediately preceding peak position. hann
Since they are equivalent to the left and right halves of the window function such as the ing window, respectively, by combining both sides to form a left-right asymmetric window function, both input voice and converted voice are taken into account, The shape of the window function can be determined. This makes it possible to absorb an error at a given peak position while maintaining the pitch pattern of the original voice, and thus it is possible to obtain a high-quality synthesized voice at a waveform level.

【０００６】[0006]

【実施例】以下、本発明の一実施例を図面により説明す
る。図３は、本発明の一実施例における音声ピッチ変換
装置の一部を示す構成図、図４は本発明の一実施例にお
けるピッチ構造を持つ音声波形中のピーク位置例図であ
る。本実施例の音声ピッチ変換装置は、音声を入力する
ための入力装置、入力音声の音声ピッチ変換結果を出力
するための出力装置、処理後のデータや音声ピッチ変換
の際の目標ピッチパターン等を格納するための外部記憶
装置、および図３に示す処理装置等から構成される。こ
の処理装置は、ＣＰＵ、メモリ等から構成され、図３の
ように、前処理部３２、ピーク位置抽出部３３、波形切
り出し／重ね合わせ部３４、および後処理部３６を有す
る。このような構成により、音声ピッチ変換を行う場
合、入力端子３１より音声信号が入力され、これが前処
理部３２へ入力される。前処理部３２では、低域フィル
タを通し、アナログ／ディジタル変換を行って音声をデ
ィジタルデータ化し、ピーク位置抽出部３３へ送る。ピ
ーク位置抽出部３３では、音声波形中からピーク位置の
抽出を行い、ピーク位置の情報を波形切り出し／重ね合
わせ部３４に入力する。このピーク位置は、例えば図４
の「＊」印で示される部分である。波形切り出し／重ね
合わせ部３４では、これらのデータを蓄積するととも
に、目標ピッチパターン３５を入力し、ピーク位置抽出
部３３からのピーク位置をもとにして、音声波形を切り
出し、目標ピッチパターンになるように波形を重ね合わ
せて、得られた合成音声のディジタルデータを後処理部
３６に入力する。後処理部３６は、そのディジタルデー
タをディジタル／アナログ変換し、低域フィルタを通し
て、変換音声を出力端子３７に出力する。An embodiment of the present invention will be described below with reference to the drawings. FIG. 3 is a configuration diagram showing a part of the voice pitch conversion device in one embodiment of the present invention, and FIG. 4 is an example diagram of peak positions in a voice waveform having a pitch structure in one embodiment of the present invention. The voice pitch conversion device of the present embodiment includes an input device for inputting voice, an output device for outputting a voice pitch conversion result of the input voice, data after processing, a target pitch pattern for voice pitch conversion, and the like. It comprises an external storage device for storing, a processing device shown in FIG. 3, and the like. This processing device includes a CPU, a memory, and the like, and includes a pre-processing unit 32, a peak position extracting unit 33, a waveform cutout / overlapping unit 34, and a post-processing unit 36, as shown in FIG. With such a configuration, when performing voice pitch conversion, a voice signal is input from the input terminal 31 and is input to the preprocessing unit 32. The preprocessing unit 32 performs analog / digital conversion through a low-pass filter, converts the voice into digital data, and sends the digital data to the peak position extraction unit 33. The peak position extraction unit 33 extracts a peak position from the audio waveform, and inputs information on the peak position to the waveform cutout / superposition unit 34. This peak position is shown in FIG.
Are the portions indicated by the “*” mark. The waveform cutout / superposition unit 34 stores these data, inputs the target pitch pattern 35, cuts out the audio waveform based on the peak position from the peak position extraction unit 33, and becomes the target pitch pattern. The digital data of the synthesized voice obtained is input to the post-processing unit 36 by overlapping the waveforms as described above. The post-processing unit 36 performs digital-to-analog conversion of the digital data, and outputs a converted voice to an output terminal 37 through a low-pass filter.

【０００７】ここで、波形切り出し／重ね合わせ部３４
の処理を詳細に述べる。図５は、本発明の一実施例にお
ける波形切り出し／重ね合わせ部の処理を示すフローチ
ャートである。本実施例の波形切り出し／重ね合わせ部
３４には、ピーク位置に関する情報とともに、変換目標
の変換率もしくは目標ピッチパターン３５が入力される
（５０１、５０２）。こうして入力されたピーク位置列
と変換目標から波形重ね合わせ位置を決定し（５０
３）、重ね合わせ位置に対応する切り出し位置を決定す
る（５０４）。次に、その波形切り出し位置の環境を考
慮し、波形切り出し用の窓関数生成を行う（５０５）。
そして、生成された窓関数を用いて音声波形を切り出し
（５０６）、先に決定した重ね合わせ位置に重ね合わせ
処理を行う（５０７）。これらの処理を、目標ピッチパ
ターンへの合成が完了するまで（５０８）、順次繰返し
て、合成音声のディジタルデータを構成し、後処理部３
６へ入力する。Here, the waveform cutting / overlapping section 34
Is described in detail. FIG. 5 is a flowchart showing the processing of the waveform cutout / superposition unit in one embodiment of the present invention. The conversion rate of the conversion target or the target pitch pattern 35 is input to the waveform cutout / superposition unit 34 of the present embodiment together with the information on the peak position (501, 502). The waveform superimposition position is determined from the input peak position sequence and the conversion target (50).
3) A cutout position corresponding to the overlapping position is determined (504). Next, a window function for waveform extraction is generated in consideration of the environment of the waveform extraction position (505).
Then, a speech waveform is cut out using the generated window function (506), and a superposition process is performed on the previously determined superposition position (507). These processes are sequentially repeated until the synthesis into the target pitch pattern is completed (508), thereby forming digital data of the synthesized voice.
Input to 6.

【０００８】次に、図５のステップ５０５の窓関数生成
処理について詳細に述べる。図１は、本発明の一実施例
における波形切り出し用窓関数の生成処理を示すフロー
チャート、図６は本発明の一実施例における左右非対称
窓の説明図、図７は本発明の一実施例における窓関数に
よりピッチ周期を短くする場合の説明図、図８は本発明
の一実施例における窓関数によりピッチ周期を長くする
場合の説明図である。図６において、６１は注目するピ
ーク位置６２の一つ前のピーク位置、６２は注目するピ
ーク位置、６３は注目するピーク位置６２の一つ後のピ
ーク位置、６４は窓関数、６５は一つ前のピーク位置６
１までの距離(ｗlf)、６６は一つ後のピーク位置６３ま
での距離(ｗlb)である。本実施例では、図１および図６
のように、与えられた波形切り出し位置の時系列の前後
のピーク位置６１，６３から窓関数の概形となる窓長
（ｗlf＋ｗlb）の決定を行う（１０１）。なお、時系列
において前後のピーク位置が一般的なピッチ周期の長さ
である場合はそのまま本実施例の処理を続行し、ピッチ
周期と思われないような場合には、例外処理として窓長
に固定値を割り当てる処理を行う。次に、得られた窓長
に対し、変換目標から周期の変換率を算出して（１０
２）、その窓長を変換対象に適した値に設定する（１０
３）。さらに、ここで得られた窓長から窓関数を生成す
る（１０４，１０５）。ここで、本実施例の窓関数の定
義式について述べる。まず、従来の窓関数ｆw(ｎ)の定
義式を次式（１）に示す。Next, the window function generation processing of step 505 in FIG. 5 will be described in detail. FIG. 1 is a flowchart showing a process of generating a window function for extracting a waveform in one embodiment of the present invention. FIG. 6 is an explanatory diagram of a left-right asymmetric window in one embodiment of the present invention. FIG. 8 is an explanatory diagram when the pitch period is shortened by the window function, and FIG. 8 is an explanatory diagram when the pitch period is increased by the window function according to the embodiment of the present invention. In FIG. 6, 61 is the peak position immediately before the peak position 62 of interest, 62 is the peak position of interest, 63 is the peak position after the peak position 62 of interest, 64 is the window function, and 65 is one. Previous peak position 6
A distance (wlf) to 1 and a distance 66 to the next peak position 63 (wlb). In the present embodiment, FIGS.
As described above, the window length (wlf + wlb) which is the outline of the window function is determined from the peak positions 61 and 63 before and after the time series of the given waveform cutout position (101). If the peak positions before and after in the time series have the general length of the pitch period, the processing of the present embodiment is continued as it is. Perform processing to assign fixed values. Next, for the obtained window length, the conversion rate of the period is calculated from the conversion target (10
2), the window length is set to a value suitable for the conversion target (10)
3). Further, a window function is generated from the obtained window length (104, 105). Here, the definition formula of the window function of the present embodiment will be described. First, the following equation (1) shows the definition equation of the conventional window function fw (n).

【数１】但し、wlf：前のピーク位置迄の標本数 wlb：後のピーク位置迄の標本数とする。この式（１）に示す窓関数に対し、本実施例で
は、ピッチ周波数を高く（周期を短く）する場合は窓長
を短くする処理を施し、逆にピッチ周波数を低く（周期
を長く）する場合は窓長を長くする処理を施す。つま
り、周波数の変換率から求めた係数Ｃを用い、式（１）
を変形する。このＣは、次式（２）で示される。(Equation 1) Where wlf: Number of samples up to the previous peak position wlb: Number of samples up to the next peak position. In the present embodiment, when the pitch frequency is increased (short period), the window length is shortened, and the window frequency is decreased (long period). In this case, a process for increasing the window length is performed. That is, using the coefficient C obtained from the frequency conversion rate, the equation (1)
To transform. This C is expressed by the following equation (2).

【数２】但し、周波数の変換率をＲ[%］（Ｒ＞０）とする。な
お、Ｃを適用することにより、周波数ｆと変換後の周波
数ｆ’、および周期Ｔと変換後の周期Ｔ’の関係はは次
式（３）のようになる。(Equation 2) Note that the frequency conversion rate is R [%] (R> 0). By applying C, the relationship between the frequency f and the converted frequency f ′ and the relationship between the cycle T and the converted cycle T ′ are as shown in the following equation (3).

【数３】次に、本実施例に適用する窓関数ｆw(ｎ)を次式（４）
に示す。(Equation 3) Next, the window function fw (n) applied to the present embodiment is expressed by the following equation (4).
Shown in

【数４】但し、wlf：前のピーク位置迄の標本数 wlb：後のピーク位置迄の標本数とする。上記定義式（１）では、hanning窓をベースに
波形切り出し用窓関数を構成している。このベースとな
る窓関数に関しては、hamming窓やblackman窓等を適用
した場合においても、定義式（４）で用いた位相部分を
用いることにより、本実施例の非対称窓関数の構成が可
能である。このような非対称窓関数によりピッチ周期を
短くする例は、図７に示される。この場合、入力音声７
１におけるピーク位置（＊印の部分）は、合成音声７２
では目標ピッチパターンに合う位置（△印の部分）に移
動し、ピーク位置の間隔が縮んでいることがわかる。ま
た、ピッチ周期を長くする例は、図８に示される。この
場合、入力音声８１におけるピーク位置（＊印の部分）
は、合成音声８２では目標ピッチパターンに合う位置
（△印の部分）に移動し、ピーク位置の間隔が伸びてい
ることがわかる。本実施例では、定義式（４）にもとづ
き、ステップ１０４，１０５で生成した窓関数を用いて
音声波形を切り出し、先に決定した重ね合わせ位置に重
ね合わせ処理を行うことにより、原音声のピッチ構造を
保存した合成音声を得ることができる。(Equation 4) Where wlf: Number of samples up to the previous peak position wlb: Number of samples up to the next peak position. In the definition equation (1), the window function for waveform extraction is configured based on the hanning window. Regarding the window function serving as a base, even when a hamming window, a blackman window, or the like is applied, the configuration of the asymmetric window function of the present embodiment can be achieved by using the phase portion used in the definition equation (4). . An example in which the pitch period is shortened by such an asymmetric window function is shown in FIG. In this case, the input voice 7
The peak position at 1 (the part marked with *) is the synthesized voice 72
Then, it moves to a position matching the target pitch pattern (indicated by a triangle), and it can be seen that the interval between the peak positions is reduced. FIG. 8 shows an example of increasing the pitch period. In this case, the peak position in the input sound 81 (the part marked with *)
Indicates that the synthesized voice 82 moves to a position matching the target pitch pattern (the portion indicated by the triangle), and it can be seen that the interval between the peak positions is extended. In the present embodiment, based on the definition equation (4), a speech waveform is cut out using the window functions generated in steps 104 and 105, and a superimposition process is performed on the previously determined superposition position, thereby obtaining the pitch of the original voice. A synthesized speech with a preserved structure can be obtained.

【０００９】[0009]

【発明の効果】本発明によれば、入力音声波形と目標と
なるピッチパターンの双方を考慮して波形切り出し用の
窓関数の形状を決定することにより、原音声のピッチパ
ターンを保存し、与えられたピーク位置の誤りを吸収す
ることが可能となり、波形レベルでの高品質な合成音声
を構成することができる。According to the present invention, the shape of the window function for waveform extraction is determined by considering both the input speech waveform and the target pitch pattern, thereby preserving and providing the pitch pattern of the original speech. It is possible to absorb the error of the peak position thus obtained, and it is possible to form a high-quality synthesized speech at the waveform level.

【００１０】[0010]

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施例における波形切り出し用窓関
数の生成処理を示すフローチャートである。FIG. 1 is a flowchart illustrating a process of generating a window function for waveform cutout according to an embodiment of the present invention.

【図２】従来の音声ピッチ変換装置の一部を示す構成図
である。FIG. 2 is a configuration diagram showing a part of a conventional voice pitch conversion device.

【図３】本発明の一実施例における音声ピッチ変換装置
の一部を示す構成図である。FIG. 3 is a configuration diagram showing a part of a voice pitch conversion device according to an embodiment of the present invention.

【図４】本発明の一実施例におけるピッチ構造を持つ音
声波形中のピーク位置例図である。FIG. 4 is a diagram illustrating an example of a peak position in a voice waveform having a pitch structure according to an embodiment of the present invention.

【図５】本発明の一実施例における波形切り出し／重ね
合わせ部の処理を示すフローチャートである。FIG. 5 is a flowchart illustrating a process of a waveform cutout / superposition unit according to an embodiment of the present invention.

【図６】本発明の一実施例における左右非対称窓の説明
図である。FIG. 6 is an explanatory diagram of a left-right asymmetric window in one embodiment of the present invention.

【図７】本発明の一実施例における窓関数によりピッチ
周期を短くする場合の説明図である。FIG. 7 is an explanatory diagram of a case where a pitch period is shortened by a window function according to an embodiment of the present invention.

【図８】本発明の一実施例における窓関数によりピッチ
周期を長くする場合の説明図である。FIG. 8 is an explanatory diagram of a case where a pitch period is lengthened by a window function according to an embodiment of the present invention.

[Explanation of symbols]

２１入力端子２２前処理部２３ピーク位置２４波形切り出し／重ね合わせ部２５目標ピッチパターン２６後処理部２７出力端子３１入力端子３２前処理部３３ピーク位置抽出部３４波形切り出し／重ね合わせ部３５目標ピッチパターン３６後処理部３７出力端子６１一つ前のピーク位置６２注目するピーク位置６３一つ後のピーク位置６４窓関数６５一つ前のピーク位置までの距離(ｗlf）６６一つ後のピーク位置までの距離(ｗlb）７１入力音声７２合成音声８１入力音声８２合成音声 DESCRIPTION OF SYMBOLS 21 Input terminal 22 Pre-processing part 23 Peak position 24 Waveform extraction / superposition part 25 Target pitch pattern 26 Post-processing part 27 Output terminal 31 Input terminal 32 Preprocessing part 33 Peak position extraction part 34 Waveform extraction / superposition part 35 Target pitch Pattern 36 Post-processing unit 37 Output terminal 61 Previous peak position 62 Peak position of interest 63 Next peak position 64 Window function 65 Distance to previous peak position (wlf) 66 Next peak position Distance to (wlb) 71 Input voice 72 Synthetic voice 81 Input voice 82 Synthetic voice

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/00 - 21/04 G10H 1/043 Continuation of the front page (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 21/00-21/04 G10H 1/043

Claims

(57) [Claims]

An audio waveform is cut out by a time window function based on a local peak position in an input audio waveform.
In the method of performing voice pitch conversion to a target pitch pattern by superimposing the waveforms again, the local peak position includes a peak position to which a time window function is applied, and a peak position before and after the time position adjacent to the peak position. A voice pitch conversion method, wherein a window length is obtained by calculating a distance from a peak position of the target pitch and multiplying the distance by a coefficient obtained from a conversion rate to a target pitch.