JP2001117578A

JP2001117578A - Device and method for adding harmony sound

Info

Publication number: JP2001117578A
Application number: JP30027099A
Authority: JP
Inventors: Takayasu Kondo; 高康近藤; Rosukosu Alex; ロスコスアレックス; Keino Pedro; ケイノペドロ; Bonada Jody; ボナダジョーディ
Original assignee: Universitat Pompeu Fabra UPF; Yamaha Corp
Current assignee: Universitat Pompeu Fabra UPF; Yamaha Corp
Priority date: 1999-10-21
Filing date: 1999-10-21
Publication date: 2001-04-27
Anticipated expiration: 2019-10-21
Also published as: JP4757971B2

Abstract

PROBLEM TO BE SOLVED: To provide a harmony sound adding device which can add natural voice harmony sound to singing voice. SOLUTION: The device is equipped with a phoneme recognition part 2 which recognizes phonemes of an input voice signal, a phoneme feature parameter storage part 3 which stores phoneme feature parameters by the phonemes, a harmony parameter control part 6 which reads phoneme feature parameters corresponding to the phonemes according to the recognized phoneme information and outputs parameter control information on harmony sound to be added according to the phoneme feature parameters and harmony information showing the harmony sound to be added, and a harmony sound synthesis part 7 which performs processes such as pitch shifting, amplitude control, formant shifting, and spectrum tilt control for the input voice signal according to the number of channels of the harmony sound to be added and the read parameter control information and adds and outputs process results, and adds the harmony sound having the features of the phonemes of the input voice signal to the input voice signal and outputs the resulting signal.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声信号に対
して、その入力音声信号に基づいて生成したハーモニー
音を付加するハーモニー音付加装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a harmony sound adding device for adding a harmony sound generated based on an input audio signal to an input audio signal.

【０００２】[0002]

【従来の技術】従来のハーモニー音付加装置の一例を、
図４を参照して説明する。図４に示すハーモニー音付加
装置では、マイク１０１から入力された音声を、ピッチ
シフト部１０２において、付加するハーモニー音のチャ
ンネル数Ｎ個分、ピッチシフトし、それを加算部１０３
において加算することでハーモニー音を合成し、出力部
１０４から複数のハーモニー音が付加された信号を出力
する処理が行われる。この場合のハーモニー音の生成
は、ハーモニーの旋律を示すＭＩＤＩ（Musical Instru
ment Digital Interface）データや楽譜情報など（以
下、総称してハーモニー情報という）に基づいて行われ
る。2. Description of the Related Art An example of a conventional harmony sound adding device is as follows.
This will be described with reference to FIG. In the harmony sound adding device shown in FIG. 4, the voice input from the microphone 101 is pitch-shifted in the pitch shifter 102 by the number of channels of the harmony sound to be added, N, and is added to the adder 103.
, The harmony sound is synthesized, and the output unit 104 outputs a signal to which a plurality of harmony sounds are added. The generation of the harmony sound in this case is based on MIDI (Musical Instrument) indicating the melody of the harmony.
ment digital interface) data and musical score information (hereinafter collectively referred to as harmony information).

【０００３】また、他の従来のハーモニー音付加装置に
は、ハーモニーの各音ごとにフォルマントのシフト量を
設定可能なものがある。このようなハーモニー音付加装
置では、フォルマントのシフト量を制御することによっ
て、男声←→女声変換、いわゆるジェンダーチェンジを
行うことが可能である。Some other conventional harmony sound adding apparatuses can set a formant shift amount for each harmony sound. In such a harmony sound adding device, it is possible to perform male voice ← → female voice conversion, so-called gender change, by controlling the amount of formant shift.

【０００４】[0004]

【発明が解決しようとする課題】上記のような従来の方
法では、付加されるハーモニー音の各ピッチおよび振幅
が、ハーモニー情報で一義的に制御される場合が多く、
単調で機械的なハーモニー音となってしまうことがあっ
た。また、振幅は入力音声の振幅に応じて制御していた
ので、ハーモニー音の不自然さは、さらに顕著となって
いた。なお、ピッチに関しては、自然性向上のために、
音の始まり部分のみに固定的なピッチ変化を生じさせた
り、ビブラートを固定的に付加するなどの案があった
が、どちらも固定的な変化であるため不自然さがあっ
た。また、このような手法に関しては、固定パターンを
幾つか持っておき、音の始まりごとにその設定をランダ
ムに変えるなどの方法も考えられているが、その方法で
もかえって不自然な部分が付加されてしまい、あまり良
い結果は得られていなかった。In the conventional method as described above, the pitch and amplitude of the added harmony sound are often uniquely controlled by the harmony information.
Sometimes it was a monotonous and mechanical harmony sound. In addition, since the amplitude is controlled in accordance with the amplitude of the input voice, the unnaturalness of the harmony sound has become more remarkable. As for the pitch, to improve the naturalness,
There have been proposals such as causing a fixed pitch change only at the beginning of the sound or adding a vibrato in a fixed manner, but both were fixed changes and were unnatural. As for such a method, a method of holding several fixed patterns and randomly changing the setting at the beginning of each sound has been considered, but an unnatural portion is added by that method. And did not get very good results.

【０００５】本発明は、上記の事情に鑑み、例えば、カ
ラオケなどにおいて歌唱された音声に対して、自然な音
声ハーモニー音を付加することができるハーモニー音付
加装置及び方法を提供することを目的とする。SUMMARY OF THE INVENTION In view of the above circumstances, it is an object of the present invention to provide a harmony sound adding device and a method capable of adding a natural harmony sound to a voice sung in, for example, karaoke. I do.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決するた
め、請求項１記載の発明は、入力音声信号の音素を認識
する音素認識手段と、ピッチ、振幅、フォルマントシフ
ト基準値、又はスペクトルチルト基準値のうちの少なく
とも一つからなる予め用意された音素ごとの音素特徴パ
ラメータを記憶する音素特徴パラメータ記憶手段と、前
記入力音声信号をハーモニー音程にピッチシフトしてハ
ーモニー音信号を生成するとともに、前記ハーモニー音
信号の生成に際し、前記音素認識手段によって認識され
た音素情報に基づいて前記音素特徴パラメータ記憶手段
からその音素に対応する音素特徴パラメータを読み出
し、読み出した音素特徴パラメータに応じて音素毎にピ
ッチシフト、振幅制御、フォルマントシフト、又はスペ
クトルチルトコントロールのうちの少なくとも一つの処
理を行うハーモニー音合成手段とを備えることを特徴と
する。In order to solve the above problems, the invention according to claim 1 comprises a phoneme recognizing means for recognizing phonemes of an input voice signal, a pitch, amplitude, formant shift reference value, or spectrum tilt reference value. Phoneme feature parameter storage means for storing a phoneme feature parameter for each phoneme prepared in advance comprising at least one of the values, and generating a harmony sound signal by pitch-shifting the input voice signal to a harmony pitch, Upon generation of the harmony sound signal, a phoneme feature parameter corresponding to the phoneme is read from the phoneme feature parameter storage unit based on the phoneme information recognized by the phoneme recognition unit, and a pitch is set for each phoneme according to the read phoneme feature parameter. Shift, amplitude control, formant shift, or spectral tilt control Characterized in that it comprises a harmony note synthesizing means for performing at least one treatment of le.

【０００７】上記のように本発明は、ハーモニー音の合
成において、ハーモニー音の制御パラメータに、所定の
音素認識方法によって選られた入力音声信号の音素情報
から、予め用意された音素ごとの特徴パラメータ（ピッ
チ、振幅、スペクトルの時間変化等）を加味することに
よって、その音素の特徴を有するハーモニー音の合成を
行うことを主要な特徴としている。As described above, according to the present invention, in synthesizing a harmony sound, the control parameters of the harmony sound are obtained from the phoneme information of the input speech signal selected by a predetermined phoneme recognition method, and the characteristic parameters for each phoneme prepared in advance. The main feature is that a harmony sound having the characteristics of the phoneme is synthesized by taking into account (pitch, amplitude, temporal change of spectrum, etc.).

【０００８】[0008]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について説明する。図１は、本発明によるハー
モニー音付加装置の実施の形態の構成を示すブロック図
である。図１に示すハーモニー音付加装置は、本発明を
カラオケ装置に適用したものであり、歌唱者のマイク１
からの入力音声に対して所定の処理を行って１以上のハ
ーモニー音を得て、それらを合成し、さらに伴奏演奏部
９から出力される伴奏音を合成して出力するように構成
されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an embodiment of a harmony sound adding device according to the present invention. The harmony sound adding device shown in FIG. 1 is one in which the present invention is applied to a karaoke device, and includes a singer's microphone 1.
A predetermined process is performed on the input voice from the melody to obtain one or more harmony sounds, these are synthesized, and the accompaniment sound output from the accompaniment performance section 9 is synthesized and output. .

【０００９】マイク１は、歌唱者の音声を収拾する。収
拾された音声信号は、音素認識部２、音声解析部２＊、
およびハーモニー音合成部７へ入力される。音素認識部
２は、周知の音素認識方法を用いて入力音声信号の音素
を抽出し、予め登録してある複数の音素情報と比較し、
その比較結果に基づいて最も類似する音素に対応する音
素番号を出力する。登録しておく音素数は、多いほど良
いが以下で説明する音素特徴パラメータ記憶部３の容量
が大きくなってしまうため、２０〜１００程度でよい。
また、音声解析部２＊によって音声のピッチが検出され
て出力される。このピッチ情報は、ハーモニー情報４で
指定される各ハーモニー音の出力ピッチと入力音声信号
のピッチとの差分を求めるときに用いられる。The microphone 1 collects the voice of the singer. The collected voice signals are sent to a phoneme recognition unit 2, a voice analysis unit 2 *,
And the harmony sound synthesis unit 7. The phoneme recognition unit 2 extracts a phoneme of the input voice signal using a known phoneme recognition method, compares the phoneme with a plurality of phoneme information registered in advance,
A phoneme number corresponding to the most similar phoneme is output based on the comparison result. The larger the number of phonemes to be registered, the better, but the capacity of the phoneme feature parameter storage unit 3 described below becomes large, so it may be about 20 to 100.
Further, the voice pitch is detected and output by the voice analysis unit 2 *. This pitch information is used when calculating the difference between the output pitch of each harmony sound specified by the harmony information 4 and the pitch of the input audio signal.

【００１０】なお、音声解析部２＊によるピッチ検出
は、音声認識部２で行ってもよい。また、音声解析部２
＊によるこのピッチ検出は、差分情報が必要な場合だけ
必要になり、それ以外の場合は省略可能である。例え
ば、歌唱音声の周波数をハーモニー旋律との差分だけ変
化させる方法ではなく、ハーモニー情報４が指定するハ
ーモニの旋律に一致するように周波数変換するような場
合には、差分情報は不要となる。The pitch detection by the voice analysis unit 2 * may be performed by the voice recognition unit 2. Also, the voice analysis unit 2
This pitch detection by * is necessary only when difference information is required, and can be omitted in other cases. For example, in the case where the frequency of the singing voice is not changed by the difference from the harmony melody but the frequency conversion is performed so as to match the harmony melody specified by the harmony information 4, the difference information becomes unnecessary.

【００１１】音素特徴パラメータ記憶部３は、予め歌唱
データなどから求めた音素ごとの特徴データを、音素ご
とに複数記憶している。本実施の形態では、その特徴デ
ータの要素には、ピッチ、振幅、フォルマントシフト基
準値、スペクトルチルト基準値がある。なお、特徴デー
タはそれらの要素のうち少なくとも一つから形成するこ
ともできる。以下に各要素について説明する。The phoneme feature parameter storage unit 3 stores a plurality of feature data for each phoneme, which are obtained in advance from singing data or the like, for each phoneme. In the present embodiment, the elements of the feature data include a pitch, an amplitude, a formant shift reference value, and a spectrum tilt reference value. Note that the feature data can be formed from at least one of these elements. Hereinafter, each element will be described.

【００１２】（１）ピッチ：音素は個々に特徴的なピッ
チの変動（ゆらぎ）を持っているので、それを再現する
ためのデータである。ここでは、音素ごとの特徴的なピ
ッチ（ただしビブラートの成分は除いたものが望まし
い）を、その平均ピッチからの比率で、ある時間間隔
（例えば５ｍｓ）でサンプルした値を用いている。ただ
し小容量のデータしか持てない場合は、各音素の頭部分
の１００ｍｓ程度のデータを持つとともに、それ以降の
データを繰り返しデータとして持つようにしても良い。
繰り返しデータは、安定している部分のある時間（例え
ば２００ｍｓ間）のデータと、その時間（またはサンプ
ル数）を記憶しておき、そのデータをその時間周期で繰
り返すことによって得ることができる。(1) Pitch: Since each phoneme has a characteristic pitch fluctuation (fluctuation), it is data for reproducing the fluctuation. Here, a value obtained by sampling a characteristic pitch of each phoneme (preferably excluding a vibrato component) at a certain time interval (for example, 5 ms) is used as a ratio from the average pitch. However, if only a small amount of data can be stored, the data may have the data of about 100 ms at the beginning of each phoneme, and the subsequent data may be stored as repeated data.
The repetitive data can be obtained by storing data for a certain time (for example, for 200 ms) with a stable portion and the time (or the number of samples), and repeating the data in the time cycle.

【００１３】（２）振幅：音素は個々に特徴的な振幅の
変動（ゆらぎ）を持っているので、それを再現するため
のデータである。ここで、各音素に対応する波形の振幅
データの持ち方としては、各音素の振幅エンベロープ
を、振幅エンベロープの最も大きい部分からの比率、ま
たは平均レベルからの比率として、上記ピッチの場合と
同様な形式でサンプリングして持つことができる。な
お、ハーモニー音に対する音素毎の振幅の制御に関して
は、ハーモニー音程にピッチシフトした音声波形を時間
軸で繋げてハーモニー音を作成する方法のように、ハー
モニー音が初めから所定の振幅（この場合は入力音声に
対応する振幅）をもつ場合には省略してもよい。特徴デ
ータとして振幅データを使用する制御は、ハーモニー音
が入力の音声振幅によらずに制御可能な場合や、ハーモ
ニー音の振幅が固定になってしまう方式などを採用した
場合に特に有効となる。(2) Amplitude: Since each phoneme has a characteristic amplitude fluctuation (fluctuation), it is data for reproducing the fluctuation. Here, the manner of holding the amplitude data of the waveform corresponding to each phoneme is the same as in the case of the pitch, where the amplitude envelope of each phoneme is defined as a ratio from the largest portion of the amplitude envelope or a ratio from the average level. It can be sampled in a format and held. Regarding the control of the amplitude of each phoneme with respect to the harmony sound, the harmony sound has a predetermined amplitude (in this case, in this case, like a method of creating a harmony sound by connecting sound waveforms pitch-shifted to the harmony pitch on the time axis). If it has an amplitude corresponding to the input voice), it may be omitted. The control using the amplitude data as the feature data is particularly effective when the harmony sound can be controlled irrespective of the input voice amplitude or when a method in which the amplitude of the harmony sound is fixed is adopted.

【００１４】（３）フォルマントシフト基準値：フォル
マントシフトを行う場合に、音素ごとに異なるシフト量
を設定する際の基準として用いられる値であって、音素
ごとに設定される。ここで、フォルマントシフト基準値
が必要となる理由について説明する。例えば不特定多数
の人にいろいろな言葉を話してもらうと、各人の平均フ
ォルマント周波数（例えば第１フォルマントと第２フォ
ルマントの中間の周波数）の差は、音素ごとにほぼ同一
の傾向で異なっていることが判る。このことは、発声の
しくみによるものである。したがって、フォルマントシ
フトを行う場合は、すべての音素について同じシフト量
を与えたのでは、自然な感じのハーモニー音を得ること
ができない。このため、フォルマントシフト基準値のデ
ータ形式は、音素ごとに設定されるものとし、時間で変
化する時間情報を持たない形とする。具体的なデータの
形式としては、音素ごとにフォルマントシフト基準値を
正規化して持つようにすれば任意の形をとることができ
るが、例えば簡単な例としては、対数の値（例えばセン
ト）で持つこととし、１００で正規化するとした場合に
は、例として求められた３つの音素の平均フォルマント
の差が２０セント、１０セント、５０セントならば、フ
ォルマントシフト基準値は、最大値の５０セントを１０
０として、それぞれ４０、２０、１００となる。(3) Formant shift reference value: a value used as a reference when setting a different shift amount for each phoneme when performing a formant shift, and is set for each phoneme. Here, the reason why the formant shift reference value is required will be described. For example, when an unspecified number of people speak various words, the difference between the average formant frequencies (for example, the frequency between the first formant and the second formant) of each person is different for each phoneme with almost the same tendency. It turns out that there is. This is due to the vocalization mechanism. Therefore, when performing a formant shift, a harmony sound with a natural feeling cannot be obtained if the same shift amount is given to all phonemes. For this reason, the data format of the formant shift reference value is set for each phoneme and has no time-varying time information. As a specific data format, any form can be adopted as long as the formant shift reference value is normalized for each phoneme. For example, a simple example is a logarithmic value (for example, cents). If the difference between the average formants of the three phonemes obtained as an example is 20 cents, 10 cents, and 50 cents, the formant shift reference value is the maximum value of 50 cents. 10
As 0, they are 40, 20, and 100, respectively.

【００１５】（４）スペクトルチルト基準値：スペクト
ルの傾き（スペクトルチルト）は音素毎に異なっている
が、その場合の基準値を決める特徴データである。この
スペクトルチルト基準値は、スペクトルのチルトコント
ロールを行う場合に、音素ごとに異なるチルト量を設定
する際の基準として用いられる値であって、音素ごとに
設定される。この実施の形態では、入力音声のピッチと
出力音声のピッチとの差に応じて付加するハーモニー音
のスペクトルの傾きを変更するようにしている。すなわ
ち、スペクトルのチルトコントロールを行うこととして
いる。この場合、仮に全ての音素に対して同じスペクト
ルの傾きとなるように、一様にチルトコントロールを行
うとすると、前述したフォルマントシフトの時と同様の
不自然さの課題が発生する可能性がある。そこで、本実
施の形態では、このスペクトルチルト基準値を用いて音
素ごとにスペクトルチルトの制御条件を異ならせること
で、そのような不具合を補正できるようにしている。こ
のパラメータは時間情報ではなく、音素ごとに一つの値
である。値としては各音素に対する値を正規化して持て
ばどのような持ち方でも良い。その具体例について次に
説明する。(4) Spectral tilt reference value: The slope of the spectrum (spectral tilt) differs for each phoneme, but is feature data for determining the reference value in that case. This spectrum tilt reference value is a value used as a reference when setting a different tilt amount for each phoneme when performing tilt control of the spectrum, and is set for each phoneme. In this embodiment, the inclination of the spectrum of the harmony sound to be added is changed according to the difference between the pitch of the input sound and the pitch of the output sound. That is, the tilt control of the spectrum is performed. In this case, if the tilt control is performed uniformly so that all the phonemes have the same spectral gradient, the same unnatural problem as in the case of the formant shift described above may occur. . Therefore, in the present embodiment, by using the spectrum tilt reference value to make the control condition of the spectrum tilt different for each phoneme, such a problem can be corrected. This parameter is not time information but one value for each phoneme. Any value can be used as long as the value for each phoneme is normalized and held. A specific example will be described below.

【００１６】スペクトルチルト基準値の求め方として
は、まず、ある人に各音素（各音素を音素番号ｐで分
類）ごとに低い音から高い音までの複数のピッチ（周波
数番号ｆで分類）で発音してもらい、その発音信号のス
ペクトル分析を行う。そして、各音素および各ピッチご
とにスペクトル分析結果のスペクトル傾きを求める。各
音素および各ピッチごとのスペクトル傾きＸｆｐ（スペ
クトルチルト係数、ここで、ｆ：周波数番号、ｐ：音素
番号）は、例えば、次式によって計算することができ
る。As a method of obtaining the spectral tilt reference value, first, for a certain person, for each phoneme (each phoneme is classified by phoneme number p), at a plurality of pitches (classified by frequency number f) from a low sound to a high sound. Have the user pronounce the sound and analyze the spectrum of the sound signal. Then, the spectrum inclination of the spectrum analysis result is obtained for each phoneme and each pitch. The spectral tilt Xfp (spectral tilt coefficient, where f: frequency number, p: phoneme number) for each phoneme and each pitch can be calculated by, for example, the following equation.

【数１】ここで、ｉはスペクトル分析結果の周波数インデック
ス、Ｎはインデックスの最大値、ｘは各インデックスの
周波数の値、ｙは各インデックスの成分のマグニチュー
ド値であり、ｉ＝０が最低周波数のスペクトル成分イン
デックスを表している。(Equation 1) Here, i is the frequency index of the spectrum analysis result, N is the maximum value of the index, x is the frequency value of each index, y is the magnitude value of the component of each index, and i = 0 is the spectral component index of the lowest frequency. Is represented.

【００１７】次に、そのピッチおよび音素ごとのスペク
トル傾きＸｆｐから、音程（ピッチ差分）対傾き率、す
なわち、スペクトル傾き値Ｘｆｐの音程に対する変化率
を、各音素ｐごとに求める。求め方は、スペクトル傾き
Ｘｆｐを求める時と同様な手法（上式と同様な計算）で
行うものとする。それを、音程対傾き率Ｙｐ（ｐ：音素
番号）で表すこととする。そして、その各音素ごとの音
程対傾き率Ｙｐから全音素に対する平均の音程対傾き率
Ｙを求める。次に、各音素ごとに、音程対傾き率Ｙｐを
平均の音程対傾き率Ｙで割った値Ｙｐ／Ｙを求め、これ
をチルト基準値とする。また、ここで求めたスペクトル
チルト変化量Ｙを、ピッチ差分値をパラメータとして変
化させた場合の複数の値を、音素特徴パラメータ記憶部
３またはハーモニパラメータ制御部６の所定の記憶装置
内に、音程（ピッチ差分値）に対応させたチルト変化量
テーブルとして記憶しておく。Next, from the pitch and the spectral gradient Xfp for each phoneme, the pitch (pitch difference) versus the gradient rate, that is, the rate of change of the spectral gradient value Xfp with respect to the pitch is determined for each phoneme p. The calculation is performed by the same method (calculation similar to the above formula) when obtaining the spectrum inclination Xfp. This is represented by pitch-to-inclination ratio Yp (p: phoneme number). Then, an average pitch-to-slope rate Y for all phonemes is obtained from the pitch-to-slope rate Yp for each phoneme. Next, for each phoneme, a value Yp / Y is obtained by dividing the pitch-to-slope rate Yp by the average pitch-to-slope rate Y, and this is used as the tilt reference value. A plurality of values obtained by changing the spectrum tilt change amount Y obtained here using the pitch difference value as a parameter are stored in a predetermined storage device of the phoneme feature parameter storage unit 3 or the harmony parameter control unit 6 in a pitch. It is stored as a tilt change amount table corresponding to (pitch difference value).

【００１８】以上が、音素特徴パラメータ記憶部３の記
憶内容である。次に、図１のハーモニー情報４は、付加
すべきハーモニーの音程（ピッチ）を示す情報である。
これは、ＭＩＤＩ規格の曲データなどから与えても良い
し、楽譜情報に含ませてシーケンス情報として持っても
よい。フォルマントシフト度設定部５１は、ハーモニー
音を自声ではなく、例えば男性なら女性、女性なら男性
の声にしたい場合等に、操作者が所定の操作子によりフ
ォルマントシフト量を設定する手段である。または、そ
の曲に応じてその設定量を変更した場合は楽譜情報に含
ませてシーケンス情報として持つようにして、その値を
呼び出して設定する手段としてもよい。The contents stored in the phoneme feature parameter storage unit 3 have been described above. Next, the harmony information 4 in FIG. 1 is information indicating the pitch (pitch) of the harmony to be added.
This may be provided from MIDI standard music data or the like, or may be included in musical score information as sequence information. The formant shift degree setting unit 51 is a means for the operator to set the formant shift amount using a predetermined operation element when the harmony sound is not a self-voice but is a female voice for a male or a male voice for a female. Alternatively, when the set amount is changed in accordance with the song, the set amount may be included in the musical score information and held as sequence information, and the value may be called and set.

【００１９】ハーモニーパラメータ制御部６は、音素特
徴パラメータ記憶部３から出力される入力音素に対応す
る音素特徴パラメータの値と、ハーモニー情報４と、フ
ォルマントシフト度設定部５１と、ハーモニー厚み度設
定部５で操作者が所定の操作子を操作することによって
指定されたハーモニー厚み度等の各値とから、ハーモニ
ー音合成部７の各制御パラメータ（ハーモニーパラメー
タ制御情報）を生成し出力する手段である。ハーモニー
パラメータ制御情報としては、ピッチ、振幅、フォルマ
ントシフト量、スペクトルチルト係数がある。以下、ハ
ーモニーパラメータ制御情報の各パラメータについて説
明する。The harmony parameter control section 6 includes a phoneme feature parameter value corresponding to the input phoneme output from the phoneme feature parameter storage section 3, harmony information 4, a formant shift degree setting section 51, and a harmony thickness degree setting section. A means for generating and outputting each control parameter (harmony parameter control information) of the harmony sound synthesizing unit 7 from each value such as a harmony thickness degree designated by an operator operating a predetermined operation element in 5. . The harmony parameter control information includes a pitch, an amplitude, a formant shift amount, and a spectrum tilt coefficient. Hereinafter, each parameter of the harmony parameter control information will be described.

【００２０】（１）ピッチパラメータ：ハーモニー情報
４から指定された各ハーモニー音に対する音程と、音素
特徴パラメータのピッチ情報を加算または乗算すること
によって求められるものであって、ハーモニー音合成部
７における各ハーモニー音のピッチシフト量を決めるパ
ラメータである。この「ピッチパラメータ」の生成にあ
たっては、音素特徴パラメータのピッチ情報の加算また
は乗算が行われるから、各音素に対応した微妙なピッチ
変動を制御することができる。(1) Pitch parameter: The pitch parameter is obtained by adding or multiplying the pitch for each harmony sound specified from the harmony information 4 and the pitch information of the phoneme characteristic parameter. This parameter determines the pitch shift amount of the harmony sound. In generating the "pitch parameter", addition or multiplication of the pitch information of the phoneme feature parameters is performed, so that fine pitch fluctuation corresponding to each phoneme can be controlled.

【００２１】（２）振幅パラメータ：ハーモニー情報４
から指定された各音に対する振幅と、音素特徴パラメー
タの振幅情報を加算または乗算することによって得られ
る、ハーモニー音合成部７における各ハーモニー音の振
幅量を決めるパラメータである。この「振幅パラメー
タ」の生成にあたっては、音素特徴パラメータの振幅情
報の加算または乗算が行われるから、音素に対応した微
妙な振幅変動を制御することができる。(2) Amplitude parameter: harmony information 4
Is a parameter for determining the amplitude of each harmony sound in the harmony sound synthesis unit 7 obtained by adding or multiplying the amplitude for each sound designated from the above and the amplitude information of the phoneme feature parameter. In generating the “amplitude parameter”, the addition or multiplication of the amplitude information of the phoneme feature parameters is performed, so that a subtle amplitude variation corresponding to the phoneme can be controlled.

【００２２】（３）フォルマントシフト量パラメータ：
各ハーモニー音に対するフォルマントシフト量は、３つ
の情報から決定される。第１の情報は、音素特徴パラメ
ータの音素ごとのフォルマントシフト基準値である。第
２の情報は、フォルマントシフト度設定部５１で設定さ
れたフォルマントシフト度である。このパラメータは、
ハーモニー音の何番目の音であるかによって、その音の
フォルマントシフト量をどれだけずらすかのオフセット
量である。値の持ち方としては、例えば、セント値で持
つこととし、その値とハーモニー厚み度（０〜１．０と
した場合）の乗算を行い、その結果が各ハーモニー音の
フォルマントシフト量のオフセット量となる。第３の情
報は、ハーモニー厚み度設定部５によって設定されたハ
ーモニーの厚み度とに基づいて決定されるパラメータで
ある。(3) Formant shift amount parameter:
The formant shift amount for each harmony sound is determined from three pieces of information. The first information is a formant shift reference value for each phoneme in the phoneme feature parameter. The second information is the formant shift degree set by the formant shift degree setting unit 51. This parameter is
An offset amount by which the formant shift amount of the harmony sound is shifted depending on the number of the harmony sound. As a way of holding the value, for example, it is assumed to have a cent value, and the value is multiplied by the harmony thickness (when it is set to 0 to 1.0), and the result is the offset amount of the formant shift amount of each harmony sound. Becomes The third information is a parameter determined based on the harmony thickness set by the harmony thickness setting unit 5.

【００２３】これら３つの情報に基づいてフォルマント
シフト量が決定される。すなわち、音素、フォルマント
シフト度および厚み度によってフォルマントシフト量が
決定される。なお、フォルマントシフト基準値について
は、予め複数パターン設定しておいて、ハーモニー厚み
度設定部５を用いた選択操作によって選択可能にしてお
いても良い。このハーモニーの厚み度に応じて、ハーモ
ニー音の各音ごとにフォルマントをずらした場合、あた
かも複数の人が歌唱しているような効果が得られ、ハー
モニー音の厚みを制御することができる。The formant shift amount is determined based on these three pieces of information. That is, the formant shift amount is determined by the phoneme, the formant shift degree, and the thickness degree. A plurality of patterns of the formant shift reference value may be set in advance, and may be selectable by a selection operation using the harmony thickness degree setting unit 5. When the formants are shifted for each of the harmony sounds in accordance with the thickness of the harmony, an effect is obtained as if a plurality of people are singing, and the thickness of the harmony sound can be controlled.

【００２４】（４）スペクトルチルト係数：スペクトル
チルトコントロールの制御条件を決定するためのパラメ
ータである。スペクトルチルト係数を決定するには、ま
ず、音声解析部２＊からの入力音声のピッチとハーモニ
ー情報４の各ハーモニー音の音程情報から、各音の入力
ピッチと出力ピッチの差分を求める。その差分に基づい
て、上記音素特徴パラメータ記憶部３内のチルト変化量
テーブルからその差分に対応する各音素に共通のチルト
変化量を求める。その各音素に共通のチルト変化量と、
音素特徴パラメータ記憶部３のチルト基準値とから、入
力音素に対応したスペクトルチルト変化量を求め、さら
に、入力音素に対応したスペクトルチルト変化量と、ピ
ッチ差分の値からスペクトルチルト係数を求めることが
できる。これによって、入力音声の音素と音程に適した
スペクトルチルトを指示することができ、従来装置のハ
ーモニー音のように、どの音程でも、どの音素でもスペ
クトルチルトが同じという単調さが回避され、自然なハ
ーモニー音を生成することができる。(4) Spectral tilt coefficient: a parameter for determining the control condition of the spectral tilt control. To determine the spectral tilt coefficient, first, a difference between the input pitch and the output pitch of each sound is obtained from the pitch of the input voice from the voice analysis unit 2 * and the pitch information of each harmony sound in the harmony information 4. Based on the difference, a tilt change amount common to each phoneme corresponding to the difference is obtained from the tilt change amount table in the phoneme feature parameter storage unit 3. The amount of tilt change common to each phoneme,
From the tilt reference value in the phoneme feature parameter storage unit 3, a spectrum tilt change amount corresponding to the input phoneme is obtained, and further, a spectrum tilt coefficient is obtained from the spectrum tilt change amount corresponding to the input phoneme and the pitch difference value. it can. This makes it possible to specify a spectral tilt suitable for the phonemes and pitches of the input voice, and avoids the monotony that the spectral tilt is the same for all phonemes at any pitch, such as the harmony sound of the conventional device, and a natural Harmony sounds can be generated.

【００２５】次に、図１のハーモニー音合成部７につい
て説明する。ハーモニー音合成部７は、入力音声信号に
対してハーモニー音を付加して合成するものであり、入
力音声にハーモニーを付加するＮ個（Ｎチャンネル）の
処理回路１〜Ｎを有している。なお、処理回路１〜Ｎか
ら処理部７１が構成されている。ハーモニー音合成部７
は、上記ハーモニーパラメータ制御部６からの各ハーモ
ニー音に対するハーモニーパラメータ制御情報に従って
制御を行い、制御結果である各ハーモニー音を加算部７
２で加算して出力する。Next, the harmony sound synthesizer 7 of FIG. 1 will be described. The harmony sound synthesizing section 7 adds harmony sounds to an input audio signal and synthesizes them, and has N (N-channel) processing circuits 1 to N for adding harmony to the input audio signals. Note that a processing unit 71 is configured by the processing circuits 1 to N. Harmony sound synthesizer 7
Performs control in accordance with the harmony parameter control information for each harmony sound from the harmony parameter control unit 6 and adds each harmony sound as a control result to the addition unit 7
The result is added and output.

【００２６】ここで、歌唱された元の音声信号の出力に
ついては、例えば、ハーモニー音の１つのチャンネル
（例えばチャンネル１）については、何も処理を行わな
いようにして出力する。なお、元の音声信号について
は、１つのチャンネルについては、他のチャンネルの遅
延と同期を取るための遅延処理を行う処理だけにした
り、他のチャンネルに比べてピッチシフトが小さい処
理、例えば、入力音声の音程ずれ（メロディからの音程
ずれ）を修正する程度のピッチシフトのみを行って出力
するように構成してもよい。また、入力音声について
は、スルーで通過させる経路を別途設け、これを加算部
７２で合成するように構成してもよい。Here, as for the output of the original voice signal sung, for example, for one channel of the harmony sound (for example, channel 1), the output is performed without performing any processing. In the case of the original audio signal, for one channel, only a process for performing a delay process for synchronizing with a delay of another channel is performed, or a process having a smaller pitch shift as compared with other channels, for example, an input process is performed. A configuration may also be adopted in which only a pitch shift that corrects a pitch shift of a voice (a pitch shift from a melody) is performed and output. In addition, a path may be separately provided for the input voice to pass through, and the input voice may be synthesized by the adding unit 72.

【００２７】次に、ミキシングアンプ８は、ハーモニー
音合成部７からのハーモニー音と、伴奏音演奏部９から
の伴奏音を加算して出力する。Next, the mixing amplifier 8 adds the harmony sound from the harmony sound synthesizing section 7 and the accompaniment sound from the accompaniment sound playing section 9 and outputs the result.

【００２８】ここで、図２は、図１のハーモニー音合成
部７を、ＳＭＳ分析・合成方法（ＳＭＳ：Spectral Mod
eling Synthesis；スペクトル・モデリング・合成）を
利用することで実現する場合の構成の一例を示すブロッ
ク図である。なお、ＳＭＳ分析・合成ついては、特開平
７−３２５５８３号公報「サウンドの分析及び合成方法
並びに装置」、特開平１１−１３３９９５号公報「音声
変換装置」等に記載されている。FIG. 2 shows the harmony sound synthesizer 7 shown in FIG. 1 using an SMS analysis / synthesis method (SMS: Spectral Mod).
FIG. 3 is a block diagram illustrating an example of a configuration in the case of realizing by using eling synthesis (spectral modeling / synthesis). The SMS analysis / synthesis is described in Japanese Unexamined Patent Publication No. Hei 7-325583, "Sound Analysis and Synthesis Method and Apparatus", and Japanese Unexamined Patent Publication No. Hei 11-133959, "Speech Converter".

【００２９】ＳＭＳ分析部７０１は、入力音声信号を、
所定のフレーム単位で切り出した後、ＦＦＴ（高速フー
リエ変換）によって周波数スペクトルに変換し、スペク
トル分析結果からＳＭＳ分析によって正弦波成分および
残差成分を抽出してフレーム単位で出力する。Ｎ個の処
理部１（７０２−１）、処理部２（７０２−２）、…、
処理部Ｎ（７０２−１）は、ハーモニー音の各チャンネ
ルに対応する信号処理回路であり、ハーモニーパラメー
タ制御部６から各回路に対してそれぞれ供給されるＮ個
のハーモニーパラメータ制御情報に基づいて、正弦波成
分および残差成分に対して振幅制御、ピッチ制御、スペ
クトルチルトなどの処理を行って出力する。加算部７０
３は、処理部１〜Ｎ（７０２−１〜Ｎ）で処理された結
果を加算して出力する。逆ＦＦＴ部７０４は、加算され
た結果を逆ＦＦＴによって波形情報に変換して出力す
る。The SMS analyzer 701 converts the input voice signal into
After clipping in a predetermined frame unit, it is converted into a frequency spectrum by FFT (Fast Fourier Transform), a sine wave component and a residual component are extracted from the spectrum analysis result by SMS analysis, and output in frame units. N processing units 1 (702-1), processing units 2 (702-2),.
The processing unit N (702-1) is a signal processing circuit corresponding to each channel of the harmony sound, and based on N harmony parameter control information supplied to each circuit from the harmony parameter control unit 6, The sine wave component and the residual component are subjected to processes such as amplitude control, pitch control, and spectrum tilt, and output. Adder 70
3 adds and outputs the results processed by the processing units 1 to N (702-1 to N). Inverse FFT section 704 converts the added result into waveform information by inverse FFT and outputs the waveform information.

【００３０】図３は、図２の処理部１〜Ｎ（７０２−１
〜Ｎ）の構成を示すブロック図である。各処理部７０２
（７０２−１〜Ｎ）は、ハーモニーパラメータ制御情報
に基づいて、正弦波成分に対してピッチシフト、フォル
マントシフト、スペクトルチルト、振幅制御を行って出
力する制御部７０２１と、例えばハーモニーパラメータ
制御情報内のピッチシフト等の周波数に関する情報に基
づいて、残差成分に対して周波数成分を制御するフィル
タリング処理を行って出力する残差成分複合フィルタ７
０２２とから構成されている。FIG. 3 shows the processing units 1 to N (702-1) of FIG.
FIG. 2 is a block diagram showing a configuration of the present invention. Each processing unit 702
(702-1 to N) are a control unit 7021 that performs pitch shift, formant shift, spectral tilt, and amplitude control on the sine wave component based on the harmony parameter control information and outputs the result. A residual component composite filter 7 that performs a filtering process for controlling the frequency component of the residual component based on information about the frequency such as the pitch shift of the
022.

【００３１】図２および図３の構成によれば、ＳＭＳ分
析・合成方法を利用することにより、ピッチ、振幅、フ
ォルマントシフト、チルトコントロールが、周波数領域
で行うことが出来、各ハーモニー音の合成も、周波数領
域で加算後、逆ＦＦＴにより行うことが出来るので、多
くのハーモニー音を容易に合成することが可能である。According to the configurations shown in FIGS. 2 and 3, pitch, amplitude, formant shift, and tilt control can be performed in the frequency domain by using the SMS analysis / synthesis method, and the synthesis of each harmony sound is also possible. , Can be performed by inverse FFT after addition in the frequency domain, so that many harmony sounds can be easily synthesized.

【００３２】（実施形態の動作）上述した構成によれ
ば、マイク１から入力された歌唱音声は、音素認識部２
においてその音素が認識される。ここで認識された音素
に応じて、音素特徴パラメータ記憶部３から各音素につ
いての特徴パラメータが出力され、ハーモニーパラメー
タ制御部６に供給される。ハーモニーパラメータ制御部
６では、ハーモニー音の音程を作るためのピッチシフト
量を求め、さらに、供給された音素特徴パラメータとハ
ーモニー厚み度設定部５及びフォルマントシフト度設定
部５１からの出力信号に基づき、音素毎の制御も含めた
ハーモニーパラメータを生成し、ハーモニー合成部７に
供給する。これにより、ハーモニー合成部７では、入力
音声をハーモニー音程へとピッチシフトするハーモニー
音の生成に際して、さらに、音素に対応した微妙なピッ
チ制御、振幅制御、フォルマントシフトがなされる。ま
た、スペクトルチルトについては、音素と音程の双方に
対応した制御がなされる。以上の処理により、自然で厚
みのあるハーモニー音が歌唱音声に付加され、従来にな
い響きのあるハーモニー効果を得ることができる。(Operation of the Embodiment) According to the above-described configuration, the singing voice input from the microphone 1 is transmitted to the phoneme recognition unit 2.
In, the phoneme is recognized. According to the recognized phoneme, feature parameters for each phoneme are output from the phoneme feature parameter storage unit 3 and supplied to the harmony parameter control unit 6. The harmony parameter control unit 6 obtains a pitch shift amount for forming a pitch of a harmony sound, and further, based on the supplied phoneme feature parameters and output signals from the harmony thickness setting unit 5 and the formant shift degree setting unit 51, A harmony parameter including control for each phoneme is generated and supplied to the harmony synthesis unit 7. Thus, the harmony synthesizing unit 7 further performs fine pitch control, amplitude control, and formant shift corresponding to phonemes when generating a harmony sound that shifts the pitch of an input voice to a harmony pitch. As for the spectrum tilt, control corresponding to both phonemes and pitches is performed. Through the above-described processing, a natural and thick harmony sound is added to the singing voice, and a harmony effect with an unprecedented sound can be obtained.

【００３３】なお、本発明の実施の形態は、信号処理用
の半導体集積回路と、それに設定されたマイクロプログ
ラム等の組み合わせによって構成することができ、また
コンピュータおよびその周辺機器と、そのコンピュータ
で実行されるプログラムとの組み合わせによっても実現
することが可能である。さらに、コンピュータとプログ
ラムとから構成する場合には、そのコンピュータが実行
するプログラムを、コンピュータ読み取り可能な記録媒
体に記録して頒布することが可能である。The embodiment of the present invention can be constituted by a combination of a semiconductor integrated circuit for signal processing and a microprogram or the like set in the semiconductor integrated circuit. It can also be realized by a combination with a program to be executed. Further, in the case of a configuration including a computer and a program, the program executed by the computer can be recorded on a computer-readable recording medium and distributed.

【００３４】[0034]

【発明の効果】以上説明したように、本発明によれば、
ハーモニー音に歌唱者の音素の特徴が加味されるので、
より自然なハーモニー音を得ることが可能となる。ま
た、ハーモニー音の各音ごとにフォルマントシフト量を
設定すれば、あたかも複数の人が歌唱しているような効
果を得ることが出来る。さらに、その時、フォルマント
シフト量を、各音素に応じたシフト量とすれば、シフト
量を固定とした場合の音韻の不自然さがなくなり、より
自然な効果を得ることが可能となる。As described above, according to the present invention,
Because the harmony sound is accompanied by the characteristics of the singer's phonemes,
A more natural harmony sound can be obtained. Also, by setting the formant shift amount for each harmony sound, it is possible to obtain an effect as if a plurality of people were singing. Further, at this time, if the formant shift amount is set to a shift amount corresponding to each phoneme, unnaturalness of phonemes when the shift amount is fixed is eliminated, and a more natural effect can be obtained.

[Brief description of the drawings]

【図１】本発明によるハーモニー音付加装置の実施の
形態を示すブロック図FIG. 1 is a block diagram showing an embodiment of a harmony sound adding device according to the present invention.

【図２】図１のハーモニー音合成部７の構成を示すブ
ロック図FIG. 2 is a block diagram showing a configuration of a harmony sound synthesizer 7 in FIG. 1;

【図３】図２の処理部１〜Ｎの構成を示すブロック図FIG. 3 is a block diagram showing a configuration of processing units 1 to N in FIG. 2;

【図４】従来のハーモニー音付加装置の実施の形態を
示すブロック図FIG. 4 is a block diagram showing an embodiment of a conventional harmony sound adding device.

[Explanation of symbols]

１…マイク、２…音素認識部、２＊…音声解析部、３…
音素特徴パラメータ記憶部、４…ハーモニ情報、５…ハ
ーモニ厚み度設定部、６…ハーモニパラメータ制御部、
７…ハーモニー音合成部、８…ミキシングアンプ、９…
伴奏演奏部、７０１…ＳＭＳ分析部、７０２，７０２−
１〜７０１−Ｎ…処理部、７０３…加算部、７０４…逆
ＦＦＴ部、７０２１…制御部、７０２２…残差成分複合
フィルタ。1 microphone, 2 phoneme recognition unit, 2 * voice analysis unit, 3
Phoneme feature parameter storage unit, 4 ... Harmony information, 5 ... Harmony thickness degree setting unit, 6 ... Harmony parameter control unit,
7: Harmony sound synthesizer, 8: Mixing amplifier, 9:
Accompaniment part, 701 ... SMS analysis part, 702, 702-
1-701-N: processing unit, 703: adding unit, 704: inverse FFT unit, 7021: control unit, 7022: residual component composite filter.

─────────────────────────────────────────────────────
────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成１１年１２月９日（１９９９．１２．
９）[Submission date] December 9, 1999 (1999.12.
9)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００２９[Correction target item name] 0029

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【００２９】ＳＭＳ分析部７０１は、入力音声信号を、
所定のフレーム単位で切り出した後、ＦＦＴ（高速フー
リエ変換）によって周波数スペクトルに変換し、スペク
トル分析結果からＳＭＳ分析によって正弦波成分および
残差成分を抽出してフレーム単位で出力する。Ｎ個の処
理部１（７０２−１）、処理部２（７０２−２）、…、
処理部Ｎ（７０２−Ｎ）は、ハーモニー音の各チャンネ
ルに対応する信号処理回路であり、ハーモニーパラメー
タ制御部６から各回路に対してそれぞれ供給されるＮ個
のハーモニーパラメータ制御情報に基づいて、正弦波成
分および残差成分に対して振幅制御、ピッチ制御、スペ
クトルチルトなどの処理を行って出力する。加算部７０
３は、処理部１〜Ｎ（７０２−１〜Ｎ）で処理された結
果を加算して出力する。逆ＦＦＴ部７０４は、加算され
た結果を逆ＦＦＴによって波形情報に変換して出力す
る。The SMS analyzer 701 converts the input voice signal into
After clipping in a predetermined frame unit, it is converted into a frequency spectrum by FFT (Fast Fourier Transform), a sine wave component and a residual component are extracted from the spectrum analysis result by SMS analysis, and output in frame units. N processing units 1 (702-1), processing units 2 (702-2),.
The processing unit N (702- N ) is a signal processing circuit corresponding to each channel of the harmony sound, and based on N pieces of harmony parameter control information supplied from the harmony parameter control unit 6 to each circuit. The sine wave component and the residual component are subjected to processes such as amplitude control, pitch control, and spectrum tilt, and output. Adder 70
3 adds and outputs the results processed by the processing units 1 to N (702-1 to N). Inverse FFT section 704 converts the added result into waveform information by inverse FFT and outputs the waveform information.

フロントページの続き (72)発明者アレックスロスコススペインバルセロナ 08002 メルセ 12 (72)発明者ペドロケイノスペインバルセロナ 08002 メルセ 12 (72)発明者ジョーディボナダスペインバルセロナ 08002 メルセ 12 Ｆターム(参考） 5D015 BB02 KK02 KK04 5D045 AB30 5D108 BD07 BF02 BF06 5D378 JC02 JC10 KK02 MM97 Continued on the front page (72) Inventor Alex Rothkos Spain Barcelona 08002 Merce 12 (72) Inventor Pedro Keino Spain Barcelona 08002 Merce 12 (72) Inventor Jodi Bonada Spain Barcelona 08002 Merce 12 F-term (reference) 5D015 BB02 KK02 KK04 5D045 AB30 5D108 BD07 BF02 BF06 5D378 JC02 JC10 KK02 MM97

Claims

[Claims]

1. A phoneme recognizing means for recognizing a phoneme of an input voice signal, and a phoneme feature parameter for each phoneme prepared in advance comprising at least one of a pitch, an amplitude, a formant shift reference value, or a spectrum tilt reference value. A phoneme feature parameter storage means for storing a harmony pitch, and generating a harmony sound signal by pitch-shifting the input voice signal to a harmony pitch, based on phoneme information recognized by the phoneme recognition means when generating the harmony sound signal. Reading a phoneme feature parameter corresponding to the phoneme from the phoneme feature parameter storage means, and performing at least one of pitch shift, amplitude control, formant shift, or spectral tilt control for each phoneme according to the read phoneme feature parameter. Harmony sound synthesis means and Harmony sound adding device, characterized in that it comprises.