JP3294192B2

JP3294192B2 - Voice conversion device and voice conversion method

Info

Publication number: JP3294192B2
Application number: JP17503898A
Authority: JP
Inventors: 啓嘉山; セラザビエル
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1998-06-22
Filing date: 1998-06-22
Publication date: 2002-06-24
Anticipated expiration: 2018-06-22
Also published as: JP2000010599A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声変換装置及び
音声変換方法に係り、特にカラオケ等で歌唱者の歌声
が、音声変換の対象となる特定の歌唱者の歌声になるよ
うに、また歌声を別人が歌っているように変換する音声
変換装置及び音声変換方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice conversion apparatus and a voice conversion method, and more particularly to a singer's singing voice in karaoke or the like, and a singing voice of a specific singer to be converted. The present invention relates to a voice conversion device and a voice conversion method for converting a voice as if another person were singing.

【０００２】[0002]

【従来の技術】入力された音声の周波数特性などを変え
て出力する音声変換装置は種々開発されており、例え
ば、カラオケ装置の中には、歌い手の歌った歌声のピッ
チを変換して、男性の声を女性の声に、あるいはその逆
に変換させるものもある（例えば、特表平８−５０８５
８１号公報参照）。2. Description of the Related Art There have been developed various voice converters for changing the frequency characteristics and the like of an input voice and outputting the converted voice. For example, some karaoke devices convert the pitch of a singer's singing voice into a male voice. Some voices are converted to female voices and vice versa (for example, Japanese Translation of International Patent Application No. Hei 8-5085).
No. 81).

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、従来の
音声変換装置においては、音声の変換（例えば、男声→
女声、女声→男声など）は行われるものの、単に声質を
変えるだけに止まっていたので、例えば、特定の歌唱者
（例えば、プロの歌手）の声に似せるように変換すると
いうことはできなかった。また、声質だけでなく、歌い
方までも特定の歌唱者に似させるという、ものまねのよ
うな機能があれば、カラオケ装置などにおいては大変に
面白いが、従来の音声変換装置ではこのような処理は不
可能であった。However, in the conventional voice converter, voice conversion (for example, male voice →
Female voices, female voices → male voices etc. were performed, but only the voice quality was changed, so it could not be converted, for example, to resemble the voice of a specific singer (for example, a professional singer) . Also, if there is a function like imitation that makes not only voice quality but also singing style similar to a specific singer, it is very interesting in karaoke equipment etc. It was impossible.

【０００４】これらを解決するための手法として、音声
信号を正弦波の合成で表す正弦波（ＳＩＮ）成分と、そ
れ以外の正弦波成分で表すことができない残差（ＲＥＳ
ＩＤＵＡＬ）成分とで表す信号処理により、歌唱者の音
声信号（正弦波成分、残差成分）を、音声変換の対象と
なる特定の歌唱者の音声信号（正弦波成分、残差成分）
に基づいて変形させ、ものまね対象となる声質や歌い方
が反映された音声信号を生成し、伴奏とともに出力する
音声変換装置が考えられる。As a technique for solving these problems, a sine wave (SIN) component representing an audio signal by synthesizing a sine wave and a residual (RES) that cannot be represented by other sine wave components.
By the signal processing represented by an IDUAL) component, a singer's voice signal (sine wave component, residual component) is converted into a specific singer's voice signal (sine wave component, residual component) to be subjected to voice conversion.
, A voice conversion device that generates a voice signal reflecting the voice quality and singing style to be imitated and outputs the voice signal along with the accompaniment.

【０００５】このような音声変換装置を構成した場合、
残差成分には、ピッチ成分が含まれるため、正弦波成分
と残差成分とをそれぞれ音声変換処理して合成すると、
聴取者は、正弦波成分及び残差成分の各々に含まれるピ
ッチ成分を聴取することとなる。従って、正弦波成分及
び残差成分の各々に含まれるピッチ成分が異なる周波数
の場合には、音声変換処理された音声の自然性が損なわ
れてしまうと可能性がある。そこで、本発明の目的は、
音声の自然性を損なうことなく、音声変換することがで
きる音声変換装置及び音声変換方法を提供することにあ
る。[0005] When such an audio converter is constructed,
Since the residual component includes a pitch component, the sine wave component and the residual component are subjected to voice conversion processing and synthesized, respectively.
The listener listens to the pitch components included in each of the sine wave component and the residual component. Therefore, when the pitch components included in each of the sine wave component and the residual component have different frequencies, there is a possibility that the naturalness of the voice that has been subjected to the voice conversion processing may be impaired. Therefore, an object of the present invention is to
An object of the present invention is to provide a voice conversion device and a voice conversion method that can perform voice conversion without impairing the naturalness of voice.

【０００６】[0006]

【課題を解決するための手段】上述した問題点を解決す
るために、請求項１記載の構成は、入力音声信号から正
弦波成分を抽出する正弦波成分抽出手段と、前記正弦波
成分抽出手段により抽出された正弦波成分以外の残差成
分を、前記入力音声信号から抽出する残差成分抽出手段
と、前記正弦波成分抽出手段により抽出された正弦波成
分を、ターゲット音声信号の正弦波成分に基づいて変形
する正弦波成分変形手段と、前記残差成分抽出手段によ
り抽出された残差成分を、前記ターゲット音声信号の残
差成分に基づいて変形する残差成分変形手段と、前記残
差成分変形手段により得られた残差成分のピッチ成分お
よびその倍音成分を除去する除去手段と、前記正弦波成
分変形手段により変形された正弦波成分と、前記除去手
段によりピッチ成分およびその倍音成分が除去された残
差成分とを合成する合成手段と、を具備することを特徴
としている。According to a first aspect of the present invention, there is provided a sine wave component extracting means for extracting a sine wave component from an input voice signal, and the sine wave component extracting means. A residual component extracting means for extracting a residual component other than the sine wave component extracted from the input audio signal, and a sine wave component extracted by the sine wave component extracting means as a sine wave component of the target audio signal. A sinusoidal wave component transforming means for transforming the residual component extracted by the residual component extracting means based on the residual component of the target audio signal; and a residual component transforming means for transforming the residual component extracted by the residual component extracting means. Removing means for removing the pitch component of the residual component obtained by the component deforming means and its overtone component; a sine wave component deformed by the sine wave component deforming means; And it is characterized by comprising synthesizing means for its harmonic component combines the removed residual component, a.

【０００７】請求項２記載の構成は、請求項１記載の構
成において、前記入力音声信号の正弦波成分のピッチ、
前記ターゲット音声信号の正弦波成分のピッチ、前記正
弦波成分変形手段により得られた正弦波成分のピッチの
いずれかを、前記除去手段における減衰ピークのピッチ
とするピッチ決定手段を具備することを特徴としてい
る。According to a second aspect of the present invention, in the configuration of the first aspect, a pitch of a sine wave component of the input audio signal is obtained.
A pitch determining unit is provided which sets any one of a pitch of a sine wave component of the target audio signal and a pitch of a sine wave component obtained by the sine wave component deforming unit as a pitch of an attenuation peak in the removing unit. And

【０００８】請求項３記載の構成は、請求項１記載の構
成において、前記除去手段は、前記残差成分を周波数軸
上で保持する場合には、前記ピッチ決定手段により決定
された減衰ピークのピッチを有するくし形フィルタであ
ることを特徴としている。According to a third aspect of the present invention, in the configuration of the first aspect, when the removing unit holds the residual component on a frequency axis, the removing unit removes the attenuation peak determined by the pitch determining unit. It is a comb filter having a pitch.

【０００９】請求項４記載の構成は、請求項１記載の構
成において、前記除去手段は、前記残差成分を時間軸上
で保持する場合には、前記ピッチ決定手段により決定さ
れた減衰ピークのピッチの逆数を遅延時間とする遅延フ
ィルタを有するくし形フィルタであることを特徴として
いる。According to a fourth aspect of the present invention, in the configuration of the first aspect, when the removing unit holds the residual component on a time axis, the removing unit removes the attenuation peak determined by the pitch determining unit. It is characterized in that it is a comb filter having a delay filter whose delay time is the reciprocal of the pitch.

【００１０】請求項５記載の構成は、入力音声から正弦
波成分及び前記正弦波成分以外の成分である残差成分を
抽出する成分抽出工程と、前記抽出された正弦波成分
を、ターゲット音声の正弦波成分に基づいて変形する正
弦波成分変形工程と、前記抽出された残差成分を、前記
ターゲット音声の残差成分に基づいて変形する残差成分
変形工程と、前記残差成分変形工程において得られた残
差成分のピッチ成分およびその倍音成分を除去する除去
工程と、前記正弦波成分変形工程において変形された正
弦波成分と、前記除去工程において得られたピッチ成分
およびその倍音成分が除去された残差成分とを合成する
合成工程とを具備することを特徴としている。According to a fifth aspect of the present invention, there is provided a component extracting step of extracting a sine wave component and a residual component other than the sine wave component from the input voice, and converting the extracted sine wave component into the target voice. A sine wave component deformation step of deforming based on a sine wave component, a residual component deformation step of deforming the extracted residual component based on a residual component of the target sound, and a residual component deformation step. A removing step of removing the pitch component and its overtone component of the obtained residual component; removing the sine wave component deformed in the sine wave component deforming step; and removing the pitch component and its harmonic component obtained in the removing step. And a combining step of combining the obtained residual component.

【００１１】請求項６記載の構成は、請求項５記載の構
成において、前記入力音声の正弦波成分のピッチ、前記
ターゲット音声の正弦波成分のピッチ、前記正弦波成分
変形手段により得られた正弦波成分のピッチのいずれか
を、前記除去手段における減衰ピークのピッチとするピ
ッチ決定工程を具備することを特徴としている。According to a sixth aspect of the present invention, in the configuration of the fifth aspect, a pitch of a sine wave component of the input voice, a pitch of a sine wave component of the target voice, and a sine wave obtained by the sine wave component deforming means. A pitch determining step is provided, in which any one of the pitches of the wave components is set as the pitch of the attenuation peak in the removing means.

【００１２】本発明によれば、入力された音声信号から
抽出した正弦波成分及び残差成分とを、ターゲット音声
信号のの正弦波成分または残差成分に基づいて各々変形
する。次いで、変形された正弦波成分と残差成分とを合
成する前に、残差成分のピッチ成分およびその倍音成分
を除去する。したがって、最終的には、正弦波成分のピ
ッチ成分のみが聴取されることになり、音声の自然性を
向上させることが可能となる。According to the present invention, the sine wave component and the residual component extracted from the input audio signal are respectively modified based on the sine wave component or the residual component of the target audio signal. Next, before synthesizing the transformed sine wave component and the residual component, the pitch component of the residual component and its harmonic component are removed. Therefore, finally, only the pitch component of the sine wave component is heard, and the naturalness of the sound can be improved.

【００１３】[0013]

【発明の実施の形態】［１］実施形態の概要処理始めに、実施形態の概要処理について説明する。［１．１］ステップＳ１まず、ものまねをしようとする歌唱者（me）の音声（入
力音声信号）をリアルタイムでＦＦＴ（Fast Fourie Tr
ansform）を含むＳＭＳ（Spectral Modeling Synthesi
s）分析を行い、フレーム単位で正弦波成分（Sine成
分）を抽出するとともに、入力音声信号及び正弦波成分
からフレーム単位で残差成分（Residual成分）Ｒmeを生
成する。これと並行して入力音声信号が無声音（含む無
音）か否かを判別し、無声音である場合には、以下のス
テップＳ２〜ステップＳ６の処理は行わず、入力音声信
号をそのまま出力することとなる。この場合において、
ＳＭＳ分析としては、前回のフレームにおけるピッチに
応じて分析窓幅を変更するピッチ同期分析を採用してい
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS [1] Outline Processing of Embodiment First, outline processing of an embodiment will be described. [1.1] Step S1 First, the voice (input voice signal) of the singer (me) who wants to imitate is FFT (Fast Fourie Tr) in real time.
SMS (Spectral Modeling Synthesi) including ansform
s) Perform analysis to extract a sine wave component (Sine component) for each frame, and generate a residual component (Residual component) Rme for each frame from the input audio signal and the sine wave component. In parallel with this, it is determined whether or not the input audio signal is unvoiced (including non-voiced sound). If the input audio signal is unvoiced, the following steps S2 to S6 are not performed and the input audio signal is output as it is. Become. In this case,
As the SMS analysis, pitch synchronous analysis that changes the analysis window width according to the pitch in the previous frame is employed.

【００１４】［１．２］ステップＳ２次に入力音声信号が有声音である場合には、抽出した正
弦波成分からさらに元属性（Attribute）データである
ピッチ（Pitch）、アンプ（Amplitude）及びスペクトラ
ル・シェイプ（Spectral Shape）を抽出する。さらに抽
出したピッチ及びアンプについては、ビブラート成分及
びビブラート成分以外の他の成分に分離する。[1.2] Step S2 Next, when the input audio signal is a voiced sound, pitch (Pitch), amplifier (Amplitude) and spectral which are original attribute data are further extracted from the extracted sine wave component. -Extract a shape (Spectral Shape). Further, the extracted pitch and amplifier are separated into a vibrato component and components other than the vibrato component.

【００１５】［１．３］ステップＳ３予め記憶（保存）してあるものまねの対象（Target）と
なる歌唱者の属性データ（ターゲット属性データ＝ピッ
チ、アンプ及びスペクトラル・シェイプ）から、ものま
ねをしようとする歌唱者（me）の入力音声信号のフレー
ムに対応するフレームのターゲット属性データ（＝ピッ
チ、アンプ及びスペクトラル・シェイプ）を取り出す。
この場合において、ものまねをしようとする歌唱者（m
e）の入力音声信号のフレームに対応するフレームのタ
ーゲット属性データが存在しない場合には、後に詳述す
るように、予め定めたイージーシンクロナイゼーション
規則（Easy Synchronization Rule）に従って、ターゲ
ット属性データを生成し、同様の処理を行う。[1.3] Step S3 Attempt to imitate from the attribute data (target attribute data = pitch, amplifier and spectral shape) of the singer to be imitated (Target) stored in advance (target attribute). The target attribute data (= pitch, amplifier, and spectral shape) of the frame corresponding to the frame of the input voice signal of the singer (me) to be extracted is extracted.
In this case, the singer trying to imitate (m
If target attribute data of a frame corresponding to the frame of the input audio signal of e) does not exist, target attribute data is generated in accordance with a predetermined Easy Synchronization Rule, as described in detail later. Then, the same processing is performed.

【００１６】［１．４］ステップＳ４次にものまねをしようとする歌唱者（me）に対応する元
属性データ及びものまねの対象となる歌唱者に対応する
ターゲット属性データを適宜選択して組み合わせること
により、新しい属性データ（新属性データ＝ピッチ、ア
ンプ及びスペクトラル・シェイプ）を得る。なお、もの
まねではなく、単なる音声変換として用いる場合には、
元属性データ及びターゲット属性データの加算平均とし
て新属性データを得るなどの元属性データ及びターゲッ
ト属性データの双方に基づいて計算により新属性データ
を得るようにすることも可能である。[1.4] Step S4 Next, the original attribute data corresponding to the singer (me) to be imitated and the target attribute data corresponding to the singer to be imitated are appropriately selected and combined. , New attribute data (new attribute data = pitch, amplifier, and spectral shape). In addition, when using as a simple voice conversion instead of imitation,
It is also possible to obtain new attribute data by calculation based on both the original attribute data and the target attribute data, such as obtaining the new attribute data as an average of the original attribute data and the target attribute data.

【００１７】［１．５］ステップＳ５つづいて得られた新属性データに基づいて、当該フレー
ムの正弦波成分ＳＩＮnewを求める。さらに、該正弦波
成分ＳＩＮnewのアンプ、スペクトラル・シェープ等を
変形し、正弦波成分ＳＩＮnew'を生成する。［１．６］ステップＳ６また、ステップＳ１で求めた入力音声信号の残差成分Ｒ
me(f)を、ターゲットの残差成分Ｒtar(f)に基づいて変
形し、新たな残差成分Ｒnew(f)を求める。[1.5] Step S5 The sine wave component SINnew of the frame is determined based on the new attribute data obtained in the subsequent step. Further, the sine wave component SINnew ′ is modified to generate a sine wave component SINnew ′ by modifying the amplifier, spectral shape and the like. [1.6] Step S6 Further, the residual component R of the input audio signal obtained in step S1
me (f) is deformed based on the residual component Rtar (f) of the target to obtain a new residual component Rnew (f).

【００１８】［１．７］ステップＳ７また、ステップＳ１で求めた入力音声信号の正弦波成分
のピッチＰme-str、ものまねの対象（Target）となる歌
唱者の正弦波成分のピッチＰtar-sta、ステップＳ５で
生成した正弦波成分ＳＩＮnewのピッチＰnew、さらに変
形した正弦波成分ＳＩＮnew'のピッチＰattのいずれか
を（基本的にはピッチＰatt）、くし形フィルタの最適
なピッチ（くし形フィルタのピッチ：Ｐcomb）とする。［１．８］ステップＳ８つづいて、得られたピッチＰcombに基づいて、くし形フ
ィルタを構成し、ステップＳ６で求めた残差成分Ｒnew
(f)をフィルタリングすることで、残差成分Ｒnew(f)か
らピッチ成分およびその倍音成分を取り除き、新たな残
差成分Ｒnew'(f)を取得する。[1.7] Step S7 Also, the pitch Pme-str of the sine wave component of the input voice signal obtained in step S1, the pitch Ptar-sta of the singer's sine wave component to be imitated (Target), Either the pitch Pnew of the sine wave component SINnew generated in step S5 or the pitch Patt of the deformed sine wave component SINnew '(basically the pitch Patt), the optimum pitch of the comb filter (the pitch of the comb filter) : Pcomb). [1.8] Step S8 Subsequently, a comb filter is formed based on the obtained pitch Pcomb, and the residual component Rnew obtained in step S6.
By filtering (f), the pitch component and its harmonic components are removed from the residual component Rnew (f), and a new residual component Rnew ′ (f) is obtained.

【００１９】［１．９］ステップＳ９そして、ステップＳ５で求めた正弦波成分ＳＩＮnew'
と、ステップＳ８で求めた新たな残差成分Ｒnew'(f)と
を合成した後、逆ＦＦＴを行い、変換音声信号を得る。［１．１０］まとめこれらの処理の結果得られる変換音声信号によれば、再
生される音声は、物まねをしようとする歌唱者の歌声
が、あたかも、別の歌唱者（ターゲットの歌唱者）が歌
った歌声のようになる。さらに、残差成分Ｒnew(f)から
ピッチ成分およびその倍音成分が取り除かれるので、最
終的には、正弦波成分のピッチ成分のみが聴取されるこ
とになり、音声の自然性を損なうことがない。[1.9] Step S9 Then, the sine wave component SINnew 'obtained in step S5
And the new residual component Rnew '(f) obtained in step S8, and then inverse FFT is performed to obtain a converted audio signal. [1.10] Summary According to the converted audio signal obtained as a result of these processes, the reproduced voice is the singing voice of the singer trying to imitate, as if another singer (the target singer). It becomes like a singing voice. Furthermore, since the pitch component and its overtone component are removed from the residual component Rnew (f), only the pitch component of the sine wave component is ultimately heard, and the naturalness of the sound is not impaired. .

【００２０】［２］実施形態の詳細構成次に図面を参照してこの発明の実施形態について説明す
る。図１及び図２に、実施形態の詳細構成図を示す。な
お、本実施形態は、本発明による音声変換装置（音声変
換方法）をカラオケ装置に適用し、ものまねを行うこと
ができるカラオケ装置として構成した場合の例である。
図１において、マイク１は、ものまねをしようとする歌
唱者（me）の声を収集し、入力音声信号Ｓｖとして入力
音声信号切出部３に出力する。[2] Detailed Configuration of Embodiment Next, an embodiment of the present invention will be described with reference to the drawings. 1 and 2 show detailed configuration diagrams of the embodiment. The present embodiment is an example in which the voice conversion device (voice conversion method) according to the present invention is applied to a karaoke device and configured as a karaoke device that can perform imitation.
In FIG. 1, the microphone 1 collects the voice of a singer (me) who wants to imitate and outputs the voice to the input audio signal cutout unit 3 as an input audio signal Sv.

【００２１】これと並行して、分析窓生成部２は、前回
のフレームで検出したピッチの周期の固定倍（例えば、
３．５倍など）の周期を有する分析窓（例えば、ハミン
グ窓）ＡＷを生成し、入力音声信号切出部３に出力す
る。なお、初期状態あるいは前回のフレームが無声音
（含む無音）の場合には、予め設定した固定周期の分析
窓を分析窓ＡＷとして入力音声信号切出部３に出力す
る。これらにより入力音声信号切出部３は、入力された
分析窓ＡＷと入力音声信号Ｓvとを掛け合わせ、入力音
声信号Ｓvをフレーム単位で切り出し、フレーム音声信
号ＦＳvとして高速フーリエ変換部４に出力する。より
具体的には、入力音声信号Ｓｖとフレームとの関係は、
図３に示すようになっており、各フレームＦＬは、前の
フレームＦＬと一部重なるように設定されている。In parallel with this, the analysis window generator 2 sets a fixed multiple of the pitch cycle detected in the previous frame (for example,
An analysis window (for example, a hamming window) AW having a period of 3.5 times or the like is generated and output to the input audio signal cutout unit 3. When the initial state or the previous frame is an unvoiced sound (including silence), an analysis window having a fixed period set in advance is output to the input audio signal cutout unit 3 as the analysis window AW. Thus, the input audio signal cutout unit 3 multiplies the input analysis window AW by the input audio signal Sv, cuts out the input audio signal Sv in frame units, and outputs it to the fast Fourier transform unit 4 as a frame audio signal FSv. . More specifically, the relationship between the input audio signal Sv and the frame is
As shown in FIG. 3, each frame FL is set so as to partially overlap the previous frame FL.

【００２２】そして、高速フーリエ変換部４においてフ
レーム音声信号ＦＳvは、解析処理されるとともに、図
４に示すように、高速フーリエ変換部４の出力である周
波数スペクトルからピーク検出部５によりローカルピー
クが検出される。より具体的には、図４に示すような周
波数スペクトルに対して、×印を付けたローカルピーク
を検出する。このローカルピークは、周波数値とアンプ
（振幅）値の組み合わせとして表される。すなわち、図
４に示すように、（Ｆ０、Ａ０）、（Ｆ１、A１）、
（Ｆ２、Ａ２）、……、（ＦＮ、ＡＮ）というように各
フレームについてローカルピークが検出され、表される
こととなる。Then, the frame sound signal FSv is analyzed and processed by the fast Fourier transform unit 4 and a local peak is detected by the peak detecting unit 5 from the frequency spectrum output from the fast Fourier transform unit 4 as shown in FIG. Is detected. More specifically, a local peak marked with “x” is detected in the frequency spectrum as shown in FIG. This local peak is represented as a combination of a frequency value and an amplifier (amplitude) value. That is, as shown in FIG. 4, (F0, A0), (F1, A1),
Local peaks are detected and represented for each frame as (F2, A2),..., (FN, AN).

【００２３】そして、図３に模式的に示すように、各フ
レーム毎に一組（以下、ローカルピーク組という。）と
して無声／有声検出部６及びピーク連携部８に出力され
る。無声／有声検出部６は、入力されたフレーム毎のロ
ーカルピークに基づいて、高周波成分の大きさに応じて
無声であることを検出（'ｔ'、'ｋ'等）し、無声／有声
検出信号Ｕ／Ｖmeをピッチ検出部７、イージーシンクロ
ナイゼーション処理部２２及びクロスフェーダ３０に出
力する。あるいは、時間軸上で単位時間あたりの零クロ
ス数に応じて無声であることを検出（'ｓ'等）し、元無
声／有声検出信号Ｕ／Ｖmeをピッチ検出部７、イージー
シンクロナイゼーション処理部２２及びクロスフェーダ
３０に出力する。Then, as schematically shown in FIG. 3, one set (hereinafter, referred to as a local peak set) is output to the unvoiced / voiced detecting unit 6 and the peak linking unit 8 for each frame. The unvoiced / voiced detection unit 6 detects that the voice is unvoiced ('t', 'k', etc.) according to the magnitude of the high-frequency component based on the input local peak for each frame, and performs unvoiced / voiced detection. The signal U / Vme is output to the pitch detection unit 7, the easy synchronization processing unit 22, and the crossfader 30. Alternatively, it is detected on the time axis that the voice is unvoiced according to the number of zero crossings per unit time ('s', etc.), and the original unvoiced / voiced detection signal U / Vme is detected by the pitch detection unit 7 by the easy synchronization processing. It outputs to the section 22 and the crossfader 30.

【００２４】さらに無声／有声検出部６は、入力された
フレームについて無声であると検出されなかった場合に
は、入力されたローカルピーク組をそのまま、ピッチ検
出部７に出力する。ピッチ検出部７は、入力されたロー
カルピーク組に基づいて、当該ローカルピーク組が対応
するフレームのピッチＰmeを検出する。より具体的なフ
レームのピッチＰmeの検出方法としては、例えば、Mahe
r,R.C.andJ.W.Beauchamp:"Fundamental Frequency Esti
mation of Musical Signal using a two-way Mismatch
Procedure"（Journal of Acounstical Society of Amer
ica95(4):2254-2263）に開示されているような方法で行
う。Further, if the unvoiced / voiced detection unit 6 does not detect that the input frame is unvoiced, it outputs the input local peak set to the pitch detection unit 7 as it is. The pitch detector 7 detects the pitch Pme of the frame corresponding to the local peak set based on the input local peak set. As a more specific method of detecting the pitch Pme of a frame, for example, Mahe
r, RCandJ.W.Beauchamp: "Fundamental Frequency Esti
mation of Musical Signal using a two-way Mismatch
Procedure "(Journal of Acounstical Society of Amer
ica95 (4): 2254-2263).

【００２５】次に、ピーク検出部５から出力されたロー
カルピーク組は、ピーク連携部８において、前後のフレ
ームについて連携が判断され、連携すると認められるロ
ーカルピークについては、一連のデータ列となるように
ローカルピークをつなげる連携処理がなされる。ここ
で、この連携処理について、図５を参照して説明する。
今、図５（Ａ）に示すようなローカルピークが前回のフ
レームにおいて検出され、図５（Ｂ）に示すようなロー
カルピークが今回のフレームにおいて検出されたとす
る。Next, the local peak set output from the peak detecting unit 5 is determined by the peak linking unit 8 to be linked with the preceding and succeeding frames, and the local peaks recognized as linked are formed into a series of data strings. A linking process for connecting a local peak to the data is performed. Here, this cooperation processing will be described with reference to FIG.
Now, assume that a local peak as shown in FIG. 5A is detected in the previous frame, and a local peak as shown in FIG. 5B is detected in the current frame.

【００２６】この場合、ピーク連携部８は、前回のフレ
ームで検出された各ローカルピーク（Ｆ０、Ａ０）、
（Ｆ１、A１）、（Ｆ２、Ａ２）、……、（ＦＮ、Ａ
Ｎ）に対応するローカルピークが今回のフレームでも検
出されたか否かを調べる。対応するローカルピークがあ
るか否かの判断は、前回のフレームで検出されたローカ
ルピークの周波数を中心にした所定範囲内に今回のフレ
ームのローカルピークが検出されるか否かによって行わ
れる。より具体的には、図５の例では、ローカルピーク
（Ｆ０、Ａ０）、（Ｆ１、A１）、（Ｆ２、Ａ２）……
については、対応するローカルピークが検出されている
が、ローカルピーク（ＦＫ、ＡＫ）については（図５
（Ａ）参照）、対応するローカルピーク（図５（Ｂ）参
照）は検出されていない。In this case, the peak coordinating unit 8 calculates each local peak (F0, A0) detected in the previous frame,
(F1, A1), (F2, A2), ..., (FN, A
It is checked whether the local peak corresponding to N) has been detected in the current frame. The determination as to whether or not there is a corresponding local peak is made based on whether or not the local peak of the current frame is detected within a predetermined range centered on the frequency of the local peak detected in the previous frame. More specifically, in the example of FIG. 5, the local peaks (F0, A0), (F1, A1), (F2, A2).
, The corresponding local peak is detected, but for the local peaks (FK, AK), (FIG. 5
(See (A)) and the corresponding local peak (see FIG. 5B) is not detected.

【００２７】ピーク連携部８は、対応するローカルピー
クを検出した場合は、それらを時系列順に繋げて一組の
データ列として出力する。なお、対応するローカルピー
クが検出されない場合は、当該フレームについての対応
ローカルピークは無しということを示すデータに置き換
える。ここで、図６は、複数のフレームにわたるローカ
ルピークの周波数Ｆ０及び周波数Ｆ１の変化の一例を示
している。このような変化は、アンプ（振幅）Ａ０、Ａ
１、Ａ２、……についても同様に認められる。この場
合、ピーク連携部８から出力されるデータ列は、フレー
ムの間隔おきに出力される離散的な値である。When the corresponding local peaks are detected, the peak linking unit 8 connects the local peaks in chronological order and outputs them as a set of data strings. If the corresponding local peak is not detected, the data is replaced with data indicating that there is no corresponding local peak for the frame. Here, FIG. 6 shows an example of changes in the frequency F0 and the frequency F1 of the local peak over a plurality of frames. Such changes are caused by the amplifiers (amplitude) A0, A
1, A2,... Are similarly recognized. In this case, the data string output from the peak linking unit 8 is a discrete value output at every frame interval.

【００２８】なお、ピーク連携部８から出力されるピー
ク値を、以後において、確定成分という。これは、元の
信号（すなわち、音声信号Ｓｖ）のうち正弦波の要素と
して確定的に置き換えられる成分という意味である。ま
た、置き換えられた各正弦波（厳密には、正弦波のパラ
メータである周波数及びアンプ（振幅））の各々につい
ては、正弦波成分と呼ぶことにする。次に、補間合成部
９は、ピーク連携部８から出力される確定成分について
補間処理を行い、補間後の確定成分に基づいていわゆる
オシレータ方式で波形合成を行う。この場合の補間の間
隔は、後述する出力部３４が出力する最終出力信号のサ
ンプリングレート（例えば、４４．１ＫＨｚ）に対応し
た間隔で行われる。前述した図６に示す実線は、正弦波
成分の周波数Ｆ０、Ｆ１について補間処理が行われた場
合のイメージを示している。The peak value output from the peak linking unit 8 is hereinafter referred to as a deterministic component. This means a component that is deterministically replaced as a sine wave element in the original signal (that is, the audio signal Sv). Further, each of the replaced sine waves (strictly speaking, frequency and amplifier (amplitude) which are parameters of the sine wave) will be referred to as sine wave components. Next, the interpolation synthesizing unit 9 performs an interpolation process on the deterministic component output from the peak linking unit 8, and performs a waveform synthesis based on the deterministic component after the interpolation using a so-called oscillator method. In this case, the interpolation is performed at intervals corresponding to the sampling rate (for example, 44.1 KHz) of the final output signal output from the output unit 34 described later. The solid line shown in FIG. 6 described above shows an image when the interpolation processing is performed on the frequencies F0 and F1 of the sine wave components.

【００２９】［２．１］補間合成部の構成ここで、補間合成部９の構成を図７に示す。補間合成部
９は、複数の部分波形発生部９ａを備えて構成されてお
り、各部分波形発生部９ａは、指定された正弦波成分の
周波数（Ｆ０、Ｆ１、…）およびアンプ（振幅）に応じ
た正弦波を発生する。ただし、本第１実施形態における
正弦波成分（Ｆ０、Ａ０）、（Ｆ１、Ａ１）、（Ｆ２、
Ａ２）、……は、各々補間の間隔に従って時事刻々変化
していくものであるから、各部分波形発生部９ａから出
力される波形は、その変化に従った波形になる。すなわ
ち、ピーク連携部８からは正弦波成分（Ｆ０、Ａ０）、
（Ｆ１、Ａ１）、（Ｆ２、Ａ２）、……が順次出力さ
れ、各正弦波成分の各々について補間処理が行われるか
ら、各部分波形発生部９ａは、所定の周波数領域内で周
波数と振幅が変動する波形を出力する。そして、各部分
波形発生部９ａから出力された波形は、加算部９ｂにお
いて加算合成される。したがって、補間合成部９の出力
信号は、入力音声信号Ｓｖから確定成分を抽出した正弦
波成分合成信号ＳSSになる。[2.1] Configuration of Interpolation Synthesis Unit The configuration of the interpolation synthesis unit 9 is shown in FIG. The interpolation / synthesis unit 9 includes a plurality of partial waveform generation units 9a, and each of the partial waveform generation units 9a adjusts a frequency (F0, F1,...) And an amplifier (amplitude) of a designated sine wave component. Generates a corresponding sine wave. However, the sine wave components (F0, A0), (F1, A1), (F2,
Since A2),... Change every moment according to the interpolation interval, the waveform output from each partial waveform generator 9a becomes a waveform according to the change. That is, the sine wave components (F0, A0) from the peak linking unit 8,
(F1, A1), (F2, A2),... Are sequentially output, and interpolation processing is performed for each of the sine wave components. Therefore, each partial waveform generation unit 9a determines the frequency and amplitude within a predetermined frequency domain. Output a waveform that fluctuates. Then, the waveforms output from the respective partial waveform generators 9a are added and synthesized in an adder 9b. Therefore, the output signal of the interpolation / synthesis unit 9 is a sine wave component synthesized signal SSS obtained by extracting a deterministic component from the input audio signal Sv.

【００３０】［２．２］残差成分検出部の動作次に、残差成分検出部１０は、補間合成部９から出力さ
れた正弦波成分合成信号ＳSSと入力音声信号Ｓｖとの偏
差である残差成分信号ＳRD（時間波形）を生成する。こ
の残差成分信号ＳRDは、音声に含まれる無声成分を多く
含む。一方、前述の正弦波成分合成信号ＳSSは有声成分
に対応するものである。ところで、目標（Target）とな
る歌唱者の声に似せるには、有声音についてだけ処理を
行えば、無声音については処理を施す必要はあまりな
い。そこで、本実施形態においては、有声母音成分に対
応する確定成分について音声変換処理を行うようにして
いる。より具体的には、残差成分信号ＳRDについては、
高速フーリエ変換部１１で、周波数波形に変換し、得ら
れた残差成分信号（周波数波形）をＲme(f)として残差
成分保持部１２に保持しておく。[2.2] Operation of Residual Component Detecting Unit Next, the residual component detecting unit 10 calculates a deviation between the sine wave component synthesized signal SSS output from the interpolation synthesizing unit 9 and the input audio signal Sv. A residual component signal SRD (time waveform) is generated. This residual component signal SRD contains a lot of unvoiced components included in the voice. On the other hand, the above-mentioned sine wave component composite signal SSS corresponds to a voiced component. By the way, in order to resemble the voice of the singer who becomes the target (Target), if only the voiced sound is processed, it is not necessary to process the unvoiced sound. Therefore, in the present embodiment, speech conversion processing is performed on a deterministic component corresponding to a voiced vowel component. More specifically, regarding the residual component signal SRD,
The fast Fourier transform unit 11 converts the signal into a frequency waveform, and the obtained residual component signal (frequency waveform) is stored in the residual component storage unit 12 as Rme (f).

【００３１】［２．３］平均アンプ演算部の動作一方、図８（Ａ）に示すように、ピーク検出部５からピ
ーク連携部８を介して出力された正弦波成分（Ｆ０、Ａ
０）、（Ｆ１、Ａ１）、（Ｆ２、Ａ２）、……、（Ｆ(N
-1)、Ａ(N-1)）のＮ個の正弦波成分（以下、これらをま
とめてＦｎ、Ａｎと表記する。ｎ＝０〜（Ｎ−１）。）
は、正弦波成分保持部１３に保持されるとともに、アン
プＡｎは平均アンプ演算部１４に入力され、各フレーム
毎に次式により平均アンプＡmeが算出される。Ａme＝Σ（Ａｎ）／Ｎ[2.3] Operation of Average Amplifier Operation Unit On the other hand, as shown in FIG. 8A, sine wave components (F0, A) output from the peak detection unit 5 via the peak linking unit 8
0), (F1, A1), (F2, A2),..., (F (N
-1), A (N-1)) N sine wave components (hereinafter, these are collectively referred to as Fn and An. N = 0 to (N-1).)
Is held in the sine wave component holding unit 13, and the amplifier An is input to the average amplifier operation unit 14, and the average amplifier Ame is calculated for each frame by the following equation. Ame = Σ (An) / N

【００３２】［２．４］アンプ正規化部の動作次にアンプ正規化部１５において、次式により各アンプ
Ａｎを平均アンプＡmeで正規化し、正規化アンプＡ'ｎ
を求める。Ａ'ｎ＝Ａｎ／Ａme ［２．５］スペクトラル・シェイプ演算部の動作そして、スペクトラル・シェイプ演算部１６において、
図８（Ｂ）に示すように、周波数Ｆｎ及び正規化アンプ
Ａ'ｎにより得られる正弦波成分（Ｆｎ、Ａ'ｎ）をブレ
ークポイントとするエンベロープ（包絡線）をスペクト
ラル・シェイプＳme(f)として生成する。この場合にお
いて、二つのブレークポイント間の周波数におけるアン
プの値は、当該二つのブレークポイントを、例えば、直
線補間することにより算出する。なお、補間の方法は直
線補間に限られるものではない。[2.4] Operation of Amplifier Normalization Unit Next, in the amplifier normalization unit 15, each amplifier An is normalized by the average amplifier Ame by the following equation, and the normalized amplifier A'n
Ask for. A'n = An / Ame [2.5] Operation of Spectral Shape Computing Unit Then, in the spectral shape computing unit 16,
As shown in FIG. 8B, an envelope (envelope) having a sine wave component (Fn, A'n) obtained by the frequency Fn and the normalizing amplifier A'n as a break point has a spectral shape Sme (f). Generate as In this case, the value of the amplifier at the frequency between the two break points is calculated by, for example, linearly interpolating the two break points. The method of interpolation is not limited to linear interpolation.

【００３３】［２．６］ピッチ正規化部の動作続いてピッチ正規化部１７においては、各周波数Ｆｎを
ピッチ検出部７において検出したピッチＰmeで正規化
し、正規化周波数Ｆ'ｎを求める。Ｆ'ｎ＝Ｆｎ／Ｐme これらの結果、元フレーム情報保持部１８は、入力音声
信号Ｓvに含まれる正弦波成分に対応する元属性データ
である平均アンプＡme、ピッチＰme、スペクトラル・シ
ェイプＳme(f)、正規化周波数Ｆ'ｎを保持することとな
る。なお、この場合において、正規化周波数Ｆ'ｎは、
倍音列の周波数の相対値を表しており、もし、フレーム
の倍音構造を完全倍音構造であるとして取り扱うなら
ば、保持する必要はない。この場合において、男声／女
声変換を行おうとしている場合には、この段階におい
て、男声→女声変換を行う場合には、ピッチをオクター
ブ上げ、女声→男声変換を行う場合にはピッチをオクタ
ーブ下げる男声／女声ピッチ制御処理を行うようにする
のが好ましい。[2.6] Operation of Pitch Normalization Unit Subsequently, the pitch normalization unit 17 normalizes each frequency Fn with the pitch Pme detected by the pitch detection unit 7 to obtain a normalized frequency F'n. F′n = Fn / Pme As a result, the original frame information holding unit 18 obtains the average amplifier Ame, the pitch Pme, and the spectral shape Sme (f) which are the original attribute data corresponding to the sine wave component included in the input audio signal Sv. ), And hold the normalized frequency F'n. In this case, the normalized frequency F'n is
It represents the relative value of the frequency of the harmonic train, and need not be retained if the harmonic structure of the frame is treated as a complete harmonic structure. In this case, if a male / female conversion is going to be performed, at this stage, the pitch is raised by an octave when the male to female conversion is performed, and the pitch is lowered by an octave when the female to male conversion is performed. It is preferable to perform a female voice pitch control process.

【００３４】つづいて、元フレーム情報保持部１８に保
持している元属性データのうち、平均アンプＡmeおよび
ピッチＰmeについては、さらに静的変化／ビブラート的
変化分離部１９により、フィルタリング処理などを行っ
て、静的変化成分とビブラート変化的成分とに分離して
保持する。なお、さらにビブラート変化的成分からより
高周波変化成分であるジッタ変化的成分を分離するよう
に構成することも可能である。より具体的には、平均ア
ンプＡmeを平均アンプ静的成分Ａme-sta及び平均アンプ
ビブラート的成分Ａme-vibとに分離して保持する。ま
た、ピッチＰmeをピッチ静的成分Ｐme-sta及びピッチビ
ブラート的成分Ｐme-vibとに分離して保持され、さら
に、ピッチ静的成分Ｐme-staは、ピッチ決定部４０へ供
給される。Subsequently, among the original attribute data held in the original frame information holding unit 18, the average amplifier Ame and the pitch Pme are further subjected to a filtering process and the like by the static change / vibrato change change separation unit 19. Thus, the static change component and the vibrato change component are separately held. In addition, it is also possible to configure so as to further separate a jitter variable component which is a higher frequency change component from a vibrato variable component. More specifically, the average amplifier Ame is separated and held as an average amplifier static component Ame-sta and an average amplifier vibrato component Ame-vib. Further, the pitch Pme is separated and held as a pitch static component Pme-sta and a pitch vibrato component Pme-vib, and the pitch static component Pme-sta is supplied to the pitch determination unit 40.

【００３５】これらの結果、対応するフレームの元フレ
ーム情報データＩＮＦmeは、図８（Ｃ）に示すように、
入力音声信号Ｓvの正弦波成分に対応する元属性データ
である平均アンプ静的成分Ａme-sta、平均アンプビブラ
ート的成分Ａme-vib、ピッチ静的成分Ｐme-sta、ピッチ
ビブラート的成分Ｐme-vib、スペクトラル・シェイプＳ
me(f)、正規化周波数Ｆ'ｎ及び残差成分Ｒme（ｆ）の形
で保持されることとなる。As a result, the original frame information data INFme of the corresponding frame is, as shown in FIG.
Average amplifier static component Ame-sta, average amplifier vibrato component Ame-vib, pitch static component Pme-sta, pitch vibrato component Pme-vib, which are original attribute data corresponding to the sine wave component of the input audio signal Sv, Spectral Shape S
me (f), the normalized frequency F'n, and the residual component Rme (f).

【００３６】一方、ものまねの対象（target）となる歌
唱者に対応するターゲット属性データから構成されるタ
ーゲットフレーム情報データＩＮＦtarは、予め分析さ
れてターゲットフレーム情報保持部２０を構成するハー
ドディスクなどに予め保持されている。この場合におい
て、ターゲットフレーム情報データＩＮＦtarのうち、
正弦波成分に対応するターゲット属性データとしては、
平均アンプ静的成分Ａtar-sta、平均アンプビブラート
的成分Ａtar-vib、ピッチ静的成分Ｐtar-sta、ピッチビ
ブラート的成分Ｐtar-vib、スペクトラル・シェイプＳt
ar(f)がある。また、ターゲットフレーム情報データＩ
ＮＦtarのうち、残差成分に対応するターゲット属性デ
ータとしては、残差成分Ｒtar(f)がある。これらのう
ち、ピッチ静的成分Ｐtar-staは、ピッチ決定部４０に
も供給される。On the other hand, target frame information data INFtar composed of target attribute data corresponding to the singer to be imitated (target) is preliminarily analyzed and stored in a hard disk or the like constituting the target frame information storage unit 20 in advance. Have been. In this case, of the target frame information data INFtar,
As target attribute data corresponding to the sine wave component,
Average amplifier static component Atar-sta, average amplifier vibrato component Atar-vib, pitch static component Ptar-sta, pitch vibrato component Ptar-vib, spectral shape St
There is ar (f). Further, the target frame information data I
Among the NFtars, target attribute data corresponding to the residual component includes a residual component Rtar (f). Among them, the pitch static component Ptar-sta is also supplied to the pitch determination unit 40.

【００３７】［２．７］キーコントロール／テンポチ
ェンジ部の動作次にキーコントロール／テンポチェンジ部２１は、シー
ケンサ３１からの同期信号ＳSYNCに基づいて、ターゲッ
トフレーム情報保持部２０から同期信号ＳSYNCに対応す
るフレームのターゲットフレーム情報ＩＮＦtarの読出
処理及び読み出したターゲットフレーム情報データＩＮ
Ｆtarを構成するターゲット属性データの補正処理を行
うとともに、読み出したターゲットフレーム情報ＩＮＦ
tarおよび当該フレームが無声であるか有声であるかを
表すターゲット無声／有声検出信号Ｕ／Ｖtarを出力す
る。[2.7] Operation of Key Control / Tempo Change Unit Next, the key control / tempo change unit 21 responds to the synchronization signal SSYNC from the target frame information holding unit 20 based on the synchronization signal SSYNC from the sequencer 31. Of target frame information INFtar of the frame to be read and read target frame information data IN
The target attribute data constituting the Ftar is corrected, and the read target frame information INF is read.
It outputs tar and a target unvoiced / voiced detection signal U / Vtar indicating whether the frame is unvoiced or voiced.

【００３８】より具体的には、キーコントロール／テン
ポチェンジ部２１の図示しないキーコントロールユニッ
トは、カラオケ装置のキーを基準より上げ下げした場
合、ターゲット属性データであるピッチ静的成分Ｐtar-
sta及びピッチビブラート的成分Ｐtar-vibについても、
同じだけ上げ下げする補正処理を行う。例えば、５０
［cent］だけキーを上げた場合には、ピッチ静的成分Ｐ
tar-sta及びピッチビブラート的成分Ｐtar-vibについて
も５０［cent］だけ上げなければならない。また、キー
コントロール／テンポチェンジ部２１の図示しないテン
ポチェンジユニットは、カラオケ装置のテンポを上げ下
げした場合には、変更後のテンポに相当するタイミング
で、ターゲットフレーム情報データＩＮＦtarの読み出
し処理を行う必要がある。More specifically, a key control unit (not shown) of the key control / tempo change unit 21 operates when a key of the karaoke apparatus is raised or lowered from a reference, and a pitch static component Ptar- which is target attribute data.
For the sta and pitch vibrato-like components Ptar-vib,
A correction process for raising and lowering by the same amount is performed. For example, 50
When the key is raised by [cent], the pitch static component P
The tar-sta and pitch vibrato-like component Ptar-vib must also be increased by 50 [cent]. When the tempo of the karaoke apparatus is raised or lowered, a tempo change unit (not shown) of the key control / tempo change unit 21 needs to read the target frame information data INFtar at a timing corresponding to the changed tempo. is there.

【００３９】この場合において、必要なフレームに対応
するタイミングに相当するターゲットフレーム情報デー
タＩＮＦtarが存在しない場合には、当該必要なフレー
ムのタイミングの前後のタイミングに存在する二つのフ
レームのターゲットフレーム情報データＩＮＦtarを読
み出し、これら二つのターゲットフレーム情報データＩ
ＮＦtarにより補間処理を行い、当該必要なタイミング
におけるフレームのターゲットフレーム情報データＩＮ
Ｆtar、ひいては、ターゲット属性データを生成する。
この場合において、ビブラート的成分（平均アンプビブ
ラート的成分Ａtar-vib及びピッチビブラート的成分Ｐt
ar-vib）に関しては、そのままでは、ビブラートの周期
自体が変化してしまい、不適当であるので、周期が変動
しないような補間処理を行う必要がある。又は、ターゲ
ット属性データとして、ビブラートの軌跡そのものを表
すデータではなく、ビブラート周期及びビブラート深さ
のパラメータを保持し、実際の軌跡を演算により求める
ようにすれば、この不具合を回避することができる。In this case, if the target frame information data INFtar corresponding to the timing corresponding to the required frame does not exist, the target frame information data of the two frames existing before and after the timing of the required frame is obtained. INFtar is read and these two target frame information data I
Interpolation processing is performed by NFtar, and target frame information data IN of the frame at the necessary timing is obtained.
Ftar, and eventually the target attribute data is generated.
In this case, the vibrato component (the average amp vibrato component Atar-vib and the pitch vibrato component Pt
Regarding ar-vib), if it is left untouched, the vibrato cycle itself changes and is inappropriate, so it is necessary to perform interpolation processing so that the cycle does not change. Alternatively, this problem can be avoided by holding the parameters of the vibrato cycle and the vibrato depth instead of the data representing the vibrato trajectory itself as the target attribute data and calculating the actual trajectory by calculation.

【００４０】［２．８］イージーシンクロナイゼーシ
ョン処理部の動作次にイージーシンクロナイゼーション処理部２２は、も
のまねをしようとする歌唱者のフレーム（以下、元フレ
ームという。）に元フレーム情報データＩＮＦmeが存在
するにもかかわらず、対応するものまねの対象となる歌
唱者のフレーム（以下、ターゲットフレームという。）
にターゲットフレーム情報データＩＮＦtarが存在しな
い場合には、当該ターゲットフレームの前後方向に存在
するフレームのターゲットフレーム情報データＩＮＦta
rを当該ターゲットフレームのターゲットフレーム情報
データＩＮＦtarとするイージーシンクロナイゼーショ
ン処理を行う。[2.8] Operation of Easy Synchronization Processing Unit Next, the easy synchronization processing unit 22 adds the original frame information data INFme to the frame of the singer who wants to imitate (hereinafter referred to as the original frame). Despite the presence of, the singer's frame that is the target of the corresponding singer (hereinafter referred to as the target frame)
If the target frame information data INFtar does not exist in the target frame, the target frame information data INFta
An easy synchronization process is performed using r as the target frame information data INFtar of the target frame.

【００４１】そして、イージーシンクロナイゼーション
処理部２２は、後述する置換済ターゲットフレーム情報
データＩＮＦtar-syncに含まれるターゲット属性データ
のうち正弦波成分に関するターゲット属性データ（平均
アンプ静的成分Ａtar-sync-sta、平均アンプビブラート
的成分Ａtar-sync-vib、ピッチ静的成分Ｐtar-sync-st
a、ピッチビブラート的成分Ｐtar-sync-vib及びスペク
トラル・シェイプＳtar-sync(f)）を正弦波成分属性デ
ータ選択部２３に出力する。また、イージーシンクロナ
イゼーション処理部２２は、後述する置換済ターゲット
フレーム情報データＩＮＦtar-syncに含まれるターゲッ
ト属性データのうち残差成分に関するターゲット属性デ
ータ（残差成分Ｒtar-sync(f)）を残差成分選択部２５
に出力する。Then, the easy synchronization processing section 22 generates target attribute data relating to the sine wave component (average amplifier static component Atar-sync-) out of target attribute data included in the replaced target frame information data INFtar-sync to be described later. sta, average amp vibrato component Atar-sync-vib, pitch static component Ptar-sync-st
a, a pitch vibrato-like component Ptar-sync-vib and a spectral shape Star-sync (f)) are output to the sine wave component attribute data selection unit 23. Further, the easy synchronization processing unit 22 stores target attribute data (residual component Rtar-sync (f)) relating to a residual component among target attribute data included in replaced target frame information data INFtar-sync described later. Difference component selection unit 25
Output to

【００４２】このイージーシンクロナイゼーション処理
部２２における処理においても、ビブラート的成分（平
均アンプビブラート的成分Ａtar-vib及びピッチビブラ
ート的成分Ｐtar-vib）に関しては、そのままでは、ビ
ブラートの周期自体が変化してしまい、不適当であるの
で、周期が変動しないような補間処理を行う必要があ
る。又は、ターゲット属性データとして、ビブラートの
軌跡そのものを表すデータではなく、ビブラート周期及
びビブラート深さのパラメータを保持し、実際の軌跡を
演算により求めるようにすれば、この不具合を回避する
ことができる。In the processing in the easy synchronization processing section 22, the vibrato cycle itself changes with respect to the vibrato-like components (average amp vibrato-like component Atar-vib and pitch vibrato-like component Ptar-vib). Therefore, it is necessary to perform interpolation processing so that the period does not change. Alternatively, this problem can be avoided by holding the parameters of the vibrato cycle and the vibrato depth instead of the data representing the vibrato trajectory itself as the target attribute data and calculating the actual trajectory by calculation.

【００４３】［２．８．１］イージーシンクロナイゼ
ーション処理の詳細ここで、図９及び図１０を参照してイージーシンクロナ
イゼーション処理について詳細に説明する。図９は、イ
ージーシンクロナイゼーション処理のタイミングチャー
トであり、図１０はイージーシンクロナイゼーション処
理フローチャートである。まず、イージーシンクロナイ
ゼーション処理部２２は、シンクロナイゼーション処理
の方法を表すシンクロナイゼーションモード＝"０"とす
る（ステップＳ１１）。このシンクロナイゼーションモ
ード＝"０"は、元フレームに対応するターゲットフレー
ムにターゲットフレーム情報データＩＮＦtarが存在す
る通常処理の場合に相当する。[2.8.1] Details of Easy Synchronization Process Here, the easy synchronization process will be described in detail with reference to FIGS. 9 and 10. FIG. 9 is a timing chart of the easy synchronization process, and FIG. 10 is a flowchart of the easy synchronization process. First, the easy synchronization processing unit 22 sets the synchronization mode = “0” indicating the method of the synchronization processing (step S11). This synchronization mode = "0" corresponds to the case of the normal processing in which the target frame information data INFtar exists in the target frame corresponding to the original frame.

【００４４】そしてあるタイミングｔにおける元無声／
有声検出信号Ｕ／Ｖme(t)が無声（Ｕ）から有声（Ｖ）
に変化したか否かを判別する（ステップＳ１２）。例え
ば、図９に示すように、タイミングｔ＝ｔ1において
は、元無声／有声検出信号Ｕ／Ｖme(t)が無声（Ｕ）か
ら有声（Ｖ）に変化している。ステップＳ１２の判別に
おいて、元無声／有声検出信号Ｕ／Ｖme(t)が無声
（Ｕ）から有声（Ｖ）に変化している場合には（ステッ
プＳ１２；Ｙｅｓ）、タイミングｔの前回のタイミング
ｔ-1における元無声／有声検出信号Ｕ／Ｖme(t-1)が無
声（Ｕ）かつターゲット無声／有声検出信号Ｕ／Ｖtar
(t-1)が無声（Ｕ）であるか否かを判別する（ステップ
Ｓ１８）。Then, the original silent at a certain timing t /
The voiced detection signal U / Vme (t) changes from unvoiced (U) to voiced (V)
Is determined (step S12). For example, as shown in FIG. 9, at timing t = t1, the original unvoiced / voiced detection signal U / Vme (t) changes from unvoiced (U) to voiced (V). If it is determined in step S12 that the original unvoiced / voiced detection signal U / Vme (t) has changed from unvoiced (U) to voiced (V) (step S12; Yes), the previous timing t of timing t The original unvoiced / voiced detection signal U / Vme (t-1) at -1 is unvoiced (U) and the target unvoiced / voiced detection signal U / Vtar
It is determined whether or not (t-1) is silent (U) (step S18).

【００４５】例えば、図９に示すように、タイミングｔ
＝ｔ0（＝ｔ1-1）においては、元無声／有声検出信号Ｕ
／Ｖme(t-1)が無声（Ｕ）かつターゲット無声／有声検
出信号Ｕ／Ｖtar(t-1)が無声（Ｕ）となっている。ステ
ップＳ１８の判別において、元無声／有声検出信号Ｕ／
Ｖme(t-1)が無声（Ｕ）かつターゲット無声／有声検出
信号Ｕ／Ｖtar(t-1)が無声（Ｕ）となっている場合には
（ステップＳ１８；Ｙｅｓ）、当該ターゲットフレーム
には、ターゲットフレーム情報データＩＮＦtarが存在
しないので、シンクロナイゼーションモード＝"１"と
し、置換用のターゲットフレーム情報データＩＮＦhold
を当該ターゲットフレームの後方向（Backward）に存在
するフレームのターゲットフレーム情報とする。For example, as shown in FIG.
= T0 (= t1-1), the original unvoiced / voiced detection signal U
/ Vme (t-1) is unvoiced (U) and the target unvoiced / voiced detection signal U / Vtar (t-1) is unvoiced (U). In the determination in step S18, the original unvoiced / voiced detection signal U /
If Vme (t-1) is unvoiced (U) and the target unvoiced / voiced detection signal U / Vtar (t-1) is unvoiced (U) (step S18; Yes), the target frame is Since the target frame information data INFtar does not exist, the synchronization mode is set to "1" and the replacement target frame information data INFhold is set.
Is the target frame information of the frame existing in the backward direction (Backward) of the target frame.

【００４６】例えば、図９に示すように、タイミングｔ
＝ｔ1〜ｔ2のターゲットフレームには、ターゲットフレ
ーム情報データＩＮＦtarが存在しないので、シンクロ
ナイゼーションモード＝"１"とし、置換用ターゲットフ
レーム情報データＩＮＦholdを当該ターゲットフレーム
の後方向に存在するフレーム（すなわち、タイミングｔ
＝ｔ2〜ｔ3に存在するフレーム）のターゲットフレーム
情報データbackwardとする。そして、処理をステップＳ
１５に移行し、シンクロナイゼーションモード＝"０"で
あるか否かを判別する（ステップＳ１５）。For example, as shown in FIG.
= T1 to t2, since the target frame information data INFtar does not exist, the synchronization mode is set to “1”, and the replacement target frame information data INFhold is set to a frame existing in the backward direction of the target frame (that is, , Timing t
= Frame existing in t2 to t3) (target frame information data backward). Then, the process proceeds to step S
The process proceeds to step S15, and it is determined whether or not the synchronization mode is "0" (step S15).

【００４７】ステップＳ１５の判別において、シンクロ
ナイゼーションモード＝"０"である場合には、タイミン
グｔにおける元フレームに対応するターゲットフレーム
にターゲットフレーム情報データＩＮＦtar(t)が存在す
る場合、すなわち、通常処理であるので、置換済ターゲ
ットフレーム情報データＩＮＦtar-syncをターゲットフ
レーム情報データＩＮＦtar(t)とする。ＩＮＦtar-sync＝ＩＮＦtar(t) 例えば、図９に示すようにタイミングｔ＝ｔ2〜ｔ3のタ
ーゲットフレームには、ターゲットフレーム情報データ
ＩＮＦtarが存在するので、ＩＮＦtar-sync＝ＩＮＦtar(t) とする。If it is determined in step S15 that the synchronization mode is "0", the target frame information data INFtar (t) is present in the target frame corresponding to the original frame at the timing t, that is, the normal mode is set. Since this is a process, the replaced target frame information data INFtar-sync is set as target frame information data INFtar (t). INFtar-sync = INFtar (t) For example, as shown in FIG. 9, since the target frame at the timing t = t2 to t3 has target frame information data INFtar, INFtar-sync = INFtar (t) is set.

【００４８】この場合において、以降の処理に用いられ
る置換済ターゲットフレーム情報データＩＮＦtar-sync
に含まれるターゲット属性データ（平均アンプ静的成分
Ａtar-sync-sta、平均アンプビブラート的成分Ａtar-sy
nc-vib、ピッチ静的成分Ｐtar-sync-sta、ピッチビブラ
ート的成分Ｐtar-sync-vib、スペクトラル・シェイプＳ
tar-sync(f)及び残差成分Ｒtar-sync(f)）は実質的に
は、以下の内容となる（ステップＳ１６）。Ａtar-sync-sta＝Ａtar-sta Ａtar-sync-vib＝Ａtar-vib Ｐtar-sync-sta＝Ｐtar-sta Ｐtar-sync-vib＝Ｐtar-vib Ｓtar-sync(f)＝Ｓtar(f) Rtar-sync(f)＝Ｒtar(f)In this case, the replaced target frame information data INFtar-sync used in the subsequent processing
Target attribute data (average amplifier static component Atar-sync-sta, average amplifier vibrato-like component Atar-sy)
nc-vib, pitch static component Ptar-sync-sta, pitch vibrato-like component Ptar-sync-vib, spectral shape S
The tar-sync (f) and the residual component Rtar-sync (f) have substantially the following contents (step S16). Atar-sync-sta = Atar-sta Atar-sync-vib = Atar-vib Ptar-sync-sta = Ptar-sta Ptar-sync-vib = Ptar-vib Star-sync (f) = Star (f) Rtar-sync (f) = Rtar (f)

【００４９】ステップＳ１５の判別において、シンクロ
ナイゼーションモード＝"１"またはシンクロナイゼーシ
ョンモード＝"２"である場合には、タイミングｔにおけ
る元フレームに対応するターゲットフレームにターゲッ
トフレーム情報データＩＮＦtar(t)が存在しない場合で
あるので、置換済ターゲットフレーム情報データＩＮＦ
tar-syncを置換用ターゲットフレーム情報データＩＮＦ
holdとする。ＩＮＦtar-sync＝ＩＮＦhold 例えば、図９に示すように、タイミングｔ＝ｔ1〜ｔ2の
ターゲットフレームには、ターゲットフレーム情報デー
タＩＮＦtarが存在せず、シンクロナイゼーションモー
ド＝"１"となるが、タイミングｔ＝ｔ2〜ｔ3のターゲッ
トフレームには、ターゲットフレーム情報データＩＮＦ
tarが存在するので、置換済ターゲットフレーム情報デ
ータＩＮＦtar-syncをタイミングｔ＝ｔ2〜ｔ3のターゲ
ットフレームのターゲットフレーム情報データである置
換用ターゲットフレーム情報データＩＮＦholdとする処
理Ｐ１を行い、以降の処理に用いられる置換済ターゲッ
トフレーム情報データＩＮＦtar-syncに含まれるターゲ
ット属性データは、平均アンプ静的成分Ａtar-sync-st
a、平均アンプビブラート的成分Ａtar-sync-vib、ピッ
チ静的成分Ｐtar-sync-sta、ピッチビブラート的成分Ｐ
tar-sync-vib、スペクトラル・シェイプＳtar-sync(f)
及び残差成分Ｒtar-sync(f)となる（ステップＳ１
６）。If it is determined in step S15 that the synchronization mode = "1" or the synchronization mode = "2", the target frame information data INFtar (t) is added to the target frame corresponding to the original frame at the timing t. ) Does not exist, the replaced target frame information data INF
Target frame information data INF for replacing tar-sync
Hold. INFtar-sync = INFhold For example, as shown in FIG. 9, the target frame at the timing t = t1 to t2 does not have the target frame information data INFtar, and the synchronization mode = “1”. = Target frame information data INF
Since tar exists, a process P1 is performed in which the replaced target frame information data INFtar-sync is set as the replacement target frame information data INFhold which is the target frame information data of the target frame at the timing t = t2 to t3. The target attribute data included in the replaced target frame information data INFtar-sync used is an average amplifier static component Atar-sync-st
a, average amp vibrato component Atar-sync-vib, pitch static component Ptar-sync-sta, pitch vibrato component P
tar-sync-vib, spectral shape Star-sync (f)
And the residual component Rtar-sync (f) (step S1).
6).

【００５０】また、図９に示すように、タイミングｔ＝
ｔ3〜ｔ4のターゲットフレームには、ターゲットフレー
ム情報データＩＮＦtarが存在せず、シンクロナイゼー
ションモード＝"２"となるが、タイミングｔ＝ｔ2〜ｔ3
のターゲットフレームには、ターゲットフレーム情報デ
ータＩＮＦtarが存在するので、置換済ターゲットフレ
ーム情報データＩＮＦtar-syncをタイミングｔ＝ｔ2〜
ｔ3のターゲットフレームのターゲットフレーム情報デ
ータである置換用ターゲットフレーム情報データＩＮＦ
holdとする処理Ｐ２を行い、以降の処理に用いられる置
換済ターゲットフレーム情報データＩＮＦtar-syncに含
まれるターゲット属性データは、平均アンプ静的成分Ａ
tar-sync-sta、平均アンプビブラート的成分Ａtar-sync
-vib、ピッチ静的成分Ｐtar-sync-sta、ピッチビブラー
ト的成分Ｐtar-sync-vib、スペクトラル・シェイプＳta
r-sync(f)及び残差成分Ｒtar-sync(f)となる（ステップ
Ｓ１６）。As shown in FIG. 9, the timing t =
In the target frame from t3 to t4, the target frame information data INFtar does not exist, and the synchronization mode becomes "2", but the timing t = t2 to t3.
Since the target frame has target frame information data INFtar, the replaced target frame information data INFtar-sync is set at timing t = t2 to
Replacement target frame information data INF, which is target frame information data of the target frame at t3
The processing P2 for setting the hold is performed, and the target attribute data included in the replaced target frame information data INFtar-sync used in the subsequent processing is the average amplifier static component A
tar-sync-sta, average amp vibrato component Atar-sync
-vib, pitch static component Ptar-sync-sta, pitch vibrato-like component Ptar-sync-vib, spectral shape Sta
It becomes r-sync (f) and the residual component Rtar-sync (f) (step S16).

【００５１】ステップＳ１２の判別において、元無声／
有声検出信号Ｕ／Ｖme(t)が無声（Ｕ）から有声（Ｖ）
に変化していない場合には（ステップＳ１２；Ｎｏ）、
ターゲット無声／有声検出信号Ｕ／Ｖtar(t)が有声
（Ｖ）から無声（Ｕ）に変化しているか否かを判別する
（ステップＳ１３）。ステップＳ１３の判別において、
ターゲット無声／有声検出信号Ｕ／Ｖtar(t)が有声
（Ｖ）から無声（Ｕ）に変化している場合には（ステッ
プＳ１３；Ｙｅｓ）、タイミングｔの前回のタイミング
ｔ-1における元無声／有声検出信号Ｕ／Ｖme(t-1)が有
声（Ｖ）かつターゲット無声／有声検出信号Ｕ／Ｖtar
(t-1)が有声（Ｖ）であるか否かを判別する（ステップ
Ｓ１９）。In the determination in step S12, the original silent /
The voiced detection signal U / Vme (t) changes from unvoiced (U) to voiced (V)
(Step S12; No),
It is determined whether or not the target unvoiced / voiced detection signal U / Vtar (t) has changed from voiced (V) to unvoiced (U) (step S13). In the determination in step S13,
If the target unvoiced / voiced detection signal U / Vtar (t) changes from voiced (V) to unvoiced (U) (step S13; Yes), the original unvoiced / voiced signal at the previous timing t-1 of the timing t is output. The voiced detection signal U / Vme (t-1) is voiced (V) and the target unvoiced / voiced detection signal U / Vtar
It is determined whether or not (t-1) is voiced (V) (step S19).

【００５２】例えば、図９に示すように、タイミングｔ
3においてターゲット無声／有声検出信号Ｕ／Ｖtar(t)
が有声（Ｖ）から無声（Ｕ）に変化し、タイミングｔ-1
＝ｔ2〜ｔ3においては、元無声／有声検出信号Ｕ／Ｖme
(t-1)が有声（Ｖ）かつターゲット無声／有声検出信号
Ｕ／Ｖtar(t-1)が有声（Ｕ）となっている。ステップＳ
１９の判別において、元無声／有声検出信号Ｕ／Ｖme(t
-1)が有声（Ｖ）かつターゲット無声／有声検出信号Ｕ
／Ｖtar(t-1)が有声（Ｖ）となっている場合には（ステ
ップＳ１９；Ｙｅｓ）、当該ターゲットフレームには、
ターゲットフレーム情報データＩＮＦtarが存在しない
ので、シンクロナイゼーションモード＝"２"とし、置換
用のターゲットフレーム情報データＩＮＦholdを当該タ
ーゲットフレームの前方向（forward）に存在するフレ
ームのターゲットフレーム情報とする。For example, as shown in FIG.
Target unvoiced / voiced detection signal U / Vtar (t) at 3
Changes from voiced (V) to unvoiced (U) at timing t-1
= T2 to t3, the original unvoiced / voiced detection signal U / Vme
(t-1) is voiced (V) and the target unvoiced / voiced detection signal U / Vtar (t-1) is voiced (U). Step S
In the determination of No. 19, the original unvoiced / voiced detection signal U / Vme (t
-1) is voiced (V) and the target unvoiced / voiced detection signal U
If / Vtar (t-1) is voiced (V) (step S19; Yes), the target frame includes
Since the target frame information data INFtar does not exist, the synchronization mode is set to "2", and the replacement target frame information data INFhold is set as the target frame information of the frame existing in the forward direction of the target frame.

【００５３】例えば、図９に示すように、タイミングｔ
＝ｔ3〜ｔ4のターゲットフレームには、ターゲットフレ
ーム情報データＩＮＦtarが存在しないので、シンクロ
ナイゼーションモード＝"２"とし、置換用ターゲットフ
レーム情報データＩＮＦholdを当該ターゲットフレーム
の前方向に存在するフレーム（すなわち、タイミングｔ
＝ｔ2〜ｔ3に存在するフレーム）のターゲットフレーム
情報データforwardとする。そして、処理をステップＳ
１５に移行し、シンクロナイゼーションモード＝"０"で
あるか否かを判別して（ステップＳ１５）、以下、同様
の処理を行う。ステップＳ１３の判別において、ターゲ
ット無声／有声検出信号Ｕ／Ｖtar(t)が有声（Ｖ）から
無声（Ｕ）に変化していない場合には（ステップＳ１
３；Ｎｏ）、タイミングｔにおける元無声／有声検出信
号Ｕ／Ｖme(t)が有声（Ｖ）から無声（Ｕ）に変化し、
あるいは、ターゲット無声／有声検出信号Ｕ／Ｖtar(t)
が無声（Ｕ）から有声（Ｖ）に変化しているか否かを判
別する（ステップＳ１４）。For example, as shown in FIG.
= T3 to t4, since the target frame information data INFtar does not exist, the synchronization mode is set to "2" and the replacement target frame information data INFhold is set to a frame existing in the forward direction of the target frame (i.e., , Timing t
= Frame existing at t2 to t3). Then, the process proceeds to step S
The process proceeds to step S15, and it is determined whether or not the synchronization mode is "0" (step S15), and thereafter, the same processing is performed. If it is determined in step S13 that the target unvoiced / voiced detection signal U / Vtar (t) has not changed from voiced (V) to unvoiced (U) (step S1).
3; No), the original unvoiced / voiced detection signal U / Vme (t) at timing t changes from voiced (V) to unvoiced (U),
Alternatively, the target unvoiced / voiced detection signal U / Vtar (t)
Is changed from unvoiced (U) to voiced (V) (step S14).

【００５４】ステップＳ１４の判別において、タイミン
グｔにおける元無声／有声検出信号Ｕ／Ｖme(t)が有声
（Ｖ）から無声（Ｕ）に変化し、かつ、ターゲット無声
／有声検出信号Ｕ／Ｖtar(t)が無声（Ｕ）から有声
（Ｖ）に変化している場合には（ステップＳ１４；Ｙｅ
ｓ）、シンクロナイゼーションモード＝"０"とし、置換
用ターゲットフレーム情報データＩＮＦholdを初期化
（clear）し、処理をステップＳ１５に移行して、以
下、同様の処理を行う。ステップＳ１４の判別におい
て、タイミングｔにおける元無声／有声検出信号Ｕ／Ｖ
me(t)が有声（Ｖ）から無声（Ｕ）に変化せず、かつ、
ターゲット無声／有声検出信号Ｕ／Ｖtar(t)が無声
（Ｕ）から有声（Ｖ）に変化していない場合には（ステ
ップＳ１４；Ｎｏ）、そのまま処理をステップＳ１５に
移行し、以下同様の処理を行う。In the determination in step S14, the original unvoiced / voiced detection signal U / Vme (t) at the timing t changes from voiced (V) to unvoiced (U), and the target unvoiced / voiced detection signal U / Vtar ( If t) changes from unvoiced (U) to voiced (V) (step S14; Ye)
s), the synchronization mode is set to "0", the replacement target frame information data INFhold is initialized (cleared), the process proceeds to step S15, and the same process is performed. In the determination of step S14, the original unvoiced / voiced detection signal U / V at timing t
me (t) does not change from voiced (V) to unvoiced (U), and
If the target unvoiced / voiced detection signal U / Vtar (t) has not changed from unvoiced (U) to voiced (V) (step S14; No), the process proceeds to step S15 as it is, and thereafter the same process is performed. I do.

【００５５】［２．９］正弦波成分属性データ選択部
の動作続いて、正弦波成分属性データ選択部２３は、イージー
シンクロナイゼーション処理部２２から入力された置換
済ターゲットフレーム情報データＩＮＦtar-syncに含ま
れるターゲット属性データのうち正弦波成分に関するタ
ーゲット属性データ（平均アンプ静的成分Ａtar-sync-s
ta、平均アンプビブラート的成分Ａtar-sync-vib、ピッ
チ静的成分Ｐtar-sync-sta、ピッチビブラート的成分Ｐ
tar-sync-vib及びスペクトラル・シェイプＳtar-sync
(f)）及びコントローラ２９から入力される正弦波成分
属性データ選択情報に基づいて、新しい正弦波成分属性
データである新規アンプ成分Ａnew、新規ピッチ成分Ｐn
ew及び新規スペクトラル・シェイプＳnew(f)を生成す
る。[2.9] Operation of Sine Wave Component Attribute Data Selector Subsequently, the sine wave component attribute data selector 23 replaces the target frame information data INFtar-sync which has been input from the easy synchronization processor 22. Attribute data on sine wave components (average amplifier static component Atar-sync-s)
ta, average amp vibrato component Atar-sync-vib, pitch static component Ptar-sync-sta, pitch vibrato component P
tar-sync-vib and spectral shape Star-sync
(f)) and the new amplifier component Anew and the new pitch component Pn, which are new sine wave component attribute data, based on the sine wave component attribute data selection information input from the controller 29.
Generate ew and a new spectral shape Snew (f).

【００５６】すなわち、新規アンプ成分Ａnewについて
は、次式により生成する。Ａnew＝Ａ*-sta＋Ａ*-vib（ただし、*は、me又はtar-sy
nc）より具体的には、図８（Ｄ）に示すように、新規アンプ
成分Ａnewを元属性データの平均アンプ静的成分Ａme-st
aあるいはターゲット属性データの平均アンプ静的成分
Ａtar-sync-staのいずれか一方及び元属性データの平均
アンプビブラート的成分Ａme-vibあるいはターゲット属
性データの平均アンプビブラート的成分Ａtar-sync-vib
のいずれか一方の組み合わせとして生成する。また、新
規ピッチ成分Ｐnewについては、次式により生成する。Ｐnew＝Ｐ*-sta＋Ｐ*-vib（ただし、*は、me又はtar-sy
nc）That is, the new amplifier component Anew is generated by the following equation. Anew = A * -sta + A * -vib (* is me or tar-sy
nc) More specifically, as shown in FIG. 8D, the new amplifier component Anew is replaced with the average amplifier static component Ame-st of the original attribute data.
a or the average amplifier static component Atar-sync-sta of the target attribute data and the average amplifier vibrato component Ame-vib of the original attribute data or the average amplifier vibrato component Atar-sync-vib of the target attribute data
Is generated as a combination of any one of the above. The new pitch component Pnew is generated by the following equation. Pnew = P * -sta + P * -vib (* is me or tar-sy
nc)

【００５７】より具体的には、図８（Ｄ）に示すよう
に、新規ピッチ成分Ｐnewを元属性データのピッチ静的
成分Ｐme-staあるいはターゲット属性データのピッチ静
的成分Ｐtar-sync-staのいずれか一方及び元属性データ
のピッチビブラート的成分Ｐme-vibあるいはターゲット
属性データのピッチビブラート的成分Ｐtar-sync-vibの
いずれか一方の組み合わせとして生成する。また、新規
スペクトラル・シェイプＳnew(f)については、次式によ
り生成する。Ｓnew(f)＝Ｓ*(f)（ただし、*は、me又はtar-sync）More specifically, as shown in FIG. 8D, the new pitch component Pnew is defined as the pitch static component Pme-sta of the original attribute data or the pitch static component Ptar-sync-sta of the target attribute data. It is generated as a combination of any one of the pitch vibrato component Pme-vib of the original attribute data and the pitch vibrato component Ptar-sync-vib of the target attribute data. The new spectral shape Snew (f) is generated by the following equation. Snew (f) = S * (f) (However, * is me or tar-sync)

【００５８】ところで、一般的にアンプ成分が大きい場
合には、高域まで伸びた抜けの明るい音となり、アンプ
成分が小さい場合には、逆にこもった音になる。そこ
で、新規スペクトラル・シェイプＳnew(f)に関しては、
このような状態をシミュレートすべく、図１１に示すよ
うに、スペクトラル・シェイプの高域成分、すなわち、
高域成分部分のスペクトラル・シェイプの傾きを新規ア
ンプ成分Ａnewの大きさに応じて補償するスペクトラル
チルト補償（spectral tilt correction）を行って、コ
ントロールすることにより、よりリアルな音声を再生す
ることができる。続いて、生成された新規アンプ成分Ａ
new、新規ピッチ成分Ｐnew及び新規スペクトラル・シェ
イプＳnew(f)について、必要に応じてコントローラ２９
から入力される正弦波成分属性データ変形情報に基づい
て、属性データ変形部２４によりさらなる変形を行う。
例えば、スペクトラル・シェイプを全体的に間延びさせ
る等の変形を行う。属性データ変形部２４は、変形後の
正弦波成分のピッチＰattをピッチ決定部４０へ供給す
る。By the way, generally, when the amplifier component is large, the sound becomes a bright sound which extends to a high frequency, and when the amplifier component is small, the sound becomes muffled. Therefore, regarding the new spectral shape Snew (f),
In order to simulate such a state, as shown in FIG. 11, the high frequency component of the spectral shape, that is,
A more realistic sound can be reproduced by performing and controlling spectral tilt correction for compensating for the inclination of the spectral shape of the high-frequency component according to the magnitude of the new amplifier component Anew. . Subsequently, the generated new amplifier component A
new, a new pitch component Pnew and a new spectral shape Snew (f)
Further modification is performed by the attribute data transformation unit 24 based on the sine wave component attribute data transformation information input from.
For example, a deformation such as extending the entire spectral shape is performed. The attribute data deforming unit 24 supplies the pitch Patt of the sinusoidal component after the deformation to the pitch determining unit 40.

【００５９】［２．１０］残差成分選択部の動作一方、残差成分選択部２５は、イージーシンクロナイゼ
ーション処理部２２から入力された置換済ターゲットフ
レーム情報データＩＮＦtar-syncに含まれるターゲット
属性データのうち残差成分に関するターゲット属性デー
タ（残差成分Ｒtar-sync(f)）、残差成分保持部１２に
保持されている残差成分信号（周波数波形）Ｒme(f)及
びコントローラ２９から入力される残差成分属性データ
選択情報に基づいて新しい残差成分属性データである新
規残差成分Ｒnew(f)を生成する。すなわち、新規残差成
分Ｒnew(f)については、次式により生成する。Ｒnew(f)＝Ｒ*(f)（ただし、*は、me又はtar-sync）[2.10] Operation of Residual Component Selection Unit On the other hand, the residual component selection unit 25 includes a target attribute included in the replaced target frame information data INFtar-sync input from the easy synchronization processing unit 22. Target attribute data (residual component Rtar-sync (f)) relating to the residual component of the data, the residual component signal (frequency waveform) Rme (f) held in the residual component holding unit 12 and input from the controller 29 Based on the residual component attribute data selection information to be generated, a new residual component Rnew (f), which is new residual component attribute data, is generated. That is, the new residual component Rnew (f) is generated by the following equation. Rnew (f) = R * (f) (* is me or tar-sync)

【００６０】この場合においては、me又はtar-syncのい
ずれを選択するかは、新規スペクトラル・シェイプＳne
w(f)と同一のものを選択するのがより好ましい。さら
に、新規残差成分Ｒnew(f)に関しても、新規スペクトラ
ル・シェイプと同様な状態をシミュレートすべく、図１
１に示したように、残差成分の高域成分、すなわち、高
域成分部分の残差成分の傾きを新規アンプ成分Ａnewの
大きさに応じて補償するスペクトラルチルト補償（spec
tral tilt correction）を行って、コントロールするこ
とにより、よりリアルな音声を再生することができる。In this case, whether to select me or tar-sync is determined by the new spectral shape Sne.
It is more preferable to select the same as w (f). Further, with respect to the new residual component Rnew (f), in order to simulate a state similar to the new spectral shape, FIG.
As shown in FIG. 1, spectral tilt compensation (spec) for compensating the high-frequency component of the residual component, that is, the slope of the residual component in the high-frequency component portion, according to the magnitude of the new amplifier component Anew
By performing tral tilt correction) and controlling, more realistic sound can be reproduced.

【００６１】［２．１１］正弦波成分生成部の動作続いて、正弦波成分生成部２６は、属性データ変形部２
４から出力された変形を伴わない、あるいは、変形を伴
う新規アンプ成分Ａnew、新規ピッチ成分Ｐnew及び新規
スペクトラル・シェイプＳnew(f)に基づいて、当該フレ
ームにおける新たな正弦波成分（Ｆ"０、Ａ"０）、
（Ｆ"１、Ａ"１）、（Ｆ"２、Ａ"２）、……、（Ｆ"(N-
1)、Ａ"(N-1)）のＮ個の正弦波成分（以下、これらをま
とめてＦ"ｎ、Ａ"ｎと表記する。ｎ＝０〜（Ｎ−
１）。）を求める。より具体的には、次式により新規周
波数Ｆ"ｎおよび新規アンプＡ"ｎを求める。Ｆ"ｎ＝Ｆ'ｎ×Ｐnew Ａ"ｎ＝Ｓnew(Ｆ"ｎ）×Ａnew なお、完全倍音構造のモデルとして捉えるのであれば、Ｆ"ｎ＝（ｎ＋１）×Ｐnew となる。[2.11] Operation of Sine Wave Component Generation Unit Subsequently, the sine wave component generation unit 26
4, a new sine wave component (F "0, F" 0, Fnew) in the frame based on the new amplifier component Anew, the new pitch component Pnew and the new spectral shape Snew (f) without or with the deformation. A "0),
(F "1, A" 1), (F "2, A" 2), ..., (F "(N-
1), N sine wave components of A "(N-1)) (hereinafter collectively referred to as F" n, A "n. N = 0 to (N-
1). ). More specifically, a new frequency F "n and a new amplifier A" n are obtained by the following equations. F "n = F'n.times.Pnew A" n = Snew (F "n) .times.Anew If it is considered as a model of a perfect harmonic structure, F" n = (n + 1) .times.Pnew.

【００６２】［２．１２］正弦波成分変形部の動作さらに、求めた新規周波数Ｆ"ｎおよび新規アンプＡ"ｎ
について、必要に応じてコントローラ２９から入力され
る正弦波成分変形情報に基づいて、正弦波成分変形部２
７によりさらなる変形を行う。例えば、偶数次成分の新
規アンプＡ"ｎ（＝Ａ"０、Ａ"２、Ａ"４、……）だけを
大きく（例えば、２倍する）等の変形を行う。これによ
って得られる変換音声にさらにバラエティーを持たせる
ことが可能となる。[2.12] Operation of Sine Wave Component Deformer Further, the new frequency F "n and new amplifier A" n obtained
, Based on the sine wave component deformation information input from the controller 29 as necessary.
7 make a further deformation. For example, a modification is performed such that only the new-order amplifier A "n (= A" 0, A "2, A" 4,...) Of even-order components is increased (for example, doubled). As a result, it is possible to give the converted speech further variety.

【００６３】［２．１３］ピッチ決定部の動作くし形フィルタのピッチ決定部４０は、ピッチ検出部７
からのピッチＰme-str、ターゲットフレーム情報保持部
２０からのピッチＰtar-sta、正弦波成分属性データ選
択部２３からのピッチＰnew、属性データ変形部２４か
らのピッチＰattのいずれかを（基本的にはピッチＰat
t）、くし形フィルタの最適なピッチ（くし形フィルタ
のピッチ：Ｐcomb）とし、くし形フィルタ処理部４１へ
供給する。ここで、くし形フィルタのピッチ（Ｐcomb）
の決定方法について説明する。上述した説明では、ピッ
チＰcombを属性データ変形部２４による属性変換後のピ
ッチＰattから生成するとしたが、これに限るものでは
ない。例えば、音声変換処理において、正弦波成分のピ
ッチにターゲットのピッチＰtar-staを用い、新規残差
成分Ｒnew（ｆ）にＲme（ｆ）を用いた場合、残差成分
で不要となるのは、ピッチＰme-staであり、ピッチＰco
mbとしてはピッチＰme-staを用いる。逆に、音声変換処
理において、正弦波成分のピッチにピッチＰme-staを用
い、新規残差成分Ｐnew（ｆ）にターゲットの残差成分
Ｒtar-sync（ｆ）を用いた場合、ピッチＰcombとしては
ピッチＰtar-staを用いる。[2.13] Operation of Pitch Determining Unit The pitch determining unit 40 of the comb filter includes the pitch detecting unit 7
Pme-str from the target frame information holding unit 20, pitch Pnew from the sine wave component attribute data selecting unit 23, and pitch Patt from the attribute data deforming unit 24 (basically Is the pitch Pat
t), the optimum pitch of the comb filter (pitch of the comb filter: Pcomb) is supplied to the comb filter processing unit 41. Here, the pitch of the comb filter (Pcomb)
Will be described. In the above description, the pitch Pcomb is generated from the pitch Patt after the attribute conversion by the attribute data deforming unit 24, but the present invention is not limited to this. For example, in the voice conversion processing, when the target pitch Ptar-sta is used for the pitch of the sine wave component and Rme (f) is used for the new residual component Rnew (f), the residual component is unnecessary. Pitch Pme-sta and pitch Pco
The pitch Pme-sta is used as mb. Conversely, in the voice conversion process, when the pitch Pme-sta is used for the pitch of the sine wave component and the target residual component Rtar-sync (f) is used for the new residual component Pnew (f), the pitch Pcomb is The pitch Ptar-sta is used.

【００６４】また、最終的な音声変換処理となる属性変
換において、オクターブ等のピッチシフトを行う場合、
ピッチＰcombとしては、該ピッチシフトに入力音声の残
差成分を用いたときには、ピッチＰme-staを用い、ター
ゲットの残差成分を用いたときには、ピッチＰtar-sta
を用いればよい。さらに、入力音声とターゲット音声の
各々の残差成分を任意の比率で補間している用いる場
合、ピッチＰme-staとピッチＰtar-staとをこれと同じ
比率で補間して生成されるピッチを、くし形フィルタの
ピッチＰcombとする。このように、音声変換処理を施し
た残差成分をくし形フィルタでフィルタリングし、該残
差成分からピッチ成分およびその倍音成分を取り除くに
は、用いるくし形フィルタへ最適なピッチＰcombを決定
する必要がある。Further, in the case of performing a pitch shift such as an octave in the attribute conversion as a final voice conversion process,
As the pitch Pcomb, when the residual component of the input voice is used for the pitch shift, the pitch Pme-sta is used. When the residual component of the target is used, the pitch Ptar-sta is used.
May be used. Further, when the residual components of the input voice and the target voice are interpolated at an arbitrary ratio, the pitch generated by interpolating the pitch Pme-sta and the pitch Ptar-sta at the same ratio is defined as: The pitch of the comb filter is Pcomb. As described above, in order to filter the residual component subjected to the voice conversion processing by the comb filter and remove the pitch component and its overtone component from the residual component, it is necessary to determine the optimum pitch Pcomb for the comb filter to be used. There is.

【００６５】［２．１４］くし形フィルタ処理部の動
作くし形フィルタ処理部４１は、ピッチＰcombを用いて、
くし形フィルタを構成し、該くし形フィルタで残差成分
Ｒnew(f)をフィルタリングすることで、残差成分Ｒnew
(f)からピッチ成分およびその倍音成分を取り除き、新
たな残差成分Ｒnew'(f)として、逆高速フーリエ変換部
２８へ供給する。ここで、図１２は、ピッチＰcombを２
００Ｈｚとした場合のくし形フィルタの特性例を示す概
念図である。このように、残差成分を周波数軸上で保持
している場合には、ピッチＰcombに基づいて周波数軸上
でくし形フィルタを構成する。[2.14] Operation of Comb Filter Processing Unit The comb filter processing unit 41 uses the pitch Pcomb to
By forming a comb filter and filtering the residual component Rnew (f) with the comb filter, the residual component Rnew (f) is obtained.
The pitch component and its harmonic components are removed from (f), and the resulting component is supplied to the inverse fast Fourier transform unit 28 as a new residual component Rnew '(f). Here, FIG. 12 shows that the pitch Pcomb is 2
It is a conceptual diagram which shows the example of a characteristic of the comb filter in the case of 00 Hz. As described above, when the residual component is held on the frequency axis, a comb filter is formed on the frequency axis based on the pitch Pcomb.

【００６６】［２．１５］逆高速フーリエ変換部の動
作次に逆高速フーリエ変換部２８は、求めた新規周波数
Ｆ"ｎおよび新規アンプＡ"ｎ（＝新規正弦波成分）並び
に新規残差成分Ｒnew'(f)をＦＦＴバッファに格納し、
順次逆ＦＦＴを行い、さらに得られた時間軸信号を一部
重複するようにオーバーラップ処理し、それらを加算す
る加算処理を行うことにより新しい有声音の時間軸信号
である変換音声信号を生成する。[2.15] Operation of Inverse Fast Fourier Transform Unit Next, the inverse fast Fourier transform unit 28 calculates the new frequency F "n and new amplifier A" n (= new sine wave component) and new residual component. Rnew '(f) is stored in the FFT buffer,
Inverse FFT is sequentially performed, and the obtained time axis signals are overlapped so as to partially overlap, and an addition processing of adding them is performed to generate a converted voice signal which is a new voiced sound time axis signal. .

【００６７】このとき、コントローラ２９から入力され
る正弦波成分／残差成分バランス制御信号に基づいて、
正弦波成分及び残差成分の混合比率を制御し、よりリア
ルな有声信号を得る。この場合において、一般的には、
残差成分の混合比率を大きくするとざらついた声が得ら
れる。この場合において、ＦＦＴバッファに新規周波数
Ｆ"ｎおよび新規アンプＡ"ｎ（＝新規正弦波成分）並び
に新規残差成分Ｒnew(f)を格納するに際し、異なるピッ
チ、かつ、適当なピッチで変換された正弦波成分をさら
に加えることにより変換音声信号としてハーモニーを得
ることができる。さらにシーケンサ３１により伴奏音に
適合したハーモニーピッチを与えることにより、伴奏に
適合した音楽的ハーモニーを得ることができる。At this time, based on the sine wave component / residual component balance control signal input from the controller 29,
A more realistic voiced signal is obtained by controlling the mixing ratio of the sine wave component and the residual component. In this case, generally,
When the mixing ratio of the residual components is increased, a rough voice is obtained. In this case, when storing the new frequency F "n, the new amplifier A" n (= new sine wave component) and the new residual component Rnew (f) in the FFT buffer, they are converted at different pitches and at an appropriate pitch. Harmony can be obtained as a converted audio signal by further adding the sine wave component. Further, by giving a harmony pitch adapted to the accompaniment sound by the sequencer 31, musical harmony adapted to the accompaniment can be obtained.

【００６８】［２．１６］クロスフェーダの動作次にクロスフェーダ３０は、元無声／有声検出信号Ｕ／
Ｖme(t)に基づいて、入力音声信号Ｓvが無声（Ｕ）であ
る場合には、入力音声信号Ｓvをそのままミキサ３０に
出力する。また、入力音声信号Ｓvが有声（Ｖ）である
場合には、逆高速フーリエ変換変換部２８が出力した変
換音声信号をミキサ３３に出力する。この場合におい
て、切替スイッチとしてクロスフェーダ３０を用いてい
るのは、クロスフェード動作を行わせることによりスイ
ッチ切替時のクリック音の発生を防止するためである。[2.16] Operation of Crossfader Next, the crossfader 30 transmits the original unvoiced / voiced detection signal U /
If the input audio signal Sv is unvoiced (U) based on Vme (t), the input audio signal Sv is output to the mixer 30 as it is. When the input audio signal Sv is voiced (V), the converted audio signal output from the inverse fast Fourier transform converter 28 is output to the mixer 33. In this case, the reason why the cross fader 30 is used as the changeover switch is to prevent a click sound from occurring at the time of switch changeover by performing a crossfade operation.

【００６９】［２．１７］シーケンサ、音源部、ミキ
サ及び出力部の動作一方、シーケンサ３１は、カラオケの伴奏音を発生する
ための音源制御情報を例えば、ＭＩＤＩ（Musical Inst
rument Digital Interface）データなどとして音源部３
２に出力する。これにより音源部３２は、音源制御情報
に基づいて伴奏信号を生成し、ミキサ３３に出力する。
ミキサ３３は、入力音声信号Ｓvあるいは変換音声信号
のいずれか一方及び伴奏信号を混合し、混合信号を出力
部３４に出力する。出力部３４は、図示しない増幅器を
有し混合信号を増幅して音響信号として出力することと
なる。[2.17] Operation of Sequencer, Sound Source Unit, Mixer, and Output Unit On the other hand, the sequencer 31 transmits sound source control information for generating a karaoke accompaniment sound to, for example, a MIDI (Musical Instrument).
rument Digital Interface) sound source section 3 as data etc.
Output to 2. Thereby, the sound source section 32 generates an accompaniment signal based on the sound source control information, and outputs the accompaniment signal to the mixer 33.
The mixer 33 mixes either the input audio signal Sv or the converted audio signal and the accompaniment signal, and outputs the mixed signal to the output unit 34. The output unit 34 has an amplifier (not shown), amplifies the mixed signal, and outputs it as an acoustic signal.

【００７０】［３］実施形態の変形例［３．１］第１変形例以上の説明においては、属性データとしては、元属性デ
ータあるいはターゲット属性データのいずれかを選択的
に用いる構成としていたが、元属性データ及びターゲッ
ト属性データの双方を用い、補間処理を行うことにより
中間的な属性を有する変換音声信号を得るように構成す
ることも可能である。しかしながら、このような構成に
よれば、ものまねをしようとする歌唱者及びものまねの
対象（target）となる歌唱者のいずれにも似ていない変
換音声が得られる場合もある。また、特にスペクトラル
・シェイプを補間処理によって求めた場合には、ものま
ねをしようとする歌唱者が「あ」を発音し、ものまねの
対象となる歌唱者が「い」を発音している場合などに
は、「あ」でも「い」でもない音が変換音声として出力
される可能性があり、その取扱には注意が必要である。[3] Modifications of Embodiment [3.1] First Modification In the above description, either the original attribute data or the target attribute data is selectively used as the attribute data. It is also possible to obtain a converted audio signal having an intermediate attribute by performing an interpolation process using both the original attribute data and the target attribute data. However, according to such a configuration, a converted voice that is not similar to any of the singer trying to imitate and the singer to be imitated may be obtained. Also, especially when the spectral shape is obtained by interpolation processing, the singer trying to imitate pronounces "a", and the singer to be imitated pronounces "i". There is a possibility that sounds other than "A" or "I" may be output as converted voices, and care must be taken when handling them.

【００７１】［３．２］第２変形例正弦波成分の抽出は、この実施形態で用いた方法に限ら
ない。要は、音声信号に含まれる正弦波を抽出できれば
よい。［３．３］第３変形例本実施形態においては、ターゲットの正弦波成分及び残
差成分を記憶したが、これに換えて、ターゲットの音声
そのものを記憶し、それを読み出してリアルタイム処理
によって正弦波成分と残差成分とを抽出してもよい。す
なわち、本実施形態でものまねをしようとする歌唱者の
音声に対して行った処理と同様の処理をターゲットの歌
唱者の音声に対して行ってもよい。[3.2] Second Modification The extraction of the sine wave component is not limited to the method used in this embodiment. In short, it is only necessary to extract a sine wave included in the audio signal. [3.3] Third Modification In the present embodiment, the sine wave component and the residual component of the target are stored. Instead, the target speech itself is stored, read out, and subjected to real-time processing. The wave component and the residual component may be extracted. That is, processing similar to the processing performed on the voice of the singer trying to imitate in the present embodiment may be performed on the voice of the target singer.

【００７２】［３．４］第４変形例本実施形態においては、属性データとして、ピッチ、ア
ンプ、スペクトラル・シェイプの全てを取り扱ったが、
少なくともいずれか一つを扱うようにすることも可能で
ある。［３．５］第５変形例本実施形態では、残差成分を周波数軸上で保持していた
が、これに限らず、残差成分を時間軸上で保持するよう
にしてもよい。図１３は、上述した実施形態の変形例の
構成（一部）を示すブロック図である。また、図１４
は、くし形フィルタ（遅延フィルタ）の構成の一例を示
すブロック図である。なお、図１に対応する部分には同
一の符号を付けて説明を省略する。図において、くし形
フィルタ処理部４２は、ピッチ決定部４０で決定された
ピッチＰcombの逆数をディレイタイムとする、くし形フ
ィルタ（遅延フィルタ）を構成し、該くし形フィルタで
残差成分Ｒnew(t)をフィルタリングし、残差成分Ｒne
w''(t)として減算器４３に供給する。減算器４３は、残
差成分Ｒnew(t)から上記フィルタリングされた残差成分
Ｒnew''(t)を減算することで、残差成分Ｒnew(t)からピ
ッチ成分およびその倍音成分を取り除き、新たな残差成
分Ｒnew'(t)として、ＩＦＦＴ処理部８へ供給する。こ
のように、残差成分を時間軸上で処理する場合であって
も、上述した実施形態と同様に、残差成分Ｒnew(t)から
ピッチ成分およびその倍音成分を取り除くことが可能と
なる。したがって、最終的に出力される音声には、正弦
波成分のピッチ成分のみが聴取されることになり、音声
の自然性を向上させることができる。[3.4] Fourth Modification In this embodiment, all of the pitch, amplifier, and spectral shape are handled as attribute data.
It is also possible to handle at least one of them. [3.5] Fifth Modification In the present embodiment, the residual component is held on the frequency axis. However, the present invention is not limited to this, and the residual component may be held on the time axis. FIG. 13 is a block diagram illustrating a configuration (part) of a modification of the above-described embodiment. FIG.
FIG. 3 is a block diagram illustrating an example of a configuration of a comb filter (delay filter). Note that the same reference numerals are given to portions corresponding to FIG. In the figure, a comb filter processing unit 42 constitutes a comb filter (delay filter) that uses a reciprocal of the pitch Pcomb determined by the pitch determination unit 40 as a delay time, and the comb filter has a residual component Rnew ( t) and filter the residual component Rne
It is supplied to the subtractor 43 as w '' (t). The subtracter 43 subtracts the filtered residual component Rnew '' (t) from the residual component Rnew (t) to remove the pitch component and its harmonic component from the residual component Rnew (t), It is supplied to the IFFT processing unit 8 as a residual component Rnew ′ (t). As described above, even when the residual component is processed on the time axis, it is possible to remove the pitch component and its harmonic component from the residual component Rnew (t), as in the above-described embodiment. Therefore, only the pitch component of the sine wave component is heard in the finally output sound, and the naturalness of the sound can be improved.

【００７３】［４］実施形態の効果以上の結果、カラオケの伴奏とともに、歌唱者の歌が出
力されるが、その声質および歌い方などは、ターゲット
の影響を大きく受け、ターゲットそのものの声質および
歌い方となる。このようにして、あたかもターゲットの
物まねをしているような歌が出力される。また、残差成
分Ｒnew(f)からピッチ成分およびその倍音成分が取り除
かれるので、最終的には、正弦波成分のピッチ成分のみ
が聴取されることになり、音声の自然性を損なうことが
ない。[4] Effects of Embodiment As a result, the singer's song is output together with the karaoke accompaniment. The voice quality and singing style are greatly affected by the target, and the voice quality and singing of the target itself are obtained. One. In this way, a song as if imitating the target is output. Further, since the pitch component and its harmonic components are removed from the residual component Rnew (f), only the pitch component of the sine wave component is finally heard, and the naturalness of the sound is not impaired. .

【００７４】[0074]

【発明の効果】以上、説明したように、本発明によれ
ば、入力音声信号から抽出した正弦波成分と、残差成分
とを、ターゲット音声の正弦波成分または残差成分に基
づいて各々変形し、次いで、正弦波成分と残差成分とを
合成する前に、変形した残差成分のピッチ成分およびそ
の倍音成分を除去するようにしたので、合成することに
より得られる音声の自然性を損なうことなく、ものまね
しようとする歌唱者の音声（入力された音声）からもの
まねの対象となるターゲット歌唱者の声質や歌い方が反
映された変換音声を得ることが容易にできる。As described above, according to the present invention, the sine wave component and the residual component extracted from the input voice signal are respectively transformed based on the sine wave component or the residual component of the target voice. Then, before synthesizing the sine wave component and the residual component, the pitch component of the transformed residual component and its overtone component are removed, so that the naturalness of the voice obtained by the synthesis is impaired. Without conversion, a converted voice that reflects the voice quality and singing style of the target singer to be imitated can be easily obtained from the voice of the singer trying to imitate (input voice).

[Brief description of the drawings]

【図１】本発明の一実施形態の構成を示すブロック図
（その１）である。FIG. 1 is a block diagram (part 1) illustrating a configuration of an embodiment of the present invention.

【図２】本発明の一実施形態の構成を示すブロック図
（その２）である。FIG. 2 is a block diagram (part 2) showing a configuration of an embodiment of the present invention.

【図３】実施形態におけるフレームの状態を示す図で
ある。FIG. 3 is a diagram illustrating a state of a frame according to the embodiment.

【図４】実施形態における周波数スペクトルのピーク
検出を説明するための説明図である。FIG. 4 is an explanatory diagram for describing peak detection of a frequency spectrum in the embodiment.

【図５】実施形態におけるフレーム毎のピーク値の連
携を示す図である。FIG. 5 is a diagram illustrating cooperation of peak values for each frame in the embodiment.

【図６】実施形態における周波数値の変化状態を示す
図である。FIG. 6 is a diagram illustrating a change state of a frequency value in the embodiment.

【図７】実施形態における処理過程における確定成分
の変化状態を示す図である。FIG. 7 is a diagram showing a change state of a deterministic component in a process in the embodiment.

【図８】実施形態における信号処理の説明図である。FIG. 8 is an explanatory diagram of signal processing in the embodiment.

【図９】イージーシンクロナイゼーション処理のタイ
ミングチャートである。FIG. 9 is a timing chart of an easy synchronization process.

【図１０】イージーシンクロナイゼーション処理フロ
ーチャートである。FIG. 10 is a flowchart of an easy synchronization process.

【図１１】スペクトラル・シェイプのスペクトラルチ
ルト補償について説明する図である。FIG. 11 is a diagram for explaining spectral tilt compensation of a spectral shape.

【図１２】くし形フィルタの特性（ピッチＰcombを２
００Ｈｚとした場合）を説明するための概念図である。FIG. 12 shows characteristics of a comb filter (pitch Pcomb is 2
FIG. 6 is a conceptual diagram for explaining the case of 00 Hz).

【図１３】本発明の変形例による音声変換装置の構成
（一部）を示すブロック図である。FIG. 13 is a block diagram illustrating a configuration (part) of a voice conversion device according to a modification of the present invention.

【図１４】くし形フィルタ（遅延フィルタ）の構成の
一例を示すブロック図である。FIG. 14 is a block diagram illustrating an example of a configuration of a comb filter (delay filter).

[Explanation of symbols]

１…マイク、２…分析窓生成部、３…入力音声信号切出
部、４…高速フーリエ変換部、５…ピーク検出部、６…
無声／有声検出部、７…ピッチ抽出部、８…ピーク連携
部、９…補間合成部、１０…残差成分検出部、１１…高
速フーリエ変換部、１２…残差成分保持部、１３…正弦
波成分保持部、１４…平均アンプ演算部、１５…アンプ
正規化部、１６…スペクトラル・シェイプ演算部、１７
…ピッチ正規化部、１８…元フレーム情報保持部、１９
…静的変化／ビブラート的変化分離部、２０…ターゲッ
トフレーム情報保持部、２１…キーコントロール／テン
ポチェンジ部、２２…イージーシンクロナイゼーション
処理部、２３…正弦波成分属性データ選択部、２４…属
性データ変形部、２５…残差成分選択部、２６…正弦波
成分生成部、２７…正弦波成分変形部、２８…逆高速フ
ーリエ変換部、２９…コントローラ、３０…クロスフェ
ーダ、３１…シーケンサ、３２…音源部、３３…ミキ
サ、３４…出力部、４０…ピッチ決定部、４１，４２…
くし形フィルタ処理部、４３…減算器DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Analysis window generation part, 3 ... Input audio signal extraction part, 4 ... Fast Fourier transform part, 5 ... Peak detection part, 6 ...
Unvoiced / voiced detection unit, 7: pitch extraction unit, 8: peak linking unit, 9: interpolation / synthesis unit, 10: residual component detection unit, 11: fast Fourier transform unit, 12: residual component holding unit, 13: sine Wave component holding unit, 14: average amplifier calculation unit, 15: amplifier normalization unit, 16: spectral shape calculation unit, 17
... Pitch normalizing section, 18 ... Original frame information holding section, 19
... Static change / vibrato change separation section, 20 ... Target frame information holding section, 21 ... Key control / tempo change section, 22 ... Easy synchronization processing section, 23 ... Sine wave component attribute data selection section, 24 ... Attribute Data transformation unit, 25: Residual component selection unit, 26: Sine wave component generation unit, 27: Sine wave component transformation unit, 28: Inverse fast Fourier transform unit, 29: Controller, 30: Crossfader, 31: Sequencer, 32 ... sound source section, 33 ... mixer, 34 ... output section, 40 ... pitch determination section, 41, 42 ...
Comb-shaped filter processing unit, 43 ... subtractor

───────────────────────────────────────────────────── フロントページの続き審査官渡邊聡 (56)参考文献特開平６−149242（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/04 ────────────────────────────────────────────────── ─── Continuation of the front page Examiner Satoshi Watanabe (56) References JP-A-6-149242 (JP, A) (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 21/04

Claims

(57) [Claims]

1. A sine wave component extracting unit for extracting a sine wave component from an input audio signal, and a residual component for extracting a residual component other than the sine wave component extracted by the sine wave component extracting unit from the input audio signal. A difference component extracting unit, a sine wave component deforming unit that deforms the sine wave component extracted by the sine wave component extracting unit based on a sine wave component of the target audio signal, and a sine wave component deforming unit extracted by the residual component extracting unit. Residual component deforming means for deforming the residual component based on the residual component of the target audio signal; and removing means for removing the pitch component of the residual component obtained by the residual component deforming means and its harmonic component. Synthesizing means for synthesizing a sine wave component deformed by the sine wave component deforming means and a residual component from which the pitch component and its harmonic component have been removed by the removing means. Speech conversion system which is characterized in that.

2. The audio conversion device according to claim 1, wherein a pitch of a sine wave component of the input audio signal, a pitch of a sine wave component of the target audio signal, and a sine wave component obtained by the sine wave component deformation unit. One of the pitches,
A voice conversion device comprising a pitch determination unit that sets a pitch of an attenuation peak in the removal unit.

3. The audio conversion device according to claim 1, wherein the removing unit has a pitch of an attenuation peak determined by the pitch determining unit when the residual component is held on a frequency axis. A voice conversion device characterized by being a shape filter.

4. The voice conversion device according to claim 1, wherein the removing unit, when holding the residual component on a time axis, calculates a reciprocal of a pitch of the attenuation peak determined by the pitch determining unit. An audio conversion device characterized by being a comb filter having a delay filter for setting a delay time.

5. A component extracting step of extracting a sine wave component and a residual component other than the sine wave component from the input voice, and extracting the extracted sine wave component based on a sine wave component of the target voice. A sinusoidal wave component deforming step of deforming; a residual component deforming step of deforming the extracted residual component based on a residual component of the target voice; and a residual component obtained in the residual component deforming step. A removing step of removing the pitch component and its overtone component, a sine wave component deformed in the sine wave component deforming step, and a residual component from which the pitch component and its harmonic component removed in the removing step are removed. And a synthesizing step of synthesizing.

6. The voice conversion method according to claim 5, wherein a pitch of a sine wave component of the input voice, a pitch of a sine wave component of the target voice, and a pitch of a sine wave component obtained by the sine wave component transforming means. A pitch determination step of setting any one of the above as a pitch of an attenuation peak in said removing means.