JP4430174B2

JP4430174B2 - Voice conversion device and voice conversion method

Info

Publication number: JP4430174B2
Application number: JP30026899A
Authority: JP
Inventors: 靖雄吉岡; セラザビエル; シーメンツマーク; ボナダジョルディ
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1999-10-21
Filing date: 1999-10-21
Publication date: 2010-03-10
Anticipated expiration: 2019-10-21
Also published as: JP2001117597A

Abstract

PROBLEM TO BE SOLVED: To make the inputted voice of a singer possible to resemble the singing of a target singer and to reduce the capacity of analytic data of the target singer. SOLUTION: Input frame data FSMS corresponding to an input voice signal SV are extracted and alignment adjustment for synchronizing the input frame data FSMS and target frame data TGFL to be generated is performed, and the target frame data TGFL are generated on the basis of target frame generation data previously extracted from a target voice and on the basis of the input frame data FSMS and target frame data TGFL, a converted sound signal is generated and outputted.

Description

【０００１】
【発明の属する技術分野】
この発明は、処理対象となる音声を目標とする他の音声に近似させる音声変換装置、音声変換方法ならびに音声変換を行うに際し用いる他の音声に対応する音声変換用辞書を生成する音声変換用辞書の生成方法に係り、特にカラオケ装置に用いるのに好適な音声変換装置、音声変換方法及び音声変換用辞書の生成方法に関する。
【０００２】
【従来の技術】
入力された音声の周波数特性などを変えて出力する音声変換装置は種々開発されており、例えば、カラオケ装置の中には、歌い手の歌った歌声のピッチを変換して、男性の声を女性の声に、あるいはその逆に変換させるものもある（例えば、特表平８−５０８５８１号）。
【０００３】
【発明が解決しようとする課題】
しかしながら、従来の音声変換装置においては、音声の変換（例えば、男声→女声、女声→男声など）は行われるものの、単に声質を変えるだけに止まっていたので、例えば、特定の歌唱者（例えば、プロの歌手）の声に似せるように変換するということはできなかった。
また、声質だけでなく、歌い方までも特定の歌唱者に似させるという、ものまねのような機能があれば、カラオケ装置などにおいては大変に面白いが、従来の音声変換装置ではこのような処理は不可能であった。
【０００４】
そこで、発明者らは、声質を目標（ターゲット）とする歌唱者（ターゲット歌唱者）の声に似させるために、ターゲット歌唱者の音声を分析し、得られた分析データである正弦波成分属性ピッチ、アンプリチュード、スペクトル・シェイプ及び残差成分を１曲分全てのフレームについてターゲットフレームデータとして保持し、入力音声を分析して得られる入力ターゲットフレームデータとの同期をとって、変換処理を行うことによりターゲット歌唱者の声に似せるように変換を行う音声変換装置を提案している（特願平１０−１８３３３８号等参照）。
上記音声変換装置は、声質だけでなく、歌い方までも特定の歌唱者に似させることができるが、ターゲット歌唱者の分析データが一曲毎に必要となり、複数の曲の分析データを記憶させるような場合には、データ量が膨大になってしまうという不具合があった。
【０００５】
そこで、本発明の目的は、入力された歌唱者の音声を目標とする歌唱者の歌い方に似せることができるとともに、ターゲット歌唱者の分析データの容量を低減することが可能な音声変換装置、音声変換方法および音声変換用辞書の生成方法を提供することにある。
【０００６】
【課題を解決するための手段】
上記課題を解決するため、請求項１に記載の構成は、入力音声信号から周波数スペクトルに関する入力フレームデータを抽出する入力フレームデータ抽出手段と、前記入力音声信号から特徴ベクトルを抽出する特徴分析手段と、前記特徴ベクトルを予め決められたアルゴリズムにより解析して、音声変換対象となるターゲット音声のピッチおよび音素の時間的変化が規定されたターゲット挙動データと対応付けて、前記入力フレームデータに対応する前記ターゲット挙動データにおける時間的位置を判別するアライメント処理手段と、前記アライメント処理手段によって判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素が安定状態、あるいは、第１の音素から第２の音素に遷移する途中である遷移状態のいずれにあるかを判別する状態判別手段と、前記状態判別手段によって前記音素が遷移状態にあると判別された場合に、音素毎に複数のピッチに対応したスペクトル・シェイプを有するターゲット音素辞書のスペクトル・シェイプのうち、前記アライメント処理手段によって判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素について、前記第１の音素に対応する二つのスペクトル・シェイプであって当該時間的位置の前記ターゲット挙動データにおけるターゲット音声のピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第１の音素のスペクトル・シェイプを算出し、前記第２の音素に対応する二つのスペクトル・シェイプであって当該時間的位置の前記ターゲット挙動データにおけるターゲット音声のピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第２の音素のスペクトル・シェイプを算出し、前記第１の音素のスペクトル・シェイプと前記第２の音素のスペクトル・シェイプとを用いて前記入力フレームデータに対応するスペクトル・シェイプを算出するスペクトル・シェイプ補間手段と、前記スペクトル・シェイプ補間手段によって算出されたスペクトル・シェイプに基づいて変換音声信号を生成し出力する変換音声信号生成手段とを備えたことを特徴としている。
【０００７】
請求項２に記載の構成は、入力音声信号から周波数スペクトルに関する入力フレームデータを抽出する入力フレームデータ抽出手段と、前記入力音声信号から特徴ベクトルを抽出する特徴分析手段と、前記特徴ベクトルを予め決められたアルゴリズムにより解析して、音声変換対象となるターゲット音声のピッチおよび音素の時間的変化が規定されたターゲット挙動データと対応付けて、前記入力フレームデータに対応する前記ターゲット挙動データにおける時間的位置を判別するアライメント処理手段と、前記アライメント処理手段によって判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素が安定状態、あるいは、第１の音素から第２の音素に遷移する途中である遷移状態のいずれにあるかを判別する状態判別手段と、前記状態判別手段によって前記音素が遷移状態にあると判別された場合に、音素毎に複数のピッチに対応したスペクトル・シェイプを有するターゲット音素辞書のスペクトル・シェイプのうち、前記アライメント処理手段によって判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素について、前記第１の音素に対応する二つのスペクトル・シェイプであって前記入力フレームデータから得られるピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第１の音素のスペクトル・シェイプを算出し、前記第２の音素に対応する二つのスペクトル・シェイプであって前記入力フレームデータから得られるピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第２の音素のスペクトル・シェイプを算出し、前記第１の音素のスペクトル・シェイプと前記第２の音素のスペクトル・シェイプとを用いて前記入力フレームデータに対応するスペクトル・シェイプを算出するスペクトル・シェイプ補間手段と、前記スペクトル・シェイプ補間手段によって算出されたスペクトル・シェイプに基づいて変換音声信号を生成し出力する変換音声信号生成手段とを備えたことを特徴としている。
【００１３】
請求項３に記載の構成は、請求項１または請求項２に記載の音声変換装置において、前記スペクトル・シェイプ補間手段は、二つのスペクトル・シェイプを用いて補間を行うに際し、前記二つのスペクトル・シェイプ間における遷移関数を用いて補間処理を行うことを特徴としている。
【００１４】
請求項４に記載の構成は、請求項３に記載の音声変換装置において、前記遷移関数は、線形関数あるいは非線形関数として予め定義されていることを特徴としている。
【００１５】
請求項５に記載の構成は、請求項３に記載の音声変換装置において、前記二つのスペクトル・シェイプを周波数軸上でそれぞれ複数の領域に分け、各領域毎に前記遷移関数を定めることを特徴としている。
【００１６】
請求項６に記載の構成は、請求項３に記載の音声変換装置において、前記スペクトル・シェイプ補間手段は、前記第２の音素に対応させて前記遷移関数を定めることを特徴としている。
【００１８】
請求項７に記載の構成は、請求項３記載の音声変換装置において、前記スペクトル・シェイプ補間手段は、前記二つのスペクトル・シェイプを周波数軸上でそれぞれ複数の領域に分け、各領域に属する前記二つのスペクトル・シェイプ上の実在の周波数およびマグニチュードの組に対し、前記遷移関数としての線形関数を用いた補間処理を前記複数の領域にわたって行うことを特徴としている。
【００１９】
請求項８に記載の構成は、請求項７に記載の音声変換装置において、前記スペクトル・シェイプ補間手段は、前記各領域に属する一方のスペクトル・シェイプの周波数である第１周波数及び当該第１周波数に対応する他方のスペクトル・シェイプの周波数である第２周波数を前記線形関数を用いて補間することにより補間周波数を算出する周波数補間手段と、前記各領域に属する一方のスペクトル・シェイプのマグニチュードである第１マグニチュードおよび当該第１マグニチュードに対応する他方のスペクトル・シェイプのマグニチュードである第２マグニチュードを前記線形関数を用いて補間するマグニチュード補間手段とを備えたことを特徴としている。
【００２０】
請求項９に記載の構成は、請求項１に記載の音声変換装置において、前記ターゲット挙動データには、さらにターゲット音声のアンプリチュードの時間的変化が規定され、前記アライメント処理手段によって判別された前記ターゲット挙動データにおける時間的位置のアンプリチュードに応じて、前記スペクトル・シェイプ補間手段によって算出されたスペクトル・シェイプのスペクトル傾きを補正するスペクトル傾き補正手段を備え、前記変換音声信号生成手段は、前記スペクトル傾き補正手段によってスペクトルの傾きが補正されたスペクトル・シェイプに基づいて変換音声信号を生成し出力することを特徴としている。
【００２１】
請求項１０に記載の構成は、請求項２に記載の音声変換装置において、前記スペクトル・シェイプ補間手段によって算出されたスペクトル・シェイプのスペクトル傾きと、前記入力フレームデータから得られるスペクトル・シェイプのスペクトル傾きとの比較結果に応じて、前記スペクトル・シェイプ補間手段によって算出されたスペクトル・シェイプのスペクトル傾きを補正するスペクトル傾き補正手段を備えたことを特徴としている。
【００２２】
請求項１１の構成は、入力音声信号から周波数スペクトルに関する入力フレームデータを抽出する入力フレームデータ抽出過程と、前記入力音声信号から特徴ベクトルを抽出する特徴分析過程と、前記特徴ベクトルを予め決められたアルゴリズムにより解析して、音声変換対象となるターゲット音声のピッチおよび音素の時間的変化が規定されたターゲット挙動データと対応付けて、前記入力フレームデータに対応する前記ターゲット挙動データにおける時間的位置を判別するアライメント処理過程と、前記アライメント処理過程において判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素が安定状態、あるいは、第１の音素から第２の音素に遷移する途中である遷移状態のいずれにあるかを判別する状態判別過程と、前記状態判別過程において前記音素が遷移状態にあると判別された場合に、音素毎に複数のピッチに対応したスペクトル・シェイプを有するターゲット音素辞書のスペクトル・シェイプのうち、前記アライメント処理過程において判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素について、前記第１の音素に対応する二つのスペクトル・シェイプであって当該時間的位置の前記ターゲット挙動データにおけるターゲット音声のピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第１の音素のスペクトル・シェイプを算出し、前記第２の音素に対応する二つのスペクトル・シェイプであって当該時間的位置の前記ターゲット挙動データにおけるターゲット音声のピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第２の音素のスペクトル・シェイプを算出し、前記第１の音素のスペクトル・シェイプと前記第２の音素のスペクトル・シェイプとを用いて前記入力フレームデータに対応するスペクトル・シェイプを算出するスペクトル・シェイプ補間過程と、前記スペクトル・シェイプ補間過程において算出されたスペクトル・シェイプに基づいて変換音声信号を生成し出力する変換音声信号生成過程とを備えたことを特徴としている。
【００２３】
請求項１２に記載の構成は、入力音声信号から周波数スペクトルに関する入力フレームデータを抽出する入力フレームデータ抽出過程と、前記入力音声信号から特徴ベクトルを抽出する特徴分析過程と、前記特徴ベクトルを予め決められたアルゴリズムにより解析して、音声変換対象となるターゲット音声のピッチおよび音素の時間的変化が規定されたターゲット挙動データと対応付けて、前記入力フレームデータに対応する前記ターゲット挙動データにおける時間的位置を判別するアライメント処理過程と、前記アライメント処理過程において判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素が安定状態、あるいは、第１の音素から第２の音素に遷移する途中である遷移状態のいずれにあるかを判別する状態判別過程と、前記状態判別過程において前記音素が遷移状態にあると判別された場合に、音素毎に複数のピッチに対応したスペクトル・シェイプを有するターゲット音素辞書のスペクトル・シェイプのうち、前記アライメント処理過程において判別された時間的位置の前記ターゲット挙動データにおけるターゲット音声の音素について、前記第１の音素に対応する二つのスペクトル・シェイプであって前記入力フレームデータから得られるピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行い、うことによって前記第１の音素のスペクトル・シェイプを算出し、前記第２の音素に対応する二つのスペクトル・シェイプであって前記入力フレームデータから得られるピッチに近い二つのピッチに対応する二つのスペクトル・シェイプを用いて補間処理を行うことによって前記第２の音素のスペクトル・シェイプを算出し、前記第１の音素のスペクトル・シェイプと前記第２の音素のスペクトル・シェイプとを用いて前記入力フレームデータに対応するスペクトル・シェイプを算出するスペクトル・シェイプ補間過程と、前記スペクトル・シェイプ補間過程において算出されたスペクトル・シェイプに基づいて変換音声信号を生成し出力する変換音声信号生成過程とを備えたことを特徴としている。
【００２７】
請求項１３に記載の構成は、請求項１１または請求項１２に記載の音声変換方法において、前記スペクトル・シェイプ補間過程は、二つのスペクトル・シェイプを用いて補間を行うに際し、前記二つのスペクトル・シェイプ間における遷移関数を用いて補間処理を行うことを特徴としている。
【００２８】
請求項１４に記載の構成は、請求項１３に記載の音声変換方法において、前記遷移関数は、線形関数あるいは非線形関数として予め定義されていることを特徴としている。
【００２９】
請求項１５に記載の構成は、請求項１３に記載の音声変換方法において、前記二つのスペクトル・シェイプを周波数軸上でそれぞれ複数の領域に分け、各領域毎に前記遷移関数を定めることを特徴としている。
【００３０】
請求項１６に記載の構成は、請求項１３に記載の音声変換方法において、前記スペクトル・シェイプ補間過程は、前記第２の音素に対応させて前記遷移関数を定めることを特徴としている。
【００３２】
請求項１７に記載の構成は、請求項１３に記載の音声変換方法において、前記スペクトル・シェイプ補間過程は、前記二つのスペクトル・シェイプを周波数軸上でそれぞれ複数の領域に分け、各領域に属する前記二つのスペクトル・シェイプ上の実在の周波数およびマグニチュードの組に対し、前記遷移関数としての線形関数を用いた補間処理を前記複数の領域にわたって行うことを特徴としている。
【００３３】
請求項１８に記載の構成は、請求項１７に記載の音声変換方法において、前記スペクトル・シェイプ補間過程は、前記各領域に属する一方のスペクトル・シェイプの周波数である第１周波数及び当該第１周波数に対応する他方のスペクトル・シェイプの周波数である第２周波数を前記線形関数を用いて補間することにより補間周波数を算出する周波数補間過程と、前記各領域に属する一方のスペクトル・シェイプのマグニチュードである第１マグニチュードおよび当該第１マグニチュードに対応する他方のスペクトル・シェイプのマグニチュードである第２マグニチュードを前記線形関数を用いて補間するマグニチュード補間過程とを備えたことを特徴としている。
【００３４】
請求項１９に記載の構成は、請求項１１に記載の音声変換方法において、前記ターゲット挙動データには、さらにターゲット音声のアンプリチュードの時間的変化が規定され、前記アライメント処理過程において判別された前記ターゲット挙動データにおける時間的位置のアンプリチュードに応じて、前記スペクトル・シェイプ補間過程において算出されたスペクトル・シェイプのスペクトル傾きを補正するスペクトル傾き補正過程を備え、前記変換音声信号生成過程は、前記スペクトル傾き補正過程においてスペクトルの傾きが補正されたスペクトル・シェイプに基づいて変換音声信号を生成し出力することを特徴としている。
【００３５】
請求項２０に記載の構成は、請求項１２に記載の音声変換方法において、前記スペクトル・シェイプ補間過程において算出されたスペクトル・シェイプのスペクトル傾きと、前記入力フレームデータから得られるスペクトル・シェイプのスペクトル傾きとの比較結果に応じて、前記スペクトル・シェイプ補間過程において算出されたスペクトル・シェイプのスペクトル傾きを補正するスペクトル傾き補正過程を備えたことを特徴としている。
【００３７】
【発明の実施の形態】
次に図面を参照して本発明の好適な実施形態について説明する。
［Ａ］第１実施形態
まず、本発明の第１実施形態について説明する。
［１］音声変換装置の全体構成
図１に実施形態の音声変換装置（音声変換方法）をカラオケ装置に適用し、ものまねを行うことができるカラオケ装置として構成した場合の例である。
音声変換装置１０は、歌唱者の音声が入力され、歌唱信号を出力する歌唱信号入力部１１と、予め定めたコードブックに基づいて歌唱信号から各種特徴ベクトルを抽出する認識特徴分析部１２と、歌唱信号のＳＭＳ（Spectral Modeling Synthesis）分析を行って入力ＳＭＳフレームデータおよび有声／無声情報を出力するＳＭＳ分析部１３と、各種コードブックおよび各音素の隠れマルコフモデル（ＨＭＭ）を予め記憶した認識用音素辞書記憶部１４と、曲に依存したターゲット挙動データを記憶するターゲット挙動データ記憶部１５と、キー情報、テンポ情報、似具合パラメータ、変換パラメータなどの各種パラメータを制御するためのパラメータコントロール部１６と、ターゲット挙動データ記憶部に記憶されたターゲット挙動データ、キー情報およびテンポ情報に基づいてデータ変換を行い、変換された持続時間付音素表記情報、ピッチ情報およびアンプリチュード（振幅）情報を生成し出力するデータ変換部１７と、を備えて構成されている。
【００３８】
また、音声変換装置１０は、抽出された特徴ベクトル、各音素のＨＭＭおよび持続時間付音素表記情報に基づいて歌唱者が対象としている曲中のどの部分を歌っているかをビタビアルゴリズムを用いて求め、アライメント情報（＝ターゲット歌手が歌うべき曲中の歌唱位置および音素）を検出するアライメント処理部１８と、ターゲット歌手に依存するスペクトル・シェイプ情報を記憶するターゲット音素辞書記憶部１９と、アライメント情報、ターゲット挙動データのピッチ情報、ターゲット挙動データのアンプリチュード情報、入力ＳＭＳフレームデータおよびターゲット音素辞書のスペクトル・シェイプ情報に基づいてターゲットのフレームデータ（以下、ターゲットフレームデータという。）ＴＧＦＬを生成し出力するターゲット・デコーダ部２０と、パラメータコントロール部１６から入力される似具合パラメータ、ターゲットフレームデータＴＧＦＬおよびＳＭＳフレームデータＦSMSに基づいてモーフィング処理を行い、モーフィングフレームデータＭＦＬを出力するモーフィング処理部２１と、モーフィングフレームデータＭＦＬおよびパラメータコントロール部１６より入力された変換パラメータに基づいて変換処理を行い、変換フレームデータＭＭＦＬを出力する変換処理部２２と、を備えて構成されている。
【００３９】
さらに、音声変換装置１０は、変換フレームデータＭＭＦＬのＳＭＳ合成を行い、変換音声信号である波形信号ＳWAVを出力するＳＭＳ合成部２３と、ＳＭＳ分析部１３からの有声／無声情報に基づいて波形信号ＳWAVあるいは入力された歌唱信号ＳVのいずれかを選択的に出力する選択部２４と、パラメータコントロール部１６からのキー情報およびテンポ情報に基づいて音源部２５を駆動するシーケンサ２６と、選択部２４から出力された波形信号ＳWAVあるいは歌唱信号ＳVと音源部２５からの出力信号であるミュージック信号ＳMSCを加算して出力する加算部２７と、加算部２７の出力信号を増幅等行ってカラオケ信号として出力する出力部２８と、を備えて構成されている。
【００４０】
ここで、音声変換装置の各部の構成の説明に先立ち、ＳＭＳ分析について説明する。
ＳＭＳ分析では、まず標本化された音声波形に窓関数を乗じた音声波形（Frame）を切り出し、高速フーリエ変換（FFT）を行って得られる周波数スペクトルから、正弦波成分と残差成分とを抽出する。
この場合において、正弦波成分とは、基本周波数（Pitch）および基本周波数の倍数にあたる周波数（倍音）の成分をいう。
そして、正弦波成分として本実施形態では、基本周波数、各成分の平均アンプリチュードおよびスペクトル包絡をエンベロープとして保持する。
また、残差成分とは、入力信号から正弦波成分を除いた成分であり、本実施形態では周波数領域のデータとして保持する。
さらに得られた正弦波成分および残差成分で示される周波数分析データは、フレーム単位で記憶されることとなる。このとき、フレーム間の時間間隔は固定（例えば、５ｍｓ）となっているので、フレームをカウントすることによって時間を特定することができるようになっている。さらに各フレームには曲の冒頭からの経過時間に相当するタイムスタンプが付されている。
【００４１】
［２］音声変換装置の各部の構成
［２．１］認識用音素辞書記憶部
認識用音素辞書記憶部１４は、コードブック及び音素の隠れマルコフモデルを記憶している。
記憶しているコードブックは、歌唱信号を各種特徴ベクトル（より具体的には、メルケプストラム、差分メルケプストラム、エネルギー、差分エネルギー、ボイスネス（有声音尤度））にベクトル量子化するために用いられる。
また、本音声変換装置においては、アライメント処理を行うために音声認識の一手法である隠れマルコフモデル（ＨＭＭ）を用いており、ＨＭＭパラメータ（初期状態分布、状態遷移確率行列、観測シンボル確率行列）を各音素（/a/、/i/等）について求めたものが記憶されている。
【００４２】
［２．２］ターゲット挙動データ記憶部
ターゲット挙動データ記憶部１５はターゲット挙動データを記憶しており、このターゲット挙動データは、音声変換を行う曲それぞれに対応した曲依存のデータである。
具体的には、対象となる曲を物まねの対象となるターゲット歌手が歌ったものから、ピッチ、アンプリチュードの時間的変化を抽出したもの（なお、これらを静的変化成分、ビブラート的変化成分に分離して抽出しておくと、後処理の自由度がより高くなる）および対象となる曲の歌詞に基づいて歌詞を音素列の並びに置き換えた音素表記に持続時間を含めた持続時間付音素表記が含まれる。
例えば、持続時間付音素表記は、音素表記/n//a//k//i/……に対し、各々の持続時間、すなわち、/n/の持続時間、/a/の持続時間、/k/の持続時間、/i/の持続時間、……が含められる。
【００４３】
［２．３］ターゲット音素辞書記憶部
ターゲット音素辞書記憶部は、物まね対象となるターゲット歌手の各音素に対応したスペクトル情報であるターゲット音素辞書を記憶しており、ターゲット音素辞書には、何種類かのピッチに対応したスペクトル・シェイプおよびスペクトル補間を行うためのアンカーポイント情報が含まれている。
ここで、ターゲット音素辞書記憶部１９に記憶されている音声変換用辞書としてのターゲット音素辞書の作成について図２及び図３を参照して説明する。
［２．３．１］ターゲット音素辞書
ターゲット音素辞書は、各音素毎にいくつかのピッチに対応してスペクトル・シェイプと、アンカーポイント情報を有している。
図２にターゲット音素辞書の説明図を示す。
図２（ｂ）、（ｃ）、（ｄ）は、ある音素におけるピッチｆ0i+1、ｆ0i、ｆ0i-1にそれぞれ対応するスペクトル・シェイプを示したものであり、一つの音素に対して複数の（上述の例の場合、３個）スペクトル・シェイプがターゲット音素辞書には含まれる。このように複数のピッチに対応したスペクトル・シェイプをターゲット音素辞書として持つ理由は、一般的に同一人物が同一の音素を発声したとしても、ピッチに応じてスペクトル・シェイプの形状は多少変化するものだからである。
また、図２（ｂ）、（ｃ）、（ｄ）中、点線は周波数軸上で複数の領域に分ける際の境界線であり、各領域の境界の周波数がアンカーポイントであり、アンカーポイント情報として当該周波数がターゲット音素辞書に含まれている。
【００４４】
［２．３．２］ターゲット音素辞書の作成
次にターゲット音素辞書の作成について説明する。
まず、ターゲット歌手がそれぞれの音素について出しうる一番低いピッチから一番高いピッチまで、連続して発生したものを録音する。
より具体的には図２（ａ）のように、時間とともにピッチをあげていくように発声する。
このように録音を行う理由は、より正確なスペクトル・シェイプを算出するためである。
すなわち、ある固定ピッチで発生したサンプルから分析して求めたスペクトル・シェイプには、実際に存在するフォルマントが必ずしも現れるとは限らないからである。従って、求めるスペクトル・シェイプに正確にフォルマントが現れるようにするために、あるピッチの前後で同じスペクトル・シェイプとみなせる範囲内の分析結果の全てを用いる必要がある。
【００４５】
同じスペクトル・シェイプと見なせるピッチの周波数範囲を同じセグメントであるとすると、ｉ番目のセグメントの中心周波数ｆ0iは、
【数１】

ここで、ｆ_i ^(low)、ｆ_i ^(high)は、ある音素のｉ番目のセグメントの境界のピッチ周波数であり、ｆ_i ^(low)が低ピッチ側のピッチ周波数を表し、ｆ_i ^(high)が高ピッチ側のピッチ周波数を表す。
同じセグメントとみなせるピッチにおけるスペクトル・シェイプの全ての値（周波数及びマグニチュードの組）を一つにまとめる。
より具体的には、例えば、図３（ａ）に示すように、同じセグメントとみなせるピッチにおけるスペクトル・シェイプを同一の周波数軸／マグニチュード軸上にプロットする。
次に周波数軸上で周波数範囲［０，ｆ_S／２］を等間隔（例えば３０［Ｈｚ］）に分割する。ここで、ｆ_Sは、サンプリング周波数である。
【００４６】
このときの分割幅をＢＷ［Ｈｚ］、分割数をＢ（バンド番号ｂ∈［０，Ｂ−１］）とし、各分割範囲内に含まれる実際の周波数及びマグニチュードの組を
（ｘn、ｙn）
ここで、ｎ＝０、……、Ｎ−１である。
とすると、当該バンドｂの中心周波数ｆb及び平均マグニチュードＭbは、それぞれ、
【数２】

と計算される。
このようにして求めた
（ｆb、Ｍb）
ここで、ｂ＝０、……、Ｂ−１である。
の組が最終的なあるピッチにおけるスペクトル・シェイプである。
【００４７】
より具体的には、図３（ａ）に示した周波数及びマグニチュードの組を用いてスペクトル・シェイプを算出した場合には、図３（ｃ）に示すようにターゲット音素辞書に格納すべき、フォルマントがはっきりと現れた良好なスペクトル・シェイプが得られる。
これに対し図３（ｂ）に示すように、同じセグメントとみなすことができないようなピッチにおけるスペクトル・シェイプの全ての値（周波数及びマグニチュードの組）を一つにまとめ、まとめた周波数及びマグニチュードの組を用いてスペクトル・シェイプを算出した場合には、図３（ｄ）に示すように、図３（ｃ）の場合と比較してフォルマントがあまりはっきりしないスペクトル・シェイプが得られることとなる。
【００４８】
［２．４］ターゲット・デコーダ部
［２．４．１］ターゲット・デコーダ部の構成
図４にターゲット・デコーダ部の構成ブロック図を示す。
ターゲット・デコーダ部２０は、歌唱者及びターゲット歌唱者のピッチ、アライメントおよび既に処理済みのデコードフレームからデコードされるべきフレームに対応する音素が安定状態にあるかあるいは他の音素に移行する遷移状態にあるかを決定する安定状態／遷移状態決定部３１と、スムーズなフレームデータの生成のために既に処理済みのデコードフレームを格納するフレームメモリ部３２と、安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が安定状態にある場合には現在の音素のスペクトル・シェイプを現在のターゲットのピッチ付近の二つのスペクトル・シェイプから後述のスペクトル補間の方法を用いて第１補間スペクトル・シェイプＳＳ１として生成し、デコードされるべきフレームに対応する音素が遷移状態にある場合には遷移元の音素のスペクトル・シェイプを現在のターゲットのピッチ付近の二つのスペクトル・シェイプから後述のスペクトル補間の方法を用いて第２補間スペクトル・シェイプＳＳ２として生成する第１スペクトル補間部３３と、を備えて構成されている。
【００４９】
また、ターゲット・デコーダ部２０は、安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が遷移状態にある場合に遷移先の音素のスペクトル・シェイプを現在のターゲットのピッチ付近の二つのスペクトル・シェイプから後述のスペクトル補間の方法を用いて第３補間スペクトル・シェイプＳＳ３として生成する第２スペクトル補間部３４と、遷移元の音素及び遷移先の音素並びに歌唱者のピッチ、ターゲット歌唱者のピッチ及びスペクトル・シェイプなどを考慮に入れて遷移元の音素から遷移先の音素に遷移させる場合の遷移のさせかたを規定する遷移関数を発生する遷移関数発生部３５と、安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が遷移状態にある場合に遷移関数発生部３５において発生された遷移関数並びに第２補間スペクトル・シェイプＳＳ２及び第３補間スペクトル・シェイプＳＳ３の二つのスペクトル・シェイプから後述のスペクトル補間の方法を用いて第４スペクトル・シェイプＳＳ４として生成する第３スペクトル補間部３６と、を備えて構成されている。
【００５０】
さらに、ターゲット・デコーダ部２０は、出力されるデコードフレームがよりリアルであるようにターゲットのピッチ及びフレームメモリ部３２に格納されている処理済みのデコードフレームに基づいてスペクトル・シェイプの微細構造を時間軸に沿って変化させ（例えば、マグニチュードを時間とともに少しずつ変化させる）、時間的変化が付加されたスペクトル・シェイプＳＳｔを出力する時間的変化付加部３７と、時間的変化付加部３７により時間的変化が付加されたスペクトル・シェイプＳＳｔをさらにリアルにするためにターゲットのアンプリチュードに対応させてスペクトル・シェイプＳＳｔのスペクトル傾きを補正してターゲットスペクトル・シェイプＳＳＴＧとして出力するスペクトル傾き補正部３８と、アライメント情報、ターゲットのピッチ及びアンプリチュードに基づいて出力するデコードフレームに対応するターゲットのピッチおよびアンプリチュードを算出するターゲットピッチ／アンプリチュード算出部３９と、を備えて構成されている。
【００５１】
［２．４．２］ターゲット・デコーダ部の詳細動作
ここで、ターゲット・デコーダ部２０の詳細動作について説明する。
この場合において、よりスムーズなフレームデータの生成の為、ターゲット・デコーダ部２０が出力すべきフレームデータ（デコードフレーム；ターゲットスペクトル・シェイプ）はフレームメモリ部に記憶される。
ターゲット・デコーダ部２０への入力情報としては、歌唱音声の情報（ピッチ、アンプリチュード、スペクトル・シェイプ、アライメント）、ターゲット挙動データ（ピッチ、アンプリチュード、持続時間付音素表記）、ターゲット音素辞書（スペクトル・シェイプ）が含まれている。
【００５２】
そして、安定状態／遷移状態決定部３１は、歌唱者、ターゲット歌手のピッチ、アライメント情報、過去のデコード・フレームからデコードされるべきフレームが安定状態（ある音素からある音素への遷移（変化）途中ではなく、ある音素であることが特定できる状態にあるか否かを決定し、決定結果を第１スペクトル補間部３３及び第２スペクトル補間部３４に通知する。
第１スペクトル補間部３３は、安定状態／遷移状態決定部３１の通知に基づいて、デコードされるべきフレームが安定状態である場合には、現在の音素のスペクトル・シェイプを現在のターゲットのピッチ付近の２つのスペクトル・シェイプから、後述するスペクトル補間の方法を用いて補間されたスペクトル・シェイプである第１補間スペクトル・シェイプＳＳ１を算出し時間的変化付加部３７に出力する。
【００５３】
また、第１スペクトル補間部３３は、安定状態／遷移状態決定部３１の通知に基づいて、デコードされるべきフレームが遷移状態である場合には、遷移元の音素（第１の音素から第２の音素に遷移途中の場合における、第１の音素）のスペクトル・シェイプを現在のターゲットのピッチ付近の２つのスペクトル・シェイプから、後述するスペクトル補間の方法を用いて補間されたスペクトル・シェイプである第２補間スペクトル・シェイプＳＳ２を算出し、第３スペクトル補間部３６に出力する。
一方、第２スペクトル補間部３４は、安定状態／遷移状態決定部３１の通知に基づいて、デコードされるべきフレームが遷移状態である場合に、遷移先の音素（第１の音素から第２の音素に遷移途中の場合における、第２の音素）のスペクトル・シェイプを現在のターゲットのピッチ付近の２つのスペクトル・シェイプから、後述するスペクトル補間の方法を用いて補間されたスペクトル・シェイプである第３補間スペクトル・シェイプを算出し、第３スペクトル補間部３６に出力する。
【００５４】
これらの結果、第３スペクトル補間部３６は、安定状態／遷移状態決定部３１の通知に基づいて、デコードされるべきフレームが遷移状態である場合に、第２補間スペクトル・シェイプおよび第２スペクトル補間処理において算出された第３補間スペクトル・シェイプに基づいて後述するスペクトル補間の方法を用いて補間し、第４スペクトル・シェイプＳＳ４を算出し、時間的変化付加部３７に出力する。この第４スペクトル・シェイプＳＳ４は、二つの異なる音素の中間的な音素のスペクトル・シェイプに相当するものとなる。この場合において、第４スペクトル・シェイプＳＳ４を求めるべく補間を行う際には、単純にある時間に亘って対応する領域（その境界点はアンカー・ポイントで示される。）内で線形に補間を行うのではなく、遷移関数発生部３５において生成される遷移関数に従ってスペクトル補間を行うことにより、より現実に近いスペクトル補間を行うことができる。
【００５５】
例えば、遷移関数発生部３５は、音素/a/から音素/e/に変化する際には、１０フレームかけて対応する領域内（後述するアンカー・ポイント間）のスペクトルを時間的に線形に変化させ、また、音素/a/から音素/u/に変化する際には、５フレームかけて変化するが、ある周波数帯域内（後述するアンカー・ポイント間）のスペクトルについては、線形に変化させ、他の周波数帯域内（後述するアンカー・ポイント間）のスペクトルについては、指数関数的に変化させることにより、自然な音素間の移動をスムーズに実現することができる。
このため、遷移関数発生処理においては、音素、ピッチに基づくとともに、歌唱者、ターゲットのピッチやスペクトル・シェイプ等を考慮に入れて、遷移関数を発生させる。
この場合において、後述するようにターゲット音素辞書の中にこれらの情報を含めてしまうように構成することも可能である。
次に時間的変化付加部３７は、入力された第１補間スペクトル・シェイプＳＳ１または第４補間スペクトル・シェイプＳＳ４に対し、ターゲット・デコーダ部２０より出力されるターゲットスペクトル・シェイプ（＝デコードフレーム）がより実在するフレームと近似するようにターゲットのピッチおよび過去のデコードフレームに基づいて、スペクトル・シェイプの微細構造を変化させ、時間的変化付加スペクトル・シェイプＳＳｔとしてスペクトル傾き補正部３８に出力する。
【００５６】
例えば、スペクトル・シェイプの微細構造としてのマグニチュードを時間的に少しづつ変化させるようにする。
スペクトル傾き補正部３８は、入力された時間的変化付加スペクトル・シェイプＳＳｔに対し、出力されるターゲットスペクトル・シェイプ（＝デコードフレーム）ＳＳＴＧがより実在するフレームと近似するようにターゲットのアンプリチュードに応じたスペクトル傾きを有するように補正を行い、補正後のスペクトル・シェイプをターゲットスペクトル・シェイプＳＳＴＧとして出力する。
スペクトル傾き補正処理としては、出力する音量が大きいときは一般的にスペクトル・シェイプの高域が豊か（リッチ）であり、音量が小さいときはスペクトル・シェイプの高域が乏しい（＝こもったような音）ことをシミュレートするために、スペクトル・シェイプの高域部の形状を音量に応じて変化させてやるのである。
そして、スペクトル傾き補正して得られるターゲットスペクトル・シェイプＳＳＴＧをフレームメモリ部３２に格納することとなる。
一方、ターゲットピッチ／アンプリチュード算出部３９は、出力するターゲットスペクトル・シェイプＳＳＴＧに対応するピッチＴＧＰ、アンプリチュードＴＧＡを算出し出力する。
【００５７】
［２．４．３］スペクトル補間処理
ここで、図５を参照してターゲット・デコーダ部のスペクトル補間処理について説明する。
［２．４．３．１］スペクトル補間処理の概要
まず、安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が安定状態にある場合には、ターゲットデコーダ部２０は、当該音素に対応する二つのスペクトル・シェイプをターゲットの音素辞書から取り出し、また、デコードされるべきフレームに対応する音素が遷移状態にある場合には、遷移元の音素に対応する二つのスペクトル・シェイプをターゲットの音素辞書から取り出す。
図５（ａ）及び図５（ｂ）は、安定状態にある音素あるいは遷移元の音素に対応させてターゲット音素辞書から取り出された二つのスペクトル・シェイプであり、この二つのスペクトル・シェイプのピッチは異なっている。
例えば、求めたいスペクトル・シェイプがピッチ１４０［Ｈｚ］、音素/a/のものだとすると、図５（ａ）のスペクトル・シェイプは、ピッチ１００［Ｈｚ］の音素/a/に対応するものであり、図５（ｂ）のスペクトル・シェイプは、ピッチ２００［Ｈｚ］の音素/a/に対応するものである。すなわち、求めたいスペクトル・シェイプのピッチを挟むような前後のピッチでそれぞれ最も近いピッチを有する二つのスペクトル・シェイプであって、かつ、求めたいスペクトル・シェイプと同一の音素に対応する二つのスペクトル・シェイプを用いる。
【００５８】
得られた二つのスペクトル・シェイプを第１スペクトル補間部３３でスペクトル補間の方法で補間することにより、図５（ｅ）に示すような所望のスペクトル・シェイプ（第１スペクトル・シェイプＳＳ１あるいは第２スペクトル・シェイプＳＳ２に相当）を得る。得られたスペクトル・シェイプは、安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が安定状態にある場合には、そのまま得られたスペクトル・シェイプを時間的変化付加部３７に出力する。、
さらに安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が遷移状態にある場合には、遷移先の音素に対応する二つのスペクトル・シェイプをターゲットの音素辞書から取り出す。
図５（ｃ）及び図５（ｄ）は、遷移先の音素に対応させてターゲット音素辞書から取り出された二つのスペクトル・シェイプであり、この二つのスペクトル・シェイプのピッチも図５（ａ）及び図５（ｂ）の場合と同様に異なっている。
そして得られた二つのスペクトル・シェイプを第２スペクトル補間部３４で補間することにより、図５（ｆ）に示すような所望のスペクトル・シェイプ（第３スペクトル・シェイプＳＳ３に相当）を得る。
さらにまた、安定状態／遷移状態決定部３１における決定結果に基づいてデコードされるべきフレームに対応する音素が遷移状態にある場合には、図５（ｅ）及び図５（ｆ）に示したスペクトル・シェイプを第３スペクトル補間部３６でスペクトル補間の方法で補間することにより、図５（ｇ）に示すような所望のスペクトル・シェイプ（第４スペクトル・シェイプＳＳ４に相当）を得る。
【００５９】
［２．４．３．２］スペクトル補間手法
ここで、スペクトル補間の手法について詳細に説明する。
スペクトル補間を用いる目的は、以下の二つに大別される。
（１）二つの時間的に連続するフレームのスペクトル・シェイプを補間し、時間的に二つのフレームの間にあるフレームのスペクトル・シェイプを求める。
（２）二つの異なる音のスペクトル・シェイプを補間し、中間的な音のスペクトル・シェイプを求める。
図６（ａ）に示すように、補間のもととなる二つのスペクトル・シェイプ（以下、便宜上、第１スペクトル・シェイプＳＳ１１および第２スペクトル・シェイプＳＳ１２とする。なお、これらは、上述の第１スペクトル・シェイプＳ１および第２スペクトル・シェイプＳ２とは全く別個のものである。）を各々周波数軸上で複数の領域Ｚ１、Ｚ２、……に分割する。
そして、各領域を区切る境界の周波数を各スペクトル・シェイプ毎にそれぞれ以下のように設定する。この設定した境界の周波数をアンカー・ポイントと呼んでいる。
第１スペクトル・シェイプＳＳ１１：ＲＢ1,1、ＲＢ2,1、……、ＲＢN,1
第２スペクトル・シェイプＳＳ１２：ＲＢ1,2、ＲＢ2,2、……、ＲＢM,2
【００６０】
図６（ｂ）に線形スペクトル補間の説明図を示す。
線形スペクトル補間は、補間位置により定義され、補間位置Ｘは、０から１までの範囲である。この場合において、補間位置Ｘ＝０は、第１スペクトル・シェイプＳＳ１１そのもの、補間位置Ｘ＝１は第２スペクトル・シェイプＳＳ１２そのものに相当する。
図６（ｂ）は、補間位置Ｘ＝０．３５の場合である。
また、図６（ｂ）において、縦軸上の白丸（○）は、スペクトル・シェイプを構成する周波数およびマグニチュードの組のそれぞれを示す。従って、紙面垂直方向にマグニチュード軸が存在すると考えるのが適当である。
補間位置Ｘ＝０の軸上の第１スペクトル・シェイプＳＳ１１の注目するある領域Ｚｉに対応するアンカー・ポイントが、
ＲＢi,1およびＲＢi+1,1
であり、当該領域Ｚｉに属する具体的な周波数およびマグニチュードの組のうちいずれかの組の周波数＝ｆi1であり、マグニチュード＝Ｓ1（ｆi1）であるものとする。
補間位置Ｘ＝１の軸上の第２スペクトル・シェイプＳＳ１２の注目するある領域Ｚｉに対応するアンカー・ポイントが、
ＲＢi,2およびＲＢi+1,2
であり、当該領域Ｚｉに属する具体的な周波数およびマグニチュードの組のうちいずれかの組の周波数＝ｆi2であり、マグニチュード＝Ｓ2（ｆi2）であるものとする。
ここで、スペクトル遷移関数ｆtrans1（ｘ）及びスペクトル遷移関数ｆtrans2（ｘ）を求める。
【００６１】
例えば、これらを最も簡単な線形関数で表すとすると、以下のようになる。
ｆtrans1（ｘ）＝ｍ1・ｘ＋ｂ1
ｆtrans2（ｘ）＝ｍ2・ｘ＋ｂ2
ここで、
ｍ1＝ＲＢi,2−ＲＢi,1
ｂ1＝ＲＢi,1
ｍ2＝ＲＢi+1,2−ＲＢi+1,1
ｂ2＝ＲＢi+1,2
である。
次に第１スペクトル・シェイプＳＳ１１上に実在する周波数およびマグニチュードの組に対応する補間スペクトル・シェイプ上の周波数およびマグニチュードの組を求める。
【００６２】
まず、第１スペクトル・シェイプＳＳ１１上に実在する周波数およびマグニチュードの組、具体的には、周波数ｆi1、マグニチュードＳ1（ｆi1）に対応する第２スペクトル・シェイプ上の周波数＝ｆi1,2、マグニチュード＝Ｓ2（ｆi1,2）を以下のように算出する。
【数３】

ここで、
Ｗ1 ＝ＲＢi+1,1−ＲＢi,1
Ｗ2 ＝ＲＢi+1,2−ＲＢi,2
である。
マグニチュード＝Ｓ2（ｆi1,2）を算出するにあたり、第２スペクトル・シェイプＳＳ１２上に実在する周波数およびマグニチュードの組のうちで周波数＝ｆi1,2をはさむように最も近い周波数をそれぞれ、(+)、(-)のサフィックスを付して表すとすると、
【数４】

となる。
【００６３】
以上から、補間位置＝ｘとすると、第１スペクトル・シェイプＳＳ１１上に実在する周波数およびマグニチュードの組に対応する補間スペクトル・シェイプ上の周波数ｆi1,xおよびマグニチュードＳx（ｆi1,x）は以下の式で求められる。
【数５】

Ｓx（ｆi1,x）＝Ｓ1 （ｆi1）＋｛Ｓ2（ｆi1,2）−Ｓ1（ｆi1）｝・ｘ同様にして、第１スペクトル・シェイプＳＳ１１上の全ての周波数およびマグニチュードの組に対して算出する。
続いて、第２スペクトル・シェイプＳＳ１２上に実在する周波数およびマグニチュードの組に対応する補間スペクトル・シェイプ上の周波数およびマグニチュードの組を求める。
【００６４】
まず、第２スペクトル・シェイプＳＳ１２上に実在する周波数およびマグニチュードの組、具体的には、周波数ｆi2、マグニチュードＳ2（ｆi2）に対応する第１スペクトル・シェイプ上の周波数＝ｆi1,1、マグニチュード＝Ｓ1（ｆi1,1）を以下のように算出する。
【数６】

ここで、
Ｗ1 ＝ＲＢi+1,1−ＲＢi,1
Ｗ2 ＝ＲＢi+1,2−ＲＢi,2
である。
マグニチュード＝Ｓ1（ｆi1,1 2）を算出するにあたり、第１スペクトル・シェイプＳＳ１１上に実在する周波数およびマグニチュードの組のうちで周波数＝ｆi2,1をはさむように最も近い周波数をそれぞれ、(+)、(-)のサフィックスを付して表すとすると、
【数７】

となる。
以上から、補間位置＝ｘとすると、第２スペクトル・シェイプＳＳ１２上に実在する周波数およびマグニチュードの組に対応する補間スペクトル・シェイプ上の周波数ｆi2,xおよびマグニチュードＳx（ｆi2,x）は以下の式で求められる。
【数８】

Ｓx（ｆi2,x）＝Ｓ2（ｆi2）＋｛Ｓ2（ｆi1,2）−Ｓ1（ｆi2）｝・（ｘ−１）
【００６５】
同様にして、第２スペクトル・シェイプＳＳ１２上の全ての周波数およびマグニチュードの組に対して算出する。
上述したように第１スペクトル・シェイプＳＳ１１上に実在する周波数ｆi1およびマグニチュードＳ1（ｆi1）の組に対応する補間スペクトル・シェイプ上の周波数＝ｆi1,x、マグニチュード＝Ｓx（ｆi1,x）並びに第２スペクトル・シェイプ上に実在する周波数ｆi2およびマグニチュードＳ2（ｆi2）の組に対応する補間スペクトル・シェイプ上の周波数ｆi2,xおよびマグニチュードＳx（ｆi2,x）の全ての算出結果を周波数順に並び替えることにより、補間スペクトル・シェイプを求める。
これらを全ての領域Ｚ1 、Ｚ2、……について行い、全周波数帯域の補間スペクトル・シェイプを算出する。
上述の例においては、スペクトル遷移関数ｆtrans1（ｘ）、ｆtrans2（ｘ）を線形な関数としたが、二次関数、指数関数など非線形な関数として定義あるいは関数に対応する変化をテーブルとして用意するように構成することも可能である。
【００６６】
また、アンカー・ポイントに応じてそれらの遷移関数を変更してやることによりより現実に近いスペクトル補間を行うことが可能である。
この場合、ターゲット音素辞書の内容は、アンカー・ポイントに付随した遷移関数情報を含めるように構成すればよい。
さらに遷移関数情報としては、遷移先の音素に応じて設定するようにすればよい。すなわち、遷移先の音素が音素Ｂの場合には、遷移関数Ｙを用い、遷移先の音素が音素Ｃの場合には、遷移関数Ｚを用いる等のように設定し、設定状態を音素辞書に組み込むようにすればよい。
さらに歌唱者、ターゲット歌手のピッチやスペクトル・シェイプ等を考慮に入れ、リアルタイムに最適な遷移関数を設定するようにしても良い。
【００６７】
［３］全体動作
次に音声変換装置１０の全体動作を順を追って説明する。
まず、歌唱信号入力部１１により、信号入力処理が行われ、歌唱者の歌った信号を入力する。
続いて認識特徴分析部１２により認識特徴分析処理が行われ、歌唱信号入力部１１を介して入力された歌唱信号ＳVを以降のアライメント処理部１８へ入力すべく、認識用音素辞書に含まれるコードブックに基づいてベクトル量子化を行い、各特徴ベクトルＶＣ（メルケプストラム、差分メルケプストラム、エネルギー、差分エネルギー、ボイスネス（有声音尤度）など）を算出する。
なお、差分メルケプストラムとは、前フレームと現在のフレームのメルケプストラムの差分値を示す。差分エネルギーとは、前フレームと現在のフレームの信号エネルギーの差分値を示す。ボイスネスとは、ゼロ交差数、ピッチ検出を行うときに求まる検出誤差等から総合的に求められる値、あるいは、総合的に重み付けして求められる値であり、有声音らしさを表す数値である。
【００６８】
一方、ＳＭＳ分析部１３は、歌唱信号入力部１１を介して入力された歌唱信号ＳVをＳＭＳ分析して、ＳＭＳフレームデータＦSMSを得て、ターゲット・デコーダ部２０およびモーフィング処理部２１に出力する。具体的には、ピッチに応じた窓幅で切り出した波形に対して、
（１）高速フーリエ変換（ＦＦＴ）処理
（２）ピーク検出処理
（３）有声／無声判別処理およびピッチ検出処理
（４）ピーク連携処理
（５）正弦波成分属性ピッチ、アンプリチュード、スペクトル・シェイプの計算処理
（６）残差成分算出処理
が行われる。
アライメント処理部１８は、認識特徴分析部１２により出力された各種特徴ベクトルＶＣ、認識用音素辞書１４からの各音素のＨＭＭおよびターゲット挙動データに含まれる持続時間付音素表記情報より、歌唱者が対象としている曲中のどの部分を歌っているかをビタビアルゴリズムを用いて求める。
これにより、アライメント情報が求まり、この結果、ターゲット歌手が歌うべきピッチ、アンプリチュード、音素を検出することができる。
【００６９】
この処理のなかで、歌唱者がある音素をターゲット歌唱者に比較して長く歌った場合には、持続時間付音素表記情報の持続時間を超えてある音素を歌っていると判断し、ループ処理に入る旨をアライメント情報に含めて出力することとなる。
これらの結果、ターゲット・デコーダ部２０は、アライメント処理部１８により出力されたアライメント情報およびターゲット音素辞書１９に含まれるスペクトル情報よりターゲット歌手のフレーム情報（ピッチ、アンプリチュード、スペクトル・シェイプ）であるターゲットスペクトル・シェイプＳＳＴＧ、ピッチＴＧＰ、アンプリチュードＴＧＡを算出し、ターゲットフレームデータＴＧＦＬとしてモーフィング処理部２１に出力する。
モーフィング処理部２１は、ターゲット・デコーダ部２０から出力されたターゲットフレームデータＴＧＦＬおよび歌唱信号ＳVに対応するＳＭＳフレームデータＦSMS並びにパラメータコントロール部１６から入力された似具合パラメータに基づいてモーフィング処理を行い、似具合パラメータに応じた所望のスペクトル・シェイプ、ピッチ、アンプリチュードを有するモーフィングフレームデータＭＦＬを生成し、変換処理部２２に出力する。
【００７０】
変換処理部２２は、パラメータコントロール部１６からの変換パラメータに従って、モーフィングフレームデータＭＦＬを変形し、変換フレームデータＭＭＦＬとしてＳＭＳ合成部２３に出力する。この場合において、出力アンプリチュードに応じたスペクトル傾き補正を行うことにより、よりリアルな出力音声を得ることが可能となる。
また、変換処理部２２で行う処理としては、例えば偶数倍音をなくす等の処理があげられる。
ＳＭＳ合成部２３は、変換フレームデータＭＭＦＬをフレームスペクトルに変換し、逆高速フーリエ変換（ＩＦＦＴ）、オーバーラップ処理および加算処理を行い、波形信号ＳWAVとして選択部２４に出力する。
選択部２４は、ＳＭＳ分析部１３からの有声／無声情報に基づいて歌唱信号ＳVに対応する歌唱者の音声が無声音である場合には、歌唱信号ＳVをそのまま加算部２７に出力し、歌唱信号ＳVに対応する歌唱者の音声が有声音である場合には、
波形信号ＳWAVを加算部２７に出力する。
【００７１】
これらの動作と並行して、シーケンサ２６は、パラメータコントロール部１６の制御下で音源２５を駆動してミュージック信号ＳMSCを生成して加算部２７に出力する。
加算部２７は、選択部２４から出力された波形信号ＳWAVあるいは歌唱信号Ｓvと音源２５から出力されたミュージック信号ＳMSCとを適当な割合で混合して加算し、出力部２８に出力する。
出力部２８は、加算部２７の出力信号に基づいてカラオケ信号（音声＋ミュージック）を出力することとなる。
【００７２】
［Ｂ］第２実施形態
次に、本発明の第２実施形態について説明する。本第２実施形態が第１実施形態と異なる点は、第１実施形態のターゲット・デコーダ部においては、モーフィング処理部に出力されるスペクトル・シェイプは、ターゲット挙動データに含まれるターゲットのピッチ、アンプリチュードに基づいて算出していたが、本第２実施形態においては、歌唱者のピッチ及びスペクトル傾き情報に基づいて算出している点である。
これに伴い、本第２実施形態のＳＭＳ分析部では、正弦波成分属性として、ピッチアンプリチュード、スペクトル・シェイプに加えて、スペクトル傾きも算出しておく必要があるが、ターゲット・デコーダ部を除く各部の構成は第１実施形態と同様である。
【００７３】
［１］ターゲット・デコーダ部
図７に第２実施形態のターゲット・デコーダ部の構成ブロック図を示す。図７において図４の第１実施形態と同様の部分には同一の符号を付し、その詳細な説明を省略する。
ターゲット・デコーダ部５０は、安定状態／遷移状態決定部３１と、フレームメモリ部３２と、第１スペクトル補間部３３と、第２スペクトル補間部３４と、遷移関数発生部３５と、第３スペクトル補間部３６と、出力されるデコードフレームがよりリアルであるように歌唱者のピッチ及びフレームメモリ部３２に格納されている処理済みのデコードフレームに基づいてスペクトル・シェイプの微細構造を時間軸に沿って変化させる（例えば、マグニチュードを時間とともに少しずつ変化させる）時間的変化付加部５７と、時間的変化付加部５７により時間的変化が付加されたスペクトル・シェイプをさらにリアルにするために歌唱者のスペクトル傾きと既に生成されたスペクトル・シェイプの傾きを比較し、スペクトル・シェイプのスペクトル傾きを補正して補正後のスペクトル・シェイプをターゲットスペクトル・シェイプＳＳＴＧとして出力し、フレームメモリ部３２にターゲットスペクトル・シェイプＳＳＴＧを格納するスペクトル傾き補正部５８と、ターゲットピッチ／アンプリチュード算出部３９と、を備えて構成されている。
【００７４】
［２］第２実施形態の動作
本第２実施形態の動作は全体としては、第１実施形態と同様であるので、主要部の動作のみを説明する。
ターゲット・デコーダ部５０の時間的変化付加部５７は、出力されるデコードフレームであるターゲットフレームがよりリアルであるように歌唱者のピッチ及びフレームメモリ部３２に格納されている処理済みのデコードフレームに基づいてスペクトル・シェイプ（第１スペクトル・シェイプＳＳ１あるいは第４スペクトル・シェイプＳＳ４）の微細構造を時間軸に沿って変化させて（例えば、マグニチュードを時間とともに少しずつ変化させて）、スペクトル傾き補正部５８に出力する。
スペクトル傾き補正部５８は、ターゲット・デコーダ部５０から出力するターゲットスペクトル・シェイプＳＳＴＧをさらにリアルにするために歌唱者のスペクトル傾きと既に生成されたスペクトル・シェイプの傾きを比較し、スペクトル・シェイプのスペクトル傾きを補正して補正後のスペクトル・シェイプをターゲットスペクトル・シェイプＳＳＴＧとして出力し、フレームメモリ部３２にターゲットスペクトル・シェイプＳＳＴＧを格納する。
より具体的には、歌唱者のスペクトル傾きと生成されたターゲットのスペクトル・シェイプのスペクトル傾きの差であるスペクトル傾き補正値（Tilt Correction値）を算出し、図８に示すように、スペクトル傾き補正値に応じた特性を有するスペクトル傾き補正フィルタを生成されたターゲットのスペクトル・シェイプに対してかける。
これにより、より自然なスペクトル・シェイプを得ることが可能となる。
【００７５】
［Ｃ］実施形態の変形例
［１］第１変形例
ピッチ、アンプリチュードに関して、前もって静的変化成分と、ビブラート的変化成分（ビブラートを早さ、深さのパラメータとして有する）に分けた情報として持っていれば、例えば、同じ音素を歌唱者がターゲットに比較して長く歌った場合でも、適切なビブラートを付加したピッチ、アンプリチュードを生成することができるので、自然な音の伸びを得ることができる。
このような処理を行う理由としては、このような処理を行わない場合には、歌唱者がターゲット歌手と比較して長く音をのばした場合などには、途中でビブラートがかからなくなるなどの現象が生じ、不自然なものとなり、また、歌唱者がターゲット歌手と比較してテンポを変更した場合については、ビブラート成分を持っていない場合には、テンポを挙げるとビブラートが早くなってしまい同様に不自然なものとなるからである。
【００７６】
［２］第２変形例
以上の説明においては、ターゲット歌唱者の残差成分については、考慮していないものであったが、ターゲット歌唱者の残差成分を考慮する場合に、全てのフレームについて残差成分を保持することは、情報圧縮の観点からいっても本音声変換装置のシステムには適合しない。
そこで、残差について予め代表的なスペクトルエンベロープを用意し、これらのスペクトルエンベロープを特定するためのインデックス情報を持つようにすればよい。
より具体的には、ターゲット挙動データとして残差スペクトルエンベロープ情報インデックスを持たせ、例えば、歌唱経過時間０秒〜２秒の間は、残差スペクトルエンベロープ情報インデックス＝１のスペクトルエンベロープを使用し、歌唱経過時間２秒〜３秒までは残差スペクトルエンベロープ情報インデックス＝３のスペクトルエンベロープを使用する。
そして、残差スペクトルエンベロープ情報インデックスに対応するスペクトルエンベロープから実際の残差スペクトルを生成して、モーフィング処理において用いるようにすれば、残差についてもモーフィングを可能とすることができる。
【００７７】
【発明の効果】
本発明によれば、入力された歌唱者の音声を目標とするターゲット歌唱者の歌い方に似せることができるとともに、ターゲット歌唱者の分析データの容量を低減して、リアルタイムに処理を行うことが可能となる。
【図面の簡単な説明】
【図１】実施形態にかかる音声変換装置の概要構成ブロック図である。
【図２】ターゲット音素辞書の説明図（その１）である。
【図３】ターゲット音素辞書の説明図（その２）である。
【図４】第１実施形態のターゲット・デコーダ部の概要構成ブロック図である。
【図５】ターゲット・デコーダ部のスペクトル補間処理の説明図（その１）である。
【図６】ターゲット・デコーダ部のスペクトル補間処理の説明図（その２）である。
【図７】第２実施形態のターゲット・デコーダ部の概要構成ブロック図である。
【図８】第２実施形態のスペクトル傾き補正フィルタの特性説明図である。
【符号の説明】
１０…音声変換装置、１１…歌唱信号入力部、１２…認識特徴分析部、１３…ＳＭＳ分析部、１４…認識用音素辞書、１５…ターゲット挙動データ、１６…パラメータコントロール部、１７…データ変換部、１８…アライメント処理部、１９…ターゲット音素辞書、２０…ターゲット・デコーダ部、２１…モーフィング処理部、２２…変換処理部、２３…ＳＭＳ合成部、２４…選択部、２５…音源、２６…シーケンサ、２７…加算部、２８…出力部、３１…安定状態／遷移状態決定部、３２…フレームメモリ部、３３…第１スペクトル補間部、３４…第２スペクトル補間部、３５…遷移関数発生部、３６…第３スペクトル補間部、３７…時間的変化付加部、３８…スペクトル傾き補正部、３９…ターゲットピッチ／アンプリチュード算出部、５０…ターゲット・デコーダ部、５７…時間的変化付加部、５８…スペクトル傾き補正部、ＳＳ１…第１スペクトル・シェイプ、ＳＳ２…第２スペクトル・シェイプ、ＳＳ３…第３スペクトル・シェイプ、ＳＳ４…第４スペクトル・シェイプ、ＳＳｔ…時間的変化付加スペクトル・シェイプ、ＳＳＴＧ…ターゲットスペクトル・シェイプ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice conversion device that approximates a target voice to another target voice, a voice conversion method, and a voice conversion dictionary that generates a voice conversion dictionary corresponding to another voice used when performing voice conversion. In particular, the present invention relates to a voice conversion apparatus, a voice conversion method, and a voice conversion dictionary generation method suitable for use in a karaoke apparatus.
[0002]
[Prior art]
Various voice conversion devices that change the frequency characteristics of the input voice and the like have been developed. For example, in a karaoke device, the pitch of a singer's singing voice is converted to convert a male voice into a female voice. Some are converted to voice or vice versa (for example, Japanese translation of Japanese translation of PCT publication No. 8-508581).
[0003]
[Problems to be solved by the invention]
However, in the conventional voice conversion device, although voice conversion (for example, male voice → female voice, female voice → male voice, etc.) is performed, it has only stopped changing the voice quality. For example, a specific singer (for example, It couldn't be converted to resemble the voice of a professional singer.
Also, if you have a function that imitates a specific singer, not only the voice quality, but also the way you sing, it will be very interesting in a karaoke device etc., but such processing is not possible with conventional speech conversion devices It was impossible.
[0004]
Therefore, the inventors analyzed the voice of the target singer in order to resemble the voice of a singer (target singer) whose voice quality is the target (target), and the sine wave component attribute that is the obtained analysis data Pitch, amplitude, spectrum shape, and residual components are held as target frame data for all frames of one song, and conversion processing is performed in synchronization with input target frame data obtained by analyzing input speech. Therefore, a voice conversion device that performs conversion to resemble the voice of the target singer has been proposed (see Japanese Patent Application No. 10-183338, etc.)
The voice conversion device can resemble not only the voice quality but also a specific singer to the way of singing, but the analysis data of the target singer is required for each song, and the analysis data of a plurality of songs are stored. In such a case, there is a problem that the data amount becomes enormous.
[0005]
Therefore, an object of the present invention is to provide a voice conversion device that can resemble a target singer's singing method with respect to the input singer's voice and reduce the capacity of analysis data of the target singer, An object of the present invention is to provide a speech conversion method and a speech conversion dictionary generation method.
[0006]
[Means for Solving the Problems]
  In order to solve the above-described problem, the configuration according to claim 1 includes: input frame data extraction means for extracting input frame data relating to a frequency spectrum from an input voice signal; and feature analysis means for extracting a feature vector from the input voice signal. , Analyzing the feature vector by a predetermined algorithm and associating with the target behavior data in which the pitch of the target speech to be converted into speech and the temporal change of the phoneme are defined, and corresponding to the input frame data Alignment processing means for determining a temporal position in the target behavior data;Whether the phoneme of the target speech in the target behavior data at the temporal position determined by the alignment processing means is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme When it is determined that the phoneme is in the transition state by the state determining unit for determining and the state determining unit,Among the spectrum shapes of the target phoneme dictionary having spectrum shapes corresponding to a plurality of pitches for each phoneme, for the phonemes of the target speech in the target behavior data at the temporal position determined by the alignment processing means,Two spectral shapes corresponding to the first phoneme,Corresponds to two pitches close to the pitch of the target speech in the target behavior data at the temporal positiontwoInterpolate using spectrum shapesBy calculating the spectrum shape of the first phoneme, two spectrum shapes corresponding to the second phoneme, which are close to the pitch of the target speech in the target behavior data at the temporal position. A spectrum shape of the second phoneme is calculated by performing an interpolation process using two spectrum shapes corresponding to the pitch, and the spectrum shape of the first phoneme and the spectrum shape of the second phoneme And withSpectrum shape interpolation means for calculating a spectrum shape corresponding to the input frame data, and converted voice signal generation means for generating and outputting a converted voice signal based on the spectrum shape calculated by the spectrum shape interpolation means; It is characterized by having.
[0007]
  According to a second aspect of the present invention, input frame data extracting means for extracting input frame data relating to a frequency spectrum from an input voice signal, feature analysis means for extracting a feature vector from the input voice signal, and the feature vector are determined in advance. The time position in the target behavior data corresponding to the input frame data is correlated with the target behavior data in which the temporal change of the pitch and phoneme of the target speech to be converted into speech is analyzed Alignment processing means for determiningWhether the phoneme of the target speech in the target behavior data at the temporal position determined by the alignment processing means is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme When it is determined that the phoneme is in the transition state by the state determining unit for determining and the state determining unit,Among the spectrum shapes of the target phoneme dictionary having spectrum shapes corresponding to a plurality of pitches for each phoneme, for the phonemes of the target speech in the target behavior data at the temporal position determined by the alignment processing means,Two spectral shapes corresponding to the first phoneme,Corresponds to two pitches close to the pitch obtained from the input frame datatwoInterpolate using spectrum shapesBy calculating the spectrum shape of the first phoneme, two spectrum shapes corresponding to the second phoneme and corresponding to two pitches close to the pitch obtained from the input frame data are obtained. The spectrum shape of the second phoneme is calculated by performing an interpolation process using the spectrum shape.,Using the spectrum shape of the first phoneme and the spectrum shape of the second phonemeSpectrum shape interpolation means for calculating a spectrum shape corresponding to the input frame data, and converted voice signal generation means for generating and outputting a converted voice signal based on the spectrum shape calculated by the spectrum shape interpolation means; It is characterized by having.
[0013]
  Claim3In the speech conversion device according to claim 1 or 2, the spectrum shape interpolating unit is configured to interpolate between the two spectrum shapes when performing the interpolation using the two spectrum shapes. It is characterized by performing an interpolation process using a transition function.
[0014]
  Claim4The configuration described in claim3In the speech conversion device described in (1), the transition function is previously defined as a linear function or a nonlinear function.
[0015]
  Claim5The configuration described in claim3In the sound conversion device described in 1), the two spectrum shapes are divided into a plurality of regions on the frequency axis, and the transition function is defined for each region.
[0016]
  Claim6The configuration described in claim3In the sound conversion device according to claim 1, the spectrum shape interpolation means includesSecondThe transition function is determined in correspondence with the phoneme.
[0018]
  Claim7The configuration described in claim3In the audio conversion device described above, the spectrum shape interpolation unit divides the two spectrum shapes into a plurality of regions on the frequency axis, and the actual frequency and magnitude on the two spectrum shapes belonging to each region. An interpolation process using a linear function as the transition function is performed over the plurality of regions.
[0019]
  Claim8The configuration described in claim7In the speech conversion device according to claim 1, the spectrum shape interpolation means is a first frequency that is a frequency of one spectrum shape belonging to each region and a frequency of the other spectrum shape corresponding to the first frequency. Frequency interpolation means for calculating an interpolation frequency by interpolating a second frequency using the linear function, a first magnitude that is the magnitude of one spectrum shape belonging to each of the regions, and the other corresponding to the first magnitude Magnitude interpolation means for interpolating the second magnitude, which is the magnitude of the spectrum shape, using the linear function.
[0020]
  Claim9The structure according to claim 1, wherein the target behavior data further includes a temporal change in the amplitude of the target speech and is determined by the alignment processing means in the target behavior data. In accordance with the amplitude of the temporal position in the spectrum, the spectrum shape correction means for correcting the spectrum inclination of the spectrum shape calculated by the spectrum shape interpolation means, the converted speech signal generation means is the spectrum inclination correction means A feature is that a converted speech signal is generated and output based on the spectrum shape in which the inclination of the spectrum is corrected by.
[0021]
  Claim10According to the configuration described in claim 2, in the speech conversion device according to claim 2, the spectrum inclination of the spectrum shape calculated by the spectrum shape interpolation means and the spectrum inclination of the spectrum shape obtained from the input frame data are calculated. According to the comparison result, there is provided a spectrum inclination correction means for correcting the spectrum inclination of the spectrum shape calculated by the spectrum shape interpolation means.
[0022]
  Claim11The configuration includes: an input frame data extraction process for extracting input frame data related to a frequency spectrum from an input speech signal; a feature analysis process for extracting a feature vector from the input speech signal; and the feature vector is analyzed by a predetermined algorithm. Alignment processing for determining the temporal position in the target behavior data corresponding to the input frame data in association with the target behavior data in which the temporal change of the pitch and phoneme of the target speech to be converted into speech is defined Process,Whether the phoneme of the target speech in the target behavior data at the temporal position determined in the alignment process is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme When it is determined that the phoneme is in a transition state in the state determination process for determining and the state determination process,Among the spectrum shapes of the target phoneme dictionary having spectrum shapes corresponding to a plurality of pitches for each phoneme, regarding the phonemes of the target speech in the target behavior data at the temporal position determined in the alignment process,Two spectral shapes corresponding to the first phoneme,Corresponds to two pitches close to the pitch of the target speech in the target behavior data at the temporal positiontwoInterpolate using spectrum shapesBy calculating the spectrum shape of the first phoneme, two spectrum shapes corresponding to the second phoneme, which are close to the pitch of the target speech in the target behavior data at the temporal position. A spectrum shape of the second phoneme is calculated by performing an interpolation process using two spectrum shapes corresponding to the pitch, and the spectrum shape of the first phoneme and the spectrum shape of the second phoneme And withA spectrum shape interpolation process for calculating a spectrum shape corresponding to the input frame data; and a converted sound signal generation process for generating and outputting a converted sound signal based on the spectrum shape calculated in the spectrum shape interpolation process; It is characterized by having.
[0023]
  Claim12In the configuration described in the above, an input frame data extraction process for extracting input frame data related to a frequency spectrum from an input speech signal, a feature analysis process for extracting a feature vector from the input speech signal, and a predetermined algorithm for the feature vector To determine the temporal position in the target behavior data corresponding to the input frame data by associating with the target behavior data in which the pitch of the target speech to be converted into speech and the temporal change of the phoneme are defined. Alignment process,Whether the phoneme of the target speech in the target behavior data at the temporal position determined in the alignment process is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme When it is determined that the phoneme is in a transition state in the state determination process for determining and the state determination process,Among the spectrum shapes of the target phoneme dictionary having spectrum shapes corresponding to a plurality of pitches for each phoneme, regarding the phonemes of the target speech in the target behavior data at the temporal position determined in the alignment process,Two spectral shapes corresponding to the first phoneme,Corresponds to two pitches close to the pitch obtained from the input frame datatwoInterpolate using spectrum shapes,By calculating the spectrum shape of the first phoneme, two spectrum shapes corresponding to the second phoneme and corresponding to two pitches close to the pitch obtained from the input frame data are obtained. The spectrum shape of the second phoneme is calculated by performing an interpolation process using the spectrum shape.,Using the spectrum shape of the first phoneme and the spectrum shape of the second phonemeA spectrum shape interpolation process for calculating a spectrum shape corresponding to the input frame data; and a converted sound signal generation process for generating and outputting a converted sound signal based on the spectrum shape calculated in the spectrum shape interpolation process; It is characterized by having.
[0027]
  Claim13The configuration described in claim11Or claims12In the speech conversion method according to claim 1, wherein the spectrum shape interpolation process performs an interpolation process using a transition function between the two spectrum shapes when performing the interpolation using the two spectrum shapes. Yes.
[0028]
  Claim14The configuration described in claim13In the speech conversion method described in (1), the transition function is previously defined as a linear function or a nonlinear function.
[0029]
  Claim15The configuration described in claim13In the speech conversion method described in 1), the two spectrum shapes are divided into a plurality of regions on the frequency axis, and the transition function is defined for each region.
[0030]
  Claim16The configuration described in claim13In the speech conversion method according to claim 1, the spectrum shape interpolation process includes the step ofSecondThe transition function is determined in correspondence with the phoneme.
[0032]
  Claim17The configuration described in claim13In the speech conversion method according to claim 1, the spectrum shape interpolation process divides the two spectrum shapes into a plurality of regions on the frequency axis, and the actual frequencies on the two spectrum shapes belonging to each region and An interpolation process using a linear function as the transition function is performed over the plurality of regions for a set of magnitudes.
[0033]
  Claim18The configuration described in claim17In the speech conversion method according to claim 1, the spectrum shape interpolation process is a first frequency which is a frequency of one spectrum shape belonging to each region and a frequency of the other spectrum shape corresponding to the first frequency. A frequency interpolation process of calculating an interpolation frequency by interpolating a second frequency using the linear function, a first magnitude that is a magnitude of one spectrum shape belonging to each of the regions, and the other corresponding to the first magnitude And a magnitude interpolation process for interpolating the second magnitude, which is the magnitude of the spectrum shape, using the linear function.
[0034]
  Claim19The configuration described in claim11In the voice conversion method according to claim 1, the target behavior data further defines a temporal change in the amplitude of the target voice, and corresponds to the amplitude of the temporal position in the target behavior data determined in the alignment process. A spectrum tilt correction process for correcting the spectrum tilt of the spectrum shape calculated in the spectrum shape interpolation process, wherein the converted speech signal generation process includes a spectrum in which the spectrum tilt is corrected in the spectrum tilt correction process. A feature is that a converted audio signal is generated and output based on the shape.
[0035]
  Claim20The configuration described in claim12In the speech conversion method according to claim 1, in accordance with a comparison result between the spectrum inclination of the spectrum shape calculated in the spectrum shape interpolation process and the spectrum inclination of the spectrum shape obtained from the input frame data. It is characterized in that it has a spectrum inclination correction process for correcting the spectrum inclination of the spectrum shape calculated in the shape interpolation process.
[0037]
DETAILED DESCRIPTION OF THE INVENTION
Next, preferred embodiments of the present invention will be described with reference to the drawings.
[A] First embodiment
First, a first embodiment of the present invention will be described.
[1] Overall configuration of voice conversion device
FIG. 1 shows an example in which the speech conversion device (speech conversion method) of the embodiment is applied to a karaoke device and configured as a karaoke device capable of imitation.
The voice conversion device 10 receives a singer's voice and outputs a singing signal, a singing signal input unit 11, a recognition feature analyzing unit 12 that extracts various feature vectors from the singing signal based on a predetermined code book, SMS analysis unit 13 that performs SMS (Spectral Modeling Synthesis) analysis of singing signal and outputs input SMS frame data and voiced / unvoiced information, and various codebooks and hidden Markov models (HMMs) of each phoneme stored in advance A phoneme dictionary storage unit 14, a target behavior data storage unit 15 for storing target behavior data depending on music, and a parameter control unit 16 for controlling various parameters such as key information, tempo information, similarity parameters, and conversion parameters. Target behavior data and key information stored in the target behavior data storage And it performs data conversion based on the tempo information, phonemic transcription information with the converted duration, and is configured to include a data conversion unit 17 for generating and outputting pitch information and Amplitude (amplitude) information.
[0038]
Also, the speech conversion apparatus 10 uses a Viterbi algorithm to determine which part of the song the singer is singing based on the extracted feature vector, the HMM of each phoneme and the phoneme notation information with duration. , Alignment processing unit 18 for detecting alignment information (= singing position and phoneme in the song that the target singer should sing), target phoneme dictionary storage unit 19 for storing spectrum shape information depending on the target singer, alignment information, Target frame data (hereinafter referred to as target frame data) TGFL is generated and output based on the pitch information of the target behavior data, the amplitude information of the target behavior data, the input SMS frame data, and the spectrum shape information of the target phoneme dictionary. target· A morphing processing unit 21 that performs morphing processing based on the similarity parameter, target frame data TGFL, and SMS frame data FSMS input from the coder unit 20, the parameter control unit 16, and outputs morphing frame data MFL, and morphing frame data A conversion processing unit 22 that performs conversion processing based on the conversion parameters input from the MFL and the parameter control unit 16 and outputs converted frame data MMFL is provided.
[0039]
Further, the voice conversion device 10 performs SMS synthesis of the converted frame data MMFL and outputs a waveform signal SWAV as a converted voice signal, and a waveform signal based on voiced / unvoiced information from the SMS analysis unit 13. From the selection unit 24 that selectively outputs either the SWAV or the input singing signal SV, the sequencer 26 that drives the sound source unit 25 based on the key information and tempo information from the parameter control unit 16, and the selection unit 24 An adder 27 that adds the output waveform signal SWAV or singing signal SV and the music signal SMSC that is an output signal from the tone generator 25 and outputs the result as a karaoke signal after amplifying the output signal of the adder 27 And an output unit 28.
[0040]
Here, prior to the description of the configuration of each part of the speech conversion apparatus, the SMS analysis will be described.
In SMS analysis, a speech waveform (Frame) obtained by multiplying a sampled speech waveform by a window function is extracted, and a sine wave component and a residual component are extracted from the frequency spectrum obtained by performing fast Fourier transform (FFT). To do.
In this case, the sine wave component means a component of a fundamental frequency (Pitch) and a frequency (overtone) that is a multiple of the fundamental frequency.
In the present embodiment, as a sine wave component, the fundamental frequency, the average amplitude of each component, and the spectrum envelope are held as an envelope.
The residual component is a component obtained by removing the sine wave component from the input signal, and is retained as frequency domain data in this embodiment.
Further, the frequency analysis data indicated by the obtained sine wave component and residual component is stored in units of frames. At this time, since the time interval between frames is fixed (for example, 5 ms), the time can be specified by counting the frames. Furthermore, each frame has a time stamp corresponding to the elapsed time from the beginning of the song.
[0041]
[2] Configuration of each part of the voice conversion device
[2.1] Phoneme dictionary storage for recognition
The recognition phoneme dictionary storage unit 14 stores a code book and a hidden Markov model of phonemes.
The stored codebook is used for vector quantization of the singing signal into various feature vectors (more specifically, mel cepstrum, differential mel cepstrum, energy, differential energy, voiceness (voiced sound likelihood)). .
In addition, in this speech conversion apparatus, a hidden Markov model (HMM), which is a speech recognition method, is used to perform alignment processing, and HMM parameters (initial state distribution, state transition probability matrix, observation symbol probability matrix) Is obtained for each phoneme (/ a /, / i /, etc.).
[0042]
[2.2] Target behavior data storage unit
The target behavior data storage unit 15 stores target behavior data, and the target behavior data is song-dependent data corresponding to each song for which voice conversion is performed.
Specifically, the target singer who sang the target song sang the pitch and amplitude over time (note that these are static change components and vibrato change components) Extracting and separating them will increase the degree of freedom in post-processing) and phoneme notation with duration including the duration in the phoneme notation in which the lyrics are replaced by phoneme sequences based on the lyrics of the target song Is included.
For example, the phoneme notation with duration is different from the phoneme notation / n // a // k // i / ... for each duration, ie the duration of / n /, the duration of / a /, The duration of k /, the duration of / i /, and so on are included.
[0043]
[2.3] Target phoneme dictionary storage unit
The target phoneme dictionary storage unit stores a target phoneme dictionary that is spectrum information corresponding to each phoneme of the target singer to be imitated, and the target phoneme dictionary includes a spectrum shape corresponding to several types of pitches and Anchor point information for performing spectral interpolation is included.
Here, creation of a target phoneme dictionary as a speech conversion dictionary stored in the target phoneme dictionary storage unit 19 will be described with reference to FIGS.
[2.3.1] Target phoneme dictionary
The target phoneme dictionary has a spectrum shape and anchor point information corresponding to several pitches for each phoneme.
FIG. 2 is an explanatory diagram of the target phoneme dictionary.
FIGS. 2B, 2C, and 2D show spectrum shapes respectively corresponding to pitches f0i + 1, f0i, and f0i-1 in a certain phoneme. (Three in the above example) Spectrum shapes are included in the target phoneme dictionary. The reason for having spectrum shapes corresponding to multiple pitches as the target phoneme dictionary is that, in general, even if the same person utters the same phoneme, the shape of the spectrum shape changes slightly according to the pitch. That's why.
2 (b), (c), and (d), the dotted line is a boundary line when dividing into a plurality of regions on the frequency axis, and the frequency at the boundary of each region is an anchor point, and anchor point information Is included in the target phoneme dictionary.
[0044]
[2.3.2] Creation of target phoneme dictionary
Next, creation of a target phoneme dictionary will be described.
First, what is continuously generated from the lowest pitch that the target singer can produce for each phoneme to the highest pitch is recorded.
More specifically, as shown in FIG. 2 (a), utterance is performed so that the pitch is increased with time.
The reason for recording in this way is to calculate a more accurate spectrum shape.
That is, the formant that actually exists does not always appear in the spectrum shape obtained by analyzing from a sample generated at a certain fixed pitch. Therefore, it is necessary to use all of the analysis results within a range that can be regarded as the same spectrum shape before and after a certain pitch in order to make the formant appear accurately in the desired spectrum shape.
[0045]
If the frequency range of the pitch that can be regarded as the same spectrum shape is the same segment, the center frequency f0i of the i-th segment is
[Expression 1]

Where f_i ^(low), F_i ^(high)Is the pitch frequency of the boundary of the i-th segment of a phoneme, and f_i ^(low)Represents the pitch frequency on the low pitch side, and f_i ^(high)Represents the pitch frequency on the high pitch side.
All spectral shape values (frequency and magnitude pairs) at a pitch that can be regarded as the same segment are combined into one.
More specifically, for example, as shown in FIG. 3A, spectrum shapes at a pitch that can be regarded as the same segment are plotted on the same frequency axis / magnitude axis.
Next, the frequency range [0, f on the frequency axis_S/ 2] is divided into equal intervals (for example, 30 [Hz]). Where f_SIs the sampling frequency.
[0046]
The division width at this time is BW [Hz], the number of divisions is B (band number bε [0, B-1]), and the actual frequency and magnitude pairs included in each division range are
(Xn, yn)
Here, n = 0,..., N-1.
Then, the center frequency fb and the average magnitude Mb of the band b are respectively
[Expression 2]

Is calculated.
Obtained in this way
(Fb, Mb)
Here, b = 0,..., B-1.
Is the final spectral shape at a certain pitch.
[0047]
More specifically, when the spectrum shape is calculated using the combination of frequency and magnitude shown in FIG. 3 (a), the formant to be stored in the target phoneme dictionary as shown in FIG. 3 (c). A good spectral shape is obtained with a clear appearance of.
On the other hand, as shown in FIG. 3 (b), all values of spectrum shapes (a set of frequency and magnitude) at a pitch that cannot be regarded as the same segment are combined into one, and the combined frequency and magnitude are combined. When the spectrum shape is calculated using the set, as shown in FIG. 3D, a spectrum shape whose formant is not so clear as compared with the case of FIG. 3C is obtained.
[0048]
[2.4] Target decoder section
[2.4.1] Configuration of target decoder unit
FIG. 4 is a block diagram showing the configuration of the target decoder unit.
The target decoder unit 20 is in a transition state where the phoneme corresponding to the frame to be decoded from the already processed decoded frame is in a stable state or transitions to another phoneme. A stable state / transition state determination unit 31 that determines whether there is a frame memory unit 32 that stores a decoded frame that has already been processed for smooth frame data generation, and a determination result in the stable state / transition state determination unit 31 If the phoneme corresponding to the frame to be decoded is in a stable state, the current phoneme spectral shape is derived from the two spectral shapes near the current target pitch using the spectral interpolation method described below. Generated and decoded as first interpolated spectrum shape SS1 When the phoneme corresponding to the frame to be processed is in the transition state, the spectrum shape of the phoneme of the transition source is changed from the two spectrum shapes in the vicinity of the current target pitch to the second interpolated spectrum using the spectral interpolation method described later. A first spectrum interpolation unit 33 that is generated as a shape SS2.
[0049]
Further, the target decoder unit 20 sets the current spectrum shape of the destination phoneme when the phoneme corresponding to the frame to be decoded based on the determination result in the stable state / transition state determination unit 31 is in the transition state. A second spectrum interpolating unit 34 which generates a third interpolated spectrum shape SS3 from two spectrum shapes in the vicinity of the target pitch by using the spectral interpolation method described later, a transition source phoneme, a transition destination phoneme, and a singer The transition function generator 35 generates a transition function that defines how to transition when transitioning from a transition source phoneme to a transition destination phoneme taking into account the pitch of the target singer, the pitch of the target singer, the spectrum shape, etc. And a frame to be decoded based on the determination result in the stable state / transition state determination unit 31. A method of spectral interpolation described later from the transition function generated in the transition function generator 35 when the corresponding phoneme is in the transition state and the two spectrum shapes of the second interpolation spectrum shape SS2 and the third interpolation spectrum shape SS3. And a third spectrum interpolation unit 36 that generates a fourth spectrum shape SS4 using
[0050]
Furthermore, the target decoder unit 20 temporally converts the fine structure of the spectrum shape based on the target pitch and the processed decoded frame stored in the frame memory unit 32 so that the output decoded frame is more realistic. A time change addition unit 37 that outputs a spectrum shape SSt that is changed along the axis (for example, a magnitude is changed little by little with time) and a time change is added, and a time change addition unit 37 generates a temporal change. In order to make the spectrum shape SSt to which the change is added more realistic, a spectrum tilt correction unit 38 that corrects the spectrum tilt of the spectrum shape SSt in correspondence with the amplitude of the target and outputs it as the target spectrum shape SSTG; Alignment information And it is configured to include a target pitch / Amplitude calculating unit 39 for calculating a pitch and Amplitude of the target corresponding to the decoded frame to be output on the basis of the pitch and Amplitude of the target, the.
[0051]
[2.4.2] Detailed operation of target decoder
Here, the detailed operation of the target decoder unit 20 will be described.
In this case, in order to generate frame data more smoothly, frame data (decoded frame; target spectrum shape) to be output by the target decoder unit 20 is stored in the frame memory unit.
As input information to the target decoder unit 20, information on singing voice (pitch, amplitude, spectrum shape, alignment), target behavior data (pitch, amplitude, phoneme notation with duration), target phoneme dictionary (spectrum)・ Shape) is included.
[0052]
Then, the stable state / transition state determination unit 31 is in a state where the frame to be decoded from the singer, the target singer's pitch, alignment information, and a past decoded frame is in a stable state (transition (change) from a certain phoneme to a certain phoneme) Instead, it is determined whether or not a certain phoneme can be specified, and the determination result is notified to the first spectrum interpolation unit 33 and the second spectrum interpolation unit 34.
If the frame to be decoded is in a stable state based on the notification from the stable state / transition state determining unit 31, the first spectrum interpolating unit 33 determines the current phoneme spectrum shape in the vicinity of the current target pitch. The first interpolated spectrum shape SS1, which is a spectrum shape interpolated by using a spectrum interpolation method described later, is calculated from the two spectrum shapes and is output to the temporal change adding unit 37.
[0053]
Further, based on the notification from the stable state / transition state determination unit 31, the first spectrum interpolation unit 33, when the frame to be decoded is in the transition state, the transition source phoneme (from the first phoneme to the second phoneme). The spectrum shape of the first phoneme) in the middle of transition to the phoneme is a spectrum shape interpolated from the two spectrum shapes in the vicinity of the current target pitch using the spectral interpolation method described later. The second interpolation spectrum shape SS2 is calculated and output to the third spectrum interpolation unit 36.
On the other hand, when the frame to be decoded is in the transition state based on the notification from the stable state / transition state determination unit 31, the second spectrum interpolation unit 34 changes the phoneme of the transition destination (from the first phoneme to the second phoneme). The spectrum shape of the second phoneme) in the case of the transition to the phoneme is a spectrum shape interpolated from two spectrum shapes in the vicinity of the current target pitch using a spectrum interpolation method described later. Three interpolation spectrum shapes are calculated and output to the third spectrum interpolation unit 36.
[0054]
As a result, when the frame to be decoded is in the transition state based on the notification from the stable state / transition state determination unit 31, the third spectrum interpolation unit 362The interpolated spectrum shape and the second calculated in the second spectral interpolation process3Based on the interpolated spectrum shape, interpolation is performed using a spectral interpolation method to be described later, a fourth spectrum shape SS4 is calculated and output to the temporal change adding unit 37. The fourth spectrum shape SS4 corresponds to a spectrum shape of an intermediate phoneme between two different phonemes. In this case, when interpolation is performed to obtain the fourth spectrum shape SS4, interpolation is performed linearly within a corresponding region (boundary points are indicated by anchor points) over a certain period of time. Instead of performing spectral interpolation according to the transition function generated in the transition function generating unit 35, it is possible to perform spectral interpolation closer to reality.
[0055]
For example, when changing from phoneme / a / to phoneme / e /, the transition function generator 35 linearly changes the spectrum in the corresponding region (between anchor points described later) over 10 frames. Also, when changing from phoneme / a / to phoneme / u /, it changes over 5 frames, but the spectrum within a certain frequency band (between anchor points described later) is changed linearly, By changing the spectrum in other frequency bands (between anchor points described later) exponentially, natural movement between phonemes can be realized smoothly.
For this reason, in the transition function generation process, the transition function is generated in consideration of the singer, the target pitch, spectrum shape, and the like, based on the phoneme and pitch.
In this case, as described later, it is also possible to configure such that these pieces of information are included in the target phoneme dictionary.
Next, the temporal change adding unit 37 applies the target spectrum shape (= decode frame) output from the target decoder unit 20 to the input first interpolation spectrum shape SS1 or fourth interpolation spectrum shape SS4. The fine structure of the spectrum shape is changed based on the target pitch and the past decoded frame so as to approximate the actual frame, and is output to the spectrum tilt correction unit 38 as a time-change added spectrum shape SSt.
[0056]
For example, the magnitude of the fine structure of the spectrum shape is changed little by little over time.
The spectrum inclination correction unit 38 responds to the amplitude of the target so that the output target spectrum shape (= decoded frame) SSTG approximates the actual frame with respect to the input time-change added spectrum shape SSt. Then, the corrected spectrum shape is output as the target spectrum shape SSTG.
Spectral tilt correction is generally performed when the output volume is high and the high frequency range is rich (rich), and when the output volume is low, the high frequency of the spectrum shape is low. In order to simulate that, the shape of the high-frequency part of the spectrum shape is changed according to the volume.
Then, the target spectrum shape SSTG obtained by correcting the spectrum tilt is stored in the frame memory unit 32.
On the other hand, the target pitch / amplitude calculation unit 39 calculates and outputs the pitch TGP and the amplitude TGA corresponding to the target spectrum shape SSTG to be output.
[0057]
[2.4.3] Spectral interpolation processing
Here, the spectrum interpolation processing of the target decoder unit will be described with reference to FIG.
[2.4.3.1] Overview of spectral interpolation processing
First, when a phoneme corresponding to a frame to be decoded based on the determination result in the stable state / transition state determination unit 31 is in a stable state, the target decoder unit 20 uses two spectrum shapes corresponding to the phoneme. Are extracted from the target phoneme dictionary, and if the phoneme corresponding to the frame to be decoded is in the transition state, the two spectrum shapes corresponding to the phoneme of the transition source are extracted from the target phoneme dictionary.
FIGS. 5A and 5B show two spectrum shapes extracted from the target phoneme dictionary in correspondence with phonemes in a stable state or transition source phonemes, and the pitches of the two spectrum shapes. Is different.
For example, if the spectrum shape to be obtained has a pitch of 140 [Hz] and phoneme / a /, the spectrum shape of FIG. 5A corresponds to the phoneme / a / with a pitch of 100 [Hz]. The spectrum shape in FIG. 5B corresponds to a phoneme / a / with a pitch of 200 [Hz]. That is, two spectrum shapes each having the closest pitch between the front and back pitches sandwiching the pitch of the spectrum shape to be obtained, and corresponding to the same phoneme as the spectrum shape to be obtained. Use shapes.
[0058]
The obtained two spectrum shapes are interpolated by the spectrum interpolation method by the first spectrum interpolation unit 33, whereby a desired spectrum shape (first spectrum shape SS1 or second spectrum shape shown in FIG. 5E) is obtained. Spectrum shape SS2). If the phoneme corresponding to the frame to be decoded based on the determination result in the stable state / transition state determination unit 31 is in the stable state, the obtained spectrum shape is temporally converted from the spectral shape obtained as it is. It outputs to the change addition part 37. ,
Further, when the phoneme corresponding to the frame to be decoded based on the determination result in the stable state / transition state determination unit 31 is in the transition state, the two spectrum shapes corresponding to the transition destination phoneme are determined as the target phoneme dictionary. Take out from.
FIGS. 5C and 5D show two spectrum shapes extracted from the target phoneme dictionary in correspondence with the transition destination phoneme, and the pitches of these two spectrum shapes are also shown in FIG. And the same as in the case of FIG.
Then, the obtained two spectrum shapes are interpolated by the second spectrum interpolation unit 34 to obtain a desired spectrum shape (corresponding to the third spectrum shape SS3) as shown in FIG.
Furthermore, when the phoneme corresponding to the frame to be decoded based on the determination result in the stable state / transition state determination unit 31 is in the transition state, the spectrum shown in FIG. 5 (e) and FIG. 5 (f). A desired spectrum shape (corresponding to the fourth spectrum shape SS4) as shown in FIG. 5G is obtained by interpolating the shape by the third spectrum interpolation unit 36 using the spectrum interpolation method.
[0059]
[2.4.3.2] Spectral interpolation method
Here, the method of spectrum interpolation will be described in detail.
The purpose of using spectral interpolation is roughly divided into the following two.
(1) Interpolate the spectrum shape of two temporally continuous frames, and obtain the spectrum shape of a frame between the two temporally frames.
(2) Interpolate the spectrum shape of two different sounds to obtain the spectrum shape of an intermediate sound.
As shown in FIG. 6 (a), two spectrum shapes (hereinafter referred to as a first spectrum shape SS11 and a second spectrum shape SS12 for convenience) are used as the basis of interpolation. 1 spectrum shape S1 and 2nd spectrum shape S2 are completely separate from each other.) Is divided into a plurality of regions Z1, Z2,... On the frequency axis.
And the frequency of the boundary which divides each area | region is set as follows for every spectrum shape, respectively. This set boundary frequency is called an anchor point.
First spectrum shape SS11: RB1,1, RB2,1, ..., RBN, 1
Second spectrum shape SS12: RB1,2, RB2,2, ..., RBM, 2
[0060]
FIG. 6B shows an explanatory diagram of linear spectrum interpolation.
Linear spectral interpolation is defined by the interpolation position, and the interpolation position X ranges from 0 to 1. In this case, the interpolation position X = 0 corresponds to the first spectrum shape SS11 itself, and the interpolation position X = 1 corresponds to the second spectrum shape SS12 itself.
FIG. 6B shows the case where the interpolation position X = 0.35.
In FIG. 6B, the white circles (◯) on the vertical axis indicate each of the frequency and magnitude sets that make up the spectrum shape. Therefore, it is appropriate to consider that the magnitude axis exists in the direction perpendicular to the paper surface.
An anchor point corresponding to a region Zi of interest of the first spectrum shape SS11 on the axis at the interpolation position X = 0 is
RBi, 1 and RBi + 1,1
It is assumed that either one of the specific frequency and magnitude sets belonging to the region Zi has frequency = fi1 and magnitude = S1 (fi1).
An anchor point corresponding to a certain area Zi of interest of the second spectrum shape SS12 on the axis of the interpolation position X = 1 is
RBi, 2 and RBi + 1,2
It is assumed that either one of the specific frequency and magnitude sets belonging to the region Zi has frequency = fi2 and magnitude = S2 (fi2).
Here, the spectrum transition function ftrans1 (x) and the spectrum transition function ftrans2 (x) are obtained.
[0061]
For example, if these are expressed by the simplest linear function, the following is obtained.
ftrans1 (x) = m1 · x + b1
ftrans2 (x) = m2 · x + b2
here,
m1 = RBi, 2-RBi, 1
b1 = RBi, 1
m2 = RBi + 1,2-RBi + 1,1
b2 = RBi + 1,2
It is.
Next, a set of frequency and magnitude on the interpolated spectrum shape corresponding to the set of frequency and magnitude actually existing on the first spectrum shape SS11 is obtained.
[0062]
First, a set of frequencies and magnitudes existing on the first spectrum shape SS11, specifically, frequencies fi1, frequency on the second spectrum shape corresponding to magnitude S1 (fi1) = fi1,2, magnitude = S2. (Fi1,2) is calculated as follows.
[Equation 3]

here,
W1 = RBi + 1,1-RBi, 1
W2 = RBi + 1,2-RBi, 2
It is.
In calculating magnitude = S2 (fi1,2), the frequency closest to the frequency = fi1,2 in the set of frequencies and magnitudes existing on the second spectrum shape SS12 is (+), If it is expressed with a (-) suffix,
[Expression 4]

It becomes.
[0063]
From the above, assuming that the interpolation position = x, the frequency fi1, x and the magnitude Sx (fi1, x) on the interpolated spectrum shape corresponding to the set of frequencies and magnitudes existing on the first spectrum shape SS11 are as follows: Is required.
[Equation 5]

Sx (fi1, x) = S1 (fi1) + {S2 (fi1,2) −S1 (fi1)} · Similar to all frequencies and magnitude pairs on the first spectrum shape SS11. To do.
Subsequently, a set of frequency and magnitude on the interpolated spectrum shape corresponding to the set of frequency and magnitude actually existing on the second spectrum shape SS12 is obtained.
[0064]
First, a set of frequencies and magnitudes existing on the second spectrum shape SS12, specifically, the frequency on the first spectrum shape corresponding to the frequency fi2 and the magnitude S2 (fi2) = fi1,1, and the magnitude = S1. (Fi1,1) is calculated as follows.
[Formula 6]

here,
W1 = RBi + 1,1-RBi, 1
W2 = RBi + 1,2-RBi, 2
It is.
In calculating magnitude = S1 (fi1,1 2), the frequency closest to the frequency = fi2,1 in the set of frequencies and magnitudes existing on the first spectrum shape SS11 is (+) , (-) Suffix,
[Expression 7]

It becomes.
From the above, assuming that the interpolation position = x, the frequency fi2, x and the magnitude Sx (fi2, x) on the interpolated spectrum shape corresponding to the set of frequency and magnitude actually existing on the second spectrum shape SS12 are as follows: Is required.
[Equation 8]

Sx (fi2, x) = S2 (fi2) + {S2 (fi1,2) -S1 (fi2)}. (X-1)
[0065]
Similarly, calculation is performed for all frequency and magnitude sets on the second spectrum shape SS12.
As described above, the frequency on the interpolated spectrum shape corresponding to the set of the frequency fi1 and the magnitude S1 (fi1) existing on the first spectrum shape SS11 = fi1, x, the magnitude = Sx (fi1, x), and the second By rearranging all the calculation results of the frequency fi2, x and the magnitude Sx (fi2, x) on the interpolated spectrum shape corresponding to the set of the frequency fi2 and the magnitude S2 (fi2) that exist on the spectrum shape in order of frequency. Find the interpolated spectrum shape.
These are performed for all the regions Z1, Z2,... To calculate the interpolated spectrum shape for all frequency bands.
In the above example, the spectral transition functions ftrans1 (x) and ftrans2 (x) are linear functions. However, it is defined as a nonlinear function such as a quadratic function or an exponential function, or a change corresponding to the function is prepared as a table. It is also possible to configure.
[0066]
Further, it is possible to perform spectrum interpolation closer to reality by changing the transition function according to the anchor point.
In this case, the contents of the target phoneme dictionary may be configured to include transition function information associated with the anchor point.
Furthermore, the transition function information may be set according to the phoneme of the transition destination. That is, when the transition destination phoneme is the phoneme B, the transition function Y is used, and when the transition destination phoneme is the phoneme C, the transition function Z is set. It should be built in.
Furthermore, an optimum transition function may be set in real time in consideration of the pitch and spectrum shape of the singer and the target singer.
[0067]
[3] Overall operation
Next, the overall operation of the voice conversion device 10 will be described in order.
First, a signal input process is performed by the singing signal input unit 11, and a signal sung by the singer is input.
Subsequently, a recognition feature analysis process is performed by the recognition feature analysis unit 12, and a code included in the recognition phoneme dictionary to input the singing signal SV input via the singing signal input unit 11 to the subsequent alignment processing unit 18. Vector quantization is performed based on the book to calculate each feature vector VC (mel cepstrum, differential mel cepstrum, energy, differential energy, voiceness (voiced sound likelihood), etc.).
The difference mel cepstrum indicates a difference value between the mel cepstrum of the previous frame and the current frame. The difference energy indicates a difference value between the signal energy of the previous frame and the current frame. The voiceness is a value obtained comprehensively from the number of zero crossings, a detection error obtained when performing pitch detection, or a value obtained by weighting comprehensively, and is a numerical value representing the likelihood of voiced sound.
[0068]
On the other hand, the SMS analysis unit 13 performs SMS analysis on the singing signal SV input via the singing signal input unit 11 to obtain SMS frame data FSMS, and outputs the SMS frame data FSMS to the target decoder unit 20 and the morphing processing unit 21. Specifically, for a waveform cut out with a window width according to the pitch,
(1) Fast Fourier transform (FFT) processing
(2) Peak detection processing
(3) Voiced / unvoiced discrimination processing and pitch detection processing
(4) Peak cooperation processing
(5) Calculation processing of sine wave component attribute pitch, amplitude, spectrum shape
(6) Residual component calculation processing
Is done.
The alignment processing unit 18 targets the singer from the various feature vectors VC output from the recognition feature analysis unit 12, the HMM of each phoneme from the recognition phoneme dictionary 14, and the phoneme notation information with duration included in the target behavior data. The Viterbi algorithm is used to determine which part of the song is being sung.
As a result, alignment information is obtained, and as a result, the pitch, amplitude, and phoneme to be sung by the target singer can be detected.
[0069]
During this process, if a singer sings a phoneme longer than the target singer, the singer determines that he is singing a phoneme that exceeds the duration of the phoneme notation information with duration, and loop processing Is included in the alignment information and output.
As a result, the target decoder unit 20 uses the alignment information output from the alignment processing unit 18 and the spectrum information contained in the target phoneme dictionary 19 as target singer frame information (pitch, amplitude, spectrum shape). The spectrum shape SSTG, pitch TGP, and amplitude TGA are calculated and output to the morphing processing unit 21 as target frame data TGFL.
The morphing processing unit 21 performs morphing processing based on the target frame data TGFL output from the target decoder unit 20 and the SMS frame data FSMS corresponding to the singing signal SV and the similarity parameter input from the parameter control unit 16. Morphing frame data MFL having a desired spectrum shape, pitch, and amplitude according to the similarity parameters is generated and output to the conversion processing unit 22.
[0070]
The conversion processing unit 22 modifies the morphing frame data MFL in accordance with the conversion parameter from the parameter control unit 16 and outputs it as converted frame data MMFL to the SMS combining unit 23. In this case, it is possible to obtain a more realistic output sound by performing the spectrum inclination correction according to the output amplitude.
Examples of the process performed by the conversion processing unit 22 include a process of eliminating even overtones.
The SMS synthesis unit 23 converts the converted frame data MMFL into a frame spectrum, performs inverse fast Fourier transform (IFFT), overlap processing, and addition processing, and outputs the result to the selection unit 24 as a waveform signal SWAV.
When the voice of the singer corresponding to the singing signal SV is an unvoiced sound based on the voiced / unvoiced information from the SMS analyzing unit 13, the selecting unit 24 outputs the singing signal SV to the adding unit 27 as it is, and the singing signal If the voice of the singer corresponding to the SV is voiced,
The waveform signal SWAV is output to the adder 27.
[0071]
In parallel with these operations, the sequencer 26 drives the sound source 25 under the control of the parameter control unit 16 to generate a music signal SMSC and outputs it to the adding unit 27.
The adder 27 mixes and adds the waveform signal SWAV or singing signal Sv output from the selector 24 and the music signal SMSC output from the sound source 25 at an appropriate ratio, and outputs the result to the output unit 28.
The output unit 28 outputs a karaoke signal (voice + music) based on the output signal of the adding unit 27.
[0072]
[B] Second Embodiment
Next, a second embodiment of the present invention will be described. The second embodiment is different from the first embodiment in that, in the target decoder unit of the first embodiment, the spectrum shape output to the morphing processing unit includes the target pitch and the amplification included in the target behavior data. The calculation is based on the tude, but in the second embodiment, the calculation is based on the pitch and spectrum inclination information of the singer.
Accordingly, in the SMS analysis unit of the second embodiment, in addition to the pitch amplitude and the spectrum shape, it is necessary to calculate the spectrum inclination as the sine wave component attribute, but the target decoder unit is excluded. The configuration of each part is the same as in the first embodiment.
[0073]
[1] Target decoder section
FIG. 7 shows a block diagram of the configuration of the target decoder unit of the second embodiment. In FIG. 7, the same parts as those in the first embodiment shown in FIG. 4 are denoted by the same reference numerals, and detailed description thereof is omitted.
The target decoder unit 50 includes a stable state / transition state determination unit 31, a frame memory unit 32, a first spectrum interpolation unit 33, a second spectrum interpolation unit 34, a transition function generation unit 35, and a third spectrum interpolation. The fine structure of the spectrum shape along the time axis based on the singer's pitch and the processed decoded frame stored in the frame memory unit 32 so that the output decoded frame is more realistic. A time change adding unit 57 that changes (for example, changes the magnitude gradually with time), and a spectrum of the singer to make the spectrum shape to which the time change is added by the time change adding unit 57 more realistic. Compare the slope with the slope of the spectrum shape that has already been generated. A spectrum inclination correction unit 58 that corrects the spectrum inclination and outputs the corrected spectrum shape as the target spectrum shape SSTG, and stores the target spectrum shape SSTG in the frame memory unit 32; and a target pitch / amplitude calculation unit 39 And is configured.
[0074]
[2] Operation of the second embodiment
Since the operation of the second embodiment is generally the same as that of the first embodiment, only the operation of the main part will be described.
The temporal change adding unit 57 of the target decoder unit 50 applies the processed decoded frame stored in the singer's pitch and frame memory unit 32 so that the target frame which is the output decoded frame is more realistic. Based on this, the fine structure of the spectrum shape (the first spectrum shape SS1 or the fourth spectrum shape SS4) is changed along the time axis (for example, the magnitude is changed gradually with time), and the spectrum tilt correction unit 58.
The spectrum inclination correction unit 58 compares the spectrum inclination of the singer with the inclination of the already generated spectrum shape in order to make the target spectrum shape SSTG output from the target decoder unit 50 more realistic. The spectrum shape is corrected, the corrected spectrum shape is output as the target spectrum shape SSTG, and the target spectrum shape SSTG is stored in the frame memory unit 32.
More specifically, a spectrum tilt correction value (Tilt Correction value) that is the difference between the spectrum tilt of the singer and the spectrum tilt of the generated target spectrum shape is calculated, and the spectrum tilt correction is performed as shown in FIG. A spectral tilt correction filter having a characteristic according to the value is applied to the generated spectral shape of the target.
This makes it possible to obtain a more natural spectrum shape.
[0075]
[C] Modification of Embodiment
[1] First modification
For pitch and amplitude, if you have information that is divided into static change component and vibrato change component (having vibrato as a parameter of speed and depth) in advance, for example, the singer can target the same phoneme Even when singing for a long time, it is possible to generate a pitch and an amplitude to which an appropriate vibrato is added, so that natural sound growth can be obtained.
The reason for performing such processing is that if such processing is not performed, vibrato will not be applied in the middle if the singer extends the sound longer than the target singer, etc. If the singer changes the tempo compared to the target singer, if the singer does not have a vibrato component, the vibrato will be faster if the tempo is raised. This is because it becomes unnatural.
[0076]
[2] Second modification
In the above description, the residual component of the target singer is not considered, but when the residual component of the target singer is considered, the residual component is retained for all frames. Is not compatible with the system of the speech conversion apparatus from the viewpoint of information compression.
Therefore, representative spectral envelopes may be prepared in advance for the residual, and index information for specifying these spectral envelopes may be provided.
More specifically, a residual spectrum envelope information index is provided as target behavior data. For example, during the singing elapsed time of 0 second to 2 seconds, the spectral envelope of the residual spectral envelope information index = 1 is used to sing. A spectrum envelope with a residual spectrum envelope information index = 3 is used for an elapsed time of 2 to 3 seconds.
If an actual residual spectrum is generated from the spectrum envelope corresponding to the residual spectrum envelope information index and used in the morphing process, the residual can be morphed.
[0077]
【The invention's effect】
According to the present invention, it is possible to resemble the target singer's way of singing the target singer's voice, and to reduce the capacity of the target singer's analysis data and perform processing in real time. It becomes possible.
[Brief description of the drawings]
FIG. 1 is a schematic configuration block diagram of a speech conversion apparatus according to an embodiment.
FIG. 2 is an explanatory diagram (part 1) of a target phoneme dictionary;
FIG. 3 is an explanatory diagram (part 2) of the target phoneme dictionary;
FIG. 4 is a schematic configuration block diagram of a target decoder unit of the first embodiment.
FIG. 5 is an explanatory diagram (part 1) of spectrum interpolation processing of a target decoder unit;
FIG. 6 is an explanatory diagram (part 2) of the spectrum interpolation process of the target decoder unit;
FIG. 7 is a schematic configuration block diagram of a target decoder unit according to a second embodiment.
FIG. 8 is a characteristic explanatory diagram of a spectral tilt correction filter according to a second embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Voice converter, 11 ... Singing signal input part, 12 ... Recognition feature analysis part, 13 ... SMS analysis part, 14 ... Recognition phoneme dictionary, 15 ... Target behavior data, 16 ... Parameter control part, 17 ... Data conversion part , 18 ... alignment processing unit, 19 ... target phoneme dictionary, 20 ... target decoder unit, 21 ... morphing processing unit, 22 ... conversion processing unit, 23 ... SMS synthesis unit, 24 ... selection unit, 25 ... sound source, 26 ... sequencer , 27 ... addition unit, 28 ... output unit, 31 ... stable state / transition state determination unit, 32 ... frame memory unit, 33 ... first spectrum interpolation unit, 34 ... second spectrum interpolation unit, 35 ... transition function generation unit, 36 ... third spectrum interpolation unit, 37 ... temporal change addition unit, 38 ... spectral tilt correction unit, 39 ... target pitch / amplitude calculation unit, DESCRIPTION OF SYMBOLS 0 ... Target decoder part, 57 ... Time change addition part, 58 ... Spectral inclination correction part, SS1 ... 1st spectrum shape, SS2 ... 2nd spectrum shape, SS3 ... 3rd spectrum shape, SS4 ... 4th Spectral shape, SSt, time-varying additive spectral shape, SSTG, target spectral shape.

Claims

Input frame data extraction means for extracting input frame data related to a frequency spectrum from an input audio signal;
Feature analysis means for extracting a feature vector from the input speech signal;
The target corresponding to the input frame data is analyzed by analyzing the feature vector with a predetermined algorithm and associated with the target behavior data in which the pitch of the target speech to be converted into speech and the temporal change of phonemes are defined. An alignment processing means for determining a temporal position in the behavior data;
Whether the phoneme of the target speech in the target behavior data at the temporal position determined by the alignment processing means is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme State determination means for determining;
When the state determining unit determines that the phoneme is in the transition state, the alignment processing unit determines the spectrum shape of the target phoneme dictionary having the spectrum shape corresponding to a plurality of pitches for each phoneme. For the phoneme of the target speech in the target behavior data at a different temporal position, two spectrum shapes corresponding to the first phoneme, which are close to the pitch of the target speech in the target behavior data at the temporal location an interpolation process using two spectral shape to calculate the spectral shape of the first phoneme by the row Ukoto corresponding to the pitch, the time a two spectrum shapes corresponding to the second phoneme Target in the target behavior data at the target position A spectrum shape of the second phoneme is calculated by performing interpolation using two spectrum shapes corresponding to two pitches close to the pitch of the speech, and the first phoneme spectrum shape and the first phoneme spectrum shape are calculated. Spectrum shape interpolation means for calculating a spectrum shape corresponding to the input frame data using a spectrum shape of two phonemes ;
A voice conversion apparatus comprising: a converted voice signal generating means for generating and outputting a converted voice signal based on the spectrum shape calculated by the spectrum shape interpolation means.

Input frame data extraction means for extracting input frame data related to a frequency spectrum from an input audio signal;
Feature analysis means for extracting a feature vector from the input speech signal;
The target corresponding to the input frame data is analyzed by analyzing the feature vector with a predetermined algorithm and associated with the target behavior data in which the pitch of the target speech to be converted into speech and the temporal change of phonemes are defined. An alignment processing means for determining a temporal position in the behavior data;
Whether the phoneme of the target speech in the target behavior data at the temporal position determined by the alignment processing means is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme State determination means for determining;
When the state determining unit determines that the phoneme is in the transition state, the alignment processing unit determines the spectrum shape of the target phoneme dictionary having the spectrum shape corresponding to a plurality of pitches for each phoneme. said the phonemes of the target speech in the target behavior data of temporal position, two corresponding to two pitches near the pitch obtained from the input frame data to a two spectrum shapes corresponding to the first phoneme calculating a spectrum shape of the first phoneme interpolation processing by row Ukoto using spectral-shape, the second of the two spectral shape in a by pitch obtained from the input frame data corresponding to the phoneme Two spectra corresponding to two pitches close to Calculating a spectrum shape of the second phoneme by performing interpolation processing using Eipu, the input frame data by using the spectral shape of said spectrum-shape of the first phoneme second phoneme Spectrum shape interpolation means for calculating a spectrum shape corresponding to
A voice conversion apparatus comprising: a converted voice signal generating means for generating and outputting a converted voice signal based on the spectrum shape calculated by the spectrum shape interpolation means.

In the voice converter according to claim 1 or 2,
The speech conversion apparatus, wherein the spectrum shape interpolation means performs an interpolation process using a transition function between the two spectrum shapes when performing the interpolation using the two spectrum shapes.

The voice conversion device according to claim 3 ,
The transition function is defined in advance as a linear function or a nonlinear function.

The voice conversion device according to claim 3 ,
The speech conversion apparatus, wherein the two spectrum shapes are divided into a plurality of regions on the frequency axis, and the transition function is defined for each region.

The voice conversion device according to claim 3 ,
The spectrum conversion means defines the transition function in correspondence with the second phoneme.

The voice conversion device according to claim 3 ,
The spectrum shape interpolating means divides the two spectrum shapes into a plurality of regions on the frequency axis, and performs the transition with respect to a set of real frequencies and magnitudes on the two spectrum shapes belonging to each region. An audio conversion device, wherein interpolation processing using a linear function as a function is performed over the plurality of regions.

The voice conversion device according to claim 7 ,
The spectrum shape interpolation means includes:
An interpolation frequency is obtained by interpolating, using the linear function, a first frequency that is a frequency of one spectrum shape belonging to each region and a second frequency that is the frequency of the other spectrum shape corresponding to the first frequency. Frequency interpolation means for calculating
Magnitude interpolation means for interpolating, using the linear function, a first magnitude that is a magnitude of one of the spectrum shapes belonging to each region and a second magnitude that is the magnitude of the other spectrum shape corresponding to the first magnitude; An audio conversion device characterized by comprising:

The voice conversion device according to claim 1,
The target behavior data further defines a temporal change in the amplitude of the target sound,
Spectral inclination correction means for correcting the spectral inclination of the spectral shape calculated by the spectral shape interpolation means according to the amplitude of the temporal position in the target behavior data determined by the alignment processing means,
The converted voice signal generating unit generates and outputs a converted voice signal based on the spectrum shape whose spectrum tilt is corrected by the spectrum tilt correcting unit.

The voice conversion device according to claim 2,
The spectrum calculated by the spectrum shape interpolation means according to the comparison result between the spectrum inclination of the spectrum shape calculated by the spectrum shape interpolation means and the spectrum inclination of the spectrum shape obtained from the input frame data. A speech conversion device comprising spectral tilt correction means for correcting the spectral tilt of the shape.

An input frame data extraction process for extracting input frame data related to the frequency spectrum from the input speech signal;
A feature analysis process for extracting a feature vector from the input speech signal;
The target corresponding to the input frame data is analyzed by analyzing the feature vector with a predetermined algorithm and associated with the target behavior data in which the pitch of the target speech to be converted into speech and the temporal change of phonemes are defined. An alignment process for determining temporal positions in behavior data;
Whether the phoneme of the target speech in the target behavior data at the temporal position determined in the alignment process is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme A state determination process for determining,
When it is determined that the phoneme is in a transition state in the state determination process, among the spectrum shapes of the target phoneme dictionary having a spectrum shape corresponding to a plurality of pitches for each phoneme, it is determined in the alignment process. For the phoneme of the target speech in the target behavior data at a different temporal position, two spectrum shapes corresponding to the first phoneme, which are close to the pitch of the target speech in the target behavior data at the temporal location an interpolation process using two spectral shape to calculate the spectral shape of the first phoneme by the row Ukoto corresponding to the pitch, the time a two spectrum shapes corresponding to the second phoneme Target in the target behavior data at the target position A spectrum shape of the second phoneme is calculated by performing interpolation using two spectrum shapes corresponding to two pitches close to the pitch of the speech, and the first phoneme spectrum shape and the first phoneme spectrum shape are calculated. A spectral shape interpolation process for calculating a spectral shape corresponding to the input frame data using a spectral shape of two phonemes ;
A voice conversion method comprising: a converted voice signal generation process for generating and outputting a converted voice signal based on the spectrum shape calculated in the spectrum shape interpolation process.

An input frame data extraction process for extracting input frame data related to the frequency spectrum from the input speech signal;
A feature analysis process for extracting a feature vector from the input speech signal;
The target corresponding to the input frame data is analyzed by analyzing the feature vector with a predetermined algorithm and associated with the target behavior data in which the pitch of the target speech to be converted into speech and the temporal change of phonemes are defined. An alignment process for determining temporal positions in behavior data;
Whether the phoneme of the target speech in the target behavior data at the temporal position determined in the alignment process is in a stable state or a transition state in the middle of transition from the first phoneme to the second phoneme A state determination process for determining,
When it is determined that the phoneme is in a transition state in the state determination process, among the spectrum shapes of the target phoneme dictionary having a spectrum shape corresponding to a plurality of pitches for each phoneme, it is determined in the alignment process. said the phonemes of the target speech in the target behavior data of temporal position, two corresponding to two pitches near the pitch obtained from the input frame data to a two spectrum shapes corresponding to the first phoneme A spectrum shape of the first phoneme is calculated by performing an interpolation process using a spectrum shape, and two spectrum shapes corresponding to the second phoneme are obtained from the input frame data. Two spectra corresponding to two pitches close to the pitch Shape was used to calculate the spectral shape of the second phoneme by performing an interpolation process, the input frame using the spectral shape of said first spectral shape of the phonemes second phoneme A spectral shape interpolation process for calculating a spectral shape corresponding to the data;
A voice conversion method comprising: a converted voice signal generation process for generating and outputting a converted voice signal based on the spectrum shape calculated in the spectrum shape interpolation process.

The speech conversion method according to claim 11 or 12 ,
The speech conversion method, wherein the spectrum shape interpolation process performs an interpolation process using a transition function between the two spectrum shapes when performing the interpolation using the two spectrum shapes.

The voice conversion method according to claim 13 .
The speech conversion method, wherein the transition function is previously defined as a linear function or a nonlinear function.

The voice conversion method according to claim 13 .
The speech conversion method, wherein the two spectrum shapes are divided into a plurality of regions on the frequency axis, and the transition function is defined for each region.

The voice conversion method according to claim 13 .
The speech conversion method, wherein the spectrum shape interpolation step determines the transition function in correspondence with the second phoneme.

The voice conversion method according to claim 13 .
In the spectral shape interpolation process, the two spectral shapes are divided into a plurality of regions on the frequency axis, and the transitions for the real frequency and magnitude pairs on the two spectral shapes belonging to each region are performed. An audio conversion method, wherein interpolation processing using a linear function as a function is performed over the plurality of regions.

The speech conversion method according to claim 17 ,
The spectral shape interpolation process is:
An interpolation frequency is obtained by interpolating, using the linear function, a first frequency that is a frequency of one spectrum shape belonging to each region and a second frequency that is the frequency of the other spectrum shape corresponding to the first frequency. Frequency interpolation process to calculate
A magnitude interpolation process of interpolating, using the linear function, a first magnitude that is a magnitude of one of the spectrum shapes belonging to each region and a second magnitude that is the magnitude of the other spectrum shape corresponding to the first magnitude; A speech conversion method characterized by comprising:

The speech conversion method according to claim 11 ,
The target behavior data further defines a temporal change in the amplitude of the target sound,
A spectral tilt correction process for correcting the spectral tilt of the spectral shape calculated in the spectral shape interpolation process according to the amplitude of the temporal position in the target behavior data determined in the alignment process;
The converted voice signal generating step generates and outputs a converted voice signal based on the spectrum shape whose spectrum tilt is corrected in the spectrum tilt correcting step.

The speech conversion method according to claim 12 ,
The spectrum calculated in the spectrum shape interpolation process according to the comparison result of the spectrum inclination of the spectrum shape calculated in the spectrum shape interpolation process and the spectrum inclination of the spectrum shape obtained from the input frame data. A speech conversion method characterized by comprising a spectral tilt correction process for correcting the spectral tilt of the shape.