JP3555490B2

JP3555490B2 - Voice conversion system

Info

Publication number: JP3555490B2
Application number: JP08272899A
Authority: JP
Inventors: 章寺澤; 博昭竹山; 聖今井
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 1999-03-26
Filing date: 1999-03-26
Publication date: 2004-08-18
Anticipated expiration: 2019-03-26
Also published as: JP2000276200A

Abstract

PROBLEM TO BE SOLVED: To provide a voice quality converting system in which voice quality of input voice signals is converted in near rear timing using signal processing technique. SOLUTION: A voice analysis section 1 extracts frequency spectrum of input voice signals and a voiced sound detecting section 2 conducts voiced sound discrimination. A fundamental frequency detecting section 3 detects the fundamental frequencies existed in the voiced interval that is discriminated to be voiced sound in the section 2. A fundamental frequency processing section 4 converts the fundamental frequencies detected by the section 3 into low frequencies. A sound source signal generating section 5 generates sound source signals based on the detection result of the section 2 to synthesize voices. A voice synthesizing section 7 synthesizes and outputs synthesized signals using the frequency spectrum, that is frequency shifted the frequency spectrum obtained in the section 1 into a lower frequency side by a frequency process controlling section 6, and the sound source signals outputted from the section 5.

Description

【０００１】
【発明の属する技術分野】
本発明は、声質変換システムに関するものである。
【０００２】
【従来の技術】
現在、音声合成技術の応用が盛んに進められ、特にマルチメディア技術への応用としてインターネットでの情報提供のための音声活用や、カーナビゲーションのための音声メッセージ等の製品が開発されつつある。これら音声情報提供に対して、利用者の好みに合わせて音声を選択したい、利用者自身の発声を別の声質に変換して相手に伝えたい等の要求が利用者から出ている。これらの要求に対して、利用者の好みに合わせた音声情報提供や任意話者への声質変換する声質変換システムとしては、特開平９−２９２８９８号、特開平９−２５８７７９号、特開平９−３０５１９７号等に示されるシステムがある。これらの従来のシステムは、予め記憶しておいた平均標準パターンやターゲット話者音声パターンと入力音声のマッチングを行うことにより、声質を変換することを特徴としている。ところが、これらの方式では、変換に要する様々な音声パターンを予め記憶させておく必要があり、また入力音声と記憶させておいた音声パターンとの照合を行うための演算量が必要であると考えられる。そのため、これら従来装置を実現するためには、膨大な記憶容量をもつメモリと極めて処理能力の高い演算処理装置が必要である。
【０００３】
【発明が解決しようとする課題】
実際、声質変換システムを活用しようとした場合、膨大な音声パターンの記憶メモリ容量と処理能力の高い演算処理装置が必要である点は、活用しようとする製品の選択に支障をきたす恐れがある。また、声質変換後の音声は特定話者へ声質変換する必要もなく、単に利用者自身の発声を別の声質に変換すればよい分野も多い。
【０００４】
例えば住戸外に取り付けられるカメラ付きドアホンと住戸内に取り付けられるモニタ付きインターホンから構成されるインターホンシステムにおいて、住戸内の住人の性別にかかわらず、男性の声で住戸外の来客と応答することができるようにする場合等がある。また電話機の受話口に取り付けるアダプタ形状の装置等により、電話機に任意に取り付けて、相手によっては応対時の音声を男性の声にするような場合等がある。
【０００５】
本発明は、上記のような点に鑑みて為されたもので、その目的とするところは入力音声をメモリに記憶しておく音声パターンに変換するのではなく、入力音声信号を信号処理技術を活用して略リアルタイミングで声質を変換することができる声質変換システムを提供することにある。
【０００６】
【課題を解決するための手段】
上記目的を達成するために、請求項１の発明では、音声分析処理、音声合成処理の際に、周波数軸変換処理を制御するための周波数処理制御部と、入力音声を上記周波数処理制御部の制御により音声分析する音声分析部と、上記音声分析部による音声分析により得られる音声特徴パラメータから入力音声が有声音か否かを判別する有声音検出部と、上記有声音検出部で有声音と検出した場合、入力音声の基本周波数を検出する基本周波数検出部と、上記基本周波数検出部で検出された基本周波数を逓倍して基本周波数変換を行う基本周波数処理部と、上記有声音検出部で有声音を検出した場合、基本周波数処理部で基本周波数変換された得られた基本周波数に応じてパルス信号を発生させ、有声音を検出しなかった場合、白色雑音信号を発生させ、これらパルス信号及び白色雑音信号を音源信号として出力する音源信号発生部と、上記音声分析部で音声分析することにより得られた特徴パラメータと、上記音源信号発生部から得られた音源信号とを用いて、上記周波数処理制御部による周波数制御に従い音声合成する音声合成部とから成ることを特徴とする。
【０００７】
請求項２の発明では、請求項１の発明において、上記音声分析部をＭＬＳＡ分析フィルタで構成し、上記音声合成部をＭＬＳＡ合成フィルタで構成し、メル周波数軸変換をメルケプストラム分析時と音声合成時とで変えることで周波数軸変換処理を行うことを特徴とする。
【０００８】
請求項３の発明では、請求項１の発明において、上記音声分析部をフーリエ変換分析を用いたメルケプストラム分析部で構成し、上記音声合成部をＭＬＳＡ合成フィルタで構成し、メル周波数軸変換をメルケプストラム分析時と音声合成時とで変えることで周波数軸変換処理を行うことを特徴とする。
【０００９】
請求項４の発明では、請求項１乃至３の何れかの発明において、上記有声音検出部は、上記音声分析部により得られた音声特徴パラメータをフーリエ変換により周波数軸上のパラメータに変換して、所望周波数帯域の入力音声レベルを検出し、該検出レベルが閾値よりも大きい場合に有声音検出とすることを特徴とする。
【００１０】
請求項５の発明では、請求項１乃至３の何れかの発明において、上記有声音検出部は、上記音声分析部により得られた音声特徴パラメータを近似フーリエ変換により周波数軸上のパラメータに変換して、所望周波数帯域での入力音声レベルを検出し、該検出レベルが閾値よりも大きい場合に有声音検出とすることを特徴とする。
【００１１】
請求項６の発明では、請求項１乃至３の何れかの発明において、上記有声音検出部は、音声分析パラメータの対数パワーを用いて、対数パワー値が閾値よりも大きい場合に有声音検出とすることを特徴とする。
【００１２】
請求項７の発明では、請求項４乃至６の何れかの発明において、上記閾値は、音声入力信号に応じて設定されることを特徴とする。
【００１３】
請求項８の発明では、請求項２の発明において、上記基本周波数検出部は、上記ＭＬＳＡ分析フィルタから出力される残差信号の自己相関を用いたピーク検出の間隔により基本周波数を検出することを特徴とする。
【００１４】
請求項９の発明では、請求項３の発明において、上記基本周波数検出部は、上記メルケプストラム分析部により得られるメルケプストラムパラメータの高次成分のピーク検出の間隔により基本周波数を検出することを特徴とする。
【００１５】
請求項１０の発明では、請求項２の発明において、上記基本周波数検出部は、上記ＭＬＳＡ分析フィルタから出力される残差信号の零交差数解析により基本周波数を検出することを特徴とする。
【００１６】
請求項１１の発明では、請求項２の発明において、上記基本周波数検出部は、上記ＭＬＳＡ分析フィルタから出力される残差信号を入力とするニューラルネットワークにより基本周波数を推定検出することを特徴とする。
【００１７】
請求項１２の発明では、請求項１乃至１１の何れかの発明において、上記基本周波数検出部により検出した基本周波数と１時刻前の基本周波数との傾きが予め設定した傾き範囲を越えた場合に、該傾き範囲に入るように上記検出した基本周波数を補正する基本周波数補正処理部を付設したことを特徴とする。
【００１８】
請求項１３の発明では、請求項１乃至１２の何れかの発明において、上記基本周波数処理部は、検出される基本周波数に応じた、基本周波数変換処理を行うことを特徴とする。
【００１９】
請求項１４の発明では、請求項１乃至１３の何れかの発明において、上記音源信号発生部は、発生させるパルス信号の振幅に応じて上記白色雑音信号の振幅を制御することを特徴とする。
【００２０】
請求項１５の発明では、請求項１乃至１４の何れかの発明において、上記音声合成部より出力される合成音声信号に対してダウンサンプリングを行って再生音声の周波数帯域の制限を加えた合成音声信号を出力するダウンサンプリング部を付設したことを特徴とする。
【００２１】
【発明の実施の形態】
以下本発明を実施形態により説明する。
【００２２】
（実施形態１）
本実施形態装置は、図１に示す構成を基本構成とし、図示するように音声分析部１と、有声音検出部２と、基本周波数検出部３と、基本周波数処理部４と、音源信号発生部５と、周波数処理制御部６と、音声合成部７とから構成されており、音声分析部１では、入力音声信号の周波数スペクトルを音声特徴パラメータとして抽出し、有声音検出部２では、上記音声分析部１で抽出された周波数スペクトル（音声特徴パラメータ）を利用して有声音判別を行う。また、有声音検出部２で有声音と判別された音声区間は、入力音声信号に周期性のある基本周波数が存在すると考えられるので、基本周波数検出部３で基本周波数の検出を行う。ここで女性音声は、男性音声に比べて基本周波数が高いため、基本周波数処理部４では基本周波数検出部３で検出された基本周波数を低い周波数に変換する。音源信号発生部５では、有声音検出部２の検出結果に基づき、有声音検出区間において、基本周波数処理部４で処理された基本周波数に従いパルス信号を発生させ、それ以外の区間において、白色雑音信号を発生させ、これら信号を音声を合成するための音源信号としして出力する。音声合成部７では、上記音声分析部１で得られた周波数スペクトル（音声特徴パラメータ）を周波数処理制御部６で低域側に周波数シフトした周波数スペクトルと音源信号発生部５により発生させた音源信号を用いて音声を合成して合成音声信号を出力する。
【００２３】
ここで本実施形態を、インターホンに組み込んだり、電話機にアダプタとして付加し、通話を行う際に、発話者の希望に応じて当該システムを動作させ、発話者の音声信号を入力音声信号として入力し、その入力音声信号に基づいて、上述のように音声合成を行うことにより、略リアルタイムに声質を変換して通話することが可能となる。また、声質を変換して通話することにより、女性の単身住宅でも男性の声質で対応できるため、簡易的な防犯が可能となる。さらに、計算量が少なく実現することが可能であり、またメモリ量もほとんど必要ない。
【００２４】
（実施形態２）
本実施形態では、基本構成としては実施形態１と同じであるが、音声特徴パラメータを音声分析により抽出する音声分析部としてリアルタイムで適応メルケプストラム分析を行う適応デジタルフィルタであるＭＬＳＡ分析フィルタ１００を用い、音声合成部として、ＭＬＳＡ合成フィルタ７０を用いて構成する。
【００２５】
ＭＬＳＡ分析フィルタ１００及びＭＬＳＡ合成フィルタ７０はメル周波数軸上の分析処理を活用しているものである。ＭＬＳＡ分析フィルタはｐａｄｅ近似によりメル対数スペクトルを近似するデジタルフィルタであり、メル尺度を規定するパラメータαとメル対数化プストラム係数ｂ（ｍ）からなる複数の基本フィルタＦ（ｚ）＜図３（ａ）参照＞と、ｐａｄｅ係数ｐ_１ …とから図３（ｂ）のように構成される。また適応デジタルフィルタ技術を用いて、入力音声信号に応じて適応的にメル対数ケプストラム係数ｂ（ｍ）を基本フィルタＦ（ｚ）で算出することにより、ＭＬＳＡ分析フィルタ１００は、入力音声信号のメル対数スペクトルモデルを適応的に近似するフィルタとなり、その出力として、残差信号が得られる。特に、メル尺度を規定するパラメータαの選択により、人間の聴覚特性を生かした適応デジタルフィルタであると言える。そのため、従来の音声分析法に比べて分析次数を減らすことができ、例えば８ｋＨｚサンプリングでは、ｍ＝１２、α＝０．３１にとることにより、略リアルタイムで人間の聴覚特性に合わせた音声分析が行える。
【００２６】
音声合成部を構成するＭＬＳＡ合成フィルタ７０は、ＭＬＳＡ分析フィルタ１００の逆フィルタであり、メル周波数軸上の分析処理を活用しており、該周波数軸の伸縮を利用し、周波数処理制御部６では、メル周波数軸変換の伸縮パラメータを制御する。
【００２７】
而して本実施形態では、入力音声信号から音声分析部であるＭＬＳＡ分析フィルタ１００は音声分析して、メルケプストラムパラメータを音声特徴パラメータとして有声音検出部２へ出力する。このメルケプストラムパラメータに基づいて有声音検出部２では有声音判別を行い、一方有声音区間に対応して基本周波数検出部３ではＭＬＳＡ分析フィルタ１００からの残差信号から基本周波数の検出を行う。音源信号発生部５では、有声音検出部２の検出結果に基づき、有声音検出区間において、基本周波数処理部４で処理された基本周波数に従いパルス信号を発振出力し、それ以外の区間において、白色雑音信号を発振出力し、これら発振出力を音声合成のための音源信号としてＭＬＳＡ合成フィルタ７０へ出力する。ＭＬＳＡ合成フィルタ７０では、ＭＬＳＡ分析フィルタ１００からのメルケプストラムパラメータと、音源信号とを用いて、周波数処理制御部６の周波数制御処理による制御に従い音声合成を行い、合成音声信号を出力する。
【００２８】
ここで本実施形態に用いることができる基本周波数検出部３の例を次に説明する。
【００２９】
例１
図４は本例を示しており、本例の基本周波数検出部３は、図示するようにＭＬＳＡ分析フィルタ１００から出力される残差信号の自己相関を基本周波数が存在すると考えられる区間に対して計算する自己相関計算部３０と、該自己相関計算部３０で計算された自己関数のピークが出現する区間を検出するピーク検出部３１と、該ピーク検出部３１により検出された区間を用いて基本周波数を算出する基本周波数算出部３２とにより構成される。
【００３０】
本例の基本周波数検出部３では、ＭＬＳＡ分析フィルタ１００から出力される残差信号を利用することで、入力音声信号レベルを吸収することが可能となるため、常に一定に検出精度で基本周波数の検出が可能となる。
【００３１】
例２
本例の基本周波数検出部３は図５に示すようにＭＬＳＡ分析フィルタ１００から出力される残差信号の零交差を解析して零交差数値を求める零交差解析部３３と、零交差数値から基本周波数を算出する基本周波数算出部３４とにより構成される。
【００３２】
例３
本例の基本周波数検出部３は図６に示すようにＭＬＳＡ分析フィルタ１００から出力される残差信号を入力とする基本周波数検出ニューラルネットワーク３５からなり、この基本周波数検出ニューラルネットワーク３５は入力音声信号に対応したピッチの値を出力するように予め学習が行われているものであって、基本周波数を推定する。
【００３３】
上記の例１〜３の何れの基本周波数検出部３もＭＬＳＡ分析フィルタ１００の残差信号を利用することで、入力音声信号レベルを吸収することが可能となり、そのため常に一定に検出精度で基本周波数の検出ができることになる。
【００３４】
またＭＬＳＡ分析フィルタ１００による適応デジタルフィルタの精度の高い分析結果を利用することにより、高い精度の検出が可能となる。
【００３５】
更に例３の場合には、残差信号を予め学習した基本周波数検出ニューラルネットワーク３５を利用しているため、ニューラルネットワーク構成時の統計的な検出を行うことが可能となり、その結果精度の高い基本周波数の検出ができることになる。
【００３６】
（実施形態３）
上記実施形態２では音声合成部をＭＬＳＡ分析フィルタ１００で構成しているが、本実施形態では図７に示すようにメルケプストラム分析部１０１により構成している点で実施形態１とは相違する。
【００３７】
メルケプストラム分析部１０１は、入力音声信号に対してフーリエ変換、対数変換、メル周波数軸変換、逆フーリエ変換を行うメルケプストラム分析を行い、音声特徴パラメータとしてメルケプストラムパラメータを抽出するもので、音声合成部を構成するＭＬＳＡ合成フィルタ７０と互いにメル周波数軸上の分析処理を活用しており、その周波数軸の伸縮を利用し、周波数処理制御部６ではメル周波数軸変換のパラメータを制御するようになっている。
【００３８】
また基本周波数検出部３は、例えば図８に示すようにメルケプストラム分析部１０１から出力されるメルケプストラムパラメータの内、高次数部（高ケフレンシー部パラメータ）のピーク検出をピーク検出部３６で行い、その検出されたピークの区間から基本周波数算出部３７で基本周波数を算出するようになっている。その他の構成は実施形態２と同じであるので、ここでは説明は省略する。
【００３９】
而して本実施形態ではメルケプストラム分析部１０１から抽出される音声特徴パラメータであるメルケプストラムパラメータに基づいて有声音検出部２により有声音検出を行い、基本周波数検出部３で基本周波数を検出する。音源信号発生部５では実施形態２と同様に、有声音検出部２の検出結果に基づき、有声音検出区間において、基本周波数処理部４で処理された基本周波数に従いパルス信号を発振出力し、それ以外の区間において、白色雑音信号を発振出力し、これら発振出力を音声合成のための音源信号としてＭＬＳＡ合成フィルタ７０へ出力する。ＭＬＳＡ合成フィルタ７０では、メルケプストラム分析部１０１からのメルケプストラムパラメータと、音源信号とを用いて、周波数処理制御部６の周波数制御処理による制御に従い音声合成を行い、合成音声信号を出力する。
【００４０】
ここでメルケプストラムパラメータを用いて有声音を検出する本実施形態（上記実施形態２）に用いることができる有声検出部２の例を次に示す。
【００４１】
例１
本例の有声検出部２は図９に示すようにメルケプストラムパラメータをフーリエ変換し、メル対数軸上のスペクトルに変換するフーリエ変換部２０と、その変換結果から得られるメル対数スペクトルの指定周波数帯域、例えば図１０に示す８０Ｈｚ〜６００Ｈｚのレベル検出を行うレベル検出部２１と、検出したレベル値を予め設定しておいた有声音検出閾値とを比較してその閾値より入力音声のレベル値が大きい場合有声音を検出したとする比較部２２とで構成される。図１０はメル対数軸上のメル対数スペクトルの例と上述した指定周波数帯域の例を示しており、図示する指定周波数帯域は、音声の有声音の代表である母音のフォルマント周波数帯域を利用したものである。
【００４２】
本例の場合、日本語の特徴を生かし、有声音の代表的且つ勢力の大きい母音を誤り無く検出することにより、有声音検出性能を上げることが可能なものであり、またレベルを検知する周波数帯域を指定することにより、周囲騒音の影響にも強くなる。
【００４３】
例２
本例の有声音検出部２は図１１に示すように複数の指定周波数帯域のレベル検出部２１１〜２１ｎ及び夫々のレベル検出部２１１〜２１ｎに対応した閾値が設定された比較部２２１〜２１ｎを設け、指定帯域とその閾値は１つ決めておくだけでなく、各母音に対して、各々の指定帯域と各々の閾値を用意しており、どこかの指定帯域の一つでも閾値を超えると有声音検出と見なすようになっている。尚ＯＲは比較部２２１〜２２ｎの出力の論理和を取るオアゲートである。
【００４４】
例３
上記例１の有声音検出部２における比較部２２の閾値を本実施形態では、図１２に示すようにフーリエ変換部２０から出力されるメル対数スペクトルから入力音声信号のレベルを常時検出して、有声音検出の閾値を入力音声信号のレベルに応じて決定する閾値決定部２７を具備し、この閾値決定部２７で決定した閾値を比較部２２に与えるようにしてある。
【００４５】
勿論例２の各比較部２２１〜２２ｎの閾値を決定する場合にも本例の閾値決定部２７を用いても良い。
【００４６】
本例によれば、有声音検出の閾値を入力音声信号レベルに応じて決定変更することにより、入力音声のレベルの大小の影響や入力される周囲騒音の影響に対応することが可能となる。
【００４７】
例４
上記例１〜３はフーリエ変換を行ってメル対数スペクトルに変換するものであったが、この場合メルケプストラムパラメータをメル対数スペクトルに変換する際に必要なフーリエ変換の計算量が多い。そこで、本例の有声検出部２は同じ作用をするフーリエ変換近似算出法を用いて、フーリエ変換を行わずに、指定周波数帯域のレベルを検出するようにしたものである。
【００４８】
つまり、所望周波数帯域のみ一定値をまずとり、その他の帯域は０とする矩形スペクトル（対数スペクトル）を図１３（ａ）に示すように用意し、この矩形スペクトルに対してメル周波数軸変換を音声分析時と同じメル周波数軸伸縮パラメータにより行う。その結果図１３（ｂ）示すように得られるメル対数スペクトルの逆フーリエ変換を行い、所望帯域のみ値をもつメル周波数スペクトルのメルケプストラム係数ａ（１）…を得る。実際、この所望帯域のみ値をもつメル周波数スペクトルのメルケプストラム係数は、指定周波数帯域を決定しておけば前もって算出可能であり、音声検出する際に毎回計算する必要はない。
【００４９】
図１４は本例の有声音検出部２の構成を示しており、上述の所望の周波数帯域のスペクトルのメルケプストラム係数ａ（ｍ）を予め決定される所定周波数帯域に基づいて算出記憶している所定周指定周波数用メルケプストラム係数算出部２３と、音声分析部１から入力するメルケプストラムパラメータから入力音声信号のメルケプストラム係数ｃ（ｍ）を算出する入力音声信号用メルケプストラム係数算出部２４と、両メルケプストラム係数ａ（ｍ）、ｃ（ｍ）の積和演算（Σａ（ｍ）ｃ（ｍ））を行う積和部２５と、その結果積和演算の値を閾値と比較して、有声音を検出する比較部２２とからなる。上記の指定周波数帯域は、音声の有声音の代表である母音のフォルマント周波数帯域を利用したものであり、指定帯域とその閾値は１つ決めておくだけでなく、例２と同様に各母音に対して、各々の指定帯域と各々の閾値を用意し、どこかの指定帯域の一つでも閾値を超えると有声音検出とを見なすようにしても良い。
【００５０】
本例の場合も、日本語の特徴を生かし、有声音の代表的且つ勢力の大きい母音を誤り無く検出することにより、有声音検出性能を上げることが可能なものであり、またレベルを検出する周波数帯域を指定することにより、周囲騒音の影響にも強くなる。
【００５１】
例５
本例の有声音検出部２は、メルケプストラムパラメータの０次成分が入力音声信号の対数パワーを表していることに着目したもので、図１５に示すように音声分析部１から入力するメルケプストラムパラメータから入力音声信号のメルケプストラム係数ｃ（ｍ）を算出する入力音声信号用メルケプストラム係数算出部２４と、算出されたメルケプストラム係数ｃ（ｍ）からｍ＝０、つまり０次元（ｃ（０））のデータを抽出する０次元データ抽出部２６と、この抽出された値と閾値とを比較して有声音の検出を行う比較部２２とから構成される。
【００５２】
本例の場合、音声分析の結果を利用することで、音声パワーをリアルタイムで活用することが可能となる。
【００５３】
ところで、本実施形態での有声音検出部２の例１乃至５の構成は本実施形態と同様に音声特徴パラメータとしてメルケプストラムパラメータを用いる実施形態２の有声音検出部２として用いることができるのは勿論のことである。
【００５４】
（実施形態４）
本実施形態は実施形態１〜３における基本周波数検出部３で検出される基本周波数の検出誤りの影響を小さくするために、図１６に示すように基本周波数検出部３の後段に、基本周波数検出部３で検出した基本周波数と、１時刻前の基本周波数との傾きを計算し、この傾きが、予め設定しておいた傾きの範囲外の場合、基本周波数を誤検出したとして、予め設定しておいた傾きの範囲内に入る様に補正を加える処理を行う基本周波数補正部８を設け、この基本周波数補正部８で補正された基本周波数を、実施形態１〜３における、基本周波数処理部４へ出力するのである。
【００５５】
図１７は基本周波数補正の例を示しており、この例の場合現時点ｔで検出された基本周波数がｆ_ｔで、１時刻前ｔ−１で検出された基本周波数がｆ_ｔ−１であって、その時の傾きが予め設定してある傾きの範囲外にある場合を示しており、この場合基本周波数補正部８は基本周波数ｆ_ｔを予め設定しておいた傾きの範囲内に入るようにようにｆ_ｔ’に補正するのである。
【００５６】
尚その他の構成は実施形態１〜３の何れかの構成と同じ構成を採用することができるから、ここでは図示及び説明を省略する。
【００５７】
而して本実施形態では、検出される基本周波数の時間的な変動が急激な場合、誤検出した可能性が高いため、その補正を行うことで、合成された音声の声質を向上させることができる。またその補正された基本周波数の時間的な変化は緩やかなものとなり、急激な基本周波数変化によって発生する合成音声のイントネーションの不自然性を解消することが可能となる。
【００５８】
（実施形態５）
本実施形態は、実施形態１〜３（或いは実施形態４）において、基本周波数検出部３で検出された基本周波数を逓倍して高周波数から低周波数に変換するための基本周波数処理部４において、図示するように検出された基本周波数に応じて基本周波数の変換処理を行うか行わないかを決定し、基本周波数処理部４の制御を行う基本周波数処理制御部９を付加したものである。その他の構成は実施形態１〜３或いは実施形態４と同じ構成を採用することができるので、図示及び説明を省略する。
【００５９】
而して本実施形態では、入力音声が男性周波数帯域（低い周波数）の場合に、更に低周波数に変換されるのを防ぐことができ、また合成音声は、常に一般的な男性音声周波数帯域の音声となり、合成音声として、通常音声と違和感の無い音声を提供することができる。
【００６０】
（実施形態６）
本実施形態は、有声音検出部２の検出結果と、基本周波数処理部４の結果を用いて音源信号を発生させる音源信号発生部５において、音源信号のパワー集中を防ぐために、発生させるパルス信号Ｐの列（図１９（ｂ）参照）及び白色雑音信号ＷＮ（図１９（ａ）参照）の振幅を推定する推定機能と、パルス信号Ｐの振幅に対応して白色雑音信号ＷＮの発生の振幅を適応的に制御する処理機能とを設け、パルス信号の発生のタイミングを、基本周波数処理部４の結果に依存するものとし、音源信号のパワー集中を防ぐために、図２０に示すように白色雑音信号ＷＮの直後のパルス信号Ｐは、音源信号のパワー集中を防ぐため、白色雑音信号ＷＮの直後数ｍｓ間無音信号Ｓを発生させ、その後パルス信号Ｐを発生させる構成とする。
【００６１】
尚本実施形態は音源信号発生部５以外の構成は上記実施形態１乃至５の何れかの構成を採用すればよいので、その他の構成は図示せず、説明も省略する。
【００６２】
而して本実施形態では合成音声に急激なパワー変動に起因するクリック性の雑音が発生するのを防ぐことができ、またパルス信号Ｐと白色雑音信号ＷＮの振幅制御を行うことにより、合成音声の音質として滑らかな音声を提供することができる。
【００６３】
（実施形態７）
ところで、音声合成部により出力される音声は、入力音声信号の声質を変換した音声であり、周波数スペクトルの移動を伴った処理を行っているために、再生可能周波数の高周波帯域の処理の効果が少ないことにより合成音声に歪が生じる可能性があり、この歪を削除するために、本実施形態では、図２１に示すように音声合成部を構成するＭＬＳＡ合成フィルタ７０より出力した合成音声信号に対して、ダウンサンプリング部１０でサンプリング周波数制限を行い、上記の高周波帯域を再生周波数帯域から除外するようにしたものである。つまり本実施形態では、例えば１０ｋＨｚのサンプリング周波数で得られた合成音声信号をダウンサンプリング部１０により８ｋＨｚのサンプリング周波数でダウンサンプリングを行うようなっている。
【００６４】
尚その他の構成は実施形態２乃至６の何れかの構成と同じ構成を採用できるからここでは図示及び説明を省略する。またＭＬＳＡ合成フィルタ７０を用いず、他の音声合成手段を用いる、例えば実施形態１の構成に採用しても良い。
【００６５】
而して図２２（ａ）に示すように周波数軸変動大の周波数帯域と、周波数軸変動小の周波数帯域の内、合成音声に歪みが発生し易いスペクトル成分の高周波数帯域を図２２（ｂ）に示すようにダウンサンプリング部７１にてダウンサンプリングして再生周波数帯域から除外する。
【００６６】
このようにして本実施形態では、合成音声の歪み成分の影響が無くなり、合成音声の音質を向上させることができる。
【００６７】
【発明の効果】
請求項１の発明は、音声分析処理、音声合成処理の際に、周波数軸変換処理を制御するための周波数処理制御部と、入力音声を上記周波数処理制御部の制御により音声分析する音声分析部と、上記音声分析部による音声分析により得られる音声特徴パラメータから入力音声が有声音か否かを判別する有声音検出部と、上記有声音検出部で有声音と検出した場合、入力音声の基本周波数を検出する基本周波数検出部と、上記基本周波数検出部で検出された基本周波数を逓倍して基本周波数変換を行う基本周波数処理部と、上記有声音検出部で有声音を検出した場合、基本周波数処理部で基本周波数変換された得られた基本周波数に応じてパルス信号を発生させ、有声音を検出しなかった場合、白色雑音信号を発生させ、これらパルス信号及び白色雑音信号を音源信号として出力する音源信号発生部と、上記音声分析部で音声分析することにより得られた特徴パラメータと、上記音源信号発生部から得られた音源信号とを用いて、上記周波数処理制御部による周波数制御に従い音声合成する音声合成部とから成るので、大容量のメモリや複雑な演算処理が不要で、入力音声をリアルタイムに且つ少ない演算量で声質を変換することができ、その結果小型のシステムとして実現が可能となり、インターホンに内蔵したり、通常の電話機にアダプタとして取り付けるシステムとして構築することができ、更に声質変換後の音声が、入力音声を変換するので、決まった人の声になることがなく、簡易的な防犯装置にも有効に活用できるという効果がある。
【００６８】
請求項２の発明は、請求項１の発明において、上記音声分析部をＭＬＳＡ分析フィルタで構成し、上記音声合成部をＭＬＳＡ合成フィルタで構成し、メル周波数軸変換をメルケプストラム分析時と音声合成時とで変えることで周波数軸変換処理を行うので、人間の聴覚的特徴を生かした適応的な分析方法により、極めて簡易に音声分析が可能となり、またＭＬＳＡ分析フィルタと、ＭＬＳＡ合成フィルタの分析パラメータであるメル周波数軸変換パラメータを制御することにより、入力音声信号のメル対数スペクトル分布を変換することが可能となるという効果がある。
【００６９】
請求項３の発明は、請求項１の発明において、上記音声分析部をフーリエ変換分析を用いたメルケプストラム分析部で構成し、上記音声合成部をＭＬＳＡ合成フィルタで構成し、メル周波数軸変換をメルケプストラム分析時と音声合成時とで変えることで周波数軸変換処理を行うので、人間の聴覚的特徴を生かした精度の高い音声分析ができ、また分析において、メル周波数帯域分析を行うため、合成時のＭＬＳＡ分析フィルタと共に、メル周波数軸変換パラメータを制御することにより、入力音声信号のメル対数スペクトル分布を変換できるという効果がある。
【００７０】
請求項４の発明は、請求項１乃至３の何れかの発明において、上記有声音検出部は、上記音声分析部により得られた音声特徴パラメータをフーリエ変換により周波数軸上のパラメータに変換して、所望周波数帯域の入力音声レベルを検出し、該検出レベルが閾値よりも大きい場合に有声音検出とするので、有声音検出部の検出性能を上げることができ、特に日本語の特徴を生かし、有声音の代表的かつ勢力の大きい母音を誤り無く検出することにより有声音検出性能を上げることが可能となり、またレベルを検出する周波数帯域を指定することにより、周囲騒音の影響にも強くになるという効果がある。特に、検出性能を落とさずに計算量を下げることを可能とあるという効果がある。
【００７１】
請求項５の発明は、請求項１乃至３の何れかの発明において、上記有声音検出部は、上記音声分析部により得られた音声特徴パラメータを近似フーリエ変換により周波数軸上のパラメータに変換して、所望周波数帯域での入力音声レベルを検出し、該検出レベルが閾値よりも大きい場合に有声音検出とするので、有声音検出部の検出性能をあげることができ、請求項４の発明と同様に、特に日本語の特徴を生かし、有声音の代表的かつ勢力の大きい母音を誤り無く検出することにより有声音検出性能を上げることが可能となり、またレベルを検出する周波数帯域を指定することにより、周囲騒音の影響にも強くなるという効果がある。
【００７２】
請求項６の発明は、請求項１乃至３の何れかの発明において、上記有声音検出部が、音声分析パラメータの対数パワーを用いて、対数パワー値が閾値よりも大きい場合に有声音検出とするので、有声音検出に音声分析の結果を利用することができ、また、分析結果を利用することで、音声パワーをリアルタイムで活用することが可能となるという効果がある。
【００７３】
請求項７の発明は、請求項４乃至６の何れかの発明において、上記閾値を、入力音声信号に応じて設定するので、検出閾値を入力音声信号レベルに応じて変更することにより、入力音声のレベルの大小の影響や入力される周囲騒音の影響にも対応することが可能となるという効果がある。
【００７４】
請求項８の発明は、請求項２の発明において、上記基本周波数検出部は、上記ＭＬＳＡ分析フィルタから出力される残差信号の自己相関を用いたピーク検出の間隔により基本周波数を検出するので、入力音声信号レベルを吸収することが可能となり、そのため常に一定の検出精度で検出が可能となるという効果がある。
【００７５】
請求項９の発明は、請求項３の発明において、上記基本周波数検出部が、メルケプストラム分析部により得られるメルケプストラムパラメータの高次成分のピーク検出の間隔により基本周波数を検出するので、分析精度と同等の検出精度を保つことが可能となるという効果がある。
【００７６】
請求項１０の発明は、請求項２の発明において、上記基本周波数検出部が、上記ＭＬＳＡ分析フィルタから出力される残差信号の零交差数解析により基本周波数を検出するので、ＭＬＳＡ分析フィルタによる適応デジタルフィルタの精度の高い分析結果を利用することが可能となるという効果がある。
【００７７】
請求項１１の発明は、請求項２の発明において、上記基本周波数検出部が、上記ＭＬＳＡ分析フィルタから出力される残差信号を入力とするニューラルネットワークにより基本周波数を推定検出するので、入力音声信号の変化に対応でき、ニューラルネットワーク構成時の統計的な検出を行うことが可能となり、その結果精度の高い基本周波数検出が可能となるという効果がある。
【００７８】
請求項１２の発明は、請求項１乃至１１の何れかの発明において、上記基本周波数検出部により検出した基本周波数と１時刻前の基本周波数との傾きが予め設定した傾き範囲を越えた場合に、該傾き範囲に入るように上記検出した基本周波数を補正する基本周波数補正処理部を付設したので、検出された基本周波数の時間的な変動が急激で、誤検出した可能性が高い場合にも、補正を行うことで、合成された音声の音質を向上させることができ、また、その補正された基本周波数の時間的な変化を緩やかなものとして、急激な基本周波数変化によって発生する合成音声のイントネーションの不自然性を解消することが可能となるという効果がある。
【００７９】
請求項１３の発明は、請求項１乃至１２の何れかの発明において、上記基本周波数処理部が、検出される基本周波数に応じた、基本周波数変換処理を行うので、入力音声が声質変換に不適当な基本周波数帯域の場合に声質変換を行なわれるのを防ぐことができ、得られる合成音声が、常に所定の基本周波数帯域の音声となり、合成音声の音質として、通常音声と違和感の無い音声を提供することが可能となるという効果がある。
【００８０】
請求項１４の発明は、請求項１乃至１３の何れかの発明において、上記音源信号発生部が、発生させるパルス信号の振幅に応じて上記白色雑音信号の振幅を制御するので、合成音声に急激なパワー変動に起因するクリック性の雑音が発生しないようにでき、また、パルス信号と白色性雑音信号の振幅制御を行うことにより、合成音声の音質として、滑らかな音声を提供することが可能となるという効果がある。
【００８１】
請求項１５の発明は、請求項１乃至１４の何れかの発明において、上記音声合成部より出力される合成音声信号に対してダウンサンプリングを行って再生音声の周波数帯域の制限を加えた合成音声信号を出力するダウンサンプリング部を付設したので、スペクトルの処理に起因する再生可能周波数の高周波数部の処理の効果の少ない帯域における音声歪みの影響を、サンプリング周波数制限を行うことにより、音声歪みを起こす可能性のある周波数帯域を再生周波数帯域から除外することが可能となり、合成音声の音質に歪み成分の影響がなくなり、合成音声の音質を向上させることが可能となるという効果がある。
【図面の簡単な説明】
【図１】本発明の実施形態１の構成図である。
【図２】本発明の実施形態２の構成図である。
【図３】（ａ）は同上に用いるＭＬＳＡ分析フィルタを構成する基本フィルタの構成図である。（ｂ）は同上に用いるＭＬＳＡ分析フィルタの具体的例の構成図である。
【図４】同上に用いる基本周波数検出部の例１を示す構成図である。
【図５】同上に用いる基本周波数検出部の例２を示す構成図である。
【図６】同上に用いる基本周波数検出部の例３を示す構成図である。
【図７】本発明の実施形態３の構成図である。
【図８】同上に用いる基本周波数検出部の一例を示す構成図である。
【図９】同上に用いる有声音検出部の例１を示す構成図である。
【図１０】同上の音声有声音検出部のレベル検出の説明図である。
【図１１】同上に用いる有声音検出部の例２を示す構成図である。
【図１２】同上に用いる有声音検出部の例３を示す構成図である。
【図１３】同上に用いる有声音検出部の例４の原理説明図である。
【図１４】同上の有声音検出部の例４を示す構成図である。
【図１５】同上に用いる有声音検出部の例５を示す構成図である。
【図１６】本発明の実施形態４に要部の構成図である。
【図１７】同上に用いる基本周波数補正部の動作説明図である。
【図１８】本発明の実施形態５に要部の構成図である。
【図１９】本発明の実施形態６の音源信号発生部５の発生信号例の説明図である。
【図２０】同上の音源信号発生部の動作説明図である。
【図２１】本発明の実施形態７の要部の構成図である。
【図２２】同上のダウンサンプリング部の動作説明図である。
【符号の説明】
１音声分析部
２有声音検出部
３基本周波数検出部
４基本周波数処理部
５音源信号発生部
６周波数処理制御部
７音声合成部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice conversion system.
[0002]
[Prior art]
At present, applications of speech synthesis technology are being actively promoted, and products such as speech utilization for providing information on the Internet and speech messages for car navigation are being developed as applications to multimedia technology. In response to the provision of the voice information, there are requests from the user to select a voice according to the user's preference, to convert the user's own voice into another voice quality, and to convey it to the other party. In response to these requests, voice quality conversion systems for providing voice information according to the user's preference and converting voice quality to an arbitrary speaker are disclosed in JP-A-9-292998, JP-A-9-258779, and JP-A-9-25879. There is a system disclosed in Japanese Patent Application No. 305197. These conventional systems are characterized in that voice quality is converted by matching an input standard with a previously stored average standard pattern or target speaker voice pattern. However, in these methods, it is necessary to store various voice patterns required for conversion in advance, and it is necessary to perform computation for comparing input voices with stored voice patterns. Can be Therefore, in order to realize these conventional devices, a memory having an enormous storage capacity and an arithmetic processing device having extremely high processing capability are required.
[0003]
[Problems to be solved by the invention]
In fact, if a voice quality conversion system is to be used, the need for an enormous voice pattern storage memory capacity and an arithmetic processing unit having a high processing capability may hinder the selection of a product to be used. In addition, there are many fields where it is not necessary to convert the voice after voice conversion to a specific speaker, but simply to convert the user's own utterance into another voice.
[0004]
For example, in an intercom system including a door phone with a camera attached outside the dwelling unit and an intercom with a monitor attached inside the dwelling unit, it is possible to respond to a guest outside the dwelling unit with a male voice regardless of the gender of the dwelling person in the dwelling unit. And so on. Further, there is a case in which an adapter-shaped device or the like attached to the earpiece of the telephone is arbitrarily attached to the telephone, and depending on the other party, the voice at the time of the reception is changed to a male voice.
[0005]
The present invention has been made in view of the above points, and its purpose is to convert an input audio signal into a signal processing technique instead of converting the input audio into an audio pattern stored in a memory. It is an object of the present invention to provide a voice quality conversion system capable of converting voice quality at substantially real timing.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, according to the first aspect of the present invention, a frequency processing control unit for controlling a frequency axis conversion process during a voice analysis process and a voice synthesis process, A voice analysis unit that performs voice analysis by control, a voiced sound detection unit that determines whether the input voice is a voiced sound from a voice feature parameter obtained by voice analysis by the voice analysis unit, and a voiced sound in the voiced sound detection unit. If detected, a fundamental frequency detecting unit that detects a fundamental frequency of the input voice, a fundamental frequency processing unit that performs fundamental frequency conversion by multiplying the fundamental frequency detected by the fundamental frequency detecting unit, and a voiced sound detecting unit. When a voiced sound is detected, a pulse signal is generated according to the obtained fundamental frequency converted by the fundamental frequency processing unit, and when a voiced sound is not detected, a white noise signal is generated. A sound source signal generator that outputs the pulse signal and the white noise signal as a sound source signal, a feature parameter obtained by performing voice analysis in the voice analyzer, and a sound source signal obtained from the sound source signal generator. And a voice synthesizing unit for synthesizing voice according to the frequency control by the frequency processing control unit.
[0007]
According to a second aspect of the present invention, in the first aspect, the voice analysis unit is configured by an MLSA analysis filter, and the voice synthesis unit is configured by an MLSA synthesis filter. The frequency axis conversion processing is performed by changing the time.
[0008]
According to a third aspect of the present invention, in the first aspect of the present invention, the speech analysis unit is configured by a mel-cepstral analysis unit using Fourier transform analysis, the speech synthesis unit is configured by an MLSA synthesis filter, and the mel frequency axis conversion is performed. It is characterized in that the frequency axis conversion processing is performed by changing between mel-cepstral analysis and speech synthesis.
[0009]
According to a fourth aspect of the present invention, in any one of the first to third aspects of the present invention, the voiced sound detector converts the voice feature parameter obtained by the voice analyzer into a parameter on a frequency axis by Fourier transform. , An input voice level in a desired frequency band is detected, and when the detected voice level is higher than a threshold, voiced sound detection is performed.
[0010]
According to a fifth aspect of the present invention, in any one of the first to third aspects of the present invention, the voiced sound detection unit converts the voice feature parameter obtained by the voice analysis unit into a parameter on a frequency axis by an approximate Fourier transform. Then, an input voice level in a desired frequency band is detected, and when the detection level is higher than a threshold, voiced sound detection is performed.
[0011]
In the invention of claim 6, in the invention of any one of claims 1 to 3, the voiced sound detection unit uses the logarithmic power of the voice analysis parameter to perform voiced sound detection when the logarithmic power value is larger than a threshold value. It is characterized by doing.
[0012]
According to a seventh aspect of the present invention, in any one of the fourth to sixth aspects, the threshold value is set according to a voice input signal.
[0013]
According to an eighth aspect of the present invention, in the second aspect of the present invention, the fundamental frequency detecting section detects the fundamental frequency based on a peak detection interval using autocorrelation of a residual signal output from the MLSA analysis filter. Features.
[0014]
According to a ninth aspect of the present invention, in the third aspect of the present invention, the fundamental frequency detector detects a fundamental frequency based on an interval of peak detection of a higher-order component of a mel-cepstral parameter obtained by the mel-cepstral analyzer. And
[0015]
According to a tenth aspect of the present invention, in the second aspect, the fundamental frequency detecting section detects a fundamental frequency by analyzing a number of zero crossings of a residual signal output from the MLSA analysis filter.
[0016]
According to an eleventh aspect of the present invention, in the second aspect of the present invention, the fundamental frequency detecting section estimates and detects a fundamental frequency by using a neural network having a residual signal output from the MLSA analysis filter as an input. .
[0017]
According to a twelfth aspect of the present invention, in any one of the first to eleventh aspects, when the gradient between the fundamental frequency detected by the fundamental frequency detection unit and the fundamental frequency one time ago exceeds a preset gradient range. A fundamental frequency correction processing unit for correcting the detected fundamental frequency so as to fall within the inclination range.
[0018]
According to a thirteenth aspect of the present invention, in any one of the first to twelfth aspects, the fundamental frequency processing unit performs a fundamental frequency conversion process according to a detected fundamental frequency.
[0019]
According to a fourteenth aspect, in any one of the first to thirteenth aspects, the sound source signal generator controls the amplitude of the white noise signal according to the amplitude of a pulse signal to be generated.
[0020]
According to a fifteenth aspect of the present invention, in any one of the first to fourteenth aspects, a synthesized speech signal obtained by down-sampling a synthesized speech signal output from the speech synthesizing unit to limit a frequency band of a reproduced speech. A downsampling unit for outputting a signal is provided.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described with reference to embodiments.
[0022]
(Embodiment 1)
The apparatus according to the present embodiment has a basic configuration shown in FIG. 1, and as shown, a voice analysis unit 1, a voiced sound detection unit 2, a basic frequency detection unit 3, a basic frequency processing unit 4, a sound source signal generation unit The voice analysis unit 1 extracts a frequency spectrum of an input voice signal as a voice feature parameter, and the voiced sound detection unit 2 Voiced sound discrimination is performed using the frequency spectrum (voice feature parameter) extracted by the voice analysis unit 1. Also, in the voice section determined to be voiced by the voiced sound detection unit 2, it is considered that a basic frequency having periodicity exists in the input voice signal, so that the basic frequency detection unit 3 detects the basic frequency. Here, since the female voice has a higher fundamental frequency than the male voice, the fundamental frequency processing unit 4 converts the fundamental frequency detected by the fundamental frequency detection unit 3 to a lower frequency. The sound source signal generation unit 5 generates a pulse signal in accordance with the fundamental frequency processed by the fundamental frequency processing unit 4 in a voiced sound detection section based on the detection result of the voiced sound detection unit 2, and generates white noise in other sections. Signals are generated, and these signals are output as sound source signals for synthesizing voice. In the voice synthesis unit 7, the frequency spectrum (voice feature parameter) obtained by the voice analysis unit 1 is frequency-shifted to a lower frequency side by the frequency processing control unit 6 and the sound source signal generated by the sound source signal generation unit 5. Is used to synthesize a speech and output a synthesized speech signal.
[0023]
Here, when the present embodiment is incorporated in an intercom or added as an adapter to a telephone, when making a call, the system is operated according to the request of the speaker, and the voice signal of the speaker is input as an input voice signal. By performing the speech synthesis based on the input speech signal as described above, it is possible to convert the voice quality substantially in real time and make a telephone conversation. Also, by converting the voice quality and making a call, even a single female housing can be handled with the voice quality of a male, so that simple crime prevention becomes possible. Furthermore, it can be implemented with a small amount of calculation and requires almost no memory.
[0024]
(Embodiment 2)
In the present embodiment, the basic configuration is the same as that of the first embodiment, but an MLSA analysis filter 100, which is an adaptive digital filter that performs adaptive mel-cepstral analysis in real time, is used as a voice analysis unit that extracts voice feature parameters by voice analysis. The MLSA synthesis filter 70 is used as the voice synthesis unit.
[0025]
The MLSA analysis filter 100 and the MLSA synthesis filter 70 utilize analysis processing on the mel frequency axis. The MLSA analysis filter is a digital filter that approximates the mel log spectrum by pad approximation, and includes a plurality of basic filters F (z) including a parameter α defining a mel scale and a mel log psm coefficient b (m) <FIG. ) Reference> and the pad coefficient p₁ .. Are configured as shown in FIG. Further, by using an adaptive digital filter technology to adaptively calculate the mel logarithmic cepstrum coefficient b (m) with the basic filter F (z) according to the input audio signal, the MLSA analysis filter 100 allows the A filter that adaptively approximates the logarithmic spectrum model is obtained, and a residual signal is obtained as an output thereof. In particular, it can be said that the adaptive digital filter makes use of human auditory characteristics by selecting the parameter α that defines the mel scale. Therefore, the order of analysis can be reduced as compared with the conventional voice analysis method. For example, in the case of 8 kHz sampling, by taking m = 12 and α = 0.31, voice analysis adapted to human auditory characteristics in almost real time can be performed. I can do it.
[0026]
The MLSA synthesis filter 70 constituting the voice synthesis unit is an inverse filter of the MLSA analysis filter 100 and utilizes analysis processing on the mel frequency axis, and utilizes expansion and contraction of the frequency axis. Control the expansion / contraction parameter of the mel frequency axis conversion.
[0027]
Thus, in the present embodiment, the MLSA analysis filter 100, which is a voice analysis unit, analyzes the voice from the input voice signal and outputs the mel-cepstral parameter to the voiced sound detection unit 2 as a voice feature parameter. Based on the mel-cepstral parameters, the voiced sound detector 2 performs voiced sound discrimination, while the fundamental frequency detector 3 detects the fundamental frequency from the residual signal from the MLSA analysis filter 100 corresponding to the voiced sound section. The sound source signal generation section 5 oscillates and outputs a pulse signal in accordance with the fundamental frequency processed by the fundamental frequency processing section 4 in the voiced sound detection section based on the detection result of the voiced sound detection section 2, and outputs the white signal in other sections. A noise signal is oscillated and output to the MLSA synthesis filter 70 as a sound source signal for voice synthesis. The MLSA synthesis filter 70 performs voice synthesis using the mel-cepstral parameters from the MLSA analysis filter 100 and the sound source signal in accordance with the control by the frequency control processing of the frequency processing control unit 6, and outputs a synthesized voice signal.
[0028]
Here, an example of the fundamental frequency detector 3 that can be used in the present embodiment will be described next.
[0029]
Example 1
FIG. 4 shows the present example, and the fundamental frequency detection unit 3 of the present example compares the autocorrelation of the residual signal output from the MLSA analysis filter 100 with respect to the section in which the fundamental frequency is considered to exist as shown in FIG. An autocorrelation calculating section 30 for calculating, a peak detecting section 31 for detecting a section where a peak of the self function calculated by the autocorrelation calculating section 30 appears, and a basic section using the section detected by the peak detecting section 31. A fundamental frequency calculation unit 32 for calculating a frequency.
[0030]
The fundamental frequency detection unit 3 of this example can absorb the level of the input audio signal by using the residual signal output from the MLSA analysis filter 100. Detection becomes possible.
[0031]
Example 2
As shown in FIG. 5, the fundamental frequency detector 3 of this example analyzes a zero crossing of a residual signal output from the MLSA analysis filter 100 to obtain a zero crossing value, and a basic value based on the zero crossing value. It comprises a fundamental frequency calculator 34 for calculating the frequency.
[0032]
Example 3
As shown in FIG. 6, the fundamental frequency detecting unit 3 of the present embodiment includes a fundamental frequency detecting neural network 35 which receives a residual signal output from the MLSA analysis filter 100 as an input. Is learned in advance so as to output a pitch value corresponding to the basic frequency, and a fundamental frequency is estimated.
[0033]
By using the residual signal of the MLSA analysis filter 100, any of the fundamental frequency detectors 3 of Examples 1 to 3 can absorb the level of the input audio signal, so that the fundamental frequency can be constantly detected with a constant detection accuracy. Can be detected.
[0034]
In addition, by using the highly accurate analysis result of the adaptive digital filter by the MLSA analysis filter 100, highly accurate detection becomes possible.
[0035]
Further, in the case of Example 3, since the fundamental frequency detection neural network 35 in which the residual signal has been learned in advance is used, it is possible to perform statistical detection when configuring the neural network. The frequency can be detected.
[0036]
(Embodiment 3)
In the second embodiment, the speech synthesis unit is configured by the MLSA analysis filter 100. However, the second embodiment is different from the first embodiment in that the speech synthesis unit is configured by the mel-cepstrum analysis unit 101 as shown in FIG.
[0037]
The mel-cepstral analysis unit 101 performs mel-cepstral analysis for performing Fourier transform, logarithmic transformation, mel-frequency axis transform, and inverse Fourier transform on an input speech signal, and extracts mel-cepstral parameters as speech feature parameters. The analysis process on the mel frequency axis is utilized mutually with the MLSA synthesis filter 70 constituting the unit, and the expansion and contraction of the frequency axis is utilized, and the frequency processing control unit 6 controls the parameters of the mel frequency axis conversion. ing.
[0038]
In addition, the fundamental frequency detection unit 3 performs peak detection of a high-order part (high quefrency part parameter) among the mel-cepstrum parameters output from the mel-cepstrum analysis unit 101 as shown in FIG. The fundamental frequency is calculated by the fundamental frequency calculator 37 from the detected peak section. The other configuration is the same as that of the second embodiment, and the description is omitted here.
[0039]
Thus, in the present embodiment, voiced sound detection is performed by the voiced sound detection unit 2 based on the mel cepstrum parameter that is a speech feature parameter extracted from the mel cepstrum analysis unit 101, and the fundamental frequency is detected by the fundamental frequency detection unit 3. . The sound source signal generator 5 oscillates and outputs a pulse signal according to the fundamental frequency processed by the fundamental frequency processor 4 in the voiced sound detection section based on the detection result of the voiced sound detector 2 as in the second embodiment. In sections other than the above, white noise signals are oscillated and output, and these oscillated outputs are output to the MLSA synthesis filter 70 as sound source signals for speech synthesis. The MLSA synthesis filter 70 performs voice synthesis using the mel-cepstrum parameters from the mel-cepstrum analysis unit 101 and the sound source signal in accordance with the control by the frequency control processing of the frequency processing control unit 6, and outputs a synthesized voice signal.
[0040]
Here, an example of the voiced detection unit 2 that can be used in the present embodiment (the above-described second embodiment) for detecting a voiced sound using the mel-cepstral parameter will be described.
[0041]
Example 1
As shown in FIG. 9, the voiced detector 2 of this example performs a Fourier transform on the mel cepstrum parameter to convert the mel cepstrum parameter into a spectrum on the mel logarithmic axis, and a designated frequency band of the mel log spectrum obtained from the result of the conversion. For example, the level detector 21 that detects the level of 80 Hz to 600 Hz shown in FIG. 10 compares the detected level value with a preset voiced sound detection threshold value, and the level value of the input voice is larger than the threshold value. In this case, a comparison unit 22 is assumed to detect a voiced sound. FIG. 10 shows an example of the mel-log spectrum on the mel-log axis and an example of the above-mentioned designated frequency band. The designated frequency band shown in FIG. 10 uses a vowel formant frequency band which is a representative voiced sound of voice. It is.
[0042]
In the case of this example, it is possible to improve the voiced sound detection performance by detecting a vowel representative of a voiced sound and having a large power without error by making use of the characteristics of Japanese, and improving the frequency at which the level is detected. By specifying the band, the influence of the ambient noise becomes stronger.
[0043]
Example 2
As shown in FIG. 11, the voiced sound detection unit 2 of this example includes level detection units 211 to 21n of a plurality of designated frequency bands and comparison units 221 to 21n in which threshold values corresponding to the respective level detection units 211 to 21n are set. In addition to specifying one designated band and its threshold, each designated band and each threshold are prepared for each vowel, and if any one of the designated bands exceeds the threshold, It is considered to be voiced sound detection. The OR is an OR gate that takes the logical sum of the outputs of the comparison units 221 to 22n.
[0044]
Example 3
In the present embodiment, the threshold value of the comparison unit 22 in the voiced sound detection unit 2 of Example 1 is always detected from the mel log spectrum output from the Fourier transform unit 20 as shown in FIG. A threshold determination unit 27 that determines a threshold for voiced sound detection according to the level of the input voice signal is provided, and the threshold determined by the threshold determination unit 27 is provided to the comparison unit 22.
[0045]
Of course, the threshold value determination unit 27 of the present example may be used when determining the threshold values of the respective comparison units 221 to 22n of the second example.
[0046]
According to this example, by determining and changing the threshold value for voiced sound detection according to the input audio signal level, it is possible to deal with the influence of the level of the input audio and the influence of the input ambient noise.
[0047]
Example 4
In the above Examples 1 to 3, the Fourier transform is performed to convert to a mel logarithmic spectrum. In this case, the amount of calculation of the Fourier transform required when converting the mel cepstrum parameter to the mel logarithmic spectrum is large. Therefore, the voiced detection unit 2 of the present embodiment detects the level of the designated frequency band without performing the Fourier transform by using the Fourier transform approximation calculation method having the same effect.
[0048]
That is, a rectangular spectrum (logarithmic spectrum) in which only the desired frequency band takes a constant value first and the other bands are 0 is prepared as shown in FIG. This is performed using the same mel frequency axis expansion and contraction parameters as in the analysis. As a result, an inverse Fourier transform of the obtained mel log spectrum is performed as shown in FIG. 13 (b) to obtain mel cepstrum coefficients a (1)... Of the mel frequency spectrum having only the desired band. In fact, the mel cepstrum coefficient of the mel frequency spectrum having only the desired band can be calculated in advance if the designated frequency band is determined, and does not need to be calculated every time sound is detected.
[0049]
FIG. 14 shows the configuration of the voiced sound detection unit 2 of the present example, in which the mel-cepstral coefficient a (m) of the spectrum of the above-mentioned desired frequency band is calculated and stored based on a predetermined frequency band determined in advance. A mel-cepstral coefficient calculating unit 23 for a predetermined frequency designated frequency, a mel-cepstral coefficient calculating unit 24 for an input voice signal calculating a mel-cepstral coefficient c (m) of the input voice signal from the mel-cepstral parameter input from the voice analyzing unit 1, A sum-of-products unit 25 that performs a sum-of-products operation (Σa (m) c (m)) of both mel-cepstral coefficients a (m) and c (m); And a comparing unit 22 for detecting a voice sound. The specified frequency band uses a formant frequency band of a vowel that is a representative voiced sound of a voice, and not only one specified band and its threshold are determined but also each vowel as in Example 2. On the other hand, each designated band and each threshold may be prepared, and if any one of the designated bands exceeds the threshold, voiced sound detection may be considered.
[0050]
Also in the case of this example, the voiced sound detection performance can be improved by detecting the representative vowels of voiced sounds and large vowels without errors by utilizing the characteristics of Japanese and detecting the level. By specifying the frequency band, the influence of the ambient noise is increased.
[0051]
Example 5
The voiced sound detector 2 of this example focuses on the fact that the 0th order component of the mel-cepstral parameter represents the logarithmic power of the input voice signal, and as shown in FIG. A mel-cepstral coefficient c (m) for the input audio signal which calculates a mel-cepstral coefficient c (m) of the input audio signal from the parameters; and m = 0, that is, 0-dimensional (c (0 )), A 0-dimensional data extraction unit 26 for extracting the data, and a comparison unit 22 for comparing the extracted value with a threshold value to detect a voiced sound.
[0052]
In the case of this example, it is possible to utilize the audio power in real time by using the result of the audio analysis.
[0053]
By the way, the configurations of Examples 1 to 5 of the voiced sound detection unit 2 in the present embodiment can be used as the voiced sound detection unit 2 of the second embodiment using the mel-cepstral parameter as the voice feature parameter as in the present embodiment. Of course.
[0054]
(Embodiment 4)
In the present embodiment, in order to reduce the influence of the detection error of the fundamental frequency detected by the fundamental frequency detector 3 in the first to third embodiments, the fundamental frequency detector 3 The slope between the fundamental frequency detected by the unit 3 and the fundamental frequency one time before is calculated, and if the slope is out of the range of the slope set in advance, it is determined that the fundamental frequency is erroneously detected. A basic frequency correction unit 8 for performing a process of performing correction so as to fall within the range of the inclination set as described above. The basic frequency corrected by the basic frequency correction unit 8 is used as a basic frequency processing unit in the first to third embodiments. 4 is output.
[0055]
FIG. 17 shows an example of the fundamental frequency correction. In this example, the fundamental frequency detected at the present time t is f_tAnd the fundamental frequency detected at t-1 one time earlier is f_t-1In this case, the inclination at that time is out of the range of the inclination set in advance, and in this case, the fundamental frequency correction unit 8 outputs the fundamental frequency f_tSo that f falls within a preset inclination range._t’.
[0056]
Note that other configurations can employ the same configurations as any of the first to third embodiments, and thus illustration and description are omitted here.
[0057]
Thus, in the present embodiment, if the temporal variation of the detected fundamental frequency is rapid, there is a high possibility that an erroneous detection has been made. Therefore, it is possible to improve the voice quality of the synthesized voice by performing the correction. it can. Further, the temporal change of the corrected fundamental frequency becomes gradual, and it becomes possible to eliminate the unnaturalness of the intonation of the synthesized voice caused by the sudden change of the fundamental frequency.
[0058]
(Embodiment 5)
In the present embodiment, in Embodiments 1 to 3 (or Embodiment 4), a fundamental frequency processing unit 4 for multiplying the fundamental frequency detected by the fundamental frequency detection unit 3 to convert from a high frequency to a low frequency, As shown in the figure, it is determined whether or not to perform the conversion process of the fundamental frequency according to the detected fundamental frequency, and a fundamental frequency processing control unit 9 for controlling the fundamental frequency processing unit 4 is added. Other configurations can employ the same configurations as those of the first to third or fourth embodiments, and thus illustration and description are omitted.
[0059]
Thus, in the present embodiment, when the input voice is in the male frequency band (low frequency), it can be prevented from being further converted to a lower frequency, and the synthesized voice is always in the general male voice frequency band. It becomes a voice, and can provide a voice that is not uncomfortable with the normal voice as a synthesized voice.
[0060]
(Embodiment 6)
In the present embodiment, in a sound source signal generation unit 5 that generates a sound source signal using the detection result of the voiced sound detection unit 2 and the result of the fundamental frequency processing unit 4, a pulse signal generated to prevent power concentration of the sound source signal An estimation function for estimating the amplitude of the column of P (see FIG. 19B) and the white noise signal WN (see FIG. 19A), and the amplitude of the generation of the white noise signal WN corresponding to the amplitude of the pulse signal P And a processing function for adaptively controlling the frequency, the timing of generation of the pulse signal depends on the result of the fundamental frequency processing unit 4, and in order to prevent power concentration of the sound source signal, as shown in FIG. The pulse signal P immediately after the signal WN generates a silence signal S for several ms immediately after the white noise signal WN and then generates the pulse signal P in order to prevent power concentration of the sound source signal.
[0061]
In this embodiment, since the configuration other than the sound source signal generation unit 5 may adopt any one of the configurations of the first to fifth embodiments, the other configuration is not shown and the description is omitted.
[0062]
Thus, in the present embodiment, it is possible to prevent click noise from being generated in the synthesized speech due to abrupt power fluctuations, and to control the amplitude of the pulse signal P and the white noise signal WN to thereby improve the synthesized speech. A smooth sound can be provided as the sound quality of the sound.
[0063]
(Embodiment 7)
By the way, the voice output by the voice synthesizer is a voice obtained by converting the voice quality of the input voice signal, and the processing accompanying the shift of the frequency spectrum is performed. Since there is a possibility that the synthesized speech may be distorted due to the small amount, in order to eliminate the distortion, in the present embodiment, as shown in FIG. 21, the synthesized speech signal output from the MLSA synthesis filter 70 configuring the speech synthesis unit is used. On the other hand, the sampling frequency is limited by the down-sampling unit 10, and the high-frequency band is excluded from the reproduction frequency band. That is, in the present embodiment, the downsampling unit 10 down-samples the synthesized voice signal obtained at a sampling frequency of 10 kHz at a sampling frequency of 8 kHz.
[0064]
Note that the other configuration can adopt the same configuration as any one of the second to sixth embodiments, so that illustration and description are omitted here. Further, instead of using the MLSA synthesis filter 70, another voice synthesizing means may be used, for example, the configuration of the first embodiment may be adopted.
[0065]
As shown in FIG. 22 (a), of the frequency band with large frequency axis fluctuation and the frequency band with small frequency axis fluctuation, the high frequency band of the spectrum component in which the synthesized speech is likely to generate distortion is shown in FIG. As shown in ()), the signal is down-sampled by the down-sampling unit 71 and excluded from the reproduction frequency band.
[0066]
Thus, in the present embodiment, the influence of the distortion component of the synthesized voice is eliminated, and the sound quality of the synthesized voice can be improved.
[0067]
【The invention's effect】
According to a first aspect of the present invention, a frequency processing control unit for controlling a frequency axis conversion process during a voice analysis process and a voice synthesis process, and a voice analysis unit for performing voice analysis on an input voice under the control of the frequency processing control unit. A voiced sound detection unit that determines whether or not the input voice is a voiced sound based on voice feature parameters obtained by voice analysis by the voice analysis unit; and, when the voiced sound detection unit detects a voiced sound, A fundamental frequency detecting unit for detecting a frequency, a fundamental frequency processing unit for multiplying a fundamental frequency detected by the fundamental frequency detecting unit to perform a fundamental frequency conversion, and a basic unit for detecting a voiced sound by the voiced sound detecting unit. A pulse signal is generated according to the fundamental frequency obtained by the fundamental frequency conversion by the frequency processing unit. If no voiced sound is detected, a white noise signal is generated. Using a sound source signal generating unit that outputs a noise signal as a sound source signal, a feature parameter obtained by performing voice analysis in the voice analyzing unit, and a sound source signal obtained from the voice source signal generating unit, Since it consists of a voice synthesizer that synthesizes voice according to the frequency control by the control unit, large-capacity memory and complicated arithmetic processing are not required, and the voice quality can be converted in real time and with a small amount of calculation for the input voice. It can be realized as a small system, can be built into an intercom, or can be constructed as a system that can be attached to an ordinary telephone as an adapter, and the voice after voice conversion converts the input voice, so that the voice of a fixed person There is an effect that it can be effectively used for a simple security device without becoming a problem.
[0068]
According to a second aspect of the present invention, in the first aspect of the invention, the voice analysis unit is configured with an MLSA analysis filter, and the voice synthesis unit is configured with an MLSA synthesis filter. Since the frequency axis conversion process is performed by changing with time, the voice analysis can be performed extremely easily by an adaptive analysis method utilizing human auditory characteristics. Also, the analysis parameters of the MLSA analysis filter and the MLSA synthesis filter can be obtained. By controlling the mel frequency axis conversion parameter which is the following, there is an effect that the mel log spectrum distribution of the input audio signal can be converted.
[0069]
According to a third aspect of the present invention, in the first aspect of the present invention, the voice analysis unit is configured by a mel-cepstral analysis unit using Fourier transform analysis, the voice synthesis unit is configured by an MLSA synthesis filter, and the mel frequency axis conversion is performed. Since the frequency axis conversion process is performed by changing between mel-cepstral analysis and speech synthesis, highly accurate speech analysis that takes advantage of human auditory characteristics can be performed. By controlling the mel frequency axis conversion parameter together with the MLSA analysis filter at the time, there is an effect that the mel log spectrum distribution of the input audio signal can be converted.
[0070]
According to a fourth aspect of the present invention, in any one of the first to third aspects, the voiced sound detection unit converts the voice feature parameter obtained by the voice analysis unit into a parameter on a frequency axis by Fourier transform. Detects an input voice level in a desired frequency band, and performs voiced sound detection when the detected level is greater than a threshold, so that the detection performance of the voiced sound detection unit can be improved. It is possible to improve voiced sound detection performance by detecting vowels that are representative of voiced sounds and having large power without error, and by specifying the frequency band for detecting the level, the influence of ambient noise is also enhanced. This has the effect. In particular, there is an effect that the calculation amount can be reduced without lowering the detection performance.
[0071]
According to a fifth aspect of the present invention, in any one of the first to third aspects of the present invention, the voiced sound detecting unit converts the voice feature parameter obtained by the voice analyzing unit into a parameter on a frequency axis by an approximate Fourier transform. Then, the input voice level in the desired frequency band is detected, and when the detected level is higher than the threshold, voiced sound detection is performed, so that the detection performance of the voiced sound detection unit can be improved. Similarly, it is possible to improve voiced sound detection performance by detecting vowels that are typical and large in power of voiced voices without errors, taking advantage of the characteristics of Japanese in particular, and to specify the frequency band for level detection. Accordingly, there is an effect that the influence of the ambient noise is increased.
[0072]
The invention of claim 6 is the invention according to any one of claims 1 to 3, wherein the voiced sound detection unit uses the logarithmic power of the voice analysis parameter to perform voiced sound detection when the logarithmic power value is larger than a threshold value. Therefore, the voice analysis result can be used for voiced sound detection, and the use of the analysis result has an effect that the voice power can be used in real time.
[0073]
According to a seventh aspect of the present invention, in the invention of any of the fourth to sixth aspects, the threshold value is set according to an input audio signal. It is possible to cope with the influence of the magnitude of the level and the influence of the input ambient noise.
[0074]
According to an eighth aspect of the present invention, in the second aspect of the present invention, the fundamental frequency detecting section detects the fundamental frequency at intervals of peak detection using autocorrelation of the residual signal output from the MLSA analysis filter. It is possible to absorb the level of the input audio signal, so that there is an effect that detection can always be performed with a constant detection accuracy.
[0075]
According to a ninth aspect of the present invention, in the third aspect of the present invention, the fundamental frequency detecting section detects the fundamental frequency based on an interval of peak detection of a higher-order component of the mel-cepstral parameter obtained by the mel-cepstral analyzing section. There is an effect that it is possible to maintain the same detection accuracy as that of.
[0076]
According to a tenth aspect of the present invention, in the second aspect, the fundamental frequency detecting section detects the fundamental frequency by analyzing the number of zero crossings of the residual signal output from the MLSA analysis filter. There is an effect that a highly accurate analysis result of the digital filter can be used.
[0077]
According to an eleventh aspect of the present invention, in the second aspect of the present invention, the fundamental frequency detecting section estimates and detects the fundamental frequency by using a neural network that receives a residual signal output from the MLSA analysis filter. , It is possible to perform statistical detection when configuring a neural network, and as a result, it is possible to detect a fundamental frequency with high accuracy.
[0078]
According to a twelfth aspect of the present invention, in any one of the first to eleventh aspects, when the gradient between the fundamental frequency detected by the fundamental frequency detection unit and the fundamental frequency one time ago exceeds a preset gradient range. Since the fundamental frequency correction processing unit for correcting the detected fundamental frequency so as to fall within the inclination range is provided, even if the detected fundamental frequency fluctuates rapidly and the possibility of erroneous detection is high, By performing the correction, the sound quality of the synthesized voice can be improved, and the time change of the corrected fundamental frequency is made gradual, and the synthesized voice generated by the rapid change of the fundamental frequency is corrected. There is an effect that the unnaturalness of intonation can be eliminated.
[0079]
According to a thirteenth aspect of the present invention, in any one of the first to twelfth aspects, the fundamental frequency processing unit performs a fundamental frequency conversion process in accordance with a detected fundamental frequency. It is possible to prevent voice quality conversion from being performed in the case of an appropriate basic frequency band, and the resultant synthesized voice is always a voice of a predetermined basic frequency band. There is an effect that it can be provided.
[0080]
According to a fourteenth aspect of the present invention, in the invention according to any one of the first to thirteenth aspects, the sound source signal generator controls the amplitude of the white noise signal in accordance with the amplitude of the pulse signal to be generated. Click noise due to power fluctuations can be prevented, and by controlling the amplitude of the pulse signal and the white noise signal, it is possible to provide a smooth voice as the sound quality of the synthesized voice. It has the effect of becoming.
[0081]
According to a fifteenth aspect of the present invention, in the first aspect of the present invention, the synthesized speech signal output from the speech synthesis unit is down-sampled to limit the frequency band of the reproduced speech. A down-sampling unit that outputs a signal is added, so that the effect of audio distortion in the band where the processing of the high-frequency part of the reproducible frequency due to spectrum processing is less effective is reduced by limiting the sampling frequency. It is possible to exclude a frequency band that may occur from the reproduction frequency band, so that there is no effect of the distortion component on the sound quality of the synthesized voice, and it is possible to improve the sound quality of the synthesized voice.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a first embodiment of the present invention.
FIG. 2 is a configuration diagram of a second embodiment of the present invention.
FIG. 3A is a configuration diagram of a basic filter that constitutes an MLSA analysis filter used in the first embodiment. (B) is a block diagram of a specific example of the MLSA analysis filter used in the above.
FIG. 4 is a configuration diagram showing Example 1 of a fundamental frequency detector used in the Embodiment.
FIG. 5 is a configuration diagram showing Example 2 of a fundamental frequency detection unit used in the Embodiment.
FIG. 6 is a configuration diagram showing Example 3 of a fundamental frequency detection unit used in the Embodiment.
FIG. 7 is a configuration diagram of Embodiment 3 of the present invention.
FIG. 8 is a configuration diagram illustrating an example of a fundamental frequency detection unit used in the above energy management system;
FIG. 9 is a configuration diagram showing Example 1 of a voiced sound detection unit used in the Embodiment.
FIG. 10 is an explanatory diagram of level detection performed by the voiced sound detection unit in Embodiment 1;
FIG. 11 is a configuration diagram showing Example 2 of the voiced sound detection unit used in the Embodiment.
FIG. 12 is a configuration diagram showing Example 3 of the voiced sound detection unit used in the Embodiment.
FIG. 13 is a diagram illustrating the principle of Example 4 of the voiced sound detection unit used in the Embodiment.
FIG. 14 is a configuration diagram showing Example 4 of the voiced sound detection unit of the above.
FIG. 15 is a configuration diagram showing Example 5 of the voiced sound detection unit used in the Embodiment.
FIG. 16 is a configuration diagram of a main part according to a fourth embodiment of the present invention.
FIG. 17 is an explanatory diagram of the operation of the fundamental frequency correction unit used in the power supply system.
FIG. 18 is a configuration diagram of a main part according to a fifth embodiment of the present invention.
FIG. 19 is an explanatory diagram of an example of a signal generated by a sound source signal generator 5 according to a sixth embodiment of the present invention.
FIG. 20 is an explanatory diagram of the operation of the above sound source signal generator.
FIG. 21 is a configuration diagram of a main part according to a seventh embodiment of the present invention.
FIG. 22 is an explanatory diagram of the operation of the downsampling unit of the above.
[Explanation of symbols]
1 Voice analysis unit
2 Voiced sound detector
3 Basic frequency detector
4 Basic frequency processing section
5 Sound source signal generator
6 Frequency processing controller
7 Voice synthesis unit

Claims

A frequency processing control unit for controlling a frequency axis conversion process during a voice analysis process and a voice synthesis process;
A voice analysis unit for analyzing the input voice under the control of the frequency processing control unit,
A voiced sound detector that determines whether or not the input voice is a voiced sound from a voice feature parameter obtained by voice analysis by the voice analyzer;
When the voiced sound detection unit detects a voiced sound, a fundamental frequency detection unit that detects a fundamental frequency of the input voice;
A fundamental frequency processing unit that performs fundamental frequency conversion by multiplying the fundamental frequency detected by the fundamental frequency detection unit,
When a voiced sound is detected by the voiced sound detection unit, a pulse signal is generated according to the obtained fundamental frequency obtained by the fundamental frequency conversion by the fundamental frequency processing unit, and when a voiced sound is not detected, a white noise signal is generated. A sound source signal generator for generating and outputting these pulse signals and white noise signals as sound source signals;
Using a feature parameter obtained by performing voice analysis in the voice analysis unit and a sound source signal obtained from the sound source signal generation unit, a voice synthesis unit that performs voice synthesis according to frequency control by the frequency processing control unit. A voice quality conversion system characterized by comprising:

The voice analysis unit is configured by an MLSA analysis filter, the voice synthesis unit is configured by an MLSA synthesis filter, and the frequency axis conversion processing is performed by changing the mel frequency axis conversion between the mel cepstrum analysis and the voice synthesis. The voice quality conversion system according to claim 1, wherein

The speech analysis unit is composed of a mel-cepstral analysis unit using Fourier transform analysis, and the speech synthesis unit is composed of an MLSA synthesis filter. The voice quality conversion system according to claim 1, wherein axis conversion processing is performed.

The voiced sound detection unit converts the voice feature parameter obtained by the voice analysis unit into a parameter on a frequency axis by Fourier transform, detects an input voice level in a desired frequency band, and the detection level is higher than a threshold. 4. The voice quality conversion system according to claim 1, wherein voiced sound detection is performed when the sound volume is large.

The voiced sound detection unit converts the voice feature parameter obtained by the voice analysis unit into a parameter on a frequency axis by an approximate Fourier transform, detects an input voice level in a desired frequency band, and sets the detection level to a threshold. 4. The voice quality conversion system according to claim 1, wherein a voiced sound is detected when the voice quality is larger than the voice quality.

The voice quality conversion according to any one of claims 1 to 3, wherein the voiced sound detection unit performs voiced sound detection using a logarithmic power of a voice analysis parameter when the logarithmic power value is larger than a threshold value. system.

The voice quality conversion system according to claim 4, wherein the threshold is set according to a voice input signal.

The voice quality conversion system according to claim 2, wherein the fundamental frequency detection unit detects a fundamental frequency based on a peak detection interval using autocorrelation of a residual signal output from the MLSA analysis filter.

4. The voice conversion system according to claim 3, wherein the fundamental frequency detector detects a fundamental frequency based on an interval between peak detections of higher-order components of the mel-cepstral parameter obtained by the mel-cepstral analyzer.

3. The voice conversion system according to claim 2, wherein the fundamental frequency detection unit detects a fundamental frequency by analyzing a number of zero crossings of a residual signal output from the MLSA analysis filter.

3. The voice conversion system according to claim 2, wherein the fundamental frequency detection unit estimates and detects a fundamental frequency by a neural network that receives a residual signal output from the MLSA analysis filter.

When the gradient between the fundamental frequency detected by the fundamental frequency detection unit and the fundamental frequency one time ago exceeds a preset slope range, the fundamental frequency correction for correcting the detected fundamental frequency so as to fall within the slope range. 12. The voice quality conversion system according to claim 1, further comprising a processing unit.

13. The voice conversion system according to claim 1, wherein the fundamental frequency processing unit performs a fundamental frequency conversion process according to a detected fundamental frequency.

14. The voice conversion system according to claim 1, wherein the sound source signal generator controls the amplitude of the white noise signal in accordance with the amplitude of a pulse signal to be generated.

4. A down-sampling unit for down-sampling a synthesized voice signal output from the voice synthesizing unit and outputting a synthesized voice signal in which a frequency band of a reproduced voice is limited. A voice quality conversion system according to any one of Claims 14 to 14.