JP6468518B2

JP6468518B2 - Basic frequency pattern prediction apparatus, method, and program

Info

Publication number: JP6468518B2
Application number: JP2016032411A
Authority: JP
Inventors: 弘和亀岡; 田中　宏; 宏田中; 戸田　智基; 智基戸田; 中村　哲; 哲中村
Original assignee: Nara Institute of Science and Technology NUC; Nippon Telegraph and Telephone Corp
Current assignee: Nara Institute of Science and Technology NUC; Nippon Telegraph and Telephone Corp
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2019-02-13
Anticipated expiration: 2036-02-23
Also published as: JP2017151223A

Description

本発明は、基本周波数パターン予測装置、方法、及びプログラムに係り、特に、ソース音声から、ターゲット音声の基本周波数パターンを予測する基本周波数パターン予測装置、方法、及びプログラムに関する。 The present invention relates to a fundamental frequency pattern prediction apparatus, method, and program, and more particularly, to a fundamental frequency pattern prediction apparatus, method, and program for predicting a fundamental frequency pattern of a target voice from a source voice.

他者とのコミュニケーションにおいて音声は利便性に優れた手段ではあるが、時として物理的制約により様々な障壁が必然的にもたらされる。例えば、発声器官の内、わずか一か所でも正常に動作しなくなると、深刻な発声障害を患い、音声コミュニケーションに支障をきたす。また、音声生成という物理的行為は、秘匿性の高い意思伝達には不向きであるし、周囲の騒音に脆弱である。これらの障壁を無くすためには、身体的制約を超えて発声器官を動作させて音声を生成したり、適切な発音動作を指定して音声を生成したり、聴取困難なほど微かな音声発声時の発声器官動作から通常音声を生成するなど、物理的・身体的制約を超えた音声生成機能の拡張が必要である。 In communication with others, voice is a convenient means, but sometimes physical barriers inevitably cause various barriers. For example, if even one of the vocal organs does not operate normally, it suffers from serious vocal disturbances and hinders voice communication. Moreover, the physical action of voice generation is not suitable for highly confidential communication and is vulnerable to ambient noise. In order to eliminate these barriers, voices can be generated by moving the vocal organs beyond physical constraints, voices can be generated by specifying an appropriate pronunciation, or when voices are so fine that it is difficult to hear It is necessary to expand the speech generation function beyond physical and physical constraints, such as generating normal speech from the vocal organ movements.

例えば、喉頭癌などで喉頭を失った喉頭摘出者に対して、残存器官を用いた代替発声法により生成される自然性に乏しい音声を、より自然な音声へと変換する発声補助技術が提案されている（非特許文献１〜３参照）。 For example, for laryngectomy patients who have lost their larynx due to laryngeal cancer, etc., voice assist technology has been proposed to convert less natural speech generated by alternative vocalization methods using residual organs into more natural speech. (See Non-Patent Documents 1 to 3).

この他にも、非可聴つぶやき音声を自然な音声に変換する技術も提案されており、秘匿性に優れた通話技術としての応用が期待されている（非特許文献４参照）。上述の技術はいずれも音声のスペクトル特徴量系列から自然音声の基本周波数（Ｆ_０）パターンを予測する問題を扱っている点で共通しており、学習処理と変換処理で構成される。学習処理では、対象音声（前者であれば電気音声、後者であれば非可聴つぶやき音声）と通常音声の同一発話データを用いる。まず各離散時刻（以後、フレーム）において、前後数フレームから得られる対象音声のスペクトル特徴量と、通常音声の対数Ｆ_０とその動的成分（時間微分または時間差分）を抽出し、スペクトル距離尺度に基づく動的時間伸縮によりこれらを対応付けた結合ベクトルを得る。これをパラレルデータと呼ぶ。各フレームのパラレルデータを用い、対象音声のスペクトル特徴量と通常音声の対数Ｆ_０の静的・動的成分の結合確率密度関数を混合正規分布モデル（Gaussian Mixture Model;以下、ＧＭＭと称する）で表現する。ＧＭＭのパラメータはExpectation-Maximizationアルゴリズムにより学習することができる。変換処理では、学習されたＧＭＭを用いて、系列内変動を考慮した最尤系列変換法により、対象音声のスペクトル特徴量系列から通常音声のＦ_０パターンへと変換することができる。 In addition to this, a technique for converting a non-audible murmur voice to a natural voice has been proposed, and application as a call technique with excellent secrecy is expected (see Non-Patent Document 4). All of the above-described techniques are common in that they deal with the problem of predicting the fundamental frequency (F ₀ ) pattern of natural speech from the spectral feature quantity sequence of speech, and are composed of learning processing and conversion processing. In the learning process, the same speech data of the target voice (electric voice in the former case, non-audible murmur voice in the latter case) and normal voice is used. First, at each discrete time (hereinafter referred to as a frame), a spectral feature quantity of the target speech obtained from several frames before and after, a logarithm F _{0 of the} normal speech and its dynamic component (time differential or time difference) are extracted, and a spectral distance measure A combined vector corresponding to these is obtained by dynamic time expansion and contraction based on. This is called parallel data. Using the parallel data of each frame, the combined probability density function of the static and dynamic components of the target speech's spectral features and the logarithm F ₀ of the normal speech is a mixed normal distribution model (hereinafter referred to as GMM). Express. GMM parameters can be learned by an Expectation-Maximization algorithm. In the conversion process, using the learned GMM, the spectral feature quantity sequence of the target speech can be converted into the F ₀ pattern of the normal speech by the maximum likelihood sequence conversion method considering intra-sequence variation.

Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano, "Speaking-aid systems using1 GMM-based voice conversion for electrolaryngeal speech," Speech Communication, vol. 54, no. 1,pp. 134{146, 2012.Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano, "Speaking-aid systems using1 GMM-based voice conversion for electrolaryngeal speech," Speech Communication, vol. 54, no. 1, pp. 134 {146, 2012. Kou Tanaka, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura, "A hybrid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation," IEICE Transactions on Information and Systems, vol. E97-D, no. 6, pp. 1429-1437,Jun. 2014.Kou Tanaka, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura, "A hybrid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation," IEICE Transactions on Information and Systems, vol.E97-D, no. 6, pp. 1429-1437, Jun. 2014. Kou Tanaka, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura, "Direct F0 control of an electrolarynx based on statistical excitation feature prediction and its evaluation through simulation," Proc. INTERSPEECH, pp. 31{35, Sep. 2014.Kou Tanaka, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura, "Direct F0 control of an electrolarynx based on statistical excitation feature prediction and its evaluation through simulation," Proc. INTERSPEECH, pp. 31 {35, Sep. 2014. Tomoki Toda, Kiyohiro Shikano, "NAM-to-speech conversion with Gaussian mixture models,"Proc. INTERSPEECH, pp. 244-247, 2005.Tomoki Toda, Kiyohiro Shikano, "NAM-to-speech conversion with Gaussian mixture models," Proc. INTERSPEECH, pp. 244-247, 2005.

従来技術では、学習処理や変換処理において音声のＦ_０パターンの物理的な生成過程を考慮したモデルが用いられていなかったため、物理的に人間が発声しえないような不自然なＦ_０パターンを生成することが起こりえた。この問題に対し、Ｆ_０パターンの物理的な生成過程を考慮した予測を行うことで、より自然なＦ_０パターンを生成できる可能性がある。 In the prior art, since the model considering the physical process of generating speech F ₀ pattern in the learning process and the conversion process has not been used, physically human unnatural F ₀ pattern as not be uttered It could happen. To solve this problem, by performing prediction in consideration of the physical process of generating F ₀ pattern, it may be possible to generate a more natural F ₀ pattern.

Ｆ_０パターンは声帯に張力を与える甲状軟骨の運動によって生み出されており、非特許文献５ではその制御機構の確率モデルに基づき、フレーズ・アクセント指令と呼ぶ甲状軟骨の運動に関係するパラメータを推定する技術が提案されている。この技術では、フレーズ・アクセント指令の時系列の生成プロセスを隠れマルコフモデル（以下、ＨＭＭと称する）により表現した点がポイントの一つであり、ＨＭＭのトポロジーの設計や遷移確率の学習を通して、指令列に関する言語学的ないし先験的な知識をパラメータ推定に組み込むことが可能である。 The F ₀ pattern is generated by the movement of the thyroid cartilage that gives tension to the vocal cords. In Non-Patent Document 5, a parameter related to the movement of the thyroid cartilage called a phrase / accent command is estimated based on the probability model of the control mechanism. Technology has been proposed. In this technology, one of the points is that the time series generation process of the phrase / accent command is expressed by a hidden Markov model (hereinafter referred to as HMM). Linguistic or a priori knowledge about the sequence can be incorporated into the parameter estimation.

［非特許文献５」：Hirokazu Kameoka, Jonathan Le Roux, Yasunori Ohishi, "A statistical model of speech F₀ contours," ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition (SAPA2010), pp. 43-48, Sep. 2010. [Non-Patent Document 5]: Hirokazu Kameoka, Jonathan Le Roux, Yasunori Ohishi, "A statistical model of speech F ₀ contours," ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition (SAPA2010), pp. 43-48, Sep. 2010.

本発明は、上記問題点を解決するために成されたものであり、Ｆ_０パターンの物理的な生成過程の制約を考慮しながらスペクトル特徴量系列に対応する最適なＦ_０パターンを推定することができる基本周波数パターン予測装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, by estimating the optimal F ₀ pattern corresponding to the spectral feature amount sequence taking into account the constraints of the physical process of generating F ₀ pattern An object of the present invention is to provide a fundamental frequency pattern prediction apparatus, method, and program capable of performing the above.

上記目的を達成するために、第１の発明に係る基本周波数パターン予測装置は、ソース音声の時系列データを入力として、ターゲット音声の、各時刻における甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令のペアからなる指令列を予測する基本周波数パターン予測装置であって、学習サンプルのソース音声の時系列データとターゲット音声の時系列データとからなるパラレルデータを入力として、前記ソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、前記ターゲット音声の時系列データから推定される、隠れマルコフモデルの各時刻の、前記フレーズ指令及び前記アクセント指令のペアを示す状態のインデックスからなる状態系列とに基づいて、前記ソース音声の各時刻のスペクトル特徴量ベクトルと、ターゲット音声の前記状態系列との組み合わせの同時生成モデルのパラメータを学習する学習部と、予測対象のソース音声の時系列データを入力として、前記予測対象のソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、前記学習部によって学習された前記同時生成モデルのパラメータとに基づいて、前記予測対象のソース音声に対応する前記ターゲット音声の前記指令列を予測する変換処理部と、を含んで構成されている。 In order to achieve the above object, a fundamental frequency pattern predicting apparatus according to a first aspect of the present invention uses a source speech time-series data as an input, and obtains a fundamental frequency pattern generated by translational movement of thyroid cartilage at each time of a target speech. A basic frequency pattern predicting device for predicting a command sequence consisting of a pair of a phrase command to represent and an accent command representing a basic frequency pattern generated by the rotational motion of thyroid cartilage, wherein the time series data of the source speech of the learning sample and the time of the target speech With parallel data consisting of series data as input, the spectral feature vector at each time extracted from the time series data of the source speech and the time of the hidden Markov model estimated from the time series data of the target speech , Indicates a pair of the phrase command and the accent command A learning unit that learns parameters of a simultaneous generation model of a combination of a spectral feature vector at each time of the source speech and the state sequence of a target speech, based on a state sequence composed of state indices; Based on the time series data of the source speech as input, based on the spectral feature vector at each time extracted from the time series data of the source speech to be predicted, and the parameters of the simultaneously generated model learned by the learning unit And a conversion processing unit that predicts the command sequence of the target speech corresponding to the source speech to be predicted.

また、第１の発明に係る基本周波数パターン予測装置において、前記隠れマルコフモデルは、前記フレーズ指令が生起する複数の状態ｐ_ｍと、前記アクセント指令が生起する複数の状態ａ_ｎと、前記フレーズ指令及び前記アクセント指令の何れもが生起しない状態ｒ₀、ｒ₁とを有し、前記状態ｒ₀から前記複数の状態ｐ_ｍの何れかに遷移して前記状態ｒ_１に遷移し、前記状態ｒ₁から前記複数の状態ａ_ｎの何れかに遷移して前記状態ｒ₀に遷移するように各状態が連結されているようにしてもよい。 Further, the fundamental frequency pattern predicting device according to the first invention, the hidden Markov model includes a plurality of states p _m that the phrase command is to occur, a plurality of states a _n that the accent command is occurring, the phrase command and and a state r _0, r ₁ none of the accent command not causing transitions in transition from the state r ₀ to one of the plurality of states p _m in the state r _1, the state r it may be each state is connected to ₁ from the transition to one of the plurality of states a _n transitions to the state r _0.

また、第１の発明に係る基本周波数パターン予測装置において、前記学習部は、以下の式で表わされる目的関数を大きくするように、前記同時生成モデルのパラメータλを学習するようにしてもよい。 In the fundamental frequency pattern prediction apparatus according to the first invention, the learning unit may learn the parameter λ of the simultaneously generated model so as to increase an objective function expressed by the following equation.

ただし、ｃは、前記ソース音声の各時刻ｋのスペクトル特徴量ベクトルｃ［ｋ］からなり、ｓは、各時刻ｋの状態ｓ_kからなる前記状態系列であり、 Where c is a spectral feature vector c [k] at each time k of the source speech, and s is the state series consisting of states s _{k at} each time k,

は状態ｓ_ｋの状態出力分布Ｐ（ｃ［ｋ］｜ｓ_ｋ）のパラメータであり、φ_i,i'は、前記隠れマルコフモデルの状態ｉ，ｉ´間の遷移確率であり、φ_iは、初期状態が状態ｉである確率である。 Is a parameter of the state output distribution P (c [k] | s _k ) of the state s _k , φ _{i, i ′} is the transition probability between the states i, _{i ′} of the hidden Markov model, and φ _i is , The probability that the initial state is state i.

また、第１の発明に係る基本周波数パターン予測装置において、前記変換処理部は、Ｖｉｔｅｒｂｉアルゴリズムを用いて、前記目的関数を大きくするように、前記状態系列を推定し、前記推定した前記状態系列から、前記予測対象のソース音声に対応する前記ターゲット音声の前記指令列を予測するようにしてもよい。 In the fundamental frequency pattern prediction apparatus according to the first aspect of the present invention, the conversion processing unit estimates the state sequence so as to increase the objective function using a Viterbi algorithm, and calculates the state sequence from the estimated state sequence. The command sequence of the target speech corresponding to the source speech to be predicted may be predicted.

第２の発明に係る基本周波数パターン予測方法は、ソース音声の時系列データを入力として、ターゲット音声の、各時刻における甲状軟骨の平行移動運動によって生じる基本周波数パターンを表すフレーズ指令及び甲状軟骨の回転運動によって生じる基本周波数パターンを表すアクセント指令のペアからなる指令列を予測する基本周波数パターン予測装置における基本周波数パターン予測方法であって、学習部が、学習サンプルのソース音声の時系列データとターゲット音声の時系列データとからなるパラレルデータを入力として、前記ソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、前記ターゲット音声の時系列データから推定される、隠れマルコフモデルの各時刻の、前記フレーズ指令及び前記アクセント指令のペアを示す状態のインデックスからなる状態系列とに基づいて、前記ソース音声の各時刻のスペクトル特徴量ベクトルと、ターゲット音声の前記状態系列との組み合わせの同時生成モデルのパラメータを学習するステップと、変換処理部が、予測対象のソース音声の時系列データを入力として、前記予測対象のソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、前記学習部によって学習された前記同時生成モデルのパラメータとに基づいて、前記予測対象のソース音声に対応する前記ターゲット音声の前記指令列を予測するステップと、を含んで実行することを特徴とする。 The basic frequency pattern prediction method according to the second invention is the phrase command representing the basic frequency pattern generated by the translational movement of the thyroid cartilage at each time of the target speech and the rotation of the thyroid cartilage with the time series data of the source speech as input. A basic frequency pattern prediction method in a basic frequency pattern prediction apparatus for predicting a command sequence consisting of a pair of accent commands representing a basic frequency pattern generated by exercise, wherein a learning unit uses time-series data of a source sound of a learning sample and a target sound Each of the hidden Markov models estimated from the spectral feature vector at each time extracted from the time series data of the source speech and the time series data of the target speech. The phrase command and the accent of time Learning a parameter of a simultaneous generation model of a combination of a spectrum feature vector at each time of the source speech and the state sequence of a target speech based on a state sequence consisting of a state index indicating a pair of commands; The conversion processing unit receives the time-series data of the source speech to be predicted as an input, the spectral feature quantity vector at each time extracted from the time-series data of the source speech to be predicted, and the learning learned by the learning unit And executing the step of predicting the command sequence of the target speech corresponding to the source speech to be predicted based on a parameter of a simultaneously generated model.

また、第２の発明に係る基本周波数パターン予測方法において、前記隠れマルコフモデルは、前記フレーズ指令が生起する複数の状態ｐ_ｍと、前記アクセント指令が生起する複数の状態ａ_ｎと、前記フレーズ指令及び前記アクセント指令の何れもが生起しない状態ｒ₀、ｒ₁とを有し、前記状態ｒ₀から前記複数の状態ｐ_ｍの何れかに遷移して前記状態ｒ_１に遷移し、前記状態ｒ₁から前記複数の状態ａ_ｎの何れかに遷移して前記状態ｒ₀に遷移するように各状態が連結されているようにしてもよい。 Further, the fundamental frequency pattern prediction method according to the second invention, the hidden Markov model includes a plurality of states p _m that the phrase command is to occur, a plurality of states a _n that the accent command is occurring, the phrase command and and a state r _0, r ₁ none of the accent command not causing transitions in transition from the state r ₀ to one of the plurality of states p _m in the state r _1, the state r it may be each state is connected to ₁ from the transition to one of the plurality of states a _n transitions to the state r _0.

また、第２の発明に係る基本周波数パターン予測方法において、前記学習部は、以下の式で表わされる目的関数を大きくするように、前記同時生成モデルのパラメータλを学習するようにしてもよい。 In the fundamental frequency pattern prediction method according to the second invention, the learning unit may learn the parameter λ of the simultaneous generation model so as to increase an objective function expressed by the following equation.

また、第３の発明に係るプログラムは、第１の発明に係る基本周波数パターン予測装置の各部としてコンピュータを機能させるためのプログラムである。 A program according to the third invention is a program for causing a computer to function as each part of the fundamental frequency pattern prediction apparatus according to the first invention.

本発明の基本周波数パターン予測装置、方法、及びプログラムによれば、ソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、隠れマルコフモデルの各時刻の、フレーズ指令及びアクセント指令のペアを示す状態のインデックスからなる状態系列とに基づいて、ソース音声の各時刻のスペクトル特徴量ベクトルと、ターゲット音声の前記状態系列との組み合わせの同時生成モデルのパラメータを学習し、予測対象のソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、学習された同時生成モデルのパラメータとに基づいて、予測対象のソース音声に対応するターゲット音声の指令列を予測することにより、Ｆ_０パターンの物理的な生成過程の制約を考慮しながらスペクトル特徴量系列に対応する最適なＦ_０パターンを推定することができる、という効果が得られる。 According to the fundamental frequency pattern prediction apparatus, method, and program of the present invention, the spectral feature vector at each time extracted from the time series data of the source speech, the phrase command and the accent command at each time of the hidden Markov model Based on a state sequence consisting of a state index indicating a pair, the parameters of the simultaneous generation model of the combination of the spectral feature vector at each time of the source speech and the state sequence of the target speech are learned, and the source to be predicted Based on the spectral feature vector at each time extracted from the time series data of the speech and the parameters of the learned simultaneous generation model, by predicting the target speech command sequence corresponding to the source speech to be predicted, F ₀ pattern spectrum feature amount sequence taking into account the constraints of the physical process of generating It is possible to estimate the optimum F ₀ pattern corresponding to the above.

ＨＭＭの構成の一例を示す図である。It is a figure which shows an example of a structure of HMM. 状態ａ_ｎを分割した場合の一例を示す図である。It is a figure which shows an example at the time of dividing _| segmenting state an. ＨＭＭの構成において、状態集合における状態ｐ_０を増加させた場合の状態遷移の一例を示す図である。FIG. 10 is a diagram illustrating an example of state transition when the state p ₀ in the state set is increased in the configuration of the HMM. 本発明の実施の形態に係る基本周波数パターン予測装置の構成を示すブロック図である。It is a block diagram which shows the structure of the fundamental frequency pattern prediction apparatus which concerns on embodiment of this invention. 学習部３０の構成を示すブロック図である。3 is a block diagram illustrating a configuration of a learning unit 30. FIG. 変換処理部５０の構成を示すブロック図である。3 is a block diagram illustrating a configuration of a conversion processing unit 50. FIG. 本発明の実施の形態に係る基本周波数パターン予測装置における学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning process routine in the fundamental frequency pattern prediction apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る基本周波数パターン予測装置における基本周波数パターン予測処理ルーチンを示すフローチャートである。It is a flowchart which shows the fundamental frequency pattern prediction process routine in the fundamental frequency pattern prediction apparatus which concerns on embodiment of this invention. 本発明の実施の形態の実験の効果の一例を示す図である。It is a figure which shows an example of the effect of the experiment of embodiment of this invention. 複数のＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ型のＨＭＭの始点と終点の状態を連結したＨＭＭの一例を示す図である。It is a figure which shows an example of HMM which connected the state of the start point and end point of several Left-to-Right type HMM.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本発明の実施の形態で提案する技術は、信号処理の技術分野に属し、音声の特徴量系列から基本周波数パターンを予測し、原音声の基本周波数パターンを予測した基本周波数パターンに置き換えることで音声の自然性を向上させることを目的とした音声処理技術である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The technology proposed in the embodiments of the present invention belongs to the technical field of signal processing, and predicts a fundamental frequency pattern from a feature sequence of speech and replaces the fundamental frequency pattern of the original speech with the predicted fundamental frequency pattern. This is a speech processing technology that aims to improve the naturalness of speech.

ここで本発明の実施の形態における関連技術１〜３について説明する。 Here, related techniques 1 to 3 in the embodiment of the present invention will be described.

＜関連技術１：音声特徴量系列からのＦ_０パターン予測方法＞ <Related technology 1: F ₀ pattern prediction method from speech feature sequence>

まず、音声特徴量系列からのＦ_０パターン予測方法について説明する。 First, a description will be given F ₀ pattern prediction method from the audio feature amount sequence.

上記非特許文献１〜非特許文献３では、音声特徴量系列からＦ_０パターンを予測する手法が提案されている。当該手法は音声特徴量系列とＦ_０パターンの同時確率分布モデルのパラメータを学習する処理と学習した当該モデルを用いて所与の音声特徴量系列からＦ_０パターンに変換する処理からなる。 In Non-Patent Document 1 to Non-Patent Document 3, a method for predicting an F ₀ pattern from a speech feature amount sequence is proposed. The method includes processing for learning parameters of a speech feature quantity sequence and a F ₀ pattern simultaneous probability distribution model, and processing for converting a given speech feature quantity sequence into a F ₀ pattern using the learned model.

＜学習処理＞ <Learning process>

学習処理ソース音声（例えば電気音声）とターゲット音声（例えば自然音声）のパラレルデータが与えられているものとする。ソース音声の特徴量ベクトルをｃ［ｋ］とし、ターゲット音声の対数Ｆ_０とその動的成分（時間微分または時間差分）の結合ベクトルをｑ［ｋ］＝（ｙ［ｋ］，Δｙ［ｋ］）^Ｔとする。ここでｋは離散時刻のインデックスである。音声特徴量ｃ［ｋ］としては例えば時刻ｋを中心とした前後数フレーム分のメルケプストラム（ベクトル）の系列を連結したベクトルに対し、主成分分析により次元圧縮を行ったものを用いる。本手法ではｃ［ｋ］とｑ［ｋ］の同時確率分布を混合正規分布モデル（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ；ＧＭＭ）でモデル化し、学習処理では所与のパラレルデータ｛ｃ［ｋ］，ｑ［ｋ］｝_ｋ＝１ ^Ｋから当該ＧＭＭのパラメータを学習する。ＧＭＭのパラメータはＥｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ（ＥＭ）アルゴリズムにより推定することができる。 It is assumed that parallel data of learning processing source speech (for example, electrical speech) and target speech (for example, natural speech) is given. The feature vector of the source speech is c [k], and the combined vector of the logarithm F _{0 of the} target speech and its dynamic component (time derivative or time difference) is q [k] = (y [k], Δy [k]. ) ^T. Here, k is an index of discrete time. As the speech feature value c [k], for example, a vector obtained by dimensional compression by principal component analysis is used for a vector obtained by connecting a series of mel cepstrum (vector) for several frames around the time k. In this method, the joint probability distribution of c [k] and q [k] is modeled by a mixed normal distribution model (Gaussian Mixture Model; GMM), and given parallel data {c [k], q [k]} _k in the learning process. _{= 1} The parameter of the GMM is learned from ^K. The parameters of the GMM can be estimated by an Expectation-Maximization (EM) algorithm.

＜変換処理＞ <Conversion processing>

変換処理では、変換処理における所与の音声特徴量系列ｃ＝（ｃ［１］^Ｔ，．．．，ｃ［Ｋ］^Ｔ）^Ｔの下で、最尤のＦ_０パターンであるｙ＝（ｙ［１］^Ｔ，．．．，ｙ［Ｋ］^Ｔ）^Ｔを In the conversion process, y = (y, which is the maximum likelihood F ₀ pattern under a given speech feature quantity sequence c = (c [1] ^T ,..., C [K] ^T ) ^{T in} the conversion process. [1] ^T , ..., y [K] ^T ) ^T

により算出する。ただし、ｑ＝（ｑ［１］^Ｔ，．．．，ｑ［Ｋ］^Ｔ）^Ｔで、Ｗはｙとｑの関係を表す変換行列である。 Calculated by Here, q = (q [1] ^T ,..., Q [K] ^T ) ^T , and W is a transformation matrix representing the relationship between y and q.

＜関連技術２：Ｆ_０パターン生成過程モデル＞ <Related technology 2: _{F 0} pattern generation process model>

次に、Ｆ_０パターン生成過程モデルについて説明する。 Next, a description will be given F ₀ pattern generation process model.

音声のＦ_０パターンの生成過程を記述したモデルに、藤崎の基本周波数（Ｆ_０）パターン生成過程モデル（藤崎モデル）が知られている（非特許文献６参照）。 Fujisaki's fundamental frequency (F ₀ ) pattern generation process model (Fujisaki model) is known as a model that describes the generation process of speech F ₀ patterns (see Non-Patent Document 6).

［非特許文献６］：H. Fujisaki, "In Vocal Physiology: Voice Production, Mechanisms and Functions," Raven Press, 1988. [Non-Patent Document 6]: H. Fujisaki, “In Vocal Physiology: Voice Production, Mechanisms and Functions,” Raven Press, 1988.

藤崎モデルとは、甲状軟骨の運動によるＦ_０パターンの生成過程を説明した物理モデルである。藤崎モデルでは、甲状軟骨の二つの独立な運動（平行移動運動と回転運動）にそれぞれ伴う声帯の伸びの合計がＦ_０の時間的変化をもたらすと解釈され、声帯の伸びとＦ_０パターンの対数値ｙ（ｔ）が比例関係にあるという仮定に基づいてＦ_０パターンがモデル化される。甲状軟骨の平行移動運動によって生じるＦ_０パターンｘ_ｐ（ｔ）をフレーズ成分、回転運動によって生じるＦ_０パターンｘ_ａ（ｔ）をアクセント成分と呼ぶ。藤崎モデルでは、音声のＦ_０パターンｙ（ｔ）は、これらの成分に声帯の物理的制約によって決まるベースライン成分ｂを足し合わせたものとして、 The Fujisaki model is a physical model describing the process of generating F ₀ pattern due to the motion of the thyroid cartilage. The Fujisaki model, the total elongation of the vocal cords with each two independent movement of the thyroid cartilage (translational motion and rotational motion) is interpreted to result temporal variation of F _0, pairs of elongation and F ₀ pattern of the vocal cords The F ₀ pattern is modeled based on the assumption that the numerical value y (t) is proportional. The F ₀ pattern x _p (t) generated by the translational movement of the thyroid cartilage is called a phrase component, and the F ₀ pattern x _a (t) generated by the rotational movement is called an accent component. In the Fujisaki model, the speech F ₀ pattern y (t) is the sum of these components plus the baseline component b determined by the physical constraints of the vocal cords.

と表現される。これら二つの成分は二次の臨界制動系の出力と仮定され、 It is expressed. These two components are assumed to be the output of the second-order critical braking system,

と表される（＊は時刻t に関する畳み込み演算）。ここでｕ_ｐ（ｔ）はフレーズ指令関数と呼ばれ、デルタ関数（フレーズ指令）の列からなり、ｕ_ａ（ｔ）はアクセント指令関数と呼ばれ、矩形波（アクセント指令）の列からなる。これらの指令列には、発話の最初にはフレーズ指令が生起する、フレーズ指令は二連続で生起しない、異なる二つの指令は同時刻に生起しない、という制約条件がある。またαとβはそれぞれフレーズ制御機構、アクセント制御機構の固有角周波数であり、話者や発話内容によらず、おおよそα＝３ｒａｄ／ｓ、β＝２０ｒａｄ／ｓ程度であることが経験的に知られている。 (* Is a convolution operation with respect to time t). Here, u _p (t) is called a phrase command function and consists of a sequence of delta functions (phrase commands), and u _a (t) is called an accent command function and consists of a sequence of rectangular waves (accent commands). These command sequences have a constraint condition that a phrase command occurs at the beginning of an utterance, phrase commands do not occur twice in succession, and two different commands do not occur at the same time. Further, α and β are natural angular frequencies of the phrase control mechanism and the accent control mechanism, respectively, and it is empirically known that α is about 3 rad / s and β is about 20 rad / s regardless of the speaker and the content of the utterance. It has been.

＜関連技術３：Ｆ_０パターン生成過程モデルパラメータ推定法＞ <Related technology 3: F ₀ pattern generation process model parameter estimation method>

次に、Ｆ_０パターン生成過程モデルパラメータ推定法について説明する。 Next, the F ₀ pattern generation process model parameter estimation method will be described.

上述の藤崎モデルは以下のような確率モデルで記述することができる（非特許文献５、非特許文献７参照）。 The above-mentioned Fujisaki model can be described by the following probability model (see Non-Patent Document 5 and Non-Patent Document 7).

［非特許文献７］：Kota Yoshizato, Hirokazu Kameoka, Daisuke Saito, Shigeki Sagayama, "Hidden Markov convolutive mixture model for pitch contour analysis of speech," in Proc. The 13th Annual Conference of theInternational Speech Communication Association (Interspeech 2012), Sep. 2012. [Non-Patent Document 7]: Kota Yoshizato, Hirokazu Kameoka, Daisuke Saito, Shigeki Sagayama, "Hidden Markov convolutive mixture model for pitch contour analysis of speech," in Proc. The 13th Annual Conference of the International Speech Communication Association (Interspeech 2012), Sep. 2012.

まずフレーズ・アクセント指令関数のペアｏ［ｋ］＝（ｕ_ｐ［ｋ］，ｕ_ａ［ｋ］）^Ｔを出力するＨＭＭを考える。ただし、ｋは離散時刻のインデックスを表す。状態出力分布は正規分布とし、各時刻の状態が与えられた下で First pair _o of phrase accent command function [k] = (u p [ k], u a [k]) consider the HMM to output the ^T. Here, k represents an index of discrete time. The state output distribution is a normal distribution.

により指令関数ペアｏ［ｋ］が生成されると仮定する。ここで｛ｓ_ｋ｝_ｋ＝１ ^ＫはＨＭＭの状態系列であり、平均ベクトル It is assumed that a command function pair o [k] is generated by Where {s _k } _{k = 1} ^K is the state sequence of the HMM, and the average vector

はＨＭＭの状態遷移の結果として定まる値である。具体的なＨＭＭの構成を図１に示す。 Is a value determined as a result of state transition of the HMM. A specific configuration of the HMM is shown in FIG.

それぞれの状態をいくつかの小状態に分割することで自己遷移の持続長をパラメータ化することができる。なおこのとき、各々の小状態は全て同じ出力分布を持ち、小状態の数は十分大きな値となるようにしておく。図２に状態ａ_ｎを分割した例を示す。例えば図２のように全てのｍ≠０に対してａ_ｎ，ｍからａ_{ｎ，ｍ＋１}への状態遷移確率を１に設定することで、ａ_ｎ，０からａ_ｎ，ｍへの遷移確率が状態ａ_ｎがｍステップだけ持続する確率に対応し、アクセント指令の持続長を柔軟に制御できるようになる。同様にｐとｒ_０とｒ_１も小状態に分割することで、フレーズ指令の持続長と指令間の間隔の長さの分布をパラメータ化することが可能になる。このような分割をふまえて、以後は改めて The duration of self-transition can be parameterized by dividing each state into several small states. At this time, all the small states have the same output distribution, and the number of small states is set to a sufficiently large value. It shows an example of dividing the state a _n in FIG. For example, as shown in FIG. 2, by setting the state transition probability from an _{, m} to an _{, m + 1} to ₁ for all m ≠ 0, the transition probability from an _{, 0} to an _{, m} is It corresponds to the probability that state a _n lasts only m step, it becomes possible to flexibly control the persistence length of the accent command. Similarly, by dividing p, r ₀ and r ₁ into small states, it becomes possible to parameterize the distribution of the duration of the phrase command and the length of the interval between commands. Based on this division, after that,

と表記する。上記のＨＭＭの構成は次のように書ける。 Is written. The configuration of the above HMM can be written as follows.

状態系列ｓ＝｛ｓ_ｋ｝^Ｋ _ｋ＝１が与えられたとき、このＨＭＭはフレーズ指令関数ｕ_ｐ［ｋ］とアクセント指令関数ｕ_ａ［ｋ］のペアを出力する。式（３）と式（５）で示した通り、ｕ_ｐ［ｋ］とｕ_ａ［ｋ］にそれぞれｇ_ｐ［ｋ］とｇ_ａ［ｋ］が畳み込まれてフレーズ成分ｘ_ｐ［ｋ］とアクセント成分ｘ_ａ［ｋ］が出力される。これを式で表すと、 When a state sequence s = {s _k } ^K _{k = 1} is given, the HMM outputs a pair of _a phrase command function u _p [k] and an accent command function u _a [k]. Equation (3) and (5) as shown _{in, u} p [k] and _u a [k], respectively _g p [k] and _g a [k] is convolved phrase component _x p [k] And the accent component x _a [k] are output. This can be expressed as an expression:

と書ける（＊は離散時刻ｋに関する畳み込み演算）。このとき、Ｆ_０パターンｘ［ｋ］は (* Is a convolution operation related to the discrete time k). At this time, the F ₀ pattern x [k] is

と三種類の成分の重ね合わせで書ける。ただしｂは時刻によらないベースライン成分である。 And can be written by superimposing three kinds of components. However, b is a baseline component which does not depend on time.

また、実音声においては、いつも信頼のできるＦ_０の値が観測できるとは限らない。藤崎モデルのパラメータ推定を行うにあたっては、信頼のおける観測区間のＦ_０値のみを考慮に入れて、そうでない区間は無視することが望ましい。例えば音声の無声区間においては通常声帯の振動に伴う周期的な粗密波は観測されないので、仮に自動ピッチ抽出によって音声の無声区間から何らかの値がＦ_０の推定値得られたとしても、その値を声帯から発せられる信号のＦ_０の値と見なすのは適当ではない。そこで、提案モデルに観測Ｆ_０値の時刻ｋにおける不確かさの程度ｖ_ｎ ²［ｋ］を導入する。具体的には、観測Ｆ_０値ｙ［ｋ］を、真のＦ_０値ｘ［ｋ］とノイズ成分 In real speech, a reliable F ₀ value is not always observable. In estimating the parameters of the Fujisaki model, it is desirable to take into account only the F ₀ value of the reliable observation interval and ignore the other intervals. For example, since the periodic coarse / dense wave associated with the normal vocal cord vibration is not observed in the voiceless section, even if an estimated value of F ₀ is obtained from the voiceless section by automatic pitch extraction, the value is It is not appropriate to consider the value of F ₀ of a signal emitted from a vocal cord. Therefore, the degree of uncertainty v _n ² [k] at time k of the observed F ₀ value is introduced into the proposed model. Specifically, the observed F ₀ value y [k], the true F ₀ value x [k], and the noise component

との重ね合わせで With overlay

と表現することで、信頼のおける区間かどうかに関わらず全ての観測区間を統一的に扱える。また、 This means that all observation intervals can be handled uniformly regardless of whether they are reliable intervals. Also,

を定数とみなし、 Is regarded as a constant,

は一様に分布すると仮定する。するとｘ_ｎ［ｋ］を周辺化することで、出力値系列ｏ＝｛ｏ［ｋ］｝_ｋ＝１ ^Ｋが与えられたときのｙ＝｛ｙ［ｋ］｝_ｋ＝１ ^Ｋの確率密度関数 Is assumed to be uniformly distributed. Then, the probability density function of y = {y [k]} _{k = 1} ^K when the output value series o = {o [k]} _{k = 1} ^K is given by peripheralizing x _n [k].

が得られる。状態系列ｓ＝｛ｓ［ｋ］｝_ｋ＝１ ^Ｋと指令の振幅を表すパラメータ Is obtained. State sequence s = {s [k]} _{k = 1} ^K and a parameter representing the amplitude of the command

が与えられたとき、出力値系列ｏは Is given, the output value sequence o is

に従って生成される。また、Ｐ（ｓ）は状態遷移確率の積として Is generated according to P (s) is a product of state transition probabilities.

と書ける。ただしφｓ_１は初期状態がｓ_１である確率をあらわす。 Can be written. However, φs ₁ represents the probability that the initial state is s ₁ .

＜本発明の実施の形態の概要＞ <Outline of Embodiment of the Present Invention>

本発明の実施の形態に係る技術は、非特許文献５、及び非特許文献７に記載の確率モデルと非特許文献１〜３に記載の確率モデルを組み合わせ、音声スペクトル特徴量系列からフレーズ・アクセント指令を予測することで、Ｆ_０パターンの物理的な生成過程の制約を満たしながらスペクトル特徴量系列に対応する最適なＦ_０パターンを推定することを可能にする技術である。 The technology according to the embodiment of the present invention combines the probability models described in Non-Patent Document 5 and Non-Patent Document 7 and the probability models described in Non-Patent Documents 1 to 3, and a phrase accent from a speech spectrum feature quantity sequence. by predicting the instruction is a technique that makes it possible to estimate the optimal F ₀ pattern corresponding to the spectral feature amount sequence while satisfying the constraints of physical generation process of F ₀ pattern.

＜本発明の実施の形態に係る原理＞ <Principle according to the embodiment of the present invention>

モデルパラメータを学習するために、まず、上述の関連技術３におけるＨＭＭの状態遷移ネットワークを改変する。 In order to learn model parameters, first, the state transition network of the HMM in the related technique 3 described above is modified.

図３のように状態集合における状態ｐ_０をｐ_０，．．．，ｐ_Ｍ−１に増やし、各々の状態出力分布の平均を時刻ｋに依存しないようにしたものである。また、図２と同様、状態出力分布におけるｐ_ｍも小状態に分割し、ｐ_ｍ＝｛ｐ_ｍ，０，ｐ_ｍ，１．．．，｝とする。これにより、異なるｍのｐ_ｍに、異なる強度値Ａ^（ｍ） _ｐが対応する。ここでは、関連技術３と異なり、状態ｐ_ｍの出力分布の平均Ａ^（ｍ） _ｐがｋに依存しない変数になっていることに注意する。ｒ_１＝｛ｒ_ｉ，ｊ｜ｉ≧１，ｊ≧０｝、ｒ_ｉ，ｊ＝｛ｒ_{ｉ，ｊ，ｌ}｜ｌ≧０｝、ａ_ｉ＝｛ａ_ｉ，ｊ｜ｊ≧０｝、ａ_ｉ，ｊ＝｛ａ_{ｉ，ｊ，ｌ}｜ｌ≧０｝のように書けば、以上の例は以下のような構成のＨＭＭとして記述できる。 _P 0, the state _{p 0} in the state set as in FIG. . . , P _{M−1 so} that the average of each state output distribution does not depend on time k. Further, similarly to FIG. 2, _{p m} is also divided into small state in the state output _{_{distributions, p m = {p m,}} 0, p m, 1. . . ,}. Thus, the _{p m} of different m, different intensity values ^{A _(m)} _p correspond. Here, unlike the related art 3, the average A ^(m) _p of the power distribution in the state p _m to note that has a variable that is independent of k. r ₁ = {r _{i, j} | i ≧ 1, j ≧ 0}, r _{i, j} = {r _{i, j, l} | l ≧ 0}, a _i = {a _{i, j} | j ≧ 0}, By writing as a _{i, j} = {a _{i, j, l} | l ≧ 0}, the above example can be described as an HMM having the following configuration.

次に、関連技術３においてＨＭＭを以上の構成のＨＭＭに置き換えたものを用い、ターゲット音声のデータに対し事前にフレーズ・アクセント指令（状態系列ｓ）を推定する。また、ソース音声の特徴量ベクトルをｃ［ｋ］とする。ここでｋは離散時刻のインデックスである。例えば、非特許文献１〜３と同様に時刻ｋを中心とした前後数フレーム分のメルケプストラム（ベクトル）の系列を連結したベクトルに対し主成分分析により次元圧縮を行ったものをｃ［ｋ］として用いる。以上より、｛ｃ［ｋ］，ｓ_ｋ｝_ｋ＝１ ^Ｋというソース音声の特徴量ベクトルと、フレーズ・アクセント指令の状態系列とのペアのデータが得られる。 Next, using the related technique 3 in which the HMM is replaced with the HMM having the above configuration, the phrase / accent command (state series s) is estimated in advance for the target speech data. Also, let the feature vector of the source audio be c [k]. Here, k is an index of discrete time. For example, as in Non-Patent Documents 1 to 3, c [k] is obtained by performing dimension compression by principal component analysis on a vector obtained by concatenating a series of mel cepstrum (vector) for several frames around the time k. Used as As described above, the pair data of the feature vector of the source speech {c [k], s _k } _{k = 1} ^K and the state series of the phrase / accent command are obtained.

このペアのデータ｛ｃ［ｋ］，ｓ_ｋ｝_ｋ＝１ ^Ｋの同時生成モデルを、上述のＨＭＭ（図３）において出力シンボルをｏ［ｋ］の代わりにｃ［ｋ］としたＨＭＭにより記述する。また、状態出力分布を混合正規分布（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ；ＧＭＭ）に改変する。学習処理においては、このＧＭＭのパラメータ（混合重み、各正規分布の平均と分散共分散行列）が学習すべきパラメータとなる。すなわちｌｏｇＰ（ｃ，ｓ｜λ）は The simultaneous generation model of this pair of data {c [k], s _k } _{k = 1} ^K is described by an HMM in which the output symbol is c [k] instead of o [k] in the above HMM (FIG. 3). To do. Further, the state output distribution is modified to a mixed normal distribution (GaMMian Mixture Model; GMM). In the learning process, the GMM parameters (mixing weight, average of each normal distribution and variance-covariance matrix) are parameters to be learned. That is, logP (c, s | λ) is

で与えられる。ただし、λ_ｉ＝｛α_ｍ，ｉ，μ_ｍ，ｉ，Σ_ｍ，ｉ｜ｍ＝１，．．，Ｍ｝は各状態の出力分布パラメータ（ＧＭＭの混合重み、各正規分布の平均と分散共分散行列）を表し、λ＝｛λ_ｉ｝^Ｉ _ｉ＝１が学習処理において推定すべきパラメータとなる。φ_ｉ，ｉ´は状態ｉから状態ｉ´へ遷移する確率を表し、定数とする。 Given in. However, λ _i = {α _{m, i} , μ _{m, i} , Σ _{m, i} | m = 1,. . , M} represent output distribution parameters (GMM mixing weights, average of each normal distribution and variance-covariance matrix), and λ = {λ _i } ^I _{i = 1} is a parameter to be estimated in the learning process. . φ _{i, i ′} represents the probability of transition from state i to state _{i ′ and} is a constant.

学習処理においては所与のスペクトル特徴量系列ｃ＝（ｃ［１］，，．．，ｃ［Ｋ］）と状態系列ｓ＝（ｓ［１］，，．．，ｓ［Ｋ］）の下でｌｏｇＰ（ｃ，ｓ｜λ）が局所最大となるようにλを推定する。これはＥｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎアルゴリズムにより実現可能である。 In the learning process, a given spectrum feature quantity sequence c = (c [1],..., C [K]) and a state sequence s = (s [1],. Λ is estimated so that logP (c, s | λ) has a local maximum. This can be realized by an Expectation-Maximization algorithm.

ソース音声の特徴量系列からフレーズ・アクセント指令への変換処理は所与のスペクトル特徴量系列ｃ＝（ｃ［１］，，．．，ｃ［Ｋ］）の下で、上述のＨＭＭの状態系列を推定する問題となるので、通常のＶｉｔｅｒｂｉアルゴリズムによりｓ＝（ｓ［１］，，．．，ｓ［Ｋ］）を求め、そのときの平均系列をフレーズ・アクセント指令列とする。 The conversion process from the feature amount sequence of the source speech to the phrase / accent command is performed under the given spectrum feature amount sequence c = (c [1],. Therefore, s = (s [1],... S [K]) is obtained by a normal Viterbi algorithm, and the average sequence at that time is set as a phrase / accent command sequence.

＜システム構成＞ <System configuration>

次に、ソース音声のスペクトル特徴量系列から、ターゲット音声の基本周波数パターンを予測する基本周波数パターン予測装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 Next, an embodiment of the present invention will be described by taking as an example a case where the present invention is applied to a fundamental frequency pattern prediction apparatus that predicts a fundamental frequency pattern of a target speech from a spectral feature quantity sequence of a source speech.

図４に示すように、本発明の実施の形態に係る基本周波数パターン予測装置は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン、及び基本周波数パターン予測処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 4, the fundamental frequency pattern prediction apparatus according to the embodiment of the present invention stores a CPU, a RAM, a learning processing routine described later, and a program for executing a fundamental frequency pattern prediction processing routine. It is comprised by the computer provided with ROM, and is comprised as shown below functionally.

図４に示すように、基本周波数パターン予測装置１００は、入力部１０と、演算部２０と、出力部９０とを備えている。 As shown in FIG. 4, the fundamental frequency pattern prediction apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 90.

入力部１０は、学習サンプルのソース音声（例えば電気音声）の時系列データとターゲット音声（例えば自然音声）の時系列データとからなるパラレルデータを受け付ける。また、入力部１０は、学習サンプルのソース音声（例えば電気音声）の時系列データとターゲット音声（例えば自然音声）の時系列データとからなるパラレルデータの入力を受け付ける。また、入力部１０は、予測対象のソース音声の時系列データを受け付ける。 The input unit 10 accepts parallel data composed of time-series data of the source sound (for example, electric sound) of the learning sample and time-series data of the target sound (for example, natural sound). The input unit 10 accepts input of parallel data composed of time-series data of the source sound (for example, electric sound) of the learning sample and time-series data of the target sound (for example, natural sound). In addition, the input unit 10 receives time-series data of the source audio to be predicted.

演算部２０は、学習部３０と、パラメータ記憶部４０と、変換処理部５０とを備えている。 The calculation unit 20 includes a learning unit 30, a parameter storage unit 40, and a conversion processing unit 50.

学習部３０は、入力部１０によって受け付けた学習サンプルのソース音声の時系列データとターゲット音声の時系列データとからなるパラレルデータから、ソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、ターゲット音声の時系列データから推定される、ＨＭＭの各時刻の、フレーズ指令及びアクセント指令のペアを示す状態のインデックスからなる状態系列とに基づいて、ソース音声の各時刻のスペクトル特徴量ベクトルと、ターゲット音声の状態系列との組み合わせの同時生成モデルのパラメータを学習する。 The learning unit 30 extracts the spectral feature quantity at each time extracted from the time series data of the source voice from the parallel data composed of the time series data of the source voice and the target voice of the learning sample received by the input unit 10. Spectral feature quantity of each time of source speech based on vector and state sequence consisting of state index indicating phrase command and accent command pair at each time of HMM estimated from time series data of target speech The parameter of the simultaneous generation model of the combination of the vector and the state sequence of the target speech is learned.

図５に示すように、学習部３０は、特徴量抽出部３２と、基本周波数系列抽出部３４と、状態系列推定部３６と、モデルパラメータ学習部３８とを備えている。 As illustrated in FIG. 5, the learning unit 30 includes a feature amount extraction unit 32, a fundamental frequency sequence extraction unit 34, a state sequence estimation unit 36, and a model parameter learning unit 38.

特徴量抽出部３２は、入力部１０によって受け付けた学習サンプルのソース音声の時系列データから、ソース音声のスペクトグラム特徴量ベクトルｃ［ｋ］を抽出する。ここでｋは離散時刻のインデックスである。例えば、非特許文献１〜３と同様に時刻ｋを中心とした前後数フレーム分のメルケプストラム（ベクトル）の系列を連結したベクトルに対し主成分分析により次元圧縮を行ったものをｃ［ｋ］として用いる。 The feature amount extraction unit 32 extracts a source speech spectrogram feature amount vector c [k] from the time series data of the source speech of the learning sample received by the input unit 10. Here, k is an index of discrete time. For example, as in Non-Patent Documents 1 to 3, c [k] is obtained by performing dimension compression by principal component analysis on a vector obtained by concatenating a series of mel cepstrum (vector) for several frames around the time k. Used as

基本周波数系列抽出部３４は、入力部１０によって受け付けた学習サンプルのターゲット音声の時系列データから、ターゲット音声の各時刻ｋにおける基本周波数ｙ［ｋ］を抽出し、ｙ＝（ｙ［１］，．．．，ｙ［Ｋ］）^Ｔとする。 The fundamental frequency series extraction unit 34 extracts the fundamental frequency y [k] at each time k of the target speech from the time series data of the target speech of the learning sample received by the input unit 10, and y = (y [1], ..., y [K]) ^T.

この基本周波数の抽出処理は、周知技術により実現でき、例えば、非特許文献８（H. Kameoka, "Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model," in Tech. Rep. IEICE, 2010, in Japanese.）に記載の手法を利用して、８ｍｓごとに基本周波数を抽出する。 This fundamental frequency extraction process can be realized by a well-known technique. For example, Non-Patent Document 8 (H. Kameoka, “Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model,” in Tech. Rep. IEICE, 2010, in Japanese.), The fundamental frequency is extracted every 8 ms.

また、ｙとその動的成分（時間微分または時間差分）の結合ベクトル（Ｆ_０特徴量と呼ぶ。）をｑ［ｋ］＝（ｙ［ｋ］，Δｙ［ｋ］）^Ｔとする。 Further, a combined vector (referred to as F ₀ feature amount) of y and its dynamic component (time differential or time difference) is set to q [k] = (y [k], Δy [k]) ^T.

状態系列推定部３６は、基本周波数系列抽出部３４によって抽出された基本周波数ｙ［ｋ］に基づいて、ＨＭＭにおける、各時刻ｋの状態ｓ_ｋからなる状態系列ｓを推定する。 Based on the fundamental frequency y [k] extracted by the fundamental frequency sequence extraction unit 34, the state sequence estimation unit 36 estimates the state sequence s composed of the states s _{k at} each time k in the HMM.

ここでＨＭＭは、上述した本発明の実施の形態に係る原理及び図３に示したように、フレーズ指令が生起する複数の状態ｐ_ｍと、アクセント指令が生起する複数の状態ａ_ｎと、フレーズ指令及びアクセント指令の何れもが生起しない状態ｒ₀、ｒ₁とを有し、状態ｒ₀から複数の状態ｐ_ｍの何れかに遷移して状態ｒ_１に遷移し、状態ｒ₁から複数の状態ａ_ｎの何れかに遷移して状態ｒ₀に遷移するように各状態が連結されている。 Here HMM, as shown in principle and Figure 3 according to an embodiment of the present invention described above, a plurality of states p _m where phrase command is to occur, a plurality of states a _n accent command occurring phrases both the command and the accent command and a state r _0, r ₁ does not occur, a transition to state r ₁ transitions from state r ₀ to one of a plurality of states p _m, from the state r ₁ more each state is connected so as to transition to state r ₀ transitions to either state a _n.

以上より、｛ｃ［ｋ］，ｓ_ｋ｝_ｋ＝１ ^Ｋというデータが得られる。 From the above, data of {c [k], s _k } _{k = 1} ^K is obtained.

モデルパラメータ学習部３８は、特徴量抽出部３２によって抽出された各時刻ｋのスペクトル特徴量ベクトルｃ［ｋ］と、状態系列推定部３６によって推定された状態系列ｓに基づいて、既存のＥＭアルゴリズムに従って、上記（１６）式におけるｌｏｇＰ（ｃ，ｓ）で表わされる目的関数を大きくするように、同時生成モデルのパラメータλを学習する。 The model parameter learning unit 38 uses the existing EM algorithm based on the spectral feature vector c [k] at each time k extracted by the feature extraction unit 32 and the state sequence s estimated by the state sequence estimation unit 36. Accordingly, the parameter λ of the simultaneous generation model is learned so as to increase the objective function represented by logP (c, s) in the above equation (16).

変換処理部５０は、入力部１０から予測対象のソース音声の時系列データを受け付けて、予測対象のソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、学習部３０によって学習された同時生成モデルのパラメータλとに基づいて、予測対象のソース音声に対応するターゲット音声の指令列を予測する。ここで、指令列の推定は、Ｖｉｔｅｒｂｉアルゴリズムを用いて、上記（１６）式の目的関数を大きくするように、状態系列を推定し、推定した状態系列から、予測対象のソース音声に対応するターゲット音声の前記指令列を予測する。 The conversion processing unit 50 receives time-series data of the prediction target source speech from the input unit 10, and learns the spectral feature quantity vector at each time extracted from the time-series data of the prediction target source speech and the learning unit 30. Based on the parameter λ of the generated simultaneous model, a command sequence of the target speech corresponding to the source speech to be predicted is predicted. Here, the estimation of the command sequence is performed by estimating the state sequence using the Viterbi algorithm so as to increase the objective function of the above equation (16), and from the estimated state sequence, the target corresponding to the source speech to be predicted Predict the command sequence of speech.

図６に示すように、変換処理部５０は、特徴量抽出部５２と、状態系列推定部５４と、指令系列更新部５６とを備えている。 As illustrated in FIG. 6, the conversion processing unit 50 includes a feature amount extraction unit 52, a state sequence estimation unit 54, and a command sequence update unit 56.

特徴量抽出部５２は、入力部１０によって受け付けた予測対象のソース音声の時系列データから、特徴量抽出部３２と同様に、ソース音声の各時刻ｋのスペクトグラム特徴量ベクトルｃ［ｋ］を抽出する。 Similar to the feature quantity extraction unit 32, the feature quantity extraction unit 52 obtains a spectrogram feature quantity vector c [k] at each time k of the source voice from the time series data of the source speech to be predicted received by the input unit 10. Extract.

状態系列推定部５４は、学習部３０で学習された同時生成モデルのパラメータλと、特徴量抽出部５２で抽出されたスペクトグラム特徴量ベクトルｃ［ｋ］とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、上記（１６）式の目的関数を大きくするように、状態系列ｓを推定する。 Based on the parameter λ of the simultaneously generated model learned by the learning unit 30 and the spectrogram feature quantity vector c [k] extracted by the feature quantity extraction unit 52, the state series estimation unit 54 performs the above-described process according to the Viterbi algorithm. The state series s is estimated so as to increase the objective function of the equation (16).

指令系列更新部５６は、状態系列推定部５４で推定された状態系列ｓに基づいて、状態系列ｓの平均系列をフレーズ・アクセント指令列として算出し、出力部９０に出力する。 The command sequence update unit 56 calculates an average sequence of the state sequence s as a phrase / accent command sequence based on the state sequence s estimated by the state sequence estimation unit 54, and outputs it to the output unit 90.

＜基本周波数パターン予測装置の作用＞ <Operation of fundamental frequency pattern prediction device>

次に、本発明の実施の形態に係る基本周波数パターン予測装置１００の作用について説明する。まず、入力部１０において学習サンプルのソース音声の時系列データとターゲット音声の時系列データとからなるパラレルデータを受け付けると、基本周波数パターン予測装置１００は、図７に示す学習処理ルーチンを実行する。 Next, the operation of the fundamental frequency pattern prediction apparatus 100 according to the embodiment of the present invention will be described. First, when the input unit 10 receives parallel data composed of time series data of the source speech and target speech of the learning sample, the fundamental frequency pattern prediction apparatus 100 executes a learning processing routine shown in FIG.

まず、ステップＳ１００では、入力されたソース音声の時系列データを読み込み、各時刻ｋのスペクトル特徴量ベクトルｃ［ｋ］を抽出する。 First, in step S100, input time-series data of source speech is read, and a spectral feature quantity vector c [k] at each time k is extracted.

次に、ステップＳ１０２では、入力されたターゲット音声の時系列データを読み込み、ターゲット音声の各時刻ｋにおける基本周波数ｙ［ｋ］を抽出し、また、基本周波数ｙ［ｋ］とその動的成分の結合ベクトルｑ［ｋ］を抽出する。 Next, in step S102, the time-series data of the input target voice is read, the fundamental frequency y [k] at each time k of the target voice is extracted, and the fundamental frequency y [k] and its dynamic components are extracted. A combined vector q [k] is extracted.

ステップＳ１０４では、ステップｓ１０２で抽出された基本周波数ｙ［ｋ］に基づいて、ＨＭＭにおける、状態系列ｓを推定する。 In step S104, the state sequence s in the HMM is estimated based on the fundamental frequency y [k] extracted in step s102.

ステップＳ１０６では、同時生成モデルのパラメータλを初期設定する。 In step S106, the parameter λ of the simultaneous generation model is initialized.

ステップＳ１０８では、上記ステップＳ１０６で初期設定された、又はステップＳ１１０で前回更新された同時生成モデルのパラメータλに従って、期待値を算出する。 In step S108, an expected value is calculated according to the parameter λ of the simultaneous generation model that is initially set in step S106 or updated last time in step S110.

ステップＳ１１０では、ステップＳ１００で抽出された各時刻ｋのスペクトル特徴量ベクトルｃ［ｋ］と、ステップＳ１０４で推定された状態系列ｓと、上記ステップＳ１０８で算出された期待値に基づいて、上記（１６）式に示す目的関数を大きくするように、同時生成モデルのパラメータλを更新する。 In step S110, based on the spectral feature vector c [k] at each time k extracted in step S100, the state series s estimated in step S104, and the expected value calculated in step S108, the above ( The parameter λ of the simultaneous generation model is updated so as to increase the objective function shown in the equation (16).

ステップＳ１１２では、予め定められた収束判定条件を満たしたか否かを判定し、収束判定条件を満たしていない場合には、上記ステップＳ１０８へ戻る。一方、収束判定条件を満たした場合には、ステップＳ１１４において、上記ステップＳ１１０で学習されたパラメータλを、パラメータ記憶部４０に格納する。 In step S112, it is determined whether or not a predetermined convergence determination condition is satisfied. If the convergence determination condition is not satisfied, the process returns to step S108. On the other hand, when the convergence determination condition is satisfied, the parameter λ learned in step S110 is stored in the parameter storage unit 40 in step S114.

次に、予測対象のソース音声の時系列データが、基本周波数パターン予測装置１００に入力されると、基本周波数パターン予測装置１００において、図８に示す基本周波数パターン予測処理ルーチンが実行される。 Next, when the time-series data of the source speech to be predicted is input to the fundamental frequency pattern prediction apparatus 100, the fundamental frequency pattern prediction processing routine shown in FIG.

まず、ステップＳ２００において、入力された予測対象のソース音声の時系列データを読み込み、各時刻ｋのスペクトル特徴量ベクトルｃ［ｋ］を抽出する。 First, in step S200, the input time-series data of the target speech to be predicted is read, and the spectrum feature vector c [k] at each time k is extracted.

ステップＳ２０２では、学習部３０で学習された同時生成モデルのパラメータλと、ステップＳ２００で抽出されたスペクトグラム特徴量ベクトルｃ［ｋ］とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、状態系列ｓを推定する。 In step S202, the state sequence s is estimated according to the Viterbi algorithm based on the parameter λ of the simultaneously generated model learned by the learning unit 30 and the spectrogram feature quantity vector c [k] extracted in step S200.

ステップＳ２０４では、ステップＳ２０２で推定された状態系列ｓに基づいて、状態系列ｓに対応する平均系列を、フレーズ・アクセント指令列として算出し、出力部９０に出力する。 In step S204, based on the state series s estimated in step S202, an average series corresponding to the state series s is calculated as a phrase / accent command sequence and output to the output unit 90.

＜本実施の形態の実験の効果＞ <Effect of experiment of this embodiment>

音声信号からスペクトル特徴量系列とＦ_０パターンおよびフレーズ・アクセント指令を抽出し、スペクトル特徴量系列とフレーズ・アクセント指令系列のペアデータを用いて学習処理により上記のモデルパラメータ（ＧＭＭのパラメータ）を学習したのちに、変換処理によりスペクトル特徴量系列をフレーズ・アセント指令系列に変換する実験を行い、変換されたフレーズ・アセント指令系列が元のＦ_０パターンをどの程度復元できているかを確認した。図９にその結果の例を示す。点線が、音声信号から推定されたＦ_０パターンであり、破線が、スペクトル特徴量系列から変換されたフレーズ・アセント指令系列から得られたF_０パターンである。スペクトル特徴量にはＦ_０の情報が多く含まれていないにもかかわらず概ね元のＦ_０パターンを復元できていることが確認できる。 Extracting a spectral feature amount sequence and F ₀ pattern and phrase accent command from the audio signal, learning the model parameters (parameters of GMM) by the learning process using the paired data of the spectral feature amount sequence and phrase accent command sequence and the after, and conducted an experiment to convert the spectral feature amount sequence phrase ascent command sequence by the conversion process, the converted phrase ascent command sequence to confirm whether the possible extent restore the original F ₀ pattern. FIG. 9 shows an example of the result. The dotted line is the F ₀ pattern estimated from the speech signal, and the broken line is the F ₀ pattern obtained from the phrase ascent command sequence converted from the spectral feature amount sequence. It can be confirmed that the original F ₀ pattern can be substantially restored even though the spectrum feature amount does not contain much F ₀ information.

以上説明したように、本発明の実施の形態に係る基本周波数パターン予測装置によれば、ソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、隠れマルコフモデルの各時刻の、フレーズ指令及びアクセント指令のペアを示す状態のインデックスからなる状態系列とに基づいて、ソース音声の各時刻のスペクトル特徴量ベクトルと、ターゲット音声の前記状態系列との組み合わせの同時生成モデルのパラメータを学習し、予測対象のソース音声の時系列データから抽出される各時刻のスペクトル特徴量ベクトルと、学習された同時生成モデルのパラメータとに基づいて、予測対象のソース音声に対応するターゲット音声の指令列を予測することにより、Ｆ_０パターンの物理的な生成過程の制約を考慮しながらスペクトル特徴量系列に対応する最適なＦ_０パターンを推定することができる。 As described above, according to the fundamental frequency pattern predicting apparatus according to the embodiment of the present invention, the spectral feature quantity vector at each time extracted from the time series data of the source speech, and each time of the hidden Markov model, Based on a state sequence consisting of a state index indicating a pair of a phrase command and an accent command, learning parameters of a simultaneous generation model of a combination of a spectral feature vector at each time of the source speech and the state sequence of the target speech Then, based on the spectral feature vector at each time extracted from the time series data of the source speech to be predicted and the parameters of the learned simultaneous generation model, the target speech command sequence corresponding to the source speech to be predicted by predicting, spectrum taking into account the constraints of the physical process of generating F ₀ pattern It is possible to estimate the optimum _F0 pattern corresponding to the toll feature amount series.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述した実施の形態ではＨＭＭを図３に示すような状態遷移ネットワークをもつＨＭＭとしていたが、これに限定されるものではなく、図１０のように複数のＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ型のＨＭＭの始点と終点の状態を連結したＨＭＭとしてもよい。フレーズ指令が生起する複数の状態ｐ_ｍに対応して複数のＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ型のＨＭＭを有している。図３のＨＭＭの例と同様、状態ｐ_０，．．．，ｐ_Ｍ−１の状態出力分布の平均は時刻ｋに依存しないものとする。また、図１０の各状態は図２と同様、小状態に分割する。 For example, in the above-described embodiment, the HMM is an HMM having a state transition network as shown in FIG. 3. However, the present invention is not limited to this, and a plurality of Left-to-Right HMMs as shown in FIG. It is good also as HMM which connected the state of the start point and end point of. Corresponding to a plurality of states _{p m} where phrase command is to occur has a plurality of Left-to-Right type HMM. Similar to the HMM example of FIG. 3, the states p ₀ ,. . . , P _M−1 state output distribution average does not depend on time k. Each state in FIG. 10 is divided into small states as in FIG.

１０入力部
２０演算部
３０学習部
３２特徴量抽出部
３４基本周波数系列抽出部
３６、５４状態系列推定部
３８モデルパラメータ学習部
４０パラメータ記憶部
５０変換処理部
５２特徴量抽出部
５６指令系列更新部
９０出力部
１００基本周波数パターン予測装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 30 Learning part 32 Feature-value extraction part 34 Fundamental frequency series extraction part 36, 54 State series estimation part 38 Model parameter learning part 40 Parameter storage part 50 Conversion process part 52 Feature-value extraction part 56 Command sequence update part 90 Output Unit 100 Fundamental Frequency Pattern Prediction Device

Claims

From the source voice time series data as input, from a pair of phrase commands that represent the basic frequency pattern generated by the translational movement of the thyroid cartilage at each time of the target voice and the accent commands that represent the basic frequency pattern generated by the rotational movement of the thyroid cartilage A basic frequency pattern predicting device for predicting a command sequence comprising:
Using parallel data composed of time-series data of source speech and target speech of the learning sample as input, a spectral feature vector at each time extracted from the time-series data of the source speech, and the time of the target speech Based on a state sequence consisting of an index of a state indicating a pair of the phrase command and the accent command at each time of the hidden Markov model estimated from the sequence data, a spectral feature quantity vector at each time of the source speech, A learning unit for learning a parameter of a simultaneous generation model of a combination of the target speech and the state sequence;
Using time series data of the source speech to be predicted as input, a spectral feature vector at each time extracted from the time series data of the source speech to be predicted, and the parameters of the simultaneously generated model learned by the learning unit, And a conversion processing unit that predicts the instruction sequence of the target speech corresponding to the source speech to be predicted,
A fundamental frequency pattern prediction apparatus including:

The Hidden Markov Model includes a plurality of states p _m that the phrase command is to occur, a plurality of states a _n that the accent command is occurring, the phrase state r ₀ the command and any of the accent command not causing, r have _one and, in the transition from the state r ₀ to one of the plurality of states p _m transitions to the state r _1, the transition from the state r ₁ to any one of the plurality of states a _n The fundamental frequency pattern prediction apparatus according to claim 1, wherein the states are connected so as to transition to the state r ₀ .

The fundamental frequency pattern prediction apparatus according to claim 1, wherein the learning unit learns a parameter λ of the simultaneous generation model so as to increase an objective function represented by the following expression.
Where c is a spectral feature vector c [k] at each time k of the source speech, and s is the state series consisting of states s _{k at} each time k,
Is a parameter of the state output distribution P (c [k] | s _k ) of the state s _k , φ _{i, i ′} is the transition probability between the states i, _{i ′} of the hidden Markov model, and φ _i is , The probability that the initial state is state i.

The conversion processing unit estimates the state sequence so as to increase the objective function using a Viterbi algorithm, and the target speech corresponding to the source speech to be predicted is estimated from the estimated state sequence. 4. The fundamental frequency pattern prediction apparatus according to claim 3, wherein the command sequence is predicted.

From the source voice time series data as input, from a pair of phrase commands that represent the basic frequency pattern generated by the translational movement of the thyroid cartilage at each time of the target voice and the accent commands that represent the basic frequency pattern generated by the rotational movement of the thyroid cartilage A fundamental frequency pattern prediction method in a fundamental frequency pattern prediction apparatus that predicts a command sequence comprising:
The learning unit receives parallel data consisting of time-series data of source speech and target speech of a learning sample as input, and a spectral feature quantity vector at each time extracted from the time-series data of the source speech, A spectrum of each time of the source speech based on a state sequence consisting of an index of a state indicating a pair of the phrase command and the accent command at each time of the hidden Markov model estimated from time series data of the target speech Learning a parameter of a simultaneous generation model of a combination of a feature vector and the state sequence of a target speech;
The conversion processing unit receives time series data of the source speech to be predicted as input, and a spectral feature vector at each time extracted from the time series data of the source speech to be predicted, and the simultaneous learning learned by the learning unit Predicting the command sequence of the target speech corresponding to the source speech to be predicted based on parameters of the generation model;
A basic frequency pattern prediction method including:

The Hidden Markov Model includes a plurality of states p _m that the phrase command is to occur, a plurality of states a _n that the accent command is occurring, the phrase state r ₀ the command and any of the accent command not causing, r have _one and, in the transition from the state r ₀ to one of the plurality of states p _m transitions to the state r _1, the transition from the state r ₁ to any one of the plurality of states a _n fundamental frequency pattern predicting method according to claim 5 wherein each state is connected to transition to the state r _0.

The fundamental frequency pattern prediction method according to claim 5 or 6, wherein the learning unit learns the parameter λ of the simultaneous generation model so as to increase an objective function represented by the following expression.
Where c is a spectral feature vector c [k] at each time k of the source speech, and s is the state series consisting of states s _{k at} each time k,
Is a parameter of the state output distribution P (c [k] | s _k ) of the state s _k , φ _{i, i ′} is the transition probability between the states i, _{i ′} of the hidden Markov model, and φ _i is , The probability that the initial state is state i.

The program for functioning a computer as each part of the fundamental frequency pattern prediction apparatus of any one of Claims 1-4.