JP2021189402A

JP2021189402A - Voice processing program, voice processing device and voice processing method

Info

Publication number: JP2021189402A
Application number: JP2020097784A
Authority: JP
Inventors: 健太郎橘; Kentaro Tachibana
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2021-12-13

Abstract

To appropriately convert a sound uttered by any speaker to a sound quality of an uttered voice of a target speaker.SOLUTION: A voice processing device 100 includes: a voice analysis unit 20 which receives a first voice; a phoneme conversion unit 24 which extracts a first phoneme post-probability based on the first voice; a normalization unit 26 which generates a second phoneme post-probability obtained by normalizing the first phoneme post-probability for each hour; a feature amount conversion unit 28 which generates a feature amount of the first voice based on the second phoneme post-probability; and a voice generation unit 30 which generates a second voice based on the feature amount.SELECTED DRAWING: Figure 2

Description

本発明は、音声処理プログラム、音声処理装置及び音声処理方法に関する。 The present invention relates to a voice processing program, a voice processing device, and a voice processing method.

任意の話者が発声した音声を別の話者の声質を有する音声に変換する音声処理装置が開発されている。 A voice processing device has been developed that converts a voice uttered by an arbitrary speaker into a voice having the voice quality of another speaker.

例えば、目標話者の音声信号と同一又は類似の発声内容であるパラレルデータを使ってモデル学習を行って合成音を作成する技術が開示されている（特許文献１）。また、機械学習を適用することによって任意の話者の声を目標話者の声質に変換する技術が開示されている（非特許文献１）。 For example, a technique for creating a synthetic sound by performing model learning using parallel data having the same or similar utterance content as the voice signal of the target speaker is disclosed (Patent Document 1). Further, a technique for converting the voice of an arbitrary speaker into the voice quality of a target speaker by applying machine learning is disclosed (Non-Patent Document 1).

特開２０１７−１５１２３０号公報Japanese Unexamined Patent Publication No. 2017-151230

L.Sun, et. al. "Phonetic Posterior for many-to-one voice conversion without parallel data training" in Proc. ICME, Seattle, USA, Jul. 2016.L.Sun, et. Al. "Phonetic Posterior for many-to-one voice conversion without parallel data training" in Proc. ICME, Seattle, USA, Jul. 2016.

ところで、従来技術では、合成された音声の声質や言い回しが元話者に依存してしまうことがあった。すなわち、任意の話者の音声データから目的話者の声質に十分に似通った音声を合成することができていなかった点が課題の一例としてあげられる。 By the way, in the prior art, the voice quality and wording of the synthesized voice may depend on the original speaker. That is, one example of the problem is that it was not possible to synthesize a voice sufficiently similar to the voice quality of the target speaker from the voice data of any speaker.

本発明の１つの態様は、コンピュータに、第１音声を受ける第１ステップと、前記第１音声に基づいて、第１音素事後確率を抽出する第２ステップと、前記第１音素事後確率を時間毎に正規化した第２音素事後確率を生成する第３ステップと、前記第２音素事後確率に基づいて前記第１音声の特徴量を生成する第４ステップと、前記特徴量に基づいて第２音声を生成する第５ステップと、を実行させることを特徴とする音声処理プログラムである。 One aspect of the present invention is a first step of receiving a first voice from a computer, a second step of extracting a first phoneme posterior probability based on the first voice, and a time of the first phoneme posterior probability. The third step of generating the second phoneme posterior probability normalized for each, the fourth step of generating the feature amount of the first voice based on the second phoneme posterior probability, and the second step based on the feature amount. It is a voice processing program characterized by executing a fifth step of generating voice.

ここで、前記第３ステップは、前記第１音素事後確率を時間毎に音素の確率の最大値、平均値及び分散値の少なくとも１つが一定値となるように前記第２音素事後確率を生成することが好適である。 Here, in the third step, the second phoneme posterior probability is generated so that at least one of the maximum value, the average value, and the variance value of the phoneme probability becomes a constant value for the first phoneme posterior probability every hour. Is preferable.

また、前記特徴量は、スペクトル包絡であることが好適である。 Further, it is preferable that the feature amount is a spectral envelope.

また、前記第１音声から第１基本周波数を抽出する第６ステップを備え、前記第４ステップは、前記第２音素事後確率及び前記第１基本周波数に基づいて前記特徴量を生成することが好適である。 Further, it is preferable to include a sixth step of extracting the first fundamental frequency from the first voice, and the fourth step preferably generates the feature amount based on the second phoneme posterior probability and the first fundamental frequency. Is.

また、前記第１基本周波数の傾きに基づいて前記第１基本周波数から第２基本周波数を算出し、前記第２音素事後確率及び前記第２基本周波数に基づいて前記特徴量を生成することが好適である。 Further, it is preferable to calculate the second fundamental frequency from the first fundamental frequency based on the gradient of the first fundamental frequency and generate the feature amount based on the second fundamental posterior probability and the second fundamental frequency. Is.

本発明の１つの態様は、第１音声を受ける音声取得手段と、前記第１音声に基づいて、第１音素事後確率を抽出する音素事後確率抽出手段と、前記第１音素事後確率を時間毎に正規化した第２音素事後確率を生成する正規化手段と、前記第２音素事後確率に基づいて前記第１音声の特徴量を生成する特徴量生成手段と、前記特徴量に基づいて第２音声を生成する音声生成手段と、を備えることを特徴とする音声処理装置である。 One aspect of the present invention is a voice acquisition means for receiving the first voice, a phoneme post-probability extraction means for extracting the first phoneme post-probability based on the first voice, and the first phoneme post-probability for each hour. A normalization means for generating the second phoneme post-probability normalized to, a feature amount generating means for generating the feature amount of the first voice based on the second phoneme post-probability, and a second feature amount based on the feature amount. It is a voice processing apparatus including a voice generation means for generating voice.

本発明の１つの態様は、第１音声を受ける第１ステップと、前記第１音声に基づいて、第１音素事後確率を抽出する第２ステップと、前記第１音素事後確率を時間毎に正規化した第２音素事後確率を生成する第３ステップと、前記第２音素事後確率に基づいて前記第１音声の特徴量を生成する第４ステップと、前記特徴量に基づいて第２音声を生成する第５ステップと、を備えることを特徴とする音声処理方法である。 One aspect of the present invention is a first step of receiving a first voice, a second step of extracting a first phoneme posterior probability based on the first voice, and a normal first phoneme posterior probability for each hour. The third step of generating the converted second phoneme posterior probability, the fourth step of generating the feature amount of the first voice based on the second phoneme posterior probability, and the second voice generation based on the feature amount. This is a voice processing method characterized by comprising the fifth step.

本発明によれば、任意の話者が発した音声を目標とする話者が発した音声の音質に適切に変換する音声処理プログラム、音声処理装置及び音声処理方法を提供することができる。本発明の実施の形態の他の目的は、本明細書全体を参照することにより明らかになる。 INDUSTRIAL APPLICABILITY According to the present invention, it is possible to provide a voice processing program, a voice processing device, and a voice processing method for appropriately converting a voice uttered by an arbitrary speaker into the sound quality of a voice uttered by a target speaker. Other objects of the embodiments of the present invention will become apparent by reference to the entire specification.

本発明の実施の形態における音声処理装置の構成を示す図である。It is a figure which shows the structure of the voice processing apparatus in embodiment of this invention. 本発明の実施の形態における音声処理装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the voice processing apparatus in embodiment of this invention. 本発明の実施の形態における第１基本周波数の抽出例を示す図である。It is a figure which shows the extraction example of the 1st fundamental frequency in embodiment of this invention. 本発明の実施の形態における発話情報の抽出例を示す図である。It is a figure which shows the extraction example of the utterance information in embodiment of this invention. 本発明の実施の形態における音素事後確率の正規化処理を説明する図である。It is a figure explaining the normalization process of the phoneme posterior probability in embodiment of this invention.

本発明の実施の形態における音声処理装置１００は、図１に示すように、処理部１０、記憶部１２、入力部１４、出力部１６及び通信部１８を含んで構成される。処理部１０は、ＣＰＵ等の演算処理を行う手段を含む。処理部１０は、記憶部１２に記憶されている音声処理プログラムを実行することによって、本実施の形態における音声処理に関する機能を実現する。記憶部１２は、半導体メモリやメモリカード等の記憶手段を含む。記憶部１２は、処理部１０とアクセス可能に接続され、音声処理プログラム、その処理に必要な情報を記憶する。入力部１４は、情報を入力する手段を含む。入力部１４は、例えば、使用者からの情報の入力を受けるキーボード、タッチパネル、ボタン等を備える。また、入力部１４は、任意の話者及び目標となる所定の話者の音声の入力を受ける音声入力手段を備える。音声入力手段は、例えば、マイク、増幅回路等を含む構成とすればよい。出力部１６は、管理者から入力情報を受け付けるためのユーザインターフェース画面（ＵＩ）や処理結果を出力する手段を含む。出力部１６は、例えば、画像を呈示するディスプレイを備える。また、出力部１６は、音声処理装置１００によって生成された合成音声を出力する音声出力手段を備える。音声出力手段は、例えば、スピーカ、増幅器等を含む構成とすればよい。通信部１８は、ネットワーク１０２を介して、外部端末（図示しない）との情報の通信を行うインターフェースを含んで構成される。通信部１８による通信は有線及び無線を問わない。なお、音声処理に供される音声情報は通信部１８を介して外部端末から取得してもよい。 As shown in FIG. 1, the voice processing device 100 according to the embodiment of the present invention includes a processing unit 10, a storage unit 12, an input unit 14, an output unit 16, and a communication unit 18. The processing unit 10 includes means for performing arithmetic processing such as a CPU. The processing unit 10 realizes the function related to voice processing in the present embodiment by executing the voice processing program stored in the storage unit 12. The storage unit 12 includes storage means such as a semiconductor memory and a memory card. The storage unit 12 is accessiblely connected to the processing unit 10 and stores a voice processing program and information necessary for the processing. The input unit 14 includes means for inputting information. The input unit 14 includes, for example, a keyboard, a touch panel, buttons, and the like that receive input of information from the user. Further, the input unit 14 includes voice input means for receiving voice input of an arbitrary speaker and a target predetermined speaker. The voice input means may be configured to include, for example, a microphone, an amplifier circuit, or the like. The output unit 16 includes a user interface screen (UI) for receiving input information from the administrator and means for outputting the processing result. The output unit 16 includes, for example, a display for presenting an image. Further, the output unit 16 includes a voice output means for outputting the synthetic voice generated by the voice processing device 100. The audio output means may be configured to include, for example, a speaker, an amplifier, and the like. The communication unit 18 includes an interface for communicating information with an external terminal (not shown) via the network 102. The communication by the communication unit 18 may be wired or wireless. The voice information used for voice processing may be acquired from an external terminal via the communication unit 18.

本実施の形態では、複数の話者が発した音声を所定の話者（目標話者）の音声の音質に変換する音声処理を行う。図２は、音声処理装置１００の構成を示す機能ブロック図である。音声処理装置１００は、音声分析部２０、抑揚抽出部２２、音素変換部２４、正規化部２６、特徴量変換部２８及び音声生成部３０として機能する。 In the present embodiment, voice processing is performed to convert the voices emitted by a plurality of speakers into the sound quality of the voices of a predetermined speaker (target speaker). FIG. 2 is a functional block diagram showing the configuration of the voice processing device 100. The voice processing device 100 functions as a voice analysis unit 20, an intonation extraction unit 22, a phoneme conversion unit 24, a normalization unit 26, a feature amount conversion unit 28, and a voice generation unit 30.

音声分析部２０は、音声データを取得する処理を行う。すなわち、音声処理装置１００の処理部１０は、音声分析部２０として機能する。音声データは、入力部１４を構成するマイクを用いて話者の音声をデータに変換して取得すればよい。また、通信部１８を介して、外部のコンピュータ等に予め記録されている音声データを受信するようにしてもよい。取得された音声データは、記憶部１２に記憶される。 The voice analysis unit 20 performs a process of acquiring voice data. That is, the processing unit 10 of the voice processing device 100 functions as the voice analysis unit 20. The voice data may be acquired by converting the voice of the speaker into data using a microphone constituting the input unit 14. Further, the voice data recorded in advance in an external computer or the like may be received via the communication unit 18. The acquired voice data is stored in the storage unit 12.

音声データの取得処理は、任意の話者の発する音声及び目標話者の発する音声の両方について行われる。任意の話者からの音声と目標話者からの音声は、同一の内容（いわゆる、パラレルトレーニングデータ）である必要はない。ただし、任意の話者からの音声と目標話者からの音声が同一の内容（パラレルトレーニングデータ）である場合には音声変換の処理がより適切に行われる可能性が高くなる。 The voice data acquisition process is performed on both the voice emitted by an arbitrary speaker and the voice emitted by the target speaker. The voice from any speaker and the voice from the target speaker do not have to have the same content (so-called parallel training data). However, if the voice from any speaker and the voice from the target speaker have the same content (parallel training data), there is a high possibility that the voice conversion process will be performed more appropriately.

また、音声分析部２０は、さらに音声処理に必要な音声分析を行う。例えば、音声分析部２０は、入力された音声の周波数特性に基づいて音声のケプストラム解析を行い、スペクトルの包絡線（声の太さ等を示す情報）及び微細構造の情報を含むメタ周波数ケプストラム係数（ＭＦＣＣ）、音声の基本周波数や共鳴周波数（声の高さ、声のかすれ等を示す情報）等の音声データを求める。以降、音声分析部２０で求められた音声データの基本周波数を第１基本周波数Ｆ０とする。 In addition, the voice analysis unit 20 further performs voice analysis necessary for voice processing. For example, the voice analysis unit 20 performs voice cepstrum analysis based on the frequency characteristics of the input voice, and has a meta-frequency cepstrum coefficient including spectral envelopes (information indicating voice thickness and the like) and fine structure information. (MFCC), voice data such as basic voice frequency and resonance frequency (information indicating voice pitch, voice faintness, etc.) are obtained. Hereinafter, the fundamental frequency of the voice data obtained by the voice analysis unit 20 will be referred to as the first fundamental frequency F0.

抑揚抽出部２２は、音声分析部２０で音声から求められた第１基本周波数Ｆ０の傾きΔＦ０を算出する。図３は、時間フレーム毎に音声から抽出された第１基本周波数Ｆ０の例を示す。抑揚抽出部２２は、音声から抽出された第１基本周波数Ｆ０の傾きΔＦ０を算出する。第１基本周波数Ｆ０の傾きΔＦ０は、特徴量変換部２８へ入力される。第１基本周波数Ｆ０の傾きΔＦ０は、後述する特徴量変換部２８における第２基本周波数Ｆ１の算出方法の一つとして利用される。 The intonation extraction unit 22 calculates the slope ΔF0 of the first fundamental frequency F0 obtained from the voice by the voice analysis unit 20. FIG. 3 shows an example of the first fundamental frequency F0 extracted from the voice for each time frame. The intonation extraction unit 22 calculates the slope ΔF0 of the first fundamental frequency F0 extracted from the voice. The slope ΔF0 of the first fundamental frequency F0 is input to the feature amount conversion unit 28. The slope ΔF0 of the first fundamental frequency F0 is used as one of the calculation methods of the second fundamental frequency F1 in the feature amount conversion unit 28 described later.

音素変換部２４は、音声データから発話情報を抽出する処理を行う。すなわち、音声処理装置１００の処理部１０は、音声分析部２０から出力された音声データから発話情報を抽出する処理を行うことによって音素変換部２４として機能する。ここで、発話情報は、音素事後確率（ＰＰＧ：ＰｈｏｎｅｔｉｃＰｏｓｔｅｒｉｏｒＧｒａｍｓ）又は音素（Ｐｈｏｎｅｍｅ）を含む情報である。 The phoneme conversion unit 24 performs a process of extracting utterance information from voice data. That is, the processing unit 10 of the voice processing device 100 functions as a phoneme conversion unit 24 by performing a process of extracting utterance information from the voice data output from the voice analysis unit 20. Here, the utterance information is information including phoneme posterior probabilities (PPG: Phonetic Posterior Grams) or phonemes (Phoneme).

処理部１０は、音声データから音素事後確率又は音素を抽出して出力する音素変換部２４となるように機械学習を行って音素変換部２４を構成する。より具体的には、時刻に応じて変化する時系列情報である音声データを入力データとして、図４に例示するような当該音声データの音素事後確率又は音素を抽出して出力するように学習器を学習させる。図４に示すように、音素事後確率は、時間フレーム毎の音素の確率を示す情報である。 The processing unit 10 performs machine learning so as to be a phoneme conversion unit 24 that extracts and outputs phoneme posterior probabilities or phonemes from voice data, and constitutes a phoneme conversion unit 24. More specifically, the learning device uses voice data, which is time-series information that changes according to time, as input data, and extracts and outputs phoneme posterior probabilities or phonemes of the voice data as illustrated in FIG. To learn. As shown in FIG. 4, the phoneme posterior probability is information indicating the phoneme probability for each time frame.

例えば、畳み込みニューラルネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用して入力音声データから音素事後確率又は音素を抽出して出力する音素変換部２４を構成する。畳み込みニューラルネットワークは、パーセプトロン同士を全結合させずに結合をうまく制限し、なおかつウェイト共有という手法を使うことで画像の畳み込みに相当するような処理をニューラルネットワークの枠組みの中で表現したものである。また、例えば、再帰型ニューラルネットワーク（ＲＮＮ：ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用して入力音声データから音素事後確率又は音素を抽出して出力する音素変換部２４を構成してもよい。また、音素変換部２４には、さらにプーリング層（Ｐｏｏｌｉｎｇ）等の他のニューラルネットワークを組み込んでもよい。 For example, a convolutional neural network (CNN) is applied to configure a sound element conversion unit 24 that extracts and outputs sound element posterior probabilities or sound elements from input voice data. A convolutional neural network expresses processing equivalent to image convolution by using a technique called weight sharing, which limits the coupling well without fully coupling the perceptrons, within the framework of the neural network. .. Further, for example, a recurrent neural network (RNN) may be applied to configure a phoneme conversion unit 24 that extracts phoneme posterior probabilities or phonemes from input voice data and outputs them. Further, another neural network such as a pooling layer may be further incorporated in the phoneme conversion unit 24.

また、参考文献１：「齋藤佑樹，阿久澤圭，橘健太郎， “音素事後確率を用いた多対一音声変換のための音声認識・生成モデルの同時敵対学習，“日本音響学会２０１９年秋季研究発表会講演論文集，２−４−２，ｐｐ．９６３−−９６６，２０１９年９月 “ｈｔｔｐ：／／ｓｙｔｈｏｎ．ｏｒｇ／ｐａｐｅｒｓ／ＡＳＪ／ｓａｉｔｏ２０１９ａｓｊａ＿ｄｅｎａ．ｐｄｆ”」に記載のネットワークアーキテクチャを適用してもよい。すなわち、デーブ・ニューラル・ネットワーク（ＤＮＮ）を適用して、ＭＦＣＣ等の音声データに含まれる入力音声特徴量系列から音素ラベル系列を予測するように学習される。入力音声特徴量と音素ラベルのペアは、多数話者を含む音声コーパスから抽出される。音素事後確率系列は、入力音声特徴量が与えられたもとでの音素ラベルの事後確率を表す。音素変換部２４は、音素ラベルと音素事後確率のｓｏｆｔｍａｘｃｒｏｓｓ−ｅｎｔｒｏｐｙＬ_ＳＣＥとして定義される音素認識誤差を最小化するように学習される。このとき、Ｄｏｍａｉｎ−ａｄｖｅｒｓａｒｉａｌｔｒａｉｎｉｎｇ（ＤＡＴ）に基づき、音素変換部２４に入力される音素事後確率の話者不変性を向上させる学習法を適用することが好適である。すなわち、音素変換部２４の学習に用いた音声コーパスをそれぞれ多数話者ドメインと目的話者ドメインとして定義し、ＤＡＴによってこの２つのドメインの違いを緩和することが好適である。具体的には、参考文献１のテーブル１に記載の音声認識モデルのＤＮＮのアーキテクチャを適用すればよい。（参考文献２：CLDNN：Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. Convolutional, Long ShortTerm Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2015. 参考文献３：LAS：Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. arXiv preprint arXiv:1508.01211. 参考文献４：Transformer-ASR: Karita, S., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., ... & Watanabe, S. (2019). A comparative study on transformer vs rnn in speech applications. arXiv preprint arXiv:1909.06317.） Reference 1: "Yuki Saito, Kei Akuzawa, Kentaro Tachibana," Simultaneous hostile learning of speech recognition and generative models for many-to-one speech conversion using phoneme posterior probabilities, "Acoustical Society of Japan 2019 Fall Study" Presentation Proceedings, 2-4-2, pp. 963--966, September 2019 "http: // sound. org / papers / ASJ / saito2019 asja_dena. The network architecture described in "pdf" may be applied. That is, it is learned to apply a Dave neural network (DNN) to predict a phoneme label sequence from an input speech feature quantity sequence included in speech data such as MFCC. The pair of input voice features and phoneme labels is extracted from a voice corpus containing a large number of speakers. The phoneme posterior probability series represents the phoneme posterior probabilities given the input speech features. Phoneme converter 24 is learned phoneme recognition error defined as softmax cross-entropy L _SCE phoneme labels and the phoneme posterior probabilities to minimize. At this time, it is preferable to apply a learning method for improving the speaker invariance of the phoneme posterior probability input to the phoneme conversion unit 24 based on the Domain-adversarial training (DAT). That is, it is preferable to define the voice corpus used for learning of the phoneme conversion unit 24 as a majority speaker domain and a target speaker domain, respectively, and to alleviate the difference between these two domains by DAT. Specifically, the DNN architecture of the speech recognition model shown in Table 1 of Reference 1 may be applied. (Reference 2: CLDNN: Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. Convolutional, Long ShortTerm Memory, Fully Connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2015. References 3: LAS: Chan, W., Jaitly, N., Le, QV, & Vinyals, O. (2015). Listen, attend and spell. ArXiv preprint arXiv: 1508.01211. Reference 4: Transformer-ASR: Karita, S ., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., ... & Watanabe, S. (2019). A comparative study on transformer vs rnn in speech applications. arXiv preprint arXiv: 1909.06317.)

畳み込みニューラルネットワーク、再帰型ニューラルネットワーク又はこれらの組み合わせに対して活性化関数を適用することが好適である。活性化関数としては、例えば、ＲｅＬＵ、シグモイド関数、ソフトマックス関数、多項式等を適用することができる。また、活性化関数に合わせて損失関数を適用してもよい。損失関数としては、ソフトマックス関数やＣｏｎｎｅｃｔｉｏｎｉｓｔＴｅｍｐｏｒａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ（ＣＴＣ）損失関数を用いてもよい。 It is preferable to apply the activation function to convolutional neural networks, recurrent neural networks, or combinations thereof. As the activation function, for example, ReLU, sigmoid function, softmax function, polynomial and the like can be applied. Further, the loss function may be applied according to the activation function. As the loss function, a softmax function or a connectionist temporal classification (CTC) loss function may be used.

正規化部２６は、音素変換部２４で抽出された発話情報を正規化する処理を行う。すなわち、音声処理装置１００の処理部１０は、音素変換部２４から出力された発話情報を正規化することによって正規化部２６として機能する。 The normalization unit 26 performs a process of normalizing the utterance information extracted by the phoneme conversion unit 24. That is, the processing unit 10 of the voice processing device 100 functions as the normalization unit 26 by normalizing the utterance information output from the phoneme conversion unit 24.

処理部１０は、単位毎における各音素に対する確率を示す音素事後確率（ＰＰＧ）を第１音素事後確率として正規化することによって第２音素事後確率を生成する。図５は、異なる２人の女性の音声から抽出された第１音素事後確率の例を示す。各時間フレームにおける第１音素事後確率の最大値は異なる。すなわち、図に示すように、女性１の発する音声から得られた第１音素事後確率の最大値と女性２の発する音声から得られた第１音素事後確率の最大値は異なる。このように、各時間フレームにおいて事後確率が揃えられていないことによって、複数の話者が発した音声を所定の話者（目標話者）の音声の音質に変換した際に音声の品質が劣化を招く要因となると考えられる。 The processing unit 10 generates a second phoneme posterior probability by normalizing a phoneme posterior probability (PPG) indicating a probability for each phoneme in each unit as a first phoneme posterior probability. FIG. 5 shows an example of first phoneme posterior probabilities extracted from the voices of two different females. The maximum value of the first phoneme posterior probability in each time frame is different. That is, as shown in the figure, the maximum value of the first phoneme posterior probability obtained from the voice emitted by female 1 and the maximum value of the first phoneme posterior probability obtained from the voice emitted by female 2 are different. In this way, because the posterior probabilities are not aligned in each time frame, the quality of the voice deteriorates when the voice emitted by a plurality of speakers is converted into the sound quality of the voice of a predetermined speaker (target speaker). It is considered to be a factor that causes.

そこで、本実施の形態では、音素変換部２４で抽出された発話情報を正規化する処理を行う。例えば、時間フレーム毎の音素の事後確率の最大値が所定の一定値となるように正規化することが好適である。具体的には、時間フレーム毎に、各音素の事後確率を最大値で除算することによって時間フレーム毎の音素の最大値が１となるように正規化することができる。また、例えば、時間フレーム毎の音素の事後確率の平均値が所定の一定値となるように正規化することが好適である。また、例えば、時間フレーム毎の音素の事後確率の分散値が所定の一定値となるように正規化することが好適である。具体的には、時間フレーム毎に、各音素の事後確率Ｘに対してその平均値μ及び標準偏差σを用いて数式（１）の変換を適用して事後確率Ｙを算出することによって事後確率Ｙの平均値が０及び分散値が１となるように正規化することができる。 Therefore, in the present embodiment, the processing for normalizing the utterance information extracted by the phoneme conversion unit 24 is performed. For example, it is preferable to normalize so that the maximum value of the posterior probability of a phoneme for each time frame becomes a predetermined constant value. Specifically, by dividing the posterior probability of each phoneme by the maximum value for each time frame, it can be normalized so that the maximum value of the phoneme for each time frame becomes 1. Further, for example, it is preferable to normalize so that the average value of the posterior probabilities of phonemes for each time frame becomes a predetermined constant value. Further, for example, it is preferable to normalize so that the variance value of the posterior probability of the phoneme for each time frame becomes a predetermined constant value. Specifically, for each time frame, the posterior probability Y is calculated by applying the transformation of the equation (1) to the posterior probability X of each phonetic element using the average value μ and the standard deviation σ. It can be normalized so that the mean value of Y is 0 and the variance value is 1.

（数１）
Ｙ＝（Ｘ−μ）／σ・・・（１） (Number 1)
Y = (X−μ) / σ ・・・ (1)

このように、正規化部２６において第１音素事後確率を正規化して第２音素事後確率を算出することによって、話者によらず同じように正規化された第２音素事後確率を特徴量変換部２８に入力することができる。すなわち、異なる話者間において事後確率分布を似通らせた第２音素事後確率に基づいて音声を生成することができる。 In this way, by normalizing the first phoneme posterior probability in the normalization unit 26 and calculating the second phoneme posterior probability, the feature quantity conversion of the second phoneme posterior probability similarly normalized regardless of the speaker. It can be input to the unit 28. That is, it is possible to generate a voice based on the second phoneme posterior probability in which the posterior probability distributions are similar among different speakers.

特徴量変換部２８は、抑揚抽出部２２で得られた第１基本周波数Ｆ０の傾きΔＦ０及び音素変換部２４で生成された発話情報を入力データとして目標話者の音声データを生成する処理を行う。すなわち、音声処理装置１００の処理部１０は、音素変換部２４で生成された音素事後確率（ＰＰＧ：ＰｈｏｎｅｔｉｃＰｏｓｔｅｒｉｏｒＧｒａｍｓ）又は音素（Ｐｈｏｎｅｍｅ）を含む発話情報及び第１基本周波数Ｆ０の傾きΔＦ０から音声データを再構築する処理を行うことによって特徴量変換部２８として機能する。 The feature amount conversion unit 28 performs a process of generating voice data of the target speaker using the slope ΔF0 of the first fundamental frequency F0 obtained by the intonation extraction unit 22 and the utterance information generated by the phoneme conversion unit 24 as input data. .. That is, the processing unit 10 of the voice processing device 100 receives speech data including phoneme post-probability (PPG: Phonetic PosterGrams) or phoneme (Phoneme) generated by the phoneme conversion unit 24 and voice data from the inclination ΔF0 of the first basic frequency F0. Functions as a feature amount conversion unit 28 by performing a process of reconstructing.

なお、傾きΔＦ０以外に、平均値が０及び分散値が１となるように標準化された第１基本周波数Ｆ０や量子化された第１基本周波数Ｆ０を適用してもよい。量子化処理には、μ−ｌａｗといった既存の符号化アルゴリズムを用いてもよいし、第１基本周波数Ｆ０の累積密度関数に基づき分割されたものでもよい。また、動的な値を示す傾きΔＦ０に限らず、静的な値を示す第１基本周波数Ｆ０のみを用いてもよいし、または共に用いてもよい。 In addition to the slope ΔF0, a standardized first fundamental frequency F0 or a quantized first fundamental frequency F0 may be applied so that the average value is 0 and the variance value is 1. For the quantization process, an existing coding algorithm such as μ-law may be used, or the algorithm may be divided based on the cumulative density function of the first fundamental frequency F0. Further, the slope ΔF0 indicating a dynamic value is not limited, and only the first fundamental frequency F0 indicating a static value may be used, or both may be used.

処理部１０は、第１基本周波数Ｆ０の傾きΔＦ０及び第２音素事後確率の時系列情報に基づいて目標話者の声質の音声データを生成するように機械学習を行って特徴量変換部２８を構成する。例えば、畳み込みニューラルネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用して第１基本周波数Ｆ０の傾きΔＦ０及び第２音素事後確率の時系列情報を入力データとして目標話者の声質を有する音声の音声データを生成して出力する特徴量変換部２８を構成する。畳み込みニューラルネットワークは、パーセプトロン同士を全結合させずに結合をうまく制限し、なおかつウェイト共有という手法を使うことで画像の畳み込みに相当するような処理をニューラルネットワークの枠組みの中で表現したものである。また、例えば、再帰型ニューラルネットワーク（ＲＮＮ：ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用して特徴量変換部２８を構成してもよい。また、特徴量変換部２８には、さらにプーリング層（Ｐｏｏｌｉｎｇ）等の他のニューラルネットワークを組み込んでもよい。畳み込みニューラルネットワーク、再帰型ニューラルネットワーク又はこれらの組み合わせに対して活性化関数を適用することが好適である。活性化関数としては、例えば、ＲｅＬＵ、シグモイド関数、ソフトマックス関数、多項式等を適用することができる。 The processing unit 10 performs machine learning so as to generate voice data of the voice quality of the target speaker based on the time-series information of the gradient ΔF0 of the first fundamental frequency F0 and the second phoneme posterior probability, and performs the feature quantity conversion unit 28. Configure. For example, by applying a convolutional neural network (CNN), time-series information of the slope ΔF0 of the first basic frequency F0 and the second sound posterior probability is used as input data to obtain voice data having the voice quality of the target speaker. A feature amount conversion unit 28 that is generated and output is configured. A convolutional neural network expresses processing equivalent to image convolution by using a technique called weight sharing, which limits the coupling well without fully coupling the perceptrons, within the framework of the neural network. .. Further, for example, a recurrent neural network (RNN) may be applied to configure the feature amount conversion unit 28. Further, another neural network such as a pooling layer may be further incorporated in the feature amount conversion unit 28. It is preferable to apply the activation function to convolutional neural networks, recurrent neural networks, or combinations thereof. As the activation function, for example, ReLU, sigmoid function, softmax function, polynomial and the like can be applied.

また、上記参考文献１に記載の技術を適用して特徴量変換部２８を構成してもよい。具体的には、参考文献１のテーブル１に記載の音声生成モデルのアーキテクチャを適用すればよい。 Further, the feature quantity conversion unit 28 may be configured by applying the technique described in Reference 1. Specifically, the architecture of the speech generation model shown in Table 1 of Reference 1 may be applied.

学習に用いられる音声損失関数は、特徴量変換部２８によって生成された音声データが目標話者の音声から生成された音声データであるか否かを示す真偽値（例えば、目標話者の音声データであれば１、目標話者の音声データでなければ０）であってもよいし、特徴量変換部２８によって生成された音声データが目標話者の音声データである確からしさを示す尤度値であってもよい。 The voice loss function used for learning is a boolean value indicating whether or not the voice data generated by the feature amount conversion unit 28 is voice data generated from the voice of the target speaker (for example, the voice of the target speaker). If it is data, it may be 1, and if it is not the voice data of the target speaker, it may be 0). It may be a value.

処理部１０は、生成された音声データと目標話者の音声から予め抽出された音声データとを入力データとして、特徴量変換部２８によって生成された音声データが目標話者の音声から抽出された音声データであるかを示す音声損失関数を用いて特徴量変換部２８の機械学習を行う。音声損失関数の生成には、畳み込みニューラルネットワーク（Ｃｏｎｖ：ＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用することができる。畳み込みニューラルネットワーク層は、パーセプトロン同士を全結合させずに結合をうまく制限し、なおかつウェイト共有という手法を使うことで画像の畳み込みに相当するような処理をニューラルネットワークの枠組みの中で表現した層である。また、例えば、再帰型ニューラルネットワーク（ＲＮＮ：ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用してよい。また、さらにプーリング層（Ｐｏｏｌｉｎｇ）等の他のニューラルネットワークを組み込んでもよい。畳み込みニューラルネットワーク、再帰型ニューラルネットワーク又はこれらの組み合わせに対して活性化関数を適用することが好適である。活性化関数としては、例えば、ＲｅＬＵ、シグモイド関数、ソフトマックス関数、多項式等を適用することができる。また、音声損失関数としては、生成された特徴量と肉声から抽出した特徴量の誤差のＬ１ノルムまたはＬ２ノルムを用いることが好適である。また、ＧＡＮを適用した場合、敵対損失関数は、最小二乗誤差に基づいてもよいし、シグモイド関数としてもよい。このとき、上記参考文献１のテーブル１に記載の話者認証モデルのアーキテクチャを適用してもよい。 The processing unit 10 uses the generated voice data and the voice data extracted in advance from the voice of the target speaker as input data, and the voice data generated by the feature amount conversion unit 28 is extracted from the voice of the target speaker. Machine learning of the feature amount conversion unit 28 is performed using a voice loss function indicating whether the data is voice data. A convolutional neural network (Conv) can be applied to generate a voice loss function. The convolutional neural network layer is a layer that expresses processing equivalent to image convolution by using a technique called weight sharing, which limits the coupling well without fully coupling the perceptrons together, within the framework of the neural network. be. Further, for example, a recurrent neural network (RNN) may be applied. Further, another neural network such as a pooling layer may be incorporated. It is preferable to apply the activation function to convolutional neural networks, recurrent neural networks, or combinations thereof. As the activation function, for example, ReLU, sigmoid function, softmax function, polynomial and the like can be applied. Further, as the voice loss function, it is preferable to use the L1 norm or the L2 norm of the error between the generated feature amount and the feature amount extracted from the real voice. Further, when GAN is applied, the hostile loss function may be based on the least squares error or may be a sigmoid function. At this time, the architecture of the speaker authentication model shown in Table 1 of Reference 1 may be applied.

なお、第２音素事後確率の時系列情報に基づいて生成された音声データ及び基本周波数Ｆ０の傾きΔＦ０又は第１基本周波数Ｆ０に基づいて生成された第２基本周波数Ｆ１を組み合わせて特徴量変換部２８によって生成された音声データが目標話者からの音声から生成された音声データであるかを示す音声損失関数を出力する処理としたが、これに限定されるものではない。すなわち、第２音素事後確率の時系列情報に基づいて生成された音声データと目標話者からの音声から生成された音声データとの差から真偽値や尤度値を算出するようにしてもよい。そして、特徴量変換部２８は、音声損失関数のフィードバックを受けて、当該音声損失関数を入力データの１つとして当該音声損失関数が小さくなるように第２音素事後確率の時系列情報に基づいて目標話者の声質に合わせた音声データを生成するように特徴量変換部２８の機械学習を行うようにしてもよい。また、基本周波数Ｆ０の傾きΔＦ０又は第１基本周波数Ｆ０に基づいて生成された第２基本周波数Ｆ１と目標話者からの音声から得られた基本周波数Ｆ０との差から真偽値や尤度値を算出するようにしてもよい。そして、特徴量変換部２８は、話者照合識別部３２から出力された一致度を示す音声損失関数のフィードバックを受けて、当該音声損失関数を入力データの１つとして当該音声損失関数が小さくなるように第１基本周波数Ｆ０の傾きΔＦ０に基づいて目標話者の声質に合わせた第２基本周波数Ｆ１の情報を含む音声データを生成するように特徴量変換部２８の機械学習を行うようにしてもよい。このように、第２音素事後確率の時系列情報から得られた音声データと第１基本周波数Ｆ０の傾きΔＦ０に基づいて目標話者の声質に合わせた第２基本周波数Ｆ１の情報を含む音声データを組み合わせて音声生成部３０へ出力するようにしてもよい。 It should be noted that the feature quantity conversion unit is a combination of the voice data generated based on the time series information of the second phoneme posterior probability and the second fundamental frequency F1 generated based on the gradient ΔF0 of the fundamental frequency F0 or the first fundamental frequency F0. The process is defined as a process of outputting a voice loss function indicating whether the voice data generated by 28 is voice data generated from the voice from the target speaker, but the present invention is not limited to this. That is, even if the truth value or the likelihood value is calculated from the difference between the voice data generated based on the time series information of the second phoneme posterior probability and the voice data generated from the voice from the target speaker. good. Then, the feature amount conversion unit 28 receives the feedback of the voice loss function, uses the voice loss function as one of the input data, and bases the time series information of the second sound element posterior probability so that the voice loss function becomes smaller. Machine learning of the feature quantity conversion unit 28 may be performed so as to generate voice data according to the voice quality of the target speaker. Further, the truth value and the likelihood value are obtained from the difference between the second fundamental frequency F1 generated based on the slope ΔF0 of the fundamental frequency F0 or the first fundamental frequency F0 and the fundamental frequency F0 obtained from the voice from the target speaker. May be calculated. Then, the feature amount conversion unit 28 receives the feedback of the voice loss function indicating the degree of matching output from the speaker matching identification unit 32, and the voice loss function is used as one of the input data, and the voice loss function becomes smaller. As described above, the feature quantity conversion unit 28 performs machine learning so as to generate voice data including information of the second basic frequency F1 according to the voice quality of the target speaker based on the inclination ΔF0 of the first basic frequency F0. May be good. As described above, the voice data including the voice data obtained from the time-series information of the second phoneme posterior probability and the information of the second fundamental frequency F1 according to the voice quality of the target speaker based on the inclination ΔF0 of the first fundamental frequency F0. May be combined and output to the voice generation unit 30.

また、目標話者の音声の基本周波数に対して任意の話者の音声から抽出された第１基本周波数Ｆ０の傾きΔＦ０に基づいた変化量を加算することによって第２基本周波数Ｆ１を算術的に算出するようにしてもよい。また、傾きΔＦ０に代えて、平均値が０及び分散値が１となるように標準化された第１基本周波数Ｆ０や量子化された第１基本周波数Ｆ０を用いてもよい。量子化処理には、μ−ｌａｗといった既存の符号化アルゴリズムを用いてもよいし、第１基本周波数Ｆ０の累積密度関数に基づき分割されたものでもよい。また、動的な値を示す傾きΔＦ０に限らず、静的な値を示す第１基本周波数Ｆ０のみを用いてもよいし、または共に用いてもよい。そして、当該第２基本周波数Ｆ１を第２音素事後確率の時系列情報と共に特徴量変換部２８へ入力データとして入力することによって目標話者の音声データを生成するようにしてもよい。 In addition, the second fundamental frequency F1 is arithmetically calculated by adding the amount of change based on the slope ΔF0 of the first fundamental frequency F0 extracted from the voice of an arbitrary speaker to the fundamental frequency of the target speaker's voice. It may be calculated. Further, instead of the slope ΔF0, a standardized first fundamental frequency F0 or a quantized first fundamental frequency F0 may be used so that the average value is 0 and the variance value is 1. For the quantization process, an existing coding algorithm such as μ-law may be used, or the algorithm may be divided based on the cumulative density function of the first fundamental frequency F0. Further, the slope ΔF0 indicating a dynamic value is not limited, and only the first fundamental frequency F0 indicating a static value may be used, or both may be used. Then, the voice data of the target speaker may be generated by inputting the second fundamental frequency F1 as input data to the feature amount conversion unit 28 together with the time series information of the second phoneme posterior probability.

音声生成部３０は、特徴量変換部２８によって生成された音声データを音声に変換して出力する。特徴量変換部２８は、話者照合識別部３２との敵対的生成ネットワーク（ＧＡＮｓ）によって音素変換部２４において抽出された発話情報を目標話者の声質の音声データに変換するように学習されているので、音声生成部３０で生成される音声は目標話者の声質をもつ音声となる。 The voice generation unit 30 converts the voice data generated by the feature amount conversion unit 28 into voice and outputs it. The feature amount conversion unit 28 is learned to convert the utterance information extracted by the phoneme conversion unit 24 by the hostile generation network (GANs) with the speaker matching identification unit 32 into the voice data of the target speaker's voice quality. Therefore, the voice generated by the voice generation unit 30 is a voice having the voice quality of the target speaker.

以上のように、本実施の形態の音声処理装置１００によれば、任意の話者が発した音声を目標とする話者が発した音声の音質に適切に変換する音声処理装置及び音声処理プログラムを提供することができる。 As described above, according to the voice processing device 100 of the present embodiment, the voice processing device and the voice processing program appropriately convert the voice uttered by an arbitrary speaker into the sound quality of the voice uttered by the target speaker. Can be provided.

１０処理部、１２記憶部、１４入力部、１６出力部、１８通信部、２０音声分析部、２２抑揚抽出部、２４音素変換部、２６正規化部、２８特徴量変換部、３０音声生成部、１００音声処理装置、１０２ネットワーク。
10 Processing unit, 12 Storage unit, 14 Input unit, 16 Output unit, 18 Communication unit, 20 Speech analysis unit, 22 Inflection extraction unit, 24 Phoneme conversion unit, 26 Normalization unit, 28 Feature quantity conversion unit, 30 Speech generation unit , 100 voice processors, 102 networks.

Claims

On the computer
The first step of receiving the first voice and
The second step of extracting the first phoneme posterior probability based on the first voice, and
The third step of generating the second phoneme posterior probability obtained by normalizing the first phoneme posterior probability for each hour,
The fourth step of generating the feature amount of the first voice based on the second phoneme posterior probability, and
The fifth step of generating the second voice based on the feature amount, and
A voice processing program characterized by executing.

The voice processing program according to claim 1.
The third step is characterized in that the second phoneme posterior probability is generated so that at least one of the maximum value, the average value, and the variance value of the phoneme probability becomes a constant value for the first phoneme posterior probability every hour. Voice processing program.

The voice processing program according to claim 1 or 2.
The feature quantity is a speech processing program characterized by being a spectral envelope.

The voice processing program according to any one of claims 1 to 3.
A sixth step of extracting the first fundamental frequency from the first voice is provided.
The fourth step is a speech processing program characterized in that the feature amount is generated based on the second phoneme posterior probability and the first fundamental frequency.

The voice processing program according to claim 4.
It is characterized in that a second fundamental frequency is calculated from the first fundamental frequency based on the gradient of the first fundamental frequency, and the feature amount is generated based on the second phoneme posterior probability and the second fundamental frequency. Voice processing program.

The voice acquisition means for receiving the first voice,
A phoneme posterior probability extraction means for extracting a first phoneme posterior probability based on the first voice,
A normalization means for generating a second phoneme posterior probability in which the first phoneme posterior probability is normalized for each hour,
A feature amount generating means for generating a feature amount of the first voice based on the second phoneme posterior probability, and a feature amount generating means.
A voice generation means for generating a second voice based on the feature amount,
A voice processing device characterized by being provided with.

The first step of receiving the first voice and
The second step of extracting the first phoneme posterior probability based on the first voice, and
The third step of generating the second phoneme posterior probability obtained by normalizing the first phoneme posterior probability for each hour,
The fourth step of generating the feature amount of the first voice based on the second phoneme posterior probability, and
The fifth step of generating the second voice based on the feature amount, and
A voice processing method characterized by comprising.