JP2015212845A

JP2015212845A - Voice processing device, voice processing method, and filter produced by voice processing method

Info

Publication number: JP2015212845A
Application number: JP2015164768A
Authority: JP
Inventors: 大和大谷; Yamato Otani; 正統田村; Masanori Tamura; 眞弘森田; Shinko Morita
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2015-08-24
Filing date: 2015-08-24
Publication date: 2015-11-26

Abstract

PROBLEM TO BE SOLVED: To implement a voice processing device capable of appropriately controlling filter characteristics when emphasizing a voice.SOLUTION: The voice processing device includes: histogram calculation means which calculates a first histogram from a first voice feature quantity extracted from voice data and calculates a second histogram from a second voice feature quantity different from the first voice feature quantity; cumulative frequency calculation means which calculates a first cumulative frequency resulting from accumulation of frequencies in the first histogram and a second cumulative frequency resulting from accumulation of frequencies in the second histogram; and filter generation means which generates a filter having such filter characteristics that the second cumulative frequency is closer to the first cumulative frequency, on the basis of the first cumulative frequency and the second cumulative frequency.

Description

本発明の実施形態は、音声処理装置、音声処理方法および音声処理方法により作成されたフィルタに関する。 Embodiments described herein relate generally to a voice processing device, a voice processing method, and a filter created by the voice processing method.

音声合成技術により合成された音声波形は、人の実際の音声と比較してこもったような音質になるという問題があった。これを解決するために、音声波形に変換する前の音声特徴量にフィルタを適用して、音声スペクトルの凹凸を強調することが提案されている。 The voice waveform synthesized by the voice synthesis technique has a problem that it has a sound quality as compared with the actual voice of a person. In order to solve this problem, it has been proposed to apply a filter to a speech feature amount before being converted into a speech waveform to emphasize the unevenness of the speech spectrum.

音声スペクトルの凹凸の強調する処理では、従来は、ユーザによって設定された２組の補間関数を用いて、入力されたＬＳＰ係数とフラットな周波数特性を持つＬＳＰ係数との間におけるフィルタの補正量を決定していた。 In the process of emphasizing the unevenness of the speech spectrum, conventionally, the correction amount of the filter between the input LSP coefficient and the LSP coefficient having a flat frequency characteristic is calculated using two sets of interpolation functions set by the user. It was decided.

しかしながら、上述した方法では、音声を強調する際のフィルタ特性が、ユーザが設定した補間関数によって調整されていた。そのため、音声スペクトルの凹凸を強調する際のフィルタ特性を適切に制御することができなかった。 However, in the method described above, the filter characteristics for enhancing the voice are adjusted by the interpolation function set by the user. For this reason, it has not been possible to appropriately control the filter characteristics when emphasizing the unevenness of the speech spectrum.

特開平９−２３０８６９号公報Japanese Patent Laid-Open No. 9-230869

Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. of ICASSP, June 2000, p.1315-1318.Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. Of ICASSP, June 2000, p.1315-1318. Tomoki Toda, Alan W. Black, Keiichi Tokuda, “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech and Language Processing, Nov. 2007, Vol.15, No.8, p.2222-2235.Tomoki Toda, Alan W. Black, Keiichi Tokuda, “Voice conversion based on maximum likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech and Language Processing, Nov. 2007, Vol.15, No.8, p.2222 -2235.

発明が解決しようとする課題は、音声を強調する際のフィルタ特性を適切に制御できる音声処理装置を実現することである。 The problem to be solved by the invention is to realize a speech processing apparatus capable of appropriately controlling the filter characteristics when enhancing speech.

実施形態の音声処理装置は、音声データから抽出された第１の音声特徴量から第１のヒストグラムを計算し、前記第１の音声特徴量とは異なる第２の音声特徴量から第２のヒストグラムを計算するヒストグラム計算手段と、前記第１のヒストグラムの度数を累積した第１の累積度数と、前記第２のヒストグラムの度数を累積した第２の累積度数とを計算する累積度数計算手段と、前記第１および第２の累積度数に基づいて、前記第２の累積度数を前記第１の累積度数に近づける特性をもつフィルタを作成するフィルタ作成手段とを備える音声処理装置である。 The speech processing apparatus according to the embodiment calculates a first histogram from a first speech feature amount extracted from speech data, and calculates a second histogram from a second speech feature amount that is different from the first speech feature amount. A histogram calculation means for calculating the frequency, a first cumulative frequency obtained by accumulating the frequency of the first histogram, and a cumulative frequency calculation means for calculating a second cumulative frequency obtained by accumulating the frequency of the second histogram, And a filter creating unit that creates a filter having a characteristic of bringing the second cumulative frequency closer to the first cumulative frequency based on the first and second cumulative frequencies.

第１の実施形態の音声処理装置を示すブロック図。1 is a block diagram showing a speech processing apparatus according to a first embodiment. 実施形態の音声処理装置のフローチャート（フィルタ作成部）。The flowchart (filter preparation part) of the audio processing apparatus of embodiment. 実施形態の第１の正規化累積度数分布を示す図。The figure which shows the 1st normalization accumulation frequency distribution of embodiment. 実施形態の音声処理装置のフローチャート（音声合成部）。6 is a flowchart (speech synthesizer) of the speech processing apparatus according to the embodiment. 実施形態の第１および第２の正規化累積度数分布を示す図。The figure which shows the 1st and 2nd normalization accumulation frequency distribution of embodiment. 実施形態の第１、第３、第４の音声特徴量の正規化累積度数分を示す図。The figure which shows the part for normalization accumulation frequency of the 1st, 3rd, 4th audio | voice feature-value of embodiment. 実施形態の音声波形のスペクトルを示す図。The figure which shows the spectrum of the audio | voice waveform of embodiment. 変形例１の音声処理装置を示すブロック図。The block diagram which shows the audio | voice processing apparatus of the modification 1. FIG. 変形例３の音声処理装置を示すブロック図。The block diagram which shows the audio | voice processing apparatus of the modification 3. FIG.

以下、本発明の実施形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施形態）
第１の実施形態の音声処理装置は、任意のテキストから音声波形を生成する音声合成を想定しており、フィルタを用いて音声スペクトルの凹凸を強調することにより、音声合成により生成された人工的な音声波形の音質を目標となる実音声データに近づけることを目的としている。ここでは、オフラインで音声スペクトルの凹凸を強調するためのフィルタを作成し、オンラインでこのフィルタを用いて任意のテキストを読み上げるための音声波形を生成する。 (First embodiment)
The speech processing apparatus according to the first embodiment assumes speech synthesis in which a speech waveform is generated from arbitrary text, and artificially generated by speech synthesis by enhancing the unevenness of the speech spectrum using a filter. The purpose is to bring the sound quality of a simple sound waveform closer to the target actual sound data. Here, a filter for emphasizing the unevenness of the speech spectrum is created offline, and a speech waveform for reading out an arbitrary text is generated online using this filter.

フィルタを作成するオフライン処理では、目標となる実音声データから抽出した第１の音声特徴量と、この実音声データのコンテキスト情報および音声合成辞書を用いて生成した第２の音声特徴量とから、それぞれ第１および第２のヒストグラムを計算する。そして、第１のヒストグラムの度数を累積して計算した第１の累積度数および第２のヒストグラムの度数を累積して計算した第２の累積度数に基づいてフィルタを作成する。ここで、本実施形態の音声処理装置は、ユーザの手動調整ではなく、第２の累積度数を目標となる実音声データから求めた第１の累積度数に近づけるという基準でフィルタを作成する。これにより、フィルタ特性を適切に制御することができる。 In the off-line processing for creating the filter, from the first speech feature amount extracted from the target actual speech data, and the second speech feature amount generated using the context information of the actual speech data and the speech synthesis dictionary, First and second histograms are calculated, respectively. Then, a filter is created based on the first cumulative frequency calculated by accumulating the frequencies of the first histogram and the second cumulative frequency calculated by accumulating the frequencies of the second histogram. Here, the speech processing apparatus according to the present embodiment creates a filter based on a criterion that the second cumulative frequency is close to the first cumulative frequency obtained from the target actual speech data, instead of manual adjustment by the user. Thereby, a filter characteristic can be controlled appropriately.

任意のテキストの音声波形を生成するオンライン処理では、読み上げ対象となるテキストを解析し音声合成辞書を用いて生成した音声合成のための第３の音声特徴量を、オフライン処理で生成したフィルタを用いて第４の音声特徴量に変換する。最後に、第４の音声特徴量から音声スペクトルの凹凸を強調した音声波形を生成する。 In the online processing for generating a speech waveform of an arbitrary text, a third speech feature amount for speech synthesis generated by analyzing a text to be read out and using a speech synthesis dictionary is used by a filter generated by offline processing. To convert to the fourth voice feature amount. Finally, a speech waveform in which the unevenness of the speech spectrum is emphasized is generated from the fourth speech feature quantity.

本実施形態では、音声合成のための第３の音声特徴量は、フィルタ作成の際に生成された第２の音声特徴量と同様な方法で抽出された音声特徴量である。したがって、第２の累積度数を第１の累積度数に近づける基準で作成されたフィルタを用いて、第３の音声特徴量を第４の音声特徴量に変換することにより、第４の音声特徴量の累積度数自体を第１の累積度数に近づけることができる。累積度数が近づくことは、音声特徴量のスペクトル特性が近づくことを意味しており、結果として、第４の音声特徴量から生成される人工的な音声波形の音質を目標となる実音声データに近づけることができる。 In the present embodiment, the third speech feature amount for speech synthesis is a speech feature amount extracted by the same method as the second speech feature amount generated at the time of filter creation. Therefore, the fourth voice feature value is obtained by converting the third voice feature value into the fourth voice feature value by using the filter created based on the criterion for making the second cumulative frequency close to the first cumulative frequency. Can be brought close to the first cumulative frequency. When the cumulative frequency approaches, it means that the spectral characteristics of the speech feature amount approach, and as a result, the sound quality of the artificial speech waveform generated from the fourth speech feature amount becomes the target actual speech data. You can get closer.

（ブロック構成）
図１は、第１の実施形態にかかる音声処理装置を示すブロック図である。本実施形態の音声処理装置は、隠れマルコフモデルを利用して任意のテキストから音声波形を生成する。この音声処理装置は、オフラインでフィルタを作成するフィルタ作成部１０１と、作成されたフィルタを用いてオンラインで音声波形を合成する音声合成部１０２とを備える。 (Block configuration)
FIG. 1 is a block diagram showing a speech processing apparatus according to the first embodiment. The speech processing apparatus of this embodiment generates a speech waveform from arbitrary text using a hidden Markov model. The speech processing apparatus includes a filter creation unit 101 that creates a filter offline and a speech synthesis unit 102 that synthesizes a speech waveform online using the created filter.

フィルタ作成部１０１は、音声データ格納部１１１に格納された実音声データからスペクトルに関する第１の音声特徴量を抽出する第１特徴量抽出部１０３と、第１の音声特徴量から第１のヒストグラムを計算する第１ヒストグラム計算部１０４と、第１のヒストグラムから第１の累積度数を計算する第１累積度数計算部１０５と、音声データ格納部１１１に記憶されたコンテキスト情報および音声合成辞書１０６に記憶された隠れマルコフモデルを用いて、スペクトルに関する第２の音声特徴量を生成する第２特徴量抽出部１０７と、第２の音声特徴量から第２のヒストグラムを計算する第２ヒストグラム計算部１０８と、第２のヒストグラムから第２の累積度数を計算する第２累積度数計算部１０９と、第１および第２の累積度数に基づいて、第３の音声特徴量を第４の音声特徴量に変換するフィルタを作成するフィルタ作成処理部１１０とを備える。 The filter creation unit 101 includes a first feature amount extraction unit 103 that extracts a first speech feature amount related to a spectrum from real speech data stored in the speech data storage unit 111, and a first histogram based on the first speech feature amount. The first histogram calculation unit 104 for calculating the first cumulative frequency from the first histogram, the first cumulative frequency calculation unit 105 for calculating the first cumulative frequency from the first histogram, and the context information stored in the voice data storage unit 111 and the voice synthesis dictionary 106 Using the stored hidden Markov model, a second feature amount extraction unit 107 that generates a second speech feature amount related to the spectrum, and a second histogram calculation unit 108 that calculates a second histogram from the second speech feature amount. And a second cumulative frequency calculation unit 109 for calculating the second cumulative frequency from the second histogram, and based on the first and second cumulative frequencies. Te, and a filter creation unit 110 to create a filter that converts the third audio feature to the fourth speech features.

音声データ格納部１１１は、フィルタを設計する際の目標となる実音声データおよびこの実音声データのコンテキスト情報を記憶している。コンテキスト情報とは、実音声データの発話内容に関する音韻情報、文中の位置、品詞や係り先などの言語情報である。また、音声合成辞書１０６は、第２特徴量抽出部１０７および第３特徴量抽出部１１３で音声特徴量を生成する際に利用する隠れマルコフモデルを記憶している。 The voice data storage unit 111 stores real voice data that is a target when designing a filter and context information of the real voice data. The context information is phonological information related to the utterance content of the actual speech data, language information such as position in the sentence, part of speech and dependency. Further, the speech synthesis dictionary 106 stores a hidden Markov model used when the second feature amount extraction unit 107 and the third feature amount extraction unit 113 generate speech feature amounts.

音声合成部１０２は、読み上げ対象となる第１のテキストを解析してコンテキスト情報を抽出するテキスト解析部１１２と、コンテキスト情報および音声合成辞書１０６の隠れマルコフモデルを用いてスペクトルに関する第３の音声特徴量を生成する第３特徴量抽出部１１３と、フィルタ作成部１０１で作成されたフィルタを用いて、第３の音声特徴量を第４の音声特徴量に変換する特徴量変換部１１４と、コンテキスト情報および音声合成辞書１０６の隠れマルコフモデルを用いて音源に関する特徴量（音源特徴量）を生成する音源特徴量抽出部１１５と、第４の音声特徴量および音源特徴量から音声波形を生成する波形生成部１１６とを備える。 The speech synthesis unit 102 analyzes the first text to be read out and extracts context information, and uses the context information and the hidden Markov model of the speech synthesis dictionary 106 to provide a third speech feature relating to the spectrum. A third feature quantity extraction unit 113 that generates a quantity, a feature quantity conversion unit 114 that converts the third voice feature quantity into a fourth voice feature quantity using the filter created by the filter creation unit 101, and a context A sound source feature amount extraction unit 115 that generates a sound source feature amount (sound source feature amount) using a hidden Markov model of the information and speech synthesis dictionary 106, and a waveform that generates a speech waveform from the fourth sound feature amount and the sound source feature amount And a generation unit 116.

（フローチャート：フィルタ作成部）
図２は、本実施形態にかかる音声処理装置において、オフラインでフィルタを作成する際のフローチャートである。まず、ステップＳ１では、第１特徴量抽出部１０３は、音声データ格納部１１１から実音声データを取得し、取得した音声波形を２０〜３０ｍｓ程度の長さのフレームに分割する。 (Flowchart: Filter creation part)
FIG. 2 is a flowchart when the filter is created offline in the speech processing apparatus according to the present embodiment. First, in step S1, the first feature quantity extraction unit 103 acquires real audio data from the audio data storage unit 111, and divides the acquired audio waveform into frames having a length of about 20 to 30 ms.

次に、ステップＳ２では、第１特徴量抽出部１０３は、各フレームの音響分析を行い第１の音声特徴量を抽出する。ここで、第１の音声特徴量は、音声の声色や音韻情報を表すスペクトルに関する特徴量であり、例えば、音声データをフーリエ変換することにより得られる離散スペクトル、ＬＰＣ係数、ケプストラム、メルケプストラム、ＬＳＰ係数、メルＬＳＰ係数などを用いることができる。本実施形態では、第1の音声特徴量としてメルＬＳＰ係数を用いる。メルＬＳＰ係数は短時間フーリエ変換により得られたスペクトルをメルスケールに変換した後にＬＳＰ分析を行うことで抽出する。 Next, in step S 2, the first feature quantity extraction unit 103 performs an acoustic analysis of each frame and extracts a first voice feature quantity. Here, the first speech feature amount is a feature amount related to a spectrum representing the voice color and phonological information of the speech. For example, a discrete spectrum obtained by Fourier transforming speech data, an LPC coefficient, a cepstrum, a mel cepstrum, an LSP Coefficients, Mel LSP coefficients, etc. can be used. In this embodiment, the mel LSP coefficient is used as the first audio feature amount. The mel LSP coefficient is extracted by performing LSP analysis after converting the spectrum obtained by the short-time Fourier transform to mel scale.

第１の音声特徴量の次元数はＤとし、ｎ番目のフレームから抽出した第１の音声特徴量ｙ_ｎは、（１）式で表わされる。Ｔは転置を表す。

The number of dimensions of the first speech feature quantity is D, the first audio feature y _n extracted from the n-th frame is represented by equation (1). T represents transposition.

ステップＳ３では、第１ヒストグラム計算部１０４は、総数Ｎフレームの第１の音声特徴量から第１のヒストグラムを計算する。ステップＳ３の詳細を説明する。まず、第１ヒストグラム計算部１０４は、第１の音声特徴量の各次元について最大値ｙ_ｍaｘ（ｄ）および最小値ｙ_ｍｉｎ（ｄ）を計算する（ステップＳ２０１）。ｄは次元を表す。そして、この最大値および最小値の範囲内でＩ＋１個の階級を設定し（ステップＳ２０２）、各階級における第３の音声特徴量の頻度を計算することで、（２）式で表される各次元のヒストグラムを得る（ステップＳ２０３）。

In step S 3, the first histogram calculation unit 104 calculates a first histogram from the first audio feature amount of N frames in total. Details of step S3 will be described. First, the first histogram calculation unit 104 calculates the maximum value y _max (d) and the minimum value y _min (d) for each dimension of the first audio feature amount (step S201). d represents a dimension. Then, I + 1 classes are set within the range of the maximum value and the minimum value (step S202), and the frequency of the third speech feature amount in each class is calculated, thereby each of the expressions represented by the expression (2). A dimension histogram is obtained (step S203).

ステップＳ４では、第１累積度数計算部１０５は、第１の正規化累積度数を計算する。具体的には、第１のヒストグラムから各階級の度数を累積することにより累積度数を求め（ステップＳ２０４）、求めた累積度数を総数Ｎで割ることで正規化する（ステップＳ２０５）。正規化された第１の累積度数（第１の正規化累積度数）は、（３）式で表される。

In step S4, the first cumulative frequency calculation unit 105 calculates a first normalized cumulative frequency. Specifically, the cumulative frequency is obtained by accumulating the frequency of each class from the first histogram (step S204), and normalized by dividing the obtained cumulative frequency by the total number N (step S205). The normalized first cumulative frequency (first normalized cumulative frequency) is expressed by equation (3).

正規化後の累積度数の値域は、０〜１になる。 The range of the cumulative frequency after normalization is 0-1.

次に、ステップＳ５では、第２特徴量抽出部１０７は、音声データ格納部１１１に格納された音声データに関するコンテキスト情報を取得する。 Next, in step S 5, the second feature amount extraction unit 107 acquires context information regarding audio data stored in the audio data storage unit 111.

ステップＳ６では、第２特徴量抽出部１０７は、ステップＳ５で取得したコンテキスト情報と音声合成辞書１０６の隠れマルコフモデルを用いてスペクトルに関する第２の音声特徴量を生成する。本実施形態では、第２の音声特徴量は第１の音声特徴量と同様にメルＬＳＰとなる。第２の音声特徴量の次元数は、第１の音声特徴量と同様にＤであり、ｍ番目のフレームから抽出した第２の音声特徴量ｘ_ｍは、（４）式で表される。

In step S 6, the second feature amount extraction unit 107 generates a second speech feature amount related to the spectrum using the context information acquired in step S 5 and the hidden Markov model of the speech synthesis dictionary 106. In the present embodiment, the second audio feature quantity is the mel LSP as with the first audio feature quantity. The number of dimensions of the second audio feature quantity is D, as is the case with the first audio feature quantity, and the second audio feature quantity x _m extracted from the m-th frame is expressed by equation (4).

ステップＳ７では、総数Ｍフレームの第２の音声特徴量から第２のヒストグラムを計算する。ステップＳ２０６〜Ｓ２０８の処理は、それぞれステップＳ２０１〜Ｓ２０３と同様であるため説明を省略する。なお、ステップＳ２０６において、第２の音声特徴量の最大値および最小値を、第１の音声特徴量の最大値および最小値で代用することもできる。 In step S7, a second histogram is calculated from the second audio feature quantity of the total number M frames. Since the processing of steps S206 to S208 is the same as that of steps S201 to S203, description thereof will be omitted. In step S206, the maximum value and the minimum value of the second sound feature amount can be substituted with the maximum value and the minimum value of the first sound feature amount.

ステップＳ８では、（５）式で表される正規化された第２の累積度数（第２の正規化累積度数）を求める。

In step S8, a normalized second cumulative frequency (second normalized cumulative frequency) expressed by equation (5) is obtained.

ステップＳ２０９およびＳ２１０の処理は、それぞれステップＳ２０４およびＳ２０５と同様であるため説明を省略する。 Since the processes in steps S209 and S210 are the same as those in steps S204 and S205, respectively, description thereof will be omitted.

次に、ステップＳ９では、フィルタ作成処理部１１０は、第１および第２の正規化累積度数に基づいて、後述する第３の音声特徴量を第４の音声特徴量に変換するフィルタを作成する。ここでは、第２の累積度数を実音声データから計算した第１の累積度数に近づけるという基準でフィルタを作成する。 Next, in step S 9, the filter creation processing unit 110 creates a filter that converts a later-described third voice feature quantity into a fourth voice feature quantity based on the first and second normalized cumulative frequencies. . Here, the filter is created on the basis of bringing the second cumulative frequency closer to the first cumulative frequency calculated from the actual voice data.

ステップＳ９の詳細を説明する。まず、Ｋ個の正規化累積度数ｐ_ｋ（０≦ｋ＜Ｋ）を設定する（ステップＳ２１１）。例えば、Ｋを１１として、（６）式のように０．１刻みに設定する。

Details of step S9 will be described. First, K normalized cumulative frequencies p _k (0 ≦ k <K) are set (step S211). For example, assuming that K is 11, it is set in increments of 0.1 as in equation (6).

なお、ｐ_ｋはステップＳ９の処理ではなく、事前に設定してもよい。 Incidentally, p _k is not in the process of step S9, it may be set in advance.

次に、全てのｐ_ｋ（０≦ｋ＜Ｋ）について、第１の正規化累積度数分布において（７）式を満たす階級ｉを探索する（ステップＳ２１２）。

Next, for all p _k (0 ≦ k <K), the class i satisfying the expression (7) is searched for in the first normalized cumulative frequency distribution (step S212).

同様に第２の正規化累積度数分布についても、（８）式を満たす階級ｊを探索する（ステップＳ２１２）。

Similarly, for the second normalized cumulative frequency distribution, a class j that satisfies the equation (8) is searched (step S212).

次に、（９）式の線形補間により、第１の正規化累積度数分布においてｐ_ｋに対応する音声特徴量の値ｙ⁻（ｐ_ｋ，ｄ）を求める（ステップＳ２１３）。

Then, (9) by linear interpolation, the value of the voice feature amount corresponding to _{p k} in the first normalized cumulative frequency distribution ^y - _(p k, d) obtaining the (step S213).

ここで、ｉ（ｋ）は、ステップＳ２１２で探索された階級である。また、第１の正規化累積分布において、ｙ（ｉ（ｋ），ｄ）は、階級ｉ（ｋ）に対応する音声特徴量の値である。図３に、第１の正規化累積分布上でのｐ_ｋとｙ⁻（ｐ_ｋ，ｄ）の関係を示す。 Here, i (k) is the class searched in step S212. In the first normalized cumulative distribution, y (i (k), d) is a value of the speech feature amount corresponding to the class i (k). Figure 3, _{p k} and ^y on a first normalized cumulative distribution - indicating the _(p k, d) relationship.

同様に、（１０）式の線形補間により、第２の正規化累積度数分布においてｐ_ｋに対応する値ｘ⁻（ｐ_ｋ，ｄ）を求める（ステップＳ２１３）。

Similarly, (10) by linear interpolation of the equation, the value ^x corresponds to the _{p k} in the second normalization cumulative frequency distribution - _(p k, d) obtaining the (step S213).

ステップＳ２１４では、フィルタ作成処理部１１０は、ステップＳ２１３で計算された音声特徴量の値をフィルタとして記憶する。ｄ次元目の特徴量に対応するフィルタＴ（ｄ）は（１１）式で表される。

In step S214, the filter creation processing unit 110 stores the audio feature value calculated in step S213 as a filter. The filter T (d) corresponding to the d-dimensional feature amount is expressed by the equation (11).

ここで、第１および第２の音声特徴量の最大値および最小値を用いて、フィルタＴ（ｄ）の値を（１２）式および（１３）式のように置き換えてもよい。

Here, using the maximum value and the minimum value of the first and second audio feature values, the value of the filter T (d) may be replaced as in the expressions (12) and (13).

以上の処理により、本実施形態の音声処理装置は、音声特徴量の各次元についてフィルタＴ（ｄ）を作成する。フィルタＴ（ｄ）は、所定の正規化累積度数ｐ_ｋを用いて、第１および第２の正規化累積度数の対応関係を保存している。これにより、後述する特徴量変換部１１４は、フィルタＴ（ｄ）を用いて第２の正規化累積度数を第１の正規化累積度数に近づけるような変換を実現できる。 Through the above processing, the speech processing apparatus according to the present embodiment creates a filter T (d) for each dimension of the speech feature amount. Filter T (d), using a predetermined normalization cumulative frequency p _k, it has saved correspondence relationship between the first and second normalized cumulative frequency. Thereby, the feature amount conversion unit 114 to be described later can realize conversion that causes the second normalized cumulative frequency to approach the first normalized cumulative frequency using the filter T (d).

（フローチャート：音声合成部）
図４は、本実施形態にかかる音声処理装置において、フィルタを用いて音声スペクトルの凹凸が強調された音声波形を生成する際のフローチャートである。まず、ステップＳ４１では、テキスト解析部１１２は、読み上げ対象となる第１のテキストを解析してコンテキスト情報を抽出する。コンテキスト情報は、音素情報、アクセント句長、品詞情報などを含んでおり、構文解析により抽出できる。 (Flowchart: Speech synthesis unit)
FIG. 4 is a flowchart when the speech processing apparatus according to the present embodiment generates a speech waveform in which the unevenness of the speech spectrum is enhanced using a filter. First, in step S41, the text analysis unit 112 analyzes the first text to be read out and extracts context information. The context information includes phoneme information, accent phrase length, part of speech information, and the like, and can be extracted by syntax analysis.

次に、ステップＳ４２では、第３特徴量抽出部１１３は、抽出されたコンテキスト情報および音声合成辞書１０６の隠れマルコフモデルを用いて（１４）式で表される第３の音声特徴量を生成する。

Next, in step S42, the third feature quantity extraction unit 113 generates a third voice feature quantity represented by Expression (14) using the extracted context information and the hidden Markov model of the voice synthesis dictionary 106. .

第３の音声特徴量はスペクトルに関する特徴量であり、第１および第２の音声特徴量と同様にメルＬＳＰを用いる。また、第３の音声特徴量の抽出方法は、第２の音声特徴量の抽出方法と同様である。 The third voice feature value is a spectrum-related feature value, and Mel LSP is used in the same manner as the first and second voice feature values. The third audio feature quantity extraction method is the same as the second audio feature quantity extraction method.

次に、ステップＳ４３では、特徴量変換部１１４は、オフライン処理で作成されたフィルタＴ（ｄ）を用いて第３の音声特徴量を第４の音声特徴量に変換する。 Next, in step S43, the feature quantity conversion unit 114 converts the third voice feature quantity into the fourth voice feature quantity by using the filter T (d) created by the offline processing.

ステップＳ４３の詳細を説明する。まず、特徴量変換部１１４は、第３の音声特徴量の各次元について、（１５）式を満たすｋ（ｄ）を探索する（ステップＳ４０１）。

Details of step S43 will be described. First, the feature quantity conversion unit 114 searches for k (d) satisfying the expression (15) for each dimension of the third audio feature quantity (step S401).

次に、特徴量変換部１１４は、各次元の第３の音声特徴量ｘ_ｔ ^〜（ｄ）を第４の音声特徴量ｙ_ｔ ^〜（ｄ）に変換する（ステップＳ４０２）。変換は（１６）式で表すことができる。

Next, the feature quantity conversion unit 114 converts the third audio feature quantity x _t ^to (d) of each dimension into the fourth audio feature quantity y _t ^to (d) (step S402). The conversion can be expressed by equation (16).

図５を用いて（１６）式の動作を説明する。まず、図５(a)に示す第２の正規化累積度数分布において、変換前の第３の音声特徴量ｘ_ｔ ^〜（ｄ）の正規化累積度数ｐを、ｘ⁻（ｐ_ｋ（ｄ），_、ｄ）、ｘ⁻（ｐ_{ｋ（ｄ）＋１}，_、ｄ）、ｐ_ｋ（ｄ）およびｐ_{ｋ（ｄ）＋１}を用いた線形補間により求める。次に、図５(b)に示す第１の正規化累積度数分布において、上記正規化累積頻度ｐに対応する変換後の音声特徴量ｙ_ｔ ^〜（ｄ）を、ｙ⁻（ｐ_ｋ（ｄ），ｄ）、ｙ⁻（ｐ_{ｋ（ｄ）＋１}，ｄ）、ｐ_ｋおよびｐ_ｋ＋１を用いて線形補間により求める。これらの処理をまとめたものが（１６）式に相当する。 The operation of equation (16) will be described with reference to FIG. First, in the second normalized cumulative frequency distribution shown in FIG. 5A, the normalized cumulative frequency p of the third speech feature amount x _t ^to (d) before conversion is _expressed as x ⁻ ( _{pk (d)).} _{^{_{,, d), x - (}}} p k (d) +1,, d), determined by linear interpolation using _{p k (d)} and _{p k (d) +1.} Next, in the first normalized cumulative frequency distribution shown in FIG. 5B, the converted speech feature value y _t ^to (d) corresponding to the normalized cumulative frequency p is _expressed as y ⁻ ( _{pk (d ^{_{), d), y - (}}} p k (d) +1, d), determined by linear interpolation using _{p k} and _{p k + 1.} A summary of these processes corresponds to equation (16).

図６に、変換前後における第３の音声特徴量の正規化累積度数分布を示す。この図より、第４の音声特徴量ｙ_ｔ ^〜（ｄ）から計算した正規化累積度数分布の形状は、実音声データから計算した第１の正規化累積度数分布の形状に近付いていることが分かる。つまり、第４の音声特徴量がもつスペクトル特性が、音声データ格納部１１１に格納された実音声データがもつスペクトル特性に近づいたことを意味する。これは、変換前の第３の音声特徴量は第２の音声特徴量と同様な方法で抽出されており、かつ、フィルタＴ（ｄ）は、第２の正規化累積度数を第１の正規化累積度数に近づけるという基準で設計されているからである。 FIG. 6 shows the normalized cumulative frequency distribution of the third speech feature before and after conversion. From this figure, it can be seen that the shape of the normalized cumulative frequency distribution calculated from the fourth speech feature value y _t ^to (d) is close to the shape of the first normalized cumulative frequency distribution calculated from the actual speech data. I understand. That is, it means that the spectrum characteristic of the fourth voice feature amount is close to the spectrum characteristic of the actual voice data stored in the voice data storage unit 111. This is because the third speech feature value before conversion is extracted in the same manner as the second speech feature value, and the filter T (d) uses the second normalized cumulative frequency as the first normal feature value. This is because it is designed on the basis of approaching the cumulative accumulation frequency.

なお、ステップＳ４２で生成した第３の音声特徴量ｘ_ｔ ^〜（ｄ）が、第２の音声特徴量の最大値を超えたり最小値を下回ったりする場合は、変換をせずに出力したり、ｘ_ｔ ^〜（ｄ）を最大値あるいは最小値に置き換えて変換したりすることができる。 If the third audio feature quantity x _t ^to (d) generated in step S42 exceeds the maximum value of the second audio feature quantity or falls below the minimum value, it is output without conversion. , X _t ^to (d) can be converted to the maximum value or the minimum value.

ステップＳ４４では、音源特徴量抽出部１１５は、コンテキスト情報および音声合成辞書１０６の隠れマルコフモデルを用いて音源特徴量を生成する。音源特徴量には、非周期成分や基本周波数がある。 In step S44, the sound source feature extraction unit 115 generates a sound source feature using the context information and the hidden Markov model of the speech synthesis dictionary 106. Sound source features include aperiodic components and fundamental frequencies.

最後に、ステップＳ４５では、波形生成部１１６は、第４の音声特徴量ｙ_ｔ ^〜（ｄ）および音源特徴量から音声波形を生成する。図７に、変換前後の音声波形のスペクトルを示す。この図からも、本実施形態のフィルタを用いた変換により、音声スペクトルの凹凸が強調されることが分かる。 Finally, in step S45, the waveform generation unit 116 generates a speech waveform from the fourth speech feature amount y _t ^to (d) and the sound source feature amount. FIG. 7 shows the spectrum of the speech waveform before and after conversion. Also from this figure, it can be seen that the unevenness of the speech spectrum is enhanced by the conversion using the filter of the present embodiment.

（効果）
このように、本実施形態にかかる音声処理装置は、実音声データから計算した第１の累積度数と音声合成辞書を用いて計算した第２の累積度数に基づいて、第２の累積度数を第１の累積度数に近づけるという基準でフィルタを作成する。これにより、フィルタ特性を適切に制御することができる。 (effect)
As described above, the speech processing apparatus according to the present embodiment calculates the second cumulative frequency based on the first cumulative frequency calculated from the actual speech data and the second cumulative frequency calculated using the speech synthesis dictionary. A filter is created on the basis of approaching the cumulative frequency of 1. Thereby, a filter characteristic can be controlled appropriately.

また、本実施形態にかかる音声処理装置は、フィルタ特性をユーザの手動で調整する必要がないため、フィルタ作成に必要な時間的コストを削減することができる。 Moreover, since the audio processing apparatus according to the present embodiment does not require the user to manually adjust the filter characteristics, it is possible to reduce the time cost required for creating the filter.

さらに、本実施形態にかかる音声処理装置は、音声合成辞書を用いて計算した第２の累積度数を実音声データから計算した第１の累積度数に近づける基準でフィルタを作成する。そして、このフィルタを用いて音声合成のための第３の音声特徴量を第４の音声特徴量に変換する。これにより、第４の音声特徴量から生成された音声波形の音質を実音声データに近づけることができる。 Furthermore, the speech processing apparatus according to the present embodiment creates a filter on the basis of bringing the second cumulative frequency calculated using the speech synthesis dictionary close to the first cumulative frequency calculated from the actual speech data. Then, the third voice feature quantity for voice synthesis is converted into a fourth voice feature quantity using this filter. Thereby, the sound quality of the speech waveform generated from the fourth speech feature value can be brought close to the actual speech data.

（変形例１）
本実施形態では、第１ヒストグラム計算部１０４および第２ヒストグラム計算部１０８の２つのヒストグラム計算部を設けたが、これらを１つにまとめることもできる。第１累積度数計算部１０５および第２累積度数計算部１０９についても同様である。 (Modification 1)
In the present embodiment, two histogram calculation units, the first histogram calculation unit 104 and the second histogram calculation unit 108, are provided, but these may be combined into one. The same applies to the first cumulative frequency calculation unit 105 and the second cumulative frequency calculation unit 109.

また、本実施形態では、第１〜第３の音声特徴量としてスペクトルに関するメルＬＳＰを音声特徴量として用いたが、この他にも、音声に含まれる周期・非周期性の度合いを表す非周期成分、声の高さを表す基本周波数を音声特徴量として用いることができる。また、特徴量の時間方向の変化、周波数方向の変化の度合い、特徴量の次元間の差分、対数値を用いてもよい。 Further, in this embodiment, the mel LSP relating to the spectrum is used as the voice feature quantity as the first to third voice feature quantities. However, in addition to this, the non-period representing the degree of period / aperiodicity included in the voice The fundamental frequency representing the component and the pitch of the voice can be used as the voice feature amount. Also, a change in the feature amount in the time direction, a degree of change in the frequency direction, a difference between the feature amount dimensions, and a logarithmic value may be used.

また、図８に示すように、第２特徴量抽出部１０７がテキスト解析部１１２で抽出されたコンテキスト情報を利用して第２の音声特徴量を抽出してもよい。この場合、第２の音声特徴量と第３の音声特徴量が同一となり、フィルタ作成部１０１は読み上げ対象となるテキスト毎にフィルタＴ（ｄ）を作成する。これにより、各テキストに最適なフィルタが作成することができる。 Further, as shown in FIG. 8, the second feature quantity extraction unit 107 may extract the second voice feature quantity using the context information extracted by the text analysis unit 112. In this case, the second voice feature quantity and the third voice feature quantity are the same, and the filter creation unit 101 creates a filter T (d) for each text to be read out. This makes it possible to create an optimum filter for each text.

また、本実施形態では、累積度数を正規化したが、正規化せずにフィルタを作成することもできる。 In the present embodiment, the cumulative frequency is normalized, but a filter can be created without normalization.

また、特徴量変換部１１４が、全ての次元ではなく特定の次元についてフィルタを適用するようにしてもよい。例えば、音声特徴量の総次元数が５０であれば、１から３０次元はフィルタＴ（ｄ）を用いて変換し、残りの３１〜５０次元は変換を行わないなどの処理が可能である。 Further, the feature amount conversion unit 114 may apply the filter not to all dimensions but to a specific dimension. For example, if the total number of dimensions of the audio feature amount is 50, processing can be performed such that 1 to 30 dimensions are converted using the filter T (d), and the remaining 31 to 50 dimensions are not converted.

（変形例２）
フィルタ作成処理部１１０では、第２の正規化累積度数分布を第１の正規化累積度数分布に近づけるｄ次元目のフィルタＴ（ｄ）として、（１７）式を満たす係数ａ_ｄ ^＾、ｂ_ｄ ^＾を用いることができる。

(Modification 2)
The filter generation unit 110, the coefficient meets the second normalized cumulative frequency distribution as the first normalized cumulative frequency close to the distribution d-th dimension of the filter T (d), the (17) a _{d ^{^,}} b _d ^{^} Can be used.

（１７）式を解くと（１８）式となる。

When equation (17) is solved, equation (18) is obtained.

特徴量変換部１１４では、（１９）式を用いて各次元の第３の音声特徴量ｘ_ｔ ^〜（ｄ）を第４の音声特徴量ｙ_ｔ ^〜（ｄ）に変換する。

The feature amount conversion unit 114 converts the third speech feature amount x _t ^to (d) of each dimension into the fourth speech feature amount y _t ^to (d) using the equation (19).

（変形例３）
本実施形態では、テキスト音声合成における音声強調について説明したが、他の用途に音声強調を用いることもできる。図９は、入力された音声データの声質を変換する機能を有した音声処理装置のブロック図を示している。この音声処理装置は、声質変換部１２１に入力された変換前の音声データの声質を、音声データ格納部１１１に格納された実音声データの声質に近づけることを目的としている。例えば、音声データ格納部１１１にユーザの実音声データを格納しておけば、声質変換部１２１に入力された任意の音声波形の声質をユーザの声質に近づくよう変換することができる。 (Modification 3)
In the present embodiment, speech enhancement in text-to-speech synthesis has been described, but speech enhancement can be used for other purposes. FIG. 9 shows a block diagram of a voice processing apparatus having a function of converting the voice quality of inputted voice data. The purpose of this speech processing apparatus is to bring the speech quality of the speech data before conversion input to the speech quality conversion unit 121 closer to the speech quality of the actual speech data stored in the speech data storage unit 111. For example, if the user's actual voice data is stored in the voice data storage unit 111, the voice quality of an arbitrary voice waveform input to the voice quality conversion unit 121 can be converted so as to approach the voice quality of the user.

この音声処理装置は、音声データの声質を変換する声質変換部１２１を備えている。第２の特徴量抽出部１１７および第３の特徴量抽出部１１８は、音声データからそれぞれ第２および第３の音声特徴量を抽出する。声質変換処理部１１９は、声質を変換するためのフィルタである声質変換フィルタ１２５を用いて第３の音声特徴量の声質を変換する。特徴量変換部１１４は、声質変換後の第３の音声特徴量を、フィルタＴ（ｄ）により音声スペクトルの凹凸を強調した第４の音声特徴量に変換する。 This voice processing apparatus includes a voice quality conversion unit 121 that converts voice quality of voice data. The second feature quantity extraction unit 117 and the third feature quantity extraction unit 118 extract the second and third voice feature quantities from the voice data, respectively. The voice quality conversion processing unit 119 converts the voice quality of the third voice feature amount by using the voice quality conversion filter 125 that is a filter for converting the voice quality. The feature amount conversion unit 114 converts the third speech feature amount after the voice quality conversion into a fourth speech feature amount in which the unevenness of the speech spectrum is emphasized by the filter T (d).

本変形例では、第２音声特徴量抽出部１１７および第３音声特徴量抽出部１１８は、互いに同じ方法で音声特徴量を抽出する。また、声質変換処理部１２４および声質変換処理部１１９も同じ方法で声質を変換することから、第２ヒストグラム計算部１０８に入力される音声特徴量と音声特徴量変換部１１４に入力される音声特徴量は同一なものになる。フィルタＴ（ｄ）は、声質変換処理部１２４により声質が変換された第２の音声特徴量の累積度数を、実音声データから計算した第１の累積度数に近づける基準で生成される。このフィルタＴ（ｄ）を用いた変換により、第４の音声特徴量から生成された音声波形の音質を実音声データの音質に近づけることができる。 In this modification, the second audio feature quantity extraction unit 117 and the third audio feature quantity extraction unit 118 extract the audio feature quantity by the same method. Further, since the voice quality conversion processing unit 124 and the voice quality conversion processing unit 119 also convert the voice quality by the same method, the voice feature amount input to the second histogram calculation unit 108 and the voice feature input to the voice feature amount conversion unit 114 The amount will be the same. The filter T (d) is generated on the basis of bringing the cumulative frequency of the second voice feature amount whose voice quality is converted by the voice quality conversion processing unit 124 closer to the first cumulative frequency calculated from the actual voice data. By the conversion using the filter T (d), the sound quality of the sound waveform generated from the fourth sound feature amount can be brought close to the sound quality of the actual sound data.

このように、本実施形態で説明した音声強調処理は、音声合成だけでなく、声質変換、音声符号化等に用いられる音声特徴量に対しても適用可能である。 As described above, the speech enhancement processing described in the present embodiment can be applied not only to speech synthesis but also to speech feature amounts used for voice quality conversion, speech coding, and the like.

なお、以上説明した本実施形態における一部機能もしくは全ての機能は、ソフトウェア処理により実現可能である。 Note that some or all of the functions in the present embodiment described above can be realized by software processing.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０１、１２２、１２３フィルタ作成部
１０２音声合成部
１０３第１特徴量抽出部
１０４第１ヒストグラム計算部
１０５第１累積度数計算部
１０６音声合成辞書
１０７、１１７第２特徴量抽出部
１０８第２ヒストグラム計算部
１０９第２累積度数計算部
１１０フィルタ作成処理部
１１１音声データ格納部
１１２テキスト解析部
１１３、１１８第３特徴量抽出部
１１４特徴量変換部
１１５、１２０音源特徴量抽出部
１１６波形生成部
１１９、１２４声質変換処理部
１２１声質変換部
１２５声質変換フィルタ 101, 122, 123 Filter creation unit 102 Speech synthesis unit 103 First feature amount extraction unit 104 First histogram calculation unit 105 First cumulative frequency calculation unit 106 Speech synthesis dictionary 107, 117 Second feature amount extraction unit 108 Second histogram calculation Unit 109 second cumulative frequency calculation unit 110 filter creation processing unit 111 audio data storage unit 112 text analysis unit 113, 118 third feature amount extraction unit 114 feature amount conversion unit 115, 120 sound source feature amount extraction unit 116 waveform generation unit 119, 124 voice quality conversion processing unit 121 voice quality conversion unit 125 voice quality conversion filter

Claims

A spectrum generated by calculating a first histogram from a first speech feature amount related to a spectrum extracted from actual speech data to be a target when designing a filter, and using the context information of the actual speech data and a speech synthesis dictionary Histogram calculating means for calculating a second histogram from the second audio feature value for
A cumulative frequency calculation means for calculating a first cumulative frequency obtained by accumulating the frequency of the first histogram and a second cumulative frequency obtained by accumulating the frequency of the second histogram;
Filter creating means for creating a filter having a characteristic of bringing the second cumulative frequency closer to the first cumulative frequency based on the first and second cumulative frequencies;
Using the filter created by the filter creating means, the third speech feature value related to the spectrum generated using the context information of the text to be read out and the speech synthesis dictionary is converted into the fourth speech feature value. Feature amount conversion means;
A speech processing apparatus comprising:

The first histogram is calculated from the first speech feature amount related to the spectrum extracted from the actual speech data as a target when designing the filter, and is generated using the context information of the text to be read out and the speech synthesis dictionary A histogram calculating means for calculating a second histogram from the second speech feature quantity related to the spectrum;
A cumulative frequency calculation means for calculating a first cumulative frequency obtained by accumulating the frequency of the first histogram and a second cumulative frequency obtained by accumulating the frequency of the second histogram;
Filter creating means for creating a filter having a characteristic of bringing the second cumulative frequency closer to the first cumulative frequency based on the first and second cumulative frequencies;
Using the filter created by the filter creating means, the third speech feature value related to the spectrum generated using the context information of the text to be read out and the speech synthesis dictionary is converted into the fourth speech feature value. Feature amount conversion means;
A speech processing apparatus comprising:

The audio feature value corresponding to the case where the filter creation means sets a predetermined value in the range of the first and second cumulative frequencies and sets the predetermined value as the cumulative frequency in the distribution of the first cumulative frequencies. The sound processing apparatus according to claim 1, wherein the filter is created using a value of a sound feature amount corresponding to a case where the predetermined value is the cumulative power in the second cumulative power distribution.

The first and second cumulative frequencies calculated by the cumulative frequency calculation means are normalized by the total number of the first speech feature amounts and the total number of the second speech feature amounts, respectively. The speech processing apparatus according to claim 1 or 2.

3. The speech processing apparatus according to claim 1, wherein the first to third speech feature values are any one of a spectrum envelope, a parameter indicating the spectrum envelope, and a parameter indicating the periodicity / non-periodicity of the speech.

A spectrum generated by calculating a first histogram from a first speech feature amount related to a spectrum extracted from actual speech data to be a target when designing a filter, and using the context information of the actual speech data and a speech synthesis dictionary Calculating a second histogram from a second audio feature quantity for
Calculating a first cumulative frequency obtained by accumulating the frequency of the first histogram and a second cumulative frequency obtained by accumulating the frequency of the second histogram;
Creating a filter having a characteristic of bringing the second cumulative frequency closer to the first cumulative frequency based on the first and second cumulative frequencies;
Converting a third speech feature amount related to a spectrum generated by using the context information of the text to be read out and the speech synthesis dictionary into a fourth speech feature amount using the created filter;
A voice processing method comprising:

The first histogram is calculated from the first speech feature amount related to the spectrum extracted from the actual speech data as a target when designing the filter, and is generated using the context information of the text to be read out and the speech synthesis dictionary Calculating a second histogram from a second audio feature for the spectrum;
Calculating a first cumulative frequency obtained by accumulating the frequency of the first histogram and a second cumulative frequency obtained by accumulating the frequency of the second histogram;
Creating a filter having a characteristic of bringing the second cumulative frequency closer to the first cumulative frequency based on the first and second cumulative frequencies;
Converting a third speech feature amount related to a spectrum generated by using the context information of the text to be read out and the speech synthesis dictionary into a fourth speech feature amount using the created filter;
A voice processing method comprising: