JP2016151736A

JP2016151736A - Speech processing device and program

Info

Publication number: JP2016151736A
Application number: JP2015030459A
Authority: JP
Inventors: 信正清山; Nobumasa Seiyama; 礼子齋藤; Reiko Saito; 今井　篤; Atsushi Imai; 篤今井; 都木　徹; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2015-02-19
Filing date: 2015-02-19
Publication date: 2016-08-22

Abstract

PROBLEM TO BE SOLVED: To easily and precisely perform a style conversion on a speech.SOLUTION: A speech processing device acquires a first acoustic feature quantity generation value as an acoustic feature quantity of each of frames of a time series based upon a language feature quantity of a document and a statistical model associated with an acoustic feature quantity generated as to a first style, and obtains a second acoustic feature quantity generation value as an acoustic feature quantity of each of frames of the time series based upon the language feature quantity of the document and a statistical model associated with an acoustic feature quantity generated as to a second style. The speech processing device makes frames of the first acoustic feature quantity generation value and the second acoustic feature quantity generation value correspond to each other, and generates processing information with the difference between the first acoustic feature quantity generation value and the second acoustic feature quantity generation value for each of the frames which are made to correspond. The speech processing device makes an acoustic feature quantity of speech data generated by reading the document aloud and frames of the first acoustic feature quantity generation value correspond to each other, and processes acoustic feature quantities of the respective frames of the speech data based upon processing information generated from the first acoustic feature quantity generation value of a corresponding frame.SELECTED DRAWING: Figure 1

Description

本発明は、音声加工装置、及びプログラムに関する。 The present invention relates to a voice processing device and a program.

近年、統計モデルを利用して、テキストから音声を合成する方法が開発されている（例えば、非特許文献１参照）。この方法に基づく研究は盛んに進められており、スタイルを付与した音声を合成する方法も提案されている（例えば、非特許文献２、３、４参照）。 In recent years, a method of synthesizing speech from text using a statistical model has been developed (see, for example, Non-Patent Document 1). Research based on this method has been actively promoted, and methods for synthesizing styled speech have also been proposed (for example, see Non-Patent Documents 2, 3, and 4).

一方、音声のピッチやスペクトルなどの音響特徴量を変換する方法が提案されている（例えば、特許文献１、２、非特許文献５参照）。さらには、音声のスタイルのうち、例えば、感情を制御する方法も提案されている。この方法の一つとして、以下がある。すなわち、無感情な発声と感情を伴った発声により同一文章を読み上げたものを音響分析してピッチ、パワー、話速を求め、それらの分析結果を人手で観察して対比することにより、それぞれについて簡単な変換規則を作成する。そして、作成した変換規則を別の無感情な発声に適用して感情を付与する（例えば、非特許文献６参照）。 On the other hand, methods for converting acoustic feature quantities such as the pitch and spectrum of speech have been proposed (see, for example, Patent Documents 1 and 2 and Non-Patent Document 5). Furthermore, for example, a method for controlling emotions in a voice style has been proposed. One of the methods is as follows. That is, by analyzing acoustically what was read out the same sentence by emotionless utterance and emotional utterance, the pitch, power, and speech speed were obtained, and the analysis results were manually observed and compared. Create a simple transformation rule. Then, the created conversion rule is applied to another emotionless utterance to give emotion (see, for example, Non-Patent Document 6).

また、音声のスタイルのうち感情を制御する他の方法として、以下がある。すなわち、事前に、無感情な発声と感情を伴った発声により同一文章を読み上げたものを音響分析してピッチやスペクトルなどの音響特徴量の時系列を求め、それらの対応付けを得る。そして、得られた対応関係に基づいて、無感情な発声の音響特徴量の時系列及び感情の種別を入力とし、感情を伴った音響特徴量の時系列が出力となるようなニューラルネットワークを学習する。この学習したニューラルネットワークに、任意の文章の無感情な発声を音響分析して求めた音響特徴量の時系列、ならびに所望の感情の種別を入力することにより、音響特徴量の時系列を得る。得られた音響特徴量の時系列に基づいて音声を合成することにより、所望の感情を付与した発生が得られる（例えば、特許文献３参照）。
さらに他の方法として、平静音声と感情音声の間のスペクトル変化を母音ごとに学習し、学習した母音のスペクトル変化を任意の発話の平静音声に与える技術もある（例えば、非特許文献７参照）。 Other methods for controlling emotion in the audio style include the following. That is, in advance, an acoustic analysis is performed on the same sentence read out by an emotional utterance and an utterance with an emotion to obtain a time series of acoustic feature quantities such as a pitch and a spectrum, and a correspondence between them is obtained. Then, based on the obtained correspondence, learn a neural network that takes the time series of emotional features of emotionless speech and the type of emotion as input, and outputs the time series of acoustic features with emotion To do. A time series of acoustic feature quantities is obtained by inputting a time series of acoustic feature quantities obtained by acoustic analysis of an emotionless utterance of an arbitrary sentence and a desired emotion type to the learned neural network. The generation | occurrence | production which provided the desired emotion is obtained by synthesize | combining audio | voice based on the time series of the obtained acoustic feature-value (for example, refer patent document 3).
As another method, there is a technique of learning a spectral change between a calm voice and an emotion voice for each vowel, and giving the learned vowel spectrum change to a quiet voice of an arbitrary utterance (see, for example, Non-Patent Document 7). .

特許第２６１２８６７号公報Japanese Patent No. 2612867 特許第２６１２８６９号公報Japanese Patent No. 2612869 特開平７−７２９００号公報JP-A-7-72900

Keiichi Tokuda，外４名，"Speech parameter generation algorithms for HMM-based speech synthesis"，IEEE，in Proc. ICASSP，2000年Keiichi Tokuda, 4 others, "Speech parameter generation algorithms for HMM-based speech synthesis", IEEE, in Proc. ICASSP, 2000 Makoto Tachibana，外３名，"Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing"，一般社団法人電子情報通信学会，IEICE transactions on information and systems，E88-D(11)，p.2484-2491，2005年Makoto Tachibana, 3 others, "Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing", IEICE, IEICE transactions on information and systems, E88-D (11), p.2484- 2491, 2005 Takashi Nose，外３名，"A style Control Technique for HMM-Based Expressive Speech Synthesis"，一般社団法人電子情報通信学会，IEICE transactions on information and systems，E90-D(9)，p.1406-1413，2007年Takashi Nose, 3 others, "A style Control Technique for HMM-Based Expressive Speech Synthesis", IEICE, IEICE transactions on information and systems, E90-D (9), p.1406-1413, 2007 Year 大谷大和，外５名，"HMM音声合成における加算モデルに基づく任意話者への感情付与法の検討"，日本音響学会講演論文集，2-7-2，p.233-236，2014年Yamato Otani, 5 others, "Study on how to give emotions to arbitrary speakers based on addition model in HMM speech synthesis", Proceedings of the Acoustical Society of Japan, 2-7-2, p.233-236, 2014 Hideki Kawahara，外２名，"Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds"，Speech Communication, 27(3)，1999年Hideki Kawahara, 2 others, "Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds", Speech Communication, 27 (3), 1999 Yoshinori Kitahara，外１名，"Prosodic Control to Express Emotions for Man-Machine Speech Interaction"，一般社団法人電子情報通信学会，IEICE transactions on fundamentals of electronics，Communications and Computer Sciences，Vol.E75-A，No.2，p.155-163，1992年Yoshinori Kitahara, 1 other, "Prosodic Control to Express Emotions for Man-Machine Speech Interaction", IEICE, IEICE transactions on fundamentals of electronics, Communications and Computer Sciences, Vol.E75-A, No.2 , P.155-163, 1992 今井篤，外２名，"母音のスペクトル変化に基づく感情音声加工方法の検討"，一般社団法人映像情報メディア学会，映像情報メディア学会 2014年年次大会講演予稿集，2014年Atsushi Imai and 2 others, “Examination of emotional speech processing method based on spectrum change of vowels”, The Institute of Image Information and Television Engineers, Annual Conference of the Institute of Image Information and Television Engineers 2014, 2014

非特許文献１〜４の技術は、いずれもテキストから音声を合成する方法であり、音声を変換するものではない。
また、特許文献１、２及び非特許文献５は、ピッチ及びスペクトルなどの音響特徴量を変換する基本的な技術に関するものであり、音声を所望のスタイルに変換するには、目標値を何らかの方法で与えなければならない。
また、非特許文献６の技術は、人手により生成された簡単な規則によって音声の加工を制御するため、時間的に複雑に変化する音響特徴量を十分に制御することは困難である。加えて、非特許文献６の技術は、ピッチ、パワー、話速といった韻律に関する制御のみを行い、スペクトルを制御することはできない。
また、特許文献３の技術は、感情に関連するパラメータの学習にニューラルネットワークを用いており、その学習には膨大な学習データと学習時間が必要である。
また、非特許文献７の技術は、聞こえに大きな影響を与える母音についてスペクトルを加工して平静音声を感情音声に変換するものであるが、子音については感情音声に変換するための加工を行っていない。 The techniques of Non-Patent Documents 1 to 4 are all methods for synthesizing speech from text, and do not convert speech.
Patent Documents 1 and 2 and Non-Patent Document 5 relate to a basic technique for converting acoustic feature quantities such as pitch and spectrum, and in order to convert sound into a desired style, the target value is set to some method. Must be given in.
Further, since the technology of Non-Patent Document 6 controls speech processing by simple rules generated manually, it is difficult to sufficiently control the acoustic feature quantity that changes in a complicated manner over time. In addition, the technique of Non-Patent Document 6 performs only control related to prosody such as pitch, power, and speech speed, and cannot control the spectrum.
Further, the technique of Patent Document 3 uses a neural network for learning parameters related to emotion, and learning requires enormous learning data and learning time.
The technique of Non-Patent Document 7 is to process a spectrum for vowels that have a great influence on hearing and convert quiet speech to emotional speech, but for consonants to process emotional speech. Absent.

本発明は、このような事情を考慮してなされたもので、音声のスタイル変換を簡易かつ精度良く行うことができる音声加工装置、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and provides an audio processing apparatus and a program that can perform audio style conversion easily and accurately.

本発明の一態様は、テキストデータが示す文章の言語特徴量を取得する言語解析部と、前記言語解析部が取得した前記言語特徴量と、第一のスタイルの発話の音声データから生成された音響特徴量に関する統計モデルとに基づいて、時系列のフレーム単位の音響特徴量を生成する第一音響特徴量生成部と、前記言語解析部が取得した前記言語特徴量と、第二のスタイルの発話の音声データから生成された音響特徴量に関する統計モデルとに基づいて、時系列のフレーム単位の音響特徴量を生成する第二音響特徴量生成部と、前記第一音響特徴量生成部が生成した前記音響特徴量である第一音響特徴量生成値と、前記第二音響特徴量生成部が生成した前記音響特徴量である第二音響特徴量生成値との類似性に基づいて、前記第一音響特徴量生成値のフレームと前記第二音響特徴量生成値のフレームとを対応付け、対応付けられた前記フレームごとに、前記第一音響特徴量生成値と前記第二音響特徴量生成値との差分により加工情報を生成する加工情報生成部と、前記テキストデータが示す前記文章の音声データから時系列のフレーム単位の音響特徴量を取得する音響分析部と、前記音響分析部が取得した前記音響特徴量と、前記第一音響特徴量生成値との類似性に基づいて、前記音響特徴量のフレームと前記第一音響特徴量生成値のフレームとを対応付け、各フレームの前記音響特徴量を、対応するフレームの前記第一音響特徴量生成値を用いて前記加工情報生成部が生成した前記加工情報に基づいて加工する音声加工処理部と、を備えることを特徴とする音声加工装置である。
この発明によれば、音声加工装置は、原音声のテキストの言語特徴量と、第一のスタイルについて生成された音響特徴量に関する統計モデルとに基づいて、時系列のフレーム単位の音響特徴量である第一音響特徴量生成値を得る。さらに、音声加工装置は、原音声のテキストの言語特徴量と、第二のスタイルについて生成された音響特徴量に関する統計モデルとに基づいて、時系列のフレーム単位の音響特徴量である第二音響特徴量生成値を得る。音声加工装置は、第一音響特徴量生成値のフレームと第二音響特徴量生成値のフレームとを値の類似性によって対応付け、対応付けられたフレームごとに第一音響特徴量生成値と第二音響特徴量生成値との差分により加工情報を生成する。音声加工装置は、原音声の音声データから時系列のフレーム単位の音響特徴量を取得し、原音声の音響特徴量のフレームと、第一音響特徴量生成値のフレームを値の類似性に基づいて対応付ける。音声加工装置は、原音声の各フレームの音響特徴量を、対応するフレームの第一音響特徴量生成値を用いて生成された加工情報に基づいて加工する。
これにより、音声加工装置は、原音声の音韻性や自然性を良好に保持したまま、原音声のスタイルを簡易に変換する。 One aspect of the present invention is generated from a language analysis unit that acquires a language feature of a sentence indicated by text data, the language feature acquired by the language analysis unit, and speech data of an utterance of a first style Based on a statistical model related to acoustic features, a first acoustic feature generator that generates acoustic features in units of time-series frames, the language feature obtained by the language analyzer, and a second style Based on a statistical model related to acoustic features generated from speech data of speech, a second acoustic feature generating unit that generates acoustic features in units of frames in time series, and the first acoustic feature generating unit generates Based on the similarity between the first acoustic feature value generation value that is the acoustic feature value and the second acoustic feature value generation value that is the acoustic feature value generated by the second acoustic feature value generation unit. One acoustic feature generation And the frame of the second acoustic feature value generation value are associated with each other, and the processing information is determined by the difference between the first acoustic feature value generation value and the second acoustic feature value generation value for each of the associated frames. A processing information generation unit that generates, an acoustic analysis unit that acquires acoustic features in frame units in time series from the audio data of the sentence indicated by the text data, and the acoustic feature amount acquired by the acoustic analysis unit, Based on the similarity with the first acoustic feature value generation value, the acoustic feature value frame and the first acoustic feature value generation value frame are associated with each other, and the acoustic feature value of each frame is associated with the corresponding frame. And a voice processing unit that processes based on the processing information generated by the processing information generation unit using the first acoustic feature value generation value.
According to the present invention, the speech processing apparatus uses a time-series frame-based acoustic feature amount based on the language feature amount of the text of the original speech and the statistical model related to the acoustic feature amount generated for the first style. A certain first acoustic feature value generation value is obtained. Furthermore, the speech processing apparatus performs second sound, which is a time-series frame-based acoustic feature amount, based on the linguistic feature amount of the text of the original speech and the statistical model related to the acoustic feature amount generated for the second style. A feature value generation value is obtained. The speech processing device associates the first acoustic feature value generation value frame and the second acoustic feature value generation value frame with the similarity of values, and sets the first acoustic feature value generation value and the first sound feature value for each associated frame. Processing information is generated based on the difference between the two acoustic feature value generation values. The sound processing device acquires sound feature quantities in time-series frames from the sound data of the original sound, and based on the similarity of values between the sound feature amount frame of the original sound and the first sound feature value generation value frame To associate. The sound processing device processes the acoustic feature amount of each frame of the original speech based on the processing information generated using the first acoustic feature amount generation value of the corresponding frame.
As a result, the speech processing apparatus easily converts the style of the original speech while maintaining the phonological and naturalness of the original speech well.

本発明の一態様は、上述する音声加工装置であって、前記音響特徴量は、ピッチに関する情報と周波数スペクトルに関する情報とのうち少なくとも一方を含む、ことを特徴とする。
この発明によれば、音声加工装置は、原音声のピッチと周波数スペクトルとの一方または両方を加工してスタイルを変更する。
これにより、音声加工装置は、原音声のピッチを変更し、原音声の音韻性や自然性を良好に保持したまま、イントネーションやアクセントを変化させてスタイルを変換することができる。また、音声加工装置は、原音声の周波数スペクトルを変更し、原音声の音韻性や自然性を良好に保持したまま、声質を変化させてスタイルを変換することができる。あるいは、音声加工装置は、原音声のピッチ及び周波数スペクトルを変化させて、原音声の音韻性や自然性を良好に保持したまま、イントネーションやアクセント、ならびに、声質を変化させて、原音声のスタイルを変換することができる。 One aspect of the present invention is the speech processing device described above, wherein the acoustic feature amount includes at least one of information regarding a pitch and information regarding a frequency spectrum.
According to the present invention, the sound processing device changes the style by processing one or both of the pitch and the frequency spectrum of the original sound.
As a result, the speech processing apparatus can change the pitch by changing the pitch of the original speech and changing the intonation and accent while maintaining the phonological and naturalness of the original speech. Further, the speech processing apparatus can change the voice quality and change the style while changing the frequency spectrum of the original speech and maintaining the phonological and naturalness of the original speech well. Alternatively, the voice processing device changes the pitch and frequency spectrum of the original voice, changes the intonation, accent, and voice quality while maintaining the phonological and natural nature of the original voice, thereby changing the style of the original voice. Can be converted.

本発明の一態様は、上述する音声加工装置であって、前記テキストデータが示す前記文章の前記音声データのスタイルは、前記第一のスタイルである、ことを特徴とする。
この発明によれば、音声加工装置は、原音声と同じスタイルの発話から生成された統計モデル、及び、所望のスタイルの発話から生成された統計モデルのそれぞれを用いて、原音声のテキストから第一音響特徴量生成値及び第二音響特徴量生成値を生成し、それらの差分により加工情報を生成する。
これにより、音声加工装置は、原音声を所望のスタイルに精度よく変換することができる。 One aspect of the present invention is the speech processing apparatus described above, wherein the style of the speech data of the sentence indicated by the text data is the first style.
According to this invention, the speech processing apparatus uses the statistical model generated from the utterance of the same style as the original speech and the statistical model generated from the utterance of the desired style, respectively, from the text of the original speech. One acoustic feature value generation value and a second acoustic feature value generation value are generated, and processing information is generated based on a difference between them.
Thereby, the voice processing apparatus can convert the original voice into a desired style with high accuracy.

本発明の一態様は、上述する音声加工装置であって、前記音声加工処理部は、加工された前記音響特徴量に基づいて音声データを合成する、ことを特徴とする。
この発明によれば、音声加工装置は、スタイル変換のための加工がなされた音響特徴量から、音声を合成する。
これにより、音声加工装置は、原音声のスタイルを変換して生成した音声を出力することができる。 One aspect of the present invention is the speech processing apparatus described above, wherein the speech processing unit synthesizes speech data based on the processed acoustic feature amount.
According to the present invention, the speech processing apparatus synthesizes speech from the acoustic feature amount processed for style conversion.
Thereby, the sound processing apparatus can output the sound generated by converting the style of the original sound.

本発明の一態様は、コンピュータを、テキストデータが示す文章の言語特徴量を取得する言語解析手段と、前記言語解析手段が取得した前記言語特徴量と、第一のスタイルの発話の音声データから生成された音響特徴量に関する統計モデルとに基づいて、時系列のフレーム単位の音響特徴量を生成する第一音響特徴量生成手段と、前記言語解析手段が取得した前記言語特徴量と、第二のスタイルの発話の音声データから生成された音響特徴量に関する統計モデルとに基づいて、時系列のフレーム単位の音響特徴量を生成する第二音響特徴量生成手段と、前記第一音響特徴量生成手段が生成した前記音響特徴量である第一音響特徴量生成値と、前記第二音響特徴量生成手段が生成した前記音響特徴量である第二音響特徴量生成値との類似性に基づいて、前記第一音響特徴量生成値のフレームと前記第二音響特徴量生成値のフレームとを対応付け、対応付けられた前記フレームごとに、前記第一音響特徴量生成値と前記第二音響特徴量生成値との差分により加工情報を生成する加工情報生成手段と、前記テキストデータが示す前記文章の音声データから時系列のフレーム単位の音響特徴量を取得する音響分析手段と、前記音響分析手段が取得した前記音響特徴量と、前記第一音響特徴量生成値との類似性に基づいて、前記音響特徴量のフレームと前記第一音響特徴量生成値のフレームとを対応付け、各フレームの前記音響特徴量を、対応するフレームの前記第一音響特徴量生成値を用いて前記加工情報生成手段が生成した前記加工情報に基づいて加工する音声加工処理手段と、を具備する音声加工装置として機能させるためのプログラムである。 According to one aspect of the present invention, a computer analyzes a language analysis unit that acquires a language feature of a sentence indicated by text data, the language feature acquired by the language analysis unit, and speech data of an utterance of a first style. Based on the generated statistical model for the acoustic feature amount, a first acoustic feature amount generation unit that generates an acoustic feature amount in time-series frames, the language feature amount acquired by the language analysis unit, and a second Second acoustic feature generating means for generating acoustic features in units of frames in time series based on a statistical model relating to acoustic features generated from speech data of utterances of the style, and the first acoustic feature generation Based on the similarity between the first acoustic feature value generation value that is the acoustic feature value generated by the means and the second acoustic feature value generation value that is the sound feature value generated by the second acoustic feature value generation means. The first acoustic feature value generation value and the second acoustic feature value generation value frame are associated with each other, and the first acoustic feature value generation value and the second sound are associated with each of the associated frames. Processing information generation means for generating processing information based on a difference from a feature value generation value, acoustic analysis means for acquiring acoustic features in time-series frames from voice data of the sentence indicated by the text data, and the acoustic analysis Based on the similarity between the acoustic feature value acquired by the means and the first acoustic feature value generation value, the frame of the acoustic feature value and the frame of the first acoustic feature value generation value are associated with each other. Audio processing unit for processing the acoustic feature amount based on the processing information generated by the processing information generation unit using the first acoustic feature amount generation value of the corresponding frame. Is a program for functioning as a voice processing device.

本発明によれば、音声のスタイル変換を簡易かつ精度良く行うことができる。 According to the present invention, voice style conversion can be performed easily and accurately.

本発明の一実施形態による音声加工装置の機能ブロック図である。It is a functional block diagram of the audio processing apparatus by one Embodiment of this invention. 同実施形態による音声加工装置が用いる音響特徴量を説明するための図である。It is a figure for demonstrating the acoustic feature-value which the audio processing apparatus by the embodiment uses. 同実施形態による音声加工装置が用いる音響特徴量を示す図である。It is a figure which shows the acoustic feature-value which the audio processing apparatus by the embodiment uses. 同実施形態による音声加工装置が用いる言語特徴量を示す図である。It is a figure which shows the language feature-value which the audio processing apparatus by the embodiment uses. 同実施形態による音声加工装置における学習処理を示す処理フローである。It is a processing flow which shows the learning process in the audio processing apparatus by the embodiment. 同実施形態による音声加工装置における音声加工処理を示す処理フローである。It is a processing flow which shows the audio processing process in the audio processing apparatus by the embodiment. 同実施形態による音声加工装置におけるテキストからの音響特徴量取得処理を説明するための図である。It is a figure for demonstrating the acoustic feature-value acquisition process from the text in the audio processing apparatus by the embodiment. 同実施形態による音声加工装置における加工情報生成処理を説明するための図である。It is a figure for demonstrating the process information generation process in the audio processing apparatus by the embodiment. 同実施形態による音声加工装置における入力音声の加工処理を説明するための図である。It is a figure for demonstrating the process of the input audio | voice in the audio processing apparatus by the embodiment. 同実施形態による音声加工装置を用いてスタイル変換した入力音声と加工音声を示す図である。It is a figure which shows the input audio | voice and style audio | voice which carried out style conversion using the audio processing apparatus by the embodiment. 同実施形態による音声加工装置について行った主観評価実験の諸元を示す図である。It is a figure which shows the item of the subjective evaluation experiment performed about the audio processing apparatus by the embodiment. 同実施形態による音声加工装置について行った主観評価実験において使用した統計モデルを生成するために用いた学習データに対する判定感情を示す図である。It is a figure which shows the determination emotion with respect to the learning data used in order to produce | generate the statistical model used in the subjective evaluation experiment performed about the audio | voice processing apparatus by the embodiment. 同実施形態による音声加工装置について行った主観評価実験の評価結果を示す図である。It is a figure which shows the evaluation result of the subjective evaluation experiment done about the audio | voice processing apparatus by the embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
本実施形態の音声加工装置は、入力した音声を一時記録し、その音響特徴量を変換して異なるスタイルの音声として再び出力する。スタイルには、例えば、怒り、喜び、などの感情や、ニュース調、丁寧、ぞんざい、フォーマル、カジュアルなどの発話表現がある。本実施形態の音声加工装置は、入力音声のスタイル及び所望のスタイルのそれぞれについて事前に作成しておいた音響特徴量に関する統計モデルを利用して、入力音声のテキストから、入力音声のスタイル及び所望のスタイルのそれぞれについてフレーム単位の音響特徴量を生成する。本実施形態の音声加工装置は、入力音声のスタイルについて生成された音響特徴量及び所望のスタイルについて生成された音響特徴量のフレームを対応付け、この対応付けに従って、入力音声のスタイルの音響特徴量と所望のスタイルの音響特徴量との差分値を算出する。本実施形態の音声加工装置は、入力音声から求めたフレーム単位の音響特徴量と、入力音声のテキストから入力音声のスタイルについて生成したフレーム単位の音響特徴量とを対応付ける。本実施形態の音声加工装置は、入力音声の各フレームの音響特徴量に、対応するフレームの音響特徴量を用いて算出した差分値を加算して、入力音声の音響特徴量を変更し、変更を反映した音声を出力する。これにより、本実施形態の音声加工装置は、原音声の音韻性や自然性を良好に保持したまま、スタイル変換を可能にする。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The sound processing apparatus of this embodiment temporarily records the input sound, converts the acoustic feature amount, and outputs the sound again as a different style of sound. Styles include, for example, emotions such as anger and joy, and utterance expressions such as news-like, polite, zaizai, formal, and casual. The speech processing apparatus according to the present embodiment uses a statistical model related to acoustic features that has been created in advance for each of the input speech style and the desired style, and from the input speech text, the input speech style and desired For each of the styles, an acoustic feature amount in units of frames is generated. The speech processing apparatus according to the present embodiment associates the acoustic feature amount generated for the input speech style with the acoustic feature amount frame generated for the desired style, and the acoustic feature amount of the input speech style according to the association. And a difference value between the acoustic feature quantity of the desired style. The speech processing apparatus according to the present embodiment associates an acoustic feature amount in units of frames obtained from the input speech with an acoustic feature amount in units of frames generated for the input speech style from the input speech text. The speech processing apparatus according to the present embodiment changes the acoustic feature amount of the input speech by adding the difference value calculated using the acoustic feature amount of the corresponding frame to the acoustic feature amount of each frame of the input speech. Outputs audio that reflects. Thereby, the speech processing apparatus according to the present embodiment enables style conversion while maintaining the phonological and naturalness of the original speech well.

図１は、本発明の一実施形態による音声加工装置１の構成を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみを抽出して示してある。音声加工装置１は、１台または複数台のコンピュータ装置により実現される。複数台のコンピュータ装置により音声加工装置１を実現する場合、いずれの機能部をいずれのコンピュータ装置により実現するかは任意とすることができる。また、１つの機能部を、複数台のコンピュータ装置により実現してもよい。同図に示すように、音声加工装置１は、学習部２と、記憶部３と、音声加工部４とを備えて構成される。 FIG. 1 is a functional block diagram showing a configuration of a sound processing apparatus 1 according to an embodiment of the present invention, and only functional blocks related to the present embodiment are extracted and shown. The voice processing device 1 is realized by one or a plurality of computer devices. When the sound processing device 1 is realized by a plurality of computer devices, which functional unit is realized by which computer device can be arbitrarily determined. One functional unit may be realized by a plurality of computer devices. As shown in FIG. 1, the voice processing device 1 includes a learning unit 2, a storage unit 3, and a voice processing unit 4.

学習部２は、第一音声記憶部２１と、第一音響分析部２２と、第一学習用言語解析部２３と、第一統計モデル学習部２４と、第二音声記憶部２５と、第二音響分析部２６と、第二学習用言語解析部２７と、第二統計モデル学習部２８とを備えて構成される。また、記憶部３は、第一統計モデル記憶部３１と、第二統計モデル記憶部３２とを備えて構成される。 The learning unit 2 includes a first speech storage unit 21, a first acoustic analysis unit 22, a first learning language analysis unit 23, a first statistical model learning unit 24, a second speech storage unit 25, and a second An acoustic analysis unit 26, a second learning language analysis unit 27, and a second statistical model learning unit 28 are provided. The storage unit 3 includes a first statistical model storage unit 31 and a second statistical model storage unit 32.

第一音声記憶部２１は、第一学習用音声データを記憶する。第一学習用音声データは、変換前スタイル（第一のスタイル）によって、第一学習用テキストデータが示す文章を読み上げたときの学習用の音声データである。変換前スタイルは、音声加工部４に入力される入力音声データのスタイルであり、この入力音声データは、スタイル変換を行う対象の音声データである。なお、変換前スタイルが複数ある場合、第一学習用音声データには、スタイルの種類（例えば、「平静」など）を示す情報を付加しておく。
第一音響分析部２２は、第一音声記憶部２１から第一学習用音声データを読み出し、読み出した第一学習用音声データから時系列のフレーム単位の音響特徴量を取得する。
第一学習用言語解析部２３は、第一学習用テキストデータが示す文章の言語特徴量を取得する。 The first voice storage unit 21 stores first learning voice data. The first learning voice data is voice data for learning when the text indicated by the first learning text data is read out by the pre-conversion style (first style). The pre-conversion style is a style of input voice data input to the voice processing unit 4, and this input voice data is voice data to be subjected to style conversion. When there are a plurality of pre-conversion styles, information indicating the type of style (for example, “seduce”) is added to the first learning audio data.
The first acoustic analysis unit 22 reads the first learning voice data from the first voice storage unit 21, and acquires the time-series acoustic feature quantity from the read first learning voice data.
The first learning language analysis unit 23 acquires the language feature amount of the sentence indicated by the first learning text data.

第一統計モデル学習部２４は、第一音響分析部２２が第一学習用音声データから取得した音響特徴量と、第一学習用言語解析部２３が第一学習用テキストデータから取得した言語特徴量とを用いて、変換前スタイルの統計モデルを生成し、生成した統計モデルを第一統計モデル記憶部３１に書き込む。変換前スタイルの種類が複数ある場合、第一統計モデル学習部２４は、その種類ごとに統計モデルを生成する。例えば、第一統計モデル学習部２４は、「平静」のラベルが付与された第一学習用音声データの音響特徴量と、第一学習用テキストデータの言語特徴量とを用いて、スタイルが「平静」の統計モデルを生成する。第一統計モデル学習部２４は、スタイルの種類ごとに生成した統計モデルに、スタイルの種類を示す情報を付加して第一統計モデル記憶部３１に書き込む。 The first statistical model learning unit 24 includes the acoustic feature amount acquired from the first learning speech data by the first acoustic analysis unit 22 and the language feature acquired by the first learning language analysis unit 23 from the first learning text data. The statistical model of the style before conversion is generated using the quantity, and the generated statistical model is written in the first statistical model storage unit 31. When there are a plurality of types of pre-conversion styles, the first statistical model learning unit 24 generates a statistical model for each type. For example, the first statistical model learning unit 24 uses the acoustic feature amount of the first learning speech data to which the label “Silence” is assigned and the language feature amount of the first learning text data, and the style is “ Generate a "quiet" statistical model. The first statistical model learning unit 24 adds information indicating the style type to the statistical model generated for each style type, and writes the information to the first statistical model storage unit 31.

第二音声記憶部２５は、第二学習用音声データを記憶する。第二学習用音声データは、変換後スタイル（第二のスタイル）によって、第二学習用テキストデータが示す文章を読み上げたときの学習用の音声データである。なお、第二学習用テキストデータは、第一学習用テキストデータと同一でもよく、異なっていてもよい。変換後スタイルは、音声加工部４において音声データを加工した結果として得たい音声データのスタイルである。なお、変換後スタイルが複数ある場合、第二学習用音声データには、スタイルの種類（例えば、「怒り」、「驚き」、「喜び」など）を示す情報を付加しておく。
第二音響分析部２６は、第二音声記憶部２５から第二学習用音声データを読み出し、読み出した第二学習用音声データから時系列のフレーム単位の音響特徴量を取得する。
第二学習用言語解析部２７は、第一学習用言語解析部２３と同様の処理により、第二学習用テキストデータが示す文章の言語特徴量を取得する。 The second voice storage unit 25 stores second learning voice data. The second learning voice data is voice data for learning when the sentence indicated by the second learning text data is read out by the converted style (second style). Note that the second learning text data may be the same as or different from the first learning text data. The post-conversion style is the style of audio data desired to be obtained as a result of processing the audio data in the audio processing unit 4. When there are a plurality of converted styles, information indicating the type of style (for example, “anger”, “surprise”, “joy”) is added to the second learning audio data.
The second acoustic analysis unit 26 reads the second learning voice data from the second voice storage unit 25, and acquires the time-series acoustic feature amount from the read second learning voice data.
The second learning language analysis unit 27 acquires the language feature amount of the sentence indicated by the second learning text data by the same processing as the first learning language analysis unit 23.

第二統計モデル学習部２８は、第二音響分析部２６が第二学習用音声データから取得した音響特徴量と、第二学習用言語解析部２７が第二学習用テキストデータから取得した言語特徴量とを用いて、変換後スタイルの統計モデルを生成し、生成した統計モデルを第二統計モデル記憶部３２に書き込む。変換後スタイルの種類が複数ある場合、第二統計モデル学習部２８は、その種類ごとに統計モデルを生成する。例えば、第二統計モデル学習部２８は、「怒り」のラベルが付与された第二学習用音声データの音響特徴量と、第二学習用テキストデータの言語特徴量とを用いて、スタイルが「怒り」の統計モデルを生成する。第二統計モデル学習部２８は、スタイルの種類ごとに生成した統計モデルに、スタイルの種類を示す情報を付加して第二統計モデル記憶部３２に書き込む。 The second statistical model learning unit 28 includes the acoustic feature amount acquired from the second learning speech data by the second acoustic analysis unit 26 and the language feature acquired by the second learning language analysis unit 27 from the second learning text data. The converted statistical model is generated using the quantity, and the generated statistical model is written in the second statistical model storage unit 32. When there are a plurality of types of styles after conversion, the second statistical model learning unit 28 generates a statistical model for each type. For example, the second statistical model learning unit 28 uses the acoustic feature amount of the second learning speech data to which the “anger” label is assigned and the language feature amount of the second learning text data, and the style is “ A statistical model of anger is generated. The second statistical model learning unit 28 adds information indicating the style type to the statistical model generated for each style type and writes the information to the second statistical model storage unit 32.

第一統計モデル学習部２４が生成する統計モデル及び第二統計モデル学習部２８が生成する統計モデルは、音響特徴量に関する統計モデルである。統計モデルには、例えば、３状態ＨＭＭ（Hidden Markov Model、隠れマルコフモデル）を用いた音響モデルを用いることができる。この音響モデルは、言語特徴量を反映した音素を適切な決定木を用いてクラスタリングにより分類した単位（以下、「音素クラスタリング単位」と記載する。）ごとに作成される。 The statistical model generated by the first statistical model learning unit 24 and the statistical model generated by the second statistical model learning unit 28 are statistical models related to acoustic features. As the statistical model, for example, an acoustic model using a three-state HMM (Hidden Markov Model) can be used. This acoustic model is created for each unit in which phonemes reflecting language feature quantities are classified by clustering using an appropriate decision tree (hereinafter referred to as “phoneme clustering unit”).

音声加工部４は、音声加工用言語解析部４１（言語解析部）と、第一統計モデル選択部４２と、第一音響特徴量生成部４３と、第二統計モデル選択部４４と、第二音響特徴量生成部４５と、加工情報生成部４６と、音声加工用音響分析部４７（音響分析部）と、音声加工処理部４８とを備えて構成される。 The speech processing unit 4 includes a speech processing language analysis unit 41 (language analysis unit), a first statistical model selection unit 42, a first acoustic feature quantity generation unit 43, a second statistical model selection unit 44, and a second An acoustic feature amount generation unit 45, a processing information generation unit 46, a voice processing acoustic analysis unit 47 (acoustic analysis unit), and a voice processing unit 48 are configured.

音声加工用言語解析部４１は、第一学習用言語解析部２３及び第二学習用言語解析部２７と同様の処理により、入力音声テキストデータが示す文章の言語特徴量を取得する。入力音声テキストデータは、入力音声データの発話の内容を示す文章のテキストデータである。
第一統計モデル選択部４２は、変換前スタイルデータが示すスタイルに対応した統計モデルを第一統計モデル記憶部３１から読み出す。変換前スタイルデータは、入力音声データのスタイルを示す。
第一音響特徴量生成部４３は、第一統計モデル選択部４２が読み出した統計モデルと、音声加工用言語解析部４１から出力された言語特徴量とを用いて、時系列のフレーム単位の音響特徴量を生成する。生成された音響特徴量を、第一音響特徴量生成値と記載する。 The speech processing language analysis unit 41 acquires the language feature amount of the sentence indicated by the input speech text data by the same processing as the first learning language analysis unit 23 and the second learning language analysis unit 27. The input voice text data is text data of a sentence indicating the utterance content of the input voice data.
The first statistical model selection unit 42 reads a statistical model corresponding to the style indicated by the pre-conversion style data from the first statistical model storage unit 31. The pre-conversion style data indicates the style of the input audio data.
The first acoustic feature value generation unit 43 uses the statistical model read by the first statistical model selection unit 42 and the language feature value output from the speech processing language analysis unit 41 to perform time-series frame-based acoustics. Generate feature values. The generated acoustic feature quantity is referred to as a first acoustic feature quantity generation value.

第二統計モデル選択部４４は、変換後スタイルデータが示すスタイルに対応した統計モデルを第二統計モデル記憶部３２から読み出す。変換後スタイルデータは、入力音声データを加工した結果として得たい音声データのスタイルを示す。
第二音響特徴量生成部４５は、第二統計モデル選択部４４が読み出した統計モデルと、音声加工用言語解析部４１から出力された言語特徴量とを用いて、時系列のフレーム単位の音響特徴量を生成する。生成された音響特徴量を、第二音響特徴量生成値と記載する。 The second statistical model selection unit 44 reads a statistical model corresponding to the style indicated by the converted style data from the second statistical model storage unit 32. The post-conversion style data indicates the style of audio data desired to be obtained as a result of processing the input audio data.
The second acoustic feature value generation unit 45 uses the statistical model read by the second statistical model selection unit 44 and the language feature value output from the speech processing language analysis unit 41 to perform time-series frame-based acoustics. Generate feature values. The generated acoustic feature amount is described as a second acoustic feature amount generation value.

加工情報生成部４６は、第一対応フレーム検出部４６１と、加工情報算出部４６２とを備える。第一対応フレーム検出部４６１は、第一音響特徴量生成部４３が生成した第一音響特徴量生成値と、第二音響特徴量生成部４５が生成した第二音響特徴量生成値とを、値の類似性に基づいてフレーム単位で対応させる。加工情報算出部４６２は、対応するフレームごとに、第一音響特徴量生成値と第二音響特徴量生成値との差分に基づいて、音響特徴量の加工情報を作成する。 The processing information generation unit 46 includes a first corresponding frame detection unit 461 and a processing information calculation unit 462. The first corresponding frame detection unit 461 includes a first acoustic feature amount generation value generated by the first acoustic feature amount generation unit 43 and a second acoustic feature amount generation value generated by the second acoustic feature amount generation unit 45. Corresponding in units of frames based on the similarity of values. The processing information calculation unit 462 creates processing information of the acoustic feature amount for each corresponding frame based on the difference between the first acoustic feature amount generation value and the second acoustic feature amount generation value.

音声加工用音響分析部４７は、入力音声データの音響特徴量を取得する。
音声加工処理部４８は、第二対応フレーム検出部４８１と、加工情報付加部４８２と、音声合成部４８３とを備える。第二対応フレーム検出部４８１は、第一音響特徴量生成部４３が生成した第一音響特徴量生成値と、音声加工用音響分析部４７が取得した音響特徴量とを、値の類似性に基づいてフレーム単位で対応させる。加工情報付加部４８２は、音声加工用音響分析部４７が取得した各フレームの音響特徴量を、対応するフレームの第一音響特徴量生成値を用いて加工情報生成部４６が生成した音響特徴量の加工情報に基づいて加工する。音声合成部４８３は、加工情報付加部４８２における加工により得られた音響特徴量の音声データを合成し、出力音声データとして出力する。 The sound processing acoustic analysis unit 47 acquires the acoustic feature amount of the input sound data.
The voice processing unit 48 includes a second corresponding frame detection unit 481, a processing information addition unit 482, and a voice synthesis unit 483. The second corresponding frame detection unit 481 converts the first acoustic feature value generation value generated by the first acoustic feature value generation unit 43 and the acoustic feature value acquired by the sound processing acoustic analysis unit 47 into similarity in value. Based on the frame basis. The processing information adding unit 482 uses the acoustic feature amount of each frame acquired by the sound processing acoustic analysis unit 47 as the acoustic feature amount generated by the processing information generation unit 46 using the first acoustic feature amount generation value of the corresponding frame. Machining based on the machining information. The voice synthesizing unit 483 synthesizes the audio data of the acoustic feature amount obtained by the processing in the processing information adding unit 482 and outputs it as output voice data.

なお、学習部２における第一統計モデル及び第二統計モデルの学習処理、学習処理により生成される第一統計モデル及び第二統計モデル、音声加工用言語解析部４１における言語解析処理、第一音響特徴量生成部４３及び第二音響特徴量生成部４５における音響特徴量生成処理には、ＨＴＳ（HMM-based speech synthesis system）などの既存の音声合成技術を利用することができる。 Note that the learning process of the first statistical model and the second statistical model in the learning unit 2, the first statistical model and the second statistical model generated by the learning process, the language analysis process in the speech processing language analysis unit 41, the first acoustic An existing speech synthesis technique such as HTS (HMM-based speech synthesis system) can be used for the acoustic feature amount generation processing in the feature amount generation unit 43 and the second acoustic feature amount generation unit 45.

図２は、本実施形態において用いる音響特徴量を説明するための図である。同図では、音声波形と音素表記とを対応付けて示している。音声波形からは、フレームごとに、ピッチ（基本周波数）、及び、周波数スペクトル（以下、「スペクトル」と記載する。）が得られる。音声波形からピッチや周波数スペクトルを取得する方法には、任意の従来技術を用いることができる。本実施形態では、フレーム長を２５ｍｓ（ミリ秒）、フレームシフトを５ｍｓとする。 FIG. 2 is a diagram for explaining acoustic feature amounts used in the present embodiment. In the figure, a speech waveform and a phoneme notation are shown in association with each other. From the speech waveform, a pitch (fundamental frequency) and a frequency spectrum (hereinafter referred to as “spectrum”) are obtained for each frame. Any conventional technique can be used as a method for acquiring a pitch and a frequency spectrum from a speech waveform. In the present embodiment, the frame length is 25 ms (milliseconds) and the frame shift is 5 ms.

図３は、本実施形態において用いる音響特徴量を示す図である。同図に示す音響特徴量は、静特性及び動特性を含む１５３次元の情報であり、例えば、非特許文献１や、ＨＴＳなどを含む従来技術においても使用されている一般的なものである。あるフレームの静特性は、そのフレームの音声波形から得られた１次元のピッチ及び５０次元のスペクトル係数からなる５１次元の情報である。動特性の音響特徴量は、静特性の１次差分（５１次元）及び静特性の２次差分（５１次元）の情報を含む。あるフレームの静特性の１次差分は、そのフレームの静特性と隣接するフレームの静特性との差分である。あるフレームの静特性の２次差分（５１次元）は、そのフレームの１次差分と隣接するフレームの１次差分との差分である。 FIG. 3 is a diagram showing acoustic feature amounts used in the present embodiment. The acoustic feature amount shown in the figure is 153-dimensional information including static characteristics and dynamic characteristics, and is generally used in, for example, Non-Patent Document 1 and conventional techniques including HTS. The static characteristic of a certain frame is 51-dimensional information including a one-dimensional pitch and a 50-dimensional spectral coefficient obtained from the speech waveform of the frame. The acoustic feature quantity of the dynamic characteristic includes information on the primary difference (51 dimensions) of the static characteristics and the secondary difference (51 dimensions) of the static characteristics. The primary difference of the static characteristics of a certain frame is the difference between the static characteristics of that frame and the static characteristics of adjacent frames. The static difference secondary difference (51 dimension) of a frame is a difference between the primary difference of the frame and the primary difference of an adjacent frame.

図４は、本実施形態において用いる言語特徴量を示す図である。漢字仮名交じりの文からは、形態素解析により、アクセント句の区切り、呼気段落の区切り、アクセントの情報、及び品詞情報が得られる。さらに、漢字仮名交じりの文章は、単音素表記に変換された後、形態素解析により得られたアクセントの情報と併せて、単音素アクセント表記に変換される。単音素アクセント表記と、形態素解析で得られた品詞情報からは、言語特徴量として用いる文脈依存音素表記が得られる。この文脈依存音素表記は、例えば、ＨＴＳなどを含む従来技術においても一般的に使用されている言語特徴量である。 FIG. 4 is a diagram showing language feature amounts used in the present embodiment. From kanji-kana mixed sentences, accent phrase delimiters, exhalation paragraph delimiters, accent information, and part-of-speech information are obtained by morphological analysis. Further, the kanji-kana mixed text is converted into a phoneme notation, and then converted into a phoneme accent notation together with accent information obtained by morphological analysis. From the phoneme accent notation and the part-of-speech information obtained by morphological analysis, a context-dependent phoneme notation used as a language feature amount is obtained. This context-dependent phoneme notation is a language feature that is generally used in the prior art including HTS, for example.

文脈依存音素表記は、単音素表記で示される時系列の各音素の音素情報、アクセント情報、品詞情報、アクセント句情報、呼気段落情報、及び音節数情報を含む。音素情報は、現在の音素を中心とした５つ分の音素の並びを示す。アクセント情報は、アクセント句における位置をモーラによって示す。品詞情報は、現在の単語や前後の単語の品詞を示す。アクセント句情報は、現在のアクセント句や前後のアクセント句のアクセントの種類、現在のアクセント句の位置を示す。呼気段落情報は、現在の呼気段落と前後の呼気段落のアクセント句の数やモーラの数、現在の呼気段落の位置を示す。音節数情報は、呼気段落、アクセント句、モーラの数を示す。 The context-dependent phoneme notation includes phoneme information, accent information, part-of-speech information, accent phrase information, exhalation paragraph information, and syllable number information of each time-series phoneme indicated by a single phoneme notation. The phoneme information indicates a sequence of five phonemes centering on the current phoneme. The accent information indicates the position in the accent phrase by mora. The part-of-speech information indicates the part-of-speech of the current word and the preceding and following words. The accent phrase information indicates the accent type of the current accent phrase, the preceding and following accent phrases, and the position of the current accent phrase. The exhalation paragraph information indicates the number of accent phrases and the number of mora in the current exhalation paragraph and the preceding and following exhalation paragraphs, and the position of the current exhalation paragraph. The syllable number information indicates the number of exhalation paragraphs, accent phrases, and mora.

次に、音声加工装置１の動作について説明する。以下では、変換前スタイルが「平静」であり、変換後スタイルが「怒り」である場合を例に説明する。
図５は、音声加工装置１による事前学習の処理フローを示す図である。
まず、第一音声記憶部２１には、スタイルが「平静」の学習用音声データである第一学習用音声データを記憶させておき、第二音声記憶部２５には、スタイルが「怒り」の学習用音声データである第二学習用音声データを記憶させておく。第一学習用音声データ及び第二学習用音声データはそれぞれ、同じ人物が文章を「平静」及び「怒り」のスタイルで読み上げたときの音声データである。「平静」のスタイルで読み上げる文章と、「怒り」のスタイルで読み上げ文章とは、同一でもよく、異なっていてもよい。各音素の音響特徴量は、その音素の前後の音素の影響を受ける。そこで、第一学習用音声データや第二学習用音声データには、様々な音素の並びがバランスよく含まれる音素バランス文の発話を用いることが望ましい。例えば、読み上げる文章として、以下の参考文献１、２で提案されている音素バランス５０３文を利用することができる。 Next, the operation of the voice processing device 1 will be described. Hereinafter, a case where the pre-conversion style is “calm” and the post-conversion style is “anger” will be described as an example.
FIG. 5 is a diagram illustrating a processing flow of pre-learning by the voice processing device 1.
First, the first voice storage unit 21 stores first learning voice data that is learning voice data with a style of “seduce”, and the second voice storage unit 25 has a style of “anger”. Second learning voice data, which is learning voice data, is stored. The first learning voice data and the second learning voice data are voice data when the same person reads out the sentence in the styles of “seduce” and “anger”, respectively. Sentences read in the “seduce” style and read-out sentences in the “anger” style may be the same or different. The acoustic feature quantity of each phoneme is affected by the phonemes before and after the phoneme. Therefore, it is desirable to use utterances of phoneme balance sentences including various phoneme arrangements in a well-balanced manner in the first learning voice data and the second learning voice data. For example, the phoneme balance 503 sentence proposed in the following references 1 and 2 can be used as a sentence to be read.

（参考文献１）磯健一、渡辺隆夫、桑原尚夫、「音声データベース用文セットの設計」、音講論（春）、ｐ．８９−９０、１９８８年３月
（参考文献２）匂坂芳典、浦谷則好、「ＡＴＲ音声・言語データベース」、音響誌、４８巻、１２号、ｐ．８７８−８８２、１９９２年 (Reference 1) Kenichi Tsuji, Takao Watanabe, Nao Kuwabara, “Designing a sentence set for speech database”, Sound lecture (Spring), p. 89-90, March 1988 (reference 2) Yoshinori Osaka, Noriyoshi Uraya, “ATR Speech / Language Database”, Acoustic Journal, Vol. 48, No. 12, p. 878-882, 1992

第一音響分析部２２は、第一音声記憶部２１からスタイル「平静」の情報が付加された第一学習用音声データを読み出す。第一音響分析部２２は、読み出した第一学習用音声データが示す音声波形から各文章のフレーム単位の音響特徴量を取得する（ステップＳ１１０）。第一学習用言語解析部２３は、第一学習用テキストデータが示す第一学習用音声データの発話内容の各文章から文脈依存音素表記を取得し、言語特徴量とする（ステップＳ１２０）。第一学習用テキストデータが示す読み上げ文章から求めた文脈依存音素表記は、実際に文章が読み上げられたときの音声波形と、ポーズの位置、アクセント区切り、アクセントの位置などが異なる場合がある。そこで、第一学習用言語解析部２３が取得した文脈依存音素表記を、人手で確認して修正する。 The first acoustic analysis unit 22 reads out the first learning voice data to which the information of the style “calm” is added from the first voice storage unit 21. The first acoustic analysis unit 22 acquires the acoustic feature amount of each sentence frame from the speech waveform indicated by the read first learning speech data (step S110). The first learning language analysis unit 23 acquires a context-dependent phoneme notation from each sentence of the utterance content of the first learning speech data indicated by the first learning text data, and sets it as a language feature amount (step S120). The context-dependent phoneme notation obtained from the reading sentence indicated by the first learning text data may differ from the speech waveform when the sentence is actually read out, the position of the pose, the accent break, the accent position, and the like. Therefore, the context-dependent phoneme notation acquired by the first learning language analysis unit 23 is manually checked and corrected.

第一統計モデル学習部２４は、各文章についてステップＳ１１０において得られた音響特徴量及びステップＳ１２０において得られた言語特徴量を用いて統計モデル機械学習により統計モデルを生成する（ステップＳ１３０）。統計モデル機械学習には、非特許文献１や参考文献３の技術を用いることができるが、任意の従来技術を用いてもよい。例えば、第一統計モデル学習部２４は、音素クラスタリング単位ごとに１５３次元の音響特徴量の平均、分散、各状態間の遷移確率、出力確率を求め、統計モデル機械学習により３状態ＨＭＭの音響モデルを求める。第一統計モデル学習部２４は、生成した音素クラスタリング単位ごとの音響モデルからなる統計モデルに、スタイル「平静」を示す情報を付加して第一統計モデル記憶部３１に書き込む。 The first statistical model learning unit 24 generates a statistical model by statistical model machine learning using the acoustic feature obtained in step S110 and the language feature obtained in step S120 for each sentence (step S130). For the statistical model machine learning, the techniques of Non-Patent Document 1 and Reference Document 3 can be used, but any conventional technique may be used. For example, the first statistical model learning unit 24 obtains the average, variance, transition probability between each state, and output probability of 153 dimensional acoustic feature values for each phoneme clustering unit, and performs a three-state HMM acoustic model by statistical model machine learning. Ask for. The first statistical model learning unit 24 adds information indicating the style “calm” to the generated statistical model including the acoustic model for each phoneme clustering unit and writes the information to the first statistical model storage unit 31.

（参考文献３）Takayoshi Yoshimura，外４名，"Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis"，in Proc. EUROSPEECH，p.2347-2350，1999年 (Reference 3) Takayoshi Yoshimura, 4 others, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis", in Proc. EUROSPEECH, p.2347-2350, 1999

第二音響分析部２６は、第二音声記憶部２５からスタイル「怒り」の情報が付加された第二学習用音声データを読み出す。第二音響分析部２６は、読み出した第二学習用音声データが示す音声波形から各文章のフレーム単位の音響特徴量を取得する（ステップＳ１４０）。第二学習用言語解析部２７は、第一学習用言語解析部２３と同様の処理により、第二学習用テキストデータが示す第二学習用音声データの発話内容の各文章から文脈依存音素表記を取得し、言語特徴量とする（ステップＳ１５０）。第二学習用言語解析部２７が取得した文脈依存音素表記については、人手で確認して修正する。 The second acoustic analysis unit 26 reads the second learning voice data to which the style “anger” information is added from the second voice storage unit 25. The second acoustic analysis unit 26 acquires the acoustic feature amount in units of frames of each sentence from the speech waveform indicated by the read second learning speech data (step S140). The second learning language analysis unit 27 performs context-dependent phoneme notation from each sentence of the utterance content of the second learning speech data indicated by the second learning text data by the same processing as the first learning language analysis unit 23. Acquired and used as a language feature amount (step S150). The context-dependent phoneme notation acquired by the second learning language analysis unit 27 is manually checked and corrected.

第二統計モデル学習部２８は、各文章についてステップＳ１４０において得られた音響特徴量及びステップＳ１５０において得られた言語特徴量を用いて、第一統計モデル学習部２４と同様の機械学習により統計モデルを生成する（ステップＳ１６０）。第二統計モデル学習部２８は、生成した統計モデルに、スタイル「怒り」を示す情報を付加して第二統計モデル記憶部３２に書き込む。 The second statistical model learning unit 28 uses the acoustic feature obtained in step S140 and the language feature obtained in step S150 for each sentence, and performs a statistical model by machine learning similar to the first statistical model learning unit 24. Is generated (step S160). The second statistical model learning unit 28 adds information indicating the style “anger” to the generated statistical model and writes the information to the second statistical model storage unit 32.

なお、音声加工装置１は、ステップＳ１１０の処理とステップＳ１２０の処理を並行して、あるいは、入れ替えて実行してもよい。同様に、音声加工装置１は、ステップＳ１４０の処理とステップＳ１５０の処理を並行して、あるいは、入れ替えて実行してもよい。また、音声加工装置１は、ステップＳ１１０〜ステップＳ１３０の処理と、ステップＳ１４０〜ステップＳ１６０の処理を並行して実行してもよく、順番を入れ替えて実行してもよい。また、変換前スタイル、及び、変換後スタイルがそれぞれ１つである場合、第一学習用音声データ、第二学習用音声データ、及び、統計モデルにスタイルを示す情報を付加しなくてもよい。
また、「平静」のスタイルで読み上げる文章と、「怒り」のスタイルで読み上げ文章とが同一である場合、第一学習用テキストデータまたは第二学習用テキストデータのいずれかのみを音声加工装置１に入力してもよい。第一学習用言語解析部２３または第二学習用言語解析部２７は、得られた言語特徴量を、第一統計モデル学習部２４及び第二統計モデル学習部２８に出力する。 Note that the audio processing device 1 may execute the process of step S110 and the process of step S120 in parallel or in a switched manner. Similarly, the speech processing apparatus 1 may execute the process of step S140 and the process of step S150 in parallel or in a switched manner. In addition, the voice processing device 1 may execute the processes of Steps S110 to S130 and the processes of Steps S140 to S160 in parallel, or may be executed by changing the order. Further, when there is one pre-conversion style and one post-conversion style, information indicating the style may not be added to the first learning audio data, the second learning audio data, and the statistical model.
In addition, when the sentence read out in the “seduce” style and the sentence read out in the “anger” style are the same, only the first learning text data or the second learning text data is stored in the speech processing apparatus 1. You may enter. The first learning language analysis unit 23 or the second learning language analysis unit 27 outputs the obtained language feature amount to the first statistical model learning unit 24 and the second statistical model learning unit 28.

図６は、音声加工装置１による音声加工処理の処理フローを示す図である。
音声加工装置１には、入力音声データと、入力音声データの発話の内容を示す入力音声テキストデータと、変換前スタイルデータと、変換後スタイルデータとが入力される。入力音声データのスタイルや、変換により得たい音声データのスタイルが予め決められている場合には、変換前スタイルデータや、変換後スタイルデータの入力を省略することができる。入力音声データの話者と第一学習用音声データ及び第二学習用音声データの話者は、異なっていてもよい。 FIG. 6 is a diagram illustrating a processing flow of the voice processing by the voice processing device 1.
The speech processing apparatus 1 receives input speech data, input speech text data indicating the utterance content of the input speech data, pre-conversion style data, and post-conversion style data. When the style of input audio data or the style of audio data desired to be obtained by conversion is determined in advance, the input of pre-conversion style data or post-conversion style data can be omitted. The speaker of the input voice data may be different from the speaker of the first learning voice data and the second learning voice data.

音声加工用言語解析部４１は、第一学習用言語解析部２３及び第二学習用言語解析部２７と同様の処理により、入力音声テキストデータが示す文章の言語特徴量を取得する（ステップＳ２１０）。 The speech processing language analyzing unit 41 acquires the language feature amount of the sentence indicated by the input speech text data by the same processing as the first learning language analyzing unit 23 and the second learning language analyzing unit 27 (step S210). .

第一統計モデル選択部４２は、変換前スタイルデータが示すスタイル「平静」の統計モデルを第一統計モデル記憶部３１から読み出す（ステップＳ２２０）。第一音響特徴量生成部４３は、第一統計モデル選択部４２が読み出した統計モデルと、音声加工用言語解析部４１から出力された言語特徴量とを用いて、時系列の音響特徴量である第一音響特徴量生成値を生成する（ステップＳ２３０）。 The first statistical model selection unit 42 reads out the statistical model of the style “calm” indicated by the pre-conversion style data from the first statistical model storage unit 31 (step S220). The first acoustic feature value generation unit 43 uses the statistical model read by the first statistical model selection unit 42 and the language feature value output from the speech processing language analysis unit 41 as a time-series acoustic feature value. A certain first acoustic feature value generation value is generated (step S230).

第二統計モデル選択部４４は、変換後スタイルデータが示すスタイル「怒り」の統計モデルを第二統計モデル記憶部３２から読み出す（ステップＳ２４０）。第二音響特徴量生成部４５は、第二統計モデル選択部４４が読み出した統計モデルと、音声加工用言語解析部４１から出力された言語特徴量とを用いて、時系列の音響特徴量である第二音響特徴量生成値を生成する（ステップＳ２５０）。 The second statistical model selection unit 44 reads out the statistical model of the style “anger” indicated by the converted style data from the second statistical model storage unit 32 (step S240). The second acoustic feature value generation unit 45 uses the statistical model read by the second statistical model selection unit 44 and the language feature value output from the speech processing language analysis unit 41 as a time-series acoustic feature value. A second acoustic feature value generation value is generated (step S250).

加工情報生成部４６の第一対応フレーム検出部４６１は、第一音響特徴量生成部４３が生成した第一音響特徴量生成値と、第二音響特徴量生成部４５が生成した第二音響特徴量生成値とを値の類似性に基づいてフレーム単位で対応させる（ステップＳ２６０）。この対応付けには、例えば、動的計画法（ＤＴＷ）が用いられる。加工情報算出部４６２は、対応するフレームごとに第一音響特徴量生成値と第二音響特徴量生成値との差分を算出し、音響特徴量の加工情報とする（ステップＳ２７０）。 The first corresponding frame detection unit 461 of the processing information generation unit 46 includes a first acoustic feature amount generation value generated by the first acoustic feature amount generation unit 43 and a second acoustic feature generated by the second acoustic feature amount generation unit 45. The quantity generation value is associated with each frame based on the similarity of values (step S260). For this association, for example, dynamic programming (DTW) is used. The processing information calculation unit 462 calculates the difference between the first acoustic feature value generation value and the second acoustic feature value generation value for each corresponding frame, and sets the difference as the acoustic feature value processing information (step S270).

音声加工用音響分析部４７は、入力音声データが示すが示す音声波形の音響特徴量を取得する（ステップＳ２８０）。音声加工処理部４８の第二対応フレーム検出部４８１は、第一音響特徴量生成部４３が生成した第一音響特徴量生成値と、音声加工用音響分析部４７が取得した音響特徴量とを値の類似性に基づいてフレーム単位で対応させる（ステップＳ２９０）。加工情報付加部４８２は、音声加工用音響分析部４７が取得した各フレームの音響特徴量に、対応するフレームの第一音響特徴量生成値を用いて加工情報生成部４６が生成した音響特徴量の加工情報を加算する（ステップＳ３００）。音声合成部４８３は、加工情報が加算された入力音声データの音響特徴量の音声データを合成し、加工音声データとする。音声合成部４８３は、生成した加工音声データを、音声加工装置１からの出力音声データとして出力する（ステップＳ３１０）。 The audio processing acoustic analysis unit 47 acquires the acoustic feature quantity of the audio waveform indicated by the input audio data (step S280). The second corresponding frame detection unit 481 of the sound processing unit 48 uses the first sound feature value generation value generated by the first sound feature value generation unit 43 and the sound feature value acquired by the sound processing sound analysis unit 47. Corresponding in units of frames based on the similarity of values (step S290). The processing information adding unit 482 uses the first acoustic feature value generation value of the corresponding frame as the acoustic feature value of each frame acquired by the sound processing acoustic analysis unit 47, and the acoustic feature value generated by the processing information generation unit 46. Are added (step S300). The voice synthesizer 483 synthesizes the voice data of the acoustic feature amount of the input voice data to which the processing information is added to obtain processed voice data. The voice synthesizer 483 outputs the generated processed voice data as output voice data from the voice processing device 1 (step S310).

なお、音声加工装置１は、ステップＳ２２０〜ステップＳ２３０の処理と、ステップＳ２４０〜ステップＳ２５０の処理を並行して、あるいは入れ替えて実行してもよい。また、音声加工装置１は、ステップＳ２１０〜ステップＳ２７０の処理と、ステップＳ２８０の処理を並行して、あるいは、入れ替えて実行してもよい。 Note that the audio processing device 1 may execute the processes of Steps S220 to S230 and the processes of Steps S240 to S250 in parallel or interchanged. In addition, the voice processing device 1 may execute the processing in steps S210 to S270 and the processing in step S280 in parallel or in replacement.

図６に示す音声加工処理を、データの図を用いて説明する。
図７は、音声加工装置１におけるテキストからの音響特徴量取得処理を説明するための図である。同図は、図６のステップＳ２１０〜ステップＳ２５０の処理を示す。この処理により、入力音声テキストデータが示す漢字仮名交じりの文章から、第一音響特徴量生成値や第二音響特徴量生成値が生成される。 The voice processing shown in FIG. 6 will be described with reference to the data diagram.
FIG. 7 is a diagram for explaining an acoustic feature amount acquisition process from text in the speech processing apparatus 1. This figure shows the processing of step S210 to step S250 of FIG. By this process, the first acoustic feature value generation value and the second acoustic feature value generation value are generated from the kanji kana mixed text indicated by the input voice text data.

図６のステップＳ２１０において、音声加工用言語解析部４１は、入力音声テキストデータが示す漢字仮名交じりの文章から、文脈依存音素表記の言語特徴量を得る。
図６のステップＳ２３０において、第一音響特徴量生成部４３は、文脈依存音素の並びに応じて、スタイル「平静」の統計モデルが示す各音素クラスタリング単位の音響モデルを接続する。第一音響特徴量生成部４３は、接続確率が最小となる組み合わせを選択することにより、５ｍｓのフレームシフトごとの音響特徴量を生成する。ここで生成される各フレームの音響特徴量は、１次元のピッチ及び５０次元のスペクトル係数からなる５１次元の静特性である。この音響特徴量の生成には、例えば、非特許文献１の方法を用いることができるが、テキストデータと音響モデルから音響特徴量を得るための任意の既存技術を用いてもよい。生成された音響特徴量は、スタイルが「平静」の時系列のフレームごとの第一音響特徴量生成値となる。 In step S210 of FIG. 6, the speech processing language analyzing unit 41 obtains a language feature amount of context-dependent phoneme notation from a kanji-kana mixed sentence indicated by input speech text data.
In step S230 of FIG. 6, the first acoustic feature value generation unit 43 connects the acoustic models of each phoneme clustering unit indicated by the statistical model of style “calm” according to the arrangement of the context-dependent phonemes. The first acoustic feature value generation unit 43 generates an acoustic feature value for each frame shift of 5 ms by selecting a combination that minimizes the connection probability. The acoustic feature value of each frame generated here is a 51-dimensional static characteristic composed of a 1-dimensional pitch and a 50-dimensional spectral coefficient. For example, the method of Non-Patent Document 1 can be used to generate the acoustic feature amount, but any existing technique for obtaining an acoustic feature amount from text data and an acoustic model may be used. The generated acoustic feature value is a first acoustic feature value generation value for each time-series frame whose style is “seduce”.

同様に、図６のステップＳ２５０において、第二音響特徴量生成部４５は、スタイルが「怒り」の時系列のフレームごとの第二音響特徴量生成値を得る。つまり、第二音響特徴量生成部４５は、文脈依存音素の並びに応じて、スタイル「怒り」の統計モデルが示す各音素クラスタリング単位の音響モデルを接続する。第二音響特徴量生成部４５は、接続確率が最小となる組み合わせを選択し、５ｍｓのフレームシフトごとの５１次元の音響特徴量を生成する。生成された音響特徴量は、スタイルが「怒り」の時系列のフレームごとの第二音響特徴量生成値となる。 Similarly, in step S250 of FIG. 6, the second acoustic feature value generation unit 45 obtains a second acoustic feature value generation value for each time-series frame whose style is “anger”. That is, the second acoustic feature value generation unit 45 connects the acoustic models of each phoneme clustering unit indicated by the statistical model of the style “anger” according to the arrangement of the context-dependent phonemes. The second acoustic feature value generation unit 45 selects a combination that minimizes the connection probability, and generates a 51-dimensional acoustic feature value for each 5 ms frame shift. The generated acoustic feature amount is a second acoustic feature amount generation value for each time-series frame whose style is “anger”.

図８は、音声加工装置１における加工情報生成処理を説明するための図である。同図は、図６のステップＳ２６０〜ステップＳ２７０の処理を示す。図７に示す処理により、第一音響特徴量生成部４３は、スタイルが「平静」の５ｍｓのフレームシフトごとの第一音響特徴量生成値を生成し、第二音響特徴量生成部４５は、スタイルが「怒り」の５ｍｓのフレームシフトごとの第一音響特徴量生成値を生成する。ｉ番目（ｉは１以上の整数）のフレームの第一音響特徴量生成値をＡｉと記載し、ｉ番目（ｉは１以上の整数）のフレームの第二音響特徴量生成値をＢｉと記載する。図６のステップＳ２６０において、加工情報生成部４６の第一対応フレーム検出部４６１は、第一音響特徴量生成値Ａ１、Ａ２、…と、第二音響特徴量生成値Ｂ１、Ｂ２、…とを、５０次元のスペクトル係数による距離尺度を用いて、動的計画法（ＤＴＷ）などにより対応付ける。第一対応フレーム検出部４６１は、この対応付を文章全体で行う。 FIG. 8 is a diagram for explaining processing information generation processing in the voice processing device 1. This figure shows the processing from step S260 to step S270 in FIG. Through the processing shown in FIG. 7, the first acoustic feature value generation unit 43 generates a first acoustic feature value generation value for each 5 ms frame shift whose style is “Silence”, and the second acoustic feature value generation unit 45 A first acoustic feature value generation value is generated for each 5 ms frame shift whose style is “anger”. The first acoustic feature value generation value of the i-th frame (i is an integer of 1 or more) is described as Ai, and the second acoustic feature value generation value of the i-th frame (i is an integer of 1 or more) is described as Bi. To do. 6, the first corresponding frame detection unit 461 of the processing information generation unit 46 obtains the first acoustic feature value generation values A1, A2,... And the second acoustic feature value generation values B1, B2,. , Using a distance scale based on 50-dimensional spectral coefficients, and by using dynamic programming (DTW) or the like. The first corresponding frame detection unit 461 performs this association on the entire sentence.

図６のステップＳ２７０において、加工情報算出部４６２は、対応付けられたフレームの第一音響特徴量生成値と第二音響特徴量生成値との差分を算出し、加工情報とする。加工情報算出部４６２は、１フレームの第一音響特徴量生成値と複数のフレームの第二音響特徴量生成値とが対応する場合、第一音響特徴量生成値のフレームが、対応する第二音響特徴量生成値のフレーム数分あるものとして、加工情報を生成する。また、加工情報算出部４６２は、複数のフレームの第一音響特徴量生成値と１つのフレームの第二音響特徴量生成値とが対応する場合、それら複数のフレームの第一音響特徴量生成値のそれぞれについて、対応するフレームの第二音響特徴量生成値との差分により加工情報を生成する。ｉ番目のフレームの加工情報をＣｉと記載する。 In step S270 of FIG. 6, the processing information calculation unit 462 calculates the difference between the first acoustic feature value generation value and the second acoustic feature value generation value of the associated frame and sets it as processing information. When the first acoustic feature value generation value of one frame and the second acoustic feature value generation values of a plurality of frames correspond to each other, the processing information calculation unit 462 corresponds to the second frame corresponding to the first acoustic feature value generation value. Processing information is generated on the assumption that there are as many acoustic feature value generation values as the number of frames. In addition, when the first acoustic feature value generation value of a plurality of frames and the second acoustic feature value generation value of one frame correspond to each other, the processing information calculation unit 462 has a first acoustic feature value generation value of the plurality of frames. For each of these, processing information is generated based on the difference from the second acoustic feature value generation value of the corresponding frame. The processing information of the i-th frame is described as Ci.

例えば、第一音響特徴量生成値Ａ１と第二音響特徴量生成値Ｂ１とが対応するため、加工情報算出部４６２は、それらの差分を算出して加工情報Ｃ１とする。つまり、加工情報Ｃ１＝第二音響特徴量生成値Ｂ１−第一音響特徴量生成値Ａ１である。
第一音響特徴量生成値Ａ２は、第二音響特徴量生成値Ｂ２及び第二音響特徴量生成値Ｂ３と対応しているため、第二音響特徴量生成値Ｂ２と第二音響特徴量生成値Ｂ３のそれぞれに対応する２フレーム分の第一音響特徴量生成値Ａ２があるものとする。加工情報算出部４６２は、第一音響特徴量生成値Ａ２と第二音響特徴量生成値Ｂ２との差分を算出して加工情報Ｃ２とし、第一音響特徴量生成値Ａ２と第二音響特徴量生成値Ｂ３との差分を算出して加工情報Ｃ３とする。つまり、加工情報Ｃ２＝第二音響特徴量生成値Ｂ２−第一音響特徴量生成値Ａ２であり、加工情報Ｃ３＝第二音響特徴量生成値Ｂ３−第一音響特徴量生成値Ａ２である。
第一音響特徴量生成値Ａ３は第二音響特徴量生成値Ｂ４と対応するため、加工情報算出部４６２は、それらの差分を算出して加工情報Ｃ４とする。つまり、加工情報Ｃ４＝第二音響特徴量生成値Ｂ４−第一音響特徴量生成値Ａ３である。
第一音響特徴量生成値Ａ４及び第一音響特徴量生成値Ａ５は、第二音響特徴量生成値Ｂ５と対応している。そこで、加工情報算出部４６２は、第一音響特徴量生成値Ａ４と第二音響特徴量生成値Ｂ５との差分を算出して加工情報Ｃ５とし、第一音響特徴量生成値Ａ５と第二音響特徴量生成値Ｂ５との差分を算出して加工情報Ｃ６とする。つまり、加工情報Ｃ５＝第二音響特徴量生成値Ｂ５−第一音響特徴量生成値Ａ４であり、加工情報Ｃ６＝第二音響特徴量生成値Ｂ５−第一音響特徴量生成値Ａ５である。
第一音響特徴量生成値Ａ６は、第二音響特徴量生成値Ｂ６及び第二音響特徴量生成値Ｂ７と対応している。従って、加工情報Ｃ７＝第二音響特徴量生成値Ｂ６−第一音響特徴量生成値Ａ６であり、加工情報Ｃ８＝第二音響特徴量生成値Ｂ７−第一音響特徴量生成値Ａ６である。
加工情報算出部４６２は、同様の処理を繰り返して加工情報Ｃ１、Ｃ２、…を生成する。 For example, since the first acoustic feature value generation value A1 and the second acoustic feature value generation value B1 correspond to each other, the processing information calculation unit 462 calculates the difference between them and sets it as the processing information C1. That is, the processing information C1 = second acoustic feature value generation value B1−first acoustic feature value generation value A1.
Since the first acoustic feature value generation value A2 corresponds to the second acoustic feature value generation value B2 and the second acoustic feature value generation value B3, the second acoustic feature value generation value B2 and the second acoustic feature value generation value It is assumed that there are two frames of first acoustic feature value generation values A2 corresponding to each of B3. The processing information calculation unit 462 calculates a difference between the first acoustic feature value generation value A2 and the second acoustic feature value generation value B2 to obtain processing information C2, and the first acoustic feature value generation value A2 and the second acoustic feature value. The difference from the generated value B3 is calculated and used as processing information C3. That is, processing information C2 = second acoustic feature value generation value B2—first acoustic feature value generation value A2 and processing information C3 = second acoustic feature value generation value B3—first acoustic feature value generation value A2.
Since the first acoustic feature value generation value A3 corresponds to the second acoustic feature value generation value B4, the processing information calculation unit 462 calculates a difference between them to be processing information C4. That is, the processing information C4 = second acoustic feature value generation value B4-first acoustic feature value generation value A3.
The first acoustic feature value generation value A4 and the first acoustic feature value generation value A5 correspond to the second acoustic feature value generation value B5. Therefore, the processing information calculation unit 462 calculates a difference between the first acoustic feature value generation value A4 and the second acoustic feature value generation value B5 as processing information C5, and the first acoustic feature value generation value A5 and the second acoustic feature value A5. The difference from the feature value generation value B5 is calculated and set as processing information C6. That is, processing information C5 = second acoustic feature value generation value B5-first acoustic feature value generation value A4, and processing information C6 = second acoustic feature value generation value B5-first acoustic feature value generation value A5.
The first acoustic feature value generation value A6 corresponds to the second acoustic feature value generation value B6 and the second acoustic feature value generation value B7. Therefore, processing information C7 = second acoustic feature value generation value B6-first acoustic feature value generation value A6, and processing information C8 = second acoustic feature value generation value B7-first acoustic feature value generation value A6.
The machining information calculation unit 462 generates machining information C1, C2,... By repeating the same processing.

図９は、音声加工装置１における入力音声の加工処理を説明するための図である。同図は、図６のステップＳ２８０〜ステップＳ３００の処理を示す。
図６のステップＳ２８０において、音声加工用音響分析部４７は、スタイルが「平静」の入力音声データから、フレーム長２５ｍｓ、フレームシフト５ｍｓの音響特徴量を得る。ここで得られる音響特徴量は、１次元のピッチ及び５０次元のスペクトル係数からなる５１次元の静特性である。入力音声データから得られたｉ番目のフレームの音響特徴量を、音響特徴量Ｄｉと記載する。 FIG. 9 is a diagram for explaining input voice processing in the voice processing apparatus 1. This figure shows the processing from step S280 to step S300 in FIG.
In step S280 in FIG. 6, the sound processing acoustic analysis unit 47 obtains an acoustic feature amount having a frame length of 25 ms and a frame shift of 5 ms from the input speech data whose style is “serious”. The acoustic feature value obtained here is a 51-dimensional static characteristic composed of a 1-dimensional pitch and a 50-dimensional spectral coefficient. The acoustic feature quantity of the i-th frame obtained from the input speech data is described as acoustic feature quantity Di.

図６のステップＳ２９０において、音声加工処理部４８の第二対応フレーム検出部４８１は、入力音声データの音響特徴量Ｄ１、Ｄ２、…と、第一音響特徴量生成値Ａ１、Ａ２、…とを、第一対応フレーム検出部４６１と同様に、５０次元のスペクトル係数による距離尺度を用いて、動的計画法（ＤＴＷ）などにより対応付ける。 In step S290 of FIG. 6, the second corresponding frame detection unit 481 of the audio processing unit 48 receives the acoustic feature amounts D1, D2,... Of the input audio data and the first acoustic feature amount generation values A1, A2,. Similarly to the first corresponding frame detection unit 461, the correspondence is made by dynamic programming (DTW) or the like using a distance scale based on 50-dimensional spectral coefficients.

なお、フレームの対応付は、文章全体で行うが、音素ごとにおこなってもよい。音素ごとに対応付を行う場合は、入力音声とその音素列を用いてアラインメント処理を行うことにより、音声のどの部分がどの音素に対応するかを求めておく。 Note that the correspondence between frames is performed for the entire sentence, but may be performed for each phoneme. In the case of performing correspondence for each phoneme, it is determined which part of speech corresponds to which phoneme by performing alignment processing using the input speech and the phoneme string.

図６のステップ３００において、加工情報付加部４８２は、入力音声データの各フレームの音響特徴量に、その音響特徴量に対応付けられた第一音響特徴量生成値から生成された音響特徴量の加工情報を加算する。加工情報付加部４８２は、入力音声データの１フレームの音響特徴量と複数のフレームの第一音響特徴量生成値とが対応する場合、その入力音声データの音響特徴量に、対応する複数の第一音響特徴量生成値のそれぞれから生成された加工情報の平均を加算する。また、加工情報付加部４８２は、入力音声データの複数のフレームの音響特徴量と１つのフレームの第一音響特徴量生成値とが対応する場合、それら複数のフレームの音響特徴量それぞれに、対応する第一音響特徴量生成値から生成された加工情報を加算する。加工により得られたｉ番目のフレームの音響特徴量をＥｉと記載する。 In step 300 of FIG. 6, the processing information adding unit 482 adds the acoustic feature quantity generated from the first acoustic feature quantity generation value associated with the acoustic feature quantity to the acoustic feature quantity of each frame of the input voice data. Add processing information. When the acoustic feature quantity of one frame of the input voice data corresponds to the first acoustic feature quantity generation values of the plurality of frames, the processing information adding unit 482 corresponds to the acoustic feature quantities of the input voice data. The average of the processing information generated from each of the acoustic feature value generation values is added. Further, when the acoustic feature values of a plurality of frames of the input audio data correspond to the first acoustic feature value generation value of one frame, the processing information adding unit 482 corresponds to each of the acoustic feature values of the plurality of frames. The processing information generated from the first acoustic feature value generation value to be added is added. The acoustic feature quantity of the i-th frame obtained by processing is described as Ei.

例えば、音響特徴量Ｄ１と第一音響特徴量生成値Ａ１とが対応するため、加工情報付加部４８２は、音響特徴量Ｄ１に、第一音響特徴量生成値Ａ１から生成された加工情報Ｃ１を加算し、音響特徴量Ｅ１とする。つまり、音響特徴量Ｅ１＝音響特徴量Ｄ１＋加工情報Ｃ１である。
音響特徴量Ｄ２及びＤ３は、第一音響特徴量生成値Ａ２と対応し、第一音響特徴量生成値Ａ２からは加工情報Ｃ２及びＣ３が生成されている。そのため、加工情報付加部４８２は、音響特徴量Ｄ２に、第一音響特徴量生成値Ａ２を用いて生成された１つ目の加工情報Ｃ２を加算して音響特徴量Ｅ２とし、音響特徴量Ｄ３に、第一音響特徴量生成値Ａ２を用いて生成された２つめの加工情報Ｃ３を加算して音響特徴量Ｅ３とする。つまり、音響特徴量Ｅ２＝音響特徴量Ｄ２＋加工情報Ｃ２であり、音響特徴量Ｅ３＝音響特徴量Ｄ３＋加工情報Ｃ３である。
音響特徴量Ｄ４は、第一音響特徴量生成値Ａ３と対応するため、加工情報付加部４８２は、音響特徴量Ｄ４に、第一音響特徴量生成値Ａ３を用いて生成された加工情報Ｃ４を加算して音響特徴量Ｅ４とする。つまり、音響特徴量Ｅ４＝音響特徴量Ｄ４＋加工情報Ｃ４である。
音響特徴量Ｄ５は、第一音響特徴量生成値Ａ４及びＡ５との２フレーム分に対応するため、加工情報付加部４８２は、音響特徴量Ｄ５に、第一音響特徴量生成値Ａ４から生成された加工情報Ｃ５と第一音響特徴量生成値Ａ５から生成された加工情報Ｃ６との平均を加算して音響特徴量Ｅ５とする。つまり、音響特徴量Ｅ５＝音響特徴量Ｄ５＋Ａｖｇ（加工情報Ｃ５＋加工情報Ｃ６）である。なお、Ａｖｇ（ｘ＋ｙ）は、ｘとｙの平均を示す。
音響特徴量Ｄ６は、第一音響特徴量生成値Ａ６と対応し、第一音響特徴量生成値Ａ６からは加工情報Ｃ７及び加工情報Ｃ８が生成されている。加工情報付加部４８２は、音響特徴量Ｄ６に、加工情報Ｃ７と加工情報Ｃ８の平均を加算して音響特徴量Ｅ６とする。つまり、音響特徴量Ｅ６＝音響特徴量Ｄ６＋Ａｖｇ（加工情報Ｃ７＋加工情報Ｃ８）である。
加工情報算出部４６２は、同様の処理を繰り返し、入力音声データの時系列の音響特徴量を、音響特徴量Ｅ１、Ｅ２、…に変更する。 For example, since the acoustic feature quantity D1 and the first acoustic feature quantity generation value A1 correspond, the processing information adding unit 482 adds the processing information C1 generated from the first acoustic feature quantity generation value A1 to the acoustic feature quantity D1. These are added to obtain the acoustic feature quantity E1. That is, acoustic feature amount E1 = acoustic feature amount D1 + processing information C1.
The acoustic feature amounts D2 and D3 correspond to the first acoustic feature amount generation value A2, and processing information C2 and C3 are generated from the first acoustic feature amount generation value A2. Therefore, the processing information adding unit 482 adds the first processing information C2 generated using the first acoustic feature value generation value A2 to the acoustic feature amount D2 to obtain the acoustic feature amount E2, and the acoustic feature amount D3. The second processing information C3 generated using the first acoustic feature value generation value A2 is added to obtain an acoustic feature value E3. That is, acoustic feature amount E2 = acoustic feature amount D2 + processing information C2, and acoustic feature amount E3 = acoustic feature amount D3 + processing information C3.
Since the acoustic feature amount D4 corresponds to the first acoustic feature amount generation value A3, the processing information adding unit 482 adds the processing information C4 generated using the first acoustic feature amount generation value A3 to the acoustic feature amount D4. These are added to obtain an acoustic feature quantity E4. That is, acoustic feature amount E4 = acoustic feature amount D4 + processing information C4.
Since the acoustic feature amount D5 corresponds to two frames of the first acoustic feature amount generation values A4 and A5, the processing information adding unit 482 generates the acoustic feature amount D5 from the first acoustic feature amount generation value A4. The average of the processed information C5 and the processed information C6 generated from the first acoustic feature amount generation value A5 is added to obtain an acoustic feature amount E5. That is, acoustic feature amount E5 = acoustic feature amount D5 + Avg (processing information C5 + processing information C6). Avg (x + y) represents an average of x and y.
The acoustic feature amount D6 corresponds to the first acoustic feature amount generation value A6, and processing information C7 and processing information C8 are generated from the first acoustic feature amount generation value A6. The processing information adding unit 482 adds the average of the processing information C7 and the processing information C8 to the acoustic feature amount D6 to obtain an acoustic feature amount E6. That is, acoustic feature amount E6 = acoustic feature amount D6 + Avg (processing information C7 + processing information C8).
The processing information calculation unit 462 repeats the same processing, and changes the time-series acoustic feature values of the input voice data to the acoustic feature values E1, E2,.

図６のステップＳ３１０において、音声合成部４８３は、音響特徴量Ｅ１、Ｅ２、…からなる音声データを合成し、入力音声データのピッチ及びスペクトルを変換した加工音声データを得る。この変換には、例えば、特許文献１、２の方法を用いることができるが、任意の音声合成の従来技術を用いてもよい。音声合成部４８３は、加工音声データの時間長が、第二音響特徴量生成値のフレーム数に対応した時間長となるように圧縮し、出力音声データとして出力する。これにより、「怒り」の感情にスタイル変換された音声波形が得られる。 In step S310 in FIG. 6, the speech synthesizer 483 synthesizes speech data composed of the acoustic feature amounts E1, E2,..., And obtains processed speech data obtained by converting the pitch and spectrum of the input speech data. For this conversion, for example, the methods of Patent Documents 1 and 2 can be used, but any conventional speech synthesis technique may be used. The speech synthesis unit 483 compresses the processed speech data so that the time length corresponds to the number of frames of the second acoustic feature value generation value, and outputs the compressed speech data as output speech data. As a result, a speech waveform style-converted into an “anger” emotion can be obtained.

図１０は、音声加工装置１を用いてスタイル変換した入力音声データと加工音声データを示す図である。同図では、上から順にピッチ、スペクトル、音素ラベル、及び音声波形を示している。横軸は時間であり、縦軸は、ピッチとスペクトルでは周波数、音声波形では音量であり、それぞれ時間変化を示している。 FIG. 10 is a diagram showing input voice data and processed voice data that have been style-converted using the voice processing apparatus 1. In the figure, the pitch, spectrum, phoneme label, and speech waveform are shown in order from the top. The horizontal axis represents time, and the vertical axis represents frequency in the pitch and spectrum, and volume in the speech waveform, each representing a change over time.

上述した実施形態では、音響特徴量に、ピッチ及びスペクトルを使用したが、いずれか一方のみを使用してもよい。 In the above-described embodiment, the pitch and the spectrum are used for the acoustic feature amount, but only one of them may be used.

また、話者ごとに感情表現は異なる。そこで、話者毎に変換前スタイル及び変換後スタイルの統計モデルを作成しておき、音声加工処理においていずれの話者の学習用音声データから学習した統計モデルを用いるかを指定してもよい。これにより、入力音声データにいずれの話者の感情表現を付与するかを指定することができる。 Also, the emotional expression varies from speaker to speaker. Therefore, a statistical model of the pre-conversion style and the post-conversion style may be created for each speaker, and it may be specified which of the speaker's learning speech data is used in the speech processing. Thereby, it is possible to specify which speaker's emotional expression is to be added to the input voice data.

図１１は、音声加工装置１について行った主観評価実験の諸元を示す図である。
事前実験により文意が無感情と判定された１０文を「平静」のスタイルで読み上げ、入力音声データとした。この入力音声データを、音声加工装置１により、「喜び」、「怒り」、「悲哀」のそれぞれスタイルに変換し、３０文の音声データに加工した。被験者は、男性５名、女性５名であり、一般的な実験室においてスピーカーにより変換後の音声データを被験者に呈示した。
実験方法は以下のとおりである。すなわち、音声加工装置１により加工された３０文の全ての音声データからランダムに選択して被験者に呈示した。被験者は、呈示された音声データがどのような感情表現に聞こえるかを、「喜び」、「驚き」、「怒り」、「嫌悪」、「悲哀」、「恐れ」の６感情と、「無感情」及び「不明」とを合わせた８つのカテゴリーの中から選択した。１つの音声データについて、１０名の被験者が判定した。 FIG. 11 is a diagram illustrating specifications of a subjective evaluation experiment performed on the speech processing apparatus 1.
Ten sentences whose sentence meanings were determined to be emotionless by a prior experiment were read out in a “seduce” style and used as input speech data. The input voice data was converted into “joy”, “anger”, and “sorrow” styles by the voice processing device 1 and processed into voice data of 30 sentences. The test subjects were five men and five women, and the converted voice data was presented to the test subjects by speakers in a general laboratory.
The experimental method is as follows. That is, it selected at random from all the audio | voice data of 30 sentences processed with the audio | voice processing apparatus 1, and showed it to the test subject. The test subject expressed the emotional expression of the presented voice data based on the six emotions of “joy”, “surprise”, “anger”, “disgust”, “sadness”, “fear” ”And“ Unknown ”were selected from eight categories. Ten subjects judged about one voice data.

図１２は、音声加工装置１について行った主観評価実験において使用した統計モデルを生成するために用いた学習データに対する判定感情を示す図である。学習用音声データの話者は、入力音声データの話者と同一である。
スタイル「無感情」の統計モデルを生成するために第一学習用音声データとして用いた音声データについては、スタイルが「無感情」であると判定した被験者は８５．６５％であった。
また、スタイル「喜び」の統計モデルを生成するために第二学習用音声データとして用いた音声データについては、スタイルが「喜び」であると判定した被験者は８０．９２％であった。
同様に、スタイル「怒り」の統計モデルを生成するために第二学習用音声データとして用いた音声データについては、スタイルが「怒り」であると判定した被験者は７８．００％であった。
そして、変換後スタイル「悲哀」の統計モデルを生成するために第二学習用音声データとして用いた音声データについては、スタイルが「悲哀（悲しみ）」であると判定した被験者は６１．１２％であった。
このように、学習データに対してスタイルが正しく判定される割合は、６１．１２％〜８５．６５％であった。 FIG. 12 is a diagram showing determination emotions for the learning data used to generate the statistical model used in the subjective evaluation experiment performed on the speech processing apparatus 1. The speaker of the learning voice data is the same as the speaker of the input voice data.
Regarding the voice data used as the first learning voice data in order to generate the statistical model of the style “no emotion”, 85.65% of the subjects determined that the style is “no emotion”.
In addition, regarding the voice data used as the second learning voice data in order to generate the statistical model of the style “joy”, 80.92% of the subjects determined that the style was “joy”.
Similarly, regarding the voice data used as the second learning voice data to generate the statistical model of the style “anger”, 78.00% of the subjects determined that the style is “anger”.
Of the voice data used as the second learning voice data to generate the statistical model of the converted style “sorrow”, 61.12% of the subjects determined that the style was “sorrow (sorrow)”. there were.
Thus, the rate at which the style was correctly determined for the learning data was 61.12% to 85.65%.

図１３は、音声加工装置１について行った主観評価実験の評価結果を示す図である。
音声加工装置１が変換前スタイル「無感情」の統計モデルと、変換後スタイル「喜び」の統計モデルを使用して、スタイルが「平静」の入力音声データを変換して得られた音声データに対しては、４９％の被験者がスタイルを「喜び」と判定した。
また、音声加工装置１が変換前スタイル「無感情」の統計モデルと、変換後スタイル「怒り」の統計モデルを使用して、スタイルが「平静」の入力音声データを変換して得られた音声データに対しては、７１．０％の被験者がスタイルを「怒り」と判定した。
また、音声加工装置１が変換前スタイル「無感情」の統計モデルと、変換後スタイル「悲哀」の統計モデルを使用して、スタイルが「平静」の入力音声データを変換して得られた音声データに対しては、７７．０％の被験者がスタイルを「悲哀」と判定した。
上記によれば、正答率は、「喜び」が４９．０％、「怒り」が７１．０％、「悲哀」が７７．０％であり、平均の正答率は６５．７％となった。
このように、特定の話者が発声した音声に対して、同じ話者から抽出した感情を付与した場合の主観評価の正答率は６０％以上となり、本実施形態の有効性が確認された。 FIG. 13 is a diagram illustrating an evaluation result of a subjective evaluation experiment performed on the audio processing device 1.
The voice processing apparatus 1 uses the statistical model of the style “no emotion” before conversion and the statistical model of the style “joy” after conversion to the voice data obtained by converting the input voice data of the style “calm”. In contrast, 49% of subjects judged the style to be “joy”.
Further, the voice processing apparatus 1 uses the statistical model of the style “no emotion” before conversion and the statistical model of the style “anger” after conversion, and the voice obtained by converting the input voice data of the style “calm” For the data, 71.0% of the subjects determined the style to be “angry”.
In addition, the voice processing device 1 uses the statistical model of the style “no emotion” before conversion and the statistical model of the style “sorrow” after conversion, and the voice obtained by converting the input voice data of the style “calm” For the data, 77.0% of subjects determined the style to be “sorrow”.
According to the above, the correct answer rate was 49.0% for “joy”, 71.0% for “anger” and 77.0% for “sorrow”, and the average correct answer rate was 65.7%. .
Thus, the correct answer rate of the subjective evaluation when the emotion extracted from the same speaker is given to the voice uttered by a specific speaker is 60% or more, and the effectiveness of this embodiment is confirmed.

以上説明した実施形態によれば、音声加工装置１は、音声のスタイル変換を簡易かつ精度良く行うことが可能となる。 According to the embodiment described above, the voice processing apparatus 1 can perform voice style conversion easily and accurately.

なお、上述の音声加工装置１は、内部にコンピュータシステムを有している。そして、音声加工装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 Note that the above-described speech processing apparatus 1 has a computer system therein. The operation process of the sound processing apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１音声加工装置
２学習部
３記憶部
４音声加工部
２１第一音声記憶部
２２第一音響分析部
２３第一学習用言語解析部
２４第一統計モデル学習部
２５第二音声記憶部
２６第二音響分析部
２７第二学習用言語解析部
２８第二統計モデル学習部
３１第一統計モデル記憶部
３２第二統計モデル記憶部
４１音声加工用言語解析部（言語解析部）
４２第一統計モデル選択部
４３第一音響特徴量生成部
４４第二統計モデル選択部
４５第二音響特徴量生成部
４６加工情報生成部
４７音声加工用音響分析部（音響分析部）
４８音声加工処理部
４６１第一対応フレーム検出部
４６２加工情報算出部
４８１第二対応フレーム検出部
４８２加工情報付加部
４８３音声合成部 DESCRIPTION OF SYMBOLS 1 Speech processing apparatus 2 Learning part 3 Storage part 4 Speech processing part 21 First speech storage part 22 First acoustic analysis part 23 First learning language analysis part 24 First statistical model learning part 25 Second speech storage part 26 Second Acoustic analysis unit 27 Second learning language analysis unit 28 Second statistical model learning unit 31 First statistical model storage unit 32 Second statistical model storage unit 41 Language processing unit for speech processing (language analysis unit)
42 1st statistical model selection part 43 1st acoustic feature-value production | generation part 44 2nd statistical model selection part 45 2nd acoustic feature-value production | generation part 46 Processing information generation part 47 Sound analysis acoustic analysis part (acoustic analysis part)
48 voice processing unit 461 first corresponding frame detection unit 462 processing information calculation unit 481 second corresponding frame detection unit 482 processing information addition unit 483 voice synthesis unit

Claims

A language analysis unit for acquiring a language feature of a sentence indicated by text data;
Based on the language feature amount acquired by the language analysis unit and a statistical model related to the acoustic feature amount generated from the speech data of the first style utterance, a time-series frame-wise acoustic feature amount is generated. An acoustic feature generation unit;
Based on the language feature acquired by the language analysis unit and a statistical model related to the acoustic feature generated from the speech data of the second style utterance, a time-series frame-based acoustic feature is generated. A two-acoustic feature generation unit;
A first acoustic feature value generation value that is the acoustic feature value generated by the first acoustic feature value generation unit, and a second acoustic feature value generation value that is the acoustic feature value generated by the second acoustic feature value generation unit. The first acoustic feature value generation value frame and the second acoustic feature value generation value frame are associated with each other based on the similarity to the first acoustic feature value generation value. A processing information generating unit that generates processing information based on a difference between the value and the second acoustic feature value generation value;
An acoustic analysis unit that acquires acoustic features in units of time-series frames from the voice data of the sentence indicated by the text data;
Based on the similarity between the acoustic feature quantity acquired by the acoustic analysis unit and the first acoustic feature quantity generation value, the acoustic feature quantity frame and the first acoustic feature quantity generation value frame are associated with each other. An audio processing unit that processes the acoustic feature amount of each frame based on the processing information generated by the processing information generation unit using the first acoustic feature amount generation value of the corresponding frame;
An audio processing apparatus comprising:

The acoustic feature amount includes at least one of information on pitch and information on frequency spectrum.
The speech processing apparatus according to claim 1.

The style of the voice data of the sentence indicated by the text data is the first style.
The speech processing apparatus according to claim 1 or 2, wherein

The voice processing unit synthesizes voice data based on the processed acoustic feature amount.
The speech processing apparatus according to any one of claims 1 to 3, wherein

Computer
Language analysis means for acquiring a language feature of a sentence indicated by text data;
Based on the language feature acquired by the language analysis means and a statistical model related to the acoustic feature generated from the speech data of the first style utterance, a time-series frame-wise acoustic feature is generated. One acoustic feature generation means;
Based on the language feature acquired by the language analysis means and a statistical model related to the acoustic feature generated from the speech data of the second style utterance, a time-series frame-wise acoustic feature is generated. Two acoustic feature generation means;
The first acoustic feature value generation value that is the acoustic feature value generated by the first acoustic feature value generation unit, and the second acoustic feature value generation value that is the acoustic feature value generated by the second acoustic feature value generation unit. The first acoustic feature value generation value frame and the second acoustic feature value generation value frame are associated with each other based on the similarity to the first acoustic feature value generation value. Processing information generating means for generating processing information by the difference between the value and the second acoustic feature value generation value;
Acoustic analysis means for acquiring acoustic features in units of time-series frames from voice data of the sentence indicated by the text data;
Based on the similarity between the acoustic feature quantity acquired by the acoustic analysis means and the first acoustic feature quantity generation value, the acoustic feature quantity frame and the first acoustic feature quantity generation value frame are associated with each other. Voice processing unit for processing the acoustic feature amount of each frame based on the processing information generated by the processing information generation unit using the first acoustic feature amount generation value of the corresponding frame;
A program for causing a voice processing apparatus to function.