JP2001350491A

JP2001350491A - Method and device for voice processing

Info

Publication number: JP2001350491A
Application number: JP2000170708A
Authority: JP
Inventors: Masaaki Yamada; 雅章山田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-06-07
Filing date: 2000-06-07
Publication date: 2001-12-21

Abstract

PROBLEM TO BE SOLVED: To learn a rhythm estimation model in which fluctuation in uttering environment is being considered. SOLUTION: Rhythm information of learning data is obtained every time the data are inputted (S3), estimation factors of the data are obtained (S4), and uttering date and time information of the data is obtained (S5). Then, a rhythm estimation model is learned to estimate prescribed rhythm information employing the rhythm information of the data, the estimation factors and the uttering date and time information (S8).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声処理方法およ
び装置に関し、特に、合成音声の品質を向上させるため
の技術に関わる。The present invention relates to a speech processing method and apparatus, and more particularly to a technique for improving the quality of synthesized speech.

【０００２】[0002]

【従来の技術】音声規則合成の過程は、入力テキストか
ら継続時間長、基本周波数（Ｆ０）、パワー等の韻律情
報を推定する韻律生成工程と、生成された韻律情報によ
り音声波形を生成する波形生成工程とに大別される。2. Description of the Related Art A speech rule synthesis process includes a prosody generation step for estimating prosody information such as a duration, a fundamental frequency (F0), and power from an input text, and a waveform for generating a speech waveform based on the generated prosody information. It is roughly divided into a generation step.

【０００３】韻律生成工程に於いては、近年、コーパス
に基づく手法が用いられている。これは、大量の学習デ
ータをもとに、テキストと韻律情報との関係を統計的に
学習する手法である。この手法を用いて、所定の推定要
因を入力として所定の韻律情報を出力とするモデル（韻
律推定モデル）を仮定することによって、所定の韻律情
報の推定が可能となる。[0003] In the prosody generation step, a corpus-based technique has recently been used. This is a method of statistically learning the relationship between text and prosody information based on a large amount of learning data. By using this method and assuming a model (prosody estimation model) that outputs predetermined prosody information with predetermined estimation factors as input, it is possible to estimate predetermined prosody information.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記手
法には以下の問題がある。However, the above method has the following problems.

【０００５】継続時間長、基本周波数、パワー等の韻律
情報は発声環境の影響を受けて変動するため、上記手法
を用いて精度の良い韻律推定モデルを作成するために
は、発声環境の安定した学習データを大量に用意する必
要がある。ところが、大量の学習データを用意するため
には学習データの収録に長時間を要することとなり、長
時間に渡って安定した発声環境を維持しなければ、精度
の良い韻律推定モデルを生成することができないという
問題がある。長時間に渡って安定した発声環境を維持す
るのは大変困難であり、同じ発声者であっても、その日
の体調、慣れ、疲労等によって発声環境は変動してしま
う。このような変動の影響を受けた韻律推定モデルを用
いて生成された合成音声は、不自然で違和感のあるもの
となってしまう。Since prosody information such as duration, fundamental frequency, power and the like fluctuates under the influence of the utterance environment, in order to create a highly accurate prosody estimation model using the above method, the utterance environment must be stable. It is necessary to prepare a large amount of training data. However, in order to prepare a large amount of training data, it takes a long time to record the training data, and unless a stable vocal environment is maintained for a long time, it is necessary to generate an accurate prosodic estimation model. There is a problem that can not be. It is very difficult to maintain a stable vocal environment over a long period of time, and even for the same vocal speaker, the vocal environment fluctuates due to the physical condition, familiarity, fatigue, etc. of the day. The synthesized speech generated using the prosody estimation model affected by such fluctuations is unnatural and unnatural.

【０００６】本発明は上記の問題に鑑みてなされたもの
であり、発声環境の変動を考慮した韻律推定モデルを学
習することを可能とし、自然で違和感のない合成音声の
生成を可能とすることを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and has an object to learn a prosody estimation model in consideration of fluctuations in the utterance environment, and to generate a natural and natural synthesized speech. With the goal.

【０００７】[0007]

【課題を解決するための手段】上記の目的を達成するた
めの本発明の一態様による音声処理方法は、音声情報を
入力する入力工程と、前記音声情報から韻律情報を取得
する第１取得工程と、前記音声情報の発声環境の変化を
示す情報を推定要因の一つとして取得する第２取得工程
と、前記韻律情報と前記推定要因とを用いて前記韻律情
報を推定のための韻律推定モデルを学習する学習工程と
を備える。According to one aspect of the present invention, there is provided a voice processing method for inputting voice information, and a first obtaining step of obtaining prosody information from the voice information. A second acquisition step of acquiring information indicating a change in the utterance environment of the voice information as one of the estimation factors; and a prosody estimation model for estimating the prosody information using the prosody information and the estimation factors. And a learning step of learning.

【０００８】上記の目的を達成するための本発明の他の
態様による音声処理方法は、文字情報を解析する解析工
程と、前記文字情報に対応する音声素片を取得する取得
工程と、発声環境の変化を示す情報を推定要因の一つと
して設定する設定工程と、前記推定要因と所定の韻律情
報を推定する韻律推定モデルとを用いて前記音声素片の
韻律情報を推定する推定工程とを備える。According to another aspect of the present invention, there is provided a speech processing method for analyzing character information, an acquiring step for acquiring a speech unit corresponding to the character information, and a speech environment. A setting step of setting the information indicating the change as one of the estimation factors, and an estimation step of estimating the prosody information of the speech unit using the estimation factors and a prosody estimation model for estimating predetermined prosody information. Prepare.

【０００９】上記の目的を達成するための本発明の他の
態様による音声処理装置は、音声情報を入力する入力手
段と、前記音声情報から韻律情報を取得する第１取得手
段と、前記音声情報の発声環境の変化を示す情報を推定
要因の一つとして取得する第２取得手段と、前記韻律情
報と前記推定要因とを用いて前記韻律情報を推定のため
の韻律推定モデルを学習する学習手段とを備える。According to another aspect of the present invention, there is provided a voice processing apparatus for inputting voice information, first obtaining means for obtaining prosody information from the voice information, Second acquisition means for acquiring information indicating a change in the utterance environment as one of estimation factors, and learning means for learning a prosody estimation model for estimating the prosody information using the prosody information and the estimation factors. And

【００１０】上記の目的を達成するための本発明の更に
他の態様による音声処理装置は、文字情報を解析する解
析手段と、前記文字情報に対応する音声素片を取得する
取得手段と、発声環境の変化を示す情報を推定要因の一
つとして設定する設定手段と、前記推定要因と所定の韻
律情報を推定する韻律推定モデルとを用いて前記音声素
片の韻律情報を推定する推定手段とを備える。According to another aspect of the present invention, there is provided a speech processing apparatus for analyzing character information, acquiring means for acquiring a speech unit corresponding to the character information, Setting means for setting information indicating a change in environment as one of the estimation factors; and estimation means for estimating the prosody information of the speech unit using the estimation factors and a prosody estimation model for estimating predetermined prosody information. Is provided.

【００１１】[0011]

【発明の実施の形態】以下、添付の図面を参照して本発
明の好適な実施形態を説明する。Preferred embodiments of the present invention will be described below with reference to the accompanying drawings.

【００１２】［第１の実施形態］図１は、本実施形態に
おける音声合成装置のハードウェア構成を示すブロック
図である。図１において、１１は数値演算、制御等の処
理を行う中央処理装置であり、各種の制御プログラムを
用いて図２及び図４のフローチャートで説明する処理手
順を制御する。１２は学習ユニットであり、大量の学習
データを用いて所定の韻律情報を推定する韻律推定モデ
ルを学習する。学習ユニット１２は、学習した韻律推定
モデルを管理するモデル管理部１９を具備する。学習ユ
ニット１２は、ハードウェアで構成することもソフトウ
ェアで構成することも可能である。ソフトウェアで構成
する場合には、このソフトウェアを実現するための制御
プログラムを記録装置１３に格納する。そして、中央処
理装置１１がこの制御プログラムを基づいてモデル学習
ユニット１２の機能を実現する。[First Embodiment] FIG. 1 is a block diagram showing a hardware configuration of a speech synthesizer according to the present embodiment. In FIG. 1, reference numeral 11 denotes a central processing unit that performs processing such as numerical calculation and control, and controls the processing procedure described in the flowcharts of FIGS. 2 and 4 using various control programs. A learning unit 12 learns a prosody estimation model for estimating predetermined prosody information using a large amount of learning data. The learning unit 12 includes a model management unit 19 that manages the learned prosody estimation model. The learning unit 12 can be configured by hardware or software. When configured with software, a control program for implementing the software is stored in the recording device 13. Then, the central processing unit 11 realizes the function of the model learning unit 12 based on the control program.

【００１３】１３は半導体メモリやハードディスク等か
らなる記憶装置であり、図２及び図４のフローチャート
で説明する処理手順を実現するための制御プログラム、
学習データの入力や音声合成するテキストの入力を支援
するためのグラフィカルユーザインタフェースを制御す
る制御プログラムを格納する。記憶装置１３は、大量の
学習データを蓄積する学習データ蓄積部１８を具備す
る。Reference numeral 13 denotes a storage device including a semiconductor memory, a hard disk, and the like, and a control program for realizing the processing procedure described in the flowcharts of FIGS.
A control program for controlling a graphical user interface for supporting input of learning data and text for speech synthesis is stored. The storage device 13 includes a learning data storage unit 18 that stores a large amount of learning data.

【００１４】１４は表示器、スピーカ等からなる出力装
置であり、スピーカは合成された音声を出力し、表示器
は上述のグラフィカルユーザインタフェースを表示す
る。１５はキーボードやマイクロフォン等からなる入力
装置であり、キーボードは音声合成したいテキスト（日
本語や他の言語からなる）を入力或いは指定し、マイク
ロフォンは学習データを入力する。１６は内部バスであ
る。Reference numeral 14 denotes an output device including a display, a speaker, and the like. The speaker outputs synthesized voice, and the display displays the above-described graphical user interface. Reference numeral 15 denotes an input device including a keyboard, a microphone, and the like. The keyboard inputs or specifies a text (in Japanese or another language) to be synthesized, and the microphone inputs learning data. 16 is an internal bus.

【００１５】１７は音声合成ユニットであり、図４で説
明する処理手順に従って入力テキストから合成音声を生
成する。音声合成ユニット１７は、ハードウェアで構成
することもソフトウェアで構成することも可能である。
ソフトウェアで構成する場合には、このソフトウェアを
実現するための制御プログラムを記録装置１３に格納す
る。そして、中央処理装置１１がこの制御プログラムを
基づいて音声合成ユニット１７の機能を実現する。Reference numeral 17 denotes a speech synthesis unit, which generates a synthesized speech from the input text according to the processing procedure described with reference to FIG. The speech synthesis unit 17 can be configured by hardware or software.
When configured with software, a control program for implementing the software is stored in the recording device 13. Then, the central processing unit 11 realizes the function of the speech synthesis unit 17 based on the control program.

【００１６】次に、以上の構成を備えた本実施形態の音
声合成装置の処理動作について説明する。Next, the processing operation of the speech synthesizing apparatus of the present embodiment having the above configuration will be described.

【００１７】図２は、本実施形態における韻律推定モデ
ルの学習手順を説明するフローチャートである。図２で
は、韻律情報の一つである継続時間長を推定する韻律推
定モデルを学習する手順について説明する。FIG. 2 is a flowchart for explaining a learning procedure of the prosody estimation model in the present embodiment. In FIG. 2, a procedure for learning a prosody estimation model for estimating a duration time, which is one of the prosody information, will be described.

【００１８】まず、ステップＳ１において、学習ユニッ
ト１２は、ループカウンタｉを０に初期化する。First, in step S1, the learning unit 12 initializes a loop counter i to zero.

【００１９】ステップＳ２において、入力装置１５の有
するマイクロフォンは、単語或いは文章を単位とする学
習テキストの音声波形を入力し、この音声波形をｉ番目
の学習データとして学習ユニット１２に供給するととも
に、この音声波形を学習データ蓄積部１８に格納する。In step S2, the microphone of the input device 15 inputs a speech waveform of a learning text in units of words or sentences, and supplies this speech waveform to the learning unit 12 as i-th learning data. The speech waveform is stored in the learning data storage unit 18.

【００２０】ステップＳ３において、学習ユニット１２
は、ｉ番目の学習データをリアルタイムに音響分析し、
音素を音韻単位とする音韻系列と各音素の韻律情報とを
取得し、これらを学習データ蓄積部１８に蓄積する。例
えば、学習テキスト「あらし」の音声波形を音響分析し
た場合には、この音声波形から音韻系列「／ａ／，／ｒ
／，／ａ／，／ｓｈ／，／ｉ／」を取得する。また、韻
律情報としては、各音素の継続時間長（音韻の長さ）、
基本周波数（音韻の高さ）、パワー（音韻の大きさ）等
を取得する。各音素の継続時間長を韻律情報として蓄積
する例を図３に示す。In step S3, the learning unit 12
Performs acoustic analysis of the i-th learning data in real time,
A phoneme sequence having phonemes as phoneme units and prosodic information of each phoneme are acquired, and these are stored in the learning data storage unit 18. For example, when the speech waveform of the learning text “storm” is subjected to acoustic analysis, the phoneme sequence “/ a /, / r” is obtained from the speech waveform.
/, / A /, / sh /, / i / ". The prosody information includes the duration of each phoneme (phoneme length),
Acquire the fundamental frequency (phoneme height), power (phoneme size), and the like. FIG. 3 shows an example in which the duration of each phoneme is stored as prosody information.

【００２１】ステップＳ４において、学習ユニット１２
は、ｉ番目の学習データから推定要因を取得し、これら
を学習データ蓄積部１８に蓄積する。本実施形態では、
アクセント核の有無、モーラ位置等を推定要因として取
得する。アクセント核の有無、モーラ位置等を推定要因
として蓄積する例を図３に示す。In step S4, the learning unit 12
Acquires the estimated factors from the i-th learning data and stores them in the learning data storage unit 18. In this embodiment,
The presence or absence of an accent nucleus, the position of a mora, etc. are obtained as estimation factors. FIG. 3 shows an example in which the presence or absence of an accent nucleus, the mora position, and the like are accumulated as estimation factors.

【００２２】ステップＳ５において、学習ユニット１２
は、ｉ番目の学習データの発声日時情報を取得し、これ
を学習データ蓄積部１８に蓄積する。この発声日時情報
には、例えば、ステップＳ２で学習データを学習データ
蓄積部１８に蓄積した日時を用いる。この発声日時情報
は、学習データの発声環境の時間的な変動を考慮するた
めの推定要因である。各音素の発声日時情報を推定要因
の一つとして蓄積する例を図３に示す。In step S5, the learning unit 12
Obtains the utterance date and time information of the i-th learning data, and stores it in the learning data storage unit 18. As the utterance date and time information, for example, the date and time when the learning data is stored in the learning data storage unit 18 in step S2 is used. The utterance date and time information is an estimation factor for considering a temporal change in the utterance environment of the learning data. FIG. 3 shows an example in which the utterance date and time information of each phoneme is stored as one of the estimation factors.

【００２３】ステップＳ６において、学習ユニット１２
は、ループカウンタｉの値に１を加える。そして、ステ
ップＳ７において、ループカウンタｉの値が予め設定さ
れた学習データの総数に等しいか否かを判定する。両者
が等しくない場合には、未入力の学習データがあると判
断し、ステップＳ２に戻り、上述の処理を繰り返す。こ
うして、学習ユニット１２は、大量の学習データの音声
波形、韻律情報、推定要因、発声日時情報を図３のごと
く学習データ蓄積部１８に蓄積する。In step S6, the learning unit 12
Adds 1 to the value of the loop counter i. Then, in step S7, it is determined whether or not the value of the loop counter i is equal to a preset total number of learning data. If the two are not equal, it is determined that there is uninput learning data, the process returns to step S2, and the above processing is repeated. Thus, the learning unit 12 accumulates the speech waveform, the prosodic information, the estimation factor, and the utterance date and time information of a large amount of learning data in the learning data accumulating unit 18 as shown in FIG.

【００２４】ステップＳ８において、学習ユニット１２
は、学習データ蓄積部１８に蓄積されたデータを用い
て、韻律情報の一つである継続時間長を推定する韻律推
定モデルを学習する。韻律推定モデルの学習には、数量
化Ｉ類や回帰木等の統計的手法を用いる。学習した韻律
推定モデルは、学習ユニット１２のモデル管理部１９に
格納される。In step S8, the learning unit 12
Learns a prosody estimation model for estimating a duration, which is one of the prosody information, using the data stored in the learning data storage unit 18. In learning the prosody estimation model, a statistical method such as quantification class I or a regression tree is used. The learned prosody estimation model is stored in the model management unit 19 of the learning unit 12.

【００２５】以上説明したように本実施形態によれば、
アクセント核の有無、モーラ位置、発声日時情報等の推
定要因を用いて、韻律情報の一つである継続時間長を推
定する韻律推定モデルを学習することができる。特に、
発声日時情報を推定要因の一つとすることにより、学習
データの発声環境の時間的な変動を考慮した韻律推定モ
デルを学習することができる。As described above, according to the present embodiment,
A prosody estimation model for estimating a duration, which is one of the prosody information, can be learned using estimation factors such as the presence or absence of an accent nucleus, a mora position, and utterance date and time information. In particular,
By using the utterance date and time information as one of the estimation factors, it is possible to learn a prosody estimation model that considers the temporal variation of the utterance environment of the learning data.

【００２６】上記実施形態では、アクセント核の有無、
モーラ位置、発声日時情報等の推定要因を用いて、韻律
情報の一つである継続時間長を推定する韻律推定モデル
を学習する例について説明したが、本実施形態はそれに
限るものではない。上述の推定要因を用いることによっ
て、基本周波数、パワー等の韻律情報に対しても、学習
データの発声環境の時間的な変動を考慮した韻律推定モ
デルを学習することが可能である。In the above embodiment, the presence or absence of an accent nucleus,
Although an example has been described in which a prosody estimation model for estimating a duration length, which is one of the prosody information, is learned using estimation factors such as a mora position and utterance date / time information, the present embodiment is not limited to this. By using the above estimation factors, it is possible to learn a prosody estimation model in consideration of the temporal variation of the utterance environment of the learning data even for prosody information such as a fundamental frequency and power.

【００２７】図４は、本実施形態における音声合成処理
の手順を説明するフローチャートである。図４では、韻
律情報の一つである継続時間長を推定する韻律推定モデ
ルを用いて、入力テキストを音声合成する手順を説明す
る。FIG. 4 is a flowchart for explaining the procedure of the speech synthesis processing in this embodiment. FIG. 4 illustrates a procedure of synthesizing an input text using a prosody estimation model for estimating a duration, which is one of prosody information.

【００２８】まず、ステップＳ１１において、音声合成
ユニット１７は、入力されたテキスト（単語、文節、文
等の単位からなる文字列）を解析する。First, in step S11, the speech synthesizing unit 17 analyzes an input text (a character string including units of words, phrases, sentences, and the like).

【００２９】ステップＳ１２において、音声合成ユニッ
ト１７は、ステップＳ１１での解析結果に基づいて、ア
クセント核の有無、モーラ位置等の推定要因を設定す
る。In step S12, the speech synthesizing unit 17 sets estimation factors such as the presence or absence of an accent nucleus and the mora position based on the analysis result in step S11.

【００３０】ステップＳ１３において、音声合成ユニッ
ト１７は、学習データ蓄積部１８を検索し、入力テキス
トの音韻系列に対応する複数個の音声素片と各音声素片
の韻律情報とを取得する。ここで取得する韻律情報は、
継続時間長、基本周波数、パワー等である。In step S13, the speech synthesis unit 17 searches the learning data storage unit 18 to obtain a plurality of speech units corresponding to the phoneme sequence of the input text and the prosodic information of each speech unit. The prosody information acquired here is
For example, duration, fundamental frequency, power, and the like.

【００３１】ステップＳ１４において、音声合成ユニッ
ト１７は、ステップＳ１３で取得した所定の音声素片の
発声日時情報を設定する。本実施形態では、例えばステ
ップＳ１３で取得した最初の音声素片の発声日時情報を
設定する。In step S14, the speech synthesis unit 17 sets the utterance date and time information of the predetermined speech unit obtained in step S13. In the present embodiment, for example, the utterance date and time information of the first speech unit acquired in step S13 is set.

【００３２】ステップＳ１５において、音声合成ユニッ
ト１７は、ステップＳ１２で取得した推定要因と、ステ
ップＳ１４で設定した発声日時情報と、学習ユニット１
２のモデル管理部１９が保持する韻律推定モデルとを用
いて、韻律情報の一つである継続時間長を推定する。In step S15, the speech synthesizing unit 17 determines the estimated factor obtained in step S12, the utterance date and time information set in step S14, and the learning unit 1
Using the prosody estimation model held by the second model management unit 19, the duration time, which is one of the prosody information, is estimated.

【００３３】ステップＳ１６において、音声合成ユニッ
ト１７は、ステップＳ１３で取得した継続時間長をステ
ップＳ１５で推定した継続時間長に置き換え、置き換え
た継続時間長と他の韻律情報とに基づいてステップＳ１
３で取得した音声素片を波形編集して接続する。本実施
形態では、PSOLA(Pitch-Synchronous Overlap Add meth
od「ピッチ同期波形重畳法」)を用いて各音声素片を波
形編集する。In step S16, the speech synthesizing unit 17 replaces the duration obtained in step S13 with the duration estimated in step S15, and executes step S1 based on the replaced duration and other prosody information.
The waveform of the speech unit obtained in step 3 is edited and connected. In the present embodiment, PSOLA (Pitch-Synchronous Overlap Add meth
od "Pitch-synchronized waveform superposition method") to edit the waveform of each speech unit.

【００３４】ステップＳ１７において、音声合成ユニッ
ト１７は、ステップＳ１６で生成した合成音声を出力装
置１４に供給する。この合成音声は、出力装置１４の具
備するスピーカから出力される。In step S17, the speech synthesis unit 17 supplies the synthesized speech generated in step S16 to the output device 14. This synthesized voice is output from a speaker included in the output device 14.

【００３５】以上説明したように本実施形態によれば、
発声環境の時間的な変動を考慮して学習した韻律推定モ
デルを用いて、韻律情報の一つである継続時間長を推定
することができる。これにより、入力テキストから発声
環境の変動による影響を抑制した自然で違和感のない合
成音声を生成することが可能となる。As described above, according to this embodiment,
Using the prosody estimation model learned in consideration of the temporal fluctuation of the utterance environment, it is possible to estimate the duration time, which is one of the prosody information. As a result, it is possible to generate a natural and comfortable speech that is not affected by fluctuations in the utterance environment from the input text.

【００３６】上記実施形態では、発声環境の時間的な変
動を考慮して学習した韻律推定モデルを用いて入力テキ
ストの継続時間長を推定する例について説明したが、本
実施形態はそれに限るものではない。継続時間長だけで
なく、基本周波数、パワー等の韻律情報を推定すること
も可能である。In the above embodiment, an example has been described in which the duration of an input text is estimated using a prosody estimation model that has been learned in consideration of temporal fluctuations in the utterance environment. However, the present embodiment is not limited to this. Absent. It is possible to estimate prosody information such as fundamental frequency and power as well as duration time.

【００３７】上記実施形態では、学習データを実際に入
力した日時を、推定要因の一つである発声日時情報とし
て用いる例について説明したが、本実施例はそれに限る
ものではない。例えば、図６に示すように、所定の期間
（時間、日にち等の単位）を表わすラベルを発声日時情
報として用いてもよい。図６は、韻律情報の一つである
継続時間長を推定する韻律推定モデルを学習するための
学習データの一例を示す図である。図６のアクセント核
の有無、モーラ位置、発声日時情報等は、この韻律推定
モデルの推定要因である。各学習データの発声日時情報
には、例えば、所定の時間毎に異なるラベル（セット１
〜７）を付与する。In the above embodiment, an example was described in which the date and time when the learning data was actually input is used as the utterance date and time information, which is one of the estimation factors. However, the present embodiment is not limited to this. For example, as shown in FIG. 6, a label representing a predetermined period (unit of time, date, etc.) may be used as the utterance date and time information. FIG. 6 is a diagram showing an example of learning data for learning a prosody estimation model for estimating a duration, which is one of the prosody information. The presence / absence of an accent nucleus, the mora position, the utterance date / time information, and the like in FIG. 6 are estimation factors of the prosody estimation model. The utterance date and time information of each learning data includes, for example, a different label (set 1
To 7).

【００３８】また、上記実施形態では、推定要因として
アクセント核の有無、モーラ位置、発声日時情報等を用
いる例について説明したが、本実施形態はそれに限るも
のではない。学習データの発声環境の変動による影響を
考慮することのできる推定要因であれば、発声日時情報
に加えて更に、話者の性別、単語を発声することにより
得た学習データなのか文を発声することにより得た学習
データなのか、読み上げ文から得た学習データなのか対
話文から得た学習データなのか等を推定要因として用い
ることも可能である。このような推定要因の種類を増や
すことによって、より高精度な韻律推定モデルを生成す
ることが可能となる。また、このような韻律推定モデル
を用いることにより、極めて自然で違和感のない合成音
声を生成することもできる。また、アクセント型、モー
ラ数を推定要因として用いることも可能である。In the above embodiment, an example is described in which the presence or absence of an accent nucleus, the mora position, the utterance date and time information, and the like are used as estimation factors, but the present embodiment is not limited to this. If it is an estimated factor that can take into account the influence of the fluctuation of the utterance environment of the learning data, in addition to the utterance date and time information, further utter the gender of the speaker and the sentence whether the learning data is obtained by uttering a word. It is also possible to use, as the estimation factor, whether the learning data is obtained from the learning data, the learning data obtained from the reading sentence or the learning data obtained from the dialogue sentence. By increasing the types of such estimation factors, a more accurate prosody estimation model can be generated. Also, by using such a prosody estimation model, it is possible to generate a synthesized speech that is extremely natural and does not cause any discomfort. It is also possible to use the accent type and the number of mora as estimation factors.

【００３９】また、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体（または記録媒体）を、システムあるい
は装置に供給し、そのシステムあるいは装置のコンピュ
ータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納された
プログラムコードを読み出し実行することによっても、
達成されることは言うまでもない。この場合、記憶媒体
から読み出されたプログラムコード自体が前述した実施
形態の機能を実現することになり、そのプログラムコー
ドを記憶した記憶媒体は本発明を構成することになる。
また、コンピュータが読み出したプログラムコードを実
行することにより、前述した実施形態の機能が実現され
るだけでなく、そのプログラムコードの指示に基づき、
コンピュータ上で稼働しているオペレーティングシステ
ム（ＯＳ）などが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, an object of the present invention is to supply a storage medium (or a recording medium) in which a program code of software for realizing the functions of the above-described embodiments is recorded to a system or an apparatus, and to provide a computer (a computer) of the system or the apparatus. Or a CPU or MPU) reads out and executes the program code stored in the storage medium,
Needless to say, this is achieved. In this case, the program code itself read from the storage medium implements the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.
In addition, by the computer executing the readout program code, not only the functions of the above-described embodiments are realized, but also based on the instructions of the program code,
The operating system (OS) running on the computer performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００４０】さらに、記憶媒体から読み出されたプログ
ラムコードが、コンピュータに挿入された機能拡張カー
ドやコンピュータに接続された機能拡張ユニットに備わ
るメモリに書込まれた後、そのプログラムコードの指示
に基づき、その機能拡張カードや機能拡張ユニットに備
わるＣＰＵなどが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the program code is read based on the instruction of the program code. , The CPU provided in the function expansion card or the function expansion unit performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００４１】[0041]

【発明の効果】以上説明したように、本発明によれば、
発声環境の変動を考慮した韻律推定モデルを学習するこ
とが可能となる。As described above, according to the present invention,
It becomes possible to learn a prosody estimation model that takes into account fluctuations in the utterance environment.

【００４２】また、本発明によれば、発声環境の変動に
よる影響を抑制した自然で違和感のない合成音声を生成
することが可能とできる。Further, according to the present invention, it is possible to generate a natural and uncomplicated synthesized speech in which the influence of the fluctuation of the utterance environment is suppressed.

[Brief description of the drawings]

【図１】本実施形態における音声合成装置のハードウェ
ア構成を示すブロック図である。FIG. 1 is a block diagram illustrating a hardware configuration of a speech synthesis device according to an embodiment.

【図２】本実施形態における韻律推定モデルの学習手順
を説明するフローチャートである。FIG. 2 is a flowchart illustrating a learning procedure of a prosody estimation model in the embodiment.

【図３】本実施形態における韻律情報、推定要因、発声
日時情報の例を示す図である。FIG. 3 is a diagram illustrating examples of prosody information, estimated factors, and utterance date and time information according to the embodiment;

【図４】本実施形態における音声合成処理の手順を説明
するフローチャートである。FIG. 4 is a flowchart illustrating a procedure of a speech synthesis process according to the embodiment.

【図５】他の実施形態における韻律情報、推定要因、発
声日時情報の例を示す図である。FIG. 5 is a diagram illustrating examples of prosody information, estimated factors, and utterance date and time information according to another embodiment.

Claims

[Claims]

An input step of inputting voice information; a first obtaining step of obtaining prosody information from the voice information; and a second obtaining step of obtaining information indicating a change in a vocal environment of the voice information as one of estimation factors. 2. A speech processing method comprising: an acquisition step; and a learning step of learning a prosody estimation model for estimating the prosody information using the prosody information and the estimation factor.

2. The voice processing method according to claim 1, wherein the first obtaining step obtains the prosodic information for each phoneme.

3. The audio processing method according to claim 1, wherein the first acquisition step acquires any one of a duration time, a fundamental frequency, and power as the prosodic information.

4. The audio processing method according to claim 1, wherein the information indicating a change in the utterance environment of the audio information is information indicating an utterance date and time of the audio information.

5. The audio processing method according to claim 1, wherein said second obtaining step obtains accent information of said learning data as one of estimated factors.

6. The audio processing method according to claim 1, wherein the second obtaining step obtains mora information of the learning data as one of estimation factors.

7. An analyzing step of analyzing character information; an obtaining step of obtaining a speech unit corresponding to the character information; and a setting step of setting information indicating a change in the vocal environment as one of estimation factors. An estimating step of estimating the prosody information of the speech unit using the estimation factor and a prosody estimation model for estimating predetermined prosody information.

8. The speech processing method according to claim 7, wherein the prosody estimation model estimates one of a duration time, a fundamental frequency, and power.

9. The speech processing method according to claim 7, wherein the information indicating a change in the utterance environment is information indicating the utterance date and time of the speech unit.

10. The voice processing method according to claim 7, wherein said setting step sets accent information of said character information as one of estimated factors.

11. The audio processing method according to claim 7, wherein the setting step sets the mora information of the character information as one of the estimation factors.

12. A synthesizing step of synthesizing a speech corresponding to the character information using the speech segment acquired in the acquiring step and the prosodic information estimated in the estimating step. 12. The audio processing method according to any one of 7 to 11.

13. An input unit for inputting voice information; a first obtaining unit for obtaining prosody information from the voice information; and a second obtaining unit for obtaining information indicating a change in the vocal environment of the voice information as one of estimation factors. 2. An audio processing apparatus comprising: an acquisition unit; and a learning unit that learns a prosody estimation model for estimating the prosody information using the prosody information and the estimation factor.

14. The speech processing device according to claim 13, wherein the first acquisition unit acquires the prosodic information for each phoneme.

15. The speech processing device according to claim 13, wherein the first acquisition unit acquires any one of a duration time, a fundamental frequency, and power as the prosody information.

16. The audio processing apparatus according to claim 13, wherein the information indicating a change in the utterance environment of the audio information is information indicating an utterance date and time of the audio information.

17. The speech processing apparatus according to claim 13, wherein said second acquisition means acquires accent information of said learning data as one of estimated factors.

18. The speech processing apparatus according to claim 13, wherein said second acquisition unit acquires the mora information of the learning data as one of estimation factors.

19. An analyzing unit for analyzing character information, an obtaining unit for obtaining a speech unit corresponding to the character information, a setting unit for setting information indicating a change in the utterance environment as one of estimation factors, An audio processing apparatus comprising: an estimation unit configured to estimate prosody information of the speech unit using the estimation factor and a prosody estimation model for estimating predetermined prosody information.

20. The prosody estimation model comprising: a duration length;
20. The audio processing device according to claim 19, wherein one of a fundamental frequency and power is estimated.

21. The speech processing apparatus according to claim 19, wherein the information indicating the change in the utterance environment is information indicating the utterance date and time of the speech unit.

22. The speech processing device according to claim 19, wherein said setting means sets accent information of said character information as one of estimated factors.

23. The audio processing apparatus according to claim 19, wherein said setting means sets mora information of said character information as one of estimated factors.

24. The apparatus according to claim 24, further comprising a synthesizing unit for synthesizing a voice corresponding to the character information using the speech unit acquired by the acquiring unit and the prosody information estimated by the estimating unit. 23. The audio processing device according to any one of 19 to 22.

25. A storage medium storing a control program for realizing a sound processing method according to any one of claims 1 to 12 by a computer.