JP6442982B2

JP6442982B2 - Basic frequency adjusting device, method and program, and speech synthesizer, method and program

Info

Publication number: JP6442982B2
Application number: JP2014219547A
Authority: JP
Inventors: 淳哉斎藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-10-28
Filing date: 2014-10-28
Publication date: 2018-12-26
Anticipated expiration: 2034-10-28
Also published as: JP2016085408A

Description

開示の技術は、基本周波数調整装置、方法及びプログラム、並びに、音声合成装置、方法及びプログラムに関する。 The disclosed technology relates to a fundamental frequency adjustment device, method, and program, and speech synthesis device, method, and program.

テキストに基づいて合成された音声をユーザの期待通りの音声として出力するために、ユーザの指定に基づいて音声のアクセント強度を調整する技術が存在する。アクセントは声の高さで定義され、声の高さは基本周波数（Ｆ０）によって決定されるため、基本周波数の値を調整することにより、アクセント強度が調整される。 In order to output the voice synthesized based on the text as the voice expected by the user, there is a technique for adjusting the accent intensity of the voice based on the designation by the user. Since the accent is defined by the pitch of the voice, and the pitch of the voice is determined by the fundamental frequency (F0), the accent strength is adjusted by adjusting the value of the fundamental frequency.

基本周波数を調整する関連技術では、数量化Ｉ類などの統計的手法を用いて、文の言語情報に基づいて各母音の中心の基本周波数を推定する。単語先頭母音から単語最終母音にかけての基本周波数の傾斜線を取得し、母音毎に、当該傾斜線を越える基本周波数成分にアクセント強度に応じた値を乗算することによって基本周波数を調整し、調整した基本周波数の間の基本周波数を線型補間する。 In the related technology for adjusting the fundamental frequency, the fundamental frequency at the center of each vowel is estimated based on the linguistic information of the sentence using a statistical method such as quantification class I. The basic frequency is adjusted by acquiring the slope line of the fundamental frequency from the first vowel of the word to the last vowel of the word, and multiplying the fundamental frequency component exceeding the slope line by a value corresponding to the accent intensity for each vowel. Linearly interpolate fundamental frequencies between fundamental frequencies.

特開２００１−２４９６７７号公報JP 2001-249677 A

徳田恵一、「ＨＭＭによる音声合成の基礎」、電子情報通信学会技術研究報告、一般社団法人電子情報通信学会、２０００年１０月１９日、頁４３〜５０Keiichi Tokuda, “Basics of Speech Synthesis by HMM”, IEICE Technical Report, IEICE, 19 October 2000, pp. 43-50 小林隆夫ら、「コーパスベース音声合成技術の動向［ＩＶ］ −ＨＭＭ音声合成方式−」、電子情報通信学会誌、２００４年、Ｖｏｌ．８７、Ｎｏ．４、頁３２２〜３２７Takao Kobayashi et al., “Trends in Corpus-Based Speech Synthesis Technology [IV] -HMM Speech Synthesis Method”, Journal of the Institute of Electronics, Information and Communication Engineers, 2004, Vol. 87, no. 4, pages 322-327

関連技術では、調整された音声の基本周波数は単純な線分の集まりであり、人間の声に特有の小さな変動であるマイクロプロソディを含む複雑な基本周波数ではないため、音声の自然性が損なわれている。強調用隠れマルコフモデル（ＨＭＭ）データを用いることによって、マイクロプロソディを保持しつつ、音声の基本周波数を調整することは可能である。しかしながら、強調用ＨＭＭデータを準備することは困難である。 In the related art, the fundamental frequency of the tuned speech is a collection of simple line segments, not a complex fundamental frequency that includes micro-prosody, which is a small variation unique to the human voice, which impairs the naturalness of the speech. ing. By using the Hidden Markov Model (HMM) data for emphasis, it is possible to adjust the fundamental frequency of speech while maintaining the micro-process. However, it is difficult to prepare emphasis HMM data.

開示の技術は１つの側面として、強調用ＨＭＭデータを用いずに、マイクロプロソディを保持しつつ、音声の基本周波数を調整することを目的とする。 In one aspect, the disclosed technology aims to adjust the fundamental frequency of speech while maintaining microprocedures without using emphasis HMM data.

開示の技術において、基本周波数パターン推定部は、テキストに対応する隠れマルコフモデルの情報を用いて、テキストに対応する音声の基本周波数パターンを推定する。また、基本周波数変更部は、推定された基本周波数パターン内の指定された部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更する。また、再推定部は、隠れマルコフモデルの情報を用いて、テキストに対応し、かつ指定された部分の基本周波数の値が変更された値になった基本周波数パターンを再推定する。 In the disclosed technique, the fundamental frequency pattern estimation unit estimates the fundamental frequency pattern of speech corresponding to the text using information of the hidden Markov model corresponding to the text. The fundamental frequency changing unit changes the value of the fundamental frequency of the designated portion in the estimated fundamental frequency pattern to a value corresponding to the designated accent intensity. In addition, the re-estimation unit re-estimates the fundamental frequency pattern corresponding to the text and the value of the fundamental frequency of the designated portion changed to the value using the information of the hidden Markov model.

開示の技術は１つの側面として、強調用ＨＭＭデータを用いずに、マイクロプロソディを保持しつつ、音声の基本周波数を調整する、という効果を有する。 As one aspect, the disclosed technology has the effect of adjusting the fundamental frequency of speech while maintaining the microprocedure without using the emphasis HMM data.

実施形態に係るコンピュータの要部機能の一例を示すブロック図である。It is a block diagram which shows an example of the principal part function of the computer which concerns on embodiment. 実施形態に係る隠れマルコフモデルデータベース（ＨＭＭＤＢ）の一例を示す概念図である。It is a conceptual diagram which shows an example of the hidden Markov model database (HMM DB) which concerns on embodiment. 実施形態に係るコンピュータの電気系の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the electric system of the computer which concerns on embodiment. 実施形態に係る基本周波数調整処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the fundamental frequency adjustment process which concerns on embodiment. 実施形態に係る基本周波数（Ｆ０）パターン推定処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the fundamental frequency (F0) pattern estimation process which concerns on embodiment. 実施形態に係るユーザインターフェイスの一例を示す概念図である。It is a conceptual diagram which shows an example of the user interface which concerns on embodiment. 実施形態に係るユーザインターフェイスの一例を示す概念図である。It is a conceptual diagram which shows an example of the user interface which concerns on embodiment. 実施形態に係る文に対応する隠れマルコフモデル（ＨＭＭ）の部分の一例を示す概念図である。It is a conceptual diagram which shows an example of the part of the hidden Markov model (HMM) corresponding to the sentence which concerns on embodiment. ＨＭＭを用いて推定されたＦ０パターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the F0 pattern estimated using HMM. 部分的にＦ０が変更されたＦ０パターンの一例を示す概念図である。It is a conceptual diagram which shows an example of F0 pattern in which F0 was changed partially. ＨＭＭを用いて再推定されたＦ０パターンの一例を示す概念図である。It is a conceptual diagram which shows an example of the F0 pattern re-estimated using HMM. 実施形態に係るＦ０パターン部分変更処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the F0 pattern partial change process which concerns on embodiment. 実施形態に係るＦ０パターン再推定処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the F0 pattern re-estimation process which concerns on embodiment.

以下、図面を参照して開示の技術の実施形態の一例を詳細に説明する。なお、以下の説明では、開示の技術に係る基本周波数調整装置の一例として汎用装置であるコンピュータを用いた場合を例に挙げて説明するが、開示の技術はこれに限定されるものではない。開示の技術は、例えば、基本周波数調整のための専用装置、または基本周波数調整のためのデバイスを装着した基板などに適用可能である。 Hereinafter, an example of an embodiment of the disclosed technology will be described in detail with reference to the drawings. In the following description, a case where a computer that is a general-purpose device is used as an example of the fundamental frequency adjustment device according to the disclosed technology will be described as an example. However, the disclosed technology is not limited thereto. The disclosed technology can be applied to, for example, a dedicated device for fundamental frequency adjustment or a substrate on which a device for fundamental frequency adjustment is mounted.

一例として図１に示すコンピュータ１０は、検出部１２、言語処理部１４、パラメータ推定部１６、アクセント強度−基本周波数（Ｆ０）変換部１８、Ｆ０指定部２０、Ｆ０再推定部２２、及び分析合成部２４を有している。また、一例として図１に示すコンピュータ１０は、隠れマルコフモデルデータベース（ＨＭＭＤＢ）３０を有している。 As an example, the computer 10 shown in FIG. 1 includes a detection unit 12, a language processing unit 14, a parameter estimation unit 16, an accent intensity-fundamental frequency (F0) conversion unit 18, an F0 designation unit 20, an F0 re-estimation unit 22, and analysis and synthesis. Part 24. As an example, the computer 10 shown in FIG. 1 has a hidden Markov model database (HMM DB) 30.

検出部１２は、ユーザによってユーザインターフェイスに入力された日本語表記及びアクセントを変更する部分の指定、及びアクセントを変更する部分のアクセント強度の指定を検出する。言語処理部１４は、検出された日本語表記を処理して言語情報を取得する。パラメータ推定部１６は、Ｆ０パターン推定部及びメルケプストラムパターン推定部を含む。パラメータ推定部１６は、音声合成の処理単位である隠れマルコフモデル（ＨＭＭ）を用いて日本語表記で表される文に対応するＨＭＭを生成し、文に対応するＨＭＭを用いて、Ｆ０パターン及びメルケプストラムパターンを出力系列として推定する。アクセント強度−Ｆ０変換部１８は指定されたアクセント強度をＦ０の高低に変換する。Ｆ０指定部２０は推定されたＦ０パターンの指定された部分を変換されたＦ０に変更する。アクセント強度−Ｆ０変換部１８及びＦ０指定部２０は、開示の技術の基本周波数変更部の一例である。Ｆ０再推定部２２は、Ｆ０パターンの推定に用いたＨＭＭを用いて、変更されていない部分のＦ０パターンを再推定する。分析合成部２４は再推定されたＦ０パターン及び推定されたメルケプストラムパターンを用いて音声信号を合成する。 The detection unit 12 detects the Japanese notation and the designation of the part to change the accent input by the user to the user interface and the designation of the accent strength of the part to change the accent. The language processing unit 14 processes the detected Japanese notation to acquire language information. The parameter estimation unit 16 includes an F0 pattern estimation unit and a mel cepstrum pattern estimation unit. The parameter estimation unit 16 generates an HMM corresponding to a sentence expressed in Japanese using a hidden Markov model (HMM), which is a speech synthesis processing unit, and uses the HMM corresponding to the sentence to generate an F0 pattern and The mel cepstrum pattern is estimated as an output sequence. The accent strength-F0 conversion unit 18 converts the specified accent strength into the height of F0. The F0 designation unit 20 changes the designated part of the estimated F0 pattern to the converted F0. The accent strength-F0 conversion unit 18 and the F0 designation unit 20 are examples of the fundamental frequency changing unit of the disclosed technology. The F0 re-estimator 22 re-estimates the portion of the F0 pattern that has not been changed, using the HMM used for estimating the F0 pattern. The analysis / synthesis unit 24 synthesizes a speech signal using the re-estimated F0 pattern and the estimated mel cepstrum pattern.

図２にＨＭＭＤＢ３０の概念図を示す。ＨＭＭＤＢ３０には、処理単位ＨＭＭとして学習済みコンテキスト依存ＨＭＭ３２が予め記憶されている。コンテキスト依存ＨＭＭ３２は、音素のコンテキストを考慮したモデルである。音素の音響的な特徴はコンテキストの影響で大きく変化する。このような問題に対処するため、コンテキスト依存ＨＭＭが音声合成の処理単位として用いられる。音素は、当該音素のコンテキストに応じて、複数のコンテキスト依存ＨＭＭ３２を有する。コンテキストには、例えば、先行音素、当該音素、後続音素、当該音素のアクセント句内でのモーラ位置、先行の品詞、当該の品詞、後続の品詞などがある。 FIG. 2 shows a conceptual diagram of the HMM DB 30. In the HMM DB 30, a learned context-dependent HMM 32 is stored in advance as a processing unit HMM. The context-dependent HMM 32 is a model that takes into account the phoneme context. The acoustic characteristics of phonemes vary greatly due to the influence of context. In order to cope with such a problem, a context-dependent HMM is used as a speech synthesis processing unit. A phoneme has a plurality of context-dependent HMMs 32 according to the context of the phoneme. The context includes, for example, a preceding phoneme, the phoneme, a subsequent phoneme, a mora position in the accent phrase of the phoneme, a preceding part of speech, the relevant part of speech, and a subsequent part of speech.

コンピュータ１０は、一例として図３に示すように、ＣＰＵ（Central Processing Unit）６０、１次記憶部６２、２次記憶部６４、外部インターフェイス７０、キーボード７２、マウス７４、ディスプレイ７６、及びスピーカ７８を備えている。ＣＰＵ６０、１次記憶部６２、２次記憶部６４、外部インターフェイス７０、キーボード７２、マウス７４、ディスプレイ７６、及びスピーカ７８は、バス８０を介して相互に接続されている。 As an example, the computer 10 includes a CPU (Central Processing Unit) 60, a primary storage unit 62, a secondary storage unit 64, an external interface 70, a keyboard 72, a mouse 74, a display 76, and a speaker 78, as shown in FIG. I have. The CPU 60, primary storage unit 62, secondary storage unit 64, external interface 70, keyboard 72, mouse 74, display 76, and speaker 78 are connected to each other via a bus 80.

キーボード７２及びマウス７４は、ユーザの操作を受け付け、コンピュータ１０に情報を入力する。ディスプレイ７６及びスピーカ７８は、ユーザに情報を提示する。外部インターフェイス７０には、外部装置が接続され、外部装置とＣＰＵ６０との間の各種情報の送受信を司る。 The keyboard 72 and the mouse 74 receive user operations and input information to the computer 10. The display 76 and the speaker 78 present information to the user. An external device is connected to the external interface 70 and controls transmission / reception of various information between the external device and the CPU 60.

１次記憶部６２は、例えば、ＲＡＭ（Random Access Memory）などの揮発性のメモリである。２次記憶部６４は、例えば、ＨＤＤ（Hard Disk Drive）、またはＳＳＤ（Solid State Drive）などの不揮発性のメモリである。 The primary storage unit 62 is a volatile memory such as a RAM (Random Access Memory), for example. The secondary storage unit 64 is a non-volatile memory such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive).

２次記憶部６４は、一例として、検出サブプログラム６６Ａ、言語処理サブプログラム６６Ｂ、パラメータ推定サブプログラム６６Ｃ、アクセント強度−Ｆ０変換サブプログラム６６Ｄ、及びＦ０指定サブプログラム６６Ｅを記憶している。また、２次記憶部６４は、一例として、Ｆ０再推定サブプログラム６６Ｆ、及び分析合成サブプログラム６６Ｇを含む基本周波数調整プログラム６６を記憶している。また、２次記憶部６４は、ＨＭＭＤＢ３０を構成する情報が記憶されるＨＭＭＤＢ記憶領域６８を有する。 As an example, the secondary storage unit 64 stores a detection subprogram 66A, a language processing subprogram 66B, a parameter estimation subprogram 66C, an accent strength-F0 conversion subprogram 66D, and an F0 designation subprogram 66E. In addition, the secondary storage unit 64 stores, as an example, a fundamental frequency adjustment program 66 including an F0 re-estimation subprogram 66F and an analysis / synthesis subprogram 66G. Further, the secondary storage unit 64 has an HMM DB storage area 68 in which information constituting the HMM DB 30 is stored.

ＣＰＵ６０は、２次記憶部６４から検出サブプログラム６６Ａ、言語処理サブプログラム６６Ｂ、パラメータ推定サブプログラム６６Ｃ、及びアクセント強度−Ｆ０変換サブプログラム６６Ｄを読み出して１次記憶部６２に展開する。また、ＣＰＵ６０は、２次記憶部６４からＦ０指定サブプログラム６６Ｅ、Ｆ０再推定サブプログラム６６Ｆ、及び分析合成サブプログラム６６Ｇを読み出して１次記憶部６２に展開する。ＣＰＵ６０は、検出サブプログラム６６Ａを実行することで、図１に示す検出部１２として動作する。ＣＰＵ６０は、言語処理サブプログラム６６Ｂを実行することで、図１に示す言語処理部１４として動作する。ＣＰＵ６０は、パラメータ推定サブプログラム６６Ｃを実行することで、図１に示すパラメータ推定部１６として動作する。ＣＰＵ６０は、アクセント強度−Ｆ０変換サブプログラム６６Ｄを実行することで、図１に示すアクセント強度−Ｆ０変換部１８として動作する。ＣＰＵ６０は、Ｆ０指定サブプログラム６６Ｅを実行することで、図１に示すＦ０指定部２０として動作する。ＣＰＵ６０は、Ｆ０再推定サブプログラム６６Ｆを実行することで、図１に示すＦ０再推定部２２として動作する。ＣＰＵ６０は、分析合成サブプログラム６６Ｇを実行することで、図１に示す分析合成部２４として動作する。また、ＣＰＵ６０は、ＨＭＭＤＢ記憶領域６８に記憶された情報を読み出して、例えば、図２に示すようなＨＭＭＤＢ３０として１次記憶部６２に展開する。これにより、基本周波数調整プログラム６６を実行したコンピュータ１０が基本周波数調整装置として機能する。 The CPU 60 reads the detection subprogram 66A, the language processing subprogram 66B, the parameter estimation subprogram 66C, and the accent strength-F0 conversion subprogram 66D from the secondary storage unit 64, and expands them in the primary storage unit 62. Further, the CPU 60 reads the F0 designation subprogram 66E, the F0 re-estimation subprogram 66F, and the analysis / synthesis subprogram 66G from the secondary storage unit 64 and expands them in the primary storage unit 62. The CPU 60 operates as the detection unit 12 illustrated in FIG. 1 by executing the detection subprogram 66A. The CPU 60 operates as the language processing unit 14 illustrated in FIG. 1 by executing the language processing subprogram 66B. The CPU 60 operates as the parameter estimation unit 16 shown in FIG. 1 by executing the parameter estimation subprogram 66C. The CPU 60 operates as the accent strength-F0 conversion unit 18 shown in FIG. 1 by executing the accent strength-F0 conversion subprogram 66D. The CPU 60 operates as the F0 designation unit 20 shown in FIG. 1 by executing the F0 designation subprogram 66E. The CPU 60 operates as the F0 re-estimation unit 22 illustrated in FIG. 1 by executing the F0 re-estimation subprogram 66F. The CPU 60 operates as the analysis / synthesis unit 24 shown in FIG. 1 by executing the analysis / synthesis subprogram 66G. Further, the CPU 60 reads out information stored in the HMM DB storage area 68 and develops it in the primary storage unit 62 as an HMM DB 30 as shown in FIG. As a result, the computer 10 that has executed the fundamental frequency adjustment program 66 functions as a fundamental frequency adjustment device.

次に、基本周波数調整装置であるコンピュータ１０の作用について説明する。基本周波数調整プログラム６６が、例えば、ユーザの指示により起動されると、図４に示す基本周波数調整処理が開始される。図４に例示する基本周波数調整処理では、まずステップ１００でＦ０パターン推定を行う。図５に、ステップ１００のＦ０パターン推定の詳細を例示する。 Next, the operation of the computer 10 that is a fundamental frequency adjusting device will be described. When the fundamental frequency adjustment program 66 is started, for example, according to a user instruction, the fundamental frequency adjustment process shown in FIG. 4 is started. In the fundamental frequency adjustment process illustrated in FIG. 4, first, F0 pattern estimation is performed in step 100. FIG. 5 illustrates details of F0 pattern estimation in step 100.

ステップ１０２で、ＣＰＵ６０は、図６Ａに例示するユーザインターフェイス９１をディスプレイ７６に表示する。ユーザインターフェイス９１は、日本語表記及びアクセント強度指定が入力されるテキストボックス９２、及び音声合成処理を指示する際に選択される「音声合成」ボタン９４を含む。また、ユーザインターフェイス９１は、音声合成の再実行の際に日本語表記を再入力する際に選択される「再入力」ボタン９５、及び基本周波数調整処理を終了する際に選択される「終了」ボタン９６を含む。ＣＰＵ６０は、ユーザがキーボード７２を用いてユーザインターフェイス９１のテキストボックス９２に入力した日本語表記及びアクセント強度指定を検出する。図６Ａの例では、日本語表記は「今日はいい天気です」であり、日本語表記「今日は」の部分のアクセント強度指定は「強」である。すなわち、ＣＰＵ６０により音声合成の対象となる文、文内のアクセントを変更する部分、及びアクセントを変更する部分のアクセント強度が検出され、テキストボックス９２に、検出された文、アクセントを変更する部分、及びアクセント強度が表示される。 In step 102, the CPU 60 displays the user interface 91 illustrated in FIG. 6A on the display 76. The user interface 91 includes a text box 92 in which Japanese notation and accent strength designation are input, and a “speech synthesis” button 94 selected when a voice synthesis process is instructed. In addition, the user interface 91 has a “re-input” button 95 selected when re-inputting Japanese notation when re-executing speech synthesis, and “end” selected when ending the fundamental frequency adjustment process. Button 96 is included. The CPU 60 detects Japanese notation and accent strength designation input by the user into the text box 92 of the user interface 91 using the keyboard 72. In the example of FIG. 6A, the Japanese notation is “Today is good weather”, and the accent strength designation of the Japanese notation “Today is” is “strong”. That is, the CPU 60 detects the speech synthesis target sentence, the part that changes the accent in the sentence, and the accent intensity of the part that changes the accent, and the text box 92 displays the detected sentence, the part that changes the accent, And the accent intensity are displayed.

ユーザが、マウス７４で「音声合成」ボタン９４をクリックすると、ステップ１０４で、ＣＰＵ６０は、ステップ１０２で検出した日本語表記で表される文の解析を行い、読み、アクセント句位置、アクセント句境界、及び品詞などの文の言語情報を取得する。 When the user clicks the “speech synthesis” button 94 with the mouse 74, in step 104, the CPU 60 analyzes the sentence represented in Japanese notation detected in step 102, reads the accent phrase position, and the accent phrase boundary. And language information of sentences such as parts of speech.

次に、ステップ１０６で、ＣＰＵ６０は、日本語表記で表される文に対応するＨＭＭを用いて、日本語表記で表される文のＦ０パターン及びメルケプストラムパターンを推定する。詳細には、ステップ１０４で取得した言語情報をコンテキストとして利用して、日本語表記をコンテキスト依存ラベルに変換し、各ラベルに対応するコンテキスト依存ＨＭＭ３２をＨＭＭＤＢ３０から選択する。選択したコンテキスト依存ＨＭＭ３２を順に連結して、文に対応するＨＭＭを生成する。図７に文に対応するＨＭＭの１フレーム分を例示する。１フレームは、例えば、８ｍ秒の音声に対応する。 Next, in step 106, the CPU 60 estimates the F0 pattern and the mel cepstrum pattern of the sentence expressed in Japanese using the HMM corresponding to the sentence expressed in Japanese. Specifically, the language information acquired in step 104 is used as a context to convert Japanese notation into a context-dependent label, and a context-dependent HMM 32 corresponding to each label is selected from the HMM DB 30. The selected context-dependent HMMs 32 are sequentially connected to generate an HMM corresponding to the sentence. FIG. 7 illustrates one frame of the HMM corresponding to the sentence. One frame corresponds to, for example, 8 ms of voice.

図７において、参照符号３５は文に対応するＨＭＭ、参照符号３６は状態、参照符号３７は状態系列、参照符号３８は平均ベクトル、参照符号３９は共分散行列を各々示す。時間ｔが１〜Ｔまで変化し、状態がｑ_１、…、ｑ_Ｔと遷移した場合の文に対応するＨＭＭ３５の状態系列をＱで表す。また、μ_ｑ１、…、μ_ｑＴは状態ｑ_１、…、ｑ_Ｔの各々に対応する出力確率分布の平均ベクトルであり、Ｕ_ｑ１、…、Ｕ_ｑＴは状態ｑ_１、…、ｑ_Ｔに対応する出力確率分布の共分散行列である。文に対応するＨＭＭ３５の情報である平均ベクトル及び共分散行列は各々が対応する状態に与えられている。また、平均ベクトルμ_ｑ１、…、μ_ｑＴの各々を転置しさらに転置したＭを式（１）とし、共分散行列Ｕ_ｑ１、…、Ｕ_ｑＴを対角成分とした対角行列を転置したＵを式（２）とし、状態ｑ_１、…、ｑ_Ｔの各々に対応する出力ベクトルをo_１、…、o_Ｔとする。また、出力ベクトルの各々を転置しさらに転置した出力系列Ｏを式（３）としたとき、状態系列がＱ、文に対応するＨＭＭ３５のパラメータがλである場合の出力系列がＯである確率を示す式（４）を最大とするＯを求める。式（４）を最大とするＯを求めることにより、Ｆ０パターンを推定することが可能である。λは、λ＝（Ａ，Ｂ，π）であり、Ａは状態遷移確率、Ｂは出力確率分布、πは初期状態確率を示す。一般式であれば、（２π）^３Ｔは、（２π）^{（３Ｔ×パラメータの次元数）}であるが、ここでは、パラメータである基本周波数の次元数が１であるため、２π^３Ｔとしている。状態Ｑがｑ_１、…、ｑ_Ｔと遷移し、出力ベクトルＯ＝｛o_１、…、o_Ｔ｝が出力される確率は、Ａ×Ｂで与えられる。以下、式において、太字はベクトルまたは行列を示し、ベクトル、行列の右上のＴは行列の転置を示す。 In FIG. 7, reference numeral 35 indicates an HMM corresponding to a sentence, reference numeral 36 indicates a state, reference numeral 37 indicates a state sequence, reference numeral 38 indicates an average vector, and reference numeral 39 indicates a covariance matrix. A state sequence of the HMM 35 corresponding to a sentence when the time t changes from _{1 to} _T and the state transitions to q ₁ ,. Further, μ _q1, ..., μ _qT state _q 1, ..., a mean vector of an output probability distribution corresponding to each of the _{_{_{q T, U q1, ...,}}} U qT state _q 1, ..., corresponding to _{q T} This is a covariance matrix of output probability distribution. The average vector and the covariance matrix, which are information of the HMM 35 corresponding to the sentence, are given to the corresponding states. Further, M obtained by transposing each of the average vectors μ _q1 ,..., Μ _qT and further transposing M is represented by Expression (1), and U is _obtained by transposing a diagonal matrix having the covariance matrix U _q1 _,. was an expression (2), the state _q 1, ..., _{o 1} output vector corresponding to each of the _{q T,} ..., and o _T. Further, when each of the output vectors is transposed and the transposed output sequence O is expressed by Equation (3), the probability that the output sequence is O when the state sequence is Q and the parameter of the HMM 35 corresponding to the sentence is λ. O which maximizes Expression (4) shown is obtained. By obtaining O that maximizes Equation (4), it is possible to estimate the F0 pattern. λ is λ = (A, B, π), A is a state transition probability, B is an output probability distribution, and π is an initial state probability. If it is a general formula, (2π) ^3T is (2π) ^{(3T × number of dimensions of parameter)} , but here, since the number of dimensions of the fundamental frequency as a parameter is 1, it is set to 2π ^3T . The probability that the state Q transitions to q ₁ ,..., Q _T and the output vector O = {o ₁ ,..., O _T } is output is given by A × B. In the following equations, bold indicates a vector or matrix, and T at the upper right of the vector or matrix indicates transposition of the matrix.

（１）
（２）
（３）
（４） (1)
(2)
(3)
(4)

しかし、式（４）を最大とするＯを求めることによりＦ０パターンを推定すると、状態が遷移するときにＦ０は不連続な変化を起こす。不連続な変化を起こさせないようにするために、Ｆ０の動的特徴ベクトルΔｃ_ｔ及びΔ^２ｃ_ｔを考慮する。不連続な変化を起こさせないＦ０パターンの静的特徴が式（５）で表されるとすると、動的特徴ベクトルΔｃ_ｔは式（６）、Δ^２ｃ_ｔは式（７）で表される。
（５）
（６）
（７） However, if the F0 pattern is estimated by obtaining O that maximizes the expression (4), F0 causes a discontinuous change when the state transitions. In order not to cause discontinuous changes, the dynamic feature vectors Δc _t and Δ ² c _t of F0 are considered. When static characteristics of the F0 pattern which does not cause discontinuous change is to be represented by the formula (5), the dynamic feature vector .DELTA.c _t is expressed by Equation (6), the delta ² c _t (7) .
(5)
(6)
(7)

ｗ_ｔ ^（１）（τ）、ｗ_ｔ ^（２）（τ）は動的特徴量を計算するための重み係数である。また、Ｌ^（１）及びＬ^（２）はそれぞれ、時刻ｔにおけるΔｃ_ｔ及びΔ^２ｃ_ｔの算出において、時刻ｔの前後で考慮すべき時間幅をサンプリング時間τを単位として表したものである。 w _t ⁽¹⁾ (τ) and w _t ⁽²⁾ (τ) are weighting coefficients for calculating the dynamic feature amount. Moreover, L ⁽¹⁾ and L ⁽²⁾ is a representation of each in the calculation of .DELTA.c _t and delta ² c _t at time t, the time width to be considered before and after the time t the sampling time τ as a unit .

式（５）〜式（７）の関係を行列型式で表すと、式（８）となる。静的特徴量であるＣは、変換行列Ｗを用いて、動的特徴量を含む出力ベクトルＯに変換される。変換行列Ｗは式（９）で表され、ｃ_ｔ、Δｃ_ｔ、Δ^２ｃ_ｔ各々に対応する重みｗ_ｔは式（１０）で表される。
（８）
（９）
（１０） When the relationship between Expression (5) to Expression (7) is expressed in a matrix form, Expression (8) is obtained. C, which is a static feature quantity, is converted into an output vector O including a dynamic feature quantity using a transformation matrix W. Transformation matrix W is expressed by Equation _{(9), c t, Δc} t, weights _{w t} corresponding to each delta ² c _t is expressed by Equation (10).
(8)
(9)
(10)

対数をとることにより式（４）を式（１１）に変形し、状態系列Ｑに対して、Ｐ（Ｏ｜Ｑ、λ）をＣに関して最大化する。すなわち、Ｃで偏微分を行った式（１２）を解くことで最適なＣを求める。対数関数は厳密な増加関数であるため、Ｐを最大化するＣの値とＰの対数をとったｌｏｇＰを最大化するＣの値は同じ値となるため、式（４）の対数をとる。式（８）に示すように、ＯをＷＣで置き換えることにより、式（１２）は式（１３）となる。
（１１）
（１２）
（１３） By taking the logarithm, equation (4) is transformed into equation (11), and P (O | Q, λ) is maximized with respect to C for state series Q. That is, the optimum C is obtained by solving the equation (12) obtained by partial differentiation with C. Since the logarithmic function is a strict increase function, the value of C that maximizes P and the value of C that maximizes logP that takes the logarithm of P are the same value, and therefore, the logarithm of equation (4) is taken. As shown in Expression (8), Expression (12) becomes Expression (13) by replacing O with WC.
(11)
(12)
(13)

式（１３）に含まれる式（１４）を式（１５）に示すように変形する。
（１４）

（１５） Expression (14) included in Expression (13) is modified as shown in Expression (15).
(14)

(15)

また、一般に、式（１６）及び式（１７）が成り立つため、式（１５）の第１項を式（１６）を用いて式（１８）に示すように変形し、式（１５）の第２項を式（１７）を用いて式（１９）に示すように変形する。これらの変形により、式（１２）は式（２０）となり、式（２１）が成立する。
（１６）
（１７）
（１８）
（１９）
（２０）
（２１） In general, since Expression (16) and Expression (17) hold, the first term of Expression (15) is transformed as shown in Expression (18) using Expression (16), and the first term of Expression (15) is obtained. The two terms are transformed using equation (17) as shown in equation (19). By these deformations, Expression (12) becomes Expression (20), and Expression (21) is established.
(16)
(17)
(18)
(19)
(20)
(21)

式（２１）を解くことにより、Ｆ０パターンの静的特徴量Ｃを求めることが可能である。式（２１）を解くためには、例えば、コレスキー分解あるいはＱＲ分解を用いる。 By solving the equation (21), it is possible to obtain the static feature amount C of the F0 pattern. In order to solve Equation (21), for example, Cholesky decomposition or QR decomposition is used.

例えば、日本語表記「富士通では…」が入力された場合、図８Ａに太線４１で例示する曲線で示されるＦ０パターンが推定される。破線で示される円４２で囲まれている部分はマイクロプロソディと呼ばれるＦ０パターンの小さな変動であり、音声の自然性に寄与する。 For example, when Japanese notation “in Fujitsu” is input, an F0 pattern indicated by a curve illustrated by a thick line 41 in FIG. 8A is estimated. A portion surrounded by a circle 42 indicated by a broken line is a small variation of the F0 pattern called microprosody, which contributes to the naturalness of speech.

メルケプストラムパターンもＦ０パターンと同様に推定することが可能である。 The mel cepstrum pattern can be estimated in the same manner as the F0 pattern.

次に、図４のステップ２００で、ステップ１０６で推定したＦ０パターンに含まれるＦ０を部分的に変更する。図９に図４のＦ０パターン部分変更処理の詳細を例示する。ステップ２０２で、ＣＰＵ６０は、ステップ１０２で検出したアクセント強度指定により指定されたアクセント句に含まれる各モーラの中央のＦ０を予め定めたルールに基づいて変更する。 Next, in step 200 of FIG. 4, F0 included in the F0 pattern estimated in step 106 is partially changed. FIG. 9 illustrates details of the F0 pattern partial change process of FIG. In step 202, the CPU 60 changes F0 at the center of each mora included in the accent phrase designated by the accent strength designation detected in step 102 based on a predetermined rule.

例えば、アクセント句のアクセント型情報（０型：低高高、１型：高低低など）を用いて、アクセント句のアクセント強度を強にする場合、アクセント句内のＦ０の値の高低差が大きくなるようにルールを定める。詳細には、例えば、当該アクセント句のアクセント高のモーラの中央のＦ０に数ヘルツ加算し、アクセント低のモーラの中央のＦ０から数ヘルツ減算することを予め定めておく。例えば、図８Ｂに示すように、「富士通では（フジツーデワ）」のアクセント強度が「ジ」の部分に付された記号「’」によって強に指定されていた場合、アクセント高であるモーラ「ジ」の中央のＦ０に、例えば、５Ｈｚを加算する。また、アクセント強度低である他のモーラの中央のＦ０から、例えば、５Ｈｚを減算する。図８Ｂにおいて、太線４３Ｈが数ヘルツ加算された部分、太線４３Ｌが数ヘルツ減算された部分である。例えば、アクセント句のアクセント型情報を用いて、アクセント句のアクセント強度を弱にする場合、アクセント句内のＦ０の値の高低差が小さくなるように、ルールを定めればよい。 For example, when the accent phrase is strengthened by using accent type accent type information (0 type: low, high, 1 type: high, low, etc.), the difference in the F0 value in the accent phrase is large. Set rules to be Specifically, for example, it is predetermined to add several hertz to F0 at the center of the mora with high accent height and subtract several hertz from F0 at the center of the mora with low accent. For example, as shown in FIG. 8B, when the accent strength of “in Fujitsu (Fujitsudewa)” is strongly designated by the symbol “′” attached to the “di” portion, the mora “di”, which is the accent height. For example, 5 Hz is added to the center F0. Further, for example, 5 Hz is subtracted from F0 at the center of another mora having low accent intensity. In FIG. 8B, a thick line 43H is a portion where several hertz is added, and a thick line 43L is a portion where several hertz is subtracted. For example, when the accent phrase information is used to weaken the accent strength of the accent phrase, the rule may be determined so that the difference in height of the F0 value in the accent phrase is reduced.

ステップ２０２で変更したＦ０を、ステップ２０４で、ＣＰＵ６０は、ステップ１０６で推定されたＦ０パターンＣの対応する要素ｃ_ｔに上書きする。ステップ２０６で、ＣＰＵ６０は、ステップ２０４で上書きされなかったＣの要素ｃ_ｔの時刻ｔを１次記憶部６２に記録する。 A F0 changed in step 202, in step 204, CPU 60 overwrites the corresponding element _{c t} of F0 pattern C which is estimated in step 106. In step 206, CPU 60 records the time t of the element _{c t} of C that has not been overwritten in step 204 to the primary storage unit 62.

次に、図４のステップ３００で、ステップ２００で変更されなかった部分のＦ０を再推定する。図４のＦ０パターン再推定処理の詳細を図１０に例示する。ステップ３０２で、ＣＰＵ６０は、式（２２）を式（２３）について解く。ここで、式（２４）は、行列Ａの第Ｉ行のみの行列を示す。Ｉ＝｛Ｉ_１、Ｉ_２、…｝であり、ここでは、Ｉ_１、Ｉ_２、…は、ステップ２０４で上書きされなかった、すなわち、ステップ２０６で１次記憶部６２に記録された行列Ｃの要素ｃ_ｔのｔによって示される行列Ｃで変更されていない行に相当する。
（２２）
（２３）
（２４） Next, in step 300 of FIG. 4, F0 of the portion that has not been changed in step 200 is re-estimated. FIG. 10 illustrates details of the F0 pattern re-estimation process in FIG. In step 302, the CPU 60 solves the equation (22) for the equation (23). Here, Equation (24) represents a matrix of only the I-th row of the matrix A. I = {I ₁ , I ₂ ,..., Where I ₁ , I ₂ ,... Were not overwritten in step 204, that is, the matrix C recorded in the primary storage unit 62 in step 206. Corresponds to a row that has not been changed in the matrix C indicated by _t of the element c _t .
(22)
(23)
(24)

式（２２）を式（２３）について解く過程を説明するために、式（２５）、式（２６）とすると、式（２２）は式（２７）であると考えることができる。
（２５）
（２６）
（２７） In order to explain the process of solving Equation (22) for Equation (23), Equation (22) can be considered as Equation (27) when Equation (25) and Equation (26) are used.
(25)
(26)
(27)

式（２７）は、行列Ｃについて連立１次方程式を解く場合の形であるため、行列の基本変形により、解を変えずに、行列の要素を並び替え、式（２８）とすることが可能である。式（２９）は、ステップ２００で変更された行を示す。行列Ｃが式（３０）と式（３１）に分かれるように、行列の要素は並び替えられている。
（２８）
（２９）
（３０）
（３１） Since Equation (27) is a form in the case of solving simultaneous linear equations for the matrix C, it is possible to rearrange the elements of the matrix without changing the solution by the basic deformation of the matrix to obtain Equation (28). It is. Equation (29) shows the line changed in step 200. The elements of the matrix are rearranged so that the matrix C is divided into Expression (30) and Expression (31).
(28)
(29)
(30)
(31)

式（３２）を式（３３）と式（３４）とに分けて考えると、式（２８）は式（３５）となる。式（３６）は行列Ａの第Ｉ列のみを含む行列を示す。
（３２）
（３３）
（３４）
（３５）
（３６） When formula (32) is divided into formula (33) and formula (34), formula (28) becomes formula (35). Equation (36) shows a matrix including only the I-th column of the matrix A.
(32)
(33)
(34)
(35)
(36)

式（３５）を変形すると、式（３７）となり、式（３８）に示すように、式（３９）を解くことが可能となる。
（３７）
（３８）
（３９） By transforming equation (35), equation (37) is obtained, and as shown in equation (38), equation (39) can be solved.
(37)
(38)
(39)

ステップ３０４で、ＣＰＵ６０は、ステップ３０２で推定されたＦ０パターンとステップ２０２で変更されたＦ０とを統合する。図８Ｃに、統合されたＦ０パターンを例示する。細破線４７は再推定される前のＦ０パターンを示し（図８Ｂも参照）、太線４４が再推定されたＦ０パターンを示し、細実線４３Ｌ及び４３Ｈが部分的に変更されたＦ０を示す。また、太線４４及び細実線４３Ｌ及び４３Ｈを連ねた曲線が、統合されたＦ０パターンを示している。指定されたＦ０以外のＦ０を、Ｆ０の動的特徴を導入してＦ０を推定したＨＭＭの平均ベクトル及び共分散行列を用いて再推定している。これにより、統合されたＦ０パターンでは、アクセント強度が調整され、かつ、破線で示される円４２で囲まれた部分にマイクロプロソディが保持されている。 In step 304, the CPU 60 integrates the F0 pattern estimated in step 302 and the F0 changed in step 202. FIG. 8C illustrates the integrated F0 pattern. The thin broken line 47 indicates the F0 pattern before being re-estimated (see also FIG. 8B), the thick line 44 indicates the re-estimated F0 pattern, and the thin solid lines 43L and 43H indicate F0 partially changed. Moreover, the curve which connected the thick line 44 and the thin solid lines 43L and 43H has shown the integrated F0 pattern. F0 other than the designated F0 is re-estimated using the mean vector and covariance matrix of the HMM in which the dynamic features of F0 are introduced to estimate F0. Thereby, in the integrated F0 pattern, the accent intensity is adjusted, and the micro procedure is held in a portion surrounded by a circle 42 indicated by a broken line.

図４のステップ４０２で、ＣＰＵ６０は、ステップ３０４で統合されたＦ０パターン及びステップ１０６で推定されたメルケプストラムパターンを用いて音声信号を合成する。音声信号の合成には、例えば、音声の生成過程をモデル化し、その特徴パラメータを用いて音声を合成する分析合成方式を用いる。 In step 402 of FIG. 4, the CPU 60 synthesizes an audio signal using the F0 pattern integrated in step 304 and the mel cepstrum pattern estimated in step 106. For the synthesis of the speech signal, for example, an analysis / synthesis method is used in which the speech generation process is modeled and the speech is synthesized using the feature parameters.

ステップ４０４で、ＣＰＵ６０は、ステップ４０２で合成した音声信号を用いて、スピーカ７８から音声を出力させる。 In step 404, the CPU 60 causes the speaker 78 to output sound using the sound signal synthesized in step 402.

ステップ４０６で、ＣＰＵ６０は、図６Ａに示すユーザインターフェイス９１の終了ボタン９６をユーザがマウス７４でクリックしたことを検出すると、基本周波数調整処理を終了する。ステップ４０６で、ＣＰＵ６０は、再入力ボタン９５をユーザがマウス７４でクリックしたことを検出すると、ステップ１００に戻り、基本周波数調整処理を継続する。ユーザは、ユーザインターフェイス９１のテキストボックス９２の日本語表記及びアクセント強度指定をキーボード７２を用いて修正することが可能である。 In step 406, when the CPU 60 detects that the user clicks the end button 96 of the user interface 91 shown in FIG. 6A with the mouse 74, the basic frequency adjustment processing is ended. In step 406, when the CPU 60 detects that the user clicks the re-input button 95 with the mouse 74, the CPU 60 returns to step 100 and continues the fundamental frequency adjustment process. The user can correct the Japanese notation and the accent strength designation in the text box 92 of the user interface 91 using the keyboard 72.

なお、ステップ１０２で、ユーザがキーボード７２を用いて日本語表記及びアクセント強度指定を入力する例について説明したが、開示の技術はこれに限定されない。例えば、日本語表記及びアクセント強度指定は予めファイルに保存されていてもよく、当該ファイルからＣＰＵ６０が日本語表記及びアクセント強度指定を読み込んでもよい。また、日本語表記に代えて英語表記等の他の言語表記が使用されてもよい。 In addition, although the example which a user inputs Japanese description and accent intensity | strength specification using the keyboard 72 was demonstrated in step 102, the technique of an indication is not limited to this. For example, Japanese notation and accent strength designation may be stored in a file in advance, and the CPU 60 may read Japanese notation and accent strength designation from the file. In addition, other language notation such as English notation may be used instead of Japanese notation.

また、ステップ１０６で、文に対応するＨＭＭを生成する例について説明したが、開示の技術はこれに限定されない。文に対応するＨＭＭは、開示の技術のテキストに対応するＨＭＭの一例である。テキストに対応するＨＭＭは、例えば、文節または単語に対応するＨＭＭであってもよい。 Moreover, although the example which produces | generates HMM corresponding to a sentence was demonstrated in step 106, the technique of an indication is not limited to this. The HMM corresponding to the sentence is an example of the HMM corresponding to the text of the disclosed technology. The HMM corresponding to the text may be an HMM corresponding to a phrase or a word, for example.

また、ステップ１０６で、Ｆ０パターンの動的特徴を導入してＦ０パターンを推定する例について説明したが、開示の技術はこれに限定されない。Ｆ０パターンの動的特徴を考慮せず、Ｆ０パターンを推定してもよい。 Moreover, although the example which introduces the dynamic feature of F0 pattern and estimated F0 pattern in step 106 was demonstrated, the technique of an indication is not limited to this. The F0 pattern may be estimated without considering the dynamic features of the F0 pattern.

また、ステップ２０２で、代表部分であるモーラの中央のＦ０を変更する例について説明したが、開示の技術はこれに限定されない。例えば、モーラに代えて、音節、音素もしくは母音を変更の対象としてもよい。音節、音素もしくは母音を変更の対象とした場合であっても、モーラを変更の対象とした場合と同様の効果を得られる。また、代表部分は、モーラの中央に代えて、モーラの先頭または末尾であってもよい。モーラの先頭または末尾を代表部分とした場合であっても、モーラの中央を代表部分とした場合と同様の効果を得られる。また、モーラの単一のｃ_ｔに対応するＦ０ではなく、モーラの複数の連続したｃ_ｔに対応するＦ０を変更の対象としてもよい。モーラの複数の連続したｃ_ｔに対応するＦ０を変更の対象とした場合であっても、モーラの単一のｃ_ｔに対応するＦ０を変更の対象とした場合と同様の効果を得られる。 Moreover, although the example which changes F0 of the center of the mora which is a representative part was demonstrated in step 202, the technique of an indication is not limited to this. For example, instead of mora, syllables, phonemes, or vowels may be changed. Even when syllables, phonemes, or vowels are to be changed, the same effect as when mora is to be changed can be obtained. Further, the representative portion may be the top or the end of the mora instead of the center of the mora. Even when the head or tail of the mora is the representative part, the same effect as when the center of the mora is the representative part can be obtained. Also, rather than F0 corresponding to a single c _t Mora, it may be subject to change F0 corresponding to a plurality of successive c _t mora. Even when the F0 corresponding to a plurality of successive c _t Mora and change the subject, produces effects similar to those of the case where the object of changing the F0 corresponding to a single c _t Mora.

また、無声音ではＦ０を定義できないので、無声子音＋有声母音で表されるモーラでは、有性母音の中央のＦ０を変更するようにしてもよい。あるいは、有声音のみで表されるモーラでのみＦ０を変更するようにしてもよい。 In addition, since F0 cannot be defined for an unvoiced sound, F0 at the center of the sexual vowel may be changed in a mora represented by unvoiced consonant + voiced vowel. Or you may make it change F0 only with the mora represented only by voiced sound.

また、ステップ２０２では、アクセント句のアクセント型情報を用いて、高のモーラの中央のＦ０に数ヘルツ加算し、低のモーラの中央のＦ０から数ヘルツ減算する例について説明したが、開示の技術はこれに限定されない。例えば、単語先頭母音から単語最終母音にかけてのＦ０の傾斜線を取得し、母音毎に、当該傾斜線を越える音素中央のＦ０成分にアクセント強度に応じた所定の値を乗算することによって、Ｆ０を調整するようにしてもよい。 Further, in step 202, an example has been described in which the accent type information of the accent phrase is used to add several hertz to the center F0 of the high mora and subtract several hertz from the center F0 of the low mora. Is not limited to this. For example, the F0 slope line from the word head vowel to the word final vowel is acquired, and for each vowel, the F0 component at the center of the phoneme that crosses the slope line is multiplied by a predetermined value according to the accent intensity, so that F0 is obtained. You may make it adjust.

なお、ステップ１０６の後、図６Ｂに示すように、日本語表記に対応する中間表記をユーザインターフェイス９１のテキストボックス９３に表示してもよい。ここで、「’」は、アクセント強度が強であることを示す。 After step 106, as shown in FIG. 6B, intermediate notation corresponding to Japanese notation may be displayed in the text box 93 of the user interface 91. Here, “′” indicates that the accent strength is strong.

また、アクセント強度はユーザが指定するだけでなく、ＣＰＵ６０が推定するようにしてもよい。例えば、文の中で重要なアクセント句の強度が強となるように推定する。より詳細には、固有名詞のアクセント強度を強と推定し、固有名詞以外のアクセント強度は中と推定してもよい。また、呼気段落の先頭のアクセント強度を強と推定し、先頭以外のアクセント強度を中と推定してもよい。また、「らしい」、「でない」、「だろう」など補助的な形態素を含むアクセント句のアクセント強度は弱であると推定してもよい。ＣＰＵ６０が推定したアクセント強度指定は、図６Ａのテキストボックス９２、図６Ｂのテキストボックス９２及び９３に、例えば、ユーザが指定したアクセント強度指定と異なる色で表示されてもよい。また、ＣＰＵ６０が推定したアクセント強度指定は、ユーザによって変更されてもよい。 Further, the accent intensity may be estimated not only by the user but also by the CPU 60. For example, it is estimated that the strength of an important accent phrase in a sentence is strong. More specifically, the accent strength of proper nouns may be estimated as strong, and the accent strength other than proper nouns may be estimated as medium. Alternatively, the beginning accent strength of the exhalation paragraph may be estimated to be strong, and the accent strength other than the beginning may be estimated to be medium. Further, it may be estimated that the accent intensity of an accent phrase including auxiliary morphemes such as “Like”, “Not”, “Would” is weak. The accent strength designation estimated by the CPU 60 may be displayed in, for example, a color different from the accent strength designation designated by the user in the text box 92 in FIG. 6A and the text boxes 92 and 93 in FIG. 6B. Moreover, the accent intensity designation estimated by the CPU 60 may be changed by the user.

また、ステップ４０６で、基本周波数調整処理を終了しないことが判定された場合、ユーザは、図６Ｂのユーザインターフェイス９１で、日本語表記ではなく中間表記及びそのアクセント強度を修正してもよい。 If it is determined in step 406 that the fundamental frequency adjustment process is not terminated, the user may correct the intermediate notation and its accent strength on the user interface 91 of FIG. 6B instead of the Japanese notation.

開示の技術のコンピュータ１０はスタンドアロンで稼動するコンピュータであってよく、開示の技術をｅラーニング用音声、美術館、博物館などの展示ガイダンス用音声などに利用することが可能である。この場合、例えば、コンピュータ１０に、ユーザがＦ０を調整して再生することを所望する音声に対応する文字列をキーボード７２を用いて入力する。また、開示の技術を電子メールの読み上げに利用することも可能である。この場合、例えば、コンピュータ１０で実行される電子メール用アプリケーションから電子メールに含まれる文字列を、Ｆ０を調整して音声として再生する文字列として取得する。 The computer 10 of the disclosed technology may be a stand-alone computer, and the disclosed technology can be used for e-learning audio, audio for exhibition guidance in museums, museums, and the like. In this case, for example, a character string corresponding to a voice desired to be reproduced by the user adjusting F0 is input to the computer 10 using the keyboard 72. It is also possible to use the disclosed technology for reading out an e-mail. In this case, for example, a character string included in the e-mail is acquired from an e-mail application executed on the computer 10 as a character string to be reproduced as a sound by adjusting F0.

また、開示の技術のコンピュータ１０はサーバとして稼動するコンピュータであってよく、開示の技術をｅラーニング用音声、美術館、博物館などの展示ガイダンス用音声などに利用することが可能である。この場合、例えば、コンピュータ１０に接続されているクライアントに、ユーザがＦ０を調整して再生することを希望する音声に対応する文字列をキーボードを用いて入力し、音声もクライアントで再生する。また、開示の技術をスマートフォンまたは車載端末用音声対話エージェントとして利用することが可能である。この場合、例えば、ユーザは質問を音声でスマートフォンまたは車載端末に入力する。スマートフォンまたは車載端末にネットワークを介して接続されているコンピュータ１０は、入力された音声を認識し、当該質問に対する回答の文字列に対応する音声のＦ０を開示の技術を用いて調整し、スマートフォンまたは車載端末に送信する。スマートフォンまたは車載端末は当該音声を再生する。 The computer 10 of the disclosed technology may be a computer that operates as a server, and the disclosed technology can be used for e-learning audio, audio for exhibition guidance in museums, museums, and the like. In this case, for example, a character string corresponding to a voice that the user desires to reproduce by adjusting F0 is input to the client connected to the computer 10 by using the keyboard, and the voice is also reproduced by the client. In addition, the disclosed technology can be used as a voice interaction agent for a smartphone or an in-vehicle terminal. In this case, for example, the user inputs a question to the smartphone or the in-vehicle terminal by voice. The computer 10 connected to the smartphone or the in-vehicle terminal via the network recognizes the input voice and adjusts the voice F0 corresponding to the character string of the answer to the question using the disclosed technology. Send to the in-vehicle terminal. The smartphone or the in-vehicle terminal reproduces the sound.

開示の技術によれば、指定されたＦ０以外のＦ０を、Ｆ０の動的特徴を導入してＨＭＭの平均ベクトル及び共分散行列を用いて再推定しているため、Ｆ０、すなわち、アクセント強度が調整され、かつ、調整されたＦ０においてマイクロプロソディが保持される。したがって、開示の技術によってＦ０が調整された音声の自然性は損なわれない。また、開示の技術によれば、指定されたＦ０以外のＦ０を、Ｆ０の動的特徴を導入してＨＭＭの平均ベクトル及び共分散行列を用いて再推定しているため、アクセント強度指定されているアクセント句以外の部分との連続性も損なわれない。また、開示の技術によれば、アクセント強度に関する学習データを用いて学習されたＨＭＭを用いていないため、アクセント強度に関する学習データを収集しなくてよい。 According to the disclosed technique, F0 other than the designated F0 is re-estimated using the mean vector and the covariance matrix of the HMM by introducing the dynamic feature of F0. The microprocess is held at the adjusted and adjusted F0. Therefore, the naturalness of the sound whose F0 is adjusted by the disclosed technique is not impaired. Further, according to the disclosed technique, F0 other than the designated F0 is re-estimated using the HMM mean vector and covariance matrix by introducing the dynamic features of F0, so that the accent strength is designated. Continuity with parts other than the accent phrase is not impaired. In addition, according to the disclosed technique, since the HMM learned using the learning data related to the accent strength is not used, it is not necessary to collect the learning data related to the accent strength.

以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiment, the following additional notes are disclosed.

（付記１）
テキストに対応する隠れマルコフモデルの情報を用いて、前記テキストに対応する音声の基本周波数パターンを推定する基本周波数パターン推定部と、
推定された基本周波数パターン内の指定された部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更する基本周波数変更部と、
前記隠れマルコフモデルの情報を用いて、前記テキストに対応し、かつ前記指定された部分の基本周波数の値が変更された値になった基本周波数パターンを再推定する再推定部と、
を含む基本周波数調整装置。 (Appendix 1)
A fundamental frequency pattern estimator for estimating a fundamental frequency pattern of speech corresponding to the text using information of a hidden Markov model corresponding to the text;
A fundamental frequency changing unit that changes the value of the fundamental frequency of the designated portion in the estimated fundamental frequency pattern to a value according to the designated accent intensity;
Using the information of the hidden Markov model, a re-estimator that re-estimates a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed;
Basic frequency adjustment device including

（付記２）
前記隠れマルコフモデルの情報は、前記隠れマルコフモデルの状態に対応する平均ベクトル及び共分散行列である、付記１に記載の基本周波数調整装置。 (Appendix 2)
The fundamental frequency adjustment device according to appendix 1, wherein the information of the hidden Markov model is an average vector and a covariance matrix corresponding to the state of the hidden Markov model.

（付記３）
前記再推定部は、前記隠れマルコフモデルの情報、前記推定された基本周波数パターンの動的特徴及び前記指定された部分の変更された基本周波数の値を用いて、基本周波数パターンを再推定する、
付記１または２に記載の基本周波数調整装置。 (Appendix 3)
The re-estimator re-estimates the fundamental frequency pattern using the information of the hidden Markov model, the dynamic characteristics of the estimated fundamental frequency pattern, and the changed fundamental frequency value of the designated portion.
The fundamental frequency adjusting device according to appendix 1 or 2.

（付記４）
前記推定された基本周波数パターン内の指定された部分及び前記指定されたアクセント強度は、
ユーザによって指定される、及び、
前記テキストから取得される言語情報に基づいて推定される、
の少なくとも一方によって指定される、
付記１〜３のいずれかに記載の基本周波数調整装置。 (Appendix 4)
The specified portion in the estimated fundamental frequency pattern and the specified accent intensity are:
Specified by the user and
Estimated based on linguistic information obtained from the text,
Specified by at least one of
The fundamental frequency adjusting device according to any one of appendices 1 to 3.

（付記５）
前記指定された部分は、前記推定された基本周波数パターン内の指定された代表部分に含まれるモーラ、音節、音素、もしくは母音の中央である、付記１〜４のいずれかに記載の基本周波数調整装置。 (Appendix 5)
The fundamental frequency adjustment according to any one of appendices 1 to 4, wherein the designated part is a center of a mora, a syllable, a phoneme, or a vowel included in a designated representative part in the estimated fundamental frequency pattern apparatus.

（付記６）
前記推定された基本周波数パターン内の指定された部分に含まれるアクセント句のアクセント型情報に基づいて、前記指定された部分の基本周波数の値を決定する、付記１〜５のいずれかに記載の基本周波数調整装置。 (Appendix 6)
The value of the fundamental frequency of the designated portion is determined based on accent type information of an accent phrase included in the designated portion in the estimated fundamental frequency pattern. Basic frequency adjustment device.

（付記７）
コンピュータが、
テキストに対応する隠れマルコフモデルの情報を用いて、前記テキストに対応する音声の基本周波数パターンを推定し、
推定された基本周波数パターン内の指定された部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更し、
前記隠れマルコフモデルの情報を用いて、前記テキストに対応し、かつ前記指定された部分の基本周波数の値が変更された値になった基本周波数パターンを再推定する、
基本周波数調整方法。 (Appendix 7)
Computer
Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed.
Basic frequency adjustment method.

（付記８）
前記隠れマルコフモデルの情報は、前記隠れマルコフモデルの状態に対応する平均ベクトル及び共分散行列である、付記７に記載の基本周波数調整方法。 (Appendix 8)
The fundamental frequency adjustment method according to appendix 7, wherein the information of the hidden Markov model is an average vector and a covariance matrix corresponding to the state of the hidden Markov model.

（付記９）
前記隠れマルコフモデルの情報、前記推定された基本周波数パターンの動的特徴及び前記指定された部分の変更された基本周波数の値を用いて、基本周波数パターンを再推定する、
付記７または８に記載の基本周波数調整方法。 (Appendix 9)
Re-estimate the fundamental frequency pattern using the information of the hidden Markov model, the dynamic characteristics of the estimated fundamental frequency pattern and the changed fundamental frequency value of the specified portion;
The fundamental frequency adjusting method according to appendix 7 or 8.

（付記１０）
前記推定された基本周波数パターン内の指定された部分及び前記指定されたアクセント強度は、
ユーザによって指定される、及び、
前記テキストから取得される言語情報に基づいて推定される、
の少なくとも一方によって指定される、
付記７〜９のいずれかに記載の基本周波数調整方法。 (Appendix 10)
The specified portion in the estimated fundamental frequency pattern and the specified accent intensity are:
Specified by the user and
Estimated based on linguistic information obtained from the text,
Specified by at least one of
The fundamental frequency adjustment method according to any one of appendices 7 to 9.

（付記１１）
前記指定された部分は、前記推定された基本周波数パターン内の指定された部分に含まれるモーラ、音節、音素、もしくは母音の中央である、付記７〜１０のいずれかに記載の基本周波数調整方法。 (Appendix 11)
The fundamental frequency adjustment method according to any one of appendices 7 to 10, wherein the designated portion is a center of a mora, a syllable, a phoneme, or a vowel included in the designated portion in the estimated fundamental frequency pattern. .

（付記１２）
前記推定された基本周波数パターン内の指定された部分に含まれるアクセント句のアクセント型情報に基づいて、前記指定された部分の基本周波数の値を決定する、付記７〜１１に記載のいずれかに記載の基本周波数調整方法。 (Appendix 12)
The value of the fundamental frequency of the designated portion is determined based on accent type information of an accent phrase included in the designated portion in the estimated fundamental frequency pattern. The basic frequency adjustment method described.

（付記１３）
テキストに対応する隠れマルコフモデルの情報を用いて、前記テキストに対応する音声の基本周波数パターンを推定し、
推定された基本周波数パターン内の指定された部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更し、
前記隠れマルコフモデルの情報を用いて、前記テキストに対応し、かつ前記指定された部分の基本周波数の値が変更された値になった基本周波数パターンを再推定する、
ことを含む基本周波数調整処理をコンピュータに実行させるためのプログラム。 (Appendix 13)
Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed.
A program for causing a computer to execute basic frequency adjustment processing including the above.

（付記１４）
前記隠れマルコフモデルの情報は、前記隠れマルコフモデルの状態に対応する平均ベクトル及び共分散行列である、付記１３に記載のプログラム。 (Appendix 14)
The program according to appendix 13, wherein the information of the hidden Markov model is an average vector and a covariance matrix corresponding to the state of the hidden Markov model.

（付記１５）
前記隠れマルコフモデルの情報、前記推定された基本周波数パターンの動的特徴及び前記指定された部分の変更された基本周波数の値を用いて、基本周波数パターンを再推定する、
付記１３または１４に記載のプログラム。 (Appendix 15)
Re-estimate the fundamental frequency pattern using the information of the hidden Markov model, the dynamic characteristics of the estimated fundamental frequency pattern and the changed fundamental frequency value of the specified portion;
The program according to appendix 13 or 14.

（付記１６）
前記推定された基本周波数パターン内の指定された部分及び前記指定されたアクセント強度は、
ユーザによって指定される、及び、
前記テキストから取得される言語情報に基づいて推定される、
の少なくとも一方によって指定される、
付記１３〜１５のいずれかに記載のプログラム。 (Appendix 16)
The specified portion in the estimated fundamental frequency pattern and the specified accent intensity are:
Specified by the user and
Estimated based on linguistic information obtained from the text,
Specified by at least one of
The program according to any one of appendices 13 to 15.

（付記１７）
前記指定された部分は、前記推定された基本周波数パターン内の指定された代表部分に含まれるモーラ、音節、音素、もしくは母音の中央である、付記１３〜１６のいずれかに記載のプログラム。 (Appendix 17)
The program according to any one of appendices 13 to 16, wherein the designated portion is a center of a mora, a syllable, a phoneme, or a vowel included in a designated representative portion in the estimated fundamental frequency pattern.

（付記１８）
前記推定された基本周波数パターン内の指定された部分に含まれるアクセント句のアクセント型情報に基づいて、前記指定された部分の基本周波数の値を決定する、付記１３〜１７のいずれかに記載のプログラム。 (Appendix 18)
The supplementary frequency according to any one of appendices 13 to 17, wherein a value of a fundamental frequency of the designated portion is determined based on accent type information of an accent phrase included in the designated portion in the estimated fundamental frequency pattern. program.

（付記１９）
テキストに対応する隠れマルコフモデルの情報を用いて、前記テキストに対応する音声の基本周波数パターンを推定する基本周波数パターン推定部と、
推定された基本周波数パターン内の指定された部分の代表部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更する基本周波数変更部と、
前記隠れマルコフモデルの情報を用いて、前記テキストに対応し、かつ前記代表部分の基本周波数の値が変更された値になった基本周波数パターンを再推定する再推定部と、
統合された基本周波数パターン及び前記隠れマルコフモデルが有する情報を用いて推定されたメルケプストラムパターンに基づいて音声信号を合成する音声合成部と、
を含む音声合成装置。 (Appendix 19)
A fundamental frequency pattern estimator for estimating a fundamental frequency pattern of speech corresponding to the text using information of a hidden Markov model corresponding to the text;
A fundamental frequency changing unit that changes the value of the fundamental frequency of the representative portion of the designated portion in the estimated fundamental frequency pattern to a value according to the designated accent intensity;
Using the information of the hidden Markov model, a re-estimator that re-estimates a fundamental frequency pattern corresponding to the text and having a value of a fundamental frequency of the representative portion changed;
A speech synthesizer that synthesizes a speech signal based on an integrated fundamental frequency pattern and a mel cepstrum pattern estimated using information of the hidden Markov model;
A speech synthesizer.

（付記２０）
コンピュータが、
テキストに対応する隠れマルコフモデルの情報を用いて、前記テキストに対応する音声の基本周波数パターンを推定し、
推定された基本周波数パターン内の指定された部分の代表部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更し、
前記隠れマルコフモデルの情報を用いて、前記テキストに対応し、かつ前記代表部分の基本周波数の値が変更された値になった基本周波数パターンを再推定し、
統合された基本周波数パターン及び前記隠れマルコフモデルが有する情報を用いて推定されたメルケプストラムパターンに基づいて音声信号を合成する、
音声合成方法。 (Appendix 20)
Computer
Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the representative part of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate the fundamental frequency pattern corresponding to the text and the value of the fundamental frequency of the representative portion is changed,
Synthesizing a speech signal based on an integrated fundamental frequency pattern and a mel cepstrum pattern estimated using information of the hidden Markov model;
Speech synthesis method.

（請求項２１）
テキストに対応する隠れマルコフモデルの情報を用いて、前記テキストに対応する音声の基本周波数パターンを推定し、
推定された基本周波数パターン内の指定された部分の代表部分の基本周波数の値を、指定されたアクセント強度に応じた値に変更し、
前記隠れマルコフモデルの情報を用いて、前記テキストに対応し、かつ前記代表部分の基本周波数の値が変更された値になった基本周波数パターンを再推定し、
統合された基本周波数パターン及び前記隠れマルコフモデルが有する情報を用いて推定されたメルケプストラムパターンに基づいて音声信号を合成する、
ことを含む音声合成処理をコンピュータに実行させるためのプログラム。 (Claim 21)
Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the representative part of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate the fundamental frequency pattern corresponding to the text and the value of the fundamental frequency of the representative portion is changed,
Synthesizing a speech signal based on an integrated fundamental frequency pattern and a mel cepstrum pattern estimated using information of the hidden Markov model;
For causing a computer to execute speech synthesis processing including the above.

１０コンピュータ
１６パラメータ推定部
１８アクセント強度−Ｆ０変換部
２０Ｆ０指定部
２２Ｆ０再推定部
２４分析合成部
３０ＨＭＭＤＢ
６０ＣＰＵ
６２１次記憶部
６４２次記憶部
６８ＨＭＭＤＢ記憶領域 10 Computer 16 Parameter estimation unit 18 Accent strength-F0 conversion unit 20 F0 designation unit 22 F0 re-estimation unit 24 Analysis synthesis unit 30 HMM DB
60 CPU
62 Primary storage unit 64 Secondary storage unit 68 HMM DB storage area

Claims

A fundamental frequency pattern estimator for estimating a fundamental frequency pattern of speech corresponding to the text using information of a hidden Markov model corresponding to the text;
A fundamental frequency changing unit that changes the value of the fundamental frequency of the designated portion in the estimated fundamental frequency pattern to a value according to the designated accent intensity;
Using the information of the hidden Markov model, a re-estimator that re-estimates a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed;
Basic frequency adjustment device including

The fundamental frequency adjustment device according to claim 1, wherein the information of the hidden Markov model is an average vector and a covariance matrix corresponding to a state of the hidden Markov model.

The re-estimator re-estimates the fundamental frequency pattern using the information of the hidden Markov model, the dynamic characteristics of the estimated fundamental frequency pattern, and the changed fundamental frequency value of the designated portion.
The fundamental frequency adjusting device according to claim 1 or 2.

The specified portion in the estimated fundamental frequency pattern and the specified accent intensity are:
Specified by the user and
Estimated based on linguistic information obtained from the text,
Specified by at least one of
The fundamental frequency adjusting device according to any one of claims 1 to 3.

The specified part is the center of a mora, a syllable, a phoneme, or a vowel included in a specified representative part in the estimated fundamental frequency pattern, according to any one of claims 1 to 4. Basic frequency adjustment device.

The value of the fundamental frequency of the designated part is determined based on accent type information of an accent phrase included in the designated part in the estimated fundamental frequency pattern. The fundamental frequency adjusting device described in 1.

Computer
Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed.
Basic frequency adjustment method.

Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed.
A program for causing a computer to execute basic frequency adjustment processing including the above.

A fundamental frequency pattern estimator for estimating a fundamental frequency pattern of speech corresponding to the text using information of a hidden Markov model corresponding to the text;
A fundamental frequency changing unit that changes the value of the fundamental frequency of the designated portion in the estimated fundamental frequency pattern to a value according to the designated accent intensity;
Using the information of the hidden Markov model, a re-estimator that re-estimates a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed;
A speech synthesizer that synthesizes a speech signal based on an integrated fundamental frequency pattern and a mel cepstrum pattern estimated using information of the hidden Markov model;
A speech synthesizer.

Computer
Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed.
Synthesizing a speech signal based on an integrated fundamental frequency pattern and a mel cepstrum pattern estimated using information of the hidden Markov model;
Speech synthesis method.

Using information of the hidden Markov model corresponding to the text, estimating the fundamental frequency pattern of the speech corresponding to the text,
Change the value of the fundamental frequency of the specified part in the estimated fundamental frequency pattern to a value according to the specified accent strength,
Using the information of the hidden Markov model, re-estimate a fundamental frequency pattern corresponding to the text and having a value of the fundamental frequency of the designated portion changed.
Synthesizing a speech signal based on an integrated fundamental frequency pattern and a mel cepstrum pattern estimated using information of the hidden Markov model;
For causing a computer to execute speech synthesis processing including the above.