JP5328703B2

JP5328703B2 - Prosody pattern generator

Info

Publication number: JP5328703B2
Application number: JP2010066289A
Authority: JP
Inventors: 啓吾川島; 裕久田崎; 正山浦; 訓古田; 貴弘大塚
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-03-23
Filing date: 2010-03-23
Publication date: 2013-10-30
Anticipated expiration: 2030-03-23
Also published as: JP2011197542A

Description

本発明は、入力されるユーザの発声から韻律パターンを生成する韻律パターン生成装置に関するものである。 The present invention relates to a prosodic pattern generation device that generates a prosodic pattern from an input user's utterance.

従来より、ユーザの発声から抽出した韻律情報を用いることで、ユーザの意図する韻律パターンに修正するための技術が提案されている。ここで、韻律情報とは、発声から抽出された韻律（ピッチ、音韻継続長、パワーなど）そのものを示す。また、韻律パターンとは、合成音声で利用出来る品質・形式に整備された韻律を示す。
例えば、特許文献１にはマイクより入力した音声のピッチパターンを判断し、入力したテキストデータからピッチパターンの候補を生成し、入力された音声から判断されたピッチパターンとピッチパターンの候補とを照合して、合致する度合いの高い音声のピッチパターンを設定する音声出力装置が開示されている。また、特許文献２には、表記対応韻律辞書に未知語を登録する場合に、未知語の表記及び読みを取得すると共に、未知語に対応する音声情報を、音声入力を介して取得し、音声情報から抽出した韻律を表記対応韻律辞書に登録する音声合成装置が開示されている。 Conventionally, a technique for correcting to a prosodic pattern intended by the user by using prosodic information extracted from the user's utterance has been proposed. Here, the prosody information indicates the prosody (pitch, phoneme duration, power, etc.) itself extracted from the utterance. The prosodic pattern indicates a prosody arranged in quality and format that can be used in synthesized speech.
For example, in Patent Document 1, a pitch pattern of speech input from a microphone is determined, a pitch pattern candidate is generated from input text data, and the pitch pattern determined from the input speech is compared with the pitch pattern candidate. Thus, an audio output device that sets a pitch pattern of audio with a high degree of matching is disclosed. Further, in Patent Document 2, when an unknown word is registered in the notation-compatible prosodic dictionary, the notation and reading of the unknown word are acquired, and voice information corresponding to the unknown word is acquired via voice input, A speech synthesizer that registers prosody extracted from information in a notation-compatible prosody dictionary is disclosed.

特開２００９−５３５２２号公報JP 2009-53522 A 特開２００９−４８００８号公報JP 2009-48008 A

しかしながら、従来ではテキストデータから生成されたピッチパターンの候補から選択して修正後のピッチパターンとして用いており、テキストデータから生成されたピッチパターンは規則に基づいて生成された人工的な韻律パターンである規則韻律パターンであり、自然性に欠けるという課題があった。また、テキストデータを基にピッチパターン（韻律パターン）の候補を得るため、未知語や新語などにおいて、テキストデータに対応したピッチパターン（韻律パターン）の候補を生成するのが困難であるという課題もあった。 However, conventionally, a pitch pattern candidate generated from text data is selected and used as a corrected pitch pattern, and the pitch pattern generated from the text data is an artificial prosodic pattern generated based on a rule. There was a problem that it was a regular prosodic pattern and lacked naturalness. In addition, in order to obtain pitch pattern (prosodic pattern) candidates based on text data, it is difficult to generate pitch pattern (prosodic pattern) candidates corresponding to text data in unknown words, new words, and the like. there were.

さらに、未知語に対する発声から韻律を抽出し、未知語の韻律として用いるため、ユーザの声質や収録環境などの影響で、発声された音声情報から韻律（韻律情報）がうまく抽出できるとは限らない。また、発声はプロのナレータに限られないため、安定したユーザの発声が得られず、抽出された韻律（韻律情報）が局所的な劣化を伴うなど、ユーザが何度も発声を繰り返す必要が生じるという課題もあった。 Furthermore, because prosody is extracted from utterances of unknown words and used as prosody of unknown words, prosody (prosodic information) cannot always be successfully extracted from uttered speech information due to the influence of the user's voice quality and recording environment. . In addition, since the utterance is not limited to a professional narrator, the user's utterance cannot be obtained, and the extracted prosody (prosodic information) is accompanied by local deterioration. There was also a problem that occurred.

この発明は上記のような課題を解決するためになされたもので、音声データから抽出した韻律情報である肉声韻律情報に対し、抽出誤りの補正などの整備を行った韻律パターンである肉声韻律パターンを複数格納した韻律パターン辞書を予め用意し、この韻律パターン辞書から、入力された音声データの肉声韻律情報に近い肉声韻律パターンを選択することで、未知語や新語など類似したテキストの肉声韻律パターンが無い場合や、安定した発声や肉声韻律情報が得られないユーザの声質や収録環境においても、自然性の高い肉声韻律パターンを生成して出力する韻律パターン生成装置を得ることを目的とする。
なお、以下では特に記載がない場合には肉声韻律パターンおよび肉声ピッチパターンを韻律パターンおよびピッチパターンとして記載する。 The present invention has been made to solve the above-described problems, and a real voice prosody pattern, which is a prosodic pattern in which correction of extraction errors and the like is performed on real voice prosody information that is prosodic information extracted from speech data. A prosody pattern dictionary that stores multiple words is prepared in advance, and from this prosody pattern dictionary, a real voice prosody pattern of similar text such as unknown words or new words is selected by selecting a real voice prosody pattern close to the real voice prosody information of the input voice data It is an object of the present invention to provide a prosody pattern generation apparatus that generates and outputs a natural voice real prosody pattern even when there is no voice, or even in a voice quality or recording environment of a user who cannot obtain stable utterance or real voice prosody information.
In the following description, the real voice prosody pattern and the real voice pitch pattern are described as the prosodic pattern and the pitch pattern unless otherwise specified.

この発明に係る韻律パターン生成装置は、音声データおよびテキストデータの入力を受
け付け、音声データから韻律情報を抽出し、当該韻律情報をテキストデータに対応付けた
肉声韻律情報を生成する肉声韻律情報抽出部と、複数の韻律パターンを格納する韻律パタ
ーン辞書と、当該韻律パターン辞書から肉声韻律情報の部分あるいは全体に類似した韻律
パターンを１つ以上検索して類似韻律パターンとして出力する類似韻律パターン検索部と、肉声韻律情報抽出部から入力される肉声韻律情報に対して、韻律パターン辞書に格納された韻律パターンに近似させる加工を行い、類似韻律パターン検索部に出力する肉声韻律情報加工部と、類似韻律パターンをユーザが認識可能な形式に変換して提示し、ユーザに類似韻律パターンの選択を要求する類似韻律パターン提示部と、類似韻律パターン提示部が提示した類似韻律パターンのうち、ユーザが選択した類似韻律パターンを出力する韻律パターン出力部とを備えるように構成したものである。 The prosody pattern generation device according to the present invention receives an input of speech data and text data, extracts prosodic information from the speech data, and generates a real voice prosody information extracting unit that associates the prosodic information with the text data. A prosodic pattern dictionary that stores a plurality of prosodic patterns, and a similar prosodic pattern search unit that searches the prosodic pattern dictionary for one or more prosodic patterns similar to the real voice prosodic information part or the whole and outputs them as similar prosodic patterns The real voice prosody information input from the real voice prosody information extraction unit is processed to approximate the prosody pattern stored in the prosodic pattern dictionary and output to the similar prosody pattern search unit, and the similar prosody Converts the pattern into a user-recognizable format and presents it, requiring the user to select a similar prosodic pattern And similar prosodic pattern presenting unit that, among the similar prosodic pattern similar prosody pattern presentation unit is presented, which is constituted to include the prosody pattern output unit for outputting the similarity prosody pattern selected by the user.

音声データおよびテキストデータに基づき生成された肉声韻律情報の部分あるいは全体に類似した韻律パターンを韻律パターン辞書から検索し、類似韻律パターンとしてユーザに提示し、ユーザが選択した類似韻律パターンを出力するように構成したので、補正整備などが行われた韻律パターンを格納した韻律パターン辞書から肉声韻律情報に類似した韻律パターンを選択することができ、自然性の高い韻律パターンを生成することができる。また、未知語や新語など類似したテキストデータが存在しない場合にも、自然性の高い韻律パターンを生成することができる。さらに、集音状況に影響されずに安定した韻律パターンを生成することができる。 Search prosody patterns similar to part or whole of real voice prosody information generated based on speech data and text data from the prosodic pattern dictionary, present them to the user as similar prosodic patterns, and output the similar prosodic pattern selected by the user Thus, a prosodic pattern similar to the real voice prosodic information can be selected from the prosodic pattern dictionary storing the prosody pattern subjected to correction and maintenance, and a highly natural prosodic pattern can be generated. Also, prosody patterns with high naturalness can be generated even when similar text data such as unknown words or new words does not exist. Furthermore, a stable prosodic pattern can be generated without being affected by the sound collection situation.

この発明の実施の形態１による韻律パターン生成装置を有する音声合成システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis system which has a prosodic pattern generation apparatus by Embodiment 1 of this invention. この発明の実施の形態１による韻律パターン生成装置の類似度算出方法および類似ピッチパターン検索方法を示す説明図である。It is explanatory drawing which shows the similarity calculation method and similar pitch pattern search method of the prosodic pattern generation apparatus by Embodiment 1 of this invention. この発明の実施の形態１による韻律パターン生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the prosodic pattern generation apparatus by Embodiment 1 of this invention. この発明の実施の形態１による肉声ピッチ情報抽出部の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the real voice pitch information extraction part by Embodiment 1 of this invention. この発明の実施の形態１による類似ピッチパターン提示部の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the similar pitch pattern presentation part by Embodiment 1 of this invention. この発明の実施の形態１による韻律パターン生成装置の劣化を含む発声からのピッチパターンの生成を示す説明図である。It is explanatory drawing which shows the production | generation of the pitch pattern from the utterance containing degradation of the prosody pattern production | generation apparatus by Embodiment 1 of this invention. この発明の実施の形態２による韻律パターン生成装置を有する音声合成システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis system which has a prosodic pattern generation apparatus by Embodiment 2 of this invention. この発明の実施の形態２による韻律パターン生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the prosodic pattern generation apparatus by Embodiment 2 of this invention. この発明の実施の形態３による韻律パターン生成装置を有する音声合成システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesis system which has a prosodic pattern generation apparatus by Embodiment 3 of this invention. この発明の実施の形態３による韻律パターン生成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the prosodic pattern generation apparatus by Embodiment 3 of this invention. この発明の実施の形態３による韻律パターン生成装置の類似ピッチパターンの加工例を示す説明図である。It is explanatory drawing which shows the example of a process of the similar pitch pattern of the prosodic pattern generation apparatus by Embodiment 3 of this invention.

実施の形態１．
図１は、この発明の実施の形態１による韻律パターン生成装置を有する音声合成システムの構成を示すブロック図である。なお、この実施の形態１では、韻律情報および韻律パターンとしてピッチのみを扱うピッチパターン生成装置を用いて説明する。そこで、以下では韻律パターン生成装置をピッチパターン生成装置として記載する。実施の形態２および実施の形態３においても同様である。
音声合成システム１００は、入力装置１、ピッチパターン生成装置（韻律パターン生成装置）２および出力装置３で構成されている。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a speech synthesis system having a prosodic pattern generation device according to Embodiment 1 of the present invention. In the first embodiment, description will be made using a pitch pattern generation apparatus that handles only pitches as prosodic information and prosodic patterns. Therefore, hereinafter, the prosodic pattern generation device is described as a pitch pattern generation device. The same applies to the second embodiment and the third embodiment.
The speech synthesis system 100 includes an input device 1, a pitch pattern generation device (prosodic pattern generation device) 2, and an output device 3.

次に各構成の詳細について説明する。入力装置１は、音声入力部１１、テキスト入力部１２、ピッチパターン選択部１３で構成されている。
音声入力部１１は、例えばマイクなどで構成され、発声を収集して音声データを生成してピッチパターン生成装置２に出力する。また、マイクから発声を直接収録する他、予め収録した発声データを蓄えた記録装置（図示せず）から音声データを読み込む、またはストリームデータから音声データを収集するように構成してもよく、発声の音声データが得られる構成であれば適宜変更可能である。また、音声データはユーザ自身の声でもよく、別のユーザの声であってもよい。テキスト入力部１２は、発声に対応するテキストデータ（例えば読みや表記、ポーズ情報など）を収集してテキストデータを作成し、ピッチパターン生成装置２に出力する。ピッチパターン選択部１３は、ユーザが入力するピッチパターン選択結果をピッチパターン選択情報としてピッチパターン生成装置２に出力する。 Next, details of each component will be described. The input device 1 includes a voice input unit 11, a text input unit 12, and a pitch pattern selection unit 13.
The voice input unit 11 includes, for example, a microphone, collects utterances, generates voice data, and outputs the voice data to the pitch pattern generation device 2. In addition to directly recording the utterance from the microphone, the voice data may be read from a recording device (not shown) that stores prerecorded utterance data, or the voice data may be collected from the stream data. Any other configuration can be used as long as the audio data can be obtained. Further, the voice data may be a user's own voice or another user's voice. The text input unit 12 collects text data (for example, reading, notation, pause information) corresponding to the utterance, creates text data, and outputs the text data to the pitch pattern generation device 2. The pitch pattern selection unit 13 outputs the pitch pattern selection result input by the user to the pitch pattern generation device 2 as pitch pattern selection information.

ピッチパターン生成装置（韻律パターン生成装置）２は、肉声ピッチ情報抽出部（肉声韻律情報抽出部）２１、類似ピッチパターン検索部（類似韻律パターン検索部）２２、ピッチパターン辞書（韻律パターン辞書）２３、類似ピッチパターン提示部（類似韻律パターン提示部）２４、ピッチパターン出力部（韻律パターン出力部）２５で構成され、入力装置１から入力される音声データおよびテキストデータに基づき類似ピッチパターンを検索し、検索された類似ピッチパターンを出力装置３に出力してユーザに提示すると共に、入力装置１から入力されたピッチパターン選択情報に対応する類似ピッチパターンをピッチパターンとして出力する。 A pitch pattern generation device (prosodic pattern generation device) 2 includes a real voice pitch information extraction unit (real voice prosody information extraction unit) 21, a similar pitch pattern search unit (similar prosody pattern search unit) 22, and a pitch pattern dictionary (prosody pattern dictionary) 23. , A similar pitch pattern presentation unit (similar prosody pattern presentation unit) 24, and a pitch pattern output unit (prosody pattern output unit) 25, and searches for similar pitch patterns based on speech data and text data input from the input device 1. The retrieved similar pitch pattern is output to the output device 3 and presented to the user, and the similar pitch pattern corresponding to the pitch pattern selection information input from the input device 1 is output as a pitch pattern.

肉声ピッチ情報抽出部２１は、音声入力部１１から入力される音声データを解析してピッチ情報を抽出し、当該ピッチ情報にテキスト入力部１２から入力されるテキストデータを対応付けした肉声ピッチ情報を生成して類似ピッチパターン検索部２２に出力する。なお、ピッチ情報の抽出には、ケプストラム法や自己相関関数法などの公知の処理方法を適用することができる。なお当該処理方法は公知のため詳細な説明を省略する。 The real voice pitch information extraction unit 21 analyzes the voice data input from the voice input unit 11 to extract pitch information, and obtains real voice pitch information in which the text data input from the text input unit 12 is associated with the pitch information. Generated and output to the similar pitch pattern search unit 22. Note that a known processing method such as a cepstrum method or an autocorrelation function method can be applied to the extraction of pitch information. Since the processing method is well known, detailed description thereof is omitted.

類似ピッチパターン検索部２２は、類似度算出部２２ａを有している。類似度算出部２２ａは、肉声ピッチ情報抽出部２１から入力される肉声ピッチ情報とピッチパターン辞書２３に記憶されたピッチパターン（以下、蓄積ピッチパターンと称する）との類似度を算出し、類似ピッチパターン検索部２２は当該類似度に基づき肉声ピッチ情報の部分（例えば、連続する有声音区間の単位）あるいは全体に類似した蓄積ピッチパターンを１つ以上検索し、検索した蓄積ピッチパターンを類似ピッチパターンとして類似ピッチパターン提示部２４に出力する。 The similar pitch pattern search unit 22 has a similarity calculation unit 22a. The similarity calculation unit 22a calculates the similarity between the real voice pitch information input from the real voice pitch information extraction unit 21 and the pitch pattern stored in the pitch pattern dictionary 23 (hereinafter referred to as an accumulated pitch pattern). Based on the similarity, the pattern search unit 22 searches for one or more stored pitch patterns that are similar to the portion of the real voice pitch information (for example, a unit of a continuous voiced sound section) or the whole, and uses the searched stored pitch patterns as similar pitch patterns. Is output to the similar pitch pattern presentation unit 24.

ここで、類似度の算出および類似ピッチパターンの検索例について図２を参照しながら具体的に説明を行う。図２は、実施の形態１によるピッチ（韻律）パターン生成装置の類似度の算出方法および類似ピッチパターンの検索方法を示す説明図である。
図２に示すように「インターチェンジ」という肉声ピッチ情報が入力されると、類似度算出部２２ａはピッチパターン辞書２３に記憶された複数の蓄積ピッチパターン（図３では、ピッチパターン１、ピッチパターン２、ピッチパターン３・・・）を読み出す。 Here, a calculation example of similarity and a search example of a similar pitch pattern will be specifically described with reference to FIG. FIG. 2 is an explanatory diagram showing a similarity calculation method and a similar pitch pattern search method of the pitch (prosodic) pattern generation device according to the first embodiment.
As shown in FIG. 2, when the real voice pitch information “interchange” is input, the similarity calculation unit 22 a stores a plurality of accumulated pitch patterns stored in the pitch pattern dictionary 23 (pitch pattern 1 and pitch pattern 2 in FIG. 3). , Pitch pattern 3...

次に、入力された肉声ピッチ情報と読み出した蓄積ピッチパターン１を有声音区間で分割し、分割した有声音区間毎に距離および尤度（ｄ_１１、ｄ_１２、ｄ_１３、ｄ_１４）を算出する。さらに、各有声音区間の距離および尤度の和（ｄ_１１＋ｄ_１２＋ｄ_１３＋ｄ_１４）を算出する。当該算出された距離および尤度の和を肉声ピッチ情報と蓄積ピッチパターンとの類似度とする。 Next, the input real voice pitch information and the read accumulated pitch pattern 1 are divided into voiced sound sections, and the distance and likelihood (d ₁₁ , d ₁₂ , d ₁₃ , d ₁₄ ) are calculated for each divided voiced sound section. To do. Furthermore, the sum of distances and likelihoods (d ₁₁ + d ₁₂ + d ₁₃ + d ₁₄ ) of each voiced sound section is calculated. The sum of the calculated distance and likelihood is used as the similarity between the real voice pitch information and the accumulated pitch pattern.

この処理を全ての蓄積ピッチパターン２，３・・・に対して実行し、類似度（（ｄ_２１＋ｄ_２２＋ｄ_２３＋ｄ_２４），（ｄ_３１＋ｄ_３２＋ｄ_３３＋ｄ_３４）・・・）を算出する。類似ピッチパターン検索部２２は、類似度算出部２２ａが算出した類似度を比較し、類似度の上位あるいは類似度が閾値以上の蓄積ピッチパターンを類似ピッチパターンとする。 This process is executed for all the accumulated pitch patterns 2, 3..., And the similarity ((d ₂₁ + d ₂₂ + d ₂₃ + d ₂₄ ), (d ₃₁ + d ₃₂ + d ₃₃ + d ₃₄ )...) Is calculated. To do. The similar pitch pattern search unit 22 compares the similarities calculated by the similarity calculating unit 22a, and sets the upper rank of the similarities or the accumulated pitch pattern whose similarity is equal to or greater than the threshold as the similar pitch pattern.

ピッチパターン辞書２３は、音声データから抽出した肉声ピッチ情報に対し、抽出誤りやピッチの揺らぎなどによる劣化箇所の補正などの整備を行ったピッチパターンおよび当該整備を行ったピッチパターンに対応したテキストデータ（表記や読みなど）が複数登録されているメモリである。なお、当該ピッチパターン辞書２３に記憶されているピッチパターンを蓄積ピッチパターンと称する。 The pitch pattern dictionary 23 is a pitch pattern in which corrections such as extraction errors and pitch fluctuations are corrected for the real voice pitch information extracted from the voice data, and text data corresponding to the pitch pattern in which the maintenance is performed. This is a memory in which a plurality of (notation, reading, etc.) are registered. The pitch pattern stored in the pitch pattern dictionary 23 is referred to as an accumulated pitch pattern.

類似ピッチパターン提示部２４は、類似ピッチパターン検索部２２から入力される類似ピッチパターンをピッチパターン出力部２５に出力する。さらに、類似ピッチパターン提示部２４は出力変換部２４ａを有し、類似ピッチパターンをユーザが確認できる形式に変換し、出力用類似ピッチパターンとして出力装置３に出力する。
ユーザが確認できる形式としては、ピッチパターンの時間変化を数値化した表あるいは図示したグラフの表示が挙げられる。また、公知の音声合成装置を用いてピッチパターンを利用した合成音声を生成してスピーカなどの合成音声を視聴できる出力装置３へ出力する、または音声合成装置用のピッチパターンを音声合成機能および音声出力機能を有する出力装置３へ出力するなど適宜構成可能である。
なお、出力装置３に出力された出力用類似ピッチパターンは、類似ピッチパターン出力部３１を介してユーザに提示される。ユーザは、提示された出力用類似ピッチパターンから最適なピッチパターンを選択し、ピッチパターン選択部１３を介してピッチパターン選択結果を入力する。 The similar pitch pattern presentation unit 24 outputs the similar pitch pattern input from the similar pitch pattern search unit 22 to the pitch pattern output unit 25. Furthermore, the similar pitch pattern presentation unit 24 includes an output conversion unit 24a, converts the similar pitch pattern into a format that can be confirmed by the user, and outputs the similar pitch pattern to the output device 3 as an output similar pitch pattern.
As a format that can be confirmed by the user, there is a display of a table in which the time change of the pitch pattern is quantified or an illustrated graph. Also, a synthesized speech using a pitch pattern is generated using a known speech synthesizer and is output to the output device 3 where a synthesized speech such as a speaker can be viewed, or the pitch pattern for the speech synthesizer is synthesized with a speech synthesis function and speech. For example, it can be appropriately configured to output to the output device 3 having an output function.
The output similar pitch pattern output to the output device 3 is presented to the user via the similar pitch pattern output unit 31. The user selects an optimum pitch pattern from the presented output similar pitch patterns, and inputs a pitch pattern selection result via the pitch pattern selection unit 13.

ピッチパターン出力部２５は、ピッチパターン選択部１３から入力されるピッチパターン選択情報に基づき、類似ピッチパターン提示部２４から入力された類似ピッチパターンの中から出力すべきピッチパターンを選択して出力する。 The pitch pattern output unit 25 selects and outputs a pitch pattern to be output from the similar pitch patterns input from the similar pitch pattern presentation unit 24 based on the pitch pattern selection information input from the pitch pattern selection unit 13. .

出力装置３は、類似ピッチパターン出力部３１を有し、例えば、図示されたグラフを表示する機能を有する装置などで構成される。類似ピッチパターン提示部２４から入力される出力用類似ピッチパターンをユーザに提示する。 The output device 3 includes a similar pitch pattern output unit 31 and is configured by, for example, a device having a function of displaying the illustrated graph. The similar pitch pattern for output input from the similar pitch pattern presentation unit 24 is presented to the user.

次に、音声合成システム１００の動作について説明を行う。図３は、この発明の実施の形態１によるピッチ（韻律）パターン生成装置の動作を示すフローチャートである。以下、このフローチャートに従って説明を行う。
音声入力部１１およびテキスト入力部１２から肉声ピッチ情報抽出部２１に音声データおよびテキストデータが入力される（ステップＳＴ１）。肉声ピッチ情報抽出部２１は、入力された音声データを解析してピッチ情報を抽出する（ステップＳＴ２）。ピッチ情報としては、例えばピッチを一定時間間隔で抽出したデータである。また抽出の間隔は、一定時間間隔のみに限られず、有声音区間を等間隔に分割した数点の代表点、ピッチの変化の激しい点など、用途やデータ合せて適宜変更可能である。 Next, the operation of the speech synthesis system 100 will be described. FIG. 3 is a flowchart showing the operation of the pitch (prosodic) pattern generation device according to Embodiment 1 of the present invention. Hereinafter, description will be given according to this flowchart.
Voice data and text data are input from the voice input unit 11 and the text input unit 12 to the real voice pitch information extraction unit 21 (step ST1). The real voice pitch information extraction unit 21 analyzes the input voice data and extracts pitch information (step ST2). The pitch information is, for example, data obtained by extracting the pitch at regular time intervals. Also, the extraction interval is not limited to a certain time interval, and can be changed as appropriate according to the application and data, such as several representative points obtained by dividing the voiced sound interval into equal intervals, points where the pitch changes rapidly, and the like.

さらに肉声ピッチ情報抽出部２１は、抽出したピッチ情報をテキストデータに対応付け、肉声ピッチ情報として類似ピッチパターン検索部２２に出力する（ステップＳＴ３）。類似ピッチパターン検索部２２の類似度算出部２２ａは、ステップＳＴ３において肉声ピッチ情報が入力されると、ピッチパターン辞書２３から蓄積ピッチパターンを読み出し、読み出した各蓄積ピッチパターンと入力された肉声ピッチ情報の類似度を算出する（ステップＳＴ４）。さらに類似ピッチパターン検索部２２は、ステップＳＴ４で算出した類似度に基づき、肉声ピッチ情報の部分（例えば、連続する有声音区間の単位）あるいは全体に類似した蓄積ピッチパターンを１つ以上検索し、類似ピッチパターンとして類似ピッチパターン提示部２４に出力する（ステップＳＴ５）。 Further, the real voice pitch information extraction unit 21 associates the extracted pitch information with the text data, and outputs it to the similar pitch pattern search unit 22 as real voice pitch information (step ST3). When the real voice pitch information is input in step ST3, the similarity calculation unit 22a of the similar pitch pattern search unit 22 reads the stored pitch pattern from the pitch pattern dictionary 23, and the read stored pitch pattern and the input real voice pitch information. Is calculated (step ST4). Further, the similar pitch pattern search unit 22 searches for one or more accumulated pitch patterns similar to a portion of the real voice pitch information (for example, a unit of continuous voiced sound segments) or the whole based on the similarity calculated in step ST4, It outputs to the similar pitch pattern presentation part 24 as a similar pitch pattern (step ST5).

類似ピッチパターン提示部２４は、ステップＳＴ５で入力された類似ピッチパターンをピッチパターン出力部２５に出力すると共に、出力変換部２４ａが類似ピッチパターンをユーザが確認できる形式に変換し、出力用類似ピッチパターンとして出力装置３の類似ピッチパターン出力部３１に出力し、検索された類似ピッチパターンをユーザに提示する（ステップＳＴ６）。 The similar pitch pattern presentation unit 24 outputs the similar pitch pattern input in step ST5 to the pitch pattern output unit 25, and the output conversion unit 24a converts the similar pitch pattern into a format that can be confirmed by the user. It outputs to the similar pitch pattern output part 31 of the output device 3 as a pattern, and the searched similar pitch pattern is shown to a user (step ST6).

その後、ユーザは類似ピッチパターン出力部３１により提示された出力用類似ピッチパターンから所望のピッチパターンを選択し、選択結果であるピッチパターン選択情報がピッチパターン選択部１３を介してピッチパターン出力装置３に入力される（ステップＳＴ７）。ピッチパターン出力部２５は、ステップＳＴ７で入力されたピッチパターン選択情報に基づき、ステップＳＴ５で入力された類似ピッチパターンの中から出力すべきピッチパターンを選択して出力し（ステップＳＴ８）、処理を終了する。 Thereafter, the user selects a desired pitch pattern from the output similar pitch patterns presented by the similar pitch pattern output unit 31, and the pitch pattern selection information as a selection result is sent via the pitch pattern selection unit 13 to the pitch pattern output device 3. (Step ST7). The pitch pattern output unit 25 selects and outputs a pitch pattern to be output from the similar pitch patterns input in step ST5 based on the pitch pattern selection information input in step ST7 (step ST8), and performs processing. finish.

次に、肉声ピッチ情報抽出部２１の他の構成例を示す。図４は、肉声ピッチ情報抽出部の他の構成例を示すブロック図である。
肉声ピッチ情報抽出部２１に、テキストデータを言語解析して付加情報を取得して肉声ピッチ情報に与える付加情報取得部２１ａを追加して設けても良い。具体的には、表記のみのテキストデータが入力された場合、当該テキストデータを言語解析して読みや品詞情報などの付加情報を取得して肉声ピッチ情報に与える。 Next, another configuration example of the real voice pitch information extraction unit 21 is shown. FIG. 4 is a block diagram illustrating another configuration example of the real voice pitch information extraction unit.
The real voice pitch information extraction unit 21 may be additionally provided with an additional information acquisition unit 21a that performs linguistic analysis of text data to acquire additional information and gives it to the real voice pitch information. Specifically, when text data with only notation is input, the text data is subjected to language analysis, and additional information such as reading and part-of-speech information is acquired and given to the real voice pitch information.

さらに、その他の構成例として、肉声ピッチ情報抽出部２１に、入力されたテキストデータから公知の音声認識技術を用いて、例えば音韻情報として音韻毎に発声音声データのセグメンテーションを行うセグメンテーション部２１ｂを追加して設けても良い。肉声ピッチ情報抽出部２１は、セグメンテーション部２１ｂにおいて得られた音韻情報に対応付けられた肉声ピッチ情報を抽出する。 Furthermore, as another configuration example, a segmentation unit 21b is added to the real voice pitch information extraction unit 21, for example, segmenting utterance speech data for each phoneme as phoneme information using a known speech recognition technique from input text data. May be provided. The real voice pitch information extraction unit 21 extracts real voice pitch information associated with the phoneme information obtained in the segmentation unit 21b.

さらに、その他の構成例として、肉声ピッチ情報抽出部２１に、音声データからピッチパターンを生成したい箇所を指定するピッチパターン生成指定部２１ｃを追加して設けても良い。具体的には、例えば、「このインターチェンジです。」という合成音声の生成において「インターチェンジ」部分のピッチパターンを生成する場合、ユーザの「このインターチェンジです。」という発声データに対して、ピッチパターン指定部が「インターチェンジ」部分のみを肉声ピッチ情報の抽出対象とするように指定する。
なお、図４の例では、付加情報取得部２１ａ、セグメンテーション部２１ｂおよびピッチパターン生成指定部２１ｃを全て同時に設ける構成を示しているが、全て同時に設ける必要はなく、各構成を適宜選択して構成してよい。 Furthermore, as another configuration example, a pitch pattern generation specifying unit 21c that specifies a location where a pitch pattern is to be generated from audio data may be additionally provided in the real voice pitch information extracting unit 21. More specifically, for example, when generating the pitch pattern of the “interchange” part in the generation of the synthesized speech “this interchange”, the pitch pattern designation unit for the utterance data of the user “this interchange”. Specifies that only the “interchange” part is to be extracted from the real voice pitch information.
In the example of FIG. 4, a configuration in which the additional information acquisition unit 21a, the segmentation unit 21b, and the pitch pattern generation designation unit 21c are all provided at the same time is shown, but it is not necessary to provide all at the same time. You can do it.

次に、類似ピッチパターン検索部２２の類似度算出部２２ａの詳細および算出方法の例を示す。
類似度算出部２２ａは、肉声ピッチ情報と蓄積ピッチパターンのデータ形式が異なる場合（例えば、肉声ピッチ情報は５ｍｓ毎にピッチを算出、これに対して蓄積ピッチパターンでは５０ｍｓ毎にピッチを算出している場合など）、肉声ピッチ情報のデータ形式を蓄積ピッチパターンのデータ形式に変換してから類似度を算出する。 Next, details of the similarity calculation unit 22a of the similar pitch pattern search unit 22 and an example of a calculation method will be described.
When the data format of the real voice pitch information and the accumulated pitch pattern is different (for example, the real voice pitch information calculates the pitch every 5 ms, while the accumulated pitch pattern calculates the pitch every 50 ms. In other words, the similarity is calculated after converting the data format of the real voice pitch information into the data format of the accumulated pitch pattern.

類似度算出部２２ａの距離や尤度の計算には、例えば、ピッチパターンを所定のｎ次のベクトルに正規化した上でのピッチパターン間の二乗誤差や内積などの他、ＤＰ（動的計画法）マッチング手法やＨＭＭ（隠れマルコフモデル）を用いた統計的手法などの処理を適用することができる。ＤＰマッチング手法やＨＭＭを用いた統計的手法は、公知であるため詳細な説明を省略する。例えば、ＨＭＭを利用する場合には、音声データの有声音区間（あるいは音素や音節、単語、文章などの単位）毎に１つ以上の状態数・ガウス分布を持った肉声ピッチ情報に関するＨＭＭを作成し、尤度の算出に用いる。 For calculating the distance and likelihood of the similarity calculation unit 22a, for example, a square error or inner product between pitch patterns after normalizing the pitch pattern to a predetermined n-order vector, DP (dynamic planning), etc. Method) Processing such as a matching method and a statistical method using an HMM (Hidden Markov Model) can be applied. Since the DP matching method and the statistical method using the HMM are publicly known, detailed description thereof is omitted. For example, when using HMM, create HMM for real voice pitch information with one or more number of states and Gaussian distribution for each voiced sound section (or unit of phoneme, syllable, word, sentence, etc.) And used to calculate likelihood.

次に類似度算出方法を例示する。
＜算出方法例＞
１．肉声ピッチ情報と有声音区間の数が一致する蓄積ピッチパターンに絞って類似度を算出する。
２．肉声ピッチ情報内の複数の有声音区間を連接して１つの有声音区間として類似度を算出する。
３．蓄積ピッチパターンの複数の有声音区間を連接して１つの有声音区間として類似度を算出する。
上記算出方法例２．および３．は部分的な有声／無声の発声誤りやピッチの抽出誤りによって、本来一つの有声音区間であるものが分割されてしまう、あるいは二つの有声音区間が一つになってしまう場合があることを考慮したものである。
４．部分的な有声／無声の発声誤りやピッチの抽出誤りがあると推測される有声音区間を除いた有声音区間の距離、尤度から類似度を算出する（例えば、距離が短いまたは尤度の高い上位の有声音区間のみを類似度の算出に用いる）。
５．距離、尤度の計算単位は、有声音区間のみではなく、一定時間単位とする。音素や音節、単語、文章単位で区切られた肉声ピッチ情報および蓄積ピッチパターンが得られればそれらの単位で距離、尤度を用いて算出する。 Next, a similarity calculation method will be exemplified.
<Example of calculation method>
1. The similarity is calculated by focusing on the accumulated pitch pattern in which the number of real voice pitch information and the number of voiced sound sections are the same.
2. A plurality of voiced sound sections in the real voice pitch information are concatenated to calculate the similarity as one voiced sound section.
3. A plurality of voiced sound sections of the accumulated pitch pattern are concatenated to calculate the similarity as one voiced sound section.
Calculation method example 2 above. And 3. Is that a voiced / unvoiced utterance error or pitch extraction error may result in the division of one voiced sound segment, or two voiced sound segments. It is taken into consideration.
4). The similarity is calculated from the distance and likelihood of the voiced sound section excluding the voiced sound section estimated to have partial voiced / unvoiced utterance errors and pitch extraction errors (for example, the distance is short or the likelihood is low) Only high-order voiced segments are used to calculate similarity).
5. The unit for calculating the distance and likelihood is not only a voiced sound section but a fixed time unit. If real voice pitch information and accumulated pitch patterns separated by phonemes, syllables, words, and sentences are obtained, calculation is made using the distance and likelihood in those units.

次に、類似ピッチパターン検索部２２の検索動作の他の構成例を示す。
類似ピッチパターン検出部は、肉声ピッチ情報を用いて類似ピッチパターンを検索する以外に、発声された音声のテキストデータを入力し、有声音／無声音情報や、音素情報、音節情報、品詞情報、言語の出現位置などを組み合わせた検索を行い、肉声ピッチ情報と蓄積ピッチパターンが一致しやすいと想定される、言語的に近い蓄積ピッチパターンから検索を行うことも可能である。例えば、子音の種類（破裂性子音や摩擦性子音など）が同じテキストの蓄積ピッチパターンから検索することにより、子音母音間のピッチパターン遷移が似る、文末の発声であれば同じ文末のテキストデータのピッチパターンから収束感のあるピッチパターンが選ばれる。 Next, another configuration example of the search operation of the similar pitch pattern search unit 22 is shown.
In addition to searching for similar pitch patterns using real voice pitch information, the similar pitch pattern detection unit inputs text data of uttered speech, voiced / unvoiced sound information, phoneme information, syllable information, part of speech information, language It is also possible to perform a search combining the appearance positions of the voices, and to perform a search from a linguistic storage pitch pattern that is assumed that the real voice pitch information and the storage pitch pattern are likely to match. For example, by searching from the stored pitch pattern of text with the same consonant type (burst consonant, frictional consonant, etc.), the pitch pattern transition between consonant vowels is similar. A pitch pattern with a sense of convergence is selected from the pitch patterns.

さらに類似ピッチパターン検索部２２は、蓄積ピッチパターンを調整した後に類似ピッチパターン検索に用いるように構成することができる。調整例を以下に示す。
１．入力された音声の時間長に合せて蓄積ピッチパターンを時間軸上で伸縮変形させた後に類似ピッチパターンの検索を行う。これにより、話速の異なるピッチパターンからも類似するピッチパターンを検索することができる。
２．肉声ピッチ情報の平均ピッチに合せて蓄積ピッチパターンの平均ピッチを調整した後に類似ピッチパターンの検索を行う。これにより声の高さの異なるピッチパターンからも類似するピッチパターンが検索できる。
３．肉声ピッチ情報のピッチの変化幅に合せて蓄積ピッチパターンのピッチの変化幅を調整した後に類似ピッチパターンの検索を行う。これにより、抑揚の調子が異なるピッチパターンからも類似するピッチパターンが検索できる。 Furthermore, the similar pitch pattern search unit 22 can be configured to be used for the similar pitch pattern search after adjusting the accumulated pitch pattern. An example of adjustment is shown below.
1. A similar pitch pattern is searched after the accumulated pitch pattern is stretched and deformed on the time axis in accordance with the time length of the input voice. Thereby, a similar pitch pattern can be searched from pitch patterns having different speaking speeds.
2. After adjusting the average pitch of the accumulated pitch patterns in accordance with the average pitch of the real voice pitch information, the similar pitch pattern is searched. As a result, a similar pitch pattern can be retrieved from pitch patterns with different voice pitches.
3. A similar pitch pattern is searched after adjusting the pitch change width of the stored pitch pattern in accordance with the pitch change width of the real voice pitch information. As a result, a similar pitch pattern can be retrieved from pitch patterns having different tones.

次に、類似ピッチパターン提示部２４の他の構成例を示す。図５は、類似ピッチパターン提示部２４の他の構成例を示すブロック図である。
類似ピッチパターン提示部２４の出力変換部２４ａは、出力用類似ピッチパターンの出力において、生成するピッチパターンのテキストデータに加えてその前後のテキストデータを与え、それらを合せて出力用類似ピッチパターンとして構成するテキストデータ合成部２４ｂを追加して設けても良い。例えば、生成するピッチパターンのテキストデータが「インターチェンジ」であり、「次のインターチェンジです。」という合成音声での利用を想定したテキストデータを入力することにより、生成された「インターチェンジ」のピッチパターンを利用した「次のインターチェンジです。」という出力用類似ピッチパターンを生成する。 Next, another configuration example of the similar pitch pattern presentation unit 24 is shown. FIG. 5 is a block diagram illustrating another configuration example of the similar pitch pattern presentation unit 24.
The output conversion unit 24a of the similar pitch pattern presenting unit 24 gives the text data of the pitch pattern to be generated in addition to the text data of the pitch pattern to be generated in the output of the similar pitch pattern for output, and combines them as a similar pitch pattern for output. An additional text data composition unit 24b may be provided. For example, if the text data of the generated pitch pattern is “Interchange” and the text data that is assumed to be used in synthesized speech as “Next Interchange” is input, the generated “Interchange” pitch pattern is changed. Generate a similar output pitch pattern that uses the next interchange.

さらに、類似ピッチパターン提示部２４に、ユーザが入力する類似ピッチパターンの選択指示を受け付けるトリガー入力部２４ｃを設けてもよい。類似ピッチパターン提示部２４は選択された類似ピッチパターンを出力用類似ピッチパターンとして出力装置３に出力する。 Further, the similar pitch pattern presentation unit 24 may be provided with a trigger input unit 24c that receives an instruction to select a similar pitch pattern input by the user. The similar pitch pattern presentation unit 24 outputs the selected similar pitch pattern to the output device 3 as an output similar pitch pattern.

さらに、類似ピッチパターン提示部２４に、ユーザが類似ピッチパターンを絞り込む絞込み部２４ｄを設けても良い。例えば、ユーザは絞込み部２４ｄにより「声の高さが発声に近い」や「声の高さが変化する部分が発声に近い」などの条件で類似ピッチパターンを絞り込み、絞り込み後のピッチパターンを出力用類似ピッチパターンとして出力装置３に出力する。 Further, the similar pitch pattern presentation unit 24 may be provided with a narrowing unit 24d for narrowing down the similar pitch pattern by the user. For example, the user narrows down the similar pitch pattern by using the narrowing unit 24d under conditions such as “the voice pitch is close to utterance” or “the part where the voice pitch changes is close to the utterance”, and outputs the pitch pattern after narrowing down Is output to the output device 3 as a similar pitch pattern.

以上のように、この実施の形態１によれば、ユーザの発声に基づく音声データから肉声ピッチ情報を抽出する肉声ピッチ情報抽出部２１と、肉声ピッチ情報とピッチパターン辞書２３に記憶された蓄積ピッチパターンから類似度を算出する類似度算出部２２ａを有し、肉声ピッチ情報に近い蓄積ピッチパターンを１つ以上検索する類似ピッチパターン検索部２２と、ユーザが選択した類似ピッチパターンをピッチパターンとして出力するピッチパターン出力部２５を備えるように構成したので、抽出誤りの補正などの整備を行った肉声のピッチパターンを複数格納したピッチパターン辞書２３から、入力された音声データの肉声ピッチ情報に近いピッチパターンを検索することが可能となり、自然性の高いピッチパターンを生成することができる。 As described above, according to the first embodiment, the real voice pitch information extraction unit 21 extracts the real voice pitch information from the voice data based on the user's utterance, and the accumulated pitch stored in the real voice pitch information and the pitch pattern dictionary 23. The similarity calculation unit 22a that calculates the similarity from the pattern, the similarity pitch pattern search unit 22 that searches for one or more stored pitch patterns close to the real voice pitch information, and outputs the similar pitch pattern selected by the user as a pitch pattern Since the pitch pattern output unit 25 is provided, a pitch close to the real voice pitch information of the input voice data from the pitch pattern dictionary 23 storing a plurality of pitch patterns of the real voice subjected to correction such as extraction error correction. It is possible to search for patterns and generate highly natural pitch patterns. That.

また、この実施の形態１によれば、入力された音声データの肉声ピッチ情報に近い蓄積ピッチパターンを選択する類似ピッチパターン検索部２２を備えるように構成したので、未知語や新語など類似したテキストデータの蓄積ピッチパターンが無い場合にも、自然性の高いピッチパターンを生成することができる。 Moreover, according to this Embodiment 1, since it comprised so that the similar pitch pattern search part 22 which selects the accumulation | storage pitch pattern close | similar to the real voice pitch information of the input audio | voice data could be provided, similar texts, such as an unknown word and a new word Even when there is no data accumulation pitch pattern, a highly natural pitch pattern can be generated.

さらに、この実施の形態１によれば、音声データから抽出した肉声ピッチ情報の抽出誤りの補正などの整備を行った肉声のピッチパターンを予め複数格納したピッチパターン辞書２３を備え、入力された音声データの肉声ピッチ情報に近いピッチパターンを検索する類似ピッチパターン検索部２２を備えるように構成したので、図６に示す説明図のように安定した発声やピッチパターンが得られないユーザおよび収録環境における肉声ピッチ情報からも劣化を含まないピッチパターンを生成することができる。 Furthermore, according to the first embodiment, the pitch pattern dictionary 23 that stores in advance a plurality of real voice pitch patterns that have been corrected, such as correction of extraction errors of real voice pitch information extracted from voice data, is provided, and the input voice Since the similar pitch pattern search unit 22 for searching for a pitch pattern close to the real voice pitch information of the data is provided, a user who cannot obtain a stable utterance or pitch pattern as in the explanatory diagram shown in FIG. A pitch pattern that does not include deterioration can also be generated from real voice pitch information.

さらに、この実施の形態１では、肉声ピッチ情報抽出部２１に、入力されたテキストデータを言語解析し、付加情報を取得する付加情報取得部２１ａを追加して設けた場合には、表記のみなど入力情報が少ない場合であっても、類似ピッチパターン検索や出力用類似ピッチパターンにおける合成音声の生成に有用な読みや品詞情報などの付加情報を得ることができる。 Further, in the first embodiment, when the additional information acquisition unit 21a for analyzing the input text data and acquiring additional information is additionally provided in the real voice pitch information extraction unit 21, only the notation or the like is provided. Even when the input information is small, additional information such as reading and part-of-speech information useful for similar pitch pattern search and generation of synthesized speech in the output similar pitch pattern can be obtained.

さらに、この実施の形態１では、肉声ピッチ情報抽出部２１に、入力されたテキストデータから音声認識技術を用いて音韻毎に発声データのセグメンテーションを行うセグメンテーション部２１ｂを追加して設けた場合には、類似ピッチパターン検索においてセグメンテーションされた音韻情報単位での類似ピッチパターン検索が行えるため、類似度計算の精度を向上させることができる。 Further, in the first embodiment, when the real voice pitch information extraction unit 21 is additionally provided with a segmentation unit 21b that performs segmentation of utterance data for each phoneme using speech recognition technology from input text data. Since the similar pitch pattern search can be performed in units of phoneme information segmented in the similar pitch pattern search, the accuracy of similarity calculation can be improved.

さらに、この実施の形態１では、肉声ピッチ情報抽出部２１に、音声データからピッチパターンを生成したい箇所を指定するピッチパターン生成指定部２１ｃを追加して設けた場合には、実際に生成されるピッチパターンが利用される文を発声し、前後の言語環境まで考慮した発声から肉声ピッチ情報を抽出することができるので、前後の繋がりがよく、自然性の高いピッチパターンを生成することができる。 Further, in the first embodiment, when the real pitch information extraction unit 21 is additionally provided with a pitch pattern generation designation unit 21c that designates a location where a pitch pattern is desired to be generated from the voice data, it is actually generated. Since a sentence using the pitch pattern is uttered and the real voice pitch information can be extracted from the utterance considering the language environment before and after, it is possible to generate a highly natural pitch pattern with good connection between the front and back.

さらに、この実施の形態１では、類似ピッチパターンの検索において、肉声ピッチ情報内あるいはピッチパターン辞書２３内の蓄積ピッチパターンの複数の有声音区間を連接して一つの区間として類似度を計算した場合には、部分的な有声/無声の発声誤りやピッチの抽出誤りにより抽出された肉声ピッチ情報の有声音区間が分割あるいは統合されている場合にも、本来所望されるピッチパターンに近いピッチパターンを検索できる。 Furthermore, in the first embodiment, in the search for similar pitch patterns, when similarity is calculated as a single segment by concatenating a plurality of voiced sound intervals of accumulated pitch patterns in the real voice pitch information or in the pitch pattern dictionary 23 Includes a pitch pattern that is close to the originally desired pitch pattern, even when voiced sections of real voice pitch information extracted due to partial voiced / unvoiced voicing errors or pitch extraction errors are divided or integrated. Searchable.

さらに、この実施の形態１では、類似ピッチパターン検索において類似度の計算の際にすべての区間の距離、尤度を用いずに一部の区間の距離、尤度を用いるように構成した場合には、部分的な有声/無声の発声誤りやピッチの抽出誤りを持つと推測される区間を除いた区間から類似度を計算することができる。 Further, in the first embodiment, when similarities are calculated in similarity pitch pattern search, the distances and likelihoods of some sections are used without using the distances and likelihoods of all sections. Can calculate the similarity from the section excluding the section estimated to have partial voiced / unvoiced utterance errors and pitch extraction errors.

さらに、この実施の形態１では、類似ピッチパターンの検索において、肉声ピッチ情報の他にテキストデータを併せて検索に利用するように構成した場合には、肉声ピッチ情報とピッチパターンの類似度が高くなりやすいと想定されるピッチパターン辞書２３内のテキストデータに検索対象を絞ることができ、検索処理量の削減、あるいはテキストデータが近い蓄積ピッチパターンを検索することにより遷移の類似したピッチパターンが得られやすくなり類似ピッチパターンの検索精度を向上させることができる。 Furthermore, in the first embodiment, in the search for the similar pitch pattern, when the text data is used together with the real voice pitch information for the search, the similarity between the real voice pitch information and the pitch pattern is high. The search target can be narrowed down to the text data in the pitch pattern dictionary 23 that is assumed to be easily obtained, and a pitch pattern similar in transition can be obtained by reducing the search processing amount or searching for an accumulated pitch pattern that is close to the text data. This makes it easier to search for similar pitch patterns.

さらに、この実施の形態１では、類似ピッチパターンの検索において、肉声ピッチ情報に合わせて蓄積ピッチパターンの話速や平均ピッチ、ピッチの変化幅などを加工するように構成した場合には、蓄積ピッチパターンと大きく異なる異性などの肉声ピッチ情報でも、肉声ピッチ情報に近いピッチパターンを生成することができる。 Furthermore, in the first embodiment, in the search for similar pitch patterns, when the configuration is such that the speaking speed, average pitch, pitch change width, etc. of the accumulated pitch pattern are processed according to the real voice pitch information, the accumulated pitch A pitch pattern close to the real voice pitch information can be generated even with real voice pitch information such as the opposite sex greatly different from the pattern.

さらに、この実施の形態１では、類似ピッチパターン提示部２４の出力変換部２４ａに、生成するピッチパターンのテキストデータに加えてその前後のテキストデータを与え、それらのピッチパターンを併せて出力用類似ピッチパターンとして構成するテキストデータ合成部２４ｂを追加して設けた場合には、実際に生成されるピッチパターンが利用される文における前後の言語環境まで考慮した類似ピッチパターンの確認ができ、前後と繋がりが良く、自然性の高い類似ピッチパターンを選択することができる。 Furthermore, in this Embodiment 1, in addition to the text data of the pitch pattern to produce | generate, the text data before and behind that is given to the output conversion part 24a of the similar pitch pattern presentation part 24, and those pitch patterns are combined together for the output similar When the text data composition unit 24b configured as a pitch pattern is additionally provided, similar pitch patterns can be confirmed in consideration of the language environment before and after the sentence in which the actually generated pitch pattern is used. It is possible to select a similar pitch pattern that is well connected and highly natural.

さらに、この実施の形態１では、類似ピッチパターン提示部２４に、ユーザから出力する類似ピッチパターンの選択を受け付けるトリガー入力部２４ｃを追加して設けた場合には、すべての類似ピッチパターンの出力用類似ピッチパターンを生成しなくて良いため、処理量を削減することができる。また、類似ピッチパターンを再度確認したい場合など、操作や手順を簡単にすることができる。 Furthermore, in the first embodiment, when the similar pitch pattern presenting unit 24 is additionally provided with a trigger input unit 24c for receiving selection of the similar pitch pattern output from the user, all the similar pitch patterns are output. Since it is not necessary to generate a similar pitch pattern, the amount of processing can be reduced. Further, when it is desired to confirm a similar pitch pattern again, the operation and procedure can be simplified.

さらに、この実施の形態１では、類似ピッチパターン提示部２４に、ユーザの所望する条件で提示される類似ピッチパターンを絞り込む絞込み部２４ｄを追加して設けた場合には、処理量の削減およびユーザの類似ピッチパターン選択を容易に行うことができる。 Further, in the first embodiment, when the similar pitch pattern presenting unit 24 is additionally provided with a narrowing unit 24d for narrowing down similar pitch patterns presented under the conditions desired by the user, the processing amount can be reduced and the user can be reduced. The similar pitch pattern can be easily selected.

さらに、この実施の形態１では、類似ピッチパターン提示部２４の出力変換部２４ａが、合成音声を出力してユーザが確認できる形式で出力するように構成したので、実際に生成されるピッチパターンが利用される合成音声をユーザが聴覚的に考慮して選択することができ、ピッチなどの音声情報に関する専門知識や経験が無いユーザでも容易にピッチパターン生成が可能になる。 Furthermore, in the first embodiment, the output conversion unit 24a of the similar pitch pattern presenting unit 24 is configured to output the synthesized speech and output it in a format that can be confirmed by the user. The user can select the synthesized speech to be used in an auditory manner, and even a user who has no expertise or experience regarding speech information such as pitch can easily generate a pitch pattern.

なお、この実施の形態１ではピッチパターンを用いて説明したが、これに限ることは無く、例えば、ピッチパターンを制御するためのパラメータに対して行うことも可能である。 In the first embodiment, the pitch pattern has been described. However, the present invention is not limited to this, and for example, it can be performed on a parameter for controlling the pitch pattern.

なお、この実施の形態１ではピッチ情報及びピッチパターンとしてピッチのみを扱うピッチ情報及びピッチパターンを用いて説明したが、ピッチ以外にも音韻継続長やパワーなど、その他のピッチを組み合わせたピッチ情報及びピッチパターンでも適用可能である。 In the first embodiment, the pitch information and the pitch pattern that handle only the pitch as the pitch information and the pitch pattern have been described. However, in addition to the pitch, the pitch information that combines other pitches such as phonological continuation length and power, and A pitch pattern is also applicable.

なお、この実施の形態１では入力言語として日本語を用いて説明しているが、英語や中国語などの他言語においても適用可能である。 In the first embodiment, the description is made using Japanese as the input language, but the present invention can also be applied to other languages such as English and Chinese.

実施の形態２．
上記実施の形態１において示したピッチパターン生成装置では肉声ピッチ情報抽出部２１が抽出した肉声ピッチ情報をそのまま類似ピッチパターン検索に用いる構成を示したが、この実施の形態２では、肉声ピッチ情報をピッチパターン辞書２３の蓄積ピッチパターンに近づけるように加工処理を行った後、類似ピッチパターン検索に用いる構成を示す。 Embodiment 2. FIG.
In the pitch pattern generation device shown in the first embodiment, the configuration in which the real voice pitch information extracted by the real voice pitch information extraction unit 21 is used as it is for the similar pitch pattern search is shown, but in the second embodiment, the real voice pitch information is used. A configuration used for searching for a similar pitch pattern after processing is performed so as to approach the accumulated pitch pattern of the pitch pattern dictionary 23 is shown.

図７は、実施の形態２によるピッチ（韻律）パターン生成装置を有する音声合成システムの構成を示すブロック図であり、実施の形態１のピッチパターン生成装置２に肉声ピッチ情報加工部（肉声韻律情報加工部）２６を追加して設けている。なお以下では、実施の形態１に係る音声合成システムの構成要素と同一または相当する部分には実施の形態１で使用した符号と同一の符号を付して説明を省略または簡略化する。 FIG. 7 is a block diagram showing a configuration of a speech synthesis system having a pitch (prosody) pattern generation apparatus according to the second embodiment. The pitch pattern generation apparatus 2 according to the first embodiment includes a real voice pitch information processing unit (real voice prosody information). (Processing part) 26 is additionally provided. In the following description, the same or equivalent parts as those of the speech synthesis system according to the first embodiment are denoted by the same reference numerals as those used in the first embodiment, and description thereof is omitted or simplified.

肉声ピッチ情報抽出部２１は、抽出した肉声ピッチ情報を肉声ピッチ情報加工部２６に出力する。肉声ピッチ情報加工部２６は、肉声ピッチ情報抽出部２１から入力される肉声ピッチ情報に対して蓄積ピッチパターンに近づけるように平均ピッチなどを加工し、加工肉声ピッチ情報として類似ピッチパターン検索部２２に出力する。類似ピッチパターン検索部２２は、肉声ピッチ情報加工部２６から入力される加工肉声ピッチ情報を基に、ピッチパターン辞書２３から肉声ピッチ情報の部分あるいは全体に類似した蓄積ピッチパターンを検索する。そして検索された蓄積ピッチパターンを類似ピッチパターンとして類似ピッチパターン提示部２４に出力する。 The real voice pitch information extraction unit 21 outputs the extracted real voice pitch information to the real voice pitch information processing unit 26. The real voice pitch information processing unit 26 processes the average pitch and the like so as to approximate the accumulated pitch pattern with respect to the real voice pitch information input from the real voice pitch information extraction unit 21, and sends the processed pitch information to the similar pitch pattern search unit 22. Output. The similar pitch pattern search unit 22 searches the pitch pattern dictionary 23 for a stored pitch pattern similar to the whole or part of the real voice pitch information based on the processed real voice pitch information input from the real voice pitch information processing unit 26. Then, the retrieved accumulated pitch pattern is output to the similar pitch pattern presentation unit 24 as a similar pitch pattern.

次に、肉声ピッチ情報加工部２６の具体的な加工方法を以下に挙げる。
Ａ．肉声ピッチ情報の平均ピッチを検索対象のピッチパターン辞書２３の平均ピッチと一致するように調整する。これにより、異性などのピッチの大きく異なる発声が入力された場合にも検索することができる。
Ｂ．肉声ピッチ情報のピッチ変化幅を検索対象のピッチパターン辞書２３のピッチ変化幅に合せて調整する。これにより異性などのピッチの大きく異なる発声であっても検索することができる。
Ｃ．肉声ピッチ情報（音声）の平均時間長を検索対象のピッチパターン辞書２３の発声の平均時間長に合せて時間軸上で伸縮させる。これにより、話速の大きく異なる発声であっても検索することができる。
Ｄ．上記ＡからＣのように肉声ピッチ情報の平均値や変化幅を基準に加工するのではなく、ユーザが故意に韻律を変化（高めや低め、速めや遅めなど）させて発声している場合を考慮して、予め設定されたパラメータを基準として肉声ピッチ情報を加工してもよい。
Ｅ．入力された肉声ピッチ情報において抽出誤りと考えられるピッチを補正し、加工肉声ピッチ情報とする。
例えば、周囲のピッチとの比較、ユーザの音域から倍ピッチや半ピッチと想定されるピッチの区間、無声化あるいは声のもれ込みなどによる局所的にピッチが抽出されている区間、および声の震えによりピッチのふらついている区間など、抽出された肉声ピッチ情報が異常であると推定される区間について、周囲のピッチからの補間や平均化処理を行う、あるいは類似ピッチパターン検索部２２において検索時に考慮しない区間として設定する。
Ｆ．予めユーザの発声において劣化しやすいピッチ特徴を想定し、それらを補う変換ポリシーを用いて加工肉声ピッチ情報を一つ以上生成する。例えば、抑揚が大きく（小さく）なりすぎる場合を想定しピッチパターンの変化幅を小さく（大きく）する、長い文ほど話速が早くなりやすい場合を想定し話速を遅くする、楽しげな発声のピッチパターン辞書２３に対し暗い発声になる場合を想定し語頭から語末まで高いピッチを維持するように加工するなどである。
Ｇ．肉声ピッチ情報抽出部２１から入力される肉声ピッチ情報に加えて、発声の読みや有声無声情報や音韻継続長などのテキストデータを入力し、このテキストデータを組み合わせた補間処理を行う。例えば、テキストデータから本来有声的な発声である母音連鎖区間内にも関わらずピッチの無い区間が存在した場合、ピッチの欠落あるいは抽出誤りと判断しピッチの補間を行う。 Next, a specific processing method of the real voice pitch information processing unit 26 will be described below.
A. The average pitch of the real voice pitch information is adjusted to match the average pitch of the pitch pattern dictionary 23 to be searched. Thereby, it is possible to search even when utterances having greatly different pitches such as the opposite sex are inputted.
B. The pitch change width of the real voice pitch information is adjusted according to the pitch change width of the pitch pattern dictionary 23 to be searched. As a result, even utterances with greatly different pitches such as the opposite sex can be searched.
C. The average time length of the real voice pitch information (speech) is expanded or contracted on the time axis according to the average time length of the utterance in the pitch pattern dictionary 23 to be searched. As a result, it is possible to search even utterances with greatly different speaking speeds.
D. When the user intentionally changes the prosody (higher or lower, faster or slower, etc.) instead of processing based on the average value or range of the real voice pitch information as in A to C above. In consideration of the above, the real voice pitch information may be processed based on a preset parameter.
E. The pitch considered to be an extraction error in the input real voice pitch information is corrected to obtain processed real voice pitch information.
For example, comparison with surrounding pitches, sections of pitches that are assumed to be double pitch or half pitch from the user's range, sections in which pitches are locally extracted by devoicing or voice leakage, and voice For sections where the extracted real voice pitch information is estimated to be abnormal, such as sections where the pitch fluctuates due to tremors, interpolation or averaging is performed from surrounding pitches, or when similar pitch pattern search unit 22 searches Set as a section not considered.
F. One or more pieces of processed real voice pitch information are generated using a conversion policy that compensates for pitch characteristics that are likely to deteriorate in the user's utterance in advance. For example, it is assumed that the inflection is too large (small) and the pitch pattern change width is small (large). The longer the sentence, the slower the speech speed. For example, the pitch pattern dictionary 23 is processed so as to maintain a high pitch from the beginning to the end of the word, assuming a dark utterance.
G. In addition to the real voice pitch information input from the real voice pitch information extraction unit 21, text data such as utterance reading, voiced / unvoiced information, and phoneme duration is input, and interpolation processing combining the text data is performed. For example, if there is a section with no pitch in the vowel chain section that is originally voiced utterance from text data, it is determined that the pitch is missing or is an extraction error, and pitch interpolation is performed.

次に、この実施の形態２のピッチパターン生成装置の動作を図８のフローチャートに従って説明する。なお、実施の形態１と同一の処理を行うステップには図３で使用した符号と同一の符号を付し、説明を省略または簡略化する。
肉声ピッチ情報抽出部２１は、抽出したピッチ情報をテキスト入力部１２から入力されたテキストデータに対応付け、肉声ピッチ情報として肉声ピッチ情報加工部２６に出力する（ステップＳＴ１１）。 Next, the operation of the pitch pattern generation apparatus according to the second embodiment will be described with reference to the flowchart of FIG. Note that the same reference numerals as those used in FIG. 3 are assigned to steps performing the same processing as in the first embodiment, and description thereof will be omitted or simplified.
The real voice pitch information extraction unit 21 associates the extracted pitch information with the text data input from the text input unit 12, and outputs it to the real voice pitch information processing unit 26 as real voice pitch information (step ST11).

肉声ピッチ情報加工部２６は、ステップＳＴ１１において入力された肉声ピッチ情報の解析を行い、平均ピッチの補正、変化幅の大きさの補正および抽出誤りと推定される箇所の補正など、ピッチパターン辞書２３内の蓄積ピッチパターンに近づける加工を施し、加工肉声ピッチ情報として類似ピッチパターン検索部２２に出力する（ステップＳＴ１２）。類似ピッチパターン検索部２２の類似度算出部２２ａは、ステップＳＴ１２において加工肉声ピッチ情報が入力されると、ピッチパターン辞書２３から蓄積ピッチパターンを読み出し、読み出した各蓄積ピッチパターンと入力された加工肉声ピッチ情報の類似度を算出する（ステップＳＴ１３）。
以降の処理は、実施の形態１と同様であるため、説明を省略する。 The real voice pitch information processing unit 26 analyzes the real voice pitch information input in step ST11, and performs a pitch pattern dictionary 23 such as correction of an average pitch, correction of a change width, and correction of a portion estimated to be an extraction error. The processing is performed so as to be close to the stored accumulated pitch pattern, and is output to the similar pitch pattern search unit 22 as processed real voice pitch information (step ST12). When the processed real voice pitch information is input in step ST12, the similarity calculation unit 22a of the similar pitch pattern search unit 22 reads the stored pitch pattern from the pitch pattern dictionary 23, and the read processed pitch is input to each stored pitch pattern. The similarity of pitch information is calculated (step ST13).
Since the subsequent processing is the same as that of the first embodiment, description thereof is omitted.

以上のように、この実施の形態２によれば、肉声ピッチ情報をピッチパターン辞書２３内の蓄積ピッチパターンに近づける加工を行う肉声ピッチ情報加工部２６を設け、加工した肉声ピッチ情報を類似ピッチパターン検索に利用するように構成したので、異性や話調などのピッチパターン辞書２３内の蓄積ピッチパターンとは大きく異なるユーザの肉声ピッチ情報であっても、類似ピッチパターン検索に利用することができる。 As described above, according to the second embodiment, the real voice pitch information processing unit 26 that performs processing for bringing the real voice pitch information closer to the accumulated pitch pattern in the pitch pattern dictionary 23 is provided, and the processed real voice pitch information is converted into the similar pitch pattern. Since it is configured to be used for the search, even the user's real voice pitch information that is greatly different from the accumulated pitch pattern in the pitch pattern dictionary 23 such as the opposite sex and the tone can be used for the similar pitch pattern search.

また、この実施の形態２によれば、入力された肉声ピッチ情報において抽出誤りと考えられるピッチを補正し、加工肉声ピッチ情報として類似ピッチパターン検索に利用するように構成したので、安定した発声やピッチが得られないユーザおよび収録環境における肉声ピッチ情報からの類似ピッチパターンの検索精度を向上させることができる。これにより、読み直しなどのユーザの負担を軽減することができる。 In addition, according to the second embodiment, the pitch that is considered to be an extraction error in the input real voice pitch information is corrected and used as a similar pitch pattern search as processed real voice pitch information. It is possible to improve the search accuracy of the similar pitch pattern from the user who cannot obtain the pitch and the real voice pitch information in the recording environment. Thereby, a user's burden, such as re-reading, can be reduced.

さらに、この実施の形態２によれば、肉声ピッチ情報の加工において、肉声ピッチ情報の平均値や変化幅を基準に加工するのではなく、予め設定されたパラメータを基準に加工するように構成したので、ユーザが故意にピッチを変化（高めや低め、速めや遅めなど）させて発声している場合であっても、当該変化を反映させた加工を行った上で類似ピッチパターン検索に利用することができる。 Further, according to the second embodiment, in the processing of the real voice pitch information, the processing is not performed based on the average value or the change width of the real voice pitch information, but is processed based on a preset parameter. Therefore, even when the user intentionally changes the pitch (higher or lower, faster or slower) and uses it for similar pitch pattern search after processing that reflects the change. can do.

さらに、この実施の形態２によれば、ユーザの発声において劣化しやすいピッチ特徴（例えば抑揚が大きくなりやすいなど）を補う変換ポリシーを用いて肉声ピッチ情報を加工するように構成したので、ユーザが通常の発声方法を変えることなく肉声ピッチ情報をピッチパターン辞書２３内の蓄積ピッチパターンに近づけることができる。 Further, according to the second embodiment, since the configuration is such that the real voice pitch information is processed using the conversion policy that compensates for the pitch characteristics that are likely to deteriorate in the user's utterance (for example, the inflection tends to be large), the user can The real voice pitch information can be brought close to the accumulated pitch pattern in the pitch pattern dictionary 23 without changing the normal utterance method.

さらに、この実施の形態２によれば、肉声ピッチ情報に加えて、発声の読みや有声無声情報や音韻継続長などのテキストデータを入力し、このテキストデータと組み合わせて補正処理を行うように構成したので、無声音区間あるいは有声音区間の数の不一致からのピッチ抽出誤りの有無の検出や、音韻継続長から有声音区間におけるピッチ抽出誤り箇所の特定及び推測を行うことができ、補正精度を向上させることができる。 Further, according to the second embodiment, in addition to the real voice pitch information, text data such as utterance reading, voiced / unvoiced information, and phoneme duration is input, and correction processing is performed in combination with the text data. Therefore, it is possible to detect the presence or absence of pitch extraction error from the mismatch of the number of unvoiced sound sections or voiced sound sections, and to identify and estimate the location of pitch extraction errors in the voiced sound section from the phoneme duration, improving the correction accuracy Can be made.

なお、上記実施の形態２では、肉声ピッチ情報加工部２６が肉声ピッチ情報を加工する構成を示したが、ユーザが肉声ピッチ情報を加工する手段を設けても良い。これにより、ユーザ自身が発声の誤りあるいはピッチ抽出誤りに伴う肉声ピッチ情報の修正を適切に行うことができる。 In the second embodiment, the configuration in which the real voice pitch information processing unit 26 processes the real voice pitch information is shown. However, a user may provide means for processing the real voice pitch information. As a result, the user himself / herself can appropriately correct the real voice pitch information accompanying the utterance error or the pitch extraction error.

なお、上記実施の形態２では、肉声ピッチ情報に１つの加工を施す構成を示したが、肉声ピッチ情報に複数の加工を施すように構成してもよい。これにより、類似ピッチパターン検索において検索される類似ピッチパターンの数およびバリエーションが増加し、ユーザの所望に近いピッチパターンを生成し易くなる。 In the second embodiment, a configuration is shown in which one processing is performed on the real voice pitch information. However, a plurality of processing may be performed on the real voice pitch information. Thereby, the number and variation of the similar pitch patterns searched in the similar pitch pattern search increase, and it becomes easy to generate a pitch pattern close to the user's desire.

なお、この実施の形態２ではピッチパターンを用いて説明したが、これに限ることは無く、例えば、ピッチパターンを制御するためのパラメータに対して行うことも可能である。 Although the second embodiment has been described using the pitch pattern, the present invention is not limited to this, and for example, it can be performed on a parameter for controlling the pitch pattern.

なお、この実施の形態２ではピッチ情報及びピッチパターンとしてピッチのみを扱うピッチ情報及びピッチパターンを用いて説明したが、ピッチ以外にも音韻継続長やパワーなど、その他のピッチを組み合わせたピッチ情報及びピッチパターンでも適用可能である。 In the second embodiment, the pitch information and the pitch pattern that handle only the pitch as the pitch information and the pitch pattern have been described. However, in addition to the pitch, the pitch information that combines other pitches such as phonological continuation length and power, and A pitch pattern is also applicable.

実施の形態３．
上記実施の形態１および２では、類似ピッチパターン検索部２２が検索した類似ピッチパターンをそのままユーザに提示するあるいはピッチパターンとして出力する構成を示したが、この実施の形態３では、類似ピッチパターン検索部２２が検索した類似ピッチパターンに加工を施してより肉声ピッチ情報に近づけた後、ユーザに提示するあるいはピッチパターンとして出力する構成を示す。 Embodiment 3 FIG.
In the first and second embodiments, the configuration in which the similar pitch pattern searched by the similar pitch pattern searching unit 22 is presented to the user as it is or is output as the pitch pattern is shown. In the third embodiment, the similar pitch pattern search is performed. A configuration in which the similar pitch pattern searched by the unit 22 is processed to be closer to real voice pitch information and then presented to the user or output as a pitch pattern is shown.

図９は、実施の形態３によるピッチ（韻律）パターン生成装置の構成を示すブロック図であり、実施の形態２のピッチパターン生成装置２にピッチパターン候補生成部（韻律パターン候補生成部）２７を追加して設けている。なお以下では、実施の形態１および実施の形態２によるピッチパターン生成装置の構成要素と同一または相当する部分には実施の形態１および実施の形態２で使用した符号と同一の符号を付して説明を省略または簡略化する。 FIG. 9 is a block diagram showing the configuration of the pitch (prosody) pattern generation apparatus according to the third embodiment. A pitch pattern candidate generation unit (prosody pattern candidate generation unit) 27 is added to the pitch pattern generation apparatus 2 of the second embodiment. It is additionally provided. In the following description, the same reference numerals as those used in the first and second embodiments are given to the same or corresponding parts as the constituent elements of the pitch pattern generating apparatus according to the first and second embodiments. The description is omitted or simplified.

類似ピッチパターン検索部２２は、肉声ピッチ情報抽出部２１から入力された肉声ピッチ情報を基に、ピッチパターン辞書２３から肉声ピッチ情報の部分あるいは全体に類似した蓄積ピッチパターンを１つ以上検索する。検索した蓄積ピッチパターンを類似ピッチパターンとしてピッチパターン候補生成部２７に出力する。ピッチパターン候補生成部２７は、類似ピッチパターン検索部２２から入力される類似ピッチパターンを加工し、入力された音声データの肉声ピッチ情報に近い１つ以上のピッチパターンの候補を生成して新たな類似ピッチパターンの一つとし、類似ピッチパターン提示部２４に出力する。 Based on the real voice pitch information input from the real voice pitch information extraction unit 21, the similar pitch pattern search unit 22 searches the pitch pattern dictionary 23 for one or more accumulated pitch patterns that are similar to the part or the whole of the real voice pitch information. The retrieved accumulated pitch pattern is output to the pitch pattern candidate generation unit 27 as a similar pitch pattern. The pitch pattern candidate generation unit 27 processes the similar pitch pattern input from the similar pitch pattern search unit 22, generates one or more pitch pattern candidates that are close to the real voice pitch information of the input voice data, and creates a new one. One of the similar pitch patterns is output to the similar pitch pattern presentation unit 24.

次に、この実施の形態３のピッチパターン生成装置の動作を図１０のフローチャートに従って説明する。なお、実施の形態２と同一の処理を行うステップには図８で使用した符号と同一の符号を付し、説明を省略または簡略化する。
類似ピッチパターン検索部２２は、ステップＳＴ１３で算出した類似度に基づき、肉声ピッチ情報の部分（例えば、連続する有声音区間の単位）あるいは全体に類似したピッチパターンを１つ以上検索し、類似ピッチパターンとして類似ピッチパターン候補生成部２７に出力する（ステップＳＴ２１）。 Next, the operation of the pitch pattern generation apparatus according to the third embodiment will be described with reference to the flowchart of FIG. Note that the same reference numerals as those used in FIG. 8 are attached to steps for performing the same processing as in the second embodiment, and description thereof will be omitted or simplified.
Based on the similarity calculated in step ST13, the similar pitch pattern search unit 22 searches for one or more pitch patterns that are similar to a part of the real voice pitch information (for example, a unit of continuous voiced voice segments) or the whole, and the similar pitch It outputs to the similar pitch pattern candidate production | generation part 27 as a pattern (step ST21).

ピッチパターン候補生成部２７は、ステップＳＴ２１で入力された類似ピッチパターンを加工し、肉声ピッチ情報に近い類似ピッチパターンを生成して類似ピッチパターン提示部２４に出力する（ステップＳＴ２２）。
以降の処理は、実施の形態１および２と同様のため、説明を省略する。 The pitch pattern candidate generating unit 27 processes the similar pitch pattern input in step ST21, generates a similar pitch pattern close to the real voice pitch information, and outputs the similar pitch pattern to the similar pitch pattern presenting unit 24 (step ST22).
Since the subsequent processing is the same as in the first and second embodiments, description thereof is omitted.

次に、ピッチパターン候補生成処理の具体例を示す。図１１は、この実施の形態３によるピッチ（韻律）パターン生成装置のピッチパターン候補生成処理を示す説明図である。
ピッチパターン候補生成処理は、例えば複数の類似ピッチパターンを接続し、入力装置１に入力された音声データの肉声ピッチ情報に近い１つ以上のピッチパターン候補を生成する。図１１の例において、「インターチェンジ」という発生の肉声ピッチ情報が入力された場合に、類似ピッチパターン検索部２２はピッチパターン辞書２３から１および２番目の有声音区間に当たる「インター」の肉声ピッチ情報に類似した「演奏」という２つの有声音区間を持つ発声の類似ピッチパターンと、３および４番目の有声音区間に当たる「チェンジ」の肉声ピッチ情報に類似した「ハイツ」という２つの有声音区間を持つ発声の類似ピッチパターンを検索し、ピッチパターン候補生成部２７に出力する。ピッチパターン候補生成部２７は、入力された２つの類似ピッチパターンを接続し、「インターチェンジ」という肉声ピッチ情報に近い新たな類似ピッチパターンを得る。 Next, a specific example of the pitch pattern candidate generation process is shown. FIG. 11 is an explanatory view showing pitch pattern candidate generation processing of the pitch (prosodic) pattern generation device according to the third embodiment.
In the pitch pattern candidate generation process, for example, a plurality of similar pitch patterns are connected, and one or more pitch pattern candidates close to the real voice pitch information of the audio data input to the input device 1 are generated. In the example of FIG. 11, when the generated real voice pitch information “interchange” is inputted, the similar pitch pattern search unit 22 reads the pitch voice information of “inter” corresponding to the first and second voiced sound sections from the pitch pattern dictionary 23. A similar pitch pattern of utterances having two voiced sound segments of “performance” similar to the “sound” and two voiced sound segments of “heights” similar to the real voice pitch information of “change” corresponding to the third and fourth voiced sound segments A similar pitch pattern of the utterance possessed is searched and output to the pitch pattern candidate generation unit 27. The pitch pattern candidate generation unit 27 connects the two input similar pitch patterns, and obtains a new similar pitch pattern close to the real voice pitch information “interchange”.

図１１において示した有声音区間の数に基づき類似ピッチパターンを接続する以外にも、以下の処理方法を適用することができる。
ａ．音素や音節、単語、および文章単位で区切られた肉声ピッチ情報およびピッチパターン辞書２３内のピッチパターンが得られれば、それらの単位を利用して類似ピッチパターンを接続する。
ｂ．ピッチパターンの接続は、そのまま接続する以外にも、接続部の遷移が滑らかになるようにスムージング処理あるいは補間処理を行いピッチを変形してから接続してもよい。
ｃ．接続する類似ピッチパターン同士の平均ピッチやピッチの変化幅の話速の繋がりがよくなるように変形してから接続する。
ｄ．生成するピッチパターンのテキストデータに加えてその前後のテキストデータを与え、類似ピッチパターンと前後のピッチパターンの接続部のピッチに平滑化あるいはスムージング処理などの加工を行う。
ｅ．ユーザが類似ピッチパターンの修正を行う手段を設け、たとえば局所的なピッチの修正や、肉声ピッチ情報と類似ピッチパターンの重み付け加算による補間を行う。
ｆ．ユーザの指摘箇所（時間情報や音韻情報など）とユーザの指摘内容（アクセント位置や声の高さ、抑揚など）を入力として、指摘箇所及び指摘内容における差分をつけた類似ピッチパターンの加工を行う。 In addition to connecting similar pitch patterns based on the number of voiced sound sections shown in FIG. 11, the following processing method can be applied.
a. If the real voice pitch information divided in units of phonemes, syllables, words, and sentences and the pitch patterns in the pitch pattern dictionary 23 are obtained, similar pitch patterns are connected using these units.
b. In addition to connecting the pitch pattern as it is, the pitch pattern may be connected after the smoothing process or the interpolation process is performed so that the transition of the connection portion is smooth and the pitch is changed.
c. The connection is made after deformation so that the average pitch of similar pitch patterns to be connected and the speaking speed of the change width of the pitch are improved.
d. In addition to the text data of the pitch pattern to be generated, the text data before and after the text data are given, and the pitch of the connection portion between the similar pitch pattern and the front and back pitch patterns is subjected to processing such as smoothing or smoothing processing.
e. The user provides means for correcting the similar pitch pattern, and performs, for example, local pitch correction or interpolation by weighted addition of the real voice pitch information and the similar pitch pattern.
f. Using the user's pointed location (time information, phonological information, etc.) and the user's pointed content (accent position, voice pitch, intonation, etc.) as input, process similar pitch patterns with differences in the pointed location and pointed content .

以上のように、この実施の形態３によれば、類似ピッチパターン検索部２２において検索された類似ピッチパターンに対して、部分的な類似ピッチパターンを組み合わせて入力音声データの肉声ピッチ情報に近づける加工を行って類似ピッチパターンの一つとするピッチパターン候補生成部２７を設けるように構成したので、同規模のピッチパターン辞書であっても類似ピッチパターンのバリエーションが増加し、肉声ピッチ情報に近い類似ピッチパターンがより得られ易くなる。 As described above, according to the third embodiment, the similar pitch pattern searched by the similar pitch pattern search unit 22 is combined with the partial similar pitch pattern to approximate the real voice pitch information of the input voice data. Since the pitch pattern candidate generation unit 27 is set to be one of the similar pitch patterns, the variation of the similar pitch patterns is increased even in the pitch pattern dictionary of the same scale, and the similar pitch close to the real voice pitch information. A pattern can be obtained more easily.

また、この実施の形態３によれば、部分的な類似ピッチパターンを組み合わせる際に、接続部の遷移が滑らかになるようにピッチの変形を行ってから接続するように構成した場合、前後の言語環境が異なる発声の類似ピッチパターン同士の接続による不連続感を解消することができる。 Further, according to the third embodiment, when combining similar partial pitch patterns, if the connection is made after changing the pitch so that the transition of the connection portion is smooth, Discontinuity caused by connection of similar pitch patterns of utterances in different environments can be eliminated.

また、この実施の形態３によれば、部分的な類似ピッチパターンを組み合わせる際に、各類似ピッチパターンの平均ピッチやピッチの変化幅や話速を繋がりが良くなるように変形してから接続するようにした場合、肉声ピッチ情報との類似度は高いが抑揚などが異なる類似ピッチパターン同士の接続においても、繋がりが良く自然性の高い類似ピッチパターンを生成することができる。 Further, according to the third embodiment, when partial similar pitch patterns are combined, the average pitch of each similar pitch pattern, the change width of the pitch, and the speech speed are deformed so as to improve the connection and then connected. In such a case, a similar pitch pattern having a good connection and high naturalness can be generated even in the connection of similar pitch patterns having high similarity to the real voice pitch information but different inflections.

また、この実施の形態３によれば、ピッチパターン候補生成において、生成する類似ピッチパターンのテキストデータに加えてその前後のテキストデータを与え、生成する類似ピッチパターンと前後のピッチパターンとの接続部のピッチを接続性が良くなるように加工するように構成した場合には、実際に生成されるピッチパターンが利用される文における前後の言語環境まで考慮し、繋がりが良く自然性の高い類似ピッチパターンの生成が可能になる。 Further, according to the third embodiment, in the generation of pitch pattern candidates, in addition to the text data of the similar pitch pattern to be generated, the text data before and after the text data is given, and the connection portion between the generated similar pitch pattern and the front and rear pitch patterns If the pitch is processed so as to improve the connectivity, the similar pitch with good connection and naturalness will be taken into consideration even before and after the language environment in the sentence where the actually generated pitch pattern is used A pattern can be generated.

また、この実施の形態３によれば、ピッチパターン候補の生成において、ユーザが類似ピッチパターンを修正する手段を備えるように構成した場合、ユーザの所望するピッチパターンを容易に生成することができる。 Further, according to the third embodiment, when the pitch pattern candidate is generated so that the user includes a means for correcting the similar pitch pattern, the pitch pattern desired by the user can be easily generated.

なお、ここではピッチパターン候補生成部２７で生成されたピッチパターンが選択された場合に、ピッチパターン辞書２３に記憶させ、新しいピッチパターンとして検索に利用するように構成してもよい。ピッチパターンのバリエーションが増加すると共に、ユーザの個性が反映されたピッチパターンをピッチパターン辞書２３に追加することができる。 Here, when the pitch pattern generated by the pitch pattern candidate generation unit 27 is selected, the pitch pattern dictionary 23 may store the pitch pattern and use it as a new pitch pattern for the search. As the number of pitch pattern variations increases, a pitch pattern reflecting the user's personality can be added to the pitch pattern dictionary 23.

なお、上記実施の形態３では、実施の形態２のピッチパターン生成装置にピッチパターン候補生成部２７を追加して設ける構成を示したが、実施の形態１のピッチパターン生成装置にピッチパターン候補生成部２７を追加して設けてもよい。 In the third embodiment, the pitch pattern candidate generation unit 27 is additionally provided in the pitch pattern generation apparatus of the second embodiment. However, the pitch pattern candidate generation is performed in the pitch pattern generation apparatus of the first embodiment. A portion 27 may be additionally provided.

なお、この実施の形態３ではピッチパターンを用いて説明したが、これに限ることは無く、例えば、ピッチパターンを制御するためのパラメータに対して行うことも可能である。 In the third embodiment, the pitch pattern has been described. However, the present invention is not limited to this, and for example, it can be performed on a parameter for controlling the pitch pattern.

なお、この実施の形態３ではピッチ情報及びピッチパターンとしてピッチのみを扱うピッチ情報及びピッチパターンを用いて説明したが、ピッチ以外にも音韻継続長やパワーなど、その他のピッチを組み合わせたピッチ情報及びピッチパターンでも適用可能である。 In the third embodiment, the pitch information and the pitch pattern that handle only the pitch are described as the pitch information and the pitch pattern. However, in addition to the pitch, the pitch information that combines other pitches such as phonological continuation length and power, and A pitch pattern is also applicable.

１入力装置、２ピッチパターン生成装置、３出力装置、１１音声入力部、１２テキスト入力部、１３ピッチパターン選択部、２１肉声ピッチ情報抽出部、２１ａ付加情報取得部、２１ｂセグメンテーション部、２１ｃピッチパターン生成指定部、２２類似ピッチパターン検索部、２２ａ類似度算出部、２３ピッチパターン辞書、２４類似ピッチパターン提示部、２４ａ出力変換部、２４ｂテキストデータ合成部、２４ｃトリガー入力部、２４ｄ絞込み部、２５ピッチパターン出力部、２６肉声ピッチ情報加工部、２７ピッチパターン候補生成部、３１類似ピッチパターン出力部、１００音声合成システム。 DESCRIPTION OF SYMBOLS 1 Input device, 2 Pitch pattern production | generation apparatus, 3 Output device, 11 Voice input part, 12 Text input part, 13 Pitch pattern selection part, 21 Real voice pitch information extraction part, 21a Additional information acquisition part, 21b Segmentation part, 21c Pitch pattern Generation designation unit, 22 Similar pitch pattern search unit, 22a Similarity calculation unit, 23 Pitch pattern dictionary, 24 Similar pitch pattern presentation unit, 24a Output conversion unit, 24b Text data synthesis unit, 24c Trigger input unit, 24d Narrowing unit, 25 Pitch pattern output unit, 26 Real voice pitch information processing unit, 27 Pitch pattern candidate generation unit, 31 Similar pitch pattern output unit, 100 Speech synthesis system.

Claims

A real voice prosody information extraction unit that accepts input of voice data and text data, extracts prosody information from the voice data, and generates real voice prosody information in which the prosody information is associated with the text data;
A prosodic pattern dictionary storing a plurality of prosodic patterns;
A similar prosodic pattern search unit that searches one or more prosodic patterns similar to part or the whole of the real voice prosodic information from the prosodic pattern dictionary and outputs them as similar prosodic patterns;
The real voice prosody information input from the real voice prosody information extraction unit is processed to approximate the prosody pattern stored in the prosodic pattern dictionary, and is output to the similar prosody pattern search unit; ,
A similar prosodic pattern presenting unit that presents the similar prosodic pattern converted into a user-recognizable format and requests the user to select the similar prosodic pattern;
A prosodic pattern generation device comprising: a prosodic pattern output unit that outputs a similar prosodic pattern selected by the user among the similar prosodic patterns presented by the similar prosodic pattern presentation unit.

Accepts input of speech data and text data and extracts prosodic information from the speech data
And generating real voice prosody information in which the prosody information is associated with the text data.
An information extractor;
A prosodic pattern dictionary storing a plurality of prosodic patterns;
A similar prosodic pattern search unit that searches one or more prosodic patterns similar to part or the whole of the real voice prosodic information from the prosodic pattern dictionary and outputs them as similar prosodic patterns;
The similar prosodic pattern is converted into a user-recognizable format and presented to the user.
A similar prosodic pattern presentation unit that requests selection of a similar prosodic pattern;
Processing the similar prosodic pattern, and the real voice prosody pattern candidates approximating to the prosodic information generated one or more prosodic pattern candidate generating unit that outputs the prosody pattern candidate to the similar prosodic pattern presenting unit as the similar prosodic pattern ,
Of the similar prosodic patterns presented by the similar prosodic pattern presentation unit, the user selects
Prosody pattern generator you comprising the prosodic pattern output unit for outputting the-option was similar prosodic patterns.

3. The prosodic pattern generation device according to claim 1, wherein the similar prosodic pattern presentation unit generates and presents synthesized speech based on the similar prosodic pattern.