JP4829477B2

JP4829477B2 - Voice quality conversion device, voice quality conversion method, and voice quality conversion program

Info

Publication number: JP4829477B2
Application number: JP2004079079A
Authority: JP
Inventors: 康行三井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-03-18
Filing date: 2004-03-18
Publication date: 2011-12-07
Anticipated expiration: 2024-03-18
Also published as: JP2005266349A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method of converting free voice quality which does not depend upon voiced contents of a conversion target speaker. <P>SOLUTION: A speech signal of the target speaker is inputted to an input part 11 and a pronunciation symbol string of voiced contents which are the same with or similar to the target speaker's voice is inputted to a pronunciation symbol string input part 12. A speech synthesis part 14 generates a synthesized sound by using a database for speech synthesis in a data storage part 13 for speech synthesis according to the inputted pronunciation symbol string. A feature parameter extraction part 15 extracts feature parameters by analyzing the target speaker's voice and a feature parameter extraction part 16 extracts feature parameters by analyzing the generated synthesized voice. A conversion function generation part 17 uses both the extracted feature parameters to identify a function of converting a spectrum shape of the synthesized voice into a spectrum shape of the target speaker's voice. A voice quality conversion part 18 converts the voice quality of the input signal with the identified conversion function. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、声質変換装置および声質変換方法ならびに声質変換プログラムに関する。 The present invention relates to a voice quality conversion device, a voice quality conversion method, and a voice quality conversion program.

従来から、ある話者が発声した音声を、別の話者の声質を持つ音声へと変換する、声質変換技術についての研究がなされてきた。例えば、特許文献１において、話者Ａの声質を話者Ｂの声質へと変換する技術が開示されている。特許文献１における声質変換法を図３８に示す。この声質変換法では、話者Ａの音声をＬＰＣ分析１０１によって分析し、話者Ａのコードブック１０３を用い、ベクトル量子化１０５によって量子化する。また、話者Ｂの音声をＬＰＣ分析１０２によって分析し、話者Ｂのコードブック１０４を用い、ベクトル量子化１０６によって量子化する。それぞれ量子化されたデータは、時間軸の整合をＤＰマッチングによる対応付け１０７によって対応付け（写像）がなされ、写像を元に、ヒストグラムを用いた変換コードブック作成１０８によってスペクトル変換コードブック１０９を作成する。話者Ａの声質がスペクトル変換コードブック１０９を用いて声質変換される。 Conventionally, research has been conducted on a voice quality conversion technique for converting a voice uttered by a speaker into a voice having a voice quality of another speaker. For example, Patent Document 1 discloses a technique for converting the voice quality of the speaker A into the voice quality of the speaker B. The voice quality conversion method in Patent Document 1 is shown in FIG. In this voice quality conversion method, the voice of the speaker A is analyzed by the LPC analysis 101 and quantized by the vector quantization 105 using the codebook 103 of the speaker A. Further, the voice of the speaker B is analyzed by the LPC analysis 102, and is quantized by the vector quantization 106 using the codebook 104 of the speaker B. Each quantized data is matched (mapped) by matching 107 by DP matching with time axis matching, and a spectrum conversion code book 109 is created by a conversion code book creation 108 using a histogram based on the mapping. To do. The voice quality of the speaker A is converted using the spectrum conversion code book 109.

さらに近年では、一人ないし複数の話者の音声データから生成ならびに蓄積された音声合成用データベースを用いて、入力されたテキストの内容を音声として出力する音声合成装置において、所望の声質を持つ音声を合成するための声質変換技術も研究されている。この方法の利点は、ある声質を持つ合成音を生成する際に、その声質を持つ話者の音声データをその都度録音してデータベースを生成ならびに蓄積する必要がなく、１人ないし複数の標準話者データベースを予め蓄積しておけば、そのデータベースを目的の声質を持つように声質変換することで、所望の声質を持つ合成音が生成できるという点にある。 Furthermore, in recent years, in a speech synthesizer that outputs the contents of input text as speech using a speech synthesis database generated and accumulated from speech data of one or more speakers, speech having a desired voice quality can be obtained. Voice quality conversion technology for synthesis is also being studied. The advantage of this method is that when a synthesized sound having a certain voice quality is generated, it is not necessary to create and store a database by recording voice data of a speaker having that voice quality each time, and one or a plurality of standard stories. If a user database is stored in advance, a synthesized sound having a desired voice quality can be generated by converting the voice quality of the database so as to have a desired voice quality.

例えば、特許文献２において、予め記憶してある音声データベースと目標話者の音声との間の写像コードブックを作成し、これを変換関数として変換された音声データベースをテキスト音声合成用データベースとして用いる技術が開示されている。特許文献２における声質変換法を図３９に示す。複数の登録話者の音響特徴パラメータを含む音声データベースを予め記憶する。この声質変換装置は、複数の登録話者の音響特徴パラメータを含む音声データベース２０３とそのコードブック２０４を予め記憶しておく。入力された目標話者の少なくとも１単語の音声信号に基づいて、声質変換をすべき目標話者に最も近い話者を、複数の登録話者の中から選択する選択手段２０１と、選択手段２０１によって選択された話者の音響空間と目標話者の音響空間との間の差分を計算することにより、選択された話者から目標話者への写像コードブック２０５を計算する生成手段２０２と、入力された音声合成すべき文字列に基づいて、音声データベース２０３に記憶された選択された話者の音声の音響特徴パラメータを上記選択された話者のコードブックを用いて量子化し、選択された話者のコードブックと写像コードブックの対応関係に基づいて文字列に対応する目標話者の音声信号の音響特徴パラメータを生成する写像処理手段２０６と、写像処理手段２０６によって生成された目標話者の音声信号の音響特徴パラメータに基づいて、文字列に対応する目標話者の音声信号を発生して出力する音声合成手段２０７とを備えている。また、生成手段２０２は、移動ベクトル場平滑化法を用いて、選択された話者から目標話者への写像コードブックを計算するようにしている。 For example, in Patent Document 2, a mapping code book between a speech database stored in advance and a target speaker's speech is created, and a speech database converted as a conversion function is used as a text speech synthesis database. Is disclosed. The voice quality conversion method in Patent Document 2 is shown in FIG. A speech database including acoustic feature parameters of a plurality of registered speakers is stored in advance. This voice quality conversion device stores a speech database 203 including acoustic feature parameters of a plurality of registered speakers and its code book 204 in advance. Based on the input speech signal of at least one word of the target speaker, a selection unit 201 that selects a speaker closest to the target speaker to be subjected to voice quality conversion from a plurality of registered speakers, and a selection unit 201 Generating means 202 for calculating a mapping codebook 205 from the selected speaker to the target speaker by calculating a difference between the acoustic space of the speaker selected by and the acoustic space of the target speaker; Based on the input character string to be synthesized, the acoustic feature parameters of the selected speaker's speech stored in the speech database 203 are quantized using the selected speaker's codebook and selected. Mapping processing means 206 for generating an acoustic feature parameter of the target speaker's speech signal corresponding to the character string based on the correspondence between the speaker codebook and the mapping codebook, and mapping processing means 206 Therefore, based on the acoustic feature parameters of the generated target speaker's voice signal, and a voice synthesis section 207 which generates a sound signal of a target speaker corresponding to the character string output. The generation unit 202 calculates a mapping code book from the selected speaker to the target speaker by using the moving vector field smoothing method.

なお、音声信号処理、音声認識の一般的な技術については、非特許文献１、非特許文献２において解説されている。 Note that non-patent literature 1 and non-patent literature 2 describe general techniques of speech signal processing and speech recognition.

特開平１−９７９９７号公報（図４）JP-A-1-97997 (FIG. 4) 特開平８−２４８９９４号公報（図１）JP-A-8-248994 (FIG. 1) 古井貞熙著、「ディジタル音声処理」、東海大学出版会、１９８５年By Sadahiro Furui, "Digital Audio Processing", Tokai University Press, 1985 中川聖一著、「確率モデルによる音声認識」、電子情報通信学会、１９８８年Seiichi Nakagawa, "Speech recognition using probabilistic models", IEICE, 1988

特許文献１に開示されている声質変換方法では、変換元話者Ａから変換目標話者Ｂへと声質変換する場合に、ＡとＢが全く同一内容の発声をしている原音の音声データ（あるいは特徴量データ）を用いて変換関数を作成する必要があった。このため、この方法で声質変換を実施するためには、話者Ａと話者Ｂの全く同一内容の音声データがその都度必要となってしまい、自由な声質変換ができなかった。 In the voice quality conversion method disclosed in Patent Document 1, when voice quality conversion is performed from the conversion source speaker A to the conversion target speaker B, voice data of the original sound in which A and B are uttering exactly the same content ( Alternatively, it is necessary to create a conversion function using feature amount data. For this reason, in order to perform voice quality conversion by this method, voice data having exactly the same contents of speaker A and speaker B is required each time, and free voice quality conversion cannot be performed.

また、特許文献２に開示されている声質変換および音声合成方法では、目標話者の音声の音響空間と音声データベースの音響空間との間の差分を取って移動ベクトル場平滑化法を用いて写像コードブックを求めるため、データベースの情報量が目標話者の情報量に比べ圧倒的に多い場合、非常に疎な対応付けを平滑化して全体の対応付けとする必要があった。このため、対応付けの精度が低く、変換後の音声の音質が劣化する、あるいは目標音声の声質との類似度が低いという問題点が生じた。 Further, in the voice quality conversion and speech synthesis method disclosed in Patent Document 2, the difference between the acoustic space of the target speaker's speech and the acoustic space of the speech database is taken and mapped using the moving vector field smoothing method. In order to obtain a code book, when the amount of information in the database is overwhelmingly larger than the amount of information of the target speaker, it is necessary to smooth the very sparse association to make the entire association. For this reason, there is a problem that the accuracy of association is low, the sound quality of the converted speech is deteriorated, or the similarity with the voice quality of the target speech is low.

したがって、本発明の主たる目的は、変換目標話者の発声内容に依存しない自由な声質変換を実現する装置および方法ならびにプログラムを提供することにある。 Therefore, a main object of the present invention is to provide an apparatus, a method, and a program for realizing free voice quality conversion that does not depend on the utterance content of the conversion target speaker.

本発明の他の目的は、所望の声質へ高精度に変換できる変換関数の同定を可能とする装置および方法ならびにプログラムを提供することにある。 Another object of the present invention is to provide an apparatus, a method, and a program that enable identification of a conversion function that can be converted into a desired voice quality with high accuracy.

本発明のさらに他の目的は、変換目標話者音声と同一あるいは類似の発声内容を持つ合成音の自動生成を可能とする装置および方法ならびにプログラムを提供することにある。 Still another object of the present invention is to provide an apparatus, a method, and a program capable of automatically generating a synthesized sound having the same or similar utterance content as the conversion target speaker voice.

前記目的を達成するために、本発明に係る声質変換装置は、第１のアスペクトによれば、変換目標となる話者の音声（「目標話者音声」という）を入力する目標話者音声入力部と、音声合成用のデータを記憶する音声合成用データ記憶部と、入力された目標話者音声と同一又は類似の発声内容を記述する発音記号列を入力する発音記号入力部と、を備える。また、入力された発音記号列にしたがって、音声合成用データ記憶部に記憶されている音声合成用のデータに基づき、音声を合成して出力する音声合成部と、音声合成部から出力される音声信号（「合成音」という）の特徴パラメータを抽出する合成音特徴パラメータ抽出部と、目標話者音声の特徴パラメータを抽出する目標話者音声特徴パラメータ抽出部と、を備える。さらに、合成音特徴パラメータ抽出部からの合成音の特徴パラメータと、目標話者音声特徴パラメータ抽出部からの目標話者音声の特徴パラメータとを入力し、特徴パラメータを表す空間において、合成音の第１の部分と、目標話者音声の第２の部分との時間軸上の対応関係を求め、合成音の第１の部分におけるスペクトル形状を、目標話者音声の第２の部分におけるスペクトル形状に変換する変換関数の同定を行う変換関数生成部と、を備える構成とされる。 In order to achieve the above object, according to the first aspect, a voice quality conversion apparatus according to the present invention is a target speaker voice input for inputting a voice of a speaker as a conversion target (referred to as “target speaker voice”). A speech synthesis data storage unit that stores data for speech synthesis, and a phonetic symbol input unit that inputs a phonetic symbol string describing the utterance content that is the same as or similar to the input target speaker voice. . In addition, a speech synthesis unit that synthesizes and outputs speech based on speech synthesis data stored in the speech synthesis data storage unit according to the input phonetic symbol string, and a speech output from the speech synthesis unit A synthesized sound feature parameter extracting unit that extracts a feature parameter of a signal (referred to as “synthesized sound”), and a target speaker voice feature parameter extracting unit that extracts a feature parameter of the target speaker voice. Further, the synthesized speech feature parameter from the synthesized speech feature parameter extraction unit and the target speaker speech feature parameter from the target speaker speech feature parameter extraction unit are input, and in the space representing the feature parameter, 1 to obtain a correspondence relationship on the time axis between the first portion and the second portion of the target speaker voice, and change the spectrum shape in the first portion of the synthesized speech to the spectrum shape in the second portion of the target speaker voice. A conversion function generation unit that identifies a conversion function to be converted.

また、本発明に係る声質変換方法は、第２のアスペクトによれば、声質変換装置により声質を変換する方法である。声質変換方法は、声質の変換目標となる話者の音声（「目標話者音声」という）を入力するステップと、目標話者音声の発声内容を記述する発音記号列を入力し、記憶部に予め記憶されている音声合成用のデータを用いて、発音記号列から、合成音を作成するステップと、を含む。また、合成音と目標話者音声とを分析し、それぞれの特徴パラメータを抽出するステップと、目標話者音声の特徴パラメータと合成音の特徴パラメータとの時間軸上の対応付けを行うステップと、対応付けがなされた目標話者音声及び合成音の特徴パラメータに基づき、合成音のスペクトル形状を、目標話者音声のスペクトル形状に、変換するための変換関数を生成するステップと、を含む。 According to the second aspect, the voice quality conversion method according to the present invention is a method for converting voice quality by a voice quality conversion device. The voice quality conversion method includes a step of inputting a voice of a speaker as a voice quality conversion target (referred to as “target speaker voice”), a phonetic symbol string describing the utterance content of the target speaker voice, Creating synthesized speech from a phonetic symbol string using speech synthesis data stored in advance. A step of analyzing the synthesized sound and the target speaker voice and extracting each feature parameter; and a step of correlating the feature parameter of the target speaker voice and the feature parameter of the synthesized sound on the time axis; Generating a conversion function for converting the spectrum shape of the synthesized speech into the spectrum shape of the target speaker speech based on the feature parameters of the target speaker speech and the synthesized speech associated with each other.

さらに、本発明に係る声質変換方法は、第３のアスペクトによれば、声質変換装置により声質を変換する方法である。声質変換方法は、声質の変換目標となる話者の音声（「目標話者音声」という）を入力するステップと、入力される目標話者音声を音声認識するステップと、音声認識の結果から目標話者音声の発声内容を記述する発音記号列を生成するステップと、を含む。さらに、（ａ）記憶部に予め記憶されている音声合成用のデータを用いて、発音記号列から、合成音を作成するステップと、（ｂ）合成音と目標話者音声とを分析し、それぞれの特徴パラメータを抽出するステップと、（ｃ）目標話者音声の特徴パラメータと合成音の特徴パラメータとの時間軸上の対応付けを行うステップと、（ｄ）対応付けされた２つの特徴パラメータに基づいて、合成音のスペクトル形状を目標話者音声のスペクトル形状に変換する変換関数を生成するステップと、（ｅ）生成された変換関数を用いて変換対象となる音声の声質を変換するステップと、（ｆ）声質の変換結果を、音声合成用のデータとして記憶部に格納するステップと、を収束条件に至るまで繰り返す。 Further, according to the third aspect, the voice quality conversion method according to the present invention is a method for converting voice quality by a voice quality conversion device. The voice quality conversion method includes a step of inputting a voice of a speaker as a voice quality conversion target (referred to as “target speaker voice”), a step of recognizing the input target speaker voice, and a target from the result of the voice recognition. Generating a phonetic symbol string describing the utterance content of the speaker voice. Furthermore, (a) a step of creating a synthesized sound from a phonetic symbol string using speech synthesis data stored in advance in the storage unit; (b) analyzing the synthesized sound and the target speaker voice; A step of extracting each feature parameter, (c) a step of associating the feature parameter of the target speaker speech with the feature parameter of the synthesized sound on the time axis, and (d) two feature parameters associated with each other And a step of generating a conversion function for converting the spectrum shape of the synthesized sound into the spectrum shape of the target speaker speech, and (e) converting the voice quality of the speech to be converted using the generated conversion function And (f) the step of storing the voice quality conversion result in the storage unit as data for speech synthesis is repeated until the convergence condition is reached.

また、本発明に係る声質変換プログラムは、第４のアスペクトによれば、声質変換装置を構成するコンピュータに実行させるプログラムである。このプログラムは、声質の変換目標となる話者の音声（「目標話者音声」という）を入力する処理と、目標話者音声の発声内容を記述する発音記号列を入力し、記憶部に予め記憶されている音声合成用のデータを用いて、発音記号列から、合成音を作成する処理と、を実行させる。また、合成音と目標話者音声とを分析し、それぞれの特徴パラメータを抽出する処理と、目標話者音声の特徴パラメータと合成音の特徴パラメータとの時間軸上の対応付けを行う処理と、対応付けがなされた目標話者音声及び合成音の特徴パラメータに基づき、合成音のスペクトル形状を、目標話者音声のスペクトル形状に、変換するための変換関数を生成する処理と、を実行させる。 According to the fourth aspect, a voice quality conversion program according to the present invention is a program that is executed by a computer constituting the voice quality conversion device. This program inputs a voice of a speaker as a voice quality conversion target (referred to as “target speaker voice”) and a phonetic symbol string describing the utterance content of the target speaker voice, and stores them in advance in the storage unit. Using the stored speech synthesis data, a process of creating a synthesized sound from a phonetic symbol string is executed. A process of analyzing the synthesized sound and the target speaker voice and extracting each feature parameter; a process of associating the feature parameter of the target speaker voice and the feature parameter of the synthesized sound on the time axis; A process of generating a conversion function for converting the spectrum shape of the synthesized speech into the spectrum shape of the target speaker speech is executed based on the characteristic parameters of the target speaker speech and the synthesized speech associated with each other.

さらに、本発明に係る声質変換プログラムは、第５のアスペクトによれば、声質変換装置を構成するコンピュータに実行させるプログラムである。このプログラムは、声質の変換目標となる話者の音声（「目標話者音声」という）を入力する処理と、入力される目標話者音声を音声認識する処理と、音声認識の結果から目標話者音声の発声内容を記述する発音記号列を生成する処理と、を実行させる。さらに、（ａ）記憶部に予め記憶されている音声合成用のデータを用いて、発音記号列から合成音を作成する処理と、（ｂ）合成音と目標話者音声とを分析し、それぞれの特徴パラメータを抽出する処理と、（ｃ）目標話者音声の特徴パラメータと合成音の特徴パラメータとの時間軸上の対応付けを行う処理と、（ｄ）対応付けされた２つの特徴パラメータに基づいて、合成音のスペクトル形状を目標話者音声のスペクトル形状に変換する変換関数を生成する処理と、（ｅ）生成された変換関数を用いて変換対象となる音声の声質を変換する処理と、（ｆ）声質の変換結果を、音声合成用のデータとして記憶部に格納する処理と、を収束条件に至るまで繰り返す処理を実行させる。 Furthermore, according to the fifth aspect, the voice quality conversion program according to the present invention is a program that is executed by a computer constituting the voice quality conversion device. This program inputs the voice of the speaker that is the voice quality conversion target (referred to as “target speaker voice”), the process of recognizing the input target speaker voice, and the result of the speech recognition. Generating a phonetic symbol string describing the utterance content of the person's voice. Furthermore, (a) a process of creating a synthesized sound from a phonetic symbol string using speech synthesis data stored in advance in the storage unit, and (b) analyzing the synthesized sound and the target speaker voice, (C) processing for extracting feature parameters of the target speaker voice, and processing for associating the feature parameters of the target speaker speech with the feature parameters of the synthesized speech on the time axis, and (d) two feature parameters associated with each other And a process for generating a conversion function for converting the spectrum shape of the synthesized sound into the spectrum shape of the target speaker voice, and (e) a process for converting the voice quality of the voice to be converted using the generated conversion function; (F) A process of storing the voice quality conversion result in the storage unit as data for voice synthesis and a process of repeating until the convergence condition is reached.

本発明によれば、目標話者音声と同一あるいは類似の発声内容を持つ変換元音声を予め用意することなく、目標話者音声の声質への変換を行うことができる。したがって、使用者の負担が軽減する。この理由は、目標話者音声と同一あるいは類似の発声内容を表す発音記号列を入力することで、目標話者音声と同一あるいは類似の発声内容を持つ合成音を生成することができるためである。 According to the present invention, conversion to target voice quality can be performed without preparing a conversion source voice having the same or similar utterance content as the target speaker voice in advance. Therefore, the burden on the user is reduced. This is because a synthesized sound having the same or similar utterance content as the target speaker voice can be generated by inputting a phonetic symbol string representing the same or similar utterance content as the target speaker voice. .

また、学習用音声の対応付けおよび声質変換を行うための変換関数の同定が高精度にできる。したがって、従来に比べて所望の声質に近い声質を持つ高音質な音声が得られる。この理由の１つは、合成音作成に使用されるデータを目標話者音声の情報から推定することによって、目標話者音声への変換関数を同定しやすいように合成音を作成することができるためである。もう１つの理由は、合成音作成時に使用されたデータを記憶しておくことによって、このデータを音声から抽出する特徴パラメータに加えて変換関数生成に用いることができるためである。 In addition, it is possible to identify a conversion function for performing learning speech association and voice quality conversion with high accuracy. Accordingly, it is possible to obtain a high-quality sound having a voice quality close to a desired voice quality as compared with the prior art. One reason for this is that by estimating the data used to create the synthesized speech from the information of the target speaker speech, the synthesized speech can be created so that the conversion function to the target speaker speech can be easily identified. Because. Another reason is that by storing the data used when creating the synthesized sound, this data can be used for generating a conversion function in addition to the feature parameter extracted from the speech.

さらに、目標話者音声と同一あるいは類似の発声内容を表す発音記号列を予め用意することなく、目標話者音声と同一あるいは類似の発声内容の合成音が作成できる。したがって、処理が自動化し、処理速度が向上する。この理由は、目標話者音声を音声認識することによって、目標話者音声と同一あるいは類似の発声内容を表す発音記号列を自動的に生成することができるためである。 Furthermore, a synthesized sound having the same or similar utterance content as that of the target speaker voice can be created without preparing a phonetic symbol string representing the same or similar utterance content as that of the target speaker voice. Therefore, the processing is automated and the processing speed is improved. This is because a phonetic symbol string representing the utterance content identical or similar to the target speaker voice can be automatically generated by recognizing the target speaker voice.

［第１の実施形態］
図１に、本発明の第１の実施形態に係る声質変換装置のブロック図を示す。声質変換装置は、目標話者音声入力部１１と、発音記号列入力部１２と、音声合成用データ記憶部１３と、音声合成部１４と、目標話者音声特徴パラメータ抽出部１５と、合成音特徴パラメータ抽出部１６と、変換関数生成部１７と、声質変換部１８とを備えている。 [First Embodiment]
FIG. 1 shows a block diagram of a voice quality conversion apparatus according to the first embodiment of the present invention. The voice quality conversion apparatus includes a target speaker voice input unit 11, a phonetic symbol string input unit 12, a voice synthesis data storage unit 13, a voice synthesis unit 14, a target speaker voice feature parameter extraction unit 15, and a synthesized sound. A feature parameter extraction unit 16, a conversion function generation unit 17, and a voice quality conversion unit 18 are provided.

目標話者音声入力部１１は、目標とする声質を持つ音声データを入力する。発音記号列入力部１２は、作成したい音声の発声内容が記述されている発音記号列を入力する。この時、入力される発音記号列は、目標話者音声入力部１１から入力される目標話者音声と同一あるいは類似の発声内容の音声を合成するように記述してあるものとする。音声合成用データ記憶部１３は、音声合成部１４で用いる、音声や音節等の単位、時間長情報等のデータを記憶している。音声合成部１４は、発音記号列に対応するデータにしたがって、音声合成用データ記憶部１３に蓄えられたデータから合成用のデータを算出し、これを用いて音声を合成して出力する。 The target speaker voice input unit 11 inputs voice data having a target voice quality. The phonetic symbol string input unit 12 inputs a phonetic symbol string in which the utterance content of the voice to be created is described. At this time, the input phonetic symbol string is described so as to synthesize voice having the same or similar utterance content as the target speaker voice input from the target speaker voice input unit 11. The speech synthesis data storage unit 13 stores data such as units of speech and syllables, time length information, and the like used by the speech synthesis unit 14. The voice synthesis unit 14 calculates synthesis data from the data stored in the voice synthesis data storage unit 13 according to the data corresponding to the phonetic symbol string, and synthesizes and outputs the voice using the data.

また、目標話者音声特徴パラメータ抽出部１５は、目標話者音声入力部１１から入力された音声データに対しスペクトル分析を施して、特徴パラメータを抽出する。合成音特徴パラメータ抽出部１６は、音声合成部１４から入力された音声データに対しスペクトル分析を施して、特徴パラメータを抽出する。ここで、目標話者音声および合成音から抽出する特徴パラメータとしては、例えば非特許文献１に記載してあるようなＬＰＣ（linear predictive Coding）係数、フォルマント周波数ならびにバンド帯域幅、韻律データ等の内、少なくとも１つを含む。変換関数生成部１７は、目標話者音声から抽出された特徴パラメータおよび合成音から抽出された特徴パラメータによって、合成音の声質から目標話者音声の声質へと変換する関数を同定する。声質変換部１８は、同定された変換関数を用いて、変換対象である被変換入力信号を変換して変換後出力信号を出力する。 Further, the target speaker voice feature parameter extraction unit 15 performs spectrum analysis on the voice data input from the target speaker voice input unit 11 and extracts feature parameters. The synthesized sound feature parameter extraction unit 16 performs spectrum analysis on the speech data input from the speech synthesis unit 14 to extract feature parameters. Here, the characteristic parameters extracted from the target speaker voice and the synthesized sound include, for example, LPC (linear predictive coding) coefficients, formant frequencies, band bandwidths, prosodic data, and the like described in Non-Patent Document 1. Including at least one. The conversion function generation unit 17 identifies a function for converting the voice quality of the synthesized sound into the voice quality of the target speaker voice based on the feature parameter extracted from the target speaker voice and the feature parameter extracted from the synthesized voice. The voice quality conversion unit 18 converts the converted input signal to be converted using the identified conversion function, and outputs a converted output signal.

次に、本発明の第１の実施形態に係る声質変換装置の動作を図１および図１３を参照して説明する。図１３は、本発明の第１の実施形態に係る声質変換装置の動作を示すフローチャートである。まず、ユーザが望む声質を持つ話者の音声データを目標音声として用意し、目標話者音声入力部１１に入力する（ステップＳ１１）。次に、目標話者音声入力部１１に入力した目標話者音声と同一あるいは類似の発声内容を持つように発音記号列を作成し、発音記号列入力部１２に入力する。入力された発音記号列に対応するデータにしたがって音声合成用データ記憶部１３から合成用のデータを算出し、音声合成部１４において合成音を生成する。この時、目標話者音声と合成音の発声の時間長は、ずれていても構わない。音声合成用データと発音記号との対応付けの規則は、発音記号が指定されると音声合成用データ記憶部１３に記憶されている音声素片データや時間長情報等の音声合成用データから当該発音記号に対応した音声合成用データを算出するように予め作成されている（ステップＳ１２）。 Next, the operation of the voice quality conversion apparatus according to the first embodiment of the present invention will be described with reference to FIG. 1 and FIG. FIG. 13 is a flowchart showing the operation of the voice quality conversion apparatus according to the first embodiment of the present invention. First, voice data of a speaker having a voice quality desired by the user is prepared as a target voice and input to the target speaker voice input unit 11 (step S11). Next, a phonetic symbol string is created so as to have the same or similar utterance content as the target speaker voice input to the target speaker voice input unit 11 and input to the phonetic symbol string input unit 12. Data for synthesis is calculated from the voice synthesis data storage unit 13 in accordance with the data corresponding to the inputted phonetic symbol string, and the voice synthesis unit 14 generates a synthesized sound. At this time, the time length of the speech of the target speaker voice and the synthesized sound may be shifted. The rules for associating speech synthesis data with phonetic symbols are based on the speech synthesis data such as speech segment data and time length information stored in the speech synthesis data storage unit 13 when a phonetic symbol is designated. It is created in advance so as to calculate speech synthesis data corresponding to phonetic symbols (step S12).

また、目標話者音声特徴パラメータ抽出部１５において、目標話者音声に対し分析を行い、目標話者音声の特徴パラメータを抽出する。また、合成音特徴パラメータ抽出部１６において、合成音に対し分析を行い、合成音の特徴パラメータを抽出する（ステップＳ１３）。合成音と目標音声の時間軸のずれを修正するために、目標話者音声から抽出したパラメータと合成音から抽出した特徴パラメータを用いて、合成音と目標音声の間で時間軸の対応付けを行う。対応付けの方法としては、ＤＰマッチングによる時間軸伸縮等が考えられる（ステップＳ１４）。全学習データ（目標話者音声と合成音の組）の時間軸の対応付けが行われた後、この時間軸対応付け済みの特徴パラメータを用いて、変換関数生成部１７で合成音の声質から目標話者音声の声質へと変換する関数を作成する（ステップＳ１５）。 Further, the target speaker voice feature parameter extraction unit 15 analyzes the target speaker voice and extracts feature parameters of the target speaker voice. In addition, the synthesized sound feature parameter extraction unit 16 analyzes the synthesized sound and extracts a feature parameter of the synthesized sound (step S13). In order to correct the time lag between the synthesized speech and the target speech, the time axis is correlated between the synthesized speech and the target speech using the parameters extracted from the target speaker speech and the feature parameters extracted from the synthesized speech. Do. As a method of association, time axis expansion and contraction by DP matching can be considered (step S14). After all learning data (a set of the target speaker voice and synthesized sound) is associated with the time axis, the conversion function generation unit 17 uses the characteristic parameters already associated with the time axis to determine the voice quality of the synthesized sound. A function for converting the voice quality of the target speaker voice is created (step S15).

さらに、被変換入力信号のどの部分にどのような変換関数が適用されるかを決定するために、被変換入力信号（素片データ）と作成された変換関数との対応付けを行う（ステップＳ１６）。被変換入力信号に対応付けられた変換関数を用いて、被変換入力信号を変換する（ステップＳ１７）。変換されたデータを変換後出力信号として出力する（ステップＳ１８）。 Further, in order to determine which conversion function is applied to which part of the converted input signal, the converted input signal (element data) is associated with the created conversion function (step S16). ). The converted input signal is converted using the conversion function associated with the converted input signal (step S17). The converted data is output as a post-conversion output signal (step S18).

本発明の第１の実施形態によれば、目標話者音声と同一あるいは類似の発声内容の合成音を音声合成で生成してから分析するため、目標話者音声の発声内容にかかわらず、高精度の対応付けを行ったうえで変換関数を生成することが可能となる。このため、変換目標話者の発声内容に依存しない自由な声質変換を実現することができる。したがって、目標話者音声の発声内容に合わせて、その都度変換元音声を入力する必要がなくなるので、使用者の負担を軽減し、処理を迅速化することができる。 According to the first embodiment of the present invention, since the synthesized sound having the same or similar utterance content as that of the target speaker voice is generated by the voice synthesis, the analysis is performed. Therefore, regardless of the utterance content of the target speaker voice, It is possible to generate a conversion function after associating accuracy. For this reason, free voice quality conversion independent of the utterance content of the conversion target speaker can be realized. Therefore, it is not necessary to input the conversion source voice each time according to the utterance content of the target speaker voice, so that the burden on the user can be reduced and the processing can be speeded up.

［第２の実施形態］
図２に、本発明の第２の実施形態に係る声質変換装置のブロック図を示す。図２において、図１と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。第２の実施形態に係る声質変換装置は、第１の実施形態に係る声質変換装置に、さらに音声認識部１９を備えている。音声認識部１９は、目標話者音声入力部１１ａで入力された目標話者音声を音声認識し、目標話者音声の発声内容を発音記号とともに発声記号列入力部１２ａへ出力する。 [Second Embodiment]
FIG. 2 shows a block diagram of a voice quality conversion apparatus according to the second embodiment of the present invention. 2, the same reference numerals as those in FIG. 1 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the second embodiment further includes a voice recognition unit 19 in addition to the voice quality conversion device according to the first embodiment. The voice recognition unit 19 recognizes the target speaker voice input by the target speaker voice input unit 11a, and outputs the utterance content of the target speaker voice to the utterance symbol string input unit 12a together with the phonetic symbols.

次に、本発明の第２の実施形態に係る声質変換装置の動作を図２および図１４を参照して説明する。ただし、図１４において、ステップＳ２１およびＳ２３〜Ｓ２８は、それぞれ図１３のＳ１１およびＳ１３〜Ｓ１８と同様の動作をするため、新たに追加したステップの動作のみを説明する。ステップＳ２１で目標話者音声入力部１１ａから入力された目標話者音声を音声認識部１９で音声認識して（ステップＳ２２１）発音記号列を生成し（ステップＳ２２２）、生成された発音記号列を用いて合成音を作成する（ステップＳ２２３）。 Next, the operation of the voice quality conversion apparatus according to the second embodiment of the present invention will be described with reference to FIG. 2 and FIG. However, since steps S21 and S23 to S28 in FIG. 14 perform the same operations as S11 and S13 to S18 in FIG. 13, respectively, only the operations of the newly added steps will be described. The target speaker voice input from the target speaker voice input unit 11a in step S21 is voice-recognized by the voice recognition unit 19 (step S221) to generate a phonetic symbol string (step S222). A synthesized sound is created by using it (step S223).

本発明の第２の実施形態によれば、目標話者音声の音声認識結果を合成音生成のための発音記号列として入力するため、ユーザがテキスト等で発音記号列を入力する必要が無く、処理の自動化を図ることが可能となる。本発明は、目標話者音声として多くの文を用いて対応付けの高精度化を図る際に有効である。 According to the second embodiment of the present invention, since the speech recognition result of the target speaker voice is input as a phonetic symbol string for generating a synthesized sound, there is no need for the user to input a phonetic symbol string as text, Processing can be automated. The present invention is effective in increasing the accuracy of association using many sentences as target speaker voices.

［第３の実施形態］
本発明の第３の実施形態に係る声質変換装置は、第１の実施形態に係る声質変換装置の構成に加え、音声合成部で使用したセグメンテーション情報を記憶し、変換関数生成部に出力するセグメンテーション情報記憶部により構成される。このような構成を採用し、変換関数生成部でセグメンテーション情報を用いることにより、所望の声質へ高精度に変換できる変換関数の同定が可能である。なお、以下の説明において、セグメンテーション情報とは、時刻情報に対応づいた音素（音素ラベル）を主とし、ピッチパターンあるいはパワー等の韻律情報を付加したものでもよい。 [Third Embodiment]
The voice quality conversion apparatus according to the third embodiment of the present invention stores the segmentation information used in the speech synthesis unit in addition to the configuration of the voice quality conversion apparatus according to the first embodiment, and outputs the segmentation information to the conversion function generation unit An information storage unit is used. By adopting such a configuration and using segmentation information in the conversion function generation unit, it is possible to identify a conversion function that can be converted to a desired voice quality with high accuracy. In the following description, the segmentation information may mainly be phonemes (phoneme labels) corresponding to time information, and may be added with prosodic information such as a pitch pattern or power.

図３に、本発明の第３の実施形態に係る声質変換装置のブロック図を示す。図３において、目標話者音声入力部１１と、発音記号列入力部１２と、音声合成用データ記憶部１３と、目標話者音声特徴パラメータ抽出部１５と、合成音特徴パラメータ抽出部１６と、声質変換部１８は、図１と同等であり、その説明を省略する。本発明の第３の実施形態に係る声質変換装置は、さらに合成音セグメンテーション情報記憶部２０を備えている。合成音セグメンテーション情報記憶部２０は、音声合成部１４ａで発音記号列に従って音声を合成する際に用いられるセグメンテーション情報を記憶して、変換関数生成部１７ａへと出力する。 FIG. 3 is a block diagram of a voice quality conversion apparatus according to the third embodiment of the present invention. In FIG. 3, a target speaker voice input unit 11, a phonetic symbol string input unit 12, a voice synthesis data storage unit 13, a target speaker voice feature parameter extraction unit 15, a synthesized voice feature parameter extraction unit 16, The voice quality conversion unit 18 is equivalent to that shown in FIG. The voice quality conversion device according to the third embodiment of the present invention further includes a synthesized sound segmentation information storage unit 20. The synthesized sound segmentation information storage unit 20 stores segmentation information used when the speech synthesizer 14a synthesizes speech according to the phonetic symbol string, and outputs the segmentation information to the conversion function generator 17a.

次に、本発明の第３の実施形態に係る声質変換装置の動作を図３および図１５を参照して説明する。ただし、図１５において、図１３の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ１２ａでは、音声合成部１４で音声を合成する際に用いられたセグメンテーション情報２３ａを合成音セグメンテーション情報記憶部２０に記憶し、変換関数生成部１７ａへ出力する。変換関数生成部１７ａに入力されたセグメンテーション情報２３ａは、時間軸の対応付け（ステップＳ１４ａ）、および変換関数の作成（ステップＳ１５ａ）で用いられる。 Next, the operation of the voice quality conversion apparatus according to the third embodiment of the present invention will be described with reference to FIG. 3 and FIG. However, in FIG. 15, the same reference numerals in FIG. 13 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S12a, the segmentation information 23a used when the speech synthesizer 14 synthesizes the speech is stored in the synthesized speech segmentation information storage unit 20 and is output to the conversion function generator 17a. The segmentation information 23a input to the conversion function generation unit 17a is used for time axis association (step S14a) and generation of a conversion function (step S15a).

本発明の第３の実施形態によれば、合成音を生成する際に利用するセグメンテーション情報を変換関数生成時に活用するため、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the third embodiment of the present invention, since the segmentation information used when generating the synthesized sound is used when generating the conversion function, the process of associating the synthesized sound feature parameter with the target speaker voice feature parameter is performed. Simplification and high accuracy can be achieved.

［第４の実施形態］
図４に、本発明の第４の実施形態に係る声質変換装置のブロック図を示す。図４において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第４の実施形態に係る声質変換装置は、図２に対し、さらに合成音セグメンテーション情報記憶部２０を備えている。なお、合成音セグメンテーション情報記憶部２０は、図３で説明したものと同様である。 [Fourth Embodiment]
FIG. 4 shows a block diagram of a voice quality conversion apparatus according to the fourth embodiment of the present invention. 4, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion apparatus according to the fourth embodiment of the present invention further includes a synthesized sound segmentation information storage unit 20 as compared to FIG. The synthesized sound segmentation information storage unit 20 is the same as that described with reference to FIG.

次に、本発明の第４の実施形態に係る声質変換装置の動作を図４および図１６を参照して説明する。ただし、図１６において、図１３の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２３ａでは、音声合成部１４ａで音声を合成する際に用いられたセグメンテーション情報２３ｂを出力し、変換関数生成部１７ａへ入力する。入力されたセグメンテーション情報２３ｂは、時間軸の対応付け（ステップＳ２４ａ）、および変換関数の作成（ステップＳ２５ｂ）で用いられる。 Next, the operation of the voice quality conversion apparatus according to the fourth embodiment of the present invention will be described with reference to FIG. 4 and FIG. However, in FIG. 16, the same reference numerals in FIG. 13 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S223a, the segmentation information 23b used when the speech synthesizer 14a synthesizes speech is output and input to the conversion function generator 17a. The input segmentation information 23b is used in time axis association (step S24a) and creation of a conversion function (step S25b).

本発明の第４の実施形態によれば、目標話者音声を音声合成して発音記号列を自動生成し、これを用いて合成音を生成する際に利用するセグメンテーション情報を変換関数生成時に活用するため、処理の自動化を図りつつ、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the fourth embodiment of the present invention, the target speaker's speech is synthesized with speech to automatically generate a phonetic symbol string, and the segmentation information used when generating the synthesized speech using this is used when generating the conversion function. Therefore, it is possible to simplify the processing and increase the accuracy when associating the synthesized speech feature parameter with the target speaker speech feature parameter while automating the processing.

［第５の実施形態］
図５に、本発明の第５の実施形態に係る声質変換装置のブロック図を示す。図５において、図１と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第５の実施形態に係る声質変換装置は、さらにセグメンテーション情報入力部２２を備えている。セグメンテーション情報入力部２２は、目標話者音声入力部１１で入力された目標話者音声を不図示の手段で分析してセグメンテーション情報２３ｃを抽出し、音声合成部１４ｂに出力する。 [Fifth Embodiment]
FIG. 5 shows a block diagram of a voice quality conversion apparatus according to the fifth embodiment of the present invention. 5, the same reference numerals as those in FIG. 1 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the fifth embodiment of the present invention further includes a segmentation information input unit 22. The segmentation information input unit 22 analyzes the target speaker voice input by the target speaker voice input unit 11 by means (not shown), extracts the segmentation information 23c, and outputs it to the voice synthesis unit 14b.

次に、本発明の第５の実施形態に係る声質変換装置の動作を図５および図１７を参照して説明する。ただし、図１７において、図１３の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ１２ｂでは、目標話者音声入力部１１で入力された目標話者音声に基づいたセグメンテーション情報２３ｃを入力する。入力されたセグメンテーション情報２３ｃは、音声合成部１４ｂに出力され、発音記号列入力部１２で入力された発音記号列の発声内容を持ち、発声の時間長が目標話者音声と同一である音声が音声合成部１４ｂにおいて合成される。 Next, the operation of the voice quality conversion apparatus according to the fifth embodiment of the present invention will be described with reference to FIGS. However, in FIG. 17, the same reference numerals in FIG. 13 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S12b, segmentation information 23c based on the target speaker voice input by the target speaker voice input unit 11 is input. The input segmentation information 23c is output to the speech synthesizer 14b, and has the utterance contents of the phonetic symbol string input by the phonetic symbol string input unit 12, and the voice whose utterance time length is the same as that of the target speaker voice. The speech synthesizer 14b synthesizes it.

本発明の第５の実施形態によれば、目標話者音声に基づいたセグメンテーション情報を入力して合成音を生成するため、生成された合成音の時間長情報は当該目標話者音声の時間長情報と等しくなり、変換関数生成部での時間軸の対応付け処理を不必要とする、あるいは簡素化することが可能となる。 According to the fifth embodiment of the present invention, since the segmented information based on the target speaker voice is input to generate the synthesized sound, the time length information of the generated synthesized sound is the time length of the target speaker voice. It becomes equal to information, and it becomes unnecessary or can simplify the time axis association processing in the conversion function generation unit.

［第６の実施形態］
図６に、本発明の第６の実施形態に係る声質変換装置のブロック図を示す。図６において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第６の実施形態に係る声質変換装置は、さらにセグメンテーション情報入力部２２を備えている。セグメンテーション情報入力部２２は、目標話者音声入力部１１ａで入力された目標話者音声を不図示の手段で分析してセグメンテーション情報２３ｃを抽出し、音声合成部１４ｂに出力する。 [Sixth Embodiment]
FIG. 6 shows a block diagram of a voice quality conversion apparatus according to the sixth embodiment of the present invention. 6, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the sixth embodiment of the present invention further includes a segmentation information input unit 22. The segmentation information input unit 22 analyzes the target speaker voice input by the target speaker voice input unit 11a by means (not shown), extracts the segmentation information 23c, and outputs it to the voice synthesis unit 14b.

次に、本発明の第６の実施形態に係る声質変換装置の動作を図６および図１８を参照して説明する。ただし、図１８において、図１３の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２２ａでは、目標話者音声入力部１１ａで入力された目標話者音声に基づいたセグメンテーション情報２３ｃを入力する。入力されたセグメンテーション情報２３ｃは音声合成部１４ｂに出力され、発音記号列入力部１２ａで入力された発音記号列の発声内容を持ち、発声の時間長が目標話者音声と同一である音声が合成される。 Next, the operation of the voice quality conversion apparatus according to the sixth embodiment of the present invention will be described with reference to FIG. 6 and FIG. However, in FIG. 18, the same reference numerals in FIG. 13 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S222a, segmentation information 23c based on the target speaker voice input by the target speaker voice input unit 11a is input. The input segmentation information 23c is output to the speech synthesizer 14b, and the speech having the utterance content of the phonetic symbol string input by the phonetic symbol string input unit 12a and having the same utterance time length as the target speaker voice is synthesized. Is done.

本発明の第６の実施形態によれば、目標話者音声に基づいたセグメンテーション情報を入力して合成音を生成するため、生成された合成音の時間長情報は当該目標話者音声の時間長情報と等しくなり、変換関数生成部での時間軸の対応付け処理を不必要とする、あるいは簡素化することが可能となる。さらに、本発明では、音声認識によって目標話者音声から発音記号列を生成しているため、処理の自動化を図りつつ、発音記号列と目標話者音声から抽出される時間長情報との整合性が取りやすくすることが可能となる。 According to the sixth embodiment of the present invention, since the segmentation information based on the target speaker voice is input to generate the synthesized sound, the time length information of the generated synthesized sound is the time length of the target speaker voice. It becomes equal to information, and it becomes unnecessary or can simplify the time axis association processing in the conversion function generation unit. Further, in the present invention, since the phonetic symbol string is generated from the target speaker voice by voice recognition, the consistency between the phonetic symbol string and the time length information extracted from the target speaker voice is achieved while automating the process. Can be easily removed.

［第７の実施形態］
図７に、本発明の第７の実施形態に係る声質変換装置のブロック図を示す。図７において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第７の実施形態に係る声質変換装置は、さらに目標話者音声セグメンテーション情報記憶部２１を備えている。目標話者音声セグメンテーション情報記憶部２１は、目標話者音声入力部１１ａで入力された目標話者音声のセグメンテーション情報を記憶して、変換関数生成部１７ｂに出力する。 [Seventh Embodiment]
FIG. 7 shows a block diagram of a voice quality conversion apparatus according to the seventh embodiment of the present invention. 7, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the seventh embodiment of the present invention further includes a target speaker voice segmentation information storage unit 21. The target speaker voice segmentation information storage unit 21 stores the segmentation information of the target speaker voice input by the target speaker voice input unit 11a and outputs it to the conversion function generation unit 17b.

次に、本発明の第７の実施形態に係る声質変換装置の動作を図７および図１９を参照して説明する。ただし、図１９において、図１４の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２１ａでは、目標話者音声信号を音声認識部１９で音声認識した結果、抽出されるセグメンテーション情報２３ｄを出力する。セグメンテーション情報２３ｄは、変換関数生成部１７ｂへ出力され、時間軸の対応付け（ステップＳ２４ａ）、および変換関数の作成（ステップＳ２５ａ）で用いられる。 Next, the operation of the voice quality conversion apparatus according to the seventh embodiment of the present invention will be described with reference to FIGS. However, in FIG. 19, the same reference numerals in FIG. 14 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S221a, segmentation information 23d extracted as a result of speech recognition of the target speaker speech signal by the speech recognition unit 19 is output. The segmentation information 23d is output to the conversion function generation unit 17b, and is used for time axis association (step S24a) and creation of the conversion function (step S25a).

本発明の第７の実施形態によれば、目標話者音声を音声認識する際に同時に目標話者音声のセグメンテーション情報を抽出して変換関数生成部に入力するため、目標話者音声の特徴パラメータに加えてセグメンテーション情報を変換関数生成に用いることができ、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the seventh embodiment of the present invention, when the target speaker voice is recognized, the segmentation information of the target speaker voice is simultaneously extracted and input to the conversion function generation unit. In addition to this, segmentation information can be used for generating a conversion function, and it is possible to simplify and increase the accuracy of the process when associating the synthesized speech feature parameters with the target speaker speech feature parameters.

［第８の実施形態］
図８に、本発明の第８の実施形態に係る声質変換装置のブロック図を示す。図８において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第８の実施形態に係る声質変換装置は、さらに目標話者音声セグメンテーション情報記憶部２１ａを備えている。目標話者音声セグメンテーション情報記憶部２１ａは、目標話者音声入力部１１ａで入力された目標話者音声のセグメンテーション情報を記憶して、音声合成部１４ｂに出力する。 [Eighth Embodiment]
FIG. 8 shows a block diagram of a voice quality conversion apparatus according to the eighth embodiment of the present invention. 8, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the eighth embodiment of the present invention further includes a target speaker voice segmentation information storage unit 21a. The target speaker voice segmentation information storage unit 21a stores the segmentation information of the target speaker voice input by the target speaker voice input unit 11a and outputs it to the voice synthesis unit 14b.

次に、本発明の第８の実施形態に係る声質変換装置の動作を図８および図２０を参照して説明する。ただし、図２０において、図１４の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２１ａでは、目標話者を音声認識部１９で音声認識した結果抽出されるセグメンテーション情報２３ｅを出力する。ステップＳ２２３ｂにおいて、音声合成部１４ｂは、セグメンテーション情報２３ｅを入力し、発音記号列入力部１２ａで入力された発音記号列の発声内容を持ち、目標話者音声のセグメンテーション情報２３ｅに基づいた音声を合成する。 Next, the operation of the voice quality conversion apparatus according to the eighth embodiment of the present invention will be described with reference to FIGS. However, in FIG. 20, the same reference numerals in FIG. 14 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S221a, segmentation information 23e extracted as a result of speech recognition of the target speaker by the speech recognition unit 19 is output. In step S223b, the speech synthesizer 14b inputs the segmentation information 23e, synthesizes speech based on the segmentation information 23e of the target speaker speech having the utterance content of the phonetic symbol string input by the phonetic symbol string input unit 12a. To do.

本発明の第８の実施形態によれば、目標話者音声を音声認識する際に同時に同音声のセグメンテーション情報を抽出して音声合成部に入力するため、目標話者音声に基づいたセグメンテーション情報を用いて合成音を生成することができ、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the eighth embodiment of the present invention, when the target speaker voice is recognized, the segmentation information of the same voice is simultaneously extracted and input to the voice synthesizer. Thus, a synthesized sound can be generated, and it is possible to simplify and increase the accuracy of the process when associating the synthesized sound feature parameter with the target speaker voice feature parameter.

［第９の実施形態］
図９に、本発明の第９の実施形態に係る声質変換装置のブロック図を示す。図９において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第９の実施形態に係る声質変換装置は、さらに目標話者音声セグメンテーション情報記憶部２１ｂを備えている。目標話者音声セグメンテーション情報記憶部２１ｂは、目標話者音声入力部１１ａで入力された目標話者音声のセグメンテーション情報を記憶して、音声合成部１４ｂおよび変換関数生成部１７ｂに出力する。 [Ninth Embodiment]
FIG. 9 shows a block diagram of a voice quality conversion apparatus according to the ninth embodiment of the present invention. 9, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the ninth embodiment of the present invention further includes a target speaker voice segmentation information storage unit 21b. The target speaker voice segmentation information storage unit 21b stores the segmentation information of the target speaker voice input by the target speaker voice input unit 11a and outputs the segmentation information to the voice synthesis unit 14b and the conversion function generation unit 17b.

次に、本発明の第９の実施形態に係る声質変換装置の動作を図９および図２１を参照して説明する。ただし、図２１において、図１４の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２１ａでは、目標話者を音声認識部１９で音声認識した結果抽出されるセグメンテーション情報２３ｆを出力する。セグメンテーション情報２３ｆは、音声合成部１４ｂへ出力され、発音記号列入力部１２ａで入力された発音記号列の発声内容を持ち、目標話者音声のセグメンテーション情報２３ｆに基づいた音声を合成する。さらに、セグメンテーション情報２３ｆは、変換関数生成部１７ｂにも出力され、時間軸の対応付け（ステップＳ２４ａ）、および変換関数の作成（ステップＳ２５ａ）で用いられる。 Next, the operation of the voice quality conversion apparatus according to the ninth embodiment of the present invention will be described with reference to FIG. 9 and FIG. However, in FIG. 21, the same reference numerals in FIG. 14 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S221a, segmentation information 23f extracted as a result of speech recognition of the target speaker by the speech recognition unit 19 is output. The segmentation information 23f is output to the speech synthesizer 14b, has the utterance contents of the phonetic symbol string input by the phonetic symbol string input unit 12a, and synthesizes speech based on the segmentation information 23f of the target speaker speech. Further, the segmentation information 23f is also output to the conversion function generation unit 17b, and is used for time axis association (step S24a) and generation of the conversion function (step S25a).

本発明の第９の実施形態によれば、目標話者音声を音声認識する際に同時に同音声のセグメンテーション情報を抽出して音声合成部および変換関数生成部に入力するため、目標話者音声に基づいたセグメンテーション情報を用いて合成音を生成することができ、かつ、目標話者音声の特徴パラメータに加えてセグメンテーション情報を変換関数生成に用いることができる。したがって、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the ninth embodiment of the present invention, when the target speaker speech is recognized, the segmentation information of the same speech is simultaneously extracted and input to the speech synthesizer and the conversion function generator. The synthesized speech can be generated using the segmentation information based on the segmentation information, and the segmentation information can be used for generating the conversion function in addition to the feature parameter of the target speaker voice. Therefore, it is possible to simplify and increase the accuracy of the process when associating the synthesized speech feature parameter with the target speaker speech feature parameter.

［第１０の実施形態］
図１０に、本発明の第１０の実施形態に係る声質変換装置のブロック図を示す。図１０において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第１０の実施形態に係る声質変換装置は、さらに合成音セグメンテーション情報記憶部２０と目標話者音声セグメンテーション情報記憶部２１を備えている。合成音セグメンテーション情報記憶部２０は、図３で説明した合成音セグメンテーション情報記憶部２０と同じである。また目標話者音声セグメンテーション情報記憶部２１は、図７で説明した目標話者音声セグメンテーション情報記憶部２１と同様のものである。 [Tenth embodiment]
FIG. 10 shows a block diagram of a voice quality conversion apparatus according to the tenth embodiment of the present invention. 10, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the tenth embodiment of the present invention further includes a synthesized speech segmentation information storage unit 20 and a target speaker speech segmentation information storage unit 21. The synthesized sound segmentation information storage unit 20 is the same as the synthesized sound segmentation information storage unit 20 described in FIG. The target speaker voice segmentation information storage unit 21 is the same as the target speaker voice segmentation information storage unit 21 described with reference to FIG.

次に、本発明の第１０の実施形態に係る声質変換装置の動作を図１０および図２２を参照して説明する。ただし、図２２において、図１３の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２１ａでは、目標話者を音声認識部１９で音声認識した結果抽出されるセグメンテーション情報２３ｄを出力する。セグメンテーション情報２３ｄは、変換関数生成部１７ｃに出力される。また、ステップＳ２２３ａでは、音声合成部１４ａで音声を合成する際に用いられたセグメンテーション情報２３ｇを記憶し、変換関数生成部１７ｃへ出力される。変換関数生成部１７ｃに入力されたセグメンテーション情報２３ｆおよび２３ｇは時間軸の対応付け（ステップＳ２４ｂ）、および変換関数の作成（ステップＳ２５ｂ）で用いられる。 Next, the operation of the voice quality conversion apparatus according to the tenth embodiment of the present invention will be described with reference to FIG. 10 and FIG. However, in FIG. 22, the same reference numerals in FIG. 13 perform the same or equivalent processing, the description thereof is omitted, and only the operation of the newly added step will be described. In step S221a, segmentation information 23d extracted as a result of speech recognition of the target speaker by the speech recognition unit 19 is output. The segmentation information 23d is output to the conversion function generation unit 17c. In step S223a, segmentation information 23g used when the speech synthesizer 14a synthesizes speech is stored and output to the conversion function generator 17c. The segmentation information 23f and 23g input to the conversion function generation unit 17c is used for time axis association (step S24b) and generation of a conversion function (step S25b).

本発明の第１０の実施形態によれば、変換関数生成部において合成音と目標話者音声両方のセグメンテーション情報を利用できるため、合成音および目標話者音声の特徴パラメータに加えてセグメンテーション情報を変換関数生成に用いることができ、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the tenth embodiment of the present invention, since the segmentation information of both the synthesized sound and the target speaker voice can be used in the conversion function generation unit, the segmentation information is converted in addition to the characteristic parameters of the synthesized sound and the target speaker voice. It can be used for function generation, and it is possible to simplify and increase the accuracy of the process when associating the synthesized speech feature parameters with the target speaker speech feature parameters.

［第１１の実施形態］
図１１に、本発明の第１１の実施形態に係る声質変換装置のブロック図を示す。図１１において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第１１の実施形態に係る声質変換装置は、さらに合成音セグメンテーション情報記憶部２０と目標話者音声セグメンテーション情報記憶部２１ａを備えている。合成音セグメンテーション情報記憶部２０および目標話者音声セグメンテーション情報記憶部２１ａは、図３の合成音セグメンテーション情報記憶部２０および図８の目標話者音声セグメンテーション情報記憶部２１ａと同様のものである。 [Eleventh embodiment]
FIG. 11 shows a block diagram of a voice quality conversion apparatus according to the eleventh embodiment of the present invention. 11, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the eleventh embodiment of the present invention further includes a synthesized speech segmentation information storage unit 20 and a target speaker voice segmentation information storage unit 21a. The synthesized sound segmentation information storage unit 20 and the target speaker voice segmentation information storage unit 21a are the same as the synthesized sound segmentation information storage unit 20 of FIG. 3 and the target speaker voice segmentation information storage unit 21a of FIG.

次に、本発明の第１１の実施形態に係る声質変換装置の動作を図１１および図２３を参照して説明する。ただし、図２３において、図２０の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２３ｃでは、音声合成部１４ｃで音声を合成する際に用いられたセグメンテーション情報２３ｇを記憶し、変換関数生成部１７ａへ入力する。入力されたセグメンテーション情報２３ｇは時間軸の対応付け（ステップＳ２４ａ）、および変換関数の作成（ステップＳ２５ａ）で用いられる。 Next, the operation of the voice quality conversion apparatus according to the eleventh embodiment of the present invention will be described with reference to FIG. 11 and FIG. However, in FIG. 23, the same reference numerals in FIG. 20 perform the same or equivalent processing, and the description thereof will be omitted, and only the operation of the newly added step will be described. In step S223c, segmentation information 23g used when the speech synthesizer 14c synthesizes speech is stored and input to the conversion function generator 17a. The input segmentation information 23g is used for time axis association (step S24a) and creation of a conversion function (step S25a).

本発明の第１１の実施形態によれば、目標話者音声を音声認識する際に同時に同音声のセグメンテーション情報を抽出して音声合成部に入力するため、目標話者音声に基づいたセグメンテーション情報を用いて合成音を生成することができる。また、合成音を生成する際に利用するセグメンテーション情報を変換関数生成時に活用するため、合成音の特徴パラメータに加えてセグメンテーション情報を変換関数生成に用いることができる。したがって、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the eleventh embodiment of the present invention, when the target speaker voice is recognized, the segmentation information of the same voice is simultaneously extracted and input to the voice synthesizer. Can be used to generate synthesized sounds. Further, since the segmentation information used when generating the synthesized sound is used when generating the conversion function, the segmentation information can be used for generating the conversion function in addition to the characteristic parameter of the synthesized sound. Therefore, it is possible to simplify and increase the accuracy of the process when associating the synthesized speech feature parameter with the target speaker speech feature parameter.

［第１２の実施形態］
図１２に、本発明の第１２の実施形態に係る声質変換装置のブロック図を示す。図１２において、図２と同一の符号は、同一部あるいは相当部を示し、その説明を省略する。本発明の第１２の実施形態に係る声質変換装置は、さらに合成音セグメンテーション情報記憶部２０と目標話者音声セグメンテーション情報記憶部２１ｂを備えている。合成音セグメンテーション情報記憶部２０および目標話者音声セグメンテーション情報記憶部２１ｂは、それぞれ図３の合成音セグメンテーション情報記憶部２０および図９の目標話者音声セグメンテーション情報記憶部２１ｂと同様のものである。 [Twelfth embodiment]
FIG. 12 shows a block diagram of a voice quality conversion apparatus according to the twelfth embodiment of the present invention. 12, the same reference numerals as those in FIG. 2 denote the same or corresponding parts, and the description thereof is omitted. The voice quality conversion device according to the twelfth embodiment of the present invention further includes a synthesized speech segmentation information storage unit 20 and a target speaker voice segmentation information storage unit 21b. The synthesized sound segmentation information storage unit 20 and the target speaker voice segmentation information storage unit 21b are the same as the synthesized sound segmentation information storage unit 20 of FIG. 3 and the target speaker voice segmentation information storage unit 21b of FIG. 9, respectively.

次に、本発明の第１２の実施形態に係る声質変換装置の動作を図１２および図２４を参照して説明する。ただし、図２４において、図２１の同一の符号は、同一あるいは同等の処理を行い、その説明を省略し、新たに追加したステップの動作のみを説明する。ステップＳ２２３ｃでは、音声合成部１４ｃで音声を合成する際に用いられたセグメンテーション情報２３ｇを記憶し、変換関数生成部１７ｃへ入力する。入力されたセグメンテーション情報２３ｇは時間軸の対応付け（ステップＳ２４ｂ）、および変換関数の作成（ステップＳ２５ｂ）で用いられる。 Next, the operation of the voice quality conversion device according to the twelfth embodiment of the present invention will be described with reference to FIGS. However, in FIG. 24, the same reference numerals in FIG. 21 perform the same or equivalent processing, and the description thereof is omitted, and only the operation of the newly added step will be described. In step S223c, segmentation information 23g used when the speech synthesizer 14c synthesizes speech is stored and input to the conversion function generator 17c. The input segmentation information 23g is used for time axis association (step S24b) and creation of a conversion function (step S25b).

本発明の第１２の実施形態によれば、目標話者音声を音声認識する際に同時に同音声のセグメンテーション情報を抽出して音声合成部および変換関数生成部に入力するため、目標話者音声に基づいたセグメンテーション情報を用いて合成音を生成することができる。また、目標話者音声の特徴パラメータに加えてセグメンテーション情報を変換関数生成に用いることができる。また、変換関数生成部において合成音と目標話者音声両方のセグメンテーション情報を利用できるため、合成音および目標話者音声の特徴パラメータに加えてセグメンテーション情報を変換関数生成に用いることができる。したがって、合成音特徴パラメータと目標話者音声特徴パラメータとを対応付ける際の処理の簡素化、高精度化を図ることが可能となる。 According to the twelfth embodiment of the present invention, when the target speaker speech is recognized, the segmentation information of the same speech is simultaneously extracted and input to the speech synthesizer and the conversion function generator. A synthesized sound can be generated using the segmentation information based thereon. In addition to the feature parameters of the target speaker's voice, segmentation information can be used for generating a conversion function. Further, since the segmentation information of both the synthesized sound and the target speaker voice can be used in the conversion function generation unit, the segmentation information can be used for generating the conversion function in addition to the characteristic parameters of the synthesized sound and the target speaker voice. Therefore, it is possible to simplify and increase the accuracy of the process when associating the synthesized speech feature parameter with the target speaker speech feature parameter.

図２５は、本発明の第１の実施例に係る声質変換装置のブロック図である。声質変換装置は、目標話者音声入力部１１、発音記号列入力部１２、音声合成用データ記憶部１３、音声合成部１４、ＬＰＣ分析部２５、２６、３０、ＤＰマッチング部２７、スペクトル形状変換部２８、変換関数導出部２９、対応フレーム探索部３１、スペクトル変換部３２を備える。なお、図２５において、図１と同一の符号は、同一物あるいは相当物を示し、特に記載無き場合、その説明を省略する。 FIG. 25 is a block diagram of the voice quality conversion apparatus according to the first embodiment of the present invention. The voice quality conversion device includes a target speaker voice input unit 11, a phonetic symbol string input unit 12, a voice synthesis data storage unit 13, a voice synthesis unit 14, LPC analysis units 25, 26 and 30, a DP matching unit 27, a spectral shape conversion. A conversion function deriving unit 29, a corresponding frame searching unit 31, and a spectrum converting unit 32. In FIG. 25, the same reference numerals as those in FIG. 1 denote the same or equivalent components, and the description thereof will be omitted unless otherwise specified.

本実施例では、あらかじめ少なくとも１人以上の話者から音声合成用のデータベースが作成されており、音声合成用データ記憶部１３に保存されているものとする。データベースは、音声素片、継続時間長、ピッチパターン等のデータを含んでいる。ただし、必ずしもこれらの全てのデータをデータベース内に記憶しておく必要はなく、これらのうち１つないし２つのデータだけでもよい。 In the present embodiment, it is assumed that a database for speech synthesis is created in advance from at least one speaker and is stored in the speech synthesis data storage unit 13. The database includes data such as speech segments, durations, and pitch patterns. However, it is not always necessary to store all these data in the database, and only one or two of them may be stored.

今、目標話者音声の発声内容Ｃが明らかになっており、予めこの発声内容と同一あるいは類似の発声内容を持つ合成音を生成できるような発音記号列が発音記号列入力部１２から入力され用意されているとする。この発音記号列に従って、音声合成部１４は、音声合成用データ内の素片の声質を持った発声の合成音Ｂを合成する。目標話者音声入力部１１から入力される目標話者音声としては、操作開始時点で録音装置に発声する場合、予め保存してあった音声データを用いる場合等が考えられる。また、目標話者音声が複数あっても構わない。 Now, the utterance content C of the target speaker voice has been clarified, and a phonetic symbol string that can generate a synthesized sound having the same or similar utterance content as the utterance content is input from the phonetic symbol string input unit 12 in advance. Suppose that it is prepared. In accordance with this phonetic symbol string, the speech synthesizer 14 synthesizes the synthesized sound B of the utterance having the voice quality of the segments in the speech synthesis data. Examples of the target speaker voice input from the target speaker voice input unit 11 include a case where a voice is stored in a recording device at the start of operation, and a case where voice data stored in advance is used. There may be a plurality of target speaker voices.

次に、ＬＰＣ分析部２５は、目標話者音声入力部１１から入力された、話者Ａの発声内容Ｃの音声（以降では、目標Ａとする）を例えば２０ｍｓｅｃといった長さの分析フレームごとにＬＰＣ分析し、目標ＡのＬＰＣ係数３５ａを抽出する。また、ＬＰＣ分析部２６は、音声合成部１４で作成した合成音Ｂの発声内容Ｃの音声（以降では、合成音Ｂとする）を同様にＬＰＣ分析し、合成音ＢのＬＰＣ係数３５ｂを抽出する。本実施例では分析方法（特徴パラメータ抽出）として、ＬＰＣ分析を説明するが、他に、ＬＳＰ（line spectrum pair）分析、ＰＡＲＣＯＲ（partial auto-correlation）分析等のスペクトル分析や零交叉計数法、自己相関法等のピッチ抽出法、およびこれら複数の組み合わせによる分析等が考えられる。また、抽出する特徴パラメータとしてＬＰＣ係数のほかに、自己相関係数、ケプストラム係数等や、ピッチ周波数等の韻律パラメータを用いることもできる。 Next, the LPC analysis unit 25 inputs the voice of the utterance content C of the speaker A (hereinafter referred to as target A) input from the target speaker voice input unit 11 for each analysis frame having a length of 20 msec, for example. LPC analysis is performed, and the LPC coefficient 35a of the target A is extracted. In addition, the LPC analysis unit 26 similarly performs LPC analysis on the speech of the utterance content C of the synthesized sound B created by the speech synthesis unit 14 (hereinafter referred to as the synthesized sound B), and extracts the LPC coefficient 35b of the synthesized sound B. To do. In this embodiment, LPC analysis will be described as an analysis method (feature parameter extraction). In addition, spectrum analysis such as LSP (line spectrum pair) analysis, PARCOR (partial auto-correlation) analysis, zero crossing counting method, self A pitch extraction method such as a correlation method and an analysis based on a combination of these may be considered. In addition to LPC coefficients, prosody parameters such as autocorrelation coefficients, cepstrum coefficients, and pitch frequencies can be used as feature parameters to be extracted.

次に、声質変換の主要部となる、ＤＰマッチング部２７、スペクトル形状変換部２８、変換関数導出部２９、対応フレーム探索部３１、スペクトル変換部３２における処理の流れを説明する。図２６は、声質変換の主要部における音声データの処理の流れを説明する図である。ＬＰＣ分析された目標話者音声（目標Ａ）と合成音Ｂとは、ＤＰマッチング部２７によってＤＰマッチングが行われて対応関係が求められる。求められた対応関係を用いて、合成音Ｂから目標Ａへの変換関数が、スペクトル形状変換部２８、および変換関数導出部２９により生成される。 Next, the flow of processing in the DP matching unit 27, the spectrum shape conversion unit 28, the conversion function derivation unit 29, the corresponding frame search unit 31, and the spectrum conversion unit 32, which are main parts of voice quality conversion, will be described. FIG. 26 is a diagram for explaining the flow of processing of voice data in the main part of voice quality conversion. The target speaker voice (target A) and the synthesized sound B subjected to the LPC analysis are subjected to DP matching by the DP matching unit 27 to obtain a correspondence relationship. A conversion function from the synthesized sound B to the target A is generated by the spectrum shape conversion unit 28 and the conversion function derivation unit 29 using the obtained correspondence relationship.

変換対象となる音声素片は、対応フレーム探索部３１によって合成音素片との対応付けがなされる。この場合、対応付けは、例えば変換対象となる音声素片と合成音素片とのそれぞれのＬＰＣ係数での特徴空間で距離が小さくなるものを選択することでなされる。変換対象となる音声素片は、対応付けがされた合成音素片に対応する先に求めた変換関数によってスペクトル変換がなされ、目標の声質を持つ音声素片が生成されることとなる。以下に、各部についてより詳しく説明する。 The speech unit to be converted is associated with the synthesized speech unit by the corresponding frame search unit 31. In this case, the association is performed, for example, by selecting a feature space having a smaller distance in the LPC coefficients of the speech unit and the synthesized speech unit to be converted. The speech unit to be converted is subjected to spectral conversion by the conversion function previously obtained corresponding to the synthesized speech unit that is associated, and a speech unit having the target voice quality is generated. Hereinafter, each part will be described in more detail.

ＤＰ（dynamic programming）マッチング部２７は、目標ＡのＬＰＣ係数３５ａと合成音ＢのＬＰＣ係数３５ｂとを用いて、目標Ａと合成音Ｂの時間軸を合わせるために、ＤＰ（dynamic programming）マッチングによる時間軸伸縮を行う。これにより、目標Ａと合成音Ｂとの分析フレームごとの対応が作成される。この時、ＤＰマッチングで用いる特徴量空間内距離を表す尺度としては、差分ベクトルの二乗和、ＷＬＲ（weighted likelihood ratio）尺度、最尤スペクトル距離、ＬＰＣケプストラム距離等を用いることができる。 The DP (dynamic programming) matching unit 27 uses DP (dynamic programming) matching to match the time axes of the target A and the synthesized sound B using the LPC coefficient 35a of the target A and the LPC coefficient 35b of the synthesized sound B. Perform time axis expansion and contraction. Thereby, the correspondence for each analysis frame between the target A and the synthesized sound B is created. At this time, as a scale representing the distance in the feature amount space used in DP matching, a sum of squares of a difference vector, a weighted likelihood ratio (WLR) scale, a maximum likelihood spectral distance, an LPC cepstrum distance, and the like can be used.

図２７は、分析フレームごとのＤＰマッチングを模式的に表した図である。目標話者音声（目標Ａ）のフレームをＡ１、Ａ２、Ａ３、Ａ４、Ａ５、・・Ａｍ、・・とし、合成音ＢのフレームをＢ１、Ｂ２、Ｂ３、Ｂ４、Ｂ５、・・Ｂｎ、・・とする。目標話者音声（目標Ａ）と合成音Ｂとをフレーム毎にＬＰＣ分析し、その上で、例えば、フレームＡ１とフレームＢ１との特徴量空間内距離、フレームＡ１とフレームＢ２との特徴量空間内距離、フレームＡ２とフレームＢ１との特徴量空間内距離、の中で最も距離の短い対応関係を選択する。ここでは、フレームＡ２とフレームＢ１との特徴量空間内距離が選択されたとすると、次には、例えば、フレームＡ３とフレームＢ１との特徴量空間内距離、フレームＡ２とフレームＢ２との特徴量空間内距離、フレームＡ３とフレームＢ２との特徴量空間内距離、の中で最も距離の短い対応を選択する。このようにして、特徴量空間内距離の最小の距離の対応を順次求めていき、目標話者音声（目標Ａ）と合成音ＢとのＤＰマッチングが行われる。 FIG. 27 is a diagram schematically illustrating DP matching for each analysis frame. Frames of the target speaker voice (target A) are A1, A2, A3, A4, A5,... Am, .., and a frame of the synthesized sound B is B1, B2, B3, B4, B5,.・ Let's say. The target speaker voice (target A) and the synthesized sound B are subjected to LPC analysis for each frame, and then, for example, the distance in the feature amount space between the frames A1 and B1, and the feature amount space between the frames A1 and B2. A correspondence relationship having the shortest distance among the internal distances and the distances within the feature amount space between the frames A2 and B1 is selected. Here, assuming that the distance in the feature amount space between the frame A2 and the frame B1 is selected, then, for example, the distance in the feature amount space between the frame A3 and the frame B1, and the feature amount space between the frame A2 and the frame B2. The correspondence with the shortest distance is selected from the internal distance and the distance in the feature amount space between the frame A3 and the frame B2. In this way, correspondence of the minimum distance of the feature amount space distance is sequentially obtained, and DP matching between the target speaker voice (target A) and the synthesized sound B is performed.

次に、スペクトル形状変換部２８は、分析フレームごとに対応付けられた目標ＡのＬＰＣ係数３５ａと合成音ＢのＬＰＣ係数３５ｂとを用いて、合成音Ｂの声質が目標Ａの声質へと変換されるような変換関数を同定する。合成音Ｂから目標Ａへと声質変換されるような変換を実現するには、合成音Ｂの分析フレームごとの周波数特性が目標Ａの周波数特性とできるだけ等しくなるように、合成音Ｂの周波数領域のデータを変換関数で変換すればよい。これを分析フレームごとに考えると、合成音Ｂの１つの分析フレームＢｎ（ｎ番目の分析フレームＢｎとする）の周波数特性が、この分析フレームに時間軸上で対応付けられた目標Ａの分析フレームＡｍ（ｍ番目の分析フレームＡｍとする）の周波数特性に変換されるような変換関数を同定すればよいということになる。そこで、分析フレームをフーリエ変換して波形の周波数領域の形状（スペクトル包絡）を求め、周波数軸上でＤＰマッチング等による伸縮を行い、合成音Ｂから目標Ａへの変換関数を求める。すなわち、合成音Ｂの分析フレームＢｎのスペクトル形状を、目標Ａの分析フレームＡｍのスペクトル形状に合致させるような変換を行う。 Next, the spectrum shape conversion unit 28 converts the voice quality of the synthesized sound B into the voice quality of the target A using the LPC coefficient 35a of the target A and the LPC coefficient 35b of the synthesized sound B that are associated with each analysis frame. Identify the transformation function as In order to realize a conversion in which the voice quality is converted from the synthesized sound B to the target A, the frequency region of the synthesized sound B is set so that the frequency characteristics of each analysis frame of the synthesized sound B are as equal as possible to the frequency characteristics of the target A. Can be converted with a conversion function. When this is considered for each analysis frame, the frequency characteristic of one analysis frame Bn (referred to as the nth analysis frame Bn) of the synthesized sound B is the analysis frame of the target A associated with this analysis frame on the time axis. This means that it is only necessary to identify a conversion function that is converted into the frequency characteristic of Am (m-th analysis frame Am). Therefore, the analysis frame is subjected to Fourier transform to obtain the shape (spectrum envelope) of the frequency domain of the waveform, and expansion / contraction by DP matching or the like on the frequency axis is performed to obtain a conversion function from the synthesized sound B to the target A. That is, conversion is performed so that the spectrum shape of the analysis frame Bn of the synthesized sound B matches the spectrum shape of the analysis frame Am of the target A.

さらに、スペクトル形状変換について説明する。図２８は、周波数軸上でのＤＰマッチングを模式的に表した図である。目標話者音声（目標Ａ）と合成音Ｂのスペクトル包絡をそれぞれＳＡ、ＳＢとする。周波数軸上でスペクトル包絡ＳＡとスペクトル包絡ＳＢとの対応関係をＤＰマッチングによって求める。対応関係を表すパスＰ０、・・Ｐｉ、・・Ｐｎが変換関数に相当し、合成音ＢのスペクトルをパスＰ０、・・Ｐｉ、・・Ｐｎによって非線型に写像することで目標話者音声（目標Ａ）のスペクトルが得られることとなる。なお、予めスペクトル包絡ＳＡ、ＳＢに対し、高域強調あるいは低域強調等の前処理を行い、前処理がなされたスペクトル包絡に対してＤＰマッチングを行うようにしてもよい。 Further, spectral shape conversion will be described. FIG. 28 is a diagram schematically illustrating DP matching on the frequency axis. The spectral envelopes of the target speaker voice (target A) and synthesized sound B are SA and SB, respectively. A correspondence relationship between the spectrum envelope SA and the spectrum envelope SB is obtained on the frequency axis by DP matching. The paths P0,..., Pi,... Pn representing the correspondence relationship correspond to transformation functions, and the target speaker's voice ((Pi,... Pn) is non-linearly mapped by the paths P0,. A spectrum of target A) will be obtained. Note that pre-processing such as high-frequency emphasis or low-frequency emphasis may be performed on the spectrum envelopes SA and SB in advance, and DP matching may be performed on the pre-processed spectrum envelope.

以上で説明したように、変換関数同定においては、ＬＰＣ分析部２６で合成音Ｂのｎ番目の分析フレームＢｎをＬＰＣ分析した結果から、スペクトル包絡ＳＢｎを導出する。同様に、Ｂｎに時間軸上で対応付けられた目標Ａの分析フレームＡｍのスペクトル包絡ＳＡｍを同定する。その上で、ＳＢｎがＳＡｍに変換されるような変換関数を作成する。このようにして、合成音Ｂのｎ番目の分析フレームＢｎから、時間軸上で対応付けられた目標Ａの分析フレームＡｍへの変換関数が求まるので、これを当該フレーム全てに適用し、合成音Ａ全体に対する変換関数を求めることができる。 As described above, in the conversion function identification, the spectrum envelope SBn is derived from the result of LPC analysis of the nth analysis frame Bn of the synthesized sound B by the LPC analysis unit 26. Similarly, the spectrum envelope SAm of the analysis frame Am of the target A associated with Bn on the time axis is identified. Then, a conversion function is created so that SBn is converted to SAm. In this way, since the conversion function from the nth analysis frame Bn of the synthesized sound B to the analysis frame Am of the target A associated on the time axis is obtained, this is applied to all the frames and the synthesized sound is applied. A conversion function for the entire A can be obtained.

なお、以上の説明では、目標Ａと合成音Ｂといった１文だけを学習データとして用いて変換関数をフレームごとに求める例を示したが、学習データをさらに増やした場合、目標Ａと合成音ＢのＬＰＣ係数を特徴量空間内で、例えば音素ごとにクラスタリングし、クラスタの代表点ごとに変換関数を同定することも可能である。また、ＧＭＭ（Guassian Mixture model）等の確率密度分布で特徴量空間を表現して、密度分布状態で対応付ける方法も考えられる。 In the above description, the conversion function is obtained for each frame using only one sentence such as the target A and the synthesized sound B as learning data. However, when the learning data is further increased, the target A and the synthesized sound B are obtained. It is also possible to cluster the LPC coefficients in the feature amount space, for example, for each phoneme, and identify a conversion function for each representative point of the cluster. Also, a method of expressing the feature amount space with a probability density distribution such as GMM (Guassian Mixture model) and associating it with the density distribution state is conceivable.

図２９は、特徴量空間内でクラスタリングされた音素間の変換関数を模式的に示す図である。合成音Ｂの特徴パラメータ空間内と目標Ａの特徴パラメータ空間内とにおいて、例えば音素（ａ、ｉ、ｕ、ｅ、ｏ、ｓ）がそれぞれクラスタに分類され、合成音Ｂの特徴量空間内のクラスタ中の代表点が変換関数によって目標Ａの特徴量空間内のクラスタ中の代表点に変換される。 FIG. 29 is a diagram schematically showing a conversion function between phonemes clustered in the feature amount space. In the feature parameter space of the synthesized sound B and the feature parameter space of the target A, for example, phonemes (a, i, u, e, o, s) are classified into clusters, respectively. The representative points in the cluster are converted into the representative points in the cluster in the feature amount space of the target A by the conversion function.

次に、変換対象である音声合成用素片データと変換関数との対応付けを行う。そのために、ＬＰＣ分析部３０は、あらかじめ音声合成用の素片信号をＬＰＣ分析して、素片データの分析フレームごとのＬＰＣ係数を求めておく。 Next, the speech synthesis segment data to be converted is associated with the conversion function. For this purpose, the LPC analysis unit 30 performs an LPC analysis on a speech synthesis unit signal in advance to obtain an LPC coefficient for each analysis frame of the unit data.

対応フレーム探索部３１は、ＬＰＣ分析部３０が出力する素片データの分析フレーム毎のＬＰＣ係数に対して合成音Ｂのフレーム毎のＬＰＣ係数３５ｂ中の特徴量空間内距離が最も近いものを対応フレームとして求める。 The corresponding frame search unit 31 corresponds to the LPC coefficient for each analysis frame of the segment data output from the LPC analysis unit 30 that has the closest distance in the feature amount space in the LPC coefficient 35b for each frame of the synthesized sound B. Ask as a frame.

変換関数導出部２９は、スペクトル形状変換部２８によって求めた合成音Ｂのフレームごとの変換関数を素片データの変換関数とする。この時、学習データを増やした場合は、図２９で説明した変換関数の同定時と同様に、フレームごとではなくクラスタごとに変換関数を設定するといったことも可能である。 The conversion function deriving unit 29 sets the conversion function for each frame of the synthesized sound B obtained by the spectrum shape conversion unit 28 as the conversion function of the segment data. At this time, when the learning data is increased, the conversion function can be set not for each frame but for each cluster, as in the case of identification of the conversion function described with reference to FIG.

スペクトル変換部３２は、変換関数導出部２９において求めた素片に対する変換関数の中から、先の対応フレームに対する変換関数を選択し、選択された変換関数を用いて素片信号（素片データ）のスペクトルを変換する。これにより、目標Ａの声質を持つ変換後素片信号（合成音を作成できる素片データ）を得ることができる。 The spectrum conversion unit 32 selects a conversion function for the previous corresponding frame from the conversion functions for the segment obtained by the conversion function deriving unit 29, and uses the selected conversion function to generate a segment signal (segment data). Convert the spectrum of. Thereby, the post-conversion segment signal (the segment data which can produce a synthetic sound) with the voice quality of the target A can be obtained.

さらに、この素片データを、音声合成用データ記憶部１３に保存されている変換前の音声合成用データベース内の素片データと差し替えることで、任意のテキストに対して目標Ａの声質を持つ合成音を作成できるデータベースが完成する。 Further, by replacing this segment data with the segment data in the speech synthesis database before conversion stored in the speech synthesis data storage unit 13, synthesis having the voice quality of the target A for any text A database that can create sounds is completed.

なお、第１の実施例では、合成音を作成するための発音記号列が予め分かっているものとしたが、目標話者音声の発声内容を漢字かな混じり文等のテキストデータで表して形態素解析等を用いて分析し、その結果によって発音記号列を生成することも可能である。 In the first embodiment, it is assumed that the phonetic symbol string for generating the synthesized sound is known in advance. However, the utterance content of the target speaker voice is expressed as text data such as kanji-kana mixed sentences, and morphological analysis is performed. Etc., and a phonetic symbol string can be generated based on the result.

また、被変換対象である入力信号は、上記の素片データ以外にも、上記以外の素片データを使うこともできる。例えば、音声合成用データ記憶部１３に保存されている素片データが旅行用の会話のデータであり、入力信号となる素片データが一般の会話用のデータである場合などがある。 In addition to the above-described unit data, other input unit data can be used as the input signal to be converted. For example, there is a case where the segment data stored in the voice synthesis data storage unit 13 is travel conversation data, and the segment data serving as an input signal is general conversation data.

さらに、上記の素片データで作成された合成音、上記以外の素片データで作成された合成音、ユーザの発声による音声等を被変換対象である入力信号とすることもできる。 Furthermore, a synthesized sound created with the above-mentioned segment data, a synthesized sound created with other segment data, a voice uttered by the user, or the like can be used as an input signal to be converted.

さらに、第１の実施例の変形として、変換関数生成部内の合成音と目標話者音声の時間軸対応付け処理を行わないで変換関数の同定を行うという方法も考えられる。この方法を実現できる理由を以下に示す。本発明では、合成音と目標話者音声の発声内容を同一あるいは類似のものにして処理を行っているため、合成音および目標話者音声の音素情報は同一の部分が多く、特徴量空間内の特徴パラメータの分布が互いに類似している。このため、例えば、音素ごとにクラスタリングして変換関数を求める場合、合成音と目標話者音声の時間軸対応を取らなくても、変換関数を生成することができる。この方法を用いれば、時間軸対応付け処理を行わないので、処理速度の大幅な向上を図ることが可能である。 Furthermore, as a modification of the first embodiment, a method of identifying the conversion function without performing the time axis association processing of the synthesized sound and the target speaker voice in the conversion function generation unit is also conceivable. The reason why this method can be realized will be described below. In the present invention, processing is performed with the utterance contents of the synthesized sound and the target speaker voice being the same or similar, so that the phoneme information of the synthesized sound and the target speaker voice has many identical parts, The distributions of the feature parameters are similar to each other. Therefore, for example, when a conversion function is obtained by clustering for each phoneme, the conversion function can be generated without taking the time axis correspondence between the synthesized sound and the target speaker voice. If this method is used, the time-axis association process is not performed, so that the processing speed can be greatly improved.

図３０は、本発明の第２の実施例に係る声質変換装置のブロック図である。声質変換装置は、目標話者音声入力部１１、発音記号列入力部１２、音声合成用データ記憶部１３、音声合成部１４ａ、音声認識部１９、合成音セグメンテーション情報記憶部２０、目標話者音声セグメンテーション情報記憶部２１、ＬＰＣ分析部２５、２６、３０、ＤＰマッチング部２７ａ、スペクトル形状変換部２８、変換関数導出部２９、対応フレーム探索部３１、スペクトル変換部３２を備える。なお、図３０において、図２５と同一の符号は、同一物あるいは相当物を示し、特に記載無き場合、その説明を省略する。 FIG. 30 is a block diagram of a voice quality conversion apparatus according to the second embodiment of the present invention. The voice quality conversion apparatus includes a target speaker voice input unit 11, a phonetic symbol string input unit 12, a voice synthesis data storage unit 13, a voice synthesis unit 14a, a voice recognition unit 19, a synthesized voice segmentation information storage unit 20, and a target speaker voice. A segmentation information storage unit 21, LPC analysis units 25, 26 and 30, a DP matching unit 27 a, a spectrum shape conversion unit 28, a conversion function derivation unit 29, a corresponding frame search unit 31, and a spectrum conversion unit 32 are provided. In FIG. 30, the same reference numerals as those in FIG. 25 denote the same or equivalent components, and the description thereof is omitted unless otherwise specified.

音声認識部１９は、入力された目標話者音声信号に対し音声認識を行うことによって、目標話者音声信号と同一あるいは類似の発声内容を持つ合成音を生成するための発音記号列を生成し、発音記号列入力部１２に出力する。今、発声内容Ｃを持つ音声が目標話者音声であるとすると、これを音声認識することによって、発声内容Ｃを持つ発音記号列を自動的に生成することができる。このため、発音記号列が予め明らかでない場合でも、目標話者音声と同一あるいは類似の発生内容を持つ合成音を生成することができ、目標話者音声のみを入力してしまえば、第１の実施例と同様の処理を自動的に行うことができる。この時の音声認識の方法としては、例えば非特許文献２にあるようなＨＭＭ（hidden Markov model）による音声認識法等がある。 The voice recognition unit 19 generates a phonetic symbol string for generating a synthesized sound having the same or similar utterance content as the target speaker voice signal by performing voice recognition on the input target speaker voice signal. To the phonetic symbol string input unit 12. Assuming that the speech having the utterance content C is the target speaker speech, the phonetic symbol string having the utterance content C can be automatically generated by recognizing the speech. For this reason, even if the phonetic symbol string is not clear in advance, it is possible to generate a synthesized sound having the same or similar content as the target speaker voice, and if only the target speaker voice is input, the first Processing similar to that in the embodiment can be automatically performed. As a speech recognition method at this time, for example, there is a speech recognition method by HMM (hidden Markov model) as described in Non-Patent Document 2.

合成音セグメンテーション情報記憶部２０は、音声合成部１４ａにおいて音声を合成する際に音声合成用データ内から算出されて用いられるセグメンテーション情報を、変換関数の生成時に活用するためにＤＰマッチング部２７ａに出力する。今、発声内容Ｃを持つ音声が目標話者音声であるとすると、これを音声認識部１９において音声認識することによって、発声内容Ｃを持つ発音記号列を自動的に生成し発音記号列入力部１２に出力される。発音記号列が音声合成部１４ａに入力されたとすると、音声合成部１４ａは、この発音記号列にしたがって音声合成用データ記憶部１３から発声内容Ｃに対応したセグメンテーション情報を算出する。合成音セグメンテーション情報記憶部２０は、このセグメンテーション情報を記憶しておき、セグメンテーション情報２３ｉとしてＤＰマッチング部２７ａに出力し、変換関数生成の際に利用する。例えば、目標話者音声を分析して抽出された特徴パラメータの時間変化に対して、セグメンテーション情報と合成音を分析して抽出された特徴パラメータの時間変化を対応付けることで、目標話者音声と合成音の時間軸対応付けを簡素化、高精度化することができる。セグメンテーション情報としては、図３１に示すような、フレーム番号と音素が対応付けられた表（例えば、フレーム番号１、２、・・１０、１１、・・、５０、５１、・・に対しそれぞれ音素「ｉ」、「ｉ」、・・「ｉ」、「ｅ」、・・「ｕ」、「ｓ」、・・が対応している）を用いる。その他の利用法としては、目標話者音声と合成音の組が複数文であった場合、音素セグメンテーション情報を用いてクラスタリングするといった方法も考えられる。 The synthesized speech segmentation information storage unit 20 outputs the segmentation information calculated and used from the speech synthesis data when the speech synthesizer 14a synthesizes speech to the DP matching unit 27a to use it when generating the conversion function. To do. Assuming that the speech having the utterance content C is the target speaker speech, the speech recognition unit 19 recognizes the speech, thereby automatically generating a pronunciation symbol string having the utterance content C and generating a pronunciation symbol string input unit. 12 is output. If a phonetic symbol string is input to the voice synthesis unit 14a, the voice synthesis unit 14a calculates segmentation information corresponding to the utterance content C from the voice synthesis data storage unit 13 according to the phonetic symbol string. The synthesized sound segmentation information storage unit 20 stores this segmentation information, outputs it to the DP matching unit 27a as the segmentation information 23i, and uses it when generating the conversion function. For example, by synthesizing segmentation information and temporal changes of feature parameters extracted by analyzing synthesized speech to temporal changes of feature parameters extracted by analyzing target speaker voices, synthesis with target speaker voices It is possible to simplify and increase the accuracy of sound time axis association. As the segmentation information, as shown in FIG. 31, a phoneme is associated with a table in which frame numbers and phonemes are associated (for example, frame numbers 1, 2,... 10, 11,... 50, 51,. “I”, “i”,... “I”, “e”,... “U”, “s”,. As another usage method, a method of clustering using phoneme segmentation information when the combination of the target speaker speech and the synthesized speech is a plurality of sentences is also conceivable.

目標話者音声セグメンテーション情報記憶部２１は、音声認識部１９が目標話者音声信号を音声認識する際に、出力される目標話者音声のセグメンテーション情報を入力する。今、発声内容Ｃを持つ目標話者音声を目標話者音声入力部１１に入力したとする。音声認識部１９は、この目標話者音声を音声認識して、発声内容Ｃを記述する発音記号列を生成して発音記号列入力部１２に出力する一方で、音声認識の結果であるセグメンテーション情報を目標話者音声セグメンテーション情報記憶部２１に出力し、目標話者音声セグメンテーション情報記憶部２１は、セグメンテーション情報を記憶しておく。このセグメンテーション情報２３ｈを、ＤＰマッチング部２７ａに出力して合成音と目標話者音声との対応付け処理に利用する。利用法としては、合成音を分析して抽出された特徴パラメータの時間変化に対して、セグメンテーション情報と目標話者音声を分析して抽出された特徴パラメータの時間変化を対応付けることで、目標話者音声と合成音の時間軸対応付けを簡素化、高精度化することができる。なお、セグメンテーション情報としては、図３１で示した表と同様のものを用いる。 The target speaker voice segmentation information storage unit 21 inputs segmentation information of the target speaker voice that is output when the voice recognition unit 19 recognizes the target speaker voice signal. Assume that the target speaker voice having the utterance content C is input to the target speaker voice input unit 11. The voice recognition unit 19 recognizes the target speaker voice, generates a phonetic symbol string describing the utterance content C, and outputs it to the phonetic symbol string input unit 12, while segmentation information as a result of voice recognition Is output to the target speaker voice segmentation information storage unit 21, and the target speaker voice segmentation information storage unit 21 stores the segmentation information. The segmentation information 23h is output to the DP matching unit 27a and used for the process of associating the synthesized sound with the target speaker voice. As a usage method, the target speaker is analyzed by associating the temporal change of the feature parameter extracted by analyzing the segmentation information and the target speaker voice with the temporal change of the feature parameter extracted by analyzing the synthesized sound. It is possible to simplify and improve the time-axis association between speech and synthesized speech. As the segmentation information, the same information as the table shown in FIG. 31 is used.

ＤＰマッチング部２７ａは、実施例１で説明したＤＰマッチング部２７と同様に、目標ＡのＬＰＣ係数３５ａと合成音ＢのＬＰＣ係数３５ｂとを用いて、目標Ａと合成音Ｂの時間軸を合わせるために、ＤＰマッチングによる時間軸伸縮を行う。さらに、ＤＰマッチング部２７ａは、合成音セグメンテーション情報記憶部２０が出力するセグメンテーション情報２３ｈと、目標話者音声セグメンテーション情報記憶部２１が出力するセグメンテーション情報２３ｉとを用いてＤＰマッチングによる時間軸対応付けの簡素化、高精度化を図っている。 Similarly to the DP matching unit 27 described in the first embodiment, the DP matching unit 27a uses the LPC coefficient 35a of the target A and the LPC coefficient 35b of the synthesized sound B to match the time axes of the target A and the synthesized sound B. Therefore, time axis expansion / contraction by DP matching is performed. Further, the DP matching unit 27a uses the segmentation information 23h output from the synthesized speech segmentation information storage unit 20 and the segmentation information 23i output from the target speaker voice segmentation information storage unit 21 to perform time-axis association by DP matching. Simplification and high precision are being achieved.

次にセグメンテーション情報を用いたＤＰマッチングによる時間軸対応付けの一例について説明する。図３２は、セグメンテーション情報を用いたＤＰマッチングによる時間軸対応付けの第１の例を示す図である。縦軸方向に分析された目標Ａ（音素「ａ」「ｓ」「ｕ」）、横軸方向に分析された合成音Ｂ（音素「ａ」「ｓ」「ｕ」）が配置されている。セグメンテーション情報２３ｈとして、目標Ａの音素「ａ」と「ｓ」の間に音素境界Ｐ１が、目標Ａの音素「ｓ」と「ｕ」の間に音素境界Ｑ１が音声合成部１４ａにおいて設定されている。また、セグメンテーション情報２３ｉとして、合成音Ｂの音素「ａ」と「ｓ」の間に音素境界Ｐ２が、目標Ａの音素「ｓ」と「ｕ」の間に音素境界Ｑ２が音声認識部１９において設定されている。この時、音素境界Ｐ１と音素境界Ｐ２との交点を拘束点Ｐとし、音素境界Ｑ１と音素境界Ｑ２との交点を拘束点Ｑとする。ＤＰマッチングを音素「ａ」、「ｓ」、「ｕ」に対して順次行い、ＤＰパスを求めていく。この時、ＤＰパスが拘束点Ｐ、拘束点Ｑを通るように制約を課してＤＰパスを求めるようにする。すなわち、拘束点を通るようにＤＰパスを決定することにより、制約条件付のＤＰマッチングを行い、音素境界同士が対応付けられるようにする。なお、拘束点は、必ずしも通らなくてもよく、拘束点の近傍を通るような緩やかな制約を課してもよい。 Next, an example of time axis association by DP matching using segmentation information will be described. FIG. 32 is a diagram illustrating a first example of time axis association by DP matching using segmentation information. A target A (phoneme “a” “s” “u”) analyzed in the vertical axis direction and a synthesized sound B (phoneme “a” “s” “u”) analyzed in the horizontal axis direction are arranged. As the segmentation information 23h, a phoneme boundary P1 is set between the phonemes “a” and “s” of the target A, and a phoneme boundary Q1 is set between the phonemes “s” and “u” of the target A. Yes. Further, as segmentation information 23i, the phoneme boundary P2 between the phonemes “a” and “s” of the synthesized sound B and the phoneme boundary Q2 between the phonemes “s” and “u” of the target A are Is set. At this time, the intersection between the phoneme boundary P1 and the phoneme boundary P2 is defined as a constraint point P, and the intersection between the phoneme boundary Q1 and the phoneme boundary Q2 is defined as a constraint point Q. DP matching is sequentially performed on the phonemes “a”, “s”, and “u” to obtain a DP path. At this time, a restriction is imposed so that the DP path passes through the restriction point P and the restriction point Q, and the DP path is obtained. That is, by determining the DP path so as to pass through the constraint point, DP matching with a constraint condition is performed so that phoneme boundaries are associated with each other. Note that the constraint point does not necessarily pass, and a gentle constraint that passes in the vicinity of the constraint point may be imposed.

セグメンテーション情報は、ある発声内容に対して、各音素の開始時刻と終了時刻、および音素のラベル等が記述された情報である。セグメンテーション情報によって、音素のラベルとその音素の開始時刻、終了時刻が示されるために、目標Ａあるいは合成音Ｂにおける音素の境界が明確に示されることとなる。したがって、目標Ａと合成音Ｂとの各フレーム間の時間軸上の対応付け（ＤＰパス）を求める際に、セグメンテーション情報を用いて制約を付けることで、ＤＰマッチングによる対応付けが音素境界付近であいまいになった場合であっても、精度の高い対応付けを実現することができることとなる。 The segmentation information is information in which the start time and end time of each phoneme, the phoneme label, and the like are described for a certain utterance content. Since the segmentation information indicates the phoneme label and the start time and end time of the phoneme, the phoneme boundary in the target A or the synthesized sound B is clearly shown. Therefore, when obtaining the correspondence (DP path) between each frame of the target A and the synthesized sound B on the time axis, by using the segmentation information, the association by DP matching is performed near the phoneme boundary. Even if it is ambiguous, it is possible to realize highly accurate association.

次にセグメンテーション情報を用いたＤＰマッチングによる時間軸対応付けの他の例について説明する。図３３は、セグメンテーション情報を用いたＤＰマッチングによる時間軸対応付けの第２の例を示す図である。目標Ａと合成音Ｂは、図３２で説明したと同様に配置されている。ただし、図３３において、図３２と異なる点は、セグメンテーション情報２３ｈが入力されていない。すなわち、目標話者音声セグメンテーション情報記憶部２１が存在せず、目標Ａに音素境界がないことである。この場合、通常のＤＰマッチングによってＤＰパスを求めた上で、ＤＰパスが合成音Ｂの音素境界Ｐ１を通る点に対応する目標Ａの推定音素境界Ｐ３を求める。また、ＤＰパスが合成音Ｂの音素境界Ｑ１を通る点に対応する目標Ａの推定音素境界Ｑ３を求める。すなわち、制約条件なしでＤＰマッチングを行った後に、セグメンテーション情報がある音声の音素境界と対応付いた箇所を、セグメンテーション情報がない音声の音素境界として推定することができる。通常、ＤＰマッチングを行っただけでは、音素境界を判定することができないため、音素境界が推定できるこの方法は、変換関数を選択する際などにより有効な手段となる。 Next, another example of time axis association by DP matching using segmentation information will be described. FIG. 33 is a diagram illustrating a second example of time axis association by DP matching using segmentation information. The target A and the synthesized sound B are arranged in the same manner as described with reference to FIG. However, in FIG. 33, the difference from FIG. 32 is that segmentation information 23h is not input. That is, the target speaker voice segmentation information storage unit 21 does not exist and the target A has no phoneme boundary. In this case, after obtaining the DP path by normal DP matching, the estimated phoneme boundary P3 of the target A corresponding to the point where the DP path passes through the phoneme boundary P1 of the synthesized sound B is obtained. Further, the estimated phoneme boundary Q3 of the target A corresponding to the point where the DP path passes through the phoneme boundary Q1 of the synthesized sound B is obtained. That is, after DP matching is performed without a constraint, a location associated with a phoneme boundary of speech with segmentation information can be estimated as a phoneme boundary of speech without segmentation information. Normally, a phoneme boundary cannot be determined only by performing DP matching. Therefore, this method of estimating a phoneme boundary is an effective means when selecting a conversion function.

以上の説明では、セグメンテーション情報をＤＰマッチングに適用する例について説明したが、他にセグメンテーション情報を、スペクトル変換部３２において被変換入力信号（素片信号）を変換する際に適用することも可能である。すなわち、被変換入力信号の各フレームがどの変換関数で変換されるかを判定する際に、そのフレームのセグメンテーション情報に付するラベル情報を用いれば、どの集合（例えば、音素毎のクラスタ等）に属するかを容易に判別することができる。この様子を図３４に示す。図３４において、ラベル情報（「ｉ」「ｉ」「ｅ」「ｕ」）は、特徴パラメータ空間中の各クラスタに対応付けがなされ、ラベル情報から直接的に特徴量空間内のクラスタ中の代表点に変換することができる。 In the above description, an example in which segmentation information is applied to DP matching has been described. However, it is also possible to apply segmentation information when the converted input signal (element signal) is converted in the spectrum conversion unit 32. is there. That is, when determining which conversion function is used to convert each frame of the input signal to be converted, to which set (for example, a cluster for each phoneme), the label information attached to the segmentation information of the frame is used. It can be easily determined whether it belongs. This is shown in FIG. In FIG. 34, label information (“i”, “i”, “e”, “u”) is associated with each cluster in the feature parameter space, and the representative in the cluster in the feature amount space directly from the label information. Can be converted to a point.

次に声質変換方法を繰り返し行い精度を高めて行く例について説明する。図３５は、本発明の第３の実施例に係る声質変換方法を表すフローチャート図である。図３５において、ステップＳ３１〜Ｓ３９は、それぞれ図１４のステップＳ２１、Ｓ２２１、Ｓ２２２、Ｓ２２３、Ｓ２３、Ｓ２４、Ｓ２５、Ｓ２６、Ｓ２７と同等の処理を行うステップであり、その説明を省略する。ステップＳ４０において、変換後の信号を音声合成用のデータとして音声合成用データ記憶部１３に登録する。 Next, an example in which the voice quality conversion method is repeated to increase the accuracy will be described. FIG. 35 is a flowchart showing a voice quality conversion method according to the third embodiment of the present invention. 35, steps S31 to S39 are steps for performing processing equivalent to steps S21, S221, S222, S223, S23, S24, S25, S26, and S27 of FIG. 14, respectively, and description thereof is omitted. In step S40, the converted signal is registered in the voice synthesis data storage unit 13 as voice synthesis data.

ステップＳ４０において、繰り返しが所定の収束条件に達したか否かを判定する。達していなければ、ステップＳ３４に戻り、変換後の音声合成用データを用いて合成音を生成する。達していればステップＳ４２で一連の処理が終了する。収束条件としては、例えば、目標話者音声と合成音とのＬＰＣ係数空間内距離が一定値以下になった場合、もしくはこの一定値以下の状態が一定回数継続した場合がある。また、スペクトルやパワー等の前回との差分値の合計値が一定値以下になった場合、もしくはこの一定値以下の状態が一定回数継続した場合を収束条件としてもよい。さらに繰り返し回数が一定回数繰り返した場合等を収束条件としてもよい。 In step S40, it is determined whether or not the repetition has reached a predetermined convergence condition. If not, the process returns to step S34 to generate a synthesized sound using the converted voice synthesis data. If it has reached, a series of processing ends in step S42. As the convergence condition, for example, there is a case where the distance in the LPC coefficient space between the target speaker voice and the synthesized sound becomes a certain value or less, or a state where this certain value or less continues for a certain number of times. Further, the convergence condition may be a case where the total value of the difference values from the previous time such as spectrum and power becomes equal to or less than a certain value, or a case where the condition below this certain value continues for a certain number of times. Furthermore, the convergence condition may be a case where the number of repetitions is a fixed number of times.

第３の実施例では、第１の実施例のように、素片を被変換入力信号として音声合成用データの声質を変換する場合には、変換後の音声合成用データを用いて再度目標話者音声と同一あるいは類似の発声内容の合成音を生成し、実施例１と同様の方法で変換関数の生成を行い、素片を変換するという処理を複数回繰り返す。さらに、第２の実施例を組み合わせて用いることで、最初に目標話者音声を入力してしまえば、繰り返しの処理も全て自動で行うことができ、変換精度（声質の類似度）を高めていくことが可能となる。なお、図３５では、セグメンテーション情報を用いない例を示したが、勿論セグメンテーション情報を用いてもよい。 In the third embodiment, as in the first embodiment, when converting the voice quality of speech synthesis data using a segment as a converted input signal, the target speech is again generated using the speech synthesis data after conversion. A synthesized sound having the same or similar utterance content as the person's voice is generated, a conversion function is generated in the same manner as in the first embodiment, and the process of converting the segment is repeated a plurality of times. Furthermore, by using the second embodiment in combination, if the target speaker voice is first input, all of the repeated processing can be performed automatically, improving the conversion accuracy (similarity of voice quality). It is possible to go. In addition, although the example which does not use segmentation information was shown in FIG. 35, of course, you may use segmentation information.

なお、以上説明した実施例１〜３において、音声合成用データ記憶部１３に複数人数分の音声データからなる音声合成用データを記憶しておき、入力された目標話者の声質によって使用する音声合成用データを選択することができるようにすることも可能である。例えば、目標話者音声が男声であった場合は男声の素片データを用い、女声であった場合は女声の素片データを用いるといった方法が考えられる。この方法を用いれば、極端な変換を避けて変換処理による音質の劣化を少なくすることが可能となる。 In the first to third embodiments described above, speech synthesis data consisting of speech data for a plurality of people is stored in the speech synthesis data storage unit 13 and used for the voice quality of the input target speaker. It is also possible to select the data for synthesis. For example, when the target speaker voice is a male voice, a male voice segment data is used, and when the target speaker voice is a female voice, a female voice segment data is used. If this method is used, it is possible to avoid extreme conversion and reduce deterioration in sound quality due to the conversion process.

次に、本発明の声質変換装置、声質変換方法あるいは声質変換プログラムによって生成された合成音データを用いて合成音を生成する装置について説明する。図３６は、本発明の第４の実施例に係る合成音生成装置を表すブロック図である。図３６において、記号列入力部４１、音声合成出力部４２、データ記憶部４３は、それぞれ図１の発音記号列入力部１２、音声合成部１４、音声合成用データ記憶部１３と同等の機能を有するものである。ただし、音声合成出力部４２は、生成された合成音を出力する機能を持つ。また、データ記憶部４３は、先に説明した声質変換装置、声質変換方法あるいは声質変換プログラムによって生成される変換後出力信号を合成音データとして蓄えるものである。音声合成出力部４２は、記号列入力部４１に入力される記号列に基づいてデータ記憶部４３から読み出した合成音データを用いて合成音を生成して出力する。出力される合成音は、変換後出力信号、すなわち目標話者の声質に変換されたデータに基づいて生成されるので、目標話者の声質を備えた合成音となる。 Next, the voice quality conversion apparatus, voice quality conversion method or apparatus for generating synthesized sound using the synthesized voice data generated by the voice quality conversion program will be described. FIG. 36 is a block diagram showing a synthesized sound generating apparatus according to the fourth embodiment of the present invention. 36, a symbol string input unit 41, a speech synthesis output unit 42, and a data storage unit 43 have the same functions as the phonetic symbol string input unit 12, the speech synthesis unit 14, and the speech synthesis data storage unit 13 of FIG. It is what you have. However, the speech synthesis output unit 42 has a function of outputting the generated synthesized sound. The data storage unit 43 stores the converted output signal generated by the voice quality conversion device, voice quality conversion method, or voice quality conversion program described above as synthesized sound data. The speech synthesis output unit 42 generates a synthesized sound using the synthesized sound data read from the data storage unit 43 based on the symbol string input to the symbol string input unit 41 and outputs it. Since the synthesized sound to be output is generated based on the converted output signal, that is, the data converted into the voice quality of the target speaker, the synthesized voice has the voice quality of the target speaker.

次に、本発明の声質変換装置、声質変換方法あるいは声質変換プログラムによって生成された合成音データを用いて合成音を生成する他の装置について説明する。図３７は、本発明の第５の実施例に係る合成音生成装置を表すブロック図である。図３７において、音声入力部５１は、ユーザの発する音声信号を入力する。変換データ記憶部５３は、先に説明した声質変換装置、声質変換方法あるいは声質変換プログラムによって生成される、分析フレーム毎のＬＰＣ係数と対応する変換関数とを記憶している。音声変換出力部５２は、音声入力部５１から出力される音声信号をフレーム毎に、例えばＬＰＣ分析し、分析されたフレームにパラメータ空間距離の最も近いフレームを変換データ記憶部５３内において探索して対応する変換関数を読出し、この変換関数によって音声信号のスペクトルを変換して出力する。出力される合成音は、目標話者の声質に変換されたデータに基づいて生成されるので、目標話者の声質を備えた合成音となる。 Next, another apparatus for generating a synthesized sound using the synthesized sound data generated by the voice quality conversion apparatus, voice quality conversion method or voice quality conversion program of the present invention will be described. FIG. 37 is a block diagram showing a synthesized sound generating apparatus according to the fifth embodiment of the present invention. In FIG. 37, the voice input unit 51 inputs a voice signal emitted by the user. The conversion data storage unit 53 stores an LPC coefficient for each analysis frame and a corresponding conversion function generated by the voice quality conversion device, voice quality conversion method, or voice quality conversion program described above. The voice conversion output unit 52 performs, for example, LPC analysis on the voice signal output from the voice input unit 51 for each frame, and searches the converted data storage unit 53 for a frame having a parameter space distance closest to the analyzed frame. The corresponding conversion function is read out, and the spectrum of the audio signal is converted and output by this conversion function. Since the output synthesized sound is generated based on the data converted into the voice quality of the target speaker, it becomes a synthesized sound having the voice quality of the target speaker.

本発明によれば、電話機やトランシーバー等の通信機器で、自由に声質を変えるといった用途に適用できる。また、パーソナルコンピュータや携帯電話等で、電子メールやチャットのテキストを読み上げる際の合成音の声質をユーザの望む声質にするといった用途にも適用できる。さらに、アニメーションの音声の録音や外国映画の吹き替え等をテキスト音声合成で行う場合に、登場人物に合ったキャラクタ音声の声質を生成するといった用途にも適用できる。 INDUSTRIAL APPLICABILITY According to the present invention, it can be applied to uses such as changing the voice quality freely in communication equipment such as a telephone and a transceiver. In addition, the present invention can also be applied to a case where the voice quality of the synthesized sound when reading a text of an e-mail or chat on a personal computer or a mobile phone is set to a voice quality desired by the user. Furthermore, when voice recording of animation, dubbing of a foreign movie, or the like is performed by text voice synthesis, the present invention can be applied to the use of generating voice quality of character voice that matches a character.

本発明の第１の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 4th Embodiment of this invention. 本発明の第５の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 5th Embodiment of this invention. 本発明の第６の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 6th Embodiment of this invention. 本発明の第７の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 7th Embodiment of this invention. 本発明の第８の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 8th Embodiment of this invention. 本発明の第９の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 9th Embodiment of this invention. 本発明の第１０の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 10th Embodiment of this invention. 本発明の第１１の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 11th Embodiment of this invention. 本発明の第１２の実施形態に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on the 12th Embodiment of this invention. 本発明の第１の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 4th Embodiment of this invention. 本発明の第５の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 5th Embodiment of this invention. 本発明の第６の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 6th Embodiment of this invention. 本発明の第７の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 7th Embodiment of this invention. 本発明の第８の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 8th Embodiment of this invention. 本発明の第９の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 9th Embodiment of this invention. 本発明の第１０の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 10th Embodiment of this invention. 本発明の第１１の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 11th Embodiment of this invention. 本発明の第１２の実施形態に係る声質変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the voice quality conversion apparatus which concerns on the 12th Embodiment of this invention. 本発明の第１の実施例に係る声質変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality conversion apparatus which concerns on 1st Example of this invention. 声質変換の主要部における音声データの処理の流れを説明する図である。It is a figure explaining the flow of a process of the audio | voice data in the principal part of voice quality conversion. 分析フレームごとのＤＰマッチングを模式的に表した図である。It is the figure which represented DP matching for every analysis frame typically. 周波数軸上でのＤＰマッチングを模式的に表した図である。It is the figure which represented DP matching on a frequency axis typically. 特徴量空間内でクラスタリングされた音素間の変換関数を模式的に示す図である。It is a figure which shows typically the conversion function between the phonemes clustered within the feature-value space. 本発明の第２の実施例に係る声質変換装置のブロック図である。It is a block diagram of the voice quality conversion apparatus which concerns on 2nd Example of this invention. フレーム番号と音素との対応付けを示す図である。It is a figure which shows matching with a frame number and a phoneme. セグメンテーション情報を用いたＤＰマッチングによる時間軸対応付けの第１の例を示す図である。It is a figure which shows the 1st example of the time-axis matching by DP matching using segmentation information. セグメンテーション情報を用いたＤＰマッチングによる時間軸対応付けの第２の例を示す図である。It is a figure which shows the 2nd example of the time-axis matching by DP matching using segmentation information. ラベル情報とクラスタとの対応付けを示す図である。It is a figure which shows matching with label information and a cluster. 本発明の第３の実施例に係る声質変換方法を表すフローチャート図である。It is a flowchart figure showing the voice quality conversion method which concerns on 3rd Example of this invention. 本発明の第４の実施例に係る合成音生成装置を表すブロック図である。It is a block diagram showing the synthetic sound production | generation apparatus which concerns on the 4th Example of this invention. 本発明の第５の実施例に係る合成音生成装置を表すブロック図である。It is a block diagram showing the synthetic sound production | generation apparatus which concerns on the 5th Example of this invention. 第１の従来例の声質変換方法を表すブロック図である。It is a block diagram showing the voice quality conversion method of a 1st prior art example. 第２の従来例の声質変換方法を表すブロック図である。It is a block diagram showing the voice quality conversion method of the 2nd prior art example.

Explanation of symbols

１１、１１ａ目標話者音声入力部
１２、１２ａ発音記号列入力部
１３音声合成用データ記憶部
１４、１４ａ、１４ｂ、１４ｃ音声合成部
１５目標話者音声特徴パラメータ抽出部
１６合成音特徴パラメータ抽出部
１７、１７ａ、１７ｂ、１７ｃ変換関数生成部
１８声質変換部
１９音声認識部
２０合成音セグメンテーション情報記憶部
２１、２１ａ、２１ｂ目標話者音声セグメンテーション情報記憶部
２２セグメンテーション情報入力部
２３ａ、２３ｂ、２３ｃ、２３ｄ、２３ｅ、２３ｆ、２３ｇ、２３ｈ、２３ｉセグメンテーション情報
２５、２６、３０ＬＰＣ分析部
２７、２７ａＤＰマッチング部
２８スペクトル形状変換部
２９変換関数導出部
３１対応フレーム探索部
３２スペクトル変換部
３５ａ、３５ｂＬＰＣ係数
４１記号列入力部
４２音声合成出力部
４３データ記憶部
５１音声入力部
５２音声変換出力部
５３変換データ記憶部 11, 11a Target speaker voice input unit 12, 12a Phonetic symbol string input unit 13 Speech synthesis data storage unit 14, 14a, 14b, 14c Speech synthesis unit 15 Target speaker voice feature parameter extraction unit 16 Synthetic sound feature parameter extraction unit 17, 17a, 17b, 17c Conversion function generation unit 18 Voice quality conversion unit 19 Speech recognition unit 20 Synthetic speech segmentation information storage units 21, 21a, 21b Target speaker speech segmentation information storage unit 22 Segmentation information input units 23a, 23b, 23c, 23d, 23e, 23f, 23g, 23h, 23i Segmentation information 25, 26, 30 LPC analysis unit 27, 27a DP matching unit 28 Spectrum shape conversion unit 29 Conversion function derivation unit 31 Corresponding frame search unit 32 Spectrum conversion unit 35a, 35b LPC Coefficient 41 symbol Input unit 42 speech synthesizing output unit 43 data storage unit 51 the audio input unit 52 speech conversion output unit 53 converts the data storage unit

Claims

A target speaker voice input unit for inputting a voice of a speaker to be converted (referred to as “target speaker voice”);
A voice synthesis data storage unit for storing voice synthesis data;
A phonetic symbol input unit for inputting a phonetic symbol string describing the same or similar utterance content as the input target speaker voice;
A speech synthesis unit that synthesizes and outputs speech based on speech synthesis data stored in the speech synthesis data storage unit according to the input phonetic symbol string;
A synthesized sound feature parameter extracting unit that extracts feature parameters of a speech signal (referred to as “synthesized sound”) output from the speech synthesizer;
A target speaker voice feature parameter extraction unit for extracting feature parameters of the target speaker voice;
The synthesized sound feature parameter extraction unit from the synthesized sound feature parameter extraction unit and the target speaker voice feature parameter extraction unit from the target speaker voice feature parameter extraction unit are input, and the synthesized sound is expressed in a space representing the feature parameter. Corresponding to the second part of the target speaker voice on the time axis, and the spectrum shape of the first part of the synthesized sound is determined as the first part of the target speaker voice. A conversion function generation unit for identifying a conversion function to be converted into a spectral shape in the part 2;
A voice quality conversion device characterized by comprising:

A voice signal to be converted is input, and the voice signal to be converted is converted into a voice signal having the voice quality of the target speaker voice using the conversion function identified by the conversion function generation unit and output. The voice quality conversion device according to claim 1, further comprising a voice quality conversion unit that performs the operation.

A voice recognition unit that inputs and recognizes the target speaker voice input by the target speaker voice input unit, generates the phonetic symbol string, and outputs the phonetic symbol string to the voice synthesis unit; The voice quality conversion device according to claim 1, wherein

A first segmentation information storage unit that stores segmentation information including at least a phoneme sequence corresponding to time information, which is used when the synthesized sound is generated by the speech synthesis unit, and outputs the segmentation information to the conversion function generation unit The voice quality conversion device according to claim 1, further comprising:

A segmentation information input unit that inputs segmentation information including at least a phoneme sequence corresponding to time information of the synthesized sound created by the speech synthesis unit and outputs the segmentation information to the speech synthesis unit; The voice quality conversion device according to any one of claims 1 to 3, wherein

Second segmentation information including a phoneme string corresponding to at least time information of the target speaker voice recognized by the voice recognition unit is stored and output to at least one of the conversion function generation unit and the voice synthesis unit. The voice quality conversion device according to claim 3, further comprising a segmentation information storage unit.

The voice quality conversion device according to any one of claims 1 to 3, wherein the voice synthesis data stored in the voice synthesis data storage unit includes voice data of a plurality of different speakers. .

A data storage unit that stores a converted signal output from the voice quality conversion unit of the voice quality conversion device according to claim 2 as voice synthesis data;
A symbol string input unit for inputting a symbol string describing the utterance content;
A speech synthesis output unit that synthesizes and outputs speech from speech synthesis data in the speech synthesis data storage unit according to the input symbol string;
A synthesized sound generating device comprising:

A conversion data storage unit that stores in association with claim 1 conversion function identified to correspond to the characteristic parameters and the characteristic parameters of the synthesized speech in voice quality conversion device according,
An audio input unit for inputting the converted audio signal;
The feature parameter of the input converted speech signal is obtained, the feature parameter of the synthesized sound in the converted data storage unit corresponding to the obtained feature parameter is searched, and the conversion function corresponding to the feature parameter of the synthesized sound is obtained. A voice conversion output unit for reading and converting the converted voice signal by the conversion function;
A synthesized sound generating device comprising:

A method for converting voice quality by a voice quality conversion device,
Inputting the voice of the speaker that is the voice quality conversion target (referred to as "target speaker voice");
Inputting a phonetic symbol string describing the utterance content of the target speaker voice, and creating synthesized speech from the phonetic symbol string using data for speech synthesis stored in advance in a storage unit;
Analyzing the synthesized sound and extracting characteristic parameters of the synthesized sound;
Analyzing the target speaker voice and extracting feature parameters of the target speaker voice;
Correlating the feature parameter of the target speaker voice and the feature parameter of the synthesized sound on the time axis;
Generating a conversion function for converting the spectrum shape of the synthesized speech into the spectrum shape of the target speaker speech based on the target speaker speech and the synthesized speech feature parameters that are associated with each other; and
A voice quality conversion method comprising:

11. The voice quality conversion method according to claim 10, further comprising: converting a voice signal to be converted into a voice signal having a voice quality of the target speaker voice using the conversion function.

Recognizing the input target speaker voice,
The voice quality conversion method according to claim 10, wherein the phonetic symbol string is generated from a result of the voice recognition.

The segmented information including a phoneme string corresponding to at least time information of the synthesized sound is input, and the synthesized sound is generated based on the segmentation information. Voice quality conversion method.

The step of performing the association on the time axis according to first segmentation information including at least a phoneme string corresponding to time information, which is used when generating the synthesized sound, includes: The voice quality conversion method according to any one of the above.

11. The step of associating on the time axis with second segmentation information including at least a phoneme sequence corresponding to time information of the target speaker voice recognized by the voice is included. The voice quality conversion method according to any one of?

The step of generating the synthesized sound based on second segmentation information including at least a phoneme string corresponding to time information of the target speaker voice recognized by the voice is included. 15. The voice quality conversion method according to any one of claims 14 and 14.

Based on the second segmentation information, the synthesis is performed based on the second segmentation information based on the second segmentation information including at least the phoneme string corresponding to the time information of the target speaker voice recognized by the voice. The voice quality conversion method according to claim 10, further comprising a step of generating a sound.

A method for converting voice quality by a voice quality conversion device,
Inputting the voice of the speaker that is the voice quality conversion target (referred to as "target speaker voice");
Recognizing the input target speaker voice;
Generating a phonetic symbol string describing the utterance content of the target speaker voice from the result of the voice recognition;
Including,
(A) creating synthesized speech from the phonetic symbol string using speech synthesis data stored in advance in the storage unit;
(B) analyzing the synthesized sound and the target speaker voice and extracting respective characteristic parameters;
(C) associating the feature parameter of the target speaker voice with the feature parameter of the synthesized sound on the time axis;
(D) generating a conversion function for converting the spectrum shape of the synthesized sound into the spectrum shape of the target speaker voice based on the two corresponding feature parameters;
(E) converting the voice quality of the voice to be converted using the generated conversion function;
(F) storing the conversion result of the voice quality in the storage unit as the data for voice synthesis;
Is repeated until a predetermined convergence condition is reached.

In the computer that composes the voice quality conversion device,
Input the voice of the speaker that is the voice quality conversion target (referred to as "target speaker voice"),
A process of inputting a phonetic symbol string describing the utterance content of the target speaker voice and creating a synthesized sound from the phonetic symbol string using data for speech synthesis stored in a storage unit in advance,
A process of analyzing the synthesized sound and the target speaker voice and extracting respective characteristic parameters;
A process for associating the feature parameter of the target speaker voice and the feature parameter of the synthesized sound on the time axis;
A process for generating a conversion function for converting a spectrum shape of the synthesized sound into a spectrum shape of the target speaker sound based on the target speaker voice and the characteristic parameter of the synthesized sound that are associated with each other;
A program that executes

The program according to claim 19, wherein
A program for causing the computer to further execute a process of converting an audio signal to be converted into an audio signal having a voice quality of the target speaker voice using the conversion function.

The program according to claim 19, wherein
Processing for recognizing the input target speaker voice;
A program for causing the computer to execute processing for generating the phonetic symbol string from the result of the speech recognition.

In the program according to any one of claims 19 to 21,
A program for causing the computer to execute a process of inputting segmentation information including a phoneme string corresponding to at least time information of the synthesized sound and generating the synthesized sound based on the segmentation information.

In the program according to any one of claims 19 to 21,
A program for causing the computer to execute a process of performing association on the time axis by using first segmentation information including at least a phoneme string corresponding to time information, which is used when generating the synthesized sound.

The program according to any one of claims 19 to 21 and 23,
A program for causing the computer to execute a process of performing association on the time axis by using second segmentation information including at least a phoneme string corresponding to time information of the target speaker voice recognized by the voice.

The program according to any one of claims 19 to 21 and 23,
A program for causing the computer to execute a process of generating the synthesized sound based on second segmentation information including at least a phoneme string corresponding to time information of the target speaker voice recognized by the voice.

The program according to any one of claims 19 to 21 and 23,
Corresponding on the time axis is performed by second segmentation information including at least a phoneme sequence corresponding to time information of the target speaker speech recognized by the speech, and the synthesized sound is based on the second segmentation information. The program which makes the said computer perform the process which produces | generates.

In the computer that composes the voice quality conversion device,
Input the voice of the speaker that is the voice quality conversion target (referred to as "target speaker voice"),
Processing for recognizing the input target speaker voice;
Generating a phonetic symbol string describing the utterance content of the target speaker voice from the result of the voice recognition, and
(A) a process of creating a synthesized sound from the phonetic symbol string using speech synthesis data stored in advance in the storage unit;
(B) a process of analyzing the synthesized sound and the target speaker voice and extracting respective characteristic parameters;
(C) a process of associating the feature parameter of the target speaker voice and the feature parameter of the synthesized sound on the time axis;
(D) generating a conversion function for converting the spectrum shape of the synthesized sound into the spectrum shape of the target speaker voice based on the two associated feature parameters;
(E) a process of converting the voice quality of the voice to be converted using the generated conversion function;
(F) a process of storing the conversion result of the voice quality in the storage unit as the data for voice synthesis;
Is repeated until a predetermined convergence condition is reached,
A program that executes processing.