JPH01211799A

JPH01211799A - Regular synthesizing device for multilingual voice

Info

Publication number: JPH01211799A
Application number: JP63037948A
Authority: JP
Inventors: Masanobu Abe; 匡伸阿部; Hisao Kuwabara; 尚夫桑原; Kiyohiro Kano; 清宏鹿野
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1988-02-19
Filing date: 1988-02-19
Publication date: 1989-08-24

Abstract

PURPOSE:To obtain a voice signal which has individual features as to plural languages by providing a regular synthesizing means which synthesizes the voice signal of a 1st standard speaker with respect to the plural languages and a voice converting means which converts a voice signal outputted by a selecting means into the voice signal of a 2nd speaker to whom individual features are to be added. CONSTITUTION:A switching part 100 selects one of languages L1-Ln, e.g., L1 and outputs the voice signal s11 of a standard speaker A1 of the selected language L1 from a multilingual regular synthesizing group 104 to a voice conversion part 101. The voice quality conversion part 101 receives the voice signal s11, refers to data on the speaker B whose voice is to be given individual ity in a voice individual information file 102, and converts the voice signal s11 of the standard speaker A1 into the voice signal s4 of the speaker B, which is outputted. Consequently, the individual features of speaker are given to regu larly synthesized voices of respective languages.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は、音声の規則合成装置に関し、特に、多言語
を扱う規則合成装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech rule synthesis device, and particularly to a rule synthesis device that handles multiple languages.

［従来の技術および発明が解決しようとする課題］従来
、規則合成システムは、各言語ごと（たとえば、日本語
、英語、ドイツ語などそれぞれについて）に開発されて
いる。しかし、これらの規則合成システムは、あくまで
もネイティブスピーカか使用することが念頭に置かれ、
ネイティブスピーカでないものが使用する機能を有して
いない。[Prior Art and Problems to be Solved by the Invention] Conventionally, rule synthesis systems have been developed for each language (eg, Japanese, English, German, etc.). However, these rule synthesis systems are designed to be used only by native speakers.
It does not have functions that non-native speakers can use.

すなわち、各言語ごとに開発した規則合成システムは、
合成に使用する音声の単位、韻律制御規則などともにネ
イティブスピーカをもとに作成されているため、その規
則合成音は、ネイティブスピーカでないものが話す音声
よりも流暢な音声を出力できるが、一方、これらの規則
合成システムの出力可能な音声の声質は、１つまたは数
種類（男性の音声、女性の音声、子供らしい音声および
老人のような音声など）に限られている。In other words, the rule synthesis system developed for each language is
Since the units of speech used for synthesis and prosodic control rules are created based on those of native speakers, the synthesized speech using these rules can output more fluent speech than the speech spoken by non-native speakers. The voice qualities that can be output by these rule synthesis systems are limited to one or several types (male voice, female voice, childlike voice, elderly voice, etc.).

したがって、規則合成システムがある人に代わって音声
を合成すること、すなわちある人の音声の声質に近い音
声を合成し、あたかもその人がその言語で話しているよ
うな音声を出力することかできないという課題があった
。Therefore, it is not possible for a rule synthesis system to synthesize speech on behalf of a certain person, that is, to synthesize speech that is close to the voice quality of a certain person's voice and output speech as if that person were speaking in that language. There was a problem.

その理由として、（１）　規則合成に用いられる音声単位は、ある話者が
発生した数百例の音声がら作成されるため、発声者の負
担が大きい。The reasons for this are: (1) The speech units used for rule synthesis are created from hundreds of speeches produced by a certain speaker, which places a heavy burden on the speaker.

（２）　音声単位の作成は、全自動で行なうのが難しく
、人手がかかり、たくさんの話者について音声の単位を
作成するのは事実上不可能である。(2) Creation of speech units is difficult and labor-intensive to perform fully automatically, and it is virtually impossible to create speech units for a large number of speakers.

（３）　規則合成に用いられる音声の単位セットを各話
者ごとに作ると、格納しておくメモリ量が膨大なものと
なる。　　− などを挙げることができる。(3) If a unit set of speech used for rule synthesis is created for each speaker, the amount of memory to be stored will become enormous. − and so on.

この発明は、上記のような課題を解決するためになされ
たもので、各言語について合成された規則合成音に発声
者の個人的特徴を持たせることを目的とする。This invention has been made to solve the above-mentioned problems, and aims to give the speaker's personal characteristics to the rule-synthesized speech synthesized for each language.

［課題を解決するための手段］この発明に係る多言語を扱う音声の規則合成装置は、複
数の言語のそれぞれについて外部から文字情報信号を受
け、各言語について標準となる第一の話者の音声単位信
号集合を参照することにより、各言語の第一の話者の音
声信号を合成する複数の規則合成手段と、各言語の第一
の話者の音声信号のうちいずれかを選択して出力する選
択手段と、第一の話者の音声を、音声に個人的特徴を付
与したい第２の話者の音声に変換するのに必要な変換信
号を蓄積した変換信号ファイル手段と、選択手段から出
力された第１の話者の音声信号を変換信号ファイル手段
に蓄積された変換信号に基づいて、第２の話者の音声信
号に変換する音声変換手段とを含む。[Means for Solving the Problems] A speech rule synthesis device that handles multiple languages according to the present invention receives character information signals from the outside for each of a plurality of languages, and generates a standard first speaker for each language. A plurality of rule synthesis means for synthesizing the speech signals of the first speaker of each language by referring to the speech unit signal set, and selecting one of the speech signals of the first speaker of each language. a selection means for outputting, a conversion signal file means storing conversion signals necessary for converting a first speaker's voice into a second speaker's voice to which it is desired to add personal characteristics to the voice, and a selection means. and voice conversion means for converting the first speaker's voice signal outputted from the first speaker into a second speaker's voice signal based on the conversion signal stored in the conversion signal file means.

［作用］この発明における多言語を扱う音声の規則合成装置は、
規則合成手段によりその言語の標準となる第１の話者の
音声信号を規則合成した後で、選択手段により選択され
た言語について、音声変換手段により変換信号ファイル
手段に蓄積された変換信号に従って、その第１の音声信
号を個人的特徴を有する第２の話者の音声信号に変換す
るので、多言語について個人的特徴を有する音声信号を
得ることができる。[Operation] The speech rule synthesis device that handles multiple languages in this invention has the following features:
After the rule-synthesizing means rules-synthesizes the first speaker's speech signal that is the standard for that language, the speech conversion means converts the language selected by the selection means into a converted signal stored in the converted signal file means. Since the first voice signal is converted into a second speaker's voice signal having personal characteristics, it is possible to obtain a voice signal having personal characteristics for multiple languages.

［発明の実施例］第１図は、この発明による多言語を扱う規則合成装置の
一実施例を示すブロック図である。[Embodiment of the Invention] FIG. 1 is a block diagram showing an embodiment of a rule synthesis device that handles multiple languages according to the present invention.

第１図を参照して、この規則合成装置は、入力部］０な
いしｎｏと、多言語規則合成群１０４と、切換部１００
と、声質変換部１０１と、音声個人情報ファイル１０２
と、出力部１０３とを含む。Referring to FIG. 1, this rule synthesis device includes an input section ]0 to no, a multilingual rule synthesis group 104, and a switching section 100
, voice quality conversion unit 101 , and voice personal information file 102
and an output section 103.

この規則合成装置は、言語Ｌ１ないし言語ＬＮのＮ個の
言語（たとえば、英語、ドイツ語、中国語、日本語など
）を扱うことができる。This rule synthesis device can handle N languages L1 to LN (eg, English, German, Chinese, Japanese, etc.).

多言語規則合成群１０４は、各言語について文字情報信
号が与えられる入力部からの信号をそれぞれ処理する規
則合成部を含む。たとえば、ある言語Ｌ１について、入
力部１０に、外部から言語Ｌ１に関する文字およびアク
セント型などの文字列ならびに韻律信号を含む文字情報
信号ｓｌＯが与えられる。言語Ｌ１の標準の話者Ａ１の
音声の単位セット情報ファイル１２を含む規則合成部１
１は、文字情報信号ｓｌＯを受け、音声の単位セット情
報ファイル１２を参照して、その言語の標準の話者Ａ１
の音声信号ｓｌｌを規則合成する。The multilingual rule synthesis group 104 includes rule synthesis units that process signals from input units to which character information signals are given for each language. For example, for a certain language L1, a character information signal slO including character strings such as characters and accent types related to the language L1 and a prosodic signal is provided to the input unit 10 from the outside. A rule synthesis unit 1 including a unit set information file 12 of standard speaker A1's speech of language L1
1 receives the character information signal slO, refers to the voice unit set information file 12, and determines the standard speaker A1 of the language.
The audio signal sll of is synthesized according to the rules.

一方、言語Ｌ１以外の他の言語についても、各規則合成
部２１ないしｎｌによって、同様に、各言語の標準の話
者Ａ２ないしＡｎの音声信号ｓ２１ないしｓｎｌを規則
合成することができる。On the other hand, for languages other than the language L1, the speech signals s21 to snl of the standard speakers A2 to An of each language can be similarly synthesized by the rule synthesis units 21 to nl.

ここで、音声の単位セット情報ファイル１２ないしｎ２
は、各言語Ｌ１ないしＬｎの標準の話者Ａ１ないしＡｎ
について、それぞれ、音素や音節などの音声の単位とな
る情報が予め蓄えられたデータベースである。Here, audio unit set information files 12 to n2
are the standard speakers A1 to An of each language L1 to Ln.
Each of these is a database in which information on units of speech such as phonemes and syllables is stored in advance.

切換部１００は、各言語Ｌ１ないしＬｎのうちから、い
ずれか１つの言語、たとえば、言語Ｌ１を選択し、多言
語規則合成群１０４から出力された選ばれた言語Ｌ１の
標準話者Ａ１の音声信号Ｓ１１を声質変換部１０１に出
力する。The switching unit 100 selects one of the languages L1 to Ln, for example, the language L1, and selects the speech of the standard speaker A1 of the selected language L1 output from the multilingual rule synthesis group 104. The signal S11 is output to the voice quality conversion section 101.

声質変換部１０１は、音声信号ｓｌｌを受け、予め登録
されている、音声個人情報ファイル１０２の中から音声
に個人性を付与したい話者Ｂのデータを参照して、標準
話者Ａ１の音声信号ｓｌｌを話者Ｂの音声信号ｓ４に変
換して出力する。The voice quality conversion unit 101 receives the audio signal sll, refers to the data of the speaker B whose voice is desired to be personalized from the audio personal information file 102 registered in advance, and converts the audio signal of the standard speaker A1 into a voice signal of the standard speaker A1. sll is converted into speaker B's audio signal s4 and output.

声質変換部１０１における声質変換方法として、ベクト
ル量子化を利用した声質変換法が用いられる。この方法
は、規則合成部の基準となった標準話者と、音声に個人
性を付与したい話者との間の声質変換を、各話者のコー
ドブックの対応づけである変換コードブックによって行
なうものである。As a voice quality conversion method in the voice quality conversion unit 101, a voice quality conversion method using vector quantization is used. This method converts the voice quality between a standard speaker, which is the standard for the rule synthesis unit, and a speaker who wants to add individuality to the voice, using a conversion codebook that is a correspondence between the codebooks of each speaker. It is something.

変換コードブックは、個人性を付与したい話者の音声の
パワー、ピッチ周波数およびスペクトル情報を含み、音
声の特徴が離散的に表現されている。第１図の音声個人
情報ファイル１０２は、音声に個人性を付与したい話者
ごとにこの変換コードブックの内容を含む。The conversion codebook includes the power, pitch frequency, and spectrum information of the speaker's voice to which individuality is to be added, and the characteristics of the voice are expressed discretely. The voice personal information file 102 in FIG. 1 includes the contents of this conversion codebook for each speaker whose voice is desired to be personalized.

以下では、−例として、言語Ｌ１を英語とし、英語の規
則合成部１１で合成されたあるアメリカ人話者Ａ１の音
声信号ｓ’ｌｌを、個人性を付与したい日本人Ｂの音声
信号Ｓ４に変換する場合について説明する。In the following, - as an example, the language L1 is English, and a certain American speaker A1's speech signal s'll synthesized by the English rule synthesis unit 11 is converted into a Japanese speaker B's speech signal S4 to which individuality is to be added. The case of conversion will be explained.

第２図は、変換コードブックの作成手順を示すフロー図
である。FIG. 2 is a flow diagram showing the procedure for creating a conversion codebook.

第２図を参照して、以下に変換コードブック４１．４２
．４３を求める手順について説明する。Referring to Figure 2, below is the conversion code book 41.42
．． The procedure for finding 43 will be explained.

まず、ステップ３０１および３０２において、アメリカ
人話者Ａ１および日本人話者已に同一の単語を発声させ
、それぞれの音声にＬＰＧ分析を施し、パワー、ピッチ
周波数およびスペクトルパラメータを求める。次に、ス
テップ３０３および３０４において、スペクトルパラメ
ータをベクトル量子化し、ステップ３０５および３０６
でパワーをスカラー量子化し、ステップ３０７および３
０８においてピッチ周波数をスカラー量子化する。First, in steps 301 and 302, American speaker A1 and Japanese speaker I are made to utter the same word, and their respective voices are subjected to LPG analysis to determine power, pitch frequency, and spectral parameters. Next, in steps 303 and 304, the spectral parameters are vector quantized, and in steps 305 and 306
scalar quantize the power in steps 307 and 3
08, the pitch frequency is scalar quantized.

話者Ａおよび話者Ｂの発声した音声の時間対応をとるた
めに、スペクトルパラメータを用いて、ステップ３０９
においてＤｏｕｂｌｅ　　５ｐｌｉｔ法によるＤＰマツ
チングを行なう。ここで得られた時間対応の情報を基に
して、ステップ３１０．３１１および３１２において、
各特徴量について話者Ａと話者Ｂの対応関係を求め、ヒ
ストグラムを作成する。スペクトルパラメータおよびパ
ワーの変換コードブック４Ｌ　４３は、このヒストグラ
ムを重みとした話者Ｂの特徴ベクトルの線形結合で求め
る。また、ピッチ周波数の変換コードブック４２は、こ
のヒストグラムの最大値を与える話者Ｂの特徴ベクトル
で作成する。Step 309 uses the spectral parameters to take the time correspondence of the voices uttered by speaker A and speaker B.
DP matching is performed using the Double 5plit method. Based on the time-related information obtained here, in steps 310, 311 and 312,
The correspondence between speaker A and speaker B is determined for each feature, and a histogram is created. The spectral parameter and power conversion codebook 4L 43 is obtained by linear combination of the feature vectors of speaker B using this histogram as weight. Further, the pitch frequency conversion codebook 42 is created using the feature vector of speaker B that gives the maximum value of this histogram.

第３図は、声質変換部１０１における声質変換手順を示
すフロー図である。FIG. 3 is a flow diagram showing the voice quality conversion procedure in the voice quality conversion section 101.

第３図を参照して、以下に変換コードブックを用いた声
質変換方法について、言語Ｌ１すなわち英語を例として
説明する。話者Ａ１の音声ｓｌｌは、ステップ４０１に
おいてＬＰＧ分析され、パワー、ピッチ周波数およびス
ペクトルパラメータが抽出される。　次に、ステップ４
０２において話者Ａ１のスペクトルコードブックからの
スペクトルパラメータがベクトル量子化され、ステップ
４０３において話者Ａ１のパワーコードブックからのパ
ワーがスカラー量子化され、ステップ４０４において話
者Ａ１のピッチ周波数コードブックからのピッチ周波数
がスカラー量子化される。これらの量子化されたパラメ
ータを復号化する過程において、前述の変換コードブッ
ク４１．４２．４３が使用される。すなわち、ステップ
４０５において、話者Ａ１から話者Ｂへのスペクトル変
換コードブック４１を用い、ステップ４０６において、
パワー変換コードブック４３を用い、ステップ４０７に
おいてピッチ周波数変換コードブック４２を用いる。そ
して、変換された各パラメータを用いてステップ４０８
で話者Ｂの音声が合成される。これによって、日本人話
者Ｂの声質をもった英語の規則合成音か得られる。Referring to FIG. 3, a voice quality conversion method using a conversion codebook will be described below using language L1, that is, English, as an example. The speech sll of speaker A1 is subjected to LPG analysis in step 401 to extract power, pitch frequency and spectral parameters. Next, step 4
The spectral parameters from speaker A1's spectral codebook are vector quantized in step 02, the powers from speaker A1's power codebook are scalar quantized in step 404, and the powers from speaker A1's pitch frequency codebook are vector quantized in step 404. The pitch frequency of is scalar quantized. In the process of decoding these quantized parameters, the aforementioned transformation codebooks 41.42.43 are used. That is, in step 405, the spectrum conversion codebook 41 from speaker A1 to speaker B is used, and in step 406,
The power conversion codebook 43 is used, and the pitch frequency conversion codebook 42 is used in step 407. Then, using each converted parameter, step 408
The voice of speaker B is synthesized. As a result, a regular synthesized English sound having the voice quality of Japanese speaker B is obtained.

第４図は、この発明による多言語を扱う規則合成装置を
含む規則合成システムのハードウェア構成を示す概略ブ
ロック図である。FIG. 4 is a schematic block diagram showing the hardware configuration of a rule synthesis system including a rule synthesis device that handles multiple languages according to the present invention.

第４図を参照して、この規則合成システムは、アンプ１
とローパスフィルタ２とＡ／Ｄ変換器３とコンピュータ
システム４とを含む。アンプ１は入力された音声信号を
増幅するものであり、ローパスフィルタ２は増幅された
音声信号から折返し雑音を除去するものである。Ａ／Ｄ
変換器３は音声信号を１２ｋＨｚのサンプリング信号に
より、１６ビツトのディジタル信号に変換するものであ
る。コンピュータシステム４は、規則合成装置（演算処
理部）５と磁気ディスク６と端末類７とプリンタ８とを
含む。この発明による多言語を扱う音声の規則合成装置
は、第４図の規則合成装置５内において構成される。Referring to FIG. 4, this rule synthesis system consists of amplifier 1
, a low-pass filter 2 , an A/D converter 3 , and a computer system 4 . The amplifier 1 is for amplifying an input audio signal, and the low-pass filter 2 is for removing aliasing noise from the amplified audio signal. A/D
The converter 3 converts the audio signal into a 16-bit digital signal using a 12 kHz sampling signal. The computer system 4 includes a rule synthesis device (arithmetic processing unit) 5, a magnetic disk 6, terminals 7, and a printer 8. The speech rule synthesis device that handles multiple languages according to the present invention is configured within the rule synthesis device 5 shown in FIG.

以上に述べた多言語を扱う規則合成装置は、特に、多数
の言語をある話者が発声する代わりに音声を出力する多
言語発声代行用規則合成システムとして有用である。The above-described rule synthesis device that handles multiple languages is particularly useful as a rule synthesis system for multilingual voice proxy, which outputs speech in place of a speaker's utterance of multiple languages.

［発明の効果］以上のように、この発明によれば、多言語について標準
となる第１の話者の音声信号を合成する規則合成手段と
、選択手段から出力された音声信号を個人的特徴を付与
したい第２の話者の音声信号に変換する音声変換手段と
を含むので、多言語について個人的特徴を有する音声信
号を得ることができる。[Effects of the Invention] As described above, according to the present invention, the rule synthesis means for synthesizing the speech signal of the first speaker, which is standard for multiple languages, and the speech signal output from the selection means are adapted to personal characteristics. and a voice converting means for converting the voice signal into the voice signal of the second speaker to whom it is desired to impart a voice signal, it is possible to obtain a voice signal having personal characteristics in multiple languages.

[Brief explanation of the drawing]

第１図は、この発明による多言語を扱う規則合成装置の
一実施例を示すブロック図である。第２図は、変換コー
ドブックの作成手順を示すフロー図である。第３図は、
声質変換部における声質変換手順を示すフロー図である
。第４図は、この発明による多言語を扱う規則合成装置
を含む規則合成システムのハードウェア構成を示す概略
ブロック図である。図において、１はアンプ、２はローパスフィルタ、３は
Ａ／Ｄ変換器、４はコンピュータシステム、５は規則合
成装置、１０ないしｎｏは入力部、１１ないしｎｌは規
則合成部、１２ないしｎ２は音声単位セット情報ファイ
ル、１００は切換部、１０１は声質変換部、１０２は音
声個人情報ファイル、１０３は出力部、ｓｌＯないしｓ
ｎＯは文字情報信号、ｓｌｌないしｓｎｌは話者Ａ１な
いしＡｎの音声信号、ｓ３は話者Ｂの音声個人情報信号
、ｓ４は話者Ｂの音声信号を示す。FIG. 1 is a block diagram showing an embodiment of a rule synthesis device that handles multiple languages according to the present invention. FIG. 2 is a flow diagram showing the procedure for creating a conversion codebook. Figure 3 shows
FIG. 3 is a flow diagram showing a voice quality conversion procedure in a voice quality conversion unit. FIG. 4 is a schematic block diagram showing the hardware configuration of a rule synthesis system including a rule synthesis device that handles multiple languages according to the present invention. In the figure, 1 is an amplifier, 2 is a low-pass filter, 3 is an A/D converter, 4 is a computer system, 5 is a rule synthesis device, 10 to no are input sections, 11 to nl are rule synthesis sections, 12 to n2 are Voice unit set information file, 100 is a switching unit, 101 is a voice quality conversion unit, 102 is a voice personal information file, 103 is an output unit, slO or s
nO is a text information signal, sll to snl are voice signals of speakers A1 to An, s3 is a voice personal information signal of speaker B, and s4 is a voice signal of speaker B.

Claims

[Scope of Claims] For each of a plurality of languages, the system includes a set of standard voice unit signals of a first speaker, receives a character information signal from the outside, and refers to the set of voice unit signals. a plurality of rule synthesis means for synthesizing speech signals of a first speaker; and a plurality of rule synthesis means connected to the plurality of rule synthesis means, for selecting one language from the plurality of languages, and for selecting the first language for the selected language. a selection means for outputting a voice signal of a second speaker; and a conversion signal stored therein necessary for converting the voice of the first speaker into the voice of a second speaker whose voice is desired to have personal characteristics. a converted signal file means for receiving the synthesized voice signal of the first speaker in the selected language, and converting the voice of the first speaker based on the converted signal stored in the converted signal file means; a voice conversion means for converting a signal into a voice signal of the second speaker; a voice rule synthesis device that handles multiple languages;