JP2010224419A

JP2010224419A - Voice synthesizer, method and, program

Info

Publication number: JP2010224419A
Application number: JP2009073997A
Authority: JP
Inventors: Nobuyuki Nishizawa; 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2009-03-25
Filing date: 2009-03-25
Publication date: 2010-10-07
Anticipated expiration: 2029-03-25
Also published as: JP5376643B2

Abstract

<P>PROBLEM TO BE SOLVED: To further reproduce and synthesizes features of an original voice in an environment where an available data size or communication bit rates are extremely limited. <P>SOLUTION: This voice synthesizer synthesizes a voice based on a series of information for voice synthesis. The information for voice synthesis is composed of common information for synthesizing a voice regardless of target vocal quality and correction information which functions when performing the voice synthesis of predetermined vocal quality. The voice synthesizer includes: a correction means which modifies contents of the common information based on the correction information when synthesizing a voice with predetermined vocal quality; and a voice synthesis means which synthesizes a voice waveform based on the corrected information; and further includes: a control command generation means which generates a control command for the generation of the voice waveform based on the common information; a control command correction means which corrects the generated control command based on the correction information when synthesizing the voice with predetermined vocal quality; and a voice waveform generation means which generates the voice waveform based on the corrected control command. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、事前に収録された音声の特徴に基づき、音声波形を生成する音声合成装置、方法およびプログラムに関する。 The present invention relates to a speech synthesizer, a method, and a program for generating a speech waveform based on features of speech recorded in advance.

音声に特化した高効率な音声符号化方式として、ＣＥＬＰ（Code Excited Linear Prediction）方式が知られている。ＣＥＬＰ方式は音声波形の物理的な特徴に関する知見に基づいた方法だが、音声の言語的制約を直接的には用いていないため、どのような言語のどのようなスタイルの音声でも高効率に符号化可能な特徴を有する。しかし、符号化された音声のビットレートは最低でも数ｋｂｐｓ(bits
per second)となる。これに対し、言語的な情報から音声を合成する技術は、一般に音声合成技術に属する。音声合成技術の代表的な利用方法は、テキスト音声変換（Text-To-Speech）だが、ここでは例えば、テキストを解析して得られる、音素の種類や韻律的特徴を表記した記号をその入力とし、音声波形を生成する装置を特に音声合成装置と呼び、その入力を構成する記号を、音声合成用記号と呼ぶ。音声合成用記号には様々な形式がありうるが、ここでは、一連の音声を構成する音韻的情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものを考える。そのような音声合成記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００２「日本語テキスト音声合成用記号」（非特許文献１）がある。 A CELP (Code Excited Linear Prediction) method is known as a highly efficient speech coding method specialized for speech. The CELP method is based on the knowledge about the physical characteristics of the speech waveform, but does not directly use speech language restrictions, so it can efficiently encode any style of speech in any language. Has possible characteristics. However, the bit rate of encoded audio is at least several kbps (bits
per second). On the other hand, the technology for synthesizing speech from linguistic information generally belongs to speech synthesis technology. Text-to-speech is a typical method of using speech synthesis technology. Here, for example, a symbol representing phoneme types and prosodic features obtained by analyzing text is used as the input. A device for generating a speech waveform is particularly called a speech synthesizer, and symbols constituting its input are called speech synthesis symbols. There are various forms of symbols for speech synthesis. Here, let us consider a case in which phonological information constituting a series of speech and prosodic information mainly expressed as a pose or voice pitch are simultaneously described. As an example of such a speech synthesis symbol, there is JEITA (Electronic Information Technology Industry Association) standard IT-4002 “symbol for Japanese text speech synthesis” (Non-patent Document 1).

音声合成装置における音声波形の生成方法には様々な方式があるが、ここではその代表的な方式として、接続合成方式と、分析合成方式の２つを説明する。 There are various methods of generating a speech waveform in the speech synthesizer. Here, two typical methods are described: a connection synthesis method and an analysis synthesis method.

接続合成方式は、あらかじめ音声を大量に収集し、その音声断片（以下音声素片という）をあらかじめデータベース化しておき、合成時には指定された合成目標情報の各パラメータに近く、かつ、前後の音声素片との接続関係の良好な音声素片を、素片データベースから選択して合成を行う方式である。各音声素片には、音素情報、音響パラメータ、音声コーパス内での出現環境等のパラメータが付されている。 In the connection synthesis method, a large amount of speech is collected in advance, and its speech fragments (hereinafter referred to as speech segments) are stored in a database in advance. In this method, a speech unit having a good connection relationship with a segment is selected from the segment database and synthesized. Each speech segment is assigned parameters such as phoneme information, acoustic parameters, and appearance environment in the speech corpus.

接続合成方式においては、音声合成記号列によって指定される合成目標情報に基づき、使用する音声素片の選択（以後、素片選択と呼ぶ。）を行うが、この素片選択は、コストと呼ぶ歪み尺度、つまり、選択した音声素片により合成される音声波形の、目標とする合成音声波形からの劣化度合いを示す指標に基づき行われる。コストは、通常、合成目標情報と音声素片との誤差を示すターゲットコストと、音声素片間の不連続の程度を示す接続コストに分けることができ、素片選択は全体のコストを最小とするように行われる。 In the connection synthesis method, the speech unit to be used is selected (hereinafter referred to as segment selection) based on synthesis target information specified by the speech synthesis symbol string. This unit selection is referred to as cost. This is performed based on a distortion measure, that is, an index indicating the degree of deterioration of the speech waveform synthesized by the selected speech segment from the target synthesized speech waveform. The cost can usually be divided into a target cost indicating an error between the synthesis target information and the speech unit, and a connection cost indicating the degree of discontinuity between the speech units, and segment selection minimizes the overall cost. To be done.

素片のデータベースについては、音声波形をそのまま格納しても良いし、あるいは、素片をＣＥＬＰ等により圧縮したデータを格納しても良い。音声波形を蓄積した場合は、素片を波形上で接続することで最終的な出力音声波形が合成される。一方、ＣＥＬＰ等を用いて素片を圧縮した場合は、それを復号した波形上で素片を接続するだけでなく、復号前のパラメータレベルで先に接続しておき、それをまとめて復号することで音声波形を生成する手法もある。 As for the unit database, the speech waveform may be stored as it is, or data obtained by compressing the unit by CELP or the like may be stored. When the speech waveform is accumulated, the final output speech waveform is synthesized by connecting the segments on the waveform. On the other hand, when a segment is compressed using CELP or the like, not only the segment is connected on the waveform obtained by decoding the segment, but also connected at the parameter level before decoding and decoded together. There is also a technique for generating a speech waveform.

一方、分析合成方式は、音声を分析した結果得られる、音声の特徴パラメータ上で操作を行い、特徴パラメータから音声波形を信号処理により生成する方式である。ここでは特に、ＣＥＬＰ方式と同様の、音源とフィルタを組み合わせた音源・フィルタモデル等に基づき、信号処理で音声波形を合成する方法を対象とする。音源・フィルタモデルでは、音声の響きをつくるフィルタを適当な音源で駆動することで、音声波形を信号処理的に合成するが、ここではＣＥＬＰ方式とは異なり、インパルス列や白色雑音源といった比較的に単純な構成の音源で駆動する場合を主に考える。また以下では、音源のパラメータとフィルタのパラメータをまとめて音声合成パラメータと呼ぶ。音声合成パラメータは、スペクトルの特徴を表現するためのＭＦＣＣ（Mel-Frequency Cepstral Coefficient）や、声の高さに対応する、波形の基本周波数（Ｆ０）などの複数のパラメータで構成される。また、フィルタにはＡＲ（自己回帰）型のフィルタや、ＭＦＣＣを直接そのパラメータとする、ＭＬＳＡ（メル対数スペクトル近似）フィルタ（非特許文献２）等が用いられる。 On the other hand, the analysis and synthesis method is a method in which an operation is performed on a feature parameter of speech obtained as a result of analyzing speech, and a speech waveform is generated from the feature parameter by signal processing. Here, in particular, a method of synthesizing a speech waveform by signal processing based on a sound source / filter model in which a sound source and a filter are combined, similar to the CELP method, is targeted. In the sound source / filter model, the sound waveform is synthesized in a signal processing manner by driving a filter that generates sound of sound with an appropriate sound source, but here, unlike the CELP method, relatively, such as an impulse train or a white noise source. Considering the case of driving with a simple sound source. Hereinafter, the sound source parameters and the filter parameters are collectively referred to as speech synthesis parameters. The speech synthesis parameter is composed of a plurality of parameters such as a MFCC (Mel-Frequency Cepstral Coefficient) for expressing spectral features and a fundamental frequency (F0) of the waveform corresponding to the voice pitch. As the filter, an AR (autoregressive) type filter, an MLSA (Mel logarithmic spectrum approximation) filter (Non-Patent Document 2), which directly uses MFCC as a parameter, and the like are used.

例えば子音のような音声を合成するためには、音声合成パラメータを時間的に変化させることが必要なため、この方法では、例えば５ｍｓ程度の一定周期で音声合成パラメータを更新し、その特徴を変化させながら音声を合成することが一般的である。この一定周期の１周期分は一般に１フレームと呼ばれる。したがって、この構成で音声を合成するためには、音声合成用記号からフレーム毎の音声合成パラメータの値を決める必要がある。 For example, in order to synthesize speech such as consonants, it is necessary to change the speech synthesis parameters over time. In this method, for example, the speech synthesis parameters are updated at a fixed period of about 5 ms and the characteristics are changed. It is common to synthesize speech while One period of this fixed period is generally called one frame. Therefore, in order to synthesize speech with this configuration, it is necessary to determine the value of the speech synthesis parameter for each frame from the speech synthesis symbol.

その方法の１つとして、音声合成パラメータ時系列の時間変化を適当なモデルに基づきモデル化し、そのモデルパラメータを音声合成用記号からまず予測することで生成し、得られたモデルから音声合成パラメータ時系列を生成することで、任意の音声を合成可能とする方法が用いられる。以下では、このモデルのことを音声生成モデルと呼ぶ。例えば、ある音素の音声合成パラメータの特徴が時間的に３つの状態に分かれ、各状態のフレーム数について、その統計的特徴を現すベクトルを最初の状態から順にd1、d2、d3とし、この３つのベクトルの要素を連結して１つのベクトルdを作り、また、各状態の統計的特徴を現すベクトルを最初の状態から順にv1、v2、v3とすれば、その音素を合成するための音声合成パラメータの特徴は、音声生成モデルのパラメータを構成するd、v1、v2、v3の４つのベクトルで表すことができる。 As one of the methods, a time change of a speech synthesis parameter time series is modeled based on an appropriate model, and the model parameter is generated by first predicting from a speech synthesis symbol. A method is used in which an arbitrary speech can be synthesized by generating a sequence. Hereinafter, this model is referred to as a speech generation model. For example, the speech synthesis parameter feature of a phoneme is divided into three states in terms of time. For the number of frames in each state, the vectors representing the statistical features are d1, d2, and d3 in order from the first state. Speech synthesis parameters for synthesizing the phoneme if the vector elements are concatenated to create one vector d, and the vectors representing the statistical features of each state are v1, v2, and v3 in order from the first state. Can be expressed by four vectors d, v1, v2, and v3 that constitute the parameters of the speech generation model.

このように全ての音素がこのように４つのベクトルで表すことができると仮定し、予めそれぞれのベクトルについて、最適なコードブックを作成しておく。あるいは、v1、v2、v3は同じコードブックを用いて表しても良い。音声合成の際は、まず、音声合成用記号から各音素の音声生成モデルのパラメータを構成する最適なコードブックのベクトルをそれぞれ予測し、各音素を合成するため音声生成モデルを構築する。そして、それらの音声生成モデルを時間順に連結して１発声分の音声生成モデルとし、そのモデルに基づき最適な音声合成パラメータ時系列を求める。この音声合成パラメータ時系列に基づき、音源・フィルタを制御することで、音声波形は生成される。 As described above, it is assumed that all phonemes can be represented by four vectors in this way, and an optimal codebook is created for each vector in advance. Alternatively, v1, v2, and v3 may be expressed using the same codebook. At the time of speech synthesis, first, an optimal codebook vector constituting the parameters of the speech generation model of each phoneme is predicted from the speech synthesis symbols, and a speech generation model is constructed to synthesize each phoneme. Then, these speech generation models are connected in time order to obtain a speech generation model for one utterance, and an optimal speech synthesis parameter time series is obtained based on the model. A voice waveform is generated by controlling the sound source / filter based on the voice synthesis parameter time series.

以上、上記のような音声合成方式を用いることにより、音声合成記号列の形で表現された数百ｂｐｓ程度のデータから音声波形を生成することができる。 As described above, by using the speech synthesis method as described above, a speech waveform can be generated from data of about several hundred bps expressed in the form of a speech synthesis symbol string.

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００２、２００５年３月"Symbols for Japanese text-to-speech synthesis" JEITA standard IT-4002, March 2005 今井聖、住田一男、古市千枝子、「音声合成のためのメル対数スペクトル近似（ＭＬＳＡ）フィルタ」、電子情報通信学会論文誌(A), J66-A, 2, pp.122-129, Feb. 1983.Sei Imai, Kazuo Sumita, Chieko Furuichi, "Mel Log Spectrum Approximation (MLSA) Filter for Speech Synthesis", IEICE Transactions (A), J66-A, 2, pp.122-129, Feb. 1983 .

利用可能なデータサイズや通信ビットレートが非常に限られた環境で、予め準備された音声メッセージを再生する場合を考える。特にここでは、二次元バーコードなど、記憶容量が限られた対象に、音声データを格納する場合を主にその対象とする。音声の声質や読み上げスタイルに対する好みは、ユーザによりそれぞれ異なると考えられるので、ユーザの満足度を高めるためには、あらかじめ様々な声質や読み上げスタイルでの音声をそのバリエーションとして準備しておくことが望ましい。以下では説明のために、個々のバリエーション間の差異は全て「声質」の差異であると考える。よって以下における「声質」とは、音声学的な定義の意味で限定されたものではなく、より一般化された音声の個性を表す。 Consider a case where a voice message prepared in advance is reproduced in an environment where the available data size and communication bit rate are very limited. In particular, here, the case where audio data is stored in a target having a limited storage capacity, such as a two-dimensional barcode, is mainly targeted. The preference for voice quality and reading style is considered to vary from user to user, so it is desirable to prepare various voice quality and reading styles as variations in advance in order to increase user satisfaction. . In the following, for the sake of explanation, it is considered that all differences between individual variations are “voice quality” differences. Therefore, “voice quality” in the following is not limited in the sense of phonetic definition, but represents a more generalized voice personality.

もしデータサイズの制限がなければ、従来の音声符号化技術で、声質毎に独立してそれぞれ音声を符号化すれば良い。しかし従来の音声符号化方式では品質を維持するために、ある程度のデータサイズが必要で、声質の違いごとにそれぞれ音声を符号化するので、データサイズは声質の数に比例する。このため、限られたデータサイズで品質を確保しつつ、声質の数を増やすことは困難である。 If there is no limitation on the data size, the speech may be encoded independently for each voice quality by the conventional speech encoding technique. However, in the conventional speech coding method, a certain amount of data size is required to maintain the quality, and speech is coded for each voice quality difference, so the data size is proportional to the number of voice qualities. For this reason, it is difficult to increase the number of voice qualities while ensuring quality with a limited data size.

一方、音声合成技術を用いることで、音声合成装置のサイズはある程度大きいものの、音声合成記号列のみで音声を合成することができる。ある声質の音声を出力する音声合成装置は、同種の声質の音声データをあらかじめ例えば数十分から数十時間といった規模で大量に収集しておき、それを用いて構成することができる。また音声合成記号列は、音声符号化されたデータよりも通常ずっと小さく、容易に二次元バーコードなどに記録可能である。さらに、事前に複数の声質のデータからそれぞれ音声合成装置用のデータを構築しそれを装置に組み込んでおくことで、装置の利用時に、そのデータを切り替えることで、出力される音声の声質を比較的容易に変更することができる。しかし、一般に音声合成装置は、音声合成記号列を構成する言語的情報のみから対象の音声の特徴を予測して音声を合成しており、自然音声を音声符号化した場合と比較し、元となった音声の再現性は低く、自然性で大きく劣るという問題がある。 On the other hand, by using a speech synthesis technique, it is possible to synthesize speech using only a speech synthesis symbol string, although the size of the speech synthesis apparatus is somewhat large. A voice synthesizer that outputs voice of a certain voice quality can be configured by collecting a large amount of voice data of the same kind of voice quality in advance on a scale of, for example, several tens of minutes to several tens of hours. A speech synthesis symbol string is usually much smaller than speech-encoded data and can be easily recorded on a two-dimensional barcode or the like. Furthermore, by constructing data for speech synthesizers from multiple voice quality data in advance and incorporating them into the device, the voice quality of the output speech can be compared by switching the data when using the device. Can be changed easily. However, in general, a speech synthesizer synthesizes speech by predicting the features of the target speech from only linguistic information constituting the speech synthesis symbol string, and compared with the case where natural speech is speech-encoded, There is a problem that the reproducibility of the voice becomes low and the naturalness is greatly inferior.

以上に示すように、従来技術では、記憶容量が限られている場合に、音声メッセージの自然性と、声質の多様性を両立させることができなかった。 As described above, in the related art, when the storage capacity is limited, it is impossible to achieve both the naturalness of the voice message and the diversity of the voice quality.

したがって、本発明は、利用可能なデータサイズや通信ビットレートが非常に限られた環境で、原音声の特徴をより再現して合成することができる音声合成装置、方法およびプログラムを提供することを目的とする。 Therefore, the present invention provides a speech synthesizer, method, and program capable of reproducing and synthesizing the features of the original speech in an environment where the available data size and communication bit rate are very limited. Objective.

上記目的を実現するため本発明による音声合成装置は、一連の音声合成用情報に基づき音声を合成する音声合成装置であって、前記音声合成用情報は、対象声質に関係なく音声を合成するための共通情報と、所定の声質を音声合成する場合に機能する修正情報で構成され、所定の声質の音声を合成する場合、前記共通情報の内容を前記修正情報に基づき変更する修正手段と、前記修正された情報に基づき音声波形を合成する音声合成手段と、
を備える。 In order to achieve the above object, a speech synthesizer according to the present invention is a speech synthesizer that synthesizes speech based on a series of speech synthesis information, and the speech synthesis information synthesizes speech regardless of the target voice quality. The correction information functioning when synthesizing a predetermined voice quality, and when synthesizing a voice of a predetermined voice quality, the correction means for changing the content of the common information based on the correction information, Speech synthesis means for synthesizing a speech waveform based on the corrected information;
Is provided.

上記目的を実現するため本発明による音声合成装置は、一連の音声合成用情報に基づき音声を合成する音声合成装置であって、前記音声合成用情報は、対象声質に関係なく音声を合成するための共通情報と、所定の声質を音声合成する場合に機能する修正情報で構成され、前記共通情報に基づき音声波形生成のための制御指令を生成する制御指令生成手段と、所定の声質の音声を合成する場合、前記修正情報に基づき前記生成された制御指令を修正する制御指令修正手段と、前記修正された制御指令に基づき音声波形を生成する音声波形生成手段とを備える。 In order to achieve the above object, a speech synthesizer according to the present invention is a speech synthesizer that synthesizes speech based on a series of speech synthesis information, and the speech synthesis information synthesizes speech regardless of the target voice quality. Control information generating means configured to generate a control command for generating a speech waveform based on the common information, and voice of a predetermined voice quality. In the case of synthesizing, a control command correcting unit that corrects the generated control command based on the correction information, and a voice waveform generating unit that generates a voice waveform based on the corrected control command.

また、前記共通情報は、音韻記号と韻律記号で構成されることも好ましい。 The common information is preferably composed of phonological symbols and prosodic symbols.

また、前記音声合成装置は、合成する音声の声質を複数の声質の中から選択する手段をさらに備えることも好ましい。 The speech synthesizer preferably further includes means for selecting the voice quality of the synthesized voice from a plurality of voice qualities.

上記目的を実現するため本発明による音声合成方法は、一連の音声合成用情報に基づき音声を合成する音声合成方式であって、前記音声合成用情報は、対象声質に関係なく音声を合成するための共通情報と、所定の声質を音声合成する場合に機能する修正情報で構成され、所定の声質の音声を合成する場合、前記共通情報の内容を前記修正情報に基づき変更する修正ステップと、前記修正された情報に基づき音声波形を合成する音声合成ステップとを備える。 In order to achieve the above object, a speech synthesis method according to the present invention is a speech synthesis method for synthesizing speech based on a series of speech synthesis information, and the speech synthesis information is used to synthesize speech regardless of target voice quality. The correction information that functions when synthesizing a predetermined voice quality, and when synthesizing a voice of a predetermined voice quality, a correction step of changing the content of the common information based on the correction information; and A speech synthesis step of synthesizing a speech waveform based on the corrected information.

上記目的を実現するため本発明による音声合成方法は、一連の音声合成用情報に基づき音声を合成する音声合成方式であって、前記音声合成用情報は、対象声質に関係なく音声を合成するための共通情報と、所定の声質を音声合成する場合に機能する修正情報で構成され、前記共通情報に基づき音声波形生成のための制御指令を生成する制御指令生成ステップと、所定の声質の音声を合成する場合、前記修正情報に基づき前記生成された制御指令を修正する制御指令修正ステップと、前記修正された制御指令に基づき音声波形を生成する音声波形生成ステップとを備える。 In order to achieve the above object, a speech synthesis method according to the present invention is a speech synthesis method for synthesizing speech based on a series of speech synthesis information, and the speech synthesis information is used to synthesize speech regardless of target voice quality. Control information generating step for generating a control command for generating a speech waveform based on the common information, and a voice of a predetermined voice quality. When synthesizing, a control command correcting step of correcting the generated control command based on the correction information and a voice waveform generating step of generating a voice waveform based on the corrected control command are provided.

上記目的を実現するため本発明によるプログラムは、上記に記載の音声合成装置としてコンピュータを機能させる。 In order to achieve the above object, a program according to the present invention causes a computer to function as the speech synthesizer described above.

本発明により、音声合成用情報を、声質とは独立した情報である共通情報と、声質毎に異なる修正情報に分けることで、同一発話内容の音声については、共通情報を共用して修正情報のみを追加することで、効率的に符号化することができる。この際、音声波形生成のための指令情報が元となる自然音声の特徴に近付くような修正情報を作成しておくことで、元となる音声の特長を再現する自然性の高い音声を出力することができる。 According to the present invention, by dividing the information for speech synthesis into common information that is information independent of voice quality and correction information that differs for each voice quality, only the correction information is shared by sharing the common information for the voice of the same utterance content. Can be efficiently encoded. At this time, by generating correction information that makes the command information for generating the voice waveform approach the characteristics of the original natural voice, a highly natural voice that reproduces the characteristics of the original voice is output. be able to.

また、共通情報を従来の音声合成技術と同様の言語的な情報だけで構成する場合は、ユーザが望む声質に対応した修正情報が音声合成用情報に含まれていない場合でも、従来の技術による音声合成が可能なので、自然性は低下するが、ユーザの望む声質の音声を合成することができる。 Further, when the common information is composed only of linguistic information similar to the conventional speech synthesis technology, even if correction information corresponding to the voice quality desired by the user is not included in the speech synthesis information, Since speech synthesis is possible, the naturalness is reduced, but speech of the voice quality desired by the user can be synthesized.

また、共通情報を修正の対象とすることにより、言語的な表現が発声の一部で異なる場合であっても、共通情報を共用することができる。 Moreover, even if the linguistic expression is different in a part of the utterance, the common information can be shared by setting the common information to be corrected.

以上の特徴から、類似の発話内容で声質の異なる複数の音声を、従来の音声符号化方式よりも必要な記憶容量のサイズを抑え、かつ、従来の音声合成技術をそのまま用いた場合よりも原音声の特徴をより再現して合成することができる。 Based on the above features, it is possible to reduce the size of the required storage capacity compared to the conventional speech coding method and to produce a plurality of speech with different speech qualities with similar utterance content than when using the conventional speech synthesis technology as it is. It is possible to reproduce and synthesize voice characteristics.

本発明の第１の実施形態による音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 第１の実施形態による音声合成のフローチャートを示す。2 shows a flowchart of speech synthesis according to the first embodiment. 本発明の第２の実施形態による音声合成装置のブロック図である。It is a block diagram of the speech synthesizer by the 2nd Embodiment of this invention. 第２の実施形態による音声合成のフローチャートを示す。10 shows a flowchart of speech synthesis according to the second embodiment. 本発明の第３の実施形態による音声合成装置のブロック図である。It is a block diagram of the speech synthesizer by the 3rd Embodiment of this invention. 第３の実施形態による音声合成のフローチャートを示す。10 shows a flowchart of speech synthesis according to a third embodiment.

本発明を実施するための最良の実施形態について、以下では図面を用いて詳細に説明する。なお、以下において、“単位音声”とは、本発明による音声合成装置における、音声の最小処理単位である。単位音声の具体例としては、音素、音節、単語がある。ただしここでは、単位音声は、例えば前後の音素の種類といった音韻環境に関する違い、またアクセントやイントネーション、話速といった韻律的特徴の違いを考慮した分類が行われているものとする。また“音声合成用記号”とは、１発声の音声に含まれる単位音声のそれぞれの種類を記述するための一連の記号である。 The best mode for carrying out the present invention will be described in detail below with reference to the drawings. In the following, the “unit speech” is a minimum speech processing unit in the speech synthesizer according to the present invention. Specific examples of unit speech include phonemes, syllables, and words. However, here, unit speech is classified in consideration of differences in phonemic environment such as the types of phonemes before and after, and differences in prosodic features such as accent, intonation, and speech speed. A “speech synthesis symbol” is a series of symbols for describing each type of unit speech included in one speech.

図１は、本発明の第１の実施形態による音声合成装置のブロック図である。図１によると、音声合成装置１は、音声合成用情報に対する前処理部１１と、共通情報修正部１２と、音声波形合成部１３を備えている。図２は、第１の実施形態による音声合成のフローチャートを示す。 FIG. 1 is a block diagram of a speech synthesizer according to the first embodiment of the present invention. As shown in FIG. 1, the speech synthesizer 1 includes a preprocessing unit 11 for speech synthesis information, a common information correction unit 12, and a speech waveform synthesis unit 13. FIG. 2 shows a flowchart of speech synthesis according to the first embodiment.

Ｓ１１．まず前処理部１１では、音声合成用情報を処理し、共通情報と修正情報を取り出す。共通情報の例としては、ＪＥＩＴＡＩＴ−４００２で規定される記号のような、音声合成用の音韻情報と韻律情報を同時に記述するための記号がある。共通情報の作成には音声認識技術を用いても良いが、一般に、音声は予め用意された原稿を読み上げたものが多いため、このような共通情報であれば、日本語テキスト音声変換技術を用いて、漢字仮名交じりテキストの音声原稿から比較的高精度に自動で作成することができる。また自動処理による誤りや、実際の発声との間で生じる誤差は、人手により容易に修正することができる。また、修正情報には、対象声質を識別する識別情報が含まれている。
Ｓ１２．ここで共通情報修正部１２は、前処理部１１の出力である共通情報および修正情報をその入力とする。共通情報修正部１２は、音声波形合成部１３の波形生成対象声質が、修正情報の対象声質と一致するか確認する。
Ｓ１３．一致する場合、共通情報修正部１２は、修正情報に基づき共通情報を修正し出力する。一方、音声波形合成部１３の波形生成対象声質が、修正情報の対象声質と異なる場合、共通情報修正部１２は、入力された共通情報をそのまま出力する。
Ｓ１４．共通情報修正部１２の出力に基づき、音声波形合成部１３で音声波形が合成され、音声合成装置１の出力として出力される。 S11. First, the preprocessing unit 11 processes the information for speech synthesis, and extracts common information and correction information. As an example of the common information, there is a symbol for simultaneously describing phonetic information for speech synthesis and prosodic information, such as a symbol defined in JEITA IT-4002. Although speech recognition technology may be used to create common information, generally speaking, since speech is often read out from a prepared document, Japanese text-to-speech conversion technology is used for such common information. Thus, it can be automatically created with relatively high accuracy from a voice manuscript mixed with kanji characters. Also, errors caused by automatic processing and errors that occur between actual utterances can be easily corrected manually. The correction information includes identification information for identifying the target voice quality.
S12. Here, the common information correction unit 12 receives the common information and the correction information that are output from the preprocessing unit 11 as inputs. The common information correction unit 12 confirms whether the waveform generation target voice quality of the voice waveform synthesis unit 13 matches the target voice quality of the correction information.
S13. If they match, the common information correction unit 12 corrects and outputs the common information based on the correction information. On the other hand, when the waveform generation target voice quality of the voice waveform synthesis unit 13 is different from the target voice quality of the correction information, the common information correction unit 12 outputs the input common information as it is.
S14. Based on the output of the common information correction unit 12, the speech waveform synthesis unit 13 synthesizes a speech waveform and outputs it as the output of the speech synthesizer 1.

修正情報の例としては、共通情報の先頭からｉ番目の記号を記号Ａで置換することを示す符号、先頭からｊ番目の記号の直前に記号Ｂを挿入することを示す符号、および先頭からｋ番目の記号を削除する符号を並べたものがある。修正の際は、先頭の符号から順に対応する操作を行うことで、共通情報は修正される。例えば音節を単位とするような共通情報を用いている場合において、共通情報では頭高型アクセントになっている「デ’ンシャ．」（ただしここで記号「’」はその場所にアクセント核があることを示す）を、修正情報対象の声質では平板型アクセントで読む場合、２番目の記号を削除する符号１つをその修正情報とすることで、頭高型アクセントの「デ’ンシャ．」は、平板型アクセントの「デンシャ．」に修正される。同様の修正により、ポーズ挿入位置の違いや、「デ’スネ．」と「デ’スワ．」といった表現の細かな違いも修正することができる。 Examples of correction information include a code indicating that the i-th symbol from the top of the common information is replaced with the symbol A, a code indicating that the symbol B is inserted immediately before the j-th symbol from the top, and k from the top. There is a list of codes that delete the th symbol. At the time of correction, the common information is corrected by performing the corresponding operations in order from the top code. For example, when common information such as syllable is used as a unit, “Densha.” (Where the symbol “'” has an accent nucleus at that place is a head-high accent in the common information. ) Is read with a flat accent in the voice information subject to the correction information, the head information accent “Densha.” Can be obtained by setting one correction code to delete the second symbol. The flat accent “Densha.” Is corrected. By the same correction, it is possible to correct the difference in the pose insertion position and the small difference in expression such as “De'Sne.” And “De'Swa.”.

図３は、本発明の第２の実施形態による音声合成装置のブロック図である。図３によると、音声合成装置１は、前処理部１１と、音声合成制御指令生成部１４と、制御指令修正部１５と、音声波形生成部１６を備えている。なお、ここでの音声合成制御指令生成部１４と音声波形生成部１６を組み合わせたものが、第１の実施形態における音声波形合成部１３に相当する。図４は、第２の実施形態による音声合成のフローチャートを示す。 FIG. 3 is a block diagram of a speech synthesizer according to the second embodiment of the present invention. As shown in FIG. 3, the speech synthesizer 1 includes a preprocessing unit 11, a speech synthesis control command generation unit 14, a control command correction unit 15, and a speech waveform generation unit 16. The combination of the speech synthesis control command generation unit 14 and the speech waveform generation unit 16 here corresponds to the speech waveform synthesis unit 13 in the first embodiment. FIG. 4 shows a flowchart of speech synthesis according to the second embodiment.

Ｓ２１．前処理部１１は、音声合成用情報を処理し、共通情報と修正情報を取り出す。
Ｓ２２．音声合成制御指令生成部１４は、取り出された共通情報を、音声波形生成のための音声合成制御指令に変換する。
Ｓ２３．制御指令修正部１５は、前記の制御指令が入力されると、音声波形生成部１６の波形生成対象声質が修正情報の対象声質と一致する場合がどうか確認する。
Ｓ２４．一致する場合、制御指令修正部１５は、別途入力された修正情報に基づき制御指令を修正する。一致しない場合、制御指令修正部１５は、制御指令をそのまま出力する。
Ｓ２５．制御指令修正部１５の出力情報に基づき音声波形生成部１６で音声波形が生成され、音声合成装置１の出力として出力される。 S21. The pre-processing unit 11 processes the information for speech synthesis and extracts common information and correction information.
S22. The voice synthesis control command generation unit 14 converts the extracted common information into a voice synthesis control command for voice waveform generation.
S23. When the control command is input, the control command correction unit 15 checks whether the waveform generation target voice quality of the voice waveform generation unit 16 matches the target voice quality of the correction information.
S24. If they match, the control command correction unit 15 corrects the control command based on the separately input correction information. If they do not match, the control command correction unit 15 outputs the control command as it is.
S25. A speech waveform is generated by the speech waveform generation unit 16 based on the output information of the control command correction unit 15 and output as an output of the speech synthesizer 1.

ここで制御指令の例として、まず音声合成方式として接続合成方式を用いる場合においては、素片ＩＤがある。この場合、接続合成方式において必要となる素片選択処理は音声合成制御指令生成部１４で行われ、素片の接続する処理は音声波形生成部１６で行われる。一方、分析合成方式を用いる場合は、制御指令として、音声合成パラメータ時系列が例として挙げられる。これは具体的には、ＭＦＣＣやＦ０のパラメータ時系列である。この場合は、音声合成制御指令生成部１４で音声生成モデルから音声合成パラメータ時系列を生成する処理が行われる。 Here, as an example of the control command, there is a segment ID when the connection synthesis method is used as the voice synthesis method. In this case, the segment selection process required in the connection synthesis method is performed by the speech synthesis control command generation unit 14, and the process of connecting the segments is performed by the speech waveform generation unit 16. On the other hand, when the analysis / synthesis method is used, a speech synthesis parameter time series is exemplified as the control command. Specifically, this is a parameter time series of MFCC and F0. In this case, the speech synthesis control command generation unit 14 performs processing for generating a speech synthesis parameter time series from the speech generation model.

あるいは、音声生成モデルのパラメータを、ここでの制御指令としてもよい。この場合は、音声波形生成部１６で音声生成モデルから音声合成パラメータ時系列を生成する処理が行われる。 Alternatively, the parameters of the voice generation model may be used as the control command here. In this case, the speech waveform generation unit 16 performs processing for generating a speech synthesis parameter time series from the speech generation model.

また、この場合の修正情報の例としては、それぞれのパラメータのある値に関する、置換操作を表す符号の列がある。またこの本実施形態においても修正情報には、対象声質を識別する識別情報が含まれている。 In addition, as an example of the correction information in this case, there is a string of codes representing a replacement operation regarding a certain value of each parameter. Also in this embodiment, the correction information includes identification information for identifying the target voice quality.

この修正情報を、音声合成装置１から出力される音声のスペクトルに関する特徴や基本周波数に関する特徴が、元となる音声のそれぞれの特徴に近付くように構成することで、修正を行わない場合よりも、より原音声の特徴に近い音声を出力することができる。 By configuring the correction information so that the characteristics related to the spectrum of the voice output from the speech synthesizer 1 and the characteristics related to the fundamental frequency approach each characteristic of the original voice, compared to the case where no correction is performed, It is possible to output sound closer to the characteristics of the original sound.

例えば、全ての特徴ベクトルの全ての要素について、それぞれ値の修正を試行し、その結果、合成される音声の特徴が、元となる音声の特徴に最も近付いた試行の修正を、修正情報の１つの要素とする。これを（１）修正が所定の回数に達する、あるいは修正情報全体のサイズが所定の制限値に達する、または（２）合成される音声と、元となる音声との間の特徴の差異が所定の基準よりも小さくなる、まで繰り返し修正情報の要素を蓄積していくことで、修正情報を作成することができる。 For example, correction of values is attempted for all elements of all feature vectors, and as a result, the correction of the trial in which the synthesized speech feature is closest to the original speech feature is corrected by 1 One element. (1) The number of corrections reaches a predetermined number of times, or the size of the entire correction information reaches a predetermined limit value, or (2) The difference in characteristics between the synthesized voice and the original voice is predetermined. The correction information can be created by repeatedly accumulating the elements of the correction information until it becomes smaller than the standard.

また、先の２つの修正処理を組み合わせた装置構成とし、両者の修正情報が混在するような音声合成用制御指令を構成してもよい。図５はこの構成を有する本発明の第３の実施形態による音声合成装置のブロック図である。図５によると、音声合成装置１は、前処理部１１と、共通情報修正部１２と、音声合成制御指令生成部１４と、制御指令修正部１５と、音声波形生成部１６を備えている。図６は、第３の実施形態による音声合成のフローチャートを示す。 Alternatively, a device configuration in which the above two correction processes are combined may be used, and a voice synthesis control command in which both pieces of correction information are mixed may be configured. FIG. 5 is a block diagram of a speech synthesizer according to the third embodiment of the present invention having this configuration. As shown in FIG. 5, the speech synthesizer 1 includes a preprocessing unit 11, a common information correction unit 12, a voice synthesis control command generation unit 14, a control command correction unit 15, and a voice waveform generation unit 16. FIG. 6 shows a flowchart of speech synthesis according to the third embodiment.

Ｓ３１．前処理部１１は、音声合成用情報を処理し、共通情報と修正情報を取り出す。
Ｓ３２．共通情報修正部１２は、音声波形生成部１６の波形生成対象声質が、修正情報の対象声質と一致するか確認する。
Ｓ３３．一致する場合、共通情報修正部１２は、修正情報に基づき共通情報を修正し出力する。一方、音声波形生成部１６の波形生成対象声質が、修正情報の対象声質と異なる場合、共通情報修正部１２は、入力された共通情報をそのまま出力する。
Ｓ３４．音声合成制御指令生成部１４は、取り出された共通情報を、音声波形生成のための音声合成制御指令に変換する。
Ｓ３５．制御指令修正部１５は、前記の制御指令が入力されると、音声波形生成部１６の波形生成対象声質が修正情報の対象声質と一致する場合がどうか確認する。
Ｓ３６．一致する場合、制御指令修正部１５は、別途入力された修正情報に基づき制御指令を修正する。一致しない場合、制御指令修正部１５は、制御指令をそのまま出力する。
Ｓ３７．制御指令修正部１５の出力情報に基づき音声波形生成部１６で音声波形が生成され、音声合成装置１の出力として出力される。 S31. The pre-processing unit 11 processes the information for speech synthesis and extracts common information and correction information.
S32. The common information correction unit 12 confirms whether the waveform generation target voice quality of the voice waveform generation unit 16 matches the target voice quality of the correction information.
S33. If they match, the common information correction unit 12 corrects and outputs the common information based on the correction information. On the other hand, when the waveform generation target voice quality of the voice waveform generation unit 16 is different from the target voice quality of the correction information, the common information correction unit 12 outputs the input common information as it is.
S34. The voice synthesis control command generation unit 14 converts the extracted common information into a voice synthesis control command for voice waveform generation.
S35. When the control command is input, the control command correction unit 15 checks whether the waveform generation target voice quality of the voice waveform generation unit 16 matches the target voice quality of the correction information.
S36. If they match, the control command correction unit 15 corrects the control command based on the separately input correction information. If they do not match, the control command correction unit 15 outputs the control command as it is.
S37. A speech waveform is generated by the speech waveform generation unit 16 based on the output information of the control command correction unit 15 and output as an output of the speech synthesizer 1.

共通情報が音声合成用の音韻情報と韻律情報で構成されるような場合、共通情報を直接修正することにより、小さい修正情報で、出力される合成音声を大きく変更することができるが、共通情報の修正では、音声波形生成部１６への指令の修正では可能な音声の微修正が不可能なため、これを組み合わせることは有効である。 When the common information is composed of phonetic information and prosodic information for speech synthesis, by directly correcting the common information, the output synthesized speech can be greatly changed with small correction information. In this modification, since it is impossible to finely modify the sound that can be performed by modifying the command to the speech waveform generation unit 16, it is effective to combine these.

この場合の修正情報の作成は、例えば、まず共通情報の修正情報を作成し、その結果合成される音声において品質上問題が生じている部分について、音声波形生成部１６への指令の修正するような修正情報を追加する、といった手段で実現される。 In this case, for example, correction information is created by first creating correction information of common information, and correcting a command to the voice waveform generation unit 16 for a portion where a quality problem occurs in the synthesized voice. This is realized by means such as adding correction information.

また、１つの共通情報と、複数種類の声質に対する修正情報をまとめて一連の音声合成用情報を構成しても良い。この場合、音声波形生成処理において、あらかじめ用意された複数種類の声質の中で、ユーザの好みの声質で音声波形を生成する場合、修正情報の用意された声質については、より原音声の特徴に近い音声を、修正情報がない場合でも、従来の音声合成装置で可能な水準でその対象声質の音声を出力することができる。人気が高いと予想される声質についてのみ修正情報を作成することで、全体のデータサイズを抑えつつ、効率的にユーザ全体のニーズに応えることができる。 Also, a series of information for speech synthesis may be configured by combining one piece of common information and correction information for a plurality of types of voice qualities. In this case, in the voice waveform generation process, when a voice waveform is generated with a voice quality of the user's preference among a plurality of types of voice quality prepared in advance, the voice quality for which the correction information is prepared is more characteristic of the original voice. Even when there is no correction information, it is possible to output the voice of the target voice quality at a level possible with a conventional voice synthesizer. By creating correction information only for voice qualities that are expected to be popular, it is possible to efficiently meet the needs of the entire user while suppressing the overall data size.

また、以上述べた実施形態は全て本発明を例示的に示すものであって限定的に示すものではなく、本発明は他の種々の変形態様および変更態様で実施することができる。従って本発明の範囲は特許請求の範囲およびその均等範囲によってのみ規定されるものである。 Moreover, all the embodiments described above are illustrative of the present invention and are not intended to limit the present invention, and the present invention can be implemented in other various modifications and changes. Therefore, the scope of the present invention is defined only by the claims and their equivalents.

１音声合成装置
１１前処理部
１２共通情報修正部
１３音声波形合成部
１４音声合成制御指令生成部
１５制御指令修正部
１６音声波形生成部 DESCRIPTION OF SYMBOLS 1 Speech synthesizer 11 Pre-processing part 12 Common information correction part 13 Speech waveform synthesis part 14 Speech synthesis control command generation part 15 Control command correction part 16 Speech waveform generation part

Claims

A speech synthesizer that synthesizes speech based on a series of speech synthesis information,
The speech synthesis information is composed of common information for synthesizing speech regardless of the target voice quality, and correction information that functions when voice synthesis is performed for a predetermined voice quality,
When synthesizing speech of a predetermined voice quality, correction means for changing the content of the common information based on the correction information;
Voice synthesis means for synthesizing a voice waveform based on the corrected information;
A speech synthesizer characterized by comprising:

A speech synthesizer that synthesizes speech based on a series of speech synthesis information,
The speech synthesis information is composed of common information for synthesizing speech regardless of the target voice quality, and correction information that functions when voice synthesis is performed for a predetermined voice quality,
Control command generating means for generating a control command for generating a voice waveform based on the common information;
When synthesizing speech of a predetermined voice quality, control command correcting means for correcting the generated control command based on the correction information;
Voice waveform generation means for generating a voice waveform based on the modified control command;
A speech synthesizer characterized by comprising:

The speech synthesis apparatus according to claim 1, wherein the common information includes a phoneme symbol and a prosodic symbol.

4. The speech synthesizer according to claim 1, further comprising means for selecting a voice quality of the voice to be synthesized from a plurality of voice qualities.

A speech synthesis method for synthesizing speech based on a series of speech synthesis information,
The speech synthesis information is composed of common information for synthesizing speech regardless of the target voice quality, and correction information that functions when voice synthesis is performed for a predetermined voice quality,
When synthesizing speech of a predetermined voice quality, a modification step for changing the content of the common information based on the modification information;
A speech synthesis method comprising a speech synthesis step of synthesizing a speech waveform based on the corrected information.

A speech synthesis method for synthesizing speech based on a series of speech synthesis information,
The speech synthesis information is composed of common information for synthesizing speech regardless of the target voice quality, and correction information that functions when voice synthesis is performed for a predetermined voice quality,
A control command generating step for generating a control command for generating a voice waveform based on the common information;
When synthesizing speech of a predetermined voice quality, a control command correction step for correcting the generated control command based on the correction information;
A voice waveform generation step of generating a voice waveform based on the modified control command;
A speech synthesis method comprising:

A program that causes a computer to function as the speech synthesizer according to any one of claims 1 to 4.