JPH1091195A

JPH1091195A - Method of analyzing and synthesizing speech

Info

Publication number: JPH1091195A
Application number: JP9124571A
Authority: JP
Inventors: Sunao Aizawa; 直相澤; Yoshiteru Uchiyama; 喜照内山
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1996-05-15
Filing date: 1997-05-14
Publication date: 1998-04-10

Abstract

PROBLEM TO BE SOLVED: To permit a satisfactory voice synthesis to woman voice, moreover, a voice analysis and synthesis by means of less calculation, in a voice analysis and synthesis using LPC(linear predictive coding) system. SOLUTION: By obtaining an amplitude value and time-information at the peak point of an original wave, determining peak values having maximal amplitudes of each single oscillation, and determining a zero-amplitude point in the direction of the origin in the time-axis of the waveform having a maximal peak value in each single oscillation, information representing a period of the unit waveform is taken out starting from that point (step s1-s5). A first unit waveform selects information representing the period and n pieces of large amplitudes and sends them to synthesis side for a synthesis processing. A second unit waveform and thereafter deforms a row of pulses obtained from a previous unit waveform into a round shape, and extracts in descending order n pieces of pulses from a pulse row of the difference obtained by subtracting the deformed pulse row from residual waveform of the unit waveform to be processed. On the synthesis side, a synthesis processing is performed based on a pulse row obtained by adding the beforementioned n pieces of pulses to the pulse row which was rounddefomed one used for the previous synthesis (step s9-s14).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、特に女性の音声か
ら得られた単位時間ごとに周期的な繰り返しを行う単位
波形について、各単位波形間の相関に基づく線形予測係
数を求めるとともに、この線形予測時に得られる残差波
形から、単位波形の周期を示す情報とそれ以外の誤差成
分情報とを抽出して、これらの情報を基に音声合成を行
う音声分析合成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for obtaining a linear prediction coefficient based on a correlation between unit waveforms, particularly for a unit waveform obtained from a female voice and performing periodic repetition for each unit time. The present invention relates to a speech analysis / synthesis method for extracting information indicating a cycle of a unit waveform and other error component information from a residual waveform obtained at the time of prediction, and performing speech synthesis based on such information.

【０００２】[0002]

【従来の技術】人間の発する音声を合成するアルゴリズ
ムの１つとして、ＬＰＣ方式というものがある。ＬＰＣ
方式は、入力音声データを圧縮してデータ量を減らして
記録し再生するアルゴリズムともいうことができる。2. Description of the Related Art As one of algorithms for synthesizing a human voice, there is an LPC method. LPC
The method can also be referred to as an algorithm that compresses input audio data to reduce the amount of data and records and reproduces the data.

【０００３】このＬＰＣ方式は男性の声における母音な
どについては、高い圧縮率で音声データの圧縮が可能で
あり、また、高精度な音声合成ができるが、女性の声に
対しては男性の声のような高い圧縮率は得られず、ま
た、合成音も原音に対して劣化が起こりやすいという問
題があった。The LPC method can compress voice data at a high compression rate for vowels and the like in a male voice, and can synthesize voice with high accuracy. However, there is a problem in that a high compression ratio cannot be obtained as described above, and that the synthesized sound is liable to be deteriorated with respect to the original sound.

【０００４】すなわち、ＬＰＣ方式は入力音声を数１０
msec程度のフレーム毎に、以下に示すＹ１（ＬＰＣ係
数）、Ｙ２（有声音の周期と強度）、Ｙ３（誤差成分）
の３つの要素に変換するものである。That is, in the LPC system, input voices are several tens of
For each frame of about msec, Y1 (LPC coefficient), Y2 (cycle and intensity of voiced sound), Y3 (error component) shown below
Is converted into three elements.

【０００５】（１）ＹＩ（ＬＰＣ係数）：音声（特に
母音）は数１０msecというような短い時間単位で見る
と、図８（ａ）に示した或る母音の音声波形の一例から
もわかるように、殆ど同じような波形を繰り返している
のが普通である。ＬＰＣ係数とはこのような相関を持っ
た繰り返しから得られる線形予測係数である。(1) YI (LPC coefficient): When a voice (especially a vowel) is viewed in a short time unit such as several tens of msec, it can be understood from an example of a voice waveform of a certain vowel shown in FIG. Generally, almost the same waveform is repeated. The LPC coefficient is a linear prediction coefficient obtained from such a correlated iteration.

【０００６】（２）Ｙ２（有声音の周期と強度）：前記
線形予測における予測誤差（残差）は２つの成分に分け
ることができ、その２つの成分のうちの１つである。こ
の成分は、図８（ｂ）に示す残差波形における時間の成
分（Ｙ２１とする）と強度の成分（Ｙ２２とする）であ
り、時間の成分とは音声波形が周期的に変化するポイン
ト（図８(b)においては、ｐ１，ｐ２，ｐ３，・・・の
部分）間の時間ｔを示し、強度の成分とは入力音声波形
の振幅ｈを表す成分である。(2) Y2 (cycle and intensity of voiced sound): The prediction error (residual error) in the linear prediction can be divided into two components, and one of the two components. This component is a time component (referred to as Y21) and an intensity component (referred to as Y22) in the residual waveform shown in FIG. 8B, and the time component is a point (refer to FIG. In FIG. 8B, a time t between p1, p2, p3,... Is shown, and the intensity component is a component representing the amplitude h of the input speech waveform.

【０００７】（３）Ｙ３（誤差成分）：前記線形予測に
おける予測誤差（残差）は２つの成分に分けることがで
き、その２つの成分のうちの１つである。この成分は線
形予測された波形と実際の波形の誤差のうち、一見ラン
ダムに見えるような雑音の成分であり、図８（ｂ）にお
いては、Ｎで示した部分である。(3) Y3 (error component): The prediction error (residual error) in the linear prediction can be divided into two components, one of the two components. This component is a noise component that looks seemingly random among errors between the linearly predicted waveform and the actual waveform, and is a portion indicated by N in FIG. 8B.

【０００８】ＬＰＣ方式は入力音声をこのような３つの
要素Ｙ１，Ｙ２，Ｙ３に変換するが、実際には、男性の
声における母音などのように波形に相関の多いもの（摩
擦音など波形に相関の少ないものは含まない）について
は、前記Ｙ３の誤差成分Ｎは殆ど存在しないため、Ｙ３
は無視し、圧縮後のデータは、前記Ｙ１、Ｙ２１，Ｙ２
２の要素と音声か非音声かを示す２値データのみとな
る。したがって、波形に相関の多い音声は、図８（ｃ）
のようなＹ３を無視したデータを合成側に送ればよいこ
とから、高い圧縮率が得られ、音声合成する場合も、高
精度な音声合成が可能となる。The LPC system converts an input voice into these three components Y1, Y2, and Y3. In practice, however, a waveform such as a vowel in a male voice that has a large correlation with a waveform (correlation with a waveform such as a fricative sound). Is not included), since there is almost no error component N of Y3,
Is ignored, and the data after compression is represented by Y1, Y21, Y2.
There are only two elements and binary data indicating voice or non-voice. Therefore, the sound having a large correlation in the waveform is shown in FIG.
Since it is only necessary to send data that ignores Y3 to the synthesizing side, a high compression rate can be obtained, and high-precision speech synthesis can be performed even when speech synthesis is performed.

【０００９】たとえば、男性の「あー」という音声の場
合、「あ」について考えると、図８（ａ）に示すよう
に、「あ」の音声波形がしばらくの間、周期的に繰り返
して発生することになる。したがって、同じ音声波形が
繰り返される場合は、最初の１周期分の音声波形（ここ
ではこの１周期分の波形を単位波形といい、図８ではそ
れぞれの単位波形を単位波形１０，２０，３０，４０で
表している）を再現できるＬＰＣ係数、振幅がわかる情
報（Ｙ２２）、単位波形の繰り返しの周期がわかる情報
（Ｙ２１）を記憶すれば、「あ」の音声を合成すること
が可能である。また、男性の声の場合は、それぞれの単
位波形で見れば、単位波形内で同じような波形がそのま
ま減衰していく場合が多い。[0009] For example, in the case of a male voice "Ah", considering "A", the voice waveform of "A" is periodically and repeatedly generated for a while as shown in FIG. Will be. Therefore, when the same voice waveform is repeated, the voice waveform for the first cycle (here, the waveform for this one cycle is called a unit waveform, and in FIG. 8, each unit waveform is a unit waveform 10, 20, 30,. (Represented by reference numeral 40), the information (Y22) for recognizing the amplitude and the information (Y21) for recognizing the repetition cycle of the unit waveform can be used to synthesize the voice of "a". . Further, in the case of a male voice, similar waveforms often attenuate as they are within each unit waveform when viewed from each unit waveform.

【００１０】このように、音声波形に相関の多い男性の
音声に関しては、極く短い周期ごとに、前記Ｙ１，Ｙ２
１，Ｙ２２の要素と、音声か非音声かを示す２値データ
とを、圧縮された音声データとして合成側に送り、合成
側ではこのデータを基に合成処理することで人間が耳で
聞いて原音と殆ど遜色のない音声を合成することができ
る。[0010] As described above, regarding the male voice having a large correlation with the voice waveform, the Y1, Y2
The elements of Y1, Y22 and binary data indicating speech or non-speech are sent to the synthesizer as compressed audio data, and the synthesizer performs synthesis processing based on this data, so that humans can hear it with their ears. It is possible to synthesize a voice almost inferior to the original sound.

【００１１】[0011]

【発明が解決しようとする課題】しかしながら、女性の
声は、図９（ａ）のような原波形に対する残差波形は、
同図（ｂ）に示すように、男性の場合に比べると、図８
で示したポイントｐ１，ｐ２，・・・におけるパルス
（各単位波形の先頭パルス）が明確でなく、誤差成分Ｎ
との区別がはっきりしないという特徴がある。したがっ
て、女性の声の場合は、前記Ｙ３の誤差成分Ｎを無視す
ることはできない。つまり、女性の声は、Ｙ３の誤差成
分にも音声の特徴を表す重要な要素が含まれているた
め、このＹ３を無視して合成すると原音に対して大きく
劣化した合成音となってしまう。これに対処するために
は、誤差成分を含んだデータをそのまま合成側に送れば
よいが、データ量の圧縮という点で問題がある。However, a female voice has a residual waveform with respect to the original waveform as shown in FIG.
As shown in FIG. 8B, compared to the case of a male, FIG.
Are not clear at the points p1, p2,... Indicated by, the error component N
The distinctive feature is that it is not clear. Therefore, in the case of a female voice, the error component N of Y3 cannot be ignored. That is, since the voice of a woman also includes an important element representing the characteristics of the voice in the error component of Y3, if the voice is synthesized while ignoring this Y3, the synthesized voice will be greatly degraded from the original sound. To cope with this, data containing an error component may be sent to the synthesis side as it is, but there is a problem in terms of data amount compression.

【００１２】一方、以上のような方式ではなく、他の音
声分析・合成手法として「Ａnalysisby synthesis」と
いう手法がある。この手法は残差波形を用いずに、原波
形から単位波形を求めて、原波形から単位波形を引いて
その差を見ながらパルス位置を探すという方法であり、
女性の声に対しても高い圧縮率で音声合成に必要なデー
タを合成側に送ることができ、高精度な合成が行える
が、原波形から単位波形を引いてその差を見ながらパル
ス位置を探すという処理を行う際、最も差の小さくなる
ような位置や単位波形の振幅を決定するなどの処理を行
う必要があるため、計算量がきわめて大きいという欠点
があった。On the other hand, there is a method called "Analysis by synthesis" as another voice analysis / synthesis method other than the above-mentioned method. This method is a method of finding a unit waveform from an original waveform without using a residual waveform, and searching for a pulse position while observing the difference by subtracting the unit waveform from the original waveform.
Data required for voice synthesis can be sent to the synthesizer at a high compression ratio even for female voices, and high-precision synthesis can be performed. When performing the process of searching, it is necessary to perform a process such as determining the position or the amplitude of the unit waveform that minimizes the difference, so that there is a disadvantage that the amount of calculation is extremely large.

【００１３】本発明は、ＬＰＣ方式を用いた音声分析合
成において、「Ａnalysis by synthesis」的な手法を用
いずに、女性の声に対しても「Ａnalysis by synthesi
s」と同等な良好な音声合成を可能とし、しかも少ない
計算量にて音声分析合成を可能とすることを目的とす
る。According to the present invention, in the speech analysis / synthesis using the LPC method, the "Analysis by synthesi
s "as well as speech analysis and synthesis with a small amount of calculation.

【００１４】[0014]

【課題を解決するための手段】本発明の音声分析合成方
法は、請求項１に記載されたように、処理対象の原波形
を所定時間ごとのフレームに分割して、その１フレーム
内のほぼ一定時間ごとに周期的な繰り返しを行う波形
（以後、単位波形と記す）を基に、各単位波形間の相関
に基づく線形予測係数を得るとともに、この線形予測に
おける誤差成分を示す残差波形から、単位波形の周期お
よび強度を示す情報とそれ以外の誤差成分情報とを抽出
して、これらの情報を基に音声合成を行う音声分析合成
方法において、互いに直交する時間軸と振幅軸の平面座
標上で表される或る１フレームの原波形を、時間軸に沿
って波形のピーク点を調べ、そのピーク点の振幅値と時
刻情報を得て、これらのピーク点の中から各単位波形に
おける最大の振幅を有するピーク値を求め、これら各単
位波形における最大のピーク値を有する波形の時間軸上
の原点方向における振幅が０の点を求め、この振幅０の
点を基準としてその近傍の残差波形から単位波形の周期
を示す情報を取り出して音声合成側に送るとともに、前
記それぞれの単位波形に対応する残差波形の誤差成分の
中から合成に必要な情報をパルス情報として取り出し、
前記線形予測係数とともに、このパルス情報と前記周期
を示す情報に基づいて音声合成処理を行うことを特徴と
する。According to the speech analysis / synthesis method of the present invention, an original waveform to be processed is divided into frames at predetermined time intervals, and substantially one of the frames is divided into frames. A linear prediction coefficient based on the correlation between the unit waveforms is obtained based on a waveform (hereinafter, referred to as a unit waveform) that periodically repeats at regular intervals, and a residual waveform indicating an error component in the linear prediction is obtained. In a speech analysis / synthesis method for extracting information indicating the cycle and intensity of a unit waveform and other error component information and performing speech synthesis based on the information, planar coordinates of a time axis and an amplitude axis which are orthogonal to each other. From the original waveform of a certain frame represented above, the peak point of the waveform is examined along the time axis, the amplitude value and time information of the peak point are obtained, and the unit waveform of each unit waveform is obtained from among these peak points. Maximum amplitude The amplitude of the waveform having the maximum peak value in each unit waveform in the direction of the origin on the time axis is determined to be 0, and the unit having the amplitude of 0 is used as a reference from the residual waveform in the vicinity thereof. Extracting information indicating the cycle of the waveform and sending it to the voice synthesis side, and extracting information necessary for synthesis from error components of the residual waveform corresponding to each of the unit waveforms as pulse information,
A speech synthesis process is performed based on the pulse information and the information indicating the period together with the linear prediction coefficient.

【００１５】これによれば、女性の声のように、残差波
形における単位波形の周期を示す情報と、誤差成分との
区別が付かない音声に対しても、的確に周期を示す情報
と、合成するに必要な誤差成分を取り出すことができ、
高い圧縮率でデータ圧縮することができ、かつ、良好な
合成音を得ることができる。According to this, the information indicating the period of the unit waveform in the residual waveform, such as a female voice, and the information indicating the period accurately even for a voice that cannot be distinguished from the error component, Error components necessary for synthesis can be extracted,
Data compression can be performed at a high compression rate, and a good synthesized sound can be obtained.

【００１６】また、請求項１における振幅０の点を基準
としてその近傍の残差波形から単位波形の周期を示す情
報を取り出す処理は、前記振幅０の点を単位波形の周期
とすることも可能であり、また、記振幅０の点を基準と
してその近傍の残差波形の中から振幅の大きい情報を抽
出して、その抽出された情報を基に周期を示す情報を得
ることも可能である。In the processing for extracting information indicating a cycle of a unit waveform from a residual waveform in the vicinity thereof with reference to a point having an amplitude of 0, the point having the amplitude of 0 may be set as a cycle of a unit waveform. In addition, it is also possible to extract information having a large amplitude from a residual waveform in the vicinity of the point having the amplitude 0 as a reference and obtain information indicating a cycle based on the extracted information. .

【００１７】これによれば、残差波形における各単位波
形の周期を示す情報を的確に取り出すことができる。According to this, it is possible to accurately extract information indicating the cycle of each unit waveform in the residual waveform.

【００１８】また、前記それぞれの単位波形に対応する
残差波形の誤差成分の中から合成に必要な情報をパルス
情報として取り出し、前記線形予測係数とともに、この
パルス情報と前記周期を示す情報に基づいて音声合成処
理を行う処理は、無声音の状態から有声音に変わるフレ
ーム内の少なくとも先頭の単位波形および線形予測係数
が大きく変化する付近のフレームにおいては、波形のピ
ーク数を計数し、そのピーク数が予め設定した数より大
きいか否かを判断し、予め設定した数以上の場合は、誤
差成分から合成に必要な最低限の情報のみ効率的に選択
してパルス情報として合成側に送り、ピーク数が設定値
未満である場合には、前記した合成に必要な最低限のパ
ルス情報のみならず、それ以外の誤差成分からパルス情
報を得て合成側に送ることを特徴としている。Also, information necessary for synthesis is extracted as pulse information from error components of a residual waveform corresponding to each of the unit waveforms, and based on the pulse information and the information indicating the period together with the linear prediction coefficient. The voice synthesis process is performed by counting the number of peaks of the waveform in at least the first unit waveform in a frame where the state changes from an unvoiced sound to a voiced sound and in a frame where the linear prediction coefficient greatly changes. Is determined to be greater than or equal to a predetermined number. If the number is equal to or greater than a predetermined number, only the minimum information necessary for synthesis is efficiently selected from error components and sent to the synthesis side as pulse information, and the peak value is determined. If the number is less than the set value, not only the minimum pulse information necessary for the above-described synthesis, but also obtain pulse information from other error components and send it to the synthesis side. It is characterized in Rukoto.

【００１９】これによれば、単位波形内で波形が多様に
変化する音声波形は、音声合成する上で必要な情報量が
多いと判断して、残差波形から最低限必要な情報のみを
効率よく選択することで、データの圧縮率を高めること
ができ、良好な合成音を得ることができる。これに対し
て、単位波形内に多くの波形が存在しない音声波形は、
音声合成する上で必要な情報量が少ないと判断して、残
差波形の誤差成分のうち必要なものを効率よく選択する
ことで、良好な合成音を得ることができる。According to this, it is determined that a speech waveform whose waveform varies variously within a unit waveform has a large amount of information necessary for speech synthesis, and only the minimum necessary information is efficiently extracted from the residual waveform. By selecting well, it is possible to increase the data compression ratio and obtain a good synthesized sound. On the other hand, an audio waveform that does not have many waveforms in the unit waveform
A good synthesized sound can be obtained by judging that the amount of information necessary for speech synthesis is small and efficiently selecting a necessary error component of the residual waveform.

【００２０】また、前記誤差成分から合成に必要な最低
限の情報のみを選択する処理は、前記無声音の状態から
有声音に変わるフレーム内の少なくとも先頭の単位波形
に対応する残差波形の誤差成分の中から振幅値の絶対値
の大きい順に予め設定した数だけ選択し、それをパルス
情報として合成側に送るようにしている。Further, the processing of selecting only the minimum information necessary for the synthesis from the error component is performed by selecting the error component of the residual waveform corresponding to at least the first unit waveform in the frame where the unvoiced sound state changes to the voiced sound. , A predetermined number is selected in descending order of the absolute value of the amplitude value, and the selected number is sent to the synthesis side as pulse information.

【００２１】これによれば、音声合成に必要な最低限必
要な情報を的確に効率よく抽出することができ、高い圧
縮率で圧縮されたデータとなり、かつ、良好な合成音を
得ることができる。According to this, it is possible to accurately and efficiently extract the minimum information required for speech synthesis, to obtain data compressed at a high compression rate, and to obtain a good synthesized sound. .

【００２２】また、前記先頭の単位波形に対応する残差
波形の誤差成分から合成に必要な情報をパルス情報とし
て取り出す処理は、選択しようとする残差波形に対応す
る原波形の振幅の向きが正から負へ向かう場合は、その
残差波形に対する誤差成分情報は負のパルス情報とし、
選択しようとする残差波形に対応する原波形の振幅の向
きが負から正へ向かう場合は、その残差波形に対する誤
差成分情報は正のパルス情報とするようにしている。In the process of extracting information necessary for synthesis from the error component of the residual waveform corresponding to the head unit waveform as pulse information, the amplitude direction of the original waveform corresponding to the residual waveform to be selected is determined. When going from positive to negative, the error component information for the residual waveform is negative pulse information,
When the amplitude direction of the original waveform corresponding to the residual waveform to be selected goes from negative to positive, error component information for the residual waveform is positive pulse information.

【００２３】これによれば、原波形と合成波形の形が１
時的にせよ大きく異なってしまうのを防ぐことができ、
良好な合成音を得ることができる。According to this, the shapes of the original waveform and the composite waveform are 1
It can be prevented from differing greatly even sometimes,
Good synthesized sound can be obtained.

【００２４】また、前記それぞれの単位波形に対応する
残差波形の誤差成分の中から合成に必要な信号をパルス
情報として取り出し、前記線形予測係数とともに、この
パルス情報と前記周期を示す情報に基づいて音声合成処
理を行うす処理は、前記処理対象のフレームの２番目以
降の単位波形においては、１つ前の単位波形を合成する
のに用いたパルス情報を当該単位波形の実際の残差波形
から差し引き、この差し引かれた当該単位波形の残差波
形から振幅の絶対値の大きい順に所定の数だけ選択し、
その選択されたパルス情報を分析側から合成側に渡し、
合成側では、前記分析側から渡されたパルス情報を、１
つ前の単位波形を合成するのに用いたパルス情報に加算
して得たパルス情報を、当該単位波形を合成するのに用
いるようにしている。A signal necessary for synthesis is extracted as pulse information from error components of a residual waveform corresponding to each of the unit waveforms, and based on the pulse information and the information indicating the period, together with the linear prediction coefficient. In the process of performing the voice synthesis process, the pulse information used to synthesize the immediately preceding unit waveform in the second and subsequent unit waveforms of the frame to be processed is used as the actual residual waveform of the unit waveform. From the residual waveform of the subtracted unit waveform, a predetermined number is selected in descending order of the absolute value of the amplitude,
Pass the selected pulse information from the analysis side to the synthesis side,
On the synthesis side, the pulse information passed from the analysis side is
The pulse information obtained by adding to the pulse information used for synthesizing the previous unit waveform is used for synthesizing the unit waveform.

【００２５】これによれば、たとえば、「あー」という
音声波形に対する合成音を得ようとする場合、最初のう
ちは情報量が少なく、合成波形は原波形にあまり似てい
なくても、時間の経過とともに、情報量が蓄積された行
くため、合成波形は原波形に極めて近い波形となり、全
体としては、より良好な合成音を得ることができる。According to this, for example, when trying to obtain a synthesized sound with respect to the speech waveform "Ah", the amount of information is initially small, and even if the synthesized waveform is not very similar to the original waveform, the time is short. As the amount of information accumulates over time, the synthesized waveform becomes a waveform very close to the original waveform, and a better synthesized sound can be obtained as a whole.

【００２６】また、前記１つ前の単位波形を合成するの
に用いたパルス情報を当該単位波形の実際の残差波形か
ら差し引く処理は、前記１つ前の単位波形を合成するの
に用いたパルス情報を構成する各パルスを山型に崩し、
この山型に崩したパルス情報を当該単位波形の実際の残
差波形から差し引くようにし、また、前記分析側から渡
されたパルス情報を、１つ前の単位波形を合成するのに
用いたパルス情報に加算する合成側の処理は、前記１つ
前の単位波形を合成するのに用いたパルス情報を構成す
る各パルスを山型に崩し、この山型に崩したパルス情報
に前記分析側から渡されたパルス情報を加算したパルス
情報を得るようにする。The processing of subtracting the pulse information used for synthesizing the immediately preceding unit waveform from the actual residual waveform of the unit waveform is used for synthesizing the immediately preceding unit waveform. Each pulse constituting pulse information is broken down into a mountain shape,
The pulse information disintegrated into the chevron is subtracted from the actual residual waveform of the unit waveform, and the pulse information passed from the analysis side is used for synthesizing the previous unit waveform. The processing of the synthesis side to add to the information breaks each pulse constituting the pulse information used for synthesizing the previous unit waveform into a mountain shape, and converts the pulse information broken into the mountain shape into the pulse information from the analysis side. Pulse information obtained by adding the passed pulse information is obtained.

【００２７】このように、パルス情報を山型に崩して処
理を行うことにより、それまでに蓄積された情報に今回
の単位波形の情報を加算した情報を合成側で得る際に、
それまでの単位波形で得られた合成側に送るべきパルス
を今回の単位波形の残差波形に先頭パルス列の位置を合
わせて重ね合わたとき、単位波形のパルス列同志で全体
に多少のずれを有する場合、このずれを吸収して処理す
ることができる。As described above, by processing the pulse information by breaking it into a mountain shape, when the information obtained by adding the information of the current unit waveform to the information accumulated so far is obtained on the synthesis side,
When the pulse to be sent to the synthesis side obtained with the previous unit waveform is superimposed on the residual waveform of the current unit waveform with the position of the first pulse train aligned, and there is some deviation in the whole pulse train of the unit waveform This deviation can be absorbed and processed.

【００２８】また、前記単位波形の周期および強度を示
す情報と、誤差成分情報とに基づいて音声合成処理を行
う処理において、合成波形のエンベロープを滑らかにす
る処理を行うことを特徴とする。そして、その合成波形
のエンベロープを滑らかにする処理は、合成波形に対し
て処理対象区間を設定し、その処理対象区間における各
フレーム毎にそのフレーム内の単位波形ごとの最大のピ
ーク値の平均を求め、処理対象区間全体における各単位
波形のピーク値を前記求められた平均のピーク値をもと
に得られた所定値とする。Further, in the processing for performing the speech synthesis processing based on the information indicating the cycle and intensity of the unit waveform and the error component information, processing for smoothing the envelope of the synthesized waveform is performed. Then, in the process of smoothing the envelope of the synthesized waveform, a processing target section is set for the synthesized waveform, and for each frame in the processing target section, the average of the maximum peak value of each unit waveform in the frame is calculated. Then, the peak value of each unit waveform in the entire processing target section is set to a predetermined value obtained based on the obtained average peak value.

【００２９】これによれば、合成波形における処理対象
区間の各フレーム毎のピーク値を結ぶ包絡線（エンベロ
ープ）は、聴感上、耳につくゆらぎが取り去られて平坦
な曲線となり、たとえば「おーい」などという母音が長
く続くような音声の原波形に対する合成音が喉に何かが
絡んだようなゴロゴロした音声となる現象を防止でき
る。According to this, an envelope (envelope) connecting the peak values of each frame of the processing target section in the synthetic waveform becomes a flat curve due to the perceived fluctuation of the ears. , Etc., can be prevented from becoming a gurgling sound in which a synthesized sound with respect to an original waveform of a voice in which a vowel lasts for a long time has something tangled in the throat.

【００３０】[0030]

【発明の実施の形態】以下、本発明の実施の形態を説明
する。前記したように、女性の声は、原波形に対する残
差波形が図９に示すように、男性の場合に比べると、図
８で示したｐ１，ｐ２，・・・におけるパルスが明確で
なく、誤差成分Ｎとの区別がはっきりしないという特徴
がある。したがって、女性の声の場合は、前記Ｙ３の誤
差成分Ｎを無視することはできない。Embodiments of the present invention will be described below. As described above, in the case of the female voice, the pulse at p1, p2,... Shown in FIG. There is a feature that the distinction from the error component N is not clear. Therefore, in the case of a female voice, the error component N of Y3 cannot be ignored.

【００３１】つまり、女性の声は、Ｙ３の誤差成分にも
音声の特徴を表す重要な要素が含まれているため、この
Ｙ３を無視して合成すると原音に対して大きく劣化した
合成音となってしまうという問題点があり、また、誤差
成分を含んだデータをそのまま合成側に送ればよいが、
データ量の圧縮という点で問題がある。In other words, since the voice of a woman also includes an important element representing the characteristics of the voice in the error component of Y3, if the synthesis is performed ignoring Y3, the synthesized voice will be greatly deteriorated from the original sound. There is a problem that the data containing the error component can be sent to the synthesis side as it is,
There is a problem in terms of data volume compression.

【００３２】そこで、この発明では、女性の声の音声信
号に対する残差波形の中から各単位波形における先頭の
パルス（図８で示したｐ１，ｐ２，・・・に対応するパ
ルス）を効率よく抽出するとともに、誤差成分のうち音
声合成する上で重要な成分を抽出して音声合成側に送
り、合成側で合成に都合の良いパルス列を再生するもの
である。以下、本発明の実施の形態について説明する。Therefore, according to the present invention, the first pulse (pulse corresponding to p1, p2,... Shown in FIG. 8) in each unit waveform is efficiently extracted from the residual waveform for the voice signal of the female voice. In addition to the extraction, a component important in speech synthesis among the error components is extracted and sent to the speech synthesis side, and a pulse train convenient for synthesis is reproduced on the synthesis side. Hereinafter, embodiments of the present invention will be described.

【００３３】図１はこの実施の形態で用いられる女性の
音声波形の特徴を示す一例を示すもので、同図（ａ）は
女性の「あー」というような音声の１フレーム分（数１
０msec程度)の原波形であり、図８と同様、４個程度の
単位波形で構成されているものとするが、ここでは、図
面を簡略化するために、単位波形１０、単位波形２０の
みを図示している。なお、単位波形１０は無声音の状態
から有声音に変わったときの最初の単位波形であり、そ
のあとに単位波形２０が続き、さらにここでは図示され
ていないが、幾つかの単位波形が続いて、たとえば４つ
の単位波形で１つのフレームを構成し、そのあとに幾つ
かのフレームが続いて或る音声波形を形成している。FIG. 1 shows an example of the characteristics of a female voice waveform used in this embodiment. FIG. 1A shows one frame of a female voice such as "Ah" (Equation 1).
It is assumed that the unit waveform is composed of about four unit waveforms as in FIG. 8, but in order to simplify the drawing, only unit waveform 10 and unit waveform 20 are used. It is illustrated. The unit waveform 10 is the first unit waveform when the state changes from an unvoiced sound to a voiced sound, followed by a unit waveform 20, and although not shown here, several unit waveforms follow. For example, one frame is composed of four unit waveforms, followed by several frames to form a certain audio waveform.

【００３４】図１（ｂ）はその残差波形であり、この残
差波形からもわかるように、女性の声の特徴として先頭
パルス、つまり、図８で説明した男性の声のＰ１，ｐ
２，・・・に相当するパルス（Ｙ２１，Ｙ２２成分）と
その他の誤差成分（図８のＮで示すＹ３成分）との差が
殆どない。なお、図１（ｃ）〜（ｆ）については後に説
明する。FIG. 1B shows the residual waveform. As can be seen from the residual waveform, the leading pulse is a characteristic of the female voice, that is, P1, p of the male voice described in FIG.
There is almost no difference between the pulses (Y21 and Y22 components) corresponding to 2,... And other error components (Y3 component indicated by N in FIG. 8). 1C to 1F will be described later.

【００３５】このような残差波形において、この残差波
形の誤差成分の中から各単位波形の先頭位置を示すパル
ス（先頭パルス）およびこの先頭パルスの他に音声合成
する上で最低限必要な誤差成分をパルスとして抽出する
処理について、図２のフローチャートおよび図１を参照
しながら説明する。In such a residual waveform, a pulse (head pulse) indicating the head position of each unit waveform from the error components of the residual waveform and the head pulse are required in addition to the minimum necessary for speech synthesis. The process of extracting the error component as a pulse will be described with reference to the flowchart of FIG. 2 and FIG.

【００３６】まず、図１（ａ）に示す原波形において、
時間軸に沿って振幅のピークを調べて行き、そのピーク
値と時刻の情報を得る（ステップｓ１）。たとえば、図
１（ａ）の例では、時間軸に沿って調べて行くと、ｐ１
１，ｐ１２，ｐ１３，・・・がそれぞれの波形における
振幅のピークの部分として求められ、点ｐ１１の振幅値
をｈ１、そのときの時刻ｔ１、点ｐ１２の振幅値をｈ
２、そのときの時刻ｔ２、点ｐ１３の振幅値をｈ３、そ
のときの時刻ｔ３というようにして、各波形のピーク点
の振幅と時刻の情報を得ることができる。このようにし
て得られた各ピーク値の情報を時間軸と振幅値の座標と
して示したものが図１（ｃ）である。この図１（ｃ）か
らもわかるように、原波形のピーク値は一定の周期で振
動しているので、この一定周期の振動の繰り返しをの１
つの単位を、ここでは単位振動（結果的には１つの単位
振動が１つの単位波形に相当する）といい、各単位振動
におけるピーク値のなかで、最も振幅の大きいピーク点
を選ぶ（ステップｓ２）。この場合、点Ｐ１１、点ｐ１
４がそれぞれの単位振動の中で最も大きいピーク値であ
り、結果的に点Ｐ１１が単位波形１０における最大のピ
ーク値、ｐ１４が単位波形２０における最大のピーク値
となる。First, in the original waveform shown in FIG.
The amplitude peak is checked along the time axis, and information on the peak value and time is obtained (step s1). For example, in the example of FIG. 1 (a), when checking along the time axis, p1
, P12, p13,... Are obtained as peak portions of the amplitudes in the respective waveforms. The amplitude value of the point p11 is h1, the time t1 at that time, and the amplitude value of the point p12 is h.
2. At that time, the amplitude value of the peak point and time information of each waveform can be obtained by setting the time t2, the amplitude value of the point p13 to h3, and the time t3 at that time. FIG. 1C shows the information of each peak value obtained in this manner as coordinates of the time axis and the amplitude value. As can be seen from FIG. 1 (c), the peak value of the original waveform oscillates at a constant cycle.
Here, one unit is called a unit vibration (one unit vibration corresponds to one unit waveform as a result), and a peak point having the largest amplitude is selected from among peak values in each unit vibration (step s2). ). In this case, the points P11 and p1
4 is the largest peak value among the unit vibrations. As a result, the point P11 is the largest peak value in the unit waveform 10 and the point p14 is the largest peak value in the unit waveform 20.

【００３７】そして、各単位振動の中で選択された最も
振幅値の大きいピーク点の時刻から原波形の時間軸に沿
って、原点ｏ方向に戻り、波形と時間軸が交差する点
（振幅値が０の点）を求める（ステップｓ３）。すなわ
ち、図１（ａ）において、点ｐ１１の時刻ｔ１から原点
ｏに戻って、波形と時間軸が交差する点（振幅が０の
点）を求め、その点をｔｘ１とし、同様に、点ｐ１４の
時刻ｔ４から原点ｏに戻って、波形と時間軸が交差する
点（振幅が０の点）を求め、その点をｔｘ２とするとい
うように、波形と時間軸が交差する点（振幅値が０）を
求める。Then, from the time of the peak point having the largest amplitude value selected in each unit vibration, the waveform returns to the origin o along the time axis of the original waveform, and the point where the waveform and the time axis intersect (the amplitude value Is zero) (step s3). That is, in FIG. 1 (a), returning from the time t1 of the point p11 to the origin o, a point where the waveform and the time axis intersect (a point where the amplitude is 0) is obtained, and the point is set to tx1, and similarly, the point p14 is set. Returning from the time t4 to the origin o, a point where the waveform and the time axis intersect (a point where the amplitude is 0) is obtained, and the point where the waveform and the time axis intersect (the amplitude value is 0).

【００３８】このようにして、波形と時間軸が交差する
点ｔｘ１，ｔｘ２，・・・を求めると、次に、図１
（ｂ）に示す残差波形の時間軸上に沿って、これらの点
ｔｘ１，ｔｘ２，・・・を基準に前後Δｔ時間を設定
し、このΔｔ時間の範囲をスキャンして、そのΔｔ時間
内で最大の振幅値を有する誤差成分を検出する（ステッ
プｓ４）。ここで、Δｔ時間というのは、たとえば、単
位振動（単位波形）の時間の長さを基に、その時間の何
分の１とするというように、経験に基づいて最適な時間
を決めておく。When the points tx1, tx2,... Where the waveform and the time axis intersect are obtained in this manner,
Along the time axis of the residual waveform shown in (b), the Δt time is set before and after these points tx1, tx2,..., And the range of this Δt time is scanned. , An error component having the maximum amplitude value is detected (step s4). Here, the Δt time is determined based on experience based on experience, for example, based on the length of time of a unit vibration (unit waveform), such as a fraction of that time. .

【００３９】ところで、女性の声は、Ｙ２成分とＹ３成
分との区別が明確でないことは前述したが、多少はＹ２
成分が存在するものである。したがって、前記Δｔ時間
内で最大の振幅を有する部分を検出し、それを各単位波
形の先頭位置の情報として図１（ｄ）に示すような先頭
パルスＰｓ１，Ｐｓ２，・・・を得る（ステップｓ
５）。Although it has been mentioned above that the distinction between the Y2 component and the Y3 component is not clear in the female voice,
The components are those that are present. Therefore, the part having the maximum amplitude within the time Δt is detected, and the information is used as the information on the start position of each unit waveform to obtain head pulses Ps1, Ps2,... As shown in FIG. s
5).

【００４０】なお、或る単位振動において、Δｔ時間ス
キャンしても先頭位置が抽出できない場合もある。この
ような場合には、前後の単位振動にて検出された先頭パ
ルスの時間的位置をもとに決定する。つまり、先頭パル
スはほぼ等間隔で出るのが普通であるので、検出された
先頭パルスの間隔を基に不明確な部分の先頭パルスの位
置を決定する。また、その不明確な部分の先頭パルスの
振幅の大きさは、前の単位振動の先頭パルスの振幅値と
する。In some unit vibrations, the head position may not be extracted even after scanning for Δt. In such a case, the determination is made based on the temporal position of the leading pulse detected in the preceding and following unit vibrations. In other words, since the leading pulses are generally emitted at substantially equal intervals, the position of the leading pulse in an unclear part is determined based on the detected leading pulse interval. In addition, the magnitude of the amplitude of the leading pulse in the unclear part is the amplitude value of the leading pulse of the previous unit vibration.

【００４１】以上のようにして、原波形およびその残差
波形から各単位波形の先頭位置を特定することができ
る。また、この残差波形から先頭位置情報を抽出する方
法としては、前記各単位波形における波形と時間軸が交
差する点（振幅値が０）をそのまま単位波形の先頭の点
とし、それを原波形の先頭位置情報として用いることも
可能である。As described above, the head position of each unit waveform can be specified from the original waveform and its residual waveform. As a method of extracting head position information from the residual waveform, a point (amplitude value 0) where the waveform and the time axis in each unit waveform intersect is directly used as the head point of the unit waveform, and this is used as the original waveform. Can be used as the head position information.

【００４２】このように求められた各単位波形の先頭位
置情報は合成側に渡されるが、この先頭位置情報ととも
に、合成を行う上で最低限必要な誤差成分（合成を行う
上で必要な代表的な誤差成分）を、図１（ｂ）に示すよ
うな残差波形の誤差成分Ｎから抽出する。The head position information of each unit waveform obtained in this manner is passed to the synthesizing side, and together with the head position information, an error component (minimum necessary for synthesizing, a minimum necessary error component for synthesizing) is provided. (A typical error component) is extracted from the error component N of the residual waveform as shown in FIG.

【００４３】まず、基本的な処理について説明する。こ
の処理は、それぞれの単位波形ごと（図１においては、
単位波形１０，２０ごと）に処理を行い、その処理は、
処理対象の単位波形が最初の単位波形であるか２番目以
降の単位波形であるかによって異なる。First, the basic processing will be described. This processing is performed for each unit waveform (in FIG. 1,
The processing is performed for each of the unit waveforms 10 and 20).
It differs depending on whether the unit waveform to be processed is the first unit waveform or the second and subsequent unit waveforms.

【００４４】処理対象の単位波形が１番目の単位波形で
あると判定（ステップｓ６）された場合、前記した先頭
位置情報の他に、合成側に送る成分として、誤差成分
（前記Ｙ３成分に相当）のなかから、振幅の絶対値の大
きいｎ個のパルスを選択し、パルス位置、パルスの大き
さのデータとして合成側に送る（ステップｓ７）。この
単位波形１０に対する先頭パルスと新たに選ばれたｎ個
のパルスの例を図１（e）に示す。この図１（e）のパル
スにおいて、負方向のパルスは負方向に大きな誤差成分
に対応したパルスであることを示している。When it is determined that the unit waveform to be processed is the first unit waveform (step s6), an error component (corresponding to the Y3 component) is sent to the synthesis side in addition to the above-described head position information. )), N pulses having a large absolute value of the amplitude are selected and sent to the synthesizing side as pulse position and pulse magnitude data (step s7). FIG. 1E shows an example of the leading pulse and the newly selected n pulses for the unit waveform 10. In the pulse shown in FIG. 1E, the pulse in the negative direction is a pulse corresponding to a large error component in the negative direction.

【００４５】合成側では、前記ステップｓ５で得られた
単位波形の先頭に位置を示すデータとステップｓ７で得
られたパルス位置とパルスの大きさのデータより、最初
の単位波形の合成のためのｎ個のパルスを時間軸上に作
成する（ステップｓ８）。そして、以上のようにして求
めたパルス列と、前述したＬＰＣ係数により波形合成を
行う（ステップｓ１４）。On the synthesizing side, the data indicating the head position of the unit waveform obtained in step s5 and the pulse position and pulse magnitude data obtained in step s7 are used to synthesize the first unit waveform. The n pulses are created on the time axis (step s8). Then, waveform synthesis is performed using the pulse train obtained as described above and the LPC coefficient described above (step s14).

【００４６】一方、処理対象の単位波形が１番目の単位
波形でない場合にはステップｓ９に処理が進む。たとえ
ば、単位波形２０においては、先頭の単位波形と同様
に、その単位波形における先頭パルスと、誤差成分の中
の大きい振幅を有するｎ個を選択して、それに対応する
パルスを合成側に送るという方法を取らずに、ここで
は、前の単位波形で得られ、合成側に送られた情報（そ
れまでに蓄積された情報）を用い、その情報に今回の単
位波形の情報を加算した情報を合成側で得るという方法
を用いる。以下、これについて説明する。On the other hand, if the unit waveform to be processed is not the first unit waveform, the process proceeds to step s9. For example, in the unit waveform 20, as in the case of the head unit waveform, the head pulse in the unit waveform and n pulses having a large amplitude among the error components are selected, and the corresponding pulse is sent to the synthesis side. Instead of using the method, here, the information obtained by the previous unit waveform and sent to the synthesizing side (the information accumulated so far) is used, and the information obtained by adding the information of the current unit waveform to the information is used. The method of obtaining on the synthesis side is used. Hereinafter, this will be described.

【００４７】隣り合った単位波形のパルス列にはかなり
相関がある。そこで、たとえば、ステップｓ７により単
位波形１０で得られたｎ個のパルス（これを図１(e)の
Ｘ１で示す）を単位波形２０の残差パルス列（図１
（b)）に先頭パルス列の位置を合わせて重ね合わせる
と、かなり一致するはずである。しかし、単位波形のパ
ルス列同志で全体に多少のずれを有する場合も多い。こ
のずれを吸収するため、１つ前の単位波形（この場合、
単位波形１０）で得られたパルス列のすべてのパルスを
山型に崩し（ステップｓ９）、この山型に崩したパルス
を用いて単位波形２０におけるパルスの抽出を行う。There is considerable correlation between pulse trains of adjacent unit waveforms. Therefore, for example, the n pulses (indicated by X1 in FIG. 1E) obtained by the unit waveform 10 in step s7 are converted into the residual pulse train of the unit waveform 20 (FIG. 1).
If the position of the first pulse train is aligned and superimposed on (b)), they should be quite consistent. However, the pulse trains of the unit waveform often have some deviation as a whole. In order to absorb this shift, the previous unit waveform (in this case,
All the pulses of the pulse train obtained in the unit waveform 10) are broken into a mountain shape (step s9), and the pulses in the unit waveform 20 are extracted using the pulses broken into the mountain shape.

【００４８】このパルスを山型に崩すというのは、図３
（ａ）のようなパルスを同図（ｂ）のようにすることで
あり、たとえば、図３（ａ）のパルスの振幅値ｈを同図
（ｂ）のように１／２とした場合、残りの１／２の振幅
値を両サイドに振り分けるという処理であり、単位波形
１０で得られた全てのパルスについて行う。The breaking of this pulse into a mountain shape is shown in FIG.
FIG. 3B shows a pulse as shown in FIG. 3A. For example, when the amplitude h of the pulse in FIG. 3A is １／ as shown in FIG. This is a process of distributing the remaining 分け amplitude value to both sides, and is performed for all the pulses obtained in the unit waveform 10.

【００４９】そして、山型に崩した単位波形１０の合成
側に送ったものと同じパルス列を、単位波形２０の実際
の残差波形（この単位波形２０の残差波形パルス列を図
１（e)のＸ２で示す）から差し引いて、差のパルス列を
求め（ステップｓ１０）、その差のパルス列のなかか
ら、絶対値の大きい振幅を有する誤差成分を大きい順に
ｎ個抽出して、これらのパルス位置と大きさのデータを
合成側に送る（ステップｓ１１）。この単位波形２０か
ら抽出されたｎ個のパルスを図１(f)のＸ３で示す。The same pulse train as that sent to the synthesizing side of the unit waveform 10 broken into a mountain shape is used as the actual residual waveform of the unit waveform 20 (the residual waveform pulse train of the unit waveform 20 is shown in FIG. X2) to obtain a pulse train of the difference (step s10), and from the pulse train of the difference, extract n error components having amplitudes with large absolute values in descending order, and extract these pulse positions and The size data is sent to the combining side (step s11). The n pulses extracted from the unit waveform 20 are indicated by X3 in FIG.

【００５０】一方、合成側では、１つ前（この場合、単
位波形１０）を合成するのに用いたパルス列の各パルス
を山型に崩したパルス列を作り（ステップｓ１２）、前
記ステップｓ１１で求めたｎ個のパルスのパルス位置と
大きさ（図１(f)のＸ３に示す）を前記ステップｓ１２
で求めた山型に崩したパルス列に加算する（ステップｓ
１３）。そして、以上のようにして求めたパルス列と、
ＬＰＣ係数により波形合成を行う（ステップｓ１４）。On the other hand, on the synthesizing side, a pulse train in which each pulse of the pulse train used for synthesizing the immediately preceding one (in this case, the unit waveform 10) is formed into a mountain shape (step s12), and obtained in step s11. The pulse positions and magnitudes of the n pulses (indicated by X3 in FIG. 1 (f)) are determined in step s12.
(Step s)
13). And the pulse train obtained as described above,
Waveform synthesis is performed using LPC coefficients (step s14).

【００５１】このように、２番目以降の単位波形に対し
ては、前回までのパルス情報が蓄積されて行き、その蓄
積された情報に基づいて音声合成されることになる。つ
まり、たとえば、「あー」という音声波形に対する合成
音を得ようとする場合、最初のうちは情報量が少なく、
合成波形は原波形にあまり似ていなくても、時間の経過
とともに、情報量が蓄積された行くため、合成波形は原
波形に極めて近い波形となり、全体としては、極めて良
好な合成音を得ることができる。As described above, for the second and subsequent unit waveforms, the pulse information up to the previous time is accumulated, and voice synthesis is performed based on the accumulated information. In other words, for example, when trying to obtain a synthesized sound for the speech waveform "Ah", the amount of information is initially small,
Even though the synthesized waveform is not very similar to the original waveform, the amount of information is accumulated over time, so the synthesized waveform becomes a waveform very close to the original waveform, and as a whole, an extremely good synthesized sound can be obtained. Can be.

【００５２】本来は残差波形の誤差成分を全て合成側に
送れば、高精度な音声合成が行えるのであるが、データ
の圧縮ということを考慮した場合、少しでもデータ量を
減らす必要が生じてくる。これに対処するための方法と
して、それぞれの単位波形において、先頭パルスと、誤
差成分の中の大きい振幅を有するｎ個を選択して合成側
に送り、合成側でも、各単位波形について送られてきた
パルス列のみから合成を行う方法もあるが、それぞれの
単位波形間では、抽出すべきパルスにある程度相関があ
るため、それぞれを単独でパルスを抽出するよりは、前
述した如く、前回のパルス情報に今回のパルス情報を加
算して行くというように、情報を蓄積して処理した方
が、単位波形ごとに単独でパルス情報を抽出するより、
音声合成する上で有利なものとなる。Originally, if all error components of the residual waveform are sent to the synthesis side, high-accuracy speech synthesis can be performed. However, in consideration of data compression, it is necessary to reduce the data amount even a little. come. As a method for coping with this, in each unit waveform, the leading pulse and n signals having a large amplitude among error components are selected and sent to the combining side, and the combining side also sends each unit waveform. Although there is a method of synthesizing only the pulse train that has been used, since there is some correlation between the pulses to be extracted between each unit waveform, rather than extracting each pulse alone, as described above, It is better to accumulate and process the information, such as adding the current pulse information, rather than extracting the pulse information alone for each unit waveform.
This is advantageous for speech synthesis.

【００５３】以上は、残差波形のＹ３成分から合成側に
渡す必要のある成分を抽出する本発明の基本的な処理で
あるが、このような処理を行う上で、以下のような処理
を施すことにより、より一層、良好な合成音を得ること
ができる。The above is the basic processing of the present invention for extracting the component that needs to be passed to the synthesis side from the Y3 component of the residual waveform. In performing such processing, the following processing is performed. By applying, a more favorable synthesized sound can be obtained.

【００５４】まず、最初の単位波形１０においては前述
したように、先頭パルスの他に、振幅の大きいｎ個を単
純に選んで、それを合成側に送っているが、単純にｎ個
を選んだのでは不都合を生じる場合もある。First, in the first unit waveform 10, as described above, in addition to the first pulse, n pulses having a large amplitude are simply selected and sent to the synthesizing side. This may cause inconvenience.

【００５５】すなわち、本来は残差波形の誤差成分を全
て合成側に送って初めて高精度な音声合成が行えるので
あるが、振幅の大きいｎ個を単純に選んで、それを合成
側に送る方法では、高精度な合成音が得られないことも
ある。これは、原波形の或る１つの単位波形が、たとえ
ば図４（ａ）のような場合に、前記したような処理を行
って得られたパルス情報に基づく音声合成後の波形（合
成波形）が、同図（ｂ）のように原波形に対して、部分
的に振幅の正負の向きが異なってしまう場合があるから
である。これに対処するため、本発明では、振幅の大き
いｎ個を選択する場合、以下のような処理を行う。That is, high-accuracy speech synthesis can be performed only after all the error components of the residual waveform have been sent to the synthesis side. However, a method of simply selecting n large amplitude signals and sending it to the synthesis side. In some cases, a highly accurate synthesized sound may not be obtained. This is because, when a certain unit waveform of the original waveform is, for example, as shown in FIG. 4A, a waveform after speech synthesis based on the pulse information obtained by performing the above-described processing (synthesized waveform). However, this is because the positive and negative directions of the amplitude may partially differ from the original waveform as shown in FIG. In order to cope with this, in the present invention, the following processing is performed when n large amplitudes are selected.

【００５６】残差波形の誤差成分から振幅の大きいとｎ
個を抽出する場合、原波形の振幅の向きを調べ、振幅が
正から負の方向へ増加している部分では負のパルスを選
択し、逆に振幅が負から正の方向へ増加している部分で
は正のパルスを選択する。When the amplitude is large from the error component of the residual waveform, n
When extracting individual pieces, check the direction of the amplitude of the original waveform, select a negative pulse where the amplitude increases from positive to negative, and increase the amplitude from negative to positive. In the part, a positive pulse is selected.

【００５７】具体的には、図５（ａ）に示すような原波
形の或る単位波形に対する残差波形が同図(ｂ)のようで
あった場合、原波形の振幅が正から負へ変化しようとす
る点Ａに対応する残差波形部分（この残差波形部分を、
図中、ｗで示し、このｗは前記Ａ点を中心に所定の範囲
を有しており、その範囲の大きさは予め決めておく）
に、残差波形の誤差成分に抽出すべき大きなパルスが有
ったときは、そのパルスは同図（ｃ）に示すように、負
方向のパルスとして抽出し、また、原波形の振幅が負か
ら正へ変化しようとする点Ｂに対応する残差波形部分
（前記同様、この残差波形部分を、図中、ｗで示し、こ
のｗは前記Ｂ点を中心に所定の範囲を有しており、その
範囲の大きさは予め決めておく）に、残差波形の誤差成
分に抽出すべき大きなパルスが有ったときは、そのパル
スは同図（ｃ）に示すように、正方向のパルスとして抽
出する。ただし、単位波形の先頭パルスの近傍（図５
(a)の点Ｃ付近）についてのみは、パルスの符号にかか
わらず絶対値の大きい方から選ぶ。More specifically, when the residual waveform for a certain unit waveform of the original waveform as shown in FIG. 5A is as shown in FIG. 5B, the amplitude of the original waveform changes from positive to negative. A residual waveform portion corresponding to the point A to be changed (this residual waveform portion is
In the drawing, it is indicated by w, and this w has a predetermined range around the point A, and the size of the range is determined in advance.)
If there is a large pulse to be extracted as an error component of the residual waveform, the pulse is extracted as a negative-direction pulse as shown in FIG. From the residual waveform portion corresponding to the point B which is about to change from positive to positive (the same as described above, this residual waveform portion is indicated by w in the figure, and w has a predetermined range around the point B). If the size of the range is determined in advance) and there is a large pulse to be extracted as an error component of the residual waveform, the pulse is shifted in the positive direction as shown in FIG. Extract as a pulse. However, the vicinity of the first pulse of the unit waveform (FIG. 5)
Only the point (in the vicinity of point C in (a)) is selected from the one with the larger absolute value regardless of the sign of the pulse.

【００５８】なお、このように残差波形の中から大きい
パルスをｎ個選択し、その選択を行う際、原波形の振幅
の向き（正から負、または負から正）を判断して対応し
たパルスを取り出すという処理は、無声音状態から有声
音に変わる最初付近の単位波形についてのみ行い、２番
目以降の単位波形においては、前記したように、前の単
位波形からの蓄積された情報を用いて処理した方が良好
な結果が得られる。As described above, when n large pulses are selected from the residual waveform, and the selection is performed, the direction of the amplitude of the original waveform (from positive to negative or from negative to positive) is determined. The process of extracting the pulse is performed only on the unit waveform near the beginning where the unvoiced sound state changes to the voiced sound, and in the second and subsequent unit waveforms, as described above, using the information accumulated from the previous unit waveform. Good results are obtained with the treatment.

【００５９】このように、最初付近の単位波形、つま
り、無声音状態から有声音に変わる先頭の幾つかの単位
波形部分のみでこのような処理を行うのは、この部分の
単位波形は、他の単位波形と異なり情報の蓄積がなく、
少ない情報量にて合成側で如何に原波形に近い合成波形
を得るかということを考慮したものである。また、この
処理は、線形予測係数（ＬＰＣ係数列）を各フレーム毎
に比較し、その値が大きく変化したフレームの直後の幾
つかのフレームについて行ってもよい。As described above, such processing is performed only on the unit waveform near the beginning, that is, only on the first few unit waveform portions that change from the unvoiced sound state to the voiced sound. Unlike unit waveforms, there is no accumulation of information,
This takes into consideration how to obtain a synthesized waveform close to the original waveform on the synthesis side with a small amount of information. In addition, this processing may be performed for some frames immediately after the frame in which the linear prediction coefficient (LPC coefficient sequence) is compared for each frame and the value of which is greatly changed.

【００６０】以上の処理を行うことにより、最初の単位
波形における残差波形から抽出されるパルス情報をもと
に合成側で得られる合成波形の振幅の正負と原波形の振
幅の正負との関係は大きく異なることがなくなり、この
最初の単位波形近辺でも比較的原波形に近い合成波形が
得られるが、さらに良好な合成音を得るために、以下の
ような処理を行うことも可能である。By performing the above processing, the relationship between the sign of the amplitude of the synthesized waveform obtained on the synthesizing side and the sign of the amplitude of the original waveform based on the pulse information extracted from the residual waveform in the first unit waveform. Does not greatly differ, and a synthesized waveform relatively close to the original waveform can be obtained even in the vicinity of the first unit waveform. However, in order to obtain a better synthesized sound, the following processing can be performed.

【００６１】無声音から有声音に変化した最初のいくつ
かの単位波形（図１(a)では単位波形１０）、あるいは
ＬＰＣ係数が大きく変化したフレーム付近の単位波形に
おいて、波形のピークの数を計数し、その数を予め設定
したしきい値と比較して、しきい値以上であった場合
と、しきい値未満であった場合とで異なった処理を行
う。つまり、単位波形における波形がピーク数が多い複
雑な波形（情報量の多い波形）であるか、あるいは、ピ
ーク数の少ない単純な波形（情報量の少ない波形）であ
るかによって異なった処理を行う。この処理を図６を参
照しながら説明する。The number of waveform peaks is counted in the first several unit waveforms that have changed from unvoiced to voiced (unit waveform 10 in FIG. 1A) or in the unit waveform near the frame where the LPC coefficient has changed significantly. Then, the number is compared with a preset threshold value, and different processing is performed depending on whether the number is equal to or greater than the threshold value or less than the threshold value. That is, different processing is performed depending on whether the waveform in the unit waveform is a complex waveform having a large number of peaks (a waveform having a large amount of information) or a simple waveform having a small number of peaks (a waveform having a small amount of information). . This processing will be described with reference to FIG.

【００６２】たとえば、しきい値を４としたとすると、
図６（ａ）のように１番目の単位波形のピーク数がｐ１
１〜ｐ１４の４個あった場合は、情報量が比較的多いと
判断し、情報量が多い場合は情報量を削減するために、
先頭パルスを得るための誤差成分ｓ０と、その他の誤差
成分のうち合成側に送る必要のある最低限の誤差成分と
して、同図（ｂ）に示すような、ｓ１，ｓ２，ｓ３，ｓ
４，ｓ５の範囲の誤差成分のみを選択して、ｓ０からは
絶対値の大きなパルス、ｓ１，ｓ２，ｓ３，ｓ４，ｓ５
に対応したパルス列からは前記のように符号を考慮しな
がら選択したパルスを合成側に送り、それ以外の区間の
誤差成分は無視する。For example, assuming that the threshold value is 4,
As shown in FIG. 6A, the peak number of the first unit waveform is p1
If there are four (1) to (p14), it is determined that the information amount is relatively large, and if the information amount is large, to reduce the information amount,
As the error component s0 for obtaining the leading pulse and the minimum error component that needs to be sent to the synthesis side among the other error components, s1, s2, s3, s as shown in FIG.
Only the error components in the range of 4, s5 are selected, and from s0, a pulse having a large absolute value, s1, s2, s3, s4, s5
As described above, the selected pulse is sent to the synthesizing side while considering the code from the pulse train corresponding to the above, and error components in other sections are ignored.

【００６３】一方、図６（ｃ）のように１番目の単位波
形内のピーク数がｐ１１，ｐ１２の２個だけで、しきい
値未満であった場合は、もともと情報量が少なく上記よ
り広い範囲のパルスを選択してよいと判断し、先頭パル
スを得るための誤差成分ｓ０と、その他の誤差成分のう
ち合成側に送る必要のある最低限の誤差成分として、同
図（ｄ）に示すように、ｓ０から符合を考えずに選択し
たパルスと、ｓ１，ｓ２，ｓ３のパルス列のうち前記の
ように符号を考えて選択したパルスと、ｓ１１，ｓ１
２，ｓ１３，ｓ１４のパルス列のうち、符号を考えずに
選択したパルスのなかで、絶対値の大きいｎ個のパルス
を得て合成側に送る。つまり、情報量の少ない単位波形
の場合は、単位波形の残差パルス列のより多くの区間の
情報を合成側に送るようにする。On the other hand, when the number of peaks in the first unit waveform is only p11 and p12 and less than the threshold as shown in FIG. 6C, the information amount is originally small and wider than the above. It is determined that a pulse in the range may be selected, and an error component s0 for obtaining the leading pulse and a minimum error component of the other error components that need to be sent to the combining side are shown in FIG. Thus, the pulse selected from s0 without considering the sign, the pulse selected from the pulse train of s1, s2, and s3 considering the sign as described above, and s11 and s1
Among the pulse trains of s2, s13, and s14, n pulses having a large absolute value are obtained from the pulses selected without considering the sign, and sent to the synthesis side. That is, in the case of a unit waveform having a small amount of information, information of more sections of the residual pulse train of the unit waveform is sent to the synthesis side.

【００６４】このように、単位波形内の波形が多様に変
化する音声波形は、音声合成する上で必要な情報量が多
く、単純に大きなパルスをｎ個選択したのでは波形を再
生するのに有用な情報を効率よく選択しているとは言え
ない。これに対し前記のように残差波形から最低限必要
な情報のみを効率よく選択することで、データの圧縮率
を高めることができ、良好な合成音を得ることができ
る。一方、単位波形内の波形に変化が少ない音声波形
は、音声合成する上で必要な情報量が少ないと判断し
て、残差波形の誤差成分の多くの区間のものから選択す
ることで、精度のよい良好な合成音を得ることができ
る。As described above, the audio waveform in which the waveform in the unit waveform changes variously has a large amount of information necessary for synthesizing the audio, and it is difficult to reproduce the waveform by simply selecting n large pulses. It cannot be said that useful information is selected efficiently. On the other hand, by efficiently selecting only the minimum necessary information from the residual waveform as described above, the data compression ratio can be increased, and a good synthesized sound can be obtained. On the other hand, a speech waveform having a small change in the waveform in the unit waveform is judged to have a small amount of information necessary for speech synthesis, and is selected from a large number of sections of the error component of the residual waveform, thereby improving accuracy. A good synthetic sound with good quality can be obtained.

【００６５】以上のような処理を行うことにより、合成
側ではより一層、原波形に近い合成波形を得ることがで
き、極めて良好な合成音を得ることができる。By performing the above-described processing, a synthesized waveform closer to the original waveform can be obtained on the synthesis side, and an extremely good synthesized sound can be obtained.

【００６６】ところで、以上説明したような処理により
得られた合成音が、たとえば「おーい」などという母音
が長く続くような部分で、喉に何かが絡んだようなゴロ
ゴロした音声となる現象が生じる場合がある。特に女性
の声に多く発生する。By the way, the phenomenon that the synthesized sound obtained by the above-described processing becomes a gurgling sound such as something entangled in the throat in a portion where a vowel such as "Oi" continues for a long time. May occur. It occurs particularly frequently in female voices.

【００６７】これは、原波形のエンベロープ（包絡線）
で見たときに、振幅値の最大値にそれほど大きな変動が
ないのに対し、前記のような処理により得られた合成後
の波形（合成波形）は、各単位波形ごとの最大の振幅値
が原波形に比べて変動が大きかったり、幾つかのフレー
ムにまたがって、単位波形の振幅の最大値が大きく変動
したりする場合があるために生ずると考えられる。This is the envelope (envelope) of the original waveform.
As seen from the above, while the maximum value of the amplitude value does not fluctuate so much, the waveform after synthesis (synthetic waveform) obtained by the above processing has the maximum amplitude value for each unit waveform. This is considered to occur because the fluctuation is larger than the original waveform, or the maximum value of the amplitude of the unit waveform fluctuates greatly over several frames.

【００６８】図７（ａ）は、或る音声に対する原波形に
おける幾つかのフレームのそれぞれのフレームを構成す
る単位波形の正側の最大の振幅値を結ぶ包絡線２１およ
び負側の最大の振幅値を結ぶ包絡線２２を示しており、
この図からもわかるように、殆ど凹凸がなく直線的であ
る。これに対して、前記した処理を行った結果得られる
合成波形は、図７（ｂ）に示されるように、それぞれの
フレームを構成する各単位波形ごとの正側の最大の振幅
値を結ぶ包絡線２３および負側の最大の振幅値を結ぶ包
絡線２４は、図７（ｂ）のように、凹凸が大きい曲線と
なる場合があり、これにより、母音が長く続くような部
分で、喉に何かが絡んだようなゴロゴロした音声となる
現象が生じると考えられる。このような現象に対しては
以下のような処理を行うことで対処できる。FIG. 7A shows an envelope 21 connecting the positive maximum amplitude values of the unit waveforms constituting each frame of several frames in the original waveform for a certain voice, and the negative maximum amplitude. Shows an envelope 22 connecting the values,
As can be seen from this figure, there is almost no unevenness and it is linear. On the other hand, as shown in FIG. 7B, a composite waveform obtained as a result of performing the above-described processing is an envelope connecting the maximum positive-side amplitude values of each unit waveform constituting each frame. The envelope 23 connecting the line 23 and the maximum amplitude value on the negative side may be a curve with large irregularities as shown in FIG. 7B. It is thought that the phenomenon of rumbling sound that something is entangled occurs. Such a phenomenon can be dealt with by performing the following processing.

【００６９】まず、連続した有音声部分が所定時間以上
の場合は、図７（ｂ）に示すような合成後の波形におけ
る有音声部分の立ち上がり部分と立ち下がり部分を除い
た処理対象区間Ｄを設定する。この処理対象区間Ｄにお
いて、１フレーム（数１０msec程度）ごとにそのフレー
ム内における各単位波形の最大の振幅値の平均値を求め
る。なお、合成後の波形の正側の包絡線２３および負側
の包絡線２４は対象ではないので、処理は正側と負側で
別々であるが、処理法方は同じであるので、ここでは、
正側のみについて説明する。First, when the continuous voice portion is longer than a predetermined time, the processing target section D excluding the rising portion and the falling portion of the voice portion in the synthesized waveform as shown in FIG. Set. In this processing target section D, the average value of the maximum amplitude value of each unit waveform in each frame (about several tens msec) is obtained. Since the positive side envelope 23 and the negative side envelope 24 of the synthesized waveform are not targets, the processing is different on the positive side and the negative side. However, since the processing method is the same, ,
Only the positive side will be described.

【００７０】前記のように、１フレーム内の各単位波形
における最大の振幅値の平均値を求める処理は、たとえ
ば、１フレーム内に４つの単位波形が存在するとすれ
ば、その４つの単位波形のそれぞれの最大の振幅値の平
均を求める。そして、連続する数フレーム単位で、各フ
レーム毎の平均値を見て、それぞれのフレームで平均値
が異なるような場合には、これを或る一定値にするよう
に以下の処理を行う。As described above, the processing for obtaining the average value of the maximum amplitude value in each unit waveform in one frame is performed, for example, when there are four unit waveforms in one frame. The average of each maximum amplitude value is obtained. Then, the average value of each frame is checked in units of several consecutive frames, and if the average value differs in each frame, the following processing is performed to set this to a certain constant value.

【００７１】すなわち、前記のように求められた最大の
振幅値の平均値を求め、幾つかのフレームについての平
均値をＨａとしたとき、各フレーム内の単位波形（その
単位波形の振幅の最大値をＨとする）ごとに波形の正の
値をとっている部分をＨａ／Ｈ倍する。That is, the average value of the maximum amplitude values obtained as described above is obtained, and when the average value of several frames is defined as Ha, the unit waveform in each frame (the maximum amplitude of the unit waveform) is obtained. Each time the value is set to H), the portion of the waveform having a positive value is multiplied by Ha / H.

【００７２】これにより、処理対象区間のピーク値を結
ぶ包絡線は原波形と同じように平坦なものとなり、たと
えば「おーい」などという母音が長く続くような音声の
原波形に対する合成音が喉に何かが絡んだようなゴロゴ
ロした音声となる現象を防止できる。これと同様の処理
を負側に対しても行う。ただし、この処理を行うと、原
波形のエンベロープにもともと凹凸が存在していた場
合、その凹凸を除去してしまう場合があるが、聴感上は
問題はない。As a result, the envelope connecting the peak values in the section to be processed becomes flat like the original waveform. For example, the synthesized sound of the original waveform of the voice in which the vowel “Oi” continues for a long time is added to the throat. It can prevent the phenomenon of rumbling sound like something entangled. The same processing is performed on the negative side. However, if this process is performed, if the original waveform envelope originally has irregularities, the irregularities may be removed, but there is no problem in terms of hearing.

【００７３】なお、処理対象区間のピーク値を結ぶ包絡
線を原波形と同じような平坦なものとする処理方法は以
上説明した方法に限られるものではなく、他の方法を用
いても実現できる。また、以上説明したエンベロープを
滑らかにする処理は、有声音のフレームが、所定の数以
上（たとえば２０フレーム以上）連続した場合に、その
フレームについて行うとより有効なものとなる。The processing method for making the envelope connecting the peak values of the processing target section flat like the original waveform is not limited to the method described above, but can be realized by using other methods. . Further, the above-described process of smoothing the envelope becomes more effective when the voiced sound frame is continuous for a predetermined number or more (for example, 20 frames or more) of that frame.

【００７４】また、以上説明した本発明を実現するため
のプログラムはフロッピイディスクなどの記憶媒体に記
憶させておくことができ、本発明はその記憶媒体をも含
むものである。The program for realizing the present invention described above can be stored in a storage medium such as a floppy disk, and the present invention also includes the storage medium.

【００７５】[0075]

【発明の効果】以上説明したように、本発明によれば、
女性の声のように、残差波形における単位波形の周期を
示す成分Ｙ２と、誤差成分Ｙ３との区別が付かない音声
に対しても、的確に周期および強度を示す情報と、合成
するに必要な誤差成分を取り出すことができ、高い圧縮
率でのデータ圧縮が可能となり、かつ、良好な合成音を
得ることができる。As described above, according to the present invention,
Necessary for accurately synthesizing the information indicating the period and the intensity even for a voice that cannot be distinguished from the component Y2 indicating the cycle of the unit waveform in the residual waveform and the error component Y3, such as a female voice. Error component can be extracted, data can be compressed at a high compression rate, and a good synthesized sound can be obtained.

【００７６】また、無声音状態から有声音に変わるフレ
ーム内の少なくとも先頭の単位波形においては、波形の
ピーク数を計数し、そのピーク数が予め設定した数より
大きいか否かを判断し、予め設定した数以上の場合は、
前記周期を示す情報と、誤差成分から合成に最低限必要
な情報のみを効率よく選択して合成側に送り、ピーク数
が設定値未満である場合には、前記周期を示す情報と、
合成に最低限必要な誤差成分のみならずそれ以外の誤差
成分に相当する情報からも誤差成分情報を得て合成側に
送るようにしているので、音声合成する上で必要な情報
量が多い場合は、残差波形から最低限必要な情報のみを
効率よく選択することで、同程度の合成音声を得るため
のデータの圧縮率を高めることができる。Also, in at least the first unit waveform in the frame where the unvoiced sound state changes to a voiced sound, the number of peaks of the waveform is counted, and it is determined whether or not the number of peaks is larger than a preset number. If you have more than
The information indicating the cycle, and only the information necessary for synthesis from the error component is efficiently selected and sent to the synthesis side, and when the number of peaks is less than a set value, the information indicating the cycle,
Since error component information is obtained not only from the minimum error component required for synthesis but also from information corresponding to other error components and sent to the synthesis side, when the amount of information necessary for voice synthesis is large By efficiently selecting only the minimum necessary information from the residual waveform, it is possible to increase the data compression ratio for obtaining the same level of synthesized speech.

【００７７】また、無声音状態から有声音に変わるフレ
ーム内のの２番目以降の単位波形においては、前に位置
する全ての単位波形に対する処理により蓄積された誤差
成分情報（パルス列）のパルスを山型に崩した情報と前
記周期を示す情報に基づいて音声合成処理するようにし
ているので、たとえば、「あー」という音声波形に対す
る合成音を得ようとする場合、最初のうちは情報量が少
なく、合成波形は原波形にあまり似ていなくても、時間
の経過とともに、情報量が蓄積された行くため、合成波
形は原波形に極めて近い波形となり、全体としては、よ
り良好な合成音を得ることができる。Further, in the second and subsequent unit waveforms in the frame where the unvoiced sound state changes to the voiced sound, the pulse of the error component information (pulse train) accumulated by processing all the unit waveforms located before is converted into a mountain-shaped pulse. Since the voice synthesis processing is performed based on the information decomposed into the information and the information indicating the period, for example, when trying to obtain a synthesized sound for the voice waveform of “Ah”, the information amount is initially small, Even though the synthesized waveform is not very similar to the original waveform, the amount of information accumulates over time, so the synthesized waveform becomes very close to the original waveform, and as a whole, a better synthesized sound is obtained. Can be.

【００７８】また、前記無声音状態から有声音に変わる
少なくとも先頭の単位波形あるいはＬＰＣ係数が大きく
変化するフレーム近辺の単位波形に対応する残差波形か
ら振幅値の大きい信号を予め設定した数だけ選択する処
理は、選択しようとする信号に対応する原波形の振幅の
向きが正から負へ向かう場合は、その信号に対する誤差
成分情報は負の誤差成分情報とし、選択しようとする信
号に対応する原波形の振幅の向きが負から正へ向かう場
合は、その信号に対する誤差成分情報は正の誤差成分情
報とするようにいているので、抽出される誤差成分情報
としてのパルスと原波形の符号が大きく異なることがな
くなり、良好な合成音を得ることができる。A predetermined number of signals having a large amplitude value are selected from at least the first unit waveform that changes from the unvoiced sound state to a voiced sound or a residual waveform corresponding to a unit waveform near a frame where the LPC coefficient greatly changes. When the direction of the amplitude of the original waveform corresponding to the signal to be selected goes from positive to negative, the error component information for the signal is set to negative error component information, and the original waveform corresponding to the signal to be selected is processed. When the amplitude direction of the signal goes from negative to positive, the error component information for the signal is set to be positive error component information, so that the sign of the pulse as the error component information to be extracted and the sign of the original waveform are significantly different. And a good synthesized sound can be obtained.

【００７９】さらに本発明は、単位波形の周期を示す情
報と、誤差成分情報とに基づいて音声合成処理を行う処
理において、合成波形における各フレーム毎の振幅のピ
ーク値を或る一定の値に設定する処理を行うことによ
り、合成波形における処理対象区間の各フレーム毎のピ
ーク値を結ぶ包絡線は原波形と同じような平坦なものと
なり、たとえば「おーい」などという母音が長く続くよ
うな音声の原波形に対する合成音が喉に何かが絡んだよ
うなゴロゴロした音声となる現象を防止でき、きわめて
良好な合成音を得ることができる。Further, according to the present invention, in a process of performing a speech synthesis process based on information indicating a cycle of a unit waveform and error component information, a peak value of an amplitude of each frame in a synthesized waveform is set to a certain value. By performing the setting process, the envelope connecting the peak values of each frame of the processing target section in the synthesized waveform becomes a flat shape similar to the original waveform, and for example, a voice in which a vowel such as “Oi” continues for a long time A phenomenon that the synthesized sound with respect to the original waveform becomes a gurgling sound with something entangled in the throat can be prevented, and an extremely good synthesized sound can be obtained.

[Brief description of the drawings]

【図１】或る女性の音声波形から音声合成に必要な情報
を抽出する本発明の実施の形態の処理を説明する図。FIG. 1 is a view for explaining processing of an embodiment of the present invention for extracting information necessary for speech synthesis from a speech waveform of a certain woman.

【図２】本発明の実施の形態の基本的な処理を説明する
フローチャート。FIG. 2 is a flowchart illustrating basic processing according to the embodiment of the present invention.

【図３】本発明の実施の形態において抽出された音声合
成に必要な情報（パルス）を山型に崩す処理を説明する
図。FIG. 3 is a view for explaining a process of breaking information (pulses) necessary for speech synthesis extracted into a mountain shape in the embodiment of the present invention.

【図４】本発明の実施の形態において処理対象の原波形
に対する合成音波形のそれぞれの振幅の符号の違いによ
る不都合を説明する図。FIG. 4 is a diagram for explaining inconvenience due to differences in signs of respective amplitudes of a synthesized sound waveform with respect to an original waveform to be processed in the embodiment of the present invention.

【図５】前記処理対象の原波形に対する合成音波形のそ
れぞれの振幅の符号の違いによる不都合を解消する処理
を説明する図。FIG. 5 is a view for explaining a process for eliminating a problem caused by a difference in the sign of each amplitude of the synthesized sound waveform with respect to the original waveform to be processed.

【図６】本発明の実施の形態において単位波形内におけ
るピーク数に応じて音声合成に必要な情報を抽出する処
理を説明する図。FIG. 6 is a diagram illustrating a process of extracting information necessary for speech synthesis according to the number of peaks in a unit waveform according to the embodiment of the present invention.

【図７】本発明の実施の形態において処理対象の原波形
の包絡線に比べて合成音波形の包絡線に凹凸が生じる場
合の処理を説明する図。FIG. 7 is a view for explaining processing in a case where irregularities occur in the envelope of the synthesized sound waveform as compared with the envelope of the original waveform to be processed in the embodiment of the present invention.

【図８】男性の音声波形に対する残差波形およびその残
差波形から得られた音声合成に必要な情報の一例を示す
図。FIG. 8 is a view showing an example of a residual waveform for a male speech waveform and information necessary for speech synthesis obtained from the residual waveform.

【図９】女性の音声波形に対する残差波形の一例を示す
図。FIG. 9 is a diagram showing an example of a residual waveform with respect to a female voice waveform.

[Explanation of symbols]

１０，２０，・・・単位波形ｐ１１，ｐ１２，・・・各単位波形内のピーク値 Δｔ先頭パルスを検出するための時間Ｐｓ１，Ｐｓ２，Ｐｓ３先頭パルスＡ原波形が正から負に変化する点Ｂ原波形が負から正に変化する点 10, 20,... Unit waveforms p11, p12,. B Point where the original waveform changes from negative to positive

Claims

[Claims]

1. An original waveform to be processed is divided into frames at predetermined time intervals, and based on a waveform (hereinafter, referred to as a unit waveform) that is periodically repeated at substantially constant time intervals within one frame. While obtaining the linear prediction coefficient based on the correlation between each unit waveform, from the residual waveform indicating the error component in this linear prediction, information indicating the cycle and intensity of the unit waveform and other error component information are extracted, In a speech analysis / synthesis method that performs speech synthesis based on such information, an original waveform of one frame expressed on plane coordinates of a time axis and an amplitude axis which are orthogonal to each other is converted into a peak point of the waveform along the time axis. To obtain the amplitude value and time information of the peak point,
From these peak points, the peak value having the maximum amplitude in each unit waveform is obtained, and the point of the waveform having the maximum peak value in each unit waveform whose amplitude in the direction of the origin on the time axis is 0 is obtained. The information indicating the cycle of the unit waveform is extracted from the residual waveform in the vicinity of the point having the amplitude 0 as a reference and sent to the voice synthesis side, and the error component of the residual waveform corresponding to each of the unit waveforms is synthesized. A speech analysis / synthesis method, wherein necessary information is extracted as pulse information, and speech synthesis processing is performed based on the pulse information and the information indicating the period together with the linear prediction coefficient.

2. A process for extracting information indicating a cycle of a unit waveform from a residual waveform in the vicinity of the point having the amplitude 0 as a reference, wherein the point having the amplitude 0 is set as a cycle of the unit waveform. Item 4. The speech analysis / synthesis method according to Item 1.

3. A process for extracting information indicating a cycle of a unit waveform from a residual waveform in the vicinity thereof with reference to the point of amplitude 0, comprising: 2. The speech analysis / synthesis method according to claim 1, wherein information having a large absolute value is extracted, and information indicating a cycle is obtained based on the extracted information.

4. Extracting information necessary for synthesis from error components of a residual waveform corresponding to each of the unit waveforms as pulse information, and based on the pulse information and the information indicating the period together with the linear prediction coefficient. The process of performing voice synthesis processing by counting the number of peaks of the waveform in at least the first unit waveform in a frame where the state changes from unvoiced sound to voiced sound and in a frame where the linear prediction coefficient greatly changes,
Judge whether the number of peaks is larger than a preset number, and if the number is more than the preset number, efficiently select only the minimum information necessary for synthesis from error components and send it to the synthesis side as pulse information. Wherein when the number of peaks is less than a set value, not only the minimum pulse information necessary for the synthesis but also pulse information from other error components is obtained and sent to the synthesis side. The speech synthesis analysis method according to any one of claims 1 to 3.

5. A process of selecting only minimum information necessary for synthesis from the error component, comprising: selecting an error component of a residual waveform corresponding to at least a first unit waveform in a frame that changes from an unvoiced sound state to a voiced sound state. 5. The speech analysis / synthesis method according to claim 4, wherein a predetermined number is selected from the above in ascending order of the absolute value of the amplitude value, and the selected number is sent to the synthesis side as pulse information.

6. A process for extracting information necessary for synthesis as pulse information from an error component of a residual waveform corresponding to the leading unit waveform, the amplitude of the original waveform corresponding to the residual waveform to be selected being changed. When going from positive to negative, the error component information for the residual waveform is negative pulse information, and when the amplitude direction of the original waveform corresponding to the residual waveform to be selected goes from negative to positive, the residual 6. The speech analysis / synthesis method according to claim 4, wherein the error component information for the difference waveform is positive pulse information.

7. A signal necessary for synthesis is extracted as pulse information from error components of a residual waveform corresponding to each of the unit waveforms, and based on the pulse information and information indicating the period together with the linear prediction coefficient. In the process of performing the voice synthesis process, in the second and subsequent unit waveforms of the frame to be processed, the pulse information used to synthesize the immediately preceding unit waveform is used as the actual residual waveform of the unit waveform. From the residual waveform of the unit waveform thus subtracted, a predetermined number is selected in ascending order of the absolute value of the amplitude, and the selected pulse information is passed from the analysis side to the synthesis side. The pulse information passed from the
4. The speech analysis / synthesis method according to claim 1, wherein pulse information obtained by adding to the pulse information used for synthesizing the preceding unit waveform is used for synthesizing the unit waveform.

8. The process of subtracting the pulse information used for synthesizing the immediately preceding unit waveform from the actual residual waveform of the unit waveform is used for synthesizing the immediately preceding unit waveform. 8. The speech analysis / synthesis method according to claim 7, wherein each pulse constituting the pulse information is broken into a mountain shape, and the pulse information broken into the mountain shape is subtracted from an actual residual waveform of the unit waveform.

9. The pulse information passed from the analysis side,
The processing on the synthesis side for adding to the pulse information used for synthesizing the immediately preceding unit waveform is as follows: each pulse constituting the pulse information used for synthesizing the immediately preceding unit waveform is broken down into a mountain shape. 8. The speech analysis / synthesis method according to claim 7, wherein pulse information obtained by adding pulse information passed from the analysis side to the pulse information collapsed into the mountain shape is obtained.

10. A process for performing a speech synthesis process based on information indicating a cycle of a unit waveform and error component information, wherein a process for smoothing an envelope of the synthesized waveform is performed. 10. The speech analysis / synthesis method according to any one of 9 above.

11. The process of smoothing the envelope of the synthesized waveform includes setting a processing target section for the synthesized waveform, and for each frame in the processing target section, a maximum peak value for each unit waveform in the frame. 11. The speech analysis / synthesis method according to claim 10, wherein an average of the average is calculated, and a peak value of each unit waveform in the entire processing target section is set to a predetermined value obtained based on the obtained average peak value. .