JPS5979299A

JPS5979299A - Coding of voice analysis

Info

Publication number: JPS5979299A
Application number: JP57191287A
Authority: JP
Inventors: 高井　紀代; 今井　良彦
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-10-29
Filing date: 1982-10-29
Publication date: 1984-05-08

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産ｆ′１鴛十の利用分野本うｃ明は層重合成装置ｄに与える音声データの分析管
Ｉ］化の方法に関する。DETAILED DESCRIPTION OF THE INVENTION FIELD OF APPLICATION OF F'1 RAKIJU This invention relates to a method for converting audio data provided to a layer synthesis device d into an analysis tube.

従来例の構成とその問題点これまで音声の分析合成方式として何々の方法がＩｊ＋
！案さＩしているが、・二の中でも線形予測係数または
それと等価な偏自己相関係数を用いた分析合成方式は情
報圧縮、合成音声の品質などの点で１−ぐれている。Conventional configurations and their problems Until now, various methods have been used as voice analysis and synthesis methods, such as Ij+
! Among the two methods, the analysis and synthesis method using linear prediction coefficients or equivalent partial autocorrelation coefficients is inferior in terms of information compression, quality of synthesized speech, etc.

プリング（１）ξれ、音声ラー゛イジタル１−タ（、シ
）がイ：Ｉられる。音声デイジタルデータ（２）を線形
手綱やｆ！］＋！自己相関等を用いて分析時間幅毎（以
下フレートという）に分析（３）シ、音声の特徴をあら
れずパラメータ（４）を得る。ここにパラメータ（４ン
とは、基本同波数、振幅、線形予測係数あるいはそれと
等両ム偏自己相関係数、有声／無声判定値、総１（Ｌ力
（人力音声ディジタルデータのＲＭＳ　、以壬Ｅとあら
れす）、残差電力（縁形予測誤差のｒｕｂｆｓ、以７”
　Ｒとあられす）である。得られたパラメータ（４）を
符シ」化（５）シてデータ圧縮を行な−った符すパラメ
ータ（〔すを用いて音声を合成（７）する。After pulling (1), the audio digital data (1) is turned on. Audio digital data (2) can be converted into linear reins or f! ]＋! Using autocorrelation and the like, analysis (3) is carried out for each analysis time width (hereinafter referred to as a frame) to determine the characteristics of the voice and obtain parameters (4). Here, the parameters (4) are the basic homogeneous wave number, amplitude, linear prediction coefficient or partial autocorrelation coefficient, voiced/unvoiced judgment value, total 1 (L power (RMS of human voice digital data, hereinafter referred to as E), residual power (rubfs of edge shape prediction error, hereinafter 7"
R and hail). The obtained parameter (4) is encoded (5) and the voice is synthesized (7) using the encoded parameter ([).

第２図は第１図の合成（７）に用いる音声合成装置を詳
しく説明したものである。音声パラ２゛−夕は符号化さ
れた形でパラメータＲＯＭ０：Ｉに記憶されており、各
々の語いの開始アドレスがアドレステーブル（６）に記
憶されている。発声させたいも声の番弓をアドレス指定
端０υから入力する序により、該当′慣る７′１声パラ
メータの記憶されている前記パラメータＲＯＭｆＪ：９
のアドレスがアドレステーブルＱでイ１７られ、パ゛）
メータＲＯＭ０葎から前記パラメータが１ｆ７１ｉ　！
１出、３１■ろ。読み出されたパラメータはビット詰め
されたｊ；ニにへっているので、各′ｎ’　Ｆ、　４ｓ
ビットｌ、に毎にヅ’Ｊ１’ｉ’Ｊされ、各パラメータ
の符号化１１１１としでバラタ・−タＲＡハｉＨに一時
的にストアされる１、パラメータ１ｊ符１′！で１Ｘ己
トユされているのでデコーディングＲＵＭθりを用いで
イ見号化され、補間回路（Ｉｌ’９で、こｃ／、＋７ｊ
ｊ：　Ｌ）されｌこパラメータと１フレーム前のパラメ
ータ値との！、１．１で線形捕間され、実際の分析時間
幅、１．り短い時１１！Ｉ　ｌｊ’ｉｌｊでパラメータ
を更新し又、以下の斃直、ｉｉｌちビ″ツチ／’Ｊウン
タ乾、有音声油回路（ハ）、無声ｆｆ　ｒ！ｂｊ回ｉ１
１？Ｊ（ｌλデ、ディジタルフィルタ６！１浄を制御す
る。ピッチカウンタαηはイ゛Ｓ声音を合成する際の基
本周波数に１１・ｊｌ　ｒｉｌｌｌされ、１周期すなわ
ち基本９周波数分のり゛ンブリング周波数毎に１〕声音
蛭であるパルスを１（１加する。無ｊ：１音源回路０【
！は無声音を合成する際にｆ’ｉ’　ｌ’ｂ：ｉとなる
ランダムノイズを生成する。ディジクルフィルタ（ホ）
は偏自己相関係数（以下に／ｆラメータと１げぶ）によ
って制で１１１され、声道の形状をモデル化したもので
ある。これらを用いて音声を合成し、Ｄ／Ａ変換器Ｃ１
）でアナログ信号に変換し、出力端（支）に合成波形を
得る。FIG. 2 shows a detailed explanation of the speech synthesizer used for synthesis (7) in FIG. 1. The audio parameters are stored in encoded form in the parameter ROM0:I, and the starting address of each word is stored in the address table (6). By inputting the number of voices you want to utter from the address designation end 0υ, select the parameter ROMfJ:9 in which the corresponding voice parameters are stored.
The address of is entered in the address table Q, and the address is
The parameter is 1f71i from meter ROM0!
1 out, 31■ro. The read parameters are bit-packed, so each 'n' F, 4s
The bits 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, , , , , , , , , , , , , , , , , , , , , , ,. Since 1X is written in
j: L) between the parameter and the parameter value one frame ago! , 1.1, and the actual analysis time width, 1. ri short time 11! Update the parameters with I lj'ilj, and perform the following operations, iil chibi'tsuchi/'J unta dry, voiced oil circuit (c), voiceless ff r!bj times i1
1? The pitch counter αη is set at 11·jl rill to the fundamental frequency when synthesizing the I/S voice, and is calculated every 1 period, that is, the fundamental 9 frequencies at each embedding frequency. 1] Add 1 (1) to the pulse that is the voice sound leech. No j: 1 sound source circuit 0 [
! generates random noise that becomes f'i'l'b:i when synthesizing unvoiced sounds. Digital filter (E)
is determined by a partial autocorrelation coefficient (hereinafter referred to as /f parameter and 1 value) and models the shape of the vocal tract. These are used to synthesize speech, and the D/A converter C1
) is converted to an analog signal, and a composite waveform is obtained at the output terminal (support).

第３０は第２（」のデコーディングＲＯＭ（１！１の４
７１成を示したものである。イ・Ｊ：号化の際の各ノテ
ラメータへのビットの与え方によって１；：；成（、ｉ
異なる。第３図はにパラメータ１０次を用い、振幅５ビ
ツト、基本周波数６ビツト、Ｋ１−７ビツト、Ｋ　２−
　Ｇピッｌ−１Ｋ３−５ビツト、Ｋ　４〜Ｋ　６，４．
ビット、■（７・〜１り１０−１８ビツトの場合を示し
たものである。The 30th is the 2nd ('' decoding ROM (1!1 of 4
This shows 71 formations. i J: Depending on how bits are given to each noterameter during encoding, 1;:;
different. In Figure 3, 10th order parameters are used, amplitude is 5 bits, fundamental frequency is 6 bits, K1-7 bits, K2-
G-Pill-1K3-5 bits, K4-K6,4.
bit, (7.about.1) shows the case of 10-18 bits.

第４図は上記ビット配分を与えた時の１フレ一ム分のパ
ラメータの構成を示したものである。基本周波数および
ｌ（パラメータについては、あるフレームの値が１つ前
のフレームと同じ、あるいは近い場合、そのフレームの
値を記憶せず前のフレームの値をそのまま用いるという
意味で、リピートビットを１にして記憶し、パラメータ
に必要なビット星が圧縮できる様にしている。FIG. 4 shows the configuration of parameters for one frame when the above bit allocation is given. For the fundamental frequency and l (parameters), if the value of a certain frame is the same as or close to the previous frame, the value of the previous frame is not memorized and the value of the previous frame is used as is.The repeat bit is set to 1. The bit star required for the parameter can be compressed.

従来、基本周波数、振幅、Ｉ（パラメータ各次毎に異な
るヒツト配分で別々に符号化、復号化を行なっているが
、Ｋパラメータのスペクトル感度はｓ　ｉ、、’：’ｉ
−ｎず、同心パラメータに関しては唯一つの符”ｊ化、
復÷」化テーブルを用いている。Conventionally, fundamental frequency, amplitude, and I (parameters are encoded and decoded separately with different hit distributions for each order, but the spectral sensitivity of the K parameter is s i,,':'i
-n, there is only one sign for concentric parameters,
A reconstruction table is used.

一般に、残差’１ｉｆ、力Ｒは抽出したパラメータを用
いて予測した音声データともとの波形との差の２乗和で
あられされ、抽出されたパラメータの精度を示してｔ）
る小は公知である。即ち残差電力Ｒが大きい時は、その
フレームのパラメータの精度は低く歪みは大きい。逆に
残差電力Ｒが小さい時は、パラメータは精度よく抽出さ
れている事になる。In general, the residual '1if' and the force R are expressed as the sum of squares of the difference between the audio data predicted using the extracted parameters and the original waveform, indicating the accuracy of the extracted parameters.
The elementary schools are well known. That is, when the residual power R is large, the precision of the parameters of that frame is low and the distortion is large. Conversely, when the residual power R is small, it means that the parameters are extracted with high accuracy.

ただし残ｚｒＥ　ｊ’ｌｔ力は人力データの振幅に依存
しているので、入力データの総電力Ｅで正規化する必要
がある。即ちＲ／Ｉεが大きい時は、そのフレームのパ
ラメータは精度が低く、Ｒ／Ｅが小さい時はそのフレー
ムのパラメータはＭ度が高い。精度が高いパラメータに
ついては、高ビットできめ細やかに符υ化する必要があ
るが、精度の低いパラメータについては、ビット配分を
少なくしても合成品質に与える影響は少ない。However, since the residual zrE j'lt force depends on the amplitude of the human power data, it is necessary to normalize it with the total power E of the input data. That is, when R/Iε is large, the accuracy of the parameters of the frame is low, and when R/E is small, the parameters of the frame have a high degree of M. For parameters with high precision, it is necessary to code υ with high bits in detail, but for parameters with low precision, even if the bit allocation is reduced, there is little effect on the synthesis quality.

発明の目的本発明は、上記の点を考慮に入れ、データ１，１を−増
〃１１させる事なく、音質を向」二させることができる
音声分析符号化方法を提案することを目的とするもので
ある。Purpose of the Invention The purpose of the present invention is to take the above points into consideration and to propose a voice analysis encoding method that can improve the sound quality without increasing the data 1,1. It is something.

発明の１７６成上記目的を達成するために、本発明は　＋１値１）を決
め、１しＬ＜　＞　Ｈのときは低ビツト配分で、Ｉ＜７
’ＥくＨのときは高ビツト配分で符号化を行なう、１；
うに構成し、これにより従来と同じビットレートでは、
精度のバいパラメータに従来よりｔこくさ／しのビット
を与えて符号化して従来より高晶ｙ：ｊの合成音を得る
小ができ、また、Ｒ／Ｅ＜Ｈの時は従来と同じビット配
分で符号化を行ない、Ｒ／Ｅ　：利Ｉの時は従来より少
ないビット配分で符号化を行なって従来より少ないビッ
トレートで従来通りの晶’ｉ’ｉ　；”ｒ＝得る事がで
きる効果を有する。176 Achievement of the Invention In order to achieve the above object, the present invention determines +1 value 1), and when 1 and L<> H, low bit allocation, and I<7
'When E and H, perform encoding with high bit allocation, 1;
As a result, at the same bit rate as before,
It is possible to obtain a synthesized sound with higher crystalline y:j than before by giving t depth/shin bits to the less accurate parameter and encoding it, and when R/E<H, it is the same as before. Encoding is performed with bit allocation, and when R/E: profit is I, encoding is performed with less bit allocation than before, and the conventional crystal 'i'i;'r= can be obtained at a lower bit rate than before. have an effect.

実施例の説明以下本発明の一実施例を図面に基づいて説明する　１υ
１（ルでザンブリングした音声テ゛−夕を分析時間幅ｌ
Ｏ・；１安で１０次の偏自己相関分析を行なうに除し、
デコーディングＲ０人（の構成をあられす第５図（ａ）
　（ｂ）に示すように、ＶＥ　＞　０．４の場合とＲ／
Ｅ−二〇、４の場合とで異なるビット配分で符号化を行
なう。例えば、Ｒ／）Ｅ　＜　０．４の時、パラメータ
の精度がｔ′！ｌいので、従来通りのビット配分、即ち
ノｂｋＱ’１１ｉ　５ビツト、基本周波数６ビツト、Ｋ
　１−７ビツト、Ｋ　２−６ビツト、Ｋ８−５ビツト、
Ｋ　４、Ｋ６−４ビツト、Ｋ　７〜Ｉ（１０’　　８　
Ｌ：　ツ）　テ符号化し、１＜／ｓ；、　＞　０．４の
時は、従来よりにパラメータを各々ｌビット減らして符
号化を行なう。即ち、Ｋ１−６ビツト、Ｋ２５ビット、
Ｊ（８−４ビツト、Ｋ４〜に６−３ビツト、Ｋ７〜ＫＩ
Ｏ−２ビツトである。振幅、基本周波数については′同
じビット配分を用いる。そのフレームがＩ＜／Ｅ　＞　
０．４なのかＲ／’Ｉｉ、　＜−０，・ムなのかについ
ての情報として、即ちデコードテーブルを切り分けるた
めの情報として１ビツト必要である。これを符号化フラ
グと呼ぶ。第１表は、２０秒の天気予報を分析した際の
ル乍〉０．４のフレームとＲ／′Ｅ＜　０．４のフレー
ムのパーセンテイジ比を示したものである。これをもと
に１秒分のデータを記ｔ＠するのに必要なピッｌ−数を
求めると、従来では１フレ一ム分記憶するのに５５ピツ
＋、分析時間幅ｌＱｍ既だから１　仕分では１００フレ
ーム分記憶する必要があるので５５　Ｘ　１００　＝　５５００ビット／　　秒となる
が本方式では＋　　４７　Ｘ　８８　＋　５６　Ｘ　６７　＝　５８
０８ビット／抄となる。即ち約２００ビツトの圧縮が可
能に４Ｃす、かつ従来程度の合成品貢が得られる。・第
１表第６　図（ａ）　（ｂ）はｌフレーム分のパラメータの
構成を示す。第６図（ａ）は１＜／Ｅ　（０，４の場合
であり、１フレ一ム分５６ビツト、第６図（ｂ）は１ｖ
１＜　＞０．４の場合であり、１フレ一ム分４７ビツト
である。DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described below based on the drawings.
1 (Analyze the audio data sambling with
O.; When performing a 10th order partial autocorrelation analysis at 1 price,
Fig. 5(a)
As shown in (b), when VE > 0.4 and R/
Encoding is performed using different bit allocations for E-20 and E-4. For example, when R/)E < 0.4, the accuracy of the parameter is t'! Since the bit allocation is small, the conventional bit allocation is used: 5 bits for nobkQ'11i, 6 bits for the fundamental frequency, and 6 bits for the fundamental frequency.
1-7 bits, K 2-6 bits, K8-5 bits,
K 4, K6-4 bits, K 7-I (10' 8
When 1</s;> 0.4, the encoding is performed by reducing each parameter by 1 bit compared to the conventional method. That is, K1-6 bits, K25 bits,
J (8-4 bits, K4~6-3 bits, K7~KI
It is O-2 bit. The same bit allocation is used for amplitude and fundamental frequency. That frame is I</E>
0.4 or R/'Ii, <-0, .mu., ie, one bit is required as information for dividing the decoding table. This is called an encoding flag. Table 1 shows the percentage ratio of frames with R>0.4 and frames with R/'E<0.4 when analyzing a 20-second weather forecast. Based on this, we calculate the number of pins required to record one second of data. Conventionally, it takes 55 pins to store one frame, and since the analysis time span is lQm, it takes 1 sorting. In this case, it is necessary to store 100 frames, so 55 x 100 = 5500 bits/second, but in this method, + 47 x 88 + 56 x 67 = 58
08 bits/sho. In other words, compression of about 200 bits is possible with 4C, and the same level of synthetic output as before can be obtained.・Table 1, Figure 6 (a) and (b) show the configuration of parameters for one frame. Figure 6(a) shows the case of 1</E (0,4, 56 bits for one frame, Figure 6(b) shows 1v
1<>0.4, and one frame has 47 bits.

なお、ここでは、■１が１つであり、２つのデコーディ
ングテーブル・を用いる例をボしたが、これ１」パラメ
ータＮ’＋’；ＩＣｊ：の割合によって同段に分けても
よい。Note that here, we have omitted an example in which there is one ``1'' and two decoding tables are used; however, it may be divided into the same stages depending on the ratio of the ``1'' parameter N'+';ICj:.

次ｉ／（Ｆ７！　５し゛く１お」゛び第６図と同じ条件
で分析し、？’ｒ′１３化するし、；、１り７１号〉０
．４のフレームについては、パラメータり精ｔＣが低い
ので、１つ前のフレームのパラメータをそのまま用いる
事にする。その時、■フレームに←必要なビット数は１
２ビツトでよく、そのｊｈ’Ｊ成を２１）７図に示す。Next i/(F7!
．． Regarding frame No. 4, since the parameter accuracy tC is low, the parameters of the previous frame are used as they are. At that time, the number of bits required for the frame is 1
2 bits is sufficient, and its jh'J configuration is shown in Figure 21)7.

デコーディング」れ）Ｍは従来と同、じて第３図に示さ
れる構成をもつ。この実施例においては、１秒分あたり
のビット数は５６　Ｘ　６７　＋１２　Ｘ　８３　＝４
６４８ビット／秒てｉあり、約１８５０　　ビット／秒
のデータ圧縮が可能となる。Decoding (decoding) M has the same structure as the conventional one, as shown in FIG. In this example, the number of bits per second is 56 x 67 + 12 x 83 = 4
There are 648 bits/second i, and data compression of approximately 1850 bits/second is possible.

発明の効果以」三木発明によれば、パラメータ抽出精度の低いとこ
ろでのみ低いビットレートで符号化を行なうので、音質
をほとんど劣化させる事なくデータ圧縮を行なう事がで
きる。また抽出精度の低いところで減らしたビット数を
抽出精度メ高いところへわりあてる事によって、データ
凡を増加させる。Effects of the Invention According to the Miki invention, since encoding is performed at a low bit rate only in areas where parameter extraction precision is low, data compression can be performed with almost no deterioration in sound quality. Furthermore, by allocating the reduced number of bits in areas with low extraction accuracy to areas with high extraction accuracy, the data size is increased.

事なく音質を向上させる小ができる。You can easily improve the sound quality.

[Brief explanation of the drawing]

第１図は音声分析合成方式の流れ図−／’、：：２図は
音声合成装置のブロック図、第８１−１ｌは’４ｒ（Ｅ
来ＣＩＩＧ′ｌ−おけるデコーディング１＜ＯＭのｒｉ
Ｉ’ｊ成の一例Ｃ７１，Ｆｆ’４１ス１は従来例におけ
る１フレ一ム分のパラメータの（ｌｔ成の一例図、第５
図は本発明の一ンコ７Ｉｉｌ！削ｔ（−おけるデコーデ
ィングＲ（ＪＭの構成１朗、５”ｆ５．６図はｒｌ’ｉ
　５　＋２１シ（二おけるｌフレーム分のパラメータの
構成［ｉ　、・１′５７図１．１　他の実施例における
１フレーノ・分のバラノー・りの構成図である。 θ罎・・・パラメータＲＯＭ、　（＋（１・・・パラメ
ータＲＡＭ　、　（１！Ｑ・・・デコーディングＲＯＭ
０代理人　　森本義弘第ｂ（，２）（ｂ）５、Ｒ／ＥンＯ１４Figure 1 is a flowchart of the speech analysis and synthesis method -/', Figure 2 is a block diagram of the speech synthesizer, and Figure 81-1l is a '4r (E
Next CIIG'l-decoding 1 <OM's ri
An example of I'j composition C71, Ff'41 is an example diagram of (lt composition) of parameters for one frame in the conventional example.
The figure is part 7Iil of the present invention! Decoding R (JM configuration 1, 5" f5.6 figure is rl'i
5 +21 (parameter configuration for 1 frame in 2 [i, 1'57 Figure 1.1 This is a configuration diagram of 1 frame's worth of parameters in another embodiment. θ...Parameter ROM , (+(1...Parameter RAM, (1!Q...Decoding ROM
0 Agent Yoshihiro Morimoto b (,2) (b) 5, R/EnO14

Claims

[Claims]

1 7'+':','; Perform linear predictive analysis on FJ, and calculate the linear predictive coefficient for every minute V〒hour ll'l+l,
Parameters such as total power and residual power are extracted, and depending on the ratio of total power and residual 1u power, when the residual is small and the value of force/'total force is large, the residual power is A speech analysis encoding method characterized by encoding with a larger bit allocation when the total force value is small.