CN1248191C

CN1248191C - Phoneme changing method based on digital signal processing

Info

Publication number: CN1248191C
Application number: CNB031370144A
Authority: CN
Inventors: 李明; 刘建; 汪俊杰; 庹凌云; 颜永红; 孙宝海
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2003-06-19
Filing date: 2003-06-19
Publication date: 2006-03-29
Anticipated expiration: 2023-06-19
Also published as: CN1567428A

Abstract

The present invention discloses a voice changing method based on digital signal processing. The present invention comprises the steps: 1, original voice signals need to change voice are selected. 2, users can obtain the fundamental tone periodical length of the original voice signals. 3, the position of each fundamental tone cycle of the entire original voice signals is positioned according to the fundamental tone periodical length. 4, fundamental tone periods are deleted or inserted between fundamental tone periods of the original voice signals so as to obtain shortened or elongated voice signals. 5, the shortened or elongated voice signals are linearly elongated or compressed to the same length with the original voice signals so as to obtain the voice signals after voice change. The present invention is based on a voice change method processed by digital signals, the method is simple and practical and has little calculate amount, the method is suitable for real-time realization on a DSP chip, and the natural degree of changed voice is high. The length of the changed voice is the same as the original voice length, so the present invention is favorable to transmit the changed voice signals in real time.

Description

A kind of voice change of voice method based on digital signal processing

Technical field

The present invention relates to a kind of voice change of voice method, more particularly, the present invention relates to a kind of voice change of voice method based on digital signal processing.

Background technology

Fundamental frequency and resonance peak are two very important features in the voice.Fundamental frequency is the frequency of vocal cord vibration when sending out voiced sound, and the height of fundamental frequency is directly related with speaker's sex, and in general the fundamental frequency of male voice is lower, and the fundamental frequency of female voice is than higher.In addition, the height of age for fundamental frequency also has certain influence, and the elderly's fundamental frequency is lower than young man's fundamental frequency, and young man's fundamental frequency is lower than children's fundamental frequency.So by changing fundamental frequency, just can change the effect of voice, influence the judgement of people to speaker's age even sex.

Resonance peak is meant the resonance frequency of glottis ripple in sound channel.The length of resonance peak and sound channel has very big correlativity, and the frequency of the long more resonance peak of sound channel is high more, and vice versa.Comparatively speaking, man's sound channel is more longer than woman's sound channel, so the formant frequency of male voice is relatively also higher than the formant frequency of female voice.Therefore by changing resonance peak, can influence the judgement of people to the speaker.

For the frequency of revising resonance peak, most of method all is based on the synthetic algorithm of parameter.The ubiquitous problem of these methods is that operand is bigger, needs manual intervention, and the naturalness of synthetic voice is poor.

For changing fundamental frequency, a lot of methods have been arranged at present.Using has PSOLA algorithm (PitchSynchronous Overlap and Add) more widely, mix harmonic wave probabilistic model method (Hybrid Harmonic/StochasticModel), autoregression linear predictor coefficient method methods such as (Auto-Regressive LPC).The PSOLA algorithm is because method is simple, and operand is little, and the naturalness of synthetic speech is very high, so be most widely used.But owing to the limitation of PSOLA algorithm itself, when needs change fundamental frequency scope was bigger, voice will produce considerable aliasing, cause very big noise.The naturalness of voice is less better behind other two kinds of methods change fundamental frequencies, and the operand of these two kinds of methods is all bigger, has certain difficulty in the dsp chip real-time implementation.In addition, these methods all can change the length of raw tone, this for the change of voice after the real-time transmission of voice can cause very big problem.And existing change of voice method all adopts basically is the method for mimic channel, is not suitable for going up at digital signal processor (DSP) realizing.

Summary of the invention

The object of the present invention is to provide a kind of improved voice change of voice method, by fundamental frequency that changes voice and the effect that formant frequency obtains the change of voice; The present invention also aims to provide a kind of improved voice change of voice method, make that the voice signal after the change of voice is consistent with primary speech signal length.

The object of the present invention is achieved like this:

A kind of voice change of voice method based on digital signal processing comprises the steps:

(1) chooses the primary speech signal that needs the change of voice;

(2) when primary speech signal exists periodically, calculate its fundamental frequency value, and calculate the length of the pitch period corresponding with this fundamental frequency value; When not existing periodically in the raw tone, at 65Hz to getting a frequency values between the 500Hz, with cycle of this frequency values correspondence as pitch period, with the Cycle Length of this frequency values correspondence as pitch period length;

(3) locate the position of each pitch period of whole primary speech signal according to the pitch period length that obtains in the step (2);

(4) deletion/insertion pitch period between the pitch period in primary speech signal, the voice signal that is shortened/extend;

(5), obtain the voice signal after the change of voice with the voice signal linear extension of shortening/elongation of obtaining in the step (4)/be compressed to the length consistent with primary speech signal.

It in step (4) deletion/insertion pitch period periodically between the pitch period in primary speech signal.

When the fundamental frequency of voice after the expectation change of voice is that the p of raw tone fundamental frequency is doubly and p＞1 the time, every (p-1) ^-1Individual pitch period inserts a pitch period, and this pitch period that is inserted into is that of the pitch period adjacent with the insertion point duplicates.When the fundamental frequency of voice after the expectation change of voice is that the p of raw tone fundamental frequency is doubly and 0＜p＜1 the time, every (1-p) ^-1Current pitch period of individual pitch period deletion.Preferably, 1＜p≤2 or 0.5≤p＜1.

Linear extension/compression method in step (5) is: the length of raw tone is that the length of the voice signal of shortening/elongation of obtaining of N, step (3) is M, and then magnification ratio is r=M/N; The sequence of the voice signal of described shortening/elongation is x (m), wherein 1≤m≤M; Making the sequence of the voice signal after the change of voice is y (n), wherein 1≤n≤N; Make A _n=nr,

C _n=B _n+ 1, wherein

Be to be not more than A _nMaximum integer; Y (n)=x (B then _n)+(A _n-B _n) [x (C _n)-x (B _n)], wherein y (n) is a n point of voice sequence after the change of voice.

The present invention is based on the voice change of voice method of digital signal processing, and this method is simple and practical, and operand is very little, is suitable for real-time implementation on dsp chip, and the naturalness of the voice of the change of voice is very high.And the length of the voice after the change of voice is consistent with raw tone length, helps transmitting in real time the voice signal after the change of voice.

Description of drawings

Fig. 1 is the process flow diagram of voice change of voice method of the present invention;

Fig. 2 is the instance graph of fundamental tone period of voice signal location.

Embodiment

The present invention is done describe in further detail below in conjunction with accompanying drawing and the concrete direction of implementing.

The process flow diagram of voice change of voice method of the present invention as shown in Figure 1.At first import frame voice, the length of frame voice can be done suitably to adjust according to the actual conditions demand.

Estimate the fundamental frequency value in the primary speech signal of this input then.In the speech pitch of present embodiment is estimated, employing be harmonic wave and method (Summation of Sub-Harmonic Method), periodically all will obtain a fundamental frequency value so whether raw tone exists.When raw tone exists periodically, just there is pitch period, will obtain a significant fundamental frequency value so; When raw tone does not exist periodically, when just not having pitch period, as voiceless sound section or quiet section, obtain be actually one at 65Hz to a random number between the 500Hz, but the present invention is still with " puppet " pitch period length of the pairing Cycle Length of this frequency values as this voice segments.

From above-mentioned disposal route as can be seen, the present invention in fact handles not existing periodic voiceless sound section also to be used as voiced segments.This is because the voiceless sound section of voice is similar to white noise, periodically deletes therein or the insertion voice, influences the sense of hearing perceived effect of people to it hardly.And voice are regardless of the unified of voiced sound, voiceless sound handle, simplified the complexity of algorithm, the more important thing is the cost that has caused the voice change of voice to fail when having avoided the voiced sound erroneous judgement for voiceless sound.

The length of the pitch period of voice signal equals the fundamental frequency value of the sampling rate of voice divided by this voice signal.Locate the position of each pitch period of whole primary speech signal according to this Cycle Length.

Voice with per second 8K sampled point are example, and with 1000 sampled points as a processing unit, i.e. frame voice.Average fundamental frequency as this frame voice signal is 100Hz, and then its pitch period is 80 sampled points, and locatees pitch period with this length.

In general each pitch period all has a maximum value, and it is the most convenient and reliable to locate each pitch period with this.At first near the centre position of these frame voice, find a maximum value, find a maximal point to both sides every the length of a pitch period then, find the maximal point of all pitch periods successively.Present embodiment stipulates that each pitch period starts from this pitch period maximal point second zero crossing backward, ends at the starting point of next pitch period.Therefore seek its second zero crossing backward according to each maximal point, orient the zone of each pitch period with this.This location process can be with reference to shown in Figure 2, and to mark be the extreme point of each pitch period to solid line among the figure, and it is extreme point second zero crossing backward that dotted line is marked, and is a pitch period between two dotted lines.For example in the above-mentioned voice signal of choosing, near the 500th point of present frame, find maximum value earlier, then this maximum value forward near 80 and backward 80 seek maximum value, Using such method searches out all maximum value of these frame voice again.Seek its second zero crossing backward at each maximal point at last, orient the zone of each pitch period with this.

For the voice signal that does not have pitch period, after obtaining its " puppet " pitch period length, the method for also available above-mentioned searching maximum value and zero crossing is located the position of its pseudo-pitch period.

According to the difference that the change of voice requires, decision needs to insert pitch period and still deletes pitch period.Improve fundamental frequency if desired, so just insert pitch period; If the reduction fundamental frequency is just deleted pitch period.For example expect that the fundamental frequency of voice after the change of voice is 1.5 times of raw tone fundamental frequency, inserts a pitch period adjacent with the insertion point every [1/ (1.5-1)]=2 pitch periods so; For example expect that the fundamental frequency of voice after the change of voice is 0.8 times of raw tone fundamental frequency, so every current pitch period of [1/ (1-0.8)]=5 pitch periods deletions.The voice signal that is so just extended or shorten.

At last with the Speech Signal Compression of elongation to the length consistent with raw tone, perhaps the voice signal that shortens is elongated to the length consistent with raw tone, obtain the voice signal after the needed change of voice.For example the length of the voice of the elongation that is obtained by previous step is 1400 sampled points, and its voice sequence is x (m), wherein 1≤m≤1400.And the length of raw tone is 1000 sampled points, that is to say that the length of the voice signal after the change of voice also should be 1000 points, and making the sequence of the voice signal after the change of voice is y (n), wherein 1≤n≤1000.Magnification ratio r=1400/1000=1.4.Make A _n=nr,

C _n=B _n+ 1, wherein Be to be not more than A _nMaximum integer.As following table:

n	1	2	3	…	500	501	…	999	1000
n	1	2	3	…	500	501	…	999	1000	A _n	1.4	2.8	4.4	…	700	701.4	…	1398.6	1400
B _n	1	2	4	…	700	701	…	1398	1400	A _n	1.4	2.8	4.4	…	700	701.4	…	1398.6	1400
B _n	1	2	4	…	700	701	…	1398	1400	C _n	2	3	5	…	701	702	…	1399	1400(1401)

By formula y (n)=x (B _n)+(A _n-B _n) [x (C _n)-x (B _n)], voice, wherein n point of voice sequence after y (n) change of voice after the change of voice after just having obtained compressing.The lower right corner in form, C _nCalculated value 1401 surpassed the desirable maximal value 1400 of m, in the present embodiment, use and 1401 immediate 1400 replace.Voice output after the change of voice so just can have been obtained existing change of voice effect, consistent with raw tone length again voice signal at last.

Adopt the step consistent, also the voice signal that shortens can be elongated to the length consistent with raw tone with this method.

Claims

1, a kind of voice change of voice method based on digital signal processing comprises the steps:

(1) chooses the primary speech signal that needs the change of voice;

2, voice change of voice method according to claim 1 is characterized in that, is deletion/insertion pitch period periodically between the pitch period in primary speech signal in step (4).

3, voice change of voice method according to claim 2 is characterized in that, when the fundamental frequency of voice after the expectation change of voice is that the p of raw tone fundamental frequency is doubly and p＞1 the time, every (p-1) ^-1Individual pitch period inserts a pitch period, and this pitch period that is inserted into is that of the pitch period adjacent with the insertion point duplicates.

4, voice change of voice method according to claim 2 is characterized in that, when the fundamental frequency of voice after the expectation change of voice is that the p of raw tone fundamental frequency is doubly and 0＜p＜1 the time, every (1-p) ^-1Current pitch period of individual pitch period deletion.

5, voice change of voice method according to claim 3 is characterized in that 1＜p≤2.

6, voice change of voice method according to claim 4 is characterized in that 0.5≤p＜1.

7, voice change of voice method according to claim 1 is characterized in that, the linear extension/compression method in step (5) is:

The length of raw tone is that the length of the voice signal of shortening/elongation of obtaining of N, step (4) is M, and then magnification ratio is r=M/N; The sequence of the voice signal of described shortening/elongation is x (m), wherein 1≤m≤M; Making the sequence of the voice signal after the change of voice is y (n), wherein 1≤n≤N;

Make A _n=nr, As 1≤B _nDuring＜M, C _n=B _n+ 1, work as B _nDuring=M, C _n=B _n, wherein Be to be not more than A _nMaximum integer; Y (n)=x (B then _n)+(A _n-B _n) [x (C _n)-x (B _n)], wherein y (n) is a n point of voice sequence after the change of voice.