CN103531205B - The asymmetrical voice conversion method mapped based on deep neural network feature - Google Patents
The asymmetrical voice conversion method mapped based on deep neural network feature Download PDFInfo
- Publication number
- CN103531205B CN103531205B CN201310468769.1A CN201310468769A CN103531205B CN 103531205 B CN103531205 B CN 103531205B CN 201310468769 A CN201310468769 A CN 201310468769A CN 103531205 B CN103531205 B CN 103531205B
- Authority
- CN
- China
- Prior art keywords
- voice signal
- parameter
- signal
- voice
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of asymmetrical voice conversion method mapped based on deep neural network feature, belong to Voice Conversion Techniques field.A kind of asymmetrical voice conversion method mapped based on deep neural network feature of the present invention, for source voice and the asymmetric data of target voice, pre-training function first with deep layer network carries out probabilistic Modeling to it, by refining the higher order statistical characteristic contained in voice signal, provide the standby preferable space of net coefficients;Secondly, utilize a small amount of symmetric data to carry out incremental learning, carry out corrective networks weight coefficient by the transmission error after optimizing, thus realize the mapping of characteristic parameter.Present invention optimizes net coefficients structure, and as the initial parameter value of deep layer forward prediction network, and then during the incremental learning of a small amount of symmetric data, reverse conduction optimizes network architecture parameters, it is achieved the mapping of the personal characteristics parameter of speaker.
Description
Technical field
The invention belongs to Voice Conversion Techniques field, it is a kind of based on deep neural network feature mapping non-right to be specifically related to
Claim phonetics transfer method.
Background technology
Voice Conversion Techniques, is exactly briefly by the sound in a speaker (referred to as source), is entered by certain means
Line translation so that it is sound it being another speaker (referred to as target) word seemingly.Voice conversion belongs to of intercrossing
Branch of section, its content had both related to the knowledge in the fields such as phonetics, semantics and psychologic acoustics, contained again Speech processing neck
The various aspects in territory, such as analysis and synthesis, Speaker Identification, voice coding and the enhancing etc. of voice.
The final goal of voice conversion is to provide instant, can automatically to rapidly adapt to any speaker voice service,
This system need not or seldom need user to train just can play merit well for all users and various condition
With.But, the Voice Conversion Techniques of present stage does not also accomplish this point.The strictest user's word that limits of current system is made
The mode (i.e. needing symmetric data to be trained) of sentence, on the other hand there is also a need for bigger data volume carrys out training system.
For the problems referred to above, some counte-rplan are the most there are.Such as, for " asymmetric data " problem, there is scholar
Propose first by Vector Quantization algorithm, the feature space of source and target speaker to be divided, then compare sound channel length normalization
After template distance, therefrom select the source code word corresponding with speaker, finally in same code word space, look for k-nearest neighbor
Seek the most close coupling speech frame.And for example Salor et al. then proposes such issues that utilize dynamic programming algorithm to solve.This algorithm
Core concept be: build cost function, make source and target and target former frame and the error of present frame and reach the most simultaneously
Little.For " minimizing data volume " problem, Helander et al. proposes the coupling considering between characteristic parameter during modeling
Relation, and utilize this relation to improve system robustness in the case of data volume rareness.In addition, somebody proposes to utilize
Based on the gauss hybrid models that variation Bayesian analysis technique study is traditional, strengthen this model and model energy when Sparse
Power.
Through retrieval, Chinese Patent Application No. ZL201210229540.8, Shen Qing Publication day is on October 17th, 2012, invention
Create entitled: the method for a kind of sound based on LPC and RBF neural conversion, this application case relate to a kind of based on LPC and
The method of the sound conversion of RBF neural, comprises the following steps: A, pre-process voice;B, unvoiced frame is carried out base
Frequency detection;C, the unvoiced frame after fundamental detection is changed;D, the fundamental frequency after conversion is carried out the extraction of unvoiced frame parameter;E、
The unvoiced frame parameter extracted is calculated, tries to achieve a frame unvoiced frame, then this frame unvoiced frame is synthesized, turned
Unvoiced frame after changing.The Voice Conversion Techniques scheme that this application case proposes a kind of high-quality, amount of calculation is moderate, but its deficiency
Place is: the method for a kind of based on LPC and RBF neural the sound conversion of this application case, is become by speech decomposition to be converted
Voiceless sound and voiced sound, be divided into voiced sound again fundamental frequency, energy, LPC and LSF coefficient and carry out voice conversion, add the measurement of energy, increases
Big measurement difficulty and error, easily cause the problem that the voice quality after conversion is undesirable.
Summary of the invention
It is an object of the invention to: overcome speech conversion system in prior art the most strictly to limit the side of user's word sentence-making
Formula, but also need bigger data volume to train, the unsatisfactory deficiency of voice quality after conversion simultaneously, it is provided that Yi Zhongji
In the asymmetrical voice conversion method that deep neural network feature maps, use the technical scheme that the present invention provides, for reality
In environment, what the systematic function under the conditions of asymmetric data and data volume scarcity that speech conversion system faces drastically deteriorated asks
Topic, is integrated under unified theoretical frame studying by link relatively independent for above-mentioned two aspects, utilizes deep layer neural simultaneously
Initial data is trained by network with carrying out non-supervisory formula, refines the higher order statistical theory information wherein comprised, leads on this basis
Cross the forward prediction training of supervised, final raising speech conversion system Generalization Capability under practical circumstances.
The general principle of the present invention is: a kind of asymmetric voice mapped based on deep neural network feature of the present invention turns
Change method, for source voice and the asymmetric data of target voice, first with the pre-training function of deep-neural-network to it
Carry out probabilistic Modeling, by refining the higher order statistical characteristic contained in voice signal, provide the standby preferable space of net coefficients;
Secondly, utilize a small amount of symmetric data to carry out incremental learning, carry out corrective networks weight coefficient by the transmission error after optimizing, thus
Realize the mapping of characteristic parameter.
Specifically, the present invention is to use following technical scheme to realize, and comprises the following steps:
1) on the basis of the voice signal of existing source, the targeted voice signal collection according to collecting has in identical semanteme
The source voice signal held, is formed and comprises asymmetric source voice signal, symmetric sources voice signal, the training of targeted voice signal
Use voice signal;
Use harmonic wave to add stochastic model training voice signal is decomposed, respectively obtain asymmetric source voice signal
Fundamental frequency track, the harmonic wave sound channel of asymmetric source voice signal compose range value and phase value, the symmetric sources voice signal of parameter
Fundamental frequency track, the fundamental frequency track of targeted voice signal, symmetric sources voice signal harmonic wave sound channel spectrum parameter width
The harmonic wave sound channel of angle value and phase value, targeted voice signal composes range value and the phase value of parameter;
Fundamental frequency track according to symmetric sources voice signal and the fundamental frequency track of targeted voice signal, set up source language
The Gauss model of sound fundamental frequency and the Gauss model of target voice fundamental frequency;
2) respectively the harmonic wave sound channel of asymmetric source voice signal is composed range value and phase value, the symmetric sources voice letter of parameter
Number the harmonic wave sound channel spectrum range value of parameter and phase value, the range value of harmonic wave sound channel spectrum parameter of targeted voice signal and phase place
Value carries out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency being applicable to voice conversion
Rate parameter;
3) utilize step 2) in the linear spectral frequency parameter of asymmetric source voice signal that obtains deep layer confidence network is entered
Row unsupervised training, obtains the deep layer confidence network trained;
4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal that obtains
Align with the linear spectral frequency parameter of targeted voice signal;
5) the linear spectral frequency parameter of the symmetric sources voice signal after alignment and the linear spectral frequency of targeted voice signal are utilized
Rate parameter carries out increment type supervised training to deep layer forward prediction network, obtains the deep layer forward prediction network trained;
6) use harmonic wave to add stochastic model source voice signal to be converted is decomposed, obtain source voice letter to be converted
Number fundamental frequency track, the harmonic wave sound channel spectrum range value of parameter of source voice signal to be converted and phase value;
Range value and phase value that the harmonic wave sound channel of source voice signal to be converted is composed parameter carry out dimension-reduction treatment, by sound
Road parameter is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter being applicable to voice conversion, then utilizes step
3) the linear spectral frequency parameter of the source voice signal of the deep layer confidence network handles conversion trained in carries out Feature Mapping,
To the new characteristic parameter of source voice signal to be converted, finally the deep layer forward prediction network trained in step 5) is seen
Make general Functional Mapping function, the new characteristic parameter of source voice signal to be converted is carried out Mapping and Converting, is changed
After the linear spectral frequency parameter of voice signal;
Utilize Gauss model and the Gaussian mode of target voice fundamental frequency of source voice fundamental frequency obtained by step 1)
Type, carries out Gauss conversion to the fundamental frequency track of source voice signal to be converted, the fundamental tone of the voice signal after being changed
Frequency locus;
7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then and
The fundamental frequency track of the voice signal after conversion carries out phonetic synthesis, the voice signal after being changed together.
Being further characterized by described step 1) of technique scheme, uses harmonic wave to add stochastic model to original language
The process that tone signal carries out decomposing is as follows:
1-1) primary speech signal is fixed the framing of duration, with correlation method, fundamental frequency is estimated;
1-2) for Voiced signal, Voiced signal arranges a maximum voiced sound frequency component, be used for dividing harmonic wave
Divide and the main energy area of random element;Recycling least-squares algorithm is estimated to obtain discrete harmonic wave sound channel spectrum parameter magnitudes value
And phase value;
1-3) for Unvoiced signal, it is analyzed by the linear prediction analysis method directly utilizing classics, obtains linear pre-
Survey coefficient.
Being further characterized by described step 2 of technique scheme) in, channel parameters is converted into linear prediction
Parameter, and then the process producing the linear spectral frequency parameter being applicable to voice conversion is as follows:
2-1) range value to discrete harmonic wave sound channel spectrum parameter is asked for square, and be construed as discrete power spectrum
Sampled value;
2-2) according to power spectral density function and the one-to-one relationship of auto-correlation function, obtain about linear predictor coefficient
Top's Ritz matrix equation, obtain linear predictor coefficient by solving the equation;
2-3) linear predictor coefficient is converted to linear spectral coefficient of frequency.
Being further characterized by described step 3) of technique scheme carries out unsupervised training to deep layer confidence network
Mode be divided into following two:
3-1) any two-tier network is formed restricted Boltzmann machine, it is trained, so with contrast divergent method
After all of Boltzmann machine is combined into stack, constitute a complete deep layer confidence network, the weight in this network
Coefficient sets constitutes network parameter standby preferable space;
3-2) splice positive and negative for two deep layer feedforward networks, constitute the combinational network of adaptive coding/decoding device structure, simultaneously
The linear spectral coefficient of frequency of voice signal is placed in input and output, under regularization stochastic gradient descent criterion, study
Network architecture parameters.
Being further characterized by described step 4) of technique scheme, the criterion carrying out aliging is: for two not
Isometric characteristic parameter sequence, utilizes dynamic time warping algorithm to be mapped to another one by nonlinear for the time shaft of one of which
Time shaft on, thus realize matching relationship one to one;During the alignment of existing parameter sets, pass through iteration optimization
One default cumulative distortion function, and restricted searching area, final acquisition time match function.
Being further characterized by described step 5) of technique scheme, carries out increment to deep layer forward prediction network
The process of formula supervised training is as follows:
The superiors of deep layer confidence network 5-1) trained in step 3) increase by a layer network output layer, and this layer has
The soft output characteristics of Finite Amplitude, thus constitute deep layer feedforward network;
5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode at
Reason, and extract the network intermediate layer parameter new characteristic parameter as symmetric sources voice signal;
5-3) using the new characteristic parameter of symmetric sources voice signal and the linear spectral coefficient of frequency of targeted voice signal as
The input of deep layer feedforward network and output, adjust network weight coefficient on the premise of transmission error minimizes rear, complete net
The incremental training of network.
Technique scheme to be further characterized by the process of phonetic synthesis in described step 7) as follows:
7-1) range value of the discrete harmonic wave sound channel spectrum parameter of Voiced signal and phase value are used as the width of sinusoidal signal
Angle value and phase value, and be overlapped, obtain the Voiced signal of reconstruct;Interpositioning and Phase Compensation is used to make reconstruct
Voiced signal in time domain waveform, do not produce distortion;
7-2) by the white noise signal of Unvoiced signal by an all-pole filter, obtain the Unvoiced signal of reconstruct;
7-3) Voiced signal of reconstruct and the Unvoiced signal of reconstruct are overlapped, the voice signal after being changed.
Beneficial effects of the present invention is as follows: a kind of asymmetric voice mapped based on deep neural network feature of the present invention
Conversion method, takes full advantage of the common feature of " asymmetric data " and " data volume is deficient " problem, devises a set of comprehensive two
The data acquisition of the situation of kind and integration method, utilize deep layer confidence e-learning asymmetric data architectural feature on this basis,
Optimize net coefficients structure, and as the initial parameter value of deep layer forward prediction network, and then in a small amount of symmetric data
Under the process of incremental learning, reverse conduction optimizes network architecture parameters, it is achieved the mapping of speaker's personal characteristics parameter.
Accompanying drawing explanation
Fig. 1 is the speech conversion system training and conversion stage block diagram that the present invention relates to;
Fig. 2 is for the present invention relates to deep layer confidence network pre-training mode schematic diagram.
Detailed description of the invention
With reference to the accompanying drawings and combine example the present invention is described in further detail.
In order to effectively process " asymmetric data " and " data volume is deficient " problem in actual environment, the present invention designs following number
According to obtaining and integrated scheme, in order to subsequent operation: for most application scenario, gather the sound number of target speaker
According to general the most passive, therefore gather relatively difficult, can frequently result in data volume deficient;Under comparing, owing to source is said
The voice data gatherer process initiative of words people is relatively strong, so collecting relatively easy, data volume is the most sufficient.To this end,
On the basis of the speech data of existing source, make source speaker according to the voice of the target speaker collected, again record a small amount of
Include the voice data of identical semantic content as with reference to (source speaker records a small amount of voice incrementally).So, source and
Although the data of target are the most asymmetrical, but wherein contain a small amount of symmetric data.
Therefore, in conjunction with Fig. 1 and Fig. 2, a kind of asymmetric voice mapped based on deep neural network feature of the present embodiment
Conversion method, including training stage and conversion stage, following steps 1~5) it is the training stage, step 6~7) be the conversion stage:
1) on the basis of the voice signal of existing source, the targeted voice signal collection according to collecting has in identical semanteme
The source voice signal held, is formed and comprises asymmetric source voice signal, symmetric sources voice signal, the training of targeted voice signal
Use voice signal.
Use harmonic wave to add stochastic model training voice signal is decomposed, respectively obtain asymmetric source voice signal
Fundamental frequency track, the harmonic wave sound channel of asymmetric source voice signal compose range value and phase value, the symmetric sources voice signal of parameter
Fundamental frequency track, the fundamental frequency track of targeted voice signal, symmetric sources voice signal harmonic wave sound channel spectrum parameter width
The harmonic wave sound channel of angle value and phase value, targeted voice signal composes range value and the phase value of parameter.
Using harmonic wave to add, primary speech signal decomposes by stochastic model specifically comprises the following steps that
A. voice signal is carried out framing, frame length 20ms, frame section gap 10ms.
B. in every frame, estimate fundamental frequency with correlation method, if this frame is unvoiced frames, then fundamental frequency is set equal to zero.
C. for unvoiced frame (frame that i.e. fundamental frequency is not zero), it is assumed that voice signal shN () can be by a series of sine wave
It is formed by stacking:
In formula, L is sinusoidal wave number, { ClIt is sinusoidal wave complex magnitude, ω0For fundamental frequency, n represents the n-th of voice
Individual sampling point.Make shRepresent shN vector that () sampling point in a frame is formed, then (1) formula can be rewritten into:
Wherein N represents the number of samples that a frame voice is total.Above-mentioned { C is may determine that by least-squares algorithml, it may be assumed that
Wherein s (n) is actual speech signal, and w (n) is window function, typically takes Hamming window, and ε represents error.By window function also
It is rewritten into matrix form:
Then optimum x can so obtain:
In formula, subscript H represents conjugate complex transposition, and S is actual speech signal s (n) sampling point institute group in the range of a frame
The vector become.
D. { C has been obtainedl, then harmonic amplitude and phase value are as follows:
AMl=2|Cl|=2|C-l|,θl=argCl=-argC-l (6)
E. for unvoiced frames, raw tone frame signal is analyzed by direct classical Linear prediction analysis method,
To corresponding linear predictor coefficient.
Due to it is believed that the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and targeted voice signal takes
From single Gaussian Profile, therefore can be according to the fundamental frequency track of symmetric sources voice signal and the fundamental frequency rail of targeted voice signal
Mark sets up Gauss model and the Gauss model of target voice fundamental frequency of source voice fundamental frequency.
According to above-mentioned Gauss model, the parameter of Gauss model, the i.e. Gauss model of source voice fundamental frequency can be estimated
Mean μyAnd variances sigmay, and the mean μ of Gauss model of target voice fundamental frequencyxAnd variances sigmax。
2) respectively the harmonic wave sound channel of asymmetric source voice signal is composed range value and phase value, the symmetric sources voice letter of parameter
Number the harmonic wave sound channel spectrum range value of parameter and phase value, the range value of harmonic wave sound channel spectrum parameter of targeted voice signal and phase place
Value carries out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency being applicable to voice conversion
Rate parameter.
The reason of step 2 is, owing to original harmonics plus noise model parameter dimension is higher, is not easy to subsequent calculations, because of
This must carry out dimensionality reduction to it.Owing to pitch contour is one-dimensional parameter, therefore, the main object of dimensionality reduction is sound channel amplitude spectrum parameter
And phase parameter.Meanwhile, the target of dimensionality reduction is that channel parameters is converted into the linear forecasting parameter of classics, and then generation is applicable to
The linear spectral frequency parameter of speech conversion system.Its solution procedure is as follows:
Ask for L discrete range value AM the most respectivelylSquare, and be construed as discrete power spectrum sampled value PW
(ωl), ωlRepresent the frequency values of (l times) on fundamental frequency integral multiple.
B. according to Pascal's law, auto-correlation function and power spectral density function are a pair Fourier transforms pair, i.e.Therefore the preliminary valuation to linear forecasting parameter coefficient can be obtained by solving following formula:
Wherein a1,a2..., apIt is the coefficient of p rank linear prediction filter A (z), R0~RpIt is respectively p before auto-correlation function
Value on individual integer discrete point.
C. the all-pole modeling that p rank linear forecasting parameter coefficient represents is converted into time domain impulse response function h*[n]:
Wherein Permissible
Prove, h*The autocorrelation sequence R obtained with estimation*Meet:
In the case of meeting plate storehouse-vegetarian field distance minimization, there is the R of real R and estimation*Relation as follows:
The most then (19) formula is replaced (20) formula, and revaluation (17) formula, has:
E. by plate storehouse-vegetarian field criteria evaluation error, if error is more than the threshold value set, then step c~e are repeated.Otherwise,
Then stop iteration.
The linear forecasting parameter coefficient obtained passes through simultaneous solution following two equation, is converted into linear spectral frequency parameter:
P(z)=A(z)+z-(p+1)A(z-1)
Q(z)=A(z)-z-(p+1)A(z-1) (12)
3) utilize step 2) in the linear spectral frequency parameter of asymmetric source voice signal that obtains deep layer confidence network is entered
Row unsupervised training, obtains the deep layer confidence network trained.
Above-mentioned steps is " pre-training "." pre-training " process should be divided into two kinds of forms.The first (as shown in Figure 2 a): right
In a complete deep layer confidence network, according to order from bottom to up, Internet adjacent for any two is constituted restricted
Boltzmann machine (being wherein positioned at the input layer that is referred to as of lower section, referred to as hidden layer above, is undirected company between the two
Connect relation), under the driving of input initial data, with the structural parameters between contrast divergent method learning network.It addition, respectively group is limited
Boltzmann machine between data transmission, meet following condition be positioned at lower section Boltzmann machine hidden layer output,
Input as the input layer of Boltzmann machine above.Progressive alternate in the manner described above, until designed network
Till structural parameters whole " pre-training " complete.The second (as shown in Figure 2 b): splice positive and negative for two deep layer feedforward networks, structure
Become the combinational network of adaptive coding/decoding device structure, the linear spectral frequency parameter of voice signal is concurrently placed at this network simultaneously
Input and output, under regularization stochastic gradient descent criterion, " pre-training " network architecture parameters.
4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal that obtains
Align with the linear spectral frequency parameter of targeted voice signal.
So-called " alignment " refers to: the linear spectral frequency of the source and target of correspondence is had in the distortion criterion set
Minimum distortion distance.The purpose of do so is: the characteristic sequence of source and target people is associated in the aspect of parameter, it is simple to
Subsequent statistical model learns mapping principle therein.
The criterion carrying out aliging is: for the characteristic parameter sequence of two Length discrepancy, utilizes dynamic time warping algorithm to incite somebody to action
On the nonlinear time shaft being mapped to another one of the time shaft of one of which, thus realize matching relationship one to one;?
During the alignment of existing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area,
Obtain time match function eventually.
Dynamic time warping algorithm steps is briefly outlined below:
Pronunciation for same statement, it is assumed that the acoustics personal characteristics argument sequence of source speaker isAnd the characteristic parameter sequence of target speaker isAnd Nx
≠Ny.Set the characteristic parameter sequence of source speaker as reference template, then dynamic time warping algorithm seeks to hunting time rule
Integral functionMake the time shaft n of target signature sequenceyNon-linearly be mapped to source characteristic parameter sequence time
Countershaft nx, so that total cumulative distortion amount is minimum, it is mathematically represented as:
WhereinRepresent n-thyThe target speaker characteristic parameter of frame andFrame source speaker
Certain measure distance between characteristic parameter.Dynamic time warping regular during, warping functionIt is intended to
Meet following constraints, have boundary condition and the condition of continuity to be respectively as follows:
Dynamic time warping is a kind of optimization algorithm, and it turns to determining of multiple single phase a multistage decision process
Plan process, is namely converted into the multiple subproblems made a policy one by one, in order to simplifies and calculates.The process one of dynamic time warping
As be to proceed by from the last stage, namely it is a vice versa, and its recursive process can be expressed as:
D(ny+1,nx)=d(ny+1,nx)+min[D(ny,nx)g(ny,nx),D(ny,nx-1),D(ny,nx-2)] (16)
Wherein,g(ny,nx) it is for ny,nxValue meet the time
The constraints of warping function.
5) the linear spectral frequency parameter of the symmetric sources voice signal after alignment and the linear spectral frequency of targeted voice signal are utilized
Rate parameter carries out increment type supervised training to deep layer forward prediction network, obtains the deep layer forward prediction network trained.
The a small amount of symmetric data of above-mentioned utilization carries out the process of incremental training to deep layer feedforward network, and its content includes following
Three aspects: one, the superiors at the deep layer confidence network trained increase by a layer network output layer, and this layer has amplitude limit
Soft output characteristics, thus constitute deep layer feedforward network;Two, using the linear spectral frequency parameter in source as the group with encoding and decoding structure
Close input and the output of network, on the basis of " pre-training ", extract network intermediate layer (as shown in Figure 2 b) output data, as
New characteristic parameter is treated.This new characteristic parameter remains the higher order statistical theory of original linear spectral frequency parameter, therefore
There is more preferable discrimination;Three, using the new characteristic parameter in symmetrical source and target linear spectral frequency coefficient as deep layer forward direction
The input of network and output parameter, rear on the premise of transmission error minimizes, supervised ground adjusts network weight coefficient, complete
Become the incremental training of network.
6) use harmonic wave to add stochastic model source voice signal to be converted is decomposed, obtain source voice letter to be converted
Number fundamental frequency track, the harmonic wave sound channel spectrum range value of parameter of source voice signal to be converted and phase value.Concrete technology
Details is identical with the way of step 1).
Range value and phase value that the harmonic wave sound channel of source voice signal to be converted is composed parameter carry out dimension-reduction treatment, by sound
Road parameter is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter being applicable to voice conversion.Concrete ins and outs
With step 2) way identical.
Then the linear spectral frequency of the source voice signal of the deep layer confidence network handles conversion trained in step 3) is utilized
Rate parameter carries out Feature Mapping, obtains the new characteristic parameter of source voice signal to be converted, finally will train in step 5)
The deep layer forward prediction network become regards general Functional Mapping function as, the new characteristic parameter to source voice signal to be converted
Carry out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed.Specifically, source voice that will be to be converted
The linear spectral frequency parameter of signal is placed in input and the output of the combinational network with codec structure, and extracts centre
Layer parameter, as new characteristic parameter, and is used for the characteristic parameter of map source by the deep layer feedforward network trained, will
The new characteristic parameter of source voice signal to be converted is as input, it is provided that be predicted to this model, finally defeated at network
Go out the linear spectral frequency parameter of the voice signal after end is changed.
Utilize Gauss model and the Gaussian mode of target voice fundamental frequency of source voice fundamental frequency obtained by step 1)
Type, carries out Gauss conversion to the fundamental frequency track of source voice signal to be converted, the fundamental tone of the voice signal after being changed
Frequency locus.Fundamental frequency transfer function is:
Wherein f '0It is the fundamental frequency after conversion, 2 π f0=ω0。
7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then and
The fundamental frequency track of the voice signal after conversion carries out phonetic synthesis, the voice signal after being changed together.Detailed step
As follows:
A. the AM that will obtainl,f0,θlVoice by the definition synthesis kth frame of sinusoidal model, it may be assumed that
B. in order to reduce the error produced when interframe replaces, splicing adding method is used to synthesize whole voice, i.e. for arbitrarily
Two adjacent frames, have:
The number of samples that wherein N comprises in representing a frame voice.Interpositioning and Phase Compensation is used to make reconstruct
Voiced signal does not produce distortion in time domain waveform.
C. for unvoiced frames, by white noise signal, by an all-pole filter, (filter coefficient is to walk the training stage
Rapid 1) step e analyzes the linear predictor coefficient obtained), available approximate reconstruction signal.
D. Voiced signal and the Unvoiced signal of reconstruct are added, i.e. can obtain synthesizing voice.
A kind of asymmetrical voice conversion method mapped based on deep neural network feature of the present invention, can be used for secrecy logical
Letter carries out the camouflage that voice is personalized, such as, by Voice Conversion Techniques, changes speaker's voice by a kind of rule determined
Some parameter, then carry out inverse transformation at receiving terminal, synthesize original voice, if in transmitting procedure, be listened, then listen
To be the sound of another one speaker, reach the function of speaker's camouflage;Apply in multimedia recreation, such as, at electricity
During shadow is dubbed, when especially dubbing with another language, often voice-over actor is not performer, usually make to dub with
The personal characteristics of former performer differs greatly, and dubbed effect is undesirable, if but will dub and carry out sound conversion again, it is allowed to again to have
There is the personal characteristics of performer, then dubbed effect will be the most;For speech-enhancement system, particularly with vocal cords etc.
There is pathology or damage in vocal organs, the quality of its speech is also badly damaged, and the other side is difficult to understand, and has severely impacted normal
Communication with exchange, if the speech being so badly damaged can be converted into the sound clearly can understood, the most convenient
The normal life of this kind of patient.
Although the present invention is open as above with preferred embodiment, but embodiment is not for limiting the present invention's.Not
Depart from the spirit and scope of the present invention, any equivalence change done or retouching, also belong to the protection domain of the present invention.Cause
The content that this protection scope of the present invention should be defined with claims hereof is as standard.
Claims (5)
1. the asymmetrical voice conversion method mapped based on deep neural network feature, it is characterised in that comprise the steps:
1) on the basis of the voice signal of existing source, the targeted voice signal collection according to collecting has identical semantic content
Source voice signal, forms the training term comprising asymmetric source voice signal, symmetric sources voice signal, targeted voice signal
Tone signal;
Use harmonic wave to add stochastic model training voice signal is decomposed, respectively obtain the fundamental tone of asymmetric source voice signal
Frequency locus, the harmonic wave sound channel of asymmetric source voice signal compose range value and phase value, the base of symmetric sources voice signal of parameter
Voice frequency track, the fundamental frequency track of targeted voice signal, the harmonic wave sound channel of symmetric sources voice signal compose the range value of parameter
With range value and the phase value that the harmonic wave sound channel of phase value, targeted voice signal composes parameter;
Fundamental frequency track according to symmetric sources voice signal and the fundamental frequency track of targeted voice signal, set up source voice base
The Gauss model of voice frequency and the Gauss model of target voice fundamental frequency;
2) respectively the harmonic wave sound channel of asymmetric source voice signal is composed the range value of parameter and phase value, symmetric sources voice signal
The harmonic wave sound channel spectrum range value of parameter and phase value, the range value of harmonic wave sound channel spectrum parameter of targeted voice signal and phase value enter
Row dimension-reduction treatment, is converted into channel parameters linear forecasting parameter, and then produces the linear spectral frequency ginseng being applicable to voice conversion
Number;
3) utilize step 2) in the linear spectral frequency parameter of asymmetric source voice signal that obtains deep layer confidence network is carried out non-
Supervised training, obtains the deep layer confidence network trained;
The described mode that deep layer confidence network carries out unsupervised training is divided into following two:
3-1) any two-tier network is formed restricted Boltzmann machine, with contrast divergent method, it is trained, then will
All of Boltzmann machine is combined into stack, constitutes a complete deep layer confidence network, the weight coefficient in this network
Set constitutes network parameter standby preferable space;
3-2) splice positive and negative for two deep layer feedforward networks, constitute the combinational network of adaptive coding/decoding device structure, simultaneously by language
The linear spectral coefficient of frequency of tone signal is placed in input and output, under regularization stochastic gradient descent criterion, and learning network
Structural parameters;
4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal that obtains and mesh
The linear spectral frequency parameter of poster tone signal aligns;
5) the linear spectral frequency parameter of the symmetric sources voice signal after alignment and the linear spectral frequency ginseng of targeted voice signal are utilized
Several deep layer forward prediction network is carried out increment type supervised training, obtain the deep layer forward prediction network trained;
The described process that deep layer forward prediction network carries out increment type supervised training is as follows:
5-1) in step 3) in the superiors of deep layer confidence network that trained increase by a layer network output layer, this layer has limit
The soft output characteristics of width, thus constitute deep layer feedforward network;
5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode process, and
Extract the network intermediate layer parameter new characteristic parameter as symmetric sources voice signal;
5-3) using the new characteristic parameter of symmetric sources voice signal and the linear spectral coefficient of frequency of targeted voice signal as deep layer
The input of feedforward network and output, adjust network weight coefficient on the premise of transmission error minimizes rear, complete network
Incremental training;
6) use harmonic wave to add stochastic model source voice signal to be converted is decomposed, obtain source voice signal to be converted
Fundamental frequency track, the harmonic wave sound channel of source voice signal to be converted compose range value and the phase value of parameter;
Range value and phase value that the harmonic wave sound channel of source voice signal to be converted is composed parameter carry out dimension-reduction treatment, sound channel are joined
Number is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter being applicable to voice conversion, then utilizes step 3) in
The linear spectral frequency parameter of the source voice signal of the deep layer confidence network handles conversion trained carries out Feature Mapping, is treated
The new characteristic parameter of source voice signal of conversion, finally by step 5) in the deep layer forward prediction network trained regard as logical
Functional Mapping function, the new characteristic parameter of source voice signal to be converted is carried out Mapping and Converting, after being changed
The linear spectral frequency parameter of voice signal;
Utilize step 1) obtained by the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency, right
The fundamental frequency track of source voice signal to be converted carries out Gauss conversion, the fundamental frequency rail of the voice signal after being changed
Mark;
Then and conversion 7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient,
After the fundamental frequency track of voice signal carry out phonetic synthesis, the voice signal after being changed together.
The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature
It is, described step 1) in, use harmonic wave to add the process that primary speech signal decomposes by stochastic model as follows:
1-1) primary speech signal is fixed the framing of duration, with correlation method, fundamental frequency is estimated;
1-2) for Voiced signal, Voiced signal arranges a maximum voiced sound frequency component, be used for divide harmonic components and
The main energy area of random element;Recycling least-squares algorithm is estimated to obtain discrete harmonic wave sound channel spectrum parameter magnitudes value and phase
Place value;
1-3) for Unvoiced signal, it is analyzed by the linear prediction analysis method directly utilizing classics, obtains linear prediction system
Number.
The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature
It is, in described step 2) in, channel parameters is converted into linear forecasting parameter, and then generation is applicable to the linear of voice conversion
The process of spectral frequency parameter is as follows:
2-1) range value to discrete harmonic wave sound channel spectrum parameter is asked for square, and is construed as the sampling of discrete power spectrum
Value;
2-2) according to power spectral density function and the one-to-one relationship of auto-correlation function, obtain the torr about linear predictor coefficient
General Ritz matrix equation, obtains linear predictor coefficient by solving the equation;
2-3) linear predictor coefficient is converted to linear spectral coefficient of frequency.
The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature
It is, described step 4) in, the criterion carrying out aliging is: for the characteristic parameter sequence of two Length discrepancy, utilize dynamic time
Regular algorithm is by nonlinear for the time shaft of the one of which time shaft being mapped to another one, thus realizes one to one
Join relation;During the alignment of existing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restriction is searched
Rope region, final acquisition time match function.
The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature
Be, described step 7) in the process of phonetic synthesis as follows:
7-1) range value of the discrete harmonic wave sound channel spectrum parameter of Voiced signal and phase value are used as the range value of sinusoidal signal
And phase value, and be overlapped, obtain the Voiced signal of reconstruct;Interpositioning and Phase Compensation is used to make the turbid of reconstruct
Tone signal does not produce distortion in time domain waveform;
7-2) by the white noise signal of Unvoiced signal by an all-pole filter, obtain the Unvoiced signal of reconstruct;
7-3) Voiced signal of reconstruct and the Unvoiced signal of reconstruct are overlapped, the voice signal after being changed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310468769.1A CN103531205B (en) | 2013-10-09 | 2013-10-09 | The asymmetrical voice conversion method mapped based on deep neural network feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310468769.1A CN103531205B (en) | 2013-10-09 | 2013-10-09 | The asymmetrical voice conversion method mapped based on deep neural network feature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103531205A CN103531205A (en) | 2014-01-22 |
CN103531205B true CN103531205B (en) | 2016-08-31 |
Family
ID=49933157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310468769.1A Active CN103531205B (en) | 2013-10-09 | 2013-10-09 | The asymmetrical voice conversion method mapped based on deep neural network feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103531205B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104464744A (en) * | 2014-11-19 | 2015-03-25 | 河海大学常州校区 | Cluster voice transforming method and system based on mixture Gaussian random process |
CN104392717A (en) * | 2014-12-08 | 2015-03-04 | 常州工学院 | Sound track spectrum Gaussian mixture model based rapid voice conversion system and method |
CN104867489B (en) * | 2015-04-27 | 2019-04-26 | 苏州大学张家港工业技术研究院 | A kind of simulation true man read aloud the method and system of pronunciation |
CN105005783B (en) * | 2015-05-18 | 2019-04-23 | 电子科技大学 | The method of classification information is extracted from higher-dimension asymmetric data |
CN105118498B (en) * | 2015-09-06 | 2018-07-31 | 百度在线网络技术(北京)有限公司 | The training method and device of phonetic synthesis model |
CN106203624B (en) * | 2016-06-23 | 2019-06-21 | 上海交通大学 | Vector Quantization and method based on deep neural network |
CN106057192A (en) * | 2016-07-07 | 2016-10-26 | Tcl集团股份有限公司 | Real-time voice conversion method and apparatus |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
WO2018085697A1 (en) | 2016-11-04 | 2018-05-11 | Google Llc | Training neural networks using a variational information bottleneck |
US10902312B2 (en) * | 2017-03-28 | 2021-01-26 | Qualcomm Incorporated | Tracking axes during model conversion |
CN107464569A (en) * | 2017-07-04 | 2017-12-12 | 清华大学 | Vocoder |
CN107545903B (en) * | 2017-07-19 | 2020-11-24 | 南京邮电大学 | Voice conversion method based on deep learning |
CN107886967B (en) * | 2017-11-18 | 2018-11-13 | 中国人民解放军陆军工程大学 | A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network |
JP6733644B2 (en) * | 2017-11-29 | 2020-08-05 | ヤマハ株式会社 | Speech synthesis method, speech synthesis system and program |
CN108417207B (en) * | 2018-01-19 | 2020-06-30 | 苏州思必驰信息科技有限公司 | Deep hybrid generation network self-adaption method and system |
CN109147806B (en) * | 2018-06-05 | 2021-11-12 | 安克创新科技股份有限公司 | Voice tone enhancement method, device and system based on deep learning |
CN110164414B (en) * | 2018-11-30 | 2023-02-14 | 腾讯科技(深圳)有限公司 | Voice processing method and device and intelligent equipment |
CN109637551A (en) * | 2018-12-26 | 2019-04-16 | 出门问问信息科技有限公司 | Phonetics transfer method, device, equipment and storage medium |
CN110085255B (en) * | 2019-03-27 | 2021-05-28 | 河海大学常州校区 | Speech conversion Gaussian process regression modeling method based on deep kernel learning |
CN114223032A (en) * | 2019-05-17 | 2022-03-22 | 重庆中嘉盛世智能科技有限公司 | Memory, microphone, audio data processing method, device, equipment and system |
CN110992739B (en) * | 2019-12-26 | 2021-06-01 | 上海松鼠课堂人工智能科技有限公司 | Student on-line dictation system |
CN111524526B (en) * | 2020-05-14 | 2023-11-17 | 中国工商银行股份有限公司 | Voiceprint recognition method and voiceprint recognition device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103049792A (en) * | 2011-11-26 | 2013-04-17 | 微软公司 | Discriminative pretraining of Deep Neural Network |
CN103280224A (en) * | 2013-04-24 | 2013-09-04 | 东南大学 | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm |
-
2013
- 2013-10-09 CN CN201310468769.1A patent/CN103531205B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049792A (en) * | 2011-11-26 | 2013-04-17 | 微软公司 | Discriminative pretraining of Deep Neural Network |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103280224A (en) * | 2013-04-24 | 2013-09-04 | 东南大学 | Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm |
Non-Patent Citations (1)
Title |
---|
Voice conversion with smoothed GMM and MAP adaptation;Chen, Y., Chu, M., Chang, E., Liu, J., Liu, R.;《INTERSPEECH》;20031231;2413-2416 * |
Also Published As
Publication number | Publication date |
---|---|
CN103531205A (en) | 2014-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103531205B (en) | The asymmetrical voice conversion method mapped based on deep neural network feature | |
CN102800316B (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN101178896B (en) | Unit selection voice synthetic method based on acoustics statistical model | |
CN105023580B (en) | Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method | |
CN108986834A (en) | The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network | |
CN102664003B (en) | Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM) | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN102568476B (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
CN103035236B (en) | High-quality voice conversion method based on modeling of signal timing characteristics | |
CN105488466B (en) | A kind of deep-neural-network and Acoustic Object vocal print feature extracting method | |
Fei et al. | Research on speech emotion recognition based on deep auto-encoder | |
CN114141238A (en) | Voice enhancement method fusing Transformer and U-net network | |
CN110060657A (en) | Multi-to-multi voice conversion method based on SN | |
CN110047501A (en) | Multi-to-multi phonetics transfer method based on beta-VAE | |
CN106782599A (en) | The phonetics transfer method of post filtering is exported based on Gaussian process | |
CN114495969A (en) | Voice recognition method integrating voice enhancement | |
CN103886859B (en) | Phonetics transfer method based on one-to-many codebook mapping | |
Fan et al. | The impact of student learning aids on deep learning and mobile platform on learning behavior | |
CN107464569A (en) | Vocoder | |
CN114626424B (en) | Data enhancement-based silent speech recognition method and device | |
Xie et al. | Pitch transformation in neural network based voice conversion | |
Djeffal et al. | Noise-Robust Speech Recognition: A Comparative Analysis of LSTM and CNN Approaches | |
Lifang et al. | A voice conversion system based on the harmonic plus noise excitation and Gaussian mixture model | |
Li et al. | Research on voiceprint recognition technology based on deep neural network | |
Wu et al. | ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20190329 Address after: No. 3, courtyard No. 5, di Kam Road, Haidian District, Beijing Patentee after: BYZORO NETWORK LTD. Address before: 213022 Wushan Road, Xinbei District, Changzhou, Jiangsu Province, No. 1 Patentee before: Changzhou Polytechnic College |
|
TR01 | Transfer of patent right |