CN103531205B

CN103531205B - The asymmetrical voice conversion method mapped based on deep neural network feature

Info

Publication number: CN103531205B
Application number: CN201310468769.1A
Authority: CN
Inventors: 鲍静益; 徐宁
Original assignee: Changzhou Institute of Technology
Current assignee: BYZORO NETWORK LTD.
Priority date: 2013-10-09
Filing date: 2013-10-09
Publication date: 2016-08-31
Anticipated expiration: 2033-10-09
Also published as: CN103531205A

Abstract

The invention discloses a kind of asymmetrical voice conversion method mapped based on deep neural network feature, belong to Voice Conversion Techniques field.A kind of asymmetrical voice conversion method mapped based on deep neural network feature of the present invention, for source voice and the asymmetric data of target voice, pre-training function first with deep layer network carries out probabilistic Modeling to it, by refining the higher order statistical characteristic contained in voice signal, provide the standby preferable space of net coefficients；Secondly, utilize a small amount of symmetric data to carry out incremental learning, carry out corrective networks weight coefficient by the transmission error after optimizing, thus realize the mapping of characteristic parameter.Present invention optimizes net coefficients structure, and as the initial parameter value of deep layer forward prediction network, and then during the incremental learning of a small amount of symmetric data, reverse conduction optimizes network architecture parameters, it is achieved the mapping of the personal characteristics parameter of speaker.

Description

The asymmetrical voice conversion method mapped based on deep neural network feature

Technical field

The invention belongs to Voice Conversion Techniques field, it is a kind of based on deep neural network feature mapping non-right to be specifically related to Claim phonetics transfer method.

Background technology

Voice Conversion Techniques, is exactly briefly by the sound in a speaker (referred to as source), is entered by certain means Line translation so that it is sound it being another speaker (referred to as target) word seemingly.Voice conversion belongs to of intercrossing Branch of section, its content had both related to the knowledge in the fields such as phonetics, semantics and psychologic acoustics, contained again Speech processing neck The various aspects in territory, such as analysis and synthesis, Speaker Identification, voice coding and the enhancing etc. of voice.

The final goal of voice conversion is to provide instant, can automatically to rapidly adapt to any speaker voice service, This system need not or seldom need user to train just can play merit well for all users and various condition With.But, the Voice Conversion Techniques of present stage does not also accomplish this point.The strictest user's word that limits of current system is made The mode (i.e. needing symmetric data to be trained) of sentence, on the other hand there is also a need for bigger data volume carrys out training system.

For the problems referred to above, some counte-rplan are the most there are.Such as, for " asymmetric data " problem, there is scholar Propose first by Vector Quantization algorithm, the feature space of source and target speaker to be divided, then compare sound channel length normalization After template distance, therefrom select the source code word corresponding with speaker, finally in same code word space, look for k-nearest neighbor Seek the most close coupling speech frame.And for example Salor et al. then proposes such issues that utilize dynamic programming algorithm to solve.This algorithm Core concept be: build cost function, make source and target and target former frame and the error of present frame and reach the most simultaneously Little.For " minimizing data volume " problem, Helander et al. proposes the coupling considering between characteristic parameter during modeling Relation, and utilize this relation to improve system robustness in the case of data volume rareness.In addition, somebody proposes to utilize Based on the gauss hybrid models that variation Bayesian analysis technique study is traditional, strengthen this model and model energy when Sparse Power.

Through retrieval, Chinese Patent Application No. ZL201210229540.8, Shen Qing Publication day is on October 17th, 2012, invention Create entitled: the method for a kind of sound based on LPC and RBF neural conversion, this application case relate to a kind of based on LPC and The method of the sound conversion of RBF neural, comprises the following steps: A, pre-process voice；B, unvoiced frame is carried out base Frequency detection；C, the unvoiced frame after fundamental detection is changed；D, the fundamental frequency after conversion is carried out the extraction of unvoiced frame parameter；E、 The unvoiced frame parameter extracted is calculated, tries to achieve a frame unvoiced frame, then this frame unvoiced frame is synthesized, turned Unvoiced frame after changing.The Voice Conversion Techniques scheme that this application case proposes a kind of high-quality, amount of calculation is moderate, but its deficiency Place is: the method for a kind of based on LPC and RBF neural the sound conversion of this application case, is become by speech decomposition to be converted Voiceless sound and voiced sound, be divided into voiced sound again fundamental frequency, energy, LPC and LSF coefficient and carry out voice conversion, add the measurement of energy, increases Big measurement difficulty and error, easily cause the problem that the voice quality after conversion is undesirable.

Summary of the invention

It is an object of the invention to: overcome speech conversion system in prior art the most strictly to limit the side of user's word sentence-making Formula, but also need bigger data volume to train, the unsatisfactory deficiency of voice quality after conversion simultaneously, it is provided that Yi Zhongji In the asymmetrical voice conversion method that deep neural network feature maps, use the technical scheme that the present invention provides, for reality In environment, what the systematic function under the conditions of asymmetric data and data volume scarcity that speech conversion system faces drastically deteriorated asks Topic, is integrated under unified theoretical frame studying by link relatively independent for above-mentioned two aspects, utilizes deep layer neural simultaneously Initial data is trained by network with carrying out non-supervisory formula, refines the higher order statistical theory information wherein comprised, leads on this basis Cross the forward prediction training of supervised, final raising speech conversion system Generalization Capability under practical circumstances.

The general principle of the present invention is: a kind of asymmetric voice mapped based on deep neural network feature of the present invention turns Change method, for source voice and the asymmetric data of target voice, first with the pre-training function of deep-neural-network to it Carry out probabilistic Modeling, by refining the higher order statistical characteristic contained in voice signal, provide the standby preferable space of net coefficients； Secondly, utilize a small amount of symmetric data to carry out incremental learning, carry out corrective networks weight coefficient by the transmission error after optimizing, thus Realize the mapping of characteristic parameter.

Specifically, the present invention is to use following technical scheme to realize, and comprises the following steps:

1) on the basis of the voice signal of existing source, the targeted voice signal collection according to collecting has in identical semanteme The source voice signal held, is formed and comprises asymmetric source voice signal, symmetric sources voice signal, the training of targeted voice signal Use voice signal；

Use harmonic wave to add stochastic model training voice signal is decomposed, respectively obtain asymmetric source voice signal Fundamental frequency track, the harmonic wave sound channel of asymmetric source voice signal compose range value and phase value, the symmetric sources voice signal of parameter Fundamental frequency track, the fundamental frequency track of targeted voice signal, symmetric sources voice signal harmonic wave sound channel spectrum parameter width The harmonic wave sound channel of angle value and phase value, targeted voice signal composes range value and the phase value of parameter；

Fundamental frequency track according to symmetric sources voice signal and the fundamental frequency track of targeted voice signal, set up source language The Gauss model of sound fundamental frequency and the Gauss model of target voice fundamental frequency；

2) respectively the harmonic wave sound channel of asymmetric source voice signal is composed range value and phase value, the symmetric sources voice letter of parameter Number the harmonic wave sound channel spectrum range value of parameter and phase value, the range value of harmonic wave sound channel spectrum parameter of targeted voice signal and phase place Value carries out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency being applicable to voice conversion Rate parameter；

3) utilize step 2) in the linear spectral frequency parameter of asymmetric source voice signal that obtains deep layer confidence network is entered Row unsupervised training, obtains the deep layer confidence network trained；

4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal that obtains Align with the linear spectral frequency parameter of targeted voice signal；

5) the linear spectral frequency parameter of the symmetric sources voice signal after alignment and the linear spectral frequency of targeted voice signal are utilized Rate parameter carries out increment type supervised training to deep layer forward prediction network, obtains the deep layer forward prediction network trained；

6) use harmonic wave to add stochastic model source voice signal to be converted is decomposed, obtain source voice letter to be converted Number fundamental frequency track, the harmonic wave sound channel spectrum range value of parameter of source voice signal to be converted and phase value；

Range value and phase value that the harmonic wave sound channel of source voice signal to be converted is composed parameter carry out dimension-reduction treatment, by sound Road parameter is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter being applicable to voice conversion, then utilizes step 3) the linear spectral frequency parameter of the source voice signal of the deep layer confidence network handles conversion trained in carries out Feature Mapping, To the new characteristic parameter of source voice signal to be converted, finally the deep layer forward prediction network trained in step 5) is seen Make general Functional Mapping function, the new characteristic parameter of source voice signal to be converted is carried out Mapping and Converting, is changed After the linear spectral frequency parameter of voice signal；

Utilize Gauss model and the Gaussian mode of target voice fundamental frequency of source voice fundamental frequency obtained by step 1) Type, carries out Gauss conversion to the fundamental frequency track of source voice signal to be converted, the fundamental tone of the voice signal after being changed Frequency locus；

7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then and The fundamental frequency track of the voice signal after conversion carries out phonetic synthesis, the voice signal after being changed together.

Being further characterized by described step 1) of technique scheme, uses harmonic wave to add stochastic model to original language The process that tone signal carries out decomposing is as follows:

1-1) primary speech signal is fixed the framing of duration, with correlation method, fundamental frequency is estimated；

1-2) for Voiced signal, Voiced signal arranges a maximum voiced sound frequency component, be used for dividing harmonic wave Divide and the main energy area of random element；Recycling least-squares algorithm is estimated to obtain discrete harmonic wave sound channel spectrum parameter magnitudes value And phase value；

1-3) for Unvoiced signal, it is analyzed by the linear prediction analysis method directly utilizing classics, obtains linear pre- Survey coefficient.

Being further characterized by described step 2 of technique scheme) in, channel parameters is converted into linear prediction Parameter, and then the process producing the linear spectral frequency parameter being applicable to voice conversion is as follows:

2-1) range value to discrete harmonic wave sound channel spectrum parameter is asked for square, and be construed as discrete power spectrum Sampled value；

2-2) according to power spectral density function and the one-to-one relationship of auto-correlation function, obtain about linear predictor coefficient Top's Ritz matrix equation, obtain linear predictor coefficient by solving the equation；

2-3) linear predictor coefficient is converted to linear spectral coefficient of frequency.

Being further characterized by described step 3) of technique scheme carries out unsupervised training to deep layer confidence network Mode be divided into following two:

3-1) any two-tier network is formed restricted Boltzmann machine, it is trained, so with contrast divergent method After all of Boltzmann machine is combined into stack, constitute a complete deep layer confidence network, the weight in this network Coefficient sets constitutes network parameter standby preferable space；

3-2) splice positive and negative for two deep layer feedforward networks, constitute the combinational network of adaptive coding/decoding device structure, simultaneously The linear spectral coefficient of frequency of voice signal is placed in input and output, under regularization stochastic gradient descent criterion, study Network architecture parameters.

Being further characterized by described step 4) of technique scheme, the criterion carrying out aliging is: for two not Isometric characteristic parameter sequence, utilizes dynamic time warping algorithm to be mapped to another one by nonlinear for the time shaft of one of which Time shaft on, thus realize matching relationship one to one；During the alignment of existing parameter sets, pass through iteration optimization One default cumulative distortion function, and restricted searching area, final acquisition time match function.

Being further characterized by described step 5) of technique scheme, carries out increment to deep layer forward prediction network The process of formula supervised training is as follows:

The superiors of deep layer confidence network 5-1) trained in step 3) increase by a layer network output layer, and this layer has The soft output characteristics of Finite Amplitude, thus constitute deep layer feedforward network；

5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode at Reason, and extract the network intermediate layer parameter new characteristic parameter as symmetric sources voice signal；

5-3) using the new characteristic parameter of symmetric sources voice signal and the linear spectral coefficient of frequency of targeted voice signal as The input of deep layer feedforward network and output, adjust network weight coefficient on the premise of transmission error minimizes rear, complete net The incremental training of network.

Technique scheme to be further characterized by the process of phonetic synthesis in described step 7) as follows:

7-1) range value of the discrete harmonic wave sound channel spectrum parameter of Voiced signal and phase value are used as the width of sinusoidal signal Angle value and phase value, and be overlapped, obtain the Voiced signal of reconstruct；Interpositioning and Phase Compensation is used to make reconstruct Voiced signal in time domain waveform, do not produce distortion；

7-2) by the white noise signal of Unvoiced signal by an all-pole filter, obtain the Unvoiced signal of reconstruct；

7-3) Voiced signal of reconstruct and the Unvoiced signal of reconstruct are overlapped, the voice signal after being changed.

Beneficial effects of the present invention is as follows: a kind of asymmetric voice mapped based on deep neural network feature of the present invention Conversion method, takes full advantage of the common feature of " asymmetric data " and " data volume is deficient " problem, devises a set of comprehensive two The data acquisition of the situation of kind and integration method, utilize deep layer confidence e-learning asymmetric data architectural feature on this basis, Optimize net coefficients structure, and as the initial parameter value of deep layer forward prediction network, and then in a small amount of symmetric data Under the process of incremental learning, reverse conduction optimizes network architecture parameters, it is achieved the mapping of speaker's personal characteristics parameter.

Accompanying drawing explanation

Fig. 1 is the speech conversion system training and conversion stage block diagram that the present invention relates to；

Fig. 2 is for the present invention relates to deep layer confidence network pre-training mode schematic diagram.

Detailed description of the invention

With reference to the accompanying drawings and combine example the present invention is described in further detail.

In order to effectively process " asymmetric data " and " data volume is deficient " problem in actual environment, the present invention designs following number According to obtaining and integrated scheme, in order to subsequent operation: for most application scenario, gather the sound number of target speaker According to general the most passive, therefore gather relatively difficult, can frequently result in data volume deficient；Under comparing, owing to source is said The voice data gatherer process initiative of words people is relatively strong, so collecting relatively easy, data volume is the most sufficient.To this end, On the basis of the speech data of existing source, make source speaker according to the voice of the target speaker collected, again record a small amount of Include the voice data of identical semantic content as with reference to (source speaker records a small amount of voice incrementally).So, source and Although the data of target are the most asymmetrical, but wherein contain a small amount of symmetric data.

Therefore, in conjunction with Fig. 1 and Fig. 2, a kind of asymmetric voice mapped based on deep neural network feature of the present embodiment Conversion method, including training stage and conversion stage, following steps 1～5) it is the training stage, step 6～7) be the conversion stage:

1) on the basis of the voice signal of existing source, the targeted voice signal collection according to collecting has in identical semanteme The source voice signal held, is formed and comprises asymmetric source voice signal, symmetric sources voice signal, the training of targeted voice signal Use voice signal.

Use harmonic wave to add stochastic model training voice signal is decomposed, respectively obtain asymmetric source voice signal Fundamental frequency track, the harmonic wave sound channel of asymmetric source voice signal compose range value and phase value, the symmetric sources voice signal of parameter Fundamental frequency track, the fundamental frequency track of targeted voice signal, symmetric sources voice signal harmonic wave sound channel spectrum parameter width The harmonic wave sound channel of angle value and phase value, targeted voice signal composes range value and the phase value of parameter.

Using harmonic wave to add, primary speech signal decomposes by stochastic model specifically comprises the following steps that

A. voice signal is carried out framing, frame length 20ms, frame section gap 10ms.

B. in every frame, estimate fundamental frequency with correlation method, if this frame is unvoiced frames, then fundamental frequency is set equal to zero.

C. for unvoiced frame (frame that i.e. fundamental frequency is not zero), it is assumed that voice signal s_hN () can be by a series of sine wave It is formed by stacking:

s_{h} (n) = Σ_{l = - L}^{L} C_{l} e^{j ω_{0} n} - - - (1)

In formula, L is sinusoidal wave number, { C_lIt is sinusoidal wave complex magnitude, ω₀For fundamental frequency, n represents the n-th of voice Individual sampling point.Make s_hRepresent s_hN vector that () sampling point in a frame is formed, then (1) formula can be rewritten into:

Wherein N represents the number of samples that a frame voice is total.Above-mentioned { C is may determine that by least-squares algorithm_l, it may be assumed that

ϵ = Σ_{n = - \frac{N}{2}}^{\frac{N}{2}} w^{2} (n) \cdot {(s (n) - s_{h} (n))}^{2} - - - (3)

Wherein s (n) is actual speech signal, and w (n) is window function, typically takes Hamming window, and ε represents error.By window function also It is rewritten into matrix form:

Then optimum x can so obtain:

WBΔ = Ws &DoubleRightArrow; Δ_{opt} = B^{H} W^{H} Ws - - - (5)

In formula, subscript H represents conjugate complex transposition, and S is actual speech signal s (n) sampling point institute group in the range of a frame The vector become.

D. { C has been obtained_l, then harmonic amplitude and phase value are as follows:

AM_l=2|C_l|=2|C_-l|,θ_l=argC_l=-argC_-l (6)

E. for unvoiced frames, raw tone frame signal is analyzed by direct classical Linear prediction analysis method, To corresponding linear predictor coefficient.

Due to it is believed that the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and targeted voice signal takes From single Gaussian Profile, therefore can be according to the fundamental frequency track of symmetric sources voice signal and the fundamental frequency rail of targeted voice signal Mark sets up Gauss model and the Gauss model of target voice fundamental frequency of source voice fundamental frequency.

According to above-mentioned Gauss model, the parameter of Gauss model, the i.e. Gauss model of source voice fundamental frequency can be estimated Mean μ^yAnd variances sigma^y, and the mean μ of Gauss model of target voice fundamental frequency^xAnd variances sigma^x。

2) respectively the harmonic wave sound channel of asymmetric source voice signal is composed range value and phase value, the symmetric sources voice letter of parameter Number the harmonic wave sound channel spectrum range value of parameter and phase value, the range value of harmonic wave sound channel spectrum parameter of targeted voice signal and phase place Value carries out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency being applicable to voice conversion Rate parameter.

The reason of step 2 is, owing to original harmonics plus noise model parameter dimension is higher, is not easy to subsequent calculations, because of This must carry out dimensionality reduction to it.Owing to pitch contour is one-dimensional parameter, therefore, the main object of dimensionality reduction is sound channel amplitude spectrum parameter And phase parameter.Meanwhile, the target of dimensionality reduction is that channel parameters is converted into the linear forecasting parameter of classics, and then generation is applicable to The linear spectral frequency parameter of speech conversion system.Its solution procedure is as follows:

Ask for L discrete range value AM the most respectively_lSquare, and be construed as discrete power spectrum sampled value PW (ω_l), ω_lRepresent the frequency values of (l times) on fundamental frequency integral multiple.

B. according to Pascal's law, auto-correlation function and power spectral density function are a pair Fourier transforms pair, i.e.Therefore the preliminary valuation to linear forecasting parameter coefficient can be obtained by solving following formula:

Wherein a₁,a₂..., a_pIt is the coefficient of p rank linear prediction filter A (z), R₀～R_pIt is respectively p before auto-correlation function Value on individual integer discrete point.

C. the all-pole modeling that p rank linear forecasting parameter coefficient represents is converted into time domain impulse response function h^*[n]:

h^{*} [n] = \frac{1}{L} Re {\underset{l}{Σ} \frac{1}{A (e^{j ω_{l}})} e^{j ω_{l} n}} - - - (8)

Wherein

A (e^{j ω_{l}}) = A {(z)}_{| z = e^{j ω_{l}}} = 1 + a_{1} z^{- 1} + z_{2} z^{- 2} + \cdot \cdot \cdot + a_{p} z^{- p} .

Permissible Prove, h^*The autocorrelation sequence R obtained with estimation^*Meet:

Σ_{i = 0}^{p} a_{i} R^{*} (n - i) = h^{*} [- n] - - - (9)

In the case of meeting plate storehouse-vegetarian field distance minimization, there is the R of real R and estimation^*Relation as follows:

Σ_{i = 0}^{p} a_{i} R^{*} (n - i) = Σ_{i = 0}^{p} a_{i} R (n - i) - - - (10)

The most then (19) formula is replaced (20) formula, and revaluation (17) formula, has:

E. by plate storehouse-vegetarian field criteria evaluation error, if error is more than the threshold value set, then step c～e are repeated.Otherwise, Then stop iteration.

The linear forecasting parameter coefficient obtained passes through simultaneous solution following two equation, is converted into linear spectral frequency parameter:

P(z)=A(z)+z^-(p+1)A(z^-1)

Q(z)=A(z)-z^-(p+1)A(z^-1) (12)

3) utilize step 2) in the linear spectral frequency parameter of asymmetric source voice signal that obtains deep layer confidence network is entered Row unsupervised training, obtains the deep layer confidence network trained.

Above-mentioned steps is " pre-training "." pre-training " process should be divided into two kinds of forms.The first (as shown in Figure 2 a): right In a complete deep layer confidence network, according to order from bottom to up, Internet adjacent for any two is constituted restricted Boltzmann machine (being wherein positioned at the input layer that is referred to as of lower section, referred to as hidden layer above, is undirected company between the two Connect relation), under the driving of input initial data, with the structural parameters between contrast divergent method learning network.It addition, respectively group is limited Boltzmann machine between data transmission, meet following condition be positioned at lower section Boltzmann machine hidden layer output, Input as the input layer of Boltzmann machine above.Progressive alternate in the manner described above, until designed network Till structural parameters whole " pre-training " complete.The second (as shown in Figure 2 b): splice positive and negative for two deep layer feedforward networks, structure Become the combinational network of adaptive coding/decoding device structure, the linear spectral frequency parameter of voice signal is concurrently placed at this network simultaneously Input and output, under regularization stochastic gradient descent criterion, " pre-training " network architecture parameters.

4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal that obtains Align with the linear spectral frequency parameter of targeted voice signal.

So-called " alignment " refers to: the linear spectral frequency of the source and target of correspondence is had in the distortion criterion set Minimum distortion distance.The purpose of do so is: the characteristic sequence of source and target people is associated in the aspect of parameter, it is simple to Subsequent statistical model learns mapping principle therein.

The criterion carrying out aliging is: for the characteristic parameter sequence of two Length discrepancy, utilizes dynamic time warping algorithm to incite somebody to action On the nonlinear time shaft being mapped to another one of the time shaft of one of which, thus realize matching relationship one to one；? During the alignment of existing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restricted searching area, Obtain time match function eventually.

Dynamic time warping algorithm steps is briefly outlined below:

Pronunciation for same statement, it is assumed that the acoustics personal characteristics argument sequence of source speaker isAnd the characteristic parameter sequence of target speaker isAnd N_x ≠N_y.Set the characteristic parameter sequence of source speaker as reference template, then dynamic time warping algorithm seeks to hunting time rule Integral functionMake the time shaft n of target signature sequence_yNon-linearly be mapped to source characteristic parameter sequence time Countershaft n_x, so that total cumulative distortion amount is minimum, it is mathematically represented as:

WhereinRepresent n-th_yThe target speaker characteristic parameter of frame andFrame source speaker Certain measure distance between characteristic parameter.Dynamic time warping regular during, warping functionIt is intended to Meet following constraints, have boundary condition and the condition of continuity to be respectively as follows:

Dynamic time warping is a kind of optimization algorithm, and it turns to determining of multiple single phase a multistage decision process Plan process, is namely converted into the multiple subproblems made a policy one by one, in order to simplifies and calculates.The process one of dynamic time warping As be to proceed by from the last stage, namely it is a vice versa, and its recursive process can be expressed as:

D(n_y+1,n_x)=d(n_y+1,n_x)+min[D(n_y,n_x)g(n_y,n_x),D(n_y,n_x-1),D(n_y,n_x-2)] (16)

Wherein,g(n_y,n_x) it is for n_y,n_xValue meet the time The constraints of warping function.

5) the linear spectral frequency parameter of the symmetric sources voice signal after alignment and the linear spectral frequency of targeted voice signal are utilized Rate parameter carries out increment type supervised training to deep layer forward prediction network, obtains the deep layer forward prediction network trained.

The a small amount of symmetric data of above-mentioned utilization carries out the process of incremental training to deep layer feedforward network, and its content includes following Three aspects: one, the superiors at the deep layer confidence network trained increase by a layer network output layer, and this layer has amplitude limit Soft output characteristics, thus constitute deep layer feedforward network；Two, using the linear spectral frequency parameter in source as the group with encoding and decoding structure Close input and the output of network, on the basis of " pre-training ", extract network intermediate layer (as shown in Figure 2 b) output data, as New characteristic parameter is treated.This new characteristic parameter remains the higher order statistical theory of original linear spectral frequency parameter, therefore There is more preferable discrimination；Three, using the new characteristic parameter in symmetrical source and target linear spectral frequency coefficient as deep layer forward direction The input of network and output parameter, rear on the premise of transmission error minimizes, supervised ground adjusts network weight coefficient, complete Become the incremental training of network.

6) use harmonic wave to add stochastic model source voice signal to be converted is decomposed, obtain source voice letter to be converted Number fundamental frequency track, the harmonic wave sound channel spectrum range value of parameter of source voice signal to be converted and phase value.Concrete technology Details is identical with the way of step 1).

Range value and phase value that the harmonic wave sound channel of source voice signal to be converted is composed parameter carry out dimension-reduction treatment, by sound Road parameter is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter being applicable to voice conversion.Concrete ins and outs With step 2) way identical.

Then the linear spectral frequency of the source voice signal of the deep layer confidence network handles conversion trained in step 3) is utilized Rate parameter carries out Feature Mapping, obtains the new characteristic parameter of source voice signal to be converted, finally will train in step 5) The deep layer forward prediction network become regards general Functional Mapping function as, the new characteristic parameter to source voice signal to be converted Carry out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed.Specifically, source voice that will be to be converted The linear spectral frequency parameter of signal is placed in input and the output of the combinational network with codec structure, and extracts centre Layer parameter, as new characteristic parameter, and is used for the characteristic parameter of map source by the deep layer feedforward network trained, will The new characteristic parameter of source voice signal to be converted is as input, it is provided that be predicted to this model, finally defeated at network Go out the linear spectral frequency parameter of the voice signal after end is changed.

Utilize Gauss model and the Gaussian mode of target voice fundamental frequency of source voice fundamental frequency obtained by step 1) Type, carries out Gauss conversion to the fundamental frequency track of source voice signal to be converted, the fundamental tone of the voice signal after being changed Frequency locus.Fundamental frequency transfer function is:

\log f_{0}^{'} = μ^{y} + \frac{σ^{y}}{σ^{x}} (\log f_{0} - μ^{x}) - - - (17)

Wherein f '₀It is the fundamental frequency after conversion, 2 π f₀=ω₀。

7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then and The fundamental frequency track of the voice signal after conversion carries out phonetic synthesis, the voice signal after being changed together.Detailed step As follows:

A. the AM that will obtain_l,f₀,θ_lVoice by the definition synthesis kth frame of sinusoidal model, it may be assumed that

s^{(k)} (n) = Σ_{l = 1}^{L^{(k)}} {AM}_{l}^{(k)} \cos (2 πl f_{0}^{(k)} n + θ_{l}^{(k)}) - - - (18)

B. in order to reduce the error produced when interframe replaces, splicing adding method is used to synthesize whole voice, i.e. for arbitrarily Two adjacent frames, have:

s (kN + m) = (\frac{N - m}{N}) \cdot s^{(k)} (m) + (\frac{m}{N}) \cdot s^{(k + 1)} (m - N), 0 \leq m \leq N - - - (19)

The number of samples that wherein N comprises in representing a frame voice.Interpositioning and Phase Compensation is used to make reconstruct Voiced signal does not produce distortion in time domain waveform.

C. for unvoiced frames, by white noise signal, by an all-pole filter, (filter coefficient is to walk the training stage Rapid 1) step e analyzes the linear predictor coefficient obtained), available approximate reconstruction signal.

D. Voiced signal and the Unvoiced signal of reconstruct are added, i.e. can obtain synthesizing voice.

A kind of asymmetrical voice conversion method mapped based on deep neural network feature of the present invention, can be used for secrecy logical Letter carries out the camouflage that voice is personalized, such as, by Voice Conversion Techniques, changes speaker's voice by a kind of rule determined Some parameter, then carry out inverse transformation at receiving terminal, synthesize original voice, if in transmitting procedure, be listened, then listen To be the sound of another one speaker, reach the function of speaker's camouflage；Apply in multimedia recreation, such as, at electricity During shadow is dubbed, when especially dubbing with another language, often voice-over actor is not performer, usually make to dub with The personal characteristics of former performer differs greatly, and dubbed effect is undesirable, if but will dub and carry out sound conversion again, it is allowed to again to have There is the personal characteristics of performer, then dubbed effect will be the most；For speech-enhancement system, particularly with vocal cords etc. There is pathology or damage in vocal organs, the quality of its speech is also badly damaged, and the other side is difficult to understand, and has severely impacted normal Communication with exchange, if the speech being so badly damaged can be converted into the sound clearly can understood, the most convenient The normal life of this kind of patient.

Although the present invention is open as above with preferred embodiment, but embodiment is not for limiting the present invention's.Not Depart from the spirit and scope of the present invention, any equivalence change done or retouching, also belong to the protection domain of the present invention.Cause The content that this protection scope of the present invention should be defined with claims hereof is as standard.

Claims

1. the asymmetrical voice conversion method mapped based on deep neural network feature, it is characterised in that comprise the steps:

1) on the basis of the voice signal of existing source, the targeted voice signal collection according to collecting has identical semantic content Source voice signal, forms the training term comprising asymmetric source voice signal, symmetric sources voice signal, targeted voice signal Tone signal；

Use harmonic wave to add stochastic model training voice signal is decomposed, respectively obtain the fundamental tone of asymmetric source voice signal Frequency locus, the harmonic wave sound channel of asymmetric source voice signal compose range value and phase value, the base of symmetric sources voice signal of parameter Voice frequency track, the fundamental frequency track of targeted voice signal, the harmonic wave sound channel of symmetric sources voice signal compose the range value of parameter With range value and the phase value that the harmonic wave sound channel of phase value, targeted voice signal composes parameter；

Fundamental frequency track according to symmetric sources voice signal and the fundamental frequency track of targeted voice signal, set up source voice base The Gauss model of voice frequency and the Gauss model of target voice fundamental frequency；

2) respectively the harmonic wave sound channel of asymmetric source voice signal is composed the range value of parameter and phase value, symmetric sources voice signal The harmonic wave sound channel spectrum range value of parameter and phase value, the range value of harmonic wave sound channel spectrum parameter of targeted voice signal and phase value enter Row dimension-reduction treatment, is converted into channel parameters linear forecasting parameter, and then produces the linear spectral frequency ginseng being applicable to voice conversion Number；

3) utilize step 2) in the linear spectral frequency parameter of asymmetric source voice signal that obtains deep layer confidence network is carried out non- Supervised training, obtains the deep layer confidence network trained；

The described mode that deep layer confidence network carries out unsupervised training is divided into following two:

3-1) any two-tier network is formed restricted Boltzmann machine, with contrast divergent method, it is trained, then will All of Boltzmann machine is combined into stack, constitutes a complete deep layer confidence network, the weight coefficient in this network Set constitutes network parameter standby preferable space；

3-2) splice positive and negative for two deep layer feedforward networks, constitute the combinational network of adaptive coding/decoding device structure, simultaneously by language The linear spectral coefficient of frequency of tone signal is placed in input and output, under regularization stochastic gradient descent criterion, and learning network Structural parameters；

4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal that obtains and mesh The linear spectral frequency parameter of poster tone signal aligns；

5) the linear spectral frequency parameter of the symmetric sources voice signal after alignment and the linear spectral frequency ginseng of targeted voice signal are utilized Several deep layer forward prediction network is carried out increment type supervised training, obtain the deep layer forward prediction network trained；

The described process that deep layer forward prediction network carries out increment type supervised training is as follows:

5-1) in step 3) in the superiors of deep layer confidence network that trained increase by a layer network output layer, this layer has limit The soft output characteristics of width, thus constitute deep layer feedforward network；

5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode process, and Extract the network intermediate layer parameter new characteristic parameter as symmetric sources voice signal；

5-3) using the new characteristic parameter of symmetric sources voice signal and the linear spectral coefficient of frequency of targeted voice signal as deep layer The input of feedforward network and output, adjust network weight coefficient on the premise of transmission error minimizes rear, complete network Incremental training；

6) use harmonic wave to add stochastic model source voice signal to be converted is decomposed, obtain source voice signal to be converted Fundamental frequency track, the harmonic wave sound channel of source voice signal to be converted compose range value and the phase value of parameter；

Range value and phase value that the harmonic wave sound channel of source voice signal to be converted is composed parameter carry out dimension-reduction treatment, sound channel are joined Number is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter being applicable to voice conversion, then utilizes step 3) in The linear spectral frequency parameter of the source voice signal of the deep layer confidence network handles conversion trained carries out Feature Mapping, is treated The new characteristic parameter of source voice signal of conversion, finally by step 5) in the deep layer forward prediction network trained regard as logical Functional Mapping function, the new characteristic parameter of source voice signal to be converted is carried out Mapping and Converting, after being changed The linear spectral frequency parameter of voice signal；

Utilize step 1) obtained by the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency, right The fundamental frequency track of source voice signal to be converted carries out Gauss conversion, the fundamental frequency rail of the voice signal after being changed Mark；

Then and conversion 7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, After the fundamental frequency track of voice signal carry out phonetic synthesis, the voice signal after being changed together.

The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature It is, described step 1) in, use harmonic wave to add the process that primary speech signal decomposes by stochastic model as follows:

1-2) for Voiced signal, Voiced signal arranges a maximum voiced sound frequency component, be used for divide harmonic components and The main energy area of random element；Recycling least-squares algorithm is estimated to obtain discrete harmonic wave sound channel spectrum parameter magnitudes value and phase Place value；

1-3) for Unvoiced signal, it is analyzed by the linear prediction analysis method directly utilizing classics, obtains linear prediction system Number.

The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature It is, in described step 2) in, channel parameters is converted into linear forecasting parameter, and then generation is applicable to the linear of voice conversion The process of spectral frequency parameter is as follows:

2-1) range value to discrete harmonic wave sound channel spectrum parameter is asked for square, and is construed as the sampling of discrete power spectrum Value；

2-2) according to power spectral density function and the one-to-one relationship of auto-correlation function, obtain the torr about linear predictor coefficient General Ritz matrix equation, obtains linear predictor coefficient by solving the equation；

The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature It is, described step 4) in, the criterion carrying out aliging is: for the characteristic parameter sequence of two Length discrepancy, utilize dynamic time Regular algorithm is by nonlinear for the time shaft of the one of which time shaft being mapped to another one, thus realizes one to one Join relation；During the alignment of existing parameter sets, by the cumulative distortion function that iteration optimization one is default, and restriction is searched Rope region, final acquisition time match function.

The asymmetrical voice conversion method mapped based on deep neural network feature the most according to claim 1, its feature Be, described step 7) in the process of phonetic synthesis as follows:

7-1) range value of the discrete harmonic wave sound channel spectrum parameter of Voiced signal and phase value are used as the range value of sinusoidal signal And phase value, and be overlapped, obtain the Voiced signal of reconstruct；Interpositioning and Phase Compensation is used to make the turbid of reconstruct Tone signal does not produce distortion in time domain waveform；