CN103531205A

CN103531205A - Asymmetrical voice conversion method based on deep neural network feature mapping

Info

Publication number: CN103531205A
Application number: CN201310468769.1A
Authority: CN
Inventors: 鲍静益; 徐宁
Original assignee: Changzhou Institute of Technology
Current assignee: BYZORO NETWORK LTD.
Priority date: 2013-10-09
Filing date: 2013-10-09
Publication date: 2014-01-22
Anticipated expiration: 2033-10-09
Also published as: CN103531205B

Abstract

The invention discloses an asymmetrical voice conversion method based on deep neural network feature mapping, and belongs to the technical field of voice conversion. The asymmetrical voice conversion method based on deep neural network feature mapping disclosed by the invention is specific to asymmetrical data of source voice and target voice. The method comprises the following steps: firstly, performing probability modeling by using the pre-training function of a deep network, and extracting high-order statistics features in a voice signal to provide a standby preferred space of network coefficients; secondly, performing incremental learning by using a small quantity of asymmetrical data, and correcting network weight coefficients by using an optimized transmission error to realize mapping of feature parameters. According to the asymmetrical voice conversion method, a network coefficient structure is optimized and is taken as a parameter initial value of a deep forward prediction network, network structure parameters are further transmitted reversely and optimized in the incremental learning process of a small quantity of asymmetrical data, so that mapping of the personal feature parameters of a speaker is realized.

Description

Asymmetric phonetics transfer method based on deep layer neural network Feature Mapping

Technical field

The invention belongs to Voice Conversion Techniques field, be specifically related to a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping.

Background technology

Voice Conversion Techniques, is exactly briefly by a speaker's (being referred to as source) sound, by certain means, converts, and it is sounded is another speaker (being referred to as target) word seemingly.Speech conversion belongs to the subject branch of intercrossing, its content had both related to the knowledge in the fields such as phonetics, semantics and psychologic acoustics, the various aspects that contain again field of voice signal, as the analysis of voice and synthetic, Speaker Identification, voice coding and enhancing etc.

That the final goal of speech conversion is to provide is instant, can automatically adapt to fast any speaker's voice service, and this system does not need or seldom need user just to train can, for all users and various condition, bring into play function well.Yet the Voice Conversion Techniques of present stage is not also accomplished this point.The mode (needing symmetric data to train) of the on the one hand strict limited subscriber word sentence-making of current system, also the larger data volume of demand is carried out training system on the other hand.

For the problems referred to above, some counte-rplan have been there are at present.For example, for " asymmetric data " problem, there is scholar to propose first by Vector Quantization algorithm, source and target speaker's feature space to be divided, then compare the template distance after sound channel length normalization, therefrom select code word corresponding to source and speaker, finally in same code word space, with nearest neighbor algorithm, look for the most close coupling speech frame.And for example the people such as Salor proposes to utilize dynamic programming algorithm to solve this class problem.The core concept of this algorithm is: build cost function, make the error of source and target and target former frame and present frame and reach minimum simultaneously.For " minimizing data volume " problem, the people such as Helander propose in the process of modeling, to consider the coupled relation between characteristic parameter, and utilize this relation to improve the robustness of system in the rare situation of data volume.In addition, somebody proposes to utilize the gauss hybrid models based on the Bayesian analytical approach research tradition of variation, strengthens this model modeling ability when Sparse.

Through retrieval, Chinese Patent Application No. ZL201210229540.8, Shen Qing Publication day is on October 17th, 2012, invention and created name is: a kind of method of the sound conversion based on LPC and RBF neural network, this application case relates to a kind of method of sound based on LPC and RBF neural network conversion, comprises the following steps: A, voice are carried out to pre-service; B, unvoiced frame is carried out to fundamental detection; C, the unvoiced frame after fundamental detection is changed; D, to conversion after fundamental frequency carry out the extraction of unvoiced frame parameter; E, the unvoiced frame parameter extracting is calculated, try to achieve a frame unvoiced frame, then this frame unvoiced frame is synthesized the unvoiced frame after being changed.This application case has proposed a kind of high-quality, Voice Conversion Techniques scheme that calculated amount is moderate, but its weak point is: the method for a kind of sound conversion based on LPC and RBF neural network of this application case, speech decomposition to be converted is become to voiceless sound and voiced sound, again voiced sound is divided into fundamental frequency, energy, LPC and LSF coefficient and carries out speech conversion, increased the measurement of energy, increase measurement difficulty and error, easily caused the undesirable problem of voice quality after conversion.

Summary of the invention

The object of the invention is: overcome the speech conversion system mode that not only strict limited subscriber word is made sentences in prior art, but also need larger data volume to train, the unsatisfactory deficiency of voice quality after simultaneously changing, a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping is provided, adopt technical scheme provided by the invention, in actual environment, the problem that system performance sharply worsens under asymmetric data and the deficient condition of data volume that speech conversion system faces, the link that above-mentioned two aspects are relatively independent is comprehensively studied under unified theoretical frame, utilize deep layer neural network to carry out non-supervisory formula to raw data trains simultaneously, refine the higher order statistical characteristic information wherein comprising, by the forward prediction of supervision formula, train on this basis, the final Generalization Capability of speech conversion system under actual environment that improve.

Ultimate principle of the present invention is: a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present invention, asymmetric data for source voice and target voice, first utilize the pre-training function of deep layer neural network to carry out probabilistic Modeling to it, by refining the higher order statistical characteristic containing in voice signal, provide the standby preferable space of network coefficient; Secondly, utilize a small amount of symmetric data to carry out incremental learning, by the transmission error after optimizing, carry out roll-off network weight coefficient, thus the mapping of realization character parameter.

Specifically, the present invention adopts following technical scheme to realize, and comprises the following steps:

1) on the basis of existing source voice signal, the source voice signal according to the target voice signals collecting collecting with identical semantic content, forms and comprises asymmetric source voice signal, symmetric sources voice signal, target voice signal at interior training voice signal;

Adopt harmonic wave to add probabilistic model training is decomposed with voice signal, obtain respectively the range value of harmonic wave sound channel spectrum parameter of the fundamental frequency track of asymmetric source voice signal, asymmetric source voice signal and phase value, the fundamental frequency track of symmetric sources voice signal, range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the fundamental frequency track of target voice signal, symmetric sources voice signal and phase value, target voice signal;

According to the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and target voice signal, set up the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency;

2) respectively range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of asymmetric source voice signal and phase value, symmetric sources voice signal and phase value, target voice signal are carried out to dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion;

3) utilize step 2) in the linear spectral frequency parameter of the asymmetric source voice signal that obtains deep layer put to communication network carry out unsupervised training, the deep layer that obtains having trained is put communication network;

4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal that obtain align;

5) utilize the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal after alignment deep layer forward prediction network to be carried out to increment type supervised training, the deep layer forward prediction network that obtains having trained;

6) adopt harmonic wave to add probabilistic model source voice signal to be converted is decomposed, obtain the fundamental frequency track of source voice signal to be converted, range value and the phase value of the harmonic wave sound channel spectrum parameter of source voice signal to be converted;

Range value and phase value to the harmonic wave sound channel spectrum parameter of source voice signal to be converted carry out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then generation is applicable to the linear spectral frequency parameter of speech conversion, then utilize the deep layer of having trained in step 3) to put communication network the linear spectral frequency parameter of source voice signal to be converted is carried out to Feature Mapping, obtain the new characteristic parameter of source voice signal to be converted, finally regard the deep layer forward prediction network of having trained in step 5) as general Functional Mapping function, new characteristic parameter to source voice signal to be converted carries out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed,

Utilize the Gauss model of the resulting source of step 1) voice fundamental frequency and the Gauss model of target voice fundamental frequency, the fundamental frequency track of source voice signal to be converted is carried out to Gauss's conversion, the fundamental frequency track of the voice signal after being changed;

7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then carries out phonetic synthesis, the voice signal after being changed with together with the fundamental frequency track of voice signal after conversion.

Technique scheme is further characterized in that: in described step 1), adopt harmonic wave to add the process that probabilistic model decomposes primary speech signal as follows:

1-1) primary speech signal is fixed to minute frame of duration, with correlation method, fundamental frequency is estimated;

1-2) for voiced sound signal, a maximum voiced sound frequency component is set in voiced sound signal, be used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm estimates to obtain discrete harmonic wave sound channel spectrum parameter range value and phase value;

1-3) for voiceless sound signal, directly utilize classical linear prediction analysis method to analyze it, obtain linear predictor coefficient.

Technique scheme is further characterized in that: in described step 2) in, channel parameters is converted into linear forecasting parameter, and then it is as follows to produce the process of the linear spectral frequency parameter be applicable to speech conversion:

2-1) range value of discrete harmonic wave sound channel spectrum parameter is asked for square, and thought the sampled value of discrete power spectrum;

2-2) according to the one-to-one relationship of power spectral density function and autocorrelation function, obtain the Top's Ritz matrix equation about linear predictor coefficient, by solving this equation, obtain linear predictor coefficient;

2-3) linear predictor coefficient is converted to linear spectral coefficient of frequency.

Technique scheme is further characterized in that: in described step 3), deep layer is put to the mode that communication network carries out unsupervised training and be divided into following two kinds:

3-1) any two-tier network is formed to restricted Boltzmann machine, by the contrast method of dispersing, it is trained, then all Boltzmann machines are combined into storehouse form, form a complete deep layer and put communication network, the weight coefficient set in this network forms network parameter standby preferable space;

3-2) by two positive and negative splicing of deep layer feedforward network, form the combinational network of adaptive coding/decoding device structure, the linear spectral coefficient of frequency of voice signal is placed in to input end and output terminal simultaneously, under the random Gradient Descent criterion of regularization, learning network structural parameters.

Technique scheme is further characterized in that: in described step 4), the criterion of aliging is: for two not isometric characteristic parameter sequences, utilize dynamic time warping algorithm by the nonlinear time shaft that is mapped to another one of the time shaft of one wherein, thereby realize matching relationship one to one; In the process of the alignment of existing parameter sets, by default cumulative distortion function of iteration optimization, and restricted searching area, finally obtain time match function.

Technique scheme is further characterized in that: in described step 5), the process of deep layer forward prediction network being carried out to increment type supervised training is as follows:

The superiors that the deep layer of 5-1) having trained in step 3) is put communication network increase by a layer network output layer, and this layer has the soft output characteristics of amplitude limit, thereby form deep layer feedforward network;

5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode process, and extract network middle layer parameter as the new characteristic parameter of symmetric sources voice signal;

5-3) the input and output using the linear spectral coefficient of frequency of the new characteristic parameter of symmetric sources voice signal and target voice signal as deep layer feedforward network, at the rear network weight coefficient of adjusting under minimized prerequisite to transmission error, complete the incremental training of network.

Technique scheme is further characterized in that: in described step 7), the process of phonetic synthesis is as follows:

7-1) range value and the phase value of the discrete harmonic wave sound channel spectrum parameter of voiced sound signal are used as to range value and the phase value of sinusoidal signal, and superpose, obtain the voiced sound signal of reconstruct; Use interpositioning and Phase Compensation to make the voiced sound signal of reconstruct in time domain waveform, not produce distortion;

7-2) white noise signal of voiceless sound signal is passed through to an all-pole filter, obtain the voiceless sound signal of reconstruct;

7-3) the voiceless sound signal of the voiced sound signal of reconstruct and reconstruct is superposeed, the voice signal after being changed.

Beneficial effect of the present invention is as follows: a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present invention, take full advantage of the common feature of " asymmetric data " and " data volume is deficient " problem, data acquisition and the integration method of a set of comprehensive two kinds of situations have been designed, utilize on this basis deep layer to put communication network study asymmetric data architectural feature, optimized network coefficient structure, and the initial parameter value using it as deep layer forward prediction network, and then under the process of the incremental learning of a small amount of symmetric data, reverse conduction optimized network structural parameters, realize the mapping of speaker's personal characteristics parameter.

Accompanying drawing explanation

Fig. 1 is speech conversion system training and the translate phase block diagram the present invention relates to;

Fig. 2 puts the pre-training patterns schematic diagram of communication network for the present invention relates to deep layer.

Embodiment

With reference to the accompanying drawings and in conjunction with example the present invention is described in further detail.

In order effectively to process " asymmetric data " and " data volume is deficient " problem in actual environment, the present invention designs following data acquisition and integrated scheme, so that subsequent operation: for most application scenario, the voice data that gathers target speaker is generally more passive, therefore gather more difficult, usually can cause data volume deficient; Under comparing, because source speaker's voice data gatherer process initiative is stronger, so collect relatively easily, data volume is also comparatively sufficient.For this reason, on the basis of existing source speech data, make source speaker according to the target speaker's who collects voice, again record the voice data (source speaker records a small amount of voice incrementally) as a reference that includes on a small quantity identical semantic content.Like this, although the data of source and target are asymmetrical generally, wherein comprised a small amount of symmetric data.

Therefore, in conjunction with Fig. 1 and Fig. 2, a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present embodiment, comprises training stage and translate phase, following steps 1～5) be the training stage, step 6～7) be translate phase:

1) on the basis of existing source voice signal, the source voice signal according to the target voice signals collecting collecting with identical semantic content, forms and comprises asymmetric source voice signal, symmetric sources voice signal, target voice signal at interior training voice signal.

Adopt harmonic wave to add probabilistic model training is decomposed with voice signal, obtain respectively the range value of harmonic wave sound channel spectrum parameter of the fundamental frequency track of asymmetric source voice signal, asymmetric source voice signal and phase value, the fundamental frequency track of symmetric sources voice signal, range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the fundamental frequency track of target voice signal, symmetric sources voice signal and phase value, target voice signal.

Adopt harmonic wave to add the concrete steps that probabilistic model decomposes primary speech signal as follows:

A. to voice signal, divide frame, frame length 20ms, frame section gap 10ms.

B. in every frame, with correlation method, estimate fundamental frequency, if this frame is unvoiced frames, fundamental frequency is set and equals zero.

C. for unvoiced frame (being the non-vanishing frame of fundamental frequency), suppose voice signal s _h(n) can be formed by a series of sine-wave superimposed:

s_{h} (n) = Σ_{l = - L}^{L} C_{l} e^{j ω_{0} n} - - - (1)

In formula, L is sinusoidal wave number, { C _lbe sinusoidal wave complex magnitude, ω ₀for fundamental frequency, n represents n sampling point of voice.Make s _hrepresent s _h(n) vector that the sampling point in a frame forms, (1) formula can be rewritten into:

Wherein N represents the total number of samples of frame voice.By least-squares algorithm, can determine above-mentioned { C _l, that is:

ϵ = Σ_{n = - \frac{N}{2}}^{\frac{N}{2}} w^{2} (n) \cdot {(s (n) - s_{h} (n))}^{2} - - - (3)

Wherein s (n) is real speech signal, and w (n) is window function, generally gets Hamming window, and ε represents error.Window function is also rewritten into matrix form:

Optimum x can obtain like this:

WBΔ = Ws &DoubleRightArrow; Δ_{opt} = B^{H} W^{H} Ws - - - (5)

In formula, subscript H represents conjugate complex transposition, and S is the vector that the sampling point of real speech signal s (n) in the scope of a frame forms.

D. obtained { C _l, harmonic amplitude and phase value are as follows:

AM _l=2|C _l|=2|C _-l|,θ _l=argC _l=-argC _-l (6)

E. for unvoiced frames, directly with classical Linear prediction analysis method, raw tone frame signal is analyzed, obtained corresponding linear predictor coefficient.

Owing to can thinking that the fundamental frequency track of symmetric sources voice signal and the fundamental frequency track of target voice signal obey single Gaussian distribution, therefore can set up the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency according to the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and target voice signal.

According to above-mentioned Gauss model, can estimate the parameter of Gauss model, i.e. the average μ of the Gauss model of source voice fundamental frequency ^yand variances sigma ^y, and the average μ of the Gauss model of target voice fundamental frequency ^xand variances sigma ^x.

2) respectively range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of asymmetric source voice signal and phase value, symmetric sources voice signal and phase value, target voice signal are carried out to dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion.

The reason of step 2 is, because original harmonic wave plus noise model parameter dimension is higher, is not easy to subsequent calculations, therefore must carry out dimensionality reduction to it.Because pitch contour is one dimension parameter, therefore, the main object of dimensionality reduction is sound channel amplitude spectrum parameter and phase parameter.Meanwhile, the target of dimensionality reduction is that channel parameters is converted into classical linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion system.Its solution procedure is as follows:

A. ask for respectively discrete L range value AM _lsquare, and thought the sampled value PW (ω of discrete power spectrum _l), ω _lbe illustrated in the frequency values of (l doubly) on fundamental frequency integral multiple.

B. according to Pascal law, autocorrelation function and power spectral density function are a pair of Fourier transforms pair,

therefore can obtain the preliminary valuation to linear forecasting parameter coefficient by solving following formula:

A wherein ₁, a ₂..., a _pthe coefficient of p rank linear prediction filter A (z), R ₀～R _pbe respectively the value on front p the integer discrete point of autocorrelation function.

C. convert the all-pole modeling of p rank linear forecasting parameter coefficient representative to time domain impulse response function h ^*[n]:

h^{*} [n] = \frac{1}{L} Re {\underset{l}{Σ} \frac{1}{A (e^{j ω_{l}})} e^{j ω_{l} n}} - - - (8)

Wherein

A (e^{j ω_{l}}) = A {(z)}_{| z = e^{j ω_{l}}} = 1 + a_{1} z^{- 1} + z_{2} z^{- 2} + \cdot \cdot \cdot + a_{p} z^{- p} .

Can prove h ^*with the autocorrelation sequence R that estimates to obtain ^*meet:

Σ_{i = 0}^{p} a_{i} R^{*} (n - i) = h^{*} [- n] - - - (9)

In the situation that meeting plate storehouse-vegetarian field distance minimization, there is the R of real R and estimation ^*relation as follows:

Σ_{i = 0}^{p} a_{i} R^{*} (n - i) = Σ_{i = 0}^{p} a_{i} R (n - i) - - - (10)

So d. (19) formula is replaced to (20) formula, and revaluation (17) formula, have:

E. use plate storehouse-vegetarian field criterion assessment errors, if error is greater than the threshold value of setting, repeating step c～e.Otherwise, stop iteration.

The linear forecasting parameter coefficient obtaining, by two equatioies below simultaneous solution, is converted into linear spectral frequency parameter:

P(z)=A(z)+z ^-(p+1)A(z ^-1)

Q(z)=A(z)-z ^-(p+1)A(z ^-1) (12)

3) utilize step 2) in the linear spectral frequency parameter of the asymmetric source voice signal that obtains deep layer put to communication network carry out unsupervised training, the deep layer that obtains having trained is put communication network.

Above-mentioned steps is " training in advance ".Should " training in advance " process be divided into two kinds of forms.The first (as shown in Figure 2 a): put communication network for a complete deep layer, according to order from bottom to up, any two adjacent network layers are formed to restricted Boltzmann machine and (be wherein positioned at the input layer that is referred to as of below, be positioned at the hidden layer that is referred to as of top, undirected annexation between the two), under the driving of input raw data, with contrast, disperse the internetwork structural parameters of calligraphy learning.In addition, each organizes the data transmission between limited Boltzmann machine, meets following condition---be positioned at the hidden layer output of the Boltzmann machine of below, as the input of input layer that is positioned at the Boltzmann machine of top.Iteration progressively in the manner described above, until designed network architecture parameters all " training in advance " complete.The second (as shown in Figure 2 b): by two positive and negative splicing of deep layer feedforward network, form the combinational network of adaptive coding/decoding device structure, the linear spectral frequency parameter of voice signal is placed in simultaneously to input end and the output terminal of this network simultaneously, under the random Gradient Descent criterion of regularization, " training in advance " network architecture parameters.

4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal that obtain align.

So-called " alignment " refers to: make the linear spectral frequency of corresponding source and target have minimum distortion distance in the distortion criterion of setting.The object of doing is like this: make source and target people's characteristic sequence associated in the aspect of parameter, be convenient to follow-up statistical model study mapping principle wherein.

The criterion of aliging is: for two not isometric characteristic parameter sequences, utilize dynamic time warping algorithm by the nonlinear time shaft that is mapped to another one of the time shaft of one wherein, thereby realize matching relationship one to one; In the process of the alignment of existing parameter sets, by default cumulative distortion function of iteration optimization, and restricted searching area, finally obtain time match function.

Dynamic time warping algorithm steps brief overview is as follows:

For the pronunciation of same statement, suppose that source speaker's acoustics personal characteristics argument sequence is

and target speaker's characteristic parameter sequence is

and N _x≠ N _y.Setting source speaker's characteristic parameter sequence is reference template, and dynamic time warping algorithm is exactly to want hunting time warping function

make the time shaft n of target signature sequence _ynon-linearly be mapped to the time shaft n of source characteristic parameter sequence _xthereby, make total cumulative distortion amount minimum, on mathematics, can be expressed as:

Wherein

represent n _ythe target speaker characteristic parameter of frame and

certain measure distance between the speaker characteristic parameter of frame source.In the regular process of dynamic time warping, warping function

be to meet following constraint condition, have boundary condition and the condition of continuity to be respectively:

Dynamic time warping is a kind of optimization algorithm, and it turns to a multistage decision process decision process of a plurality of single phases, is namely converted into a plurality of subproblems that make a policy one by one, to simplify, calculates.The process of dynamic time warping is generally to start to carry out from the last stage, is also that it is a backward process, and its recursive process can be expressed as:

D(n _y+1,n _x)=d(n _y+1,n _x)+min[D(n _y,n _x)g(n _y,n _x),D(n _y,n _x-1),D(n _y,n _x-2)] (16)

Wherein, g(n _y, n _x) be for n _y, n _xvalue meet the constraint condition of Time alignment function.

5) utilize the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal after alignment deep layer forward prediction network to be carried out to increment type supervised training, the deep layer forward prediction network that obtains having trained.

The a small amount of symmetric data of above-mentioned utilization is carried out the process of incremental training to deep layer feedforward network, its content comprises following three aspects: the superiors of, putting communication network in the deep layer having trained increase by a layer network output layer, this layer has the soft output characteristics of amplitude limit, thereby forms deep layer feedforward network; Two, using the linear spectral frequency parameter in source as the input and output with the combinational network of encoding and decoding structure, on the basis of " training in advance ", extract network middle layer (as shown in Figure 2 b) output data, as new characteristic parameter, treat.This new characteristic parameter has retained the higher order statistical feature of original linear spectral frequency parameter, therefore has better discrimination; Three, the input and output parameter using the linear spectral frequency coefficient of the new characteristic parameter in symmetrical source and target as deep layer feedforward network, rear, to transmission error under minimized prerequisite, network weight coefficient is adjusted on supervision formula ground, completes the incremental training of network.

6) adopt harmonic wave to add probabilistic model source voice signal to be converted is decomposed, obtain the fundamental frequency track of source voice signal to be converted, range value and the phase value of the harmonic wave sound channel spectrum parameter of source voice signal to be converted.Concrete ins and outs are identical with the way of step 1).

Range value and phase value to the harmonic wave sound channel spectrum parameter of source voice signal to be converted carry out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion.Concrete ins and outs and step 2) way identical.

Then utilize the deep layer of having trained in step 3) to put communication network the linear spectral frequency parameter of source voice signal to be converted is carried out to Feature Mapping, obtain the new characteristic parameter of source voice signal to be converted, finally regard the deep layer forward prediction network of having trained in step 5) as general Functional Mapping function, new characteristic parameter to source voice signal to be converted carries out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed.Particularly, the linear spectral frequency parameter that is about to source voice signal to be converted is placed in input end and the output terminal of the combinational network with codec structure, and extract middle layer parameter, using it as new characteristic parameter, and by the deep layer feedforward network training the characteristic parameter for map source, be about to the new characteristic parameter of source voice signal to be converted as input, offer this model and predict, the linear spectral frequency parameter of final voice signal after the output terminal of network is changed.

Utilize the Gauss model of the resulting source of step 1) voice fundamental frequency and the Gauss model of target voice fundamental frequency, the fundamental frequency track of source voice signal to be converted is carried out to Gauss's conversion, the fundamental frequency track of the voice signal after being changed.Fundamental frequency transfer function is:

\log f_{0}^{'} = μ^{y} + \frac{σ^{y}}{σ^{x}} (\log f_{0} - μ^{x}) - - - (17)

F ' wherein ₀the fundamental frequency after conversion, 2 π f ₀=ω ₀.

7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then carries out phonetic synthesis, the voice signal after being changed with together with the fundamental frequency track of voice signal after conversion.Detailed step is as follows:

A. by the AM obtaining _l, f ₀, θ _lthe voice that synthesize k frame with the definition of sinusoidal model, that is:

s^{(k)} (n) = Σ_{l = 1}^{L^{(k)}} {AM}_{l}^{(k)} \cos (2 πl f_{0}^{(k)} n + θ_{l}^{(k)}) - - - (18)

B. the error producing in order to reduce interframe and to replace, adopts the synthetic whole voice of splicing adding method,, for two frames of arbitrary neighborhood, has:

s (kN + m) = (\frac{N - m}{N}) \cdot s^{(k)} (m) + (\frac{m}{N}) \cdot s^{(k + 1)} (m - N), 0 \leq m \leq N - - - (19)

Wherein N represents the number of samples comprising in frame voice.Use interpositioning and Phase Compensation to make the voiced sound signal of reconstruct in time domain waveform, not produce distortion.

C. for unvoiced frames, white noise signal, by an all-pole filter (filter coefficient is to analyze the linear predictor coefficient obtaining in training stage step 1) e step), can be obtained to approximate reconstruction signal.

D. by the voiced sound signal of reconstruct and voiceless sound signal plus, can obtain synthetic speech.

A kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present invention, can be used for carrying out in secret communication the camouflage of voice personalization, for example, by Voice Conversion Techniques, by a kind of rule of determining, change some parameter of speaker's voice, at receiving end, carry out inverse transformation again, synthesize original voice, if in transmitting procedure, be listened, what hear is another one speaker's sound, reaches the function of speaker's camouflage; Be applied in multimedia recreation, for example, in film is dubbed, while especially dubbing with another language, often voice-over actor is not performer, usually makes the personal characteristics of dubbing with former performer differ greatly, dubbed effect is undesirable, if carry out again sound conversion but will dub, make it again to have performer's personal characteristics, what dubbed effect will be desirable so is many; For speech-enhancement system, especially for vocal organs such as vocal cords, there is pathology or damage, the quality of its speech is also badly damaged, the other side is difficult to understand, seriously affected normal Communication, if the voice conversion being badly damaged like this can be become to a clear sound of understanding, be very easy to this class patient's normal life.

Although the present invention with preferred embodiment openly as above, embodiment is not of the present invention for limiting.Without departing from the spirit and scope of the invention, any equivalence of doing changes or retouching, belongs to equally the present invention's protection domain.Therefore should to take the application's the content that claim was defined be standard to protection scope of the present invention.

Claims

1. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping, is characterized in that, comprises the steps:

2. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, is characterized in that, in described step 1), adopts harmonic wave to add the process that probabilistic model decomposes primary speech signal as follows:

3. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, it is characterized in that, in described step 2) in, channel parameters is converted into linear forecasting parameter, and then it is as follows to produce the process of the linear spectral frequency parameter be applicable to speech conversion:

4. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, is characterized in that, in described step 3), deep layer is put to the mode that communication network carries out unsupervised training and is divided into following two kinds:

5. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, it is characterized in that, in described step 4), the criterion of aliging is: for two not isometric characteristic parameter sequences, utilize dynamic time warping algorithm by the nonlinear time shaft that is mapped to another one of the time shaft of one wherein, thereby realize matching relationship one to one; In the process of the alignment of existing parameter sets, by default cumulative distortion function of iteration optimization, and restricted searching area, finally obtain time match function.

6. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 4, is characterized in that, in described step 5), the process of deep layer forward prediction network being carried out to increment type supervised training is as follows:

7. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, is characterized in that, in described step 7), the process of phonetic synthesis is as follows: