CN103531205A - Asymmetrical voice conversion method based on deep neural network feature mapping - Google Patents

Asymmetrical voice conversion method based on deep neural network feature mapping Download PDF

Info

Publication number
CN103531205A
CN103531205A CN201310468769.1A CN201310468769A CN103531205A CN 103531205 A CN103531205 A CN 103531205A CN 201310468769 A CN201310468769 A CN 201310468769A CN 103531205 A CN103531205 A CN 103531205A
Authority
CN
China
Prior art keywords
voice signal
parameter
network
deep layer
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310468769.1A
Other languages
Chinese (zh)
Other versions
CN103531205B (en
Inventor
鲍静益
徐宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BYZORO NETWORK LTD.
Original Assignee
Changzhou Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Institute of Technology filed Critical Changzhou Institute of Technology
Priority to CN201310468769.1A priority Critical patent/CN103531205B/en
Publication of CN103531205A publication Critical patent/CN103531205A/en
Application granted granted Critical
Publication of CN103531205B publication Critical patent/CN103531205B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an asymmetrical voice conversion method based on deep neural network feature mapping, and belongs to the technical field of voice conversion. The asymmetrical voice conversion method based on deep neural network feature mapping disclosed by the invention is specific to asymmetrical data of source voice and target voice. The method comprises the following steps: firstly, performing probability modeling by using the pre-training function of a deep network, and extracting high-order statistics features in a voice signal to provide a standby preferred space of network coefficients; secondly, performing incremental learning by using a small quantity of asymmetrical data, and correcting network weight coefficients by using an optimized transmission error to realize mapping of feature parameters. According to the asymmetrical voice conversion method, a network coefficient structure is optimized and is taken as a parameter initial value of a deep forward prediction network, network structure parameters are further transmitted reversely and optimized in the incremental learning process of a small quantity of asymmetrical data, so that mapping of the personal feature parameters of a speaker is realized.

Description

Asymmetric phonetics transfer method based on deep layer neural network Feature Mapping
Technical field
The invention belongs to Voice Conversion Techniques field, be specifically related to a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping.
Background technology
Voice Conversion Techniques, is exactly briefly by a speaker's (being referred to as source) sound, by certain means, converts, and it is sounded is another speaker (being referred to as target) word seemingly.Speech conversion belongs to the subject branch of intercrossing, its content had both related to the knowledge in the fields such as phonetics, semantics and psychologic acoustics, the various aspects that contain again field of voice signal, as the analysis of voice and synthetic, Speaker Identification, voice coding and enhancing etc.
That the final goal of speech conversion is to provide is instant, can automatically adapt to fast any speaker's voice service, and this system does not need or seldom need user just to train can, for all users and various condition, bring into play function well.Yet the Voice Conversion Techniques of present stage is not also accomplished this point.The mode (needing symmetric data to train) of the on the one hand strict limited subscriber word sentence-making of current system, also the larger data volume of demand is carried out training system on the other hand.
For the problems referred to above, some counte-rplan have been there are at present.For example, for " asymmetric data " problem, there is scholar to propose first by Vector Quantization algorithm, source and target speaker's feature space to be divided, then compare the template distance after sound channel length normalization, therefrom select code word corresponding to source and speaker, finally in same code word space, with nearest neighbor algorithm, look for the most close coupling speech frame.And for example the people such as Salor proposes to utilize dynamic programming algorithm to solve this class problem.The core concept of this algorithm is: build cost function, make the error of source and target and target former frame and present frame and reach minimum simultaneously.For " minimizing data volume " problem, the people such as Helander propose in the process of modeling, to consider the coupled relation between characteristic parameter, and utilize this relation to improve the robustness of system in the rare situation of data volume.In addition, somebody proposes to utilize the gauss hybrid models based on the Bayesian analytical approach research tradition of variation, strengthens this model modeling ability when Sparse.
Through retrieval, Chinese Patent Application No. ZL201210229540.8, Shen Qing Publication day is on October 17th, 2012, invention and created name is: a kind of method of the sound conversion based on LPC and RBF neural network, this application case relates to a kind of method of sound based on LPC and RBF neural network conversion, comprises the following steps: A, voice are carried out to pre-service; B, unvoiced frame is carried out to fundamental detection; C, the unvoiced frame after fundamental detection is changed; D, to conversion after fundamental frequency carry out the extraction of unvoiced frame parameter; E, the unvoiced frame parameter extracting is calculated, try to achieve a frame unvoiced frame, then this frame unvoiced frame is synthesized the unvoiced frame after being changed.This application case has proposed a kind of high-quality, Voice Conversion Techniques scheme that calculated amount is moderate, but its weak point is: the method for a kind of sound conversion based on LPC and RBF neural network of this application case, speech decomposition to be converted is become to voiceless sound and voiced sound, again voiced sound is divided into fundamental frequency, energy, LPC and LSF coefficient and carries out speech conversion, increased the measurement of energy, increase measurement difficulty and error, easily caused the undesirable problem of voice quality after conversion.
Summary of the invention
The object of the invention is: overcome the speech conversion system mode that not only strict limited subscriber word is made sentences in prior art, but also need larger data volume to train, the unsatisfactory deficiency of voice quality after simultaneously changing, a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping is provided, adopt technical scheme provided by the invention, in actual environment, the problem that system performance sharply worsens under asymmetric data and the deficient condition of data volume that speech conversion system faces, the link that above-mentioned two aspects are relatively independent is comprehensively studied under unified theoretical frame, utilize deep layer neural network to carry out non-supervisory formula to raw data trains simultaneously, refine the higher order statistical characteristic information wherein comprising, by the forward prediction of supervision formula, train on this basis, the final Generalization Capability of speech conversion system under actual environment that improve.
Ultimate principle of the present invention is: a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present invention, asymmetric data for source voice and target voice, first utilize the pre-training function of deep layer neural network to carry out probabilistic Modeling to it, by refining the higher order statistical characteristic containing in voice signal, provide the standby preferable space of network coefficient; Secondly, utilize a small amount of symmetric data to carry out incremental learning, by the transmission error after optimizing, carry out roll-off network weight coefficient, thus the mapping of realization character parameter.
Specifically, the present invention adopts following technical scheme to realize, and comprises the following steps:
1) on the basis of existing source voice signal, the source voice signal according to the target voice signals collecting collecting with identical semantic content, forms and comprises asymmetric source voice signal, symmetric sources voice signal, target voice signal at interior training voice signal;
Adopt harmonic wave to add probabilistic model training is decomposed with voice signal, obtain respectively the range value of harmonic wave sound channel spectrum parameter of the fundamental frequency track of asymmetric source voice signal, asymmetric source voice signal and phase value, the fundamental frequency track of symmetric sources voice signal, range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the fundamental frequency track of target voice signal, symmetric sources voice signal and phase value, target voice signal;
According to the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and target voice signal, set up the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency;
2) respectively range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of asymmetric source voice signal and phase value, symmetric sources voice signal and phase value, target voice signal are carried out to dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion;
3) utilize step 2) in the linear spectral frequency parameter of the asymmetric source voice signal that obtains deep layer put to communication network carry out unsupervised training, the deep layer that obtains having trained is put communication network;
4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal that obtain align;
5) utilize the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal after alignment deep layer forward prediction network to be carried out to increment type supervised training, the deep layer forward prediction network that obtains having trained;
6) adopt harmonic wave to add probabilistic model source voice signal to be converted is decomposed, obtain the fundamental frequency track of source voice signal to be converted, range value and the phase value of the harmonic wave sound channel spectrum parameter of source voice signal to be converted;
Range value and phase value to the harmonic wave sound channel spectrum parameter of source voice signal to be converted carry out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then generation is applicable to the linear spectral frequency parameter of speech conversion, then utilize the deep layer of having trained in step 3) to put communication network the linear spectral frequency parameter of source voice signal to be converted is carried out to Feature Mapping, obtain the new characteristic parameter of source voice signal to be converted, finally regard the deep layer forward prediction network of having trained in step 5) as general Functional Mapping function, new characteristic parameter to source voice signal to be converted carries out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed,
Utilize the Gauss model of the resulting source of step 1) voice fundamental frequency and the Gauss model of target voice fundamental frequency, the fundamental frequency track of source voice signal to be converted is carried out to Gauss's conversion, the fundamental frequency track of the voice signal after being changed;
7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then carries out phonetic synthesis, the voice signal after being changed with together with the fundamental frequency track of voice signal after conversion.
Technique scheme is further characterized in that: in described step 1), adopt harmonic wave to add the process that probabilistic model decomposes primary speech signal as follows:
1-1) primary speech signal is fixed to minute frame of duration, with correlation method, fundamental frequency is estimated;
1-2) for voiced sound signal, a maximum voiced sound frequency component is set in voiced sound signal, be used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm estimates to obtain discrete harmonic wave sound channel spectrum parameter range value and phase value;
1-3) for voiceless sound signal, directly utilize classical linear prediction analysis method to analyze it, obtain linear predictor coefficient.
Technique scheme is further characterized in that: in described step 2) in, channel parameters is converted into linear forecasting parameter, and then it is as follows to produce the process of the linear spectral frequency parameter be applicable to speech conversion:
2-1) range value of discrete harmonic wave sound channel spectrum parameter is asked for square, and thought the sampled value of discrete power spectrum;
2-2) according to the one-to-one relationship of power spectral density function and autocorrelation function, obtain the Top's Ritz matrix equation about linear predictor coefficient, by solving this equation, obtain linear predictor coefficient;
2-3) linear predictor coefficient is converted to linear spectral coefficient of frequency.
Technique scheme is further characterized in that: in described step 3), deep layer is put to the mode that communication network carries out unsupervised training and be divided into following two kinds:
3-1) any two-tier network is formed to restricted Boltzmann machine, by the contrast method of dispersing, it is trained, then all Boltzmann machines are combined into storehouse form, form a complete deep layer and put communication network, the weight coefficient set in this network forms network parameter standby preferable space;
3-2) by two positive and negative splicing of deep layer feedforward network, form the combinational network of adaptive coding/decoding device structure, the linear spectral coefficient of frequency of voice signal is placed in to input end and output terminal simultaneously, under the random Gradient Descent criterion of regularization, learning network structural parameters.
Technique scheme is further characterized in that: in described step 4), the criterion of aliging is: for two not isometric characteristic parameter sequences, utilize dynamic time warping algorithm by the nonlinear time shaft that is mapped to another one of the time shaft of one wherein, thereby realize matching relationship one to one; In the process of the alignment of existing parameter sets, by default cumulative distortion function of iteration optimization, and restricted searching area, finally obtain time match function.
Technique scheme is further characterized in that: in described step 5), the process of deep layer forward prediction network being carried out to increment type supervised training is as follows:
The superiors that the deep layer of 5-1) having trained in step 3) is put communication network increase by a layer network output layer, and this layer has the soft output characteristics of amplitude limit, thereby form deep layer feedforward network;
5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode process, and extract network middle layer parameter as the new characteristic parameter of symmetric sources voice signal;
5-3) the input and output using the linear spectral coefficient of frequency of the new characteristic parameter of symmetric sources voice signal and target voice signal as deep layer feedforward network, at the rear network weight coefficient of adjusting under minimized prerequisite to transmission error, complete the incremental training of network.
Technique scheme is further characterized in that: in described step 7), the process of phonetic synthesis is as follows:
7-1) range value and the phase value of the discrete harmonic wave sound channel spectrum parameter of voiced sound signal are used as to range value and the phase value of sinusoidal signal, and superpose, obtain the voiced sound signal of reconstruct; Use interpositioning and Phase Compensation to make the voiced sound signal of reconstruct in time domain waveform, not produce distortion;
7-2) white noise signal of voiceless sound signal is passed through to an all-pole filter, obtain the voiceless sound signal of reconstruct;
7-3) the voiceless sound signal of the voiced sound signal of reconstruct and reconstruct is superposeed, the voice signal after being changed.
Beneficial effect of the present invention is as follows: a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present invention, take full advantage of the common feature of " asymmetric data " and " data volume is deficient " problem, data acquisition and the integration method of a set of comprehensive two kinds of situations have been designed, utilize on this basis deep layer to put communication network study asymmetric data architectural feature, optimized network coefficient structure, and the initial parameter value using it as deep layer forward prediction network, and then under the process of the incremental learning of a small amount of symmetric data, reverse conduction optimized network structural parameters, realize the mapping of speaker's personal characteristics parameter.
Accompanying drawing explanation
Fig. 1 is speech conversion system training and the translate phase block diagram the present invention relates to;
Fig. 2 puts the pre-training patterns schematic diagram of communication network for the present invention relates to deep layer.
Embodiment
With reference to the accompanying drawings and in conjunction with example the present invention is described in further detail.
In order effectively to process " asymmetric data " and " data volume is deficient " problem in actual environment, the present invention designs following data acquisition and integrated scheme, so that subsequent operation: for most application scenario, the voice data that gathers target speaker is generally more passive, therefore gather more difficult, usually can cause data volume deficient; Under comparing, because source speaker's voice data gatherer process initiative is stronger, so collect relatively easily, data volume is also comparatively sufficient.For this reason, on the basis of existing source speech data, make source speaker according to the target speaker's who collects voice, again record the voice data (source speaker records a small amount of voice incrementally) as a reference that includes on a small quantity identical semantic content.Like this, although the data of source and target are asymmetrical generally, wherein comprised a small amount of symmetric data.
Therefore, in conjunction with Fig. 1 and Fig. 2, a kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present embodiment, comprises training stage and translate phase, following steps 1~5) be the training stage, step 6~7) be translate phase:
1) on the basis of existing source voice signal, the source voice signal according to the target voice signals collecting collecting with identical semantic content, forms and comprises asymmetric source voice signal, symmetric sources voice signal, target voice signal at interior training voice signal.
Adopt harmonic wave to add probabilistic model training is decomposed with voice signal, obtain respectively the range value of harmonic wave sound channel spectrum parameter of the fundamental frequency track of asymmetric source voice signal, asymmetric source voice signal and phase value, the fundamental frequency track of symmetric sources voice signal, range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the fundamental frequency track of target voice signal, symmetric sources voice signal and phase value, target voice signal.
Adopt harmonic wave to add the concrete steps that probabilistic model decomposes primary speech signal as follows:
A. to voice signal, divide frame, frame length 20ms, frame section gap 10ms.
B. in every frame, with correlation method, estimate fundamental frequency, if this frame is unvoiced frames, fundamental frequency is set and equals zero.
C. for unvoiced frame (being the non-vanishing frame of fundamental frequency), suppose voice signal s h(n) can be formed by a series of sine-wave superimposed:
s h ( n ) = Σ l = - L L C l e j ω 0 n - - - ( 1 )
In formula, L is sinusoidal wave number, { C lbe sinusoidal wave complex magnitude, ω 0for fundamental frequency, n represents n sampling point of voice.Make s hrepresent s h(n) vector that the sampling point in a frame forms, (1) formula can be rewritten into:
Figure BDA0000392786130000081
Wherein N represents the total number of samples of frame voice.By least-squares algorithm, can determine above-mentioned { C l, that is:
ϵ = Σ n = - N 2 N 2 w 2 ( n ) · ( s ( n ) - s h ( n ) ) 2 - - - ( 3 )
Wherein s (n) is real speech signal, and w (n) is window function, generally gets Hamming window, and ε represents error.Window function is also rewritten into matrix form:
Figure BDA0000392786130000083
Optimum x can obtain like this:
WBΔ = Ws ⇒ Δ opt = B H W H Ws - - - ( 5 )
In formula, subscript H represents conjugate complex transposition, and S is the vector that the sampling point of real speech signal s (n) in the scope of a frame forms.
D. obtained { C l, harmonic amplitude and phase value are as follows:
AM l=2|C l|=2|C -l|,θ l=argC l=-argC -l (6)
E. for unvoiced frames, directly with classical Linear prediction analysis method, raw tone frame signal is analyzed, obtained corresponding linear predictor coefficient.
Owing to can thinking that the fundamental frequency track of symmetric sources voice signal and the fundamental frequency track of target voice signal obey single Gaussian distribution, therefore can set up the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency according to the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and target voice signal.
According to above-mentioned Gauss model, can estimate the parameter of Gauss model, i.e. the average μ of the Gauss model of source voice fundamental frequency yand variances sigma y, and the average μ of the Gauss model of target voice fundamental frequency xand variances sigma x.
2) respectively range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of asymmetric source voice signal and phase value, symmetric sources voice signal and phase value, target voice signal are carried out to dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion.
The reason of step 2 is, because original harmonic wave plus noise model parameter dimension is higher, is not easy to subsequent calculations, therefore must carry out dimensionality reduction to it.Because pitch contour is one dimension parameter, therefore, the main object of dimensionality reduction is sound channel amplitude spectrum parameter and phase parameter.Meanwhile, the target of dimensionality reduction is that channel parameters is converted into classical linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion system.Its solution procedure is as follows:
A. ask for respectively discrete L range value AM lsquare, and thought the sampled value PW (ω of discrete power spectrum l), ω lbe illustrated in the frequency values of (l doubly) on fundamental frequency integral multiple.
B. according to Pascal law, autocorrelation function and power spectral density function are a pair of Fourier transforms pair,
Figure BDA0000392786130000091
therefore can obtain the preliminary valuation to linear forecasting parameter coefficient by solving following formula:
Figure BDA0000392786130000092
A wherein 1, a 2..., a pthe coefficient of p rank linear prediction filter A (z), R 0~R pbe respectively the value on front p the integer discrete point of autocorrelation function.
C. convert the all-pole modeling of p rank linear forecasting parameter coefficient representative to time domain impulse response function h *[n]:
h * [ n ] = 1 L Re { Σ l 1 A ( e j ω l ) e j ω l n } - - - ( 8 )
Wherein A ( e j ω l ) = A ( z ) | z = e j ω l = 1 + a 1 z - 1 + z 2 z - 2 + · · · + a p z - p . Can prove h *with the autocorrelation sequence R that estimates to obtain *meet:
Σ i = 0 p a i R * ( n - i ) = h * [ - n ] - - - ( 9 )
In the situation that meeting plate storehouse-vegetarian field distance minimization, there is the R of real R and estimation *relation as follows:
Σ i = 0 p a i R * ( n - i ) = Σ i = 0 p a i R ( n - i ) - - - ( 10 )
So d. (19) formula is replaced to (20) formula, and revaluation (17) formula, have:
Figure BDA0000392786130000104
E. use plate storehouse-vegetarian field criterion assessment errors, if error is greater than the threshold value of setting, repeating step c~e.Otherwise, stop iteration.
The linear forecasting parameter coefficient obtaining, by two equatioies below simultaneous solution, is converted into linear spectral frequency parameter:
P(z)=A(z)+z -(p+1)A(z -1)
Q(z)=A(z)-z -(p+1)A(z -1) (12)
3) utilize step 2) in the linear spectral frequency parameter of the asymmetric source voice signal that obtains deep layer put to communication network carry out unsupervised training, the deep layer that obtains having trained is put communication network.
Above-mentioned steps is " training in advance ".Should " training in advance " process be divided into two kinds of forms.The first (as shown in Figure 2 a): put communication network for a complete deep layer, according to order from bottom to up, any two adjacent network layers are formed to restricted Boltzmann machine and (be wherein positioned at the input layer that is referred to as of below, be positioned at the hidden layer that is referred to as of top, undirected annexation between the two), under the driving of input raw data, with contrast, disperse the internetwork structural parameters of calligraphy learning.In addition, each organizes the data transmission between limited Boltzmann machine, meets following condition---be positioned at the hidden layer output of the Boltzmann machine of below, as the input of input layer that is positioned at the Boltzmann machine of top.Iteration progressively in the manner described above, until designed network architecture parameters all " training in advance " complete.The second (as shown in Figure 2 b): by two positive and negative splicing of deep layer feedforward network, form the combinational network of adaptive coding/decoding device structure, the linear spectral frequency parameter of voice signal is placed in simultaneously to input end and the output terminal of this network simultaneously, under the random Gradient Descent criterion of regularization, " training in advance " network architecture parameters.
4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal that obtain align.
So-called " alignment " refers to: make the linear spectral frequency of corresponding source and target have minimum distortion distance in the distortion criterion of setting.The object of doing is like this: make source and target people's characteristic sequence associated in the aspect of parameter, be convenient to follow-up statistical model study mapping principle wherein.
The criterion of aliging is: for two not isometric characteristic parameter sequences, utilize dynamic time warping algorithm by the nonlinear time shaft that is mapped to another one of the time shaft of one wherein, thereby realize matching relationship one to one; In the process of the alignment of existing parameter sets, by default cumulative distortion function of iteration optimization, and restricted searching area, finally obtain time match function.
Dynamic time warping algorithm steps brief overview is as follows:
For the pronunciation of same statement, suppose that source speaker's acoustics personal characteristics argument sequence is
Figure BDA0000392786130000111
and target speaker's characteristic parameter sequence is
Figure BDA0000392786130000112
and N x≠ N y.Setting source speaker's characteristic parameter sequence is reference template, and dynamic time warping algorithm is exactly to want hunting time warping function
Figure BDA0000392786130000113
make the time shaft n of target signature sequence ynon-linearly be mapped to the time shaft n of source characteristic parameter sequence xthereby, make total cumulative distortion amount minimum, on mathematics, can be expressed as:
Figure BDA0000392786130000114
Wherein
Figure BDA0000392786130000115
represent n ythe target speaker characteristic parameter of frame and
Figure BDA0000392786130000116
certain measure distance between the speaker characteristic parameter of frame source.In the regular process of dynamic time warping, warping function
Figure BDA0000392786130000121
be to meet following constraint condition, have boundary condition and the condition of continuity to be respectively:
Figure BDA0000392786130000122
Figure BDA0000392786130000123
Dynamic time warping is a kind of optimization algorithm, and it turns to a multistage decision process decision process of a plurality of single phases, is namely converted into a plurality of subproblems that make a policy one by one, to simplify, calculates.The process of dynamic time warping is generally to start to carry out from the last stage, is also that it is a backward process, and its recursive process can be expressed as:
D(n y+1,n x)=d(n y+1,n x)+min[D(n y,n x)g(n y,n x),D(n y,n x-1),D(n y,n x-2)] (16)
Wherein, g(n y, n x) be for n y, n xvalue meet the constraint condition of Time alignment function.
5) utilize the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal after alignment deep layer forward prediction network to be carried out to increment type supervised training, the deep layer forward prediction network that obtains having trained.
The a small amount of symmetric data of above-mentioned utilization is carried out the process of incremental training to deep layer feedforward network, its content comprises following three aspects: the superiors of, putting communication network in the deep layer having trained increase by a layer network output layer, this layer has the soft output characteristics of amplitude limit, thereby forms deep layer feedforward network; Two, using the linear spectral frequency parameter in source as the input and output with the combinational network of encoding and decoding structure, on the basis of " training in advance ", extract network middle layer (as shown in Figure 2 b) output data, as new characteristic parameter, treat.This new characteristic parameter has retained the higher order statistical feature of original linear spectral frequency parameter, therefore has better discrimination; Three, the input and output parameter using the linear spectral frequency coefficient of the new characteristic parameter in symmetrical source and target as deep layer feedforward network, rear, to transmission error under minimized prerequisite, network weight coefficient is adjusted on supervision formula ground, completes the incremental training of network.
6) adopt harmonic wave to add probabilistic model source voice signal to be converted is decomposed, obtain the fundamental frequency track of source voice signal to be converted, range value and the phase value of the harmonic wave sound channel spectrum parameter of source voice signal to be converted.Concrete ins and outs are identical with the way of step 1).
Range value and phase value to the harmonic wave sound channel spectrum parameter of source voice signal to be converted carry out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion.Concrete ins and outs and step 2) way identical.
Then utilize the deep layer of having trained in step 3) to put communication network the linear spectral frequency parameter of source voice signal to be converted is carried out to Feature Mapping, obtain the new characteristic parameter of source voice signal to be converted, finally regard the deep layer forward prediction network of having trained in step 5) as general Functional Mapping function, new characteristic parameter to source voice signal to be converted carries out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed.Particularly, the linear spectral frequency parameter that is about to source voice signal to be converted is placed in input end and the output terminal of the combinational network with codec structure, and extract middle layer parameter, using it as new characteristic parameter, and by the deep layer feedforward network training the characteristic parameter for map source, be about to the new characteristic parameter of source voice signal to be converted as input, offer this model and predict, the linear spectral frequency parameter of final voice signal after the output terminal of network is changed.
Utilize the Gauss model of the resulting source of step 1) voice fundamental frequency and the Gauss model of target voice fundamental frequency, the fundamental frequency track of source voice signal to be converted is carried out to Gauss's conversion, the fundamental frequency track of the voice signal after being changed.Fundamental frequency transfer function is:
log f 0 ′ = μ y + σ y σ x ( log f 0 - μ x ) - - - ( 17 )
F ' wherein 0the fundamental frequency after conversion, 2 π f 00.
7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then carries out phonetic synthesis, the voice signal after being changed with together with the fundamental frequency track of voice signal after conversion.Detailed step is as follows:
A. by the AM obtaining l, f 0, θ lthe voice that synthesize k frame with the definition of sinusoidal model, that is:
s ( k ) ( n ) = Σ l = 1 L ( k ) AM l ( k ) cos ( 2 πl f 0 ( k ) n + θ l ( k ) ) - - - ( 18 )
B. the error producing in order to reduce interframe and to replace, adopts the synthetic whole voice of splicing adding method,, for two frames of arbitrary neighborhood, has:
s ( kN + m ) = ( N - m N ) · s ( k ) ( m ) + ( m N ) · s ( k + 1 ) ( m - N ) , 0 ≤ m ≤ N - - - ( 19 )
Wherein N represents the number of samples comprising in frame voice.Use interpositioning and Phase Compensation to make the voiced sound signal of reconstruct in time domain waveform, not produce distortion.
C. for unvoiced frames, white noise signal, by an all-pole filter (filter coefficient is to analyze the linear predictor coefficient obtaining in training stage step 1) e step), can be obtained to approximate reconstruction signal.
D. by the voiced sound signal of reconstruct and voiceless sound signal plus, can obtain synthetic speech.
A kind of asymmetric phonetics transfer method based on deep layer neural network Feature Mapping of the present invention, can be used for carrying out in secret communication the camouflage of voice personalization, for example, by Voice Conversion Techniques, by a kind of rule of determining, change some parameter of speaker's voice, at receiving end, carry out inverse transformation again, synthesize original voice, if in transmitting procedure, be listened, what hear is another one speaker's sound, reaches the function of speaker's camouflage; Be applied in multimedia recreation, for example, in film is dubbed, while especially dubbing with another language, often voice-over actor is not performer, usually makes the personal characteristics of dubbing with former performer differ greatly, dubbed effect is undesirable, if carry out again sound conversion but will dub, make it again to have performer's personal characteristics, what dubbed effect will be desirable so is many; For speech-enhancement system, especially for vocal organs such as vocal cords, there is pathology or damage, the quality of its speech is also badly damaged, the other side is difficult to understand, seriously affected normal Communication, if the voice conversion being badly damaged like this can be become to a clear sound of understanding, be very easy to this class patient's normal life.
Although the present invention with preferred embodiment openly as above, embodiment is not of the present invention for limiting.Without departing from the spirit and scope of the invention, any equivalence of doing changes or retouching, belongs to equally the present invention's protection domain.Therefore should to take the application's the content that claim was defined be standard to protection scope of the present invention.

Claims (7)

1. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping, is characterized in that, comprises the steps:
1) on the basis of existing source voice signal, the source voice signal according to the target voice signals collecting collecting with identical semantic content, forms and comprises asymmetric source voice signal, symmetric sources voice signal, target voice signal at interior training voice signal;
Adopt harmonic wave to add probabilistic model training is decomposed with voice signal, obtain respectively the range value of harmonic wave sound channel spectrum parameter of the fundamental frequency track of asymmetric source voice signal, asymmetric source voice signal and phase value, the fundamental frequency track of symmetric sources voice signal, range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the fundamental frequency track of target voice signal, symmetric sources voice signal and phase value, target voice signal;
According to the fundamental frequency track of the fundamental frequency track of symmetric sources voice signal and target voice signal, set up the Gauss model of source voice fundamental frequency and the Gauss model of target voice fundamental frequency;
2) respectively range value and the phase value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of the range value of the harmonic wave sound channel spectrum parameter of asymmetric source voice signal and phase value, symmetric sources voice signal and phase value, target voice signal are carried out to dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then produces the linear spectral frequency parameter that is applicable to speech conversion;
3) utilize step 2) in the linear spectral frequency parameter of the asymmetric source voice signal that obtains deep layer put to communication network carry out unsupervised training, the deep layer that obtains having trained is put communication network;
4) utilize dynamic time warping algorithm, to step 2) in the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal that obtain align;
5) utilize the linear spectral frequency parameter of symmetric sources voice signal and the linear spectral frequency parameter of target voice signal after alignment deep layer forward prediction network to be carried out to increment type supervised training, the deep layer forward prediction network that obtains having trained;
6) adopt harmonic wave to add probabilistic model source voice signal to be converted is decomposed, obtain the fundamental frequency track of source voice signal to be converted, range value and the phase value of the harmonic wave sound channel spectrum parameter of source voice signal to be converted;
Range value and phase value to the harmonic wave sound channel spectrum parameter of source voice signal to be converted carry out dimension-reduction treatment, channel parameters is converted into linear forecasting parameter, and then generation is applicable to the linear spectral frequency parameter of speech conversion, then utilize the deep layer of having trained in step 3) to put communication network the linear spectral frequency parameter of source voice signal to be converted is carried out to Feature Mapping, obtain the new characteristic parameter of source voice signal to be converted, finally regard the deep layer forward prediction network of having trained in step 5) as general Functional Mapping function, new characteristic parameter to source voice signal to be converted carries out Mapping and Converting, the linear spectral frequency parameter of the voice signal after being changed,
Utilize the Gauss model of the resulting source of step 1) voice fundamental frequency and the Gauss model of target voice fundamental frequency, the fundamental frequency track of source voice signal to be converted is carried out to Gauss's conversion, the fundamental frequency track of the voice signal after being changed;
7) the linear spectral frequency parameter contravariant of voice signal after conversion is changed to harmonic wave plus noise model coefficient, then carries out phonetic synthesis, the voice signal after being changed with together with the fundamental frequency track of voice signal after conversion.
2. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, is characterized in that, in described step 1), adopts harmonic wave to add the process that probabilistic model decomposes primary speech signal as follows:
1-1) primary speech signal is fixed to minute frame of duration, with correlation method, fundamental frequency is estimated;
1-2) for voiced sound signal, a maximum voiced sound frequency component is set in voiced sound signal, be used for dividing the main energy area of harmonic components and random element; Recycling least-squares algorithm estimates to obtain discrete harmonic wave sound channel spectrum parameter range value and phase value;
1-3) for voiceless sound signal, directly utilize classical linear prediction analysis method to analyze it, obtain linear predictor coefficient.
3. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, it is characterized in that, in described step 2) in, channel parameters is converted into linear forecasting parameter, and then it is as follows to produce the process of the linear spectral frequency parameter be applicable to speech conversion:
2-1) range value of discrete harmonic wave sound channel spectrum parameter is asked for square, and thought the sampled value of discrete power spectrum;
2-2) according to the one-to-one relationship of power spectral density function and autocorrelation function, obtain the Top's Ritz matrix equation about linear predictor coefficient, by solving this equation, obtain linear predictor coefficient;
2-3) linear predictor coefficient is converted to linear spectral coefficient of frequency.
4. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, is characterized in that, in described step 3), deep layer is put to the mode that communication network carries out unsupervised training and is divided into following two kinds:
3-1) any two-tier network is formed to restricted Boltzmann machine, by the contrast method of dispersing, it is trained, then all Boltzmann machines are combined into storehouse form, form a complete deep layer and put communication network, the weight coefficient set in this network forms network parameter standby preferable space;
3-2) by two positive and negative splicing of deep layer feedforward network, form the combinational network of adaptive coding/decoding device structure, the linear spectral coefficient of frequency of voice signal is placed in to input end and output terminal simultaneously, under the random Gradient Descent criterion of regularization, learning network structural parameters.
5. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, it is characterized in that, in described step 4), the criterion of aliging is: for two not isometric characteristic parameter sequences, utilize dynamic time warping algorithm by the nonlinear time shaft that is mapped to another one of the time shaft of one wherein, thereby realize matching relationship one to one; In the process of the alignment of existing parameter sets, by default cumulative distortion function of iteration optimization, and restricted searching area, finally obtain time match function.
6. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 4, is characterized in that, in described step 5), the process of deep layer forward prediction network being carried out to increment type supervised training is as follows:
The superiors that the deep layer of 5-1) having trained in step 3) is put communication network increase by a layer network output layer, and this layer has the soft output characteristics of amplitude limit, thereby form deep layer feedforward network;
5-2) by the linear spectral coefficient of frequency of symmetric sources voice signal after alignment according to step 3-2) mode process, and extract network middle layer parameter as the new characteristic parameter of symmetric sources voice signal;
5-3) the input and output using the linear spectral coefficient of frequency of the new characteristic parameter of symmetric sources voice signal and target voice signal as deep layer feedforward network, at the rear network weight coefficient of adjusting under minimized prerequisite to transmission error, complete the incremental training of network.
7. the asymmetric phonetics transfer method based on deep layer neural network Feature Mapping according to claim 1, is characterized in that, in described step 7), the process of phonetic synthesis is as follows:
7-1) range value and the phase value of the discrete harmonic wave sound channel spectrum parameter of voiced sound signal are used as to range value and the phase value of sinusoidal signal, and superpose, obtain the voiced sound signal of reconstruct; Use interpositioning and Phase Compensation to make the voiced sound signal of reconstruct in time domain waveform, not produce distortion;
7-2) white noise signal of voiceless sound signal is passed through to an all-pole filter, obtain the voiceless sound signal of reconstruct;
7-3) the voiceless sound signal of the voiced sound signal of reconstruct and reconstruct is superposeed, the voice signal after being changed.
CN201310468769.1A 2013-10-09 2013-10-09 The asymmetrical voice conversion method mapped based on deep neural network feature Active CN103531205B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310468769.1A CN103531205B (en) 2013-10-09 2013-10-09 The asymmetrical voice conversion method mapped based on deep neural network feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310468769.1A CN103531205B (en) 2013-10-09 2013-10-09 The asymmetrical voice conversion method mapped based on deep neural network feature

Publications (2)

Publication Number Publication Date
CN103531205A true CN103531205A (en) 2014-01-22
CN103531205B CN103531205B (en) 2016-08-31

Family

ID=49933157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310468769.1A Active CN103531205B (en) 2013-10-09 2013-10-09 The asymmetrical voice conversion method mapped based on deep neural network feature

Country Status (1)

Country Link
CN (1) CN103531205B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104392717A (en) * 2014-12-08 2015-03-04 常州工学院 Sound track spectrum Gaussian mixture model based rapid voice conversion system and method
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN104867489A (en) * 2015-04-27 2015-08-26 苏州大学张家港工业技术研究院 Method and system for simulating reading and pronunciation of real person
CN105005783A (en) * 2015-05-18 2015-10-28 电子科技大学 Method of extracting classification information from high dimensional asymmetric data
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN106057192A (en) * 2016-07-07 2016-10-26 Tcl集团股份有限公司 Real-time voice conversion method and apparatus
CN106203624A (en) * 2016-06-23 2016-12-07 上海交通大学 Vector Quantization based on deep neural network and method
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN108417207A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 A kind of depth mixing generation network self-adapting method and system
CN109147806A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 Speech quality Enhancement Method, device and system based on deep learning
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium
CN109923560A (en) * 2016-11-04 2019-06-21 谷歌有限责任公司 Neural network is trained using variation information bottleneck
CN110085255A (en) * 2019-03-27 2019-08-02 河海大学常州校区 Voice conversion learns Gaussian process regression modeling method based on depth kernel
CN110164414A (en) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 Method of speech processing, device and smart machine
CN110520835A (en) * 2017-03-28 2019-11-29 高通股份有限公司 The tracking axis during model conversion
CN110992739A (en) * 2019-12-26 2020-04-10 上海乂学教育科技有限公司 Student on-line dictation system
CN111418005A (en) * 2017-11-29 2020-07-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program
CN111524526A (en) * 2020-05-14 2020-08-11 中国工商银行股份有限公司 Voiceprint recognition method and device
WO2020232578A1 (en) * 2019-05-17 2020-11-26 Xu Junli Memory, microphone, audio data processing method and apparatus, and device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103049792A (en) * 2011-11-26 2013-04-17 微软公司 Discriminative pretraining of Deep Neural Network
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049792A (en) * 2011-11-26 2013-04-17 微软公司 Discriminative pretraining of Deep Neural Network
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103280224A (en) * 2013-04-24 2013-09-04 东南大学 Voice conversion method under asymmetric corpus condition on basis of adaptive algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, Y., CHU, M., CHANG, E., LIU, J., LIU, R.: "Voice conversion with smoothed GMM and MAP adaptation", 《INTERSPEECH》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464744A (en) * 2014-11-19 2015-03-25 河海大学常州校区 Cluster voice transforming method and system based on mixture Gaussian random process
CN104392717A (en) * 2014-12-08 2015-03-04 常州工学院 Sound track spectrum Gaussian mixture model based rapid voice conversion system and method
CN104867489A (en) * 2015-04-27 2015-08-26 苏州大学张家港工业技术研究院 Method and system for simulating reading and pronunciation of real person
CN104867489B (en) * 2015-04-27 2019-04-26 苏州大学张家港工业技术研究院 A kind of simulation true man read aloud the method and system of pronunciation
CN105005783A (en) * 2015-05-18 2015-10-28 电子科技大学 Method of extracting classification information from high dimensional asymmetric data
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105118498B (en) * 2015-09-06 2018-07-31 百度在线网络技术(北京)有限公司 The training method and device of phonetic synthesis model
CN106203624A (en) * 2016-06-23 2016-12-07 上海交通大学 Vector Quantization based on deep neural network and method
CN106203624B (en) * 2016-06-23 2019-06-21 上海交通大学 Vector Quantization and method based on deep neural network
CN106057192A (en) * 2016-07-07 2016-10-26 Tcl集团股份有限公司 Real-time voice conversion method and apparatus
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
US11681924B2 (en) 2016-11-04 2023-06-20 Google Llc Training neural networks using a variational information bottleneck
CN109923560A (en) * 2016-11-04 2019-06-21 谷歌有限责任公司 Neural network is trained using variation information bottleneck
CN110520835B (en) * 2017-03-28 2023-11-24 高通股份有限公司 Tracking axes during model conversion
CN110520835A (en) * 2017-03-28 2019-11-29 高通股份有限公司 The tracking axis during model conversion
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107545903B (en) * 2017-07-19 2020-11-24 南京邮电大学 Voice conversion method based on deep learning
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN111418005A (en) * 2017-11-29 2020-07-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program
CN111418005B (en) * 2017-11-29 2023-08-11 雅马哈株式会社 Voice synthesis method, voice synthesis device and storage medium
CN108417207A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 A kind of depth mixing generation network self-adapting method and system
CN109147806A (en) * 2018-06-05 2019-01-04 安克创新科技股份有限公司 Speech quality Enhancement Method, device and system based on deep learning
CN109147806B (en) * 2018-06-05 2021-11-12 安克创新科技股份有限公司 Voice tone enhancement method, device and system based on deep learning
CN110164414B (en) * 2018-11-30 2023-02-14 腾讯科技(深圳)有限公司 Voice processing method and device and intelligent equipment
CN110164414A (en) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 Method of speech processing, device and smart machine
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium
CN110085255A (en) * 2019-03-27 2019-08-02 河海大学常州校区 Voice conversion learns Gaussian process regression modeling method based on depth kernel
CN110085255B (en) * 2019-03-27 2021-05-28 河海大学常州校区 Speech conversion Gaussian process regression modeling method based on deep kernel learning
WO2020232578A1 (en) * 2019-05-17 2020-11-26 Xu Junli Memory, microphone, audio data processing method and apparatus, and device and system
CN110992739B (en) * 2019-12-26 2021-06-01 上海松鼠课堂人工智能科技有限公司 Student on-line dictation system
CN110992739A (en) * 2019-12-26 2020-04-10 上海乂学教育科技有限公司 Student on-line dictation system
CN111524526A (en) * 2020-05-14 2020-08-11 中国工商银行股份有限公司 Voiceprint recognition method and device
CN111524526B (en) * 2020-05-14 2023-11-17 中国工商银行股份有限公司 Voiceprint recognition method and voiceprint recognition device

Also Published As

Publication number Publication date
CN103531205B (en) 2016-08-31

Similar Documents

Publication Publication Date Title
CN103531205A (en) Asymmetrical voice conversion method based on deep neural network feature mapping
CN110136731B (en) Cavity causal convolution generation confrontation network end-to-end bone conduction voice blind enhancement method
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
CN111081268A (en) Phase-correlated shared deep convolutional neural network speech enhancement method
CN110534120B (en) Method for repairing surround sound error code under mobile network environment
CN102568476B (en) Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN103035236B (en) High-quality voice conversion method based on modeling of signal timing characteristics
CN105139864A (en) Voice recognition method and voice recognition device
CN103117059A (en) Voice signal characteristics extracting method based on tensor decomposition
CN110459241A (en) A kind of extracting method and system for phonetic feature
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN102789779A (en) Speech recognition system and recognition method thereof
CN109065073A (en) Speech-emotion recognition method based on depth S VM network model
CN106782599A (en) The phonetics transfer method of post filtering is exported based on Gaussian process
CN107248414A (en) A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization
CN114495969A (en) Voice recognition method integrating voice enhancement
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
CN114626424B (en) Data enhancement-based silent speech recognition method and device
Li et al. Speech intelligibility enhancement using non-parallel speaking style conversion with stargan and dynamic range compression
Miao et al. A blstm and wavenet-based voice conversion method with waveform collapse suppression by post-processing
CN115910091A (en) Method and device for separating generated voice by introducing fundamental frequency clues
Liu et al. Spectral envelope estimation used for audio bandwidth extension based on RBF neural network
Zhong A Framework for Piano Online Education Based on Multi-Modal AI Technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190329

Address after: No. 3, courtyard No. 5, di Kam Road, Haidian District, Beijing

Patentee after: BYZORO NETWORK LTD.

Address before: 213022 Wushan Road, Xinbei District, Changzhou, Jiangsu Province, No. 1

Patentee before: Changzhou Polytechnic College