CN102982809A - Conversion method for sound of speaker - Google Patents

Conversion method for sound of speaker Download PDF

Info

Publication number
CN102982809A
CN102982809A CN2012105286294A CN201210528629A CN102982809A CN 102982809 A CN102982809 A CN 102982809A CN 2012105286294 A CN2012105286294 A CN 2012105286294A CN 201210528629 A CN201210528629 A CN 201210528629A CN 102982809 A CN102982809 A CN 102982809A
Authority
CN
China
Prior art keywords
speaker
characteristic
voice signal
fundamental frequency
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105286294A
Other languages
Chinese (zh)
Other versions
CN102982809B (en
Inventor
陈凌辉
戴礼荣
凌震华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201210528629.4A priority Critical patent/CN102982809B/en
Publication of CN102982809A publication Critical patent/CN102982809A/en
Application granted granted Critical
Publication of CN102982809B publication Critical patent/CN102982809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a conversion method for sound of a speaker. The method comprises a training stage and a conversion stage, wherein the training stage comprises the steps of respectively extracting a fundamental frequency characteristic, a speaker characteristic and a content characteristic from training voice signals of a source speaker and a target speaker, constructing a fundamental frequency conversion function according to the fundamental frequency characteristic, and constructing a speaker conversion function according to the speaker characteristic. The conversion stage comprises the steps of extracting a fundamental frequency characteristic and a spectrum characteristic from a voice signal to be converted of the source speaker, using the fundamental frequency conversion function and the speaker conversion function obtained in the training stage to convert the fundamental frequency characteristic and the speaker characteristic extracted from the voice signal to be converted, obtaining the converted fundamental frequency characteristic and the speaker characteristic, and synthesizing voices of the target speaker according to the obtained converted fundamental frequency characteristic, the speaker characteristic and the content characteristic in the voice signal to be converted. The method is easy to realize, and the converted sound quality and similarity are higher.

Description

A kind of speaker's sound converting method
Technical field
The invention belongs to the signal processing technology field, be specifically related under the prerequisite that does not change content information in the voice signal, a speaker's voice signal is passed through conversion process, change into the voice signal that can be perceived as another speaker, particularly a kind of speaker's sound converting method that speaker information in the voice signal is separated with content information.
Background technology
In the information age of today, man-machine interaction is the study hotspot of computer realm always, and the man-machine interaction environment of high efficiency smart has become the active demand of the application and development of current information technology.As everyone knows, voice are human most important, one of approach the most easily that exchange.Interactive voice will be interpersonal mutual in the most " close friend ".Be that a universally acknowledged difficulty is very large based on the man-machine dialogue technology of speech recognition, phonetic synthesis and natural language understanding, be rich in challenging high-tech sector, but its application prospect be very bright.
As one of core technology of man-machine interaction, phonetic synthesis all makes significant progress in technology and application facet in recent years.At present, all obtaining good effect aspect tonequality and the naturalness based on the voice that the synthesis system of Big-corpus is synthetic, therefore everybody has proposed more demand to speech synthesis system---and diversified phonetic synthesis comprises a plurality of speaker, multiple pronunciation style, multiple emotion and multilingual etc.And existing speech synthesis system is simplification mostly, and a synthesis system generally includes only one to two speaker, adopt and read aloud or news report style, and for certain specific languages.The synthetic speech of this simplification has limited the application in practice of speech synthesis system greatly, comprises education, amusement and toy etc.For this reason, the research of diversified phonetic synthesis aspect becomes one of main flow direction of recent phonetic synthesis research field gradually.
Realize the speech synthesis system of speaker more than, multiple pronunciation style, multiple emotion, the most direct method is exactly the sound storehouse of recording a plurality of people, multiple style, and makes up respectively the personalized speech synthesis system of each speaker, each style.Because the workload for each speaker, every kind of style, a specific sound bank of every kind of emotion making is excessive, therefore this method is also infeasible in practice.Under this background, speaker's sound switch technology is suggested.Speaker's sound switch technology attempts a people (source speaker) word (voice) is changed (parameter that fundamental frequency, duration, spectrum parameter etc. is comprised speaker characteristic information is adjusted) exactly, make it sound as another person (target speaker) say.Meanwhile, the meaning that keeps the source speaker to express is constant.Speaker's sound switch technology is trained by the voice signal of recording a small amount of speaker, and the voice of adjusting the source speaker obtain target speaker's synthetic speech, thereby realize fast the personalized speech synthesis system.
Realize speaker's sound converting system, topmost challenge is similarity and the tonequality of converting speech.Speaker's sound converting method as current a kind of main flow---based on speaker's sound converting method of joint space gauss hybrid models, owing to used the framework of statistical modeling, have comparatively speaking good robustness and generalization, but the method is the method for a Feature Mapping in the typical machine learning, do not utilize more distinctive characteristics of voice signal (speaker information and content information coexistence), and statistical modeling has brought problems, such as the dependence to data volume, modeling accuracy is inadequate, statistical model all causes the effect of converting speech sharply to descend to the destruction of the original information of parameters,acoustic.And the speech synthesis technique of another kind of main flow, frequency spectrum bending method based on resonance peak, this mainly reflects the feature of speaker information then to have used the speaker's resonance peak structure in the voice signal, when conversion, keep as much as possible the details composition in the voice signal, guaranteed the tonequality of converting speech, but because the extraction of resonance peak and modeling are difficult to, just so that these class methods need a lot of artificial interventions, and robustness is relatively poor.
Generally speaking, traditional speaker's phonetics transfer method, because its acoustic information to speaker dependent in the voice signal lacks effective expression and effective modeling, modeling data is required high, constructed conversion method has often comprised the conversion to the voice signal content, and therefore speech quality and the similarity after the conversion can not reach gratifying degree at present.
Summary of the invention
The technical matters that (one) will solve
Technical matters to be solved by this invention is the relatively poor and not high problem of similarity of the speech quality of existing speaker's phonetics transfer method.
(2) technical scheme
The present invention proposes a kind of speaker's sound converting method, is used for the source speaker voice signal that what is said or talked about is changed, and voice after the conversion is sounded be the target speaker who is different from the source speaker said, it is characterized in that, the method comprises training stage and translate phase, wherein
The described training stage comprises:
Steps A 1, extract respectively fundamental frequency feature and spectrum signature from source speaker and target speaker's training utterance signal, described spectrum signature comprises speaker characteristic and content characteristic;
Steps A 2, according to the fundamental frequency feature of source speaker and target speaker's training utterance signal, make up the fundamental frequency transfer function from source speaker's voice to target speaker's voice;
Steps A 3, the source speaker who extracts according to steps A 1 and target speaker's speaker characteristic make up the voice conversion function;
Described translate phase comprises:
Step B1, extract fundamental frequency feature and spectrum signature from source speaker's voice signal to be converted, described spectrum signature comprises speaker characteristic and content characteristic;
Step B2, use fundamental frequency transfer function and the voice conversion function that the training stage obtains respectively, the fundamental frequency feature and the speaker characteristic that extract are changed fundamental frequency feature and speaker characteristic after obtaining changing from described voice signal to be converted from step B1;
Fundamental frequency feature and speaker characteristic after step B3, the conversion that obtains according to step B2, and the content characteristic in the voice signal to be converted that extracts of step B1, synthetic target speaker's voice.
According to a kind of embodiment of the present invention, the fundamental frequency feature of the extraction voice signal of described steps A 1 and step B1 and the method for spectrum signature comprise:
Step a1, based on the source-filter construction of voice signal, voice signal is carried out segmentation with 20~30ms, each section is as a frame, and the voice signal of each frame is extracted fundamental frequency and frequency spectrum parameter;
Step a2, separate speaker characteristic and content characteristic in the described frequency spectrum parameter with a neural network, this neural network structure adopts laterally zygomorphic altogether 2K-1 layer multi-layer (K is natural number) network structure, comprise: orlop is input layer, from this layer input acoustic feature to be separated; The superiors are output layer, the acoustic feature that this layer output reconstructs; Middle 2K-3 hidden layer, every layer of several node, the processing procedure of analog neuron unit.Be coding network from input layer to from bottom to up K hidden layer, be used for extracting high-rise information from the Speech acoustics feature of input; K hidden layer from bottom to up is the coding layer; The network node of coding layer is divided into two parts, and a part is relevant with the speaker, and another part is relevant with content, and their output is corresponding speaker characteristic and content characteristic respectively; The above hidden layer of K hidden layer from bottom to up is decoding network, is used for reconstructing the acoustics frequency spectrum parameter from speaker characteristic and the content characteristic of high level.
According to a kind of embodiment of the present invention, step a2 is included on the voice signal data storehouse described neural network is trained, extract and the ability of separating speaker characteristic and content characteristic so that it possesses from acoustic feature, the described step that described neural network is trained comprises:
Step b1, come the network weight of the described neural network of initialization by pre-training;
Step b2, to the output characteristic of each node of the coding layer of described neural network, adopt the property a distinguished criterion to add up it between the different speakers and the differentiation between the different content, as speaker's interdependent node, remaining node is as the content interdependent node with the large and node that the property distinguished is little between the different content of the property distinguished between different speakers;
Step b3, the specific property distinguished of design objective function come the weights of this neural network of intense adjustment, make this neural network possess the ability of separating speaker information and content information from acoustic feature.
According to a kind of embodiment of the present invention, described voice signal data storehouse makes through the following steps:
Step c1, set up a corpus, make to comprise a plurality of sentences in this corpus;
Step c2, record the voice signal that a plurality of speakers read aloud the sentence in the described corpus, make up the voice signal data storehouse, and the voice signal in this voice signal data storehouse is carried out pre-service, to remove the malformation in the voice signal;
Step c3, come carrying out the capable cutting of voice signal in this voice signal data storehouse of pretreated poplar with Hidden Markov Model (HMM), each section after the cutting is as a frame, by speaker's markup information and the content markup information of the frame one-level that obtains each voice signal;
Step c4, each voice signal of described speech database is carried out random combine, the training data of constructing neural network.
(3) beneficial effect
Speaker's sound converting method of the present invention has the following advantages:
1, the present invention has proposed to realize separating of speaker information and content information in the voice signal with the deep layer neural network first, to satisfy the demand of different phonetic signal processing tasks, such as speech recognition, Speaker Identification and conversion.
2, the present invention only considers the human factor of speaking when carrying out the conversion of speaker's sound, has got rid of the interference of content factors, so that the conversion of speaker's sound is easier to realize that tonequality and similarity after the conversion are greatly improved.
3, the separation vessel of the present invention's employing only need to be trained once, can extract speaker characteristic and content characteristic to the speaker undepandent voice after training, and once training is repeatedly used, and need not the repetition training model.
Description of drawings
Fig. 1 is the process flow diagram of speaker's sound converting method of the present invention;
Fig. 2 is the block diagram of characteristic extraction step of the present invention;
Fig. 3 is the neural network structure synoptic diagram for character separation of the present invention;
Fig. 4 is neural metwork training process flow diagram of the present invention;
Fig. 5 is the process flow diagram that database is made among the present invention;
Fig. 6 is the synoptic diagram of the differentiation of cepstrum feature between different speakers and different content among the present invention;
Fig. 7 is the speaker characteristic that extracts among the present invention and the synoptic diagram of the differentiation of content characteristic between different speakers and different content.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.
From physiological angle, existing scholar's work confirms that human brain is finished in corticocerebral zones of different respectively to the perception of speaker information with to the perception of the content of speaking when the perceptual speech signal.This explanation human brain decomposes speaker and content information at high level, information in the voice signal is separable, separating the meaning of voice signal processing of speaker information and content information is very great, the information of separating can be respectively applied to Speaker Identification, and some of speech recognition and other are used targetedly.
The present invention namely keeps speaker's content that what is said or talked about constant from the essence of speaker's sound conversion, and only changes the speaker's who says this information.Based on this consideration, the information in the voice signal is separated, obtain speaker characteristic and content characteristic, in order to speaker's composition is operated.Said among the present invention " speaker characteristic " refers to reaction speaker characteristic, the different speakers' of difference feature, " content characteristic " refer to the reaction voice signal meaning to be expressed feature.
To this, the present invention uses a kind of technology based on the deep layer neural network, at high level the acoustic feature of voice signal is decomposed into speaker characteristic and content characteristic, with and make the conversion of speaker's sound be able to more perfect and simple realization, reach the conversion speech signal that tonequality and similarity significantly promote.
Fig. 1 is the process flow diagram of speaker's sound converting method of the present invention.As shown in the figure, method of the present invention comprises two stages generally: training stage and translate phase.The below introduces successively:
(1) training stage
Training stage mainly comprises three steps:
Steps A 1: feature extraction.
This step is extracted respectively feature from source speaker and target speaker's training utterance signal, described feature comprises fundamental frequency feature and spectrum signature, and spectrum signature is divided into speaker characteristic and content characteristic in the present invention.
Steps A 2: fundamental frequency transfer function training.
This step is according to the fundamental frequency feature of source speaker and target speaker's training utterance signal, makes up the fundamental frequency transfer function from source speaker's voice to target speaker's voice.
According to a kind of embodiment, average and variance that the fundamental frequency feature of this step Statistic Source speaker and target speaker's training utterance signal distributes at log-domain make up fundamental frequency transfer function from source speaker's voice to target speaker's voice according to the average of adding up and variance.
Because each speaker's fundamental frequency characteristic parameter is Gaussian distribution at log-domain, therefore for the fundamental frequency conversion, is preferably the simple linear transformation of only using log-domain among the present invention and carries out.
Steps A 3: spectral conversion function training.
This step makes up the voice conversion function according to the speaker characteristic in the spectrum signature that extracts from source speaker and target speaker's training utterance signal.
The requirement of the aforementioned voice conversion content that keeps speaking is constant and only change speaker information.Therefore, the present invention only need to train the transfer function (voice conversion function) of speaker characteristic to get final product.
Because when recording source speaker and target speaker's voice signal, can't accomplish that different speakers keep identical duration when carrying out the recording of same a word, therefore need some regular means with the sentence of different durations regular to identical duration in order to there is the Feature Conversion of supervision to learn (feature alignment), it is regular that the present invention adopts dynamic time warping (dynamic time warping) algorithm to carry out duration, and the modeling of speaker characteristic conversion can realize with methods such as linear regression model (LRM) or joint space gauss hybrid models.
(2) translate phase
Translate phase comprises three steps:
Step B1: feature extraction.
Similar with the training stage, this step is extracted feature from source speaker's voice signal to be converted, and described feature comprises fundamental frequency feature and spectrum signature, and spectrum signature is divided into speaker characteristic and content characteristic.
Step B2: Feature Conversion.
Use respectively fundamental frequency transfer function and voice conversion function that the training stage obtains, the fundamental frequency feature and the speaker characteristic that extract are changed fundamental frequency feature and speaker characteristic after obtaining changing from step B1 from described voice signal to be converted.
For the fundamental frequency conversion, concrete, the training stage counts the fundamental frequency of source, target speaker's voice signal at the average μ of log-domain at training set x, μ yAnd variance
Figure BDA00002561067200071
Figure BDA00002561067200072
Transfer function shape is shown below during the fundamental frequency conversion:
log f 0 y = μ y + σ y σ x ( log f 0 x - μ x )
And for the conversion of speaker characteristic, suppose speaker characteristic X={x active and the corresponding time unifying of target speaker 1, x 2... x TAnd Y={y 1, y 2... y TAs training data.The present invention adopts two schemes.A kind of scheme is to use linear regression model (LRM) F (x t)=Ax t+ b is as the spectral conversion function, and parameter wherein can have following formula to calculate:
[A,b]=YX T(XX T) -1
Another scheme based on the method for joint space gauss hybrid models, needs to use union feature Z=[X T, Y T] TTrain a gauss hybrid models, he describes the distribution in union feature space with following form:
P(z)=∑ mw mN(z;μ m,∑ m),
Wherein
μ m = μ m ( x ) μ m ( y ) , Σ m = Σ m ( xx ) Σ m ( xy ) Σ m ( yx ) Σ m ( yy )
Therefrom, derive transfer function:
F ( x t ) = Σ m h m ( x t ) [ μ m ( y ) + Σ m ( yx ) Σ m ( xx ) - 1 ( x t - μ m ( x ) ) ]
In the formula
Figure BDA00002561067200081
Be posterior probability.
Step B3: phonetic synthesis.
Fundamental frequency feature and speaker characteristic after the conversion that this step obtains according to step B2, and the content characteristic in the voice signal to be converted that extracts of step B1, synthetic target speaker's voice.
The present invention uses the compositor based on source-filter construction, needs input stimulus (being fundamental frequency) and sound channel response (frequency spectrum parameter) to generate voice to be converted.Therefore at first need from the content characteristic of the speaker characteristic of conversion and speaker's voice signal to be converted, to reconstruct speaker's frequency spectrum parameter (the frequency spectrum parameter process of reconstruction sees below described) of conversion, and then come the voice of T.G Grammar by compositor.The present invention adopts STRAIGHT analysis compositor to carry out speech production.
(3) feature extraction
More than method of the present invention has been carried out the introduction of globality, lower regard to the characteristic extraction step that adopts in the described method and be described in detail.
As previously mentioned, feature extraction of the present invention comprises the extraction of fundamental frequency feature, speaker characteristic and content characteristic.Traditional fundamental frequency extracting method is adopted in the fundamental frequency feature extraction among the present invention.The feature extracting method of speaker characteristic and content characteristic is core of the present invention place.
3.1 basic step
Fig. 2 is the block diagram of characteristic extraction step of the present invention.As shown in Figure 2, characteristic extraction step specifically is divided into two steps:
Step a1: acoustic feature extracts.
Based on the source-filter construction of voice signal, consider the in short-term stationarity of voice signal and non-stationary when long, voice signal is carried out segmentation with 20-30ms, each section the present invention is called a frame.To each frame voice signal, use existing speech analysis algorithm (such as STRAIGHT etc.) from voice signal, to extract fundamental frequency and frequency spectrum parameter (such as line spectrum pair, Mel cepstrum etc.).
Step a2: speaker characteristic and Content Feature Extraction.
Consider that the difference between the speaker is mainly reflected on the channel structure, on acoustic feature, namely mainly is reflected in the frequency spectrum parameter.Therefore, the present invention mainly considers to isolate speaker's correlated characteristic and content correlated characteristic from spectrum signature.In addition, the present invention considers the feature when speaker characteristic is a kind of Supersonic segment length, for effectively extracting the speaker's correlated characteristic in the voice signal, it is separated better with the content correlated characteristic, and the present invention becomes one to be referred to as Supersonic section feature and to be input in the character separation device merging features of continuous multiple frames.Concrete character separation method is as follows:
3.2 character separation algorithm
The present invention separates speaker characteristic and content characteristic in the acoustics frequency spectrum parameter with the neural network of a deep layer.Fig. 3 is the neural network structure synoptic diagram for character separation of the present invention.As shown in Figure 3, this neural network structure adopts laterally zygomorphic altogether 2K-1 layer multi-layer (K is natural number) network structure, and comprising: orlop is input layer, from this layer input acoustic feature to be separated; The superiors are output layer, the acoustic feature that this layer output reconstructs; Middle 2K-3 hidden layer, every layer comprises several nodes, the processing procedure of analog neuron unit.
Be coding network (or claiming scrambler) from input layer to from bottom to up K hidden layer, be used for extracting high-rise information from the Speech acoustics feature of input that K hidden layer from bottom to up is coding layer; The network node of coding layer is divided into two parts, and a part is relevant with the speaker, and another part is relevant with content, and their output is corresponding speaker characteristic and content characteristic respectively.The above hidden layer of K hidden layer from bottom to up is decoding network (or claiming demoder), and its function is opposite with coding network, is used for reconstructing the acoustics frequency spectrum parameter from speaker characteristic and the content characteristic of high level.
The deep layer neural network shown in Figure 3 that the present invention adopts is a simulation to people's nervous system processes voice signals, need to train it, extract and separate this specific ability of speaker characteristic and content characteristic thereby make it have needed can from acoustic feature, the realization.The training of deep layer neural network shown in Figure 3 is to carry out in the designed voice signal data storehouse of database constructing method that the present invention proposes, and the database constructing method that the present invention proposes is seen database making part of the present invention.
Fig. 4 is the particular flow sheet of neural metwork training among the present invention.Training process is divided into three steps:
Step b1: pre-training.
Because the optimization of deep layer neural network is difficulty relatively, before training, need to come the initialization network weight by pre-training.The present invention takes a kind of unsupervised mode of learning, comes successively training network with greedy algorithm, obtains fast the initial parameter of model.In the training of every one deck, can use the autocoder (De-noising auto-encoder) of eliminating noise to come the initialization network weight, namely add certain noise takeover at input feature vector, so that the training of neural network robust more, and prevented training.Concrete, at input layer, the input feature vector Gaussian distributed, then each dimension in input adds an amount of Gaussian noise, and adopts minimum mean square error criterion to train.And more than ground floor each layer, input feature vector is obeyed two-value and is distributed, and therefore with certain probability, with some dimension zero setting of input feature vector, and uses minimum cross entropy (cross-entropy) criterion to train.After training obtains a stacked autocoder that adds of K in advance, with its upwards upset, just obtained laterally zygomorphic autocoder structure.
Step b2: the coding layer is adjusted.
Neural network through after the pre-training has possessed certain high layer information extractability, and at the coding layer, some node can reflect stronger speaker's separating capacity, and the other node then can reflect stronger content separating capacity.This step will be picked out these nodes with some objective criterions, and it is exported respectively as characteristic of correspondence.Here can use some property distinguished criterions, such as Fisher ' s ratio, select.Concrete, on the training set in described voice signal data storehouse, output characteristic to each node of coding layer, all add up it between the different speakers and the differentiation between the different content with this criterion, as speaker's interdependent node, remaining node is as the content interdependent node with the large and node that the property distinguished is little between the different content of the property distinguished between different speakers.
Step b3: intense adjustment.
The present invention need to isolate the relevant feature relevant with content of speaker from the acoustics frequency spectrum parameter of input, and can apply it to speaker's sound and go in changing.To this, design specific differentiation objective function and train this network, make it possess the desired this ability of the present invention.Reach this requirement, need in the input training sample, introduce the means of contrast competition.In network structure as shown in Figure 3, at input layer, two sample x of input walk abreast at every turn simultaneously 1And x 2, they generate speaker characteristic c at the coding output layer respectively S1, c S2With content characteristic c C1, c C2, then by decoding network, reconstruct the acoustic feature of input
Figure BDA00002561067200101
With Therefore, comprise three following parts in the objective function of training network:
Reconstruction error: on the one hand, because the needs of speaker's sound transformation applications will be rebuild from high-level characteristic and recover the acoustics frequency spectrum parameter, decoding network need to have the ability of good restoration and reconstruction, and this ability will directly affect the quality of synthetic speech.Therefore, in the training objective function, need reconstruction error is limited.Another aspect, the restriction that adds reconstruction error also are for information integrity in the speaker characteristic of the output that guarantees to encode and the content characteristic.Adopt the errors of form of following form among the present invention:
L r = Σ i ∈ { 0,1 } | x 1 - x ^ 2 | 2
Speaker characteristic cost: the speaker is had very strong differentiation in order to make speaker characteristic, and content is not had differentiation, can design a kind of like this criterion, make the speaker characteristic error between the identical speaker as far as possible little, and the error between the different speakers is as far as possible large, and this criterion can be expressed as following formula:
L sc=δ s*E s+(1-δ s)*exp(-λ sE s)
Wherein, E s=| c S1-c S2| 2, δ sSpeaker's mark of two samples of input, δ sTwo of=1 expressions are inputted them from same speaker, and δ s=0 expression is from two different speakers.
The content characteristic cost: similar with the speaker characteristic error, can construct the differentiation cost function of content characteristic:
L cc=δ c*E s+(1-δ c)*exp(-λ cE c)
Comprehensive above-mentioned three kinds of costs, the objective function of the intense adjustment that can finally be used for:
L cc=αL r+βL sc+ζL cc
α, β and ζ adjust the weights of these three kinds of cost proportions, the training objective of neural network is to adjust network weight so that this objective function is as far as possible little, use error back-propagation algorithm of the present invention during training utilizes the gradient descent algorithm with momentum to upgrade network weight.
(4) making in speaker's voice signal storehouse
Employed neural network needs a large amount of training datas to carry out among the present invention, need to comprise a lot of speakers, and each speaker also needs to record the language material of sufficient content.
What particularly point out is that the needed a large amount of training datas of neural network are not source speaker or the target speaker data of training process shown in Fig. 1.In the practical application, it is unrealistic or require too highly to obtain the source speaker of training process shown in Fig. 1 or target speaker's mass data, and to state the needed a large amount of training datas of neural network be feasible but obtain this place, realistic requirement.
Fig. 5 is the process flow diagram that database is made among the present invention.Be divided into four steps:
Step c1: set up a corpus, make to comprise a plurality of sentences in this corpus.
Consider the separated network that will design a kind of robust, need it can process the somebody of institute and all contents, the corpus of a phoneme equilibrium of design among the present invention, and also the sentence number can not be too many, usually in 100, in order to gather a large amount of speaker's data.So-called phoneme equilibrium refers to comprise all phonemes in the language material, and the quantity relative equilibrium of each phoneme.
Step c2: record the voice signal that a plurality of speakers read aloud the sentence in the described corpus, make up the voice signal data storehouse, and the voice signal in this voice signal data storehouse is carried out pre-service, to remove the malformation in the voice signal.
Consider that network is had distinguishes the Man's Power of speaking, and need to record a large amount of speakers' data and come training network.In the recording stage, reason owing to aspects such as costs, can't find so many announcer to record the sound storehouse, can only gathering remaining personnel's recording, this is just so that the voice quality of recording is uneven, therefore, record finish after, need to do some pre-service to the voice of recording, if gauge is whole, the processing of channel equalization, spray wheat phenomenon etc., guarantee the quality of corpus.
Step c3: come carrying out the capable cutting of voice signal in this voice signal data storehouse of pretreated poplar with Hidden Markov Model (HMM), each section after the cutting is as a frame, by speaker's markup information and the content markup information of the frame one-level that obtains each voice signal.
As seen from the above, in the intense adjustment stage of neural metwork training, be the learning process that supervision is arranged, need to know speaker's markup information and the content markup information of the every frame training data of input.Therefore, need to do to the voice signal in the voice signal data storehouse mark of frame one-level, namely carry out the cutting of segment.Concrete, can adopt an existing context-sensitive hidden Markov model as phonetic synthesis to realize the segment cutting.Before cutting, recording data with each speaker uses the linear regression algorithm of maximum likelihood with the acoustic space of this model adaptation to this speaker first, re-use model that self-adaptation obtains and utilize viterbi algorithm to decode to this speaker's recording data, obtain the boundary information of each state of model.
Step c4: each voice signal to described speech database carries out random combine, the training data of constructing neural network.
According to above describing, the training data of neural network has four classes: identical speaker's identical content, identical speaker's different content, different speaker's identical content and different speaker's different contents.Because a lot of speaker characteristics and content characteristic attribute are arranged, in the training stage, the present invention is the random choose combination in training data, is input to network and trains.
(5) specific embodiment
According to method mentioned above, give an example as embodiment of the present invention, the present invention has built speaker's sound converting system.At first, the present invention has designed the language material of the phoneme balance that comprises 100 word, has raised 81 speakers (wherein comprising 40 male sex and 41 female speaker) and has recorded, and forms after treatment final training corpus.The voice document of recording is monophony, 16kHz sampling rate.In these 81 speakers' data, our random choose 60 people's (30 male sex, 30 women) data are as the training set of neural network training, other 10 people's (5 male sex and 5 women) data are as the checking collection of neural network training training, 11 remaining people data are tested the effect of speaker's sound conversion as test set.When extracting acoustic feature, we adopt the Hamming window of 25ms to divide frame to process to waveform signal, and move short time-window with the frame in-migration of 5ms, and every frame extracts the Mel cepstrum parameter of a fundamental frequency and one group of 24 dimension as acoustic feature.
Be used for neural network stage of character separation in training, the input vector of network be present frame with its before and after each 5 frame Supersonic section feature of being combined into of totally 11 frames, totally 264 dimensions because output only need to reconstruct the present frame of input, therefore, output layer is 24 to tie up.In addition, network comprises 7 hidden layers, and wherein nodes is respectively 500,400,300,200,300,400,500, that one deck in the centre, and we make front 100 nodes be output as speaker characteristic, and 100 remaining nodes are output as content characteristic.In the pre-training stage, we adopt the form of 4 stacked autocoders to come the initialization network weight, nodes is respectively: 264-500,500-400,400-300 and 300-200, bottom-up, the output of each autocoder is as the input of next autocoder, form initialization network weight by unsupervised learning, at last network weight is overturn, obtain the initialization weights of whole network, it should be noted that, when ground floor was turned to the top one deck of whole network, because output only has 24 dimensions, only need to the weights that the input layer present frame is corresponding overturning up got final product.In addition, before middle layer upset, need to calculate each node output between the different speakers and the differentiation between the different content (Fisher ' the s ratio that above mentions), and come node and network weight are reset with this.After the pre-training, carry out intense adjustment according to method mentioned above, in this process, need on the checking collection, adjust the weights of objective function, obtain optimal value.
Train after the character separation device, just can build speaker's sound converting system, we select arbitrarily two speakers on test set, select wherein 50 word as training data, by above extracting the feature that needs, the transfer function of training fundamental frequency, speaker characteristic (present embodiment is the direct linear regression model (LRM) of middle use for example), 50 a remaining word are verified the effect of speaker's sound conversion as test data.
The different characteristic that we use Fisher ' s ratio to measure to extract is between the different speakers and the differentiation between the different content.What Fisher ' s ratio measured is the ratio of feature inter-object distance and between class distance, and this ratio is larger, and characterization has differentiation more under this kind sorting technique.Fig. 6 and Fig. 7 are respectively Mel cepstrum coefficient and the differentiation of isolated feature between different speakers (solid line) and different content (dotted line).As seen, in the acoustic feature of input, except low-dimensional shows stronger differentiation in terms of content, the not very strong differentiation of its codimension.And the feature that extracts (front 100 dimensions are speaker characteristic, and remaining 100 dimensions are content characteristic) reveals desired differentiation through training to different classified bodies.And at voice conversion experimentally, directly the speaker characteristic with the target speaker adds the voice that source speaker's content characteristic synthesizes, the cepstrum error is 4.39dB, and be 5.64dB with the source speaker's of linear transformation speaker characteristic and the synthetic voice cepstrum error of its content characteristic, approached target speaker's voice on the subjective sense of hearing.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; be understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1. speaker's sound converting method is used for the source speaker voice signal that what is said or talked about is changed, and voice after the conversion is sounded be the target speaker who is different from the source speaker said, it is characterized in that, the method comprises training stage and translate phase, wherein
The described training stage comprises:
Steps A 1, extract respectively fundamental frequency feature and spectrum signature from source speaker and target speaker's training utterance signal, described spectrum signature comprises speaker characteristic and content characteristic;
Steps A 2, according to the fundamental frequency feature of source speaker and target speaker's training utterance signal, make up the fundamental frequency transfer function from source speaker's voice to target speaker's voice;
Steps A 3, the source speaker who extracts according to steps A 1 and target speaker's speaker characteristic make up the voice conversion function;
Described translate phase comprises:
Step B1, extract fundamental frequency feature and spectrum signature from source speaker's voice signal to be converted, described spectrum signature comprises speaker characteristic and content characteristic;
Step B2, use fundamental frequency transfer function and the voice conversion function that the training stage obtains respectively, the fundamental frequency feature and the speaker characteristic that extract are changed fundamental frequency feature and speaker characteristic after obtaining changing from described voice signal to be converted from step B1;
Fundamental frequency feature and speaker characteristic after step B3, the conversion that obtains according to step B2, and the content characteristic in the voice signal to be converted that extracts of step B1, synthetic target speaker's voice.
2. speaker's sound converting method as claimed in claim 1, it is characterized in that, average and variance that the fundamental frequency feature of described steps A 2 Statistic Source speakers and target speaker's training utterance signal distributes at log-domain make up fundamental frequency transfer function from source speaker's voice to target speaker's voice according to the average of adding up and variance.
3. speaker's sound converting method as claimed in claim 2 is characterized in that, described fundamental frequency transfer function is the linear transformation function.
4. speaker's sound converting method as claimed in claim 1 is characterized in that, the fundamental frequency feature of the extraction voice signal of described steps A 1 and step B1 and the method for spectrum signature comprise:
Step a1, based on the source-filter construction of voice signal, voice signal is carried out segmentation with 20~30ms, each section is as a frame, and the voice signal of each frame is extracted fundamental frequency and frequency spectrum parameter;
Step a2, separate speaker characteristic and content characteristic in the described frequency spectrum parameter with a neural network, this neural network structure adopts laterally zygomorphic altogether 2K-1 layer multi-layer (K is natural number) network structure, comprise: orlop is input layer, from this layer input acoustic feature to be separated; The superiors are output layer, the acoustic feature that this layer output reconstructs; Middle 2K-3 hidden layer, every layer of several node, the processing procedure of analog neuron unit.Be coding network from input layer to from bottom to up K hidden layer, be used for extracting high-rise information from the Speech acoustics feature of input; K hidden layer from bottom to up is the coding layer; The network node of coding layer is divided into two parts, and a part is relevant with the speaker, and another part is relevant with content, and their output is corresponding speaker characteristic and content characteristic respectively; The above hidden layer of K hidden layer from bottom to up is decoding network, is used for reconstructing the acoustics frequency spectrum parameter from speaker characteristic and the content characteristic of high level.
5. speaker's sound converting method as claimed in claim 4, it is characterized in that, described step a2 is included on the voice signal data storehouse described neural network is trained, and extracts and the ability of separating speaker characteristic and content characteristic so that it possesses from acoustic feature.
6. speaker's sound converting method as claimed in claim 5 is characterized in that, the described step that described neural network is trained comprises:
Step b1, come the network weight of the described neural network of initialization by pre-training;
Step b2, to the output characteristic of each node of the coding layer of described neural network, adopt the property a distinguished criterion to add up it between the different speakers and the differentiation between the different content, as speaker's interdependent node, remaining node is as the content interdependent node with the large and node that the property distinguished is little between the different content of the property distinguished between different speakers;
Step b3, the specific property distinguished of design objective function come the weights of this neural network of intense adjustment, make this neural network possess the ability of separating speaker information and content information from acoustic feature.
7. speaker's sound converting method as claimed in claim 5 is characterized in that, described step b1 takes unsupervised mode of learning, successively trains this neural network with greedy algorithm;
8. speaker's sound converting method as claimed in claim 7 is characterized in that, described step b1 comprises:
At input layer, the input feature vector Gaussian distributed, then each dimension in input adds an amount of Gaussian noise, and adopts minimum mean square error criterion to train; Each layer more than ground floor, input feature vector are obeyed two-value and are distributed, and therefore with certain probability, with some dimension zero setting of input feature vector, and train with minimum cross entropy criterion; After training obtains a stacked autocoder that adds of K in advance, with its upwards upset, just obtained laterally zygomorphic autocoder structure.
9. speaker's sound converting method as claimed in claim 6 is characterized in that, described step b2 adopts Fisher ' s ratio criterion as the property distinguished criterion.
10. speaker's sound converting method as claimed in claim 9 is characterized in that, described step b3 comprises:
Design has the differentiation objective function of contrast competition mechanism, and the use error Back Propagation Algorithm comes the network weight of the described neural network of intense adjustment, makes this neural network possess the ability of separating speaker information and content information from acoustic feature.
11. speaker's sound converting method as claimed in claim 5 is characterized in that, wherein said voice signal data storehouse makes through the following steps:
Step c1, set up a corpus, make to comprise a plurality of sentences in this corpus;
Step c2, record the voice signal that a plurality of speakers read aloud the sentence in the described corpus, make up the voice signal data storehouse, and the voice signal in this voice signal data storehouse is carried out pre-service, to remove the malformation in the voice signal;
Step c3, come carrying out the capable cutting of voice signal in this voice signal data storehouse of pretreated poplar with Hidden Markov Model (HMM), each section after the cutting is as a frame, by speaker's markup information and the content markup information of the frame one-level that obtains each voice signal;
Step c4, each voice signal of described speech database is carried out random combine, the training data of constructing neural network.
CN201210528629.4A 2012-12-11 2012-12-11 Conversion method for sound of speaker Active CN102982809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210528629.4A CN102982809B (en) 2012-12-11 2012-12-11 Conversion method for sound of speaker

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210528629.4A CN102982809B (en) 2012-12-11 2012-12-11 Conversion method for sound of speaker

Publications (2)

Publication Number Publication Date
CN102982809A true CN102982809A (en) 2013-03-20
CN102982809B CN102982809B (en) 2014-12-10

Family

ID=47856718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210528629.4A Active CN102982809B (en) 2012-12-11 2012-12-11 Conversion method for sound of speaker

Country Status (1)

Country Link
CN (1) CN102982809B (en)

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514883A (en) * 2013-09-26 2014-01-15 华南理工大学 Method for achieving self-adaptive switching of male voice and female voice
CN103531205A (en) * 2013-10-09 2014-01-22 常州工学院 Asymmetrical voice conversion method based on deep neural network feature mapping
CN103886859A (en) * 2014-02-14 2014-06-25 河海大学常州校区 Voice conversion method based on one-to-many codebook mapping
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN104392717A (en) * 2014-12-08 2015-03-04 常州工学院 Sound track spectrum Gaussian mixture model based rapid voice conversion system and method
CN104464725A (en) * 2014-12-30 2015-03-25 福建星网视易信息系统有限公司 Method and device for singing imitation
CN105023570A (en) * 2014-04-30 2015-11-04 安徽科大讯飞信息科技股份有限公司 method and system of transforming speech
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105321526A (en) * 2015-09-23 2016-02-10 联想(北京)有限公司 Audio processing method and electronic device
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
US9508347B2 (en) 2013-07-10 2016-11-29 Tencent Technology (Shenzhen) Company Limited Method and device for parallel processing in model training
CN106228976A (en) * 2016-07-22 2016-12-14 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN106384587A (en) * 2015-07-24 2017-02-08 科大讯飞股份有限公司 Voice recognition method and system thereof
CN107068157A (en) * 2017-02-21 2017-08-18 中国科学院信息工程研究所 A kind of information concealing method and system based on audio carrier
CN107452369A (en) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN107464554A (en) * 2017-09-28 2017-12-12 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107481735A (en) * 2017-08-28 2017-12-15 中国移动通信集团公司 A kind of method, server and the computer-readable recording medium of transducing audio sounding
CN107507619A (en) * 2017-09-11 2017-12-22 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
WO2018098892A1 (en) * 2016-11-29 2018-06-07 科大讯飞股份有限公司 End-to-end modelling method and system
CN105023574B (en) * 2014-04-30 2018-06-15 科大讯飞股份有限公司 A kind of method and system for realizing synthesis speech enhan-cement
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
CN108550372A (en) * 2018-03-24 2018-09-18 上海诚唐展览展示有限公司 A kind of system that astronomical electric signal is converted into audio
CN108847249A (en) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 Sound converts optimization method and system
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109377978A (en) * 2018-11-12 2019-02-22 南京邮电大学 Multi-to-multi voice conversion method under non-parallel text condition based on i vector
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109616131A (en) * 2018-11-12 2019-04-12 南京南大电子智慧型服务机器人研究院有限公司 A kind of number real-time voice is changed voice method
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium
CN109817198A (en) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 Multiple sound training method, phoneme synthesizing method and device for speech synthesis
CN109935225A (en) * 2017-12-15 2019-06-25 富泰华工业(深圳)有限公司 Character information processor and method, computer storage medium and mobile terminal
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110060701A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi phonetics transfer method based on VAWGAN-AC
CN110459232A (en) * 2019-07-24 2019-11-15 浙江工业大学 A kind of phonetics transfer method generating confrontation network based on circulation
CN110570873A (en) * 2019-09-12 2019-12-13 Oppo广东移动通信有限公司 voiceprint wake-up method and device, computer equipment and storage medium
CN110600013A (en) * 2019-09-12 2019-12-20 苏州思必驰信息科技有限公司 Training method and device for non-parallel corpus voice conversion data enhancement model
CN110600012A (en) * 2019-08-02 2019-12-20 特斯联(北京)科技有限公司 Fuzzy speech semantic recognition method and system for artificial intelligence learning
CN111201565A (en) * 2017-05-24 2020-05-26 调节股份有限公司 System and method for sound-to-sound conversion
CN111433847A (en) * 2019-12-31 2020-07-17 深圳市优必选科技股份有限公司 Speech conversion method and training method, intelligent device and storage medium
CN111462769A (en) * 2020-03-30 2020-07-28 深圳市声希科技有限公司 End-to-end accent conversion method
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN111951810A (en) * 2019-05-14 2020-11-17 国际商业机器公司 High quality non-parallel many-to-many voice conversion
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function
CN113345452A (en) * 2021-04-27 2021-09-03 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
US11996117B2 (en) 2020-10-08 2024-05-28 Modulate, Inc. Multi-stage adaptive system for content moderation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835074A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
US20090089063A1 (en) * 2007-09-29 2009-04-02 Fan Ping Meng Voice conversion method and system
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835074A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speaking person conversion method combined high layer discription information and model self adaption
US20090089063A1 (en) * 2007-09-29 2009-04-02 Fan Ping Meng Voice conversion method and system
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LING-HUI CHEN,ET AL.: "NON-PARALLEL TRAINING FOR VOICE CONVERSION BASED ON FT-GMM", 《2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS,SPEECH AND SIGNAL PROCESSING(ICASSP)》, 27 May 2011 (2011-05-27), pages 5116 - 5119 *
LING-HUI CHEN,ET AL: "GMM-based Voice Conversion with Explicit Modelling on Feature Transform", 《2010 7TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING(ISCSLP)》, 3 December 2010 (2010-12-03), pages 364 - 368 *
徐小峰等: "基于说话人独立建模的语音转换系统研究", 《信号处理》, vol. 25, no. 8, 31 August 2009 (2009-08-31), pages 171 - 174 *

Cited By (84)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143327B (en) * 2013-07-10 2015-12-09 腾讯科技(深圳)有限公司 A kind of acoustic training model method and apparatus
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
WO2015003436A1 (en) * 2013-07-10 2015-01-15 Tencent Technology (Shenzhen) Company Limited Method and device for parallel processing in model training
US9508347B2 (en) 2013-07-10 2016-11-29 Tencent Technology (Shenzhen) Company Limited Method and device for parallel processing in model training
CN103514883A (en) * 2013-09-26 2014-01-15 华南理工大学 Method for achieving self-adaptive switching of male voice and female voice
CN103531205A (en) * 2013-10-09 2014-01-22 常州工学院 Asymmetrical voice conversion method based on deep neural network feature mapping
CN103531205B (en) * 2013-10-09 2016-08-31 常州工学院 The asymmetrical voice conversion method mapped based on deep neural network feature
CN103886859B (en) * 2014-02-14 2016-08-17 河海大学常州校区 Phonetics transfer method based on one-to-many codebook mapping
CN103886859A (en) * 2014-02-14 2014-06-25 河海大学常州校区 Voice conversion method based on one-to-many codebook mapping
CN105023574B (en) * 2014-04-30 2018-06-15 科大讯飞股份有限公司 A kind of method and system for realizing synthesis speech enhan-cement
CN105023570A (en) * 2014-04-30 2015-11-04 安徽科大讯飞信息科技股份有限公司 method and system of transforming speech
CN105023570B (en) * 2014-04-30 2018-11-27 科大讯飞股份有限公司 A kind of method and system for realizing sound conversion
CN104392717A (en) * 2014-12-08 2015-03-04 常州工学院 Sound track spectrum Gaussian mixture model based rapid voice conversion system and method
CN104464725B (en) * 2014-12-30 2017-09-05 福建凯米网络科技有限公司 A kind of method and apparatus imitated of singing
CN104464725A (en) * 2014-12-30 2015-03-25 福建星网视易信息系统有限公司 Method and device for singing imitation
CN106384587A (en) * 2015-07-24 2017-02-08 科大讯飞股份有限公司 Voice recognition method and system thereof
CN106384587B (en) * 2015-07-24 2019-11-15 科大讯飞股份有限公司 A kind of audio recognition method and system
CN105321526A (en) * 2015-09-23 2016-02-10 联想(北京)有限公司 Audio processing method and electronic device
CN105321526B (en) * 2015-09-23 2020-07-24 联想(北京)有限公司 Audio processing method and electronic equipment
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105390141B (en) * 2015-10-14 2019-10-18 科大讯飞股份有限公司 Sound converting method and device
CN105206257B (en) * 2015-10-14 2019-01-18 科大讯飞股份有限公司 A kind of sound converting method and device
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN106228976B (en) * 2016-07-22 2019-05-31 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN106228976A (en) * 2016-07-22 2016-12-14 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
US11651578B2 (en) 2016-11-29 2023-05-16 Iflytek Co., Ltd. End-to-end modelling method and system
WO2018098892A1 (en) * 2016-11-29 2018-06-07 科大讯飞股份有限公司 End-to-end modelling method and system
CN107068157A (en) * 2017-02-21 2017-08-18 中国科学院信息工程研究所 A kind of information concealing method and system based on audio carrier
CN107068157B (en) * 2017-02-21 2020-04-10 中国科学院信息工程研究所 Information hiding method and system based on audio carrier
US11854563B2 (en) 2017-05-24 2023-12-26 Modulate, Inc. System and method for creating timbres
CN111201565A (en) * 2017-05-24 2020-05-26 调节股份有限公司 System and method for sound-to-sound conversion
CN107464569A (en) * 2017-07-04 2017-12-12 清华大学 Vocoder
CN107545903A (en) * 2017-07-19 2018-01-05 南京邮电大学 A kind of phonetics transfer method based on deep learning
CN107545903B (en) * 2017-07-19 2020-11-24 南京邮电大学 Voice conversion method based on deep learning
CN107481735A (en) * 2017-08-28 2017-12-15 中国移动通信集团公司 A kind of method, server and the computer-readable recording medium of transducing audio sounding
CN107507619B (en) * 2017-09-11 2021-08-20 厦门美图之家科技有限公司 Voice conversion method and device, electronic equipment and readable storage medium
CN107507619A (en) * 2017-09-11 2017-12-22 厦门美图之家科技有限公司 Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing
CN107452369B (en) * 2017-09-28 2021-03-19 百度在线网络技术(北京)有限公司 Method and device for generating speech synthesis model
US10978042B2 (en) 2017-09-28 2021-04-13 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for generating speech synthesis model
CN107464554A (en) * 2017-09-28 2017-12-12 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN107452369A (en) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN109935225A (en) * 2017-12-15 2019-06-25 富泰华工业(深圳)有限公司 Character information processor and method, computer storage medium and mobile terminal
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
CN108550372A (en) * 2018-03-24 2018-09-18 上海诚唐展览展示有限公司 A kind of system that astronomical electric signal is converted into audio
CN108847249A (en) * 2018-05-30 2018-11-20 苏州思必驰信息科技有限公司 Sound converts optimization method and system
CN108847249B (en) * 2018-05-30 2020-06-05 苏州思必驰信息科技有限公司 Sound conversion optimization method and system
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN109616131B (en) * 2018-11-12 2023-07-07 南京南大电子智慧型服务机器人研究院有限公司 Digital real-time voice sound changing method
CN109377978A (en) * 2018-11-12 2019-02-22 南京邮电大学 Multi-to-multi voice conversion method under non-parallel text condition based on i vector
CN109616131A (en) * 2018-11-12 2019-04-12 南京南大电子智慧型服务机器人研究院有限公司 A kind of number real-time voice is changed voice method
CN109377978B (en) * 2018-11-12 2021-01-26 南京邮电大学 Many-to-many speaker conversion method based on i vector under non-parallel text condition
CN109326283B (en) * 2018-11-23 2021-01-26 南京邮电大学 Many-to-many voice conversion method based on text encoder under non-parallel text condition
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109637551A (en) * 2018-12-26 2019-04-16 出门问问信息科技有限公司 Phonetics transfer method, device, equipment and storage medium
CN109599091B (en) * 2019-01-14 2021-01-26 南京邮电大学 Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN109817198B (en) * 2019-03-06 2021-03-02 广州多益网络股份有限公司 Speech synthesis method, apparatus and storage medium
CN109817198A (en) * 2019-03-06 2019-05-28 广州多益网络股份有限公司 Multiple sound training method, phoneme synthesizing method and device for speech synthesis
CN110060690A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN and ResNet
CN110060701A (en) * 2019-04-04 2019-07-26 南京邮电大学 Multi-to-multi phonetics transfer method based on VAWGAN-AC
CN110060701B (en) * 2019-04-04 2023-01-31 南京邮电大学 Many-to-many voice conversion method based on VAWGAN-AC
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN111951810A (en) * 2019-05-14 2020-11-17 国际商业机器公司 High quality non-parallel many-to-many voice conversion
CN110459232A (en) * 2019-07-24 2019-11-15 浙江工业大学 A kind of phonetics transfer method generating confrontation network based on circulation
CN110600012A (en) * 2019-08-02 2019-12-20 特斯联(北京)科技有限公司 Fuzzy speech semantic recognition method and system for artificial intelligence learning
CN110570873B (en) * 2019-09-12 2022-08-05 Oppo广东移动通信有限公司 Voiceprint wake-up method and device, computer equipment and storage medium
CN110600013A (en) * 2019-09-12 2019-12-20 苏州思必驰信息科技有限公司 Training method and device for non-parallel corpus voice conversion data enhancement model
CN110570873A (en) * 2019-09-12 2019-12-13 Oppo广东移动通信有限公司 voiceprint wake-up method and device, computer equipment and storage medium
CN111433847B (en) * 2019-12-31 2023-06-09 深圳市优必选科技股份有限公司 Voice conversion method, training method, intelligent device and storage medium
CN111433847A (en) * 2019-12-31 2020-07-17 深圳市优必选科技股份有限公司 Speech conversion method and training method, intelligent device and storage medium
WO2021134520A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Voice conversion method, voice conversion training method, intelligent device and storage medium
CN111462769B (en) * 2020-03-30 2023-10-27 深圳市达旦数生科技有限公司 End-to-end accent conversion method
CN111462769A (en) * 2020-03-30 2020-07-28 深圳市声希科技有限公司 End-to-end accent conversion method
CN111883149B (en) * 2020-07-30 2022-02-01 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
CN111883149A (en) * 2020-07-30 2020-11-03 四川长虹电器股份有限公司 Voice conversion method and device with emotion and rhythm
US11996117B2 (en) 2020-10-08 2024-05-28 Modulate, Inc. Multi-stage adaptive system for content moderation
CN112309365A (en) * 2020-10-21 2021-02-02 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112309365B (en) * 2020-10-21 2024-05-10 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function
CN113345452A (en) * 2021-04-27 2021-09-03 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model
CN113345452B (en) * 2021-04-27 2024-04-26 北京搜狗科技发展有限公司 Voice conversion method, training method, device and medium of voice conversion model

Also Published As

Publication number Publication date
CN102982809B (en) 2014-12-10

Similar Documents

Publication Publication Date Title
CN102982809B (en) Conversion method for sound of speaker
CN106228977B (en) Multi-mode fusion song emotion recognition method based on deep learning
Gerosa et al. A review of ASR technologies for children's speech
Jin et al. Speech emotion recognition with acoustic and lexical features
CN105741832B (en) Spoken language evaluation method and system based on deep learning
CN106297773B (en) A kind of neural network acoustic training model method
Kandali et al. Emotion recognition from Assamese speeches using MFCC features and GMM classifier
CN102332263B (en) Close neighbor principle based speaker recognition method for synthesizing emotional model
CN103544963A (en) Voice emotion recognition method based on core semi-supervised discrimination and analysis
Lee et al. Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams
CN107369440A (en) The training method and device of a kind of Speaker Identification model for phrase sound
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN102800314A (en) English sentence recognizing and evaluating system with feedback guidance and method of system
CN103366618A (en) Scene device for Chinese learning training based on artificial intelligence and virtual reality
CN104240706B (en) It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
Wang et al. Child Speech Disorder Detection with Siamese Recurrent Network Using Speech Attribute Features.
CN109300339A (en) A kind of exercising method and system of Oral English Practice
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
Polur et al. Investigation of an HMM/ANN hybrid structure in pattern recognition application using cepstral analysis of dysarthric (distorted) speech signals
Sheikh et al. Advancing stuttering detection via data augmentation, class-balanced loss and multi-contextual deep learning
CN102880906B (en) Chinese vowel pronunciation method based on DIVA nerve network model
CN110348482A (en) A kind of speech emotion recognition system based on depth model integrated architecture
Zhao et al. [Retracted] Standardized Evaluation Method of Pronunciation Teaching Based on Deep Learning
Rafi et al. Relative significance of speech sounds in speaker verification systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant