CN108777140B - Voice conversion method based on VAE under non-parallel corpus training - Google Patents

Voice conversion method based on VAE under non-parallel corpus training Download PDF

Info

Publication number
CN108777140B
CN108777140B CN201810393556.XA CN201810393556A CN108777140B CN 108777140 B CN108777140 B CN 108777140B CN 201810393556 A CN201810393556 A CN 201810393556A CN 108777140 B CN108777140 B CN 108777140B
Authority
CN
China
Prior art keywords
characteristic
frame
bottleneck
training
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810393556.XA
Other languages
Chinese (zh)
Other versions
CN108777140A (en
Inventor
李燕萍
凌云志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201810393556.XA priority Critical patent/CN108777140B/en
Publication of CN108777140A publication Critical patent/CN108777140A/en
Application granted granted Critical
Publication of CN108777140B publication Critical patent/CN108777140B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a voice conversion method based on VAE under the condition of non-parallel corpus training, which extracts Bottleneck characteristics, namely Bottleneck characteristics, through a deep neural network under the condition of non-parallel texts, then realizes the learning and modeling of a conversion function based on a variational self-coding model, and can realize the conversion of multiple speakers to multiple speakers in a conversion stage. The advantages of the invention are three: 1) the dependence on parallel texts is removed, and no alignment operation is required in the training process; 2) the conversion system of a plurality of source-target speaker pairs can be integrated in one conversion model to realize many-to-many conversion; 3) the many-to-many conversion system under the condition of non-parallel texts provides technical support for the technology to actual voice interaction.

Description

Voice conversion method based on VAE under non-parallel corpus training
Technical Field
The invention belongs to the field of voice signal processing, and particularly relates to a voice conversion method based on a Variational self-encoding (VAE) model under non-parallel corpus training.
Background
The speech conversion technology is a research branch of speech signal processing, which covers the contents of the fields of speaker recognition, speech synthesis and the like, and is intended to change the personalized information of the speech under the condition of keeping the original semantic information unchanged, so that the speech of a specific speaker (namely, a source speaker) sounds like the speech of another specific speaker (namely, a target speaker). The main task of voice conversion includes extracting the characteristic parameters of the voices of two specific speakers, mapping and converting, and then decoding and reconstructing the converted parameters into converted voices. In the process, whether the hearing quality of the obtained converted voice and the personality characteristics after conversion are accurate or not is ensured. Research on voice conversion technology has been developed for many years, and the voice conversion field has emerged with various methods, wherein statistical conversion methods represented by gaussian mixture models have become the classic methods in the field. However, such algorithms still have some drawbacks, such as: the classical method of using a gaussian mixture model to perform voice conversion is mostly based on a one-to-one voice conversion task, the contents of training sentences used by a source speaker and a target speaker are required to be the same, Dynamic Time Warping (DTW) is required to be performed on spectral features to align frame by frame, and then the mapping relation between the spectral features can be obtained through model training, so that the voice conversion method is not flexible enough in practical application; when the mapping function is trained by using the Gaussian mixture model, the global variables are considered, the calculated amount is increased suddenly by iterating the training data, and the Gaussian mixture model can achieve a good conversion effect only when the training data is sufficient, which is not suitable for limited computing resources and equipment.
In recent years, research in the field of deep learning accelerates the training speed of a deep neural network and the effectiveness of the network, and researchers continuously provide new models and new learning methods, so that the modeling capability is strong, and deeper features can be learned from complex data.
The AHOcoder feature parameter extraction model is a speech codec (speech analysis/synthesis system) developed by the AHO L AB signal processing laboratory at Baske university of Daniel Erro.AHOcoder decomposes 16kHz, 16bits of monophonic wav speech into three parts, fundamental frequency (F0), spectrum (Mel cepstral coefficient MFCC), and maximum voiced frequency.
The fundamental frequency is an important parameter influencing the prosodic characteristics of the voice, and the voice conversion method designed by the invention adopts the traditional Gaussian normalization conversion method aiming at the conversion of the fundamental frequency. Assuming that the logarithmic fundamental frequencies of the voiced speech segments of the source speaker and the target speaker obey Gaussian distributions, then, the mean and variance of the Gaussian distributions of the logarithmic fundamental frequencies of the voiced speech segments of the source speaker and the target speaker are respectively calculated. Then, the following formula is used to realize the conversion from the logarithmic fundamental frequency of the voiced sound segment of the source speaker to the logarithmic fundamental frequency of the voiced sound segment of the target speaker, and the unvoiced sound segment is not changed.
Figure BDA0001644026130000021
Wherein the mean and variance of the logarithmic fundamental frequency of the voiced segment of the source speaker are respectively expressed by musrcAnd σsrcPresentation, target speaker voiced segment logarithmic basisThe mean and variance of the frequency are respectively represented by mutgtAnd σtgtIs shown, and FOsrcRepresenting the fundamental frequency, FO, of the originating speakerconvRepresenting the converted fundamental frequency.
Disclosure of Invention
In order to solve the problems, the invention provides a voice conversion method based on VAE under non-parallel corpus training, which gets rid of the dependence on parallel texts, realizes the conversion of multiple speakers to multiple speakers, improves the flexibility, and solves the technical problem that the voice conversion is difficult to realize under the condition of limited resources and equipment.
The invention adopts the following technical scheme that a voice conversion method based on VAE under non-parallel corpus training comprises the following steps:
training:
1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;
2) carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameter X with the original characteristic parameter X, and splicing the characteristic parameter obtained by splicing with the characteristic parameters of each frame in the front and the back on the time domain to form a combined characteristic parameter Xn
3) Using joint feature parameters xnAnd speaker class label feature ynTraining a Deep Neural Network (DNN), adjusting the weight of the DNN to reduce classification errors until the network converges to obtain a DNN based on a speaker recognition task, and extracting a Bottleneck characteristic of each frame, namely a Bottleneck characteristic bn
4) Using joint feature parameters xnAnd a Bottleneck feature b corresponding to each framenTraining the VAE model until the model training converges, extracting a Variational auto-encoder (VAE) model, namely a sampling feature z of each frame of a hidden space z of the VAE modeln
5) Sampling feature znAnd the tag feature y of the speaker corresponding to each framenSplicing to obtain training data of a Bottleneck feature mapping network (BP network), and performing the methodBottleneck feature b of each framenThe method is used as supervision information to guide the training of the Bottleneck feature mapping network, and the output error of the Bottleneck feature mapping network is minimized through a random gradient descent algorithm to obtain the Bottleneck feature mapping network;
the trained DNN network, VAE network and Bottleneck feature mapping network are combined to form a voice conversion system based on VAE and Bottleneck features;
a voice conversion step:
6) joint feature parameter X of speech to be convertedpObtaining the sampling characteristic z of each frame of the implicit space z through an encoder module of a VAE modeln
7) Sampling feature znAnd the tag characteristics y of the targeted speakernSplicing frame by frame and inputting the Bottleneck feature mapping network to obtain the Bottleneck feature of the target speaker
Figure BDA0001644026130000031
8) Will Bottleneck feature
Figure BDA0001644026130000032
And a sampling characteristic znReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicingp';
9) The speech signal is reconstructed using an AHOcoder sound codec.
Preferably, the extracting the mel cepstrum features of the speeches participating in the training in the step 1) is to extract the mel cepstrum features of the speeches participating in the training respectively by using an AHOcoder sound codec, and read the mel cepstrum features into a Matlab platform.
Preferably, the obtained joint characteristic parameters in the step 2) are specifically: carrying out first order difference and second order difference on the extracted characteristic parameter X of each frame, and splicing the characteristic parameter X with the original characteristic to obtain the characteristic parameter Xt=(X,ΔX,Δ2X) splicing the obtained characteristic parameters X) in the time domaintSplicing with the characteristic parameters of each frame to form a combined characteristic parameter xn=(Xt-1,Xt,Xt+1)。
Preferably, the Bottleneck feature b is extracted from the pair in the step 3)nThe method comprises the following steps:
31) obtaining the combined characteristic parameter x on the MAT L AB platformnThe classification label characteristic y of the speaker corresponding to each framen
32) Carrying out unsupervised pre-training on the DNN by using a layer-by-layer greedy pre-training method, wherein an activation function of a hidden layer adopts a Re L U function;
33) setting the DNN network output layer as softmax classification output, and labeling the classification label characteristic y of the speakernAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimizednUntil convergence, obtaining a DNN network based on the speaker recognition task;
34) combining the characteristic parameters x by a feed-forward algorithmnInputting DNN network frame by frame, extracting the activation value of Bottleneck layer corresponding to each frame, namely Bottleneck feature b corresponding to Mel cepstrum feature parameter of each framen
Preferably, the training of the VAE model in step 4) includes the following steps:
41) combining the characteristic parameters xnTraining data, Bottleneck feature b as VAE model encoder modulenTraining the VAE model as training data when decoding and reconstructing the decoder module, and performing Bottleneck feature b in the decoder module of the VAE modelnAs control information of the voice spectrum reconstruction process, i.e. Bottleneck feature bnAnd a sampling characteristic znSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;
42) k L divergence and mean square error in the parameter estimation process of the VAE model are optimized by using an ADAM optimizer to adjust the network weight of the VAE model, so that a VAE voice spectrum conversion model is obtained;
43) combining the characteristic parameters xnInputting VAE voice frequency spectrum conversion model frame by frame and obtaining the model through sampling processImplicit sampling feature zn
Preferably, the obtaining of the bottleeck feature mapping network in the step 5) includes the following steps:
51) sampling feature znClassification label characteristic y corresponding to speaker of each framenSplicing the data to be used as training data of a Bottleneck feature mapping network, wherein the Bottleneck feature mapping network adopts the structure of an input layer, a hidden layer and an output layer, the hidden layer activation function is a sigmoid function, and the output layer is linear output;
52) according to the mean square error minimization criterion, a random gradient descent algorithm of backward error propagation is adopted to optimize the weight of the Bottleneck feature mapping network, and the Bottleneck feature output by the minimization network
Figure BDA0001644026130000041
Bottleneck feature b corresponding to each framenThe error between.
Preferably, the joint feature parameter X of the speech to be converted is obtained in the step 6)pExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be convertedp
Preferably, the reconstructing the speech signal in step 9) is specifically: the voice characteristic parameter X obtained after conversionpThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then an AHOcoder sound coder and decoder is used for synthesizing the converted voice.
The invention has the following beneficial effects: the invention relates to a voice conversion method based on VAE under non-parallel corpus training, which gets rid of the dependence on parallel texts, realizes the conversion of multiple speakers to multiple speakers, improves the flexibility and solves the technical problem that the voice conversion is difficult to realize under the condition of limited resources and equipment. The invention has the advantages that:
1) the advantage that the phoneme information irrelevant to the personality of the speaker in the voice frequency spectrum characteristics can be separated from the hidden layer by the VAE model through modeling learning is utilized, so that the VAE model can learn voice conversion through non-parallel voice data, the limitation that the source and the target speaker need to be trained through parallel corpus data in the traditional voice conversion model and the voice frequency spectrum characteristics need to be aligned is eliminated, the practicability and the flexibility of a voice conversion system are greatly improved, and convenience is provided for designing a cross-language voice conversion system;
2) the voice conversion network obtained by training the VAE model can complete various conversion situations, and compared with the traditional one-to-one voice conversion system, the voice conversion network can complete various conversion tasks only by training one model, thereby greatly improving the efficiency of voice conversion model training;
3) in the decoder module of the VAE model, the Bottleneck feature b is usednAs the personality characteristics of the speaker, the voice frequency spectrum characteristics after the reconstruction conversion are compared with the characteristics y of the classification label of the speakernAs a system for representing the individual information characteristics of the speaker, the finally obtained converted voice has better conversion effect and sound quality.
Drawings
FIG. 1 is a block diagram of the system training process of the present invention;
FIG. 2 is a block diagram of the system conversion process of the present invention;
FIG. 3 is a block diagram of a DNN network based on speaker recognition tasks in accordance with the present invention;
FIG. 4 is a block diagram of a VAE voice spectral feature conversion network of the present invention;
FIG. 5 is a block diagram of a Bottleneeck feature mapping network of the present invention;
FIG. 6 is a schematic diagram of a VAE model variational Bayesian process parameter estimation;
FIG. 7 is a comparison graph of MCD values of converted speech under different conversion situations based on a VAE model using different features to characterize the personality of a speaker.
Detailed Description
The technical solution of the present invention is further explained with reference to the embodiments according to the drawings.
The invention adopts the following technical scheme that a voice conversion method based on VAE under the training of non-parallel linguistic data extracts the Mel cepstrum characteristics of voice through an AHOcoder voice codec and splices the Mel cepstrum characteristics with first-order difference and second-order difference characteristics on an MAT L AB platform, and then splices the characteristic parameters of each frame in front and at the back to form a combined characteristic parameter xn(ii) a X is to benTraining by using DNN based on speaker recognition task as training data, and after the network training is finished and convergence is reached, x is calculatednInputting DNN network frame by frame and obtaining Bottleneck layer output of each frame, namely Bottleneck characteristic parameter b containing speaker personality characteristicsn(ii) a X is to benTraining data as VAE model encoder Module, bnTraining the VAE model as training data during decoding reconstruction of the decoder module, so that the VAE model can obtain phoneme information z with semantic features in an implicit space z through the encoder modulenI.e. sampling features, the phoneme information z containing the semantic features is passed through the decoder modulenAnd Bottleneck feature b containing speaker personality featuresnReconstructing the voice frequency spectrum characteristics; the phoneme information z containing semantic featuresnClass label feature with speaker ynThe combined features formed by splicing are used as training data of a BP network to train a Bottleneck feature mapping network of a target speaker, and the expected network outputs Bottleneck features b corresponding to each framenThe error between is minimal; when in conversion, firstly, the spectrum characteristics of the voice to be converted are extracted through an encoder module of the VAE model to obtain phoneme information z correspondingly containing semantic characteristicsnAnd the classification label characteristic y of the target speaker is matched with the classification label characteristic y of the target speaker frame by framenSplicing to form a combined characteristic, inputting the combined characteristic into a BP network to obtain the Bottleneck characteristic of each frame of the target speaker
Figure BDA0001644026130000061
Then the phoneme information z containing semantic featuresnBottleneck features with frames of the target speaker
Figure BDA0001644026130000062
Reconstructing the joint characteristics spliced frame by frame into converted voice spectrum characteristics through a decoder module of a VAE model, and finally synthesizing voice through an AHOdecoder; the method specifically comprises a training step and a voice conversion step:
FIG. 1 is a block diagram of a training process of a system according to the present invention, the training steps being:
1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;
extracting Mel cepstrum characteristics of the speaker voices participating in training by respectively extracting the Mel cepstrum characteristics of the speaker voices participating in training by using an AHOcoder sound codec, and reading the Mel cepstrum characteristics into a Matlab platform; the invention adopts the 19-dimensional Mel cepstrum characteristics, the voice content of each speaker can be different, and DTW alignment is not needed.
2) Carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameters with the original characteristic parameters, and splicing the characteristic parameters obtained by splicing with the characteristic parameters of each frame in the front and the back on the time domain to form a combined characteristic parameter Xn
Carrying out first order difference and second order difference on each extracted frame characteristic parameter X, and splicing the difference with the original characteristic to obtain 57-dimensional difference characteristic parameter Xt=(X,ΔX,Δ2X) splicing the obtained characteristic parameters X) in the time domaintSplicing the characteristic parameters of the previous frame and the next frame to form a 171-dimensional combined characteristic parameter xn=(Xt-1,Xt,Xt+1)。
3) Using joint feature parameters xnAnd speaker class label feature ynTraining the DNN network, adjusting the weight of the DNN network to reduce classification errors until the network converges to obtain the DNN network based on the speaker recognition task, and extracting Bottleneck characteristics b of each framen
The structure of the bottleeck feature extraction DNN network used in the present invention is shown in fig. 3, where the inputs to the network areThe number of the nodes of the layer corresponds to the dimension of the voice frequency spectrum characteristic participating in training, the output is the softmax classified output of the speaker, and the number of the nodes is determined according to the number of the speakers participating in training. Extracting Bottleneck characteristics bnThe method comprises the following steps:
31) obtaining the combined characteristic parameter x on the MAT L AB platformnThe classification label characteristic y of the speaker corresponding to each framen(ii) a At this time, the source speaker and the target speaker are not distinguished, and only the speaker classification label characteristic y is used for the characteristic parameter of each framenDistinguishing;
32) the DNN network is a fully-connected neural network, adopts a DNN model of a 9-layer network, has 171 input-layer nodes and corresponds to xn171 dimensional characteristics of each frame, 7 hidden layers in the middle, the number of nodes of each layer is 1200,57,1200 and 1200, wherein the hidden layer with less number of nodes is a Bottleneck layer, a connection weight between nodes of each layer of a DNN network is unsupervised and pre-trained by a layer-by-layer greedy pre-training method, and an activation function of the hidden layer adopts a Re L U function which is closer to a brain neuron in biological angle, namely:
f(x)=max(0,x)
the Re L U function is considered to have the expressive power of more primitive features because of its unilateral inhibition, sparse activation, and relatively broad excitatory boundaries.
The activation value of the (k + 1) th hidden layer is as follows: h isk+1=f(wkhk+Bk)
Wherein h isk+1、hkActivation values, w, for the k +1 th and k-th hidden layers, respectivelykIs the connection weight between the k +1 th layer and the k layer, BkIs the bias of the k-th layer.
33) Setting a DNN network output layer as softmax classified output, selecting spectral characteristic parameters of 100 sentences of voice of each speaker of 5 speakers for training, so that the number of nodes of the output layer is 5, corresponding to the label characteristics of the 5 speakers, and classifying the label characteristics y of the speakersnAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimizednUntil convergence, obtaining a DNN network based on the speaker recognition task, namely a Bottleneck feature extraction network;
34) combining the characteristic parameters x by a feed-forward algorithmnInputting the DNN network frame by frame, and extracting the activation value of the Bottleneck layer corresponding to each frame, namely the Bottleneck characteristic b corresponding to the characteristic parameter of each framenIn the invention, the Bottleneck layer is a fourth hidden layer, namely:
bn=f(w3h3+B3)
wherein h is3Is the activation value of the hidden layer of layer 3, w3Is the connection weight between layer 3 and layer 4, B3Is the layer 3 bias.
4) Utilizing joint feature parameters xnAnd a Bottleneck feature b corresponding to each framenTraining the VAE model until the model training converges, and extracting the sampling characteristic z of each frame of the hidden space z of the VAE modeln
The Variational Auto-encoder (VAE) used in the present invention is a generative learning method, and the concrete structure of the model used in the present invention is shown in fig. 4, where x iss,nA characteristic parameter representing the source speech is determined,
Figure BDA0001644026130000081
characteristic parameters representing the converted speech of the target speaker, bnRepresenting Bottleneck characteristics of a frame corresponding to a target speaker, mu and sigma are respectively vector representations of mean values and covariance of components of Gaussian distribution, z represents a hidden space of a VAE model obtained through a sampling process, and z representsnI.e. the sampling characteristic. The parameter estimation process for VAE model training is shown in FIG. 6. The VAE model training comprises the following steps:
41) combining the characteristic parameters xnTraining data, Bottleneck feature b as VAE model encoder modulenTraining the VAE model as training data when decoding and reconstructing the decoder module, and performing Bottleneck feature b in the decoder module of the VAE modelnAs control information for the speech spectral reconstruction process, i.e. Bottleenck feature bnAnd a sampling characteristic znSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;
the encoder input layer of the VAE model is 171 nodes and then comprises two hidden layers, wherein the first layer is 500 nodes, the second layer is 64 nodes, in the second layer nodes, the first 32 nodes calculate the mean value of each component of mixed Gaussian distribution, and the last 32 nodes calculate the variance of each component (at the moment, the Gaussian mixed distribution which can be better fitted with an input signal is calculated through a neural network);
42) optimizing K L (Kullback-L eibler variation) Divergence and mean square error in the parameter estimation process of the VAE model shown in the figure 4 by using an ADAM optimizer according to the variational Bayes principle in the VAE model to adjust the network weight of the VAE model so as to obtain a VAE voice frequency spectrum conversion model;
43) combining the characteristic parameters xnInputting VAE voice frequency spectrum conversion model frame by frame, and obtaining implicit sampling characteristic z through sampling processn
The method is more intuitive, namely, the decoder module of the VAE model is used for processing the phoneme information z containing the semantic featuresnAdding individual character b of speakernModulation of (3).
5) Sampling feature znAnd the classification label characteristic y of the speaker corresponding to each framenSplicing to obtain training data of a Bottleneck feature mapping network (BP network), and using Bottleneck features b of each framenThe method is used as supervision information to guide the training of the Bottleneck feature mapping network, and the output error of the Bottleneck feature mapping network is minimized through a random gradient descent algorithm to obtain the Bottleneck feature mapping network;
the Bottleneeck feature mapping network of the target speaker used in the invention adopts a BP network, and the structure is shown in FIG. 5, wherein the input parameter is zn+ynWherein z isnFor variational self-encoder hidden layer features, ynA label characteristic for a speaker participating in training; output as Bottleneck feature b of the target speakern. The method for obtaining the Bottleneck feature mapping network comprises the following steps:
51) sampling characteristic z of hidden space of VAEnClassification label characteristic y corresponding to speaker of each framenSplicing the data to be used as training data of a Bottleneck feature mapping network, wherein the Bottleneck feature mapping network adopts a three-layer feedforward fully-connected neural network and comprises an input layer, a hidden layer and an output layer, the number of nodes of the input layer is 37, and 32 nodes correspond to sampling features z in a VAE model n5 nodes correspond to the classification label characteristic y of the 5-dimensional speaker formed by the five speakers participating in the trainingn(ii) a The output layer is 57 nodes and corresponds to 57-dimensional Bottleneck characteristics; the middle part of the system comprises a hidden layer, the number of nodes is 1200, a hidden layer activation function is a sigmoid function to introduce nonlinear change, and an output layer is linear output; the expression of the sigmoid function is:
f(x)=1/(1+ex)
52) according to the mean square error minimization criterion, a random gradient descent algorithm of backward error propagation is adopted to optimize the weight of the Bottleneck feature mapping network, and the Bottleneck feature output by the minimization network
Figure BDA0001644026130000101
Bottleneck feature b corresponding to each framenThe error between, namely:
Figure BDA0001644026130000105
optimizing the weight of the whole network to finally obtain a passable sampling characteristic znClass label feature y with targeted speakernGet the Bottleneck characteristics of the target speaker
Figure BDA0001644026130000102
BP of (3) maps the network.
The trained DNN network, VAE network and Bottleneck feature mapping network are combined to form a voice conversion system based on VAE and Bottleneck features;
the voice conversion is realized according to the spectrum conversion flow shown in fig. 2, and the voice conversion step includes:
6) joint feature parameter X of the source speaker's voice to be convertedpObtaining the sampling characteristic z of each frame of the implicit space z through an encoder module of a VAE modeln
Obtaining a combined feature parameter X of the speech to be convertedpIn particular, the joint characteristic parameter X to the speech to be convertedpExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be convertedp
7) Sampling feature znAnd the classification label characteristic y of the target speakernSplicing frame by frame and inputting the Bottleneck feature mapping network to obtain the Bottleneck feature of the target speaker
Figure BDA0001644026130000103
8) Will Bottleneck feature
Figure BDA0001644026130000104
And a sampling characteristic znReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicingp';
9) The speech signal is reconstructed using an AHOcoder sound codec.
The reconstructed speech signal is specifically: the voice characteristic parameter X obtained after conversionpThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then a voice coder-decoder AHOcoder is used for synthesizing the converted voice.
Mel-Cepstrum Distortion (MCD) is an objective measure of the quality of speech conversion in speech conversion. The smaller the MCD value between the converted speech and the target speech, the better the conversion performance of the corresponding speech conversion system. Fig. 7 is a comparison of MCD values of converted voices under different conversion conditions obtained by a non-parallel corpus training VAE model when different characteristic parameters are used to characterize the personality of a speaker, and it can be seen from the figure that the voice conversion using the Bottleneck characteristic to characterize the personality of the speaker has better performance than the conversion system using the speaker tag to characterize the personality of the speaker.
Compared with other deep learning concepts such as a deep confidence network DBN, a convolutional neural network CNN and the like, the variable self-encoder VAE can learn probability distribution conforming to an original input signal in an encoder process of the VAE model in a training process through a variable Bayes principle, obtain characteristics of an original signal implicit space through a sampling process, and reconstruct the original signal through a decoder process by utilizing the sampling characteristics, so that errors between the reconstructed signal and the original signal are as small as possible (or the probability distribution difference is small). The characteristics of the VAE model can be applied to style migration, and in voice conversion, the phoneme information which is irrelevant to the individual characteristics of the speaker and relevant to the semantic characteristics can be separated in the hidden space through the VAE model, and the voice spectrum signal can be reconstructed by combining the hidden space information with the parameters for representing the individual characteristics of the speaker. According to the invention, the individual characteristics of a speaker are represented by using the Bottleneck characteristics extracted by a DNN (digital noise network) based on a speaker recognition task, the mapping relation between the combined characteristics consisting of phoneme information and speaker labels and the Bottleneck characteristics is obtained through a mapping network obtained by BP (back propagation) network training, so that the Bottleneck characteristics of a target speaker are obtained indirectly through the voice spectrum characteristics of a source speaker, and finally the phoneme information in an implicit space and the Bottleneck characteristics of the target speaker are reconstructed into converted voice spectrum characteristics through a decoder module of VAE (voice over adaptive algorithm).
The invention aims at the traditional Gaussian mixture model conversion methodThe invention also discloses a method for realizing the voice conversion under the non-parallel language material by utilizing the BP network, which is provided by combining the characteristics of the VAE model and solves the problems that the voice spectrum conversion method needs to use the parallel language material and needs to carry out DTW alignment and then model training, and has three key points: firstly, extracting Bottleneck characteristics representing individual characteristics of speakers by utilizing DNN (DNN network) based on speaker recognition task
Figure BDA0001644026130000111
Secondly, establishing sampling characteristic z by utilizing BP neural networknSpeaker classification label feature ynCombined features of composition with Bottleneck features
Figure BDA0001644026130000112
The mapping relationship between the two; thirdly, using decoder module of trained VAE model to convert Bottleneck features
Figure BDA0001644026130000113
And a sampling characteristic znThe combined features of the components are reconstructed into transformed speech spectral features.
The method has the innovation points that ① utilizes the characteristics of the VAE model to separate phoneme information which is irrelevant to the individual characteristics of the speaker and relevant to semantic characteristics from an implied space, so that the voice conversion under the non-parallel corpus training can be realized, the method can complete various conversion tasks aiming at different speakers through one-time model training, ② utilizes the Bottleneck characteristics extracted from a DNN network based on the speaker recognition task as the individual characteristics of the speaker to participate in the decoder module reconstruction process of the VAE model, and the voice conversion performance is improved.
For some medical auxiliary systems, such as patients who can not normally sound because of physiological defects or diseases of a sound-producing organ, when providing sound-producing auxiliary equipment for the patients, some principles of the method related to the invention can be adopted; the invention has better expansibility and provides a solution for solving specific problems in voice conversion, including the problem of many-to-many (M2M) voice conversion.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications can be made without departing from the spirit of the invention, and such modifications are to be considered as within the scope of the invention.

Claims (8)

1. A voice conversion method based on VAE under the training of non-parallel corpus is characterized by comprising the following steps:
training:
1) respectively extracting Mel cepstrum characteristic parameters X of the speaker voices participating in training by using an AHOcoder sound codec;
2) carrying out differential processing on the extracted Mel cepstrum characteristic parameter X of each frame, splicing the characteristic parameter X with the original characteristic parameter X, and splicing the characteristic parameter X obtained by splicing in the time domaintSplicing with the characteristic parameters of each frame to form a combined characteristic parameter xn
3) Using joint feature parameters xnAnd speaker class label feature ynTraining the DNN network, adjusting the weight of the DNN network to reduce classification errors until the network converges to obtain the DNN network based on the speaker recognition task, and extracting the bottleneck characteristic b of each framen
4) Using joint feature parameters xnAnd a bottleneck characteristic b corresponding to each framenTraining the VAE model until the model training converges, and extracting the sampling characteristic z of each frame of the hidden space z of the VAE modeln
5) Sampling feature znAnd the classification label characteristic y of the speaker corresponding to each framenSplicing to obtain training data of bottleneck characteristic mapping network, and using bottleneck characteristic b of every framenThe bottleneck characteristic mapping network is obtained by taking the monitoring information as a guide to train the bottleneck characteristic mapping network and minimizing the output error of the bottleneck characteristic mapping network through a random gradient descent algorithm;
a voice conversion step:
6) joint feature parameter X of speech to be convertedpThrough the encoder module of the VAE model,obtaining the sampling characteristic z of each frame of the implicit space zn
7) Sampling feature znAnd the classification label characteristic y of the target speakernPerforming frame-by-frame splicing to input bottleneck characteristic mapping network to obtain the bottleneck characteristic of the target speaker
Figure FDA0002472905570000013
8) Characterizing a bottleneck
Figure FDA0002472905570000012
And a sampling characteristic znReconstructing the joint characteristic parameter X of the converted voice by a decoder module of a VAE model through frame-by-frame splicingp′;
9) The speech signal is reconstructed using an AHOcoder sound codec.
2. The method according to claim 1, wherein the extracting Mel cepstral features of the speaker's voice involved in the training in step 1) is performed by using an AHOcoder voice codec to extract Mel cepstral features of the speaker's voice involved in the training, and reading the Mel cepstral features into a Matlab platform.
3. The method according to claim 1, wherein the obtaining of the joint feature parameters in step 2) specifically comprises: carrying out first order difference and second order difference on each extracted frame characteristic parameter X, and splicing the extracted frame characteristic parameter X with the original characteristic parameter X to obtain the characteristic parameter Xt=(X,ΔX,Δ2X) splicing the obtained characteristic parameters X) in the time domaintSplicing with the characteristic parameters of each frame to form a combined characteristic parameter xn=(Xt-1,Xt,Xt+1)。
4. The method according to claim 1, wherein said method comprises a voice conversion based on VAE under non-parallel corpus trainingExtracting the bottleneck characteristic b in the step 3)nThe method comprises the following steps:
31) obtaining the combined characteristic parameter x on the MAT L AB platformnThe classification label characteristic y of the speaker corresponding to each framen
32) Carrying out unsupervised pre-training on the DNN by using a layer-by-layer greedy pre-training method, wherein an activation function of a hidden layer adopts a Re L U function;
33) setting the DNN network output layer as softmax classification output, and labeling the classification label characteristic y of the speakernAs the monitoring information of the DNN network for monitoring training, the weight of the network is adjusted by utilizing the stochastic gradient descent algorithm, and the classification output of the DNN network and the classification label characteristic y of the speaker are minimizednUntil convergence, obtaining a DNN network based on the speaker recognition task;
34) combining the characteristic parameters x by a feed-forward algorithmnInputting a DNN network frame by frame, wherein the DNN network is a fully-connected neural network, a DNN model of a 9-layer network is adopted, the number of nodes of an input layer is 171, and the DNN network corresponds to xn171 dimensional characteristics of each frame, 7 hidden layers in the middle, the node number of each layer is 1200,57,1200, wherein the hidden layer with less node number is a bottleneck layer, and the activation value of the bottleneck layer corresponding to each frame is extracted, namely the bottleneck characteristic b corresponding to the Mel cepstrum characteristic parameter of each framen
5. The method according to claim 1, wherein the VAE model training in step 4) comprises the following steps:
41) combining the characteristic parameters xnTraining data as VAE model encoder module, bottleneck characteristics bnTraining the VAE model as the training data when decoding and reconstructing the decoder module, and training the bottleneck characteristic b in the decoder module of the VAE modelnAs control information for the speech spectral reconstruction process, i.e. the bottleneck feature bnAnd a sampling characteristic znSplicing frame by frame through training of a decoder module of the VAE model to reconstruct the voice frequency spectrum characteristics;
42) k L divergence and mean square error in the parameter estimation process of the VAE model are optimized by using an ADAM optimizer to adjust the network weight of the VAE model, so that a VAE voice spectrum conversion model is obtained;
43) combining the characteristic parameters xnInputting VAE voice frequency spectrum conversion model frame by frame, and obtaining implicit sampling characteristic z through sampling processn
6. The method according to claim 1, wherein the obtaining of the bottleneck feature mapping network in step 5) comprises the following steps:
51) sampling characteristic z of VAE voice spectrum conversion modelnClassification label characteristic y corresponding to speaker of each framenSplicing is carried out to be used as training data of a bottleneck feature mapping network, the bottleneck feature mapping network adopts a structure of an input layer, a hidden layer and an output layer, a hidden layer activation function is a sigmoid function, and the output layer is linear output;
52) optimizing the bottleneck characteristic mapping network weight by adopting a random gradient descent algorithm of backward error propagation according to a mean square error minimization criterion, and minimizing the output bottleneck characteristic of the network
Figure FDA0002472905570000031
Bottleneck characteristics b corresponding to each framenThe error between.
7. The method according to claim 1, wherein the joint feature parameter X of the speech to be converted in step 6) is obtained from the VAE-based speech conversion under the training of non-parallel corpuspExtracting Mel cepstrum characteristic parameters of the speech to be converted by AHOcoder, performing first order difference and second order difference on the extracted characteristic parameters of each frame on MAT L AB platform, splicing with the original characteristics to obtain characteristic parameters, splicing the spliced characteristic parameters with the characteristic parameters of each frame in front and at back on time domain to form combined characteristic parameters, and obtaining the characteristic parameters X of the speech frequency spectrum to be convertedp
8. The method according to claim 1, wherein the reconstructing the speech signal in step 9) specifically comprises: the voice characteristic parameter X obtained after conversionpThe' is restored to a Mel cepstrum characteristic form, namely, a time domain splicing item and a differential item are removed, and then an AHOcoder sound coder and decoder is used for synthesizing the converted voice.
CN201810393556.XA 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training Active CN108777140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810393556.XA CN108777140B (en) 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810393556.XA CN108777140B (en) 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training

Publications (2)

Publication Number Publication Date
CN108777140A CN108777140A (en) 2018-11-09
CN108777140B true CN108777140B (en) 2020-07-28

Family

ID=64026673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810393556.XA Active CN108777140B (en) 2018-04-27 2018-04-27 Voice conversion method based on VAE under non-parallel corpus training

Country Status (1)

Country Link
CN (1) CN108777140B (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377978B (en) * 2018-11-12 2021-01-26 南京邮电大学 Many-to-many speaker conversion method based on i vector under non-parallel text condition
CN109326283B (en) * 2018-11-23 2021-01-26 南京邮电大学 Many-to-many voice conversion method based on text encoder under non-parallel text condition
CN109377986B (en) * 2018-11-29 2022-02-01 四川长虹电器股份有限公司 Non-parallel corpus voice personalized conversion method
CN109584893B (en) * 2018-12-26 2021-09-14 南京邮电大学 VAE and i-vector based many-to-many voice conversion system under non-parallel text condition
CN109671442B (en) * 2019-01-14 2023-02-28 南京邮电大学 Many-to-many speaker conversion method based on STARGAN and x vectors
CN109599091B (en) * 2019-01-14 2021-01-26 南京邮电大学 Star-WAN-GP and x-vector based many-to-many speaker conversion method
CN110033096B (en) * 2019-03-07 2021-04-02 北京大学 State data generation method and system for reinforcement learning
CN110070895B (en) * 2019-03-11 2021-06-22 江苏大学 Mixed sound event detection method based on factor decomposition of supervised variational encoder
CN110047501B (en) * 2019-04-04 2021-09-07 南京邮电大学 Many-to-many voice conversion method based on beta-VAE
CN110060690B (en) * 2019-04-04 2023-03-24 南京邮电大学 Many-to-many speaker conversion method based on STARGAN and ResNet
CN110060691B (en) * 2019-04-16 2023-02-28 南京邮电大学 Many-to-many voice conversion method based on i-vector and VARSGAN
CN110085254A (en) * 2019-04-22 2019-08-02 南京邮电大学 Multi-to-multi phonetics transfer method based on beta-VAE and i-vector
US11854562B2 (en) 2019-05-14 2023-12-26 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion
CN110164463B (en) * 2019-05-23 2021-09-10 北京达佳互联信息技术有限公司 Voice conversion method and device, electronic equipment and storage medium
CN110211575B (en) * 2019-06-13 2021-06-04 思必驰科技股份有限公司 Voice noise adding method and system for data enhancement
CN110648658B (en) * 2019-09-06 2022-04-08 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111627420B (en) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 Method and device for synthesizing emotion voice of specific speaker under extremely low resource
CN111724809A (en) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 Vocoder implementation method and device based on variational self-encoder
CN112017644B (en) * 2020-10-21 2021-02-12 南京硅基智能科技有限公司 Sound transformation system, method and application
CN112382271B (en) * 2020-11-30 2024-03-26 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN113032558B (en) * 2021-03-11 2023-08-29 昆明理工大学 Variable semi-supervised hundred degree encyclopedia classification method integrating wiki knowledge
CN113299267B (en) * 2021-07-26 2021-10-15 北京语言大学 Voice stimulation continuum synthesis method and device based on variational self-encoder
CN113571039B (en) * 2021-08-09 2022-04-08 北京百度网讯科技有限公司 Voice conversion method, system, electronic equipment and readable storage medium
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 Training method and device of voice conversion model
CN113763924B (en) * 2021-11-08 2022-02-15 北京优幕科技有限责任公司 Acoustic deep learning model training method, and voice generation method and device
CN114360557B (en) * 2021-12-22 2022-11-01 北京百度网讯科技有限公司 Voice tone conversion method, model training method, device, equipment and medium
CN115457969A (en) * 2022-09-06 2022-12-09 平安科技(深圳)有限公司 Speech conversion method, apparatus, computer device and medium based on artificial intelligence
WO2024069726A1 (en) * 2022-09-27 2024-04-04 日本電信電話株式会社 Learning device, conversion device, training method, conversion method, and program

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2880592B2 (en) * 1990-10-30 1999-04-12 インターナショナル・ビジネス・マシーンズ・コーポレイション Editing apparatus and method for composite audio information
CN102063899B (en) * 2010-10-27 2012-05-23 南京邮电大学 Method for voice conversion under unparallel text condition
CN103258531B (en) * 2013-05-29 2015-11-11 安宁 A kind of harmonic characteristic extracting method of the speech emotion recognition had nothing to do for speaker
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
CN104361620B (en) * 2014-11-27 2017-07-28 韩慧健 A kind of mouth shape cartoon synthetic method based on aggregative weighted algorithm
WO2016207978A1 (en) * 2015-06-23 2016-12-29 株式会社大入 Method and device for manufacturing book with audio, and method and device for reproducing acoustic waveform
US20170069306A1 (en) * 2015-09-04 2017-03-09 Foundation of the Idiap Research Institute (IDIAP) Signal processing method and apparatus based on structured sparsity of phonological features
CN106778700A (en) * 2017-01-22 2017-05-31 福州大学 One kind is based on change constituent encoder Chinese Sign Language recognition methods
CN107301859B (en) * 2017-06-21 2020-02-21 南京邮电大学 Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering
CN107274029A (en) * 2017-06-23 2017-10-20 深圳市唯特视科技有限公司 A kind of future anticipation method of interaction medium in utilization dynamic scene

Also Published As

Publication number Publication date
CN108777140A (en) 2018-11-09

Similar Documents

Publication Publication Date Title
CN108777140B (en) Voice conversion method based on VAE under non-parallel corpus training
CN107545903B (en) Voice conversion method based on deep learning
Morgan Deep and wide: Multiple layers in automatic speech recognition
Hossain et al. Implementation of back-propagation neural network for isolated Bangla speech recognition
Sun et al. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks
CN112767958B (en) Zero-order learning-based cross-language tone conversion system and method
JP6911208B2 (en) Speaking style transfer
US11538455B2 (en) Speech style transfer
Luo et al. Emotional voice conversion using deep neural networks with MCC and F0 features
Luo et al. Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform
Azizah et al. Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages
CN110930981A (en) Many-to-one voice conversion system
Pascual et al. Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation
Moon et al. Mist-tacotron: End-to-end emotional speech synthesis using mel-spectrogram image style transfer
Cai et al. Research on English pronunciation training based on intelligent speech recognition
Lai et al. Phone-aware LSTM-RNN for voice conversion
Bi et al. Deep feed-forward sequential memory networks for speech synthesis
Swain et al. A DCRNN-based ensemble classifier for speech emotion recognition in Odia language
Zheng et al. An improved speech emotion recognition algorithm based on deep belief network
Zen Deep learning in speech synthesis.
Luo et al. Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform.
Zhao et al. Research on voice cloning with a few samples
Xie et al. Voice conversion with SI-DNN and KL divergence based mapping without parallel training data
Coto-Jiménez et al. LSTM deep neural networks postfiltering for improving the quality of synthetic voices
Banerjee et al. Intelligent stuttering speech recognition: A succinct review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant