WO2019227586A1 - Procédé d'apprentissage de modèle de voix, procédé, appareil, dispositif et support de reconnaissance de locuteur - Google Patents
Procédé d'apprentissage de modèle de voix, procédé, appareil, dispositif et support de reconnaissance de locuteur Download PDFInfo
- Publication number
- WO2019227586A1 WO2019227586A1 PCT/CN2018/094406 CN2018094406W WO2019227586A1 WO 2019227586 A1 WO2019227586 A1 WO 2019227586A1 CN 2018094406 W CN2018094406 W CN 2018094406W WO 2019227586 A1 WO2019227586 A1 WO 2019227586A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- target
- training
- voiceprint feature
- feature vector
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 86
- 239000013598 vector Substances 0.000 claims abstract description 169
- 238000013528 artificial neural network Methods 0.000 claims abstract description 18
- 230000003044 adaptive effect Effects 0.000 claims abstract description 13
- 238000001228 spectrum Methods 0.000 claims description 53
- 238000003062 neural network model Methods 0.000 claims description 52
- 230000006870 function Effects 0.000 claims description 37
- 230000008569 process Effects 0.000 claims description 33
- 239000011159 matrix material Substances 0.000 claims description 27
- 238000004422 calculation algorithm Methods 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000000354 decomposition reaction Methods 0.000 claims description 14
- 238000004458 analytical method Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 12
- 230000035945 sensitivity Effects 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 7
- 230000017105 transposition Effects 0.000 claims description 7
- 230000000875 corresponding effect Effects 0.000 description 44
- 238000009826 distribution Methods 0.000 description 12
- 230000000694 effects Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012067 mathematical method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
Definitions
- the present application relates to the field of speech processing, and in particular, to a speech model training method, a speaker recognition method, a device, a device, and a medium.
- the embodiments of the present application provide a method, a device, a device, and a medium for training a voice model to solve the problem of low accuracy of current speaker recognition.
- Embodiments of the present application further provide a speaker recognition method, device, device, and medium to solve the problem of low speaker recognition accuracy.
- An embodiment of the present application provides a speech model training method, including:
- the target voiceprint feature vector is input to a deep neural network for training to obtain a target speaker speech recognition model.
- An embodiment of the present application provides a voice model training device, including:
- a general background model acquisition module which is used for general background model training based on pre-prepared training speech data to obtain a general background model
- a target voiceprint feature model acquisition module configured to adaptively process the target speaker's voice data based on the universal background model to obtain a corresponding target voiceprint feature model
- a target voiceprint feature vector acquisition module configured to obtain a target voiceprint feature vector of the target speaker's voice data based on the target voiceprint feature model
- the target speaker speech recognition model acquisition module is used to input the target voiceprint feature vector into a deep neural network for training to obtain the target speaker speech recognition model.
- An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
- the processor implements the computer-readable instructions to implement The following steps:
- the target voiceprint feature vector is input to a deep neural network for training to obtain a target speaker speech recognition model.
- This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions.
- the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the following steps:
- the target voiceprint feature vector is input to a deep neural network for training to obtain a target speaker speech recognition model.
- An embodiment of the present application provides a speaker recognition method, including:
- An embodiment of the present application provides a speaker recognition device, including:
- a to-be-recognized voice data acquisition module configured to obtain the to-be-recognized voice data, the to-be-recognized voice data being associated with a user identifier;
- a to-be-recognized voiceprint feature model acquisition module configured to adaptively process the to-be-recognized voiceprint feature data based on a common background model to obtain the to-be-recognized voiceprint feature model;
- a to-be-recognized voiceprint feature vector acquisition module configured to obtain a corresponding to-be-recognized voiceprint feature vector based on the to-be-recognized voiceprint feature model
- a recognition module configured to obtain a target speaker voice recognition model corresponding to the user identifier according to the user identifier, and use the target speaker voice recognition model to obtain a recognition probability value for the voiceprint feature vector to be recognized; If the recognition probability value is greater than a preset probability value, it is determined to be the user himself; wherein the target speaker speech recognition model is obtained by using the speech model training method.
- An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
- the processor implements the computer-readable instructions to implement The following steps:
- This embodiment of the present application provides one or more non-volatile readable storage media storing computer-readable instructions.
- the computer-readable instructions are executed by one or more processors, the one or more processors are Perform the following steps:
- FIG. 1 is a flowchart of a speech model training method according to an embodiment of the present application.
- FIG. 2 is a flowchart of step S10 in FIG. 1.
- FIG. 3 is a flowchart of step S20 in FIG. 1.
- FIG. 4 is a flowchart of step S30 in FIG. 1.
- FIG. 5 is a flowchart of step S40 in FIG. 1.
- FIG. 6 is a principle block diagram of a speech model training device according to an embodiment of the present application.
- FIG. 7 is a flowchart of a speaker recognition method according to an embodiment of the present application.
- FIG. 8 is a principle block diagram of a speaker recognition device in an embodiment of the present application.
- FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
- FIG. 1 shows a flowchart of a speech model training method according to an embodiment of the present application.
- the speech model training method can be applied to computer equipment of financial institutions such as banks, securities, investment, and insurance, or other institutions that need to perform speaker recognition, for training the speech model, so as to use the trained speech model for speaker recognition.
- the computer device is a device that can perform human-computer interaction with a user, including, but not limited to, a computer, a smart phone, and a tablet.
- the speech model training method includes the following steps:
- S10 Perform a general background model training based on the training speech data prepared in advance to obtain a general background model.
- the training speech data is speech data used for training a general background model.
- the training voice data may be recording data collected by a recording module integrated on a computer device or a recording device connected to the computer device to record a large number of users without a logo, or it may be directly used as an open source voice data training set on the Internet as the training data set.
- Training speech data The Universal Background Model (UBM) is a Gaussian Mixture Model (GMM) that represents a large number of non-specific speaker voice feature distributions. Because UBM training usually uses a large number of channels that are unrelated to specific speakers Irrelevant speech data, so UBM can generally be considered as a model that has nothing to do with a specific speaker, it just fits the distribution of people's speech features, and does not represent a specific speaker.
- the Gaussian mixture model is a model that uses a Gaussian probability density function (that is, a normal distribution curve) to accurately quantify things, and decomposes one thing into several Gaussian probability density functions (that is, a normal
- a pre-prepared training voice data is used to train a general background model.
- the expression of the general background model is a Gaussian probability density function: Among them, x represents the training speech data, K represents the number of Gaussian distributions that make up the general background model, C k represents the coefficient of the k-th mixed Gaussian, and N (x; m k , R k ) represents the mean m k is a D-dimensional vector Gaussian distribution of the D ⁇ D diagonal diagonal covariance matrix R k .
- training the general background model is actually to find the parameters (C k , m k and R k ) in the expression.
- the general expression for the background model is a Gaussian probability density function, and therefore may be employed expectation-maximization algorithm (Expectation Maximization Algorithm, referred to as the EM algorithm) parameters (C k, m k and R k) in the expression is obtained.
- the EM algorithm is an iterative algorithm used to perform maximum likelihood estimation or maximum posterior probability estimation on a probability parameter model containing hidden variables.
- hidden variables refer to unobservable random variables, but hidden variables can be inferred from samples of observable variables.
- the training process is unobservable (or hidden). , So the parameters in the universal background model are actually hidden variables.
- the parameters in the universal background model can be obtained based on the maximum likelihood estimation or the maximum posterior probability estimation. After obtaining the parameters, the universal background model can be obtained.
- the general background model it can provide an important basis for the subsequent implementation of the target voiceprint feature model based on the general background model when the target speaker's voice data is small or insufficient.
- step S10 the universal background model is trained based on the training speech data prepared in advance to obtain the universal background model, including the following steps:
- the training voice data is voice data directly collected by a built-in recording module of a computer device or an external recording device, and cannot be directly recognized by a computer, so that it cannot be directly used for training a general background model. Therefore, training data
- the speech data is first converted into training speech features that the computer can recognize.
- the training speech feature may specifically be Mel Frequency Cepstrum Coefficient (MFCC).
- MFCC Mel Frequency Cepstrum Coefficient
- the process of acquiring training voice features is as follows:
- Preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data.
- the preprocessing specifically includes:
- Pre-emphasis the training voice data.
- pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process.
- the damaged signal needs to be compensated.
- the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the beginning of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission.
- Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
- the use of the pre-emphasis processing can eliminate the interference caused by the vocal cords and lips during the speaker's utterance, can effectively compensate the suppressed high-frequency part of the training voice data, and can highlight the high-frequency formants of the training voice data, strengthening the training voice data.
- the signal amplitude helps to extract training speech features.
- Frame the pre-emphasized training speech data.
- Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
- the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
- Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
- Framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.
- the framed training speech data is windowed. After the training speech data is framed, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error with the original signal.
- the use of windowing can solve this problem, making the framed training speech data continuous, and enabling each frame to exhibit the characteristics of a periodic function.
- the windowing process specifically refers to the processing of training speech data using a window function.
- the windowing function can select the Hamming window.
- the formula for windowing is N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. Windowing the training speech data can make the signal of the framed training speech data in the time domain continuous, which is helpful for extracting the training speech features of the training speech data.
- FFT Fast Fourier Transform
- performing fast Fourier transform on the pre-processed training voice data specifically includes the following process: First, a formula for calculating a frequency spectrum is used to calculate the pre-processed training voice data to obtain a frequency spectrum of the training voice data.
- the formula for calculating the spectrum is 1 ⁇ k ⁇ N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
- a formula for calculating a power spectrum is used to calculate a spectrum of the acquired training voice data, and a power spectrum of the training voice data is obtained.
- the formula for calculating the power spectrum of the training speech data is 1 ⁇ k ⁇ N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain.
- the training speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the training speech data is obtained according to the signal amplitude in the frequency domain, in order to extract the training speech from the power spectrum of the training speech data.
- a Mel scale filter bank is used to process the power spectrum of the training speech data, and a Mel power spectrum of the training speech data is obtained.
- the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum
- the Mel frequency analysis is an analysis based on human auditory perception.
- the human ear is like a filter bank, and only pays attention to certain specific frequency components (that is, human hearing is selective to frequencies), that is, the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive.
- the Mel scale filter bank includes multiple filters. These filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and the distribution is relatively dense. However, in the high frequency region, the filters are not uniformly distributed. The number becomes smaller and the distribution is sparse.
- the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
- the frequency domain signal is segmented by using a Mel frequency scale filter bank, so that each frequency segment corresponds to an energy value. If the number of filters is 22, the Mel power spectrum of the training speech data will be corresponding. Of 22 energy values.
- the power spectrum retains a frequency portion that is closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data.
- cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
- cepstrum analysis the features contained in the Mel power spectrum of training speech data that were originally too high in dimension and difficult to use directly can be converted into data that can be used in the model training process by performing cepstrum analysis on the Mel power spectrum.
- the feature of the training speech used directly in this method is the Mel frequency cepstrum coefficient.
- S12 Use the training speech features to perform a general background model training to obtain a general background model.
- the training voice feature after acquiring a training voice feature (such as an MFCC feature), the training voice feature can be expressed in the form of a vector (matrix), and a computer device can directly read the training voice data in the form of a vector and perform general background model training.
- a computer device can directly read the training voice data in the form of a vector and perform general background model training.
- the EM algorithm is a commonly used mathematical method for calculating the probability density function containing hidden variables, and will not be repeated here.
- S20 Adaptively process the target speaker's voice data based on the general background model to obtain the corresponding target voiceprint feature model.
- the target speaker's voice data refers to the voice data required for training the target voiceprint feature model.
- the target voiceprint feature model refers to the voiceprint feature model related to some target speakers. Understandably, when the voiceprint feature model of some speakers needs to be trained, these speakers are the target speakers.
- the voice data of each target speaker can be carried with a corresponding user identifier, which is an identifier used to uniquely identify the user, and can specifically be the ID number or phone number of the target speaker. Number, etc.
- the target speaker's voice data is difficult to obtain in some scenarios (for example, in a scenario where a bank or the like processes a business), so the data sample of the target speaker's voice data is relatively small.
- the target voiceprint feature model obtained by directly training the target speaker's voice data with a small number of data samples has a very poor effect in the subsequent calculation of the target voiceprint feature vector, and cannot reflect the voice (voiceprint) characteristics of the target speaker's voice data. Therefore, in this embodiment, the universal background model is used to adaptively process the target speaker's speech data to obtain the corresponding target voiceprint feature model, so that the accuracy of the target voiceprint feature model obtained is higher.
- the general background model is a Gaussian mixture model representing the distribution of a large number of non-specific speaker voice features.
- the general background model adaptively adds a large number of non-specific speaker voice features to the target speaker's speech data, which is quite similar to the general background model Part of the non-specific speaker voice features are trained together as target speaker voice data, which can "supplement" the target speaker voice data well to train the target voiceprint feature model.
- adaptive processing refers to a processing method that uses a portion of non-specific speaker voice characteristics similar to the target speaker's voice data in the general background model as the target speaker's voice data.
- the adaptive processing can specifically use a maximum posterior estimation algorithm (Maximum Posteriori, MAP for short).
- Maximum Posteriori is to obtain an estimate of the difficult-to-observe quantity based on empirical data.
- the prior probability and the Bayes' theorem must be used to obtain the posterior probability.
- the objective function (that is, the expression representing the target voiceprint feature model)
- the parameter value at which the likelihood function is maximized can be obtained (the gradient descent algorithm can be used to obtain the maximum likelihood function), so that the target speaker in the general background model can be realized.
- the target speaker's speech data is adaptively processed using the maximum posterior estimation algorithm.
- the corresponding speech feature should be extracted before it can be calculated and trained.
- the general background model of the target speaker performs adaptive processing on the target speaker's speech data.
- the target speaker's speech data should be regarded as the target speaker's speech feature with good features extracted.
- the speech feature should be the same as the speech feature of the training general background model, such as Both use MFCC features.
- step S20 the target speaker's voice data is adaptively processed based on the general background model to obtain the corresponding target voiceprint feature model, including the following steps:
- the expression of the general background model x represents the training speech data, K represents the number of Gaussian distributions that make up the general background model, C k represents the k-th mixed Gaussian coefficient, and N (x; m k , R k ) represents the mean m k is a D-dimensional vector, D ⁇ D dimension Gaussian distribution of the diagonal covariance matrix R k .
- the general background model is represented by a Gaussian probability density function.
- the covariance matrix R k in the parameters of the general background model is represented by a vector (matrix), and singular value decomposition can be used.
- the method performs feature dimensionality reduction processing on the general background model to remove the noise data in the general background model.
- Singular value decomposition refers to an important matrix factorization in linear algebra. It is a generalization of normal matrix ⁇ diagonalization in matrix analysis. It has important applications in signal processing and statistics.
- singular value decomposition is used to perform feature reduction on the general background model.
- ⁇ before each term on the right side of the equation is a singular value
- ⁇ is a diagonal matrix
- u is a square matrix
- the vector u contains is orthogonal and is called left
- v is a square matrix
- the vectors contained by v are orthogonal, which is called a right singular matrix
- T represents a matrix operation for matrix transposition.
- uv T is a matrix with a rank of 1, and the singular values satisfy ⁇ 1 ⁇ 2 ⁇ n > 0.
- a larger singular value indicates that the sub-item ⁇ uv T corresponding to the singular value represents a more important feature in R k , and a feature with a smaller singular value is considered to be a less important feature.
- the matrix in the parameters of the general background model can be used. Feature dimensionality reduction is performed to reduce the general background model with a higher feature dimension to the target background model with a lower feature to remove the sub-items with smaller singular values.
- the feature dimensionality reduction process not only does not weaken the ability to express the general background model of features, but actually enhances it, because some feature dimensions removed during singular value decomposition are small in the feature dimensions. These small ⁇ features are actually the noise part when training the general background model. Therefore, using singular value decomposition to perform feature dimensionality reduction on the general background model can remove the feature dimension represented by the noise part in the general background model to obtain the target background model (the target background model is an optimized general background model and can replace the original The universal background model can adaptively process the target speaker's speech data, and can achieve better results).
- the target background model shows the speech features of the training speech data well in a lower feature dimension, and it will be greatly reduced when performing calculations related to the target background model (such as using the target background model to adaptively process the target speaker's speech data). The amount of calculations improves efficiency.
- S22 Use the target background model to adaptively process the target speaker's speech data to obtain the corresponding target voiceprint feature model.
- the general background model used for adaptively processing the target speaker's speech data is specifically the target background model.
- the target background model refers to the optimization obtained by dimensionality reduction of the original general background model through singular value decomposition.
- Post universal background model For a process of adaptively processing the target speaker's voice data, refer to step S20, and details are not described herein again.
- S30 Obtain a target voiceprint feature vector of the target speaker's voice data based on the target voiceprint feature model.
- the target voiceprint feature model is a model for calculating the target voiceprint feature vector.
- the target voiceprint feature vector refers to the feature vector obtained by the target voiceprint feature model and represents the target speaker's voice data.
- the target voiceprint feature model is actually a hybrid Gaussian model (GMM) corresponding to the target speaker's speech data.
- the expression of the target voiceprint feature model is similar to the general background model, except for the specific parameters in the expression. The values are different.
- the target voiceprint feature vector can be obtained when the target background model is known. The obtained target voiceprint feature vector can still retain the key voiceprint features related to the target speaker's speech data in the lower dimension.
- step S30 obtaining a target voiceprint feature vector of target speaker voice data based on the target voiceprint feature model includes the following steps:
- S31 Acquire a voiceprint feature vector space of the target speaker's voice data based on the target voiceprint feature model.
- the mean value of the target voiceprint feature model parameters (the average value of the general background model is represented by m k .
- the mean value of the target voiceprint feature model can be m k 'Make the representation) concatenate the supervector M (i) of A ⁇ K dimension, connect the mean (m k ) in the target background model parameters to form the supervector M 0 of A ⁇ K dimension, and the voiceprint feature vector space T is ( A ⁇ K) ⁇ F-dimensional matrix describing the overall change.
- the parameters of the voiceprint feature vector space T contain hidden variables and cannot be obtained directly, but can be obtained from the known M (i) and M 0.
- the EM algorithm can be used to iteratively calculate the sound according to M (i) and M 0. Grain feature vector space T.
- the target voiceprint feature vector can reflect the voiceprint features in the target speaker's voice data in a lower dimension.
- it can greatly reduce the calculation amount, improve efficiency, and improve efficiency.
- the target voiceprint feature vector can still retain the key voiceprint features related to the target speaker's voice data in the lower dimension.
- the target voiceprint feature vector is input to a deep neural network for training to obtain a target speaker speech recognition model.
- the Deep Neural Networks (DNN) model includes an input layer, a hidden layer, and an output layer composed of neurons.
- the deep neural network model includes the weights and biases of each neuron connection between layers. These weights and biases determine the nature and recognition effect of the DNN model.
- the target speaker recognition model refers to a model capable of identifying a specific target speaker to be recognized.
- the target voiceprint feature vector is input to a deep neural network model for training, and the network parameters (ie weights and biases) of the deep neural network model are updated to obtain a target speaker speech recognition model.
- the target voiceprint feature includes most of the key voiceprint features of the target speaker's voice data in a lower feature dimension, and can represent the target speaker voice data to a certain extent.
- the target voiceprint feature vector is trained in the DNN model to further extract the features of the target speaker's voice data, and the deep feature extraction is performed based on the target voiceprint feature vector. The deep feature is passed through the target speaker.
- the network parameter expression in the recognition model can extract the deep features of the target voiceprint feature vector according to the target speaker recognition model, so that when the speaker recognition is performed based on the deep features, the target speaker recognition model can be used to achieve very accurate Recognition effect.
- the dimensionality of the target voiceprint feature vector used in training is not high, which can greatly improve the efficiency of model training, and the features of fewer dimensions can represent the target speaker's speech data.
- step S40 the target voiceprint feature vector is input into a deep neural network for training to obtain a target speaker speech recognition model, including the following steps:
- the DNN model is initialized.
- This initialization operation is to set initial values of weights and offsets in the DNN model.
- the initial value may be set to a smaller value, such as between [-0.3-0.3], Or directly use the empirical value to set the initial weight and offset.
- Reasonable initialization of the DNN model can make the DNN model have a more flexible adjustment ability in the early stage, and the model can be effectively adjusted during the DNN model training process, so that the trained DNN model has a better recognition effect.
- the target voiceprint feature vector is grouped into the deep neural network model, and the output value of the deep neural network model is obtained according to the forward propagation algorithm.
- the i-th sample of the target voiceprint feature vector is in the current layer of the deep neural network model.
- the target voiceprint feature vector is first divided into a preset number of samples, and then grouped and input into the DNN model for training, that is, the grouped samples are respectively input into the DNN model for training.
- the DNN's forward propagation algorithm is a series of linear operations and activation operations performed in the DNN model based on the weights W, bias b, and input values (vector x i ) of each neuron in the DNN model, starting from the input layer, Layer by layer calculations are performed until the output layer gets the output value.
- the output value of each layer of the network in the DNN model can be calculated until the output value of the last layer is calculated.
- the total number of layers in a DNN model is L.
- the weights W, offsets b, and input value vectors x i of each neuron are connected, and the output values a i, L of the output layer (i represents the input target.
- the activation function used here can be a sigmoid or tanh activation function.
- forward propagation is performed layer by layer according to the number of layers to obtain the final output value a i, L of the network in the DNN model (that is , the output value of the deep neural network model).
- the output value a i, i.e., L can be a i
- L DNN network parameters in accordance with the model output value (each neuron connection weights W, bias b) is adjusted to obtain the target speaker speech recognition model has excellent speaker recognition ability.
- a label value can be set in advance according to a i, L (the label value is used to set the output value according to the actual situation).
- Compare the target voiceprint feature vector to obtain the error value Calculate the error generated when the target voiceprint feature vector is trained in the DNN model, and construct a suitable error function (such as using the mean square error to measure the error error) Function) to perform error back propagation according to the error function to adjust and update the weight W and the offset b of each layer of the DNN model.
- the back-propagation algorithm is used to update the weights W and offsets b of each layer of the DNN model, and the minimum value of the error function is calculated according to the back-propagation algorithm to optimize and update the weights W and offset b of each layer of the DNN model Get the target speaker speech recognition model.
- the iteration step size of the model training is set to ⁇ , the maximum number of iterations MAX, and the stop iteration threshold ⁇ .
- the sensitivity ⁇ i, l is a common factor that appears every time the parameter is updated, so the error can be calculated by using the sensitivity ⁇ i, l to update the network parameters in the DNN model.
- the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped.
- the weights W and the bias b of each layer of the DNN model are updated, so that the target speaker's voice finally obtained
- the recognition model can perform speaker recognition based on the target voiceprint feature vector.
- Steps S41-S43 use the target voiceprint feature vector to train the DNN model, so that the target speaker speech recognition model obtained through training can effectively perform speaker recognition, and the target voiceprint feature vector with lower dimensions can be used to achieve accurate speaking People recognition effect.
- the target speaker speech recognition model further extracts the deep features of the target voiceprint feature vector during the model training process.
- the trained weights and offsets in the model reflect the deep features based on the target voiceprint feature vector.
- the target speaker speech recognition model can perform deep feature recognition based on the speaker's target voiceprint feature vector to achieve accurate speaker recognition.
- a general background model is first obtained, and then singular value decomposition is used to perform feature dimensionality reduction processing on the general background model to obtain a target background model and reduce the general background model with a higher feature dimension.
- the target background model with adaptive data is used to adaptively supplement the speech data of the target speaker, so that The target voiceprint feature model representing the target speaker's speech data can also be obtained in a small amount of cases. Then, based on the target voiceprint feature model, a target voiceprint feature vector of the target speaker's voice data is obtained.
- the target voiceprint feature vector can reflect the target speaker's voice data in a lower dimension, and is correlated with the target voiceprint feature vector. During the calculation, it can greatly reduce the amount of calculation, improve efficiency, and at the same time improve efficiency, ensure that the target voiceprint feature vector can still retain key voiceprint features related to the target speaker's voice data in a lower dimension.
- the target voiceprint feature vector is input to the deep neural network for training to obtain the target speaker's speech recognition model.
- the target voiceprint vector can better describe the speech features, and the voice can be based on the target voiceprint feature vector. Deep extraction of features.
- the dimension of the target voiceprint feature vector used in training is not high, which can greatly improve the efficiency of model training.
- Features with less dimensions can represent the target speaker's voice data, and obtain targets with higher recognition accuracy. Speaker speech recognition model.
- FIG. 6 shows a principle block diagram of a speech model training device corresponding to the speech model training method in the embodiment.
- the parameter modification device includes a general background model acquisition module 10, a target voiceprint feature model acquisition module 20, a target voiceprint feature vector acquisition module 30, and a target model acquisition module 40.
- the implementation functions of the universal background model acquisition module 10, the target voiceprint feature model acquisition module 20, the target voiceprint feature vector acquisition module 30, and the target model acquisition module 40 correspond to the steps corresponding to the speech model training method in the embodiment. In order to avoid redundant description, this embodiment is not detailed one by one.
- the general background model acquisition module 10 is configured to perform a general background model training based on training voice data prepared in advance to obtain a general background model.
- the target voiceprint feature model acquisition module 20 is configured to adaptively process the target speaker's voice data based on a common background model to obtain a corresponding target voiceprint feature model.
- the target voiceprint feature vector acquisition module 30 is configured to obtain a target voiceprint feature vector of target speaker voice data based on the target voiceprint feature model.
- a target model acquisition module 40 is configured to input a target voiceprint feature vector into a deep neural network for training, and obtain a target speaker speech recognition model.
- the general background model acquisition module 10 includes a training speech feature unit 11 and a general background model acquisition unit 12.
- the training speech feature unit 11 is configured to obtain training speech features based on the training speech data.
- a general background model acquisition unit 12 is configured to use the training voice feature to perform a general background model training to obtain a general background model.
- the training speech feature unit 11 includes a preprocessing subunit 111, a power spectrum acquisition subunit 112, a Mel power spectrum subunit 113, and a training speech feature determination subunit 114.
- the pre-processing sub-unit 111 is configured to pre-process the training voice data.
- the power spectrum acquisition subunit 112 is configured to perform a fast Fourier transform on the preprocessed training voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.
- the Mel power spectrum subunit 113 is configured to process the power spectrum of the training speech data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training speech data.
- the training speech feature determination subunit 114 is configured to perform cepstrum analysis on the Mel power spectrum to obtain Mel frequency cepstrum coefficients of training speech data, and determine the obtained Mel frequency cepstrum coefficients as training speech features.
- the target voiceprint feature model acquisition module 20 includes a target background model acquisition unit 21 and a target voiceprint feature model acquisition unit 22.
- the target background model obtaining unit 21 is configured to perform dimensionality reduction processing on the general background model by using singular value decomposition to obtain a target background model.
- the target voiceprint feature model acquisition unit 22 is configured to adaptively process the target speaker's voice data by using the target background model to obtain a corresponding target voiceprint feature model.
- the target voiceprint feature vector acquisition module 30 includes a voiceprint feature vector space acquisition unit 31 and a target voiceprint feature vector acquisition unit 32.
- the voiceprint feature vector space acquisition unit 31 is configured to acquire a voiceprint feature vector space of the target speaker's voice data based on the target voiceprint feature model.
- the target voiceprint feature vector acquisition unit 32 is configured to acquire a target voiceprint feature vector according to the voiceprint feature vector space.
- the target model acquisition module 40 includes an initialization unit 41, an output value acquisition unit 42, and an update unit 43.
- the initialization unit 41 is configured to initialize a deep neural network model.
- An output value obtaining unit 42 is configured to group the target voiceprint feature vector into the deep neural network model, and obtain the output value of the deep neural network model according to the forward propagation algorithm.
- the i-th group of samples of the target voiceprint feature vector is in the deep neural network.
- l is the current layer of the deep neural network model
- ⁇ is the activation function
- W is the weight
- l-1 is the previous layer of the current layer of the deep neural network model
- b is the bias.
- An update unit 43 is configured to perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain a target speaker speech recognition model.
- FIG. 7 shows a flowchart of the speaker recognition method in this embodiment.
- the speaker recognition method can be applied to the computer equipment of financial institutions such as banks, securities, investment, and insurance, or other institutions that need to perform speaker recognition in order to perform speaker recognition and achieve artificial intelligence purposes.
- the speaker recognition method includes the following steps:
- S50 Acquire speech data to be identified, and the speech data to be identified is associated with a user identifier.
- the to-be-recognized voice data refers to the voice data of the user to be identified.
- the user identifier is an identifier for uniquely identifying the user.
- the user identifier may be an identifier that can uniquely identify the user, such as a user ID number and a user phone number.
- the to-be-recognized voice data may be specifically acquired through a built-in recording module of a computer device or an external recording device.
- the to-be-recognized voice data is associated with a user identifier, and a corresponding target speech can be found through the user identifier.
- the human speech recognition model recognizes the speech data to be recognized by the target speaker speech recognition model, and judges whether the user is the user or not based on the speech data to be recognized, thereby realizing speaker recognition.
- S60 Perform adaptive processing on the speech data to be identified based on the general background model, and obtain a voiceprint feature model to be identified.
- the voiceprint feature model to be recognized is based on the general background model, and the voiceprint feature model related to the voice data to be recognized after adaptive processing of the voice data to be recognized through the target background model.
- steps in this embodiment are similar to steps S21-S22, please refer to steps S21-S22, which will not be repeated here.
- the purpose of this step is to obtain the voiceprint feature model to be identified, so as to obtain the voiceprint feature vector to be identified according to the model.
- S70 Obtain a corresponding voiceprint feature vector based on the voiceprint feature model to be identified.
- the voiceprint feature vector to be identified refers to a feature vector obtained through the voiceprint feature model to be identified and representing the speech data to be identified.
- steps in this embodiment are similar to steps S31-S32. Please refer to steps S31-S32, which will not be repeated here.
- Steps S50-S70 are to obtain the voiceprint feature vector to be recognized which can represent the voice data to be recognized, to perform speaker recognition in the target speaker's voice recognition model based on the voiceprint feature vector to be recognized, and determine whether the voice data to be recognized belongs to the user.
- S80 Acquire a target speaker speech recognition model corresponding to the user's logo according to the user's identification, and use the target speaker's speech recognition model to identify the feature vector of the voiceprint to be identified to obtain the recognition probability value; if the recognition probability value is greater than a preset probability value , It is determined to be the user himself; among them, the target speaker's speech recognition model is obtained by using the speech model training method of the embodiment.
- a target speaker voice recognition model corresponding to the user identifier is obtained according to the user identifier.
- the target speaker voice recognition model is a recognition model stored in a database in advance, and the recognition model is related to the target speaker voice data. That is, it is associated with the user identification corresponding to the target speaker's voice data. Therefore, a corresponding target speaker recognition model can be obtained based on the user identification.
- the voiceprint feature vector to be recognized is input to the target speaker's voice recognition model for recognition, and the recognition probability value of the voiceprint feature vector to be recognized in the recognition model can be obtained.
- the recognition probability value is greater than a preset probability value, it is considered that the voice data to be recognized represented by the voiceprint feature vector to be recognized is the user's own voice, and it can be determined that the voice data to be recognized is issued by the user himself to realize the speaker Identify.
- the preset probability value refers to a preset reference threshold for judging whether the speech data to be recognized is issued by the user himself or herself, and is expressed by a probability value, for example, the preset probability value is 95%.
- a corresponding voiceprint feature model to be recognized is obtained from the voice data to be recognized, and the voiceprint feature vector to be recognized is input to the target speaker's voice corresponding to the user identifier associated with the voice data to be recognized.
- Recognition is performed in the recognition model to realize speaker recognition.
- the speaker speech recognition model can fully describe speech features with a lower-dimensional target voiceprint feature vector, and the speaker recognition method can achieve higher recognition accuracy when performing speaker speech recognition.
- FIG. 8 shows a principle block diagram of a speaker recognition training device corresponding to the speaker recognition method in the embodiment.
- the speaker recognition device includes a voice data acquisition module 50 to be identified, a voiceprint feature model acquisition module 60 to be identified, a voiceprint feature vector acquisition module 70 to be identified, and a recognition module 80.
- the implementation functions of the to-be-recognized voice data acquisition module 50, to-be-recognized voiceprint feature model acquisition module 60, to-be-recognized voiceprint feature vector acquisition module 70, and recognition module 80 correspond to the steps corresponding to the voice model training method in the embodiment In order to avoid redundant description, this embodiment is not detailed one by one.
- the to-be-recognized voice data acquisition module 50 is configured to obtain the to-be-recognized voice data, and the to-be-recognized voice data is associated with a user identifier.
- the to-be-recognized voiceprint feature model acquisition module 60 is configured to adaptively process the to-be-recognized voiceprint feature model based on the general background model to obtain the to-be-recognized voiceprint feature model.
- the to-be-recognized voiceprint feature vector acquisition module 70 is used for the to-be-recognized voiceprint feature vector acquisition module 70 to obtain the corresponding to-be-recognized voiceprint feature vector based on the to-be-recognized voiceprint feature model.
- the recognition module 80 is configured to obtain a target speaker's speech recognition model corresponding to the user's identification based on the user's identification, and use the target speaker's speech recognition model to identify the voiceprint feature vector to be identified to obtain a recognition probability value; if the recognition probability value is greater than The preset probability value is determined to be the user himself; wherein the target speaker's speech recognition model is obtained by using the speech model training method of the embodiment.
- This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions.
- the computer-readable instructions are executed by one or more processors, the one or more processors are executed.
- the execution of the one or more processors enables the functions of each module / unit in the speaker voice distinguishing device in the embodiment to avoid duplication, I won't repeat them here.
- the functions of each step in the speaker recognition method in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here is not One by one.
- the computer-readable instructions are executed by one or more processors, the execution of the one or more processors realizes the functions of each module / unit in the speaker recognition device in the embodiment. To avoid repetition, this I will not repeat them one by one.
- the computer-readable storage medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM (Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electric carrier signals and telecommunication signals.
- ROM Read-Only Memory
- RAM Random Access Memory
- FIG. 9 is a schematic diagram of a terminal device according to an embodiment of the present application.
- the terminal device 90 of this embodiment includes a processor 91, a memory 92, and computer-readable instructions 93 stored in the memory 92 and executable on the processor 91.
- the computer-readable instructions 93 are processed.
- the device 91 implements the speech model training method in the embodiment during execution. To avoid repetition, details are not described here one by one.
- the computer-readable instructions 93 are executed by the processor 91, the functions of each model / unit in the speech model training device in the embodiment are implemented. To avoid repetition, details are not described here one by one.
- the computer-readable instructions 93 are executed by the processor 91, the functions of the steps in the speaker recognition method in the embodiment are implemented. To avoid repetition, details are not described here one by one.
- the computer-readable instructions 93 are executed by the processor 91, the functions of each module / unit in the speaker recognition device in the embodiment are realized. To avoid repetition, we will not repeat them here.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
La présente invention concerne un procédé d'apprentissage de modèle de voix, un procédé, un appareil, un dispositif et un support de reconnaissance de locuteur. Le procédé d'apprentissage de modèle de voix comprend : l'exécution d'un apprentissage de modèle d'arrière-plan général sur la base de données de voix d'apprentissage préparées à l'avance afin d'obtenir un modèle d'arrière-plan général; l'exécution d'un traitement adaptatif sur des données de voix d'un locuteur cible sur la base du modèle d'arrière-plan général afin d'obtenir un modèle de caractéristique d'empreinte vocale correspondant; l'obtention d'un vecteur de caractéristique d'empreinte vocale cible des données de voix du locuteur cible sur la base du modèle de caractéristique d'empreinte vocale cible; et l'introduction du vecteur de caractéristique d'empreinte vocale cible dans un réseau neuronal profond aux fins d'apprentissage afin d'obtenir un modèle de reconnaissance vocale du locuteur cible. Par l'utilisation du procédé d'apprentissage de modèle de voix, le modèle de reconnaissance vocale du locuteur cible est obtenu pour réaliser une reconnaissance du locuteur, ce qui peut fournir un résultat de reconnaissance précis.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810549432.6 | 2018-05-31 | ||
CN201810549432.6A CN108777146A (zh) | 2018-05-31 | 2018-05-31 | 语音模型训练方法、说话人识别方法、装置、设备及介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019227586A1 true WO2019227586A1 (fr) | 2019-12-05 |
Family
ID=64028243
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/094406 WO2019227586A1 (fr) | 2018-05-31 | 2018-07-04 | Procédé d'apprentissage de modèle de voix, procédé, appareil, dispositif et support de reconnaissance de locuteur |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108777146A (fr) |
WO (1) | WO2019227586A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP4053835A4 (fr) * | 2020-01-16 | 2023-02-22 | Tencent Technology (Shenzhen) Company Limited | Procédé et appareil de reconnaissance vocale, et dispositif et support d'enregistrement |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109686382A (zh) * | 2018-12-29 | 2019-04-26 | 平安科技(深圳)有限公司 | 一种说话人聚类方法和装置 |
CN110084371B (zh) * | 2019-03-27 | 2021-01-15 | 平安国际智慧城市科技股份有限公司 | 基于机器学习的模型迭代更新方法、装置和计算机设备 |
CN110428842A (zh) * | 2019-08-13 | 2019-11-08 | 广州国音智能科技有限公司 | 语音模型训练方法、装置、设备及计算机可读存储介质 |
CN110491373A (zh) * | 2019-08-19 | 2019-11-22 | Oppo广东移动通信有限公司 | 模型训练方法、装置、存储介质及电子设备 |
CN110781519B (zh) * | 2019-10-31 | 2023-10-31 | 东华大学 | 一种语音数据发布的安全脱敏方法 |
CN110956957B (zh) * | 2019-12-23 | 2022-05-17 | 思必驰科技股份有限公司 | 语音增强模型的训练方法及系统 |
CN111816185A (zh) * | 2020-07-07 | 2020-10-23 | 广东工业大学 | 一种对混合语音中说话人的识别方法及装置 |
CN111883139A (zh) * | 2020-07-24 | 2020-11-03 | 北京字节跳动网络技术有限公司 | 用于筛选目标语音的方法、装置、设备和介质 |
CN112669836B (zh) * | 2020-12-10 | 2024-02-13 | 鹏城实验室 | 命令的识别方法、装置及计算机可读存储介质 |
CN112562648A (zh) * | 2020-12-10 | 2021-03-26 | 平安科技(深圳)有限公司 | 基于元学习的自适应语音识别方法、装置、设备及介质 |
CN112669820B (zh) * | 2020-12-16 | 2023-08-04 | 平安科技(深圳)有限公司 | 基于语音识别的考试作弊识别方法、装置及计算机设备 |
CN112820299B (zh) * | 2020-12-29 | 2021-09-14 | 马上消费金融股份有限公司 | 声纹识别模型训练方法、装置及相关设备 |
CN112687290B (zh) * | 2020-12-30 | 2022-09-20 | 同济大学 | 一种经过压缩的咳嗽自动检测方法及嵌入式设备 |
CN113077798B (zh) * | 2021-04-01 | 2022-11-22 | 山西云芯新一代信息技术研究院有限公司 | 一种居家老人呼救设备 |
CN114049900B (zh) * | 2021-12-08 | 2023-07-25 | 马上消费金融股份有限公司 | 模型训练方法、身份识别方法、装置及电子设备 |
CN115240688B (zh) * | 2022-07-15 | 2024-09-03 | 西安电子科技大学 | 基于声纹特征的目标说话人实时语音信息提取方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN105575394A (zh) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | 基于全局变化空间及深度学习混合建模的声纹识别方法 |
JP2016143043A (ja) * | 2015-02-05 | 2016-08-08 | 日本電信電話株式会社 | 音声モデル学習方法、雑音抑圧方法、音声モデル学習装置、雑音抑圧装置、音声モデル学習プログラム及び雑音抑圧プログラム |
CN106847292A (zh) * | 2017-02-16 | 2017-06-13 | 平安科技(深圳)有限公司 | 声纹识别方法及装置 |
CN107146601A (zh) * | 2017-04-07 | 2017-09-08 | 南京邮电大学 | 一种用于说话人识别系统的后端i‑vector增强方法 |
CN107610707A (zh) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105096940B (zh) * | 2015-06-30 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | 用于进行语音识别的方法和装置 |
US10366687B2 (en) * | 2015-12-10 | 2019-07-30 | Nuance Communications, Inc. | System and methods for adapting neural network acoustic models |
CN107564513B (zh) * | 2016-06-30 | 2020-09-08 | 阿里巴巴集团控股有限公司 | 语音识别方法及装置 |
CN107785015A (zh) * | 2016-08-26 | 2018-03-09 | 阿里巴巴集团控股有限公司 | 一种语音识别方法及装置 |
KR101843074B1 (ko) * | 2016-10-07 | 2018-03-28 | 서울대학교산학협력단 | Vae를 이용한 화자 인식 특징 추출 방법 및 시스템 |
CN107680600B (zh) * | 2017-09-11 | 2019-03-19 | 平安科技(深圳)有限公司 | 声纹模型训练方法、语音识别方法、装置、设备及介质 |
-
2018
- 2018-05-31 CN CN201810549432.6A patent/CN108777146A/zh active Pending
- 2018-07-04 WO PCT/CN2018/094406 patent/WO2019227586A1/fr active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
JP2016143043A (ja) * | 2015-02-05 | 2016-08-08 | 日本電信電話株式会社 | 音声モデル学習方法、雑音抑圧方法、音声モデル学習装置、雑音抑圧装置、音声モデル学習プログラム及び雑音抑圧プログラム |
CN105575394A (zh) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | 基于全局变化空间及深度学习混合建模的声纹识别方法 |
CN107610707A (zh) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
CN106847292A (zh) * | 2017-02-16 | 2017-06-13 | 平安科技(深圳)有限公司 | 声纹识别方法及装置 |
CN107146601A (zh) * | 2017-04-07 | 2017-09-08 | 南京邮电大学 | 一种用于说话人识别系统的后端i‑vector增强方法 |
Non-Patent Citations (1)
Title |
---|
LI JINGYANG ET AL.: "A speaker verification method based on GMM-DNN", COMPUTER APPLICATIONS AND SOFTWARE, vol. 13, no. 12, 31 December 2016 (2016-12-31), pages 131 - 132 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP4053835A4 (fr) * | 2020-01-16 | 2023-02-22 | Tencent Technology (Shenzhen) Company Limited | Procédé et appareil de reconnaissance vocale, et dispositif et support d'enregistrement |
Also Published As
Publication number | Publication date |
---|---|
CN108777146A (zh) | 2018-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019227586A1 (fr) | Procédé d'apprentissage de modèle de voix, procédé, appareil, dispositif et support de reconnaissance de locuteur | |
WO2019227574A1 (fr) | Procédé d'apprentissage de modèle vocal, procédé, dispositif et équipement de reconnaissance vocale, et support | |
Michelsanti et al. | Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification | |
Hsu et al. | Learning latent representations for speech generation and transformation | |
CN109074822B (zh) | 特定声音识别方法、设备和存储介质 | |
CN108922513B (zh) | 语音区分方法、装置、计算机设备及存储介质 | |
Stöter et al. | Countnet: Estimating the number of concurrent speakers using supervised learning | |
CN110459225B (zh) | 一种基于cnn融合特征的说话人辨认系统 | |
WO2019232829A1 (fr) | Procédé et appareil de reconnaissance d'empreinte vocale, dispositif informatique et support d'enregistrement | |
Stöter et al. | Classification vs. regression in supervised learning for single channel speaker count estimation | |
CN108922544B (zh) | 通用向量训练方法、语音聚类方法、装置、设备及介质 | |
Uria et al. | A deep neural network for acoustic-articulatory speech inversion | |
WO2018223727A1 (fr) | Procédé, appareil et dispositif de reconnaissance d'empreinte vocale, et support | |
CN109065028A (zh) | 说话人聚类方法、装置、计算机设备及存储介质 | |
CN111968666B (zh) | 基于深度域自适应网络的助听器语音增强方法 | |
CN111899757B (zh) | 针对目标说话人提取的单通道语音分离方法及系统 | |
CN108922543B (zh) | 模型库建立方法、语音识别方法、装置、设备及介质 | |
KR102026226B1 (ko) | 딥러닝 기반 Variational Inference 모델을 이용한 신호 단위 특징 추출 방법 및 시스템 | |
Nainan et al. | Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN | |
WO2019232833A1 (fr) | Procédé et dispositif de différentiation vocale, dispositif d'ordinateur et support d'informations | |
CN111666996B (zh) | 一种基于attention机制的高精度设备源识别方法 | |
WO2019232867A1 (fr) | Procédé et appareil de discrimination vocale, et dispositif informatique et support de stockage | |
Li et al. | A Convolutional Neural Network with Non-Local Module for Speech Enhancement. | |
Meutzner et al. | A generative-discriminative hybrid approach to multi-channel noise reduction for robust automatic speech recognition | |
CN115064175A (zh) | 一种说话人识别方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18921206 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18921206 Country of ref document: EP Kind code of ref document: A1 |