WO2019227574A1 - Voice model training method, voice recognition method, device and equipment, and medium - Google Patents

Voice model training method, voice recognition method, device and equipment, and medium Download PDF

Info

Publication number
WO2019227574A1
WO2019227574A1 PCT/CN2018/094348 CN2018094348W WO2019227574A1 WO 2019227574 A1 WO2019227574 A1 WO 2019227574A1 CN 2018094348 W CN2018094348 W CN 2018094348W WO 2019227574 A1 WO2019227574 A1 WO 2019227574A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
training
speech
model
voice
Prior art date
Application number
PCT/CN2018/094348
Other languages
French (fr)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019227574A1 publication Critical patent/WO2019227574A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Definitions

  • the present application relates to the field of speech recognition technology, and in particular, to a speech model training method, speech recognition method, device, device, and medium.
  • the embodiments of the present application provide a method, a device, a device, and a medium for training a speech model, so as to solve the problem of low accuracy of current speech recognition.
  • a speech model training method includes:
  • the target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
  • a voice model training device includes:
  • a training voice feature extraction module configured to obtain training voice data, and extract training voice features based on the training voice data
  • a target background model acquisition module configured to acquire a target background model based on the training speech feature
  • a target voice feature extraction module configured to obtain target voice data, and extract target voice features based on the target voice data
  • a target voiceprint feature recognition model acquisition module configured to adaptively process the target voice feature using the target background model to obtain a target voiceprint feature recognition model
  • a speech feature recognition acquisition module configured to input the target speech feature into a deep neural network for training, and obtain a target speech feature recognition model
  • a model storage module is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
  • One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
  • the embodiments of the present application provide a method, a device, a device, and a medium for speech recognition to solve the problem of low accuracy of current speech recognition.
  • a speech recognition method includes:
  • the target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method Speech model
  • the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
  • a voice recognition device includes:
  • a to-be-recognized voice data acquisition module configured to obtain the to-be-recognized voice data, the to-be-recognized voice data being associated with a user identifier;
  • a model acquisition module is configured to query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in an associated manner.
  • the target voiceprint feature recognition model and the target voice feature recognition model are all Describe the model obtained by the speech model training method;
  • Speech feature extraction module for extracting speech features based on the speech data to be identified
  • a first score acquisition module configured to input the speech feature to be recognized into a target speech feature recognition model to obtain a first score
  • a second score obtaining module configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score
  • a target score obtaining module configured to multiply the first score with a preset first weighted ratio, obtain a first weighted score, multiply the second score with a preset second weighted ratio, and obtain a second Weighted score, adding the first weighted score and the second weighted score to obtain a target score;
  • a voice determination module is configured to determine, if the target score is greater than a preset score threshold, the voice data to be identified is target voice data corresponding to the user identifier.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method.
  • the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
  • One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method.
  • the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
  • FIG. 1 is an application environment diagram of a speech model training method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a speech model training method according to an embodiment of the present application.
  • FIG. 3 is a specific flowchart of step S10 in FIG. 2;
  • step S11 in FIG. 3 is a specific flowchart of step S11 in FIG. 3;
  • FIG. 5 is a specific flowchart of step S20 in FIG. 2;
  • FIG. 6 is a specific flowchart of step S50 in FIG. 2;
  • FIG. 7 is a schematic diagram of a voice model training device according to an embodiment of the present application.
  • FIG. 8 is a flowchart of a speech recognition method according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a voice recognition device according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 illustrates an application environment of a speech model training method provided by an embodiment of the present application.
  • the application environment of the speech model training method includes a server and a client, where the server and the client are connected through a network.
  • a client is also called a client, which refers to a program that provides local services to the client corresponding to the server.
  • the client is installed on a computer device that can interact with the user, including, but not limited to, computers, smartphones, and Tablet and other devices.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • the server includes, but is not limited to, a file server, a database server, an application server, and a web server.
  • FIG. 2 shows a flowchart of a speech model training method according to an embodiment of the present application. This embodiment is described by using the speech model training method as an example on a server.
  • the speech discrimination method includes the following steps:
  • S10 Acquire training voice data, and extract training voice features based on the training voice data.
  • the training speech data is speech data used for training the target background model.
  • the training voice data may be recording data collected by a recording module integrated on a computer device or a recording device connected to the computer device to record a large number of users without a logo, or it may be directly used as an open source voice data training set on the Internet as the training data set. Training speech data.
  • training voice data is acquired, and the training voice data cannot be directly recognized by a computer and cannot be directly used to train a target background model. Therefore, it is necessary to extract training voice features according to the training voice data, and convert the training voice data into training voice features that can be recognized by a computer.
  • the training speech feature may specifically be Mel Frequency Cepstrum Coefficient (MFCC).
  • MFCC Mel Frequency Cepstrum Coefficient
  • the MFCC feature has 39 dimensions (represented in the form of a vector), which can better describe the training speech data.
  • step S10 extracting a training voice feature based on the training voice data includes the following steps:
  • the training voice data is pre-processed when the training voice features are extracted.
  • the process of preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data.
  • preprocessing the training voice data includes the following steps:
  • S111 Perform pre-emphasis processing on the training voice data.
  • the signal amplitude at a moment s' n is the signal amplitude in the time domain after pre-emphasis
  • a is the pre-emphasis coefficient
  • the value of a ranges from 0.9 ⁇ a ⁇ 1.0.
  • pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process.
  • the damaged signal needs to be compensated.
  • the idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the beginning of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission.
  • Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio.
  • the server can eliminate the interference caused by the vocal cords and lips during the speaker's vocalization, can effectively compensate the suppressed high frequency part of the training voice data, and can highlight the high frequency resonance of the training voice data. The peaks enhance the signal amplitude of the training speech data and help extract the training speech features.
  • S112 Perform frame processing on the pre-emphasized training voice data.
  • framed processing is performed on the pre-emphasized training voice data.
  • Framing refers to the speech processing technology that cuts the entire voice signal into several segments.
  • the size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length.
  • Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames.
  • Framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.
  • windowing processing is performed on the training speech data after frame processing. After the training speech data is framed, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error with the original signal.
  • the use of windowing can solve this problem, making the framed training speech data continuous, and making each frame exhibit the characteristics of a periodic function.
  • the windowing process specifically refers to the processing of training speech data using a window function.
  • the windowing function can select the Hamming window.
  • the formula for windowing is N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed.
  • steps S111-S113 pre-emphasis, framing, and window preprocessing are performed on the training voice data, which is helpful to extract the training voice features from the training voice data, so that the extracted training voice features can better represent the training voice data.
  • S12 Perform fast Fourier transform on the pre-processed training voice data to obtain the frequency spectrum of the training voice data, and obtain the power spectrum of the training voice data according to the frequency spectrum.
  • FFT Fast Fourier Transform
  • performing fast Fourier transform on the pre-processed training voice data specifically includes the following process: First, a formula for calculating a frequency spectrum is used to calculate the pre-processed training voice data to obtain a frequency spectrum of the training voice data.
  • the formula for calculating the spectrum is 1 ⁇ k ⁇ N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit.
  • a formula for calculating a power spectrum is used to calculate a spectrum of the acquired training voice data, and a power spectrum of the training voice data is obtained.
  • the formula for calculating the power spectrum is 1 ⁇ k ⁇ N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain.
  • the training speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the training speech data is obtained according to the signal amplitude in the frequency domain, in order to extract the training speech from the power spectrum of the training speech data.
  • S13 Use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data.
  • the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum
  • the Mel frequency analysis is an analysis based on human auditory perception.
  • the human ear is like a filter bank, and only pays attention to certain specific frequency components (that is, human hearing is selective to frequencies), that is, the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive.
  • the Mel scale filter bank includes multiple filters. These filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and the distribution is relatively dense. However, in the high frequency region, the filters are not uniformly distributed. The number becomes smaller and the distribution is sparse.
  • the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale.
  • the frequency domain signal is segmented by using a Mel frequency scale filter bank, so that each frequency segment corresponds to an energy value. If the number of filters is 22, the Mel power spectrum of the training speech data will be corresponding. Of 22 energy values.
  • the acquired Mel power spectrum retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data.
  • S14 Perform cepstrum analysis on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the training speech data, and determine the obtained Mel frequency cepstrum coefficient as the training speech feature.
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • the features contained in the Mel power spectrum of the training speech data that are too high in original feature dimension and difficult to use directly can be converted into training speech that can be used directly in the model training process.
  • the training speech feature is the Mel frequency cepstrum coefficient.
  • the training voice feature is extracted based on the training voice data feature.
  • the training voice feature may specifically be a Mel frequency cepstrum coefficient, which can well reflect the training voice data.
  • the Universal Background Model is a Gaussian Mixture Model (GMM) that represents a large number of non-specific speaker voice feature distributions.
  • GBM Gaussian Mixture Model
  • a Gaussian mixture model is a model that uses a Gaussian probability density function (that is, a normal distribution curve) to accurately quantify things and decompose one thing into several Gaussian probability density functions.
  • the target background model is a model obtained by reducing the feature dimension of the general background model.
  • training a general background model based on the training voice feature can obtain a target background model.
  • the target background model shows the speech features of the training speech data in a lower feature dimension, and is performing calculations related to the target background model (such as using the target background model to target speaker speech data).
  • the calculation amount is greatly reduced and the efficiency is improved.
  • step S20 obtaining the target background model based on the training speech features includes the following steps:
  • a training background feature is used to train a general background model.
  • the expression of the general background model is a Gaussian probability density function: Among them, x represents the training speech features, K represents the number of Gaussian distributions that make up the general background model, C k represents the coefficient of the k-th mixed Gaussian, and N (x; m k , R k ) represents the mean m k is a D-dimensional vector Gaussian distribution of the D ⁇ D diagonal diagonal covariance matrix R k .
  • training the general background model is actually to find the parameters (C k , m k and R k ) in the expression.
  • the general expression for the background model is a Gaussian probability density function, and therefore may be employed expectation-maximization algorithm (Expectation Maximization Algorithm, referred to as the EM algorithm) parameters (C k, m k and R k) in the expression is obtained.
  • the EM algorithm is an iterative algorithm used to perform maximum likelihood estimation or maximum posterior probability estimation on a probability parameter model containing hidden variables.
  • hidden variables refer to unobservable random variables, but hidden variables can be inferred from samples of observable variables.
  • the training process is unobservable (or hidden). , So the parameters in the universal background model are actually hidden variables.
  • the parameters in the universal background model can be obtained based on the maximum likelihood estimation or the maximum posterior probability estimation. After obtaining the parameters, the universal background model can be obtained.
  • the EM algorithm is a commonly used mathematical method for calculating the probability density function containing hidden variables, and the mathematical method is not described in detail here.
  • the acquisition of the general background model provides an important basis for the subsequent implementation of the target voiceprint feature recognition model based on the general background model when the target speaker's voice data is low or insufficient.
  • the expression of the general background model x represents training speech features, K represents the number of Gaussian distributions that make up the general background model, C k represents the coefficient of the k-th mixed Gaussian, and N (x; m k , R k ) represents the mean m k is a D-dimensional vector, D ⁇ D dimension Gaussian distribution of the diagonal covariance matrix R k .
  • the general background model is represented by a Gaussian probability density function.
  • the covariance matrix R k in the parameters of the general background model is represented by a vector (matrix), and singular value decomposition can be used.
  • the method performs feature dimensionality reduction processing on the general background model to remove the noise data in the general background model.
  • Singular value decomposition refers to an important matrix factorization in linear algebra. It is a generalization of normal matrix ⁇ diagonalization in matrix analysis. It has important applications in signal processing and statistics.
  • singular value decomposition is used to perform feature reduction on the general background model.
  • before each term on the right side of the equation is a singular value
  • is a diagonal matrix
  • u is a square matrix
  • the vector u contains is orthogonal and is called left
  • v is a square matrix
  • the vectors contained by v are orthogonal, which is called a right singular matrix
  • T represents a matrix operation for matrix transposition.
  • uv T is a matrix with a rank of 1, and the singular values satisfy ⁇ 1 ⁇ 2 ⁇ n > 0.
  • a larger singular value indicates that the sub-item ⁇ uv T corresponding to the singular value represents a more important feature in R k , and a feature with a smaller singular value is considered to be a less important feature.
  • the matrix in the parameters of the general background model can be used. Feature dimensionality reduction is performed to reduce the general background model with a higher feature dimension to the target background model with a lower feature to remove the sub-items with smaller singular values.
  • the feature dimensionality reduction process not only does not weaken the ability to express the general background model of features, but actually enhances it, because some feature dimensions removed during singular value decomposition are small in the feature dimensions. These small ⁇ features are actually the noise part when training the general background model. Therefore, using singular value decomposition to perform feature dimensionality reduction on the general background model can remove the feature dimension represented by the noise part in the general background model to obtain the target background model (the target background model is an optimized general background model and can replace the original The universal background model can adaptively process the target speaker's speech data, and can achieve better results).
  • the target background model shows the speech features of the training speech data well in a lower feature dimension, and it will be greatly reduced when performing calculations related to the target background model (such as using the target background model to adaptively process the target speaker's speech data). The amount of calculations improves efficiency.
  • steps S21-S22 by obtaining a general background model, it is possible to provide an important basis for the subsequent implementation of the target voiceprint feature recognition model based on the general background model when the target speaker's voice data is low or insufficient. And using the singular value decomposition feature reduction method for the general background model to obtain the target background model, the target background model shows the speech features of the training speech data in a lower feature dimension, which can be used in the calculation related to the target background model Improve efficiency.
  • S30 Obtain target voice data, and extract target voice features based on the target voice data.
  • the target voice data refers to voice data associated with a specific target user.
  • the target user is associated with a user ID, and the corresponding user can be uniquely identified by the user ID. Understandably, when it is necessary to train a target voiceprint feature recognition model or a target voice feature recognition model related to certain users, these users are the target users.
  • a user ID is an identifier that uniquely identifies a user.
  • target voice data is acquired.
  • the target voice data cannot be directly recognized by a computer and cannot be used for model training. Therefore, it is necessary to extract target speech features based on the target speech data, and convert the target speech data into target speech features that can be recognized by a computer.
  • the target speech feature may specifically be a Mel frequency cepstrum coefficient. For specific extraction processes, refer to S11-S14, and details are not described herein again.
  • S40 Use the target background model to adaptively process the target voice features to obtain the target voiceprint feature recognition model.
  • the target voiceprint feature recognition model refers to a voiceprint feature recognition model related to the target user.
  • the target voice data is difficult to obtain in some scenarios (for example, in a scenario where a bank or the like processes a service), so there are fewer data samples based on the target voice features provided by the target voice data.
  • the target voiceprint feature recognition model obtained by directly training the target voice features with few data samples has a very poor effect in the subsequent calculation of the target voiceprint features, and cannot reflect the voice (voiceprint) features of the target voice features. Therefore, in this embodiment, a target background model is required to adaptively process a target voice feature to obtain a corresponding target voiceprint feature recognition model, so that the accuracy of the target voiceprint feature recognition model obtained is higher.
  • the target background model is a Gaussian mixture model representing a large number of non-specific speech feature distributions.
  • a large number of non-specific speech features in the target background model are adaptively added to the target speech features, which is equivalent to a part of the non-specific speech features in the target background model.
  • Training together as target speech features can “supplement” the target speech features well to train the target voiceprint feature recognition model.
  • adaptive processing refers to a method of processing a part of non-specific speech features in the target background model that are close to the target speech features as target speech features.
  • the adaptive processing can specifically use a maximum posterior estimation algorithm (Maximum A Posteriori, (Referred to as MAP).
  • Maximum posterior estimation is to obtain an estimate of the difficult-to-observe quantity based on empirical data.
  • the prior probability and the Bayes' theorem must be used to obtain the posterior probability.
  • the objective function (that is, the expression representing the target voiceprint feature recognition model) ) Is the likelihood function of the posterior probability, and obtain the parameter value when the likelihood function is maximum (the gradient descent algorithm can be used to find the maximum value of the likelihood function), so that the target background model and the target speech are realized.
  • the effect of training a part of the non-specific speech features with similar features as the target speech features is to obtain a target voiceprint feature recognition model corresponding to the target speech features based on the parameter value obtained when the likelihood function is maximized.
  • S50 Input target voice features into a deep neural network for training, and obtain a target voice feature recognition model.
  • the target speech feature recognition model refers to a speech feature recognition model related to the target user.
  • a deep neural network (DNN) model includes an input layer, a hidden layer, and an output layer composed of neurons.
  • the deep neural network model includes the weights and biases of each neuron connection between layers. These weights and biases determine the nature and recognition effect of the DNN model.
  • a target speech feature is input into a deep neural network model for training, and network parameters (ie weights and biases) of the deep neural network model are updated to obtain a target speech feature recognition model.
  • the target speech features include key speech features of the target speech data.
  • the target speech features are trained in a DNN model to further extract the features of the target speech data, and perform deep feature extraction based on the target speech features.
  • the deep features are expressed by network parameters in the target speech feature recognition model, and based on the extracted deep features, a more accurate recognition effect can be achieved when the target speech recognition model is subsequently used for recognition.
  • step S50 the target voice feature is input into a deep neural network for training, and the target voice feature recognition model is obtained, including the following steps:
  • the DNN model is initialized.
  • This initialization operation is to set initial values of weights and offsets in the DNN model.
  • the initial value may be set to a smaller value, such as between [-0.3-0.3].
  • Reasonable initialization of the DNN model can make the DNN model have more flexible adjustment ability in the early stage, and the model can be adjusted effectively during the DNN model training process, so that the trained DNN model has a better recognition effect.
  • the target voice features are grouped into the deep neural network model, and the output value of the deep neural network model is obtained according to the forward propagation algorithm.
  • the target voice feature is first divided into a preset number of samples, and then grouped and input into the DNN model for training, that is, the grouped samples are respectively input into the DNN model for training.
  • the DNN's forward propagation algorithm is a series of linear operations and activation operations performed in the DNN model based on the weights W, bias b, and input values (vector x i ) of each neuron in the DNN model, starting from the input layer, Layer by layer calculations are performed until the output layer gets the output value of the output layer.
  • the output value of each layer of the network in the DNN model can be calculated until the output value of the output layer (that is, the output value of the DNN model) is calculated.
  • the activation function specifically used here may be a sigmoid or tanh activation function.
  • forward propagation is performed layer by layer according to the number of layers to obtain the final output value a i, L of the network in the DNN model (that is , the output value of the deep neural network model).
  • the output value a i, i.e., L can be a i
  • L DNN network parameters in accordance with the model output value (each neuron connection weights W, bias b) be adjusted in order to obtain a more accurate speech recognition target speech feature recognition model.
  • S53 Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model.
  • the calculation formula for the updated weight is l is the current layer of the deep neural network model, W is the weight, ⁇ is the iteration step size, m is the total number of samples of the input target speech feature, and ⁇ i, l is the sensitivity of the current layer;
  • a label value can be set in advance according to a i, L (the label value is used to set the output value according to the actual situation). Compare the target speech features to obtain the error value), calculate the error generated when the target speech feature is trained in the DNN model, and construct a suitable error function based on the error (such as an error function that uses mean square error to measure the error), The error back-propagation is performed according to the error function to adjust and update the weight W and the offset b of each layer of the DNN model.
  • the back-propagation algorithm is used to update the weights W and offsets b of each layer of the DNN model, and the minimum value of the error function is calculated according to the back-propagation algorithm to optimize and update the weights W and offsets b of each layer of the DNN model.
  • the iteration step size of the model training is set to ⁇ , the maximum number of iterations MAX, and the stop iteration threshold ⁇ .
  • the sensitivity ⁇ i, l is a common factor that appears every time the parameter is updated, so the error can be calculated by using the sensitivity ⁇ i, l to update the network parameters in the DNN model.
  • the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped.
  • the weight W and the offset b of each layer of the DNN model can be updated, so that the obtained target speech feature recognition model can be Perform speech recognition.
  • Steps S51-S53 train the DNN model by using the target speech features, so that the target speech feature recognition model obtained through training can recognize speech.
  • the target speech feature recognition model further extracts the deep features of the target speech feature during the model training process.
  • the trained weights and offsets in the model reflect the deep features based on the target speech feature. Therefore, the target speech feature recognition model can recognize based on the deep features learned through training, and achieve more accurate speech recognition.
  • S60 Associate the target voiceprint feature recognition model and the target voice feature recognition model in a database.
  • the two models are associated and stored in a database. Specifically, the association between the models is performed through the user ID of the target user, and the target voiceprint feature recognition model and the target voice feature recognition model corresponding to the same user ID are stored in a database in the form of a file.
  • the target voiceprint feature recognition model and target voice feature recognition model corresponding to the user's identity can be called during the voice recognition stage, in order to combine the two models for voice recognition, and overcome the separate recognition of each model It is an error that further improves the accuracy of speech recognition.
  • a target background model is obtained by using the extracted training speech features.
  • the target background model is obtained from a general background model using singular value decomposition feature dimensionality reduction method.
  • the target background model uses lower features The dimensionality shows the speech features of the training speech data well, which can improve the efficiency when performing calculations related to the target background model.
  • the target background model is used to adaptively process the extracted target speech features to obtain a voiceprint feature recognition model.
  • the target background model covers the speech features of training speech data in multiple dimensions.
  • the target background model can be used to adaptively supplement the target speech features with a small amount of data through the target background model, so that the target sound can also be obtained when the amount of data is small. Grain feature recognition model.
  • the target voiceprint feature recognition model can recognize voiceprint features that use lower dimensions to represent the target voice features, thereby performing voice recognition. Then the target speech features are input to the deep neural network for training to obtain the target speech feature recognition model. The target speech feature recognition model deeply learns the target speech features and can perform speech recognition with high accuracy. Finally, the target voiceprint feature recognition model and the target voice feature recognition model are stored in the database in association, and the two models are stored as a total voice model. The voice model organically combines the target voiceprint feature recognition model and the target voice feature recognition. The model adopts the overall speech model for speech recognition, and the accuracy rate of speech recognition can be achieved.
  • FIG. 7 is a schematic diagram of a speech model training device that corresponds to the speech model training method in the embodiment.
  • the speech model training device includes a training speech feature extraction module 10, a target background model acquisition module 20, a target speech feature extraction module 30, a target voiceprint feature recognition model acquisition module 40, a speech feature recognition acquisition module 50, and Model storage module 60.
  • training speech feature extraction module 10, target background model acquisition module 20, target speech feature extraction module 30, target voiceprint feature recognition model acquisition module 40, speech feature recognition acquisition module 50, and model storage module 60 implementation functions and embodiments
  • the steps corresponding to the middle voice model training method correspond one by one. In order to avoid redundant description, this embodiment is not detailed one by one.
  • Training voice feature extraction module 10 configured to obtain training voice data, and extract training voice features based on the training voice data
  • a target background model acquisition module 20 configured to acquire a target background model based on the training speech features
  • a target voice feature extraction module 30 configured to obtain target voice data, and extract target voice features based on the target voice data
  • the target voiceprint feature recognition model acquisition module 40 is configured to adaptively process the target voice feature using the target background model to obtain the target voiceprint feature recognition model;
  • Speech feature recognition acquisition module 50 configured to input target speech features into a deep neural network for training, and obtain a target speech feature recognition model
  • the model storage module 60 is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.
  • the training speech feature extraction module 10 includes a preprocessing unit 11, a power spectrum acquisition unit 12, a Mel power spectrum acquisition unit 13, and a training speech feature determination unit 14.
  • the preprocessing unit 11 is configured to preprocess the training voice data.
  • a power spectrum obtaining unit 12 is configured to perform a fast Fourier transform on the pre-processed training voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.
  • the Mel power spectrum obtaining unit 13 is configured to process the power spectrum of the training speech data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training speech data.
  • the training speech feature determining unit 14 is configured to perform cepstrum analysis on the Mel power spectrum, obtain Mel frequency cepstrum coefficients of training speech data, and determine the obtained Mel frequency cepstrum coefficients as training speech features.
  • the pre-processing unit 11 includes a pre-emphasis sub-unit 111, a frame sub-unit 112, and a windowing sub-unit 113.
  • the pre-emphasis sub-unit 111 is configured to perform pre-emphasis processing on the training voice data.
  • the frame sub-unit 112 is configured to perform frame processing on the pre-emphasized training voice data.
  • a windowing sub-unit 113 is configured to perform windowing processing on the framed processing speech data.
  • the target background model acquisition module 20 includes a general background model acquisition unit 21 and a target background model acquisition unit 22.
  • the universal background model obtaining unit 21 is configured to use the training voice feature to perform a universal background model training to obtain a universal background model.
  • the target background model obtaining unit 22 is configured to perform dimensionality reduction processing on the general background model by using singular value decomposition to obtain a target background model.
  • the speech feature recognition acquisition module 50 includes an initialization unit 51, an output value acquisition unit 52, and a target speech feature recognition model acquisition unit 53.
  • the initialization unit 51 is configured to initialize a deep neural network model.
  • An output value obtaining unit 52 is configured to group the target speech features into the deep neural network model, and obtain the output values of the deep neural network model according to the forward propagation algorithm.
  • the i-th group of samples of the target speech features are present in the deep neural network model.
  • FIG. 8 shows a flowchart of a speech recognition method in an embodiment.
  • the speech recognition method can be applied to the computer equipment of financial institutions such as banks, securities, investment, and insurance, or other institutions that need to perform speech recognition to achieve the purpose of speech recognition by artificial intelligence.
  • the computer device is a device that can perform human-computer interaction with a user, including, but not limited to, a computer, a smart phone, and a tablet.
  • the speech recognition method includes the following steps:
  • S71 Acquire speech data to be identified, and the speech data to be identified is associated with a user identifier.
  • the voice data to be identified refers to voice data of a user to be identified.
  • the user identifier is an identifier for uniquely identifying the user.
  • the user identifier may be an identifier that can uniquely identify the user, such as an ID card number or a phone number.
  • acquiring the voice data to be identified may be specifically collected through a recording module built in a computer device or an external recording device.
  • the voice data to be identified is associated with a user identifier, and the voice data to be identified may be associated with the user identifier
  • the data judges whether the user's own voice is used for speech recognition.
  • S72 Query the database based on the user ID to obtain the target voiceprint feature recognition model and target voice feature recognition model that are stored in association.
  • the target voiceprint feature recognition model and target voice feature recognition model are models obtained by the voice model training method provided in the foregoing embodiment. .
  • a database is queried according to the user identifier, and a target voiceprint feature recognition model and a target voice feature recognition model associated with the user identifier are obtained in the database.
  • the associatively stored target voiceprint feature recognition model and target voice feature recognition model are stored in the form of files in the database. After querying the database, the file of the model corresponding to the user identification is called, so that the computer device can Voiceprint feature recognition model and target voice feature recognition model are used for voice recognition.
  • to-be-recognized voice data is acquired, and the to-be-recognized voice data cannot be directly recognized by a computer, and voice recognition cannot be performed. Therefore, it is necessary to extract corresponding to-be-recognized speech features according to the to-be-recognized voice data, and convert the to-be-recognized voice data into to-be-recognized voice features that can be recognized by a computer.
  • the feature of the speech to be recognized may specifically be a Mel frequency cepstrum coefficient, and the specific extraction process refers to S11-S14, which is not described in detail here.
  • S74 Input the speech feature to be recognized into the target speech feature recognition model, and obtain a first score.
  • the target speech feature recognition model is used to identify the speech features to be recognized, and the recognized speech features are input into the target speech feature recognition model. Calculate to get the first score.
  • S75 Input the speech data to be recognized into the target voiceprint feature recognition model, and obtain a second score.
  • the voice data to be recognized is input into the target voiceprint feature recognition model for recognition.
  • a similarity comparison (such as cosine similarity) is performed according to the target voiceprint features corresponding to the voiceprint features to be identified and the target voice feature. If the similarity is higher, the voiceprint to be identified is considered The closer the feature is to the target voiceprint feature, the more likely it is the user's own voice. Then according to the method for obtaining voiceprint features to be identified using the voice data to be identified, the target voiceprint features corresponding to the target voice features used in the training target voiceprint feature recognition model can be calculated, and the voiceprint features to be identified can be calculated by The cosine similarity with the target voiceprint feature is taken as the second score.
  • a similarity comparison such as cosine similarity
  • S76 Multiply the first score with a preset first weighted ratio to obtain a first weighted score, multiply the second score with a preset second weighted ratio, obtain a second weighted score, and sum the first weighted score and The second weighted scores are added to obtain the target score.
  • the shortcomings of the target voiceprint feature recognition model and the target voice feature recognition model are overcome in a targeted manner. Understandably, when the target voice feature recognition model is used to identify and obtain the first score, because the features of the voice features to be recognized have a high dimension, they include some interfering voice features (such as noise, etc.). There is a certain error between the score and the actual result; when the target voiceprint feature recognition model is used to identify and obtain the second score, due to the low dimension of the voiceprint feature to be identified, it is unavoidable to lose some features that can represent the voice data to be identified , So that the second score obtained by using the model alone has a certain error with the actual result.
  • the first score is given for the reasons caused by the errors of the first score and the errors of the second score. Multiply by a preset first weighted proportion to obtain a first weighted score, multiply a second score with a preset second weighted proportion, obtain a second weighted score, and add the first weighted score and the second weighted score To obtain the target score, which is the final output score.
  • the error of the first score and the error of the second score can be overcome exactly. It can be considered that the two errors cancel each other out, so that the target score is closer to the actual result, and the accuracy of speech recognition can be improved.
  • the target score is greater than a preset score threshold. If the target score is greater than the preset score threshold, the speech data to be identified is considered as the target speech data corresponding to the user identification, that is, the user's own speech data is determined; If the score is not greater than the preset score threshold, the voice data to be recognized is not considered to be the voice data of the user himself.
  • the preset score threshold refers to a preset threshold used to measure whether the speech data to be identified is target speech data corresponding to the user identifier, and the threshold is expressed in the form of a score. For example, if the preset score threshold is set to 0.95, the speech data to be recognized with a target score greater than 0.95 is the target speech data corresponding to the user identification, and the speech data to be recognized with a target score not greater than 0.95 is not considered to be the user's own corresponding Voice data.
  • a speech model is input according to the extracted speech features to be recognized, and a first score related to the target speech feature recognition model and a second score related to the target voiceprint feature recognition model are obtained, and The target score is obtained through weighted operation, and the speech recognition result is obtained from the target score.
  • the first score reflects the probability of speech recognition results from the voiceprint features in the lower dimension. Due to the low dimension of the voiceprint features, some key voice features are unavoidably lost, making the first score inaccurate from the actual output and affecting the voice.
  • Recognition results; the second score reflects the probability of speech recognition results from the higher-dimensional target speech features.
  • the second score has a higher dimension, it includes some interfering speech features (such as noise), making the second score and the actual output There are errors that affect speech recognition results.
  • the target score obtained by weighted operation can address the shortcomings of the target speech feature recognition model and the target voiceprint feature recognition model, overcome the errors of the first score and the second score, and it can be considered that the two errors cancel each other out, making the target score more It is close to the actual result and improves the accuracy of speech recognition.
  • FIG. 9 is a schematic diagram of a speech recognition device corresponding to the speech recognition method in the embodiment.
  • the voice recognition device includes a to-be-recognized voice data acquisition module 70, a model acquisition module 80, a to-be-recognized speech feature extraction module 90 and a first score acquisition module 100, a second score acquisition module 110, and a target score acquisition module. 120 and a voice determination module 130.
  • the realized functions of the to-be-recognized voice data acquisition module 70, model acquisition module 80, to-be-recognized voice feature extraction module 90, first score acquisition module 100, second score acquisition module 110, target score acquisition module 120, and voice determination module 130 The steps corresponding to the speech recognition method in the embodiment are one-to-one. To avoid redundant descriptions, this embodiment does not detail them one by one.
  • the to-be-recognized voice data acquisition module 70 is configured to obtain the to-be-recognized voice data, and the to-be-recognized voice data is associated with a user identifier.
  • a model acquisition module 80 is configured to query a database based on a user identifier to obtain an associated stored target voiceprint feature recognition model and a target voice feature recognition model.
  • the target voiceprint feature recognition model and the target voice feature recognition model use the voice provided by the foregoing embodiment The model obtained by the model training method.
  • the to-be-recognized voice feature extraction module 90 is configured to extract the to-be-recognized voice features based on the to-be-recognized voice data.
  • the first score obtaining module 100 is configured to input a voice feature to be recognized into a target voice feature recognition model, and obtain a first score.
  • a second score obtaining module 110 is configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score.
  • a target score obtaining module 120 configured to multiply a first score with a preset first weighted ratio, obtain a first weighted score, multiply a second score with a preset second weighted ratio, and obtain a second weighted score; Add the first weighted score and the second weighted score to obtain the target score.
  • the voice determining module 130 is configured to determine that the voice data to be recognized is target voice data corresponding to a user identifier if the target score is greater than a preset score threshold.
  • This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors are executed.
  • the functions of each module / unit of the speech model training device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here No longer.
  • the functions of each step in the speech recognition method in the embodiment are implemented when the one or more processors are executed. To avoid repetition, different ones are not provided here.
  • the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the voice recognition device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here Not one by one.
  • FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the computer device 140 of this embodiment includes a processor 141, a memory 142, and computer-readable instructions 143 stored in the memory 142 and executable on the processor 141.
  • the computer-readable instructions 143 are processed.
  • the implementation of the speech model training method in the embodiment is implemented when the processor 141 is executed. To avoid repetition, details are not described herein.
  • the computer-readable instructions 143 are executed by the processor 141, the functions of each model / unit in the voice model training device in the embodiment are implemented. To avoid repetition, details are not described here one by one.
  • the computer-readable instructions 143 are executed by the processor 141, the functions of the steps in the speech recognition method in the embodiment are implemented. To avoid repetition, details are not described here one by one.
  • the computer-readable instructions 143 are executed by the processor 141, the functions of each module / unit in the speech recognition apparatus in the embodiment are realized. To avoid repetition, we will not repeat them here.
  • the computer device 140 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer equipment may include, but is not limited to, a processor 141 and a memory 142.
  • FIG. 10 is only an example of the computer device 140, and does not constitute a limitation on the computer device 140. It may include more or fewer components than shown in the figure, or combine some components or different components.
  • computer equipment may also include input and output equipment, network access equipment, and buses.
  • the so-called processor 141 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 142 may be an internal storage unit of the computer device 140, such as a hard disk or a memory of the computer device 140.
  • the memory 142 may also be an external storage device of the computer device 140, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash) provided on the computer device 140. Card) and so on.
  • the memory 142 may also include both an internal storage unit of the computer device 140 and an external storage device.
  • the memory 142 is used to store the computer-readable instructions 143 and other programs and data required by the computer device.
  • the memory 142 may also be used to temporarily store data that has been output or is to be output.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware or in the form of software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice model training method, a voice recognition method, device and equipment, and a medium. The voice model training method comprises: acquiring training voice data, and extracting a training voice feature; acquiring a target background model on the basis of the training voice feature; acquiring target voice data, and extracting a target voice feature; carrying out adaptive processing on the target voice feature by using the target background model, thus acquiring a target voiceprint feature recognition model; inputting the target voice feature into a deep neural network for training, thus acquiring a target voice feature recognition model; and storing the target voiceprint feature recognition model and the target voice feature recognition model in a database in an associative manner.

Description

语音模型训练方法、语音识别方法、装置、设备及介质Speech model training method, speech recognition method, device, device and medium
本申请以2018年5月31日提交的申请号为201810551458.4,名称为“语音模型训练方法、语音识别方法、装置、设备及介质”的中国专利申请为基础,并要求其优先权。This application is based on a Chinese patent application filed on May 31, 2018 with the application number 201810551458.4, entitled "Speech Model Training Method, Speech Recognition Method, Device, Equipment, and Medium" and claims its priority.
技术领域Technical field
本申请涉及语音识别技术领域,尤其涉及一种语音模型训练方法、语音识别方法、装置、设备及介质。The present application relates to the field of speech recognition technology, and in particular, to a speech model training method, speech recognition method, device, device, and medium.
背景技术Background technique
目前在进行语音识别时,大多数都是根据语音特征进行识别,这些语音特征有的维度太高,包含太多非关键信息;有的维度太低,不能充分体现语音的特点,使得当前语音识别精确度较低,无法对语音进行有效的识别,制约了语音识别的应用。At present, most of the speech recognition is based on speech features. Some of these speech features are too high in dimension and contain too much non-critical information; some are too low in dimension to fully reflect the characteristics of speech, making current speech recognition The accuracy is low, and the speech cannot be effectively recognized, which restricts the application of speech recognition.
发明内容Summary of the invention
本申请实施例提供一种语音模型训练方法、装置、设备及介质,以解决当前语音识别准确度较低的问题。The embodiments of the present application provide a method, a device, a device, and a medium for training a speech model, so as to solve the problem of low accuracy of current speech recognition.
一种语音模型训练方法,包括:A speech model training method includes:
获取训练语音数据,基于所述训练语音数据提取训练语音特征;Acquiring training voice data, and extracting training voice features based on the training voice data;
基于所述训练语音特征获取目标背景模型;Obtaining a target background model based on the training speech features;
获取目标语音数据,基于所述目标语音数据提取目标语音特征;Acquiring target voice data, and extracting target voice features based on the target voice data;
采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;
将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;
将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
一种语音模型训练装置,包括:A voice model training device includes:
训练语音特征提取模块,用于获取训练语音数据,基于所述训练语音数据提取训练语音特征;A training voice feature extraction module, configured to obtain training voice data, and extract training voice features based on the training voice data;
目标背景模型获取模块,用于基于所述训练语音特征获取目标背景模型;A target background model acquisition module, configured to acquire a target background model based on the training speech feature;
目标语音特征提取模块,用于获取目标语音数据,基于所述目标语音数据提取目标语音特征;A target voice feature extraction module, configured to obtain target voice data, and extract target voice features based on the target voice data;
目标声纹特征识别模型获取模块,用于采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;A target voiceprint feature recognition model acquisition module, configured to adaptively process the target voice feature using the target background model to obtain a target voiceprint feature recognition model;
语音特征识别获取模块,用于将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;A speech feature recognition acquisition module, configured to input the target speech feature into a deep neural network for training, and obtain a target speech feature recognition model;
模型存储模块,用于将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。A model storage module is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:
获取训练语音数据,基于所述训练语音数据提取训练语音特征;Acquiring training voice data, and extracting training voice features based on the training voice data;
基于所述训练语音特征获取目标背景模型;Obtaining a target background model based on the training speech features;
获取目标语音数据,基于所述目标语音数据提取目标语音特征;Acquiring target voice data, and extracting target voice features based on the target voice data;
采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;
将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;
将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
获取训练语音数据,基于所述训练语音数据提取训练语音特征;Acquiring training voice data, and extracting training voice features based on the training voice data;
基于所述训练语音特征获取目标背景模型;Obtaining a target background model based on the training speech features;
获取目标语音数据,基于所述目标语音数据提取目标语音特征;Acquiring target voice data, and extracting target voice features based on the target voice data;
采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;
将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;
将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
本申请实施例提供一种语音识别方法、装置、设备及介质,以解决当前语音识别准确度较低的问题。The embodiments of the present application provide a method, a device, a device, and a medium for speech recognition to solve the problem of low accuracy of current speech recognition.
一种语音识别方法,包括:A speech recognition method includes:
获取待识别语音数据,所述待识别语音数据与用户标识相关联;Obtaining to-be-recognized voice data, which is associated with a user identifier;
基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用所述语音模型训练方法获取的语音模型;Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method Speech model
基于所述待识别语音数据,提取待识别语音特征;Extracting features to be recognized based on the to-be-recognized voice data;
将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;
若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
一种语音识别装置,包括:A voice recognition device includes:
待识别语音数据获取模块,用于获取待识别语音数据,所述待识别语音数据与用户标识相关联;A to-be-recognized voice data acquisition module, configured to obtain the to-be-recognized voice data, the to-be-recognized voice data being associated with a user identifier;
模型获取模块,用于基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用所述语音模型训练方法获取的模型;A model acquisition module is configured to query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in an associated manner. The target voiceprint feature recognition model and the target voice feature recognition model are all Describe the model obtained by the speech model training method;
待识别语音特征提取模块,用于基于所述待识别语音数据,提取待识别语音特征;Speech feature extraction module for extracting speech features based on the speech data to be identified;
第一得分获取模块,用于将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;A first score acquisition module, configured to input the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
第二得分获取模块,用于将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;A second score obtaining module, configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
目标得分获取模块,用于将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;A target score obtaining module, configured to multiply the first score with a preset first weighted ratio, obtain a first weighted score, multiply the second score with a preset second weighted ratio, and obtain a second Weighted score, adding the first weighted score and the second weighted score to obtain a target score;
语音确定模块,用于若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。A voice determination module is configured to determine, if the target score is greater than a preset score threshold, the voice data to be identified is target voice data corresponding to the user identifier.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:
获取待识别语音数据,所述待识别语音数据与用户标识相关联;Obtaining to-be-recognized voice data, which is associated with a user identifier;
基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用所述语音模型训练方法获取的语音模型;Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method. Speech model
基于所述待识别语音数据,提取待识别语音特征;Extracting features to be recognized based on the to-be-recognized voice data;
将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;
若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
获取待识别语音数据,所述待识别语音数据与用户标识相关联;Obtaining to-be-recognized voice data, which is associated with a user identifier;
基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用所述语音模型训练方法获取的语音模型;Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method. Speech model
基于所述待识别语音数据,提取待识别语音特征;Extracting features to be recognized based on the to-be-recognized voice data;
将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;
若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一实施例中语音模型训练方法的一应用环境图;FIG. 1 is an application environment diagram of a speech model training method according to an embodiment of the present application; FIG.
图2是本申请一实施例中语音模型训练方法的一流程图;2 is a flowchart of a speech model training method according to an embodiment of the present application;
图3是图2中步骤S10的一具体流程图;FIG. 3 is a specific flowchart of step S10 in FIG. 2;
图4是图3中步骤S11的一具体流程图;4 is a specific flowchart of step S11 in FIG. 3;
图5是图2中步骤S20的一具体流程图;FIG. 5 is a specific flowchart of step S20 in FIG. 2;
图6是图2中步骤S50的一具体流程图;FIG. 6 is a specific flowchart of step S50 in FIG. 2;
图7是本申请一实施例中语音模型训练装置的一示意图;FIG. 7 is a schematic diagram of a voice model training device according to an embodiment of the present application; FIG.
图8是本申请一实施例中语音识别方法的一流程图;8 is a flowchart of a speech recognition method according to an embodiment of the present application;
图9是本申请一实施例中语音识别装置的一示意图;FIG. 9 is a schematic diagram of a voice recognition device according to an embodiment of the present application; FIG.
图10是本申请一实施例中计算机设备的一示意图。FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
图1示出本申请实施例提供的语音模型训练方法的应用环境。该语音模型训练方法的应用环境包括服务端和客户端,其中,服务端和客户端之间通过网络进行连接。客户端又称为用户端,是指与服务端相对应,为客户提供本地服务的程序,该客户端安装在可与用户进行人机交互的计算机设备上,包括但不限于电脑、智能手机和平板等设备。服务端具体可以用独立的服务器或者多个服务器组成的服务器集群实现,服务端包括但不限于文件服务器、数据库服务器、应用程序服务器和WEB服务器。FIG. 1 illustrates an application environment of a speech model training method provided by an embodiment of the present application. The application environment of the speech model training method includes a server and a client, where the server and the client are connected through a network. A client is also called a client, which refers to a program that provides local services to the client corresponding to the server. The client is installed on a computer device that can interact with the user, including, but not limited to, computers, smartphones, and Tablet and other devices. The server can be implemented by an independent server or a server cluster composed of multiple servers. The server includes, but is not limited to, a file server, a database server, an application server, and a web server.
如图2所示,图2示出本申请实施例中语音模型训练方法的一流程图,本实施例以该语音模型训练方法应用在服务端为例进行说明,该语音区分方法包括如下步骤:As shown in FIG. 2, FIG. 2 shows a flowchart of a speech model training method according to an embodiment of the present application. This embodiment is described by using the speech model training method as an example on a server. The speech discrimination method includes the following steps:
S10:获取训练语音数据,基于训练语音数据提取训练语音特征。S10: Acquire training voice data, and extract training voice features based on the training voice data.
其中,训练语音数据是用于训练目标背景模型的语音数据。该训练语音数据可以是计算机设备上集成的录音模块或与计算机设备相连的录音设备对大量的不带标识的用户进行录音所采集的录音数据,也 可以是直接采用网上开源的语音数据训练集作为训练语音数据。The training speech data is speech data used for training the target background model. The training voice data may be recording data collected by a recording module integrated on a computer device or a recording device connected to the computer device to record a large number of users without a logo, or it may be directly used as an open source voice data training set on the Internet as the training data set. Training speech data.
本实施例中,获取训练语音数据,该训练语音数据不能被计算机直接识别,无法直接用于训练目标背景模型。因此,需根据该训练语音数据提取训练语音特征,将训练语音数据转化为计算机能够识别的训练语音特征。该训练语音特征具体可以是梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,简称MFCC),该MFCC特征具有39个维度的特征(以向量的形式表示),能够较好地描述训练语音数据。In this embodiment, training voice data is acquired, and the training voice data cannot be directly recognized by a computer and cannot be directly used to train a target background model. Therefore, it is necessary to extract training voice features according to the training voice data, and convert the training voice data into training voice features that can be recognized by a computer. The training speech feature may specifically be Mel Frequency Cepstrum Coefficient (MFCC). The MFCC feature has 39 dimensions (represented in the form of a vector), which can better describe the training speech data.
在一实施例中,如图3所示,步骤S10中,基于训练语音数据提取训练语音特征,包括如下步骤:In an embodiment, as shown in FIG. 3, in step S10, extracting a training voice feature based on the training voice data includes the following steps:
S11:对训练语音数据进行预处理。S11: Preprocess the training speech data.
本实施例中,在提取训练语音特征时,对训练语音数据进行预处理。预处理训练语音数据的过程能够更好地提取训练语音数据的训练语音特征,使得提取出的训练语音特征更能代表该训练语音数据。In this embodiment, the training voice data is pre-processed when the training voice features are extracted. The process of preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data.
在一实施例中,如图4所示,步骤S11中,对训练语音数据进行预处理,包括如下步骤:In an embodiment, as shown in FIG. 4, in step S11, preprocessing the training voice data includes the following steps:
S111:对训练语音数据作预加重处理。S111: Perform pre-emphasis processing on the training voice data.
本实施例中,预加重处理的计算公式为s' n=s n-a*s n-1,其中,s n为时域上的信号幅度,s n-1为与s n相对应的上一时刻的信号幅度,s' n为预加重后时域上的信号幅度,a为预加重系数,a的取值范围为0.9<a<1.0。其中,预加重是一种在发送端对输入信号高频分量进行补偿的信号处理方式。随着信号速率的增加,信号在传输过程中受损很大,为了在接收终端能得到比较好的信号波形,就需要对受损的信号进行补偿。预加重技术的思想就是在传输线的始端增强信号的高频成分,以补偿高频分量在传输过程中的过大衰减。预加重对噪声并没有影响,因此能够有效提高输出信噪比。服务端通过对训练语音数据进行预加重处理,能够消除说话人发声过程中声带和嘴唇等造成的干扰,可以有效补偿训练语音数据被压抑的高频部分,并且能够突显训练语音数据高频的共振峰,加强训练语音数据的信号幅度,有助于提取训练语音特征。 In this embodiment, pre-emphasis process is calculated as s' n = s n -a * s n-1, wherein the amplitude of the signal on the time domain s n, s n-1 s n with the corresponding upper The signal amplitude at a moment, s' n is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value of a ranges from 0.9 <a <1.0. Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving terminal, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the beginning of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio. By pre-emphasizing the training voice data, the server can eliminate the interference caused by the vocal cords and lips during the speaker's vocalization, can effectively compensate the suppressed high frequency part of the training voice data, and can highlight the high frequency resonance of the training voice data. The peaks enhance the signal amplitude of the training speech data and help extract the training speech features.
S112:对预加重后的训练语音数据进行分帧处理。S112: Perform frame processing on the pre-emphasized training voice data.
本实施例中,对预加重后的训练语音数据进行分帧处理。分帧是指将整段的语音信号切分成若干段的语音处理技术,每帧的大小在10-30ms的范围内,以大概1/2帧长作为帧移。帧移是指相邻两帧间的重叠区域,能够避免相邻两帧变化过大的问题。对训练语音数据进行分帧处理能够将训练语音数据分成若干段的语音数据,可以细分训练语音数据,便于训练语音特征的提取。In this embodiment, framed processing is performed on the pre-emphasized training voice data. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. Framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.
S113:对分帧处理后的训练语音数据进行加窗处理。S113: Perform windowing processing on the framed processing speech data.
本实施例中,对分帧处理后的训练语音数据进行加窗处理。在对训练语音数据进行分帧处理后,每一帧的起始段和末尾端都会出现不连续的地方,所以分帧越多与原始信号的误差也就越大。采用加窗能够解决这个问题,可以使分帧处理后的训练语音数据变得连续,并且使得每一帧能够表现出周期函数的特征。加窗处理具体是指采用窗函数对训练语音数据进行处理,窗函数可以选择汉明窗,则该加窗的公式为
Figure PCTCN2018094348-appb-000001
N为汉明窗窗长,n为时间,s n为时域上的信号幅度,s' n为加窗后时域上的信号幅度。服务端通过对训练语音数据进行加窗处理,能够使得分帧处理后的训练语音数据在时域上的信号变得连续,有助于提取训练语音数据的训练语音特征。
In this embodiment, windowing processing is performed on the training speech data after frame processing. After the training speech data is framed, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error with the original signal. The use of windowing can solve this problem, making the framed training speech data continuous, and making each frame exhibit the characteristics of a periodic function. The windowing process specifically refers to the processing of training speech data using a window function. The windowing function can select the Hamming window. The formula for windowing is
Figure PCTCN2018094348-appb-000001
N Hamming window length, n is the time, s n of the signal amplitude on the time domain, s' n in the time domain signal after the amplitude is windowed. By performing window processing on the training voice data, the server can make the signal of the training voice data after frame processing in the time domain continuous, which is helpful for extracting the training voice features of the training voice data.
步骤S111-S113中,对训练语音数据进行预加重、分帧和加窗的预处理,有助于从训练语音数据中提取训练语音特征,使得提取出的训练语音特征更能代表该训练语音数据。In steps S111-S113, pre-emphasis, framing, and window preprocessing are performed on the training voice data, which is helpful to extract the training voice features from the training voice data, so that the extracted training voice features can better represent the training voice data. .
S12:对预处理后的训练语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱。S12: Perform fast Fourier transform on the pre-processed training voice data to obtain the frequency spectrum of the training voice data, and obtain the power spectrum of the training voice data according to the frequency spectrum.
其中,快速傅里叶变换(Fast Fourier Transformation,简称FFT),指利用计算机计算离散傅里叶变换的高效、快速计算方法的统称。采用这种计算方法能使计算机计算离散傅里叶变换所需要的乘法次数大为减少,特别是被变换的抽样点数越多,FFT算法计算量的节省就越显著。Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing discrete Fourier transforms using a computer. The use of this calculation method can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the savings in the FFT algorithm calculation amount.
本实施例中,对预处理后的训练语音数据作快速傅里叶变换具体包括如下过程:首先,采用计算频 谱的公式对预处理后的训练语音数据进行计算,获取训练语音数据的频谱。该计算频谱的公式为
Figure PCTCN2018094348-appb-000002
1≤k≤N,N为帧的大小,s(k)为频域上的信号幅度,s(n)为时域上的信号幅度,n为时间,i为复数单位。然后,采用计算功率谱的公式对获取到的训练语音数据的频谱进行计算,求得训练语音数据的功率谱。该计算功率谱的公式为
Figure PCTCN2018094348-appb-000003
1≤k≤N,N为帧的大小,s(k)为频域上的信号幅度。通过将训练语音数据从时域上的信号幅度转换为频域上的信号幅度,再根据该频域上的信号幅度获取训练语音数据的功率谱,为从训练语音数据的功率谱中提取训练语音特征提供重要的技术前提。
In this embodiment, performing fast Fourier transform on the pre-processed training voice data specifically includes the following process: First, a formula for calculating a frequency spectrum is used to calculate the pre-processed training voice data to obtain a frequency spectrum of the training voice data. The formula for calculating the spectrum is
Figure PCTCN2018094348-appb-000002
1≤k≤N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. Then, a formula for calculating a power spectrum is used to calculate a spectrum of the acquired training voice data, and a power spectrum of the training voice data is obtained. The formula for calculating the power spectrum is
Figure PCTCN2018094348-appb-000003
1≤k≤N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain. The training speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the training speech data is obtained according to the signal amplitude in the frequency domain, in order to extract the training speech from the power spectrum of the training speech data. Features provide important technical premises.
S13:采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱。S13: Use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data.
其中,采用梅尔刻度滤波器组处理训练语音数据的功率谱是对功率谱进行的梅尔频率分析,而梅尔频率分析是基于人类听觉感知的分析。观测发现人耳就像一个滤波器组一样,只关注某些特定的频率分量(即人的听觉对频率是有选择性的),也就是说人耳只让某些频率的信号通过,而直接无视不想感知的某些频率信号。具体地,梅尔刻度滤波器组包括多个滤波器,这些滤波器在频率坐标轴上却不是统一分布的,在低频区域有很多的滤波器,分布比较密集,但在高频区域,滤波器的数目就变得比较少,分布很稀疏。可以理解地,梅尔刻度滤波器组在低频部分的分辨率高,跟人耳的听觉特性是相符的,这也是梅尔刻度的物理意义所在。通过采用梅尔频率刻度滤波器组对频域信号进行切分,使得最后每个频率段对应一个能量值,若滤波器的个数为22,那么将得到训练语音数据的梅尔功率谱相对应的22个能量值。通过对训练语音数据的功率谱进行梅尔频率分析,使得获取到的梅尔功率谱保留着与人耳特性密切相关的频率部分,该频率部分能够很好地反映出训练语音数据的特征。Among them, the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. The observation found that the human ear is like a filter bank, and only pays attention to certain specific frequency components (that is, human hearing is selective to frequencies), that is, the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive. Specifically, the Mel scale filter bank includes multiple filters. These filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and the distribution is relatively dense. However, in the high frequency region, the filters are not uniformly distributed. The number becomes smaller and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale. The frequency domain signal is segmented by using a Mel frequency scale filter bank, so that each frequency segment corresponds to an energy value. If the number of filters is 22, the Mel power spectrum of the training speech data will be corresponding. Of 22 energy values. By performing Mel frequency analysis on the power spectrum of the training speech data, the acquired Mel power spectrum retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data.
S14:在梅尔功率谱上进行倒谱分析,获取训练语音数据的梅尔频率倒谱系数,并将获取到的梅尔频率倒谱系数确定为训练语音特征。S14: Perform cepstrum analysis on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the training speech data, and determine the obtained Mel frequency cepstrum coefficient as the training speech feature.
其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶反变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。通过在梅尔功率谱上倒谱分析,可以将原本特征维数过高,难以直接使用的训练语音数据的梅尔功率谱中包含的特征,转换成能够在模型训练过程中直接使用的训练语音特征,该训练语音特征即梅尔频率倒谱系数。Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. . Through cepstrum analysis on the Mel power spectrum, the features contained in the Mel power spectrum of the training speech data that are too high in original feature dimension and difficult to use directly can be converted into training speech that can be used directly in the model training process. Feature, the training speech feature is the Mel frequency cepstrum coefficient.
步骤S11-S14中,基于训练语音数据特征提取训练语音特征,该训练语音特征具体可以是梅尔频率倒谱系数,能够很好地反映训练语音数据。In steps S11-S14, the training voice feature is extracted based on the training voice data feature. The training voice feature may specifically be a Mel frequency cepstrum coefficient, which can well reflect the training voice data.
S20:基于训练语音特征获取目标背景模型。S20: Obtain a target background model based on the training speech features.
其中,通用背景模型(Universal Background Model,简称UBM)是一个表示大量非特定说话人语音特征分布的高斯混合模型(Gaussian Mixture Model,简称GMM),由于UBM的训练通常采用大量的与特定说话人无关、信道无关的语音数据,因此通常可以认为UBM是与特定说话人无关的模型,它只是拟合人的语音特征分布,而并不代表某个具体的说话人。高斯混合模型是用高斯概率密度函数(即正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数形成的模型。目标背景模型是通用背景模型经过特征降维后得到的模型。Among them, the Universal Background Model (UBM) is a Gaussian Mixture Model (GMM) that represents a large number of non-specific speaker voice feature distributions. Because UBM training usually uses a large number of unrelated speakers Channel-independent speech data, so UBM can generally be considered as a model that is not related to a specific speaker, it just fits the speech feature distribution of a person, and does not represent a specific speaker. A Gaussian mixture model is a model that uses a Gaussian probability density function (that is, a normal distribution curve) to accurately quantify things and decompose one thing into several Gaussian probability density functions. The target background model is a model obtained by reducing the feature dimension of the general background model.
本实施例中,在获取训练语音特征(如MFCC特征)后,基于训练语音特征训练通用背景模型即可获得目标背景模型。该目标背景模型相比于通用背景模型,以较低的特征维度便良好展现了训练语音数据的语音特征,并且在进行与目标背景模型相关的计算(如采用目标背景模型对目标说话人语音数据进行自适应处理)时会大大减少计算量,提高效率。In this embodiment, after acquiring a training voice feature (such as an MFCC feature), training a general background model based on the training voice feature can obtain a target background model. Compared with the general background model, the target background model shows the speech features of the training speech data in a lower feature dimension, and is performing calculations related to the target background model (such as using the target background model to target speaker speech data). When performing adaptive processing), the calculation amount is greatly reduced and the efficiency is improved.
在一实施例中,如图5所示,步骤S20中,基于训练语音特征获取目标背景模型,包括如下步骤:In an embodiment, as shown in FIG. 5, in step S20, obtaining the target background model based on the training speech features includes the following steps:
S21:采用训练语音特征进行通用背景模型训练,获取通用背景模型。S21: Use the training voice features to perform general background model training to obtain a general background model.
本实施例中,采用训练语音特征训练通用背景模型。通用背景模型的表达式为高斯概率密度函数:
Figure PCTCN2018094348-appb-000004
其中,x表示训练语音特征,K表示组成通用背景模型的高斯分布的个数, C k表示第k个混合高斯的系数,N(x;m k,R k)表示均值m k是D维矢量,D×D维对角协方差矩阵R k的高斯分布。由通用背景模型的表达式可知,训练通用背景模型实际上就是求出该表达式中的参数(C k、m k和R k)。该通用背景模型的表达式为高斯概率密度函数,因此可以采用最大期望算法(Expectation Maximization Algorithm,简称EM算法)求出该表达式中的参数(C k、m k和R k)。EM算法是一种迭代算法,用于对含有隐变量的概率参数模型进行最大似然估计或最大后验概率估计。在统计学里,隐变量是指不可观测的随机变量,但可以通过可观测变量的样本对隐变量作出推断,在训练通用背景模型的过程中由于训练过程是不可观测的(或者说隐藏的),因此通用背景模型中的参数实际上是隐变量。采用EM算法可以基于最大似然估计或最大后验概率估计求出通用背景模型中的参数,求得参数后即得到通用背景模型。EM算法是计算含有隐变量的概率密度函数的常用数学方法,在此不对该数学方法进行赘述。通过获取该通用背景模型,为后续在目标说话人语音数据较少或不足的情况下能够基于该通用背景模型,获取到相应的目标声纹特征识别模型提供重要的实现基础。
In this embodiment, a training background feature is used to train a general background model. The expression of the general background model is a Gaussian probability density function:
Figure PCTCN2018094348-appb-000004
Among them, x represents the training speech features, K represents the number of Gaussian distributions that make up the general background model, C k represents the coefficient of the k-th mixed Gaussian, and N (x; m k , R k ) represents the mean m k is a D-dimensional vector Gaussian distribution of the D × D diagonal diagonal covariance matrix R k . According to the expression of the general background model, training the general background model is actually to find the parameters (C k , m k and R k ) in the expression. The general expression for the background model is a Gaussian probability density function, and therefore may be employed expectation-maximization algorithm (Expectation Maximization Algorithm, referred to as the EM algorithm) parameters (C k, m k and R k) in the expression is obtained. The EM algorithm is an iterative algorithm used to perform maximum likelihood estimation or maximum posterior probability estimation on a probability parameter model containing hidden variables. In statistics, hidden variables refer to unobservable random variables, but hidden variables can be inferred from samples of observable variables. During the training of the general background model, the training process is unobservable (or hidden). , So the parameters in the universal background model are actually hidden variables. Using the EM algorithm, the parameters in the universal background model can be obtained based on the maximum likelihood estimation or the maximum posterior probability estimation. After obtaining the parameters, the universal background model can be obtained. The EM algorithm is a commonly used mathematical method for calculating the probability density function containing hidden variables, and the mathematical method is not described in detail here. The acquisition of the general background model provides an important basis for the subsequent implementation of the target voiceprint feature recognition model based on the general background model when the target speaker's voice data is low or insufficient.
S22:采用奇异值分解对通用背景模型进行特征降维处理,获取目标背景模型。S22: Use singular value decomposition to perform feature reduction on the general background model to obtain the target background model.
其中,由通用背景模型的表达式:
Figure PCTCN2018094348-appb-000005
x表示训练语音特征,K表示组成通用背景模型的高斯分布的个数,C k表示第k个混合高斯的系数,N(x;m k,R k)表示均值m k是D维矢量,D×D维对角协方差矩阵R k的高斯分布,可知,通用背景模型采用高斯概率密度函数表示,该通用背景模型参数中的协方差矩阵R k采用矢量(矩阵)表示,可以采用奇异值分解的方式对通用背景模型进行特征降维处理,去除通用背景模型中的噪音数据。奇异值分解是指线性代数中一种重要的矩阵分解,是矩阵分析中正规矩阵酉对角化的推广,在信号处理、统计学等领域有重要应用。
Among them, the expression of the general background model:
Figure PCTCN2018094348-appb-000005
x represents training speech features, K represents the number of Gaussian distributions that make up the general background model, C k represents the coefficient of the k-th mixed Gaussian, and N (x; m k , R k ) represents the mean m k is a D-dimensional vector, D × D dimension Gaussian distribution of the diagonal covariance matrix R k . It can be seen that the general background model is represented by a Gaussian probability density function. The covariance matrix R k in the parameters of the general background model is represented by a vector (matrix), and singular value decomposition can be used. The method performs feature dimensionality reduction processing on the general background model to remove the noise data in the general background model. Singular value decomposition refers to an important matrix factorization in linear algebra. It is a generalization of normal matrix 酉 diagonalization in matrix analysis. It has important applications in signal processing and statistics.
本实施例中,采用奇异值分解对通用背景模型进行特征降维。具体地,将通用背景模型中参数协方差矩阵R k相对应的矩阵进行奇异值分解,用公式表示为:m k=σ 1u 1v 1 T2u 2v 2 T+...+σ nu nv n T,其中,等式右边每一项前的系数σ为奇异值,σ是对角矩阵,u为是一个方阵,u包含的向量是正交的,称为左奇异矩阵,v为是一个方阵,v包含的向量是正交的,称为右奇异矩阵,T表示矩阵转置的矩阵运算。该等式中uv T都是秩为1的矩阵,并且奇异值满足σ 1≥σ 2≥σ n>0。可以理解地,奇异值越大表示该奇异值对应的分项σuv T代表在R k中越重要的特征,奇异值越小的特征认为是越不重要的特征。在训练通用背景模型中,难免也会有噪音数据的影响,导致训练出来的通用背景模型不仅特征维度高,而且还不够客观精确,采用奇异值分解的方式,可以对通用背景模型参数中的矩阵进行特征降维处理,把原本特征维度较高的通用背景模型降维到特征较低的目标背景模型,去除掉奇异值较小的分项。需要说明的是,该特征降维处理不但没有减弱特征表达通用背景模型的能力,实际上反而是增强了,因为在进行奇异值分解时去除的部分特征维度,该特征维度中都是σ比较小的特征,这些σ比较小的特征实际上就是训练通用背景模型时的噪音部分。因此,采用奇异值分解对通用背景模型进行特征降维处理,能够去除通用背景模型中的噪音部分代表的特征维度,获取目标背景模型(该目标背景模型是优化后的通用背景模型,可以替代原本的通用背景模型对目标说话人语音数据进行自适应处理,并能达到更好的效果)。该目标背景模型以较低特征维度良好展现了训练语音数据的语音特征,并且在进行与目标背景 模型相关的计算(如采用目标背景模型对目标说话人语音数据进行自适应处理)时会大大减少计算量,提高效率。 In this embodiment, singular value decomposition is used to perform feature reduction on the general background model. Specifically, the matrix corresponding to the parameter covariance matrix R k in the general background model is subjected to singular value decomposition, and is expressed by the formula: m k = σ 1 u 1 v 1 T + σ 2 u 2 v 2 T + ... + σ n u n v n T , where the coefficient σ before each term on the right side of the equation is a singular value, σ is a diagonal matrix, u is a square matrix, and the vector u contains is orthogonal and is called left For a singular matrix, v is a square matrix, and the vectors contained by v are orthogonal, which is called a right singular matrix, and T represents a matrix operation for matrix transposition. In the equation, uv T is a matrix with a rank of 1, and the singular values satisfy σ 1 ≥σ 2 ≥σ n > 0. Understandably, a larger singular value indicates that the sub-item σuv T corresponding to the singular value represents a more important feature in R k , and a feature with a smaller singular value is considered to be a less important feature. In the training of the general background model, it is inevitable that there will be the influence of noise data, which leads to the trained general background model not only having high feature dimensions, but also not objective and accurate enough. Using singular value decomposition, the matrix in the parameters of the general background model can be used. Feature dimensionality reduction is performed to reduce the general background model with a higher feature dimension to the target background model with a lower feature to remove the sub-items with smaller singular values. It should be noted that the feature dimensionality reduction process not only does not weaken the ability to express the general background model of features, but actually enhances it, because some feature dimensions removed during singular value decomposition are small in the feature dimensions. These small σ features are actually the noise part when training the general background model. Therefore, using singular value decomposition to perform feature dimensionality reduction on the general background model can remove the feature dimension represented by the noise part in the general background model to obtain the target background model (the target background model is an optimized general background model and can replace the original The universal background model can adaptively process the target speaker's speech data, and can achieve better results). The target background model shows the speech features of the training speech data well in a lower feature dimension, and it will be greatly reduced when performing calculations related to the target background model (such as using the target background model to adaptively process the target speaker's speech data). The amount of calculations improves efficiency.
步骤S21-S22中,通过获取通用背景模型,为后续在目标说话人语音数据较少或不足的情况下能够基于该通用背景模型,获取到相应的目标声纹特征识别模型提供重要的实现基础,并且对通用背景模型采用奇异值分解的特征降维方法后获取目标背景模型,该目标背景模型以较低特征维度良好展现了训练语音数据的语音特征,在进行与目标背景模型相关的计算时能够提高效率。In steps S21-S22, by obtaining a general background model, it is possible to provide an important basis for the subsequent implementation of the target voiceprint feature recognition model based on the general background model when the target speaker's voice data is low or insufficient. And using the singular value decomposition feature reduction method for the general background model to obtain the target background model, the target background model shows the speech features of the training speech data in a lower feature dimension, which can be used in the calculation related to the target background model Improve efficiency.
S30:获取目标语音数据,基于目标语音数据提取目标语音特征。S30: Obtain target voice data, and extract target voice features based on the target voice data.
其中,目标语音数据是指与特定的目标用户相关联的语音数据。该目标用户与用户标识相关联,可通过用户标识唯一识别对应的用户。可以理解地,在需要训练与某些用户相关的目标声纹特征识别模型或目标语音特征识别模型时,这些用户就是目标用户。用户标识是用于唯一识别用户的标识。The target voice data refers to voice data associated with a specific target user. The target user is associated with a user ID, and the corresponding user can be uniquely identified by the user ID. Understandably, when it is necessary to train a target voiceprint feature recognition model or a target voice feature recognition model related to certain users, these users are the target users. A user ID is an identifier that uniquely identifies a user.
本实施例中,获取目标语音数据,该目标语音数据不能被计算机直接识别,无法用于模型训练。因此,需根据该目标语音数据提取目标语音特征,将目标语音数据转化为计算机能够识别的目标语音特征。该目标语音特征具体可以是梅尔频率倒谱系数,具体提取过程参见S11-S14,在此不在赘述。In this embodiment, target voice data is acquired. The target voice data cannot be directly recognized by a computer and cannot be used for model training. Therefore, it is necessary to extract target speech features based on the target speech data, and convert the target speech data into target speech features that can be recognized by a computer. The target speech feature may specifically be a Mel frequency cepstrum coefficient. For specific extraction processes, refer to S11-S14, and details are not described herein again.
S40:采用目标背景模型对目标语音特征进行自适应处理,获取目标声纹特征识别模型。S40: Use the target background model to adaptively process the target voice features to obtain the target voiceprint feature recognition model.
其中,目标声纹特征识别模型是指与目标用户相关的声纹特征识别模型。The target voiceprint feature recognition model refers to a voiceprint feature recognition model related to the target user.
本实施例中,目标语音数据在某些场景下是比较难获取的(如在银行等办理业务的场景下),因此造成基于目标语音数据提供的目标语音特征的数据样本比较少。采用数据样本少的目标语音特征直接训练获取的目标声纹特征识别模型在后续计算目标声纹特征时的效果非常差,无法体现出目标语音特征的语音(声纹)特征。因此,本实施例需采用目标背景模型对目标语音特征进行自适应处理,获取相对应的目标声纹特征识别模型,以使得获取到的目标声纹特征识别模型的准确性更高。目标背景模型是一个表示大量非特定语音特征分布的高斯混合模型,将目标背景模型中大量非特定语音特征自适应地添加到目标语音特征中,相当与把目标背景模型中的一部分非特定语音特征作为目标语音特征一同训练,可以很好地“补充”目标语音特征,以训练目标声纹特征识别模型。In this embodiment, the target voice data is difficult to obtain in some scenarios (for example, in a scenario where a bank or the like processes a service), so there are fewer data samples based on the target voice features provided by the target voice data. The target voiceprint feature recognition model obtained by directly training the target voice features with few data samples has a very poor effect in the subsequent calculation of the target voiceprint features, and cannot reflect the voice (voiceprint) features of the target voice features. Therefore, in this embodiment, a target background model is required to adaptively process a target voice feature to obtain a corresponding target voiceprint feature recognition model, so that the accuracy of the target voiceprint feature recognition model obtained is higher. The target background model is a Gaussian mixture model representing a large number of non-specific speech feature distributions. A large number of non-specific speech features in the target background model are adaptively added to the target speech features, which is equivalent to a part of the non-specific speech features in the target background model. Training together as target speech features can “supplement” the target speech features well to train the target voiceprint feature recognition model.
其中,自适应处理是指将目标背景模型中的与目标语音特征相近的一部分非特定语音特征作为目标语音特征进行处理的方法,该自适应处理具体可以采用最大后验估计算法(Maximum A Posteriori,简称MAP)实现。最大后验估计是根据经验数据获得对难以观察的量的估计,估计过程中,需利用先验概率和贝叶斯定理得到后验概率,目标函数(即表示目标声纹特征识别模型的表达式)为后验概率的似然函数,求得该似然函数最大时的参数值(可采用梯度下降算法求出似得然函数的最大值),也就实现将目标背景模型中的与目标语音特征相近的一部分非特定语音特征作为目标语音特征一同训练的效果,根据求得的似然函数最大时的参数值获取到与目标语音特征相对应的目标声纹特征识别模型。Among them, adaptive processing refers to a method of processing a part of non-specific speech features in the target background model that are close to the target speech features as target speech features. The adaptive processing can specifically use a maximum posterior estimation algorithm (Maximum A Posteriori, (Referred to as MAP). The maximum posterior estimation is to obtain an estimate of the difficult-to-observe quantity based on empirical data. During the estimation process, the prior probability and the Bayes' theorem must be used to obtain the posterior probability. The objective function (that is, the expression representing the target voiceprint feature recognition model) ) Is the likelihood function of the posterior probability, and obtain the parameter value when the likelihood function is maximum (the gradient descent algorithm can be used to find the maximum value of the likelihood function), so that the target background model and the target speech are realized. The effect of training a part of the non-specific speech features with similar features as the target speech features is to obtain a target voiceprint feature recognition model corresponding to the target speech features based on the parameter value obtained when the likelihood function is maximized.
S50:将目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型。S50: Input target voice features into a deep neural network for training, and obtain a target voice feature recognition model.
其中,目标语音特征识别模型是指与目标用户相关的语音特征识别模型。深度神经网络(Deep Neural Networks,简称DNN)模型中包括由神经元组成的输入层、隐藏层和输出层。该深度神经网络模型中包括各层之间各个神经元连接的权值和偏置,这些权值和偏置决定了DNN模型的性质及识别效果。The target speech feature recognition model refers to a speech feature recognition model related to the target user. A deep neural network (DNN) model includes an input layer, a hidden layer, and an output layer composed of neurons. The deep neural network model includes the weights and biases of each neuron connection between layers. These weights and biases determine the nature and recognition effect of the DNN model.
本实施例中,将目标语音特征输入到深度神经网络模型中进行训练,更新该深度神经网络模型的网络参数(即权值和偏置),获取目标语音特征识别模型。目标语音特征包括了目标语音数据的关键语音特征。本实施例中目标语音特征通过在DNN模型中训练,进一步地提取目标语音数据的特征,在目标语音特征的基础上进行深层特征的提取。该深层特征通过目标语音特征识别模型中的网络参数表达,可以根据该提取的深层特征,在后续采用该目标语音识别模型识别时达到较为精确的识别效果。In this embodiment, a target speech feature is input into a deep neural network model for training, and network parameters (ie weights and biases) of the deep neural network model are updated to obtain a target speech feature recognition model. The target speech features include key speech features of the target speech data. In this embodiment, the target speech features are trained in a DNN model to further extract the features of the target speech data, and perform deep feature extraction based on the target speech features. The deep features are expressed by network parameters in the target speech feature recognition model, and based on the extracted deep features, a more accurate recognition effect can be achieved when the target speech recognition model is subsequently used for recognition.
在一实施例中,如图6所示,步骤S50中,将目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型,包括如下步骤:In an embodiment, as shown in FIG. 6, in step S50, the target voice feature is input into a deep neural network for training, and the target voice feature recognition model is obtained, including the following steps:
S51:初始化深度神经网络模型。S51: Initialize a deep neural network model.
本实施例中,初始化DNN模型,该初始化操作即设置DNN模型中权值和偏置的初始值,该初始值可以设置为较小的值,如设置在区间[-0.3-0.3]之间。合理的初始化DNN模型可以使DNN模型在初期有较 灵活的调整能力,可以在DNN模型训练过程中对模型进行有效的调整,使得训练出的DNN模型识别效果较好。In this embodiment, the DNN model is initialized. This initialization operation is to set initial values of weights and offsets in the DNN model. The initial value may be set to a smaller value, such as between [-0.3-0.3]. Reasonable initialization of the DNN model can make the DNN model have more flexible adjustment ability in the early stage, and the model can be adjusted effectively during the DNN model training process, so that the trained DNN model has a better recognition effect.
S52:将目标语音特征分组输入到深度神经网络模型中,根据前向传播算法获取深度神经网络模型的输出值,目标语音特征的第i组样本在深度神经网络模型的当前层的输出值用公式表示为a i,l=σ(W la i,l-1+b l),其中,a为输出值,i表示输入的目标语音特征的第i组样本,l为深度神经网络模型的当前层,σ为激活函数,W为权值,l-1为深度神经网络模型的当前层的上一层,b为偏置。 S52: The target voice features are grouped into the deep neural network model, and the output value of the deep neural network model is obtained according to the forward propagation algorithm. The output value of the i-th sample of the target voice feature at the current layer of the deep neural network model is expressed by a formula. Expressed as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the current deep neural network model Layer, σ is the activation function, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias.
本实施例中,先将目标语音特征分成预设组数的样本,再分组输入到DNN模型中进行训练,即把分组后的样本分别输入到DNN模型进行训练。DNN的前向传播算法是根据DNN模型中连接各个神经元的权值W,偏置b和输入值(向量x i)在DNN模型中进行的一系列线性运算和激活运算,从输入层开始,一层层运算,一直运算到输出层,得到输出层的输出值为止。根据前向传播算法可以计算DNN模型中网络每一层的输出值,直至算到输出层的输出值(即DNN模型的输出值)。 In this embodiment, the target voice feature is first divided into a preset number of samples, and then grouped and input into the DNN model for training, that is, the grouped samples are respectively input into the DNN model for training. The DNN's forward propagation algorithm is a series of linear operations and activation operations performed in the DNN model based on the weights W, bias b, and input values (vector x i ) of each neuron in the DNN model, starting from the input layer, Layer by layer calculations are performed until the output layer gets the output value of the output layer. According to the forward propagation algorithm, the output value of each layer of the network in the DNN model can be calculated until the output value of the output layer (that is, the output value of the DNN model) is calculated.
具体地,设DNN模型的总层数为L,DNN模型中连接各个神经元的权值W,偏置b和输入值向量x i,输出层的输出值a i,L(i表示输入的目标语音特征的第i组样本),则a 1=x i(第一层的输出为在输入层输入的目标语音特征,即输入值向量x i),根据前向传播算法可知输出a i,l=σ(W la i,l-1+b l),其中,l表示深度神经网络模型的当前层,σ为激活函数,这里具体采用的激活函数可以是sigmoid或者tanh激活函数。根据上述计算a i,l的公式按层数逐层进行前向传播,获取DNN模型中网络最终的输出值a i,L(即深度神经网络模型的输出值),有了输出值a i,L即可以根据输出值a i,L对DNN模型中的网络参数(连接各个神经元的权值W,偏置b)进行调整,以获取语音识别能力较准确的目标语音特征识别模型。 Specifically, let the total number of layers of the DNN model be L. In the DNN model, the weights W, biases b, and input value vectors x i of each neuron are connected, and the output values a i, L of the output layer (i represents the input target. I-th sample of speech features), then a 1 = x i (the output of the first layer is the target speech feature input at the input layer, that is, the input value vector x i ), and the output a i, l can be known according to the forward propagation algorithm = Σ (W l a i, l-1 + b l ), where l represents the current layer of the deep neural network model, and σ is the activation function. The activation function specifically used here may be a sigmoid or tanh activation function. According to the above formula for calculating a i, l , forward propagation is performed layer by layer according to the number of layers to obtain the final output value a i, L of the network in the DNN model (that is , the output value of the deep neural network model). With the output value a i, i.e., L can be a i, L DNN network parameters in accordance with the model output value (each neuron connection weights W, bias b) be adjusted in order to obtain a more accurate speech recognition target speech feature recognition model.
S53:基于深度神经网络模型的输出值进行误差反传,更新深度神经网络模型各层的权值和偏置,获取目标语音特征识别模型,其中,更新权值的计算公式为
Figure PCTCN2018094348-appb-000006
l为深度神经网络模型的当前层,W为权值,α为迭代步长,m为输入的目标语音特征的样本总数,δ i,l为当前层的灵敏度;δ i,l=(W l+1) Tδ i,l+1οσ'(z i,l),z i,l=W la i,l-1+b l,a i,l-1为上一层的输出,T表示矩阵转置运算,ο表示两个矩阵对应元素相乘的运算(Hadamard积),更新偏置的计算公式为
Figure PCTCN2018094348-appb-000007
S53: Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model. The calculation formula for the updated weight is
Figure PCTCN2018094348-appb-000006
l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
Figure PCTCN2018094348-appb-000007
本实施例中,在根据前向传播算法获取DNN模型的输出值a i,L后,可以根据a i,L与预先设置好标签值(该标签值是根据实际情况设置的用于与输出值进行比较,获取误差的值)的目标语音特征,计算目标语音特征在该DNN模型中训练时产生的误差,并根据该误差构建合适的误差函数(如采用均方差来度量误差的误差函数),根据误差函数进行误差反传,以调整更新DNN模型各层的权值W和偏置b。 In this embodiment, after obtaining the output values a i, L of the DNN model according to the forward propagation algorithm, a label value can be set in advance according to a i, L (the label value is used to set the output value according to the actual situation). Compare the target speech features to obtain the error value), calculate the error generated when the target speech feature is trained in the DNN model, and construct a suitable error function based on the error (such as an error function that uses mean square error to measure the error), The error back-propagation is performed according to the error function to adjust and update the weight W and the offset b of each layer of the DNN model.
更新DNN模型各层的权值W和偏置b采用的是后向传播算法,根据后向传播算法求误差函数的极小值,以优化更新DNN模型各层的权值W和偏置b,获取目标语音特征识别模型。具体地,设置模型训练的迭代步长为α,最大迭代次数MAX与停止迭代阈值∈。在后向传播算法中,灵敏度δ i,l是每次更 新参数都会出现的公共因子,因此可以借助灵敏度δ i,l计算误差,以更新DNN模型中的网络参数。已知a 1=x i(第一层的输出为在输入层输入的目标语音特征,即输入值向量x i),则先求出输出层L的灵敏度δ i,L,δ i,L=(a i,L-y i)οσ'(z L),z i,l=W la i,l-1+b l,其中i表示输入的目标语音特征的第i组样本,y为标签值(即用来与输出值a i,L相比较的值),ο表示两个矩阵对应元素相乘的运算(Hadamard积)。再根据δ i,L求出深度神经网络模型的第l层的灵敏度δ i,l,根据后向传播算法可以计算得出深度神经网络模型的第l层的灵敏度δ i,l=(W l+1) Tδ i,l+1οσ'(z i,l),得到深度神经网络模型的第l层的灵敏度δ i,l后,即可更新DNN模型各层的权值W和偏置b,更新后的权值为
Figure PCTCN2018094348-appb-000008
更新后的偏置为
Figure PCTCN2018094348-appb-000009
其中,α为模型训练的迭代步长,m为输入的目标语音特征的样本总数,T表示矩阵转置运算。当所有W和b的变化值都小于停止迭代阈值∈时,即可停止训练;或者,训练达到最大迭代次数MAX时,停止训练。通过目标语音特征在DNN模型中的输出值和预先设置好的标签值之间产生的误差,能够实现DNN模型各层的权值W和偏置b的更新,使得获取的目标语音特征识别模型能够进行语音识别。
The back-propagation algorithm is used to update the weights W and offsets b of each layer of the DNN model, and the minimum value of the error function is calculated according to the back-propagation algorithm to optimize and update the weights W and offsets b of each layer of the DNN model. Get the target speech feature recognition model. Specifically, the iteration step size of the model training is set to α, the maximum number of iterations MAX, and the stop iteration threshold ∈. In the back-propagation algorithm, the sensitivity δ i, l is a common factor that appears every time the parameter is updated, so the error can be calculated by using the sensitivity δ i, l to update the network parameters in the DNN model. Knowing a 1 = x i (the output of the first layer is the target speech feature input at the input layer, that is, the input value vector x i ), first find the sensitivity δ i, L , δ i, L of the output layer L = (a i, L -y i ) σ '(z L ), z i, l = W l a i, l-1 + b l , where i represents the ith set of samples of the input target speech feature, and y is the label Value (that is, the value to be compared with the output values a i, L ), ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product). Then according δ i, L l is obtained sensitivity δ i layer depth of neural network model, l, according to the results of the sensitivity δ i l depth layer neural network model, l = (W l propagation algorithm to calculate +1 ) T δ i, l + 1 σ '(z i, l ), after obtaining the sensitivity δ i, l of the first layer of the deep neural network model, the weights W and offsets of each layer of the DNN model can be updated b, the updated weights
Figure PCTCN2018094348-appb-000008
The updated offset is
Figure PCTCN2018094348-appb-000009
Among them, α is the iterative step size of model training, m is the total number of samples of the input target speech features, and T is the matrix transposition operation. When all the changes of W and b are less than the stop iteration threshold ∈, the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped. Through the error generated by the target speech feature between the output value in the DNN model and the preset label value, the weight W and the offset b of each layer of the DNN model can be updated, so that the obtained target speech feature recognition model can be Perform speech recognition.
步骤S51-S53采用目标语音特征对DNN模型进行训练,使得训练获取的目标语音特征识别模型可以对语音进行识别。具体地,目标语音特征识别模型在模型训练过程中进一步提取了目标语音特征的深层特征,模型中训练好的权值和偏置体现了该基于目标语音特征的深层特征。因此,目标语音特征识别模型能够基于训练学习到的深层特征进行识别,实现较为精确的语音识别。Steps S51-S53 train the DNN model by using the target speech features, so that the target speech feature recognition model obtained through training can recognize speech. Specifically, the target speech feature recognition model further extracts the deep features of the target speech feature during the model training process. The trained weights and offsets in the model reflect the deep features based on the target speech feature. Therefore, the target speech feature recognition model can recognize based on the deep features learned through training, and achieve more accurate speech recognition.
S60:将目标声纹特征识别模型和目标语音特征识别模型关联存储在数据库中。S60: Associate the target voiceprint feature recognition model and the target voice feature recognition model in a database.
本实施例中,在获取目标声纹特征识别模型和目标语音特征识别模型后,将该两个模型关联存储在数据库中。具体地,通过目标用户的用户标识进行模型间的关联存储,把相同的用户标识对应的目标声纹特征识别模型和目标语音特征识别模型以文件的形式存储到数据库中。通过将该两个模型进行关联存储,可以在语音的识别阶段调用用户标识对应的目标声纹特征识别模型和目标语音特征识别模型,以结合该两个模型进行语音识别,克服各个模型单独进行识别是存在的误差,进一步地提高语音识别的准确率。In this embodiment, after the target voiceprint feature recognition model and the target voice feature recognition model are obtained, the two models are associated and stored in a database. Specifically, the association between the models is performed through the user ID of the target user, and the target voiceprint feature recognition model and the target voice feature recognition model corresponding to the same user ID are stored in a database in the form of a file. By associating and storing the two models, the target voiceprint feature recognition model and target voice feature recognition model corresponding to the user's identity can be called during the voice recognition stage, in order to combine the two models for voice recognition, and overcome the separate recognition of each model It is an error that further improves the accuracy of speech recognition.
本实施例所提供的语音模型训练方法中,通过提取的训练语音特征获取目标背景模型,该目标背景模型由通用背景模型采用奇异值分解的特征降维方法得到,该目标背景模型以较低特征维度良好展现了训练语音数据的语音特征,在进行与目标背景模型相关的计算时能够提高效率。采用该目标背景模型对提取的目标语音特征进行自适应处理,获取声纹特征识别模型。目标背景模型涵盖训练语音数据多个维度的语音特征,可以通过该目标背景模型对数据量较少的目标语音特征进行自适应补充处理,使得在数据量很少的情况下,同样能够得到目标声纹特征识别模型。该目标声纹特征识别模型能够识别采用较低维度表示目标语音特征的声纹特征,从而进行语音识别。然后将目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型,该目标语音特征识别模型深度学习了目标语音特征,能够进行准确率较高的语音识别。最后将目标声纹特征识别模型和目标语音特征识别模型关联存储在数据库中,将两个模型关联存储作为一个总的语音模型,该语音模型有机结合了目标声纹特征识别模型和目标语音特征识别模型,采用由该总的语音模型进行语音识别时,能够语音识别的精确率。In the speech model training method provided in this embodiment, a target background model is obtained by using the extracted training speech features. The target background model is obtained from a general background model using singular value decomposition feature dimensionality reduction method. The target background model uses lower features The dimensionality shows the speech features of the training speech data well, which can improve the efficiency when performing calculations related to the target background model. The target background model is used to adaptively process the extracted target speech features to obtain a voiceprint feature recognition model. The target background model covers the speech features of training speech data in multiple dimensions. The target background model can be used to adaptively supplement the target speech features with a small amount of data through the target background model, so that the target sound can also be obtained when the amount of data is small. Grain feature recognition model. The target voiceprint feature recognition model can recognize voiceprint features that use lower dimensions to represent the target voice features, thereby performing voice recognition. Then the target speech features are input to the deep neural network for training to obtain the target speech feature recognition model. The target speech feature recognition model deeply learns the target speech features and can perform speech recognition with high accuracy. Finally, the target voiceprint feature recognition model and the target voice feature recognition model are stored in the database in association, and the two models are stored as a total voice model. The voice model organically combines the target voiceprint feature recognition model and the target voice feature recognition. The model adopts the overall speech model for speech recognition, and the accuracy rate of speech recognition can be achieved.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其 功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
图7示出与实施例中语音模型训练方法一一对应的语音模型训练装置的示意图。如图7所示,该语音模型训练装置包括训练语音特征提取模块10、目标背景模型获取模块20、目标语音特征提取模块30、目标声纹特征识别模型获取模块40、语音特征识别获取模块50和模型存储模块60。其中,训练语音特征提取模块10、目标背景模型获取模块20、目标语音特征提取模块30、目标声纹特征识别模型获取模块40、语音特征识别获取模块50和模型存储模块60的实现功能与实施例中语音模型训练方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。FIG. 7 is a schematic diagram of a speech model training device that corresponds to the speech model training method in the embodiment. As shown in FIG. 7, the speech model training device includes a training speech feature extraction module 10, a target background model acquisition module 20, a target speech feature extraction module 30, a target voiceprint feature recognition model acquisition module 40, a speech feature recognition acquisition module 50, and Model storage module 60. Among them, training speech feature extraction module 10, target background model acquisition module 20, target speech feature extraction module 30, target voiceprint feature recognition model acquisition module 40, speech feature recognition acquisition module 50, and model storage module 60 implementation functions and embodiments The steps corresponding to the middle voice model training method correspond one by one. In order to avoid redundant description, this embodiment is not detailed one by one.
训练语音特征提取模块10,用于获取训练语音数据,基于训练语音数据提取训练语音特征;Training voice feature extraction module 10, configured to obtain training voice data, and extract training voice features based on the training voice data;
目标背景模型获取模块20,用于基于训练语音特征获取目标背景模型;A target background model acquisition module 20, configured to acquire a target background model based on the training speech features;
目标语音特征提取模块30,用于获取目标语音数据,基于目标语音数据提取目标语音特征;A target voice feature extraction module 30, configured to obtain target voice data, and extract target voice features based on the target voice data;
目标声纹特征识别模型获取模块40,用于采用目标背景模型对目标语音特征进行自适应处理,获取目标声纹特征识别模型;The target voiceprint feature recognition model acquisition module 40 is configured to adaptively process the target voice feature using the target background model to obtain the target voiceprint feature recognition model;
语音特征识别获取模块50,用于将目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Speech feature recognition acquisition module 50, configured to input target speech features into a deep neural network for training, and obtain a target speech feature recognition model;
模型存储模块60,用于将目标声纹特征识别模型和目标语音特征识别模型关联存储在数据库中。The model storage module 60 is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.
优选地,训练语音特征提取模块10包括预处理单元11、功率谱获取单元12、梅尔功率谱获取单元13和训练语音特征确定单元14。Preferably, the training speech feature extraction module 10 includes a preprocessing unit 11, a power spectrum acquisition unit 12, a Mel power spectrum acquisition unit 13, and a training speech feature determination unit 14.
预处理单元11,用于对训练语音数据进行预处理。The preprocessing unit 11 is configured to preprocess the training voice data.
功率谱获取单元12,用于对预处理后的训练语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据频谱获取训练语音数据的功率谱。A power spectrum obtaining unit 12 is configured to perform a fast Fourier transform on the pre-processed training voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.
梅尔功率谱获取单元13,用于采用梅尔刻度滤波器组处理训练语音数据的功率谱,获取训练语音数据的梅尔功率谱。The Mel power spectrum obtaining unit 13 is configured to process the power spectrum of the training speech data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training speech data.
训练语音特征确定单元14,用于在梅尔功率谱上进行倒谱分析,获取训练语音数据的梅尔频率倒谱系数,并将获取到的梅尔频率倒谱系数确定为训练语音特征。The training speech feature determining unit 14 is configured to perform cepstrum analysis on the Mel power spectrum, obtain Mel frequency cepstrum coefficients of training speech data, and determine the obtained Mel frequency cepstrum coefficients as training speech features.
优选地,预处理单元11包括预加重子单元111、分帧子单元112和加窗子单元113。Preferably, the pre-processing unit 11 includes a pre-emphasis sub-unit 111, a frame sub-unit 112, and a windowing sub-unit 113.
预加重子单元111,用于对训练语音数据作预加重处理。The pre-emphasis sub-unit 111 is configured to perform pre-emphasis processing on the training voice data.
分帧子单元112,用于对预加重后的训练语音数据进行分帧处理。The frame sub-unit 112 is configured to perform frame processing on the pre-emphasized training voice data.
加窗子单元113,用于对分帧处理后的训练语音数据进行加窗处理。A windowing sub-unit 113 is configured to perform windowing processing on the framed processing speech data.
优选地,目标背景模型获取模块20包括通用背景模型获取单元21和目标背景模型获取单元22。Preferably, the target background model acquisition module 20 includes a general background model acquisition unit 21 and a target background model acquisition unit 22.
通用背景模型获取单元21,用于采用训练语音特征进行通用背景模型训练,获取通用背景模型。The universal background model obtaining unit 21 is configured to use the training voice feature to perform a universal background model training to obtain a universal background model.
目标背景模型获取单元22,用于采用奇异值分解对通用背景模型进行特征降维处理,获取目标背景模型。The target background model obtaining unit 22 is configured to perform dimensionality reduction processing on the general background model by using singular value decomposition to obtain a target background model.
优选地,语音特征识别获取模块50包括初始化单元51、输出值获取单元52和目标语音特征识别模型获取单元53。Preferably, the speech feature recognition acquisition module 50 includes an initialization unit 51, an output value acquisition unit 52, and a target speech feature recognition model acquisition unit 53.
初始化单元51,用于初始化深度神经网络模型。The initialization unit 51 is configured to initialize a deep neural network model.
输出值获取单元52,用于将目标语音特征分组输入到深度神经网络模型中,根据前向传播算法获取深度神经网络模型的输出值,目标语音特征的第i组样本在深度神经网络模型的当前层的输出值用公式表示为a i,l=σ(W la i,l-1+b l),其中,a为输出值,i表示输入的目标语音特征的第i组样本,l为深度神经网络模型的当前层,σ为激活函数,W为权值,l-1为深度神经网络模型的当前层的上一层,b为偏置。 An output value obtaining unit 52 is configured to group the target speech features into the deep neural network model, and obtain the output values of the deep neural network model according to the forward propagation algorithm. The i-th group of samples of the target speech features are present in the deep neural network model. The output value of the layer is expressed as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is The current layer of the deep neural network model, σ is the activation function, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias.
目标语音特征识别模型获取单元53,用于基于深度神经网络模型的输出值进行误差反传,更新深度神经网络模型各层的权值和偏置,获取目标语音特征识别模型,其中,更新权值的计算公式为
Figure PCTCN2018094348-appb-000010
l为深度神经网络模型的当前层,W为权值,α为迭代步长,m为输入 的目标语音特征的样本总数,δ i,l为当前层的灵敏度;δ i,l=(W l+1) Tδ i,l+1οσ'(z i,l),z i,l=W la i,l-1+b l,a i,l-1为上一层的输出,T表示矩阵转置运算,ο表示两个矩阵对应元素相乘的运算(Hadamard积),更新偏置的计算公式为
Figure PCTCN2018094348-appb-000011
The target speech feature recognition model acquisition unit 53 is configured to perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, where the weights are updated Is calculated as
Figure PCTCN2018094348-appb-000010
l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
Figure PCTCN2018094348-appb-000011
图8示出在一实施例中语音识别方法的一流程图。该语音识别方法可应用在银行、证券、投资和保险等金融机构或者需进行语音识别的其他机构的计算机设备上,以达到人工智能的语音识别目的。其中,该计算机设备是可与用户进行人机交互的设备,包括但不限于电脑、智能手机和平板等设备。如图8所示,该语音识别方法包括如下步骤:FIG. 8 shows a flowchart of a speech recognition method in an embodiment. The speech recognition method can be applied to the computer equipment of financial institutions such as banks, securities, investment, and insurance, or other institutions that need to perform speech recognition to achieve the purpose of speech recognition by artificial intelligence. The computer device is a device that can perform human-computer interaction with a user, including, but not limited to, a computer, a smart phone, and a tablet. As shown in FIG. 8, the speech recognition method includes the following steps:
S71:获取待识别语音数据,待识别语音数据与用户标识相关联。S71: Acquire speech data to be identified, and the speech data to be identified is associated with a user identifier.
其中,待识别语音数据是指待进行识别的用户的语音数据,用户标识是用于唯一识别用户的标识,该用户标识可以是身份证号或电话号码等能够唯一识别用户的标识。The voice data to be identified refers to voice data of a user to be identified. The user identifier is an identifier for uniquely identifying the user. The user identifier may be an identifier that can uniquely identify the user, such as an ID card number or a phone number.
本实施例中,获取待识别语音数据,具体可以是通过计算机设备内置的录音模块或者外部的录音设备采集,该待识别语音数据与用户标识相关联,可以根据与用户标识相关联的待识别语音数据判断是不是用户本人发出的语音,实现语音识别。In this embodiment, acquiring the voice data to be identified may be specifically collected through a recording module built in a computer device or an external recording device. The voice data to be identified is associated with a user identifier, and the voice data to be identified may be associated with the user identifier The data judges whether the user's own voice is used for speech recognition.
S72:基于用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,目标声纹特征识别模型和目标语音特征识别模型是上述实施例提供的语音模型训练方法获取的模型。S72: Query the database based on the user ID to obtain the target voiceprint feature recognition model and target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and target voice feature recognition model are models obtained by the voice model training method provided in the foregoing embodiment. .
本实施例中,根据用户标识查询数据库,在数据库中获取与用户标识相关联的目标声纹特征识别模型和目标语音特征识别模型。关联存储的目标声纹特征识别模型和目标语音特征识别模型在数据库中以文件的形式存储,在对数据库查询后调用与用户标识相对应的模型的文件,以使计算机设备可根据文件存储的目标声纹特征识别模型和目标语音特征识别模型进行语音识别。In this embodiment, a database is queried according to the user identifier, and a target voiceprint feature recognition model and a target voice feature recognition model associated with the user identifier are obtained in the database. The associatively stored target voiceprint feature recognition model and target voice feature recognition model are stored in the form of files in the database. After querying the database, the file of the model corresponding to the user identification is called, so that the computer device can Voiceprint feature recognition model and target voice feature recognition model are used for voice recognition.
S73:基于待识别语音数据,提取待识别语音特征。S73: Based on the speech data to be identified, extract speech features to be identified.
本实施例中,获取待识别语音数据,该待识别语音数据不能被计算机直接识别,无法进行语音识别。因此,需根据该待识别语音数据提取相应的待识别语音特征,将待识别语音数据转化为计算机能够识别的待识别语音特征。该待识别语音特征具体可以是梅尔频率倒谱系数,具体提取过程参S11-S14,在此不在赘述。In this embodiment, to-be-recognized voice data is acquired, and the to-be-recognized voice data cannot be directly recognized by a computer, and voice recognition cannot be performed. Therefore, it is necessary to extract corresponding to-be-recognized speech features according to the to-be-recognized voice data, and convert the to-be-recognized voice data into to-be-recognized voice features that can be recognized by a computer. The feature of the speech to be recognized may specifically be a Mel frequency cepstrum coefficient, and the specific extraction process refers to S11-S14, which is not described in detail here.
S74:将待识别语音特征输入到目标语音特征识别模型,获取第一得分。S74: Input the speech feature to be recognized into the target speech feature recognition model, and obtain a first score.
本实施例中,采用目标语音特征识别模型对待识别语音特征进行识别,将识别语音特征输入到目标语音特征识别模型中,经过该模型内部的网络参数(权值和偏置)对待识别语音特征进行计算,获取第一得分。In this embodiment, the target speech feature recognition model is used to identify the speech features to be recognized, and the recognized speech features are input into the target speech feature recognition model. Calculate to get the first score.
S75:将待识别语音数据输入到目标声纹特征识别模型中,获取第二得分。S75: Input the speech data to be recognized into the target voiceprint feature recognition model, and obtain a second score.
本实施例中,将待识别语音数据输入到目标声纹特征识别模型中进行识别,具体地,先采用目标声纹特征模型提取待识别语音数据中的待识别声纹特征,可以通过以下公式计算获取待识别声纹特征:M(i)=M 0+Tw(i),其中M 0是由目标背景模型参数中的均值(m k)连接组成的A×K维超矢量(目标背景模型是采用上述实施例提供的语音模型训练方法获取的目标背景模型,目标背景模型中的均值是降维过的,降维后均值表示为A维矢量),M(i)是由目标声纹特征识别模型参数中的均值(m k′)连接组成的A×K维超矢量,T是(A×K)×F维的描述总体变化的矩阵,表示待识别声纹特征的向量空间,w(i)表示一个F维矢量符合标准的正太分布,该w(i)即为待识别声纹特征。由于向量空间T的参数含有隐变量,无法直接得到,但是能够根据已知的M(i)和M 0,采用EM算法,根据M(i)和M 0迭 代计算求出空间T,再根据M(i)=M 0+Tw(i)的关系式获取待识别声纹特征。获取待识别声纹特征后,根据该待识别声纹特征与目标语音特征对应的目标声纹特征进行相似度的比较(如余弦相似度),若相似度越高,则认为该待识别声纹特征与目标声纹特征越接近,也就代表是用户本人语音的可能性越大。则同样根据上述采用待识别语音数据求得待识别声纹特征的方法,可以计算得到训练目标声纹特征识别模型过程中采用的目标语音特征对应的目标声纹特征,通过计算待识别声纹特征与目标声纹特征的余弦相似度,将余弦相似度作为第二得分。 In this embodiment, the voice data to be recognized is input into the target voiceprint feature recognition model for recognition. Specifically, the target voiceprint feature model is first used to extract the voiceprint features to be recognized in the voice data to be recognized, which can be calculated by the following formula Obtain the voiceprint features to be identified: M (i) = M 0 + Tw (i), where M 0 is an A × K-dimensional supervector composed of the mean (m k ) connections in the parameters of the target background model (the target background model is The target background model obtained by using the speech model training method provided in the foregoing embodiment. The average value in the target background model is dimensionality-reduced, and the average value after dimensionality reduction is represented as an A-dimensional vector. M (i) is identified by the target voiceprint feature. A × K-dimensional supervector composed of mean (m k ′) connections in the model parameters, T is a matrix describing the overall change in (A × K) × F dimension, representing the vector space of the voiceprint features to be identified, w (i ) Indicates an F-dimensional vector that meets the standard normal distribution, and w (i) is the voiceprint feature to be identified. Since the parameter vector space T containing hidden variables, not directly, but can be 0, and M according to known M (i), using the EM algorithm to calculate the space T is determined according to M (i) and 0 M iterations, then according to M (i) = M 0 + Tw (i) obtains the voiceprint features to be identified. After obtaining the voiceprint features to be identified, a similarity comparison (such as cosine similarity) is performed according to the target voiceprint features corresponding to the voiceprint features to be identified and the target voice feature. If the similarity is higher, the voiceprint to be identified is considered The closer the feature is to the target voiceprint feature, the more likely it is the user's own voice. Then according to the method for obtaining voiceprint features to be identified using the voice data to be identified, the target voiceprint features corresponding to the target voice features used in the training target voiceprint feature recognition model can be calculated, and the voiceprint features to be identified can be calculated by The cosine similarity with the target voiceprint feature is taken as the second score.
S76:将第一得分与预设的第一加权比例相乘,获取第一加权得分,将第二得分与预设的第二加权比例相乘,获取第二加权得分,将第一加权得分和第二加权得分相加,获取目标得分。S76: Multiply the first score with a preset first weighted ratio to obtain a first weighted score, multiply the second score with a preset second weighted ratio, obtain a second weighted score, and sum the first weighted score and The second weighted scores are added to obtain the target score.
本实施例中,根据目标声纹特征识别模型和目标语音特征识别模型各自存在的不足进行针对性的克服。可以理解地,在采用目标语音特征识别模型识别并获取第一得分时,由于待识别语音特征维度较高,包含了部分干扰语音特征(如噪音等),使得在单独采用该模型得到的第一得分与实际结果存在一定的误差;在采用目标声纹特征识别模型识别并获取第二得分时,由于待识别声纹特征的维度较低,难以避免地丢失了部分能够代表待识别语音数据的特征,使得在单独采用该模型得到的第二得分与实际结果存在一定的误差。由于第一得分和第二得分直接的误差是由维度较高和维度较低两个相反的原因造成的误差,因此针对第一得分的误差和第二得分的误差造成的原因,将第一得分与预设的第一加权比例相乘,获取第一加权得分,将第二得分与预设的第二加权比例相乘,获取第二加权得分,将第一加权得分和第二加权得分相加,获取目标得分,该目标得分即最终输出的得分。采用该加权的处理方式恰好可以克服第一得分的误差和第二得分的误差,可以认为两个误差之间相互抵消掉,使得目标得分更接近实际结果,能够提高语音识别的准确率。In this embodiment, the shortcomings of the target voiceprint feature recognition model and the target voice feature recognition model are overcome in a targeted manner. Understandably, when the target voice feature recognition model is used to identify and obtain the first score, because the features of the voice features to be recognized have a high dimension, they include some interfering voice features (such as noise, etc.). There is a certain error between the score and the actual result; when the target voiceprint feature recognition model is used to identify and obtain the second score, due to the low dimension of the voiceprint feature to be identified, it is unavoidable to lose some features that can represent the voice data to be identified , So that the second score obtained by using the model alone has a certain error with the actual result. Because the direct error between the first score and the second score is the error caused by the opposite reasons of higher dimensions and lower dimensions, the first score is given for the reasons caused by the errors of the first score and the errors of the second score. Multiply by a preset first weighted proportion to obtain a first weighted score, multiply a second score with a preset second weighted proportion, obtain a second weighted score, and add the first weighted score and the second weighted score To obtain the target score, which is the final output score. By adopting this weighted processing method, the error of the first score and the error of the second score can be overcome exactly. It can be considered that the two errors cancel each other out, so that the target score is closer to the actual result, and the accuracy of speech recognition can be improved.
S77:若目标得分大于预设得分阈值,则确定待识别语音数据为用户标识对应的目标语音数据。S77: If the target score is greater than a preset score threshold, determine that the speech data to be recognized is target speech data corresponding to the user identification.
本实施例中,判断目标得分是否大于预设得分阈值,若目标得分大于预设得分阈值,则认为待识别语音数据为用户标识对应的目标语音数据,即确定为用户本人的语音数据;若目标得分不大于预设得分阈值,则不认为该待识别语音数据为用户本人的语音数据。In this embodiment, it is judged whether the target score is greater than a preset score threshold. If the target score is greater than the preset score threshold, the speech data to be identified is considered as the target speech data corresponding to the user identification, that is, the user's own speech data is determined; If the score is not greater than the preset score threshold, the voice data to be recognized is not considered to be the voice data of the user himself.
其中,预设得分阈值是指预先设置的用于衡量待识别语音数据是否为用户标识对应的目标语音数据的阈值,该阈值以分数的形式表示。例如,将预设得分阈值设置为0.95,则目标得分大于0.95的待识别语音数据为与用户标识对应的目标语音数据,目标得分不大于0.95的待识别语音数据不认为用户标识对应的用户本人的语音数据。The preset score threshold refers to a preset threshold used to measure whether the speech data to be identified is target speech data corresponding to the user identifier, and the threshold is expressed in the form of a score. For example, if the preset score threshold is set to 0.95, the speech data to be recognized with a target score greater than 0.95 is the target speech data corresponding to the user identification, and the speech data to be recognized with a target score not greater than 0.95 is not considered to be the user's own corresponding Voice data.
本实施例所提供的语音识别方法中,根据提取的待识别语音特征输入到语音模型中,得到与目标语音特征识别模型相关的第一得分和目标声纹特征识别模型相关的第二得分,并通过加权运算获取目标得分,由目标得分得出语音识别结果。第一得分从较低维度的声纹特征反映了语音识别结果的概率,由于声纹特征的维度较低,难以避免地丢失了部分关键语音特征,使得第一得分与实际输出存在误差,影响语音识别结果;第二得分从较高维度的目标语音特征反映了语音识别结果的概率,由于第二得分的维度较高,包含了部分干扰语音特征(如噪音等),使得第二得分与实际输出存在误差,影响语音识别结果。采用加权运算获取的目标得分能够针对目标语音特征识别模型和目标声纹特征识别模型各自的不足,克服第一得分和第二得分的误差,可以认为将两个误差相互抵消掉,使得目标得分更接近实际结果,提高语音识别的精确率。In the speech recognition method provided in this embodiment, a speech model is input according to the extracted speech features to be recognized, and a first score related to the target speech feature recognition model and a second score related to the target voiceprint feature recognition model are obtained, and The target score is obtained through weighted operation, and the speech recognition result is obtained from the target score. The first score reflects the probability of speech recognition results from the voiceprint features in the lower dimension. Due to the low dimension of the voiceprint features, some key voice features are unavoidably lost, making the first score inaccurate from the actual output and affecting the voice. Recognition results; the second score reflects the probability of speech recognition results from the higher-dimensional target speech features. Because the second score has a higher dimension, it includes some interfering speech features (such as noise), making the second score and the actual output There are errors that affect speech recognition results. The target score obtained by weighted operation can address the shortcomings of the target speech feature recognition model and the target voiceprint feature recognition model, overcome the errors of the first score and the second score, and it can be considered that the two errors cancel each other out, making the target score more It is close to the actual result and improves the accuracy of speech recognition.
图9示出与实施例中语音识别方法一一对应的语音识别装置的示意图。如图9所示,该语音识别装置包括待识别语音数据获取模块70、模型获取模块80、待识别语音特征提取模块90和第一得分获取模块100、第二得分获取模块110、目标得分获取模块120和语音确定模块130。其中,待识别语音数据获取模块70、模型获取模块80、待识别语音特征提取模块90和第一得分获取模块100、第二得分获取模块110、目标得分获取模块120和语音确定模块130的实现功能与实施例中语音识别方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。FIG. 9 is a schematic diagram of a speech recognition device corresponding to the speech recognition method in the embodiment. As shown in FIG. 9, the voice recognition device includes a to-be-recognized voice data acquisition module 70, a model acquisition module 80, a to-be-recognized speech feature extraction module 90 and a first score acquisition module 100, a second score acquisition module 110, and a target score acquisition module. 120 and a voice determination module 130. Among them, the realized functions of the to-be-recognized voice data acquisition module 70, model acquisition module 80, to-be-recognized voice feature extraction module 90, first score acquisition module 100, second score acquisition module 110, target score acquisition module 120, and voice determination module 130 The steps corresponding to the speech recognition method in the embodiment are one-to-one. To avoid redundant descriptions, this embodiment does not detail them one by one.
待识别语音数据获取模块70,用于获取待识别语音数据,待识别语音数据与用户标识相关联。The to-be-recognized voice data acquisition module 70 is configured to obtain the to-be-recognized voice data, and the to-be-recognized voice data is associated with a user identifier.
模型获取模块80,用于基于用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语 音特征识别模型,目标声纹特征识别模型和目标语音特征识别模型是采用上述实施例提供的语音模型训练方法获取的模型。A model acquisition module 80 is configured to query a database based on a user identifier to obtain an associated stored target voiceprint feature recognition model and a target voice feature recognition model. The target voiceprint feature recognition model and the target voice feature recognition model use the voice provided by the foregoing embodiment The model obtained by the model training method.
待识别语音特征提取模块90,用于基于待识别语音数据,提取待识别语音特征。The to-be-recognized voice feature extraction module 90 is configured to extract the to-be-recognized voice features based on the to-be-recognized voice data.
第一得分获取模块100,用于将待识别语音特征输入到目标语音特征识别模型,获取第一得分。The first score obtaining module 100 is configured to input a voice feature to be recognized into a target voice feature recognition model, and obtain a first score.
第二得分获取模块110,用于将待识别语音数据输入到目标声纹特征识别模型中,获取第二得分。A second score obtaining module 110 is configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score.
目标得分获取模块120,用于将第一得分与预设的第一加权比例相乘,获取第一加权得分,将第二得分与预设的第二加权比例相乘,获取第二加权得分,将第一加权得分和第二加权得分相加,获取目标得分。A target score obtaining module 120, configured to multiply a first score with a preset first weighted ratio, obtain a first weighted score, multiply a second score with a preset second weighted ratio, and obtain a second weighted score; Add the first weighted score and the second weighted score to obtain the target score.
语音确定模块130,用于若目标得分大于预设得分阈值,则确定待识别语音数据为用户标识对应的目标语音数据。The voice determining module 130 is configured to determine that the voice data to be recognized is target voice data corresponding to a user identifier if the target score is greater than a preset score threshold.
本实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例中语音模型训练方法,为避免重复,这里不再赘述。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例中语音模型训练装置的各模块/单元的功能,为避免重复,这里不再赘述。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例中语音识别方法中各步骤的功能,为避免重复,此处不一一赘述。或者,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行时实现实施例中语音识别装置中各模块/单元的功能,为避免重复,此处不一一赘述。This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are executed. To implement the speech model training method in the embodiment, to avoid repetition, details are not repeated here. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each module / unit of the speech model training device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here No longer. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each step in the speech recognition method in the embodiment are implemented when the one or more processors are executed. To avoid repetition, different ones are not provided here. One more detail. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the voice recognition device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here Not one by one.
图10是本申请一实施例提供的计算机设备的示意图。如图10所示,该实施例的计算机设备140包括:处理器141、存储器142以及存储在存储器142中并可在处理器141上运行的计算机可读指令143,该计算机可读指令143被处理器141执行时实现实施例中的语音模型训练方法,为避免重复,此处不一一赘述。或者,该计算机可读指令143被处理器141执行时实现实施例中语音模型训练装置中各模型/单元的功能,为避免重复,此处不一一赘述。或者,该计算机可读指令143被处理器141执行时实现实施例中语音识别方法中各步骤的功能,为避免重复,此处不一一赘述。或者,该计算机可读指令143被处理器141执行时实现实施例中语音识别装置中各模块/单元的功能。为避免重复,此处不一一赘述。FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in FIG. 10, the computer device 140 of this embodiment includes a processor 141, a memory 142, and computer-readable instructions 143 stored in the memory 142 and executable on the processor 141. The computer-readable instructions 143 are processed. The implementation of the speech model training method in the embodiment is implemented when the processor 141 is executed. To avoid repetition, details are not described herein. Alternatively, when the computer-readable instructions 143 are executed by the processor 141, the functions of each model / unit in the voice model training device in the embodiment are implemented. To avoid repetition, details are not described here one by one. Alternatively, when the computer-readable instructions 143 are executed by the processor 141, the functions of the steps in the speech recognition method in the embodiment are implemented. To avoid repetition, details are not described here one by one. Alternatively, when the computer-readable instructions 143 are executed by the processor 141, the functions of each module / unit in the speech recognition apparatus in the embodiment are realized. To avoid repetition, we will not repeat them here.
计算机设备140可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可包括,但不仅限于,处理器141、存储器142。本领域技术人员可以理解,图10仅仅是计算机设备140的示例,并不构成对计算机设备140的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如计算机设备还可以包括输入输出设备、网络接入设备、总线等。The computer device 140 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer equipment may include, but is not limited to, a processor 141 and a memory 142. Those skilled in the art can understand that FIG. 10 is only an example of the computer device 140, and does not constitute a limitation on the computer device 140. It may include more or fewer components than shown in the figure, or combine some components or different components. For example, computer equipment may also include input and output equipment, network access equipment, and buses.
所称处理器141可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 141 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
存储器142可以是计算机设备140的内部存储单元,例如计算机设备140的硬盘或内存。存储器142也可以是计算机设备140的外部存储设备,例如计算机设备140上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器142还可以既包括计算机设备140的内部存储单元也包括外部存储设备。存储器142用于存储计算机可读指令143以及计算机设备所需的其他程序和数据。存储器142还可以用于暂时地存储已经输出或者将要输出的数据。The memory 142 may be an internal storage unit of the computer device 140, such as a hard disk or a memory of the computer device 140. The memory 142 may also be an external storage device of the computer device 140, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash) provided on the computer device 140. Card) and so on. Further, the memory 142 may also include both an internal storage unit of the computer device 140 and an external storage device. The memory 142 is used to store the computer-readable instructions 143 and other programs and data required by the computer device. The memory 142 may also be used to temporarily store data that has been output or is to be output.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现, 也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware or in the form of software functional units.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

  1. 一种语音模型训练方法,其特征在于,包括:A speech model training method, comprising:
    获取训练语音数据,基于所述训练语音数据提取训练语音特征;Acquiring training voice data, and extracting training voice features based on the training voice data;
    基于所述训练语音特征获取目标背景模型;Obtaining a target background model based on the training speech features;
    获取目标语音数据,基于所述目标语音数据提取目标语音特征;Acquiring target voice data, and extracting target voice features based on the target voice data;
    采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;
    将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;
    将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
  2. 根据权利要求1所述的语音模型训练方法,其特征在于,所述基于所述训练语音数据提取训练语音特征,包括:The method for training a speech model according to claim 1, wherein the extracting a training speech feature based on the training speech data comprises:
    对所述训练语音数据进行预处理;Preprocessing the training speech data;
    对预处理后的训练语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据所述频谱获取训练语音数据的功率谱;Performing fast Fourier transform on the pre-processed training voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;
    在所述梅尔功率谱上进行倒谱分析,获取训练语音数据的梅尔频率倒谱系数,并将获取到的梅尔频率倒谱系数确定为所述训练语音特征。A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of training speech data, and the obtained Mel frequency cepstrum coefficient is determined as the training speech feature.
  3. 根据权利要求2所述的语音模型训练方法,其特征在于,所述对所述训练语音数据进行预处理,包括:The method for training a speech model according to claim 2, wherein the preprocessing the training speech data comprises:
    对所述训练语音数据作预加重处理;Pre-emphasis the training voice data;
    对预加重后的所述训练语音数据进行分帧处理;Performing frame processing on the pre-emphasis training voice data;
    对分帧处理后的所述训练语音数据进行加窗处理。Perform windowing processing on the training speech data after frame processing.
  4. 根据权利要求1所述的语音模型训练方法,其特征在于,所述基于所述训练语音特征获取目标背景模型,包括:The method for training a speech model according to claim 1, wherein the acquiring a target background model based on the training speech features comprises:
    采用所述训练语音特征进行通用背景模型训练,获取通用背景模型;Using the training speech feature to perform a general background model training to obtain a general background model;
    采用奇异值分解对所述通用背景模型进行特征降维处理,获取所述目标背景模型。Singular value decomposition is used to perform feature reduction processing on the universal background model to obtain the target background model.
  5. 根据权利要求1所述的语音模型训练方法,其特征在于,所述将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型,包括:The method for training a speech model according to claim 1, wherein the inputting the target speech features into a deep neural network for training to obtain the target speech feature recognition model comprises:
    初始化深度神经网络模型;Initialize the deep neural network model;
    将所述目标语音特征分组输入到所述深度神经网络模型中,根据前向传播算法获取深度神经网络模型的输出值,目标语音特征的第i组样本在深度神经网络模型的当前层的输出值用公式表示为a i,l=σ(W la i,l-1+b l),其中,a为输出值,i表示输入的目标语音特征的第i组样本,l为深度神经网络模型的当前层,σ为激活函数,W为权值,l-1为深度神经网络模型的当前层的上一层,b为偏置; Input the target speech features into the deep neural network model, obtain the output value of the deep neural network model according to the forward propagation algorithm, and output the i-th sample of the target speech feature at the current layer of the deep neural network model. Formulated as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the deep neural network model Σ is the activation layer, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias;
    基于深度神经网络模型的输出值进行误差反传,更新深度神经网络模型各层的权值和偏置,获取所述目标语音特征识别模型,其中,更新权值的计算公式为
    Figure PCTCN2018094348-appb-100001
    l为深度神经网络模型的当前层,W为权值,α为迭代步长,m为输入的目标语音特征的样本总数,δ i,l为当前层的灵敏度;δ i,l=(W l+1) Tδ i,l+1οσ'(z i,l),z i,l=W la i,l-1+b l,a i,l-1为上一层的输出,T表示矩阵转置运算,ο表示两个矩阵对应元素相乘的运算(Hadamard积),更新偏置的计算公式为
    Figure PCTCN2018094348-appb-100002
    Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, wherein the calculation formula for the updated weight is
    Figure PCTCN2018094348-appb-100001
    l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
    Figure PCTCN2018094348-appb-100002
  6. 一种语音识别方法,其特征在于,包括:A speech recognition method, comprising:
    获取待识别语音数据,所述待识别语音数据与用户标识相关联;Obtaining to-be-recognized voice data, which is associated with a user identifier;
    基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用权利要求1-5任一项所述语音模型训练方法获取的模型;Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are based on any one of claims 1-5. The model obtained by the speech model training method described in item;
    基于所述待识别语音数据,提取待识别语音特征;Extracting features to be recognized based on the to-be-recognized voice data;
    将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
    将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
    将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;
    若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
  7. 一种语音模型训练装置,其特征在于,包括:A voice model training device, comprising:
    训练语音特征提取模块,用于获取训练语音数据,基于所述训练语音数据提取训练语音特征;A training voice feature extraction module, configured to obtain training voice data, and extract training voice features based on the training voice data;
    目标背景模型获取模块,用于基于所述训练语音特征获取目标背景模型;A target background model acquisition module, configured to acquire a target background model based on the training speech feature;
    目标语音特征提取模块,用于获取目标语音数据,基于所述目标语音数据提取目标语音特征;A target voice feature extraction module, configured to obtain target voice data, and extract target voice features based on the target voice data;
    目标声纹特征识别模型获取模块,用于采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;A target voiceprint feature recognition model acquisition module, configured to adaptively process the target voice feature using the target background model to obtain a target voiceprint feature recognition model;
    语音特征识别获取模块,用于将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;A speech feature recognition acquisition module, configured to input the target speech feature into a deep neural network for training, and obtain a target speech feature recognition model;
    模型存储模块,用于将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。A model storage module is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.
  8. 一种语音识别装置,其特征在于,包括:A voice recognition device, comprising:
    待识别语音数据获取模块,用于获取待识别语音数据,所述待识别语音数据与用户标识相关联;A to-be-recognized voice data acquisition module, configured to obtain the to-be-recognized voice data, the to-be-recognized voice data being associated with a user identifier;
    模型获取模块,用于基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用权利要求1-5任一项所述语音模型训练方法获取的模型;A model acquisition module is configured to query a database based on the user identifier to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in an associated manner. The target voiceprint feature recognition model and the target voice feature recognition model use rights. A model obtained by the speech model training method according to any one of claims 1-5;
    待识别语音特征提取模块,用于基于所述待识别语音数据,提取待识别语音特征;Speech feature extraction module for extracting speech features based on the speech data to be identified;
    第一得分获取模块,用于将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;A first score acquisition module, configured to input the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
    第二得分获取模块,用于将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;A second score obtaining module, configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
    目标得分获取模块,用于将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;A target score obtaining module, configured to multiply the first score with a preset first weighted ratio, obtain a first weighted score, multiply the second score with a preset second weighted ratio, and obtain a second Weighted score, adding the first weighted score and the second weighted score to obtain a target score;
    语音确定模块,用于若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。A voice determination module is configured to determine, if the target score is greater than a preset score threshold, the voice data to be identified is target voice data corresponding to the user identifier.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取训练语音数据,基于所述训练语音数据提取训练语音特征;Acquiring training voice data, and extracting training voice features based on the training voice data;
    基于所述训练语音特征获取目标背景模型;Obtaining a target background model based on the training speech features;
    获取目标语音数据,基于所述目标语音数据提取目标语音特征;Acquiring target voice data, and extracting target voice features based on the target voice data;
    采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;
    将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;
    将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
  10. 根据权利要求9所述的计算机设备,其特征在于,所述基于所述训练语音数据提取训练语音特征,包括:The computer device according to claim 9, wherein the extracting training voice features based on the training voice data comprises:
    对所述训练语音数据进行预处理;Preprocessing the training speech data;
    对预处理后的训练语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据所述频谱获取训练语音数据的功率谱;Performing fast Fourier transform on the pre-processed training voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;
    在所述梅尔功率谱上进行倒谱分析,获取训练语音数据的梅尔频率倒谱系数,并将获取到的梅尔频率倒谱系数确定为所述训练语音特征。A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of training speech data, and the obtained Mel frequency cepstrum coefficient is determined as the training speech feature.
  11. 根据权利要求10所述的计算机设备,其特征在于,所述对所述训练语音数据进行预处理,包括:The computer device according to claim 10, wherein the preprocessing the training voice data comprises:
    对所述训练语音数据作预加重处理;Pre-emphasis the training voice data;
    对预加重后的所述训练语音数据进行分帧处理;Performing frame processing on the pre-emphasis training voice data;
    对分帧处理后的所述训练语音数据进行加窗处理。Perform windowing processing on the training speech data after frame processing.
  12. 根据权利要求9所述的计算机设备,其特征在于,所述基于所述训练语音特征获取目标背景模型,包括:The computer device according to claim 9, wherein the obtaining a target background model based on the training speech features comprises:
    采用所述训练语音特征进行通用背景模型训练,获取通用背景模型;Using the training speech feature to perform a general background model training to obtain a general background model;
    采用奇异值分解对所述通用背景模型进行特征降维处理,获取所述目标背景模型。Singular value decomposition is used to perform feature reduction processing on the universal background model to obtain the target background model.
  13. 根据权利要求9所述的计算机设备,其特征在于,所述将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型,包括:The computer device according to claim 9, wherein the inputting the target voice feature to a deep neural network for training to obtain the target voice feature recognition model comprises:
    初始化深度神经网络模型;Initialize the deep neural network model;
    将所述目标语音特征分组输入到所述深度神经网络模型中,根据前向传播算法获取深度神经网络模型的输出值,目标语音特征的第i组样本在深度神经网络模型的当前层的输出值用公式表示为a i,l=σ(W la i,l-1+b l),其中,a为输出值,i表示输入的目标语音特征的第i组样本,l为深度神经网络模型的当前层,σ为激活函数,W为权值,l-1为深度神经网络模型的当前层的上一层,b为偏置; Input the target speech features into the deep neural network model, obtain the output value of the deep neural network model according to the forward propagation algorithm, and output the i-th sample of the target speech feature at the current layer of the deep neural network model. Formulated as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the deep neural network model Σ is the activation layer, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias;
    基于深度神经网络模型的输出值进行误差反传,更新深度神经网络模型各层的权值和偏置,获取所述目标语音特征识别模型,其中,更新权值的计算公式为
    Figure PCTCN2018094348-appb-100003
    l为深度神经网络模型的当前层,W为权值,α为迭代步长,m为输入的目标语音特征的样本总数,δ i,l为当前层的灵敏度;δ i,l=(W l+1) Tδ i,l+1οσ'(z i,l),z i,l=W la i,l-1+b l,a i,l-1为上一层的输出,T表示矩阵转置运算,ο表示两个矩阵对应元素相乘的运算(Hadamard积),更新偏置的计算公式为
    Figure PCTCN2018094348-appb-100004
    Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, wherein the calculation formula for the updated weight is
    Figure PCTCN2018094348-appb-100003
    l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
    Figure PCTCN2018094348-appb-100004
  14. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取待识别语音数据,所述待识别语音数据与用户标识相关联;Obtaining to-be-recognized voice data, which is associated with a user identifier;
    基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用权利要求1-5任一项所述语音模型训练方法获取的模型;Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are based on any one of claims 1-5. The model obtained by the speech model training method described in item;
    基于所述待识别语音数据,提取待识别语音特征;Extracting features to be recognized based on the to-be-recognized voice data;
    将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
    将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
    将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;
    若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数 据。If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    获取训练语音数据,基于所述训练语音数据提取训练语音特征;Acquiring training voice data, and extracting training voice features based on the training voice data;
    基于所述训练语音特征获取目标背景模型;Obtaining a target background model based on the training speech features;
    获取目标语音数据,基于所述目标语音数据提取目标语音特征;Acquiring target voice data, and extracting target voice features based on the target voice data;
    采用所述目标背景模型对所述目标语音特征进行自适应处理,获取目标声纹特征识别模型;Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;
    将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型;Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;
    将所述目标声纹特征识别模型和所述目标语音特征识别模型关联存储在数据库中。The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
  16. 根据权利要求15所述的非易失性可读存储介质,其特征在于,所述基于所述训练语音数据提取训练语音特征,包括:The non-volatile readable storage medium according to claim 15, wherein the extracting a training voice feature based on the training voice data comprises:
    对所述训练语音数据进行预处理;Preprocessing the training speech data;
    对预处理后的训练语音数据作快速傅里叶变换,获取训练语音数据的频谱,并根据所述频谱获取训练语音数据的功率谱;Performing fast Fourier transform on the pre-processed training voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;
    采用梅尔刻度滤波器组处理所述训练语音数据的功率谱,获取训练语音数据的梅尔功率谱;Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;
    在所述梅尔功率谱上进行倒谱分析,获取训练语音数据的梅尔频率倒谱系数,并将获取到的梅尔频率倒谱系数确定为所述训练语音特征。A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of training speech data, and the obtained Mel frequency cepstrum coefficient is determined as the training speech feature.
  17. 根据权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述训练语音数据进行预处理,包括:The non-volatile readable storage medium according to claim 16, wherein the preprocessing the training voice data comprises:
    对所述训练语音数据作预加重处理;Pre-emphasis the training voice data;
    对预加重后的所述训练语音数据进行分帧处理;Performing frame processing on the pre-emphasis training voice data;
    对分帧处理后的所述训练语音数据进行加窗处理。Perform windowing processing on the training speech data after frame processing.
  18. 根据权利要求15所述的非易失性可读存储介质,其特征在于,所述基于所述训练语音特征获取目标背景模型,包括:The non-volatile readable storage medium according to claim 15, wherein the obtaining a target background model based on the training voice feature comprises:
    采用所述训练语音特征进行通用背景模型训练,获取通用背景模型;Using the training speech feature to perform a general background model training to obtain a general background model;
    采用奇异值分解对所述通用背景模型进行特征降维处理,获取所述目标背景模型。Singular value decomposition is used to perform feature reduction processing on the universal background model to obtain the target background model.
  19. 根据权利要求15所述的非易失性可读存储介质,其特征在于,所述将所述目标语音特征输入到深度神经网络中进行训练,获取目标语音特征识别模型,包括:The non-volatile readable storage medium according to claim 15, wherein the inputting the target voice feature into a deep neural network for training to obtain the target voice feature recognition model comprises:
    初始化深度神经网络模型;Initialize the deep neural network model;
    将所述目标语音特征分组输入到所述深度神经网络模型中,根据前向传播算法获取深度神经网络模型的输出值,目标语音特征的第i组样本在深度神经网络模型的当前层的输出值用公式表示为a i,l=σ(W la i,l-1+b l),其中,a为输出值,i表示输入的目标语音特征的第i组样本,l为深度神经网络模型的当前层,σ为激活函数,W为权值,l-1为深度神经网络模型的当前层的上一层,b为偏置; Input the target speech features into the deep neural network model, obtain the output value of the deep neural network model according to the forward propagation algorithm, and output the i-th sample of the target speech feature at the current layer of the deep neural network model. Formulated as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the deep neural network model Σ is the activation layer, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias;
    基于深度神经网络模型的输出值进行误差反传,更新深度神经网络模型各层的权值和偏置,获取所述目标语音特征识别模型,其中,更新权值的计算公式为
    Figure PCTCN2018094348-appb-100005
    l为深度神经网络模型的当前层,W为权值,α为迭代步长,m为输入的目标语音特征的样本总数,δ i,l为当前层的灵敏度;δ i,l=(W l+1) Tδ i,l+1οσ'(z i,l),z i,l=W la i,l-1+b l,a i,l-1为上一层的输出,T表示矩阵转置运算,ο表示两个矩阵对应元素相乘的运算(Hadamard积),更新偏置的计算公式为
    Figure PCTCN2018094348-appb-100006
    Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, wherein the calculation formula for the updated weight is
    Figure PCTCN2018094348-appb-100005
    l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
    Figure PCTCN2018094348-appb-100006
  20. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:
    获取待识别语音数据,所述待识别语音数据与用户标识相关联;Obtaining to-be-recognized voice data, which is associated with a user identifier;
    基于所述用户标识查询数据库,获取关联存储的目标声纹特征识别模型和目标语音特征识别模型,所述目标声纹特征识别模型和所述目标语音特征识别模型是采用权利要求1-5任一项所述语音模型训练方法获取的模型;Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are based on any one of claims 1-5. The model obtained by the speech model training method described in item;
    基于所述待识别语音数据,提取待识别语音特征;Extracting features to be recognized based on the to-be-recognized voice data;
    将所述待识别语音特征输入到目标语音特征识别模型,获取第一得分;Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;
    将所述待识别语音数据输入到目标声纹特征识别模型中,获取第二得分;Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;
    将所述第一得分与预设的第一加权比例相乘,获取第一加权得分,将所述第二得分与预设的第二加权比例相乘,获取第二加权得分,将所述第一加权得分和所述第二加权得分相加,获取目标得分;Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;
    若所述目标得分大于预设得分阈值,则确定所述待识别语音数据为所述用户标识对应的目标语音数据。If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
PCT/CN2018/094348 2018-05-31 2018-07-03 Voice model training method, voice recognition method, device and equipment, and medium WO2019227574A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810551458.4A CN108922515A (en) 2018-05-31 2018-05-31 Speech model training method, audio recognition method, device, equipment and medium
CN201810551458.4 2018-05-31

Publications (1)

Publication Number Publication Date
WO2019227574A1 true WO2019227574A1 (en) 2019-12-05

Family

ID=64420091

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094348 WO2019227574A1 (en) 2018-05-31 2018-07-03 Voice model training method, voice recognition method, device and equipment, and medium

Country Status (2)

Country Link
CN (1) CN108922515A (en)
WO (1) WO2019227574A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448726A (en) * 2019-01-14 2019-03-08 李庆湧 A kind of method of adjustment and system of voice control accuracy rate
CN109817246B (en) * 2019-02-27 2023-04-18 平安科技(深圳)有限公司 Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
CN112116909A (en) * 2019-06-20 2020-12-22 杭州海康威视数字技术股份有限公司 Voice recognition method, device and system
CN110706690B (en) * 2019-09-16 2024-06-25 平安科技(深圳)有限公司 Speech recognition method and device thereof
CN110928583B (en) * 2019-10-10 2020-12-29 珠海格力电器股份有限公司 Terminal awakening method, device, equipment and computer readable storage medium
CN110942779A (en) * 2019-11-13 2020-03-31 苏宁云计算有限公司 Noise processing method, device and system
CN113457096B (en) * 2020-03-31 2022-06-24 荣耀终端有限公司 Method for detecting basketball movement based on wearable device and wearable device
CN113223537B (en) * 2020-04-30 2022-03-25 浙江大学 Voice training data iterative updating method based on stage test feedback
CN111883175B (en) * 2020-06-09 2022-06-07 河北悦舒诚信息科技有限公司 Voiceprint library-based oil station service quality improving method
CN112599136A (en) * 2020-12-15 2021-04-02 江苏惠通集团有限责任公司 Voice recognition method and device based on voiceprint recognition, storage medium and terminal
CN112669820B (en) * 2020-12-16 2023-08-04 平安科技(深圳)有限公司 Examination cheating recognition method and device based on voice recognition and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194455A (en) * 2010-03-17 2011-09-21 博石金(北京)信息技术有限公司 Voiceprint identification method irrelevant to speak content
CN104217152A (en) * 2014-09-23 2014-12-17 陈包容 Implementation method and device for mobile terminal to enter application program under stand-by state
CN104992705A (en) * 2015-05-20 2015-10-21 普强信息技术(北京)有限公司 English oral automatic grading method and system
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024455B (en) * 2009-09-10 2014-09-17 索尼株式会社 Speaker recognition system and method
US9401148B2 (en) * 2013-11-04 2016-07-26 Google Inc. Speaker verification using neural networks
CN105895104B (en) * 2014-05-04 2019-09-03 讯飞智元信息科技有限公司 Speaker adaptation recognition methods and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194455A (en) * 2010-03-17 2011-09-21 博石金(北京)信息技术有限公司 Voiceprint identification method irrelevant to speak content
CN104217152A (en) * 2014-09-23 2014-12-17 陈包容 Implementation method and device for mobile terminal to enter application program under stand-by state
CN104992705A (en) * 2015-05-20 2015-10-21 普强信息技术(北京)有限公司 English oral automatic grading method and system
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes

Also Published As

Publication number Publication date
CN108922515A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2019227574A1 (en) Voice model training method, voice recognition method, device and equipment, and medium
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
US11996091B2 (en) Mixed speech recognition method and apparatus, and computer-readable storage medium
WO2018107810A1 (en) Voiceprint recognition method and apparatus, and electronic device and medium
CN108922544B (en) Universal vector training method, voice clustering method, device, equipment and medium
WO2019232851A1 (en) Method and apparatus for training speech differentiation model, and computer device and storage medium
WO2019237517A1 (en) Speaker clustering method and apparatus, and computer device and storage medium
CN111968666B (en) Hearing aid voice enhancement method based on depth domain self-adaptive network
Poorjam et al. Height estimation from speech signals using i-vectors and least-squares support vector regression
WO2021127982A1 (en) Speech emotion recognition method, smart device, and computer-readable storage medium
CN112767927A (en) Method, device, terminal and storage medium for extracting voice features
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
WO2022143723A1 (en) Voice recognition model training method, voice recognition method, and corresponding device
CN106297768B (en) Speech recognition method
CN116580708A (en) Intelligent voice processing method and system
CN115565548A (en) Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment
Medikonda et al. Higher order information set based features for text-independent speaker identification
Medikonda et al. An information set-based robust text-independent speaker authentication
CN108573698B (en) Voice noise reduction method based on gender fusion information
Kangala et al. A Fractional Ebola Optimization Search Algorithm Approach for Enhanced Speaker Diarization.
Ali et al. The identification and localization of speaker using fusion techniques and machine learning techniques
Prasanna Kumar et al. An unsupervised approach for co-channel speech separation using Hilbert–Huang transform and Fuzzy C-Means clustering
Asaei et al. Investigation of kNN classifier on posterior features towards application in automatic speech recognition
Mavaddati Blind Voice Separation Based on Empirical Mode Decomposition and Grey Wolf Optimizer Algorithm.
Feng et al. Underwater acoustic feature extraction based on restricted Boltzmann machine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920316

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18920316

Country of ref document: EP

Kind code of ref document: A1