WO2019227574A1

WO2019227574A1 - Voice model training method, voice recognition method, device and equipment, and medium

Info

Publication number: WO2019227574A1
Application number: PCT/CN2018/094348
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-05-31
Filing date: 2018-07-03
Publication date: 2019-12-05
Also published as: CN108922515A

Abstract

A voice model training method, a voice recognition method, device and equipment, and a medium. The voice model training method comprises: acquiring training voice data, and extracting a training voice feature; acquiring a target background model on the basis of the training voice feature; acquiring target voice data, and extracting a target voice feature; carrying out adaptive processing on the target voice feature by using the target background model, thus acquiring a target voiceprint feature recognition model; inputting the target voice feature into a deep neural network for training, thus acquiring a target voice feature recognition model; and storing the target voiceprint feature recognition model and the target voice feature recognition model in a database in an associative manner.

Description

Speech model training method, speech recognition method, device, device and medium

This application is based on a Chinese patent application filed on May 31, 2018 with the application number 201810551458.4, entitled "Speech Model Training Method, Speech Recognition Method, Device, Equipment, and Medium" and claims its priority.

Technical field

The present application relates to the field of speech recognition technology, and in particular, to a speech model training method, speech recognition method, device, device, and medium.

Background technique

At present, most of the speech recognition is based on speech features. Some of these speech features are too high in dimension and contain too much non-critical information; some are too low in dimension to fully reflect the characteristics of speech, making current speech recognition The accuracy is low, and the speech cannot be effectively recognized, which restricts the application of speech recognition.

Summary of the invention

The embodiments of the present application provide a method, a device, a device, and a medium for training a speech model, so as to solve the problem of low accuracy of current speech recognition.

A speech model training method includes:

Acquiring training voice data, and extracting training voice features based on the training voice data;

Obtaining a target background model based on the training speech features;

Acquiring target voice data, and extracting target voice features based on the target voice data;

Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;

Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;

The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.

A voice model training device includes:

A training voice feature extraction module, configured to obtain training voice data, and extract training voice features based on the training voice data;

A target background model acquisition module, configured to acquire a target background model based on the training speech feature;

A target voice feature extraction module, configured to obtain target voice data, and extract target voice features based on the target voice data;

A target voiceprint feature recognition model acquisition module, configured to adaptively process the target voice feature using the target background model to obtain a target voiceprint feature recognition model;

A speech feature recognition acquisition module, configured to input the target speech feature into a deep neural network for training, and obtain a target speech feature recognition model;

A model storage module is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. When the processor executes the computer-readable instructions, the following steps are implemented:

Obtaining a target background model based on the training speech features;

One or more non-volatile readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtaining a target background model based on the training speech features;

The embodiments of the present application provide a method, a device, a device, and a medium for speech recognition to solve the problem of low accuracy of current speech recognition.

A speech recognition method includes:

Obtaining to-be-recognized voice data, which is associated with a user identifier;

Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method Speech model

Extracting features to be recognized based on the to-be-recognized voice data;

Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;

Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;

Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;

If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.

A voice recognition device includes:

A to-be-recognized voice data acquisition module, configured to obtain the to-be-recognized voice data, the to-be-recognized voice data being associated with a user identifier;

A model acquisition module is configured to query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in an associated manner. The target voiceprint feature recognition model and the target voice feature recognition model are all Describe the model obtained by the speech model training method;

Speech feature extraction module for extracting speech features based on the speech data to be identified;

A first score acquisition module, configured to input the speech feature to be recognized into a target speech feature recognition model to obtain a first score;

A second score obtaining module, configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;

A target score obtaining module, configured to multiply the first score with a preset first weighted ratio, obtain a first weighted score, multiply the second score with a preset second weighted ratio, and obtain a second Weighted score, adding the first weighted score and the second weighted score to obtain a target score;

A voice determination module is configured to determine, if the target score is greater than a preset score threshold, the voice data to be identified is target voice data corresponding to the user identifier.

Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are obtained by using the voice model training method. Speech model

Extracting features to be recognized based on the to-be-recognized voice data;

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is an application environment diagram of a speech model training method according to an embodiment of the present application; FIG.

2 is a flowchart of a speech model training method according to an embodiment of the present application;

FIG. 3 is a specific flowchart of step S10 in FIG. 2;

4 is a specific flowchart of step S11 in FIG. 3;

FIG. 5 is a specific flowchart of step S20 in FIG. 2;

FIG. 6 is a specific flowchart of step S50 in FIG. 2;

FIG. 7 is a schematic diagram of a voice model training device according to an embodiment of the present application; FIG.

8 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a voice recognition device according to an embodiment of the present application; FIG.

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

FIG. 1 illustrates an application environment of a speech model training method provided by an embodiment of the present application. The application environment of the speech model training method includes a server and a client, where the server and the client are connected through a network. A client is also called a client, which refers to a program that provides local services to the client corresponding to the server. The client is installed on a computer device that can interact with the user, including, but not limited to, computers, smartphones, and Tablet and other devices. The server can be implemented by an independent server or a server cluster composed of multiple servers. The server includes, but is not limited to, a file server, a database server, an application server, and a web server.

As shown in FIG. 2, FIG. 2 shows a flowchart of a speech model training method according to an embodiment of the present application. This embodiment is described by using the speech model training method as an example on a server. The speech discrimination method includes the following steps:

S10: Acquire training voice data, and extract training voice features based on the training voice data.

The training speech data is speech data used for training the target background model. The training voice data may be recording data collected by a recording module integrated on a computer device or a recording device connected to the computer device to record a large number of users without a logo, or it may be directly used as an open source voice data training set on the Internet as the training data set. Training speech data.

In this embodiment, training voice data is acquired, and the training voice data cannot be directly recognized by a computer and cannot be directly used to train a target background model. Therefore, it is necessary to extract training voice features according to the training voice data, and convert the training voice data into training voice features that can be recognized by a computer. The training speech feature may specifically be Mel Frequency Cepstrum Coefficient (MFCC). The MFCC feature has 39 dimensions (represented in the form of a vector), which can better describe the training speech data.

In an embodiment, as shown in FIG. 3, in step S10, extracting a training voice feature based on the training voice data includes the following steps:

S11: Preprocess the training speech data.

In this embodiment, the training voice data is pre-processed when the training voice features are extracted. The process of preprocessing the training voice data can better extract the training voice features of the training voice data, so that the extracted training voice features can better represent the training voice data.

In an embodiment, as shown in FIG. 4, in step S11, preprocessing the training voice data includes the following steps:

S111: Perform pre-emphasis processing on the training voice data.

In this embodiment, pre-emphasis process is calculated as _{_{s' n = s n -a *}} s n-1, wherein the amplitude of the signal on the time domain s _n, s _n-1 s _n with the corresponding upper The signal amplitude at a moment, s' _n is the signal amplitude in the time domain after pre-emphasis, a is the pre-emphasis coefficient, and the value of a ranges from 0.9 <a <1.0. Among them, pre-emphasis is a signal processing method that compensates the high-frequency component of the input signal at the transmitting end. With the increase of the signal rate, the signal is greatly damaged in the transmission process. In order to obtain a better signal waveform at the receiving terminal, the damaged signal needs to be compensated. The idea of the pre-emphasis technology is to enhance the high-frequency component of the signal at the beginning of the transmission line to compensate for the excessive attenuation of the high-frequency component during transmission. Pre-emphasis has no effect on noise, so it can effectively improve the output signal-to-noise ratio. By pre-emphasizing the training voice data, the server can eliminate the interference caused by the vocal cords and lips during the speaker's vocalization, can effectively compensate the suppressed high frequency part of the training voice data, and can highlight the high frequency resonance of the training voice data. The peaks enhance the signal amplitude of the training speech data and help extract the training speech features.

S112: Perform frame processing on the pre-emphasized training voice data.

In this embodiment, framed processing is performed on the pre-emphasized training voice data. Framing refers to the speech processing technology that cuts the entire voice signal into several segments. The size of each frame is in the range of 10-30ms, and the frame shift is about 1/2 frame length. Frame shift refers to the overlapping area between two adjacent frames, which can avoid the problem of excessive changes in adjacent two frames. Framed processing of the training voice data can divide the training voice data into several pieces of voice data, and the training voice data can be subdivided to facilitate the extraction of training voice features.

S113: Perform windowing processing on the framed processing speech data.

In this embodiment, windowing processing is performed on the training speech data after frame processing. After the training speech data is framed, discontinuities will appear at the beginning and end of each frame, so the more frames there are, the greater the error with the original signal. The use of windowing can solve this problem, making the framed training speech data continuous, and making each frame exhibit the characteristics of a periodic function. The windowing process specifically refers to the processing of training speech data using a window function. The windowing function can select the Hamming window. The formula for windowing is

N Hamming window length, n is the time, s _n of the signal amplitude on the time domain, s' _n in the time domain signal after the amplitude is windowed. By performing window processing on the training voice data, the server can make the signal of the training voice data after frame processing in the time domain continuous, which is helpful for extracting the training voice features of the training voice data.

In steps S111-S113, pre-emphasis, framing, and window preprocessing are performed on the training voice data, which is helpful to extract the training voice features from the training voice data, so that the extracted training voice features can better represent the training voice data. .

S12: Perform fast Fourier transform on the pre-processed training voice data to obtain the frequency spectrum of the training voice data, and obtain the power spectrum of the training voice data according to the frequency spectrum.

Among them, Fast Fourier Transform (FFT) refers to a collective term for an efficient and fast method for computing discrete Fourier transforms using a computer. The use of this calculation method can greatly reduce the number of multiplications required by the computer to calculate the discrete Fourier transform. In particular, the more the number of transformed sampling points, the more significant the savings in the FFT algorithm calculation amount.

In this embodiment, performing fast Fourier transform on the pre-processed training voice data specifically includes the following process: First, a formula for calculating a frequency spectrum is used to calculate the pre-processed training voice data to obtain a frequency spectrum of the training voice data. The formula for calculating the spectrum is

1≤k≤N, N is the frame size, s (k) is the signal amplitude in the frequency domain, s (n) is the signal amplitude in the time domain, n is time, and i is a complex unit. Then, a formula for calculating a power spectrum is used to calculate a spectrum of the acquired training voice data, and a power spectrum of the training voice data is obtained. The formula for calculating the power spectrum is

1≤k≤N, N is the size of the frame, and s (k) is the signal amplitude in the frequency domain. The training speech data is converted from the signal amplitude in the time domain to the signal amplitude in the frequency domain, and then the power spectrum of the training speech data is obtained according to the signal amplitude in the frequency domain, in order to extract the training speech from the power spectrum of the training speech data. Features provide important technical premises.

S13: Use the Mel scale filter bank to process the power spectrum of the training speech data, and obtain the Mel power spectrum of the training speech data.

Among them, the power spectrum of the training speech data using the Mel scale filter bank is a Mel frequency analysis of the power spectrum, and the Mel frequency analysis is an analysis based on human auditory perception. The observation found that the human ear is like a filter bank, and only pays attention to certain specific frequency components (that is, human hearing is selective to frequencies), that is, the human ear only allows signals of certain frequencies to pass through, and directly Ignore certain frequency signals that you don't want to perceive. Specifically, the Mel scale filter bank includes multiple filters. These filters are not uniformly distributed on the frequency axis. There are many filters in the low frequency region, and the distribution is relatively dense. However, in the high frequency region, the filters are not uniformly distributed. The number becomes smaller and the distribution is sparse. Understandably, the resolution of the Mel scale filter bank in the low frequency part is high, which is consistent with the hearing characteristics of the human ear, which is also the physical meaning of the Mel scale. The frequency domain signal is segmented by using a Mel frequency scale filter bank, so that each frequency segment corresponds to an energy value. If the number of filters is 22, the Mel power spectrum of the training speech data will be corresponding. Of 22 energy values. By performing Mel frequency analysis on the power spectrum of the training speech data, the acquired Mel power spectrum retains a frequency portion closely related to the characteristics of the human ear, and this frequency portion can well reflect the characteristics of the training speech data.

S14: Perform cepstrum analysis on the Mel power spectrum to obtain the Mel frequency cepstrum coefficient of the training speech data, and determine the obtained Mel frequency cepstrum coefficient as the training speech feature.

Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. . Through cepstrum analysis on the Mel power spectrum, the features contained in the Mel power spectrum of the training speech data that are too high in original feature dimension and difficult to use directly can be converted into training speech that can be used directly in the model training process. Feature, the training speech feature is the Mel frequency cepstrum coefficient.

In steps S11-S14, the training voice feature is extracted based on the training voice data feature. The training voice feature may specifically be a Mel frequency cepstrum coefficient, which can well reflect the training voice data.

S20: Obtain a target background model based on the training speech features.

Among them, the Universal Background Model (UBM) is a Gaussian Mixture Model (GMM) that represents a large number of non-specific speaker voice feature distributions. Because UBM training usually uses a large number of unrelated speakers Channel-independent speech data, so UBM can generally be considered as a model that is not related to a specific speaker, it just fits the speech feature distribution of a person, and does not represent a specific speaker. A Gaussian mixture model is a model that uses a Gaussian probability density function (that is, a normal distribution curve) to accurately quantify things and decompose one thing into several Gaussian probability density functions. The target background model is a model obtained by reducing the feature dimension of the general background model.

In this embodiment, after acquiring a training voice feature (such as an MFCC feature), training a general background model based on the training voice feature can obtain a target background model. Compared with the general background model, the target background model shows the speech features of the training speech data in a lower feature dimension, and is performing calculations related to the target background model (such as using the target background model to target speaker speech data). When performing adaptive processing), the calculation amount is greatly reduced and the efficiency is improved.

In an embodiment, as shown in FIG. 5, in step S20, obtaining the target background model based on the training speech features includes the following steps:

S21: Use the training voice features to perform general background model training to obtain a general background model.

In this embodiment, a training background feature is used to train a general background model. The expression of the general background model is a Gaussian probability density function:

Among them, x represents the training speech features, K represents the number of Gaussian distributions that make up the general background model, C _k represents the coefficient of the k-th mixed Gaussian, and N (x; m _k , R _k ) represents the mean m _k is a D-dimensional vector Gaussian distribution of the D × D diagonal diagonal covariance matrix R _k . According to the expression of the general background model, training the general background model is actually to find the parameters (C _k , m _k and R _k ) in the expression. The general expression for the background model is a Gaussian probability density function, and therefore may be employed expectation-maximization algorithm (Expectation Maximization Algorithm, referred to as the EM algorithm) parameters (C _{_k,} m _k and R _k) in the expression is obtained. The EM algorithm is an iterative algorithm used to perform maximum likelihood estimation or maximum posterior probability estimation on a probability parameter model containing hidden variables. In statistics, hidden variables refer to unobservable random variables, but hidden variables can be inferred from samples of observable variables. During the training of the general background model, the training process is unobservable (or hidden). , So the parameters in the universal background model are actually hidden variables. Using the EM algorithm, the parameters in the universal background model can be obtained based on the maximum likelihood estimation or the maximum posterior probability estimation. After obtaining the parameters, the universal background model can be obtained. The EM algorithm is a commonly used mathematical method for calculating the probability density function containing hidden variables, and the mathematical method is not described in detail here. The acquisition of the general background model provides an important basis for the subsequent implementation of the target voiceprint feature recognition model based on the general background model when the target speaker's voice data is low or insufficient.

S22: Use singular value decomposition to perform feature reduction on the general background model to obtain the target background model.

Among them, the expression of the general background model:

x represents training speech features, K represents the number of Gaussian distributions that make up the general background model, C _k represents the coefficient of the k-th mixed Gaussian, and N (x; m _k , R _k ) represents the mean m _k is a D-dimensional vector, D × D dimension Gaussian distribution of the diagonal covariance matrix R _k . It can be seen that the general background model is represented by a Gaussian probability density function. The covariance matrix R _k in the parameters of the general background model is represented by a vector (matrix), and singular value decomposition can be used. The method performs feature dimensionality reduction processing on the general background model to remove the noise data in the general background model. Singular value decomposition refers to an important matrix factorization in linear algebra. It is a generalization of normal matrix 酉 diagonalization in matrix analysis. It has important applications in signal processing and statistics.

In this embodiment, singular value decomposition is used to perform feature reduction on the general background model. Specifically, the matrix corresponding to the parameter covariance matrix R _k in the general background model is subjected to singular value decomposition, and is expressed by the formula: m _k = σ ₁ u ₁ v ₁ ^T + σ ₂ u ₂ v ₂ ^T + ... + σ _n u _n v _n ^T , where the coefficient σ before each term on the right side of the equation is a singular value, σ is a diagonal matrix, u is a square matrix, and the vector u contains is orthogonal and is called left For a singular matrix, v is a square matrix, and the vectors contained by v are orthogonal, which is called a right singular matrix, and T represents a matrix operation for matrix transposition. In the equation, uv ^T is a matrix with a rank of 1, and the singular values satisfy σ ₁ ≥σ ₂ ≥σ _n > 0. Understandably, a larger singular value indicates that the sub-item σuv ^T corresponding to the singular value represents a more important feature in R _k , and a feature with a smaller singular value is considered to be a less important feature. In the training of the general background model, it is inevitable that there will be the influence of noise data, which leads to the trained general background model not only having high feature dimensions, but also not objective and accurate enough. Using singular value decomposition, the matrix in the parameters of the general background model can be used. Feature dimensionality reduction is performed to reduce the general background model with a higher feature dimension to the target background model with a lower feature to remove the sub-items with smaller singular values. It should be noted that the feature dimensionality reduction process not only does not weaken the ability to express the general background model of features, but actually enhances it, because some feature dimensions removed during singular value decomposition are small in the feature dimensions. These small σ features are actually the noise part when training the general background model. Therefore, using singular value decomposition to perform feature dimensionality reduction on the general background model can remove the feature dimension represented by the noise part in the general background model to obtain the target background model (the target background model is an optimized general background model and can replace the original The universal background model can adaptively process the target speaker's speech data, and can achieve better results). The target background model shows the speech features of the training speech data well in a lower feature dimension, and it will be greatly reduced when performing calculations related to the target background model (such as using the target background model to adaptively process the target speaker's speech data). The amount of calculations improves efficiency.

In steps S21-S22, by obtaining a general background model, it is possible to provide an important basis for the subsequent implementation of the target voiceprint feature recognition model based on the general background model when the target speaker's voice data is low or insufficient. And using the singular value decomposition feature reduction method for the general background model to obtain the target background model, the target background model shows the speech features of the training speech data in a lower feature dimension, which can be used in the calculation related to the target background model Improve efficiency.

S30: Obtain target voice data, and extract target voice features based on the target voice data.

The target voice data refers to voice data associated with a specific target user. The target user is associated with a user ID, and the corresponding user can be uniquely identified by the user ID. Understandably, when it is necessary to train a target voiceprint feature recognition model or a target voice feature recognition model related to certain users, these users are the target users. A user ID is an identifier that uniquely identifies a user.

In this embodiment, target voice data is acquired. The target voice data cannot be directly recognized by a computer and cannot be used for model training. Therefore, it is necessary to extract target speech features based on the target speech data, and convert the target speech data into target speech features that can be recognized by a computer. The target speech feature may specifically be a Mel frequency cepstrum coefficient. For specific extraction processes, refer to S11-S14, and details are not described herein again.

S40: Use the target background model to adaptively process the target voice features to obtain the target voiceprint feature recognition model.

The target voiceprint feature recognition model refers to a voiceprint feature recognition model related to the target user.

In this embodiment, the target voice data is difficult to obtain in some scenarios (for example, in a scenario where a bank or the like processes a service), so there are fewer data samples based on the target voice features provided by the target voice data. The target voiceprint feature recognition model obtained by directly training the target voice features with few data samples has a very poor effect in the subsequent calculation of the target voiceprint features, and cannot reflect the voice (voiceprint) features of the target voice features. Therefore, in this embodiment, a target background model is required to adaptively process a target voice feature to obtain a corresponding target voiceprint feature recognition model, so that the accuracy of the target voiceprint feature recognition model obtained is higher. The target background model is a Gaussian mixture model representing a large number of non-specific speech feature distributions. A large number of non-specific speech features in the target background model are adaptively added to the target speech features, which is equivalent to a part of the non-specific speech features in the target background model. Training together as target speech features can “supplement” the target speech features well to train the target voiceprint feature recognition model.

Among them, adaptive processing refers to a method of processing a part of non-specific speech features in the target background model that are close to the target speech features as target speech features. The adaptive processing can specifically use a maximum posterior estimation algorithm (Maximum A Posteriori, (Referred to as MAP). The maximum posterior estimation is to obtain an estimate of the difficult-to-observe quantity based on empirical data. During the estimation process, the prior probability and the Bayes' theorem must be used to obtain the posterior probability. The objective function (that is, the expression representing the target voiceprint feature recognition model) ) Is the likelihood function of the posterior probability, and obtain the parameter value when the likelihood function is maximum (the gradient descent algorithm can be used to find the maximum value of the likelihood function), so that the target background model and the target speech are realized. The effect of training a part of the non-specific speech features with similar features as the target speech features is to obtain a target voiceprint feature recognition model corresponding to the target speech features based on the parameter value obtained when the likelihood function is maximized.

S50: Input target voice features into a deep neural network for training, and obtain a target voice feature recognition model.

The target speech feature recognition model refers to a speech feature recognition model related to the target user. A deep neural network (DNN) model includes an input layer, a hidden layer, and an output layer composed of neurons. The deep neural network model includes the weights and biases of each neuron connection between layers. These weights and biases determine the nature and recognition effect of the DNN model.

In this embodiment, a target speech feature is input into a deep neural network model for training, and network parameters (ie weights and biases) of the deep neural network model are updated to obtain a target speech feature recognition model. The target speech features include key speech features of the target speech data. In this embodiment, the target speech features are trained in a DNN model to further extract the features of the target speech data, and perform deep feature extraction based on the target speech features. The deep features are expressed by network parameters in the target speech feature recognition model, and based on the extracted deep features, a more accurate recognition effect can be achieved when the target speech recognition model is subsequently used for recognition.

In an embodiment, as shown in FIG. 6, in step S50, the target voice feature is input into a deep neural network for training, and the target voice feature recognition model is obtained, including the following steps:

S51: Initialize a deep neural network model.

In this embodiment, the DNN model is initialized. This initialization operation is to set initial values of weights and offsets in the DNN model. The initial value may be set to a smaller value, such as between [-0.3-0.3]. Reasonable initialization of the DNN model can make the DNN model have more flexible adjustment ability in the early stage, and the model can be adjusted effectively during the DNN model training process, so that the trained DNN model has a better recognition effect.

S52: The target voice features are grouped into the deep neural network model, and the output value of the deep neural network model is obtained according to the forward propagation algorithm. The output value of the i-th sample of the target voice feature at the current layer of the deep neural network model is expressed by a formula. Expressed as a ^{i, l} = σ (W ^l a ^{i, l-1} + b ^l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the current deep neural network model Layer, σ is the activation function, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias.

In this embodiment, the target voice feature is first divided into a preset number of samples, and then grouped and input into the DNN model for training, that is, the grouped samples are respectively input into the DNN model for training. The DNN's forward propagation algorithm is a series of linear operations and activation operations performed in the DNN model based on the weights W, bias b, and input values (vector x ⁱ ) of each neuron in the DNN model, starting from the input layer, Layer by layer calculations are performed until the output layer gets the output value of the output layer. According to the forward propagation algorithm, the output value of each layer of the network in the DNN model can be calculated until the output value of the output layer (that is, the output value of the DNN model) is calculated.

Specifically, let the total number of layers of the DNN model be L. In the DNN model, the weights W, biases b, and input value vectors x ^{i of} each neuron are connected, and the output values a ^{i, L of the} output layer (i represents the input target. I-th sample of speech features), then a ¹ = x ⁱ (the output of the first layer is the target speech feature input at the input layer, that is, the input value vector x ⁱ ), and the output a ^{i, l} can be known according to the forward propagation algorithm = Σ (W ^l a ^{i, l-1} + b ^l ), where l represents the current layer of the deep neural network model, and σ is the activation function. The activation function specifically used here may be a sigmoid or tanh activation function. According to the above formula for calculating a ^{i, l} , forward propagation is performed layer by layer according to the number of layers to obtain the final output value a ^{i, L} of the network in the DNN model (that is ^, the output value of the deep neural network model). With the output value a ^i, i.e., ^L can be a ^{i, L} DNN network parameters in accordance with the model output value (each neuron connection weights W, bias b) be adjusted in order to obtain a more accurate speech recognition target speech feature recognition model.

S53: Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model. The calculation formula for the updated weight is

l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ ^{i, l} is the sensitivity of the current layer; δ ^{i, l} = (W ^{l +1} ) ^T δ ^{i, l + 1} σ '(z ^{i, l} ), z ^{i, l} = W ^l a ^{i, l-1} + b ^l , a ^{i, l-1} is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is

In this embodiment, after obtaining the output values a ^{i, L} of the DNN model according to the forward propagation algorithm, a label value can be set in advance according to a ^{i, L} (the label value is used to set the output value according to the actual situation). Compare the target speech features to obtain the error value), calculate the error generated when the target speech feature is trained in the DNN model, and construct a suitable error function based on the error (such as an error function that uses mean square error to measure the error), The error back-propagation is performed according to the error function to adjust and update the weight W and the offset b of each layer of the DNN model.

The back-propagation algorithm is used to update the weights W and offsets b of each layer of the DNN model, and the minimum value of the error function is calculated according to the back-propagation algorithm to optimize and update the weights W and offsets b of each layer of the DNN model. Get the target speech feature recognition model. Specifically, the iteration step size of the model training is set to α, the maximum number of iterations MAX, and the stop iteration threshold ∈. In the back-propagation algorithm, the sensitivity δ ^{i, l} is a common factor that appears every time the parameter is updated, so the error can be calculated by using the sensitivity δ ^{i, l} to update the network parameters in the DNN model. Knowing a ¹ = x ⁱ (the output of the first layer is the target speech feature input at the input layer, that is, the input value vector x ⁱ ), first find the sensitivity δ ^{i, L} , δ ^{i, L of} the output layer ^L = (a ^{i, L} -y ⁱ ) σ '(z ^L ), z ^{i, l} = W ^l a ^{i, l-1} + b ^l , where i represents the ith set of samples of the input target speech feature, and y is the label Value (that is, the value to be compared with the output values a ^{i, L} ), ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product). Then according δ ^{i, L} l is obtained sensitivity δ ⁱ layer depth of neural network ^{model, l,} according to the results of the sensitivity δ ⁱ l depth layer neural network ^{model, ^l} = (W ^l propagation algorithm to calculate ⁺¹ ) ^T δ ^{i, l + 1} σ '(z ^{i, l} ), after obtaining the sensitivity δ ^{i, l} of the first layer of the deep neural network model, the weights W and offsets of each layer of the DNN model can be updated b, the updated weights

The updated offset is

Among them, α is the iterative step size of model training, m is the total number of samples of the input target speech features, and T is the matrix transposition operation. When all the changes of W and b are less than the stop iteration threshold ∈, the training can be stopped; or, when the training reaches the maximum number of iterations MAX, the training is stopped. Through the error generated by the target speech feature between the output value in the DNN model and the preset label value, the weight W and the offset b of each layer of the DNN model can be updated, so that the obtained target speech feature recognition model can be Perform speech recognition.

Steps S51-S53 train the DNN model by using the target speech features, so that the target speech feature recognition model obtained through training can recognize speech. Specifically, the target speech feature recognition model further extracts the deep features of the target speech feature during the model training process. The trained weights and offsets in the model reflect the deep features based on the target speech feature. Therefore, the target speech feature recognition model can recognize based on the deep features learned through training, and achieve more accurate speech recognition.

S60: Associate the target voiceprint feature recognition model and the target voice feature recognition model in a database.

In this embodiment, after the target voiceprint feature recognition model and the target voice feature recognition model are obtained, the two models are associated and stored in a database. Specifically, the association between the models is performed through the user ID of the target user, and the target voiceprint feature recognition model and the target voice feature recognition model corresponding to the same user ID are stored in a database in the form of a file. By associating and storing the two models, the target voiceprint feature recognition model and target voice feature recognition model corresponding to the user's identity can be called during the voice recognition stage, in order to combine the two models for voice recognition, and overcome the separate recognition of each model It is an error that further improves the accuracy of speech recognition.

In the speech model training method provided in this embodiment, a target background model is obtained by using the extracted training speech features. The target background model is obtained from a general background model using singular value decomposition feature dimensionality reduction method. The target background model uses lower features The dimensionality shows the speech features of the training speech data well, which can improve the efficiency when performing calculations related to the target background model. The target background model is used to adaptively process the extracted target speech features to obtain a voiceprint feature recognition model. The target background model covers the speech features of training speech data in multiple dimensions. The target background model can be used to adaptively supplement the target speech features with a small amount of data through the target background model, so that the target sound can also be obtained when the amount of data is small. Grain feature recognition model. The target voiceprint feature recognition model can recognize voiceprint features that use lower dimensions to represent the target voice features, thereby performing voice recognition. Then the target speech features are input to the deep neural network for training to obtain the target speech feature recognition model. The target speech feature recognition model deeply learns the target speech features and can perform speech recognition with high accuracy. Finally, the target voiceprint feature recognition model and the target voice feature recognition model are stored in the database in association, and the two models are stored as a total voice model. The voice model organically combines the target voiceprint feature recognition model and the target voice feature recognition. The model adopts the overall speech model for speech recognition, and the accuracy rate of speech recognition can be achieved.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

FIG. 7 is a schematic diagram of a speech model training device that corresponds to the speech model training method in the embodiment. As shown in FIG. 7, the speech model training device includes a training speech feature extraction module 10, a target background model acquisition module 20, a target speech feature extraction module 30, a target voiceprint feature recognition model acquisition module 40, a speech feature recognition acquisition module 50, and Model storage module 60. Among them, training speech feature extraction module 10, target background model acquisition module 20, target speech feature extraction module 30, target voiceprint feature recognition model acquisition module 40, speech feature recognition acquisition module 50, and model storage module 60 implementation functions and embodiments The steps corresponding to the middle voice model training method correspond one by one. In order to avoid redundant description, this embodiment is not detailed one by one.

Training voice feature extraction module 10, configured to obtain training voice data, and extract training voice features based on the training voice data;

A target background model acquisition module 20, configured to acquire a target background model based on the training speech features;

A target voice feature extraction module 30, configured to obtain target voice data, and extract target voice features based on the target voice data;

The target voiceprint feature recognition model acquisition module 40 is configured to adaptively process the target voice feature using the target background model to obtain the target voiceprint feature recognition model;

Speech feature recognition acquisition module 50, configured to input target speech features into a deep neural network for training, and obtain a target speech feature recognition model;

The model storage module 60 is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.

Preferably, the training speech feature extraction module 10 includes a preprocessing unit 11, a power spectrum acquisition unit 12, a Mel power spectrum acquisition unit 13, and a training speech feature determination unit 14.

The preprocessing unit 11 is configured to preprocess the training voice data.

A power spectrum obtaining unit 12 is configured to perform a fast Fourier transform on the pre-processed training voice data, obtain a frequency spectrum of the training voice data, and obtain a power spectrum of the training voice data according to the frequency spectrum.

The Mel power spectrum obtaining unit 13 is configured to process the power spectrum of the training speech data by using a Mel scale filter bank, and obtain a Mel power spectrum of the training speech data.

The training speech feature determining unit 14 is configured to perform cepstrum analysis on the Mel power spectrum, obtain Mel frequency cepstrum coefficients of training speech data, and determine the obtained Mel frequency cepstrum coefficients as training speech features.

Preferably, the pre-processing unit 11 includes a pre-emphasis sub-unit 111, a frame sub-unit 112, and a windowing sub-unit 113.

The pre-emphasis sub-unit 111 is configured to perform pre-emphasis processing on the training voice data.

The frame sub-unit 112 is configured to perform frame processing on the pre-emphasized training voice data.

A windowing sub-unit 113 is configured to perform windowing processing on the framed processing speech data.

Preferably, the target background model acquisition module 20 includes a general background model acquisition unit 21 and a target background model acquisition unit 22.

The universal background model obtaining unit 21 is configured to use the training voice feature to perform a universal background model training to obtain a universal background model.

The target background model obtaining unit 22 is configured to perform dimensionality reduction processing on the general background model by using singular value decomposition to obtain a target background model.

Preferably, the speech feature recognition acquisition module 50 includes an initialization unit 51, an output value acquisition unit 52, and a target speech feature recognition model acquisition unit 53.

The initialization unit 51 is configured to initialize a deep neural network model.

An output value obtaining unit 52 is configured to group the target speech features into the deep neural network model, and obtain the output values of the deep neural network model according to the forward propagation algorithm. The i-th group of samples of the target speech features are present in the deep neural network model. The output value of the layer is expressed as a ^{i, l} = σ (W ^l a ^{i, l-1} + b ^l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is The current layer of the deep neural network model, σ is the activation function, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias.

The target speech feature recognition model acquisition unit 53 is configured to perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, where the weights are updated Is calculated as

FIG. 8 shows a flowchart of a speech recognition method in an embodiment. The speech recognition method can be applied to the computer equipment of financial institutions such as banks, securities, investment, and insurance, or other institutions that need to perform speech recognition to achieve the purpose of speech recognition by artificial intelligence. The computer device is a device that can perform human-computer interaction with a user, including, but not limited to, a computer, a smart phone, and a tablet. As shown in FIG. 8, the speech recognition method includes the following steps:

S71: Acquire speech data to be identified, and the speech data to be identified is associated with a user identifier.

The voice data to be identified refers to voice data of a user to be identified. The user identifier is an identifier for uniquely identifying the user. The user identifier may be an identifier that can uniquely identify the user, such as an ID card number or a phone number.

In this embodiment, acquiring the voice data to be identified may be specifically collected through a recording module built in a computer device or an external recording device. The voice data to be identified is associated with a user identifier, and the voice data to be identified may be associated with the user identifier The data judges whether the user's own voice is used for speech recognition.

S72: Query the database based on the user ID to obtain the target voiceprint feature recognition model and target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and target voice feature recognition model are models obtained by the voice model training method provided in the foregoing embodiment. .

In this embodiment, a database is queried according to the user identifier, and a target voiceprint feature recognition model and a target voice feature recognition model associated with the user identifier are obtained in the database. The associatively stored target voiceprint feature recognition model and target voice feature recognition model are stored in the form of files in the database. After querying the database, the file of the model corresponding to the user identification is called, so that the computer device can Voiceprint feature recognition model and target voice feature recognition model are used for voice recognition.

S73: Based on the speech data to be identified, extract speech features to be identified.

In this embodiment, to-be-recognized voice data is acquired, and the to-be-recognized voice data cannot be directly recognized by a computer, and voice recognition cannot be performed. Therefore, it is necessary to extract corresponding to-be-recognized speech features according to the to-be-recognized voice data, and convert the to-be-recognized voice data into to-be-recognized voice features that can be recognized by a computer. The feature of the speech to be recognized may specifically be a Mel frequency cepstrum coefficient, and the specific extraction process refers to S11-S14, which is not described in detail here.

S74: Input the speech feature to be recognized into the target speech feature recognition model, and obtain a first score.

In this embodiment, the target speech feature recognition model is used to identify the speech features to be recognized, and the recognized speech features are input into the target speech feature recognition model. Calculate to get the first score.

S75: Input the speech data to be recognized into the target voiceprint feature recognition model, and obtain a second score.

In this embodiment, the voice data to be recognized is input into the target voiceprint feature recognition model for recognition. Specifically, the target voiceprint feature model is first used to extract the voiceprint features to be recognized in the voice data to be recognized, which can be calculated by the following formula Obtain the voiceprint features to be identified: M (i) = M ₀ + Tw (i), where M ₀ is an A × K-dimensional supervector composed of the mean (m _k ) connections in the parameters of the target background model (the target background model is The target background model obtained by using the speech model training method provided in the foregoing embodiment. The average value in the target background model is dimensionality-reduced, and the average value after dimensionality reduction is represented as an A-dimensional vector. M (i) is identified by the target voiceprint feature. A × K-dimensional supervector composed of mean (m _k ′) connections in the model parameters, T is a matrix describing the overall change in (A × K) × F dimension, representing the vector space of the voiceprint features to be identified, w (i ) Indicates an F-dimensional vector that meets the standard normal distribution, and w (i) is the voiceprint feature to be identified. Since the parameter vector space T containing hidden variables, not directly, but can be _0, and M according to known M (i), using the EM algorithm to calculate the space T is determined according to M (i) and ₀ M iterations, then according to M (i) = M ₀ + Tw (i) obtains the voiceprint features to be identified. After obtaining the voiceprint features to be identified, a similarity comparison (such as cosine similarity) is performed according to the target voiceprint features corresponding to the voiceprint features to be identified and the target voice feature. If the similarity is higher, the voiceprint to be identified is considered The closer the feature is to the target voiceprint feature, the more likely it is the user's own voice. Then according to the method for obtaining voiceprint features to be identified using the voice data to be identified, the target voiceprint features corresponding to the target voice features used in the training target voiceprint feature recognition model can be calculated, and the voiceprint features to be identified can be calculated by The cosine similarity with the target voiceprint feature is taken as the second score.

S76: Multiply the first score with a preset first weighted ratio to obtain a first weighted score, multiply the second score with a preset second weighted ratio, obtain a second weighted score, and sum the first weighted score and The second weighted scores are added to obtain the target score.

In this embodiment, the shortcomings of the target voiceprint feature recognition model and the target voice feature recognition model are overcome in a targeted manner. Understandably, when the target voice feature recognition model is used to identify and obtain the first score, because the features of the voice features to be recognized have a high dimension, they include some interfering voice features (such as noise, etc.). There is a certain error between the score and the actual result; when the target voiceprint feature recognition model is used to identify and obtain the second score, due to the low dimension of the voiceprint feature to be identified, it is unavoidable to lose some features that can represent the voice data to be identified , So that the second score obtained by using the model alone has a certain error with the actual result. Because the direct error between the first score and the second score is the error caused by the opposite reasons of higher dimensions and lower dimensions, the first score is given for the reasons caused by the errors of the first score and the errors of the second score. Multiply by a preset first weighted proportion to obtain a first weighted score, multiply a second score with a preset second weighted proportion, obtain a second weighted score, and add the first weighted score and the second weighted score To obtain the target score, which is the final output score. By adopting this weighted processing method, the error of the first score and the error of the second score can be overcome exactly. It can be considered that the two errors cancel each other out, so that the target score is closer to the actual result, and the accuracy of speech recognition can be improved.

S77: If the target score is greater than a preset score threshold, determine that the speech data to be recognized is target speech data corresponding to the user identification.

In this embodiment, it is judged whether the target score is greater than a preset score threshold. If the target score is greater than the preset score threshold, the speech data to be identified is considered as the target speech data corresponding to the user identification, that is, the user's own speech data is determined; If the score is not greater than the preset score threshold, the voice data to be recognized is not considered to be the voice data of the user himself.

The preset score threshold refers to a preset threshold used to measure whether the speech data to be identified is target speech data corresponding to the user identifier, and the threshold is expressed in the form of a score. For example, if the preset score threshold is set to 0.95, the speech data to be recognized with a target score greater than 0.95 is the target speech data corresponding to the user identification, and the speech data to be recognized with a target score not greater than 0.95 is not considered to be the user's own corresponding Voice data.

In the speech recognition method provided in this embodiment, a speech model is input according to the extracted speech features to be recognized, and a first score related to the target speech feature recognition model and a second score related to the target voiceprint feature recognition model are obtained, and The target score is obtained through weighted operation, and the speech recognition result is obtained from the target score. The first score reflects the probability of speech recognition results from the voiceprint features in the lower dimension. Due to the low dimension of the voiceprint features, some key voice features are unavoidably lost, making the first score inaccurate from the actual output and affecting the voice. Recognition results; the second score reflects the probability of speech recognition results from the higher-dimensional target speech features. Because the second score has a higher dimension, it includes some interfering speech features (such as noise), making the second score and the actual output There are errors that affect speech recognition results. The target score obtained by weighted operation can address the shortcomings of the target speech feature recognition model and the target voiceprint feature recognition model, overcome the errors of the first score and the second score, and it can be considered that the two errors cancel each other out, making the target score more It is close to the actual result and improves the accuracy of speech recognition.

FIG. 9 is a schematic diagram of a speech recognition device corresponding to the speech recognition method in the embodiment. As shown in FIG. 9, the voice recognition device includes a to-be-recognized voice data acquisition module 70, a model acquisition module 80, a to-be-recognized speech feature extraction module 90 and a first score acquisition module 100, a second score acquisition module 110, and a target score acquisition module. 120 and a voice determination module 130. Among them, the realized functions of the to-be-recognized voice data acquisition module 70, model acquisition module 80, to-be-recognized voice feature extraction module 90, first score acquisition module 100, second score acquisition module 110, target score acquisition module 120, and voice determination module 130 The steps corresponding to the speech recognition method in the embodiment are one-to-one. To avoid redundant descriptions, this embodiment does not detail them one by one.

The to-be-recognized voice data acquisition module 70 is configured to obtain the to-be-recognized voice data, and the to-be-recognized voice data is associated with a user identifier.

A model acquisition module 80 is configured to query a database based on a user identifier to obtain an associated stored target voiceprint feature recognition model and a target voice feature recognition model. The target voiceprint feature recognition model and the target voice feature recognition model use the voice provided by the foregoing embodiment The model obtained by the model training method.

The to-be-recognized voice feature extraction module 90 is configured to extract the to-be-recognized voice features based on the to-be-recognized voice data.

The first score obtaining module 100 is configured to input a voice feature to be recognized into a target voice feature recognition model, and obtain a first score.

A second score obtaining module 110 is configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score.

A target score obtaining module 120, configured to multiply a first score with a preset first weighted ratio, obtain a first weighted score, multiply a second score with a preset second weighted ratio, and obtain a second weighted score; Add the first weighted score and the second weighted score to obtain the target score.

The voice determining module 130 is configured to determine that the voice data to be recognized is target voice data corresponding to a user identifier if the target score is greater than a preset score threshold.

This embodiment provides one or more non-volatile readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are executed. To implement the speech model training method in the embodiment, to avoid repetition, details are not repeated here. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each module / unit of the speech model training device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here No longer. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each step in the speech recognition method in the embodiment are implemented when the one or more processors are executed. To avoid repetition, different ones are not provided here. One more detail. Alternatively, when the computer-readable instructions are executed by one or more processors, the functions of each module / unit in the voice recognition device in the embodiment are implemented when the one or more processors are executed. To avoid repetition, here Not one by one.

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in FIG. 10, the computer device 140 of this embodiment includes a processor 141, a memory 142, and computer-readable instructions 143 stored in the memory 142 and executable on the processor 141. The computer-readable instructions 143 are processed. The implementation of the speech model training method in the embodiment is implemented when the processor 141 is executed. To avoid repetition, details are not described herein. Alternatively, when the computer-readable instructions 143 are executed by the processor 141, the functions of each model / unit in the voice model training device in the embodiment are implemented. To avoid repetition, details are not described here one by one. Alternatively, when the computer-readable instructions 143 are executed by the processor 141, the functions of the steps in the speech recognition method in the embodiment are implemented. To avoid repetition, details are not described here one by one. Alternatively, when the computer-readable instructions 143 are executed by the processor 141, the functions of each module / unit in the speech recognition apparatus in the embodiment are realized. To avoid repetition, we will not repeat them here.

The computer device 140 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer equipment may include, but is not limited to, a processor 141 and a memory 142. Those skilled in the art can understand that FIG. 10 is only an example of the computer device 140, and does not constitute a limitation on the computer device 140. It may include more or fewer components than shown in the figure, or combine some components or different components. For example, computer equipment may also include input and output equipment, network access equipment, and buses.

The so-called processor 141 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 142 may be an internal storage unit of the computer device 140, such as a hard disk or a memory of the computer device 140. The memory 142 may also be an external storage device of the computer device 140, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, and a flash memory card (Flash) provided on the computer device 140. Card) and so on. Further, the memory 142 may also include both an internal storage unit of the computer device 140 and an external storage device. The memory 142 is used to store the computer-readable instructions 143 and other programs and data required by the computer device. The memory 142 may also be used to temporarily store data that has been output or is to be output.

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware or in the form of software functional units.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

A speech model training method, comprising:

Acquiring training voice data, and extracting training voice features based on the training voice data;

Obtaining a target background model based on the training speech features;

Acquiring target voice data, and extracting target voice features based on the target voice data;

Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;

Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;

The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
The method for training a speech model according to claim 1, wherein the extracting a training speech feature based on the training speech data comprises:

Preprocessing the training speech data;

Performing fast Fourier transform on the pre-processed training voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;

Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;

A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of training speech data, and the obtained Mel frequency cepstrum coefficient is determined as the training speech feature.
The method for training a speech model according to claim 2, wherein the preprocessing the training speech data comprises:

Pre-emphasis the training voice data;

Performing frame processing on the pre-emphasis training voice data;

Perform windowing processing on the training speech data after frame processing.
The method for training a speech model according to claim 1, wherein the acquiring a target background model based on the training speech features comprises:

Using the training speech feature to perform a general background model training to obtain a general background model;

Singular value decomposition is used to perform feature reduction processing on the universal background model to obtain the target background model.
The method for training a speech model according to claim 1, wherein the inputting the target speech features into a deep neural network for training to obtain the target speech feature recognition model comprises:

Initialize the deep neural network model;

Input the target speech features into the deep neural network model, obtain the output value of the deep neural network model according to the forward propagation algorithm, and output the i-th sample of the target speech feature at the current layer of the deep neural network model. Formulated as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the deep neural network model Σ is the activation layer, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias;

Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, wherein the calculation formula for the updated weight is
l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
A speech recognition method, comprising:

Obtaining to-be-recognized voice data, which is associated with a user identifier;

Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are based on any one of claims 1-5. The model obtained by the speech model training method described in item;

Extracting features to be recognized based on the to-be-recognized voice data;

Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;

Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;

Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;

If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
A voice model training device, comprising:

A training voice feature extraction module, configured to obtain training voice data, and extract training voice features based on the training voice data;

A target background model acquisition module, configured to acquire a target background model based on the training speech feature;

A target voice feature extraction module, configured to obtain target voice data, and extract target voice features based on the target voice data;

A target voiceprint feature recognition model acquisition module, configured to adaptively process the target voice feature using the target background model to obtain a target voiceprint feature recognition model;

A speech feature recognition acquisition module, configured to input the target speech feature into a deep neural network for training, and obtain a target speech feature recognition model;

A model storage module is configured to store the target voiceprint feature recognition model and the target voice feature recognition model in a database in association.
A voice recognition device, comprising:

A to-be-recognized voice data acquisition module, configured to obtain the to-be-recognized voice data, the to-be-recognized voice data being associated with a user identifier;

A model acquisition module is configured to query a database based on the user identifier to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in an associated manner. The target voiceprint feature recognition model and the target voice feature recognition model use rights. A model obtained by the speech model training method according to any one of claims 1-5;

Speech feature extraction module for extracting speech features based on the speech data to be identified;

A first score acquisition module, configured to input the speech feature to be recognized into a target speech feature recognition model to obtain a first score;

A second score obtaining module, configured to input the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;

A target score obtaining module, configured to multiply the first score with a preset first weighted ratio, obtain a first weighted score, multiply the second score with a preset second weighted ratio, and obtain a second Weighted score, adding the first weighted score and the second weighted score to obtain a target score;

A voice determination module is configured to determine, if the target score is greater than a preset score threshold, the voice data to be identified is target voice data corresponding to the user identifier.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Acquiring training voice data, and extracting training voice features based on the training voice data;

Obtaining a target background model based on the training speech features;

Acquiring target voice data, and extracting target voice features based on the target voice data;

Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;

Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;

The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
The computer device according to claim 9, wherein the extracting training voice features based on the training voice data comprises:

Preprocessing the training speech data;

Performing fast Fourier transform on the pre-processed training voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;

Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;

A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of training speech data, and the obtained Mel frequency cepstrum coefficient is determined as the training speech feature.
The computer device according to claim 10, wherein the preprocessing the training voice data comprises:

Pre-emphasis the training voice data;

Performing frame processing on the pre-emphasis training voice data;

Perform windowing processing on the training speech data after frame processing.
The computer device according to claim 9, wherein the obtaining a target background model based on the training speech features comprises:

Using the training speech feature to perform a general background model training to obtain a general background model;

Singular value decomposition is used to perform feature reduction processing on the universal background model to obtain the target background model.
The computer device according to claim 9, wherein the inputting the target voice feature to a deep neural network for training to obtain the target voice feature recognition model comprises:

Initialize the deep neural network model;

Input the target speech features into the deep neural network model, obtain the output value of the deep neural network model according to the forward propagation algorithm, and output the i-th sample of the target speech feature at the current layer of the deep neural network model. Formulated as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the deep neural network model Σ is the activation layer, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias;

Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, wherein the calculation formula for the updated weight is
l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Obtaining to-be-recognized voice data, which is associated with a user identifier;

Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are based on any one of claims 1-5. The model obtained by the speech model training method described in item;

Extracting features to be recognized based on the to-be-recognized voice data;

Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;

Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;

Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;

If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Acquiring training voice data, and extracting training voice features based on the training voice data;

Obtaining a target background model based on the training speech features;

Acquiring target voice data, and extracting target voice features based on the target voice data;

Using the target background model to adaptively process the target voice feature to obtain a target voiceprint feature recognition model;

Inputting the target speech features into a deep neural network for training, and obtaining a target speech feature recognition model;

The target voiceprint feature recognition model and the target voice feature recognition model are associatedly stored in a database.
The non-volatile readable storage medium according to claim 15, wherein the extracting a training voice feature based on the training voice data comprises:

Preprocessing the training speech data;

Performing fast Fourier transform on the pre-processed training voice data to obtain a frequency spectrum of the training voice data, and obtaining a power spectrum of the training voice data according to the frequency spectrum;

Using a Mel scale filter bank to process the power spectrum of the training speech data, and obtain a Mel power spectrum of the training speech data;

A cepstrum analysis is performed on the Mel power spectrum to obtain a Mel frequency cepstrum coefficient of training speech data, and the obtained Mel frequency cepstrum coefficient is determined as the training speech feature.
The non-volatile readable storage medium according to claim 16, wherein the preprocessing the training voice data comprises:

Pre-emphasis the training voice data;

Performing frame processing on the pre-emphasis training voice data;

Perform windowing processing on the training speech data after frame processing.
The non-volatile readable storage medium according to claim 15, wherein the obtaining a target background model based on the training voice feature comprises:

Using the training speech feature to perform a general background model training to obtain a general background model;

Singular value decomposition is used to perform feature reduction processing on the universal background model to obtain the target background model.
The non-volatile readable storage medium according to claim 15, wherein the inputting the target voice feature into a deep neural network for training to obtain the target voice feature recognition model comprises:

Initialize the deep neural network model;

Input the target speech features into the deep neural network model, obtain the output value of the deep neural network model according to the forward propagation algorithm, and output the i-th sample of the target speech feature at the current layer of the deep neural network model. Formulated as a i, l = σ (W l a i, l-1 + b l ), where a is the output value, i is the i-th sample of the input target speech feature, and l is the deep neural network model Σ is the activation layer, W is the weight, l-1 is the previous layer of the current layer of the deep neural network model, and b is the bias;

Perform error back propagation based on the output value of the deep neural network model, update the weights and offsets of each layer of the deep neural network model, and obtain the target speech feature recognition model, wherein the calculation formula for the updated weight is
l is the current layer of the deep neural network model, W is the weight, α is the iteration step size, m is the total number of samples of the input target speech feature, and δ i, l is the sensitivity of the current layer; δ i, l = (W l +1 ) T δ i, l + 1 σ '(z i, l ), z i, l = W l a i, l-1 + b l , a i, l-1 is the output of the previous layer, T Represents the matrix transposition operation, ο represents the operation of multiplying the corresponding elements of two matrices (Hadamard product), and the calculation formula for updating the offset is
One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps:

Obtaining to-be-recognized voice data, which is associated with a user identifier;

Query a database based on the user ID to obtain a target voiceprint feature recognition model and a target voice feature recognition model that are stored in association. The target voiceprint feature recognition model and the target voice feature recognition model are based on any one of claims 1-5. The model obtained by the speech model training method described in item;

Extracting features to be recognized based on the to-be-recognized voice data;

Inputting the speech feature to be recognized into a target speech feature recognition model to obtain a first score;

Inputting the speech data to be recognized into a target voiceprint feature recognition model to obtain a second score;

Multiplying the first score with a preset first weighted ratio to obtain a first weighted score, multiplying the second score with a preset second weighted ratio to obtain a second weighted score, and Adding a weighted score and the second weighted score to obtain a target score;

If the target score is greater than a preset score threshold, it is determined that the speech data to be recognized is target speech data corresponding to the user identification.