WO2020177380A1

WO2020177380A1 - Voiceprint detection method, apparatus and device based on short text, and storage medium

Info

Publication number: WO2020177380A1
Application number: PCT/CN2019/117731
Authority: WO
Inventors: 王健宗; 周新宇; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-03-06
Filing date: 2019-11-13
Publication date: 2020-09-10
Also published as: CN110010133A

Abstract

Disclosed are a voiceprint detection method, apparatus and device based on a short text, and a storage medium. The method comprises: training a preset deep neural network by means of a training sample; acquiring a speech signal to be recognized; preprocessing the speech signal to be recognized, and carrying out feature extraction on the preprocessed speech signal to obtain a mel-frequency cepstral coefficient; taking the mel-frequency cepstral coefficient as an input and transmitting same into the pretrained deep neural network, and acquiring an output vector of the deep neural network on a last full connection layer and taking same as a voiceprint vector of the speech signal; and comparing the voiceprint vector of the speech signal with a voiceprint vector prestored in a voiceprint model library, and outputting a voiceprint detection result according to a comparison result, wherein the training sample and the speech signal are both a short text. The present application solves the problems of a redundant speech signal, a large quantity of sample information and a high requirement for computing resources in an existing voiceprint detection method.

Description

Voiceprint detection method, device, equipment and storage medium based on short text

This application is based on the Chinese invention patent application filed on March 6, 2019 with the application number 201910167882.3, titled "Short text-based voiceprint detection method, device, equipment and storage medium", and claims its priority.

Technical field

This application relates to the field of information technology, and in particular to a short text-based voiceprint detection method, device, equipment and storage medium.

Background technique

Voiceprint detection is a common and effective identification method, which can be applied to a series of scenarios that need to be combined with identity authentication, such as online payment, voiceprint lock control, survival authentication, and Internet of Things device verification. It is especially inconvenient to use video image verification In the remote verification, there is no device restriction at all. When verifying, using content and voiceprint detection for double verification can greatly increase the threshold of being attacked and improve security. When performing voiceprint detection, currently commonly used methods include but are not limited to template matching method, probability model method, artificial neural network method, and I-vector model method. However, in these methods, due to the limitation of the structure of the model itself, it is difficult to complete text training using short text. Therefore, long text with more features can usually be used as the model input vector. However, the longer the speech signal, the more features it carries, the larger the amount of sample information required during training, and the more computer resources it takes up.

Summary of the invention

The embodiments of the present application provide a short text-based voiceprint detection method, device, equipment, and storage medium to solve the problems of long voice signals, large sample information, and high computing resource requirements in the existing voiceprint detection methods.

A voiceprint detection method based on short text, including:

Obtaining training samples, and using the training samples to train a preset deep neural network;

Obtain the voice signal to be recognized;

Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.

Optionally, the acquiring training samples and using the training samples to train a preset deep neural network includes:

Acquire voice samples of multiple users as training samples;

Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;

Labeling a user tag on the Mel frequency cepstrum coefficient of each user;

The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;

Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;

The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.

Optionally, the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Discard strategy for training.

Optionally, the comparing the voiceprint vector of the voice signal with a pre-stored voiceprint vector in a voiceprint model library, and outputting a voiceprint detection result according to the comparison result includes:

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;

If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;

If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.

Optionally, the preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient includes:

Performing framing processing on the waveform diagram of the voice signal to be recognized;

After framing processing, perform windowing processing on each frame of signal;

Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;

Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;

Calculating a mel filter bank according to the power spectrum;

Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;

The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.

A voiceprint detection device based on short text, including:

The training module is used to obtain training samples, and use the training samples to train a preset deep neural network;

The signal acquisition module is used to acquire the voice signal to be recognized;

The feature extraction module is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The feature acquisition module is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the voice signal A voiceprint vector, where each element in the voiceprint vector represents a feature of the voice signal;

The detection module is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.

Optionally, the detection module includes:

The comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;

The first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;

The second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.

A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Obtain the voice signal to be recognized;

Wherein, the training samples and speech signals are both short texts.

One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Obtain the voice signal to be recognized;

Wherein, the training samples and speech signals are both short texts.

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a flowchart of a voiceprint detection method based on short text in an embodiment of the present application;

2 is a flowchart of step S101 in the voiceprint detection method based on short text in an embodiment of the present application;

3 is a flowchart of step S103 in the voiceprint detection method based on short text in an embodiment of the present application;

4 is a flowchart of step S105 in the short text-based voiceprint detection method in an embodiment of the present application;

FIG. 5 is a functional block diagram of a voiceprint detection device based on short text in an embodiment of the present application;

Fig. 6 is a schematic diagram of a computer device in an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.

The voiceprint detection method based on short text provided by the embodiment of the present application is applied to a server. The server can be implemented by an independent server or a server cluster composed of multiple servers. In an embodiment, as shown in FIG. 1, a method for voiceprint detection based on short text is provided, which includes the following steps:

In step S101, a training sample is obtained, and the training sample is used to train a preset deep neural network.

Here, the embodiment of the application redesigned a deep neural network suitable for short text. The deep neural network includes an input layer, a four-layer fully connected layer, and an output layer. Each fully connected layer is a 12-dimensional input and is excited by maxout. Function, and the third fully connected layer and the fourth fully connected layer are trained using a discard strategy. In this way, the deep neural network is not limited by the model structure, and can use short texts as training samples and input vectors, thereby reducing data requirements. Among them, short text refers to a voice signal with a shorter length. For example, a sentence-length speech signal. Optionally, the short text may be specified by a length, such as a voice signal less than or equal to the specified length. In this embodiment of the application, voice samples of multiple users are collected as training samples, and a preset deep neural network is trained based on the training samples. Optionally, as shown in FIG. 2, the step S101 includes:

In step S201, voice samples of multiple users are obtained as training samples.

In this embodiment, for actual application scenarios, voice samples corresponding to multiple users can be collected in advance in specific application scenarios. For example, voice samples corresponding to each user can be collected through channels such as professional knowledge bases, network databases, etc., as training samples.

In step S202, the training samples of each user are preprocessed, and feature extraction is performed on the preprocessed training samples to obtain MFCC features.

Here, the MFCC feature (Mel-scale Frequency Cepstral Coefficients, MFCC for short) is a recognizable component in the speech signal. It is a cepstral parameter extracted in the frequency domain of the Mel scale. Taking into account the human ear's perception of different frequencies, it is especially suitable for speech recognition and speaker recognition. The embodiment of the present application designs a deep neural network based on the MFCC feature, and uses the MFCC feature as the input of the deep neural network. Before training the deep neural network, first perform preprocessing and feature extraction on the user samples to obtain corresponding MFCC features. The preprocessing and feature extraction of the training samples of the user are the same as step S103. For details, please refer to the description of step S103, which will not be repeated here.

In the embodiment of the present application, a set of 128-dimensional MFCC features corresponding to the training sample is obtained by performing feature extraction on the preprocessed training sample. The 128-dimensional MFCC feature is used as the input vector of the deep neural network.

In step S203, the MFCC feature of each user is tagged with a user tag.

In this embodiment of the application, the user tag is used to identify the speaker to which the MFCC feature belongs. Different users have different user tags for their corresponding MFCC features. Before training the deep neural network, the 128-dimensional MFCC feature of each user needs to be tagged with a corresponding user label. In order to facilitate understanding, the following examples illustrate. Assuming that there are three users, user 1, user 2, and user 3, in step S203, user 1’s MFCC feature is labeled "01", user 2’s MFCC feature is labeled “02”, and user 3’s MFCC feature is labeled “02”. Put the user tag "03". It should be understood that the above is only an example of the present application and is not used to limit the present application. In other embodiments, the user tag may also be a tag of other forms.

In step S204, the MFCC feature with the user tag is used as an input vector into a preset deep neural network for training.

During training, for each user, the 128-dimensional MFCC feature with the same user label is used as an input vector, and then passed into a preset deep neural network for training, and the recognition result of the user is obtained.

Here, the preset deep neural network includes an input layer, a four-layer fully connected layer, and an output layer. Each fully connected layer is a 12-dimensional input, using the maxout excitation function, and the output expression of the hidden layer node is:

In the above formula, b represents the bias value, W represents the three-dimensional matrix composed of parameters, the size is d×m×k, d represents the number of nodes in the input layer, m represents the number of nodes in the hidden layer, and k represents each The number of hidden hidden layer nodes corresponding to each hidden layer node, and the k hidden hidden layer nodes are all linearly output. Each node of the maxout excitation function takes the maximum value among the output values of the k hidden layer nodes.

In the embodiment of the present application, the number m of nodes in each fully connected layer is 12, and for each of the 12 nodes, take the maximum value of the output values of the k hidden layer nodes generated by the maxout excitation function, and combine the The maximum value corresponding to the 12 nodes is used as the output vector of the fully connected layer. The embodiment of the present application uses the maxout excitation function to make the fully connected layer of the deep neural network non-linear conversion.

Further, in the embodiment of the present application, the deep neural network includes four fully connected layers, which are respectively denoted as the first fully connected layer, the second fully connected layer, the third fully connected layer, and the fourth fully connected layer. During training, the MFCC features with user labels are first passed through the first fully connected layer, and then the output vector of the first fully connected layer is used as the input vector of the second fully connected layer, and the second fully connected layer The output vector is used as the input vector of the third fully connected layer, the output vector of the third fully connected layer is used as the input vector of the fourth fully connected layer, and the output vector of the fourth fully connected layer is used as the input vector of the output layer. When the third fully connected layer and the fourth fully connected layer are trained, the embodiment of the present application adopts a drop strategy, that is, a dropout strategy. When the output vector of the second fully connected layer is passed into the third fully connected layer, the elements in the output vector of the third fully connected layer are randomly discarded according to the preset first discarding probability. It should be understood that discarding refers to "erasing" these elements from the network, which is equivalent to these "erased" elements not participating in this training in this training. Then use the maxout excitation function of the third fully connected layer to train the remaining elements to generate the output vector of the third fully connected layer. Then randomly discard the elements in the output vector obtained by the third fully connected layer according to the preset second discarding probability, and input the remaining elements into the fourth fully connected layer for training. Here, the first discarding probability and the second discarding probability are set according to actual requirements, and the embodiment of the present application is preferably 0.5. By using the dropout strategy, the joint adaptability between hidden layer nodes is effectively weakened, and the generalization ability is enhanced, thereby preventing the deep neural network from overfitting during the training process, which is conducive to improving the training effect of the deep neural network.

In step S205, a preset loss function is used to calculate the error between the recognition result of each MFCC feature through the deep neural network and the corresponding user tag, and the parameters of the deep neural network are modified according to the error .

After the deep neural network passes through four fully connected layers, the output vector of the fourth fully connected layer is used as the input of the output layer. The output layer is the softmax layer, and the softmax layer can be classified according to the output vector of the fourth fully connected layer to obtain the recognition result of the MFCC feature. The recognition result is that the deep neural network predicts the user to which the MFCC feature belongs. As mentioned earlier, each fully connected layer uses a maxout excitation function, which includes a three-dimensional parameter matrix W and a bias value b. After completing the training of each MFCC feature through step S204 to obtain the recognition result corresponding to the MFCC feature, the error between the recognition result of each MFCC feature and the corresponding user tag is calculated using a preset loss function , And modify the parameter matrix W and the bias value b of the maxout excitation function in the deep neural network based on the error return. Optionally, the loss function includes but is not limited to a mutual entropy loss function and a square loss function.

In step S206, the MFCC feature with the user tag is used as an input vector to pass into the modified deep neural network for the next iteration training, until the accuracy of the recognition result of each MFCC feature by the deep neural network reaches the specified Threshold, stop iteration.

The deep neural network whose parameters have been modified in step S205 is used for the next training, that is, the MFCC features with user tags are used as the input vector and then passed into the modified deep neural network for training. The training process is the same as that in step S204. For details, please refer to the above description, which will not be repeated here. Repeat steps S204, S205, and S206 until the accuracy of the recognition results of the MFCC features of all users by the deep neural network reaches the specified threshold, that is, the recognition results of each MFCC feature of the deep neural network and the corresponding user If the probability of the same label reaches the specified threshold, it indicates that each parameter in the deep neural network has been adjusted in place, it is determined that the training of the deep neural network has been completed, and the iteration is stopped.

The trained deep neural network can be used to extract the voiceprint vector from the speech signal.

In step S102, a voice signal to be recognized is acquired.

The voice signal to be recognized is a short text, that is, a short-length voice signal, such as a sentence-length voice signal, so as to reduce the requirements for data. In each recognition process, the acquired voice signal to be recognized should be of a user to be recognized. The voice signal to be recognized may be one voice signal or multiple voice signals.

In step S103, preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the MFCC feature.

Before using the deep neural network, first perform feature extraction on the speech signal to be recognized to obtain the corresponding MFCC feature. Optionally, as shown in FIG. 3, the step S103 includes:

In step S301, framing processing is performed on the waveform of the voice signal to be recognized.

Here, the framing processing refers to cutting the waveform diagram of the voice signal of indefinite length into small segments of fixed length, usually 10-30 milliseconds as a frame. Since the speech signal changes rapidly, the Fourier transform is suitable for analyzing stationary signals. By framing the waveform of the speech signal, the intensity of the side lobe after Fourier transform can be reduced, and the quality of the obtained spectrum can be improved.

In step S302, after framing processing, windowing processing is performed on each frame signal.

In the embodiment of the present application, each frame signal is windowed to smooth the speech signal. Optionally, a Hamming window can be used for smoothing. Compared with a rectangular window function, the Hamming window enhances the continuity of the left and right ends of the speech signal, and can effectively reduce the intensity of side lobes and spectrum leakage after Fourier transform.

In step S303, the discrete Fourier transform is performed on each frame signal after the windowing process to obtain the frequency spectrum corresponding to the frame signal.

Since it is difficult to see the characteristics of the voice signal when the voice signal changes in the time domain, it is necessary to convert the voice signal into an energy distribution in the frequency domain for observation. Different energy distributions represent the characteristics of different voices. After windowing is performed on each frame of speech signal, discrete Fourier transform is performed to obtain the energy distribution of the frame signal on the frequency spectrum. Discrete Fourier transform is performed on each frame signal after frame division and windowing to obtain the frequency spectrum of each frame, and then the frequency spectrum of the speech signal.

In step S304, the power spectrum of the speech signal is calculated according to the spectrum corresponding to all frame signals.

After completing the discrete Fourier transform, the energy distribution obtained is a frequency domain signal. The energy of each frequency band is different, and the energy spectrum of different phonemes is also different. It is necessary to take the modulus square of the frequency spectrum of the speech signal to obtain the power spectrum of the speech signal.

In step S305, the Mel filter bank is calculated according to the power spectrum.

Here, the Mel filter bank is a set of nonlinearly distributed filter banks, which are densely distributed in the low-frequency part and sparsely distributed in the high-frequency part, which can better meet the human hearing characteristics. In the embodiment of the present application, a set of filter banks including n triangular filters are applied to the voice signal, that is, the power spectrum of the voice signal is multiplied by a set of n triangular filters to increase the power of the voice signal. The spectrum is transformed into an n-dimensional vector. Here, the triangular filter can eliminate the effect of harmonics, highlight the formant of the original voice signal, and thereby reduce the amount of data.

In step S306, logarithmic operation is performed on the output of each Mel filter to obtain logarithmic energy.

Each element in the n-dimensional vector obtained through step S305 is the output of a mel filter in the mel filter bank, and the embodiment of the present application further performs logarithm for each element in the n-dimensional vector obtained By calculation, the logarithmic energy output by the Mel filter bank, that is, log-mel filer bank energies, is obtained. The logarithmic energy is used for subsequent cepstrum analysis.

In step S307, the discrete cosine transform is performed on the logarithmic energy to obtain the MFCC feature of the speech signal.

After obtaining the logarithmic energy of the voice signal by performing step S306 above, the embodiment of the present application performs discrete cosine transform on the logarithmic energy, and takes the low 128-dimensional coefficient in the output result as the MFCC of the voice signal feature. Here, the output result obtained by the discrete cosine transform has a good energy accumulation effect. The larger value is concentrated in the low-energy part near the upper left corner, and the remaining part produces a large number of 0 or close to 0. The embodiment of the present application takes the low 128-dimensional value in the output result as the MFCC feature, so that the amount of data can be further compressed.

Among them, the MFCC feature does not depend on the nature of the signal and does not impose any restrictions on the input signal. It has high robustness and conforms to the hearing coefficient of the human ear. It still has good recognition performance when the signal-to-noise ratio is reduced. The MFCC feature is used as the sound feature of the voice signal to be recognized, and is transmitted to the deep neural network for recognition, which can improve the accuracy of deep neural network recognition.

In step S104, the MFCC feature is input to a pre-trained deep neural network, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the characteristics of the voice signal.

After the MFCC feature of the voice signal is obtained, the MFCC feature is passed as an input to a pre-trained deep neural network, and the voice signal is recognized based on the MFCC feature through the deep neural network. Here, the pre-trained deep neural network includes four fully connected layers, each fully connected layer includes 12 nodes, and a 12-dimensional output vector is obtained through the excitation function maxout function. After the deep neural network completes the recognition of the speech signal, the output vector of the neural network in the last fully connected layer is obtained as the d-vector vector of the speech signal. The d-vector vector is the voiceprint vector of the voice signal, and each element in it represents the voiceprint feature of the voice signal.

In step S105, the voiceprint vector of the voice signal is compared with the pre-stored voiceprint vector in the voiceprint model library, and the voiceprint detection result is output according to the comparison result.

Here, the voiceprint model library is set according to needs in combination with the application scenarios of identity authentication, such as online payment, voiceprint lock control, and survival authentication. There are multiple pre-stored voiceprint vectors and their corresponding user information in the voiceprint model library. In a specific application scenario, the user who needs to be authenticated is identified in advance through the deep neural network, and the voiceprint vector is extracted and entered into the voiceprint model library.

When performing voiceprint detection, the voiceprint vector of the voice signal to be recognized is compared with the pre-stored voiceprint vector in the voiceprint model library to perform speaker discrimination of the voice signal. Optionally, as shown in FIG. 4, the step S105 includes:

In step S401, the voiceprint vector of the voice signal is compared with the pre-stored voiceprint vector in the voiceprint model library.

Here, the embodiment of the present application compares the voiceprint vector of the voice signal with each pre-stored voiceprint vector in the voiceprint model library to determine whether the elements in the two are the same.

In step S402, if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, user information corresponding to the pre-stored voiceprint vector is obtained, and the user information is output.

If there is a pre-stored voiceprint vector in the voiceprint model database that is the same as the voiceprint vector of the voice signal, it indicates that the speaker of the voice signal to be recognized has been entered into the voiceprint model database, and the voice signal belongs to all The authenticated user in the voiceprint model library obtains user information corresponding to the pre-stored voiceprint vector, and outputs the user information, thereby completing the recognition of the voice signal to be recognized.

In step S403, if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.

If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model database, it indicates that the speaker of the voice signal to be recognized has not been entered into the voiceprint model database, and the voice If the signal does not belong to an authenticated user in the voiceprint model library, a prompt message indicating that the verification fails is output.

The short text-based voiceprint detection method described in the embodiments of the present application can be applied to a series of application scenarios that need to be combined with identity authentication, such as online payment, voiceprint lock control, and survival authentication, and can also be used in IoT device verification. Especially in remote verification where video image verification is inconvenient, it is not restricted by equipment at all, and the identity can be confirmed by telephone, which can greatly reduce the cost of remote verification.

In summary, the embodiment of the present application redesigns the deep neural network suitable for short text in advance, and then uses the training samples of the short text to train the preset deep neural network; when performing voiceprint detection, obtain the A voice signal, the voice signal is a short text; the voice signal to be recognized is preprocessed, and the preprocessed voice signal is feature extracted to obtain the MFCC feature; the MFCC feature is passed as input A pre-trained deep neural network obtains the output vector of the deep neural network in the last fully connected layer as the voiceprint vector of the voice signal, and each element in the voiceprint vector represents the value of the voice signal Features; compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output the voiceprint detection result according to the comparison result; thus, the voiceprint detection based on short text is realized, greatly The input vector of the model is reduced, and the problems of long speech signal, large amount of sample information, and high computing resource requirements in the existing voiceprint detection methods are solved.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

In one embodiment, a short text-based voiceprint detection device is provided, and the short text-based voiceprint detection device corresponds to the short text-based voiceprint detection method in the foregoing embodiment. As shown in FIG. 5, the short text-based voiceprint detection device includes a training module, an information acquisition module, a feature extraction module, a feature acquisition module, and a detection module. The detailed description of each functional module is as follows:

The training module 51 is used for the training module, used to obtain training samples, and use the training samples to train a preset deep neural network;

The signal acquisition module 52 is used to acquire the voice signal to be recognized;

The feature extraction module 53 is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The feature acquisition module 54 is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the speech signal The voiceprint vector of, where each element in the voiceprint vector represents the feature of the voice signal;

The detection module 55 is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.

Optionally, the training module 51 includes:

The sample acquisition unit is used to acquire voice samples of multiple users as training samples;

The feature extraction unit is configured to preprocess the training samples of each user, and perform feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;

The tag unit is used to tag the Mel frequency cepstrum coefficient of each user with a user tag;

The training unit is used to input the Mel frequency cepstrum coefficients with user tags as input vectors into the preset deep neural network for training;

The parameter modification unit is used to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag using a preset loss function, and modify the Parameters of deep neural network;

The training unit is also used to pass the Mel frequency cepstrum coefficients with user labels as an input vector to the modified deep neural network for the next iterative training, until the deep neural network performs the next iteration of training for each Mel frequency The accuracy of the recognition result of the cepstral coefficient reaches the specified threshold, and the iteration is stopped.

Optionally, the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using a maxout excitation function, and the third fully connected layer and the fourth fully connected layer use Dropout strategy for training.

Optionally, the feature extraction module 53 includes:

The framing unit is configured to perform framing processing on the waveform diagram of the voice signal to be recognized;

The windowing unit is used to perform windowing processing on each frame of signal after framing processing;

A transforming unit for performing discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;

A power spectrum calculation unit, configured to calculate the power spectrum of the voice signal according to the spectrum corresponding to all frame signals;

A filter bank calculation unit for calculating a mel filter bank according to the power spectrum;

Logarithmic unit, used to perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;

The cosine transform unit is configured to perform discrete cosine transform on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the voice signal.

Optionally, the detection module 55 includes:

For the specific limitation of the voiceprint detection device based on short text, please refer to the above limitation on the voiceprint detection method based on short text, which will not be repeated here. Each module in the aforementioned short text-based voiceprint detection device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 6. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a short text-based voiceprint detection method.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

Obtain the voice signal to be recognized;

Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain MFCC features;

Use the MFCC feature as input into a pre-trained deep neural network, and obtain the output vector of the deep neural network in the last fully connected layer as the voiceprint vector of the speech signal. In the voiceprint vector Each element of represents the characteristics of the voice signal;

Wherein, the training samples and speech signals are both short texts.

In one embodiment, one or more non-volatile readable storage media storing computer readable instructions are provided. When the computer readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:

Obtain the voice signal to be recognized;

Wherein, the training samples and speech signals are both short texts.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A voiceprint detection method based on short text, which is characterized in that it includes:

Obtaining training samples, and using the training samples to train a preset deep neural network;

Obtain the voice signal to be recognized;

Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.
The method for voiceprint detection based on short text according to claim 1, wherein said acquiring training samples and using said training samples to train a preset deep neural network comprises:

Acquire voice samples of multiple users as training samples;

Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;

Labeling a user tag on the Mel frequency cepstrum coefficient of each user;

The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;

Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;

The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
The voiceprint detection method based on short text according to claim 2, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using maxout Excitation function, and the third fully connected layer and the fourth fully connected layer adopt the discarding strategy for training.
The voiceprint detection method based on short text according to any one of claims 1 to 3, wherein the voiceprint vector of the voice signal is compared with a pre-stored voiceprint vector in a voiceprint model library , And output voiceprint detection results according to the comparison results, including:

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;

If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;

If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
The voiceprint detection method based on short text according to any one of claims 1 to 3, wherein the preprocessing is performed on the voice signal to be recognized, and the preprocessed voice signal is performed Feature extraction, obtained Mel frequency cepstrum coefficients include:

Performing framing processing on the waveform diagram of the voice signal to be recognized;

After framing processing, perform windowing processing on each frame of signal;

Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;

Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;

Calculating a mel filter bank according to the power spectrum;

Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;

The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
A voiceprint detection device based on short text, which is characterized in that it comprises:

The training module is used to obtain training samples, and use the training samples to train a preset deep neural network;

The signal acquisition module is used to acquire the voice signal to be recognized;

The feature extraction module is configured to preprocess the voice signal to be recognized, and perform feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The feature acquisition module is used to input the Mel frequency cepstrum coefficients into a pre-trained deep neural network, and acquire the output vector of the deep neural network in the last fully connected layer as the voice signal A voiceprint vector, where each element in the voiceprint vector represents a feature of the voice signal;

The detection module is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and output a voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.
The voiceprint detection device based on short text according to claim 6, wherein the training module comprises:

The sample acquisition unit is used to acquire voice samples of multiple users as training samples;

The feature extraction unit is configured to preprocess the training samples of each user, and perform feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;

The tag unit is used to tag the Mel frequency cepstrum coefficient of each user with a user tag;

The training unit is used to input the Mel frequency cepstrum coefficients with user tags as input vectors into the preset deep neural network for training;

The parameter modification unit is used to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag using a preset loss function, and modify the Parameters of deep neural network;

The training unit is also used to pass the Mel frequency cepstrum coefficients with user labels as an input vector to the modified deep neural network for the next iterative training, until the deep neural network performs the next iteration of training for each Mel frequency The accuracy of the recognition result of the cepstral coefficient reaches the specified threshold, and the iteration is stopped.
The voiceprint detection device based on short text according to claim 7, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using maxout Excitation function, and the third fully connected layer and the fourth fully connected layer adopt the discarding strategy for training.
The voiceprint detection device based on short text according to any one of claims 6 to 8, wherein the detection module comprises:

The comparison unit is configured to compare the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;

The first result output unit is configured to obtain user information corresponding to the pre-stored voiceprint vector if there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, and output the user information;

The second result output unit is configured to output a prompt message that the detection fails if there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library.
The voiceprint detection device based on short text according to any one of claims 6 to 8, wherein the feature extraction module comprises:

The framing unit is configured to perform framing processing on the waveform diagram of the voice signal to be recognized;

The windowing unit is used to perform windowing processing on each frame of signal after framing processing;

A transforming unit for performing discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;

A power spectrum calculation unit, configured to calculate the power spectrum of the voice signal according to the spectrum corresponding to all frame signals;

A filter bank calculation unit for calculating a mel filter bank according to the power spectrum;

Logarithmic unit, used to perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;

The cosine transform unit is configured to perform discrete cosine transform on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the voice signal.
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Obtaining training samples, and using the training samples to train a preset deep neural network;

Obtain the voice signal to be recognized;

Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The Mel frequency cepstral coefficients are passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.
The computer device according to claim 11, wherein said obtaining training samples and using said training samples to train a preset deep neural network comprises:

Acquire voice samples of multiple users as training samples;

Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;

Labeling a user tag on the Mel frequency cepstrum coefficient of each user;

The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;

Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;

The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
The computer device of claim 12, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, a maxout excitation function is used, and the third The fully connected layer and the fourth fully connected layer are trained using a discard strategy.
The computer device according to any one of claims 11 to 13, wherein the voiceprint vector of the voice signal is compared with a pre-stored voiceprint vector in a voiceprint model library, and the comparison result is The output voiceprint detection results include:

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;

If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;

If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
The computer device according to any one of claims 11 to 13, wherein the preprocessing is performed on the voice signal to be recognized, and the preprocessing is performed on the feature extraction of the voice signal to obtain Mel Frequency cepstrum coefficients include:

Performing framing processing on the waveform diagram of the voice signal to be recognized;

After framing processing, perform windowing processing on each frame of signal;

Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;

Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;

Calculating a mel filter bank according to the power spectrum;

Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;

The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.
One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Obtaining training samples, and using the training samples to train a preset deep neural network;

Obtain the voice signal to be recognized;

Preprocessing the voice signal to be recognized, and performing feature extraction on the preprocessed voice signal to obtain the Mel frequency cepstrum coefficient;

The Mel frequency cepstrum coefficient is passed into a pre-trained deep neural network as input, and the output vector of the deep neural network in the last fully connected layer is obtained as the voiceprint vector of the speech signal. Each element in the voiceprint vector represents the feature of the voice signal;

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library, and outputting the voiceprint detection result according to the comparison result;

Wherein, the training samples and speech signals are both short texts.
The non-volatile readable storage medium according to claim 16, wherein said obtaining training samples and using said training samples to train a preset deep neural network comprises:

Acquire voice samples of multiple users as training samples;

Preprocessing the training samples of each user, and performing feature extraction on the preprocessed training samples to obtain the Mel frequency cepstrum coefficient;

Labeling a user tag on the Mel frequency cepstrum coefficient of each user;

The Mel frequency cepstrum coefficients with user labels are used as input vectors to the preset deep neural network for training;

Using a preset loss function to calculate the error between the recognition result of each Mel frequency cepstrum coefficient through the deep neural network and the corresponding user tag, and modify the parameters of the deep neural network according to the error;

The mel frequency cepstral coefficients with user labels are used as input vectors to pass into the modified deep neural network for the next iterative training, until the deep neural network has an accurate recognition result of each mel frequency cepstral coefficient If the rate reaches the specified threshold, stop iteration.
The non-volatile readable storage medium according to claim 17, wherein the deep neural network includes an input layer, a four-layer fully connected layer, and an output layer, each fully connected layer is a 12-dimensional input, using maxout Excitation function, and the third fully connected layer and the fourth fully connected layer adopt the discarding strategy for training.
The non-volatile readable storage medium according to any one of claims 16 to 18, wherein the voiceprint vector of the voice signal is compared with a pre-stored voiceprint vector in a voiceprint model library , And output voiceprint detection results according to the comparison results, including:

Comparing the voiceprint vector of the voice signal with the pre-stored voiceprint vector in the voiceprint model library;

If there is a pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, acquiring user information corresponding to the pre-stored voiceprint vector, and outputting the user information;

If there is no pre-stored voiceprint vector that is the same as the voiceprint vector of the voice signal in the voiceprint model library, a prompt message indicating that the detection fails is output.
The non-volatile readable storage medium according to any one of claims 16 to 18, wherein the preprocessing is performed on the voice signal to be recognized, and the preprocessed voice signal is performed Feature extraction, obtained Mel frequency cepstrum coefficients include:

Performing framing processing on the waveform diagram of the voice signal to be recognized;

After framing processing, perform windowing processing on each frame of signal;

Perform discrete Fourier transform on each frame signal after windowing processing to obtain the frequency spectrum corresponding to the frame signal;

Calculating the power spectrum of the voice signal according to the frequency spectrum corresponding to all frame signals;

Calculating a mel filter bank according to the power spectrum;

Perform logarithmic operation on the output of each mel filter to obtain logarithmic energy;

The discrete cosine transform is performed on the logarithmic energy to obtain the Mel frequency cepstrum coefficient of the speech signal.