WO2020224114A1

WO2020224114A1 - Residual delay network-based speaker confirmation method and apparatus, device and medium

Info

Publication number: WO2020224114A1
Application number: PCT/CN2019/103155
Authority: WO
Inventors: 彭俊清; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-09
Filing date: 2019-08-29
Publication date: 2020-11-12
Also published as: CN110232932A; CN110232932B

Abstract

A residual delay network-based speaker confirmation method and apparatus, a device and a medium. Said method comprises: constructing a residual delay network, and training the residual delay network by using a preset training sample set (S101); acquiring an audio information set of a test user, the audio information set comprising registered audio and test audio (S102); performing pre-processing on the audio information set of the test user (S103); performing feature extraction on the pre-processed audio information set to obtain Mel frequency cepstrum coefficients of the registered audio and the test audio, respectively (S104); transmitting the Mel frequency cepstrum coefficient of the registered audio as an input vector to the trained residual delay network, and acquiring a feature vector outputted by the residual delay network at a session slice level as a registered feature vector of the test user (S105); transmitting the Mel frequency cepstrum coefficient of the test audio as an input vector to the trained residual delay network, and acquiring a feature vector outputted by the residual delay network at a session slice level as a feature vector to be tested of the test user (S106); inputting, into a preset probability linear discriminant analysis model, the registered feature vector and the feature vector to be tested, and acquiring a score outputted by the probability linear discrimination analysis model (S107); and outputting a speaker confirmation result according to the score (S108). Said method solves the problem of the poor accuracy of the existing text-independent speaker confirmation method in terms of short audio.

Description

Speaker confirmation method, device, equipment and medium based on residual time delay network

This application is based on the Chinese invention patent application filed on May 9, 2019 with the application number 201910384582.0, titled "Speaker verification method, device, equipment and medium based on residual time delay network", and claims its priority .

Technical field

This application relates to the field of information technology, and in particular to a method, device, equipment and medium for speaker confirmation based on residual time delay network.

Background technique

Voiceprint recognition, also known as speaking person recognition, is a type of biometric technology. Voiceprint recognition mainly solves two major problems, namely speaker identification and speaker confirmation. Speaker recognition technology is used to determine which of several speakers a certain speech comes from, which is a "choose one question", while speaker confirmation technology is to determine whether a certain speech belongs to the designated person to be detected. "One to one question". Speaker Confirmation is widely used in many fields, and has a wide range of needs in industries and sectors such as banking, non-bank finance, public security, military and other civilian safety certification.

Speaker confirmation can be divided into two methods: text-related confirmation and text-independent confirmation according to whether the detected voice needs to specify the content. In recent years, there have been continuous breakthroughs in text-independent speaker verification methods, and its accuracy has been greatly improved compared with the past. However, in some limited situations, such as when the collected speaker's effective voice is relatively short, its accuracy is not satisfactory.

Therefore, finding a method to improve the accuracy of text-independent speaker confirmation in short audio has become an urgent problem for those skilled in the art.

Summary of the invention

The embodiments of the present application provide a method, device, device, and medium for speaker verification based on a residual delay network to solve the problem of poor accuracy of the existing text-independent speaker verification method in terms of short audio.

A speaker confirmation method based on residual delay network, including:

Construct a residual delay network, and use a preset training sample set to train the residual delay network;

Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;

Perform preprocessing on the audio information set of the test user;

Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;

The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;

The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;

Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;

The speaker confirmation result is output according to the score.

Further, the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay network block is obtained by combining the structure of the delay network and the residual network. Identity mapping and residual mapping are obtained.

Further, the training of the residual delay network using a preset training sample set includes:

Collect multiple audio information of several speakers as a training sample set;

Perform preprocessing on the audio information in the training sample set;

Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;

Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;

Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;

The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.

Further, the performing preprocessing on the audio information in the training sample set includes:

Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;

Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;

Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part time;

The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.

Further, the outputting the speaker confirmation result according to the score includes:

Comparing the score with a preset score threshold;

If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;

If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.

A speaker confirmation device based on residual time delay network, including:

The training module is used to construct a residual delay network, and use a preset training sample set to train the residual delay network;

An acquiring module, configured to acquire an audio information set of a test user, the audio information set includes registered audio and test audio;

A preprocessing module for performing preprocessing on the audio information set of the test user;

The feature extraction module is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to the registered audio and Mel frequency cepstral coefficients corresponding to the test audio respectively;

The first feature acquisition module is configured to pass the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the output of the residual delay network at the session slice level The feature vector is used as the registered feature vector of the test user;

The second feature acquisition module is used to pass the Mel frequency cepstrum coefficients of the test audio as an input vector to the trained residual delay network, and obtain the output of the residual delay network at the session slice level A feature vector as the feature vector to be tested of the test user;

The score obtaining module is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;

The speaker confirmation module is used to output the speaker confirmation result according to the score.

Further, the training module includes:

The collection unit is used to collect multiple audio information of several speakers as a training sample set;

A preprocessing unit, configured to perform preprocessing on the audio information in the training sample set;

The feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient;

A training unit, configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;

The parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameters of the residual delay network according to the error;

The training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.

Further, the preprocessing unit includes:

The tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;

The first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;

The detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;

The second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.

A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

Perform preprocessing on the audio information set of the test user;

The speaker confirmation result is output according to the score.

One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Perform preprocessing on the audio information set of the test user;

The speaker confirmation result is output according to the score.

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a flowchart of a speaker verification method based on a residual delay network in an embodiment of the present application;

Figure 2(a) is a schematic structural diagram of a delay network in an embodiment of the present application, and Figure 2(b) is a schematic structural diagram of a residual network in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a residual delay network block in an embodiment of the present application;

FIG. 4 is a flowchart of step S101 in the speaker verification method based on the residual time delay network in an embodiment of the present application;

FIG. 5 is a flowchart of step S402 in the speaker verification method based on the residual delay network in an embodiment of the present application;

FIG. 6 is a flowchart of step S108 in the speaker verification method based on the residual time delay network in an embodiment of the present application;

FIG. 7 is a schematic block diagram of a speaker confirmation device based on a residual time delay network in an embodiment of the present application;

Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.

The speaker confirmation method based on the residual delay network provided by the embodiment of the present application is applied to a server. The server can be implemented by an independent server or a server cluster composed of multiple servers. In an embodiment, as shown in FIG. 1, a method for speaker confirmation based on a residual delay network is provided, which includes the following steps:

In step S101, a residual delay network is constructed, and a preset training sample set is used to train the residual delay network.

The Residual Delay Network (Res-TDNN for short) provided by the embodiments of this application combines the Time-Delay Neural Network (TDNN) and the Residual Network (Residual Network, ResNet for short), and uses the delay neural network. Network TDNN as the basic structure.

Here, the structure of the time delay neural network TDNN is shown in Figure 2(a), including session frame-level, session segment-level, and session segment-level ) Includes a statistical pooling layer (Statistic-Pooling), several embedding layers (embeddings) and a classification output layer (log-softmax).

The structure of the residual network ResNet is shown in Figure 2(b), which includes two mappings, namely: identity mapping and residual mapping, and is connected by direct connection The (shortcut connection) method connects the two mapping structures to overcome the problems of reduced training set accuracy and reduced network performance as the network deepens. The curve part is the aforementioned identity mapping (identity mapping), which is represented by x in the figure; the remaining part is residual mapping (residual mapping), which is represented by F(x) in the figure. The two parts are combined into a building block, and the reuse of this structure can effectively deepen the network depth and improve network performance.

The embodiment of the application combines the characteristics of the ResNet network and the TDNN network, and integrates the residual mapping in the ResNet network into the TDNN network. As shown in Figure 3, it is called a residual delay network block (Res-TDNN). block). In Figure 3, the residual delay network block combines the traditional TDNN network structure with identity mapping and residual mapping, and the activation function adopts, for example, a parameterized activation function ReLU (Parametric Rectified Linear Unit, referred to as PReLU), This structure can effectively transfer the residual of the previous layer to a deeper network, avoiding that the gradient difference becomes too small when it is transferred layer by layer to affect training and make the network fall into a local optimal solution; at the same time, combined with the ResNet network can pass Increasing the depth of the network and reducing the number of nodes at each layer of the network reduces the amount of overall network parameters without reducing network performance.

In the embodiment of the present application, the residual delay network block is used to replace the session frame level in the traditional TDNN network, and the session slice level is kept unchanged, so as to obtain the residual delay network, namely the Res-TDNN network.

The training sample set used for training the Res-TDNN network includes multiple audio information of several speakers. To facilitate understanding, the training process of the Res-TDNN network will be described in detail below. As shown in FIG. 4, the training of the residual delay network by using the preset training sample set in step S101 includes:

In step S401, multiple audio information of several speakers are collected as a training sample set.

Here, the embodiments of the present application may obtain audio information according to actual needs or application scenarios. For example, the audio information is obtained from a preset audio library, and a large amount of audio information is collected in advance in the preset audio library. The training sample set can also be obtained by connecting to a communication device to collect telephone recordings. It is understandable that in this embodiment, the training sample set can also be obtained in a variety of ways, which will not be repeated here.

In the training sample set, each speaker corresponds to an audio information set, and the audio information set includes multiple audio information.

In step S402, preprocessing is performed on the audio information in the training sample set.

Here, since the audio information in the training sample set may have noise and less useful information, it is necessary to preprocess the training sample set to improve the quality of the training samples. Optionally, as shown in FIG. 5, the step S402 includes:

In step S501, a speaker tag is added to each of the audio information, and classification is performed according to the speaker tag to obtain an audio information set of each speaker.

In this embodiment, each speaker corresponds to a speaker tag, and the speaker tag is the identification information of the speaker, which is used to distinguish different speakers. Add a speaker tag corresponding to the speaker to the audio information of the same speaker to mark the speaker to which each audio information belongs.

Illustratively, suppose there are N speakers, namely speaker spkr ₁ , speaker spkr ₂ , ... speaker spkr _K , and the corresponding labels are label 1, label 2, ... label K. Then the audio information of speaker spkr ₁ is added with label 1, and the audio information of speaker spkr ₂ is added with label 2, ... the audio information of speaker spkr _K is added with label K. Among them, K is a positive integer.

In step S502, the audio information set and the speaker whose number of audio information is less than the first preset threshold are removed from the training sample set.

Further, in order to reduce the amount of calculation during the training of the residual delay network and improve the training effect, for each speaker, the number of audio information included in the audio information set corresponding to the speaker is counted, and the audio information The number is compared with the first preset threshold. Here, the first preset threshold is a judgment criterion based on whether the speaker is eliminated based on the number of audio information. If the number of audio information included in the audio information set of a speaker is less than the first preset threshold, the speaker will be excluded from the training sample set. Exemplarily, the first preset threshold may be 4. If the number of audio information included in the audio information set of a speaker is less than 4, this embodiment will select the speaker and its audio information set from all The training samples are removed in a centralized manner, thereby ensuring the number of audio information for each speaker, which is beneficial to reduce the calculation amount of the residual delay network, and at the same time improve the training effect of the residual delay network.

In step S503, the voice activity detection is performed on each audio information in the remaining audio information set, and the non-voice part is deleted according to the voice activity detection result to obtain the voice part duration.

Here, the voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection and voice boundary detection, which refers to detecting which signals in the audio information are the speaker’s voice components and which are non-voice components, such as silence and noise . In this embodiment, the long-term non-speech part is identified and eliminated from the audio information according to the result of the voice activity detection, so as to reduce the data amount of the training sample without reducing the audio quality.

In step S504, the audio information whose voice duration is less than the second preset threshold is excluded from the audio information set.

After eliminating the long-term non-voice part, the duration of the voice part in the audio information, that is, the voice duration, is further obtained according to the result of the voice activity detection, and the voice duration is compared with the second preset threshold. Here, the second preset threshold is a judgment standard based on whether the audio information is eliminated based on the voice duration. If the voice duration of one piece of audio information in the speaker's audio information set is less than the second preset threshold, the audio information will be excluded from the audio information set. Optionally, the second preset threshold may be 1 second. If the voice duration of an audio message of the speaker is less than 1 second, it may be that the speaker speaks too fast or the content is too short, and there is no representative Sex. In this embodiment, the audio information is collectively removed from the audio information of the speaker. Exemplarily, for the speaker spkr _j , if there is an audio information set M _j = {x _j1 , x _j2 , x _j3 …, x _jm }, if the voice time length of the audio information x _ji measured by VAD calculation is less than 1 second, Then x _{ji is} removed from the audio information set M _j of the speaker spkr _j . Among them, j and m are both positive integers, i=1, 2, ..., m.

In this embodiment, the audio information whose voice part duration is less than the second preset threshold is removed from the audio information set, which effectively eliminates extreme situations and ensures the length of the audio information in the audio information set for each speaker, which is beneficial to Improve the training effect and generalization ability of the residual delay network.

The speaker and its audio information set remaining after preprocessing through the above steps S501 to S504 are used as the training sample set for training the residual delay network in the embodiment of the present application. The whole training process includes several trainings, each training includes K speakers, and a total of N pieces of audio information.

In step S403, feature extraction is performed on each of the pre-processed audio information to obtain the corresponding Mel frequency cepstrum coefficient.

Among them, the Mel-scale Frequency Cepstral Coefficients (MFCC feature for short) is a voice feature, which is a cepstrum parameter extracted in the frequency domain of the Mel scale, and its parameters take into account the difference in human ear pairs. The degree of frequency perception is especially suitable for speech recognition and speaker recognition. In this embodiment, the MFCC feature is used as the input of the residual delay network. Before training or using the residual delay network, first perform feature extraction on each audio information to obtain the corresponding MFCC feature. Optionally, the process of feature extraction includes, but is not limited to, framing processing, windowing processing, discrete Fourier transform, power spectrum calculation, Mel filter bank calculation, logarithmic energy calculation, and discrete cosine transform. Here, this embodiment uses 23-dimensional MFCC features to further compress the calculation data volume of the residual network.

In step S404, the Mel frequency cepstral coefficient corresponding to each audio information is input as an input vector to a preset residual delay network for training, and the recognition result output by the residual delay network is obtained.

During training, for each audio information, the corresponding MFCC feature is used as an input vector and passed into the preset residual delay network for training, and the recognition result of the audio information is obtained.

As mentioned above, the residual delay network includes a stacked frame-level Res-TDNN block, a Statistics-Pooling layer, a segment-level layer, and a log-softmax layer. The 23-dimensional MFCC feature of an audio signal is first input to the Res-TDNN block of the residual delay network for feature extraction; the obtained feature matrix is then input to the Statistics-Pooling layer and the segment-level layer for feature extraction; the segment- The feature vector output by the level layer is used as the feature vector of the audio signal, which includes feature information of the audio signal. The feature vector of the audio signal is further input to the log-softmax layer for classification. The recognition result output by the log-softmax layer is a one-dimensional probability vector. If there are K speakers in this training, the probability vector includes K elements. Each speaker corresponds to an element, which represents the relative probability between different speakers. The larger the value of the element, the greater the probability that the MFCC feature/audio information belongs to the corresponding speaker, so that it can be clearly predicted The audio information is the speaker corresponding to the element with the highest probability.

The above steps S403 and S404 are respectively performed on the N pieces of audio information in this training until the N pieces of audio information are traversed. Step S405 is executed.

In step S405, a preset loss function is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information and the corresponding speaker tag after passing through the residual delay network, and according to The error modifies the parameter of the residual delay network.

In this embodiment, the calculation of the loss function is performed in the loss layer of the residual delay network. Assuming that there are K speakers and N audio information in each training session, the calculation formula of the loss function is:

In the above formula,

Indicates the probability that the T frame test is the speaker spkr _k ;

Where T represents the frame length of an audio information, x ⁽ⁿ⁾ represents the nth audio among the N audios,

Represents a signal of one frame length of the nth audio in the N audio; d _nk represents the label function, if the frames contained in the nth audio information in the N audio information are all from speaker k, then the value of d _nk is 1 , Otherwise it is 0.

The value of the aforementioned frame length T is related to the length of the audio information, and is determined by the TDNN network structure. Usually, the experiment will intercept fixed-length audio, such as 4 seconds, then T is 400.

After completing a training session and obtaining the recognition results corresponding to the N audio information, the above loss function calculation formula is used to obtain the error between the recognition result of each audio information and the corresponding preset label, and based on the error Go back to modify the parameters in the residual delay network, including the parameters in the Res-TDNN block, Statistics-Pooling layer, and segment-level layer. Optionally, the embodiment of the present application uses a back propagation algorithm to calculate the gradient of the residual delay network, and uses a stochastic gradient descent method to update the parameters of the residual delay network, so as to encourage it to continuously learn features until convergence.

In step S406, the Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the residual delay network after parameter modification to perform the next training.

The residual time delay network after the parameters are modified in step S405 is used for the next training. In each training process, K speakers are randomly selected from the preprocessed training sample set, and N audio information with preset labels are selected for training. The training process is the same as that of steps S404 and S405. For details, see the description above. I won't repeat them here. Steps S404, S405, and S406 are repeated, and 50-150 iterative training is performed, so that the residual delay network can learn the key features of audio information and obtain better model performance. The above training times can be adjusted according to the size of the training set, and there is no limitation here.

After the training is completed, the trained residual time delay network is used for testing, and step S102 is executed.

In step S102, an audio information set of the test user is obtained, where the audio information set includes registered audio and test audio.

Optionally, the server may obtain the test user and its audio information according to actual needs or application scenarios, and obtain the test user's audio information set. For example, the test user and its audio information are acquired from a preset audio library, and a large number of users and their audio information are collected in advance in the preset audio library. You can also collect telephone recordings as the audio information of the test user by connecting to a communication device. It is understandable that the embodiment of the present application may also obtain the audio information set of the test user in various ways, which will not be repeated here.

In this embodiment, the audio information of the test user includes a test audio and a registration audio, the test audio is the audio information performed through the residual delay network to perform speaker confirmation, and the registration audio is through the The residual delay network constructs the audio information of the speaker feature database. Optionally, the acquired test users may include one or more; the acquired test audio/registered audio may include one or more.

In step S103, preprocessing is performed on the audio information set of the test user.

Here, since the audio information of the test user may have problems with noise and less effective information, it is necessary to preprocess the audio information of the test user to improve the speed and accuracy of the residual delay network recognition . Optionally, the step S103 includes:

Removing audio information sets and test users whose number of audio information is less than the first preset threshold;

Perform voice activity detection on each audio information in the audio information set of the remaining test users, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration; remove the audio information whose voice part duration is less than the second preset threshold from all The audio information of the test user is collectively eliminated.

The above steps are the same as step S402, that is, test users whose number of audio information is less than the first preset threshold and their audio information sets are eliminated, and audio information whose voice part duration is less than the second preset threshold is eliminated. For details, please refer to the description of the above embodiment , I won’t repeat it here.

In step S104, feature extraction is performed on the preprocessed audio information set, and Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio are obtained respectively.

Optionally, the step S104 is the same as the above step S403. For details, please refer to the description of the above embodiment, which will not be repeated here. Here, this embodiment uses 23-dimensional MFCC features for testing.

In step S105, the Mel frequency cepstrum coefficients of the registered audio are passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained. As the registered feature vector of the test user.

After the MFCC feature of the registered audio is obtained, the MFCC feature is passed as input to the pre-trained residual delay network, and the registered audio is performed based on the MFCC feature through the residual delay network. Recognition. Here, the pre-trained residual delay network includes a Res-TDNN block, a Statistics-Pooling layer, a segment-level layer, and a log-softmax layer. After the residual delay network completes the recognition of the registered audio, obtain the output vector of the residual delay network after embeding the registered audio at the segment-level layer as the registered audio Registered feature vector. The registered feature vector is the audio feature vector of the test user in the speaker feature library, and each element in it represents the voiceprint feature of the registered audio. Here, the speaker feature database can be set according to the application scenarios of identity authentication, such as online payment, voiceprint lock control, survival authentication, etc., to store audio feature information of registered users who need to be filed, that is, the above Register the feature vector.

In step S106, the Mel frequency cepstrum coefficients of the test audio are passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained. As the feature vector to be tested of the test user.

After the MFCC features of the test audio are obtained, the MFCC features are passed as input to the pre-trained residual delay network, and the test audio is performed based on the MFCC features through the residual delay network. Recognition. After the residual delay network completes the identification of the test audio, the output vector obtained after the residual delay network embeds the test audio at the segment-level layer is obtained as the test audio The feature vector to be tested. The feature vector to be tested is an audio feature vector that the test user performs speaker confirmation through the residual time delay network, and each element in it represents a voiceprint feature of the test audio.

In step S107, the registered feature vector and the feature vector to be tested are input into a preset probabilistic linear discriminant analysis model, and the score output by the probabilistic linear discriminant analysis model is obtained.

When speaker confirmation is performed, the feature vector to be tested and the registered feature vector are input to a preset probability linear discriminant analysis model. Here, the probabilistic linear discriminant analysis model (Probabilistic Linear Discriminant Analysis, PLDA for short) is a channel compensation algorithm. In this embodiment, the PLDA model is used to calculate the similarity between the feature vector to be tested and the registered feature vector to obtain a score. The higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector, and the lower the score, the lower the consistency between the feature vector to be tested and the registered feature vector.

In step S108, the speaker confirmation result is output according to the score.

As mentioned above, the higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector, and the lower the score, the higher the consistency between the feature vector to be tested and the registered feature vector. The lower. In this embodiment, a score threshold is set, the score is compared with a preset score threshold, and the speaker confirmation result is output according to the comparison result. Optionally, as shown in FIG. 6, the step S108 includes:

In step S601, the score is compared with a preset score threshold.

Here, the preset score threshold is set based on experience as a criterion for judging whether the feature vector to be tested and the registered feature vector come from the same speaker.

In step S602, if the score is greater than or equal to the preset score threshold, output indicating that the feature vector to be tested and the registered feature vector are from the same speaker.

As mentioned above, the higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector. When the score is greater than or equal to the preset score threshold, this embodiment determines that the feature vector to be tested and the registered feature vector are from the same speaker, and outputs indication information indicating that the speaker confirmation result is the same speaker.

In step S603, if the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.

When the score is less than the preset score threshold, this embodiment determines that the feature vector to be tested and the registered feature vector are from different speakers, and outputs information indicating that the speaker confirmation result is a different speaker.

In summary, this embodiment constructs a residual delay network, and extracts the registered feature vector from the preprocessed registered audio through the residual delay network to establish a speaker feature database; when performing speaker confirmation, Extract the feature vector to be tested from the pre-processed test audio through the residual delay network, and pass it into the PLDA model together with the registered feature vector in the speaker feature library to calculate the score, and compare the score with the preset score The threshold value is compared, and finally the speaker confirmation result is output according to the comparison result; because the residual delay network uses the residual delay network block to replace the session inter-frame level of the traditional delay network, compared with the traditional TDNN and PLDA The speaker confirmation method requires a smaller training set size, and the model is easier to train, which effectively reduces the training cost. In addition, this method can increase the depth of the network while reducing the number of nodes at each layer of the network. Even if the overall network parameters decrease, the network performance will not be affected. The key features are extracted through the residual delay network, which can effectively reduce noise interference. In short audio speaker confirmation, it can achieve significantly better results than the traditional PLDA model.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

In one embodiment, a speaker verification device based on a residual delay network is provided. The speaker verification device based on the residual delay network is the same as the speaker verification method based on the residual delay network in the foregoing embodiment. One correspondence. As shown in FIG. 7, the speaker verification device based on the residual time delay network includes a training module, an acquisition module, a preprocessing module, a feature extraction module, a first feature acquisition module, a second feature acquisition module, a score acquisition module, Speaker confirmation module. The detailed description of each functional module is as follows:

The training module 71 is configured to construct a residual delay network, and use a preset training sample set to train the residual delay network;

The obtaining module 72 is configured to obtain an audio information set of a test user, where the audio information set includes registered audio and test audio;

The preprocessing module 73 is configured to perform preprocessing on the audio information set of the test user;

The feature extraction module 74 is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;

The first feature acquisition module 75 is configured to input the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the residual delay network output at the session slice level As the registered feature vector of the test user;

The second feature acquisition module 76 is configured to input the Mel frequency cepstrum coefficient of the test audio as an input vector into the trained residual delay network, and obtain the residual delay network output at the session slice level As the feature vector to be tested of the test user;

The score obtaining module 77 is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;

The speaker confirmation module 78 is configured to output the speaker confirmation result according to the score.

Wherein, the residual delay network is obtained by replacing the residual delay network block with the session frame level in the delay network, and the residual delay network block is obtained by combining the structure of the delay network with the constant of the residual network. Obtained by equal mapping and residual mapping.

Optionally, the training module 71 includes:

Optionally, the preprocessing unit includes:

Optionally, the speaker confirmation module 88 includes:

The comparison unit is used to compare the score with a preset score threshold;

The first confirmation unit is configured to output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker if the score is greater than or equal to the preset score threshold;

The second confirmation unit is configured to output indication information indicating that the feature vector to be tested and the registered feature vector are from different speakers if the score is less than the preset score threshold.

For the specific limitation of the speaker verification device based on the residual delay network, please refer to the above limitation on the speaker verification method based on the residual delay network, which will not be repeated here. Each module in the above-mentioned speaker confirmation device based on the residual time delay network can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a method of speaker verification based on the residual delay network is realized.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

Perform preprocessing on the audio information set of the test user;

The speaker confirmation result is output according to the score.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A speaker confirmation method based on residual delay network, which is characterized in that it includes:

Construct a residual delay network, and use a preset training sample set to train the residual delay network;

Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;

Perform preprocessing on the audio information set of the test user;

Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;

The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;

The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;

Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;

The speaker confirmation result is output according to the score.
The speaker confirmation method based on the residual delay network according to claim 1, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, The residual delay network block is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
The speaker verification method based on the residual delay network according to claim 1 or 2, wherein the training of the residual delay network using a preset training sample set comprises:

Collect multiple audio information of several speakers as a training sample set;

Perform preprocessing on the audio information in the training sample set;

Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;

Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;

Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;

The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
The speaker verification method based on the residual delay network according to claim 3, wherein the performing preprocessing on the audio information in the training sample set comprises:

Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;

Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;

Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part time;

The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
The method for speaker confirmation based on a residual delay network according to claim 1 or 2, wherein said outputting a speaker confirmation result according to said score comprises:

Comparing the score with a preset score threshold;

If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;

If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
A speaker confirmation device based on residual time delay network, which is characterized in that it comprises:

The training module is used to construct a residual delay network, and use a preset training sample set to train the residual delay network;

An acquiring module, configured to acquire an audio information set of a test user, the audio information set includes registered audio and test audio;

A preprocessing module for performing preprocessing on the audio information set of the test user;

The feature extraction module is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to the registered audio and Mel frequency cepstral coefficients corresponding to the test audio respectively;

The first feature acquisition module is configured to pass the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the output of the residual delay network at the session slice level The feature vector is used as the registered feature vector of the test user;

The second feature acquisition module is used to pass the Mel frequency cepstrum coefficients of the test audio as an input vector to the trained residual delay network, and obtain the output of the residual delay network at the session slice level A feature vector as the feature vector to be tested of the test user;

The score obtaining module is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;

The speaker confirmation module is used to output the speaker confirmation result according to the score.
The speaker confirmation device based on the residual delay network according to claim 6, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, The residual delay network block is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
The speaker verification device based on the residual delay network according to claim 6 or 7, wherein the training module comprises:

The collection unit is used to collect multiple audio information of several speakers as a training sample set;

A preprocessing unit, configured to perform preprocessing on the audio information in the training sample set;

The feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient;

A training unit, configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;

The parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameter of the residual delay network according to the error;

The training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.
8. The speaker confirmation device based on the residual delay network according to claim 8, wherein the preprocessing unit comprises:

The tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;

The first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;

The detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;

The second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.
The speaker confirmation device based on the residual time delay network according to claim 6 or 7, wherein the speaker confirmation module comprises:

The comparison unit is used to compare the score with a preset score threshold;

The first confirmation unit is configured to output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker if the score is greater than or equal to the preset score threshold;

The second confirmation unit is configured to output indication information indicating that the feature vector to be tested and the registered feature vector are from different speakers if the score is less than the preset score threshold.
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Construct a residual delay network, and use a preset training sample set to train the residual delay network;

Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;

Perform preprocessing on the audio information set of the test user;

Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;

The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;

The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;

Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;

The speaker confirmation result is output according to the score.
The computer device according to claim 11, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay network block passes It is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
The computer device according to claim 11 or 12, wherein the training of the residual delay network using a preset training sample set comprises:

Collect multiple audio information of several speakers as a training sample set;

Perform preprocessing on the audio information in the training sample set;

Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;

Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;

Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;

The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
The computer device according to claim 13, wherein the performing preprocessing on the audio information in the training sample set comprises:

Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;

Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;

Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;

The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
The computer device according to claim 11 or 12, wherein said outputting a speaker confirmation result according to said score comprises:

Comparing the score with a preset score threshold;

If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;

If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Construct a residual delay network, and use a preset training sample set to train the residual delay network;

Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;

Perform preprocessing on the audio information set of the test user;

Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;

The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;

The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;

Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;

The speaker confirmation result is output according to the score.
The computer-readable storage medium according to claim 16, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay The network block is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
The computer-readable storage medium according to claim 16 or 17, wherein the training the residual delay network using a preset training sample set comprises:

Collect multiple audio information of several speakers as a training sample set;

Perform preprocessing on the audio information in the training sample set;

Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;

Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;

Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;

The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
18. The computer-readable storage medium of claim 18, wherein the performing preprocessing on the audio information in the training sample set comprises:

Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;

Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;

Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part time;

The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
18. The computer-readable storage medium of claim 16 or 17, wherein said outputting a speaker confirmation result according to said score comprises:

Comparing the score with a preset score threshold;

If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;

If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.