WO2020224114A1 - Residual delay network-based speaker confirmation method and apparatus, device and medium - Google Patents

Residual delay network-based speaker confirmation method and apparatus, device and medium Download PDF

Info

Publication number
WO2020224114A1
WO2020224114A1 PCT/CN2019/103155 CN2019103155W WO2020224114A1 WO 2020224114 A1 WO2020224114 A1 WO 2020224114A1 CN 2019103155 W CN2019103155 W CN 2019103155W WO 2020224114 A1 WO2020224114 A1 WO 2020224114A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio information
delay network
residual delay
audio
feature vector
Prior art date
Application number
PCT/CN2019/103155
Other languages
French (fr)
Chinese (zh)
Inventor
彭俊清
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020224114A1 publication Critical patent/WO2020224114A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Definitions

  • This application relates to the field of information technology, and in particular to a method, device, equipment and medium for speaker confirmation based on residual time delay network.
  • Voiceprint recognition also known as speaking person recognition, is a type of biometric technology. Voiceprint recognition mainly solves two major problems, namely speaker identification and speaker confirmation. Speaker recognition technology is used to determine which of several speakers a certain speech comes from, which is a "choose one question", while speaker confirmation technology is to determine whether a certain speech belongs to the designated person to be detected. "One to one question”. Speaker Confirmation is widely used in many fields, and has a wide range of needs in industries and sectors such as banking, non-bank finance, public security, military and other civilian safety certification.
  • Speaker confirmation can be divided into two methods: text-related confirmation and text-independent confirmation according to whether the detected voice needs to specify the content.
  • text-related confirmation and text-independent confirmation according to whether the detected voice needs to specify the content.
  • text-independent speaker verification methods have been continuous breakthroughs in text-independent speaker verification methods, and its accuracy has been greatly improved compared with the past.
  • accuracy is not satisfactory.
  • the embodiments of the present application provide a method, device, device, and medium for speaker verification based on a residual delay network to solve the problem of poor accuracy of the existing text-independent speaker verification method in terms of short audio.
  • a speaker confirmation method based on residual delay network including:
  • the audio information set includes registered audio and test audio
  • the Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
  • the Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
  • the speaker confirmation result is output according to the score.
  • the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay network block is obtained by combining the structure of the delay network and the residual network. Identity mapping and residual mapping are obtained.
  • the training of the residual delay network using a preset training sample set includes:
  • the Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
  • the performing preprocessing on the audio information in the training sample set includes:
  • the audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
  • the outputting the speaker confirmation result according to the score includes:
  • a speaker confirmation device based on residual time delay network including:
  • the training module is used to construct a residual delay network, and use a preset training sample set to train the residual delay network;
  • An acquiring module configured to acquire an audio information set of a test user, the audio information set includes registered audio and test audio;
  • a preprocessing module for performing preprocessing on the audio information set of the test user
  • the feature extraction module is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to the registered audio and Mel frequency cepstral coefficients corresponding to the test audio respectively;
  • the first feature acquisition module is configured to pass the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the output of the residual delay network at the session slice level
  • the feature vector is used as the registered feature vector of the test user
  • the second feature acquisition module is used to pass the Mel frequency cepstrum coefficients of the test audio as an input vector to the trained residual delay network, and obtain the output of the residual delay network at the session slice level A feature vector as the feature vector to be tested of the test user;
  • the score obtaining module is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;
  • the speaker confirmation module is used to output the speaker confirmation result according to the score.
  • the training module includes:
  • the collection unit is used to collect multiple audio information of several speakers as a training sample set
  • a preprocessing unit configured to perform preprocessing on the audio information in the training sample set
  • the feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient
  • a training unit configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;
  • the parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameters of the residual delay network according to the error;
  • the training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.
  • the preprocessing unit includes:
  • the tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;
  • the first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;
  • the detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;
  • the second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the audio information set includes registered audio and test audio
  • the Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
  • the Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
  • the speaker confirmation result is output according to the score.
  • One or more non-volatile readable storage media storing computer readable instructions.
  • the computer readable instructions execute the following steps:
  • the audio information set includes registered audio and test audio
  • the Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
  • the Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
  • the speaker confirmation result is output according to the score.
  • FIG. 1 is a flowchart of a speaker verification method based on a residual delay network in an embodiment of the present application
  • Figure 2(a) is a schematic structural diagram of a delay network in an embodiment of the present application
  • Figure 2(b) is a schematic structural diagram of a residual network in an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a residual delay network block in an embodiment of the present application.
  • FIG. 4 is a flowchart of step S101 in the speaker verification method based on the residual time delay network in an embodiment of the present application
  • FIG. 5 is a flowchart of step S402 in the speaker verification method based on the residual delay network in an embodiment of the present application
  • FIG. 6 is a flowchart of step S108 in the speaker verification method based on the residual time delay network in an embodiment of the present application
  • FIG. 7 is a schematic block diagram of a speaker confirmation device based on a residual time delay network in an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.
  • the speaker confirmation method based on the residual delay network provided by the embodiment of the present application is applied to a server.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for speaker confirmation based on a residual delay network is provided, which includes the following steps:
  • step S101 a residual delay network is constructed, and a preset training sample set is used to train the residual delay network.
  • the Residual Delay Network (Res-TDNN for short) provided by the embodiments of this application combines the Time-Delay Neural Network (TDNN) and the Residual Network (Residual Network, ResNet for short), and uses the delay neural network.
  • TDNN Time-Delay Neural Network
  • ResNet Residual Network
  • time delay neural network TDNN includes session frame-level, session segment-level, and session segment-level ) Includes a statistical pooling layer (Statistic-Pooling), several embedding layers (embeddings) and a classification output layer (log-softmax).
  • Statistic-Pooling statistical pooling layer
  • embeddings embedding layers
  • log-softmax classification output layer
  • the structure of the residual network ResNet is shown in Figure 2(b), which includes two mappings, namely: identity mapping and residual mapping, and is connected by direct connection
  • the (shortcut connection) method connects the two mapping structures to overcome the problems of reduced training set accuracy and reduced network performance as the network deepens.
  • the curve part is the aforementioned identity mapping (identity mapping), which is represented by x in the figure; the remaining part is residual mapping (residual mapping), which is represented by F(x) in the figure.
  • identity mapping identity mapping
  • residual mapping residual mapping
  • F(x) residual mapping
  • the embodiment of the application combines the characteristics of the ResNet network and the TDNN network, and integrates the residual mapping in the ResNet network into the TDNN network.
  • a residual delay network block (Res-TDNN). block).
  • the residual delay network block combines the traditional TDNN network structure with identity mapping and residual mapping, and the activation function adopts, for example, a parameterized activation function ReLU (Parametric Rectified Linear Unit, referred to as PReLU),
  • PReLU Parametric Rectified Linear Unit
  • This structure can effectively transfer the residual of the previous layer to a deeper network, avoiding that the gradient difference becomes too small when it is transferred layer by layer to affect training and make the network fall into a local optimal solution; at the same time, combined with the ResNet network can pass Increasing the depth of the network and reducing the number of nodes at each layer of the network reduces the amount of overall network parameters without reducing network performance.
  • the residual delay network block is used to replace the session frame level in the traditional TDNN network, and the session slice level is kept unchanged, so as to obtain the residual delay network, namely the Res-TDNN network.
  • the training sample set used for training the Res-TDNN network includes multiple audio information of several speakers. To facilitate understanding, the training process of the Res-TDNN network will be described in detail below. As shown in FIG. 4, the training of the residual delay network by using the preset training sample set in step S101 includes:
  • step S401 multiple audio information of several speakers are collected as a training sample set.
  • the embodiments of the present application may obtain audio information according to actual needs or application scenarios.
  • the audio information is obtained from a preset audio library, and a large amount of audio information is collected in advance in the preset audio library.
  • the training sample set can also be obtained by connecting to a communication device to collect telephone recordings. It is understandable that in this embodiment, the training sample set can also be obtained in a variety of ways, which will not be repeated here.
  • each speaker corresponds to an audio information set, and the audio information set includes multiple audio information.
  • step S402 preprocessing is performed on the audio information in the training sample set.
  • the step S402 includes:
  • step S501 a speaker tag is added to each of the audio information, and classification is performed according to the speaker tag to obtain an audio information set of each speaker.
  • each speaker corresponds to a speaker tag
  • the speaker tag is the identification information of the speaker, which is used to distinguish different speakers. Add a speaker tag corresponding to the speaker to the audio information of the same speaker to mark the speaker to which each audio information belongs.
  • N speakers namely speaker spkr 1 , speaker spkr 2 , ... speaker spkr K , and the corresponding labels are label 1, label 2, ... label K.
  • the audio information of speaker spkr 1 is added with label 1
  • the audio information of speaker spkr 2 is added with label 2
  • the audio information of speaker spkr K is added with label K.
  • K is a positive integer.
  • step S502 the audio information set and the speaker whose number of audio information is less than the first preset threshold are removed from the training sample set.
  • the number of audio information included in the audio information set corresponding to the speaker is counted, and the audio information The number is compared with the first preset threshold.
  • the first preset threshold is a judgment criterion based on whether the speaker is eliminated based on the number of audio information. If the number of audio information included in the audio information set of a speaker is less than the first preset threshold, the speaker will be excluded from the training sample set.
  • the first preset threshold may be 4.
  • this embodiment will select the speaker and its audio information set from all The training samples are removed in a centralized manner, thereby ensuring the number of audio information for each speaker, which is beneficial to reduce the calculation amount of the residual delay network, and at the same time improve the training effect of the residual delay network.
  • step S503 the voice activity detection is performed on each audio information in the remaining audio information set, and the non-voice part is deleted according to the voice activity detection result to obtain the voice part duration.
  • the voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection and voice boundary detection, which refers to detecting which signals in the audio information are the speaker’s voice components and which are non-voice components, such as silence and noise .
  • VAD Voice Activity Detection
  • voice endpoint detection and voice boundary detection refers to detecting which signals in the audio information are the speaker’s voice components and which are non-voice components, such as silence and noise .
  • the long-term non-speech part is identified and eliminated from the audio information according to the result of the voice activity detection, so as to reduce the data amount of the training sample without reducing the audio quality.
  • step S504 the audio information whose voice duration is less than the second preset threshold is excluded from the audio information set.
  • the duration of the voice part in the audio information is further obtained according to the result of the voice activity detection, and the voice duration is compared with the second preset threshold.
  • the second preset threshold is a judgment standard based on whether the audio information is eliminated based on the voice duration. If the voice duration of one piece of audio information in the speaker's audio information set is less than the second preset threshold, the audio information will be excluded from the audio information set.
  • the second preset threshold may be 1 second. If the voice duration of an audio message of the speaker is less than 1 second, it may be that the speaker speaks too fast or the content is too short, and there is no representative Sex.
  • the audio information is collectively removed from the audio information of the speaker.
  • the audio information whose voice part duration is less than the second preset threshold is removed from the audio information set, which effectively eliminates extreme situations and ensures the length of the audio information in the audio information set for each speaker, which is beneficial to Improve the training effect and generalization ability of the residual delay network.
  • the speaker and its audio information set remaining after preprocessing through the above steps S501 to S504 are used as the training sample set for training the residual delay network in the embodiment of the present application.
  • the whole training process includes several trainings, each training includes K speakers, and a total of N pieces of audio information.
  • step S403 feature extraction is performed on each of the pre-processed audio information to obtain the corresponding Mel frequency cepstrum coefficient.
  • the Mel-scale Frequency Cepstral Coefficients is a voice feature, which is a cepstrum parameter extracted in the frequency domain of the Mel scale, and its parameters take into account the difference in human ear pairs.
  • the degree of frequency perception is especially suitable for speech recognition and speaker recognition.
  • the MFCC feature is used as the input of the residual delay network.
  • the process of feature extraction includes, but is not limited to, framing processing, windowing processing, discrete Fourier transform, power spectrum calculation, Mel filter bank calculation, logarithmic energy calculation, and discrete cosine transform.
  • this embodiment uses 23-dimensional MFCC features to further compress the calculation data volume of the residual network.
  • step S404 the Mel frequency cepstral coefficient corresponding to each audio information is input as an input vector to a preset residual delay network for training, and the recognition result output by the residual delay network is obtained.
  • the corresponding MFCC feature is used as an input vector and passed into the preset residual delay network for training, and the recognition result of the audio information is obtained.
  • the residual delay network includes a stacked frame-level Res-TDNN block, a Statistics-Pooling layer, a segment-level layer, and a log-softmax layer.
  • the 23-dimensional MFCC feature of an audio signal is first input to the Res-TDNN block of the residual delay network for feature extraction; the obtained feature matrix is then input to the Statistics-Pooling layer and the segment-level layer for feature extraction; the segment- The feature vector output by the level layer is used as the feature vector of the audio signal, which includes feature information of the audio signal.
  • the feature vector of the audio signal is further input to the log-softmax layer for classification.
  • the recognition result output by the log-softmax layer is a one-dimensional probability vector. If there are K speakers in this training, the probability vector includes K elements.
  • Each speaker corresponds to an element, which represents the relative probability between different speakers.
  • the audio information is the speaker corresponding to the element with the highest probability.
  • Step S403 and S404 are respectively performed on the N pieces of audio information in this training until the N pieces of audio information are traversed.
  • Step S405 is executed.
  • step S405 a preset loss function is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information and the corresponding speaker tag after passing through the residual delay network, and according to The error modifies the parameter of the residual delay network.
  • the calculation of the loss function is performed in the loss layer of the residual delay network. Assuming that there are K speakers and N audio information in each training session, the calculation formula of the loss function is:
  • T represents the frame length of an audio information
  • x (n) represents the nth audio among the N audios
  • d nk represents the label function, if the frames contained in the nth audio information in the N audio information are all from speaker k, then the value of d nk is 1 , Otherwise it is 0.
  • the value of the aforementioned frame length T is related to the length of the audio information, and is determined by the TDNN network structure. Usually, the experiment will intercept fixed-length audio, such as 4 seconds, then T is 400.
  • the above loss function calculation formula is used to obtain the error between the recognition result of each audio information and the corresponding preset label, and based on the error Go back to modify the parameters in the residual delay network, including the parameters in the Res-TDNN block, Statistics-Pooling layer, and segment-level layer.
  • the embodiment of the present application uses a back propagation algorithm to calculate the gradient of the residual delay network, and uses a stochastic gradient descent method to update the parameters of the residual delay network, so as to encourage it to continuously learn features until convergence.
  • step S406 the Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the residual delay network after parameter modification to perform the next training.
  • the residual time delay network after the parameters are modified in step S405 is used for the next training.
  • K speakers are randomly selected from the preprocessed training sample set, and N audio information with preset labels are selected for training.
  • the training process is the same as that of steps S404 and S405. For details, see the description above. I won't repeat them here. Steps S404, S405, and S406 are repeated, and 50-150 iterative training is performed, so that the residual delay network can learn the key features of audio information and obtain better model performance.
  • the above training times can be adjusted according to the size of the training set, and there is no limitation here.
  • step S102 is executed.
  • step S102 an audio information set of the test user is obtained, where the audio information set includes registered audio and test audio.
  • the server may obtain the test user and its audio information according to actual needs or application scenarios, and obtain the test user's audio information set.
  • the test user and its audio information are acquired from a preset audio library, and a large number of users and their audio information are collected in advance in the preset audio library. You can also collect telephone recordings as the audio information of the test user by connecting to a communication device. It is understandable that the embodiment of the present application may also obtain the audio information set of the test user in various ways, which will not be repeated here.
  • the audio information of the test user includes a test audio and a registration audio
  • the test audio is the audio information performed through the residual delay network to perform speaker confirmation
  • the registration audio is through the The residual delay network constructs the audio information of the speaker feature database.
  • the acquired test users may include one or more; the acquired test audio/registered audio may include one or more.
  • step S103 preprocessing is performed on the audio information set of the test user.
  • the step S103 includes:
  • step S402 test users whose number of audio information is less than the first preset threshold and their audio information sets are eliminated, and audio information whose voice part duration is less than the second preset threshold is eliminated.
  • I won’t repeat it here.
  • step S104 feature extraction is performed on the preprocessed audio information set, and Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio are obtained respectively.
  • step S104 is the same as the above step S403.
  • this embodiment uses 23-dimensional MFCC features for testing.
  • step S105 the Mel frequency cepstrum coefficients of the registered audio are passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained. As the registered feature vector of the test user.
  • the MFCC feature of the registered audio is passed as input to the pre-trained residual delay network, and the registered audio is performed based on the MFCC feature through the residual delay network.
  • the pre-trained residual delay network includes a Res-TDNN block, a Statistics-Pooling layer, a segment-level layer, and a log-softmax layer.
  • the residual delay network completes the recognition of the registered audio, obtain the output vector of the residual delay network after embeding the registered audio at the segment-level layer as the registered audio Registered feature vector.
  • the registered feature vector is the audio feature vector of the test user in the speaker feature library, and each element in it represents the voiceprint feature of the registered audio.
  • the speaker feature database can be set according to the application scenarios of identity authentication, such as online payment, voiceprint lock control, survival authentication, etc., to store audio feature information of registered users who need to be filed, that is, the above Register the feature vector.
  • step S106 the Mel frequency cepstrum coefficients of the test audio are passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained. As the feature vector to be tested of the test user.
  • the MFCC features of the test audio are obtained, the MFCC features are passed as input to the pre-trained residual delay network, and the test audio is performed based on the MFCC features through the residual delay network. Recognition. After the residual delay network completes the identification of the test audio, the output vector obtained after the residual delay network embeds the test audio at the segment-level layer is obtained as the test audio The feature vector to be tested.
  • the feature vector to be tested is an audio feature vector that the test user performs speaker confirmation through the residual time delay network, and each element in it represents a voiceprint feature of the test audio.
  • step S107 the registered feature vector and the feature vector to be tested are input into a preset probabilistic linear discriminant analysis model, and the score output by the probabilistic linear discriminant analysis model is obtained.
  • the feature vector to be tested and the registered feature vector are input to a preset probability linear discriminant analysis model.
  • the probabilistic linear discriminant analysis model Probabilistic Linear Discriminant Analysis, PLDA for short
  • PLDA Probabilistic Linear Discriminant Analysis
  • the PLDA model is used to calculate the similarity between the feature vector to be tested and the registered feature vector to obtain a score. The higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector, and the lower the score, the lower the consistency between the feature vector to be tested and the registered feature vector.
  • step S108 the speaker confirmation result is output according to the score.
  • the step S108 includes:
  • step S601 the score is compared with a preset score threshold.
  • the preset score threshold is set based on experience as a criterion for judging whether the feature vector to be tested and the registered feature vector come from the same speaker.
  • step S602 if the score is greater than or equal to the preset score threshold, output indicating that the feature vector to be tested and the registered feature vector are from the same speaker.
  • this embodiment determines that the feature vector to be tested and the registered feature vector are from the same speaker, and outputs indication information indicating that the speaker confirmation result is the same speaker.
  • step S603 if the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
  • this embodiment determines that the feature vector to be tested and the registered feature vector are from different speakers, and outputs information indicating that the speaker confirmation result is a different speaker.
  • this embodiment constructs a residual delay network, and extracts the registered feature vector from the preprocessed registered audio through the residual delay network to establish a speaker feature database; when performing speaker confirmation, Extract the feature vector to be tested from the pre-processed test audio through the residual delay network, and pass it into the PLDA model together with the registered feature vector in the speaker feature library to calculate the score, and compare the score with the preset score The threshold value is compared, and finally the speaker confirmation result is output according to the comparison result; because the residual delay network uses the residual delay network block to replace the session inter-frame level of the traditional delay network, compared with the traditional TDNN and PLDA The speaker confirmation method requires a smaller training set size, and the model is easier to train, which effectively reduces the training cost.
  • this method can increase the depth of the network while reducing the number of nodes at each layer of the network. Even if the overall network parameters decrease, the network performance will not be affected.
  • the key features are extracted through the residual delay network, which can effectively reduce noise interference. In short audio speaker confirmation, it can achieve significantly better results than the traditional PLDA model.
  • a speaker verification device based on a residual delay network is provided.
  • the speaker verification device based on the residual delay network is the same as the speaker verification method based on the residual delay network in the foregoing embodiment.
  • the speaker verification device based on the residual time delay network includes a training module, an acquisition module, a preprocessing module, a feature extraction module, a first feature acquisition module, a second feature acquisition module, a score acquisition module, Speaker confirmation module.
  • each functional module is as follows:
  • the training module 71 is configured to construct a residual delay network, and use a preset training sample set to train the residual delay network;
  • the obtaining module 72 is configured to obtain an audio information set of a test user, where the audio information set includes registered audio and test audio;
  • the preprocessing module 73 is configured to perform preprocessing on the audio information set of the test user
  • the feature extraction module 74 is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
  • the first feature acquisition module 75 is configured to input the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the residual delay network output at the session slice level As the registered feature vector of the test user;
  • the second feature acquisition module 76 is configured to input the Mel frequency cepstrum coefficient of the test audio as an input vector into the trained residual delay network, and obtain the residual delay network output at the session slice level As the feature vector to be tested of the test user;
  • the score obtaining module 77 is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;
  • the speaker confirmation module 78 is configured to output the speaker confirmation result according to the score.
  • the residual delay network is obtained by replacing the residual delay network block with the session frame level in the delay network, and the residual delay network block is obtained by combining the structure of the delay network with the constant of the residual network. Obtained by equal mapping and residual mapping.
  • the training module 71 includes:
  • the collection unit is used to collect multiple audio information of several speakers as a training sample set
  • a preprocessing unit configured to perform preprocessing on the audio information in the training sample set
  • the feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient
  • a training unit configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;
  • the parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameters of the residual delay network according to the error;
  • the training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.
  • the preprocessing unit includes:
  • the tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;
  • the first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;
  • the detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;
  • the second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.
  • the speaker confirmation module 88 includes:
  • the comparison unit is used to compare the score with a preset score threshold
  • the first confirmation unit is configured to output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker if the score is greater than or equal to the preset score threshold;
  • the second confirmation unit is configured to output indication information indicating that the feature vector to be tested and the registered feature vector are from different speakers if the score is less than the preset score threshold.
  • each module in the above-mentioned speaker confirmation device based on the residual time delay network can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a method of speaker verification based on the residual delay network is realized.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the audio information set includes registered audio and test audio
  • the Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
  • the Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
  • the speaker confirmation result is output according to the score.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

A residual delay network-based speaker confirmation method and apparatus, a device and a medium. Said method comprises: constructing a residual delay network, and training the residual delay network by using a preset training sample set (S101); acquiring an audio information set of a test user, the audio information set comprising registered audio and test audio (S102); performing pre-processing on the audio information set of the test user (S103); performing feature extraction on the pre-processed audio information set to obtain Mel frequency cepstrum coefficients of the registered audio and the test audio, respectively (S104); transmitting the Mel frequency cepstrum coefficient of the registered audio as an input vector to the trained residual delay network, and acquiring a feature vector outputted by the residual delay network at a session slice level as a registered feature vector of the test user (S105); transmitting the Mel frequency cepstrum coefficient of the test audio as an input vector to the trained residual delay network, and acquiring a feature vector outputted by the residual delay network at a session slice level as a feature vector to be tested of the test user (S106); inputting, into a preset probability linear discriminant analysis model, the registered feature vector and the feature vector to be tested, and acquiring a score outputted by the probability linear discrimination analysis model (S107); and outputting a speaker confirmation result according to the score (S108). Said method solves the problem of the poor accuracy of the existing text-independent speaker confirmation method in terms of short audio.

Description

基于残差时延网络的说话人确认方法、装置、设备及介质Speaker confirmation method, device, equipment and medium based on residual time delay network
本申请以2019年5月9日提交的申请号为201910384582.0,名称为“基于残差时延网络的说话人确认方法、装置、设备及介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on May 9, 2019 with the application number 201910384582.0, titled "Speaker verification method, device, equipment and medium based on residual time delay network", and claims its priority .
技术领域Technical field
本申请涉及信息技术领域,尤其涉及一种基于残差时延网络的说话人确认方法、装置、设备及介质。This application relates to the field of information technology, and in particular to a method, device, equipment and medium for speaker confirmation based on residual time delay network.
背景技术Background technique
声纹识别,也称为话说人识别,是生物识别技术中的一种。声纹识别主要解决两大类问题,即说话人辨认和说话人确认。说话人辨认技术是用以判断某段语音来自若干说话人中的哪一个,是“多选一问题”,而说话人确认技术是判定某段语音是不是属于指定被检测人所说的,是“一对一问题”。说话人确认广泛应用于诸多领域,在银行、非银金融、公安、军队及其他民用安全认证等行业和部门有着广泛的需求。Voiceprint recognition, also known as speaking person recognition, is a type of biometric technology. Voiceprint recognition mainly solves two major problems, namely speaker identification and speaker confirmation. Speaker recognition technology is used to determine which of several speakers a certain speech comes from, which is a "choose one question", while speaker confirmation technology is to determine whether a certain speech belongs to the designated person to be detected. "One to one question". Speaker Confirmation is widely used in many fields, and has a wide range of needs in industries and sectors such as banking, non-bank finance, public security, military and other civilian safety certification.
说话人确认依照被检测语音是否需要指定内容分为文本相关确认和文本无关确认两种方式。近年来文本无关说话人确认方法不断突破,其准确性较之以往有了极大的提升。然而在某些受限情况下,比如采集到的说话人有效语音较短的情况下,其准确性还不尽如人意。Speaker confirmation can be divided into two methods: text-related confirmation and text-independent confirmation according to whether the detected voice needs to specify the content. In recent years, there have been continuous breakthroughs in text-independent speaker verification methods, and its accuracy has been greatly improved compared with the past. However, in some limited situations, such as when the collected speaker's effective voice is relatively short, its accuracy is not satisfactory.
因此,寻找一种提高文本无关说话人确认在短音频方面的准确率的方法成为本领域技术人员亟需解决的问题。Therefore, finding a method to improve the accuracy of text-independent speaker confirmation in short audio has become an urgent problem for those skilled in the art.
发明内容Summary of the invention
本申请实施例提供了一种基于残差时延网络的说话人确认方法、装置、设备及介质,以解决现有文本无关说话人确认方法在短音频方面的准确率欠佳的问题。The embodiments of the present application provide a method, device, device, and medium for speaker verification based on a residual delay network to solve the problem of poor accuracy of the existing text-independent speaker verification method in terms of short audio.
一种基于残差时延网络的说话人确认方法,包括:A speaker confirmation method based on residual delay network, including:
构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
进一步地,所述残差时延网络通过将残差时延网络块替换时延网络中的会话帧间级得到,所述残差时延网络块通过结合时延网络的结构与残差网络的恒等映射、残差映射得到。Further, the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay network block is obtained by combining the structure of the delay network and the residual network. Identity mapping and residual mapping are obtained.
进一步地,所述采用预设的训练样本集对所述残差时延网络进行训练包括:Further, the training of the residual delay network using a preset training sample set includes:
收集若干个说话人的多个音频信息作为训练样本集;Collect multiple audio information of several speakers as a training sample set;
对所述训练样本集中的音频信息执行预处理;Perform preprocessing on the audio information in the training sample set;
对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;
将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;
采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;
将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
进一步地,所述对所述训练样本集中的音频信息执行预处理包括:Further, the performing preprocessing on the audio information in the training sample set includes:
对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;
将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;
对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part time;
将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
进一步地,所述根据所述分值输出说话人确认结果包括:Further, the outputting the speaker confirmation result according to the score includes:
比对所述分值与预设分数阈值;Comparing the score with a preset score threshold;
若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息;If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;
若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
一种基于残差时延网络的说话人确认装置,包括:A speaker confirmation device based on residual time delay network, including:
训练模块,用于构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;The training module is used to construct a residual delay network, and use a preset training sample set to train the residual delay network;
获取模块,用于获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;An acquiring module, configured to acquire an audio information set of a test user, the audio information set includes registered audio and test audio;
预处理模块,用于对所述测试用户的音频信息集执行预处理;A preprocessing module for performing preprocessing on the audio information set of the test user;
特征提取模块,用于对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;The feature extraction module is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to the registered audio and Mel frequency cepstral coefficients corresponding to the test audio respectively;
第一特征获取模块,用于将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The first feature acquisition module is configured to pass the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the output of the residual delay network at the session slice level The feature vector is used as the registered feature vector of the test user;
第二特征获取模块,用于将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The second feature acquisition module is used to pass the Mel frequency cepstrum coefficients of the test audio as an input vector to the trained residual delay network, and obtain the output of the residual delay network at the session slice level A feature vector as the feature vector to be tested of the test user;
分值获取模块,用于将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;The score obtaining module is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;
说话人确认模块,用于根据所述分值输出说话人确认结果。The speaker confirmation module is used to output the speaker confirmation result according to the score.
进一步地,所述训练模块包括:Further, the training module includes:
收集单元,用于收集若干个说话人的多个音频信息作为训练样本集;The collection unit is used to collect multiple audio information of several speakers as a training sample set;
预处理单元,用于对所述训练样本集中的音频信息执行预处理;A preprocessing unit, configured to perform preprocessing on the audio information in the training sample set;
特征提取单元,用于对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;The feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient;
训练单元,用于将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;A training unit, configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;
参数修改单元,用于采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒 谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;The parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameters of the residual delay network according to the error;
所述训练单元还用于,将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.
进一步地,所述预处理单元包括:Further, the preprocessing unit includes:
标签子单元,用于对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;The tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;
第一剔除子单元,用于将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;The first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;
检测子单元,用于对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;The detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;
第二剔除子单元,用于将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和 优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中基于残差时延网络的说话人确认方法的一流程图;FIG. 1 is a flowchart of a speaker verification method based on a residual delay network in an embodiment of the present application;
图2(a)是本申请一实施例中时延网络的结构示意图,图2(b)是本申请一实施例中残差网络的结构示意图;Figure 2(a) is a schematic structural diagram of a delay network in an embodiment of the present application, and Figure 2(b) is a schematic structural diagram of a residual network in an embodiment of the present application;
图3是本申请一实施例中残差时延网络块的结构示意图;FIG. 3 is a schematic structural diagram of a residual delay network block in an embodiment of the present application;
图4是本申请一实施例中基于残差时延网络的说话人确认方法中步骤S101的一流程图;FIG. 4 is a flowchart of step S101 in the speaker verification method based on the residual time delay network in an embodiment of the present application;
图5是本申请一实施例中基于残差时延网络的说话人确认方法中步骤S402的一流程图;FIG. 5 is a flowchart of step S402 in the speaker verification method based on the residual delay network in an embodiment of the present application;
图6是本申请一实施例中基于残差时延网络的说话人确认方法中步骤S108的一流程图;FIG. 6 is a flowchart of step S108 in the speaker verification method based on the residual time delay network in an embodiment of the present application;
图7是本申请一实施例中基于残差时延网络的说话人确认装置的一原理框图;FIG. 7 is a schematic block diagram of a speaker confirmation device based on a residual time delay network in an embodiment of the present application;
图8是本申请一实施例中计算机设备的一示意图。Fig. 8 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.
本申请实施例提供的基于残差时延网络的说话人确认方法应用于服务器。所述服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。在一实施例中,如图1所示,提供一种基于残差时延网络的说话人确认方法,包括如下步骤:The speaker confirmation method based on the residual delay network provided by the embodiment of the present application is applied to a server. The server can be implemented by an independent server or a server cluster composed of multiple servers. In an embodiment, as shown in FIG. 1, a method for speaker confirmation based on a residual delay network is provided, which includes the following steps:
在步骤S101中,构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练。In step S101, a residual delay network is constructed, and a preset training sample set is used to train the residual delay network.
本申请实施例提供的残差时延网络(简称Res-TDNN)结合了时延神经网络(Time-Delay Neural Network,简称TDNN)和残差网络(Residual Network,简称ResNet),并使用时延神经网络TDNN作为基础结构。The Residual Delay Network (Res-TDNN for short) provided by the embodiments of this application combines the Time-Delay Neural Network (TDNN) and the Residual Network (Residual Network, ResNet for short), and uses the delay neural network. Network TDNN as the basic structure.
在这里,所述时延神经网络TDNN的结构如图2(a)所示,包括会话帧间级(frame-level)、会话切片级(segment-level),所述会话切片级(segment-level)包括一个统计池化层(Statistic-Pooling)、若干个嵌入层(embeddings)和一个分类输出层(log-softmax)。Here, the structure of the time delay neural network TDNN is shown in Figure 2(a), including session frame-level, session segment-level, and session segment-level ) Includes a statistical pooling layer (Statistic-Pooling), several embedding layers (embeddings) and a classification output layer (log-softmax).
所述残差网络ResNet的结构如图2(b)所示,包括两种映射(mapping),分别为:恒等映射(identity mapping)和残差映射(residual mapping),并通过采用直连连接(shortcut connection)的方式,将两种映射结构连接起来达到克服随网络深度加深而训练集准确度下降、网络性能降低的问题。其中曲线部分为上述提到的恒等映射(identity mapping),图中用x表示;其余部分为残差映射(residual mapping),图中用F(x)表示。两部分结合成为一个基础块(building block),该结构的复用能有效加深网络深度,提高网络性能。The structure of the residual network ResNet is shown in Figure 2(b), which includes two mappings, namely: identity mapping and residual mapping, and is connected by direct connection The (shortcut connection) method connects the two mapping structures to overcome the problems of reduced training set accuracy and reduced network performance as the network deepens. The curve part is the aforementioned identity mapping (identity mapping), which is represented by x in the figure; the remaining part is residual mapping (residual mapping), which is represented by F(x) in the figure. The two parts are combined into a building block, and the reuse of this structure can effectively deepen the network depth and improve network performance.
本申请实施例结合ResNet网络与TDNN网络的特点,将ResNet网络中的残差映射(residual mapping)融入到TDNN网络中,如图3所示,称为一个残差时延网络块(Res-TDNN block)。在图3中,所述残差时延网络块将传统的TDNN网络结构与恒等映 射、残差映射相结合,激活函数采用比如带参数的激活函数ReLU(Parametric Rectified Linear Unit,简称PReLU),这种结构能够有效地将前一层残差传递到更深的网络上,避免梯度差值在层层传递时变得过小无法影响训练而使网络陷入局部最优解;同时结合ResNet网络可以通过增加网络深度而减少网络每一层节点数降低网络整体参数量却不降低网络性能的优势。The embodiment of the application combines the characteristics of the ResNet network and the TDNN network, and integrates the residual mapping in the ResNet network into the TDNN network. As shown in Figure 3, it is called a residual delay network block (Res-TDNN). block). In Figure 3, the residual delay network block combines the traditional TDNN network structure with identity mapping and residual mapping, and the activation function adopts, for example, a parameterized activation function ReLU (Parametric Rectified Linear Unit, referred to as PReLU), This structure can effectively transfer the residual of the previous layer to a deeper network, avoiding that the gradient difference becomes too small when it is transferred layer by layer to affect training and make the network fall into a local optimal solution; at the same time, combined with the ResNet network can pass Increasing the depth of the network and reducing the number of nodes at each layer of the network reduces the amount of overall network parameters without reducing network performance.
本申请实施例使用所述残差时延网络块替换传统TDNN网络中的会话帧间级,并保持会话切片级不变,从而得到所述残差时延网络,即Res-TDNN网络。In the embodiment of the present application, the residual delay network block is used to replace the session frame level in the traditional TDNN network, and the session slice level is kept unchanged, so as to obtain the residual delay network, namely the Res-TDNN network.
用于训练所述Res-TDNN网络的训练样本集包括若干个说话人的多个音频信息。为了便于理解,下面将对Res-TDNN网络的训练过程进行详细描述。如图4所示,步骤S101中所述的采用预设的训练样本集对所述残差时延网络进行训练包括:The training sample set used for training the Res-TDNN network includes multiple audio information of several speakers. To facilitate understanding, the training process of the Res-TDNN network will be described in detail below. As shown in FIG. 4, the training of the residual delay network by using the preset training sample set in step S101 includes:
在步骤S401中,收集若干个说话人的多个音频信息作为训练样本集。In step S401, multiple audio information of several speakers are collected as a training sample set.
在这里,本申请实施例可以根据实际需要或者应用场景的需要获取音频信息。例如,从预设音频库中获取音频信息,所述预设音频库中预先收集了大量的音频信息。还可以通过连接到通信设备采集电话录音得到所述训练样本集。可以理解的是,本实施例还可以通过多种方式获取到训练样本集,此处不再过多赘述。Here, the embodiments of the present application may obtain audio information according to actual needs or application scenarios. For example, the audio information is obtained from a preset audio library, and a large amount of audio information is collected in advance in the preset audio library. The training sample set can also be obtained by connecting to a communication device to collect telephone recordings. It is understandable that in this embodiment, the training sample set can also be obtained in a variety of ways, which will not be repeated here.
在所述训练样本集中,每一个说话人对应一个音频信息集,所述音频信息集中包括多个音频信息。In the training sample set, each speaker corresponds to an audio information set, and the audio information set includes multiple audio information.
在步骤S402中,对所述训练样本集中的音频信息执行预处理。In step S402, preprocessing is performed on the audio information in the training sample set.
在这里,由于所述训练样本集中的音频信息可能存在杂音、有用信息较少的问题,需要对所述训练样本集进行预处理,以提高训练样本的质量。可选地,如图5所示,所述步骤S402包括:Here, since the audio information in the training sample set may have noise and less useful information, it is necessary to preprocess the training sample set to improve the quality of the training samples. Optionally, as shown in FIG. 5, the step S402 includes:
在步骤S501中,对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集。In step S501, a speaker tag is added to each of the audio information, and classification is performed according to the speaker tag to obtain an audio information set of each speaker.
在本实施例中,每一个说话人对应一个说话人标签,所述说话人标签为说话人的标识信息,用于区分不同的说话人。对于同一说话人的音频信息添加所述说话人对应的说话人标签,以标记每一个音频信息所属的说话人。In this embodiment, each speaker corresponds to a speaker tag, and the speaker tag is the identification information of the speaker, which is used to distinguish different speakers. Add a speaker tag corresponding to the speaker to the audio information of the same speaker to mark the speaker to which each audio information belongs.
示例性地,假设存在N个说话人,分别为说话人spkr 1、说话人spkr 2、……说话人spkr K,对应的标签分别为标签1、标签2、……标签K。那么说话人spkr 1的音频信息添加上标签1,说话人spkr 2的音频信息添加上标签2,……说话人spkr K的音频信息添加上标签K。其中,K为正整数。 Illustratively, suppose there are N speakers, namely speaker spkr 1 , speaker spkr 2 , ... speaker spkr K , and the corresponding labels are label 1, label 2, ... label K. Then the audio information of speaker spkr 1 is added with label 1, and the audio information of speaker spkr 2 is added with label 2, ... the audio information of speaker spkr K is added with label K. Among them, K is a positive integer.
在步骤S502中,将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除。In step S502, the audio information set and the speaker whose number of audio information is less than the first preset threshold are removed from the training sample set.
进一步地,为了减少残差时延网络训练时的计算量,提高训练效果,针对每一个说话人,统计所述说话人对应的音频信息集中所包括的音频信息个数,将所述音频信息个数与第一预设阈值进行比对。在这里,所述第一预设阈值为基于音频信息个数是否剔除说话人的判断标准。如果一个说话人的音频信息集所包括的音频信息个数小于所述第一预设阈值时,所述说话人会被排除在训练样本集之外。示例性地,所述第一预设阈值可以为4,若一个说话人的音频信息集中所包括的音频信息个数小于4条时,本实施例将所述说话人及其音频信息集从所述训练样本集中剔除,从而保证了每一个说话人的音频信息个数,有利于减少残差时延网络的计算量,同时提高残差时延网络的训练效果。Further, in order to reduce the amount of calculation during the training of the residual delay network and improve the training effect, for each speaker, the number of audio information included in the audio information set corresponding to the speaker is counted, and the audio information The number is compared with the first preset threshold. Here, the first preset threshold is a judgment criterion based on whether the speaker is eliminated based on the number of audio information. If the number of audio information included in the audio information set of a speaker is less than the first preset threshold, the speaker will be excluded from the training sample set. Exemplarily, the first preset threshold may be 4. If the number of audio information included in the audio information set of a speaker is less than 4, this embodiment will select the speaker and its audio information set from all The training samples are removed in a centralized manner, thereby ensuring the number of audio information for each speaker, which is beneficial to reduce the calculation amount of the residual delay network, and at the same time improve the training effect of the residual delay network.
在步骤S503中,对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长。In step S503, the voice activity detection is performed on each audio information in the remaining audio information set, and the non-voice part is deleted according to the voice activity detection result to obtain the voice part duration.
在这里,所述语音活动检测(Voice Activity Detection,VAD)又称为语音端点检测、语音 边界检测,是指检测音频信息中哪些信号是说话人的语音成分,哪些是非语音成分,比如静音、噪音。本实施例根据语音活动检测的结果从音频信息中识别和消除长时间的非语音部分,以达到在不降低音频质量的情况下减少训练样本的数据量。Here, the voice activity detection (Voice Activity Detection, VAD) is also called voice endpoint detection and voice boundary detection, which refers to detecting which signals in the audio information are the speaker’s voice components and which are non-voice components, such as silence and noise . In this embodiment, the long-term non-speech part is identified and eliminated from the audio information according to the result of the voice activity detection, so as to reduce the data amount of the training sample without reducing the audio quality.
在步骤S504中,将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。In step S504, the audio information whose voice duration is less than the second preset threshold is excluded from the audio information set.
在消除长时间的非语音部分后,进一步根据语音活动检测的结果获取音频信息中语音部分的时长,即语音时长,将所述语音时长与第二预设阈值进行比对。在这里,所述第二预设阈值为基于语音时长是否剔除音频信息的判断标准。如果说话人音频信息集中的一个音频信息的语音时长小于所述第二预设阈值时,所述音频信息会被排除在音频信息集之外。可选地,所述第二预设阈值可以为1秒,若说话人的一个音频信息的语音时长小于1秒时,可能是所述说话人语速过快或者说话内容过短,不具备代表性。本实施例将所述音频信息从所述说话人的音频信息集中剔除。示例性地,对于说话人spkr j,若有音频信息集M j={x j1,x j2,x j3…,x jm},如果经过VAD计算测得音频信息x ji的语音时长小于1秒,则将x ji从所述说话人spkr j的音频信息集M j中剔除。其中,j、m均为正整数,i=1,2,…,m。 After eliminating the long-term non-voice part, the duration of the voice part in the audio information, that is, the voice duration, is further obtained according to the result of the voice activity detection, and the voice duration is compared with the second preset threshold. Here, the second preset threshold is a judgment standard based on whether the audio information is eliminated based on the voice duration. If the voice duration of one piece of audio information in the speaker's audio information set is less than the second preset threshold, the audio information will be excluded from the audio information set. Optionally, the second preset threshold may be 1 second. If the voice duration of an audio message of the speaker is less than 1 second, it may be that the speaker speaks too fast or the content is too short, and there is no representative Sex. In this embodiment, the audio information is collectively removed from the audio information of the speaker. Exemplarily, for the speaker spkr j , if there is an audio information set M j = {x j1 , x j2 , x j3 …, x jm }, if the voice time length of the audio information x ji measured by VAD calculation is less than 1 second, Then x ji is removed from the audio information set M j of the speaker spkr j . Among them, j and m are both positive integers, i=1, 2, ..., m.
本实施例通过将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除,有效地排除了极端情况,保证了每一个说话人音频信息集中的音频信息的长度,有利于提高残差时延网络的训练效果以及泛化能力。In this embodiment, the audio information whose voice part duration is less than the second preset threshold is removed from the audio information set, which effectively eliminates extreme situations and ensures the length of the audio information in the audio information set for each speaker, which is beneficial to Improve the training effect and generalization ability of the residual delay network.
通过上述步骤S501至步骤S504预处理后留下来的说话人及其音频信息集,作为本申请实施例中用于训练残差时延网络的训练样本集。整个训练过程包括若干次训练,每次训练包括K个说话人,共N条音频信息。The speaker and its audio information set remaining after preprocessing through the above steps S501 to S504 are used as the training sample set for training the residual delay network in the embodiment of the present application. The whole training process includes several trainings, each training includes K speakers, and a total of N pieces of audio information.
在步骤S403中,对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数。In step S403, feature extraction is performed on each of the pre-processed audio information to obtain the corresponding Mel frequency cepstrum coefficient.
其中,所述梅尔频率倒频谱系数(Mel-scale Frequency Cepstral Coefficients,简称MFCC特征)是一种语音特征,是在Mel标度频率域提取出来的倒谱参数,其参数考虑到了人耳对不同频率的感受程度,特别适用于语音辨别和语者辨识。本实施例以MFCC特征为残差时延网络的输入。在训练或者使用残差时延网络之前,首先对每一所述音频信息进行特征提取,得到对应的MFCC特征。可选地,特征提取的过程包括但不限于分帧处理、加窗处理、离散傅里叶变换、功率谱计算、梅尔滤波器组计算、对数能量计算、离散余弦变换。在这里,本实施例采用23维MFCC特征,以进一步压缩残差网络的计算数据量。Among them, the Mel-scale Frequency Cepstral Coefficients (MFCC feature for short) is a voice feature, which is a cepstrum parameter extracted in the frequency domain of the Mel scale, and its parameters take into account the difference in human ear pairs. The degree of frequency perception is especially suitable for speech recognition and speaker recognition. In this embodiment, the MFCC feature is used as the input of the residual delay network. Before training or using the residual delay network, first perform feature extraction on each audio information to obtain the corresponding MFCC feature. Optionally, the process of feature extraction includes, but is not limited to, framing processing, windowing processing, discrete Fourier transform, power spectrum calculation, Mel filter bank calculation, logarithmic energy calculation, and discrete cosine transform. Here, this embodiment uses 23-dimensional MFCC features to further compress the calculation data volume of the residual network.
在步骤S404中,将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果。In step S404, the Mel frequency cepstral coefficient corresponding to each audio information is input as an input vector to a preset residual delay network for training, and the recognition result output by the residual delay network is obtained.
在训练时,针对每一个音频信息,将对应的MFCC特征作为一个输入向量,传入预设的残差时延网络进行训练,得到所述音频信息的识别结果。During training, for each audio information, the corresponding MFCC feature is used as an input vector and passed into the preset residual delay network for training, and the recognition result of the audio information is obtained.
如前所述,所述残差时延网络包括堆叠frame-level的Res-TDNN block、Statistics-Pooling层、segment-level层以及log-softmax层。一个音频信号的23维MFCC特征首先输入至残差时延网络的Res-TDNN block进行特征提取;所得到的特征矩阵再输入至Statistics-Pooling层和segment-level层进行特征提取;所述segment-level层输出的特征向量作为所述音频信号的特征向量,其中包括了音频信号的特征信息。所述音频信号的特征向量进一步输入至log-softmax层进行分类。所述log-softmax层输出的识别结果为一维的概率向量。若本次训练中的说话人有K个时,所述概率向量中包括K个元素。每一个说话人对应一个元素,该元素表征了不同说话人之间的相对概率,元素的值越大表示所述MFCC特征/音频信息属于对应的说话人的可能性越大,从而可以清晰地预测所述音频信息为概率最大的元素对应的说话人。As mentioned above, the residual delay network includes a stacked frame-level Res-TDNN block, a Statistics-Pooling layer, a segment-level layer, and a log-softmax layer. The 23-dimensional MFCC feature of an audio signal is first input to the Res-TDNN block of the residual delay network for feature extraction; the obtained feature matrix is then input to the Statistics-Pooling layer and the segment-level layer for feature extraction; the segment- The feature vector output by the level layer is used as the feature vector of the audio signal, which includes feature information of the audio signal. The feature vector of the audio signal is further input to the log-softmax layer for classification. The recognition result output by the log-softmax layer is a one-dimensional probability vector. If there are K speakers in this training, the probability vector includes K elements. Each speaker corresponds to an element, which represents the relative probability between different speakers. The larger the value of the element, the greater the probability that the MFCC feature/audio information belongs to the corresponding speaker, so that it can be clearly predicted The audio information is the speaker corresponding to the element with the highest probability.
对该次训练中的N个音频信息分别执行上述步骤S403和步骤S404,直至遍历完所述N个音频信息。执行步骤S405。The above steps S403 and S404 are respectively performed on the N pieces of audio information in this training until the N pieces of audio information are traversed. Step S405 is executed.
在步骤S405中,采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数。In step S405, a preset loss function is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information and the corresponding speaker tag after passing through the residual delay network, and according to The error modifies the parameter of the residual delay network.
在本实施例中,损失函数的计算在残差时延网络的损失层中执行。假设每一次训练中有K个说话人共N个音频信息,损失函数的计算公式为:In this embodiment, the calculation of the loss function is performed in the loss layer of the residual delay network. Assuming that there are K speakers and N audio information in each training session, the calculation formula of the loss function is:
Figure PCTCN2019103155-appb-000001
Figure PCTCN2019103155-appb-000001
在上式中,
Figure PCTCN2019103155-appb-000002
表示T帧测试是说话人spkr k的概率;
Figure PCTCN2019103155-appb-000003
中T表示一个音频信息的帧长度,x (n)表示N个音频中的第n个音频,
Figure PCTCN2019103155-appb-000004
表示N个音频中的第n个音频的一个帧长度的信号;d nk表示标签函数,若N个音频信息中第n个音频信息包含的帧均来自说话人k,则d nk的值为1,否则为0。
In the above formula,
Figure PCTCN2019103155-appb-000002
Indicates the probability that the T frame test is the speaker spkr k ;
Figure PCTCN2019103155-appb-000003
Where T represents the frame length of an audio information, x (n) represents the nth audio among the N audios,
Figure PCTCN2019103155-appb-000004
Represents a signal of one frame length of the nth audio in the N audio; d nk represents the label function, if the frames contained in the nth audio information in the N audio information are all from speaker k, then the value of d nk is 1 , Otherwise it is 0.
上述帧长度T的取值与音频信息的长度有关,由TDNN网络结构决定,通常实验会截取固定长度音频,如4秒,则T为400。The value of the aforementioned frame length T is related to the length of the audio information, and is determined by the TDNN network structure. Usually, the experiment will intercept fixed-length audio, such as 4 seconds, then T is 400.
在完成一次训练,得到所述N个音频信息对应的识别结果后,采用上述损失函数计算公式得到每一所述音频信息的识别结果与对应的预设标签之间的误差,并基于所述误差返回去修改所述残差时延网络中的参数,包括Res-TDNN block、Statistics-Pooling层、segment-level层中的参数。可选地,本申请实施例采用反向传播算法计算残差时延网络的梯度,并采用随机梯度下降方法更新残差时延网络的参数,促使其不断学习特征,直至收敛。After completing a training session and obtaining the recognition results corresponding to the N audio information, the above loss function calculation formula is used to obtain the error between the recognition result of each audio information and the corresponding preset label, and based on the error Go back to modify the parameters in the residual delay network, including the parameters in the Res-TDNN block, Statistics-Pooling layer, and segment-level layer. Optionally, the embodiment of the present application uses a back propagation algorithm to calculate the gradient of the residual delay network, and uses a stochastic gradient descent method to update the parameters of the residual delay network, so as to encourage it to continuously learn features until convergence.
在步骤S406中,将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。In step S406, the Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the residual delay network after parameter modification to perform the next training.
通过步骤S405修改参数后的残差时延网络,用于进行下一次训练。每次训练过程中,从预处理后的训练样本集中随机选取K个说话人共N个带有预设标签的音频信息进行训练,训练过程和步骤S404、S405的相同,具体参见上面的叙述,此处不再赘述。重复步骤S404、S405、S406,执行50-150次迭代训练,使得所述残差时延网络能够学习到音频信息的关键特征,得到较好的模型性能。上述训练次数可以根据训练集的规模进行调整,此处不做限制。The residual time delay network after the parameters are modified in step S405 is used for the next training. In each training process, K speakers are randomly selected from the preprocessed training sample set, and N audio information with preset labels are selected for training. The training process is the same as that of steps S404 and S405. For details, see the description above. I won't repeat them here. Steps S404, S405, and S406 are repeated, and 50-150 iterative training is performed, so that the residual delay network can learn the key features of audio information and obtain better model performance. The above training times can be adjusted according to the size of the training set, and there is no limitation here.
在训练完成后,以训练好的残差时延网络进行测试,执行步骤S102。After the training is completed, the trained residual time delay network is used for testing, and step S102 is executed.
在步骤S102中,获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频。In step S102, an audio information set of the test user is obtained, where the audio information set includes registered audio and test audio.
可选地,服务器可以根据实际需要或者应用场景的需要获取测试用户及其音频信息,得到测试用户的音频信息集。例如,从预设音频库中获取测试用户及其音频信息,所述预设音频库中预先收集了大量的用户及其音频信息。还可以通过连接到通信设备采集电话录音作为测试用户的音频信息。可以理解的是,本申请实施例还可以通过多种方式获取到测试用户的音频信息集,此处不再过多赘述。Optionally, the server may obtain the test user and its audio information according to actual needs or application scenarios, and obtain the test user's audio information set. For example, the test user and its audio information are acquired from a preset audio library, and a large number of users and their audio information are collected in advance in the preset audio library. You can also collect telephone recordings as the audio information of the test user by connecting to a communication device. It is understandable that the embodiment of the present application may also obtain the audio information set of the test user in various ways, which will not be repeated here.
在本实施例中,所述测试用户的音频信息集中包括测试音频和注册音频,所述测试音频为通过所述残差时延网络执行说话人确认的音频信息,所述注册音频为通过所述残差时延网络构建说话人特征库的音频信息。可选地,所获取的测试用户可以包括一个或者多个;所获取的测试音频/注册音频可以包括一个或者多个。In this embodiment, the audio information of the test user includes a test audio and a registration audio, the test audio is the audio information performed through the residual delay network to perform speaker confirmation, and the registration audio is through the The residual delay network constructs the audio information of the speaker feature database. Optionally, the acquired test users may include one or more; the acquired test audio/registered audio may include one or more.
在步骤S103中,对所述测试用户的音频信息集执行预处理。In step S103, preprocessing is performed on the audio information set of the test user.
在这里,由于所述测试用户的音频信息可能存在杂音、有效信息较少的问题,需要对所述测试用户的音频信息进行预处理,以提高残差时延网络识别的速度和识别的准确度。可选地,所述步骤S103包括:Here, since the audio information of the test user may have problems with noise and less effective information, it is necessary to preprocess the audio information of the test user to improve the speed and accuracy of the residual delay network recognition . Optionally, the step S103 includes:
将音频信息个数小于第一预设阈值的音频信息集及测试用户剔除;Removing audio information sets and test users whose number of audio information is less than the first preset threshold;
对于剩余测试用户的音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;将语音部分时长少于第二预设阈值的音频信息从所述测试用户的音频信息集中剔除。Perform voice activity detection on each audio information in the audio information set of the remaining test users, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration; remove the audio information whose voice part duration is less than the second preset threshold from all The audio information of the test user is collectively eliminated.
上述步骤与步骤S402相同,即剔除音频信息个数小于第一预设阈值的测试用户及其音频信息集、剔除语音部分时长小于第二预设阈值的音频信息,具体请参见上述实施例的叙述,此处不再赘述。The above steps are the same as step S402, that is, test users whose number of audio information is less than the first preset threshold and their audio information sets are eliminated, and audio information whose voice part duration is less than the second preset threshold is eliminated. For details, please refer to the description of the above embodiment , I won’t repeat it here.
在步骤S104中,对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数。In step S104, feature extraction is performed on the preprocessed audio information set, and Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio are obtained respectively.
可选地,所述步骤S104与上述步骤S403相同,具体请参见上述实施例的叙述,此处不再赘述。在这里,本实施例采用23维MFCC特征进行测试。Optionally, the step S104 is the same as the above step S403. For details, please refer to the description of the above embodiment, which will not be repeated here. Here, this embodiment uses 23-dimensional MFCC features for testing.
在步骤S105中,将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量。In step S105, the Mel frequency cepstrum coefficients of the registered audio are passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained. As the registered feature vector of the test user.
在得到所述注册音频的MFCC特征之后,将所述MFCC特征作为输入传入至预先训练好的残差时延网络,通过所述残差时延网络基于所述MFCC特征对所述注册音频进行识别。在这里,所述预先训练好的残差时延网络中包括Res-TDNN block、Statistics-Pooling层、segment-level层以及log-softmax层。当所述残差时延网络完成对所述注册音频的识别后,获取所述残差时延网络在segment-level层对所述注册音频进行embeding特征提取后的输出向量,作为所述注册音频的注册特征向量。所述注册特征向量为所述测试用户在说话人特征库中的音频特征向量,其中的每个元素表示所述注册音频的声纹特征。在这里,所述说话人特征库可以根据需要结合身份认证的应用场景进行设置,比如网络支付、声纹锁控、生存认证等,用于存储需要备案的注册用户的音频特征信息,即上述的注册特征向量。After the MFCC feature of the registered audio is obtained, the MFCC feature is passed as input to the pre-trained residual delay network, and the registered audio is performed based on the MFCC feature through the residual delay network. Recognition. Here, the pre-trained residual delay network includes a Res-TDNN block, a Statistics-Pooling layer, a segment-level layer, and a log-softmax layer. After the residual delay network completes the recognition of the registered audio, obtain the output vector of the residual delay network after embeding the registered audio at the segment-level layer as the registered audio Registered feature vector. The registered feature vector is the audio feature vector of the test user in the speaker feature library, and each element in it represents the voiceprint feature of the registered audio. Here, the speaker feature database can be set according to the application scenarios of identity authentication, such as online payment, voiceprint lock control, survival authentication, etc., to store audio feature information of registered users who need to be filed, that is, the above Register the feature vector.
在步骤S106中,将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量。In step S106, the Mel frequency cepstrum coefficients of the test audio are passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained. As the feature vector to be tested of the test user.
在得到所述测试音频的MFCC特征之后,将所述MFCC特征作为输入传入至预先训练好的残差时延网络,通过所述残差时延网络基于所述MFCC特征对所述测试音频进行识别。当所述残差时延网络完成对所述测试音频的识别后,获取所述残差时延网络在segment-level层对所述测试音频进行embeding特征提取后的输出向量,作为所述测试音频的待测试特征向量。所述待测试特征向量为所述测试用户通过所述残差时延网络执行说话人确认的音频特征向量,其中的每个元素表示所述测试音频的声纹特征。After the MFCC features of the test audio are obtained, the MFCC features are passed as input to the pre-trained residual delay network, and the test audio is performed based on the MFCC features through the residual delay network. Recognition. After the residual delay network completes the identification of the test audio, the output vector obtained after the residual delay network embeds the test audio at the segment-level layer is obtained as the test audio The feature vector to be tested. The feature vector to be tested is an audio feature vector that the test user performs speaker confirmation through the residual time delay network, and each element in it represents a voiceprint feature of the test audio.
在步骤S107中,将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值。In step S107, the registered feature vector and the feature vector to be tested are input into a preset probabilistic linear discriminant analysis model, and the score output by the probabilistic linear discriminant analysis model is obtained.
在进行说话人确认时,将所述待测试特征向量和注册特征向量输入至预设的概率线性判别分析模型。在这里,所述概率线性判别分析模型(Probabilistic Linear Discriminant Analysis,简称PLDA),是一种信道补偿算法。本实施例使用所述PLDA模型来计算待测试特征向量和注册特征向量的相似程度,得到一个分值。所述分值越高,表示所述待测试特征向量和注册特征向量的一致性越高,所述分值越低,表示所述待测试特征向量和注册特征向量的一致性越低。When speaker confirmation is performed, the feature vector to be tested and the registered feature vector are input to a preset probability linear discriminant analysis model. Here, the probabilistic linear discriminant analysis model (Probabilistic Linear Discriminant Analysis, PLDA for short) is a channel compensation algorithm. In this embodiment, the PLDA model is used to calculate the similarity between the feature vector to be tested and the registered feature vector to obtain a score. The higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector, and the lower the score, the lower the consistency between the feature vector to be tested and the registered feature vector.
在步骤S108中,根据所述分值输出说话人确认结果。In step S108, the speaker confirmation result is output according to the score.
如前所述,所述分值越高,表示所述待测试特征向量和注册特征向量的一致性越高,所述分值越低,表示所述待测试特征向量和注册特征向量的一致性越低。本实施例通过设定一个分数阈值,将所述分值与预设分数阈值进行比对,并根据比对结果输出说话人确认结果。可选地,如图6所示,所述步骤S108包括:As mentioned above, the higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector, and the lower the score, the higher the consistency between the feature vector to be tested and the registered feature vector. The lower. In this embodiment, a score threshold is set, the score is compared with a preset score threshold, and the speaker confirmation result is output according to the comparison result. Optionally, as shown in FIG. 6, the step S108 includes:
在步骤S601中,比对所述分值与预设分数阈值。In step S601, the score is compared with a preset score threshold.
在这里,所述预设分数阈值根据经验设置,作为待测试特征向量和注册特征向量是否来自同一个说话人的判断标准。Here, the preset score threshold is set based on experience as a criterion for judging whether the feature vector to be tested and the registered feature vector come from the same speaker.
在步骤S602中,若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息。In step S602, if the score is greater than or equal to the preset score threshold, output indicating that the feature vector to be tested and the registered feature vector are from the same speaker.
如前所述,所述分值越高,表示所述待测试特征向量和注册特征向量的一致性越高。当所述分值大于或等于所述预设分数阈值时,本实施例确定所述待测试特征向量和注册特征向量来自同一个说话人,输出说话人确认结果为同一说话人的指示信息。As mentioned above, the higher the score, the higher the consistency between the feature vector to be tested and the registered feature vector. When the score is greater than or equal to the preset score threshold, this embodiment determines that the feature vector to be tested and the registered feature vector are from the same speaker, and outputs indication information indicating that the speaker confirmation result is the same speaker.
在步骤S603中,若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。In step S603, if the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
当所述分值小于所述预设分数阈值时,本实施例确定所述待测试特征向量和注册特征向量来自不同的说话人,输出说话人确认结果为不同说话人的指示信息。When the score is less than the preset score threshold, this embodiment determines that the feature vector to be tested and the registered feature vector are from different speakers, and outputs information indicating that the speaker confirmation result is a different speaker.
综上所述,本实施例通过构建残差时延网络,通过所述残差时延网络对预处理后的注册音频提取出注册特征向量,建立说话人特征库;在进行说话人确认时,通过所述残差时延网络对预处理后的测试音频提取待测试特征向量,与说话人特征库中的注册特征向量一同传入PLDA模型计算分值,并将所述分值与预设分数阈值进行比对,最后根据比对结果输出说话人确认结果;由于所述残差时延网络使用残差时延网络块替换了传统时延网络的会话帧间级,对比传统的TDNN和PLDA的说话人确认方法,所需训练集规模更小,模型更容易训练,有效地降低了训练成本。此外,该方法可以在增加网络深度的同时减少网络每一层节点数,即使网络整体参数量下降也不影响网络性能,通过所述残差时延网络提取出关键特征,能够有效地减低噪声干扰,在短音频的说话人确认上,能达到显著优于传统PLDA模型的结果。In summary, this embodiment constructs a residual delay network, and extracts the registered feature vector from the preprocessed registered audio through the residual delay network to establish a speaker feature database; when performing speaker confirmation, Extract the feature vector to be tested from the pre-processed test audio through the residual delay network, and pass it into the PLDA model together with the registered feature vector in the speaker feature library to calculate the score, and compare the score with the preset score The threshold value is compared, and finally the speaker confirmation result is output according to the comparison result; because the residual delay network uses the residual delay network block to replace the session inter-frame level of the traditional delay network, compared with the traditional TDNN and PLDA The speaker confirmation method requires a smaller training set size, and the model is easier to train, which effectively reduces the training cost. In addition, this method can increase the depth of the network while reducing the number of nodes at each layer of the network. Even if the overall network parameters decrease, the network performance will not be affected. The key features are extracted through the residual delay network, which can effectively reduce noise interference. In short audio speaker confirmation, it can achieve significantly better results than the traditional PLDA model.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
在一实施例中,提供一种基于残差时延网络的说话人确认装置,该基于残差时延网络的说话人确认装置与上述实施例中基于残差时延网络的说话人确认方法一一对应。如图7所示,该基于残差时延网络的说话人确认装置包括训练模块、获取模块、预处理模块、特征提取模块、第一特征获取模块、第二特征获取模块、分值获取模块、说话人确认模块。各功能模块详细说明如下:In one embodiment, a speaker verification device based on a residual delay network is provided. The speaker verification device based on the residual delay network is the same as the speaker verification method based on the residual delay network in the foregoing embodiment. One correspondence. As shown in FIG. 7, the speaker verification device based on the residual time delay network includes a training module, an acquisition module, a preprocessing module, a feature extraction module, a first feature acquisition module, a second feature acquisition module, a score acquisition module, Speaker confirmation module. The detailed description of each functional module is as follows:
训练模块71,用于构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;The training module 71 is configured to construct a residual delay network, and use a preset training sample set to train the residual delay network;
获取模块72,用于获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;The obtaining module 72 is configured to obtain an audio information set of a test user, where the audio information set includes registered audio and test audio;
预处理模块73,用于对所述测试用户的音频信息集执行预处理;The preprocessing module 73 is configured to perform preprocessing on the audio information set of the test user;
特征提取模块74,用于对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;The feature extraction module 74 is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
第一特征获取模块75,用于将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The first feature acquisition module 75 is configured to input the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the residual delay network output at the session slice level As the registered feature vector of the test user;
第二特征获取模块76,用于将所述测试音频的梅尔频率倒谱系数作为输入向量传入 训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The second feature acquisition module 76 is configured to input the Mel frequency cepstrum coefficient of the test audio as an input vector into the trained residual delay network, and obtain the residual delay network output at the session slice level As the feature vector to be tested of the test user;
分值获取模块77,用于将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;The score obtaining module 77 is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;
说话人确认模块78,用于根据所述分值输出说话人确认结果。The speaker confirmation module 78 is configured to output the speaker confirmation result according to the score.
其中,所述残差时延网络通过将残差时延网络块替换时延网络中的会话帧间级得到,所述残差时延网络块通过结合时延网络的结构与残差网络的恒等映射、残差映射得到。Wherein, the residual delay network is obtained by replacing the residual delay network block with the session frame level in the delay network, and the residual delay network block is obtained by combining the structure of the delay network with the constant of the residual network. Obtained by equal mapping and residual mapping.
可选地,所述训练模块71包括:Optionally, the training module 71 includes:
收集单元,用于收集若干个说话人的多个音频信息作为训练样本集;The collection unit is used to collect multiple audio information of several speakers as a training sample set;
预处理单元,用于对所述训练样本集中的音频信息执行预处理;A preprocessing unit, configured to perform preprocessing on the audio information in the training sample set;
特征提取单元,用于对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;The feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient;
训练单元,用于将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;A training unit, configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;
参数修改单元,用于采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;The parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameters of the residual delay network according to the error;
所述训练单元还用于,将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.
可选地,所述预处理单元包括:Optionally, the preprocessing unit includes:
标签子单元,用于对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;The tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;
第一剔除子单元,用于将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;The first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;
检测子单元,用于对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;The detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;
第二剔除子单元,用于将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.
可选地,所述说话人确认模块88包括:Optionally, the speaker confirmation module 88 includes:
比对单元,用于比对所述分值与预设分数阈值;The comparison unit is used to compare the score with a preset score threshold;
第一确认单元,用于若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息;The first confirmation unit is configured to output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker if the score is greater than or equal to the preset score threshold;
第二确认单元,用于若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。The second confirmation unit is configured to output indication information indicating that the feature vector to be tested and the registered feature vector are from different speakers if the score is less than the preset score threshold.
关于基于残差时延网络的说话人确认装置的具体限定可以参见上文中对于基于残差时延网络的说话人确认方法的限定,在此不再赘述。上述基于残差时延网络的说话人确认装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the speaker verification device based on the residual delay network, please refer to the above limitation on the speaker verification method based on the residual delay network, which will not be repeated here. Each module in the above-mentioned speaker confirmation device based on the residual time delay network can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令 被处理器执行时以实现一种基于残差时延网络的说话人确认方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a method of speaker verification based on the residual delay network is realized.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种基于残差时延网络的说话人确认方法,其特征在于,包括:A speaker confirmation method based on residual delay network, which is characterized in that it includes:
    构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
    获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
    对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
    对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
    将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
    将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
    将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
    根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
  2. 如权利要求1所述的基于残差时延网络的说话人确认方法,其特征在于,所述残差时延网络通过将残差时延网络块替换时延网络中的会话帧间级得到,所述残差时延网络块通过结合时延网络的结构与残差网络的恒等映射、残差映射得到。The speaker confirmation method based on the residual delay network according to claim 1, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, The residual delay network block is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
  3. 如权利要求1或2所述的基于残差时延网络的说话人确认方法,其特征在于,所述采用预设的训练样本集对所述残差时延网络进行训练包括:The speaker verification method based on the residual delay network according to claim 1 or 2, wherein the training of the residual delay network using a preset training sample set comprises:
    收集若干个说话人的多个音频信息作为训练样本集;Collect multiple audio information of several speakers as a training sample set;
    对所述训练样本集中的音频信息执行预处理;Perform preprocessing on the audio information in the training sample set;
    对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;
    将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;
    采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;
    将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
  4. 如权利要求3所述的基于残差时延网络的说话人确认方法,其特征在于,所述对所述训练样本集中的音频信息执行预处理包括:The speaker verification method based on the residual delay network according to claim 3, wherein the performing preprocessing on the audio information in the training sample set comprises:
    对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;
    将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;
    对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part time;
    将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
  5. 如权利要求1或2所述的基于残差时延网络的说话人确认方法,其特征在于,所述根据所述分值输出说话人确认结果包括:The method for speaker confirmation based on a residual delay network according to claim 1 or 2, wherein said outputting a speaker confirmation result according to said score comprises:
    比对所述分值与预设分数阈值;Comparing the score with a preset score threshold;
    若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息;If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;
    若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
  6. 一种基于残差时延网络的说话人确认装置,其特征在于,包括:A speaker confirmation device based on residual time delay network, which is characterized in that it comprises:
    训练模块,用于构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;The training module is used to construct a residual delay network, and use a preset training sample set to train the residual delay network;
    获取模块,用于获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;An acquiring module, configured to acquire an audio information set of a test user, the audio information set includes registered audio and test audio;
    预处理模块,用于对所述测试用户的音频信息集执行预处理;A preprocessing module for performing preprocessing on the audio information set of the test user;
    特征提取模块,用于对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;The feature extraction module is configured to perform feature extraction on the pre-processed audio information set to obtain Mel frequency cepstral coefficients corresponding to the registered audio and Mel frequency cepstral coefficients corresponding to the test audio respectively;
    第一特征获取模块,用于将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The first feature acquisition module is configured to pass the Mel frequency cepstrum coefficient of the registered audio as an input vector into the trained residual delay network, and obtain the output of the residual delay network at the session slice level The feature vector is used as the registered feature vector of the test user;
    第二特征获取模块,用于将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The second feature acquisition module is used to pass the Mel frequency cepstrum coefficients of the test audio as an input vector to the trained residual delay network, and obtain the output of the residual delay network at the session slice level A feature vector as the feature vector to be tested of the test user;
    分值获取模块,用于将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;The score obtaining module is configured to input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain the score output by the probabilistic linear discriminant analysis model;
    说话人确认模块,用于根据所述分值输出说话人确认结果。The speaker confirmation module is used to output the speaker confirmation result according to the score.
  7. 如权利要求6所述的基于残差时延网络的说话人确认装置,其特征在于,所述残差时延网络通过将残差时延网络块替换时延网络中的会话帧间级得到,所述残差时延网络块通过结合时延网络的结构与残差网络的恒等映射、残差映射得到。The speaker confirmation device based on the residual delay network according to claim 6, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, The residual delay network block is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
  8. 如权利要求6或7所述的基于残差时延网络的说话人确认装置,其特征在于,所述训练模块包括:The speaker verification device based on the residual delay network according to claim 6 or 7, wherein the training module comprises:
    收集单元,用于收集若干个说话人的多个音频信息作为训练样本集;The collection unit is used to collect multiple audio information of several speakers as a training sample set;
    预处理单元,用于对所述训练样本集中的音频信息执行预处理;A preprocessing unit, configured to perform preprocessing on the audio information in the training sample set;
    特征提取单元,用于对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;The feature extraction unit is configured to perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstral coefficient;
    训练单元,用于将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;A training unit, configured to pass the Mel frequency cepstral coefficient corresponding to each audio information as an input vector into a preset residual delay network for training, and obtain the recognition result output by the residual delay network;
    参数修改单元,用于采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;The parameter modification unit is used to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each audio information through the residual delay network and the corresponding speaker tag using a preset loss function, and Modify the parameter of the residual delay network according to the error;
    所述训练单元还用于,将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The training unit is also used to input the Mel frequency cepstrum coefficient corresponding to each audio information as an input vector to the modified residual delay network to perform the next training.
  9. 如权利要求8所述的基于残差时延网络的说话人确认装置,其特征在于,所述预处理单元包括:8. The speaker confirmation device based on the residual delay network according to claim 8, wherein the preprocessing unit comprises:
    标签子单元,用于对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;The tag subunit is used to add a speaker tag to each of the audio information, classify according to the speaker tag, and obtain the audio information set of each speaker;
    第一剔除子单元,用于将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;The first elimination subunit is used to eliminate audio information sets and speakers whose number of audio information is less than a first preset threshold from the training sample set;
    检测子单元,用于对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;The detection subunit is used to perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;
    第二剔除子单元,用于将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The second culling subunit is used for culling the audio information whose voice duration is less than the second preset threshold from the audio information set.
  10. 如权利要求6或7所述的基于残差时延网络的说话人确认装置,其特征在于,所述说话人确认模块包括:The speaker confirmation device based on the residual time delay network according to claim 6 or 7, wherein the speaker confirmation module comprises:
    比对单元,用于比对所述分值与预设分数阈值;The comparison unit is used to compare the score with a preset score threshold;
    第一确认单元,用于若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息;The first confirmation unit is configured to output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker if the score is greater than or equal to the preset score threshold;
    第二确认单元,用于若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。The second confirmation unit is configured to output indication information indicating that the feature vector to be tested and the registered feature vector are from different speakers if the score is less than the preset score threshold.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:
    构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
    获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
    对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
    对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
    将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
    将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
    将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
    根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
  12. 如权利要求11所述的计算机设备,其特征在于,所述残差时延网络通过将残差时延网络块替换时延网络中的会话帧间级得到,所述残差时延网络块通过结合时延网络的结构与残差网络的恒等映射、残差映射得到。The computer device according to claim 11, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay network block passes It is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
  13. 如权利要求11或12所述的计算机设备,其特征在于,所述采用预设的训练样本集对所述残差时延网络进行训练包括:The computer device according to claim 11 or 12, wherein the training of the residual delay network using a preset training sample set comprises:
    收集若干个说话人的多个音频信息作为训练样本集;Collect multiple audio information of several speakers as a training sample set;
    对所述训练样本集中的音频信息执行预处理;Perform preprocessing on the audio information in the training sample set;
    对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;
    将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;
    采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;
    将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
  14. 如权利要求13所述的计算机设备,其特征在于,所述对所述训练样本集中的音频信息执行预处理包括:The computer device according to claim 13, wherein the performing preprocessing on the audio information in the training sample set comprises:
    对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;
    将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;
    对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果 删除非语音部分,得到语音部分时长;Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part duration;
    将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
  15. 如权利要求11或12所述的计算机设备,其特征在于,所述根据所述分值输出说话人确认结果包括:The computer device according to claim 11 or 12, wherein said outputting a speaker confirmation result according to said score comprises:
    比对所述分值与预设分数阈值;Comparing the score with a preset score threshold;
    若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息;If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;
    若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more non-volatile readable storage media storing computer readable instructions. When the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    构建残差时延网络,采用预设的训练样本集对所述残差时延网络进行训练;Construct a residual delay network, and use a preset training sample set to train the residual delay network;
    获取测试用户的音频信息集,所述音频信息集包括注册音频和测试音频;Acquiring an audio information set of the test user, where the audio information set includes registered audio and test audio;
    对所述测试用户的音频信息集执行预处理;Perform preprocessing on the audio information set of the test user;
    对预处理后的所述音频信息集执行特征提取,分别得到注册音频对应的梅尔频率倒谱系数和测试音频对应的梅尔频率倒谱系数;Perform feature extraction on the preprocessed audio information set to obtain Mel frequency cepstral coefficients corresponding to registered audio and Mel frequency cepstral coefficients corresponding to test audio respectively;
    将所述注册音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的注册特征向量;The Mel frequency cepstrum coefficient of the registered audio is passed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user Registered feature vector;
    将所述测试音频的梅尔频率倒谱系数作为输入向量传入训练好的所述残差时延网络,获取所述残差时延网络在会话切片级输出的特征向量,作为所述测试用户的待测试特征向量;The Mel frequency cepstrum coefficients of the test audio are fed into the trained residual delay network as an input vector, and the feature vector output by the residual delay network at the session slice level is obtained as the test user The feature vector to be tested;
    将所述注册特征向量和待测试特征向量输入预设的概率线性判别分析模型,并获取所述概率线性判别分析模型输出的分值;Input the registered feature vector and the feature vector to be tested into a preset probabilistic linear discriminant analysis model, and obtain a score output by the probabilistic linear discriminant analysis model;
    根据所述分值输出说话人确认结果。The speaker confirmation result is output according to the score.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述残差时延网络通过将残差时延网络块替换时延网络中的会话帧间级得到,所述残差时延网络块通过结合时延网络的结构与残差网络的恒等映射、残差映射得到。The computer-readable storage medium according to claim 16, wherein the residual delay network is obtained by replacing the residual delay network block with the session inter-frame level in the delay network, and the residual delay The network block is obtained by combining the structure of the delay network with the identity mapping and residual mapping of the residual network.
  18. 如权利要求16或17所述的计算机可读存储介质,其特征在于,所述采用预设的训练样本集对所述残差时延网络进行训练包括:The computer-readable storage medium according to claim 16 or 17, wherein the training the residual delay network using a preset training sample set comprises:
    收集若干个说话人的多个音频信息作为训练样本集;Collect multiple audio information of several speakers as a training sample set;
    对所述训练样本集中的音频信息执行预处理;Perform preprocessing on the audio information in the training sample set;
    对预处理后的每一所述音频信息进行特征提取,得到对应的梅尔频率倒频谱系数;Perform feature extraction on each of the audio information after preprocessing to obtain the corresponding Mel frequency cepstrum coefficient;
    将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入预设的残差时延网络进行训练,获取所述残差时延网络输出的识别结果;Passing the Mel frequency cepstral coefficient corresponding to each audio information as an input vector to a preset residual delay network for training, and obtaining a recognition result output by the residual delay network;
    采用预设的损失函数计算每一所述音频信息对应的梅尔频率倒谱系数经过所述残差时延网络的识别结果与对应的说话人标签之间的误差,并根据所述误差修改所述残差时延网络的参数;Use a preset loss function to calculate the error between the recognition result of the Mel frequency cepstrum coefficient corresponding to each of the audio information through the residual delay network and the corresponding speaker tag, and modify the error according to the error. State the parameters of the residual delay network;
    将每一所述音频信息对应的梅尔频率倒频谱系数作为输入向量传入参数修改后的残差时延网络执行下一次训练。The Mel frequency cepstrum coefficient corresponding to each audio information is used as an input vector and passed into the parameter-modified residual delay network to perform the next training.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述对所述训练样本集中的音频信息执行预处理包括:18. The computer-readable storage medium of claim 18, wherein the performing preprocessing on the audio information in the training sample set comprises:
    对每一所述音频信息添加说话人标签,根据所述说话人标签进行分类,得到每一个说话人的音频信息集;Add a speaker tag to each of the audio information, and classify according to the speaker tag to obtain an audio information set of each speaker;
    将音频信息个数小于第一预设阈值的音频信息集及说话人从所述训练样本集中剔除;Removing audio information sets and speakers whose number of audio information is less than the first preset threshold from the training sample set;
    对剩余音频信息集中的每一个音频信息执行语音活动检测,并根据语音活动检测结果删除非语音部分,得到语音部分时长;Perform voice activity detection on each audio information in the remaining audio information set, and delete the non-voice part according to the voice activity detection result to obtain the voice part time;
    将语音部分时长少于第二预设阈值的音频信息从所述音频信息集中剔除。The audio information whose speech duration is less than the second preset threshold is removed from the audio information set.
  20. 如权利要求16或17所述的计算机可读存储介质,其特征在于,所述根据所述分值输出说话人确认结果包括:18. The computer-readable storage medium of claim 16 or 17, wherein said outputting a speaker confirmation result according to said score comprises:
    比对所述分值与预设分数阈值;Comparing the score with a preset score threshold;
    若所述分值大于或等于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自同一个说话人的指示信息;If the score is greater than or equal to the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from the same speaker;
    若所述分值小于所述预设分数阈值时,输出所述待测试特征向量和注册特征向量来自不同的说话人的指示信息。If the score is less than the preset score threshold, output the indication information that the feature vector to be tested and the registered feature vector are from different speakers.
PCT/CN2019/103155 2019-05-09 2019-08-29 Residual delay network-based speaker confirmation method and apparatus, device and medium WO2020224114A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910384582.0A CN110232932B (en) 2019-05-09 2019-05-09 Speaker confirmation method, device, equipment and medium based on residual delay network
CN201910384582.0 2019-05-09

Publications (1)

Publication Number Publication Date
WO2020224114A1 true WO2020224114A1 (en) 2020-11-12

Family

ID=67860506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103155 WO2020224114A1 (en) 2019-05-09 2019-08-29 Residual delay network-based speaker confirmation method and apparatus, device and medium

Country Status (2)

Country Link
CN (1) CN110232932B (en)
WO (1) WO2020224114A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613468A (en) * 2020-12-31 2021-04-06 平安国际智慧城市科技股份有限公司 Epidemic situation investigation method based on artificial intelligence and related equipment
CN112735470A (en) * 2020-12-28 2021-04-30 携程旅游网络技术(上海)有限公司 Audio cutting method, system, device and medium based on time delay neural network

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081278A (en) * 2019-12-18 2020-04-28 公安部第三研究所 Method and system for testing conversation quality of talkback terminal
CN111133507B (en) * 2019-12-23 2023-05-23 深圳市优必选科技股份有限公司 Speech synthesis method, device, intelligent terminal and readable medium
CN111916074A (en) * 2020-06-29 2020-11-10 厦门快商通科技股份有限公司 Cross-device voice control method, system, terminal and storage medium
CN111885275B (en) * 2020-07-23 2021-11-26 海尔优家智能科技(北京)有限公司 Echo cancellation method and device for voice signal, storage medium and electronic device
CN112992157A (en) * 2021-02-08 2021-06-18 贵州师范大学 Neural network noisy line identification method based on residual error and batch normalization
CN112992155B (en) * 2021-03-02 2022-10-14 复旦大学 Far-field voice speaker recognition method and device based on residual error neural network
CN113178196B (en) * 2021-04-20 2023-02-07 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113724731B (en) * 2021-08-30 2024-01-05 中国科学院声学研究所 Method and device for carrying out audio discrimination by utilizing audio discrimination model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034472A (en) * 2009-09-28 2011-04-27 戴红霞 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN106683680A (en) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 Speaker recognition method and device and computer equipment and computer readable media
CN108109613A (en) * 2017-12-12 2018-06-01 苏州思必驰信息科技有限公司 For the audio training of Intelligent dialogue voice platform and recognition methods and electronic equipment
CN108694949A (en) * 2018-03-27 2018-10-23 佛山市顺德区中山大学研究院 Method for distinguishing speek person and its device based on reorder super vector and residual error network
US20180350351A1 (en) * 2017-05-31 2018-12-06 Intel Corporation Feature extraction using neural network accelerator
CN109166586A (en) * 2018-08-02 2019-01-08 平安科技(深圳)有限公司 A kind of method and terminal identifying speaker

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005034395A2 (en) * 2003-09-17 2005-04-14 Nielsen Media Research, Inc. Methods and apparatus to operate an audience metering device with voice commands
CN101226743A (en) * 2007-12-05 2008-07-23 浙江大学 Method for recognizing speaker based on conversion of neutral and affection sound-groove model
CN107464568B (en) * 2017-09-25 2020-06-30 四川长虹电器股份有限公司 Speaker identification method and system based on three-dimensional convolution neural network text independence
CN108281146B (en) * 2017-12-29 2020-11-13 歌尔科技有限公司 Short voice speaker identification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034472A (en) * 2009-09-28 2011-04-27 戴红霞 Speaker recognition method based on Gaussian mixture model embedded with time delay neural network
CN106683680A (en) * 2017-03-10 2017-05-17 百度在线网络技术(北京)有限公司 Speaker recognition method and device and computer equipment and computer readable media
US20180350351A1 (en) * 2017-05-31 2018-12-06 Intel Corporation Feature extraction using neural network accelerator
CN108109613A (en) * 2017-12-12 2018-06-01 苏州思必驰信息科技有限公司 For the audio training of Intelligent dialogue voice platform and recognition methods and electronic equipment
CN108694949A (en) * 2018-03-27 2018-10-23 佛山市顺德区中山大学研究院 Method for distinguishing speek person and its device based on reorder super vector and residual error network
CN109166586A (en) * 2018-08-02 2019-01-08 平安科技(深圳)有限公司 A kind of method and terminal identifying speaker

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735470A (en) * 2020-12-28 2021-04-30 携程旅游网络技术(上海)有限公司 Audio cutting method, system, device and medium based on time delay neural network
CN112735470B (en) * 2020-12-28 2024-01-23 携程旅游网络技术(上海)有限公司 Audio cutting method, system, equipment and medium based on time delay neural network
CN112613468A (en) * 2020-12-31 2021-04-06 平安国际智慧城市科技股份有限公司 Epidemic situation investigation method based on artificial intelligence and related equipment
CN112613468B (en) * 2020-12-31 2024-04-05 深圳平安智慧医健科技有限公司 Epidemic situation investigation method based on artificial intelligence and related equipment

Also Published As

Publication number Publication date
CN110232932A (en) 2019-09-13
CN110232932B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
WO2020224114A1 (en) Residual delay network-based speaker confirmation method and apparatus, device and medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
WO2021164147A1 (en) Artificial intelligence-based service evaluation method and apparatus, device and storage medium
Reynolds An overview of automatic speaker recognition technology
US9502038B2 (en) Method and device for voiceprint recognition
CN109473105A (en) The voice print verification method, apparatus unrelated with text and computer equipment
WO2014114116A1 (en) Method and system for voiceprint recognition
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN108564956B (en) Voiceprint recognition method and device, server and storage medium
US20190325880A1 (en) System for text-dependent speaker recognition and method thereof
Chakroun et al. Robust text-independent speaker recognition with short utterances using Gaussian mixture models
Revathi et al. Text independent speaker recognition and speaker independent speech recognition using iterative clustering approach
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Zewoudie et al. The Use of Audio Fingerprints for Authentication of Speakers on Speech Operated Interfaces
CN111063359B (en) Telephone return visit validity judging method, device, computer equipment and medium
Nirjon et al. sMFCC: exploiting sparseness in speech for fast acoustic feature extraction on mobile devices--a feasibility study
Singh et al. Combining evidences from Hilbert envelope and residual phase for detecting replay attacks
Akinrinmade et al. Creation of a Nigerian voice corpus for indigenous speaker recognition
Pickersgill et al. Investigation of DNN prediction of power spectral envelopes for speech coding & ASR
Hossan et al. Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization
Balpande et al. Speaker recognition based on mel-frequency cepstral coefficients and vector quantization
Sailaja et al. Text Independent Speaker Identification Using Finite Doubly Truncated Gaussian Mixture Model
Wei Adaptive Speaker Recognition Based on Hidden Markov Model Parameter Optimization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19927645

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19927645

Country of ref document: EP

Kind code of ref document: A1