CN114913859B - Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium - Google Patents

Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114913859B
CN114913859B CN202210536790.XA CN202210536790A CN114913859B CN 114913859 B CN114913859 B CN 114913859B CN 202210536790 A CN202210536790 A CN 202210536790A CN 114913859 B CN114913859 B CN 114913859B
Authority
CN
China
Prior art keywords
voiceprint
audio data
sample
sub
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210536790.XA
Other languages
Chinese (zh)
Other versions
CN114913859A (en
Inventor
赵情恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210536790.XA priority Critical patent/CN114913859B/en
Publication of CN114913859A publication Critical patent/CN114913859A/en
Application granted granted Critical
Publication of CN114913859B publication Critical patent/CN114913859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The disclosure provides a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence such as deep learning and voice technology. The specific implementation scheme is as follows: acquiring target audio data to be identified, and acquiring corresponding local audio features and global audio features based on the target audio data; inputting the local audio features into a student network of the voiceprint recognition model to obtain first voiceprint features output by the student network; inputting the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network; and determining target voiceprint features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features. By utilizing the student network and the teacher network, the target voiceprint features corresponding to the target audio data are acquired based on the features of the local audio features and the global audio features corresponding to the target audio data, so that the voiceprint recognition accuracy is improved.

Description

Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and voice, and particularly relates to a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium.
Background
Voiceprints, similar to fingerprints, are information specific to a person, who say differently, that the voiceprints are identical, so that the identity of the speaker can be identified by voiceprint recognition. In actual voiceprint recognition, the voiceprint recognition is often interfered by multiple factors such as a speaker and the environment, so that the accuracy of the voiceprint recognition is affected to a certain extent. Therefore, how to improve the accuracy of voiceprint recognition has become an important research direction.
Disclosure of Invention
The disclosure provides a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium.
According to an aspect of the present disclosure, there is provided a voiceprint recognition method, the method including: acquiring target audio data to be identified, and acquiring corresponding local audio features and global audio features based on the target audio data; inputting the local audio features into a student network of a voiceprint recognition model to obtain first voiceprint features output by the student network; inputting the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network; and determining a target voiceprint feature corresponding to the target audio data based on the first voiceprint feature and the second voiceprint feature.
According to another aspect of the present disclosure, there is provided a model training method for voiceprint recognition, the method comprising: acquiring a training sample set, wherein the training sample set comprises a plurality of sample audio data; based on the sample audio data, obtaining corresponding sample local audio features and sample global audio features; and taking the sample local audio features as training samples of the student network in the voiceprint recognition model, taking the sample global audio features as training samples of the teacher network in the voiceprint recognition model, and training the teacher network and the student network in the voiceprint recognition model to obtain a trained voiceprint recognition model.
According to another aspect of the present disclosure, there is provided a voiceprint recognition apparatus, the apparatus comprising: the first acquisition module is used for acquiring target audio data to be identified and acquiring corresponding local audio features and global audio features based on the target audio data; the first processing module is used for inputting the local audio features into a student network of the voiceprint recognition model so as to obtain first voiceprint features output by the student network; the second processing module is used for inputting the global audio feature into a teacher network of the voiceprint recognition model to obtain a second voiceprint feature output by the teacher network; and the first determining module is used for determining target voiceprint features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features.
According to another aspect of the present disclosure, there is provided a model training apparatus for voiceprint recognition, the apparatus comprising: the second acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of sample audio data; the third acquisition module is used for acquiring corresponding sample local audio characteristics and sample global audio characteristics based on the sample audio data; the training module is used for taking the sample local audio features as training samples of the student networks in the voiceprint recognition model, taking the sample global audio features as training samples of the teacher networks in the voiceprint recognition model, and training the teacher networks and the student networks in the voiceprint recognition model to obtain a trained voiceprint recognition model.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the voiceprint recognition method of the present disclosure or to perform the model training method for voiceprint recognition of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the voiceprint recognition method disclosed by the embodiments of the present disclosure or to perform the model training method for voiceprint recognition disclosed by the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of voiceprint recognition of the present disclosure, or the steps of the model training method for voiceprint recognition of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a voiceprint recognition method according to a first embodiment of the present disclosure;
FIG. 2 is a flow chart of a voiceprint recognition method according to a second embodiment of the present disclosure;
FIG. 3 is a flow diagram of a model training method for voiceprint recognition according to a third embodiment of the present disclosure;
FIG. 4 is a flow diagram of a model training method for voiceprint recognition according to a fourth embodiment of the present disclosure;
FIG. 5 is a framework diagram of a model training method for voiceprint recognition according to a fourth embodiment of the present disclosure;
fig. 6 is a schematic structural view of a voiceprint recognition device according to a fifth embodiment of the present disclosure;
FIG. 7 is a schematic structural diagram of a model training apparatus for voiceprint recognition according to a sixth embodiment of the present disclosure;
Fig. 8 is a block diagram of an electronic device used to implement a voiceprint recognition method or a model training method for voiceprint recognition in accordance with an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the disclosure provides a voiceprint recognition method capable of improving the accuracy of voiceprint recognition and a model training method for voiceprint recognition, wherein the voiceprint recognition method comprises the following steps: acquiring target audio data to be identified, and acquiring corresponding local audio features and global audio features based on the target audio data; inputting the local audio features into a student network of the voiceprint recognition model to obtain first voiceprint features output by the student network; inputting the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network; and determining target voiceprint features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features. Therefore, by utilizing the student network and the teacher network, the target voiceprint features corresponding to the target audio data are acquired based on the features of the local audio features and the global audio features corresponding to the target audio data, so that the voiceprint recognition accuracy is improved.
The disclosure provides a voiceprint recognition method, a model training method and device for voiceprint recognition, electronic equipment, a non-transitory computer readable storage medium and a computer program product, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and voice.
Artificial intelligence is a discipline that studies enabling computers to simulate certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of humans, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises computer vision, voice recognition technology, natural language processing technology, machine learning/deep learning, big data processing technology, knowledge graph technology and other big directions.
Voiceprint recognition methods, model training methods for voiceprint recognition, apparatus, electronic devices, non-transitory computer-readable storage media, and computer program products of embodiments of the present disclosure are described below with reference to the accompanying drawings.
It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.
Fig. 1 is a flow chart of a voiceprint recognition method according to a first embodiment of the present disclosure. It should be noted that, in the voiceprint recognition method of the present embodiment, the execution body is a voiceprint recognition device, and the voiceprint recognition device may be implemented by software and/or hardware, and the voiceprint recognition device may be configured in an electronic device, where the electronic device may include, but is not limited to, a terminal device, a server, and the embodiment does not specifically limit the electronic device.
As shown in fig. 1, the voiceprint recognition method may include:
step 101, obtaining target audio data to be identified, and obtaining corresponding local audio features and global audio features based on the target audio data.
The target audio data to be identified is a continuous voice, such as a sentence, a section of speech, etc., which needs voiceprint identification.
In an example embodiment, the target audio data to be identified may be obtained in various public and legal manners, for example, the target audio data to be identified may be obtained from a public data set, or the target audio data to be identified may also be obtained from a user after authorization of the user, which is not limited in this disclosure.
For example, the voiceprint recognition device may obtain, after user authorization, target audio data to be voiceprint recognized, which is collected by an audio collection device such as a recorder and a microphone in real time, for example, when a user wakes up an electronic device such as a mobile phone and a tablet, the user obtains audio data collected by a collection device in the electronic device such as the mobile phone and the tablet.
Or the voiceprint recognition device can download target audio data which needs to be voiceprint recognized from the internet after the user authorization, or download target audio data which needs to be voiceprint recognized from social software, and the like. The present disclosure is not limited in this regard.
The local audio features are features only including local information in the target audio data, for example, the local audio features corresponding to the target audio data may include features of first sub-audio data of each frame after the target audio data is divided into multiple frames of first sub-audio data, and features of first sub-audio data of each frame are obtained only according to the first sub-audio data of the frame or first sub-audio data of several frames before and after the first sub-audio data of the frame.
The global audio feature is a feature containing global information of the target audio data, for example, the global audio feature corresponding to the target audio data may include a feature of the first sub-audio data of each frame after the target audio data is divided into multiple frames of the first sub-audio data, and the feature of the first sub-audio data of each frame is obtained according to the first sub-audio data of the frame and the whole target audio data.
The local audio feature and the global audio feature are low-level features, and can be obtained by adopting a traditional digital signal processing technology, for example, MFCC (Mel frequency cepstral coefficients, mel frequency cepstral coefficient), PLP (Perceptual LINEAR PREDICTIVE, perceptual linear prediction) or Fbank (Filter Bank) and other methods.
Step 102, inputting the local audio features into a student network of the voiceprint recognition model to obtain first voiceprint features output by the student network.
Step 103, inputting the global audio feature into a teacher network of the voiceprint recognition model to obtain a second voiceprint feature output by the teacher network.
The voiceprint recognition model can comprise a teacher network and a student network, and the network structures of the teacher network and the student network can be the same. The network structures of the teacher network and the student network can be constructed according to the needs, and the voiceprint feature extraction function can be realized. For example, the teacher network and the student network may be formed by sequentially connecting the following network layers: conv1D (one-dimensional convolution ) layer, reLU (modified Linear Unit) activation function layer, BN (batch normalizaion, batch normalization) layer, 3 SE (Squeeze-and-Exactions) -Res2Block layer, conv1D layer, reLU layer, ATTENTIVE STAT Pooling (attention mechanism statistics pooling) layer, BN layer, FC (full Connect) layer, BN layer.
In an example embodiment, the voiceprint recognition model including the teacher network and the student network is a pre-trained model, and is used for voiceprint recognition on arbitrary audio data to obtain voiceprint features corresponding to the audio data. The training process may refer to the following embodiments, which are not described herein.
The teacher network and the student network can comprise a voiceprint feature extraction layer and a voiceprint distribution prediction layer which are sequentially connected. The voiceprint feature extraction layer is used for extracting voiceprint features in the audio data, the voiceprint features are advanced attribute characterization representing the characteristics of a speaker, can be used for distinguishing speaker information such as gender, accent, physiological structure, pronunciation habit and the like of the speaker, and can be extracted from low-level local audio features or global audio features; the voiceprint distribution prediction layer is used for predicting voiceprint distribution probability corresponding to the voiceprint features, wherein the voiceprint distribution probability represents the posterior probability of the voiceprint features on a plurality of speakers.
When training the teacher network and the student network in the voiceprint recognition model, the sample local audio features corresponding to the sample audio data in the sample training set can be used as training samples of the student network, the sample global audio features corresponding to the sample audio data in the sample training set can be used as training samples of the teacher network, and the teacher network and the student network in the voiceprint recognition model are trained so that a voiceprint feature extraction layer in the trained teacher network can be used for extracting voiceprint features from the sample global audio features, a voiceprint distribution prediction layer in the teacher network can be used for predicting voiceprint distribution probability corresponding to the voiceprint features based on the voiceprint features extracted from the sample global audio features, and a voiceprint distribution prediction layer in the student network can be used for extracting voiceprint features from the sample local audio features and predicting the voiceprint distribution probability corresponding to the voiceprint features based on the voiceprint features extracted from the sample local audio features.
In an example embodiment, the local audio features corresponding to the target audio data may be input into the student network to obtain the first voiceprint features output by the voiceprint feature extraction layer in the student network, and the local audio features corresponding to the target audio data may be input into the teacher network to obtain the second voiceprint features output by the voiceprint feature extraction layer in the teacher network.
Step 104, determining a target voiceprint feature corresponding to the target audio data based on the first voiceprint feature and the second voiceprint feature.
In an example embodiment, the first voiceprint feature output by the student network and the second voiceprint feature output by the teacher network may be represented in a vector form, and then, after the first voiceprint feature and the second voiceprint feature are acquired, an average value of the first voiceprint feature and the second voiceprint feature may be taken as the target voiceprint feature.
Or the first voiceprint feature and the second voiceprint feature can be respectively assigned corresponding weights, and then the weighted sum of the first voiceprint feature and the second voiceprint feature is taken as the target voiceprint feature. I.e. the sum of the product of the first voiceprint feature and its corresponding weight and the product of the second voiceprint feature and its corresponding weight is used as the target voiceprint feature.
In an example embodiment, in order to realize identity verification of a speaker, a voiceprint feature corresponding to pre-stored audio data may be further obtained, for example, a voiceprint feature pre-stored when a user performs account registration, after a target voiceprint feature corresponding to target audio data of the speaker is obtained, the target voiceprint feature may be compared with a voiceprint feature corresponding to pre-stored audio data, and whether the speaker corresponding to the target audio data and the registered user are the same person or not is determined according to a similarity between the two voiceprint features.
In summary, the voiceprint recognition method of the exemplary embodiment obtains target audio data to be recognized, obtains corresponding local audio features and global audio features based on the target audio data, inputs the local audio features into a student network of a voiceprint recognition model to obtain first voiceprint features output by the student network, inputs the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network, and determines target voiceprint features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features. Therefore, by utilizing the student network and the teacher network, the target voiceprint features corresponding to the target audio data are acquired based on the features of the local audio features and the global audio features corresponding to the target audio data, so that the voiceprint recognition accuracy is improved.
In the following, with reference to fig. 2, a process of obtaining local audio features and global audio features corresponding to target audio data in the voiceprint recognition method provided by the present disclosure is further described.
Fig. 2 is a flow chart of a voiceprint recognition method according to a second embodiment of the present disclosure. As shown in fig. 2, the voiceprint recognition method may include the steps of:
in step 201, target audio data to be identified is acquired.
The specific implementation process and principle of step 201 may refer to the description of the foregoing embodiments, which is not repeated herein.
Step 202, framing the target audio data to obtain multi-frame first sub-audio data.
In an example embodiment, the target audio data may be framed, i.e., split into fixed-length segments, to obtain multi-frame first sub-audio data. Since the fourier transform is used to transform the audio data from the time domain signal to the frequency domain signal when the audio feature extraction is performed on the target audio data, and the fourier transform is suitable for a stable signal, in order to ensure the short-time stationarity of the audio data, 20 milliseconds (ms) -40ms of audio is generally taken as one frame, for example, the length of each frame of audio may be 25ms, which is not limited in the disclosure.
When the frame is shifted, there is a frame overlap (also referred to as frame shift), that is, a part of the first sub-audio data needs to be overlapped between each frame, in order to avoid omission of the window boundary from the target audio data. Half the frame length is typically taken as the frame shift. For example, the frame length is 25ms and the frame shift may be 10ms. The present disclosure is not limited in this regard.
Step 203, extracting the features of the first sub-audio data of each frame to obtain feature vectors corresponding to the first sub-audio data of each frame.
In an exemplary embodiment, the feature extraction may be performed on the first sub-audio data of each frame by using a conventional digital signal processing technology, for example, MFCC, PLP or Fbank, to obtain a feature vector corresponding to the first sub-audio data of the corresponding frame. The present disclosure is not limited in this regard.
Step 204, for the first sub-audio data of each frame, based on the corresponding feature vector and the average value of the feature vectors corresponding to the first sub-audio data of at least one frame before and after, obtaining the local audio feature corresponding to the first sub-audio data.
In an example embodiment, the following manner may be adopted to average the feature vectors corresponding to the first sub-audio data of each frame to obtain the local audio features corresponding to the first sub-audio data of each frame: and for each frame of first sub-audio data, acquiring an average value corresponding to the frame of first sub-audio data, wherein the average value is an average value of the feature vectors corresponding to the n frames of first sub-audio data before the frame of first sub-audio data and the feature vectors corresponding to the n frames of first sub-audio data after the frame of first sub-audio data, and subtracting the corresponding average value from the feature vectors corresponding to the frame of first sub-audio data to obtain local audio features corresponding to the frame of first sub-audio data. Wherein n is an integer greater than or equal to 1.
Step 205, for the first sub-audio data of each frame, acquiring a global audio feature corresponding to the first sub-audio data based on the corresponding feature vector and an average value of the feature vectors corresponding to the first sub-audio data of each frame.
The steps 204 and 205 may be executed simultaneously or sequentially, and the execution timing of the steps 204 and 205 is not limited in this embodiment.
In an example embodiment, the following manner may be adopted to average the feature vectors corresponding to the first sub-audio data of each frame to obtain the global audio feature corresponding to the first sub-audio data of each frame: and obtaining an average value of the feature vectors corresponding to the first sub-audio data of all frames after the target audio data is framed, and subtracting the average value from the feature vector corresponding to the first sub-audio data of each frame to obtain the global audio feature corresponding to the first sub-audio data of the frame.
Through the process, the local audio features and the global audio features corresponding to the first sub-audio data of each frame in the target audio data are accurately acquired, and a foundation is laid for improving the accuracy of voiceprint recognition.
Step 206, inputting the local audio features into the student network of the voiceprint recognition model to obtain first voiceprint features output by the student network.
It should be noted that, the local audio features corresponding to the target audio data include the local audio features corresponding to the first sub-audio data of each frame, and the local audio features corresponding to the first sub-audio data of each frame may be input into the student network of the voiceprint recognition model, so as to obtain the first voiceprint features corresponding to the first sub-audio data of each frame output by the student network.
Step 207, inputting the global audio feature into the teacher network of the voiceprint recognition model to obtain a second voiceprint feature output by the teacher network.
The steps 206 and 207 may be executed simultaneously or sequentially, and the execution timing of the steps 206 and 207 is not limited in this embodiment.
It should be noted that, the global audio features corresponding to the target audio data, including the global audio features corresponding to the first sub-audio data of each frame, may input the global audio features corresponding to the first sub-audio data of each frame into the teacher network of the voiceprint recognition model, so as to obtain the second voiceprint features corresponding to the first sub-audio data of each frame output by the teacher network.
Step 208, determining a target voiceprint feature corresponding to the target audio data based on the first voiceprint feature and the second voiceprint feature.
The specific implementation and principles of steps 206-208 may refer to the description of the foregoing embodiments, and are not repeated herein.
In an example embodiment, after the first voiceprint feature corresponding to the first sub-audio data of each frame is obtained, for example, an average value of the first voiceprint features corresponding to the first sub-audio data of each frame may be used as the first voiceprint feature used when the target voiceprint feature is finally determined, or corresponding weights may be respectively allocated to the first sub-audio data of each frame, and a weighted sum of the first voiceprint features corresponding to the first sub-audio data of each frame may be used as the first voiceprint feature used when the target voiceprint feature is finally determined. Similarly, after the second voiceprint feature corresponding to the first sub-audio data of each frame is obtained, an average value or a weighted sum of the first voiceprint features corresponding to the first sub-audio data of each frame may be used as the second voiceprint feature used when the target voiceprint feature is finally determined.
In summary, the voiceprint recognition method of the exemplary embodiment obtains target audio data to be recognized, frames the target audio data to obtain multi-frame first sub-audio data, performs feature extraction on the first sub-audio data of each frame to obtain feature vectors corresponding to the first sub-audio data of each frame, for the first sub-audio data of each frame, based on the corresponding feature vectors and an average value of the feature vectors corresponding to the first sub-audio data of at least one frame before and after, obtains local audio features corresponding to the first sub-audio data, for the first sub-audio data of each frame, based on the corresponding feature vectors and the average value of the feature vectors corresponding to the first sub-audio data of each frame, obtains global audio features corresponding to the first sub-audio data, inputs the local audio features into a student network of a voiceprint recognition model to obtain first voiceprint features output by the student network, inputs the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network, determines target audio features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features, and the target audio features corresponding to the first voiceprint features and the second voiceprint features are determined by the teacher network, and the student network.
In an example embodiment, a model training method for voiceprint recognition is also provided. Fig. 3 is a flow diagram of a model training method for voiceprint recognition according to a third embodiment of the present disclosure.
It should be noted that, in the model training method for voiceprint recognition provided in the embodiment of the present disclosure, the execution body is a model training device for voiceprint recognition, which is hereinafter referred to as a model training device. The model training apparatus may be implemented by software and/or hardware, and the model training apparatus may be configured on an electronic device, where the electronic device may include, but is not limited to, a terminal device, a server, and the like, and the embodiment is not specifically limited to the electronic device.
As shown in fig. 3, the model training method for voiceprint recognition can include the steps of:
Step 301, a training sample set is obtained, the training sample set comprising a plurality of sample audio data.
The training sample set may be audio data obtained from VoxCeleb (large-scale audio-visual data set of human voice), CN-Celeb (speaker recognition voiceprint voice database) and other open source data, and each sample audio data is a continuous voice, such as a sentence, a section of speech, and the like.
Step 302, based on the sample audio data, obtaining a corresponding sample local audio feature and a sample global audio feature.
The method for obtaining the local audio feature of the sample and the global audio feature of the sample corresponding to the sample audio data is the same as the method for obtaining the local audio data and the global audio data corresponding to the target audio data, and reference may be made to the description of the above embodiment, which is not repeated herein.
Step 303, taking the local audio features of the sample as training samples of the student network in the voiceprint recognition model, taking the global audio features of the sample as training samples of the teacher network in the voiceprint recognition model, and training the teacher network and the student network in the voiceprint recognition model to obtain a trained voiceprint recognition model.
The voiceprint recognition model can comprise a teacher network and a student network, and the network structures of the teacher network and the student network can be the same. The network structures of the teacher network and the student network can be constructed according to the needs, and the voiceprint feature extraction function can be realized.
In an example embodiment, after the sample local audio feature and the sample global audio feature corresponding to the sample audio data are obtained, the sample local audio feature corresponding to the sample audio data in the sample training set may be used as a training sample of the student network, the global audio feature corresponding to the sample audio data in the sample training set may be used as a training sample of the teacher network, and self-supervision training may be performed on the teacher network and the student network in the voiceprint recognition model.
It should be noted that the teacher network and the student network in the trained voiceprint recognition model may be used to execute the voiceprint recognition method described above. The process of executing the foregoing voiceprint recognition method by using the teacher network and the student network in the trained voiceprint recognition model may refer to the description of the embodiment of the voiceprint recognition method, which is not repeated herein.
In summary, the model training method for voiceprint recognition provided by the example embodiment obtains a training sample set, the training sample set includes a plurality of sample audio data, based on the sample audio data, the corresponding sample local audio feature and sample global audio feature are obtained, the sample local audio feature is used as a training sample of a student network in the voiceprint recognition model, the sample global audio feature is used as a training sample of a teacher network in the voiceprint recognition model, the teacher network and the student network in the voiceprint recognition model are trained to obtain a trained voiceprint recognition model, training of the teacher network and the student network in the voiceprint recognition model based on the sample local audio feature and the sample global audio feature corresponding to the sample audio data in the training sample set is achieved, the teacher network and the student network for voiceprint recognition are obtained, and by utilizing the characteristics of the local audio feature and the teacher network corresponding to the target audio data in the voiceprint recognition model, the target voiceprint feature corresponding to the target audio data is obtained. In addition, the labeling of the audio data of each sample in the training sample set is not needed, so that the labeling cost is reduced.
Through the analysis, the teacher network and the student network in the voiceprint recognition model can be subjected to self-supervision training, and in the model training method for voiceprint recognition provided by the disclosure, the process of self-supervision training on the teacher network and the student network in the voiceprint recognition model is further described below with reference to fig. 4. Fig. 4 is a flow diagram of a model training method for voiceprint recognition according to a fourth embodiment of the present disclosure.
As shown in fig. 4, the model training method for voiceprint recognition can include the steps of:
step 401, a training sample set is obtained, the training sample set comprising a plurality of sample audio data.
Step 402, based on the sample audio data, obtaining a corresponding sample local audio feature and a sample global audio feature.
In an example embodiment, the corresponding sample local audio features may be acquired based on the sample audio data in the following manner: framing the sample audio data to obtain multi-frame fourth sub-audio data; extracting the characteristics of the fourth sub-audio data of each frame to obtain a characteristic vector corresponding to the fourth sub-audio data of each frame; and for the fourth sub-audio data of each frame, acquiring the sample local audio characteristics corresponding to the fourth sub-audio data based on the corresponding characteristic vectors and the average value of the characteristic vectors corresponding to the fourth sub-audio data of at least one frame before and after the fourth sub-audio data.
The process of acquiring the local audio features corresponding to the target audio data in the above embodiment is also applicable to the process of acquiring the sample local audio features corresponding to the sample audio data in this example embodiment, and will not be described herein.
In an example embodiment, the corresponding sample local audio features may also be obtained based on the sample audio data in the following manner: performing enhancement processing on the sample audio data to obtain enhanced audio data corresponding to the sample audio data; framing the sample audio data and the enhanced audio data to obtain multi-frame second sub-audio data; extracting the characteristics of the second sub-audio data of each frame to obtain a characteristic vector corresponding to the second sub-audio data of each frame; and for the second sub-audio data of each frame, acquiring the sample local audio characteristics corresponding to the second sub-audio data based on the corresponding characteristic vectors and the average value of the characteristic vectors corresponding to the second sub-audio data of at least one frame before and after the corresponding characteristic vectors. And the sample local audio features corresponding to the sample audio data comprise sample local audio features corresponding to the second sub-audio data of each frame.
The manner of enhancing the sample audio data may include: random erasure, mixed ambient noise, different factor scaling, clipping and slicing, etc. It should be noted that, for the sample audio data obtained from some of the open source data sets, many types of noise such as street, exhibition hall, kitchen, playground, in-car, airport, room reverberation may already be included, and the operation of mixing environmental noise may not be performed. Different factor scaling, such as a transformation comprising stretching or compressing time intervals.
Random erasure, for example, may be performed in the following manner: determining the number of random numbers to be generated according to the duration of the sample audio data, and generating a random erasing parameter set based on the number of the random numbers to be generated, wherein the random erasing parameter set comprises a plurality of random numbers, duration parameters corresponding to the random numbers and frequency point parameters corresponding to the random numbers, and further processing the sample audio data based on the random erasing parameter set to realize random erasing of the sample audio data.
The number of the corresponding random numbers to be generated can be preset for each duration interval, and then the number of the random numbers to be generated is determined according to the duration interval of the duration of the sample audio data. For example, the number of random numbers to be generated corresponding to a duration interval of 4000ms-5000ms is preset to be 20, and if the duration of the sample audio data is 4500ms, the number of random numbers to be generated corresponding to the sample audio data is 20.
Or the number range of the corresponding random numbers to be generated can be set for each duration interval in advance, and then the number of the random numbers to be generated is randomly determined according to the duration interval of the duration of the sample audio data. For example, the number of random numbers to be generated corresponding to the duration interval of 4000ms-5000ms is preset to be 10-20, and if the duration of the sample audio data is 4800ms, the number of random numbers to be generated may be 12 or 15, which is not limited in the disclosure.
The duration parameter may be a time length for erasing the sample audio data in a time domain. The frequency point parameter may be a plurality of frequency points around each random number. The random numbers contained in the random erasure parameter set, the duration parameters and the frequency point parameters corresponding to each random number can be randomly generated, and accordingly, the duration parameters corresponding to each random number can be the same or different, and the frequency point parameters corresponding to each random number can be the same or different. The present disclosure is not limited in this regard.
When the sample audio data is processed based on the random erasure parameter set, the time domain of the sample audio data can be erased based on a plurality of random numbers contained in the random erasure parameter set and a time length parameter corresponding to each random number, and the frequency domain of the sample audio data can be erased based on the plurality of random numbers and a frequency point parameter corresponding to each random number, so that the random erasure of the sample audio data is realized.
After the enhanced audio data corresponding to the sample audio data is obtained, the sample audio data and the corresponding enhanced audio data can be framed to obtain multi-frame second sub-audio data, then feature extraction is performed on the second sub-audio data of each frame, and further average normalization is performed on feature vectors corresponding to the second sub-audio data of each frame to obtain local audio features of the sample.
Therefore, the method and the device realize accurate acquisition of the sample local audio characteristics corresponding to the second sub-audio data of each frame in the sample audio data and the corresponding enhanced audio data, lay a foundation for improving the accuracy of voiceprint recognition based on sample local audio characteristic training, and perform data enhancement on the sample audio data, which is equivalent to noise-added interference, further train a teacher network and a student network by using the sample local audio characteristics acquired based on the sample audio data and the corresponding enhanced audio data as training samples, so that the anti-interference capability of the student network and the teacher network can be improved. In addition, since the random erasure parameter sets are different each time data enhancement is performed, such as each time random erasure is performed, the data is different each time, and thus the diversity of audio data can be enhanced.
In an example embodiment, the corresponding sample global audio feature may be obtained based on the sample audio data in the following manner: framing the sample audio data to obtain multi-frame third sub-audio data; extracting the characteristics of the third sub-audio data of each frame to obtain a characteristic vector corresponding to the third sub-audio data of each frame; and for the third sub-audio data of each frame, acquiring the sample global audio feature corresponding to the third sub-audio data based on the corresponding feature vector and the average value of the feature vectors corresponding to the third sub-audio data of each frame.
The process of acquiring the global audio feature corresponding to the target audio data in the above embodiment is also applicable to the process of acquiring the sample global audio feature corresponding to the sample audio data in the example embodiment, which is not described herein again.
Therefore, the sample global audio characteristics corresponding to the third sub audio data of each frame in the sample audio data are accurately obtained, and a foundation is laid for improving the accuracy of voiceprint recognition based on sample global audio characteristic training.
It should be noted that, the sample local audio features corresponding to the sample audio data include sample local audio features corresponding to the second sub audio data of each frame, and the sample global audio features corresponding to the sample audio data include sample global audio features corresponding to the third sub audio data of each frame, when the duration of the second sub audio data and the duration of the third sub audio data are equal in the actual training process, in order to ensure that the number of sample local audio features corresponding to the second sub audio data and the number of sample global audio features corresponding to the third sub audio data are equal, when the sample audio data is subjected to framing to obtain the third sub audio data, the sample audio data may be multiplexed so that the number of the second sub audio data and the number of the third sub audio data are equal.
Step 403, inputting the sample local audio feature into the student network branch to obtain a first voiceprint distribution probability corresponding to the sample local audio feature, and inputting the sample global audio feature into the teacher network branch to obtain a second voiceprint distribution probability corresponding to the sample global audio feature.
The first voiceprint distribution probability represents the posterior probability of voiceprint features corresponding to the local audio features of the sample on a plurality of speakers; and the second voiceprint distribution probability represents the posterior probability of voiceprint features corresponding to the global audio features of the sample on a plurality of speakers.
Referring to fig. 5, the voiceprint recognition model includes a teacher network branch and a student network branch, the teacher network branch includes a teacher network, the student network branch includes a student network, and the teacher network and the student network have the same network structure. Where P s in fig. 5 represents a first voiceprint distribution probability, and P t represents a second voiceprint distribution probability.
In an example embodiment, a sample local audio feature corresponding to second sub-audio data of each frame may be input to a student network branch for forward computation to obtain a first voiceprint distribution probability corresponding to each sample local audio feature output by the student network branch, and a sample global audio feature corresponding to third sub-audio data of each frame may be input to a teacher network branch for forward computation to obtain a second voiceprint distribution probability corresponding to each sample global audio feature output by the teacher network branch.
Step 404, adjusting model parameters of the student network based on the loss between the first voiceprint distribution probability and the second voiceprint distribution probability.
And step 405, adjusting the model parameters of the teacher network based on the adjusted model parameters of the student network to obtain a trained voiceprint recognition model based on the adjusted student network and the teacher network.
In an example embodiment, the model parameters of the student network may be adjusted based on the loss between the first voiceprint distribution probability corresponding to each local audio feature of each sample and the second voiceprint distribution probability corresponding to each global audio feature of each sample, and then the model parameters of the teacher network may be adjusted based on the adjusted model parameters of the student network.
In an example embodiment, in order to make the voiceprint distribution probabilities output by the teacher network branch and the student network branch be the same, cross entropy may be used to calculate the loss between the first voiceprint distribution probability and the second voiceprint distribution probability, and then the model parameters of the student network are updated reversely according to the random gradient descent criterion, and after the model parameters of the student network are updated, the model parameters of the teacher network may be updated through an EMA (Exponential moving average ) policy.
That is, the model parameters of the student network may be adjusted in a manner shown in the following formula (1):
/>
Wherein θ s represents model parameters of the student network; p t (x ') represents the second voice distribution probability corresponding to the sample global audio feature x' output by the teacher network branch; p s (x) represents a first voiceprint distribution probability corresponding to the sample local audio feature x output by the student network branch; the loss function H (P t(x'),Ps (x)) represents-P t(x')logPs (x); the meaning of equation (1) is that the value of the loss function H (P t(x'),Ps (x)) is minimized by adjusting the model parameter θ s of the student network.
The model parameters of the teacher network may be adjusted based on the adjusted model parameters of the student network in a manner shown in the following formula (2):
θt←λθt+(1-λ)θs (2)
wherein lambda represents a super parameter, and can take a value according to the requirement, for example, a value between 0.996 and 1; θ t represents model parameters of the teacher network; the meaning of the formula (2) is that λθ t+(1-λ)θs is taken as a model parameter of the teacher network.
The above-mentioned manner of adjusting the model parameters of the student network and the teacher network is merely illustrative, and should not be construed as limiting the present solution.
According to the training mode for the teacher network and the student network, the model parameters of the student network are adjusted based on the loss between the first voiceprint distribution probability and the second voiceprint distribution probability, and the model parameters of the teacher network are adjusted based on the adjusted model parameters of the student network, so that the effectiveness of training and learning is guaranteed through two different learning strategies, part of knowledge learned by the student network is distilled into the teacher network through the part of the training strategy, the two parts of knowledge are mutually assisted, and mutually verified and corrected, so that voiceprint distribution probabilities output by branches of the two networks are close to each other as much as possible, self-supervision training of the teacher network and the student network is realized, and label marking is not needed for each sample audio data in a training sample set, so that the marking cost is reduced. In addition, the trained voiceprint recognition model can utilize a teacher network and a student network with the same network structure to accurately acquire the internal invariable advanced attribute characterization, namely the target voiceprint feature, corresponding to the target audio data through the features of two different features, namely the local audio feature and the global audio feature, corresponding to the target audio data, so that the voiceprint recognition accuracy is improved.
In an example embodiment, referring to fig. 5, the student network branch may further include a first normalization layer connected to the student network, and accordingly, inputting the sample local audio feature into the student network branch in step 403 to obtain a first voiceprint distribution probability corresponding to the sample local audio feature may include: inputting the sample local audio features into a student network to obtain first initial voiceprint distribution probability corresponding to the sample local audio features output by the student network; and inputting the first initial voiceprint distribution probability into a first normalization layer to obtain a first voiceprint distribution probability corresponding to the local audio characteristics of the sample output by the first normalization layer.
The first initial voiceprint distribution probability represents the posterior probability of voiceprint features corresponding to sample local audio features predicted by the student network on a plurality of speakers.
In an example embodiment, in order to make the output smoother, a constant factor τ s may be added in the calculation of the first normalization layer, and correspondingly, the first normalization layer may process the first initial voiceprint distribution probability corresponding to the sample local audio feature output by the student network by adopting a manner shown in the following formula (3), so as to obtain the first voiceprint distribution probability corresponding to the sample local audio feature:
Wherein τ s can take on values as required, for example, values between 0.04 and 0.07; p s(x)(i) represents the posterior probability of the voiceprint feature corresponding to the sample local audio feature on the ith speaker; gθ s represents the student network; gθ s(x)(i) represents the posterior probability of the voiceprint feature on the ith speaker in the first initial voiceprint distribution probability corresponding to the sample local audio feature output by the student network; gθ s(x)(k) represents the posterior probability of the voiceprint feature on the kth speaker in the first initial voiceprint distribution probability corresponding to the sample local audio feature output by the student network; k represents the total number of speakers.
Through the process, the first normalization layer is utilized to normalize the first initial voiceprint distribution probability corresponding to the sample local audio features output by the student network, and the first voiceprint distribution probability corresponding to the sample local audio features is obtained.
In an example embodiment, the student network may include a voiceprint feature extraction layer and a voiceprint distribution prediction layer that are sequentially connected, and correspondingly, inputting the sample local audio feature into the student network to obtain a first initial voiceprint distribution probability corresponding to the sample local audio feature output by the student network may include: inputting the sample local audio features into a voiceprint feature extraction layer in the student network to obtain third voiceprint features corresponding to the sample local audio features output by the voiceprint feature extraction layer; and inputting the third voiceprint feature into a voiceprint distribution prediction layer in the student network to obtain a first initial voiceprint distribution probability corresponding to the sample local audio feature output by the voiceprint distribution prediction layer.
Therefore, the student network is trained, how to extract the voiceprint features from the local audio features can be learned, and the voiceprint distribution probability corresponding to the voiceprint features is predicted, so that the first voiceprint features can be extracted from the local audio features corresponding to the target audio data to be identified by using the trained student network.
In an example embodiment, the structure of the teacher network branch may be the same as the structure of the student network branch, i.e., the teacher network branch may include a teacher network and a second normalization layer, and the structure of the teacher network may be the same as the structure of the student network, i.e., the teacher network may include a voiceprint feature extraction layer and a voiceprint distribution prediction layer.
In an example embodiment, to make training more stable, the teacher network branch may include, in addition to the teacher network and the second normalization layer, a centering layer, where the centering layer is connected to the teacher network and the second normalization layer, respectively. That is, the teacher network branch may include a teacher network, a centralized layer connected to the teacher network, and a second normalization layer connected to the centralized layer.
Accordingly, in step 403, inputting the sample global audio feature into the teacher network branch to obtain the second voice distribution probability corresponding to the sample global audio feature may include: inputting the sample global audio characteristics corresponding to the third sub-audio data of each frame into a teacher network to obtain second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame output by the teacher network; inputting the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame into a centralization layer, and updating the initial voiceprint distribution probability corresponding to the previous sample audio data output by the centralization layer by using the average value of the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame to obtain a third initial voiceprint distribution probability; and inputting the third initial voiceprint distribution probability into the second normalization layer to obtain second voiceprint distribution probability output by the second normalization layer.
The second initial voiceprint distribution probability represents the posterior probability of voiceprint features corresponding to the sample global audio features predicted by the teacher network on a plurality of speakers.
That is, the third initial voiceprint distribution probability can be obtained in a manner shown in the following equation (4):
Wherein m is a super parameter, and can be valued according to the requirement, for example, 0.9; b is the frame number of each batch of data in the iterative training process, namely the frame number of the third sub-audio data obtained after framing the sample audio data; c represents initial voiceprint distribution probability corresponding to the current sample audio data output by the centralization layer, namely third initial voiceprint distribution probability; c 'represents the initial voiceprint distribution probability corresponding to the audio data of the previous sample output by the centralization layer, and the initial value of c' can be set according to the requirement; gθ t represents a teacher network; gθ t(x'j) represents the second initial voiceprint distribution probability corresponding to the j-th frame of third sub-audio data x' j output by the teacher network.
In the iterative training process, the first initial voiceprint distribution probability corresponding to the second sub-audio data of each frame is input into the first normalization layer, and then the obtained first voiceprint distribution probability corresponding to the second sub-audio data of each frame, that is, the first voiceprint distribution probability corresponding to each second sub-audio data of each frame can be output by the student network branch for the current sample audio data, while the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame is input into the centering layer, and then the obtained third initial voiceprint distribution probability corresponding to the current sample audio data is input into the second normalization layer, and then the obtained second voiceprint distribution probability corresponding to the current sample audio data, that is, the second voiceprint distribution probability corresponding to each second sample audio data can be output by the teacher network branch, and when the model parameters of the student network are adjusted, the model parameters of the student network can be adjusted based on the loss between each first voiceprint distribution probability and the same second voiceprint distribution probability.
Through the process, the second voiceprint distribution probability is obtained by utilizing the teacher network branch, the centering layer is added into the teacher network branch, the initial voiceprint distribution probability corresponding to the previous sample audio data output by the centering layer is updated by utilizing the average value of the second initial voiceprint distribution probabilities corresponding to the third sub-audio data of each frame through the centering layer, the third initial voiceprint distribution probability corresponding to the current sample audio data is obtained, and wild value points in the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame can be removed, so that training is more stable, and voiceprint recognition accuracy of a voiceprint recognition model after training is improved. And the normalization processing of the third initial voiceprint distribution probability corresponding to the current sample audio data output by the centering layer is realized by processing the third initial voiceprint distribution probability by the second normalization layer.
In addition, since the training sample set adopted in the training of the teacher network and the student network in the voiceprint recognition model in the embodiment of the disclosure is label-free, unstable or crashed situations may occur in the training process, in an example embodiment, the voiceprint recognition model can be learned from easy training samples and gradually stepped to complex training samples and knowledge based on the idea of course learning, so that the iteration step number is reduced, the training process is accelerated, the situation of unstable or crashed training is avoided, and the voiceprint recognition model obtains better generalization performance.
The method can adopt a strategy of gradually increasing data volume and/or increasing data enhancement mode in different stages of training, so that the voiceprint recognition model can learn from easy training samples, and gradually go to complex training samples and knowledge.
Taking as an example a strategy to gradually increase the amount of data and the way the data is enhanced in different stages of training. Accordingly, in an example embodiment, the number of training sample sets when training the voiceprint recognition model may be N, where N is an integer greater than 1, and the training process may be divided into P stages, where M training sample sets in the N training sample sets are obtained in each stage, and one-stage training is performed on the teacher network and the student network in the voiceprint recognition model based on the M training sample sets. Wherein M is an integer greater than 0, P is an integer greater than 1, and as the number of stages increases, the number of M increases, increasing the manner in which data enhancement is performed on the sample audio data in the M training sample sets. After training in P stages, determining the voiceprint recognition model obtained after training in P stages as the voiceprint recognition model after training.
For example, assuming that the training process of the voiceprint recognition model includes 50 training rounds, the number of training sample sets is 3 training sample sets, each training sample set includes 10 ten thousand sample audio data. Then the first 10 rounds of training may be set to the first phase, the 11 th round to 30 rounds of training to the second phase, and the 31 st round to 50 rounds of training to the third phase. In the first stage, training is carried out by adopting any 1 training sample set of 3 training sample sets, and when data enhancement is carried out on sample audio data in the 1 training sample sets, only a random erasure enhancement mode is adopted; in the second stage, training is carried out by adopting any 2 training sample sets in 3 training sample sets, and a random erasure and mixed environmental noise enhancement mode is adopted when the sample audio data in the 2 training sample sets are subjected to data enhancement; in the third stage, training is performed by using all 3 training sample sets, and when data enhancement is performed on sample audio data in the 3 training sample sets, enhancement modes of random erasure, mixed environmental noise and scale transformation are adopted. In each training round, the training mode of the teacher network and the student network in the voiceprint recognition model is the training mode in the embodiment.
In summary, the model training method for voiceprint recognition provided by the example embodiment obtains a training sample set, the training sample set includes a plurality of sample audio data, based on the sample audio data, obtain corresponding sample local audio features and sample global audio features, input the sample local audio features into a student network branch to obtain a first voiceprint distribution probability corresponding to the sample local audio features, input the sample global audio features into a teacher network branch to obtain a second voiceprint distribution probability corresponding to the sample global audio features, adjust model parameters of the student network based on losses between the first voiceprint distribution probability and the second voiceprint distribution probability, adjust model parameters of the teacher network based on the adjusted model parameters of the student network, obtain a trained voiceprint recognition model based on the adjusted student network and the teacher network, self-supervise training is performed on the sample local audio features and the sample global audio features corresponding to the sample audio features in the voiceprint recognition model, obtain a teacher network and a teacher network for voiceprint recognition, and the student network for voiceprint recognition are obtained by utilizing the trained voiceprint recognition network and the audio model, respectively, and the model parameters can be accurately recognized based on the local audio features of the teacher network and the local audio features and the global feature. In addition, the labeling of the audio data of each sample in the training sample set is not needed, so that the labeling cost is reduced.
The voiceprint recognition apparatus provided by the present disclosure is described below with reference to fig. 6.
Fig. 6 is a schematic structural view of a voiceprint recognition device according to a fifth embodiment of the present disclosure.
As shown in fig. 6, the voiceprint recognition apparatus 600 provided by the present disclosure includes: a first acquisition module 601, a first processing module 602, a second processing module 603, and a first determination module 604.
The first obtaining module 601 is configured to obtain target audio data to be identified, and obtain corresponding local audio features and global audio features based on the target audio data;
The first processing module 602 is configured to input the local audio feature into a student network of the voiceprint recognition model to obtain a first voiceprint feature output by the student network;
the second processing module 603 is configured to input the global audio feature into a teacher network of the voiceprint recognition model to obtain a second voiceprint feature output by the teacher network.
The first determining module 604 is configured to determine a target voiceprint feature corresponding to the target audio data based on the first voiceprint feature and the second voiceprint feature.
Note that, the voiceprint recognition apparatus 600 provided in this embodiment may perform the voiceprint recognition method of the foregoing embodiment. The voiceprint recognition apparatus 600 may be implemented by software and/or hardware, and the voiceprint recognition apparatus 600 may be configured in an electronic device, where the electronic device may include, but is not limited to, a terminal device, a server, and the like, and the embodiment is not specifically limited to the electronic device.
In an example embodiment, the first acquisition module 601 includes:
The first framing sub-module is used for framing the target audio data to obtain multi-frame first sub-audio data;
The first feature extraction sub-module is used for carrying out feature extraction on the first sub-audio data of each frame so as to obtain feature vectors corresponding to the first sub-audio data of each frame;
The first acquisition sub-module is used for acquiring local audio features corresponding to the first sub-audio data of each frame based on the corresponding feature vectors and the average value of the feature vectors corresponding to the first sub-audio data of at least one frame before and after the corresponding feature vectors;
The second obtaining sub-module is used for obtaining the global audio feature corresponding to the first sub-audio data of each frame based on the corresponding feature vector and the average value of the feature vectors corresponding to the first sub-audio data of each frame.
It should be noted that the foregoing description of the embodiments of the voiceprint recognition method is also applicable to the voiceprint recognition device provided in the present disclosure, and will not be repeated here.
The voiceprint recognition device provided by the example embodiment obtains target audio data to be recognized, obtains corresponding local audio features and global audio features based on the target audio data, inputs the local audio features into a student network of a voiceprint recognition model to obtain first voiceprint features output by the student network, inputs the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network, and determines target voiceprint features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features. Therefore, by utilizing the student network and the teacher network, the target voiceprint features corresponding to the target audio data are acquired based on the features of the local audio features and the global audio features corresponding to the target audio data, so that the voiceprint recognition accuracy is improved.
In an example embodiment, a model training apparatus for voiceprint recognition is also provided. The model training apparatus for voiceprint recognition provided by the present disclosure is described below with reference to fig. 7.
Fig. 7 is a schematic structural view of a model training apparatus for voiceprint recognition according to a sixth embodiment of the present disclosure.
As shown in fig. 7, the model training apparatus 700 for voiceprint recognition provided by the present disclosure includes: a second acquisition module 701, a third acquisition module 702, and a training module 703.
The second obtaining module 701 is configured to obtain a training sample set, where the training sample set includes a plurality of sample audio data;
a third obtaining module 702, configured to obtain, based on the sample audio data, a corresponding sample local audio feature and a sample global audio feature;
The training module 703 is configured to train the teacher network and the student network in the voiceprint recognition model by using the local audio features of the sample as training samples of the student network in the voiceprint recognition model and using the global audio features of the sample as training samples of the teacher network in the voiceprint recognition model, so as to obtain a trained voiceprint recognition model.
It should be noted that, the model training device 700 for voiceprint recognition, abbreviated as the model training device, provided in this embodiment may perform the model training method for voiceprint recognition in the foregoing embodiment. The model training apparatus may be implemented by software and/or hardware, and the model training apparatus may be configured on an electronic device, where the electronic device may include, but is not limited to, a terminal device, a server, and the like, and the embodiment is not specifically limited to the electronic device.
In an example embodiment, the voiceprint recognition model includes a teacher network branch including a teacher network and a student network branch including a student network; the teacher network and the student network have the same network structure;
training module 703, comprising:
The first processing submodule is used for inputting the sample local audio features into the student network branch to obtain a first voiceprint distribution probability corresponding to the sample local audio features, and inputting the sample global audio features into the teacher network branch to obtain a second voiceprint distribution probability corresponding to the sample global audio features;
The first adjusting submodule is used for adjusting model parameters of the student network based on loss between the first voiceprint distribution probability and the second voiceprint distribution probability;
And the second adjusting sub-module is used for adjusting the model parameters of the teacher network based on the adjusted model parameters of the student network so as to obtain a trained voiceprint recognition model based on the adjusted student network and the teacher network.
In an example embodiment, the third acquisition module 702 includes:
The second processing sub-module is used for carrying out enhancement processing on the sample audio data so as to obtain enhanced audio data corresponding to the sample audio data;
the second sub-frame sub-module is used for framing the sample audio data and the enhanced audio data to obtain multi-frame second sub-audio data;
The second feature extraction sub-module is used for carrying out feature extraction on the second sub-audio data of each frame so as to obtain feature vectors corresponding to the second sub-audio data of each frame;
and the third acquisition sub-module is used for acquiring the sample local audio characteristics corresponding to the second sub-audio data of each frame based on the corresponding characteristic vector and the average value of the characteristic vectors corresponding to the second sub-audio data of at least one frame before and after the second sub-audio data.
In an example embodiment, the student network branch further comprises a first normalization layer connected to the student network;
a first processing sub-module comprising:
The first processing unit is used for inputting the sample local audio features into the student network so as to obtain first initial voiceprint distribution probability corresponding to the sample local audio features output by the student network;
the second processing unit is used for inputting the first initial voiceprint distribution probability into the first normalization layer to obtain a first voiceprint distribution probability corresponding to the local audio characteristics of the sample output by the first normalization layer.
In an example embodiment, the student network includes a voiceprint feature extraction layer and a voiceprint distribution prediction layer connected in sequence;
A first processing unit comprising:
the first processing subunit is used for inputting the sample local audio features into a voiceprint feature extraction layer in the student network so as to obtain third voiceprint features corresponding to the sample local audio features output by the voiceprint feature extraction layer;
The second processing subunit is configured to input the third voiceprint feature into a voiceprint distribution prediction layer in the student network, so as to obtain a first initial voiceprint distribution probability corresponding to the sample local audio feature output by the voiceprint distribution prediction layer.
In an example embodiment, the third acquisition module 702 includes:
the third framing sub-module is used for framing the sample audio data to obtain multi-frame third sub-audio data;
the third feature extraction sub-module is used for carrying out feature extraction on the third sub-audio data of each frame so as to obtain feature vectors corresponding to the third sub-audio data of each frame;
And the fourth acquisition sub-module is used for acquiring the sample global audio feature corresponding to the third sub-audio data of each frame based on the corresponding feature vector and the average value of the feature vectors corresponding to the third sub-audio data of each frame.
In an example embodiment, the teacher network branch further includes a centralized layer connected to the teacher network, and a second normalization layer connected to the centralized layer;
a first processing sub-module comprising:
The third processing unit is used for inputting the sample global audio characteristics corresponding to the third sub-audio data of each frame into the teacher network so as to obtain the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame output by the teacher network;
the fourth processing unit is used for inputting the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame into the centralization layer so as to update the initial voiceprint distribution probability corresponding to the previous sample audio data output by the centralization layer by using the average value of the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame to obtain the third initial voiceprint distribution probability;
And the fifth processing unit is used for inputting the third initial voiceprint distribution probability into the second normalization layer to obtain second voiceprint distribution probability output by the second normalization layer.
In an example embodiment, the number of training sample sets is N, N being an integer greater than 1; the apparatus further comprises:
a fourth obtaining module, configured to obtain M training sample sets of the N training sample sets;
The training module 703 includes a training unit, configured to perform one-stage training on a teacher network and a student network in the voiceprint recognition model based on M training sample sets; wherein M is an integer greater than 0;
model training apparatus 700 further comprises: the second determining module is used for determining the voiceprint recognition model obtained after the training of the P stages as the trained voiceprint recognition model; p is an integer greater than 1, and as the number of stages increases, the number of M increases, increasing the manner in which data enhancement is performed on the sample audio data in the M training sample sets.
It should be noted that the foregoing description of the embodiment of the model training method for voiceprint recognition is also applicable to the model training device for voiceprint recognition provided in the present disclosure, and is not repeated herein.
According to the model training device for voiceprint recognition, a training sample set is obtained, the training sample set comprises a plurality of sample audio data, corresponding sample local audio features and sample global audio features are obtained based on the sample audio data, the sample local audio features are used as training samples of student networks in the voiceprint recognition model, the sample global audio features are used as training samples of teacher networks in the voiceprint recognition model, the teacher networks and the student networks in the voiceprint recognition model are trained to obtain a trained voiceprint recognition model, training of the teacher networks and the student networks in the voiceprint recognition model based on the sample local audio features and the sample global audio features corresponding to the sample audio data in the training sample set is achieved, the teacher networks and the student networks for voiceprint recognition are obtained, and target voiceprint features corresponding to the target audio data are obtained by utilizing the characteristics of the student networks and the teacher networks in the trained voiceprint recognition model based on the local audio features and the characteristics corresponding to the target audio data, so that accuracy of voiceprint recognition can be improved. In addition, the labeling of the audio data of each sample in the training sample set is not needed, so that the labeling cost is reduced.
Based on the above embodiments, the present disclosure further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the voiceprint recognition method of the present disclosure or to perform the model training method for voiceprint recognition of the present disclosure.
Based on the above embodiments, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the voiceprint recognition method disclosed by the embodiments of the present disclosure or to perform the model training method for voiceprint recognition disclosed by the embodiments of the present disclosure.
Based on the above embodiments, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the voiceprint recognition method of the present disclosure, or implements the steps of the model training method for voiceprint recognition of the present disclosure.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium and a computer program product.
Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 800 may include a computing unit 801 that may perform various suitable actions and processes according to computer programs stored in a Read Only Memory (ROM) 802 or loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a voiceprint recognition method or a model training method for voiceprint recognition. For example, in some embodiments, the voiceprint recognition method or model training method for voiceprint recognition can be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the voiceprint recognition method or model training method for voiceprint recognition described above can be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a voiceprint recognition method or a model training method for voiceprint recognition by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (23)

1.A voiceprint recognition method comprising:
acquiring target audio data to be identified, and acquiring corresponding local audio features and global audio features based on the target audio data;
inputting the local audio features into a student network of a voiceprint recognition model to obtain first voiceprint features output by the student network;
Inputting the global audio features into a teacher network of the voiceprint recognition model to obtain second voiceprint features output by the teacher network;
And determining a target voiceprint feature corresponding to the target audio data based on the first voiceprint feature and the second voiceprint feature.
2. The method of claim 1, wherein the corresponding local audio feature and global audio feature based on the target audio data comprises:
framing the target audio data to obtain multi-frame first sub-audio data;
extracting the characteristics of the first sub-audio data of each frame to obtain characteristic vectors corresponding to the first sub-audio data of each frame;
For the first sub-audio data of each frame, acquiring local audio features corresponding to the first sub-audio data based on corresponding feature vectors and average values of the feature vectors corresponding to the first sub-audio data of at least one frame before and after the first sub-audio data;
And for the first sub-audio data of each frame, acquiring global audio features corresponding to the first sub-audio data based on the corresponding feature vectors and the average value of the feature vectors corresponding to the first sub-audio data of each frame.
3. A model training method for voiceprint recognition, comprising:
Acquiring a training sample set, wherein the training sample set comprises a plurality of sample audio data;
based on the sample audio data, obtaining corresponding sample local audio features and sample global audio features;
And taking the sample local audio features as training samples of the student network in the voiceprint recognition model, taking the sample global audio features as training samples of the teacher network in the voiceprint recognition model, and training the teacher network and the student network in the voiceprint recognition model to obtain a trained voiceprint recognition model.
4. A method according to claim 3, wherein the voiceprint recognition model includes a teacher network branch including the teacher network and a student network branch including a student network; the teacher network and the student network have the same network structure;
The training of the teacher network and the student network in the voiceprint recognition model by using the sample local audio features as training samples of the student network in the voiceprint recognition model and using the sample global audio features as training samples of the teacher network in the voiceprint recognition model includes:
Inputting the sample local audio features into the student network branch to obtain a first voiceprint distribution probability corresponding to the sample local audio features, and inputting the sample global audio features into the teacher network branch to obtain a second voiceprint distribution probability corresponding to the sample global audio features;
Based on the loss between the first voiceprint distribution probability and the second voiceprint distribution probability, adjusting model parameters of the student network;
And adjusting the model parameters of the teacher network based on the adjusted model parameters of the student network to obtain a trained voiceprint recognition model based on the adjusted student network and the teacher network.
5. The method of claim 4, wherein the acquiring corresponding sample local audio features and sample global audio features based on the sample audio data comprises:
Performing enhancement processing on the sample audio data to obtain enhanced audio data corresponding to the sample audio data;
framing the sample audio data and the enhanced audio data to obtain multi-frame second sub-audio data;
performing feature extraction on the second sub-audio data of each frame to obtain feature vectors corresponding to the second sub-audio data of each frame;
and for the second sub-audio data of each frame, acquiring the sample local audio characteristics corresponding to the second sub-audio data based on the corresponding characteristic vector and the average value of the characteristic vectors corresponding to the second sub-audio data of at least one frame before and after the corresponding characteristic vector.
6. The method of claim 4 or 5, wherein the student network branch further comprises a first normalization layer connected to the student network;
Inputting the sample local audio feature into the student network branch to obtain a first voiceprint distribution probability corresponding to the sample local audio feature, including:
Inputting the sample local audio features into the student network to obtain first initial voiceprint distribution probability corresponding to the sample local audio features output by the student network;
And inputting the first initial voiceprint distribution probability into the first normalization layer to obtain a first voiceprint distribution probability corresponding to the local audio characteristics of the sample output by the first normalization layer.
7. The method of claim 6, wherein the student network comprises a voiceprint feature extraction layer and a voiceprint distribution prediction layer connected in sequence;
The step of inputting the sample local audio feature into the student network to obtain a first initial voiceprint distribution probability corresponding to the sample local audio feature output by the student network, includes:
inputting the sample local audio features into a voiceprint feature extraction layer in the student network to obtain third voiceprint features corresponding to the sample local audio features output by the voiceprint feature extraction layer;
And inputting the third voiceprint feature into a voiceprint distribution prediction layer in the student network to obtain a first initial voiceprint distribution probability corresponding to the sample local audio feature output by the voiceprint distribution prediction layer.
8. The method of claim 4, wherein the acquiring corresponding sample local audio features and sample global audio features based on the sample audio data comprises:
Framing the sample audio data to obtain multi-frame third sub-audio data;
extracting the characteristics of the third sub-audio data of each frame to obtain characteristic vectors corresponding to the third sub-audio data of each frame;
and for the third sub-audio data of each frame, acquiring a sample global audio feature corresponding to the third sub-audio data based on the corresponding feature vector and the average value of the feature vectors corresponding to the third sub-audio data of each frame.
9. The method of claim 8, wherein the teacher network branch further comprises a centralized layer connected to the teacher network, and a second normalization layer connected to the centralized layer;
Inputting the sample global audio feature into the teacher network branch to obtain a second voice distribution probability corresponding to the sample global audio feature, including:
Inputting the sample global audio characteristics corresponding to the third sub-audio data of each frame into the teacher network to obtain a second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame output by the teacher network;
Inputting the second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame into the centralization layer, and updating the initial voiceprint distribution probability corresponding to the previous sample audio data output by the centralization layer by using the average value of the second initial voiceprint distribution probabilities corresponding to the third sub-audio data of each frame to obtain a third initial voiceprint distribution probability;
And inputting the third initial voiceprint distribution probability into the second normalization layer to acquire the second voiceprint distribution probability output by the second normalization layer.
10. The method of claim 5, wherein the number of training sample sets is N, N being an integer greater than 1; the method further comprises the steps of:
obtaining M training sample sets in N training sample sets, and training a teacher network and a student network in the voiceprint recognition model in one stage based on the M training sample sets; wherein M is an integer greater than 0;
determining a voiceprint recognition model obtained after training of P stages as the trained voiceprint recognition model; p is an integer greater than 1, and as the number of stages increases, the number of M increases, increasing the manner in which data enhancement is performed on the sample audio data in the M training sample sets.
11. A voiceprint recognition apparatus comprising:
The first acquisition module is used for acquiring target audio data to be identified and acquiring corresponding local audio features and global audio features based on the target audio data;
The first processing module is used for inputting the local audio features into a student network of the voiceprint recognition model so as to obtain first voiceprint features output by the student network;
the second processing module is used for inputting the global audio feature into a teacher network of the voiceprint recognition model to obtain a second voiceprint feature output by the teacher network;
and the first determining module is used for determining target voiceprint features corresponding to the target audio data based on the first voiceprint features and the second voiceprint features.
12. The apparatus of claim 11, wherein the first acquisition module comprises:
The first framing sub-module is used for framing the target audio data to obtain multi-frame first sub-audio data;
The first feature extraction sub-module is used for carrying out feature extraction on the first sub-audio data of each frame so as to obtain feature vectors corresponding to the first sub-audio data of each frame;
The first acquisition sub-module is used for acquiring local audio features corresponding to the first sub-audio data of each frame based on the corresponding feature vectors and the average value of the feature vectors corresponding to the first sub-audio data of at least one frame before and after the first sub-audio data;
And the second acquisition sub-module is used for acquiring global audio features corresponding to the first sub-audio data based on the corresponding feature vectors and the average value of the feature vectors corresponding to the first sub-audio data for each frame.
13. A model training apparatus for voiceprint recognition, comprising:
the second acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of sample audio data;
the third acquisition module is used for acquiring corresponding sample local audio characteristics and sample global audio characteristics based on the sample audio data;
The training module is used for taking the sample local audio features as training samples of the student networks in the voiceprint recognition model, taking the sample global audio features as training samples of the teacher networks in the voiceprint recognition model, and training the teacher networks and the student networks in the voiceprint recognition model to obtain a trained voiceprint recognition model.
14. The apparatus of claim 13, wherein the voiceprint recognition model comprises a teacher network branch and a student network branch, the teacher network branch comprising the teacher network and the student network branch comprising a student network; the teacher network and the student network have the same network structure;
the training module comprises:
The first processing submodule is used for inputting the sample local audio features into the student network branch to obtain a first voiceprint distribution probability corresponding to the sample local audio features, and inputting the sample global audio features into the teacher network branch to obtain a second voiceprint distribution probability corresponding to the sample global audio features;
The first adjusting submodule is used for adjusting model parameters of the student network based on loss between the first voiceprint distribution probability and the second voiceprint distribution probability;
and the second adjustment sub-module is used for adjusting the model parameters of the teacher network based on the adjusted model parameters of the student network so as to obtain a trained voiceprint recognition model based on the adjusted student network and the teacher network.
15. The apparatus of claim 14, wherein the third acquisition module comprises:
The second processing sub-module is used for carrying out enhancement processing on the sample audio data so as to obtain enhanced audio data corresponding to the sample audio data;
The second sub-frame sub-module is used for framing the sample audio data and the enhanced audio data to obtain multi-frame second sub-audio data;
the second feature extraction sub-module is used for carrying out feature extraction on the second sub-audio data of each frame so as to obtain feature vectors corresponding to the second sub-audio data of each frame;
and the third acquisition sub-module is used for acquiring the sample local audio characteristics corresponding to the second sub-audio data based on the corresponding characteristic vector and the average value of the characteristic vectors corresponding to the second sub-audio data of at least one frame before and after each frame.
16. The apparatus of claim 14 or 15, wherein the student network branch further comprises a first normalization layer connected to the student network;
the first processing sub-module includes:
The first processing unit is used for inputting the sample local audio features into the student network so as to obtain first initial voiceprint distribution probability corresponding to the sample local audio features output by the student network;
And the second processing unit is used for inputting the first initial voiceprint distribution probability into the first normalization layer so as to acquire a first voiceprint distribution probability corresponding to the local audio characteristics of the sample output by the first normalization layer.
17. The apparatus of claim 16, wherein the student network comprises a voiceprint feature extraction layer and a voiceprint distribution prediction layer connected in sequence;
The first processing unit includes:
the first processing subunit is used for inputting the sample local audio features into a voiceprint feature extraction layer in the student network so as to obtain third voiceprint features corresponding to the sample local audio features output by the voiceprint feature extraction layer;
And the second processing subunit is used for inputting the third voiceprint feature into a voiceprint distribution prediction layer in the student network so as to acquire a first initial voiceprint distribution probability corresponding to the sample local audio feature output by the voiceprint distribution prediction layer.
18. The apparatus of claim 14, wherein the third acquisition module comprises:
a third framing sub-module, configured to frame the sample audio data to obtain multi-frame third sub-audio data;
a third feature extraction sub-module, configured to perform feature extraction on the third sub-audio data of each frame, so as to obtain feature vectors corresponding to the third sub-audio data of each frame;
And a fourth obtaining sub-module, configured to obtain, for each frame of the third sub-audio data, a sample global audio feature corresponding to the third sub-audio data based on the corresponding feature vector and an average value of feature vectors corresponding to the third sub-audio data for each frame.
19. The apparatus of claim 18, wherein the teacher network branch further comprises a centralized layer connected to the teacher network, and a second normalization layer connected to the centralized layer;
the first processing sub-module includes:
The third processing unit is used for inputting the sample global audio characteristics corresponding to the third sub-audio data of each frame into the teacher network so as to obtain second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame output by the teacher network;
A fourth processing unit, configured to input a second initial voiceprint distribution probability corresponding to the third sub-audio data of each frame into the centering layer, so as to update an initial voiceprint distribution probability corresponding to the previous sample audio data output by the centering layer by using an average value of the second initial voiceprint distribution probabilities corresponding to the third sub-audio data of each frame, thereby obtaining a third initial voiceprint distribution probability;
And a fifth processing unit, configured to input the third initial voiceprint distribution probability into the second normalization layer, so as to obtain the second voiceprint distribution probability output by the second normalization layer.
20. The apparatus of claim 15, wherein the number of training sample sets is N, N being an integer greater than 1; the apparatus further comprises:
a fourth obtaining module, configured to obtain M training sample sets of the N training sample sets;
The training module comprises a training unit and a training unit, wherein the training unit is used for training a teacher network and a student network in the voiceprint recognition model in one stage based on M training sample sets; wherein M is an integer greater than 0;
the apparatus further comprises: the second determining module is used for determining the voiceprint recognition model obtained after the training of the P stages as the trained voiceprint recognition model; p is an integer greater than 1, and as the number of stages increases, the number of M increases, increasing the manner in which data enhancement is performed on the sample audio data in the M training sample sets.
21. An electronic device, comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-2 or to perform the method of any one of claims 3-10.
22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-2 or to perform the method of any one of claims 3-10.
23. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-2 or implements the steps of the method of any of claims 3-10.
CN202210536790.XA 2022-05-17 2022-05-17 Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium Active CN114913859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210536790.XA CN114913859B (en) 2022-05-17 2022-05-17 Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210536790.XA CN114913859B (en) 2022-05-17 2022-05-17 Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114913859A CN114913859A (en) 2022-08-16
CN114913859B true CN114913859B (en) 2024-06-04

Family

ID=82768007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210536790.XA Active CN114913859B (en) 2022-05-17 2022-05-17 Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114913859B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496980B (en) * 2023-12-29 2024-03-26 南京邮电大学 Voiceprint recognition method based on local and global cross-channel fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001016892A1 (en) * 1999-08-31 2001-03-08 Accenture Llp System, method, and article of manufacture for a border crossing system that allows selective passage based on voice analysis
CN113257230A (en) * 2021-06-23 2021-08-13 北京世纪好未来教育科技有限公司 Voice processing method and device and computer storage medium
CN113488058A (en) * 2021-06-23 2021-10-08 武汉理工大学 Voiceprint recognition method based on short voice
CN114333848A (en) * 2022-01-12 2022-04-12 北京百度网讯科技有限公司 Voiceprint recognition method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6427137B2 (en) * 1999-08-31 2002-07-30 Accenture Llp System, method and article of manufacture for a voice analysis system that detects nervousness for preventing fraud
US11195057B2 (en) * 2014-03-18 2021-12-07 Z Advanced Computing, Inc. System and method for extremely efficient image and pattern recognition and artificial intelligence platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001016892A1 (en) * 1999-08-31 2001-03-08 Accenture Llp System, method, and article of manufacture for a border crossing system that allows selective passage based on voice analysis
CN113257230A (en) * 2021-06-23 2021-08-13 北京世纪好未来教育科技有限公司 Voice processing method and device and computer storage medium
CN113488058A (en) * 2021-06-23 2021-10-08 武汉理工大学 Voiceprint recognition method based on short voice
CN114333848A (en) * 2022-01-12 2022-04-12 北京百度网讯科技有限公司 Voiceprint recognition method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于深度神经网络的鲁棒性语音识别系统设计与实现;张允耀;中国优秀硕士学位论文全文数据库信息科技辑;20210915(第09期);全文 *
深度迁移模型下的小样本声纹识别方法;孙存威;文畅;谢凯;贺建飚;;计算机工程与设计;20181216(第12期);全文 *

Also Published As

Publication number Publication date
CN114913859A (en) 2022-08-16

Similar Documents

Publication Publication Date Title
CN109817246B (en) Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
CN107610709B (en) Method and system for training voiceprint recognition model
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN105976812B (en) A kind of audio recognition method and its equipment
CN109859772B (en) Emotion recognition method, emotion recognition device and computer-readable storage medium
CN112259106B (en) Voiceprint recognition method and device, storage medium and computer equipment
CN111402891B (en) Speech recognition method, device, equipment and storage medium
WO2021051577A1 (en) Speech emotion recognition method, apparatus, device, and storage medium
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN111710337B (en) Voice data processing method and device, computer readable medium and electronic equipment
CN111833845A (en) Multi-language speech recognition model training method, device, equipment and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN114127849A (en) Speech emotion recognition method and device
CN111192659A (en) Pre-training method for depression detection and depression detection method and device
CN108986798A (en) Processing method, device and the equipment of voice data
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
Mian Qaisar Isolated speech recognition and its transformation in visual signs
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN111402922A (en) Audio signal classification method, device, equipment and storage medium based on small samples
Gorrostieta et al. Attention-based Sequence Classification for Affect Detection.
CN112017690B (en) Audio processing method, device, equipment and medium
CN116978359A (en) Phoneme recognition method, device, electronic equipment and storage medium
CN115132170A (en) Language classification method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant