CN110853653B - Voiceprint recognition method based on self-attention and transfer learning - Google Patents

Voiceprint recognition method based on self-attention and transfer learning Download PDF

Info

Publication number
CN110853653B
CN110853653B CN201911150646.7A CN201911150646A CN110853653B CN 110853653 B CN110853653 B CN 110853653B CN 201911150646 A CN201911150646 A CN 201911150646A CN 110853653 B CN110853653 B CN 110853653B
Authority
CN
China
Prior art keywords
data set
attention
model
application scene
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911150646.7A
Other languages
Chinese (zh)
Other versions
CN110853653A (en
Inventor
高登科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Zhiyun Technology Co ltd
Original Assignee
Zhongke Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Zhiyun Technology Co ltd filed Critical Zhongke Zhiyun Technology Co ltd
Priority to CN201911150646.7A priority Critical patent/CN110853653B/en
Publication of CN110853653A publication Critical patent/CN110853653A/en
Application granted granted Critical
Publication of CN110853653B publication Critical patent/CN110853653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a voiceprint recognition method based on self-attention and transfer learning, which comprises the steps of obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; training a primary base model based on the attention model and the primary base data set; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, on the specific application scene data, the secondary basic model is migrated and fine-tuned to obtain a final model adaptive to the specific application scene. The invention not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristic of Chinese and the recognition capability of being more suitable for a real application scene, has the robustness of the noise, the reverberation and the channel, and well meets the application of the real scene.

Description

Voiceprint recognition method based on self-attention and transfer learning
Technical Field
The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voiceprint recognition method based on self-attention and transfer learning.
Background
The biological identification technology is an identification technology for carrying out identity verification by means of human body characteristics. The anti-counterfeiting product has the characteristics of no loss, no forgetfulness, uniqueness, invariance, good anti-counterfeiting performance and convenience in use, and is widely applied to entrance guard, attendance checking, finance, public safety and terminal electronic equipment.
Voice Print Recognition (Voice Print Recognition), which is one type of biometric Recognition, is a service for performing identification based on the characteristics of a speaker's Voice wave. The identity recognition is irrelevant to accent, language and non-contact, and the realization mode is natural, so that the method is more widely concerned and applied in recent years.
At present, the voiceprint recognition accuracy based on the traditional method is low, the voiceprint recognition based on the deep learning excessively depends on massive, high-latitude and high-quality voice data, and both the voiceprint recognition and the voice data are easily influenced by environmental noise, reverberation and audio channels, and the generalization capability of real world application is lacked.
Therefore, in order to solve the problem, the invention provides a voiceprint recognition method based on self-attention and transfer learning.
Disclosure of Invention
The invention aims to provide a voiceprint recognition method based on self-attention and transfer learning, which not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristic of Chinese and the recognition capability of being more suitable for a real application scene, has the robustness of the noise, the reverberation and the channel, and well meets the application of the real scene.
The invention is mainly realized by the following technical scheme: a voiceprint recognition method based on self-attention and transfer learning is characterized by obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; training a primary base model based on the attention model and the primary base data set; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, on the specific application scene data, the secondary basic model is migrated and fine-tuned to obtain a final model adaptive to the specific application scene.
In order to better implement the method, data enhancement in the time field and the frequency field is further performed on the primary basic data set, the secondary basic data set and the application scene data set.
In order to better realize the invention, further, in the time domain, the rhythm and pitch of the primary basic data set, the secondary basic data set and the application scene data set are respectively adjusted to adjust the audio speed, and then random noise is added; in the frequency domain, a random warping factor is applied to the spectral features of each audio using Vocal Tract Length Perturbation.
In order to better implement the invention, further, a primary basic data set is collected under a non-constraint condition.
In order to better realize the method, a self-attention model is further introduced in the space dimension, and the audio features which act on the recognition effect in the space dimension are screened through the autocorrelation of the space dimension; and introducing a self-attention model in the feature dimension, and screening feature dimension components which play a role in the recognition effect through autocorrelation among the feature dimensions.
In order to better implement the present invention, the audio features are subjected to attention cascade first through feature dimensions and then through attention cascade through space dimensions.
In order to better implement the present invention, further, threshold control is performed on the difference between the registration channel and the authentication channel, when the registration and the authentication are from the same channel, a higher threshold is selected, and when the registration and the authentication are from different channels, a lower or lower threshold is selected according to the difference.
The invention has the beneficial effects that:
(1) the method solves the problems of low voiceprint recognition precision, low robustness of real environments (noise, reverberation and the like), low channel robustness and excessive dependence on massive real scene data, constructs a random digital voiceprint recognition algorithm, and can complete identity recognition according to simple voice of a user on the basis of a data enhancement technology, a self-attention technology, a transfer learning technology and a dynamic threshold technology.
(2) And data enhancement, namely performing data enhancement in the time field and the frequency field aiming at all data sets, greatly reducing the amount of audio data depended on, and simultaneously greatly improving the robustness of the algorithm to the environment, the channel and the speech speed.
(3) The self-attention model screens more useful features for identification from two dimensions of space and features, improves the extraction capability of algorithm features, and enhances the robustness of noise, reverberation and channels.
(4) The cascade transfer learning not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristics of Chinese and the recognition capability of being more suitable for a real application scene.
(5) The cross-channel dynamic threshold value is dynamically adjusted by registering and verifying the difference between the channels, so that the generalization capability of the algorithm to the channels is greatly expanded.
(6) The key point of the invention is that the random digital voiceprint recognition algorithm provided by the invention only needs a small amount of application scene audio data, starts from network public data, utilizes a data enhancement technology to improve the data quality, and adopts a self-attention model and a cascade transfer learning technology to realize the high precision of voiceprint recognition and the strong generalization capability on noise, reverberation, channels and speech speed. Meanwhile, a cross-channel dynamic threshold technology is provided, and the cross-channel capability of the algorithm is further greatly expanded.
Drawings
FIG. 1 is a flow diagram of the time domain data enhancement of the present invention;
FIG. 2 is a flow chart of frequency domain data enhancement of the present invention;
FIG. 3 is a spatial attention flow diagram of the present invention;
FIG. 4 is a channel attention flow diagram of the present invention;
FIG. 5 is a flow chart of the dual attention fusion of the present invention;
FIG. 6 is a flow diagram of cascaded migration learning of the present invention;
fig. 7 is a cross-channel dynamic threshold flow diagram of the present invention.
Detailed Description
Example 1:
a voiceprint recognition method based on self-attention and transfer learning is characterized by obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; as shown in fig. 6, a primary base model is trained based on the attention model and the primary base dataset; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, on the specific application scene data, the secondary basic model is migrated and fine-tuned to obtain a final model adaptive to the specific application scene. And cascade fine adjustment is performed, so that the robustness of noise, reverberation and a channel is learned, the pronunciation characteristic of Chinese is learned, and the recognition capability of the Chinese is more suitable for a real application scene.
The invention performs basic training on the public English data set and performs secondary fine tuning training on the public Chinese data set and the application scene data set. The invention not only learns the robustness of noise, reverberation and channels, but also learns the pronunciation characteristics of Chinese and the recognition capability of being more suitable for real application scenes.
Example 2:
in this embodiment, optimization is performed on the basis of embodiment 1, mass open source english speech data (sitw, voxceleb1, voxceleb2, and the like) are obtained, and a primary voiceprint basic data set is constructed; the data set is collected under the non-constraint condition, and has good noise, reverberation and channel robustness.
Acquiring a large amount of open-source Chinese voice data (aishell, primewords, st-cmds, thchs30 and the like) to construct a secondary voiceprint basic data set; the data set is a Chinese data set and can better adapt to the pronunciation characteristics of Chinese.
Collecting a small amount of application scene voice data, and constructing an application scene voiceprint data set; the data set is collected under the scene to be really applied, and can be better matched with the actual application scene.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
in this embodiment, optimization is performed based on embodiment 1 or 2, and as shown in fig. 1 and fig. 2, data enhancement in the time domain and the frequency domain is performed on a primary basic data set, a secondary basic data set, and an application scene data set. As shown in fig. 1, time domain audio data is enhanced; in the time domain, the tempo and pitch are controlled, the audio speed is adjusted, and random noise is added. As shown in fig. 2, frequency domain audio data is enhanced; in the frequency domain, a random warping factor is applied to the spectral features of each audio using Vocal Tract Length Perturbation.
The invention obtains English and Chinese public data sets, collects a small amount of application scene data sets, and enhances the application scene data sets from two dimensions of time domain and frequency domain. Aiming at all data sets, data enhancement in the time field and the frequency field is carried out, the amount of audio data depended on is greatly reduced, and meanwhile, the robustness of the algorithm to the environment, the channel and the speech speed is greatly improved.
The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.
Example 4:
this example was optimized on the basis of any of examples 1-3, and as shown in fig. 3-5, the self-attention model was as follows:
a. as shown in FIG. 3, the spatial attention mechanism; and introducing a self-attention model in the space dimension, and screening the audio features which act on the recognition effect in the space dimension through the autocorrelation of the space dimension.
b. As shown in fig. 4, the channel attention mechanism; and introducing a self-attention model in the feature dimension, and screening feature dimension components which play a role in the recognition effect through autocorrelation among the feature dimensions.
c. As shown in fig. 5, two levels of attention fusion; the audio features are subjected to attention of feature dimensions and attention of space dimensions, attention cascade is performed, feature extraction capability is improved, and robustness of noise, reverberation and channels is enhanced.
The invention performs basic training on the public English data set and performs secondary fine tuning training on the public Chinese data set and the application scene data set. The cascade transfer learning not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristics of Chinese and the recognition capability of being more suitable for a real application scene.
Other parts of this embodiment are the same as any of embodiments 1 to 3, and thus are not described again.
Example 5:
this embodiment is optimized based on any of embodiments 1-4, as shown in fig. 7, in the application of voiceprint recognition, there is strong correlation of channels; the data and the model are well considered and optimized aiming at the problem, so that the algorithm has relatively good generalization capability among different channels; in order to further expand the generalization capability among channels, the invention designs a cross-channel dynamic threshold technology. The method mainly aims at controlling the threshold value of the difference between a registration channel and a verification channel, a higher threshold value is selected when the registration channel and the verification channel come from the same channel, a lower or lower threshold value is selected according to the difference when the registration channel and the verification channel come from different channels, and the selection of the threshold value is determined through batch tests.
Cross-channel dynamic threshold: and counting to obtain difference thresholds of the registration and verification channels, and dynamically adjusting the identification result according to the difference of the registration and verification channels. The cross-channel dynamic threshold value is dynamically adjusted by registering and verifying the difference between the channels, so that the generalization capability of the algorithm to the channels is greatly expanded.
Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims (6)

1. A voiceprint recognition method based on self-attention and transfer learning is characterized by comprising the steps of obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; training a primary base model based on the attention model and the primary base data set; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, migrating and fine-tuning the secondary basic model on the specific application scene data to obtain a final model suitable for the specific application scene;
introducing a self-attention model in a space dimension, and screening audio features which act on the recognition effect in the space dimension through autocorrelation of the space dimension; and introducing a self-attention model in the feature dimension, and screening feature dimension components which play a role in the recognition effect through autocorrelation among the feature dimensions.
2. The voiceprint recognition method based on self-attention and transfer learning of claim 1, wherein the primary basic data set, the secondary basic data set and the application scene data set are subjected to data enhancement in time domain and frequency domain.
3. The voiceprint recognition method based on self-attention and transfer learning as claimed in claim 2, wherein in the time domain, the primary basic data set, the secondary basic data set and the application scene data set are subjected to rhythm and pitch adjustment respectively to adjust the audio speed, and then random noise is added; in the frequency domain, a random warping factor is applied to the spectral features of each audio using channel length perturbation.
4. The voiceprint recognition method based on self-attention and transfer learning of claim 1, wherein the primary basic data set is collected under an unconstrained condition.
5. The voiceprint recognition method based on self-attention and transfer learning according to any one of claims 1 to 4, wherein the audio features are subjected to attention cascade firstly through feature dimension and then through attention of space dimension.
6. The voiceprint recognition method according to claim 1, wherein threshold control is performed for the difference between the registered channel and the verified channel, and when the registered channel and the verified channel are from the same channel, a higher threshold is selected, and when the registered channel and the verified channel are from different channels, a lower or lower threshold is selected according to the difference.
CN201911150646.7A 2019-11-21 2019-11-21 Voiceprint recognition method based on self-attention and transfer learning Active CN110853653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911150646.7A CN110853653B (en) 2019-11-21 2019-11-21 Voiceprint recognition method based on self-attention and transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911150646.7A CN110853653B (en) 2019-11-21 2019-11-21 Voiceprint recognition method based on self-attention and transfer learning

Publications (2)

Publication Number Publication Date
CN110853653A CN110853653A (en) 2020-02-28
CN110853653B true CN110853653B (en) 2022-04-12

Family

ID=69603266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911150646.7A Active CN110853653B (en) 2019-11-21 2019-11-21 Voiceprint recognition method based on self-attention and transfer learning

Country Status (1)

Country Link
CN (1) CN110853653B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488058B (en) * 2021-06-23 2023-03-24 武汉理工大学 Voiceprint recognition method based on short voice

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107705791B (en) * 2016-08-08 2021-06-04 中国电信股份有限公司 Incoming call identity confirmation method and device based on voiceprint recognition and voiceprint recognition system
CN107610709B (en) * 2017-08-01 2021-03-19 百度在线网络技术(北京)有限公司 Method and system for training voiceprint recognition model
CN108305296B (en) * 2017-08-30 2021-02-26 深圳市腾讯计算机系统有限公司 Image description generation method, model training method, device and storage medium
CN109637545B (en) * 2019-01-17 2023-05-30 哈尔滨工程大学 Voiceprint recognition method based on one-dimensional convolution asymmetric bidirectional long-short-time memory network
CN110111803B (en) * 2019-05-09 2021-02-19 南京工程学院 Transfer learning voice enhancement method based on self-attention multi-kernel maximum mean difference

Also Published As

Publication number Publication date
CN110853653A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
Thienpondt et al. Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN106504772B (en) Speech-emotion recognition method based on weights of importance support vector machine classifier
CN1808567A (en) Voice-print authentication device and method of authenticating people presence
WO2012075641A1 (en) Device and method for pass-phrase modeling for speaker verification, and verification system
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
CN109410956A (en) A kind of object identifying method of audio data, device, equipment and storage medium
CN107274887A (en) Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN109448755A (en) Artificial cochlea's auditory scene recognition methods
CN110853653B (en) Voiceprint recognition method based on self-attention and transfer learning
CN109935233A (en) A kind of recording attack detection method based on amplitude and phase information
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
CN110085236B (en) Speaker recognition method based on self-adaptive voice frame weighting
KR20050080648A (en) Apparatus and method for distinguishing between vocal sound and other sound
CN110299133A (en) The method for determining illegally to broadcast based on keyword
CN105741853A (en) Digital speech perception hash method based on formant frequency
CN114038469B (en) Speaker identification method based on multi-class spectrogram characteristic attention fusion network
Yao et al. Multi-stream convolutional neural network with frequency selection for robust speaker verification
CN108831486A (en) Method for distinguishing speek person based on DNN and GMM model
Zhipeng et al. Voiceprint recognition based on BP Neural Network and CNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant