CN110853653B

CN110853653B - Voiceprint recognition method based on self-attention and transfer learning

Info

Publication number: CN110853653B
Application number: CN201911150646.7A
Authority: CN
Inventors: 高登科
Original assignee: Zhongke Zhiyun Technology Co ltd
Current assignee: Zhongke Zhiyun Technology Co ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2022-04-12
Anticipated expiration: 2039-11-21
Also published as: CN110853653A

Abstract

The invention discloses a voiceprint recognition method based on self-attention and transfer learning, which comprises the steps of obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; training a primary base model based on the attention model and the primary base data set; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, on the specific application scene data, the secondary basic model is migrated and fine-tuned to obtain a final model adaptive to the specific application scene. The invention not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristic of Chinese and the recognition capability of being more suitable for a real application scene, has the robustness of the noise, the reverberation and the channel, and well meets the application of the real scene.

Description

Voiceprint recognition method based on self-attention and transfer learning

Technical Field

The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voiceprint recognition method based on self-attention and transfer learning.

Background

The biological identification technology is an identification technology for carrying out identity verification by means of human body characteristics. The anti-counterfeiting product has the characteristics of no loss, no forgetfulness, uniqueness, invariance, good anti-counterfeiting performance and convenience in use, and is widely applied to entrance guard, attendance checking, finance, public safety and terminal electronic equipment.

Voice Print Recognition (Voice Print Recognition), which is one type of biometric Recognition, is a service for performing identification based on the characteristics of a speaker's Voice wave. The identity recognition is irrelevant to accent, language and non-contact, and the realization mode is natural, so that the method is more widely concerned and applied in recent years.

At present, the voiceprint recognition accuracy based on the traditional method is low, the voiceprint recognition based on the deep learning excessively depends on massive, high-latitude and high-quality voice data, and both the voiceprint recognition and the voice data are easily influenced by environmental noise, reverberation and audio channels, and the generalization capability of real world application is lacked.

Therefore, in order to solve the problem, the invention provides a voiceprint recognition method based on self-attention and transfer learning.

Disclosure of Invention

The invention aims to provide a voiceprint recognition method based on self-attention and transfer learning, which not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristic of Chinese and the recognition capability of being more suitable for a real application scene, has the robustness of the noise, the reverberation and the channel, and well meets the application of the real scene.

The invention is mainly realized by the following technical scheme: a voiceprint recognition method based on self-attention and transfer learning is characterized by obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; training a primary base model based on the attention model and the primary base data set; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, on the specific application scene data, the secondary basic model is migrated and fine-tuned to obtain a final model adaptive to the specific application scene.

In order to better implement the method, data enhancement in the time field and the frequency field is further performed on the primary basic data set, the secondary basic data set and the application scene data set.

In order to better realize the invention, further, in the time domain, the rhythm and pitch of the primary basic data set, the secondary basic data set and the application scene data set are respectively adjusted to adjust the audio speed, and then random noise is added; in the frequency domain, a random warping factor is applied to the spectral features of each audio using Vocal Tract Length Perturbation.

In order to better implement the invention, further, a primary basic data set is collected under a non-constraint condition.

In order to better realize the method, a self-attention model is further introduced in the space dimension, and the audio features which act on the recognition effect in the space dimension are screened through the autocorrelation of the space dimension; and introducing a self-attention model in the feature dimension, and screening feature dimension components which play a role in the recognition effect through autocorrelation among the feature dimensions.

In order to better implement the present invention, the audio features are subjected to attention cascade first through feature dimensions and then through attention cascade through space dimensions.

In order to better implement the present invention, further, threshold control is performed on the difference between the registration channel and the authentication channel, when the registration and the authentication are from the same channel, a higher threshold is selected, and when the registration and the authentication are from different channels, a lower or lower threshold is selected according to the difference.

The invention has the beneficial effects that:

(1) the method solves the problems of low voiceprint recognition precision, low robustness of real environments (noise, reverberation and the like), low channel robustness and excessive dependence on massive real scene data, constructs a random digital voiceprint recognition algorithm, and can complete identity recognition according to simple voice of a user on the basis of a data enhancement technology, a self-attention technology, a transfer learning technology and a dynamic threshold technology.

(2) And data enhancement, namely performing data enhancement in the time field and the frequency field aiming at all data sets, greatly reducing the amount of audio data depended on, and simultaneously greatly improving the robustness of the algorithm to the environment, the channel and the speech speed.

(3) The self-attention model screens more useful features for identification from two dimensions of space and features, improves the extraction capability of algorithm features, and enhances the robustness of noise, reverberation and channels.

(4) The cascade transfer learning not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristics of Chinese and the recognition capability of being more suitable for a real application scene.

(5) The cross-channel dynamic threshold value is dynamically adjusted by registering and verifying the difference between the channels, so that the generalization capability of the algorithm to the channels is greatly expanded.

(6) The key point of the invention is that the random digital voiceprint recognition algorithm provided by the invention only needs a small amount of application scene audio data, starts from network public data, utilizes a data enhancement technology to improve the data quality, and adopts a self-attention model and a cascade transfer learning technology to realize the high precision of voiceprint recognition and the strong generalization capability on noise, reverberation, channels and speech speed. Meanwhile, a cross-channel dynamic threshold technology is provided, and the cross-channel capability of the algorithm is further greatly expanded.

Drawings

FIG. 1 is a flow diagram of the time domain data enhancement of the present invention;

FIG. 2 is a flow chart of frequency domain data enhancement of the present invention;

FIG. 3 is a spatial attention flow diagram of the present invention;

FIG. 4 is a channel attention flow diagram of the present invention;

FIG. 5 is a flow chart of the dual attention fusion of the present invention;

FIG. 6 is a flow diagram of cascaded migration learning of the present invention;

fig. 7 is a cross-channel dynamic threshold flow diagram of the present invention.

Detailed Description

Example 1:

a voiceprint recognition method based on self-attention and transfer learning is characterized by obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; as shown in fig. 6, a primary base model is trained based on the attention model and the primary base dataset; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, on the specific application scene data, the secondary basic model is migrated and fine-tuned to obtain a final model adaptive to the specific application scene. And cascade fine adjustment is performed, so that the robustness of noise, reverberation and a channel is learned, the pronunciation characteristic of Chinese is learned, and the recognition capability of the Chinese is more suitable for a real application scene.

The invention performs basic training on the public English data set and performs secondary fine tuning training on the public Chinese data set and the application scene data set. The invention not only learns the robustness of noise, reverberation and channels, but also learns the pronunciation characteristics of Chinese and the recognition capability of being more suitable for real application scenes.

Example 2:

in this embodiment, optimization is performed on the basis of embodiment 1, mass open source english speech data (sitw, voxceleb1, voxceleb2, and the like) are obtained, and a primary voiceprint basic data set is constructed; the data set is collected under the non-constraint condition, and has good noise, reverberation and channel robustness.

Acquiring a large amount of open-source Chinese voice data (aishell, primewords, st-cmds, thchs30 and the like) to construct a secondary voiceprint basic data set; the data set is a Chinese data set and can better adapt to the pronunciation characteristics of Chinese.

Collecting a small amount of application scene voice data, and constructing an application scene voiceprint data set; the data set is collected under the scene to be really applied, and can be better matched with the actual application scene.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

in this embodiment, optimization is performed based on embodiment 1 or 2, and as shown in fig. 1 and fig. 2, data enhancement in the time domain and the frequency domain is performed on a primary basic data set, a secondary basic data set, and an application scene data set. As shown in fig. 1, time domain audio data is enhanced; in the time domain, the tempo and pitch are controlled, the audio speed is adjusted, and random noise is added. As shown in fig. 2, frequency domain audio data is enhanced; in the frequency domain, a random warping factor is applied to the spectral features of each audio using Vocal Tract Length Perturbation.

The invention obtains English and Chinese public data sets, collects a small amount of application scene data sets, and enhances the application scene data sets from two dimensions of time domain and frequency domain. Aiming at all data sets, data enhancement in the time field and the frequency field is carried out, the amount of audio data depended on is greatly reduced, and meanwhile, the robustness of the algorithm to the environment, the channel and the speech speed is greatly improved.

The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.

Example 4:

this example was optimized on the basis of any of examples 1-3, and as shown in fig. 3-5, the self-attention model was as follows:

a. as shown in FIG. 3, the spatial attention mechanism; and introducing a self-attention model in the space dimension, and screening the audio features which act on the recognition effect in the space dimension through the autocorrelation of the space dimension.

b. As shown in fig. 4, the channel attention mechanism; and introducing a self-attention model in the feature dimension, and screening feature dimension components which play a role in the recognition effect through autocorrelation among the feature dimensions.

c. As shown in fig. 5, two levels of attention fusion; the audio features are subjected to attention of feature dimensions and attention of space dimensions, attention cascade is performed, feature extraction capability is improved, and robustness of noise, reverberation and channels is enhanced.

The invention performs basic training on the public English data set and performs secondary fine tuning training on the public Chinese data set and the application scene data set. The cascade transfer learning not only learns the robustness of noise, reverberation and a channel, but also learns the pronunciation characteristics of Chinese and the recognition capability of being more suitable for a real application scene.

Other parts of this embodiment are the same as any of embodiments 1 to 3, and thus are not described again.

Example 5:

this embodiment is optimized based on any of embodiments 1-4, as shown in fig. 7, in the application of voiceprint recognition, there is strong correlation of channels; the data and the model are well considered and optimized aiming at the problem, so that the algorithm has relatively good generalization capability among different channels; in order to further expand the generalization capability among channels, the invention designs a cross-channel dynamic threshold technology. The method mainly aims at controlling the threshold value of the difference between a registration channel and a verification channel, a higher threshold value is selected when the registration channel and the verification channel come from the same channel, a lower or lower threshold value is selected according to the difference when the registration channel and the verification channel come from different channels, and the selection of the threshold value is determined through batch tests.

Cross-channel dynamic threshold: and counting to obtain difference thresholds of the registration and verification channels, and dynamically adjusting the identification result according to the difference of the registration and verification channels. The cross-channel dynamic threshold value is dynamically adjusted by registering and verifying the difference between the channels, so that the generalization capability of the algorithm to the channels is greatly expanded.

Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A voiceprint recognition method based on self-attention and transfer learning is characterized by comprising the steps of obtaining open source English voice data and constructing a primary basic data set; acquiring open source Chinese voice data and constructing a secondary basic data set; acquiring application scene voice data and constructing an application scene data set; training a primary base model based on the attention model and the primary base data set; then, carrying out migration fine tuning training on the primary basic model on the secondary basic data set to obtain a secondary basic model; finally, migrating and fine-tuning the secondary basic model on the specific application scene data to obtain a final model suitable for the specific application scene;

introducing a self-attention model in a space dimension, and screening audio features which act on the recognition effect in the space dimension through autocorrelation of the space dimension; and introducing a self-attention model in the feature dimension, and screening feature dimension components which play a role in the recognition effect through autocorrelation among the feature dimensions.

2. The voiceprint recognition method based on self-attention and transfer learning of claim 1, wherein the primary basic data set, the secondary basic data set and the application scene data set are subjected to data enhancement in time domain and frequency domain.

3. The voiceprint recognition method based on self-attention and transfer learning as claimed in claim 2, wherein in the time domain, the primary basic data set, the secondary basic data set and the application scene data set are subjected to rhythm and pitch adjustment respectively to adjust the audio speed, and then random noise is added; in the frequency domain, a random warping factor is applied to the spectral features of each audio using channel length perturbation.

4. The voiceprint recognition method based on self-attention and transfer learning of claim 1, wherein the primary basic data set is collected under an unconstrained condition.

5. The voiceprint recognition method based on self-attention and transfer learning according to any one of claims 1 to 4, wherein the audio features are subjected to attention cascade firstly through feature dimension and then through attention of space dimension.

6. The voiceprint recognition method according to claim 1, wherein threshold control is performed for the difference between the registered channel and the verified channel, and when the registered channel and the verified channel are from the same channel, a higher threshold is selected, and when the registered channel and the verified channel are from different channels, a lower or lower threshold is selected according to the difference.