CN116072125B

CN116072125B - Method and system for constructing self-supervision speaker recognition model in noise environment

Info

Publication number: CN116072125B
Application number: CN202310364542.6A
Authority: CN
Inventors: 张葛祥; 曾鑫; 姚光乐; 杨强; 方祖林; 陈柯屹
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-10-17
Anticipated expiration: 2043-04-07
Also published as: CN116072125A

Abstract

The application provides a method and a system for constructing a self-supervision speaker recognition model in a noise environment, wherein the method comprises the following steps: s1, intercepting a section of voice; s2, inputting the characteristic images into a convolution filter layer to obtain characteristic images; s3, inputting the data to an attention mechanism module and a residual error module; s4, inputting the result of the S3 into an attention mechanism module and a residual error module; s5, extracting to obtain acoustic characteristics; s6, training the double encoders by using a contrast learning method; s7, inputting acoustic features into a double encoder to obtain feature vectors; s8, extracting feature vectors from all original voices, and clustering to generate pseudo tags; s9, performing supervision training on the double encoders through pseudo tags; s10, repeatedly executing S7-S9 until the error rate is not reduced, and completing model construction. The application can effectively restrain the noise information existing in the acoustic characteristic channel and the space and reduce the influence of the noise label on the recognition accuracy of the self-supervision speaker.

Description

Method and system for constructing self-supervision speaker recognition model in noise environment

Technical Field

The application relates to the technical field of speaker recognition, in particular to a method and a system for constructing a self-supervision speaker recognition model in a noise environment.

Background

Speaker recognition is widely used in security, medical, financial and smart home fields as an important component of biometric recognition. At present, under the quiet laboratory environment and the condition that the voice data are sufficient, the speaker recognition technology has achieved a satisfactory effect. However, in real-world applications, the system performance is significantly reduced compared with a pure environment and a sufficiently marked voice data environment due to the influence of different noises and lack of marked voice data in the environment, which seriously hinders the application development of the speaker recognition technology.

Most of the current voice denoising schemes are based on deep neural networks, and are large in size, high in calculation amount and unfavorable for adding to specific tasks such as speaker recognition. Therefore, the current speaker recognition algorithm cannot well meet the requirements of speaker recognition with noise in a real scene, and the recognition accuracy rate of the speaker recognition algorithm is required to be improved.

For the speaker recognition self-supervision method, most schemes are performed by using a contrast learning or iterative learning method, which can generate more noise labels, so that the final model performance is affected, and therefore, how to avoid the influence of the noise labels on the speaker recognition accuracy is very important.

Disclosure of Invention

The application provides a method and a system for constructing a self-supervision speaker recognition model in a noise environment, which can effectively inhibit noise information existing in an acoustic characteristic channel and a space and reduce the influence of a noise label on the recognition accuracy of the self-supervision speaker.

One aspect of the embodiment of the application discloses a method for constructing a self-supervision speaker recognition model in a noise environment, which comprises the following steps:

s1, randomly intercepting a section of voice in original voice;

s2, inputting the intercepted voice into an interpretable convolution filter layer, and outputting to obtain a feature map；

S3, willInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain；

S4, willInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain；

S5, extracting to obtain acoustic characteristics；

S6, training a double encoder consisting of an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;

s7, the acoustic characteristics are obtainedInputting to the double encoder to obtain feature vector；

S8, extracting the feature vectors from all original voicesThen, clustering is carried out to generate pseudo tags;

s9, performing supervision training on the double encoders through the pseudo tag;

s10, repeating the steps S7-S9 until the error rate EER is not reduced, and completing the model construction.

In some embodiments, in S1, the original audio is loaded, the original speech data is read, and the sampling frequency of the speech is 16000Hz.

In some embodiments, in S1, the length of the truncated segment of speech is 4800 ms, and if not enough, zero padding is performed at both ends of the speech.

In some embodiments, in S2, the interpretable convolution filter layer is a bandpass filterWhere rect is rectangular bandpass filtering, n is the speech signal length,anda low cut-off frequency and a high cut-off frequency, respectively.

In some embodiments, in S5, based on a multi-scale method, stitching is performed in a time seriesExtracting acoustic features。

Another aspect of the embodiments of the present application discloses a system for constructing a self-supervision speaker recognition model in a noise environment, comprising:

the voice intercepting module is used for intercepting a section of voice randomly in the original voice;

a filtering module for inputting the intercepted voice into an interpretable convolution filtering layer and outputting to obtain a characteristic diagram；

First extraction moduleFor connectingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain；

A second extraction module for extractingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain；

A third extraction module for extracting acoustic features；

The first training module is used for training a double encoder formed by an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;

a fourth extraction module for extracting the acoustic featuresInputting to the double encoder to obtain feature vector；

A clustering module for extracting feature vectors for all original voicesThen, clustering is carried out to generate pseudo tags;

the second training module is used for performing supervision training on the double encoders through the pseudo tag;

and the model construction module is used for repeatedly controlling the fourth extraction module, the clustering module and the second training module to work until the error rate EER is not reduced any more, and completing the model construction.

In some embodiments, the self-supervising speaker recognition model construction system in a noisy environment further comprises:

the processor is respectively connected with the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module; a memory coupled to the processor and storing a computer program executable on the processor; when the processor executes the computer program, the processor controls the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module to work so as to realize the self-supervision speaker identification model building method under the noise environment.

In summary, the application has at least the following advantages:

the method improves the existing Sincnet characteristics, reduces the interference of noise information on the acoustic characteristics by using a CBAM module and a multi-scale method, and improves the robustness of the acoustic characteristics. Wherein the CBAM attention mechanism module can effectively suppress noise information present in the acoustic signature channel and space. While acoustic features output at different depths have different implications, such as shallow features may represent speaker speech speed, accent, etc., deep features represent gene contours, etc. Therefore, the extracted shallow layer and deep layer features are spliced in time sequence by using a multi-scale method, and acoustic features with various information can be acquired.

The application uses a double-encoder network structure and a sample screening strategy to improve the model identification capability on the mainstream iterative self-supervision learning method. In the model training process, two networks in the double encoder can learn different feature extraction capacities from the same sample to form advantage complementation, and a sample screening strategy can filter most error labels, so that model learning error information is avoided, and model performance is influenced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a self-supervision speaker recognition model construction in a noise environment according to the present application.

Fig. 2 is a schematic diagram of an equal error rate conversion curve according to the present application.

Fig. 3 is a schematic diagram of an ECAPA-TDNN network structure according to the present application.

Fig. 4 is a schematic diagram of an MFA-connector network structure according to the present application.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in numerous different ways without departing from the spirit or scope of the embodiments of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.

The following disclosure provides many different implementations, or examples, for implementing different configurations of embodiments of the application. In order to simplify the disclosure of embodiments of the present application, components and arrangements of specific examples are described below. Of course, they are merely examples and are not intended to limit embodiments of the present application. Furthermore, embodiments of the present application may repeat reference numerals and/or letters in the various examples, which are for the purpose of brevity and clarity, and which do not themselves indicate the relationship between the various embodiments and/or arrangements discussed.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, an aspect of the embodiment of the present application discloses a method for constructing a self-supervision speaker recognition model in a noise environment, where the speaker recognition model includes an acoustic feature extraction network module and a dual-encoder self-supervision network;

the acoustic feature extraction network module is:

1.1 randomly intercepting a section of input voice, wherein the length of the section is 4800 milliseconds, and if the section is insufficient, zero padding is carried out at two ends of the voice.

1.2 inputting the intercepted Speech into the SincNet filter layerObtaining a characteristic diagramWhere rect is rectangular bandpass filtering, n is the speech signal length,andthe low cut-off frequency and the high cut-off frequency, respectively, are both learnable.

1.3 characterization map obtained in 1.2Input to CBAM (attention mechanism Module, convolutional Block Attention Module) ModuleThen input into a residual error module (composed of two layers of convolution networks) to obtain。

1.4 willInput to the CBAM module to obtainThen input into a residual error module to obtain。

1.5 final extraction of Acoustic features。

The double-encoder self-supervision network module is as follows:

2.1 training a dual encoder consisting of an ECAPA-TDNN network and an MFA-Conformer network using a contrast learning method.

2.2 inputting the acoustic feature into the double encoder to obtain the feature vector。

2.3 all Speech extraction on datasetsPost clustering (K-means) yields pseudo tags.

2.4 supervised training of double encoders with pseudo tags when sample loss is greater than a threshold(super parameters, artificial experience setting), no parameter update is performed.

2.5 repeat 2.2,2.3,2.4 until the error rate EER is no longer decreasing.

Further technical proposal, the threshold valueDynamically adjusting in the training process of the speaker recognition model: 6, 5, 4, 2 and 1 in turn. Each iteration, i.e. using the same batch of pseudo tags, the threshold is adjusted when the error rate no longer dropsThe next iteration is entered, and the threshold is selected sequentially from among the five values.

In a specific embodiment, a method for constructing a self-supervision speaker recognition model in a noise environment includes the following steps:

1. the original audio is loaded, and the original voice data is read, wherein the sampling frequency of the voice in the embodiment is 16000Hz.

2. Sending the read voice data into an acoustic feature extraction network module, wherein the specific steps are as follows:

a1: a section of input speech is randomly intercepted, the length of the section is 4800 milliseconds, and if the section is insufficient, zero padding is carried out on two ends of the speech.

A2: inputting intercepted voice into a band-pass filter (SincNet filter) layerObtaining a characteristic diagramWhere rect is rectangular bandpass filtering, n is the speech signal length,andthe low cut-off frequency and the high cut-off frequency, respectively, are both learnable.

A3: the obtained characteristic diagramInput to the CBAM module to obtainThen input into a residual error module (composed of two layers of convolution networks) to obtainThe CBAM attention mechanism module may effectively suppress noise information present in the acoustic signature path and space.

A4: will beInput to the CBAM module to obtainThen input into a residual error module to obtain。

A5: the acoustic features with various information can be obtained by splicing the extracted shallow layer and deep layer features in time sequence by using a multi-scale method, namely splicing in time sequence、Andextracting acoustic features。

3. Sending the extracted acoustic features to a double encoder module for speaker feature vectorExtraction, wherein the dual encoder module consists of ECAPA-TDNN and MFA-Conformer.

4. Using k-means algorithm for all speaker eigenvectors of data setClustering is carried out, and pseudo tags are generated.

Training acoustic features in a double encoder module to obtain speaker feature vectors. During model training, two networks in the dual encoder can learn different feature extraction capabilities from the same sample, forming complementary advantages. While the sample screening strategy can filter out mostAnd (5) classifying the error labels, so that the model learning error information is avoided, and the model performance is influenced. Wherein the ECAPA-TDNN and MFA-Conformer network structures are shown in fig. 3 and 4, respectively.

5. When the method is applied, the trained speaker recognition model is used for obtaining the characteristic vector of the speaker, and the characteristic vector of the speaker is obtainedWith feature vectors already in the databaseAnd (4) performing cosine similarity calculation, wherein the cosine similarity calculation is shown in the following formula:

；

wherein d is cosine distance, and the speaker is determined according to d and the threshold value.

Simulation experiment:

the speaker data set used was Free ST Chinese Mandarin Corpus Chinese data set, the Noise data set was Noise92 Noise data set, free ST Chinese Mandarin Corpus Chinese data set was itself clean speech data set, and the factor factory Noise in Noise92 Noise data set and Free ST ChineseMandarin Corpus data set were selected to be Noise-containing data set with a signal-to-Noise ratio of 10 dB. The Free ST Chinese Mandarin Corpus dataset contained 855 people, 120 voices per person, using 90% of the 855 people as training sets and 10% as test sets. The error rate of the test result and the like obtained by using the method reaches 3.94%, and the error rate conversion curve is shown in fig. 2, wherein the error rate is improved by 16.7% compared with the test result and the like without using an acoustic feature extraction network and a double encoder module.

a filtering module for inputting the intercepted voice to an interpretable convolution filtering layer and outputting to obtain the characteristicsDrawing of the figure；

A first extraction module for extractingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain；

A third extraction module for extracting acoustic features；

The above embodiments are provided to illustrate the present application and not to limit the present application, so that the modification of the exemplary values or the replacement of equivalent elements should still fall within the scope of the present application.

From the foregoing detailed description, it will be apparent to those skilled in the art that the present application can be practiced without these specific details, and that the present application meets the requirements of the patent statutes.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application. The foregoing description of the preferred embodiment of the application is not intended to be limiting, but rather to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.

It should be noted that the above description of the flow is only for the purpose of illustration and description, and does not limit the application scope of the present specification. Various modifications and changes to the flow may be made by those skilled in the art under the guidance of this specification. However, such modifications and variations are still within the scope of the present description.

While the basic concepts have been described above, it will be apparent to those of ordinary skill in the art after reading this application that the above disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations of the application may occur to one of ordinary skill in the art. Such modifications, improvements, and modifications are intended to be suggested within the present disclosure, and therefore, such modifications, improvements, and adaptations are intended to be within the spirit and scope of the exemplary embodiments of the present disclosure.

Meanwhile, the present application uses specific words to describe embodiments of the present application. For example, "one embodiment," "an embodiment," and/or "some embodiments" means a particular feature, structure, or characteristic in connection with at least one embodiment of the application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the application may be combined as suitable.

Furthermore, those of ordinary skill in the art will appreciate that aspects of the application are illustrated and described in the context of a number of patentable categories or conditions, including any novel and useful processes, machines, products, or materials, or any novel and useful improvements thereof. Accordingly, aspects of the present application may be implemented entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or a combination of hardware and software. The above hardware or software may be referred to as a "unit," module, "or" system. Furthermore, aspects of the present application may take the form of a computer program product embodied in one or more computer-readable media, wherein the computer-readable program code is embodied therein.

Computer program code required for operation of portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, C#, VB.NET, python, etc., a conventional programming language such as C programming language, visualBasic, fortran2103, perl, COBOL2102, PHP, ABAP, a dynamic programming language such as Python, ruby and Groovy, or other programming languages, etc. The program code may execute entirely on the user's computer, or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.

Furthermore, the order in which the elements and sequences are presented, the use of numerical letters, or other designations are used in the application is not intended to limit the sequence of the processes and methods unless specifically recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of example, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the application. For example, while the implementation of the various components described above may be embodied in a hardware device, it may also be implemented as a purely software solution, e.g., an installation on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation of the disclosure and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, the inventive subject matter should be provided with fewer features than the single embodiments described above.

Claims

1. A method for constructing a self-supervision speaker recognition model in a noise environment is characterized by comprising the following steps:

s1, randomly intercepting a section of voice in original voice;

s2, inputting the intercepted voice into an interpretable convolution filter layer, and outputting to obtain a characteristic diagram P _x (ω)；

S3, P is taken _x (omega) input to the attention mechanism module to obtain P _y (omega) and then P _y (omega) input to residual error module to obtain P _x1 (ω)；

S4, P is taken _x1 (omega) input to the attention mechanism module to obtain P _y1 (omega) and then P _y1 (omega) input to residual error module to obtain P _x2 (ω)；

S5, splicing P in time sequence based on a multi-scale method _x (ω)、P _x1 (omega) and P _x2 (omega) extracting to obtain acoustic features P _CAM (ω)＝P _x (ω)+P _x1 (ω)+P _x2 (ω)；

s7, the acoustic feature P is processed _CAM (omega) input to the double encoder to obtain a feature vector

2. The method for constructing a self-monitoring speaker recognition model in a noisy environment according to claim 1, wherein in S1, an original audio is loaded, the original speech data is read, and the sampling frequency of the speech is 16000Hz.

3. The method of claim 1, wherein in S1, the length of a section of speech is 4800 ms, and if the section of speech is insufficient, zero padding is performed at both ends of the speech.

4. The method of claim 1, wherein in S2, the interpretable convolution filter layer is a band-pass filter G _ω (n,f ₁ ,f ₂ )＝rect(f/2f ₂ )-rect(f/2f ₁ ) Where rect is rectangular bandpass filtering, n is the speech signal length, f ₁ And f ₂ A low cut-off frequency and a high cut-off frequency, respectively.

5. A system for constructing a self-supervising speaker recognition model in a noisy environment, comprising:

a filtering module for inputting the intercepted voice into an interpretable convolution filtering layer and outputting to obtain a characteristic diagram P _x (ω)；

A first extraction module for extracting P _x (omega) input to the attention mechanism module to obtain P _y (omega) and then P _y (omega) input to residual error module to obtain P _x1 (ω)；

A second extraction module for extracting P _x1 (omega) input to the attention mechanism module to obtain P _y1 (omega) and then P _y1 (omega) input to residual error module to obtain P _x2 (ω)；

A third extraction module for splicing P in time series based on a multi-scale method _x (ω)、P _x1 (omega) and P _x2 (omega) extracting to obtain acoustic features P _CAM (ω)＝P _x (ω)+P _x1 (ω)+P _x2 (ω)；

a fourth extraction module for extracting the acoustic features P _CAM (omega) input to the double encoder to obtain a feature vector

6. The system for constructing a self-supervising speaker recognition model in a noisy environment of claim 5, further comprising:

the processor is respectively connected with the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module;

a memory coupled to the processor and storing a computer program executable on the processor; when the processor executes the computer program, the processor controls the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module to work so as to realize the self-supervision speaker identification model building method in the noise environment according to any one of claims 1 to 4.