CN116072125B - Method and system for constructing self-supervision speaker recognition model in noise environment - Google Patents

Method and system for constructing self-supervision speaker recognition model in noise environment Download PDF

Info

Publication number
CN116072125B
CN116072125B CN202310364542.6A CN202310364542A CN116072125B CN 116072125 B CN116072125 B CN 116072125B CN 202310364542 A CN202310364542 A CN 202310364542A CN 116072125 B CN116072125 B CN 116072125B
Authority
CN
China
Prior art keywords
module
extracting
omega
training
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310364542.6A
Other languages
Chinese (zh)
Other versions
CN116072125A (en
Inventor
张葛祥
曾鑫
姚光乐
杨强
方祖林
陈柯屹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202310364542.6A priority Critical patent/CN116072125B/en
Publication of CN116072125A publication Critical patent/CN116072125A/en
Application granted granted Critical
Publication of CN116072125B publication Critical patent/CN116072125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Abstract

The application provides a method and a system for constructing a self-supervision speaker recognition model in a noise environment, wherein the method comprises the following steps: s1, intercepting a section of voice; s2, inputting the characteristic images into a convolution filter layer to obtain characteristic images; s3, inputting the data to an attention mechanism module and a residual error module; s4, inputting the result of the S3 into an attention mechanism module and a residual error module; s5, extracting to obtain acoustic characteristics; s6, training the double encoders by using a contrast learning method; s7, inputting acoustic features into a double encoder to obtain feature vectors; s8, extracting feature vectors from all original voices, and clustering to generate pseudo tags; s9, performing supervision training on the double encoders through pseudo tags; s10, repeatedly executing S7-S9 until the error rate is not reduced, and completing model construction. The application can effectively restrain the noise information existing in the acoustic characteristic channel and the space and reduce the influence of the noise label on the recognition accuracy of the self-supervision speaker.

Description

Method and system for constructing self-supervision speaker recognition model in noise environment
Technical Field
The application relates to the technical field of speaker recognition, in particular to a method and a system for constructing a self-supervision speaker recognition model in a noise environment.
Background
Speaker recognition is widely used in security, medical, financial and smart home fields as an important component of biometric recognition. At present, under the quiet laboratory environment and the condition that the voice data are sufficient, the speaker recognition technology has achieved a satisfactory effect. However, in real-world applications, the system performance is significantly reduced compared with a pure environment and a sufficiently marked voice data environment due to the influence of different noises and lack of marked voice data in the environment, which seriously hinders the application development of the speaker recognition technology.
Most of the current voice denoising schemes are based on deep neural networks, and are large in size, high in calculation amount and unfavorable for adding to specific tasks such as speaker recognition. Therefore, the current speaker recognition algorithm cannot well meet the requirements of speaker recognition with noise in a real scene, and the recognition accuracy rate of the speaker recognition algorithm is required to be improved.
For the speaker recognition self-supervision method, most schemes are performed by using a contrast learning or iterative learning method, which can generate more noise labels, so that the final model performance is affected, and therefore, how to avoid the influence of the noise labels on the speaker recognition accuracy is very important.
Disclosure of Invention
The application provides a method and a system for constructing a self-supervision speaker recognition model in a noise environment, which can effectively inhibit noise information existing in an acoustic characteristic channel and a space and reduce the influence of a noise label on the recognition accuracy of the self-supervision speaker.
One aspect of the embodiment of the application discloses a method for constructing a self-supervision speaker recognition model in a noise environment, which comprises the following steps:
s1, randomly intercepting a section of voice in original voice;
s2, inputting the intercepted voice into an interpretable convolution filter layer, and outputting to obtain a feature map
S3, willInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain
S4, willInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain
S5, extracting to obtain acoustic characteristics
S6, training a double encoder consisting of an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;
s7, the acoustic characteristics are obtainedInputting to the double encoder to obtain feature vector
S8, extracting the feature vectors from all original voicesThen, clustering is carried out to generate pseudo tags;
s9, performing supervision training on the double encoders through the pseudo tag;
s10, repeating the steps S7-S9 until the error rate EER is not reduced, and completing the model construction.
In some embodiments, in S1, the original audio is loaded, the original speech data is read, and the sampling frequency of the speech is 16000Hz.
In some embodiments, in S1, the length of the truncated segment of speech is 4800 ms, and if not enough, zero padding is performed at both ends of the speech.
In some embodiments, in S2, the interpretable convolution filter layer is a bandpass filterWhere rect is rectangular bandpass filtering, n is the speech signal length,anda low cut-off frequency and a high cut-off frequency, respectively.
In some embodiments, in S5, based on a multi-scale method, stitching is performed in a time seriesExtracting acoustic features
Another aspect of the embodiments of the present application discloses a system for constructing a self-supervision speaker recognition model in a noise environment, comprising:
the voice intercepting module is used for intercepting a section of voice randomly in the original voice;
a filtering module for inputting the intercepted voice into an interpretable convolution filtering layer and outputting to obtain a characteristic diagram
First extraction moduleFor connectingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain
A second extraction module for extractingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain
A third extraction module for extracting acoustic features
The first training module is used for training a double encoder formed by an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;
a fourth extraction module for extracting the acoustic featuresInputting to the double encoder to obtain feature vector
A clustering module for extracting feature vectors for all original voicesThen, clustering is carried out to generate pseudo tags;
the second training module is used for performing supervision training on the double encoders through the pseudo tag;
and the model construction module is used for repeatedly controlling the fourth extraction module, the clustering module and the second training module to work until the error rate EER is not reduced any more, and completing the model construction.
In some embodiments, the self-supervising speaker recognition model construction system in a noisy environment further comprises:
the processor is respectively connected with the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module; a memory coupled to the processor and storing a computer program executable on the processor; when the processor executes the computer program, the processor controls the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module to work so as to realize the self-supervision speaker identification model building method under the noise environment.
In summary, the application has at least the following advantages:
the method improves the existing Sincnet characteristics, reduces the interference of noise information on the acoustic characteristics by using a CBAM module and a multi-scale method, and improves the robustness of the acoustic characteristics. Wherein the CBAM attention mechanism module can effectively suppress noise information present in the acoustic signature channel and space. While acoustic features output at different depths have different implications, such as shallow features may represent speaker speech speed, accent, etc., deep features represent gene contours, etc. Therefore, the extracted shallow layer and deep layer features are spliced in time sequence by using a multi-scale method, and acoustic features with various information can be acquired.
The application uses a double-encoder network structure and a sample screening strategy to improve the model identification capability on the mainstream iterative self-supervision learning method. In the model training process, two networks in the double encoder can learn different feature extraction capacities from the same sample to form advantage complementation, and a sample screening strategy can filter most error labels, so that model learning error information is avoided, and model performance is influenced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a self-supervision speaker recognition model construction in a noise environment according to the present application.
Fig. 2 is a schematic diagram of an equal error rate conversion curve according to the present application.
Fig. 3 is a schematic diagram of an ECAPA-TDNN network structure according to the present application.
Fig. 4 is a schematic diagram of an MFA-connector network structure according to the present application.
Detailed Description
Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in numerous different ways without departing from the spirit or scope of the embodiments of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.
The following disclosure provides many different implementations, or examples, for implementing different configurations of embodiments of the application. In order to simplify the disclosure of embodiments of the present application, components and arrangements of specific examples are described below. Of course, they are merely examples and are not intended to limit embodiments of the present application. Furthermore, embodiments of the present application may repeat reference numerals and/or letters in the various examples, which are for the purpose of brevity and clarity, and which do not themselves indicate the relationship between the various embodiments and/or arrangements discussed.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, an aspect of the embodiment of the present application discloses a method for constructing a self-supervision speaker recognition model in a noise environment, where the speaker recognition model includes an acoustic feature extraction network module and a dual-encoder self-supervision network;
the acoustic feature extraction network module is:
1.1 randomly intercepting a section of input voice, wherein the length of the section is 4800 milliseconds, and if the section is insufficient, zero padding is carried out at two ends of the voice.
1.2 inputting the intercepted Speech into the SincNet filter layerObtaining a characteristic diagramWhere rect is rectangular bandpass filtering, n is the speech signal length,andthe low cut-off frequency and the high cut-off frequency, respectively, are both learnable.
1.3 characterization map obtained in 1.2Input to CBAM (attention mechanism Module, convolutional Block Attention Module) ModuleThen input into a residual error module (composed of two layers of convolution networks) to obtain
1.4 willInput to the CBAM module to obtainThen input into a residual error module to obtain
1.5 final extraction of Acoustic features
The double-encoder self-supervision network module is as follows:
2.1 training a dual encoder consisting of an ECAPA-TDNN network and an MFA-Conformer network using a contrast learning method.
2.2 inputting the acoustic feature into the double encoder to obtain the feature vector
2.3 all Speech extraction on datasetsPost clustering (K-means) yields pseudo tags.
2.4 supervised training of double encoders with pseudo tags when sample loss is greater than a threshold(super parameters, artificial experience setting), no parameter update is performed.
2.5 repeat 2.2,2.3,2.4 until the error rate EER is no longer decreasing.
Further technical proposal, the threshold valueDynamically adjusting in the training process of the speaker recognition model: 6, 5, 4, 2 and 1 in turn. Each iteration, i.e. using the same batch of pseudo tags, the threshold is adjusted when the error rate no longer dropsThe next iteration is entered, and the threshold is selected sequentially from among the five values.
In a specific embodiment, a method for constructing a self-supervision speaker recognition model in a noise environment includes the following steps:
1. the original audio is loaded, and the original voice data is read, wherein the sampling frequency of the voice in the embodiment is 16000Hz.
2. Sending the read voice data into an acoustic feature extraction network module, wherein the specific steps are as follows:
a1: a section of input speech is randomly intercepted, the length of the section is 4800 milliseconds, and if the section is insufficient, zero padding is carried out on two ends of the speech.
A2: inputting intercepted voice into a band-pass filter (SincNet filter) layerObtaining a characteristic diagramWhere rect is rectangular bandpass filtering, n is the speech signal length,andthe low cut-off frequency and the high cut-off frequency, respectively, are both learnable.
A3: the obtained characteristic diagramInput to the CBAM module to obtainThen input into a residual error module (composed of two layers of convolution networks) to obtainThe CBAM attention mechanism module may effectively suppress noise information present in the acoustic signature path and space.
A4: will beInput to the CBAM module to obtainThen input into a residual error module to obtain
A5: the acoustic features with various information can be obtained by splicing the extracted shallow layer and deep layer features in time sequence by using a multi-scale method, namely splicing in time sequenceAndextracting acoustic features
3. Sending the extracted acoustic features to a double encoder module for speaker feature vectorExtraction, wherein the dual encoder module consists of ECAPA-TDNN and MFA-Conformer.
4. Using k-means algorithm for all speaker eigenvectors of data setClustering is carried out, and pseudo tags are generated.
Training acoustic features in a double encoder module to obtain speaker feature vectors. During model training, two networks in the dual encoder can learn different feature extraction capabilities from the same sample, forming complementary advantages. While the sample screening strategy can filter out mostAnd (5) classifying the error labels, so that the model learning error information is avoided, and the model performance is influenced. Wherein the ECAPA-TDNN and MFA-Conformer network structures are shown in fig. 3 and 4, respectively.
5. When the method is applied, the trained speaker recognition model is used for obtaining the characteristic vector of the speaker, and the characteristic vector of the speaker is obtainedWith feature vectors already in the databaseAnd (4) performing cosine similarity calculation, wherein the cosine similarity calculation is shown in the following formula:
wherein d is cosine distance, and the speaker is determined according to d and the threshold value.
Simulation experiment:
the speaker data set used was Free ST Chinese Mandarin Corpus Chinese data set, the Noise data set was Noise92 Noise data set, free ST Chinese Mandarin Corpus Chinese data set was itself clean speech data set, and the factor factory Noise in Noise92 Noise data set and Free ST ChineseMandarin Corpus data set were selected to be Noise-containing data set with a signal-to-Noise ratio of 10 dB. The Free ST Chinese Mandarin Corpus dataset contained 855 people, 120 voices per person, using 90% of the 855 people as training sets and 10% as test sets. The error rate of the test result and the like obtained by using the method reaches 3.94%, and the error rate conversion curve is shown in fig. 2, wherein the error rate is improved by 16.7% compared with the test result and the like without using an acoustic feature extraction network and a double encoder module.
Another aspect of the embodiments of the present application discloses a system for constructing a self-supervision speaker recognition model in a noise environment, comprising:
the voice intercepting module is used for intercepting a section of voice randomly in the original voice;
a filtering module for inputting the intercepted voice to an interpretable convolution filtering layer and outputting to obtain the characteristicsDrawing of the figure
A first extraction module for extractingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain
A second extraction module for extractingInput to the attention mechanism module to obtainAnd then willInput to a residual error module to obtain
A third extraction module for extracting acoustic features
The first training module is used for training a double encoder formed by an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;
a fourth extraction module for extracting the acoustic featuresInputting to the double encoder to obtain feature vector
A clustering module for extracting feature vectors for all original voicesThen, clustering is carried out to generate pseudo tags;
the second training module is used for performing supervision training on the double encoders through the pseudo tag;
and the model construction module is used for repeatedly controlling the fourth extraction module, the clustering module and the second training module to work until the error rate EER is not reduced any more, and completing the model construction.
In some embodiments, the self-supervising speaker recognition model construction system in a noisy environment further comprises:
the processor is respectively connected with the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module; a memory coupled to the processor and storing a computer program executable on the processor; when the processor executes the computer program, the processor controls the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module to work so as to realize the self-supervision speaker identification model building method under the noise environment.
The above embodiments are provided to illustrate the present application and not to limit the present application, so that the modification of the exemplary values or the replacement of equivalent elements should still fall within the scope of the present application.
From the foregoing detailed description, it will be apparent to those skilled in the art that the present application can be practiced without these specific details, and that the present application meets the requirements of the patent statutes.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application. The foregoing description of the preferred embodiment of the application is not intended to be limiting, but rather to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.
It should be noted that the above description of the flow is only for the purpose of illustration and description, and does not limit the application scope of the present specification. Various modifications and changes to the flow may be made by those skilled in the art under the guidance of this specification. However, such modifications and variations are still within the scope of the present description.
While the basic concepts have been described above, it will be apparent to those of ordinary skill in the art after reading this application that the above disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations of the application may occur to one of ordinary skill in the art. Such modifications, improvements, and modifications are intended to be suggested within the present disclosure, and therefore, such modifications, improvements, and adaptations are intended to be within the spirit and scope of the exemplary embodiments of the present disclosure.
Meanwhile, the present application uses specific words to describe embodiments of the present application. For example, "one embodiment," "an embodiment," and/or "some embodiments" means a particular feature, structure, or characteristic in connection with at least one embodiment of the application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the application may be combined as suitable.
Furthermore, those of ordinary skill in the art will appreciate that aspects of the application are illustrated and described in the context of a number of patentable categories or conditions, including any novel and useful processes, machines, products, or materials, or any novel and useful improvements thereof. Accordingly, aspects of the present application may be implemented entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or a combination of hardware and software. The above hardware or software may be referred to as a "unit," module, "or" system. Furthermore, aspects of the present application may take the form of a computer program product embodied in one or more computer-readable media, wherein the computer-readable program code is embodied therein.
Computer program code required for operation of portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, C#, VB.NET, python, etc., a conventional programming language such as C programming language, visualBasic, fortran2103, perl, COBOL2102, PHP, ABAP, a dynamic programming language such as Python, ruby and Groovy, or other programming languages, etc. The program code may execute entirely on the user's computer, or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.
Furthermore, the order in which the elements and sequences are presented, the use of numerical letters, or other designations are used in the application is not intended to limit the sequence of the processes and methods unless specifically recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of example, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the application. For example, while the implementation of the various components described above may be embodied in a hardware device, it may also be implemented as a purely software solution, e.g., an installation on an existing server or mobile device.
Likewise, it should be noted that in order to simplify the presentation of the disclosure and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, the inventive subject matter should be provided with fewer features than the single embodiments described above.

Claims (6)

1. A method for constructing a self-supervision speaker recognition model in a noise environment is characterized by comprising the following steps:
s1, randomly intercepting a section of voice in original voice;
s2, inputting the intercepted voice into an interpretable convolution filter layer, and outputting to obtain a characteristic diagram P x (ω);
S3, P is taken x (omega) input to the attention mechanism module to obtain P y (omega) and then P y (omega) input to residual error module to obtain P x1 (ω);
S4, P is taken x1 (omega) input to the attention mechanism module to obtain P y1 (omega) and then P y1 (omega) input to residual error module to obtain P x2 (ω);
S5, splicing P in time sequence based on a multi-scale method x (ω)、P x1 (omega) and P x2 (omega) extracting to obtain acoustic features P CAM (ω)=P x (ω)+P x1 (ω)+P x2 (ω);
S6, training a double encoder consisting of an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;
s7, the acoustic feature P is processed CAM (omega) input to the double encoder to obtain a feature vector
S8, extracting the feature vectors from all original voicesThen, clustering is carried out to generate pseudo tags;
s9, performing supervision training on the double encoders through the pseudo tag;
s10, repeating the steps S7-S9 until the error rate EER is not reduced, and completing the model construction.
2. The method for constructing a self-monitoring speaker recognition model in a noisy environment according to claim 1, wherein in S1, an original audio is loaded, the original speech data is read, and the sampling frequency of the speech is 16000Hz.
3. The method of claim 1, wherein in S1, the length of a section of speech is 4800 ms, and if the section of speech is insufficient, zero padding is performed at both ends of the speech.
4. The method of claim 1, wherein in S2, the interpretable convolution filter layer is a band-pass filter G ω (n,f 1 ,f 2 )=rect(f/2f 2 )-rect(f/2f 1 ) Where rect is rectangular bandpass filtering, n is the speech signal length, f 1 And f 2 A low cut-off frequency and a high cut-off frequency, respectively.
5. A system for constructing a self-supervising speaker recognition model in a noisy environment, comprising:
the voice intercepting module is used for intercepting a section of voice randomly in the original voice;
a filtering module for inputting the intercepted voice into an interpretable convolution filtering layer and outputting to obtain a characteristic diagram P x (ω);
A first extraction module for extracting P x (omega) input to the attention mechanism module to obtain P y (omega) and then P y (omega) input to residual error module to obtain P x1 (ω);
A second extraction module for extracting P x1 (omega) input to the attention mechanism module to obtain P y1 (omega) and then P y1 (omega) input to residual error module to obtain P x2 (ω);
A third extraction module for splicing P in time series based on a multi-scale method x (ω)、P x1 (omega) and P x2 (omega) extracting to obtain acoustic features P CAM (ω)=P x (ω)+P x1 (ω)+P x2 (ω);
The first training module is used for training a double encoder formed by an ECAPA-TDNN network and an MFA-Conformer network by using a contrast learning method;
a fourth extraction module for extracting the acoustic features P CAM (omega) input to the double encoder to obtain a feature vector
A clustering module for extracting feature vectors for all original voicesThen, clustering is carried out to generate pseudo tags;
the second training module is used for performing supervision training on the double encoders through the pseudo tag;
and the model construction module is used for repeatedly controlling the fourth extraction module, the clustering module and the second training module to work until the error rate EER is not reduced any more, and completing the model construction.
6. The system for constructing a self-supervising speaker recognition model in a noisy environment of claim 5, further comprising:
the processor is respectively connected with the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module;
a memory coupled to the processor and storing a computer program executable on the processor; when the processor executes the computer program, the processor controls the voice intercepting module, the filtering module, the first extracting module, the second extracting module, the third extracting module, the first training module, the fourth extracting module, the clustering module, the second training module and the model building module to work so as to realize the self-supervision speaker identification model building method in the noise environment according to any one of claims 1 to 4.
CN202310364542.6A 2023-04-07 2023-04-07 Method and system for constructing self-supervision speaker recognition model in noise environment Active CN116072125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310364542.6A CN116072125B (en) 2023-04-07 2023-04-07 Method and system for constructing self-supervision speaker recognition model in noise environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310364542.6A CN116072125B (en) 2023-04-07 2023-04-07 Method and system for constructing self-supervision speaker recognition model in noise environment

Publications (2)

Publication Number Publication Date
CN116072125A CN116072125A (en) 2023-05-05
CN116072125B true CN116072125B (en) 2023-10-17

Family

ID=86177183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310364542.6A Active CN116072125B (en) 2023-04-07 2023-04-07 Method and system for constructing self-supervision speaker recognition model in noise environment

Country Status (1)

Country Link
CN (1) CN116072125B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114038469B (en) * 2021-08-03 2023-06-20 成都理工大学 Speaker identification method based on multi-class spectrogram characteristic attention fusion network

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111145760A (en) * 2020-04-02 2020-05-12 支付宝(杭州)信息技术有限公司 Method and neural network model for speaker recognition
CN111461173A (en) * 2020-03-06 2020-07-28 华南理工大学 Attention mechanism-based multi-speaker clustering system and method
CN114464195A (en) * 2021-12-23 2022-05-10 厦门快商通科技股份有限公司 Voiceprint recognition model training method and device for self-supervision learning and readable medium
WO2022122121A1 (en) * 2020-12-08 2022-06-16 Huawei Technologies Co., Ltd. End-to-end streaming acoustic trigger apparatus and method
CN114678030A (en) * 2022-03-17 2022-06-28 重庆邮电大学 Voiceprint identification method and device based on depth residual error network and attention mechanism
CN114898772A (en) * 2022-06-22 2022-08-12 辽宁工程技术大学 Method for classifying acoustic scenes based on feature layering and improved ECAPA-TDNN
CN115050373A (en) * 2022-04-29 2022-09-13 思必驰科技股份有限公司 Dual path embedded learning method, electronic device, and storage medium
CN115101076A (en) * 2022-05-26 2022-09-23 燕山大学 Speaker clustering method based on multi-scale channel separation convolution characteristic extraction
CN115116446A (en) * 2022-06-21 2022-09-27 成都理工大学 Method for constructing speaker recognition model in noise environment
CN115602152A (en) * 2022-12-14 2023-01-13 成都启英泰伦科技有限公司(Cn) Voice enhancement method based on multi-stage attention network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11557283B2 (en) * 2021-03-26 2023-01-17 Mitsubishi Electric Research Laboratories, Inc. Artificial intelligence system for capturing context by dilated self-attention

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461173A (en) * 2020-03-06 2020-07-28 华南理工大学 Attention mechanism-based multi-speaker clustering system and method
CN111145760A (en) * 2020-04-02 2020-05-12 支付宝(杭州)信息技术有限公司 Method and neural network model for speaker recognition
WO2022122121A1 (en) * 2020-12-08 2022-06-16 Huawei Technologies Co., Ltd. End-to-end streaming acoustic trigger apparatus and method
CN114464195A (en) * 2021-12-23 2022-05-10 厦门快商通科技股份有限公司 Voiceprint recognition model training method and device for self-supervision learning and readable medium
CN114678030A (en) * 2022-03-17 2022-06-28 重庆邮电大学 Voiceprint identification method and device based on depth residual error network and attention mechanism
CN115050373A (en) * 2022-04-29 2022-09-13 思必驰科技股份有限公司 Dual path embedded learning method, electronic device, and storage medium
CN115101076A (en) * 2022-05-26 2022-09-23 燕山大学 Speaker clustering method based on multi-scale channel separation convolution characteristic extraction
CN115116446A (en) * 2022-06-21 2022-09-27 成都理工大学 Method for constructing speaker recognition model in noise environment
CN114898772A (en) * 2022-06-22 2022-08-12 辽宁工程技术大学 Method for classifying acoustic scenes based on feature layering and improved ECAPA-TDNN
CN115602152A (en) * 2022-12-14 2023-01-13 成都启英泰伦科技有限公司(Cn) Voice enhancement method based on multi-stage attention network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MFA-Conformer:Multi-scale feature Aggregation Conformer for automatic speaker verification;Yang Zhang,et al.;《https://arxiv.org/abs/2203.15249》;全文 *
Underwater acoustic target recognition technology based on MFA-Conformer;XuYang Wang,et al.;《2022 2nd International Conference on Electronic Information Engineering and Computer Technology (EIECT)》;全文 *
基于深度学习的语音去噪方法研究;李蕊;《中国优秀硕士学位论文全文库(信息科技辑)》(第12期);全文 *

Also Published As

Publication number Publication date
CN116072125A (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
US20210287663A1 (en) Method and apparatus with a personalized speech recognition model
US10957309B2 (en) Neural network method and apparatus
CN111160533B (en) Neural network acceleration method based on cross-resolution knowledge distillation
WO2019179036A1 (en) Deep neural network model, electronic device, identity authentication method, and storage medium
US11875799B2 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
WO2019179029A1 (en) Electronic device, identity verification method and computer-readable storage medium
US10026395B1 (en) Methods and systems for extracting auditory features with neural networks
US20220208198A1 (en) Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments
WO2019136909A1 (en) Voice living-body detection method based on deep learning, server and storage medium
CN111542841A (en) System and method for content identification
CN116072125B (en) Method and system for constructing self-supervision speaker recognition model in noise environment
US11545154B2 (en) Method and apparatus with registration for speaker recognition
US20200293886A1 (en) Authentication method and apparatus with transformation model
EP3674974A1 (en) Apparatus and method with user verification
CN113362822B (en) Black box voice confrontation sample generation method with auditory masking
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN111933148A (en) Age identification method and device based on convolutional neural network and terminal
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
US11397868B2 (en) Fungal identification by pattern recognition
CN111524522B (en) Voiceprint recognition method and system based on fusion of multiple voice features
WO2020238681A1 (en) Audio processing method and device, and man-machine interactive system
Lei et al. Identity vector extraction by perceptual wavelet packet entropy and convolutional neural network for voice authentication
CN113763966B (en) End-to-end text irrelevant voiceprint recognition method and system
JP2016162437A (en) Pattern classification device, pattern classification method and pattern classification program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant