CN113823293B - Speaker recognition method and system based on voice enhancement - Google Patents

Speaker recognition method and system based on voice enhancement Download PDF

Info

Publication number
CN113823293B
CN113823293B CN202111140239.5A CN202111140239A CN113823293B CN 113823293 B CN113823293 B CN 113823293B CN 202111140239 A CN202111140239 A CN 202111140239A CN 113823293 B CN113823293 B CN 113823293B
Authority
CN
China
Prior art keywords
voice
speaker
features
data
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111140239.5A
Other languages
Chinese (zh)
Other versions
CN113823293A (en
Inventor
熊盛武
张欣冉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202111140239.5A priority Critical patent/CN113823293B/en
Publication of CN113823293A publication Critical patent/CN113823293A/en
Application granted granted Critical
Publication of CN113823293B publication Critical patent/CN113823293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a speaker recognition method and a speaker recognition system based on voice enhancement, wherein the method comprises the following steps: s1, collecting a large amount of original voice data; s2, removing interference noise and irrelevant speaker sound contained in the original voice data; s3: extracting MFCC features and GFCC features, and fusing to obtain acoustic features of the voice; s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model; s5: collecting a registered voice sample for registration, acquiring voice data of a speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the feature of the registered speaker. The invention can improve the recognition accuracy of the voiceprint recognition system.

Description

Speaker recognition method and system based on voice enhancement
Technical Field
The invention relates to the field of pattern recognition, in particular to a speaker recognition method and system based on voice enhancement.
Background
Voiceprint recognition is a technology for extracting the voice characteristics and the speaking content information of a speaker and automatically verifying the identity of the speaker. With the wide application of artificial intelligence in daily life of people, voiceprint recognition technology has also increasingly highlighted its roles, such as voice-based authentication of personal smart devices (e.g., cell phones, vehicles, and notebook computers); ensuring the transaction safety of banking transaction and remote payment; and (3) automatic identity marking.
However, due to the complexity of real-life background noise, the voice used for recognition always contains various noises, which will result in poor voiceprint recognition effect, so how to overcome the noise problem of the voice to be recognized is a problem that needs to be solved when the voiceprint recognition technology is applied to real life.
Disclosure of Invention
The invention provides a speaker recognition method and a speaker recognition system based on voice enhancement, which are used for solving or at least partially solving the technical problem of poor voiceprint recognition effect in the prior art.
In order to solve the above technical problem, a first aspect of the present invention provides a speaker recognition method based on speech enhancement, including:
s1: collecting a large amount of original voice data;
S2: removing interference noise and irrelevant speaker sound contained in the original voice data to obtain enhanced voice data;
S3: extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and GFCC features to obtain acoustic features of voice;
S4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
S5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as speaker features of each speaker, and storing the depth features; and (3) acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.
In one embodiment, step S1 uses a recording mode to collect the original voice data.
In one embodiment, step S2 uses generation of an countermeasure network to remove interference noise and irrelevant speaker sounds contained in the original speech data, so as to implement end-to-end speech enhancement.
In one embodiment, step S3 includes:
S3.1: detecting voice activity end points of the enhanced voice data, and eliminating a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
S3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal;
s3.4: the power spectrum obtained by the fast Fourier transform is passed through a group of triangular filters with Mel scale, and the energy value of each frame of data in the frequency band corresponding to the triangular filters is obtained;
S3.5: taking the logarithm of the energy value of each frame of data in the frequency band corresponding to the triangular filter, and calculating the logarithmic energy output by each filter bank;
S3.6: substituting the logarithmic energy into discrete cosine transform to obtain an L-order Mel cepstrum coefficient;
S3.7: enabling a power spectrum obtained by the fast Fourier transform to pass through a gammatine filter, and then carrying out index compression and discrete cosine transform to obtain GFCC features of a voice signal;
s3.8: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the acoustic characteristic of the speech signal.
In one embodiment, step S4 includes:
The collected large amount of original voice data is subjected to voice enhancement, then acoustic features are extracted from the voice data to serve as training data, and the training data are input into a speaker recognition model to be trained, so that a trained model is obtained;
in one embodiment, the registration data in step S5 includes h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the feature of the registered speaker, including:
after carrying out voice enhancement and feature extraction on each voice sample in the registration data, extracting depth features of each voice sample from the obtained acoustic features through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker, and storing the h depth features as the speaker features of each speaker in a database;
After voice data of a speaker to be identified is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be identified;
and calculating cosine similarity cos of the speaker features to be identified and all the speaker features stored in the database, if the maximum cosine similarity is larger than a set threshold, determining that the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, otherwise rejecting the identified speaker.
Based on the same inventive concept, a second aspect of the present invention provides a speaker recognition system based on speech enhancement, comprising:
the voice acquisition module is used for acquiring a large amount of original voice data;
The voice enhancement module is used for removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data;
the voice feature extraction module is used for extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and the GFCC features to obtain acoustic features of voice;
The model training module is used for constructing a speaker recognition model based on the convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
The speaker recognition module is used for collecting registered voice samples, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as the speaker features of each speaker, and storing the speaker features of each speaker; and acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.
The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:
the invention provides a speaker recognition method based on voice enhancement, which uses an end-to-end voice enhancement method to remove noise and irrelevant speaker sounds in voice, uses GFCC features with more noise robustness in the voiceprint recognition process, fuses MFCC features and GFCC features to obtain acoustic features of voice, can improve noise robustness, constructs a speaker recognition model based on a convolutional neural network, trains the model by using training data, collects registered voice samples, extracts the speaker features of each registered speaker, stores the speaker features, and recognizes the identity of the speaker to be recognized according to the similarity between the speaker features to be recognized and the stored speaker features. The problem of in the prior art because noise contained in the pronunciation leads to voiceprint recognition effect not good is solved, the recognition accuracy of voiceprint recognition is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a speaker recognition method based on speech enhancement according to an embodiment of the present invention;
FIG. 2 is a flow chart of the extraction of the MFCC speech feature in the implementation of the present invention;
FIG. 3 is a flow chart illustrating the extraction of speech features GFCC in the practice of the present invention;
fig. 4 is a block diagram of a speaker recognition system based on speech enhancement according to an embodiment of the present invention.
Detailed Description
The invention aims to provide a speaker recognition method based on voice enhancement, which solves the problem of poor recognition effect caused by the fact that noise is contained in voice to be recognized and accurate feature extraction cannot be performed in the prior art.
The main conception of the invention is as follows:
Firstly, collecting a large amount of original voice data, and then removing interference noise and irrelevant speaker sounds contained in the original voice data to obtain enhanced voice data; then extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and GFCC features to obtain acoustic features of voice; then constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model; collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as speaker features of each speaker, and storing the depth features; and obtaining voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The embodiment of the invention provides a speaker recognition method based on voice enhancement, which comprises the following steps:
s1: collecting a large amount of original voice data;
S2: removing interference noise and irrelevant speaker sound contained in the original voice data to obtain enhanced voice data;
S3: extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and GFCC features to obtain acoustic features of voice;
S4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
S5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as the speaker features of each speaker, and storing the speaker features of each speaker; and (3) acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.
Specifically, in the speaker recognition model training module, a convolutional neural network is used as a network model, a softmax is used as a classifier, and the trained model is an offline model. The registered voice data includes a plurality of speakers, each speaker including h voice samples.
Referring to fig. 1, a flowchart of a speaker recognition method based on speech enhancement is provided for implementing the present invention.
In one embodiment, step S1 uses a recording mode to collect the original voice data.
In one embodiment, step S2 uses generation of an countermeasure network to remove interference noise and irrelevant speaker sounds contained in the original speech data, so as to implement end-to-end speech enhancement.
Generating the countermeasure network is a complete convolution structure of an encoder-decoder for removing noise in the speech to generate a clean speech waveform; the countermeasure network sets a threshold on the basis of the clean speech waveform and the noise speech waveform for judging whether the generated speech waveform is clean, and when the values of the generated speech waveform and the noise speech waveform reach the threshold, the generated speech waveform is sufficiently clean.
The invention realizes an end-to-end voice enhancement method in the generation reactance framework to remove interference noise and irrelevant speaker voice in voice.
In the specific implementation process, the pure voice and common noise in life are mixed with random signal to noise ratio to obtain noise voice corresponding to the pure voice, and then the pure voice data set and the corresponding noise voice data set are used for training to obtain the generating countermeasure network for realizing end-to-end voice enhancement.
The speech model training process is described in detail below with respect to training a model comprising a data set of 1000 clean speech.
The clean speech set and the living noise data set are mixed with a random signal to noise ratio (typically between-10 dB and 10 dB) to obtain a noise speech set corresponding to the clean speech set. The noise voice is processed through a generating network to obtain generated pure voice, and then the generated pure voice and the real pure voice are processed through a judging network to judge whether the generated pure voice is the real pure voice or not: if the resulting clean speech is generated, the discriminator should output 0, and if it is true, the clean speech should output 1. And then, obtaining error gradient back propagation through a loss function to update parameters until the arbiter cannot accurately judge the generated pure voice and the real pure voice, wherein the generated network is a trained voice enhancement network. Intuitively, the method is as follows: the arbiter has to tell the generator how to adjust so that the clean speech it generates becomes more realistic.
In one embodiment, step S3 includes:
s3.1: detecting voice activity end points of the enhanced voice data and eliminating a long-time mute segment;
s3.2: preprocessing the voice obtained in the step S3.1;
S3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal;
s3.4: the power spectrum obtained by the fast Fourier transform is passed through a group of triangular filters with Mel scale, and the energy value of each frame of data in the frequency band corresponding to the triangular filters is obtained;
S3.5: taking the logarithm of the energy value of each frame of data in the frequency band corresponding to the triangular filter, and calculating the logarithmic energy output by each filter bank;
S3.6: substituting the logarithmic energy into discrete cosine transform to obtain an L-order Mel cepstrum coefficient;
S3.7: enabling a power spectrum obtained by the fast Fourier transform to pass through a gammatine filter, and then carrying out index compression and discrete cosine transform to obtain GFCC features of a voice signal;
s3.8: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the acoustic characteristic of the speech signal.
In a specific implementation process, the preprocessing comprises pre-emphasis, framing and windowing. The specific steps of the feature extraction are as follows:
s301: performing voice activity endpoint detection (VAD) on the enhanced voice tone to eliminate long periods of silence;
S302: pre-emphasis of the speech signal by a high pass filter: h (z) =1- μz -1, H (z) is a high pass filter; mu pre-emphasis coefficient, typically 0.97; z is a speech signal.
S303: the sampling frequency of the voice signal is 16KHz, 512 sampling points are first integrated into one frame, and the corresponding time length is 512/16000×1000=32 ms. Let an overlap area between two adjacent frames, the overlap area contains 256 sampling points, which is 1/2 of the sampling points 512.
S304: assuming that the signal after framing is s (N), n=0, 1,..:
x(n)=s(n)×W(n), w (n) is a Hamming window; n is the total frame number; n=0, 1,..n-1.
S305: and performing fast Fourier transform on each frame of signal x (n) subjected to framing and windowing to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal. The discrete fourier transform of the speech signal (speech signal is stored in discrete form) is:
x (n) is an input speech signal, and T represents the number of points of fourier transform.
S306: passing a power spectrum |X (k) | 2 obtained by fast Fourier transform through a group of triangular filters H m (k) with Mel scale, wherein M is more than or equal to 0 and less than or equal to M, and M is the number of the filters: the power spectrum is multiplied and accumulated with each filter to obtain the energy value of the frame data in the frequency band corresponding to the filter
S307: taking log of energy values, and calculating the logarithmic energy of each filter bank output as follows:
T represents the number of points of the Fourier transform; m is the number of filters; the I X (k) I 2 is the power spectrum obtained by S4; h m (k), M is more than or equal to 0 and less than or equal to M is a group of triangular filters with Mel scale.
S308: substituting the logarithmic energy of S307 into discrete cosine transform to obtain mel-frequency cepstrum coefficient MFCC of L-order:
L refers to the MFCC coefficient order, typically taking 12-16; m is the number of triangular filters, and M is more than or equal to 0 and less than or equal to M.
S309: and (3) passing the power spectrum obtained by the fast Fourier transform through a gammatine filter, and performing index compression and Discrete Cosine Transform (DCT) to obtain GFCC characteristics of the voice signal.
S310: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the GMCC characteristic of the speech signal.
Fig. 2 and 3 are a flowchart of extracting a speech feature MFCC and a flowchart of extracting a speech feature GFCC in the implementation of the present invention, respectively.
In one embodiment, step S4 includes:
And (3) carrying out voice enhancement on a large amount of collected original voice data, extracting acoustic features from the voice data as training data, and inputting the training data into a speaker recognition model for training to obtain a trained model.
Specifically, the training model is an offline process, and the speaker recognition model is trained:
Collecting training samples by adopting a recording mode; the collected voice sample is processed by a voice preprocessing module (a voice enhancement module and a voice feature extraction module) to obtain the GMCC feature of the voice; the GMCC features are used as the input of the model, and a convolutional neural network structure and softmax classification are adopted to train a speaker recognition model.
The speaker recognition model training process will be specifically described below by taking training a model containing 1000 speakers as an example.
Collecting samples of each speaker, and collecting 100 samples of each speaker; all voice samples are subjected to voice preprocessing (voice enhancement module and voice feature extraction module) to obtain GMCC features of voice as training data of a convolutional neural network (speaker recognition model), wherein all training data are randomly divided into 5:1 and respectively used as a training set and a verification set; training the convolutional network by using a training set, and finishing the training of the convolutional network when the identification precision of the trained convolutional network on the verification set is basically unchanged; otherwise, continuing training. The trained convolutional network is the speaker recognition offline model.
In one embodiment, the registration data in step S5 includes h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the feature of the registered speaker, including:
after carrying out voice enhancement and feature extraction on each voice sample in the registration data, extracting depth features of each voice sample from the obtained acoustic features through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker, and storing the h depth features as the speaker features of each speaker in a database;
After voice data of a speaker to be identified is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be identified;
and calculating cosine similarity cos of the speaker features to be identified and all the speaker features stored in the database, if the maximum cosine similarity is larger than a set threshold, determining that the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, otherwise rejecting the identified speaker.
Registration mode:
Collecting a registration sample by adopting a recording mode; the collected registration sample is subjected to voice preprocessing to obtain the GMCC characteristics of voice; extracting Deep features of each voice sample through a speaker recognition offline model from GMCC features of the voice; registration data (i.e., speaker characteristics for each speaker) is generated and stored in a database.
For example, 10 samples of a speaker (20 voice samples per person) are taken; the voice preprocessing module processes all voice samples to obtain GMCC characteristics of voice; obtaining Deep features of 200 voice samples through a speaker recognition offline model by using GMCC features of the voice; then, 20 Deep features of each speaker are averaged to be used as the characteristics of each speaker; the 10 speaker characteristics are saved in a database: speaker0, speaker, and speaker.
Identifying a mode:
Collecting a sample to be identified by adopting a recording mode; the sample to be identified is subjected to voice preprocessing to obtain GMCC characteristics; obtaining Deep Feature of a sample to be identified through a speaker identification offline model by using the GMCC Feature as the speaker Feature to be identified; calculating cosine similarity cos of the features of the speaker to be identified and all the speaker features in the database, and if the maximum cosine similarity is greater than a certain threshold value, determining that the speaker in the database corresponding to the cosine similarity is the identified speaker; otherwise, refusing.
For example, a piece of voice data of the speaker is collected; the GMCC characteristics are obtained through a voice preprocessing module; obtaining Deep Feature of the voice data through a speaker recognition offline model by using the GMCC Feature as the speaker Feature; and calculating cosine similarity between the speaker characteristics and 10 speaker characteristics stored in the database to obtain cos0, cos1, and cos9, finding the maximum value cos_max in the 10 cosine similarity and the number speaker _x of the corresponding speaker, if the maximum value is larger than a set threshold, receiving the speaker as speaker _x, otherwise, identifying the speaker as an unregistered speaker.
In summary, the invention realizes a speaker recognition method based on voice enhancement through voice collection, voice enhancement, voice feature extraction, speaker model training, speaker registration and speaker recognition.
Compared with the prior art, the invention has the beneficial effects that:
the speaker recognition method and the speaker recognition system based on the voice enhancement, which are provided by the invention, use the end-to-end voice enhancement method to remove noise and irrelevant speaker sounds in voice, use GFCC features with more noise robustness in the voice print recognition process, improve the noise robustness of the whole system, solve the problem of poor voice print recognition effect caused by noise contained in voice, and improve the recognition accuracy of the voice print recognition system.
Example two
Based on the same inventive concept, this embodiment provides a speaker recognition system based on speech enhancement, please refer to fig. 4, which includes:
A voice acquisition module 201, configured to acquire a large amount of original voice data;
the voice enhancement module 202 is configured to remove interference noise and irrelevant speaker voice contained in the original voice data, so as to obtain enhanced voice data;
The voice feature extraction module 203 is configured to extract MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fuse the MFCC features and the GFCC features to obtain acoustic features of voice;
The model training module 204 is configured to construct a speaker recognition model based on the convolutional neural network, and train the speaker recognition model by using acoustic features extracted from a large amount of original speech data as training data to obtain a trained model;
The speaker recognition module 205 is configured to register a speaker and recognize the speaker, collect registered voice samples, perform voice enhancement and feature extraction by using the methods of the voice enhancement module and the voice feature extraction module, input a trained model to obtain depth features of each registered voice sample, and store the depth features as the speaker features of each speaker; and acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.
Because the system described in the second embodiment of the present invention is a system for implementing the speaker recognition method based on speech enhancement in the first embodiment of the present invention, a person skilled in the art can know the specific structure and the modification of the system based on the method described in the first embodiment of the present invention, and therefore, the details are not repeated here. All systems used in the method according to the first embodiment of the present invention are within the scope of the present invention.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (5)

1. A method for speaker recognition based on speech enhancement, comprising:
s1: collecting a large amount of original voice data;
S2: removing interference noise and irrelevant speaker sound contained in the original voice data to obtain enhanced voice data;
S3: extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and GFCC features to obtain acoustic features of voice;
S4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
S5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as speaker features of each speaker, and storing the depth features; the method comprises the steps of obtaining voice data of a speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker;
Step S2 is to remove interference noise and irrelevant speaker sound contained in original voice data by adopting a generation countermeasure network, so as to realize end-to-end voice enhancement; the method for generating the countermeasure network comprises the following steps: mixing pure voice and common noise in life with random signal to noise ratio to obtain noise voice corresponding to the pure voice, and training by using the pure voice data set and the corresponding noise voice data set to obtain the noise voice;
The step S3 comprises the following steps:
S3.1: detecting voice activity end points of the enhanced voice data, and eliminating a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
S3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal;
s3.4: the power spectrum obtained by the fast Fourier transform is passed through a group of triangular filters with Mel scale, and the energy value of each frame of data in the frequency band corresponding to the triangular filters is obtained;
S3.5: taking the logarithm of the energy value of each frame of data in the frequency band corresponding to the triangular filter, and calculating the logarithmic energy output by each filter bank;
S3.6: substituting the logarithmic energy into discrete cosine transform to obtain an L-order Mel cepstrum coefficient;
S3.7: enabling a power spectrum obtained by the fast Fourier transform to pass through a gammatine filter, and then carrying out index compression and discrete cosine transform to obtain GFCC features of a voice signal;
s3.8: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the acoustic characteristic of the speech signal.
2. The speaker recognition method as recited in claim 1, wherein step S1 performs the collection of the original voice data by recording.
3. The speaker recognition method as claimed in claim 1, wherein step S4 comprises:
a large amount of original voice data is subjected to voice enhancement, acoustic features are extracted from the original voice data to serve as training data, and the training data are input into a speaker recognition model to be trained, so that a trained model is obtained.
4. The speaker recognition method as claimed in claim 1, wherein the registration data includes h voice samples of each speaker, and the step S5 includes identifying the identity of the speaker to be recognized based on the similarity between the feature of the speaker to be recognized and the feature of the registered speaker:
after carrying out voice enhancement and feature extraction on each voice sample in the registration data, extracting depth features of each voice sample from the obtained acoustic features through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker, and storing the h depth features as the speaker features of each speaker in a database;
After voice data of a speaker to be identified is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be identified;
and calculating cosine similarity cos of the speaker features to be identified and all the speaker features stored in the database, if the maximum cosine similarity is larger than a set threshold, determining that the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, otherwise rejecting the identified speaker.
5. A speech enhancement-based speaker recognition system, comprising:
the voice acquisition module is used for acquiring a large amount of original voice data;
The voice enhancement module is used for removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data;
the voice feature extraction module is used for extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and the GFCC features to obtain acoustic features of voice;
The model training module is used for constructing a speaker recognition model based on the convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
The speaker recognition module is used for collecting registered voice samples, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain depth features of each registered voice sample, and taking the depth features as speaker features of each speaker and storing the depth features; the method comprises the steps of obtaining voice data of a speaker to be identified, carrying out voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker;
The voice enhancement module removes interference noise and irrelevant speaker voice contained in original voice data by adopting a generating countermeasure network, and achieves end-to-end voice enhancement, wherein the generating countermeasure network comprises the following acquisition modes: mixing pure voice and common noise in life with random signal to noise ratio to obtain noise voice corresponding to the pure voice, and training by using the pure voice data set and the corresponding noise voice data set to obtain the noise voice;
The voice feature extraction module is specifically configured to execute the following steps:
S3.1: detecting voice activity end points of the enhanced voice data, and eliminating a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
S3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal;
s3.4: the power spectrum obtained by the fast Fourier transform is passed through a group of triangular filters with Mel scale, and the energy value of each frame of data in the frequency band corresponding to the triangular filters is obtained;
S3.5: taking the logarithm of the energy value of each frame of data in the frequency band corresponding to the triangular filter, and calculating the logarithmic energy output by each filter bank;
S3.6: substituting the logarithmic energy into discrete cosine transform to obtain an L-order Mel cepstrum coefficient;
S3.7: enabling a power spectrum obtained by the fast Fourier transform to pass through a gammatine filter, and then carrying out index compression and discrete cosine transform to obtain GFCC features of a voice signal;
s3.8: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the acoustic characteristic of the speech signal.
CN202111140239.5A 2021-09-28 Speaker recognition method and system based on voice enhancement Active CN113823293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111140239.5A CN113823293B (en) 2021-09-28 Speaker recognition method and system based on voice enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111140239.5A CN113823293B (en) 2021-09-28 Speaker recognition method and system based on voice enhancement

Publications (2)

Publication Number Publication Date
CN113823293A CN113823293A (en) 2021-12-21
CN113823293B true CN113823293B (en) 2024-04-26

Family

ID=

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568472A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Voice synthesis system with speaker selection and realization method thereof
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104835498A (en) * 2015-05-25 2015-08-12 重庆大学 Voiceprint identification method based on multi-type combination characteristic parameters
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CA3179080A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network
CN109326302A (en) * 2018-11-14 2019-02-12 桂林电子科技大学 A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109524020A (en) * 2018-11-20 2019-03-26 上海海事大学 A kind of speech enhan-cement processing method
CN109712628A (en) * 2019-03-15 2019-05-03 哈尔滨理工大学 A kind of voice de-noising method and audio recognition method based on RNN
CN110299142A (en) * 2018-05-14 2019-10-01 桂林远望智能通信科技有限公司 A kind of method for recognizing sound-groove and device based on the network integration
CN110428849A (en) * 2019-07-30 2019-11-08 珠海亿智电子科技有限公司 A kind of sound enhancement method based on generation confrontation network
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion
KR20210036692A (en) * 2019-09-26 2021-04-05 국방과학연구소 Method and apparatus for robust speech enhancement training using adversarial training
CN112820301A (en) * 2021-03-15 2021-05-18 中国科学院声学研究所 Unsupervised cross-domain voiceprint recognition method fusing distribution alignment and counterstudy

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568472A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Voice synthesis system with speaker selection and realization method thereof
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104835498A (en) * 2015-05-25 2015-08-12 重庆大学 Voiceprint identification method based on multi-type combination characteristic parameters
CA3179080A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN110299142A (en) * 2018-05-14 2019-10-01 桂林远望智能通信科技有限公司 A kind of method for recognizing sound-groove and device based on the network integration
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109326302A (en) * 2018-11-14 2019-02-12 桂林电子科技大学 A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN109524020A (en) * 2018-11-20 2019-03-26 上海海事大学 A kind of speech enhan-cement processing method
CN109712628A (en) * 2019-03-15 2019-05-03 哈尔滨理工大学 A kind of voice de-noising method and audio recognition method based on RNN
CN110428849A (en) * 2019-07-30 2019-11-08 珠海亿智电子科技有限公司 A kind of sound enhancement method based on generation confrontation network
KR20210036692A (en) * 2019-09-26 2021-04-05 국방과학연구소 Method and apparatus for robust speech enhancement training using adversarial training
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion
CN112820301A (en) * 2021-03-15 2021-05-18 中国科学院声学研究所 Unsupervised cross-domain voiceprint recognition method fusing distribution alignment and counterstudy

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
一种基于卷积神经网络的快速说话人识别方法;蔡倩等;《无线电工程》;第50卷(第6期);第447-451页 *
单声道语音降噪与去混响研究综述;蓝天;彭川;李森;叶文政;李萌;惠国强;吕忆蓝;钱宇欣;刘峤;;计算机研究与发展(第05期);全文 *
双微阵列语音增强算法在说话人识别中的应用;毛维;曾庆宁;龙超;;声学技术;20180615(第03期);全文 *
基于神经网络的说话人识别实验设计;杨瑶;陈晓;;实验室研究与探索(第09期);全文 *
基于端点检测和高斯滤波器组的MFCC说话人识别;王萌;王福龙;;计算机系统应用;20161015(第10期);全文 *

Similar Documents

Publication Publication Date Title
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
Wang et al. Channel pattern noise based playback attack detection algorithm for speaker recognition
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
CN101923855A (en) Test-irrelevant voice print identifying system
CN103065629A (en) Speech recognition system of humanoid robot
CN108564956B (en) Voiceprint recognition method and device, server and storage medium
CN110767239A (en) Voiceprint recognition method, device and equipment based on deep learning
CN112382300A (en) Voiceprint identification method, model training method, device, equipment and storage medium
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
Nandyal et al. MFCC based text-dependent speaker identification using BPNN
Revathi et al. Text independent speaker recognition and speaker independent speech recognition using iterative clustering approach
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
Maazouzi et al. MFCC and similarity measurements for speaker identification systems
Neelima et al. Mimicry voice detection using convolutional neural networks
CN111524520A (en) Voiceprint recognition method based on error reverse propagation neural network
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN113823293B (en) Speaker recognition method and system based on voice enhancement
Sukor et al. Speaker identification system using MFCC procedure and noise reduction method
Nagakrishnan et al. Generic speech based person authentication system with genuine and spoofed utterances: different feature sets and models
Khetri et al. Automatic speech recognition for marathi isolated words
Islam et al. A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network
Wang et al. Robust Text-independent Speaker Identification in a Time-varying Noisy Environment.

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant