CN113823293B

CN113823293B - Speaker recognition method and system based on voice enhancement

Info

Publication number: CN113823293B
Application number: CN202111140239.5A
Authority: CN
Inventors: 熊盛武; 张欣冉
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Filing date: 2021-09-28
Publication date: 2024-04-26
Anticipated expiration: 2041-09-28

Abstract

The invention provides a speaker recognition method and a speaker recognition system based on voice enhancement, wherein the method comprises the following steps: s1, collecting a large amount of original voice data; s2, removing interference noise and irrelevant speaker sound contained in the original voice data; s3: extracting MFCC features and GFCC features, and fusing to obtain acoustic features of the voice; s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model; s5: collecting a registered voice sample for registration, acquiring voice data of a speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the feature of the registered speaker. The invention can improve the recognition accuracy of the voiceprint recognition system.

Description

Speaker recognition method and system based on voice enhancement

Technical Field

The invention relates to the field of pattern recognition, in particular to a speaker recognition method and system based on voice enhancement.

Background

Voiceprint recognition is a technology for extracting the voice characteristics and the speaking content information of a speaker and automatically verifying the identity of the speaker. With the wide application of artificial intelligence in daily life of people, voiceprint recognition technology has also increasingly highlighted its roles, such as voice-based authentication of personal smart devices (e.g., cell phones, vehicles, and notebook computers); ensuring the transaction safety of banking transaction and remote payment; and (3) automatic identity marking.

However, due to the complexity of real-life background noise, the voice used for recognition always contains various noises, which will result in poor voiceprint recognition effect, so how to overcome the noise problem of the voice to be recognized is a problem that needs to be solved when the voiceprint recognition technology is applied to real life.

Disclosure of Invention

The invention provides a speaker recognition method and a speaker recognition system based on voice enhancement, which are used for solving or at least partially solving the technical problem of poor voiceprint recognition effect in the prior art.

In order to solve the above technical problem, a first aspect of the present invention provides a speaker recognition method based on speech enhancement, including:

s1: collecting a large amount of original voice data;

S2: removing interference noise and irrelevant speaker sound contained in the original voice data to obtain enhanced voice data;

S3: extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and GFCC features to obtain acoustic features of voice;

S4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;

S5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as speaker features of each speaker, and storing the depth features; and (3) acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.

In one embodiment, step S1 uses a recording mode to collect the original voice data.

In one embodiment, step S2 uses generation of an countermeasure network to remove interference noise and irrelevant speaker sounds contained in the original speech data, so as to implement end-to-end speech enhancement.

In one embodiment, step S3 includes:

S3.1: detecting voice activity end points of the enhanced voice data, and eliminating a long-time mute section;

s3.2: preprocessing the voice obtained in the step S3.1;

S3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal;

s3.4: the power spectrum obtained by the fast Fourier transform is passed through a group of triangular filters with Mel scale, and the energy value of each frame of data in the frequency band corresponding to the triangular filters is obtained;

S3.5: taking the logarithm of the energy value of each frame of data in the frequency band corresponding to the triangular filter, and calculating the logarithmic energy output by each filter bank;

S3.6: substituting the logarithmic energy into discrete cosine transform to obtain an L-order Mel cepstrum coefficient;

S3.7: enabling a power spectrum obtained by the fast Fourier transform to pass through a gammatine filter, and then carrying out index compression and discrete cosine transform to obtain GFCC features of a voice signal;

s3.8: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the acoustic characteristic of the speech signal.

In one embodiment, step S4 includes:

The collected large amount of original voice data is subjected to voice enhancement, then acoustic features are extracted from the voice data to serve as training data, and the training data are input into a speaker recognition model to be trained, so that a trained model is obtained;

in one embodiment, the registration data in step S5 includes h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the feature of the registered speaker, including:

after carrying out voice enhancement and feature extraction on each voice sample in the registration data, extracting depth features of each voice sample from the obtained acoustic features through a convolutional neural network of a speaker recognition model;

averaging the h depth features of each speaker, and storing the h depth features as the speaker features of each speaker in a database;

After voice data of a speaker to be identified is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be identified;

and calculating cosine similarity cos of the speaker features to be identified and all the speaker features stored in the database, if the maximum cosine similarity is larger than a set threshold, determining that the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, otherwise rejecting the identified speaker.

Based on the same inventive concept, a second aspect of the present invention provides a speaker recognition system based on speech enhancement, comprising:

the voice acquisition module is used for acquiring a large amount of original voice data;

The voice enhancement module is used for removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data;

the voice feature extraction module is used for extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and the GFCC features to obtain acoustic features of voice;

The model training module is used for constructing a speaker recognition model based on the convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;

The speaker recognition module is used for collecting registered voice samples, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as the speaker features of each speaker, and storing the speaker features of each speaker; and acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.

The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:

the invention provides a speaker recognition method based on voice enhancement, which uses an end-to-end voice enhancement method to remove noise and irrelevant speaker sounds in voice, uses GFCC features with more noise robustness in the voiceprint recognition process, fuses MFCC features and GFCC features to obtain acoustic features of voice, can improve noise robustness, constructs a speaker recognition model based on a convolutional neural network, trains the model by using training data, collects registered voice samples, extracts the speaker features of each registered speaker, stores the speaker features, and recognizes the identity of the speaker to be recognized according to the similarity between the speaker features to be recognized and the stored speaker features. The problem of in the prior art because noise contained in the pronunciation leads to voiceprint recognition effect not good is solved, the recognition accuracy of voiceprint recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a speaker recognition method based on speech enhancement according to an embodiment of the present invention;

FIG. 2 is a flow chart of the extraction of the MFCC speech feature in the implementation of the present invention;

FIG. 3 is a flow chart illustrating the extraction of speech features GFCC in the practice of the present invention;

fig. 4 is a block diagram of a speaker recognition system based on speech enhancement according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a speaker recognition method based on voice enhancement, which solves the problem of poor recognition effect caused by the fact that noise is contained in voice to be recognized and accurate feature extraction cannot be performed in the prior art.

The main conception of the invention is as follows:

Firstly, collecting a large amount of original voice data, and then removing interference noise and irrelevant speaker sounds contained in the original voice data to obtain enhanced voice data; then extracting MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fusing the MFCC features and GFCC features to obtain acoustic features of voice; then constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model; collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as speaker features of each speaker, and storing the depth features; and obtaining voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention provides a speaker recognition method based on voice enhancement, which comprises the following steps:

s1: collecting a large amount of original voice data;

S5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as the speaker features of each speaker, and storing the speaker features of each speaker; and (3) acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.

Specifically, in the speaker recognition model training module, a convolutional neural network is used as a network model, a softmax is used as a classifier, and the trained model is an offline model. The registered voice data includes a plurality of speakers, each speaker including h voice samples.

Referring to fig. 1, a flowchart of a speaker recognition method based on speech enhancement is provided for implementing the present invention.

Generating the countermeasure network is a complete convolution structure of an encoder-decoder for removing noise in the speech to generate a clean speech waveform; the countermeasure network sets a threshold on the basis of the clean speech waveform and the noise speech waveform for judging whether the generated speech waveform is clean, and when the values of the generated speech waveform and the noise speech waveform reach the threshold, the generated speech waveform is sufficiently clean.

The invention realizes an end-to-end voice enhancement method in the generation reactance framework to remove interference noise and irrelevant speaker voice in voice.

In the specific implementation process, the pure voice and common noise in life are mixed with random signal to noise ratio to obtain noise voice corresponding to the pure voice, and then the pure voice data set and the corresponding noise voice data set are used for training to obtain the generating countermeasure network for realizing end-to-end voice enhancement.

The speech model training process is described in detail below with respect to training a model comprising a data set of 1000 clean speech.

The clean speech set and the living noise data set are mixed with a random signal to noise ratio (typically between-10 dB and 10 dB) to obtain a noise speech set corresponding to the clean speech set. The noise voice is processed through a generating network to obtain generated pure voice, and then the generated pure voice and the real pure voice are processed through a judging network to judge whether the generated pure voice is the real pure voice or not: if the resulting clean speech is generated, the discriminator should output 0, and if it is true, the clean speech should output 1. And then, obtaining error gradient back propagation through a loss function to update parameters until the arbiter cannot accurately judge the generated pure voice and the real pure voice, wherein the generated network is a trained voice enhancement network. Intuitively, the method is as follows: the arbiter has to tell the generator how to adjust so that the clean speech it generates becomes more realistic.

In one embodiment, step S3 includes:

s3.1: detecting voice activity end points of the enhanced voice data and eliminating a long-time mute segment;

s3.2: preprocessing the voice obtained in the step S3.1;

In a specific implementation process, the preprocessing comprises pre-emphasis, framing and windowing. The specific steps of the feature extraction are as follows:

s301: performing voice activity endpoint detection (VAD) on the enhanced voice tone to eliminate long periods of silence;

S302: pre-emphasis of the speech signal by a high pass filter: h (z) =1- μz ^-1, H (z) is a high pass filter; mu pre-emphasis coefficient, typically 0.97; z is a speech signal.

S303: the sampling frequency of the voice signal is 16KHz, 512 sampling points are first integrated into one frame, and the corresponding time length is 512/16000×1000=32 ms. Let an overlap area between two adjacent frames, the overlap area contains 256 sampling points, which is 1/2 of the sampling points 512.

S304: assuming that the signal after framing is s (N), n=0, 1,..:

x(n)＝s(n)×W(n)， w (n) is a Hamming window; n is the total frame number; n=0, 1,..n-1.

S305: and performing fast Fourier transform on each frame of signal x (n) subjected to framing and windowing to obtain the frequency spectrum of each frame, and performing modulo squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal. The discrete fourier transform of the speech signal (speech signal is stored in discrete form) is:

x (n) is an input speech signal, and T represents the number of points of fourier transform.

S306: passing a power spectrum |X (k) | ² obtained by fast Fourier transform through a group of triangular filters H _m (k) with Mel scale, wherein M is more than or equal to 0 and less than or equal to M, and M is the number of the filters: the power spectrum is multiplied and accumulated with each filter to obtain the energy value of the frame data in the frequency band corresponding to the filter

S307: taking log of energy values, and calculating the logarithmic energy of each filter bank output as follows:

T represents the number of points of the Fourier transform; m is the number of filters; the I X (k) I ² is the power spectrum obtained by S4; h _m (k), M is more than or equal to 0 and less than or equal to M is a group of triangular filters with Mel scale.

S308: substituting the logarithmic energy of S307 into discrete cosine transform to obtain mel-frequency cepstrum coefficient MFCC of L-order:

L refers to the MFCC coefficient order, typically taking 12-16; m is the number of triangular filters, and M is more than or equal to 0 and less than or equal to M.

S309: and (3) passing the power spectrum obtained by the fast Fourier transform through a gammatine filter, and performing index compression and Discrete Cosine Transform (DCT) to obtain GFCC characteristics of the voice signal.

S310: the MFCC characteristic and GFCC characteristic of the speech signal are concatenated to obtain the GMCC characteristic of the speech signal.

Fig. 2 and 3 are a flowchart of extracting a speech feature MFCC and a flowchart of extracting a speech feature GFCC in the implementation of the present invention, respectively.

In one embodiment, step S4 includes:

And (3) carrying out voice enhancement on a large amount of collected original voice data, extracting acoustic features from the voice data as training data, and inputting the training data into a speaker recognition model for training to obtain a trained model.

Specifically, the training model is an offline process, and the speaker recognition model is trained:

Collecting training samples by adopting a recording mode; the collected voice sample is processed by a voice preprocessing module (a voice enhancement module and a voice feature extraction module) to obtain the GMCC feature of the voice; the GMCC features are used as the input of the model, and a convolutional neural network structure and softmax classification are adopted to train a speaker recognition model.

The speaker recognition model training process will be specifically described below by taking training a model containing 1000 speakers as an example.

Collecting samples of each speaker, and collecting 100 samples of each speaker; all voice samples are subjected to voice preprocessing (voice enhancement module and voice feature extraction module) to obtain GMCC features of voice as training data of a convolutional neural network (speaker recognition model), wherein all training data are randomly divided into 5:1 and respectively used as a training set and a verification set; training the convolutional network by using a training set, and finishing the training of the convolutional network when the identification precision of the trained convolutional network on the verification set is basically unchanged; otherwise, continuing training. The trained convolutional network is the speaker recognition offline model.

Registration mode:

Collecting a registration sample by adopting a recording mode; the collected registration sample is subjected to voice preprocessing to obtain the GMCC characteristics of voice; extracting Deep features of each voice sample through a speaker recognition offline model from GMCC features of the voice; registration data (i.e., speaker characteristics for each speaker) is generated and stored in a database.

For example, 10 samples of a speaker (20 voice samples per person) are taken; the voice preprocessing module processes all voice samples to obtain GMCC characteristics of voice; obtaining Deep features of 200 voice samples through a speaker recognition offline model by using GMCC features of the voice; then, 20 Deep features of each speaker are averaged to be used as the characteristics of each speaker; the 10 speaker characteristics are saved in a database: speaker0, speaker, and speaker.

Identifying a mode:

Collecting a sample to be identified by adopting a recording mode; the sample to be identified is subjected to voice preprocessing to obtain GMCC characteristics; obtaining Deep Feature of a sample to be identified through a speaker identification offline model by using the GMCC Feature as the speaker Feature to be identified; calculating cosine similarity cos of the features of the speaker to be identified and all the speaker features in the database, and if the maximum cosine similarity is greater than a certain threshold value, determining that the speaker in the database corresponding to the cosine similarity is the identified speaker; otherwise, refusing.

For example, a piece of voice data of the speaker is collected; the GMCC characteristics are obtained through a voice preprocessing module; obtaining Deep Feature of the voice data through a speaker recognition offline model by using the GMCC Feature as the speaker Feature; and calculating cosine similarity between the speaker characteristics and 10 speaker characteristics stored in the database to obtain cos0, cos1, and cos9, finding the maximum value cos_max in the 10 cosine similarity and the number speaker _x of the corresponding speaker, if the maximum value is larger than a set threshold, receiving the speaker as speaker _x, otherwise, identifying the speaker as an unregistered speaker.

In summary, the invention realizes a speaker recognition method based on voice enhancement through voice collection, voice enhancement, voice feature extraction, speaker model training, speaker registration and speaker recognition.

Compared with the prior art, the invention has the beneficial effects that:

the speaker recognition method and the speaker recognition system based on the voice enhancement, which are provided by the invention, use the end-to-end voice enhancement method to remove noise and irrelevant speaker sounds in voice, use GFCC features with more noise robustness in the voice print recognition process, improve the noise robustness of the whole system, solve the problem of poor voice print recognition effect caused by noise contained in voice, and improve the recognition accuracy of the voice print recognition system.

Example two

Based on the same inventive concept, this embodiment provides a speaker recognition system based on speech enhancement, please refer to fig. 4, which includes:

A voice acquisition module 201, configured to acquire a large amount of original voice data;

the voice enhancement module 202 is configured to remove interference noise and irrelevant speaker voice contained in the original voice data, so as to obtain enhanced voice data;

The voice feature extraction module 203 is configured to extract MFCC features and cepstrum coefficient GFCC features based on a gammatine filter from the enhanced voice data, and fuse the MFCC features and the GFCC features to obtain acoustic features of voice;

The model training module 204 is configured to construct a speaker recognition model based on the convolutional neural network, and train the speaker recognition model by using acoustic features extracted from a large amount of original speech data as training data to obtain a trained model;

The speaker recognition module 205 is configured to register a speaker and recognize the speaker, collect registered voice samples, perform voice enhancement and feature extraction by using the methods of the voice enhancement module and the voice feature extraction module, input a trained model to obtain depth features of each registered voice sample, and store the depth features as the speaker features of each speaker; and acquiring voice data of the speaker to be identified, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker.

Because the system described in the second embodiment of the present invention is a system for implementing the speaker recognition method based on speech enhancement in the first embodiment of the present invention, a person skilled in the art can know the specific structure and the modification of the system based on the method described in the first embodiment of the present invention, and therefore, the details are not repeated here. All systems used in the method according to the first embodiment of the present invention are within the scope of the present invention.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for speaker recognition based on speech enhancement, comprising:

s1: collecting a large amount of original voice data;

S5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain depth features of each registered voice sample, taking the depth features as speaker features of each speaker, and storing the depth features; the method comprises the steps of obtaining voice data of a speaker to be identified, performing voice enhancement and feature extraction by adopting the methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker;

Step S2 is to remove interference noise and irrelevant speaker sound contained in original voice data by adopting a generation countermeasure network, so as to realize end-to-end voice enhancement; the method for generating the countermeasure network comprises the following steps: mixing pure voice and common noise in life with random signal to noise ratio to obtain noise voice corresponding to the pure voice, and training by using the pure voice data set and the corresponding noise voice data set to obtain the noise voice;

The step S3 comprises the following steps:

s3.2: preprocessing the voice obtained in the step S3.1;

2. The speaker recognition method as recited in claim 1, wherein step S1 performs the collection of the original voice data by recording.

3. The speaker recognition method as claimed in claim 1, wherein step S4 comprises:

a large amount of original voice data is subjected to voice enhancement, acoustic features are extracted from the original voice data to serve as training data, and the training data are input into a speaker recognition model to be trained, so that a trained model is obtained.

4. The speaker recognition method as claimed in claim 1, wherein the registration data includes h voice samples of each speaker, and the step S5 includes identifying the identity of the speaker to be recognized based on the similarity between the feature of the speaker to be recognized and the feature of the registered speaker:

5. A speech enhancement-based speaker recognition system, comprising:

The speaker recognition module is used for collecting registered voice samples, performing voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain depth features of each registered voice sample, and taking the depth features as speaker features of each speaker and storing the depth features; the method comprises the steps of obtaining voice data of a speaker to be identified, carrying out voice enhancement and feature extraction by adopting a voice enhancement module and a voice feature extraction module, inputting a trained model to obtain the feature of the speaker to be identified, and identifying the identity of the speaker to be identified according to the similarity between the feature of the speaker to be identified and the stored feature of the speaker;

The voice enhancement module removes interference noise and irrelevant speaker voice contained in original voice data by adopting a generating countermeasure network, and achieves end-to-end voice enhancement, wherein the generating countermeasure network comprises the following acquisition modes: mixing pure voice and common noise in life with random signal to noise ratio to obtain noise voice corresponding to the pure voice, and training by using the pure voice data set and the corresponding noise voice data set to obtain the noise voice;

The voice feature extraction module is specifically configured to execute the following steps:

s3.2: preprocessing the voice obtained in the step S3.1;