CN113823293A - Speaker recognition method and system based on voice enhancement - Google Patents

Speaker recognition method and system based on voice enhancement Download PDF

Info

Publication number
CN113823293A
CN113823293A CN202111140239.5A CN202111140239A CN113823293A CN 113823293 A CN113823293 A CN 113823293A CN 202111140239 A CN202111140239 A CN 202111140239A CN 113823293 A CN113823293 A CN 113823293A
Authority
CN
China
Prior art keywords
speaker
voice
feature
data
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111140239.5A
Other languages
Chinese (zh)
Other versions
CN113823293B (en
Inventor
熊盛武
张欣冉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202111140239.5A priority Critical patent/CN113823293B/en
Publication of CN113823293A publication Critical patent/CN113823293A/en
Application granted granted Critical
Publication of CN113823293B publication Critical patent/CN113823293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Filters That Use Time-Delay Elements (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a speaker recognition method and system based on voice enhancement, wherein the method comprises the following steps: s1, collecting a large amount of original voice data; s2, removing the interference noise and the irrelevant speaker voice contained in the original voice data; s3: extracting MFCC characteristics and GFCC characteristics, and fusing to obtain acoustic characteristics of voice; s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model; s5: collecting a registered voice sample for registration, acquiring voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the feature of the registered speaker. The invention can improve the recognition accuracy of the voiceprint recognition system.

Description

Speaker recognition method and system based on voice enhancement
Technical Field
The invention relates to the field of pattern recognition, in particular to a speaker recognition method and system based on voice enhancement.
Background
Voiceprint recognition is a technology for extracting the voice characteristics and the content information of a speaker and automatically verifying the identity of the speaker. With the wide application of artificial intelligence in people's daily life, voiceprint recognition technology has also gradually highlighted its role, such as voice-based authentication of personal intelligent devices (e.g., mobile phones, vehicles, and notebook computers); the transaction safety of bank transaction and remote payment is ensured; and automatic identity tagging.
However, due to the complexity of background noise in real life, the voice to be recognized always contains various noises, which will result in poor voiceprint recognition effect, so how to overcome the noise problem of the voice to be recognized is a problem to be solved urgently when the voiceprint recognition technology is applied to real life.
Disclosure of Invention
The invention provides a speaker recognition method and system based on voice enhancement, which are used for solving or at least partially solving the technical problem of poor voiceprint recognition effect in the prior art.
In order to solve the above technical problem, a first aspect of the present invention provides a speaker recognition method based on speech enhancement, including:
s1: collecting a large amount of original voice data;
s2: removing interference noise and irrelevant speaker voice contained in original voice data to obtain enhanced voice data;
s3: extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice;
s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
s5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample, taking the depth feature as the speaker feature of each speaker, and storing the speaker feature; obtaining voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
In one embodiment, step S1 is performed by recording raw voice data.
In one embodiment, step S2 is implemented by removing the interference noise and the irrelevant speaker voice contained in the original voice data by using the generation countermeasure network to achieve end-to-end voice enhancement.
In one embodiment, step S3 includes:
s3.1: carrying out voice activity endpoint detection on the enhanced voice data to eliminate a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
s3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal;
s3.4: enabling the power spectrum obtained by the fast Fourier transform to pass through a group of Mel-scale triangular filters to obtain the energy value of each frame of data in the frequency band corresponding to the triangular filters;
s3.5: logarithm is taken on the energy value of each frame data in the frequency band corresponding to the triangular filter, and the logarithmic energy output by each filter bank is calculated;
s3.6: substituting the logarithmic energy into discrete cosine transform to obtain L-order Mel cepstrum coefficient;
s3.7: the power spectrum obtained by fast Fourier transform is processed by index compression and discrete cosine transform through a Gamma atom filter to obtain the GFCC characteristic of the voice signal;
s3.8: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the acoustic characteristics of the voice signal.
In one embodiment, step S4 includes:
performing voice enhancement on a large amount of collected original voice data, extracting acoustic features from the voice data to be used as training data, and inputting the training data into a speaker recognition model for training to obtain a trained model;
in one embodiment, the step S5 of registering data including h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the characteristics of the speaker to be identified and the characteristics of the registered speaker comprises:
after voice enhancement and feature extraction are carried out on each voice sample in the registration data, the depth feature of each voice sample is extracted from the obtained acoustic feature through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker to serve as the speaker feature of each speaker, and storing the speaker feature in a database;
after voice data of a speaker to be recognized is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be recognized;
and calculating the cosine similarity cos of the characteristics of the speaker to be identified and all the characteristics of the speakers stored in the database, wherein if the maximum cosine similarity is greater than a set threshold, the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, and otherwise, the speaker is rejected.
Based on the same inventive concept, the second aspect of the present invention provides a speaker recognition system based on speech enhancement, comprising:
the voice acquisition module is used for acquiring a large amount of original voice data;
the voice enhancement module is used for removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data;
the voice feature extraction module is used for extracting MFCC features and cepstrum coefficient GFCC features based on a Gamma filter from the enhanced voice data and fusing the MFCC features and the GFCC features to obtain the acoustic features of the voice;
the model training module is used for constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
the speaker recognition module is used for collecting the registered voice samples, inputting the trained model to obtain the depth characteristic of each registered voice sample after voice enhancement and characteristic extraction are carried out by adopting the methods of the voice enhancement module and the voice characteristic extraction module, taking the depth characteristic as the speaker characteristic of each speaker, and storing the speaker characteristic of each speaker; obtaining the voice data of the speaker to be recognized, inputting the trained model to obtain the characteristics of the speaker to be recognized after performing voice enhancement and characteristic extraction by adopting the methods of a voice enhancement module and a voice characteristic extraction module, and recognizing the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a speaker recognition method based on voice enhancement, which uses an end-to-end voice enhancement method to remove noise in voice and irrelevant speaker voice, uses GFCC (noise robust character) characteristics with more noise robustness in the voiceprint recognition process, fuses the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice, can improve the noise robustness, constructs a speaker recognition model based on a convolutional neural network, trains the model by using training data, collects registered voice samples, extracts and stores the speaker characteristics of each registered speaker, and recognizes the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker. The problem of among the prior art because the noise that contains in the pronunciation leads to the voiceprint recognition effect not good is solved, improve the discernment rate of accuracy of voiceprint discernment.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for speaker recognition based on speech enhancement according to an embodiment of the present invention;
FIG. 2 is a flow chart of the voice feature MFCC extraction in the practice of the present invention;
FIG. 3 is a flow chart of the extraction of the GFCC speech feature in the practice of the present invention;
FIG. 4 is a block diagram of a speaker recognition system based on speech enhancement in accordance with an embodiment of the present invention.
Detailed Description
The invention aims to provide a speaker recognition method based on voice enhancement, which solves the problem of poor recognition effect caused by the fact that the voice to be recognized contains noise and accurate feature extraction cannot be carried out in the prior art.
The main concept of the invention is as follows:
firstly, collecting a large amount of original voice data, and then removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data; extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma-tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice; then, constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model; collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample, taking the depth feature as the speaker feature of each speaker, and storing the speaker feature; and then acquiring voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment of the invention provides a speaker recognition method based on voice enhancement, which comprises the following steps:
s1: collecting a large amount of original voice data;
s2: removing interference noise and irrelevant speaker voice contained in original voice data to obtain enhanced voice data;
s3: extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice;
s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
s5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample as the speaker feature of each speaker, and storing the speaker feature of each speaker; obtaining voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
Specifically, in the speaker recognition model training module, the network model uses a convolutional neural network, the classifier uses softmax, and the trained model is an offline model. The registered voice data includes a plurality of speakers, each speaker including h voice samples.
Please refer to fig. 1, which is a flowchart of a speaker recognition method based on speech enhancement according to an embodiment of the present invention.
In one embodiment, step S1 is performed by recording raw voice data.
In one embodiment, step S2 is implemented by removing the interference noise and the irrelevant speaker voice contained in the original voice data by using the generation countermeasure network to achieve end-to-end voice enhancement.
The generation countermeasure network is a complete convolution structure of a coder-decoder and is used for removing noise in voice to generate a clean voice waveform; the countermeasure network sets a threshold value on the basis of the clean voice waveform and the noise voice waveform for judging whether the generated voice waveform is clean or not, and when the values of the generated voice waveform and the noise voice waveform reach the threshold value, the generated voice waveform is sufficiently clean.
The invention realizes an end-to-end voice enhancement method in a generation countermeasure framework to remove interference noise and irrelevant speaker voice in voice.
In the specific implementation process, pure voice and common noise in life are mixed by a random signal-to-noise ratio to obtain noise voice corresponding to the pure voice, and then the pure voice data set and the corresponding noise voice data set are used for training to obtain a generation countermeasure network for realizing end-to-end voice enhancement.
The speech model training process is described in detail below by taking as an example a model for training a data set containing 1000 clean speeches.
The clean speech set and the live noise data set are mixed with a random signal-to-noise ratio (typically between-10 dB to 10 dB) to obtain a noise speech set corresponding to the clean speech set. The method comprises the following steps of obtaining generated pure voice by noise voice through a generation network, and judging whether the generated pure voice and real pure voice are real pure voice through a discrimination network: if the generated clean speech is obtained, the discriminator should output 0, and if the true clean speech is obtained, 1. And then, parameters are updated by obtaining the reverse propagation of the error gradient through the loss function until the generated pure voice and the real pure voice cannot be accurately judged by the discriminator, and the generated network is the trained voice enhancement network. Intuitively, it is: the discriminator has to tell the generator how to adjust so that the clean speech it generates becomes more realistic.
In one embodiment, step S3 includes:
s3.1: performing voice activity endpoint detection on the enhanced voice data and eliminating a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
s3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal;
s3.4: enabling the power spectrum obtained by the fast Fourier transform to pass through a group of Mel-scale triangular filters to obtain the energy value of each frame of data in the frequency band corresponding to the triangular filters;
s3.5: logarithm is taken on the energy value of each frame data in the frequency band corresponding to the triangular filter, and the logarithmic energy output by each filter bank is calculated;
s3.6: substituting the logarithmic energy into discrete cosine transform to obtain L-order Mel cepstrum coefficient;
s3.7: the power spectrum obtained by fast Fourier transform is processed by index compression and discrete cosine transform through a Gamma atom filter to obtain the GFCC characteristic of the voice signal;
s3.8: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the acoustic characteristics of the voice signal.
In a specific implementation process, the preprocessing includes pre-emphasis, framing, and windowing. The specific steps of feature extraction are as follows:
s301: performing voice activity endpoint detection (VAD) on the enhanced voice to eliminate a long mute period;
s302: the speech signal is pre-emphasized by passing it through a high-pass filter: h (z) ═ 1-. mu.z-1H (z) is a high-pass filter; μ pre-emphasis factor, typically taken as 0.97; z is a speech signal.
S303: the sampling frequency of the voice signal is 16KHz, 512 sampling points are firstly grouped into a frame, and the corresponding time length is 512/16000 × 1000 ═ 32 ms. An overlap region is formed between two adjacent frames, and the overlap region includes 256 sampling points, 1/2 of sampling point 512.
S304: assuming that the signal after framing is s (N), N is 0,1, and N-1, where N is the total number of frames, each frame is multiplied by a hamming window:
x(n)=s(n)×W(n),
Figure BDA0003283600940000071
w (n) is a Hamming window; n is the total frame number; n-1, 0, 1.
S305: and performing fast Fourier transform on each frame signal x (n) after the framing and windowing to obtain the frequency spectrum of each frame, and performing modular squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal. The discrete fourier transform of a speech signal (the speech signal is stored in discrete form) is:
Figure BDA0003283600940000072
x (n) is the input speech signal, and T represents the number of points of the Fourier transform.
S306: performing fast Fourier transform to obtain a power spectrum | X (k) | non-conducting2Triangular filter H through a set of Mel scalesm(k) M is more than or equal to 0 and less than or equal to M, and M is the number of filters: respectively multiplying and accumulating the power spectrum with each filter to obtain the energy value of the frame data in the corresponding frequency band of the filter
Figure BDA0003283600940000073
S307: taking log of the energy values, the log energy output by each filter bank is calculated as:
Figure BDA0003283600940000074
t represents the number of points of Fourier transform; m is the number of the filters; | X (k) messaging2The power spectrum obtained for S4; hm(k) M is more than or equal to 0 and less than or equal to M is a group of triangular filters with the Mel scale.
S308: substituting the logarithmic energy of S307 into discrete cosine transform to obtain L-order Mel cepstrum coefficient MFCC:
Figure BDA0003283600940000081
l refers to the order of the MFCC coefficient, and is usually 12-16; m is the number of the triangular filters, and M is more than or equal to 0 and less than or equal to M.
S309: and (3) passing the power spectrum obtained by the fast Fourier transform through a Gamma atom filter, and then performing index compression and Discrete Cosine Transform (DCT) to obtain the GFCC characteristics of the voice signal.
S310: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the GMCC characteristics of the voice signal.
Fig. 2 and fig. 3 are a flow chart of voice feature MFCC extraction and a flow chart of voice feature GFCC extraction, respectively, in the implementation of the present invention.
In one embodiment, step S4 includes:
and performing voice enhancement on a large amount of collected original voice data, extracting acoustic features from the voice data to be used as training data, and inputting the training data into a speaker recognition model for training to obtain a trained model.
Specifically, the training model is an off-line process, and the training of the speaker recognition model:
collecting training samples in a recording mode; the collected voice samples pass through a voice preprocessing module (a voice enhancement module and a voice feature extraction module) to obtain the GMCC features of the voice; and taking the GMCC characteristics as the input of the model, and training the speaker recognition model by adopting a convolutional neural network structure and softmax classification.
The following describes the speaker recognition model training process by taking training a model containing 1000 speakers as an example.
Collecting a sample of each speaker, wherein each speaker collects 100 samples; obtaining GMCC characteristics of voice of all voice samples through a voice preprocessing module (a voice enhancement module and a voice characteristic extraction module) to be used as training data of a convolutional neural network (a speaker recognition model), wherein all the training data are randomly divided into 5:1 and respectively used as a training set and a verification set; training the convolutional network by using a training set, and finishing the training of the convolutional network when the identification precision of the trained convolutional network on a verification set is basically kept unchanged; otherwise, continuing training. The trained convolutional network is the speaker recognition offline model.
In one embodiment, the step S5 of registering data including h voice samples of each speaker, and identifying the identity of the speaker to be identified according to the similarity between the characteristics of the speaker to be identified and the characteristics of the registered speaker comprises:
after voice enhancement and feature extraction are carried out on each voice sample in the registration data, the depth feature of each voice sample is extracted from the obtained acoustic feature through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker to serve as the speaker feature of each speaker, and storing the speaker feature in a database;
after voice data of a speaker to be recognized is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be recognized;
and calculating the cosine similarity cos of the characteristics of the speaker to be identified and all the characteristics of the speakers stored in the database, wherein if the maximum cosine similarity is greater than a set threshold, the speaker in the database corresponding to the cosine similarity is the identity of the identified speaker, and otherwise, the speaker is rejected.
A registration mode:
collecting registration samples in a recording mode; obtaining GMCC characteristics of voice through a voice preprocessing module by the collected registration samples; extracting Deep Feature (depth Feature) of each voice sample from GMCC features of voice through a speaker recognition offline model; enrollment data (i.e., speaker characteristics for each speaker) is generated and stored in a database.
For example, samples of 10 speakers (20 speech samples per person) are taken; the voice preprocessing module processes all voice samples to obtain GMCC characteristics of voice; obtaining Deep features of 200 voice samples by using GMCC characteristics of voice through a speaker recognition offline model; then averaging 20 Deep features of each speaker as the characteristics of each speaker; save 10 speaker profiles in the database: a spaker 0, spaker 1, a., spaker 9.
Identifying a mode:
collecting a sample to be identified by adopting a recording mode; obtaining GMCC characteristics of a sample to be recognized through a voice preprocessing module; obtaining Deep Feature of a sample to be identified through GMCC features through a speaker identification offline model, and taking the Deep Feature as the features of the speaker to be identified; calculating the cosine similarity cos of the characteristics of the speaker to be identified and the characteristics of all speakers in the database, wherein if the maximum cosine similarity is greater than a certain threshold, the speaker in the database corresponding to the cosine similarity is the identified speaker; otherwise, rejecting.
For example, a piece of voice data of the speaker is collected; obtaining GMCC characteristics through a voice preprocessing module; obtaining Deep Feature of the voice data by using the GMCC Feature through a speaker recognition offline model as the speaker Feature; calculating cosine similarity of the speaker characteristics and the 10 speaker characteristics stored in the database to obtain cos0, cos1 and cos9, finding the maximum value cos _ max of the 10 cosine similarity and the number speaker _ x of the corresponding speaker, if the maximum value is greater than a set threshold value, accepting the speaker as the speaker _ x, and otherwise, identifying the speaker as an unregistered speaker.
In summary, the invention realizes a speaker recognition method based on speech enhancement through speech acquisition, speech enhancement, speech feature extraction, speaker model training, speaker registration and speaker recognition.
Compared with the prior art, the invention has the beneficial effects that:
the end-to-end voice enhancement method is used to remove noise in voice and irrelevant speaker voice, GFCC features with noise robustness are used in the voiceprint recognition process, the noise robustness of the whole system is improved, the problem of poor voiceprint recognition effect caused by noise contained in voice can be solved, and the recognition accuracy of the voiceprint recognition system is improved.
Example two
Based on the same inventive concept, the present embodiment provides a speaker recognition system based on speech enhancement, please refer to fig. 4, the system includes:
a voice collecting module 201, configured to collect a large amount of original voice data;
the voice enhancement module 202 is configured to remove interference noise and irrelevant speaker voice included in the original voice data to obtain enhanced voice data;
a speech feature extraction module 203, configured to extract MFCC features and cepstrum coefficient GFCC features based on a Gammatone filter from the enhanced speech data, and fuse the MFCC features and the GFCC features to obtain acoustic features of the speech;
the model training module 204 is used for constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
the speaker recognition module 205 is used for registering and recognizing speakers, collecting registered voice samples, performing voice enhancement and feature extraction by adopting the methods of the voice enhancement module and the voice feature extraction module, inputting the trained model to obtain the depth feature of each registered voice sample, and storing the depth feature as the speaker feature of each speaker; obtaining the voice data of the speaker to be recognized, inputting the trained model to obtain the characteristics of the speaker to be recognized after performing voice enhancement and characteristic extraction by adopting the methods of a voice enhancement module and a voice characteristic extraction module, and recognizing the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker.
Since the system described in the second embodiment of the present invention is a system adopted for implementing the speaker recognition method based on speech enhancement according to the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the system based on the method described in the first embodiment of the present invention, and thus the details are not described herein again. All systems adopted by the method of the first embodiment of the present invention are within the intended protection scope of the present invention.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. A speaker recognition method based on speech enhancement is characterized by comprising the following steps:
s1: collecting a large amount of original voice data;
s2: removing interference noise and irrelevant speaker voice contained in original voice data to obtain enhanced voice data;
s3: extracting MFCC characteristics and cepstrum coefficient GFCC characteristics based on a Gamma tone filter from the enhanced voice data, and fusing the MFCC characteristics and the GFCC characteristics to obtain acoustic characteristics of voice;
s4: constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
s5: collecting registered voice samples, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting a trained model to obtain the depth feature of each registered voice sample, taking the depth feature as the speaker feature of each speaker, and storing the speaker feature; obtaining voice data of the speaker to be recognized, performing voice enhancement and feature extraction by adopting methods of S2 and S3, inputting the trained model to obtain the feature of the speaker to be recognized, and recognizing the identity of the speaker to be recognized according to the similarity between the feature of the speaker to be recognized and the stored feature of the speaker.
2. The speaker ID method as in claim 1, wherein step S1 is performed by collecting original voice data by recording.
3. The speaker recognition method as claimed in claim 1, wherein step S2 employs a generation countermeasure network to remove the interference noise and the irrelevant speaker voice contained in the original voice data, thereby achieving end-to-end voice enhancement.
4. The speaker ID method as claimed in claim 1, wherein the step S3 comprises:
s3.1: carrying out voice activity endpoint detection on the enhanced voice data to eliminate a long-time mute section;
s3.2: preprocessing the voice obtained in the step S3.1;
s3.3: performing fast Fourier transform on the preprocessed voice to obtain the frequency spectrum of each frame, and performing a modular square on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal;
s3.4: enabling the power spectrum obtained by the fast Fourier transform to pass through a group of Mel-scale triangular filters to obtain the energy value of each frame of data in the frequency band corresponding to the triangular filters;
s3.5: taking logarithm of the energy value of each frame data in the frequency band corresponding to the triangular filter, and calculating the logarithmic energy output by each filter group;
s3.6: substituting the logarithmic energy into discrete cosine transform to obtain L-order Mel cepstrum coefficient;
s3.7: the power spectrum obtained by fast Fourier transform is processed by index compression and discrete cosine transform through a Gamma atom filter to obtain the GFCC characteristic of the voice signal;
s3.8: and cascading the MFCC characteristics and the GFCC characteristics of the voice signal to obtain the acoustic characteristics of the voice signal.
5. The speaker recognition method as claimed in claim 1, wherein the step S4 comprises:
a large amount of original voice data are subjected to voice enhancement, acoustic features are extracted from the voice data to serve as training data, and the training data are input into a speaker recognition model for training to obtain a trained model.
6. The method for speaker recognition according to claim 1, wherein the enrollment data includes h speech samples of each speaker, and the identity of the speaker to be recognized is recognized according to the similarity between the speaker characteristics to be recognized and the registered speaker characteristics, and step S5 includes:
after voice enhancement and feature extraction are carried out on each voice sample in the registration data, the depth feature of each voice sample is extracted from the obtained acoustic feature through a convolutional neural network of a speaker recognition model;
averaging the h depth features of each speaker to serve as the speaker feature of each speaker, and storing the speaker feature in a database;
after voice data of a speaker to be recognized is subjected to voice enhancement and feature extraction, inputting a trained model to obtain the feature of the speaker to be recognized;
calculating the cosine similarity cos of the characteristics of the speaker to be identified and all the characteristics of the speakers stored in the database, if the maximum cosine similarity is greater than a set threshold, determining that the speaker in the database corresponding to the cosine similarity is the identified speaker identity, and if not, rejecting the identity.
7. A system for speaker recognition based on speech enhancement, comprising:
the voice acquisition module is used for acquiring a large amount of original voice data;
the voice enhancement module is used for removing interference noise and irrelevant speaker voice contained in the original voice data to obtain enhanced voice data;
the voice feature extraction module is used for extracting MFCC features and cepstrum coefficient GFCC features based on a Gamma filter from the enhanced voice data and fusing the MFCC features and the GFCC features to obtain the acoustic features of the voice;
the model training module is used for constructing a speaker recognition model based on a convolutional neural network, taking acoustic features extracted from a large amount of original voice data as training data, and training the speaker recognition model to obtain a trained model;
the speaker recognition module is used for collecting the registered voice samples, performing voice enhancement and feature extraction by adopting the methods of the voice enhancement module and the voice feature extraction module, inputting the trained model to obtain the depth feature of each registered voice sample, taking the depth feature as the speaker feature of each speaker and storing the speaker feature; obtaining the voice data of the speaker to be recognized, inputting the trained model to obtain the characteristics of the speaker to be recognized after performing voice enhancement and characteristic extraction by adopting the methods of a voice enhancement module and a voice characteristic extraction module, and recognizing the identity of the speaker to be recognized according to the similarity between the characteristics of the speaker to be recognized and the stored characteristics of the speaker.
CN202111140239.5A 2021-09-28 2021-09-28 Speaker recognition method and system based on voice enhancement Active CN113823293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111140239.5A CN113823293B (en) 2021-09-28 2021-09-28 Speaker recognition method and system based on voice enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111140239.5A CN113823293B (en) 2021-09-28 2021-09-28 Speaker recognition method and system based on voice enhancement

Publications (2)

Publication Number Publication Date
CN113823293A true CN113823293A (en) 2021-12-21
CN113823293B CN113823293B (en) 2024-04-26

Family

ID=78921390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111140239.5A Active CN113823293B (en) 2021-09-28 2021-09-28 Speaker recognition method and system based on voice enhancement

Country Status (1)

Country Link
CN (1) CN113823293B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631743A (en) * 2022-12-07 2023-01-20 中诚华隆计算机技术有限公司 High-precision voice recognition method and system based on voice chip
CN116434759A (en) * 2023-04-11 2023-07-14 兰州交通大学 Speaker identification method based on SRS-CL network
WO2024082928A1 (en) * 2022-10-21 2024-04-25 腾讯科技(深圳)有限公司 Voice processing method and apparatus, and device and medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568472A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Voice synthesis system with speaker selection and realization method thereof
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104835498A (en) * 2015-05-25 2015-08-12 重庆大学 Voiceprint identification method based on multi-type combination characteristic parameters
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CA3179080A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network
US20190043529A1 (en) * 2018-06-06 2019-02-07 Intel Corporation Speech classification of audio for wake on voice
CN109326302A (en) * 2018-11-14 2019-02-12 桂林电子科技大学 A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109524020A (en) * 2018-11-20 2019-03-26 上海海事大学 A kind of speech enhan-cement processing method
CN109712628A (en) * 2019-03-15 2019-05-03 哈尔滨理工大学 A kind of voice de-noising method and audio recognition method based on RNN
CN110299142A (en) * 2018-05-14 2019-10-01 桂林远望智能通信科技有限公司 A kind of method for recognizing sound-groove and device based on the network integration
CN110428849A (en) * 2019-07-30 2019-11-08 珠海亿智电子科技有限公司 A kind of sound enhancement method based on generation confrontation network
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion
KR20210036692A (en) * 2019-09-26 2021-04-05 국방과학연구소 Method and apparatus for robust speech enhancement training using adversarial training
CN112820301A (en) * 2021-03-15 2021-05-18 中国科学院声学研究所 Unsupervised cross-domain voiceprint recognition method fusing distribution alignment and counterstudy

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102568472A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Voice synthesis system with speaker selection and realization method thereof
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104835498A (en) * 2015-05-25 2015-08-12 重庆大学 Voiceprint identification method based on multi-type combination characteristic parameters
CA3179080A1 (en) * 2016-09-19 2018-03-22 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
CN107464568A (en) * 2017-09-25 2017-12-12 四川长虹电器股份有限公司 Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system
CN110299142A (en) * 2018-05-14 2019-10-01 桂林远望智能通信科技有限公司 A kind of method for recognizing sound-groove and device based on the network integration
US20190043529A1 (en) * 2018-06-06 2019-02-07 Intel Corporation Speech classification of audio for wake on voice
CN109147810A (en) * 2018-09-30 2019-01-04 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109326302A (en) * 2018-11-14 2019-02-12 桂林电子科技大学 A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN109524020A (en) * 2018-11-20 2019-03-26 上海海事大学 A kind of speech enhan-cement processing method
CN109712628A (en) * 2019-03-15 2019-05-03 哈尔滨理工大学 A kind of voice de-noising method and audio recognition method based on RNN
CN110428849A (en) * 2019-07-30 2019-11-08 珠海亿智电子科技有限公司 A kind of sound enhancement method based on generation confrontation network
KR20210036692A (en) * 2019-09-26 2021-04-05 국방과학연구소 Method and apparatus for robust speech enhancement training using adversarial training
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion
CN112820301A (en) * 2021-03-15 2021-05-18 中国科学院声学研究所 Unsupervised cross-domain voiceprint recognition method fusing distribution alignment and counterstudy

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
杨瑶;陈晓;: "基于神经网络的说话人识别实验设计", 实验室研究与探索, no. 09 *
毛维;曾庆宁;龙超;: "双微阵列语音增强算法在说话人识别中的应用", 声学技术, no. 03, 15 June 2018 (2018-06-15) *
王萌;王福龙;: "基于端点检测和高斯滤波器组的MFCC说话人识别", 计算机系统应用, no. 10, 15 October 2016 (2016-10-15) *
蓝天;彭川;李森;叶文政;李萌;惠国强;吕忆蓝;钱宇欣;刘峤;: "单声道语音降噪与去混响研究综述", 计算机研究与发展, no. 05 *
蔡倩等: "一种基于卷积神经网络的快速说话人识别方法", 《无线电工程》, vol. 50, no. 6, pages 447 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024082928A1 (en) * 2022-10-21 2024-04-25 腾讯科技(深圳)有限公司 Voice processing method and apparatus, and device and medium
CN115631743A (en) * 2022-12-07 2023-01-20 中诚华隆计算机技术有限公司 High-precision voice recognition method and system based on voice chip
CN115631743B (en) * 2022-12-07 2023-03-21 中诚华隆计算机技术有限公司 High-precision voice recognition method and system based on voice chip
CN116434759A (en) * 2023-04-11 2023-07-14 兰州交通大学 Speaker identification method based on SRS-CL network
CN116434759B (en) * 2023-04-11 2024-03-01 兰州交通大学 Speaker identification method based on SRS-CL network

Also Published As

Publication number Publication date
CN113823293B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN108597496B (en) Voice generation method and device based on generation type countermeasure network
CN102163427B (en) Method for detecting audio exceptional event based on environmental model
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
CN113823293B (en) Speaker recognition method and system based on voice enhancement
US20070129941A1 (en) Preprocessing system and method for reducing FRR in speaking recognition
CN102324232A (en) Method for recognizing sound-groove and system based on gauss hybrid models
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN110570871A (en) TristouNet-based voiceprint recognition method, device and equipment
CN111933148A (en) Age identification method and device based on convolutional neural network and terminal
CN110782902A (en) Audio data determination method, apparatus, device and medium
CN115394318A (en) Audio detection method and device
CN111508504A (en) Speaker recognition method based on auditory center perception mechanism
KR100779242B1 (en) Speaker recognition methods of a speech recognition and speaker recognition integrated system
Neelima et al. Mimicry voice detection using convolutional neural networks
CN111524520A (en) Voiceprint recognition method based on error reverse propagation neural network
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
CN110415707B (en) Speaker recognition method based on voice feature fusion and GMM
CN107993666B (en) Speech recognition method, speech recognition device, computer equipment and readable storage medium
Sukor et al. Speaker identification system using MFCC procedure and noise reduction method
CN114038469B (en) Speaker identification method based on multi-class spectrogram characteristic attention fusion network
Khetri et al. Automatic speech recognition for marathi isolated words
Islam et al. A Novel Approach for Text-Independent Speaker Identification Using Artificial Neural Network
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
CN106971725B (en) Voiceprint recognition method and system with priority
Shofiyah et al. Voice recognition system for home security keys with mel-frequency cepstral coefficient method and backpropagation artificial neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant