CN112542174A - VAD-based multi-dimensional characteristic parameter voiceprint identification method - Google Patents

VAD-based multi-dimensional characteristic parameter voiceprint identification method Download PDF

Info

Publication number
CN112542174A
CN112542174A CN202011557161.2A CN202011557161A CN112542174A CN 112542174 A CN112542174 A CN 112542174A CN 202011557161 A CN202011557161 A CN 202011557161A CN 112542174 A CN112542174 A CN 112542174A
Authority
CN
China
Prior art keywords
characteristic parameters
frame
signal
frequency
mfcc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011557161.2A
Other languages
Chinese (zh)
Inventor
邓立新
孙明铭
濮勇
徐艳君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202011557161.2A priority Critical patent/CN112542174A/en
Publication of CN112542174A publication Critical patent/CN112542174A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

Step S1, reading, pre-emphasizing, framing and windowing the input voice signal, and converting the voice signal into a voice preprocessing signal; step S2, accurately detecting the start and stop frames of the framed voice preprocessing signal through end point detection, and removing the mute section; and S3, extracting MFCC characteristic parameters, MFCC standardized characteristic parameters, GFCC characteristic parameters and PNCC characteristic parameters of the voice signal after the endpoint detection, and combining the MFCC characteristic parameters, the MFCC standardized characteristic parameters and the PNCC characteristic parameters to form multi-dimensional characteristic parameters. The method improves the accuracy of endpoint detection, reduces the data volume of the training in the template training phase, enhances the anti-noise interference capability and effectively improves the recognition efficiency of voiceprint recognition.

Description

VAD-based multi-dimensional characteristic parameter voiceprint identification method
Technical Field
The invention belongs to the technical field of voiceprint recognition, and particularly relates to a multi-dimensional characteristic parameter voiceprint recognition method based on VAD.
Background
Voiceprint recognition is one of the biometric identification techniques, also known as speaker recognition. It is divided into two categories, namely speaker identification and speaker verification. The theoretical basis for voiceprint recognition is that each sound has a unique characteristic by which it is possible to effectively distinguish between different human voices.
The main flow of voiceprint recognition comprises reading the voice file of the speaker in the training set, and extracting characteristic information with identification from the read voice data through a specific filter. Common feature extraction methods include Mel cepstral coefficient MFCC, cepstral feature parameter GFCC of a Gamma-tone filter, energy regularization spectral coefficient PNCC, speaker Vector factors (Identity-Vector, I-Vector) and the like, and then template training is performed based on methods such as Gaussian mixture model GMM, dynamic time regularization DTW or artificial neural network template matching and the like. And finally, extracting the characteristics of the audio data in the test set to be matched with the trained template, so as to achieve the aim of voiceprint recognition.
In recent years, in order to improve the accuracy of voiceprint recognition, the following two types of feature extraction methods are mainly used in the field of speaker recognition.
(1) The method is mainly characterized in that a single feature extraction method is used for training according to different types of audios and the magnitude of signal-to-noise ratio. For example, the mainstream MFCC feature extraction, but sometimes the endpoint detection cannot accurately detect the start and stop endpoints of the speech because of the transformation of the frame length and the frame shift, and the standard cepstrum parameter MFCC only reflects the static characteristics of the speech parameters and does not perform well in the aspect of robustness.
(2) And calculating a difference spectrum from the extracted static features by a difference method to represent the dynamic features of the voice parameters, and then combining the dynamic features and the static features. The method can effectively improve the recognition performance of the system, but even greatly reduces the recognition accuracy under the condition of large noise interference in the voice information.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a multi-dimensional characteristic parameter voiceprint recognition method based on VAD, so that the accuracy of endpoint detection is improved, the data volume of training in the template training stage is reduced, the anti-noise interference capability is enhanced, and the recognition efficiency of voiceprint recognition is effectively improved.
The invention provides a voice print identification method based on VAD multidimensional characteristic parameters, which comprises the following steps,
step S1, reading, pre-emphasizing, framing and windowing the input voice signal, and converting the voice signal into a voice preprocessing signal;
step S2, accurately detecting the start and stop frames of the framed voice preprocessing signal through end point detection, and removing the mute section;
and S3, extracting MFCC characteristic parameters, MFCC standardized characteristic parameters, GFCC characteristic parameters and PNCC characteristic parameters of the voice signal after the endpoint detection, and combining the MFCC characteristic parameters, the MFCC standardized characteristic parameters and the PNCC characteristic parameters to form multi-dimensional characteristic parameters.
As a further technical solution of the present invention, in step S1, the voice signal reading is to read wav format audio files in a training set, and a wavfile () method in a scipy. io library in python is adopted to obtain a one-dimensional array representing audio information.
Further, in step S1, the pre-emphasis is to boost the high frequency component as a function of:
Figure 100002_DEST_PATH_IMAGE002
wherein u is a pre-emphasis coefficient and has a value range of 0.9-1.
Further, in step S1, the framing and windowing specifically includes that the parameter model of the speech signal is approximately unchanged within 10ms to 30ms, the number of frames in 1 second is 33 to 100 frames, an overlap region exists between adjacent frames, i.e., frame shift, and the ratio of the frame shift to the frame length is 1/3 to 1/2; and finally multiplying each frame signal by a Hamming window, wherein the expression of the Hamming window is as follows:
Figure 100002_DEST_PATH_IMAGE004
where a is the Hamming window coefficient.
Further, the end point detection in step S2 adopts a spectral entropy method, where entropy is identificationThe order degree of the signals, the spectrum entropy method is to detect the voice endpoint by detecting the flatness degree of the spectrum; the speech signal is
Figure 100002_DEST_PATH_IMAGE006
After windowing and framing, the nth frame is obtained
Figure 100002_DEST_PATH_IMAGE008
The FFT is:
Figure 100002_DEST_PATH_IMAGE010
k is the kth spectral line; the short-time energy of the speech frame in the frequency domain is:
Figure 100002_DEST_PATH_IMAGE012
wherein N is the FFT length, and only the positive frequency part is taken; the energy spectrum of the kth spectral line is
Figure 100002_DEST_PATH_IMAGE014
Normalized spectral probability density function for each frequency component of
Figure 100002_DEST_PATH_IMAGE016
The short-time spectral entropy of the frame is
Figure 100002_DEST_PATH_IMAGE018
Calculating the spectral entropy of each frame as
Figure 100002_DEST_PATH_IMAGE020
(ii) a The audio information in step S1 is intercepted by the spectral entropy method, and an area rich in speech information is reserved.
Further, in step S2, the endpoint detection adds a result verification mechanism, and when the spectral entropy method fails, the framed speech signal is filtered through an energy valve to remove the silence segment, where the value of the valve is
Figure 100002_DEST_PATH_IMAGE022
Wherein
Figure 100002_DEST_PATH_IMAGE024
Is an array of speech signal frame energies.
Furthermore, in step S3, the specific method for extracting the MFCC characteristic parameters is,
FFT is carried out on each frame signal of the voice preprocessing signal to obtain the frequency spectrum of each frame, and the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal;
passing the energy spectrum through a Mel triangular filter bank, wherein the center frequencies of the filter bank are uniformly arranged according to the Mel frequency, the filter bank comprises 22-26 filters, the base angle of each filter is the center frequency of the adjacent filter, and the approximate relational expression of the Mel frequency and the frequency is
Figure 100002_DEST_PATH_IMAGE026
Wherein f is frequency;
and calculating the logarithmic energy output by each Mel triangular filter bank, obtaining an MFCC feature vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.
Further, in step S3, the specific method for extracting the GFCC characteristic parameters is,
carrying out Fourier transform calculation on a specific signal frame; and taking an absolute value of the output frequency spectrum;
the energy spectrum is passed through a Gamma Filter Bank, which contains 20 filters, the output response of which is
Figure 100002_DEST_PATH_IMAGE028
Wherein, N is the channel number of the filter, and M is the frame number after sampling;
and calculating the logarithmic energy output by each filter bank, obtaining a GFCC characteristic vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.
Further, in step S3, the specific method for extracting the PNCC characteristic parameter is,
performing Fourier transform calculation on the signal frame, and taking an absolute value of an output frequency spectrum;
the energy spectrum is passed through a Gamma Filter Bank, which contains 20 filters, the output response of which is
Figure 685402DEST_PATH_IMAGE028
Where N is the number of channels of the filter and M is the number of frames after sampling.
Smoothing each frame, namely averaging the left frame and the right frame by 2 frames to obtain average power;
and normalizing the average power, and acquiring the PNCC characteristic parameters through an exponential function and Discrete Cosine Transform (DCT) which are more in line with the auditory characteristics of the human ear.
Further, in step S3, the dimension of the multi-dimensional feature parameter formed by combination is 52 dimensions.
The invention has the advantages that the voice information frame is intercepted through the endpoint detection, the energy valve device is added, the interference of a mute frame and a noise frame on the feature extraction result can be eliminated, the accuracy of the endpoint detection result is enhanced, the standardized feature parameter of the MFCC, the MFCC feature parameter, the Gamma atom filter feature parameter and the energy regularization spectrum coefficient PNCC feature parameter are combined into the multidimensional feature coefficient, and the accuracy of the identification is obviously improved.
Drawings
FIG. 1 is a flow chart of voiceprint recognition in accordance with the present invention;
FIG. 2 is a schematic diagram illustrating the effect of the endpoint detection method of the present invention;
FIG. 3 is a diagram of a multi-dimensional feature set according to the present invention.
Detailed Description
Referring to fig. 1, the present embodiment provides a method for recognizing a voiceprint with multidimensional feature parameters based on VAD, which is based on traditional feature parameter extraction and training of a speech library, and is improved for a feature extraction stage, and mainly includes three parts, namely speech signal preprocessing, endpoint detection, and feature parameter extraction, including the following steps,
step S1, reading, pre-emphasizing, framing and windowing the input voice signal, and converting the voice signal into a voice preprocessing signal;
step S2, accurately detecting the start and stop frames of the framed voice preprocessing signal through end point detection, and removing the mute section;
and S3, extracting MFCC characteristic parameters, MFCC standardized characteristic parameters, GFCC characteristic parameters and PNCC characteristic parameters of the voice signal after the endpoint detection, and combining the MFCC characteristic parameters, the MFCC standardized characteristic parameters and the PNCC characteristic parameters to form multi-dimensional characteristic parameters.
In step S1, the voice signal reading is to read wav format audio files in the training set, and obtain a one-dimensional array representing audio information by using the wavfile () method in the scipy.
In step S1, since the frequency response curve of the glottal pulse is close to a second-order low-pass filter and the oral cavity radiation response is also close to a low-order high-pass filter, the frequency spectrum of the speech signal exhibits raised cosine roll-off fading, the value of the high-frequency component is usually much smaller than that of the low-frequency component, and in order to increase the high-frequency resolution of the speech signal and highlight the formants of the high-frequency part, we pre-emphasize the speech signal. The purpose of pre-emphasis is to compensate for the loss of high frequency components, which are boosted to pass the input speech test signal through a high pass filter as a function of:
Figure 199560DEST_PATH_IMAGE002
wherein u is a pre-emphasis coefficient, the value range is 0.9-1, and is generally 0.97.
In the step S1, the framing and windowing are specifically that a parameter model of the voice signal is approximately unchanged within 10ms to 30ms, the number of frames in 1 second is 33 to 100 frames, an overlapping region exists between adjacent frames, namely frame shift, and the ratio of the frame shift to the frame length is 1/3 to 1/2; and finally multiplying each frame signal by a Hamming window, wherein the expression of the Hamming window is as follows:
Figure 922665DEST_PATH_IMAGE004
where a is the Hamming window coefficient, which is 0.46.
In the step S2, the endpoint detection adopts a spectrum entropy method, wherein entropy is the ordered degree of the identification signal, and the spectrum entropy method is used for detecting the voice endpoint by detecting the flat degree of the spectrum; the speech signal is
Figure 205879DEST_PATH_IMAGE006
After windowing and framing, the nth frame is obtained
Figure 621817DEST_PATH_IMAGE008
The FFT is:
Figure 154429DEST_PATH_IMAGE010
k is the kth spectral line; the short-time energy of the speech frame in the frequency domain is:
Figure 648602DEST_PATH_IMAGE012
wherein N is the FFT length, and only the positive frequency part is taken; the energy spectrum of the kth spectral line is
Figure 583060DEST_PATH_IMAGE014
Normalized spectral probability density function for each frequency component of
Figure 45265DEST_PATH_IMAGE016
The short-time spectral entropy of the frame is
Figure 393070DEST_PATH_IMAGE018
Calculating the spectral entropy of each frame as
Figure 333344DEST_PATH_IMAGE020
(ii) a The audio information in step S1 is intercepted by the spectral entropy method, and a region rich in speech information is reserved, as shown in fig. 2.
In step S2, the endpoint detection adds a result verification mechanism, when the spectral entropy method fails, the speech signal after framing is screened by an energy valve to remove the mute section, and the valve value is
Figure 450205DEST_PATH_IMAGE022
Wherein
Figure 145628DEST_PATH_IMAGE024
Is an array of speech signal frame energies.
In step S3, the specific method for extracting the MFCC characteristic parameters is,
FFT is carried out on each frame signal of the voice preprocessing signal to obtain the frequency spectrum of each frame, and the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal;
passing the energy spectrum through a Mel triangular filter bank, wherein the center frequencies of the filter bank are uniformly arranged according to the Mel frequency, the filter bank comprises 22-26 filters, the base angle of each filter is the center frequency of the adjacent filter, and the approximate relational expression of the Mel frequency and the frequency is
Figure 856095DEST_PATH_IMAGE026
Wherein f is frequency;
the MFCC characteristic parameter is normalized by centering the MFCC characteristic parameter on the mean, centering on the component unit and centering on the unit variance, and the data is subjected to the normalization processing by subtracting the mean from the attribute (by column) and dividing by the variance. The net result is that for each attribute/column all data is clustered around 0, with a variance value of 1, resulting in a normalized feature that is the same size as the MFCC feature parameter dimension.
And calculating the logarithmic energy output by each Mel triangular filter bank, obtaining an MFCC feature vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.
In step S3, the specific method for extracting GFCC characteristic parameters is,
carrying out Fourier transform calculation on a specific signal frame; and taking an absolute value of the output frequency spectrum;
the energy spectrum is passed through a Gamma Filter Bank, which contains 20 filters, the output response of which is
Figure 724694DEST_PATH_IMAGE028
Wherein, N is the channel number of the filter, and M is the frame number after sampling;
and calculating the logarithmic energy output by each filter bank, obtaining a GFCC characteristic vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.
In step S3, the specific method for extracting the PNCC characteristic parameters is,
performing Fourier transform calculation on the signal frame, and taking an absolute value of an output frequency spectrum;
passing the energy spectrum through a Gamma-tone filter bank comprising 20 filters, passThe output response of the filter is
Figure 571428DEST_PATH_IMAGE028
Where N is the number of channels of the filter and M is the number of frames after sampling.
Smoothing each frame, namely averaging the left frame and the right frame by 2 frames to obtain average power;
and normalizing the average power, and acquiring the PNCC characteristic parameters through an exponential function and Discrete Cosine Transform (DCT) which are more in line with the auditory characteristics of the human ear.
As shown in fig. 3, in step S3, the MFCC characteristic parameter, the MFCC normalized characteristic parameter, the GFCC characteristic parameter, and the PNCC characteristic parameter are combined to form a multi-dimensional characteristic parameter. The selected characteristic parameter dimension is 13 dimensions, and the dimension of the multi-dimensional characteristic parameter formed by combination is 52 dimensions.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be protected by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims (10)

1. A multi-dimensional characteristic parameter voiceprint recognition method based on VAD is characterized by comprising the following steps,
step S1, reading, pre-emphasizing, framing and windowing the input voice signal, and converting the voice signal into a voice preprocessing signal;
step S2, accurately detecting the start and stop frames of the framed voice preprocessing signal through end point detection, and removing the mute section;
and S3, extracting MFCC characteristic parameters, MFCC standardized characteristic parameters, GFCC characteristic parameters and PNCC characteristic parameters of the voice signal after the endpoint detection, and combining the MFCC characteristic parameters, the MFCC standardized characteristic parameters and the PNCC characteristic parameters to form multi-dimensional characteristic parameters.
2. The method for identifying the voiceprint of the multi-dimensional characteristic parameters based on the VAD as claimed in claim 1, wherein in the step S1, the voice signal reading is to read wav format audio files in a training set, and a wavfile () method in a scipy. io library in python is adopted to obtain a one-dimensional array representing audio information;
the method according to claim 1, wherein in step S1, the pre-emphasis is to boost the high frequency component as a function of:
Figure DEST_PATH_IMAGE002
wherein u is a pre-emphasis coefficient and has a value range of 0.9-1.
3. The method for voiceprint recognition based on VAD multi-dimensional characteristic parameters of claim 1, wherein in the step S1, the framing and windowing are specifically performed such that the parameter model of the speech signal is approximately unchanged within 10ms to 30ms, the number of frames in 1 second is 33 to 100 frames, an overlapping region exists between adjacent frames, i.e. frame shift, and the ratio of the frame shift to the frame length is 1/3 to 1/2; and finally multiplying each frame signal by a Hamming window, wherein the expression of the Hamming window is as follows:
Figure DEST_PATH_IMAGE004
in the formula, a is a Hamming window coefficient.
4. The method according to claim 1, wherein the end point detection in step S2 adopts a spectral entropy method, entropy is the degree of order of the identification signal, and the spectral entropy end point detection is to detect a voice end point by detecting the flatness of the spectrum; the speech signal is
Figure DEST_PATH_IMAGE006
After windowing and framing, the nth frame is obtained
Figure DEST_PATH_IMAGE008
The FFT is:
Figure DEST_PATH_IMAGE010
k is the kth spectral line; the short-time energy of the speech frame in the frequency domain is:
Figure DEST_PATH_IMAGE012
wherein N is the FFT length, and only the positive frequency part is taken; the energy spectrum of the kth spectral line is
Figure DEST_PATH_IMAGE014
Normalized spectral probability density function for each frequency component of
Figure DEST_PATH_IMAGE016
The short-time spectral entropy of the frame is
Figure DEST_PATH_IMAGE018
Calculating the spectral entropy of each frame as
Figure DEST_PATH_IMAGE020
(ii) a The audio information in step S1 is intercepted by the spectral entropy method, and an area rich in speech information is reserved.
5. The method according to claim 1, wherein in step S2, the endpoint detection adds a result verification mechanism, and when the spectral entropy method fails, the framed speech signal is filtered through an energy valve to remove silence segments, and the value of the valve is set as
Figure DEST_PATH_IMAGE022
Wherein
Figure DEST_PATH_IMAGE024
Is an array of speech signal frame energies.
6. The method for identifying the voiceprint of the multi-dimensional characteristic parameter based on the VAD according to the claim 1, wherein in the step S3, the specific method for extracting the MFCC characteristic parameter is,
FFT is carried out on each frame signal of the voice preprocessing signal to obtain the frequency spectrum of each frame, and the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal;
passing the energy spectrum through a Mel triangular filter bank, wherein the center frequencies of the filter bank are uniformly arranged according to the Mel frequency, the filter bank comprises 22-26 filters, the base angle of each filter is the center frequency of the adjacent filter, and the approximate relational expression of the Mel frequency and the frequency is
Figure DEST_PATH_IMAGE026
Wherein f is frequency;
and calculating the logarithmic energy output by each Mel triangular filter bank, obtaining an MFCC feature vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.
7. The method according to claim 1, wherein in step S3, the specific method for extracting GFCC characteristic parameters is,
carrying out Fourier transform calculation on a specific signal frame; and taking an absolute value of the output frequency spectrum;
the energy spectrum is passed through a Gamma Filter Bank, which contains 20 filters, the output response of which is
Figure DEST_PATH_IMAGE028
Wherein, N is the channel number of the filter, and M is the frame number after sampling;
and calculating the logarithmic energy output by each filter bank, obtaining a GFCC characteristic vector through Discrete Cosine Transform (DCT), and returning to the default 13-dimensional cepstrum number.
8. The method as claimed in claim 1, wherein in step S3, the PNCC characteristic parameters are extracted by the specific method,
performing Fourier transform calculation on the signal frame, and taking an absolute value of an output frequency spectrum;
the energy spectrum is passed through a Gamma Filter Bank, which contains 20 filters, the output response of which is
Figure 430569DEST_PATH_IMAGE028
Where N is the number of channels of the filter and M is the number of frames after sampling.
9. Smoothing each frame, namely averaging the left frame and the right frame by 2 frames to obtain average power;
and normalizing the average power, and acquiring the PNCC characteristic parameters through an exponential function and discrete cosine transform.
10. The method according to claim 1, wherein the dimension of the multidimensional characteristic parameter formed by combining in step S3 is 52 dimensions.
CN202011557161.2A 2020-12-25 2020-12-25 VAD-based multi-dimensional characteristic parameter voiceprint identification method Pending CN112542174A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011557161.2A CN112542174A (en) 2020-12-25 2020-12-25 VAD-based multi-dimensional characteristic parameter voiceprint identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011557161.2A CN112542174A (en) 2020-12-25 2020-12-25 VAD-based multi-dimensional characteristic parameter voiceprint identification method

Publications (1)

Publication Number Publication Date
CN112542174A true CN112542174A (en) 2021-03-23

Family

ID=75017464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011557161.2A Pending CN112542174A (en) 2020-12-25 2020-12-25 VAD-based multi-dimensional characteristic parameter voiceprint identification method

Country Status (1)

Country Link
CN (1) CN112542174A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113179442A (en) * 2021-04-20 2021-07-27 浙江工业大学 Voice recognition-based audio stream replacement method in video
CN113205803A (en) * 2021-04-22 2021-08-03 上海顺久电子科技有限公司 Voice recognition method and device with adaptive noise reduction capability
CN113823290A (en) * 2021-08-31 2021-12-21 杭州电子科技大学 Multi-feature fusion voiceprint recognition method
CN114038469A (en) * 2021-08-03 2022-02-11 成都理工大学 Speaker identification method based on multi-class spectrogram feature attention fusion network
CN115273863A (en) * 2022-06-13 2022-11-01 广东职业技术学院 Compound network class attendance system and method based on voice recognition and face recognition

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835498A (en) * 2015-05-25 2015-08-12 重庆大学 Voiceprint identification method based on multi-type combination characteristic parameters
CN108922541A (en) * 2018-05-25 2018-11-30 南京邮电大学 Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
CN110349598A (en) * 2019-07-15 2019-10-18 桂林电子科技大学 A kind of end-point detecting method under low signal-to-noise ratio environment
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835498A (en) * 2015-05-25 2015-08-12 重庆大学 Voiceprint identification method based on multi-type combination characteristic parameters
CN108922541A (en) * 2018-05-25 2018-11-30 南京邮电大学 Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
CN110349598A (en) * 2019-07-15 2019-10-18 桂林电子科技大学 A kind of end-point detecting method under low signal-to-noise ratio environment
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王萌;王福龙;: "基于端点检测和高斯滤波器组的MFCC说话人识别", 计算机系统应用, no. 10 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113179442A (en) * 2021-04-20 2021-07-27 浙江工业大学 Voice recognition-based audio stream replacement method in video
CN113179442B (en) * 2021-04-20 2022-04-29 浙江工业大学 Voice recognition-based audio stream replacement method in video
CN113205803A (en) * 2021-04-22 2021-08-03 上海顺久电子科技有限公司 Voice recognition method and device with adaptive noise reduction capability
CN113205803B (en) * 2021-04-22 2024-05-03 上海顺久电子科技有限公司 Voice recognition method and device with self-adaptive noise reduction capability
CN114038469A (en) * 2021-08-03 2022-02-11 成都理工大学 Speaker identification method based on multi-class spectrogram feature attention fusion network
CN114038469B (en) * 2021-08-03 2023-06-20 成都理工大学 Speaker identification method based on multi-class spectrogram characteristic attention fusion network
CN113823290A (en) * 2021-08-31 2021-12-21 杭州电子科技大学 Multi-feature fusion voiceprint recognition method
CN115273863A (en) * 2022-06-13 2022-11-01 广东职业技术学院 Compound network class attendance system and method based on voice recognition and face recognition

Similar Documents

Publication Publication Date Title
CN112542174A (en) VAD-based multi-dimensional characteristic parameter voiceprint identification method
CN107610715B (en) Similarity calculation method based on multiple sound characteristics
CN106935248B (en) Voice similarity detection method and device
JP4802135B2 (en) Speaker authentication registration and confirmation method and apparatus
CN102543073B (en) Shanghai dialect phonetic recognition information processing method
CN108986824B (en) Playback voice detection method
CN102968990B (en) Speaker identifying method and system
CN110931022B (en) Voiceprint recognition method based on high-low frequency dynamic and static characteristics
CN109256127B (en) Robust voice feature extraction method based on nonlinear power transformation Gamma chirp filter
CN108198545B (en) Speech recognition method based on wavelet transformation
CN110299141B (en) Acoustic feature extraction method for detecting playback attack of sound record in voiceprint recognition
CN103646649A (en) High-efficiency voice detecting method
Vyas A Gaussian mixture model based speech recognition system using Matlab
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
Couvreur et al. Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models
CN112233657A (en) Speech enhancement method based on low-frequency syllable recognition
Nijhawan et al. A new design approach for speaker recognition using MFCC and VAD
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
Tazi A robust speaker identification system based on the combination of GFCC and MFCC methods
Kumar et al. Text dependent speaker identification in noisy environment
Wang et al. Robust Text-independent Speaker Identification in a Time-varying Noisy Environment.
Shu-Guang et al. Isolated word recognition in reverberant environments
CN108962249B (en) Voice matching method based on MFCC voice characteristics and storage medium
Bonifaco et al. Comparative analysis of filipino-based rhinolalia aperta speech using mel frequency cepstral analysis and Perceptual Linear Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination