CN113393847A

CN113393847A - Voiceprint recognition method based on fusion of Fbank features and MFCC features

Info

Publication number: CN113393847A
Application number: CN202110586134.6A
Authority: CN
Inventors: 周后盘; 赵将焜
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-09-14
Anticipated expiration: 2041-05-27
Also published as: CN113393847B

Abstract

The invention discloses a voiceprint recognition method based on fusion of Fbank characteristics and MFCC characteristics, which comprises the steps of preprocessing a voice data set, and extracting the Fbank characteristics and the MFCC characteristics; and performing feature fusion on the basis of the obtained 40-dimensional Fbank feature and the 12-dimensional MFCC feature. The invention is tested on a generalized end-to-end model, and compared with the characteristics of single Fbank and MFCC, the characteristic fusion method provided by the invention is superior to the single characteristics. The feature fusion method of the invention reduces feature dimension, reduces redundancy, reduces storage space and training complexity.

Description

Voiceprint recognition method based on fusion of Fbank features and MFCC features

Technical Field

The invention relates to the field of voice signal processing and artificial intelligence, in particular to a voiceprint recognition method based on fusion of Fbank characteristics and MFCC characteristics.

Background

Voiceprint recognition, also known as speaker recognition, is a technique that extracts features that can represent the identity of a speaker from a speech signal and recognizes the identity of the speaker based on the features. Voiceprint recognition is one of biological feature applications, and has the same important application fields as fingerprint recognition and face recognition, and the voiceprint recognition has the advantages of convenience in collection, convenience in non-contact, low manufacturing cost and the like. Voiceprint recognition can be applied to the fields of finance, intelligent locks, specific person awakening and the like, along with the expansion of the application range, the requirement of people on voiceprint recognition is higher and higher, and the improvement of the performance of voiceprint recognition is of great significance.

The voiceprint recognition process is generally divided into three modules of feature extraction, model construction and scoring judgment. In the feature extraction module, commonly used voiceprint features are MFCC, Fbank, LPC, PLP and the like. Most of the current common methods are based on a single class of features for training, and the only feature fusion method is to select two different features for direct splicing.

Disclosure of Invention

The invention provides a voiceprint recognition method based on fusion of Fbank features and MFCC features, aiming at the problems of overlarge dimensionality and redundancy caused by direct splicing of heterogeneous features in the prior art.

The invention discloses a voiceprint recognition method based on fusion of Fbank characteristics and MFCC characteristics, which specifically comprises the following steps:

firstly, preparing a voice data set and preprocessing the data set;

step two, extracting Fbank characteristics;

performing fast Fourier transform, power spectrum taking, amplitude square taking, Mel filter bank passing and logarithm taking on the preprocessed voice frame sequence to obtain Fbank characteristics;

step three, extracting MFCC characteristics;

performing discrete cosine transform on the basis of the Fbank characteristics to obtain MFCC characteristics;

step four, feature fusion;

and performing feature fusion on the basis of obtaining the 40-dimensional Fbank feature and the 12-dimensional MFCC feature.

Preferably, the mel-filter bank coefficients are 40.

Preferably, the MFCC is obtained by performing discrete cosine transform on the basis of the Fbank characteristics, and specifically includes: extracting the 1 st to 12 th group of coefficients, and performing DCT transformation to obtain 12-dimensional MFCC features.

Preferably, the feature fusion is performed on the basis of obtaining the 40-dimensional Fbank feature and the 12-dimensional MFCC feature, and specifically comprises the following steps: the MFCC features of groups 1-12 are embedded into groups 1-12 of a 40-dimensional Fbank.

Preferably, the preprocessing the data set specifically includes: pre-emphasis, framing, windowing, and finally outputting a voice frame sequence.

Preferably, the framing adopts a 25ms frame length and a 10ms frame shift.

Preferably, the windowed window is selected as a hamming window.

Compared with the prior art, the invention has the following beneficial effects: the invention is tested on a generalized end-to-end model, and compared with the characteristics of single Fbank and MFCC, the characteristic fusion method provided by the invention is superior to the single characteristics. The feature fusion method of the invention reduces feature dimension, reduces redundancy, reduces storage space and training complexity.

Drawings

FIG. 1 is a flow chart of the feature extraction of Fbank and MFCC according to the present invention;

FIG. 2 is a schematic diagram of a feature fusion method proposed by the present invention;

Detailed Description

The following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings.

Fig. 1 depicts the flow of extracting Fbank features and MFCC features in voiceprint recognition. As shown in fig. 1, the feature extraction process includes preprocessing, fast fourier transform, power spectrum extraction, amplitude square extraction, Fbank feature extraction through mel filter bank and logarithm extraction (a in fig. 1), and MFCC feature extraction through Discrete Cosine Transform (DCT) (b in fig. 1).

Wherein the preprocessing comprises pre-emphasis, framing, and windowing. The specific details are as follows: the sampling rate is 8khz, the frame length is 25ms, the frame is shifted by 10ms, and a Hamming window is adopted.

Fig. 2 depicts a specific process of the feature fusion method of the present invention. The voice signal is preprocessed, fast Fourier transformed, power spectrum is taken, the square of the amplitude is taken, and the voice signal passes through a Mel filter bank. The Mel filter bank is a filter fitting the receiving characteristic of human ear, and the coefficient of the filter bank is taken as 40 to obtain the 40-dimensional Fbank characteristic. Taking the 1 st-12 th dimension features in the Fbank features to perform DCT transformation to obtain 12-dimension MFCC features, and then embedding the 12-dimension MFCC features into the 1 st-12 th dimension positions in the Fbank features to obtain fusion features.

The feature fusion method provided by the invention is used for carrying out experiments on LSTM and BiLSTM network models based on generalized end-to-end loss, and compared with single MFCC and Fbank features, the method is verified to be capable of improving the performance of voiceprint recognition and is beneficial to application of voiceprint recognition.

Results of MFCC features alone, Fbank features, and fusion features were compared experimentally. And experiments are carried out on the Bi-LSTM model and the LSTM model, and the results show that the characteristic fusion method provided by the invention effectively improves the performance of speaker recognition. Table 1 shows the results in the Bi-LSTM model, and Table 2 shows the results in the LSTM model.

TABLE 1

TABLE 2

This experiment compares the MFCC characteristics alone, the Fbank characteristics alone, and the characteristic fusion method proposed by the present invention. The results of experiments on Bi-LSTM and LSTM models prove that the method provided by the invention can improve the performance of speaker recognition. The test adopts the judgment standard of Equal Error Rate (EER), and the lower the EER is, the better the test effect is.

Claims

1. The voiceprint recognition method based on the fusion of the Fbank characteristic and the MFCC characteristic is characterized by comprising the following steps:

firstly, preparing a voice data set and preprocessing the data set;

step two, extracting Fbank characteristics;

step three, extracting MFCC characteristics;

step four, feature fusion;

2. The method of claim 1 for voiceprint recognition based on Fbank feature and MFCC feature fusion, wherein: the mel-filter bank coefficients are taken as 40.

3. The method of claim 1 for voiceprint recognition based on Fbank feature and MFCC feature fusion, wherein: the discrete cosine transform is performed on the basis of the Fbank characteristics to obtain the MFCC characteristics, and the method specifically comprises the following steps: extracting the 1 st to 12 th group of coefficients, and performing DCT transformation to obtain 12-dimensional MFCC features.

4. The method of claim 1 for voiceprint recognition based on Fbank feature and MFCC feature fusion, wherein: the feature fusion is carried out on the basis of obtaining the 40-dimensional Fbank feature and the 12-dimensional MFCC feature, and specifically comprises the following steps: the MFCC features of groups 1-12 are embedded into groups 1-12 of a 40-dimensional Fbank.

5. The method of claim 1 for voiceprint recognition based on Fbank feature and MFCC feature fusion, wherein: the preprocessing of the data set specifically comprises: pre-emphasis, framing, windowing, and finally outputting a voice frame sequence.

6. The method of claim 5 for voiceprint recognition based on Fbank feature and MFCC feature fusion, wherein: the framing adopts 25ms frame length and 10ms frame shift.

7. The method of claim 5 for voiceprint recognition based on Fbank feature and MFCC feature fusion, wherein: the windowed window is selected as a Hamming window.