CN112053694A

CN112053694A - Voiceprint recognition method based on CNN and GRU network fusion

Info

Publication number: CN112053694A
Application number: CN202010719665.3A
Authority: CN
Inventors: 崔建伟; 陈宝远
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-12-08

Abstract

The invention discloses a voiceprint recognition method based on the integration of CNN and GRU network (C-GRU for short), which comprises the following steps: preprocessing a voice signal sample to be recognized and then performing voice enhancement processing on the voice signal sample through a self-adaptive filtering algorithm; generating a spectrogram of a voice segment to be recognized; inputting a spectrogram generated by a voice segment to be recognized into the trained C-GRU network model, and respectively extracting voiceprint features; and inputting the extracted features into a softmax function to obtain identity classification information of the voice signal segment to be recognized. The feature extraction method based on the C-GRU network avoids information loss in a frequency domain, and simultaneously realizes the voiceprint recognition method with higher recognition accuracy and faster convergence speed by utilizing the characteristic that the GRU network has good time feature extraction capability.

Description

Voiceprint recognition method based on CNN and GRU network fusion

Technical Field

The invention relates to a voiceprint recognition method, in particular to a voiceprint recognition method based on CNN and GRU network fusion (C-GRU for short).

Technical Field

In recent years, biometric information identification technology has become a very reliable and convenient way of authenticating identity information, attracting attention from both inside and outside of the industry. Voice is one of the everyday ways people communicate. Scientifically proves that the sounding organs of each person have difference, and the influence on the sounding organs is different due to different growing environments, so that each person has unique personality. The method has the advantages that the identity is recognized by voice, the method is very convenient to collect voice, and used equipment is cheap and easy to obtain, so that the method has great potential, and can be used for remote identity authentication, easy acceptance of users and the like.

Voiceprint recognition techniques can be divided into two directions, text-dependent and text-independent, in content. In the text-related voiceprint recognition method, a speaker must speak according to a fixed dialect, and the text content of a training voice must be the same as that of a testing voice, although the recognition method can be trained to have a good effect, the biggest disadvantage is that the speaker must pronounce according to the fixed text, and once the voice content is inconsistent with the text or does not pronounce according to the requirement, the text-independent voiceprint recognition method is difficult to guarantee, so that the popularization of the method in practical application has great limitation.

The traditional voiceprint recognition technology usually adopts a general background model (GMM-UBM), firstly, a speaker independent model is trained by utilizing the phonetic script of a large number of speakers, then, the problems of less phonetic data and phonetic mismatch caused by multiple channels existing in the traditional Gaussian mixture model are effectively solved by utilizing an MAP algorithm, and finally, the recognition model is trained through a maximum posterior probability or a maximum likelihood regression criterion. But this model takes up a significant amount of storage resources when modeling each speaker. The neural network method is a basic subject of deep learning research at present, and with the fact that deep learning gradually deepens into various fields, the aspect of voiceprint recognition technology gradually turns to the deep learning field for exploration research. The traditional deep learning method for voiceprint recognition mainly comprises a CNN (voice communications network) and a long-short term memory network (LSTM), original sequence features of sequence voice are ignored when a voiceprint recognition system based on the CNN extracts voiceprint features, and although voice feature sequences are considered in an LSTM network model, the LSTM network is extremely difficult to train due to huge operation requirements brought by 3 thresholds of the LSTM network.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a voiceprint recognition method based on the fusion of the CNN and the GRU network, which utilizes the advantage that the CNN can autonomously extract the characteristics to avoid the information loss on the frequency domain caused by the traditional speaker voice characteristic extraction method, simultaneously utilizes the characteristic that the GRU network has good time sequence characteristic extraction, and realizes the voiceprint recognition with high accuracy by adopting the mode of the fusion of the CNN and the GRU network.

The invention is realized by adopting the following technical scheme:

a voiceprint recognition method based on CNN and GRU network fusion comprises the following steps:

step 1, acquiring a voice segment to be recognized;

step 2, preprocessing an original voice signal to generate a spectrogram of a voice fragment to be recognized;

step 3, inputting the spectrogram into a combined neural network voiceprint recognition model related to a time sequence to obtain identity classification information of the voice fragment to be recognized;

the training method of the voiceprint recognition model with the CNN and GRU network integration specifically comprises the following steps:

step 201, acquiring a training set of voice signals and a test set of the voice signals;

step 202, performing voice signal preprocessing by methods of pre-emphasis, framing, windowing, endpoint detection and the like;

step 203, improving the signal-to-noise ratio of the voice signal by improving the RLS algorithm;

step 204, converting each voice segment of the voice segment training set and the voice signal testing set through operations such as discrete Fourier transform and the like to obtain a spectrogram training set and a spectrogram testing set;

step 205, inputting a training set of the spectrogram into a CNN and GRU network to be trained, and training the CNN and GRU network to be trained;

step 206, inputting the testing set of the spectrogram into the trained CNN and GRU network, finishing the training of the CNN and GRU network if the output testing result meets the preset condition, or returning to the step 205 to perform the training again until the testing result meets the preset condition;

the spectrogram generating process comprises the following steps:

step 301, based on the first order digital filter h (z) ═ 1- α z^-1Pre-emphasis processing is carried out on the voice fragment, wherein alpha is a filter coefficient;

step 302, performing framing processing on the pre-emphasized voice segment, and maintaining smooth transition and continuity between frames;

step 303, based on the formula

Fourier transform is carried out on each frame of signal, wherein M is the number of sampling points of each frame, and the sequence formed by M sampling points of the nth frame of voice is x₀(n)，x₁(n),…,x_M-1(n)；

Step 304, calculating | X (n, k) | based on formula E (n, k) | X (n, k) |²＝X_R(n,k)²+X_I(n,k)²Calculating the energy spectral density of each frame of signal, wherein X (n, k) is a complex sequence obtained after the nth frame of voice is subjected to FTT (Fourier transform) conversion by M points;

step 305, obtaining the logarithm of the energy spectrum density in step 303

Step 306, based on the formula

Normalizing the spectrogram by a normalization method to obtain a normalized spectrogram, wherein Q_max(a, b) and Q_min(a, b) are respectively the maximum value and the minimum value in the grey scale of the spectrogram;

the improved RLS algorithm comprises:

step 401, based on the formula e (n) ═ d (n) — x^T(n) ω (n-1) obtaining a prediction error to perform enhancement processing on the speech signal;

step 402, based on formula

Obtaining a Kalman gain coefficient;

step 403, based on formula

The forgetting factor is improved to have a faster tracking speed and a smaller steady-state error;

step 404, completing updating the filter coefficients based on the formula ω (n) ═ ω (n-1) + k (n) e (n);

in summary, the present invention discloses a voiceprint recognition method based on the fusion of CNN and GRU network, which includes the following steps: step 1, obtaining a voice segment to be recognized. And 2, preprocessing the original voice signal to generate a spectrogram of the voice fragment to be recognized. And 3, inputting the spectrogram into a combined neural network voiceprint recognition model related to the time sequence to obtain identity classification information of the voice fragment to be recognized. The invention is based on the feature extraction method of the integration of the CNN and the GRU network, avoids the information loss on the frequency domain caused by the traditional speaker voice feature extraction method, and simultaneously realizes the voiceprint recognition with high accuracy by using the characteristic that the GRU network has good time sequence feature extraction and adopting the mode of the integration of the CNN and the GRU network.

The technical scheme provided by the implementation of the invention has the beneficial effects that at least:

the voiceprint recognition method based on the fusion of the CNN and the GRU network can solve the problems that a single neural network is insufficient in feature extraction and is limited by accuracy when being applied to complex problems, and meanwhile training efficiency is effectively improved.

Drawings

FIG. 1 is a flow chart of a voiceprint recognition method based on the fusion of a CNN and a GRU network implemented by the present invention;

FIG. 2 is a schematic diagram of a spectrogram implemented in the present invention;

FIG. 3 is a comparison of the improved RLS of the practice of the present invention;

fig. 4 is a schematic structural diagram of CNN and GRU network convergence implemented in the present invention;

Detailed Description

The present disclosure is described in further detail below with reference to the attached drawing figures.

As shown in fig. 1, the present invention discloses a voiceprint recognition method based on the fusion of CNN and GRU network, comprising the steps of. A speech segment to be recognized is acquired (step 1). Then, the original voice signal is preprocessed to generate a spectrogram of the voice segment to be recognized (step 2). And inputting the spectrogram into a combined neural network voiceprint recognition model related to the time sequence to obtain identity classification information of the voice fragment to be recognized (step 3).

The obtained spectrogram schematic diagram is shown in fig. 2, and the specific generation method comprises the following steps. Firstly, a pre-processing operation is performed on the speech signal to obtain a frame-by-frame short-time speech (step 301), based on a formula

Fourier transform is carried out on each frame of signal, wherein M is the number of sampling points of each frame, and the sequence formed by M sampling points of the nth frame of voice is x₀(n)，x₁(n),…,x_M-1(n) (step 302), then intangibly calculating the luminance based on the formula E (n, k) ═ X (n, k)²＝X_R(n,k)²+X_I(n,k)²Calculating the energy spectrum density of each frame signal, wherein X (n, k) is a complex sequence obtained by the N frame speech after the M-point FTT transformation (step 303), and obtaining the logarithm of the energy spectrum density of the step 303

(step 304), finally based on the formula

Normalizing the spectrogram by a normalization method to obtain a normalized spectrogram, wherein Q_max(a, b) and Q_min(a, b) are the maximum and minimum values in spectrogram grayscale level, respectively (step 305).

Improved RLS comparison graph As shown in FIG. 3, the cost function of the exponentially weighted RLS algorithm with the forgetting factor is J_n∑λ^n-ie²(i) And the smaller the lambda is, the stronger the tracking capability of the time-varying parameter is, but the smaller the lambda is, the more sensitive the noise is, and the larger the stable error is, and the weaker the tracking capability is, but the less sensitive the noise is. The RLS algorithm based on the variable forgetting factor has the advantages that the tracking capability is high, and small parameter estimation errors are considered, and the algorithm is as follows:

e(n)＝d(n)-x^T(n)ω(n-1)

ω(n)＝ω(n-1)+k(n)e(n)

λ(n)＝λ_min+(1-λ_min)2^L(n)

L(n)＝-round[μe²(n)]

in summary, when the error becomes smaller, λ (n) approaches 1, and the parameter error is reduced, whereas when the square error becomes larger, λ (n) becomes the minimum value λ_min。

Based on the above understanding, it is proposed to use the following formula as a correction function, and introduce the parameter a to control the function shape, and the specific function is as follows:

in the formula, the constant m and n control the value range of the function, and the constant a and the constant b control the convergence speed of the function and improve the shape of the function. When e is^tWhen larger as 6, λ (n) approaches n; when e is^tWhen 0 is equal, λ (n) is m + n, i.e. n<λ(n)<m + n; experiments show that when the temperature is 0.8<λ(n)<1 is preferred, so that m, n finally takes the value of m 0.2 and n 0.8.

The schematic diagram of the structure of the obtained C-GRU network is shown in FIG. 4, and the training process comprises. Firstly, a training set of a speech signal and a test set of the speech signal are obtained (step 201), the speech signal is preprocessed by pre-emphasis, framing, windowing, end point detection and other methods (step 202), the speech signal is enhanced by improving RLS algorithm (step 203), then, each voice segment of the voice segment training set and the voice signal testing set is converted through operations such as discrete Fourier transform and the like to obtain a spectrogram training set and a spectrogram testing set (step 204), the training set of the spectrogram is input into a CNN and a GRU network to be trained, training the CNN and the GRU network to be trained (step 205), finally inputting the testing set of the spectrogram into the trained CNN and GRU network, if the output testing result meets the preset condition, the training of CNN and GRU network is completed, otherwise, the step 205 is returned to perform the training again until the test result satisfies the preset condition (step 206).

During the training process, the following modes can be adopted: the jth spectrogram of the ith speaker

The corresponding tag value is labeled i-1. And (3) sending the obtained spectrogram into a network for training, wherein the input dimension of the spectrogram is 128 multiplied by 128, and the length and the height of the corresponding spectrogram. And then inputting the data into a C-GRU network, wherein the convolution kernel size of the CNN network is 5 multiplied by 5, the convolution kernel number is 100, the size of a pooling layer is 2 multiplied by 2, the most obvious features in the feature map are extracted, and the maximum pooling is adopted by the network model. To prevent loss of timing information in the spectrogram, the signals are pooled only in frequency when pooling is performed. In order to prevent the phenomenon of data overfitting in the network, Dropout is added inside the GRU unit, and Dropout is added among GRU internal neurons and different GThe connection between the RU units is temporarily disconnected by a certain ratio, and the disconnection ratio can be set to 0.2. And finally, carrying out classification and identification through a softmax classifier.

The present disclosure is based on the use of a TIMIT Speech database with a corpus sampling rate of 16kHz, 16 bits, containing 10 sentences per person from 630 people. And during testing, 80% of data sets are used as training samples, and the rest 20% of data sets are used as test sets. The network model was iterated 20 times, the results are shown in table 1:

the accuracy of the C-GRU network is superior to that of the other two network structures, and the single characteristic has large influence on voiceprint recognition and cannot meet actual requirements.

The foregoing is one of the exemplary embodiments of the present disclosure, and various modifications may be made by those skilled in the art without departing from the spirit and scope of the present invention.

Claims

1. A voiceprint recognition method based on CNN and GRU network fusion is characterized by comprising the following steps:

step 1, acquiring a voice segment to be recognized;

and 3, inputting the spectrogram into a combined neural network voiceprint recognition model related to the time sequence to obtain identity classification information of the voice fragment to be recognized.

2. The method for voiceprint recognition based on CNN and GRU network convergence according to claim 1, wherein the training method for the CNN and GRU network voiceprint recognition model comprises the following steps:

and step 206, inputting the testing set of the spectrogram into the trained CNN and GRU network, finishing the training of the CNN and GRU network if the output testing result meets the preset condition, or returning to the step 205 to perform the training again until the testing result meets the preset condition.

3. The method of claim 2, wherein the improved RLS algorithm comprises:

step 301, based on the formula e (n) ═ d (n) — x^T(n) ω (n-1) obtaining a prediction error to perform enhancement processing on the speech signal;

step 302, based on formula

Obtaining a Kalman gain coefficient;

step 303, based on the formula

step 304, updating the filter coefficients based on the formula ω (n) ═ ω (n-1) + k (n) e (n).

4. The method for voiceprint recognition based on CNN and GRU network convergence according to claim 1, wherein the generation process of the spectrogram comprises:

step 401, based on first order numberWord filter h (z) ═ 1- α z^-1Pre-emphasis processing is carried out on the voice fragment, wherein alpha is a filter coefficient;

step 402, performing framing processing on the pre-emphasized voice segment, and maintaining smooth transition and continuity between frames;

step 403, based on formula

Step 404, calculating y based on formula E (n, k) ═ X (n, k) |²＝X_R(n,k)²+X_I(n,k)²Calculating the energy spectral density of each frame of signal, wherein X (n, k) is a complex sequence obtained after the nth frame of voice is subjected to FTT (Fourier transform) conversion by M points;

step 405, obtaining the logarithm of the energy spectrum density in the step 404

Step 406, based on the formula

Normalizing the spectrogram by a normalization method to obtain a normalized spectrogram, wherein Q_max(a, b) and Q_minAnd (a, b) are respectively the maximum value and the minimum value in the grayscale level of the spectrogram.