CN109524014A

CN109524014A - A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks

Info

Publication number: CN109524014A
Application number: CN201811439719.XA
Authority: CN
Inventors: 仲珩; 李昕; 褚治广; 蔡盼
Original assignee: Liaoning University of Technology
Current assignee: Liaoning University of Technology
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-03-26

Abstract

The present invention discloses a kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks, including step 1: acquiring the voice signal of known speaker, gray scale sound spectrograph is generated after the voice signal is pre-processed, and characteristic parameter extraction is carried out to the gray scale sound spectrograph；Step 2: depth convolutional neural networks being established to the characteristic parameter of the gray scale sound spectrograph and are trained；Step 3: acquiring voice signal to be identified, the characteristic parameter of the gray scale sound spectrograph of voice signal to be identified is obtained according to step 1, and identify the speaker of the voice signal to be identified using trained convolutional neural networks.Application on Voiceprint Recognition analysis method provided by the invention based on depth convolutional neural networks, it is extracted the characteristic parameter of voice signal, and pass through the training of depth convolutional neural networks, identification, it can correctly identify the identity of speaker, preferable effect is achieved, the accuracy and efficiency of Application on Voiceprint Recognition is effectively improved.

Description

A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks

Technical field

The present invention relates to artificial intelligence fields, and more particularly, the present invention relates to one kind to be based on depth convolutional neural networks Application on Voiceprint Recognition analysis method.

Background technique

With the raising of scientific and technological level and the fast development of artificial intelligence, Application on Voiceprint Recognition numerous areas importance increasingly It highlights.For example, determining user identity using telephone speech certification in financial field；In security fields, use vocal print as into The authorization message of important secret occasion out；In police and judicial field, using vocal print as a kind of effective supplementary means to crime The identity of suspect judges；In military field, the identity of personnel is judged using vocal print；In medical application, vocal print is used for The diagnosis etc. of certain related diseases.Vocal print signal acquisition is extremely convenient, is full of in each place of people's daily life.Research High performance Voiceprint Recognition System has important real value.For this purpose, in order to promote the Accuracy and high efficiency of Application on Voiceprint Recognition, Designing a kind of Application on Voiceprint Recognition analysis method based on machine learning is very important.

Summary of the invention

The present invention has designed and developed a kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks, is extracted voice The characteristic parameter of signal, and by the training of depth convolutional neural networks, identification, it can correctly identify the identity of speaker, have Effect improves the accuracy and efficiency of Application on Voiceprint Recognition.

Technical solution provided by the invention are as follows:

A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks, includes the following steps:

Step 1: acquiring the voice signal of known speaker, gray scale language spectrum is generated after the voice signal is pre-processed Figure carries out characteristic parameter extraction to the gray scale sound spectrograph；

Step 2: depth convolutional neural networks being established to the characteristic parameter of the gray scale sound spectrograph and are trained comprising 5 A hidden layer, 3 convolutional layers and 2 down-sampled layers, and convolution sum is down-sampled alternately:

First convolutional layer is made of 8 Feature Mappings, and the mode use using 5 × 5 convolution kernel, and convolution does not make 0 part is mended with edge to be calculated；

First drop adopts layer, is made of 8 Feature Mappings, using 2 × 2 convolution kernel, realizes down-sampled and local average；

Second convolutional layer is made of 20 Feature Mappings, using 5 × 5 convolution kernel, and each Feature Mapping by 10 × 10 neuron compositions；

Second drop adopts layer, is made of 20 Feature Mappings, using 5 × 5 convolution kernel；

Obtained characteristic pattern is pulled into vector, and uses 5 × 5 convolution kernel to each characteristic pattern by third convolutional layer；

Step 3: acquiring voice signal to be identified, the spy of the gray scale sound spectrograph of voice signal to be identified is obtained according to step 1 Parameter is levied, and identifies the speaker of the voice signal to be identified using trained convolutional neural networks.

Preferably, in step 1, the pretreatment of the voice signal include sampling with quantization, preemphasis, framing adding window and End-point detection.

Preferably, the sampling Yu quantization of the voice signal include: by voice signal with the sampling rate number of 8kHz Change, each sampling is indicated with 8bit.

Preferably, the preemphasis of the voice signal includes:

By the audio digital signals after over-sampling and quantization conversion, made by single order high-pass filter pre-

Exacerbation processing, highlights high frequency section, the transmission function of the single order high-pass filter are as follows:

H (z)=1-0.9375z^-1

Wherein, z is the frequency of voice signal.

Preferably, the framing adding window of the voice signal includes:

Continuous speech signal is split as multiframe voice signal according to 10~30ms；

Windowing process, the window function of the Hamming window are carried out to the multiframe voice signal using the window function of Hamming window Are as follows:

Wherein, W (n) is the window function of the Hamming window of n-th frame voice signal, and N is the frame number of voice signal.

Preferably, the end-point detection of the voice signal includes:

The silence clip in voice signal is rejected using short-time energy method and short-time zero-crossing rate method.

Preferably, in step 1, the generation of the gray scale sound spectrograph includes:

Each frame voice signal is resolved into amplitude spectrum by discrete Fourier transform:

Wherein, M is the sampling number of each frame, and X (n, k) is the sequence that n-th frame voice signal passes through that Fourier transformation obtains Column, k are Fourier transformation parameter, and e is the truth of a matter of natural logrithm, x_pIt (n) is the letter of p-th of sampled point of n-th frame voice signal Number；

Obtain the energy density spectrum for the sequence of complex numbers that every frame voice signal obtains after Fourier transformation:

E (n, k)=| X (n, k) |=X_R(n,k)²+X_I(n,k)²

Wherein, E (n, k) is the energy density for the sequence of complex numbers that ground n-th frame voice signal obtains after Fourier transformation Spectrum, X_R(n, k) is the real part for the sequence of complex numbers that n-th frame voice signal obtains after Fourier transformation, X_I(n, k) is n-th The imaginary part for the sequence of complex numbers that frame voice signal obtains after Fourier transformation；

Logarithm is taken to energy spectral density:

10log₁₀E (n, k)=10log₁₀|X(n,k)|²=20log₁₀|X(n,k)|；

The energy spectral density of logarithmic form is mapped as the pixel value Q (n, m) between 0-255, obtains gray scale sound spectrograph:

Wherein, T (n, m) is the energy spectral density value of m logarithmic form of n-th frame voice signal, T_max(n, m) is n-th frame Maximum value in the energy spectral density value of m logarithmic form of voice signal, T_min(n, m) is that m of n-th frame voice signal are right Minimum value in the energy spectral density value of number form formula.

Preferably, described using Meier characteristic parameter of the frequency cepstral coefficient parameter as the gray scale sound spectrograph The acquisition of Meier frequency cepstral coefficient parameter includes:

Discrete cosine transform is carried out after taking logarithm to the energy spectral density, casts out its DC component, remaining is Meier Frequency cepstral coefficient parameter.

Preferably, the step 3 includes:

Initialize the decision content A of the corresponding voice signal of S kind speaker₁,A₂,...,A_ω,...,A_S, so that A₁=A₂=... =A_ω=...=A_S=0；

Voice signal to be measured is obtained to the characteristic set of the gray scale sound spectrograph of voice signal to be identified according to step 1, respectively Trained convolutional neural networks are inputted, are spoken when the feature for the gray scale sound spectrograph for identifying the voice signal to be measured belongs to ω kind When the corresponding voice signal of people, A_ω=A_ω+1；

Export decision content max (A₁,A₂,…,A_ω,…,A_S) speaker belonging to corresponding voice signal.

Preferably, continuous speech signal is split as multiframe voice signal with the frame length of 10ms.

It is of the present invention the utility model has the advantages that

Application on Voiceprint Recognition analysis method provided by the invention based on depth convolutional neural networks, is extracted the spy of voice signal Parameter is levied, and by the training of depth convolutional neural networks, identification, can correctly identify the identity of speaker, achieve preferably Effect, effectively improve the accuracy and efficiency of Application on Voiceprint Recognition.

Detailed description of the invention

Fig. 1 is the Application on Voiceprint Recognition analytical framework schematic diagram of the present invention based on depth convolutional neural networks.

Fig. 2 is of the present invention completely based on the voiceprint recognition algorithm flow chart of depth convolutional neural networks.

Fig. 3 is identification model overall construction drawing of the present invention.

Specific embodiment

Present invention will be described in further detail below with reference to the accompanying drawings, to enable those skilled in the art referring to specification text Word can be implemented accordingly.

Referring to Fig. 1, the Application on Voiceprint Recognition analytical framework schematic diagram based on depth convolutional neural networks.By obtaining speaker Raw tone the voice messaging of speaker is pre-processed as input；For passing through pretreated voice messaging, By Fourier transformation, Fourier transformation is done to each frame, calculates the energy spectral density of every frame signal, energy spectral density is taken pair Number, is mapped as gray scale sound spectrograph for the energy spectral density of logarithmic form；Feature extraction is carried out for the vocal print feature in sound spectrograph, It establishes depth convolutional neural networks (CNNs) and classification based training is carried out for the characteristic parameter in training sample, finally utilize template Test sample is identified with method, obtains discriminance analysis result.

Referring to Fig. 2, for the complete Application on Voiceprint Recognition parser flow chart of the present invention.

Speech signal pre-processing, process are as follows:

In order to which balanced sound acquisition process generates certain decaying, the influence to voice signal, before handling voice signal, It needs to aggravate the high frequency section of signal, while reducing the influence of noise, the frequency spectrum of voice signal is made to become flat, improve noise Than.It is by a transmission function by the audio digital signals after sampling

H (z)=1-0.9375z^-1

Single order high-pass filter realize preemphasis.

System voice sample frequency is 8kHz, correspondingly takes voice when extracting Meier frequency cepstral coefficient (MFCC) parameter The frame length 10ms of frame.To reduce the prediction error at signal both ends, avoid occurring leakage phenomenon in frequency spectrum, using Hamming window Window function carries out windowing process to voice signal, and Hamming window function is

It often further include silence clip in the voice of speaker other than including effective sound bite.Silence clip In the presence of the reduction that will lead to Application on Voiceprint Recognition accuracy rate and efficiency.The silence clip in voice is eliminated using end-point detection. The method combined in this method using short-time energy and short-time zero-crossing rate removes silence clip.

Voice signal through over-sampling and quantization, preemphasis, framing adding window, end-point detection and etc. after, so that it may carry out Generate sound spectrograph.

Sound spectrograph method is generated to specifically include:

Each frame is resolved into amplitude spectrum by discrete Fourier transform, n-th frame Fourier transformation is as follows:

Wherein, M is the sampling number of each frame, and X (n, k) is the sequence that n-th frame voice signal passes through that Fourier transformation obtains Column, k are Fourier transformation parameter, and e is the truth of a matter of natural logrithm, x_pIt (n) is the letter of p-th of sampled point of n-th frame voice signal Number.

Next sequence of complex numbers X (n, k), k=0,1 that every frame signal obtains after M point Fourier transformation are calculated ..., The energy density spectrum of M-1, calculating formula are

E (n, k)=| X (n, k) |=X_R(n,k)²+X_I(n,k)²

Then logarithm is taken to energy spectral density, is converted to a decibel form according to the following formula

10log₁₀E (n, k)=10log₁₀|X(n,k)|²=20log₁₀|X(n,k)|

After above-mentioned steps are handled, it is assumed that n-th frame voice signal has obtained m value, is denoted as T (n, m), therein each Value is all the energy spectral density of logarithmic form.The maximum value for enabling T (n, m) is T_max(n, m), minimum value T_min(n, m), then can be with The pixel value Q (n, m) being mapped as the energy spectral density of logarithmic form by following formula between 0-255

Finally using n as abscissa, m is as ordinate, and with Q (n, m) for pixel value, the two dimensional image of generation is exactly gray scale Sound spectrograph.

After speech signal pre-processing generates sound spectrograph, followed by the extraction of characteristic parameter.Using MFCC parameter as sound The characteristic parameter of line identification, the calculation process of MFCC parameter are as follows:

It calculates energy spectral density and takes the discrete cosine transform after logarithm, obtain D (n, m).Cast out its DC component, remaining is For MFCC parameter.

Next the foundation to depth convolutional neural networks (CNNs) model will be completed, this is the core of this method. The structure of CNNs designs are as follows: total network includes 5 hidden layers, wherein 3 convolutional layers and 2 down-sampled layers, 1 full connection Layer.Between convolution sum is down-sampled alternately, specific calculating process is as described below for calculation process:

Convolution operation is carried out in the first hidden layer, characteristic pattern is designed as 8, and each characteristic pattern uses 5 × 5 convolution kernel, It is in this way 28 × 28 by the characteristic pattern size obtained after Feature Mapping, the mode of convolution mends 0 using edge is not used here Part is calculated；

Down-sampled and local average is realized in the second hidden layer, it is equally made of 8 Feature Mappings, down-sampled to use 2 × 2 core, finally obtaining each characteristic pattern size is 14 × 14.Down-sampled layer characteristic pattern quantity does not change, only each spy The size of sign reduces, this is a kind of dimensionality reduction mode；

Second of convolution is carried out in third hidden layer, it is made of 20 Feature Mappings, and convolution kernel size is similarly 5 × 5, Each Feature Mapping is made of 10 × 10 neurons.Here convolution operation is with the first convolutional layer, only in Feature Mapping process In a characteristic pattern may be connected with upper one layer of multiple characteristic pattern；

Second of down-sampled operation is carried out in the 4th hidden layer.It is made of 20 characteristic patterns, down-sampled template using 2 × 2, the mapping graph obtained in this way is 5 × 5 sizes.

Obtained characteristic pattern is pulled into vector in the 5th hidden layer, uses 5 × 5 convolution operations for each characteristic pattern, this Sample can be to 120 dimension output vectors.

Network the last layer is full articulamentum, obtains output vector by BP network training.

Next classification based training is carried out to training sample characteristic parameter, trained purpose is exactly in order to obtain in network model The biasing of connection weight and neuron between neuron, these values constitute model library.Trained process is calculated using supervised learning Method, training sample characteristic parameter is tagged before training, and CNNs model is by the corresponding mark of i-th of characteristic parameter of training sample Label value stamps i-1, all characteristic parameters of some speaker possess identical label value, this label value represents the speaker ID.Training pattern selects to intersect entropy function as cost function, and output layer uses Softmax classifier.Use BP algorithm meter Gradient is calculated, and CNNs network model can be trained in conjunction with any general technology based on gradient.

Referring to Fig. 3, identification model overall construction drawing.The identification of network model corresponds to the identification rank in Application on Voiceprint Recognition Section.

Initialize the decision content A of the corresponding voice signal of S kind speaker₁,A₂,…,A_ω,…,A_S, so that A₁=A₂=...= A_ω=...=A_S=0；

Specific discriminance analysis process is as follows: assuming that the voice of speaker to be measured obtains after the sound spectrograph generating process of front N characteristic parameter sound spectrographs have been arrived, then successively this N sound spectrographs have been input in CNNs network model, every sound spectrograph exists Before being input in network, pre-treatment step is also carried out as the training stage.Then CNNs network model can provide every Speaker belonging to characteristic parameter sound spectrograph.Last N characteristic parameter sound spectrographs can correspond to N number of speaker, wherein frequency of occurrence Most speakers is identified as speaker belonging to this section of tested speech, to achieve the purpose that Application on Voiceprint Recognition.

It illustrates: assuming that having A, B, C, D, E, F and G kind speaker in database, when training, with A, B, C, D, E, F and G Speech signal pre-processing after obtained characteristic parameter as training set training, using A, B, C, D, E, F and G speaker as exporting Vector obtains depth convolutional neural networks model.The voice for acquiring speaker to be measured, after the sound spectrograph generating process of front N characteristic parameter sound spectrographs have been obtained, this N sound spectrographs are successively then input in CNNs network model (every sound spectrograph Before being input in network, pre-treatment step is also carried out as the training stage), ballot form is then taken, finally To one group of result.Voting process is as follows；

The decision content of initialization sample A, B, C, D, E, F and G, A=B=C=D=E=F=G=0；

The characteristic parameter of voice signal to be measured is input in CNNs network model, is said if voice signal to be measured belongs to A People is talked about, then A=A+1；If voice signal to be measured belongs to B speaker, B=B+1；

And so on,

If voice signal to be measured belongs to G speaker, G=G+1；

Finally output result is that max (A, B, C, D, E, F, G) speaks belonging to artificial voice signal to be measured to get ticket is most Speaker.

Although the embodiments of the present invention have been disclosed as above, but its is not only in the description and the implementation listed With it can be fully applied to various fields suitable for the present invention, for those skilled in the art, can be easily Realize other modification, therefore without departing from the general concept defined in the claims and the equivalent scope, the present invention is simultaneously unlimited In specific details and legend shown and described herein.

Claims

1. a kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks, which comprises the steps of:

Step 1: the voice signal of known speaker is acquired, gray scale sound spectrograph is generated after the voice signal is pre-processed, Characteristic parameter extraction is carried out to the gray scale sound spectrograph；

Step 2: depth convolutional neural networks being established to the characteristic parameter of the gray scale sound spectrograph and are trained comprising 5 hidden Layer, 3 convolutional layers and 2 down-sampled layers are hidden, and convolution sum is down-sampled alternately:

First convolutional layer is made of 8 Feature Mappings, and using 5 × 5 convolution kernel, and the mode of convolution is using unused side Edge is mended 0 part and is calculated；

Second convolutional layer is made of 20 Feature Mappings, and using 5 × 5 convolution kernel, and each Feature Mapping is by 10 × 10 Neuron composition；

Step 3: acquiring voice signal to be identified, joined according to the feature that step 1 obtains the gray scale sound spectrograph of voice signal to be identified It counts, and identifies the speaker of the voice signal to be identified using trained convolutional neural networks.

2. as described in claim 1 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that step In 1, the pretreatment of the voice signal includes sampling and quantization, preemphasis, framing adding window and end-point detection.

3. as claimed in claim 2 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that described The sampling and quantization of voice signal include: to digitize voice signal with the sampling rate of 8kHz, and each sampling is indicated with 8bit.

4. as claimed in claim 3 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that described The preemphasis of voice signal includes:

By the audio digital signals after over-sampling and quantization conversion, make preemphasis processing by single order high-pass filter, it is convex Aobvious high frequency section, the transmission function of the single order high-pass filter are as follows:

H (z)=1-0.9375z^-1

Wherein, z is the frequency of voice signal.

5. as claimed in claim 4 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that described The framing adding window of voice signal includes:

Continuous speech signal is split as multiframe voice signal with the frame length of 10~30ms；

6. as claimed in claim 5 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that described The end-point detection of voice signal includes:

7. the Application on Voiceprint Recognition analysis method based on depth convolutional neural networks as described in any one of claim 2-6, It is characterized in that, in step 1, the generation of the gray scale sound spectrograph includes:

Wherein, M is the sampling number of each frame, and X (n, k) is the sequence that n-th frame voice signal passes through that Fourier transformation obtains, k For Fourier transformation parameter, e is the truth of a matter of natural logrithm, x_pIt (n) is the signal of p-th of sampled point of n-th frame voice signal；

E (n, k)=| X (n, k) |=X_R(n,k)²+X_I(n,k)²

Wherein, E (n, k) is the energy density spectrum for the sequence of complex numbers that ground n-th frame voice signal obtains after Fourier transformation, X_R (n, k) is the real part for the sequence of complex numbers that n-th frame voice signal obtains after Fourier transformation, X_I(n, k) is n-th frame language The imaginary part for the sequence of complex numbers that sound signal obtains after Fourier transformation；

Logarithm is taken to energy spectral density:

10log₁₀E (n, k)=10log₁₀|X(n,k)|²=20log₁₀|X(n,k)|；

Wherein, T (n, m) is the energy spectral density value of m logarithmic form of n-th frame voice signal, T_max(n, m) is n-th frame voice Maximum value in the energy spectral density value of m logarithmic form of signal, T_min(n, m) is m of n-th frame voice signal to number form Minimum value in the energy spectral density value of formula.

8. as claimed in claim 7 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that use Meier characteristic parameter of the frequency cepstral coefficient parameter as the gray scale sound spectrograph, the Meier frequency cepstral coefficient parameter Acquisition include:

Discrete cosine transform is carried out after taking logarithm to the energy spectral density, casts out its DC component, remaining is Meier frequency Cepstrum coefficient parameter.

9. as claimed in claim 8 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that described Step 3 includes:

Initialize the decision content A of the corresponding voice signal of S kind speaker₁,A₂,...,A_ω,...,A_S, so that A₁=A₂=...=A_ω =...=A_S=0；

The characteristic set that voice signal to be measured is obtained to the gray scale sound spectrograph of voice signal to be identified according to step 1, inputs respectively Trained convolutional neural networks are corresponded to when the feature for the gray scale sound spectrograph for identifying the voice signal to be measured belongs to ω speaker Voice signal when, A_ω=A_ω+1；

Export decision content max (A₁,A₂,...,A_ω,...,A_S) speaker belonging to corresponding voice signal.

10. as claimed in claim 5 based on the Application on Voiceprint Recognition analysis method of depth convolutional neural networks, which is characterized in that will Continuous speech signal is split as multiframe voice signal with the frame length of 10ms.