CN109272990B

CN109272990B - Voice recognition method based on convolutional neural network

Info

Publication number: CN109272990B
Application number: CN201811112506.6A
Authority: CN
Inventors: 曹毅; 张威; 翟明浩; 刘晨; 黄子龙; 李巍
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2021-11-05
Anticipated expiration: 2038-09-25
Also published as: CN109272990A

Abstract

The invention provides a voice recognition method based on a convolutional neural network, which is better at extracting high-level features, simple in modeling process, easy to train, better in generalization performance of a model and capable of being widely applied to various voice recognition scenes. It includes: s1: preprocessing an input original voice signal; s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence; s3: constructing an acoustic model in an end-to-end mode based on a DCNN model and by taking a joint meaning time classifier (CTC) as a loss function; s4: training an acoustic model to obtain a trained acoustic model; s5: inputting the feature vector sequence to be recognized obtained in the step S2 into a trained acoustic model to obtain a recognition result; s6: and performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is a language word in which the original speech is recognized.

Description

Voice recognition method based on convolutional neural network

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition method based on a convolutional neural network.

Background

In the speech recognition technology, a GMM-HMM (Gaussian Mixed Model-hidden Markov Model) Model always takes a leading role as an acoustic Model of speech, but due to the characteristics of the GMM-HMM Model, the GMM-HMM acoustic Model needs to be aligned first before training, data of each frame needs to be aligned with a corresponding label, the alignment process is complicated and complicated, the training time is long, and the Model is a combined Model of the GMM and the HMM, the specific modeling process is relatively complicated to implement, and certain limitations are provided in the specific application of the speech recognition technology.

Disclosure of Invention

In order to solve the problems of long training time, complex modeling process and limited application of the existing acoustic model, the invention provides the voice recognition method based on the convolutional neural network, which is better at extracting high-level features, simple in modeling process, easy to train, better in generalization performance of the model and capable of being more widely applied to various voice recognition scenes.

The technical scheme of the invention is as follows: the speech recognition method based on the convolutional neural network comprises the following steps:

s1: inputting original voice, preprocessing the original voice signal, and performing related transformation processing;

s2: extracting key characteristic parameters reflecting the characteristics of the voice signals to form a characteristic vector sequence;

s3: constructing an acoustic model;

s4: training the acoustic model to obtain a trained acoustic model;

s5: inputting the feature vector sequence to be recognized obtained in the step S2 into the trained acoustic model to obtain a recognition result;

s6: performing subsequent operations based on the recognition result obtained in step S5 to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is the language word recognized by the original speech;

the method is characterized in that:

in step S3, the acoustic model is constructed in an end-to-end manner based on a DCNN-based network model and using a join-sense time classifier CTC as a loss function.

It is further characterized in that:

the structure of the acoustic model comprises a plurality of convolutional layer structures, two full-connection layers and a CTC loss function which are sequentially arranged;

extracting voice features by adopting convolutional layers with 32 convolutional kernels in a first layer and a second layer of the convolutional layer structures; the third layer and the fourth layer adopt the convolution layers of 64 convolution kernels to extract voice features; the fifth layer begins to extract speech higher layer features for the convolutional layers of the multiple successive 128 convolutional kernels;

if the number of layers of the convolutional layers in the plurality of convolutional layer structures is an even number, starting from the first convolutional layer, every two continuous convolutional layers are followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, one pooling layer follows every two consecutive convolutional layers, and performing pooling operation of the pooling layer again after continuous operation of the last three convolutional layers;

the structure of the plurality of convolution layers has 8 layers or 9 layers in total;

the training formula of the acoustic model in step S4 is as follows:

H_i= W_i* X + b_i，

wherein:

i=1,……k，

H_irepresents the characteristic diagram of the ith image,

W_irepresents the weight of the ith feature map,

x represents the feature map of the previous layer input,

b_ia bias term representing the ith feature map;

the CTC loss function is:

wherein:

as a calculation formula of the softmax function,

representing the probability of an output pi path based on input x,

the sum of the probabilities representing how many paths the probability of outputting the label sequence is,

x represents the input of the user, x represents the input,

X = x₁,x₂，……x_Trepresenting the input sequence, the subscripts represent the time from 1 to T,

Y = y₁，y₂，……y_kand represents an output corresponding to X,

y_i= y_i ¹,y_i ²,……，y_i ^krepresenting the conditional probability distribution of the ith frame of the output sequence, where i = 1,2, … … K,

and pi represents an output path of the input signal,

μ denotes the output label sequence,

a many-to-one relationship exists between pi and mu,

b represents the mapping relation from the path to the label sequence;

the sizes of convolution kernels of the convolution layers are all set to be 3 x 3;

the step length of the pooling layer is 2 multiplied by 2, and the maximum pooling operation is performed;

in step S2, a spectrogram is used as a voice feature; the frame length in the spectrogram is set to be 25ms, the frame shift is set to be 10ms, and the window function uses a Hamming window function.

In the technical scheme provided by the invention, the acoustic model is constructed based on the DCNN model, only one network model is based on, the parameters of the model are greatly reduced, the constructed acoustic model is easier to train, the training time is reduced, the training speed is higher, and the fitting effect on audio data acquired in various scenes is good because the generalization capability of the algorithm model of the convolutional neural network is stronger, so that the technical scheme provided by the invention has a wider application range; the invention takes the joint meaning time classifier CTC as a loss function to construct an acoustic model in an end-to-end mode, does not need to adopt the traditional preceding alignment operation, can be trained only by one input sequence and one output sequence, directly outputs the probability of sequence prediction, does not need external post-processing, reduces the training time and simplifies the modeling process.

Drawings

FIG. 1 is a schematic diagram of a network architecture according to the present invention;

FIG. 2 is a schematic diagram of a speech feature spectrogram extraction process;

FIG. 3 is a schematic diagram of a network of the present invention including 7 convolutional layers;

FIG. 4 is a schematic diagram of a network of the present invention including 8 convolutional layers;

fig. 5 is a schematic diagram of a network including 9 convolutional layers according to the present invention.

Detailed Description

As shown in fig. 1 to 5, the technical solution of the present invention is based on a dcnn (deep relational Neural network) network model and a CTC (connected semantic Temporal Classification) method to implement an acoustic model in an end-to-end manner; comprises the following steps:

s1: inputting original voice, preprocessing an original voice signal, and performing related transformation processing;

s3: constructing an acoustic model; constructing an acoustic model in an end-to-end mode by taking a joint meaning time classifier (CTC) as a loss function on the basis of a DCNN-based network model;

the structure of the acoustic model comprises a plurality of convolution layers, two full-connection layers and a CTC loss function which are sequentially arranged; the structure of a plurality of convolution layers is: the first layer and the second layer adopt 32 convolution kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer extracts the voice higher layer characteristics for the multilayer continuous convolution layer with 128 convolution kernels; if the number of layers of the convolutional layers in the plurality of convolutional layer structures is even, each two continuous convolutional layers from the first convolutional layer is followed by one pooling layer; if the number of the layers of the convolutional layers in the plurality of convolutional layer structures is an odd number, starting from the first convolutional layer, each two continuous convolutional layers are followed by one pooling layer, and after the last three convolutional layers are continuously operated, performing pooling operation of the pooling layers again; the sizes of convolution kernels of the convolution layers are all set to be 3 x 3; maximum pooling operation with 2 x 2 pooling layers and step size of 2; by adopting the convolution layers of 128 multilayer convolution kernels to extract the higher-layer characteristics of the voice, the parameter number of the acoustic model can be controlled, the network is ensured not to be over-fitted, and the practicability of the acoustic model is ensured; the pooling operation by adopting the pooling layer mainly reduces the dimension of the voice characteristic diagram, reduces the number of parameters and can enhance the noise resistance of the voice;

s4: training an acoustic model to obtain a trained acoustic model;

s5: inputting the feature vector sequence to be recognized obtained in the step S2 into a trained acoustic model to obtain a recognition result;

s6: constructing a language model, and training the language model to obtain a trained language model; inputting the recognition result obtained in the step S5 into a trained language model, and performing subsequent speech decoding operation to obtain a word string capable of outputting the speech signal with the maximum probability, where the word string is a language word recognized by the original speech;

the method comprises the steps of forming a feature vector sequence by analyzing key feature parameters aiming at an input voice signal, establishing a recognition network by a trained acoustic model, a trained language model and a dictionary, and searching an optimal path in the network according to a search algorithm, wherein the path is a word string capable of outputting the voice signal with the maximum probability, and the word string is a character of recognized original voice.

In step S2, a spectrogram is used as a voice feature; the extraction process of the spectrogram comprises the steps of inputting original voice, then carrying out framing and windowing on the voice, wherein the frame length is 25ms, the frame shift is 10ms, a Hamming window is used as a window function, then carrying out fast Fourier transform, converting a voice signal from a time domain to a frequency domain, and taking logarithm to obtain the spectrogram; the spectrogram represents the time and the frequency by combining the time domain and the frequency domain, is a visual expression mode of voice energy time-frequency distribution, effectively utilizes the correlation between the time domain and the frequency domain, has better extraction effect on the original characteristics by a characteristic vector sequence obtained by spectrogram analysis, and is input into an acoustic model, so that the subsequent operation accuracy is higher; compared with other window functions, the Hamming window can effectively reduce the leakage of frequency spectrum, so that the data input into the acoustic model is more accurate.

The CTC loss function is:

wherein:

as a calculation formula of the softmax function,

representing the probability of an output pi path based on input x,

x represents the input of the user, x represents the input,

Y = y₁，y₂，……y_kand represents an output corresponding to X,

and pi represents an output path of the input signal,

μ denotes the output lebel sequence,

a many-to-one relationship between pi and mu, mu = B (pi),

b represents the mapping of the path to the lebel sequence,

finally, one decoding mode of the CTC is given, namely maximum path decoding:

pi in the formula^*The path that yields the maximum probability is represented.

The training formula of the acoustic model in step S4 is as follows:

H_i= W_i* X + b_i，

wherein:

i=1,……k，

H_irepresents the characteristic diagram of the ith image,

W_irepresents the weight of the ith feature map,

x represents the feature map of the previous layer input,

b_ia bias term representing the ith feature map;

an open source 30-hour voice data set of Qinghua university is used for training an acoustic model, the data set is divided into a training set, a verification set and a test set, the number of the linguistic data is 10000, 893 and 2495 sentences respectively, and the linguistic data are recorded in a clean and noiseless environment.

As shown in fig. 3, the network structure includes 7 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer and the seventh layer are 3 continuous convolution layers with 128 convolution kernels; the first and second convolution layers are a pooling layer; the third and fourth convolution layers are followed by a pooling layer; a fifth layer, a sixth layer and a seventh layer are sequentially provided with two convolution layers with 128 convolution kernels, and then a pooling layer is followed; the two fully connected layers in succession are then connected in turn with the CTC loss function.

As shown in fig. 4, the network structure includes 8 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as an input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer, the seventh layer and the eighth layer are 4 continuous convolution layers with 128 convolution kernels; each two consecutive convolutional layers are followed by a pooling layer, which is then connected in turn to two consecutive fully-connected layers and a CTC loss function.

As shown in fig. 5, the network structure includes 9 convolutional layers, a feature vector sequence obtained by spectrogram analysis is used as input, and the first layer and the second layer adopt 32 convolutional kernels to extract voice features; the third layer and the fourth layer adopt 64 convolution kernels to extract voice features; the fifth layer, the sixth layer, the seventh layer, the eighth layer and the ninth layer are 5 continuous convolution layers of 128 convolution kernels; the first and second convolution layers are a pooling layer; the third and fourth convolution layers are followed by a pooling layer; the fifth layer and the sixth layer are two convolution layers with 128 continuous convolution kernels, and then a pooling layer is followed; the seventh layer, the eighth layer and the ninth layer are sequentially connected with 128 convolution layers, and then are sequentially connected with two continuous full-connection layers and the CTC loss function through a pooling layer.

Building a network model through a learning library Keras, and testing the network structures of the 7 convolutional layers, the 8 convolutional layers and the 9 convolutional layers and a traditional acoustic model built based on a GMM-HMM (Gaussian mixture model) by adopting a kaldi voice recognition tool kit under the experimental environment of a computer graphics card GTX-1070Ti, a CPU (Central processing Unit) model I7-7700K and the computing power of 6.1, wherein the used data set is a 30-hour voice data set of Qinghua university, and the obtained results are shown in the following table 1:

TABLE 1 test results

According to the technical scheme, the error rate of the acoustic model built based on the GMM-HMM is lower than that of the traditional acoustic model built based on the GMM-HMM, and in the technical scheme, the error rate of the acoustic model is continuously reduced along with the deepening of the network depth, and the fitting capacity of the network is continuously improved; the network fitting capability is best when the number of layers of the convolutional layer of the acoustic model is 9, and the number of parameters is within an acceptable range when the number of layers is 8 and 9.

Claims

1. The speech recognition method based on the convolutional neural network comprises the following steps:

s3: constructing an acoustic model;

s4: training the acoustic model to obtain a trained acoustic model;

the method is characterized in that:

in step S3, constructing the acoustic model in an end-to-end manner based on a DCNN-based network model and using a join-sense time classifier CTC as a loss function;

the structure of the plurality of convolution layers is 8 layers or 9 layers in total.

2. The convolutional neural network-based speech recognition method of claim 1, wherein: the training formula of the acoustic model in step S4 is as follows:

H_i= W_i* X + b_i，

wherein:

i=1,……k，

H_irepresents the characteristic diagram of the ith image,

W_irepresents the weight of the ith feature map,

x represents the feature map of the previous layer input,

b_iand representing the bias term of the ith feature map.

3. The convolutional neural network-based speech recognition method of claim 1, wherein: the CTC loss function is:

wherein:

as a calculation formula of the softmax function,

representing the probability of an output pi path based on input x,

x represents the input of the user, x represents the input,

Y = y₁，y₂，……y_kand represents an output corresponding to X,

and pi represents an output path of the input signal,

μ denotes the output label sequence,

a many-to-one relationship exists between pi and mu,

and B represents the mapping relation of the path to the label sequence.

4. The convolutional neural network-based speech recognition method of claim 1, wherein: the convolution kernels of the convolutional layers are all set to 3 × 3 in size.

5. The convolutional neural network-based speech recognition method of claim 1, wherein: the pooling layer is a 2 x 2 maximum pooling operation with a step size of 2.

6. The convolutional neural network-based speech recognition method of claim 1, wherein: in step S2, a spectrogram is used as a voice feature; the frame length in the spectrogram is set to be 25ms, the frame shift is set to be 10ms, and the window function uses a Hamming window function.