Background
With the rapid development of computer technology, the dependence and requirement of human beings on computers are continuously enhanced, and how to better interact with computers has become a research hotspot. Speech, the most common and natural way of communicating in daily life, contains a huge amount of information, such as the accent of a speaker, the emotional state of the speaker, etc. The speech classification and recognition capability of the computer is an important component of speech processing of the computer, is a key premise for realizing a natural human-computer interaction interface, and has great research value and application value. The speech classification technology is an important research direction, and plays an important role in speech recognition, speech content detection and other aspects. The voice classification is the basis and the premise for deep processing of the audio, and for a section of audio given currently, the audio environment, the gender, the accent, the emotion and the like of a speaker in which the voice is located can be determined in advance through classification, so that a basis is provided for adjusting the adaptive algorithm of the voice model. Therefore, the speech classification method is crucial.
Speech classification includes a number of different tasks, such as: speech emotion recognition, accent recognition, speaker recognition, speech environment differentiation, etc. The challenge of the speech classification task is the high dimensional nature of speech. Conventional speech classification methods typically extract specific audio features for a single problem or database, thereby reducing the dimensionality of the data input to the classification network. However, feature extraction requires sufficient speech signal processing knowledge, since feature extraction represents filtering of information, which can result in loss of information. Secondly, conventional classification algorithms are often not suitable for multi-classification tasks, such as support vector machines and the like. These problems are the difficulties that our work needs to overcome.
The deep neural network method is one of the most important means for processing big data, especially high-dimensional data at present. The deep neural network is characterized in that the learning of the characteristics of the audio data and the classification can be realized through constructing a multilayer nonlinear mapping function and through the training of the connection weight. The deep neural network can adjust the parameters of the network according to the output result because of the functions of feedback, learning and the like, and at present, the heat tide of the deep neural network is gradually spread in various subject fields, so that the deep neural network is successfully applied to a plurality of fields including machine translation, voice recognition, target recognition and the like.
Disclosure of Invention
The invention provides a voice classification method based on a deep neural network aiming at the defects, and solves the problems that the existing single task classification or data feature extraction method is only aimed at, and high-dimensional data is difficult to process in the prior art.
In order to achieve the purpose, the invention adopts the technical scheme that:
a speech classification method based on a deep neural network is characterized by comprising the following steps:
s1: carrying out short-time Fourier transform on the voice data, and converting the voice data into a corresponding spectrogram; partitioning along a frequency domain on a complete spectrogram to obtain a group of local frequency domain information sets;
s2: establishing an algorithm model based on a convolutional neural network and an attention mechanism, and respectively taking a complete spectrogram and local frequency domain information as the input of the model to carry out feature learning; extracting local and global features by using a convolutional neural network based on local and complete spectrogram information;
s3: fusing global and local feature expressions by using an attention mechanism to form a final feature expression, and inputting the final feature expression into a softmax classifier so as to obtain prediction of the classification to which the voice belongs;
s4: adopting marked voice data, training a network through a gradient descent and back propagation algorithm, and storing network parameters;
s5: and predicting the unmarked voice by adopting a trained model, and outputting the belonged classification with the highest probability as a final prediction result by the model.
Further, the distributed spectrogram conversion process in S1 specifically includes the following steps:
carrying out short-time Fourier transform on the original audio, and dividing the given original audio into M sections of short audio; calculating the short-time energy of each short audio segment and performing modulus extraction to finally obtain a complete spectrogram expression S, wherein the expression S of the spectrogram is as follows:
wherein, N is expressed as the length of each short audio segment; formula (1) shows the structural composition of the spectrogram as a two-dimensional matrix, wherein two dimensions represent the time change sequence and the frequency domain change from low frequency to high frequency on the voice respectively, and the numerical value at each point represents the amplitude.
Partitioning the complete spectrogram information along the direction of frequency domain change can obtain a group of local and global frequency spectrum information sets, namely a group of input data combinations based on different frequency domain distributions: { s1,s2,...,sn,S}。
Further, the feature extraction of the convolutional neural network in S2 specifically includes the following steps:
for a plurality of local inputs, extracting features of different information by using a convolutional neural network so as to obtain a group of local expressions:
in the above formula, each local input snAll have convolution parameters w corresponding theretonAnd bnF is expressed as an activation function; the resulting set of local features is expressed as: { a1,a2,…,an}。
For the current complete global frequency domain information, extracting global features by using a convolutional neural network, wherein a specific calculation formula is as follows:
a=g(wS+b) (3)
each global input S has a convolution parameter weight w and a bias parameter b corresponding to the global input S, g represents an activation function adopted by the global input S, and finally a represents a global feature extracted by the convolutional neural network.
Wherein the convolution and pooling operations of the convolutional neural network are mainly involved in equations (2) and (3). The specific operation of convolution is as follows:
where M and N define the size of the convolution kernel, M, N represent the number of rows and columns used to define the pixel point location, f is the convolution kernel function, aijDefining a characteristic expression, s, of i rows and j columns of the current layerijThe input data of the current layer i row j column is defined. w defines the parameters of the convolution kernel, b is the corresponding offset value;
the convolution operation in equation (4) plays an important role in the convolution network. Through the design of the shared weight, the features extracted by the convolutional network have the characteristics of no deformation; i.e. the input of the input changes slightly, the characteristics proposed by the network do not change much.
The specific operation of pooling is as follows:
p=σ(a) (5)
where σ represents the pooling function, the most common pooling functions are three, namely taking the maximum, minimum or average within the receptive field (the space of the convolution kernel). a represents the input of the pooling layer and p represents the output after the pooling operation;
the pooling parameters in the formula (5) greatly reduce the number of weights in the network, and prevent the overfitting phenomenon of the network.
Further, the fusion of the attention mechanism and the local feature expression in S3 specifically includes the following steps:
based on different local features, a new global feature expression is obtained again by applying an attention mechanism; the global information is first given a "coefficient" for each of its components:
in the above formula, p
iA certain component representing the global feature a, a total of m component information,
the representation is based on the current local feature a
n,p
iThe coefficient of this component, representing its degree of importance; the attention mechanism learns through two-tier mapping, the first tier employing a weight W
1Bias parameter b
1And an activation function f to learn the mapping, the second layer using the weight W
2Bias parameter b
2And the activation function g learns the mapping on the results of the first layer;
the meaning of equation (6) is the essential operation of the attention mechanism, based on the local feature a of the guidance informationnAssigning different p to each component of the global feature aiThe weight value represents the importance of the composition. It is desirable to find the most representative features in the composition through network training.
Then multiplying the calculated coefficient representing the degree of importance with the corresponding component to form a new global information:
thus, by applying the attention mechanism, n new global information are obtained, and are added to the initial global feature a in a para-position manner to obtain a final feature expression:
and inputting the final feature expression A into a softmax classifier, wherein the obtained class with the maximum probability value is the prediction class of the voice data.
After the scheme is adopted, the invention has the beneficial effects that:
(1) the traditional speech classification method adopts different feature extraction algorithms aiming at a single problem, and the invention directly performs feature learning on a speech spectrogram through a deep neural network, and can autonomously learn different audio features according to different tasks.
(2) Training of deep neural networks often requires large data, however, the number of speech data presently disclosed is small. Based on the previous research of the deep neural network, the invention further provides an algorithm model for fusing the convolutional neural network and the attention mechanism, and the recognition rate on multiple tasks is further improved.
Taking two groups of voice classification tasks of accent recognition and speaker recognition as examples:
model (model)
|
Accuracy (%)
|
i-Vector
|
74.50
|
Convolutional network and attention model
|
79.32
|
VGG-11
|
54.40
|
ResNet-18
|
61.66
|
ResNet-34
|
58.47 |
Table 1 shows the comparison of the model of the present invention with other methods in terms of accent recognition, where i-Vector is the classical feature extraction algorithm and VGG and ResNet are representative convolutional neural network models.
Model (model)
|
Accuracy (%)
|
MFCC
|
91.00
|
Convolutional network and attention model
|
98.04
|
VGG-11
|
75.21
|
ResNet-18
|
75.04
|
ResNet-34
|
66.05 |
Table 2 shows the comparison of the model of the present invention with other methods in terms of speaker recognition, where MFCC is the classical feature extraction algorithm and VGG and resets are representative convolutional neural network models.
The above experimental results prove that:
1) on the aspect of a plurality of voice classification problems, compared with the traditional feature extraction algorithm, the features learned by the model provided by the invention can obtain a better recognition result.
2) Compared with other neural network methods, the method further improves the application of an attention mechanism in the convolutional neural network, increases the robustness and generalization capability of the model, and improves the accuracy of speech recognition on multiple problems.
Detailed description of the preferred embodiments
The technical solution in the embodiment of the present invention will be described in detail below with reference to the drawings in the embodiment of the present invention; the embodiments described herein are merely a few embodiments of the present invention and are not all embodiments of the specification.
Referring to fig. 1, a core model of speech classification based on a deep neural network is a deep neural network model composed of a plurality of convolution blocks using an attention mechanism. One is a convolutional neural network, which mainly adopts multilayer nonlinear functions to learn the mapping relation between input data and characteristics; the deep learning algorithm can automatically learn relevant features according to the target; the other is an attention mechanism, which mainly obtains an expression that the local information has different weights by distributing different weights to the local information. The invention effectively improves the accuracy of voice classification by combining the deep learning algorithm and the attention mechanism. .
The speech classification method based on the deep neural network comprises the following steps:
step S1: carrying out short-time Fourier transform on the original audio, and dividing the given original audio into M sections of short audio; calculating the short-time energy of each short audio segment and performing modulus extraction to finally obtain a complete spectrogram expression S, wherein the expression S of the spectrogram is as follows:
where N is expressed as the short audio length per segment size.
The complete spectrogram information is partitioned along the direction of frequency domain change, so that the method can obtainA set of local and global spectral information is collected, that is, a set of input data combinations based on different frequency domain distributions are obtained: { s1,s2,…,sn,S}。
The display of the complete spectrogram and the frequency domain-based distributed spectrogram can be seen with reference to fig. 2. The spectrogram based on distribution is to perform partitioning along intervals of frequency domain variation, so as to obtain distribution information in different frequency intervals.
Step S2: establishing an algorithm model based on a convolutional neural network and an attention mechanism, and respectively taking a complete spectrogram and local frequency domain information as the input of the model to carry out feature learning; for a plurality of local inputs, extracting features of different information by using a convolutional neural network so as to obtain a group of local expressions:
in the above formula, each local input snAll have convolution parameters w corresponding theretonAnd bnF is expressed as an activation function; the resulting set of local features is expressed as: { a1,a2,…,an}。
For the current complete global frequency domain information, extracting global features by using a convolutional neural network, wherein a specific calculation formula is as follows:
a=g(wS+b) (3)
each global input S has a convolution parameter weight w and a bias parameter b corresponding to the global input S, g represents an activation function adopted by the global input S, and finally a represents a global feature extracted by the convolutional neural network.
Step S3: on the basis of the local features and the global features proposed in the step S2, based on different local features, applying an attention mechanism to retrieve a new global feature expression; the global information is first given a "coefficient" for each of its components:
in the above formula, p
iA certain component representing the global feature a, a total of m component information,
the representation is based on the current local feature a
n,p
iThe coefficient of this component, representing its degree of importance; the attention mechanism learns through two-tier mapping, the first tier employing a weight W
1Bias parameter b
1And an activation function f to learn the mapping, the second layer using the weight W
2Bias parameter b
2And the activation function g learns the mapping on the results of the first layer.
Then multiplying the calculated coefficient representing the degree of importance with the corresponding component to form a new global information:
thus, by applying the attention mechanism, n new global information are obtained, and are added to the initial global feature a in a para-position manner to obtain a final feature expression:
and inputting the final feature expression A into a softmax classifier, wherein the obtained class with the maximum probability value is the prediction class of the voice data.
Referring to fig. 3, a basic structure diagram of a convolution block based on attention mechanism is shown, including a feature extraction process based on local and global information, and finally performing information re-fusion by using attention idea, and finally obtaining a final feature expression a.
Step S4: adopting marked voice data, training a network through a gradient descent and back propagation algorithm, and storing network parameters; the initial construction of the model is that the parameters in the network are initialized randomly, and the network parameters are updated through the errors generated by the marked voice data until the network becomes stable and the optimal parameters are reserved.
Step S5: and predicting the unmarked voice by adopting the trained model and parameters, and outputting the belonged classification with the highest probability as a final prediction result by the model.
Referring to fig. 4, the complete process diagram of the present invention from step S1 to step S5, if there is audio still to be identified, the process continues to step S1 to step S5, and finally the classifier outputs the classification with the highest probability value, which is the prediction result.