CN107578775B

CN107578775B - Multi-classification voice method based on deep neural network

Info

Publication number: CN107578775B
Application number: CN201710801016.6A
Authority: CN
Inventors: 毛华; 彭德中; 章毅; 曾煜妮
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2017-09-07
Filing date: 2017-09-07
Publication date: 2021-02-12
Anticipated expiration: 2037-09-07
Also published as: CN107578775A

Abstract

The invention discloses a multitask voice classification method based on deep learning, which relates to the technical field of voice processing and comprises the following steps: and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram. And S2, establishing a neural network model based on the convolutional neural network and the residual error network, and taking the spectrogram as network input to extract features. And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model. And S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model. And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result. The invention solves the problem that the existing audio classification method aims at the problem of low classification efficiency caused by independent processing of tasks and neglect of the relevance of voice tasks.

Description

Multi-classification voice method based on deep neural network

Technical Field

The invention relates to the technical field of sound signal processing, in particular to a voice multi-classification method based on a deep neural network.

Background

Sound provides us with a lot of information about the source of the sound and the surrounding environment. The human auditory system is able to separate and recognize complex sounds, and it would be useful if a machine could perform similar functions (audio classification and recognition), such as speech recognition in noise. Audio classification is an important field of pattern recognition and has been successfully applied in many fields, such as professional education and entertainment. In recent years, different classes of audio classification, such as accent recognition, speaker recognition, and speech emotion recognition, have been used with much success.

However, most audio classification methods are processed separately for tasks, and the correlation between tasks is omitted. For example, the accent recognition task and speaker recognition are typically treated as separate classification tasks. In fact, however, for the same piece of speech data, the accent of the spoken speaker will be determined once it is confirmed. Thus, it is desirable to improve the classification effect of both tasks simultaneously by using this relationship.

In recent years, the deep learning has caused the climax of artificial intelligence, and due to the strong abstract capability of the deep neural network on data, the neural network learning method has been successfully applied to various fields such as speech signal processing and the like. In our work, convolutional neural networks are used to learn speech features, improving accuracy in multi-classification tasks.

A spectrogram is a detailed and accurate phonetic representation containing time and frequency information. The general form of a spectrogram is mainly three dimensions: time, frequency and amplitude in color.

Disclosure of Invention

The invention aims to: the problem that the existing audio classification method aims at independent processing of tasks and ignores correlation of voice tasks, so that classification efficiency is low is solved.

The technical scheme of the invention is as follows:

a multitask speech classification method based on deep learning comprises the following steps:

and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram.

And S2, establishing a neural network model based on the convolutional neural network and the residual error network, and taking the spectrogram as network input to extract features.

And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model.

And S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model.

And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result.

Further, in S2, the basic operation of the convolutional neural network includes a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:

wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,

defining the characteristics of i rows and j columns of l layers,

the parameters of the convolution kernel of n rows m of l layers are defined, b is the corresponding bias function,

the meaning of formula (1) is: the product of different parts of the input feature map and the convolution kernel obtains a new feature map under the action of the convolution kernel function, and the formula ensures that the feature extraction is independent of the position, namely the statistical property of one part of the input feature map is the same as that of other parts.

The pooling operation of the convolutional neural network can be represented by the following equation:

a^l＝f(β^ldown(a^l-1)+b^l) (2)

in the above formula, a^lFor the input of the l-th layer, down represents the down-sampling mode, β^lIs the corresponding parameter; the meaning of equation (2) is that the input feature map image pooling operation, i.e. the aggregation of features at different positions of the image, is performed to reduce the parameters in the network.

The basic residual block of the residual network in S2 can be represented by the following formula:

y＝F(x，W)+x (3)

where F denotes a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y denotes a basic residual block.

Equation (3) means that an input x, after two layers of forward convolutional networks, gets an output F (x, W), and then through a shortcut, gets an output y.

The formula of the basic architecture model used in S2 is represented as:

y＝F₁(x，W₁)*F₂(x，W₂)+x (4)

wherein is a multiplication by a bit-wise operation, F₁，F₂Is two convolutional layers, x is the input of the basic structure, W₁，W₂Are parameters of both convolutional layers.

The meaning of the formula (4) is that one input x obtains an output F under the action of two convolution networks respectively₁(x，W₁) And F₂(x，W₂) The two are multiplied and then passed through a shortcut to obtain the output y.

Specifically, the step S4 includes the following steps:

S41: and analyzing the time domain and the frequency domain of each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample.

S42: on the basis of the initialized multi-task classification model obtained in step S3, the current speech classification task is learned to obtain a trained multi-task classification model.

S43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.

After the scheme is adopted, the invention has the beneficial effects that:

(1) the feature extraction of the voice data is a key preprocessing operation, and the voice spectrogram is extracted by a neural network, and the voice spectrogram is converted into 200-dimensional shared features in specific operation.

(2) In the classification process, the neural network is expected to learn the intrinsic characteristics of the voice so as to correctly predict each classification category, and then the neural network structure of the neural network is proposed to obtain better voice expression. Particularly, for a model which completes multi-classification like SVM and a classical neural network structure, the model is better; for a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.

Taking speech emotion recognition on sentences and songs as an example, the main task is speech emotion classification, and the auxiliary task is sentence and song classification.

	Rate of accuracy
		SVM	48.01％
Single task model	56.33％
		Multitasking model	62.39％

Table 1 compares mainly the accuracy of the single-task and multi-task models on the main tasks. Wherein, SVM is a classic machine learning classification method; the single-task model is used for single-task classification, the emotion classification accuracy is 56.33%, while the emotion recognition accuracy is increased by 6.06% when two tasks are simultaneously realized on the multi-task model

Network architecture	Emotion recognition accuracy	Speech and song classification accuracy
			Convolutional neural network	53.73％	92.24
Residual error network	57.21％	94.62％
			Gate-based residual error network	62.39	93.13

Table 2, mainly compares the accuracy of the multitask model based on different neural network structures in speech emotion recognition on sentences and songs. Wherein the gate-based residual error network is the model proposed by this patent.

The above experimental results prove that:

1) for a model which also completes multi-classification, such as SVM and a classical neural network structure, the model is better.

2) For a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.

(3) Compared with models of other non-neural network methods, the method has the advantages that the multi-task classification model can be well initialized by extracting the characteristics of the voice through the deep neural network method, the robustness of the model is improved, and the recognition effect of each task is improved. Since the audio signal itself may be affected by noise, etc., the neural network method has a good generalization ability to noise, etc. In addition, single-task models, such as emotional classification of audio, are very sensitive to new speakers, and multi-task classification is relatively less influential because speaker characteristics are also learned.

Drawings

FIG. 1 is a diagram of a multitask model according to the present invention;

FIG. 2 is a spectrogram of speech containing emotional speech;

FIG. 3 is a spectrogram of speech containing happy emotions;

FIG. 4 is a basic structure diagram of the residual error network of the present invention;

fig. 5 is a basic configuration diagram of the neural network in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a core model of a deep neural network-based multitask speech classification is a multitask classification model, which is used to classify two types of tasks.

The multitask speech classification method based on deep learning comprises the following steps:

And S2, establishing a neural network model based on a convolutional neural network and a residual error network, taking the spectrogram as network input, and extracting features, wherein in the step, common features used for a plurality of tasks are extracted by constructing a two-classification task network structure. The multitask of the invention is directed at two major classification tasks, one of which is that the emotion included in the voice and whether the voice belongs to a song or a sentence are distinguished at the same time; the second is to distinguish the speaker and the accent of the speaker at the same time.

As shown in fig. 3, the basic operations of the convolutional neural network include a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:

defining the characteristics of i rows and j columns of l layers,

the meaning of formula (1) is: the product of different parts of the input feature map and the convolution kernel obtains a new feature map under the action of the convolution kernel function, and the above formula ensures that the feature extraction is irrelevant to the position, namely the statistical property of one part of the input feature map is the same as that of other parts; the pooling operation of the convolutional neural network can be represented by the following equation:

a^l＝f(β^ldown(a^l-1)+b^l) (2)

in the above formula, down represents the down-sampling mode, β^lIs the corresponding parameter;

the meaning of equation (2) is that the input feature map image pooling operation, i.e. the aggregation of features at different positions of the image, is performed to reduce the parameters in the network.

As shown in fig. 4, the basic residual block of the residual network in S2 can be represented by the following formula:

y＝F(x，W)+x (3)

As shown in fig. 5, the formula of the basic architecture model of the deep neural network used in S2 is expressed as:

y＝F₁(x，W₁)*F₂(x，W₂)+x (4)

S4, the voice sample and the corresponding labels are digitized, and the initialized model is trained by the data set. S4 includes the following steps:

s4, training the initialized model by using the voice data and the corresponding marks to obtain a trained network model;

s41: performing time domain and frequency domain analysis on each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample;

s42: learning a current speech classification task on the basis of the initialized multi-task classification model obtained in the step S3 to obtain a trained multi-task classification model;

Fig. 2 and fig. 3 list the spectrogram containing two emotions of "angry" and "happy", and we can see that the spectrogram amplitude difference is obvious in the range of 10kHz to 15 kHz.

Fig. 4 and 5 show a neural network method proposed by the present invention, which specifically includes:

(1) the basic structure of the two models in fig. 4 and 5 is a convolutional neural network, which specifically includes two operations. One is the convolution operation of the convolutional neural network, which can be expressed by the following formula:

wherein M and N define the size of convolution kernel, p, q represent the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,

the characteristics of the i-rows of the l-layers are defined, k defines the parameters of the convolution kernel, and b is the corresponding bias function.

Another operation is the pooling operation of the convolutional neural network, which can be expressed by the following equation:

a^l＝f(β^ldown(a^l-1)+b^l)

in the above formula, down represents the down-sampling operation, and β is the corresponding parameter.

(2) Fig. 4 shows a basic residual block of a residual network, which can also be expressed by the following formula:

y＝F(x，W)+x

where F is the convolutional layer function, x is the input to a residual block, and W is the parameter.

(3) Fig. 5 shows the basic architecture of our neural network, which can also be expressed by the following formula:

y＝F₁(x，W₁)*F₂(x，W₂)+x

wherein is a multiplication by a bit-wise operation, F₁，F₂Is a concatenation of layers, x is the input to this basic structure. W₁，W₂Are parameters of both convolutional layers.

The existing audio classification problem mainly aims at single sample and single mark, namely, a training model only classifies a single task. For example, voice emotion classification and single task classification are used for realizing which emotion only one audio belongs to. However, because different speakers understand different emotions, the expression of different speakers in the same emotion situation is different. The multi-task classification mainly realizes a plurality of different tasks at the same time, for example, the project completes the speech emotion classification task and also completes the speaker classification problem. That is, for a trained model, a piece of speech is input, and two results are obtained, one is the person who says the piece of speech, and the other is the emotion contained in the piece of speech. That is, the emotional characteristics and the speaker characteristics are learned simultaneously when the model is trained.

The above experimental results prove that:

(1) for the model which also completes multi-classification, such as SVM and the classical neural network structure, the model is better

(2) For a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A multitask speech classification method based on deep learning is characterized in that: the method comprises the following steps:

s1: performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram;

s2: establishing a neural network model based on a convolutional neural network and a residual error network, taking a spectrogram as network input, and extracting characteristics;

in S2, the basic operations of the convolutional neural network include a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:

defining the characteristics of i rows and j columns of l layers,

parameters of the convolution kernel of n rows m defining l layers, b^lIs the bias function of l layers;

a^l＝f(β^ldown(a^l-1)+b^l) (2)

in the above formula, a^lFor the input of the l-th layer, f is the pooling layer function, down denotes the down-sampling mode, β^lIs the corresponding parameter;

y＝F(x，W)+x (3)

wherein F represents a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y represents a basic residual block output;

the formula of the basic architecture model used in S2 is represented as:

y＝F₁(x，W₁)*F₂(x，W₂)+x (4)

wherein is a multiplication by a bit-wise operation, F₁，F₂Is two convolutional layers, x is the input of the basic structure, W₁，W₂Is a parameter of both convolutional layers, y represents the output;

s3: inputting the extracted features into a plurality of different softmax classifiers, thereby obtaining an initialized model;

s4: digitizing the voice sample and the corresponding marks, and training an initialized model by using the data set to obtain a trained network model;

s5: and predicting the unmarked voice data by the trained model to obtain a classified probability value, and selecting the class with a higher probability value as a classification result.

2. The method for multi-task speech classification based on deep learning of claim 1, wherein: the step of S4 includes the following steps: