WO2015180368A1

WO2015180368A1 - Variable factor decomposition method for semi-supervised speech features

Info

Publication number: WO2015180368A1
Application number: PCT/CN2014/088539
Authority: WO
Inventors: 毛启容; 黄正伟; 薛文韬; 詹永照; 苟建平
Original assignee: 江苏大学
Priority date: 2014-05-27
Filing date: 2014-10-14
Publication date: 2015-12-03
Also published as: CN104021373B; CN104021373A

Abstract

Disclosed is a variable factor decomposition method for semi-supervised speech features. The speech features are divided into four types: emotion-related features, gender-related features, age-related features, and noise, language and other factor-related features. The method comprises: first, pre-processing speech to obtain a speech spectrum, inputting speech spectrum blocks of different sizes in an unsupervised feature learning network SAE, and conducting pre-training to obtain convolution kernels of different sizes; then, conducting convolution on the entire speech spectrum using the convolution kernels of different sizes, so as to obtain several feature mapping diagrams, and conducting maximum pooling on the feature mapping diagrams; and finally, stacking the features to form a local invariant feature y. y serves as an input of a semi-supervised convolutional neural network, and y is decomposed into four types of features by minimizing four different loss function terms. The present invention solves the problem that the recognition accuracy rate is not high as the emotion, gender and age speech features are mixed with each other, can be used for different recognition requirements based on speech signals respectively, and can also be used for decomposing more factors.

Description

Semi-supervised speech feature variable factor decomposition method

Technical field

The invention belongs to the field of speech recognition, and in particular relates to a method for decomposing speech features.

Background technique

As computers penetrate into every corner of life, all types of computing platforms require easier input media, and voice is one of the best choices for users. In general, the voice includes a variety of information such as the speaker, the content of the speaker, the emotion of the speaker, the gender, and the age. In recent years, with the continuous improvement of some applications, the development of speech signal-based recognition technology for human emotion, gender, age, and speech content has been promoted. For example, traditional call centers usually randomly connect to the waiter to provide telephone consultation for customers, and can not provide personalized services according to the user's emotion, gender and age. This makes it possible to judge the emotion through the voice of the customer. , gender and age, and provide a more personalized voice service based on this. However, in the existing tasks related to emotion, gender and age recognition based on speech signals, the features extracted by the traditional feature extraction methods are often mixed with emotions, gender, age, speech content, language and other factors. Distinguish, resulting in poor recognition.

In a paper by Dong Yu et al., entitled Feature Learning in Deep Neural Networks—Studies on Speech Recognition Tasks, deep neural networks are used to learn a deep feature, but this feature may be mixed with many factors such as emotion, gender, age and other factors. If this feature is used for speech emotion recognition, the recognition rate may be affected by other factors in the feature. At present, there is no feature extraction method that can separately extract different types of features in a speech signal. In order to overcome the defects of the prior art, the present invention decomposes speech features into four categories by semi-supervised feature learning based on convolutional neural networks: emotion-related features, gender-related features, age-related features, and other factors-related features, which can be used separately For different identification needs based on voice signals. Further expansion of the present invention can be used to decompose more factors later.

Summary of the invention

The object of the present invention is to provide a semi-supervised speech feature variable factor decomposition method, so that the decomposed features are not interfered by factors unrelated to the recognition task, and the difference between the recognition target categories is more prominently reflected, thereby improving recognition. Accuracy.

In order to solve the above technical problem, the present invention firstly preprocesses speech to obtain a spectrogram, and then obtains a local invariant feature through unsupervised learning based on convolutional neural networks, and then adopts a semi-supervised learning method to reconstruct an error function. The four loss function constraints of the discriminant loss function, the orthogonal loss function, and the significant loss function decompose the local invariant features obtained by unsupervised learning into four categories: emotion-related features, Gender-related features, age-related features and other factors related characteristics can be used for emotion recognition, gender recognition, and age recognition, respectively, which can effectively improve the recognition accuracy. The specific technical solutions are as follows:

A semi-supervised speech feature variable factor decomposition method, comprising the following steps:

Step one, pre-processing: pre-processing the speech samples to obtain a spectral map, and then using PCA to perform principal component analysis for dimensionality reduction and whitening, and extracting spectral blocks of different sizes therefrom;

Step 2: Unsupervised local invariant feature learning: the speech block is used as an unsupervised feature to learn the input of the SAE. By inputting the spectral blocks of different sizes, the convolution kernels of different sizes are pre-trained and then used separately. The convolution kernels of different sizes are convoluted for the entire spectrogram to obtain a plurality of feature maps, and then the feature map is maximized, and finally the features are stacked to form a local invariant feature y;

Step 3: Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions The local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.

For the reconstructed error function, the four types of features are involved in reconstructing the local invariant feature y, and the error is a mean square error; for the discriminant loss function, class prediction is performed on the tagged data, and then the prediction is calculated. The difference between the tag and the real tag as a value of the discriminant loss function; for the orthogonal loss function, the purpose is to make the four types of features orthogonal to each other, indicating different directions of input local invariant features y; a significant loss function, the purpose is to learn a feature that only reflects the difference between the target categories and is more discriminative; the parameters of the four loss functions are obtained by minimizing the loss function, including offsets and weights, thereby obtaining The four types of features.

The present invention has a beneficial effect. The semi-supervised feature learning of the present invention decomposes the local invariant features into four categories of emotion-related features, gender-related features, age-related features, and other factors, so that different types of features are used for different recognition needs, thereby avoiding The disadvantage of mutual interference between different types of features. In particular, the loss function of semi-supervised learning consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function, so that the learned features can better describe the difference between the identified target categories. , not subject to interference from unrelated factors. The invention solves the problem that the different speech features are mixed together and the recognition rate is not high, and the recognition accuracy can be effectively improved.

DRAWINGS

Figure 1 is a flow chart of speech feature decomposition.

Figure 2 is a flow chart of unsupervised feature learning.

Figure 3 is a semi-supervised speech feature decomposition structure diagram.

detailed description

Figure 1 shows the general idea of the method of the present invention. First, the speech is preprocessed to obtain a spectral map, and the different sizes of the spectral blocks are input into the unsupervised feature learning network SAE, and pre-trained to obtain convolution kernels of different sizes, and then Convolution, pooling operations, forming a local invariant feature y. y, as an input to a semi-supervised convolutional neural network, decomposes y into four types of features by minimizing four different loss function terms.

The pre-processed speech signal is divided into blocks of size l _i × h _i , i represents the number of blocks, and the blocks of different sizes are input into the unsupervised feature learning network SAE, pre-training to obtain volumes of different sizes. After accumulating the core, the convolution kernels of different sizes are used to convolve the entire spectrogram, and several feature maps are obtained. Then the feature map is maximized, and finally the features are stacked to form a local invariant feature y. 2 is shown. y, as an input to a semi-supervised convolutional neural network, decomposes y into four types of features by four different loss function terms. The semi-supervised loss function consists of four parts: the reconstruction error function, the discriminant loss function, the orthogonal loss function, and the significant loss function. The parameters of the four loss function terms are obtained by minimizing the loss function, and the four types of features are decomposed to be used for different identification requirements, as shown in FIG. All features are to participate in the reconstruction, and different types of features participate in the constraint of the corresponding discriminant loss function.

The invention firstly preprocesses speech, obtains a set of locally invariant features by using an unsupervised learning algorithm based on convolutional neural network, and then decomposes the local invariant features into four types of features by using a semi-supervised learning algorithm based on convolutional neural networks. : Emotional related characteristics, gender-related characteristics, age-related characteristics, and other factors related characteristics. The specific steps are as follows:

Step one, first convert the time domain signal into a spectral map, the window size is 20ms, with 10ms overlap; then using PCA dimensionality reduction and whitening, PCA has 60 principal components, and finally produces a 60×n spectral map. A number of non-overlapping 60×15 patterns are extracted therefrom. For each 60×15 language spectrum, two sizes of spectral blocks are extracted therefrom, which are 60×6 and 60×10, respectively.

In step two, the 60×6 and 60×10 sizes of the spectral blocks are respectively input into the SAE, and 120 convolution kernels of 60×6 and 60×10 which are as large as the input size are respectively learned. Then use the two convolution kernels to convolve the entire spectrum 60×15, and obtain 120 1×10 and 120 1×6 features. The map is then maximized every two frames to yield 120 1 x 5 and 120 1 x 3 features. That is, 600 features are obtained for a 60×6 convolution kernel, and 360 features are obtained for a 60×10 convolution kernel. This total of 960 features serves as a semi-supervised input. Let's take a look at the general steps of unsupervised feature learning.

The objective function of the automatic encoder AE (Auto-Encoder) is as follows:

Where x is the input score block, where x is unlabeled. h(x) is the encoding function, h(x)=s(ωx+α), where ω is the weight matrix and α is the bias,

g(x) is the decoding function, x' = g(x) = s(ω ^T h(x) + δ), where ω ^T is the transpose of ω and δ is the offset. L(x, x') is a loss function, and L(x, x') = ||xx'|| ² represents a mean square error.

The sparse self-coding SAE adds a sparse loss term to the objective function of the AE. The objective function of SAE is as follows:

among them,

It is a relative entropy between two Bernoulli random variables with ρ as the mean and ρ' _j as the mean, used to control sparsity.

ρ is the sparsity parameter,

Is the average activation degree of the hidden neuron j, n ₂ is the number of hidden nodes, and m is the number of input nodes. λ is a parameter that controls the sparse term.

Suppose there are n different input sizes, denoted as l _i ×h _i (i=1,2,...,n). By minimizing

Different convolution kernels (ω ⁱ , α ⁱ ) are obtained. Convolution of all spectral blocks l _i ×h _i of the entire spectral map with the convolution kernel (ω ⁱ , α ⁱ ):

f ⁱ (x)=s(conv(ω ⁱ ,x)+α ⁱ ) (3)

Then, the convolved feature map is divided into non-overlapping regions, P={p ₁ , p ₂ ,..., p _q }, and the maximum pooling is performed for each region:

For the ith convolution kernel, stack the pooled features:

Stack the pooled features of all convolution kernels to form a locally invariant feature y:

y=F(x)=[F ¹ (x), F ² (x),...,F ⁿ (x)] (6)

The locally invariant feature y serves as an input to the following unsupervised learning.

In the third step, the local invariant feature y is obtained by the above unsupervised learning. Then, through semi-supervised learning based on convolutional neural networks (partial input with category labels), y is decomposed into four categories of features: emotion-related features, gender-related features, age-related features, and other factor-related features. The loss function of unsupervised learning consists of four parts.

First, through the four encoding functions h ^(e) (y), h ^(s) (y), h ^(a) (y), h ^(o) (y), y is mapped into four types of features, respectively, and emotions, Gender, age, and other factors are relevant. The four encoding functions are as follows:

h ^(e) (y)=s(Ey+e) (7)

h ^(s) (y)=s(Sy+s) (8)

h ^(a) (y)=s(Ay+a) (9)

h ^(o) (y)=s(Oy+o) (10)

These four types of features must participate in the reconstruction of y:

y'=g([h ^(e) (y),h ^(s) (y),h ^(a) (y),h ^(o) (y)])=s(E ^T h ^(e) (y )+S ^T h ^(s) (y)+A ^T h ^(a) (y)+O ^T h ^(o) (y)+γ) (11)

Where γ is a compensation parameter for close to the y-means.

Therefore, the reconstruction error function is:

L _RECON (y,y')=||yy'|| ² (12)

Then, the tag data is used to predict the category, and the data (x, z) is input, where x is a block of music, and z={z ₁ , z ₂ , z ₃ } respectively represent an emotion tag, a gender tag, and an age tag. z'={z' ₁ , z' ₂ , z' ₃ } respectively represent predicted emotional tags, gender tags, and age tags. For example, the following equation (13) is through the U _1j mapping h ^{(e) (y),} to predict the j th feature emotion tag z _'1j.

z' _1j =s(U _1j h ^(e) (y)+b _1j ) (13)

z' _2j =s(U _2j h ^(s) (y)+b _2j ) (14)

z' _3j =s(U _3j h ^(a) (y)+b _3j ) (15)

Therefore, the discriminant loss functions of the emotional label, the gender label, and the age label are:

The total discriminant loss function is:

L _DISC (z,z')=L _DISCE (z ₁ ,z' ₁ )+L _DISCS (z ₂ ,z' ₂ )+L _DISCA (z ₃ ,z' ₃ ) (19)

Wherein C ₁ , C ₂ , and C ₃ represent the number of categories of emotional tags, gender tags, and age tags, respectively. In particular, the emotion-related features of this step are constrained by the (1) sentiment discriminant penalty function, the gender-related features are constrained by the (15) gender discriminant penalty function, and the age-related features are subjected to the (16) age discriminant penalty function. constraint.

In addition, in order to let h ^(e) (y), h ^(s) (y), h ^(a) (y), h ^(o) (y) represent as many different directions as possible, for example, let h ^(e) (y) and h ^(s) (y) indicate different directions and can be

with

As orthogonal as possible. The orthogonal loss function is:

Finally, the significance loss function can be utilized, so that the learned four types of features more reflect the identifiability between different categories of recognition targets and are relatively stable. . For the significance of each input i, we measure it by the sum of the saliency of its weight. Specifically, the significance of the input i is as follows:

Where φ(i) is the set of weights associated with input i, ω _k is the kth weight, and MSE is the mean square error. For the three types of h ^(e) (y), h ^(s) (y), h ^(a) (y), the reconstruction error and discriminant loss must take into account the significant loss function, and for h ^(o) (y), as long as the reconstruction error is considered. So the significant loss function is as follows:

Therefore, the four parts of the reconstructed error function, the discriminant loss function, the orthogonal loss function, and the significant loss function constitute the total loss function. The total loss function is:

L _LOSS (θ)=∑ _x∈D,y=F(x) L _RECON (y,y')+βJ _ORTH (y)+∑ _(x,z)∈∈ L _DISC (z,z')+ηJ _SAL (y) (23)

Where D is the entire data set (including unlabeled data and tagged data), and S is the tagged data set. β adjusts the contribution of the vertical loss function, β∈[0,1]. η adjusts the contribution of the sensitivity loss function, η ∈ [0, 1]. The contribution weight parameter β, η is set by a grid search method with a step size of 0.1. The parameters θ={E, S, A, O, U, e, s, a, o, γ, b}.

The parameter weights and offsets of the four loss functions are obtained by minimizing the loss function, and the four types of features are decomposed.

Claims

A semi-supervised speech feature variable factor decomposition method, comprising the following steps:

Step one, pre-processing: pre-processing the speech samples to obtain a spectral map, and then using PCA to perform principal component analysis for dimensionality reduction and whitening, and extracting spectral blocks of different sizes therefrom;

Step 2: Unsupervised local invariant feature learning: the speech block is used as an unsupervised feature to learn the input of the SAE. By inputting the spectral blocks of different sizes, the convolution kernels of different sizes are pre-trained and then used separately. The convolution kernels of different sizes are convoluted for the entire spectrogram to obtain a plurality of feature maps, and then the feature map is maximized, and finally the features are stacked to form a local invariant feature y;

Step 3: Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions The local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.

For the reconstructed error function, the four types of features are involved in reconstructing the local invariant feature y, and the error is a mean square error; for the discriminant loss function, class prediction is performed on the tagged data, and then the prediction is calculated. The difference between the tag and the real tag as a value of the discriminant loss function; for the orthogonal loss function, the purpose is to make the four types of features orthogonal to each other, indicating different directions of input local invariant features y; a significant loss function, the purpose is to learn a feature that only reflects the difference between the target categories and is more discriminative; the parameters of the four loss functions are obtained by minimizing the loss function, including offsets and weights, thereby obtaining The four types of features.