CN104021373A

CN104021373A - Semi-supervised speech feature variable factor decomposition method

Info

Publication number: CN104021373A
Application number: CN201410229537.5A
Authority: CN
Inventors: 毛启容; 黄正伟; 薛文韬; 于永斌; 詹永照; 苟建平; 邢玉萍
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2014-05-27
Filing date: 2014-05-27
Publication date: 2014-09-03
Anticipated expiration: 2034-05-27
Also published as: CN104021373B; WO2015180368A1

Abstract

The invention discloses a semi-supervised speech feature variable factor decomposition method. Speech features are divided into four types: emotion-related features, gender-related features, age-related features and noise, language and other factor-related features. Firstly, a speech is pretreated to obtain a spectrogram, speech spectrum blocks of different sizes are inputted to an unsupervised feature learning network SAE, convolution kernels of different sizes are obtained through pre-training, convolution kernels of different sizes are then respectively used for carrying out convolution on the whole spectrogram, a plurality of feature mapping pictures are obtained, maximal pooling is then carried out on the feature mapping pictures, and the features are finally stacked together to form a local invariant feature y. Y serves as input of semi-supervised convolution neural network, y is decomposed into four types of features through minimizing four different loss function items. The problem that the recognition accuracy rate is not high as emotion, gender, age and speech features are mixed is solved, and the method can be used for different recognition demands based on speech signals and can also be used for decomposing more factors.

Description

A kind of semi-supervised phonetic feature variable factor decomposition method

Technical field

The invention belongs to field of speech recognition, be specifically related to a kind of method that phonetic feature decomposes.

Background technology

Along with computing machine is penetrated into each corner of life, various types of computing platforms all need easier input medium, and voice are taken as own duty becomes one of selection of user's the best.The much informations such as in general, voice have comprised speaker, the content of speaking, speaker's emotion, sex, age.In recent years, constantly perfect along with some application, has promoted the development of the recognition technology based on voice signal of emotion to people, sex, age, the aspects such as content of speaking.Such as the connection that traditional call center conventionally all can be random is served from birth for client provides telephone counseling, and can not provide personalized service according to user's emotion, sex and age, whether this can judge its emotion, sex and age by client's sound with regard to having impelled, and more personalized voice service is provided on this basis.But identify in inter-related task at existing emotion, sex and age based on voice signal, the feature that traditional feature extracting method extracts often adulterated emotion, sex, age, the factor such as content, language of speaking, be difficult to each other distinguish, thereby cause recognition effect not good.

Be called in the paper of Feature Learning in Deep Neural Networks-Studies on Speech Recognition Tasks in Dong Yu etc., name, utilize degree of depth neural network to acquire a further feature, but this feature may mix several factors, as factors such as emotion, sex, ages, if for speech emotional identification, discrimination may be subject to the impact of other factors in feature this feature.At present also do not occur that a kind of feature extracting method can extract respectively feature dissimilar in voice signal.The present invention is in order to overcome the defect of prior art, by the semi-supervised feature learning based on convolutional neural networks, phonetic feature is resolved into four classes: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features, can be respectively used to the difference identification demand based on voice signal.The present invention can also be used for decomposing more multifactor after further expanding.

Summary of the invention

The object of the present invention is to provide a kind of semi-supervised phonetic feature variable factor decomposition method, make the feature decompositing not be subject to the interference of the factor irrelevant with identification mission, and embody more significantly the difference between identification target classification, thereby improve the accuracy of identification.

In order to solve above technical matters, first the present invention carries out pre-service to voice and obtains sound spectrograph, then obtain local invariant feature by the unsupervised learning based on convolutional neural networks, adopt again a kind of semi-supervised learning method, by reconstructed error function, differentiate loss function, orthogonal loss function, the local invariant feature that the constraint of four loss functions of conspicuousness loss function obtains unsupervised learning, resolve into four classes: emotion correlated characteristic, Sexual-related feature, age correlated characteristic and other factor analysis features, can be respectively used to emotion recognition, sex identification, age identification, can effectively improve recognition accuracy.Concrete technical scheme is as follows:

A kind of semi-supervised phonetic feature variable factor decomposition method, is characterized in that comprising the following steps:

Step 1, pre-service: speech samples is carried out to pre-service and obtain sound spectrograph, then adopt PCA to carry out principal component analysis dimensionality reduction and albefaction, therefrom extract the language spectrum piece of different size;

Step 2, unsupervised local invariant feature study: using institute's predicate spectrum piece as the input without supervision feature learning SAE, by the language spectrum piece of input different size, pre-training obtains the convolution kernel of different size, then with the convolution kernel of described different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, more described Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y;

Step 3, semi-supervised feature learning based on convolutional neural networks: the input using described local invariant feature y as semi-supervised learning algorithm, utilize the method for semi-supervised learning based on convolutional neural networks, by four different loss functions just local invariant feature y resolve into four category features; Described four category features comprise emotion correlated characteristic, Sexual-related feature, age correlated characteristic and comprise noise and other factor analysis features of languages; The loss function of described semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts;

For described reconstructed error function, described four category features all will participate in reconstruct local invariant feature y, and error adopts square error; For described differentiation loss function, first the data that have label are carried out to classification prediction, then calculate the difference of predicting between label and true label as the value of differentiating loss function; For described orthogonal loss function, object is to make described four category features mutually orthogonal, represents the different direction of input local invariant feature y; For described conspicuousness loss function, object is that study is to only embodying the feature of identifying the difference between target classification and having more class discrimination; The parameter that obtains four loss functions by minimizing described loss function comprises biasing and weight, thereby obtains described four category features.

The present invention has beneficial effect.Semi-supervised feature learning of the present invention, by local invariant feature being resolved into emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features totally four category features, make dissimilar feature for different identification demands, avoided the shortcoming of phase mutual interference between dissimilar feature.Particularly the loss function of semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts, make the feature of being learnt can describe better the difference between identification target classification, be not subject to the interference of irrelevant factor.Thereby the invention solves the not high problem of the different phonetic features discrimination bringing mixed in together, can effectively improve recognition accuracy.

Brief description of the drawings

Fig. 1 is phonetic feature decomposition process figure.

Fig. 2 is without supervision feature learning process flow diagram.

Fig. 3 is semi-supervised phonetic feature decomposition chart.

Embodiment

Fig. 1 has provided the general thought of the inventive method, first, voice is carried out to pre-service and obtain sound spectrograph, the language spectrum piece input of different size is without supervision feature learning network SAE, pre-training obtains the convolution kernel of different size, then, through convolution, pondization operation, forms local invariant feature y.Y, as the input of semi-supervised convolutional neural networks, resolves into four category features by minimizing four different loss function items by y.

Pretreated voice signal is divided into l _{i ×}h _ithe language spectrum piece of size, i represents the number of language spectrum piece, the language spectrum piece input of different size is without supervision feature learning network SAE, pre-training obtains the convolution kernel of different size, then with the convolution kernel of different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, then Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y, as shown in Figure 2.Y, as the input of semi-supervised convolutional neural networks, resolves into four category features by four different loss function items by y.Semi-supervised loss function is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts.Obtain the parameter of four loss function items by minimum losses function, obtain four category features thereby decompose, be respectively used to different identification demands, as shown in Figure 3.All features all will be participated in reconstruct, and dissimilar feature participates in the constraint of corresponding differentiation loss function.

First the present invention carries out pre-service to voice, utilize the unsupervised learning algorithm based on convolutional neural networks to obtain one group of local invariant feature, then utilize the semi-supervised learning algorithm based on convolutional neural networks that local invariant feature is resolved into four category features: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features.Concrete step is as follows:

Step 1, first changes into sound spectrograph time-domain signal, and window is of a size of 20ms, has that 10ms's is overlapping; Then utilize PCA dimensionality reduction and albefaction, PCA has 60 major components, finally produces the sound spectrograph of 60 × n.Therefrom extract several language of nonoverlapping 60 × 15 spectrums.For each 60 × 15 language spectrum, therefrom extract the language spectrum piece of two sizes, be respectively 60 × 6 and 60 × 10.

Step 2, is input to respectively SAE by the language spectrum piece of 60 × 6 and 60 × 10 two kinds of sizes, and study obtains 120 and input the same large 60 × 6 and 60 × 10 the convolution kernel of size respectively.Then utilize these two convolution kernels respectively whole language spectrum 60 × 15 to be carried out to convolution, obtain the Feature Mapping figure of 120 1 × 10 and 120 1 × 6, then every two frames carry out maximum pond, obtain the feature of 120 1 × 5 and 120 1 × 3.Obtain 600 features for 60 × 6 convolution kernel, obtain 360 features for 60 × 10 convolution kernel.These 960 total features are as semi-supervised input.Next introduce the general step without supervision feature learning.

The objective function of autocoder AE (Auto-Encoder) is as follows:

J _AE(θ)＝Σ _x∈L(x，g(h(x))) (1)

Wherein x is the language spectrum piece of input, and x is herein tape label not.H (x) is coding function,

H (x)=s (ω x+ α), wherein ω is weight matrix, α is biasing, g (x) is decoding functions, x=g (x)=s (ω ^th (x)-δ), wherein ω ^tbe the transposition of ω, δ is biasing.L (x, x ') is loss function, L (x, x ')=|| x-x ' || ², represent square error.

And sparse own coding SAE adds a sparse property loss on the objective function of AE.The objective function of SAE is as follows:

J_{SAE} (θ) = Σ_{x &Element; D} L (x, g (h (x))) + λ Σ_{j = 1}^{n_{2}} KL (ρ | | ρ_{j}^{'}) - - - (2)

Wherein, be one taking ρ as average and one with ρ ' _jfor the relative entropy between two Bernoulli random variables of average, be used for controlling sparse property. ρ is sparse property parameter, the average activity of hidden neuron j, n ₂be to hide node number, m is input node number.λ controls the parameter of sparse.

Suppose the input size that has n kind different, remember into l _i× h _i(i=1,2 ..., n0.By minimizing J _sAE(θ) obtain different convolution kernel (ω ⁱ, α ⁱ).With convolution kernel (ω ⁱ, α ⁱ) all languages of whole sound spectrograph are composed to piece li × h _icarry out convolution:

f ⁱ(x)＝s(conv(ω ⁱ，x)+α ⁱ) (3)

Then Feature Mapping figure convolution being obtained is divided into nonoverlapping region, P={p ₁, p ₂..., p _qmaximum pond is carried out in each region:

F_{j}^{i} (x) = \max_{k &Element; p_{j}} (f_{j}^{i} (x)), j &Element; [1, q] - - - (4)

For i convolution kernel, the feature in pond is stacked up:

F^{i} (x) = [F_{1}^{i} (x), F_{2}^{i} (x), . . ., F_{q}^{i} (x)] - - - (5)

The pond feature of all convolution kernels is stacked up, forms local invariant feature y:

y＝F(x)＝[F ¹(x)，F ²(x)，...，F ⁿ(x)] (6)

Local invariant feature y is as the input of unsupervised learning below.

Step 3, by above-mentioned unsupervised learning, has obtained local invariant feature y.Then by the semi-supervised learning based on convolutional neural networks (part input is with class label), y is resolved into four category features: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features.The loss function of unsupervised learning is made up of four parts.

First, by four coding function h ^(e)(y), h ^(s)(y), h ^(a)(y), h ^(o)(y) y is mapped to four category features, respectively with emotion, sex, age, other factor analysis.Four coding functions are as follows:

h ^(e)(y)＝s(Ey+e) (7)

h ^(s)(y)＝s(Sy+s) (8)

h ^(a)(y)＝s(Ay+a) (9)

h ^(o)(y)＝s(Oy+o) (10)

This four category feature all will participate in reconstruct y:

y'＝g[[h ^(e)(y)，h ^(s)(y)，h ^(a)(y)，h ^(o)(y)])

＝s(E ^Th ^(e)(y)+S ^Th ^(s)(y)+ATh ^(a)(y)+O ^Th(o)(y)+γ) (11)

Wherein, γ is for a compensating parameter near y average.

So reconstructed error function is:

L _RECON(y，y′)＝||y-y'|| ² (12)

Then, utilize and have label data to carry out the prediction of classification, input data (x, z), wherein x is language spectrum piece, z={z ₁, z ₂, z ₃represent respectively emotion label, sex label, age label.Z '=z ' ₁, z ' ₂, z ' ₃, respectively represent prediction emotion label, sex label, age label.Example as the following formula (13) is exactly to pass through U _1jmapping h ^(e)(y), predict j feature z ' of emotion label _1j.

z′ _1j＝s(U _1jh ^(e)(y)+b1 _j) (13)

z′ _2j＝s(U _2jh ^(s)(y)+b _2j) (14)

z′ _3j＝s(U _3jh ^(a)(y)+b _1j) (15)

So the differentiation loss function of emotion label, sex label, age label is respectively:

L_{DISCE} (z_{1}, z_{1}^{'}) = Σ_{j = 1}^{C_{1}} z_{1 j} {\log z}_{1 j}^{'} + (1 - z_{1 j}) \log (1 - z_{1 j}^{'}) - - - (16)

L_{DISCS} (z_{2}, z_{2}^{'}) = Σ_{j = 1}^{C_{2}} z_{2 j} {\log z}_{2 j}^{'} + (1 - z_{2 j}) \log (1 - z_{2 j}^{'}) - - - (17)

L_{DISCA} (z_{3}, z_{3}^{'}) = Σ_{j = 1}^{C_{3}} z_{3 j} {\log z}_{3 j}^{'} + (1 - z_{3 j}) \log (1 - z_{3 j}^{'}) - - - (18)

Total differentiation loss function is:

L _DISC(z，z′)＝LDISCE(z ₁，z′ ₁)+L _DISCS(z ₂，z′ ₂)+L _DIACA(z ₃，z′ ₃) (19)

Wherein C ₁, C ₂, C ₃represent respectively the classification number of emotion label, sex label, age label.Be noted that especially this step emotion correlated characteristic is subject to formula (14) emotion to differentiate the constraint of penalty, Sexual-related feature is subject to the constraint of formula (15) Sex Discrimination penalty, and age correlated characteristic is subject to formula (16) age to differentiate the constraint of penalty.

In addition, in order to allow h ^(e)(y), h ^(s)(y), h ^(a)(y), h ^(o)(y) represent as much as possible the different directions that y changes, such as will allow ^{h (e)}and h (y) ^(s)(y) represent different directions, can be by allowing with orthogonal as much as possible.Orthogonal loss function is:

J_{ORTH} (y) = Σ_{i, j} {(\frac{{&PartialD; h}_{i}^{(e)} (y)}{&PartialD; y} \cdot \frac{{&PartialD; h}_{j}^{(a)} (y)}{&PartialD; y})}^{2} + Σ_{i, k} {(\frac{{&PartialD; h}_{i}^{(a)} (y)}{&PartialD; y} \cdot \frac{{&PartialD; h}_{k}^{(a)} (y)}{&PartialD; y})}^{2}

+ Σ_{i, L} {(\frac{{&PartialD; h}_{i}^{(a)} (y)}{&PartialD; y} \cdot \frac{{&PartialD; h}_{L}^{(o)} (y)}{&PartialD; y})}^{2} + Σ_{j, k} {(\frac{{&PartialD; h}_{j}^{(a)} (y)}{&PartialD; y} \cdot \frac{{&PartialD; h}_{k}^{(a)} (y)}{&PartialD; y})}^{2}

+ Σ_{j, L} {(\frac{{&PartialD; h}_{j}^{(a)} (y)}{&PartialD; y} \cdot \frac{{&PartialD; h}_{L}^{(o)} (y)}{&PartialD; y})}^{2} + Σ_{k, L} {(\frac{{&PartialD; h}_{k}^{(a)} (y)}{&PartialD; y} \cdot \frac{{&PartialD; h}_{L}^{(o)} (y)}{&PartialD; y})}^{2} - - - (20)

Finally, can utilize conspicuousness loss function, make study to four category features more embody identifiable and more stable between different classes of of identification target.。For the conspicuousness of each input i, we weigh by the conspicuousness summation of its weight.Concrete, as follows for the conspicuousness of input i:

S_{i} = Σ_{k &Element; φ (i)} Saliency (ω_{k}) = \frac{1}{2} Σ \frac{{&PartialD;}^{2} MSE}{{&PartialD; ω}_{k}^{2}} &PartialD; ω_{k}^{2} - - - (21)

Wherein φ (i) is and the relevant weight sets of input i, ω _kbe k weight, MSE is square error.For h ^(e)(y), h ^(s)(y), h ^(a)(y) this three category feature, reconstructed error and differentiation loss all will take into account conspicuousness loss function, and for h ^(o)(y), as long as consider reconstructed error.So conspicuousness loss function is as follows:

J_{SAL} (y) = - \frac{1}{2} Σ \frac{{&PartialD;}^{2} {| | y - y^{'} | |}^{2}}{&PartialD; ω_{k}^{2}} &PartialD; ω_{k}^{2} - \frac{1}{2} Σ_{{k &Element; h}^{(e)} (y)} \frac{{&PartialD;}^{2} L_{DISCE} (z_{1}, z_{1}^{'})}{&PartialD; ω_{k}^{2}} &PartialD; ω_{k}^{2}

- \frac{1}{2} Σ_{{k &Element; h}^{(e)} (y)} \frac{{&PartialD;}^{2} L_{DISCS} (z_{2}, z_{2}^{'})}{&PartialD; ω_{k}^{2}} &PartialD; ω_{k}^{2} - \frac{1}{2} Σ_{{k &Element; h}^{(e)} (y)} \frac{{&PartialD;}^{2} L_{DISCE} (z_{2}, z_{2}^{'})}{&PartialD; ω_{k}^{2}} &PartialD; ω_{k}^{2} - - - (22)

So reconstructed error function, differentiation loss function, orthogonal loss function, this four part of conspicuousness loss function have just formed total loss function.Total loss function is:

L _LOSS(θ)＝Σ _{x∈D，y＝F(x)}L _RECON(y，y′)+βJ _ORTH(y)+Σ _(x，z)∈SL _DISC(z，z′)

+ηJ _SAL(y) (23)

Wherein D is whole data set (comprises without label data and have label data), and S has label data collection.β adjusts the percentage contribution of vertical loss function, β ∈ [0,1].η adjusts the percentage contribution of susceptibility loss function, η ∈ [0,1].Contribution weighting parameter β, η adopts the method for the grid search that step-length is 0.1 to set.Parameter θ={ E, S, A, O, U, e, s, a, o, γ, b}.

Obtain parameter weight and the biasing of four loss functions by minimum losses function, obtain four category features thereby decompose.

Claims

1. a semi-supervised phonetic feature variable factor decomposition method, is characterized in that comprising the following steps: