WO2015180368A1 - Variable factor decomposition method for semi-supervised speech features - Google Patents

Variable factor decomposition method for semi-supervised speech features Download PDF

Info

Publication number
WO2015180368A1
WO2015180368A1 PCT/CN2014/088539 CN2014088539W WO2015180368A1 WO 2015180368 A1 WO2015180368 A1 WO 2015180368A1 CN 2014088539 W CN2014088539 W CN 2014088539W WO 2015180368 A1 WO2015180368 A1 WO 2015180368A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
feature
loss function
speech
semi
Prior art date
Application number
PCT/CN2014/088539
Other languages
French (fr)
Chinese (zh)
Inventor
毛启容
黄正伟
薛文韬
詹永照
苟建平
Original Assignee
江苏大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏大学 filed Critical 江苏大学
Publication of WO2015180368A1 publication Critical patent/WO2015180368A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • Step 3 Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions
  • the local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.
  • Step one first convert the time domain signal into a spectral map, the window size is 20ms, with 10ms overlap; then using PCA dimensionality reduction and whitening, PCA has 60 principal components, and finally produces a 60 ⁇ n spectral map. A number of non-overlapping 60 ⁇ 15 patterns are extracted therefrom. For each 60 ⁇ 15 language spectrum, two sizes of spectral blocks are extracted therefrom, which are 60 ⁇ 6 and 60 ⁇ 10, respectively.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a variable factor decomposition method for semi-supervised speech features. The speech features are divided into four types: emotion-related features, gender-related features, age-related features, and noise, language and other factor-related features. The method comprises: first, pre-processing speech to obtain a speech spectrum, inputting speech spectrum blocks of different sizes in an unsupervised feature learning network SAE, and conducting pre-training to obtain convolution kernels of different sizes; then, conducting convolution on the entire speech spectrum using the convolution kernels of different sizes, so as to obtain several feature mapping diagrams, and conducting maximum pooling on the feature mapping diagrams; and finally, stacking the features to form a local invariant feature y. y serves as an input of a semi-supervised convolutional neural network, and y is decomposed into four types of features by minimizing four different loss function terms. The present invention solves the problem that the recognition accuracy rate is not high as the emotion, gender and age speech features are mixed with each other, can be used for different recognition requirements based on speech signals respectively, and can also be used for decomposing more factors.

Description

一种半监督语音特征可变因素分解方法Semi-supervised speech feature variable factor decomposition method 技术领域Technical field
本发明属于语音识别领域,具体涉及一种语音特征分解的方法。The invention belongs to the field of speech recognition, and in particular relates to a method for decomposing speech features.
背景技术Background technique
随着计算机渗透到生活的各个角落,各种类型的计算平台都需要更简便的输入媒体,语音当仁不让成为用户最佳的选择之一。一般来说,语音中包括了说话人、说话内容、说话人的情感、性别、年龄等多种信息。近年来,随着一些应用的不断完善,促进了对人的情感、性别、年龄、说话内容等方面的基于语音信号的识别技术的发展。比如传统的呼叫中心通常都会随机的接通服务生来为客户提供电话咨询,而不能够根据用户的情感、性别和年龄提供个性化的服务,这就促使了是否可以通过客户的声音来判断其情感、性别和年龄,并以此为依据提供更加个性化的语音服务。但是在现有的基于语音信号的情感、性别和年龄识别相关任务中,传统的特征提取方法所提取的特征往往掺杂了情感、性别、年龄、说话内容、语言等因素,彼此之间很难区分,从而导致识别效果不佳。As computers penetrate into every corner of life, all types of computing platforms require easier input media, and voice is one of the best choices for users. In general, the voice includes a variety of information such as the speaker, the content of the speaker, the emotion of the speaker, the gender, and the age. In recent years, with the continuous improvement of some applications, the development of speech signal-based recognition technology for human emotion, gender, age, and speech content has been promoted. For example, traditional call centers usually randomly connect to the waiter to provide telephone consultation for customers, and can not provide personalized services according to the user's emotion, gender and age. This makes it possible to judge the emotion through the voice of the customer. , gender and age, and provide a more personalized voice service based on this. However, in the existing tasks related to emotion, gender and age recognition based on speech signals, the features extracted by the traditional feature extraction methods are often mixed with emotions, gender, age, speech content, language and other factors. Distinguish, resulting in poor recognition.
在Dong Yu等、名称为Feature Learning in Deep Neural Networks—Studies on Speech Recognition Tasks的论文中,利用深度神经网络学到一个深层特征,但这个特征可能混杂了很多因素,如情感、性别、年龄等因素,如果把这个特征用于语音情感识别,识别率可能会受特征中其他因素的影响。目前还未出现一种特征提取方法能分别提取语音信号中不同类型的特征。本发明为了克服现有技术的缺陷,通过基于卷积神经网络的半监督特征学习,将语音特征分解成四类:情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征,可分别用于基于语音信号的不同识别需求。本发明进一步拓展以后还可以用于分解更多因素。In a paper by Dong Yu et al., entitled Feature Learning in Deep Neural Networks—Studies on Speech Recognition Tasks, deep neural networks are used to learn a deep feature, but this feature may be mixed with many factors such as emotion, gender, age and other factors. If this feature is used for speech emotion recognition, the recognition rate may be affected by other factors in the feature. At present, there is no feature extraction method that can separately extract different types of features in a speech signal. In order to overcome the defects of the prior art, the present invention decomposes speech features into four categories by semi-supervised feature learning based on convolutional neural networks: emotion-related features, gender-related features, age-related features, and other factors-related features, which can be used separately For different identification needs based on voice signals. Further expansion of the present invention can be used to decompose more factors later.
发明内容Summary of the invention
本发明的目的在于提供一种半监督语音特征可变因素分解方法,使得分解出的特征不受与识别任务无关的因素的干扰,且更显著地体现识别目标类别之间的差异,从而提高识别的准确度。The object of the present invention is to provide a semi-supervised speech feature variable factor decomposition method, so that the decomposed features are not interfered by factors unrelated to the recognition task, and the difference between the recognition target categories is more prominently reflected, thereby improving recognition. Accuracy.
为了解决以上技术问题,本发明首先对语音进行预处理得到语谱图,然后通过基于卷积神经网络的无监督学习得到局部不变特征,再采用一种半监督学习方法,通过重构误差函数、判别损失函数、正交损失函数、显著性损失函数四个损失函数的约束将无监督学习得到的局部不变特征,分解成四类:情感相关特征、 性别相关特征、年龄相关特征和其他因素相关特征,可分别用于情感识别、性别识别、年龄识别,能有效提高识别准确率。具体技术方案如下:In order to solve the above technical problem, the present invention firstly preprocesses speech to obtain a spectrogram, and then obtains a local invariant feature through unsupervised learning based on convolutional neural networks, and then adopts a semi-supervised learning method to reconstruct an error function. The four loss function constraints of the discriminant loss function, the orthogonal loss function, and the significant loss function decompose the local invariant features obtained by unsupervised learning into four categories: emotion-related features, Gender-related features, age-related features and other factors related characteristics can be used for emotion recognition, gender recognition, and age recognition, respectively, which can effectively improve the recognition accuracy. The specific technical solutions are as follows:
一种半监督语音特征可变因素分解方法,其特征在于包括下列步骤:A semi-supervised speech feature variable factor decomposition method, comprising the following steps:
步骤一,预处理:对语音样本进行预处理得到语谱图,再采用PCA进行主成份分析降维以及白化,从中提取出不同尺寸的语谱块;Step one, pre-processing: pre-processing the speech samples to obtain a spectral map, and then using PCA to perform principal component analysis for dimensionality reduction and whitening, and extracting spectral blocks of different sizes therefrom;
步骤二,无监督的局部不变特征学习:将所述语谱块作为无监督特征学习SAE的输入,通过输入不同尺寸的语谱块,预训练得到不同尺寸的卷积核,然后分别用所述不同尺寸的卷积核对整个语谱图进行卷积,得到若干特征映射图,再对所述特征映射图进行最大池化,最终把特征堆叠起来形成局部不变特征y;Step 2: Unsupervised local invariant feature learning: the speech block is used as an unsupervised feature to learn the input of the SAE. By inputting the spectral blocks of different sizes, the convolution kernels of different sizes are pre-trained and then used separately. The convolution kernels of different sizes are convoluted for the entire spectrogram to obtain a plurality of feature maps, and then the feature map is maximized, and finally the features are stacked to form a local invariant feature y;
步骤三,基于卷积神经网络的半监督特征学习:将所述局部不变特征y作为半监督学习算法的输入,利用基于卷积神经网络的半监督学习的方法,通过四个不同的损失函数将将局部不变特征y分解成四类特征;所述四类特征包括情感相关特征、性别相关特征、年龄相关特征、以及包括噪声和语种的其他因素相关特征;所述半监督学习的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成;Step 3: Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions The local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.
对于所述重构误差函数,所述四类特征都要参与重构局部不变特征y,误差采用均方误差;对于所述判别损失函数,先对有标签的数据进行类别预测,然后计算预测标签和真实标签之间的差异作为判别损失函数的值;对于所述正交损失函数,目的是使所述四类特征相互正交,表示输入局部不变特征y的不同的方向;对于所述显著性损失函数,目的是学习到仅体现识别目标类别之间的差异且更具有类别区分性的特征;通过最小化所述损失函数来获得四个损失函数的参数包括偏置和权重,从而得到所述四类特征。For the reconstructed error function, the four types of features are involved in reconstructing the local invariant feature y, and the error is a mean square error; for the discriminant loss function, class prediction is performed on the tagged data, and then the prediction is calculated. The difference between the tag and the real tag as a value of the discriminant loss function; for the orthogonal loss function, the purpose is to make the four types of features orthogonal to each other, indicating different directions of input local invariant features y; a significant loss function, the purpose is to learn a feature that only reflects the difference between the target categories and is more discriminative; the parameters of the four loss functions are obtained by minimizing the loss function, including offsets and weights, thereby obtaining The four types of features.
本发明具有有益效果。本发明的半监督特征学习,通过将局部不变特征分解成情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征共四类特征,使得不同类型的特征用于不同的识别需求,避免了不同类型特征之间相互干扰的缺点。特别是半监督学习的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成,使得所学习到的特征能更好地描述识别目标类别之间的差异,不受无关因素的干扰。本发明解决了不同的语音特征混杂在一起从而带来的识别率不高的问题,能有效地提高识别准确率。 The present invention has a beneficial effect. The semi-supervised feature learning of the present invention decomposes the local invariant features into four categories of emotion-related features, gender-related features, age-related features, and other factors, so that different types of features are used for different recognition needs, thereby avoiding The disadvantage of mutual interference between different types of features. In particular, the loss function of semi-supervised learning consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function, so that the learned features can better describe the difference between the identified target categories. , not subject to interference from unrelated factors. The invention solves the problem that the different speech features are mixed together and the recognition rate is not high, and the recognition accuracy can be effectively improved.
附图说明DRAWINGS
图1是语音特征分解流程图。Figure 1 is a flow chart of speech feature decomposition.
图2是无监督特征学习流程图。Figure 2 is a flow chart of unsupervised feature learning.
图3是半监督语音特征分解结构图。Figure 3 is a semi-supervised speech feature decomposition structure diagram.
具体实施方式detailed description
图1给出了本发明方法的总体思路,首先,对语音进行预处理得到语谱图,不同尺寸的语谱块输入无监督特征学习网络SAE,预训练得到不同尺寸的卷积核,然后经过卷积、池化操作,形成局部不变特征y。y作为半监督卷积神经网络的输入,通过最小化四个不同的损失函数项将y分解成四类特征。Figure 1 shows the general idea of the method of the present invention. First, the speech is preprocessed to obtain a spectral map, and the different sizes of the spectral blocks are input into the unsupervised feature learning network SAE, and pre-trained to obtain convolution kernels of different sizes, and then Convolution, pooling operations, forming a local invariant feature y. y, as an input to a semi-supervised convolutional neural network, decomposes y into four types of features by minimizing four different loss function terms.
预处理后的语音信号被划分成li×hi大小的语谱块,i表示语谱块的个数,不同尺寸的语谱块输入无监督特征学习网络SAE,预训练得到不同尺寸的卷积核,然后分别用不同尺寸的卷积核对整个语谱图进行卷积,得到若干特征映射图,再对特征映射图进行最大池化,最终把特征堆叠起来形成局部不变特征y,如图2所示。y作为半监督卷积神经网络的输入,通过四个不同的损失函数项将y分解成四类特征。半监督的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成。通过最小化损失函数来获得四个损失函数项的参数,从而分解得到四类特征,分别用于不同的识别需求,如图3所示。所有特征均要参加重构,而不同类型的特征参与对应的判别损失函数的约束。The pre-processed speech signal is divided into blocks of size l i × h i , i represents the number of blocks, and the blocks of different sizes are input into the unsupervised feature learning network SAE, pre-training to obtain volumes of different sizes. After accumulating the core, the convolution kernels of different sizes are used to convolve the entire spectrogram, and several feature maps are obtained. Then the feature map is maximized, and finally the features are stacked to form a local invariant feature y. 2 is shown. y, as an input to a semi-supervised convolutional neural network, decomposes y into four types of features by four different loss function terms. The semi-supervised loss function consists of four parts: the reconstruction error function, the discriminant loss function, the orthogonal loss function, and the significant loss function. The parameters of the four loss function terms are obtained by minimizing the loss function, and the four types of features are decomposed to be used for different identification requirements, as shown in FIG. All features are to participate in the reconstruction, and different types of features participate in the constraint of the corresponding discriminant loss function.
本发明首先对语音进行预处理,利用基于卷积神经网络的无监督学习算法得到一组局部不变特征,然后利用基于卷积神经网络的半监督学习算法把局部不变特征分解成四类特征:情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征。具体的步骤如下:The invention firstly preprocesses speech, obtains a set of locally invariant features by using an unsupervised learning algorithm based on convolutional neural network, and then decomposes the local invariant features into four types of features by using a semi-supervised learning algorithm based on convolutional neural networks. : Emotional related characteristics, gender-related characteristics, age-related characteristics, and other factors related characteristics. The specific steps are as follows:
步骤一,首先把时域信号转化成语谱图,窗尺寸为20ms,有10ms的重叠;然后利用PCA降维以及白化,PCA有60个主成分,最终产生60×n的语谱图。从中提取若干个不重叠的60×15的语谱。对于每个60×15的语谱,从中提取两个尺寸的语谱块,分别为60×6和60×10。Step one, first convert the time domain signal into a spectral map, the window size is 20ms, with 10ms overlap; then using PCA dimensionality reduction and whitening, PCA has 60 principal components, and finally produces a 60×n spectral map. A number of non-overlapping 60×15 patterns are extracted therefrom. For each 60×15 language spectrum, two sizes of spectral blocks are extracted therefrom, which are 60×6 and 60×10, respectively.
步骤二,将60×6和60×10两种尺寸的语谱块分别输入到SAE,分别学习得到120个和输入尺寸一样大的60×6和60×10的卷积核。然后利用这两个卷积核分别对整个语谱60×15进行卷积,得到120个1×10和120个1×6的特征 映射图,然后每两帧进行最大池化,得到120个1×5和120个1×3的特征。即对于60×6的卷积核得到600个特征,对于60×10的卷积核得到360个特征。这总的960个特征作为半监督的输入。接下来介绍一下无监督特征学习的一般步骤。In step two, the 60×6 and 60×10 sizes of the spectral blocks are respectively input into the SAE, and 120 convolution kernels of 60×6 and 60×10 which are as large as the input size are respectively learned. Then use the two convolution kernels to convolve the entire spectrum 60×15, and obtain 120 1×10 and 120 1×6 features. The map is then maximized every two frames to yield 120 1 x 5 and 120 1 x 3 features. That is, 600 features are obtained for a 60×6 convolution kernel, and 360 features are obtained for a 60×10 convolution kernel. This total of 960 features serves as a semi-supervised input. Let's take a look at the general steps of unsupervised feature learning.
自动编码器AE(Auto-Encoder)的目标函数如下:The objective function of the automatic encoder AE (Auto-Encoder) is as follows:
Figure PCTCN2014088539-appb-000001
Figure PCTCN2014088539-appb-000001
其中x是输入的语谱块,此处的x是不带标签的。h(x)是编码函数,h(x)=s(ωx+α),其中ω是权重矩阵,α是偏置,
Figure PCTCN2014088539-appb-000002
g(x)是解码函数,x′=g(x)=s(ωTh(x)+δ),其中ωT是ω的转置,δ是偏置。L(x,x′)是损失函数,L(x,x′)=||x-x′||2,表示均方误差。
Where x is the input score block, where x is unlabeled. h(x) is the encoding function, h(x)=s(ωx+α), where ω is the weight matrix and α is the bias,
Figure PCTCN2014088539-appb-000002
g(x) is the decoding function, x' = g(x) = s(ω T h(x) + δ), where ω T is the transpose of ω and δ is the offset. L(x, x') is a loss function, and L(x, x') = ||xx'|| 2 represents a mean square error.
而稀疏自编码SAE就是在AE的目标函数上加一项稀疏性损失项。SAE的目标函数如下:The sparse self-coding SAE adds a sparse loss term to the objective function of the AE. The objective function of SAE is as follows:
Figure PCTCN2014088539-appb-000003
Figure PCTCN2014088539-appb-000003
其中,
Figure PCTCN2014088539-appb-000004
是一个以ρ为均值和一个以ρ′j为均值的两个伯努利随机变量之间的相对熵,用来控制稀疏性。
Figure PCTCN2014088539-appb-000005
ρ是稀疏性参数,
Figure PCTCN2014088539-appb-000006
是隐藏神经元j的平均激活度,n2是隐藏结点个数,m是输入结点个数。λ是控制稀疏项的参数。
among them,
Figure PCTCN2014088539-appb-000004
It is a relative entropy between two Bernoulli random variables with ρ as the mean and ρ' j as the mean, used to control sparsity.
Figure PCTCN2014088539-appb-000005
ρ is the sparsity parameter,
Figure PCTCN2014088539-appb-000006
Is the average activation degree of the hidden neuron j, n 2 is the number of hidden nodes, and m is the number of input nodes. λ is a parameter that controls the sparse term.
假定有n种不同的输入尺寸,记成li×hi(i=1,2,…,n)。通过最小化
Figure PCTCN2014088539-appb-000007
得到不同的卷积核(ωii)。用卷积核(ωii)对整个语谱图的所有语谱块li×hi进行卷积:
Suppose there are n different input sizes, denoted as l i ×h i (i=1,2,...,n). By minimizing
Figure PCTCN2014088539-appb-000007
Different convolution kernels (ω i , α i ) are obtained. Convolution of all spectral blocks l i ×h i of the entire spectral map with the convolution kernel (ω i , α i ):
fi(x)=s(conv(ωi,x)+αi)  (3)f i (x)=s(conv(ω i ,x)+α i ) (3)
然后把卷积得到的特征映射图分成不重叠的区域,P={p1,p2,…,pq},对每个区域进行最大池化:Then, the convolved feature map is divided into non-overlapping regions, P={p 1 , p 2 ,..., p q }, and the maximum pooling is performed for each region:
Figure PCTCN2014088539-appb-000008
Figure PCTCN2014088539-appb-000008
对于第i个卷积核,把池化的特征堆叠起来:For the ith convolution kernel, stack the pooled features:
Figure PCTCN2014088539-appb-000009
Figure PCTCN2014088539-appb-000009
把所有卷积核的池化特征堆叠起来,形成局部不变特征y:Stack the pooled features of all convolution kernels to form a locally invariant feature y:
y=F(x)=[F1(x),F2(x),…,Fn(x)]  (6)y=F(x)=[F 1 (x), F 2 (x),...,F n (x)] (6)
局部不变特征y作为下面无监督学习的输入。The locally invariant feature y serves as an input to the following unsupervised learning.
步骤三,通过上述的无监督学习,获得了局部不变特征y。然后通过基于卷积神经网络的半监督学习(部分输入带有类别标签),将y分解成四类特征:情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征。无监督学习的损失函数由四个部分构成。In the third step, the local invariant feature y is obtained by the above unsupervised learning. Then, through semi-supervised learning based on convolutional neural networks (partial input with category labels), y is decomposed into four categories of features: emotion-related features, gender-related features, age-related features, and other factor-related features. The loss function of unsupervised learning consists of four parts.
首先,通过四个编码函数h(e)(y)、h(s)(y)、h(a)(y)、h(o)(y)将y映射成四类特征,分别和情感、性别、年龄、其他因素相关。四个编码函数如下:First, through the four encoding functions h (e) (y), h (s) (y), h (a) (y), h (o) (y), y is mapped into four types of features, respectively, and emotions, Gender, age, and other factors are relevant. The four encoding functions are as follows:
h(e)(y)=s(Ey+e)  (7)h (e) (y)=s(Ey+e) (7)
h(s)(y)=s(Sy+s)  (8)h (s) (y)=s(Sy+s) (8)
h(a)(y)=s(Ay+a)  (9)h (a) (y)=s(Ay+a) (9)
h(o)(y)=s(Oy+o)  (10)h (o) (y)=s(Oy+o) (10)
这四类特征都要参与重构y:These four types of features must participate in the reconstruction of y:
y′=g([h(e)(y),h(s)(y),h(a)(y),h(o)(y)])=s(ETh(e)(y)+STh(s)(y)+ATh(a)(y)+OTh(o)(y)+γ)  (11)y'=g([h (e) (y),h (s) (y),h (a) (y),h (o) (y)])=s(E T h (e) (y )+S T h (s) (y)+A T h (a) (y)+O T h (o) (y)+γ) (11)
其中,γ是为了靠近y均值的一个补偿参数。Where γ is a compensation parameter for close to the y-means.
所以,重构误差函数为:Therefore, the reconstruction error function is:
LRECON(y,y′)=||y-y′||2  (12)L RECON (y,y')=||yy'|| 2 (12)
然后,利用有标签数据来进行类别的预测,输入数据(x,z),其中x是语谱块,z={z1,z2,z3}分别表示情感标签、性别标签、年龄标签。z′={z′1,z′2,z′3}分别表示预测的情感标签、性别标签、年龄标签。例如下面公式(13)就是通过U1j映射h(e)(y),来预测情感标签的第j个特征z′1jThen, the tag data is used to predict the category, and the data (x, z) is input, where x is a block of music, and z={z 1 , z 2 , z 3 } respectively represent an emotion tag, a gender tag, and an age tag. z'={z' 1 , z' 2 , z' 3 } respectively represent predicted emotional tags, gender tags, and age tags. For example, the following equation (13) is through the U 1j mapping h (e) (y), to predict the j th feature emotion tag z '1j.
z′1j=s(U1jh(e)(y)+b1j)  (13)z' 1j =s(U 1j h (e) (y)+b 1j ) (13)
z′2j=s(U2jh(s)(y)+b2j)  (14)z' 2j =s(U 2j h (s) (y)+b 2j ) (14)
z′3j=s(U3jh(a)(y)+b3j)  (15) z' 3j =s(U 3j h (a) (y)+b 3j ) (15)
所以,情感标签、性别标签、年龄标签的判别损失函数分别为:Therefore, the discriminant loss functions of the emotional label, the gender label, and the age label are:
Figure PCTCN2014088539-appb-000010
Figure PCTCN2014088539-appb-000010
Figure PCTCN2014088539-appb-000011
Figure PCTCN2014088539-appb-000011
Figure PCTCN2014088539-appb-000012
Figure PCTCN2014088539-appb-000012
总的判别损失函数为:The total discriminant loss function is:
LDISC(z,z′)=LDISCE(z1,z′1)+LDISCS(z2,z′2)+LDISCA(z3,z′3)  (19)L DISC (z,z')=L DISCE (z 1 ,z' 1 )+L DISCS (z 2 ,z' 2 )+L DISCA (z 3 ,z' 3 ) (19)
其中C1、C2、C3分别表示情感标签、性别标签、年龄标签的类别个数。特别要说明的是这一步情感相关特征受式(14)情感判别惩罚函数的约束,性别相关特征受式(15)性别判别惩罚函数的约束,年龄相关特征受式(16)年龄判别惩罚函数的约束。Wherein C 1 , C 2 , and C 3 represent the number of categories of emotional tags, gender tags, and age tags, respectively. In particular, the emotion-related features of this step are constrained by the (1) sentiment discriminant penalty function, the gender-related features are constrained by the (15) gender discriminant penalty function, and the age-related features are subjected to the (16) age discriminant penalty function. constraint.
此外,为了让h(e)(y)、h(s)(y)、h(a)(y)、h(o)(y)尽可能地表示y变化的不同方向,比如说要让h(e)(y)和h(s)(y)表示不同方向,可以通过让
Figure PCTCN2014088539-appb-000013
Figure PCTCN2014088539-appb-000014
尽可能地正交。正交损失函数为:
In addition, in order to let h (e) (y), h (s) (y), h (a) (y), h (o) (y) represent as many different directions as possible, for example, let h (e) (y) and h (s) (y) indicate different directions and can be
Figure PCTCN2014088539-appb-000013
with
Figure PCTCN2014088539-appb-000014
As orthogonal as possible. The orthogonal loss function is:
Figure PCTCN2014088539-appb-000015
Figure PCTCN2014088539-appb-000015
最后,可以利用显著性损失函数,使得学习到的四类特征更体现识别目标不同类别之间的可鉴别性且比较稳定。。对于每个输入i的显著性,我们用它的权重的显著性总和来衡量。具体的,对于输入i的显著性如下:Finally, the significance loss function can be utilized, so that the learned four types of features more reflect the identifiability between different categories of recognition targets and are relatively stable. . For the significance of each input i, we measure it by the sum of the saliency of its weight. Specifically, the significance of the input i is as follows:
Figure PCTCN2014088539-appb-000016
Figure PCTCN2014088539-appb-000016
其中φ(i)是和输入i相关的权重集,ωk是第k个权重,MSE是均方误差。对于h(e)(y)、h(s)(y)、h(a)(y)这三类特征,重构误差和判别损失都要考虑进显著性损失函数,而对于h(o)(y),只要考虑重构误差。所以显著性损失函数如下:Where φ(i) is the set of weights associated with input i, ω k is the kth weight, and MSE is the mean square error. For the three types of h (e) (y), h (s) (y), h (a) (y), the reconstruction error and discriminant loss must take into account the significant loss function, and for h (o) (y), as long as the reconstruction error is considered. So the significant loss function is as follows:
Figure PCTCN2014088539-appb-000017
Figure PCTCN2014088539-appb-000017
Figure PCTCN2014088539-appb-000018
Figure PCTCN2014088539-appb-000018
所以,重构误差函数、判别损失函数、正交损失函数、显著性损失函数这四部分就构成了总的损失函数。总的损失函数为:Therefore, the four parts of the reconstructed error function, the discriminant loss function, the orthogonal loss function, and the significant loss function constitute the total loss function. The total loss function is:
LLOSS(θ)=∑x∈D,y=F(x)LRECON(y,y′)+βJORTH(y)+∑(x,z)∈∈LDISC(z,z′)+ηJSAL(y)  (23)L LOSS (θ)=∑ x∈D,y=F(x) L RECON (y,y')+βJ ORTH (y)+∑ (x,z)∈∈ L DISC (z,z')+ηJ SAL (y) (23)
其中D是整个数据集(包括无标签数据和有标签数据),S是有标签数据集。β调整垂直损失函数的贡献程度,β∈[0,1]。η调整敏感性损失函数的贡献程度,η∈[0,1]。贡献权值参数β,η采用步长为0.1的网格搜索的方法设定。参数θ={E,S,A,O,U,e,s,a,o,γ,b}。Where D is the entire data set (including unlabeled data and tagged data), and S is the tagged data set. β adjusts the contribution of the vertical loss function, β∈[0,1]. η adjusts the contribution of the sensitivity loss function, η ∈ [0, 1]. The contribution weight parameter β, η is set by a grid search method with a step size of 0.1. The parameters θ={E, S, A, O, U, e, s, a, o, γ, b}.
通过最小化损失函数来获得四个损失函数的参数权重和偏置,从而分解得到四类特征。 The parameter weights and offsets of the four loss functions are obtained by minimizing the loss function, and the four types of features are decomposed.

Claims (1)

  1. 一种半监督语音特征可变因素分解方法,其特征在于包括下列步骤:A semi-supervised speech feature variable factor decomposition method, comprising the following steps:
    步骤一,预处理:对语音样本进行预处理得到语谱图,再采用PCA进行主成份分析降维以及白化,从中提取出不同尺寸的语谱块;Step one, pre-processing: pre-processing the speech samples to obtain a spectral map, and then using PCA to perform principal component analysis for dimensionality reduction and whitening, and extracting spectral blocks of different sizes therefrom;
    步骤二,无监督的局部不变特征学习:将所述语谱块作为无监督特征学习SAE的输入,通过输入不同尺寸的语谱块,预训练得到不同尺寸的卷积核,然后分别用所述不同尺寸的卷积核对整个语谱图进行卷积,得到若干特征映射图,再对所述特征映射图进行最大池化,最终把特征堆叠起来形成局部不变特征y;Step 2: Unsupervised local invariant feature learning: the speech block is used as an unsupervised feature to learn the input of the SAE. By inputting the spectral blocks of different sizes, the convolution kernels of different sizes are pre-trained and then used separately. The convolution kernels of different sizes are convoluted for the entire spectrogram to obtain a plurality of feature maps, and then the feature map is maximized, and finally the features are stacked to form a local invariant feature y;
    步骤三,基于卷积神经网络的半监督特征学习:将所述局部不变特征y作为半监督学习算法的输入,利用基于卷积神经网络的半监督学习的方法,通过四个不同的损失函数将将局部不变特征y分解成四类特征;所述四类特征包括情感相关特征、性别相关特征、年龄相关特征、以及包括噪声和语种的其他因素相关特征;所述半监督学习的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成;Step 3: Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions The local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.
    对于所述重构误差函数,所述四类特征都要参与重构局部不变特征y,误差采用均方误差;对于所述判别损失函数,先对有标签的数据进行类别预测,然后计算预测标签和真实标签之间的差异作为判别损失函数的值;对于所述正交损失函数,目的是使所述四类特征相互正交,表示输入局部不变特征y的不同的方向;对于所述显著性损失函数,目的是学习到仅体现识别目标类别之间的差异且更具有类别区分性的特征;通过最小化所述损失函数来获得四个损失函数的参数包括偏置和权重,从而得到所述四类特征。 For the reconstructed error function, the four types of features are involved in reconstructing the local invariant feature y, and the error is a mean square error; for the discriminant loss function, class prediction is performed on the tagged data, and then the prediction is calculated. The difference between the tag and the real tag as a value of the discriminant loss function; for the orthogonal loss function, the purpose is to make the four types of features orthogonal to each other, indicating different directions of input local invariant features y; a significant loss function, the purpose is to learn a feature that only reflects the difference between the target categories and is more discriminative; the parameters of the four loss functions are obtained by minimizing the loss function, including offsets and weights, thereby obtaining The four types of features.
PCT/CN2014/088539 2014-05-27 2014-10-14 Variable factor decomposition method for semi-supervised speech features WO2015180368A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410229537.5A CN104021373B (en) 2014-05-27 2014-05-27 Semi-supervised speech feature variable factor decomposition method
CN201410229537.5 2014-05-27

Publications (1)

Publication Number Publication Date
WO2015180368A1 true WO2015180368A1 (en) 2015-12-03

Family

ID=51438118

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/088539 WO2015180368A1 (en) 2014-05-27 2014-10-14 Variable factor decomposition method for semi-supervised speech features

Country Status (2)

Country Link
CN (1) CN104021373B (en)
WO (1) WO2015180368A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106803069A (en) * 2016-12-29 2017-06-06 南京邮电大学 Crowd's level of happiness recognition methods based on deep learning
CN106919710A (en) * 2017-03-13 2017-07-04 东南大学 A kind of dialect sorting technique based on convolutional neural networks
CN108021910A (en) * 2018-01-04 2018-05-11 青岛农业大学 The analysis method of Pseudocarps based on spectrum recognition and deep learning
CN108899075A (en) * 2018-06-28 2018-11-27 众安信息技术服务有限公司 A kind of DSA image detecting method, device and equipment based on deep learning
CN109065021A (en) * 2018-10-18 2018-12-21 江苏师范大学 The end-to-end dialect identification method of confrontation network is generated based on condition depth convolution
CN109117943A (en) * 2018-07-24 2019-01-01 中国科学技术大学 Utilize the method for more attribute informations enhancing network characterisation study
CN109543727A (en) * 2018-11-07 2019-03-29 复旦大学 A kind of semi-supervised method for detecting abnormality based on competition reconstruct study
CN109559736A (en) * 2018-12-05 2019-04-02 中国计量大学 A kind of film performer's automatic dubbing method based on confrontation network
CN110009025A (en) * 2019-03-27 2019-07-12 河南工业大学 A kind of semi-supervised additive noise self-encoding encoder for voice lie detection
CN110084850A (en) * 2019-04-04 2019-08-02 东南大学 A kind of dynamic scene vision positioning method based on image, semantic segmentation
CN110363139A (en) * 2019-07-15 2019-10-22 上海点积实业有限公司 A kind of digital signal processing method and system
CN110503128A (en) * 2018-05-18 2019-11-26 百度(美国)有限责任公司 The spectrogram that confrontation network carries out Waveform composition is generated using convolution
CN110738168A (en) * 2019-10-14 2020-01-31 长安大学 distributed strain micro crack detection system and method based on stacked convolution self-encoder
CN111179941A (en) * 2020-01-06 2020-05-19 科大讯飞股份有限公司 Intelligent device awakening method, registration method and device
CN111832650A (en) * 2020-07-14 2020-10-27 西安电子科技大学 Image classification method based on generation of confrontation network local aggregation coding semi-supervision
CN112735478A (en) * 2021-01-29 2021-04-30 华南理工大学 Voice emotion recognition method based on additive angle punishment focus loss
US11093818B2 (en) 2016-04-11 2021-08-17 International Business Machines Corporation Customer profile learning based on semi-supervised recurrent neural network using partially labeled sequence data

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021373B (en) * 2014-05-27 2017-02-15 江苏大学 Semi-supervised speech feature variable factor decomposition method
CN104408470B (en) * 2014-12-01 2017-07-25 中科创达软件股份有限公司 The sex-screening method learnt in advance based on average face
CN105989368A (en) * 2015-02-13 2016-10-05 展讯通信(天津)有限公司 Target detection method and apparatus, and mobile terminal
CN105070288B (en) * 2015-07-02 2018-08-07 百度在线网络技术(北京)有限公司 Vehicle-mounted voice instruction identification method and device
CN105321525B (en) * 2015-09-30 2019-02-22 北京邮电大学 A kind of system and method reducing VOIP communication resource expense
CN105550679B (en) * 2016-02-29 2019-02-15 深圳前海勇艺达机器人有限公司 A kind of judgment method of robot cycle monitoring recording
US10579860B2 (en) * 2016-06-06 2020-03-03 Samsung Electronics Co., Ltd. Learning model for salient facial region detection
CN110089135A (en) * 2016-10-19 2019-08-02 奥蒂布莱现实有限公司 System and method for generating audio image
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN108461092B (en) * 2018-03-07 2022-03-08 燕山大学 Method for analyzing Parkinson's disease voice
CN110148400B (en) * 2018-07-18 2023-03-17 腾讯科技(深圳)有限公司 Pronunciation type recognition method, model training method, device and equipment
US11606663B2 (en) 2018-08-29 2023-03-14 Audible Reality Inc. System for and method of controlling a three-dimensional audio engine
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
CN110070895B (en) * 2019-03-11 2021-06-22 江苏大学 Mixed sound event detection method based on factor decomposition of supervised variational encoder
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN110297928A (en) * 2019-07-02 2019-10-01 百度在线网络技术(北京)有限公司 Recommended method, device, equipment and the storage medium of expression picture
CN111009262A (en) * 2019-12-24 2020-04-14 携程计算机技术(上海)有限公司 Voice gender identification method and system
CN114037059A (en) * 2021-11-05 2022-02-11 北京百度网讯科技有限公司 Pre-training model, model generation method, data processing method and data processing device
CN115240649B (en) * 2022-07-19 2023-04-18 于振华 Voice recognition method and system based on deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1150852A (en) * 1994-06-06 1997-05-28 摩托罗拉公司 Speech-recognition system utilizing neural networks and method of using same
CN1151218A (en) * 1994-06-03 1997-06-04 摩托罗拉公司 Method of training neural networks used for speech recognition
CN1275746A (en) * 1994-04-28 2000-12-06 摩托罗拉公司 Equipment for converting text into audio signal by using nervus network
CN1280697A (en) * 1998-02-03 2001-01-17 西门子公司 Method for voice data transmission
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN103400145A (en) * 2013-07-19 2013-11-20 北京理工大学 Voice-vision fusion emotion recognition method based on hint nerve networks
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8582807B2 (en) * 2010-03-15 2013-11-12 Nec Laboratories America, Inc. Systems and methods for determining personal characteristics
CN102222500A (en) * 2011-05-11 2011-10-19 北京航空航天大学 Extracting method and modeling method for Chinese speech emotion combining emotion points

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1275746A (en) * 1994-04-28 2000-12-06 摩托罗拉公司 Equipment for converting text into audio signal by using nervus network
CN1151218A (en) * 1994-06-03 1997-06-04 摩托罗拉公司 Method of training neural networks used for speech recognition
CN1150852A (en) * 1994-06-06 1997-05-28 摩托罗拉公司 Speech-recognition system utilizing neural networks and method of using same
CN1280697A (en) * 1998-02-03 2001-01-17 西门子公司 Method for voice data transmission
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN103400145A (en) * 2013-07-19 2013-11-20 北京理工大学 Voice-vision fusion emotion recognition method based on hint nerve networks
CN104021373A (en) * 2014-05-27 2014-09-03 江苏大学 Semi-supervised speech feature variable factor decomposition method

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093818B2 (en) 2016-04-11 2021-08-17 International Business Machines Corporation Customer profile learning based on semi-supervised recurrent neural network using partially labeled sequence data
CN106803069A (en) * 2016-12-29 2017-06-06 南京邮电大学 Crowd's level of happiness recognition methods based on deep learning
CN106803069B (en) * 2016-12-29 2021-02-09 南京邮电大学 Crowd happiness degree identification method based on deep learning
CN106919710A (en) * 2017-03-13 2017-07-04 东南大学 A kind of dialect sorting technique based on convolutional neural networks
CN108021910A (en) * 2018-01-04 2018-05-11 青岛农业大学 The analysis method of Pseudocarps based on spectrum recognition and deep learning
CN110503128A (en) * 2018-05-18 2019-11-26 百度(美国)有限责任公司 The spectrogram that confrontation network carries out Waveform composition is generated using convolution
CN108899075A (en) * 2018-06-28 2018-11-27 众安信息技术服务有限公司 A kind of DSA image detecting method, device and equipment based on deep learning
CN109117943A (en) * 2018-07-24 2019-01-01 中国科学技术大学 Utilize the method for more attribute informations enhancing network characterisation study
CN109117943B (en) * 2018-07-24 2022-09-30 中国科学技术大学 Method for enhancing network representation learning by utilizing multi-attribute information
CN109065021A (en) * 2018-10-18 2018-12-21 江苏师范大学 The end-to-end dialect identification method of confrontation network is generated based on condition depth convolution
CN109543727A (en) * 2018-11-07 2019-03-29 复旦大学 A kind of semi-supervised method for detecting abnormality based on competition reconstruct study
CN109559736A (en) * 2018-12-05 2019-04-02 中国计量大学 A kind of film performer's automatic dubbing method based on confrontation network
CN109559736B (en) * 2018-12-05 2022-03-08 中国计量大学 Automatic dubbing method for movie actors based on confrontation network
CN110009025A (en) * 2019-03-27 2019-07-12 河南工业大学 A kind of semi-supervised additive noise self-encoding encoder for voice lie detection
CN110009025B (en) * 2019-03-27 2023-03-24 河南工业大学 Semi-supervised additive noise self-encoder for voice lie detection
CN110084850A (en) * 2019-04-04 2019-08-02 东南大学 A kind of dynamic scene vision positioning method based on image, semantic segmentation
CN110363139A (en) * 2019-07-15 2019-10-22 上海点积实业有限公司 A kind of digital signal processing method and system
CN110738168A (en) * 2019-10-14 2020-01-31 长安大学 distributed strain micro crack detection system and method based on stacked convolution self-encoder
CN111179941A (en) * 2020-01-06 2020-05-19 科大讯飞股份有限公司 Intelligent device awakening method, registration method and device
CN111179941B (en) * 2020-01-06 2022-10-04 科大讯飞股份有限公司 Intelligent device awakening method, registration method and device
CN111832650A (en) * 2020-07-14 2020-10-27 西安电子科技大学 Image classification method based on generation of confrontation network local aggregation coding semi-supervision
CN111832650B (en) * 2020-07-14 2023-08-01 西安电子科技大学 Image classification method based on generation of antagonism network local aggregation coding semi-supervision
CN112735478A (en) * 2021-01-29 2021-04-30 华南理工大学 Voice emotion recognition method based on additive angle punishment focus loss

Also Published As

Publication number Publication date
CN104021373B (en) 2017-02-15
CN104021373A (en) 2014-09-03

Similar Documents

Publication Publication Date Title
WO2015180368A1 (en) Variable factor decomposition method for semi-supervised speech features
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
Palo et al. Wavelet based feature combination for recognition of emotions
Hsu et al. Unsupervised learning of disentangled and interpretable representations from sequential data
Mao et al. Learning salient features for speech emotion recognition using convolutional neural networks
Deng et al. Recognizing emotions from whispered speech based on acoustic feature transfer learning
CN110136690A (en) Phoneme synthesizing method, device and computer readable storage medium
CN105047194B (en) A kind of self study sound spectrograph feature extracting method for speech emotion recognition
CN111461176A (en) Multi-mode fusion method, device, medium and equipment based on normalized mutual information
Wei et al. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model
CN115359576A (en) Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
Xia et al. Audiovisual speech recognition: A review and forecast
Dua et al. Optimizing integrated features for Hindi automatic speech recognition system
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Singkul et al. Vector learning representation for generalized speech emotion recognition
JP2015175859A (en) Pattern recognition device, pattern recognition method, and pattern recognition program
CN111462762B (en) Speaker vector regularization method and device, electronic equipment and storage medium
Cetin Accent recognition using a spectrogram image feature-based convolutional neural network
CN116434759B (en) Speaker identification method based on SRS-CL network
CN110226201A (en) The voice recognition indicated using the period
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
CN114626424A (en) Data enhancement-based silent speech recognition method and device
CN114913871A (en) Target object classification method, system, electronic device and storage medium
Mu et al. Self-supervised disentangled representation learning for robust target speech extraction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14893660

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14893660

Country of ref document: EP

Kind code of ref document: A1