WO2015180368A1 - Variable factor decomposition method for semi-supervised speech features - Google Patents
Variable factor decomposition method for semi-supervised speech features Download PDFInfo
- Publication number
- WO2015180368A1 WO2015180368A1 PCT/CN2014/088539 CN2014088539W WO2015180368A1 WO 2015180368 A1 WO2015180368 A1 WO 2015180368A1 CN 2014088539 W CN2014088539 W CN 2014088539W WO 2015180368 A1 WO2015180368 A1 WO 2015180368A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- features
- feature
- loss function
- speech
- semi
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 13
- 238000000354 decomposition reaction Methods 0.000 title claims abstract description 7
- 230000006870 function Effects 0.000 claims abstract description 76
- 230000008451 emotion Effects 0.000 claims abstract description 19
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 230000003595 spectral effect Effects 0.000 claims description 14
- 230000002087 whitening effect Effects 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 claims description 2
- 238000001228 spectrum Methods 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 abstract description 3
- 238000013507 mapping Methods 0.000 abstract description 3
- 238000011176 pooling Methods 0.000 abstract description 3
- 238000012549 training Methods 0.000 abstract description 2
- 230000002996 emotional effect Effects 0.000 description 4
- 230000008909 emotion recognition Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- Step 3 Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions
- the local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.
- Step one first convert the time domain signal into a spectral map, the window size is 20ms, with 10ms overlap; then using PCA dimensionality reduction and whitening, PCA has 60 principal components, and finally produces a 60 ⁇ n spectral map. A number of non-overlapping 60 ⁇ 15 patterns are extracted therefrom. For each 60 ⁇ 15 language spectrum, two sizes of spectral blocks are extracted therefrom, which are 60 ⁇ 6 and 60 ⁇ 10, respectively.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed is a variable factor decomposition method for semi-supervised speech features. The speech features are divided into four types: emotion-related features, gender-related features, age-related features, and noise, language and other factor-related features. The method comprises: first, pre-processing speech to obtain a speech spectrum, inputting speech spectrum blocks of different sizes in an unsupervised feature learning network SAE, and conducting pre-training to obtain convolution kernels of different sizes; then, conducting convolution on the entire speech spectrum using the convolution kernels of different sizes, so as to obtain several feature mapping diagrams, and conducting maximum pooling on the feature mapping diagrams; and finally, stacking the features to form a local invariant feature y. y serves as an input of a semi-supervised convolutional neural network, and y is decomposed into four types of features by minimizing four different loss function terms. The present invention solves the problem that the recognition accuracy rate is not high as the emotion, gender and age speech features are mixed with each other, can be used for different recognition requirements based on speech signals respectively, and can also be used for decomposing more factors.
Description
本发明属于语音识别领域,具体涉及一种语音特征分解的方法。The invention belongs to the field of speech recognition, and in particular relates to a method for decomposing speech features.
随着计算机渗透到生活的各个角落,各种类型的计算平台都需要更简便的输入媒体,语音当仁不让成为用户最佳的选择之一。一般来说,语音中包括了说话人、说话内容、说话人的情感、性别、年龄等多种信息。近年来,随着一些应用的不断完善,促进了对人的情感、性别、年龄、说话内容等方面的基于语音信号的识别技术的发展。比如传统的呼叫中心通常都会随机的接通服务生来为客户提供电话咨询,而不能够根据用户的情感、性别和年龄提供个性化的服务,这就促使了是否可以通过客户的声音来判断其情感、性别和年龄,并以此为依据提供更加个性化的语音服务。但是在现有的基于语音信号的情感、性别和年龄识别相关任务中,传统的特征提取方法所提取的特征往往掺杂了情感、性别、年龄、说话内容、语言等因素,彼此之间很难区分,从而导致识别效果不佳。As computers penetrate into every corner of life, all types of computing platforms require easier input media, and voice is one of the best choices for users. In general, the voice includes a variety of information such as the speaker, the content of the speaker, the emotion of the speaker, the gender, and the age. In recent years, with the continuous improvement of some applications, the development of speech signal-based recognition technology for human emotion, gender, age, and speech content has been promoted. For example, traditional call centers usually randomly connect to the waiter to provide telephone consultation for customers, and can not provide personalized services according to the user's emotion, gender and age. This makes it possible to judge the emotion through the voice of the customer. , gender and age, and provide a more personalized voice service based on this. However, in the existing tasks related to emotion, gender and age recognition based on speech signals, the features extracted by the traditional feature extraction methods are often mixed with emotions, gender, age, speech content, language and other factors. Distinguish, resulting in poor recognition.
在Dong Yu等、名称为Feature Learning in Deep Neural Networks—Studies on Speech Recognition Tasks的论文中,利用深度神经网络学到一个深层特征,但这个特征可能混杂了很多因素,如情感、性别、年龄等因素,如果把这个特征用于语音情感识别,识别率可能会受特征中其他因素的影响。目前还未出现一种特征提取方法能分别提取语音信号中不同类型的特征。本发明为了克服现有技术的缺陷,通过基于卷积神经网络的半监督特征学习,将语音特征分解成四类:情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征,可分别用于基于语音信号的不同识别需求。本发明进一步拓展以后还可以用于分解更多因素。In a paper by Dong Yu et al., entitled Feature Learning in Deep Neural Networks—Studies on Speech Recognition Tasks, deep neural networks are used to learn a deep feature, but this feature may be mixed with many factors such as emotion, gender, age and other factors. If this feature is used for speech emotion recognition, the recognition rate may be affected by other factors in the feature. At present, there is no feature extraction method that can separately extract different types of features in a speech signal. In order to overcome the defects of the prior art, the present invention decomposes speech features into four categories by semi-supervised feature learning based on convolutional neural networks: emotion-related features, gender-related features, age-related features, and other factors-related features, which can be used separately For different identification needs based on voice signals. Further expansion of the present invention can be used to decompose more factors later.
发明内容Summary of the invention
本发明的目的在于提供一种半监督语音特征可变因素分解方法,使得分解出的特征不受与识别任务无关的因素的干扰,且更显著地体现识别目标类别之间的差异,从而提高识别的准确度。The object of the present invention is to provide a semi-supervised speech feature variable factor decomposition method, so that the decomposed features are not interfered by factors unrelated to the recognition task, and the difference between the recognition target categories is more prominently reflected, thereby improving recognition. Accuracy.
为了解决以上技术问题,本发明首先对语音进行预处理得到语谱图,然后通过基于卷积神经网络的无监督学习得到局部不变特征,再采用一种半监督学习方法,通过重构误差函数、判别损失函数、正交损失函数、显著性损失函数四个损失函数的约束将无监督学习得到的局部不变特征,分解成四类:情感相关特征、
性别相关特征、年龄相关特征和其他因素相关特征,可分别用于情感识别、性别识别、年龄识别,能有效提高识别准确率。具体技术方案如下:In order to solve the above technical problem, the present invention firstly preprocesses speech to obtain a spectrogram, and then obtains a local invariant feature through unsupervised learning based on convolutional neural networks, and then adopts a semi-supervised learning method to reconstruct an error function. The four loss function constraints of the discriminant loss function, the orthogonal loss function, and the significant loss function decompose the local invariant features obtained by unsupervised learning into four categories: emotion-related features,
Gender-related features, age-related features and other factors related characteristics can be used for emotion recognition, gender recognition, and age recognition, respectively, which can effectively improve the recognition accuracy. The specific technical solutions are as follows:
一种半监督语音特征可变因素分解方法,其特征在于包括下列步骤:A semi-supervised speech feature variable factor decomposition method, comprising the following steps:
步骤一,预处理:对语音样本进行预处理得到语谱图,再采用PCA进行主成份分析降维以及白化,从中提取出不同尺寸的语谱块;Step one, pre-processing: pre-processing the speech samples to obtain a spectral map, and then using PCA to perform principal component analysis for dimensionality reduction and whitening, and extracting spectral blocks of different sizes therefrom;
步骤二,无监督的局部不变特征学习:将所述语谱块作为无监督特征学习SAE的输入,通过输入不同尺寸的语谱块,预训练得到不同尺寸的卷积核,然后分别用所述不同尺寸的卷积核对整个语谱图进行卷积,得到若干特征映射图,再对所述特征映射图进行最大池化,最终把特征堆叠起来形成局部不变特征y;Step 2: Unsupervised local invariant feature learning: the speech block is used as an unsupervised feature to learn the input of the SAE. By inputting the spectral blocks of different sizes, the convolution kernels of different sizes are pre-trained and then used separately. The convolution kernels of different sizes are convoluted for the entire spectrogram to obtain a plurality of feature maps, and then the feature map is maximized, and finally the features are stacked to form a local invariant feature y;
步骤三,基于卷积神经网络的半监督特征学习:将所述局部不变特征y作为半监督学习算法的输入,利用基于卷积神经网络的半监督学习的方法,通过四个不同的损失函数将将局部不变特征y分解成四类特征;所述四类特征包括情感相关特征、性别相关特征、年龄相关特征、以及包括噪声和语种的其他因素相关特征;所述半监督学习的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成;Step 3: Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions The local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.
对于所述重构误差函数,所述四类特征都要参与重构局部不变特征y,误差采用均方误差;对于所述判别损失函数,先对有标签的数据进行类别预测,然后计算预测标签和真实标签之间的差异作为判别损失函数的值;对于所述正交损失函数,目的是使所述四类特征相互正交,表示输入局部不变特征y的不同的方向;对于所述显著性损失函数,目的是学习到仅体现识别目标类别之间的差异且更具有类别区分性的特征;通过最小化所述损失函数来获得四个损失函数的参数包括偏置和权重,从而得到所述四类特征。For the reconstructed error function, the four types of features are involved in reconstructing the local invariant feature y, and the error is a mean square error; for the discriminant loss function, class prediction is performed on the tagged data, and then the prediction is calculated. The difference between the tag and the real tag as a value of the discriminant loss function; for the orthogonal loss function, the purpose is to make the four types of features orthogonal to each other, indicating different directions of input local invariant features y; a significant loss function, the purpose is to learn a feature that only reflects the difference between the target categories and is more discriminative; the parameters of the four loss functions are obtained by minimizing the loss function, including offsets and weights, thereby obtaining The four types of features.
本发明具有有益效果。本发明的半监督特征学习,通过将局部不变特征分解成情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征共四类特征,使得不同类型的特征用于不同的识别需求,避免了不同类型特征之间相互干扰的缺点。特别是半监督学习的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成,使得所学习到的特征能更好地描述识别目标类别之间的差异,不受无关因素的干扰。本发明解决了不同的语音特征混杂在一起从而带来的识别率不高的问题,能有效地提高识别准确率。
The present invention has a beneficial effect. The semi-supervised feature learning of the present invention decomposes the local invariant features into four categories of emotion-related features, gender-related features, age-related features, and other factors, so that different types of features are used for different recognition needs, thereby avoiding The disadvantage of mutual interference between different types of features. In particular, the loss function of semi-supervised learning consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function, so that the learned features can better describe the difference between the identified target categories. , not subject to interference from unrelated factors. The invention solves the problem that the different speech features are mixed together and the recognition rate is not high, and the recognition accuracy can be effectively improved.
图1是语音特征分解流程图。Figure 1 is a flow chart of speech feature decomposition.
图2是无监督特征学习流程图。Figure 2 is a flow chart of unsupervised feature learning.
图3是半监督语音特征分解结构图。Figure 3 is a semi-supervised speech feature decomposition structure diagram.
图1给出了本发明方法的总体思路,首先,对语音进行预处理得到语谱图,不同尺寸的语谱块输入无监督特征学习网络SAE,预训练得到不同尺寸的卷积核,然后经过卷积、池化操作,形成局部不变特征y。y作为半监督卷积神经网络的输入,通过最小化四个不同的损失函数项将y分解成四类特征。Figure 1 shows the general idea of the method of the present invention. First, the speech is preprocessed to obtain a spectral map, and the different sizes of the spectral blocks are input into the unsupervised feature learning network SAE, and pre-trained to obtain convolution kernels of different sizes, and then Convolution, pooling operations, forming a local invariant feature y. y, as an input to a semi-supervised convolutional neural network, decomposes y into four types of features by minimizing four different loss function terms.
预处理后的语音信号被划分成li×hi大小的语谱块,i表示语谱块的个数,不同尺寸的语谱块输入无监督特征学习网络SAE,预训练得到不同尺寸的卷积核,然后分别用不同尺寸的卷积核对整个语谱图进行卷积,得到若干特征映射图,再对特征映射图进行最大池化,最终把特征堆叠起来形成局部不变特征y,如图2所示。y作为半监督卷积神经网络的输入,通过四个不同的损失函数项将y分解成四类特征。半监督的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成。通过最小化损失函数来获得四个损失函数项的参数,从而分解得到四类特征,分别用于不同的识别需求,如图3所示。所有特征均要参加重构,而不同类型的特征参与对应的判别损失函数的约束。The pre-processed speech signal is divided into blocks of size l i × h i , i represents the number of blocks, and the blocks of different sizes are input into the unsupervised feature learning network SAE, pre-training to obtain volumes of different sizes. After accumulating the core, the convolution kernels of different sizes are used to convolve the entire spectrogram, and several feature maps are obtained. Then the feature map is maximized, and finally the features are stacked to form a local invariant feature y. 2 is shown. y, as an input to a semi-supervised convolutional neural network, decomposes y into four types of features by four different loss function terms. The semi-supervised loss function consists of four parts: the reconstruction error function, the discriminant loss function, the orthogonal loss function, and the significant loss function. The parameters of the four loss function terms are obtained by minimizing the loss function, and the four types of features are decomposed to be used for different identification requirements, as shown in FIG. All features are to participate in the reconstruction, and different types of features participate in the constraint of the corresponding discriminant loss function.
本发明首先对语音进行预处理,利用基于卷积神经网络的无监督学习算法得到一组局部不变特征,然后利用基于卷积神经网络的半监督学习算法把局部不变特征分解成四类特征:情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征。具体的步骤如下:The invention firstly preprocesses speech, obtains a set of locally invariant features by using an unsupervised learning algorithm based on convolutional neural network, and then decomposes the local invariant features into four types of features by using a semi-supervised learning algorithm based on convolutional neural networks. : Emotional related characteristics, gender-related characteristics, age-related characteristics, and other factors related characteristics. The specific steps are as follows:
步骤一,首先把时域信号转化成语谱图,窗尺寸为20ms,有10ms的重叠;然后利用PCA降维以及白化,PCA有60个主成分,最终产生60×n的语谱图。从中提取若干个不重叠的60×15的语谱。对于每个60×15的语谱,从中提取两个尺寸的语谱块,分别为60×6和60×10。Step one, first convert the time domain signal into a spectral map, the window size is 20ms, with 10ms overlap; then using PCA dimensionality reduction and whitening, PCA has 60 principal components, and finally produces a 60×n spectral map. A number of non-overlapping 60×15 patterns are extracted therefrom. For each 60×15 language spectrum, two sizes of spectral blocks are extracted therefrom, which are 60×6 and 60×10, respectively.
步骤二,将60×6和60×10两种尺寸的语谱块分别输入到SAE,分别学习得到120个和输入尺寸一样大的60×6和60×10的卷积核。然后利用这两个卷积核分别对整个语谱60×15进行卷积,得到120个1×10和120个1×6的特征
映射图,然后每两帧进行最大池化,得到120个1×5和120个1×3的特征。即对于60×6的卷积核得到600个特征,对于60×10的卷积核得到360个特征。这总的960个特征作为半监督的输入。接下来介绍一下无监督特征学习的一般步骤。In step two, the 60×6 and 60×10 sizes of the spectral blocks are respectively input into the SAE, and 120 convolution kernels of 60×6 and 60×10 which are as large as the input size are respectively learned. Then use the two convolution kernels to convolve the entire spectrum 60×15, and obtain 120 1×10 and 120 1×6 features.
The map is then maximized every two frames to yield 120 1 x 5 and 120 1 x 3 features. That is, 600 features are obtained for a 60×6 convolution kernel, and 360 features are obtained for a 60×10 convolution kernel. This total of 960 features serves as a semi-supervised input. Let's take a look at the general steps of unsupervised feature learning.
自动编码器AE(Auto-Encoder)的目标函数如下:The objective function of the automatic encoder AE (Auto-Encoder) is as follows:
其中x是输入的语谱块,此处的x是不带标签的。h(x)是编码函数,h(x)=s(ωx+α),其中ω是权重矩阵,α是偏置,g(x)是解码函数,x′=g(x)=s(ωTh(x)+δ),其中ωT是ω的转置,δ是偏置。L(x,x′)是损失函数,L(x,x′)=||x-x′||2,表示均方误差。Where x is the input score block, where x is unlabeled. h(x) is the encoding function, h(x)=s(ωx+α), where ω is the weight matrix and α is the bias, g(x) is the decoding function, x' = g(x) = s(ω T h(x) + δ), where ω T is the transpose of ω and δ is the offset. L(x, x') is a loss function, and L(x, x') = ||xx'|| 2 represents a mean square error.
而稀疏自编码SAE就是在AE的目标函数上加一项稀疏性损失项。SAE的目标函数如下:The sparse self-coding SAE adds a sparse loss term to the objective function of the AE. The objective function of SAE is as follows:
其中,是一个以ρ为均值和一个以ρ′j为均值的两个伯努利随机变量之间的相对熵,用来控制稀疏性。ρ是稀疏性参数,是隐藏神经元j的平均激活度,n2是隐藏结点个数,m是输入结点个数。λ是控制稀疏项的参数。among them, It is a relative entropy between two Bernoulli random variables with ρ as the mean and ρ' j as the mean, used to control sparsity. ρ is the sparsity parameter, Is the average activation degree of the hidden neuron j, n 2 is the number of hidden nodes, and m is the number of input nodes. λ is a parameter that controls the sparse term.
假定有n种不同的输入尺寸,记成li×hi(i=1,2,…,n)。通过最小化得到不同的卷积核(ωi,αi)。用卷积核(ωi,αi)对整个语谱图的所有语谱块li×hi进行卷积:Suppose there are n different input sizes, denoted as l i ×h i (i=1,2,...,n). By minimizing Different convolution kernels (ω i , α i ) are obtained. Convolution of all spectral blocks l i ×h i of the entire spectral map with the convolution kernel (ω i , α i ):
fi(x)=s(conv(ωi,x)+αi) (3)f i (x)=s(conv(ω i ,x)+α i ) (3)
然后把卷积得到的特征映射图分成不重叠的区域,P={p1,p2,…,pq},对每个区域进行最大池化:Then, the convolved feature map is divided into non-overlapping regions, P={p 1 , p 2 ,..., p q }, and the maximum pooling is performed for each region:
对于第i个卷积核,把池化的特征堆叠起来:For the ith convolution kernel, stack the pooled features:
把所有卷积核的池化特征堆叠起来,形成局部不变特征y:Stack the pooled features of all convolution kernels to form a locally invariant feature y:
y=F(x)=[F1(x),F2(x),…,Fn(x)] (6)y=F(x)=[F 1 (x), F 2 (x),...,F n (x)] (6)
局部不变特征y作为下面无监督学习的输入。The locally invariant feature y serves as an input to the following unsupervised learning.
步骤三,通过上述的无监督学习,获得了局部不变特征y。然后通过基于卷积神经网络的半监督学习(部分输入带有类别标签),将y分解成四类特征:情感相关特征、性别相关特征、年龄相关特征、其他因素相关特征。无监督学习的损失函数由四个部分构成。In the third step, the local invariant feature y is obtained by the above unsupervised learning. Then, through semi-supervised learning based on convolutional neural networks (partial input with category labels), y is decomposed into four categories of features: emotion-related features, gender-related features, age-related features, and other factor-related features. The loss function of unsupervised learning consists of four parts.
首先,通过四个编码函数h(e)(y)、h(s)(y)、h(a)(y)、h(o)(y)将y映射成四类特征,分别和情感、性别、年龄、其他因素相关。四个编码函数如下:First, through the four encoding functions h (e) (y), h (s) (y), h (a) (y), h (o) (y), y is mapped into four types of features, respectively, and emotions, Gender, age, and other factors are relevant. The four encoding functions are as follows:
h(e)(y)=s(Ey+e) (7)h (e) (y)=s(Ey+e) (7)
h(s)(y)=s(Sy+s) (8)h (s) (y)=s(Sy+s) (8)
h(a)(y)=s(Ay+a) (9)h (a) (y)=s(Ay+a) (9)
h(o)(y)=s(Oy+o) (10)h (o) (y)=s(Oy+o) (10)
这四类特征都要参与重构y:These four types of features must participate in the reconstruction of y:
y′=g([h(e)(y),h(s)(y),h(a)(y),h(o)(y)])=s(ETh(e)(y)+STh(s)(y)+ATh(a)(y)+OTh(o)(y)+γ) (11)y'=g([h (e) (y),h (s) (y),h (a) (y),h (o) (y)])=s(E T h (e) (y )+S T h (s) (y)+A T h (a) (y)+O T h (o) (y)+γ) (11)
其中,γ是为了靠近y均值的一个补偿参数。Where γ is a compensation parameter for close to the y-means.
所以,重构误差函数为:Therefore, the reconstruction error function is:
LRECON(y,y′)=||y-y′||2 (12)L RECON (y,y')=||yy'|| 2 (12)
然后,利用有标签数据来进行类别的预测,输入数据(x,z),其中x是语谱块,z={z1,z2,z3}分别表示情感标签、性别标签、年龄标签。z′={z′1,z′2,z′3}分别表示预测的情感标签、性别标签、年龄标签。例如下面公式(13)就是通过U1j映射h(e)(y),来预测情感标签的第j个特征z′1j。Then, the tag data is used to predict the category, and the data (x, z) is input, where x is a block of music, and z={z 1 , z 2 , z 3 } respectively represent an emotion tag, a gender tag, and an age tag. z'={z' 1 , z' 2 , z' 3 } respectively represent predicted emotional tags, gender tags, and age tags. For example, the following equation (13) is through the U 1j mapping h (e) (y), to predict the j th feature emotion tag z '1j.
z′1j=s(U1jh(e)(y)+b1j) (13)z' 1j =s(U 1j h (e) (y)+b 1j ) (13)
z′2j=s(U2jh(s)(y)+b2j) (14)z' 2j =s(U 2j h (s) (y)+b 2j ) (14)
z′3j=s(U3jh(a)(y)+b3j) (15)
z' 3j =s(U 3j h (a) (y)+b 3j ) (15)
所以,情感标签、性别标签、年龄标签的判别损失函数分别为:Therefore, the discriminant loss functions of the emotional label, the gender label, and the age label are:
总的判别损失函数为:The total discriminant loss function is:
LDISC(z,z′)=LDISCE(z1,z′1)+LDISCS(z2,z′2)+LDISCA(z3,z′3) (19)L DISC (z,z')=L DISCE (z 1 ,z' 1 )+L DISCS (z 2 ,z' 2 )+L DISCA (z 3 ,z' 3 ) (19)
其中C1、C2、C3分别表示情感标签、性别标签、年龄标签的类别个数。特别要说明的是这一步情感相关特征受式(14)情感判别惩罚函数的约束,性别相关特征受式(15)性别判别惩罚函数的约束,年龄相关特征受式(16)年龄判别惩罚函数的约束。Wherein C 1 , C 2 , and C 3 represent the number of categories of emotional tags, gender tags, and age tags, respectively. In particular, the emotion-related features of this step are constrained by the (1) sentiment discriminant penalty function, the gender-related features are constrained by the (15) gender discriminant penalty function, and the age-related features are subjected to the (16) age discriminant penalty function. constraint.
此外,为了让h(e)(y)、h(s)(y)、h(a)(y)、h(o)(y)尽可能地表示y变化的不同方向,比如说要让h(e)(y)和h(s)(y)表示不同方向,可以通过让和尽可能地正交。正交损失函数为:In addition, in order to let h (e) (y), h (s) (y), h (a) (y), h (o) (y) represent as many different directions as possible, for example, let h (e) (y) and h (s) (y) indicate different directions and can be with As orthogonal as possible. The orthogonal loss function is:
最后,可以利用显著性损失函数,使得学习到的四类特征更体现识别目标不同类别之间的可鉴别性且比较稳定。。对于每个输入i的显著性,我们用它的权重的显著性总和来衡量。具体的,对于输入i的显著性如下:Finally, the significance loss function can be utilized, so that the learned four types of features more reflect the identifiability between different categories of recognition targets and are relatively stable. . For the significance of each input i, we measure it by the sum of the saliency of its weight. Specifically, the significance of the input i is as follows:
其中φ(i)是和输入i相关的权重集,ωk是第k个权重,MSE是均方误差。对于h(e)(y)、h(s)(y)、h(a)(y)这三类特征,重构误差和判别损失都要考虑进显著性损失函数,而对于h(o)(y),只要考虑重构误差。所以显著性损失函数如下:Where φ(i) is the set of weights associated with input i, ω k is the kth weight, and MSE is the mean square error. For the three types of h (e) (y), h (s) (y), h (a) (y), the reconstruction error and discriminant loss must take into account the significant loss function, and for h (o) (y), as long as the reconstruction error is considered. So the significant loss function is as follows:
所以,重构误差函数、判别损失函数、正交损失函数、显著性损失函数这四部分就构成了总的损失函数。总的损失函数为:Therefore, the four parts of the reconstructed error function, the discriminant loss function, the orthogonal loss function, and the significant loss function constitute the total loss function. The total loss function is:
LLOSS(θ)=∑x∈D,y=F(x)LRECON(y,y′)+βJORTH(y)+∑(x,z)∈∈LDISC(z,z′)+ηJSAL(y) (23)L LOSS (θ)=∑ x∈D,y=F(x) L RECON (y,y')+βJ ORTH (y)+∑ (x,z)∈∈ L DISC (z,z')+ηJ SAL (y) (23)
其中D是整个数据集(包括无标签数据和有标签数据),S是有标签数据集。β调整垂直损失函数的贡献程度,β∈[0,1]。η调整敏感性损失函数的贡献程度,η∈[0,1]。贡献权值参数β,η采用步长为0.1的网格搜索的方法设定。参数θ={E,S,A,O,U,e,s,a,o,γ,b}。Where D is the entire data set (including unlabeled data and tagged data), and S is the tagged data set. β adjusts the contribution of the vertical loss function, β∈[0,1]. η adjusts the contribution of the sensitivity loss function, η ∈ [0, 1]. The contribution weight parameter β, η is set by a grid search method with a step size of 0.1. The parameters θ={E, S, A, O, U, e, s, a, o, γ, b}.
通过最小化损失函数来获得四个损失函数的参数权重和偏置,从而分解得到四类特征。
The parameter weights and offsets of the four loss functions are obtained by minimizing the loss function, and the four types of features are decomposed.
Claims (1)
- 一种半监督语音特征可变因素分解方法,其特征在于包括下列步骤:A semi-supervised speech feature variable factor decomposition method, comprising the following steps:步骤一,预处理:对语音样本进行预处理得到语谱图,再采用PCA进行主成份分析降维以及白化,从中提取出不同尺寸的语谱块;Step one, pre-processing: pre-processing the speech samples to obtain a spectral map, and then using PCA to perform principal component analysis for dimensionality reduction and whitening, and extracting spectral blocks of different sizes therefrom;步骤二,无监督的局部不变特征学习:将所述语谱块作为无监督特征学习SAE的输入,通过输入不同尺寸的语谱块,预训练得到不同尺寸的卷积核,然后分别用所述不同尺寸的卷积核对整个语谱图进行卷积,得到若干特征映射图,再对所述特征映射图进行最大池化,最终把特征堆叠起来形成局部不变特征y;Step 2: Unsupervised local invariant feature learning: the speech block is used as an unsupervised feature to learn the input of the SAE. By inputting the spectral blocks of different sizes, the convolution kernels of different sizes are pre-trained and then used separately. The convolution kernels of different sizes are convoluted for the entire spectrogram to obtain a plurality of feature maps, and then the feature map is maximized, and finally the features are stacked to form a local invariant feature y;步骤三,基于卷积神经网络的半监督特征学习:将所述局部不变特征y作为半监督学习算法的输入,利用基于卷积神经网络的半监督学习的方法,通过四个不同的损失函数将将局部不变特征y分解成四类特征;所述四类特征包括情感相关特征、性别相关特征、年龄相关特征、以及包括噪声和语种的其他因素相关特征;所述半监督学习的损失函数由重构误差函数、判别损失函数、正交损失函数、显著性损失函数四部分组成;Step 3: Semi-supervised feature learning based on convolutional neural network: using the locally invariant feature y as input to a semi-supervised learning algorithm, using a semi-supervised learning method based on convolutional neural networks, through four different loss functions The local invariant feature y will be decomposed into four types of features; the four types of features include emotion-related features, gender-related features, age-related features, and other factor-related features including noise and language; the loss function of the semi-supervised learning It consists of four parts: reconstruction error function, discriminant loss function, orthogonal loss function and significant loss function.对于所述重构误差函数,所述四类特征都要参与重构局部不变特征y,误差采用均方误差;对于所述判别损失函数,先对有标签的数据进行类别预测,然后计算预测标签和真实标签之间的差异作为判别损失函数的值;对于所述正交损失函数,目的是使所述四类特征相互正交,表示输入局部不变特征y的不同的方向;对于所述显著性损失函数,目的是学习到仅体现识别目标类别之间的差异且更具有类别区分性的特征;通过最小化所述损失函数来获得四个损失函数的参数包括偏置和权重,从而得到所述四类特征。 For the reconstructed error function, the four types of features are involved in reconstructing the local invariant feature y, and the error is a mean square error; for the discriminant loss function, class prediction is performed on the tagged data, and then the prediction is calculated. The difference between the tag and the real tag as a value of the discriminant loss function; for the orthogonal loss function, the purpose is to make the four types of features orthogonal to each other, indicating different directions of input local invariant features y; a significant loss function, the purpose is to learn a feature that only reflects the difference between the target categories and is more discriminative; the parameters of the four loss functions are obtained by minimizing the loss function, including offsets and weights, thereby obtaining The four types of features.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410229537.5A CN104021373B (en) | 2014-05-27 | 2014-05-27 | Semi-supervised speech feature variable factor decomposition method |
CN201410229537.5 | 2014-05-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015180368A1 true WO2015180368A1 (en) | 2015-12-03 |
Family
ID=51438118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/088539 WO2015180368A1 (en) | 2014-05-27 | 2014-10-14 | Variable factor decomposition method for semi-supervised speech features |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104021373B (en) |
WO (1) | WO2015180368A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106803069A (en) * | 2016-12-29 | 2017-06-06 | 南京邮电大学 | Crowd's level of happiness recognition methods based on deep learning |
CN106919710A (en) * | 2017-03-13 | 2017-07-04 | 东南大学 | A kind of dialect sorting technique based on convolutional neural networks |
CN108021910A (en) * | 2018-01-04 | 2018-05-11 | 青岛农业大学 | The analysis method of Pseudocarps based on spectrum recognition and deep learning |
CN108899075A (en) * | 2018-06-28 | 2018-11-27 | 众安信息技术服务有限公司 | A kind of DSA image detecting method, device and equipment based on deep learning |
CN109065021A (en) * | 2018-10-18 | 2018-12-21 | 江苏师范大学 | The end-to-end dialect identification method of confrontation network is generated based on condition depth convolution |
CN109117943A (en) * | 2018-07-24 | 2019-01-01 | 中国科学技术大学 | Utilize the method for more attribute informations enhancing network characterisation study |
CN109543727A (en) * | 2018-11-07 | 2019-03-29 | 复旦大学 | A kind of semi-supervised method for detecting abnormality based on competition reconstruct study |
CN109559736A (en) * | 2018-12-05 | 2019-04-02 | 中国计量大学 | A kind of film performer's automatic dubbing method based on confrontation network |
CN110009025A (en) * | 2019-03-27 | 2019-07-12 | 河南工业大学 | A kind of semi-supervised additive noise self-encoding encoder for voice lie detection |
CN110084850A (en) * | 2019-04-04 | 2019-08-02 | 东南大学 | A kind of dynamic scene vision positioning method based on image, semantic segmentation |
CN110363139A (en) * | 2019-07-15 | 2019-10-22 | 上海点积实业有限公司 | A kind of digital signal processing method and system |
CN110503128A (en) * | 2018-05-18 | 2019-11-26 | 百度(美国)有限责任公司 | The spectrogram that confrontation network carries out Waveform composition is generated using convolution |
CN110738168A (en) * | 2019-10-14 | 2020-01-31 | 长安大学 | distributed strain micro crack detection system and method based on stacked convolution self-encoder |
CN111179941A (en) * | 2020-01-06 | 2020-05-19 | 科大讯飞股份有限公司 | Intelligent device awakening method, registration method and device |
CN111832650A (en) * | 2020-07-14 | 2020-10-27 | 西安电子科技大学 | Image classification method based on generation of confrontation network local aggregation coding semi-supervision |
CN112735478A (en) * | 2021-01-29 | 2021-04-30 | 华南理工大学 | Voice emotion recognition method based on additive angle punishment focus loss |
US11093818B2 (en) | 2016-04-11 | 2021-08-17 | International Business Machines Corporation | Customer profile learning based on semi-supervised recurrent neural network using partially labeled sequence data |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104021373B (en) * | 2014-05-27 | 2017-02-15 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
CN104408470B (en) * | 2014-12-01 | 2017-07-25 | 中科创达软件股份有限公司 | The sex-screening method learnt in advance based on average face |
CN105989368A (en) * | 2015-02-13 | 2016-10-05 | 展讯通信(天津)有限公司 | Target detection method and apparatus, and mobile terminal |
CN105070288B (en) * | 2015-07-02 | 2018-08-07 | 百度在线网络技术(北京)有限公司 | Vehicle-mounted voice instruction identification method and device |
CN105321525B (en) * | 2015-09-30 | 2019-02-22 | 北京邮电大学 | A kind of system and method reducing VOIP communication resource expense |
CN105550679B (en) * | 2016-02-29 | 2019-02-15 | 深圳前海勇艺达机器人有限公司 | A kind of judgment method of robot cycle monitoring recording |
US10579860B2 (en) * | 2016-06-06 | 2020-03-03 | Samsung Electronics Co., Ltd. | Learning model for salient facial region detection |
CN110089135A (en) * | 2016-10-19 | 2019-08-02 | 奥蒂布莱现实有限公司 | System and method for generating audio image |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN108461092B (en) * | 2018-03-07 | 2022-03-08 | 燕山大学 | Method for analyzing Parkinson's disease voice |
CN110148400B (en) * | 2018-07-18 | 2023-03-17 | 腾讯科技(深圳)有限公司 | Pronunciation type recognition method, model training method, device and equipment |
US11606663B2 (en) | 2018-08-29 | 2023-03-14 | Audible Reality Inc. | System for and method of controlling a three-dimensional audio engine |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
CN110070895B (en) * | 2019-03-11 | 2021-06-22 | 江苏大学 | Mixed sound event detection method based on factor decomposition of supervised variational encoder |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN110297928A (en) * | 2019-07-02 | 2019-10-01 | 百度在线网络技术(北京)有限公司 | Recommended method, device, equipment and the storage medium of expression picture |
CN111009262A (en) * | 2019-12-24 | 2020-04-14 | 携程计算机技术(上海)有限公司 | Voice gender identification method and system |
CN114037059A (en) * | 2021-11-05 | 2022-02-11 | 北京百度网讯科技有限公司 | Pre-training model, model generation method, data processing method and data processing device |
CN115240649B (en) * | 2022-07-19 | 2023-04-18 | 于振华 | Voice recognition method and system based on deep learning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1150852A (en) * | 1994-06-06 | 1997-05-28 | 摩托罗拉公司 | Speech-recognition system utilizing neural networks and method of using same |
CN1151218A (en) * | 1994-06-03 | 1997-06-04 | 摩托罗拉公司 | Method of training neural networks used for speech recognition |
CN1275746A (en) * | 1994-04-28 | 2000-12-06 | 摩托罗拉公司 | Equipment for converting text into audio signal by using nervus network |
CN1280697A (en) * | 1998-02-03 | 2001-01-17 | 西门子公司 | Method for voice data transmission |
CN1975856A (en) * | 2006-10-30 | 2007-06-06 | 邹采荣 | Speech emotion identifying method based on supporting vector machine |
CN103400145A (en) * | 2013-07-19 | 2013-11-20 | 北京理工大学 | Voice-vision fusion emotion recognition method based on hint nerve networks |
CN104021373A (en) * | 2014-05-27 | 2014-09-03 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8582807B2 (en) * | 2010-03-15 | 2013-11-12 | Nec Laboratories America, Inc. | Systems and methods for determining personal characteristics |
CN102222500A (en) * | 2011-05-11 | 2011-10-19 | 北京航空航天大学 | Extracting method and modeling method for Chinese speech emotion combining emotion points |
-
2014
- 2014-05-27 CN CN201410229537.5A patent/CN104021373B/en active Active
- 2014-10-14 WO PCT/CN2014/088539 patent/WO2015180368A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1275746A (en) * | 1994-04-28 | 2000-12-06 | 摩托罗拉公司 | Equipment for converting text into audio signal by using nervus network |
CN1151218A (en) * | 1994-06-03 | 1997-06-04 | 摩托罗拉公司 | Method of training neural networks used for speech recognition |
CN1150852A (en) * | 1994-06-06 | 1997-05-28 | 摩托罗拉公司 | Speech-recognition system utilizing neural networks and method of using same |
CN1280697A (en) * | 1998-02-03 | 2001-01-17 | 西门子公司 | Method for voice data transmission |
CN1975856A (en) * | 2006-10-30 | 2007-06-06 | 邹采荣 | Speech emotion identifying method based on supporting vector machine |
CN103400145A (en) * | 2013-07-19 | 2013-11-20 | 北京理工大学 | Voice-vision fusion emotion recognition method based on hint nerve networks |
CN104021373A (en) * | 2014-05-27 | 2014-09-03 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11093818B2 (en) | 2016-04-11 | 2021-08-17 | International Business Machines Corporation | Customer profile learning based on semi-supervised recurrent neural network using partially labeled sequence data |
CN106803069A (en) * | 2016-12-29 | 2017-06-06 | 南京邮电大学 | Crowd's level of happiness recognition methods based on deep learning |
CN106803069B (en) * | 2016-12-29 | 2021-02-09 | 南京邮电大学 | Crowd happiness degree identification method based on deep learning |
CN106919710A (en) * | 2017-03-13 | 2017-07-04 | 东南大学 | A kind of dialect sorting technique based on convolutional neural networks |
CN108021910A (en) * | 2018-01-04 | 2018-05-11 | 青岛农业大学 | The analysis method of Pseudocarps based on spectrum recognition and deep learning |
CN110503128A (en) * | 2018-05-18 | 2019-11-26 | 百度(美国)有限责任公司 | The spectrogram that confrontation network carries out Waveform composition is generated using convolution |
CN108899075A (en) * | 2018-06-28 | 2018-11-27 | 众安信息技术服务有限公司 | A kind of DSA image detecting method, device and equipment based on deep learning |
CN109117943A (en) * | 2018-07-24 | 2019-01-01 | 中国科学技术大学 | Utilize the method for more attribute informations enhancing network characterisation study |
CN109117943B (en) * | 2018-07-24 | 2022-09-30 | 中国科学技术大学 | Method for enhancing network representation learning by utilizing multi-attribute information |
CN109065021A (en) * | 2018-10-18 | 2018-12-21 | 江苏师范大学 | The end-to-end dialect identification method of confrontation network is generated based on condition depth convolution |
CN109543727A (en) * | 2018-11-07 | 2019-03-29 | 复旦大学 | A kind of semi-supervised method for detecting abnormality based on competition reconstruct study |
CN109559736A (en) * | 2018-12-05 | 2019-04-02 | 中国计量大学 | A kind of film performer's automatic dubbing method based on confrontation network |
CN109559736B (en) * | 2018-12-05 | 2022-03-08 | 中国计量大学 | Automatic dubbing method for movie actors based on confrontation network |
CN110009025A (en) * | 2019-03-27 | 2019-07-12 | 河南工业大学 | A kind of semi-supervised additive noise self-encoding encoder for voice lie detection |
CN110009025B (en) * | 2019-03-27 | 2023-03-24 | 河南工业大学 | Semi-supervised additive noise self-encoder for voice lie detection |
CN110084850A (en) * | 2019-04-04 | 2019-08-02 | 东南大学 | A kind of dynamic scene vision positioning method based on image, semantic segmentation |
CN110363139A (en) * | 2019-07-15 | 2019-10-22 | 上海点积实业有限公司 | A kind of digital signal processing method and system |
CN110738168A (en) * | 2019-10-14 | 2020-01-31 | 长安大学 | distributed strain micro crack detection system and method based on stacked convolution self-encoder |
CN111179941A (en) * | 2020-01-06 | 2020-05-19 | 科大讯飞股份有限公司 | Intelligent device awakening method, registration method and device |
CN111179941B (en) * | 2020-01-06 | 2022-10-04 | 科大讯飞股份有限公司 | Intelligent device awakening method, registration method and device |
CN111832650A (en) * | 2020-07-14 | 2020-10-27 | 西安电子科技大学 | Image classification method based on generation of confrontation network local aggregation coding semi-supervision |
CN111832650B (en) * | 2020-07-14 | 2023-08-01 | 西安电子科技大学 | Image classification method based on generation of antagonism network local aggregation coding semi-supervision |
CN112735478A (en) * | 2021-01-29 | 2021-04-30 | 华南理工大学 | Voice emotion recognition method based on additive angle punishment focus loss |
Also Published As
Publication number | Publication date |
---|---|
CN104021373B (en) | 2017-02-15 |
CN104021373A (en) | 2014-09-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015180368A1 (en) | Variable factor decomposition method for semi-supervised speech features | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
Palo et al. | Wavelet based feature combination for recognition of emotions | |
Hsu et al. | Unsupervised learning of disentangled and interpretable representations from sequential data | |
Mao et al. | Learning salient features for speech emotion recognition using convolutional neural networks | |
Deng et al. | Recognizing emotions from whispered speech based on acoustic feature transfer learning | |
CN110136690A (en) | Phoneme synthesizing method, device and computer readable storage medium | |
CN105047194B (en) | A kind of self study sound spectrograph feature extracting method for speech emotion recognition | |
CN111461176A (en) | Multi-mode fusion method, device, medium and equipment based on normalized mutual information | |
Wei et al. | A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model | |
CN115359576A (en) | Multi-modal emotion recognition method and device, electronic equipment and storage medium | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
Xia et al. | Audiovisual speech recognition: A review and forecast | |
Dua et al. | Optimizing integrated features for Hindi automatic speech recognition system | |
Zheng et al. | MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios | |
Singkul et al. | Vector learning representation for generalized speech emotion recognition | |
JP2015175859A (en) | Pattern recognition device, pattern recognition method, and pattern recognition program | |
CN111462762B (en) | Speaker vector regularization method and device, electronic equipment and storage medium | |
Cetin | Accent recognition using a spectrogram image feature-based convolutional neural network | |
CN116434759B (en) | Speaker identification method based on SRS-CL network | |
CN110226201A (en) | The voice recognition indicated using the period | |
Tailor et al. | Deep learning approach for spoken digit recognition in Gujarati language | |
CN114626424A (en) | Data enhancement-based silent speech recognition method and device | |
CN114913871A (en) | Target object classification method, system, electronic device and storage medium | |
Mu et al. | Self-supervised disentangled representation learning for robust target speech extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14893660 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14893660 Country of ref document: EP Kind code of ref document: A1 |