CN104021373A - Semi-supervised speech feature variable factor decomposition method - Google Patents
Semi-supervised speech feature variable factor decomposition method Download PDFInfo
- Publication number
- CN104021373A CN104021373A CN201410229537.5A CN201410229537A CN104021373A CN 104021373 A CN104021373 A CN 104021373A CN 201410229537 A CN201410229537 A CN 201410229537A CN 104021373 A CN104021373 A CN 104021373A
- Authority
- CN
- China
- Prior art keywords
- feature
- loss function
- semi
- features
- supervised
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000000354 decomposition reaction Methods 0.000 title claims abstract description 8
- 230000006870 function Effects 0.000 claims abstract description 73
- 230000008451 emotion Effects 0.000 claims abstract description 24
- 238000001228 spectrum Methods 0.000 claims abstract description 18
- 238000013507 mapping Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims abstract description 5
- 230000002596 correlated effect Effects 0.000 claims description 16
- 238000013527 convolutional neural network Methods 0.000 claims description 11
- 230000004069 differentiation Effects 0.000 claims description 11
- 238000000556 factor analysis Methods 0.000 claims description 8
- 230000001568 sexual effect Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 6
- 238000000513 principal component analysis Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 2
- 238000011176 pooling Methods 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000009223 counseling Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a semi-supervised speech feature variable factor decomposition method. Speech features are divided into four types: emotion-related features, gender-related features, age-related features and noise, language and other factor-related features. Firstly, a speech is pretreated to obtain a spectrogram, speech spectrum blocks of different sizes are inputted to an unsupervised feature learning network SAE, convolution kernels of different sizes are obtained through pre-training, convolution kernels of different sizes are then respectively used for carrying out convolution on the whole spectrogram, a plurality of feature mapping pictures are obtained, maximal pooling is then carried out on the feature mapping pictures, and the features are finally stacked together to form a local invariant feature y. Y serves as input of semi-supervised convolution neural network, y is decomposed into four types of features through minimizing four different loss function items. The problem that the recognition accuracy rate is not high as emotion, gender, age and speech features are mixed is solved, and the method can be used for different recognition demands based on speech signals and can also be used for decomposing more factors.
Description
Technical field
The invention belongs to field of speech recognition, be specifically related to a kind of method that phonetic feature decomposes.
Background technology
Along with computing machine is penetrated into each corner of life, various types of computing platforms all need easier input medium, and voice are taken as own duty becomes one of selection of user's the best.The much informations such as in general, voice have comprised speaker, the content of speaking, speaker's emotion, sex, age.In recent years, constantly perfect along with some application, has promoted the development of the recognition technology based on voice signal of emotion to people, sex, age, the aspects such as content of speaking.Such as the connection that traditional call center conventionally all can be random is served from birth for client provides telephone counseling, and can not provide personalized service according to user's emotion, sex and age, whether this can judge its emotion, sex and age by client's sound with regard to having impelled, and more personalized voice service is provided on this basis.But identify in inter-related task at existing emotion, sex and age based on voice signal, the feature that traditional feature extracting method extracts often adulterated emotion, sex, age, the factor such as content, language of speaking, be difficult to each other distinguish, thereby cause recognition effect not good.
Be called in the paper of Feature Learning in Deep Neural Networks-Studies on Speech Recognition Tasks in Dong Yu etc., name, utilize degree of depth neural network to acquire a further feature, but this feature may mix several factors, as factors such as emotion, sex, ages, if for speech emotional identification, discrimination may be subject to the impact of other factors in feature this feature.At present also do not occur that a kind of feature extracting method can extract respectively feature dissimilar in voice signal.The present invention is in order to overcome the defect of prior art, by the semi-supervised feature learning based on convolutional neural networks, phonetic feature is resolved into four classes: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features, can be respectively used to the difference identification demand based on voice signal.The present invention can also be used for decomposing more multifactor after further expanding.
Summary of the invention
The object of the present invention is to provide a kind of semi-supervised phonetic feature variable factor decomposition method, make the feature decompositing not be subject to the interference of the factor irrelevant with identification mission, and embody more significantly the difference between identification target classification, thereby improve the accuracy of identification.
In order to solve above technical matters, first the present invention carries out pre-service to voice and obtains sound spectrograph, then obtain local invariant feature by the unsupervised learning based on convolutional neural networks, adopt again a kind of semi-supervised learning method, by reconstructed error function, differentiate loss function, orthogonal loss function, the local invariant feature that the constraint of four loss functions of conspicuousness loss function obtains unsupervised learning, resolve into four classes: emotion correlated characteristic, Sexual-related feature, age correlated characteristic and other factor analysis features, can be respectively used to emotion recognition, sex identification, age identification, can effectively improve recognition accuracy.Concrete technical scheme is as follows:
A kind of semi-supervised phonetic feature variable factor decomposition method, is characterized in that comprising the following steps:
Step 1, pre-service: speech samples is carried out to pre-service and obtain sound spectrograph, then adopt PCA to carry out principal component analysis dimensionality reduction and albefaction, therefrom extract the language spectrum piece of different size;
Step 2, unsupervised local invariant feature study: using institute's predicate spectrum piece as the input without supervision feature learning SAE, by the language spectrum piece of input different size, pre-training obtains the convolution kernel of different size, then with the convolution kernel of described different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, more described Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y;
Step 3, semi-supervised feature learning based on convolutional neural networks: the input using described local invariant feature y as semi-supervised learning algorithm, utilize the method for semi-supervised learning based on convolutional neural networks, by four different loss functions just local invariant feature y resolve into four category features; Described four category features comprise emotion correlated characteristic, Sexual-related feature, age correlated characteristic and comprise noise and other factor analysis features of languages; The loss function of described semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts;
For described reconstructed error function, described four category features all will participate in reconstruct local invariant feature y, and error adopts square error; For described differentiation loss function, first the data that have label are carried out to classification prediction, then calculate the difference of predicting between label and true label as the value of differentiating loss function; For described orthogonal loss function, object is to make described four category features mutually orthogonal, represents the different direction of input local invariant feature y; For described conspicuousness loss function, object is that study is to only embodying the feature of identifying the difference between target classification and having more class discrimination; The parameter that obtains four loss functions by minimizing described loss function comprises biasing and weight, thereby obtains described four category features.
The present invention has beneficial effect.Semi-supervised feature learning of the present invention, by local invariant feature being resolved into emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features totally four category features, make dissimilar feature for different identification demands, avoided the shortcoming of phase mutual interference between dissimilar feature.Particularly the loss function of semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts, make the feature of being learnt can describe better the difference between identification target classification, be not subject to the interference of irrelevant factor.Thereby the invention solves the not high problem of the different phonetic features discrimination bringing mixed in together, can effectively improve recognition accuracy.
Brief description of the drawings
Fig. 1 is phonetic feature decomposition process figure.
Fig. 2 is without supervision feature learning process flow diagram.
Fig. 3 is semi-supervised phonetic feature decomposition chart.
Embodiment
Fig. 1 has provided the general thought of the inventive method, first, voice is carried out to pre-service and obtain sound spectrograph, the language spectrum piece input of different size is without supervision feature learning network SAE, pre-training obtains the convolution kernel of different size, then, through convolution, pondization operation, forms local invariant feature y.Y, as the input of semi-supervised convolutional neural networks, resolves into four category features by minimizing four different loss function items by y.
Pretreated voice signal is divided into l
i ×h
ithe language spectrum piece of size, i represents the number of language spectrum piece, the language spectrum piece input of different size is without supervision feature learning network SAE, pre-training obtains the convolution kernel of different size, then with the convolution kernel of different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, then Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y, as shown in Figure 2.Y, as the input of semi-supervised convolutional neural networks, resolves into four category features by four different loss function items by y.Semi-supervised loss function is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts.Obtain the parameter of four loss function items by minimum losses function, obtain four category features thereby decompose, be respectively used to different identification demands, as shown in Figure 3.All features all will be participated in reconstruct, and dissimilar feature participates in the constraint of corresponding differentiation loss function.
First the present invention carries out pre-service to voice, utilize the unsupervised learning algorithm based on convolutional neural networks to obtain one group of local invariant feature, then utilize the semi-supervised learning algorithm based on convolutional neural networks that local invariant feature is resolved into four category features: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features.Concrete step is as follows:
Step 1, first changes into sound spectrograph time-domain signal, and window is of a size of 20ms, has that 10ms's is overlapping; Then utilize PCA dimensionality reduction and albefaction, PCA has 60 major components, finally produces the sound spectrograph of 60 × n.Therefrom extract several language of nonoverlapping 60 × 15 spectrums.For each 60 × 15 language spectrum, therefrom extract the language spectrum piece of two sizes, be respectively 60 × 6 and 60 × 10.
Step 2, is input to respectively SAE by the language spectrum piece of 60 × 6 and 60 × 10 two kinds of sizes, and study obtains 120 and input the same large 60 × 6 and 60 × 10 the convolution kernel of size respectively.Then utilize these two convolution kernels respectively whole language spectrum 60 × 15 to be carried out to convolution, obtain the Feature Mapping figure of 120 1 × 10 and 120 1 × 6, then every two frames carry out maximum pond, obtain the feature of 120 1 × 5 and 120 1 × 3.Obtain 600 features for 60 × 6 convolution kernel, obtain 360 features for 60 × 10 convolution kernel.These 960 total features are as semi-supervised input.Next introduce the general step without supervision feature learning.
The objective function of autocoder AE (Auto-Encoder) is as follows:
J
AE(θ)=Σ
x∈L(x,g(h(x))) (1)
Wherein x is the language spectrum piece of input, and x is herein tape label not.H (x) is coding function,
H (x)=s (ω x+ α), wherein ω is weight matrix, α is biasing,
g (x) is decoding functions, x=g (x)=s (ω
th (x)-δ), wherein ω
tbe the transposition of ω, δ is biasing.L (x, x ') is loss function, L (x, x ')=|| x-x ' ||
2, represent square error.
And sparse own coding SAE adds a sparse property loss on the objective function of AE.The objective function of SAE is as follows:
Wherein,
be one taking ρ as average and one with ρ '
jfor the relative entropy between two Bernoulli random variables of average, be used for controlling sparse property.
ρ is sparse property parameter,
the average activity of hidden neuron j, n
2be to hide node number, m is input node number.λ controls the parameter of sparse.
Suppose the input size that has n kind different, remember into l
i× h
i(i=1,2 ..., n0.By minimizing J
sAE(θ) obtain different convolution kernel (ω
i, α
i).With convolution kernel (ω
i, α
i) all languages of whole sound spectrograph are composed to piece li × h
icarry out convolution:
f
i(x)=s(conv(ω
i,x)+α
i) (3)
Then Feature Mapping figure convolution being obtained is divided into nonoverlapping region, P={p
1, p
2..., p
qmaximum pond is carried out in each region:
For i convolution kernel, the feature in pond is stacked up:
The pond feature of all convolution kernels is stacked up, forms local invariant feature y:
y=F(x)=[F
1(x),F
2(x),...,F
n(x)] (6)
Local invariant feature y is as the input of unsupervised learning below.
Step 3, by above-mentioned unsupervised learning, has obtained local invariant feature y.Then by the semi-supervised learning based on convolutional neural networks (part input is with class label), y is resolved into four category features: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features.The loss function of unsupervised learning is made up of four parts.
First, by four coding function h
(e)(y), h
(s)(y), h
(a)(y), h
(o)(y) y is mapped to four category features, respectively with emotion, sex, age, other factor analysis.Four coding functions are as follows:
h
(e)(y)=s(Ey+e) (7)
h
(s)(y)=s(Sy+s) (8)
h
(a)(y)=s(Ay+a) (9)
h
(o)(y)=s(Oy+o) (10)
This four category feature all will participate in reconstruct y:
y'=g[[h
(e)(y),h
(s)(y),h
(a)(y),h
(o)(y)])
=s(E
Th
(e)(y)+S
Th
(s)(y)+ATh
(a)(y)+O
Th(o)(y)+γ) (11)
Wherein, γ is for a compensating parameter near y average.
So reconstructed error function is:
L
RECON(y,y′)=||y-y'||
2 (12)
Then, utilize and have label data to carry out the prediction of classification, input data (x, z), wherein x is language spectrum piece, z={z
1, z
2, z
3represent respectively emotion label, sex label, age label.Z '=z '
1, z '
2, z '
3, respectively represent prediction emotion label, sex label, age label.Example as the following formula (13) is exactly to pass through U
1jmapping h
(e)(y), predict j feature z ' of emotion label
1j.
z′
1j=s(U
1jh
(e)(y)+b1
j) (13)
z′
2j=s(U
2jh
(s)(y)+b
2j) (14)
z′
3j=s(U
3jh
(a)(y)+b
1j) (15)
So the differentiation loss function of emotion label, sex label, age label is respectively:
Total differentiation loss function is:
L
DISC(z,z′)=LDISCE(z
1,z′
1)+L
DISCS(z
2,z′
2)+L
DIACA(z
3,z′
3) (19)
Wherein C
1, C
2, C
3represent respectively the classification number of emotion label, sex label, age label.Be noted that especially this step emotion correlated characteristic is subject to formula (14) emotion to differentiate the constraint of penalty, Sexual-related feature is subject to the constraint of formula (15) Sex Discrimination penalty, and age correlated characteristic is subject to formula (16) age to differentiate the constraint of penalty.
In addition, in order to allow h
(e)(y), h
(s)(y), h
(a)(y), h
(o)(y) represent as much as possible the different directions that y changes, such as will allow
h (e)and h (y)
(s)(y) represent different directions, can be by allowing
with
orthogonal as much as possible.Orthogonal loss function is:
Finally, can utilize conspicuousness loss function, make study to four category features more embody identifiable and more stable between different classes of of identification target.。For the conspicuousness of each input i, we weigh by the conspicuousness summation of its weight.Concrete, as follows for the conspicuousness of input i:
Wherein φ (i) is and the relevant weight sets of input i, ω
kbe k weight, MSE is square error.For h
(e)(y), h
(s)(y), h
(a)(y) this three category feature, reconstructed error and differentiation loss all will take into account conspicuousness loss function, and for h
(o)(y), as long as consider reconstructed error.So conspicuousness loss function is as follows:
So reconstructed error function, differentiation loss function, orthogonal loss function, this four part of conspicuousness loss function have just formed total loss function.Total loss function is:
L
LOSS(θ)=Σ
x∈D,y=F(x)L
RECON(y,y′)+βJ
ORTH(y)+Σ
(x,z)∈SL
DISC(z,z′)
+ηJ
SAL(y) (23)
Wherein D is whole data set (comprises without label data and have label data), and S has label data collection.β adjusts the percentage contribution of vertical loss function, β ∈ [0,1].η adjusts the percentage contribution of susceptibility loss function, η ∈ [0,1].Contribution weighting parameter β, η adopts the method for the grid search that step-length is 0.1 to set.Parameter θ={ E, S, A, O, U, e, s, a, o, γ, b}.
Obtain parameter weight and the biasing of four loss functions by minimum losses function, obtain four category features thereby decompose.
Claims (1)
1. a semi-supervised phonetic feature variable factor decomposition method, is characterized in that comprising the following steps:
Step 1, pre-service: speech samples is carried out to pre-service and obtain sound spectrograph, then adopt PCA to carry out principal component analysis dimensionality reduction and albefaction, therefrom extract the language spectrum piece of different size;
Step 2, unsupervised local invariant feature study: using institute's predicate spectrum piece as the input without supervision feature learning SAE, by the language spectrum piece of input different size, pre-training obtains the convolution kernel of different size, then with the convolution kernel of described different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, more described Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y;
Step 3, semi-supervised feature learning based on convolutional neural networks: the input using described local invariant feature y as semi-supervised learning algorithm, utilize the method for semi-supervised learning based on convolutional neural networks, by four different loss functions just local invariant feature y resolve into four category features; Described four category features comprise emotion correlated characteristic, Sexual-related feature, age correlated characteristic and comprise noise and other factor analysis features of languages; The loss function of described semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts;
For described reconstructed error function, described four category features all will participate in reconstruct local invariant feature y, and error adopts square error; For described differentiation loss function, first the data that have label are carried out to classification prediction, then calculate the difference of predicting between label and true label as the value of differentiating loss function; For described orthogonal loss function, object is to make described four category features mutually orthogonal, represents the different direction of input local invariant feature y; For described conspicuousness loss function, object is that study is to only embodying the feature of identifying the difference between target classification and having more class discrimination; The parameter that obtains four loss functions by minimizing described loss function comprises biasing and weight, thereby obtains described four category features.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410229537.5A CN104021373B (en) | 2014-05-27 | 2014-05-27 | Semi-supervised speech feature variable factor decomposition method |
PCT/CN2014/088539 WO2015180368A1 (en) | 2014-05-27 | 2014-10-14 | Variable factor decomposition method for semi-supervised speech features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410229537.5A CN104021373B (en) | 2014-05-27 | 2014-05-27 | Semi-supervised speech feature variable factor decomposition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104021373A true CN104021373A (en) | 2014-09-03 |
CN104021373B CN104021373B (en) | 2017-02-15 |
Family
ID=51438118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410229537.5A Active CN104021373B (en) | 2014-05-27 | 2014-05-27 | Semi-supervised speech feature variable factor decomposition method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104021373B (en) |
WO (1) | WO2015180368A1 (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408470A (en) * | 2014-12-01 | 2015-03-11 | 中科创达软件股份有限公司 | Gender detection method based on average face preliminary learning |
CN105070288A (en) * | 2015-07-02 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Vehicle-mounted voice instruction recognition method and device |
WO2015180368A1 (en) * | 2014-05-27 | 2015-12-03 | 江苏大学 | Variable factor decomposition method for semi-supervised speech features |
CN105321525A (en) * | 2015-09-30 | 2016-02-10 | 北京邮电大学 | System and method for reducing VOIP (voice over internet protocol) communication resource overhead |
CN105550679A (en) * | 2016-02-29 | 2016-05-04 | 深圳前海勇艺达机器人有限公司 | Judgment method for robot to cyclically monitor recording |
CN105989368A (en) * | 2015-02-13 | 2016-10-05 | 展讯通信(天津)有限公司 | Target detection method and apparatus, and mobile terminal |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN108461092A (en) * | 2018-03-07 | 2018-08-28 | 燕山大学 | A method of to Parkinson's disease speech analysis |
CN109564618A (en) * | 2016-06-06 | 2019-04-02 | 三星电子株式会社 | Learning model for the detection of significant facial area |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
CN110070895A (en) * | 2019-03-11 | 2019-07-30 | 江苏大学 | A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition |
CN110089135A (en) * | 2016-10-19 | 2019-08-02 | 奥蒂布莱现实有限公司 | System and method for generating audio image |
CN110148400A (en) * | 2018-07-18 | 2019-08-20 | 腾讯科技(深圳)有限公司 | The pronunciation recognition methods of type, the training method of model, device and equipment |
CN110297928A (en) * | 2019-07-02 | 2019-10-01 | 百度在线网络技术(北京)有限公司 | Recommended method, device, equipment and the storage medium of expression picture |
CN110503128A (en) * | 2018-05-18 | 2019-11-26 | 百度(美国)有限责任公司 | The spectrogram that confrontation network carries out Waveform composition is generated using convolution |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN111009262A (en) * | 2019-12-24 | 2020-04-14 | 携程计算机技术(上海)有限公司 | Voice gender identification method and system |
CN114037059A (en) * | 2021-11-05 | 2022-02-11 | 北京百度网讯科技有限公司 | Pre-training model, model generation method, data processing method and data processing device |
CN115240649A (en) * | 2022-07-19 | 2022-10-25 | 于振华 | Voice recognition method and system based on deep learning |
US11606663B2 (en) | 2018-08-29 | 2023-03-14 | Audible Reality Inc. | System for and method of controlling a three-dimensional audio engine |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11093818B2 (en) | 2016-04-11 | 2021-08-17 | International Business Machines Corporation | Customer profile learning based on semi-supervised recurrent neural network using partially labeled sequence data |
CN106803069B (en) * | 2016-12-29 | 2021-02-09 | 南京邮电大学 | Crowd happiness degree identification method based on deep learning |
CN106919710A (en) * | 2017-03-13 | 2017-07-04 | 东南大学 | A kind of dialect sorting technique based on convolutional neural networks |
CN108021910A (en) * | 2018-01-04 | 2018-05-11 | 青岛农业大学 | The analysis method of Pseudocarps based on spectrum recognition and deep learning |
CN108899075A (en) * | 2018-06-28 | 2018-11-27 | 众安信息技术服务有限公司 | A kind of DSA image detecting method, device and equipment based on deep learning |
CN109117943B (en) * | 2018-07-24 | 2022-09-30 | 中国科学技术大学 | Method for enhancing network representation learning by utilizing multi-attribute information |
CN109065021B (en) * | 2018-10-18 | 2023-04-18 | 江苏师范大学 | End-to-end dialect identification method for generating countermeasure network based on conditional deep convolution |
CN109543727B (en) * | 2018-11-07 | 2022-12-20 | 复旦大学 | Semi-supervised anomaly detection method based on competitive reconstruction learning |
CN109559736B (en) * | 2018-12-05 | 2022-03-08 | 中国计量大学 | Automatic dubbing method for movie actors based on confrontation network |
CN110009025B (en) * | 2019-03-27 | 2023-03-24 | 河南工业大学 | Semi-supervised additive noise self-encoder for voice lie detection |
CN110084850B (en) * | 2019-04-04 | 2023-05-23 | 东南大学 | Dynamic scene visual positioning method based on image semantic segmentation |
CN110363139B (en) * | 2019-07-15 | 2020-09-18 | 上海点积实业有限公司 | Digital signal processing method and system |
CN110738168B (en) * | 2019-10-14 | 2023-02-14 | 长安大学 | Distributed strain micro crack detection system and method based on stacked convolution self-encoder |
CN111179941B (en) * | 2020-01-06 | 2022-10-04 | 科大讯飞股份有限公司 | Intelligent device awakening method, registration method and device |
CN111832650B (en) * | 2020-07-14 | 2023-08-01 | 西安电子科技大学 | Image classification method based on generation of antagonism network local aggregation coding semi-supervision |
CN112735478B (en) * | 2021-01-29 | 2023-07-18 | 华南理工大学 | Voice emotion recognition method based on additive angle punishment focus loss |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1975856A (en) * | 2006-10-30 | 2007-06-06 | 邹采荣 | Speech emotion identifying method based on supporting vector machine |
US20110222724A1 (en) * | 2010-03-15 | 2011-09-15 | Nec Laboratories America, Inc. | Systems and methods for determining personal characteristics |
CN102222500A (en) * | 2011-05-11 | 2011-10-19 | 北京航空航天大学 | Extracting method and modeling method for Chinese speech emotion combining emotion points |
CN103400145A (en) * | 2013-07-19 | 2013-11-20 | 北京理工大学 | Voice-vision fusion emotion recognition method based on hint nerve networks |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU675389B2 (en) * | 1994-04-28 | 1997-01-30 | Motorola, Inc. | A method and apparatus for converting text into audible signals using a neural network |
US5509103A (en) * | 1994-06-03 | 1996-04-16 | Motorola, Inc. | Method of training neural networks used for speech recognition |
WO1995034064A1 (en) * | 1994-06-06 | 1995-12-14 | Motorola Inc. | Speech-recognition system utilizing neural networks and method of using same |
CN1120469C (en) * | 1998-02-03 | 2003-09-03 | 西门子公司 | Method for voice data transmission |
CN104021373B (en) * | 2014-05-27 | 2017-02-15 | 江苏大学 | Semi-supervised speech feature variable factor decomposition method |
-
2014
- 2014-05-27 CN CN201410229537.5A patent/CN104021373B/en active Active
- 2014-10-14 WO PCT/CN2014/088539 patent/WO2015180368A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1975856A (en) * | 2006-10-30 | 2007-06-06 | 邹采荣 | Speech emotion identifying method based on supporting vector machine |
US20110222724A1 (en) * | 2010-03-15 | 2011-09-15 | Nec Laboratories America, Inc. | Systems and methods for determining personal characteristics |
CN102222500A (en) * | 2011-05-11 | 2011-10-19 | 北京航空航天大学 | Extracting method and modeling method for Chinese speech emotion combining emotion points |
CN103400145A (en) * | 2013-07-19 | 2013-11-20 | 北京理工大学 | Voice-vision fusion emotion recognition method based on hint nerve networks |
Non-Patent Citations (3)
Title |
---|
NWE T L等: "Speech emotion recognition using hidden Markov models", 《SPEECH COMMUNICATION》 * |
张石清等: "基于一种改进的监督流形学习算法的语音情感识别", 《电子与信息学报》 * |
毛启容等: "结合过完备字典与PCA的小样本语音情感识别方法", 《江苏大学学报》 * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015180368A1 (en) * | 2014-05-27 | 2015-12-03 | 江苏大学 | Variable factor decomposition method for semi-supervised speech features |
CN104408470B (en) * | 2014-12-01 | 2017-07-25 | 中科创达软件股份有限公司 | The sex-screening method learnt in advance based on average face |
CN104408470A (en) * | 2014-12-01 | 2015-03-11 | 中科创达软件股份有限公司 | Gender detection method based on average face preliminary learning |
CN105989368A (en) * | 2015-02-13 | 2016-10-05 | 展讯通信(天津)有限公司 | Target detection method and apparatus, and mobile terminal |
CN105070288A (en) * | 2015-07-02 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Vehicle-mounted voice instruction recognition method and device |
WO2017000489A1 (en) * | 2015-07-02 | 2017-01-05 | 百度在线网络技术(北京)有限公司 | On-board voice command identification method and apparatus, and storage medium |
US10446150B2 (en) | 2015-07-02 | 2019-10-15 | Baidu Online Network Technology (Beijing) Co. Ltd. | In-vehicle voice command recognition method and apparatus, and storage medium |
CN105070288B (en) * | 2015-07-02 | 2018-08-07 | 百度在线网络技术(北京)有限公司 | Vehicle-mounted voice instruction identification method and device |
CN105321525B (en) * | 2015-09-30 | 2019-02-22 | 北京邮电大学 | A kind of system and method reducing VOIP communication resource expense |
CN105321525A (en) * | 2015-09-30 | 2016-02-10 | 北京邮电大学 | System and method for reducing VOIP (voice over internet protocol) communication resource overhead |
CN105550679A (en) * | 2016-02-29 | 2016-05-04 | 深圳前海勇艺达机器人有限公司 | Judgment method for robot to cyclically monitor recording |
CN105550679B (en) * | 2016-02-29 | 2019-02-15 | 深圳前海勇艺达机器人有限公司 | A kind of judgment method of robot cycle monitoring recording |
CN109564618B (en) * | 2016-06-06 | 2023-11-24 | 三星电子株式会社 | Method and system for facial image analysis |
CN109564618A (en) * | 2016-06-06 | 2019-04-02 | 三星电子株式会社 | Learning model for the detection of significant facial area |
CN110089135A (en) * | 2016-10-19 | 2019-08-02 | 奥蒂布莱现实有限公司 | System and method for generating audio image |
US11516616B2 (en) | 2016-10-19 | 2022-11-29 | Audible Reality Inc. | System for and method of generating an audio image |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN108461092A (en) * | 2018-03-07 | 2018-08-28 | 燕山大学 | A method of to Parkinson's disease speech analysis |
CN108461092B (en) * | 2018-03-07 | 2022-03-08 | 燕山大学 | Method for analyzing Parkinson's disease voice |
CN110503128A (en) * | 2018-05-18 | 2019-11-26 | 百度(美国)有限责任公司 | The spectrogram that confrontation network carries out Waveform composition is generated using convolution |
CN110148400A (en) * | 2018-07-18 | 2019-08-20 | 腾讯科技(深圳)有限公司 | The pronunciation recognition methods of type, the training method of model, device and equipment |
CN110148400B (en) * | 2018-07-18 | 2023-03-17 | 腾讯科技(深圳)有限公司 | Pronunciation type recognition method, model training method, device and equipment |
US11606663B2 (en) | 2018-08-29 | 2023-03-14 | Audible Reality Inc. | System for and method of controlling a three-dimensional audio engine |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
WO2020181998A1 (en) * | 2019-03-11 | 2020-09-17 | 江苏大学 | Method for detecting mixed sound event on basis of factor decomposition of supervised variational encoder |
CN110070895A (en) * | 2019-03-11 | 2019-07-30 | 江苏大学 | A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition |
CN110705339A (en) * | 2019-04-15 | 2020-01-17 | 中国石油大学(华东) | C-C3D-based sign language identification method |
CN110297928A (en) * | 2019-07-02 | 2019-10-01 | 百度在线网络技术(北京)有限公司 | Recommended method, device, equipment and the storage medium of expression picture |
CN111009262A (en) * | 2019-12-24 | 2020-04-14 | 携程计算机技术(上海)有限公司 | Voice gender identification method and system |
CN114037059A (en) * | 2021-11-05 | 2022-02-11 | 北京百度网讯科技有限公司 | Pre-training model, model generation method, data processing method and data processing device |
CN115240649A (en) * | 2022-07-19 | 2022-10-25 | 于振华 | Voice recognition method and system based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN104021373B (en) | 2017-02-15 |
WO2015180368A1 (en) | 2015-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104021373A (en) | Semi-supervised speech feature variable factor decomposition method | |
Hsu et al. | Unsupervised learning of disentangled and interpretable representations from sequential data | |
US10699719B1 (en) | System and method for taxonomically distinguishing unconstrained signal data segments | |
CN106952649A (en) | Method for distinguishing speek person based on convolutional neural networks and spectrogram | |
CN110164452A (en) | A kind of method of Application on Voiceprint Recognition, the method for model training and server | |
CN106104674A (en) | Mixing voice identification | |
CN111128209B (en) | Speech enhancement method based on mixed masking learning target | |
Sprechmann et al. | Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement | |
CN103456302A (en) | Emotion speaker recognition method based on emotion GMM model weight synthesis | |
Lee et al. | Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities | |
Wiem et al. | Unsupervised single channel speech separation based on optimized subspace separation | |
Li et al. | A si-sdr loss function based monaural source separation | |
JP2020071482A (en) | Word sound separation method, word sound separation model training method and computer readable medium | |
Das et al. | Towards Transferable Speech Emotion Representation: on loss functions for cross-lingual latent representations | |
US20200211569A1 (en) | Audio signal processing | |
McVicar et al. | Learning to separate vocals from polyphonic mixtures via ensemble methods and structured output prediction | |
Yue et al. | Equilibrium optimizer for emotion classification from english speech signals | |
CN106205636A (en) | A kind of speech emotion recognition Feature fusion based on MRMR criterion | |
CN113707172B (en) | Single-channel voice separation method, system and computer equipment of sparse orthogonal network | |
Roberts et al. | Deep learning-based single-ended quality prediction for time-scale modified audio | |
Szekrényes et al. | Classification of formal and informal dialogues based on turn-taking and intonation using deep neural networks | |
Medikonda et al. | An information set-based robust text-independent speaker authentication | |
Bisio et al. | Speaker recognition exploiting D2D communications paradigm: Performance evaluation of multiple observations approaches | |
US12014728B2 (en) | Dynamic combination of acoustic model states | |
Franzoni et al. | Crowd emotional sounds: spectrogram-based analysis using convolutional neural network. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |