CN104021373A - Semi-supervised speech feature variable factor decomposition method - Google Patents

Semi-supervised speech feature variable factor decomposition method Download PDF

Info

Publication number
CN104021373A
CN104021373A CN201410229537.5A CN201410229537A CN104021373A CN 104021373 A CN104021373 A CN 104021373A CN 201410229537 A CN201410229537 A CN 201410229537A CN 104021373 A CN104021373 A CN 104021373A
Authority
CN
China
Prior art keywords
feature
loss function
semi
features
supervised
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410229537.5A
Other languages
Chinese (zh)
Other versions
CN104021373B (en
Inventor
毛启容
黄正伟
薛文韬
于永斌
詹永照
苟建平
邢玉萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201410229537.5A priority Critical patent/CN104021373B/en
Publication of CN104021373A publication Critical patent/CN104021373A/en
Priority to PCT/CN2014/088539 priority patent/WO2015180368A1/en
Application granted granted Critical
Publication of CN104021373B publication Critical patent/CN104021373B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semi-supervised speech feature variable factor decomposition method. Speech features are divided into four types: emotion-related features, gender-related features, age-related features and noise, language and other factor-related features. Firstly, a speech is pretreated to obtain a spectrogram, speech spectrum blocks of different sizes are inputted to an unsupervised feature learning network SAE, convolution kernels of different sizes are obtained through pre-training, convolution kernels of different sizes are then respectively used for carrying out convolution on the whole spectrogram, a plurality of feature mapping pictures are obtained, maximal pooling is then carried out on the feature mapping pictures, and the features are finally stacked together to form a local invariant feature y. Y serves as input of semi-supervised convolution neural network, y is decomposed into four types of features through minimizing four different loss function items. The problem that the recognition accuracy rate is not high as emotion, gender, age and speech features are mixed is solved, and the method can be used for different recognition demands based on speech signals and can also be used for decomposing more factors.

Description

A kind of semi-supervised phonetic feature variable factor decomposition method
Technical field
The invention belongs to field of speech recognition, be specifically related to a kind of method that phonetic feature decomposes.
Background technology
Along with computing machine is penetrated into each corner of life, various types of computing platforms all need easier input medium, and voice are taken as own duty becomes one of selection of user's the best.The much informations such as in general, voice have comprised speaker, the content of speaking, speaker's emotion, sex, age.In recent years, constantly perfect along with some application, has promoted the development of the recognition technology based on voice signal of emotion to people, sex, age, the aspects such as content of speaking.Such as the connection that traditional call center conventionally all can be random is served from birth for client provides telephone counseling, and can not provide personalized service according to user's emotion, sex and age, whether this can judge its emotion, sex and age by client's sound with regard to having impelled, and more personalized voice service is provided on this basis.But identify in inter-related task at existing emotion, sex and age based on voice signal, the feature that traditional feature extracting method extracts often adulterated emotion, sex, age, the factor such as content, language of speaking, be difficult to each other distinguish, thereby cause recognition effect not good.
Be called in the paper of Feature Learning in Deep Neural Networks-Studies on Speech Recognition Tasks in Dong Yu etc., name, utilize degree of depth neural network to acquire a further feature, but this feature may mix several factors, as factors such as emotion, sex, ages, if for speech emotional identification, discrimination may be subject to the impact of other factors in feature this feature.At present also do not occur that a kind of feature extracting method can extract respectively feature dissimilar in voice signal.The present invention is in order to overcome the defect of prior art, by the semi-supervised feature learning based on convolutional neural networks, phonetic feature is resolved into four classes: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features, can be respectively used to the difference identification demand based on voice signal.The present invention can also be used for decomposing more multifactor after further expanding.
Summary of the invention
The object of the present invention is to provide a kind of semi-supervised phonetic feature variable factor decomposition method, make the feature decompositing not be subject to the interference of the factor irrelevant with identification mission, and embody more significantly the difference between identification target classification, thereby improve the accuracy of identification.
In order to solve above technical matters, first the present invention carries out pre-service to voice and obtains sound spectrograph, then obtain local invariant feature by the unsupervised learning based on convolutional neural networks, adopt again a kind of semi-supervised learning method, by reconstructed error function, differentiate loss function, orthogonal loss function, the local invariant feature that the constraint of four loss functions of conspicuousness loss function obtains unsupervised learning, resolve into four classes: emotion correlated characteristic, Sexual-related feature, age correlated characteristic and other factor analysis features, can be respectively used to emotion recognition, sex identification, age identification, can effectively improve recognition accuracy.Concrete technical scheme is as follows:
A kind of semi-supervised phonetic feature variable factor decomposition method, is characterized in that comprising the following steps:
Step 1, pre-service: speech samples is carried out to pre-service and obtain sound spectrograph, then adopt PCA to carry out principal component analysis dimensionality reduction and albefaction, therefrom extract the language spectrum piece of different size;
Step 2, unsupervised local invariant feature study: using institute's predicate spectrum piece as the input without supervision feature learning SAE, by the language spectrum piece of input different size, pre-training obtains the convolution kernel of different size, then with the convolution kernel of described different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, more described Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y;
Step 3, semi-supervised feature learning based on convolutional neural networks: the input using described local invariant feature y as semi-supervised learning algorithm, utilize the method for semi-supervised learning based on convolutional neural networks, by four different loss functions just local invariant feature y resolve into four category features; Described four category features comprise emotion correlated characteristic, Sexual-related feature, age correlated characteristic and comprise noise and other factor analysis features of languages; The loss function of described semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts;
For described reconstructed error function, described four category features all will participate in reconstruct local invariant feature y, and error adopts square error; For described differentiation loss function, first the data that have label are carried out to classification prediction, then calculate the difference of predicting between label and true label as the value of differentiating loss function; For described orthogonal loss function, object is to make described four category features mutually orthogonal, represents the different direction of input local invariant feature y; For described conspicuousness loss function, object is that study is to only embodying the feature of identifying the difference between target classification and having more class discrimination; The parameter that obtains four loss functions by minimizing described loss function comprises biasing and weight, thereby obtains described four category features.
The present invention has beneficial effect.Semi-supervised feature learning of the present invention, by local invariant feature being resolved into emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features totally four category features, make dissimilar feature for different identification demands, avoided the shortcoming of phase mutual interference between dissimilar feature.Particularly the loss function of semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts, make the feature of being learnt can describe better the difference between identification target classification, be not subject to the interference of irrelevant factor.Thereby the invention solves the not high problem of the different phonetic features discrimination bringing mixed in together, can effectively improve recognition accuracy.
Brief description of the drawings
Fig. 1 is phonetic feature decomposition process figure.
Fig. 2 is without supervision feature learning process flow diagram.
Fig. 3 is semi-supervised phonetic feature decomposition chart.
Embodiment
Fig. 1 has provided the general thought of the inventive method, first, voice is carried out to pre-service and obtain sound spectrograph, the language spectrum piece input of different size is without supervision feature learning network SAE, pre-training obtains the convolution kernel of different size, then, through convolution, pondization operation, forms local invariant feature y.Y, as the input of semi-supervised convolutional neural networks, resolves into four category features by minimizing four different loss function items by y.
Pretreated voice signal is divided into l i ×h ithe language spectrum piece of size, i represents the number of language spectrum piece, the language spectrum piece input of different size is without supervision feature learning network SAE, pre-training obtains the convolution kernel of different size, then with the convolution kernel of different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, then Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y, as shown in Figure 2.Y, as the input of semi-supervised convolutional neural networks, resolves into four category features by four different loss function items by y.Semi-supervised loss function is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts.Obtain the parameter of four loss function items by minimum losses function, obtain four category features thereby decompose, be respectively used to different identification demands, as shown in Figure 3.All features all will be participated in reconstruct, and dissimilar feature participates in the constraint of corresponding differentiation loss function.
First the present invention carries out pre-service to voice, utilize the unsupervised learning algorithm based on convolutional neural networks to obtain one group of local invariant feature, then utilize the semi-supervised learning algorithm based on convolutional neural networks that local invariant feature is resolved into four category features: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features.Concrete step is as follows:
Step 1, first changes into sound spectrograph time-domain signal, and window is of a size of 20ms, has that 10ms's is overlapping; Then utilize PCA dimensionality reduction and albefaction, PCA has 60 major components, finally produces the sound spectrograph of 60 × n.Therefrom extract several language of nonoverlapping 60 × 15 spectrums.For each 60 × 15 language spectrum, therefrom extract the language spectrum piece of two sizes, be respectively 60 × 6 and 60 × 10.
Step 2, is input to respectively SAE by the language spectrum piece of 60 × 6 and 60 × 10 two kinds of sizes, and study obtains 120 and input the same large 60 × 6 and 60 × 10 the convolution kernel of size respectively.Then utilize these two convolution kernels respectively whole language spectrum 60 × 15 to be carried out to convolution, obtain the Feature Mapping figure of 120 1 × 10 and 120 1 × 6, then every two frames carry out maximum pond, obtain the feature of 120 1 × 5 and 120 1 × 3.Obtain 600 features for 60 × 6 convolution kernel, obtain 360 features for 60 × 10 convolution kernel.These 960 total features are as semi-supervised input.Next introduce the general step without supervision feature learning.
The objective function of autocoder AE (Auto-Encoder) is as follows:
J AE(θ)=Σ x∈L(x,g(h(x))) (1)
Wherein x is the language spectrum piece of input, and x is herein tape label not.H (x) is coding function,
H (x)=s (ω x+ α), wherein ω is weight matrix, α is biasing, g (x) is decoding functions, x=g (x)=s (ω th (x)-δ), wherein ω tbe the transposition of ω, δ is biasing.L (x, x ') is loss function, L (x, x ')=|| x-x ' || 2, represent square error.
And sparse own coding SAE adds a sparse property loss on the objective function of AE.The objective function of SAE is as follows:
J SAE ( θ ) = Σ x ∈ D L ( x , g ( h ( x ) ) ) + λ Σ j = 1 n 2 KL ( ρ | | ρ j ′ ) - - - ( 2 )
Wherein, be one taking ρ as average and one with ρ ' jfor the relative entropy between two Bernoulli random variables of average, be used for controlling sparse property. ρ is sparse property parameter, the average activity of hidden neuron j, n 2be to hide node number, m is input node number.λ controls the parameter of sparse.
Suppose the input size that has n kind different, remember into l i× h i(i=1,2 ..., n0.By minimizing J sAE(θ) obtain different convolution kernel (ω i, α i).With convolution kernel (ω i, α i) all languages of whole sound spectrograph are composed to piece li × h icarry out convolution:
f i(x)=s(conv(ω i,x)+α i) (3)
Then Feature Mapping figure convolution being obtained is divided into nonoverlapping region, P={p 1, p 2..., p qmaximum pond is carried out in each region:
F j i ( x ) = max k ∈ p j ( f j i ( x ) ) , j ∈ [ 1 , q ] - - - ( 4 )
For i convolution kernel, the feature in pond is stacked up:
F i ( x ) = [ F 1 i ( x ) , F 2 i ( x ) , . . . , F q i ( x ) ] - - - ( 5 )
The pond feature of all convolution kernels is stacked up, forms local invariant feature y:
y=F(x)=[F 1(x),F 2(x),...,F n(x)] (6)
Local invariant feature y is as the input of unsupervised learning below.
Step 3, by above-mentioned unsupervised learning, has obtained local invariant feature y.Then by the semi-supervised learning based on convolutional neural networks (part input is with class label), y is resolved into four category features: emotion correlated characteristic, Sexual-related feature, age correlated characteristic, other factor analysis features.The loss function of unsupervised learning is made up of four parts.
First, by four coding function h (e)(y), h (s)(y), h (a)(y), h (o)(y) y is mapped to four category features, respectively with emotion, sex, age, other factor analysis.Four coding functions are as follows:
h (e)(y)=s(Ey+e) (7)
h (s)(y)=s(Sy+s) (8)
h (a)(y)=s(Ay+a) (9)
h (o)(y)=s(Oy+o) (10)
This four category feature all will participate in reconstruct y:
y'=g[[h (e)(y),h (s)(y),h (a)(y),h (o)(y)])
=s(E Th (e)(y)+S Th (s)(y)+ATh (a)(y)+O Th(o)(y)+γ) (11)
Wherein, γ is for a compensating parameter near y average.
So reconstructed error function is:
L RECON(y,y′)=||y-y'|| 2 (12)
Then, utilize and have label data to carry out the prediction of classification, input data (x, z), wherein x is language spectrum piece, z={z 1, z 2, z 3represent respectively emotion label, sex label, age label.Z '=z ' 1, z ' 2, z ' 3, respectively represent prediction emotion label, sex label, age label.Example as the following formula (13) is exactly to pass through U 1jmapping h (e)(y), predict j feature z ' of emotion label 1j.
z′ 1j=s(U 1jh (e)(y)+b1 j) (13)
z′ 2j=s(U 2jh (s)(y)+b 2j) (14)
z′ 3j=s(U 3jh (a)(y)+b 1j) (15)
So the differentiation loss function of emotion label, sex label, age label is respectively:
L DISCE ( z 1 , z 1 ′ ) = Σ j = 1 C 1 z 1 j log z 1 j ′ + ( 1 - z 1 j ) log ( 1 - z 1 j ′ ) - - - ( 16 )
L DISCS ( z 2 , z 2 ′ ) = Σ j = 1 C 2 z 2 j log z 2 j ′ + ( 1 - z 2 j ) log ( 1 - z 2 j ′ ) - - - ( 17 )
L DISCA ( z 3 , z 3 ′ ) = Σ j = 1 C 3 z 3 j log z 3 j ′ + ( 1 - z 3 j ) log ( 1 - z 3 j ′ ) - - - ( 18 )
Total differentiation loss function is:
L DISC(z,z′)=LDISCE(z 1,z′ 1)+L DISCS(z 2,z′ 2)+L DIACA(z 3,z′ 3) (19)
Wherein C 1, C 2, C 3represent respectively the classification number of emotion label, sex label, age label.Be noted that especially this step emotion correlated characteristic is subject to formula (14) emotion to differentiate the constraint of penalty, Sexual-related feature is subject to the constraint of formula (15) Sex Discrimination penalty, and age correlated characteristic is subject to formula (16) age to differentiate the constraint of penalty.
In addition, in order to allow h (e)(y), h (s)(y), h (a)(y), h (o)(y) represent as much as possible the different directions that y changes, such as will allow h (e)and h (y) (s)(y) represent different directions, can be by allowing with orthogonal as much as possible.Orthogonal loss function is:
J ORTH ( y ) = Σ i , j ( ∂ h i ( e ) ( y ) ∂ y · ∂ h j ( a ) ( y ) ∂ y ) 2 + Σ i , k ( ∂ h i ( a ) ( y ) ∂ y · ∂ h k ( a ) ( y ) ∂ y ) 2
+ Σ i , L ( ∂ h i ( a ) ( y ) ∂ y · ∂ h L ( o ) ( y ) ∂ y ) 2 + Σ j , k ( ∂ h j ( a ) ( y ) ∂ y · ∂ h k ( a ) ( y ) ∂ y ) 2
+ Σ j , L ( ∂ h j ( a ) ( y ) ∂ y · ∂ h L ( o ) ( y ) ∂ y ) 2 + Σ k , L ( ∂ h k ( a ) ( y ) ∂ y · ∂ h L ( o ) ( y ) ∂ y ) 2 - - - ( 20 )
Finally, can utilize conspicuousness loss function, make study to four category features more embody identifiable and more stable between different classes of of identification target.。For the conspicuousness of each input i, we weigh by the conspicuousness summation of its weight.Concrete, as follows for the conspicuousness of input i:
S i = Σ k ∈ φ ( i ) Saliency ( ω k ) = 1 2 Σ ∂ 2 MSE ∂ ω k 2 ∂ ω k 2 - - - ( 21 )
Wherein φ (i) is and the relevant weight sets of input i, ω kbe k weight, MSE is square error.For h (e)(y), h (s)(y), h (a)(y) this three category feature, reconstructed error and differentiation loss all will take into account conspicuousness loss function, and for h (o)(y), as long as consider reconstructed error.So conspicuousness loss function is as follows:
J SAL ( y ) = - 1 2 Σ ∂ 2 | | y - y ′ | | 2 ∂ ω k 2 ∂ ω k 2 - 1 2 Σ k ∈ h ( e ) ( y ) ∂ 2 L DISCE ( z 1 , z 1 ′ ) ∂ ω k 2 ∂ ω k 2
- 1 2 Σ k ∈ h ( e ) ( y ) ∂ 2 L DISCS ( z 2 , z 2 ′ ) ∂ ω k 2 ∂ ω k 2 - 1 2 Σ k ∈ h ( e ) ( y ) ∂ 2 L DISCE ( z 2 , z 2 ′ ) ∂ ω k 2 ∂ ω k 2 - - - ( 22 )
So reconstructed error function, differentiation loss function, orthogonal loss function, this four part of conspicuousness loss function have just formed total loss function.Total loss function is:
L LOSS(θ)=Σ x∈D,y=F(x)L RECON(y,y′)+βJ ORTH(y)+Σ (x,z)∈SL DISC(z,z′)
+ηJ SAL(y) (23)
Wherein D is whole data set (comprises without label data and have label data), and S has label data collection.β adjusts the percentage contribution of vertical loss function, β ∈ [0,1].η adjusts the percentage contribution of susceptibility loss function, η ∈ [0,1].Contribution weighting parameter β, η adopts the method for the grid search that step-length is 0.1 to set.Parameter θ={ E, S, A, O, U, e, s, a, o, γ, b}.
Obtain parameter weight and the biasing of four loss functions by minimum losses function, obtain four category features thereby decompose.

Claims (1)

1. a semi-supervised phonetic feature variable factor decomposition method, is characterized in that comprising the following steps:
Step 1, pre-service: speech samples is carried out to pre-service and obtain sound spectrograph, then adopt PCA to carry out principal component analysis dimensionality reduction and albefaction, therefrom extract the language spectrum piece of different size;
Step 2, unsupervised local invariant feature study: using institute's predicate spectrum piece as the input without supervision feature learning SAE, by the language spectrum piece of input different size, pre-training obtains the convolution kernel of different size, then with the convolution kernel of described different size, whole sound spectrograph is carried out to convolution respectively, obtain some Feature Mapping figure, more described Feature Mapping figure is carried out to maximum pond, finally feature is stacked up and forms local invariant feature y;
Step 3, semi-supervised feature learning based on convolutional neural networks: the input using described local invariant feature y as semi-supervised learning algorithm, utilize the method for semi-supervised learning based on convolutional neural networks, by four different loss functions just local invariant feature y resolve into four category features; Described four category features comprise emotion correlated characteristic, Sexual-related feature, age correlated characteristic and comprise noise and other factor analysis features of languages; The loss function of described semi-supervised learning is made up of reconstructed error function, differentiation loss function, orthogonal loss function, conspicuousness loss function four parts;
For described reconstructed error function, described four category features all will participate in reconstruct local invariant feature y, and error adopts square error; For described differentiation loss function, first the data that have label are carried out to classification prediction, then calculate the difference of predicting between label and true label as the value of differentiating loss function; For described orthogonal loss function, object is to make described four category features mutually orthogonal, represents the different direction of input local invariant feature y; For described conspicuousness loss function, object is that study is to only embodying the feature of identifying the difference between target classification and having more class discrimination; The parameter that obtains four loss functions by minimizing described loss function comprises biasing and weight, thereby obtains described four category features.
CN201410229537.5A 2014-05-27 2014-05-27 Semi-supervised speech feature variable factor decomposition method Active CN104021373B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410229537.5A CN104021373B (en) 2014-05-27 2014-05-27 Semi-supervised speech feature variable factor decomposition method
PCT/CN2014/088539 WO2015180368A1 (en) 2014-05-27 2014-10-14 Variable factor decomposition method for semi-supervised speech features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410229537.5A CN104021373B (en) 2014-05-27 2014-05-27 Semi-supervised speech feature variable factor decomposition method

Publications (2)

Publication Number Publication Date
CN104021373A true CN104021373A (en) 2014-09-03
CN104021373B CN104021373B (en) 2017-02-15

Family

ID=51438118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410229537.5A Active CN104021373B (en) 2014-05-27 2014-05-27 Semi-supervised speech feature variable factor decomposition method

Country Status (2)

Country Link
CN (1) CN104021373B (en)
WO (1) WO2015180368A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408470A (en) * 2014-12-01 2015-03-11 中科创达软件股份有限公司 Gender detection method based on average face preliminary learning
CN105070288A (en) * 2015-07-02 2015-11-18 百度在线网络技术(北京)有限公司 Vehicle-mounted voice instruction recognition method and device
WO2015180368A1 (en) * 2014-05-27 2015-12-03 江苏大学 Variable factor decomposition method for semi-supervised speech features
CN105321525A (en) * 2015-09-30 2016-02-10 北京邮电大学 System and method for reducing VOIP (voice over internet protocol) communication resource overhead
CN105550679A (en) * 2016-02-29 2016-05-04 深圳前海勇艺达机器人有限公司 Judgment method for robot to cyclically monitor recording
CN105989368A (en) * 2015-02-13 2016-10-05 展讯通信(天津)有限公司 Target detection method and apparatus, and mobile terminal
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN108461092A (en) * 2018-03-07 2018-08-28 燕山大学 A method of to Parkinson's disease speech analysis
CN109564618A (en) * 2016-06-06 2019-04-02 三星电子株式会社 Learning model for the detection of significant facial area
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
CN110070895A (en) * 2019-03-11 2019-07-30 江苏大学 A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition
CN110089135A (en) * 2016-10-19 2019-08-02 奥蒂布莱现实有限公司 System and method for generating audio image
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
CN110297928A (en) * 2019-07-02 2019-10-01 百度在线网络技术(北京)有限公司 Recommended method, device, equipment and the storage medium of expression picture
CN110503128A (en) * 2018-05-18 2019-11-26 百度(美国)有限责任公司 The spectrogram that confrontation network carries out Waveform composition is generated using convolution
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN111009262A (en) * 2019-12-24 2020-04-14 携程计算机技术(上海)有限公司 Voice gender identification method and system
CN114037059A (en) * 2021-11-05 2022-02-11 北京百度网讯科技有限公司 Pre-training model, model generation method, data processing method and data processing device
CN115240649A (en) * 2022-07-19 2022-10-25 于振华 Voice recognition method and system based on deep learning
US11606663B2 (en) 2018-08-29 2023-03-14 Audible Reality Inc. System for and method of controlling a three-dimensional audio engine

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093818B2 (en) 2016-04-11 2021-08-17 International Business Machines Corporation Customer profile learning based on semi-supervised recurrent neural network using partially labeled sequence data
CN106803069B (en) * 2016-12-29 2021-02-09 南京邮电大学 Crowd happiness degree identification method based on deep learning
CN106919710A (en) * 2017-03-13 2017-07-04 东南大学 A kind of dialect sorting technique based on convolutional neural networks
CN108021910A (en) * 2018-01-04 2018-05-11 青岛农业大学 The analysis method of Pseudocarps based on spectrum recognition and deep learning
CN108899075A (en) * 2018-06-28 2018-11-27 众安信息技术服务有限公司 A kind of DSA image detecting method, device and equipment based on deep learning
CN109117943B (en) * 2018-07-24 2022-09-30 中国科学技术大学 Method for enhancing network representation learning by utilizing multi-attribute information
CN109065021B (en) * 2018-10-18 2023-04-18 江苏师范大学 End-to-end dialect identification method for generating countermeasure network based on conditional deep convolution
CN109543727B (en) * 2018-11-07 2022-12-20 复旦大学 Semi-supervised anomaly detection method based on competitive reconstruction learning
CN109559736B (en) * 2018-12-05 2022-03-08 中国计量大学 Automatic dubbing method for movie actors based on confrontation network
CN110009025B (en) * 2019-03-27 2023-03-24 河南工业大学 Semi-supervised additive noise self-encoder for voice lie detection
CN110084850B (en) * 2019-04-04 2023-05-23 东南大学 Dynamic scene visual positioning method based on image semantic segmentation
CN110363139B (en) * 2019-07-15 2020-09-18 上海点积实业有限公司 Digital signal processing method and system
CN110738168B (en) * 2019-10-14 2023-02-14 长安大学 Distributed strain micro crack detection system and method based on stacked convolution self-encoder
CN111179941B (en) * 2020-01-06 2022-10-04 科大讯飞股份有限公司 Intelligent device awakening method, registration method and device
CN111832650B (en) * 2020-07-14 2023-08-01 西安电子科技大学 Image classification method based on generation of antagonism network local aggregation coding semi-supervision
CN112735478B (en) * 2021-01-29 2023-07-18 华南理工大学 Voice emotion recognition method based on additive angle punishment focus loss

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
US20110222724A1 (en) * 2010-03-15 2011-09-15 Nec Laboratories America, Inc. Systems and methods for determining personal characteristics
CN102222500A (en) * 2011-05-11 2011-10-19 北京航空航天大学 Extracting method and modeling method for Chinese speech emotion combining emotion points
CN103400145A (en) * 2013-07-19 2013-11-20 北京理工大学 Voice-vision fusion emotion recognition method based on hint nerve networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU675389B2 (en) * 1994-04-28 1997-01-30 Motorola, Inc. A method and apparatus for converting text into audible signals using a neural network
US5509103A (en) * 1994-06-03 1996-04-16 Motorola, Inc. Method of training neural networks used for speech recognition
WO1995034064A1 (en) * 1994-06-06 1995-12-14 Motorola Inc. Speech-recognition system utilizing neural networks and method of using same
CN1120469C (en) * 1998-02-03 2003-09-03 西门子公司 Method for voice data transmission
CN104021373B (en) * 2014-05-27 2017-02-15 江苏大学 Semi-supervised speech feature variable factor decomposition method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
US20110222724A1 (en) * 2010-03-15 2011-09-15 Nec Laboratories America, Inc. Systems and methods for determining personal characteristics
CN102222500A (en) * 2011-05-11 2011-10-19 北京航空航天大学 Extracting method and modeling method for Chinese speech emotion combining emotion points
CN103400145A (en) * 2013-07-19 2013-11-20 北京理工大学 Voice-vision fusion emotion recognition method based on hint nerve networks

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NWE T L等: "Speech emotion recognition using hidden Markov models", 《SPEECH COMMUNICATION》 *
张石清等: "基于一种改进的监督流形学习算法的语音情感识别", 《电子与信息学报》 *
毛启容等: "结合过完备字典与PCA的小样本语音情感识别方法", 《江苏大学学报》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015180368A1 (en) * 2014-05-27 2015-12-03 江苏大学 Variable factor decomposition method for semi-supervised speech features
CN104408470B (en) * 2014-12-01 2017-07-25 中科创达软件股份有限公司 The sex-screening method learnt in advance based on average face
CN104408470A (en) * 2014-12-01 2015-03-11 中科创达软件股份有限公司 Gender detection method based on average face preliminary learning
CN105989368A (en) * 2015-02-13 2016-10-05 展讯通信(天津)有限公司 Target detection method and apparatus, and mobile terminal
CN105070288A (en) * 2015-07-02 2015-11-18 百度在线网络技术(北京)有限公司 Vehicle-mounted voice instruction recognition method and device
WO2017000489A1 (en) * 2015-07-02 2017-01-05 百度在线网络技术(北京)有限公司 On-board voice command identification method and apparatus, and storage medium
US10446150B2 (en) 2015-07-02 2019-10-15 Baidu Online Network Technology (Beijing) Co. Ltd. In-vehicle voice command recognition method and apparatus, and storage medium
CN105070288B (en) * 2015-07-02 2018-08-07 百度在线网络技术(北京)有限公司 Vehicle-mounted voice instruction identification method and device
CN105321525B (en) * 2015-09-30 2019-02-22 北京邮电大学 A kind of system and method reducing VOIP communication resource expense
CN105321525A (en) * 2015-09-30 2016-02-10 北京邮电大学 System and method for reducing VOIP (voice over internet protocol) communication resource overhead
CN105550679A (en) * 2016-02-29 2016-05-04 深圳前海勇艺达机器人有限公司 Judgment method for robot to cyclically monitor recording
CN105550679B (en) * 2016-02-29 2019-02-15 深圳前海勇艺达机器人有限公司 A kind of judgment method of robot cycle monitoring recording
CN109564618B (en) * 2016-06-06 2023-11-24 三星电子株式会社 Method and system for facial image analysis
CN109564618A (en) * 2016-06-06 2019-04-02 三星电子株式会社 Learning model for the detection of significant facial area
CN110089135A (en) * 2016-10-19 2019-08-02 奥蒂布莱现实有限公司 System and method for generating audio image
US11516616B2 (en) 2016-10-19 2022-11-29 Audible Reality Inc. System for and method of generating an audio image
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN108461092A (en) * 2018-03-07 2018-08-28 燕山大学 A method of to Parkinson's disease speech analysis
CN108461092B (en) * 2018-03-07 2022-03-08 燕山大学 Method for analyzing Parkinson's disease voice
CN110503128A (en) * 2018-05-18 2019-11-26 百度(美国)有限责任公司 The spectrogram that confrontation network carries out Waveform composition is generated using convolution
CN110148400A (en) * 2018-07-18 2019-08-20 腾讯科技(深圳)有限公司 The pronunciation recognition methods of type, the training method of model, device and equipment
CN110148400B (en) * 2018-07-18 2023-03-17 腾讯科技(深圳)有限公司 Pronunciation type recognition method, model training method, device and equipment
US11606663B2 (en) 2018-08-29 2023-03-14 Audible Reality Inc. System for and method of controlling a three-dimensional audio engine
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
WO2020181998A1 (en) * 2019-03-11 2020-09-17 江苏大学 Method for detecting mixed sound event on basis of factor decomposition of supervised variational encoder
CN110070895A (en) * 2019-03-11 2019-07-30 江苏大学 A kind of mixed sound event detecting method based on supervision variation encoder Factor Decomposition
CN110705339A (en) * 2019-04-15 2020-01-17 中国石油大学(华东) C-C3D-based sign language identification method
CN110297928A (en) * 2019-07-02 2019-10-01 百度在线网络技术(北京)有限公司 Recommended method, device, equipment and the storage medium of expression picture
CN111009262A (en) * 2019-12-24 2020-04-14 携程计算机技术(上海)有限公司 Voice gender identification method and system
CN114037059A (en) * 2021-11-05 2022-02-11 北京百度网讯科技有限公司 Pre-training model, model generation method, data processing method and data processing device
CN115240649A (en) * 2022-07-19 2022-10-25 于振华 Voice recognition method and system based on deep learning

Also Published As

Publication number Publication date
CN104021373B (en) 2017-02-15
WO2015180368A1 (en) 2015-12-03

Similar Documents

Publication Publication Date Title
CN104021373A (en) Semi-supervised speech feature variable factor decomposition method
Hsu et al. Unsupervised learning of disentangled and interpretable representations from sequential data
US10699719B1 (en) System and method for taxonomically distinguishing unconstrained signal data segments
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN110164452A (en) A kind of method of Application on Voiceprint Recognition, the method for model training and server
CN106104674A (en) Mixing voice identification
CN111128209B (en) Speech enhancement method based on mixed masking learning target
Sprechmann et al. Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
Lee et al. Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities
Wiem et al. Unsupervised single channel speech separation based on optimized subspace separation
Li et al. A si-sdr loss function based monaural source separation
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
Das et al. Towards Transferable Speech Emotion Representation: on loss functions for cross-lingual latent representations
US20200211569A1 (en) Audio signal processing
McVicar et al. Learning to separate vocals from polyphonic mixtures via ensemble methods and structured output prediction
Yue et al. Equilibrium optimizer for emotion classification from english speech signals
CN106205636A (en) A kind of speech emotion recognition Feature fusion based on MRMR criterion
CN113707172B (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network
Roberts et al. Deep learning-based single-ended quality prediction for time-scale modified audio
Szekrényes et al. Classification of formal and informal dialogues based on turn-taking and intonation using deep neural networks
Medikonda et al. An information set-based robust text-independent speaker authentication
Bisio et al. Speaker recognition exploiting D2D communications paradigm: Performance evaluation of multiple observations approaches
US12014728B2 (en) Dynamic combination of acoustic model states
Franzoni et al. Crowd emotional sounds: spectrogram-based analysis using convolutional neural network.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant