CN108010514A

CN108010514A - A kind of method of speech classification based on deep neural network

Info

Publication number: CN108010514A
Application number: CN201711155884.8A
Authority: CN
Inventors: 毛华; 章毅; 吴雨
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2017-11-20
Filing date: 2017-11-20
Publication date: 2018-05-08
Anticipated expiration: 2037-11-20
Also published as: CN108010514B

Abstract

The invention discloses a kind of method of speech classification based on deep neural network, it is intended to by a unified algorithm model, solves the problems, such as different Classification of Speech.The present invention includes the following steps：S1：Convert speech into corresponding sound spectrograph；Piecemeal is carried out along frequency domain on complete sound spectrograph, obtains one group of local frequency domain information set.S2：Input using complete and local frequency domain information as model respectively, based on different inputs, convolutional neural networks can extract part and global characteristics.S3：With notice mechanism, amalgamation of global and local feature representation, form final feature representation.S4：Using marked data, pass through gradient decline and back-propagation algorithm training network.S5：To unlabelled voice, using trained parameter, the classification of model output maximum probability is as prediction result.The present invention realizes the unified algorithm model to different phonetic classification problem, and accuracy is improved in multiple Classification of Speech problems.

Description

A kind of method of speech classification based on deep neural network

Technical field

A kind of method of speech classification based on deep neural network, for handling the classification task of different voices, is related to The technical fields such as Speech processing, artificial intelligence.

Background technology

With the fast development of computer technology, how more preferably the mankind constantly strengthen the dependence of computer and requirement, Ground is interacted with computer has become a research hotspot.Voice as it is most universal in daily life, most natural one Kind communication way, it includes huge information content, such as the accent of speaker, the affective state of speaker etc..Computer Classification of Speech recognition capability be computer carry out speech processes important component, realize nature human-computer interaction interface Key precondition, has very big researching value and application value.Speech classification technique is a highly important research direction, it In speech recognition, voice content detection etc. all plays an important role.Classification of Speech is the base that advanced treating is carried out to audio Plinth and premise, for the section audio currently provided, the audio environment residing for voice can be determined in advance by classification, say The gender of people, accent, mood etc. are talked about, basis is provided to adjust the adaptive algorithm of speech model.Therefore, method of speech classification is It is vital.

Classification of Speech includes a variety of different tasks, such as：Speech emotion recognition, accents recognition, Speaker Identification, voice Ambient zone grades.The challenge of Classification of Speech task is the higher-dimension characteristic of voice.Traditional method of speech classification, it will usually be directed to The problem of single or database, extract specific audio frequency characteristics, so as to reduce the dimension of the data of input sorter network.So And feature extraction needs enough Speech processing knowledge, because feature extraction represents the filtering of information, information can be caused Missing.Secondly, traditional sorting algorithm is often not suitable for more classification tasks, such as support vector machines etc..These problems are all The difficult point that our need of work is captured.

Deep neural network method is one of current processing most important means of big data, especially high dimensional data.Depth The characteristics of neutral net, is to realize to sound by the training to connection weight by constructing the nonlinear mapping function of multilayer The study of the feature of frequency evidence simultaneously is used to classify.Deep neural network, can be according to output because it has the function of feedback, study etc. As a result network inherent parameters are adjusted, at present, although the upsurge of deep neural network is gradually in every subjects field Sprawling is opened, and is applied successfully to multiple fields, including machine translation, speech recognition, target identification etc..

The content of the invention

The present invention provides a kind of method of speech classification based on deep neural network for above-mentioned shortcoming, solves existing There is the intractable problem of feature extracting method, high dimensional data only for the classification of distinctive single task or data in technology.

To achieve these goals, the technical solution adopted by the present invention is：

A kind of method of speech classification based on deep neural network, it is characterised in that include the following steps：

S1：Voice data is subjected to Short Time Fourier Transform, is converted to corresponding sound spectrograph；Along frequency on complete sound spectrograph Domain carries out piecemeal, obtains one group of local frequency domain information set；

S2：The algorithm model based on convolutional neural networks and notice mechanism is established, respectively by complete sound spectrograph and part Input of the frequency domain information as model, carries out feature learning；Based on local and complete sound spectrograph information, convolutional Neural net is used Network extracts part and global characteristics；

S3：With notice mechanism, amalgamation of global and local feature representation, form final feature representation, are input to Softmax graders, so as to obtain the prediction of the classification belonging to voice；

S4：Using marked voice data, by gradient decline and back-propagation algorithm training network, and network ginseng is preserved Number；

S5：To unlabelled voice, it is predicted using trained model, model exports the affiliated classification conduct of maximum probability Final prediction result.

Further, distributed sound spectrograph transfer process specifically comprises the following steps in the S1：

Short Time Fourier Transform is carried out to original audio, given original audio is divided into M sections of short audios；To every section of short audio, Its short-time energy and modulus are calculated, finally obtains a complete sound spectrograph expression S, the S of sound spectrograph is expressed as follows：

（1）

Wherein, N is expressed as every section of short audio length scale；Formula（1）In illustrate sound spectrograph be two-dimensional matrix structure composition, Two of which dimension phonetically represents the change order of time respectively and frequency domain is changed by the section of low frequency to high frequency, often Numerical values recited on a point represents the size of amplitude.

The direction changed to complete sound spectrograph information along frequency domain carries out piecemeal, can obtain one group of local and overall situation Spectrum information set, that is, obtain one group based on different frequency domain distributions input data combination：。

Further, the feature extraction of convolutional neural networks specifically comprises the following steps in S2：

For multiple local inputs, the feature of different information is extracted using convolutional neural networks, so as to obtain one group of local expression：

（2）

In above formula, each local inputThere is corresponding deconvolution parameterWith,fIt is expressed as activation primitive；Most The one group of local feature obtained eventually is expressed as：。

For current complete global frequency domain information, the feature of the overall situation is extracted using convolutional neural networks, it is specific to calculate Formula is as follows：

（3）

Wherein,aIt is expressed as the global characteristics that convolutional neural networks extract.

Wherein formula（2）With（3）In be mainly concerned with convolutional neural networks convolution and pondization operation.Convolution it is specific Operation is as follows：

（4）

Wherein,MWithNThe size of convolution kernel is defined,m,nRepresent line number and columns, for defining pixel position,fIt is convolution Kernel function,Define current layeriOKjThe feature representation of row,Define current layeriOKjThe input data of row.wIt is fixed The justice parameter of convolution kernel,bIt is corresponding bias；

Formula（4）In convolution operation, play an important role in convolutional network.By sharing the design of weights, convolution The feature that network extracts has feature invariant shape；Change somewhat, the changing features that network proposes occur for the input inputted Less.

The concrete operations in pond are as follows：

（5）

Wherein,Pond function is represent, most common pond function there are three kinds, i.e., in receptive field（The space of convolution kernel）It is interior It is maximized, minimum value or average value.aIt is the input that represent pond layer,pRepresent the output after pondization operation；

Formula（5）Middle pond parameter greatly reduces the number of weights in network, it is therefore prevented that over-fitting occurs in network.

Further, the notice mechanism amalgamation of global in S3 specifically comprises the following steps with local feature representation：

Based on different local features, with notice mechanism, new global characteristics expression is retrieved；Global information is given first Assign each of which part one " coefficient "：

（6）

In above formula,Represent global characteristicsaA certain part, altogethermA composition information,Represent based on current Local feature,The coefficient of this part, represents its importance degree；

Formula（6）Implication be notice mechanism essence operation, based on guidance information local feature, to global characteristicsa Each part assign it is differentWeights, represent the significance level of the composition.It is intended to wish to pass through network training, looks for Go out most representational feature in composition.

Then the coefficient that represent significance level calculated is multiplied with corresponding part, form one it is new complete Office's information：

（7）

Notice mechanism is so used, is obtainednA new global information, with initial global characteristicsaContraposition is added, and is obtained most Whole feature representation：

（8）

By final feature representationA, softmax graders are input to, the classification of the probable value maximum of gained is the voice number According to prediction classification.

After using the above scheme, the beneficial effects of the present invention are：

（1）Traditional method of speech classification both for it is single the problem of passed through using different feature extraction algorithms, the present invention Deep neural network directly carries out feature learning to voice sound spectrograph, can independently learn different sounds according to the difference of task Frequency feature.

（2）The training of deep neural network generally requires big data, but presently disclosed voice data number is less.Base In the research of conventional deep neural network, present invention further proposes merging for convolutional neural networks and notice mechanism Algorithm model, further increasing discrimination in multiple tasks.

By taking two groups of Classification of Speech tasks of accents recognition and Speaker Identification as an example：

Table 1 illustrates the contrast of model and other methods in the present invention in accents recognition problem, and wherein i-Vector is classical Feature extraction algorithm, VGG and ResNets are representative convolutional neural networks models.

Table 2 illustrates the contrast of model and other methods in the present invention in Speaker Identification problem, wherein MFCC be through The feature extraction algorithm of allusion quotation, VGG and ResNets are representative convolutional neural networks models.

Above-mentioned the results show：

1）In multiple Classification of Speech problems, the feature that model learning proposed by the present invention arrives is calculated compared to traditional feature extraction Method, can obtain more preferable recognition result.

2）Compared to the method for other neural net methods, the present invention, which further increasing, applies notice mechanism In convolutional neural networks, increase model robustness and generalization ability, the accuracy rate of speech recognition is all improved in multiple problems.

Brief description of the drawings

Fig. 1 is algorithm model synoptic diagram in the present invention；

Fig. 2 is the distributed sound spectrograph based on frequency domain；

Fig. 3 is the convolution block basic block diagram for employing notice mechanism；

Fig. 4 is the overall process figure of the present invention.

Specific embodiment

Each attached drawing in the embodiment of the present invention can be combined one by one below, the technical solution in the present embodiment is carried out detailed Ground describes；But it is described here go out embodiment be only the present invention part of the embodiment, rather than whole specification Embodiment.

Referring to Fig. 1, a kind of kernel model of the Classification of Speech based on deep neural network is one and multiple employs attention The deep neural network model of the convolution block composition of power mechanism.One is convolutional neural networks, main using the non-linear of multilayer Function, to learn the mapping relations between input data and feature；Deep learning algorithm can automatically learn according to target Correlated characteristic；Another is notice mechanism, mainly by distributing local message different weights, so as to obtain a part The different expression of information proportion.The present invention is effectively improved voice point by merging deep learning algorithm and notice mechanism The accuracy of class..

The method of speech classification based on deep neural network, includes the following steps：

Step S1：Short Time Fourier Transform is carried out to original audio, given original audio is divided into M sections of short audios；To every section Short audio, calculates its short-time energy and modulus, finally obtains a complete sound spectrograph expression S, and the S of sound spectrograph is expressed as follows：

（1）

Wherein, N is expressed as every section of short audio length scale.

Referring to Fig. 2 it can be seen that the displaying of complete sound spectrograph and the distributed sound spectrograph based on frequency domain.Based on distribution Sound spectrograph be along frequency domain change section carry out piecemeal, so as to obtain the distributed intelligence in different frequency section.

Step S2：The algorithm model based on convolutional neural networks and notice mechanism is established, respectively by complete sound spectrograph With input of the local frequency domain information as model, feature learning is carried out；For multiple local inputs, convolutional neural networks are used The feature of different information is extracted, so as to obtain one group of local expression：

（2）

（3）

Step S3：On the basis of the local feature and global characteristics that step S2 is proposed, based on different local features, fortune With notice mechanism, new global characteristics expression is retrieved；Assigning each of which part one to global information first " is Number "：

（6）

In above formula,Represent global characteristicsaA certain part, altogethermA composition information,Represent based on current Local feature,The coefficient of this part, represents its importance degree.

（7）

（8）

Referring to Fig. 3, the basic block diagram of the convolution block based on notice mechanism is illustrated, is included based on local and global The characteristic extraction procedure of information, and it is last use notice thought merging again into row information, last to one is finally Feature representationA。

Step S4：Using marked voice data, by gradient decline and back-propagation algorithm training network, and protect Deposit network parameter；The initial structure of model is that the parameter in network is random initializtion, is produced by marked voice data Error, so as to update network parameter, until network becomes stable, retain optimal parameter.

Step S5：To unlabelled voice, it is predicted using trained model and parameter, model output maximum probability Affiliated classification as final prediction result.

Procedure chart of the invention complete is illustrated from step S1 to step S5 referring to Fig. 4, if in the presence of the audio for still needing to identification, Step S1 to step S5 is then continued to execute, the last highest place classification of grader output probability value is prediction result.

Claims

1. a kind of method of speech classification based on deep neural network, it is characterised in that distributed sound spectrograph and convolutional Neural net The combination of network and notice mechanism, includes the following steps：

A kind of 2. method of speech classification based on deep neural network according to claim 1, it is characterised in that：The S1 Middle distribution sound spectrograph transfer process specifically comprises the following steps：

S11：Short Time Fourier Transform is carried out to original audio, given original audio is divided into M sections of short audios；To every section of minor Frequently, its short-time energy and modulus are calculated, finally obtains a complete sound spectrograph expression S, the S of sound spectrograph is expressed as follows：

（1）

Wherein, N is expressed as every section of short audio length scale；

S12：The direction changed to complete sound spectrograph information along frequency domain carries out piecemeal, wherein some local frequency domain information Be expressed as follows：

（2）

One group of local spectrum information set with the overall situation is finally obtained, that is, obtains one group of input based on different frequency domain distributions Data combine：。

A kind of 3. method of speech classification based on deep neural network according to claim 1, it is characterised in that：The S2 The feature extraction of middle convolutional neural networks specifically comprises the following steps：

S21：For multiple local inputs, the feature of different information is extracted using convolutional neural networks, so as to obtain one group of part Expression：

（3）

In above formula, each local inputThere is corresponding deconvolution parameterWith,fIt is expressed as activation primitive；Finally One group of obtained local feature is expressed as：；

S22：For current complete global frequency domain information, the feature of the overall situation is extracted using convolutional neural networks, it is specific to calculate Formula is as follows：

（4）

A kind of 4. method of speech classification based on deep neural network according to claim 1, it is characterised in that：The step Notice mechanism amalgamation of global in rapid S3 specifically comprises the following steps with local feature representation：

（6）

In above formula,Represent global characteristicsaA certain part, altogethermA composition information,Expression is based on current office Portion's feature,The coefficient of this part, represents its importance degree；

Then the coefficient that represent significance level calculated is multiplied with corresponding part, forms a new global letter Breath：

（7）

（8）