CN107203777A

CN107203777A - audio scene classification method and device

Info

Publication number: CN107203777A
Application number: CN201710257902.7A
Authority: CN
Inventors: 王永滨; 孙书韬; 安靖; 王�琦; 王剑
Original assignee: Beijing Collaborative Innovation Institute; Communication University of China
Current assignee: Beijing Collaborative Innovation Institute; Communication University of China
Priority date: 2017-04-19
Filing date: 2017-04-19
Publication date: 2017-09-26

Abstract

The present invention provides a kind of audio scene classification method and device, wherein, methods described includes：According to sample audio data, depth confidence network class model is set up；Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained；Wherein, the sample audio data, including：Without label voice data and tape label voice data, the scene type of the tag representation voice data.The present invention can classify to audio scene, realize that simple, cost is relatively low and the degree of accuracy is higher.

Description

Audio scene classification method and device

Technical field

The present invention relates to deep learning and audio-frequency information processing technology field, more particularly to a kind of audio scene classification method And device.

Background technology

With a large amount of popularizations of the mobile devices such as mobile phone, and the application based on scene is more and more, therefore, how to allow shifting Scene around dynamic equipment automatic sensing is an important and challenging task.

Scene is that the high-level semantic of audio signal is represented.Audio scene classification is that the acoustic feature based on audio is used through machine The grader that device study is obtained, and then to the carry out automated intelligent identification of current environment scene.Currently, audio scene classification method Generally include following two schemes：First, the Acoustic Modeling based on audio scene；2nd, the object and event detection occurred in scene. The Acoustic Modeling method of the first audio scene is the acoustic model for directly learning special scenes from voice data, but regardless of these Which sound element acoustic model specifically includes.The object and event detecting method occurred in second of scene is carried out more to audio Plus detailed parsing, it is by indirect detection audio object and event, to judge the scene of audio generation.

But, there are several deficiencies in above-mentioned existing audio scene classification method：1) study for acoustic model, it is necessary to Go to design acoustic feature by hand, supervised learning training pattern is then passed through again.Hand-designed feature needs substantial amounts of field to know Know, and very labor intensive.2) detection scene object and event, originally one be difficult the problem of；And distinguishing It is greatly the type for belonging to ambient sound that different scenes, which play a crucial role, and these ambient sounds are generally mixed by a variety of audio elements It is combined into, it is desirable to while distinguish foreground sounds (event) and background sound, is also that part is difficult and needs a large amount of prioris Thing.

In consideration of it, how to provide it is a kind of realize simple, the audio scene classification method that cost is relatively low and the degree of accuracy is higher and Device turns into the current technical issues that need to address.

The content of the invention

To solve above-mentioned technical problem, the present invention provides a kind of audio scene classification method and device, can be to audio Scene is classified, and realizes that simple, cost is relatively low and the degree of accuracy is higher.

In a first aspect, the present invention provides a kind of audio scene classification method, including：

According to sample audio data, depth confidence network class model is set up；

Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained；

Wherein, the sample audio data, including：Without label voice data and tape label voice data, the label list Show the scene type of voice data.

Alternatively, it is described according to sample audio data, depth confidence network class model is set up, including：

Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data, Using unsupervised learning method, multilayer RBM is successively trained from bottom to top, depth confidence network DBN is generated, until at DBN networks In poised state；

Increase a task-driven layer for being used to classify, the task after last hidden layer of the DBN networks The number for driving the number of layer output sub-category for audio scene；

Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning side Method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence.

Alternatively, it is described to be restricted Boltzmann machine as the bottom without label voice data in sample audio data RBM input, using unsupervised learning method, successively trains multilayer RBM, generates depth confidence network DBN from bottom to top, until DBN networks are in poised state, including：

Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data, Using successively greedy algorithm, multilayer RBM is successively trained from bottom to top unsupervisedly, depth confidence network DBN is generated, until DBN Network is in poised state.

Alternatively, it is described to be restricted Boltzmann machine as the bottom without label voice data in sample audio data RBM input, using successively greedy algorithm, successively trains multilayer RBM, generates depth confidence network from bottom to top unsupervisedly DBN, until DBN networks are in poised state, including：

Each RBM for constituting depth confidence network DBN, it is seen that layer is as its input layer, and hidden layer is defeated as its Go out layer；Using in sample audio data without label voice data as bottom RBM input, since the bottom RBM, Successively train each layer RBM from bottom to top unsupervisedly, every layer of RBM output as the RBM of next training input；

Every layer of RBM is in unsupervised training, for its input data v, automatically generates a corresponding hidden feature h, right In its joint probability p (v, h), carry out ALTERNATE SAMPLING to v and h to update RBM parameter by Ji Busen samplings, until RBM's Loss function tends towards stability so that p (v, h) is maximum.

Alternatively, the task-driven layer is a grader, including：Support vector machine classifier, random forest classification Device or softmax graders.

Alternatively, the tape label voice data by sample audio data inputs the task-driven layer, using having Supervised learning method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence, including：

Tape label voice data in sample audio data is inputted into the task-driven layer, using back-propagation algorithm, Finely tune the parameter of each layer of whole network with having supervision, until convergence.

Alternatively, described according to sample audio data, set up before depth confidence network class model, methods described is also Including：

According to preset format, original audio data is pre-processed；

According to preset window size, cutting is carried out to each audio in pretreated voice data；

Audio data section after cutting is divided into two parts, each audio data section after a portion cutting is added Represent after the label of scene type as the tape label voice data in sample audio data, regard another part as sample audio In data without label voice data.

Second aspect, the present invention provides a kind of audio scene classification device, including：

Module is built, for according to sample audio data, setting up depth confidence network class model；

Sort module, for voice data to be tested to be inputted into the depth confidence network class model, obtains audio field Scape classification results；

Alternatively, the structure module, specifically for

Alternatively, described device also includes：

Pretreatment module, for according to preset format, being pre-processed to original audio data；It is big according to preset window It is small, cutting is carried out to each audio in pretreated voice data；Audio data section after cutting is divided into two parts, will Each audio data section after a portion cutting is added after the label for representing scene type as in sample audio data Tape label voice data, using another part as in sample audio data without label voice data.

As shown from the above technical solution, audio scene classification method and device of the invention, by according to sample audio number According to setting up depth confidence network class model, voice data to be tested inputted into the depth confidence network class model, obtain Audio scene classification result；Wherein, the sample audio data, including：Without label voice data and tape label voice data, institute State the scene type of tag representation voice data.Thus, the present invention can classify to audio scene, realize simple, cost It is relatively low and the degree of accuracy is higher.

Brief description of the drawings

The schematic flow sheet for the audio scene classification method that Fig. 1 provides for one embodiment of the invention；

Fig. 2 is the idiographic flow schematic diagram of step 101 shown in Fig. 1；

The structural representation for the audio scene classification device that Fig. 3 provides for one embodiment of the invention.

Embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, clear, complete description is carried out to the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only Only it is a part of embodiment of the invention, rather than whole embodiments.Based on embodiments of the invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

Fig. 1 shows the schematic flow sheet for the audio scene classification method that one embodiment of the invention is provided, as shown in figure 1, The audio scene classification method of the present embodiment is as described below.

101st, according to sample audio data, depth confidence network class model is set up.

Wherein, the sample audio data D={ D_u,D_s, including：Without label voice data D_uWith tape label voice data D_s, wherein, the scene type of the tag representation voice data, D_u=X_uIt is from the audio under reality scene record, X_uFor without label Audio frequency characteristics,D represents the dimension of audio frequency characteristics, N_uThe sample number of audio is represented, u represents unsupervised (unsupervised)；D_s={ X_s, y }, s represents that (supervised) of supervision, y={ 1,2 ..., M } represent label, and M is represented The sub-category number of audio scene, M is the integer more than 1.

In a particular application, before the step 101, the step of methods described can also include not shown in figure 100a-100c：

100a, according to preset format, original audio data is pre-processed.

In a particular application, the preset format can include：Sample rate, sound channel species and the impulse modulation pre-set Coded format PCM etc..

100b, according to preset window size, cutting is carried out to each audio in pretreated voice data.

In a particular application, if the window size of last audio data section after cutting is big less than the preset window Small, by this, last audio data section is cast out.

For example, preset window size can be 1 second, specifically, can be by each audio in original audio data It was that window size carries out cutting with 1 second, casts out last audio data section less than 1 second.

100c, the audio data section after cutting is divided into two parts, by each voice data after a portion cutting Section is added after the label for representing scene type as the tape label voice data in sample audio data, regard another part as sample In this voice data without label voice data.

In a particular application, as shown in Fig. 2 the step 101 can include step 101a-101c：

101a, it will be restricted Boltzmann machine RBM's as the bottom without label voice data in sample audio data Input, using unsupervised learning method, successively trains multilayer RBM, generates depth confidence network DBN, until DBN nets from bottom to top Network is in poised state.

Specifically, the step 101a, may particularly include：

Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data, Using successively greedy algorithm, multilayer RBM is successively trained from bottom to top unsupervisedly, depth confidence network DBN is generated, until DBN Network is in poised state, may further include：

It is understood that so that the maximum actual v being just so that after being reconstructed according to hidden feature h of p (v, h) are inputted with former Error between data v is minimum.

Wherein, joint probability p (v, h) can be represented by RBM energy function E (v, h | θ)：

Wherein, θ={ w, b, c } represents RBM parameter, and w is the weight between visible layer and hidden layer, and b is v pairs of visible layer Hidden layer h biasing, c is biasings of the hidden layer h to visible layer v, and T operates for transposition.

Specifically, it is above-mentioned to can be understood as given D_u, it is desirable to θ causes network is tried one's best to be fitted D_u, formula is expressed as：

Wherein, i=[1 ..., N_u] it is sample sequence number.

Function (2) is optimized using the method for intersecting optimization, specifically be may include：2 parameters first fixed in θ, to θ In another parameter derivation, then the parameter to this derivation be updated；Then 2 parameters alternately in fixed θ are asked in θ Another parameter derivative, the parameter to this derivation is updated, and to the last L (θ) tends towards stability.

101b, the task-driven layer that increase by one is used for classification after last hidden layer of the DBN networks, it is described The number number sub-category for audio scene of task-driven layer output.

Wherein, the task-driven layer is a grader, can be included：Support vector machine classifier, random forest point Class device or softmax graders etc..The present embodiment is not limited, and the task-driven layer can also be other Grader.

101c, the tape label voice data in sample audio data is inputted to task-driven layer, learned using there is supervision Learning method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence.

It is understood that performing after the step 101c, depth confidence network class model is obtained with regard to setting up.

Specifically, the step 101c, may particularly include：

Tape label voice data in sample audio data is inputted into the task-driven layer, using backpropagation (Back Propagation, abbreviation BP) algorithm, finely tune the parameter of each layer of whole network with having supervision, until convergence.

It is understood that the main thought of BP algorithm is the method inversely propagated using error, carry out constantly regulate weight, So that network exports fitting data label y as far as possible.Specifically, the tape label voice data in sample audio data is inputted into institute Task-driven layer is stated, using backpropagation (Back Propagation, abbreviation BP) algorithm, whole network is finely tuned with having supervision each The parameter of layer, until convergence, may further include：

For giving any one tape label voice data x, y ∈ D_sL layers of DBN networks are have passed through, its output valve is h (z^l), h () represents excitation function, z^lRepresent l-1 layers of weighted value sum, z^l=W^(l-1)z^(l-1)+b^(l-1).The present embodiment is not The loss function of grader (i.e. the task-driven layer) is limited, it is assumed that the output of training data and tag distances can be with Represented using Euclidean distance L (θ) '：

Derivation can be carried out to parameters respectively by L (θ) ', obtain the gradient of the parameter, and error is reversely passed Broadcast, constantly update the value of parameter θ so that final network tends to stable.

During fine setting, the present embodiment can also extract sub-fraction data again from training data, carry out checking instruction Practice the fine or not degree of model, to select best parameter.

102nd, voice data to be tested is inputted into the depth confidence network class model, obtains audio scene classification knot Really.

In a particular application, the present embodiment methods described can also include：

Softmax loss functions are selected in the task-driven layer choosing, the prediction that it is output as each classification of audio scene is general Rate.

So, the present embodiment methods described can just expand to multi-tag separation, it can be seen that prediction probability is in not unisonance The distribution situation of frequency scene type.

By taking a concrete application as an example, all audios that the present embodiment methods described can be first to original audio data are entered Each audio, is become sample rate 22.05kHz, monophonic, impulse modulation coding (PCM) form by row pretreatment；Then every Individual audio was that window size carries out cutting with 1 second, and cast out last window less than 1 second；These data are sent into respectively Data input layer to DBN networks, i.e. DBN networks is 22050 neurons.In order to learn the feature of compacting of audio, next The numbers of several layers of neurons reduce respectively.The number of the present embodiment not to the neuron of each layer, and the number of the number of plies are carried out Limitation, finally assumes there are 6 hidden layers, exports 1024 dimensional characteristics.In order to prevent over-fitting, the present embodiment can be to each layer It is 0 to carry out certain proportion and be randomly provided the weight of some neurons.As the bottom Bohr will be restricted without label voice data Hereby graceful machine RBM input, using unsupervised learning method, successively trains multilayer RBM, generates depth confidence network from bottom to top DBN, until DBN networks are in poised state；Increasing by one after last hidden layer of the DBN networks is used for what is classified Task-driven layer, the number of the task-driven layer output number sub-category for audio scene；By tape label voice data The task-driven layer is inputted, using supervised learning method, the parameter of each layer of whole network is successively finely tuned from top to bottom, until Convergence.Assuming that the loss function of task-driven layer is the probability distribution of softmax functions, then each predictable scene type. During fine setting, sub-fraction data can be extracted again from training data, the fine or not degree of checking training pattern is carried out, with Just best parameter is selected.

The audio scene classification method of the present embodiment, by according to sample audio data, setting up depth confidence network class Model；Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained；Wherein, The sample audio data, including：Without label voice data and tape label voice data, the field of the tag representation voice data Scape classification, thus, the present invention can classify to audio scene, realize that simple, cost is relatively low and the degree of accuracy is higher.

The present embodiment methods described, can use unsupervised training method, the feature of automatic study different scenes, it is not necessary to The data of many tape labels, can greatly reduce the cost of labeled data；, can be with regulating networks with very strong property on probation Last layer, both can obtain multi-tag classification, the prediction distribution situation of label is can obtain again；, can be with very strong generalization It is used as the feature learning method of other close tasks.

The training of the present embodiment methods described additionally depends on computer system with implementing for classification, in multinuclear or cluster In computer system, some above-mentioned steps can be carried out parallel, such as above-mentioned step 100a-100c.Some training process slowly may be used very much To be accelerated using GPU, sometimes data volume can use small lot to handle very much greatly, such as in pre-training RBM.In order that obtaining above-mentioned depth Degree confidence network class model can be used on the portable mobile termianls such as mobile phone, can reduce model complicated with compact model Degree, to reduce the demand to hardware.

In a particular application, for example, above-mentioned steps 101a realize specific algorithm can be：

Input：Without label audio frequency characteristics X_u, learning rate η, network number of plies l,

Output：Weight matrix W, bias vector b, c.

In a particular application, for example, above-mentioned steps 101c realize specific algorithm can be：

Input：Tape label audio frequency characteristics X_s, y, learning rate η, network number of plies l, excitation function h () network initial weight W, partially Put b

Output：Weight matrix W after network stabilization, bias vector b.

Fig. 3 shows the structural representation for the audio scene classification device that one embodiment of the invention is provided, as shown in figure 3, The audio scene classification device of the present embodiment, including：Build module 31 and sort module 32；Wherein：

Module 31 is built, for according to sample audio data, setting up depth confidence network class model；

Sort module 32, for voice data to be tested to be inputted into the depth confidence network class model, obtains audio Scene classification result；

In a particular application, the structure module 31, can be specifically for

In a particular application, the present embodiment described device is also included not shown in figure：

Loss function selecting module, for selecting softmax loss functions in the task-driven layer choosing, it is output as audio The prediction probability of each classification of scene.

So, the present embodiment described device can just expand to multi-tag separation, it can be seen that prediction probability is in not unisonance The distribution situation of frequency scene type.

It should be noted that for device/system embodiment, because it is substantially similar to embodiment of the method, so What is described is fairly simple, and the relevent part can refer to the partial explaination of embodiments of method.

The audio scene classification device of the present embodiment, can classify to audio scene, realize simple, cost it is relatively low and The degree of accuracy is higher.The present embodiment described device, can use unsupervised training method, the feature of automatic study different scenes, no The data of many tape labels are needed, the cost of labeled data can be greatly reduced；With very strong property on probation, net can be adjusted Last layer of network, both can obtain multi-tag classification, the prediction distribution situation of label is can obtain again；With very strong generalization, Can as other close tasks feature learning method.

The training of the present embodiment described device additionally depends on computer system with implementing for classification, in multinuclear or cluster In computer system, some above-mentioned steps can be carried out parallel, such as above-mentioned step 100a-100c.Some training process slowly may be used very much To be accelerated using GPU, sometimes data volume can use small lot to handle very much greatly, such as in pre-training RBM.In order that obtaining above-mentioned depth Degree confidence network class model can be used on the portable mobile termianls such as mobile phone, can reduce model complicated with compact model Degree, to reduce the demand to hardware.

The audio scene classification device of the present embodiment, can be used for the technical scheme for performing preceding method embodiment, in fact Existing principle is similar with technique effect, and here is omitted.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The application is the flow with reference to method, equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.Term " on ", " under " etc. refers to The orientation or position relationship shown is, based on orientation shown in the drawings or position relationship, to be for only for ease of the description present invention and simplify Description, rather than indicate or imply that the device or element of meaning must have specific orientation, with specific azimuth configuration and behaviour Make, therefore be not considered as limiting the invention.Unless otherwise clearly defined and limited, term " installation ", " connected ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected, or be integrally connected；Can be Mechanically connect or electrically connect；Can be joined directly together, can also be indirectly connected to by intermediary, can be two The connection of element internal.For the ordinary skill in the art, above-mentioned term can be understood at this as the case may be Concrete meaning in invention.

In the specification of the present invention, numerous specific details are set forth.Although it is understood that, embodiments of the invention can To be put into practice in the case of these no details.In some instances, known method, structure and skill is not been shown in detail Art, so as not to obscure the understanding of this description.Similarly, it will be appreciated that disclose in order to simplify the present invention and helps to understand respectively One or more of individual inventive aspect, above in the description of the exemplary embodiment of the present invention, each of the invention is special Levy and be grouped together into sometimes in single embodiment, figure or descriptions thereof.However, should not be by the method solution of the disclosure Release and be intended in reflection is following：I.e. the present invention for required protection requirement is than the feature that is expressly recited in each claim more Many features.More precisely, as the following claims reflect, inventive aspect is to be less than single reality disclosed above Apply all features of example.Therefore, it then follows thus claims of embodiment are expressly incorporated in the embodiment, Wherein each claim is in itself as the separate embodiments of the present invention.It should be noted that in the case where not conflicting, this The feature in embodiment and embodiment in application can be mutually combined.The invention is not limited in any single aspect, Any single embodiment is not limited to, any combination and/or the displacement of these aspects and/or embodiment is also not limited to.And And, can be used alone the present invention each aspect and/or embodiment or with other one or more aspects and/or its implementation Example is used in combination.

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations；To the greatest extent The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that：Its according to The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered Row equivalent substitution；And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology The scope of scheme, it all should cover among the claim of the present invention and the scope of specification.

Claims

1. a kind of audio scene classification method, it is characterised in that including：

Wherein, the sample audio data, including：Without label voice data and tape label voice data, the tag representation sound The scene type of frequency evidence.

2. according to the method described in claim 1, it is characterised in that described according to sample audio data, set up depth confidence net Network disaggregated model, including：

Boltzmann machine RBM input, use will be restricted as the bottom without label voice data in sample audio data Unsupervised learning method, successively trains multilayer RBM from bottom to top, generates depth confidence network DBN, is put down until DBN networks are in Weighing apparatus state；

Increase a task-driven layer for being used to classify, the task-driven after last hidden layer of the DBN networks The number number sub-category for audio scene of layer output；

Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning method, from The parameter of each layer of whole network is successively finely tuned under above, until convergence.

3. method according to claim 2, it is characterised in that it is described by sample audio data without label voice data Boltzmann machine RBM input is restricted as the bottom, using unsupervised learning method, multilayer is successively trained from bottom to top RBM, generates depth confidence network DBN, until DBN networks are in poised state, including：

Boltzmann machine RBM input, use will be restricted as the bottom without label voice data in sample audio data Successively greedy algorithm, successively trains multilayer RBM from bottom to top unsupervisedly, generates depth confidence network DBN, until DBN networks In poised state.

4. method according to claim 3, it is characterised in that it is described by sample audio data without label voice data Boltzmann machine RBM input is restricted as the bottom, using successively greedy algorithm, is successively trained from bottom to top unsupervisedly Multilayer RBM, generates depth confidence network DBN, until DBN networks are in poised state, including：

Each RBM for constituting depth confidence network DBN, it is seen that layer is used as its output layer as its input layer, hidden layer； Using in sample audio data without label voice data as bottom RBM input, since the bottom RBM, under And it is upper successively train each layer RBM unsupervisedly, every layer of RBM output as the RBM of next training input；

Every layer of RBM is in unsupervised training, for its input data v, a corresponding hidden feature h is automatically generated, for it Joint probability p (v, h), carries out ALTERNATE SAMPLING to update RBM parameter, until RBM loss by Ji Busen samplings to v and h Function tends towards stability so that p (v, h) is maximum.

5. method according to claim 2, it is characterised in that the task-driven layer is a grader, including：Support Vector machine classifier, random forest grader or softmax graders.

6. method according to claim 2, it is characterised in that the tape label voice data by sample audio data The task-driven layer is inputted, using supervised learning method, the parameter of each layer of whole network is successively finely tuned from top to bottom, until Convergence, including：

Tape label voice data in sample audio data is inputted into the task-driven layer, using back-propagation algorithm, there is prison Finely tune the parameter of each layer of whole network with superintending and directing, until convergence.

7. according to the method described in claim 1, it is characterised in that described according to sample audio data, set up depth confidence Before network class model, methods described also includes：

According to preset format, original audio data is pre-processed；

Audio data section after cutting is divided into two parts, by each audio data section after a portion cutting plus expression As the tape label voice data in sample audio data after the label of scene type, another part is regard as sample audio data In without label voice data.

8. a kind of audio scene classification device, it is characterised in that including：

Sort module, for voice data to be tested to be inputted into the depth confidence network class model, obtains audio scene point Class result；

9. device according to claim 8, it is characterised in that the structure module, specifically for

10. device according to claim 8, it is characterised in that described device also includes：

Pretreatment module, for according to preset format, being pre-processed to original audio data；It is right according to preset window size Each audio in pretreated voice data carries out cutting；Audio data section after cutting is divided into two parts, will wherein Each audio data section after a part of cutting is used as the band mark in sample audio data after adding the label for representing scene type Sign voice data, using another part as in sample audio data without label voice data.