CN107203777A - audio scene classification method and device - Google Patents
audio scene classification method and device Download PDFInfo
- Publication number
- CN107203777A CN107203777A CN201710257902.7A CN201710257902A CN107203777A CN 107203777 A CN107203777 A CN 107203777A CN 201710257902 A CN201710257902 A CN 201710257902A CN 107203777 A CN107203777 A CN 107203777A
- Authority
- CN
- China
- Prior art keywords
- voice data
- audio data
- layer
- rbm
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000005520 cutting process Methods 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 6
- 238000007637 random forest analysis Methods 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 238000005303 weighing Methods 0.000 claims 2
- 238000004590 computer program Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000009795 derivation Methods 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000005284 excitation Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of audio scene classification method and device, wherein, methods described includes:According to sample audio data, depth confidence network class model is set up;Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained;Wherein, the sample audio data, including:Without label voice data and tape label voice data, the scene type of the tag representation voice data.The present invention can classify to audio scene, realize that simple, cost is relatively low and the degree of accuracy is higher.
Description
Technical field
The present invention relates to deep learning and audio-frequency information processing technology field, more particularly to a kind of audio scene classification method
And device.
Background technology
With a large amount of popularizations of the mobile devices such as mobile phone, and the application based on scene is more and more, therefore, how to allow shifting
Scene around dynamic equipment automatic sensing is an important and challenging task.
Scene is that the high-level semantic of audio signal is represented.Audio scene classification is that the acoustic feature based on audio is used through machine
The grader that device study is obtained, and then to the carry out automated intelligent identification of current environment scene.Currently, audio scene classification method
Generally include following two schemes:First, the Acoustic Modeling based on audio scene;2nd, the object and event detection occurred in scene.
The Acoustic Modeling method of the first audio scene is the acoustic model for directly learning special scenes from voice data, but regardless of these
Which sound element acoustic model specifically includes.The object and event detecting method occurred in second of scene is carried out more to audio
Plus detailed parsing, it is by indirect detection audio object and event, to judge the scene of audio generation.
But, there are several deficiencies in above-mentioned existing audio scene classification method:1) study for acoustic model, it is necessary to
Go to design acoustic feature by hand, supervised learning training pattern is then passed through again.Hand-designed feature needs substantial amounts of field to know
Know, and very labor intensive.2) detection scene object and event, originally one be difficult the problem of;And distinguishing
It is greatly the type for belonging to ambient sound that different scenes, which play a crucial role, and these ambient sounds are generally mixed by a variety of audio elements
It is combined into, it is desirable to while distinguish foreground sounds (event) and background sound, is also that part is difficult and needs a large amount of prioris
Thing.
In consideration of it, how to provide it is a kind of realize simple, the audio scene classification method that cost is relatively low and the degree of accuracy is higher and
Device turns into the current technical issues that need to address.
The content of the invention
To solve above-mentioned technical problem, the present invention provides a kind of audio scene classification method and device, can be to audio
Scene is classified, and realizes that simple, cost is relatively low and the degree of accuracy is higher.
In a first aspect, the present invention provides a kind of audio scene classification method, including:
According to sample audio data, depth confidence network class model is set up;
Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained;
Wherein, the sample audio data, including:Without label voice data and tape label voice data, the label list
Show the scene type of voice data.
Alternatively, it is described according to sample audio data, depth confidence network class model is set up, including:
Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data,
Using unsupervised learning method, multilayer RBM is successively trained from bottom to top, depth confidence network DBN is generated, until at DBN networks
In poised state;
Increase a task-driven layer for being used to classify, the task after last hidden layer of the DBN networks
The number for driving the number of layer output sub-category for audio scene;
Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning side
Method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence.
Alternatively, it is described to be restricted Boltzmann machine as the bottom without label voice data in sample audio data
RBM input, using unsupervised learning method, successively trains multilayer RBM, generates depth confidence network DBN from bottom to top, until
DBN networks are in poised state, including:
Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data,
Using successively greedy algorithm, multilayer RBM is successively trained from bottom to top unsupervisedly, depth confidence network DBN is generated, until DBN
Network is in poised state.
Alternatively, it is described to be restricted Boltzmann machine as the bottom without label voice data in sample audio data
RBM input, using successively greedy algorithm, successively trains multilayer RBM, generates depth confidence network from bottom to top unsupervisedly
DBN, until DBN networks are in poised state, including:
Each RBM for constituting depth confidence network DBN, it is seen that layer is as its input layer, and hidden layer is defeated as its
Go out layer;Using in sample audio data without label voice data as bottom RBM input, since the bottom RBM,
Successively train each layer RBM from bottom to top unsupervisedly, every layer of RBM output as the RBM of next training input;
Every layer of RBM is in unsupervised training, for its input data v, automatically generates a corresponding hidden feature h, right
In its joint probability p (v, h), carry out ALTERNATE SAMPLING to v and h to update RBM parameter by Ji Busen samplings, until RBM's
Loss function tends towards stability so that p (v, h) is maximum.
Alternatively, the task-driven layer is a grader, including:Support vector machine classifier, random forest classification
Device or softmax graders.
Alternatively, the tape label voice data by sample audio data inputs the task-driven layer, using having
Supervised learning method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence, including:
Tape label voice data in sample audio data is inputted into the task-driven layer, using back-propagation algorithm,
Finely tune the parameter of each layer of whole network with having supervision, until convergence.
Alternatively, described according to sample audio data, set up before depth confidence network class model, methods described is also
Including:
According to preset format, original audio data is pre-processed;
According to preset window size, cutting is carried out to each audio in pretreated voice data;
Audio data section after cutting is divided into two parts, each audio data section after a portion cutting is added
Represent after the label of scene type as the tape label voice data in sample audio data, regard another part as sample audio
In data without label voice data.
Second aspect, the present invention provides a kind of audio scene classification device, including:
Module is built, for according to sample audio data, setting up depth confidence network class model;
Sort module, for voice data to be tested to be inputted into the depth confidence network class model, obtains audio field
Scape classification results;
Wherein, the sample audio data, including:Without label voice data and tape label voice data, the label list
Show the scene type of voice data.
Alternatively, the structure module, specifically for
Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data,
Using unsupervised learning method, multilayer RBM is successively trained from bottom to top, depth confidence network DBN is generated, until at DBN networks
In poised state;
Increase a task-driven layer for being used to classify, the task after last hidden layer of the DBN networks
The number for driving the number of layer output sub-category for audio scene;
Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning side
Method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence.
Alternatively, described device also includes:
Pretreatment module, for according to preset format, being pre-processed to original audio data;It is big according to preset window
It is small, cutting is carried out to each audio in pretreated voice data;Audio data section after cutting is divided into two parts, will
Each audio data section after a portion cutting is added after the label for representing scene type as in sample audio data
Tape label voice data, using another part as in sample audio data without label voice data.
As shown from the above technical solution, audio scene classification method and device of the invention, by according to sample audio number
According to setting up depth confidence network class model, voice data to be tested inputted into the depth confidence network class model, obtain
Audio scene classification result;Wherein, the sample audio data, including:Without label voice data and tape label voice data, institute
State the scene type of tag representation voice data.Thus, the present invention can classify to audio scene, realize simple, cost
It is relatively low and the degree of accuracy is higher.
Brief description of the drawings
The schematic flow sheet for the audio scene classification method that Fig. 1 provides for one embodiment of the invention;
Fig. 2 is the idiographic flow schematic diagram of step 101 shown in Fig. 1;
The structural representation for the audio scene classification device that Fig. 3 provides for one embodiment of the invention.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, clear, complete description is carried out to the technical scheme in the embodiment of the present invention, it is clear that described embodiment is only
Only it is a part of embodiment of the invention, rather than whole embodiments.Based on embodiments of the invention, ordinary skill people
The every other embodiment that member is obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
Fig. 1 shows the schematic flow sheet for the audio scene classification method that one embodiment of the invention is provided, as shown in figure 1,
The audio scene classification method of the present embodiment is as described below.
101st, according to sample audio data, depth confidence network class model is set up.
Wherein, the sample audio data D={ Du,Ds, including:Without label voice data DuWith tape label voice data
Ds, wherein, the scene type of the tag representation voice data, Du=XuIt is from the audio under reality scene record, XuFor without label
Audio frequency characteristics,D represents the dimension of audio frequency characteristics, NuThe sample number of audio is represented, u represents unsupervised
(unsupervised);Ds={ Xs, y }, s represents that (supervised) of supervision, y={ 1,2 ..., M } represent label, and M is represented
The sub-category number of audio scene, M is the integer more than 1.
In a particular application, before the step 101, the step of methods described can also include not shown in figure
100a-100c:
100a, according to preset format, original audio data is pre-processed.
In a particular application, the preset format can include:Sample rate, sound channel species and the impulse modulation pre-set
Coded format PCM etc..
100b, according to preset window size, cutting is carried out to each audio in pretreated voice data.
In a particular application, if the window size of last audio data section after cutting is big less than the preset window
Small, by this, last audio data section is cast out.
For example, preset window size can be 1 second, specifically, can be by each audio in original audio data
It was that window size carries out cutting with 1 second, casts out last audio data section less than 1 second.
100c, the audio data section after cutting is divided into two parts, by each voice data after a portion cutting
Section is added after the label for representing scene type as the tape label voice data in sample audio data, regard another part as sample
In this voice data without label voice data.
In a particular application, as shown in Fig. 2 the step 101 can include step 101a-101c:
101a, it will be restricted Boltzmann machine RBM's as the bottom without label voice data in sample audio data
Input, using unsupervised learning method, successively trains multilayer RBM, generates depth confidence network DBN, until DBN nets from bottom to top
Network is in poised state.
Specifically, the step 101a, may particularly include:
Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data,
Using successively greedy algorithm, multilayer RBM is successively trained from bottom to top unsupervisedly, depth confidence network DBN is generated, until DBN
Network is in poised state, may further include:
Each RBM for constituting depth confidence network DBN, it is seen that layer is as its input layer, and hidden layer is defeated as its
Go out layer;Using in sample audio data without label voice data as bottom RBM input, since the bottom RBM,
Successively train each layer RBM from bottom to top unsupervisedly, every layer of RBM output as the RBM of next training input;
Every layer of RBM is in unsupervised training, for its input data v, automatically generates a corresponding hidden feature h, right
In its joint probability p (v, h), carry out ALTERNATE SAMPLING to v and h to update RBM parameter by Ji Busen samplings, until RBM's
Loss function tends towards stability so that p (v, h) is maximum.
It is understood that so that the maximum actual v being just so that after being reconstructed according to hidden feature h of p (v, h) are inputted with former
Error between data v is minimum.
Wherein, joint probability p (v, h) can be represented by RBM energy function E (v, h | θ):
Wherein, θ={ w, b, c } represents RBM parameter, and w is the weight between visible layer and hidden layer, and b is v pairs of visible layer
Hidden layer h biasing, c is biasings of the hidden layer h to visible layer v, and T operates for transposition.
Specifically, it is above-mentioned to can be understood as given Du, it is desirable to θ causes network is tried one's best to be fitted Du, formula is expressed as:
Wherein, i=[1 ..., Nu] it is sample sequence number.
Function (2) is optimized using the method for intersecting optimization, specifically be may include:2 parameters first fixed in θ, to θ
In another parameter derivation, then the parameter to this derivation be updated;Then 2 parameters alternately in fixed θ are asked in θ
Another parameter derivative, the parameter to this derivation is updated, and to the last L (θ) tends towards stability.
101b, the task-driven layer that increase by one is used for classification after last hidden layer of the DBN networks, it is described
The number number sub-category for audio scene of task-driven layer output.
Wherein, the task-driven layer is a grader, can be included:Support vector machine classifier, random forest point
Class device or softmax graders etc..The present embodiment is not limited, and the task-driven layer can also be other
Grader.
101c, the tape label voice data in sample audio data is inputted to task-driven layer, learned using there is supervision
Learning method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence.
It is understood that performing after the step 101c, depth confidence network class model is obtained with regard to setting up.
Specifically, the step 101c, may particularly include:
Tape label voice data in sample audio data is inputted into the task-driven layer, using backpropagation (Back
Propagation, abbreviation BP) algorithm, finely tune the parameter of each layer of whole network with having supervision, until convergence.
It is understood that the main thought of BP algorithm is the method inversely propagated using error, carry out constantly regulate weight,
So that network exports fitting data label y as far as possible.Specifically, the tape label voice data in sample audio data is inputted into institute
Task-driven layer is stated, using backpropagation (Back Propagation, abbreviation BP) algorithm, whole network is finely tuned with having supervision each
The parameter of layer, until convergence, may further include:
For giving any one tape label voice data x, y ∈ DsL layers of DBN networks are have passed through, its output valve is h
(zl), h () represents excitation function, zlRepresent l-1 layers of weighted value sum, zl=W(l-1)z(l-1)+b(l-1).The present embodiment is not
The loss function of grader (i.e. the task-driven layer) is limited, it is assumed that the output of training data and tag distances can be with
Represented using Euclidean distance L (θ) ':
Derivation can be carried out to parameters respectively by L (θ) ', obtain the gradient of the parameter, and error is reversely passed
Broadcast, constantly update the value of parameter θ so that final network tends to stable.
During fine setting, the present embodiment can also extract sub-fraction data again from training data, carry out checking instruction
Practice the fine or not degree of model, to select best parameter.
102nd, voice data to be tested is inputted into the depth confidence network class model, obtains audio scene classification knot
Really.
In a particular application, the present embodiment methods described can also include:
Softmax loss functions are selected in the task-driven layer choosing, the prediction that it is output as each classification of audio scene is general
Rate.
So, the present embodiment methods described can just expand to multi-tag separation, it can be seen that prediction probability is in not unisonance
The distribution situation of frequency scene type.
By taking a concrete application as an example, all audios that the present embodiment methods described can be first to original audio data are entered
Each audio, is become sample rate 22.05kHz, monophonic, impulse modulation coding (PCM) form by row pretreatment;Then every
Individual audio was that window size carries out cutting with 1 second, and cast out last window less than 1 second;These data are sent into respectively
Data input layer to DBN networks, i.e. DBN networks is 22050 neurons.In order to learn the feature of compacting of audio, next
The numbers of several layers of neurons reduce respectively.The number of the present embodiment not to the neuron of each layer, and the number of the number of plies are carried out
Limitation, finally assumes there are 6 hidden layers, exports 1024 dimensional characteristics.In order to prevent over-fitting, the present embodiment can be to each layer
It is 0 to carry out certain proportion and be randomly provided the weight of some neurons.As the bottom Bohr will be restricted without label voice data
Hereby graceful machine RBM input, using unsupervised learning method, successively trains multilayer RBM, generates depth confidence network from bottom to top
DBN, until DBN networks are in poised state;Increasing by one after last hidden layer of the DBN networks is used for what is classified
Task-driven layer, the number of the task-driven layer output number sub-category for audio scene;By tape label voice data
The task-driven layer is inputted, using supervised learning method, the parameter of each layer of whole network is successively finely tuned from top to bottom, until
Convergence.Assuming that the loss function of task-driven layer is the probability distribution of softmax functions, then each predictable scene type.
During fine setting, sub-fraction data can be extracted again from training data, the fine or not degree of checking training pattern is carried out, with
Just best parameter is selected.
The audio scene classification method of the present embodiment, by according to sample audio data, setting up depth confidence network class
Model;Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained;Wherein,
The sample audio data, including:Without label voice data and tape label voice data, the field of the tag representation voice data
Scape classification, thus, the present invention can classify to audio scene, realize that simple, cost is relatively low and the degree of accuracy is higher.
The present embodiment methods described, can use unsupervised training method, the feature of automatic study different scenes, it is not necessary to
The data of many tape labels, can greatly reduce the cost of labeled data;, can be with regulating networks with very strong property on probation
Last layer, both can obtain multi-tag classification, the prediction distribution situation of label is can obtain again;, can be with very strong generalization
It is used as the feature learning method of other close tasks.
The training of the present embodiment methods described additionally depends on computer system with implementing for classification, in multinuclear or cluster
In computer system, some above-mentioned steps can be carried out parallel, such as above-mentioned step 100a-100c.Some training process slowly may be used very much
To be accelerated using GPU, sometimes data volume can use small lot to handle very much greatly, such as in pre-training RBM.In order that obtaining above-mentioned depth
Degree confidence network class model can be used on the portable mobile termianls such as mobile phone, can reduce model complicated with compact model
Degree, to reduce the demand to hardware.
In a particular application, for example, above-mentioned steps 101a realize specific algorithm can be:
Input:Without label audio frequency characteristics Xu, learning rate η, network number of plies l,
Output:Weight matrix W, bias vector b, c.
In a particular application, for example, above-mentioned steps 101c realize specific algorithm can be:
Input:Tape label audio frequency characteristics Xs, y, learning rate η, network number of plies l, excitation function h () network initial weight W, partially
Put b
Output:Weight matrix W after network stabilization, bias vector b.
Fig. 3 shows the structural representation for the audio scene classification device that one embodiment of the invention is provided, as shown in figure 3,
The audio scene classification device of the present embodiment, including:Build module 31 and sort module 32;Wherein:
Module 31 is built, for according to sample audio data, setting up depth confidence network class model;
Sort module 32, for voice data to be tested to be inputted into the depth confidence network class model, obtains audio
Scene classification result;
Wherein, the sample audio data, including:Without label voice data and tape label voice data, the label list
Show the scene type of voice data.
In a particular application, the structure module 31, can be specifically for
Boltzmann machine RBM input will be restricted as the bottom without label voice data in sample audio data,
Using unsupervised learning method, multilayer RBM is successively trained from bottom to top, depth confidence network DBN is generated, until at DBN networks
In poised state;
Increase a task-driven layer for being used to classify, the task after last hidden layer of the DBN networks
The number for driving the number of layer output sub-category for audio scene;
Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning side
Method, successively finely tunes the parameter of each layer of whole network from top to bottom, until convergence.
In a particular application, the present embodiment described device is also included not shown in figure:
Pretreatment module, for according to preset format, being pre-processed to original audio data;It is big according to preset window
It is small, cutting is carried out to each audio in pretreated voice data;Audio data section after cutting is divided into two parts, will
Each audio data section after a portion cutting is added after the label for representing scene type as in sample audio data
Tape label voice data, using another part as in sample audio data without label voice data.
In a particular application, the present embodiment described device is also included not shown in figure:
Loss function selecting module, for selecting softmax loss functions in the task-driven layer choosing, it is output as audio
The prediction probability of each classification of scene.
So, the present embodiment described device can just expand to multi-tag separation, it can be seen that prediction probability is in not unisonance
The distribution situation of frequency scene type.
It should be noted that for device/system embodiment, because it is substantially similar to embodiment of the method, so
What is described is fairly simple, and the relevent part can refer to the partial explaination of embodiments of method.
The audio scene classification device of the present embodiment, can classify to audio scene, realize simple, cost it is relatively low and
The degree of accuracy is higher.The present embodiment described device, can use unsupervised training method, the feature of automatic study different scenes, no
The data of many tape labels are needed, the cost of labeled data can be greatly reduced;With very strong property on probation, net can be adjusted
Last layer of network, both can obtain multi-tag classification, the prediction distribution situation of label is can obtain again;With very strong generalization,
Can as other close tasks feature learning method.
The training of the present embodiment described device additionally depends on computer system with implementing for classification, in multinuclear or cluster
In computer system, some above-mentioned steps can be carried out parallel, such as above-mentioned step 100a-100c.Some training process slowly may be used very much
To be accelerated using GPU, sometimes data volume can use small lot to handle very much greatly, such as in pre-training RBM.In order that obtaining above-mentioned depth
Degree confidence network class model can be used on the portable mobile termianls such as mobile phone, can reduce model complicated with compact model
Degree, to reduce the demand to hardware.
The audio scene classification device of the present embodiment, can be used for the technical scheme for performing preceding method embodiment, in fact
Existing principle is similar with technique effect, and here is omitted.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The application is the flow with reference to method, equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating
In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to
Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those
Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Also there is other identical element in process, method, article or equipment including the key element.Term " on ", " under " etc. refers to
The orientation or position relationship shown is, based on orientation shown in the drawings or position relationship, to be for only for ease of the description present invention and simplify
Description, rather than indicate or imply that the device or element of meaning must have specific orientation, with specific azimuth configuration and behaviour
Make, therefore be not considered as limiting the invention.Unless otherwise clearly defined and limited, term " installation ", " connected ",
" connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected, or be integrally connected;Can be
Mechanically connect or electrically connect;Can be joined directly together, can also be indirectly connected to by intermediary, can be two
The connection of element internal.For the ordinary skill in the art, above-mentioned term can be understood at this as the case may be
Concrete meaning in invention.
In the specification of the present invention, numerous specific details are set forth.Although it is understood that, embodiments of the invention can
To be put into practice in the case of these no details.In some instances, known method, structure and skill is not been shown in detail
Art, so as not to obscure the understanding of this description.Similarly, it will be appreciated that disclose in order to simplify the present invention and helps to understand respectively
One or more of individual inventive aspect, above in the description of the exemplary embodiment of the present invention, each of the invention is special
Levy and be grouped together into sometimes in single embodiment, figure or descriptions thereof.However, should not be by the method solution of the disclosure
Release and be intended in reflection is following:I.e. the present invention for required protection requirement is than the feature that is expressly recited in each claim more
Many features.More precisely, as the following claims reflect, inventive aspect is to be less than single reality disclosed above
Apply all features of example.Therefore, it then follows thus claims of embodiment are expressly incorporated in the embodiment,
Wherein each claim is in itself as the separate embodiments of the present invention.It should be noted that in the case where not conflicting, this
The feature in embodiment and embodiment in application can be mutually combined.The invention is not limited in any single aspect,
Any single embodiment is not limited to, any combination and/or the displacement of these aspects and/or embodiment is also not limited to.And
And, can be used alone the present invention each aspect and/or embodiment or with other one or more aspects and/or its implementation
Example is used in combination.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than its limitations;To the greatest extent
The present invention is described in detail with reference to foregoing embodiments for pipe, it will be understood by those within the art that:Its according to
The technical scheme described in foregoing embodiments can so be modified, or which part or all technical characteristic are entered
Row equivalent substitution;And these modifications or replacement, the essence of appropriate technical solution is departed from various embodiments of the present invention technology
The scope of scheme, it all should cover among the claim of the present invention and the scope of specification.
Claims (10)
1. a kind of audio scene classification method, it is characterised in that including:
According to sample audio data, depth confidence network class model is set up;
Voice data to be tested is inputted into the depth confidence network class model, audio scene classification result is obtained;
Wherein, the sample audio data, including:Without label voice data and tape label voice data, the tag representation sound
The scene type of frequency evidence.
2. according to the method described in claim 1, it is characterised in that described according to sample audio data, set up depth confidence net
Network disaggregated model, including:
Boltzmann machine RBM input, use will be restricted as the bottom without label voice data in sample audio data
Unsupervised learning method, successively trains multilayer RBM from bottom to top, generates depth confidence network DBN, is put down until DBN networks are in
Weighing apparatus state;
Increase a task-driven layer for being used to classify, the task-driven after last hidden layer of the DBN networks
The number number sub-category for audio scene of layer output;
Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning method, from
The parameter of each layer of whole network is successively finely tuned under above, until convergence.
3. method according to claim 2, it is characterised in that it is described by sample audio data without label voice data
Boltzmann machine RBM input is restricted as the bottom, using unsupervised learning method, multilayer is successively trained from bottom to top
RBM, generates depth confidence network DBN, until DBN networks are in poised state, including:
Boltzmann machine RBM input, use will be restricted as the bottom without label voice data in sample audio data
Successively greedy algorithm, successively trains multilayer RBM from bottom to top unsupervisedly, generates depth confidence network DBN, until DBN networks
In poised state.
4. method according to claim 3, it is characterised in that it is described by sample audio data without label voice data
Boltzmann machine RBM input is restricted as the bottom, using successively greedy algorithm, is successively trained from bottom to top unsupervisedly
Multilayer RBM, generates depth confidence network DBN, until DBN networks are in poised state, including:
Each RBM for constituting depth confidence network DBN, it is seen that layer is used as its output layer as its input layer, hidden layer;
Using in sample audio data without label voice data as bottom RBM input, since the bottom RBM, under
And it is upper successively train each layer RBM unsupervisedly, every layer of RBM output as the RBM of next training input;
Every layer of RBM is in unsupervised training, for its input data v, a corresponding hidden feature h is automatically generated, for it
Joint probability p (v, h), carries out ALTERNATE SAMPLING to update RBM parameter, until RBM loss by Ji Busen samplings to v and h
Function tends towards stability so that p (v, h) is maximum.
5. method according to claim 2, it is characterised in that the task-driven layer is a grader, including:Support
Vector machine classifier, random forest grader or softmax graders.
6. method according to claim 2, it is characterised in that the tape label voice data by sample audio data
The task-driven layer is inputted, using supervised learning method, the parameter of each layer of whole network is successively finely tuned from top to bottom, until
Convergence, including:
Tape label voice data in sample audio data is inputted into the task-driven layer, using back-propagation algorithm, there is prison
Finely tune the parameter of each layer of whole network with superintending and directing, until convergence.
7. according to the method described in claim 1, it is characterised in that described according to sample audio data, set up depth confidence
Before network class model, methods described also includes:
According to preset format, original audio data is pre-processed;
According to preset window size, cutting is carried out to each audio in pretreated voice data;
Audio data section after cutting is divided into two parts, by each audio data section after a portion cutting plus expression
As the tape label voice data in sample audio data after the label of scene type, another part is regard as sample audio data
In without label voice data.
8. a kind of audio scene classification device, it is characterised in that including:
Module is built, for according to sample audio data, setting up depth confidence network class model;
Sort module, for voice data to be tested to be inputted into the depth confidence network class model, obtains audio scene point
Class result;
Wherein, the sample audio data, including:Without label voice data and tape label voice data, the tag representation sound
The scene type of frequency evidence.
9. device according to claim 8, it is characterised in that the structure module, specifically for
Boltzmann machine RBM input, use will be restricted as the bottom without label voice data in sample audio data
Unsupervised learning method, successively trains multilayer RBM from bottom to top, generates depth confidence network DBN, is put down until DBN networks are in
Weighing apparatus state;
Increase a task-driven layer for being used to classify, the task-driven after last hidden layer of the DBN networks
The number number sub-category for audio scene of layer output;
Tape label voice data in sample audio data is inputted into the task-driven layer, using supervised learning method, from
The parameter of each layer of whole network is successively finely tuned under above, until convergence.
10. device according to claim 8, it is characterised in that described device also includes:
Pretreatment module, for according to preset format, being pre-processed to original audio data;It is right according to preset window size
Each audio in pretreated voice data carries out cutting;Audio data section after cutting is divided into two parts, will wherein
Each audio data section after a part of cutting is used as the band mark in sample audio data after adding the label for representing scene type
Sign voice data, using another part as in sample audio data without label voice data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710257902.7A CN107203777A (en) | 2017-04-19 | 2017-04-19 | audio scene classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710257902.7A CN107203777A (en) | 2017-04-19 | 2017-04-19 | audio scene classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107203777A true CN107203777A (en) | 2017-09-26 |
Family
ID=59905814
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710257902.7A Pending CN107203777A (en) | 2017-04-19 | 2017-04-19 | audio scene classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107203777A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943865A (en) * | 2017-11-10 | 2018-04-20 | 阿基米德(上海)传媒有限公司 | It is a kind of to be suitable for more scenes, the audio classification labels method and system of polymorphic type |
CN108615532A (en) * | 2018-05-03 | 2018-10-02 | 张晓雷 | A kind of sorting technique and device applied to sound field scape |
CN109545189A (en) * | 2018-12-14 | 2019-03-29 | 东华大学 | A kind of spoken language pronunciation error detection and correcting system based on machine learning |
CN110176250A (en) * | 2019-05-30 | 2019-08-27 | 哈尔滨工业大学 | It is a kind of based on the robust acoustics scene recognition method locally learnt |
CN111261174A (en) * | 2018-11-30 | 2020-06-09 | 杭州海康威视数字技术股份有限公司 | Audio classification method and device, terminal and computer readable storage medium |
CN111341341A (en) * | 2020-02-11 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Training method of audio separation network, audio separation method, device and medium |
CN111540375A (en) * | 2020-04-29 | 2020-08-14 | 全球能源互联网研究院有限公司 | Training method of audio separation model, and audio signal separation method and device |
CN111883113A (en) * | 2020-07-30 | 2020-11-03 | 云知声智能科技股份有限公司 | Voice recognition method and device |
CN112534500A (en) * | 2018-07-26 | 2021-03-19 | Med-El电气医疗器械有限公司 | Neural network audio scene classifier for hearing implants |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104182621A (en) * | 2014-08-08 | 2014-12-03 | 同济大学 | DBN based ADHD discriminatory analysis method |
CN105894011A (en) * | 2016-03-28 | 2016-08-24 | 南京邮电大学 | Depth-confidence-network-based cognitive decision-making method |
CN106328121A (en) * | 2016-08-30 | 2017-01-11 | 南京理工大学 | Chinese traditional musical instrument classification method based on depth confidence network |
-
2017
- 2017-04-19 CN CN201710257902.7A patent/CN107203777A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104182621A (en) * | 2014-08-08 | 2014-12-03 | 同济大学 | DBN based ADHD discriminatory analysis method |
CN105894011A (en) * | 2016-03-28 | 2016-08-24 | 南京邮电大学 | Depth-confidence-network-based cognitive decision-making method |
CN106328121A (en) * | 2016-08-30 | 2017-01-11 | 南京理工大学 | Chinese traditional musical instrument classification method based on depth confidence network |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107943865A (en) * | 2017-11-10 | 2018-04-20 | 阿基米德(上海)传媒有限公司 | It is a kind of to be suitable for more scenes, the audio classification labels method and system of polymorphic type |
CN108615532A (en) * | 2018-05-03 | 2018-10-02 | 张晓雷 | A kind of sorting technique and device applied to sound field scape |
CN108615532B (en) * | 2018-05-03 | 2021-12-07 | 张晓雷 | Classification method and device applied to sound scene |
CN112534500A (en) * | 2018-07-26 | 2021-03-19 | Med-El电气医疗器械有限公司 | Neural network audio scene classifier for hearing implants |
CN111261174B (en) * | 2018-11-30 | 2023-02-17 | 杭州海康威视数字技术股份有限公司 | Audio classification method and device, terminal and computer readable storage medium |
CN111261174A (en) * | 2018-11-30 | 2020-06-09 | 杭州海康威视数字技术股份有限公司 | Audio classification method and device, terminal and computer readable storage medium |
CN109545189A (en) * | 2018-12-14 | 2019-03-29 | 东华大学 | A kind of spoken language pronunciation error detection and correcting system based on machine learning |
CN110176250B (en) * | 2019-05-30 | 2021-05-07 | 哈尔滨工业大学 | Robust acoustic scene recognition method based on local learning |
CN110176250A (en) * | 2019-05-30 | 2019-08-27 | 哈尔滨工业大学 | It is a kind of based on the robust acoustics scene recognition method locally learnt |
CN111341341A (en) * | 2020-02-11 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Training method of audio separation network, audio separation method, device and medium |
CN111540375A (en) * | 2020-04-29 | 2020-08-14 | 全球能源互联网研究院有限公司 | Training method of audio separation model, and audio signal separation method and device |
CN111540375B (en) * | 2020-04-29 | 2023-04-28 | 全球能源互联网研究院有限公司 | Training method of audio separation model, and separation method and device of audio signals |
CN111883113A (en) * | 2020-07-30 | 2020-11-03 | 云知声智能科技股份有限公司 | Voice recognition method and device |
CN111883113B (en) * | 2020-07-30 | 2024-01-30 | 云知声智能科技股份有限公司 | Voice recognition method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107203777A (en) | audio scene classification method and device | |
CN107977356B (en) | Method and device for correcting recognized text | |
CN103049792B (en) | Deep-neural-network distinguish pre-training | |
CN108922560A (en) | A kind of city noise recognition methods based on interacting depth neural network model | |
CN109785833A (en) | Human-computer interaction audio recognition method and system for smart machine | |
CN107832835A (en) | The light weight method and device of a kind of convolutional neural networks | |
WO2017183242A1 (en) | Information processing device and information processing method | |
CN109685137A (en) | A kind of topic classification method, device, electronic equipment and storage medium | |
CN112634935B (en) | Voice separation method and device, electronic equipment and readable storage medium | |
CN104572631B (en) | The training method and system of a kind of language model | |
CN108806694A (en) | A kind of teaching Work attendance method based on voice recognition | |
CN114443899A (en) | Video classification method, device, equipment and medium | |
US11943460B2 (en) | Variable bit rate compression using neural network models | |
CN110321555A (en) | A kind of power network signal classification method based on Recognition with Recurrent Neural Network model | |
CN116541755A (en) | Financial behavior pattern analysis and prediction method based on time sequence diagram representation learning | |
CN116663540A (en) | Financial event extraction method based on small sample | |
CN102237084A (en) | Method, device and equipment for adaptively adjusting sound space benchmark model online | |
CN111899766A (en) | Speech emotion recognition method based on optimization fusion of depth features and acoustic features | |
CN109726288A (en) | File classification method and device based on artificial intelligence process | |
CN113158835A (en) | Traffic accident intelligent detection method based on deep learning | |
CN116542783A (en) | Risk assessment method, device, equipment and storage medium based on artificial intelligence | |
CN113539298B (en) | Sound big data analysis and calculation imaging system based on cloud edge end | |
CN112784094B (en) | Automatic audio summary generation method and device | |
Sattigeri et al. | A scalable feature learning and tag prediction framework for natural environment sounds | |
Pan | [Retracted] Automatic Classification Method of Music Genres Based on Deep Belief Network and Sparse Representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170926 |