CN106024011A - MOAS based deep layer feature extracting method - Google Patents
MOAS based deep layer feature extracting method Download PDFInfo
- Publication number
- CN106024011A CN106024011A CN201610333538.3A CN201610333538A CN106024011A CN 106024011 A CN106024011 A CN 106024011A CN 201610333538 A CN201610333538 A CN 201610333538A CN 106024011 A CN106024011 A CN 106024011A
- Authority
- CN
- China
- Prior art keywords
- rbm
- further feature
- moas
- layer
- extracting method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000000007 visual effect Effects 0.000 claims description 17
- 239000000284 extract Substances 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 abstract description 11
- 238000000605 extraction Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 3
- 235000013350 formula milk Nutrition 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a deep layer feature extracting method and more specifically relates to a deep layer feature extracting method employing a MOAS (Movie Origin Audio Sample) as the input. The method includes steps of 1, constructing an RBM (Restricted Boltzmann Machine); 2, training the RBM; 3, constructing a deep layer feature extractor; 4, taking the MOAS as the input of a deep layer feature extractor and extracting deep layer features. According to the invention, the MOAS is taken as the input for deep layer feature extraction, training deep layer number can be reduced and more valid information can be extracted than a method employing superficial layer feature as the input.
Description
Technical field
The present invention relates to the method that further feature is extracted, use MOAS (Movie Origin more particularly, to one
Audio Sample, film original audio sampled point) as inputting the method extracting further feature.
Background technology
Due to the development of Internet technology, the cinematic data relying on network is explosive growth, and online movie resource is more
Come the hugest.Because film easily obtains, therefore having substantial amounts of spectators, current movies signal processes faced and mainly asks
Topic is how vast as the open sea cinematic data to be analyzed, index and to be managed, and is convenient for people to what quick-searching was wanted to oneself
Information.Therefore, film is carried out content analysis and become more and more urgent with understanding.Audio frequency is the important letter understanding content of multimedia
Breath source (Ghoraani, 2011), audio frequency is also a kind of important form in film, the information either still comprised in quantity
In content, all occupy and important component.Audio-frequency information has got more and more and has been used in movie contents analysis and understanding in recent years
In (Wang, 2006, Benini, 2013).
In movie audio content analysis and understanding research, feature extraction is a critically important problem, only feature
Extract, movie audio signal well could be classified and movie audio Scene Semantics reasoning is studied, feature
The quality extracted directly affects the order of accuarcy of movie audio Modulation recognition and the semantic reasoning result of movie audio scene, instead
Coming over, the order of accuarcy of movie audio Modulation recognition and the semantic reasoning result of movie audio scene may also be used for assessing feature
Performance.
In the research of former movie audio signal, the feature of use is typically all artificial constructed shallow-layer feature, than
Such as mel cepstrum coefficients (Mel-Frequency Cepstral Coefficient, MFCC), time-frequency characteristics etc. (Austin,
2010, Li, 2014).Original input signal is only transformed into particular space by shallow-layer feature, therefore cannot effectively portray the spy of signal
Property, thus result in movie audio signal processing and do not reach the preferable requirement of people.And use deep neural network (deep
Neural network, DNN) further feature that obtains of (Hinton, 2006) study not only eliminates loaded down with trivial details and complicated artificial
The process of construction feature but also artificial constructed unavailable feature (Seide, 2011) can be extracted, owing to DNN can learn
Practise more useful feature, thus finally promote classification or the accuracy (Yu Kai, 2013) of prediction.
Recent years, further feature is widely used in field of speech recognition (Mohamed, 2011, Bao, 2013), and these are deep
Layer feature is typically all by using DNN to obtain MFCC feature learning, i.e. using MFCC as the input of DNN, but this logical
Cross and MFCC trained the further feature that obtains, since it is desired that remove the information that information useless remains with, so above which floor
Effect be generally not very well, it is generally required to deep layer effect just can be got well.If directly using MOAS as the input of DNN,
So can directly use DNN to extract effective further feature from MOAS, the degree of depth number of plies of training can be saved;It addition, by
In MFCC during extracting, eliminate the information that some in MOAS are useful, use the process that MFCC is learnt by DNN below
In, the information of this partial loss is difficult to study and obtains, if directly MOAS is as the input of DNN, and would not this thing happens;
If the most directly using MOAS as the input of DNN, the degree of depth number of plies that the further feature of extraction needs not only ratio uses MFCC
Input as DNN is few, and the useful information extracted should also can be some more.
Summary of the invention
The present invention is directed to the defect that current movie audio further feature is extracted, it is provided that a kind of further feature based on MOAS carries
Access method.
For solving above-mentioned technical problem, technical scheme is as follows:
A kind of further feature extracting method based on MOAS, using MOAS as input, first builds a RBM
(Restricted Boltzmann Machines, limited Boltzmann machine), is secondly trained this RBM, re-uses same
The method of sample, builds multiple RBM, finally gives further feature extractor, finally using MOAS as this further feature extractor
Input, obtains further feature.
Above-mentioned further feature extracting method based on MOAS, specifically includes the following step:
S1, first RBM of structure, it is constituted 2 by visual layers (visual layer) and hidden layer (hidden layer)
Layer neural network model;
S2, using MOAS as the input of this RBM, train this RBM, make the likelihood score of visual layers reach maximum;
S3, on the basis of the RBM that s2 step trains, be further added by a hidden layer, will the hidden layer of first RBM
As the visual layers of second RBM, build second RBM, train this RBM;
S4, use same method, build a further feature extractor constituted containing n-layer RBM;
S5, further feature extractor s4 step obtained are finely adjusted, and obtain final further feature extractor;
S6, utilize the further feature extractor that s5 step trains, using MOAS as input, extract corresponding deep of MOAS
Layer feature.
In above-mentioned further feature extracting method based on MOAS, visual layers and the hidden layer of each RBM described are connected to each other,
Connect with nothing between layer.
In above-mentioned further feature extracting method based on MOAS, the nodes of the visual layers of first RBM is set to 512,
The nodes of hidden layer is set to 39.
In above-mentioned further feature extracting method based on MOAS, the nodes of the visual layers of second RBM is set to 39, hidden
Nodes containing layer is set to 39.
In above-mentioned further feature extracting method based on MOAS, the use back propagation of s5 step (back-propagation,
BP) weights between each layer of further feature extractor are finely adjusted, finally give every layer of weights all suitably further feature
Extractor.
In above-mentioned further feature extracting method based on MOAS, the further feature extractor layer of described n-layer RBM composition and layer
Between transformation relation be
df'm+1=σ (df'm)1≤m≤n
Wherein, df'm+1、df'mRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
Compared with prior art, technical solution of the present invention provides the benefit that:
(1) feature that present invention further feature based on MOAS extracting method extracts is further feature, and further feature is not only
Eliminate complicated and loaded down with trivial details artificial constructed process, but also can extract artificial constructed less than feature.
(2) present invention is using MOAS as the input of further feature extractor, and uses shallow-layer feature, such as MFCC as defeated
Enter to compare, be possible not only to reduce the training number of plies, but also can avoid, during extracting MFCC, losing some useful letters
Breath, say, that using MOAS as input, the useful information extracted can be more as input than using shallow-layer feature.
Accompanying drawing explanation
Fig. 1 is the flow chart that further feature based on MOAS is extracted;
Fig. 2 is the building process schematic diagram of first RBM;
Fig. 3 is the building process schematic diagram of second RBM;
Fig. 4 is the building process schematic diagram of further feature extractor.
Detailed description of the invention
Further describe the present invention with specific embodiment below in conjunction with the accompanying drawings, but the present invention is not appointed by embodiment
The restriction of what form.
Fig. 1 shows the basic process extracting further feature based on film original audio sampled point.
It is as follows that what present invention further feature based on MOAS was extracted realizes process:
1. first have to prepare data into training further feature extractor, prepare data and be divided into two large divisions: pre-training data
With fine setting data.Wherein, pre-training data, for further feature extractor is carried out pre-training, obtain a preliminary deep layer special
Levying extractor, fine setting data are for being finely adjusted the further feature extractor obtained, regardless of which part data, Dou Yaowei
They extract crude sampling point data and mel cepstrum coefficients respectively.
2. build and first RBM of training.Fig. 2 shows the building process of first RBM, and it is by visual layers and to imply
The neural network model of 2 layers of layer composition, wherein visual layers and hidden layer are connected to each other, and connect with nothing between layer.V and h is made to divide
Do not represent the parameter of visual layers and hidden layer, then a joint probability (formula is as follows) can distribute to RBM:
Wherein Ζ represents standardizing factor, W representation value matrix, b and c represents the skew of visual layers and hidden layer respectively
Value, T represents transposition.
3., on the basis of first RBM, build and train second RBM.Fig. 3 shows the structure of second RBM
Journey.Its using the hidden layer of first RBM as visual layers, and unlike first RBM, its visual layers and the joint of hidden layer
It is the same for counting, and uses above method, trains this RBM.
4. use same method, build a further feature extractor constituted containing n-layer RBM.Fig. 4 shows that this is deep
The structure structure chart of layer feature extractor.
5. the further feature extractor utilizing fine setting data to obtain pre-training above is finely adjusted.The method wherein finely tuned
It is to use back propagation (back-propagation, BP) that the weights between each layer of further feature extractor are finely adjusted,
Finally give every layer of weights all suitably further feature extractor.
6. MOAS is input to this further feature extractor, further feature can be extracted.
Have again as a example by carrying out framing windowing (frame length 32ms, frame moves 16ms, adds Hamming window) with film original audio sampled point
Body describes.
A1. assuming that sample frequency is 16KHz, so every frame just obtains 512 sampled points, it is assumed that the sampled point vector obtained
For S, S is divided into three parts, respectively S1, S2And S3, wherein S1For pre-training, S2For finely tuning, S3Special for extracting deep layer
Levy.
A2. to S1And S2Every frame extract mel cepstrum coefficients feature, it is assumed that the feature extracted is respectively M01And M02,
S1As the input of first RBM, M01As the output of first RBM, train this RBM, when first RBM has trained
After, it is assumed that the nonlinear characteristic through first RBM converts, S1It is transformed to M1。
A3. on the basis of first RBM, second RBM is built, wherein M1Inputted as second RBM,
M01As the output of second RBM, train this RBM, after second RBM has trained, it is assumed that through the non-thread of second RBM
Property eigentransformation, M1It is transformed to M2
A4. by same method, a further feature extractor being made up of n-layer RBM is trained, it is assumed that between layers
Transformation relation is
df'm+1=σ (df'm)1≤m≤n
Wherein, df'm+1、df'mRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
A5. S is used2And M02This further feature extractor is finely adjusted, wherein S2Extract as this further feature
The input of device, M02Output as this further feature extractor.After having finely tuned, obtain new non-linear spy between layers
Levy transformation for mula, it is assumed that for
dfm+1=σ (dfm)1≤m≤n
Wherein, dfm+1、dfmRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
A6. S3As the input of this further feature extractor, use and train the non-thread between layers obtained above
Property eigentransformation formula, i.e. can get S3Corresponding further feature.
Obviously, the above embodiment of the present invention is only for clearly demonstrating example of the present invention, and is not right
The restriction of embodiments of the present invention.For those of ordinary skill in the field, the most also may be used
To make other changes in different forms.Here without also cannot all of embodiment be given exhaustive.All at this
Any amendment, equivalent and the improvement etc. made within the spirit of invention and principle, should be included in the claims in the present invention
Protection domain within.
Claims (7)
1. a further feature extracting method based on MOAS, it is characterised in that using MOAS as input, first builds a RBM
And this RBM is trained, then with same method, build multiple RBM, obtain further feature extractor, finally using MOAS as
The input of this further feature extractor, extracts its further feature.
Further feature extracting method based on MOAS the most according to claim 1, it is characterised in that comprise the following steps:
S1, first RBM of structure, it is made up of 2 layers of neural network model visual layers and hidden layer;
S2, using MOAS as the input of this RBM, train this RBM, make the likelihood score of visual layers reach maximum;
S3, on the basis of the RBM that s2 step trains, be further added by a hidden layer, will the hidden layer conduct of first RBM
The visual layers of second RBM, builds second RBM, trains this RBM;
S4, use same method, build a further feature extractor constituted containing n-layer RBM;
S5, further feature extractor s4 step obtained are finely adjusted, and obtain final further feature extractor;
S6, utilize the further feature extractor that s5 step trains, using MOAS as input, extract deep layer corresponding for MOAS special
Levy.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that each RBM's described
Visual layers and hidden layer are connected to each other, and connect with nothing between layer.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that first RBM's can
Being set to 512 depending on the nodes of layer, the nodes of hidden layer is set to 39.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that second RBM's can
Being set to 39 depending on the nodes of layer, the nodes of hidden layer is set to 39.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that s5 step uses reversely
Propagate and the weights between each layer of further feature extractor are finely adjusted, finally give every layer of weights all suitably further feature
Extractor.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that described n-layer RBM is constituted
Further feature extractor transformation relation between layers be
df'm+1=σ (df'm)1≤m≤n
Wherein, df'm+1、df'mRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610333538.3A CN106024011A (en) | 2016-05-19 | 2016-05-19 | MOAS based deep layer feature extracting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610333538.3A CN106024011A (en) | 2016-05-19 | 2016-05-19 | MOAS based deep layer feature extracting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106024011A true CN106024011A (en) | 2016-10-12 |
Family
ID=57098744
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610333538.3A Pending CN106024011A (en) | 2016-05-19 | 2016-05-19 | MOAS based deep layer feature extracting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106024011A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013149123A1 (en) * | 2012-03-30 | 2013-10-03 | The Ohio State University | Monaural speech filter |
CN103971690A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Voiceprint recognition method and device |
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN104731913A (en) * | 2015-03-23 | 2015-06-24 | 华南理工大学 | GLR-based homologous audio advertisement retrieving method |
-
2016
- 2016-05-19 CN CN201610333538.3A patent/CN106024011A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013149123A1 (en) * | 2012-03-30 | 2013-10-03 | The Ohio State University | Monaural speech filter |
CN103971690A (en) * | 2013-01-28 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Voiceprint recognition method and device |
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN104731913A (en) * | 2015-03-23 | 2015-06-24 | 华南理工大学 | GLR-based homologous audio advertisement retrieving method |
Non-Patent Citations (1)
Title |
---|
JI-CHEN YANG ET.AL.: "Audio event change detection and clustering in movies", 《JOURNAL OF MULTIMEDIA》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104732978B (en) | The relevant method for distinguishing speek person of text based on combined depth study | |
CN108305616A (en) | A kind of audio scene recognition method and device based on long feature extraction in short-term | |
Kelly et al. | Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors | |
Hwang et al. | Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
Ando et al. | Customer satisfaction estimation in contact center calls based on a hierarchical multi-task model | |
CN106683666A (en) | Field adaptive method based on deep neural network (DNN) | |
CN114566189B (en) | Speech emotion recognition method and system based on three-dimensional depth feature fusion | |
Geng | Evaluation model of college english multimedia teaching effect based on deep convolutional neural networks | |
Haridas et al. | A novel approach to improve the speech intelligibility using fractional delta-amplitude modulation spectrogram | |
Ribas et al. | Wiener filter and deep neural networks: A well-balanced pair for speech enhancement | |
Yechuri et al. | A nested U-net with efficient channel attention and D3Net for speech enhancement | |
Biswas et al. | Admissible wavelet packet sub‐band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition | |
Tashakori et al. | Designing the Intelligent System Detecting a Sense of Wonder in English Speech Signal Using Fuzzy-Nervous Inference-Adaptive system (ANFIS) | |
Yang | Design of service robot based on user emotion recognition and environmental monitoring | |
Sangeetha et al. | Analysis of machine learning algorithms for audio event classification using Mel-frequency cepstral coefficients | |
CN106024011A (en) | MOAS based deep layer feature extracting method | |
Shareef et al. | Comparison between features extraction techniques for impairments arabic speech | |
Alam et al. | Radon transform of auditory neurograms: a robust feature set for phoneme classification | |
Bansod et al. | Speaker Recognition using Marathi (Varhadi) Language | |
Muni et al. | Deep learning techniques for speech emotion recognition | |
Wei et al. | Speech emotion recognition with hybrid neural network | |
Satla et al. | Dialect Identification in Telugu Language Speech Utterance Using Modified Features with Deep Neural Network. | |
Mehra et al. | ERIL: An Algorithm for Emotion Recognition from Indian Languages Using Machine Learning | |
Soni et al. | Comparing front-end enhancement techniques and multiconditioned training for robust automatic speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination |