CN106024011A - MOAS based deep layer feature extracting method - Google Patents

MOAS based deep layer feature extracting method Download PDF

Info

Publication number
CN106024011A
CN106024011A CN201610333538.3A CN201610333538A CN106024011A CN 106024011 A CN106024011 A CN 106024011A CN 201610333538 A CN201610333538 A CN 201610333538A CN 106024011 A CN106024011 A CN 106024011A
Authority
CN
China
Prior art keywords
rbm
further feature
moas
layer
extracting method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610333538.3A
Other languages
Chinese (zh)
Inventor
杨继臣
刘磊安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongkai University of Agriculture and Engineering
Original Assignee
Zhongkai University of Agriculture and Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongkai University of Agriculture and Engineering filed Critical Zhongkai University of Agriculture and Engineering
Priority to CN201610333538.3A priority Critical patent/CN106024011A/en
Publication of CN106024011A publication Critical patent/CN106024011A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a deep layer feature extracting method and more specifically relates to a deep layer feature extracting method employing a MOAS (Movie Origin Audio Sample) as the input. The method includes steps of 1, constructing an RBM (Restricted Boltzmann Machine); 2, training the RBM; 3, constructing a deep layer feature extractor; 4, taking the MOAS as the input of a deep layer feature extractor and extracting deep layer features. According to the invention, the MOAS is taken as the input for deep layer feature extraction, training deep layer number can be reduced and more valid information can be extracted than a method employing superficial layer feature as the input.

Description

A kind of further feature extracting method based on MOAS
Technical field
The present invention relates to the method that further feature is extracted, use MOAS (Movie Origin more particularly, to one Audio Sample, film original audio sampled point) as inputting the method extracting further feature.
Background technology
Due to the development of Internet technology, the cinematic data relying on network is explosive growth, and online movie resource is more Come the hugest.Because film easily obtains, therefore having substantial amounts of spectators, current movies signal processes faced and mainly asks Topic is how vast as the open sea cinematic data to be analyzed, index and to be managed, and is convenient for people to what quick-searching was wanted to oneself Information.Therefore, film is carried out content analysis and become more and more urgent with understanding.Audio frequency is the important letter understanding content of multimedia Breath source (Ghoraani, 2011), audio frequency is also a kind of important form in film, the information either still comprised in quantity In content, all occupy and important component.Audio-frequency information has got more and more and has been used in movie contents analysis and understanding in recent years In (Wang, 2006, Benini, 2013).
In movie audio content analysis and understanding research, feature extraction is a critically important problem, only feature Extract, movie audio signal well could be classified and movie audio Scene Semantics reasoning is studied, feature The quality extracted directly affects the order of accuarcy of movie audio Modulation recognition and the semantic reasoning result of movie audio scene, instead Coming over, the order of accuarcy of movie audio Modulation recognition and the semantic reasoning result of movie audio scene may also be used for assessing feature Performance.
In the research of former movie audio signal, the feature of use is typically all artificial constructed shallow-layer feature, than Such as mel cepstrum coefficients (Mel-Frequency Cepstral Coefficient, MFCC), time-frequency characteristics etc. (Austin, 2010, Li, 2014).Original input signal is only transformed into particular space by shallow-layer feature, therefore cannot effectively portray the spy of signal Property, thus result in movie audio signal processing and do not reach the preferable requirement of people.And use deep neural network (deep Neural network, DNN) further feature that obtains of (Hinton, 2006) study not only eliminates loaded down with trivial details and complicated artificial The process of construction feature but also artificial constructed unavailable feature (Seide, 2011) can be extracted, owing to DNN can learn Practise more useful feature, thus finally promote classification or the accuracy (Yu Kai, 2013) of prediction.
Recent years, further feature is widely used in field of speech recognition (Mohamed, 2011, Bao, 2013), and these are deep Layer feature is typically all by using DNN to obtain MFCC feature learning, i.e. using MFCC as the input of DNN, but this logical Cross and MFCC trained the further feature that obtains, since it is desired that remove the information that information useless remains with, so above which floor Effect be generally not very well, it is generally required to deep layer effect just can be got well.If directly using MOAS as the input of DNN, So can directly use DNN to extract effective further feature from MOAS, the degree of depth number of plies of training can be saved;It addition, by In MFCC during extracting, eliminate the information that some in MOAS are useful, use the process that MFCC is learnt by DNN below In, the information of this partial loss is difficult to study and obtains, if directly MOAS is as the input of DNN, and would not this thing happens; If the most directly using MOAS as the input of DNN, the degree of depth number of plies that the further feature of extraction needs not only ratio uses MFCC Input as DNN is few, and the useful information extracted should also can be some more.
Summary of the invention
The present invention is directed to the defect that current movie audio further feature is extracted, it is provided that a kind of further feature based on MOAS carries Access method.
For solving above-mentioned technical problem, technical scheme is as follows:
A kind of further feature extracting method based on MOAS, using MOAS as input, first builds a RBM (Restricted Boltzmann Machines, limited Boltzmann machine), is secondly trained this RBM, re-uses same The method of sample, builds multiple RBM, finally gives further feature extractor, finally using MOAS as this further feature extractor Input, obtains further feature.
Above-mentioned further feature extracting method based on MOAS, specifically includes the following step:
S1, first RBM of structure, it is constituted 2 by visual layers (visual layer) and hidden layer (hidden layer) Layer neural network model;
S2, using MOAS as the input of this RBM, train this RBM, make the likelihood score of visual layers reach maximum;
S3, on the basis of the RBM that s2 step trains, be further added by a hidden layer, will the hidden layer of first RBM As the visual layers of second RBM, build second RBM, train this RBM;
S4, use same method, build a further feature extractor constituted containing n-layer RBM;
S5, further feature extractor s4 step obtained are finely adjusted, and obtain final further feature extractor;
S6, utilize the further feature extractor that s5 step trains, using MOAS as input, extract corresponding deep of MOAS Layer feature.
In above-mentioned further feature extracting method based on MOAS, visual layers and the hidden layer of each RBM described are connected to each other, Connect with nothing between layer.
In above-mentioned further feature extracting method based on MOAS, the nodes of the visual layers of first RBM is set to 512, The nodes of hidden layer is set to 39.
In above-mentioned further feature extracting method based on MOAS, the nodes of the visual layers of second RBM is set to 39, hidden Nodes containing layer is set to 39.
In above-mentioned further feature extracting method based on MOAS, the use back propagation of s5 step (back-propagation, BP) weights between each layer of further feature extractor are finely adjusted, finally give every layer of weights all suitably further feature Extractor.
In above-mentioned further feature extracting method based on MOAS, the further feature extractor layer of described n-layer RBM composition and layer Between transformation relation be
df'm+1=σ (df'm)1≤m≤n
Wherein, df'm+1、df'mRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
Compared with prior art, technical solution of the present invention provides the benefit that:
(1) feature that present invention further feature based on MOAS extracting method extracts is further feature, and further feature is not only Eliminate complicated and loaded down with trivial details artificial constructed process, but also can extract artificial constructed less than feature.
(2) present invention is using MOAS as the input of further feature extractor, and uses shallow-layer feature, such as MFCC as defeated Enter to compare, be possible not only to reduce the training number of plies, but also can avoid, during extracting MFCC, losing some useful letters Breath, say, that using MOAS as input, the useful information extracted can be more as input than using shallow-layer feature.
Accompanying drawing explanation
Fig. 1 is the flow chart that further feature based on MOAS is extracted;
Fig. 2 is the building process schematic diagram of first RBM;
Fig. 3 is the building process schematic diagram of second RBM;
Fig. 4 is the building process schematic diagram of further feature extractor.
Detailed description of the invention
Further describe the present invention with specific embodiment below in conjunction with the accompanying drawings, but the present invention is not appointed by embodiment The restriction of what form.
Fig. 1 shows the basic process extracting further feature based on film original audio sampled point.
It is as follows that what present invention further feature based on MOAS was extracted realizes process:
1. first have to prepare data into training further feature extractor, prepare data and be divided into two large divisions: pre-training data With fine setting data.Wherein, pre-training data, for further feature extractor is carried out pre-training, obtain a preliminary deep layer special Levying extractor, fine setting data are for being finely adjusted the further feature extractor obtained, regardless of which part data, Dou Yaowei They extract crude sampling point data and mel cepstrum coefficients respectively.
2. build and first RBM of training.Fig. 2 shows the building process of first RBM, and it is by visual layers and to imply The neural network model of 2 layers of layer composition, wherein visual layers and hidden layer are connected to each other, and connect with nothing between layer.V and h is made to divide Do not represent the parameter of visual layers and hidden layer, then a joint probability (formula is as follows) can distribute to RBM:
ρ ( v , h ) = 1 Z · e b T v + c T h + v T W h
Wherein Ζ represents standardizing factor, W representation value matrix, b and c represents the skew of visual layers and hidden layer respectively Value, T represents transposition.
3., on the basis of first RBM, build and train second RBM.Fig. 3 shows the structure of second RBM Journey.Its using the hidden layer of first RBM as visual layers, and unlike first RBM, its visual layers and the joint of hidden layer It is the same for counting, and uses above method, trains this RBM.
4. use same method, build a further feature extractor constituted containing n-layer RBM.Fig. 4 shows that this is deep The structure structure chart of layer feature extractor.
5. the further feature extractor utilizing fine setting data to obtain pre-training above is finely adjusted.The method wherein finely tuned It is to use back propagation (back-propagation, BP) that the weights between each layer of further feature extractor are finely adjusted, Finally give every layer of weights all suitably further feature extractor.
6. MOAS is input to this further feature extractor, further feature can be extracted.
Have again as a example by carrying out framing windowing (frame length 32ms, frame moves 16ms, adds Hamming window) with film original audio sampled point Body describes.
A1. assuming that sample frequency is 16KHz, so every frame just obtains 512 sampled points, it is assumed that the sampled point vector obtained For S, S is divided into three parts, respectively S1, S2And S3, wherein S1For pre-training, S2For finely tuning, S3Special for extracting deep layer Levy.
A2. to S1And S2Every frame extract mel cepstrum coefficients feature, it is assumed that the feature extracted is respectively M01And M02, S1As the input of first RBM, M01As the output of first RBM, train this RBM, when first RBM has trained After, it is assumed that the nonlinear characteristic through first RBM converts, S1It is transformed to M1
A3. on the basis of first RBM, second RBM is built, wherein M1Inputted as second RBM, M01As the output of second RBM, train this RBM, after second RBM has trained, it is assumed that through the non-thread of second RBM Property eigentransformation, M1It is transformed to M2
A4. by same method, a further feature extractor being made up of n-layer RBM is trained, it is assumed that between layers Transformation relation is
df'm+1=σ (df'm)1≤m≤n
Wherein, df'm+1、df'mRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
A5. S is used2And M02This further feature extractor is finely adjusted, wherein S2Extract as this further feature The input of device, M02Output as this further feature extractor.After having finely tuned, obtain new non-linear spy between layers Levy transformation for mula, it is assumed that for
dfm+1=σ (dfm)1≤m≤n
Wherein, dfm+1、dfmRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
A6. S3As the input of this further feature extractor, use and train the non-thread between layers obtained above Property eigentransformation formula, i.e. can get S3Corresponding further feature.
Obviously, the above embodiment of the present invention is only for clearly demonstrating example of the present invention, and is not right The restriction of embodiments of the present invention.For those of ordinary skill in the field, the most also may be used To make other changes in different forms.Here without also cannot all of embodiment be given exhaustive.All at this Any amendment, equivalent and the improvement etc. made within the spirit of invention and principle, should be included in the claims in the present invention Protection domain within.

Claims (7)

1. a further feature extracting method based on MOAS, it is characterised in that using MOAS as input, first builds a RBM And this RBM is trained, then with same method, build multiple RBM, obtain further feature extractor, finally using MOAS as The input of this further feature extractor, extracts its further feature.
Further feature extracting method based on MOAS the most according to claim 1, it is characterised in that comprise the following steps:
S1, first RBM of structure, it is made up of 2 layers of neural network model visual layers and hidden layer;
S2, using MOAS as the input of this RBM, train this RBM, make the likelihood score of visual layers reach maximum;
S3, on the basis of the RBM that s2 step trains, be further added by a hidden layer, will the hidden layer conduct of first RBM The visual layers of second RBM, builds second RBM, trains this RBM;
S4, use same method, build a further feature extractor constituted containing n-layer RBM;
S5, further feature extractor s4 step obtained are finely adjusted, and obtain final further feature extractor;
S6, utilize the further feature extractor that s5 step trains, using MOAS as input, extract deep layer corresponding for MOAS special Levy.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that each RBM's described Visual layers and hidden layer are connected to each other, and connect with nothing between layer.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that first RBM's can Being set to 512 depending on the nodes of layer, the nodes of hidden layer is set to 39.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that second RBM's can Being set to 39 depending on the nodes of layer, the nodes of hidden layer is set to 39.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that s5 step uses reversely Propagate and the weights between each layer of further feature extractor are finely adjusted, finally give every layer of weights all suitably further feature Extractor.
Further feature extracting method based on MOAS the most according to claim 2, it is characterised in that described n-layer RBM is constituted Further feature extractor transformation relation between layers be
df'm+1=σ (df'm)1≤m≤n
Wherein, df'm+1、df'mRepresenting the further feature of m+1 and m layer respectively, σ represents sigmoid function,
CN201610333538.3A 2016-05-19 2016-05-19 MOAS based deep layer feature extracting method Pending CN106024011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610333538.3A CN106024011A (en) 2016-05-19 2016-05-19 MOAS based deep layer feature extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610333538.3A CN106024011A (en) 2016-05-19 2016-05-19 MOAS based deep layer feature extracting method

Publications (1)

Publication Number Publication Date
CN106024011A true CN106024011A (en) 2016-10-12

Family

ID=57098744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610333538.3A Pending CN106024011A (en) 2016-05-19 2016-05-19 MOAS based deep layer feature extracting method

Country Status (1)

Country Link
CN (1) CN106024011A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149123A1 (en) * 2012-03-30 2013-10-03 The Ohio State University Monaural speech filter
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN104731913A (en) * 2015-03-23 2015-06-24 华南理工大学 GLR-based homologous audio advertisement retrieving method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013149123A1 (en) * 2012-03-30 2013-10-03 The Ohio State University Monaural speech filter
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN104731913A (en) * 2015-03-23 2015-06-24 华南理工大学 GLR-based homologous audio advertisement retrieving method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JI-CHEN YANG ET.AL.: "Audio event change detection and clustering in movies", 《JOURNAL OF MULTIMEDIA》 *

Similar Documents

Publication Publication Date Title
CN104732978B (en) The relevant method for distinguishing speek person of text based on combined depth study
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
Hwang et al. Ensemble of deep neural networks using acoustic environment classification for statistical model-based voice activity detection
CN108986798B (en) Processing method, device and the equipment of voice data
Ando et al. Customer satisfaction estimation in contact center calls based on a hierarchical multi-task model
CN106683666A (en) Field adaptive method based on deep neural network (DNN)
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
Geng Evaluation model of college english multimedia teaching effect based on deep convolutional neural networks
Haridas et al. A novel approach to improve the speech intelligibility using fractional delta-amplitude modulation spectrogram
Ribas et al. Wiener filter and deep neural networks: A well-balanced pair for speech enhancement
Yechuri et al. A nested U-net with efficient channel attention and D3Net for speech enhancement
Biswas et al. Admissible wavelet packet sub‐band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition
Tashakori et al. Designing the Intelligent System Detecting a Sense of Wonder in English Speech Signal Using Fuzzy-Nervous Inference-Adaptive system (ANFIS)
Yang Design of service robot based on user emotion recognition and environmental monitoring
Sangeetha et al. Analysis of machine learning algorithms for audio event classification using Mel-frequency cepstral coefficients
CN106024011A (en) MOAS based deep layer feature extracting method
Shareef et al. Comparison between features extraction techniques for impairments arabic speech
Alam et al. Radon transform of auditory neurograms: a robust feature set for phoneme classification
Bansod et al. Speaker Recognition using Marathi (Varhadi) Language
Muni et al. Deep learning techniques for speech emotion recognition
Wei et al. Speech emotion recognition with hybrid neural network
Satla et al. Dialect Identification in Telugu Language Speech Utterance Using Modified Features with Deep Neural Network.
Mehra et al. ERIL: An Algorithm for Emotion Recognition from Indian Languages Using Machine Learning
Soni et al. Comparing front-end enhancement techniques and multiconditioned training for robust automatic speech recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination