CN109584888A

CN109584888A - Whistle recognition methods based on machine learning

Info

Publication number: CN109584888A
Application number: CN201910038606.7A
Authority: CN
Inventors: 乔天昊; 徐树公; 张舜卿; 曹姗
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2019-04-05

Abstract

A kind of whistle recognition methods based on machine learning, sample data is generated with public data collection and whistle data mixing, classifier is trained by extracting its MFCC feature, and the classifier after training is used to classify to testing data in on-line stage, realize whistle identification, in which: classifier is realized using open source lightweight gradient Boosting frame；Sample data is by obtaining whistle data non-in public data collection ESC-50 and whistle data mixing.The present invention has time-consuming short and more accurate recognition effect compared with prior art.

Description

Whistle recognition methods based on machine learning

Technical field

The present invention relates to a kind of technology in ambient sound identification field, specifically a kind of whistles based on machine learning Recognition methods.

Background technique

Vehicle whistle sound inherently can be used as one of ambient sound research as a part in ambient sound, to its identification Point.In the illegal whistle capturing system to come into being under cyberage, whistle sound should be identified rapidly accurately for taking the photograph As head is captured in time, excludes the interference of other sound in environment again to prevent erroneous judgement, therefore, the identification for sound of blowing a whistle is wanted Ask very high.

Whistle recognition methods under existing complicated noise obtains original training sample library first with microphone and selects Training sample set out is then trained to obtain model library using HMM model, finally be divided using model test sample Class identifies to obtain final recognition result.The training dataset of high quality is obtained using less artificial mark in the art, To solve the difficulty of train sound complexity bring training sample selection itself, and then improve recognition correct rate.But it is this kind of The sample that technology uses is generally non-public data set, and the complexity of algorithm is higher, required time is longer.

Summary of the invention

The present invention In view of the above shortcomings of the prior art, proposes a kind of whistle recognition methods based on machine learning, Identify that there is more accurate recognition effect compared with prior art to it using machine learning algorithm (lightGBM), Solves the case where existing test method judges incorrectly non-whistle data.

The present invention is achieved by the following technical solutions:

The present invention generates sample data with public data collection and whistle data mixing, by extracting its MFCC feature to classification Device is trained, and is used to classify to testing data by the classifier after training in on-line stage, realizes whistle identification.

The classifier is realized using open source lightweight gradient Boosting frame (LightGBM).

The sample data, by by whistle data non-in public data collection ESC-50 and whistle data mixing, wherein Include 11636 whistle data and 6359 non-whistle data.

The extraction refers to: Fourier transformation is carried out to sample data, using carrying out Fourier after logarithm operation again Inverse transformation obtains mel cepstrum to obtain the envelope information of sound spectrum, and it includes the formant informations of sound, while also right The low frequency part of signal is answered, it is the important information for distinguishing sound.Therefore carrying out cepstral analysis has highly important meaning Justice.

The extraction, specifically includes: framing, adding window, discrete Fourier transform, mel-frequency conversion, the non-linear change of log It changes and discrete cosine transform, in which: sample rate 22050, the MFCC quantity of return are 20.

The training refers to: being based on decision Tree algorithms, is identified using regression tree, then iteratively added first Add tree so that new tree can focus on the mistake classification of previous all tree set, by the prediction combination of multiple trees to optimize mesh Scalar functions, and adjust by gradient decline the parameter of the tree of addition.

The objective function uses logloss (logarithm loss).

Described classify referring to testing data: simultaneously by the classifier after the MFCC feature input training of testing data Prediction result is obtained, is non-whistle when prediction result is greater than 0.5 to be judged to whistle otherwise.

Technical effect

Compared with prior art, the ambient sound concentrated present invention employs public data is supplemented very as non-whistle data The ambient sound data set recorded under real street environment so that whistle is balanced with non-whistle data distribution, and then improves non-whistle number According to discrimination, propose high performance effect to reach.The present invention is used the method for lightGBM property compared with prior art More preferably better recognition effect can be obtained, and time consumption for training is shorter.

Detailed description of the invention

Fig. 1 is the flow chart of the whistle identification the present invention is based on machine learning；

Fig. 2 is the relation schematic diagram of mel-frequency and hertz.

Specific embodiment

The present embodiment specifically includes the following steps:

Step 1. chooses class road noise data from ESC-50, this experimental data set is added, and constitutes distribution relative equilibrium Whistle data set.Data set is then divided into five files, for doing five folding cross validations.

Step 2. handles data set, extracts MFCC feature:

Step 2.1) framing: each for every section of sound convenient for analyzing sound, to be divided into segment small one by one A small fragment is a frame.Every frame is one and its of short duration time, can be regarded as smoothly, be helped in this way It is subsequent that sound is analyzed.In addition to this, in order to reduce the variation between consecutive frame, can generally be arranged between frames One section of overlapping region, area size depend on the circumstances.

Step 2.2) adding window: by every frame substitute into a window function, in order to eliminate every frame two sides signal discontinuity, need Value other than window is set as 0.

Common window function has rectangular window, Hamming window and Hanning window etc., and the present embodiment uses Hamming window, because it is used not With weighting coefficient so that its secondary lobe is smaller, meanwhile, Hamming window have the function of it is smooth, after Fourier transformation being weakened Secondary lobe size and spectral leakage the problem of.

Step 2.3) Fast Fourier Transform (FFT) (FFT): the time-domain information of signal is limited, only includes the amplitude information of sound. And the characteristic of signal is mostly hidden in its frequency domain information, so needing for voice signal to be transformed into frequency domain, carries out subsequent point Analysis.

Each frame after framing windowing operation carries out discrete Fourier transform, to obtain the amplitude of the frame different frequency Distribution: energy is higher, indicates that this section of provincial characteristics is more obvious, more important in entire spectrogram.

The conversion of step 2.4) mel-frequency；

Mel-frequency is based on a kind of non-linear frequency scale of human hearing characteristic, it can generate line to the variation of tone The perception of property.Mel-frequency and hertz frequency are a kind of nonlinear relationships, with the increase of frequency, the increase of mel-frequency Further slowly, frequency range of the mel-frequency range of equal length corresponding to high frequency is wider.Therefore, Meier filter needs Broader bandwidth is used in high frequency section.

As shown in Fig. 2, the intensity in the partial shrinkage of high frequency is bigger.This not only conforms with the mel-frequency pair that front is analyzed The perception of human hearing characteristic, filter group can also compress the amplitude of frequency domain, so that each frequency range can be with It is indicated with a mel-frequency value, obtains the characteristic information of smaller complexity, make feature simplification.

Step 2.5) logarithm nonlinear transformation；

The low frequency part of sound often hides more information, and for human ear to low frequency part also rdativery sensitive, log transformation is main Effect be to enhance the low frequency of sound to indicate that enhancing is hidden in the information of low frequency part.

A kind of step 2.6) discrete cosine transform: real number variation based on Fourier transformation.DCT is general in addition to having Orthogonal property outside, the base vector of transformation matrix embodies the language and picture characteristics of human perception.Dct transform has very strong " concentrate energy " ability so that most of energy of sound and image can concentrate on the low frequency after dct transform It on part, thus is actually that dimensionality reduction operation has been carried out to every frame data of sound.According to the definition of cepstrum process, need Carry out inverse Fourier transform then again by low-pass filter to obtain the low-frequency information of sound, but we only use from Scattered cosine transform directly can just obtain the low-frequency information of frequency spectrum, with this instead of the operation of Fourier inversion.By from Cosine transform is dissipated, we can obtain a series of cepstrum vectors for being used to describe voice signal, and each vector is exactly every frame MFCC feature vector.

Step 3. carries out model training and preservation model using the feature extracted, and is made in the present embodiment using SVM model For comparison, specific steps include: to be standardized operation to data first, parameter configuration are then arranged, wherein parameter are as follows: Nthread=4；N_estimators=10000；Learning_rate=0.02；Num_leaves=32；colsample_ Bytree=0.9497036；Subsample=0.8715623；Max_depth=8；Reg_alpha=0.04；reg_ Lambda=0.073；

Min_split_gain=0.0222415；Min_child_weight=40；Silent=True；Verbose=-1, It can start the training of lightGBM model.

The SVM model, is trained in the following manner: finding first with grid data service optimal in SVM Data are then standardized operation by parameter " gamma " and " C ", can finally start to train, and are using parameter in training Gamma=0.001 and C=1000.

Step 4. carries out feature extraction operation to test data set, this operation should ensure that be operated with to training dataset It is identical, test data set feature is input in trained model and is tested, following result is obtained:

The result that SVM model carries out five folding cross validations is respectively as follows: 0.946,0.947,0.966,0.949,0.961

The result that lightGBM model carries out five folding cross validations is respectively as follows: 0.968,0.979,0.979,0.979, 0.978

To sum up, lightGBM performance is better than (recognition accuracy is higher than) SVM.Furthermore each fold of SVM model is (total Five) training time is 35 minutes or so 1 hour, and each fold of LightGBM model training only needs 25 minutes.

Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims

1. a kind of whistle recognition methods based on machine learning, which is characterized in that raw with public data collection and whistle data mixing At sample data, classifier is trained by extracting its MFCC feature, and uses the classifier after training in on-line stage Classify in testing data, realizes whistle identification, in which: classifier is real using open source lightweight gradient Boosting frame It is existing；Sample data is by obtaining whistle data non-in public data collection ESC-50 and whistle data mixing.

2. according to the method described in claim 1, it is characterized in that, the extraction refers to: extraction refer to: to sample data carry out Fourier transformation obtains mel cepstrum using Fourier inversion is carried out after logarithm operation again to obtain the packet of sound spectrum Network information.

3. method according to claim 1 or 2, characterized in that the extraction specifically includes: framing, adding window, discrete Fourier transformation, mel-frequency conversion, log nonlinear transformation and discrete cosine transform.

4. according to the method described in claim 1, it is characterized in that, the training refers to: be based on decision Tree algorithms, use first Regression tree is identified that then iteratively addition is set so that new tree can focus on the mistake point that previously all trees are gathered Class combines the predictions of multiple trees with optimization object function, and adjusts by gradient decline the parameter of the tree of addition.

5. according to the method described in claim 1, it is characterized in that, it is described to testing data carry out classification refer to: by number to be measured According to MFCC feature input training after classifier and obtain prediction result, when prediction result be greater than 0.5 be judged to whistle otherwise For non-whistle.

6. according to the method described in claim 1, it is characterized in that, the non-whistle data, i.e., the class chosen from ESC-50 Road noise data.

7. according to the method described in claim 3, it is characterized in that, the adding window, by Hamming window realize it is smooth, weaken in Fu The problem of secondary lobe size and spectral leakage after leaf transformation, the value other than window are set as 0.