CN107545890A

CN107545890A - A kind of sound event recognition method

Info

Publication number: CN107545890A
Application number: CN201710776733.8A
Authority: CN
Inventors: 张文涛; 韩莹莹; 徐韶华; 黎恒
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2018-01-05

Abstract

The present invention relates to a kind of sound event recognition method, mainly solve the problems, such as that the voice recognition accuracy rate of the prior art under powerful disturbed condition is low and poor robustness.By using following steps：Sound is acquired and handled under disturbance environment, forms audio digital signal；Sub-band filter is carried out to the audio digital signal by wave filter group, obtains the cochlea spectrogram of audio signal；A part for the cochlea spectrogram is trained to convolutional neural networks model, establishes voice recognition template；Another part of the cochlea spectrogram is substituted into the convolutional neural networks model, carries out the accuracy rate detection of the identification of sound；The above method preferably resolves the problem, the sound event identification that can be used under traffic environment.

Description

A kind of sound event recognition method

Technical field

The invention belongs to Audio Signal Processing technical field, is related specifically to a kind of sound event under strong interference environment Recognition methods.

Background technology

In recent years, researchers propose many feature extracting methods and identifying system for sound event identification, all take Certain effect was obtained, it is the acoustic information by gathering people to have a kind of sound event recognition method, and acoustic information is carried out at FFT Reason, extracts the amplitude and frequency in everyone acoustic information, and store；The information of people is carried out after equally handling, with information Information in storehouse is compared, and determines people's identity and carries out the sound event recognition method of identification, and this sound event is known Other method recognition effect under small noise environment is preferable, but recognition effect is generally poor under very noisy, strong interference environment.

The content of the invention

The technical problems to be solved by the invention are that sound is known under very noisy, strong interference environment present in prior art The poor technical problem of other effect, there is provided a kind of new sound event recognition method, the sound event recognition method have strong Recognition accuracy and the high technical characterstic of robustness under noise, strong interference environment.

In order to solve the above technical problems, the technical scheme used is as follows：

A kind of sound event recognition method, comprises the following steps：

A. sound is acquired under interference environment, forms audio digital signal, the collection includes using sound level meter Sound collection is carried out with microphone array；The processing is that the audio digital signal is carried out at end-point detection and filtering and noise reduction Reason；

B. sub-band filter is carried out to the audio digital signal by wave filter group, obtains audio signal cochlea spectrogram；

C. a part for the cochlea spectrogram is trained to convolutional neural networks model, establishes sound event recognition template；

D. another part of the cochlea spectrogram is substituted into the convolutional neural networks model, carries out the identification of sound event Accuracy rate detection.

In such scheme, for optimization, further, the extraction of the cochlea spectrogram comprises the following steps：

1) when audio digital signal described in is by the wave filter group, the expression formula of the response of the audio signal is exported It is as follows：

G_m(i)=[| g | (i, m)]^1/2, i=0,1 ..., N；M=0,1 ..., M-1

Wherein, G_m(i) matrix for representing changes in distribution on input audio signal frequency domain is formed, N is the audio signal Port number, M are the frame number after sampling, obtain original cochlea spectrogram；

2) the original cochlea spectrogram is compressed, cutting obtains final cochlea spectrogram size, as the convolution The input sample of neutral net.

Further, the method for building up of the sound event recognition template comprises the following steps：

I. using the cochlea spectrogram as learning sample, and the learning sample is done into class label；In the study sample Learning sample of the part including all categories is extracted in this as training set, remaining part is as test set；

II. the convolutional neural networks model is built using software, the convolutional neural networks model includes setting gradually The first convolutional layer, the first maximum pond layer, the second convolutional layer, the second maximum pond layer, full articulamentum and classification output layer；

III. the convolutional neural networks model is inputted using as the learning sample of the training set, exercise supervision study, The parameter of each layer of the convolutional neural networks model after being trained；During training, using probability distribution function to convolution kernel Random initializtion is carried out with weight, full 0 initialization is carried out to biasing；The algorithm adjustment weights and threshold declined using normal gradients Value；Come training convolutional neural networks by way of network propagated forward and backpropagation repeatedly cross processing, until cost letter Untill several limit errors is less than 0.01, the convolutional neural networks model trained is preserved；

IV. the convolutional neural networks model is tested, method of testing is：The test set sample is substituted into and trained The good convolutional neural networks model, by the output of convolutional neural networks model sound corresponding with the test set sample Sound classification is contrasted, and calculates the recall rate that sound event identifies under different signal to noise ratio, accuracy rate and F values respectively and to the volume Product neural network model is assessed.

Further, the full articulamentum in step II is three, and the grader of the classification output layer is classified for softmax Device.

Further, the sample of the training set in step I is the 3/4 of the learning sample.

Further, the wave filter group is formed for multiple Gammatone wave filters.

Further, the sound event of the step A collections and processing is included under traffic environment under different noise conditions Vehicle collision sound, vehicle whistle sound, one or more sound events of personnel's sound of call for help or closing of the door sound.

Further, the audio digital signal carries out end-point detection using short-time energy double threshold thresholding algorithm.

Further, the audio digital signal is filtered denoising using LMP algorithms.

Further, first convolutional layer sets 20 convolution filters, and each wave filter size is 5 × 5, and convolution is moved Dynamic step-length is 1, and activation primitive uses relu functions；The pond domain of the first maximum pond layer and the second maximum pond layer is 2 × 2, step-length is 2；Second convolutional layer sets 50 convolution filters, and each wave filter size is 5 × 5, convolution movement Step-length is 1.

Compared with prior art, the beneficial effects of the invention are as follows：

1. the effect of the end-point detection of pair audio signal is that useful sound event information is extracted in very noisy or interference Fragment；The effect of filtering and denoising is to reduce the influence of very noisy or interference to sound event feature extraction, to extract standard True voice signal；Analog cochlea is simulated with wave filter group, signal frequency domain changes in distribution is described with obtained cochlea spectrogram, Sound event during ambient noise can not only be detected or disturbed, and effective identification can be carried out to sound event and is divided Class.

2. using the method for machine learning, manual intervention is avoided, convolutional neural networks model is fully learnt every class sound The feature of sound event cochlea spectrogram, using convolutional neural networks generalization ability and adaptable characteristic, reaches higher identification Accuracy rate and stronger robustness.

3. based on the sound event recognition method of convolutional neural networks model, there is preferable anti-noise ability, same noise Under environment, discrimination of the invention is significantly improved.

4. the sound event recognition method of the present invention to be used for the traffic environment that complicated noise is high, interference is strong be present Under, for vehicle collision sound, vehicle whistle sound, the sound event such as personnel's sound of call for help and closing of the door sound can have higher identification Rate.

5. being filtered denoising to audio digital signal using LMP algorithms, effect is to reduce urban traffic noise to sound The influence of affair character extraction.

6. using relu functions are used as activation primitive, the speed of training convolutional neural networks model can be improved.

7.Gammatone wave filters, which form wave filter group, can retain original sample frequency, be set on time dimension After response frequency, extracted available for sound affair character in short-term.

Brief description of the drawings

The present invention is further described with reference to the accompanying drawings and examples.

Fig. 1 is the sound event identification process figure based on convolutional neural networks model.

Fig. 2 is the convolutional neural networks prototype network structure chart of voice recognition.

Description of reference numerals：

The maximum pond layer of maximum pond layer -3, the second convolutional layer -4, the second of cochlea spectrogram -1, the first convolutional layer -2, the first - 5, full articulamentum -6, output layer -7 of classifying.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The present embodiment provides a kind of sound event recognition method, in the present embodiment using the noise conduct under traffic environment Specific embodiment, idiographic flow is as shown in figure 1, comprise the following steps：

A. coordinate microphone array to be acquired sound using sound level meter under disturbance environment, form sound figure Signal；

Respectively every kind of sound event is acquired and handles in 20dB, tri- kinds of signal to noise ratio of 10dB, 0dB to four kinds of sound events Number of samples is 4800, sample frequency 8KHZ, and four kinds of sound events are vehicle collision sound, vehicle whistle sound, personnel's calling for help The one or more of sound or closing of the door sound；

The voice data collected is pre-processed with matlab softwares.Utilize short-time energy double threshold thresholding algorithm pair Audio digital signal carries out end-point detection, it is therefore an objective to useful sound event information segment is extracted in ambient noise, utilizes LMP Algorithm is filtered denoising to audio digital signal, in order to reduces urban traffic noise to sound event feature extraction Influence；

The extracting method of cochlea spectrogram is as follows：

Analog cochlea is simulated using the 4 rank Gammatone wave filter groups of one group of 64 passage, realizes sub-band filter, its Centre frequency is between 350Hz~4000Hz.Gammatone wave filters can retain original sample frequency, therefore in time dimension It is 100Hz that response frequency is set on degree, the frame for producing 10ms is moved, available for sound feature extraction in short-term.When sound figure is believed During number by Gammatone wave filter groups, the expression formula of the response of output signal is as follows：

G_m(i)=[| g | (i, m)]^1/2, i=0,1 ..., N；M=0,1 ..., M-1

Wherein, G_m(i) matrix for representing changes in distribution on input audio signal frequency domain is formed, N is the passage of audio signal Number, M are the frame number after sampling, and signal frequency domain changes in distribution is described using cochlea spectrogram；

Obtained original cochlea spectrogram is compressed, it is 32 × 32 that cutting, which obtains final cochlea spectrogram size, as The input sample of convolutional neural networks；

C. a part for the cochlea spectrogram 1 is trained to convolutional neural networks model, i.e. CNN network structure models, built Vertical sound event recognition template；

The method for building up of the convolutional neural networks model is as follows：

1) using the cochlea spectrogram of acquisition as learning sample, and class label is added to the learning sample；Different classes of Learning sample in extract and 3/4 be used as training set, remaining 1/4 is test set；

2) the NVIDIA GTX1080 based on Pascal GP104 cores build training platform：Use MATLAB's Parallel Computing Toolbox tool boxes and the establishment of Neural Network Toolbox tool boxes and training convolutional Neural network model, model structure are as shown in Figure 2；

Determine the convolutional neural networks number of plies：Two convolutional layers, two pond layers and full articulamentum 6 and softmax graders 7, the full articulamentum 6 includes three full articulamentum 6-1,6-2,6-3, wherein, the first convolutional layer 2 sets 20 convolution filters, Each wave filter size is 5 × 5, and convolution moving step length is 1, and to accelerate training speed, activation primitive uses relu functions；relu Function is the linear unit function of amendment；The first maximum pond domain of pond layer 3 is 2 × 2, step-length 2；Second convolutional layer 4 sets 50 Individual convolution filter, each wave filter size are 5 × 5, and convolution moving step length is 1；The second maximum pond domain of pond layer 5 be 2 × 2, step-length 2；Softmax graders 7 export four kinds of class objects：Vehicle collision sound, vehicle whistle sound, personnel's sound of call for help or car The one or more of door closing sound.

3) training sample is inputted into convolutional neural networks, carries out the study for having supervision of tape label, the volume after being trained The parameter of each layer of product neutral net.

In training process, random initializtion is carried out to convolution kernel and weight using probability distribution function, and biasing is carried out Full 0 initializes.In order to accelerate algorithm adjustment weights and the threshold value that training process is then declined using normal gradients.By before network to Propagate and the mode of backpropagation cross processing repeatedly carrys out training convolutional neural networks, until the limit error of cost function is less than Untill 0.01, the convolutional neural networks model trained is preserved；

D. another part of the cochlea spectrogram is substituted into the convolutional neural networks model, carries out the identification of sound event Accuracy rate detection；

The cochlea spectrogram of test set is substituted into the convolutional neural networks model trained, by the output and test of disaggregated model Sound class corresponding to each cochlea spectrogram is concentrated to be contrasted, calculate that sound event under different signal to noise ratio identifies respectively recalls Rate, accuracy rate and F values are assessed model.

Sound event number in the correct sound event number/sample for recall rate=extract；

Accuracy rate=the correct sound event number extracted/sound event number extracted；

F values=accuracy * recall rates * 2/ (accuracy+recall rate), F values are the harmonic average of accuracy and recall rate Value.

Although the illustrative embodiment of the present invention is described above, in order to the technology of the art Personnel are it will be appreciated that the present invention, but the present invention is not limited only to the scope of embodiment, to the common skill of the art For art personnel, as long as long as various change in the spirit and scope of the invention that appended claim limits and determines, one The innovation and creation using present inventive concept are cut in the row of protection.

Claims

1. a kind of sound event recognition method, it is characterised in that comprise the following steps：

A. sound is acquired under interference environment, forms audio digital signal, the collection includes using sound level meter and wheat Gram wind array carries out sound collection；The processing is that end-point detection and filtering and noise reduction processing are carried out to the audio digital signal；

D. another part of the cochlea spectrogram is substituted into the convolutional neural networks model, carries out the standard of the identification of sound event True rate detection.

2. sound event recognition method according to claim 1, it is characterised in that：The extraction of the cochlea spectrogram include with Lower step：

1) when the audio digital signal is by the wave filter group, the expression formula for exporting the response of the audio signal is as follows：

G_m(i)=[| g | (i, m)]^1/2, i=0,1 ..., N；M=0,1 ..., M-1

Wherein, G_m(i) matrix for representing changes in distribution on input audio signal frequency domain is formed, N is the passage of the audio signal Number, M are the frame number after sampling, obtain original cochlea spectrogram；

2) the original cochlea spectrogram is compressed, cutting obtains final cochlea spectrogram size, as the convolutional Neural The input sample of network.

3. sound event recognition method according to claim 1, it is characterised in that：The foundation side of the voice recognition template Method comprises the following steps：

I. using the cochlea spectrogram as learning sample, and class label is done to the learning sample；In the learning sample Learning sample of the part including all categories is extracted as training set, remaining part is as test set；

II. build the convolutional neural networks model using software, the convolutional neural networks model include setting gradually the One convolutional layer, the first maximum pond layer, the second convolutional layer, the second maximum pond layer, full articulamentum and classification output layer；

III. the convolutional neural networks model is inputted using as the learning sample of the training set, exercise supervision study, obtains The parameter of each layer of the convolutional neural networks model after training；During training, using probability distribution function to convolution kernel and power Random initializtion is carried out again, and full 0 initialization is carried out to biasing；The algorithm adjustment weights and threshold value declined using normal gradients；It is logical The mode for crossing network propagated forward and backpropagation cross processing repeatedly carrys out training convolutional neural networks, until the limit of cost function Untill determining error less than 0.01, the convolutional neural networks model trained is preserved；

IV. the convolutional neural networks model is tested, method of testing is：The sample of the test set is substituted into and trained The convolutional neural networks model, by corresponding with the sample of the test set sound of output of the convolutional neural networks model Sound classification is contrasted, and calculates the recall rate that sound event identifies under different signal to noise ratio, accuracy rate and F values respectively and to the volume Product neural network model is assessed.

4. sound event recognition method according to claim 3, it is characterised in that：Full articulamentum in step II is three Individual, the grader of the classification output layer is softmax graders.

5. sound event recognition method according to claim 4, it is characterised in that：The sample of training set in step I is The 3/4 of the learning sample.

6. according to any described sound event recognition methods of claim 1-5, it is characterised in that：The wave filter group is more logical Road Gammatone wave filters are formed.

7. sound event recognition method according to claim 6, it is characterised in that：The sound of the step A collections and processing Vehicle collision sound, vehicle whistle sound, personnel's sound of call for help or the car door that sound event includes under traffic environment under different noise conditions close Close one or more sound events of sound.

8. sound event recognition method according to claim 7, it is characterised in that：The audio digital signal is using in short-term Energy double threshold thresholding algorithm carries out end-point detection.

9. sound event recognition method according to claim 7, it is characterised in that：The audio digital signal uses LMP Algorithm is filtered and denoising.

10. sound event recognition method according to claim 8, it is characterised in that：First convolutional layer sets 20 Convolution filter, each wave filter size are 5 × 5, and convolution moving step length is 1, and activation primitive uses relu functions；Described first The pond domain of maximum pond layer and the second maximum pond layer is 2 × 2, and step-length is 2；Second convolutional layer sets 50 volumes Product wave filter, each wave filter size are 5 × 5, and convolution moving step length is 1.