CN110070856A

CN110070856A - A kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data

Info

Publication number: CN110070856A
Application number: CN201910233185.3A
Authority: CN
Inventors: 张涛; 刘赣俊
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-03-26
Filing date: 2019-03-26
Publication date: 2019-07-30

Abstract

A kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data, comprising: the separation that total harmonic component H and total impact component P are carried out to the audio in training set realizes data enhancing by generating two section audios by a segment of audio；, as training set input scene identifying system, audio feature extraction is carried out for by two isolated section audios of harmonic wave impulse source；Using the audio frequency characteristics of training set as the input of classifier network, training classifier network identifies audio scene according to the output result of classifier network in test set.The present invention is used to identify that, using identical sorter model, classification accuracy has obtained biggish promotion in audio scene.By being enhanced based on harmonic wave impulse source mask data, system will obtain a bigger, more diversified training set.Classifier network is trained based on the training set, the learning ability and generalization ability of classifier network can be promoted well.

Description

A kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data

Technical field

The present invention relates to a kind of identification of audio scene and classification methods.Increase more particularly to the data of a kind of pair of audio processing Strong and pattern-recognition the audio scene recognition method based on the enhancing of harmonic wave impulse source mask data.

Background technique

Currently, generalling use following method for scene Recognition.

1, audio scene identification description

The data of audio scene identification are directly acquired in actual environment, and the presence of overlapping sound is just certainly existed. Human lives are in a complicated audio environment, can be fine while ignoring or simply recognizing other sound sources Ground follows specific sound source.For example, we can be talked at one by other people or the busy background that forms of music under carry out It talks.The performance that audio scene identification is classified automatically is limited by very large in this task.Sound mixed signal includes more A simultaneous sound event, machine hearing system are also far from reaching the water of the mankind in terms of identifying these sound events It is flat.Single sound event can be used to describe an audio scene: they can by it is a kind of it is symbolistic in a manner of represent one Scene on a busy street, automobile pass through, car horn and the hasty step of people.

Audio scene identification and the purpose of classification are handled voice signal, and the corresponding of scene appearance is translated into The denotational description of sound event, for applications such as automatic marking, automatic sound analysis or audio segmentations.Knew in the past with audio scene Not relevant research is all to consider the audio scene with the overlapping events explicitly marked, but testing result is the shape with sequence What formula was presented, it is assumed that only include every time most significant event.In this respect, system is only able to find a scene every time, if inspection The scene measured is included in annotation, then assessment will be considered that output is correct.In multi-source environment, the performance of this system is non- It is often limited.

2, data enhancement methods describe

Since the research identified with audio scene more and more carries out, and a large amount of methods neural network based are answered With with such issues that, therefore for also increasing with the demand of data.But with the image classification the case where compared with, it is current to use More limits are received in the quantitative aspects of size, diversity and event instance in the data set of exploitation audio scene identifying system System, although the newest contribution of the data sets such as AudioSet, DCASE reduces this gap significantly.

One good solution of this problem is data enhancing processing, i.e., answers the training sample of one group of band annotation It is deformed with one or more, to generate new, additional training data.One key concept of data extending is to be applied to The deformation of flag data will not change the semanteme of label.By taking computer vision as an example, width rotation, translation, mirror image or scaling Automobile image is still the coherent automobile image of a width, therefore can be using these deformations to generate additional training data, together When keep label semantic validity.By carrying out network training to additional deformation data, it is desirable to which network keeps these deformations It is constant, preferably it is generalized to invisible data.By doing so, model will be exposed to a bigger, more diversified trained sample In this, therefore the decision boundary between class can be better described.It also proposed in audio area and retain semantic deformation, and by The model accuracy of music assorting task can be improved in proof.

3, HPSS method describes

Harmonic wave/impulse source separation (HPSS) technology is proposed and is applied in music separation field.In general, sound Music signal is usually in two kinds of face formal distribution in spectrogram, and one is being distributed along time shaft continuously smooth, another kind is along frequency Both sources of sound being distributed usually are referred to as harmonic source and impulse source by the distribution of axis continuously smooth." harmonic wave/impulse source separation " It is the straightforward procedure of a kind of harmonic wave for analyzing audio signal and impulse source, this method is effective to promote as a kind of preprocess method Into multi-tone analysis, the research in the audio signal analysis field such as automatic music transcription.Harmonic source and impact in audio signal Source has anisotropic characteristics on frequency spectrum, generally comprises fixed tone based on this characteristic harmonic source, can be formed on frequency spectrum A series of smooth instantaneous envelopes, thus on time-axis direction be it is smooth continuous, it is intermittent in frequency axis direction；Instead It, impulse source was generally concentrated in the short period, a series of vertical wideband spectral envelopes was formed on frequency spectrum, therefore in time shaft It is intermittent on direction, is smooth continuous in frequency axis direction.

Summary of the invention

Rushing based on harmonic wave for classification accuracy larger can be promoted the technical problem to be solved by the invention is to provide a kind of Hit the audio scene recognition method of source mask data enhancing.

The technical scheme adopted by the invention is that: a kind of audio scene identification based on the enhancing of harmonic wave impulse source mask data Method includes the following steps:

1) separation that total harmonic component H and total impact component P are carried out to the audio in training set, by being given birth to by a segment of audio At two section audios, data enhancing is realized；

2), as training set input scene identifying system, sound is carried out for by two isolated section audios of harmonic wave impulse source Frequency feature extraction；

3) using the audio frequency characteristics of training set as the input of classifier network, training classifier network, the root in test set Audio scene is identified according to the output result of classifier network.

Include: to the separation of audio progress harmonic source and impulse source in training set described in step 1)

(1) audio signal model J (H, P) is designed:

H_i+P_i=W_i H_i> 0, P_i> 0 (2)

Wherein, H_iFor i-th of harmonic component of audio signal, P_iFor i-th of impact component of audio signal, F_iFor input I-th of Short Time Fourier Transform of signal f (t), W_i=| F_i|²For energy spectrum, σ_H, σ_PFor weight smoothing factor；

(2) minimum value that each audio signal model is calculated using alternative manner, obtains total harmonic wave of each audio signal Component H and total impact component P；

(2.1) when have spectrum gradient H_i-1-H_iAnd P_i-1-P_iIndependent Gaussian distribution is obeyed, then is had:

(H_i-1-H_i)²≤2(H_i-1-U_i)²+2(H_i-U_i)² (3)

(P_i-1-P_i)²≤2(P_i-1-V_i)²+2(P_i-U_i)² (4)

Wherein, intermediate variable U_iAnd V_iIt respectively indicates are as follows:

(2.2) auxiliary function is set are as follows:

Formula (3), (4) are substituted into formula (5), obtain formula (6), (7):

J(H,P)≤Q(H,P,N,V)(6)

(2.3) pass through k iteration, k value is less than or equal to 200, obtains:

The separation for realizing total harmonic component H and total impact component P, to enhance audio data.

Step 2) includes:

(1) down-sampling is carried out to audio, uniformly converts the audio data of different sample frequencys to as 44.1 kHz；

(2) framing windowing process: setting the frame length of Short Time Fourier Transform as N ms, and it is N/2ms that frame, which moves size, uses M A mel filter obtains mel spectrogram, which is the audio frequency characteristics extracted.

Step 3) includes:

(1) audio frequency characteristics of the training set of extraction and corresponding class label are formed into one group of input-output pair；

(2) using the mapping between classifier e-learning audio frequency characteristics and corresponding class label；

(3) in test phase, the audio of test set is identified using the classifier network that training obtains, classifier net The estimation of network exportsIt is expected that there are two types of situation, one is, when in estimation output there is no with audio frequency characteristics with it is corresponding Class label when,Close to 0；Another kind is, when existing in estimation output with audio frequency characteristics with corresponding class label, Close to 1.

A kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data of the invention, in audio field Scape identification, using identical sorter model, classification accuracy has obtained biggish promotion.By being separated based on harmonic wave impulse source Data enhancing, system will obtain a bigger, more diversified training set.Classifier network is instructed based on the training set Practice, the learning ability and generalization ability of classifier network can be promoted well.

Detailed description of the invention

Fig. 1 is total harmonic component mel spectrogram after separation；

Fig. 2 is total impact component mel spectrogram after separation.

Specific embodiment

Below with reference to embodiment and attached drawing to a kind of audio field based on the enhancing of harmonic wave impulse source mask data of the invention Scape recognition methods is described in detail.

A kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data of the invention, including walk as follows It is rapid:

1) separation that total harmonic component H and total impact component P are carried out to the audio in training set, by being given birth to by a segment of audio At two section audios, data enhancing is realized；The separation that the described audio in training set carries out harmonic source and impulse source includes:

(1) audio signal model J (H, P) is designed:

H_i+P_i=W_i H_i> 0, P_i> 0 (2)

(H_i-1-H_i)²≤2(H_i-1-U_i)²+2(H_i-U_i)² (3)

(P_i-1-P_i)²≤2(P_i-1-V_i)²+2(P_i-U_i)² (4)

(2.2) auxiliary function is set are as follows:

Formula (3), (4) are substituted into formula (5), obtain formula (6), (7):

J(H,P)≤Q(H,P,N,V) (6)

(2.3) pass through k iteration, k value is less than or equal to 200, obtains:

The separation mel spectrogram of total harmonic component H and total impact component P are as shown in Figure 1 and Figure 2.

2) will by harmonic wave impulse source separation (HPSS) obtained two section audios as training set input scene identifying system, Carry out audio feature extraction；Include:

(1) down-sampling is carried out to audio, uniformly converts the audio data of different sample frequencys to as 44.1kHz；

(2) framing windowing process: setting the frame length of Short Time Fourier Transform as 46ms, and it is 23ms that frame, which moves size, uses M Mel filter obtains mel spectrogram, which is the audio frequency characteristics extracted.

3) using the audio frequency characteristics of training set as the input of classifier network, training classifier network, the root in test set Audio scene is identified according to the output result of classifier network.Include:

The embodiment of the present invention:

(1) audio signal model J (H, P) is designed:

H_i+P_i=W_i H_i> 0, P_i> 0 (2)

Wherein, H_iFor i-th of harmonic component of audio signal, P_iFor i-th of impact component of audio signal, F_iFor input I-th of Short Time Fourier Transform of signal f (t), W_i=| F_i|²For energy spectrum, it is assigned a value of H respectively_i=0.5W_i, P_i=0.5W_i, σ_H, σ_PFor weight smoothing factor, it is assigned a value of σ respectively_H=0.7, σ_P=1.05；；

(H_i-1-H_i)²≤2(H_i-1-U_i)²+2(H_i-U_i)² (3)

(P_i-1-P_i)²≤2(P_i-1-V_i)²+2(P_i-U_i)² (4)

(2.2) auxiliary function is set are as follows:

Formula (3), (4) are substituted into formula (5), obtain formula (6), (7):

J(H,P)≤Q(H,P,N,V) (6)

(2.3) pass through k iteration, wherein k=20 is obtained:

(2) framing windowing process: setting the frame length of Short Time Fourier Transform as 46ms, and it is 23ms that frame, which moves size, uses 128 A mel filter obtains mel spectrogram, which is the audio frequency characteristics extracted.

(2) mapping between classifier e-learning audio frequency characteristics and corresponding class label is used, wherein classifier network For two layers of convolutional neural networks that convolution kernel is 3 × 3；

Claims

1. a kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data, which is characterized in that including walking as follows It is rapid:

1) separation that total harmonic component H and total impact component P are carried out to the audio in training set, by generating two by a segment of audio Section audio realizes data enhancing；

2), as training set input scene identifying system, it is special to carry out audio for by two isolated section audios of harmonic wave impulse source Sign is extracted；

3) using the audio frequency characteristics of training set as the input of classifier network, training classifier network, the basis point in test set The output result of class device network identifies audio scene.

2. a kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data according to claim 1, It is characterized in that, the separation for carrying out harmonic source and impulse source to the audio in training set described in step 1) includes:

(1) audio signal model J (H, P) is designed:

H_i+P_i=W_i H_i> 0, P_i> 0 (2)

Wherein, H_iFor i-th of harmonic component of audio signal, P_iFor i-th of impact component of audio signal, F_iFor input signal f (t) i-th of Short Time Fourier Transform, W_i=| F_i|²For energy spectrum, σ_H, σ_PFor weight smoothing factor；

(2) minimum value that each audio signal model is calculated using alternative manner, obtains total harmonic component H of each audio signal With total impact component P；

(H_i-1-H_i)²≤2(H_i-1-U_i)²+2(H_i-U_i)² (3)

(P_i-1-P_i)²≤2(P_i-1-V_i)²+2(P_i-U_i)² (4)

(2.2) auxiliary function is set are as follows:

Formula (3), (4) are substituted into formula (5), obtain formula (6), (7):

J(H,P)≤Q(H,P,N,V) (6)

(2.3) pass through k iteration, k value is less than or equal to 200, obtains:

3. a kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data according to claim 1, It is characterized in that, step 2) includes:

(2) framing windowing process: setting the frame length of Short Time Fourier Transform as Nms, and it is N/2ms that frame, which moves size, uses M mel Filter obtains mel spectrogram, which is the audio frequency characteristics extracted.

4. a kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data according to claim 1, It is characterized in that, step 3) includes:

(3) in test phase, the audio of test set is identified using the classifier network that training obtains, classifier network Estimation outputIt is expected that one is be not present and audio frequency characteristics and corresponding class when in estimation output there are two types of situation When label,Close to 0；Another kind is, when existing in estimation output with audio frequency characteristics with corresponding class label,It is close In 1.