CN103474066A

CN103474066A - Ecological voice recognition method based on multiband signal reconstruction

Info

Publication number: CN103474066A
Application number: CN2013104723429A
Authority: CN
Inventors: 李应; 欧阳桢
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2013-10-11
Filing date: 2013-10-11
Publication date: 2013-12-25
Anticipated expiration: 2033-10-11
Also published as: CN103474066B

Abstract

The invention relates to an ecological voice recognition method based on multiband signal reconstruction. The ecological voice recognition method comprises the steps of: firstly, using OMP (Orthogonal Matching Pursuit) sparse decomposition as a first-stage reconstruction, and reserving a main body structure of foreground voice; secondly, allocating remained components decomposed in the former stage according to bands, and carrying out adaptive compensation on reconstruction signals according to the frequency distribution of the foreground voice and background noise to complete a second-stage reconstruction; finally, extracting compound noise-proof characteristics according to atom time-frequency information and frequency-domain information in a support set, and carrying out classification and recognition on ecologic voice by using a high-credibility network under different environments and signal to noise ratio conditions. According to the ecological voice recognition method, the noise can be inhibited by adopting two times of reconstruction, and the reconstruction precision of the foreground voice is improved; better noise robustness is achieved under a natural environment.

Description

Ecological sound identification method based on multi-band signal reconstruct

Technical field

The present invention relates to a kind of ecological sound identification method based on multi-band signal reconstruct.

Background technology

Ecological voice recognition is extracted feature and is done identification various voice signals in physical environment.By the audio-frequency information comprised in analysis and environment-identification, can be for intrusion detection, species prospecting etc.In actual environment, a large amount of nonstationary noises produces and disturbs voice recognition.Therefore, the ecological voice recognition of anti-noise has important practical significance.

During the current audio signal is processed, voice are controlled with speaker Recognition Technology relatively many, and the research of ecologic environment sound is relatively less.That commonly used is frequency domain character Mel frequency cepstral coefficient (Mel-FrequencyCepstralCoefficients, MFCCs) and the short time discrete Fourier transform of time-frequency domain and wavelet transformation etc., carry out discriminator in conjunction with gauss hybrid models (GMM) or Hidden Markov Model (HMM) (HMM).Because ecological sound randomness is large and be not all structurized, so above method is not necessarily effective to it.In order to address the above problem, some new work are suggested, and such as: people such as Khunarsal, propose to utilize the sonograph method for mode matching to be identified in conjunction with the KNN sorter ambient sound in short-term; The people such as Zhang use improved MFCCs as feature and use GMM to identify the insect sound classification; The people such as Lee use the spectrum form feature to carry out modeling, and the continuous type bird is cried and carries out Classification and Identification; The people such as Raju extract fundamental tone, and resonance peak and short-time energy feature set combination supporting vector machine (SVM) carry out Classification and Identification to 19 kinds of animal sounds that comprise the cat and dog lion.

The FAQs of identifying ecological sound existence with said method is, faces the voice signal of uncertain structure, designs suitable sorter more difficult.Discriminative model, as support vector machine (SVM) and traditional neural network etc., can carry out modeling to the Nonlinear separability class preferably, but at high dimensional feature and categorical measure when more, classifying quality is not as good as GMM or HMM.In addition, under noise circumstance, especially recognition capability rapid drawdown during low signal-to-noise ratio.Denoising method commonly used has spectrum-subtraction, Wiener filtering etc. at present.Thereby spectrum deducts the easily introducing music noise of making an uproar causes distorted signals.Filtering and noise reduction can be realized optimal filtering under the prerequisite of picked up signal and noise statistics, but noise is complicated and changeable in physical environment, and these prior imformations often can't obtain, so range of application is comparatively limited.

Based on match tracing (MatchingPursuit, MP) denoising method of reconstruction signal is to utilize the sparse property of sound, signal decomposition reconstruct is carried out to self-adaptation to be meaned, do not need the acquisition signal to be detected of priori and the statistical property of noise, therefore can be applicable to different scene multi-signals.Yet in actual applications, signal and noise can overlap, reduce as much as possible noise and be that to increase distorted signals be cost, so denoise algorithm must be weighed reducing between noise and distorted signals.Yet, simply utilize the sparse denoising of MP also to have some limitations.In the MP decomposable process, higher from the computation complexity of crossing the optimum atom of complete dictionary space search.Existing way is the restriction dictionary size, or obtains the atom high with the original signal degree of correlation by intelligent algorithm when reducing the decomposition number of times as far as possible.But not noise entirely in the residual components after reconstruction signal, also comprise effectively sound of part.If increase merely the decomposition number of times in order to improve reconstruction accuracy, increased new calculated amount on the one hand, also can't suppress noise on the other hand, follow-up recognition effect is poor.

Summary of the invention

In view of this, the purpose of this invention is to provide a kind of ecological sound identification method based on multi-band signal reconstruct.

The present invention adopts following scheme to realize: a kind of ecological sound identification method based on multi-band signal reconstruct, it is characterized in that, and comprise the following steps:

S01: respectively pure sound and calibration tape noise sound are carried out to the OMP Its Sparse Decomposition, reconstruction signal and the OMP feature of the pure sound of corresponding output and calibration tape noise sound;

S02: pure sound is extracted and comprises the compound characteristics of OMP feature and carry out the DBN model training;

S03: extraction calibration tape noise sound carries out the power spectrum of the residue signal after the OMP Its Sparse Decomposition and carries out the multiband compensation;

S04: extract the power spectrum that calibration tape noise sound carries out the reconstruction signal after the OMP Its Sparse Decomposition, and carry out secondary reconstruct in conjunction with the power spectrum that carries out the residue signal after the multiband compensation in described step S03;

S05: the signal after secondary reconstruct in described step S04 is extracted to the compound characteristics that comprises the OMP feature;

S06: the feature that meets that comprises the OMP feature of extracting in the compound characteristics that carries out in described step S02 extracting after the DBN model training and described step S05 is carried out to the DBN category of model, the ecological sound class under output calibration tape noise sound.

In an embodiment of the present invention, suppose band noise tone signal f to be decomposed, length is N, before carrying out Its Sparse Decomposition, at first constructs complete atom dictionary D=(g _γ) _{γ ∈ Γ}, time-frequency atom g _γbe the Gabor atom, by parameter group γ=(s, u, v, w) definition, shift factor u defines an atom g _γcenter, contraction-expansion factor s, frequency factor v and phase factor w define its waveform, its discretize time and frequency parameter γ=(s, u, v, w)=(a ^j, pa ^jΔ u, ka ^-jΔ v, i Δ w), wherein, 0<j≤log ₂n, 0≤p≤N2 ^-j+1, 0≤k<2 ^j+1, 0≤i≤12, a=2, Δ u=1/2, Δ v=π, Δ w=π/6; Described step S01 concrete steps comprise:

S011: initializing signal residual error R ₀y'=f, iterations k=1, maximum iteration time L;

S012: from cross complete atom dictionary D, select the iteration atom g the most relevant to the signal residual error the k time _{γ k},

| < R_{k} y', g_{γk} > | &GreaterEqual; α \sup_{γ &Element; Γ} | < R_{k} y', g_{γ} > |, 0 < α \leq 1;

S013: judgement || R _ky'||<ε, ε>0 whether set up, the residue signal threshold value of ε for setting, if || R _ky'||<ε sets up, and goes to step S016 and finishes to decompose, if be false, continues to decompose;

S014: utilize the Gram-Schmidt method by g _{γ k}about selecting former subset g _{γ p}, 0<p≤k orthogonalization obtains projection P _kand calculate respectively new approximate reconstruction signal y'=P _kf and residual error R _k+1y'=f-y';

S015: if also do not reach maximum iteration time, k=k+1 is set, returns to step S012 and continue iteration, otherwise go to step S016;

S016: by successively decomposing and obtain a series of atoms, export approximate atom expansion the L time

In an embodiment of the present invention, described extraction comprises that the compound characteristics concrete grammar of OMP feature is: extract the compound characteristics that comprises OMP feature, MFCCs feature and fundamental tone feature; Wherein, the method for extracting the OMP feature is to utilize OMP to decompose each frame voice signal, and in front L the atom time-frequency parameter group of support set of this frame signal of acquisition expression, average and the standard deviation of contraction-expansion factor s and frequency factor v, form 4 dimension OMP features,

wherein, the frame index that λ is signal, i is for meaning the former subindex of this frame signal, L is atomicity.

In an embodiment of the present invention, choose MFCCs and supplement the use of OMP feature, at first adopt 24 rank Mel bank of filters, reconstruction signal is done to obtain 12 dimension MFCCs static natures after discrete Fourier transformation, add the logarithm energy as its 13rd dimensional feature.

In an embodiment of the present invention, choose PITCH and supplement the use of OMP feature, adopt the circular AMDF function method to obtain the 1 dimension PITCH feature that every frame is corresponding.

In an embodiment of the present invention, described DBN model training comprises two steps, the first step adopts without supervising the successively strategy of greed and trains in advance, and the state value by the visible layer node of the ecological sound characteristic initialization DBN bottom that mark is good, make specific features abstract gradually like this; Second step is used correct markup information that the BP network of supervision is arranged, and is transmitted to every one deck RBM and is finely tuned update information is top-down.

In an embodiment of the present invention, RBM network using ContrastiveDivergence criterion is as the self-training strategy, every layer forms by a visual layers V and hidden layer H, connect a plurality of RBM of combination by bottom-up interlayer weighting, input with the output of Hidden unit as upper strata RBM visual layers, thereby build a DBN framework, RBM comprises three parameters, respectively the weights W between visible layer and hidden layer, and amount of bias b and c separately, therefore the process of DBN sorter training is converted into to solving the RBM parameter, the nodal value of supposing visual layers and hidden layer is respectively v _iand h _j, each node of visual layers V is put 1 probability and is

in like manner, to put 1 probability be P (h to each node of hidden layer H _j=1),

the update rule Δ w of weights W _ij∝<v _ih _j? _data-<v _ih _j? _reconstruct, wherein,<v _ih _j? _datamean known sample collection visual layers node v _iwith the unknown h of hidden node _jthe expectation value of joint probability distribution,<v _ih _j? _reconstructfor by known sample information updating Hidden unit, the visual layers unit again after reconstruct<v _ih _jthe expectation value of joint probability distribution.

In an embodiment of the present invention, the distribution of foreground sounds on frequency spectrum is not uniformly, in order to determine its dominant frequency structure, the power spectrum that reconstruct is for the first time obtained | and Y'(λ, j) | ²on average be divided into M linear sub-band, to voiced frame λ, calculate the energy proportion on frequency band i

wherein, K is the rank of FFT coefficient, FFT _{λ, p}p the FFT coefficient for frame λ.

In an embodiment of the present invention, determine a threshold gamma, when energy proportion surpasses threshold value, subband i is in the dominant frequency scope, and foreground sounds frequency factor α (λ) sets higher weight, and outside the dominant frequency scope, the weight that respective settings is lower, that is,

noise frequency factor-beta (λ) characterizes the degree height of current sub-band noise effect, can utilize the signal of last stage reconstruct to estimate noise as prior imformation, then calculate the power spectrum signal to noise ratio (S/N ratio) of frame λ i subband

{SNR}_{i} (λ) = 10 \log_{10} (\frac{Σ_{p = \frac{K}{M} \cdot (i - 1)}^{\frac{K}{M} \cdot i} {| Y_{i}^{'} (λ, p) |}^{2}}{Σ_{p = \frac{K}{M} \cdot (i - 1)}^{\frac{K}{M} \cdot i} {| F_{i} (λ, p) |}^{2} - {| Y_{i}^{'} (λ, p) |}^{2}}),

The noise frequency factor of frame λ i subband

β_{i} (λ) = \{\begin{matrix} 0.1, & {SNR}_{i} (λ) < 0 \\ 0.1 + 0.04 {SNR}_{i} (λ), & 0 \leq {SNR}_{i} (λ) \leq 20 \\ 0.9, & {SNR}_{i} (λ) > 20 \end{matrix};

By solving foreground sounds frequency factor α (λ) and noise frequency factor-beta (λ) carries out the multiband gain, obtain the sound power spectrum of reconstruct for the second time | Y (λ, j) | ²≈ | Y (λ, j) | ²=| Y'(λ, j) | ²+ α (λ) β (λ) (| F (λ, j) | ²-| Y'(λ, j) | ²), when the foreground sounds power spectrum of reconstruct surpasses former noise sound power spectrum, use

upgraded.

The present invention adopts secondary reconstruct can not only suppress noise, and has improved the reconstruction accuracy to foreground sounds.With Mel frequency cepstral coefficient (MFCC) at present commonly used, with the method for SVM, compare, the method has noise robustness preferably under physical environment.

For making purpose of the present invention, technical scheme and advantage clearer, below will, by specific embodiment and relevant drawings, the present invention be described in further detail.

The accompanying drawing explanation

Fig. 1 the present invention is based on OMP multi-band signal reconstruct process flow diagram.

Fig. 2 a is the oscillogram of pure thrush cry.

Fig. 2 b is the sonograph of pure thrush cry.

Fig. 2 c is that Fig. 2 a adds the oscillogram that signal to noise ratio (S/N ratio) is 10dB flowing water noise.

Fig. 2 d is that Fig. 2 b adds the sonograph that signal to noise ratio (S/N ratio) is 10dB flowing water noise.

Fig. 2 e is the reconstruct sonograph that Fig. 2 d degree of rarefication is 10.

Fig. 2 f is the reconstruct sonograph that Fig. 2 d degree of rarefication is 30.

Fig. 2 g is the oscillogram of secondary reconstruct.

Fig. 2 h is the sonograph of secondary reconstruct.

Fig. 3 is DBN discriminator process flow diagram of the present invention.

Embodiment

The present invention proposes a kind of ecological sound identification method based on multi-band signal reconstruct, and built the Classification and Identification framework based on degree of depth study.At first, use the OMP Its Sparse Decomposition to do first stage reconstruct, retain the agent structure of foreground sounds; Secondly, the residual components that will decompose the last stage, by frequency band division, according to the frequency distribution of foreground sounds and ground unrest, is carried out adaptive equalization to reconstruction signal, completes subordinate phase reconstruct; Finally, according to support set atom Time-Frequency Information and frequency domain information, extract compound anti-noise feature, use dark Belief Network (DBN) to carry out Classification and Identification to ecological sound under varying environment and signal to noise ratio (S/N ratio) situation.As shown in Figure 1, specifically comprise the following steps:

The OMP algorithm is compressed sensing (CompressedSensing, CS) a kind of greedy restructing algorithm in process, at match tracing (MatchingPursuit, MP) on the algorithm basis, propose, these algorithm improvements are each atom of picking out from dictionary that decomposes, be referred to as optimum atom, first utilize the Gram-Schmidt method to carry out orthogonalization process to guarantee the optimality of iteration with selecting atom set, thereby reduce iterations.Under the prerequisite required in same precision, use the signal degree of rarefication of OMP algorithm reconstruct higher, speed of convergence is faster, utilizing OMP is the feature of utilizing the sparse property of signal to ecological sound denoising, using useful information to be extracted as sparse composition, and using noise the residual error composition after removing sparse composition.Noise has certain randomness, owing to not comprising random atom in dictionary, therefore its correlativity is lower.According to the CS theory, band noise tone signal is carried out to the low-dimensional projection, when the observation dimension enough comprises useful information, noise does not have sparse property.The noise contribution of residual error part can't recover when reconstruct, thereby realizes the purpose of denoising.Voice signal is mapped to the atom dictionary and is decomposed, every take turns to decompose obtain and original signal inner product maximum, i.e. the highest atom of the degree of correlation, the atom gone out by iterative extraction is more, the signal residual error is just less, last weighted array atom obtains the best reconstruct of original signal.

Suppose band noise tone signal f to be decomposed, length is N, before carrying out Its Sparse Decomposition, at first constructs complete atom dictionary D=(g _γ) _{γ ∈ Γ}, time-frequency atom g _γbe the Gabor atom, by parameter group γ=(s, u, v, w) definition, shift factor u defines an atom g _γcenter, contraction-expansion factor s, frequency factor v and phase factor w define its waveform, its discretize time and frequency parameter γ=(s, u, v, w)=(a ^j, pa ^jΔ u, ka ^-jΔ v, i Δ w), wherein, 0<j≤log ₂n, 0≤p≤N2 ^-j+1, 0≤k<2 ^j+1, 0≤i≤12, a=2, Δ u=1/2, Δ v=π, Δ w=π/6; OMP Its Sparse Decomposition concrete steps comprise:

| < R_{k} y', g_{γk} > | &GreaterEqual; α \sup_{γ &Element; Γ} | < R_{k} y', g_{γ} > |, 0 < α \leq 1;

The process that OMP decomposes is to select optimum atom every in taking turns iteration successively according to the height of the size of energy and degree of correlation, and these selecteed optimum atoms form the support set of reconstruction signals.Noise has certain randomness, owing to not comprising random atom in dictionary, therefore its correlativity is lower.For coloured noise, utilize the pure sound principle different with the ground unrest degree of rarefication, according to the CS theory, band noise tone signal is carried out to the low-dimensional projection, when the observation dimension enough comprises useful information, noise does not have sparse property.This has just guaranteed when early stage reconstruct, and the noise contribution of residual error part can't recover, and effectively the agent structure of sound is retained.Voice signal is mapped to the atom dictionary and is decomposed, every take turns to decompose obtain and original signal inner product maximum, the i.e. the highest atom of the degree of correlation.The atom gone out by iterative extraction is more, and the signal residual error is just less, and last weighted array atom obtains the best reconstruct of original signal.

Described extraction comprises that the compound characteristics concrete grammar of OMP feature is: extract the compound characteristics that comprises OMP feature, MFCCs feature and fundamental tone feature; Wherein, the method for extracting the OMP feature is to utilize OMP to decompose each frame voice signal, and in front L the atom time-frequency parameter group of support set of this frame signal of acquisition expression, average and the standard deviation of contraction-expansion factor s and frequency factor v, form 4 dimension OMP features,

Choose MFCCs and supplement the use of OMP feature, at first adopt 24 rank Mel bank of filters, reconstruction signal is done to obtain 12 dimension MFCCs static natures after discrete Fourier transformation, add the logarithm energy as its 13rd dimensional feature.

Choose PITCH and supplement the use of OMP feature, adopt the circular AMDF function method to obtain the 1 dimension PITCH feature that every frame is corresponding.

Described DBN model training comprises two steps, and the first step adopts without supervising the successively strategy of greed and trains in advance, and the state value by the visible layer node of the ecological sound characteristic initialization DBN bottom that mark is good, make specific features abstract gradually like this; Second step is used correct markup information that the BP network of supervision is arranged, and is transmitted to every one deck RBM and is finely tuned update information is top-down.

RBM network using Contrastive Divergence criterion is as the self-training strategy, every layer forms by a visual layers V and hidden layer H, connect a plurality of RBM of combination by bottom-up interlayer weighting, input with the output of Hidden unit as upper strata RBM visual layers, thereby build a DBN framework, RBM comprises three parameters, respectively the weights W between visible layer and hidden layer, and amount of bias b and c separately, therefore the process of DBN sorter training is converted into the solving of RBM parameter, supposes that the nodal value of visual layers and hidden layer is respectively v _iand h _j, it is P (v that each node of visual layers V is put 1 probability _i=1),

the update rule Δ w of weights W _ij∝<v _ih _j? _data-<v _ih _j? _reconstruct, wherein,<v _ih _j? _datamean known sample collection visual layers node v _iwith the unknown h of hidden node _jthe expectation value of joint probability distribution,<v _ih _j? _reconstructfor by known sample information updating Hidden unit, the visual layers unit is the v after reconstruct again _ih _jthe expectation value of joint probability distribution.

Suppose that additive noise and prospect sound to be identified are incoherent, be with noise tone signal f (t) to be expressed as f (t)=y (t)+n (t), wherein, t is time index, y (t) is pure foreground sounds, n (t) is ground unrest, and it is F (λ, j) that f (t) is carried out after Fast Fourier Transform (FFT) obtaining amplitude spectrum, wherein λ is frame index, j is frequency indices, power spectrum | F (λ, j) | ²be decomposed into the foreground sounds power spectrum | Y (λ, j) | ²and noise power spectrum | N (λ, j) | ², that is, | F (λ, j) | ²=| Y (λ, j) | ²+ | N (λ, j) | ²; Band noise tone signal, by the OMP Its Sparse Decomposition, obtains the linear weighted array of front limited atom that the degree of correlation is higher and carries out reconstruct for the first time.Compare the foreground sounds power spectrum of reconstruct with original signal | Y (λ, j) | ²≈ (1-δ (λ)) | Y'(λ, j) | ²+ δ (λ) | F (λ, j) | ²in fact be not complete, can think that signal and the noise of disappearance is present in residual components jointly, wherein, δ (λ), for the gain factor that this paper introduces, characterizes the disappearance amount of λ frame and the proportionate relationship of original signal.Experiment shows, the variation of foreground sounds and noise this ratio of distribution joint effect on frequency spectrum.The remaining component of foreground sounds exists that probability is relative and other are higher in its main frequency distributes (hereinafter referred dominant frequency) scope, and, in the larger frequency band of noise effect, there is the probability less in the remaining component of foreground sounds.Therefore, gain factor can be subdivided into prospect acoustic frequency factor-alpha (λ) and noise frequency factor-beta (λ), that is: δ (λ)=α (λ) β (λ).

The distribution of foreground sounds on frequency spectrum is not uniformly, in order to determine its dominant frequency structure, the power spectrum that reconstruct is for the first time obtained | and Y'(λ, j) | ²on average be divided into M linear sub-band, to voiced frame λ, calculate the energy proportion on frequency band i

Determine a threshold gamma, when energy proportion surpasses threshold value, subband i is in the dominant frequency scope, and foreground sounds frequency factor α (λ) sets higher weight, and outside the dominant frequency scope, the weight that respective settings is lower, that is,

{SNR}_{i} (λ) = 10 \log_{10} (\frac{Σ_{p = \frac{K}{M} \cdot (i - 1)}^{\frac{K}{M} \cdot i} {| Y_{i}^{'} (λ, p) |}^{2}}{Σ_{p = \frac{K}{M} \cdot (i - 1)}^{\frac{K}{M} \cdot i} {| F_{i} (λ, p) |}^{2} - {| Y_{i}^{'} (λ, p) |}^{2}}),

The noise frequency factor of frame λ i subband

β_{i} (λ) = \{\begin{matrix} 0.1, & {SNR}_{i} (λ) < 0 \\ 0.1 + 0.04 {SNR}_{i} (λ), & 0 \leq {SNR}_{i} (λ) \leq 20 \\ 0.9, & {SNR}_{i} (λ) > 20 \end{matrix};

upgraded.

The precision of ecological voice recognition, depend on the validity of noise abatement de-noising to a great extent.For nonstationary noise complicated and changeable in ecologic scene, use the method for OMP Its Sparse Decomposition reconstruct band noise tone signal, can retain the agent structure of foreground sounds.For the validity that guarantees that subsequent characteristics is extracted, higher signal reconstruction precision is prerequisite.And improving the signal reconstruction precision, the most direct method is to decompose number of times by increase, has increased computation complexity on the one hand, on the other hand can't the burbling noise composition in restructuring procedure.This paper is used distinguishing extraction component of signal the residual components that the method for multiband compensation decomposes from OMP, for compensating the reconstruction signal of first stage, thereby adaptively carries out secondary reconstruct.Afterwards, extract compound anti-noise time-frequency characteristics for building the DBN model, to ecological sound classification identification, idiographic flow is described below efficiently.

Pre-service and first stage OMP reconstruct:

All sample sounds are done to normalized, adopt the Hamming window to carry out level and smooth rear minute frame, frame length is got 23ms (512 sample points), and frame pipettes 11.6ms (256 sample points).Fig. 2 a and Fig. 2 b are one section thrush sound signal waveform figure and spectrogram of comprising three effective syllables.As example, sneaking into after signal to noise ratio (S/N ratio) is 10dB flowing water noise, from Fig. 2 c and Fig. 2 d, can find out, the distribution of noise on frequency spectrum is not uniformly, and original signal is caused largely and disturbs.According to formula

| < R_{k} y', g_{γk} > | &GreaterEqual; α \sup_{γ &Element; Γ} | < R_{k} y', g_{γ} > |, 0 < α \leq 1; y' = P_{k} f, R_{k + 1} y' = f - y', y' (t) \approx Σ_{n = 1}^{L} P_{n} g_{γn} (t)

Each frame signal is carried out to reconstruct after Its Sparse Decomposition, and Fig. 2 e and Fig. 2 f are respectively the reconstruction signal spectrograms that degree of rarefication is 10 and 30.Clearly can find out, after degree of rarefication improves, the whole reducing degree of signals with noise has lifting to a certain degree, but noise contribution has inevitably also carried out reconstruct.And the lower reconstruction signal of degree of rarefication, agent structure still retains, and the noise contribution lower with the original signal degree of correlation weakened significantly, and thrush is called incomplete part needs to carry out next step multiband reconstruct.

The reconstruct of subordinate phase multiband:

Frequency distribution according to prospect thrush cry and background flowing water noise, be divided into 8 linear sub-bands by spectrum averaging.The OMP reconstruction signal is done to spectrum analysis, according to formula

the dominant frequency band that calculates the thrush cry is 2000Hz-3000Hz, and the residual components in this frequency band will obtain higher weighting compensation, may also be referred to as the part of " more paying attention to ".Otherwise all the other frequency bands can be thought the part of " being left in the basket ".Then, the reconstruction signal that still utilizes the OMP decomposition to obtain, as prior imformation, calculates each sub-band power spectrum signal to noise ratio (S/N ratio).The part that signal to noise ratio (S/N ratio) is high, the higher frequency band of noise energy, further utilize low weights to be weakened.By two stage self-adapting reconstruction, noise obtains the inhibition of higher degree.Fig. 2 g and Fig. 2 h are that the thrush signal passes through two stage self-adapting reconstruction, obtain final signal waveforms and sonograph.Compare Fig. 2 c and Fig. 2 d, also illustrated comparatively effectively noise reduction of multiband self-adapting reconstruction.

Compound characteristics extracts:

The Gabor atom that the present invention chooses is to consist of the Gauss function of modulating.Because Gauss type function all localizes in time domain and frequency domain, its local characteristics has guaranteed that the atom time and frequency parameter can portray the non-stationary time-varying characteristics of signal preferably.By OMP, decompose, in front 10 atom time-frequency parameter group of this segment signal of acquisition expression, average and the standard deviation of contraction-expansion factor s and frequency factor v, form 4 dimension OMP features.Due to the information of the sign original sound that reconstruction signal can not be complete for the first time, so use separately the recognition effect of OMP time-frequency characteristics unsatisfactory.Because there is different pitch period scopes in the animal cry, therefore use fundamental frequency (PITCH), as feature, ecological sound is had to certain differentiation.The present invention, after carrying out the secondary self-adapting reconstruction, uses short-time energy and zero-crossing rate to carry out end-point detection to reconstruction signal, and non-mute frame is extracted MFCCs and forms compound characteristics in conjunction with the OMP feature.

Obtaining of MFCCs feature is divided into following step, at first adopts 24 rank Mel ripple device groups, obtains 12 dimension MFCCs static natures after discrete Fourier transformation (DFT), adds the logarithm energy as its 13rd dimensional feature.In addition, adopt circular AMDF function (CAMDF) method to obtain the 1 dimension PITCH feature that every frame is corresponding.

The process of pre-training DBN model is by the visible layer node state value of the ecological sound characteristic initialization DBN bottom that mark is good, obtain proper vector through the limited Boltzmann machine of unsupervised training (RBM) model successively, as the input value of end BP network.Then, use correct markup information that the BP network of supervision is arranged, the error message backpropagation, to bottom RBM model, is finely tuned to whole DBN model.Idiographic flow as shown in Figure 3.

The classification capacity of DBN is subject to the RBM hidden layer number of plies and each node layer number affects simultaneously.Increase the hidden layer number and can improve the nicety of grading of DBN to proper vector, but learning time also increases thereupon.Best hidden layer number and nodes configuration increase the approximation capability that nodes improves the DBN network, but nodes too much can reduce again the generalization ability of network, so will be determined by experiment.

Above-listed preferred embodiment; the purpose, technical solutions and advantages of the present invention are further described; institute is understood that; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention; within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the ecological sound identification method based on multi-band signal reconstruct, is characterized in that, comprises the following steps:

2. the ecological sound identification method based on multi-band signal reconstruct according to claim 1, is characterized in that, supposes band noise tone signal f to be decomposed, and length is N, before carrying out Its Sparse Decomposition, at first constructed complete atom dictionary D=(g _γ) _{γ ∈ Γ}, time-frequency atom g _γbe the Gabor atom, by parameter group γ=(s, u, v, w) definition, shift factor u defines an atom g _γcenter, contraction-expansion factor s, frequency factor v and phase factor w define its waveform, its discretize time and frequency parameter γ=(s, u, v, w)=(a ^j, pa ^jΔ u, ka ^-jΔ v, i Δ w), wherein, 0<j≤log ₂n, 0≤p≤N2 ^-j+1, 0≤k<2 ^j+1, 0≤i≤12, a=2, Δ u=1/2, Δ v=π, Δ w=π/6; Described step S01 concrete steps comprise:

| < R_{k} y', g_{γk} > | &GreaterEqual; α \sup_{γ &Element; Γ} | < R_{k} y', g_{γ} > |, 0 < α \leq 1;

3. the ecological sound identification method based on multi-band signal reconstruct according to claim 1, it is characterized in that, described extraction comprises that the compound characteristics concrete grammar of OMP feature is: extract the compound characteristics that comprises OMP feature, MFCCs feature and fundamental tone feature; Wherein, the method for extracting the OMP feature is to utilize OMP to decompose each frame voice signal, and in front L the atom time-frequency parameter group of support set of this frame signal of acquisition expression, average and the standard deviation of contraction-expansion factor s and frequency factor v, form 4 dimension OMP features,

4. the ecological sound identification method based on multi-band signal reconstruct according to claim 3, it is characterized in that: choose MFCCs and supplement the use of OMP feature, at first adopt 24 rank Mel bank of filters, reconstruction signal is done to obtain 12 dimension MFCCs static natures after discrete Fourier transformation, add the logarithm energy as its 13rd dimensional feature.

5. the ecological sound identification method based on multi-band signal reconstruct according to claim 3, is characterized in that: choose PITCH and supplement the use of OMP feature, adopt the circular AMDF function method to obtain the 1 dimension PITCH feature that every frame is corresponding.

6. the ecological sound identification method based on multi-band signal reconstruct according to claim 1, it is characterized in that: described DBN model training comprises two steps, the first step adopts without supervising the successively strategy of greed and trains in advance, state value by the visible layer node of the ecological sound characteristic initialization DBN bottom that mark is good, make specific features abstract gradually like this; Second step is used correct markup information that the BP network of supervision is arranged, and is transmitted to every one deck RBM and is finely tuned update information is top-down.

7. the ecological sound identification method based on multi-band signal reconstruct according to claim 6, it is characterized in that: RBM network using ContrastiveDivergence criterion is as the self-training strategy, every layer forms by a visual layers V and hidden layer H, connect a plurality of RBM of combination by bottom-up interlayer weighting, input with the output of Hidden unit as upper strata RBM visual layers, thereby build a DBN framework, RBM comprises three parameters, respectively the weights W between visible layer and hidden layer, and amount of bias b and c separately, therefore the process of DBN sorter training is converted into to solving the RBM parameter, the nodal value of supposing visual layers and hidden layer is respectively v _iand h _j, it is P (v that each node of visual layers V is put 1 probability _i=1), in like manner, to put 1 probability be P (h to each node of hidden layer H _j=1),

8. the ecological sound identification method based on multi-band signal reconstruct according to claim 1, it is characterized in that: the distribution of foreground sounds on frequency spectrum is not uniform, in order to determine its dominant frequency structure, the power spectrum that reconstruct is for the first time obtained | Y'(λ, j) | ²on average be divided into M linear sub-band, to voiced frame λ, calculate the energy proportion on frequency band i wherein, K is the rank of FFT coefficient, FFT _{λ, p}p the FFT coefficient for frame λ.

9. the ecological sound identification method based on multi-band signal reconstruct according to claim 1, it is characterized in that: determine a threshold gamma, when energy proportion surpasses threshold value, subband i is in the dominant frequency scope, foreground sounds frequency factor α (λ) sets higher weight, and outside the dominant frequency scope, the weight that respective settings is lower,

{SNR}_{i} (λ) = 10 \log_{10} (\frac{Σ_{p = \frac{K}{M} \cdot (i - 1)}^{\frac{K}{M} \cdot i} {| Y_{i}^{'} (λ, p) |}^{2}}{Σ_{p = \frac{K}{M} \cdot (i - 1)}^{\frac{K}{M} \cdot i} {| F_{i} (λ, p) |}^{2} - {| Y_{i}^{'} (λ, p) |}^{2}}),

The noise frequency factor of frame λ i subband

β_{i} (λ) = \{\begin{matrix} 0.1, & {SNR}_{i} (λ) < 0 \\ 0.1 + 0.04 {SNR}_{i} (λ), & 0 \leq {SNR}_{i} (λ) \leq 20 \\ 0.9, & {SNR}_{i} (λ) > 20 \end{matrix};

upgraded.