CN111540368A

CN111540368A - Stable bird sound extraction method and device and computer readable storage medium

Info

Publication number: CN111540368A
Application number: CN202010379824.XA
Authority: CN
Inventors: 张承云; 郑泽鸿; 陈庆春; 凌嘉乐; 肖波
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2020-08-14
Anticipated expiration: 2040-05-07
Also published as: CN111540368B

Abstract

The invention discloses a steady birdsound extraction method, which comprises the following steps: preprocessing an audio signal to obtain a power spectrum of a signal with noise, and obtaining a noise power spectrum estimation by a minimum search method; on the basis of a preset HBank filter bank, a power spectrum of a signal with noise and a noise power spectrum are estimated and converted into an H domain for analysis, and then a posterior signal-to-noise ratio is obtained; obtaining the prior signal-to-noise ratio estimation of the H domain according to the posterior signal-to-noise ratio and a guide decision method; smoothing the prior signal-to-noise ratio, and then calculating the average value of the prior signal-to-noise ratio to further obtain the prior probability of the voiced frame; judging whether the current frame is a voiced frame or not according to a set threshold value, and collecting continuous voiced frame signals to obtain voiced segments; and obtaining the formant frequency and the formant width through a linear prediction method, and further judging whether the sound section has the bird sound. The method can accurately extract the voiced segments and automatically eliminate the noise, has good effect under the condition of low signal-to-noise ratio, and has low algorithm complexity and strong real-time property.

Description

Stable bird sound extraction method and device and computer readable storage medium

Technical Field

The invention relates to the technical field of ecological monitoring and acoustic signal identification, in particular to a robust method and device for extracting a bird sound and a computer readable storage medium.

Background

At present, China, as a country with the largest number of bird species, has always paid high attention to the problem of bird environmental protection. Through the research of bird song, not only can distinguish species, can analyze biological behavior again, have wide development prospect in the aspects such as field animal monitoring, agriculture and forestry are driven harmful bird, aviation bird is collided. The primary task of the birdsound study is to separate potential birdsound fragments from the continuously acquired audio. The early stage of the extraction of the bird sound is realized by adopting a manual detection method, and after repeated listening and spectrogram analysis, segments with bird sound are extracted and marked by a bird sound expert. Although the bird sound fragment can be accurately obtained by manual extraction, the detection efficiency is very low, and the method is not beneficial to processing mass recording data.

With the mature development of voice activity detection technology, some methods for automatically extracting the birdsong segment have appeared, but still have many problems, including extraction performance, complexity, universality, real-time performance, and the like.

In the prior art, the method for extracting the bird sound has the following defects:

the extraction method based on energy detection has low computation amount, but can not correctly judge under the condition of low signal-to-noise ratio; the extraction method based on the prior probability has moderate computation amount and good detection performance, but the misjudgment phenomenon still exists for the mutation noise; the extraction method based on spectrogram analysis can completely obtain fragments containing the bird sound, but the spectrogram information can be constructed only by continuous multi-frame signals, the calculation amount is large, and the method is only suitable for off-line processing; the extraction method based on the Gaussian mixture model has stable detection performance under the condition of low signal-to-noise ratio, the algorithm complexity is moderate, but model parameters need to be continuously adjusted, and misjudgment can occur on abrupt noise; the extraction algorithm based on deep learning has excellent effect under the condition of sufficient sample amount, but the algorithm has high complexity, a large amount of sample data needs to be trained in the early stage, and the classification result is influenced by over-fitting and under-fitting.

Disclosure of Invention

The purpose of the invention is: based on a steady bird sound extraction method, a steady bird sound extraction device and a computer readable storage medium, the sound segment can be accurately extracted, some human sounds and other animal sounds can be automatically rejected, bird sound signals can be accurately picked up under the condition of low signal-to-noise ratio, the algorithm complexity is low, the real-time performance is strong, and the method and the device can be applied to a bird sound acquisition system.

In order to achieve the above object, the present invention provides a robust method for extracting a bird sound, which is suitable for being executed in a computer device, and at least comprises the following steps:

preprocessing the collected audio signal in the target range to obtain a power spectrum of the signal with noise, smoothing the power spectrum of the signal with noise, and obtaining a noise power spectrum estimation by a minimum value search method;

inputting the power spectrum of the signal with noise and the noise power spectrum estimation to a preset HBank filter bank respectively to obtain the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain, and obtaining a posterior signal-to-noise ratio according to the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain;

obtaining the prior signal-to-noise ratio estimation of the H domain according to the posterior signal-to-noise ratio and a guide decision method; smoothing the prior signal-to-noise ratio to obtain a smoothed prior signal-to-noise ratio estimate; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio;

when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments;

during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice;

according to the formant frequency and the formant width, classifying the bird sound or noise of each sub-piece, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound.

Further, classifying the bird sound or noise of each sub-slice according to the formant frequency and the formant width, and judging whether the bird sound exists or not by comparing the number of the bird sound sub-slices in the sound segment with the number of the noise sub-slices, if so, storing the bird sound sub-slices, including:

when the resonance peak frequency is more than 1.5kHz, the sub-piece is judged as a bird sound sub-piece;

when the formant frequency is less than 400Hz, the sub-sheet is judged as a noise sub-sheet;

the formant frequency is between 400Hz and 1.5kHz, and the sub-sheets need to be analyzed according to the width of the formant; if the width of the resonance peak is less than 500Hz, the sub-piece is judged as a bird sound sub-piece, otherwise, the sub-piece is judged as a noise sub-piece;

and counting the number of the bird sound sub-pieces and the number of the noise sub-pieces in the sound section, if the number of the bird sound sub-pieces is greater than the number of the noise sub-pieces, the sound section contains bird sounds, and the sound section is stored.

Further, parameter setting is performed on the HBank filter bank: at a frequency F_CBuild filter for center, build M on its left side_LFilter, right side set up M_HA filter of (M)_L+1+M_H) A filter covering a linear frequency range of F_L～F_H。

Further, the function expression of the HBank filter bank is:

the central frequency expression of the HBank filter is as follows:

further, the expression of the smoothed a priori snr estimate is:

ζ_H(λ,b)＝α_ζ×ζ_H(λ-1,b)+(1-α_ζ)×ξ_H(λ,b)。

further, the formula for calculating the prior probability of the voiced frame is as follows:

further, the audio signal is preprocessed, specifically: and carrying out sub-band separation, frame shift, windowing and Fourier transform on the audio signal to obtain a power spectrum of the signal with noise.

An embodiment of the present invention further provides a robust apparatus for extracting a birdsound, including:

the preprocessing module is used for preprocessing the collected audio signals in the target range to obtain a power spectrum of the signal with noise, smoothing the power spectrum of the signal with noise and obtaining a noise power spectrum estimation through a minimum search method;

the HBank filter bank module is used for inputting the power spectrum of the signal with noise and the noise power spectrum estimation to a preset HBank filter bank respectively to obtain the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain;

the voiced frame processing module is used for obtaining the prior signal-to-noise ratio estimation of an H domain according to the posterior signal-to-noise ratio and a guide decision method, and smoothing the prior signal-to-noise ratio to obtain a smooth prior signal-to-noise ratio estimation; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments;

a birdsound fragment screening module to: during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice; according to the formant frequency and the formant width, classifying the bird sound or noise of each sub-piece, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound.

Further, the birdsound fragment screening module is specifically configured to: during the sound segment, performing bird sound or noise classification on each sub-piece according to the formant frequency and the formant width, judging whether bird sound exists or not by comparing the number of the bird sound sub-pieces in the sound segment with the number of the noise sub-pieces, and if yes, storing the bird sound sub-pieces, wherein the method comprises the following steps:

An embodiment of the present invention also provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform a robust birdsound extraction apparatus according to any one of claims 1 to 7.

Compared with the prior art, the stable birdsound extraction method, the stable birdsound extraction device and the computer readable storage medium have the advantages that:

the embodiment of the invention provides a robust method, a device and a computer readable storage medium for extracting a bird sound, wherein the method comprises the following steps: preprocessing the collected audio signal in the target range to obtain a power spectrum of the signal with noise, smoothing the power spectrum of the signal with noise, and obtaining a noise power spectrum estimation by a minimum value search method; inputting the power spectrum of the signal with noise and the noise power spectrum estimation to a preset HBank filter bank respectively to obtain the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain, and obtaining a posterior signal-to-noise ratio according to the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain; obtaining the prior signal-to-noise ratio estimation of the H domain according to the posterior signal-to-noise ratio and a guide decision method; smoothing the prior signal-to-noise ratio to obtain a smoothed prior signal-to-noise ratio estimate; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments; during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice; according to the formant frequency and the formant width, classifying the bird sound or noise of each sub-piece, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound. The invention can accurately extract the voiced segments and automatically eliminate some human voices and other animal voices, has good effect under the condition of low signal-to-noise ratio, has lower algorithm complexity and strong real-time property, and can be applied to a bird voice acquisition system.

Drawings

Fig. 1 is a schematic flow chart of a robust birdsong extraction method according to a first embodiment of the present invention;

fig. 2 is a detailed flowchart of a robust birdsong extraction method according to a first embodiment of the present invention;

fig. 3 is a schematic frequency domain distribution diagram of an HBank filter in the robust birdsong extraction method according to the first embodiment of the present invention;

fig. 4 is a schematic structural diagram of a robust apparatus for extracting birdsound according to a second embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment of the present invention:

please refer to fig. 1-3.

As shown in fig. 1, a robust birdsong extraction method according to a preferred embodiment of the present invention is suitable for being executed in a computer device, and includes at least the following steps:

s101, preprocessing an acquired audio signal in a target range to obtain a power spectrum of a signal with noise, smoothing the power spectrum of the signal with noise, and obtaining noise power spectrum estimation by a minimum search method;

s102, inputting the power spectrum of the signal with noise and the noise power spectrum estimation to a preset HBank filter bank respectively to obtain a power spectrum of the signal with noise in an H domain and a noise power spectrum estimation in the H domain, and obtaining a posterior signal-to-noise ratio according to the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain;

s103, obtaining prior signal-to-noise ratio estimation of an H domain according to the posterior signal-to-noise ratio and a guide decision method; smoothing the prior signal-to-noise ratio to obtain a smoothed prior signal-to-noise ratio estimate;

s104, obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments;

s105, during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice;

s106, classifying the bird sound or noise of each sub-piece according to the formant frequency and the formant width, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound.

For step S101, preprocessing the collected audio signal within the target range to obtain a power spectrum of the signal with noise, smoothing the power spectrum of the signal with noise, and obtaining a noise power spectrum estimate by a minimum search method, specifically;

through a microphone pairCollecting sound signals within a certain range, wherein the sampling rate is 32kHz, the quantization precision is 16 bits, the time length of each frame of signal is 10ms, the number of sampling points is 320, and the sampling points are recorded as y_in(λ), where λ is the frame number.

Since most of the energy of bird singing is concentrated below 8kHz, the high-frequency sub-band does not have too much information for distinguishing bird sounds from noise, and the signal y is subjected to quadrature mirror image filter_in(lambda) carrying out sub-band separation to obtain a low-frequency sub-band y_l(lambda) and the high-frequency subband y_h(lambda). Subsequent steps are only applied to the low frequency sub-band y_l(lambda) is analyzed, the sampling frequency F_S16000, number of samples N_l＝160。

The current frame y of the low frequency subband_l(lambda, N) and the previous frame y (lambda-1, N) to obtain y (lambda, N), wherein N is shown in formula (1)_FThe total length after the frame stack is usually an integer power of 2, N_F256; n is the number of sampling points, N is 0,1, …, N_F-1。

Adopting a mixed flat-top Hanning window function w (N) of the formula (2), windowing the signal y (lambda, N), and executing N_FPerforming point Fourier transform, and obtaining a magnitude spectrum Y (lambda, k) of the signal with noise by taking a modulus value of the Fourier transform, wherein k is a frequency point number, k is 0,1, …, and N is shown in a formula (3)_F-1。

Due to the symmetry of Fourier transform, only the first N frequency points of the frequency spectrum are analyzed, wherein N is N_FThe default value of/2 +1, N, is 129.

Carrying out intra-frame smoothing on a noisy signal power spectrum Y2 (lambda, k) to obtain S' (lambda, k), and obtaining a formula (4), wherein W is a normalized Hanning window function, and the window length is (2 × N)_W+1)，1≤N_W≤5，N_WIs 1, i.e. W ═ 0.25,0.5,0.25]。

Then, the inter-frame smoothing is performed on S' (λ, k) to obtain S (λ, k), see formula (5), wherein α_SFor inter-frame smoothing factor, 0 < α_S＜1，α_SIs 0.8.

S(λ,k)＝α_S×S(λ-1,k)+(1-α_S)×S′(λ,k)k＝0,1,…,N-1 (5)

The minimum value S is updated after R successive estimates of the smoothed power spectral density S (lambda, k) are obtained by using the minimum value search method proposed by Rainer Martin_min(λ, k), see equation (6), where min {. cndot.) is the minimum operator.

S_min(λ,k)＝min{S(λ′,k)|(λ′＝(λ-R+1),…,λ)} (6)

Obtaining noise power spectrum estimation D according to formula (7) smooth updating²(λ, k) in which [ α_m·S_min(λ,k)]2 < α for noise decision threshold_m＜8，α_mIs 5.α_DAs a weighting factor, α_DMax {0.03,1/(λ +1) }, noise power spectrum D as the number of frames increases²The change in (λ, k) gradually stabilizes.

For step S102, the specific preset process of the preset HBank filter bank is as follows:

aiming at the energy spectrum distribution of the actual bird sound, an HBank filter bank with frequency domain distribution as shown in FIG. 3 is established, wherein the related domain is simply called H domain and the frequency is F_CBuild filter for center, build M on its left side_LFilter, right side set up M_HA filter of (M)_L+1+M_H) A filter covering a linear frequency range of F_L～F_H. The above parameters need to be set in advance forNo sound collection of specific birds, 200 ≤ F_L＜F_C＜F_H≤8000，2＜M_L＜12，2＜M_H< 12, generally set to F_C＝3500、F_L＝200、F_H＝8000、M_L＝8、M_H(ii) 5; and for the sound collection of specific birds, parameters are adjusted according to the actual spectrum distribution rule of the bird sound, so that the picking-up effect is better in a complex noise environment. Number b and center frequency f of filter_C(b) The relation of (2) is shown as the expression (8), wherein s is used for adjusting the discrete degree of the adjacent filters, s is more than 0.7 and less than 1.5, the default value of s is 1.2, when s is more than 1, the filters are close to the central frequency, and when s is less than 1, the filters are close to the two sides.

In the HBank filter bank, the upper limit frequency f of the filter b_H(b) Corresponding to the center frequency f of the filter b +1_C(b +1) shown in formula (9); lower limit frequency f of filter b_L(b) Corresponding to the center frequency f of the filter b-1_C(b-1) shown in the formula (10).

Lower limit frequency f of filter b_L(b) Center frequency f_C(b) And an upper limit frequency f_H(b) Respectively mapping to corresponding frequency points to obtain k_L(b)、k_C(b) And k_H(b) See formula (11), wherein

Is rounded up.

Since the HBank filter exhibits a triangle, and k_C(b)＝1，k_L(b)＝k_H(b) The expression H (b, k) for the filter bank is found as 0, see equation (12).

For step S102, the power spectrum of the noisy signal and the noise power spectrum estimate are respectively input to a preset HBank filter bank to obtain a noisy signal power spectrum in the H domain and a noise power spectrum estimate in the H domain, and a posterior signal-to-noise ratio is obtained according to the noisy signal power spectrum in the H domain and the noise power spectrum estimate in the H domain; wherein, the correlation domain of the HBank filter bank is an H domain, specifically:

will take the power spectrum Y of the noise signal²(λ, k) and noise power spectrum estimate D²(lambda, k) are obtained by means of a filter function H (b, k), respectively

And

see the formulas (13) and (14).

In the H domain, is composed of

And

obtaining the posterior signal-to-noise ratio gamma_H(lambda, b) is shown in formula (15).

For step S103, obtaining the prior signal-to-noise ratio estimation of the H domain according to the posterior signal-to-noise ratio and a guide decision method; smoothing the prior signal-to-noise ratio to obtain a smoothed prior signal-to-noise ratio estimate; the method specifically comprises the following steps:

in the H domain, the prior SNR estimate ξ of the H domain is obtained by a guided decision method_H(lambda, b) in formula (16).

Wherein the content of the first and second substances,

for the last frame pure power spectrum estimation in the H domain, the magnitude spectrum estimation is obtained by equation (18). α_H(λ, b) is a weight adjustment factor, and is obtained from the formula (17).

Wherein, α_hIs a constant, 0 < α_h＜1，α_hIs 0.1, when the instantaneous signal-to-noise ratio is larger, the weight α of the current signal-to-noise ratio estimate is increased_H(λ, b), make ξ_HThe estimation of (λ, b) is more accurate.

Obtaining a u power estimator X of a pure amplitude spectrum according to a minimum mean square error estimation criterion_H(lambda, b) is shown in formula (18).

Wherein u is a power exponent of the amplitude spectrum estimation, u is more than or equal to 0.1 and less than or equal to 2, and a default value of u is 0.5; (. cndot.) is a gamma function, and the calculation formula is shown in formula (19); phi (·) is a confluent hypergeometric function, and the calculation formula is shown in formula (20). The two functions have higher operation complexity, and the operation can be simplified by using an approximate function in practical engineering application.

For a priori signal to noise ratio ξ_H(lambda, b) smoothing the frames to obtain a smoothed prior signal-to-noise ratio zeta_H(lambda, b) see the formula (21) wherein α_ζFor inter-frame smoothing factor, 0 < α_ζ＜1，α_ζIs 0.7.

ζ_H(λ,b)＝α_ζ×ζ_H(λ-1,b)+(1-α_ζ)×ξ_H(λ,b) (21)

For step S104, obtaining the prior probability of the voiced frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is greater than a set threshold value, the current frame is determined as a voiced frame, and voiced segments are obtained by collecting continuous voiced frame signals, specifically:

smoothing a priori signal-to-noise ratio ζ_HThe mean value of (lambda, b) is substituted into the formula (22) to obtain the prior probability p of the voiced frame_H(λ)。

When probability value p_H(lambda) is greater than a set threshold value P_HIf yes, the frame is judged to be a sound frame, otherwise, the frame is judged to be a noise frame. Wherein the threshold value P_HNeeds to be adjusted to the best effect through experiments, and the P is more than or equal to 0.2_H≤0.8，P_HIs 0.5. Input y corresponding to voiced frames for successive r frames_in(λ) the set yields a voiced segment signal, denoted V ═ y_in(λ-r+1),y_in(λ-r+2),…,y_in(λ)}。

For step S105, during the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice; the method specifically comprises the following steps:

since the determination in step S104 is only to simply separate the voiced segments from the noise segments, and whether the voiced segments have bird sounds is not analyzed, further screening needs to be performed according to the formant information of the sound segments. The formant information of the sound segment is obtained by adopting a linear prediction method, but for the sound segment with longer duration (more than 1 second), the calculation amount is larger when the linear prediction coefficient is calculated, and the real-time processing is not facilitated. The bird sound can also construct a formant in a small number of sampling points (6 frames of data), and has certain similarity with the formants of the whole bird sound fragment. By the slicing method, the calculation time of the prediction coefficient is reduced, and the formant frequency and the formant width of the corresponding sub-slices are calculated while bird sound detection is performed on each frame.

In the determination process of step S104, if it is determined that there is a voiced frame, the low-frequency subband signal y of the current frame is used_l(λ) is combined with the 5 preceding frames to obtain subplate v (λ) ═ y_l(λ-5),y_l(λ-4),…,y_l(lambda) sub-patch length 6 × N_lA complete voiced segment V corresponds to r sub-segments 960. For convenience of description, the subsequent steps omit the frame number λ.

Pre-emphasis of sub-patches v, see equation (23), raises the high frequency components of the sound signal, α_vFor pre-emphasis coefficient, 0.9 < α_v＜1，α_vIs 0.99, and n' is the sample point number of the voiced segment sequence.

v′(n′)＝v(n′)-α_v·v(n′-1) (23)

The pre-emphasis sub-slice v' is simplified by a linear prediction model of order p, see (24), where a₁,a₂,…,a_pE (n') is a linear prediction error.

According to formula (25), by N_TFast Fourier transform solving of pointsAnd obtaining the power spectrum amplitude response T (k') of the linear prediction model, namely the spectrum envelope curve of the vocal tract signals. Wherein N is_TNumber of Fourier transform points, typically an integer power of 2, N_THas a default value of 256.

Normalizing the power spectrum magnitude response T (k ') of the linear prediction model yields T ' (k '), see (26), where max {. cndot.) is the maximum operator.

Searching a frequency point k corresponding to the maximum value in the normalized power spectrum amplitude response T '(k')/_TSee formula (27), and converting by formula (28) to obtain a formant frequency value f_T。

In the normalized power spectral amplitude response T '(k'), T '(k') is known_T) 1, at frequency point k_TFor the center, search the frequency point k to the left_TLSo that it satisfies T' (k)_TL-1) is less than or equal to 0.2, and a frequency point k is searched rightwards_THSo that it satisfies T' (k)_TH+1) is less than or equal to 0.2, and the frequency point k is equal to_TLSum frequency point k_THThe lower limit frequencies f of the resonance peaks are obtained by the following equations (29)_TLAnd upper limit frequency f of resonance peak_THThe difference between the two is the width of the resonance peak Deltaf_TSee formula (30).

Δf_T＝f_TH-f_TL(30)

For step S106, according to the formant frequency and the formant width, performing classification of bird sounds or noises on each sub-slice, and counting the number of bird sound sub-slices and noise sub-slices in the voiced segment; judging whether the existing sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, if so, storing the section with the bird sound, and specifically:

setting a classification counter C_bAnd C_nAnd respectively recording the number of the bird sound sub-pieces and the number of the noise sub-pieces, and clearing two counter values and starting counting at the initial frame of the sound segment until the counting is stopped when the sound segment is ended. When the formant frequency f of the subplate v_TWhen the frequency is more than 1.5kHz, the bird sound sub-piece is judged to be a bird sound sub-piece, and a bird sound sub-piece counter C_bAdding one; when f is_TWhen the frequency is less than 400Hz, the signal is judged as a noise sub-chip, and a noise sub-chip counter C_nAnd adding one. And if f_TBetween 400Hz and 1.5kHz, the combined resonance peak width Δ f is required_TAnd (6) judging. When the width of resonance peak Δ f_TWhen the frequency is less than 500Hz, the bird song piece is judged as a bird song piece, and a bird song piece counter C_bAdding one; otherwise, the signal is judged as a noise sub-chip, and a noise sub-chip counter C_nAnd adding one.

At the end of the current voiced segment, a total of r sub-segment classification results are obtained, in which case the counter C_b+C_nR. If the number of bird sound pieces C_bGreater than the number of noise sub-chips C_nIf the sound segment V is a segment containing the bird sound, the segment is regarded as the segment containing the bird sound and is stored, otherwise, the segment does not contain the bird sound information and is discarded.

The embodiment of the invention provides a robust method for extracting a birdsong, which comprises the following steps: preprocessing the collected audio signal in the target range to obtain a power spectrum of the signal with noise, smoothing the power spectrum of the signal with noise, and obtaining a noise power spectrum estimation by a minimum value search method; inputting the power spectrum of the signal with noise and the noise power spectrum estimation to a preset HBank filter bank respectively to obtain the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain, and obtaining a posterior signal-to-noise ratio according to the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain; obtaining a prior signal-to-noise ratio estimation of an H domain according to the posterior signal-to-noise ratio and a guide decision method, and smoothing the prior signal-to-noise ratio to obtain a smooth prior signal-to-noise ratio estimation; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments; during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice; according to the formant frequency and the formant width, classifying the bird sound or noise of each sub-piece, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound. The invention can accurately extract the voiced segments and automatically eliminate some human voices and other animal voices, has good effect under the condition of low signal-to-noise ratio, has lower algorithm complexity and strong real-time property, and can be applied to a bird voice acquisition system.

Second embodiment of the invention:

please refer to fig. 4.

As shown in fig. 4, the present embodiment further provides a robust extraction apparatus of birdsound, including:

the preprocessing module 201 is configured to preprocess the acquired audio signal within the target range to obtain a power spectrum of the signal with noise, smooth the power spectrum of the signal with noise, and obtain a noise power spectrum estimation by a minimum search method;

the HBank filter bank module 202 is configured to input the power spectrum of the noisy signal and the power spectrum estimation of the noise to a preset HBank filter bank, so as to obtain a power spectrum of the noisy signal in the H domain and a power spectrum estimation of the noise in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain;

the voiced frame processing module 203 is configured to obtain a priori signal-to-noise ratio estimate of an H domain according to the posterior signal-to-noise ratio and a guide decision method, and perform smoothing processing on the priori signal-to-noise ratio to obtain a smoothed priori signal-to-noise ratio estimate; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments;

a bird sound fragment screening module 204 to: during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice; according to the formant frequency and the formant width, classifying the bird sound or noise of each sub-piece, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound.

For the birdsound fragment screening module 204, specific use is made of: classifying the bird sound or noise of each sub-piece according to the formant frequency and the formant width, judging whether the bird sound exists or not by comparing the number of the bird sound sub-pieces in the sound section with the number of the noise sub-pieces, and if so, storing the bird sound sub-pieces, wherein the method comprises the following steps:

The embodiment of the invention provides a steady apparatus for extracting birdsound, which comprises: preprocessing the collected audio signal in the target range to obtain a power spectrum of the signal with noise, smoothing the power spectrum of the signal with noise, and obtaining a noise power spectrum estimation by a minimum value search method; inputting the power spectrum of the signal with noise and the noise power spectrum estimation to a preset HBank filter bank respectively to obtain the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain, and obtaining a posterior signal-to-noise ratio according to the power spectrum of the signal with noise in the H domain and the noise power spectrum estimation in the H domain; wherein, the correlation domain of the HBank filter bank is the H domain; obtaining a prior signal-to-noise ratio estimation of an H domain according to the posterior signal-to-noise ratio and a guide decision method, and smoothing the prior signal-to-noise ratio to obtain a smooth prior signal-to-noise ratio estimation; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments; during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and acquiring the formant frequency and the formant width of the sub-slice; according to the formant frequency and the formant width, classifying the bird sound or noise of each sub-piece, and counting the number of the bird sound sub-pieces and the noise sub-pieces in the voiced segment; and judging whether the sound segment has the bird sound or not by comparing the number of the bird sound sub-pieces with the number of the noise sub-pieces, and if so, storing the sound segment with the bird sound. The invention can accurately extract the voiced segments and automatically eliminate some human voices and other animal voices, has good effect under the condition of low signal-to-noise ratio, has lower algorithm complexity and strong real-time property, and can be applied to a bird voice acquisition system.

An embodiment of the present invention further provides a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform a robust birdsound extraction method as described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. A robust extraction method of birdsong, characterized by comprising:

obtaining the prior signal-to-noise ratio estimation of the H domain according to the posterior signal-to-noise ratio and a guide decision method; smoothing the prior signal-to-noise ratio to obtain a smoothed prior signal-to-noise ratio estimate; obtaining the prior probability of the sound frame according to the average value of the smooth prior signal-to-noise ratio; when the probability value is larger than a set threshold value, the current frame is judged as a voiced frame, and continuous voiced frame signals are collected to obtain voiced segments;

during the period of the voiced segment, combining the current frame and the previous 5 frames into a sub-slice, and calculating the linear prediction coefficient of the sub-slice; fourier transform is carried out on the linear prediction coefficient to obtain the power spectrum amplitude response of the linear prediction model of the sub-slice; normalizing the power spectrum amplitude response, searching a frequency point corresponding to a spectrum peak, and then obtaining the formant frequency and the formant width of the sub-sheet;

2. The robust extraction method of bird sounds according to claim 1, wherein the classifying of bird sounds or noises for each sub-slice according to the formant frequency and formant width, and the comparing of the number of bird sound sub-slices and the number of noise sub-slices in the voiced segment to determine whether bird sounds exist, if yes, the storing of the segment with bird sounds comprises:

3. A robust guano extraction method as claimed in claim 1, wherein the HBank filter bank is parametrically set: at a frequency F_CBuild filter for center, build M on its left side_LFilter, right side set up M_HA filter of (M)_L+1+M_H) A filter covering a linear frequency range of F_L～F_H。

4. A robust guano extraction method as claimed in claim 1, wherein the functional expression of the HBank filter bank is:

the central frequency expression of the HBank filter is as follows:

5. a robust pitch extraction method as recited in claim 1, wherein the functional expression of the smoothed a priori signal-to-noise ratio is:

ζ_H(λ,b)＝α_ζ×ζ_H(λ-1,b)+(1-α_ζ)×ξ_H(λ,b)。

6. a robust extraction method of birdsound according to claim 1, wherein the formula for calculating the prior probability of the voiced frames is:

7. a robust extraction method of birdsound according to claim 1, characterized in that the audio signal is preprocessed, in particular:

and carrying out sub-band separation, frame shift, windowing and Fourier transform on the audio signal to obtain a power spectrum of the signal with noise.

8. A robust birdsound extraction apparatus, comprising:

9. The robust extraction of birdsound apparatus according to claim 8, wherein the birdsound fragment filtering module is configured to: classifying the bird sound or noise of each sub-piece according to the formant frequency and the formant width, judging whether the bird sound exists or not by comparing the number of the bird sound sub-pieces in the sound section with the number of the noise sub-pieces, and if so, storing the bird sound sub-pieces, wherein the method comprises the following steps:

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer-readable storage medium, when executed, controls an apparatus in which the computer-readable storage medium is located to perform a robust extraction method of birdsound as recited in any one of claims 1 to 7.