CN117724042A

CN117724042A - Method and system for positioning bird song sound source based on acoustic bispectrum

Info

Publication number: CN117724042A
Application number: CN202410179288.7A
Authority: CN
Inventors: 舒璐; 覃业锋
Original assignee: Bainiao Data Technology Beijing Co ltd
Current assignee: Bainiao Data Technology Beijing Co ltd
Priority date: 2024-02-18
Filing date: 2024-02-18
Publication date: 2024-03-19
Anticipated expiration: 2044-02-18
Also published as: CN117724042B

Abstract

The application relates to the technical field of voice processing, and provides a method and a system for positioning a bird song sound source based on acoustic bispectrum, wherein the method comprises the following steps: collecting sound signals in each monitoring area and spatial position vectors of sound monitoring equipment; determining the energy aggregation saliency of the mel frequency band based on the analysis result of the energy distribution centralized characteristic among different frequency bands on each frame in the mel spectrogram of each array element microphone for collecting sound signals; determining a bird information frame significance coefficient based on the mel-band energy aggregation significance and the duration of the energy steady change for each frequency band; obtaining a plurality of sound signal fragments of sound signals collected by each array element microphone based on the bird information frame significant coefficient by adopting a VAD algorithm; and determining the positioning result of the sound signal based on all the sound signal fragments by adopting a sound source estimation algorithm based on generalized cross-correlation time delay estimation. The method and the device adaptively set the threshold value in the double-threshold end point detection algorithm, and improve the accuracy of the sound source positioning result.

Description

Method and system for positioning bird song sound source based on acoustic bispectrum

Technical Field

The application relates to the technical field of voice processing, in particular to a method and a system for positioning a bird song sound source based on acoustic bispectrum.

Background

Birds are an important member of the ecosystem, and their number and distribution can reflect changes in the ecological environment. By positioning and tracking birds in the natural protected area, scientific researchers can be helped to know the distribution and activity rules of the birds in the natural protected area, the change of the ecological environment can be found in time, corresponding protection measures can be taken, and the ecological balance in the natural protected area can be maintained.

Because the quiet time period in the natural environment is relatively more, the silence signal (the signal with no sound or very weak sound in a certain time period) in the sound signals collected by the sound collection equipment occupies relatively more space, so that the collected sound signals need to be extracted effectively so as to reduce the running time of the system and the occupancy rate of resources. The dual-threshold endpoint detection (VAD, voice Activity Detection) algorithm is a common voice activity detection algorithm and has the advantages of high robustness, high sensitivity and good real-time performance. However, the threshold value in the traditional double-threshold endpoint detection algorithm is usually a fixed value selected by experience, and a great amount of environmental noise such as wind noise, plant leaf friction noise and the like exists in the natural environment, when birds are far away from the sound collection device, and more plant leaf friction noise or strong wind noise exists near the device, so that the noise level in the sound signals collected by the device is higher, if the threshold value in the double-threshold endpoint detection algorithm is not adjusted in time, false detection or omission detection of effective bird sound signal fragments in the collected sound signals can be caused, and the accuracy of identifying bird song sound signal fragments in the collected sound signals is reduced, thereby affecting the accuracy of sound source positioning.

Disclosure of Invention

The application provides a method and a system for positioning a bird song sound source based on acoustic bispectrum, which are used for solving the problem that effective bird sound signal fragments in sound signals are erroneously detected and missed to be detected due to a threshold value in a double-threshold endpoint detection algorithm in the sound source positioning process, and the adopted technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for positioning a bird song sound source based on acoustic bispectrum, the method comprising the steps of:

dividing a natural protection area into a plurality of monitoring areas, and collecting sound signals in each monitoring area and space position vectors of sound monitoring equipment;

determining the energy aggregation saliency of the mel frequency band of each frequency band on each frame based on the analysis result of the energy distribution centralized characteristic among different frequency bands on each frame in the mel spectrogram of the sound signal collected by each array element microphone;

determining a bird information frame significance coefficient of each frame signal in the sound signal collected by each array element microphone based on the energy aggregation saliency of the Mel frequency band of each frequency band on each frame in the Mel spectrogram of the sound signal collected by each array element microphone and the duration of energy stable change;

obtaining a plurality of sound signal fragments of each array element microphone for collecting sound signals based on bird information frame significant coefficients of all frame signals in the sound signals collected by each array element microphone by adopting a VAD algorithm;

and determining a positioning result of the sound signal based on all sound signal fragments of the sound signal collected by all array element microphones in each microphone array by adopting a sound source estimation algorithm based on generalized cross-correlation time delay estimation.

Preferably, the method for collecting the sound signal in each monitoring area and the spatial position vector of the sound monitoring device comprises the following steps:

respectively placing a sound monitoring device at a preset position of each monitoring area, wherein the sound monitoring device is a microphone array formed by a plurality of array element microphones;

a space coordinate system is established by taking a center point of a natural protected area as a coordinate origin, taking a north-south direction, an east-west direction and a direction vertical to the ground as an x-axis, a y-axis and a z-axis respectively, a data pair consisting of a pitch angle and an azimuth angle of each array element microphone in the space coordinate system in each microphone array in each monitoring area is taken as an ordinal number pair of each array element microphone, and a vector consisting of ordinal number pairs of all array element microphones in each sound monitoring device is taken as a space position vector of each sound monitoring device.

Preferably, the method for determining the mel-band energy aggregation prominence of each frequency band on each frame based on the analysis result of the energy distribution concentration characteristic between different frequency bands on each frame in the mel-language spectrogram of each array element microphone for collecting the sound signals comprises the following steps:

uniformly dividing the area corresponding to each frame in the Mel spectrogram of each array element microphone for collecting sound signals into a preset number of frequency bands;

taking a sequence formed by energy values of all time-frequency units in each frequency band according to the ascending order of the frequencies as an energy sequence of each frequency band;

determining the frequency band energy concentration degree of each frequency band based on the magnitude of the energy peak value of the time-frequency unit in each frequency band and the frequency interval of each frequency band;

taking a natural constant as a base, taking the product of the accumulation result of the calculation result of the measurement distance between the energy sequence of each frequency band and the energy sequence of any one of the rest frequency bands on all frequency bands on each frame and the frequency band energy aggregation degree of each frequency band as the Mel frequency band energy aggregation prominence of each frequency band on each frame.

Preferably, the method for determining the frequency band energy concentration degree of each frequency band based on the magnitude of the energy peak value of the time-frequency unit in each frequency band and the frequency interval of each frequency band comprises the following steps:

taking the energy sequence of each frequency band as input, and acquiring all peak data points in the energy sequence of each frequency band by adopting a peak-trough second-order differential recognition algorithm; taking any one time-frequency unit corresponding to the peak data point as a peak time-frequency unit;

taking the sum of the product of the maximum value of the difference values between the frequencies of all the wave crest time-frequency units in each frequency band and the number of all the wave crest time-frequency units in each frequency band and 0.1 as a denominator;

the ratio of kurtosis to denominator of all elements in the energy sequence of each frequency band is taken as the band energy concentration of each frequency band.

Preferably, the method for determining the bird information frame significance coefficient of each frame of signal in the sound signal collected by each array element microphone based on the mel frequency band energy aggregation saliency of each frequency band and the duration of energy stable change in each frame in the mel spectrogram of each array element microphone collecting the sound signal comprises the following steps:

taking a sequence formed by arranging energy values of all time-frequency units in a row of each time-frequency unit in each frequency band in each frame in a Mel spectrogram of each array element microphone for collecting sound signals according to a time sequence as an energy time distribution sequence of each time-frequency unit;

taking the energy time distribution sequence of each time-frequency unit as input, and acquiring mutation points in the energy time distribution sequence of each time-frequency unit by adopting a mutation point detection algorithm;

acquiring two mutation points with the smallest time interval with each time-frequency unit, and taking a sequence formed by elements between the two mutation points in the energy time distribution sequence of each time-frequency unit as an energy time distribution subsequence of each time-frequency unit;

determining the degree of energy change correlation of each time-frequency unit in each frequency band based on the energy time distribution subsequence of different time-frequency units in each frequency band;

determining a bird signal frequency band saliency coefficient for each frequency band based on the energy variation correlation degree of each time-frequency unit in each frequency band and the mel-band energy aggregation saliency of each frequency band;

and taking the accumulated sum of the bird signal frequency band significant coefficients of all frequency bands on each frame in the Mel spectrogram of each array element microphone for collecting the sound signals as the bird information frame significant coefficient of each frame of signals in each array element microphone for collecting the sound signals.

Preferably, the method for determining the degree of energy change correlation of each time-frequency unit in each frequency band based on the energy time distribution subsequence of different time-frequency units in each frequency band comprises the following steps:

taking the sum of the measured distance between each time frequency unit in each frequency band and the energy time distribution subsequence of any one of the rest time frequency units and 0.1 as a first distance value;

and taking the accumulated result of the inverse of the first distance value on all the time-frequency units in each frequency band as the energy variation correlation degree of each time-frequency unit in each frequency band.

Preferably, the method for determining the bird signal frequency band saliency coefficient of each frequency band based on the energy variation correlation degree of each time-frequency unit in each frequency band and the mel-band energy aggregation saliency of each frequency band comprises the following steps:

taking the sum of the element number and 0.1 in the energy time distribution subsequence of each time-frequency unit in each frequency band as a denominator; taking the ratio of the energy variation correlation degree of each time-frequency unit in each frequency band to the denominator as a first accumulation factor, and taking the product of the accumulation result of the first accumulation factor on all time-frequency units in each frequency band and the mel-band energy aggregation saliency of each frequency band as the bird signal frequency band saliency coefficient of each frequency band.

Preferably, the method for obtaining a plurality of sound signal segments of each array element microphone collecting sound signals based on bird information frame significant coefficients of all frame signals in each array element microphone collecting sound signals by using a VAD algorithm includes:

taking a sequence formed by bird information frame significant coefficients of all frame signals in the sound signals collected by each array element microphone according to time sequence as a bird information significant sequence of the sound signals collected by each array element microphone;

and taking the sound signals collected by each array element microphone and the bird information significant sequence of the sound signals collected by each array element microphone as input, determining a threshold value of a double-threshold endpoint detection algorithm by adopting a neural network model, and dividing the sound signals collected by each array element microphone into a plurality of sound signal fragments based on the threshold value by adopting the double-threshold endpoint detection algorithm.

Preferably, the method for determining the positioning result of the sound signal based on all the sound signal fragments of the sound signal collected by all the array element microphones in each microphone array by adopting the sound source estimation algorithm based on generalized cross-correlation time delay estimation comprises the following steps:

taking a Mel spectrogram and a spectrogram of all sound signal fragments corresponding to the sound signals collected by each array element microphone as input, and acquiring the bird song signal fragments in the sound signals collected by each array element microphone by adopting a convolutional neural network;

taking a bird song signal segment in the sound signal collected by each array element microphone as input, and acquiring a reinforced bird song sequence of each array element microphone in each sound monitoring device by adopting an ideal binary masking algorithm;

taking the reinforced bird song sequence of each array element microphone in each sound monitoring device as a row vector of a matrix, and taking the matrix constructed by the reinforced bird song sequence of all the array element microphones in each sound monitoring device in the signal acquisition process as a bird song signal matrix of each sound monitoring device;

and outputting coordinate information of the bird song signals acquired by each sound monitoring device in a space coordinate system by taking the bird song signal matrix of each sound monitoring device and the space position vector of each sound monitoring device as inputs and adopting a sound source estimation algorithm based on generalized cross-correlation time delay estimation.

In a second aspect, embodiments of the present application further provide an acoustic bispectrum-based bird song sound source positioning system, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the steps of any one of the methods described above when the computer program is executed.

The beneficial effects of this application are: the environment noise signals and the bird sound signals are analyzed, the energy aggregation saliency of the Mel frequency bands is built according to the frequency band energy aggregation degree and the energy distribution similarity between the frequency bands in the Mel spectrogram, the energy aggregation saliency of the suspected bird sound signals in the region of the Mel spectrogram is improved, the energy change correlation degree and the time change degree of the time-frequency units in each frequency band are combined, the bird information frame saliency coefficient is built, the distinguishing degree of the bird sound signals and the environment noise signals in the Mel spectrogram is improved, the threshold value in the double-threshold end point detection algorithm is adaptively set based on the bird information frame saliency coefficient, the problem that when the noise level in the sound signals collected by the equipment is high, the threshold value cannot be adaptively adjusted is avoided, the extraction precision of the double-threshold end point detection algorithm on effective sound signal fragments is improved, and therefore the positioning accuracy of the collected bird signals of each sound monitoring equipment is higher.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a flow chart of a method for positioning a bird song sound source based on acoustic bispectrum according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a spatial coordinate system according to an embodiment of the present application;

fig. 3 is a flowchart of an implementation of a method for positioning a bird song sound source based on acoustic bispectrum according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, a flowchart of a method for positioning a bird song sound source based on acoustic bispectrum according to an embodiment of the present application is shown, and the method includes the following steps:

step S001, dividing the natural protection area into a plurality of monitoring areas, and collecting the sound signals in each monitoring area and the spatial position vector of the sound monitoring equipment.

Dividing the area where the natural protected area is located intoEach monitoring area is provided with a sound monitoring device and a camera monitoring device at a preset position of each monitoring area, each sound monitoring device is a microphone array formed by M array element microphones, sound signals in the monitoring area can be collected, the sound signals collected by each sound monitoring device can be transmitted back to a data center in real time through the Internet of things, and the data center carries out subsequent bird song sourcesIs used for positioning. In the present application, the number of divisions of the monitoring area is +.>The number M of the array element microphones in each sound monitoring device is respectively taken as an empirical value 20 and an empirical value 8, and the sampling frequency, the quantization length and the sampling duration of sound signals collected by the array element microphones are respectively set as、/>And 6s. It should be noted that the number of divisions of the monitoring area, the specification of the microphone array, and the collection parameters of the sound signals may be selected by the practitioner according to the actual situation of the area in which the natural protection area is located.

Further, a central point of the natural protection area is taken as an origin o of coordinates, a spatial coordinate system is respectively established by taking a north-south direction, an east-west direction and a direction vertical to the ground as an x-axis, a y-axis and a z-axis, as shown in fig. 2, an included angle between a connecting line of a vertical projection point of each array element microphone in each microphone array in the spatial coordinate system on an xoy plane and the origin o of coordinates and the x-axis is taken as a direction angle of each array element microphone, and an included angle between a connecting line of a vertical projection point of each array element microphone in each microphone array in each monitoring area in the spatial coordinate system on an xoy plane and the origin o of coordinates and a connecting line of each array element microphone in each microphone array in each monitoring area in the spatial coordinate system is taken as a pitch angle of each array element microphone. Secondly, a data pair consisting of a pitch angle and an azimuth angle of each array element microphone in a microphone array in each monitoring area is used as an ordinal number pair of each array element microphone, and a vector consisting of ordinal number pairs of M array element microphones in each sound monitoring device is used as a spatial position vector of each sound monitoring device.

So far, the spatial position vector of each microphone array and the sound signals collected by each array element microphone are respectively determined and used for subsequently determining the sound source positioning result in each monitoring area.

Step S002, determining the energy aggregation saliency of the Mel frequency bands of each frequency band based on the analysis result of the energy distribution centralized characteristic between different frequency bands on each frame in the Mel spectrogram of each array element microphone for collecting sound signals.

The application aims at extracting a sound fragment in a sound signal collected by each array element microphone in a microphone array in each monitoring area, determining a sound source localization result of the sound signal based on the sound fragment in the sound signal collected by all the array element microphones in each microphone array, and the implementation flow of the whole scheme is shown in fig. 3. In the application, the probability of the existence of the bird song signal in different frequency intervals on each frame signal in the sound signal collected by each array element microphone is firstly estimated by considering the aggregation characteristic of the sound energy in a specific frequency range during the bird song.

Environmental noise such as wind noise and plant leaf friction noise in natural environment is usually caused by a large-scale air flow or a plurality of random vibration sources such as plant leaf friction, and the environmental noise is shielded and attenuated by various barriers and mediums in the transmission process, so that the energy of an environmental noise signal is relatively uniformly distributed on different frequencies, the sound of birds is usually emitted by specific physiological structures, such as a bird song pipe and the like, the difference of the structures can cause the birds to emit sound with specific frequencies only, and the energy of sound signals generated by the birds is usually concentrated in a specific frequency range so as to adapt to environment and transfer information, attract opposite, feed, alarm, avoid attack and the like, therefore, in a mel spectrogram, the energy of the environmental noise signal is relatively uniformly distributed on different frequencies, and the energy of a bird song signal is concentrated in the specific frequency range.

In this embodiment, for the sound signal collected by each array element microphone in any one microphone array, the sound signal is collected by the m-th array element microphone in the n-th microphone arrayFor example, sound signal +.>As input, obtaining sound signal via framing, windowing, fourier transform, etc>Is a Meier spectrogram->Wherein, the frame length is set to 20ms, the frame overlap is set to 5ms, the window function in windowing is hamming window, the acquisition of mel spectrogram is a known technology, and the specific process is not repeated. Meier spectrogram->The horizontal axis of (2) is time and the vertical axis is frequency, so that each time and each frequency are in the Mel spectrogram +.>The above determined point is used as a time-frequency unit for the Meier spectrogram ++>In the region on each frame, the Mel spectrogram ++>The area on each frame is uniformly divided into +.>Frequency intervals, each frequency interval being regarded as a frequency band, < >>Is taken as the checked value 20. For any frequency band, use Mel spectrogram ++>The kth frequency band +.>For example, the energy values of all time-frequency units in each frequency band are calculatedThe sequence composed in the order of the ascending frequency order serves as the energy sequence of each frequency band.

Secondly, in order to evaluate the aggregation degree of elements in the energy sequence of each frequency band, taking the energy sequence of each frequency band as input, and acquiring all peak data points in the energy sequence of each frequency band by adopting a peak-valley second-order differential recognition algorithm; taking any one time-frequency unit corresponding to the peak data point as a peak time-frequency unit, wherein the peak-trough second-order differential recognition algorithm is a known technology, and the specific process is not repeated.

Based on the above analysis, a mel-band energy aggregation prominence is constructed here to characterize the aggregation level of energy in each frequency band in the mel-pattern of the acoustic signal collected by each array element microphone. Calculating frequency bandsMel-band energy aggregation saliency of (c):

in the method, in the process of the invention,is the frequency band->Is of the frequency band energy concentration, +.>Is the frequency band->Kurtosis of all elements in the energy sequence of (2),. About.>Is the frequency band->The number of time-frequency units of all peaks in +.>Is the frequency band->Maximum value of the difference between the frequencies of all wave crest time-frequency units,/->Is a parameter regulating factor for preventing denominator from being 0, & lt/L>The size of (2) is 0.1, the calculation of kurtosis is a known technology, and the specific process is not repeated;

is the frequency band->Mel-band energy concentration saliency, < ->Is a Meier spectrogram->The number of frequency bands on the i-th frame of (c) is the Meier spectrogram ++>C-th frequency band on i-th frame of (a), a c-th frequency band on i-th frame of (b)>、/>Respectively Meier spectrograms->The energy sequences of the kth frequency band, c frequency band on the ith frame of (a),/h>Is the sequence->、/>Euclidean distance between them.

Wherein, meier spectrogramFrequency band on the ith frame of (a)>The more pronounced the phenomenon of energy concentration distribution in the energy sequence of (2) the frequency band +.>The more prominent the peak in the energy sequence is distributed in the energy sequence, +.>The larger the value of (2), the frequency band +.>The higher the concentration of energy values in the energy sequence around the peak, the frequency band +.>The smaller the frequency spacing between the peak data points in the energy sequence of (2), the more>The smaller the value of +.>The smaller the value of +.>The greater the value of (2); meier spectrogram->Frequency band on the ith frame of (a)>The more the bird song signal component is contained in the Chinese medicine composition, the more the Meier spectrogram is->Frequency band on the ith frame of (a)>The more concentrated the internal energy is distributed in a smaller frequency range and the more pronounced the energy distribution, the frequency band +.>The more the energy value of each time-frequency unit in the frequency band is larger than the element values in the energy sequences of the rest frequency bands, the sequence +.>、/>The greater the difference between +.>The greater the value of +.>The greater the value of (2).

Thus, the mel-band energy aggregate saliency for each frequency band is obtained for subsequent evaluation of the significance of bird information in each frame of signal.

Step S003, determining the bird information frame significance coefficient of each frame signal in the sound signal collected by each array element microphone based on the energy aggregation saliency of the Mel frequency band of each frequency band on each frame in the Mel spectrogram of the sound signal collected by each array element microphone and the duration of the energy stable change.

During the process of collecting sound signals in a natural protected area, the sound signals may be disturbed by various environmental noises, including but not limited to wind noise, rain noise, leaf friction around the microphone array, etc. The interference sound affects the attenuation characteristics of energy in the frequency bandThereby affecting the degree of energy aggregation within each frequency band, and therefore the present application further considers evaluating sound signals by correlation of energy changes over time during birds' singing with shorter duration of steady changes in signal energy and continuous periods of timeThe content of the ringing information in each frame of signal is high or low.

Specifically, the environmental noise signal is usually a sound signal synthesized by sound signals generated by a plurality of different vibration sources, so that the environmental noise signal has strong randomness, while birds need to adjust airflow and vibrate vocal cords through parts such as lungs, throats and oral cavities when sounding, so that the birds can generate a series of intermittent sound waves when sounding to form intermittent sounds, therefore, in the mel spectrogram, the energy of the environmental noise signal does not have great change in a longer period of time, and the energy of the environmental noise signal at different frequencies has weak correlation with time, namely, the energy values of the environmental noise signal between different time frequency units have great difference with time in the same frame on the mel spectrogram, while the energy of the bird song is quick-changing, namely, no great change occurs in a shorter period of time, and the energy of the bird song at different frequencies has strong correlation with time.

Further, for sound signalsIs a Meier spectrogram->In a Mel spectrogram ++>The kth frequency band +.>For example, frequency band +.>The time-ordered sequence of the energy values of all time-frequency units in the row of each time-frequency unit is used as frequency band +.>Time distribution sequence of energy of each time-frequency unit and frequency band +.>The energy time distribution sequence of each time-frequency unit is used as input, and BG (Bernaola Galvan) sequence segmentation algorithm is adopted to obtain the frequency band +.>The BG sequence dividing algorithm is a known technology, and the specific process is not repeated.

In another embodiment, for sound signalsIs a Meier spectrogram->The energy time distribution sequence of each time-frequency unit in each frequency band can be used as input, and the Pettitt mutation point detection algorithm is adopted to obtain the mutation point of the energy time distribution sequence of each time-frequency unit in each frequency band, and is a known technology, and the specific process is not repeated.

Second, the frequency bandEach abrupt point in the energy time distribution sequence of each time-frequency unit of (a) is taken as a dividing point, then +.>The mutation points will bring the frequency band +.>Each of (3)The energy time distribution sequence of the individual time-frequency units is divided into +.>Subsequence, frequency band->The sub-sequence in which each time-frequency unit is located is used as a frequency band +.>The energy time distribution sub-sequence of each time-frequency unit.

Based on the analysis, a bird information frame significance coefficient is constructed here and used for representing the degree that a time-frequency unit on each frame signal in the sound signal collected by each array element microphone accords with the bird song energy change characteristic. Calculating sound signalsBird information frame saliency coefficient of the i th frame:

in the method, in the process of the invention,is the frequency band->Degree of energy variation correlation of the q-th time-frequency unit in (a)>Is the frequency band->The number of mid-time-frequency units, g is the frequency band +.>In g time-frequency unit, +.>、/>Frequency bands +.>The energy time distribution subsequence of the q-th and g-th time frequency units,/and (b)>Is the sequence->、/>DTW (Dynamic Time Warping) distance between->Is a parameter regulating factor for preventing denominator from being 0, & lt/L>The size of the DTW is checked to be 0.1, the calculation of the DTW distance is a known technology, and the specific process is not repeated;

is the frequency band->Is a significant coefficient of the bird signal band,/>Is the frequency band->Mel-band energy concentration saliency of (2), when energy is usedThe meta-distributed subsequence->The number of elements in (a);

is a sound signal +.>Bird information frame saliency coefficient of the i th frame of (a), a #>Is a Meier spectrogram->The number of frequency bands on the i-th frame.

Wherein the sound signalThe more bird song information contained in the i-th frame signal, the more likely the i-th frame is the bird song signal frame, the frequency band +.>The more similar the characteristic of the change of the energy values of all time-frequency units with time, the smaller the difference between the energy time distribution sub-sequences of different time-frequency units is, the sequence +.>、/>The smaller the difference between the first distance valueThe smaller the value of +.>The greater the value of (2); frequency band->Energy value of the q-th time-frequency unit in the systemThe shorter the duration of the occurrence of a stable change, the frequency band +.>The energy-time-distribution subsequence of the q-th time-frequency unit of (a)>The smaller the number of elements in->The smaller the value of (2), the first accumulation factor +.>The larger the value of (2), the more (18) the Mel spectrogram>Frequency band on the ith frame of (a)>The more pronounced the energy concentration profile in the energy sequence of (2)>The greater the value of +.>The greater the value of (2); i.e. < ->The larger the value of (2), the sound signal +.>The more the i-th frame signal is in accordance with the time-dependent energy change characteristic of bird song, the more the frequency band of the i-th frame signal containing bird song information component is.

So far, the bird information frame significant coefficient of each frame signal in the sound signal collected by each array element microphone is obtained and used for subsequently obtaining the sound signal fragment of the sound signal collected by each array element.

Step S004, a plurality of sound signal fragments of the sound signal are determined based on bird information frame significant coefficients of each frame of signal in the sound signal collected by each array element microphone; and determining a positioning result of the sound signal based on all sound signal fragments of the sound signal collected by all array element microphones in each microphone array by adopting a sound source estimation algorithm based on generalized cross-correlation time delay estimation.

According to the steps, the bird information frame significant coefficients of each frame of signals in the sound signals collected by each array element microphone in each microphone array are respectively obtained, and a sequence formed by the bird information frame significant coefficients of all frame signals in the sound signals collected by each array element microphone according to time sequence is used as a bird information significant sequence of the sound signals collected by each array element microphone. Each array element microphone is used for collecting sound signals and bird information obvious sequences thereof as input, a cyclic neural network model is adopted, a random gradient descent algorithm is used as an optimization algorithm, a mean square error function is used as a loss function, the cyclic neural network outputs the threshold value of a double-threshold endpoint detection algorithm, training of the cyclic neural network is a known technology, and specific processes are not repeated. Secondly, taking the sound signals collected by each array element microphone as input, and obtaining a plurality of sound signal fragments in the sound signals collected by each array element microphone based on the threshold value by adopting a double-threshold end point detection algorithm, wherein the double-threshold end point detection algorithm is a known technology, and the specific process is not repeated.

Further, for any one array microphone to collect a sound signal, a mel spectrogram and a spectrogram of each sound signal segment of the sound signal collected by each array microphone are obtained, and the obtaining of the audio signal spectrogram is a known technology, and a specific process is not described again. The Mel spectrogram and the spectrogram of each sound signal segment are used as input, a convolutional neural network model is adopted, a random gradient descent algorithm is used as an optimization algorithm, a cross entropy function is used as a loss function, the convolutional neural network outputs labels of each sound signal segment, the labels are divided into 1 and 0, wherein the labels 1 and 0 respectively represent sound signals belonging to the bird song and not belonging to the bird song, the sound signal segment with any label of 1 is used as a bird song signal segment, training of the neural network is a known technology, and specific processes are not repeated.

Further, a bird song signal segment in the sound signal collected by each array element microphone in each microphone array is obtained respectively. And each bird song signal segment is taken as input, an ideal binary masking IBM (Ideal Binary Mask) algorithm is adopted to obtain the enhancement result of each bird song signal segment, and the IBM algorithm is a known technology and the specific process is not repeated. And for any array element microphone, collecting the sound signals, taking a sequence formed by amplitudes of all sampling points in the enhancement results of all the bird song signal fragments contained in the sound signals collected by each array element microphone according to the time ascending order as an effective interval sequence of each array element microphone, secondly, for each sampling point which does not belong to any bird song signal fragment in the sound signals collected by each array element microphone, setting the amplitude of each sampling point which does not belong to the bird song signal fragment as 0, and adding the amplitudes of all the sampling points which do not belong to the bird song signal fragment to the effective interval sequence of each array element microphone according to the sampling time order to obtain the enhanced bird song sequence of each array element microphone. For example, the sampling start and stop times of the sound signal collected by the microphone of the m-th array element in the n-th microphone array are respectively、/>Sound signal->Comprises two bird song signal segments, the time intervals of which are [ - ], respectively>,/>]、[/>,/>]Enhanced junction of two bird song signal fragmentsThe sequence of the amplitudes of all sampling points in the microphone according to the time ascending order is used as the effective interval sequence of the m-th array element microphone, and the sequence does not belong to [ -part of the time interval of two bird song signal fragments>,/>]、[/>+1,/>-1]、[/>+1,/>]And (3) setting the amplitude of all the sampling points to be 0, and adding all the amplitudes reset to be 0 into the effective interval sequence of the m-th array element microphone according to the sampling time sequence to obtain the intensified bird song sequence of the m-th array element microphone.

Secondly, according to the steps, the reinforced bird song sequences of all array element microphones in each microphone array are respectively obtained. Taking the reinforced bird song sequence of each array element microphone as a row vector of a matrix, and taking the matrix constructed by the reinforced bird song sequences of all the array element microphones in each microphone array as a bird song signal matrix of each microphone array; the method comprises the steps that a bird song signal matrix of each microphone array and a spatial position vector of each microphone array are used as input, a sound source estimation algorithm based on generalized cross-correlation time delay estimation is used for outputting coordinate information of bird song signals collected by each microphone array in a spatial coordinate system, a sound source positioning algorithm based on generalized cross-correlation time delay estimation is a known technology, a specific process is not repeated, and the coordinate information of the bird song signals collected by each microphone array in the spatial coordinate system is uploaded to a data center of a natural protection area.

Based on the same inventive concept as the above method, the embodiments of the present application further provide an acoustic bispectrum-based bird song sound source positioning system, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the steps of any one of the above acoustic bispectrum-based bird song sound source positioning methods when executing the computer program.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present application are intended to be included within the scope of the present application.

Claims

1. The method for positioning the bird song sound source based on the acoustic bispectrum is characterized by comprising the following steps of:

2. The acoustic bispectrum-based bird song sound source localization method according to claim 1, wherein the method of collecting the sound signal in each monitoring area and the spatial position vector of the sound monitoring device is as follows:

3. The method for positioning a bird song sound source based on acoustic bispectrum according to claim 1, wherein the method for determining the mel-band energy aggregation prominence of each frequency band on each frame based on the analysis result of the energy distribution concentration characteristic between different frequency bands on each frame in the mel-gram of the sound signal collected by each array element microphone is as follows:

4. The method for positioning a bird song sound source based on acoustic bispectrum according to claim 3, wherein the method for determining the frequency band energy concentration degree of each frequency band based on the size of the energy peak of the time-frequency unit in each frequency band and the frequency interval of each frequency band comprises the following steps:

5. The method for positioning a bird song source based on acoustic bispectrum according to claim 1, wherein the method for determining the bird information frame saliency coefficient of each frame signal in the sound signal collected by each array element microphone based on the mel frequency band energy aggregation saliency of each frequency band on each frame in the mel spectrogram of the sound signal collected by each array element microphone and the duration of the energy stable change is as follows:

6. The method for positioning a bird song sound source based on acoustic bispectrum according to claim 5, wherein the method for determining the degree of energy variation correlation of each time-frequency unit in each frequency band based on the energy time distribution subsequence of different time-frequency units in each frequency band is as follows:

7. The acoustic bispectrum-based bird song sound source localization method according to claim 5, wherein the method of determining the bird signal frequency band saliency coefficient of each frequency band based on the energy variation correlation degree of each time-frequency unit in each frequency band and the mel-band energy aggregation saliency degree of each frequency band is as follows:

8. The method for positioning a bird song sound source based on acoustic bispectrum according to claim 1, wherein the method for obtaining a plurality of sound signal fragments of each array element microphone for collecting sound signals based on bird information frame significant coefficients of all frame signals in each array element microphone for collecting sound signals by using VAD algorithm comprises:

9. The method for positioning a bird song sound source based on acoustic bispectrum according to claim 1, wherein the method for determining the positioning result of the sound signal based on all the sound signal fragments of the sound signal collected by all the array element microphones in each microphone array by adopting the sound source estimation algorithm based on generalized cross-correlation time delay estimation is as follows:

10. An acoustic bispectrum based bird song sound source localization system comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that the processor implements the steps of the acoustic bispectrum based bird song sound source localization method according to any one of claims 1-9 when the computer program is executed.