CN110838302B

CN110838302B - Audio frequency segmentation method based on signal energy peak identification

Info

Publication number: CN110838302B
Application number: CN201911121998.XA
Authority: CN
Inventors: 王旻轩; 鲍亭文; 金超
Original assignee: Beijing Cyberinsight Technology Co ltd
Current assignee: Beijing Cyberinsight Technology Co ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2022-02-11
Anticipated expiration: 2039-11-15
Also published as: CN110838302A

Abstract

The application relates to an audio frequency segmentation method based on signal energy spike identification, which comprises the following steps: carrying out short-time Fourier transform on an input audio signal, and converting the input audio signal into a power spectrum matrix; extracting intermediate frequency energy characteristics based on a power spectrum; carrying out peak identification on the extracted intermediate frequency energy characteristics; performing error division correction on the signal subjected to peak identification; and outputting the time coordinate of the division point of the audio signal. The audio segmentation method does not need to set a threshold value and train in advance, can be used for performing analysis on the basis of the audio signal in real time, quickly and accurately, can be deployed at the edge end, does not need to access other operation parameters, and basically realizes parameter-free dynamic segmentation.

Description

Audio frequency segmentation method based on signal energy peak identification

Technical Field

The application relates to an audio segmentation method based on signal energy peak identification, which is applicable to the technical field of audio signal processing.

Background

The main implementation schemes for the simple audio segmentation algorithm are as follows:

1. a segmentation method based on endpoint detection, such as chinese patent with application number CN 200510061358.6. And detecting all mute points as potential points of possible change of the speaker by utilizing the characteristic that the speaker pauses in the speaking interval. Such methods are not accurate because the mute point is difficult to detect under different snr environments.

2. The model-based segmentation method is disclosed in chinese patent application nos. CN201710512310.5 and CN 201811581291.2. Corresponding models are established for different types of audio segments, then model maximum likelihood selection is carried out on an input audio stream in a sliding window, and an audio dividing point is considered as a position where audio categories are changed. In order to establish a generalized model, various model-based segmentation methods are successively proposed and implemented. For example, UBM is used to distinguish between speech segments and non-speech segments, while UGM is used to distinguish between male and female speakers, however, these "a priori knowledge" is generally not available. This method therefore has no detection capability for unknown acoustic features.

3. Distance-based segmentation methods compute the left and right window data "difference" for each sample point in the audio stream, represented by a distance scale. When the "difference" reaches a certain level, i.e. the distance scale exceeds a certain given threshold or a local maximum is taken, it is considered as an audio segmentation point. Although such a method does not require a priori knowledge for decision making and has high segmentation accuracy, the threshold selection depends largely on the audio characteristics, so the method lacks stability and robustness, and is computationally expensive.

Taking a fan blade scene as an example, the main implementation scheme of audio segmentation is to access the real-time rotating speed of the fan blades, and obtain the approximate position of the segmentation point between each blade after operation. This solution is simple and efficient, but has the prominent problems that:

1. the positioning of the division points is not accurate, the actual rotation process is continuously variable, if the rotation time of each blade is calculated and divided according to the average rotation speed within a time range of a certain resolution, only the uniform length can be obtained approximately, and the time used by each blade in the actual rotation process is not necessarily equal. Therefore, the method is only suitable for reference and is not suitable for being used as accurate input of other analysis algorithms;

2. the real-time rotating speed of the connected fan blade has higher requirements on installation of the sensor, acquisition of high-precision rotating speed requires additional sensor hardware of acquisition equipment, engineering implementation difficulty is high, cost is high, maintenance is not facilitated, and the acquisition of the rotating speed of the main shaft is carried out at the cabin part of the fan, the acquisition device is arranged at the tower footing, and an overlong signal transmission line causes interference in acquisition signals, so that data quality is poor, and segmentation and interpretation are seriously influenced.

Disclosure of Invention

The application provides an audio frequency segmentation method based on signal energy peak recognition, which can be used for performing analysis on an audio signal in real time, quickly and accurately without setting a threshold value and training in advance, can be deployed at an edge end, does not need to access other operation parameters, and basically realizes parameter-free dynamic segmentation.

The audio segmentation method based on signal energy spike identification comprises the following steps:

(1) carrying out short-time Fourier transform on an input audio signal, and converting the input audio signal into a power spectrum matrix;

(2) extracting intermediate frequency energy characteristics based on a power spectrum;

(3) carrying out peak identification on the extracted intermediate frequency energy characteristics;

(4) performing error division correction on the signal subjected to peak identification;

(5) and outputting the time coordinate of the division point of the audio signal.

The method for extracting the energy features comprises the following steps:

(1) carrying out short-time Fourier transform on the original audio signal, and converting the original audio signal into a time-frequency domain matrix M₀；

(2) The time-frequency domain matrix M₀Into a spectral matrix M expressed in decibels₁；

(3) Determining the frequency range of the audio signal as principal element, and aligning the spectrogram matrix M₁Performing band-pass filtering to filter out low-frequency environmental noise and high-frequency abnormal sound;

(4) pair spectrogram matrix M₁Cutting according to frequency axis, and reserving sub-power spectrum matrix M mainly based on audio signal₂；

(5) Will M₂The column-wise summation of (a) to obtain the sum of each time-domain power spectral vector.

The method for identifying the peak of the extracted intermediate frequency energy features comprises the following steps:

(1) determining the rated rotating speed rs of the rotation of the fan blade and the time length t of the input audio;

(2) calculating to obtain a conversion relation prop of the characteristic index and the time index according to the duration t and the length k of the Energy characteristic;

(3) obtaining the rated segmentation step length distance of the feature index according to the rated rotating speed rs and the prop;

(4) and searching the feature vectors by using a binary search method until the peak is not searched.

The method for performing error division correction on the signal subjected to peak identification comprises the following steps of:

(1) setting a wrong score judgment threshold value;

(2) removing the division points with the values larger than the misclassification judgment threshold value to obtain final division point coordinates m';

(3) the coordinate m' is converted back to the time index according to the conversion relation prop.

Wherein the band-pass filtering is performed by selecting the upper limit of the cut-off frequency in a matrix M₁Coordinate index of vertical axis and selected cutoff frequency lower limit in matrix M₁The vertical axis index of the vertical axis is determined by the following formula:

wherein UpperBound represents the upper limit of the selected cutoff frequency in the matrix M₁Coordinate index of vertical axis, LowerBound for selected cutoff frequency lower bound in matrix M₁Vertical axis index of vertical axis, sr is sampling frequency of audio, Freq_low，Freq_upIs the frequency range in which the audio signal is the principal element.

The method is specific to the wind sweeping sound of the fan blade, for example, a specific scene of the wind sweeping sound segmentation of the fan blade, combines the algorithm result of the voice analysis, and provides an energy feature extraction method which is specific to the wind sweeping sound of the fan blade and has strong generalization and robustness; the audio segmentation method based on the energy characteristics has no parameters, low computation amount and accurate realization of variable speed cutting, and adds a wrong division post-processing mechanism to further improve the accuracy of segmentation. The method simultaneously provides an energy characteristic extraction mode aiming at wind sweeping sound of the blades of the wind turbine generator based on a priori knowledge, and the energy characteristic extraction mode is used as preprocessing and input of segmentation and has strong robustness. Specifically, the audio segmentation method based on signal energy spike identification has the following technical advantages:

(1) the method for extracting the wind sweeping sound energy characteristics of the blades of the wind turbine generator system comprises the steps of performing band-pass filtering on a power spectrum matrix, filtering low-frequency and high-frequency environment noise by taking an energy matrix of a specific frequency section, and taking the sum of the frequency domain energy of each time domain section as the medium-frequency energy characteristics of the wind sweeping sound of the blades. The characteristic can effectively filter noise point interference caused by too many sampling points and environmental sounds, and can extract information capable of stably representing the wind sweeping regularity characteristic of the fan blade from disordered original audio signals;

(2) the method for extracting the features and searching the energy trough by utilizing the peak recognition method provides a parameter-free method, the real-time rotating speed of a fan is not required to be accessed, a threshold value is not required to be set during segmentation, the priori knowledge of the audio is not required, and meanwhile, the real-time segmentation can be carried out without the need of training in advance. A correction mechanism is added after the segmentation, and the segmentation accuracy is further adjusted; the method is rapid, stable and accurate;

(3) the requirement on deployment conditions is low, a sensor does not need to be additionally installed when the draught fan is built, and only the audio acquisition equipment needs to be installed on the periphery of the equipment, so that the engineering deployment cost is saved, and errors caused by signal interference are avoided; the operation can be unaffected when idling/not generating/stopping because real-time rotating speed information is not needed during operation.

Drawings

Fig. 1 shows a diagram of an original audio signal of an audio a in an embodiment.

Fig. 2 shows the mid-frequency energy profile of audio a after vertical summing.

Fig. 3 shows a schematic diagram of finding all valley locations in fig. 2 using a spike identification algorithm.

Fig. 4 shows an effect diagram showing the division points of the audio a in the original waveform diagram.

Fig. 5 shows an effect diagram showing the division points of the audio a in the power spectrogram.

Fig. 6 shows a diagram of the original audio signal of audio B in an embodiment.

Fig. 7 shows the mid-frequency energy profile of audio B after vertical summation.

FIG. 8 shows a schematic diagram of finding all valley locations in FIG. 7 using a spike identification algorithm.

Fig. 9 shows an effect diagram showing the division points of the audio B in the original waveform diagram.

Fig. 10 shows an effect diagram showing the division point of the audio B in the power spectrogram.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

The following explains and explains an audio signal obtained by applying a specific technology used in the audio segmentation method to blade sweeping in a wind turbine generator system as an example, and it should be noted that the method of the present application can be applied to segmentation of any periodic audio signal, and is not limited to the field of blade sweeping. The audio sensor can be mounted on the periphery of the blade of the wind turbine generator, for example, on a tower, so as to acquire an audio signal generated by the blade sweeping.

Energy feature extraction method

(1) Performing short-time Fourier transform (STFT) on the original audio signal, and converting the STFT into a time-frequency domain matrix M₀The dimension of the matrix depends on the parameter setting of the short-time Fourier transform, the setting window length is usually the same as the FFT point number n _ FFT, the range is generally between 1024-₀Frequency dimension (number of rows); the overlap length n _ overlap between windows is generally equal to one of the number of FFT pointsHalf, and the audio duration together determine M₀Time dimension (number of columns); the window function that is windowed over the signal is typically a hamming window.

For a particular input signal x_nAnd window omega_nThe short-time Fourier transform is defined as

STFTs are typically visualized using log-spectra (log-spectra) for M₀Into a spectral matrix M expressed in decibels₁The elements in the matrix are converted to decibel form, then M₁The spectrogram can be displayed in a thermodynamic diagram form; m₁Dimension and M₀Keeping consistent, and the specific dimension is jointly determined by n _ fft and n _ overlap.

The STFTs to decibel conversion is defined as

20log₁₀(|X_n|).。

(2) The sensor receives a wind sweeping sound signal of the fan blade, the wind sweeping sound signal is an approximately periodic signal, each blade is in the process of moving away from the sensor to approach the sensor and then moving away, the sound of the audio frequency is in the process of changing from small to large and then changing from small to small, and the fade-out of the previous blade is accompanied with the fade-in of the next blade. However, due to the influence of noise, the signal waveform of the original audio does not necessarily exhibit this characteristic obviously, nor does it necessarily have periodicity; secondly, the original audio has more noise points at high resolution due to higher sampling rate, and is easy to interfere with the subsequent segmentation. The method requires that features which can more obviously represent the blade wind sweeping change trend and the periodic trend, namely the intermediate frequency energy based on the power spectrum, can be extracted on the premise of certain filtering or down-sampling, and the specific operations are as follows:

(1) the sampling frequency sr of the input audio is definite and is generally between 12800 and 51200 Hz;

(2) determining the frequency range Freq taking the wind sweeping sound of the blade as a principal element through experiments_low，Freq_upTo spectrogram matrix M₁(rather than the original audio signal) to filter out the low frequency loopAmbient noise (mainly wind, thunder, speech, car, etc.) and high-frequency abnormal sounds (e.g. bird sounds, whistling sounds caused by blade damage, etc.). The main frequency range of the general wind sweeping is between 100 and 1000Hz, and the specific filtering method is as follows:

wherein UpperBound represents the upper limit of the selected cutoff frequency in the matrix M₁Coordinate index of vertical axis, LowerBound for selected cutoff frequency lower bound in matrix M₁The vertical axis index of the vertical axis, round is a well-known function and represents the result of rounding operation by a specified decimal place, and length represents a function of the matrix length. Because the processing is carried out based on the matrix subjected to short-time Fourier transform, the number of sample points is reduced in a time domain, and the influence of noise points is reduced; in the frequency domain, the vertical axis h represents frequency information, so that the vertical axis h represents M₁Cutting according to a frequency axis, and reserving a sub-power spectrum matrix M mainly based on wind sweeping sound₂：

M₂＝M₁[LowerBound：UpperBound，；]

(3) Will M₂The column direction summation of (a) to obtain the sum Energy of each time domain power spectrum vector:

Energy＝sum(M′₂)

the obtained medium-frequency Energy characteristic vector Energy can fully represent the periodic and gradual change characteristics of the wind sweeping sound signal, the characteristic pattern is observed, and if a waveform with more regular and obvious wave crests and wave troughs which alternately appear is obtained, the characteristic represents that the blade wind sweeping sound is captured by the characteristic. Each section of wave crest is a blade wind sweeping sound, the wave trough is the wind sweeping alternate position of the two blades, and the blade sound is segmented, namely all the wave trough positions in the audio section are positioned.

Peak identification method for energy characteristics

Energy is a one-dimensional signal, a segmentation point is located at each trough position, and the peak identification method positions all local minimum values by comparing the sizes of the adjacent points. In the peak identification method, a signal peak A [ m ] is defined as any sample point in a signal, the value of the direct adjacent point is higher than the point, the points are approximate to infinity at the beginning and the end of the array, and no dividing point exists:

A[m-1]≥A[m]，A[m+1]≥A[m]

the specific method for identifying the peak is as follows:

(1) defining the rated rotating speed rs (in RPM) of the rotation of the fan blade and the time length t of the input audio; (ii) a

(4) searching the feature vector by using a binary search method, firstly looking up an element in the middle of the array, if the element is a peak, directly returning, otherwise, if the element on the left side is large, recursively processing the left half array, and if the element on the right side is large, processing the right half array until the peak is not searched. Additionally, when the nominal segmentation step distance is set, i.e., the minimum horizontal distance between adjacent peaks is determined, the smaller peaks are removed first until all remaining peaks are satisfied.

Error correction method

Because the set rated segmentation step length distance is based on the rated rotating speed of the fan blade, the rated rotating speed of a higher level is difficult to reach when the blade rotates generally, so the actual number of real segmentation points is often less than the number of segmentation points output in the previous step, and a wrong segmentation post-processing mechanism is added. When wrong division occurs, the value of the corresponding division point in the feature vector is not the position of the trough of the wave; the value of the miscut point is greater than the value of the integral segmentation point. Thus, with this feature, the misclassification determination threshold is set:

mean (Energy [ m ]) + std (Energy [ m ]), where mean represents the mean and std represents the standard deviation;

and removing the segmentation points with the values larger than the misclassification judgment threshold value to obtain the final segmentation point coordinates m'. Finally, the coordinate m' is converted back to the time index according to the conversion relation prop.

Examples of embodiment

Data was collected from a wind field idling fan and two sets of audio signals a and B were used for the experiment. Under the experimental conditions, the rated rotating speed of the blade tip of the fan is 8.5RPM (revolutions per minute), and the audio sampling frequency is 51.2 kHz. In order to verify the segmentation effect, the two fans are subjected to manual audio labeling, and the accurate segmentation point position is determined.

The original audio signal image of audio signal a is shown in fig. 1. The duration of the audio A is 65.53125 seconds, and an obvious regular wind sweeping form cannot be observed in the original audio; obtaining a time-frequency spectrum matrix of A after short-time Fourier transform, obtaining a power spectrum matrix of decibel (dB) unit by conversion according to a conversion method used in the scheme, wherein the dimension of the power spectrum matrix is (4097 multiplied by 3277) under experimental parameters (n _ fft is 8192, n _ overlap is 1024); and obtaining 0.019997329874885564 a conversion relation prop according to the audio time length and the time domain dimension of the power spectrum matrix, and 117.66276753906142 a rated segmentation step distance calculated according to the rated rotating speed and the conversion relation.

And calculating to obtain the coordinates of the upper limit and the lower limit of the longitudinal axis of the power spectrum matrix as (256, 32) according to the sampling frequency of 51.2kHz and the upper limit and the lower limit of the filtering of 800Hz/100 Hz. The mid-frequency energy characteristic of a after vertical summing is shown in fig. 2. It can be clearly seen in fig. 2 that each segment of the wave peak represents the periodic wind sweeping sound of one blade of the fan. Using a spike identification algorithm, finding all the trough positions in fig. 2 based on the calculated nominal segmentation step size is shown in fig. 3. The dotted line is a threshold line for correcting misclassification, if the value of a segmentation point is above the threshold line, the segmentation point is removed, and the misclassification point above the threshold line does not appear in the segmentation of the A. The x-type icon is all the division points m of the positioning.

Converting m to a time index (in seconds) m' through a conversion relation prop, wherein the coordinate of the algorithm division point of A is as follows:

[1.3198237717424472，4.95933780897162，8.538859856576137，11.158510070186145，15.277960024412572，18.677506103143116，21.31715364662801，25.2166329722307，28.03625648458956，30.935869316447967，34.55538602380226，37.5349881751602，40.634574305767465，44.114109703997556，46.87374122673177，49.993324687213914，52.952929508696975，55.93253166005493，58.67216585291425，61.35180805614891，64.31141287763198]；

the coordinates of the manual accurate segmentation points of A are as follows:

[1.03，4.65，8.26，11.72，15.22，18.21，21.6，25.08，28.07，30.94，34.42，37.39，40.58，43.72，46.77，49.83，52.88，55.73，58.43，61.18，64.24]；

the average error of the population is within 0.1 s. The effect of showing the segmentation points in the raw waveform diagram and the power spectrum diagram is shown in fig. 4 and 5.

Similarly, the original audio signal image of audio B is shown in fig. 6. The duration of the audio B is 112.53125 seconds, and an obvious regular wind sweeping form cannot be observed in the original audio; obtaining a time-frequency spectrum matrix of B after short-time Fourier transform, obtaining a power spectrum matrix of decibel (dB) unit by conversion according to a conversion method used in the scheme, wherein the dimension of the power spectrum matrix is (4097 multiplied by 5627) under experimental parameters (n _ fft is 8192, n _ overlap is 1024); and obtaining a conversion relation prop =0.019998444997334282 according to the audio time length and the time domain dimension of the power spectrum matrix, and calculating a rated segmentation step distance 117.6562066092752 according to the rated rotating speed and the conversion relation.

And calculating to obtain the coordinates of the upper limit and the lower limit of the longitudinal axis of the power spectrum matrix as (256, 32) according to the sampling frequency of 51.2kHz and the upper limit and the lower limit of the filtering of 800Hz/100 Hz. The mid-frequency energy characteristics of B after vertical summing are shown in fig. 7. It can be clearly seen in fig. 7 that each segment of the wave peak represents the periodic wind sweeping sound of one blade of the fan. Using a spike identification algorithm, finding all the trough positions in fig. 7 based on the calculated nominal segmentation step size is shown in fig. 8. The dotted line is a threshold line for correcting misclassification, if the value of a segmentation point is above the threshold line, the segmentation point is removed, and the misclassification point above the threshold line does not appear in the segmentation of B. The x-type icon is all the division points m of the positioning.

The algorithm for converting m to a time index (in seconds) m' through a conversion relation prop has the following coordinates of the division points:

[1.3198973698240626，4.179675004442865，6.659482184112316，9.33927381375511，12.039063888395237，14.598864848054026，17.378648702683492，20.098437222320953，22.578244401990403，25.158043806646525，28.01782144126533，30.39763639594811，33.19741869557491，36.077194775191046，38.65699417984717，41.05680757952728，44.03657588413009，46.61637528878621，49.33616380842367，51.79597254309579，54.45576572774125，56.855579127421365，59.7353552070375，62.5951328416563，65.11493691132043，67.77473009596588，70.45452172560867，73.17431024524613，75.8940987648836，78.61388728452106，81.27368046916652，83.87347831881998，86.3332870534921，89.21306313310824，91.73286720277235，94.51265105740181，97.23243957703927，99.95222809667673，102.6120212813222，105.13182535098632，108.05159832059712，110.47141016527458]；

the coordinates of the manual accurate segmentation points of the B are as follows:

[1.3，3.84，6.44，9.34，12.04，14.6，17.22，19.99，22.34，25.13，27.71，30.45，33.14，35.84，38.38，41.14，43.72，46.3，49.09，51.67，54.39，57.04，59.8，62.34，65.1，67.68，70.37，73.09，75.72，78.32，81.06，83.74，86.21，89.21，91.65，94.42，97.11，99.74，102.34，105.1，107.87，110.3]；

the average error of the population is also around 0.1 s. The effect of displaying the segmentation points in the original oscillogram and the power spectrum is shown in fig. 9 and 10

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An audio segmentation method based on signal energy spike identification, comprising the steps of:

(5) outputting the time coordinate of the division point of the audio signal;

2. The audio segmentation method according to claim 1, wherein the method of extracting energy features comprises the steps of:

3. The audio segmentation method according to claim 1 or 2, wherein the method of error correction of the spike identified signal comprises the steps of:

(1) setting a wrong score judgment threshold value;

4. The audio segmentation method of claim 1 or 2 wherein the band-pass filtering is performed by selecting the upper cut-off frequency limit in the matrix M₁Coordinate index of vertical axis and selected cutoff frequency lower limit in matrix M₁The vertical axis index of the vertical axis is determined by the following formula: