CN109903775B - Audio popping detection method and device - Google Patents

Audio popping detection method and device Download PDF

Info

Publication number
CN109903775B
CN109903775B CN201711283064.7A CN201711283064A CN109903775B CN 109903775 B CN109903775 B CN 109903775B CN 201711283064 A CN201711283064 A CN 201711283064A CN 109903775 B CN109903775 B CN 109903775B
Authority
CN
China
Prior art keywords
audio
frequency domain
slice
value
slices
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711283064.7A
Other languages
Chinese (zh)
Other versions
CN109903775A (en
Inventor
高超
马哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thunderstone Technology Co ltd
Original Assignee
Beijing Thunderstone Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thunderstone Technology Co ltd filed Critical Beijing Thunderstone Technology Co ltd
Priority to CN201711283064.7A priority Critical patent/CN109903775B/en
Publication of CN109903775A publication Critical patent/CN109903775A/en
Application granted granted Critical
Publication of CN109903775B publication Critical patent/CN109903775B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The embodiment of the invention provides an audio plosive detection method and device. The method comprises the following steps: cutting the audio file into a plurality of audio slices with equal time length; dividing each audio slice into N small parts, performing fast Fourier transform on each small part, dividing a frequency domain energy value into M intervals from low to high, and counting the number of the frequency domain energy values distributed in the M intervals as the distribution number of the slice frequency domain values; calculating the average value of the distribution number of the slice frequency domain values of the K adjacent audio slices in each frequency domain energy value interval through a K-adjacent algorithm according to the distribution number of the slice frequency domain values of each audio slice; and when the difference value between the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, and the average value meets a preset condition, judging that the audio slices to be tested are the popping sound. The method and the device have high accuracy and wide application range, and save a large amount of human resources.

Description

Audio popping detection method and device
Technical Field
The invention relates to the field of audio processing, in particular to an audio plosive detection method and device.
Background
With the development of internet technology, audio files enrich people's entertainment life in modern society, but pop sound may exist in the audio files, which affects user experience. The pop sound refers to an abrupt point in the hearing sense, and the reason for generating the pop sound is many, and the pop sound generally appears in a sound source, and may be an error or an audio file damage when software captures a CD audio track. Popping may occur when the signal suddenly breaks off or other strong interference is introduced.
In the prior art, there are various algorithms for identifying the plosive of a song, and in the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art: errors exist after the pop sounds in the audio files are screened through an algorithm, and the real pop sounds in the songs usually need secondary identification of manpower to improve accuracy, so that the method is troublesome and labor-consuming and consumes a large number of resources.
Disclosure of Invention
The embodiment of the invention provides an audio pop detection method and device, which can achieve the purpose of automatically identifying pop in karaoke songs based on a priority queue algorithm of frequency domain energy and overcome the defects of low accuracy and manual secondary identification of the existing song pop.
In one aspect, an embodiment of the present invention provides a method for detecting an audio pop, where the method includes:
cutting the audio file into a plurality of audio slices with equal time length;
dividing each audio slice into N parts, and performing fast Fourier transform on each part to obtain the highest value of the frequency domain energy of each part in each audio slice;
equally dividing the frequency domain energy value into M intervals from low to high, and counting the number of N frequency domain energy maximum values distributed in the M intervals corresponding to each audio slice as the distribution number of the slice frequency domain values;
calculating the average value of the distribution number of the slice frequency domain values of the K adjacent audio slices in each frequency domain energy value interval through a K-nearest neighbor algorithm according to the distribution number of the slice frequency domain values of each audio slice;
and when the difference value between the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, and the average value meets a preset condition, judging that the audio slices to be tested are the popping sound.
Optionally, when the difference between the average value and the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, satisfies a preset condition, it is determined that the audio slices to be tested are crackles, including:
respectively calculating the difference value between the maximum frequency domain energy value number of the audio slices to be tested adjacent to the K adjacent audio slices and the average value in the interval in the M intervals;
counting the number of intervals of which the difference value exceeds a preset number threshold;
when the number of the intervals is larger than R, the audio slice to be tested is judged to be a popping sound; wherein R ∈ (1, M).
Optionally, calculating an average value of the frequency domain energy value distribution data of the K adjacent slices includes:
randomly selecting distribution data of continuous K slices of the audio file in each interval, and adding the distribution data to obtain a first calculation result;
dividing the first calculation result by the number K of slices, and taking the result as the average value of the interval.
Optionally, equally dividing the frequency domain energy value into M intervals from low to high includes:
acquiring the highest value of the frequency domain energy values of the fractions;
setting an upper limit of the interval according to the maximum value, and setting a lower limit to be 0; it is divided equally into M intervals.
Optionally, R ═ M/2.
On the other hand, an embodiment of the present invention provides an audio pop detection device, including:
a slicing unit for cutting the audio file into a plurality of audio slices of equal length;
the Fourier transform unit is used for dividing each audio slice into N parts and performing fast Fourier transform on each part to obtain the highest value of the frequency domain energy of each part in each audio slice;
the distribution statistical unit is used for equally dividing the frequency domain energy value into M intervals from low to high, and counting the number of N frequency domain energy maximum values distributed in the M intervals corresponding to each audio slice as the distribution number of the slice frequency domain values;
the average value calculating unit is used for calculating the average value of the distribution number of the slice frequency domain values of the K adjacent audio slices in each frequency domain energy value interval through a K-nearest neighbor algorithm according to the distribution number of the slice frequency domain values of each audio slice;
and the plosive judging unit is used for judging the audio slice to be tested as the plosive when the difference value between the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, and the average value meets a preset condition.
Optionally, the method includes:
a difference value calculating subunit, configured to calculate difference values between the maximum frequency domain energy values of the audio slices to be detected, which are adjacent to the K adjacent audio slices, and the average value in the interval in the M intervals, respectively;
the plosive interval counting subunit is used for counting the number of the intervals with the difference value exceeding a preset number threshold;
the plosive judging subunit is used for judging the audio slice to be tested to be the plosive when the number of the intervals is greater than R; wherein R ∈ (1, M).
Optionally, the average calculating unit includes:
the first calculation subunit is used for randomly selecting distribution data of the continuous K slices of the audio file in each interval and adding the distribution data to obtain a first calculation result;
and the second calculating subunit is used for dividing the first calculating result by the number K of the slices to obtain a result which is used as the average value of the interval.
Optionally, the distribution statistical unit includes:
a maximum value obtaining subunit, configured to obtain a maximum value of the frequency domain energy maximum values of the fractions;
the interval dividing subunit is used for setting an interval upper limit according to the highest value and setting a lower limit to 0; it is divided equally into M intervals.
Optionally, R is M/2.
The technical scheme has the following beneficial effects: because the sound size of the audio signal corresponds to the frequency domain energy value after Fourier energy conversion, the sound frequency of the audio file to be detected is described through the frequency domain energy value of each audio slice, the audio with larger difference with the adjacent slices is found out according to the comparison of the frequency domain energy values of a plurality of audio slices, the audio can be judged to be plosive or bass, the error caused by manual identification is avoided, the accuracy is improved, and simultaneously, because the intensity distribution of different audio signals is different, the K-nearest neighbor algorithm is carried out on all the audio signals of the whole song for comparison, so that the audio detection range is wider.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a frequency domain diagram of a Fourier transform of an audio slice provided for an embodiment of the invention;
fig. 2 is a schematic flowchart of an audio pop detection method according to an embodiment of the present invention;
fig. 3 is a structural diagram of an audio pop detection device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides an audio plosive detection method. A method flowchart is shown in fig. 2, the method comprising the steps of:
step 101, cutting an audio file into a plurality of audio slices with equal time length;
fig. 1 is a frequency domain diagram of a fourier transform of an audio slice. As shown in fig. 1, an audio file is sliced for a preset duration, and the number of slices of an audio file is determined by the duration of the audio file.
Step 102, dividing each audio slice into N small parts, and performing fast Fourier transform on each small part to obtain the highest value of the frequency domain energy of each small part in each audio slice;
103, equally dividing the frequency domain energy value into M intervals from low to high, and counting the number of N frequency domain energy maximum values distributed in the M intervals corresponding to each audio slice as the distribution number of the slice frequency domain values;
104, calculating the average value of the distribution number of the slice frequency domain values of the K adjacent audio slices in each frequency domain energy value interval through a K-nearest neighbor algorithm according to the distribution number of the slice frequency domain values of each audio slice;
and 105, when the difference value between the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, and the average value meets a preset condition, judging that the audio slices to be tested are crackles.
Optionally, when the difference between the average value and the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, satisfies a preset condition, it is determined that the audio slices to be tested are crackles, including:
respectively calculating the difference value between the maximum frequency domain energy value number of the audio slices to be tested adjacent to the K adjacent audio slices and the average value in the interval in the M intervals;
counting the number of intervals of which the difference value exceeds a preset number threshold;
when the number of the intervals is larger than R, the audio slice to be tested is judged to be a popping sound; wherein R ∈ (1, M).
Optionally, calculating an average value of the frequency domain energy value distribution data of the K adjacent slices includes:
randomly selecting distribution data of continuous K slices of the audio file in each interval, and adding the distribution data to obtain a first calculation result;
dividing the first calculation result by the number K of slices, and taking the result as the average value of the interval.
Optionally, equally dividing the frequency domain energy value into M intervals from low to high includes:
acquiring the highest value of the frequency domain energy values of the fractions;
setting an upper limit of the interval according to the maximum value, and setting a lower limit to be 0; it is divided equally into M intervals.
Optionally, R ═ M/2.
In one embodiment of the invention, a windowing mode is adopted to divide audio to be processed into a plurality of slices, wherein when the slices are divided, one slice is divided by moving a set translation length from a section of audio slice signal, and the length of the divided slice is a set windowing width;
preferably, in one embodiment, the preset duration of the windowing width may be set to 0.2 s;
taking the example of setting the translation length to be 0.2s, i.e. 200ms, and setting the windowing width to be 200ms, for an audio signal with a duration of 200s to be processed, the divided song slices are:
the first song slice is 0-200 ms;
the second slice is 201ms-400 ms;
and so on;
the average length of a general song is 300s, so the number of song slices is 1500 on average;
the time length of 200ms is the minimum unit that the human ear can distinguish the sound size, and is less than 200ms, so that the human ear cannot distinguish the original sound from the echo.
As an embodiment of the present invention, recording the highest value of the frequency domain energy values of the audio slice in the present invention includes:
subdividing each slice into N small parts, and then performing fast Fourier transform on each small part;
preferably, in one embodiment, an audio slice is subdivided into N small parts, and then the N small parts are subjected to fast fourier transform processing, the time length of the slice is 0.2s, that is, 200ms, and the case of performing fourier transform on a slice is shown in fig. 1;
calculating the logarithm value of the amplitude of the subdivided part at each frequency point in the full frequency band after the fast Fourier transform is completed, wherein the converted amplitude can be intuitively shown in FIG. 1; taking the logarithm value as a frequency domain energy value under each frequency point;
and after the frequency domain energy value of each subdivided sub-part is obtained, recording the highest frequency domain energy value of each sub-part.
The frequency domain energy value under each frequency point after the calculation section completes the fast Fourier transform processing refers to calculating the logarithmic value of the amplitude of each frequency point in the full frequency band of the voice frame after the fast Fourier transform processing is completed, and the converted amplitude can be intuitively shown in the figure 1; the logarithmic value is used as the frequency domain energy value at each frequency point.
In general, the frequency domain energy values range from 0 to 400,
that is to say, the logarithmic amplitude value in the frequency domain energy diagram after fourier transform is the energy value of each frequency point, and the energy value of one slice is subjected to sectional interval statistics, and is generally divided into 4 intervals, which are respectively 1-100 energy values; energy value 101-; energy value 201-300; energy value 301-.
Based on the data in step 202, the number of N highest values distributed in the M intervals corresponding to one slice is recorded, as shown in table 1:
audio fragmentation Energy value of 1-100 Energy value 101- Energy value 201-300 Energy value 301-
1 2 15 62 21
2 1 10 69 20
3 1 11 70 18
A 1 14 64 21
B 2 6 11 81
TABLE 1
Carrying out interval statistics on the energy value of one audio slice, wherein the energy value is divided into 1-100 energy values; energy value 101-; energy value 201-300; energy value 301-; and counting the number of the energy maximum value of each subdivision part falling into each interval.
Dividing the frequency domain energy value intervals, and calculating the number of the energy highest values of each subdivision equal part falling into the frequency domain energy value intervals;
preferably, in one embodiment, the statistical data is that there is k in the energy values 1-1001Within the range of energy value 101-2Within the range of the energy value 201-3And k4 in the range of the energy value 301-400.
In one embodiment, as shown in Table 1, a sliced sample is 0-200ms, the sample number is 1, 200ms is further subdivided into 100 parts, 2 in the range of energy values 1-100, 15 in the range of energy values 101-200, 62 in the range of energy values 201-300, and 21 in the range of energy values 301-400;
the second section is 201-400ms, the sample number is 2, 1 in the range of 1-100 energy values, 10 in the range of 101-200 energy values, 69 in the range of 201-300 energy values, and 20 in the range of 301-400 energy values;
the third section is 401-;
and by analogy, recording the obtained energy value data.
Preferably, in the invention, a fourier transform processing is performed on one audio slice, the time length of the slice is a preset time length of 0.2s, namely 200ms, and a fourier transform is performed after subdividing one slice, so as to count the energy value distribution range data.
Optionally, randomly selecting statistical data of the continuous K slices of the audio file in the energy value segmentation 1-100, and adding the statistical data to obtain a first calculation result;
taking the energy value segment 1-100 as an example, the statistical data of the energy value segments in the samples 1-K are added in this step 203, and divided by the number K of the samples to obtain the average energy value E of the energy value segment1
Dividing the first calculation result by the number K of the slices to obtain a result as an average energy value E of the frequency domain energy value segment1(ii) a Calculating an average energy value E of 4 frequency domain energy value segments of K audio slices1、E2、E3And E4;;
In one embodiment, the average energy value of the energy frequency band from 0 to 100 is E by averaging the data of the distribution range of the energy values1The average value of the energy values 100-200 is E2The average value of the energy values of 200-300 is E3The average value of the energy value of 300-400 is E4
Optionally, taking the data in table 1 as an example, the number k of the selected slice samples is 3, the data in the range of the energy values 1-100 of the slice samples 1-3 is 2, 1, and the average energy value E of the samples 1-3 is calculated1Is (2+1+1)/3 is 1.3, so that E1Has a value of 1.3, E2=(15+10+11)/3=12、E3=(62+69+70)/3=67、E4=(21+20+18)/3=19.7;
Therefore, the mean of the energy value distribution data of sliced samples 1 to 3 was (1.3, 12, 67, 19.7);
the comparison samples are A (random sample) and B (K +1 samples), and the energy value interval distribution data is recorded as A1(interval distribution data of corresponding energy values 1-100), A2(corresponding energy value 101-200 interval distribution data), A3(corresponding energy value 201-300 interval distribution data), A4(interval distribution data of corresponding energy values 301-400); energy value interval distribution data is recorded as B1(interval distribution data of corresponding energy values 1-100), B2(corresponding energy value 101-200 interval distribution data), B3(corresponding energy value 201-300 interval distribution data), B4(corresponding energy values 301-400 interval distribution data).
Comparison E1And A1、B1Comparison E2And A2、B2Comparison E3And A3、B3(ii) a Comparison E4And A4、B4
Calculation of A1、A2、A3、A4Data of (2) and (E)1、E2、E3、E4The data difference value of (1).
The data (1, 14, 64, 21) of the sliced sample a were compared, the average values were (1.3, 12, 67, 19.7), the differences were calculated as (|1.3-1| ═ 0.3, |12-14| ═ 2, |67-64| ═ 3, |19.7-21| - > 1.3), the data (0.3, 2, 3, 1.3) were all less than 10, and the sample a was a normal sample.
And B1、B2、B3、B4Data of (2) and (E)1、E2、E3、E4The difference between the data of (1), (8), (53), (60), where 53 and 60 are greater than 10, and the data with a large difference R is 2, M is 4, R is M/2, and the sample B is determined to be a pop sound.
The intensity distribution of each section of audio signal is different, but in a complete song audio, because the property of the song audio determines that the rising speed and the falling speed of the energy value of the song are slow and have a certain energy value progressive relation, the energy intensity distribution of adjacent samples is similar, the energy intensity distribution has certain similarity in the adjacent audio signal samples, and therefore, the comparison of the energy value distribution data of K adjacent samples is meaningful for detecting the crackles.
Based on the understanding of the audio plosive, we can know that the frequency domain energy value of the plosive part of the audio suddenly increases or decreases, the change amplitude of the frequency domain energy value is large, so that the sound change of the plosive part of the audio is increased compared with that of the adjacent audio, the change amplitude of the energy value distribution range of the corresponding sample time period is relatively high, we can see that the energy in the song slice is high or low from fig. 1, then subdivide a 0.2s sample, the number of subdivided samples is generally larger than 100, then count the subdivided energy value distribution data, correspond to the energy value distribution situation of the slice sample, based on the flow shown in fig. 2, the invention uses the K-nearest neighbor algorithm to perform homogeneous analysis on the energy distribution of the adjacent samples to determine the plosive of the audio reasonably, and can accurately and quickly and automatically find the plosive part in all slice samples of the audio, does not need secondary screening of manpower, saves a large amount of manpower compared with the prior art.
The method provided by the invention is described above, and the device provided by the invention is described below:
fig. 3 is a diagram illustrating an apparatus according to an embodiment of the present invention. As shown in fig. 3, an embodiment of the present invention provides an audio pop detection device, which may include:
a slicing unit 201 for cutting the audio file into a plurality of audio slices of equal length;
the fourier transform unit 202 is configured to divide each audio slice into N small parts, and perform fast fourier transform on each small part to obtain a frequency domain energy maximum value of each small part in each audio slice;
the distribution counting unit 203 is configured to equally divide the frequency domain energy value into M intervals from low to high, and count the number of N frequency domain energy maximum values distributed in the M intervals corresponding to each audio slice as the number of slice frequency domain value distributions;
the average value calculating unit 204 is configured to calculate, according to the slice frequency domain value distribution number of each audio slice, an average value of the slice frequency domain value distribution numbers of the K adjacent audio slices in each frequency domain energy value interval through a K-nearest neighbor algorithm;
and the plosive determining unit 205 is configured to determine that the audio slice to be tested is a plosive when a difference between the average value and the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval adjacent to the K adjacent audio slices satisfies a preset condition.
Optionally, the method includes:
a difference value calculating subunit, configured to calculate difference values between the maximum frequency domain energy values of the audio slices to be detected, which are adjacent to the K adjacent audio slices, and the average value in the interval in the M intervals, respectively;
the plosive interval counting subunit is used for counting the number of the intervals with the difference value exceeding a preset number threshold;
the plosive judging subunit is used for judging the audio slice to be tested to be the plosive when the number of the intervals is greater than R; wherein R ∈ (1, M).
Optionally, the average calculating unit includes:
the first calculation subunit is used for randomly selecting distribution data of the continuous K slices of the audio file in each interval and adding the distribution data to obtain a first calculation result;
and the second calculating subunit is used for dividing the first calculating result by the number K of the slices to obtain a result which is used as the average value of the interval.
Optionally, the distribution statistical unit includes:
a maximum value obtaining subunit, configured to obtain a maximum value of the frequency domain energy maximum values of the fractions;
the interval dividing subunit is used for setting an interval upper limit according to the highest value and setting a lower limit to 0; it is divided equally into M intervals.
Optionally, R is M/2.
Preferably, in the present invention, the dividing module may divide the audio signal to be detected into a plurality of slice samples in a windowing manner, where a windowing width is a preset duration, and the windowing width may be set according to a minimum time unit recognizable by human ears when dividing the slices.
Preferably, the first and second electrodes are formed of a metal,
each slice sample is subdivided into N parts, wherein the value of N can be more than or equal to 100;
statistics are made of the energy value distribution data in each of the subdivided aliquots.
Preferably, in the present invention, the analysis module is configured to analyze the frequency domain energy value after the audio slice is converted by using a K-nearest neighbor algorithm, compare the frequency domain energy value data of the slice to be detected and K adjacent slices, and determine whether the slices are of the same category, and whether the audio of the same category is a pop sound;
based on this, the plosive portion determined by the analysis module is a portion having a large difference from the adjacent sample energy distribution.
According to the technical scheme, the intensity distribution of each section of audio signal is different, but in a complete song audio, the rising speed and the falling speed of the energy value of the song are determined to be slow due to the property of the song audio, and a certain energy value progressive relation exists, so that the energy intensity distribution of adjacent samples is similar, the energy intensity distribution has certain similarity in the adjacent audio signal samples, and therefore, the comparison of the energy value distribution data of K adjacent samples is meaningful for detecting the pop. .
Furthermore, because the voice signal energy of the popping sound in each frequency band is greatly changed relative to other audio frequencies, the method checks whether the audio signal to be detected is different from the adjacent audio frequency in category and has the popping sound according to whether the average value of the multi-band energy value distribution range of each audio frequency slice is similar to the average value of each frequency band of the adjacent audio frequency, which completely accords with the characteristics of the popping sound, and verifies that the method is reasonable to check whether the audio signal to be detected has the popping sound according to whether each audio frequency slice is the same as the adjacent audio frequency slice in category.
Furthermore, since the intensity distributions of different audio signals are different, the invention makes the audio detection range wider by comparing all the audio signals of the whole song through the K-nearest neighbor algorithm.
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An audio plosive detection method, characterized in that the method comprises:
cutting the audio file into a plurality of audio slices with equal time length;
dividing each audio slice into N parts, and performing fast Fourier transform on each part to obtain the highest value of the frequency domain energy of each part in each audio slice;
equally dividing the frequency domain energy value into M intervals from low to high, and counting the number of N frequency domain energy maximum values distributed in the M intervals corresponding to each audio slice as the distribution number of the slice frequency domain values;
calculating the average value of the distribution number of the slice frequency domain values of the K adjacent audio slices in each frequency domain energy value interval through a K-nearest neighbor algorithm according to the distribution number of the slice frequency domain values of each audio slice;
and when the difference value between the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, and the average value meets a preset condition, judging that the audio slices to be tested are the popping sound.
2. The method according to claim 1, wherein the determining that the audio slice to be tested is a pop sound when the difference between the average value and the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval adjacent to the K adjacent audio slices meets a preset condition comprises:
respectively calculating the difference value between the maximum frequency domain energy value number of the audio slices to be tested adjacent to the K adjacent audio slices and the average value in the interval in the M intervals;
counting the number of intervals of which the difference value exceeds a preset number threshold;
when the number of the intervals is larger than R, the audio slice to be tested is judged to be a popping sound; wherein R ∈ (1, M).
3. The method of claim 1, wherein computing the average of the frequency domain energy value distribution data for the K adjacent slices comprises:
randomly selecting distribution data of continuous K slices of the audio file in each interval, and adding the distribution data to obtain a first calculation result;
dividing the first calculation result by the number K of slices, and taking the result as the average value of the interval.
4. The method of claim 1, wherein equally dividing the frequency domain energy value into M intervals from low to high comprises:
acquiring the highest value of the frequency domain energy values of the fractions;
setting an upper limit of the interval according to the maximum value, and setting a lower limit to be 0; it is divided equally into M intervals.
5. The method of claim 2, wherein R is M/2.
6. An audio pop detection device, comprising:
a slicing unit for cutting the audio file into a plurality of audio slices of equal length;
the Fourier transform unit is used for dividing each audio slice into N parts and performing fast Fourier transform on each part to obtain the highest value of the frequency domain energy of each part in each audio slice;
the distribution statistical unit is used for equally dividing the frequency domain energy value into M intervals from low to high, and counting the number of N frequency domain energy maximum values distributed in the M intervals corresponding to each audio slice as the distribution number of the slice frequency domain values;
the average value calculating unit is used for calculating the average value of the distribution number of the slice frequency domain values of the K adjacent audio slices in each frequency domain energy value interval through a K-nearest neighbor algorithm according to the distribution number of the slice frequency domain values of each audio slice;
and the plosive judging unit is used for judging the audio slice to be tested as the plosive when the difference value between the distribution number of the slice frequency domain values of the audio slices to be tested in each frequency domain energy value interval, which are adjacent to the K adjacent audio slices, and the average value meets a preset condition.
7. The apparatus of claim 6, wherein the plosive judging unit comprises:
a difference value calculating subunit, configured to calculate difference values between the maximum frequency domain energy values of the audio slices to be detected, which are adjacent to the K adjacent audio slices, and the average value in the interval in the M intervals, respectively;
the plosive interval counting subunit is used for counting the number of the intervals with the difference value exceeding a preset number threshold;
the plosive judging subunit is used for judging the audio slice to be tested to be the plosive when the number of the intervals is greater than R; wherein R ∈ (1, M).
8. The apparatus of claim 6, wherein the average calculating unit comprises:
the first calculation subunit is used for randomly selecting distribution data of the continuous K slices of the audio file in each interval and adding the distribution data to obtain a first calculation result;
and the second calculating subunit is used for dividing the first calculating result by the number K of the slices to obtain a result which is used as the average value of the interval.
9. The apparatus of claim 6, wherein the distribution statistics unit comprises:
a maximum value obtaining subunit, configured to obtain a maximum value of the frequency domain energy maximum values of the fractions;
the interval dividing subunit is used for setting an interval upper limit according to the highest value and setting a lower limit to 0; it is divided equally into M intervals.
10. The device of claim 7, wherein R is M/2.
CN201711283064.7A 2017-12-07 2017-12-07 Audio popping detection method and device Active CN109903775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711283064.7A CN109903775B (en) 2017-12-07 2017-12-07 Audio popping detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711283064.7A CN109903775B (en) 2017-12-07 2017-12-07 Audio popping detection method and device

Publications (2)

Publication Number Publication Date
CN109903775A CN109903775A (en) 2019-06-18
CN109903775B true CN109903775B (en) 2020-09-25

Family

ID=66938874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711283064.7A Active CN109903775B (en) 2017-12-07 2017-12-07 Audio popping detection method and device

Country Status (1)

Country Link
CN (1) CN109903775B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110853636B (en) * 2019-10-15 2022-04-15 北京雷石天地电子技术有限公司 System and method for generating word-by-word lyric file based on K nearest neighbor algorithm
CN112435687B (en) * 2020-11-25 2024-06-25 腾讯科技(深圳)有限公司 Audio detection method, device, computer equipment and readable storage medium
CN112735481B (en) * 2020-12-18 2022-08-05 Oppo(重庆)智能科技有限公司 POP sound detection method and device, terminal equipment and storage medium
CN116486828B (en) * 2023-06-14 2023-09-08 北京觅图科技有限公司 Audio data processing method, device and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989853A (en) * 2015-02-28 2016-10-05 科大讯飞股份有限公司 Audio quality evaluation method and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4811404A (en) * 1987-10-01 1989-03-07 Motorola, Inc. Noise suppression system
US5317522A (en) * 1992-01-15 1994-05-31 Motorola, Inc. Method and apparatus for noise burst detection in a signal processor
CN104143341B (en) * 2013-05-23 2015-10-21 腾讯科技(深圳)有限公司 Sonic boom detection method and device
CN105118520B (en) * 2015-07-13 2017-11-10 腾讯科技(深圳)有限公司 A kind of removing method and device of audio beginning sonic boom
CN105676268B (en) * 2016-01-15 2018-03-13 广西大学 A kind of strain type rock burst method for early warning based on sound signal waveform variation characteristic
CN106782612B (en) * 2016-12-08 2019-12-13 腾讯音乐娱乐(深圳)有限公司 reverse popping detection method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989853A (en) * 2015-02-28 2016-10-05 科大讯飞股份有限公司 Audio quality evaluation method and system

Also Published As

Publication number Publication date
CN109903775A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109903775B (en) Audio popping detection method and device
US8595009B2 (en) Method and apparatus for performing song detection on audio signal
US10665248B2 (en) Device and method for classifying an acoustic environment
CN104078051B (en) A kind of voice extracting method, system and voice audio frequency playing method and device
Pillos et al. A Real-Time Environmental Sound Recognition System for the Android OS.
EP3451697A1 (en) Method and device for howling detection
CN109903752A (en) The method and apparatus for being aligned voice
CN110019922B (en) Audio climax identification method and device
WO2016004757A1 (en) Noise detection method and apparatus
Aleinik et al. Detection of clipped fragments in speech signals
CN113674763A (en) Whistling sound identification method, system, equipment and storage medium by utilizing line spectrum characteristics
Tamatjita et al. Comparison of music genre classification using Nearest Centroid Classifier and k-Nearest Neighbours
McLoughlin et al. Early detection of continuous and partial audio events using CNN
CN111243618B (en) Method, device and electronic equipment for determining specific voice fragments in audio
CN113782051B (en) Broadcast effect classification method and system, electronic equipment and storage medium
US12093314B2 (en) Accompaniment classification method and apparatus
CN114302301B (en) Frequency response correction method and related product
CN110739006A (en) Audio processing method and device, storage medium and electronic equipment
US20240242730A1 (en) Methods and Apparatus to Fingerprint an Audio Signal
US20220165289A1 (en) Methods and systems for processing recorded audio content to enhance speech
CN115243183A (en) Audio detection method, device and storage medium
CN112233693B (en) Sound quality evaluation method, device and equipment
CN114283841B (en) Audio classification method, system, device and storage medium
JP6090371B2 (en) Audio signal identification device and program
CN118102201B (en) Bluetooth headset automatic test method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant