CN110265064B

CN110265064B - Audio frequency crackle detection method, device and storage medium

Info

Publication number: CN110265064B
Application number: CN201910506938.3A
Authority: CN
Inventors: 陈洲旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2021-10-08
Anticipated expiration: 2039-06-12
Also published as: WO2020248308A1; CN110265064A

Abstract

The embodiment of the application discloses an audio frequency popping detection method, an audio frequency popping detection device and a storage medium, when the audio frequency signal is subjected to popping detection, the audio frequency signal to be detected can be obtained, the audio frequency signal is divided into a plurality of frame signals, then, the short-time energy difference of two adjacent frame signals is calculated, then, the frame signal meeting a preset condition interval is obtained according to the short-time energy difference, a sudden change audio frequency signal is obtained, then, the frequency spectrum flatness of the sudden change audio frequency signal is calculated, and if the frequency spectrum flatness is larger than a preset flat value, the audio frequency signal is determined to have popping; the scheme can accurately detect whether the audio signal has the plosive.

Description

Audio frequency crackle detection method, device and storage medium

Technical Field

The application relates to the technical field of communication, in particular to an audio plosive detection method, an audio plosive detection device and a storage medium.

Background

With the continuous development of internet technology, the internet has a great amount of various audio files, such as various audio files of music/speech/book/chat. Due to a series of complicated steps of recording, processing, transmitting, storing and the like, the audio frequency may have 'distortion' phenomena, such as beginning pop, glitch, breakpoint and the like. Beginning pop is a relatively common distortion phenomenon. "pop at the beginning" means that there is a short pulse at the beginning of the musical waveform and sounds like a "click", and this harsh unnatural sound gives the listener a poor user experience. In the statistical case of a song library, it is shown that the audio ratio with the beginning plosive is up to 10%, resulting in poor audio quality due to the presence of the plosive. Therefore, it is important to accurately detect the audio beginning plosive.

Disclosure of Invention

The embodiment of the application provides an audio pop detection method, an audio pop detection device and a storage medium, which can be used for detecting whether frequency band loss exists in an audio signal or not, so that an audio file with the frequency band loss is effectively and quickly screened out.

The embodiment of the application provides an audio plosive detection method, which comprises the following steps:

acquiring an audio signal to be detected, and dividing the audio signal into a plurality of frame signals;

calculating the short-time energy difference of two adjacent frame signals;

acquiring a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden change audio signal;

and calculating the spectral flatness of the sudden change audio signal, and if the spectral flatness is greater than a preset flatness value, determining that the audio signal has popping.

Optionally, in some embodiments, in the audio pop detection method, the dividing the audio signal into a plurality of frame signals includes:

selecting a signal with a preset time period from a first frame in a time domain to obtain a beginning audio signal;

the beginning audio signal is divided into a plurality of frame signals.

Optionally, in some embodiments, in the audio pop detection method, the calculating a short-time energy difference between two adjacent frame signals includes:

calculating the short-time energy of each frame signal;

acquiring the time of each frame signal;

and sequentially calculating the difference between the short-time energies of two adjacent frame signals according to the time sequence of the frame signals to obtain the short-time energy difference of the two adjacent frame signals.

Optionally, in some embodiments, in the audio pop detection method, the obtaining, according to the short-time energy difference, a frame signal that meets a preset condition interval to obtain a sudden-change audio signal includes:

acquiring two frame signals of which the short-time energy difference is larger than a preset threshold value, and determining the next frame signal of the two frame signals as a starting frame signal according to a time sequence;

acquiring two frame signals of which the short-time energy difference is smaller than a preset threshold negative value after the starting frame signal, and determining the latter one of the two frame signals as an ending frame signal according to a time sequence;

and acquiring signals between the starting frame signal and the ending frame signal to obtain a sudden change audio signal.

Optionally, in some embodiments, in the audio pop detection method, the acquiring two frame signals with the short-time energy difference smaller than a preset threshold negative value after the start frame signal, and determining a next frame signal of the two frame signals as an end frame signal according to a time sequence includes:

sequentially judging whether the short-time energy difference is a negative value smaller than a preset threshold value or not according to a time sequence after the starting frame signal;

and when the short-time energy difference is detected to be smaller than the preset threshold negative value for the first time, determining the next frame signal in the two frame signals smaller than the preset threshold negative value as an end frame signal according to the time sequence.

Optionally, in some embodiments, in the audio pop detection method, the calculating the spectral flatness of the abrupt change audio signal includes:

detecting a peak position of the abrupt audio signal;

a plurality of fixed sampling points are respectively taken before and after the peak position to form a plosive audio frame;

and calculating the spectral flatness of the plosive audio frame.

Optionally, in some embodiments, in the audio pop detection method, the determining that the audio signal has a pop if the spectral flatness is greater than a preset flatness value includes:

judging whether the frequency spectrum flatness is larger than a preset flatness value or not;

if the frequency spectrum flatness is larger than a preset flatness value, determining that the audio signal has crackles;

and if the frequency spectrum flatness is smaller than a preset flatness value, determining that the audio signal does not have crackle.

Optionally, in some embodiments, in the audio pop detection method, after determining that the audio signal has a pop if the spectral flatness is greater than a preset flatness value, the method further includes:

and returning to the step of obtaining the frame signal meeting the preset condition interval according to the short-time energy difference to obtain the abrupt change audio signal until the detection of the audio signal to be detected is finished.

Correspondingly, this application embodiment still provides an audio frequency plosive detection device, includes:

the framing module is used for acquiring an audio signal to be detected and dividing the audio signal into a plurality of frame signals;

the calculating module is used for calculating the short-time energy difference of two adjacent frame signals;

the acquisition module is used for acquiring a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden change audio signal;

and the judging module is used for calculating the spectral flatness of the sudden change audio signal, and if the spectral flatness is greater than a preset flatness value, the audio signal is determined to have the popping sound.

Optionally, in some embodiments, in the audio pop detection apparatus, the framing module includes:

the selection submodule is used for selecting signals of a preset time period from the first frame to the audio signals in the time domain to obtain starting audio signals;

and the framing submodule is used for dividing the starting audio signal into a plurality of frame signals.

Optionally, in some embodiments, in the audio pop detection apparatus, the calculation module includes:

the energy submodule is used for calculating the short-time energy of each frame signal;

the acquisition submodule is used for acquiring the time of each frame signal;

and the energy difference submodule is used for sequentially calculating the difference between the short-time energies of two adjacent frame signals according to the time sequence of the frame signals to obtain the short-time energy difference of the two adjacent frame signals.

Optionally, in some embodiments, in the audio pop detection device, the energy difference sub-module is specifically configured to obtain two frame signals of which the short-time energy difference is greater than a preset threshold, and determine a next frame signal of the two frame signals as a start frame signal according to a time sequence; acquiring two frame signals of which the short-time energy difference is smaller than a preset threshold negative value after the starting frame signal, and determining the latter one of the two frame signals as an ending frame signal according to a time sequence; and acquiring signals between the starting frame signal and the ending frame signal to obtain a sudden change audio signal.

Optionally, in some embodiments, in the audio pop detection device, the energy difference sub-module is specifically configured to sequentially determine, after the start frame signal, whether the short-time energy difference is a negative value smaller than a preset threshold in a time sequence; and when the short-time energy difference is detected to be smaller than the preset threshold negative value for the first time, determining the next frame signal in the two frame signals smaller than the preset threshold negative value as an end frame signal according to the time sequence.

Optionally, in some embodiments, in the audio pop detection device, the determining module includes:

a detection submodule for detecting a peak position of the abrupt change audio signal;

the sampling submodule is used for respectively taking a plurality of fixed sampling points before and after the peak position to form a plosive audio frame;

and the calculating submodule is used for calculating the spectral flatness of the popping audio frame.

Optionally, in some embodiments, in the audio pop detection device, the determining module is specifically configured to determine whether the spectral flatness is greater than a preset flatness value; if the frequency spectrum flatness is larger than a preset flatness value, determining that the audio signal has crackles; and if the frequency spectrum flatness is smaller than a preset flatness value, determining that the audio signal does not have crackle.

Optionally, in some embodiments, in the audio pop detection apparatus, the audio pop detection apparatus further includes:

and the detection module is used for returning to execute the step of obtaining the frame signal meeting the preset condition interval according to the short-time energy difference to obtain the abrupt change audio signal until the detection of the audio signal to be detected is finished.

In addition, a storage medium is further provided, where multiple instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the audio plosive detection methods provided in the embodiments of the present application.

When the method and the device are used for performing pop detection on the audio signal, the audio signal to be detected can be obtained, the audio signal is divided into a plurality of frame signals, then, the short-time energy difference of two adjacent frame signals is calculated, then, the frame signal meeting a preset condition interval is obtained according to the short-time energy difference, a sudden change audio signal is obtained, then, the frequency spectrum flatness of the sudden change audio signal is calculated, and if the frequency spectrum flatness is larger than a preset flatness value, the audio signal is determined to have the pop; the scheme includes that audio signals are subjected to framing, time domain short-time energy of each frame of audio signals is calculated, the audio frame position with sudden energy change is found out through short-time energy difference, the sudden change audio signals are found out, then the spectral flatness of the sudden change audio signals is calculated, and audio files with frequency band loss are accurately screened out through the ground spectral flatness.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic scene diagram of an audio pop detection method according to an embodiment of the present disclosure;

fig. 1b is a schematic diagram of a first process of an audio pop detection method according to an embodiment of the present disclosure;

fig. 2a is a second flowchart of an audio pop detection method according to an embodiment of the present disclosure;

FIG. 2b is a schematic diagram of an audio signal of an audio pop detection method according to an embodiment of the present disclosure;

fig. 3a is a schematic diagram of a first structure of an audio pop detection device according to an embodiment of the present disclosure;

fig. 3b is a schematic diagram of a second structure of the audio pop detection device according to the embodiment of the present application;

fig. 4 is a schematic structural diagram of a network device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second", and "third", etc. in this application are used to distinguish between different objects and not to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions.

The embodiment of the application provides an audio plosive detection method, an audio plosive detection device and a storage medium.

For example, referring to fig. 1a, when a user needs to perform initial pop detection on a large number of audio files, the network device may be triggered to process the audio files, and may obtain an audio signal to be detected, divide the audio signal into a plurality of frame signals, then calculate a short-time energy difference between two adjacent frame signals, then obtain a frame signal that meets a preset condition interval according to the short-time energy difference, obtain a sudden change audio signal, then calculate a spectral flatness of the sudden change audio signal, and if the spectral flatness is greater than a preset flat value, determine that the audio signal has pop.

The following are detailed below. The order of the following examples is not intended to limit the preferred order of the examples.

In the present embodiment, the description will be made in terms of an audio pop detection apparatus, which may be specifically integrated in a network device, and the network device may be a terminal or a server, and the terminal may include a tablet Computer, a notebook Computer, a Personal Computer (PC), or the like.

The embodiment of the application provides an audio plosive detection method, which comprises the following steps: the method comprises the steps of obtaining an audio signal to be detected, dividing the audio signal into a plurality of frame signals, calculating a short-time energy difference between two adjacent frame signals, obtaining a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden change audio signal, calculating the spectral flatness of the sudden change audio signal, and determining that the audio signal has a pop sound if the spectral flatness is larger than a preset flatness value.

As shown in fig. 1b, the specific process of the audio pop detection method may be as follows:

101. the method comprises the steps of obtaining an audio signal to be detected, and dividing the audio signal into a plurality of frame signals.

For example, the audio file may be obtained from various ways such as a network, a mobile phone, or a video, and then provided to the audio pop detection device, that is, the audio pop detection device may specifically receive the audio file obtained from various ways, and then extract the audio signal to be detected from the audio file. Then, the audio signals are divided into a plurality of frame signals.

The audio file may be: sound files and Musical Instrument Digital Interface (MIDI) files. The sound file is original sound recorded by sound recording equipment, and binary sampling data of real sound is directly recorded; a MIDI file is a musical performance instruction sequence that can be performed using a sound output device or an electronic musical instrument connected to a computer. And the audio signal is a regular sound wave frequency and amplitude change information carrier with voice, music and sound effects. Audio information can be classified into regular audio and irregular sound according to the characteristics of sound waves. Regular audio can be divided into speech, music and sound effects. Regular audio is a continuously varying analog signal that can be represented by a continuous curve called a sound wave.

In order to improve the detection efficiency, a detection time period may be set at the beginning of the audio signal in the time domain, and the audio signal in the time period may be subjected to framing processing, that is, the step "dividing the audio signal into a plurality of frame signals", specifically, the following steps may be performed:

the beginning audio signal is divided into a plurality of frame signals.

102. And calculating the short-time energy difference of two adjacent frame signals.

For example, the short-time energy of each frame signal may be calculated, then the time of each frame signal is obtained, and the difference between the short-time energies of two adjacent frame signals is calculated sequentially according to the time sequence of the frame signal, so as to obtain the short-time energy difference between two adjacent frame signals.

The short-time energy represents the intensity of signals at different moments. The calculation of the short-time energy E of each frame signal may be as follows:

wherein, N is the number of sampling points of each frame signal, N is the sampling point of the frame signal, t represents the position of the frame signal, and e (t) is the short-time energy of the t-th frame signal.

Wherein, calculating the short-time energy difference of two adjacent frame signals can be as follows:

p_t＝E(t)-E(t-1)

where t is the position of the frame, p_tIs the short-time energy difference of two adjacent frame signals.

103. And acquiring a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden change audio signal.

The preset condition may be set in various ways, for example, the preset condition may be flexibly set according to the requirements of the actual application, or may be preset and stored in the network device. In addition, the preset condition may be built in the network device, or may be stored in the memory and transmitted to the network device, and so on.

For example, two frame signals with the short-time energy difference larger than the preset threshold may be obtained, the next frame signal in the two frame signals is determined as the start frame signal according to the time sequence, the two frame signals with the short-time energy difference smaller than the negative value of the preset threshold are obtained after the start frame signal, the next frame signal in the two frame signals is determined as the end frame signal according to the time sequence, and the signal between the start frame signal and the end frame signal is obtained to obtain the abrupt change audio signal.

The preset threshold (threshold), abbreviated as Th, may be set in various ways, for example, flexibly set according to the requirements of practical applications, or may be preset and stored in the network device. In addition, the preset threshold may be built in the network device, or may be stored in the memory and transmitted to the network device, and so on.

In order to make the subsequent calculation of the frequency flatness closer to the true value of the preset condition interval, and in order to make the accuracy of the detection result higher, the latter one of the two frame signals of the frame signal whose short-time energy difference is smaller than the preset threshold negative value is detected for the first time after the start frame signal may be taken as the end frame signal, that is, "the two frame signals whose short-time energy difference is smaller than the preset threshold negative value are obtained after the start frame signal, and the latter one of the two frame signals is determined as the end frame signal according to the time sequence", specifically, the following steps may be:

sequentially judging whether the short-time energy difference is a negative value smaller than a preset threshold value or not according to the time sequence after the starting frame signal;

104. And calculating the spectral flatness of the sudden change audio signal, and determining that the audio signal has the popping sound if the spectral flatness is larger than a preset flatness value.

For example, the abrupt change audio signal may be subjected to fourier transform to obtain a frequency domain abrupt change audio signal, the spectral flatness of the frequency domain abrupt change audio signal is calculated, and then, whether the spectral flatness is greater than a preset flatness value is determined; if the frequency spectrum flatness is larger than a preset flatness value, determining that the audio signal has crackle; and if the frequency spectrum flatness is smaller than a preset flatness value, determining that the audio signal does not have crackle.

The preset flat value may be set in various ways, for example, the preset flat value may be flexibly set according to the requirements of the actual application, or may be preset and stored in the network device. In addition, the preset flat value may be built in the network device, or may be stored in the memory and transmitted to the network device, and so on.

Spectral flatness, also called wiener entropy, is a metric used in digital signal processing to characterize the audio spectrum. Spectral flatness can be measured by the ratio of the Geometric Mean (GM) to the Arithmetic Mean (AM) of the signal, also commonly referred to as Spectral Flatness Measure (SFM). Namely:

wherein w (n) is a window function, k is a frequency point of the frequency domain mutation audio signal, and X is the frequency domain mutation audio signal. Wherein the window function may be a rectangular window, a triangular window, or a hanning window, etc.

F(t)＝GM(t)/AM(t)

Wherein gm (t) is the geometric mean of the frequency-domain abrupt change audio signal, am (t) is the arithmetic mean of the frequency-domain abrupt change audio signal, and f (t) is the spectral flatness.

For example, in order to further improve the detection accuracy and ensure that the audio experienced by the user has no defects, the peak position of the abrupt change audio signal may be detected first, and then N/2 sampling points are taken from the left and right sides to form a pop audio frame with the peak position as the center, that is, the pop audio frame has N sampling points in total. Therefore, the step "calculating the spectral flatness of the abrupt change audio signal" may specifically be as follows:

detecting a peak position of the abrupt audio signal;

the spectral flatness of the pop audio frame is calculated.

After a pop is detected, for accuracy of subsequent repair, the method may further include, after detecting a short-time energy difference to obtain a frame signal that satisfies a preset condition interval until all audio signals to be detected are detected, that is, after "if the spectral flatness is greater than a preset flatness value, it is determined that the audio signal has a pop":

After the audio signal detection is finished, an interface of a detection result can be generated, the interface comprises a detection interface, the interface can receive the detection result of the audio signal to be detected, and whether the audio popping signal is detected or not is prompted on the interface after the detection is finished.

As can be seen from the above, in the embodiment, when the pop detection is performed on the audio signal, the audio signal to be detected may be obtained, the audio signal is divided into a plurality of frame signals, then, the short-time energy difference between two adjacent frame signals is calculated, then, the frame signal meeting the preset condition interval is obtained according to the short-time energy difference, so as to obtain a sudden change audio signal, then, the spectral flatness of the sudden change audio signal is calculated, and if the spectral flatness is greater than the preset flatness value, it is determined that the pop exists in the audio signal; the scheme includes that audio signals are subjected to framing, time domain short-time energy of each frame of audio signals is calculated, the audio frame position with sudden energy change is found out through short-time energy difference, the sudden change audio signals are found out, then the spectral flatness of the sudden change audio signals is calculated, and audio files with frequency band loss are accurately screened out through the ground spectral flatness.

According to the method described in the foregoing embodiment, the following will be described in further detail by way of example in which the audio pop detection apparatus is specifically integrated in a network device.

As shown in fig. 2a, a specific process of an audio pop detection method may be as follows:

201. the network equipment acquires the audio signal to be detected.

For example, a user may specifically obtain audio files from various ways such as a network, a mobile phone, or a video, and then provide the audio files to the network device, and the network device may receive the audio files obtained from various ways and extract the audio signals to be detected from the audio files.

202. The network equipment divides the audio signal into frames to obtain frame signals.

For example, in order to improve the detection efficiency, the network device may set a detection time period at the beginning of the audio signal in the time domain, and perform framing processing on the audio signal in the time period, that is, the step "divide the audio signal into a plurality of frame signals", specifically, the following steps may be performed:

the beginning audio signal is divided into a plurality of frame signals.

203. The network device calculates the short-time energy difference of two adjacent frame signals.

For example, the network device may specifically calculate the short-time energy of each frame signal, then obtain the time of each frame signal, and sequentially calculate the difference between the short-time energies of two adjacent frame signals according to the time sequence of the frame signal, so as to obtain the short-time energy difference between two adjacent frame signals.

p_t＝E(t)-E(t-1)

204. And the network equipment acquires the frame signal meeting the preset condition interval according to the short-time energy difference to obtain the abrupt change audio signal.

For example, the network device may specifically acquire two frame signals of which the short-time energy difference is greater than a preset threshold, determine a subsequent frame signal of the two frame signals as a start frame signal according to a time sequence, acquire two frame signals of which the short-time energy difference is smaller than a negative value of the preset threshold after the start frame signal, determine the subsequent frame signal of the two frame signals as an end frame signal according to the time sequence, and acquire a signal between the start frame signal and the end frame signal, so as to obtain the abrupt change audio signal. For example, as shown in FIG. 2b, the short-time energy difference p between E (2) and E (3) is calculated₃If p is₃>Th, the starting frame signal is a third frame signal a, the short-time energy difference of two adjacent frame signals after the third frame signal is continuously calculated, and if the short-time energy difference p of E (3) and E (4) is obtained₄<-Th，The end frame signal is the fourth frame signal b, and the third frame signal a to the fourth frame signal b are taken as the abrupt change audio signals of the audio signal.

The preset threshold may be set in various manners, for example, the preset threshold may be flexibly set according to the requirements of the actual application, or may be preset and stored in the network device. In addition, the preset threshold may be built in the network device, or may be stored in the memory and transmitted to the network device, and so on.

205. The network device calculates the spectral flatness of the abrupt audio signal.

For example, the network device may specifically perform fourier transform on the abrupt change audio signal to obtain a frequency domain abrupt change audio signal, and then calculate the spectral flatness of the frequency domain abrupt change audio signal.

Spectral flatness, also called wiener entropy, is a metric used in digital signal processing to characterize the audio spectrum. Spectral flatness can be measured by the ratio of the Geometric Mean (GM) to the Arithmetic Mean (AM) of the signal, also commonly referred to as spectral flatness. Namely:

F(t)＝GM(t)/AM(t)

For example, in order to further improve the detection accuracy and ensure that the audio experienced by the user has no flaws, the network device may first detect the peak position of the abrupt change audio signal, and then take the same plurality of sampling points to the left and right to form a plosive audio frame with the peak position as the center, that is, the peak position of the abrupt change audio signal may be specifically detected; a plurality of fixed sampling points are respectively taken before and after the peak position to form a plosive audio frame; the spectral flatness of the pop audio frame is calculated.

For example, as shown in fig. 2b, with the peak position of the abrupt change audio signal as the center, N/2 sampling points are respectively taken from the left and right to form a pop audio frame c, that is, the pop audio frame c has N sampling points in total, and then the spectral flatness of the pop audio frame c is calculated.

206. The network equipment judges whether the frequency spectrum flatness is larger than a preset flatness value or not, and if the frequency spectrum flatness is larger than the preset flatness value, the fact that the audio signal has the popping sound is determined.

For example, the network device may specifically determine whether the spectrum flatness is greater than a preset flatness value; if the frequency spectrum flatness is larger than a preset flatness value, determining that the audio signal has crackle; and if the frequency spectrum flatness is smaller than a preset flatness value, determining that the audio signal does not have crackle.

207. And the network equipment judges whether the detection of the audio signal to be detected is finished, if not, the step of obtaining the frame signal meeting the preset condition interval according to the short-time energy difference is returned to execute the step 204 to obtain the abrupt change audio signal until the detection of the audio signal to be detected is finished.

For example, after a pop is detected, for accuracy of subsequent repair, the network device may continue to detect the short-time energy difference to obtain the frame signal satisfying the preset condition interval until all the audio signals to be detected are detected, and then return to the step of obtaining the frame signal satisfying the preset condition interval according to the short-time energy difference to obtain the abrupt change audio signal until the audio signals to be detected are detected. For example, after determining whether the preset flatness value is larger than the preset flatness value according to the spectral flatness of the abrupt change audio signal, whether the determination result is larger than the preset flatness value, the frame signals subsequent to the fourth frame signal may be continuously detected until all the frame signals are detected, and the detection result is obtained.

Optionally, after the audio signal is detected, an interface of the detection result may be generated, where the interface includes a detection interface, the interface may receive the detection result of the audio signal to be detected, and after the detection is completed, whether an audio popping signal is detected is prompted on the interface.

Optionally, after the beginning pop is detected, the band missing signals may be repaired or replaced to ensure that the user can listen to the audio file with good quality.

As can be seen from the above, when the network device of this embodiment performs pop detection on an audio signal, the network device may acquire the audio signal to be detected, divide the audio signal into a plurality of frame signals, then calculate a short-time energy difference between two adjacent frame signals, then acquire a frame signal that meets a preset condition interval according to the short-time energy difference, obtain a sudden change audio signal, then calculate a spectral flatness of the sudden change audio signal, and determine that the audio signal has a pop if the spectral flatness is greater than a preset flatness value; the scheme includes that audio signals are subjected to framing, time domain short-time energy of each frame of audio signals is calculated, the audio frame position with sudden energy change is found out through short-time energy difference, the sudden change audio signals are found out, then the spectral flatness of the sudden change audio signals is calculated, and audio files with frequency band loss are accurately screened out through the ground spectral flatness.

In addition, the scheme can also repair or replace the beginning pop sound, so that the quality of the audio file can be improved, and the user experience is improved.

In order to better implement the audio pop detection method provided by the embodiment of the present application, an embodiment of the present application further provides an audio pop detection device, which may be specifically integrated in a network device such as a mobile phone, a tablet computer, a handheld computer, and the like. The meaning of the noun is the same as that in the audio plosive detection method, and specific implementation details can refer to the description in the method embodiment.

For example, as shown in fig. 3a, the audio pop detection apparatus may include a framing module 301, a calculating module 302, an obtaining module 303, and a determining module 304, as follows:

(1) a framing module 301;

the framing module 301 is configured to acquire an audio signal to be detected and divide the audio signal into a plurality of frame signals.

For example, the framing module 301 may specifically acquire audio files from various ways such as a network, a mobile phone, or a video, and then provide the audio files to the audio pop detection device, that is, the audio pop detection device may specifically receive the audio files acquired from various ways, and then extract the audio signals to be detected from the files. Then, the audio signals are divided into a plurality of frame signals.

In order to improve the detection efficiency, a detection time period may be set at the beginning of the time domain of the audio signal, and the audio signal in the time period may be subjected to framing processing, that is, the framing module may include a selection sub-module and a framing sub-module, as follows:

the selection submodule is used for selecting signals of a preset time period from the first frame to the audio signal in a time domain to obtain a starting audio signal;

(2) A calculation module 302;

a calculating module 302, configured to calculate a short-time energy difference between two adjacent frame signals.

For example, the calculation module 302 may include an energy sub-module, an acquisition sub-module, and an energy difference sub-module, as follows:

the acquisition submodule is used for acquiring the time of each frame signal;

and the energy difference submodule is used for sequentially calculating the difference between the short-time energies of two adjacent frame signals according to the time sequence of the frame signal to obtain the short-time energy difference of the two adjacent frame signals.

p_t＝E(t)-E(t-1)

where t is the position of the frame, p_tFor two adjacent framesShort-term energy difference of the numbers.

(3) An acquisition module 303;

an obtaining module 303, configured to obtain a frame signal meeting a preset condition interval according to the short-time energy difference, so as to obtain a sudden-change audio signal.

For example, the obtaining module 303 may specifically obtain two frame signals with the short-time energy difference being greater than a preset threshold, determine a next frame signal of the two frame signals as a start frame signal according to a time sequence, obtain two frame signals with the short-time energy difference being smaller than a negative value of the preset threshold after the start frame signal, determine a next frame signal of the two frame signals as an end frame signal according to the time sequence, and obtain a signal between the start frame signal and the end frame signal, so as to obtain the abrupt change audio signal.

In order to make the subsequent calculation of the frequency flatness closer to the true value of the preset condition interval and to make the accuracy of the detection result higher, the last frame signal of two frame signals of the frame signal whose short-time energy difference is smaller than the negative value of the preset threshold value and detected for the first time after the start frame signal may be taken as the end frame signal, that is, the obtaining module may specifically perform the following operations:

(4) A judging module 304;

the determining module 304 is configured to calculate a spectral flatness of the abrupt change audio signal, and determine that the audio signal has a pop sound if the spectral flatness is greater than a preset flatness value.

For example, the determining module 304 may specifically perform fourier transform on the abrupt change audio signal to obtain a frequency domain abrupt change audio signal, calculate the spectral flatness of the frequency domain abrupt change audio signal, and then determine whether the spectral flatness is greater than a preset flatness value; if the frequency spectrum flatness is larger than a preset flatness value, determining that the audio signal has crackle; and if the frequency spectrum flatness is smaller than a preset flatness value, determining that the audio signal does not have crackle.

F(t)＝GM(t)/AM(t)

For example, in order to further improve the detection accuracy and ensure that the audio experienced by the user has no defects, the peak position of the abrupt change audio signal may be detected first, and then N/2 sampling points are taken from the left and right sides to form a pop audio frame with the peak position as the center, that is, the pop audio frame has N sampling points in total. Therefore, the determining module may specifically include a detecting sub-module, a sampling sub-module and a calculating sub-module, as follows:

the sampling submodule is used for the sampling subunit to respectively take a plurality of fixed sampling points before and after the peak position to form an explosive sound audio frame;

and the calculating submodule calculates the spectral flatness of the popping audio frame.

After detecting a pop, for accuracy of subsequent repair, the method may continue to detect the short-time energy difference to obtain the frame signal satisfying the preset condition interval until all the audio signals to be detected are detected, that is, the audio pop detection apparatus, as shown in fig. 3b, may further include a detection module 305, as follows:

the detecting module 305 is configured to return to execute the step of obtaining the frame signal meeting the preset condition interval according to the short-time energy difference to obtain the abrupt change audio signal until the detection of the audio signal to be detected is completed.

It will be appreciated by those skilled in the art that the audio pop detection device shown in fig. 3a does not constitute a limitation of the device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components. In addition, it should be noted that the specific implementation of each unit may refer to the foregoing method embodiment, and is not described herein again.

As can be seen from the above, in the audio pop detection device of this embodiment, when performing pop detection on an audio signal, the framing module 301 may obtain the audio signal to be detected, divide the audio signal into a plurality of frame signals, then the calculating module 302 calculates a short-time energy difference between two adjacent frame signals, then the obtaining module 303 obtains a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden-change audio signal, then the judging module 304 calculates a spectral flatness of the sudden-change audio signal, and if the spectral flatness is greater than a preset flatness value, it is determined that the audio signal has pop; the scheme includes that audio signals are subjected to framing, time domain short-time energy of each frame of audio signals is calculated, the audio frame position with sudden energy change is found out through short-time energy difference, the sudden change audio signals are found out, then the spectral flatness of the sudden change audio signals is calculated, and audio files with frequency band loss are accurately screened out through the ground spectral flatness.

Correspondingly, the embodiment of the invention also provides network equipment, which can be equipment such as a server or a terminal and integrates any audio plosive detection device provided by the embodiment of the invention. Fig. 4 is a schematic diagram illustrating a network device according to an embodiment of the present invention, specifically:

the network device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the network device architecture shown in fig. 4 does not constitute a limitation of network devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the network device, connects various parts of the entire network device by using various interfaces and lines, and performs various functions of the network device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the network device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the network device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The network device further includes a power supply 403 for supplying power to each component, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are implemented through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The network device may also include an input unit 404, where the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the network device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the network device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

the method comprises the steps of obtaining an audio signal to be detected, dividing the audio signal into a plurality of frame signals, calculating a short-time energy difference between two adjacent frame signals, obtaining a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden change audio signal, calculating the spectral flatness of the sudden change audio signal, and determining that the audio signal has a pop sound if the spectral flatness is larger than a preset flatness value.

Optionally, dividing the audio signal into a plurality of frame signals may include:

selecting a signal with a preset time period from a first frame in a time domain to obtain a beginning audio signal; the beginning audio signal is divided into a plurality of frame signals.

Optionally, calculating the short-time energy difference between two adjacent frame signals may include:

calculating the short-time energy of each frame signal; acquiring the time of each frame signal; and sequentially calculating the difference between the short-time energies of two adjacent frame signals according to the time sequence of the frame signals to obtain the short-time energy difference of the two adjacent frame signals.

Optionally, obtaining a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden-change audio signal, where the obtaining may include:

acquiring two frame signals of which the short-time energy difference is larger than a preset threshold value, and determining the next frame signal in the two frame signals as a starting frame signal according to a time sequence; acquiring two frame signals of which the short-time energy difference is smaller than a preset threshold negative value after the starting frame signal, and determining the latter frame signal of the two frame signals as an ending frame signal according to a time sequence; and acquiring signals between the starting frame signal and the ending frame signal to obtain the abrupt change audio signal.

Optionally, acquiring two frame signals with the short-time energy difference smaller than a preset threshold negative value after the start frame signal, and determining a next frame signal of the two frame signals as an end frame signal according to the time sequence, may include:

sequentially judging whether the short-time energy difference is a negative value smaller than a preset threshold value or not according to the time sequence after the starting frame signal; and when the short-time energy difference is detected to be smaller than the preset threshold negative value for the first time, determining the next frame signal in the two frame signals smaller than the preset threshold negative value as an end frame signal according to the time sequence.

Optionally, calculating the spectral flatness of the abrupt change audio signal may include:

detecting a peak position of the abrupt audio signal; a plurality of fixed sampling points are respectively taken before and after the peak position to form a plosive audio frame; the spectral flatness of the pop audio frame is calculated.

Optionally, if the spectral flatness is greater than a preset flatness value, determining that the audio signal has a pop sound may include:

judging whether the frequency spectrum flatness is larger than a preset flatness value or not; if the frequency spectrum flatness is larger than a preset flatness value, determining that the audio signal has crackle; and if the frequency spectrum flatness is smaller than a preset flatness value, determining that the audio signal does not have crackle.

Optionally, if the spectral flatness is greater than the preset flatness value, after determining that the pop exists in the audio signal, the method may further include:

The above operations can be referred to the previous embodiments specifically, and are not described herein again.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the audio pop detection methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

the method comprises the steps of obtaining an audio signal to be detected, dividing the audio signal into a plurality of frame signals, calculating a short-time energy difference between two adjacent frame signals, obtaining a frame signal meeting a preset condition interval according to the short-time energy difference to obtain a sudden change audio signal, calculating the spectral flatness of the sudden change audio signal, and determining that the audio signal has a popping sound if the spectral flatness is greater than a preset flatness value

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium may execute the steps in any audio pop detection method provided in the embodiments of the present application, beneficial effects that any method provided in the embodiments of the present application can be applied to the audio pop detection method can be achieved, and the details are given in the foregoing embodiments and are not repeated herein.

The method, the device and the storage medium for detecting audio pop provided by the embodiment of the present application are described in detail above, a specific example is applied in the description to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An audio plosive detection method, comprising:

calculating the short-time energy difference of two adjacent frame signals;

detecting the peak position of the abrupt change audio signal, taking a plurality of fixed sampling points before and after the peak position to form a plosive audio frame, calculating the geometric mean and the arithmetic mean of the frequency domain plosive audio frame, calculating the spectral flatness according to the geometric mean and the arithmetic mean, and determining that the audio signal has the plosive if the spectral flatness is greater than a preset flat value.

2. The audio plosive detecting method according to claim 1, wherein the dividing the audio signal into a plurality of frame signals comprises:

the beginning audio signal is divided into a plurality of frame signals.

3. The audio plosive detecting method according to claim 1, wherein the calculating the short-time energy difference between two adjacent frame signals comprises:

calculating the short-time energy of each frame signal;

acquiring the time of each frame signal;

4. The audio plosive detection method according to claim 3, wherein the obtaining a frame signal satisfying a preset condition interval according to the short-time energy difference to obtain a sudden-change audio signal includes:

5. The audio pop detection method of claim 4, wherein the obtaining two frame signals with the short-time energy difference smaller than a preset threshold negative value after the start frame signal, and determining a next frame signal of the two frame signals as an end frame signal according to a time sequence comprises:

6. The audio pop detection method of claim 1, wherein determining that the audio signal has a pop if the spectral flatness is greater than a predetermined flatness value comprises:

7. The method for detecting audio pop according to claim 1, wherein if the spectral flatness is greater than a predetermined flatness value, after determining that the audio signal has pop, further comprising:

8. An audio pop detection device, comprising:

and the judging module is used for detecting the peak position of the sudden change audio signal, taking a plurality of fixed sampling points before and after the peak position to form a plosive audio frame, calculating the geometric mean and the arithmetic mean of the frequency domain plosive audio frame, calculating the spectral flatness according to the geometric mean and the arithmetic mean, and determining that the audio signal has the plosive if the spectral flatness is greater than a preset flat value.

9. A storage medium storing instructions adapted to be loaded by a processor to perform the steps of the audio plosive detecting method according to any one of claims 1 to 7.