WO2020248308A1 - 音频爆音检测方法、装置和存储介质 - Google Patents

音频爆音检测方法、装置和存储介质 Download PDF

Info

Publication number
WO2020248308A1
WO2020248308A1 PCT/CN2019/093409 CN2019093409W WO2020248308A1 WO 2020248308 A1 WO2020248308 A1 WO 2020248308A1 CN 2019093409 W CN2019093409 W CN 2019093409W WO 2020248308 A1 WO2020248308 A1 WO 2020248308A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
frame
audio signal
audio
short
Prior art date
Application number
PCT/CN2019/093409
Other languages
English (en)
French (fr)
Inventor
陈洲旋
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2020248308A1 publication Critical patent/WO2020248308A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This application relates to the field of communication technology, and in particular to an audio pop detection method, device and storage medium.
  • the embodiments of the present application provide a method, device and storage medium for detecting audio pops, which can be used to detect whether there are pops in an audio signal, so as to effectively and quickly screen out audio files with pops.
  • the embodiment of the application provides an audio pop detection method, including:
  • the frequency spectrum flatness of the sudden change audio signal is calculated, and if the frequency spectrum flatness is greater than a preset flat value, it is determined that the audio signal has popping sound.
  • the dividing the audio signal into multiple frame signals includes:
  • the first audio signal is divided into a plurality of frame signals.
  • the calculating the short-term energy difference of two adjacent frame signals includes:
  • the difference between the short-term energy of two adjacent frame signals is sequentially calculated to obtain the short-term energy difference of the two adjacent frame signals.
  • the obtaining a frame signal that meets a preset condition interval according to the short-term energy difference to obtain a sudden change audio signal includes:
  • the acquisition of two frame signals with the short-term energy difference less than the negative value of a preset threshold after the start frame signal is performed according to the time sequence Determine the latter of the two frame signals as the end frame signal, including:
  • the next frame signal of the two frame signals less than the negative of the preset threshold is determined as the end frame signal according to the time sequence.
  • the calculating the frequency spectrum flatness of the sudden change audio signal includes:
  • determining that the audio signal has popping includes:
  • the method further includes:
  • an embodiment of the present application also provides an audio popping detection device, including:
  • the framing module is used to obtain the audio signal to be detected and divide the audio signal into multiple frame signals;
  • the calculation module is used to calculate the short-term energy difference between two adjacent frame signals
  • An obtaining module configured to obtain a frame signal that meets a preset condition interval according to the short-term energy difference, and obtain a sudden change audio signal
  • the judgment module is configured to calculate the frequency spectrum flatness of the sudden change audio signal, and if the frequency spectrum flatness is greater than a preset flat value, it is determined that the audio signal has crackling.
  • the framing module includes:
  • the selection sub-module is used to select a signal of a preset time period from the first frame of the audio signal in the time domain to obtain the beginning audio signal;
  • the frame division sub-module is used to divide the beginning audio signal into multiple frame signals.
  • the calculation module includes:
  • Energy sub-module used to calculate the short-term energy of each frame signal
  • the acquisition sub-module is used to acquire the time of each frame signal
  • the energy difference sub-module is used to sequentially calculate the short-term energy difference between two adjacent frame signals according to the time sequence of the frame signals to obtain the short-term energy difference between two adjacent frame signals.
  • the energy difference sub-module is specifically configured to obtain two frame signals with the short-term energy difference greater than a preset threshold, and compare them according to a time sequence.
  • the next frame signal of the two frame signals is determined as the start frame signal; after the start frame signal, two frame signals with the short-term energy difference less than the negative value of the preset threshold are acquired, and the two frame signals are combined according to the time sequence.
  • the next frame signal in is determined as the end frame signal; the signal between the start frame signal and the end frame signal is obtained to obtain a sudden change audio signal.
  • the energy difference sub-module is specifically configured to sequentially determine whether the short-term energy difference is less than or not in a time sequence after the start frame signal The negative value of the preset threshold; when it is detected for the first time that the short-term energy difference is less than the negative preset threshold, the next frame signal of the two frame signals smaller than the negative value of the preset threshold is determined as End frame signal.
  • the judgment module includes:
  • the detection sub-module is used to detect the peak position of the abrupt audio signal
  • a sampling sub-module configured to take multiple fixed sampling points before and after the peak position to form a popping audio frame
  • the calculation sub-module is used to calculate the spectral flatness of the popped audio frame.
  • the determining module is specifically configured to determine whether the spectral flatness is greater than a preset flat value; if the spectral flatness is greater than a preset flatness Value, it is determined that the audio signal has crackling; if the frequency spectrum flatness is less than a preset flat value, it is determined that the audio signal does not have crackling.
  • the audio pop detection device further includes:
  • the detection module is configured to return to perform the step of obtaining a frame signal satisfying the preset condition interval according to the short-term energy difference to obtain a sudden change audio signal, until the detection of the audio signal to be detected is completed.
  • an embodiment of the present application further provides a storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute any of the audio pop detection methods provided in the embodiments of the present application. step.
  • this application can obtain the audio signal to be detected, divide the audio signal into multiple frame signals, and then calculate the short-term energy difference between two adjacent frame signals, and then, according to all The short-term energy difference obtains a frame signal that meets a preset condition interval to obtain a sudden change audio signal, and then calculates the spectral flatness of the sudden change audio signal, and if the spectral flatness is greater than a preset flat value, determine the The audio signal has a popping sound; this solution divides the audio signal into frames, and then calculates the time domain short-term energy of each frame of the audio signal, finds the audio frame position of the energy mutation through the short-term energy difference, and finds the mutation audio signal. Calculate its frequency spectrum flatness, and accurately filter out audio files with missing frequency bands through the frequency spectrum flatness.
  • FIG. 1a is a schematic diagram of a scene of an audio pop detection method provided by an embodiment of the present application
  • FIG. 1b is a schematic diagram of the first process of an audio pop detection method provided by an embodiment of the present application.
  • FIG. 2a is a schematic diagram of a second flow of an audio pop detection method provided by an embodiment of the present application.
  • 2b is a schematic diagram of an audio signal of an audio pop detection method provided by an embodiment of the present application.
  • Fig. 3a is a first structural diagram of an audio pop detection device provided by an embodiment of the present application.
  • 3b is a schematic diagram of a second structure of an audio pop detection device provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a network device provided by an embodiment of the present application.
  • the embodiments of the application provide an audio pop detection method, device and storage medium.
  • the audio pop detection device can be specifically integrated in a network device, the network device can be a terminal or a server, for example, see Figure 1a, when a user needs to detect a large number of audio files at the beginning of a pop, the network device can be triggered By processing these audio files, the network device can obtain the audio signal to be detected, divide the audio signal into multiple frame signals, and then calculate the short-term energy difference between two adjacent frame signals, and then, according to the short-term energy The frame signal that meets the preset condition interval is obtained by difference, and the sudden change audio signal is obtained. Then, the spectral flatness of the sudden change audio signal is calculated. If the spectral flatness is greater than the preset flat value, it is determined that the audio signal has popping.
  • the audio pop detection device will be described from the perspective of the audio pop detection device.
  • the audio pop detection device may be specifically integrated in a network device.
  • the network device may be a terminal or a server.
  • the terminal may include a tablet computer, Notebook computer or personal computer (Personal Computer, PC), etc.
  • An embodiment of the application provides an audio pop detection method, including: acquiring an audio signal to be detected, dividing the audio signal into multiple frame signals, and then calculating the short-term energy difference between two adjacent frame signals, and then, according to The short-term energy difference obtains a frame signal that meets the preset condition interval to obtain a sudden change audio signal, and then calculates the spectral flatness of the sudden change audio signal. If the spectral flatness is greater than the preset flat value, it is determined that the audio signal exists Sonic boom.
  • the specific process of the audio pop detection method can be as follows:
  • the audio file can be obtained from various channels such as the Internet, mobile phone, or video, and then provided to the audio pop detection device. That is, the audio pop detection device can receive audio files obtained through various channels, and then download The audio signals to be detected are extracted from these files. Then, these audio signals are divided into multiple frame signals.
  • the audio files may be: sound files and musical instrument digital interface (Musical Instrument Digital Interface, MIDI) files.
  • the sound file is the original sound recorded by the sound recording device, which directly records the binary sampling data of the real sound;
  • the MIDI file is a sequence of musical performance instructions, which can be played using a sound output device or an electronic musical instrument connected to a computer.
  • the audio signal is the information carrier of regular sound waves with voice, music and sound effects. According to the characteristics of sound waves, audio information can be classified into regular audio and irregular sounds.
  • the regular audio can be divided into voice, music and sound effects.
  • Regular audio is a continuously changing analog signal, which can be represented by a continuous curve, called a sound wave.
  • the detection time period at the beginning of the audio signal in the time domain, and perform framing processing on the audio signal in the time period, that is, the step "divide the audio signal into multiple frames Signal” can be specifically as follows:
  • a signal of a preset time period is selected for the audio signal from the first frame to obtain the beginning audio signal
  • the first audio signal is divided into a plurality of frame signals.
  • the short-term energy reflects the strength of the signal at different moments.
  • the short-term energy E of each frame signal can be calculated as follows:
  • N is the number of sampling points of each frame signal
  • n is the sampling point of the frame signal
  • t represents the position of the frame signal
  • E(t) is the short-term energy of the t-th frame signal.
  • the short-term energy difference between two adjacent frame signals can be calculated as follows:
  • t is the position of the frame
  • p t is the short-term energy difference of two adjacent frame signals.
  • the preset conditions there are many ways to set the preset conditions. For example, they can be set flexibly according to actual application requirements, or they can be preset and stored in a network device. In addition, the preset conditions can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • the next frame signal of the two frame signals is determined as the start frame signal according to the time sequence, and obtain the short-term signal after the start frame signal.
  • the next frame signal of the two frame signals is determined as the end frame signal according to the time sequence, and the signal between the start frame signal and the end frame signal is obtained to obtain Abrupt audio signal.
  • the preset threshold (threshold), referred to as Th, can also be set in many ways. For example, it can be set flexibly according to actual application requirements, or it can be preset and stored in a network device. In addition, the preset threshold value can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • the short-term energy difference detected for the first time after the start frame signal is less than the preset threshold negative value
  • the next frame signal of the two frame signals is the end frame signal, that is, the step “should obtain the two frame signals with the short-term energy difference less than the negative value of the preset threshold after the start frame signal, and combine them according to the time sequence
  • the latter of the two frame signals is determined to be the end frame signal", which can be specifically as follows:
  • the latter of the two frame signals less than the preset negative value of the threshold is determined as the end frame signal according to the time sequence.
  • the abrupt audio signal can be Fourier transformed to obtain a frequency-domain abrupt audio signal, the frequency-domain abrupt audio signal's spectral flatness can be calculated, and then it can be determined whether the spectral flatness is greater than a preset flat value; If the frequency spectrum flatness is greater than the preset flat value, it is determined that the audio signal has crackling; if the frequency spectrum flatness is less than the preset flat value, it is determined that the audio signal does not have crackling.
  • the preset flat value can be flexibly set according to actual application requirements, or it can be preset and stored in a network device.
  • the preset flat value can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • spectral flatness also known as Wiener entropy
  • Wiener entropy is a metric used to characterize the audio frequency spectrum in digital signal processing.
  • the spectral flatness can be measured by the ratio of the geometric mean (GM) of the signal to the arithmetic mean (AM), which is generally called the spectral flatness measure (SFM). which is:
  • w(n) is the window function
  • k is the frequency point of the frequency-domain abrupt audio signal
  • X is the frequency-domain abrupt audio signal.
  • the window function can be rectangular window, triangular window, or Hanning window and so on.
  • GM(t) is the geometric mean of the frequency-domain abrupt audio signal
  • AM(t) is the arithmetic mean of the frequency-domain abrupt audio signal
  • F(t) is the spectral flatness
  • the step "calculate the spectral flatness of the sudden change audio signal" can be specifically as follows:
  • an interface for the detection result can be generated.
  • the interface includes a detection interface that can receive the detection result of the audio signal to be detected. After the detection is completed, the interface prompts whether an audio crackle signal is detected.
  • this embodiment can obtain the audio signal to be detected when performing pop detection on the audio signal, divide the audio signal into multiple frame signals, and then calculate the short-term energy difference of two adjacent frame signals. Then, according to the short-term energy difference, obtain the frame signal that meets the preset condition interval to obtain the sudden change audio signal, and then calculate the spectral flatness of the sudden change audio signal. If the spectral flatness is greater than the preset flat value, determine the The audio signal has a popping sound; this solution divides the audio signal into frames, and then calculates the time domain short-term energy of each frame of the audio signal, finds the audio frame position of the energy mutation through the short-term energy difference, and finds the mutation audio signal. Calculate its spectral flatness, and accurately filter out the audio files with crackling sound through the spectral flatness.
  • the audio pop detection device is specifically integrated in a network device as an example for further detailed description.
  • an audio pop detection method the specific process can be as follows:
  • a network device obtains an audio signal to be detected.
  • users can specifically obtain audio files from various channels such as the Internet, mobile phones, or videos, and then provide them to network devices.
  • the network devices can receive audio files obtained through various channels and extract the audio files to be detected from these files. signal.
  • the network device divides the audio signal into frames to obtain a frame signal.
  • the network device can set the detection time period at the beginning of the audio signal in the time domain, and perform framing processing on the audio signal in the time period, that is, the step "divide the audio signal For multiple frame signals", the details can be as follows:
  • a signal of a preset time period is selected for the audio signal from the first frame to obtain the beginning audio signal
  • the first audio signal is divided into a plurality of frame signals.
  • the network device calculates the short-term energy difference between two adjacent frame signals.
  • the network device can specifically calculate the short-term energy of each frame signal, and then obtain the time of each frame signal, and sequentially calculate the difference between the short-term energy of two adjacent frame signals according to the time sequence of the frame signal. Obtain the short-term energy difference of two adjacent frame signals.
  • the short-term energy reflects the strength of the signal at different moments.
  • the short-term energy E of each frame signal can be calculated as follows:
  • N is the number of sampling points of each frame signal
  • n is the sampling point of the frame signal
  • t represents the position of the frame signal
  • E(t) is the short-term energy of the t-th frame signal.
  • the short-term energy difference between two adjacent frame signals can be calculated as follows:
  • t is the position of the frame
  • p t is the short-term energy difference of two adjacent frame signals.
  • the network device obtains a frame signal that meets a preset condition interval according to the short-term energy difference, and obtains a sudden change audio signal.
  • the preset conditions there are many ways to set the preset conditions. For example, they can be set flexibly according to actual application requirements, or they can be preset and stored in a network device. In addition, the preset conditions can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • the network device may specifically obtain two frame signals whose short-term energy difference is greater than a preset threshold, determine the next frame signal of the two frame signals as the start frame signal according to the time sequence, and obtain the frame signal after the start frame signal. For two frame signals whose short-term energy difference is less than the negative value of the preset threshold, the next frame signal of the two frame signals is determined as the end frame signal according to the time sequence, and the signal between the start frame signal and the end frame signal is obtained , Get a sudden change audio signal. For example, as shown in Figure 2b, calculate the short-term energy difference p 3 between E(2) and E(3). If p 3 >Th, the start frame signal is the third frame signal a.
  • the end frame signal is the fourth frame signal b
  • the third The frame signal a to the fourth frame signal b serve as a sudden change audio signal of the audio signal.
  • the preset threshold can be set in many ways. For example, it can be flexibly set according to actual application requirements, or it can be preset and stored in a network device. In addition, the preset threshold value can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • the short-term energy difference detected for the first time after the start frame signal is less than the preset threshold negative value
  • the next frame signal of the two frame signals is the end frame signal, that is, the step “should obtain the two frame signals with the short-term energy difference less than the negative value of the preset threshold after the start frame signal, and combine them according to the time sequence
  • the latter of the two frame signals is determined to be the end frame signal", which can be specifically as follows:
  • the latter of the two frame signals less than the preset negative value of the threshold is determined as the end frame signal according to the time sequence.
  • the network device calculates the frequency spectrum flatness of the sudden change audio signal.
  • the network device may specifically perform Fourier transform on the sudden change audio signal to obtain the sudden change audio signal in the frequency domain, and then calculate the spectral flatness of the sudden change audio signal in the frequency domain.
  • the preset flat value can be flexibly set according to actual application requirements, or it can be preset and stored in a network device.
  • the preset flat value can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • spectral flatness also known as Wiener entropy
  • Wiener entropy is a metric used to characterize the audio frequency spectrum in digital signal processing.
  • the frequency spectrum flatness can be measured by the ratio of the geometric mean (GM) of the signal to the arithmetic mean (AM), which is generally called the spectrum flatness. which is:
  • w(n) is the window function
  • k is the frequency point of the frequency-domain abrupt audio signal
  • X is the frequency-domain abrupt audio signal.
  • the window function can be rectangular window, triangular window, or Hanning window and so on.
  • GM(t) is the geometric mean of the frequency-domain abrupt audio signal
  • AM(t) is the arithmetic mean of the frequency-domain abrupt audio signal
  • F(t) is the spectral flatness
  • the network device can first detect the peak position of the abrupt audio signal, and then take the peak position as the center and take the same multiple samples to the left and right.
  • the points form a popping audio frame, that is, the peak position of the sudden change audio signal can be specifically detected; multiple fixed sampling points are taken before and after the peak position to form the popping audio frame; the spectral flatness of the popping audio frame is calculated.
  • the network device determines whether the frequency spectrum flatness is greater than a preset flat value, and if the frequency spectrum flatness is greater than the preset flat value, it is determined that the audio signal has pops.
  • the network device can specifically determine whether the spectrum flatness is greater than the preset flat value; if the spectrum flatness is greater than the preset flat value, it is determined that the audio signal has pops; if the spectrum flatness is less than the preset flat value, it is determined There is no popping sound in this audio signal.
  • the network device determines whether the audio signal to be detected has been detected, and if not, it returns to the step of obtaining a frame signal that meets the preset condition interval according to the short-term energy difference (that is, returning to step 204) to obtain a sudden change audio signal. Until the audio signal to be detected is detected.
  • the network device can continue to detect the short-term energy difference to obtain frame signals that meet the preset condition interval, until all the audio signals to be detected are detected, that is, return to the execution basis
  • the short-term energy difference obtains the frame signal that meets the preset condition interval, and obtains the step of the sudden change audio signal, until the detection of the audio signal to be detected is completed. For example, after judging whether the preset flat value is greater than the preset flat value according to the spectral flatness of the sudden change audio signal, regardless of whether the judgment result is greater than the preset flat value, the frame signal after the fourth frame signal can be detected continuously until all The frame signal detection is completed, and the detection result is obtained.
  • an interface for the detection result can be generated, the interface includes a detection interface, the interface can receive the detection result of the audio signal to be detected, and the interface prompts whether an audio pop signal is detected after the detection is completed .
  • these crackling signals can be repaired or replaced to ensure that users can listen to high-quality audio files.
  • the network device of this embodiment can obtain the audio signal to be detected when performing pop detection on the audio signal, divide the audio signal into multiple frame signals, and then calculate the short-term values of two adjacent frame signals. Energy difference, then, according to the short-term energy difference, obtain the frame signal that meets the preset condition interval to obtain the sudden change audio signal, and then calculate the spectral flatness of the sudden change audio signal.
  • this solution divides the audio signal into frames, and then calculates the time domain short-term energy of each frame of the audio signal, and finds the audio frame position of the energy mutation through the short-term energy difference, and finds the mutation audio Signal, and then calculate its spectral flatness, and accurately filter out the audio files with crackles through the spectral flatness.
  • this solution can also repair or replace the initial popping, therefore, it can improve the quality of audio files and improve user experience.
  • the embodiments of the present application also provide an audio pop detection device, which can be specifically integrated in network devices such as mobile phones, tablets, palmtops, etc. .
  • the meaning of the noun is the same as in the above audio pop detection method, and the specific implementation details can refer to the description in the method embodiment.
  • the audio pop detection device may include a framing module 301, a calculation module 302, an acquisition module 303, and a judgment module 304, as follows:
  • Framing module 301 (1) Framing module 301;
  • the framing module 301 is used to obtain the audio signal to be detected and divide the audio signal into multiple frame signals.
  • the framing module 301 may first obtain audio files from various channels such as the Internet, mobile phones, or videos, and then provide them to the audio crackle detection device, that is, the audio crackle detection device may specifically receive information obtained through various channels. Audio files, and then extract the audio signal to be detected from these files. Then, these audio signals are divided into multiple frame signals.
  • the detection time period can be set at the beginning of the audio signal in the time domain, and the audio signal in the time period can be framing processed, that is, the framing module can include selecting submodules and framing
  • the sub-modules are as follows:
  • the selection sub-module is used to select a signal of a preset time period for the audio signal from the first frame in the time domain to obtain the beginning audio signal;
  • the frame sub-module is used to divide the beginning audio signal into multiple frame signals.
  • the calculation module 302 is used to calculate the short-term energy difference of two adjacent frame signals.
  • the calculation module 302 may include an energy sub-module, an acquisition sub-module, and an energy difference sub-module, as follows:
  • Energy sub-module used to calculate the short-term energy of each frame signal
  • the acquisition sub-module is used to acquire the time of each frame signal
  • the energy difference sub-module is used to sequentially calculate the short-term energy difference between two adjacent frame signals according to the time sequence of the frame signal to obtain the short-term energy difference between two adjacent frame signals.
  • the short-term energy reflects the strength of the signal at different moments.
  • the short-term energy E of each frame signal can be calculated as follows:
  • N is the number of sampling points of each frame signal
  • n is the sampling point of the frame signal
  • t represents the position of the frame signal
  • E(t) is the short-term energy of the t-th frame signal.
  • the short-term energy difference between two adjacent frame signals can be calculated as follows:
  • t is the position of the frame
  • p t is the short-term energy difference of two adjacent frame signals.
  • the obtaining module 303 is configured to obtain a frame signal satisfying a preset condition interval according to the short-term energy difference to obtain a sudden change audio signal.
  • the preset conditions there are many ways to set the preset conditions. For example, they can be set flexibly according to actual application requirements, or they can be preset and stored in a network device. In addition, the preset conditions can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • the acquiring module 303 can specifically acquire two frame signals whose short-term energy difference is greater than a preset threshold, and determine the next frame signal of the two frame signals as the start frame signal according to the time sequence, and after the start frame signal Obtain the two frame signals whose short-term energy difference is less than the negative value of the preset threshold value, determine the next frame signal of the two frame signals as the end frame signal according to the time sequence, and obtain the interval between the start frame signal and the end frame signal Signal to get a mutation audio signal.
  • the preset threshold can be set in many ways. For example, it can be flexibly set according to actual application requirements, or it can be preset and stored in a network device. In addition, the preset threshold value can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • the acquisition module can specifically perform the following operations:
  • the latter of the two frame signals less than the preset negative value of the threshold is determined as the end frame signal according to the time sequence.
  • the judging module 304 is configured to calculate the frequency spectrum flatness of the sudden change audio signal, and if the frequency spectrum flatness is greater than the preset flat value, it is determined that the audio signal has crackling.
  • the judgment module 304 may specifically perform Fourier transform on the sudden change audio signal to obtain the frequency domain sudden change audio signal, calculate the spectral flatness of the frequency domain sudden change audio signal, and then judge whether the spectral flatness is greater than the preset flatness Value; if the spectral flatness is greater than the preset flat value, it is determined that the audio signal has crackling; if the spectral flatness is less than the preset flat value, it is determined that the audio signal does not have crackling.
  • the preset flat value can be flexibly set according to actual application requirements, or it can be preset and stored in a network device.
  • the preset flat value can be built into the network device, or can also be stored in the memory and sent to the network device, and so on.
  • spectral flatness also known as Wiener entropy
  • Wiener entropy is a metric used to characterize the audio frequency spectrum in digital signal processing.
  • the spectral flatness can be measured by the ratio of the geometric mean (GM) of the signal to the arithmetic mean (AM), which is generally called the spectral flatness. which is:
  • w(n) is the window function
  • k is the frequency point of the frequency-domain abrupt audio signal
  • X is the frequency-domain abrupt audio signal.
  • the window function can be rectangular window, triangular window, or Hanning window and so on.
  • GM(t) is the geometric mean of the frequency-domain abrupt audio signal
  • AM(t) is the arithmetic mean of the frequency-domain abrupt audio signal
  • F(t) is the spectral flatness
  • the judgment module may specifically include a detection sub-module, a sampling sub-module, and a calculation sub-module, as follows:
  • the detection sub-module is used to detect the peak position of the abrupt audio signal
  • the sampling sub-module is used for the sampling sub-unit to take multiple fixed sampling points before and after the peak position to form a popping audio frame;
  • the calculation sub-module calculates the spectral flatness of the popped audio frame.
  • a detection module 305 may also be included, as follows:
  • the detection module 305 is configured to return to perform the step of obtaining a frame signal satisfying the preset condition interval according to the short-term energy difference to obtain a sudden change audio signal, until the detection of the audio signal to be detected is completed.
  • the audio pop detection device shown in FIG. 3a does not constitute a limitation on the device, and may include more or less components than shown in the figure, or a combination of certain components, or different component arrangements.
  • the specific implementation of each of the above-mentioned units can be referred to the previous method embodiments, which will not be repeated here.
  • the framing module 301 can obtain the audio signal to be detected, divide the audio signal into multiple frame signals, and then the calculation module 302 Calculate the short-term energy difference between two adjacent frame signals. Then, the acquisition module 303 acquires the frame signal that meets the preset condition interval according to the short-term energy difference to obtain the sudden change audio signal. Then, the judgment module 304 calculates the sudden change audio signal If the spectral flatness is greater than the preset flat value, it is determined that the audio signal has popping; this solution divides the audio signal into frames, and then calculates the time domain short-term energy of each frame of the audio signal. Time-energy difference finds out the position of the audio frame with sudden energy change, finds out the sudden change audio signal, and then calculates its spectral flatness, and accurately screens out the audio files with popping sound through the spectral flatness.
  • an embodiment of the present application also provides a network device, which may be a device such as a server or a terminal, which integrates any audio pop detection device provided in the embodiment of the present application.
  • a network device which may be a device such as a server or a terminal, which integrates any audio pop detection device provided in the embodiment of the present application.
  • Figure 4 it shows a schematic structural diagram of a network device involved in an embodiment of the present application, specifically:
  • the network device may include one or more processing core processors 401, one or more computer-readable storage media memory 402, power supply 403, input unit 404 and other components.
  • processing core processors 401 one or more computer-readable storage media memory 402, power supply 403, input unit 404 and other components.
  • FIG. 4 does not constitute a limitation on the network device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components. among them:
  • the processor 401 is the control center of the network device. It uses various interfaces and lines to connect various parts of the entire network device. It runs or executes software programs and/or modules stored in the memory 402, and calls Data, perform various functions of network equipment and process data, so as to monitor the network equipment as a whole.
  • the processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 401.
  • the memory 402 may be used to store software programs and modules.
  • the processor 401 executes various functional applications and data processing by running the software programs and modules stored in the memory 402.
  • the memory 402 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of network equipment, etc.
  • the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
  • the network device also includes a power supply 403 for supplying power to various components.
  • the power supply 403 may be logically connected to the processor 401 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the power supply 403 may also include one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and any other components.
  • the network device may further include an input unit 404, which can be used to receive inputted digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • an input unit 404 which can be used to receive inputted digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the network device may also include a display unit, etc., which will not be repeated here.
  • the processor 401 in the network device will load the executable file corresponding to the process of one or more applications into the memory 402 according to the following instructions, and the processor 401 will run the executable file stored in The application programs in the memory 402 thus realize various functions, as follows:
  • the audio signal to be detected divide the audio signal into multiple frame signals, then calculate the short-term energy difference of two adjacent frame signals, and then obtain the frame signal that meets the preset condition interval according to the short-term energy difference , Obtain the sudden change audio signal, and then calculate the frequency spectrum flatness of the sudden change audio signal, and if the frequency spectrum flatness is greater than the preset flat value, it is determined that the audio signal has popping sound.
  • dividing the audio signal into multiple frame signals may include:
  • a signal of a preset time period is selected for the audio signal from the first frame to obtain the beginning audio signal; the beginning audio signal is divided into multiple frame signals.
  • calculating the short-term energy difference between two adjacent frame signals may include:
  • Calculate the short-term energy of each frame signal obtain the time of each frame signal; sequentially calculate the difference between the short-term energy of two adjacent frame signals according to the time sequence of the frame signal, and obtain the Short-term energy difference.
  • obtaining a frame signal that meets a preset condition interval according to the short-term energy difference to obtain a sudden change audio signal may include:
  • the start frame signal two frame signals with the short-term energy difference less than the negative value of the preset threshold are acquired, and the next frame signal of the two frame signals is determined as the end frame signal according to the time sequence, which may include :
  • the start frame signal After the start frame signal, it is determined in chronological order whether the short-term energy difference is less than the negative value of the preset threshold; when the short-term energy difference is detected for the first time to be less than the negative value of the preset threshold, it will be less than
  • the next frame signal of the two frame signals with a negative preset threshold value is determined as the end frame signal.
  • calculating the frequency spectrum flatness of the abrupt audio signal may include:
  • Detecting the peak position of the sudden change audio signal taking a plurality of fixed sampling points before and after the peak position to form a popping audio frame; calculating the spectral flatness of the popping audio frame.
  • determining that the audio signal has popping sound may include:
  • the frequency spectrum flatness is greater than a preset flat value, after it is determined that the audio signal has popping sound, it may further include:
  • the network device of this embodiment can obtain the audio signal to be detected when performing pop detection on the audio signal, divide the audio signal into multiple frame signals, and then calculate the short-term values of two adjacent frame signals. Energy difference, then, according to the short-term energy difference, obtain the frame signal that meets the preset condition interval to obtain the sudden change audio signal, and then calculate the spectral flatness of the sudden change audio signal.
  • this solution divides the audio signal into frames, and then calculates the time domain short-term energy of each frame of the audio signal, and finds the audio frame position of the energy mutation through the short-term energy difference, and finds the mutation audio Signal, and then calculate its spectral flatness, and accurately filter out the audio files with crackles through the spectral flatness.
  • an embodiment of the present application provides a storage medium in which multiple instructions are stored, and the instructions can be loaded by a processor to execute the steps in any audio pop detection method provided in the embodiments of the present application.
  • the instruction can perform the following steps:
  • the storage medium may include: read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Auxiliary Devices For Music (AREA)
  • Telephone Function (AREA)

Abstract

一种音频爆音检测方法、装置和存储介质,在对音频信号进行爆音检测时,可以获取待检测的音频信号,将所述音频信号划分为多个帧信号(101),接着,计算相邻两个帧信号的短时能量差(102),然后,根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号(103),再然后,计算所述突变音频信号的频谱平坦度,若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音(104);该方案可以准确地检测出音频信号是否存在爆音。

Description

音频爆音检测方法、装置和存储介质 技术领域
本申请涉及通信技术领域,具体涉及一种音频爆音检测方法、装置和存储介质。
背景技术
随着互联网技术不断发展,互联网存在海量的各类音频文件,如音乐/演讲/说书/聊天等各种类型的音频文件。由于音频经过录制、处理、传输、存储等一系列复杂的步骤,可能出现“失真”的现象,比如开头爆音、毛刺、断点等。开头爆音是比较常见的一种失真现象。“开头爆音”是指在音乐波形的开头部分,存在着短暂的脉冲,听起来像“嗒”的一声,这种刺耳不自然的声音会给听者带来比较差的用户体验。在对一个歌曲库的统计案例中显示,存在开头爆音的音频占比达到10%,由于爆音的存在,导致音频质量差。因此,正确地检测出音频开头爆音非常重要。
技术问题
本申请实施例提供一种音频爆音检测方法、装置和存储介质,可以用于检测音频信号中是否存在爆音,从而有效快速地筛选出有爆音的音频文件。
技术解决方案
本申请实施例提供一种音频爆音检测方法,包括:
获取待检测的音频信号,将所述音频信号划分为多个帧信号;
计算相邻两个帧信号的短时能量差;
根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号;
计算所述突变音频信号的频谱平坦度,若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述将所述音频信号划分为多个帧信号,包括:
在时域内从首帧开始对所述音频信号选取预设时间段的信号,得到开头音频信号;
将所述开头音频信号划分为多个帧信号。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述计算相邻两个帧信号的短时能量差,包括:
计算每个帧信号的短时能量;
获取每个帧信号的时间;
根据所述帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,包括:
获取所述短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号;
在所述开始帧信号后获取所述短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号;
获取所述开始帧信号到所述结束帧信号之间的信号,得到突变音频信号。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述在所述开始帧信号后获取所述短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号,包括:
在所述开始帧信号后按时间顺序依次判断所述短时能量差是否是小于预设阈值的负值;
当第一次检测到所述短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述计算所述突变音频信号的频谱平坦度,包括:
检测所述突变音频信号的峰值位置;
在所述峰值位置的前后各取多个固定采样点组成爆音音频帧;
计算所述爆音音频帧的频谱平坦度。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音,包括:
判断所述频谱平坦度是否大于预设平坦值;
若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音;
若所述频谱平坦度小于预设平坦值,则确定所述音频信号不存在爆音。
可选的,在一些实施例中,在所述音频爆音检测方法中,所述若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音之后,还包括:
返回执行根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
相应的,本申请实施例还提供一种音频爆音检测装置,包括:
分帧模块,用于获取待检测的音频信号,将所述音频信号划分为多个帧信号;
计算模块,用于计算相邻两个帧信号的短时能量差;
获取模块,用于根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号;
判断模块,用于计算所述突变音频信号的频谱平坦度,若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音。
可选的,在一些实施例中,在所述音频爆音检测装置中,所述分帧模块,包括:
选取子模块,用于在时域内从首帧开始对所述音频信号选取预设时间段的信号,得到开头音频信号;
分帧子模块,用于将所述开头音频信号划分为多个帧信号。
可选的,在一些实施例中,在所述音频爆音检测装置中,所述计算模块,包括:
能量子模块,用于计算每个帧信号的短时能量;
获取子模块,用于获取每个帧信号的时间;
能量差子模块,用于根据所述帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
可选的,在一些实施例中,在所述音频爆音检测装置中,所述能量差子模块,具体用于获取所述短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号;在所述开始帧信号后获取所述短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号;获取所述开始帧信号到所述结束帧信号之间的信号,得到突变音频信号。
可选的,在一些实施例中,在所述音频爆音检测装置中,所述能量差子模块,具体用于在所述开始帧信号后按时间顺序依次判断所述短时能量差是否是小于预设阈值的负值;当第一次检测到所述短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
可选的,在一些实施例中,在所述音频爆音检测装置中,所述判断模块,包括:
检测子模块,用于检测所述突变音频信号的峰值位置;
采样子模块,用于在所述峰值位置的前后各取多个固定采样点组成爆音音频帧;
计算子模块,用于计算所述爆音音频帧的频谱平坦度。
可选的,在一些实施例中,在所述音频爆音检测装置中,所述判断模块,具体用于判 断所述频谱平坦度是否大于预设平坦值;若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音;若所述频谱平坦度小于预设平坦值,则确定所述音频信号不存在爆音。
可选的,在一些实施例中,在所述音频爆音检测装置中,还包括:
检测模块,用于返回执行根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
此外,本申请实施例还提供一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行本申请实施例提供的任一种音频爆音检测方法中的步骤。
有益效果
本申请在对音频信号进行爆音检测时,可以获取待检测的音频信号,将所述音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算所述突变音频信号的频谱平坦度,若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音;该方案通过对音频信号进行分帧,然后计算出每帧音频信号的时域短时能量,通过短时能量差找出能量突变的音频帧位置,找出突变音频信号,然后计算它的频谱平坦度,通过地频谱平坦度来准确地筛选出有频带缺失的音频文件。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例提供的音频爆音检测方法的场景示意图;
图1b是本申请实施例提供的音频爆音检测方法的第一流程示意图;
图2a是本申请实施例提供的音频爆音检测方法的第二流程示意图;
图2b是本申请实施例提供的音频爆音检测方法的音频信号的示意图;
图3a是本申请实施例提供的音频爆音检测装置的第一结构示意图;
图3b是本申请实施例提供的音频爆音检测装置的第二结构示意图;
图4是本申请实施例提供的网络设备的结构示意图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描 述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。
本申请实施例提供一种音频爆音检测方法、装置和存储介质。
其中,该音频爆音检测装置具体可以集成在网络设备中,该网络设备可以是终端或服务器等设备,例如,参见图1a,用户在需要对海量的音频文件进行开头爆音检测时,可以触发网络设备对这些音频文件进行处理,网络设备可以获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音。
以下分别进行详细说明。需说明的是,以下实施例的顺序不作为对实施例优选顺序的限定。
在本实施例中,将从音频爆音检测装置的角度进行描述,该音频爆音检测装置具体可以集成在网络设备中,该网络设备可以是终端或服务器等设备,其中,该终端可以包括平板电脑、笔记本电脑或个人计算机(Personal Computer,PC)等。
本申请实施例提供一种音频爆音检测方法,包括:获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音。
如图1b所示,该音频爆音检测方法的具体流程可以如下:
101、获取待检测的音频信号,将该音频信号划分为多个帧信号。
例如,具体可以先从网络、手机或者视频等各种途径来获取音频文件,进而提供给该音频爆音检测装置,即,该音频爆音检测装置具体可以接收各种途径获取到的音频文件,再从这些文件中提取出待检测的音频信号。接着,将这些音频信号划分为多个帧信号。
其中,音频文件可以为:声音文件和乐器数字接口(Musical Instrument Digital Interface,MIDI)文件。声音文件是通过声音录入设备录制的原始声音,直接记录了真实声音的二进制采样数据;MIDI文件是一种音乐演奏指令序列,可利用声音输出设备或与计算机相连的电子乐器进行演奏。而音频信号是带有语音、音乐和音效的有规律的声波的频率、幅度变化信息载体。根据声波的特征,可把音频信息分类为规则音频和不规则声音。其中规则音频又可以 分为语音、音乐和音效。规则音频是一种连续变化的模拟信号,可用一条连续的曲线来表示,称为声波。
为了提高检测的效率,可以在音频信号的时域内的开始处,设定检测的时间段,并对该时间段内的音频信号进行分帧处理,即步骤“将该音频信号划分为多个帧信号”,具体可以如下:
在时域内从首帧开始对该音频信号选取预设时间段的信号,得到开头音频信号;
将该开头音频信号划分为多个帧信号。
102、计算相邻两个帧信号的短时能量差。
例如,具体可以计算每个帧信号的短时能量,然后,获取每个帧信号的时间,根据该帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
其中,短时能量体现的是信号在不同时刻的强弱程度。每个帧信号的短时能量E的计算可以如下:
Figure PCTCN2019093409-appb-000001
其中,N为每帧信号的采样点数,n为帧信号的采样点,t表示帧信号的位置,E(t)为第t帧信号的短时能量。
其中,计算相邻两个帧信号的短时能量差,可以如下:
p t=E(t)-E(t-1)
其中,t为帧的位置,p t为相邻两个帧信号的短时能量差。
103、根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号。
其中,预设条件的设定方式可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设条件可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
例如,具体可以获取该短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号,在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号,获取该开始帧信号到该结束帧信号之间的信号,得到突变音频信号。
其中,预设阈值(threshold),简称Th的设定方式也可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设阈值可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
为了后续的频率平坦度的计算更贴近预设条件区间的真实值,为了使检测结果的准确性更高,可以取在开始帧信号之后第一次检测到短时能量差小于预设阈值负值的帧信号两个帧信号中的后一个帧信号为结束帧信号,即步骤“该在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号”,具体可以如下:
在该开始帧信号后按时间顺序依次判断该短时能量差是否是小于预设阈值的负值;
当第一次检测到该短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
104、计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音。
例如,具体可以将该突变音频信号进行傅里叶变换,得到频域突变音频信号,计算该频域突变音频信号的频谱平坦度,然后,判断该频谱平坦度是否大于预设平坦值;若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;若该频谱平坦度小于预设平坦值,则确定该音频信号不存在爆音。
其中,预设平坦值的设定方式也可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设平坦值可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
其中,频谱平坦度,也称为维纳熵,是数字信号处理中用于表征音频频谱的度量。频谱平坦度可以通过对信号的几何平均数(Geometric Mean,GM)与算术平均数(Arithmetic Mean,AM)的比值来进行衡量,一般也叫频谱平坦度(SpectralFlatness Measure,SFM)。即:
Figure PCTCN2019093409-appb-000002
其中,w(n)为窗函数,k为频域突变音频信号的频点,X为频域突变音频信号。其中,窗函数可以选择矩形窗、三角窗、或者汉宁窗等等。
Figure PCTCN2019093409-appb-000003
Figure PCTCN2019093409-appb-000004
F(t)=GM(t)/AM(t)
其中,GM(t)为频域突变音频信号的几何平均数,AM(t)为频域突变音频信号的算术平均数,F(t)为频谱平坦度。
例如,为了更进一步地提升检测的准确性,保证给用户体验的音频没有瑕疵,可以先检测该突变音频信号的峰值位置,然后以该峰值位置为中心,向左右各取N/2个采样点组成一个爆音音频帧,即爆音音频帧一共有N个采样点。因此,步骤“计算该突变音频信号的频谱平坦度”,具体可以如下:
检测该突变音频信号的峰值位置;
在该峰值位置的前后各取多个固定采样点组成爆音音频帧;
计算该爆音音频帧的频谱平坦度。
在检测到一个爆音之后,为了后续修复的准确性,可以继续检测短时能量差获取满足预设条件区间的帧信号,直到所有待检测的音频信号都检测完毕,即步骤“若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音”之后,还可以包括:
返回执行根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
在音频信号检测完毕之后,可以生成检测结果的界面,该界面包括检测接口,该接口可以接收待检测的音频信号的检测结果,检测完成后在该界面提示是否检测到音频爆音信号。
由上可知,本实施例在对音频信号进行爆音检测时,可以获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;该方案通过对音频信号进行分帧,然后计算出每帧音频信号的时域短时能量,通过短时能量差找出能量突变的音频帧位置,找出突变音频信号,然后计算它的频谱平坦度,通过频谱平坦度来准确地筛选出有爆音的音频文件。
根据前面实施例所描述的方法,以下将以该音频爆音检测装置具体集成在网络设备中举例作进一步详细说明。
如图2a所示,一种音频爆音检测方法,具体流程可以如下:
201、网络设备获取待检测的音频信号。
例如,用户具体可以从网络、手机或者视频等各种途径来获取音频文件,进而提供给网络设备,网络设备可以接收各种途径获取到的音频文件,并从这些文件中提取出待检测的音频信号。
202、网络设备将该音频信号进行分帧,得到帧信号。
例如,为了提高检测的效率,网络设备可以在音频信号的时域内的开始处,设定检测的时间段,并对该时间段内的音频信号进行分帧处理,即步骤“将该音频信号划分为多个帧信号”,具体可以如下:
在时域内从首帧开始对该音频信号选取预设时间段的信号,得到开头音频信号;
将该开头音频信号划分为多个帧信号。
203、网络设备计算相邻两个帧信号的短时能量差。
例如,网络设备具体可以计算每个帧信号的短时能量,然后,获取每个帧信号的时间,根据该帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
其中,短时能量体现的是信号在不同时刻的强弱程度。每个帧信号的短时能量E的计算可以如下:
Figure PCTCN2019093409-appb-000005
其中,N为每帧信号的采样点数,n为帧信号的采样点,t表示帧信号的位置,E(t)为第t帧信号的短时能量。
其中,计算相邻两个帧信号的短时能量差,可以如下:
p t=E(t)-E(t-1)
其中,t为帧的位置,p t为相邻两个帧信号的短时能量差。
204、网络设备根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号。
其中,预设条件的设定方式可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设条件可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
例如,网络设备具体可以获取该短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号,在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号,获取该开始帧信号到该结束帧信号之间的信号,得到突变音频信号。比如,如图2b所示,计算E(2)和E(3)的短时能量差p 3,若p 3>Th,则开始帧信号为第三帧信号a,继续计算第三帧信号后的相邻两个帧信号的短时能量差,若获取到E(3)和E(4)的短时能量差p 4<-Th,则结束帧信号为第四帧信号b,将第三帧信号a到第四帧信号b作为该音频信号的突变音频信号。
其中,预设阈值的设定方式也可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设阈值可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
为了后续的频率平坦度的计算更贴近预设条件区间的真实值,为了使检测结果的准确性更高,可以取在开始帧信号之后第一次检测到短时能量差小于预设阈值负值的帧信号两个帧信号中的后一个帧信号为结束帧信号,即步骤“该在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号”,具体可以如下:
在该开始帧信号后按时间顺序依次判断该短时能量差是否是小于预设阈值的负值;
当第一次检测到该短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
205、网络设备计算该突变音频信号的频谱平坦度。
例如,网络设备具体可以将该突变音频信号进行傅里叶变换,得到频域突变音频信号,然后,计算该频域突变音频信号的频谱平坦度。
其中,预设平坦值的设定方式也可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设平坦值可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
其中,频谱平坦度,也称为维纳熵,是数字信号处理中用于表征音频频谱的度量。频 谱平坦度可以通过对信号的几何平均数(GM)与算术平均数(AM)的比值来进行衡量,一般也叫频谱平坦度。即:
Figure PCTCN2019093409-appb-000006
其中,w(n)为窗函数,k为频域突变音频信号的频点,X为频域突变音频信号。其中,窗函数可以选择矩形窗、三角窗、或者汉宁窗等等。
Figure PCTCN2019093409-appb-000007
Figure PCTCN2019093409-appb-000008
F(t)=GM(t)/AM(t)
其中,GM(t)为频域突变音频信号的几何平均数,AM(t)为频域突变音频信号的算术平均数,F(t)为频谱平坦度。
例如,为了更进一步地提升检测的准确性,保证给用户体验的音频没有瑕疵,网络设备可以先检测该突变音频信号的峰值位置,然后以该峰值位置为中心,向左右各取相同多个采样点组成一个爆音音频帧,即具体可以检测该突变音频信号的峰值位置;在该峰值位置的前后各取多个固定采样点组成爆音音频帧;计算该爆音音频帧的频谱平坦度。
比如,如图2b所示,以该突变音频信号的峰值位置为中心,向左右各取N/2个采样点组成一个爆音音频帧c,即爆音音频帧c一共有N个采样点,然后计算该爆音音频帧c的频谱平坦度。
206、网络设备判断该频谱平坦度是否大于预设平坦值,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音。
例如,网络设备具体可以判断该频谱平坦度是否大于预设平坦值;若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;若该频谱平坦度小于预设平坦值,则确定该音频信号不存在爆音。
207、网络设备判断待检测的音频信号是否检测完毕,若无,则返回执行根据该短时能量差获取满足预设条件区间的帧信号(即返回执行步骤204),得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
例如,在检测到一个爆音之后,为了后续修复的准确性,网络设备可以继续检测短时能量差获取满足预设条件区间的帧信号,直到所有待检测的音频信号都检测完毕,即返回执行根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。比如,根据该突变音频信号的频谱平坦度判断预设平坦值是否大于预设平坦值之后,无论判断结果是否大于预设平坦值,还可以继续检测第四帧信号之后的帧信号,直到所有的帧信号检测完毕,得到检测结果。
可选的,在音频信号检测完毕之后,可以生成检测结果的界面,该界面包括检测接口,该接口可以接收待检测的音频信号的检测结果,检测完成后在该界面提示是否检测到音频爆音信号。
可选的,在检测出开头爆音之后,还可以对这些爆音信号进行修复或替换,以保证用户可以收听到优质的音频文件。
由上可知,本实施例的网络设备在对音频信号进行爆音检测时,可以获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;该方案通过对音频信号进行分帧,然后计算出每帧音频信号的时域短时能量,通过短时能量差找出能量突变的音频帧位置,找出突变音频信号,然后计算它的频谱平坦度,通过频谱平坦度来准确地筛选出有爆音的音频文件。
此外,该方案还可以对开头爆音进行修复或替换,因此,可以提高音频文件的质量,改善用户体验。
为了更好地实施本申请实施例提供的音频爆音检测方法,本申请实施例还提供一种音频爆音检测装置,该音频爆音检测装置具体可以集成在如手机、平板电脑、掌上电脑等网络设备中。其中名词的含义与上述音频爆音检测方法中相同,具体实现细节可以参考方法实施例中的说明。
例如,如图3a所示,音频爆音检测装置可以包括分帧模块301、计算模块302、获取模块303以及判断模块304,如下:
(1)分帧模块301;
分帧模块301,用于获取待检测的音频信号,将该音频信号划分为多个帧信号。
例如,分帧模块301,具体可以先从网络、手机或者视频等各种途径来获取音频文件,进而提供给该音频爆音检测装置,即,该音频爆音检测装置具体可以接收各种途径获取到的音频文件,再从这些文件中提取出待检测的音频信号。接着,将这些音频信号划分为多个帧信号。
为了提高检测的效率,可以在音频信号的时域内的开始处,设定检测的时间段,并对该时间段内的音频信号进行分帧处理,即分帧模块可以包括选取子模块和分帧子模块,如下:
选取子模块,用于在时域内从首帧开始对该音频信号选取预设时间段的信号,得到开头音频信号;
分帧子模块,用于将该开头音频信号划分为多个帧信号。
(2)计算模块302;
计算模块302,用于计算相邻两个帧信号的短时能量差。
例如,计算模块302,可以包括能量子模块、获取子模块和能量差子模块,如下:
能量子模块,用于计算每个帧信号的短时能量;
获取子模块,用于获取每个帧信号的时间;
能量差子模块,用于根据该帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
其中,短时能量体现的是信号在不同时刻的强弱程度。每个帧信号的短时能量E的计算可以如下:
Figure PCTCN2019093409-appb-000009
其中,N为每帧信号的采样点数,n为帧信号的采样点,t表示帧信号的位置,E(t)为第t帧信号的短时能量。
其中,计算相邻两个帧信号的短时能量差,可以如下:
p t=E(t)-E(t-1)
其中,t为帧的位置,p t为相邻两个帧信号的短时能量差。
(3)获取模块303;
获取模块303,用于根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号。
其中,预设条件的设定方式可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设条件可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
例如,获取模块303,具体可以获取该短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号,在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号,获取该开始帧信号到该结束帧信号之间的信号,得到突变音频信号。
其中,预设阈值的设定方式也可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设阈值可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
为了后续的频率平坦度的计算更贴近预设条件区间的真实值,为了使检测结果的准确性更高,可以取在开始帧信号之后第一次检测到短时能量差小于预设阈值负值的帧信号两个帧信号中的后一个帧信号为结束帧信号,即获取模块具体可以执行如下操作:
在该开始帧信号后按时间顺序依次判断该短时能量差是否是小于预设阈值的负值;
当第一次检测到该短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
(4)判断模块304;
判断模块304,用于计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音。
例如,判断模块304,具体可以将该突变音频信号进行傅里叶变换,得到频域突变音频信号,计算该频域突变音频信号的频谱平坦度,然后,判断该频谱平坦度是否大于预设平坦值;若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;若该频谱平坦度小于预设平坦值,则确定该音频信号不存在爆音。
其中,预设平坦值的设定方式也可以有很多种,比如,可以根据实际应用的需求灵活设置,也可以预先设置好存储在网络设备中。此外,预设平坦值可以内置于网络设备中,或者,也可以保存在存储器中并发送给网络设备,等等。
其中,频谱平坦度,也称为维纳熵,是数字信号处理中用于表征音频频谱的度量。频谱平坦度可以通过对信号的几何平均数(GM)与算术平均数(AM)的比值来进行衡量,一般也叫频谱平坦度。即:
Figure PCTCN2019093409-appb-000010
其中,w(n)为窗函数,k为频域突变音频信号的频点,X为频域突变音频信号。其中,窗函数可以选择矩形窗、三角窗、或者汉宁窗等等。
Figure PCTCN2019093409-appb-000011
Figure PCTCN2019093409-appb-000012
F(t)=GM(t)/AM(t)
其中,GM(t)为频域突变音频信号的几何平均数,AM(t)为频域突变音频信号的算术平均数,F(t)为频谱平坦度。
例如,为了更进一步地提升检测的准确性,保证给用户体验的音频没有瑕疵,可以先检测该突变音频信号的峰值位置,然后以该峰值位置为中心,向左右各取N/2个采样点组成一个爆音音频帧,即爆音音频帧一共有N个采样点。因此,判断模块具体可以包括检测子模块、采样子模块和计算子模块,如下:
检测子模块,用于检测该突变音频信号的峰值位置;
采样子模块,用于采样子单元在该峰值位置的前后各取多个固定采样点组成爆音音频帧;
计算子模块,计算该爆音音频帧的频谱平坦度。
在检测到一个爆音之后,为了后续修复的准确性,可以继续检测短时能量差获取满足预设条件区间的帧信号,直到所有待检测的音频信号都检测完毕,即音频爆音检测装置,如图3b所示,还可以包括检测模块305,如下:
检测模块305,用于返回执行根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
本领域技术人员可以理解,图3a中示出的音频爆音检测装置并不构成对装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。此外,需说明的是,上述各个单元的具体实施可参见前面的方法实施例,在此不作赘述。
由上可知,本实施例的音频爆音检测装置,在对音频信号进行爆音检测时,分帧模块 301可以获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算模块302计算相邻两个帧信号的短时能量差,然后,获取模块303根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,判断模块304计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;该方案通过对音频信号进行分帧,然后计算出每帧音频信号的时域短时能量,通过短时能量差找出能量突变的音频帧位置,找出突变音频信号,然后计算它的频谱平坦度,通过频谱平坦度来准确地筛选出有爆音的音频文件。
相应的,本申请实施例还提供一种网络设备,该网络设备可以为服务器或终端等设备,其集成了本申请实施例所提供的任一种音频爆音检测装置。如图4所示,其示出了本申请实施例所涉及的网络设备的结构示意图,具体来讲:
该网络设备可以包括一个或者一个以上处理核心的处理器401、一个或一个以上计算机可读存储介质的存储器402、电源403和输入单元404等部件。本领域技术人员可以理解,图4中示出的网络设备结构并不构成对网络设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器401是该网络设备的控制中心,利用各种接口和线路连接整个网络设备的各个部分,通过运行或执行存储在存储器402内的软件程序和/或模块,以及调用存储在存储器402内的数据,执行网络设备的各种功能和处理数据,从而对网络设备进行整体监控。可选的,处理器401可包括一个或多个处理核心;优选的,处理器401可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器401中。
存储器402可用于存储软件程序以及模块,处理器401通过运行存储在存储器402的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器402可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据网络设备的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器402还可以包括存储器控制器,以提供处理器401对存储器402的访问。
网络设备还包括给各个部件供电的电源403,优选的,电源403可以通过电源管理系统与处理器401逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。 电源403还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该网络设备还可包括输入单元404,该输入单元404可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,网络设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,网络设备中的处理器401会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器402中,并由处理器401来运行存储在存储器402中的应用程序,从而实现各种功能,如下:
获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音。
可选的,将该音频信号划分为多个帧信号,可以包括:
在时域内从首帧开始对该音频信号选取预设时间段的信号,得到开头音频信号;将该开头音频信号划分为多个帧信号。
可选的,计算相邻两个帧信号的短时能量差,可以包括:
计算每个帧信号的短时能量;获取每个帧信号的时间;根据该帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
可选的,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,可以包括:
获取该短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号;在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号;获取该开始帧信号到该结束帧信号之间的信号,得到突变音频信号。
可选的,在该开始帧信号后获取该短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号,可以包括:
在该开始帧信号后按时间顺序依次判断该短时能量差是否是小于预设阈值的负值;当第一次检测到该短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
可选的,计算该突变音频信号的频谱平坦度,可以包括:
检测该突变音频信号的峰值位置;在该峰值位置的前后各取多个固定采样点组成爆音音频帧;计算该爆音音频帧的频谱平坦度。
可选的,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音,可以包括:
判断该频谱平坦度是否大于预设平坦值;若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;若该频谱平坦度小于预设平坦值,则确定该音频信号不存在爆音。
可选的,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音之后,还可以包括:
返回执行根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
以上各个操作具体可参见前面的实施例,在此不再赘述。
由上可知,本实施例的网络设备在对音频信号进行爆音检测时,可以获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音;该方案通过对音频信号进行分帧,然后计算出每帧音频信号的时域短时能量,通过短时能量差找出能量突变的音频帧位置,找出突变音频信号,然后计算它的频谱平坦度,通过频谱平坦度来准确地筛选出有爆音的音频文件。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例提供一种存储介质,其中存储有多条指令,该指令能够被处理器进行加载,以执行本申请实施例所提供的任一种音频爆音检测方法中的步骤。例如,该指令可以执行如下步骤:
获取待检测的音频信号,将该音频信号划分为多个帧信号,接着,计算相邻两个帧信号的短时能量差,然后,根据该短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,再然后,计算该突变音频信号的频谱平坦度,若该频谱平坦度大于预设平坦值,则确定该音频信号存在爆音
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(Read Only Memory,ROM)、随机存取记忆体(Random Access Memory,RAM)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本申请实施例所提供的任一种音频爆音检测方法中的步骤,因此,可以实现本申请实施例所提供的任一种应用于音频爆音检测方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本申请实施例所提供的一种音频爆音检测方法、装置和存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上该,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种音频爆音检测方法,其中,包括:
    获取待检测的音频信号,将所述音频信号划分为多个帧信号;
    计算相邻两个帧信号的短时能量差;
    根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号;
    计算所述突变音频信号的频谱平坦度,若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音。
  2. 根据权利要求1所述音频爆音检测方法,其中,所述将所述音频信号划分为多个帧信号,包括:
    在时域内从首帧开始对所述音频信号选取预设时间段的信号,得到开头音频信号;
    将所述开头音频信号划分为多个帧信号。
  3. 根据权利要求1所述音频爆音检测方法,其中,所述计算相邻两个帧信号的短时能量差,包括:
    计算每个帧信号的短时能量;
    获取每个帧信号的时间;
    根据所述帧信号的时间顺序依次计算相邻两个帧信号的短时能量之间的差,得到相邻两个帧信号的短时能量差。
  4. 根据权利要求3所述音频爆音检测方法,其中,所述根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号,包括:
    获取所述短时能量差大于预设阈值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为开始帧信号;
    在所述开始帧信号后获取所述短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号;
    获取所述开始帧信号到所述结束帧信号之间的信号,得到突变音频信号。
  5. 根据权利要求4所述音频爆音检测方法,其中,所述在所述开始帧信号后获取所述短时能量差小于预设阈值负值的两个帧信号,根据时间顺序将两个帧信号中的后一个帧信号确定为结束帧信号,包括:
    在所述开始帧信号后按时间顺序依次判断所述短时能量差是否是小于预设阈值的负 值;
    当第一次检测到所述短时能量差小于预设阈值负值时,根据时间顺序将小于预设阈值负值的两个帧信号中的后一个帧信号确定为结束帧信号。
  6. 根据权利要求1所述音频爆音检测方法,其中,所述计算所述突变音频信号的频谱平坦度,包括:
    检测所述突变音频信号的峰值位置;
    在所述峰值位置的前后各取多个固定采样点组成爆音音频帧;
    计算所述爆音音频帧的频谱平坦度。
  7. 根据权利要求1所述音频爆音检测方法,其中,所述若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音,包括:
    判断所述频谱平坦度是否大于预设平坦值;
    若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音;
    若所述频谱平坦度小于预设平坦值,则确定所述音频信号不存在爆音。
  8. 根据权利要求1所述音频爆音检测方法,其中,所述若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音之后,还包括:
    返回执行根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号的步骤,直到待检测的音频信号检测完毕。
  9. 一种音频爆音检测装置,其中,包括:
    分帧模块,用于获取待检测的音频信号,将所述音频信号划分为多个帧信号;
    计算模块,用于计算相邻两个帧信号的短时能量差;
    获取模块,用于根据所述短时能量差获取满足预设条件区间的帧信号,得到突变音频信号;
    判断模块,用于计算所述突变音频信号的频谱平坦度,若所述频谱平坦度大于预设平坦值,则确定所述音频信号存在爆音。
  10. 一种存储介质,其中,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至8任一项所述的音频爆音检测方法中的步骤。
PCT/CN2019/093409 2019-06-12 2019-06-27 音频爆音检测方法、装置和存储介质 WO2020248308A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910506938.3A CN110265064B (zh) 2019-06-12 2019-06-12 音频爆音检测方法、装置和存储介质
CN201910506938.3 2019-06-12

Publications (1)

Publication Number Publication Date
WO2020248308A1 true WO2020248308A1 (zh) 2020-12-17

Family

ID=67917850

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093409 WO2020248308A1 (zh) 2019-06-12 2019-06-27 音频爆音检测方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN110265064B (zh)
WO (1) WO2020248308A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113613159A (zh) * 2021-08-20 2021-11-05 北京房江湖科技有限公司 麦克风吹气信号检测方法、装置和系统
CN113611330A (zh) * 2021-07-29 2021-11-05 杭州网易云音乐科技有限公司 一种音频检测方法、装置、电子设备及存储介质
CN113744756A (zh) * 2021-08-11 2021-12-03 浙江讯飞智能科技有限公司 设备质检及音频数据扩充方法和相关装置、设备、介质
CN113761589A (zh) * 2021-04-21 2021-12-07 腾讯科技(北京)有限公司 视频检测的方法、装置及电子设备
CN115243183A (zh) * 2022-06-29 2022-10-25 上海勤宽科技有限公司 一种音频检测方法、设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312285B (zh) * 2020-01-14 2023-02-14 腾讯音乐娱乐科技(深圳)有限公司 一种开头爆音检测方法及装置
CN113542863B (zh) * 2020-04-14 2023-05-23 深圳Tcl数字技术有限公司 一种声音处理方法、存储介质以及智能电视
CN112151055B (zh) * 2020-09-25 2024-04-30 北京猿力未来科技有限公司 音频处理方法及装置
CN112735481B (zh) * 2020-12-18 2022-08-05 Oppo(重庆)智能科技有限公司 Pop音检测方法、装置、终端设备及存储介质
CN113035223B (zh) * 2021-03-12 2023-11-14 北京字节跳动网络技术有限公司 音频处理方法、装置、设备及存储介质
CN114299994B (zh) * 2022-01-04 2024-06-18 中南大学 激光多普勒远距离侦听语音的爆音检测方法、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050120870A1 (en) * 1998-05-15 2005-06-09 Ludwig Lester F. Envelope-controlled dynamic layering of audio signal processing and synthesis for music applications
US20110064233A1 (en) * 2003-10-09 2011-03-17 James Edwin Van Buskirk Method, apparatus and system for synthesizing an audio performance using Convolution at Multiple Sample Rates
CN105118520A (zh) * 2015-07-13 2015-12-02 腾讯科技(深圳)有限公司 一种音频开头爆音的消除方法及装置
CN107346665A (zh) * 2017-06-29 2017-11-14 广州视源电子科技股份有限公司 音频检测的方法、装置、设备以及存储介质
CN109616135A (zh) * 2018-11-14 2019-04-12 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8433582B2 (en) * 2008-02-01 2013-04-30 Motorola Mobility Llc Method and apparatus for estimating high-band energy in a bandwidth extension system
CN103650040B (zh) * 2011-05-16 2017-08-25 谷歌公司 使用多特征建模分析语音/噪声可能性的噪声抑制方法和装置
CN103918030B (zh) * 2011-09-29 2016-08-17 杜比国际公司 Fm立体声无线电信号中的高质量检测
CN105989853B (zh) * 2015-02-28 2020-08-18 科大讯飞股份有限公司 一种音频质量评测方法及系统
CN108198572A (zh) * 2017-12-29 2018-06-22 珠海市君天电子科技有限公司 一种音频处理方法及装置
CN108492837B (zh) * 2018-03-23 2020-10-13 腾讯音乐娱乐科技(深圳)有限公司 音频突发白噪声的检测方法、装置及存储介质
CN109658955B (zh) * 2019-01-07 2021-03-09 环鸿电子(昆山)有限公司 爆音检测方法及装置
CN109801646B (zh) * 2019-01-31 2021-11-16 嘉楠明芯(北京)科技有限公司 一种基于融合特征的语音端点检测方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050120870A1 (en) * 1998-05-15 2005-06-09 Ludwig Lester F. Envelope-controlled dynamic layering of audio signal processing and synthesis for music applications
US20110064233A1 (en) * 2003-10-09 2011-03-17 James Edwin Van Buskirk Method, apparatus and system for synthesizing an audio performance using Convolution at Multiple Sample Rates
CN105118520A (zh) * 2015-07-13 2015-12-02 腾讯科技(深圳)有限公司 一种音频开头爆音的消除方法及装置
CN107346665A (zh) * 2017-06-29 2017-11-14 广州视源电子科技股份有限公司 音频检测的方法、装置、设备以及存储介质
CN109616135A (zh) * 2018-11-14 2019-04-12 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761589A (zh) * 2021-04-21 2021-12-07 腾讯科技(北京)有限公司 视频检测的方法、装置及电子设备
CN113611330A (zh) * 2021-07-29 2021-11-05 杭州网易云音乐科技有限公司 一种音频检测方法、装置、电子设备及存储介质
CN113611330B (zh) * 2021-07-29 2024-05-03 杭州网易云音乐科技有限公司 一种音频检测方法、装置、电子设备及存储介质
CN113744756A (zh) * 2021-08-11 2021-12-03 浙江讯飞智能科技有限公司 设备质检及音频数据扩充方法和相关装置、设备、介质
CN113613159A (zh) * 2021-08-20 2021-11-05 北京房江湖科技有限公司 麦克风吹气信号检测方法、装置和系统
CN113613159B (zh) * 2021-08-20 2023-07-21 贝壳找房(北京)科技有限公司 麦克风吹气信号检测方法、装置和系统
CN115243183A (zh) * 2022-06-29 2022-10-25 上海勤宽科技有限公司 一种音频检测方法、设备及存储介质

Also Published As

Publication number Publication date
CN110265064B (zh) 2021-10-08
CN110265064A (zh) 2019-09-20

Similar Documents

Publication Publication Date Title
WO2020248308A1 (zh) 音频爆音检测方法、装置和存储介质
JP5053285B2 (ja) オーディオ装置品質の決定
US20200151212A1 (en) Music recommending method, device, terminal, and storage medium
WO2016180100A1 (zh) 一种音频处理的性能提升方法及装置
US20170060520A1 (en) Systems and methods for dynamically editable social media
CN110111811B (zh) 音频信号检测方法、装置和存储介质
WO2011035626A1 (zh) 音频播放方法及音频播放装置
US11990150B2 (en) Method and device for audio repair and readable storage medium
WO2023103253A1 (zh) 一种音频检测方法、装置及终端设备
CN107682802B (zh) 音频设备音效的调试方法及装置
CN101714861A (zh) 谐波产生装置及其产生方法
WO2020097824A1 (zh) 音频处理方法、装置、存储介质及电子设备
CN113259832A (zh) 麦克风阵列的检测方法、装置、电子设备及存储介质
CN111312287B (zh) 一种音频信息的检测方法、装置及存储介质
CN108829370B (zh) 有声资源播放方法、装置、计算机设备及存储介质
WO2024099348A1 (zh) 音频特效的编辑方法、装置、设备及存储介质
CN112423019B (zh) 调整音频播放速度的方法、装置、电子设备及存储介质
US8571235B2 (en) Method and device for providing a plurality of audio files with consistent loudness levels but different audio characteristics
JP5815435B2 (ja) 音源位置判定装置、音源位置判定方法、プログラム
CN112151055A (zh) 音频处理方法及装置
CN111782859A (zh) 一种音频可视化方法、装置和存储介质
CN110995914A (zh) 一种双麦克测试方法及装置
CN114678038A (zh) 音频噪声检测方法、计算机设备和计算机程序产品
CN112735481B (zh) Pop音检测方法、装置、终端设备及存储介质
CN106170113B (zh) 一种消除噪声的方法和装置以及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932939

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.04.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19932939

Country of ref document: EP

Kind code of ref document: A1