WO2022111177A1 - 一种音频检测方法、装置、计算机设备和可读存储介质 - Google Patents

一种音频检测方法、装置、计算机设备和可读存储介质 Download PDF

Info

Publication number
WO2022111177A1
WO2022111177A1 PCT/CN2021/126022 CN2021126022W WO2022111177A1 WO 2022111177 A1 WO2022111177 A1 WO 2022111177A1 CN 2021126022 W CN2021126022 W CN 2021126022W WO 2022111177 A1 WO2022111177 A1 WO 2022111177A1
Authority
WO
WIPO (PCT)
Prior art keywords
time point
point
target
value
energy
Prior art date
Application number
PCT/CN2021/126022
Other languages
English (en)
French (fr)
Inventor
黄正跃
史欣田
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP21896679.4A priority Critical patent/EP4250291A4/en
Publication of WO2022111177A1 publication Critical patent/WO2022111177A1/zh
Priority to US17/974,452 priority patent/US20230050565A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/08Instruments in which the tones are synthesised from a data store, e.g. computer organs by calculating functions or polynomial approximations to evaluate amplitudes at successive sample points of a tone waveform
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/368Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems displaying animated or moving pictures synchronized with the music or audio part
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • G10L19/265Pre-filtering, e.g. high frequency emphasis prior to encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/40Rhythm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/051Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or detection of onsets of musical sounds or notes, i.e. note attack timings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/076Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction of timing, tempo; Beat detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2220/00Input/output interfacing specifically adapted for electrophonic musical tools or instruments
    • G10H2220/155User input interfaces for electrophonic musical instruments
    • G10H2220/441Image sensing, i.e. capturing images or optical patterns for musical purposes or musical control purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/325Synchronizing two or more audio tracks or files according to musical features or musical timings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/131Mathematical functions for musical analysis, processing, synthesis or composition
    • G10H2250/215Transforms, i.e. mathematical transforms into domains appropriate for musical signal processing, coding or compression
    • G10H2250/235Fourier transform; Discrete Fourier Transform [DFT]; Fast Fourier Transform [FFT]

Definitions

  • the present application relates to the field of the Internet, in particular to the field of multimedia technologies, and in particular to an audio detection method, apparatus, computer equipment and readable storage medium.
  • Stepping video mainly fills the picture by jamming the accented rhythm points of the music, synchronizing the sound and picture of the video, so that the audience can feel a consistent sense of rhythm visually and auditory, thus bringing a more comfortable sensory experience.
  • the accent point is a key factor in video production.
  • some important accent points need to be determined from the audio. Therefore, how to obtain accent points from audio data has become a research hotspot.
  • Embodiments of the present application provide an audio detection method, apparatus, computer device, and readable storage medium, which can relatively accurately determine accent points in target audio data.
  • the embodiment of the present application provides an audio detection method, the method includes:
  • the target audio data includes a plurality of time points and the audio amplitude value of each time point;
  • the reference point is Refers to the time point at which the time difference from the target time point is less than the first difference threshold;
  • the accuracy of the target time point is checked
  • the target time point passes the accuracy check, the target time point is added to the set of target accent points as a target accent point.
  • an audio detection device and the device includes:
  • an acquisition unit for acquiring a target time point and a reference point of the target time point from the target audio data;
  • the target audio data includes a plurality of time points and the audio amplitude value of each time point;
  • the reference point refers to a time point whose time difference with the target time point is less than the first difference threshold;
  • a processing unit configured to perform energy evaluation processing on the target time point according to the audio amplitude value of the target time point to obtain the energy evaluation value of the target time point; and according to the audio amplitude of the reference point performing energy evaluation processing on the reference point, obtaining the energy evaluation value of the reference point;
  • the processing unit is further configured to perform an accuracy check on the target time point according to the energy evaluation value of the target time point and the energy evaluation value of the reference point;
  • the processing unit is further configured to add the target time point as a target accent point to the set of target accent points if the target time point passes the accuracy check.
  • an embodiment of the present application provides a computer device, the computer device includes an input device and an output device, and the computer device further includes:
  • a processor adapted to implement one or more instructions
  • a computer storage medium having one or more instructions stored thereon, the one or more instructions adapted to be loaded by the processor and to perform the following steps:
  • the target audio data includes a plurality of time points and the audio amplitude value of each time point;
  • the reference point is Refers to the time point at which the time difference from the target time point is less than the first difference threshold;
  • the accuracy of the target time point is checked
  • the target time point passes the accuracy check, the target time point is added to the set of target accent points as a target accent point.
  • an embodiment of the present application provides a computer storage medium, where the computer storage medium stores one or more instructions, and the one or more instructions are suitable for being loaded by the processor and performing the following steps:
  • the target audio data includes a plurality of time points and the audio amplitude value of each time point;
  • the reference point is Refers to the time point at which the time difference from the target time point is less than the first difference threshold;
  • the accuracy of the target time point is checked
  • the target time point passes the accuracy check, the target time point is added to the set of target accent points as a target accent point.
  • 1a is a schematic diagram of an audio waveform diagram provided by an embodiment of the present application.
  • FIG. 1b is a schematic diagram of a frequency spectrum provided by an embodiment of the present application.
  • 1c is a schematic structural diagram of an audio detection system provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an audio detection method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a reference point for determining a target time point provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of another audio detection method provided by an embodiment of the present application.
  • 5a is a schematic diagram of the generation of an initial accent point set and a supplementary time point set provided by an embodiment of the present application;
  • 5b is a schematic diagram of acquiring multiple peaks from each time point provided by an embodiment of the present application.
  • 5c is a schematic diagram of determining the starting point of a note according to a target time point provided by an embodiment of the present application
  • 5d is a schematic diagram of determining the starting point of a note according to a target time point provided by an embodiment of the present application
  • FIG. 6 is a schematic flowchart of an audio detection solution provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an audio detection apparatus provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Audio data is a type of digitized sound data, which can come from audio data in video files or audio data in pure audio files.
  • the process of digitizing the sound is actually the process of converting the continuous analog audio signal from the terminal device to the audio data at a certain frequency.
  • the audio data may include multiple time points (or called music points) and the audio amplitude value of each time point; to a certain extent, each time point and the corresponding audio amplitude value may be used to draw a Audio waveform graphs provide a visual representation of audio data.
  • the audio amplitude values at time points such as A, B, C, D, and E in the audio data can be intuitively seen through the audio waveform diagram.
  • each time point can also include sound attributes such as sound frequency, energy, volume, and timbre; among them, sound frequency refers to the number of times the object completes full vibration in a single time point, and each time point The sound frequency at a time point can form a spectrogram as shown in Figure 1b; volume can also be called sound intensity or loudness, which refers to the human ear’s subjective perception of the strength of the sound heard; timbre can also be called is the fret, which is used to reflect the characteristics of the sound produced based on the audio amplitude value of each time point.
  • sound attributes such as sound frequency, energy, volume, and timbre; among them, sound frequency refers to the number of times the object completes full vibration in a single time point, and each time point The sound frequency at a time point can form a spectrogram as shown in Figure 1b; volume can also be called sound intensity or loudness, which refers to the human ear’s subjective perception of the strength of the sound heard; timbre can also be called is the fret, which
  • an embodiment of the present application provides an audio detection scheme; the executive body of the audio detection scheme may be a computer device, and the computer device may be a terminal device (hereinafter referred to as a terminal) or server.
  • the embodiment of the present application also proposes an audio detection system shown in FIG. 1c; the audio detection system may include at least one terminal 101 and a server 102, that is, a computer device.
  • the terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
  • the terminals mentioned above can be smart phones, tablet computers, notebook computers, desktop computers, etc.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers , can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and big data and cloud servers for basic cloud computing services such as artificial intelligence platforms, etc.
  • CDN Content Delivery Network
  • the general principle of the above-mentioned audio detection scheme is as follows: when it is necessary to extract accent points for any type of audio data (such as lyrical type, rock type), the computer device can extract the accent points from the audio data. Extract multiple initial accent points from ; the multiple initial accent points here may include: time points where energy, volume, and timbre are locally maximized, and/or time points where energy, volume, and timbre change abruptly.
  • the audio amplitude value of the any initial accent point and the audio amplitude value of the time point adjacent to the any initial accent point in the audio data can be comprehensively analyzed, so that according to the The comprehensive analysis result further performs accuracy verification on any of the initial accent points; and after the verification is passed, the any initial accent points are used as the target accent points of the audio data.
  • the initial accent points extracted by the computer device may be insufficient due to various external factors, and other time points other than these initial accent points in the audio data, which may also be accent points, are omitted.
  • the computer device can supplement and extract some new supplementary points (ie, other time points except the initial accent point) from the audio data; and can use the comprehensive analysis method involved in any initial accent point
  • the supplementary point is comprehensively analyzed, and after the new supplementary point is determined to pass the accuracy check according to the comprehensive analysis result, the new initial accent point is used as the target accent point of the audio data.
  • initial stress points such as time points or sudden changes in local maximum energy, volume and timbre are identified from the audio data
  • the comprehensiveness of the set of target accent points can also be improved by supplementing the audio data and checking the accuracy of the new supplementary points.
  • an embodiment of the present application provides an audio detection method, and the audio detection method can be executed by the above-mentioned computer equipment.
  • the audio detection method may include the following steps S201-S204:
  • S201 Obtain a target time point and a reference point of the target time point from the target audio data.
  • the target audio data may be any type of audio data; for example, lyrical type audio data, rock type audio data, classical type music data, and so on.
  • the target audio data may include multiple time points and an audio amplitude value of each time point, and the target time point may be obtained by any of the following implementation methods:
  • the computer device may extract the initial accent point set from the target audio data according to a point extraction algorithm (eg, librosa.beat algorithm) in the open source tool libsora (an audio processing tool).
  • a point extraction algorithm eg, librosa.beat algorithm
  • the principle of the point extraction algorithm is: according to the main beat of the target audio data, extract the time points with locally larger energy, volume and timbre from the target audio data; and/or the time points where the energy, volume and timbre change suddenly as the initial accent point.
  • the main beat refers to the main beat of the audio data
  • the so-called beat is the basic unit of audio data in time, which refers to the combination rule of strong beats and weak beats; the beat can realize the same strong and weak audio data.
  • step S201 may include: arbitrarily selecting an initial accent point from the initial accent point set as the target time point; that is, the target time point under this embodiment is Any initial accent point in the initial accent point set.
  • the principle of the point extraction algorithm mentioned above is to extract accent points by considering the main beat, and there may be a small number of accent points that deviate from the main beat in the target audio data, these Accent points that deviate from the normal beat may be missed by the point extraction algorithm.
  • the beat involved in the start/end area of the target audio data may not conform to the main beat, then the accent point in the start/end area can be considered as the accent point deviating from the main beat, then the point extraction algorithm is used to carry out the accent.
  • accent points in the start/end area are usually not extracted.
  • the computer equipment can also extend the selected points outwards based on the initial accent point set in the target audio data to obtain the supplementary time point set, and adopt the methods proposed in the embodiments of the present application.
  • the audio detection method of the system performs the accuracy check on each supplementary point in the supplementary time point set in turn.
  • the specific implementation of step S201 may include: arbitrarily selecting a supplementary point from the set of supplementary time points as the target time point; that is, the target time point of this embodiment is the supplementary time Any supplementary point of the point set.
  • the computer device can also obtain a time point within a certain time range near the target time point as a reference point of the target time point, so that the target time point can be subsequently processed in combination with the audio amplitude value of the reference point. Accuracy check.
  • the upper limit of the certain time range may be equal to the value obtained by adding the first difference threshold on the basis of the target time point, and the lower limit may be equal to the value obtained by reducing the first difference threshold on the basis of the target time point ; That is, the reference point refers to a time point whose time difference from the target time point is less than the first difference threshold.
  • the first difference threshold may be set according to an empirical value or a business requirement.
  • the certain time range may be 10ms before and after the target time point.
  • the computer equipment can calculate other time points such as time point 1, time point 2, time point 3 and time point 4 in the target audio data The difference between the bit and the target time point bit. After calculation, it can be obtained that the time difference D1 between time point 1 and the target time point D is 20ms, the time difference D2 between time point 2 and the target time point D is 5ms, and the time point 3 and the target time point D2 are 5ms.
  • the time difference D3 between the time points D is 5ms, and the time difference D4 between the time point 4 and the target time point D is 20ms respectively. Then, it can be judged in turn whether D1, D2, D3, and D4 are less than 10ms; since only D2 and D3 are less than the first difference threshold, time point 2 and time point 3 are used as the reference point of the target time point.
  • time points of time point 1, time point 2, time point 3 and time point 4 are only illustratively used here for description; in the actual calculation process, the computer The device can calculate the difference between all other time points in the target audio data and the target time point, so that the time points whose difference is less than the first difference threshold are regarded as reference points, that is, the reference points include Time points within 10ms before and after the time point.
  • the computer device can obtain the audio energy function in the frequency domain to calculate the audio energy value of the target time point and the reference point respectively.
  • the computer device may use the audio energy function in the time domain to calculate the audio energy value for the target time point and the reference point respectively.
  • the audio energy function in this time domain is faster to calculate and has a higher temporal resolution.
  • the audio energy function in the time domain after testing the audio energy function in the time domain and the audio energy function in the frequency domain, it is found that the audio energy function in the time domain has a better effect on the target time point during the test. Check the effect.
  • the time domain refers to analyzing the part related to time when analyzing a function or signal
  • the frequency domain refers to analyzing the part related to the frequency domain when analyzing a function or signal.
  • the computer equipment can first determine the audio energy value of the target time point according to the audio amplitude value and the audio energy function of the target time point; and determine the audio energy value and the audio energy change function according to the target time point. The audio energy change value of the target time point; then the computer equipment performs a weighted summation on the audio energy value and the audio energy change value of the target time point to determine the energy evaluation value of the target time point, as shown in formula 1.1:
  • E represents the audio energy value of the target time point
  • represents the audio energy change value of the target time point
  • F represents the energy evaluation value of the target time point
  • c 0 and c 1 are two constants, which can be used to control the target The weight or proportion of the audio energy value and the audio energy change value at the time point; the values of c 0 and c 1 can be set according to experience, as long as the sum of c 0 and c 1 is 1.
  • c 0 may be 0.1
  • c 1 may be 0.9.
  • the calculation method of the energy evaluation value of the reference point may refer to the calculation method of the energy evaluation value of the target time point, which will not be repeated here.
  • S203 Perform an accuracy check on the target time point according to the energy evaluation value of the target time point and the energy evaluation value of the reference point.
  • the energy evaluation value may include two parts, which are the maximum energy evaluation value and the mean value.
  • the stress point is usually a time point with high energy or sudden change in energy, so it can be detected whether there is an energy change or sudden change in energy near the target time point according to the energy evaluation value of the target time point and the energy evaluation value of the reference point. If it exists, it can be considered that the target time point is a relatively accurate accent point, and at this time, the target time point can be added to the target accent point set as the target accent point through step S204.
  • the computer device can determine the maximum energy evaluation value from the energy evaluation value of the target time point and the energy evaluation value of the reference point, and determine whether the maximum energy evaluation value is greater than the evaluation energy threshold, if the maximum energy If the evaluation value is greater than the evaluation energy threshold, it indicates that there is a time point with large energy near the target time point, and it is determined that the target time point passes the accuracy check; if the maximum energy evaluation value is less than or equal to the evaluation energy threshold, it indicates that There is no time point with relatively large energy near the target time point, and it is determined that the target time point fails the accuracy check.
  • the evaluation energy threshold can be set according to experience.
  • the computer device can perform an average calculation on the energy evaluation value of the target time point and the energy evaluation value of the reference point, and determine whether the average value is greater than the average value evaluation threshold.
  • the energy of the time points near the time point is high, which can indicate that there is a time point with large energy, and determine that the target time point passes the accuracy check; if the mean value is less than or equal to the mean value evaluation threshold, it indicates that the target The energy of the time points near the time points are all low, so that it can be shown that there is no time point with high energy, and it can be determined that the target time point has not passed the accuracy check.
  • the mean evaluation threshold can be set according to experience.
  • a comprehensive evaluation can be performed in combination with the energy evaluation value and the mean value of the target time point.
  • the computer device can determine the maximum energy evaluation value and the average value according to the energy evaluation value of the target time point and the energy evaluation value of the reference point, and then perform the accuracy check on the target time point according to the maximum energy evaluation value and the average value.
  • the computer device can directly add the target time point as the target accent point to the set of target accent points; in one implementation, in order to To increase the accuracy of screening the target time point, the embodiment of the present application can also perform secondary screening on the target time point.
  • the computer equipment screens the target time point according to the local maximum amplitude value of the target time point. If the local maximum amplitude value is greater than the first amplitude threshold, the target time point can be added to the target accent point as the target accent point. in the collection.
  • the computer device can obtain the target time point and the reference point of the target time point from the target audio data, and then the computer device can perform energy evaluation on the target time point according to the audio amplitude value of the target time point. process to obtain the energy evaluation value of the target time point. Then, perform energy evaluation processing on the reference point according to the audio amplitude value of the reference point to obtain the energy evaluation value of the reference point. According to the energy evaluation value of the target time point and the energy evaluation value of the reference point, the accuracy of the target time point is checked; if the target time point passes the accuracy check, the target time point is used as the target accent point bits are added to the target accent bit set.
  • the accuracy of the target time point can be checked, which can effectively improve the accuracy of the accent point extraction, thereby giving A set of target accent points accurate to the frame level (that is, the time point level) is obtained.
  • FIG. 4 is a schematic flowchart of another audio detection method provided by an embodiment of the present application.
  • the audio detection method described in this embodiment can be executed by a computer device, and it can include the following steps S401-S406:
  • S401 Obtain a target time point and a reference point of the target time point from the target audio data.
  • the computer device can first obtain the target audio data; specifically, the computer device can obtain the original audio data from the video or other data sources, and each time point in the original audio data has a corresponding sound frequency, Other data sources can be data sources such as network, local space, etc. Then, the original audio data is preprocessed to obtain target audio data; wherein, the preprocessing may include at least one of the following (1)-(3):
  • the original audio data is filtered using the target frequency range.
  • the target frequency range can be set according to experience, for example, the target frequency range is set to 10HZ-5000HZ.
  • the computer equipment adopts the target frequency range, which can effectively filter the low-frequency audio and noise that cannot be heard by the human ear, and at the same time filter out the high-frequency components such as ventilation sound and friction sound in some audio data;
  • the time points within the target frequency range can avoid noise interference and obtain relatively clean target audio data, thereby reducing the difficulty of subsequent identification of accent points in the target audio data.
  • the computer device can perform a unified processing according to the maximum and minimum values of the sound waveform in the original audio data.
  • the unified processing refers to uniformly maintaining the volume in the audio data. between the maximum and minimum values.
  • the volume in the audio data is normalized to be between -1 and 1, so as to reduce the difficulty of screening accent points in the target audio data subsequently.
  • the original audio data is filtered by the target frequency range, and the volume of the filtered audio data is unified, so as to reduce the difficulty of identifying and screening the accent points in the target audio data.
  • the target time point and the reference point of the target time point can be acquired from the target audio data.
  • the target time point may be any initial accent point in the initial accent point set, or the target time point may be any supplementary point in the supplementary time point set.
  • the multiple initial accent points in the initial accent point set are obtained by performing point extraction on the target audio data by using a point extraction algorithm.
  • the supplementary time point set is obtained in the target audio data based on the outward extension of the initial accent point set; Specifically, multiple time points in the target audio data are obtained. Arranged in chronological order, the way to obtain the set of supplementary time points is as follows:
  • the computer device determines the starting accent point and the ending accent point from the initial accent point set, where the starting accent point refers to the earliest accent point in the initial accent point set, and the ending accent point refers to the initial accent point.
  • the accent point with the latest time in the accent point set; the computer device determines the start arrangement position of the start accent point in the target audio data, and the end arrangement position of the end accent point in the target audio data, the start The starting arrangement position of the accent point and the ending arrangement position of the ending accent point are shown in Figure 5a.
  • the computer equipment carries out extension to the time point position before the initial arrangement position in the target audio data according to the sampling frequency, and according to the sampling frequency, the time point position after the end arrangement position in the target audio data is extended.
  • Picking points the direction of the continuation picking points can be seen in Figure 5a; the time points obtained by extending the picking points are added to the set of supplementary time points as supplementary points.
  • sampling is performed at a sampling frequency of 10ms, and the 4 sampling points shown in Figure 5a can be obtained, and the time points corresponding to the 4 sampling points are added to the set of supplementary time points as supplementary points. middle.
  • the calculation method of the energy evaluation value of the target time point is similar to the calculation method of the energy evaluation value of the reference point; for the convenience of explanation, the target time point is used as an example for description in the following. Specifically, performing energy evaluation processing on the target time point according to the audio amplitude value of the target time point, and obtaining the energy evaluation value of the target time point.
  • the specific implementation may include the following steps s11-s15:
  • s11 Obtain multiple associated points of the target time point from the multiple time points.
  • the associated point refers to a time point whose time difference from the target time point is less than a second difference threshold
  • the second difference threshold may be set according to experience.
  • the second difference threshold can be set as express right Round down; among them, k can be set according to the empirical value. For example, if k is equal to 2000ms, then (i.e. 1000ms) for rounding down, we can get is 1000ms; if k is equal to 2001ms, then (i.e. 1000.5ms) for rounding down, we can get is 1000ms.
  • the associated points include time points within 1000ms before and after the target time point.
  • the audio energy function can be expressed as Equation 1.2:
  • k' represents the number of associated points at the target time point, and the k' can be determined according to the value of k.
  • k represents the number of associated points at the target time point
  • the k' can be determined according to the value of k.
  • the embodiment of the present application uses the target time point as an example, and the calculation of the audio energy value of other time points (including the above reference point) may refer to the calculation method of the target time point.
  • step s12 may be: performing square operation on the audio amplitude value of the target time point to obtain the initial energy value of the target time point; and performing square operation on the audio amplitude value of each associated point to obtain The initial energy value of each associated point. Then, an average value operation is performed on the initial energy value of the target time point and the initial energy value of each associated point to obtain the audio energy value of the target time point. Specifically, the computer device performs mean operation on the initial energy value of the target time point and the initial energy value of each associated point to obtain the intermediate energy value. Then, the intermediate energy value is directly used as the audio energy value at the target time point; or, the intermediate energy value is denoised to obtain the audio energy value at the target time point.
  • the specific implementation of denoising the intermediate energy value to obtain the audio energy value of the target time point may be: the computer equipment may use the intermediate energy values of all time points to form a curve of the change of the intermediate energy value with the time point. , and then use Gaussian filtering or box filtering to perform a curve smoothing operation to adjust the intermediate energy value of the target time point to obtain the audio energy value of the target time point.
  • Gaussian filtering or box filtering to perform a curve smoothing operation to adjust the intermediate energy value of the target time point to obtain the audio energy value of the target time point.
  • s13 Obtain the precursor point of the target time point from the multiple time points.
  • the precursor points include: based on the arrangement positions of the target time points among the plurality of time points, c time points are selected in sequence forward, and c is a positive integer.
  • c is an adjustable parameter.
  • the audio energy change function can be shown in Equation 1.3:
  • ⁇ 'i represents the initial energy change value
  • Ei represents the audio energy value
  • j represents the index in the summation symbol
  • c is an adjustable parameter, which can be used to control the number of forward difference summations and the number of precursor points.
  • the function calculates the first-order mean difference of the energy function.
  • the target time point is the ith point
  • the precursor points of the target time point may include the i-1th point, the i-2th point, ... the icth point.
  • E ij represents the audio energy value at the ij th time point.
  • step s14 the specific implementation of step s14 is as follows: the computer device obtains the sum of the audio energy values between the audio energy values of each time point in the precursor point, and obtains a reference value (for example, the reference value may be 0). Then calculate the difference between the sum of the audio energy values and the audio energy value at c times the target time point, and use the maximum value of the reference value and the calculated difference as the initial energy change value of the target time point; Finally, according to the initial energy change value of the target time point, the audio energy change value of the target time point is determined.
  • a reference value for example, the reference value may be 0
  • the computer device may directly use the initial energy change value of the target time point as the audio energy change value of the target time point; in another implementation manner, due to the range of the initial energy value of the target time point is very large, so the initial energy value of the target time point needs to be normalized.
  • a normalization method pk_normalize is defined in the embodiment of the present application. The normalization method refers to normalizing the target time point by using the average value of the largest n peaks in the initial energy values of each time point in the target audio data. Compared with the simple 0-1 normalization, this normalization can avoid the influence of some abnormally large changes in audio energy. At the same time, the strategy of only selecting the largest n peaks can also avoid a lot of audio energy.
  • the computer device may acquire the initial energy change values at each time point in the target audio data, and determine a plurality of peaks from the initial energy change values at each time point.
  • the peak value refers to the initial energy change value of the peak time point in the target audio data.
  • the peak time point satisfies the following conditions: the initial energy change value of the peak time point is greater than the initial energy change value of the two time points that are located on the left and right sides of the peak time point and adjacent to the peak time point.
  • four peaks may be determined from the initial energy change values at various time points, namely, peak 1, peak 2, peak 3, and peak 4.
  • the computer equipment normalizes the initial energy change value at the target time point by using the mean value of the multiple peaks to obtain the audio energy change value at the target time point.
  • the computer equipment uses the mean value of multiple peaks to normalize the initial energy change value of the target time point, and the obtained audio energy change value of the target time point includes the following two situations: (1) The computer equipment directly The average value of each peak value is calculated, and then the obtained average value is normalized to the initial energy change value at the target time point. (2) The computer equipment can sort multiple peaks, and then obtain n peaks from the largest to the smallest from the sorted multiple peaks, and calculate the average value of the n peaks; the computer equipment adjusts the target time according to the calculated average value The initial energy change value of the point is normalized.
  • the value of n can be set according to experience, for example, the value of n can be set to 1/3 of the number of peaks.
  • n 3
  • the computer will sort the acquired 4 peaks from large to small, that is, the order of these 4 peaks is peak 1, peak 3, peak 2, and peak 4.
  • Computer equipment can obtain 3 peaks from large to small, namely peak 1, peak 2, and peak 3.
  • the average value of multiple peaks is used to normalize the initial energy change value of the target time point
  • the specific implementation method of obtaining the audio energy change value of the target time point is as follows: the computer device obtains each The audio energy value of the time point, and the minimum audio energy value is determined from the audio energy value of each time point, and the average value and the minimum audio energy value of multiple peaks are used to shrink the initial energy change value of the target time point. Process to obtain the audio energy change value at the target time point.
  • the minimum audio energy value can be represented by min(E)
  • the mean value of the multiple peaks can be represented by mean(topn(peak( ⁇ ')))
  • peak( ⁇ ') represents the determination of all initial energy change values in the target audio data.
  • Peak (corresponding to the above multiple peaks), topk(peak( ⁇ ')) means to select n peaks from large to small from all peaks. Among them, the mean value mean(topn(peak( ⁇ '))) of multiple peaks and the minimum audio energy value min(E) are used to shrink the initial energy change value of the target time point to obtain the energy change of the target time point.
  • the specific calculation process of the value ⁇ can be found in Equation 1.4:
  • a is an adjustable parameter, which can fine-tune and control the audio energy change value at the final target time point.
  • the value of a can be set according to experience, for example, a can be set to 1.5.
  • weighted summation is performed on the audio energy value and the audio energy change value to obtain an energy evaluation value at the target time point.
  • S403 calculate the energy mean value of the energy evaluation value of the reference point and the energy evaluation value of the target time point.
  • S404 Determine the maximum energy evaluation value from the energy evaluation value of the target time point and the energy evaluation value of the reference point.
  • a threshold can be set as a condition for whether the target time point passes the accuracy check.
  • the threshold can also be understood as a condition for screening target time points.
  • the computer device can first calculate the difference between the maximum energy evaluation value and the energy mean value, and determine whether the difference between the maximum energy evaluation value and the energy mean value is greater than a threshold, if the difference between the maximum energy evaluation value and the energy mean value If the difference between the two is greater than the threshold value, and the target time point is determined to pass the accuracy check, it can be understood that the target time point is a time point with a large energy change; if the difference between the maximum energy evaluation value and the energy mean value If it is less than or equal to the threshold, it is determined that the target time point has not passed the accuracy check, which means that the target time point is a time point with a small energy change.
  • the computer device may add the target time point that has passed the verification to the target accent point set as a target accent point.
  • the target accent point set can be represented by R 0 . All accent point sets in the target accent point set satisfy Equation 1.5:
  • the maximum energy evaluation value is Fmax[i]
  • the mean value is Fmean[i]
  • i ⁇ beat ⁇ represents the target time point.
  • the filtering threshold is s 0 , which can be set empirically.
  • the screening threshold may be set to a smaller value.
  • the screening threshold can be set to 0.1.
  • the screening threshold can be appropriately increased, for example, the screening threshold Can be set to 0.3.
  • the computer device can also determine whether the target time point is an accent point according to the local maximum amplitude value in the target audio data. That is, the computer equipment can further screen the target time point according to the local maximum amplitude value of the target time point, thereby increasing the accuracy of the selection of accent points.
  • the computer selects the maximum absolute value as the local maximum amplitude value of the target time point from the absolute value of the audio amplitude value at each associated point and the absolute value of the audio amplitude value at the target time point.
  • the local maximum amplitude value of the target time point can be calculated by using the waveform local maximum amplitude function, and the calculation formula can be found in Equation 1.6:
  • the associated point refers to a time point whose time difference from the target time point is less than the second difference threshold.
  • the second difference threshold can be set according to experience.
  • the computer device may determine whether the local maximum amplitude value at the target time point is greater than the first amplitude threshold, and if the local maximum amplitude value at the target time point is greater than the first amplitude threshold, Then, the target time point is added to the set of target accent points as the target accent point.
  • the first amplitude threshold can be set according to experience, and can be represented by S 1 . In an implementation manner, if the target time point is any initial accent point in the initial accent point set, the first amplitude threshold may be set to a smaller value. For example, the first amplitude threshold may be set to 0.1.
  • the first amplitude threshold may be appropriately increased.
  • secondary screening of the accent points in the set R 0 may be performed according to the local maximum amplitude value of the accent points in the set R 0 to obtain the latest target accent set R 1 , all accent point sets in the latest target accent point set satisfy Equation 1.7:
  • A[i] represents the i-th time point in R 0
  • S 1 is the first amplitude threshold
  • note onsets may be filtered to complement accent locations in the target accent location set.
  • the computer device can extract the note onset of at least one note from the target audio data according to a note onset detection algorithm (such as the librosa.onset algorithm), wherein a note is corresponding to at least two time points and at least two time points.
  • a note start point refers to: the earliest time point among at least two time points corresponding to a note.
  • the computer device obtains the energy evaluation value of the note start point and the local maximum amplitude value of the note start point, and judges whether the energy evaluation value of the note start point and the local maximum amplitude value of the note start point satisfy the stress condition. If the energy evaluation value and the local maximum amplitude value of the onset of the note satisfy the stress condition, the onset of the note is added to the set of target accent points as the target accent point; the stress condition includes at least one of the following: The energy evaluation value is greater than the energy evaluation threshold, and the local maximum amplitude value of the note onset is greater than the second amplitude threshold.
  • the target accent point in the target accent point set may be at the peak of the energy change, when a person perceives the target accent point, the target accent point may be about to disappear.
  • the computer device may further optimize the target accent points in the target accent point set. For any target accent point in the target accent point set, the computer device obtains the note start point of the target note to which any target accent point belongs, and replaces the note start point of the target note in the target accent point set with the note start point of the target note. Any target accent point. It can be understood that the starting point of the note can also be regarded as an accent point.
  • the computer device acquires a note onset intensity evaluation curve of the target audio data, where the note onset intensity evaluation curve includes a plurality of time points arranged in chronological order and the note intensity value of each time point. Then map any target accent point to the note onset intensity evaluation curve to obtain the target position of any target accent point on the note onset intensity evaluation curve; on the note onset intensity evaluation curve, based on the target position and Traverse at least one note strength value in turn in the direction of decreasing time; if the current note strength value currently traversed satisfies the note strength condition, stop traversing, and use the current time point corresponding to the current note strength value as any target accent point
  • the computer device maps a certain target stress point to the note onset intensity evaluation curve, and obtains that the target stress point is on the note onset Target position A1 on the intensity evaluation curve of the starting point.
  • the computer device traverses at least one note intensity value in turn based on A1 and along the direction of decreasing time (the direction indicated by the arrow in Figure 5c), because when the note intensity value is 0 (the corresponding time point is A2), The note strength value is greater than the note strength value y2, then continue to traverse the next note strength value y2 (the corresponding time point is A3).
  • the note strength value y2 is less than the note strength value 0, and is also smaller than the note strength value y3 (The corresponding time point is A4). Then stop traversing, and use the time point A3 corresponding to the note intensity value y2 as the note start point of the target note to which the target accent point belongs.
  • the computer device maps the target stress point to the note onset intensity evaluation curve, and obtains the target stress point at the note onset intensity.
  • the computer device traverses at least one note intensity value in turn based on B1 and along the direction of decreasing time (the direction indicated by the arrow in Figure 5d), when the note intensity value is 0 (corresponding to time point B2), because the The note strength value is less than the note strength value corresponding to B1, and the note strength value at the time point before B2 and adjacent to B2 is equal to the current note strength value of 0, and the note strength value at the time point after B2 and adjacent to B2.
  • the note strength value is greater than the current note strength value of 0, so the traversal is stopped, and the time point B2 corresponding to the note strength value of 0 is used as the note start point of the target note to which the target accent point belongs.
  • the specific implementation method for the computer device to obtain the note onset intensity evaluation curve of the target audio data may be: the computer device may use short-time Fourier transform (stft) to convert the time domain into the frequency domain according to the target audio data, and finally generate a spectrum. Then, the spectrogram is used as the difference between the frames before and after, and the intensity evaluation curve of the note onset point is obtained by summing the difference between the frames according to time.
  • stft short-time Fourier transform
  • the target accent points in the target accent point set may be converted into a format required by the application for output.
  • the application may be a player dedicated to playing music, or video software, and so on.
  • the computer device can obtain the target time point and the reference point of the target time point from the target audio data, and then the computer device can perform energy evaluation on the target time point according to the audio amplitude value of the target time point. process to obtain the energy evaluation value of the target time point. Then, perform energy evaluation processing on the reference point according to the audio amplitude value of the reference point to obtain the energy evaluation value of the reference point. According to the energy evaluation value of the target time point and the energy evaluation value of the reference point, the accuracy of the target time point is checked; if the target time point passes the accuracy check, the target time point is used as the target accent point bits are added to the target accent bit set.
  • the accuracy of the target time point can be checked, which can effectively improve the accuracy of the accent point extraction, thereby giving A set of target accent points accurate to the frame level (that is, the time point level) is obtained.
  • the embodiment of the present application further provides a specific audio detection solution.
  • the specific process of the audio detection solution can be seen in FIG. 6.
  • the process of the audio detection solution is as follows: , you can unify the encoding format of different audio files first.
  • the computer device first sets the encoding format of the unified audio file.
  • the computer equipment processes the video according to the set encoding format, then extracts audio data from the processed video, and performs preprocessing on the audio data, which includes frequency range filtering on the audio data and overall audio data. Volume normalization.
  • the computer equipment After preprocessing the audio data, the computer equipment extracts point information from the preprocessed audio data, the point information extraction includes target time point extraction and note start point extraction, and according to the audio energy function, The audio energy change function and the waveform local maximum amplitude function evaluate the target time point, and filter the target time point according to the evaluation result to obtain the target accent point set. Further, after obtaining the target accent point set, the computer equipment can also supplement the accent point, and add the supplemented accent point as the target accent point set in the target accent point set, and then add the target accent point set.
  • the target accent points in the audio data are optimized to obtain the final target accent point set, and the target accent point set is output, so that the accent points in the target audio data can be accurately determined.
  • the accent point can be marked in the target audio data, and then according to the marked accent point, the editing tool or the content creator can be provided with the time point for screen switching, and the automatically generated or Assisting in the creation of step-by-step videos, that is, filling the picture at the stressed rhythm points of the stuck music, synchronizing the sound and picture of the video, so that the audience can feel a consistent sense of rhythm visually and auditory, bringing a more comfortable sensory experience.
  • the accent point of the mark can be used as the background music point in the secondary creation or editing of the video; or the accent point of the mark can also play the role of matching lighting or other special effects on the stage or scene, promoting the atmosphere, etc. .
  • the embodiments of the present application further disclose an audio detection apparatus.
  • the audio detection apparatus may be a hardware component provided in the above-mentioned computer equipment, or may be a hardware component running on the above-mentioned computer equipment.
  • a computer program (including program code) in the mentioned computer device.
  • the audio detection apparatus may perform the method shown in FIG. 2 or FIG. 4 . Referring to Fig. 7, the audio detection device can operate the following units:
  • Obtaining unit 701 is used to obtain a target time point and a reference point of the target time point from the target audio data;
  • the target audio data includes a plurality of time points and the audio amplitude value of each time point;
  • the reference point refers to a time point whose time difference from the target time point is less than the first difference threshold;
  • the processing unit 702 is configured to perform energy evaluation processing on the target time point according to the audio amplitude value of the target time point to obtain the energy evaluation value of the target time point; and according to the audio frequency of the reference point The amplitude value performs energy evaluation processing on the reference point to obtain the energy evaluation value of the reference point;
  • the processing unit 702 is further configured to perform an accuracy check on the target time point according to the energy evaluation value of the target time point and the energy evaluation value of the reference point;
  • the processing unit 702 is further configured to add the target time point as the target accent point to the target accent point set if the target time point passes the accuracy check.
  • processing unit 702 is specifically configured to:
  • the obtaining unit 701 is specifically configured to: obtain multiple associated points of the target time point from the multiple time points;
  • the processing unit 702 is specifically configured to: use an audio energy function to calculate the audio energy value of the target time point according to the audio amplitude value of each associated point and the audio amplitude value of the target time point; the association Point refers to a time point whose time difference from the target time point is less than the second difference threshold;
  • the obtaining unit 701 is specifically configured to: obtain a precursor point of the target time point from the plurality of time points, where the precursor point includes: based on the target time point The arrangement position in the time points, the c time points selected in sequence, c is a positive integer;
  • the processing unit 702 is specifically configured to: use an audio energy variation function to calculate the audio frequency of the target time point according to the audio energy value of the target time point and the audio energy value of each time point in the precursor point. Audio energy change value; weighted summation is performed on the audio energy value and the audio energy change value to obtain the energy evaluation value at the target time point.
  • processing unit 702 is specifically configured to:
  • processing unit 702 is specifically configured to:
  • the processing unit 702 is specifically configured to: obtain the sum of the audio energy values between the audio energy values of each time point in the precursor point;
  • the obtaining unit 701 is used to obtain the reference value
  • the processing unit 702 is specifically configured to: calculate the difference between the sum of the audio energy values and c times the audio energy value of the target time point; The maximum value is used as the initial energy change value of the target time point; the audio energy change value of the target time point is determined according to the initial energy change value of the target time point.
  • the obtaining unit 701 is configured to obtain the initial energy change value of each time point in the target audio data
  • the processing unit 702 is specifically configured to: determine a plurality of peaks from the initial energy change value of each time point, and the peak value refers to the initial energy change value of the peak time point in the target audio data , the peak time point satisfies the following conditions: the initial energy change value of the peak time point is greater than the two time points located on the left and right sides of the peak time point and adjacent to the peak time point The initial energy change value of the point; the average value of the multiple peaks is used to normalize the initial energy change value of the target time point to obtain the audio energy change value of the target time point.
  • the obtaining unit 701 is configured to obtain the audio energy value of each time point
  • the processing unit 702 is specifically configured to determine the minimum audio energy value from the audio energy values of the various time points; using the mean value of the multiple peaks and the minimum audio energy value, for the target time point The initial energy change value of the bit is subjected to shrinking processing to obtain the audio energy change value of the target time point.
  • the processing unit 702 before adding the target time point as the target accent point to the set of target accent points, the processing unit 702 is further configured to:
  • the step of adding the target time point as the target accent point to the set of target accent points is performed.
  • the target time point is any initial accent point in the initial accent point set, or any supplementary point in the supplementary time point set; wherein, the initial accent point
  • the multiple accent points in the set are obtained by using the point extraction algorithm to perform point extraction on the target audio data;
  • the multiple time points in the target audio data are arranged in chronological order, and the processing unit 702 is specifically used for:
  • a starting accent point and an ending accent point are determined from the initial accent point set, where the starting accent point refers to the earliest accent point in the initial accent point set, and the ending accent point is Point refers to the latest accent point in the initial accent point set;
  • the time points in the target audio data that are located before the starting arrangement position are extended according to the sampling frequency, and the time points in the target audio data that are located after the ending arrangement position are according to the sampling frequency.
  • the time points obtained by extending the selected points are used as supplementary points and added to the set of supplementary time points.
  • the processing unit 702 is further configured to: extract a note start point of at least one note from the target audio data, where a note is based on at least two time points and the at least two time points The audio amplitude value corresponding to the point is determined, and the note start point refers to: the earliest time point in the at least two time points corresponding to a note;
  • the obtaining unit 701 is also used to obtain the energy evaluation value of the note onset point and the local maximum amplitude value of the note onset point;
  • the processing unit 702 is further configured to: if the energy evaluation value and the local maximum amplitude value of the note onset point satisfy the stress condition, then add the note onset point as a target stress point set to the target stress point set
  • the stress condition includes at least one of the following: the energy evaluation value of the note onset is greater than an energy evaluation threshold, and the local maximum amplitude value of the note onset is greater than a second amplitude threshold.
  • the obtaining unit 701 is further configured to, for any target accent point in the target accent point set, obtain the note start point of the target note to which the any target accent point belongs ;
  • the processing unit 702 is further configured to, in the set of target accent points, use the note start point of the target note to replace any of the target accent points.
  • the obtaining unit 701 is specifically configured to obtain a note onset intensity evaluation curve of the target audio data, where the note onset intensity evaluation curve includes the plurality of Time points and note intensity values for each time point;
  • the processing unit 702 is specifically configured to: map the any target accent point to the note onset intensity evaluation curve, and obtain the any target accent point on the note onset intensity evaluation curve on the strength evaluation curve of the note starting point, traverse at least one note strength value in turn based on the target position and in the direction of decreasing time; if the current traversed current note strength value satisfies the note strength condition, stop Traverse, and use the current time point corresponding to the current note intensity value as the note starting point of the target note to which any target accent point belongs;
  • the note intensity condition includes: the note intensity value of the time point located before the current time point and adjacent to the current time point is greater than or equal to the current note intensity value, and is located in the current time point.
  • the note intensity value of the time point after the time point and adjacent to the current time point is greater than the current note intensity value.
  • the acquiring unit 701 before acquiring the target time point and the reference point of the target time point from the target audio data, the acquiring unit 701 is further configured to acquire original audio data, the original audio Each time point in the data has a corresponding sound frequency;
  • the processing unit 702 is further configured to preprocess the original audio data to obtain target audio data; the preprocessing includes at least one of the following: filtering the original audio data by using a target frequency range, Describe the original audio data or perform volume unification processing on the filtered audio data.
  • each step involved in the method shown in FIG. 2 or FIG. 4 may be performed by each unit in the audio detection apparatus shown in FIG. 7 .
  • step S201 shown in FIG. 2 is performed by the acquiring unit 701 shown in FIG. 7
  • steps S202 to S204 are all performed by the processing unit 702 shown in FIG. 7 .
  • step S401 shown in FIG. 4 is performed by the acquiring unit 701 shown in FIG. 7
  • steps S402 to S406 are performed by the processing unit 701 shown in FIG. 7 .
  • each unit in the audio detection apparatus shown in FIG. 7 may be separately or all merged into one or several other units to form, or some of the unit(s) may be disassembled again. It is divided into multiple units with smaller functions, which can realize the same operation without affecting the realization of the technical effects of the embodiments of the present application.
  • the above units are divided based on logical functions. In practical applications, the function of one unit may also be implemented by multiple units, or the functions of multiple units may be implemented by one unit. In other embodiments of the present application, the audio detection apparatus may also include other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of multiple units.
  • the audio detection method can be implemented by including processing elements and storage elements such as a central processing unit (CPU), a random access storage medium (RAM), and a read only storage medium (ROM). steps or functions of the audio detection device.
  • the audio detection device shown in FIG. 7 is constructed by running a computer program (including program code) capable of executing the steps involved in the corresponding method as shown in FIG. 2 or FIG. 4 on a general-purpose computing device of a computer, and to implement the audio detection method of the embodiment of the present application.
  • the computer program described above can be recorded on, for example, a computer-readable recording medium, loaded into the above-mentioned computer apparatus via the computer-readable recording medium, and executed therein.
  • the embodiment of the present application further discloses a computer device, please refer to FIG.
  • the processor 801 , the input device 802 , the output device 803 and the computer storage medium 804 in the computer device may be connected through a bus or other means.
  • the computer storage medium 804 is a memory device in a computer device for storing programs and data. It can be understood that the computer storage medium 804 here may include both the built-in storage medium of the computer device, and certainly also the extended storage medium supported by the computer device. Computer storage medium 804 provides storage space that stores the operating system of the computer device. In addition, one or more instructions suitable for being loaded and executed by the processor 801 are also stored in the storage space, and these instructions may be one or more computer programs (including program codes). It should be noted that the computer storage medium here may be a high-speed RAM memory; in an embodiment, it may also be at least one computer storage medium remote from the aforementioned processor, and the processor may be referred to as a central processing unit (Central Processing Unit). , CPU), which is the core and control center of computer equipment, and is suitable for implementing one or more instructions, specifically loading and executing one or more instructions to realize the corresponding method process or function.
  • CPU central processing unit
  • one or more first instructions stored in the computer storage medium can be loaded and executed by the processor 801 to implement the corresponding steps of the methods in the above-mentioned embodiments of the audio detection method;
  • One or more first instructions in the medium are loaded by the processor 801 and perform the following operations:
  • the target audio data includes a plurality of time points and the audio amplitude value of each time point;
  • the reference point is Refers to the time point at which the time difference from the target time point is less than the first difference threshold;
  • the accuracy of the target time point is checked
  • the target time point passes the accuracy check, the target time point is added to the set of target accent points as a target accent point.
  • the processor 801 is specifically configured to:
  • the multiple time points are arranged in chronological order; the processor 801 is specifically configured to:
  • the associated point refers to a time point whose time difference with the target time point is less than the second difference threshold;
  • the precursor point includes: based on the arrangement position of the target time point in the multiple time points, moving forward c time points selected in sequence, c is a positive integer;
  • Adopt the audio energy change function to calculate the audio energy change value of the target time point according to the audio energy value of the target time point and the audio energy value of each time point in the precursor point;
  • Weighted summation is performed on the audio energy value and the audio energy change value to obtain the energy evaluation value at the target time point.
  • the processor 801 is specifically configured to:
  • the processor 801 is specifically configured to:
  • the processor 801 is specifically configured to:
  • the audio energy change value of the target time point is determined.
  • the processor 801 is specifically configured to:
  • a plurality of peaks are determined from the initial energy change value of each time point, and the peak value refers to the initial energy change value of the peak time point in the target audio data, and the peak time point satisfies the following conditions : the initial energy change value of the peak time point is greater than the initial energy change value of the two time points that are located on the left and right sides of the peak time point and adjacent to the peak time point;
  • the initial energy change value of the target time point is normalized by using the mean value of the plurality of peaks to obtain the audio energy change value of the target time point.
  • the processor 801 is specifically configured to:
  • the initial energy change value of the target time point is subjected to shrinkage processing to obtain the audio energy change value of the target time point.
  • the processor 801 before adding the target time point as the target accent point set to the target accent point set, the processor 801 is further configured to:
  • the step of adding the target time point as the target accent point to the set of target accent points is performed.
  • the target time point is any initial accent point in the initial accent point set, or any supplementary point in the supplementary time point set; wherein, the initial accent point
  • the multiple accent points in the set are obtained by using the point extraction algorithm to perform point extraction on the target audio data;
  • the multiple time points in the target audio data are arranged in chronological order, and the processor 801 is specifically configured to: determine the starting accent point and the ending accent point from the initial accent point set , the initial accent position refers to the earliest accent position in the initial accent position set, and the end accent position refers to the latest accent position in the initial accent position set;
  • the time points in the target audio data that are located before the starting arrangement position are extended according to the sampling frequency, and the time points in the target audio data that are located after the ending arrangement position are according to the sampling frequency.
  • the time points obtained by extending the selected points are used as supplementary points and added to the set of supplementary time points.
  • the processor 801 is further configured to:
  • a note start point of at least one note is extracted from the target audio data, a note is determined according to at least two time points and the audio amplitude values corresponding to the at least two time points, and the note start point is Refers to: the earliest time point among at least two time points corresponding to a note;
  • the note onset point is added to the target stress point set as the target stress point position; the stress condition includes at least one of the following: Type: the energy evaluation value of the note onset is greater than an energy evaluation threshold, and the local maximum amplitude value of the note onset is greater than a second amplitude threshold.
  • the processor 801 is further configured to:
  • the note start point of the target note is used to replace any of the target accent points.
  • the processor 801 is specifically configured to:
  • the note onset intensity evaluation curve comprising the multiple time points and the note intensity values of each time point arranged in chronological order;
  • Mapping the arbitrary target accent point on the note onset intensity evaluation curve obtains the target position of the arbitrary target accent point on the note onset intensity evaluation curve
  • the note intensity condition includes: the note intensity value of the time point located before the current time point and adjacent to the current time point is greater than or equal to the current note intensity value, and is located in the current time point.
  • the note intensity value of the time point after the time point and adjacent to the current time point is greater than the current note intensity value.
  • the processor 801 before the acquisition of the target time point and the reference point of the target time point from the target audio data, the processor 801 is further configured to:
  • each time point in the original audio data has a corresponding sound frequency
  • the original audio data is preprocessed to obtain target audio data; the preprocessing includes at least one of the following: filtering the original audio data by using a target frequency range, and filtering the original audio data or the filtered audio data.
  • the audio data is subjected to volume unification processing.
  • the embodiments of the present application also provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps performed in FIG. 2 or FIG. 4 in the above-mentioned audio detection method embodiment.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Algebra (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

本申请实施例提供了一种音频检测方法、装置、计算机设备和可读存储介质,其中方法包括:从目标音频数据中获取目标时间点位以及目标时间点位的参考点位;参考点位是指与目标时间点位之间的时间差小于第一差值阈值的时间点位;根据目标时间点位的音频振幅值对目标时间点位进行能量评估处理,得到目标时间点位的能量评估值;并根据参考点位的音频振幅值对参考点位进行能量评估处理,得到参考点位的能量评估值;根据目标时间点位的能量评估值和参考点位的能量评估值,对目标时间点位进行准确性校验;若目标时间点位通过准确性校验,则将目标时间点位作为目标重音点位添加到目标重音点位集合中,可以较为准确地确定出目标音频数据中的重音点位。

Description

一种音频检测方法、装置、计算机设备和可读存储介质
本申请要求于2020年11月25日提交中国专利局、申请号为202011336979.1、名称为“一种音频检测方法、装置、计算机设备和可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网领域,具体涉及多媒体技术领域,尤其涉及一种音频检测方法、装置、计算机设备和可读存储介质。
背景技术
目前,随着视频逐渐成为内容的重要传播方式,踩点视频也逐渐成为在视频创作者中非常受欢迎的视频创作类型。踩点视频主要通过卡住音乐的重音节奏点去填补画面,使视频声画同步,使观众在视觉与听觉上感受到一致的节奏感,从而带来更为舒适的感官体验。其中,重音点位是视频生产的关键因素。为了使踩点效果更具冲击性且适合短视频内容的展示,需要从音频中确定出一些比较重要的重音点位。因此,如何从音频数据中获取重音点位成为研究热点。
发明内容
本申请实施例提供了一种音频检测方法、装置、计算机设备和可读存储介质,可以较为准确地确定出目标音频数据中的重音点位。
一方面,本申请实施例提供了一种音频检测方法,所述方法包括:
从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
另一方面,本申请实施例提供了一种音频检测装置,所述装置包括:
获取单元,用于从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
处理单元,用于根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述 参考点位进行能量评估处理,得到所述参考点位的能量评估值;
所述处理单元,还用于根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
所述处理单元,还用于若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
再一方面,本申请实施例提供了一种计算机设备,所述计算机设备包括输入设备、输出设备,所述计算机设备还包括:
处理器,适于实现一条或多条指令;以及,
计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令适于由所述处理器加载并执行如下步骤:
从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
再一方面,本申请实施例提供了一种计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令适于由所述处理器加载并执行如下步骤:
从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例提供的一种音频波形图的示意图;
图1b是本申请实施例提供的频谱的示意图;
图1c是本申请实施例提供的音频检测系统的结构示意图;
图2是本申请实施例提供的一种音频检测方法的流程示意图;
图3是本申请实施例提供的一种确定目标时间点位的参考点位的示意图;
图4是本申请实施例提供的另一种音频检测方法的流程示意图;
图5a是本申请实施例提供的初始重音点位集合和补充时间点位集合生成的示意图;
图5b是本申请实施例提供的从各时间点位中获取多个峰值的示意图;
图5c是本申请实施例提供的根据目标时间点位确定音符起始点位的示意图;
图5d是本申请实施例提供的根据目标时间点位确定音符起始点位的示意图;
图6是本申请实施例提供的一种音频检测方案的流程示意图;
图7是本申请实施例提供的一种音频检测装置的结构示意图;
图8是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
音频数据是一种数字化的声音数据,可来自视频文件中的音频数据或纯音频文件的音频数据。数字化声音的过程实际上就是以一定的频率对来自终端设备的连续的模拟音频信号进行模数转换得到音频数据的过程。具体的,音频数据可包括多个时间点位(或称为音乐点位)以及每个时间点位的音频振幅值;在一定程度上,可采用各个时间点位以及对应的音频振幅值绘制一个音频波形图以直观表示音频数据。例如参见图1a所示的音频波形图,通过此音频波形图可直观看出音频数据中的A、B、C、D、E等时间点位的音频振幅值。每个时间点位除了具有音频振幅值这一属性外,还可包括声音频率、能量、音量和音色等声音属性;其中,声音频率是指物体在单个时间点位中完成全振动的次数,各个时间点位的声音频率可形成一个如图1b所示的频谱图;音量又可称为音强或响度,其是指人耳对所听到的声音大小强弱的主观感受;音色又可称为音品,其用于反映基于每个时间点位的音频振幅值所产生的声音的特征。
为了能够更好地对音频数据进行重音点位的提取,本申请实施例提供一种音频检测方案;该音频检测方案的执行主体可以是计算机设备,该计算机设备可以是终端设备(后续简称终端)或者服务器。当计算机设备为服务器时,本申请实施例还提出了一种图1c所示的音频检测系统;该音频检测系统可包括至少一个终端101和服务器102,即计算机设备。在该音频检测系统中,终端101以及服务器102可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例在此不做限制。需要说明的是,上述所提及的终端可以是智能手机、平板电脑、笔记本电脑、台式电脑等等;服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器,等等。
在具体实现中,上述所提及的音频检测方案的大致原理如下:当需要对任一类型(如 抒情类型、摇滚类型)的音频数据进行重音点位的提取时,计算机设备可从该音频数据中提取多个初始重音点位;此处的多个初始重音点位可包括:能量、音量、音色局部最大的时间点位,和/或能量、音量、音色发生突变的时间点位。针对任一初始重音点位,可通过对该任一初始重音点位的音频振幅值,以及音频数据中与该任一初始重音点位邻近的时间点位的音频振幅值进行综合分析,使得根据综合分析结果进一步对该任一初始重音点位进行准确性校验;并在校验通过后,将该任一初始重音点位作为音频数据的目标重音点位。在实施例中,可能由于各种外界因素导致计算机设备提取出的初始重音点位不够全面,遗漏掉音频数据中除这些初始重音点位以外的、可能也是重音点位的其他时间点位。因此,计算机设备可以从音频数据中补充提取一些新的补充点位(即除初始重音点位以外的其他时间点位);并可采用任一初始重音点位所涉及的综合分析方法对新的补充点位进行综合分析,并在根据综合分析结果确定新的补充点位通过准确性校验后,将该新的初始重音点位作为音频数据的目标重音点位。
由上述描述可知,通过该音频检测方案,可以自适应地识别不同类型的音频数据;通过从音频数据中识别出能量、音量、音色局部最大的时间点位或突变点等初始重音点位,并进一步利用邻近的时间点位和初始重音点位之间的关联性,对初始重音点位进行准确性校验,可有效提升重音点位的提取准确性,从而给出精确至帧级别(即时间点位级别)的目标重音点位集合。并且,通过对音频数据进行补充采点以及对新的补充点位进行准确性校验,还可提升目标重音点位集合的全面性。
基于上述所提供的音频检测方案,本申请实施例提供了一种音频检测方法,该音频检测方法可由上述所提及的计算机设备执行。请参阅图2,该音频检测方法可包括以下步骤S201-S204:
S201,从目标音频数据中获取目标时间点位以及目标时间点位的参考点位。
其中,目标音频数据可以是任一类型的音频数据;例如抒情类型的音频数据、摇滚类型的音频数据、古典类型的音乐数据,等等。该目标音频数据可包括多个时间点位以及每个时间点位的音频振幅值,目标时间点位可以通过以下任一种实现方式获取到:
在一种具体实现中,计算机设备可根据开源工具libsora(一种音频处理工具)中的点位提取算法(如librosa.beat算法)从目标音频数据中提取初始重音点位集合。该点位提取算法的原理为:根据目标音频数据的主体节拍,从目标音频数据中提取能量、音量、音色局部较大的时间点位;和/或能量、音量、音色发生突变的时间点位作为初始重音点位。其中,主体节拍是指音频数据的最主要的节拍;所谓的节拍是音频数据在时间上的基本单位,是指强拍和弱拍的组合规律;节拍可实现音频数据中有强有弱的相同时间片段,按照一定的次序循环重复。在得到初始重音点位集合后,可采用本申请实施例所提出的音频检测方法依次对初始重音点位集合中的每一个初始重音点位进行准确性校验。那么在此具体实现中,步骤S201的具体实施方式可以包括:从初始重音点位集合中任意选取一个初始重音点位作为目标时间点位;也就是说,此实施方式下的目标时间点位为初始重音点位集合中的任一初始重音点位。
在一种具体实现中,由于上述所提及的点位提取算法的原理是通过考虑主体节拍来进行重音点位的提取的,而目标音频数据中可能存在少量偏离主体节拍的重音点位,这些偏离正常节拍的重音点位则可能被点位提取算法遗漏掉。例如,目标音频数据的开始/结束区域所涉及的节拍可能不符合主体节拍,则该开始/结束区域中的重音点位可认为是偏离主体 节拍的重音点位,那么采用点位提取算法进行重音点位提取时,通常不会将开始/结束区域中的重音点位提取出来。因此,为了提升重音点位的准确性和全面性,计算机设备还可在目标音频数据中基于初始重音点位集合向外延拓采点以得到补充时间点位集合,并采用本申请实施例所提出的音频检测方法依次对补充时间点位集合中的每一个补充点位进行准确性校验。那么在此具体实现中,步骤S201的具体实施方式可以包括:从补充时间点位集合中任意选取一个补充点位作为目标时间点位;也就是说,此实施方式的目标时间点位为补充时间点位集合的任一补充点位。
经研究表明,若目标时间点位是一个较为准确的重音点位,则在目标时间点位以及与目标时间点位邻近的时间点位中必然会存在能量、音量等局部较大的时间点位,或者能量、音量等存在突变的时间点位。基于此,计算机设备还可获取该目标时间点位附近一定时间范围内的时间点位作为目标时间点位的参考点位,以便于后续可结合参考点位的音频振幅值对目标时间点位进行准确性校验。其中,该一定时间范围的上限可以等于在目标时间点位的基础上增加第一差值阈值后得到的值,下限可以等于在目标时间点位的基础上减少第一差值阈值后得到的值;也就是说,参考点位是指与目标时间点位之间的时间差小于第一差值阈值的时间点位。其中,第一差值阈值可以根据经验值或者业务需求设置。
例如,设第一差值阈值为10ms,那么一定时间范围可以是该目标时间点位前后10ms。如图3所示:在图3中假设D点为目标时间点位,计算机设备可计算目标音频数据中时间点位1、时间点位2、时间点位3以及时间点位4等其他时间点位与目标时间点位之间的差值。经过计算可得到时间点位1与目标时间点位D之间的时间差值D1为20ms、时间点位2与目标时间点位D之间的时间差值D2为5ms、时间点位3与目标时间点位D之间的时间差值D3为5ms,时间点位4与目标时间点位D之间的时间差值D4分别为20ms。然后,可依次判断D1、D2,D3、D4是否小于10ms;由于只有D2和D3小于第一差值阈值,则将时间点位2和时间点位3作为目标时间点位的参考点位。需要说明的是,此处只是示例性地以时间点位1、时间点位2、时间点位3以及时间点位4这四个时间点位为例进行说明的;在实际计算过程中,计算机设备可计算目标音频数据中所有其他时间点位与目标时间点位之间的差值,从而将差值小于第一差值阈值的时间点位均作为参考点位,即参考点位包括位于目标时间点位的前后10ms内的时间点位。
S202,根据目标时间点位的音频振幅值对目标时间点位进行能量评估处理,得到目标时间点位的能量评估值;并根据参考点位的音频振幅值对参考点位进行能量评估处理,得到参考点位的能量评估值。
在一种具体实现中,计算机设备可获取频域上的音频能量函数分别对目标时间点位以及参考点位进行音频能量值计算。
在一种具体实现中,计算机设备可采用时域上的音频能量函数分别对目标时间点位以及参考点位进行音频能量值计算。该时域上的音频能量函数与频域上的音频能量函数相比,计算速度更快,时间分辨率更高。在本申请实施例中,在经过对时域上的音频能量函数和频域上的音频能量函数进行测试之后,发现该时域上的音频能量函数在测试过程中对目标时间点位有更好检测效果。其中,时域是指在对函数或者信号进行分析时,分析其中和时间相关的部分,频域是指在对函数或信号进行分析时,分析其中和频域相关的部分。
在具体实现中,计算机设备可根据目标时间点位的音频振幅值和音频能量函数先确定出目标时间点位的音频能量值;并根据目标时间点位的音频能量值和音频能量变化函数确 定出目标时间点位的音频能量变化值;然后计算机设备对目标时间点位的音频能量值和音频能量变化值进行加权求和确定目标时间点位的能量评估值,如公式1.1所示:
F=c 0·E+c 1·δ                            式1.1
其中,E表示目标时间点位的音频能量值,δ表示目标时间点位的音频能量变化值,F表示目标时间点位的能量评估值;c 0和c 1为两个常数,可用来控制目标时间点位的音频能量值和音频能量变化值的权重或比重;c 0和c 1的取值可根据经验设置,满足c 0与c 1之和为1即可。例如,c 0可取0.1,c 1可取0.9。
需要说明的是,参考点位的能量评估值的计算方式可参考该目标时间点位的能量评估值的计算方式,在此不再赘述。
S203,根据目标时间点位的能量评估值和参考点位的能量评估值,对目标时间点位进行准确性校验。
由前述可知,能量评估值可包括两部分,分别为最大能量评估值和均值。而重音点位通常是能量较大或能量突变的时间点位,所以可根据目标时间点位的能量评估值和参考点位的能量评估值检测目标时间点位的附近是否存在能量变化或能量突变的点;若存在,则可认为目标时间点位是较为准确的重音点位,此时可通过步骤S204将该目标时间点位作为目标重音点位添加至目标重音点位集合中。
在一种实现方式中,计算机设备可从目标时间点位的能量评估值和参考点位的能量评估值中确定出最大能量评估值,并判断最大能量评估值是否大于评估能量阈值,若最大能量评估值大于评估能量阈值,则表明目标时间点位的附近存在能量较大的时间点位,并确定目标时间点位通过准确性校验;若最大能量评估值小于或等于评估能量阈值,则表明目标时间点位的附近不存在能量较大的时间点位,并确定目标时间点位未通过准确性校验。其中,评估能量阈值可根据经验设置。
在一种实现方式中,计算机设备可对目标时间点位的能量评估值和参考点位的能量评估值进行均值运算,并判断均值是否大于均值评估阈值,若均值大于均值评估阈值,则表明目标时间点位附近的时间点位的能量均较高,从而可以表明存在能量较大的时间点位,并确定目标时间点位通过准确性校验;若均值小于或等于均值评估阈值,则表明目标时间点位附近的时间点位的能量均较低,从而可以表明不存在能量较大的时间点位,并确定目标时间点位未通过准确性校验。其中,均值评估阈值可根据经验设置。
在一种实现方式中,为了能够准确确定该目标时间点位附近是否存在能量较大或者能量突变的时间点位,可结合目标时间点位的能量评估值和均值进行综合评估。基于此,计算机设备可根据目标时间点位的能量评估值和参考点位的能量评估值确定出最大能量评估值以及均值,然后根据最大能量评估值以及均值对目标时间点位进行准确性校验。
S204,若目标时间点位通过准确性校验,则将目标时间点位作为目标重音点位添加到目标重音点位集合中。
在一种实现方式中,若目标时间点位通过准确性校验,则计算机设备可直接将目标时间点位作为目标重音点位添加到目标重音点位集合中;在一种实现方式中,为了增加对目标时间点位筛选的准确度,本申请实施例还可对该目标时间点位进行二次筛选。计算机设备根据目标时间点位的局部最大振幅值对目标时间点位进行筛选,若局部最大振幅值大于第一振幅阈值,则可将该目标时间点位作为目标重音点位添加到目标重音点位集合中。
在本申请实施例中,计算机设备可从目标音频数据中获取目标时间点位以及目标时间 点位的参考点位,然后计算机设备根据目标时间点位的音频振幅值对目标时间点位进行能量评估处理,得到目标时间点位的能量评估值。然后根据参考点位的音频振幅值对参考点位进行能量评估处理,得到参考点位的能量评估值。根据目标时间点位的能量评估值和参考点位的能量评估值,对目标时间点位进行准确性校验;若目标时间点位通过准确性校验,则将目标时间点位作为目标重音点位添加到目标重音点位集合中。在上述的音频检测过程中,通过利用邻近的参考点位和目标时间点位之间的关联性,对目标时间点位进行准确性校验,可有效提升重音点位的提取准确性,从而给出精确至帧级别(即时间点位级别)的目标重音点位集合。
请参阅图4,图4为本申请实施例提供的另一种音频检测方法的流程示意图。本实施例中所描述的音频检测方法可由计算机设备执行,其可包括以下步骤S401-S406:
S401,从目标音频数据中获取目标时间点位以及目标时间点位的参考点位。
在具体实施过程中,计算机设备可先获取目标音频数据;具体的,计算机设备可从视频或者其他数据源中获取原始音频数据,该原始音频数据中的各个时间点位均具有对应的声音频率,其他数据源可以是网络、本地空间等数据源。然后对原始音频数据进行预处理,得到目标音频数据;其中,预处理可以包括以下(1)-(3)至少一项:
(1)采用目标频率范围对原始音频数据进行滤波处理。在具体实现中,可根据经验设置目标频率范围,例如目标频率范围设置为10HZ-5000HZ。计算机设备采用目标频率范围可有效过滤人耳听不到的低频音频和噪声,同时过滤掉一些音频数据中存在的换气声、摩擦声等高频成分;可仅留下对重音点位获取有用的目标频率范围内的时间点位,避免噪音干扰,得到比较干净的目标音频数据,从而减少后续对目标音频数据中的重音点位的识别难度。
(2)对原始音频数据进行音量统一化处理。在具体实现中,由于获取的原始音频数据的音量不一致,计算机设备可根据原始音频数据中的声音波形的最大值最小值进行统一化处理,该统一化处理是指将音频数据中的音量统一保持在最大值与最小值之间。例如,将音频数据中的音量归一化为-1到1之间,以便后续减少对目标音频数据中的重音点位的筛选难度。
(3)先采用目标频率范围对原始音频数据进行滤波处理,并对滤波后的音频数据进行音量统一化处理,减少后续对目标音频数据中的重音点位的识别难度和筛选难度。
在获取到目标音频数据后,便可从目标音频数据中获取目标时间点位以及目标时间点位的参考点位。由前述可知,目标时间点位可为初始重音点位集合中的任一初始重音点位,或者该目标时间点位为补充时间点位集合中的任一补充点位。其中,初始重音点位集合中的多个初始重音点位是采用点位提取算法对目标音频数据进行点位提取所得到的。且图2所示的实施例中,提及补充时间点位集合是在目标音频数据中基于初始重音点位集合向外延拓采点得到的;具体的,目标音频数据中的多个时间点位按照时间先后顺序依次排列,补充时间点位集合的获取方式如下:
计算机设备从初始重音点位集合中确定出起始重音点位和结束重音点位,该起始重音点位是指初始重音点位集合中时间最早的重音点位,结束重音点位是指初始重音点位集合中时间最晚的重音点位;计算机设备确定起始重音点位在目标音频数据中的起始排列位置,以及结束重音点位在目标音频数据中的结束排列位置,该起始重音点位的起始排列位置和结束重音点位的结束排列位置如图5a所示。进一步地,计算机设备按照采样频率对目标音 频数据中位于起始排列位置之前的时间点位进行延拓采点,以及按照采样频率对目标音频数据中位于结束排列位置之后的时间点位进行延拓采点,该延拓采点方向可参见图5a所示;并将延拓采点所得到的时间点位作为补充点位添加到补充时间点位集合中。示例性的,在图5a中按照采样频率10ms进行采样,可得到图5a所示的4个采样点,并将该4个采样点对应的时间点位作为补充点位添加到补充时间点位集合中。
S402,根据目标时间点位的音频振幅值对目标时间点位进行能量评估处理,得到目标时间点位的能量评估值;并根据参考点位的音频振幅值对参考点位进行能量评估处理,得到参考点位的能量评估值。
在具体实现中,目标时间点位的能量评估值的计算方式与参考点位的能量评估值的计算方式类似;为便于阐述,后续均以目标时间点位为例进行说明。具体的,根据目标时间点位的音频振幅值对目标时间点位进行能量评估处理,得到目标时间点位的能量评估值的具体实施方式可包括以下步骤s11-s15:
s11,从多个时间点位中获取目标时间点位的多个关联点位。
其中,该关联点位是指与目标时间点位之间的时间差小于第二差值阈值的时间点位,该第二差值阈值可根据经验设置。例如,第二差值阈值可设置为
Figure PCTCN2021126022-appb-000001
表示对
Figure PCTCN2021126022-appb-000002
进行向下取整;其中,k可以根据经验值设置。例如若k等于2000ms,则对
Figure PCTCN2021126022-appb-000003
(即1000ms)进行向下取整处理,可得到
Figure PCTCN2021126022-appb-000004
为1000ms;若k等于2001ms,则对
Figure PCTCN2021126022-appb-000005
(即1000.5ms)进行向下取整处理,可得到
Figure PCTCN2021126022-appb-000006
为1000ms。当
Figure PCTCN2021126022-appb-000007
为1000ms时,关联点位包括位于目标时间点位前后1000ms以内的时间点位。
s12,采用音频能量函数根据各个关联点位的音频振幅值和目标时间点位的音频振幅值,计算目标时间点位的音频能量值。
在具体实现中,多个时间点位按照时间先后顺序依次排列;那么相应的,目标音频数据可表示为一个一维数组y=[y 1,y 2…y n];其中,y x表示目标音频数据中第x个时间点位的音频振幅值,i∈[1,n]。音频能量函数可以如式1.2所示:
Figure PCTCN2021126022-appb-000008
其中,k'表示目标时间点位的关联点位数量,该k'可根据k的取值确定。当k为奇数时,该k'等于k;当k为偶数时,该k'等于k+1;j表示求和符号中的索引,i的取值等于目标时间点位在目标音频数据中的排列序号。需要说明的是,当j的取值小于或等于0时,y j的取值为0。
需要说明的是,本申请实施例是以目标时间点位进行举例说明,其他时间点位(包括上述的参考点位)的音频能量值的计算均可参考目标时间点位的计算方式。当所有时间点位都计算音频能量值后,由于该音频能量函数可看作是一个离散的函数,因此所有时间点位的音频能量值可以构成数组E=[E 1,E 2,…E n]。
基于此,步骤s12的具体实施方式可以是:对目标时间点位的音频振幅值进行平方运算,得到目标时间点位的初始能量值;以及对各个关联点位的音频振幅值进行平方运算, 得到各个关联点位的初始能量值。然后,对目标时间点位的初始能量值和各个关联点位的初始能量值进行均值运算,得到目标时间点位的音频能量值。具体的,计算机设备对目标时间点位的初始能量值和各个关联点位的初始能量值进行均值运算,得到中间能量值。然后,将中间能量值直接作为目标时间点位的音频能量值;或者,对中间能量值进行去噪处理,得到目标时间点位的音频能量值。
其中,对中间能量值进行去噪处理,得到目标时间点位的音频能量值的具体实施方式可以是:计算机设备可采用所有时间点位的中间能量值构成中间能量值随时间点位变化的曲线,再采用高斯滤波或者盒子滤波(box滤波)进行曲线平滑操作,以调整目标时间点位的中间能量值,得到目标时间点位的音频能量值。通过去噪处理,可以去除噪声的干扰,得到比较干净的目标时间点位的音频能量值。
s13,从多个时间点位中获取目标时间点位的前驱点位。
该前驱点位包括:基于目标时间点位在多个时间点位中的排列位置,往前依次选取的c个时间点位,c为正整数。其中,c为可调参数。例如c可等于15。在c=15的条件下,能够缓解局部异常值的干扰,且使得音频能量变化值更能反应一个音量峰值在局部一段时间的突变强弱情况。可以理解是,根据经验设置c的值可改变获取到的前驱点位,并且该c还可用来控制向前作差求和的个数。
s14,采用音频能量变化函数根据目标时间点位的音频能量值和前驱点位中各个时间点位的音频能量值,计算目标时间点位的音频能量变化值。
在具体实现中,该音频能量变化函数可如式1.3所示:
Figure PCTCN2021126022-appb-000009
其中,δ′i表示初始能量变化值,Ei表示音频能量值,j表示求和符号中的索引,c为可调参数,可用来控制向前作差求和的个数以及前驱点位的数量。例如,当c=1时该函数计算的就是能量函数的一阶均差。目标时间点位为第i点,该目标时间点位的前驱点位可包括第i-1点、第i-2点、...第i-c点。E i-j表示第i-j个时间点位的音频能量值。
基于此,步骤s14的具体实现方式为:计算机设备求取前驱点位中各个时间点位的音频能量值之间的音频能量值总和,并获取基准数值(例如该基准值可为0)。然后计算音频能量值总和与c倍的目标时间点位的音频能量值之间的差值,并将基准数值和计算得到的差值中的最大值,作为目标时间点位的初始能量变化值;最后根据目标时间点位的初始能量变化值,确定目标时间点位的音频能量变化值。
在一种实现方式中,计算机设备可将目标时间点位的初始能量变化值直接作为目标时间点位的音频能量变化值;在另一种实现方式中,由于目标时间点位的初始能量值范围很大,因此需要对目标时间点位的初始能量值进行归一化处理。在本申请实施例定义了一种归一化方法pk_normalize,该归一化方法是指利用目标音频数据中各个时间点位的初始能量值中最大的n个峰值的均值对目标时间点位进行归一化操作,这样的归一化相比于简单的0-1的归一化可以避免一些异常大的音频能量变化值的影响,同时只选取最大的n个峰值的策略也可避免许多音频能量变化值微小的噪音峰值点导致筛选的错误。在具体实现中,计算机设备可获取目标音频数据中的各个时间点位的初始能量变化值,并从各个时间点位的初始能量变化值中确定出多个峰值。峰值是指目标音频数据中的峰值时间点位的初始能 量变化值。峰值时间点位满足如下条件:峰值时间点位的初始能量变化值均大于,位于峰值时间点位的左右两侧且与峰值时间点位相邻的两个时间点位的初始能量变化值。示例性的,在图5b中从各个时间点位的初始能量变化值中可以确定出4个峰值,分别为峰值1、峰值2、峰值3和峰值4。计算机设备采用多个峰值的均值对目标时间点位的初始能量变化值进行归一化处理,得到目标时间点位的音频能量变化值。
其中,计算机设备采用多个峰值的均值对目标时间点位的初始能量变化值进行归一化处理,得到目标时间点位的音频能量变化值包括以下两种情况:(1)计算机设备直接根据多个峰值计算均值,然后将得到的均值对目标时间点位的初始能量变化值进行归一化处理。(2)计算机设备可将多个峰值进行排序,然后从排序完成的多个峰值中从大到小获取n个峰值,并计算这n个峰值的均值;计算机设备根据计算得到的均值对目标时间点位的初始能量变化值进行归一化处理。其中,n的值可根据经验设定,例如该n的值可设置为峰值个数的1/3。示例性的,设n的取值为3,在图5b中,计算机将获取到4个峰值进行从大到小排序,即这4个峰值的顺序为峰值1、峰值3、峰值2、峰值4。计算机设备可从大到小获取3个峰值,分别为峰1、峰值2、峰值3。
在一种实现方式中,采用多个峰值的均值,对目标时间点位的初始能量变化值进行归一化处理,得到目标时间点位的音频能量变化值的具体实现方式为:计算机设备获取各个时间点位的音频能量值,并从各个时间点位的音频能量值中确定出最小音频能量值,采用多个峰值的均值和最小音频能量值,对目标时间点位的初始能量变化值进行收缩处理,得到目标时间点位的音频能量变化值。其中,最小音频能量值可用min(E)表示,该多个峰值的均值可用mean(topn(peak(δ')))表示,peak(δ')表示确定目标音频数据中所有初始能量变化值的峰值(对应上述多个峰值),topk(peak(δ'))表示从所有峰值中从大到小选取n个峰值。其中,采用多个峰值的均值mean(topn(peak(δ')))和最小音频能量值min(E)对目标时间点位的初始能量变化值进行收缩处理,得到目标时间点位的能量变化值δ的具体计算过程可参见式1.4:
Figure PCTCN2021126022-appb-000010
在式1.4中,a为一个可调参数,可以微调和控制最终目标时间点位的音频能量变化值。该a的取值可根据经验设定,例如,a可取1.5。
s15,对音频能量值和音频能量变化值进行加权求和,得到目标时间点位的能量评估值。
S403,计算参考点位的能量评估值和目标时间点位的能量评估值的能量均值。
S404,从目标时间点位的能量评估值和参考点位的能量评估值中确定出最大能量评估值。
S405,若最大能量评估值与能量均值之间的差值大于阈值,则确定目标时间点位通过准确性校验;否则,则确定目标时间点位未通过准确性校验。
其中,可设置阈值作为对目标时间点位是否通过准确性校验的条件。该阈值也可理解为筛选目标时间点位的条件。在具体实现中,计算机设备可先计算最大能量评估值与能量均值之间的差值,并判断最大能量评估值与能量均值之间的差值是否大于阈值,若最大能量评估值与能量均值之间的差值大于阈值,确定目标时间点位通过准确性校验,即可理解为该目标时间点位是能量变化较大的时间点位;若最大能量评估值与能量均值之间的差值小于或等于阈值,则确定目标时间点位未通过准确性校验,即可理解为目标时间点位是能 量变化较小的时间点位。
S406,若目标时间点位通过准确性校验,则将目标时间点位作为目标重音点位添加到目标重音点位集合中。
在具体实现中,经过步骤S405对目标时间点位进行校验后,计算机设备可将校验通过的目标时间点位作为目标重音点位添加到目标重音点位集合中。该目标重音点位集合可用R 0表示。在目标重音点位集合中的所有的重音点位集合均满足式1.5:
R 0={i=F max[i]>F mean[i]+s 0,i∈{beat}}               式1.5
其中,最大能量评估值为Fmax[i],均值为Fmean[i],i∈{beat}表示目标时间点位。筛选阈值为s 0,可根据经验设置。在一种实现方式中,若该目标时间点位为初始重音点位集合中的任一初始重音点位,则该筛选阈值可设置为较小的数值。例如,该筛选阈值可设置0.1。在另一种实现方式中,若该目标时间点位为补充时间点位集合中的任一补充点位,为了避免对目标时间点位的误检,可适当提高该筛选阈值,例如该筛选阈值可设置为0.3。
在一种实现方式中,若目标时间点位通过准确性校验,计算机设备还可根据目标音频数据中的局部最大振幅值来判断该目标时间点位是否为重音点位。即计算机设备可根据目标时间点位的局部最大振幅值进一步对目标时间点位进行筛选,从而增加对重音点位筛选的准确度。在具体实现中,计算机从各个关联点位的音频振幅值的绝对值和目标时间点位的音频振幅值的绝对值中,选取最大绝对值作为目标时间点位的局部最大振幅值。其中,该目标时间点位的局部最大振幅值可采用波形局部最大振幅函数来计算,计算公式可参见式1.6:
Figure PCTCN2021126022-appb-000011
其中,式1.6中abs(·)表示对变量求绝对值;i表示当前目标时间点位;j表示max运算的迭代变量,且表示关联点位。其中关联点位是指与目标时间点位之间的时间差小于第二差值阈值的时间点位。其中,第二差值阈值可根据经验设置。
在确定出目标时间点位的局部最大振幅值后,计算机设备可判断目标时间点位的局部最大振幅值是否大于第一振幅阈值,若目标时间点位的局部最大振幅值大于第一振幅阈值,则将目标时间点位作为目标重音点位添加到目标重音点位集合中。其中,第一振幅阈值可根据经验设置,可用S 1表示。在一种实现方式中,若该目标时间点位为初始重音点位集合中的任一初始重音点位,则该第一振幅阈值可设置为较小的数值。例如,该第一振幅阈值可设置为0.1。在另一种实现方式中,若该目标时间点位为补充时间点位集合中的任一补充点位,为了避免对目标时间点位的误检,可适当提高第一振幅阈值。示例性的,在上述确定出集合R 0之后,可根据集合R 0中的重音点位的局部最大振幅值对集合R 0中的重音点位进行二次筛选,得到最新的目标重音集合R 1,在最新的目标重音点位集合中的所有的重音点位集合均满足式1.7:
R 1={i:A[i]>s 1,i∈R 0}                           式1.7
其中,A[i]表示在R 0中的第i个时间点位,S 1为第一振幅阈值。
在实际中,音频数据中存在少量偏离正常节拍的重音点位,因此本申请实施例还可对重音点位进行补充。在一种实现方式,可对音符起始点进行筛选,以补充目标重音点位集合中的重音点位。计算机设备可按照音符起始点检测算法(如librosa.onset算法)从目标 音频数据中提取至少一个音符的音符起始点,其中,一个音符是根据至少两个时间点位及至少两个时间点位对应的音频振幅值确定的,音符起始点位是指:一个音符对应的至少两个时间点位中时间最早的时间点位。进一步地,计算机设备获取音符起始点的能量评估值和音符起始点的局部最大振幅值,并判断音符起始点的能量评估值和音符起始点的局部最大振幅值是否满足重音条件。若音符起始点的能量评估值和局部最大振幅值满足重音条件,则将音符起始点作为目标重音点位添加到目标重音点位集合中;其中,重音条件包括以下至少一种:音符起始点的能量评估值大于能量评估阈值,以及音符起始点的局部最大振幅值大于第二振幅阈值。
在实施例中,由于目标重音点位集合中的目标重音点位可能处于能量变化的峰值处,这样使得当人感知到目标重音点位的时候,可能该目标重音点位就快要消失了,因此这样的目标重音点位还不够理想。基于此,计算机设备还可对目标重音点位集合中的目标重音点位进一步进行优化。针对目标重音点位集合中的任一目标重音点位,计算机设备获取任一目标重音点位所属的目标音符的音符起始点,并在目标重音点位集合中,采用目标音符的音符起始点替换任一目标重音点位。可以理解的是,该音符起始点位也可看作是一个重音点位。在具体实现中,计算机设备获取目标音频数据的音符起始点强度评估曲线,该音符起始点强度评估曲线包括按时间先后顺序依次排列的多个时间点位和每个时间点位的音符强度值。然后将任一目标重音点位映射到音符起始点强度评估曲线上,得到任一目标重音点位在音符起始点强度评估曲线上的目标位置;在音符起始点强度评估曲线上,基于目标位置并沿时间变小的方向依次遍历至少一个音符强度值;若当前遍历的当前音符强度值满足音符强度条件,则停止遍历,并将当前音符强度值所对应的当前时间点位作为任一目标重音点位所属的目标音符的音符起始点;其中,音符强度条件包括:位于当前时间点位之前且与当前时间点位相邻的时间点位的音符强度值大于或等于当前音符强度值,且位于当前时间点位之后且与当前时间点位相邻的时间点位的音符强度值大于当前音符强度值。
在一种实现方式中,示例性的,音符起始点强度评估曲线如图5c所示,计算机设备将某个目标重音点位映射到音符起始点强度评估曲线,得到该目标重音点位在音符起始点强度评估曲线上的目标位置A1。计算机设备基于A1并沿时间变小的方向(图5c中的箭头所指的方向)依次遍历至少一个音符强度值,由于当遍历到音符强度值为0(对应的时间点位为A2)时,该音符强度值大于音符强度值y2,则继续遍历下一音符强度值y2(对应的时间点位为A3),此时该音符强度值y2小于该音符强度值0,且也小于音符强度值y3(对应的时间点位为A4)。则停止遍历,并将音符强度值y2所对应的时间点位A3作为该目标重音点位所属的目标音符的音符起始点。
在另一种实现方式中,示例性的,音符起始点强度评估曲线如图5d所示,计算机设备将目标重音点位映射到音符起始点强度评估曲线,得到目标重音点位在音符起始点强度评估曲线上的目标位置B1。计算机设备基于B1并沿时间变小的方向(图5d中的箭头所指的方向)依次遍历至少一个音符强度值,当遍历到音符强度值为0(对应的时间点位B2)时,由于该音符强度值小于B1对应的音符强度值,且位于B2之前且与B2相邻的时间点位的音符强度值等于当前音符强度值0,且位于B2之后且与当B2相邻的时间点位的音符强度值大于当前音符强度值0,因此停止遍历,并将音符强度值为0对应的时间点位B2作为目标重音点位所属的目标音符的音符起始点。
其中,计算机设备获取目标音频数据的音符起始点强度评估曲线的具体实现方式可以是:计算机设备可根据目标音频数据利用短时傅里叶变换(stft)将时域转化成频域,最终生成频谱图,然后将频谱图做前后帧间差值,并根据帧间差值按时间求和得到音符起始点强度评估曲线。
在得到目标重音点位集合后,可将该目标重音点位集合中的目标重音点位转换为应用需要的格式输出。该应用可以是专门播放音乐的播放器、或者视频软件等等。
在本申请实施例中,计算机设备可从目标音频数据中获取目标时间点位以及目标时间点位的参考点位,然后计算机设备根据目标时间点位的音频振幅值对目标时间点位进行能量评估处理,得到目标时间点位的能量评估值。然后根据参考点位的音频振幅值对参考点位进行能量评估处理,得到参考点位的能量评估值。根据目标时间点位的能量评估值和参考点位的能量评估值,对目标时间点位进行准确性校验;若目标时间点位通过准确性校验,则将目标时间点位作为目标重音点位添加到目标重音点位集合中。在上述的音频检测过程中,通过利用邻近的参考点位和目标时间点位之间的关联性,对目标时间点位进行准确性校验,可有效提升重音点位的提取准确性,从而给出精确至帧级别(即时间点位级别)的目标重音点位集合。
基于上述本申请实施例提供的音频检测方法,本申请实施例还提供一种具体的音频检测方案,该音频检测方案的具体流程可参见图6,该音频检测方案的流程如下:在提取音频数据时,可先统一不同音频文件的编码格式。计算机设备先设置统一的音频文件的编码格式。然后计算机设备对视频按照设置的编码格式进行处理,然后在处理后的视频中提取音频数据,并对该音频数据进行预处理,该预处理包括对音频数据进行频率范围滤波以及对音频数据进行整体音量规范化。在对该音频数据进行预处理之后,计算机设备从预处理后的音频数据中进行点位信息提取,该点位信息提取包括目标时间点位提取以及音符起始点位提取,并根据音频能量函数、音频能量变化函数以及波形局部最大振幅函数对目标时间点位进行评估,根据评估结果来对目标时间点位进行筛选过滤,得到目标重音点位集合。进一步地,计算机设备在得到目标重音点位集合后,还可对重音点位进行补充,并将补充的重音点位作为目标重音点位添加到目标重音点位集合中,然后对目标点位集合中的目标重音点位进行优化处理,得到最终的目标重音点位集合,并输出该目标重音点位集合,从而可以准确地确定出目标音频数据中的重音点位。
在具体应用中,在确定出重音点位之后,可在目标音频数据中标记重音点位,后续根据标记的重音点位可为剪辑工具或内容创作者提供画面切换的时间点位,自动生成或辅助创作踩点视频,即在卡住音乐的重音节奏点去填补画面,使视频声画同步,使观众在视觉与听觉上感受到一致的节奏感,带来更为舒适的感官体验。或该标记的重音点位可作为视频二次创作或剪辑中的背景音乐点位;或者该标记的重音点位还可起到在舞台或现场匹配灯光或其他特效,推动气氛烘托的作用等等。
基于上述音频检测方法实施例的描述,本申请实施例还公开了一种音频检测装置,该音频检测装置可以是设置于上述所提及的计算机设备中的一个硬件组件,也可以是运行于上述所提及的计算机设备中的一个计算机程序(包括程序代码)。该音频检测装置可以执行图2或图4所示的方法。请参见图7,所述音频检测装置可以运行如下单元:
获取单元701,用于从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考 点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
处理单元702,用于根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
所述处理单元702,还用于根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
所述处理单元702,还用于若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
在一种实现方式中,所述处理单元702,具体用于:
计算所述参考点位的能量评估值和所述目标时间点位的能量评估值的能量均值;
从所述目标时间点位的能量评估值和所述参考点位的能量评估值中确定出最大能量评估值;
若所述最大能量评估值与所述能量均值之间的差值大于阈值,则确定所述目标时间点位通过所述准确性校验;否则,则确定所述目标时间点位未通过所述准确性校验。
在一种实现方式中,所述获取单元701,具体用于:从所述多个时间点位中获取所述目标时间点位的多个关联点位;
所述处理单元702,具体用于:采用音频能量函数根据各个关联点位的音频振幅值和所述目标时间点位的音频振幅值,计算所述目标时间点位的音频能量值;所述关联点位是指与所述目标时间点位之间的时间差小于第二差值阈值的时间点位;
所述获取单元701,具体用于:从所述多个时间点位中获取所述目标时间点位的前驱点位,所述前驱点位包括:基于所述目标时间点位在所述多个时间点位中的排列位置,往前依次选取的c个时间点位,c为正整数;
所述处理单元702,具体用于:采用音频能量变化函数根据所述目标时间点位的音频能量值和所述前驱点位中各个时间点位的音频能量值,计算所述目标时间点位的音频能量变化值;对所述音频能量值和所述音频能量变化值进行加权求和,得到所述目标时间点位的能量评估值。
在一种实现方式中,所述处理单元702,具体用于:
对所述目标时间点位的音频振幅值进行平方运算,得到所述目标时间点位的初始能量值;以及对各个关联点位的音频振幅值进行平方运算,得到所述各个关联点位的初始能量值;
对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到所述目标时间点位的音频能量值。
在一种实现方式中,所述处理单元702,具体用于:
对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到中间能量值;
对所述中间能量值进行去噪处理,得到所述目标时间点位的音频能量值。
在一种实现方式中,所述处理单元702,具体用于:求取所述前驱点位中各个时间点位的音频能量值之间的音频能量值总和;
所述获取单元701,用于获取基准数值;
所述处理单元702,具体用于:计算所述音频能量值总和与c倍的所述目标时间点位 的音频能量值之间的差值;将所述基准数值和计算得到的差值中的最大值,作为所述目标时间点位的初始能量变化值;根据所述目标时间点位的初始能量变化值,确定所述目标时间点位的音频能量变化值。
在一种实现方式中,所述获取单元701,用于获取所述目标音频数据中的各个时间点位的初始能量变化值;
所述处理单元702,具体用于:从所述各个时间点位的初始能量变化值中确定出多个峰值,所述峰值是指所述目标音频数据中的峰值时间点位的初始能量变化值,所述峰值时间点位满足如下条件:所述峰值时间点位的初始能量变化值均大于,位于所述峰值时间点位的左右两侧且与所述峰值时间点位相邻的两个时间点位的初始能量变化值;采用所述多个峰值的均值,对所述目标时间点位的初始能量变化值进行归一化处理,得到所述目标时间点位的音频能量变化值。
在一种实现方式中,所述获取单元701,用于获取所述各个时间点位的音频能量值;
所述处理单元702,具体用于从所述各个时间点位的音频能量值中确定出最小音频能量值;采用所述多个峰值的均值和所述最小音频能量值,对所述目标时间点位的初始能量变化值进行收缩处理,得到所述目标时间点位的音频能量变化值。
在一种实现方式中,所述将所述目标时间点位作为目标重音点位添加到目标重音点位集合中之前,所述处理单元702,还用于:
从所述各个关联点位的音频振幅值的绝对值和所述目标时间点位的音频振幅值的绝对值中,选取最大绝对值作为所述目标时间点位的局部最大振幅值;
若所述目标时间点位的局部最大振幅值大于第一振幅阈值,则执行将所述目标时间点位作为目标重音点位添加到目标重音点位集合中的步骤。
在一种实现方式中,所述目标时间点位为初始重音点位集合中的任一初始重音点位,或者补充时间点位集合中的任一补充点位;其中,所述初始重音点位集合中的多个重音点位是采用点位提取算法对目标音频数据进行点位提取所提取得到的;
所述目标音频数据中的多个时间点位按照时间先后顺序依次排列,所述处理单元702,具体用于:
从所述初始重音点位集合中确定出起始重音点位和结束重音点位,所述起始重音点位是指所述初始重音点位集合中时间最早的重音点位,所述结束重音点位是指所述初始重音点位集合中时间最晚的重音点位;
确定所述起始重音点位在所述目标音频数据中的起始排列位置,以及所述结束重音点位在所述目标音频数据中的结束排列位置;
按照采样频率对所述目标音频数据中位于所述起始排列位置之前的时间点位进行延拓采点,以及按照所述采样频率对所述目标音频数据中位于所述结束排列位置之后的时间点位进行延拓采点;
将延拓采点所得到的时间点位作为补充点位,添加到所述补充时间点位集合中。
在一种实现方式中,所述处理单元702,还用于:从所述目标音频数据中提取至少一个音符的音符起始点,一个音符是根据至少两个时间点位及所述至少两个时间点位对应的音频振幅值确定的,所述音符起始点位是指:一个音符对应的至少两个时间点位中时间最早的时间点位;
所述获取单元701,还用于获取所述音符起始点的能量评估值和所述音符起始点局部 最大振幅值;
所述处理单元702,还用于:若所述音符起始点的能量评估值和局部最大振幅值满足重音条件,则将所述音符起始点作为目标重音点位添加到所述目标重音点位集合中;所述重音条件包括以下至少一种:所述音符起始点的能量评估值大于能量评估阈值,以及所述音符起始点的局部最大振幅值大于第二振幅阈值。
在一种实施例中,所述获取单元701,还用于针对所述目标重音点位集合中的任一目标重音点位,获取所述任一目标重音点位所属的目标音符的音符起始点;
所述处理单元702,还用于在所述目标重音点位集合中,采用所述目标音符的音符起始点替换所述任一目标重音点位。
在一种实施例中,所述获取单元701,具体用于获取所述目标音频数据的音符起始点强度评估曲线,所述音符起始点强度评估曲线包括按时间先后顺序依次排列的所述多个时间点位和每个时间点位的音符强度值;
所述处理单元702,具体用于:将所述任一目标重音点位映射到所述音符起始点强度评估曲线上,得到所述任一目标重音点位在所述音符起始点强度评估曲线上的目标位置;在所述音符起始点强度评估曲线上,基于所述目标位置并沿时间变小的方向依次遍历至少一个音符强度值;若当前遍历的当前音符强度值满足音符强度条件,则停止遍历,并将所述当前音符强度值所对应的当前时间点位作为所述任一目标重音点位所属的目标音符的音符起始点;
其中,所述音符强度条件包括:位于所述当前时间点位之前且与所述当前时间点位相邻的时间点位的音符强度值大于或等于所述当前音符强度值,且位于所述当前时间点位之后且与所述当前时间点位相邻的时间点位的音符强度值大于所述当前音符强度值。
在一种实现方式中,所述从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位之前,所述获取单元701,还用于获取原始音频数据,所述原始音频数据中的各个时间点位均具有对应的声音频率;
所述处理单元702,还用于对所述原始音频数据进行预处理,得到目标音频数据;所述预处理包括以下至少一项:采用目标频率范围对所述原始音频数据进行滤波处理,对所述原始音频数据或者对滤波后的音频数据进行音量统一化处理。
根据本申请的一个实施例,图2或图4所示的方法所涉及的各个步骤均可以是由图7所示的音频检测装置中的各个单元执行的。例如,图2所示的步骤S201由图7中所示的获取单元701来执行,步骤S202至S204均由图7中所示的处理单元702来执行。又如,图4所示的步骤S401由图7中所示的获取单元701来执行,步骤S402至步骤S406由图7中所示的处理单元701来执行。
根据本申请的另一个实施例,图7所示的音频检测装置中的各个单元可以分别或者全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请实施例的技术效果的实现。上述单元是基于逻辑功能划分的,在实际应用中,一个单元的功能也可以是由多个单元来实现,或者多个单元的功能由一个单元实现。在本申请的其他实施例中,音频检测装置也可以包括其他单元,在实际应用中,这些功能也可以由其他单元协助实现,并且可以由多个单元协作实现。
根据本申请的另一个实施例,可以通过包括中央处理单元(Central Processing Unit, CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件来实现音频检测方法的步骤或音频检测装置的功能。例如通过在计算机的通用计算设备上运行能够执行如图2或图4中所示的相应方法所涉及的各步骤的计算机程序(包括程序代码),来构造如图7所示的音频检测装置,以及来实现本申请实施例的音频检测方法。所述的计算机程序可以记载于例如计算机可读记录介质上,并通过计算机可读记录介质装载于上述计算机设备中,并在其中运行。
基于上述音频检测方法实施例的描述,本申请实施例还公开了一种计算机设备,请参见图8,该计算机设备至少可包括处理器801、输入设备802、输出设备803以及计算机存储介质804。其中,计算机设备内的处理器801、输入设备802、输出设备803以及计算机存储介质804可通过总线或其他方式连接。
所述计算机存储介质804是计算机设备中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机存储介质804既可以包括计算机设备的内置存储介质,当然也可以包括计算机设备支持的扩展存储介质。计算机存储介质804提供存储空间,该存储空间存储了计算机设备的操作系统。并且,在该存储空间中还存放了适于被处理器801加载并执行的一条或多条指令,这些指令可以是一个或一个以上的计算机程序(包括程序代码)。需要说明的是,此处的计算机存储介质可以是高速RAM存储器;在实施例中,还可以是至少一个远离前述处理器的计算机存储介质、所述处理器可以称为中央处理单元(Central Processing Unit,CPU),是计算机设备的核心以及控制中心,适于被实现一条或多条指令,具体加载并执行一条或多条指令从而实现相应的方法流程或功能。
在一种实施例中,可由处理器801加载并执行计算机存储介质中存放的一条或多条第一指令,以实现上述有关音频检测方法实施例中的方法的相应步骤;具体实现中,计算机存储介质中的一条或多条第一指令由处理器801加载并执行如下操作:
从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
在一种实现方式中,所述处理器801,具体用于:
计算所述参考点位的能量评估值和所述目标时间点位的能量评估值的能量均值;
从所述目标时间点位的能量评估值和所述参考点位的能量评估值中确定出最大能量评估值;
若所述最大能量评估值与所述能量均值之间的差值大于阈值,则确定所述目标时间点位通过所述准确性校验;否则,则确定所述目标时间点位未通过所述准确性校验。
在一种实现方式中,所述多个时间点位按照时间先后顺序依次排列;所述处理器801,具体用于:
从所述多个时间点位中获取所述目标时间点位的多个关联点位,并采用音频能量函数根据各个关联点位的音频振幅值和所述目标时间点位的音频振幅值,计算所述目标时间点位的音频能量值;所述关联点位是指与所述目标时间点位之间的时间差小于第二差值阈值的时间点位;
从所述多个时间点位中获取所述目标时间点位的前驱点位,所述前驱点位包括:基于所述目标时间点位在所述多个时间点位中的排列位置,往前依次选取的c个时间点位,c为正整数;
采用音频能量变化函数根据所述目标时间点位的音频能量值和所述前驱点位中各个时间点位的音频能量值,计算所述目标时间点位的音频能量变化值;
对所述音频能量值和所述音频能量变化值进行加权求和,得到所述目标时间点位的能量评估值。
在一种实现方式中,所述处理器801,具体用于:
对所述目标时间点位的音频振幅值进行平方运算,得到所述目标时间点位的初始能量值;以及对各个关联点位的音频振幅值进行平方运算,得到所述各个关联点位的初始能量值;
对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到所述目标时间点位的音频能量值。
在一种实现方式中,所述处理器801,具体用于:
对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到中间能量值;
对所述中间能量值进行去噪处理,得到所述目标时间点位的音频能量值。
在一种实现方式中,所述处理器801,具体用于:
求取所述前驱点位中各个时间点位的音频能量值之间的音频能量值总和;
获取基准数值,并计算所述音频能量值总和与c倍的所述目标时间点位的音频能量值之间的差值;
将所述基准数值和计算得到的差值中的最大值,作为所述目标时间点位的初始能量变化值;
根据所述目标时间点位的初始能量变化值,确定所述目标时间点位的音频能量变化值。
在一种实现方式中,所述处理器801,具体用于:
获取所述目标音频数据中的各个时间点位的初始能量变化值;
从所述各个时间点位的初始能量变化值中确定出多个峰值,所述峰值是指所述目标音频数据中的峰值时间点位的初始能量变化值,所述峰值时间点位满足如下条件:所述峰值时间点位的初始能量变化值均大于,位于所述峰值时间点位的左右两侧且与所述峰值时间点位相邻的两个时间点位的初始能量变化值;
采用所述多个峰值的均值对所述目标时间点位的初始能量变化值进行归一化处理,得到所述目标时间点位的音频能量变化值。
在一种实现方式中,所述处理器801,具体用于:
获取所述各个时间点位的音频能量值,并从所述各个时间点位的音频能量值中确定出最小音频能量值;
采用所述多个峰值的均值和所述最小音频能量值,对所述目标时间点位的初始能量变 化值进行收缩处理,得到所述目标时间点位的音频能量变化值。
在一种实现方式中,所述将所述目标时间点位作为目标重音点位添加到目标重音点位集合中之前,所述处理器801,还用于:
从所述各个关联点位的音频振幅值的绝对值和所述目标时间点位的音频振幅值的绝对值中,选取最大绝对值作为所述目标时间点位的局部最大振幅值;
若所述目标时间点位的局部最大振幅值大于第一振幅阈值,则执行将所述目标时间点位作为目标重音点位添加到目标重音点位集合中的步骤。
在一种实现方式中,所述目标时间点位为初始重音点位集合中的任一初始重音点位,或者补充时间点位集合中的任一补充点位;其中,所述初始重音点位集合中的多个重音点位是采用点位提取算法对目标音频数据进行点位提取所提取得到的;
所述目标音频数据中的多个时间点位按照时间先后顺序依次排列,所述处理器801,具体用于:从所述初始重音点位集合中确定出起始重音点位和结束重音点位,所述起始重音点位是指所述初始重音点位集合中时间最早的重音点位,所述结束重音点位是指所述初始重音点位集合中时间最晚的重音点位;
确定所述起始重音点位在所述目标音频数据中的起始排列位置,以及所述结束重音点位在所述目标音频数据中的结束排列位置;
按照采样频率对所述目标音频数据中位于所述起始排列位置之前的时间点位进行延拓采点,以及按照所述采样频率对所述目标音频数据中位于所述结束排列位置之后的时间点位进行延拓采点;
将延拓采点所得到的时间点位作为补充点位,添加到所述补充时间点位集合中。
在一种实现方式中,所述处理器801,还用于:
从所述目标音频数据中提取至少一个音符的音符起始点,一个音符是根据至少两个时间点位及所述至少两个时间点位对应的音频振幅值确定的,所述音符起始点位是指:一个音符对应的至少两个时间点位中时间最早的时间点位;
获取所述音符起始点的能量评估值和所述音符起始点局部最大振幅值;
若所述音符起始点的能量评估值和局部最大振幅值满足重音条件,则将所述音符起始点作为目标重音点位添加到所述目标重音点位集合中;所述重音条件包括以下至少一种:所述音符起始点的能量评估值大于能量评估阈值,以及所述音符起始点的局部最大振幅值大于第二振幅阈值。
在一种实现方式中,所述处理器801,还用于:
针对所述目标重音点位集合中的任一目标重音点位,获取所述任一目标重音点位所属的目标音符的音符起始点;
在所述目标重音点位集合中,采用所述目标音符的音符起始点替换所述任一目标重音点位。
在一种实现方式中,所述处理器801,具体用于:
获取所述目标音频数据的音符起始点强度评估曲线,所述音符起始点强度评估曲线包括按时间先后顺序依次排列的所述多个时间点位和每个时间点位的音符强度值;
将所述任一目标重音点位映射到所述音符起始点强度评估曲线上,得到所述任一目标重音点位在所述音符起始点强度评估曲线上的目标位置;
在所述音符起始点强度评估曲线上,基于所述目标位置并沿时间变小的方向依次遍历 至少一个音符强度值;
若当前遍历的当前音符强度值满足音符强度条件,则停止遍历,并将所述当前音符强度值所对应的当前时间点位作为所述任一目标重音点位所属的目标音符的音符起始点;
其中,所述音符强度条件包括:位于所述当前时间点位之前且与所述当前时间点位相邻的时间点位的音符强度值大于或等于所述当前音符强度值,且位于所述当前时间点位之后且与所述当前时间点位相邻的时间点位的音符强度值大于所述当前音符强度值。
在一种实现方式中,所述从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位之前,所述处理器801,还用于:
获取原始音频数据,所述原始音频数据中的各个时间点位均具有对应的声音频率;
对所述原始音频数据进行预处理,得到目标音频数据;所述预处理包括以下至少一项:采用目标频率范围对所述原始音频数据进行滤波处理,对所述原始音频数据或者对滤波后的音频数据进行音量统一化处理。
需要说明的是,本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述音频检测方法实施例图2或图4中所执行的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请一种较佳实施例而已,当然不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于申请所涵盖的范围。

Claims (17)

  1. 一种音频检测方法,由计算机设备执行,包括:
    从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
    根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
    根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
    若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
  2. 如权利要求1所述的方法,其中,所述根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验,包括:
    计算所述参考点位的能量评估值和所述目标时间点位的能量评估值的能量均值;
    从所述目标时间点位的能量评估值和所述参考点位的能量评估值中确定出最大能量评估值;
    若所述最大能量评估值与所述能量均值之间的差值大于阈值,则确定所述目标时间点位通过所述准确性校验;否则,则确定所述目标时间点位未通过所述准确性校验。
  3. 如权利要求1所述的方法,其中,所述多个时间点位按照时间先后顺序依次排列;所述根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值,包括:
    从所述多个时间点位中获取所述目标时间点位的多个关联点位,并采用音频能量函数根据各个关联点位的音频振幅值和所述目标时间点位的音频振幅值,计算所述目标时间点位的音频能量值;所述关联点位是指与所述目标时间点位之间的时间差小于第二差值阈值的时间点位;
    从所述多个时间点位中获取所述目标时间点位的前驱点位,所述前驱点位包括:基于所述目标时间点位在所述多个时间点位中的排列位置,往前依次选取的c个时间点位,c为正整数;
    采用音频能量变化函数根据所述目标时间点位的音频能量值和所述前驱点位中各个时间点位的音频能量值,计算所述目标时间点位的音频能量变化值;
    对所述音频能量值和所述音频能量变化值进行加权求和,得到所述目标时间点位的能量评估值。
  4. 如权利要求3所述的方法,其中,所述采用音频能量函数根据各个关联点位的音频振幅值和所述目标时间点位的音频振幅值,计算所述目标时间点位的音频能量值,包括:
    对所述目标时间点位的音频振幅值进行平方运算,得到所述目标时间点位的初始能量值;以及对各个关联点位的音频振幅值进行平方运算,得到所述各个关联点位的初始能量值;
    对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到所述目标时间点位的音频能量值。
  5. 如权利要求4所述的方法,其中,所述对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到所述目标时间点位的音频能量值,包括:
    对所述目标时间点位的初始能量值和所述各个关联点位的初始能量值进行均值运算,得到中间能量值;
    对所述中间能量值进行去噪处理,得到所述目标时间点位的音频能量值。
  6. 如权利要求3-5任一项所述的方法,其中,所述采用音频能量变化函数根据所述目标时间点位的音频能量值和所述前驱点位中各个时间点位的音频能量值,计算所述目标时间点位的音频能量变化值,包括:
    求取所述前驱点位中各个时间点位的音频能量值之间的音频能量值总和;
    获取基准数值,并计算所述音频能量值总和与c倍的所述目标时间点位的音频能量值之间的差值;
    将所述基准数值和计算得到的差值中的最大值,作为所述目标时间点位的初始能量变化值;
    根据所述目标时间点位的初始能量变化值,确定所述目标时间点位的音频能量变化值。
  7. 如权利要求6所述的方法,其中,所述根据所述目标时间点位的初始能量变化值,确定所述目标时间点位的音频能量变化值,包括:
    获取所述目标音频数据中的各个时间点位的初始能量变化值;
    从所述各个时间点位的初始能量变化值中确定出多个峰值,所述峰值是指所述目标音频数据中的峰值时间点位的初始能量变化值,所述峰值时间点位满足如下条件:所述峰值时间点位的初始能量变化值均大于,位于所述峰值时间点位的左右两侧且与所述峰值时间点位相邻的两个时间点位的初始能量变化值;
    采用所述多个峰值的均值,对所述目标时间点位的初始能量变化值进行归一化处理,得到所述目标时间点位的音频能量变化值。
  8. 如权利要求7所述的方法,其中,所述采用所述多个峰值的均值,对所述目标时间点位的初始能量变化值进行归一化处理,得到所述目标时间点位的音频能量变化值,包括:
    获取所述各个时间点位的音频能量值,并从所述各个时间点位的音频能量值中确定出最小音频能量值;
    采用所述多个峰值的均值和所述最小音频能量值,对所述目标时间点位的初始能量变化值进行收缩处理,得到所述目标时间点位的音频能量变化值。
  9. 如权利要求3所述的方法,其中,所述将所述目标时间点位作为目标重音点位添加到目标重音点位集合中之前,所述方法还包括:
    从所述各个关联点位的音频振幅值的绝对值和所述目标时间点位的音频振幅值的绝对值中,选取最大绝对值作为所述目标时间点位的局部最大振幅值;
    若所述目标时间点位的局部最大振幅值大于第一振幅阈值,则执行将所述目标时间点位作为目标重音点位添加到目标重音点位集合中的步骤。
  10. 如权利要求1所述的方法,其中,所述目标时间点位为初始重音点位集合中的任一初始重音点位,或者补充时间点位集合中的任一补充点位;其中,所述初始重音点位集 合中的多个重音点位是采用点位提取算法对目标音频数据进行点位提取得到的;
    所述目标音频数据中的多个时间点位按照时间先后顺序依次排列,所述补充时间点位集合的获取方式如下:
    从所述初始重音点位集合中确定出起始重音点位和结束重音点位,所述起始重音点位是指所述初始重音点位集合中时间最早的重音点位,所述结束重音点位是指所述初始重音点位集合中时间最晚的重音点位;
    确定所述起始重音点位在所述目标音频数据中的起始排列位置,以及所述结束重音点位在所述目标音频数据中的结束排列位置;
    按照采样频率对所述目标音频数据中位于所述起始排列位置之前的时间点位进行延拓采点,以及按照所述采样频率对所述目标音频数据中位于所述结束排列位置之后的时间点位进行延拓采点;
    将延拓采点所得到的时间点位作为补充点位,添加到所述补充时间点位集合中。
  11. 如权利要求1所述的方法,其中,所述方法还包括:
    从所述目标音频数据中提取至少一个音符的音符起始点,一个音符是根据至少两个时间点位及所述至少两个时间点位对应的音频振幅值确定的,所述音符起始点位是指:一个音符对应的至少两个时间点位中时间最早的时间点位;
    获取所述音符起始点的能量评估值和所述音符起始点的局部最大振幅值;
    若所述音符起始点的能量评估值和局部最大振幅值满足重音条件,则将所述音符起始点作为目标重音点位添加到所述目标重音点位集合中;所述重音条件包括以下至少一种:所述音符起始点的能量评估值大于能量评估阈值,以及所述音符起始点的局部最大振幅值大于第二振幅阈值。
  12. 如权利要求11所述的方法,其中,所述方法还包括:
    针对所述目标重音点位集合中的任一目标重音点位,获取所述任一目标重音点位所属的目标音符的音符起始点;
    在所述目标重音点位集合中,采用所述目标音符的音符起始点替换所述任一目标重音点位。
  13. 如权利要求12所述的方法,其中,所述方法还包括:
    获取所述目标音频数据的音符起始点强度评估曲线,所述音符起始点强度评估曲线包括按时间先后顺序依次排列的所述多个时间点位和每个时间点位的音符强度值;
    将所述任一目标重音点位映射到所述音符起始点强度评估曲线上,得到所述任一目标重音点位在所述音符起始点强度评估曲线上的目标位置;
    在所述音符起始点强度评估曲线上,基于所述目标位置并沿时间变小的方向依次遍历至少一个音符强度值;
    若当前遍历的当前音符强度值满足音符强度条件,则停止遍历,并将所述当前音符强度值所对应的当前时间点位作为所述任一目标重音点位所属的目标音符的音符起始点;
    其中,所述音符强度条件包括:位于所述当前时间点位之前且与所述当前时间点位相邻的时间点位的音符强度值大于或等于所述当前音符强度值,且位于所述当前时间点位之后且与所述当前时间点位相邻的时间点位的音符强度值大于所述当前音符强度值。
  14. 如权利要求1所述的方法,其中,所述从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位之前,所述方法还包括:
    获取原始音频数据,所述原始音频数据中的各个时间点位均具有对应的声音频率;
    对所述原始音频数据进行预处理,得到目标音频数据;所述预处理包括以下至少一项:采用目标频率范围对所述原始音频数据进行滤波处理,对所述原始音频数据或者对滤波后的音频数据进行音量统一化处理。
  15. 一种音频检测装置,包括:
    获取单元,用于从目标音频数据中获取目标时间点位以及所述目标时间点位的参考点位;所述目标音频数据包括多个时间点位以及每个时间点位的音频振幅值;所述参考点位是指与所述目标时间点位之间的时间差小于第一差值阈值的时间点位;
    处理单元,用于根据所述目标时间点位的音频振幅值对所述目标时间点位进行能量评估处理,得到所述目标时间点位的能量评估值;并根据所述参考点位的音频振幅值对所述参考点位进行能量评估处理,得到所述参考点位的能量评估值;
    所述处理单元,还用于根据所述目标时间点位的能量评估值和所述参考点位的能量评估值,对所述目标时间点位进行准确性校验;
    所述处理单元,还用于若所述目标时间点位通过所述准确性校验,则将所述目标时间点位作为目标重音点位添加到目标重音点位集合中。
  16. 一种计算机设备,所述计算机设备包括输入设备、输出设备,所述计算机设备还包括处理器和存储介质,所述处理器用于获取存储介质中存储的一条或多条指令,以执行如权利要求1-14中任一项所述的方法。
  17. 一种计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令运行时执行如权利要求1-14中任一项所述的方法。
PCT/CN2021/126022 2020-11-25 2021-10-25 一种音频检测方法、装置、计算机设备和可读存储介质 WO2022111177A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21896679.4A EP4250291A4 (en) 2020-11-25 2021-10-25 AUDIO DETECTION METHOD AND APPARATUS, COMPUTER DEVICE AND READABLE STORAGE MEDIUM
US17/974,452 US20230050565A1 (en) 2020-11-25 2022-10-26 Audio detection method and apparatus, computer device, and readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011336979.1A CN112435687A (zh) 2020-11-25 2020-11-25 一种音频检测方法、装置、计算机设备和可读存储介质
CN202011336979.1 2020-11-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/974,452 Continuation US20230050565A1 (en) 2020-11-25 2022-10-26 Audio detection method and apparatus, computer device, and readable storage medium

Publications (1)

Publication Number Publication Date
WO2022111177A1 true WO2022111177A1 (zh) 2022-06-02

Family

ID=74698863

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126022 WO2022111177A1 (zh) 2020-11-25 2021-10-25 一种音频检测方法、装置、计算机设备和可读存储介质

Country Status (4)

Country Link
US (1) US20230050565A1 (zh)
EP (1) EP4250291A4 (zh)
CN (1) CN112435687A (zh)
WO (1) WO2022111177A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11615772B2 (en) * 2020-01-31 2023-03-28 Obeebo Labs Ltd. Systems, devices, and methods for musical catalog amplification services
CN112435687A (zh) * 2020-11-25 2021-03-02 腾讯科技(深圳)有限公司 一种音频检测方法、装置、计算机设备和可读存储介质
CN113674723B (zh) * 2021-08-16 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 一种音频处理方法、计算机设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018072368A (ja) * 2016-10-24 2018-05-10 ヤマハ株式会社 音響解析方法および音響解析装置
CN108335703A (zh) * 2018-03-28 2018-07-27 腾讯音乐娱乐科技(深圳)有限公司 确定音频数据的重音位置的方法和装置
CN108877776A (zh) * 2018-06-06 2018-11-23 平安科技(深圳)有限公司 语音端点检测方法、装置、计算机设备和存储介质
CN109903775A (zh) * 2017-12-07 2019-06-18 北京雷石天地电子技术有限公司 一种音频爆音检测方法和装置
CN111833900A (zh) * 2020-06-16 2020-10-27 普联技术有限公司 音频增益控制方法、系统、设备和存储介质
CN112435687A (zh) * 2020-11-25 2021-03-02 腾讯科技(深圳)有限公司 一种音频检测方法、装置、计算机设备和可读存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080059881A (ko) * 2006-12-26 2008-07-01 삼성전자주식회사 음성 신호의 전처리 장치 및 방법
CN104599663B (zh) * 2014-12-31 2018-05-04 华为技术有限公司 歌曲伴奏音频数据处理方法和装置
GB2539875B (en) * 2015-06-22 2017-09-20 Time Machine Capital Ltd Music Context System, Audio Track Structure and method of Real-Time Synchronization of Musical Content
CN107103917B (zh) * 2017-03-17 2020-05-05 福建星网视易信息系统有限公司 音乐节奏检测方法及其系统
CN108319657B (zh) * 2018-01-04 2022-02-01 广州市百果园信息技术有限公司 检测强节奏点的方法、存储介质和终端
CN108320730B (zh) * 2018-01-09 2020-09-29 广州市百果园信息技术有限公司 音乐分类方法及节拍点检测方法、存储设备及计算机设备
CN109670074B (zh) * 2018-12-12 2020-05-15 北京字节跳动网络技术有限公司 一种节奏点识别方法、装置、电子设备及存储介质
CN110336960B (zh) * 2019-07-17 2021-12-10 广州酷狗计算机科技有限公司 视频合成的方法、装置、终端及存储介质
CN110890083B (zh) * 2019-10-31 2022-09-02 北京达佳互联信息技术有限公司 音频数据的处理方法、装置、电子设备及存储介质
CN111081271B (zh) * 2019-11-29 2022-09-06 福建星网视易信息系统有限公司 基于频域和时域的音乐节奏检测方法及存储介质
CN111128232B (zh) * 2019-12-26 2022-11-15 广州酷狗计算机科技有限公司 音乐的小节信息确定方法、装置、存储介质及设备
CN111105769B (zh) * 2019-12-26 2023-01-10 广州酷狗计算机科技有限公司 检测音频的中频节奏点的方法、装置、设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018072368A (ja) * 2016-10-24 2018-05-10 ヤマハ株式会社 音響解析方法および音響解析装置
CN109903775A (zh) * 2017-12-07 2019-06-18 北京雷石天地电子技术有限公司 一种音频爆音检测方法和装置
CN108335703A (zh) * 2018-03-28 2018-07-27 腾讯音乐娱乐科技(深圳)有限公司 确定音频数据的重音位置的方法和装置
CN108877776A (zh) * 2018-06-06 2018-11-23 平安科技(深圳)有限公司 语音端点检测方法、装置、计算机设备和存储介质
CN111833900A (zh) * 2020-06-16 2020-10-27 普联技术有限公司 音频增益控制方法、系统、设备和存储介质
CN112435687A (zh) * 2020-11-25 2021-03-02 腾讯科技(深圳)有限公司 一种音频检测方法、装置、计算机设备和可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4250291A4

Also Published As

Publication number Publication date
CN112435687A (zh) 2021-03-02
EP4250291A4 (en) 2024-05-01
US20230050565A1 (en) 2023-02-16
EP4250291A1 (en) 2023-09-27

Similar Documents

Publication Publication Date Title
WO2022111177A1 (zh) 一种音频检测方法、装置、计算机设备和可读存储介质
US10261965B2 (en) Audio generation method, server, and storage medium
US8411977B1 (en) Audio identification using wavelet-based signatures
CN103971689B (zh) 一种音频识别方法及装置
US9367887B1 (en) Multi-channel audio video fingerprinting
US9313593B2 (en) Ranking representative segments in media data
US20140330556A1 (en) Low complexity repetition detection in media data
JP6620241B2 (ja) ログ解析のための高速パターン発見
US10657175B2 (en) Audio fingerprint extraction and audio recognition using said fingerprints
EP2657884A2 (en) Identifying multimedia objects based on multimedia fingerprint
CN110880329A (zh) 一种音频识别方法及设备、存储介质
CN113707173B (zh) 基于音频切分的语音分离方法、装置、设备及存储介质
JP5345783B2 (ja) 音声信号用フットプリントを生成する方法
CN111428078B (zh) 音频指纹编码方法、装置、计算机设备及存储介质
CN111816170A (zh) 一种音频分类模型的训练和垃圾音频识别方法和装置
JP6462111B2 (ja) 情報信号の指紋を生成するための方法及び装置
CN110889010A (zh) 音频匹配方法、装置、介质和电子设备
WO2016185091A1 (en) Media content selection
US10776420B2 (en) Fingerprint clustering for content-based audio recognition
WO2021051559A1 (zh) 心音信号的分类方法、装置、设备及存储介质
US9215350B2 (en) Sound processing method, sound processing system, video processing method, video processing system, sound processing device, and method and program for controlling same
EP3161689A1 (en) Derivation of probabilistic score for audio sequence alignment
US20130322645A1 (en) Data recognition and separation engine
CN110517671B (zh) 一种音频信息的评估方法、装置及存储介质
CN112863548A (zh) 训练音频检测模型的方法、音频检测方法及其装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896679

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021896679

Country of ref document: EP

Effective date: 20230619