CN109256146B - Audio detection method, device and storage medium - Google Patents

Audio detection method, device and storage medium Download PDF

Info

Publication number
CN109256146B
CN109256146B CN201811278955.8A CN201811278955A CN109256146B CN 109256146 B CN109256146 B CN 109256146B CN 201811278955 A CN201811278955 A CN 201811278955A CN 109256146 B CN109256146 B CN 109256146B
Authority
CN
China
Prior art keywords
audio
signal
spectrogram
detected
impact
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811278955.8A
Other languages
Chinese (zh)
Other versions
CN109256146A (en
Inventor
王征韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN201811278955.8A priority Critical patent/CN109256146B/en
Publication of CN109256146A publication Critical patent/CN109256146A/en
Application granted granted Critical
Publication of CN109256146B publication Critical patent/CN109256146B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/041Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/071Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for rhythm pattern analysis or rhythm style recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/076Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction of timing, tempo; Beat detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses an audio detection method, an audio detection device and a storage medium, wherein the method comprises the following steps: the method comprises the steps of carrying out audio signal separation on audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected, obtaining a Mel frequency spectrum of the impact signal, then calculating an initial envelope of the impact signal according to the Mel frequency spectrum, obtaining an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal, and then determining a rhythm intensity value of the audio to be detected according to a sub-peak value in the autocorrelation velocity spectrogram. According to the embodiment of the invention, the rhythmic sensation intensity value of the audio segment is given by analyzing the regularity and the intensity of the strong impact points or the strong impact points in the audio, so that the given rhythmic sensation intensity value is more consistent with the auditory sensation of the user.

Description

Audio detection method, device and storage medium
Technical Field
The embodiment of the invention relates to the field of audio processing, in particular to an audio detection method, an audio detection device and a storage medium.
Background
Rhythm sense, also called rhythm sense, refers to a subjective sense of human beings on music rhythm, and music beat points with strong rhythm sense are clear and have rich and regular percussion content. Rhythm is a pattern formed by repeated elements such as rhythm, speed, strength, melody, and sound in music.
Musical rhythm has wide application, such as music recommendation, emotion classification and the like, but musical rhythm is a relatively subjective feeling and lacks of reasonable numerical description.
Disclosure of Invention
The embodiment of the invention provides an audio detection method, an audio detection device and a storage medium, which can measure the rhythm intensity of audio by using objective values.
The embodiment of the invention provides an audio detection method, which comprises the following steps:
carrying out audio signal separation on the audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected;
acquiring a Mel frequency spectrum of the impact signal;
calculating a starting envelope of the shock signal according to the Mel frequency spectrum;
acquiring an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal;
and determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
An embodiment of the present invention further provides an audio detection apparatus, where the apparatus includes:
the signal separation module is used for carrying out audio signal separation on the audio to be detected so as to obtain a harmonic signal and an impact signal of the audio to be detected;
the first acquisition module is used for acquiring a Mel frequency spectrum of the impact signal;
the computing module is used for computing the initial envelope of the shock signal according to the Mel frequency spectrum;
the second acquisition module is used for acquiring an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal;
and the determining module is used for determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
The embodiment of the present invention further provides a storage medium, where multiple instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to perform any of the steps in the audio detection method provided in the embodiment of the present invention.
According to the embodiment of the invention, the audio frequency to be detected is subjected to audio signal separation to obtain a harmonic signal and an impact signal of the audio frequency to be detected, a Mel frequency spectrum of the impact signal is obtained, then an initial envelope of the impact signal is calculated according to the Mel frequency spectrum, an autocorrelation velocity spectrogram of the initial envelope is obtained according to the initial envelope of the impact signal, and a rhythm intensity value of the audio frequency to be detected is determined according to a sub-peak value in the autocorrelation velocity spectrogram. According to the embodiment of the invention, the rhythmic sensation intensity value of the audio segment is given by analyzing the regularity and the intensity of the strong impact points or the strong impact points in the audio, so that the given rhythmic sensation intensity value is more consistent with the auditory sensation of the user.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart illustrating an audio detection method according to an embodiment of the present invention.
Fig. 2 is another flow chart of an audio detection method according to an embodiment of the present invention.
Fig. 3 is another flow chart of an audio detection method according to an embodiment of the present invention.
Fig. 4 is another flow chart of an audio detection method according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a strong pulse feeling provided by the embodiment of the invention.
Fig. 6 is a schematic diagram of a weak pulse feeling according to an embodiment of the present invention.
Fig. 7 is another flowchart of an audio detection method according to an embodiment of the present invention.
Fig. 8 is a schematic structural diagram of an audio detection apparatus according to an embodiment of the present invention.
Fig. 9 is another schematic structural diagram of an audio detecting apparatus according to an embodiment of the present invention.
Fig. 10 is another schematic structural diagram of an audio detecting apparatus according to an embodiment of the present invention.
Fig. 11 is another schematic structural diagram of an audio detecting apparatus according to an embodiment of the present invention.
Fig. 12 is another schematic structural diagram of an audio detecting apparatus according to an embodiment of the present invention.
Fig. 13 is a schematic structural diagram of a server according to an embodiment of the present invention.
Fig. 14 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first" and "second", etc. in the present invention are used for distinguishing different objects, not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Rhythm sense, also called rhythm sense, refers to a subjective sense of human beings on music rhythm, and music beat points with strong rhythm sense are clear and have rich and regular percussion content. Rhythm is a pattern formed by repeated elements such as rhythm, speed, strength, melody, and sound in music.
Musical rhythm has wide application, such as music recommendation, emotion classification and the like, but musical rhythm is a relatively subjective feeling and lacks of reasonable numerical description.
Therefore, the embodiment of the invention provides an audio detection method, an audio detection device and a storage medium, wherein the rhythm intensity value of an audio segment is given by analyzing the regularity and the intensity of the strong impact point or the strong impact point in the audio, so that the given rhythm intensity value is more in line with the auditory perception of a user.
The audio detection method provided by the embodiment of the invention can be realized in an audio detection device, and the audio detection device can be specifically integrated in electronic equipment or other equipment with an audio and video data processing function, wherein the electronic equipment comprises but is not limited to computers, smart televisions, smart sound boxes, mobile phones, tablet computers and other equipment.
Referring to fig. 1 to 6, wherein fig. 1 to 4 are schematic flow charts of an audio detection method according to an embodiment of the present invention, fig. 5 is a schematic diagram of a strong pulse feeling according to an embodiment of the present invention, and fig. 6 is a schematic diagram of a weak pulse feeling according to an embodiment of the present invention. The method comprises the following steps:
step 101, audio signal separation is performed on the audio to be tested to obtain a harmonic signal and an impact signal of the audio to be tested.
For example, Harmonic-impulsive Source Separation (HPSS) of audio is a common preprocessing means, which can be used to separate Harmonic sources from impulsive sources in an audio signal. In the spectrogram of an audio signal such as music, two types of audio signals are usually distributed, one type is continuously and smoothly distributed along a time axis, the other type is continuously and smoothly distributed along a frequency axis, and sound sources of the two types of audio signals are usually called as a harmonic source and an impact source respectively. Musical instruments can be classified into stringed musical instruments and percussion musical instruments. The sound source produced by stringed instruments is generally relatively soft, with continuous articulation between tones, appearing as a smooth envelope on the spectrogram. The sound source produced by a percussion instrument generally has a strong sense of rhythm, with a large span between tones, appearing as a vertical envelope on a spectrogram. Therefore, on the spectrogram, a soothing sound source generated by a stringed instrument or the like is generally referred to as a harmonic source, and a sound source having a strong rhythmic feeling generated by a percussion instrument or the like is generally referred to as an impact source.
In the embodiment of the invention, the audio signal separation can be carried out on the audio to be detected by using a harmonic and impact source separation method so as to obtain the harmonic signal and the impact signal of the audio to be detected.
In some embodiments, as shown in fig. 2, step 101 may be implemented by steps 1011 to 1012, specifically:
step 1011, performing short-time Fourier transform on the audio to be detected according to a preset frame length and a preset step length to obtain a spectrogram of the audio to be detected;
step 1012, performing median filtering along the time direction and the frequency direction of the spectrogram respectively to obtain a harmonic signal and an impulse signal of the spectrogram, wherein the harmonic signal is obtained by performing median filtering along the time direction, and the impulse signal is obtained by performing median filtering along the frequency direction.
For example, after the audio to be detected is read at a 44100 sampling rate, short-time Fourier transform (STFT) is performed with 1024 as a frame length and 441 as a step length, so as to obtain an STFT spectrogram of the audio to be detected. And then, performing median filtering along two directions of time and frequency of the spectrogram respectively to obtain a Harmonic part and a periodic part of the original audio signal to be detected. Wherein the filtering in the time direction results in a Harmonic (Harmonic part) signal, corresponding to a continuous part of the audio frequency to be measured. And filtering along the frequency direction to obtain an impact (Percussive part) signal, wherein the impact (Percussive part) signal corresponds to a part with striking feeling or impact feeling in the audio to be measured.
In some embodiments, as shown in fig. 3, step 1012 may be implemented by step 10121 to step 10123, specifically:
step 10121, performing first median filtering along the time direction and the frequency direction of the spectrogram respectively to obtain a first harmonic signal and a first impact signal of the spectrogram;
step 10122, removing the first harmonic signal in the spectrogram to obtain a target spectrogram consisting of the first impact signal;
step 10123, performing second median filtering along the time direction and the frequency direction of the target spectrogram respectively to obtain a second harmonic signal and a second impact signal of the target spectrogram, wherein the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected.
Harmonic and impulse Source Separation (HPSS) of audio can also be expressed as H-P Separation, and Harmonic parts and impulse parts separated by H-P can be expressed as H parts and P parts, respectively, wherein the H parts correspond to Harmonic signals and the P parts correspond to impulse signals.
For example, first H-P separation is performed on a spectrogram of the audio to be detected, that is, first median filtering is performed along a time direction and a frequency direction of the spectrogram, respectively, so as to obtain a first harmonic signal (H portion) and a first impact signal (P portion) of the spectrogram. Then abandoning the H part and only leaving the P part, namely removing the first harmonic signal (H part) in the spectrogram to obtain a target spectrogram consisting of the first impact signal (P part). And then performing primary H-P separation on the P part, and extracting a newly obtained P part again, namely performing secondary median filtering along the time direction and the frequency direction of the target spectrogram respectively to obtain a second harmonic signal (the newly obtained H part) and a second impact signal (the newly obtained P part) of the target spectrogram, wherein the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected. At this time, after the spectrogram of the audio to be detected is subjected to H-P separation twice, the newly obtained P part contains few continuous sounds, most of the signals are signals with strong impact feeling or strong impact feeling, such as drum sound, keyboard impact sound, gong sound and the like, and harmonic signals and impact signals can be effectively separated.
Step 102, obtaining a Mel frequency spectrum of the impact signal.
In order to better conform to human auditory perception, the impulse signal obtained in step 101 may be converted into a Mel (Mel) scale spectrum. The Mel-Frequency spectrum of the impulse signal may be obtained by converting the impulse signal into Mel-scale Frequency spectrum using Mel-Frequency cepstral Coefficients (MFCCs), for example.
Wherein the Mel-frequency cepstrum coefficient is a frequency scale divided according to the auditory characteristics of human ears. The relationship between Mel frequency and actual frequency can be expressed by the following formula:
mel (f) ═ 2569log (1+ f/700), where f denotes the actual frequency of the impact signal.
When the frequency is below 1000Hz, the hearing ability of human ears and the frequency of sound are increased linearly, and when the frequency is above 1000Hz, the hearing ability of human ears and the frequency of sound are distributed logarithmically. Therefore, the actual frequency is divided according to the corresponding relationship to obtain a series of triangular filter sequences, which are called Mel filter banks. For example, the Mel spectrum of the impulse signal is calculated with a maximum frequency of 1000Hz, and the number of Mel bands is 128.
And 103, calculating the initial envelope of the impact signal according to the Mel frequency spectrum.
Calculating the initial envelope (onset envelope) of the impact signal, namely the envelope of onset points, for the Mel frequency spectrum of the impact signal, wherein onset refers to the initial point of the 'event' in the audio, and the envelope of the onset points refers to the connecting line of the initial point of the 'event' in the audio. For example, an envelope-demodulation (envelope-demodulation) unit may be used to calculate the onset envelope of the impulse signal, and convert the calculated onset envelope into the peak point of the impulse signal of the Mel spectrum for connection.
And 104, acquiring an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal.
Wherein an autocorrelation velocity spectrogram (onset envelope tempogram) of the initial envelope of the impulse signal can be obtained by calculating a local autocorrelation function of the initial envelope. The reason for selecting the local autocorrelation is that the rhythmic sensation variation of the song in the whole music range is likely to be large, the globally calculated autocorrelation function cannot accurately depict the rhythmic sensation of the music, and the locally calculated autocorrelation function can more accurately depict the rhythmic sensation of the music.
In some embodiments, as shown in fig. 4, step 104 may be implemented through step 1041 to step 1042, specifically:
step 1041, framing the initial envelope of the impact signal according to a preset duration to obtain a plurality of local segments, and dividing each local segment into a plurality of subframes according to a preset step length;
step 1042, inputting a plurality of frames corresponding to each local segment into a local autocorrelation function for calculation, so as to obtain an autocorrelation velocity spectrogram of the initial envelope.
For example, the duration of a partial segment may be 8.9s, and the step size of a frame may be 0.01 s.
The result of time-framing and calculating the local autocorrelation function is a 2-dimensional matrix called tempogram, which is used to represent the autocorrelation velocity spectrogram of the initial envelope.
And 105, determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
The expression of rhythm is the comprehensive effect result of regular response and response intensity of the percussion instruments, for example, percussion instruments with clear and definite rhythm and strong response bring strong rhythm to music, while percussion instruments with mixed rhythm and weak response give weak rhythm. The embodiment of the invention converts the calculation of the rhythm into the calculation of the regularity and the intensity of the hitting point.
In some embodiments, the pulse sensation value of the audio to be tested may be determined by obtaining an autocorrelation mean value corresponding to the plurality of local frames in the autocorrelation velocity spectrogram, and extracting a secondary peak value in the autocorrelation mean value corresponding to the plurality of local frames.
For example, take the autocorrelation mean of each local segment in tempogram and take its second peak value as the rhythm intensity value of the audio. The value of the rhythm intensity value is a normalized peak value, which can be theoretically in the range of 0-1, and actually is generally not more than 0.8. When the value is higher than 0.2, the audio can be sensed to have stronger rhythm in subjective hearing sense.
For example, fig. 5 shows a typical autocorrelation velocity spectrogram with a strong rhythm, and fig. 6 shows a typical autocorrelation velocity spectrogram with a weak rhythm. In the autocorrelation velocity spectrogram, the abscissa is the amplitude of rightward deviation of a signal, the ordinate is the correlation value of an original signal after deviation and a deviation signal, the autocorrelation calculation is that the signal shifts to the right by a certain amplitude, and then the correlation value of the deviation signal and the original signal is calculated.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
According to the embodiment of the invention, the audio frequency to be detected is subjected to audio signal separation to obtain a harmonic signal and an impact signal of the audio frequency to be detected, a Mel frequency spectrum of the impact signal is obtained, then an initial envelope of the impact signal is calculated according to the Mel frequency spectrum, an autocorrelation velocity spectrogram of the initial envelope is obtained according to the initial envelope of the impact signal, and a rhythm intensity value of the audio frequency to be detected is determined according to a sub-peak value in the autocorrelation velocity spectrogram. According to the embodiment of the invention, the rhythm intensity value of the audio segment is given by analyzing the regularity and the intensity of the strong impact point or the strong impact point in the audio, and the rhythm intensity of the audio can be measured by using an objective value, so that the given rhythm intensity value is more in line with the auditory perception of a user.
Referring to fig. 7, fig. 7 is a schematic flowchart illustrating an audio detection method according to an embodiment of the invention. The method comprises the following steps:
step 201, audio signal separation is performed on the audio to be tested to obtain a harmonic signal and an impact signal of the audio to be tested.
For example, Harmonic-impulsive Source Separation (HPSS) of audio is a common preprocessing means, which can be used to separate Harmonic sources from impulsive sources in an audio signal. In the spectrogram of an audio signal such as music, two types of audio signals are usually distributed, one type is continuously and smoothly distributed along a time axis, the other type is continuously and smoothly distributed along a frequency axis, and sound sources of the two types of audio signals are usually called as a harmonic source and an impact source respectively. Musical instruments can be classified into stringed musical instruments and percussion musical instruments. The sound source produced by stringed instruments is generally relatively soft, with continuous articulation between tones, appearing as a smooth envelope on the spectrogram. The sound source produced by a percussion instrument generally has a strong sense of rhythm, with a large span between tones, appearing as a vertical envelope on a spectrogram. Therefore, on the spectrogram, a soothing sound source generated by a stringed instrument or the like is generally referred to as a harmonic source, and a sound source having a strong rhythmic feeling generated by a percussion instrument or the like is generally referred to as an impact source.
In the embodiment of the invention, the audio signal separation can be carried out on the audio to be detected by using a harmonic and impact source separation method so as to obtain the harmonic signal and the impact signal of the audio to be detected.
In some embodiments, the audio signal separation on the audio to be tested to obtain the harmonic signal and the impact signal of the audio to be tested includes:
carrying out short-time Fourier transform on the audio to be detected according to a preset frame length and a preset step length to obtain a spectrogram of the audio to be detected;
and respectively performing median filtering along the time direction and the frequency direction of the spectrogram to obtain a harmonic signal and an impact signal of the spectrogram, wherein the harmonic signal is obtained by performing median filtering along the time direction, and the impact signal is obtained by performing median filtering along the frequency direction.
For example, after the audio to be detected is read at a 44100 sampling rate, short-time Fourier transform (STFT) is performed with 1024 as a frame length and 441 as a step length, so as to obtain an STFT spectrogram of the audio to be detected. And then, performing median filtering along two directions of time and frequency of the spectrogram respectively to obtain a Harmonic part and a periodic part of the original audio signal to be detected. Wherein the filtering in the time direction results in a Harmonic (Harmonic part) signal, corresponding to a continuous part of the audio frequency to be measured. And filtering along the frequency direction to obtain an impact (Percussive part) signal, wherein the impact (Percussive part) signal corresponds to a part with striking feeling or impact feeling in the audio to be measured.
In some embodiments, the median filtering along the time direction and the frequency direction of the spectrogram respectively to obtain a harmonic signal and an impact signal of the spectrogram comprises:
respectively performing first median filtering along the time direction and the frequency direction of the spectrogram to obtain a first harmonic signal and a first impact signal of the spectrogram;
removing a first harmonic signal in the spectrogram to obtain a target spectrogram consisting of the first impact signal;
and respectively performing second median filtering along the time direction and the frequency direction of the target spectrogram to obtain a second harmonic signal and a second impact signal of the target spectrogram, wherein the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected.
Harmonic and impulse Source Separation (HPSS) of audio can also be expressed as H-P Separation, and Harmonic parts and impulse parts separated by H-P can be expressed as H parts and P parts, respectively, wherein the H parts correspond to Harmonic signals and the P parts correspond to impulse signals.
For example, first H-P separation is performed on a spectrogram of the audio to be detected, that is, first median filtering is performed along a time direction and a frequency direction of the spectrogram, respectively, so as to obtain a first harmonic signal (H portion) and a first impact signal (P portion) of the spectrogram. Then abandoning the H part and only leaving the P part, namely removing the first harmonic signal (H part) in the spectrogram to obtain a target spectrogram consisting of the first impact signal (P part). And then performing primary H-P separation on the P part, and extracting a newly obtained P part again, namely performing secondary median filtering along the time direction and the frequency direction of the target spectrogram respectively to obtain a second harmonic signal (the newly obtained H part) and a second impact signal (the newly obtained P part) of the target spectrogram, wherein the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected. At this time, after the spectrogram of the audio to be detected is subjected to H-P separation twice, the newly obtained P part contains few continuous sounds, most of the signals are signals with strong impact feeling or strong impact feeling, such as drum sound, keyboard impact sound, gong sound and the like, and harmonic signals and impact signals can be effectively separated.
Step 202, obtaining a mel frequency spectrum of the impact signal.
Wherein, in order to better conform to the human auditory perception, the impulse signal obtained in step 201 can be converted into a Mel (Mel) scale spectrum. The Mel-frequency spectrum of the impulse signal may be obtained by converting the impulse signal into Mel-scale frequency spectrum using Mel-frequency cepstral coefficients (MFCCs), for example.
Wherein the Mel-frequency cepstrum coefficient is a frequency scale divided according to the auditory characteristics of human ears. The relationship between Mel frequency and actual frequency can be expressed by the following formula:
mel (f) ═ 2569log (1+ f/700), where f denotes the actual frequency of the impact signal.
When the frequency is below 1000Hz, the hearing ability of human ears and the frequency of sound are increased linearly, and when the frequency is above 1000Hz, the hearing ability of human ears and the frequency of sound are distributed logarithmically. Therefore, the actual frequency is divided according to the corresponding relationship to obtain a series of triangular filter sequences, which are called Mel filter banks. For example, the Mel spectrum of the impulse signal is calculated with a maximum frequency of 1000Hz, and the number of Mel bands is 128.
Step 203, calculating the initial envelope of the shock signal according to the Mel frequency spectrum.
Calculating the initial envelope (onset envelope) of the impact signal, namely the envelope of onset points, for the Mel frequency spectrum of the impact signal, wherein onset refers to the initial point of the 'event' in the audio, and the envelope of the onset points refers to the connecting line of the initial point of the 'event' in the audio. For example, an envelope-demodulation (envelope-demodulation) unit may be used to calculate the onset envelope of the impulse signal, and convert the calculated onset envelope into the peak point of the impulse signal of the Mel spectrum for connection.
Step 204, filtering the initial envelope of the impact signal to filter out signal points of which the numerical values are smaller than a threshold value in the initial envelope.
The onset envelope obtained in step 203 still contains some negligible weak response points, and these weak response points, although not being the main factors affecting the musical rhythm dynamics, may affect the subsequent calculation, so the weak response points in the initial envelope of the impact signal may be filtered according to a certain threshold. For example, a threshold value at 0.2 of the highest peak of the signal is selected, and weak response points with values smaller than the threshold value in the starting envelope are eliminated.
Step 205, obtaining a velocity spectrogram of the initial envelope according to the initial envelope of the filtered impact signal.
In this step, an autocorrelation velocity spectrogram (onset envelope tempogram) of the initial envelope can be obtained by calculating a local autocorrelation function of the initial envelope of the impulse signal. The reason for selecting the local autocorrelation is that the rhythmic sensation variation of the song in the whole music range is likely to be large, the globally calculated autocorrelation function cannot accurately depict the rhythmic sensation of the music, and the locally calculated autocorrelation function can more accurately depict the rhythmic sensation of the music.
In some embodiments, the obtaining a velocity spectrogram of the start envelope according to the start envelope of the filtered impulse signal includes:
framing the initial envelope of the impact signal after the filtering processing according to a preset time length to obtain a plurality of local segments, and dividing each local segment into a plurality of subframes according to a preset step length;
and inputting a plurality of sub-frames corresponding to each local segment into a local autocorrelation function for calculation so as to obtain an autocorrelation velocity spectrogram of the initial envelope.
For example, the duration of a partial segment may be 8.9s, and the step size of a frame may be 0.01 s.
The result of time-framing and calculating the local autocorrelation function is a 2-dimensional matrix called tempogram, which is used to represent the autocorrelation velocity spectrogram of the initial envelope.
And step 206, determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
The expression of rhythm is the comprehensive effect result of regular response and response intensity of the percussion instruments, for example, percussion instruments with clear and definite rhythm and strong response bring strong rhythm to music, while percussion instruments with mixed rhythm and weak response give weak rhythm. The embodiment of the invention converts the calculation of the rhythm into the calculation of the regularity and the intensity of the hitting point.
In some embodiments, the pulse sensation value of the audio to be tested may be determined by obtaining an autocorrelation mean value corresponding to the plurality of local frames in the autocorrelation velocity spectrogram, and extracting a secondary peak value in the autocorrelation mean value corresponding to the plurality of local frames.
For example, take the autocorrelation mean of each local segment in tempogram and take its second peak value as the rhythm intensity value of the audio. The value of the rhythm intensity value is a normalized peak value, which can be theoretically in the range of 0-1, and actually is generally not more than 0.8. When the value is higher than 0.2, the audio can be sensed to have stronger rhythm in subjective hearing sense.
In some embodiments, the results of the local auto-correlation are also integrated in other ways, such as taking the maximum in onsenetnevelope, taking the minimum in onset evenvelope, or other voting strategies, etc. Then, the rhythm intensity value is obtained from the obtained signal in other ways, for example, the average value of N peaks of TOP is taken, the peak value of the secondary peak is obtained after normalization according to the maximum peak value, and then the obtained secondary peak value is used as the rhythm intensity value of the audio to be measured. In addition, in the process of analyzing the regularity and the strength of the strong impact points or the occurrence of the strong impact points in the audio, parameters of an algorithm, such as window length, step length, number of mel filters, cut-off frequency and the like, can be finely adjusted, so that the rhythm intensity value of the audio segment can be more accurately given.
And step 207, performing audio classification on the audio to be detected according to the rhythm intensity value of the audio to be detected.
For example, music may be divided into a plurality of music types according to different tempo values, such as light music and DJ music, etc., or into walking music, jogging music, etc. Each piece of music can record the beat points of the music in addition to marking the music type.
And step 208, generating an audio recommendation list according to the audio classification results of the multiple audios to be tested and the current audio application scene.
For example, when the mobile terminal detects an audio application scene currently running, the mobile terminal may detect the step frequency of the user through the motion sensor, and then select the first few pieces of music with the music tempo closest to the step frequency of the user from the audio classification results of multiple audio to be detected as music recommendation information to be recommended to the user.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
The embodiment of the invention separates the audio signal of the audio to be tested to obtain the harmonic signal and the impact signal of the audio to be tested and obtain the Mel frequency spectrum of the impact signal, then calculating the initial envelope of the shock signal according to the Mel frequency spectrum, and filtering the initial envelope of the shock signal to filter out the signal points with the numerical value smaller than the threshold value in the initial envelope, then obtaining an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal after the filtering processing, determining a rhythm sense intensity value of the audio to be detected according to a secondary peak value in the autocorrelation velocity spectrogram, and according to the rhythm sense intensity value of the audio to be detected, and carrying out audio classification on the audio to be detected, and then generating an audio recommendation list according to the audio classification results of the multiple audio to be detected and the current audio application scene. The embodiment of the invention gives the rhythm intensity value of the audio segment by analyzing the regularity and the intensity of the strong impact point or the strong impact point in the audio, can measure the rhythm intensity of the audio by using an objective value, ensures that the given rhythm intensity value is more in line with the auditory perception of a user, and can be used as an important characteristic for recommending music for various music applications such as a running radio station and the like.
An embodiment of the present invention further provides an audio detection device, as shown in fig. 8 to 11, and fig. 8 to 11 are schematic structural diagrams of an audio detection device provided in an embodiment of the present invention. The audio detection device 30 may include a signal separation module 31, a first acquisition module 32, a calculation module 33, a second acquisition module 35, and a determination module 36.
The signal separation module 31 is configured to perform audio signal separation on an audio frequency to be detected to obtain a harmonic signal and an impact signal of the audio frequency to be detected;
the first obtaining module 32 is configured to obtain a mel frequency spectrum of the impulse signal;
the calculating module 33 is configured to calculate a starting envelope of the impulse signal according to the mel spectrum;
the second obtaining module 35 is configured to obtain an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal;
the determining module 36 is configured to determine a rhythmic sensation intensity value of the audio to be detected according to the secondary peak in the autocorrelation velocity spectrogram.
In some embodiments, as shown in fig. 9, the signal separation module 31 includes:
the transform submodule 311 is configured to perform short-time fourier transform on the audio to be detected according to a preset frame length and a preset step length to obtain a spectrogram of the audio to be detected;
the filtering submodule 312 is configured to perform median filtering in the time direction and the frequency direction of the spectrogram, respectively, so as to obtain a harmonic signal and an impulse signal of the spectrogram, where the harmonic signal is obtained by performing median filtering in the time direction, and the impulse signal is obtained by performing median filtering in the frequency direction.
In some embodiments, as shown in fig. 10, the filtering sub-module 312 includes:
the first filtering unit 3121 is configured to perform first median filtering in a time direction and a frequency direction of the spectrogram, respectively, so as to obtain a first harmonic signal and a first impact signal of the spectrogram;
a removing unit 3122, configured to remove a first harmonic signal in the spectrogram, so as to obtain a target spectrogram formed by the first impact signal;
the second filtering unit 3123 is configured to perform second median filtering in the time direction and the frequency direction of the target spectrogram, respectively, so as to obtain a second harmonic signal and a second impact signal of the target spectrogram, where the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected.
In some embodiments, as shown in fig. 11, the second obtaining module 35 includes:
the framing submodule 351 is configured to perform framing processing on the initial envelope of the impact signal according to a preset duration to obtain a plurality of local segments, and then divide each local segment into a plurality of frames according to a preset step length;
the calculating sub-module 352 is configured to input a plurality of frames corresponding to each local segment into a local autocorrelation function for calculation, so as to obtain an autocorrelation velocity spectrogram of the start envelope.
In some embodiments, the determining module 36 is further configured to determine, as the rhythm intensity value of the audio to be tested, a secondary peak value of an autocorrelation mean value of each of the local segments in the autocorrelation velocity spectrogram.
In the audio detection apparatus 30 provided in the embodiment of the present invention, the signal separation module 31 performs audio signal separation on the audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected, the first obtaining module 32 obtains a mel frequency spectrum of the impact signal, the calculating module 33 calculates an initial envelope of the impact signal according to the mel frequency spectrum, the second obtaining module 35 obtains an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal, and the determining module 36 determines the rhythm intensity value of the audio to be detected according to a sub-peak value in the autocorrelation velocity spectrogram. The audio detection device 30 provided by the embodiment of the present invention can provide the rhythm intensity value of the audio segment by analyzing the regularity and intensity of the strong impact point or the strong impact point in the audio, and measure the rhythm intensity of the audio by using an objective value, so that the provided rhythm intensity value is more suitable for the auditory perception of the user.
In some embodiments, as shown in fig. 12, fig. 12 is another schematic structural diagram of an audio detecting apparatus according to an embodiment of the present invention. The audio detection device 30 may include a signal separation module 31, a first acquisition module 32, a calculation module 33, a filtering module 34, a second acquisition module 35, a determination module 36, a classification module 37, and a generation module 38.
The signal separation module 31 is configured to perform audio signal separation on an audio frequency to be detected to obtain a harmonic signal and an impact signal of the audio frequency to be detected;
the first obtaining module 32 is configured to obtain a mel frequency spectrum of the impulse signal;
the calculating module 33 is configured to calculate a starting envelope of the impulse signal according to the mel spectrum;
the filtering module 34 is configured to filter the initial envelope of the impact signal to filter out signal points in the initial envelope whose values are smaller than a threshold
The second obtaining module 35 is configured to obtain a velocity spectrogram of the initial envelope according to the initial envelope of the filtered impact signal;
the determining module 36 is configured to determine a rhythmic sensation intensity value of the audio to be detected according to a secondary peak value in the autocorrelation velocity spectrogram;
the classification module 37 is configured to perform audio classification on the audio to be detected according to the rhythm intensity value of the audio to be detected;
the generating module 38 is configured to generate an audio recommendation list according to the audio classification results of multiple audios to be tested and the current audio application scenario.
All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
An audio detection apparatus 30 according to an embodiment of the present invention. The audio frequency to be detected is subjected to audio signal separation through a signal separation module 31 to obtain a harmonic signal and an impact signal of the audio frequency to be detected, a first obtaining module 32 obtains a mel frequency spectrum of the impact signal, a calculating module 33 calculates an initial envelope of the impact signal according to the mel frequency spectrum, a filtering module 34 filters the initial envelope of the impact signal to filter out signal points of which the numerical values in the initial envelope are smaller than a threshold value, a second obtaining module 35 obtains an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal after filtering, a determining module 36 determines a rhythm intensity value of the audio frequency to be detected according to a sub-peak value in the autocorrelation velocity spectrogram, a classifying module 37 classifies the audio frequency to be detected according to the rhythm intensity value of the audio frequency to be detected, and a generating module 38 generates audio frequency according to audio frequency classification results of a plurality of audio frequencies to be detected and a current audio frequency application scene And recommending the list. The audio detection device 30 provided by the embodiment of the present invention can measure the rhythm intensity of the audio by analyzing the regularity and intensity of the strong impact point or the strong impact point in the audio, so that the rhythm intensity value provided by the audio detection device can better meet the auditory perception of the user, and the rhythm intensity index can be used as an important feature for music recommendation for various music applications such as a running radio station.
An embodiment of the present invention further provides a server, as shown in fig. 13, which shows a schematic structural diagram of the server according to the embodiment of the present invention, specifically:
the server may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the server architecture shown in FIG. 13 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
The server further includes a power supply 403 for supplying power to each component, and preferably, the power supply 403 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The server may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:
carrying out audio signal separation on the audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected; acquiring a Mel frequency spectrum of the impact signal; calculating a starting envelope of the shock signal according to the Mel frequency spectrum; acquiring an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal; and determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
The above operations can be specifically referred to the previous embodiments, and are not described herein.
As can be seen from the above, the server provided in this embodiment performs audio signal separation on an audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected, and obtains a mel spectrum of the impact signal, then calculates an initial envelope of the impact signal according to the mel spectrum, obtains an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal, and determines a rhythm intensity value of the audio to be detected according to a sub-peak value in the autocorrelation velocity spectrogram. According to the embodiment of the invention, the rhythmic sensation intensity value of the audio segment is given by analyzing the regularity and the intensity of the strong impact points or the strong impact points in the audio, so that the given rhythmic sensation intensity value is more consistent with the auditory sensation of the user.
Accordingly, an embodiment of the present invention further provides a terminal, as shown in fig. 14, the terminal may include a Radio Frequency (RF) circuit 501, a memory 502 including one or more computer-readable storage media, an input unit 503, a display unit 504, a sensor 505, an audio circuit 506, a Wireless Fidelity (WiFi) module 507, a processor 508 including one or more processing cores, and a power supply 509. Those skilled in the art will appreciate that the terminal structure shown in fig. 14 is not intended to be limiting and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:
the RF circuit 501 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, for receiving downlink information of a base station and then sending the received downlink information to the one or more processors 508 for processing; in addition, data relating to uplink is transmitted to the base station. In general, RF circuit 501 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 501 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 502 may be used to store software programs and modules, and the processor 508 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 508 and the input unit 503 access to the memory 502.
The input unit 503 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 503 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 508, and can receive and execute commands sent by the processor 508. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 503 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 504 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 504 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 508 to determine the type of touch event, and then the processor 508 provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 14 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.
The terminal may also include at least one sensor 505, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and striking), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.
Audio circuitry 506, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 506 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 506 and converted into audio data, which is then processed by the audio data output processor 508, and then transmitted to, for example, another terminal via the RF circuit 501, or the audio data is output to the memory 502 for further processing. The audio circuit 506 may also include an earbud jack to provide communication of peripheral headphones with the terminal.
WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 507, and provides wireless broadband internet access for the user. Although fig. 14 shows the WiFi module 507, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 508 is a control center of the terminal, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby integrally monitoring the mobile phone. Optionally, processor 508 may include one or more processing cores; preferably, the processor 508 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 508.
The terminal also includes a power supply 509 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 508 via a power management system that may be used to manage charging, discharging, and power consumption. The power supply 509 may also include any component such as one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 508 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 508 runs the application programs stored in the memory 502, thereby implementing various functions:
carrying out audio signal separation on the audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected; acquiring a Mel frequency spectrum of the impact signal; calculating a starting envelope of the shock signal according to the Mel frequency spectrum; acquiring an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal; and determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
The above operations can be specifically referred to the previous embodiments, and are not described herein.
As can be seen from the above, the terminal provided in this embodiment performs audio signal separation on an audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected, and obtains a mel spectrum of the impact signal, then calculates an initial envelope of the impact signal according to the mel spectrum, obtains an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal, and determines a rhythm intensity value of the audio to be detected according to a sub-peak value in the autocorrelation velocity spectrogram. According to the embodiment of the invention, the rhythmic sensation intensity value of the audio segment is given by analyzing the regularity and the intensity of the strong impact points or the strong impact points in the audio, so that the given rhythmic sensation intensity value is more consistent with the auditory sensation of the user.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the audio detection methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
carrying out audio signal separation on the audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected; acquiring a Mel frequency spectrum of the impact signal; calculating a starting envelope of the shock signal according to the Mel frequency spectrum; acquiring an autocorrelation velocity spectrogram of the initial envelope according to the initial envelope of the impact signal; and determining the rhythm intensity value of the audio to be detected according to the secondary peak value in the autocorrelation velocity spectrogram.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
Since the instructions stored in the storage medium can execute the steps in any audio detection method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any audio detection method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The foregoing describes in detail an audio detection method, apparatus and storage medium provided by an embodiment of the present invention, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the description of the foregoing embodiments is only used to help understand the method and its core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (11)

1. A method for audio detection, the method comprising:
carrying out audio signal separation on the audio to be detected to obtain a harmonic signal and an impact signal of the audio to be detected;
acquiring a Mel frequency spectrum of the impact signal;
calculating a starting envelope of the shock signal according to the Mel frequency spectrum;
framing the initial envelope of the impact signal according to a preset time length to obtain a plurality of local segments, and dividing each local segment into a plurality of subframes according to a preset step length;
inputting a plurality of sub-frames corresponding to each local segment into a local autocorrelation function for calculation so as to obtain an autocorrelation velocity spectrogram of the initial envelope;
obtaining an autocorrelation mean value of each local segment in the autocorrelation velocity spectrogram;
and extracting the second highest peak value in the autocorrelation mean value of each local segment, and determining the second highest peak value as the rhythm sense intensity value of the audio to be detected.
2. The audio detection method of claim 1, wherein before the framing the start envelope of the attack signal by the preset time period, the audio detection method further comprises:
and filtering the initial envelope of the impact signal to filter out signal points of which the numerical values are smaller than a threshold value in the initial envelope.
3. The audio detection method of claim 1, wherein the audio signal separation of the audio to be detected to obtain the harmonic signal and the impact signal of the audio to be detected comprises:
carrying out short-time Fourier transform on the audio to be detected according to a preset frame length and a preset step length to obtain a spectrogram of the audio to be detected;
and respectively performing median filtering along the time direction and the frequency direction of the spectrogram to obtain a harmonic signal and an impact signal of the spectrogram, wherein the harmonic signal is obtained by performing median filtering along the time direction, and the impact signal is obtained by performing median filtering along the frequency direction.
4. The audio detection method of claim 3, wherein the performing median filtering along a time direction and a frequency direction of the spectrogram to obtain harmonic signals and impulse signals of the spectrogram respectively comprises:
respectively performing first median filtering along the time direction and the frequency direction of the spectrogram to obtain a first harmonic signal and a first impact signal of the spectrogram;
removing a first harmonic signal in the spectrogram to obtain a target spectrogram consisting of the first impact signal;
and respectively performing second median filtering along the time direction and the frequency direction of the target spectrogram to obtain a second harmonic signal and a second impact signal of the target spectrogram, wherein the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected.
5. The audio detection method of claim 1, wherein the method further comprises:
according to the rhythm intensity value of the audio to be detected, audio classification is carried out on the audio to be detected;
and generating an audio recommendation list according to the audio classification results of the multiple audios to be tested and the current audio application scene.
6. An audio detection apparatus, characterized in that the apparatus comprises:
the signal separation module is used for carrying out audio signal separation on the audio to be detected so as to obtain a harmonic signal and an impact signal of the audio to be detected;
the first acquisition module is used for acquiring a Mel frequency spectrum of the impact signal;
the computing module is used for computing the initial envelope of the shock signal according to the Mel frequency spectrum;
the framing submodule is used for framing the initial envelope of the impact signal according to preset duration to obtain a plurality of local segments, and then dividing each local segment into a plurality of frames according to a preset step length;
the calculation submodule is used for inputting a plurality of sub-frames corresponding to each local segment into a local autocorrelation function for calculation so as to obtain an autocorrelation velocity spectrogram of the initial envelope;
the determining module is used for acquiring an autocorrelation mean value of each local segment in the autocorrelation velocity spectrogram;
and extracting the second highest peak value in the autocorrelation mean value of each local segment, and determining the second highest peak value as the rhythm sense intensity value of the audio to be detected.
7. The audio detection apparatus of claim 6, wherein the apparatus further comprises:
and the filtering module is used for filtering the initial envelope of the impact signal so as to filter out the signal points of which the numerical values are smaller than a threshold value in the initial envelope.
8. The audio detection device of claim 6, wherein the signal separation module comprises:
the conversion submodule is used for carrying out short-time Fourier conversion on the audio frequency to be detected according to a preset frame length and a preset step length so as to obtain a spectrogram of the audio frequency to be detected;
and the filtering submodule is used for performing median filtering along the time direction and the frequency direction of the spectrogram respectively to acquire a harmonic signal and an impact signal of the spectrogram, wherein the harmonic signal is a signal obtained by performing median filtering along the time direction, and the impact signal is a signal obtained by performing median filtering along the frequency direction.
9. The audio detection device of claim 8, wherein the filtering sub-module comprises:
the first filtering unit is used for respectively carrying out first median filtering along the time direction and the frequency direction of the spectrogram so as to obtain a first harmonic signal and a first impact signal of the spectrogram;
the removing unit is used for removing the first harmonic signals in the spectrogram to obtain a target spectrogram consisting of the first impact signals;
and the second filtering unit is used for respectively performing second median filtering along the time direction and the frequency direction of the target spectrogram so as to obtain a second harmonic signal and a second impact signal of the target spectrogram, wherein the second harmonic signal and the second impact signal of the target spectrogram form a harmonic signal and an impact signal of the audio to be detected.
10. The audio detection apparatus of claim 6, wherein the apparatus further comprises:
the classification module is used for carrying out audio classification on the audio to be detected according to the rhythm intensity value of the audio to be detected;
and the generating module is used for generating an audio recommendation list according to the audio classification results of the multiple audios to be tested and the current audio application scene.
11. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the audio detection method according to any one of claims 1 to 5.
CN201811278955.8A 2018-10-30 2018-10-30 Audio detection method, device and storage medium Active CN109256146B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811278955.8A CN109256146B (en) 2018-10-30 2018-10-30 Audio detection method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811278955.8A CN109256146B (en) 2018-10-30 2018-10-30 Audio detection method, device and storage medium

Publications (2)

Publication Number Publication Date
CN109256146A CN109256146A (en) 2019-01-22
CN109256146B true CN109256146B (en) 2021-07-06

Family

ID=65044080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811278955.8A Active CN109256146B (en) 2018-10-30 2018-10-30 Audio detection method, device and storage medium

Country Status (1)

Country Link
CN (1) CN109256146B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108320730B (en) * 2018-01-09 2020-09-29 广州市百果园信息技术有限公司 Music classification method, beat point detection method, storage device and computer device
CN110070884B (en) * 2019-02-28 2022-03-15 北京字节跳动网络技术有限公司 Audio starting point detection method and device
CN110070885B (en) * 2019-02-28 2021-12-24 北京字节跳动网络技术有限公司 Audio starting point detection method and device
CN110085214B (en) * 2019-02-28 2021-07-20 北京字节跳动网络技术有限公司 Audio starting point detection method and device
CN109978034B (en) * 2019-03-18 2020-12-22 华南理工大学 Sound scene identification method based on data enhancement
CN110070856A (en) * 2019-03-26 2019-07-30 天津大学 A kind of audio scene recognition method based on the enhancing of harmonic wave impulse source mask data
CN110188235A (en) * 2019-05-05 2019-08-30 平安科技(深圳)有限公司 Music style classification method, device, computer equipment and storage medium
CN110278388B (en) * 2019-06-19 2022-02-22 北京字节跳动网络技术有限公司 Display video generation method, device, equipment and storage medium
CN111639225B (en) * 2020-05-22 2023-09-08 腾讯音乐娱乐科技(深圳)有限公司 Audio information detection method, device and storage medium
CN112908289B (en) * 2021-03-10 2023-11-07 百果园技术(新加坡)有限公司 Beat determining method, device, equipment and storage medium
CN113473201A (en) * 2021-07-29 2021-10-01 腾讯音乐娱乐科技(深圳)有限公司 Audio and video alignment method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101375327A (en) * 2006-01-25 2009-02-25 索尼株式会社 Beat extraction device and beat extraction method
US20180005615A1 (en) * 2012-05-23 2018-01-04 Google Inc. Music selection and adaptation for exercising
CN107622774A (en) * 2017-08-09 2018-01-23 金陵科技学院 A kind of music-tempo spectrogram generation method based on match tracing
CN108335703A (en) * 2018-03-28 2018-07-27 腾讯音乐娱乐科技(深圳)有限公司 The method and apparatus for determining the stress position of audio data
CN108364660A (en) * 2018-02-09 2018-08-03 腾讯音乐娱乐科技(深圳)有限公司 Accent identification method, device and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101375327A (en) * 2006-01-25 2009-02-25 索尼株式会社 Beat extraction device and beat extraction method
US20180005615A1 (en) * 2012-05-23 2018-01-04 Google Inc. Music selection and adaptation for exercising
CN107622774A (en) * 2017-08-09 2018-01-23 金陵科技学院 A kind of music-tempo spectrogram generation method based on match tracing
CN108364660A (en) * 2018-02-09 2018-08-03 腾讯音乐娱乐科技(深圳)有限公司 Accent identification method, device and computer readable storage medium
CN108335703A (en) * 2018-03-28 2018-07-27 腾讯音乐娱乐科技(深圳)有限公司 The method and apparatus for determining the stress position of audio data

Also Published As

Publication number Publication date
CN109256146A (en) 2019-01-22

Similar Documents

Publication Publication Date Title
CN109256146B (en) Audio detection method, device and storage medium
CN109166593B (en) Audio data processing method, device and storage medium
CN110544488B (en) Method and device for separating multi-person voice
CN111210021B (en) Audio signal processing method, model training method and related device
CN107705778B (en) Audio processing method, device, storage medium and terminal
CN103440862B (en) A kind of method of voice and music synthesis, device and equipment
CN106782600B (en) Scoring method and device for audio files
CN112270913B (en) Pitch adjusting method and device and computer storage medium
CN110992963B (en) Network communication method, device, computer equipment and storage medium
CN109872710B (en) Sound effect modulation method, device and storage medium
CN107731241B (en) Method, apparatus and storage medium for processing audio signal
CN107680614B (en) Audio signal processing method, apparatus and storage medium
CN107993672B (en) Frequency band expanding method and device
CN109885162B (en) Vibration method and mobile terminal
CN109616135B (en) Audio processing method, device and storage medium
CN110830368B (en) Instant messaging message sending method and electronic equipment
CN110568926B (en) Sound signal processing method and terminal equipment
CN110097895B (en) Pure music detection method, pure music detection device and storage medium
CN110796918A (en) Training method and device and mobile terminal
CN108492837B (en) Method, device and storage medium for detecting audio burst white noise
CN111522592A (en) Intelligent terminal awakening method and device based on artificial intelligence
CN110599989A (en) Audio processing method, device and storage medium
CN107452361B (en) Song sentence dividing method and device
CN111613246A (en) Audio classification prompting method and related equipment
CN107945777B (en) Audio production method, mobile terminal and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant