CN113823325A

CN113823325A - Audio rhythm detection method, device, equipment and medium

Info

Publication number: CN113823325A
Application number: CN202110622026.XA
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Beijing Co Ltd
Current assignee: Tencent Technology Beijing Co Ltd
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-12-21
Anticipated expiration: 2041-06-03
Also published as: CN113823325B

Abstract

The application discloses an audio rhythm detection method, an apparatus, a device and a medium, wherein the method comprises the following steps: after the audio signal to be detected is obtained, feature processing is carried out on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, and then the rhythm type of each audio segment is determined according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, rhythm labeling is not required to be carried out by manpower, the cost of music rhythm detection is greatly improved, the efficiency of music editing is improved, and the accuracy of music editing is improved.

Description

Audio rhythm detection method, device, equipment and medium

Technical Field

The present disclosure relates generally to the field of data processing technologies, and in particular, to the field of audio data processing, and in particular, to a method, an apparatus, a device, and a medium for detecting an audio rhythm.

Background

As the demand for audio clips and video clips increases, the demand for pure audio clips as well as audio clips in video is increasing. In the related art, a professional is usually required to repeatedly listen to the audio to be edited and then perform rhythm division according to own judgment, so that a large amount of manpower is required, and the accuracy is different from person to person and is difficult to unify.

Disclosure of Invention

In view of the above-mentioned drawbacks and deficiencies of the prior art, it is desirable to provide an audio rhythm detection method, apparatus, device and medium, which enable automatic, efficient and accurate music rhythm detection.

In a first aspect, an embodiment of the present application provides an audio rhythm detection method, where the method includes:

acquiring an audio signal to be detected;

performing feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, wherein the rhythms of two adjacent audio segments are different;

and acquiring rhythm information of each audio segment, and determining the rhythm type of each audio segment based on the rhythm information.

In a second aspect, an embodiment of the present application provides an audio rhythm detection apparatus, including:

the acquisition module is used for acquiring an audio signal to be detected;

the segmentation module is used for carrying out feature processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, and the rhythms of two adjacent audio segments are different;

and the determining module is used for acquiring rhythm information of each audio segment and determining the rhythm type of the audio segment based on the rhythm information.

In a third aspect, embodiments of the present application provide an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the method described in the embodiments of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method as described in the embodiments of the present application.

After the audio signal to be detected is obtained, feature processing is carried out on the audio signal based on the audio characteristic of the audio signal to obtain at least one audio segment, and then the rhythm type of each audio segment is determined according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, rhythm labeling is not required to be carried out by manpower, the cost of music rhythm detection is greatly improved, the efficiency of music editing is improved, and the accuracy of music editing is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is an implementation environment architecture diagram illustrating an audio tempo detection method according to an embodiment of the present application;

fig. 2 is a flowchart of an audio tempo detection method according to an embodiment of the present application;

fig. 3 is a flowchart of another audio tempo detection method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an STFT feature extraction performed on an audio signal according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an audio STFT energy spectrum of an embodiment of the present application;

fig. 6 is a flowchart of another audio tempo detection method according to an embodiment of the present application;

fig. 7 is a schematic diagram illustrating a principle of audio tempo detection according to an embodiment of the present application;

fig. 8 is a schematic diagram of audio tempo determination for inter-segment detection proposed in an embodiment of the present application;

fig. 9 is a schematic diagram of audio tempo determination detected in a segment as proposed in an embodiment of the present application;

fig. 10 is a flowchart of another audio tempo detection method according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating a method for detecting inter-segment audio tempo according to an embodiment of the present application;

fig. 12 is a schematic diagram illustrating a method for detecting an intra-segment audio rhythm according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an audio rhythm detection apparatus according to an embodiment of the present application;

fig. 14 shows a schematic structural diagram of a computer system suitable for implementing the electronic device or the server according to the embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The automatic driving technology generally comprises technologies such as high-precision maps, environment perception, behavior decision, path planning, motion control and the like, and the self-determined driving technology has wide application prospect,

with the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technologies such as machine learning of artificial intelligence, and the like, and is specifically explained by the following embodiment.

For a clearer description of the present application, the following are explanations of terms of related art:

Short-Time Fourier transform (STFT), a mathematical transform related to the STFT to determine the frequency and phase of the local area sinusoid of a Time-varying signal, is often used as the fundamental feature of audio.

Zero Crossing Rate (ZCR) refers to the number of times an audio signal passes through a Zero point (whose sampling amplitude changes from positive to negative or negative to positive) in each frame of the audio signal.

Chroma feature (Chroma), a timbre feature, also known as a Chroma vector. The chrominance vector is a vector containing 12 elements representing the energy in 12 levels within a frame, respectively, the energy of the same level for different octaves being accumulated.

Mel-spectrum (Mel-spectrum), an audio feature transformed based on STFT spectrum and Mel frequency, the spectrum of which is more representative of the human auditory characteristics.

The K-means clustering (K-means clustering algorithm) is an iterative solution clustering analysis algorithm and comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each initial clustering center, and allocating each object to the nearest clustering center.

With the rapid development of the internet, the requirement for clipping audio and video is higher and higher, and the clipping requirement is also higher and higher. In the related art, manual labeling is generally performed according to human auditory sense of a professional. Specifically, a large number of professional annotators are hired to enable the annotators to listen to rhythm changes of music at segment level granularity, boundaries of rhythm change segments are manually annotated according to the auditory reflection of the human ears, and the rhythm speed of each segment is judged through the auditory reflection of the human ears.

However, in this method, manual labeling needs to be performed according to the auditory characteristics of human ears, so that the labeling efficiency is extremely low, the duration of a song is usually 3-5 minutes, and a labeling person needs to listen to the song at least once completely to make a label, and when higher labeling accuracy is required, the labeling person often needs to listen to more songs to determine the rhythm boundary line.

Moreover, different annotating personnel have different auditory responses to the same song, so that different requirements of different personnel on different audio rhythm detection results of the same song can appear in the annotating process, and the accuracy of annotation is seriously influenced.

Based on this, the application provides an audio rhythm detection method, device, equipment and medium.

Fig. 1 is an implementation environment architecture diagram illustrating an audio tempo detection method according to an embodiment of the present application. As shown in fig. 1, the implementation environment architecture includes: a terminal 100 and a server 200.

The audio rhythm detection device may be the terminal 100 or the server 200. The terminal 100 or the server 200 acquires audio data to be annotated or video data containing the audio data to be annotated.

The process of detecting the audio rhythm may be executed in the terminal 100 or the server 200. The terminal 100 acquires a video or audio to be detected, further, when an audio rhythm detection process is executed at the terminal 100, the terminal 100 directly performs rhythm detection on the video or audio to be detected after acquiring the video or audio to be detected, and then obtains a rhythm detection result, when the audio rhythm detection process is executed at the server 200, the terminal 100 sends the video or audio to be detected to the server 200, and the server 200 receives the video or audio to be detected and performs rhythm detection on the video or audio to be detected, and then obtains a rhythm detection result.

Optionally, the video or audio to be detected in the foregoing may be pure audio, or may also be video containing audio.

In addition, the terminal 100 may display an application interface through which the video or audio to be detected uploaded by the user may be acquired or transmitted to the server 200.

The type of the terminal 100 includes, but is not limited to, a smart phone, a tablet computer, a television, a notebook computer, a desktop computer, and the like, which is not particularly limited in this embodiment.

When the audio rhythm detection process is executed in the server 200, the server 200 sends the rhythm detection result to the terminal 100, and the terminal 100 displays the rhythm detection result on the application interface. Further, the server 200 may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center.

The terminal 100 establishes a communication connection with the server 200 through a wired or wireless network.

Fig. 2 is a flowchart of an audio rhythm detection method according to an embodiment of the present application. It should be noted that an execution subject of the audio rhythm detection method in this embodiment is an audio rhythm detection device, the audio rhythm detection device may be implemented in a software and/or hardware manner, the audio rhythm detection device in this embodiment may be configured in an electronic device, or may be configured in a server for controlling the electronic device, and the server communicates with the electronic device to further control the electronic device.

The electronic device in this embodiment may include, but is not limited to, a personal computer, a platform computer, a smart phone, a smart speaker, and the like, and the electronic device is not particularly limited in this embodiment.

As shown in fig. 2, the method for detecting an audio rhythm provided in the embodiment of the present application includes the following steps:

step 101, acquiring an audio signal to be detected.

The audio signal is a carrier with frequency and amplitude change information of regular sound waves of voice, music and sound effect. Audio information can be classified into regular audio and irregular sound according to the characteristics of sound waves. Regular audio is a continuously varying analog signal that can be represented by a continuous curve, called sound wave. The three elements of sound are pitch, intensity and timbre. There are three important parameters of a sound wave or sine wave: frequency, amplitude and phase. Wherein, the audio signal is a time domain signal of the audio.

In the embodiment of the present application, the audio signal may be a separate musical composition, or may be background music configured in a video. Further, the audio signal may have different lengths according to different detection requirements, for example, when inter-segment rhythm detection is performed, the audio signal may have a length corresponding to the entire song or a length corresponding to the entire video; when detecting the in-segment rhythm, the audio signal may be a length corresponding to a certain segment of the whole song, or a length corresponding to the cut-out segment of the whole video, or a length of an independent short video, and the like, and the length of the specific audio signal is not limited herein.

And 102, performing characteristic processing on the audio signal based on the audio characteristics of the audio signal to obtain at least one audio segment, wherein the rhythms of two adjacent audio segments are different.

The audio characteristics are characteristic rules of the audio itself, for example, the audio has loudness, frequency, amplitude, and the like.

It should be noted that, the present application performs feature processing on an audio signal according to audio characteristics, and then divides the audio signal into a plurality of audio segments according to the change of the speed of the audio rhythm in the audio signal, where the change of the rhythm occurs can be used as a decomposition line of the audio segments, and after the audio segments are divided, the audio signal can be cut off each time the change of the rhythm occurs, so that the rhythms of two adjacent audio segments are different.

And 103, acquiring rhythm information of each audio segment, and determining the rhythm type of the audio segment based on the rhythm information.

That is to say, after the audio signal is segmented, the rhythm information in the audio segment is further acquired, and the rhythm type of the audio segment is determined according to the rhythm information in the audio segment. The tempo type includes, among others, a relaxed type and a compact type.

Therefore, after the audio signal to be detected is obtained, the characteristic processing is carried out on the audio signal based on the audio characteristic of the audio signal to obtain at least one audio segment, and then the rhythm type of each audio segment is determined according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, rhythm labeling is not required to be carried out by manpower, the cost of music rhythm detection is greatly improved, the efficiency of music editing is improved, and the accuracy of music editing is improved.

In one or more embodiments, as shown in fig. 3, step 102, performing feature processing on an audio signal based on an audio characteristic of the audio signal to obtain at least one audio segment, includes:

step 1021, performing feature extraction on the audio signal based on the audio characteristics to obtain audio feature data.

Wherein the audio characteristics include an STFT characteristic, a mel characteristic, and a zero-crossing rate.

In one or more embodiments, feature extraction is performed on an audio signal based on audio characteristics to obtain audio feature data, including: respectively extracting the features of the audio signal based on the STFT characteristic, the mel characteristic and the zero crossing rate to obtain an STFT frequency spectrum, a mel frequency spectrum and zero crossing information corresponding to the audio signal; and performing characteristic resolution on the STFT spectrum to obtain an STFT energy spectrum, an STFT phase spectrum and an acoustic spectrum, and performing frequency screening on the STFT energy spectrum, the STFT phase spectrum and the mel spectrum respectively to obtain an STFT high-frequency energy spectrum, an STFT high-frequency phase spectrum and a mel high-frequency spectrum.

As shown in fig. 4, when performing feature extraction and splitting according to the STFT characteristics of the audio features, the audio signal may be firstly subjected to framing processing to obtain audio frame information corresponding to the time sequence information, and then subjected to windowing splitting and fourier transform to obtain an STFT magnitude spectrum and an STFT phase spectrum. Further, the energy characteristics based on the STFT magnitude spectrum can be further split into an STFT energy spectrum and a tone spectrum.

Further, it was also found by observing the STFT energy spectrum and mel frequency spectrum that the difference in the low frequency behavior of the characteristics of the different rhythm segments is much smaller than the difference in the high frequency behavior, as shown in fig. 5. Therefore, the application further performs high-frequency filtering on the obtained energy characteristics (the STFT energy spectrum, the STFT energy spectrum and the mel frequency spectrum), namely, the characteristic frequency bands without resolution in the low frequency band are filtered, and only the characteristics of the high frequency band are reserved, so that the characteristic difference between different rhythm segments can be highlighted.

Optionally, for each feature that needs high-frequency filtering, a high-order channel with one third of the feature order is selected as the high-frequency feature after filtering. The STFT energy spectrum is subjected to high-frequency filtering to obtain an STFT high-frequency energy spectrum, the STFT amplitude spectrum is subjected to high-frequency filtering to obtain an STFT high-frequency amplitude spectrum, and the mel frequency spectrum is subjected to high-frequency filtering to obtain a mel high-frequency spectrum. As a possible embodiment, a windowing filter may also be used to perform high-frequency filtering, or other manners capable of implementing the high-frequency filtering function of the present application, which is not limited herein.

Therefore, the extracted features are all the audio features at the frame level, and the granularity of the features at the frame level of the audio is all at the millisecond level, so that the audio segmentation at the millisecond level can be realized during the subsequent audio segmentation, and the accuracy of the audio segmentation is greatly improved.

And 1022, performing feature splicing processing on the audio feature data to obtain a target feature.

In the embodiment of the application, the audio rhythm detection comprises inter-segment detection and intra-segment detection, wherein the inter-segment detection is a rhythm detection process for dividing the whole song or the whole video into a plurality of audio segments according to the rhythm, and the intra-segment detection is a rhythm detection process for each audio segment which is already divided.

When inter-segment detection is carried out, feature splicing processing is carried out on the audio feature data to obtain target features, and the method comprises the following steps: selecting three groups of characteristic data from zero-crossing information, a sound chromatogram, an STFT phase spectrum, an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a high frequency spectrum to perform in-group splicing respectively, and taking the three groups of spliced characteristics as target characteristics, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

That is to say, when inter-segment tempo detection is performed, three groups of feature data can be extracted from a plurality of proposed audio feature data to perform intra-group splicing respectively, so as to obtain spliced feature data, and provide richer frequency domain difference information for subsequent clustering operations. Alternatively, three sets of feature data may be obtained by random decimation.

For example, an STFT high-frequency energy spectrum and zero-crossing information may be randomly selected as a first feature data group for intra-group splicing, an STFT high-frequency amplitude spectrum and a tone spectrum may be randomly selected as a second feature data group for intra-group splicing, and then a mel high-frequency spectrum may be randomly selected as a third feature data group for intra-group splicing, wherein only one feature data in the third feature data group may be spliced with a zero vector or may not be spliced. Or the mel high frequency spectrum and the zero-crossing information can be randomly extracted to be used as a first characteristic data group for carrying out intra-group splicing, the STFT high frequency amplitude spectrum and the tone spectrum are randomly selected to be used as second characteristic data for carrying out intra-group splicing, and then the STFT high frequency energy spectrum and the STFT phase spectrum are selected to be used as a third characteristic data group for carrying out intra-group splicing.

It should be understood that, in the embodiment of the present application, the purpose when randomly selecting feature data for splicing is to increase frequency domain difference information, and therefore, features selected by three groups of feature data cannot be repeated, and two features spliced in the same group cannot be the same.

When detecting in the section, the audio characteristic data is processed by characteristic splicing to obtain target characteristics, which comprises the following steps: selecting two groups of characteristic data from zero-crossing information, a sound chromatogram, an STFT phase spectrum, an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a mel high-frequency spectrum to perform in-group splicing respectively, and taking the two groups of spliced characteristics as target characteristics, wherein each group of characteristic data comprises at least two types of characteristic data.

For example, the STFT high frequency energy spectrum and the zero crossing information may be randomly selected as a first feature data set for intra-group splicing, and the STFT high frequency amplitude spectrum and the tone spectrum may be randomly selected as a second feature data set for intra-group splicing. Or the mel high frequency spectrum and the zero-crossing information can be randomly extracted to be used as a first characteristic data group for carrying out intra-group splicing, and the STFT high frequency amplitude spectrum and the tone spectrum are randomly selected to be used as second characteristic data for carrying out intra-group splicing.

It should be understood that, in the embodiment of the present application, the purpose when randomly selecting feature data for splicing is to increase frequency domain difference information, and therefore, the features selected by two sets of feature data cannot be repeated, and two features spliced in the same set cannot be the same.

And 1023, classifying the target characteristics based on a clustering algorithm to obtain a clustering label sequence corresponding to the audio signal.

Alternatively, the clustering algorithm may be a clustering algorithm, such as a K-means algorithm. And selecting other machine learning clustering algorithms capable of realizing biclustering to improve the clustering precision and reduce the time for generating the clustering label sequence, wherein the application is not specifically limited herein.

In one or more embodiments, for each target feature, a binary clustering algorithm is respectively used for classification to obtain a feature tag sequence corresponding to each target feature, the obtained feature tag sequences are combined to obtain a transition tag sequence, and binarization processing is performed on the transition tag sequence to obtain a clustering tag sequence.

The feature tag sequence obtained by the secondary clustering algorithm is a 0-1 sequence, wherein 0 represents that the rhythm of the frame of audio signal is less than a preset threshold value, namely, a slow rhythm, and 1 represents that the rhythm of the frame of audio signal is greater than or equal to the preset threshold value, namely, a fast rhythm.

Specifically, during inter-segment detection, three target features are obtained through three sets of feature data, the three target features are classified through a binary algorithm to obtain three 0-1 sequences, then the three 0-1 sequences are combined to obtain a transition tag sequence, the transition tag sequence is a 0-3 sequence obtained by adding tag values of frame positions corresponding to the three 0-1 sequences, and then binarization processing is performed on the transition tag sequence, namely, the tag value corresponding to the frame position with the tag value of 0 is set to be 0, and the tag value corresponding to the frame position with the tag value of 1-3 is set to be 1, so that a final clustering tag sequence expressed by two values is obtained.

Therefore, the rhythm detection is realized by carrying out secondary clustering on the audio features of multiple frame levels by using the K-means algorithm, the accuracy of audio segment division can be effectively improved to the real-level granularity, and the rhythm detection method is more accurate than the artificially marked rhythm boundary.

And 1024, obtaining at least one audio segment based on the continuous condition of the clustering label sequence.

Where successive labels of the same category represent an audio segment, for example, when the successive label is 000000011100, "0000000" is the first audio segment, "111" is the second audio segment, and "00" is the third audio segment.

However, due to the particularity of the rhythm of the musical piece itself and the clustering bias in clustering, a labeling error such as 000000100000 is easily generated, and when "1" in the sequence is obviously abnormal, the sequence needs to be eliminated to make the whole audio segment continuous.

In one or more embodiments, the clustering label sequence is smoothed to obtain a binary class curve for describing the audio signal; and taking the audio signal continuously corresponding to any class label in the binary class curve as an audio segment.

The method for smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal includes: counting the length of each continuous label sequence in the clustering labels, determining the length of a sliding window and an abnormal threshold value for eliminating the abnormal value according to the length of the continuous label sequence, sliding the sliding window on the clustering label sequence, and correcting the abnormal label value on the clustering label sequence according to the abnormal threshold value to obtain a binary class curve. Wherein the anomaly threshold comprises a first anomaly threshold of a length ratio and a second anomaly threshold of a label value to length ratio in the sliding window.

For example, in 000000011110000111111, the length of the two labels with label category "0" is 7 and 4, respectively, while the length of the two labels with label category "1" is 4 and 6, respectively, and then the calculated multiples are 1.75 and 1.5, respectively. The multiple calculation is carried out on the whole segment of audio signal, preferably, the first abnormal threshold is 3, that is, when the length multiple of any two segments of label sequences is greater than or equal to 3, it is determined that there is an abnormality in the two segments of label sequences, a sliding window is slid on the cluster label sequences, and the relationship between the sum of label values in the sliding window and the length is calculated, if the sum of label values is less than 1/10, that is, the number of frames with label values of 1 in the sliding window is less than 1/10 of the length, at this time, the label value corresponding to the sliding window is set to 0, if the sum of label values is greater than 9/10, that is, the number of frames with label values of 1 in the sliding window is greater than 9/10 of the length, at this time, the label value corresponding to the sliding window is set to 1, otherwise, the label values corresponding to the sliding window are kept unchanged, and a binary class curve is obtained. Wherein the second anomaly threshold includes a lower limit value 1/10 and an upper limit value 9/10.

Therefore, by combining and smoothing multiple biclustering labels, a certain fault tolerance rate is provided, and the generalization capability of the audio segmentation detection is improved.

Further, as shown in fig. 6 and 7, step 103, obtaining the rhythm information of each audio, and determining the rhythm type of the audio segment based on the rhythm information, includes:

and step 1031, acquiring the number of the heavy rhythm points in each audio segment.

The heavy rhythm point is a peak of the audio signal and is also a point with the highest energy in a certain area.

And step 1032, determining the rhythm density corresponding to the audio segment according to the time length of each audio segment and the number of the heavy rhythm points.

Wherein the tempo density may be a ratio of the number of heavy tempo points to the time length of the audio segment.

And 1033, determining the rhythm type of the audio segment according to the rhythm density corresponding to the audio segment.

Optionally, a tempo density threshold may be preset, and if the tempo density corresponding to the audio segment is less than the tempo density threshold, the audio segment is determined to be of a slow type, and if the tempo density corresponding to the audio segment is greater than or equal to the tempo density threshold, the audio segment is determined to be of a compact type.

Because the heavy rhythm point is the rhythm information of the audio characteristic, the method and the device adopt the heavy rhythm point to determine the rhythm density of the audio, can better express the actual rhythm information type of the audio, and improve the accuracy of judging the rhythm type.

When detecting between the sections, still include: and comparing the rhythm density corresponding to the audio segment with the rhythm density corresponding to the previous audio segment aiming at each audio segment, and determining the rhythm change type of the audio segment according to the comparison result.

That is, as shown in fig. 8, after dividing an audio segment, the present application further compares the tempo densities of the previous and next audio segments to determine the variation trend of the tempo of the next audio segment relative to the tempo of the previous audio segment, such as slowing down or speeding up, and expresses the variation trend with a slow and a fast.

When the intra-segment detection is carried out, the method further comprises the following steps: and respectively acquiring rhythm densities of two adjacent audio segments, calculating an absolute value of the difference of the two rhythm densities, and determining the rhythm change trends of the two audio segments according to the absolute value of the difference and a preset trend threshold.

That is to say, as shown in fig. 9, when performing intra-segment detection, the audio signal in a segment is divided into audio segments again to form audio sub-segments, then an absolute value of a difference between the tempo densities of two adjacent audio sub-segments is calculated, that is, a change magnitude of the tempo densities of the two adjacent audio sub-segments is calculated, if the absolute value of the difference is smaller than a preset threshold, it is determined that the tempos of the two audio sub-segments are not changed and marked as norm, and if the absolute value of the difference is greater than or equal to the preset threshold, the tempo densities of the two preceding and following audio sub-segments are further compared to determine a change trend of the tempo of the following audio sub-segment relative to the tempo of the preceding audio sub-segment, for example, the change trend is slowed down or quickened.

Therefore, by setting a two-stage rhythm detection mechanism, the rhythm conversion condition between the audio major paragraphs can be detected, the rhythm conversion condition in each major paragraph can be detected in a finer granularity, and the two-stage rhythm detection result can provide more and finer-granularity audio rhythm conversion information for the audio clip, especially the audio clip in the video clip, and provide more reference information for the intelligent score in the video clip.

The following embodiments are described below with reference to fig. 10 to 12.

Step 201, acquiring a complete audio signal to be detected.

Step 202, performing a first feature extraction on the audio signal based on the audio characteristic to obtain audio feature data.

The audio characteristics include an STFT characteristic, a mel characteristic, and a zero-crossing rate, and as shown in fig. 11, audio feature data may be extracted based on the audio characteristics: and further splitting the STFT spectrogram to obtain an STFT phase spectrum, an STFT amplitude spectrum, an STFT energy spectrum and an audio spectrum, and filtering the STFT amplitude spectrum, the STFT energy spectrum and the mel spectrogram by using a high-frequency filter to finally obtain an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a mel high-frequency spectrum.

Step 203, randomly extracting three groups of characteristics from the audio characteristic data and splicing the characteristics in the groups, wherein the two groups of characteristics comprise two audio characteristic data, and one group of characteristics comprises one audio characteristic data.

For example, as shown in fig. 11, the STFT high frequency energy spectrum and the zero-crossing information are extracted as the first set of audio feature data and subjected to intra-group splicing, the STFT high frequency amplitude spectrum and the tone spectrum are extracted as the second set of audio feature data and subjected to intra-group splicing, and the mel high frequency spectrum is extracted as the third set of audio feature information.

And 204, clustering the three groups of spliced characteristic data respectively to obtain a characteristic label sequence corresponding to each group of characteristic data.

And step 205, adding the tag values of the three groups of characteristic tag sequences according to the corresponding positions to obtain a transition tag sequence.

And step 206, carrying out binarization processing on the transition label sequence to obtain a clustering label sequence.

Step 207, counting the length of each continuous label sequence in the cluster labels.

And step 208, determining the length of a sliding window and an abnormal threshold value for rejecting abnormal values according to the length of the continuous label sequence.

The abnormal threshold comprises a first abnormal threshold used for judging the abnormal length ratio and a second abnormal threshold used for determining the value of the abnormal label in the sliding window, wherein the second abnormal threshold comprises an abnormal lower limit value and an abnormal upper limit value.

And step 209, sliding on the cluster label sequence by using a sliding window, and correcting the labels according to the accumulated value of the labels in the sliding window.

And if the accumulated value of the cluster labels is smaller than 1/10 of the sliding window length which is the lower limit value of the second abnormal threshold value, setting all the labels in the current window to be 0, and if the accumulated value of the cluster labels is larger than 9/10 of the sliding window length which is the upper limit value of the second abnormal threshold value, setting all the labels in the current window to be 1.

Step 210, the audio signal corresponding to the continuously same tag value is used as an audio segment.

In step 211, the number of heavy tempo points in each audio segment is obtained.

And step 212, determining the rhythm density corresponding to the audio segment according to the time length of each audio segment and the number of the heavy rhythm points.

And step 213, determining the rhythm type of the audio segment according to the rhythm density corresponding to the audio segment.

And if the rhythm density is less than a preset threshold value, determining the audio segment to be in a slow type, and if the rhythm density is greater than or equal to the preset threshold value, determining the audio segment to be in a compact type.

Step 214, for any audio segment, performing a second feature extraction on the audio signal based on the audio characteristics to obtain audio feature data.

Step 215, randomly extracting two groups of characteristics from the audio characteristic data and splicing the two groups of characteristics in a group, wherein each group of characteristics comprises two types of audio characteristic data.

For example, as shown in fig. 12, mel-frequency high-spectrum and zero-crossing information are extracted as a first set of audio feature data and intra-group spliced, and STFT phase spectrum and tone spectrum are extracted as a second set of audio feature data and intra-group spliced.

And step 216, clustering the two groups of spliced characteristic data respectively to obtain a characteristic label sequence corresponding to each group of characteristic data.

And 217, adding the tag values of the two groups of characteristic tag sequences according to corresponding positions to obtain a transition tag sequence corresponding to the audio segment.

And step 218, performing binarization processing on the transition tag sequence corresponding to the audio segment to obtain a clustering tag sequence corresponding to the audio segment.

Step 219, counting the length of each continuous label sequence in the cluster label corresponding to the audio segment.

Step 220, according to the length of the continuous label sequence, determining the length of a sliding window and an abnormal threshold value for eliminating abnormal values.

And step 221, sliding the cluster label sequence by using a sliding window, and correcting the labels according to the accumulated value of the labels in the sliding window.

Step 222, regarding the audio signal corresponding to the continuously same tag value as an audio sub-segment.

Step 223, obtaining the number of the heavy rhythm points in each audio sub-segment.

And step 224, determining the rhythm density corresponding to the audio sub-segments according to the time length of each audio sub-segment and the number of the re-rhythm points.

And step 225, respectively acquiring the rhythm densities of two adjacent audio subsections, and calculating the absolute value of the difference of the two rhythm densities.

Step 226, determining the rhythm variation trend of the two audio sub-segments according to the absolute value of the difference and a preset trend threshold.

If the absolute value of the difference is smaller than a preset trend threshold, determining that the two audio subsections have no rhythm change, if the absolute value of the difference is larger than the preset trend threshold, determining that the two audio subsections have rhythm change, further judging the rhythm density of the next audio subsection and the previous audio subsection, if the rhythm density of the next audio subsection is smaller than the rhythm density of the previous audio subsection, slowing down the rhythm, and if the rhythm density of the next audio subsection is larger than the rhythm density of the previous audio subsection, speeding up the rhythm.

The following tag sequences are obtained by detecting a certain section of audio through the scheme:

inter-segment rhythm conversion: [ 'slow', 'fast', 'slow', ]

Segment time interval: [[0.0, 29.174648526077096],[29.174648526077096, 76.87804988662131],[7687804988662131, 89.8109977324263],[89.8109977324263, 98.46009070294785],[98.46009070294785, 104.17197278911564],[104.17197278911564, 151.87537414965988],[151.87537414965988, 163.12496598639456]]

Tempo density of segmentation: [1.645264036586295,5.9953796132565165,1.0825064917076215,3.0060955626925314,1.5756627787879056,5.827676687011578,0.266676341998803]

Segment-in segment ensemble transformation: [ [ 'fast', 'slow', 'fast', 'slow', 'fast', 'slow', 'norm', 'slow', 'fast', 'slow', 'slow' ], and 'slow' ]

Audio sub-segment time interval: [[[0.0,2.6585714285714284],[2.6585714285714284, 14.448004535147392],[14.448004535147392, 29.168843537414965]],[[29.174648526077096, 69.36668934240363],[69.36668934240363, 76.48331065759638],[76.48331065759638, 76.87224489795918]],[76.87804988662131, 89.80519274376417],[89.8109977324263, 98.45428571428572],[98.46009070294785, 104.16616780045351],[[104.17197278911564, 144.3872335600907],[144.3872335600907, 151.41097505668935],[151.41097505668935, 151.86956916099774]],[151.87537414965988, 153.45426303854876],[153.45426303854876, 163.1191836734694]]]

Tempo density of audio sub-segments: [[0.7522837184309512,1.526790969275688,1.9020655001856164],[5.722526035716279,7.728386486236747,2.5711287313433573],[1.0829925958669468],[3.0081145108862453],[1.5772657547747198],[5.495426257673417,7.830584315586568,4.361155063291147],[0.6333567909922612,0.20693392895268442]]

In summary, after the audio signal to be detected is acquired, the audio signal is subjected to feature processing based on the audio characteristics of the audio signal to obtain at least one audio segment, and then the rhythm type of each audio segment is determined according to the rhythm information of each audio segment. Therefore, automatic, efficient and accurate music rhythm detection is realized, rhythm labeling is not required to be carried out by manpower, the cost of music rhythm detection is greatly improved, the efficiency of music editing is improved, and the accuracy of music editing is improved.

It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results.

With further reference to fig. 13, there is shown an exemplary block diagram of an audio tempo detection apparatus according to an embodiment of the present application.

As shown, the audio tempo detection apparatus 10 includes:

the acquisition module 11 is configured to acquire an audio signal to be detected;

the segmentation module 12 is configured to perform feature processing on the audio signal based on audio characteristics of the audio signal to obtain at least one audio segment, where rhythms of two adjacent audio segments are different;

and the comparison module 13 is configured to obtain rhythm information of each audio segment, and determine a rhythm type of the audio segment based on the rhythm information.

In some embodiments, the segmentation module 12 is further configured to: performing feature extraction on the audio signal based on the audio characteristics to obtain audio feature data;

performing feature splicing processing on the audio feature data to obtain target features;

classifying the target features based on a clustering algorithm to obtain a clustering label sequence corresponding to the audio signal;

and obtaining at least one audio segment based on the continuity condition of the clustering label sequence.

In some embodiments, the audio characteristics include an STFT characteristic, a mel characteristic, and a zero-crossing rate, and the segmentation module 12 is further configured to:

respectively extracting the features of the audio signal based on the STFT characteristic, the mel characteristic and the zero crossing rate to obtain an STFT frequency spectrum, a mel frequency spectrum and zero crossing information corresponding to the audio signal;

performing characteristic resolution on the STFT spectrum to obtain an STFT energy spectrum, an STFT phase spectrum and an acoustic chromatogram;

and respectively carrying out frequency screening on the STFT energy spectrum, the STFT amplitude spectrum and the mel frequency spectrum to obtain an STFT high-frequency energy spectrum, an STFT high-frequency amplitude spectrum and a mel high-frequency spectrum.

In some embodiments, the audio tempo detection includes inter-segment detection and intra-segment detection, and in the inter-segment detection, the segmentation module 12 is further configured to: and selecting three groups of characteristic data from the zero-crossing information, the sound chromatogram, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to perform in-group splicing respectively, and taking the three groups of spliced characteristics as the target characteristics, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

In some embodiments, the audio tempo detection comprises inter-segment detection and intra-segment detection, and when performing intra-segment detection, the segmentation module 12 is further configured to: and selecting two groups of feature data from the zero-crossing information, the sound chromatogram, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to perform in-group splicing respectively, and taking the two groups of spliced features as the target features, wherein each group of feature data comprises at least two types of feature data.

In some embodiments, the segmentation module 12 is further configured to: classifying each target feature by using a two-clustering algorithm to obtain a feature tag sequence corresponding to each target feature;

combining the obtained characteristic tag sequences to obtain a transition tag sequence;

and carrying out binarization processing on the transition tag sequence to obtain the clustering tag sequence.

In some embodiments, the segmentation module 12 is further configured to: smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal;

and taking the audio signal continuously corresponding to any class label in the binary class curve as one audio segment.

In some embodiments, the segmentation module 12 is further configured to: counting the length of each continuous label sequence in the clustering labels;

determining a sliding window length and an abnormal threshold value for eliminating abnormal values according to the length of the continuous label sequence;

and sliding a sliding window on the clustering label sequence, and correcting the abnormal label value on the clustering label sequence according to the abnormal threshold value to obtain the binary class curve.

In some embodiments, the comparing module 13 is further configured to: acquiring the number of heavy rhythm points in each audio segment;

determining the rhythm density corresponding to the audio segments according to the time length of each audio segment and the number of the heavy rhythm points;

and determining the rhythm type of the audio segment according to the rhythm density corresponding to the audio segment.

In some embodiments, the audio tempo detection includes inter-segment detection and intra-segment detection, and in the inter-segment detection, the comparing module 13 is further configured to: for each of the audio segments, comparing the tempo density corresponding to the audio segment with the tempo density corresponding to a previous one of the audio segments;

and determining the rhythm change type of the audio segment according to the comparison result.

In some embodiments, the audio tempo detection comprises inter-segment detection and intra-segment detection, and when performing intra-segment detection, the comparing module 13 is further configured to: respectively acquiring rhythm densities of two adjacent audio segments, and calculating an absolute value of a difference between the two rhythm densities;

and determining the rhythm variation trends of the two audio segments according to the absolute value of the difference and a preset trend threshold.

It should be understood that the units or modules recited in the audio tempo detection apparatus 10 correspond to the individual steps in the method described with reference to fig. 2. Thus, the operations and features described above for the method are equally applicable to the audio tempo detection device 10 and the units comprised therein, and will not be described in further detail herein. The audio rhythm detection apparatus 10 may be implemented in a browser or other security applications of the electronic device in advance, or may be loaded into the browser or other security applications of the electronic device by downloading or the like. Corresponding units in the audio tempo detection apparatus 10 may cooperate with units in the electronic device to implement aspects of embodiments of the present application.

The division into several modules or units mentioned in the above detailed description is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Referring now to fig. 14, fig. 14 illustrates a schematic diagram of a computer system suitable for use in implementing an electronic device or server of an embodiment of the present application,

as shown in fig. 14, the computer system 1400 includes a Central Processing Unit (CPU)1401, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1402 or a program loaded from a storage portion 1408 into a Random Access Memory (RAM) 1403. In the RAM1403, various programs and data necessary for operation instructions of the system are also stored. The CPU1401, ROM1402, and RAM1403 are connected to each other via a bus 1404. An input/output (I/O) interface 1405 is also connected to bus 1404.

The following components are connected to the I/O interface 1405; an input portion 1406 including a keyboard, a mouse, and the like; an output portion 1407 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like; a storage portion 1408 including a hard disk and the like; and a communication portion 1409 including a network interface card such as a LAN card, a modem, or the like. The communication section 1409 performs communication processing via a network such as the internet. The driver 1410 is also connected to the I/O interface 1405 as necessary. A removable medium 1411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1410 as necessary, so that a computer program read out therefrom is installed into the storage section 1408 as necessary.

In particular, according to embodiments of the present application, the process described above with reference to the flowchart fig. 2 may be implemented as a computer software program. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program comprises program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1409 and/or installed from the removable medium 1411. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 1401.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operational instructions of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes an acquisition module, a segmentation module, and a determination module. The names of these units or modules do not in some cases constitute a limitation on the unit or module itself, for example, the acquisition module, may also be described as "acquiring an audio signal to be detected".

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments, or may exist separately without being assembled into the electronic device. The computer-readable storage medium stores one or more programs which, when executed by one or more processors, perform the audio tempo detection methods described herein.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of audio tempo detection, the method comprising:

acquiring an audio signal to be detected;

2. The method of claim 1, wherein the characterizing the audio signal based on audio characteristics of the audio signal to obtain at least one audio segment comprises:

performing feature extraction on the audio signal based on the audio characteristics to obtain audio feature data;

3. The method of claim 2, wherein the audio characteristics comprise an STFT characteristic, a mel characteristic and a zero-crossing rate, and wherein the extracting the audio signal based on the audio characteristics to obtain audio feature data comprises:

4. The method according to claim 3, wherein the audio rhythm detection includes inter-segment detection and intra-segment detection, and when inter-segment detection is performed, the performing feature splicing processing on the audio feature data to obtain a target feature includes:

and selecting three groups of characteristic data from the zero-crossing information, the sound chromatogram, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to perform in-group splicing respectively, and taking the three groups of spliced characteristics as the target characteristics, wherein at least two groups of characteristic data respectively comprise at least two types of characteristic data.

5. The method according to claim 3, wherein the audio rhythm detection includes inter-segment detection and intra-segment detection, and when performing intra-segment detection, performing feature concatenation processing on the audio feature data to obtain a target feature includes:

and selecting two groups of feature data from the zero-crossing information, the sound chromatogram, the STFT phase spectrum, the STFT high-frequency energy spectrum, the STFT high-frequency amplitude spectrum and the mel high-frequency spectrum to perform in-group splicing respectively, and taking the two groups of spliced features as the target features, wherein each group of feature data comprises at least two types of feature data.

6. The method of claim 2, wherein the classifying the at least two target features based on a clustering algorithm to obtain a cluster label sequence corresponding to the audio signal comprises:

classifying each target feature by using a two-clustering algorithm to obtain a feature tag sequence corresponding to each target feature;

7. The method of claim 2, wherein deriving at least one audio segment based on the succession of the cluster tag sequences comprises:

smoothing the clustering label sequence to obtain a binary class curve for describing the audio signal;

8. The method of claim 7, wherein the smoothing the cluster label sequence to obtain a binary class curve describing the audio signal comprises:

counting the length of each continuous label sequence in the clustering labels;

9. The method of claim 1, wherein obtaining tempo information for each of the audio segments, and determining a tempo type for the audio segment based on the tempo information comprises:

acquiring the number of heavy rhythm points in each audio segment;

10. The method of claim 9, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein in performing inter-segment detection, the method further comprises:

for each of the audio segments, comparing the tempo density corresponding to the audio segment with the tempo density corresponding to a previous one of the audio segments;

11. The method of claim 9, wherein the audio tempo detection comprises inter-segment detection and intra-segment detection, and wherein in performing intra-segment detection, the method further comprises:

respectively acquiring rhythm densities of two adjacent audio segments, and calculating an absolute value of a difference between the two rhythm densities;

12. An audio tempo detection apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring an audio signal to be detected;

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements an audio tempo detection method according to any of claims 1-11.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the audio tempo detection method according to any of the claims 1-11.