CN112634939A

CN112634939A - Audio identification method, device, equipment and medium

Info

Publication number: CN112634939A
Application number: CN202011443460.3A
Authority: CN
Inventors: 孔令城
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-09

Abstract

The application discloses an audio recognition method, an audio recognition device, an audio recognition equipment and an audio recognition medium, wherein the method comprises the following steps: acquiring beat detection results and song spectrum data obtained after preprocessing songs to be detected; aligning the beat detection result with the song spectrum data, and intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected; and extracting the spectrum segment characteristics of the song to be detected, and determining the original song corresponding to the song to be detected from a preset retrieval library based on the spectrum segment characteristics of the song to be detected. Therefore, when the song to be detected is the reverse song of the original song, the song spectrum data is intercepted based on the beat, so that the intercepted song spectrum segment of the song to be detected and the corresponding song spectrum segment before the speed change of the original song comprise the same song information, and the detection rate and the accuracy rate of the reverse song are improved.

Description

Audio identification method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an audio recognition method, apparatus, device, and medium.

Background

In audio recognition, it is often desirable to identify whether some songs are the songs of the original song that are turned. In the current singing-flipping recognition process, an audio segment with the length of 2.048s needs to be intercepted from the audio corresponding to a song to be detected every 1.024s, and then a feature vector is extracted from the intercepted audio segment. And determining whether the song to be detected is a song to be sung of the reference song by using the characteristic vector.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art, an audio segment with a length of 2.048s is intercepted every 1.024s for the audio corresponding to the song to be detected, if the speed of the song to be detected is 1 time faster than that of the reference song, the information contained in the song to be detected is more than that of the reference song with the same 2.048s of audio content, and the extracted feature vectors are different due to the different information contained in the audio segments, so that the final misjudgment is caused, and the detection rate of the singing recognition is reduced.

Disclosure of Invention

In view of the above, an object of the present application is to provide an audio recognition method, apparatus, device, and medium, which can improve the detection rate and accuracy in the cover recognition process. The specific scheme is as follows:

to achieve the above object, in a first aspect, an audio recognition method is provided, including:

acquiring beat detection results and song spectrum data obtained after preprocessing songs to be detected;

aligning the beat detection result with the song spectrum data, and then intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected;

extracting spectrum segment characteristics of songs to be detected, and determining an original song corresponding to the songs to be detected from a preset search library based on the spectrum segment characteristics of the songs to be detected, wherein the preset search library comprises original song spectrum segment characteristics of a plurality of original songs, and the original song spectrum segments are obtained by intercepting song spectrum data of the original songs based on beats in beat detection results of the original songs.

Optionally, the obtaining of the beat detection result and the song spectrum data obtained by preprocessing the song to be detected includes:

acquiring the song to be detected;

converting the song to be detected into a preset single track format to obtain single track audio data corresponding to the song to be detected;

and processing the single-channel audio data to obtain the beat detection result and the song spectrum data.

Optionally, the processing the monaural audio data to obtain the beat detection result and the song spectrum data includes:

performing beat detection on the single sound channel audio data to obtain a beat detection result;

performing harmony tone class distribution feature extraction on the monaural audio data to obtain harmony tone class distribution data;

or, CQT transforming the single-channel audio data to obtain a CQT spectrum.

Optionally, the extracting spectral fragment features of the song to be detected, and determining the original song corresponding to the song to be detected from a preset search library based on the spectral fragment features of the song to be detected includes:

inputting the song spectrum segment to be detected into a target convolutional neural network which is trained in advance to obtain the characteristics of the song spectrum segment to be detected;

respectively determining the similarity between the spectrum fragment characteristics of the song to be detected and the spectrum fragment characteristics of the original song of each original song in a preset retrieval library, wherein the spectrum fragment characteristics of the original song are obtained by extracting the characteristics of the spectrum fragment of the original song by utilizing the target convolutional neural network;

and determining the original song corresponding to the song to be detected from the preset retrieval library according to the similarity.

Optionally, before inputting the spectral fragment of the song to be detected into the pre-trained target convolutional neural network, the method further includes:

and training a preset convolutional neural network by using a target training data set to obtain the target convolutional neural network, wherein the target training data set comprises song spectrum data of an original song and song spectrum data of a corresponding song to be sung.

Optionally, the intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected includes:

determining a target beat;

determining the length of the intercepted segment based on the beat length of the target beat;

and intercepting the song spectrum data of the song to be detected based on the beat point of the target beat and the intercepted segment length to obtain the song spectrum segment to be detected.

Optionally, the intercepting the song spectrum data of the song to be detected based on the beat point of the target beat and the length of the intercepted segment includes:

intercepting the song spectrum data of the song to be detected based on the intercepted segment length by taking the middle point of the first beat point and the second beat point of the target beat as the center;

and/or intercepting the song spectrum data of the song to be detected based on the intercepted segment length by taking the first beat point of the target beat as a starting point;

and/or intercepting the song spectrum data of the song to be detected based on the intercepted segment length by taking a second beat point of the target beat as a center.

In a second aspect, an audio recognition apparatus is provided, including:

the data acquisition module is used for acquiring beat detection results and song spectrum data which are obtained after preprocessing songs to be detected;

the data segmentation module is used for aligning the beat detection result and the song spectrum data, and then intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected;

the retrieval module is used for extracting spectrum segment characteristics of songs to be detected and determining original songs corresponding to the songs to be detected from a preset retrieval library based on the spectrum segment characteristics of the songs to be detected, wherein the preset retrieval library comprises original song spectrum segment characteristics of a plurality of original songs, and the original song spectrum segments are obtained by intercepting song spectrum data of the original songs based on beats in beat detection results of the original songs.

In a third aspect, an electronic device is provided, including:

a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the audio recognition method disclosed above.

In a fourth aspect, a computer-readable storage medium is provided for storing a computer program, wherein the computer program, when executed by a processor, implements the audio recognition method disclosed above.

According to the method, a beat detection result and song spectrum data which are obtained after a song to be detected is preprocessed are obtained, the beat detection result and the song spectrum data are aligned, then the song spectrum data are intercepted based on beats in the beat detection result, a song spectrum segment to be detected is obtained, the spectrum segment characteristic of the song to be detected is extracted, an original song corresponding to the song to be detected is determined from a preset retrieval library based on the spectrum segment characteristic of the song to be detected, wherein the preset retrieval library comprises original song spectrum segment characteristics of a plurality of original songs, and the original song spectrum segment is obtained by intercepting the song spectrum data of the original song based on beats in the beat detection result of the original song. Therefore, according to the method, after the beat detection result and the song spectrum data of the song to be detected are obtained, the beat detection result and the song spectrum data can be aligned, then the song spectrum data is intercepted based on the beat in the beat detection result, then the spectrum segment characteristics of the song to be detected can be extracted, and the original song corresponding to the song to be detected is determined from the preset retrieval library based on the spectrum segment characteristics of the song to be detected, so that the original song and the song to be sung after speed change comprise the same song information before and after speed change, when the song to be detected is the song to be sung corresponding to the original song, the song spectrum data is intercepted based on the beat, and the intercepted spectrum song segment to be detected and the corresponding song spectrum segment before speed change of the original song comprise the same song information, therefore, the detection rate and the accuracy of the song to be sung can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a system framework to which the audio recognition scheme provided herein is applicable;

FIG. 2 is a flow chart of an audio recognition method disclosed herein;

FIG. 3 is a flow chart of an audio recognition method disclosed herein;

FIG. 4a is a graph of beat detection and song spectrum data of an original song disclosed in the present application;

FIG. 4b is a graph of beat detection and song spectrum data for a song being sunderturned as disclosed in the present application;

FIG. 5 is a flow chart of a specific audio recognition method disclosed herein;

FIG. 6 is a partial flow diagram of a particular audio recognition method disclosed herein;

fig. 7 is a schematic structural diagram of an audio recognition apparatus disclosed in the present application;

fig. 8 is a schematic structural diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, in the singing recognition process, an audio segment with the length of 2.048s needs to be intercepted from the audio corresponding to a song to be detected every 1.024s, and then a feature vector is extracted from the intercepted audio segment. And determining whether the song to be detected is a song to be sung of the reference song by using the characteristic vector. Therefore, the speed of the song to be sung is different from that of the reference song, so that the information contained in the audio content of the same 2.048s is different, and the detection rate of the identification of sung is low. In order to overcome the technical problem, the application provides an audio recognition method which can improve the detection rate and the accuracy in the singing flipping recognition process.

For ease of understanding, a system framework to which the audio recognition method of the present application is applicable will be described first. It is to be understood that, in the embodiment of the present application, the number of the computer devices is not limited, and it may be that a plurality of computer devices cooperate together to perform the audio recognition function. In one possible scenario, please refer to fig. 1. As can be seen from fig. 1, the hardware composition framework may include: a first computer device 101, a second computer device 102. The first computer device 101 and the second computer device 102 are communicatively connected via a network 103.

In the embodiment of the present application, the hardware structures of the first computer device 101 and the second computer device 102 are not specifically limited, and the first computer device 101 and the second computer device 102 perform data interaction to implement an audio recognition function. Further, the form of the network 103 is not limited in this embodiment, for example, the network 103 may be a wireless network (e.g., WIFI, bluetooth, etc.), or may be a wired network.

The first computer device 101 and the second computer device 102 may be the same computer device, for example, the first computer device 101 and the second computer device 102 are both servers; or different types of computer devices, e.g., the first computer device 101 may be a terminal or an intelligent electronic device, and the second computer device 102 may be a server. In yet another possible scenario, a server with high computing power may be utilized as the second computer device 102 to improve data processing efficiency and reliability, and thus audio recognition efficiency. Meanwhile, a terminal or an intelligent electronic device with low cost and wide application range is used as the first computer device 101 to realize the interaction between the second computer device 102 and the user.

For example, referring to fig. 2, after acquiring a song to be detected, a terminal preprocesses the song to be detected to obtain a beat detection result and song spectrum data, and sends the beat detection result and the song spectrum data to a server through a network. After receiving the beat detection result and the song spectrum data, the server aligns the beat detection result and the song spectrum data, then intercepts the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected, then extracts the spectrum segment characteristic of the song to be detected, and determines the original song corresponding to the song to be detected from a preset search library based on the spectrum segment characteristic of the song to be detected, wherein the preset search library comprises original song spectrum segment characteristics of a plurality of original songs, and the original song spectrum segment is obtained by intercepting the song spectrum data of the original song based on the beat in the beat detection result of the original song. And the server returns the audio of the original song corresponding to the song to be detected to the terminal from the song library, and the terminal can play the original song corresponding to the song to be detected.

Referring to fig. 3, an embodiment of the present application discloses an audio recognition method, including:

step S11: and acquiring beat detection results and song spectrum data obtained after preprocessing the songs to be detected.

In a specific implementation process, when an original song corresponding to a song to be detected needs to be determined, a beat detection result and song spectrum data obtained after the song to be detected is processed need to be obtained first. Wherein the song spectrum data comprises spectrum data or sound spectrum data corresponding to the audio frequency of the song to be detected.

In practical application, the song to be detected may be obtained first, then the song to be detected is converted into a preset monaural format to obtain monaural audio data corresponding to the song to be detected, and then the monaural audio data is processed to obtain the beat detection result and the song spectrum data. That is, the song to be detected is converted into a mono format with a preset sampling frequency and a preset sampling digit, wherein the preset sampling frequency can be 8KHz, and the preset sampling digit is 16 bits.

Specifically, processing the mono audio data to obtain the beat detection result includes: and carrying out beat detection on the single-sound-channel audio data to obtain beat detection data. In practical applications, the existing beat detection technology, including but not limited to madmom, may be used to perform beat detection on the mono audio data to obtain the beat detection result. The madmom is an audio signal processing library.

In a first specific embodiment, the processing the monaural audio data to obtain the song spectrum data includes: and extracting harmony tone class distribution characteristics of the single-channel audio data to obtain harmony tone class distribution data. Wherein, the sum-sound Class distribution characteristic is specifically hpcp (harmonic Pitch Class profile) characteristic, that is, chroma characteristic.

Specifically, the monaural audio data is subjected to short-time fourier transform to obtain an audio frequency spectrum of the song to be detected, then local peak points are obtained from the audio frequency spectrum, and then the central frequency of a twelve-mean law is utilized to accumulate frequency band values into 60 dimensions and normalize the amplitude values on the whole frequency band to obtain the sum sound classification distribution data. When the monaural audio data is subjected to short-time fourier transform, the monaural audio data may be subjected to short-time fourier transform with a frame length of 512 sampling points and a frame shift of 256 sampling points.

In a second specific embodiment, the processing the monaural audio data to obtain the song spectrum data includes: and carrying out CQT transformation on the single-channel audio data to obtain a CQT spectrum. Namely, the monophonic audio data is subjected to constant Q value conversion to obtain a CQT spectrum corresponding to the song to be detected.

In a specific implementation process, the current computer device may first obtain a song to be detected, and then preprocess the song to be detected to obtain the beat detection result and the song spectrum data. Specifically, the current computer device may first acquire the song to be detected, then convert the song to be detected into a preset monaural format to obtain monaural audio data corresponding to the song to be detected, and then process the monaural audio data to obtain the beat detection result and the song spectrum data. The current computer equipment acquires the song to be detected, including acquiring the song to be detected through an input unit thereof, and if the current computer equipment acquires the song to be detected through a voice acquisition module. At present, the computer device may also obtain the song to be detected sent by other devices through a network (which may be a wired network or a wireless network), and of course, in this embodiment, the manner in which other devices (such as other computer devices) obtain the song to be detected is not limited. For example, other devices (e.g., terminals) may receive a song to be detected that is input by a user through a voice input module.

In a specific implementation process, the current computer device may directly obtain, through a network, a beat detection result and song spectrum data, which are sent by other devices and obtained after preprocessing a song to be detected, and may perform subsequent processing after obtaining the beat detection result and the song spectrum data. The manner in which other devices (such as other computer devices) obtain the song to be detected is not limited herein. For example, other devices (e.g., terminals) may receive a song to be detected that is input by a user through a voice input module.

Step S12: and aligning the beat detection result with the song spectrum data, and then intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected.

After the beat detection result and the song spectrum data are obtained, the beat detection result and the song spectrum data need to be aligned, and then the song spectrum data are intercepted based on the beat in the beat detection result to obtain the song spectrum segment to be detected. That is, the beat detection result and the song spectrum data need to be aligned first, and when the beat detection result and the song spectrum data both take time as a sequence, the beat detection result and the song spectrum data can be aligned in time, so that the song spectrum data can be intercepted according to the beat in the beat detection result to obtain the song spectrum segment to be detected.

Referring to fig. 4a, a spectrogram and HPCP map of an original song is shown. The upper part of the graph is the HPCP feature, the lower part is the spectrum representing the beat detection result, the lighter part of the spectrum represents the beat point, and fig. 4a is the effect graph after aligning the HPC feature and the spectrum. Referring to fig. 4b, a spectrogram and HPCP map of a song being sung. Wherein, the upper part of the graph is HPCP characteristic, the lower part is frequency spectrum representing beat detection result, the lighter part of the frequency spectrum represents beat point, fig. 4b is effect graph after aligning HPCP characteristic and frequency spectrum. The speed of the song to be turned is 2 times of the original song, so the total duration of the song to be turned is only 1 half of the original song, and the song information included in the box at the third last beat point in fig. 4a is the same as the song information in the box at the point where the third last beat point in fig. 4b comes out. The HPCP features in the boxes at the third last beat point in fig. 4a are intercepted, and the HPCP features in the boxes at the third last beat point in fig. 4b are intercepted, so that in the detection process, the similarity between the original song and the reverse song corresponding to the original song is high, and thus the original song corresponding to the reverse song can be detected.

The step of intercepting the song spectrum data based on the beat in the beat detection result to obtain the song spectrum segment to be detected includes: intercepting the song spectrum data based on a single beat to obtain the song spectrum fragment to be detected; or intercepting the song spectrum data based on a continuous preset number of beats to obtain the song spectrum segment to be detected. The thus extracted song pieces to be detected include song spectrum data within one beat or song spectrum data of a consecutive preset number of beats.

Step S13: extracting spectrum segment characteristics of songs to be detected, and determining an original song corresponding to the songs to be detected from a preset search library based on the spectrum segment characteristics of the songs to be detected, wherein the preset search library comprises original song spectrum segment characteristics of a plurality of original songs, and the original song spectrum segments are obtained by intercepting song spectrum data of the original songs based on beats in beat detection results of the original songs.

It can be understood that after the spectrum segment of the song to be detected is obtained, the spectrum segment characteristic of the song to be detected can be extracted, and the original song corresponding to the song to be detected is determined from a preset search library based on the spectrum segment characteristic of the song to be detected, wherein the preset search library comprises the spectrum segment characteristics of the original songs of a plurality of original songs, and the spectrum segment of the original song is obtained by intercepting the song spectrum data of the original song based on the beat in the beat detection result of the original song.

That is, first, a plurality of original songs are obtained, each original song is processed to obtain the beat detection result and the song spectrum data corresponding to each original song, after aligning the beat detection result and the song spectrum data corresponding to each original song, intercepting the song spectrum data of the original song based on the beat in the beat detection result of each original song to obtain the original song spectrum segment corresponding to each original song, extracting the characteristics of the original song spectrum segment corresponding to each original song to obtain the original song spectrum segment characteristics of each original song, constructing the preset search library by using the original song spectrum segment characteristics of each original song, after the spectrum segment of the song to be detected is obtained, the original song corresponding to the song to be detected can be determined from the preset retrieval library based on the characteristics of the spectrum segment of the song to be detected. The original song spectrum segment of each original song in the preset retrieval library is intercepted in the same way as the original song spectrum segment to be detected, and the original song spectrum segment feature extraction method are also the same.

Referring to fig. 5, an embodiment of the present application discloses a specific audio recognition method, including:

step S21: and acquiring beat detection results and song spectrum data obtained after preprocessing the songs to be detected.

Step S22: and aligning the beat detection result with the song spectrum data, and then intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected.

The specific implementation processes of step S21 and step S22 may refer to the corresponding contents disclosed in the foregoing embodiments, and are not described herein again.

And step S23, inputting the song spectrum segment to be detected into a target convolutional neural network which is trained in advance to obtain the characteristics of the song spectrum segment to be detected.

In practical application, after the song spectrum segment to be detected is obtained, the song spectrum segment to be detected can be input into a target convolutional neural network which is trained in advance, so that the characteristics of the song spectrum segment to be detected can be obtained. And inputting the song spectrum segment to be detected into the target convolutional neural network to obtain a feature vector with a dimension lower than that of the song spectrum segment to be detected, and taking the feature vector as the feature of the song spectrum segment to be detected.

Before inputting the song spectrum segment to be detected into the target convolutional neural network which is trained in advance, the method further comprises the following steps: and training a preset convolutional neural network by using a target training data set to obtain the target convolutional neural network, wherein the target training data set comprises song spectrum data of an original song and song spectrum data of a corresponding song to be sung.

Specifically, a preset convolutional neural network may be trained by using a pre-obtained training data set to obtain the target convolutional neural network. The target convolutional neural network may be a pre-constructed convolutional neural network, such as an inclusion V2 convolutional neural network. The data in the target training set may include song spectrum data for a number of original songs and song spectrum data for a number of turned songs corresponding to each song. For example, the target training set may include song spectrum data of 100 original songs and song spectrum data of 6 turned songs corresponding to each original song, resulting in a training set including song spectrum data of 700 songs in total.

Step S24: and respectively determining the similarity between the spectrum segment characteristics of the song to be detected and the spectrum segment characteristics of the original song of each original song in a preset retrieval library, wherein the spectrum segment characteristics of the original song are obtained by extracting the characteristics of the spectrum segment of the original song by utilizing the target convolutional neural network.

After the spectrum segment characteristics of the song to be detected are obtained, the similarity between the spectrum segment characteristics of the song to be detected and the spectrum segment characteristics of the original song of each original song in a preset retrieval library is also required to be respectively determined. The original song spectrum segment characteristics of each original song in the preset retrieval library are obtained by extracting the characteristics of the original song spectrum segment of each original song through the target convolutional neural network, and the original song spectrum segment of each original song is obtained by intercepting the song spectrum number of the corresponding original song according to the beat in the beat retrieval result of each original song, so that the original song spectrum segment characteristics of the original song in the preset retrieval library and the to-be-detected song spectrum segment characteristics of the to-be-detected song have the same acquisition mode, the similarity between the to-be-detected song spectrum segment characteristics and the original song spectrum segment characteristics of the corresponding original song is high, and the retrieval rate and the correct rate are improved.

Specifically, determining the similarity between the spectrum segment characteristics of the song to be detected and the spectrum segment characteristics of the original song of any original song in the preset search library includes: and determining the Euclidean distance between the spectrum fragment characteristics of the song to be detected and the spectrum fragment characteristics of the original song to obtain the similarity between the spectrum fragment characteristics of the song to be detected and the spectrum fragment characteristics of the original song to be detected. Or, determining the Hamming distance between the spectrum segment characteristics of the song to be detected and the spectrum segment characteristics of the original song to be sung, and obtaining the similarity between the spectrum segment characteristics of the song to be detected and the spectrum segment characteristics of the original song to be sung. In addition, other similarity determination methods may also be used, which are not specifically limited herein, and are only required to be the same as the similarity calculation method used in the target convolutional neural network training process.

If the spectrum segment of the song to be detected can include a plurality of features, the spectrum segment of the song to be detected also includes a plurality of features, and the spectrum segment of the original song of one original song in the preset search library also includes a plurality of features, the similarity between the features of the spectrum segment of each song to be detected and the features of the spectrum segment of each original song in the preset search library can be respectively determined.

Step S25: and determining the original song corresponding to the song to be detected from the preset retrieval library according to the similarity.

After the similarity is obtained, determining the original song corresponding to the song to be detected from the preset search library according to the similarity.

Specifically, when the similarity between each feature of the spectrum segment features of the song to be detected and each spectrum segment feature of the original song in the preset search library is greater than or equal to a preset similarity, the original song may be determined as the original song corresponding to the song to be detected. And when the similarity is greater than or equal to the preset similarity, the song to be detected is similar to the corresponding original song.

And when the similarity is expressed by adopting the Euclidean distance, if the Euclidean distance between each feature in the spectrum segment features of the song to be detected and each original song spectrum segment feature of one original song in the preset retrieval library is less than or equal to the preset similarity, determining the original song as the original song corresponding to the song to be detected.

Referring to fig. 6, intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected, includes:

step S31: a target beat is determined.

In intercepting the song spectrum data based on the tempo in the tempo detection result, a target tempo needs to be determined first. The beat detection result corresponding to the song to be detected includes a plurality of beat points and beats between the plurality of beat points, so that a target beat needs to be determined first. The target beat may be a single beat or a plurality of consecutive beats. For example, if there is a first beat point in the 2 nd second and a second beat point in the 4 th second of the song to be detected, the segment between the 2 nd second and the 4 th second may be used as the target beat.

Step S32: and determining the length of the intercepted segment based on the beat length of the target beat.

After the target beat is determined, the length of the intercepted segment needs to be determined based on the beat length of the target beat, where the length of the intercepted segment may be N times of the length of a single beat, where N is a positive number, for example, N is 1, 0.8, and the like. The truncated segment length may also be a plurality of continuous beat lengths, for example, 2 or 3 continuous beat lengths. When the length of the intercepted segment is N times of the beat length of the target beat, since the song information included in the beat corresponding to the cover song and the original song is the same, the content included when the length of the intercepted segment is N times of the beat length is also the same. For example, if the beat length of the original song is 2 seconds, and the length of the truncated segment is 1.6 seconds, the corresponding target beat length of the song to be sung twice faster is 1 second, and the length of the truncated segment is 0.8 seconds, and the included content is 80% of the beat. The truncated fragment length may be N times the length of a single beat.

For example, if there is a beat point in the 2 nd second and a beat point in the 4 th second of the song to be detected, the segment between the 2 nd second and the 4 th second may be used as the target beat, and the length of the truncated segment may be determined to be 2 seconds, or may be determined to be 1.6 seconds.

Step S33: and intercepting the song spectrum data of the song to be detected based on the beat point of the target beat and the intercepted segment length to obtain the song spectrum segment to be detected.

After the length of the intercepted segment is determined, the song spectrum data of the song to be detected can be intercepted based on the beat point of the target beat and the length of the intercepted segment, so that the song spectrum segment to be detected can be obtained.

In practical applications, the intercepting the song spectrum data based on the beat point of the target beat and the length of the intercepted segment includes: intercepting the song spectrum data based on the length of the intercepted segment by taking the middle point of the first beat point and the second beat point of the target beat as the center; and/or, taking the first beat point of the target beat as a starting point, and intercepting the song spectrum data based on the length of the intercepted segment; and/or, with a second beat point of the target beat as a center, intercepting the song spectrum data based on the intercepted segment length.

In a first specific implementation procedure, when the target beat is a single beat, truncating the song spectrum data based on the beat point of the target beat and the truncated segment length includes: and intercepting the song spectrum data based on the length of the intercepted segment by taking the middle point of the first beat point and the second beat point of the target beat as the center. When the target beat is a single beat, the target beat usually has two corresponding beat points, so the song spectrum data may be clipped based on the clipping segment length centering on a midpoint of the first beat point and the second beat point of the target beat. For example, if there is a beat point in the 2 nd second and a beat point in the 4 th second of the song to be detected, the segment between the 2 nd second and the 4 th second may be used as the target beat, and the target beat includes two beat points of the 2 nd second and the 4 th second. When the length of the intercepted segment is 1.6 seconds, the song spectrum data of 0.8 second can be respectively taken from the two sides by taking the 3 rd second as the center, and the corresponding song spectrum segment to be detected is obtained.

In a second specific implementation procedure, when the target beat is a single beat, intercepting the song spectrum data based on the beat point of the target beat and the length of the intercepted section includes: and intercepting the song spectrum data based on the length of the intercepted segment by taking the first beat point of the target beat as a starting point. When the target beat is a single beat, the target beat usually corresponds to two beat points, so the song spectrum data may be clipped based on the clipping segment length starting from the first beat point of the target beat. For example, if there is a beat point in the 2 nd second and a beat point in the 4 th second of the song to be detected, the segment between the 2 nd second and the 4 th second may be used as the target beat, and the target beat includes two beat points of the 2 nd second and the 4 th second. When the length of the intercepted segment is 1.6 seconds, 1.6 seconds of song spectrum data can be backwards taken by taking the 2 nd second as a starting point, and a corresponding song spectrum segment to be detected is obtained.

In a third specific implementation process, when the target beat is a single beat, intercepting the song spectrum data based on the beat point of the target beat and the length of the intercepted section includes: and intercepting the song spectrum data based on the length of the intercepted segment by taking a second beat point of the target beat as a center. When the target beat is a single beat, the target beat usually corresponds to two beat points, so the song spectrum data may be clipped based on the clipping segment length centering on the second beat point of the target beat. For example, if there is a beat point in the 2 nd second and a beat point in the 4 th second of the song to be detected, the segment between the 2 nd second and the 4 th second may be used as the target beat, and the target beat includes two beat points of the 2 nd second and the 4 th second. When the length of the intercepted segment is 1.6 seconds, the 4 th second is taken as the center, and 0.8 second song spectrum data is respectively taken from two sides to obtain the corresponding song spectrum segment to be detected.

In a specific implementation, a plurality of continuous beats may be determined as the target beat. Since the content included in the single beat of the cover song and the content included in the corresponding single beat of the original song are the same, the content included in the continuous multiple beats of the cover song and the content included in the continuous multiple beats of the original song are also the same, so that the continuous multiple beats can be taken as the target beat. For example, if the first beat point in the beat detection result of the song to be detected is the 2 nd second, the second beat point is the 4 th second, and the third beat point is the 5.8 th second, two consecutive beats between the 2 nd second and the 5.8 th second may be used as the target beat.

When a plurality of continuous beats are taken as a target beat, the song spectrum data may be clipped based on the clipping segment length starting from the first beat point of the target beat. For example, if the first beat point in the beat detection result of the song to be detected is the 2 nd second, the second beat point is the 4 th second, and the third beat point is the 5.8 th second, two continuous beats between the 2 nd second and the 5.8 th second can be taken as the target beat, and when the length of the intercepted segment is 3.8 seconds, the 2 nd second is taken as the start, and 3.8 seconds are intercepted backwards continuously to obtain the spectrum segment of the song to be detected.

When a plurality of continuous beats are taken as a target beat, the song spectrum data may be clipped based on the clipping segment length centering on a midpoint of a first beat point and a last beat point of the target beat. For example, if the first beat point in the beat detection result of the song to be detected is the 2 nd second, the second beat point is the 4 th second, and the third beat point is the 5.8 th second, two consecutive beats between the 2 nd second and the 5.8 th second can be taken as the target beat, and when the length of the intercepted segment is 3.8 seconds, the 3.9 th second is taken as the center, and 1.9 seconds are respectively and continuously intercepted at the two sides to obtain the spectrum segment of the song to be detected.

It can be seen that, in this embodiment, when the song spectrum data of the song to be detected is intercepted, the interception is performed based on the beat in the beat detection result of the song to be detected, so that the original song and the reverse song can be enabled to have the same song information in the corresponding beat, the original song corresponding to the reverse song can be identified, and the detection rate and the accuracy of the reverse song are improved.

The following describes the technical solution in the present application, taking as an example a process of a certain music APP to listen to a song and recognize the song sung by a user.

Suppose that the music APP acquires a music insertion of a drama "east palace" played at a double speed through a voice input module of an intelligent device provided with the music APP, the acquired music insertion is used as a song to be detected, and the music APP preprocesses the song to be detected to obtain a beat detection result and song spectrum data corresponding to the song to be detected.

The music APP aligns the beat detection result and the song spectrum data, cuts the song spectrum data into different song spectrum segments to be detected based on beats in the beat detection result, then passes the song segments to be detected through a pre-trained convolutional neural network to obtain the spectrum segment characteristics of the song to be detected, determines the similarity between the first song spectrum segment characteristic to be detected and the original song spectrum segment characteristics of each original song in a preset retrieval library, then determines the original song corresponding to the original song spectrum segment characteristics with the similarity between the first song spectrum segment characteristics to be detected being greater than or equal to a preset similarity threshold as song preselection, and then determines the similarity between the other song spectrum segment characteristics to be detected and other original song spectrum segment characteristics corresponding to the song preselection, and if the similarity between the spectrum segment characteristics of other songs to be detected and the spectrum segment characteristics of other original songs corresponding to the preselected song is greater than or equal to a preset similarity threshold, determining the preselected song as the original song corresponding to the song to be detected (the episode of the television drama 'east palace' played at double speed), and obtaining the song 'love site' in which the original song corresponding to the song to be detected is the small girl singing. And (3) displaying a song 'love' sung by the small girl on a song identification interface from a song library, thereby realizing the function of listening to the song and recognizing the song.

Referring to fig. 7, an embodiment of the present application discloses an audio recognition apparatus, including:

the data acquisition module 201 is configured to acquire a beat detection result and song spectrum data obtained after preprocessing a song to be detected;

the data segmentation module 202 is configured to align the beat detection result with the song spectrum data, and then intercept the song spectrum data based on a beat in the beat detection result to obtain a song spectrum segment to be detected;

the retrieval module 203 is configured to extract spectral fragment features of songs to be detected, and determine an original song corresponding to the songs to be detected from a preset retrieval library based on the spectral fragment features of the songs to be detected, where the preset retrieval library includes original song spectral fragment features of a plurality of original songs, and the original song spectral fragments are obtained by intercepting song spectral data of the original songs based on beats in beat detection results of the original songs.

According to the method, a beat detection result and song spectrum data which are obtained after a song to be detected is preprocessed are obtained, the beat detection result and the song spectrum data are aligned, then the song spectrum data are intercepted based on beats in the beat detection result, a song spectrum segment to be detected is obtained, the spectrum segment characteristic of the song to be detected is extracted, an original song corresponding to the song to be detected is determined from a preset retrieval library based on the spectrum segment characteristic of the song to be detected, wherein the preset retrieval library comprises original song spectrum segment characteristics of a plurality of original songs, and the original song spectrum segment is obtained by intercepting the song spectrum data of the original song based on beats in the beat detection result of the original song. Therefore, according to the method, after the beat detection result and the song spectrum data of the song to be detected are obtained, the beat detection result and the song spectrum data can be aligned, then the song spectrum data is intercepted based on the beat in the beat detection result, then the spectrum segment characteristic of the song to be detected can be extracted, and the original song corresponding to the song to be detected is determined from the preset retrieval library based on the spectrum segment characteristic of the song to be detected, so that the original song and the song to be turned over after the speed change comprise the same song information before and after the speed change, when the song to be detected is the song to be turned over corresponding to the original song, the song spectrum data is intercepted based on the beat, and the spectrum segment of the intercepted song to be detected and the song information of the corresponding song spectrum segment before the speed change of the original song can be the same, therefore, the detection rate and the accuracy of the song to be sung can be improved.

In a specific implementation process, the data obtaining module 201 includes:

the data acquisition unit is used for acquiring the song to be detected;

the data conversion unit is used for converting the song to be detected into a preset single track format to obtain single track audio data corresponding to the song to be detected;

and the data processing unit is used for processing the single-channel audio data to obtain the beat detection result and the song spectrum data.

Further, the data processing unit is configured to:

or, CQT transforming the single-channel audio data to obtain a CQT spectrum.

Specifically, the retrieving module 203 is configured to:

Correspondingly, the audio recognition device further comprises:

the neural network training module is used for training a preset convolutional neural network by using a target training data set to obtain the target convolutional neural network, wherein the target training data set comprises song spectrum data of an original song and song spectrum data of a corresponding song to be sung turned, which are obtained in advance.

Further, the data segmentation module 202 is configured to:

determining a target beat;

Specifically, the data segmentation module 202 is configured to:

Further, the embodiment of the application also provides electronic equipment. Fig. 8 is a schematic structural diagram of an electronic device 30 according to an embodiment of the present disclosure, where the user terminal may specifically include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

In general, the electronic device 30 in the present embodiment includes: a processor 31 and a memory 32.

The processor 31 may include one or more processing cores, such as a four-core processor, an eight-core processor, and so on. The processor 31 may be implemented by at least one hardware of a DSP (digital signal processing), an FPGA (field-programmable gate array), and a PLA (programmable logic array). The processor 31 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 31 may be integrated with a GPU (graphics processing unit) which is responsible for rendering and drawing images to be displayed on the display screen. In some embodiments, the processor 31 may include an AI (artificial intelligence) processor for processing computing operations related to machine learning.

Memory 32 may include one or more computer-readable storage media, which may be non-transitory. Memory 32 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 32 is at least used for storing the following computer program 321, wherein after being loaded and executed by the processor 31, the steps of the audio recognition method disclosed in any one of the foregoing embodiments can be implemented.

In some embodiments, the electronic device 30 may further include a display 33, an input/output interface 34, a communication interface 35, a sensor 36, a power source 37, and a communication bus 38.

Those skilled in the art will appreciate that the configuration shown in FIG. 8 is not limiting of electronic device 30 and may include more or fewer components than those shown.

Further, an embodiment of the present application also discloses a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the audio recognition method disclosed in any of the foregoing embodiments.

For the specific process of the audio recognition method, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not repeated here.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of other elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing describes in detail an audio recognition method, apparatus, device, and medium provided by the present application, and specific examples are applied herein to explain the principles and embodiments of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An audio recognition method, comprising:

2. The audio recognition method of claim 1, wherein the obtaining of the beat detection result and the song spectrum data obtained by preprocessing the song to be detected comprises:

acquiring the song to be detected;

3. The audio identification method of claim 2, wherein the processing the mono audio data to obtain the beat detection result and the song spectrum data comprises:

or, CQT transforming the single-channel audio data to obtain a CQT spectrum.

4. The audio recognition method of claim 1, wherein the extracting spectral fragment features of the song to be detected and determining the original song corresponding to the song to be detected from a preset search library based on the spectral fragment features of the song to be detected comprises:

5. The audio recognition method of claim 4, wherein before inputting the spectral fragments of the song to be detected into the pre-trained target convolutional neural network, the method further comprises:

6. The audio identification method according to any one of claims 1 to 5, wherein the intercepting the song spectrum data based on the beat in the beat detection result to obtain a song spectrum segment to be detected comprises:

determining a target beat;

7. The audio recognition method of claim 6, wherein the clipping the song spectrum data of the song to be detected based on the beat point of the target beat and the clipped segment length comprises:

8. An audio recognition apparatus, comprising:

9. An electronic device, comprising:

a memory and a processor;

wherein the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the audio recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the audio recognition method of any of claims 1 to 7.