CN112784100A

CN112784100A - Audio fingerprint processing method and device, computer equipment and storage medium

Info

Publication number: CN112784100A
Application number: CN202110292844.8A
Authority: CN
Inventors: 李敬; 何莹男
Original assignee: Bigo Technology Singapore Pte Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-05-11
Also published as: WO2022194277A1

Abstract

The embodiment of the invention provides a method and a device for processing audio fingerprints, computer equipment and a storage medium, wherein the method comprises the following steps: generating target fingerprint data for target audio data, respectively matching the target fingerprint data with reference fingerprint data in a first audio fingerprint library and a second audio fingerprint library, if the matching fails, calling a music query service interface to query copyright information of the target audio data, if the copyright information is queried, storing the target fingerprint data into the first audio fingerprint library as new reference fingerprint data in the first audio fingerprint library, recording the copyright information of the target audio data, if the copyright information is not queried, storing the target fingerprint data into the second audio fingerprint library as new reference fingerprint data in the second audio fingerprint library, formulating a joint hierarchical query mechanism, reducing the calling times of the music query service interface, and further reducing the operation cost.

Description

Audio fingerprint processing method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of audio processing, in particular to a method and a device for processing an audio fingerprint, computer equipment and a storage medium.

Background

With the rapid development of the internet, especially the wide popularization of mobile terminals, users can conveniently produce multimedia data, such as short video, humming song, recording, etc., so that the data volume of the multimedia data in the internet is rapidly increased, and the data volume of the audio data is also rapidly increased.

In business scenes such as song searching, voice content auditing and the like, the audio data are compared to judge whether the audio data are the same or similar.

Because of the large amount of audio data, some Music copyright parties receive different audio data, record copyright information thereof, and provide a Music Query Service Interface (MQSI) to provide an independent Music Query Service.

In scenes such as short videos, the magnitude of audio data uploaded to a platform by a client every day can reach tens of millions or even hundreds of millions, multimedia data such as short videos are high in updating speed, new audio data are easily generated and are not recorded by a music copyright side, if a music query service interface is called to query the audio data, relevant information cannot be queried, and therefore query efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a method and a device for processing audio fingerprints, computer equipment and a storage medium, and aims to solve the problems of low efficiency and high operation cost of calling a music query service interface to query audio data due to high updating speed of a large amount of multimedia data.

In a first aspect, an embodiment of the present invention provides an audio fingerprint processing method, including:

generating target fingerprint data for the target audio data;

matching the target fingerprint data with reference fingerprint data in a first audio fingerprint database and a second audio fingerprint database respectively;

if the target fingerprint data fails to be matched with the reference fingerprint data in the first audio fingerprint database and the second audio fingerprint database, calling a music query service interface to query the copyright information of the target audio data;

if the copyright information of the target audio data is inquired, storing the target fingerprint data into the first audio fingerprint database to serve as new reference fingerprint data in the first audio fingerprint database, and recording the copyright information of the target audio data;

and if the copyright information of the target audio data is not inquired, storing the target fingerprint data into the second audio fingerprint database as new reference fingerprint data in the second audio fingerprint database.

In a second aspect, an embodiment of the present invention further provides an apparatus for processing an audio fingerprint, including:

the fingerprint data generation module is used for generating target fingerprint data for the target audio data;

the fingerprint data matching module is used for matching the target fingerprint data with reference fingerprint data in a first audio fingerprint database and a second audio fingerprint database respectively;

the interface query module is used for calling a music query service interface to query the copyright information of the target audio data if the target fingerprint data fails to be matched with the reference fingerprint data in the first audio fingerprint database and the second audio fingerprint database;

the first updating module is used for storing the target fingerprint data into the first audio fingerprint database as new reference fingerprint data in the first audio fingerprint database and recording the copyright information of the target audio data if the copyright information of the target audio data is inquired;

and the second updating module is used for storing the target fingerprint data into the second audio fingerprint database as new reference fingerprint data in the second audio fingerprint database if the copyright information of the target audio data is not inquired.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of processing an audio fingerprint as described in the first aspect.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for processing an audio fingerprint according to the first aspect.

In this embodiment, target fingerprint data is generated for target audio data, the target fingerprint data is matched with reference fingerprint data in a first audio fingerprint library and a second audio fingerprint library respectively, if the target fingerprint data is not matched with the reference fingerprint data in the first audio fingerprint library and the second audio fingerprint library, a music query service interface is invoked to query copyright information of the target audio data, if the copyright information of the target audio data is queried, the target fingerprint data is stored in the first audio fingerprint library as new reference fingerprint data in the first audio fingerprint library, the copyright information of the target audio data is recorded, if the copyright information of the target audio data is not queried, the target fingerprint data is stored in the second audio fingerprint library as new reference fingerprint data in the second audio fingerprint library, the music query service interface is used as a basis for classification, the first audio fingerprint library and the second audio fingerprint library are divided for distinguishing whether the audio data of the version exist or not, new audio data are recorded, the success rate of searching is improved, a combined hierarchical query mechanism is formulated by using the first audio fingerprint library, the second audio fingerprint library and the music query service interface, namely, the first audio fingerprint library and the second audio fingerprint library are searched first, and then the music query service interface is called, so that the fingerprint data in the first audio fingerprint library and the second audio fingerprint library can be effectively utilized, the calling times of the music query service interface are reduced, and the operation cost is reduced.

Drawings

Fig. 1 is a flowchart of a method for processing an audio fingerprint according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for processing an audio fingerprint according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for processing an audio fingerprint according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of an audio fingerprint processing method according to an embodiment of the present invention, where the method is applicable to hierarchical clustering of a fingerprint library, so as to reduce situations of invoking a music query service interface, and the method may be executed by an audio fingerprint processing apparatus, where the audio fingerprint processing apparatus may be implemented by software and/or hardware, and may be configured in a computer device, such as a server, a workstation, a personal computer, and the like, and specifically includes the following steps:

step 101, generating target fingerprint data for the target audio data.

In this embodiment, the computer device may obtain the audio data in different manners, for example, the user uploads the audio data, purchases the audio data from a copyright party, a technician records the audio data, uses a crawler client to crawl the audio data from a network, and so on.

The audio data may be in the form of a song released by a singer, audio data separated from video data such as short videos, movies, and dramas, a voice signal recorded by a user in a mobile terminal, and the like, and the format of the audio data may include MP3, WMA, AAC, and the like, which is not limited in this embodiment.

If the computer device is used as a multimedia platform, on one hand, audio-based services can be provided for the user, for example, live programs, short videos, voice sessions, video sessions, and the like can be provided for the user, and on the other hand, files which are uploaded by the user and carry audio, for example, live data, short videos, session information, and the like, can be received.

Different multimedia platforms can make video content auditing standards according to factors such as business and law, before issuing files carrying audio, the contents of the files carrying audio are audited according to the auditing standards, and some files carrying audio which do not accord with the video content auditing standards, such as files carrying audio which contain pornography, vulgar, violence and other contents, are filtered out, so that some files carrying audio which accord with the video content auditing standards are issued.

If the requirement on the real-time performance is high, a streaming real-time system can be arranged in the multimedia platform, a user uploads a file carrying audio to the streaming real-time system in real time through a client, and the streaming real-time system can transmit the file carrying audio to computer equipment for content auditing.

If the requirement on the real-time performance is low, a database such as a distributed database can be set in the multimedia platform, a user uploads a file carrying audio to the database through a client, and computer equipment for content auditing can read the file carrying audio from the database.

In a multimedia platform, fingerprint data are calculated for files which are uploaded by users and carry audio, and fingerprint data are calculated for the audio data of the users, wherein the fingerprint data represent the characteristics of the audio data by utilizing information such as peak values, relative positions and the like in frequency spectrums of the audio data, and the fingerprint data have uniqueness for each audio data, so that services such as searching, content auditing and the like of the audio data are realized based on the audio fingerprints.

For the convenience of distinction, in this embodiment, the file carrying the audio and the audio data may be referred to as target audio data, and the fingerprint data generated for the target audio data is referred to as target fingerprint data.

In one embodiment of the present invention, step 101 may include the steps of:

step 1011, dividing the target audio data into a plurality of frames of audio signals.

In this embodiment, the target audio data may be sliced at intervals of a preset length, thereby obtaining a multi-frame audio signal.

Step 1012, converting the audio signal into a spectrogram.

In the embodiment, the audio signal is obtained by analyzing the frequency characteristics of the audio signal, and in order to analyze the frequency more intuitively, the audio signal in the time domain is usually converted into the frequency domain to obtain a spectrogram, wherein the horizontal axis (X coordinate) of the spectrogram is time, and the vertical axis (Y coordinate) of the spectrogram is frequency.

In a specific implementation, an audio signal may be converted into a spectrogram through a Fourier Transform (DFT), a short-time Fourier Transform (STFT), or the like, where the DFT reflects an average value of frequencies in the audio signal and cannot reflect a dynamic characteristic of the frequency changing with time, and the STFT overcomes this weakness by adding a window to the audio signal, and can reflect both a frequency intensity and a change of the frequency intensity with time.

Further, time information is lost when a time domain signal is changed into a frequency domain signal, so that the short-time fourier transform can divide an audio signal in a large time domain into a plurality of data blocks in a data block (also called a window) manner, and convert the plurality of data blocks into frequency domain signals respectively, thereby retaining the time information to a certain extent.

For example, the parameters of the audio signal are binaural, 16-bit precision, 44100Hz sampling, where the data size of 1s is 441002byte2 channel ≈ 176kB, and if 4kB is chosen as the size of the data block, then 44 blocks of data are short-time Fourier transformed per second, such a slicing density may suffice.

Step 1013, traverse the data points representing the peak on the spectrogram to serve as peak points.

The frequency span of the audio signal where the amplitude is large may be wide, and may occur from a bass C (32.70Hz) to a treble C (4186.01 Hz). In order to avoid analyzing the entire spectrogram and reduce the amount of computation, the spectrogram can be divided into a plurality of spectral bands (also called sub-bands).

A key point indicating that the frequency belongs to a peak is selected as a peak point from each subband, the peak indicates a point where a sufficient amount of frequency is rising first and a point where a sufficient amount of frequency is falling later, and for example, the following subbands are selected: the bass sub-bands are 30Hz-40Hz, 40Hz-80Hz and 80Hz-120Hz (bass sub-bands appear at the fundamental frequencies of instruments such as bass guitar), and the mid-and treble sub-bands are 120Hz-180Hz and 180Hz-300Hz, respectively (the fundamental frequencies of human voice and most other instruments appear in both sub-bands).

Since points with larger energy (i.e., amplitude on the spectrogram) are more noise resistant, the peak points may be selected by energy for each subband. In general, the point with the largest frequency (i.e., the largest energy) may be selected as the peak point in each subband.

And 1014, extracting the characteristic information of the peak point.

In the present embodiment, characteristics of the peak points themselves and characteristics between the peak points may be analyzed as the feature information.

In one example, the frequency value of the peak point may be queried as the characteristic information of the peak point.

In another example, traversing each peak point, a first distance in time between the current peak point and other peak points may be measured as characteristic information of the peak point.

In this example, since the audio signal can distinguish time, the number of audio signals in which the current peak point is separated from other peak points can be counted as the first distance in time with the audio signal as a unit in time.

Wherein, the other peak points are the peak points except the current peak point.

Further, the closer the current peak point is to other peak points in time, the higher the correlation between the current peak point and other peak points, so that, for the current peak point, other peak points located in the neighborhood of the current peak point in the time dimension are found, and the first distance between the current peak point and other peak points in time is calculated as the feature information of the peak point.

In addition, other peak points outside the neighborhood of the current peak point can be ignored, and the calculation amount is reduced while the accuracy of the feature information is maintained.

In yet another example, a second distance in frequency from the current peak point to other peak points may be measured as the characteristic information of the peak point.

Further, the closer the current peak point and other peak points are in frequency, the higher the correlation between the current peak point and other peak points, so that, for the current peak point, other peak points located in the neighborhood of the current peak point in the dimension of frequency are found, and the second distance between the current peak point and other peak points in frequency is calculated as the feature information of the peak point.

It should be noted that the frequency value, the first distance, and the second distance may be used as the feature information of the peak point alone, or may be combined as the feature information of the peak point at will, which is not limited in this embodiment, and when the frequency value, the first distance, and the second distance are used as the feature information of the peak point at the same time, the characteristic of the peak point may be reflected in multiple modalities, so as to improve the accuracy of the feature information of the peak point.

Of course, the feature information of the peak point is only used as an example, and when the embodiment of the present invention is implemented, the feature information of other peak points may be set according to an actual situation, which is not limited in the embodiment of the present invention. In addition, besides the above feature information of the peak point, a person skilled in the art may also use other feature information of the peak point according to actual needs, and the embodiment of the present invention is not limited to this.

Step 1015, calculate a hash value for the feature information as the target fingerprint data of the target audio data.

For the feature information of the target audio data, a hash value (hash, also called hash value) can be calculated according to a preset hash algorithm, and the target audio data is uniquely identified to be used as target fingerprint data of the target audio data.

In one example, the feature information of the peak point is a frequency value of the peak point itself, a first distance between the current peak point and other peak points in time, and a second distance between the current peak point and other peak points in frequency, in this example, the frequency value, the first distance, and the second distance may each be converted to a binary format, and if the conversion is complete, then according to a preset arrangement rule, such as the frequency value is in front, the first distance is in middle, and the second distance is behind, and after the frequency value is in the back, the first distance is in the middle, the second distance is in the front, and the like, the frequency value in the binary format, the first distance and the second distance are spliced into target fingerprint data of the target audio data, the fingerprint data in the binary format is visual, and the fingerprint data is conveniently converted into the original frequency value, the first distance and the second distance, so that the development debugging is facilitated, and the development cost is reduced.

Of course, the above-mentioned manner of calculating the Hash value is only an example, and when the embodiment of the present invention is implemented, other manners of calculating the Hash value may be set according to actual situations, for example, algorithms such as MD5(Message Digest Algorithm, fifth version of Message Digest Algorithm), SHA (Secure Hash Algorithm ) and the like are used to calculate the Hash value for the frequency value, the first distance, and the second distance, which is not limited in this embodiment of the present invention. In addition, besides the above feature information of the peak point, a person skilled in the art may also use other feature information of the peak point according to actual needs, and the embodiment of the present invention is not limited to this.

And 102, matching the target fingerprint data with reference fingerprint data in a first audio fingerprint database and a second audio fingerprint database respectively.

In this embodiment, two independent databases may be respectively constructed as a first audio fingerprint database and a second audio fingerprint database, where the first audio fingerprint database is used to store reference fingerprint data for querying audio data with copyright information through a music query service interface, and the second audio fingerprint database is used to store reference fingerprint data for querying audio data without copyright information through the music query service interface.

At the beginning, the first audio fingerprint library and the second audio fingerprint library may be empty, and a batch of reference fingerprint data of audio data verified whether to have copyright information may be stored in the first audio fingerprint library and the second audio fingerprint library as seeds in manners of manual local verification, verification by other organizations, and the like, which is not limited in this embodiment.

In general, the reference fingerprint data also belongs to the fingerprint data of the audio data, and the reference fingerprint data is generated in the same manner as the target fingerprint data.

If the target fingerprint data of the target audio data is generated, the target fingerprint data can be respectively matched with the reference fingerprint data in the first audio fingerprint database and the reference fingerprint data in the second audio fingerprint database, so that the target fingerprint data is respectively judged to be respectively matched with the reference fingerprint data in the first audio fingerprint database and the reference fingerprint data in the second audio fingerprint database.

Considering that more audio data all have copyright information, less audio data belong to original creation and do not have copyright information, the priority of matching reference fingerprint data in the first audio fingerprint database can be higher than the priority of matching reference fingerprint data in the second audio fingerprint database, namely, the target fingerprint data is matched with the reference fingerprint data in the first audio fingerprint database, if the target fingerprint data is unsuccessfully matched with the reference fingerprint data in the first audio fingerprint database, the target fingerprint data is matched with the reference fingerprint data in the second audio fingerprint database, if the target fingerprint data is successfully matched with any reference fingerprint data in the first audio fingerprint database, the target fingerprint data is stopped being matched with the reference fingerprint data in the second audio fingerprint database, more audio data all have copyright information, less audio data belong to original creation, Under the condition of no copyright information, the probability of successful matching with the reference fingerprint data in the first audio fingerprint database is higher, and the probability of successful matching with the reference fingerprint data in the second audio fingerprint database is lower, so that the reference fingerprint data in the first audio fingerprint database is preferentially matched, the calculation amount of the reference fingerprint data in the second audio fingerprint database can be reduced, and the matching efficiency is improved.

Of course, in addition to matching reference fingerprint data in the first audio fingerprint repository at a higher priority than matching reference fingerprint data in the second audio fingerprint repository, matching reference fingerprint data in the first audio fingerprint repository may also be at a lower priority than matching reference fingerprint data in the second audio fingerprint repository, that is, the target fingerprint data is matched with the reference fingerprint data in the second audio fingerprint database, and if the target fingerprint data fails to be matched with the reference fingerprint data in the second audio fingerprint database, matching the target fingerprint data with the reference fingerprint data in the first audio fingerprint database, if the target fingerprint data is successfully matched with any reference fingerprint data in the second audio fingerprint database, the matching of the target fingerprint data with the reference fingerprint data in the first audio fingerprint library is stopped, which is not limited in this embodiment.

In a specific implementation, the target audio data may be a long audio, and is divided into multiple frames of audio signals to calculate the target fingerprint data, and therefore, the target audio data may include multiple target fingerprint data, and for multimedia data such as short video, the multiple multiplexed portions of audio data having copyright information, such as a climax portion of a song, at this time, the target fingerprint data may be respectively calculated to have similarities with reference fingerprint data in the first audio fingerprint library and the second audio fingerprint library, and if the similarities between consecutive n (n is a positive integer) target fingerprint data and n reference fingerprint data are greater than a preset threshold, and the number of frames (i.e., the number of other target fingerprint data between two target fingerprint data) spaced between every two consecutive target fingerprint data is the same as the number of frames (i.e., the number of other reference fingerprint data between two reference fingerprint data) spaced between every two consecutive reference fingerprint data (i.e., the number of other reference fingerprint data between two reference fingerprint data) between every two consecutive target fingerprint data (i.e., the relative position, Relative position with reference fingerprint data), it can be determined that the target fingerprint data and the reference fingerprint data are successfully matched, and through comparison of similarity and relative position, stability between the target fingerprint data and the reference fingerprint data can be ensured, thereby ensuring accuracy of matching the target fingerprint data and the reference fingerprint data.

And 103, if the target fingerprint data fails to be matched with the reference fingerprint data in the first audio fingerprint database and the second audio fingerprint database, calling a music query service interface to query the copyright information of the target audio data.

If the matching of the target fingerprint data with any reference fingerprint data in the first audio fingerprint database and any reference fingerprint data in the second audio fingerprint database fails, it indicates that audio data which is the same as or similar to the target audio data is not searched in a local computer device, and the target audio data is more likely to be new audio data.

And step 104, if the copyright information of the target audio data is inquired, storing the target fingerprint data into the first audio fingerprint database to serve as new reference fingerprint data in the first audio fingerprint database, and recording the copyright information of the target audio data.

If the server of the music copyright party returns the copyright information of the target audio data through the music query service interface, the target fingerprint data can be stored in the first audio fingerprint database, the target fingerprint data is the new reference fingerprint data in the first audio fingerprint database, in addition, the copyright information of the target audio data is recorded in other forms or databases and the like, and the copyright information can be associated with the new reference fingerprint data in the first audio fingerprint database by taking the identification (such as ID) of the target audio data as an index.

In one storage method, data of target fingerprint data is used as a key, an identifier (such as an ID) of the audio data and a serial number of an audio signal to which the target fingerprint data belongs are used as values, and a key value pair (key, value) is generated, wherein the audio signal belongs to one frame of signal in the target audio data.

Storing the key value pair (key) to the first audio fingerprint database at the same position of the index value index as the target fingerprint data (i.e. index ═ key), as the new reference fingerprint data in the first audio fingerprint database.

For each index value index, b (b is a positive integer, e.g., 2) may be provided^N) And storing the target fingerprints with the same key but different values, so as to form a data table with a (a is the length of the key, namely the length of the target fingerprint data, and belongs to a positive integer) rows and b columns in the first audio fingerprint library, thereby improving the storage efficiency and the searching simplicity.

Of course, the above-mentioned manner of storing the target fingerprint data in the first audio fingerprint library is only an example, and when implementing the embodiment of the present invention, other manners of storing the target fingerprint data in the first audio fingerprint library may be set according to actual situations, for example, a key value pair is generated by using the identifier of the target audio data as a key and all target fingerprint data of the target audio data as values, and the key value pair is stored in the first audio fingerprint library, and so on, which is not limited in this embodiment of the present invention. In addition, in addition to the above-mentioned way of storing the target fingerprint data in the first audio fingerprint library, a person skilled in the art may also adopt other ways of storing the target fingerprint data in the first audio fingerprint library according to actual needs, and the embodiment of the present invention is not limited thereto.

And 105, if the copyright information of the target audio data is not inquired, storing the target fingerprint data into a second audio fingerprint database as new reference fingerprint data in the second audio fingerprint database.

If the server of the music copyright party returns that the target audio data does not have the copyright information through the music query service interface, the target fingerprint data can be stored in the second audio fingerprint database, and the target fingerprint data is new reference fingerprint data in the second audio fingerprint database.

And storing the key value pair (key) into the second audio fingerprint database, wherein the index value index is the same as the target fingerprint data (namely, the index is key) and is used as new reference fingerprint data in the second audio fingerprint database.

For each index value index, b (b is a positive integer, e.g., 2) may be provided^N) And storing the target fingerprints with the same key but different values, so as to form a data table with a (a is the length of the key, namely the length of the target fingerprint data, and belongs to a positive integer) rows and b columns in the second audio fingerprint library, thereby improving the storage efficiency and the searching simplicity.

Of course, the above-mentioned manner of storing the target fingerprint data in the second audio fingerprint library is only an example, and when implementing the embodiment of the present invention, other manners of storing the target fingerprint data in the second audio fingerprint library may be set according to actual situations, for example, a key value pair (key, value) is generated by taking the identifier of the target audio data as a key and all target fingerprint data of the target audio data as values, and the key value pair (key, value) is stored in the second audio fingerprint library, and so on, which is not limited in this embodiment of the present invention. In addition, in addition to the above-mentioned way of storing the target fingerprint data in the second audio fingerprint library, a person skilled in the art may also adopt other ways of storing the target fingerprint data in the second audio fingerprint library according to actual needs, and the embodiment of the present invention is not limited thereto.

It should be noted that the manner of storing the target fingerprint data in the first audio fingerprint library may be the same as or different from the manner of storing the target fingerprint data in the second audio fingerprint library, which is not limited in this embodiment.

Example two

Fig. 2 is a flowchart of a processing method of audio fingerprints according to a second embodiment of the present invention, where the second embodiment is based on the foregoing embodiment, and further operations of clustering target audio data, managing reference fingerprint data using lifetime, and performing reference fingerprint data library conversion are added, where the method specifically includes the following steps:

step 201, generating target fingerprint data for the target audio data.

Step 202, matching the target fingerprint data with reference fingerprint data in the first audio fingerprint database and the second audio fingerprint database respectively.

Step 203, if the target fingerprint data fails to match with the reference fingerprint data in the first audio fingerprint database and the second audio fingerprint database, calling a music query service interface to query the copyright information of the target audio data.

And step 204, if the copyright information of the target audio data is inquired, storing the target fingerprint data into the first audio fingerprint database as new reference fingerprint data in the first audio fingerprint database.

Step 205, using the target audio data as new reference audio data, and generating a new cluster for the new reference audio data.

In this embodiment, if the copyright information of the target audio data is queried through the music query service interface, which indicates that the audio data that is the same as or similar to the target audio data is not stored locally in the computer device, at this time, in addition to recording the copyright information of the target audio data, the target audio data may be set as new reference audio data, and a new cluster for clustering the same or similar audio data may be generated for the new reference audio data.

And step 206, if the copyright information of the target audio data is not inquired, storing the target fingerprint data into a second audio fingerprint database as new reference fingerprint data in the second audio fingerprint database.

Step 207, if the target fingerprint data is successfully matched with the reference fingerprint data in the first audio fingerprint database, adding the target audio data to the cluster to which the reference audio data belongs.

In this embodiment, if the target fingerprint data is successfully matched with the reference fingerprint data in the first audio fingerprint database, it indicates that audio data that is the same as or similar to the target audio data has been stored locally in the computer device, and for convenience of distinguishing, the audio data may be referred to as reference audio data, that is, the reference fingerprint number in the first audio fingerprint database belongs to the reference audio data, at this time, a cluster to which the reference audio data belongs may be searched, and the target audio data is added to the cluster to which the reference audio data belongs, so that the same or similar audio data are clustered into the same cluster, and subsequent service processing such as user classification, song recommendation and the like based on the cluster is facilitated.

And step 208, if the target fingerprint data is successfully matched with the reference fingerprint data in the second audio fingerprint database, adding the target audio data to the cluster to which the reference audio data belongs.

In this embodiment, if the target fingerprint data is successfully matched with the reference fingerprint data in the second audio fingerprint database, it indicates that audio data that is the same as or similar to the target audio data has been stored locally in the computer device, and for convenience of distinguishing, the audio data may be referred to as reference audio data, that is, the reference fingerprint number in the second audio fingerprint database belongs to the reference audio data, at this time, a cluster to which the reference audio data belongs may be searched, and the target audio data is added to the cluster to which the reference audio data belongs, so that the same or similar audio data are clustered into the same cluster, and subsequent service processing such as user classification, song recommendation and the like based on the cluster is facilitated.

And 209, counting indexes of successful matching of the reference fingerprint data if the reference fingerprint data in the second audio fingerprint database is successfully matched with the target fingerprint data.

Step 210, if the index meets a preset database transferring condition, transferring the reference fingerprint data from the second audio fingerprint database to the first audio fingerprint database.

Considering the situation that new audio data is easily generated and is not recorded by a music copyright party under the scenes of new songs issued by a network, high updating speed of short videos and the like, a library conversion condition can be set in advance for the reference fingerprint data in the second audio fingerprint library, and when the library conversion condition is met, the reference fingerprint data can be converted into the library.

In a specific implementation, if the target fingerprint data is successfully matched with the reference fingerprint data in the second audio fingerprint library, an index of the successful matching, such as the total number of successful matching, the frequency of successful matching, and the like, may be counted for the reference fingerprint data.

The index is compared with the conditions of the transfer library at the same latitude, for example, the total number of matching successes is greater than or equal to a first threshold, the frequency of matching successes is greater than or equal to a second threshold, and so on.

If the index meets the preset library transfer condition, the reference fingerprint data is represented to belong to hot audio data and possibly belong to newly released songs, and the like, the reference fingerprint data can be transferred from the second audio fingerprint library to the first audio fingerprint library, prompt information is generated, and the prompt information is used for prompting an operator to add copyright information to the audio data to which the reference fingerprint data belongs.

If the indicator does not satisfy the preset re-binning condition, the reference fingerprint data may be kept stored in the second audio fingerprint bin.

Step 211, setting a lifetime for the reference fingerprint data in the first audio fingerprint library and/or the second audio fingerprint library.

In a scene such as a short video, the alternation speed of a part of audio data is fast, the audio data is less used by a user after a period of streaming, and for a similar scene, for the reference fingerprint data in the first audio fingerprint library, a specified first numerical value can be set as the lifetime of the reference fingerprint data, and for the reference fingerprint data in the second audio fingerprint library, a specified second numerical value can also be set as the lifetime of the reference fingerprint data.

Considering that more audio data have copyright information, less audio data belong to original creation and do not have copyright information, the probability of successful matching with the reference fingerprint data in the first audio fingerprint library is higher, the probability of successful matching with the reference fingerprint data in the second audio fingerprint library is lower, the survival time of the reference fingerprint data in the first audio fingerprint library is set to be longer than that of the reference fingerprint data in the second audio fingerprint library, namely the first numerical value is longer than the second numerical value, so that the probability of successful matching of the reference fingerprint data in the first audio fingerprint library is maintained, the calling frequency of a music query service interface is reduced, and the operation cost is reduced.

Of course, except that the lifetime of the reference fingerprint data in the first audio fingerprint library is longer than the lifetime of the reference fingerprint data in the second audio fingerprint library, the lifetime of the reference fingerprint data in the first audio fingerprint library may also be set to be equal to or shorter than the lifetime of the reference fingerprint data in the second audio fingerprint library, that is, the first value is equal to or smaller than the second value, which is not limited in this embodiment.

Step 212, decay the lifetime.

For the lifetime of the reference fingerprint data, a timer may be started to count down in order to decay the lifetime, i.e. to continuously decrease the value of the lifetime.

In general, the attenuation can be performed at a normal time flow rate, and is not performed at a variable speed.

Step 213, if the reference fingerprint data in the first audio fingerprint database or the second audio fingerprint database is successfully matched with the target fingerprint data, increasing the survival time.

If the reference fingerprint data in the first audio fingerprint library is successfully matched with the target fingerprint data, the lifetime of the reference fingerprint data may be increased, for example, the lifetime may be restored to the original first value, the first step may be increased based on the current value of the lifetime, and so on.

If the reference fingerprint data in the second audio fingerprint library is successfully matched with the target fingerprint data, the lifetime of the reference fingerprint data may be increased, e.g., the lifetime may be restored to the original second value, a second step may be added based on the current value of the lifetime, etc.

And step 214, deleting the reference fingerprint data from the first audio fingerprint database or the second audio fingerprint database if the life time attenuation is finished.

If the attenuation of the reference fingerprint data in the first audio fingerprint database is finished, namely the current numerical value is 0, the use frequency of the audio data to which the reference fingerprint data belongs is low, at the moment, the reference fingerprint data can be deleted from the first audio fingerprint database, under the condition that the matching success rate of the reference fingerprint data in the first audio fingerprint database is kept, the data volume of the reference fingerprint data stored in the first audio fingerprint database is reduced, the space of the first audio fingerprint database is released, and therefore the warehousing requirement for processing continuous fingerprint data under the condition of limited database capacity is effectively met.

If the attenuation of the reference fingerprint data in the second audio fingerprint database is finished, namely the current numerical value is 0, the use frequency of the audio data to which the reference fingerprint data belongs is low, at the moment, the reference fingerprint data can be deleted from the second audio fingerprint database, under the condition that the matching success rate of the reference fingerprint data in the second audio fingerprint database is kept, the data volume of the reference fingerprint data stored in the second audio fingerprint database is reduced, the space of the second audio fingerprint database is released, and therefore the warehousing requirement for processing continuous fingerprint data under the condition of limited database capacity is effectively met.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

EXAMPLE III

Fig. 3 is a block diagram of a structure of an audio fingerprint processing apparatus according to a third embodiment of the present invention, which may specifically include the following modules:

a fingerprint data generating module 301, configured to generate target fingerprint data for the target audio data;

a fingerprint data matching module 302, configured to match the target fingerprint data with reference fingerprint data in a first audio fingerprint database and a second audio fingerprint database, respectively;

an interface query module 303, configured to invoke a music query service interface to query copyright information of the target audio data if the target fingerprint data fails to match with the reference fingerprint data in the first audio fingerprint library and the reference fingerprint data in the second audio fingerprint library;

a first updating module 304, configured to store the target fingerprint data in the first audio fingerprint database if the copyright information of the target audio data is queried, where the target fingerprint data is used as new reference fingerprint data in the first audio fingerprint database, and record the copyright information of the target audio data;

a second updating module 305, configured to store the target fingerprint data in the second audio fingerprint database as new reference fingerprint data in the second audio fingerprint database if the copyright information of the target audio data is not queried.

In one embodiment of the present invention, the fingerprint data generation module 301 includes:

the audio signal dividing module is used for dividing the target audio data into multi-frame audio signals;

the spectrogram conversion module is used for converting the audio signal into a spectrogram;

the peak point searching module is used for traversing data points representing peaks on the spectrogram to serve as peak points;

the characteristic information extraction module is used for extracting the characteristic information of the peak point;

and the hash value calculation module is used for calculating a hash value of the characteristic information to serve as target fingerprint data of the target audio data.

In one embodiment of the present invention, the feature information extraction module includes:

the frequency value query module is used for querying the frequency value of the peak point to serve as the characteristic information of the peak point;

the time distance measuring module is used for measuring a first distance between the current peak point and other peak points in time as characteristic information of the peak point;

and the frequency distance measuring module is used for measuring second distances of the current peak point and other peak points on the frequency, and the second distances are used as the characteristic information of the peak point.

In one embodiment of the present invention, the time distance measuring module includes:

the time neighborhood searching module is used for searching other peak points which are positioned in the neighborhood of the current peak point in the time dimension aiming at the current peak point;

and the time distance calculation module is used for calculating a first distance between the current peak point and other peak points in time as the characteristic information of the peak point.

In one embodiment of the present invention, the frequency distance measuring module includes:

the frequency neighborhood searching module is used for searching other peak points which are positioned in the neighborhood of the current peak point under the dimensionality of the frequency aiming at the current peak point;

and the frequency distance calculation module is used for calculating second distances of the current peak point and other peak points on the frequency, and the second distances serve as the characteristic information of the peak point.

In one embodiment of the present invention, the hash value calculation module includes:

a binary conversion module that converts the frequency value, the first distance, and the second distance into a binary format;

and the splicing module is used for splicing the frequency value, the first distance and the second distance into target fingerprint data of the target audio data if the conversion is finished.

In one embodiment of the present invention, the fingerprint data matching module 302 comprises:

the similarity calculation module is used for calculating the similarity between the target fingerprint data and reference fingerprint data in a first audio fingerprint database and a second audio fingerprint database respectively;

and the continuous matching module is used for determining that the matching between the target fingerprint data and the reference fingerprint data is successful if the similarity between the n target fingerprint data and the n reference fingerprint data is greater than a preset threshold value, and the number of frames spaced between every two connected target fingerprint data is the same as the number of frames spaced between every two connected reference fingerprint data.

In one embodiment of the present invention, the first update module 304 includes:

a first key-value pair generating module, configured to generate a key-value pair with the target fingerprint data as a key and an identifier of the target audio data and a sequence number of an audio signal to which the target fingerprint data belongs as values, where the audio signal belongs to one frame of signal in the target audio data;

and the first key-value pair storage module is used for storing the key-value pairs to the position, in the first audio fingerprint library, of which the index value is the same as the target fingerprint data, and using the position as new reference fingerprint data in the first audio fingerprint library.

In one embodiment of the present invention, the second update module 305 comprises:

a second key-value pair generating module, configured to generate a key-value pair with the target fingerprint data as a key and an identifier of the target audio data and a sequence number of an audio signal to which the target fingerprint data belongs as values, where the audio signal belongs to one frame of signal in the target audio data;

and the second key value pair storage module is used for storing the key value pairs to the position, in the second audio fingerprint database, of which the index value is the same as the target fingerprint data, and using the position as new reference fingerprint data in the second audio fingerprint database.

In one embodiment of the present invention, further comprising:

and the cluster generating module is used for taking the target audio data as new reference audio data and generating a new cluster for the new reference audio data.

In one embodiment of the present invention, further comprising:

a first cluster adding module, configured to add the target audio data to a cluster to which the reference audio data belongs if the target fingerprint data is successfully matched with the reference fingerprint data in the first audio fingerprint database, where a reference fingerprint number in the first audio fingerprint database belongs to the reference audio data;

and the second cluster adding module is used for adding the target audio data to the cluster to which the reference audio data belongs if the target fingerprint data is successfully matched with the reference fingerprint data in the second audio fingerprint database, wherein the reference fingerprint number in the second audio fingerprint database belongs to the reference audio data.

In one embodiment of the present invention, further comprising:

the survival time setting module is used for setting the survival time of the reference fingerprint data in the first audio fingerprint library and/or the second audio fingerprint library;

the survival time attenuation module is used for attenuating the survival time;

a survival time increasing module, configured to increase the survival time if the reference fingerprint data in the first audio fingerprint database or the second audio fingerprint database is successfully matched with the target fingerprint data;

and the fingerprint data deleting module is used for deleting the reference fingerprint data from the first audio fingerprint database or the second audio fingerprint database if the life time attenuation is finished.

In one embodiment of the present invention, further comprising:

the index counting module is used for counting the indexes which are successfully matched with the reference fingerprint data if the reference fingerprint data in the second audio fingerprint database is successfully matched with the target fingerprint data;

and the fingerprint data database shifting module is used for shifting the reference fingerprint data from the second audio fingerprint database to the first audio fingerprint database if the index meets a preset database shifting condition.

The audio fingerprint processing device provided by the embodiment of the invention can execute the audio fingerprint processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 4 is only one example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 4, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the audio fingerprint processing method provided by the embodiment of the present invention, by executing programs stored in the system memory 28.

EXAMPLE five

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the audio fingerprint processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

A computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for processing audio fingerprints, comprising:

generating target fingerprint data for the target audio data;

2. The method of claim 1, wherein generating target fingerprint data for the target audio data comprises:

dividing the target audio data into a plurality of frames of audio signals;

converting the audio signal into a spectrogram;

traversing data points representing peaks on the spectrogram to serve as peak points;

extracting characteristic information of the peak point;

and calculating a hash value of the characteristic information to serve as target fingerprint data of the target audio data.

3. The method according to claim 2, wherein the extracting the feature information of the peak point comprises:

inquiring the frequency value of the peak point as the characteristic information of the peak point;

measuring a first distance between the current peak point and other peak points in time, and taking the first distance as characteristic information of the peak point;

and measuring second distances between the current peak point and other peak points on the frequency, and taking the second distances as the characteristic information of the peak point.

4. The method of claim 3,

the measuring a first distance between the current peak point and other peak points in time, as the characteristic information of the peak point, includes:

aiming at the current peak point, searching other peak points which are positioned in the neighborhood of the current peak point in the time dimension;

calculating a first distance between the current peak point and other peak points in time to serve as characteristic information of the peak point;

the measuring a second distance between the current peak point and other peak points in frequency, as the characteristic information of the peak point, includes:

aiming at the current peak point, searching other peak points which are positioned in the neighborhood of the current peak point under the dimensionality of the frequency;

and calculating second distances between the current peak point and other peak points on the frequency, wherein the second distances are used as the characteristic information of the peak point.

5. The method according to claim 3, wherein the calculating a hash value for the feature information as target fingerprint data of the target audio data includes:

converting the frequency value, the first distance, and the second distance to a binary format;

and if the conversion is finished, splicing the frequency value, the first distance and the second distance into target fingerprint data of the target audio data.

6. The method of claim 1, wherein matching the target fingerprint data with reference fingerprint data in a first audio fingerprint library and a second audio fingerprint library respectively comprises:

calculating similarity between the target fingerprint data and reference fingerprint data in a first audio fingerprint database and a second audio fingerprint database respectively;

and if the similarity between the n target fingerprint data and the n reference fingerprint data is greater than a preset threshold value, and the number of frames spaced between every two connected target fingerprint data is the same as the number of frames spaced between every two connected reference fingerprint data, determining that the target fingerprint data and the reference fingerprint data are successfully matched.

7. The method of claim 1, wherein storing the target fingerprint data in the first audio fingerprint library as new reference fingerprint data in the first audio fingerprint library comprises:

generating a key value pair by taking the target fingerprint data as a key and taking the identifier of the target audio data and the serial number of the audio signal to which the target fingerprint data belongs as values, wherein the audio signal belongs to one frame of signal in the target audio data;

storing the key-value pair to a position in the first audio fingerprint database where the index value is the same as the target fingerprint data as new reference fingerprint data in the first audio fingerprint database;

the storing the target fingerprint data into the second audio fingerprint library as new reference fingerprint data in the second audio fingerprint library comprises:

and storing the key value pair to the position, in the second audio fingerprint database, of which the index value is the same as the target fingerprint data, as new reference fingerprint data in the second audio fingerprint database.

8. The method according to any one of claims 1-7, further comprising, after said storing said target fingerprint data in said first audio fingerprint library as new reference fingerprint data in said first audio fingerprint library:

and taking the target audio data as new reference audio data, and generating a new cluster for the new reference audio data.

9. The method of any one of claims 1-7, further comprising:

if the target fingerprint data is successfully matched with the reference fingerprint data in the first audio fingerprint database, adding the target audio data to a cluster to which the reference audio data belongs, wherein the reference fingerprint number in the first audio fingerprint database belongs to the reference audio data;

and if the target fingerprint data is successfully matched with the reference fingerprint data in the second audio fingerprint database, adding the target audio data to the cluster to which the reference audio data belongs, wherein the reference fingerprint number in the second audio fingerprint database belongs to the reference audio data.

10. The method of any one of claims 1-7, further comprising:

setting a time-to-live for reference fingerprint data in the first audio fingerprint library and/or the second audio fingerprint library;

attenuating said lifetime;

if the reference fingerprint data in the first audio fingerprint database or the second audio fingerprint database is successfully matched with the target fingerprint data, increasing the survival time;

and if the life time attenuation is finished, deleting the reference fingerprint data from the first audio fingerprint database or the second audio fingerprint database.

11. The method of any one of claims 1-7, further comprising:

if the reference fingerprint data in the second audio fingerprint database is successfully matched with the target fingerprint data, counting indexes of successful matching of the reference fingerprint data;

and if the index meets a preset database transferring condition, transferring the reference fingerprint data from the second audio fingerprint database to the first audio fingerprint database.

12. An apparatus for processing audio fingerprints, comprising:

13. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of processing an audio fingerprint of any one of claims 1-11.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of processing an audio fingerprint according to any one of claims 1 to 11.