CN108205550B

CN108205550B - Audio fingerprint generation method and device

Info

Publication number: CN108205550B
Application number: CN201611173755.7A
Authority: CN
Inventors: 吴岩
Original assignee: Beijing Kuwo Technology Co Ltd
Current assignee: Beijing Kuwo Technology Co Ltd
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2021-03-12
Anticipated expiration: 2036-12-16
Also published as: CN108205550A

Abstract

The embodiment of the invention relates to a method and a device for generating an audio fingerprint. The method comprises the following steps: intercepting a second audio file based on PCM encoding from a first audio file, the second audio file being an audio file intercepted at a first time of the first audio file; obtaining a plurality of sub-fingerprints according to the second audio file; intercepting a set number of the plurality of sub-fingerprints as audio fingerprints of the first audio file starting at a second time. A string of identifiers extracted and calculated for various formats of audio files can be used as an audio fingerprint of the audio files to identify the songs, and even if information such as the name of a singer and the name of an album of the songs is changed, the audio fingerprint cannot be changed.

Description

Audio fingerprint generation method and device

Technical Field

The present invention relates to the field of audio data processing technologies, and in particular, to a method and an apparatus for generating an audio fingerprint.

Background

For an audio file, a data segment for storing identification information of singers, titles, album names, ages, and genres is generally included, and for an audio file in the MP3 format, for example, identification information is generally stored in ID3 information of the MP3 audio file. When playing an audio file, the identification information in the data segment storing the identification information is read, and the identification information is displayed on a playing interface and provided for a user.

However, with the continuous advancement of technology, the data segment storing the identification information in the audio file can be easily modified or deleted for the purpose of avoiding copyright or the like. For such audio files, the situation that the song cannot be correctly identified will occur during playing, which will inevitably affect the enjoyment experience of the song.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating an audio fingerprint. By identifying the song by extracting a string of extracted and computed identifiers in the audio file as an audio fingerprint of the audio file, the song may still not be correctly identified after the ID3 information, etc., has changed.

In one aspect, an embodiment of the present invention provides a method for generating an audio fingerprint, including:

intercepting a second audio file encoded based on Pulse Code Modulation (PCM) according to a first audio file, wherein the second audio file is an audio file intercepted at a first time of the first audio file;

obtaining a plurality of sub-fingerprints according to the second audio file;

intercepting a set number of the plurality of sub-fingerprints as audio fingerprints of the first audio file starting at a second time.

Optionally, the method further includes:

determining a source audio file, and converting the source audio file into the first audio file.

Optionally, the first time is 45 seconds.

Optionally, the second time is greater than 32 seconds and less than the first time.

Optionally, the set number is 512.

In another aspect, an embodiment of the present invention provides a method for adding an audio fingerprint to an audio file database. The audio file database comprises a plurality of audio files, the method comprising:

determining at least one of the plurality of audio files that does not include an audio fingerprint;

calculating a plurality of sub-fingerprints corresponding to each of the at least one audio file;

generating audio fingerprints of audio files greater than a first time in the at least one audio file, wherein the audio fingerprints are a set number of sub-fingerprints intercepted from the first time of the audio files greater than the first time;

and generating a database statement according to the audio fingerprint, and adding the audio fingerprint into the database.

In another aspect, an embodiment of the present invention provides an apparatus for generating an audio fingerprint. The method comprises the following steps:

an intercepting unit configured to intercept a second audio file based on PCM encoding from a first audio file, the second audio file being an audio file intercepted at a first time of the first audio file;

the sub-fingerprint generating unit is used for obtaining a plurality of sub-fingerprints according to the second audio file;

an audio fingerprint generating unit, configured to intercept a set number of sub-fingerprints from the plurality of sub-fingerprints as audio fingerprints of the first audio file from a second time.

Optionally, the method further includes:

and the determining unit is used for determining a source audio file and converting the source audio file into the first audio file.

In another aspect, an embodiment of the present invention provides an apparatus for adding an audio fingerprint to an audio file database. The audio file database comprises a plurality of audio files, the apparatus comprising:

a determining unit configured to determine at least one of the plurality of audio files that does not include an audio fingerprint;

a sub-fingerprint generating unit, configured to calculate a plurality of sub-fingerprints corresponding to each of the at least one audio file;

the audio fingerprint generating unit is used for generating an audio fingerprint of an audio file which is greater than a first time in the at least one audio file, wherein the audio fingerprint is a set number of sub-fingerprints intercepted from the first time of the audio file which is greater than the first time;

and the adding unit is used for generating a database statement according to the audio fingerprint and adding the audio fingerprint into the database.

By the embodiment of the invention, a string of identifiers can be extracted and calculated for various formats of audio files to serve as the audio fingerprints of the audio files so as to identify the songs, and even if information such as the name of a singer and the name of an album of the songs is changed, the audio fingerprints are not changed.

Drawings

Fig. 1 is a flowchart of a method for generating an audio fingerprint according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for adding audio fingerprints to an audio file database according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for generating an audio fingerprint according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for adding an audio fingerprint to an audio file database according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

According to the embodiment of the invention, the audio files of different arbitrary versions are converted into the standard format, and the audio files are extracted according to the standard format, so that the problem of non-uniform sampling standards caused by the problem of multiple versions of the audio files can be avoided.

For the purpose of facilitating understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.

Fig. 1 is a flowchart of a method for generating an audio fingerprint according to an embodiment of the present invention. As shown in fig. 1, the method specifically includes:

s110, a second audio file based on PCM encoding is intercepted according to the first audio file, the second audio file being an audio file intercepted at a first time of the first audio file.

The first audio file is an audio file with a standard format, and the format of the first audio file can be the format of a universal audio file such as WMA.

The source audio file, that is, the audio file that needs to be identified by the audio fingerprint, may have multiple versions and multiple formats, and the source audio file is first converted into the audio file with the standard format, so that sampling is unified when the audio fingerprint is generated, and the accuracy of the audio fingerprint is improved.

When an audio fingerprint is generated from a first audio file, a portion of the audio file may be intercepted, and the portion of the audio fingerprint may be considered as a data fingerprint of the audio source file. The portion is an audio file based on PCM encoding.

Specifically, the first audio file can be played by using the Mlayer, the second audio file from the beginning of playing to the 45 seconds of the first audio file is intercepted in 45 seconds, the second audio file is a WAV audio format file based on PCM coding, compared with an analog signal, the WAV audio format file is not easily influenced by clutter and distortion of a transmission system, the dynamic range is wide, and the influence effect with quite good sound quality can be obtained. It should be appreciated that the longer the duration of the second audio file, the higher the recognition accuracy, and 45 seconds is merely an example of the present invention and is not a limitation.

And S120, obtaining a plurality of sub-fingerprints according to the second audio file.

The generation process of the plurality of sub-fingerprints is specifically as follows:

the second audio file is downsampled for various channels and sampling rates. And eliminating high-frequency interference and energy leakage through Hanning window processing, and performing Fourier transform. The energy is calculated by calculating the frequency domain amplitude, and the frequency domain of each band. And calculating an energy difference value, and recording the fingerprint with the difference value WRT larger than 0 to obtain the sub-fingerprint.

The audio realization is also in fact frequency information, and each sample point records the amplitude of the waveform at that point, and for an audio file, its features are on the frequency information.

In one example, the generation of the sub-fingerprint specifically comprises the following steps:

1. one frame of audio information of the down-sampled second audio file is extracted.

2. And eliminating high-frequency interference and energy leakage through Hanning window processing, and performing Fourier transform.

3. And converting the amplitude information into energy information according to the second audio file after Fourier transform.

4. And taking an absolute value of the result of the energy information.

5. Mapping the frequencies from 300 to 2000 to 9 frequency bands, and calculating the energy of each frequency band.

The energy sum of each frequency bin is calculated by dividing the frequency bin into 9 on average according to the frequency bark value of 300-2000 HZ.

6. The sub-fingerprint is generated by comparison with the previous frame energy value.

We obtained 9 energy information, E [1.. 9], E _ [ i ] ═ i +1] -ei; f [ n, M ] represents the value of the nth frame, E _ [ M ].

The Mth bit of the sub-fingerprint is 1 if F [ n, M ] -F [ n-1, M ] >0, otherwise 0, which results in an 8-byte sub-fingerprint based on the comparison of the two frames.

And S130, intercepting a set number of sub-fingerprints from the plurality of sub-fingerprints from the second time as the audio fingerprints of the first audio file.

According to the foregoing S110 and S120, it may be determined that the second audio file corresponds to a plurality of sub-fingerprints, and a portion of the plurality of sub-fingerprints may be intercepted, where a combination of the portion of sub-fingerprints is an audio fingerprint of the first audio file or the source audio file.

Specifically, it may be determined that the second audio file starts from a second time, and a set number of sub-fingerprints in a plurality of sub-fingerprints corresponding to the second audio file are intercepted as audio fingerprints. The second time may be any time greater than 32 seconds and less than the first time, for example, the first time is 45 seconds, the second time may be 32 seconds or 35 seconds, and so on, which may avoid the introduction of the audio file and enhance the fingerprint difference of different songs. The set number may be 512 sub-fingerprints (perhaps 6 seconds of audio corresponding to a sub-fingerprint).

Example data of generated data pattern: 5939cd89, 5d39dd8b, 5d39dda3, … … (omitting 508 sub-fingerprints), a96a76 ab.

It should be noted that, intercepting the starting values of the plurality of sub-fingerprints of the second audio file as the second time, where the second time is 32 seconds, is only one example provided by the embodiment of the present invention, and is not limited.

It should be further noted that, intercepting the fingerprint corresponding to the 6 seconds audio is only one example provided by the embodiment of the present invention, and does not constitute a limitation. The larger the time span of calculating the fingerprint is, the more accurate the fingerprint is, and the smaller the time span is, the higher the efficiency is. The fingerprint recognition efficiency is high only by calculating 6 seconds, and the recognition effect can reach 95%.

By the embodiment of the invention, a string of identifiers can be extracted and calculated for various formats of audio files to serve as the audio fingerprints of the audio files, the string of character strings corresponds to the audio files, and the probability of the same audio fingerprints is very small, so that the songs are identified, and even if the information of the singer name, the album name and the like of the songs is changed, the audio fingerprints are not changed.

Fig. 2 is a flowchart of a method for adding an audio fingerprint to an audio file database according to an embodiment of the present invention. As shown in fig. 2, the audio file database includes a plurality of audio files, and the method specifically includes:

s210, at least one audio file which does not comprise the audio fingerprint in the plurality of audio files is determined.

The audio file database typically includes a plurality of audio files, which may have audio fingerprints in part and not in part. The non-computed audio files may be added to the miss list by checking in turn to determine whether each audio file has already computed an audio fingerprint.

The miss list generally includes at least one audio file, none of which has an audio fingerprint computed.

S220, calculating a plurality of corresponding sub-fingerprints in at least one audio file.

An audio fingerprint is computed separately for at least one audio file included in the miss list.

First, a plurality of sub-fingerprints corresponding to each audio file in the miss list are calculated, and the calculation manner of the sub-fingerprints may refer to the description in S120 in the embodiment shown in fig. 1, which is not described again.

And S230, generating audio fingerprints of the audio files which are greater than the first time in at least one audio file, wherein the audio fingerprints are a set number of sub fingerprints intercepted from the first time of the audio files which are greater than the first time.

The generation of the audio fingerprint may refer to the description in S130 in the embodiment shown in fig. 1.

In the embodiment of the present invention, since the sub-fingerprint needs to be intercepted from the first time when the audio fingerprint is generated, the audio files in the miss list may include a part of the audio files smaller than the first time length and a part of the audio files larger than the first time length. The first time may be referred to as the second time in the embodiment shown in fig. 1, and may be, for example, 32 seconds.

An audio fingerprint needs to be computed for audio files that are greater than a first length of time.

And when the audio fingerprints of the audio files with the time length less than the first time length are calculated, the condition of calculation failure can occur, and the identifications of all the audio files with the calculation failure are merged.

S240, generating a database statement according to the audio fingerprint, and adding the audio fingerprint into the database.

And for the audio file which correctly generates the audio fingerprint, identifying the audio file by using the audio fingerprint, and creating a MYSQL statement according to the audio fingerprint so as to query, delete and the like the audio file according to the MYSQL statement. And adding the audio fingerprint into a database according to the corresponding relation between the audio fingerprint and the audio file.

In this way, a song fingerprint can be added to each audio file in the database, and successful song files can be added and unsuccessful song files can be counted.

Fig. 3 is a schematic structural diagram of an apparatus for generating an audio fingerprint according to an embodiment of the present invention. As shown in fig. 3, the apparatus includes:

an intercepting unit 301, configured to intercept a second audio file based on PCM encoding according to a first audio file, the second audio file being an audio file intercepted at a first time of the first audio file;

a sub-fingerprint generating unit 302, configured to obtain a plurality of sub-fingerprints from the second audio file;

an audio fingerprint generating unit 303, configured to intercept a set number of sub-fingerprints from the plurality of sub-fingerprints as audio fingerprints of the first audio file from a second time.

Optionally, the method further comprises:

Optionally, the first time is 45 seconds.

Optionally, the set number is 512.

Fig. 4 is a schematic structural diagram of an apparatus for adding an audio fingerprint to an audio file database according to an embodiment of the present invention. The audio file database includes a plurality of audio files, and as shown in fig. 4, the apparatus includes:

a determining unit 401, configured to determine at least one audio file of the plurality of audio files that does not include an audio fingerprint;

a sub-fingerprint generating unit 402, configured to calculate a plurality of sub-fingerprints corresponding to each of the at least one audio file;

an audio fingerprint generating unit 403, configured to generate an audio fingerprint of an audio file greater than a first time in the at least one audio file, where the audio fingerprint is a set number of sub-fingerprints captured from the first time of the audio file greater than the first time;

an adding unit 404, configured to generate a database statement according to the audio fingerprint, and add the audio fingerprint to the database.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the scope of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating an audio fingerprint, comprising:

intercepting a second audio file based on PCM encoding from a first audio file, the second audio file being an audio file intercepted at a first time of the first audio file;

obtaining a plurality of sub-fingerprints according to the second audio file; including down-sampling a second audio file for various channels and sampling rates; carrying out Fourier transform on the second audio file subjected to down-sampling by eliminating high-frequency interference and energy leakage through Hanning window processing; converting the amplitude information of the second audio file after Fourier transform into energy information; setting a frequency range, mapping to a plurality of frequency bands, and calculating the energy sum of each frequency band; calculating the energy and difference value of each frequency band between continuous frames, and obtaining a multi-bit sub-fingerprint according to the frequency band and the energy and difference value;

intercepting a set number of the plurality of sub-fingerprints as audio fingerprints of the first audio file starting at a second time;

the first time is 45 seconds;

the second time is greater than 32 seconds and less than the first time;

the set number is 512.

2. The method of claim 1, further comprising:

3. A method of adding an audio fingerprint to an audio file database, the audio file database comprising a plurality of audio files, the method comprising:

calculating a plurality of sub-fingerprints corresponding to each of the at least one audio file; including down-sampling a second audio file for various channels and sampling rates; carrying out Fourier transform on the second audio file subjected to down-sampling by eliminating high-frequency interference and energy leakage through Hanning window processing; converting the amplitude information of the second audio file after Fourier transform into energy information; setting a frequency range, mapping to a plurality of frequency bands, and calculating the energy sum of each frequency band; calculating the energy and difference value of each frequency band between continuous frames, and obtaining a multi-bit sub-fingerprint according to the frequency band and the energy and difference value;

generating a database statement according to the audio fingerprint, and adding the audio fingerprint into the database;

the first time is 32 seconds;

the set number is 512.

4. An apparatus for generating an audio fingerprint, comprising:

an intercepting unit configured to intercept a second audio file based on PCM encoding from a first audio file, the second audio file being an audio file intercepted at a first time of the first audio file; the first time is 45 seconds;

the sub-fingerprint generating unit is used for obtaining a plurality of sub-fingerprints according to the second audio file; including down-sampling a second audio file for various channels and sampling rates; carrying out Fourier transform on the second audio file subjected to down-sampling by eliminating high-frequency interference and energy leakage through Hanning window processing; converting the amplitude information of the second audio file after Fourier transform into energy information; setting a frequency range, mapping to a plurality of frequency bands, and calculating the energy sum of each frequency band; calculating the energy and difference value of each frequency band between continuous frames, and obtaining a multi-bit sub-fingerprint according to the frequency band and the energy and difference value;

an audio fingerprint generating unit, configured to intercept a set number of sub-fingerprints from the plurality of sub-fingerprints as audio fingerprints of the first audio file from a second time; the second time is greater than 32 seconds and less than the first time; the set number is 512.

5. The apparatus of claim 4, further comprising:

6. An apparatus for adding an audio fingerprint to an audio file database, the audio file database comprising a plurality of audio files, the apparatus comprising:

a sub-fingerprint generating unit, configured to calculate a plurality of sub-fingerprints corresponding to each of the at least one audio file; including down-sampling a second audio file for various channels and sampling rates; carrying out Fourier transform on the second audio file subjected to down-sampling by eliminating high-frequency interference and energy leakage through Hanning window processing; converting the amplitude information of the second audio file after Fourier transform into energy information; setting a frequency range, mapping to a plurality of frequency bands, and calculating the energy sum of each frequency band; calculating the energy and difference value of each frequency band between continuous frames, and obtaining a multi-bit sub-fingerprint according to the frequency band and the energy and difference value;

the audio fingerprint generating unit is used for generating an audio fingerprint of an audio file which is greater than a first time in the at least one audio file, wherein the audio fingerprint is a set number of sub-fingerprints intercepted from the first time of the audio file which is greater than the first time; the first time is 32 seconds; the set number is 512;