CN113129855A

CN113129855A - Audio fingerprint extraction and database building method, and audio identification and retrieval method and system

Info

Publication number: CN113129855A
Application number: CN201911390214.3A
Authority: CN
Inventors: 邓俊祺; 张文铂
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-07-16

Abstract

An audio fingerprint extraction and library building method, and corresponding identity recognition and audio retrieval methods and systems are disclosed. The audio fingerprint extraction method comprises the following steps: acquiring a frequency spectrum of the audio; generating a peak point pair based on the frequency-time relationship between a reference peak point and other peak points in the frequency spectrum; generating an audio fingerprint of the audio based on the pairs of peak points. The audio fingerprint may include a conventional fingerprint, a melody fingerprint, an accompaniment fingerprint, and a melody and accompaniment combined fingerprint. The extracted audio fingerprint can be used for audio identity recognition and audio retrieval library establishment, so that identity judgment of pending or queried audio is facilitated.

Description

Audio fingerprint extraction and database building method, and audio identification and retrieval method and system

Technical Field

The present disclosure relates to the field of audio processing, and in particular, to an audio fingerprint extraction and database construction method, and a corresponding audio identification and retrieval method and system.

Background

With the development of digital technology and music markets, the music library (audio library) owned by each music streaming service is also becoming larger. For example, a mature commercial music library may contain tens of millions of music data. For this reason, a method capable of describing music data in a relatively thin manner, for example, audio fingerprints, is required. The audio fingerprint extracted for each audio can be put into an audio fingerprint library for matching and retrieving the input audio, for example, a music playing App corresponds to a "listen to song" function.

In the face of millions and even tens of millions of music data, how to accurately and efficiently describe the music data and how to quickly search and match audio frequency become a big problem in the field.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide an improved audio fingerprint extraction scheme, and further provide a library establishment scheme based on the above extracted fingerprint, where the established audio fingerprint library can be used for interaction with a client to implement an audio retrieval function.

According to a first aspect of the present disclosure, there is provided an audio fingerprint extraction method, including: acquiring a frequency spectrum of the audio; generating a peak point pair based on the frequency-time relationship between a reference peak point and other peak points in the frequency spectrum; generating an audio fingerprint of the audio based on the pairs of peak points.

According to a second aspect of the present disclosure, there is provided an audio fingerprint library establishing method, including: acquiring audio in a music library; according to the method of the first aspect of the present disclosure, an audio fingerprint of an acquired audio is extracted; and sorting the extracted audio fingerprints.

According to a third aspect of the present disclosure, there is provided an audio retrieval method, comprising: acquiring query audio; extracting a query audio fingerprint of the query audio; sending the query audio fingerprint into an audio fingerprint library established according to the second aspect of the disclosure; matching audio fingerprints based on the audio fingerprint library; and returning an audio retrieval result based on the matching of the audio fingerprints.

According to a fourth aspect of the present disclosure, there is provided an audio recognition method, comprising: according to the method of the first aspect of the present disclosure, an audio fingerprint of a target audio is extracted; and determining an audio identity based on the audio fingerprint.

According to a fifth aspect of the present disclosure, there is provided an audio retrieval system comprising a client, a server and an audio fingerprint repository, wherein the client is configured to: obtaining query audio input by a user, wherein the server is used for: inputting query audio fingerprints extracted from query audio into the audio fingerprint database for matching; and returning an audio retrieval result to the user based on the matching of the audio fingerprints, wherein the audio fingerprints of the audio in the query audio and audio fingerprint library are generated based on the following operations: generating peak point pairs based on the frequency-time relationship between reference peak points and other peak points in the frequency spectrum of the audio; an audio fingerprint of the corresponding audio is generated based on the pairs of peak points.

Therefore, the audio fingerprint extraction scheme can extract the audio fingerprint which is accurately described and easy to retrieve, and the audio fingerprint can also be used for classifying the melody and the accompaniment, so that richer fingerprint materials are provided for the audio fingerprint library established by the audio fingerprint extraction scheme, and more retrieval modes are provided for subsequent fingerprint retrieval. In addition, the established fingerprint library can be sliced according to the audio heat degree, so that the cold-hot separated efficient retrieval is provided subsequently.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows a schematic flow diagram of an audio fingerprint extraction method according to an embodiment of the invention.

Fig. 2 shows an example of hash time pair generation.

Fig. 3 shows a schematic flow diagram of an audio fingerprint repository establishment method according to an embodiment of the present invention.

Fig. 4 shows a schematic flow diagram of an audio retrieval method according to an embodiment of the invention.

Fig. 5 shows a schematic diagram of audio retrieval according to the present invention.

Fig. 6 is a schematic diagram showing the components of an audio retrieval system that can implement the audio retrieval function of the present invention.

Fig. 7 shows an example of the server-side internal configuration.

Fig. 8 shows a specific scenario example of music retrieval to which the present invention is applied.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

A mature commercial music library may contain tens of millions of music data. For this reason, a way of describing music data in a relatively compact manner, for example, using audio fingerprints, is required. Here, an audio fingerprint refers to a very thin piece of data that can summarize the audio content itself, calculated based on the audio content. The audio fingerprint of each audio can be extracted from the audio file of the song library through the means of algorithm. These extracted audio fingerprints can then be placed into a unified audio fingerprint library for matching and retrieving input audio. For example, the "listen to song" function of the music playing App can find out which piece of audio in the music library the piece of audio belongs to for a piece of audio input by the user. In the face of millions and even tens of millions of music data, how to accurately and efficiently describe the music data and how to quickly search and match audio frequency become a big problem in the field.

Therefore, the method provides an improved audio fingerprint extraction scheme, and further provides a library establishing scheme based on the extracted fingerprint, and the established audio fingerprint library can be used for interacting with a client to realize an audio retrieval function. The audio fingerprint extraction scheme can extract the audio fingerprint which is accurately described and easy to retrieve, and the audio fingerprint can also be the fingerprint of melody and/or accompaniment extracted from audio, so that richer fingerprint materials are provided for an audio fingerprint library established by the audio fingerprint extraction scheme, and more retrieval modes are provided for subsequent fingerprint retrieval. In addition, the established fingerprint library can be sliced according to the audio heat degree, so that the cold-hot separated efficient retrieval is provided subsequently.

Fig. 1 shows a schematic flow diagram of an audio fingerprint extraction method according to an embodiment of the invention. In one embodiment, the above-mentioned audio fingerprint extraction method may be part of an audio fingerprint library establishment operation, for example, audio fingerprint extraction based on this method is performed for each audio in a music library. In other embodiments, the above-described audio fingerprint extraction method may also be used for subsequent audio retrieval operations, i.e., fingerprint extraction for a given audio segment for matching with existing fingerprints in an established audio fingerprint library.

In step S110, a spectrum of the audio is acquired. In one embodiment, to create a library of audio fingerprints for a music library, music data (audio) within the music library may be acquired, either on a case-by-case basis or in a prescribed order, the audio signal is pre-processed (e.g., including fourier transform) and an audio spectrum is acquired. In another embodiment, the audio segments for retrieval are also pre-processed to obtain the audio spectrum.

Subsequently, in step S120, a peak point pair is generated based on the frequency-time relationship between the reference peak point and other peak points in the spectrum. In step S130, an audio fingerprint of the audio is generated based on the pairs of peak points. Thus, the audio is characterized by a temporal frequency distribution between anchor points and other important points in the spectrum.

Here, the peak point pair may be characterized by a hash time pair (hash-time pair) including a hash value and a reference peak point time. The hash value represents the frequency-time relationship of the reference peak point to other peak points. The hash value may be used as a key (key) for matching the peak point pair. In subsequent audio fingerprint library construction, the audio fingerprint library may be constructed as a hash table. The hash table includes an entry composed of a key and a value, and a plurality of entries may be included in one hash table. Hash tables are data structures that are directly accessible at memory storage locations by key (key), thereby facilitating subsequent direct access and matching based on hash values. In the present invention, the hash value in the hash time pair can be used as the key of the hash table, and each key is accompanied by a value string. The Value string may include a pair (ID, t) of the song ID with the hash Value and the time t for which the hash Value is in the song.

To this end, one entry in the hash table may have the following form:

key | value, i.e.

hash-(id1,t1),(id2,t2),...

Here, ID1, ID2 … denote song IDs containing the hash value, and t1 and t2 denote times of reference peak points in the hash value in the corresponding songs. In the present invention, a hash table may be constructed for each partition (described below), which may have a large number of entries, e.g., on the order of millions, available for retrieval.

Fig. 2 shows an example of hash time pair generation. As shown in the upper left diagram, the audio may be transformed (e.g., a short-time fourier transform) to obtain a corresponding spectrum. Subsequently, the spectrum is subjected to two-dimensional peak selection to obtain a "constellation diagram (ConstellationMap)" shown in the upper right diagram, that is, a place with a peak is 1, and a place without a peak is 0. For each peak point in the star cloud image, a space with a fixed area size behind the peak point can be selected as an "effective area", and the space may contain other peak points, as shown in the lower right diagram. They constitute a [ reference peak point, effective area ] pair. For each peak point in the active space, a pair of peak points [ p1, p2], p1 being the reference peak point and p2 being the peak point in the active area, can be obtained. Then, the time and frequency relationship [ f1, f2, dt: t1] of the peak point pair [ p1, p2] is found to uniquely characterize the peak point pair, denoted as hash: time. Wherein f1 represents the frequency point of p1, f2 represents the frequency point of p2, dt represents the time difference between p2 and p1, t1 represents the time of p1, hash represents a datum formed by combining f1, f2 and dt, and time represents t1, as shown in the lower left graph. In specific practice, the peak value division threshold value, the position and the size of the effective area, and the like can be adjusted to achieve the optimal effect. In other embodiments, other manners may be used to generate the hash-time pair for characterizing the peak point pair, for example, the positions of f1, f2, and dt in the hash may be exchanged, and the invention is not limited thereto.

Hash time pairs may be generated for other peak points within the active region. Then, other peak points in the star cloud image can be selected as reference peak points, and similar operations of effective area selection and hash time pair generation are performed. All hash time pairs generated for the same audio (e.g., the same spectral star cloud map) may be fingerprints for that audio. When an audio fingerprint library is created for a song within the library, these hash time pairs may be stored together in association, for example, in the same hash table as the ID association for the song. It should be appreciated that since a song is longer, for example, popular songs are typically 3-5 minutes long, classical songs are typically longer, and the spectral cloudbook includes more peaks, the complete audio fingerprint of each song may include many hash time pairs. The audio input for audio matching retrieval is typically only a few seconds, which includes much less hash time than a complete song.

In the present invention, an audio fingerprint may be established for complete information of audio, for example, a complete song, or an audio itself may be automatically tracked (or source separated), and then fingerprint extraction (for example, peak value versus hash value extraction shown in fig. 2) is performed on the tracked or separated data, thereby facilitating description of audio features from different angles. In various embodiments, melody information and accompaniment information in the audio may be extracted.

As known from music theory, "Melody (Melody)" is a linear succession of musical tones perceived by listeners as a single entity, a sequence of tones linked by a series of tones of different or equal pitch, with a specific relationship of pitch and rhythm. Melody is a concept equivalent to a line in visual sense to human auditory sense. Literally, a melody is a combination of pitch and rhythm, whereas a melody can also include, figuratively, other musical elements, such as a succession of timbres. The melody can be considered as the foreground of the background accompaniment. "Accompaniment" (Accompaniment) is a music part that can provide rhythm and/or harmony support for the melody or main theme of a song or musical instrument. There are many different styles and types of accompaniment of different musical styles. The harmony accompaniment is mainly used in popular music to support clear vocal melodies. In popular music and traditional music, the accompaniment parts typically provide "beats" for the music and delineate the chord progression of the song or instrumental piece.

In the prior art, there have been relatively sophisticated algorithms for extracting melody information and accompaniment information in audio. In one embodiment, the audio fingerprint extraction method of the present invention may be implemented as or include a melody fingerprint extraction method. To this end, the method may further include extracting melody information of the audio and extracting an audio melody fingerprint based on the melody information. Specifically, the melody information may be subjected to fourier transform to obtain a melody frequency spectrum, and a peak point pair is generated based on a frequency-time relationship between a reference peak point and other peak points in the melody frequency spectrum; generating an audio melody fingerprint of the melody based on the pairs of peak points. For example, for the extracted melody information, the hash time pair extraction operation shown in fig. 2 may be performed as well, and the melody fingerprint of a song is stored in the melody hash table in association with the song ID. As will be described in more detail below, the audio fingerprint library may be implemented as, or may include, a melody fingerprint library instead of, or in addition to, a complete audio fingerprint search.

In one embodiment, the extracted melody information preferably refers to vocal information. For a conventional pop song, the singer's performance or rap can be considered to be the melody of the song, i.e., the main part of the song that is most easily perceived by the listener. For this purpose, the vocal information may be extracted from the audio based on, for example, an existing vocal extraction algorithm, and fourier-transformed to obtain a vocal spectrum. Subsequently, a peak point pair may be generated based on the frequency-time relationship between the reference peak point and other peak points within the human voice spectrum, as described above; generating an audio vocal fingerprint of the vocal based on the peak point pairs. For example, for the extracted voice information, the hash time pair extraction operation shown in fig. 2 may also be performed, and the voice fingerprint of a song is stored in association with the song ID in the voice hash table. The human voice fingerprint may be used instead of or in addition to the melody fingerprint. For example, in the case where it is detected that a song is a song mainly including vocal singing, for example, a pop song, it is possible to select extraction of vocal information as melody information of the song. Since the prelude, interlude and ending melody are usually included before and after the vocal sound is sounded, the melody can be complemented into the vocal information as the vocal information presenting the complete melody information of a song.

Alternatively or additionally, the audio fingerprint extraction method of the present invention may include an accompaniment fingerprint extraction method. To this end, the method may further include extracting accompaniment information of the audio, and extracting an audio accompaniment fingerprint based on the accompaniment information. Specifically, fourier transform may be performed on the accompaniment information to obtain an accompaniment spectrum, and a peak point pair is generated based on a frequency-time relationship between a reference peak point and other peak points in the accompaniment spectrum; generating an audio accompaniment fingerprint of the audio based on the peak point pairs. For example, for the extracted accompaniment information, the hash time pair extraction operation shown in fig. 2 may be performed as well, and the accompaniment fingerprint of a song is stored in the accompaniment hash table in association with the song ID. As will be described in more detail below, the library of accompaniment fingerprints may be included in the audio fingerprint library as a complement to the complete audio fingerprint search or melody fingerprint search.

Furthermore, the audio fingerprint extraction method of the invention may further include a melody accompaniment combined fingerprint extraction method. Here, the "melody accompaniment combined fingerprint" may refer to the pair rhythm information and accompaniment information and the correlation extracted fingerprint information. To this end, the method may further include extracting melody information and accompaniment information of the audio, and extracting a melody and accompaniment combined fingerprint based on the melody information and the accompaniment information. Generating melody and accompaniment combined peak point pairs based on frequency-time relations between the reference melody peak point and other melody peak points in the melody frequency spectrum and other accompaniment peak points in the accompaniment frequency spectrum; and generating an audio melody accompaniment joint fingerprint of the audio based on the melody accompaniment joint peak point pairs. For example, for the extracted melody and accompaniment information, a hash time pair extraction operation similar to that shown in fig. 2 may be performed. Specifically, the extracted melody information and accompaniment information may be respectively fourier-transformed to obtain respective melody and accompaniment frequency spectrums, and then the two frequency spectrums are aligned and each two-dimensional peak selection operation is performed. Then, on the melody spectrum star cloud picture, a reference peak point is selected, a certain space behind the reference peak point is selected as an effective area, and the frequency-time relation between the reference peak point and other peak points in the area is calculated. For the accompaniment spectrum star cloud map, the same space can be selected as an effective region, and the frequency time relationship between the melody reference peak point and other accompaniment peak points in the region is calculated. Thus, a hash value time pair constructed based on 3 peaks can be obtained. Because the introduction of the accompaniment star cloud picture can greatly increase the amount of the hash time pairs, other melody and accompaniment peak values can be randomly sampled according to a certain rule so as to maintain the obtained hash time pairs in a relatively reasonable number. Subsequently, an audio melody accompaniment combined fingerprint may be generated based on the frequency-time relationship and the melody accompaniment combined fingerprint of a song may be stored in association with the song ID in the accompaniment melody accompaniment combined hash table. As will be described in more detail below, the library of accompaniment fingerprints may be included in the audio fingerprint library as a complement to the complete audio fingerprint search or melody fingerprint search. It should be understood that in other embodiments, the frequency-time relationship between the melody and the accompaniment may be extracted for fingerprint characterization based on other ways, and the present invention is not limited thereto.

After fingerprint extraction, an audio fingerprint library may be built for tagging and retrieval of audio features. Fig. 3 shows a schematic flow diagram of an audio fingerprint repository establishment method according to an embodiment of the present invention.

In step S310, audio within the gallery is acquired. In different embodiments, the reading of the audio may be performed according to the storage order of the audio in the song library, or the song index order, or the genre order, etc., preferably in parallel.

Subsequently, in step S320, an audio fingerprint of the acquired audio is extracted. The extraction of the audio fingerprint may be implemented based on the audio fingerprint extraction method of the present invention as disclosed above. In step S330, the extracted audio fingerprints are sorted.

In various implementations, the present invention may extract a conventional audio fingerprint, melody fingerprint (and/or vocal fingerprint), accompaniment fingerprint, and/or melody and accompaniment combined fingerprint for a song, and obtain a corresponding conventional audio fingerprint library, melody fingerprint (and/or vocal fingerprint) library, accompaniment fingerprint library, and/or melody and accompaniment combined fingerprint library, or an audio fingerprint library including the above melody fingerprint, accompaniment fingerprint, and/or combined fingerprint. In the present invention, the "audio fingerprint" may refer to an audio-based extracted fingerprint that includes only fingerprints extracted from a spectrum transformed from conventional audio (e.g., including melody and accompaniment) in a narrow sense. In a broad sense, the "audio fingerprint" may refer to all fingerprints extracted based on the original audio, including the conventional audio fingerprint as above, and melody fingerprint (and/or vocal fingerprint), accompaniment fingerprint, and melody and accompaniment combined fingerprint, etc. Similarly, the "audio fingerprint library" includes only a conventional audio fingerprint library in a narrow sense, and includes a conventional audio fingerprint library, a melody fingerprint (and/or vocal fingerprint) library, an accompaniment fingerprint library, and a melody and accompaniment combined fingerprint library in a broad sense.

In one embodiment, because melody information (e.g., vocal information including prelude, interlude, and ending melodies) contains most of the information of a song and contains fewer features than conventional fingerprints, an audio fingerprint library can be built with melody fingerprints instead of conventional complete audio fingerprints, and the same melody fingerprint extraction and subsequent matching can be performed on the input query audio at the time of retrieval, thereby simplifying the computations required for audio retrieval. In one embodiment, a library of melody fingerprints (or just vocal fingerprints) may be created as a supplement to a conventional full audio fingerprint library. Thus, for example, when the input query audio includes a human voice, the search of the human voice fingerprint database is first performed, and the complete audio fingerprint database is searched for a supplemental or de-duplication search in the case of no returned result or a plurality of returned results. In one embodiment, a library of melody and accompaniment fingerprints may be created instead of the existing library of complete audio fingerprints, and the melody fingerprints may be matched and retrieved by an accompaniment fingerprint deduplication process when multiple results are returned. In one embodiment, a library of melody accompaniment joint fingerprints may also be built to enable higher-screening audio retrieval than conventional audio fingerprint retrieval. These melody fingerprints (and/or vocal fingerprints), accompaniment fingerprints, and/or melody and accompaniment joint fingerprints may be stored in respective libraries, e.g., in respective hash tables.

In step S330, the audio fingerprints extracted in the music library may be sorted based on various rules. In one embodiment, the audio fingerprints may be ordered by the hotness of the audio. The heat of the audio may include a playing heat, a search heat, and a recognition heat of the audio. When a user listens to a song using a music playing App installed on the terminal or a web page opened on the terminal, the degree of heat that a certain audio is played, the degree of heat that a certain audio is searched or otherwise queried in a search bar, and/or the degree of heat that a song is queried by listening to a song may be recorded. The ordering of the audio in the audio fingerprint library may be based on different rules with one or all of the above-mentioned heat (e.g., weighting).

Alternatively or additionally, the audio fingerprints may be sorted according to audio attributes, wherein the audio attributes include at least one of: the language of the audio; a genre of the audio; and a scene tag of the audio. Thereby facilitating subsequent rapid retrieval of labels, e.g., based on user portrayal.

It should be understood that the ordering of the audio in the audio fingerprint library may be done as the audio is being retrieved, e.g., the entire library may be first ordered according to certain rules, followed by subsequent audio retrieval, fingerprint extraction, and storage operations. In other embodiments, the sorting (order adjustment) may also be performed after the fingerprint is generated. For efficiency reasons, a library ordering scheme is preferred.

The audio fingerprint library obtained according to the above sorting can be divided into different sections to provide different retrieval support for audio fingerprints of different sections in the following.

For example, for a library of tens of millions of songs, the entire library of songs may be sorted first by popularity of play. The sorted song library can then be sliced in units of 100 million. For each song in each piece, the audio fingerprint can be extracted according to, for example, fig. 2, and a hash table is made with the hash in the audio fingerprint as a key, and time and song ID as other values in the hash table. After the database is built according to the above process, the original music database is converted into an audio fingerprint database. In subsequent retrieval services, the top 100 ten thousand fingerprint libraries may be implemented as hot fingerprint libraries, which have more servers providing the same service, while other fingerprint libraries use fewer (or even one) servers to provide the service. Alternatively, an audio fingerprint with higher music recognition heat can be additionally added in the hot spot fingerprint library, for example, the first fragment of the fingerprint library is composed of the fingerprint libraries with the first 95 ten thousand audio-visual heat, the first 4 ten thousand search heat and the first 1 ten thousand audio with music recognition heat (which can be de-duplicated or not de-duplicated). Within the first tile, the songs may be stored in sub-partitions according to song attributes to facilitate subsequent retrieval according to user portrait classifications. In addition, a plurality of hash tables, such as an audio fingerprint hash table, a melody fingerprint hash table, an accompaniment fingerprint hash table and/or a joint fingerprint hash table, may be linked for one song ID to facilitate handling of different retrieval needs.

After the audio fingerprint library is established, the input query audio may be retrieved based on the audio fingerprint library. Fig. 4 shows a schematic flow diagram of an audio retrieval method according to an embodiment of the invention.

In step S410, query audio is acquired. Query audio may be obtained using various methods. For example, the music playing App may invoke a built-in microphone of the smart device to pick up audio in the environment, e.g., on the street or playing in a television, in response to the user clicking a "listen to songs" button. The audio, after being picked up by the device, may first undergo noise reduction processing to filter out extraneous ambient sounds. Subsequently, the audio can be dynamically compressed, so that the dynamic range of the audio is reduced, and the audio is full and easy to recognize. The audio input for performing the matching query may be referred to as "query audio", and the audio subjected to the above-mentioned noise reduction and dynamic compression preprocessing may also be referred to as "query audio", which is not limited by the present invention.

In step S420, a query audio fingerprint of the query audio is extracted. In step S430, the query audio fingerprint is fed into the audio fingerprint library established as described above. In step S440, matching of audio fingerprints is performed based on the audio fingerprint library, and in step S450, an audio retrieval result is returned based on the matching of audio fingerprints.

Here, it should be understood that the audio fingerprint of the query audio may be extracted as described above in connection with fig. 1 and 2 and the preferred embodiment thereof and fed into the audio fingerprint library as described above in connection with fig. 3 and the preferred embodiment thereof. The audio fingerprint extraction method for querying the audio should be consistent with the extraction method of the fingerprint in the corresponding fingerprint database to be queried. For example, if a fingerprint library constructed based on 2 peaks is to be queried, then the fingerprint of the query audio should also be extracted as a 2-peak fingerprint. If a 3-peak based constructed joint fingerprint library is to be queried, then the fingerprint of the query audio should also be extracted as a 3-peak joint fingerprint. But not necessarily consistent with respect to the spectral source being extracted. For example, for the extracted query audio fingerprint, a corresponding conventional audio fingerprint library may be retrieved, and a melody fingerprint library or an accompaniment fingerprint library may also be retrieved. Accordingly, for the extracted query melody fingerprint, a corresponding melody fingerprint library may be retrieved, and a conventional audio fingerprint library or an accompaniment fingerprint library may also be retrieved.

Thus, in one embodiment, step S420 may include extracting a query melody fingerprint of the query audio. Step S430 may include entering the melody fingerprint of the query audio into the melody fingerprint library created as described above or into a conventional audio fingerprint library that also includes melody information. Accordingly, step S440 may include matching melody fingerprints based on the fingerprint library.

Additionally, in the case where the library of melody or audio fingerprints returns multiple results, a secondary search against other libraries (e.g., the library of accompaniment) may also be performed. Step S420 may then include extracting a query accompaniment fingerprint of the query audio. Accordingly, step S430 may include feeding the accompaniment fingerprint of the query audio into the accompaniment fingerprint library established as described above. Step S440 may include performing accompaniment fingerprint matching based on the accompaniment fingerprint library based on the retrieval result returned from the melody fingerprint library. For example, if the melody fingerprint library returns two matching song IDs for a query audio, a match detection of the accompaniment fingerprints may be performed for the two songs and the song with the accompaniment fingerprints also matching may be returned as the recognition result.

In other cases, the accompaniment information may be used in a scene other than the secondary search, for example, in a scene of background sound identification, the accompaniment information of the query audio may be extracted and the accompaniment fingerprint may be generated. The accompaniment fingerprint includes the main information of the background sound to be inquired, so the conventional audio fingerprint library and the accompaniment fingerprint library can be input for searching, and even the melody fingerprint library can be input for searching.

Specifically, the short video score recognition function may be developed specifically, thereby performing background score recognition for the score short video. The dubbing short video is a mixed query audio including foreground and background sounds. Therefore, the method may further include source-separating (vocal accompaniment separation, or melody accompaniment separation) the mixed query audio to obtain source-separated accompaniment data, and extracting a fingerprint of the source-separated accompaniment data as a query audio fingerprint. In the case where the foreground includes vocal (or other environmental sounds during recording) and the background soundtrack does not include vocal (e.g., pure music), the source-separated accompaniment data includes substantially complete background soundtrack data, and thus the extracted accompaniment fingerprints are suitable for direct search into a conventional audio fingerprint library or melody fingerprint library. In the case where the foreground includes vocal and the background score includes vocal (e.g., a song that the vocal sings), the accompaniment data separated from the source generally includes only the accompaniment data of the background score since the vocal is filtered, and thus the extracted accompaniment fingerprint is suitable for being directly fed into the accompaniment fingerprint library for searching. In order to cover the search of the background sound of the input audio, the retrieval scheme of the invention can directly carry out source separation on each input audio and carry out the retrieval of the subdivided fingerprint database; or the source separation and the search of the fingerprint database are carried out when the conventional fingerprint audio frequency extraction and search has no match. Alternatively or additionally, the retrieval scheme of the present invention may also provide the user with additional buttons, such as "background sound retrieval" or "short video soundtrack retrieval" as a sub-division function under the "listen to songs and identify" function. Thus, a source separation operation for query audio, and retrieval of a conventional audio fingerprint library, melody fingerprint library, and/or accompaniment fingerprint library for separated accompaniment fingerprints may be triggered when a user presses a corresponding "background sound retrieval" or "short video soundtrack retrieval" button. It should be understood that the above scheme is also applicable to mixed query audio of foreground and background sounds other than the dubbing short video.

Joint fingerprint retrieval may be introduced as a more accurate alternative or supplement to conventional audio fingerprint retrieval. The above joint fingerprint retrieval may also be combined with melody and/or accompaniment fingerprint retrieval as described above. Thus, in one embodiment, step S420 may include extracting a melody accompaniment joint fingerprint of the query audio. Accordingly, step S430 may include feeding the rhythm accompaniment joint fingerprints of the query audio into the rhythm accompaniment joint fingerprint library established as described above. The step S440 may include matching the melody accompaniment combined fingerprint based on the melody accompaniment combined fingerprint library.

As previously described, pairs of peak points in an audio fingerprint are characterized by hash time pairs comprising hash values and reference peak point times. Thus, the matching step S440 may use the hash value as a key value (key) for matching a peak point pair, and then, in step S450, an audio retrieval result may be returned based on a matching result of the query fingerprint and a peak point pair of an audio fingerprint corresponding to a certain audio in the audio fingerprint library. Specifically, the time difference or quotient between the reference peak points may be calculated for all hash value matched hash time pairs in the candidate audio for which there is a match between the query audio and the hash value; and determining a matching result of the query audio and the candidate audio based on the calculation result of the time difference or quotient.

Fig. 5 shows a schematic diagram of audio retrieval according to the present invention. The preprocessed query audio is transformed into a series of hash-time pairs via a fingerprint extraction operation. The string of hash time pairs is referred to as a "query fingerprint" or a "query audio fingerprint". The hash: time pair may be fed into the audio fingerprint repository as a query request (query). The audio fingerprint library includes a hash table consisting of a large number of key | value entries. The lower left side of fig. 5 shows a one-segment storage structure of the hash table, in which hash0 and offset0, hash1 and offset1, and hash2 and offset2 may correspond to key | values of three entries (entries) in the hash table. The key of each entry is a hash value, and value is a (id, t) pair consisting of the song id with the hash value and the time t of the hash value in the song. When a certain hash value is included by a plurality of songs, a plurality of (id, t) pairs, e.g., (id1, t1), (id2, t2), and (id3, t3) may be included in its corresponding value string. Each hash in the input query needs to be matched against the audio fingerprint library, returning a series of t and id. The middle part of fig. 5 shows a piece of data structure returned, including, for example, id1, t1, id2, t2, and id3, t 3. Each t and id pair returned may have a length of 32-64 bits. Subsequently, all the time returned under the id can be found by taking the id as a unit, and the difference is made with the t value of the corresponding hash value in the query fingerprint to obtain a dt set of the id, as shown in the upper right side of fig. 5. If the number of equal or very close dt in a dt set of an id is greater than a threshold, then the id is the song id that the query audio matches. As shown in the lower right of fig. 5, id2 with the greatest equal dt number and greater than the threshold may be considered the song hit by the query. In addition to obtaining the dt set of the id by subtracting the t value of the corresponding hash value in the query fingerprint, the corresponding time values can be subjected to quotient division (namely, division), so that the original music can be hit aiming at the audio frequency regulated on the basis of the original music.

As indicated previously, when an audio fingerprint library is built, the fragmentation can be done in the order of the audio. Thus, in the actual query, the query request may first request the hotspot fingerprint library, and if the return fails, then request the other fingerprint libraries. Alternatively or additionally, a music preference of the user may be obtained and audio fingerprint library segments matching the music preference are preferentially retrieved when performing audio fingerprint matching. For example, when a user tag refers to a user enjoying classical music, a fingerprint library of classical music, e.g., a classical music partition in a hotspot fingerprint library, may be preferentially retrieved.

The preferred solution of using the extracted audio fingerprint for e.g. "listen to songs" retrieval matching is disclosed above. In a more generalized embodiment, the audio fingerprint of the present invention may also be used in more scenarios for the determination of audio identity. Therefore, the invention can also be realized as an audio identity judgment method. In this method, first, an audio fingerprint of the target audio may be extracted according to the method described above. Here, the extracted audio fingerprint may be a conventional audio fingerprint including melody and accompaniment information, a melody fingerprint, an accompaniment fingerprint, and/or a melody and accompaniment joint fingerprint, and then, based on the audio fingerprint, the determination of the audio identity is performed.

Specifically, the extracted audio fingerprint of the target audio may be saved as a reference audio fingerprint. Accordingly, determining an audio identity based on the audio fingerprint may include: and judging the identity of the undetermined audio based on the comparison between the audio fingerprint extracted from the undetermined audio and the reference audio fingerprint.

For example, when a new song is released by the authorities, the released new song may be fingerprinted as the target audio and the fingerprint may be retained as the reference audio fingerprint. Subsequently, for subsequently generated audio (which may be referred to herein as "pending" audio), the relationship of the pending audio to the target audio may be determined by the matching relationship of the extracted fingerprint of the pending audio to the reference fingerprint. Here, matching may refer to matching based on one or more fingerprints as described above, and based on a result of the matching (e.g., conventional fingerprint coincidence, melody or accompaniment fingerprint coincidence, or the like), it is determined whether the audio to be determined is related to the target audio, e.g., whether it is a plagiarism based on the target audio, a secondary authoring based on the target audio, or the like.

In a User Generated Content (UGC), i.e., a User original Content scene, fingerprint extraction may be performed on an original work as an original tag (reference audio fingerprint) of the work, and when an original author or other authors continue creating, fingerprint extraction may be performed one by one for tracking and comparing.

To promote confidence, saving the extracted audio fingerprint of the target audio as a reference audio fingerprint may include: and uploading the extracted audio fingerprint of the target audio to a reference audio fingerprint database. The library of reference audio fingerprints herein may be maintained by an authority or high public trust authority, thereby increasing the confidence that the reference audio fingerprint is a "reference". Here, the reference audio fingerprint repository may be a copyright fingerprint repository and/or an original fingerprint repository. Thus, when there is a dispute over the identity of the pending audio, the relationship between the pending audio and the target audio may be determined based on a comparison of the audio fingerprint extracted from the pending audio and the reference audio fingerprint in the copyright fingerprint repository and/or the original fingerprint repository. For example, whether the pending audio is plagiarism and secondary creative work of the target audio is judged based on the fingerprint matching degree.

Accordingly, the present invention may also be embodied as an audio retrieval system. Fig. 6 is a schematic diagram showing the components of an audio retrieval system that can implement the audio retrieval function of the present invention. The system may include a plurality of clients 610, a server 620, and an audio fingerprint repository 630.

The client 610 may be an intelligent terminal, such as a mobile phone, a tablet or a desktop, which is installed with a music playing App or opens a corresponding music playing or song listening web page. The user can enter a song listening and song recognition page of the music playing App, or directly click a song listening and song recognition button in the page, so that the music playing App calls a microphone of the intelligent device to pick up the environmental sound.

Then, the preprocessing and fingerprint extraction for the environmental sound (query audio) can be performed on the client 610 or the server 620 (cloud), which is not limited by the invention. The server 620 may then feed the query audio fingerprint into the audio fingerprint repository 630 for matching and return the audio retrieval results to the user based on the matching of the audio fingerprints.

The audio fingerprints for querying the audio in the audio and audio fingerprint libraries may be generated based on the above operations, such as the fingerprint extraction operations described in conjunction with fig. 1 and 2. The audio fingerprint repository 630 may include at least one of: a melody fingerprint database; an accompaniment fingerprint library; the melody accompaniment combined fingerprint library is used as an alternative or supplement to the conventional audio fingerprint library. The server 620 may feed the corresponding query fingerprint into one of the above fingerprint libraries for matching.

The audio fingerprint repository 630 may include a plurality of slices distinguished by audio heat. Each shard may maintain a hash table that includes a large number of entries within each hash table, where each entry includes a structure of key | value. The key of each entry is a hash value, and value is a (id, t) pair consisting of the song id with the hash value and the time t of the hash value in the song. When a certain hash value is included by a plurality of songs, a plurality of (id, t) pairs may be included in its corresponding value string.

The server first sends the query audio fingerprint to the fragment storing the high-heat audio. The heat of the audio includes at least one of: the playing heat of the audio; a search heat of the audio; and the heat of recognition of the audio. Query audio retrieval operations for the high-heat audio slices may be performed using multiple servers.

In one embodiment, the audio fingerprint repository may also be a copyright and/or original audio fingerprint repository as described above, and returning audio retrieval results to the user based on matching of audio fingerprints includes: and judging the relation between the query audio and the target audio based on the comparison between the audio fingerprint extracted from the query audio and the reference audio fingerprint in the copyright and/or original audio fingerprint library. Thereby helping to determine the identity of the query audio.

Fig. 7 shows an example of the server-side internal configuration. It should be understood that, in order to deal with tens of millions of song searches, the server of the present invention is a cloud service platform composed of a large number of servers, and the server for providing music search service may be a part of an online music service platform.

As shown, when the server receives a Query request (Query), a distribution interface (e.g., a management node) may first distribute the Query request to a hotspot server storing data with a higher access frequency, as shown by reference numeral (r). In these hotspot servers, audio fingerprint information is stored, for example, for the top 100 million songs that are highly popular. The same or at least partially the same fingerprint information may be stored between the hotspot servers to provide an efficient retrieval service for the popular tracks. In the case of no hit in the hotspot server, the distribution interface distributes the query request to the partitioned servers storing the regular data, such as A, B, C and D-partition servers respectively storing different pieces of audio fingerprint content, as shown by reference numeral (c). The contents of the hotspot server and the zone servers may be updated periodically, e.g., every two weeks according to the current hotspot update, and the zone servers may be updated at a longer period, e.g., once a month.

The audio fingerprint extraction and library construction method according to the present invention has been described in detail with reference to the accompanying drawings, and the audio fingerprint extraction scheme of the present invention can extract an audio fingerprint that is accurately described and easy to retrieve, and the audio fingerprint can also be categorized as melody and accompaniment, thereby providing richer fingerprint materials for the audio fingerprint library created thereby and providing more retrieval methods for subsequent fingerprint retrieval. In addition, the established fingerprint library can be sliced according to the audio heat degree, so that the cold-hot separated efficient retrieval is provided subsequently.

Fig. 8 shows a specific scenario example of music retrieval to which the present invention is applied. As shown on the left side of the figure, a microphone built in the smartphone can continuously recognize the ambient sound, and within a limited number of seconds (e.g., 3 seconds, 5 seconds, 15 seconds, etc.), the query and the hit are completed at the server (cloud), the query result is returned, and the audio retrieval result as shown on the right side of the figure is presented to the user. The server can utilize the audio fingerprint extraction method, the database building method and the retrieval method to realize the online audio retrieval service. The client can acquire the environmental sound, for example, complete the pre-processing (even fingerprint extraction) of the environmental sound in the local machine, and send the query request to the server, so that the server can use the fingerprint library maintained by the server to realize the query and return the hit result.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An audio fingerprint extraction method, comprising:

acquiring a frequency spectrum of the audio;

generating a peak point pair based on the frequency-time relationship between a reference peak point and other peak points in the frequency spectrum;

generating an audio fingerprint of the audio based on the pairs of peak points.

2. The method of claim 1, further comprising:

the melody information of the audio is extracted,

wherein obtaining the spectrum of the audio comprises:

the melody information is transformed to obtain a melody spectrum, and the method generates an audio melody fingerprint based on pairs of melody peak points based on the melody spectrum.

3. The method of claim 2, wherein the melody information includes vocal information,

wherein obtaining the spectrum of the audio comprises:

the vocal information is transformed to obtain a vocal spectrum, and the method generates an audio vocal fingerprint based on a vocal peak point pair based on the vocal spectrum.

4. The method of claim 2, further comprising:

extracting the accompaniment information of the audio frequency,

wherein obtaining the spectrum of the audio comprises:

fourier transforming the accompaniment information to obtain an accompaniment spectrum, an

The method further comprises the following steps:

and generating an audio accompaniment fingerprint based on the accompaniment peak point pairs based on the accompaniment frequency spectrum.

5. The method of claim 2, further comprising:

generating a melody and accompaniment combined peak point pair based on the frequency time relationship between the reference melody peak point and other melody peak points in the melody frequency spectrum and other accompaniment peak points in the accompaniment frequency spectrum; and

generating an audio melody accompaniment joint fingerprint of the audio based on the melody accompaniment joint peak point pairs.

6. An audio fingerprint database establishment method comprises the following steps:

acquiring audio in a music library;

the method according to any one of claims 1-5, extracting an audio fingerprint of the captured audio; and

and sorting the extracted audio fingerprints.

7. The method of claim 6, wherein sorting the audio fingerprints comprises:

sorting the audio fingerprints according to the heat of the audio, wherein the heat of the audio comprises at least one of the following items:

the playing heat of the audio;

a search heat of the audio; and

and the recognition heat of the audio.

8. The method of claim 7, the ordering the audio fingerprints comprising:

sorting the audio fingerprints by audio attributes, wherein the audio attributes include at least one of:

the language of the audio;

a genre of the audio;

a scene tag of the audio.

9. An audio retrieval method, comprising:

acquiring query audio;

extracting a query audio fingerprint of the query audio;

-feeding the query audio fingerprint into an audio fingerprint repository created according to any one of claims 6-8;

matching audio fingerprints based on the audio fingerprint library; and

and returning an audio retrieval result based on the matching of the audio fingerprints.

10. The method of claim 9, wherein the audio fingerprint library is sliced in an order of audio,

wherein matching audio fingerprints based on the audio fingerprint library comprises:

firstly, the audio fingerprint database fragments with high audio heat are searched.

11. The method of claim 10, further comprising:

the music preferences of the user are obtained,

preferentially retrieving audio fingerprint library segments that match the music preference.

12. The method of claim 9, wherein extracting the query audio fingerprint of the query audio comprises:

extracting a query melody fingerprint or a query audio fingerprint of the query audio,

matching audio fingerprints based on the audio fingerprint library comprises:

and matching the query melody fingerprint or the query audio fingerprint based on the melody fingerprint library.

13. The method of claim 12, wherein extracting the query audio fingerprint of the query audio comprises:

extracting a query accompaniment fingerprint of the query audio,

matching audio fingerprints based on the audio fingerprint library comprises:

and matching the accompaniment fingerprints based on the accompaniment fingerprint library based on the retrieval result returned by the melody fingerprint library.

14. The method of claim 12, wherein extracting the query audio fingerprint of the query audio comprises:

extracting a query accompaniment fingerprint of the query audio,

matching audio fingerprints based on the audio fingerprint library comprises:

and matching the inquiry accompaniment fingerprints based on a conventional audio fingerprint library or melody fingerprint library.

15. The method of claim 9, wherein extracting the query audio fingerprint of the query audio comprises:

extracting a melody accompaniment joint fingerprint of the query audio,

matching audio fingerprints based on the audio fingerprint library comprises:

and matching the melody accompaniment combined fingerprints based on the melody accompaniment combined fingerprint library.

16. The method of claim 9, wherein the pair of peak points is characterized by a hash time pair comprising a hash value and a reference peak point time, the hash value representing a frequency-time relationship of the reference peak point to other peak points,

a key value (key) for matching using the hash value as a peak point pair,

based on the matching of the audio fingerprints, returning an audio retrieval result to the user comprises:

and returning an audio retrieval result to the user based on the query fingerprint and a matching result of the peak point pairs of the audio fingerprints corresponding to a certain audio in the audio fingerprint library.

17. The method of claim 16, wherein returning audio retrieval results to the user based on the matching of the query fingerprint and pairs of peak points of an audio fingerprint in the audio fingerprint repository that corresponds to an audio comprises:

calculating time difference or quotient between reference peak points aiming at all hash value matched hash time pairs in the candidate audios with matched hash values and the query audio;

and determining a matching result of the query audio and the candidate audio based on the calculation result of the time difference or the quotient.

18. The method of claim 9, wherein extracting the query audio fingerprint of the query audio comprises:

and carrying out noise reduction and dynamic compression processing on the query audio.

19. The method of claim 9, wherein the query audio is a mixed query audio comprising foreground and background sounds, and further comprising:

source separating the mixed query audio to obtain source-separated accompaniment data,

wherein extracting the query audio fingerprint of the query audio comprises:

extracting a fingerprint of the source-separated accompaniment data as a query audio fingerprint.

20. The method of claim 19, wherein the mixed query audio is audio of a short video of a soundtrack.

21. An audio recognition method, comprising:

the method according to any of claims 1-5, extracting an audio fingerprint of the target audio; and

based on the audio fingerprint, an audio identity is determined.

22. The method of claim 21, further comprising:

saving the extracted audio fingerprint of the target audio as a reference audio fingerprint,

wherein determining an audio identity based on the audio fingerprint comprises:

and judging the relation between the undetermined audio and the target fingerprint based on the comparison between the audio fingerprint extracted from the undetermined audio and the reference audio fingerprint.

23. The method of claim 22, wherein saving the extracted audio fingerprint of the target audio as a reference audio fingerprint comprises:

and uploading the extracted audio fingerprint of the target audio to a reference audio fingerprint database.

24. The method of claim 23, wherein the reference audio fingerprint repository is a copyright fingerprint repository and/or a creative fingerprint repository,

wherein, based on the comparison of the audio fingerprint extracted from the pending audio with the reference audio fingerprint, determining the relationship between the pending audio and the target fingerprint comprises:

and judging the relation between the undetermined audio and the target audio based on the comparison between the audio fingerprint extracted from the undetermined audio and the reference audio fingerprint in the copyright fingerprint library and/or the original fingerprint library.

25. An audio retrieval system comprises a client, a server and an audio fingerprint database, wherein,

the client is used for:

the query audio input by the user is obtained,

the server is configured to:

inputting query audio fingerprints extracted from query audio into the audio fingerprint database for matching; and

returning an audio retrieval result to the user based on the matching of the audio fingerprint,

wherein, the audio fingerprint for inquiring the audio in the audio and audio fingerprint database is generated based on the following operations:

generating peak point pairs based on the frequency-time relationship between reference peak points and other peak points in the frequency spectrum of the audio;

an audio fingerprint of the corresponding audio is generated based on the pairs of peak points.

26. The system of claim 25, wherein the audio fingerprint library includes a plurality of tiles differentiated by audio heat,

the server firstly sends the query audio fingerprint to the fragments stored with high-heat audio, wherein the heat of the audio comprises at least one of the following items:

the playing heat of the audio;

a search heat of the audio; and

and the recognition heat of the audio.

27. The system of claim 25, wherein query audio retrieval operations for the high heat audio shards are performed using a plurality of servers.

28. The system of claim 25, wherein the audio fingerprint library comprises at least one of:

an audio fingerprint library including melody and accompaniment information;

a melody fingerprint database;

an accompaniment fingerprint library;

the melody accompaniment is combined with the fingerprint library,

the server sends the corresponding inquiry fingerprints into the fingerprint library for matching.

29. The system of claim 25, wherein the audio fingerprint repository is a copyright and/or creative audio fingerprint repository, and

and judging the relation between the query audio and the target audio based on the comparison between the audio fingerprint extracted from the query audio and the reference audio fingerprint in the copyright and/or original audio fingerprint library.