CN110047515B

CN110047515B - Audio identification method, device, equipment and storage medium

Info

Publication number: CN110047515B
Application number: CN201910270746.7A
Authority: CN
Inventors: 鲁霄
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2021-04-20
Anticipated expiration: 2039-04-04
Also published as: WO2020199384A1; CN110047515A

Abstract

The embodiment of the invention discloses an audio identification method, an audio identification device, audio identification equipment and a storage medium; the embodiment of the invention can extract the audio fingerprint of the audio to be identified as the reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint database; screening a candidate fingerprint set from the fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database; selecting a reference fingerprint from the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint; and selecting a target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof. According to the scheme, the refinement degree of audio identification is improved, and more accurate target audio is obtained through identification.

Description

Audio identification method, device, equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to an audio recognition method, apparatus, device, and storage medium.

Background

The function of listening to songs and identifying songs provides a very convenient searching mode for vast music enthusiasts to retrieve favorite music, and a user only needs to record music in the environment or hum a song segment and input application software to identify the song. At present, the song listening and song recognition is mainly to search in a massive song library according to the characteristic information of input songs and select the songs most similar to the input songs.

During the course of research and practice on the prior art, the inventors of the present invention found that: the audio clips uploaded by the user may correspond to multiple versions of audio, and the current music platform audio identification process is rough and does not take into account the differences between different versions, so that songs selected by the music platform according to the clips provided by the user may not be the true source of the audio clips and are not really wanted by the user. It can be seen that the current audio recognition accuracy is poor.

Disclosure of Invention

The embodiment of the invention provides an audio recognition method, an audio recognition device, audio recognition equipment and a storage medium, and aims to improve the accuracy of audio recognition.

The embodiment of the invention provides an audio identification method, which comprises the following steps:

extracting an audio fingerprint of an audio to be identified as a reference fingerprint, and calculating the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database;

screening a candidate fingerprint set from a fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database;

selecting a reference fingerprint in the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint;

and selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

In some embodiments, said obtaining a homophonic fingerprint of said reference fingerprint comprises:

calculating the contact ratio of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set;

selecting homophonic fingerprints of the reference fingerprint from the other candidate fingerprints according to the contact ratio.

In some embodiments, the calculating a degree of overlap of the reference fingerprint with other candidate fingerprints in the set of candidate fingerprints comprises:

acquiring the longest public subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting the length of the longest public subsequence;

and calculating the contact ratio of the reference fingerprint and other candidate fingerprints according to the length of the longest public subsequence.

In some embodiments, said selecting, from the other candidate fingerprints, a homophonic fingerprint of the reference fingerprint according to the degree of overlap comprises:

and screening out candidate fingerprints with the coincidence degree with the reference fingerprint being greater than or equal to a preset threshold value from the other candidate fingerprints as homophonic fingerprints of the reference fingerprint.

In some embodiments, the method further comprises:

and if the candidate fingerprint with the contact ratio with the reference fingerprint larger than or equal to a preset threshold value is not found, determining the audio corresponding to the reference fingerprint as the target audio corresponding to the audio to be identified.

In some embodiments, selecting a reference fingerprint in the set of candidate fingerprints comprises:

and determining the candidate fingerprint with the maximum similarity value with the reference fingerprint in the candidate fingerprint set as the reference fingerprint.

In some embodiments, the calculating the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library includes:

respectively counting the number of the same hash values contained in the reference fingerprint and each audio fingerprint in a preset fingerprint database;

and respectively calculating the similarity between the reference fingerprint and each audio fingerprint in a fingerprint database according to the number of the same hash values.

In some embodiments, the selecting, from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof, the target audio corresponding to the audio to be recognized includes:

acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint of the reference fingerprint as homophonic audio, and acquiring the version information of the homophonic audio;

determining the version priority of the homophonic audio according to the version information;

and taking the homophonic audio with the highest version priority as the target audio corresponding to the audio to be identified.

In addition, an embodiment of the present invention further provides an audio recognition apparatus, including:

the fingerprint unit is used for extracting an audio fingerprint of the audio to be identified as a reference fingerprint and calculating the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database;

the candidate unit is used for screening out a candidate fingerprint set from the fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database;

the homophonic unit is used for selecting a reference fingerprint in the candidate fingerprint set and acquiring the homophonic fingerprint of the reference fingerprint;

and the audio unit is used for selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

In addition, an embodiment of the present invention further provides an audio recognition device, where the audio recognition device includes: the device comprises a memory, a processor and an audio identification program which is stored on the memory and can run on the processor, wherein the audio identification program realizes the steps in any audio identification method provided by the embodiment of the invention when being executed by the processor.

In some embodiments, the audio recognition device further comprises an audio acquisition device for acquiring audio to be recognized.

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform steps in any audio recognition method provided in the embodiment of the present invention.

The method comprises the steps of extracting an audio fingerprint of an audio to be identified as a reference fingerprint, and calculating the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database; screening a candidate fingerprint set from a fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database; selecting a reference fingerprint in the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint; and selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof. Thus, this scheme may cause uncertainty due to a version problem or the like of the audio to be recognized, although the candidate fingerprint matches the reference fingerprint, after retrieving the candidate fingerprint that approximates the reference fingerprint. Therefore, the scheme further selects the reference fingerprint in the candidate fingerprint set, and further selects the homophonic fingerprint in other candidate fingerprints in the candidate fingerprint set through the calculation of the contact ratio, thereby realizing the further screening of the candidate fingerprints. The reference fingerprint and homophonic fingerprint obtained by screening for multiple times in the scheme comprise the audio fingerprint which is most similar to the reference fingerprint of the audio to be identified and has the same corresponding audio or can be regarded as the same audio. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio with the optimal version, can be used as the real origin or source of the audio to be identified, simultaneously guarantees the accuracy of the content and the version of the target audio, and improves the overall efficiency of audio identification and user experience. According to the scheme, the audio fingerprints in the fingerprint database are screened layer by layer, so that the audio identification granularity is refined, the refinement degree of audio identification is improved, and more accurate target audio is retrieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic view of a scene of an information interaction system according to an embodiment of the present invention;

FIG. 1b is a schematic flow chart of an audio recognition method according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of an audio recognition scenario provided by an embodiment of the present invention;

FIG. 2b is a schematic diagram of a candidate fingerprint set according to an embodiment of the present invention;

FIG. 2c is a schematic diagram of a recognition result display interface according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present invention;

FIG. 4a is a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present invention;

fig. 4b is a schematic structural diagram of another audio recognition device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an audio identification method, an audio identification device, audio identification equipment and a storage medium.

As shown in fig. 1a, an information interaction system according to an embodiment of the present invention includes an audio recognition apparatus according to any one of the embodiments of the present invention, where the audio recognition apparatus may be integrated in a server or other devices; in addition, the system may also include other devices, such as clients and the like. The client may be a terminal or a Personal Computer (PC) or the like, and is configured to capture the audio to be recognized and/or upload the audio to be recognized to the server.

And the client sends the recording or the local audio serving as the audio to be identified to the server to request for audio identification. The server receives the audio to be identified sent by the client, extracts the audio fingerprint of the audio to be identified as a reference fingerprint, and then calculates the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database; screening a candidate fingerprint set in a fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database; then, selecting a reference fingerprint in the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint; and selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

Thus, this scheme may cause uncertainty due to a version problem or the like of the audio to be recognized, although the candidate fingerprint matches the reference fingerprint, after retrieving the candidate fingerprint that approximates the reference fingerprint. Therefore, the scheme further selects the reference fingerprint in the candidate fingerprint set, and further selects the homophonic fingerprint in other candidate fingerprints in the candidate fingerprint set through the calculation of the contact ratio, thereby realizing the further screening of the candidate fingerprints. The reference fingerprint and homophonic fingerprint obtained by screening for multiple times in the scheme comprise the audio fingerprint which is most similar to the reference fingerprint of the audio to be identified and has the same corresponding audio or can be regarded as the same audio. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio with the optimal version, can be used as the real origin or source of the audio to be identified, simultaneously guarantees the accuracy of the content and the version of the target audio, and improves the overall efficiency of audio identification and user experience. According to the scheme, the audio fingerprints in the fingerprint database are screened layer by layer, so that the audio identification granularity is refined, the refinement degree of audio identification is improved, and more accurate target audio is retrieved.

The following are detailed below.

The embodiment will be described from the perspective of an audio recognition apparatus, which may be specifically integrated in a network device, where the network device may be a terminal or a server, and the terminal may be a mobile phone, a tablet Computer, a notebook Computer, a Personal Computer (PC), or the like.

As shown in fig. 1b, the specific process of the audio recognition method may be as follows:

101. and acquiring an audio fingerprint of the audio to be identified as a reference fingerprint, and calculating the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database.

The preset fingerprint library stores the audio fingerprints of the audios in the audio library and the mapping relation between the audio fingerprints and the audios in the audio library. For example, the audio recognition device may extract audio fingerprints of each audio in an audio library in advance, store the extracted audio fingerprints in a fingerprint library, and record a mapping relationship between each audio and the audio fingerprints.

For example, the audio recognition device acquires the audio to be recognized, extracts an audio fingerprint, and uses the audio fingerprint of the audio to be recognized as a reference fingerprint for querying the audio fingerprint closest to or most similar to the audio fingerprint.

In some embodiments, the image retrieval device may receive an audio identification request, and obtain the audio to be identified; and performing audio fingerprint extraction on the audio to be identified to obtain a hash sequence, and taking the hash sequence as a reference fingerprint.

For example, a user may use a client to input an audio recognition request, and after receiving the audio recognition request, the audio recognition device notifies the client to start audio collection, so as to record humming sound of the user or sound in the environment, and obtain an audio to be recognized, where the audio to be recognized is the audio to be recognized corresponding to the audio recognition request. Of course, the user may also upload the audio stored locally at the client or downloaded from the network to the audio recognition device, so that the audio recognition device obtains the audio recognition request and the corresponding audio to be recognized.

The client can be a recording device with an audio acquisition function or a terminal device such as a mobile phone, a tablet, a personal computer and the like.

Then, the audio recognition device extracts the audio fingerprint of the audio signal of the audio to be recognized to obtain the audio fingerprint of the audio to be recognized, wherein the audio fingerprint contains the audio characteristic information of the audio to be recognized. The audio fingerprint extraction of the audio signal may specifically include framing, windowing, FFT (Fast Fourier Transform) frequency domain transformation, extracting local peaks, and transforming a hash sequence of the audio signal.

Specifically, after the audio recognition device obtains the audio to be recognized, the audio recognition device performs framing and windowing on the audio signal of the audio to be recognized. The framing is to cut the whole audio signal into multiple sections according to a preset rule, each section is a frame, so that the audio signal is microscopically stable, and the stable signal can be input for the audio signal processing of the later stage. Then, the audio recognition device uses a preset windowing function to respectively window each frame of audio, and the preset windowing function can be a Hamming window and the like, so that the audio signals after being divided into frames are more consistent and show periodic function characteristics.

Then, the audio recognition device performs FFT frequency domain transformation on each frame of audio signal to obtain a spectrum containing frequency domain information. And then, the audio recognition device extracts the local peak value in the frequency spectrum and converts the local peak value into a hash sequence, wherein the hash sequence is the audio fingerprint of the audio to be recognized. It should be noted that a plurality of hash values may be included in the hash sequence.

The audio recognition device takes the audio fingerprint of the audio to be recognized as a reference fingerprint to calculate the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint library, so that the retrieval or matching of the audio fingerprint is realized.

In some embodiments, the reference fingerprint and the audio fingerprint in the fingerprint database are both characterized by using a hash sequence, and the step of "calculating the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint database" may include: respectively counting the number of the same hash values contained in the reference fingerprint and each audio fingerprint in a preset fingerprint database; and respectively calculating the similarity between the reference fingerprint and each audio fingerprint in a fingerprint database according to the number of the same hash values.

Taking any audio fingerprint in the fingerprint library as an example, the audio identification device compares the hash values in the reference fingerprint hash sequence with the hash values in the audio fingerprint hash sequence one by one, and counts the number of the same hash values, and the audio identification device takes the obtained number of the same hash values as the similarity between the reference fingerprint and the audio fingerprint. Therefore, the audio recognition device respectively calculates and obtains the similarity between the reference fingerprint and each audio fingerprint in the fingerprint database.

102. And screening a candidate fingerprint set in the fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database.

For example, the audio recognition device may screen out, according to a preset similarity threshold, an audio fingerprint in the fingerprint database whose similarity value with the reference fingerprint is greater than the similarity threshold, as a candidate fingerprint matching the reference fingerprint.

It should be noted that the candidate fingerprint matching the reference fingerprint may be understood as corresponding audio that is the same as or can be regarded as the same as the audio to be identified, such as the same song or the same song that is composed differently.

And then the audio recognition device configures the candidate fingerprints obtained by screening into the same set to obtain a candidate fingerprint set. Thus, the set of candidate fingerprints includes one or more candidate fingerprints that match the reference fingerprint.

103. And selecting a reference fingerprint in the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint.

Wherein the reference fingerprint is a candidate fingerprint most similar to the reference fingerprint. For example, the audio recognition apparatus may determine the candidate fingerprint having the highest similarity value with the reference fingerprint as the reference fingerprint in the candidate fingerprint set.

Then, the audio recognition apparatus selects the homophonic fingerprint of the reference fingerprint. It should be noted that homophonic fingerprints can be understood that the corresponding audio is the same as or can be considered to be the same as the audio corresponding to the reference fingerprint. For example, in a music library of a music platform, there are multiple audios with different numbers but actually the same song, such as different versions of the same song, different versions that different songs are singing, or the same song that is received from different albums or radio stations, multiple audios belonging to the same song are defined as homophonic audios, and their audio fingerprints are homophonic fingerprints.

In some embodiments, the step of "obtaining homophonic fingerprints of said reference fingerprint" may comprise: calculating the degree of overlap of the reference fingerprint with other candidate fingerprints in the set of candidate fingerprints; selecting homophonic fingerprints of the reference fingerprint from the other candidate fingerprints according to the contact ratio.

The degree of overlap of the reference fingerprint and other candidate fingerprints can be calculated by means of correlation, the longest common subsequence and the like. The correlation may be calculating the variance of the hash sequence of the reference fingerprint and the other candidate fingerprints, and using the variance value as the contact ratio of the reference fingerprint and the other candidate fingerprints. Then, the audio recognition device takes other candidate fingerprints with variance values meeting preset requirements as homophonic fingerprints of the reference fingerprint.

Exemplifying the Longest Common Subsequence (LCS), the step of "calculating the degree of overlap of the reference fingerprint with other candidate fingerprints in the set of candidate fingerprints" may comprise: acquiring the longest public subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting the length of the longest public subsequence; and calculating the contact ratio of the reference fingerprint and other candidate fingerprints according to the length of the longest public subsequence.

Wherein the reference fingerprint and the other candidate fingerprints in the candidate fingerprint set are all characterized by using a hash sequence.

The hash sequence is a specific sequence, and the subsequence thereof refers to a sequence obtained by removing zero or more elements in the sequence without changing the relative order of the elements. If a sequence is simultaneously used as a subsequence of multiple hash sequences, the sequence is a common subsequence of the multiple hash sequences. And the longest common subsequence of the hash sequence is the common subsequence with the longest hash sequences. The length of the longest common subsequence is the number of elements in the common subsequence.

For example, the longest common subsequence length of the reference fingerprint and other candidate fingerprint hash sequences can be calculated using Dynamic Programming (DP). In this embodiment, the longest common subsequence length calculation formula for the reference fingerprint and other candidate fingerprint hash sequences is as follows:

nlcs＝LCS(res[i].hash_seq,res[0].hash_seq)

wherein, nlcs is the length of the longest public subsequence, LCS is the calculation function for dynamically planning the length of the longest public subsequence, res [ i ] hash _ seq is the ith candidate fingerprint hash sequence, and res [0] hash _ seq is the reference fingerprint hash sequence.

For example, the reference fingerprint hash sequence X is { a, B, C, B, D, a, B }, and any other candidate fingerprint hash sequence Y is { B, D, C, a, B, a }. Sequences such as { A, B } and { B, C, B, A } are both subsequences of the X sequence and subsequences of the Y sequence, and thus are common subsequences of the X and Y sequences. In this example, the common subsequences of the X and Y sequences are not listed one by one in their entirety. In the common subsequence of X and Y, the sequence { B, C, B, a } contains 4 elements, and is therefore statistically 4 in length, being the longest common subsequence of X and Y.

Taking any other candidate fingerprint as an example, after obtaining the longest common subsequence length of the reference fingerprint and the other candidate fingerprint, the audio recognition device calculates the degree of overlap of the reference fingerprint and the other candidate fingerprint. For example, the following formula may be used for calculation:

sim＝nlcs/hash_seq_cnt×100％；

where sim is the similarity between the reference fingerprint and the other candidate fingerprints, nlcs is the longest public subsequence length, and hash _ seq _ cnt is the hash sequence length of the reference fingerprint. In some embodiments, the code of the formula may refer to int sim ═ nlcs × 1.0/hash _ seq _ cnt 100.

Therefore, the audio recognition device can respectively calculate the coincidence degree of the reference fingerprint and each other candidate fingerprint.

The audio recognition device may then select a homophonic fingerprint of the reference fingerprint among the other candidate fingerprints.

For example, the audio recognition device may use the other candidate fingerprints with the largest contact ratio value as the homophonic fingerprints of the reference fingerprint; or the audio identification device selects other candidate fingerprints ranked at the previous preset order as homophonic fingerprints of the reference fingerprint according to the sequence from large to small of the coincidence degree values.

In some embodiments, the step of "selecting homophonic fingerprints of the reference fingerprint among the other candidate fingerprints according to the degree of coincidence" may comprise: and screening out candidate fingerprints with the coincidence degree with the reference fingerprint being greater than or equal to a preset threshold value from the other candidate fingerprints as homophonic fingerprints of the reference fingerprint.

The preset threshold value can be flexibly adjusted according to actual needs, for example, 25%.

Therefore, the audio recognition device filters the homophonic fingerprints of the reference fingerprint from other candidate fingerprints in the candidate fingerprint set.

Therefore, according to the embodiment, through the calculation of the similarity, the labor and time cost for marking the homophonic audio frequency in the audio frequency library is saved, the situation that information is not input manually in time is also avoided, and when the audio frequency is put in a storage, manual extra marking or classification of the homophonic audio frequency is not needed, so that the risk of wrong and missed recording of the information is eliminated, and the maintenance cost is reduced. Therefore, the method and the device improve the accuracy and efficiency of homophonic fingerprint and homophonic audio identification.

In some embodiments, if a candidate fingerprint with a degree of overlap with the reference fingerprint greater than or equal to a preset threshold is not found, determining the audio corresponding to the reference fingerprint as the target audio corresponding to the audio to be identified.

Thus, the audio recognition device determines that there are no other candidate fingerprints in the candidate fingerprint set that are very similar to the reference fingerprint when the homophonic fingerprint of the reference fingerprint cannot be found. Therefore, the audio identification device determines the audio corresponding to the reference fingerprint according to the mapping relation between the audio fingerprints and the audio in the fingerprint library, and determines the audio as the target audio corresponding to the audio to be identified.

104. And selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

After obtaining the reference fingerprint and the homophonic fingerprint thereof, the audio identification device determines the reference fingerprint and the audio corresponding to the homophonic fingerprint thereof according to the mapping relation between the audio fingerprint and the audio in the fingerprint database.

Then, the audio recognition device selects a target audio from the audio corresponding to the reference fingerprint and the homophonic fingerprint. For example, the audio recognition device takes the reference fingerprint and the audio corresponding to the homophonic fingerprint as the target audio corresponding to the audio to be recognized. Therefore, the audio which is missed and is substantially the same as the audio to be identified due to the version problem is avoided, and the accuracy of audio fingerprint matching is improved.

In some embodiments, the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof may be further filtered according to actual needs, and the step of "selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof" may include: acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint of the reference fingerprint as homophonic audio, and acquiring the version information of the homophonic audio; determining the version priority of the homophonic audio according to the version information; and taking the homophonic audio with the highest version priority as the target audio corresponding to the audio to be identified.

The version information includes information such as the source of the audio, singer, listing and/or release time, and may be preset information carried by the audio itself. Homophonic audio can be audio of different sources and/or different versions of version information.

For example, the audio recognition device sets the version priority of the album as the source to be the highest and the version priority of the station as the source to be the lowest according to the source information in the homophonic audio. Thus, the audio recognition device determines the homophonic audio whose source is the album as the target audio.

For example, the audio recognition device sets the version priority with the earliest time to be the highest and sets the version priority with the latest time to be the lowest according to the time sequence of the on-shelf time of the homophonic audio. Thus, the audio recognition device determines the homophonic audio with the earliest time on shelf as the target audio.

Thus, the target audio is the audio that is most similar to the audio to be identified and the most accurate version.

Therefore, the embodiment of the invention can extract the audio fingerprint of the audio to be identified as the reference fingerprint, and calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint database; screening a candidate fingerprint set from the fingerprint database according to the similarity between the reference fingerprint and the audio fingerprints in the fingerprint database, wherein the candidate fingerprint set comprises the audio fingerprints similar to the reference fingerprint; then, selecting a reference fingerprint in the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint; and selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof. Thus, this scheme may cause uncertainty due to a version problem or the like of the audio to be recognized, although the candidate fingerprint matches the reference fingerprint, after retrieving the candidate fingerprint that approximates the reference fingerprint. Therefore, the scheme further selects the reference fingerprint in the candidate fingerprint set, and further selects the homophonic fingerprint in other candidate fingerprints in the candidate fingerprint set through the calculation of the contact ratio, thereby realizing the further screening of the candidate fingerprints. The reference fingerprint and homophonic fingerprint obtained by screening for multiple times in the scheme comprise the audio fingerprint which is most similar to the reference fingerprint of the audio to be identified and has the same corresponding audio or can be regarded as the same audio. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio with the optimal version, can be used as the real origin or source of the audio to be identified, simultaneously guarantees the accuracy of the content and the version of the target audio, and improves the overall efficiency of audio identification and user experience. According to the scheme, the audio fingerprints in the fingerprint database are screened layer by layer, so that the audio identification granularity is refined, the refinement degree of audio identification is improved, and more accurate target audio is retrieved.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

For example, referring to fig. 2a, in the present embodiment, the audio recognition apparatus will be specifically integrated in a server cluster. The server cluster includes a feature extraction server, a leaf server, and a root server. One or more feature extraction servers, leaf servers, and root servers may be included in the system. The embodiment is exemplified by the system comprising a feature extraction server, a plurality of leaf servers and a root server.

And (I) uploading the audio to be identified by the client.

The user can upload the recorded audio or the local audio to the feature extraction server through audio identification software or music software installed in the client.

And (II) extracting the audio fingerprint.

The characteristic extraction server extracts the audio fingerprint of the audio to be identified as a reference fingerprint. Then, the characteristic extraction server sends the reference fingerprints to each leaf server respectively so as to match the audio fingerprints.

And (III) fingerprint matching.

And each leaf server respectively extracts partial audio fingerprints from the fingerprint database to match the audio fingerprints. For example, each leaf server may extract a corresponding audio fingerprint from the fingerprint library according to a preset allocation rule to perform matching, thereby implementing distribution processing and parallel processing of mass data and improving the audio recognition speed.

Illustrated as any leaf server.

The leaf server calculates the similarity between the reference fingerprint and each audio fingerprint in the fingerprint database. For example, the leaf server may count the number of the same hash values contained in the reference fingerprint and each audio fingerprint in the fingerprint library respectively; and respectively and correspondingly taking the number of the same hash value as the similarity between the reference fingerprint and each audio fingerprint in the fingerprint database.

Then, the leaf server determines the candidate fingerprint with the similarity value larger than the preset similarity threshold value with the reference fingerprint as the candidate fingerprint, and sends the candidate fingerprint to the root server.

And (IV) homophonic identification.

After the candidate fingerprints sent by each page server are obtained by the root server, each candidate fingerprint is configured in a candidate fingerprint set, and then a reference fingerprint and a homophonic fingerprint thereof are selected from the candidate fingerprint set.

For example, the root server takes the candidate fingerprint with the highest similarity value with the reference fingerprint as the reference fingerprint.

The root server then calculates the degree of overlap of the reference fingerprint with other candidate fingerprints in the set of candidate fingerprints. As an embodiment, the root server may obtain a longest common subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and count the length of the longest common subsequence; then, the lengths of the longest common subsequences are respectively used as the overlap ratio of the reference fingerprint and other candidate fingerprints.

Then, the root server selects homophonic fingerprints of the reference fingerprint from the other candidate fingerprints according to the contact ratio. In one embodiment, the root server filters out candidate fingerprints having a degree of overlap with the reference fingerprint greater than or equal to a preset threshold from the other candidate fingerprints as homophonic fingerprints of the reference fingerprint.

Thus, the root server realizes the identification of the homophonic fingerprint.

For example, in fig. 2b, idx is the rank of similarity value between the candidate fingerprint and the reference fingerprint, wherein the similarity value between the audio fingerprint with idx value of 0 and the reference fingerprint is the largest; the id is an audio number corresponding to the candidate fingerprint, so that the audio corresponding to the id can be found according to the id; score is the similarity value of the candidate fingerprint and the reference fingerprint, and the similarity value of the candidate fingerprint and the reference fingerprint is higher if the value is larger; lcs is the longest common subsequence length, i.e. similarity value, of the candidate fingerprint and the reference fingerprint.

Taking fig. 2b as an example, if the similarity threshold is 9, the candidate fingerprint set configured by the root server includes 35 candidate fingerprints in total, that is, the similarity value score between the 35 candidate fingerprints and the reference fingerprint is greater than 9.

The audio fingerprint with idx of 0 has the largest similarity value with the reference fingerprint, and is taken as the reference fingerprint, so its lcs with itself is 100. The root server calculates the lcs length of the candidate fingerprints and the reference fingerprint of idx 0-34 in the candidate fingerprint set as the similarity. If the preset threshold value is 25, the root server takes all the candidate fingerprints with the similarity value of 25 and above as homophonic fingerprints of the reference fingerprint.

And (V) audio screening.

After obtaining the reference fingerprint and the homophonic fingerprint thereof, the root service selects a target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

For example, the root server acquires the reference fingerprint and the audio corresponding to the homophonic fingerprint as homophonic audio, and acquires the version information of the homophonic audio; determining the version priority of the homophonic audio according to the version information; and taking the homophonic audio with the highest version priority as the target audio corresponding to the audio to be identified.

Taking the above fig. 2b as an example, if the root server determines that the audio corresponding to the candidate fingerprint of idx26 is the target audio, the audio id thereof is output.

And (VI) outputting the result.

And the root server returns the target audio obtained by screening to the client side for the client side to play to the user.

For example, in fig. 2c, the client obtains the audio id returned by the root server, retrieves the target audio corresponding to the number from the audio library, and displays the target audio to the user on the identification result display interface. Of course, the display interface may also provide the name of the target audio, information about the singer, such as a certain artist, the source, such as album, etc., and a play button for the user to play.

Therefore, the user can upload the audio needing to be identified to the server cluster, and the server cluster performs parallel fingerprint matching through the leaf servers, so that the audio retrieval speed is improved. The root server further screens the matching results of the leaf servers, so that the target audio with the content closest to the audio to be identified and the version most matched with the user requirements is selected, and the audio identification efficiency and the user experience are improved.

In order to better implement the above method, an embodiment of the present invention may further provide an audio recognition apparatus, where the audio recognition apparatus may be specifically integrated in a network device, and the network device may be a terminal or a server, and the like.

For example, as shown in fig. 3, the audio recognition apparatus may include a fingerprint unit 301, a candidate unit 302, a homophone unit 303, and an audio unit 304, as follows:

(1) a fingerprint unit 301;

the fingerprint unit 301 is configured to extract an audio fingerprint of an audio to be identified as a reference fingerprint, and calculate similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database.

For example, the fingerprint unit 301 obtains the audio to be recognized, performs extraction of an audio fingerprint, and uses the audio fingerprint of the audio to be recognized as a reference fingerprint for querying an audio fingerprint that is closest or most similar thereto.

In some embodiments, the fingerprint unit 301 may receive an audio identification request, and obtain an audio to be identified; and performing audio fingerprint extraction on the audio to be identified to obtain a hash sequence, and taking the hash sequence as a reference fingerprint.

For example, a user may use a client to input an audio recognition request, and after receiving the audio recognition request, the fingerprint unit 301 notifies the client to start audio collection, so as to record humming sound of the user or sound in the environment, and obtain an audio to be recognized, where the audio to be recognized is the audio to be recognized corresponding to the audio recognition request. Of course, the user may also upload the audio stored locally at the client or downloaded from the network to the audio fingerprinting unit 301, so that the fingerprinting unit 301 obtains the audio identification request and the corresponding audio to be identified.

Then, the fingerprint unit 301 performs audio fingerprint extraction on the audio signal of the audio to be identified to obtain an audio fingerprint of the audio to be identified, where the audio fingerprint includes audio feature information of the audio to be identified. The audio fingerprint extraction of the audio signal may specifically include framing, windowing, FFT (Fast Fourier Transform) frequency domain transformation, extracting local peaks, and transforming a hash sequence of the audio signal.

Specifically, after obtaining the audio to be identified, the fingerprint unit 301 performs framing and windowing on the audio signal of the audio to be identified. The framing is to cut the whole audio signal into multiple sections according to a preset rule, each section is a frame, so that the audio signal is microscopically stable, and the stable signal can be input for the audio signal processing of the later stage. Then, the fingerprint unit 301 performs windowing on each frame of audio respectively by using a preset windowing function, which may be a hamming window or the like, so that the audio signals after being framed are more consistent and show periodic function characteristics.

Then, the fingerprinting unit 301 performs FFT frequency domain transformation on each frame of audio signal to obtain a spectrum containing frequency domain information. Further, the fingerprint unit 301 extracts a local peak in the frequency spectrum and converts the local peak into a hash sequence, where the hash sequence is the audio fingerprint of the audio to be identified. It should be noted that a plurality of hash values may be included in the hash sequence.

The fingerprint unit 301 uses the audio fingerprint of the audio to be identified as a reference fingerprint to calculate the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint database, so as to retrieve or match the audio fingerprint.

In some embodiments, where the reference fingerprint and the audio fingerprint in the fingerprint library are both characterized using a hash sequence, the fingerprinting unit 301 may be configured to: respectively counting the number of the same hash values contained in the reference fingerprint and each audio fingerprint in a preset fingerprint database; and respectively calculating the similarity between the reference fingerprint and each audio fingerprint in a fingerprint database according to the number of the same hash values.

Taking any audio fingerprint in the fingerprint library as an example, the fingerprint unit 301 compares the hash values in the hash sequence of the reference fingerprint with the hash values in the hash sequence of the audio fingerprint one by one, and counts the number of the same hash values, and the fingerprint unit 301 uses the obtained number of the same hash values as the similarity between the reference fingerprint and the audio fingerprint. Thus, the fingerprint unit 301 calculates the similarity between the reference fingerprint and each audio fingerprint in the fingerprint database.

(2) A candidate unit 302;

a candidate unit 302, configured to screen out a candidate fingerprint set from the fingerprint library according to a similarity between the reference fingerprint and an audio fingerprint in the fingerprint library.

For example, the candidate unit 302 may screen the audio fingerprints in the fingerprint database having similarity values greater than the similarity threshold value with the reference fingerprint according to a preset similarity threshold value, as the candidate fingerprints matching with the reference fingerprint.

Further, candidate section 302 arranges the candidate fingerprints obtained by the screening in the same set to obtain a candidate fingerprint set. Thus, the set of candidate fingerprints includes one or more candidate fingerprints that match the reference fingerprint.

(3) A homophonic unit 303;

a homophonic unit 303, configured to select a reference fingerprint from the candidate fingerprint set, and obtain a homophonic fingerprint of the reference fingerprint.

Wherein the reference fingerprint is a candidate fingerprint most similar to the reference fingerprint. For example, the homophonic unit 303 may determine the candidate fingerprint having the highest similarity value with the reference fingerprint in the candidate fingerprint set as the reference fingerprint.

Then, the homophonic unit 303 selects a homophonic fingerprint of the reference fingerprint. It should be noted that homophonic fingerprints can be understood that the corresponding audio is the same as or can be considered to be the same as the audio corresponding to the reference fingerprint. For example, in a music library of a music platform, there are multiple audios with different numbers but actually the same song, such as different versions of the same song, different versions that different songs are singing, or the same song that is received from different albums or radio stations, multiple audios belonging to the same song are defined as homophonic audios, and their audio fingerprints are homophonic fingerprints.

In some embodiments, the homophonic unit 303 may be specifically configured to: calculating the degree of overlap of the reference fingerprint with other candidate fingerprints in the set of candidate fingerprints; selecting homophonic fingerprints of the reference fingerprint from the other candidate fingerprints according to the contact ratio.

The degree of overlap of the reference fingerprint and other candidate fingerprints can be calculated by means of correlation, the longest common subsequence and the like. The correlation may be calculating the variance of the hash sequence of the reference fingerprint and the other candidate fingerprints, and using the variance value as the contact ratio of the reference fingerprint and the other candidate fingerprints. Then, the homophonic unit 303 takes other candidate fingerprints whose variance values satisfy the preset requirement as homophonic fingerprints of the reference fingerprint.

As illustrated by the Longest Common Subsequence (LCS), the homophonic unit 303 may be configured to: acquiring the longest public subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting the length of the longest public subsequence; and calculating the contact ratio of the reference fingerprint and other candidate fingerprints according to the length of the longest public subsequence.

nlcs＝LCS(res[i].hash_seq,res[0].hash_seq)

Taking any other candidate fingerprint as an example, after obtaining the longest common subsequence length of the reference fingerprint and the other candidate fingerprint, the homophonic unit 303 calculates the degree of coincidence between the reference fingerprint and the other candidate fingerprint. For example, the following formula may be used for calculation:

sim＝nlcs/hash_seq_cnt×100％；

Therefore, the homonym unit 303 can calculate the contact ratio between the reference fingerprint and each of the other candidate fingerprints.

Then, the homophonic unit 303 may select a homophonic fingerprint of the reference fingerprint among the other candidate fingerprints.

For example, the homophonic unit 303 may use the other candidate fingerprint with the largest contact degree value as the homophonic fingerprint of the reference fingerprint; or the audio identification device selects other candidate fingerprints ranked at the previous preset order as homophonic fingerprints of the reference fingerprint according to the sequence from large to small of the coincidence degree values.

In some embodiments, the homophonic unit 303 may be configured to: and screening out candidate fingerprints with the coincidence degree with the reference fingerprint being greater than or equal to a preset threshold value from the other candidate fingerprints as homophonic fingerprints of the reference fingerprint.

Thus, the homophonic unit 303 filters the other candidate fingerprints in the candidate fingerprint set to obtain the homophonic fingerprint of the reference fingerprint.

Therefore, the homophonic unit 303 saves the labor and time cost for making homophonic audio marks on the audio in the audio library through the calculation of the similarity, also avoids the situation that the information is not input manually in time, and does not need to make manual extra marks or classification of homophonic audio when the audio is put in storage, thereby eliminating the risk of wrong and missed recording of the information and reducing the maintenance cost. Therefore, the method and the device improve the accuracy and efficiency of homophonic fingerprint and homophonic audio identification.

In some embodiments, if no candidate fingerprint having a degree of overlap with the reference fingerprint greater than or equal to a preset threshold is found, the audio unit 304 determines the audio corresponding to the reference fingerprint as the target audio corresponding to the audio to be identified.

Thus, when a homophonic fingerprint of a reference fingerprint cannot be found, the homophonic unit 303 determines that there are no other candidate fingerprints that are very similar to the reference fingerprint in the candidate fingerprint set. Therefore, the audio unit 304 determines the audio corresponding to the reference fingerprint according to the mapping relationship between each audio fingerprint in the fingerprint library and the audio, and determines the audio as the target audio corresponding to the audio to be identified.

(4) An audio unit 304;

an audio unit 304, configured to select a target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

After obtaining the reference fingerprint and the homophonic fingerprint thereof, the audio unit 304 determines the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof according to the mapping relationship between each audio fingerprint and the audio in the fingerprint database.

Then, the audio unit 304 selects a target audio from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof. For example, the audio unit 304 takes the reference fingerprint and the audio corresponding to the homophonic fingerprint as the target audio corresponding to the audio to be recognized.

In some embodiments, the reference fingerprint and the audio corresponding to the homophonic fingerprint may also be filtered according to actual needs, and the audio unit 304 may specifically be configured to: acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint of the reference fingerprint as homophonic audio, and acquiring the version information of the homophonic audio; determining the version priority of the homophonic audio according to the version information; and taking the homophonic audio with the highest version priority as the target audio corresponding to the audio to be identified.

For example, the audio unit 304 sets the version priority of album as the source and sets the version priority of station as the source to be the lowest according to the source information in the homophonic audio. Thus, the audio unit 304 determines the homophonic audio whose source is an album as the target audio.

For example, the audio unit 304 sets the priority of the version with the earliest time to be the highest and the priority of the version with the latest time to be the lowest according to the time of putting on the shelf of the homophonic audio. Thus, the audio unit 304 determines the homophonic audio with the earliest shelf time as the target audio.

As can be seen from the above, the fingerprint unit 301 in the embodiment of the present invention may extract an audio fingerprint of an audio to be identified as a reference fingerprint, and calculate a similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint library; the candidate unit 302 screens out a candidate fingerprint set from the fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database, where the candidate fingerprint set includes the audio fingerprint similar to the reference fingerprint; then, the homophonic unit 303 selects a reference fingerprint from the candidate fingerprint set, and acquires a homophonic fingerprint of the reference fingerprint; in the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof, the audio unit 304 selects a target audio corresponding to the audio to be identified. Thus, this scheme may cause uncertainty due to a version problem or the like of the audio to be recognized, although the candidate fingerprint matches the reference fingerprint, after retrieving the candidate fingerprint that approximates the reference fingerprint. Therefore, the scheme further selects the reference fingerprint in the candidate fingerprint set, and further selects the homophonic fingerprint in other candidate fingerprints in the candidate fingerprint set through the calculation of the contact ratio, thereby realizing the further screening of the candidate fingerprints. The reference fingerprint and homophonic fingerprint obtained by screening for multiple times in the scheme comprise the audio fingerprint which is most similar to the reference fingerprint of the audio to be identified and has the same corresponding audio or can be regarded as the same audio. Therefore, the target audio selected from the audio corresponding to the reference fingerprint and the homophonic fingerprint is the audio with the optimal version, can be used as the real origin or source of the audio to be identified, simultaneously guarantees the accuracy of the content and the version of the target audio, and improves the overall efficiency of audio identification and user experience. According to the scheme, the audio fingerprints in the fingerprint database are screened layer by layer, so that the audio identification granularity is refined, the refinement degree of audio identification is improved, and more accurate target audio is retrieved.

An embodiment of the present invention further provides an audio identification device, as shown in fig. 4a, which shows a schematic structural diagram of the audio identification device according to the embodiment of the present invention, specifically:

the audio recognition device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. It will be appreciated by those skilled in the art that the audio recognition device configuration shown in fig. 4a does not constitute a limitation of the audio recognition device and may include more or less components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the audio recognition apparatus, connects various parts of the entire audio recognition apparatus using various interfaces and lines, and performs various functions of the audio recognition apparatus and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the audio recognition apparatus. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as an audio recognition function, etc.), and the like; the storage data area may store data created according to use of the audio recognition apparatus, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The audio recognition device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The audio recognition device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

In addition, referring to fig. 4b, the audio recognition apparatus may further include an audio collecting device 405, and the audio collecting device 405 is used for collecting the audio to be recognized. For example, the audio acquisition device 405 may acquire the audio to be recognized by recording or the like.

Although not shown, the audio recognition apparatus may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 401 in the audio recognition apparatus loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, so as to implement various functions as follows:

extracting an audio fingerprint of an audio to be identified as a reference fingerprint, and calculating the similarity between the reference fingerprint and the audio fingerprint in a preset fingerprint database; screening a candidate fingerprint set from a fingerprint database according to the similarity between the reference fingerprint and the audio fingerprint in the fingerprint database; selecting a reference fingerprint in the candidate fingerprint set, and acquiring homophonic fingerprints of the reference fingerprint; and selecting the target audio corresponding to the audio to be identified from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof.

The processor 401 may also run an application program stored in the memory 402, implementing the following functions:

calculating the contact ratio of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set; selecting homophonic fingerprints of the reference fingerprint from the other candidate fingerprints according to the contact ratio.

acquiring the longest public subsequence of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and counting the length of the longest public subsequence; and calculating the contact ratio of the reference fingerprint and other candidate fingerprints according to the length of the longest public subsequence.

respectively counting the number of the same hash values contained in the reference fingerprint and each audio fingerprint in a preset fingerprint database; and respectively calculating the similarity between the reference fingerprint and each audio fingerprint in a fingerprint database according to the number of the same hash values.

acquiring the reference fingerprint and the audio corresponding to the homophonic fingerprint of the reference fingerprint as homophonic audio, and acquiring the version information of the homophonic audio; determining the version priority of the homophonic audio according to the version information; and taking the homophonic audio with the highest version priority as the target audio corresponding to the audio to be identified.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the audio recognition methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

The instructions may also perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any audio recognition method provided by the embodiment of the present invention, the beneficial effects that can be achieved by any audio recognition method provided by the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing describes in detail an audio recognition method, apparatus, device and storage medium provided by an embodiment of the present invention, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the foregoing embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An audio recognition method, comprising:

determining the candidate fingerprint with the maximum similarity value with the reference fingerprint in the candidate fingerprint set as a reference fingerprint, calculating the coincidence degree of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and selecting the homophonic fingerprint of the reference fingerprint from the other candidate fingerprints according to the coincidence degree;

2. The method of claim 1, wherein calculating the degree of overlap of the reference fingerprint with other candidate fingerprints in the set of candidate fingerprints comprises:

3. The method according to claim 1, wherein the selecting the homophonic fingerprint of the reference fingerprint among the other candidate fingerprints according to the contact degree comprises:

4. The method of claim 3, further comprising:

5. The method of claim 1, wherein the calculating the similarity between the reference fingerprint and the audio fingerprint in the preset fingerprint library comprises:

6. The method according to any one of claims 1 to 5, wherein the selecting, from the audio corresponding to the reference fingerprint and the homophonic fingerprint thereof, the target audio corresponding to the audio to be recognized comprises:

7. An audio recognition apparatus, comprising:

the homophonic unit is used for determining the candidate fingerprint with the maximum similarity value with the reference fingerprint in the candidate fingerprint set as a reference fingerprint, calculating the coincidence degree of the reference fingerprint and other candidate fingerprints in the candidate fingerprint set, and selecting the homophonic fingerprint of the reference fingerprint from the other candidate fingerprints according to the coincidence degree;

8. An audio recognition device, characterized in that the audio recognition device comprises: memory, a processor and an audio recognition program stored on the memory and executable on the processor, the audio recognition program when executed by the processor implementing the steps of the method according to any one of claims 1 to 6.

9. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the audio recognition method of any one of claims 1 to 6.