CN112435688A

CN112435688A - Audio recognition method, server and storage medium

Info

Publication number: CN112435688A
Application number: CN202011313926.8A
Authority: CN
Inventors: 鲁霄
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-03-02
Anticipated expiration: 2040-11-20
Also published as: CN112435688B

Abstract

The embodiment of the application discloses an audio identification method, a server and a storage medium, which comprise the following steps: extracting an audio fingerprint of an audio to be identified as a reference fingerprint, and matching the reference fingerprint with each audio fingerprint in a preset audio fingerprint library to obtain a candidate fingerprint set; determining a reference fingerprint with the highest matching degree with the reference fingerprint from the candidate fingerprint set; determining an LCS of the reference fingerprint and any candidate fingerprint in the set of candidate fingerprints based on the reference fingerprint and any candidate fingerprint; determining a similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, the first coverage length, and the second coverage length; screening out at least one homophonic fingerprint of the reference fingerprint based on the similarity between the reference fingerprint and each candidate fingerprint; based on the reference fingerprint and the at least one homophonic fingerprint, a target audio is determined. By the aid of the method and the device, the accuracy of audio identification can be improved.

Description

Audio recognition method, server and storage medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to an audio recognition method, a server, and a storage medium.

Background

The function of listening to songs and identifying songs provides a very convenient searching mode for vast music enthusiasts to retrieve favorite music, and a user only needs to record music in the environment or hum a song segment and input application software to identify the song. At present, the song listening and song recognition is mainly to search in a massive song library according to the characteristic information of input songs and select the songs most similar to the input songs.

During the course of research and practice on the prior art, the inventors of the present invention found that: the audio clips uploaded by the user may correspond to multiple versions of audio, and the current music platform audio identification process is rough and does not take into account the differences between different versions, so that songs selected by the music platform according to the clips provided by the user may not be the true source of the audio clips and are not really wanted by the user. It can be seen that the current audio recognition accuracy is poor.

Disclosure of Invention

The embodiment of the application provides an audio identification method, a server and a storage medium, aiming to improve the accuracy of audio identification.

In a first aspect, an audio recognition method is provided for an embodiment of the present application, including:

extracting an audio fingerprint of an audio to be identified as a reference fingerprint, and matching the reference fingerprint with each audio fingerprint in a preset audio fingerprint database to obtain a candidate fingerprint set;

determining a reference fingerprint with the highest matching degree with the reference fingerprint from the candidate fingerprint set;

determining a longest common subsequence, LCS, of the reference fingerprint and any one of the candidate fingerprints based on the reference fingerprint and the any one of the candidate fingerprints in the set of candidate fingerprints;

determining similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, a first coverage length and a second coverage length, wherein the first coverage length is a shortest subsequence length of the reference fingerprint including the LCS, and the second coverage length is a shortest subsequence length of the candidate fingerprint including the LCS;

screening at least one homophonic fingerprint of the reference fingerprint from the candidate fingerprint set based on the similarity between the reference fingerprint and each candidate fingerprint;

and determining the target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint.

Optionally, the determining the similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, the length of the first coverage area, and the length of the second coverage area includes:

determining the total coverage length according to the first coverage length and the second coverage length;

determining similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio of the length of the LCS to the total length of the coverage area.

Optionally, the determining a total coverage length according to the first coverage length and the second coverage length includes:

and acquiring a first weight coefficient and a second weight coefficient, and performing weighted calculation on the first coverage range length and the second coverage range length based on the first weight coefficient and the second weight coefficient to obtain the total coverage range length.

Optionally, the determining the similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, the length of the first coverage area, and the length of the second coverage area further includes:

determining similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio between the length of the LCS and the length of the first coverage and a ratio between the length of the LCS and the length of the second coverage.

Optionally, the determining the similarity between the reference fingerprint and each candidate fingerprint according to a ratio between the length of the LCS and the length of the first coverage, and a ratio between the length of the LCS and the length of the second coverage, includes:

and obtaining a first weight coefficient and a second weight coefficient, and performing weighted calculation on the ratio between the length of the LCS and the length of the first coverage area and the ratio between the length of the LCS and the length of the second coverage area based on the first weight coefficient and the second weight coefficient to obtain the similarity between the reference fingerprint and each candidate fingerprint.

Optionally, the determining the longest common subsequence LCS between the reference fingerprint and any candidate fingerprint of the candidate fingerprint set based on the reference fingerprint and the any candidate fingerprint of the candidate fingerprint set includes:

determining a matching matrix of the reference fingerprint and any candidate fingerprint based on the reference fingerprint and any candidate fingerprint;

determining an optimal matching path between the reference fingerprint and any candidate fingerprint based on the matching matrix;

and determining the LCS of the reference fingerprint and any candidate fingerprint based on the optimal matching path.

Optionally, the determining the target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint includes:

acquiring the audio frequency of the reference fingerprint and the audio frequency corresponding to each homophonic fingerprint as at least one homophonic audio frequency, wherein the homophonic audio frequency carries version information;

determining the priority of the homophonic audio according to a preset priority rule and the version information of the homophonic audio;

and determining the homophonic audio with the highest priority in the at least one homophonic audio as the target audio corresponding to the audio to be identified.

Optionally, the determining the priority of the homophonic audio according to the preset priority rule and the version information of the homophonic audio includes:

and determining the priority of the homophonic audio of which the version information is the original singing in the homophonic audio as the highest priority.

In a second aspect, an audio recognition apparatus is provided for an embodiment of the present application, including:

the extraction module is used for extracting the audio fingerprint of the audio to be identified as a reference fingerprint;

the matching module is used for matching the reference fingerprint with each audio fingerprint in a preset audio fingerprint database to obtain a candidate fingerprint set;

a reference fingerprint determining module, configured to determine, from the candidate fingerprint set, a reference fingerprint with a highest matching degree with the reference fingerprint;

an LCS determination module configured to determine, based on the reference fingerprint and any candidate fingerprint in the candidate fingerprint set, a longest common subsequence LCS between the reference fingerprint and the any candidate fingerprint;

a similarity determining module, configured to determine a similarity between the reference fingerprint and each candidate fingerprint based on a length of the LCS, a first coverage length, and a second coverage length, where the first coverage length is a shortest subsequence length of the reference fingerprint including the LCS, and the second coverage length is a shortest subsequence length of the candidate fingerprint including the LCS;

a homophonic fingerprint screening module, configured to screen at least one homophonic fingerprint of the reference fingerprint from the candidate fingerprint set based on a similarity between the reference fingerprint and each of the candidate fingerprints;

and the target audio determining module is used for determining the target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint.

Optionally, the similarity determining module includes:

a total length determining unit, configured to determine a total length of the coverage area according to the first coverage area length and the second coverage area length;

a first similarity determining unit, configured to determine a similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio between a length of the LCS and a total length of the coverage area.

Optionally, the total length determining unit is specifically configured to:

Optionally, the similarity determining module further includes:

a second similarity determining unit, configured to determine a similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio between a length of the LCS and the first coverage length, and a ratio between the length of the LCS and the second coverage length.

Optionally, the second similarity determining unit is specifically configured to:

Optionally, the LCS determining module includes:

a matching matrix determining unit configured to determine a matching matrix between the reference fingerprint and the candidate fingerprint based on the reference fingerprint and the candidate fingerprint;

an optimal path determining unit, configured to determine an optimal matching path between the reference fingerprint and any one of the candidate fingerprints based on the matching matrix;

an LCS determining unit, configured to determine an LCS for the reference fingerprint and the candidate fingerprint based on the best matching path.

Optionally, the target audio determining module includes:

a homophonic audio acquiring unit, configured to acquire an audio of the reference fingerprint and an audio corresponding to each homophonic fingerprint as at least one homophonic audio, where the homophonic audio carries version information;

a priority determining unit, configured to determine a priority of the homophonic audio according to a preset priority rule and version information of the homophonic audio;

and the target audio determining unit is used for determining the homophonic audio with the highest priority in the homophonic audio to be the target audio corresponding to the audio to be identified.

Optionally, the priority determining unit is configured to determine a priority of the homophonic audio in which the version information is an original song as a highest priority.

In a third aspect, a server is provided for an embodiment of the present application, and includes a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, where the memory is used to store a computer program that supports the electronic device to execute the audio recognition method, and the computer program includes program instructions; the processor is configured to call the program instructions to perform the audio recognition method as described above in one aspect of the embodiments of the present application.

In a fourth aspect, a storage medium is provided for an embodiment of the present application, where the storage medium stores a computer program, and the computer program includes program instructions; the program instructions, when executed by a processor, cause the processor to perform an audio recognition method as described above in an aspect of an embodiment of the present application.

In the embodiment of the application, an audio fingerprint of an audio to be identified is extracted as a reference fingerprint, and the reference fingerprint is matched with each audio fingerprint in a preset audio fingerprint database to obtain a candidate fingerprint set; determining a reference fingerprint with the highest matching degree with the reference fingerprint from the candidate fingerprint set; determining an LCS of the reference fingerprint and any candidate fingerprint in the set of candidate fingerprints based on the reference fingerprint and any candidate fingerprint; determining similarity between the reference fingerprint and any candidate fingerprint based on the length of the LCS, the length of the first coverage area and the length of the second coverage area; screening out at least one homophonic fingerprint of the reference fingerprint based on the similarity between the reference fingerprint and any candidate fingerprint; and then determining the target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint. By the aid of the method and the device, the accuracy of audio identification can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a system architecture diagram according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an audio recognition method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a process for determining an optimal matching path according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of an audio recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Please refer to fig. 1, which is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, the system architecture diagram includes an audio recognition platform and a user terminal cluster, where the user terminal cluster may include a plurality of user terminals, as shown in fig. 1, and may specifically include a user terminal 100a, a user terminal 100b, user terminals 100c and …, and a user terminal 100 n.

Each user terminal in the audio recognition platform and the user terminal cluster may be a computer device, including a mobile phone, a tablet computer, a notebook computer, a palm computer, an intelligent sound, a mobile internet device (MID, mobile internet device), a POS (Point Of Sales) machine, a wearable device (e.g., a smart watch, a smart bracelet, etc.), and the like.

Further, as shown in fig. 1, in the process of implementing the audio recognition method, a user may use a user terminal to input an audio recognition request, and after receiving the audio recognition request, the audio recognition platform notifies the user terminal to start audio collection, so as to record humming sound of the user or sound in the environment, and obtain audio to be recognized. Extracting audio fingerprints of audio to be identified as reference fingerprints, calculating the matching degree between the reference fingerprints and each audio fingerprint in a preset audio fingerprint library to obtain at least one matching degree, screening out at least one audio fingerprint with the matching degree being greater than or equal to a preset matching degree threshold value from the preset audio fingerprint library to obtain a candidate fingerprint set, and determining the candidate fingerprint with the highest matching degree in the candidate fingerprint set as the reference fingerprint. Then, a Longest Common Subsequence (LCS) of the reference fingerprint and any candidate fingerprint is calculated according to the reference fingerprint and any candidate fingerprint in the candidate fingerprint set, and a similarity between the reference fingerprint and any candidate fingerprint is determined based on a length of the LCS, the first coverage length and the second coverage length. Screening out at least one candidate fingerprint with the similarity greater than or equal to a preset similarity threshold value based on the similarity between the reference fingerprint and any candidate fingerprint to obtain at least one homophonic fingerprint of the reference fingerprint, further acquiring the audio frequency of the reference fingerprint and the audio frequency corresponding to each homophonic fingerprint as at least one homophonic audio frequency, wherein the homophonic audio frequency carries version information, determining the priority of the homophonic audio frequency according to a preset priority rule and the version information of the homophonic audio frequency, and determining the homophonic audio frequency with the highest priority in the at least one homophonic audio frequency as the target audio frequency corresponding to the audio frequency to be identified.

Please refer to fig. 2, which is a flowchart illustrating an audio recognition method according to an embodiment of the present disclosure. As shown in fig. 1, this method embodiment comprises the steps of:

and S101, extracting the audio fingerprint of the audio to be identified as a reference fingerprint.

Before executing step S101, the audio identification platform may perform audio fingerprint extraction on each audio in the audio library, store each extracted audio fingerprint in a preset audio fingerprint library, and record a mapping relationship between each audio and an audio fingerprint.

In some possible embodiments, the user terminal sends an audio identification request to the audio identification platform, and after receiving the audio identification request, the audio identification platform acquires an audio to be identified, performs audio fingerprint extraction on the audio to be identified, and uses the audio fingerprint of the audio to be identified as a reference fingerprint for querying an audio fingerprint that is closest to or most similar to the audio fingerprint.

For example, a user may use a user terminal to input an audio recognition request, and after receiving the audio recognition request, the audio recognition platform notifies the user terminal to start audio acquisition, so as to record humming sound of the user or sound in the environment, and obtain an audio to be recognized, where the audio to be recognized is the audio to be recognized corresponding to the audio recognition request. Of course, the user may also upload the audio stored locally in the user terminal or downloaded from the network to the audio recognition platform, and then the audio recognition platform obtains the audio recognition request and the audio to be recognized corresponding to the audio recognition request.

And then, the audio identification platform extracts the audio fingerprint of the audio signal of the audio to be identified to obtain the audio fingerprint of the audio to be identified, wherein the audio fingerprint contains the audio characteristic information of the audio to be identified. The audio fingerprint extraction of the audio signal may specifically include framing, windowing, Fast Fourier Transform (FFT) frequency domain transformation, extracting a local peak, and transforming a hash sequence of the audio signal.

Specifically, after the audio recognition platform obtains the audio to be recognized, the audio recognition platform performs framing and windowing on the audio signal of the audio to be recognized. The framing is to cut the whole audio signal into multiple sections according to a preset rule, each section is a frame, so that the audio signal is microscopically stable, and the stable signal can be input for the audio signal processing of the later stage. Then, the audio recognition platform uses a preset windowing function to respectively window each frame of audio, and the preset windowing function can be a Hamming window and the like, so that the audio signals after being divided into frames are more consistent and show periodic function characteristics.

And then, the audio identification platform performs FFT frequency domain transformation on each frame of audio signal to obtain a frequency spectrum containing frequency domain information. And then, the audio identification platform extracts a local peak value in the frequency spectrum, converts the local peak value into a hash sequence, wherein the hash sequence is the audio fingerprint of the audio to be identified, and takes the audio fingerprint of the audio to be identified as a reference fingerprint. It should be noted that a plurality of hash values may be included in the hash sequence.

And S102, matching the reference fingerprint with each audio fingerprint in a preset audio fingerprint database to obtain a candidate fingerprint set.

And the audio identification platform calculates the matching degree of the reference fingerprint and each audio fingerprint in a preset audio fingerprint library to realize the retrieval or matching of the audio fingerprints.

In some possible embodiments, the reference fingerprint and the audio fingerprint in the preset audio fingerprint library are both represented by using a hash sequence, the audio identification platform may count the number of the same hash values contained in the reference fingerprint and each audio fingerprint in the preset audio fingerprint library respectively, and calculate the matching degree between the reference fingerprint and each audio fingerprint in the preset audio fingerprint library respectively according to the number of the same hash values.

Specifically, for example, any audio fingerprint in a preset audio fingerprint library is taken as an example, the audio identification platform compares a hash value in a reference fingerprint hash sequence with hash values in the audio fingerprint hash sequence one by one, and counts the number of the same hash values, the audio identification device takes the obtained number of the same hash values as the matching degree between the reference fingerprint and the audio fingerprint, so as to obtain the matching degree between the reference fingerprint and each audio fingerprint in the preset audio fingerprint library, and selects the audio fingerprint in the preset audio fingerprint library, whose matching degree with the reference fingerprint is greater than or equal to a preset matching degree threshold value, so as to obtain at least one candidate fingerprint, and further obtain a candidate fingerprint set.

And S103, determining the reference fingerprint with the highest matching degree with the reference fingerprint from the candidate fingerprint set.

In some possible embodiments, the audio recognition platform determines, as the reference fingerprint, a candidate fingerprint having a highest matching degree with the reference fingerprint in the candidate fingerprint set.

It should be noted that, here, the candidate fingerprint may be understood as the audio corresponding to the candidate fingerprint is the same as or can be considered to be the same as the audio to be identified, such as the same song, or the same song with different composition.

S104, determining the longest common subsequence LCS for the reference fingerprint and any candidate fingerprint based on the reference fingerprint and any candidate fingerprint in the set of candidate fingerprints.

In some possible embodiments, the audio recognition platform determines a matching matrix of the reference fingerprint and any candidate fingerprint based on the reference fingerprint and any candidate fingerprint; determining an optimal matching path between the reference fingerprint and any candidate fingerprint based on the matching matrix; the LCS for the reference fingerprint and any of the candidate fingerprints is determined based on the best matching path.

The audio identification platform can determine the LCS of the reference fingerprint and any candidate fingerprint by adopting a dynamic programming mode, and the implementation mode is as follows:

calculating to obtain a matching matrix A according to the reference fingerprint, any candidate fingerprint and a preset matching value, wherein the matching matrix A comprises (j +1) × (k +1) matrix elements, j is the number of sequence elements of the reference fingerprint, and k is the number of sequence elements of any candidate fingerprint;

determining a matrix element A (j +1, k +1) in the matching matrix A as a first target element, and determining a path element of the first target element from a plurality of selectable path elements corresponding to the first target element in the matching matrix A;

determining the path element of the first target element as a second target element, and determining the path element of the second target element from a plurality of selectable path elements corresponding to the second target element in the matching matrix A until the path element of the mth target element is the matrix element A (1,1) in the matching matrix A;

determining an optimal matching path according to the first target element and at least one path element;

and screening target matrix elements from a plurality of matrix elements corresponding to the optimal matching path, and determining the LCS (best sequence check) of the reference fingerprint and any candidate fingerprint according to the target matrix elements.

Specifically, the audio recognition platform calculates a matching matrix a according to the reference fingerprint, the first candidate fingerprint and a preset matching value, where the matching matrix a includes (j +1) × (k +1) matrix elements, where j is the number of sequence elements of the reference fingerprint, and k is the number of sequence elements of the first candidate fingerprint, and the specific implementation process is as follows:

the audio recognition platform determines a first row matrix element A (1): and a first column matrix element A (: 1) of the matching matrix A based on a preset matrix element A (1:), A (: 1), determines A (m +1, n +1) based on a difference between an mth sequence element of the reference fingerprint and an nth sequence element of the first candidate fingerprint, a preset matching value, a preset matching score, matrix elements A (m, n), A (m-1, n), and A (m, n-1), wherein m is an integer greater than or equal to 1 and less than or equal to j, and n is an integer greater than or equal to 1 and less than or equal to k, it being understood that if the difference between the mth sequence element of the reference fingerprint and the nth sequence element of the reference subsequence is equal to the preset matching value, then A (m +1, n +1) is A (m, n) and the preset matching score, otherwise, A (m +1, n +1) is the maximum value between A (m-1, n) and A (m, n-1), and A (m +1, n +1), namely the matrix element values of the (m +1) th row and the (n +1) th column in the matching matrix A, can be determined according to the above manner. And then the matching matrix A can be obtained by calculation according to the mode. Here, the preset matching value is 0.

Wherein, A (1):, A (: 1) and the preset matching score are all set by people, which is not limited here.

Here, an implementation process of the audio recognition platform determining the matching matrix between the reference fingerprint and the first candidate fingerprint when a (1): a (: 1): 0 and the preset matching score is 1 is illustrated in detail, please refer to fig. 3, which is a schematic diagram of a process of determining the optimal matching path according to an embodiment of the present application. As shown in fig. 3, the audio recognition platform determines that 9 × 10 matrix elements are included in the matching matrix a according to the reference fingerprint S1 ═ { ACDEFGGH }, the first candidate fingerprint S2 ═ CEGDHFGHB }, and the preset matrix element a (1:): 1:0, and the matrix elements in the first row and the first column of the matching matrix a are both 0, and then determines the matrix elements a (2,2) in the 2 nd row and the 2 nd column of the matching matrix a, and the specific calculation process may be: calculating a difference between the first sequence element a in the reference fingerprint S1 and the first sequence element C in the first candidate fingerprint S2, which is not equal to the preset match value of 0, then determining the maximum value 0 of a (1,2) ═ a (2,1) ═ 0 as a (2, 2). Then, matrix elements a (3,2) in the 3 rd row and the 2 nd column in the matching matrix a are determined, and the specific calculation process may be: calculating the difference between the second sequence element C in the reference fingerprint S1 and the first sequence element C in the first candidate fingerprint S2, which is equal to the preset match value 0, then determining a (2,1) ═ 1 of the sum of 0 and the preset match score 1 as a (3, 2). For each matrix element in the match matrix A, the match matrix between the reference fingerprint S1 and the first candidate fingerprint S2 is calculated as described above and is shown in the area of the dashed line in FIG. 3.

Then, the audio recognition platform can find the optimal matching path in the matching matrix a by a backtracking method.

For example, referring to fig. 3 again, the matching matrix a may be composed of a plurality of matching paths from a first matrix element a (1,1) to a last matrix element a (9,10) of the matching matrix a to 5, the audio recognition platform determines the last matrix element a (9,10) to 5 as the first target element, and determines the maximum value a (9,9) to 5 and a (8,10) to 4 as the path element of the first target element since the sequence element H of the corresponding reference fingerprint S1 of the first target element in the matching matrix a is not equal to the sequence element B in the first candidate fingerprint S2. Determining a (9,9) ═ 5 as the second target element, and since the sequence element H of the second target element in the corresponding reference fingerprint S1 in the matching matrix a is equal to the sequence element H in the first candidate fingerprint S2, determining an alternative path element a (8,8) ═ 4 of the second target element as the path element of the second target element. According to the above manner, the path element of each target element is obtained, and until the last obtained path element is a (1,1) ═ 0, the path elements are connected in the order determined by the first target element and the plurality of path elements to form the optimal matching path, that is, the path formed by the arrow in fig. 2.

Then, the audio identification platform determines LCS of the reference fingerprint and the first candidate fingerprint according to the optimal matching path.

Specifically, the audio identification platform screens out at least one target matrix element which satisfies that sequence elements in the reference fingerprint S1 are equal to sequence elements in the first candidate fingerprint S2 from a plurality of matrix elements corresponding to the optimal matching path, sorts at least one sequence element corresponding to the at least one target matrix element in the reference fingerprint S1 or the first candidate fingerprint S2 according to a sequence number of the sequence element corresponding to the reference fingerprint S1 or the first candidate fingerprint S2 from small to large, obtains the sorted at least one sequence element, and obtains an LCS between the reference fingerprint and the first candidate fingerprint.

For example, the audio recognition platform selects, from the matrix elements a (1,1), a (2,1), a (3,2), a (4,2), a (5,3), a (6,3), a (7,4), a (7,5), a (7,6), a (7,7), a (8,8), a (9,9) and a (9,10) corresponding to the best matching path shown in fig. 3, a (5,3), a (6,3), a (7,4), a (7,5), a (7,6), a (7,7), a (8, 9) and a (9,10) to obtain a target matrix element a (3,2), a (7,4), a (8,8) and a (9,9) in which the sequence element in the reference fingerprint S1 corresponds to the position of the matrix element, and a (3,2), a (5,3), a (7,4), a (8,8) and a (9,9) correspond to a corresponding sequence element S2 in the reference fingerprint S1, respectively, S4, G, H, and LCS { C, E, G, H } of the reference fingerprint S1 and the first candidate fingerprint S2 are obtained by sorting the sequence elements in descending order of their corresponding sequence numbers (2, 4, 6, 7, and 8) in the reference fingerprint S1.

And S105, determining the similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, the length of the first coverage range and the length of the second coverage range.

Wherein the first coverage length is a shortest sub-sequence length of the LCS included in the reference fingerprint, and the second coverage length is a shortest sub-sequence length of the LCS included in each candidate fingerprint.

For example, assuming that the reference fingerprint S1 ═ { ACDEFGGH }, the first candidate fingerprint S2 ═ CEGDHFGHB }, and the LCS ═ C, E, G, H }, the length of the LCS is 5, the shortest subsequence containing LCS in the reference fingerprint S1 is { CDEFGGH }, the sequence length of the sequence { CDEFGGH } is 7, i.e. the first coverage length is 7; the shortest subsequence containing LCS in the first candidate fingerprint S2 is { ceddhfgh }, and the sequence length of this sequence { ceddhfgh } is 8, i.e. the second coverage length is 8.

In some possible embodiments, the audio recognition platform determines the total coverage length from the first coverage length and the second coverage length; the similarity between the reference fingerprint and each candidate fingerprint is determined according to the ratio between the length of the LCS and the total length of the coverage.

Specifically, the determining, by the audio recognition platform, as the total coverage length according to the first coverage length and the second coverage length may include the following implementation manners: the first coverage range length x₁And a second coverage length x₂Is determined as the total coverage length X, i.e. X ═ X₁+x₂(ii) a Alternatively, the first weight coefficient α is acquired₁And a second weight coefficient alpha₂And performing weighted calculation on the first coverage range length and the second coverage range length based on the first weight coefficient and the second weight coefficient to obtain a total coverage range length X, namely X is X₁*α₁+x₂*α₂Wherein α is₁And alpha₂Are each an arbitrary number of 0 to 1 inclusive, and α₁+α₂＝1。

And then, the audio identification platform calculates the similarity gamma xz between the reference fingerprint and each candidate fingerprint according to a preset coefficient gamma, a ratio z between the length of the LCS and the total length of the coverage range.

For example, assuming that the reference fingerprint S1 is { a, C, D, E, F, G, H }, the first candidate fingerprint S2 is { C, E, G, D, H, F, G, H, B }, the LCS is { C, E, G, H }, the length of LCS is 5, the first coverage length is 7, the second coverage length is 8, and the predetermined coefficient γ is 2, the audio recognition platform determines the sum of the first coverage length 7 and the second coverage length 8 as the total coverage length 15, and calculates the similarity between the reference fingerprint S1 and the first candidate fingerprint S2 as γ z 2/3 according to the predetermined coefficient γ being 2 and the ratio z between the length of LCS and the total coverage length being 5/15.

S106, screening out at least one homophonic fingerprint of the reference fingerprint from the candidate fingerprint set based on the similarity between the reference fingerprint and each candidate fingerprint.

In some possible embodiments, the audio recognition platform may use the other candidate fingerprint with the largest similarity value as the homophonic fingerprint of the reference fingerprint: or the audio identification platform selects other candidate fingerprints ranked at the previous preset order as homophonic fingerprints of the reference fingerprint according to the sequence from large to small of the similarity numerical values; or, screening out candidate fingerprints with similarity greater than or equal to a preset similarity threshold value from the candidate fingerprint set, wherein the preset similarity threshold value can be flexibly adjusted according to actual needs, for example, 25%, as homophonic fingerprints of the reference fingerprints.

Further, the similarity between the reference fingerprint and other candidate fingerprints can be calculated through correlation and the like. The correlation may be calculating a variance between the reference fingerprint and the hash sequence of other candidate fingerprints, and taking the variance as a similarity between the reference fingerprint and other candidate fingerprints. Then, the audio identification platform takes other candidate fingerprints with variance values meeting preset requirements as homophonic fingerprints of the reference fingerprint.

It should be noted that homophonic fingerprints can be understood that the corresponding audio is the same as or can be considered to be the same as the audio corresponding to the reference fingerprint. For example, in a music library of a music platform, there are multiple audios with different numbers but actually the same song, such as different versions of the same song, different versions that different songs are singing, or the same song that is received from different albums or radio stations, multiple audios belonging to the same song are defined as homophonic audios, and their audio fingerprints are homophonic fingerprints.

S107, determining the target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint.

In some possible embodiments, the audio identification platform may obtain that the audio of the reference fingerprint and the audio corresponding to each homophonic fingerprint are at least one homophonic audio, and the homophonic audio carries version information; determining the priority of the homophonic audio according to a preset priority rule and the version information of the homophonic audio; and determining the homophonic audio with the highest priority in at least one homophonic audio as the target audio corresponding to the audio to be identified.

The version information includes information such as the source of the audio, singer, playing amount, listing and/or release time, and may be preset information carried by the audio itself. Homophonic audio can be audio of different sources and/or different versions of version information.

For example, the audio recognition platform sets the version priority of the album as the source to be the highest and sets the version priority of the station as the source to be the lowest according to the source information in the homophonic audio. Thus, the audio recognition platform determines homophonic audio whose source is an album as the target audio.

For example, the audio identification platform sets the priority of the version with the earliest time on shelf as the highest and sets the priority of the version with the latest time on shelf as the lowest according to the time sequence of the time on shelf of the homophonic audio. Therefore, the audio recognition platform determines the homophonic audio with the earliest shelf time as the target audio.

For example, the audio identification platform sets the version priority with the highest playing quantity to be the highest and sets the version priority with the lowest playing quantity to be the lowest according to the playing quantity of the homophonic audio and the sequence from high to low. Therefore, the audio identification platform determines the homophonic audio with the highest playing quantity as the target audio.

For another example, the audio recognition platform sets the version priority with the highest singer priority as the highest, and sets the version priority with the lowest singer priority as the lowest according to the singer of the homophonic audio. For example, an original singer may have a higher priority than other turning singers. Thus, the audio recognition platform determines the original audio as the target audio.

Thus, the target audio is the audio that is most similar to the audio to be identified and the most accurate version.

In the embodiment of the application, after determining the LCS between the reference fingerprint and any candidate fingerprint, the audio identification platform calculates the similarity between the reference fingerprint and any candidate fingerprint according to the length of the LCS, the first coverage length (the shortest subsequence length including the LCS in the reference fingerprint) and the second coverage length (the shortest subsequence length including the LCS in any candidate fingerprint), so as to avoid ignoring the sequence lengths of the two fingerprints, only considering the global similarity, improve the accuracy of the audio similarity, and further improve the accuracy of the audio identification.

Please refer to fig. 4, which is a flowchart illustrating an audio recognition method according to an embodiment of the present disclosure. As shown in fig. 4, this method embodiment includes the steps of:

s201, extracting an audio fingerprint of the audio to be identified as a reference fingerprint.

S202, matching the reference fingerprint with each audio fingerprint in a preset audio fingerprint database to obtain a candidate fingerprint set.

And S203, determining the reference fingerprint with the highest matching degree with the reference fingerprint from the candidate fingerprint set.

S204, determining the longest common subsequence LCS for the reference fingerprint and any candidate fingerprint based on the reference fingerprint and any candidate fingerprint in the set of candidate fingerprints.

S205, determining the total coverage length according to the first coverage length and the second coverage length, and determining the similarity between the reference fingerprint and each candidate fingerprint according to the ratio between the length of the LCS and the total coverage length.

Here, the specific implementation manner of steps S201 to S205 may refer to the description of steps S101 to S105 in the embodiment corresponding to fig. 2, and is not described herein again.

S206, according to the ratio between the length of the LCS and the length of the first coverage area and the ratio between the length of the LCS and the length of the second coverage area, the similarity between the reference fingerprint and each candidate fingerprint is determined.

In some possible embodiments, the audio recognition platform may obtain the first weight coefficient γ₁And a second weight coefficient gamma₂Based on a first weight coefficient gamma₁And a second weight coefficient gamma₂Ratio z between the length of the LCS and the length of the first coverage area₁And the ratio z between the length of the LCS and the length of the second coverage area₂Performing weighting calculation to obtain the similarity z between the reference fingerprint and any one of the candidate fingerprints₁*γ₁+z₂*γ₂Wherein the first weight coefficient γ₁And a second weight coefficient gamma₂Are each an arbitrary number of 0 to 1 inclusive, and γ₁+γ₂＝1。

For example, assume that the reference fingerprint S1 is { a, C, D, E, F, G, H }, the first candidate fingerprint S2 is { C, E, G, D, H, F, G, H, B }, the LCS is { C, E, G, H }, the length of the LCS is 5, the first coverage length is 7, the second coverage length is 8, and the first weight coefficient γ is₁0.5 and a second weight coefficient gamma₂0.5, the audio recognition platform performs the first weighting factor γ₁0.5 and a second weight coefficient gamma₂0.5 to the ratio z between the length of the LCS and the length of the first coverage ₁5/7 and the ratio z between the length of the LCS and the length of the second coverage area ₂5/8, the similarity between the reference fingerprint and the first candidate fingerprint is calculated to be z₁*γ₁+z₂*γ₂＝75/112。

S207, at least one homophonic fingerprint of the reference fingerprint is screened out from the candidate fingerprint set based on the similarity between the reference fingerprint and each candidate fingerprint.

Here, the specific implementation manner of step S207 may refer to the description of step S106 in the embodiment corresponding to fig. 2, and is not described herein again.

And S208, determining the target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint.

In some possible embodiments, the audio identification platform takes the reference fingerprint and the audio corresponding to the homophonic fingerprint as the target audio corresponding to the audio to be identified. Therefore, the audio which is missed and is substantially the same as the audio to be identified due to the version problem is avoided, and the accuracy of audio identification is improved.

In the embodiment of the present application, after determining the LCS between the reference fingerprint and any candidate fingerprint, the audio identification platform may determine a similarity between the reference fingerprint and any candidate fingerprint according to a ratio of a length of the LCS to a total coverage length, wherein the total coverage length may be determined according to a first coverage length (a shortest subsequence length of the reference fingerprint including the LCS) and a second coverage length (a shortest subsequence length of any candidate fingerprint including the LCS); the similarity between the reference fingerprint and any candidate fingerprint can be determined according to the ratio of the length of the LCS to the length of the first coverage range and the ratio of the length of the LCS to the length of the second coverage range, so that the difference caused by different sequence lengths of the two fingerprints is balanced, the local similarity is emphasized, the condition that the original song fingerprint similarity is filtered due to too low evaluation of the original song fingerprint similarity is avoided, the condition that the original song fingerprint is returned and the original song is copied is avoided, the accuracy of the audio similarity is improved, and the accuracy of the audio identification is improved.

Please refer to fig. 5, which provides a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present application. The audio recognition apparatus is applied to an audio recognition platform, and as shown in fig. 5, the audio recognition apparatus includes an extraction module 51, a matching module 52, a reference fingerprint determination module 53, an LCS determination module 54, a similarity determination module 55, a homophonic fingerprint screening module 56, and a target audio determination module 57.

An extracting module 51, configured to extract an audio fingerprint of an audio to be identified as a reference fingerprint;

a matching module 52, configured to match the reference fingerprint with each audio fingerprint in a preset audio fingerprint library to obtain a candidate fingerprint set;

a reference fingerprint determining module 53, configured to determine a reference fingerprint with the highest matching degree with the reference fingerprint from the candidate fingerprint set;

an LCS determination module 54 configured to determine, based on the reference fingerprint and any candidate fingerprint in the candidate fingerprint set, a longest common subsequence LCS between the reference fingerprint and the any candidate fingerprint;

a similarity determining module 55, configured to determine a similarity between the reference fingerprint and each candidate fingerprint based on a length of the LCS, a first coverage length, and a second coverage length, where the first coverage length is a shortest subsequence length of the reference fingerprint including the LCS, and the second coverage length is a shortest subsequence length of the candidate fingerprint including the LCS;

a homophonic fingerprint screening module 56, configured to screen at least one homophonic fingerprint of the reference fingerprint from the candidate fingerprint set based on a similarity between the reference fingerprint and each of the candidate fingerprints;

and a target audio determining module 57, configured to determine a target audio corresponding to the audio to be identified based on the reference fingerprint and the at least one homophonic fingerprint.

Optionally, the similarity determining module 55 includes:

a total length determining unit 551, configured to determine a total length of the coverage area according to the first coverage area length and the second coverage area length;

a first similarity determining unit 552, configured to determine a similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio between the length of the LCS and the total length of the coverage area.

Optionally, the total length determining unit 551 is specifically configured to:

Optionally, the similarity determining module 55 further includes:

a second similarity determination unit 553, configured to determine a similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio between a length of the LCS and the first coverage length, and a ratio between the length of the LCS and the second coverage length.

Optionally, the second similarity determination unit 553 is specifically configured to:

and obtaining a first weight coefficient and a second weight coefficient, and performing weighted calculation on the ratio between the length of the LCS and the length of the first coverage area and the ratio between the length of the LCS and the length of the second coverage area based on the first weight coefficient and the second weight coefficient to obtain the similarity between the reference fingerprint and any one of the candidate fingerprints.

Optionally, the LCS determining module 54 includes:

a matching matrix determining unit 541, configured to determine a matching matrix between the reference fingerprint and any one of the candidate fingerprints based on the reference fingerprint and any one of the candidate fingerprints;

an optimal path determining unit 542, configured to determine an optimal matching path between the reference fingerprint and any of the candidate fingerprints based on the matching matrix;

an LCS determining unit 543, configured to determine LCS between the reference fingerprint and the candidate fingerprint based on the best matching path.

Optionally, the target audio determining module 57 includes:

a homophonic audio acquiring unit 571, configured to acquire an audio of the reference fingerprint and an audio corresponding to each homophonic fingerprint as at least one homophonic audio, where the homophonic audio carries version information;

a priority determining unit 572, configured to determine a priority of the homophonic audio according to a preset priority rule and the version information of the homophonic audio;

the target audio determining unit 573 is configured to determine a homophonic audio with the highest priority among the at least one homophonic audio as a target audio corresponding to the audio to be identified.

Optionally, the priority determining unit 572 is configured to determine a priority of the homophonic audio in which the version information is an original song as a highest priority.

It will be appreciated that the audio recognition means 5 is arranged to implement the steps performed by the audio recognition platform in the embodiments of fig. 2 and 4. As to specific implementation of the functional blocks included in the audio recognition apparatus 5 of fig. 5 and corresponding advantageous effects, reference may be made to the specific description of the embodiments of fig. 2 and fig. 4, which is not repeated herein.

The audio recognition apparatus 5 in the embodiment shown in fig. 5 described above may be implemented by the server 600 shown in fig. 6. Please refer to fig. 6, which provides a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 6, the server 600 may include: one or more processors 601 and memory 602. The processor 601 and the memory 602 are connected by a bus 603. Wherein the memory 602 is used for storing a computer program, which includes program instructions; the processor 601 is configured to execute the program instructions stored in the memory 602, and perform the following operations:

extracting an audio fingerprint of the audio to be identified as a reference fingerprint;

matching the reference fingerprint with each audio fingerprint in a preset audio fingerprint database to obtain a candidate fingerprint set;

Optionally, the processor 601 determines a similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, the length of the first coverage area, and the length of the second coverage area, and specifically performs the following operations:

Optionally, the processor 601 determines, as a total coverage length according to the first coverage length and the second coverage length, and specifically performs the following operations:

Optionally, the processor 601 determines a similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, the length of the first coverage area, and the length of the second coverage area, and further specifically performs the following operations:

Optionally, the processor 601 determines the similarity between the reference fingerprint and each candidate fingerprint according to a ratio between the length of the LCS and the length of the first coverage, and a ratio between the length of the LCS and the length of the second coverage, and specifically performs the following operations:

Optionally, the processor 601 determines the longest common subsequence LCS between the reference fingerprint and any candidate fingerprint in the candidate fingerprint set based on the reference fingerprint and the any candidate fingerprint, and specifically performs the following operations:

Optionally, the processor 601 determines, based on the reference fingerprint and the at least one homophonic fingerprint, a target audio corresponding to the audio to be identified, and specifically performs the following operations:

Optionally, the processor 601 determines the priority of the homophonic audio according to a preset priority rule and the version information of the homophonic audio, and specifically performs the following operations:

In an embodiment of the present application, a computer storage medium may be provided, which may be used to store computer software instructions for the audio recognition apparatus in the embodiment shown in fig. 5, and which includes a program designed for executing the audio recognition apparatus in the embodiment. The storage medium includes, but is not limited to, flash memory, hard disk, solid state disk.

Also provided in an embodiment of the present application is a computer program product comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the audio recognition apparatus designed in the embodiment shown in fig. 5 can be executed when the computer program product or the computer program is executed by the computer device.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

In the present application, "a and/or B" means one of the following cases: a, B, A and B. "at least one of … …" refers to any combination of the listed items or any number of the listed items, e.g., "at least one of A, B and C" refers to one of: any one of seven cases, a, B, C, a and B, B and C, a and C, A, B and C.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. An audio recognition method, comprising:

determining a longest common subsequence, LCS, of the reference fingerprint and any candidate fingerprint of the set of candidate fingerprints based on the reference fingerprint and the any candidate fingerprint;

determining similarity between the reference fingerprint and each candidate fingerprint based on a length of the LCS, a first coverage length, which is a shortest subsequence length of the reference fingerprint including the LCS, and a second coverage length, which is a shortest subsequence length of the each candidate fingerprint including the LCS;

filtering out at least one homophonic fingerprint of the reference fingerprint from the set of candidate fingerprints based on a similarity between the reference fingerprint and the each candidate fingerprint;

2. The method according to claim 1, wherein the determining the similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, a first coverage length, and a second coverage length comprises:

determining the length of the coverage range as the total length of the coverage range according to the first coverage range length and the second coverage range length;

determining a similarity between the reference fingerprint and each candidate fingerprint according to a ratio between a length of the LCS and the total coverage length.

3. The method of claim 2, wherein the determining from the first coverage length and the second coverage length as a total coverage length comprises:

4. The method of claim 1, wherein determining the similarity between the reference fingerprint and each candidate fingerprint based on the length of the LCS, a first coverage length, and a second coverage length, further comprises:

determining similarity between the reference fingerprint and each of the candidate fingerprints according to a ratio between a length of the LCS and the first coverage length and a ratio between a length of the LCS and the second coverage length.

5. The method according to claim 4, wherein the determining the similarity between the reference fingerprint and each candidate fingerprint according to the ratio between the length of the LCS and the first coverage length and the ratio between the length of the LCS and the second coverage length comprises:

and obtaining a first weight coefficient and a second weight coefficient, and performing weighted calculation on the ratio between the LCS length and the first coverage area length and the ratio between the LCS length and the second coverage area length based on the first weight coefficient and the second weight coefficient to obtain the similarity between the reference fingerprint and each candidate fingerprint.

6. The method according to claim 1, wherein the determining the longest common subsequence, LCS, of the reference fingerprint and any candidate fingerprint of the set of candidate fingerprints based on the reference fingerprint and the any candidate fingerprint comprises:

determining a matching matrix of the reference fingerprint and the any candidate fingerprint based on the reference fingerprint and the any candidate fingerprint;

determining an optimal matching path between the reference fingerprint and the any candidate fingerprint based on the matching matrix;

determining LCSs of the reference fingerprint and the any candidate fingerprint based on the best match path.

7. The method according to claim 1, wherein the determining the target audio corresponding to the audio to be recognized based on the reference fingerprint and the at least one homophonic fingerprint comprises:

8. The method of claim 7, wherein the determining the priority of the homophonic audio according to a preset priority rule and version information of the homophonic audio comprises:

9. A server, comprising a processor, a memory and a transceiver, the processor, the memory and the transceiver being interconnected, wherein the transceiver is configured to receive or transmit data, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the audio recognition method of any of claims 1-8.

10. A storage medium, characterized in that the storage medium stores a computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the audio recognition method of any of claims 1-8.