CN115329105A

CN115329105A - Multimedia data matching method and device, storage medium and electronic equipment

Info

Publication number: CN115329105A
Application number: CN202211251066.9A
Authority: CN
Inventors: 金强; 李宜烜; 蔡苗苗; 李鹏; 刘华平
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2022-11-11
Anticipated expiration: 2042-10-12
Also published as: CN115329105B

Abstract

The embodiment of the disclosure relates to the technical field of computers, and more particularly, to a multimedia data matching method and apparatus, a storage medium and an electronic device. The method comprises the following steps: acquiring an audio to be matched, and determining an audio fingerprint characteristic sequence corresponding to the audio to be matched; and calculating the first fingerprint similarity of the audio fingerprint feature sequence and the music score fingerprint feature sequence of each music score in the music score database, and determining a target music score matched with the audio to be matched according to the first fingerprint similarity. The scheme can realize quick and accurate matching between the audio frequency and the music score.

Description

Multimedia data matching method and device, storage medium and electronic equipment

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a multimedia data matching method and apparatus, a storage medium, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure and the description herein is not an admission that it is prior art, nor is it admitted to be prior art by inclusion in this section.

In the related art, some technical solutions can identify music score data and convert the identification result of a music score into audio data for playing. In other technical schemes, the melody and the vocal information in the audio can be identified and extracted and converted into the music score. In order to realize the matching of the audio data and the music score and obtain a high-quality matching result, the digital music score can be input into professional software, and the rendering of the audio is realized by combining a sound source library, so that the audio completely corresponding to the electronic music score is obtained.

Disclosure of Invention

However, in some technologies, the scheme of mutual conversion and matching between the related audio and the music score also has certain disadvantages; for example, electronic music score resources are less available and manual music making requires greater labor costs. Moreover, due to the high complexity of the music score, the existing technical solutions cannot achieve high recognition accuracy and stability. For example, the existing technical solutions can only achieve a high recognition accuracy on monotone music, but are ineffective for a large amount of polyphonic music in reality.

For this reason, an improved multimedia data matching method and apparatus, a storage medium, and an electronic device are highly required to provide a scheme capable of accurately matching audio with a musical score.

In this context, embodiments of the present disclosure are intended to provide a multimedia data matching method and apparatus, a storage medium, and an electronic device.

According to an aspect of the present disclosure, there is provided a multimedia data matching method, including: acquiring an audio to be matched, and determining an audio fingerprint characteristic sequence corresponding to the audio to be matched; and calculating the first fingerprint similarity of the audio fingerprint feature sequence and the music score fingerprint feature sequence of each music score in the music score database, and determining a target music score matched with the audio to be matched according to the first fingerprint similarity.

In an exemplary embodiment of the present disclosure, the calculating a first fingerprint similarity between the audio fingerprint feature sequence and a score fingerprint feature sequence of each score in a score database, and determining a target score matching the audio to be matched according to the first fingerprint similarity includes: and calculating fingerprint characteristic distances between the music score fingerprint characteristic sequences corresponding to the music scores in the music score database and the audio fingerprint characteristic sequences corresponding to the audio to be matched, and determining a target music score matched with the audio to be matched according to the fingerprint characteristic distances.

In an exemplary embodiment of the present disclosure, the method further comprises: generating an audio fingerprint characteristic sequence to be detected according to the audio fingerprint characteristic sequence corresponding to the audio to be matched; respectively calculating second fingerprint similarity of the audio fingerprint feature sequence to be detected and the music score fingerprint feature sequences corresponding to the music scores; determining a first score candidate set in the score database according to the second fingerprint similarity; and the music score fingerprint feature sequence corresponding to the music score in the first music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, after determining the first score candidate set in the score data according to the second fingerprint similarity, the method further includes: respectively calculating third fingerprint similarity of the audio fingerprint feature sequence corresponding to the audio to be matched and the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set; screening the music scores in the first music score candidate set according to the third fingerprint similarity, and determining a second music score candidate set; and the music score fingerprint feature sequence corresponding to the music score in the second music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, the generating an audio fingerprint feature sequence to be detected according to an audio fingerprint feature sequence corresponding to the audio to be matched includes: determining the variance of each audio fingerprint feature in the audio fingerprint feature sequence, and selecting m audio fingerprint features with the largest variance; determining the similarity among the m audio fingerprint features, reserving n audio fingerprint features with the minimum similarity, and determining the n audio fingerprint features as the audio fingerprint feature sequence to be detected; wherein m and n are positive integers, and m is greater than n.

In an exemplary embodiment of the present disclosure, the determining a first score candidate set in the score database according to the second fingerprint similarity includes: respectively calculating k music score fingerprint features with the highest similarity to each audio fingerprint feature in the n audio fingerprint features in the music score fingerprint feature sequence to obtain n x k music score fingerprint features serving as a first screening result; wherein k is a positive integer; screening the fingerprint characteristics of the music score in the first screening result according to a preset first similarity threshold value to obtain a second screening result; and determining a score according to the second screening result and constructing the first score candidate set.

In an exemplary embodiment of the present disclosure, the respectively calculating third fingerprint similarities of the audio fingerprint feature sequences corresponding to the audio to be matched and the score fingerprint feature sequences corresponding to the scores in the first score candidate set includes: and respectively calculating the fingerprint similarity of each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the audio to be matched and each music score fingerprint feature in the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set to obtain the third fingerprint similarity.

In an exemplary embodiment of the present disclosure, the screening scores in the first score candidate set according to the third fingerprint similarity and determining a second score candidate set includes: determining score fingerprint features with the highest similarity to the audio fingerprint features according to the third fingerprint similarity as a third screening result; and screening the fingerprint characteristics of the music scores in the third screening result according to a preset second similarity threshold, and constructing a second music score candidate set according to the music scores corresponding to the screened fingerprint characteristics of the music scores.

In an exemplary embodiment of the present disclosure, the calculating fingerprint feature distances of a score fingerprint feature sequence corresponding to each score in the score database and an audio fingerprint feature sequence corresponding to the audio to be matched, and determining a target score matching the audio to be matched according to the fingerprint feature distances includes: respectively constructing sampling point distance matrixes based on the audio fingerprint characteristic sequences and the music score fingerprint characteristic sequences of the music scores; calculating fingerprint characteristic distances between the audio fingerprint characteristics in the audio fingerprint characteristic sequence and the music score fingerprint characteristics in the music score fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequence according to the minimum value of the sampling point distances; and acquiring a plurality of shortest paths between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of the music scores, and determining the target music score matched with the audio to be matched according to the minimum value of the shortest paths.

In an exemplary embodiment of the present disclosure, a measure correspondence between the audio fingerprint feature sequence and the score fingerprint feature sequence of the target score is determined according to a shortest path between the score fingerprint feature sequence of the target score and the audio fingerprint feature sequence of the audio to be matched.

In an exemplary embodiment of the present disclosure, the method further comprises: and responding to the playing control operation of the audio to be matched, and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequence of the target music score.

In an exemplary embodiment of the present disclosure, the determining an audio fingerprint feature sequence corresponding to the audio to be matched includes: carrying out spectrum conversion processing on the audio to be matched to obtain corresponding spectrum data; performing note identification on the frequency spectrum data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain audio fingerprint characteristics corresponding to each group; and constructing an audio fingerprint characteristic sequence based on the audio fingerprint characteristics corresponding to each group, and configuring the audio fingerprint characteristic sequence as the audio fingerprint characteristic sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, the method further comprises: identifying each music score in the music score database, and acquiring a note sequence of at least one corresponding music track; converting the note sequence according to a preset music score speed to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping the sampling results to obtain the fingerprint characteristics of the music score corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score.

In an exemplary embodiment of the present disclosure, the determining an audio fingerprint feature sequence corresponding to the audio to be matched includes: extracting identification information corresponding to the audio to be matched, and matching the audio with each audio in an audio database according to the identification information, wherein the audio database comprises a plurality of audios and audio fingerprint characteristic sequences corresponding to the audios; if the identification information is successfully matched with the audio in an audio database, configuring the audio fingerprint characteristic sequence corresponding to the matched audio as the audio fingerprint characteristic sequence corresponding to the audio to be matched; or if the matching fails according to the identification information, performing feature extraction on the audio to be matched to obtain an audio fingerprint feature sequence corresponding to the audio to be matched.

According to an aspect of the present disclosure, there is provided a multimedia data matching apparatus including:

the audio fingerprint feature acquisition module is used for acquiring an audio to be matched and determining an audio fingerprint feature sequence corresponding to the audio to be matched;

and the music score matching module is used for calculating the first fingerprint similarity of the audio fingerprint feature sequence and the music score fingerprint feature sequence of each music score in the music score database, and determining a target music score matched with the audio to be matched according to the first fingerprint similarity.

In an exemplary embodiment of the present disclosure, the score matching module includes: and the fingerprint characteristic distance calculation module is used for calculating the fingerprint characteristic distances of the music score fingerprint characteristic sequences corresponding to the music scores in the music score database and the audio fingerprint characteristic sequences corresponding to the audio to be matched, and determining the target music score matched with the audio to be matched according to the fingerprint characteristic distances.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the first music score candidate set calculation module is used for generating an audio fingerprint characteristic sequence to be detected according to the audio fingerprint characteristic sequence corresponding to the audio to be matched; respectively calculating second fingerprint similarity of the audio fingerprint feature sequence to be detected and the music score fingerprint feature sequences corresponding to the music scores; determining a first score candidate set in the score database according to the second fingerprint similarity; and the music score fingerprint feature sequence corresponding to the music score in the first music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, the apparatus further includes: a second music score candidate set calculating module, configured to calculate third fingerprint similarities between the audio fingerprint feature sequence corresponding to the audio to be matched and the music score fingerprint feature sequences corresponding to the music scores in the first music score candidate set, respectively; screening the music scores in the first music score candidate set according to the third fingerprint similarity, and determining a second music score candidate set; and the music score fingerprint feature sequence corresponding to the music score in the second music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the to-be-detected audio fingerprint feature sequence calculating module is used for determining the variance of each audio fingerprint feature in the audio fingerprint feature sequence and selecting m audio fingerprint features with the largest variance; determining the similarity among the m audio fingerprint features, reserving n audio fingerprint features with the minimum similarity, and determining the n audio fingerprint features as the audio fingerprint feature sequence to be detected; wherein m and n are both positive integers, and m is greater than n.

In an exemplary embodiment of the present disclosure, the first score candidate set calculation module includes: the first music score screening module is used for respectively calculating k music score fingerprint features with the highest similarity to each audio fingerprint feature in the n audio fingerprint features in the music score fingerprint feature sequence to obtain n x k music score fingerprint features serving as a first screening result; wherein k is a positive integer; screening the fingerprint characteristics of the music score in the first screening result according to a preset first similarity threshold value to obtain a second screening result; and determining a score according to the second screening result and constructing the first score candidate set.

In an exemplary embodiment of the present disclosure, the second score candidate set calculation module includes: and the third fingerprint similarity calculation module is used for calculating the fingerprint similarity of each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the audio to be matched and each music score fingerprint feature in the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set respectively to obtain the third fingerprint similarity.

In an exemplary embodiment of the present disclosure, the second score candidate set calculation module includes: the second music score screening module is used for determining music score fingerprint characteristics with the highest similarity with the audio fingerprint characteristics according to the third fingerprint similarity, and the music score fingerprint characteristics serve as a third screening result; and screening the fingerprint characteristics of the music scores in the third screening result according to a preset second similarity threshold, and constructing a second music score candidate set according to the music scores corresponding to the screened fingerprint characteristics of the music scores.

In an exemplary embodiment of the present disclosure, the fingerprint feature distance calculation module includes: the sampling point distance matrix calculation module is used for respectively constructing sampling point distance matrixes based on the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequences of the music scores; the path calculation module is used for calculating fingerprint characteristic distances between the audio fingerprint characteristics in the audio fingerprint characteristic sequence and the music score fingerprint characteristics in the music score fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequence according to the minimum value of the sampling point distances; and the target music score determining module is used for acquiring a plurality of shortest paths between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of the music scores and determining the target music score matched with the audio to be matched according to the minimum value of the shortest paths.

In an exemplary embodiment of the present disclosure, the apparatus further includes: and the corresponding relation determining module is used for determining the section corresponding relation between the audio fingerprint feature sequence and the music score fingerprint feature sequence of the target music score according to the shortest path between the music score fingerprint feature sequence of the target music score and the audio fingerprint feature sequence of the audio to be matched.

In an exemplary embodiment of the present disclosure, the apparatus further includes: and the matching display module is used for responding to the playing control operation of the audio to be matched and displaying the music score bar corresponding to the currently played audio bar in a graphical user interface according to the bar corresponding relation between the audio fingerprint characteristic sequence of the audio to be matched and the music score fingerprint characteristic sequence of the target music score.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the audio fingerprint calculation module is used for carrying out spectrum conversion processing on the audio to be matched to acquire corresponding spectrum data; performing note identification on the frequency spectrum data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain audio fingerprint characteristics corresponding to each group; and constructing an audio fingerprint characteristic sequence based on the audio fingerprint characteristics corresponding to each group, and configuring the audio fingerprint characteristic sequence into an audio fingerprint characteristic sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the music score fingerprint characteristic sequence generation module is used for identifying each music score in the music score database and acquiring a note sequence of at least one corresponding music track; converting the note sequence according to a preset music score speed to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping the sampling results to obtain the fingerprint characteristics of the music score corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the identification matching module is used for extracting identification information corresponding to the audio to be matched and matching the identification information with each audio in an audio database according to the identification information, wherein the audio database comprises a plurality of audios and audio fingerprint characteristic sequences corresponding to the audios; if the identification information is successfully matched with the audio in an audio database, configuring the audio fingerprint characteristic sequence corresponding to the matched audio as the audio fingerprint characteristic sequence corresponding to the audio to be matched; or if the matching fails according to the identification information, performing feature extraction on the audio to be matched to obtain an audio fingerprint feature sequence corresponding to the audio to be matched.

According to an aspect of the present disclosure, there is provided a multimedia data matching method, including: acquiring a music score to be matched, and determining a music score fingerprint feature sequence corresponding to the music score to be matched; and calculating fourth fingerprint similarity of the music score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in an audio database, and determining a target audio matched with the music score to be matched according to the fourth fingerprint similarity.

In an exemplary embodiment of the present disclosure, the calculating a fourth fingerprint similarity between the score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in the audio database, and determining a target audio matched with the score to be matched according to the fourth fingerprint similarity includes: and calculating the fingerprint characteristic distance between the audio fingerprint characteristic sequence corresponding to each audio in the audio database and the music score fingerprint characteristic sequence corresponding to the music score to be matched, and determining the target audio matched with the music score to be matched according to the fingerprint characteristic distance.

In an exemplary embodiment of the present disclosure, the method further comprises: generating a music score fingerprint feature sequence to be detected according to the music score fingerprint feature sequence corresponding to the music score to be matched; respectively calculating the similarity of fifth fingerprints of the music score fingerprint characteristic sequence to be detected and the audio fingerprint characteristic sequence corresponding to each audio frequency; determining a first audio candidate set in the audio database according to the fifth fingerprint similarity; and the audio fingerprint feature sequence corresponding to the audio in the first audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

In an exemplary embodiment of the disclosure, after determining the first audio candidate set in the audio data according to the fifth fingerprint similarity, the method further includes: respectively calculating sixth fingerprint similarity of the music score fingerprint feature sequence corresponding to the music score to be matched and the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set; screening each audio in the first audio candidate set according to the sixth fingerprint similarity, and determining a second audio candidate set; and the audio fingerprint feature sequence corresponding to the audio in the second audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

In an exemplary embodiment of the present disclosure, the generating a score fingerprint feature sequence to be detected according to a score fingerprint feature sequence corresponding to the score to be matched includes: determining the variance of each score fingerprint feature in the score fingerprint feature sequence, and selecting m score fingerprint features with the largest variance; determining the similarity among the m music score fingerprint features, reserving n music score fingerprint features with the minimum similarity, and determining the n music score fingerprint features as the music score fingerprint feature sequence to be detected; wherein m and n are positive integers, and m is greater than n.

In an exemplary embodiment of the present disclosure, the determining, in the audio database, a first audio candidate set according to the fifth fingerprint similarity includes: respectively calculating k audio fingerprint features with the highest similarity to each music score fingerprint feature in the n music score fingerprint features in the audio fingerprint feature sequence to obtain n x k audio fingerprint features as a fourth screening result; wherein k is a positive integer; screening the audio fingerprint features in the fourth screening result according to a preset third similarity threshold value to obtain a fifth screening result; and determining audio according to the fifth screening result and constructing the first audio candidate set.

In an exemplary embodiment of the present disclosure, the separately calculating sixth fingerprint similarities of the score fingerprint feature sequences corresponding to the score to be matched and the audio fingerprint feature sequences corresponding to the audios in the first audio candidate set includes: and respectively calculating the fingerprint similarity of each music score fingerprint feature in the music score fingerprint feature sequence corresponding to the music score to be matched and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set to obtain the sixth fingerprint similarity.

In an exemplary embodiment of the present disclosure, the screening the music scores in the first audio candidate set according to the sixth fingerprint similarity and determining a second audio candidate set includes: determining audio fingerprint features with the highest similarity to the fingerprint features of the music scores according to the sixth fingerprint similarity as a sixth screening result; and screening the audio fingerprint characteristics in the sixth screening result according to a preset fourth similarity threshold, and constructing the second audio candidate set according to a music score corresponding to the screened audio fingerprint characteristics.

In an exemplary embodiment of the present disclosure, the calculating a fingerprint feature distance between an audio fingerprint feature sequence corresponding to each audio in the audio database and a music score fingerprint feature sequence corresponding to the music score to be matched, and determining a target audio matched with the music score to be matched according to the fingerprint feature distance includes: respectively constructing sampling point distance matrixes based on the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequences of the audios; calculating fingerprint characteristic distances between each music score fingerprint characteristic in the music score fingerprint characteristic sequence and each audio fingerprint characteristic in the audio fingerprint characteristic sequence, taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence according to the minimum value of the sampling point distances; and acquiring a plurality of shortest paths between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequences of the audios, and determining the target audio matched with the music score to be matched according to the minimum value of the shortest paths.

In an exemplary embodiment of the present disclosure, the method further comprises: and determining the section corresponding relation between the music score fingerprint feature sequence and the audio fingerprint feature sequence of the target audio according to the shortest path between the audio fingerprint feature sequence of the target audio and the music score fingerprint feature sequence of the music score to be matched.

In an exemplary embodiment of the present disclosure, the method further comprises: and responding to the playing control operation of the music score to be matched, and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequence of the target audio.

In an exemplary embodiment of the present disclosure, the music score to be matched includes a music score image to be matched; the determining of the score fingerprint feature sequence corresponding to the score to be matched includes: carrying out image recognition on the music score image to be matched, and acquiring original music score data; converting the raw score data to obtain a sequence of notes of at least one audio track; converting the note sequence according to the music score speed contained in the music score data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain music score fingerprint characteristics corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score to be matched.

According to an aspect of the present disclosure, there is provided a multimedia data matching apparatus including: the music score fingerprint feature acquisition module is used for acquiring a music score to be matched and determining a music score fingerprint feature sequence corresponding to the music score to be matched; and the audio matching module is used for calculating the fourth fingerprint similarity of the music score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in the audio database, and determining the target audio matched with the music score to be matched according to the fourth fingerprint similarity.

In an exemplary embodiment of the present disclosure, the audio matching module includes: and the fingerprint characteristic distance calculation module is used for calculating the fingerprint characteristic distances of the audio fingerprint characteristic sequences corresponding to the audios in the audio database and the music score fingerprint characteristic sequences corresponding to the music score to be matched, and determining the target audio matched with the music score to be matched according to the fingerprint characteristic distances.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the first audio candidate set calculation module is used for generating a score fingerprint feature sequence to be detected according to the score fingerprint feature sequence corresponding to the score to be matched; respectively calculating the similarity of fifth fingerprints of the music score fingerprint characteristic sequence to be detected and the audio fingerprint characteristic sequence corresponding to each audio frequency; determining a first audio candidate set in the audio database according to the fifth fingerprint similarity; and the audio fingerprint feature sequence corresponding to the audio in the first audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

In an exemplary embodiment of the present disclosure, the apparatus further includes: a second audio candidate set calculating module, configured to calculate sixth fingerprint similarities of the music score fingerprint feature sequence corresponding to the music score to be matched and the audio fingerprint feature sequences corresponding to the audios in the first audio candidate set, respectively; screening the audios in the first audio candidate set according to the sixth fingerprint similarity, and determining a second audio candidate set; and the audio fingerprint feature sequence corresponding to the audio in the second audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

In an exemplary embodiment of the present disclosure, the first audio candidate set calculation module includes: the music score fingerprint feature sequence to be detected calculating module is used for determining the variance of the music score fingerprint features in the music score fingerprint feature sequence and selecting m music score fingerprint features with the largest variance; determining the similarity among the m music score fingerprint features, reserving n music score fingerprint features with the minimum similarity, and determining the n music score fingerprint features as the music score fingerprint feature sequence to be detected; wherein m and n are positive integers, and m is greater than n.

In an exemplary embodiment of the disclosure, the first audio candidate set calculation module includes: the first audio screening module is used for respectively calculating k audio fingerprint features with the highest similarity to each music score fingerprint feature in the n music score fingerprint features in the audio fingerprint feature sequence to obtain n x k audio fingerprint features serving as a fourth screening result; wherein k is a positive integer; screening the audio fingerprint features in the fourth screening result according to a preset third similarity threshold value to obtain a fifth screening result; and determining audio according to the fifth screening result and constructing the first audio candidate set.

In an exemplary embodiment of the disclosure, the second audio candidate set calculating module includes: and the sixth fingerprint similarity calculation module is used for calculating the fingerprint similarity of each music score fingerprint feature in the music score fingerprint feature sequence corresponding to the music score to be matched and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set respectively to obtain the sixth fingerprint similarity.

In an exemplary embodiment of the present disclosure, the second audio candidate set calculating module includes: the second music score screening module is used for determining audio fingerprint characteristics with highest similarity to the fingerprint characteristics of the music scores according to sixth fingerprint similarity, and the audio fingerprint characteristics are used as sixth screening results; and screening the audio fingerprint characteristics in the sixth screening result according to a preset fourth similarity threshold, and constructing the second audio candidate set according to a music score corresponding to the screened audio fingerprint characteristics.

In an exemplary embodiment of the present disclosure, the fingerprint feature distance calculation module includes: the sampling point distance matrix calculation module is used for respectively constructing a sampling point distance matrix based on the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence of each audio frequency; the path calculation module is used for calculating fingerprint characteristic distances between music score fingerprint characteristics in the music score fingerprint characteristic sequence and audio fingerprint characteristics of the audio fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining a shortest path between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence according to the minimum value of the sampling point distances; and the target audio determining module is used for acquiring a plurality of shortest paths between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequences of the audios and determining the target audio matched with the music score to be matched according to the minimum value of the shortest paths.

In an exemplary embodiment of the present disclosure, the apparatus further includes: and the summary corresponding relation determining module is used for determining the summary corresponding relation between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence of the target audio according to the shortest path between the audio fingerprint characteristic sequence of the target audio and the music score fingerprint characteristic sequence of the music score to be matched.

In an exemplary embodiment of the present disclosure, the apparatus further includes: and the corresponding display control module is used for responding to the playing control operation of the music score to be matched and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequence of the target audio.

In an exemplary embodiment of the present disclosure, the music score to be matched includes a music score image to be matched; the device further comprises: the music score image processing module is used for carrying out image recognition on the music score image to be matched and acquiring original music score data; converting the raw score data to obtain a sequence of notes of at least one audio track; converting the note sequence according to the music score speed contained in the music score data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain music score fingerprint characteristics corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score to be matched.

According to an aspect of the present disclosure, there is provided a storage medium having stored thereon a computer program, which when executed by a processor, performs the multimedia data matching method according to any one of the above embodiments.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the multimedia data matching method of any one of the above embodiments via execution of the executable instructions.

According to the multimedia data matching method of the embodiment of the disclosure, the corresponding audio fingerprint feature sequence is determined for the current audio to be matched, and the audio fingerprint feature sequence is utilized to calculate the fingerprint similarity of the score fingerprint feature sequence of each score in the score data, so that the target score matched with the audio to be matched is screened according to the calculation result of the fingerprint similarity. Corresponding fingerprint characteristic sequences are constructed for the audio and the music score respectively in advance by using the fingerprint characteristics, and the matching between the audio and the music score data is carried out by using the fingerprint characteristic sequences, so that the corresponding music score data can be accurately matched for the audio to be matched.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows a flow chart of a multimedia data matching method according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of a method of computing a sequence of audio fingerprint features according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a conversion flow diagram for audio conversion into a pitch sequence, according to an embodiment of the present disclosure;

FIG. 4 schematically shows a flow diagram for sampling a pitch sequence according to an embodiment of the present disclosure;

fig. 5 schematically shows a flowchart of a method of calculating a sequence of fingerprint features of a score according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow chart of a method of one-time screening of scores in a score database according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of a distance matrix according to an embodiment of the present disclosure;

fig. 8 schematically illustrates a diagram for determining a measure matching mapping relationship between audio and a score according to a distance matrix according to an embodiment of the present disclosure;

fig. 9 schematically shows a flowchart of another multimedia data matching method according to an embodiment of the present disclosure;

fig. 10 schematically shows a block diagram of a multimedia data matching apparatus according to an embodiment of the present disclosure;

fig. 11 schematically shows a block diagram of another multimedia data matching apparatus according to an embodiment of the present disclosure;

FIG. 12 schematically illustrates a block diagram of an electronic device in accordance with the disclosed embodiments; and

fig. 13 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are presented merely to enable those skilled in the art to better understand and to practice the disclosure, and are not intended to limit the scope of the disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one of skill in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a multimedia data matching method, a multimedia data matching apparatus, a storage medium, and an electronic device are provided.

In this document, any number of elements in the drawings is intended to be illustrative and not restrictive, and any nomenclature is used for distinction only and not for any restrictive meaning.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

The inventor finds that the electronic music score corresponding to music is less in resource, professional persons are required for manually making the music score, and time and labor are wasted. In contrast, there are a large number of music score images on the current network, which are either created by artists or made by music enthusiasts, and the original electronic music scores are difficult to find either for time reasons or due to software format problems, leaving only a large number of music score images, where the music score images include both common pictures and pdf, book images, etc. In the prior art, a corresponding electronic music score is obtained based on music score picture conversion and then rendered into audio, or an electronic music score is obtained based on audio analysis and then converted into a music score picture, and both the electronic music score and the music score picture cannot achieve good effects due to technical limitations. If the music score picture can be directly matched with the audio, a corresponding or similar audio and music score pair can be found without conversion, a new matching mode is provided, and the matching efficiency is improved.

In view of the above, the basic idea of the present disclosure is: according to the multimedia data matching method and device, the audio data and the music score data are subjected to feature extraction in advance, and the fingerprint features are utilized to construct the corresponding fingerprint feature sequence, so that the similarity calculation can be performed on the fingerprint feature data of the audio and the fingerprint feature data of the music score, and the music score data matched with the audio is determined. Therefore, the corresponding music score data can be matched with the audio data quickly and accurately.

Having described the basic principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Exemplary method

A multimedia data matching method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1.

Referring to fig. 1, the multimedia data matching method may include the steps of:

s11, obtaining an audio to be matched, and determining an audio fingerprint characteristic sequence corresponding to the audio to be matched;

s12, calculating the similarity of the audio fingerprint feature sequence and a first fingerprint of a score fingerprint feature sequence of each score in a score database, and determining a target score matched with the audio to be matched according to the similarity of the first fingerprint.

In the multimedia data matching of the embodiment of the disclosure, a corresponding audio fingerprint feature sequence is determined for a current audio to be matched, and the audio fingerprint feature sequence is utilized to calculate the fingerprint similarity of the score fingerprint feature sequence of each score in the score data, so that a target score matched with the audio to be matched is screened according to the calculation result of the fingerprint similarity. Corresponding fingerprint characteristic sequences are constructed for the audio and the music score respectively in advance by using the fingerprint characteristics, and the matching between the audio and the music score data is carried out by using the fingerprint characteristic sequences, so that the corresponding music score data can be accurately matched for the audio to be matched.

In step S11, an audio to be matched is acquired, and an audio fingerprint feature sequence corresponding to the audio to be matched is determined.

In an exemplary embodiment of the present disclosure, the multimedia data matching method described above may be implemented on the intelligent terminal device side. For example, the intelligent terminal device may be a mobile phone, a tablet computer, a notebook computer, or other intelligent devices. Taking a mobile phone as an example, a user may select an audio data from a local storage of the mobile phone and upload the audio data to an application program as an audio to be matched.

In an exemplary embodiment of the present disclosure, the determining an audio fingerprint feature sequence corresponding to the audio to be matched includes:

extracting identification information corresponding to the audio to be matched, and matching the audio with each audio in an audio database according to the identification information, wherein the audio database comprises a plurality of audios and audio fingerprint characteristic sequences corresponding to the audios;

if the identification information is successfully matched with the audio in an audio database, configuring the audio fingerprint characteristic sequence corresponding to the matched audio as the audio fingerprint characteristic sequence corresponding to the audio to be matched; or,

and if the matching is failed according to the identification information, performing feature extraction on the audio to be matched so as to obtain an audio fingerprint feature sequence corresponding to the audio to be matched.

Specifically, the identification information of the audio data may be parameters such as a song title, a singer name, version information, and the like. After the terminal equipment acquires the audio data to be matched uploaded by the user, corresponding identification information can be extracted, and the audio database can be queried by using the identification information. The audio database may include a plurality of audios and feature information corresponding to each audio, that is, an audio fingerprint feature sequence.

And if the correspondingly matched audio data is inquired in the audio database according to the identification information of the audio to be matched, taking the audio fingerprint characteristic sequence corresponding to the audio data as the audio fingerprint characteristic sequence of the audio to be matched. Or, if the output result inquired in the audio database according to the identification information of the audio to be matched is empty, identifying and extracting the characteristics of the audio to be matched, and calculating the corresponding audio fingerprint characteristic sequence. For example, an audio fingerprint server may be configured with an audio database deployed at the audio fingerprint service. The audio to be matched selected by the user on the terminal equipment can be uploaded to the audio fingerprint server for inquiry and processing.

In an exemplary embodiment of the present disclosure, referring to fig. 2, the determining an audio fingerprint feature sequence corresponding to the audio to be matched includes:

step S21, carrying out spectrum conversion processing on the audio to be matched to obtain corresponding spectrum data;

s22, carrying out note identification on the frequency spectrum data to obtain a corresponding pitch sequence;

step S23, sampling the pitch sequence, and grouping the sampling results to obtain audio fingerprint characteristics corresponding to each group;

and S24, constructing an audio fingerprint characteristic sequence based on the audio fingerprint characteristics corresponding to each group, and configuring the audio fingerprint characteristic sequence corresponding to the audio to be matched.

Specifically, the audio fingerprint features refer to audio features for identifying a segment of melody, and commonly used features include amplitude spectrum, CQT spectrum, deep learning extracted features, and the like. In the present embodiment, in order to improve the efficiency of matching the score and the audio, a characteristic of a sequence of notes with a coarser granularity is used.

Specifically, referring to fig. 3, the audio 301 may be first subjected to cqt transform to obtain a frequency spectrum 302, and then the pitch sequence 304 thereof may be extracted by using the audio main melody extractor 303. Wherein the audio melody extractor is used to extract the melody note sequence of the audio, which may be a sequence of pitches at fixed time intervals, or a sequence of pitches and durations, the pitches typically being divided into 88 standard pitches, and the durations representing the duration of each pitch. For the audio main melody extractor, an audio main melody extractor based on deep learning can be trained in advance, and a deep learning model can be formed based on a convolutional neural network and a long-term and short-term memory model. For example, for the training process of the audio melody extractor, a batch of audio data can be prepared in advance as training samples, the pitch and duration of each audio training sample are labeled, and then based on the batch of labeled data, the audio is subjected to CQT conversion to obtain corresponding spectrum data; and inputting the frequency spectrum data into a deep learning model to obtain a predicted pitch sequence, comparing the prediction result with the labeled data to calculate loss, finishing the training of the model and obtaining the audio frequency melody extractor.

The melody sequence extracted by the main melody extractor is a pitch sequence with 10ms as a time unit, and the audio pitch sequence is sampled in order to reduce the length of the final match. Referring to the flow chart shown in FIG. 4, a pitch sequence 401 is sampled, and the pitch sequence 401 is converted into a set of sample points, denoted N ₁ (ii) a For example, one sample at 5 points, i.e., one sample every 50 ms. Then, the sampling points are grouped according to a certain window and step length to obtain a plurality of pitch groups 402, for example, the window size shown by the box in fig. 4 is used for grouping; for example, at a step size of 50 per 100 points, the pitch sequence is divided into a series of pitch groups of length 100, i.e., each pitch group corresponds to a duration of about 5s. The series of pitch groups is taken as a series of audio fingerprint feature sequences 403 of the current audio, denoted as N ₂ . In addition, when designing an audio fingerprint, adjacent audio fingerprint features are staggered, for example, the sequence number of sampling points is 1 to 1000, every 100 points form a group of audio fingerprint features, the step length is 50, and the obtained fingerprint feature sequences are 1 to 100, 50 to 150, 100 to 200 \ 8230and 8230.

For example, after acquiring the audio to be matched, the terminal device may upload the audio to the audio fingerprint server, and calculate an audio fingerprint feature sequence for the audio to be matched by using the audio fingerprint server. The audio fingerprint server may be configured to calculate audio data and obtain a corresponding audio fingerprint feature sequence.

In step S12, a first fingerprint similarity between the audio fingerprint feature sequence and a score fingerprint feature sequence of each score in a score database is calculated, and a target score matching the audio to be matched is determined according to the first fingerprint similarity.

In an exemplary embodiment of the present disclosure, a music database may contain several music scores and a series of fingerprint features of the music scores. Specifically, for a score, the score may contain at least one audio, and a track may be configured with at least one score fingerprint feature sequence at a score speed; when a score contains a plurality of tracks, the score correspondingly contains a plurality of score fingerprint feature sequences. That is, each score may be configured with a plurality of sequences of score fingerprint features at different tracks, at different score speeds.

In an exemplary embodiment of the present disclosure, the above step S12 may include: and calculating fingerprint characteristic distances between the music score fingerprint characteristic sequences corresponding to the music scores in the music score database and the audio fingerprint characteristic sequences corresponding to the audio to be matched, and determining a target music score matched with the audio to be matched according to the fingerprint characteristic distances.

Specifically, when music score data is retrieved, the fingerprint feature distances can be respectively calculated by using the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of music scores in the music score database, so that measures are matched. Specifically, a music database can be traversed, a music score is selected from the database each time, and the music score fingerprint feature sequences of the currently selected music score at the speed are respectively calculated with the audio fingerprint feature sequences of the audio to be matched; and screening the music score corresponding to the music score fingerprint feature sequence with the shortest distance according to the calculation result of the fingerprint feature distance to serve as a target music score.

Specifically, the step S12 may include:

step S121, respectively constructing sampling point distance matrixes based on the audio fingerprint characteristic sequences and the music score fingerprint characteristic sequences of the music scores;

step S122, calculating fingerprint characteristic distances between each audio fingerprint characteristic in the audio fingerprint characteristic sequence and each music score fingerprint characteristic in the music score fingerprint characteristic sequence, taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequence according to the minimum value of the sampling point distances;

step S123, obtaining a plurality of shortest paths between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of the music scores, and determining the target music score matched with the audio to be matched according to the minimum value of the shortest paths.

Specifically, each time the candidate music score is calculated, a candidate music score can be selected, and the music score fingerprint characteristics and the audio fingerprint characteristics of the candidate music score and the audio to be matched are obtained; since the velocity of the score fingerprint features is an approximation selected from the bpm candidate set, there is a velocity bias between the two, requiring further alignment. For example, the alignment of the fingerprint features may be performed using a DTW (Dynamic Time Warping) algorithm. Specifically, the audio fingerprint feature of the audio to be matched is denoted as Q (Q) ₁ ,q ₂ ,q ₃ ...q _n ) And the score fingerprint feature of the candidate score is recorded as K (K) ₁ ,k ₂ ,k ₃ ...k _n ) Calculating the absolute value of the difference value of each point in Q and each point in K, and obtaining a distance matrix M by taking the absolute value as the distance between every two points; then, calculating the length of the shortest path from the lower left corner (1, 1) of the matrix to any point (i, j) to be Lmin (i, j) by using a recursive algorithm; the calculation formula may include:

and calculating and obtaining a shortest distance matrix L by using the formula, and further deducing the shortest path from the left bottom to the right top of the matrix, wherein the passed points on the path are the matched points of the music score fingerprint and the audio fingerprint. For example, referring to the distance matrix M shown in fig. 7, the horizontal axis represents each point value in the audio fingerprint features, the vertical axis represents each point value in the music score fingerprint features, the values in the table represent the distance between two points, the continuous short line represents the shortest path obtained after recursive computation, and the points marked by arrows are the one-to-one correspondence relationship between the music score fingerprint features and the audio fingerprint features. Such as (q) ₁ ,k ₁ )，(q ₂ ,k ₁ ). For the candidate music scores, the fingerprint characteristics of each music score can be matched with the audio fingerprint characteristics of the current audio to be matched to obtain a fingerprint characteristic sequence and an audio finger of each music scoreAnd calculating the average value of all shortest path distances as the similarity between the candidate music score and the audio to be matched. And the smaller the distance, the higher the similarity, and finally, the candidate music score with the highest similarity is reserved as the final matching result.

In some exemplary embodiments of the present disclosure, the music scores in the music database may be first filtered once, and then a calculation is performed between the once-filtered music scores and the audio to be matched, so that a search is performed in the once-filtered music scores, and a target music score matching the audio to be matched is determined. That is, the music database is first screened once, and the above-mentioned section matching process is performed. By screening the music database, the retrieval efficiency of the music can be improved.

Specifically, referring to fig. 6, the screening the scores in the score database once may include:

s61, generating an audio fingerprint characteristic sequence to be detected according to the audio fingerprint characteristic sequence corresponding to the audio to be matched;

step S62, respectively calculating second fingerprint similarity of the audio fingerprint feature sequence to be detected and the music score fingerprint feature sequences corresponding to the music scores;

step S63, determining a first music score candidate set in the music database according to the second fingerprint similarity; and the music score fingerprint feature sequence corresponding to the music score in the first music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, specifically, the step S61 described above may include:

step S611, determining the variance of each audio fingerprint feature in the audio fingerprint feature sequence, and selecting m audio fingerprint features with the largest variance;

step S612, determining the similarity among the m audio fingerprint features, reserving n audio fingerprint features with the minimum similarity, and determining the n audio fingerprint features as the to-be-detected audio fingerprint feature sequence; wherein m and n are both positive integers, and m is greater than n.

In particular, the audio fingerprint feature sequence N for the audio to be matched ₂ Respectively calculating the variance of all the audio fingerprint characteristics in the audio fingerprint characteristics, and selecting the audio fingerprint characteristics with the largest m variances. The larger the variance of the audio fingerprint features, the more obvious the features representing the audio fingerprint are, and the correct result is easily matched. Wherein, the calculation formula of the variance may include:

where M represents the fingerprint feature sequence length, for example, the parameter in the above embodiment is 100; sigma ² Represents a variance; μ represents the mean, x, of the fingerprint feature sequence _i Representing the value of each fingerprint feature in the sequence of fingerprint features.

And respectively calculating the similarity of the fingerprint characteristics of the first m audio fingerprint characteristics screened by using the variance calculation result. And eliminating the audio fingerprint features with relatively close similarity, avoiding repeated de-matching by using similar fingerprint features, and finally reserving n audio fingerprint features to construct an audio fingerprint feature sequence to be detected for retrieval. Or if the number of the audio fingerprint features in the audio fingerprint feature sequence is less than n, all the audio fingerprint features are used for constructing the audio fingerprint feature sequence to be detected.

In an exemplary embodiment of the present disclosure, in the step S62, for each audio fingerprint feature in the audio fingerprint feature sequence to be detected, a comparison may be performed with each score fingerprint feature in the score fingerprint feature sequence corresponding to each score in the score database, and a similarity between the fingerprint features may be calculated as the second fingerprint similarity. Specifically, the similarity calculation may be calculated using a euclidean distance formula, and the smaller the value is, the higher the similarity is, and the formula may include:

wherein x is _i RepresentValue y of ith audio fingerprint feature in audio fingerprint feature sequence to be detected _i The value of the ith music score fingerprint characteristic in a music score fingerprint characteristic sequence in a music score database is represented, M represents the length of the fingerprint characteristic sequence, and d represents the similarity of fingerprints.

In an exemplary embodiment of the present disclosure, specifically, in step S63 above, the determining a first score candidate set in the score database according to the second fingerprint similarity includes:

step S631, respectively calculating k score fingerprint features in the score fingerprint feature sequence with the highest similarity to each audio fingerprint feature in the n audio fingerprint features, and obtaining n × k score fingerprint features as a first screening result; wherein k is a positive integer;

step S632, screening the music score fingerprint characteristics in the first screening result according to a preset first similarity threshold value to obtain a second screening result;

step S633, determining a score according to the second screening result, and constructing the first score candidate set.

Specifically, according to the similarity calculation result obtained in the above step, for n audio fingerprint features included in the audio fingerprint feature sequence to be detected, k music score fingerprint features with the highest similarity can be respectively screened out by using the value of the similarity calculation result, so that n × k music score fingerprint features are retrieved as the first screening result. Then, a first similarity threshold T may be used ₁ Filtering the first screening result, and enabling the corresponding similarity value to be smaller than T ₁ And eliminating the fingerprint features of the music score to obtain the remaining candidate music score fingerprint features. Taking the music scores corresponding to the fingerprint features of the music scores as a first music score candidate set, wherein the first music score candidate set comprises the set of all the fingerprint features of the music scores corresponding to the screened music scores, and the set is recorded as N ₃ (i,j)。

In some exemplary embodiments of the present disclosure, after the first score candidate set is obtained, the first score candidate set may be subjected to secondary screening, so that the target score may be confirmed by performing calculation between the score subjected to secondary screening and the audio to be matched. That is, the score data may be subjected to the screening process twice, and the above-mentioned processes of section matching may be performed between the candidate score obtained by the screening and the audio to be matched. By carrying out secondary screening on the music scores, the retrieval range can be further narrowed, and the retrieval efficiency is further improved.

Specifically, after determining the first score candidate set in the score data according to the second fingerprint similarity, the method may further include:

step S64, respectively calculating third fingerprint similarity of the audio fingerprint feature sequence corresponding to the audio to be matched and the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set;

step S65, screening the music scores in the first music score candidate set according to the third fingerprint similarity, and determining a second music score candidate set; and the music score fingerprint feature sequence corresponding to the music score in the second music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

In an exemplary embodiment of the present disclosure, the step S64 described above may include: and respectively calculating the fingerprint similarity of each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the audio to be matched and each music score fingerprint feature in the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set to obtain the third fingerprint similarity.

Specifically, for the audio to be matched, the audio fingerprint feature set N corresponding to the audio fingerprint feature sequence is ₂ Comprising n audio fingerprint features. Selecting score fingerprint characteristics of a score from the first score candidate set each time to carry out similarity calculation; and setting that the music score candidate set corresponds to m music score fingerprint features, and calculating the similarity between the n audio fingerprint features and the m music score fingerprint features pairwise in sequence to obtain the third fingerprint similarity.

In an exemplary embodiment of the present disclosure, in the step S65, specifically, the method may include:

step S651, determining score fingerprint features with highest similarity to the audio fingerprint features according to the third fingerprint similarity as a third screening result;

step S652, screening each score fingerprint feature in the third screening result according to a preset second similarity threshold, and constructing the second score candidate set according to a score corresponding to the screened score fingerprint feature.

Specifically, according to the third fingerprint similarity calculated in the above steps, the score fingerprint feature with the highest matching degree of each audio fingerprint feature and the corresponding similarity value can be obtained. For these similarity values, a second similarity threshold T may be used ₂ Filtering, if there is similarity value less than T ₂ The fingerprint characteristics of the music score indicate that the matching fails. Traversing the score in the first score candidate set using a second similarity threshold T ₂ And screening and deleting the music scores with the similarity matching failure so as to obtain a second music score candidate set. At the same time, the matching score fingerprint features of each audio fingerprint feature and each score in the second score candidate set are recorded.

By carrying out twice screening processes on the music database by using the method and then matching the music scores in the screened second music score candidate set, the retrieval efficiency of the music scores can be greatly improved, and the fast matching of the audio and the music scores is facilitated.

In an exemplary embodiment of the present disclosure, the method further includes: and determining the section corresponding relation between the audio fingerprint feature sequence and the music score fingerprint feature sequence of the target music score according to the shortest path between the music score fingerprint feature sequence of the target music score and the audio fingerprint feature sequence of the audio to be matched.

Specifically, after the target music score corresponding to the audio to be matched is determined, the result of section matching is determined according to the calculation process. Each point in the fingerprint characteristics of the music score can be restored to the specific subsection division on the corresponding music score image. Further, according to the result of section matching, referring to fig. 8, the corresponding position of the section of the audio clip on the music score picture can be obtained, and then the audio clip is divided into different time segments according to the division of the music score sections. In addition, when the audio fingerprint feature sequence is constructed, the two adjacent audio fingerprint features are staggered, so that the parts corresponding to the incomplete sections at the two ends of the audio fingerprint features can be directly omitted, and the audio fingerprint features are subjected to matching.

Based on the above, in an exemplary embodiment of the present disclosure, the method may further include:

and responding to the playing control operation of the audio to be matched, and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequence of the target music score.

For example, after determining a target music score corresponding to the current audio to be matched and a mapping relation between fingerprint features; when the user plays the audio on the middle terminal device, the corresponding target music score can be synchronously displayed in the interactive interface of the terminal device. And, based on the identified corresponding relationship of the measure, the score measure corresponding to the currently played audio measure can be specially marked and displayed. For example, the current music score bar may be distinguished by adding different colors, or highlighted.

In an exemplary embodiment of the present disclosure, each score in the score database may be calculated in advance, and each corresponding score fingerprint feature data may be obtained. Specifically, referring to fig. 5, the method for calculating the score fingerprint feature sequence may include:

step S51, identifying each music score in the music score database, and acquiring a note sequence of at least one corresponding music track;

step S52, converting the note sequence according to a preset music score speed to obtain a corresponding pitch sequence;

step S53, sampling the pitch sequence, and grouping the sampling results to obtain the music score fingerprint characteristics corresponding to each group;

and S54, constructing a score fingerprint feature sequence corresponding to the preset score speed based on score fingerprint features corresponding to each group, and configuring the score fingerprint feature sequence corresponding to the score.

Specifically, a score fingerprint server can be configured and utilized to process score data. For example, the score data may be score data in an image format. Score fingerprints may be used to identify audio features of a segment of a melody in a score. The score image may be in the form of a staff image, a guitar image, a numbered musical notation image, or the like.

Specifically, the music score image can be analyzed and identified by using a music score analyzer to obtain music score information corresponding to the music score image. Wherein, the score information may include: the clef number, the beat number, the velocity sign, the pitch of the note, the duration of the note, the bar line, the jump sign and the like. The score parser may be a score image parsing model implemented based on score image recognition technology (OMR). In this embodiment, the score parser may include a preprocessing module and a recognition model, wherein the preprocessing module is used for dividing the line positions and bars of the staff of the score image; the recognition model may be a deep learning model based on a Transformer implementation. In the training stage, a batch of music score pictures segmented according to rows are prepared, label data are marked with symbol category labels such as a clef number, a beat number, a bar line, notes and the like on the pictures, and the label data are used for finishing the training of the recognition model. When the music score recognition model is used, a music score picture is input, the preprocessing module completes the structural analysis of the music score picture, a music score in a line is intercepted according to the line, and the recognition model recognizes a symbol sequence on the line music score. Of course, in other exemplary embodiments of the present disclosure, the score image may also be recognized by other existing score recognition models or systems, resulting in the corresponding note sequence of each track.

Since the duration of the notes in the score is relative, for example, the duration of 1/4 note is 1/4 of the duration of the whole note, but the specific duration of time to be played needs to be calculated according to the velocity notation of the score. Common score velocity labels are textual labels, typically one or more italian words such as Allegro, andrante, modeto, etc., and numeric labels, typically expressed in the form of "notes = number", representing how many notes are played in a minute. However, the speed mark identified on the score is not reliable in the present disclosure due to, on one hand, the fact that the text mark is usually a small range of value, there is no accurate value, the algorithm recognizes incorrectly, there is no explicit speed mark on the part of the score, and on the other hand, there may be a change in the speed of the actual performance of the audio. Therefore, a group of music scores with different speeds is simulated by configuring a group of different bpm for the music score in advance, and a group of fingerprints with different speeds are extracted.

Specifically, according to experience, a candidate set of candidate bpm is set as bes, wherein each candidate set is 1/4 note as a time unit, and the candidates are {40, 60, 80, 100, 120, 140, 160, 180, and 200}, the bes set sequentially represents 40 1/4 notes per minute, 60 1/4 notes per minute, and so on. And processing each audio track identified by each music score image respectively. Specifically, a note sequence of the j-th track is taken, the i-th velocity is taken from bpm as Bi, and the duration of each 1/4 note is t (second) =60/Bi by converting to a time scale, so that the time consumed by a full note is 4 × t seconds, the time consumed by a half note is 2 × t seconds, and the time consumed by an eighth note is t/2 seconds, and a group of note sequences and durations are obtained. Further, the time of the musical note sequence converted to the time unit is sampled according to the time of 50ms, and a group of pitch sequences N with fixed time intervals are obtained _s (i, j) representing a pitch sequence of a musical score performance at the velocity of the j-th rail in the musical score image being Bi. The same processing mode as the audio fingerprint is adopted for the pitch sequence N _s (i, j) constructing a pitch sequence with every 100 points as a group in a mode that the window length is 100 and the step length is 50, and recording the pitch sequence as the fingerprint of the jth rail of the score at the speed Bi as N ₃ (i,j)。

Based on the above, for a musical score, a sequence of notes of the corresponding at least one track can be identified; for a sequence of notes of a track, a corresponding pitch sequence at each score speed may be generated according to a pre-configured plurality of score speeds. For the music database, the above method can be used to calculate each music score separately, extract the fingerprint of the music score, and construct the corresponding music score fingerprint feature sequence.

In addition, in an exemplary embodiment of the present disclosure, there is also provided a multimedia data matching method for matching corresponding audio data using score data. As described with reference to fig. 9, the multimedia data matching method may include:

step S91, obtaining a music score to be matched, and determining a music score fingerprint feature sequence corresponding to the music score to be matched;

and S92, calculating fourth fingerprint similarity of the music score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in an audio database, and determining a target audio matched with the music score to be matched according to the fourth fingerprint similarity.

In an exemplary embodiment of the present disclosure, step S91 may specifically include:

step S911, carrying out image recognition on the music score image to be matched and acquiring original music score data;

step S912, converting the original score data to obtain a note sequence of at least one music track;

step S913, converting the note sequence according to the score speed contained in the score data to obtain a corresponding pitch sequence;

step S914, sampling the pitch sequence, and grouping the sampling results to obtain the music score fingerprint characteristics corresponding to each group;

and S915, constructing a music score fingerprint characteristic sequence corresponding to the preset music score speed based on the music score fingerprint characteristics corresponding to each group, and configuring the music score fingerprint characteristic sequence corresponding to the music score to be matched.

Specifically, the score to be matched may include a score image to be matched. For example, the user may upload a music score image as a music score to be matched on the terminal device. For example, the score image may be a staff image, a numbered musical notation image, a guitar image, or other musical instrument score image, and so forth. After the terminal equipment acquires the music score image uploaded by the user, the music score image to be matched can be uploaded to a music score fingerprint server, and the music score fingerprint server performs identification and calculation to acquire corresponding music score fingerprint characteristics. Alternatively, in some exemplary embodiments, after the terminal device acquires the image of the music score to be matched, it may also perform recognition and calculation locally on the terminal device.

Specifically, for a music score image to be matched, music score image recognition may be performed first, and corresponding original music score data is obtained. The original score data may include information such as a clef, a time, a tempo symbol, a pitch of a note, a duration of a note, a bar line, and a jump symbol included in the image. For example, the score analyzer in the above embodiments may be used to identify and calculate a score image, and obtain a corresponding score fingerprint feature sequence.

In addition, for the music score to be matched, if the music score only contains one music track, a music score fingerprint sequence corresponding to the music track can be obtained; if it contains multiple audio tracks, a score fingerprint sequence of each corresponding audio track can be obtained. Furthermore, when it is recognized that the original score data contains a velocity token, i.e. the score velocity token is marked in the score image, only the score fingerprint feature sequence at the velocity of the score can be calculated for each track. Or, if no score speed mark is identified in the original score data, the score fingerprint feature sequences of each track at each score speed can be respectively calculated by using the candidate set of the preset candidate bpm, so that the matching can be performed by using a plurality of score fingerprint feature sequences.

Alternatively, in some exemplary embodiments, when performing the music score image recognition, a configuration page of the music score speed may also be displayed in the graphical user interface for user-defined music score speed. For example, when a score speed mark is identified to be contained in a score image, the identified score speed can be displayed in an interactive interface through a floating window, after confirmation of a user, a subsequent calculation process is executed, and only a score fingerprint feature sequence corresponding to the score speed is calculated. Or when the score speed mark in the score image is not recognized, prompt information can be displayed through the floating window, and a user can customize the score speed and confirm in the interactive interface. After receiving the music score speed defined by the user, the terminal equipment can send the music score speed defined by the user to the music score fingerprint server, or locally calculate the music score fingerprint characteristic sequence according to the defined music score speed. Therefore, the number of the music score fingerprint feature sequences corresponding to the music scores to be matched can be reduced, the calculated amount is further reduced, and the target audio can be matched more accurately and more quickly.

In an exemplary embodiment of the present disclosure, the above step S92 may include: and calculating the fingerprint characteristic distance between the audio fingerprint characteristic sequence corresponding to each audio in the audio database and the music score fingerprint characteristic sequence corresponding to the music score to be matched, and determining a target audio matched with the music score to be matched according to the fingerprint characteristic distance.

Specifically, after the score fingerprint feature sequence corresponding to the score to be matched is determined, the fingerprint feature distance between the score fingerprint feature sequence and the audio fingerprint feature sequence corresponding to each audio in the audio database can be calculated, so that the target audio matched with the score fingerprint feature sequence can be screened.

In an exemplary embodiment of the present disclosure, the calculating a fingerprint feature distance between an audio fingerprint feature sequence corresponding to each audio in the audio database and a score fingerprint feature sequence corresponding to the score to be matched, and determining a target audio matched with the score to be matched according to the fingerprint feature distance may specifically include:

step S9211, respectively constructing sampling point distance matrixes based on the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequences of the audios;

step S9212, calculating fingerprint characteristic distances between the music score fingerprint characteristics in the music score fingerprint characteristic sequence and the audio fingerprint characteristics of the audio fingerprint characteristic sequence, taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence according to the minimum value of the sampling point distances;

step S9213, a plurality of shortest paths between the music score fingerprint characteristic sequence of the music score to be matched and the audio fingerprint characteristic sequences of the audios are obtained, and the target audio matched with the music score to be matched is determined according to the minimum value of the shortest paths.

Specifically, based on the same manner in the above embodiment, a distance matrix as shown in fig. 8 may be constructed for a music score fingerprint feature sequence corresponding to a music score to be matched and an audio fingerprint feature sequence corresponding to each audio frequency in an audio database, and an absolute value of a difference between each point in the music score fingerprint feature and each point in the audio fingerprint feature in the distance matrix is calculated as a distance between two points to obtain a distance matrix; then, the length of the shortest path from the lower left corner (1, 1) of the matrix to any point (i, j) is calculated by using a recursive algorithm. And marking the one-to-one corresponding relation between the music score fingerprint and the audio fingerprint by the passing point of the obtained shortest path. And respectively calculating each audio in the audio database to obtain the shortest path between the music score fingerprint feature sequence and each audio fingerprint feature sequence self-check, calculating the average value of the shortest path distance to be used as the similarity of the music score and the audio, and finally, keeping the audio with the highest similarity as the final matching result. And acquiring the corresponding relation between the music score measure and the audio measure while determining the final matching result, so as to realize measure matching.

In some exemplary embodiments of the present disclosure, before performing section matching, in order to improve matching efficiency, the audio in the audio database may also be filtered once. Specifically, the method may further include:

step S101, generating a music score fingerprint feature sequence to be detected according to the music score fingerprint feature sequence corresponding to the music score to be matched;

step S102, respectively calculating the similarity of fifth fingerprints of the music score fingerprint characteristic sequence to be detected and the audio fingerprint characteristic sequence corresponding to each audio frequency;

step S103, determining a first audio candidate set in the audio database according to the fifth fingerprint similarity; and the audio fingerprint feature sequence corresponding to the audio in the first audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

In an exemplary embodiment of the present disclosure, the above step S101 may include:

step S1011, determining the variance of each music score fingerprint feature in the music score fingerprint feature sequence, and selecting m music score fingerprint features with the maximum variance;

step S1012, determining the similarity between the m score fingerprint features, reserving n score fingerprint features with the minimum similarity, and determining the n score fingerprint features as the sequence of score fingerprint features to be detected; wherein m and n are both positive integers, and m is greater than n.

Specifically, for at least one score fingerprint feature sequence corresponding to the obtained score to be matched, variances corresponding to the score fingerprint features can be respectively calculated, and the score fingerprint features with the largest m variances can be selected. And respectively calculating the similarity of the fingerprint features of the first m music scores two by two, reserving n music score fingerprint features with the minimum similarity, and constructing a music score fingerprint feature sequence to be detected. Or if the number of the music score fingerprint features in the music score fingerprint feature sequence is less than n, all the music score fingerprint features are used for constructing the music score fingerprint feature sequence to be detected.

Similar to the above embodiment, after the fingerprint feature sequence of the music score to be matched is obtained, the fingerprint feature similarity between the feature sequence and each audio frequency in the audio frequency database can be calculated.

In an exemplary embodiment of the present disclosure, the step S103 may include:

step S1031, respectively calculating k audio fingerprint features with highest similarity to each music score fingerprint feature in the n music score fingerprint features in the audio fingerprint feature sequence to obtain n x k audio fingerprint features as a fourth screening result; wherein k is a positive integer;

step S1032, screening the audio fingerprint features in the fourth screening result according to a preset third similarity threshold to obtain a fifth screening result;

step S1033, determining an audio according to the fifth screening result, and constructing the first audio candidate set.

Specifically, after the music score fingerprint feature sequence to be detected is obtained, the fingerprint similarity between the n music score fingerprint features contained in the music score fingerprint feature sequence and the sound fingerprint features corresponding to the audios in the audio database can be respectively calculated, so that k audio fingerprint features with the highest similarity are respectively screened for each music score fingerprint in the music score fingerprint feature sequence to be detected, and the screened audio fingerprint features serve as a fourth screening result. And screening the fingerprint similarity values corresponding to the audio fingerprint features in the fourth screening result by using the third similarity threshold, and constructing a first audio candidate set by using the audio corresponding to the audio fingerprint obtained by screening. And then, a calculation process of performing summary matching between the candidate audio in the first audio candidate set and the music score to be matched is utilized, so that the target audio is determined in the first audio candidate set.

In some exemplary embodiments of the present disclosure, after the primary filtering of the audio database, the secondary filtering may also be performed on the filtering. Specifically, the method may further include:

step S104, calculating sixth fingerprint similarity of the music score fingerprint feature sequence corresponding to the music score to be matched and the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set respectively;

step S105, screening each audio in the first audio candidate set according to the sixth fingerprint similarity, and determining a second audio candidate set; and the audio fingerprint feature sequence corresponding to the audio in the second audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

In an exemplary embodiment of the present disclosure, the step S104 may include: and respectively calculating the fingerprint similarity of each music score fingerprint feature in the music score fingerprint feature sequence corresponding to the music score to be matched and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set to obtain the sixth fingerprint similarity.

Specifically, a score fingerprint set can be obtained for score fingerprint features contained in each score fingerprint feature sequence corresponding to a score to be matched. Similarity calculation can be performed on each score fingerprint feature in the set and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the candidate audio in the first audio candidate set, and a similarity calculation result between each score fingerprint feature and each audio fingerprint feature is used as a sixth fingerprint similarity.

In an exemplary embodiment of the present disclosure, the step S105 described above may include:

step S1051, determining audio fingerprint characteristics with highest similarity to the fingerprint characteristics of the music score according to the sixth fingerprint similarity as a sixth screening result;

step S1052, filtering each audio fingerprint feature in the sixth filtering result according to a preset fourth similarity threshold, and constructing the second audio candidate set according to a score corresponding to the filtered audio fingerprint feature.

Specifically, according to the sixth fingerprint similarity, the audio fingerprint feature with the highest similarity of the fingerprint features of each score and the corresponding similarity value can be obtained and used as the sixth screening result. Screening the audio fingerprint features with the similarity value lower than the threshold value by a fourth similarity threshold value; and constructing a second audio candidate set by using the audio corresponding to the residual audio fingerprint characteristics. After the second audio candidate set is obtained, matching can be performed in the second audio candidate set to determine the target audio.

Specifically, after the target audio and the corresponding relation between the measures between the music score to be matched and the target audio are determined, when the user plays the audio at the terminal device, the music score measure corresponding to the currently played audio measure can be highlighted according to the corresponding relation between the measures, so that the playing of the accompaniment is realized.

In an exemplary embodiment of the present disclosure, for each audio in the audio database, the corresponding audio fingerprint feature sequence may be calculated in advance in the manner described in the above embodiments and stored.

Based on the above, in some exemplary embodiments of the present disclosure, for a music score to be matched, when a plurality of music tracks are included and a music score speed is not selected, the above-mentioned filtering method may also be performed on each music track, so as to obtain a target audio corresponding to each music track. And if the target audio corresponding to each audio track is the same, outputting the final target audio. Or, if the target audio frequencies corresponding to the audio tracks are different, a plurality of target audio frequencies can be output; or, according to the similarity value determined by the shortest path, selecting the target audio with the highest similarity from the plurality of target audios as the final matching result of the music score to be matched.

In summary, when a target music score is screened for an audio to be matched, the multimedia data matching method provided by the present disclosure performs similarity calculation on the music score fingerprint feature sequences corresponding to the music scores by constructing the audio fingerprint feature sequences corresponding to the audio to be matched and using the fingerprint features in the same form, thereby implementing efficient and accurate matching. In addition, the music score is subjected to primary screening and secondary screening, so that the matching range of the music score is effectively reduced, and the matching efficiency of music score retrieval is improved. In addition, the corresponding relation between the audio fingerprint and the music score fingerprint is determined in the process of matching the measures, so that the corresponding relation between the audio and the music score branch is obtained, and the accompanying playing of the music score and the audio can be realized in the process of playing the audio. When the target audio is screened for the music score to be matched, the matching is carried out through the similar technical scheme, and the efficient and accurate matching of the music score to the audio data can be realized.

Exemplary devices

Having introduced the multimedia data matching method of the exemplary embodiment of the present disclosure, a multimedia data matching apparatus of the exemplary embodiment of the present disclosure is described next with reference to fig. 10.

Referring to fig. 10, the multimedia data matching apparatus 100 of an exemplary embodiment of the present disclosure may include: an audio fingerprint feature acquisition module 1001 and a music score matching module 1002; wherein:

the audio fingerprint feature obtaining module 1001 may be configured to obtain an audio to be matched, and determine an audio fingerprint feature sequence corresponding to the audio to be matched.

The score matching module 1002 may be configured to calculate a first fingerprint similarity between the audio fingerprint feature sequence and a score fingerprint feature sequence of each score in a score database, and determine a target score matching the audio to be matched according to the first fingerprint similarity.

According to an exemplary embodiment of the present disclosure, the score matching module includes: and the fingerprint characteristic distance calculation module is used for calculating the fingerprint characteristic distances of the music score fingerprint characteristic sequences corresponding to the music scores in the music score database and the audio fingerprint characteristic sequences corresponding to the audio to be matched, and determining the target music score matched with the audio to be matched according to the fingerprint characteristic distances.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the first music score candidate set calculation module is used for generating an audio fingerprint characteristic sequence to be detected according to the audio fingerprint characteristic sequence corresponding to the audio to be matched; respectively calculating the second fingerprint similarity of the audio fingerprint characteristic sequence to be detected and the music score fingerprint characteristic sequence corresponding to each music score; determining a first score candidate set in the score database according to the second fingerprint similarity; and the music score fingerprint feature sequence corresponding to the music score in the first music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: a second music score candidate set calculating module, configured to calculate third fingerprint similarities between the audio fingerprint feature sequence corresponding to the audio to be matched and the music score fingerprint feature sequences corresponding to the music scores in the first music score candidate set, respectively; screening the music scores in the first music score candidate set according to the third fingerprint similarity, and determining a second music score candidate set; and the music score fingerprint feature sequence corresponding to the music score in the second music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the to-be-detected audio fingerprint feature sequence calculating module is used for determining the variance of each audio fingerprint feature in the audio fingerprint feature sequence and selecting m audio fingerprint features with the largest variance; determining the similarity among the m audio fingerprint features, reserving n audio fingerprint features with the minimum similarity, and determining the n audio fingerprint features as the audio fingerprint feature sequence to be detected; wherein m and n are both positive integers, and m is greater than n.

According to an exemplary embodiment of the present disclosure, the first score candidate set calculation module includes: the first music score screening module is used for respectively calculating k music score fingerprint features with the highest similarity to each audio fingerprint feature in the n audio fingerprint features in the music score fingerprint feature sequence to obtain n x k music score fingerprint features serving as a first screening result; wherein k is a positive integer; screening the fingerprint characteristics of the music score in the first screening result according to a preset first similarity threshold value to obtain a second screening result; and determining a score according to the second screening result and constructing the first score candidate set.

According to an exemplary embodiment of the present disclosure, the second score candidate set calculation module includes: and the third fingerprint similarity calculation module is used for calculating the fingerprint similarity of each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the audio to be matched and each music score fingerprint feature in the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set respectively to obtain the third fingerprint similarity.

According to an exemplary embodiment of the present disclosure, the second score candidate set calculation module includes: the second music score screening module is used for determining music score fingerprint characteristics with the highest similarity with the audio fingerprint characteristics according to the third fingerprint similarity, and the music score fingerprint characteristics serve as a third screening result; and screening the fingerprint characteristics of the music scores in the third screening result according to a preset second similarity threshold, and constructing a second music score candidate set according to the music scores corresponding to the screened fingerprint characteristics of the music scores.

According to an exemplary embodiment of the present disclosure, the fingerprint feature distance calculation module includes: the sampling point distance matrix calculation module is used for respectively constructing sampling point distance matrixes based on the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequences of the music scores; the path calculation module is used for calculating fingerprint characteristic distances between the audio fingerprint characteristics in the audio fingerprint characteristic sequence and the music score fingerprint characteristics in the music score fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequence according to the minimum value of the sampling point distances; and the target music score determining module is used for acquiring a plurality of shortest paths between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of the music scores and determining the target music score matched with the audio to be matched according to the minimum value of the shortest paths.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: and the corresponding relation determining module is used for determining the section corresponding relation between the audio fingerprint feature sequence and the music score fingerprint feature sequence of the target music score according to the shortest path between the music score fingerprint feature sequence of the target music score and the audio fingerprint feature sequence of the audio to be matched.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: and the matching display module is used for responding to the playing control operation of the audio to be matched and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequence of the target music score.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the audio fingerprint calculation module is used for carrying out spectrum conversion processing on the audio to be matched to acquire corresponding spectrum data; performing note identification on the frequency spectrum data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain audio fingerprint characteristics corresponding to each group; and constructing an audio fingerprint characteristic sequence based on the audio fingerprint characteristics corresponding to each group, and configuring the audio fingerprint characteristic sequence as the audio fingerprint characteristic sequence corresponding to the audio to be matched.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the music score fingerprint characteristic sequence generation module is used for identifying each music score in the music score database and acquiring a note sequence of at least one corresponding music track; converting the note sequence according to a preset music score speed to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain music score fingerprint characteristics corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the identification matching module is used for extracting identification information corresponding to the audio to be matched and matching the identification information with each audio in an audio database according to the identification information, wherein the audio database comprises a plurality of audios and audio fingerprint characteristic sequences corresponding to the audios; if the identification information is successfully matched with the audio in an audio database, configuring the audio fingerprint characteristic sequence corresponding to the matched audio as the audio fingerprint characteristic sequence corresponding to the audio to be matched; or if the matching fails according to the identification information, extracting the characteristics of the audio to be matched so as to obtain an audio fingerprint characteristic sequence corresponding to the audio to be matched.

As shown with reference to fig. 11, the multimedia data matching apparatus 110 of the exemplary embodiment of the present disclosure may include: a score fingerprint feature acquisition module 1101 and an audio matching module 1102; wherein,

the music score fingerprint feature acquisition module can be used for acquiring a music score to be matched and determining a music score fingerprint feature sequence corresponding to the music score to be matched.

The audio matching module 1102 may be configured to calculate a fourth fingerprint similarity between the score fingerprint feature sequence and an audio fingerprint feature sequence of each audio in an audio database, and determine a target audio matched with the score to be matched according to the fourth fingerprint similarity.

According to an exemplary embodiment of the present disclosure, the audio matching module includes: and the fingerprint characteristic distance calculation module is used for calculating the fingerprint characteristic distances of the audio fingerprint characteristic sequences corresponding to the audios in the audio database and the music score fingerprint characteristic sequences corresponding to the music score to be matched, and determining the target audio matched with the music score to be matched according to the fingerprint characteristic distances.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the first audio candidate set calculation module is used for generating a music score fingerprint feature sequence to be detected according to the music score fingerprint feature sequence corresponding to the music score to be matched; respectively calculating the similarity of fifth fingerprints of the music score fingerprint characteristic sequence to be detected and the audio fingerprint characteristic sequence corresponding to each audio frequency; determining a first audio candidate set in the audio database according to the fifth fingerprint similarity; and the audio fingerprint feature sequence corresponding to the audio in the first audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: the second audio candidate set calculating module is used for calculating sixth fingerprint similarity of the music score fingerprint feature sequence corresponding to the music score to be matched and the audio fingerprint feature sequences corresponding to the audios in the first audio candidate set respectively; screening the audios in the first audio candidate set according to the sixth fingerprint similarity, and determining a second audio candidate set; and the audio fingerprint feature sequence corresponding to the audio in the second audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

According to an exemplary embodiment of the present disclosure, the first audio candidate set calculation module includes: the music score fingerprint feature sequence to be detected calculating module is used for determining the variance of the music score fingerprint features in the music score fingerprint feature sequence and selecting m music score fingerprint features with the largest variance; determining the similarity among the m music score fingerprint features, reserving n music score fingerprint features with the minimum similarity, and determining the n music score fingerprint features as the music score fingerprint feature sequence to be detected; wherein m and n are positive integers, and m is greater than n.

According to an exemplary embodiment of the present disclosure, the first audio candidate set calculation module includes: the first audio screening module is used for respectively calculating k audio fingerprint features with the highest similarity to each music score fingerprint feature in the n music score fingerprint features in the audio fingerprint feature sequence to obtain n x k audio fingerprint features serving as a fourth screening result; wherein k is a positive integer; screening the audio fingerprint features in the fourth screening result according to a preset third similarity threshold value to obtain a fifth screening result; and determining audio according to the fifth screening result and constructing the first audio candidate set.

According to an exemplary embodiment of the present disclosure, the second audio candidate set calculation module includes: and the sixth fingerprint similarity calculation module is used for calculating the fingerprint similarity of each music score fingerprint feature in the music score fingerprint feature sequence corresponding to the music score to be matched and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set respectively to obtain the sixth fingerprint similarity.

According to an exemplary embodiment of the disclosure, the second audio candidate set calculation module includes: the second music score screening module is used for determining audio fingerprint characteristics with the highest similarity to the fingerprint characteristics of the music scores according to the sixth fingerprint similarity, and the audio fingerprint characteristics serve as a sixth screening result; and screening the audio fingerprint characteristics in the sixth screening result according to a preset fourth similarity threshold, and constructing the second audio candidate set according to a music score corresponding to the screened audio fingerprint characteristics.

According to an exemplary embodiment of the present disclosure, the fingerprint feature distance calculation module includes: the sampling point distance matrix calculation module is used for respectively constructing a sampling point distance matrix based on the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence of each audio frequency; the path calculation module is used for calculating fingerprint characteristic distances between music score fingerprint characteristics in the music score fingerprint characteristic sequence and audio fingerprint characteristics of the audio fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining a shortest path between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence according to the minimum value of the sampling point distances; and the target audio determining module is used for acquiring a plurality of shortest paths between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequences of the audios and determining the target audio matched with the music score to be matched according to the minimum value of the shortest paths.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: and the summary corresponding relation determining module is used for determining the summary corresponding relation between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence of the target audio according to the shortest path between the audio fingerprint characteristic sequence of the target audio and the music score fingerprint characteristic sequence of the music score to be matched.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: and the corresponding display control module is used for responding to the playing control operation of the music score to be matched and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequence of the target audio.

According to an exemplary embodiment of the present disclosure, the music score to be matched includes a music score image to be matched; the device further comprises: the music score image processing module is used for carrying out image recognition on the music score image to be matched and acquiring original music score data; converting the original music score data to obtain a note sequence of at least one music track; converting the note sequence according to the speed of the music score contained in the music score data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain music score fingerprint characteristics corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score to be matched.

Since each functional module of the multimedia data matching apparatus according to the embodiment of the present disclosure is the same as that of the embodiment of the multimedia data matching method described above, it is not described herein again.

Exemplary storage Medium

Having described the multimedia data matching method and apparatus according to the exemplary embodiment of the present disclosure, a storage medium according to the exemplary embodiment of the present disclosure will be described with reference to fig. 13.

Referring to fig. 13, a program product 130 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a device, such as a personal computer. However, the program product of the present disclosure is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary electronic device

Having described the storage medium of the exemplary embodiment of the present disclosure, next, an electronic device of the exemplary embodiment of the present disclosure will be described with reference to fig. 8.

The electronic device 800 shown in fig. 12 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 12, the electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting different system components (including the memory unit 820 and the processing unit 810), and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may perform the steps as shown in fig. 1, or the processing unit 810 may perform the steps as shown in fig. 9.

The memory unit 820 may include volatile memory units such as a random access memory unit (RAM) 8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

Bus 830 may include a data bus, an address bus, and a control bus.

The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), which may be through an input/output (I/O) interface 850. The electronic device 800 further comprises a display unit 840 connected to the input/output (I/O) interface 850 for displaying. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.

It should be noted that although in the above detailed description several modules or sub-modules of the audio playback device and the audio sharing device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects which is intended to be construed to be merely illustrative of the fact that features of the aspects may be combined to advantage. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for multimedia data matching, comprising:

acquiring an audio to be matched, and determining an audio fingerprint characteristic sequence corresponding to the audio to be matched;

and calculating the first fingerprint similarity of the audio fingerprint feature sequence and the music score fingerprint feature sequence of each music score in the music score database, and determining a target music score matched with the audio to be matched according to the first fingerprint similarity.

2. The method as claimed in claim 1, wherein the calculating a first fingerprint similarity between the audio fingerprint feature sequence and a score fingerprint feature sequence of each score in a score database, and determining a target score matching the audio to be matched according to the first fingerprint similarity comprises:

and calculating fingerprint characteristic distances between the music score fingerprint characteristic sequences corresponding to the music scores in the music score database and the audio fingerprint characteristic sequences corresponding to the audio to be matched, and determining a target music score matched with the audio to be matched according to the fingerprint characteristic distances.

3. The method of claim 2, further comprising:

generating an audio fingerprint characteristic sequence to be detected according to the audio fingerprint characteristic sequence corresponding to the audio to be matched;

respectively calculating second fingerprint similarity of the audio fingerprint feature sequence to be detected and the music score fingerprint feature sequences corresponding to the music scores;

determining a first score candidate set in the score database according to the second fingerprint similarity; and the music score fingerprint feature sequence corresponding to the music score in the first music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

4. The method of claim 3, wherein after determining a first score candidate set in the score data according to the second fingerprint similarity, the method further comprises:

respectively calculating third fingerprint similarity of the audio fingerprint feature sequence corresponding to the audio to be matched and the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set;

screening the music scores in the first music score candidate set according to the third fingerprint similarity, and determining a second music score candidate set; and the music score fingerprint feature sequence corresponding to the music score in the second music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

5. The method according to claim 3, wherein the generating an audio fingerprint feature sequence to be detected according to the audio fingerprint feature sequence corresponding to the audio to be matched comprises:

determining the variance of each audio fingerprint feature in the audio fingerprint feature sequence, and selecting m audio fingerprint features with the largest variance;

determining the similarity among the m audio fingerprint features, reserving n audio fingerprint features with the minimum similarity, and determining the n audio fingerprint features as the audio fingerprint feature sequence to be detected; wherein m and n are positive integers, and m is greater than n.

6. The method of claim 5, wherein determining a first score candidate set in the score database based on the second fingerprint similarity comprises:

respectively calculating k music score fingerprint features with the highest similarity to each audio fingerprint feature in the n audio fingerprint features in the music score fingerprint feature sequence to obtain n x k music score fingerprint features serving as a first screening result; wherein k is a positive integer;

screening the fingerprint characteristics of the music score in the first screening result according to a preset first similarity threshold value to obtain a second screening result;

and determining a music score according to the second screening result and constructing the first music score candidate set.

7. The method of claim 4, wherein the calculating the third fingerprint similarity between the audio fingerprint feature sequence corresponding to the audio to be matched and the score fingerprint feature sequences corresponding to the scores in the first score candidate set respectively comprises:

and respectively calculating the fingerprint similarity of each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the audio to be matched and each music score fingerprint feature in the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set to obtain the third fingerprint similarity.

8. The method of claim 7, wherein the screening scores in the first score candidate set according to the third fingerprint similarity and determining a second score candidate set comprises:

determining score fingerprint features with the highest similarity to the audio fingerprint features according to the third fingerprint similarity as a third screening result;

and screening the fingerprint characteristics of the music scores in the third screening result according to a preset second similarity threshold, and constructing a second music score candidate set according to the music scores corresponding to the screened fingerprint characteristics of the music scores.

9. The method of claim 2, wherein the calculating fingerprint feature distances of the score fingerprint feature sequence corresponding to each score in the score database and the audio fingerprint feature sequence corresponding to the audio to be matched and determining a target score matching the audio to be matched according to the fingerprint feature distances comprises:

respectively constructing sampling point distance matrixes based on the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequences of the music scores;

calculating fingerprint characteristic distances between the audio fingerprint characteristics in the audio fingerprint characteristic sequence and the music score fingerprint characteristics in the music score fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequence according to the minimum value of the sampling point distances;

and acquiring a plurality of shortest paths between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of the music scores, and determining the target music score matched with the audio to be matched according to the minimum value of the shortest paths.

10. The method of claim 9, further comprising: and determining the section corresponding relation between the audio fingerprint feature sequence and the music score fingerprint feature sequence of the target music score according to the shortest path between the music score fingerprint feature sequence of the target music score and the audio fingerprint feature sequence of the audio to be matched.

11. The method of claim 10, further comprising:

12. The method according to claim 1, wherein the determining the audio fingerprint feature sequence corresponding to the audio to be matched comprises:

carrying out spectrum conversion processing on the audio to be matched to obtain corresponding spectrum data;

performing note identification on the frequency spectrum data to obtain a corresponding pitch sequence;

sampling the pitch sequence, and grouping sampling results to obtain audio fingerprint characteristics corresponding to each group;

and constructing an audio fingerprint characteristic sequence based on the audio fingerprint characteristics corresponding to each group, and configuring the audio fingerprint characteristic sequence as the audio fingerprint characteristic sequence corresponding to the audio to be matched.

13. The method of claim 1, further comprising:

identifying each music score in the music score database, and acquiring a note sequence of at least one corresponding music track;

converting the note sequence according to a preset music score speed to obtain a corresponding pitch sequence;

sampling the pitch sequence, and grouping the sampling results to obtain the fingerprint characteristics of the music score corresponding to each group;

and constructing a music score fingerprint characteristic sequence corresponding to the preset music score speed based on the music score fingerprint characteristics corresponding to each group, and configuring the music score fingerprint characteristic sequence corresponding to the music score.

14. The method according to claim 1, wherein the determining the audio fingerprint feature sequence corresponding to the audio to be matched comprises:

and if the matching fails according to the identification information, extracting the characteristics of the audio to be matched so as to obtain an audio fingerprint characteristic sequence corresponding to the audio to be matched.

15. A method for multimedia data matching, comprising:

obtaining a music score to be matched, and determining a music score fingerprint feature sequence corresponding to the music score to be matched;

and calculating fourth fingerprint similarity of the music score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in an audio database, and determining a target audio matched with the music score to be matched according to the fourth fingerprint similarity.

16. The method of claim 15, wherein the calculating a fourth fingerprint similarity between the score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in the audio database, and determining the target audio matched with the score to be matched according to the fourth fingerprint similarity comprises:

and calculating the fingerprint characteristic distance between the audio fingerprint characteristic sequence corresponding to each audio in the audio database and the music score fingerprint characteristic sequence corresponding to the music score to be matched, and determining the target audio matched with the music score to be matched according to the fingerprint characteristic distance.

17. The method of claim 16, further comprising:

generating a music score fingerprint characteristic sequence to be detected according to the music score fingerprint characteristic sequence corresponding to the music score to be matched;

respectively calculating the similarity of fifth fingerprints of the music score fingerprint characteristic sequence to be detected and the audio fingerprint characteristic sequence corresponding to each audio frequency;

determining a first audio candidate set in the audio database according to the fifth fingerprint similarity; and the audio fingerprint feature sequence corresponding to the audio in the first audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

18. The method of claim 17, wherein after determining the first audio candidate set in the audio data according to the fifth fingerprint similarity, the method further comprises:

respectively calculating sixth fingerprint similarity of the music score fingerprint feature sequence corresponding to the music score to be matched and the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set;

screening each audio in the first audio candidate set according to the sixth fingerprint similarity, and determining a second audio candidate set; and the audio fingerprint feature sequence corresponding to the audio in the second audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

19. The method according to claim 17, wherein the generating a score fingerprint feature sequence to be detected according to a score fingerprint feature sequence corresponding to the score to be matched comprises:

determining the variance of each score fingerprint feature in the score fingerprint feature sequence, and selecting m score fingerprint features with the largest variance;

determining the similarity among the m music score fingerprint characteristics, reserving n music score fingerprint characteristics with the minimum similarity, and determining the n music score fingerprint characteristics as the music score fingerprint characteristic sequence to be detected; wherein m and n are positive integers, and m is greater than n.

20. The method of claim 19, wherein determining the first audio candidate set in the audio database according to the fifth fingerprint similarity comprises:

respectively calculating k audio fingerprint features with the highest similarity to each music score fingerprint feature in the n music score fingerprint features in the audio fingerprint feature sequence to obtain n x k audio fingerprint features as a fourth screening result; wherein k is a positive integer;

screening the audio fingerprint features in the fourth screening result according to a preset third similarity threshold value to obtain a fifth screening result;

and determining audio according to the fifth screening result and constructing the first audio candidate set.

21. The method of claim 18, wherein the calculating sixth fingerprint similarities of the score fingerprint feature sequence corresponding to the score to be matched and the audio fingerprint feature sequences corresponding to the audios in the first audio candidate set respectively comprises:

and respectively calculating the fingerprint similarity of each music score fingerprint feature in the music score fingerprint feature sequence corresponding to the music score to be matched and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set to obtain the sixth fingerprint similarity.

22. The method of claim 21, wherein the filtering scores in the first audio candidate set according to the sixth fingerprint similarity and determining a second audio candidate set comprises:

determining audio fingerprint features with the highest similarity to the fingerprint features of the music scores according to the sixth fingerprint similarity as a sixth screening result;

and screening the audio fingerprint characteristics in the sixth screening result according to a preset fourth similarity threshold, and constructing the second audio candidate set according to a music score corresponding to the screened audio fingerprint characteristics.

23. The method of claim 16, wherein the calculating fingerprint feature distances of the audio fingerprint feature sequence corresponding to each audio in the audio database and the score fingerprint feature sequence corresponding to the score to be matched and determining a target audio matching the score to be matched according to the fingerprint feature distances comprises:

respectively constructing sampling point distance matrixes based on the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequences of the audios;

calculating fingerprint characteristic distances between music score fingerprint characteristics in the music score fingerprint characteristic sequence and audio fingerprint characteristics in the audio fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence according to the minimum value of the sampling point distances;

and acquiring a plurality of shortest paths between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequences of the audios, and determining the target audio matched with the music score to be matched according to the minimum value of the shortest paths.

24. The method of claim 23, further comprising: and determining the section corresponding relation between the music score fingerprint feature sequence and the audio fingerprint feature sequence of the target audio according to the shortest path between the audio fingerprint feature sequence of the target audio and the music score fingerprint feature sequence of the music score to be matched.

25. The method of claim 23, further comprising:

and responding to the playing control operation of the music score to be matched, and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequence of the target audio.

26. The method of claim 15, wherein the score to be matched comprises a score image to be matched;

the determining of the score fingerprint feature sequence corresponding to the score to be matched includes:

carrying out image recognition on the music score image to be matched, and acquiring original music score data;

converting the original music score data to obtain a note sequence of at least one music track;

converting the note sequence according to the speed of the music score contained in the music score data to obtain a corresponding pitch sequence;

and constructing a music score fingerprint feature sequence corresponding to a preset music score speed based on the music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score to be matched.

27. A multimedia data matching apparatus, comprising:

28. The apparatus of claim 27, wherein the score matching module comprises:

and the fingerprint characteristic distance calculation module is used for calculating the fingerprint characteristic distances of the music score fingerprint characteristic sequences corresponding to the music scores in the music score database and the audio fingerprint characteristic sequences corresponding to the audio to be matched, and determining the target music score matched with the audio to be matched according to the fingerprint characteristic distances.

29. The apparatus of claim 28, further comprising:

the first music score candidate set calculation module is used for generating an audio fingerprint feature sequence to be detected according to the audio fingerprint feature sequence corresponding to the audio to be matched; respectively calculating second fingerprint similarity of the audio fingerprint feature sequence to be detected and the music score fingerprint feature sequences corresponding to the music scores; determining a first score candidate set in the score database according to the second fingerprint similarity; and the music score fingerprint feature sequence corresponding to the music score in the first music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

30. The apparatus of claim 29, further comprising:

a second music score candidate set calculating module, configured to calculate third fingerprint similarities between the audio fingerprint feature sequence corresponding to the audio to be matched and the music score fingerprint feature sequences corresponding to the music scores in the first music score candidate set, respectively; screening the music scores in the first music score candidate set according to the third fingerprint similarity, and determining a second music score candidate set; and the music score fingerprint feature sequence corresponding to the music score in the second music score candidate set is used for calculating the fingerprint feature distance of the audio fingerprint feature sequence corresponding to the audio to be matched.

31. The apparatus of claim 29, further comprising:

the to-be-detected audio fingerprint feature sequence calculating module is used for determining the variance of each audio fingerprint feature in the audio fingerprint feature sequence and selecting m audio fingerprint features with the largest variance; determining the similarity among the m audio fingerprint features, reserving n audio fingerprint features with the minimum similarity, and determining the n audio fingerprint features as the audio fingerprint feature sequence to be detected; wherein m and n are both positive integers, and m is greater than n.

32. The apparatus of claim 31, wherein the first score candidate set computing module comprises:

the first music score screening module is used for respectively calculating k music score fingerprint features with highest similarity to each of the n audio fingerprint features in the music score fingerprint feature sequence to obtain n x k music score fingerprint features serving as a first screening result; wherein k is a positive integer; screening the fingerprint characteristics of the music score in the first screening result according to a preset first similarity threshold value to obtain a second screening result; and determining a score according to the second screening result and constructing the first score candidate set.

33. The apparatus of claim 30, wherein the second score candidate set computing module comprises:

and the third fingerprint similarity calculation module is used for calculating the fingerprint similarity of each audio fingerprint feature in the audio fingerprint feature sequence corresponding to the audio to be matched and each music score fingerprint feature in the music score fingerprint feature sequence corresponding to each music score in the first music score candidate set respectively to obtain the third fingerprint similarity.

34. The apparatus of claim 30, wherein the second score candidate set calculating module comprises:

the second music score screening module is used for determining music score fingerprint characteristics with the highest similarity with the audio fingerprint characteristics according to the third fingerprint similarity as a third screening result; and screening the fingerprint characteristics of the music scores in the third screening result according to a preset second similarity threshold, and constructing a second music score candidate set according to the music scores corresponding to the screened fingerprint characteristics of the music scores.

35. The apparatus of claim 28, wherein the fingerprint feature distance calculation module comprises:

the sampling point distance matrix calculation module is used for respectively constructing sampling point distance matrixes based on the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequences of the music scores;

the path calculation module is used for calculating fingerprint characteristic distances between the audio fingerprint characteristics in the audio fingerprint characteristic sequence and the music score fingerprint characteristics in the music score fingerprint characteristic sequence and taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the audio fingerprint characteristic sequence and the music score fingerprint characteristic sequence according to the minimum value of the sampling point distances;

and the target music score determining module is used for acquiring a plurality of shortest paths between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequences of the music scores and determining the target music score matched with the audio to be matched according to the minimum value of the shortest paths.

36. The apparatus of claim 35, further comprising:

and the corresponding relation determining module is used for determining the section corresponding relation between the audio fingerprint feature sequence and the music score fingerprint feature sequence of the target music score according to the shortest path between the music score fingerprint feature sequence of the target music score and the audio fingerprint feature sequence of the audio to be matched.

37. The apparatus of claim 36, further comprising:

and the matching display module is used for responding to the playing control operation of the audio to be matched and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the audio fingerprint feature sequence of the audio to be matched and the music score fingerprint feature sequence of the target music score.

38. The apparatus of claim 27, further comprising:

the audio fingerprint calculation module is used for carrying out spectrum conversion processing on the audio to be matched to obtain corresponding spectrum data; performing note identification on the frequency spectrum data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain audio fingerprint characteristics corresponding to each group; and constructing an audio fingerprint characteristic sequence based on the audio fingerprint characteristics corresponding to each group, and configuring the audio fingerprint characteristic sequence as the audio fingerprint characteristic sequence corresponding to the audio to be matched.

39. The apparatus of claim 27, further comprising:

the music score fingerprint characteristic sequence generation module is used for identifying each music score in the music score database and acquiring a note sequence of at least one corresponding music track; converting the note sequence according to a preset music score speed to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping sampling results to obtain music score fingerprint characteristics corresponding to each group; and constructing a music score fingerprint feature sequence corresponding to the preset music score speed based on music score fingerprint features corresponding to each group, and configuring the music score fingerprint feature sequence corresponding to the music score.

40. The apparatus of claim 27, further comprising:

the identification matching module is used for extracting identification information corresponding to the audio to be matched and matching the identification information with each audio in an audio database according to the identification information, wherein the audio database comprises a plurality of audios and audio fingerprint characteristic sequences corresponding to the audios; if the identification information is successfully matched with the audio in an audio database, configuring the audio fingerprint characteristic sequence corresponding to the matched audio as the audio fingerprint characteristic sequence corresponding to the audio to be matched; or if the matching fails according to the identification information, extracting the characteristics of the audio to be matched so as to obtain an audio fingerprint characteristic sequence corresponding to the audio to be matched.

41. A multimedia data matching apparatus, comprising:

the music score fingerprint feature acquisition module is used for acquiring a music score to be matched and determining a music score fingerprint feature sequence corresponding to the music score to be matched;

and the audio matching module is used for calculating the fourth fingerprint similarity of the music score fingerprint feature sequence and the audio fingerprint feature sequences of the audios in the audio database, and determining a target audio matched with the music score to be matched according to the fourth fingerprint similarity.

42. The apparatus of claim 41, wherein the audio matching module comprises:

and the fingerprint characteristic distance calculation module is used for calculating the fingerprint characteristic distances of the audio fingerprint characteristic sequences corresponding to the audios in the audio database and the music score fingerprint characteristic sequences corresponding to the music score to be matched, and determining the target audio matched with the music score to be matched according to the fingerprint characteristic distances.

43. The apparatus of claim 41, further comprising:

the first audio candidate set calculation module is used for generating a music score fingerprint feature sequence to be detected according to the music score fingerprint feature sequence corresponding to the music score to be matched; respectively calculating the similarity of fifth fingerprints of the music score fingerprint characteristic sequence to be detected and the audio fingerprint characteristic sequence corresponding to each audio frequency; determining a first audio candidate set in the audio database according to the fifth fingerprint similarity; and the audio fingerprint feature sequence corresponding to the audio in the first audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

44. The apparatus of claim 43, further comprising:

the second audio candidate set calculating module is used for calculating sixth fingerprint similarity of the music score fingerprint feature sequence corresponding to the music score to be matched and the audio fingerprint feature sequences corresponding to the audios in the first audio candidate set respectively; screening the audios in the first audio candidate set according to the sixth fingerprint similarity, and determining a second audio candidate set; and the audio fingerprint feature sequence corresponding to the audio in the second audio candidate set is used for calculating the fingerprint feature distance of the music score fingerprint feature sequence corresponding to the music score to be matched.

45. The apparatus of claim 43, wherein the first audio candidate set computing module comprises:

the music score fingerprint feature sequence to be detected calculating module is used for determining the variance of the music score fingerprint features in the music score fingerprint feature sequence and selecting m music score fingerprint features with the largest variance; determining the similarity among the m music score fingerprint characteristics, reserving n music score fingerprint characteristics with the minimum similarity, and determining the n music score fingerprint characteristics as the music score fingerprint characteristic sequence to be detected; wherein m and n are positive integers, and m is greater than n.

46. The apparatus of claim 45, wherein the first audio candidate set computing module comprises:

the first audio screening module is used for respectively calculating k audio fingerprint features with highest similarity to each music score fingerprint feature in the n music score fingerprint features in the audio fingerprint feature sequence to obtain n x k audio fingerprint features serving as a fourth screening result; wherein k is a positive integer; screening the audio fingerprint features in the fourth screening result according to a preset third similarity threshold value to obtain a fifth screening result; and determining audio according to the fifth screening result and constructing the first audio candidate set.

47. The apparatus of claim 44, wherein the second audio candidate set computing module comprises:

and the sixth fingerprint similarity calculation module is used for calculating the fingerprint similarity of each music score fingerprint feature in the music score fingerprint feature sequence corresponding to the music score to be matched and each audio fingerprint feature in the audio fingerprint feature sequence corresponding to each audio in the first audio candidate set respectively to obtain the sixth fingerprint similarity.

48. The apparatus of claim 47, wherein the second audio candidate set computing module comprises:

the second music score screening module is used for determining audio fingerprint characteristics with highest similarity to the fingerprint characteristics of the music scores according to sixth fingerprint similarity, and the audio fingerprint characteristics are used as sixth screening results; and screening the audio fingerprint characteristics in the sixth screening result according to a preset fourth similarity threshold, and constructing the second audio candidate set according to a music score corresponding to the screened audio fingerprint characteristics.

49. The apparatus of claim 42, wherein the fingerprint feature distance calculation module comprises:

the sampling point distance matrix calculation module is used for respectively constructing a sampling point distance matrix based on the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequences of the audios;

the path calculation module is used for calculating fingerprint characteristic distances between the music score fingerprint characteristics in the music score fingerprint characteristic sequence and the audio fingerprint characteristics in the audio fingerprint characteristic sequence, taking the fingerprint characteristic distances as sampling point distances, and determining the shortest path between the music score fingerprint characteristic sequence and the audio fingerprint characteristic sequence according to the minimum value of the sampling point distances;

and the target audio determining module is used for acquiring a plurality of shortest paths between the music score fingerprint feature sequence of the music score to be matched and the audio fingerprint feature sequences of the audios and determining the target audio matched with the music score to be matched according to the minimum value of the shortest paths.

50. The apparatus of claim 49, further comprising:

and the node corresponding relation determining module is used for determining node corresponding relation between the music score fingerprint feature sequence and the audio fingerprint feature sequence of the target audio according to the shortest path between the audio fingerprint feature sequence of the target audio and the music score fingerprint feature sequence of the music score to be matched.

51. The apparatus of claim 50, further comprising:

and the corresponding display control module is used for responding to the playing control operation of the music score to be matched and displaying the music score measure corresponding to the currently played audio measure in a graphical user interface according to the measure corresponding relation between the music score fingerprint characteristic sequence of the music score to be matched and the audio fingerprint characteristic sequence of the target audio.

52. The apparatus of claim 50, wherein the score to be matched comprises a score image to be matched;

the device further comprises:

the music score image processing module is used for carrying out image recognition on the music score image to be matched and acquiring original music score data; converting the original music score data to obtain a note sequence of at least one music track; converting the note sequence according to the speed of the music score contained in the music score data to obtain a corresponding pitch sequence; sampling the pitch sequence, and grouping the sampling results to obtain the fingerprint characteristics of the music score corresponding to each group; and constructing a music score fingerprint characteristic sequence corresponding to a preset music score speed based on the music score fingerprint characteristics corresponding to each group, and configuring the music score fingerprint characteristic sequence corresponding to the music score to be matched.

53. A storage medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the multimedia data matching method according to any one of claims 1 to 26.

54. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the multimedia data matching method of any one of claims 1 to 26 via execution of the executable instructions.