CN114817620A - Song comparison method and device, equipment, medium and product thereof - Google Patents

Song comparison method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN114817620A
CN114817620A CN202111491601.3A CN202111491601A CN114817620A CN 114817620 A CN114817620 A CN 114817620A CN 202111491601 A CN202111491601 A CN 202111491601A CN 114817620 A CN114817620 A CN 114817620A
Authority
CN
China
Prior art keywords
song
information
feature extraction
audio data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111491601.3A
Other languages
Chinese (zh)
Inventor
肖纯智
张超钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN202111491601.3A priority Critical patent/CN114817620A/en
Publication of CN114817620A publication Critical patent/CN114817620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a song comparison method, a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: respectively acquiring audio data of the original song and the compared song; extracting multi-scale deep semantic information of the audio data of the original edition song by adopting a feature extraction model trained to a convergence state, and correspondingly obtaining an original edition high-dimensional vector; extracting multi-scale deep semantic information of the audio data of the compared songs by adopting the feature extraction model, and correspondingly obtaining high-dimensional vectors of the compared songs; and calculating the similarity between the original high-dimensional vector and the compared high-dimensional vector, judging whether the corresponding similarity value is greater than a first preset threshold value, and judging that the compared song and the original song form a singing relation when the similarity value is greater than the first preset threshold value. The method and the device can realize the recognition of the singing relation between the original song and the compared song, and are applicable to application scenes such as judgment of song infringement, identification of singing and the like.

Description

Song comparison method and device, equipment, medium and product thereof
Technical Field
The present application relates to the field of music information retrieval technologies, and in particular, to a song comparison method, and a corresponding apparatus, computer device, computer-readable storage medium, and computer program product.
Background
Song cover recognition techniques have a wide range of applications, and in an exemplary application, for copyright protection, it is often necessary to compare an original song with a candidate song to identify whether sufficient similarity exists between the two in order to make a determination as to whether the candidate song violates the copyright of the original song.
At present, with the popularity of short videos, live broadcasts and radio stations, the amount of music in the song-turning category is larger and larger, and scenes needing music identification are also more and more complicated. Compared with the original singing version, the reproduction version may have differences or even completely different music components such as tone color, fundamental frequency, rhythm, speed, harmony, lyrics, singing method, overall structure and the like. Therefore, a singing recognition technology capable of realizing accurate recognition of the singing relation between songs is a very challenging research.
There are multiple related techniques of singing recognition in the prior art, and various prior arts all have some disadvantages, for example: (1) the traditional song listening and song recognition technology based on Landmark can only recognize songs in the same version and cannot recognize the copy with certain differentiated information; (2) the traditional humming recognition technology based on melody matching can only recognize clean singing/humming and cannot recognize the copy with background accompaniment; (3) the traditional technical scheme of singing flipping recognition is mainly characterized in that audio features such as Pitch Class Profile (PCP) are extracted, and then algorithms such as dynamic programming are utilized to calculate the similarity distance between songs. Due to the diversity of the singing versions, the scheme can only be suitable for the singing version with smaller arrangement, the accurate recognition rate is low, the recognition speed is low, and the compatibility is limited.
In view of the fact that the technical solutions related to the singing recognition in the prior art lack general adaptability, are low in recognition accuracy and low in recognition efficiency, and are especially weak in copyright infringement comparison application scenarios, the applicant tries to explore more effective technical solutions.
Disclosure of Invention
A primary object of the present application is to solve at least one of the above problems and provide a song comparison method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.
In order to meet various purposes of the application, the following technical scheme is adopted in the application:
a song comparison method adapted to one of the purposes of the present application includes the following steps:
respectively acquiring audio data of the original song and the compared song;
extracting multi-scale deep semantic information of the audio data of the original edition song by adopting a feature extraction model trained to a convergence state, and correspondingly obtaining an original edition high-dimensional index vector;
extracting multi-scale deep semantic information of the audio data of the compared songs by adopting the feature extraction model, and correspondingly obtaining high-dimensional index vectors of the compared songs;
and calculating the similarity between the original high-dimensional index vector and the compared high-dimensional index vector, judging whether the corresponding similarity value is greater than a first preset threshold value, and judging that the compared song and the original song form a singing relation when the similarity value is greater than the first preset threshold value.
In an expanded embodiment, in the step of calculating the similarity between the original high-dimensional index vector and the high-dimensional index vector to be evaluated, and determining whether the corresponding similarity value is greater than a first preset threshold, when the similarity value is less than the preset threshold, the following steps are performed:
determining the song with the original edition and the compared song with the audio frequency with relatively short time length as a designated song, and segmenting another song with relatively long audio frequency with the audio frequency time length of the designated song as a measurement to obtain a plurality of song segments of another song;
extracting the multi-scale deep semantic information of a plurality of song segments of the other song respectively by adopting the feature extraction model, and correspondingly obtaining a plurality of segment high-dimensional index vectors;
and taking the high-dimensional index vector of the specified song as a specified high-dimensional index vector, respectively calculating the similarity between the specified high-dimensional index vector and each segment high-dimensional index vector, judging whether the maximum similarity value is higher than a second preset threshold value, and judging that the compared songs form a singing relation when the maximum similarity value is higher than the second preset threshold value.
In a further embodiment, the method for obtaining audio data of an original song and a compared song comprises the following steps:
acquiring audio data and lyric files of the original song;
performing word segmentation according to lyrics in a lyric file of the original song, and extracting a plurality of keywords from the words;
at least one song is obtained through online searching according to any combination of the keywords, and the song obtained through searching is used as a compared song;
audio data of the compared song is acquired.
In an optional embodiment, the step of extracting the multi-scale deep semantic information of the audio data of the original song by using the feature extraction model trained to the convergence state to correspondingly obtain the original high-dimensional index vector, or the step of extracting the multi-scale deep semantic information of the audio data of the compared song by using the feature extraction model to correspondingly obtain the compared high-dimensional index vector includes the following steps:
coding the audio data to correspondingly obtain coding information of the audio data;
and extracting a high-dimensional index vector representing the multi-scale deep semantic information of the audio data according to the coding information by adopting the feature extraction model.
In a further embodiment, when the feature extraction model is called, the following steps are performed:
sequentially performing multi-stage feature extraction on the coded information by adopting a plurality of volume blocks in a shared network in a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the coded information;
after feature extraction of different scales is carried out on the intermediate feature information by adopting a plurality of volume blocks in more than two branch networks in the feature extraction model, the intermediate feature information is converted into output feature vectors of corresponding scales, and deep semantic information contained in the output feature vectors of all the branch networks is different;
and outputting the output feature vector of each branch network as the high-dimensional index vector by the feature extraction model.
In a further embodiment, the converting the intermediate feature information into the output feature vector of the corresponding scale after performing feature extraction of different scales on the intermediate feature information by using a plurality of convolution blocks in more than two branch networks in the feature extraction model includes any two or more steps:
performing feature extraction on the intermediate feature information by adopting a plurality of volume blocks in a first branch network to obtain global feature information, and pooling the global feature information into output feature vectors of a global scale;
a plurality of rolling blocks in a second branch network are adopted to extract the characteristics of the intermediate characteristic information, and then the intermediate characteristic information is divided into a plurality of parts according to channels for pooling, so that output characteristic vectors of channel scales are obtained correspondingly;
and (3) performing feature extraction on the intermediate feature information by adopting a plurality of convolution blocks in a third branch network, dividing the intermediate feature information into a plurality of parts according to the frequency band, and pooling to correspondingly obtain output feature vectors of the frequency band scale.
In an embodiment, the first branch network performs the pooling operation by using a mean pooling operation and/or a maximum pooling operation to obtain one or two output feature vectors of the global scale; when the second branch network performs the pooling operation, adopting a mean pooling operation for a single or a plurality of channels to correspondingly obtain one or a plurality of output feature vectors of the channel scale; and when the third branch network performs the pooling operation, adopting an average pooling operation for a single frequency band or a plurality of frequency bands to correspondingly obtain one or a plurality of output feature vectors of the frequency band scale.
In an optional embodiment, the source of the coding information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, sound level contour information, and Chroma feature information of corresponding audio data.
In a preferred embodiment, in the shared network, at least one of the convolution blocks applies an attention module for extracting key information in song audio data, and the attention module is a spatial attention module or a channel attention module.
In an embodiment, when the volume block is called, the following steps are performed:
carrying out convolution transformation on the input information to obtain transformation characteristic information;
combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information;
performing convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times to obtain residual error information;
and overlapping the residual information into the information input into the residual information to activate output.
In an extended embodiment, the training process of the feature extraction model includes the following steps of iterative training:
calling a training sample from a training set, and determining the coding information of the training sample, wherein the training sample is pre-collected song audio data which is a complete song and a fragment thereof;
inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors;
respectively carrying out classification prediction on each output characteristic vector to map corresponding classification labels;
calculating a loss value of a feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value;
and judging whether the loss value reaches a preset threshold value, and calling a next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.
A song matching apparatus adapted to one of the objects of the present application includes: the system comprises an audio acquisition module, an original edition extraction module, a compared extraction module and a comprehensive judgment module, wherein the audio acquisition module is used for respectively acquiring audio data of an original edition song and a compared song; the original edition extracting module is used for extracting multi-scale deep semantic information of the audio data of the original edition song by adopting a feature extracting model trained to a convergence state, and correspondingly obtaining an original edition high-dimensional index vector; the system comprises a feature extraction module, a comparison module and a comparison module, wherein the feature extraction module is used for extracting multi-scale deep semantic information of audio data of a compared song by adopting the feature extraction module and correspondingly obtaining a high-dimensional index vector of the compared song; and the comprehensive judgment module is used for calculating the similarity between the original edition high-dimensional index vector and the high-dimensional index vector of the compared song, judging whether the corresponding similarity value is greater than a first preset threshold value or not, and judging that the compared song and the original edition song form a song-turning relationship when the similarity value is greater than the first preset threshold value.
The computer device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the song comparison method.
A computer-readable storage medium, which stores a computer program implemented according to the song matching method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.
A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.
Compared with the prior art, the application has the following advantages:
firstly, the method and the device respectively obtain high-dimensional index vectors of deep semantic information representing the style invariant features of an original song and a compared song by means of a feature extraction model which is pre-trained to a convergence state, and determine whether the original song and the compared song form a turning relation according to the similarity between the two high-dimensional index vectors, so that the functions of copyright infringement comparison, similar song identification and the like of the songs can be realized.
Secondly, the multi-scale feature extraction of deep semantic information of the song audio data is realized in the feature extraction model adopted by the method, so that the obtained high-dimensional index vector has higher representation capability, such as representing global feature information, obvious feature information, channel feature information, frequency band feature information and the like of the song audio data, more effective indexing of the corresponding song audio data is realized, song comparison is carried out on the basis of the global feature information, the change of various types of the song to be sung is more compatible, more accurate and efficient matching effect can be obtained, and misjudgment in the identification process is avoided.
In addition, when the application is based on end-to-end representation of learning ability, a comparison mechanism is used as an auxiliary mechanism, so that a relatively obvious scale effect can be obtained, the application can be deployed in a background of an online music service platform to realize a standardized interface, the requirements of various different application scenes are met, comprehensive and multipurpose open services are provided, and the economic advantage of music information retrieval of the platform is improved.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a song comparison method of the present application;
FIG. 2 is a schematic flowchart of an extended embodiment of a song comparison method according to the present application;
FIG. 3 is a schematic flow chart illustrating the process of obtaining a compared song from an original song according to an embodiment of the present application;
FIG. 4 is a flow chart illustrating a process of obtaining a high-dimensional index vector according to audio data according to an embodiment of the present application;
FIG. 5 is a schematic flow chart illustrating the operation of a feature extraction model according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a network architecture of a feature extraction model according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a network architecture of a feature extraction model according to another embodiment of the present application;
FIG. 8 is a schematic flow chart of the working process of the residual volume block used in the feature extraction model of the present application;
FIG. 9 is a schematic flow chart of a process in which the feature extraction model of the present application is trained;
FIG. 10 is a functional block diagram of a classification model accessed by the feature extraction model of the present application during a training phase;
FIG. 11 is a functional block diagram of a song comparison apparatus of the present application;
fig. 12 is a schematic structural diagram of a computer device used in the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.
The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.
It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.
Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and used for remote call at a client, and can also be deployed in a client with qualified equipment capability for direct call.
Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.
The song comparison method can be programmed into a computer program product, can be deployed in computer equipment to run and can be realized as a front-end product of client equipment and can also be realized as an online service product, so that the client can access an open interface after the computer program product runs in a webpage program or application program mode, and human-computer interaction is realized through a process of a graphical user interface and the computer program product.
Referring to fig. 1, in an exemplary embodiment, a song comparison method of the present application includes the following steps:
step S1100, acquiring audio data of the original song and the compared song respectively:
in an exemplary application scenario, the technical solution of the present application is responsible for comparing song similarity between an original song and other songs, namely, compared songs.
In one embodiment, both the master song and the compared song may be specified or provided by a user; in another embodiment, the compared song may be obtained by performing association retrieval based on information associated with the master song, such as its lyric file.
After the specific information of the original song and the compared song is determined, corresponding audio data can be obtained, and the audio data can be local data or remote data which can be pulled to the local.
When a plurality of compared songs are provided, the steps of the application can be executed for each compared song one by one, so as to achieve the purpose of comparing the original song with each compared song respectively.
The songs referred to in this application, including the original song and the compared song, have a file format that is not limited to any format, including but not limited to MP3, WMA, M4A, WAV, etc., and may be audio data obtained by separating audio from various types of video files. This should be understood by one skilled in the art.
For comparing the original song and the compared song with each other, it is necessary to extract the multi-scale deep semantic information through steps S1200 and S1300, which will be described in detail below.
Step S1200, extracting multi-scale deep semantic information of the audio data of the original edition song by adopting a feature extraction model trained to a convergence state, and correspondingly obtaining an original edition high-dimensional index vector:
the feature extraction model for extracting deep semantic information of the songs, which is realized based on the convolutional neural network model, is trained to a convergence state in advance, and is learned to be suitable for the capability of extracting the deep semantic information of multiple scales of the audio data of the songs according to the coding information after training, so that the representation learning of the style invariant features of the audio data of the corresponding songs is realized, and the feature extraction model can be used for the requirements of query, retrieval, matching and the like among the songs.
The feature extraction model of the present application is implemented as a feature extraction model adapted to extract deep semantic information of multiple scales of the same audio data, representing the deep semantic information as single or multiple high-dimensional index vectors, so as to implement feature representation of the audio data from multiple different aspects and/or different angles. The high-dimensional index vector is essentially a high-dimensional vector that, at a semantic level, serves as an index representative of the encoding information of the corresponding audio data. The different scales comprise global scales based on the coded information or feature extraction based on frequency band scales, channel scales and the like of the coded information, and for one song, the deep semantic information of two or more scales corresponding to the coded information is selected and represented as a high-dimensional index vector, so that the feature representation of the multi-scale deep semantic information of the corresponding song can be realized.
After the feature extraction model implemented according to the above principle is trained to converge, a service interface can be opened for the technical scheme of this embodiment to call, the encoding information of the original song is input thereto, and feature extraction is performed by the feature extraction model on the basis of the encoding information to obtain a corresponding high-dimensional index vector. As for the principle and process of converting audio data into corresponding coding information, those skilled in the art can flexibly set the principle and process to realize any one of corresponding coding information such as time-frequency spectrum information, mel spectrum information, CQT filtering information, level contour information, Chroma characteristic information and the like of the audio data. The proposed coding principles and procedures will be further disclosed in the present application by other embodiments, which are not presented here.
It should be understood that since the feature extraction model can extract deep semantic information of a song from multiple scales, there can be different organization forms when converting the deep semantic information of different scales into the high-dimensional index vector, for example, representing the high-dimensional index vector as a single high-dimensional vector, which generally represents deep semantic information of a song as a whole; or, the high-dimensional index vector is expressed into a plurality of discrete high-dimensional vectors according to the corresponding relation of the scales, and each high-dimensional vector corresponds to one scale. In any case, those skilled in the art can flexibly organize these high-dimensional vectors according to the need of the actual scale semantic information, so as to facilitate the invocation of the representation data of the overall deep semantic information of the song.
For the present step, feature extraction is performed on the encoded information of the original song through the feature extraction model, and finally, a high-dimensional index vector corresponding to the original song can be obtained and used as an original high-dimensional index vector for subsequent similarity matching.
Step S1300, extracting the multi-scale deep semantic information of the audio data of the compared songs by adopting the feature extraction model, and correspondingly obtaining the high-dimensional index vector of the compared songs:
and in parallel with the previous step, calling the feature extraction model trained to the convergence state to extract the multi-scale deep semantic information of the coding information of the audio data of the compared song, and obtaining a corresponding high-dimensional index vector as the high-dimensional index vector of the compared song. As to the encoding principles and processes related to obtaining the encoded information herein, the same will be further disclosed in the following embodiments.
It should be emphasized that steps S1200 and S1300 can be performed in parallel to increase the overall comparison speed.
Step S1400, calculating the similarity between the original high-dimensional index vector and the compared high-dimensional index vector, judging whether the corresponding similarity value is greater than a first preset threshold value, and judging that the compared song and the original song form a singing relation when the similarity value is greater than the first preset threshold value:
based on the original high-dimensional index vector corresponding to the original song and the compared high-dimensional index vector corresponding to the compared song, a preset similarity calculation formula can be applied to perform similarity calculation so as to calculate a similarity value between the original song and the compared song. The similarity calculation formula can be implemented by any algorithm suitable for calculating the similarity distance between data, such as a cosine similarity calculation method, an Euclidean distance algorithm, a Pearson coefficient algorithm, a Jacard similarity calculation method, a neighbor search algorithm and the like, and can be flexibly implemented by a person skilled in the art. After similarity calculation, a similarity value between the original high-dimensional index vector of the original song and the high-dimensional index vector of the compared song is obtained.
After the similarity value is determined, a preset threshold value, referred to as a first preset threshold value, may be further utilized, the first preset threshold value may be an empirical threshold value or an experimental threshold value, the similarity value is compared with the first preset threshold value, if the similarity value is greater than (equal to) the first preset threshold value, it indicates that the original song and the compared song are sufficiently similar, so that it may be determined that the two songs constitute a reverse relation, and in a scenario of comparing the copyright infringement of songs, it may be determined that the compared song constitutes a suspected infringement of the original song. Otherwise, if the similarity value is smaller than the first preset threshold value, it indicates that the original song and the compared song are not similar enough, the fact that the original song and the compared song do not form a singing relation can be simply judged, and in a scene of song copyright infringement comparison, the fact that the compared song does not form suspected infringement on the original song can be determined accordingly. Of course, a more complicated situation caused by the fact that the malicious singing processing increases the singing recognition difficulty is not considered, and for this reason, further refined judgment is realized through other embodiments in the following, which is not shown here.
Therefore, according to the given original song and the compared song, the method and the device realize the recognition of whether the original song and the compared song form the singing relation or not, and can output a comparison result.
In other embodiments that will be subsequently disclosed in this application, many variations based on the exemplary embodiment will be further disclosed, and not shown here for the moment. It will be understood from the description of the exemplary embodiments that the present application may be embodied with various advantages, including but not limited to the following:
firstly, the method and the device respectively obtain high-dimensional index vectors of deep semantic information representing the style invariant features of an original song and a compared song by means of a feature extraction model which is pre-trained to a convergence state, and determine whether the original song and the compared song form a turning relation according to the similarity between the two high-dimensional index vectors, so that the functions of copyright infringement comparison, similar song identification and the like of the songs can be realized.
Secondly, the multi-scale feature extraction of deep semantic information of the song audio data is realized in the feature extraction model adopted by the method, so that the obtained high-dimensional index vector has higher representation capability, such as representing global feature information, obvious feature information, channel feature information, frequency band feature information and the like of the song audio data, more effective indexing of the corresponding song audio data is realized, song comparison is carried out on the basis of the global feature information, the change of various types of the song to be sung is more compatible, more accurate and efficient matching effect can be obtained, and misjudgment in the identification process is avoided.
In addition, when the application is based on end-to-end representation of learning ability, a comparison mechanism is used as an auxiliary mechanism, so that a relatively obvious scale effect can be obtained, the application can be deployed in a background of an online music service platform to realize a standardized interface, the requirements of various different application scenes are met, comprehensive and multipurpose open services are provided, and the economic advantage of music information retrieval of the platform is improved.
Considering some situations where the difficulty of singing recognition is high, for example, the played song is formed by splicing a plurality of local song segments that are distant from the original song in time, in this case, if the deep semantic information of the played song is represented by only a single high-dimensional index vector and the deep semantic information of the original song is represented by a single high-dimensional index vector, the difference between the two is often large, and accurate recognition cannot be achieved by directly calculating the similarity of the two high-dimensional index vectors, and vice versa, if the original song is selected, a plurality of song segments with short duration are inserted into the played song with long duration, and the played song segments are inserted into the played song with long duration.
In view of the above situation, it is necessary to further deepen the technical solution of the present application, and therefore, referring to fig. 2, in an expanded embodiment, in the step S1400, calculating the similarity between the original version high-dimensional index vector and the high-dimensional index vector to be compared, and determining whether a corresponding similarity value is greater than a first preset threshold, when the similarity value is less than the preset threshold, the following steps are performed:
step S1500, determining the song with the original edition and the compared song with the shorter audio time as the appointed song, and segmenting another song with the longer audio time by taking the audio time of the appointed song as a measurement to obtain a plurality of song segments of another song:
in the case where the audio durations of the master song and the compared song do not coincide, one of the songs having a relatively shorter audio duration may be selected as the designated song (e.g., master song). Then, taking the audio time duration of the specified song as a measure, segmenting another song (for example, a compared song) in the specified song, and segmenting the other song into a plurality of song segments according to the audio time duration, wherein each song segment obtains a corresponding audio data, and similarly, corresponding encoding can be performed according to the encoding principle and process of the application to obtain corresponding encoding information for subsequent processing.
Step S1600, respectively extracting the multi-scale deep semantic information of a plurality of song segments of the other song by adopting the feature extraction model, and correspondingly obtaining a plurality of segment high-dimensional index vectors:
the high-dimensional index vector of the specified song is extracted in the previous step, so that the feature extraction is only carried out on the coding information of the audio data corresponding to the plurality of song segments of the other song in the step. In this regard, the feature extraction model of the present application is still used to extract the multi-scale deep semantic information of each song segment, and accordingly, a high-dimensional index vector of each song segment is obtained as a segment high-dimensional index vector.
Step S1700, taking the high-dimensional index vector of the designated song as a designated high-dimensional index vector, respectively calculating the similarity between the designated high-dimensional index vector and each segment high-dimensional index vector, determining whether the maximum similarity value is higher than a second preset threshold, and determining that the compared songs form a singing relationship when the maximum similarity value is higher than the second preset threshold:
the high-dimensional index vector of the designated song, referred to herein as the designated high-dimensional index vector, is determined in advance, and in order to compare with the segment high-dimensional index vector of each song segment, the designated high-dimensional index vector and each of the segment high-dimensional index vectors may be respectively subjected to similarity calculation to obtain corresponding similarity values, thereby obtaining a similarity sequence.
Similarly, when the similarity is calculated, a preset similarity calculation formula can be applied to perform similarity calculation so as to calculate a similarity value between the designated song and each song segment. The similarity calculation formula can be implemented by any algorithm suitable for calculating the similar distance between data, such as a cosine similarity calculation method, an Euclidean distance algorithm, a Pearson coefficient algorithm, an Jacradde similarity calculation method, a neighbor search algorithm and the like, and can be flexibly implemented by a person skilled in the art. After similarity calculation, a similarity value between the designated high-dimensional index vector of the designated song and the segment high-dimensional index vector of each song segment is obtained.
After the similarity value is determined, a preset threshold value, referred to as a second preset threshold value, which may be an empirical threshold value or an experimental threshold value, is further used, the maximum similarity value in the similarity sequence is compared with the second preset threshold value, and if the maximum similarity value is higher than (equal to) the second preset threshold value, it indicates that the designated song and at least one song segment constitute a sufficiently similar song, so that it can be determined that the original song and the compared song constitute a dubbing relationship, and in a scenario in which the copyright infringement of the songs is compared, it can be determined that the compared song constitutes a suspected infringement to the original song. Otherwise, if the maximum similarity value is lower than a second preset threshold value, it indicates that the designated song and all the song segments are not similar enough, and then the comparison song and the original song do not form a relation of singing, and in the scene of the comparison of the copyright infringement of the songs, the comparison song does not form suspected infringement of the original song.
The second preset threshold and the first preset threshold are generally set respectively, and in a simplified embodiment, the second preset threshold and the first preset threshold may also be equivalent data to simplify the setting.
According to the embodiment, in response to the situation that the high-dimensional index vector of the original edition of the original song and the high-dimensional index vector of the compared song are judged not to form the turning relation, the specified song is further determined when the audio time is shorter, the other song is segmented to obtain a plurality of song segments, then the similarity is calculated one by using the high-dimensional index vector of the specified song and the high-dimensional index vector of each song segment, whether the turning relation is formed by the specified song and the specified song is judged one by one according to the similarity value, so that more refined turning relation identification is realized, and in case of more discrete turning contents, the compatible identification can be better realized, so that the identification accuracy is improved, and particularly, under the situation of song infringing comparison, a suspected infringing product can be judged more quickly and efficiently.
In order to facilitate the online service, and to implement the online monitoring of infringement behavior automatically according to the original song, please refer to fig. 3, in a further embodiment, the step S1100 of obtaining the audio data of the original song and the compared song includes the following steps:
step S1110, acquiring audio data of the original song and the lyric file:
the song original user can entrust the online service platform to search the suspected infringing product of the original song, and the original song can be provided for the online service platform in a pre-configuration mode, a real-time submitting mode and the like. In any case, the computer device according to the present application can obtain the audio data corresponding to the original song and the lyric file thereof from the obtained storage address in a controlled or timed manner.
Step S1120, performing word segmentation according to the lyrics in the lyrics file of the original song, and extracting a plurality of keywords from the words:
the content stored in the lyric file can be commonly referred to as generalized lyrics, and the generalized lyrics mainly comprise song names, composer information, writer information, multiple sentences of lyrics corresponding to the melody, time stamps corresponding to the lyrics of each sentence and the like. Therefore, the lyric file contains rich text information, and accordingly, the lyric file can be segmented by means of a keyword extraction model realized by a natural language technology, and then a plurality of keywords are extracted. The technology for realizing the keyword extraction model comprises a plurality of technical approaches based on statistical characteristics, word graphs, topic models and the like, and an exemplary algorithm comprises the following steps: TF-IDF, TextRank, etc., can also be realized by adopting a pre-trained neural network model such as Bert, Electrora, etc., for which the technical personnel in the field can flexibly call. In summary, a person skilled in the art may extract a plurality of keywords from a lyrics file of an original song by means of a number of prior art techniques.
Step S1130, at least one song is obtained through online searching according to any combination of the keywords, the song obtained through searching is used as a compared song:
when a plurality of keywords are obtained through keyword extraction, the plurality of keywords can be combined at will to obtain a plurality of search expressions, for example, when { keyword _1, keyword _2, keyword _3} exists, six search expressions can be combined according to the permutation and combination principle. On the basis, an interface provided by an online search engine is called, any one or more search expressions are transmitted, online search is carried out, and a corresponding search result is obtained. On the basis of the search results, after performing page analysis by means of various mature technical means, those skilled in the art can know the songs that may exist in the search results, and these songs can be regarded as compared songs of the present application for subsequent processing.
In an improved simplified embodiment, the online service platform can legally access a music library which is held by the online service platform or held by a third party but has an open interface, and the database operation can be carried out on the music library by virtue of an online search engine corresponding to the music library, so that the compared song can be quickly and conveniently inquired and obtained.
Step S1140, obtaining the audio data of the compared song:
after determining each compared song, the audio data of each compared song can be further downloaded or copied for subsequent processing in the application.
In the embodiment, the lyric file of the original edition song can be used as a keyword source, songs on various lines can be searched and monitored on line by means of the keywords, a range needing comparison is determined, and the song in the range is further subjected to the recognition of the singing relation by means of the method, so that the method is particularly beneficial to realizing intelligent monitoring of the copyright infringement behavior of the song and has the effect of purifying the ecological environment of the on-line music.
Referring to fig. 4, in an alternative embodiment, in the step S1200, the step of extracting the multi-scale deep semantic information of the audio data of the original song by using the feature extraction model trained to the convergence state to correspondingly obtain the original high-dimensional index vector, and/or in the step S1300, the step of extracting the multi-scale deep semantic information of the audio data of the compared song by using the feature extraction model to correspondingly obtain the high-dimensional index vector, the method includes the following steps:
step S0001, encoding the audio data, and correspondingly obtaining encoding information:
before the deep semantic information of various audio data is extracted by using the feature extraction model, the audio data needs to be correspondingly coded so as to obtain the corresponding coded information, and the deep semantic information is extracted on the basis of the coded information.
As described above, the audio data may be audio data in any format such as MP3, WMA, M4A, WAV, and the like, or audio data obtained by separating audio from various types of video files. Audio data is typically composed of multiple voice data packets in the time domain. On the basis, corresponding conversion processing is carried out on the voice data packet according to the specific coding information type, so that corresponding coding information can be obtained.
The coding information mainly serves as related information for describing style invariant features in the audio data, and may be of various types, including but not limited to time-frequency spectrum information extracted from a voice data packet of the audio data, mel-frequency spectrum information, CQT filtering information, level contour information, Chroma feature information, and the like. Such information may be encoded using a corresponding algorithm to obtain a corresponding type of encoded information. In the present application, any of the above types of encoded information may be used in the present application to implement feature extraction. In practice, it is recommended to encode the CQT filtering information with the best measured CQT filtering information to obtain the encoding information.
Those skilled in the art will appreciate that the above various encoding information may be encoded using corresponding algorithms. In the encoding process, it is necessary to perform pre-emphasis, framing, windowing and other conventional processing on the audio data, and then perform time domain or frequency domain analysis, i.e. to implement speech signal analysis. The purpose of pre-emphasis is to boost the high frequency part of the speech signal and smooth the spectrum; typically the pre-emphasis is implemented by a first order high pass filter. Before the speech signal is analyzed, the speech signal needs to be framed, the length of each frame of the speech signal is usually set to be 20ms, and 10ms overlap between two adjacent frames can be realized by considering the frame shift factor. To implement framing, this may be accomplished by windowing the speech signal. Different window selections affect the result of the speech signal analysis, and it is common to use window functions corresponding to hamming windows (Hamm) to perform the windowing operation.
On the basis of completing the preprocessing required by the voice signal analysis of the song audio data, the time domain and the frequency domain can be further analyzed, so as to realize the coding and obtain the corresponding coding information:
and aiming at the time-frequency spectrum information, pre-emphasis, framing, windowing and short-time Fourier transform (STFT) are carried out on the voice data of each voice data packet on a time domain to transform the voice data into a frequency domain, so that data corresponding to a spectrogram is obtained, and the time-frequency spectrum information is formed.
The mel-frequency spectrum information can be obtained by filtering the time-frequency spectrum information by using a mel scale filter bank, and in the same way, corresponding mel cepstrum information is obtained by carrying out logarithm taking and DCT transformation on the mel-frequency spectrum information, and the method is also suitable. It will be appreciated that mel-frequency spectrum information and mel-frequency cepstral information thereof can better describe style invariant features in a song, such as pitch, intonation, timbre, and the like.
For the CQT filtering information, since all tones in music are composed of 12 equal temperaments of several octaves, i.e., twelve equal temperaments, corresponding to twelve semitones of one octave in a piano. The frequency ratio between these semitone neighbors is 2 1/12 . Obviously, for two octaves of the same scale, the higher octaves are twice as frequent as the lower octaves. Therefore, in music, sounds are exponentially distributed, but the audio spectrum obtained by fourier transform is linearly distributed, and the frequency points of the two cannot be in one-to-one correspondence, which causes errors in the estimation of some scale frequencies. The CQT time-frequency transform algorithm can be used to replace the fourier transform approach for speech analysis. CQT, Constant Q Transform, refers to a filter bank in which the center frequency is exponentially distributed, the filter bandwidth is different, and the ratio of the center frequency to the bandwidth is Constant Q. Unlike the fourier transform, the horizontal axis frequency of the spectrum is not linear, but is based on log2, and the filter window length can be varied for better performance from spectral line frequency to spectral line frequency. Since the distribution of the CQT and the scale frequency is the same, the amplitude value of the music signal at each note frequency can be directly obtained by calculating the CQT spectrum of the music signal, which is more perfect for the signal processing of music. Therefore, the present embodiment recommends using this information to perform corresponding encoding to obtain corresponding encoding information, which is used as the input of the neural network model of the present application.
The scale contour information comprises PCP (Pitch Class profile) and HPCP (harmonic Pitch Class profile), and aims to extract a corresponding Pitch sequence from song audio data, convert the Pitch sequence into a melody contour sequence after the Pitch sequence is regulated, merged and segmented, and convert the melody contour sequence into corresponding feature representation by using a standard Pitch difference value generated by standard Pitch. The coding information constructed based on the sound level contour information has better robustness to the environmental noise.
The Chroma characteristic information is a general term of a Chroma Vector (Chroma Vector) and a Chroma map (Chroma map). A chroma vector is a vector containing 12 elements that represent the energy in 12 levels over a period of time (e.g., 1 frame), with the energy of the same level being accumulated for different octaves, and a chroma map is a sequence of chroma vectors. Specifically, after a voice data packet of song audio data is subjected to short-time Fourier transform and is converted from a time domain to a frequency domain, some noise reduction processing is carried out, and then tuning is carried out; converting the absolute time into frames according to the length of the selected window, and recording the energy of each pitch in each frame to form a pitch map; on the basis of a pitch map, the energy (in a loudness meter) of notes with the same time, the same tone level and different octaves is superposed on an element of the tone level in a chrominance vector to form a chrominance map. The data corresponding to the chromaticity diagram is the Chroma characteristic information.
Any one of the above specific coding information can be used for inputting the feature extraction model of the present application, and in order to facilitate the processing of the feature extraction model, the coding information can be organized according to a certain preset format. For example, the coding information corresponding to each voice packet is organized into a row vector, and the row vectors of the voice data packets are organized together by row according to time sequence for the whole audio data to obtain a two-dimensional matrix as the complete coding information. And the like, can be preset for adapting the feature extraction model, and can be flexibly implemented by the technical personnel in the field.
It should be noted that the encoding principle referred to herein is not only applicable to the original song, the compared song, and the song segment of the compared song, but also applicable to the processing of the training sample by the feature extraction model in the training stage, as will be understood by those skilled in the art.
Step S0002, extracting a high-dimensional index vector of the multi-scale deep semantic information representing the audio data according to the coding information by adopting the feature extraction model:
as mentioned above, the feature extraction model of the present application is trained to be used by the convergence party in advance, and when the high-dimensional index vector of the audio data needs to be obtained, the feature extraction model of the present application is called correspondingly to process the coding information of the audio data, which is forbidden to be repeated.
The embodiment further introduces a principle and a process of multiple kinds of preferable coding information of the application, which is more convenient for the application to implement, wherein CQT characteristic information is recommended to be used as the coding information, and the beneficial effects of the application can be further highlighted through actual measurement.
Referring to fig. 5, in a further embodiment, when the feature extraction model is called, the following steps are performed:
step S2100, sequentially performing multi-level feature extraction on the encoded information by using a plurality of convolution blocks in a shared network in a feature extraction model trained to a convergence state, to obtain intermediate feature information of deep semantic information from which the encoded information is extracted:
the feature extraction model is constructed based on the multi-branch thought in the application, and can be flexibly deformed according to the requirements of different embodiments of the application. In a typical embodiment of the feature extraction model, as shown in the schematic block diagram of fig. 6, the feature extraction model is composed of a shared network and a plurality of branch networks, wherein the shared network includes a plurality of rolling blocks for extracting deep semantic information of encoded information level by level to obtain intermediate feature information; and the plurality of branch networks respectively extract different types of deep semantic information based on the intermediate characteristic information to obtain corresponding output characteristic information. Each branch network comprises a part of same structure, the structure comprises a plurality of convolution blocks for extracting deep semantic information step by step, and after the last convolution block is output, different processing can be carried out according to different functions of each branch network.
The convolution block can be realized by convolution layers based on CNN and RNN, and preferably adopts a convolution block based on a residual convolution principle. In order to implement the context combing function so as to extract the key information in the song audio data, an Attention mechanism may be applied to any one of the volume blocks, and a corresponding Attention Module, specifically, a Spatial Attention Module (SAM) or a Channel Attention Module (CAM) may be added. IN an enhanced embodiment, an example normalization operation (IN) and a batch normalization operation (BN) are applied to the convolution block to divide the information input thereto into two parts, one of which is performed to learn the style invariant features, and the other is performed to perform the batch normalization operation to achieve normalization, so that a so-called IBN architecture is applied. By applying the framework, the music attribute invariant characteristics of the music audio data with highly diversified styles, such as notes, rhythms, timbres and the like, can be learned, and meanwhile, the version information is kept.
Therefore, it is easy to understand that the feature extraction model adapts to different application scenes, different branch networks are enabled, and the preselected training set is adopted to train the feature extraction model to a convergence state first, so that the corresponding feature extraction capability can be obtained, and therefore, the feature extraction model is suitable for executing tasks corresponding to the application scenes and extracting output feature information corresponding to song audio data from the encoding information of the song audio data input into the feature extraction model. The training process for the feature extraction model will be given in the exemplary embodiments of the present application, and will not be pressed here.
In this step, in the architecture shown in fig. 6, after the coded information is subjected to feature extraction step by a plurality of rolling blocks of the shared network, especially after the key information is extracted by the last rolling block, the intermediate feature information of the key information of the coded information is obtained, and the intermediate feature information is divided into multiple paths and output to the plurality of branch networks, so as to extract the deep semantic information of different angles in each branch network.
Step S2200 is that after feature extraction of different scales is performed on the intermediate feature information by using a plurality of convolution blocks in more than two branch networks in the feature extraction model, the intermediate feature information is converted into output feature vectors of corresponding scales, and deep semantic information included in the output feature vectors of each branch network is different from each other:
as described above, in the architecture shown in fig. 6, the respective branch networks can be flexibly combined, so that according to the specific architecture obtained by combining, it is possible to determine how many branch networks are specifically available. The intermediate feature information output by the shared network is input into each of the branch networks for further feature extraction processing.
According to the architecture shown in fig. 6, each branch network belongs to the same structural part and includes two convolution blocks, and after feature information output by the two convolution blocks is sequentially subjected to feature extraction, the extracted feature information is output to adapt to specific structures of different branch networks for different processing.
Specifically, different branch networks adapt to different deep semantic information extracted by the branch networks, and different processing can be performed on different structural parts which are different from each other. For example: the output characteristic information containing different deep semantic information can be obtained by performing various different processes on the characteristic information output by the last convolution block, wherein the output characteristic information describes deep semantic information of the song audio data from different scales respectively and comprises global information, various local information and the like of the song audio data, such as global information that abstracts salient features of the encoded information of the song audio data, local information that abstracts channel or band features of the encoded information of the song audio data, and so forth. Accordingly, a plurality of pieces of output characteristic information different in expression can be obtained, and these pieces of output characteristic information can be called independently or can be used in any combination as needed.
In the application, the output feature information output by each branch network is normalized to be represented by the output feature vectors, so that a plurality of branch networks can correspondingly obtain a plurality of output feature vectors, each output feature vector represents deep semantic information of the song audio data on different aspects or different scales, and the deep semantic information contained in each output feature vector is different from one another.
When in use, more than two branch networks are usually adopted to obtain more than two output feature vectors, so as to perform feature representation on song audio data by using more than two deep semantic information, for example, the output feature vector for representing the global information of the song audio data may be used in combination with the output feature vector for representing the channel information of the song audio data, the output feature vector for representing the global information of the song audio data may be used in combination with the output feature vector for representing the band information of the song audio data, or the output feature vector for representing the channel information of the song audio data may be used in combination with the output feature vector for representing the band information of the song audio data, or all the output feature vectors may be used in combination. And so on, as may be called upon by those skilled in the art.
Step S2300, outputting the output feature vector of each branch network as the high-dimensional index vector by the feature extraction model:
the output feature vectors obtained by each branch network can be finally converted into high-dimensional index vectors for storage or direct use. The high-dimensional index vector is a high-dimensional vector for indexing the corresponding song audio data. Since each branch network has normalized its output feature information to an output feature vector, the high-dimensional index vector may be handled alternatively in this case, depending on the specific use of the feature extraction model. For example, for application requirements that are only reserved for storage, and are invoked separately, each output feature vector may be treated as a plurality of corresponding high-dimensional index vectors. For another example, for a specific task such as exemplary song copyright infringement comparison, all output feature vectors output by all the structured branch networks may be sequentially spliced according to the needs of the specific task, so as to obtain a single high-dimensional index vector, and this high-dimensional index vector may be stored or used for matching instantly. Thus, representation learning of the song audio data is achieved through the high-dimensional index vector.
In addition to the various application modes disclosed in the present application, the mining and utilization based on the high-dimensional index vector obtained in the present application may have various applications, and may be flexibly applied by those skilled in the art according to the principles disclosed herein without affecting the inventive embodiments of the present application.
Through the above description of the implementation process of the feature extraction model and the network architecture thereof, it can be understood that the present embodiment includes very rich beneficial effects, including but not limited to the following aspects:
firstly, a feature extraction model encodes corresponding encoding information by using audio information of song audio data to obtain style-invariant features of the song audio data, then extracts intermediate feature information from the encoding information through a shared network, extracts deep semantic information of the song audio data from different angles through a plurality of branch networks on the basis of the intermediate feature information to obtain corresponding output feature information, and finally uses the output feature information as a high-dimensional index vector corresponding to the song audio data to finish end-to-end representation learning of the song audio data.
Secondly, the feature extraction model realizes multi-angle feature extraction of deep semantic information of the song audio data by adopting a mode of combining a sharing network and a plurality of branch networks, so that the obtained high-dimensional index vector has higher representation capability, such as representing global feature information, significant feature information, channel feature information, band feature information and the like of the song audio data, and further realizes more effective indexing of the corresponding song audio data, and the downstream processing of retrieval, query, matching and the like of the song audio data is carried out on the basis of the information, so that more accurate and efficient matching effect can be obtained, and the feature extraction model can be generally used for multiple application scenes such as singing identification, song listening identification, humming identification, song infringement judgment and the like.
In addition, the output feature vectors obtained by the multiple branch networks of the feature extraction model can be combined into a single high-dimensional index vector for use, and can also be independently used as different high-dimensional index vectors for use respectively, the output feature vectors are flexibly determined according to required deep semantic information, the application range is wide, the usage is flexible, when representation learning of massive song audio data is processed, a relatively obvious scale effect can be obtained, the output feature vectors can be deployed in a background of an online music service platform to realize a standardized interface, the requirements of various different application scenes are met, comprehensive and multipurpose open services are provided, and the economic advantage of music information retrieval of the platform is improved.
In a further embodiment, the step S2200, after performing feature extraction of different scales on the intermediate feature information by using a plurality of convolution blocks in more than two branch networks in the feature extraction model, converting the intermediate feature information into an output feature vector of a corresponding scale, includes any two or more steps:
step S2210, performing feature extraction on the intermediate feature information by adopting a plurality of volume blocks in the first branch network to obtain global feature information, and pooling the global feature information into output feature vectors of a global scale:
in the first branch network exemplarily shown in fig. 6, after the intermediate feature information is subjected to feature extraction step by two convolution blocks having the same structure as that of the other branch network, the output of the last convolution block is divided into two paths, one path is directly subjected to an average pooling operation to obtain the overall feature information, the other path is subjected to a Dropout layer to randomly discard partial time-frequency region information, and then the significant feature information in the overall situation is extracted through a maximum pooling operation, so that two overall output feature vectors are correspondingly output. According to the framework, in the model training stage, on one hand, the generalization capability of the model to the audio frequency with local time-frequency domain changes such as segment deletion, segment insertion and the like in song audio data is improved, and on the other hand, the function of preventing the model from being over-fitted is also achieved to a certain extent. In addition, the two global output feature vectors capture the whole features one way and capture the obvious features one way, and the recognition capability of the model is improved.
Step S2220, a plurality of volume blocks in the second branch network are adopted to extract the characteristics of the intermediate characteristic information, and then the intermediate characteristic information is divided into a plurality of parts according to the channel to be pooled, so that the output characteristic vector of the channel scale is obtained correspondingly:
since the feature information output from each convolution block is usually expressed in "number of channels × number of bands × number of frames", the division process can be performed by the number of channels. In the second branch network exemplarily shown in fig. 6, after the intermediate feature information is subjected to feature extraction step by two convolution blocks having the same structure as that of the other branch network, the output of the last convolution block is divided into multiple paths, for example, two paths, and then the output is subjected to mean pooling by 1 × 1 convolution layer, so as to obtain the channel output feature information corresponding to the two paths. In this process, two channel branches focus on local feature capture of audio, and for audio with very different adaptations, where a large amount of information is overwhelmed by strong noise or other interfering sounds, feature representations can be built from a few local significant common features.
Step S2230, using a plurality of convolution blocks in the third branch network to perform feature extraction on the intermediate feature information, and then dividing the intermediate feature information into a plurality of parts according to the frequency band to perform pooling, so as to correspondingly obtain an output feature vector of the frequency band scale:
in the third branch network exemplarily shown in fig. 6, after the intermediate feature information is subjected to feature extraction step by two convolution blocks having the same structure as that of the other branch network, the output of the last convolution block is averaged and pooled, and then divided into multiple paths, for example, two paths of outputs, and the averaged pooled output is subjected to obtain the frequency band output feature information corresponding to the two frequency bands. In the process, each frequency band branch is dedicated to extracting the characteristic information of the corresponding frequency band, and the method has remarkable effects on resisting frequency band selective weakening of a bad pickup environment, balancing the contribution of high-frequency and low-frequency information in characteristic composition, resisting content addition and deletion (such as adding and reducing a drumbeat) of a fixed frequency band or resisting strong interference of the fixed frequency band range.
It can be understood that a plurality of output feature vectors obtained in the same branch network may be further processed into the same output feature vector through concatenation or mean pooling, for which, those skilled in the art may flexibly implement the method.
In the embodiment, the multi-scale feature information is extracted from the song audio data through the rich branch network, so that the obtained output feature vector can obtain rich deep semantic information representation, the global information and the significant information of the song audio data are represented, the related local information of the song audio data is represented according to channels and frequency bands, and the key information of the song audio data is captured by considering the prior intermediate feature information under the action of the shared network, so that the embodiment realizes the representation of the index value of the song audio data from multiple aspects, and when the subsequently obtained high-dimensional index vector is used for query, retrieval and matching, the precision of each aspect can be improved.
Because the embodiment can capture deep semantic information of the song audio data in multiple aspects, the method is particularly suitable for feature extraction of the song audio data with relatively large data volume, is particularly suitable for application scenes of long song processing, and can achieve a more accurate matching effect for the application scenes.
Referring to fig. 7, a network structure of the feature extraction model of the present application is improved based on the previous embodiment, and it can be seen that the difference between the network architecture in fig. 7 and the network architecture in fig. 6 is that, in fig. 7, a global output feature vector is obtained after the output of the last rolling block of the first branch network is directly subjected to maximum pooling, and significant feature information of the encoded information of the song audio data is captured; in the second branch network, the output of the last convolution block is equally divided into the feature information corresponding to four channels according to the channels, the feature information corresponding to each channel is respectively subjected to mean pooling and then is spliced into corresponding output feature vectors again, and the output feature vectors can learn better local feature information through the division and the construction of local branches.
The present embodiment exemplarily presents a modification based on the network architecture shown in fig. 6, which is relatively lightweight, and it is not difficult to understand that the inventive spirit of the present application focuses on the flexible combined use of a plurality of the described branch networks. Based on the principles disclosed in the present application, those skilled in the art can adapt to different specific applications according to the characteristics of the multi-scale deep semantic information of the output feature vectors obtained by each branch network, and can select feature extraction models constructed by different branch network combinations to transform various other embodiments of the present application, so as to satisfy the requirements such as humming recognition, song listening recognition, singing recognition, song infringement comparison, and the like.
Referring to FIG. 8, in an embodiment, when the volume block is called, the following steps are performed:
step S3100, performing convolution transformation on the input information to obtain transformation characteristic information:
in any convolution block in the feature extraction model, each convolution block firstly performs convolution operation on information input therein, no matter the coded information or the intermediate feature information output by the previous convolution block, through a 1 × 1 convolution kernel to obtain corresponding transformation feature information.
Step S3200, combining the transformation characteristic information after respectively carrying out example normalization and batch normalization processing to form splicing characteristic information, and activating and outputting the splicing characteristic information:
after the first convolution, an instance batch normalization layer (IN) is applied to process the transformed feature information. The transformation characteristic information is divided into two paths, a batch normalization Block (BN) is adopted to carry out batch normalization processing on half of the channels, and an instance normalization processing is carried out on the other channels by applying an instance normalization layer, and the corresponding convolution block can capture the style invariant characteristics of the song audio data, so that the song representation with diversified styles in single data can be better utilized. The two channels can be spliced into the same splicing characteristic information for activation and output after different normalization processing.
Step S3300, obtaining residual error information after carrying out convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times:
and performing convolution operation on the activated and output spliced feature information through a plurality of convolution layers to further extract features, wherein each convolution layer is output after being normalized by a batch normalization layer, and the last convolution layer is implemented by adopting a 1 x 1 convolution kernel so as to avoid that the representation learning capacity of the whole feature extraction model is attenuated after being normalized by a plurality of instances of a plurality of convolution blocks. Accordingly, the finally output feature information is residual information in the residual convolution process.
Step S3400, overlapping the residual error information into the input information, activating and outputting:
and finally, according to a residual convolution principle, referring to the transformation characteristic information obtained by the first convolution, superposing the transformation characteristic information on the residual information, and then activating and outputting the transformation characteristic information, so that the intermediate characteristic information output after the residual convolution operation is carried out on the current convolution block can be obtained.
In the embodiment, the convolution blocks required by the feature extraction model are constructed by combining example batch normalization operation based on residual convolution, the residual convolution network is improved based on a Resnet series basic model, and meanwhile, an IBN framework is overlapped, so that the constructed feature extraction model is easier to train, a more accurate feature extraction effect can be realized, and the method is particularly suitable for feature extraction of song audio data.
Referring to fig. 9, in an extended embodiment, the training process of the feature extraction model includes the following steps of iterative training:
step S4100, calling a training sample from the training set, and determining the coding information of the training sample, wherein the training sample is pre-collected song audio data, and the song audio data is a complete song and a fragment thereof:
those skilled in the art will appreciate that different training sets for training the feature extraction model may be constructed, each training set containing a sufficient number of training samples, each training sample being pre-provisioned with a corresponding supervised label, to accommodate different downstream tasks.
The training samples can be collected in advance by a person skilled in the art, each training sample is a song audio data, the training samples are suitable for different downstream tasks, and the song audio data can be a complete song, a song MIDI melody fragment, a song with accompaniment, a song with a vocal singing part without an accompaniment part, a song fragment without a rotation part, a song fragment with a rotation part and the like. Different singing versions of the same song can be combined into the same category, namely, the same supervising label is corresponded to, so as to enhance the generalization capability of the model category. When the song audio data in the training sample is too long in time, the song audio data can be segmented into a plurality of song segments according to a certain preset time length and used as a plurality of training samples to be associated with the same supervision label for training. When segmenting a song, this may be implemented with reference to a timestamp of the lyrics of the song, such that the song segments are segmented on the basis of one or more complete lyrics.
Preferably, the method is suitable for the characteristic that the whole song and the song segments of the song can be subjected to feature extraction in the song comparison stage, and in the training stage, for each song, a training sample can be constructed according to the audio data of the whole song, meanwhile, the training samples are constructed according to the audio data of a plurality of song segments of the song, and the two types of training samples are associated with the supervision label of the same song.
For training samples in the training set, for the convenience of model training, the encoding information corresponding to the song audio data may be prepared in advance, or the encoding information corresponding to the song audio data may be obtained by real-time encoding when each song audio data is called for training the feature extraction model. For specific encoding principles, it is sufficient to refer to the corresponding processes disclosed in the foregoing of the present application.
Step S4200, inputting the encoding information into the feature extraction model, and performing training on the encoding information to obtain corresponding output feature vectors:
in the training process of a training sample, the coding information corresponding to the training sample is output to the feature extraction model for feature extraction, and the feature extraction principle refers to the description of the principle of the feature extraction model in the previous embodiments, which is omitted here for brevity. In the process, the feature extraction model realizes the representation learning of the training samples to obtain each corresponding output feature vector.
Step S4300, performing classification prediction on each output feature vector, so as to map a corresponding classification label:
in the present application, the training task of the feature extraction model is understood as a classification task, and therefore, the training of the model can be implemented by accessing each path of output feature vectors of the feature extraction model into a corresponding prepared classification model, investigating classification results of each classification model, and supervising by using a corresponding supervision label. Based on the principle, in the training stage, when the feature extraction model implemented in any embodiment of the application is trained, one classification model is connected to each output feature vector output end of each branch network.
The classification model adopts the structure shown in fig. 10, a batch normalization layer is adopted to perform batch normalization operation on the output feature vectors, then the output feature vectors are mapped to a classification space through a full connection layer, and the classification probability corresponding to each classification label is calculated through a classification function, so that the classification label with the maximum classification probability is determined as the classification label corresponding to the training sample.
The classifier in the classification model can be constructed by adopting a multi-classifier realized by a Softmax function, and can also be constructed by adopting a multi-classifier realized by an AM-Softmax function which can enhance the compactness in the class and enlarge the sparsity among the classes, and the later has better classification advantages obviously.
Step S4400, calculating a loss value of the feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value:
in the classification model, the batch normalization layer is adopted, the balance of triple losses and cross entropy classification losses is realized, the triple losses can be calculated through the batch normalization layer, the cross entropy classification losses can be calculated through the full-connection layer, and the optimization of output feature vectors can be realized through the combination of the triple losses and the cross entropy classification losses.
Therefore, after the corresponding classification label is predicted from the training sample, the loss value between the supervision label and the classification label can be calculated according to the corresponding supervision label, then the gradient updating is carried out on the feature extraction model according to the loss value, the weight parameter of each link of the whole model is corrected, and the model is promoted to be converged.
Because there are multiple branch networks, each branch network may have multiple outputs of output feature vectors, and there are multiple classification models, therefore, when calculating the loss value, a weighting manner may be adopted, that is, the triple loss in each classification model and the classification loss are weighted and summed first to obtain the loss value corresponding to each output feature vector, then the loss value corresponding to each output feature vector is weighted and summed to obtain the final loss value, and the gradient update is performed on the whole feature extraction model by the loss value.
Step S4500, judging whether the loss value reaches a preset threshold value, and when the loss value does not reach the preset threshold value, calling a next training sample in a training set to continue to carry out iterative training on the feature extraction model until the loss value reaches the preset threshold value:
and judging whether the loss value counted by each training sample approaches to a 0 value infinitely or not or whether the loss value reaches a preset threshold value, and when the loss value meets the judgment conditions, judging that the feature extraction model is trained to a convergence state, accordingly, terminating the training of the model, and putting the feature extraction model into a production stage, for example, for performing feature extraction on songs in a song library or serving other downstream tasks and the like. If the convergence state is not reached, the next training sample in the training set can be continuously called, and the iterative training of the feature extraction model is continuously carried out until the feature extraction model is trained to the convergence state.
The embodiment discloses a training principle and a training process of a feature extraction model of the application, and can be seen from the embodiment that the feature extraction model is trained by adopting a prepared training set, so that the feature extraction model can learn the capability of extracting corresponding output feature vectors from the coded information of song audio data, the effective representation learning of deep semantic information of the song audio data is realized, in addition, the output feature vectors of multiple scales of the same song audio data can be jointly trained, the training efficiency is higher, the functions of the model are richer, and when the feature extraction model is put into a production stage, the deep semantic information corresponding to the multiple scales of the same song audio data can be quickly obtained.
The classification model of the embodiment adopts the multi-classifier which is realized by batch normalization and the AM-Softmax function, so that triple loss and classification loss can be balanced to perform gradient updating of the model, the model can be trained to be convergent more quickly, and the trained model can better perform more effective representation learning on deep semantic information of song audio data. When the subsequent output feature vectors are combined and used as required, the feature information of the song audio data can be represented more effectively, and a more efficient matching effect is achieved.
The embodiment also shows the expandability and compatibility of the feature extraction model in the application aspect, and specifically, the embodiment allows the feature extraction model to obtain the capability of serving different downstream tasks by training the feature extraction model by adopting training samples corresponding to the different downstream tasks according to the requirement of serving the different downstream tasks, so that the embodiment belongs to a more basic improvement and has better economic utility.
Referring to fig. 11, a song comparison apparatus provided in the present application is adapted to perform functional deployment by a song comparison method of the present application, and includes: the system comprises an audio acquisition module 1100, an original edition extraction module 1200, a compared extraction module 1300 and a comprehensive judgment module 1400, wherein the audio acquisition module 1100 is used for respectively acquiring audio data of an original edition song and a compared song; the original edition extracting module 1200 is configured to extract the multi-scale deep semantic information of the audio data of the original edition song by using the feature extraction model trained to the convergence state, and correspondingly obtain an original edition high-dimensional index vector; the compared extraction module 1300 is configured to extract the multi-scale deep semantic information of the audio data of the compared song by using the feature extraction model, and correspondingly obtain a compared high-dimensional index vector; the comprehensive decision module 1400 is configured to calculate similarity between the original high-dimensional index vector and the compared high-dimensional index vector, determine whether a corresponding similarity value is greater than a first preset threshold, and determine that a song to be compared and an original song form a sing-turning relationship when the similarity value is greater than the first preset threshold.
In an extended embodiment, the song comparing apparatus includes a structure for operating when the comprehensive decision module 1400 determines that the similarity value is smaller than a first preset threshold, where the structure includes: the song segmenting module is used for determining the song with the original edition and the compared song with the shorter audio time as a specified song, segmenting another song with the longer audio time by taking the audio time of the specified song as a measurement to obtain a plurality of song segments of the other song; the segmentation extraction module is used for respectively extracting multi-scale deep semantic information of a plurality of song segments of the other song by adopting the feature extraction model, and correspondingly obtaining a plurality of segment high-dimensional index vectors; and the segmentation judging module is used for respectively calculating the similarity between the designated high-dimensional index vector and each segment high-dimensional index vector by taking the high-dimensional index vector of the designated song as the designated high-dimensional index vector, judging whether the maximum similarity value is higher than a second preset threshold value, and judging that the compared songs form a singing turning relation when the maximum similarity value is higher than the second preset threshold value.
In a further embodiment, the audio obtaining module 1100 includes: the original edition acquisition submodule is used for acquiring audio data and lyric files of the original edition of song; the word segmentation extraction submodule is used for carrying out word segmentation according to the lyrics in the lyric file of the original edition song and extracting a plurality of keywords from the words; the online searching submodule is used for searching online to obtain at least one song according to any combination of the keywords and taking the searched song as a compared song; and the compared acquisition sub-module is used for acquiring the audio data of the compared song.
In an alternative embodiment, the master extraction module 1200 and/or the scaled extraction module 1300 include: the audio coding submodule is used for coding the audio data and correspondingly obtaining coding information of the audio data; and the model calling submodule is used for extracting a high-dimensional index vector representing the multi-scale deep semantic information of the audio data according to the coding information by adopting the feature extraction model.
In a further embodiment, the feature extraction model is implemented as a structure comprising: the shared extraction module is configured to sequentially perform multi-level feature extraction on the coded information by adopting a plurality of volume blocks in a shared network in a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the coded information; the branch extraction module is configured to extract features of different scales from the intermediate feature information by adopting a plurality of convolution blocks in more than two branch networks in the feature extraction model, and then convert the extracted features into output feature vectors of corresponding scales, wherein deep semantic information contained in the output feature vectors of the branch networks is different; and the processing output module is configured to output the output feature vector of each branch network as the high-dimensional index vector.
In a further embodiment, the branch extracting module includes any two or more of the following modules: the first extraction submodule is configured to perform feature extraction on the intermediate feature information by adopting a plurality of volume blocks in a first branch network to obtain global feature information, and the global feature information is pooled into an output feature vector of a global scale; the second extraction submodule is configured to adopt a plurality of volume blocks in a second branch network to extract the characteristics of the intermediate characteristic information, then divide the intermediate characteristic information into a plurality of parts according to channels and perform pooling, and correspondingly obtain output characteristic vectors of channel scales; and the third extraction submodule is configured to adopt a plurality of convolution blocks in a third branch network to extract the characteristics of the intermediate characteristic information, divide the intermediate characteristic information into a plurality of parts according to the frequency band and pool the parts, and accordingly obtain an output characteristic vector of the frequency band scale.
In an embodiment, the first branch network performs the pooling operation by using a mean pooling operation and/or a maximum pooling operation to obtain one or two output feature vectors of the global scale; when the second branch network performs the pooling operation, adopting a mean pooling operation aiming at a single or a plurality of channels so as to correspondingly obtain one or a plurality of output feature vectors of the channel scale; and when the third branch network performs the pooling operation, adopting an average pooling operation for a single frequency band or a plurality of frequency bands to correspondingly obtain one or a plurality of output feature vectors of the frequency band scale.
In an optional embodiment, the source of the coding information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, sound level contour information, and Chroma feature information of corresponding audio data.
In a preferred embodiment, in the shared network, at least one of the convolution blocks applies an attention module for extracting key information in song audio data, and the attention module is a spatial attention module or a channel attention module.
In an optimized embodiment, the volume block is implemented as a structure comprising: the initial convolution unit is used for carrying out convolution transformation on the information input into the initial convolution unit to obtain transformation characteristic information; the normalization processing unit is used for combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information; the residual error calculation unit is used for carrying out convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times to obtain residual error information; and the activation output unit is used for superposing the residual information into the information input into the activation output unit.
In an extended embodiment, the feature extraction model is placed in a training task implemented by a structure for implementing iterative training, wherein the structure comprises: the system comprises a sample calling module, a data processing module and a data processing module, wherein the sample calling module is used for calling a training sample from a training set and determining the coding information of the training sample, the training sample is pre-collected song audio data, and the song audio data is a complete song and a fragment thereof; the expression learning module is used for inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors; the classification prediction module is used for performing classification prediction on each output characteristic vector respectively to map corresponding classification labels; the loss calculation module is used for calculating a loss value of the feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value; and the iteration decision module is used for judging whether the loss value reaches a preset threshold value, and calling the next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 12, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can cause the processor to implement a song comparison method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the song comparison method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 11, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the song comparison device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.
The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the song comparison method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application may be implemented by hardware related to instructions of a computer program, where the computer program may be stored in a computer-readable storage medium, and when executed, the computer program may include the processes of the embodiments of the methods as described above. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
In summary, the method and the device achieve the purpose of learning and obtaining corresponding high-dimensional index vectors by means of representation of multi-scale deep semantic information of the original song and the compared song through the feature extraction model, perform similarity comparison based on the respective high-dimensional index vectors, and judge whether the original song and the compared song form a singing relation or not, so that a more accurate and efficient comparison effect is achieved, various downstream tasks such as song listening recognition, humming recognition, singing recognition, song infringing comparison and the like can be served, and the comprehensive service capability of the online music platform is improved.
Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (15)

1. A song comparison method is characterized by comprising the following steps:
respectively acquiring audio data of an original song and a compared song;
extracting multi-scale deep semantic information of the audio data of the original edition song by adopting a feature extraction model trained to a convergence state, and correspondingly obtaining an original edition high-dimensional index vector;
extracting multi-scale deep semantic information of the audio data of the compared songs by adopting the feature extraction model, and correspondingly obtaining high-dimensional index vectors of the compared songs;
and calculating the similarity between the original high-dimensional index vector and the compared high-dimensional index vector, judging whether the corresponding similarity value is greater than a first preset threshold value, and judging that the compared song and the original song form a singing relation when the similarity value is greater than the first preset threshold value.
2. The song comparison method of claim 1, wherein the step of calculating the similarity between the original high-dimensional index vector and the high-dimensional index vector to be compared determines whether the corresponding similarity value is greater than a first preset threshold value, and when the similarity value is less than the preset threshold value, the following steps are performed:
determining the song with the original edition and the compared song with the audio frequency with relatively short time length as a designated song, and segmenting another song with relatively long audio frequency with the audio frequency time length of the designated song as a measurement to obtain a plurality of song segments of another song;
extracting the multi-scale deep semantic information of a plurality of song segments of the other song respectively by adopting the feature extraction model, and correspondingly obtaining a plurality of segment high-dimensional index vectors;
and taking the high-dimensional index vector of the specified song as a specified high-dimensional index vector, respectively calculating the similarity between the specified high-dimensional index vector and each segment high-dimensional index vector, judging whether the maximum similarity value is higher than a second preset threshold value, and judging that the compared songs form a singing relation when the maximum similarity value is higher than the second preset threshold value.
3. The song comparison method of claim 1, wherein obtaining audio data of an original song and a compared song comprises the steps of:
acquiring audio data and lyric files of the original song;
performing word segmentation according to lyrics in a lyric file of the original song, and extracting a plurality of keywords from the words;
at least one song is obtained through online searching according to any combination of the keywords, and the song obtained through searching is used as a compared song;
audio data of the compared song is acquired.
4. The song comparison method according to claim 1, wherein the step of extracting the multi-scale deep semantic information of the audio data of the original song by using the feature extraction model trained to the convergence state to correspondingly obtain the original high-dimensional index vector, or the step of extracting the multi-scale deep semantic information of the audio data of the compared song by using the feature extraction model to correspondingly obtain the high-dimensional index vector comprises the following steps:
coding the audio data to correspondingly obtain coding information of the audio data;
and extracting a high-dimensional index vector representing the multi-scale deep semantic information of the audio data according to the coding information by adopting the feature extraction model.
5. The song comparison method according to any one of claims 1 to 4, wherein when the feature extraction model is invoked, the following steps are performed:
sequentially performing multi-stage feature extraction on the coding information by adopting a plurality of volume blocks in a shared network in a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the coding information;
after feature extraction of different scales is carried out on the intermediate feature information by adopting a plurality of volume blocks in more than two branch networks in the feature extraction model, the intermediate feature information is converted into output feature vectors of corresponding scales, and deep semantic information contained in the output feature vectors of all the branch networks is different;
and outputting the output feature vector of each branch network as the high-dimensional index vector by the feature extraction model.
6. The song comparison method according to claim 5, wherein the step of converting the intermediate feature information into output feature vectors of corresponding scales after performing feature extraction of different scales on the intermediate feature information by using a plurality of volume blocks in more than two branch networks in the feature extraction model comprises any two or more steps of:
performing feature extraction on the intermediate feature information by adopting a plurality of volume blocks in a first branch network to obtain global feature information, and pooling the global feature information into output feature vectors of a global scale;
a plurality of rolling blocks in a second branch network are adopted to extract the characteristics of the intermediate characteristic information, and then the intermediate characteristic information is divided into a plurality of parts according to channels for pooling, so that output characteristic vectors of channel scales are obtained correspondingly;
and (3) performing feature extraction on the intermediate feature information by adopting a plurality of convolution blocks in a third branch network, dividing the intermediate feature information into a plurality of parts according to the frequency band, and pooling to correspondingly obtain output feature vectors of the frequency band scale.
7. The song comparison method of claim 6, wherein:
when the first branch network executes the pooling operation, adopting mean pooling and/or maximum pooling operation to correspondingly obtain one or two output feature vectors of the global scale;
when the second branch network performs the pooling operation, adopting a mean pooling operation for a single or a plurality of channels to correspondingly obtain one or a plurality of output feature vectors of the channel scale;
and when the third branch network performs the pooling operation, adopting an average pooling operation for a single frequency band or a plurality of frequency bands to correspondingly obtain one or a plurality of output feature vectors of the frequency band scale.
8. The song comparison method of claim 4, wherein the source of the encoded information is any one of time-frequency spectrum information, Mel-spectrum information, CQT filter information, level profile information, Chroma feature information of the corresponding audio data.
9. The song comparison method according to claim 5, wherein in the shared network, at least one of the convolution blocks applies an attention module for extracting key information in song audio data, and the attention module is a spatial attention module or a channel attention module.
10. The song comparison method of claim 5, wherein when the volume block is called, the following steps are performed:
carrying out convolution transformation on the input information to obtain transformation characteristic information;
combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information;
performing convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times to obtain residual error information;
and overlapping the residual information into the information input into the residual information to activate output.
11. The song comparison method according to claim 5, wherein the training process of the feature extraction model comprises the following steps of iterative training:
calling a training sample from a training set, and determining the coding information of the training sample, wherein the training sample is pre-collected song audio data which is a complete song and a fragment thereof;
inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors;
respectively carrying out classification prediction on each output characteristic vector to map corresponding classification labels;
calculating a loss value of a feature extraction model by using the supervision label corresponding to the training sample and the classification label, and performing gradient updating on the feature extraction model according to the loss value;
and judging whether the loss value reaches a preset threshold value, and calling a next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.
12. A song comparison apparatus, comprising:
the audio acquisition module is used for respectively acquiring the audio data of the original song and the compared song;
the original edition extracting module is used for extracting multi-scale deep semantic information of the audio data of the original edition song by adopting a feature extracting model trained to a convergence state, and correspondingly obtaining an original edition high-dimensional index vector;
the system comprises a feature extraction module, a comparison module and a comparison module, wherein the feature extraction module is used for extracting multi-scale deep semantic information of audio data of a compared song by adopting the feature extraction module and correspondingly obtaining a high-dimensional index vector of the compared song;
and the comprehensive judgment module is used for calculating the similarity between the original high-dimensional index vector and the high-dimensional index vector of the compared song, judging whether the corresponding similarity value is greater than a first preset threshold value or not, and judging that the compared song and the original song form a singing turning relation when the similarity value is greater than the first preset threshold value.
13. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 11.
14. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 11, which, when invoked by a computer, performs the steps comprised by the corresponding method.
15. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 11.
CN202111491601.3A 2021-12-08 2021-12-08 Song comparison method and device, equipment, medium and product thereof Pending CN114817620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111491601.3A CN114817620A (en) 2021-12-08 2021-12-08 Song comparison method and device, equipment, medium and product thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111491601.3A CN114817620A (en) 2021-12-08 2021-12-08 Song comparison method and device, equipment, medium and product thereof

Publications (1)

Publication Number Publication Date
CN114817620A true CN114817620A (en) 2022-07-29

Family

ID=82526809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111491601.3A Pending CN114817620A (en) 2021-12-08 2021-12-08 Song comparison method and device, equipment, medium and product thereof

Country Status (1)

Country Link
CN (1) CN114817620A (en)

Similar Documents

Publication Publication Date Title
CN112199548A (en) Music audio classification method based on convolution cyclic neural network
US11816151B2 (en) Music cover identification with lyrics for search, compliance, and licensing
Li et al. An evaluation of deep neural network models for music classification using spectrograms
EP4004916B1 (en) System and method for hierarchical audio source separation
CN113813609B (en) Game music style classification method and device, readable medium and electronic equipment
Kolozali et al. Automatic ontology generation for musical instruments based on audio analysis
Prockup et al. Modeling Genre with the Music Genome Project: Comparing Human-Labeled Attributes and Audio Features.
Chang et al. MS-SincResnet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification
Krause et al. Classifying Leitmotifs in Recordings of Operas by Richard Wagner.
CN113506553A (en) Audio automatic labeling method based on transfer learning
Sarkar et al. Raga identification from Hindustani classical music signal using compositional properties
CN114817622A (en) Song fragment searching method and device, equipment, medium and product thereof
Yeh et al. Popular music representation: chorus detection & emotion recognition
CN113744759B (en) Tone color template customizing method and device, equipment, medium and product thereof
Li et al. Music genre classification based on fusing audio and lyric information
Kitahara et al. Musical instrument recognizer" instrogram" and its application to music retrieval based on instrumentation similarity
Chen et al. A practical singing voice detection system based on gru-rnn
Cheng Music information retrieval technology: Fusion of music, artificial intelligence and blockchain
CN114817620A (en) Song comparison method and device, equipment, medium and product thereof
Fang et al. Deep learning of chroma representation for cover song identification in compression domain
Sharma et al. Audio songs classification based on music patterns
CN114067788A (en) Guangdong opera vocal cavity classification method based on combination of CNN and LSTM
CN114764452A (en) Song searching method and device, equipment, medium and product thereof
Shirali-Shahreza et al. Fast and scalable system for automatic artist identification
CN114840708A (en) Song indexing method and device, equipment, medium and product thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination