CN114764452A

CN114764452A - Song searching method and device, equipment, medium and product thereof

Info

Publication number: CN114764452A
Application number: CN202111494004.6A
Authority: CN
Inventors: 张超钢; 肖纯智
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-07-19

Abstract

The application discloses a song search method, a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: acquiring coding information corresponding to a song to be searched, which is submitted by a client; extracting a high-dimensional index vector representing deep semantic information of the song to be searched in multiple scales according to the coding information by adopting a feature extraction model trained to a convergence state; calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of deep semantic information which is extracted by the feature extraction model and represents multiple scales of each candidate song in a preset song feature library to obtain a similarity sequence; and screening and determining the target song with the similarity value exceeding a preset threshold and being the maximum similarity in the similarity sequence, constructing an access link corresponding to the target song and pushing the access link to the client equipment. Through the process, the song searching service can be quickly, efficiently and accurately realized, and the target songs similar to the songs to be searched can be searched for the user.

Description

Song searching method and device, equipment, medium and product thereof

Technical Field

The present application relates to the technical field of music information retrieval, and in particular, to a song search method and a corresponding apparatus, a computer device, a computer-readable storage medium, and a computer program product.

Background

With the popularity of short videos, live broadcasts and radio stations, the amount of music in the song-turning category is larger and larger, and scenes needing music identification are more and more complicated. Compared with the original singing version, the reproduction version may have differences or even completely different music components such as tone color, fundamental frequency, rhythm, speed, harmony, lyrics, singing method, overall structure and the like. Therefore, rap recognition is a very challenging research effort.

There are multiple related techniques of singing recognition in the prior art, and various prior arts all have some disadvantages, for example: (1) the traditional song listening and song recognition technology based on Landmark can only recognize songs in the same version and cannot recognize the copied version with certain differential information; (2) the traditional humming recognition technology based on melody matching can only recognize clean singing/humming and cannot recognize the copy with background accompaniment; (3) the traditional technical scheme of singing flipping recognition is mainly characterized in that audio features such as Pitch Class Profile (PCP) are extracted, and then algorithms such as dynamic programming are utilized to calculate the similarity distance between songs. Due to the diversity of the singing versions, the scheme can only be suitable for the singing scheme with smaller arrangement, the accurate recognition rate is low, the recognition speed is low, and the scheme cannot be suitable for searching mass music.

The common application scene of the singing turning recognition is not only reflected in the matching recognition of the singing contents from different singer sound sources, but also comprises the matching of the singing contents from the same singer sound source; the method not only can be embodied as matching between longer and complete songs, but also can be embodied as matching between partial song fragments and complete song fragments, and the method has various conditions.

In view of the fact that the technical solutions related to song recognition in the prior art lack general adaptability, are low in recognition accuracy and low in recognition efficiency, the applicant tries to explore more effective technical solutions.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a song searching method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

a song search method provided in accordance with one of the objects of the present application, comprising the steps of:

acquiring coding information corresponding to a song to be searched, which is submitted by a client;

extracting a high-dimensional index vector representing deep semantic information of the song to be searched in multiple scales according to the coding information by adopting a feature extraction model trained to a convergence state;

calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of deep semantic information which is extracted by the feature extraction model and represents multiple scales of each candidate song in a preset song feature library to obtain a similarity sequence;

and screening and determining the target song with the similarity value exceeding a preset threshold and being the maximum similarity in the similarity sequence, constructing an access link corresponding to the target song and pushing the access link to the client equipment.

In a further embodiment, the obtaining of the coding information corresponding to the song to be searched, which is submitted by the client, includes the following steps:

receiving a song search request submitted by a client, and acquiring audio data of a song to be searched, which is specified by the request;

detecting whether the audio data contains voice singing information or not, and if not, terminating subsequent execution;

and coding the audio data of the song to be searched to obtain corresponding coding information.

In a further embodiment, the calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song to obtain a similarity sequence includes the following steps:

calling a preset song feature library to obtain a high-dimensional index vector corresponding to each candidate song, wherein the high-dimensional index vector is a single high-dimensional vector which integrally represents deep semantic information of different scales of one candidate song;

respectively calculating the similarity between the high-dimensional index vector of the song to be searched and each high-dimensional index vector in the song feature library to obtain a corresponding similarity sequence; storing a similarity numerical value corresponding to each candidate song in the song feature library in the similarity sequence;

and carrying out reverse sequencing on the similarity sequence according to the similarity value to obtain the sequenced similarity sequence for output.

In another embodiment of deepening, the calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song to obtain a similarity sequence includes the following steps:

calling a preset song feature library to obtain high-dimensional index vectors corresponding to each candidate song, wherein the high-dimensional index vectors are a plurality of high-dimensional vectors which dispersedly represent deep semantic information of various scales of one candidate song;

calculating the similarity between the high-dimensional vector of the song to be searched and the corresponding semantic scale high-dimensional vector of each candidate song in a preset song feature library according to the high-dimensional vectors corresponding to different semantic scales to obtain a similarity sequence corresponding to each semantic scale;

according to the corresponding relation of the high-dimensional vector on the semantic scale, summarizing and combining the similarity numerical values in the similarity sequence corresponding to various semantic scales to obtain a summarized similarity sequence;

and performing reverse sequencing on the summarizing similarity sequence according to the similarity value to obtain the sequenced similarity sequence for output.

In yet another embodiment of the deepening method, the obtaining of the coding information corresponding to the song to be searched, which is submitted by the client, includes the following steps:

detecting whether the audio data of the song to be searched exceeds a preset time length, if so, dividing the audio data into a plurality of audio data corresponding to a plurality of song segments according to the preset time length, otherwise, keeping the audio data as the whole segment of audio data;

and respectively coding each section of audio data to obtain coding information corresponding to each section of audio data so that the feature extraction model can respectively extract each high-dimensional index vector corresponding to each song segment.

According to a further embodiment of the deepening, the calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information of multiple scales representing each candidate song extracted by the feature extraction model in the preset song feature library to obtain a similarity sequence includes the following steps:

calling a preset song feature library to obtain a high-dimensional index vector corresponding to each song segment of each candidate song, wherein the high-dimensional index vector is a single high-dimensional vector integrally representing deep semantic information of the song segments with different scales;

respectively calculating the similarity between the high-dimensional index vector of each piece of audio data and the corresponding high-dimensional index vector of each candidate song in a preset song feature library to obtain a similarity sequence corresponding to each piece of audio data;

summarizing and combining the similarity values in all the similarity sequences according to the same candidate songs to obtain summarized similarity sequences;

In a preferred embodiment, when the feature extraction model is called, the following steps are performed:

sequentially performing multi-stage feature extraction on the coded information by adopting a plurality of volume blocks in a shared network in a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the coded information;

after feature extraction of different scales is carried out on the intermediate feature information by adopting a plurality of volume blocks in more than two branch networks in the feature extraction model, the intermediate feature information is converted into output feature vectors of corresponding scales, and deep semantic information contained in the output feature vectors of all the branch networks is different;

and outputting the output feature vector of each branch network as the high-dimensional index vector by the feature extraction model.

In a further embodiment, the converting the intermediate feature information into the output feature vector of the corresponding scale after performing feature extraction of different scales on the intermediate feature information by using a plurality of convolution blocks in more than two branch networks in the feature extraction model includes any two or more steps:

performing feature extraction on the intermediate feature information by adopting a plurality of volume blocks in a first branch network to obtain global feature information, and pooling the global feature information into output feature vectors of a global scale;

a plurality of rolling blocks in a second branch network are adopted to extract the characteristics of the intermediate characteristic information, and then the intermediate characteristic information is divided into a plurality of parts according to channels for pooling, so that output characteristic vectors of channel scales are obtained correspondingly;

and (3) performing feature extraction on the intermediate feature information by adopting a plurality of convolution blocks in a third branch network, dividing the intermediate feature information into a plurality of parts according to the frequency band, and pooling to correspondingly obtain output feature vectors of the frequency band scale.

In a preferred improved embodiment, when performing the pooling operation, the first branch network performs a mean pooling operation and/or a maximum pooling operation to obtain one or two output feature vectors of the global scale; when the second branch network performs the pooling operation, adopting a mean pooling operation for a single or a plurality of channels to correspondingly obtain one or a plurality of output feature vectors of the channel scale; and when the third branch network performs the pooling operation, adopting an average pooling operation aiming at single or multiple frequency bands to correspondingly obtain one or more output characteristic vectors of the frequency band scale.

In an optional embodiment, the source of the coding information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, sound level contour information, and Chroma feature information of corresponding audio data.

In a preferred embodiment, in the shared network, at least one of the convolution blocks applies an attention module for extracting key information in song audio data, and the attention module is a spatial attention module or a channel attention module.

In an optimized embodiment, when the volume block is called, the following steps are executed:

carrying out convolution transformation on the input information to obtain transformation characteristic information;

combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information;

performing convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times to obtain residual error information;

and overlapping the residual information into the information input into the residual information to activate output.

In an extended embodiment, the training process of the feature extraction model includes the following steps of iterative training:

calling a training sample from a training set, and determining the coding information of the training sample, wherein the training sample is pre-collected song audio data which is a complete song or a fragment thereof;

inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors;

respectively carrying out classification prediction on each output characteristic vector to map corresponding classification labels;

calculating a loss value of a feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value;

and judging whether the loss value reaches a preset threshold value, and calling a next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.

A song search apparatus provided to be adapted to one of the objects of the present application includes: the system comprises a song coding module, a semantic extraction module, a similar matching module and a screening and pushing module, wherein the song coding module is used for acquiring coding information corresponding to a song to be searched, which is submitted by a client; the semantic extraction module is used for extracting a high-dimensional index vector representing deep semantic information of a plurality of scales of the song to be searched according to the coding information by adopting a feature extraction model trained to a convergence state; the similarity matching module is used for calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of deep semantic information which is extracted by the feature extraction model and represents multiple scales of each candidate song in a preset song feature library to obtain a similarity sequence; and the screening and pushing module is used for screening and determining the target songs with the similarity values exceeding a preset threshold and being the maximum similarity in the similarity sequence, constructing access links corresponding to the target songs and pushing the access links to the client equipment.

A computer device adapted for one of the purposes of the present application comprises a central processing unit and a memory, the central processing unit being adapted to invoke execution of a computer program stored in the memory to perform the steps of the song search method described herein.

A computer-readable storage medium, provided in a form of computer-readable instructions, stores a computer program implemented according to the song search method, which when invoked by a computer performs the steps included in the method.

A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.

Compared with the prior art, the application has the following advantages that:

firstly, the application responds to a request for searching songs by songs, obtains a high-dimensional index vector of deep semantic information representing the style invariant feature of the song to be searched according to coding information obtained by the coding of the song to be searched and a feature extraction model pre-trained to a convergence state, carries out similarity calculation on the high-dimensional index vector and a high-dimensional index vector corresponding to a candidate song in a preset song feature library, determines a target song similar to the song to be searched according to the similarity calculation result, and pushes the link of the target song to client equipment, because the high-dimensional index vector of the candidate song in the song feature library is extracted by the same feature extraction model, deep semantic representation of audio data corresponding to the song is realized on different semantic scales, the song can be semantically matched with the song to be searched, so that the similar matching of the song to be searched is accurately realized according to the semantic scales, an end-to-end based model architecture quickly queries client devices for similar songs.

Secondly, the multi-scale feature extraction of deep semantic information of the song audio data is realized in the feature extraction model adopted by the method, so that the obtained high-dimensional index vector has higher representation capability, such as representing global feature information, significant feature information, channel feature information, frequency band feature information and the like of the song audio data, and accordingly more effective indexing of the corresponding song audio data is realized, downstream processing such as retrieval, query, matching and the like of the song audio data is carried out on the basis of the global feature information, the significant feature information, the channel feature information, the frequency band feature information and the like, more accurate and efficient matching effects can be obtained, and the method can be generally used for multiple application scenes such as singing recognition, song listening recognition, humming recognition and the like.

In addition, when the learning ability is expressed from end to end, the method and the device are assisted by a retrieval matching mechanism, can obtain obvious scale effect, can be deployed in the background of the online music service platform to realize a standardized interface, further serve the requirements of various different application scenes, provide comprehensive and multipurpose open service, and improve the economic advantage of music information retrieval of the platform.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart diagram of an exemplary embodiment of a song search method of the present application;

FIG. 2 is a flow chart illustrating a process of obtaining encoded information according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a process of calculating a similarity sequence according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of a process for calculating a similarity sequence according to another embodiment of the present application;

FIG. 5 is a flowchart illustrating a process of obtaining encoded information according to another embodiment of the present application;

FIG. 6 is a flowchart illustrating a process of calculating a similarity sequence according to yet another embodiment of the present application;

FIG. 7 is a schematic flow chart illustrating the operation of a feature extraction model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a network architecture of a feature extraction model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a network architecture of a feature extraction model according to another embodiment of the present application;

FIG. 10 is a schematic flow chart of the working process of the residual volume block used in the feature extraction model of the present application;

FIG. 11 is a flow chart illustrating a process of training the feature extraction model of the present application;

FIG. 12 is a functional block diagram of a classification model accessed by the feature extraction model of the present application during a training phase;

FIG. 13 is a functional block diagram of a song search apparatus of the present application;

fig. 14 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant) that may include a radio frequency receiver, a pager, internet/intranet access, web browser, notepad, calendar, and/or GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server to implement access by a client remotely invoking an online service interface provided by a fetch server, or may be deployed directly and run on a client to implement access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and performs remote invocation at a client, and can also be deployed in a client with sufficient equipment capability to perform direct invocation.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

Unless expressly stated otherwise, the technical features of the embodiments disclosed in the present application may be cross-linked to form a new embodiment, so long as the combination does not depart from the spirit of the present application and can satisfy the requirements of the prior art or solve the disadvantages of the prior art. Those skilled in the art will appreciate variations therefrom.

The song searching method can be programmed into a computer program product and is realized by being deployed in a server to run, so that a client can access an open interface after the computer program product runs in a webpage program or application program mode, and man-machine interaction is realized through a graphical user interface and the process of the computer program product.

Referring to fig. 1, the song search method of the present application, in an exemplary embodiment thereof, includes the following steps:

step S1100, obtaining the coding information corresponding to the song to be searched submitted by the client:

the technical scheme of the application can be deployed in a server of an online music platform, and various services for searching songs by songs are opened for massive platform users, including but not limited to song listening and song reading, humming and song reading, and singing recognition.

In the song listening and identification service, a user can record a piece of audio data which is usually an external source at a client side of the user, the audio data is used as a song to be searched and submitted to a server, and the server finds an original song or a corresponding song to be sung for the user according to the audio data.

In the humming recognition service, a user can record a melody formed by vocal singing at a client side of the user to obtain corresponding audio data, the corresponding audio data are used as songs to be searched and submitted to a server, and the server finds songs with the same melody for the user according to the similarity of the melodies.

In the singing-turning recognition service, a user can designate or submit the audio data of a song to a server at a client side of the user to serve as the song to be searched, and then the server finds out the same song or other corresponding songs in different versions according to the song to determine that the singing-turning recognition service and the song are in the same relation.

After the song to be searched submitted by the user arrives at the server, the server firstly carries out corresponding coding on the song to be searched so as to obtain corresponding coding information of the song, and audio data of the song to be searched can be adaptively processed according to specific situations in the coding process.

The audio data of the song to be searched submitted to the server may be audio data in any format such as MP3, WMA, M4A, WAV, etc., or audio data obtained by separating audio from various types of video files. The audio data of a song to be searched is generally composed of a plurality of voice data packets in the time domain. On the basis, corresponding conversion processing is carried out on the voice data packet according to the specific coding information type, so that corresponding coding information can be obtained.

The encoded information is mainly related information for describing the style invariant features in the audio data of the song, and may be of various types, including but not limited to time-frequency spectrum information extracted from the voice data packet of the audio data, mel-frequency spectrum information, CQT filtering information, level contour information, Chroma feature information, and the like. Such information may be encoded using a corresponding algorithm to obtain a corresponding type of encoded information. In the present application, any of the above types of encoded information may be used in the present application to implement feature extraction. In practice, it is recommended to encode the CQT filtering information with the best measured CQT filtering information to obtain the encoding information.

Those skilled in the art will appreciate that the above various encoding information may be encoded using corresponding algorithms. In the encoding process, the song audio data needs to be subjected to conventional processing such as pre-emphasis, framing, windowing and the like, and then time domain or frequency domain analysis is performed, namely speech signal analysis is realized. The purpose of pre-emphasis is to boost the high frequency part of the speech signal and smooth the spectrum; typically the pre-emphasis is implemented by a first order high pass filter. Before the speech signal is analyzed, the speech signal needs to be framed, the length of each frame of the speech signal is usually set to be 20ms, and 10ms overlap between two adjacent frames can be realized by considering the frame shift factor. To implement framing, this may be accomplished by windowing the speech signal. Different window selections affect the result of the speech signal analysis, and it is common to perform the windowing operation by using a window function corresponding to the hamming window (Hamm).

On the basis of completing the preprocessing required by the voice signal analysis of the song audio data, the time domain and the frequency domain can be further analyzed, so as to realize the coding and obtain the corresponding coding information:

and aiming at the time-frequency spectrum information, pre-emphasis, framing, windowing and short-time Fourier transform (STFT) are carried out on the voice data of each voice data packet on a time domain to transform the voice data into a frequency domain, so that data corresponding to a spectrogram is obtained, and the time-frequency spectrum information is formed.

The mel-frequency spectrum information can be obtained by filtering the time-frequency spectrum information by using a mel-scale filter bank, and in the same way, corresponding mel cepstrum information is obtained by carrying out logarithm taking and DCT (discrete cosine transformation) on the mel-frequency spectrum information, and the method is also suitable. It will be appreciated that mel-frequency spectrum information and mel-frequency cepstral information thereof can better describe style invariant features in a song, such as pitch, intonation, timbre, and the like.

For the CQT filtering information, since all tones in music are composed of several octaves of 12 equal temperament, i.e., twelve equal temperaments, correspond to twelve semitones in one octave in a piano. The frequency ratio between these semitone neighbors is 2^1/12. Obviously, for two octaves of the same scale, the higher octaves are twice as frequent as the lower octaves. Therefore, in music, sounds are distributed exponentially, but audio spectrums obtained by fourier transform are distributed linearly, and frequency points of the two are not in one-to-one correspondence, which causes errors in estimation values of certain scale frequencies. The CQT time-frequency transform algorithm can be used to replace the fourier transform approach for speech analysis. CQT, Constant Q Transform, meaning that the center frequency is exponentialAnd the filter banks are regularly distributed, have different filter bandwidths and have a constant Q ratio of the center frequency to the bandwidth. Unlike the fourier transform, the frequency of the horizontal axis of its spectrum is not linear, but is based on the log2, and the filter window length can be varied for different spectral line frequencies to achieve better performance. Since the distribution of the CQT and the scale frequency is the same, the amplitude value of the music signal at each note frequency can be directly obtained by calculating the CQT spectrum of the music signal, which is more perfect for the signal processing of music. Therefore, the present embodiment recommends using this information to perform corresponding encoding to obtain corresponding encoding information, which is used as the input of the neural network model of the present application.

The tone level contour information comprises PCP (Pitch Class Profile) and HPCP (harmonic Pitch Class Profile), and aims to extract a corresponding Pitch sequence from song audio data, convert the Pitch sequence into a melody contour sequence after being regulated, merged and segmented, and convert the melody contour sequence into corresponding feature representation by using a standard tone difference value generated by standard tones. The coding information constructed based on the sound level contour information has better robustness to the environmental noise.

The Chroma characteristic information is a general term of a Chroma Vector (Chroma Vector) and a Chroma map (Chroma map). A chroma vector is a vector containing 12 elements that represent the energy in 12 levels over a period of time (e.g., 1 frame), with the energy of the same level being accumulated for different octaves, and a chroma map is a sequence of chroma vectors. Specifically, after a voice data packet of song audio data is subjected to short-time Fourier transform and is converted from a time domain to a frequency domain, some noise reduction processing is carried out, and then tuning is carried out; converting the absolute time into frames according to the length of the selected window, and recording the energy of each pitch in each frame to form a pitch map; on the basis of a pitch map, the energy (in a loudness meter) of notes with the same time, the same tone level and different octaves is superposed on an element of the tone level in a chrominance vector to form a chrominance map. The data corresponding to the chromaticity diagram is the Chroma characteristic information.

Any one of the above specific coding information can be used for inputting the feature extraction model of the present application, and in order to facilitate the processing of the feature extraction model, the coding information can be organized according to a certain preset format. For example, the coding information corresponding to each voice packet is organized into a row vector, and the row vectors of the voice data packets are organized together by row according to time sequence for the whole audio data to obtain a two-dimensional matrix as the complete coding information. And the like, can be preset for adapting the feature extraction model, and can be flexibly implemented by the technical personnel in the field.

It should be noted that the encoding principle referred to herein is not only applicable to the song to be searched, but also applicable to the processing of the training samples by the feature extraction model in the training stage, as will be understood by those skilled in the art.

Step S1200, extracting high-dimensional index vectors of deep semantic information representing the song to be searched in multiple scales according to the coding information by adopting a feature extraction model trained to a convergence state:

the feature extraction model for extracting the deep semantic information of the songs based on the convolutional neural network model is trained to be in a convergence state in advance, and is trained to acquire the capability of being suitable for extracting the deep semantic information of multiple scales of the audio data of the songs according to the coding information, so that the representation learning of the style invariant features of the audio data of the corresponding songs is realized, and the feature extraction model can be used for the requirements of query, retrieval, matching and the like among the songs.

The feature extraction model of the present application is implemented as a feature extraction model adapted to extract deep semantic information of multiple scales of the same audio data, representing the deep semantic information as single or multiple high-dimensional index vectors, so as to implement feature representation of the audio data from multiple different aspects and/or different angles. The high-dimensional index vector is essentially a high-dimensional vector that, at a semantic level, serves as an index representative of the encoding information of the corresponding audio data. The different scales comprise global scales based on the coded information or feature extraction based on frequency band scales, channel scales and the like of the coded information, and for one song, the deep semantic information of two or more scales corresponding to the coded information is selected and represented as a high-dimensional index vector, so that the feature representation of the multi-scale deep semantic information of the corresponding song can be realized.

After the feature extraction model implemented according to the above principle is trained to converge, a service interface can be opened for the technical scheme of this embodiment to call, the encoding information of the song to be searched is fed into the service interface, feature extraction is performed by the feature extraction model on the basis of the encoding information, and a high-dimensional index vector corresponding to the song to be searched is obtained.

It should be understood that, since the feature extraction model can extract deep semantic information of a song from multiple scales, there can be different forms of organization when converting these deep semantic information of different scales into the high-dimensional index vector, for example, representing the high-dimensional index vector as a single high-dimensional vector, which generally represents deep semantic information of a song as a whole; or, the high-dimensional index vector is expressed into a plurality of discrete high-dimensional vectors according to the corresponding relation of the scales, and each high-dimensional vector corresponds to one scale. In any case, those skilled in the art can flexibly organize these high-dimensional vectors according to the need of the actual scale semantic information, so as to facilitate the invocation of the representation data of the overall deep semantic information of the song.

For the step, the feature extraction model is used for extracting the features of the coded information of the song to be searched, and finally, the high-dimensional index vector corresponding to the song to be searched can be obtained and can be used for subsequent similarity matching.

Step S1300, calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of deep semantic information which is extracted by the feature extraction model and represents multiple scales of each candidate song in a preset song feature library, and obtaining a similarity sequence:

in an exemplary online music service platform, a music library is prepared, and the music library stores audio data corresponding to a large number of songs. In order to search songs by songs, deep semantic information of each song in the song library is extracted in advance by the feature extraction model and is expressed as corresponding high-dimensional index vectors, then the mapping relation data of the high-dimensional index vectors and the corresponding songs are stored in a preset song feature library, the high-dimensional index vectors corresponding to the songs are stored in the song feature library, and then the high-dimensional index vectors corresponding to the songs in the song feature library can be called to perform similarity calculation so as to search out similar songs corresponding to the songs to be searched. Therefore, for the song to be searched, the song pointed by the song feature library is a candidate song in the searching process.

The function of the feature extraction model to extract the high-dimensional index vector of the audio data of the song has been described above, and the same applies to this step. Specifically, after the feature extraction model is trained to a convergence state, the feature extraction model can be put into production for use, and the online music service platform can extract corresponding high-dimensional index vectors of each song in the song library by adopting the feature extraction model, and then organize the high-dimensional index vectors into mapping relation data to be stored in the song feature library, so that the online music service platform can serve the requirements of the application. Similarly, since the feature extraction model is suitable for extracting multi-scale features, the feature extraction model also realizes multi-scale representation of deep semantic information of each song aiming at the high-dimensional index vector extracted from the coding information of the audio data of each song in the song library, and usually has corresponding relation in an organization form with the high-dimensional index vector of the song to be searched. Of course, sometimes those skilled in the art can adjust such a correspondence relationship as needed.

Based on the high-dimensional index vector of the song to be searched and the high-dimensional index vector corresponding to each candidate song in the song feature library, a preset similarity calculation formula can be applied to carry out similarity calculation so as to calculate a similarity value between the song to be searched and each candidate song. The similarity calculation formula can be implemented by any algorithm suitable for calculating the similarity distance between data, such as a cosine similarity calculation method, an Euclidean distance algorithm, a Pearson coefficient algorithm, a Jacard similarity calculation method, a neighbor search algorithm and the like, and can be flexibly implemented by a person skilled in the art. And after similarity calculation, obtaining a similarity sequence of the high-dimensional index vector of the song to be searched corresponding to the high-dimensional index vector of each candidate song in the song feature library, wherein the similarity sequence stores the similarity value of the song to be searched corresponding to each candidate song in the song feature library.

Step S1400, a target song with the similarity value exceeding a preset threshold and being the maximum similarity in the similarity sequence is screened and determined, and an access link corresponding to the target song is constructed and pushed to the client device:

after the similarity sequence corresponding to the song to be searched is determined, a preset threshold value can be further utilized, the preset threshold value can be an experience threshold value or an experiment threshold value, each similarity value in the similarity sequence is filtered by the preset threshold value, and all elements with the similarity value exceeding the preset threshold value are filtered. And if the element exceeding the preset threshold value is 0, indicating that similar songs similar to the song to be searched do not exist in the song characteristic library. If a plurality of similarity values are obtained after screening, only the candidate song corresponding to the maximum similarity value can be selected as the similar song corresponding to the song to be searched, namely the target song obtained by searching the song with the song.

Therefore, according to the song to be searched submitted or designated by the client device, the target song similar to the client device in semantics is determined, the audio data corresponding to the target song can be further called from the song library, the access link of the audio data of the target song is obtained, the access link is packaged into the playable page or playable data containing the link, and the playable page or playable data is pushed to the corresponding client device for further access of the user. Therefore, no matter whether the user listens to the song for song recognition, humms for song recognition or whether the user performs the recognition of singing, the user only needs to submit or designate the corresponding song to be searched, and the corresponding judgment result can be obtained.

In other embodiments that will be subsequently disclosed in the present application, there are many variations in the process of searching for songs by songs, which are not pressed once here. It should be understood that the present application, which has been described only with reference to the exemplary embodiments, has numerous advantages, including but not limited to the following:

Referring to fig. 2, in a further embodiment, the step S1100 of obtaining the coding information corresponding to the song to be searched, which is submitted by the client, includes the following steps:

step S1111, receiving a song search request submitted by the client, and acquiring audio data of a song to be searched, which is specified by the request:

in the embodiment, a user records a piece of audio data in a song search page displayed by client equipment of the user, triggers a song search request which takes the audio data as the audio data required by a song to be searched, and submits the request to a server which opens song search service by songs in the application. The server parses the request to obtain the corresponding audio data.

Step S1112, detecting whether the audio data includes vocal singing information, and if not, terminating the following steps:

in order to improve matching accuracy, the server may pre-process the received audio data, for example, detect whether the audio data includes a vocal singing melody part through the VAD logic module, and if the audio data does not include the vocal singing melody part, terminate subsequent execution of the application and directly return a corresponding notification to the client device. When it is confirmed that there is audio data corresponding to the vocal singing melody part, the subsequent steps may be continued. The VAD logic module may similarly be implemented using various existing techniques known to those skilled in the art, preferably using an end-to-end based neural network model pre-trained to a converged state.

Step S1113, encode the audio data of the song to be searched, obtain the corresponding coded information:

for the audio data that has been detected by human voice, the audio data can be encoded by using the encoding principle described in the foregoing of this application, and the corresponding encoding information is obtained. As mentioned above, CQT filtering information is recommended to construct the corresponding encoding information of the audio data of the song to be searched.

According to the method and the device, the audio data to be searched is subjected to data preprocessing, so that the audio data with no voice is partially filtered, the corresponding songs to be searched, which are frequently responded by online services and are invalid, are avoided, and the matching accuracy of searching the songs by the songs is improved.

Referring to fig. 3, in a further embodiment, the step S1300 of calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song to obtain a similarity sequence includes the following steps:

step S1311, calling a preset song feature library to obtain a high-dimensional index vector corresponding to each candidate song, wherein the high-dimensional index vector is a single high-dimensional vector which integrally represents deep semantic information of the candidate song with different scales:

in this embodiment, when the feature extraction model performs feature extraction and representation on the coding information of each candidate song in the song library, the deep semantic information of each scale extracted from each candidate song is finally spliced together in sequence to form a single high-dimensional vector as a single high-dimensional index vector, so that the high-dimensional index vector represents the deep semantic information of the corresponding candidate song with different scales on the whole. Correspondingly, when the feature extraction model performs feature extraction and representation on the song to be searched, a single high-dimensional vector with the same organization form is obtained as a corresponding high-dimensional index vector.

Step S1312, respectively calculating the similarity between the high-dimensional index vector of the song to be searched and each high-dimensional index vector in the song feature library to obtain a corresponding similarity sequence; the similarity sequence stores the similarity value corresponding to each candidate song in the song feature library:

because each song only corresponds to one high-dimensional index vector, the calculation amount of similarity matching between the song to be searched and the candidate songs can be simplified, and the similarity between the song to be searched and each candidate song is respectively calculated by directly applying a similarity calculation formula based on the high-dimensional index vectors of each song to obtain corresponding similarity values, the similarity values are constructed into a single similarity sequence, and the similarity sequence stores the corresponding similarity values of each candidate song in the song feature library.

Step S1313, the similarity sequence is reversely sorted according to the similarity value, and sorted similarity sequence output is obtained:

in order to facilitate subsequent screening and filtering, the elements in the similarity sequence can be subjected to reverse sorting, namely sorting from large to small, and then the sorted similarity sequence is output.

In the embodiment, the high-dimensional index vector output by the feature extraction model is limited to be a single high-dimensional vector, so that the similarity calculation between the song to be searched and the candidate song can be performed based on a simple and convenient application formula, the system resource overhead can be simplified, the calculation efficiency can be improved, and the matching speed when the song is searched by the song can be improved.

Referring to fig. 4, in another embodiment of the deepening, the step S1300 of calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song to obtain a similarity sequence includes the following steps:

step S1321, calling a preset song feature library to obtain a high-dimensional index vector corresponding to each candidate song, wherein the high-dimensional index vector is a plurality of high-dimensional vectors which dispersedly represent deep semantic information of various scales of one candidate song:

different from the previous embodiment, in the present embodiment, when performing feature extraction and representation on audio data of a song, after obtaining a single high-dimensional vector for each scale, the feature extraction model does not splice the high-dimensional vectors, but directly associates and stores the high-dimensional vectors and corresponding candidate songs in the song feature library, so as to form a high-dimensional index vector. That is, the high-dimensional index vector is composed of a plurality of scattered high-dimensional vectors, and each high-dimensional vector is used for representing deep semantic information of one scale of the same song. Therefore, it can be understood that, when the feature extraction model is applied to feature extraction and representation of a song to be searched, the correspondingly obtained high-dimensional index vector also includes a plurality of scattered high-dimensional vectors, and each high-dimensional vector corresponds to deep semantic information representing one scale. Therefore, the high-dimensional vectors of the song to be searched and the high-dimensional vectors of the candidate songs are in one-to-one correspondence with each other with respect to semantic scales obtained by feature extraction of the feature extraction model.

Step S1322, calculating the similarity between the high-dimensional vector of the song to be searched and the corresponding semantic scale high-dimensional vector of each candidate song in the preset song feature library according to the high-dimensional vectors corresponding to different semantic scales, and obtaining a similarity sequence corresponding to each semantic scale:

in view of the fact that the high-dimensional index vector is formed by dispersing high-dimensional vectors of multiple scales, in the embodiment, when similarity calculation between the song to be searched and the candidate song is performed, according to the corresponding relationship between the song to be searched and the candidate song on the semantic scale, the similarity between the song to be searched and each candidate song is calculated for each semantic scale respectively, and a similarity sequence of corresponding semantic scales is obtained. Therefore, it is understood that each semantic scale can obtain the corresponding similarity sequence.

Step S1323, according to the corresponding relation of the high-dimensional vector on the semantic scale, summarizing and combining the similarity numerical values in the similarity sequence corresponding to various semantic scales to obtain a summarized similarity sequence:

because a plurality of similarity sequences corresponding to different semantic scales exist, the results of all the similarity sequences need to be integrated. In an embodiment, for each similarity sequence, the similarity values of different scales corresponding to the same candidate song may be directly summed, or subjected to weighted summation, or averaged, so that the similarity values of different scales corresponding to each candidate song are unified into the same similarity value, thereby forming a final summarized similarity sequence.

Step S1324, reverse sorting is carried out on the summary similarity sequence according to the similarity degree value, and sorted similarity degree sequence output is obtained:

similarly, for the summarized similarity sequence, reverse sorting may be performed, and the sorted similarity sequence is output.

In this embodiment, the high-dimensional index vector output by the adaptive feature extraction model includes a plurality of dispersed high-dimensional vectors corresponding to different semantic scales one to one, which opens flexibility for calling deep semantic information extracted by the feature extraction model, and allows technicians in the field to flexibly organize and use the high-dimensional vectors corresponding to different semantic scales according to the selected scale, so as to serve different business requirements, for example, in a business scene of listening to songs and recognizing songs, the high-dimensional vectors representing the global scale and the channel scale of audio data can be paid more attention to; in the business scene of the singing recognition, high-dimensional vectors which represent the global scale and the frequency band scale can be more concerned. And the like, can be flexibly implemented.

Referring to fig. 5, in a further embodiment, the step S1100 of obtaining the coding information corresponding to the song to be searched submitted by the client includes the following steps:

step S1121, receiving a song search request submitted by the client, and acquiring audio data of a song to be searched specified by the request:

in the embodiment, a user records a piece of audio data in a song search page displayed by client equipment of the user, triggers a song search request which takes the audio data as the audio data required by a song to be searched, and submits the request to a server which opens song search service by songs in the application. The server parses the request to obtain corresponding audio data.

Step S1122, detecting whether the audio data of the song to be searched exceeds a preset time duration, if so, dividing the audio data into a plurality of audio data corresponding to a plurality of song segments according to the preset time duration, otherwise, retaining the audio data as the whole segment:

when the song to be searched is long, the accuracy may sometimes be affected if the encoding and feature extraction are performed directly. Therefore, in this case, the server may perform duration detection on the audio data of the song to be searched, detect whether the duration exceeds a preset duration, and treat the audio data of the song to be searched as a single-segment song segment if the duration does not exceed the preset duration; if the preset time length is exceeded, the song to be searched can be divided into a plurality of song segments according to the preset time length, and audio data corresponding to the plurality of song segments is obtained.

Step S1123, respectively encode each segment of audio data to obtain encoding information corresponding to each segment of audio data, so that the feature extraction model respectively extracts each high-dimensional index vector corresponding to each song segment:

whether the result obtained by the processing in the previous step is single-segment audio data or multi-segment audio data, each segment of audio data is independently encoded to obtain encoding information corresponding to each segment of audio data, and then a feature extraction model is adopted to independently extract multi-scale deep semantic information of the encoding information of each segment of audio data to obtain a plurality of high-dimensional index vectors corresponding to the multi-segment audio data.

According to the embodiment, the time-sharing processing is carried out on the song to be searched, the characteristic extraction model is allowed to respectively carry out characteristic extraction and characteristic representation on each piece of audio data of the song to be searched, deep semantic information of the song to be searched is hopefully acquired in a fine mode, and when matching is carried out according to each high-dimensional index vector corresponding to each piece of audio data, a more accurate searching matching result is hopefully acquired.

Referring to fig. 6, according to the previous embodiment, the step S1300 of calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song to obtain a similarity sequence includes the following steps:

step S1331, calling a preset song feature library to obtain a high-dimensional index vector corresponding to each song segment of each candidate song, wherein the high-dimensional index vector is a single high-dimensional vector which integrally represents deep semantic information of one song segment in different scales:

adapting to the former embodiment, when constructing the song feature library, each song in the song library may be segmented according to a certain preset time duration, the same song is processed into audio data corresponding to a plurality of song segments, then feature extraction is performed on each song segment by using the feature extraction model, and a high-dimensional index vector corresponding to multi-scale semantic information representing each song segment is obtained, so that a plurality of high-dimensional index vectors corresponding to a plurality of song segments exist in one song, each high-dimensional index vector is stored in the song feature library in association with the song, and each high-dimensional index vector can be used for representing deep-level semantic information of a plurality of scales of one song segment.

Step S1332, calculating the similarity between the high-dimensional index vector of each piece of audio data and the corresponding high-dimensional index vector of each candidate song in the preset song feature library, respectively, to obtain a similarity sequence corresponding to each piece of audio data:

in view of the fact that the high-dimensional index vector of each candidate song is composed of the high-dimensional index vectors of a plurality of song segments, and the song to be searched may also be divided into the plurality of song segments to obtain the plurality of high-dimensional index vectors, in the embodiment, when similarity calculation is performed between the song to be searched and the candidate songs, similarity calculation is performed between the high-dimensional index vectors corresponding to each song segment of the song to be searched and all the high-dimensional index vectors in the song feature library, so that it is easy to understand that a corresponding similarity sequence can be obtained corresponding to each song segment of the song to be searched. Meanwhile, the high-dimensional index vector corresponding to each song segment of each candidate song and the high-dimensional index vector corresponding to each song segment of the song to be searched calculate the similarity, a many-to-many calculation relation is formed, and deeper interactive calculation is achieved.

Step S1333, summarizing and combining the similarity values in all the similarity sequences according to the same candidate songs to obtain summarized similarity sequences:

because similarity sequences corresponding to a plurality of song segments of the song to be searched exist, and each similarity sequence may have similarity values between the song segment and a plurality of song segments of the same candidate song, results of the similarity sequences need to be integrated.

In a preferred embodiment, the similarity value of each similarity sequence is first screened, and specifically, since a similarity value exists between a specific song segment of the song to be searched and each song segment of a candidate song in the song feature library in a similarity sequence, at this time, according to the magnitude relation of a plurality of similarity values belonging to the same candidate song in the similarity sequence, retaining the element with the maximum similarity value, i.e. the similarity value representing the song segment of the song to be searched is most similar to the similarity value of the song segments of the candidate songs corresponding to the retained elements, and is less similar to the other song segments of the candidate song, and through this process, in each similarity sequence, for each candidate song, a similarity value of one of the song segments which forms a relatively high approximation with one of the song segments of the song to be searched is retained. For each song segment of the song to be searched, one of the similarity sequences corresponds to it.

Furthermore, the screened similarity values are summarized according to the corresponding candidate songs, that is, the similarity values belonging to the same candidate song in each similarity sequence are added, weighted, summed and/or averaged to form a final summarized similarity sequence, and each element in the similarity sequence can comprehensively represent the similarity values between the song to be searched and each candidate song.

In the embodiment of transformation, an element with the largest similarity value is selected from each similarity sequence based on the corresponding relation of the song segments of the candidate songs, wherein the element represents that the song segment of the candidate song is most similar to the song segment of the song to be searched, so that an intermediate similarity sequence is obtained, the intermediate similarity sequence comprises the similarity values between each song segment of each candidate song and the song segment which is most similar to the song segment in the song to be searched, and on the basis, the similarity values of all the song segments of the same candidate song are summed, averaged and the like according to the relation belonging to the same candidate song, so that the final summarized similarity sequence can be obtained.

Step S1334, reversely sorting the summary similarity sequence according to the similarity value, obtaining the sorted similarity sequence and outputting:

similarly, for the summarized similarity sequence, reverse sorting may be performed, and the sorted similarity sequence may be output.

In the embodiment, similar matching is performed on the song to be searched and the candidate songs in the song library in a song segmentation mode, so that compatible processing of various complex conditions is realized, and on one hand, when the song to be searched is too long, the matching accuracy can be improved through segmentation identification; on the other hand, even if the song to be searched is a short segment, the candidate song is divided into a plurality of song segments for segmented deep semantic information representation, so that the matching of the song to be searched and the specific song segment of the candidate song can be more finely realized, and the positioning is convenient; in addition, based on the principle of segment implementation, when similarity calculation is carried out, the similarity calculation of the song to be searched and each candidate song forms a many-to-many calculation relationship, so that more deep interactive calculation is realized, the corresponding relationship between song segments is more accurate, the calculated similarity value is theoretically more accurate, and the matching accuracy can be improved on the whole.

Referring to FIG. 7, in the preferred embodiment, when the feature extraction model is invoked, the following steps are performed:

step S2100, sequentially performing multi-level feature extraction on the encoded information by using a plurality of convolution blocks in a shared network in a feature extraction model trained to a convergence state, to obtain intermediate feature information of deep semantic information from which the encoded information is extracted:

the feature extraction model is constructed based on the multi-branch thought in the application, and can be flexibly deformed to adapt to the requirements of different embodiments of the application. In a typical embodiment of the feature extraction model, as shown in the schematic block diagram of fig. 8, the feature extraction model is composed of a shared network and a plurality of branch networks, wherein the shared network includes a plurality of rolling blocks for extracting deep semantic information of encoded information level by level to obtain intermediate feature information; and the plurality of branch networks respectively extract different types of deep semantic information based on the intermediate characteristic information to obtain corresponding output characteristic information. Each branch network comprises a part of same structure, the structure comprises a plurality of convolution blocks for extracting deep semantic information step by step, and after the last convolution block is output, different processing can be carried out according to different functions of each branch network.

The convolution block can be realized by adopting convolution layers based on CNN and RNN, and preferably adopts a convolution block based on a residual convolution principle. In order to implement the context combing function so as to extract the key information in the song audio data, an Attention mechanism may be applied to any one of the volume blocks, and a corresponding Attention Module, specifically, a Spatial Attention Module (SAM) or a Channel Attention Module (CAM) may be added. IN an enhanced embodiment, an example normalization operation (IN) and a batch normalization operation (BN) are applied to the convolution block to divide the information input thereto into two parts, one of which is performed to learn the style invariant features, and the other is performed to perform the batch normalization operation to achieve normalization, so that a so-called IBN architecture is applied. By applying the framework, the music attribute invariant characteristics of the song audio data with highly diversified styles, such as musical notes, rhythms, timbres and the like, can be learned, and meanwhile, the version information is kept.

Therefore, it is easy to understand that the feature extraction model adapts to different application scenes, different branch networks are enabled, and the preselected training set is adopted to train the feature extraction model to a convergence state first, so that the corresponding feature extraction capability can be obtained, and therefore, the feature extraction model is suitable for executing tasks corresponding to the application scenes and extracting output feature information corresponding to song audio data from the encoding information of the song audio data input into the feature extraction model. The training process for the feature extraction model will be given in the exemplary embodiments of the present application, and will not be pressed here.

In this step, in the architecture shown in fig. 8, after the coded information is subjected to feature extraction step by a plurality of rolling blocks of the shared network, and particularly after the key information is extracted by the last rolling block, the intermediate feature information of the key information of the coded information is extracted, and the intermediate feature information is divided into multiple paths and output to the plurality of branch networks, so as to extract deep semantic information from different angles in each branch network.

Step S2200 is that after feature extraction of different scales is performed on the intermediate feature information by using a plurality of convolution blocks in more than two branch networks in the feature extraction model, the intermediate feature information is converted into output feature vectors of corresponding scales, and deep semantic information included in the output feature vectors of each branch network is different from each other:

as described above, in the architecture shown in fig. 8, the respective branch networks can be flexibly combined, so that according to the specific architecture obtained by combining, it is possible to determine how many branch networks are specifically available. The intermediate feature information output by the shared network is input into each of the branch networks for further feature extraction processing.

According to the architecture shown in fig. 8, each branch network belongs to the same structural part and includes two convolution blocks, and after feature information output by the two convolution blocks is sequentially subjected to feature extraction, the extracted feature information is output to adapt to specific structures of different branch networks for different processing.

Specifically, different branch networks adapt to different deep semantic information extracted by themselves, and different processing can be performed on different structural parts which are different from each other. For example: the output characteristic information containing different deep semantic information can be obtained by performing various different processes on the characteristic information output by the last convolution block, wherein the output characteristic information describes deep semantic information of the song audio data from different scales respectively and comprises global information, various local information and the like of the song audio data, such as global information that abstracts salient features of the encoded information of the song audio data, local information that abstracts channel or band features of the encoded information of the song audio data, and so forth. Accordingly, a plurality of pieces of output characteristic information different in expression can be obtained, and these pieces of output characteristic information can be called independently or can be used in any combination as needed.

In the application, the output feature information output by each branch network is normalized to be represented by the output feature vectors, so that a plurality of branch networks can correspondingly obtain a plurality of output feature vectors, each output feature vector represents deep semantic information of the song audio data on different aspects or different scales, and the deep semantic information contained in each output feature vector is different from one another.

When in use, more than two branch networks are usually adopted to obtain more than two output feature vectors, so as to perform feature representation on song audio data by using more than two deep semantic information, for example, the output feature vector for representing the global information of the song audio data may be used in combination with the output feature vector for representing the channel information of the song audio data, the output feature vector for representing the global information of the song audio data may be used in combination with the output feature vector for representing the band information of the song audio data, or the output feature vector for representing the channel information of the song audio data may be used in combination with the output feature vector for representing the band information of the song audio data, or all the output feature vectors may be used in combination. And so on, as may be called upon by those skilled in the art.

Step S2300, outputting the output feature vector of each branch network as the high-dimensional index vector by the feature extraction model:

the output feature vectors obtained by each branch network can be finally converted into high-dimensional index vectors for storage or direct use. The high-dimensional index vector is a high-dimensional vector for indexing the corresponding song audio data. Since each branch network has normalized its output feature information into an output feature vector, the high-dimensional index vector may be handled alternatively in this case, depending on the specific use of the feature extraction model. For example, for application requirements that are only for storage, standby and respectively invoked, each output feature vector may be stored in the song feature library, as a plurality of corresponding high-dimensional index vectors, in a dispersed manner, so that the high-dimensional index vectors output by different branch networks can be invoked as needed for retrieval, query, and matching. For another example, for a specific task such as listening to songs for recognition, singing for recognition, humming for recognition, etc., all output feature vectors output by all the constructed branch networks may be sequentially spliced according to the requirement of the specific task, so as to obtain a single high-dimensional index vector, which may be stored or immediately used for matching, for example, in the song feature library. Thus, representation learning of the song audio data is achieved through the high-dimensional index vector.

According to the principle disclosed above in the present exemplary embodiment, according to the process of this embodiment, the song feature library may be prepared for a part or all of the songs in the song library of the online music service platform, high-dimensional index vectors corresponding to each song audio data may be obtained by applying the steps of this embodiment to the song audio data of each corresponding song or its song segment in the song library, and a song feature library may be constructed by storing these high-dimensional index vectors in association with the corresponding songs, and then the high-dimensional index vector corresponding to any one song may be directly called from the song feature library for operations such as retrieval, query, matching, and the like.

In addition to the various application modes disclosed in the present application, the mining and utilization based on the high-dimensional index vector obtained in the present application may have various applications, and may be flexibly applied by those skilled in the art according to the principles disclosed herein without affecting the inventive embodiments of the present application.

Through the above description of the implementation process of the feature extraction model and the network architecture thereof, it can be understood that the present embodiment includes very rich beneficial effects, including but not limited to the following aspects:

firstly, a feature extraction model encodes corresponding encoding information by using audio information of song audio data to obtain style-invariant features of the song audio data, then extracts intermediate feature information from the encoding information through a shared network, extracts deep semantic information of the song audio data from different angles through a plurality of branch networks on the basis of the intermediate feature information to obtain corresponding output feature information, and finally uses the output feature information as a high-dimensional index vector corresponding to the song audio data to finish end-to-end representation learning of the song audio data.

Secondly, the feature extraction model realizes multi-angle feature extraction of deep semantic information of the song audio data by adopting a mode of combining a sharing network and a plurality of branch networks, so that the obtained high-dimensional index vector has higher representation capability, such as representing global feature information, significant feature information, channel feature information, band feature information and the like of the song audio data, and further realizes more effective indexing of the corresponding song audio data, and the downstream processing of retrieval, query, matching and the like of the song audio data is carried out on the basis of the information, so that more accurate and efficient matching effect can be obtained, and the feature extraction model can be generally used for multiple application scenes such as singing identification, song listening identification, humming identification, song infringement judgment and the like.

In addition, the output feature vectors obtained by the multiple branch networks of the feature extraction model can be combined into a single high-dimensional index vector for use, and can also be independently used as different high-dimensional index vectors for use respectively, the output feature vectors are flexibly determined according to required deep semantic information, the application range is wide, the usage is flexible, when representation learning of massive song audio data is processed, a relatively obvious scale effect can be obtained, the output feature vectors can be deployed in a background of an online music service platform to realize a standardized interface, the requirements of various different application scenes are met, comprehensive and multipurpose open services are provided, and the economic advantage of music information retrieval of the platform is improved.

In a further embodiment, the step S2200, after performing feature extraction of different scales on the intermediate feature information by using a plurality of convolution blocks in more than two branch networks in the feature extraction model, converting the intermediate feature information into an output feature vector of a corresponding scale, includes any two or more steps:

step S2210, performing feature extraction on the intermediate feature information by adopting a plurality of volume blocks in the first branch network to obtain global feature information, and pooling the global feature information into output feature vectors of a global scale:

in the first branch network exemplarily shown in fig. 8, after feature extraction is performed on the intermediate feature information step by using two convolution blocks having the same structure as that of other branch networks, the output of the last convolution block is divided into two paths, one path is directly subjected to an averaging pooling operation to obtain the overall feature information, the other path is subjected to a Dropout layer to randomly discard part of time-frequency region information, and then significant feature information in the overall situation is extracted by a maximum pooling operation, so that two overall output feature vectors are correspondingly output. According to the framework, in the model training stage, on one hand, the generalization capability of the model to the audio frequency with local time-frequency domain changes such as segment deletion, segment insertion and the like in song audio data is improved, and on the other hand, the function of preventing the model from being over-fitted is also achieved to a certain extent. In addition, the two global output feature vectors capture the whole features one way and capture the obvious features one way, and the recognition capability of the model is improved.

Step S2220, a plurality of volume blocks in the second branch network are adopted to extract the characteristics of the intermediate characteristic information, and then the intermediate characteristic information is divided into a plurality of parts according to the channel for pooling, so as to correspondingly obtain the output characteristic vector of the channel scale:

since the feature information output from each convolution block is usually expressed in "number of channels × number of bands × number of frames", the division process can be performed by the number of channels. In the second branch network exemplarily shown in fig. 8, after feature extraction is performed on the intermediate feature information step by step through two convolution blocks having the same structure as that of other branch networks, the output of the last convolution block is divided into multiple paths, for example, two paths of outputs, and then the multiple paths are respectively passed through 1 × 1 convolution layers and subjected to mean pooling, so as to obtain the channel output feature information corresponding to the two paths. In this process, two channel branches focus on local feature capture of audio, and for audio with very different adaptations, where a large amount of information is overwhelmed by strong noise or other interfering sounds, feature representations can be built from a few local significant common features.

Step S2230, using a plurality of convolution blocks in the third branch network to perform feature extraction on the intermediate feature information, and then dividing the intermediate feature information into a plurality of parts according to the frequency band to perform pooling, so as to correspondingly obtain an output feature vector of the frequency band scale:

in the third branch network exemplarily shown in fig. 8, after the intermediate feature information is subjected to feature extraction step by two convolution blocks having the same structure as that of the other branch network, the output of the last convolution block is averaged and pooled, and then divided into multiple paths, for example, two paths of outputs, and the averaged pooled output is subjected to obtain the frequency band output feature information corresponding to the two frequency bands. In the process, each frequency band branch is dedicated to extracting the characteristic information of the corresponding frequency band, and the method has remarkable effects on resisting frequency band selective weakening of a bad pickup environment, balancing the contribution of high-frequency and low-frequency information in characteristic composition, resisting content addition and deletion (such as adding and reducing a drumbeat) of a fixed frequency band or resisting strong interference of the fixed frequency band range.

It can be understood that a plurality of output feature vectors obtained in the same branch network may be further processed into the same output feature vector through concatenation or mean pooling, for which, those skilled in the art may flexibly implement the method.

In the embodiment, the characteristic information of multiple aspects and multiple scales is extracted from the song audio data through the rich branch network, so that the obtained output characteristic vector can obtain rich deep semantic information representation, the global information and the significant information of the song audio data are represented, the related local information of the song audio data is represented according to channels and frequency bands, and the key information of the song audio data is captured by considering the prior intermediate characteristic information under the action of the shared network, so that the embodiment realizes the representation of the index value of the song audio data from multiple aspects, and the precision of each aspect can be improved when the subsequently obtained high-dimensional index vector is used for query, retrieval and matching.

Because the embodiment can capture deep semantic information of the song audio data in multiple aspects, the method is particularly suitable for feature extraction of the song audio data with relatively large data volume, is particularly suitable for application scenes of long song processing, and can achieve a more accurate matching effect for the application scenes.

Referring to fig. 9, a network structure of the feature extraction model of the present application is improved based on the previous embodiment, and it can be seen that the difference between the network architecture in fig. 9 and the network architecture in fig. 8 is that, in fig. 9, a global output feature vector is obtained after the output of the last rolling block of the first branch network is directly subjected to maximum pooling, and significant feature information of the encoded information of the song audio data is captured; in the second branch network, the output of the last convolution block is equally divided into the feature information corresponding to four channels according to the channels, the feature information corresponding to each channel is subjected to mean pooling respectively and then spliced into corresponding output feature vectors again, and the obtained output feature vectors can learn more optimal local feature information through the division and construction of local branches.

Similarly, as an equivalent alternative embodiment of dividing the output of the last convolution block according to the channel in fig. 9, the output of the last convolution block may also be divided according to the frequency band dimension instead of the channel to obtain the feature information corresponding to four sub-bands, and then the average pooling process is performed, and the feature information is re-spliced into the corresponding output feature vector.

The present embodiment exemplarily presents a modification based on the network architecture shown in fig. 8, which is relatively lightweight, and it is not difficult to understand that the inventive spirit of the present application focuses on the flexible combined use of a plurality of the described branch networks. Based on the principles disclosed in the present application, those skilled in the art can adapt to different specific applications according to the characteristics of the multi-scale deep semantic information of the output feature vectors obtained by each branch network, and can select feature extraction models constructed by different branch network combinations to transform various other embodiments of the present application, so as to satisfy the requirements of humming recognition, listening to songs, recognizing songs, and singing recognition.

Referring to fig. 10, in an optimized embodiment, when a convolution block in the feature extraction model is called, the following steps are performed:

step S3100, performing convolution transformation on the input information to obtain transformation characteristic information:

in any convolution block in the feature extraction model, each convolution block firstly performs convolution operation on information input therein, no matter the coded information or the intermediate feature information output by the previous convolution block, through a 1 × 1 convolution kernel to obtain corresponding transformation feature information.

Step S3200, combining the transformation characteristic information after respectively carrying out example normalization and batch normalization processing to form splicing characteristic information, and activating and outputting the splicing characteristic information:

after the first convolution, an instance batch normalization layer (IN) is applied to process the transformed feature information. The transformation characteristic information is divided into two paths, a batch normalization Block (BN) is adopted to carry out batch normalization processing on half of the channels, and an instance normalization processing is carried out on the other channels by applying an instance normalization layer, and the corresponding convolution block can capture the style invariant characteristics of the song audio data, so that the song representation with diversified styles in single data can be better utilized. The two channels can be spliced into the same splicing characteristic information to be activated and output after different normalization processing.

Step S3300, obtaining residual error information after performing convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times:

and performing convolution operation on the activated and output spliced feature information through a plurality of convolution layers to further extract features, and outputting each convolution layer after being subjected to normalization processing by a batch normalization layer, wherein the last convolution layer is implemented by adopting a 1 x 1 convolution kernel so as to avoid attenuation of the representation learning capacity of the whole feature extraction model after being subjected to multiple instance normalization processing of a plurality of convolution blocks. Accordingly, the finally output feature information is residual information in the residual convolution process.

And step S3400, overlapping the residual information into the input information, and activating and outputting.

And finally, according to a residual convolution principle, referring to the transformation characteristic information obtained by the first convolution, superposing the transformation characteristic information on the residual information, and then activating and outputting the transformation characteristic information, so that the intermediate characteristic information output after the residual convolution operation is carried out on the current convolution block can be obtained.

In the embodiment, the convolution blocks required by the feature extraction model are constructed by combining example batch normalization operation based on residual convolution, the residual convolution network is improved based on a Resnet series basic model, and meanwhile, an IBN framework is overlapped, so that the constructed feature extraction model is easier to train, a more accurate feature extraction effect can be realized, and the method is particularly suitable for feature extraction of song audio data.

Referring to fig. 11, in an expanded embodiment, the training process of the feature extraction model includes the following steps of iterative training:

step S4100, calling a training sample from the training set, and determining the coding information of the training sample, wherein the training sample is pre-collected song audio data, and the song audio data is a complete song or a fragment thereof:

those skilled in the art will appreciate that different training sets for training the feature extraction model may be constructed, each training set containing a sufficient number of training samples, each training sample being pre-provisioned with a corresponding supervised label, to accommodate different downstream tasks.

The training samples can be collected in advance by a person skilled in the art, each training sample is a song audio data, and is suitable for different downstream tasks, and the song audio data can be a complete song, a song MIDI melody fragment, a song with accompaniment, a song with a vocal singing part without an accompaniment part, a song fragment without a rotation part, a song fragment with a rotation part and the like. Different singing versions of the same song can be combined into the same category, namely, the same supervising label is corresponded to, so as to enhance the generalization capability of the model category. When the song audio data in the training sample is too long in time, the song audio data can be segmented into a plurality of song segments according to a certain preset time length and used as a plurality of training samples to be associated with the same supervision label for training. When segmenting a song, this may be implemented with reference to a timestamp of the lyrics of the song, such that the song segments are segmented on the basis of one or more complete lyrics.

For training samples in the training set, for the convenience of model training, the encoding information corresponding to the song audio data may be prepared in advance, or the encoding information corresponding to the song audio data may be obtained by real-time encoding when each song audio data is called for training the feature extraction model. For specific encoding principles, the processing may be performed by referring to the corresponding procedures disclosed in the foregoing of the present application.

Step S4200, inputting the encoding information into the feature extraction model, and performing training on the encoding information to obtain corresponding output feature vectors:

in the training process of a training sample, the coding information corresponding to the training sample is output to the feature extraction model for feature extraction, and the feature extraction principle refers to the description of the principle of the feature extraction model in the previous embodiments, which is omitted here for brevity. In the process, the expression learning of the training samples is realized by the feature extraction model, and each corresponding output feature vector is obtained.

Step S4300, respectively carrying out classification prediction on each output feature vector to map out a corresponding classification label:

in the present application, the training task of the feature extraction model is understood as a classification task, and therefore, the training of the model can be implemented by accessing each path of output feature vectors of the feature extraction model into a corresponding prepared classification model, investigating classification results of each classification model, and supervising by using a corresponding supervision label. Based on the principle, in the training stage, when the feature extraction model implemented in any embodiment of the present application is trained, one classification model is connected to each output feature vector output end of each branch network.

The classification model adopts the structure shown in fig. 12, a batch normalization layer is adopted to perform batch normalization operation on the output characteristic vectors, then the output characteristic vectors are mapped to a classification space through a full connection layer, and the classification probability corresponding to each classification label is calculated through a classification function, so that the classification label with the maximum classification probability is determined as the classification label corresponding to the training sample.

The classifier in the classification model can be constructed by adopting a multi-classifier realized by a Softmax function, and can also be constructed by adopting a multi-classifier realized by an AM-Softmax function which can enhance the compactness in classes and enlarge the sparsity among the classes, and the later has better classification advantages obviously.

Step S4400, calculating a loss value of the feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value:

in the classification model, the batch normalization layer is adopted, the balance of triple losses and cross entropy classification losses is realized, the triple losses can be calculated through the batch normalization layer, the cross entropy classification losses can be calculated through the full-connection layer, and the optimization of output feature vectors can be realized through the combination of the triple losses and the cross entropy classification losses.

Therefore, after the corresponding classification label is predicted from the training sample, the loss value between the supervision label and the classification label can be calculated according to the corresponding supervision label, then the gradient updating is carried out on the feature extraction model according to the loss value, the weight parameter of each link of the whole model is corrected, and the model is promoted to be converged.

Because there are multiple branch networks, each branch network may have multiple outputs of output feature vectors, and there are multiple classification models, therefore, when calculating the loss value, a weighting manner may be adopted, that is, the triple loss in each classification model and the classification loss are weighted and summed first to obtain the loss value corresponding to each output feature vector, then the loss value corresponding to each output feature vector is weighted and summed to obtain the final loss value, and the gradient update is performed on the whole feature extraction model by the loss value.

Step S4500, judging whether the loss value reaches a preset threshold value, and when the loss value does not reach the preset threshold value, calling the next training sample in the training set to continue to carry out iterative training on the feature extraction model until the loss value reaches the preset threshold value:

and judging whether the loss value counted by each training sample approaches to a 0 value infinitely or whether the loss value reaches a preset threshold value, and when the loss value meets the judgment conditions, judging that the feature extraction model is trained to a convergence state, thereby terminating the training of the model and putting the feature extraction model into a production stage, for example, the feature extraction model is used for carrying out feature extraction on songs in a song library or serving other downstream tasks and the like. If the convergence state is not reached, the next training sample in the training set can be continuously called, and the iterative training of the feature extraction model is continuously carried out until the feature extraction model is trained to the convergence state.

The embodiment discloses a training principle and a training process of a feature extraction model of the application, and can be seen from the embodiment that the feature extraction model is trained by adopting a prepared training set, so that the feature extraction model can learn the capability of extracting corresponding output feature vectors from the coded information of song audio data, the effective representation learning of deep semantic information of the song audio data is realized, in addition, the output feature vectors of multiple scales of the same song audio data can be jointly trained, the training efficiency is higher, the functions of the model are richer, and when the feature extraction model is put into a production stage, the deep semantic information corresponding to the multiple scales of the same song audio data can be quickly obtained.

The classification model of the embodiment adopts the multi-classifier which is realized by batch normalization and the AM-Softmax function, so that the gradient updating of the model can be carried out by balancing triple loss and classification loss, the model can be trained to be converged more quickly, and the trained model can better carry out more effective representation learning on deep semantic information of song audio data. When the output characteristic vectors are combined and used according to needs, the characteristic information of the song audio data can be more effectively represented, and a more efficient matching effect is achieved.

The embodiment also shows the expandability and compatibility of the feature extraction model in the application aspect, and specifically, the embodiment allows the feature extraction model to obtain the capability of serving different downstream tasks by training the feature extraction model by adopting training samples corresponding to the different downstream tasks according to the requirement of serving the different downstream tasks, so that the embodiment belongs to a more basic improvement and has better economic utility.

Referring to fig. 13, a song search apparatus provided in the present application, adapted to a song search method of the present application for functional deployment, includes: the system comprises a song coding module 1100, a semantic extraction module 1200, a similarity matching module 1300 and a screening and pushing module 1400, wherein the song coding module 1100 is used for acquiring coding information corresponding to a song to be searched, which is submitted by a client; the semantic extraction module 1200 is configured to extract, according to the coding information, a high-dimensional index vector representing deep semantic information of multiple scales of the song to be searched by using a feature extraction model trained to a convergence state; the similarity matching module 1300 is configured to calculate similarity between the high-dimensional index vector and a high-dimensional index vector of deep semantic information extracted by the feature extraction model from a preset song feature library and representing multiple scales of each candidate song, so as to obtain a similarity sequence; the screening and pushing module 1400 is configured to screen and determine a target song with a similarity value exceeding a preset threshold and being the maximum similarity in the similarity sequence, construct an access link corresponding to the target song, and push the access link to the client device.

In a further embodiment, the song encoding module 1100 includes: the request analysis submodule is used for receiving a song search request submitted by a client and acquiring audio data of a song to be searched, wherein the song to be searched is specified by the request; the voice detection submodule is used for detecting whether the audio data contains voice singing information or not, and if not, the subsequent execution is stopped; and the coding processing submodule is used for coding the audio data of the song to be searched to obtain corresponding coding information.

In one embodiment of the deepening, the similarity matching module 1300 includes: the candidate calling submodule is used for calling a preset song feature library to obtain a high-dimensional index vector corresponding to each candidate song, and the high-dimensional index vector is a single high-dimensional vector which integrally represents deep semantic information of the candidate song with different scales; the unified calculation submodule is used for respectively calculating the similarity between the high-dimensional index vector of the song to be searched and each high-dimensional index vector in the song feature library to obtain a corresponding similarity sequence; storing a similarity numerical value corresponding to each candidate song in the song feature library in the similarity sequence; and the sequencing processing submodule is used for carrying out reverse sequencing on the similarity sequence according to the similarity value to obtain the sequenced similarity sequence and outputting the sequenced similarity sequence.

In another embodiment of the deepening, the similarity matching module 1300 includes: the candidate calling submodule is used for calling a preset song feature library to obtain a high-dimensional index vector corresponding to each candidate song, and the high-dimensional index vector is a plurality of high-dimensional vectors which dispersedly represent deep semantic information of each scale of one candidate song; the scale calculation submodule is used for calculating the similarity between the high-dimensional vector of the song to be searched and the corresponding semantic scale high-dimensional vector of each candidate song in the preset song feature library according to the high-dimensional vectors corresponding to different semantic scales, and obtaining a similarity sequence corresponding to each semantic scale; the scale summarizing submodule is used for summarizing and merging the similarity numerical values in the similarity sequence corresponding to various semantic scales according to the corresponding relation of the high-dimensional vector on the semantic scales to obtain a summarizing similarity sequence; and the sorting processing submodule is used for carrying out reverse sorting on the summarizing similarity sequence according to the similarity value to obtain sorted similarity sequence output.

In yet another embodiment, the song encoding module 1100 includes: the request analysis submodule is used for receiving a song search request submitted by a client and acquiring audio data of a song to be searched, wherein the song to be searched is specified by the request; the detection and segmentation submodule is used for detecting whether the audio data of the song to be searched exceeds a preset time length, if so, segmenting the audio data into a plurality of audio data corresponding to a plurality of song segments according to the preset time length, and otherwise, keeping the audio data as the whole segment of audio data; and the segmented coding submodule is used for coding each segment of audio data respectively to obtain coding information corresponding to each segment of audio data so that the feature extraction model can extract each high-dimensional index vector corresponding to each song segment respectively.

According to yet another embodiment of the deepening, the affinity matching module 1300 comprises: the candidate calling sub-module is used for calling a preset song feature library to obtain a high-dimensional index vector corresponding to each song segment of each candidate song, and the high-dimensional index vector is a single high-dimensional vector which integrally represents deep semantic information of different scales of one song segment; the segmentation calculation submodule is used for respectively calculating the similarity between the high-dimensional index vector of each piece of audio data and the corresponding high-dimensional index vector of each candidate song in a preset song feature library to obtain a similarity sequence corresponding to each piece of audio data; the segmentation and summarization submodule is used for summarizing and combining the similarity values in all the similarity sequences according to the same candidate songs to obtain a summarized similarity sequence; and the sorting processing submodule is used for carrying out reverse sorting on the summary similarity sequence according to the similarity value to obtain and output the sorted similarity sequence.

In a preferred embodiment, the feature extraction model is implemented as a structure comprising: the shared extraction module is configured to sequentially perform multi-level feature extraction on the coded information by adopting a plurality of volume blocks in a shared network in a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the coded information; the branch extraction module is configured to extract features of different scales from the intermediate feature information by adopting a plurality of convolution blocks in more than two branch networks in the feature extraction model, and then convert the extracted features into output feature vectors of corresponding scales, wherein deep semantic information contained in the output feature vectors of the branch networks is different; and the processing output module is configured to output the output feature vector of each branch network as the high-dimensional index vector.

In a further embodiment, the branch extracting module includes any two or more of the following modules: the first extraction submodule is configured to perform feature extraction on the intermediate feature information by adopting a plurality of volume blocks in a first branch network to obtain global feature information, and the global feature information is pooled into an output feature vector of a global scale; the second extraction submodule is configured to adopt a plurality of volume blocks in a second branch network to extract the characteristics of the intermediate characteristic information, then divide the intermediate characteristic information into a plurality of parts according to channels and perform pooling, and correspondingly obtain output characteristic vectors of channel scales; and the third extraction submodule is configured to adopt a plurality of convolution blocks in a third branch network to extract the characteristics of the intermediate characteristic information, divide the intermediate characteristic information into a plurality of parts according to the frequency band and pool the parts, and accordingly obtain an output characteristic vector of the frequency band scale.

In a preferred improved embodiment, when performing the pooling operation, the first branch network performs a mean pooling operation and/or a maximum pooling operation to obtain one or two output feature vectors of the global scale; when the second branch network performs the pooling operation, adopting a mean pooling operation for a single or a plurality of channels to correspondingly obtain one or a plurality of output feature vectors of the channel scale; and when the third branch network performs the pooling operation, adopting an average pooling operation for a single frequency band or a plurality of frequency bands to correspondingly obtain one or a plurality of output feature vectors of the frequency band scale.

In an optional embodiment, the source of the encoded information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, sound level profile information, and Chroma characteristic information of corresponding audio data.

In an optimized embodiment, the volume block is implemented as a structure comprising: the initial convolution unit is used for carrying out convolution transformation on the information input into the initial convolution unit to obtain transformation characteristic information; the normalization processing unit is used for combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information; the residual error calculation unit is used for obtaining residual error information after carrying out convolution operation for multiple times and batch normalization processing on the activated and output splicing characteristic information; and the activation output unit is used for superposing the residual information into the information input into the activation output unit.

In an extended embodiment, the feature extraction model is placed in a training task implemented by a structure for implementing iterative training, wherein the structure comprises: the system comprises a sample calling module, a data processing module and a data processing module, wherein the sample calling module is used for calling a training sample from a training set and determining the coding information of the training sample, the training sample is pre-collected song audio data, and the song audio data is a complete song or a fragment thereof; the expression learning module is used for inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors; the classification prediction module is used for performing classification prediction on each output characteristic vector respectively so as to map corresponding classification labels; the loss calculation module is used for calculating a loss value of the feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value; and the iteration judgment module is used for judging whether the loss value reaches a preset threshold value, and calling the next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 14, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can make a processor realize a song searching method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the song search method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 14 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 13, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the song search device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application also provides a storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the song search method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or other computer readable storage medium, or a Random Access Memory (RAM).

In summary, the high-dimensional index vector is obtained by means of representation and learning of multi-scale deep semantic information of the song audio data by means of the feature extraction model, matching of similar songs is performed on the basis of the high-dimensional index vector, the effects of more accuracy and high efficiency can be achieved when the query, the retrieval and the matching of the songs are served, various downstream tasks such as song listening, song learning, humming, song turning and recognition can be served, and the comprehensive service capability of the online music platform is improved.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A song search method is characterized by comprising the following steps:

2. The song search method according to claim 1, wherein the obtaining of the encoded information corresponding to the song to be searched submitted by the client comprises the following steps:

3. The song search method according to claim 1, wherein the calculating of the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song obtains a similarity sequence, and comprises the following steps:

calling a preset song feature library to obtain a high-dimensional index vector corresponding to each candidate song, wherein the high-dimensional index vector is a single high-dimensional vector which integrally represents deep semantic information of the candidate song at different scales;

and performing reverse sequencing on the similarity sequence according to the similarity value to obtain the sequenced similarity sequence output.

4. The song search method according to claim 1, wherein the calculating of the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song obtains a similarity sequence, and comprises the following steps:

5. The song search method according to claim 2, wherein the obtaining of the encoded information corresponding to the song to be searched submitted by the client comprises the following steps:

and respectively coding each section of audio data to obtain coding information corresponding to each section of audio data so as to enable the feature extraction model to respectively extract each high-dimensional index vector corresponding to each song segment.

6. The song search method according to claim 5, wherein the calculating of the similarity between the high-dimensional index vector and the high-dimensional index vector of the deep semantic information extracted by the feature extraction model from the preset song feature library and representing multiple scales of each candidate song to obtain a similarity sequence comprises the following steps:

7. The song search method according to any one of claims 1 to 6, wherein when the feature extraction model is invoked, the following steps are performed:

sequentially performing multi-stage feature extraction on the coding information by adopting a plurality of volume blocks in a shared network in a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the coding information;

8. The song search method according to claim 7, wherein the step of converting the intermediate feature information into output feature vectors of corresponding scales after performing feature extraction of different scales on the intermediate feature information by using a plurality of volume blocks in more than two branch networks in the feature extraction model comprises any two or more steps of:

adopting a plurality of volume blocks in a first branch network to perform feature extraction on the intermediate feature information to obtain global feature information, and pooling the global feature information into output feature vectors of a global scale;

and (3) adopting a plurality of convolution blocks in a third branch network to extract the characteristics of the intermediate characteristic information, and then dividing the intermediate characteristic information into a plurality of parts according to the frequency band for pooling so as to correspondingly obtain an output characteristic vector of the frequency band scale.

9. The song search method of claim 8, wherein:

when the first branch network executes the pooling operation, adopting mean pooling and/or maximum pooling operation to correspondingly obtain one or two output feature vectors of the global scale;

when the second branch network performs the pooling operation, adopting a mean pooling operation for a single or a plurality of channels to correspondingly obtain one or a plurality of output feature vectors of the channel scale;

and when the third branch network performs the pooling operation, adopting an average pooling operation for a single frequency band or a plurality of frequency bands to correspondingly obtain one or a plurality of output feature vectors of the frequency band scale.

10. The song search method of claim 1, wherein the source of the encoded information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, level contour information, Chroma feature information of the corresponding audio data.

11. The song search method of claim 7, wherein at least one of the convolution blocks in the shared network employs an attention module for extracting key information in song audio data, wherein the attention module is a spatial attention module or a channel attention module.

12. The song search method of claim 7, wherein the volume block, when invoked, performs the steps of:

carrying out convolution operation for multiple times and batch normalization processing on the activated and output splicing characteristic information to obtain residual error information;

13. The song search method of claim 7, wherein the training process of the feature extraction model comprises the following steps of iterative training:

calculating a loss value of a feature extraction model by using the supervision label corresponding to the training sample and the classification label, and performing gradient updating on the feature extraction model according to the loss value;

14. A song search apparatus, comprising:

the song coding module is used for acquiring coding information corresponding to the song to be searched, which is submitted by the client;

the semantic extraction module is used for extracting high-dimensional index vectors of deep semantic information representing the song to be searched in multiple scales according to the coding information by adopting a feature extraction model trained to a convergence state;

the similarity matching module is used for calculating the similarity between the high-dimensional index vector and the high-dimensional index vector of deep semantic information which is extracted by the feature extraction model and represents multiple scales of each candidate song in a preset song feature library to obtain a similarity sequence;

and the screening and pushing module is used for screening and determining the target song with the similarity value exceeding a preset threshold and being the maximum similarity in the similarity sequence, constructing an access link corresponding to the target song and pushing the access link to the client equipment.

15. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 13.

16. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 13, which, when invoked by a computer, performs the steps comprised by the corresponding method.

17. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 13.