CN114817621A

CN114817621A - Song semantic information indexing method and device, equipment, medium and product thereof

Info

Publication number: CN114817621A
Application number: CN202111491602.8A
Authority: CN
Inventors: 张超钢; 肖纯智
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-07-29

Abstract

The application discloses a song semantic information indexing method and a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: coding the audio information in the song audio data to obtain corresponding coding information; sequentially performing multi-level feature extraction on the coded information by adopting a plurality of volume blocks in a shared network of a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the song audio data; extracting global significant features from the intermediate feature information by adopting a global branch network of the feature extraction model to obtain a global output feature vector; adopting a local branch network of the feature extraction model to respectively extract semantic local features from the intermediate feature information according to channel equal segmentation to obtain channel output feature vectors; and splicing the global output feature vector and the channel feature vector into a high-dimensional index vector. The method and the device can realize the representation learning of deep semantic information of the song audio data.

Description

Song semantic information indexing method and device, equipment, medium and product thereof

Technical Field

The present application relates to the technical field of music information retrieval, and in particular, to a song semantic information indexing method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

Background

With the popularity of short videos, live broadcasts and radio stations, the amount of music in the song-turning category is larger and larger, and scenes needing music identification are more and more complicated. Compared with the original singing version, the reproduction version may have differences or even completely different music components such as tone color, fundamental frequency, rhythm, speed, harmony, lyrics, singing method, overall structure and the like. Recognition of singing is therefore a very challenging research effort.

There are multiple related techniques of singing recognition in the prior art, and various prior arts all have some disadvantages, for example: (1) the traditional song listening and song recognition technology based on Landmark can only recognize songs in the same version and cannot recognize the copy with certain differentiated information; (2) the traditional humming recognition technology based on melody matching can only recognize clean singing/humming and cannot recognize the copy with background accompaniment; (3) the traditional technical scheme of singing flipping recognition is mainly characterized in that audio features such as Pitch Class Profile (PCP) are extracted, and then algorithms such as dynamic programming are utilized to calculate the similarity distance between songs. Due to the diversity of the singing versions, the scheme can only be suitable for the singing scheme with smaller arrangement, the accurate recognition rate is low, the recognition speed is low, and the scheme cannot be suitable for searching mass music.

Therefore, the technical solutions for solving the problems related to song recognition in the prior art are lack of general adaptability, low recognition accuracy and low recognition efficiency, and a more effective technical solution needs to be explored.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a song semantic information indexing method and a corresponding apparatus, computer device, computer readable storage medium, and computer program product.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

the song semantic information indexing method adaptive to one of the purposes of the application comprises the following steps:

coding the audio information in the song audio data to obtain corresponding coding information;

sequentially performing multi-level feature extraction on the coded information by adopting a plurality of volume blocks in a shared network of a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the song audio data;

extracting global significant features from the intermediate feature information by adopting a global branch network of the feature extraction model to obtain a global output feature vector;

adopting a local branch network of the feature extraction model to respectively extract semantic local features from the intermediate feature information according to channel equal segmentation to obtain channel output feature vectors;

and splicing the global output feature vector and the channel feature vector into a high-dimensional index vector.

In a further embodiment, the step of extracting the global significant feature from the intermediate feature information by using the global branch network of the feature extraction model to obtain a global output feature vector includes the following steps:

sequentially performing multi-stage feature extraction on the intermediate feature information through a plurality of rolling blocks of the global branch network to obtain deep feature information;

and performing maximum value pooling operation on the deep feature information to obtain a global output feature vector of the global significant features of the song audio data.

In a further embodiment, the step of respectively extracting semantic local features from the intermediate feature information by using the local branch network of the feature extraction model according to channel partition and the like to obtain a channel output feature vector includes the following steps:

sequentially performing multi-stage feature extraction on the intermediate feature information through a plurality of rolling blocks of the local branch network to obtain deep feature information;

equally dividing the deep characteristic information along the channel direction to obtain a plurality of equally divided characteristic information;

and respectively executing mean pooling operation on each equally divided feature information to obtain a plurality of equally divided feature vectors, and splicing all equally divided feature vectors into the local output feature vector.

In an optional embodiment, in the step of encoding audio information in the song audio data, the audio information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, sound level contour information, and Chroma feature information of the song audio data.

In a specific embodiment, in the shared network, at least one of the convolution blocks is configured to apply an attention module to extract key information in song audio data, where the attention module is a spatial attention module or a channel attention module.

In a further embodiment, the volume block is configured to perform the following steps:

carrying out convolution transformation on the input information to obtain transformation characteristic information;

combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information;

performing convolution operation and batch normalization processing on the activated and output splicing characteristic information for multiple times to obtain residual error information;

and overlapping the residual information into the information input into the residual information to activate output.

In an extended embodiment, the training process of the feature extraction model includes the following steps of iterative training:

calling a training sample from a training set, and determining the coding information of the audio information of the training sample, wherein the training sample is pre-collected song audio data which is a complete song or a fragment thereof;

inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors;

respectively carrying out classification prediction on each output characteristic vector to map corresponding classification labels;

calculating a loss value of a feature extraction model by using the supervision label corresponding to the training sample and the classification label, and performing gradient updating on the feature extraction model according to the loss value;

and judging whether the loss value reaches a preset threshold value, and calling a next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.

In an extended embodiment, after the step of fusing the global output feature vector and the channel feature vector into a high-dimensional index vector, the method includes the following steps:

responding to a query request for querying audio data, and calling the feature extraction model to extract a corresponding high-dimensional index vector as a query vector;

calculating the similarity between the query vector and a plurality of high-dimensional index vectors in a preset song feature library to obtain a similarity data sequence, wherein the high-dimensional index vectors stored in the song feature library are obtained by extracting corresponding song audio data in the preset song library through the feature extraction model;

and determining the song audio data corresponding to the maximum similarity with the similarity exceeding a preset threshold in the similarity data sequence as the similar song of the query audio data.

In an embodiment, before the step of calculating the similarity between the query vector and the plurality of high-dimensional index vectors in the preset song feature library, the method includes the following steps:

calculating the similarity between the query vector and the high-dimensional index vector of each song audio data in the rotation-law-free feature library to obtain a corresponding similarity sequence, wherein the high-dimensional index vector in the rotation-law-free feature library is obtained by extracting each song audio data without rotation information by the feature extraction model;

and comparing whether the similarity of each song audio data in the similarity sequence is lower than a preset threshold, continuing the subsequent steps when the similarity of all song audio data is lower than the preset threshold, and otherwise terminating the subsequent steps.

In an extended embodiment, after the step of splicing the global output feature vector and the channel feature vector into a high-dimensional index vector, the method includes the following steps:

obtaining query audio data to be compared;

calling the feature extraction model to determine a high-dimensional index vector corresponding to the query audio data;

calculating the similarity between the high-dimensional index vector of the query audio data and the high-dimensional index vector of the song audio data;

and judging whether the similarity exceeds a preset threshold value, and judging that the inquired audio data and the song audio data form a similar song when the similarity exceeds the preset threshold value.

The song semantic information indexing device adaptive to one of the purposes of the application comprises: the system comprises an encoding processing module, a sharing extraction module, a global extraction module, a local extraction module and an index processing module, wherein the encoding processing module is used for encoding audio information in song audio data to obtain corresponding encoding information; the shared extraction module is used for sequentially carrying out multi-level feature extraction on the coded information by adopting a plurality of volume blocks in a shared network of a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the song audio data; the global extraction module is used for extracting global significant features from the intermediate feature information by adopting a global branch network of the feature extraction model to obtain a global output feature vector; the local extraction module is used for adopting a local branch network of the feature extraction model to respectively extract semantic local features from the intermediate feature information according to channel equal segmentation to obtain channel output feature vectors; and the index processing module is used for splicing the global output characteristic vector and the channel characteristic vector into a high-dimensional index vector.

The computer device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the song semantic information indexing method.

A computer-readable storage medium, which stores a computer program implemented according to the song semantic information indexing method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.

A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.

Compared with the prior art, the application has the following advantages:

the method comprises the steps of firstly, coding corresponding coding information by utilizing audio information of song audio data to obtain the style invariant features of the song audio data, then extracting intermediate feature information from the coding information through a shared network, extracting deep semantic information of the song audio data from different angles through a plurality of branch networks on the basis of the intermediate feature information to obtain corresponding output feature information, and finally, taking the output feature information as a high-dimensional index vector corresponding to the song audio data to finish end-to-end representation learning of the song audio data.

Secondly, because the method of combining the shared network and a plurality of branch networks is applied to the characteristic extraction model adopted by the application, the characteristic extraction of deep semantic information of the song audio data at multiple angles is realized, therefore, the obtained high-dimensional index vector can be more expressive, in particular, the deep semantic representation of the global salient features of the song audio data is realized through the global branch network, the deep semantic representation of the local salient features of the song audio data is also realized through the local branch network, thereby realizing more effective indexing of the corresponding song audio data, performing downstream processing such as retrieval, query, matching and the like of the song audio data on the basis of the effective indexing, the method can obtain more accurate and efficient matching effect, and can be universally used for multiple application scenes such as singing recognition, song listening recognition, humming recognition, song infringement judgment and the like.

In addition, the output feature vectors obtained by a plurality of branch networks are fused into a single high-dimensional index vector for use, the application is wide, the usage is flexible, when the representation learning of massive song audio data is processed, a relatively obvious scale effect can be obtained, the system can be deployed in the background of an online music service platform to realize a standardized interface, the requirements of various different application scenes are met, comprehensive and multipurpose open services are provided, and the economic advantage of music information retrieval of the platform is improved.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flowchart of an exemplary embodiment of a song semantic information indexing method according to the present application;

FIG. 2 is a schematic block diagram of a network architecture of a feature extraction model in an embodiment of the present application;

FIG. 3 is a schematic flow chart of the working process of the residual volume block used in the feature extraction model of the present application;

FIG. 4 is a schematic flow chart of a process in which the feature extraction model of the present application is trained;

FIG. 5 is a functional block diagram of a classification model accessed by the feature extraction model of the present application during a training phase;

FIG. 6 is a schematic flow chart diagram illustrating one embodiment of a feature extraction model of the present application for performing similar song matching;

FIG. 7 is a schematic flow chart diagram of another embodiment of the feature extraction model of the present application for performing similar song matching, which incorporates filtering with a rotation-free feature library;

FIG. 8 is a schematic flow chart of another embodiment of the feature extraction model of the present application for performing similar song matching, which is required for a song infringement comparison;

FIG. 9 is a schematic block diagram of a song semantic information indexing device according to the present application;

fig. 10 is a schematic structural diagram of a computer device used in the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, "client," "terminal," and "terminal device" as used herein include both devices that are wireless signal receivers, which are devices having only wireless signal receivers without transmit capability, and devices that are receive and transmit hardware, which have receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: cellular or other communication devices such as personal computers, tablets, etc. having single or multi-line displays or cellular or other communication devices without multi-line displays; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "client," "terminal device" can be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "client", "terminal Device" used herein may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, and the like.

The hardware referred to by the names "server", "client", "service node", etc. is essentially an electronic device with the performance of a personal computer, and is a hardware device having necessary components disclosed by the von neumann principle such as a central processing unit (including an arithmetic unit and a controller), a memory, an input device, an output device, etc., a computer program is stored in the memory, and the central processing unit calls a program stored in an external memory into the internal memory to run, executes instructions in the program, and interacts with the input and output devices, thereby completing a specific function.

It should be noted that the concept of "server" as referred to in this application can be extended to the case of a server cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers can be independent of each other but can be called through an interface, or can be integrated into a physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.

One or more technical features of the present application, unless expressly specified otherwise, may be deployed to a server for implementation by a client remotely invoking an online service interface provided by a capture server for access, or may be deployed directly and run on the client for access.

Unless specified in clear text, the neural network model referred to or possibly referred to in the application can be deployed in a remote server and performs remote invocation at a client, and can also be deployed in a client with sufficient equipment capability to perform direct invocation.

Various data referred to in the present application may be stored in a server remotely or in a local terminal device unless specified in the clear text, as long as the data is suitable for being called by the technical solution of the present application.

The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.

The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.

The song semantic information indexing method can be programmed into a computer program product and is realized by being deployed in a server to run, so that a client can access an open interface after the computer program product runs in a webpage program or application program mode, and man-machine interaction is realized through a graphical user interface and a process of the computer program product.

Referring to fig. 1, in an exemplary embodiment of a song semantic information indexing method of the present application, the method includes the following steps:

step S1100, encoding the audio information in the song audio data to obtain corresponding encoding information:

the song audio data may be audio data in any format such as MP3, WMA, M4A, WAV, etc., or audio data obtained by separating audio from various video files. Song audio data typically includes a plurality of voice data packets in the time domain. The song audio data can be from various songs prestored in a song library of an online music service platform, or can be songs submitted by a user in real time or singing audio, and therefore, the application can flexibly respond and process according to specific tasks to serve different types of requirements, for example, when a song feature library of the song audio data in the song library is constructed for serving the requirements of song listening, song humming, song recognition, song turning recognition and the like, feature extraction needs to be carried out on the song audio data in the song library one by one; for example, when song search, query and matching are required to be performed by listening to song recognition, humming recognition, turning recognition and song infringement judgment, the audio data of the song submitted by the client needs to be acquired for feature extraction and the like. On the basis, various transformations are generally performed on the voice data packet to encode the audio information in the song audio data as required by the application, so as to obtain corresponding encoded information.

The audio information is mainly related information for describing the style invariant feature in the song audio data, and can be of various types, including but not limited to time-frequency spectrum information extracted from the voice data packet of the song audio data, mel-frequency spectrum information, CQT filtering information, tone contour information, Chroma feature information and the like. Such information may be encoded using a corresponding algorithm to obtain a corresponding type of encoded information. In the present application, any of the above types of encoded information may be used in the present application to implement feature extraction. In practice, it is recommended to encode the CQT filtering information with the best measured CQT filtering information to obtain the encoding information.

Those skilled in the art will appreciate that the above various audio information may be encoded using corresponding algorithms. In the encoding process, the song audio data needs to be subjected to conventional processing such as pre-emphasis, framing and windowing, and then time domain or frequency domain analysis is performed, so that the speech signal analysis is realized. The purpose of pre-emphasis is to boost the high frequency part of the speech signal and smooth the spectrum; typically the pre-emphasis is implemented by a first order high pass filter. Before the speech signal is analyzed, the speech signal needs to be framed, the length of each frame of the speech signal is usually set to be 20ms, and 10ms overlap exists between two adjacent frames considering the frame shift factor. To implement framing, this may be accomplished by windowing the speech signal. Different window selections affect the result of the speech signal analysis, and it is common to use window functions corresponding to hamming windows (Hamm) to perform the windowing operation.

On the basis of completing the preprocessing required by the voice signal analysis of the song audio data, the time domain and the frequency domain can be further analyzed, so as to realize the coding and obtain the corresponding coding information:

and aiming at the time-frequency spectrum information, pre-emphasis, framing, windowing and short-time Fourier transform (STFT) are carried out on the voice data of each voice data packet on a time domain to transform the voice data into a frequency domain, so that data corresponding to a spectrogram is obtained, and the time-frequency spectrum information is formed.

The mel-frequency spectrum information can be obtained by filtering the time-frequency spectrum information by using a mel scale filter bank, and in the same way, corresponding mel cepstrum information is obtained by carrying out logarithm taking and DCT transformation on the mel-frequency spectrum information, and the method is also suitable. It will be appreciated that mel-frequency spectrum information and mel-frequency cepstral information thereof can better describe style invariant features in a song, such as pitch, intonation, timbre, and the like.

For the CQT filtering information, since all tones in music are composed of 12 equal temperaments of several octaves, i.e., twelve equal temperaments, corresponding to twelve semitones of one octave in a piano. The frequency ratio between these semitone neighbors is 2 ^1/12 . Obviously, for two octaves of the same scale, the higher octaves are twice as frequent as the lower octaves. Therefore, in music, sounds are exponentially distributed, but the audio spectrum obtained by fourier transform is linearly distributed, and the frequency points of the two cannot be in one-to-one correspondence, which causes errors in the estimation of some scale frequencies. The CQT time-frequency transform algorithm can be used to replace the fourier transform approach for speech analysis. CQT, Constant Q Transform, refers to a filter bank in which the center frequency is exponentially distributed, the filter bandwidth is different, and the ratio of the center frequency to the bandwidth is Constant Q. Unlike the fourier transform, the horizontal axis of the spectrum is not linear in frequency, but is based on log2, and can vary according to the spectral line frequencyThe filter window length is varied to achieve better performance. Since the distribution of the CQT and the scale frequency is the same, the amplitude value of the music signal at each note frequency can be directly obtained by calculating the CQT spectrum of the music signal, which is more perfect for the signal processing of music. Therefore, the present embodiment recommends using this information to perform corresponding encoding to obtain corresponding encoding information, which is used as the input of the neural network model of the present application.

The tone level contour information comprises PCP (Pitch Class Profile) and HPCP (harmonic Pitch Class Profile), and aims to extract a corresponding Pitch sequence from song audio data, convert the Pitch sequence into a melody contour sequence after being regulated, merged and segmented, and convert the melody contour sequence into corresponding feature representation by using a standard tone difference value generated by standard tones. The coding information constructed based on the sound level contour information has better robustness to the environmental noise.

The Chroma characteristic information is a general term of a Chroma Vector (Chroma Vector) and a Chroma map (Chroma map). The chroma vector is a vector containing 12 elements representing the energy in 12 levels over a period of time (e.g., 1 frame), the energy of the same level for different octaves being accumulated, and the chroma map is a sequence of chroma vectors. Specifically, after a voice data packet of song audio data is subjected to short-time Fourier transform and is converted from a time domain to a frequency domain, some noise reduction processing is carried out, and then tuning is carried out; converting the absolute time into frames according to the length of the selected window, and recording the energy of each pitch in each frame to form a pitch map; on the basis of a pitch map, the energy (in a loudness meter) of notes with the same time, the same tone level and different octaves is superposed on an element of the tone level in a chrominance vector to form a chrominance map. The data corresponding to the chromaticity diagram is the Chroma characteristic information.

Any one of the above specific audio information can be used for inputting the feature extraction model of the present application, and in order to facilitate the processing of the feature extraction model, the audio information can be converted into corresponding encoded information according to a certain preset format. For example, the audio information corresponding to each voice packet is organized into a row vector, and the row vectors of the voice data packets are organized together by row according to the time sequence of the song audio data to obtain a two-dimensional matrix as the encoding information. And the like, can be preset for adapting the feature extraction model, and can be flexibly implemented by the technical personnel in the field.

Step S1200, a plurality of volume blocks in a shared network of the feature extraction model trained to a convergence state are adopted to sequentially carry out multi-level feature extraction on the coding information, and intermediate feature information of deep semantic information of the song audio data is obtained:

in order to extract the characteristics of the corresponding song audio data based on the coding information, the application provides a novel characteristic extraction model based on a neural network model architecture, the model is trained to a convergence state in advance, the model is made to acquire the capability of extracting deep semantic information of the song audio data according to the coding information so as to obtain a corresponding output characteristic vector, and the representation learning of the style invariant characteristics of the song audio data is completed for inquiring, retrieving and matching.

As shown in the schematic block diagram of fig. 2, the feature extraction model is composed of a shared network and two branch networks, wherein the shared network includes a plurality of rolling blocks for extracting deep semantic information of encoded information step by step to obtain intermediate feature information; and the two branch networks respectively extract different types of deep semantic information based on the intermediate characteristic information to obtain corresponding output characteristic information. The two branch networks respectively comprise a part of same structures, the structures comprise a plurality of convolution blocks for extracting deep semantic information step by step, and after the last convolution block is output, different processing can be carried out according to different functions of the branch networks.

The convolution block can be realized by convolution layers based on CNN and RNN, and preferably adopts a convolution block based on a residual convolution principle. In order to implement the context combing function so as to extract the key information in the song audio data, an Attention mechanism may be applied to any one of the volume blocks, and a corresponding Attention Module may be added, preferably, the Attention Module is applied to the last volume block of the shared network, and is specifically a Spatial Attention Module (SAM) or a Channel Attention Module (CAM). IN an enhanced embodiment, an example normalization operation (IN) and a batch normalization operation (BN) are applied to the convolution block to divide the information input thereto into two parts, one of which performs the example normalization operation to learn style invariant features and the other performs the batch normalization operation to achieve normalization, so that a so-called IBN architecture is applied. By applying the framework, the music attribute invariant characteristics of the song audio data with highly diversified styles, such as musical notes, rhythms, timbres and the like, can be learned, and meanwhile, the version information is kept.

Therefore, it can be easily understood that the feature extraction model is suitable for different application scenes, two branch networks are used, and the preselected training set is adopted to train the feature extraction model to a convergence state, so that the corresponding feature extraction capability can be obtained, and therefore, the feature extraction model is suitable for executing tasks corresponding to the application scenes and extracting output feature information corresponding to song audio data from the encoding information of the song audio data input into the feature extraction model. With respect to the training process of the feature extraction model, which will be given in the exemplary embodiments of the present application, the table is pressed here for the moment.

In this step, in the framework shown in fig. 2, after the coded information is subjected to feature extraction step by a plurality of rolling blocks of the shared network, especially after the key information is extracted by the last rolling block, the intermediate feature information of the key information of the coded information is extracted, and the intermediate feature information is divided into multiple paths and output to the two branch networks, so as to extract the deep semantic information of different angles in the two branch networks respectively.

Step 1300, extracting global significant features from the intermediate feature information by using the global branch network of the feature extraction model to obtain a global output feature vector:

as mentioned above, in the architecture shown in fig. 2, the intermediate feature information output by the shared network is fed into two branch networks respectively for further feature extraction processing, and in this step, the input of the intermediate feature information is received through the global branch network

According to the architecture shown in fig. 2, each branch network belongs to the same structural part and includes two convolution blocks, and after feature information output to the two convolution blocks is sequentially subjected to feature extraction, the extracted feature information is output to adapt to specific structures of different branch networks to perform different processing.

And after the intermediate characteristic information is subjected to characteristic extraction step by step through two rolling blocks with the same structure as other branch networks, the global branch network extracts global significant characteristic information from the output of the last rolling block through maximum pooling operation, so that a global output characteristic vector is correspondingly output. According to the framework, the overall significant features of the song audio data can be captured by the overall output feature vector, and the recognition capability of the model is improved.

In a further modified embodiment, the step S1300 may be implemented to include the following specific steps:

step S1310, sequentially performing multi-level feature extraction on the intermediate feature information through a plurality of rolling blocks of the global branch network to obtain deep feature information:

specifically, the two volume blocks are adopted, the two volume blocks only adopt the completely same network architecture, and the IBN blocks are both applied, so that deep feature information obtained after feature extraction can learn style-invariant features in song audio data.

Step S1320, performing maximum pooling operation on the deep feature information to obtain a global output feature vector with global salient features of the song audio data extracted:

and on the basis of the deep feature information, performing maximum value Pooling (Max Pooling) operation on the deep feature information directly through a Pooling layer so as to extract the salient features from the deep feature information from a global view.

In the present application, the output feature information output by each branch network is normalized to be represented by an output feature vector, for example, a global output feature vector of a global branch network may be normalized to be a 512-dimensional feature vector.

Step S1400, adopting the local branch network of the feature extraction model to respectively extract semantic local features from the intermediate feature information according to channel equal segmentation, and obtaining channel output feature vectors:

in the local branch network shown in fig. 2, after the intermediate feature information is subjected to feature extraction step by two convolution blocks having the same structure as the global branch network, the output of the last convolution block (the number of channels × the number of bands × the number of frames) is divided into quartered outputs according to the channel dimensions, then the average pooling is performed for each of the quartered outputs, so as to obtain channel output feature information corresponding to four channels, and then the channel output feature information is re-spliced into a channel output feature vector, where an exemplary dimension is 2048 dimensions. This process, by performing mean pooling separately for each channel local feature capture, enables feature representation to be built from a few local significant common features for audio that is very different in arrangement, with a large amount of information overwhelmed by strong noise or other interfering tones.

In a further modified embodiment, the step S1400 may be implemented to include the following specific steps:

step S1410, sequentially performing multi-level feature extraction on the intermediate feature information through a plurality of convolution blocks of the local branch network to obtain deep feature information:

the IBN blocks are applied to the two volume blocks, so that deep feature information obtained after feature extraction can learn style-invariant features in song audio data.

Step S1420, equally dividing the deep layer feature information along the channel direction to obtain a plurality of equally divided feature information:

it is understood that, in the process of extracting the feature information corresponding to the encoded information, each of the convolution blocks outputs the feature information represented by a matrix structure of "number of channels × number of bands × number of frames", and therefore includes a plurality of channels. In this embodiment, for the deep feature information output by the last rolling block in the local branch network, the matrix structure is equally divided into four equal parts, that is, four equal-divided feature information, according to the number of channels along the channel direction, so as to perform feature extraction for each part of channels.

In an alternative embodiment, the characteristic information output by the last convolution block may be equally divided into four equal parts not along the channel direction but along the frequency band direction, so that each equal part of the frequency band characteristic information has a significant effect on resisting the selective weakening of the frequency band in the bad sound pickup environment, balancing the contribution of high and low frequency information in the characteristic composition, resisting the content addition and deletion (such as adding and reducing a drum sound) of the fixed frequency band, or resisting the strong interference of the fixed frequency band range.

Step S1430, performing a mean pooling operation on each of the equally divided feature information to obtain a plurality of equally divided feature vectors, and stitching all the equally divided feature vectors into the local output feature vector:

in order to capture the significant features of each part of channel, all the equally divided feature information is subjected to mean Pooling (Average Pooling) through a Pooling layer respectively so as to obtain a plurality of equally divided feature vectors, and on the basis, all the equally divided feature vectors are simply spliced into a single local output feature vector.

S1500, splicing the global output feature vector and the channel feature vector into a high-dimensional index vector:

the output feature vectors obtained by each branch network can be finally converted into high-dimensional index vectors for storage or direct use. The high-dimensional index vector is a high-dimensional vector for indexing the corresponding song audio data. Since each branch network has normalized its output feature information into an output feature vector, the high-dimensional index vector may be handled alternatively in this case, depending on the specific use of the feature extraction model. For example, for application requirements that are only for storage and standby and are respectively called, each output feature vector can be dispersedly stored in a song feature library as a plurality of corresponding high-dimensional index vectors, so that the high-dimensional index vectors output by different branch networks can be called as required for retrieval, query and matching. For another example, for specific tasks such as song listening recognition, song turning recognition, humming recognition, infringement comparison, and the like, all output feature vectors output by all the constructed branch networks can be sequentially spliced according to the needs of the specific tasks, so as to obtain a single high-dimensional index vector, which can be stored or used for matching instantly. In this embodiment, it is preferable that the plurality of output feature vectors are further simply spliced into the same high-dimensional index vector, so that the high-dimensional index vector realizes representation learning of the song audio data, which represents global significant feature information of the song audio data and local significant feature information thereof.

Above, it should be noted that, step S1300 and step S1400 are processed in parallel, and they do not depend on data output and input of each other, and parallel tasks can be implemented to improve operation efficiency.

According to the principle disclosed above in the present exemplary embodiment, according to the process of this embodiment, a song feature library may be prepared for a part or all of songs in a song library of an online music service platform, high-dimensional index vectors corresponding to each song audio data may be obtained by applying the steps of this embodiment to each corresponding song or song audio data of a segment thereof in the song library, a song feature library may be constructed by storing the high-dimensional index vectors in association with the corresponding songs, and then, a high-dimensional index vector corresponding to any one song may be directly called from the song feature library for operations such as retrieval, query, matching, and the like.

Similarly, another application mode can be extended, that is, the above steps are applied, corresponding high-dimensional index vectors are extracted from the song audio data of two songs or the song audio data of the song segments, then the high-dimensional index vectors of the two songs or the song audio data of the song segments are compared in a similar manner, the data distance between the two songs is considered, whether the two songs are similar or not is judged by means of a preset threshold, if so, the two songs are determined to be the same, otherwise, the two songs are different. Thereby, it can be used for song infringement judgment or simple matching between two songs.

In addition to the above application modes, the mining and utilization based on the high-dimensional index vector obtained by the present application may have various different applications, and may be flexibly applied by those skilled in the art according to the principles disclosed herein without affecting the inventive embodiment of the present application.

According to the description of the exemplary embodiments, it can be understood that the present application has various advantages, including but not limited to the following aspects:

Referring to fig. 3, in a further embodiment, the convolution block is implemented based on residual convolution, and is configured to perform the following steps:

step S2100, carrying out convolution transformation on the input information to obtain transformation characteristic information:

in any convolution block in the feature extraction model, each convolution block firstly performs convolution operation on information input therein, no matter the coded information or the intermediate feature information output by the previous convolution block, through a 1 × 1 convolution kernel to obtain corresponding transformation feature information.

Step S2200, combining the transformed feature information into splicing feature information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing feature information:

after the first convolution, an instance batch normalization layer (IN) is applied to process the transformed feature information. The transformation characteristic information is divided into two paths, a batch normalization Block (BN) is adopted to carry out batch normalization processing on half of the channels, and an instance normalization processing is carried out on the other channels by applying an instance normalization layer, and the corresponding convolution block can capture the style invariant characteristics of the song audio data, so that the song representation with diversified styles in single data can be better utilized. The two channels can be spliced into the same splicing characteristic information for activation and output after different normalization processing.

Step S2300, obtaining residual error information after carrying out convolution operation and batch normalization processing on the activated and output splicing characteristic information for a plurality of times:

and performing convolution operation on the activated and output spliced feature information through a plurality of convolution layers to further extract features, wherein each convolution layer is output after being normalized by a batch normalization layer, and the last convolution layer is implemented by adopting a 1 x 1 convolution kernel so as to avoid that the representation learning capacity of the whole feature extraction model is attenuated after being normalized by a plurality of instances of a plurality of convolution blocks. Accordingly, the finally output feature information is residual information in the residual convolution process.

Step S2400, overlapping the residual error information into the information input therein, activating and outputting:

and finally, according to a residual convolution principle, referring to the transformation characteristic information obtained by the first convolution, superposing the transformation characteristic information on the residual information, and then activating and outputting the transformation characteristic information, so that the intermediate characteristic information output after the residual convolution operation is carried out on the current convolution block can be obtained.

In the embodiment, the convolution blocks required by the feature extraction model are constructed by combining example batch normalization operation based on residual convolution, the residual convolution network is improved based on a Resnet series basic model, and meanwhile, an IBN framework is overlapped, so that the constructed feature extraction model is easier to train, a more accurate feature extraction effect can be realized, and the method is particularly suitable for feature extraction of song audio data.

Referring to fig. 4, in an extended embodiment, the training process of the feature extraction model includes the following steps of iterative training:

step S3100, calling a training sample from a training set, and determining coding information of audio information of the training sample, wherein the training sample is pre-collected song audio data, and the song audio data is a complete song or a fragment thereof:

those skilled in the art will appreciate that different training sets for training the feature extraction model may be constructed, each training set containing a sufficient number of training samples, each training sample being pre-provisioned with a corresponding supervised label, to accommodate different downstream tasks.

The training samples can be collected in advance by a person skilled in the art, each training sample is a song audio data, the training samples are suitable for different downstream tasks, and the song audio data can be a complete song, a song MIDI melody fragment, a song with accompaniment, a song with a vocal singing part without an accompaniment part, a song fragment without a rotation part, a song fragment with a rotation part and the like. Different singing versions of the same song can be combined into the same category, namely, the same supervising label is corresponded to, so as to enhance the generalization capability of the model category. When the duration of the song audio data in the training sample is too long, the song audio data can be further divided into a plurality of song segments according to a certain duration, and the song segments serve as a plurality of training samples to be associated with the same supervision label for training. When segmenting a song, this may be implemented with reference to a timestamp of the lyrics of the song, such that the song segments are segmented on the basis of one or more complete lyrics.

For training samples in the training set, for the convenience of model training, the encoding information corresponding to the song audio data may be prepared in advance, or the encoding information corresponding to the song audio data may be obtained by real-time encoding when each song audio data is called for training the feature extraction model. For specific encoding principles, it is sufficient to refer to the corresponding processes disclosed in the foregoing of the present application.

Step S3200, inputting the coding information into the feature extraction model to train the coding information to obtain corresponding output feature vectors:

in the training process of a training sample, the coding information corresponding to the training sample is output to the feature extraction model for feature extraction, and the feature extraction principle refers to the description of the principle of the feature extraction model in the previous embodiments, which is omitted here for brevity. In the process, the feature extraction model realizes the representation learning of the training samples, and obtains the output feature vectors corresponding to each branch network.

Step S3300, performing classification prediction on each output feature vector, so as to map a corresponding classification label:

in the present application, the training task of the feature extraction model is understood as a classification task, and therefore, the training of the model can be implemented by accessing each path of output feature vectors of the feature extraction model into a corresponding prepared classification model, investigating classification results of each classification model, and supervising by using a corresponding supervision label. Based on the principle, in the training stage, when the feature extraction model implemented in any embodiment of the present application is trained, one classification model is connected to each output feature vector output end of each branch network.

The classification model adopts the structure shown in fig. 5, a batch normalization layer is adopted to carry out batch normalization operation on the output characteristic vectors, the output characteristic vectors are mapped to a classification space through a full connection layer, and the classification probability corresponding to each classification label is calculated through a classification function, so that the classification label with the maximum classification probability is determined as the classification label corresponding to the training sample.

The classifier in the classification model can be constructed by adopting a multi-classifier realized by a Softmax function, and can also be constructed by adopting a multi-classifier realized by an AM-Softmax function which can enhance the compactness in the class and enlarge the sparsity among the classes, and the later has better classification advantages obviously.

Step S3400, calculating a loss value of the feature extraction model by using the supervision label corresponding to the training sample and the classification label, and performing gradient updating on the feature extraction model according to the loss value:

in the classification model, the batch normalization layer is adopted, the balance of triple losses and cross entropy classification losses is realized, the triple losses can be calculated through the batch normalization layer, the cross entropy classification losses can be calculated through the full-connection layer, and the optimization of output feature vectors can be realized through the combination of the triple losses and the cross entropy classification losses.

Therefore, after the corresponding classification label is predicted from the training sample, the loss value between the supervision label and the classification label can be calculated according to the corresponding supervision label, then the gradient updating is carried out on the feature extraction model according to the loss value, the weight parameter of each link of the whole model is corrected, and the model is promoted to be converged.

Because there are multiple branch networks, there are multiple outputs of output feature vectors, and multiple classification models need to be accessed correspondingly, so when calculating loss values, a weighting mode can be adopted, that is, triple losses and classification losses in each classification model are weighted and summed first to obtain a loss value corresponding to each output feature vector, then loss values corresponding to each output feature vector are weighted and summed to obtain a final loss value, and gradient updating is performed on the whole feature extraction model by the loss values.

Step S3500, judging whether the loss value reaches a preset threshold value, calling the next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the feature extraction model exceeds the preset threshold value:

and judging whether the loss value counted by each training sample approaches to a 0 value infinitely or whether the loss value reaches a preset threshold value, and when the loss value meets the judgment conditions, judging that the feature extraction model is trained to a convergence state, thereby terminating the training of the model and putting the feature extraction model into a production stage, for example, the feature extraction model is used for carrying out feature extraction on songs in a song library or serving other downstream tasks and the like. If the convergence state is not reached, the next training sample in the training set can be continuously called, and the iterative training of the feature extraction model is continuously carried out until the feature extraction model is trained to the convergence state.

The embodiment discloses a training principle and a training process of a feature extraction model of the application, and can be seen from the embodiment that the feature extraction model is trained by adopting a prepared training set, so that the feature extraction model can learn the capability of extracting corresponding output feature vectors from the coded information of song audio data, the effective representation learning of deep semantic information of the song audio data is realized, in addition, the multi-aspect output feature vectors of the same song audio data can be jointly trained, the training efficiency is higher, the model function is richer, and when the feature extraction model is put into production, the multi-aspect deep semantic information of the same song audio data can be quickly obtained.

The classification model of the embodiment adopts the multi-classifier which is realized by the batch normalization layer and the AM-Softmax function, so that triple loss and classification loss can be balanced to perform gradient updating of the model, the model can be trained to be convergent more quickly, and the trained model can better perform more effective representation learning on deep semantic information of song audio data. When the output feature vectors are used for matching subsequently, a more efficient matching effect can be achieved.

The embodiment also shows the expandability and compatibility of the feature extraction model in the application aspect, and specifically, the embodiment allows the feature extraction model to obtain the capability of serving different downstream tasks by training the feature extraction model by adopting training samples corresponding to the different downstream tasks according to the requirement of serving the different downstream tasks, so that the embodiment belongs to a more basic improvement and has better economic utility.

Referring to fig. 6, in order to meet the requirement of matching songs in a song library according to a specific song or a song segment thereof, in an extended embodiment, after the step S1500 of splicing the global output feature vector and the channel feature vector into a high-dimensional index vector, the method includes the following steps:

step S4100, responding to the query request for querying the audio data, calling the feature extraction model to extract the corresponding high-dimensional index vector as a query vector:

in this embodiment, the feature extraction model of the present application may be used for receiving a query request submitted by a user for song listening recognition, song turning recognition and humming recognition, determining a song provided or specified by the user as query audio data according to the query request of the user, then identifying whether the corresponding query audio data is similar to songs in a song feature library, and recognizing the same song (song listening recognition and song humming recognition) according to the similarity degree or determining whether the same song belongs to a certain original song (song turning recognition).

It can be understood that, in the song feature library, the feature extraction model of the present application has been adopted to extract a corresponding high-dimensional index vector for each song audio data in a song library in advance, and the extraction process is implemented according to steps S1100 to S1500, where the high-dimensional index vector may be a high-dimensional vector obtained by splicing a plurality of output feature vectors, or may be obtained by combining a plurality of output feature vectors stored in a dispersed manner as needed. In this embodiment, it is recommended to use a high-dimensional index vector obtained by simply concatenating the global output feature vector and the local output feature vector.

It is understood that, for the song audio data that is cut into a plurality of song segments in the feature extraction stage, a plurality of sets of high-dimensional index vectors corresponding to the plurality of song segments are correspondingly generated, and in this case, each song segment is regarded as a song. If the time length of inquiring the audio data is too long, the audio data can be divided into a plurality of song segments for processing according to the fixed time length in the same way, and then the inquiring condition of each song segment is synthesized for comprehensive processing.

The song feature library usually stores the mapping relation data between the specific song and the corresponding high-dimensional index vector, so that the follow-up party can quickly determine the abstract information of the specific song audio data corresponding to the high-dimensional index vector so as to output the result.

No matter the query audio data is a complete object or is divided into a plurality of song segment units, the query audio data is input into the feature extraction model of the application one by one for feature extraction to obtain corresponding high-dimensional index vectors, and the high-dimensional index vectors are correspondingly generated by adapting to the organization form of the high-dimensional index vectors of the song audio data in the extracted song library in the song feature library, namely, if the high-dimensional index vectors of each song audio data in the song feature library are independent, the high-dimensional index vectors of the query audio data are also independent correspondingly; if the former is a plurality of dispersions obtained from different outputs of different branch networks, the latter is also a corresponding plurality. That is, the terminal output structure of the feature extraction model of the present application is consistent when feature extraction is performed on the music library and feature extraction is performed on the query audio data. For convenience of explanation, the high-dimensional index vector is defined as a single one of the outputs of the plurality of branch networks after simple splicing. Accordingly, the high-dimensional index vector extracted from the query audio data constitutes the query vector corresponding to the query audio data.

Step S4400, calculating similarities between the query vector and a plurality of high-dimensional index vectors in a preset song feature library, to obtain a similarity data sequence, where the high-dimensional index vectors stored in the song feature library are obtained by extracting, by the feature extraction model, each corresponding song audio data in the preset song library:

after the query vector is obtained, the similarity between the query vector and the corresponding high-dimensional index vector of each song in the song feature library can be calculated by adopting a preset similarity calculation formula. The similarity calculation formula can be implemented by any algorithm suitable for calculating the similarity distance between data, such as a cosine similarity calculation method, an Euclidean distance algorithm, a Pearson coefficient algorithm, a Jacard similarity calculation method, a neighbor search algorithm and the like, and can be flexibly implemented by a person skilled in the art. And after similarity calculation, obtaining a similarity data sequence of the high-dimensional index vector of each song in the song feature library corresponding to the query vector, and storing a similarity value corresponding to each song in the song feature library in the similarity data sequence.

It should be noted that, if the query vector and the high-dimensional index vector in the song feature library are both in a dispersed form, after respective corresponding calculation, a plurality of similarity data sequences may be obtained, and for convenience of subsequent calculation, the plurality of similarity data sequences may be combined into a single similarity data sequence by averaging the similarity values corresponding to the same song, or in a form of weighted average or simple addition, etc.

Similarly, if the query audio data is divided into a plurality of song segments, feature extraction is performed on the song segments respectively to obtain a plurality of query vectors, after each query vector is calculated to obtain a single similarity data sequence, the final similarity data sequence of the query audio data can be obtained by further summarizing the single similarity data sequences corresponding to the query vectors, and the summarizing manner can be any form of averaging, simple summation, weighted averaging and the like.

Step S4500, determining song audio data corresponding to the maximum similarity with the similarity exceeding a preset threshold in the similarity data sequence as a similar song of the query audio data:

after the final similarity data sequence corresponding to the query audio data is determined, a preset threshold value can be further utilized, the preset threshold value can be an empirical threshold value, the similarity data sequence is filtered by the preset threshold value, and all elements of which the similarity values exceed the preset threshold value are filtered. And if the element exceeding the preset threshold value is 0, indicating that no similar song similar to the query audio data exists in the song library. If a plurality of similarity values are obtained after screening, only the song corresponding to the maximum similarity value can be selected as the similar song corresponding to the query audio data. Thus, the corresponding similar songs are matched for the query audio data.

In this embodiment, a feature extraction model of the present application is used to extract a high-dimensional index vector for query audio data as a query vector, calculate similarity between the query vector and the high-dimensional index vector of each song in a pre-constructed song feature library, and match corresponding similar songs for the query audio data according to specific similarity values, so as to implement matching functions of downstream tasks such as song listening and music recognition, humming and music recognition, and singing recognition, thereby providing corresponding services for a mass user, and searching for corresponding similar songs according to a specified song or a song fragment submitted by the user. Because the high-dimensional index vector of the song obtained through the feature extraction model realizes the index of the song, and the similarity between the high-dimensional index vectors is calculated efficiently and quickly, the method is very rapid, efficient and accurate when the song is matched.

Referring to fig. 7, based on the previous embodiment, in an embodiment embodied by the present invention, before the step S4400 of calculating the similarity between the query vector and the high-dimensional index vectors in the preset song feature library, the method includes the following steps:

step S4200, calculating similarity between the query vector and a high-dimensional index vector of each song audio data in a rotation-law-free feature library to obtain a corresponding similarity sequence, where the high-dimensional index vector in the rotation-law-free feature library is obtained by extracting, by the feature extraction model, each song audio data without rotation-law information:

in the former embodiment, when the vector similarity calculation is performed, for the case of matching short-segment songs with a duration less than the predetermined length, the case that the segmented song segments may not have the vocal melody is not considered, and the song matching generally mainly considers the matching performed by the vocal melody.

Specifically, after the query vector corresponding to the query audio data is obtained in step S4100, the similarity between the query vector and the high-dimensional index vectors of all song audio data in the non-rotation-law feature library may be calculated in the same manner as the similarity calculation described above, so as to obtain a corresponding similarity sequence.

Step S4300, comparing whether the similarity of each song audio data in the similarity sequence is lower than a preset threshold, continuing the subsequent steps when the similarity of all song audio data is lower than the preset threshold, otherwise terminating the subsequent steps:

in order to determine whether the query vector is similar to any high-dimensional index vector in the no-rotation law feature library, a person skilled in the art may set a preset threshold corresponding to a similarity according to prior knowledge or measured empirical data, then compare each similarity value of the similarity sequence calculated in the previous step with the preset threshold, if the similarity value of all elements in the similarity sequence is lower than the preset threshold, it indicates that no-rotation law song segment similar to the query audio data exists in the no-rotation law feature library, accordingly, the steps S4400 and S4500 may be continuously performed to further determine that the songs are similar, otherwise, if the similarity value of at least one element is higher than the preset threshold, it indicates that the query audio data is similar to at least one song segment in the no-rotation law feature library, and the song segment is obviously a rotation-law-free segment, and at the moment, the execution of the subsequent steps can be stopped, so that the system resource expense is saved.

In the embodiment, the similarity between the query vector corresponding to the query audio data and the high-dimensional index vector in the irrotational characteristic library is calculated, and then whether the query audio data belongs to the irrotational song segment in the irrotational characteristic library is determined according to the preset threshold, so that the query request is filtered, the resource overhead of a server can be saved, the song query, retrieval and matching efficiency is improved, and for the short segment identification condition, the effectiveness of the identification process can be obviously improved and the processing efficiency of corresponding downstream tasks is improved because the short segment is easier to have the content of the inaudible melody.

Referring to fig. 8, in an extended embodiment, after the step S1400 of splicing the global output feature vector and the channel feature vector into a high-dimensional index vector, the step of adapting to the need of performing similar judgment on two songs or two song segments to determine whether the two songs have an infringement suspicion includes the following steps:

step S5100, obtaining query audio data to be compared:

in this embodiment, the song audio data extracted in the foregoing embodiments may be used as original audio data to be compared, or a master song with copyright, and according to the foregoing embodiments, the high-dimensional index vector corresponding to the master song is extracted by the feature extraction model of the present application.

Further, another song to be compared may continue to be obtained, with its corresponding query audio data. The songs corresponding to the query audio data obtained here may be songs submitted or released by the online music platform composition user, or individual target songs acquired by the online music platform from a target song library that needs to determine infringement through autonomous triggering.

Step S5200, invoking the feature extraction model to determine a high-dimensional index vector corresponding to the queried audio data:

since the infringement judgment needs to be performed on the query audio data, similarly, the feature extraction model described in the present application is adopted to extract the corresponding high-dimensional index vector for the query audio data in the same manner as described in the foregoing embodiments. Thus, the high-dimensional index vector corresponding to the original song and the high-dimensional index vector corresponding to the query audio data are obtained. As can be seen from the foregoing description about correspondence between the organization forms of the two high-dimensional index vectors used for calculating the similarity, the organization forms of the two high-dimensional index vectors are also the same here.

Step S5300, calculating a similarity between the high-dimensional index vector of the query audio data and the high-dimensional index vector of the song audio data:

referring to the previous embodiment, the similarity calculation formula corresponding to any similarity calculation method, such as a cosine similarity calculation method, an euclidean distance algorithm, a pearson coefficient algorithm, an jackard similarity calculation method, a neighbor search algorithm, and the like, is used to calculate the similarity between the high-dimensional index vector (also called a query vector) of the query audio data and the high-dimensional index vector of the original song, so as to obtain a corresponding similarity value.

Step S5400, judging whether the similarity exceeds a preset threshold, and judging that the query audio data and the song audio data form a similar song when the similarity exceeds the preset threshold:

after obtaining the similarity value between the two songs, a preset threshold value determined by a person skilled in the art based on empirical data or prior knowledge is used for judging whether the two songs constitute similar songs. Specifically, whether the similarity value exceeds the preset threshold value is compared, and when the similarity value exceeds the preset threshold value, the query audio data and the song audio data serving as the original song form a similar song, so that it can be determined that the query audio data forms an infringing work, and on the contrary, if the similarity value does not exceed the preset threshold value, it is determined that two songs do not form a similar song, and no special processing is needed, or a related user is simply informed.

The embodiment exemplarily illustrates a process of comparing two songs to determine whether the two songs constitute an infringement suspicion, and it can be seen that the feature extraction model of the present application has an ability to more accurately represent feature information of related songs, so that when performing a song infringement comparison, a judgment can be more accurately implemented, a corresponding infringement judgment result is obtained, which is beneficial for an online service platform to quickly check audio data of various songs, timely deter infringement behaviors, and prevent platform infringement risks.

Referring to fig. 9, a song semantic information indexing apparatus provided in the present application is adapted to perform functional deployment by the song semantic information indexing method of the present application, and includes: the system comprises an encoding processing module 1100, a sharing extraction module 1200, a global extraction module 1300, a local extraction module 1400 and an index processing module 1500, wherein the encoding processing module 1100 is used for encoding audio information in song audio data to obtain corresponding encoding information; the shared extraction module is used for sequentially carrying out multi-level feature extraction on the coded information by adopting a plurality of volume blocks in a shared network of a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the song audio data; the global extraction module is used for extracting global significant features from the intermediate feature information by adopting a global branch network of the feature extraction model to obtain a global output feature vector; the local extraction module is used for adopting a local branch network of the feature extraction model to respectively extract semantic local features from the intermediate feature information according to channel equal segmentation to obtain channel output feature vectors; and the index processing module is used for splicing the global output characteristic vector and the channel characteristic vector into a high-dimensional index vector.

In a further embodiment, the global extraction module 1300 includes: the global convolution branch module is used for sequentially carrying out multi-stage feature extraction on the intermediate feature information through a plurality of convolution blocks of the global branch network so as to obtain deep feature information; and the global pooling operation module is used for executing maximum pooling operation on the deep characteristic information to obtain a global output characteristic vector of the global significant characteristic of the song audio data.

In a further embodiment, the local extraction module 1400 includes: the local convolution branch module is used for sequentially carrying out multi-stage feature extraction on the intermediate feature information through a plurality of convolution blocks of the local branch network so as to obtain deep feature information; the channel segmentation output module is used for equally segmenting the deep layer characteristic information along the channel direction to obtain a plurality of equally-divided characteristic information; and the local pooling operation module is used for respectively performing mean pooling operation on each equally divided feature information to obtain a plurality of equally divided feature vectors and splicing all equally divided feature vectors into the local output feature vector.

In an optional embodiment, in the encoding processing module 1100, the audio information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, sound level profile information, and Chroma feature information of the song audio data.

In a further embodiment, the volume block is configured to include the following units: the convolution transformation unit is used for carrying out convolution transformation on the information input into the convolution transformation unit to obtain transformation characteristic information; the normalization processing unit is used for combining the transformation characteristic information into splicing characteristic information after respectively carrying out example normalization and batch normalization processing, and activating and outputting the splicing characteristic information; the intermediate convolution unit is used for carrying out convolution operation and batch normalization processing on the splicing characteristic information which is activated and output for multiple times to obtain residual information; and the residual processing unit is used for superposing the residual information into the information input into the residual processing unit to activate output.

In the extended embodiment, the feature extraction model is trained by adopting a training task architecture with the following structure: the system comprises a sample calling module, a data acquisition module and a data processing module, wherein the sample calling module is used for calling a training sample from a training set and determining the coding information of the audio information of the training sample, the training sample is pre-acquired song audio data, and the song audio data is a complete song or a fragment thereof; the training execution module is used for inputting the coding information into the feature extraction model to train the coding information so as to obtain corresponding output feature vectors; the classification prediction module is used for performing classification prediction on each output characteristic vector respectively to map corresponding classification labels; the gradient updating module is used for calculating a loss value of the feature extraction model by using the supervision labels and the classification labels corresponding to the training samples, and performing gradient updating on the feature extraction model according to the loss value; and the loop iteration module is used for judging whether the loss value reaches a preset threshold value, and calling the next training sample in the training set to continue to carry out iterative training on the feature extraction model when the loss value does not reach the preset threshold value until the loss value reaches the preset threshold value.

In an expanded embodiment, the song semantic information indexing apparatus further includes: the request response module is used for responding to a query request for querying the audio data and calling the feature extraction model to extract a corresponding high-dimensional index vector as a query vector; the similarity calculation module is used for calculating the similarity between the query vector and a plurality of high-dimensional index vectors in a preset song feature library to obtain a similarity data sequence, and the high-dimensional index vectors stored in the song feature library are obtained by extracting corresponding song audio data in the preset song library through the feature extraction model; and the similarity judging module is used for determining song audio data corresponding to the maximum similarity with the similarity exceeding a preset threshold in the similarity data sequence as similar songs of the query audio data.

In an embodiment, the apparatus for indexing song semantic information further includes: the rotation-law-free similarity calculation module is used for calculating the similarity between the query vector and the high-dimensional index vector of each song audio data in the rotation-law-free feature library to obtain a corresponding similarity sequence, and the high-dimensional index vector in the rotation-law-free feature library is obtained by extracting each song audio data without rotation information for the feature extraction model; and the rotation-law-free filtering processing module is used for comparing whether the similarity of each song audio data in the similarity sequence is lower than a preset threshold, when the similarity of all the song audio data is lower than the preset threshold, the similarity calculation module is continuously operated, and otherwise, the operation of other modules is stopped.

In an extended embodiment, the song semantic information indexing device further includes: the query acquisition module is used for acquiring query audio data to be compared; the query index module is used for calling the feature extraction model to determine a high-dimensional index vector corresponding to the query audio data; the query calculation module is used for calculating the similarity between the high-dimensional index vector of the query audio data and the high-dimensional index vector of the song audio data; and the similarity comparison module is used for judging whether the similarity exceeds a preset threshold value or not, and judging that the query audio data and the song audio data form a similar song when the similarity exceeds the preset threshold value.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. As shown in fig. 10, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable a processor to realize a song semantic information indexing method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the song semantic information indexing method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 9, and the memory stores program codes and various data required for executing the modules or the sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the song semantic information indexing device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application further provides a storage medium storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the song semantic information indexing method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In summary, the representation learning capability of the deep semantic information of the song audio data is improved by improving the feature extraction model of the song audio data, the deep semantic information obtained according to the representation learning capability can achieve a more accurate and efficient effect when the deep semantic information is used for inquiring, retrieving and matching songs, and the comprehensive service capability of the online music platform can be improved by serving various downstream tasks such as song listening recognition, humming recognition, singing recognition, song infringement comparison and the like.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A song semantic information indexing method is characterized by comprising the following steps:

2. The song semantic information indexing method according to claim 1, wherein the step of extracting the global significant features from the intermediate feature information by using the global branch network of the feature extraction model to obtain a global output feature vector comprises the steps of:

3. The song semantic information indexing method according to claim 1, wherein the step of extracting semantic local features from the intermediate feature information by the local branch network of the feature extraction model according to channel equal segmentation to obtain a channel output feature vector comprises the steps of:

4. The song semantic information indexing method of claim 1, wherein in the step of encoding the audio information in the song audio data, the audio information is any one of time-frequency spectrum information, mel-frequency spectrum information, CQT filtering information, level contour information, Chroma feature information of the song audio data.

5. The song semantic information indexing method according to claim 1, wherein in the shared network, at least one of the convolution blocks applies an attention module for extracting key information in song audio data, and the attention module is a spatial attention module or a channel attention module.

6. The song semantic information indexing method of claim 1, wherein the volume block is configured to perform the following steps:

7. The song semantic information indexing method according to any one of claims 1 to 6, wherein the training process of the feature extraction model comprises the following steps of iterative training:

8. The song semantic information indexing method according to any one of claims 1 to 6, characterized in that after the step of fusing the global output feature vector and the channel feature vector into a high-dimensional index vector, the method comprises the following steps:

9. The song semantic information indexing method according to claim 8, wherein the step of calculating the similarity between the query vector and a plurality of high-dimensional index vectors in a preset song feature library is preceded by the steps of:

10. The song indexing method of any one of claims 1 to 6, wherein the step of stitching the global output feature vector and the channel feature vector into a high-dimensional index vector is followed by the steps of:

obtaining query audio data to be compared;

11. An apparatus for indexing semantic information of songs, comprising:

the encoding processing module is used for encoding the audio information in the song audio data to obtain corresponding encoding information;

the shared extraction module is used for sequentially carrying out multi-stage feature extraction on the coded information by adopting a plurality of volume blocks in a shared network of a feature extraction model trained to a convergence state to obtain intermediate feature information of deep semantic information of the song audio data;

the global extraction module is used for extracting global significant features from the intermediate feature information by adopting a global branch network of the feature extraction model to obtain a global output feature vector;

the local extraction module is used for adopting a local branch network of the feature extraction model to respectively extract semantic local features from the intermediate feature information according to channel equal segmentation to obtain channel output feature vectors;

and the index processing module is used for splicing the global output characteristic vector and the channel characteristic vector into a high-dimensional index vector.

12. A computer device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 10, which, when invoked by a computer, performs the steps comprised by the corresponding method.

14. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 10.