CN113643720A

CN113643720A - Song feature extraction model training method, song identification method and related equipment

Info

Publication number: CN113643720A
Application number: CN202110903817.XA
Authority: CN
Inventors: 谭志力; 孔令城
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2021-11-12

Abstract

The embodiment of the invention provides a song feature extraction model training method, a song identification method and related equipment, which are used for improving the accuracy of song identification. The song identification method in the embodiment of the application comprises the following steps: acquiring a target audio clip; extracting a lyric vector of the target audio fragment; extracting a target embedded vector of the target audio fragment; inputting the lyric vector and the target embedded vector of the target audio clip into a feature extraction model to obtain a fusion vector of the target audio clip; and identifying the song most similar to the target audio clip according to the fusion vector of the target audio clip and a plurality of fusion vectors respectively corresponding to a plurality of audio clips of each song in a database.

Description

Song feature extraction model training method, song identification method and related equipment

Technical Field

The invention relates to the technical field of audio data processing, in particular to a song feature extraction model training method, a song identification method and related equipment.

Background

The song listening identification is increasingly applied to various terminals as a music identification mode, but the song listening identification requires the same song and the same song in the specific identification process, and the identification mode can seriously affect the identification result of the song if the accompaniment and the speed are slightly changed or the sex and the age of a singer are slightly changed for different singing versions of the same singer.

How to accurately identify the copy of the original music is a problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a method and related equipment for identifying a song to be turned and singing, which are used for improving the accuracy of identifying the song to be turned and singing.

A first aspect of an embodiment of the present application provides a method for training a feature extraction model of a song, where the method includes:

acquiring a first target embedded vector of an audio segment to be trained in a training set, wherein the first target embedded vector is a frequency domain characteristic vector of the audio segment to be trained;

acquiring a first lyric vector of an audio clip to be trained in the training set;

inputting the first target embedding vector and the first lyric vector corresponding to the audio clip to be trained into an encoding layer of an initial neural network to obtain a fusion vector corresponding to the audio clip to be trained;

inputting the fusion vector of the audio clip to be trained into a decoding layer of the initial neural network to obtain a corresponding second target embedding vector and a second lyric vector;

calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector;

updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model;

and if the updated neural network model meets the convergence condition, outputting the coding layer in the updated neural network model as a feature extraction model.

Preferably, if the updated neural network model does not meet the convergence condition, another audio segment to be trained is obtained from the training set, and the step of obtaining the first target embedding vector of the audio segment to be trained in the training set is returned to be executed until the updated neural network model meets the convergence condition.

Preferably, the calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector comprises:

respectively calculating a first loss between the second target embedding vector and the first target embedding vector and a second loss between the second lyric vector and the first lyric vector according to a preset loss function;

calculating the target loss value according to the first loss and the second loss.

Preferably, the obtaining a first target embedding vector of an audio segment to be trained in a training set includes:

performing characteristic conversion between a time domain and a frequency domain on the audio segments to be trained in the training set to obtain frequency domain characteristics of the audio segments to be trained;

and inputting the frequency domain characteristics of the audio segment to be trained into a KDTN neural network to obtain a first target embedding vector of the audio segment to be trained.

Preferably, the obtaining a first lyric vector of an audio clip to be trained in the training set includes:

converting the audio features of the audio clips to be trained in the training set into text features to extract the lyrics of the audio clips to be trained;

and inputting the lyrics of the audio clip to be trained into a text embedding model to obtain a first lyrics vector of the audio clip to be trained.

A second aspect of the embodiments of the present application provides a song identification method, where the method includes:

acquiring a target audio clip;

extracting a lyric vector of the target audio fragment;

extracting a target embedded vector of the target audio fragment;

inputting the lyric vector and the target embedded vector of the target audio clip into the feature extraction model of the first aspect of the embodiment of the application to obtain a fusion vector of the target audio clip;

and identifying the song most similar to the target audio clip according to the fusion vector of the target audio clip and a plurality of fusion vectors respectively corresponding to a plurality of audio clips of each song in a database.

Preferably, before the identifying a song most similar to the target audio clip according to the fusion vector of the target audio clip and the fusion vectors respectively corresponding to the audio clips of each song in the database, the method further includes:

acquiring a third lyric vector and a third target embedding vector corresponding to each audio fragment in a plurality of audio fragments of each song in a database;

and inputting the third lyric vector and the third target embedding vector into the feature extraction model to obtain a plurality of fusion vectors respectively corresponding to a plurality of audio segments of each song in the database.

Preferably, the identifying a song most similar to the target audio clip according to the fusion vector of the target audio clip and the fusion vectors corresponding to the audio clips of each song in the database includes:

respectively calculating fusion vectors of the target audio clips, and a plurality of similarity scores between the fusion vectors corresponding to the plurality of audio clips of each song in the database;

and identifying the song most similar to the target audio clip according to the similarity scores and a preset judgment threshold value.

Preferably, the identifying a song most similar to the target audio clip according to the multiple similarity scores and a preset judgment threshold includes:

inputting the plurality of similarity scores between the target audio clip and a plurality of audio clips of each song in the database into a deep memory neural network to obtain a target similarity score between the target audio clip and each song;

and identifying the song most similar to the target audio clip according to the target similarity score and the preset judgment threshold value.

A third aspect of the embodiments of the present application provides a device for training a feature extraction model of a song, where the device includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first target embedded vector of an audio segment to be trained in a training set, and the first target embedded vector is a frequency domain characteristic vector of the audio segment to be trained;

the obtaining unit is further configured to obtain a first lyric vector of an audio clip to be trained in the training set;

the input unit is used for inputting the first target embedding vector and the first lyric vector corresponding to the audio segment to be trained into a coding layer of an initial neural network to obtain a fusion vector corresponding to the audio segment to be trained;

the input unit is further configured to input the fusion vector of the audio segment to be trained to a decoding layer of the initial neural network to obtain a corresponding second target embedded vector and a corresponding second lyric vector;

a calculation unit to calculate a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector;

the updating unit is used for updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model;

and the output unit is used for outputting the coding layer in the updated neural network model as a feature extraction model if the updated neural network model meets the convergence condition.

Preferably, the obtaining unit is further configured to:

and if the updated neural network model does not accord with the convergence condition, acquiring another audio segment to be trained from the training set, and returning to the step of acquiring the first target embedded vector of the audio segment to be trained in the training set until the updated neural network model accords with the convergence condition.

Preferably, the computing unit is specifically configured to:

Preferably, the obtaining unit is specifically configured to:

A fourth aspect of the embodiments of the present application provides a song recognition apparatus, including:

an acquisition unit configured to acquire a target audio clip;

the extraction unit is used for extracting the lyric vector of the target audio clip;

the extracting unit is further configured to extract a target embedded vector of the target audio segment;

an input unit, configured to input the lyric vector and the target embedded vector of the target audio fragment into the feature extraction model according to the first aspect of the embodiment of the present application, to obtain a fusion vector of the target audio fragment;

and the identification unit is used for identifying the song most similar to the target audio clip according to the fusion vector of the target audio clip and a plurality of fusion vectors respectively corresponding to a plurality of audio clips of each song in a database.

Preferably, the obtaining unit is further configured to:

Preferably, the identification unit is specifically configured to:

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, is configured to implement a method for training a feature extraction model of a song according to the first aspect of the embodiment of the present application or a method for identifying a song according to the second aspect of the embodiment of the present application.

An embodiment of the present application further provides an electronic device, which includes a memory, a processor, a power module, a sensor module, and an input/output module, and is characterized in that the processor is configured to implement the method for training a feature extraction model of a song according to the first aspect of the embodiment of the present application or the method for identifying a song according to the second aspect of the embodiment of the present application when executing a computer program stored in the memory.

According to the technical scheme, the embodiment of the invention has the following advantages:

acquiring a first target embedding vector of an audio segment to be trained in a training set, wherein the first target embedding vector is a frequency domain characteristic vector of the audio segment to be trained, and acquiring a first lyric vector of the audio segment to be trained in the training set; inputting the first target embedding vector and the first lyric vector corresponding to the audio clip to be trained into an encoding layer of an initial neural network to obtain a fusion vector corresponding to the audio clip to be trained; inputting the fusion vector of the audio clip to be trained into a decoding layer of the initial neural network to obtain a corresponding second target embedding vector and a second lyric vector; calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector; updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model; and if the updated neural network model meets the convergence condition, outputting the coding layer in the updated neural network model as a feature extraction model.

In the embodiment of the application, when the model is extracted by training the characteristics, the frequency domain characteristics of the audio characteristics to be trained are utilized, and the lyric characteristics of the audio characteristics to be trained are also utilized, so that the model can identify not only the melody of a song but also the lyric of the song in the training process, and the accuracy rate of the model for identifying the song is improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a method for training a feature extraction model of a song according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another embodiment of a method for training a feature extraction model of songs in an embodiment of the present application;

FIG. 3 is a detailed step of step 105 in the embodiment of FIG. 1 of the present application;

FIG. 4 is a detailed step of step 101 in the embodiment of FIG. 1 of the present application;

fig. 5 is a schematic diagram of an architecture of a KDTN neural network in an embodiment of the present application;

FIG. 6 is a detailed step of step 102 in the embodiment of FIG. 1 of the present application;

FIG. 7 is a diagram of an embodiment of a song recognition method in an embodiment of the present application;

FIG. 8 is a schematic diagram of another embodiment of a song recognition method in the embodiment of the present application;

FIG. 9 is a detailed step of step 705 in the embodiment of FIG. 7 of the present application;

FIG. 10 is a schematic diagram of another embodiment of a song recognition method in the embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of a device for training a feature extraction model of songs in an embodiment of the present application;

FIG. 12 is a schematic diagram of an embodiment of a song recognition apparatus in an embodiment of the present application;

fig. 13 is a schematic structural diagram of an initial neural network in an embodiment of the present application.

Detailed Description

The embodiment of the invention provides a song feature extraction model training method, a song identification method and related equipment, which are used for improving the accuracy of song identification.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, the following describes a method for training a feature extraction model of a song in an embodiment of the present application, and referring to fig. 1, an embodiment of the method for training a feature extraction model of a song in an embodiment of the present application includes:

101. acquiring a first target embedded vector of an audio segment to be trained in a training set, wherein the first target embedded vector is a frequency domain characteristic vector of the audio segment to be trained;

in order to improve the accuracy of song recognition, the embodiment of the application trains a model for recognizing songs by using training samples in a training set, and in a specific training process, a first target embedding vector of an audio clip to be trained in the training set is obtained first, and then the first target embedding vector of the audio clip to be trained is input into an initial neural network for training, wherein the first target embedding vector is a frequency domain feature vector of the audio clip to be trained.

The generation process of the first target embedding vector will be described in detail in the following embodiments, and will not be described herein again.

102. Acquiring a first lyric vector of an audio clip to be trained in the training set;

in order to improve the accuracy of song recognition, in the embodiment of the application, besides the first target embedded vector is used as the input quantity of the model, the first lyric vector of the audio clip to be trained in the training set is also obtained, and the first lyric vector of the audio clip to be trained is also input into the initial neural network for training.

103. Inputting the first target embedding vector and the first lyric vector corresponding to the audio clip to be trained into an encoding layer of an initial neural network to obtain a fusion vector corresponding to the audio clip to be trained;

specifically, the initial neural network in the embodiment of the present application includes an encoding layer and a decoding layer, where the encoding layer functions to reduce the dimension of the input amount to fuse the tune information and the lyric information of the song. For ease of understanding, fig. 13 presents a schematic diagram of the initial neural network structure.

The method comprises the steps of inputting a first target embedding vector and a first lyric vector of an audio clip to be trained into a coding layer of an initial neural network to obtain a fusion vector corresponding to the audio clip to be trained, and synthesizing the first target embedding vector and the first lyric vector of the audio clip to be trained into the fusion vector of the audio clip to be trained.

The coding layer is a neural network structure including an input layer, a hidden layer and an output layer, and in particular, descriptions of the input layer, the hidden layer and the output layer are the same as those in the prior art, and are not repeated here.

104. Inputting the fusion vector of the audio clip to be trained into a decoding layer of the initial neural network to obtain a corresponding second target embedding vector and a second lyric vector;

further, in order to detect the accuracy of the neural network structure in the coding layer for fusing the first target embedding vector and the first lyric vector, the embodiment of the application inputs the fused vector of the audio segment to be trained into the decoding layer of the initial neural network to obtain a second target embedding vector and a second lyric vector corresponding to the fused vector.

And measuring the accuracy of the neural network structure in the coding layer on the fusion of the first target embedding vector and the first lyric vector based on the second target embedding vector, the second lyric vector and the difference value of the first target embedding vector and the first lyric vector.

105. Calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector;

specifically, when the accuracy of the neural network structure in the coding layer is measured by using the second target embedding vector, the second lyric vector, the first target embedding vector and the first lyric vector, the target loss value may be calculated based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector.

Further, after the target loss value is obtained, model parameters in an encoding layer and a decoding layer of the initial neural network model can be corrected based on the target loss value.

As for the specific calculation process of the target loss value, it will be described in the following embodiments, and will not be described herein again.

106. Updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model;

and after the target loss value is obtained, updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model.

Specifically, when the model parameters of the initial neural model are updated by using the target loss value, the model parameters of the initial neural model may be updated based on the target loss value and a back propagation algorithm.

However, based on the target loss value and the back propagation algorithm, the update process of the model parameters of the initial neural model (i.e., the model parameters of the coding layer and the decoding layer) is described in detail in the prior art, and is not described herein again.

107. And if the updated neural network model meets the convergence condition, outputting the coding layer in the updated neural network model as a feature extraction model.

Specifically, after model parameters of the initial neural network model are updated by using a target loss value and a back propagation algorithm, if the updated neural network model meets a convergence condition, a coding layer in the updated neural network model is output as a feature extraction model.

If the updated neural network model meets the convergence condition, it means that the target loss value obtained in step 105 tends to a stable value after the updated neural network model inputs the audio segment to be trained.

In the embodiment of the application, a first target embedded vector of an audio clip to be trained in a training set is obtained; acquiring a first lyric vector of an audio clip to be trained in the training set; inputting the first target embedding vector and the first lyric vector corresponding to the audio clip to be trained into an encoding layer of an initial neural network to obtain a fusion vector corresponding to the audio clip to be trained; inputting the fusion vector of the audio clip to be trained into a decoding layer of the initial neural network to obtain a corresponding second target embedding vector and a second lyric vector; calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector; updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model; and if the updated neural network model meets the convergence condition, outputting the coding layer in the updated neural network model as a feature extraction model.

According to the embodiment of the application, when the feature extraction model is trained, the first target embedding vector and the first lyric vector are used as input quantities to train the model parameters of the initial neural network model, namely when the model parameters of the initial neural network are trained, the first target embedding vector of the audio segment to be trained (namely, the frequency domain feature vector of the audio segment to be trained) and the first lyric vector of the audio segment to be trained (namely, the lyric feature vector of the audio segment to be trained) are utilized, so that the trained feature extraction model can identify songs based on a plurality of feature vectors of the songs, and the accuracy of the feature extraction model in song identification is improved.

Based on the embodiment shown in fig. 1, if the updated neural network model does not meet the convergence condition, the embodiment shown in fig. 2 is performed, wherein another embodiment of the training method of the song feature extraction module includes:

201. acquiring a first target embedded vector of an audio segment to be trained in a training set, wherein the first target embedded vector is a frequency domain characteristic vector of the audio segment to be trained;

202. acquiring a first lyric vector of an audio clip to be trained in the training set;

203. inputting the first target embedding vector and the first lyric vector corresponding to the audio clip to be trained into an encoding layer of an initial neural network to obtain a fusion vector corresponding to the audio clip to be trained;

204. inputting the fusion vector of the audio clip to be trained into a decoding layer of the initial neural network to obtain a corresponding second target embedding vector and a second lyric vector;

205. calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector;

206. updating the model parameters of the initial neural model according to the target loss value to obtain an updated neural network model;

it should be noted that, descriptions of step 201 to step 206 in the embodiment of the present application are similar to those described in the embodiment of fig. 1, and are not repeated here.

207. Judging whether the updated neural network model meets the convergence condition;

after obtaining the updated neural network model, it is further determined whether the updated neural network model meets the convergence condition, if yes, step 208 is executed, and if not, step 209 is executed.

It should be noted that, in the embodiment of the present application, the execution main body of step 207 may be the same as the execution main bodies of steps 201 to 206, or may be different from the execution main bodies of steps 201 to 206, and is not limited herein.

208. Outputting the coding layer in the updated neural network model as a feature extraction model;

and if the updated neural network model meets the convergence condition, outputting the coding layer in the updated neural network model as the feature extraction model.

209. Acquiring another audio segment to be trained from the training set, and returning to execute the steps 201 to 207 until the updated neural network model meets the convergence condition.

And if the updated neural network model does not accord with the convergence condition, acquiring another sample to be trained from the training set, and returning to execute the steps 201 to 207 by using the another sample to be trained until the updated neural network model accords with the convergence condition.

It should be noted that, when acquiring the additional audio segments to be trained in step 209, the additional audio segments to be trained are all different from the audio segments to be trained that have already been trained, and particularly when implementing, all the audio segments to be trained in the training set can be marked, so that each audio segment to be trained has a unique identification code of the audio segment to be trained, so that when the other audio segment to be trained is selected, the selected current audio segment to be trained is distinguished from the audio segment to be trained on which training has been performed, or, after an audio segment to be trained is trained, the audio segment to be trained after training is completed is removed from the training set, so that when further audio pieces to be trained are selected from the training set, the selected current audio pieces to be trained are distinguished from the audio pieces to be trained for which training has already been performed.

In the embodiment of the application, when the updated neural network model does not accord with the convergence condition, the training process of the feature extraction model is completely described, and the completeness of the training process of the feature extraction model is improved.

Based on the embodiment shown in fig. 1, step 105 is described in detail below, please refer to fig. 3, where fig. 3 is a detailed step of step 105:

301. respectively calculating a first loss between the second target embedding vector and the first target embedding vector and a second loss between the second lyric vector and the first lyric vector according to a preset loss function;

specifically, when the target loss value is calculated based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector, a first loss between the second target embedding vector and the first target embedding vector and a second loss between the second lyric vector and the first lyric vector may be calculated based on a preset loss function, respectively.

The preset loss function may be a cross entropy loss function, a square loss function, an absolute value loss function, or the like, and the specific form of the loss function is not limited herein.

302. Calculating the target loss value according to the first loss and the second loss.

After the first loss and the second loss are obtained, the first loss and the second loss may be superimposed to obtain a target loss value, or a first weight of the first loss and a second weight of the second loss may be set, and then the target loss value may be calculated based on the first loss and the first weight, and the second loss and the second weight.

In the embodiment of the application, the process of calculating the target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector is described in detail, so that the reliability of the target loss value calculation process is improved.

Based on the embodiment shown in fig. 1, step 101 in the embodiment of fig. 1 is described in detail below, please refer to fig. 4, where fig. 4 is a detailed step of step 101:

401. performing characteristic conversion between a time domain and a frequency domain on the audio segments to be trained in the training set to obtain frequency domain characteristics of the audio segments to be trained;

when the first target embedded vector of the audio segment to be trained is obtained, the audio segment to be trained is firstly subjected to feature conversion between a time domain and a frequency domain, for example, the target audio segment is subjected to fourier transform or Constant Q Transform (CQT), so as to obtain the frequency domain feature of the audio segment to be trained.

Further, the specific processes of fourier transform and constant Q transform are described in detail in the prior art, and are not described herein again.

402. And inputting the frequency domain characteristics of the audio segment to be trained into a KDTN neural network to obtain a first target embedding vector of the audio segment to be trained.

After the frequency domain features with the training audio segments are obtained, the frequency domain features are input into a KDTN neural network to obtain first target embedded vectors corresponding to the audio segments to be trained, and for convenience of understanding, a schematic architecture diagram of the KDTN neural network is given in FIG. 5.

Wherein, K represents a Key-invariant constraint Neural Network, and the function of the K is mainly to obtain a Key constant sequence, that is, to obtain the frequency domain characteristics of the audio segment to be trained in the embodiment of the present application;

d represents a related Temporal pyramid recommendation Network, which mainly plays a role in utilizing local Temporal environment in music;

t represents Temporal Self-orientation, whose main role is to exploit the global Temporal environment in music;

AP represents Average pore, and the main function of the AP is to down-sample the received features so as to perform dimensionality reduction processing on the features.

When the frequency domain characteristics { X1} of the audio segment to be trained are input into the KDTN network, a fixed-length embedded vector L is obtained_xI.e. the first target embedding vector of the audio piece to be trained.

In the embodiment of the application, the process of obtaining the first target embedded vector of the audio segment to be trained in the training set is described in detail, and the reliability of the calculation process of the first target embedded vector is improved.

Based on the embodiment described in fig. 1, step 102 in the embodiment of fig. 1 is described in detail below, please refer to fig. 6, where fig. 6 is a detailed step of step 102:

601. converting the audio features of the audio clips to be trained in the training set into text features to extract lyrics of the audio clips to be trained in the training set;

specifically, when the lyric vector of the audio clip to be trained is obtained, the audio feature of the audio clip to be trained is converted into a text feature, so as to obtain the lyric of the audio clip to be trained.

Specifically, the process of converting into the lyrics may be implemented by a speech-to-text conversion module, or a device having a speech-to-text conversion module, and is not limited in particular here.

602. And inputting the lyrics of the audio clip to be trained into a text embedding model to obtain a first lyrics vector of the audio clip to be trained in the training set.

After the lyrics of the audio clip to be trained are obtained, the lyrics of the audio clip to be trained are input into the text embedding model, so that a lyrics vector of the audio clip to be trained is obtained.

Specifically, the text embedding model may be a word embedding vector (wordledding) model, a one-hot matrix representation model, and the like, which is not specifically limited herein, and a specific process of inputting the lyrics of the audio clip to be trained into the word embedding vector (wordledding) model or the one-hot matrix representation model to obtain the lyrics vector of the audio clip to be trained is also described in detail in the prior art, and is not described herein again.

In the embodiment of the application, the process of obtaining the first lyric vector of the audio clip to be trained in the training set is described in detail, so that the reliability of the calculation process of the first lyric vector is improved.

In the above, a detailed description is made on a training method of a feature extraction model of a song in the embodiment of the present application, and a description is next made on a recognition method of a song in the embodiment of the present application, referring to fig. 7, an embodiment of the song recognition method in the embodiment of the present application includes:

701. acquiring a target audio clip;

in order to identify a target audio clip on a terminal, the terminal in the embodiment of the present application needs to first acquire the target audio clip, where the target audio clip in the embodiment is an audio clip that is requested to be identified by the terminal. Specifically, the terminal in this embodiment of the application may be any one of a mobile phone, a Pad, a computer, and a wearable device, and is not limited specifically here.

Further, the acquiring action in the present application may be that the terminal actively reads the target audio clip from the other device, or that the terminal passively receives the target audio clip sent by the other device.

702. Extracting a lyric vector of the target audio fragment;

after the target audio segment is obtained, the lyric vector of the target audio segment is extracted, and specifically, the process of extracting the lyric vector of the target audio segment is similar to the process of obtaining the first lyric vector of the audio segment to be trained in the embodiment of fig. 6, and details are not repeated here.

703. Extracting a target embedded vector of the target audio fragment;

after the target audio segment is obtained, the target embedded vector of the target audio segment is extracted, and specifically, the process of extracting the target embedded vector of the target audio segment is similar to the process of obtaining the first target embedded vector of the audio segment to be trained in the embodiment of fig. 4, and details are not repeated here.

704. Inputting the lyric vector and the target embedded vector of the target audio clip into a feature extraction model to obtain a fusion vector of the target audio clip;

after the lyric vector and the target embedding vector of the target audio clip are obtained, the lyric vector and the target embedding vector of the target audio clip are input to the feature extraction model in the embodiments of fig. 1 to 6 to obtain the fusion vector of the target audio clip.

It should be noted that the feature extraction model in the embodiment of the present application is a coding layer in the updated neural network model in the embodiments of fig. 1 to fig. 6, that is, when identifying songs, only the coding layer in the updated neural network model is used, so that the dimension of the lyric vector of the target audio fragment and the target embedding vector is reduced by the coding layer, and a fusion vector fusing the lyric feature and the melody feature of the target audio fragment is obtained.

705. And identifying the song most similar to the target audio clip according to the fusion vector of the target audio clip and a plurality of fusion vectors respectively corresponding to a plurality of audio clips of each song in a database.

And after the fusion vector of the target audio segment is obtained, according to the fusion vector of the target audio segment, identifying the song most similar to the target audio segment by a plurality of fusion vectors respectively corresponding to a plurality of audio segments of each song in the database.

Specifically, when a song with the most similar target audio segment is identified according to the fusion vector of the target audio segment and the fusion vectors corresponding to the audio segments of each song in the database, the similarity between the two vectors may be calculated according to a similarity calculation method between the vectors, such as a pearson correlation coefficient, a euclidean distance, a Cosine similarity, a Tanimoto coefficient, a manhattan distance, and the like, and the song with the most similar target audio segment may be identified according to the size of the similarity.

In the embodiment of the application, a target audio clip is obtained; extracting a lyric vector of the target audio fragment; extracting a target embedded vector of the target audio fragment; inputting the lyric vector and the target embedded vector of the target audio clip into a feature extraction model to obtain a fusion vector of the target audio clip; and identifying the song most similar to the target audio clip according to the fusion vector of the target audio clip and a plurality of fusion vectors respectively corresponding to a plurality of audio clips of each song in a database.

Because the song identification method in the embodiment of the application can identify the most similar song of the target audio segment through the fusion vector of the target audio segment, and the fusion vector of the target audio segment not only has the melody characteristics of the target audio segment but also has the lyric characteristics of the target audio segment, the song identification method in the embodiment of the application can improve the accuracy of identifying the most similar song with the target audio segment.

Based on the embodiment shown in fig. 7, before identifying a song most similar to the target audio segment according to the fusion vector of the target audio segment and the fusion vectors corresponding to the audio segments of each song in the database, the following steps are further performed, with reference to fig. 8, where another embodiment of the song identification method in the embodiment of the present application includes:

801. acquiring a third lyric vector and a third target embedding vector corresponding to each audio fragment in a plurality of audio fragments of each song in a database;

it is easily understood that before identifying the song most similar to the target audio segment according to the fusion vector of the target audio segment and the fusion vectors corresponding to the audio segments of each song in the database, a third lyric vector and a third target embedding vector corresponding to each audio segment of the audio segments of each song in the database should be obtained.

The process of obtaining the third lyric vector and the third target embedding vector corresponding to each audio fragment is similar to the process of obtaining the first lyric vector and the first target embedding vector of the audio fragment to be trained in fig. 4 and 6, and is not described herein again.

802. And inputting the third lyric vector and the third target embedding vector into the feature extraction model to obtain a plurality of fusion vectors corresponding to a plurality of audio segments of each song in the database.

And after the third lyric vector and the third target embedding vector corresponding to each audio fragment are obtained, inputting the third lyric vector and the third target embedding vector corresponding to each audio fragment into the feature extraction model, and thus obtaining a plurality of fusion vectors corresponding to a plurality of audio fragments of each song respectively.

It should be noted that the feature extraction model in the embodiment of the present application is an encoding layer in the neural network model updated in the embodiments of fig. 1 to 6, and the dimensionality of the third lyric vector and the third embedded vector is mainly reduced to obtain a fusion vector in which the lyric feature and the tune feature are fused.

In the embodiment of the application, the process of acquiring the plurality of fusion vectors corresponding to the plurality of audio segments of each song in the database is described in detail, so that the reliability of acquiring the plurality of fusion vectors corresponding to the plurality of audio segments of each song is improved.

Based on the embodiment described in fig. 7, step 705 is described in detail below, please refer to fig. 9, and fig. 9 is a detailed step of step 705:

901. respectively calculating fusion vectors of the target audio clips, and a plurality of similarity scores between the fusion vectors corresponding to the plurality of audio clips of each song in the database;

specifically, when the song most similar to the target audio clip is identified according to the fusion vector of the target audio clip and the fusion vectors corresponding to the audio clips of each song in the database, a plurality of similarity scores between the fusion vector of the target audio clip and the fusion vectors corresponding to the audio clips of each song in the database may be calculated.

Such as: a plurality of cosine distances between the fusion vector of the target audio clip and a plurality of fusion vectors corresponding to a plurality of audio clips of each song in the database may be calculated, and the cosine distances may be used as similarity scores. In particular, assuming a target audio segmentThe fusion vector is L_xAnd the fusion vector corresponding to the plurality of audio segments of each song is L_cAnd L_eAnd L is_xAnd L_cThe included angle between the two is alpha, L_xAnd L_eIf the included angle between the target audio segment and the plurality of audio segments is β, the similarity scores between the target audio segment and the plurality of audio segments are 1-cos a and 1-cos β.

It should be noted that, when calculating the similarity score, besides the cosine distance, an euclidean distance or an edit distance may also be used, and the algorithm for calculating the similarity score is not particularly limited herein.

902. And identifying the song most similar to the target audio clip according to the similarity scores and a preset judgment threshold value.

After obtaining a plurality of similarity scores between the target audio clip and a plurality of audio clips of each song in the database, the song most similar to the target audio clip can be identified further according to a preset judgment threshold value.

In the embodiment of the application, the process of calculating the similarity score between the target audio segment and each audio segment is described in detail, so that the reliability of the similarity score calculating process is improved.

Based on the embodiment described in fig. 9, in the process of identifying a song with a most similar target audio clip, a situation may occur that the dividing positions of the target audio clip in the target song (i.e., the start position and the end position of the target audio clip in the target song) are different from the dividing positions of the plurality of audio clips of the target song in the database, and in order to improve the accuracy of identifying the target audio clip, the following steps may also be performed, please refer to fig. 10, another embodiment of the song identification method includes:

1001. inputting the plurality of similarity scores between the target audio clip and a plurality of audio clips of each song in the database into a deep memory neural network to obtain a target similarity score between the target audio clip and each song;

the deep Memory neural network in the embodiment of the application has a Memory function, and the specific deep Memory neural network may be a neural network such as a Recurrent Neural Network (RNN) or a Long Short-Term Memory (LSTM).

When the same song includes a plurality of segments, when the target audio segment is matched with each segment in the same song, the neural network is expected to memorize the matching result of the target audio segment and each segment, and transmit the memorized result to the neural network, so that the neural network can carry the matching result of the target audio segment and each segment before the current segment when the target audio segment is matched with the current segment.

It should be noted that the recurrent neural network and the long-short memory neural network are only two specific expressions as the deep memory network, and the specific expression of the deep memory neural network is not limited herein.

Correspondingly, when each song includes a plurality of segments, the input amount of the deep memory neural network is as follows: a plurality of similarity scores between the target audio clip and a plurality of audio clips of each song, and the output of the deep neural network is: a target similarity score for the target audio segment.

The data input and output process of the deep memory neural network is described as follows:

assuming that the similarity scores between the target audio clip and the audio clips of each song are X1, X2... Xn, the target similarity score Xw between the target audio clip and each song in the database is obtained after X1, X2... Xn is input into the deep memory neural network.

1002. And identifying the song most similar to the target audio clip according to the target similarity score and the preset judgment threshold value.

Further, after the target similarity score is obtained, the song most similar to the target audio clip can be identified according to the target similarity score and a preset judgment threshold value.

In the embodiment of the application, when the division positions of the target audio clip (namely the starting point and the end point of the target audio clip) are different from the division positions of the plurality of audio clips of each song, the identification process of the target audio clip is described in detail, so that the identification accuracy of the song most similar to the target audio clip is further improved.

With reference to fig. 11, the above description of the feature extraction model training method in the embodiment of the present application, and the following description of the feature extraction model training device in the embodiment of the present application, an embodiment of the feature extraction model training device in the embodiment of the present application includes:

an obtaining unit 1101, configured to obtain a first target embedded vector of an audio segment to be trained in a training set, where the first target embedded vector is a frequency domain feature vector of the audio segment to be trained;

the obtaining unit 1101 is further configured to obtain a first lyric vector of an audio clip to be trained in the training set;

an input unit 1102, configured to input the first target embedding vector and the first lyric vector corresponding to an audio clip to be trained to an encoding layer of an initial neural network, so as to obtain a fusion vector corresponding to the audio clip to be trained;

the input unit 1102 is further configured to input the fusion vector of the audio segment to be trained to a decoding layer of the initial neural network, so as to obtain a corresponding second target embedded vector and a second lyric vector;

a calculating unit 1103 for calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector and the first lyric vector;

an updating unit 1104, configured to update the model parameters of the initial neural model according to the target loss value, so as to obtain an updated neural network model;

an output unit 1105, configured to output, if the updated neural network model meets a convergence condition, the coding layer in the updated neural network model as a feature extraction model.

Preferably, the obtaining unit 1101 is further configured to:

Preferably, the calculating unit 1103 is specifically configured to:

Preferably, the obtaining unit 1101 is specifically configured to:

In the embodiment of the application, when the feature extraction model is trained, the input unit 1102 takes the first target embedding vector and the first lyric vector as input quantities to train the model parameters of the initial neural network model, that is, when the model parameters of the initial neural network are trained, the first target embedding vector of the audio segment to be trained (namely, the frequency domain feature vector of the audio segment to be trained) is utilized, and the first lyric vector of the audio segment to be trained is also utilized, so that the trained feature extraction model can identify the song based on a plurality of feature vectors of the song, and the recognition rate of the feature extraction model to the song is improved.

Next, a description is given of a song recognition apparatus in an embodiment of the present application, referring to fig. 12, where an embodiment of the song recognition apparatus in the embodiment of the present application includes:

an obtaining unit 1201, configured to obtain a target audio clip;

an extracting unit 1202, configured to extract a lyric vector of the target audio fragment;

an input unit 1203, configured to input the lyric vector and the target embedded vector of the target audio fragment into the feature extraction model according to the first aspect of the embodiment of the present application, to obtain a fusion vector of the target audio fragment;

an identifying unit 1204, configured to identify, according to the fusion vector of the target audio segment, a song that is most similar to the target audio segment, from a plurality of fusion vectors that respectively correspond to a plurality of audio segments of each song in a database.

Preferably, the obtaining unit 1201 is further configured to:

Preferably, the identifying unit 1204 is specifically configured to:

In this embodiment of the present application, the identifying unit 1204 may identify the most similar song of the target audio segment through the fusion vector of the target audio segment, and the fusion vector of the target audio segment includes not only the tune feature of the target audio segment but also the lyric feature of the target audio segment.

The electronic device in the embodiment of the present invention is described below in terms of hardware processing, while the electronic device in the embodiment of the present invention is described above in terms of the modular functional entity:

the electronic equipment is used for realizing the function of the song turning identification equipment, and one embodiment of the electronic equipment in the embodiment of the invention comprises the following components:

the device comprises a memory, a processor, a power supply module, a sensor module and an input/output module;

the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:

In some embodiments of the present invention, the processor may be further configured to:

An embodiment of the present invention further provides an electronic device, where a processor of the electronic device, when executing a computer program stored in a memory, may implement the following steps:

acquiring a target audio clip;

extracting a lyric vector of the target audio fragment;

extracting a target embedded vector of the target audio fragment;

In some embodiments of the present invention, before identifying a song most similar to the target audio segment according to the fusion vector of the target audio segment and the fusion vectors corresponding to the audio segments of each song in the database, the processor may be further configured to implement the following steps:

It is to be understood that, when the processor in the electronic device described above executes the computer program, the functions of the units in the corresponding device embodiments may also be implemented, and are not described herein again. Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program in the apparatus for identifying a song to be turned over. For example, the computer program may be divided into units in the above-described Karaoke song recognition apparatus, and the units may implement specific functions as described above for the corresponding Karaoke song recognition apparatus.

The electronic device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing device. The electronic device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the processor, the memory are merely examples of a computer apparatus and do not constitute a limitation of the computer apparatus, and may comprise more or less components, or some components in combination, or different components, for example, the electronic device may further comprise an input output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The present invention also provides a computer-readable storage medium for implementing the functions of a karaoke song recognition apparatus, having a computer program stored thereon, which, when executed by a processor, the processor is operable to perform the steps of:

In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:

calculating the target loss value according to the first loss and the second loss. In some embodiments of the invention, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the steps of:

An embodiment of the present invention further provides another computer-readable storage medium, where when a computer program stored in the computer-readable storage medium is executed by a processor, the processor may be specifically configured to execute the following steps:

acquiring a target audio clip;

extracting a lyric vector of the target audio fragment;

extracting a target embedded vector of the target audio fragment;

In some embodiments of the present invention, before identifying a song most similar to the target audio clip according to the fusion vector of the target audio clip and the fusion vectors corresponding to the audio clips of each song in the database, the processor may be specifically configured to perform the following steps:

It will be appreciated that the integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a corresponding one of the computer readable storage media. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for training a feature extraction model of a song, the method comprising:

2. The training method according to claim 1, wherein if the updated neural network model does not meet the convergence condition, another audio segment to be trained is obtained from the training set, and the step of obtaining the first target embedding vector of the audio segment to be trained in the training set is performed in return until the updated neural network model meets the convergence condition.

3. The training method of claim 1, wherein the calculating a target loss value based on the second target embedding vector, the first target embedding vector, the second lyric vector, and the first lyric vector comprises:

4. The training method according to claim 1, wherein the obtaining a first target embedding vector of the audio segments to be trained in the training set comprises:

5. The training method of claim 1, wherein the obtaining a first lyrics vector of an audio piece to be trained in the training set comprises:

6. A song identification method, the method comprising:

acquiring a target audio clip;

extracting a lyric vector of the target audio fragment;

extracting a target embedded vector of the target audio fragment;

inputting the lyric vector and the target embedding vector of the target audio fragment into the feature extraction model of any one of claims 1 to 5 to obtain a fusion vector of the target audio fragment;

7. The song identification method of claim 6, wherein before identifying the song that is most similar to the target audio clip according to the fusion vector of the target audio clip and the plurality of fusion vectors respectively corresponding to the plurality of audio clips of each song in the database, the method further comprises:

8. The song identification method of claim 6, wherein identifying the song most similar to the target audio clip according to the fusion vector of the target audio clip and the fusion vectors corresponding to the audio clips of each song in the database comprises:

9. The song identification method of claim 8, wherein the identifying a song most similar to the target audio clip according to the similarity scores and a preset judgment threshold comprises:

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out a method of training a feature extraction model for a song according to any one of claims 1 to 5, or a method of song recognition according to any one of claims 6 to 9.

11. An electronic device, comprising: memory, processor, power module, sensor module, input/output module, characterized in that the processor, when executing a computer program stored on the memory, is adapted to implement a method of training a feature extraction model of a song according to any one of claims 1 to 5, or a method of song recognition according to any one of claims 6 to 9.