CN115083397A

CN115083397A - Training method of lyric acoustic model, lyric recognition method, equipment and product

Info

Publication number: CN115083397A
Application number: CN202210605845.8A
Authority: CN
Inventors: 王武城
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-20

Abstract

The application relates to the technical field of audio processing, and provides a training method of a lyric acoustic model, a lyric recognition method, computer equipment and a computer program product, which can effectively improve the recognition performance of the lyric acoustic model. The training method of the lyric acoustic model comprises the following steps: acquiring a song sample and a phoneme label of lyrics in the song sample; acquiring song spectrum information of a song sample and accompaniment spectrum information of the song sample; inputting the song frequency spectrum information and the accompaniment frequency spectrum information into a lyric acoustic model to be trained, extracting song characteristics of the song frequency spectrum information and accompaniment characteristics of the accompaniment frequency spectrum information by the lyric acoustic model to be trained, and outputting predicted lyric phonemes according to the song characteristics and the accompaniment characteristics; determining model loss according to the predicted lyric phoneme and phoneme label; and adjusting the model parameters of the acoustic model of the lyrics to be trained according to the model loss until the training end condition is met, thereby obtaining the trained acoustic model of the lyrics.

Description

Training method of lyric acoustic model, lyric recognition method, equipment and product

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method for training a lyric acoustic model, a lyric recognition method, a computer device, and a computer program product.

Background

With the development of computer technology, the lyric text in the song audio can be quickly extracted through the singing voice recognition technology. In the traditional technology, the voice can be extracted by eliminating the accompaniment in the song audio and the acoustic model is trained on the basis of the extracted voice, so that the acoustic model is utilized to obtain the lyric phonemes in the song audio in the subsequent song voice recognition process, and the corresponding lyric text is determined.

However, part of the vocal sounds in the song are easily eliminated in the process of eliminating the accompaniment, which affects the reliability of the vocal sounds used for training the acoustic model, resulting in limited performance of the acoustic model.

Disclosure of Invention

In view of the above, it is desirable to provide a training method for a lyric acoustic model, a lyric recognition method, a computer device and a computer program product.

In a first aspect, the present application provides a method for training a lyric acoustic model. The method comprises the following steps:

acquiring a song sample and a phoneme label of lyrics in the song sample;

acquiring song spectrum information of the song sample and accompaniment spectrum information of the song sample;

inputting the song spectrum information and the accompaniment spectrum information into a lyric acoustic model to be trained, extracting song characteristics of the song spectrum information and accompaniment characteristics of the accompaniment spectrum information by the lyric acoustic model to be trained, and outputting predicted lyric phonemes according to the song characteristics and the accompaniment characteristics;

determining a model loss according to the predicted lyric phoneme and the phoneme label;

and adjusting the model parameters of the acoustic model of the lyrics to be trained according to the model loss until the training end condition is met, thereby obtaining the trained acoustic model of the lyrics.

In a second aspect, the application further provides a lyric recognition method. The method comprises the following steps:

acquiring song spectrum information of a song with lyrics to be identified and accompaniment spectrum information of the song;

inputting the song spectrum information of the song and the accompaniment spectrum information of the song into a lyric acoustic model to obtain a predicted lyric phoneme of the song output by the lyric acoustic model; the lyric acoustic model is obtained by training according to the training method of the lyric acoustic model;

and inputting the predicted lyric phoneme of the song into a language model to obtain the lyric of the song output by the language model.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

acquiring a song sample and a phoneme label of lyrics in the song sample;

In a fourth aspect, the present application further provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

acquiring a song sample and a phoneme label of lyrics in the song sample;

In a sixth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

The training method of the acoustic model of the lyrics, the lyrics identification method, the computer equipment and the computer program product can obtain a song sample and a phoneme label of the lyrics in the song sample, obtain song spectrum information of the song sample and accompaniment spectrum information of the song sample, input the song spectrum information and the accompaniment spectrum information into the acoustic model of the lyrics to be trained, extract song characteristics of the song spectrum information and accompaniment characteristics of the accompaniment spectrum information by the acoustic model of the lyrics to be trained, output predicted lyrics phonemes according to the song characteristics and the accompaniment characteristics, determine model loss according to the predicted lyrics phoneme and the phoneme label, adjust model parameters of the acoustic model of the lyrics to be trained according to the model loss until a training end condition is met, and obtain the acoustic model of the trained lyrics. According to the lyric acoustic model recognition method and device, the song characteristics and the accompaniment characteristics of the sample song are acquired respectively, so that the lyric acoustic model can acquire the accompaniment characteristics of the accompaniment while fully learning the song characteristics, interference caused by the accompaniment in the lyric phoneme prediction process is effectively eliminated, and the recognition performance of the lyric acoustic model is improved.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for training a lyric acoustic model in one embodiment;

FIG. 2 is a flow diagram illustrating the steps of predicting a lyric phoneme, in one embodiment;

FIG. 3 is a flow diagram illustrating an example of outputting a probability distribution of phonemes in one embodiment;

FIG. 4 is a diagram of an exemplary implementation of a lyric recognition method;

FIG. 5 is a flow diagram illustrating a method for lyric recognition, according to one embodiment;

FIG. 6 is a schematic diagram of a process for obtaining a singing voice recognition model in one embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

With the development of computer technology, the lyric text in the song audio can be quickly extracted through the singing voice recognition technology. In the traditional technology, the voice can be extracted by eliminating the accompaniment in the song, and the acoustic model is trained on the basis of the extracted voice, so that the acoustic model is utilized to obtain the lyric phonemes in the song in the subsequent singing voice recognition process, and the corresponding lyric text is determined.

However, in the above-mentioned method, although the former can eliminate the interference of the accompaniment on the model, it is easy to eliminate part of the vocal sounds in the song in the process of eliminating the accompaniment, it is difficult to extract the complete vocal sounds from the song, for example, the vocal sounds in the cases of bubble sound, air port, long vowel tail, etc. are mistakenly eliminated, so that the reliability of the vocal sounds for training the acoustic model is affected, and the robustness of the acoustic model obtained based on the vocal training alone is not high, and the performance of the acoustic model is limited. In order to at least solve the technical problem, the application provides a training method of a lyric acoustic model.

In an embodiment, as shown in fig. 1, a method for training a lyric acoustic model is provided, and this embodiment is illustrated by applying the method to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

s101, obtaining a song sample and a phoneme label of lyrics in the song sample.

The phoneme is the minimum voice unit, and the corresponding phoneme can be identified according to the pronunciation action in the syllable in practice, and one pronunciation action constitutes one phoneme. As an example, the phoneme label may identify a phoneme of the lyric in the song sample, and the pitch of the phoneme may be a vocal pitch of the lyric in the song sample.

In practical application, a song sample for training an acoustic model of the lyrics may be obtained, and a phoneme label of the lyrics in the song sample may be obtained.

And S102, acquiring song spectrum information of the song sample and accompaniment spectrum information of the song sample.

The frequency spectrum information may be information reflecting frequency spectrum characteristics, and for convenience of distinguishing, the frequency spectrum information of the song sample is called song frequency spectrum information, and the frequency spectrum information of the accompaniment of the song sample is called accompaniment frequency spectrum information.

After the song sample is acquired, song spectrum information of the song sample and accompaniment spectrum information of the song sample can be acquired. Specifically, for example, for an acquired song sample, the accompaniment may be extracted from the song sample through a preset separation model, for example, a Spleeter separation model or other separation models are used for extraction, so as to obtain an audio of the accompaniment, and then a corresponding Mel Frequency Cepstrum Coefficient (MFCC) may be respectively acquired based on the audio of the song sample and the audio of the accompaniment, as the song spectrum information and the accompaniment spectrum information of the song sample.

And S103, inputting the song spectrum information and the accompaniment spectrum information into a lyric acoustic model to be trained, extracting song characteristics of the song spectrum information and accompaniment characteristics of the accompaniment spectrum information by the lyric acoustic model to be trained, and outputting predicted lyric phonemes according to the song characteristics and the accompaniment characteristics.

After the song spectrum information and the accompaniment spectrum information are obtained, the song spectrum information and the accompaniment spectrum information can be input into a lyric acoustic model to be trained. Specifically, the song spectrum information may include various rich semantic information, such as lyric information of a song, singer information, and accompaniment information, in this embodiment, on one hand, the lyric acoustic model to be trained may learn various types of semantic information from the song spectrum information to obtain song features; on the other hand, the lyric acoustic model to be trained can specially learn the characteristics of the accompaniment frequency spectrum information to obtain the accompaniment characteristics, so that the song characteristics and the accompaniment characteristics are learned respectively and do not interfere with each other.

The method has the advantages that through learning the accompaniment features of the accompaniment spectrum information, the lyric acoustic model can learn the interference information existing in the lyric phoneme prediction process, the accompaniment interference contained in the song features can be eliminated in a targeted mode according to the accompaniment features, the song features related to human voice are not influenced, the complete retention of the human voice information is ensured, the lyric phoneme is predicted on the basis, and the accuracy of the finally output lyric phoneme is improved.

In some acoustic model training modes, in order to keep complete human voice information, the accompaniment and the human voice in the song are not separated, the complete song is directly used for training the deep neural network, the deep neural network is enabled to strip information irrelevant to the target (namely, the lyric phoneme) through model training, and the influence caused by the accompaniment in the song in the lyric phoneme recognition process is eliminated. Although the deep neural network has certain noise immunity, the tag type learning process can strip information irrelevant to the target, if the quantity of irrelevant information is more, the model training difficulty is higher, and the model training difficulty is obviously increased without any processing on the accompaniment in the mode.

The lyric acoustic model training method of the embodiment can additionally perform shunt learning on the accompaniment while performing song feature extraction on the song spectrum information containing the complete semantic information of the song, so that the model can respectively learn the song features containing various semantic information (including the accompaniment information of the song) of the song and the accompaniment features corresponding to the disturbed accompaniment, thereby providing a basis for subsequently eliminating the accompaniment disturbance irrelevant to the voice in the song; besides, the accompaniment difference of different types of songs is obvious, and the difficulty of network training can be obviously reduced by specially learning the characteristics (namely the accompaniment characteristics) of interference factors in the lyric phoneme prediction process.

And S104, determining model loss according to the predicted lyric phoneme and phoneme label.

And S105, adjusting the model parameters of the acoustic model of the lyrics to be trained according to the model loss until the training end condition is met, and obtaining the trained acoustic model of the lyrics.

Specifically, after the predicted lyric phoneme output by the lyric acoustic model to be trained is obtained, the model loss may be determined according to the predicted lyric phoneme and the phoneme label, and the model parameter of the lyric acoustic model to be trained may be adjusted according to the model loss, for example, the model loss may be determined according to the difference between the predicted lyric phoneme and the phoneme label, and the model parameter of the lyric acoustic model to be trained may be adjusted by a back propagation algorithm. And iterating for multiple times until a training end condition is met, and acquiring a trained lyric acoustic model.

In the embodiment, a song sample and a phoneme label of lyrics in the song sample can be obtained, song spectrum information of the song sample and accompaniment spectrum information of the song sample are obtained, the song spectrum information and the accompaniment spectrum information are input into a lyrics acoustic model to be trained, song characteristics of the song spectrum information and accompaniment characteristics of the accompaniment spectrum information are obtained through the lyrics acoustic model to be trained, predicted phoneme lyrics are output according to the song characteristics and the accompaniment characteristics, model loss is determined according to the predicted lyrics and the phoneme label, model parameters of the lyrics acoustic model to be trained are adjusted according to the model loss until a training end condition is met, and a trained lyrics acoustic model is obtained. According to the method and the device, the song characteristics and the accompaniment characteristics of the sample song are respectively acquired, so that the lyric acoustic model can fully learn the song characteristics and acquire the accompaniment characteristics of the accompaniment at the same time, the interference of the accompaniment in the song on lyric phoneme prediction is effectively eliminated, and the recognition performance of the lyric acoustic model is improved.

In one embodiment, the song spectrum information may include song frame spectrum information for each song frame of the song sample; for example, the audio of a song sample may be pre-emphasized, frame-divided, windowed, and the like to improve the speech signal-to-noise ratio, then the time domain audio data of each song frame obtained after the pre-processing is processed by short-time Fourier Transform (STFT), and converted into frequency domain data, the obtained frequency domain is input to a mel filter to be filtered, so as to obtain a mel spectrum reflecting the auditory receiving frequency of the human ear, and then the mel frequency therein is logarithmically accompanied by Discrete Cosine Transform (DCT for Discrete Cosine Transform) to obtain the song frame frequency spectrum information of each frame.

The lyrics acoustic model to be trained may include a first network for extracting characteristics of the song, and the first network may include a plurality of hidden layers. Illustratively, the first network may be a time-lapse neural network.

S103, extracting song characteristics of the song spectrum information by the acoustic model of the lyrics to be trained, which can comprise the following steps:

inputting the song frame frequency spectrum information of each song frame into a first network, and fusing the input features in the input feature sequence by the first network according to the input feature sequence and the time delay value provided by the previous hidden layer to obtain the input feature sequence of the current hidden layer; and obtaining the song characteristics of the song frame spectrum information of each song frame according to the output result of the last hidden layer.

The input characteristic sequence of the first hidden layer is song frame frequency spectrum information of each song frame.

In practical application, after the song frame spectrum information of each song frame is input to the first network, the first network can fuse the input features in the input feature sequence according to the input features of the previous hidden layer and the time delay value of the previous hidden layer.

Specifically, for each input feature in the input feature sequence provided by the previous hidden layer, a plurality of input features adjacent to the input feature may be determined according to the time delay value, and feature fusion may be performed on the input feature and the plurality of input features adjacent to the input feature, for example, fusion may be performed on the input feature, the first n input features of the input feature, and the m input features after the input feature, so that a fusion feature associated with the input feature and the plurality of input features adjacent to the input feature may be obtained. The processing is sequentially carried out on each input feature in the input feature sequence provided by the previous hidden layer, the fusion feature corresponding to each input feature can be obtained, and then the input feature sequence of the current hidden layer can be obtained based on a plurality of fusion features.

And for the first hidden layer in the first network, the song spectrum information of each song frame can be used as an input feature sequence, so that in the first hidden layer, for each song frame spectrum information, feature fusion can be performed on the song frame spectrum information and the song frame spectrum information of a plurality of song frames before and after the song frame spectrum information according to the time delay value.

And processing is sequentially performed on each hidden layer of the first network, and then the song characteristics of the song frame frequency spectrum information of each song frame can be obtained according to the output result of the last hidden layer.

In this embodiment, the first network may fuse the input features in the input feature sequence according to the input feature sequence and the delay value provided by the previous hidden layer, and associate the input features at a plurality of adjacent moments in each hidden layer, so that the association between the song voice information can be effectively recognized, and more reliable and accurate song features can be extracted.

In an embodiment, the accompaniment spectrum information may include accompaniment frame spectrum information of each accompaniment frame of a song sample, for example, after extracting the accompaniment audio from the audio of the song sample, pre-emphasis, framing, windowing and other pre-processing may be performed on the accompaniment audio, then the time domain audio data of each accompaniment frame obtained after the pre-processing is subjected to short time fourier transform processing to convert into frequency domain data, the obtained frequency domain is input to a mel filter to be filtered, so as to obtain a mel spectrum, and then the mel frequency therein is subjected to logarithm taking and discrete cosine transform processing, so as to obtain the frame frequency accompaniment spectrum information of each accompaniment frame.

The acoustic model of the lyrics to be trained may include a function for extractingA second network of accompaniment features, the second network may include a plurality of residual blocks. Illustratively, the second network may be a residual neural network (Resnet) that is a convolutional neural network based with the addition of a residual block. In particular, the second network may comprise a plurality of residual blocks, wherein each residual block may be composed of a direct mapped part and a residual part, which may contain at least one convolutional layer, for which a residual block input x _i Can input the x _l Respectively inputting the direct mapping part and the residual part to be processed, wherein the residual part can be used for inputting x _l Performing convolution operation to obtain F (x), wherein the direct mapping part can convert x _l Directly outputting, and further, the convolution operation result F (x) of the residual part and the output x of the direct mapping part can be obtained _l To yield F (x) + x _l . The structure of the residual block can introduce the input of the front layer into the current layer, so that the problem of gradient disappearance or gradient explosion of the network in the deepening process is avoided, and the method has stronger learning capability than a common convolutional neural network.

S103, extracting accompaniment features of the accompaniment spectrum information from the lyric acoustic model to be trained, which may include:

inputting the accompaniment frequency spectrum information of each accompaniment frame into a second network, and sequentially carrying out feature transmission from a first residual block to a last residual block on the accompaniment frame frequency spectrum information of each accompaniment frame by the second network to obtain the accompaniment features of the accompaniment frame frequency spectrum information of each accompaniment frame.

In this step, after obtaining the accompaniment spectrum information of each accompaniment frame, the accompaniment spectrum information of each accompaniment frame may be input to the second network, and feature transmission from the first residual block to the last residual block may be performed on the accompaniment frame spectrum information of each accompaniment frame sequentially by the plurality of residual blocks in the second network, so as to obtain the accompaniment features of the accompaniment frame spectrum information of each accompaniment frame according to the output of the second network.

In this embodiment, the accompaniment spectrum information of the accompaniment frame is processed by the plurality of residual blocks in the second network, so that when the detail information in the accompaniment spectrum information is effectively captured by the deep neural network, gradient disappearance or gradient explosion is avoided, and a basis is provided for the model to more accurately identify the interference information in the song characteristics.

In one embodiment, the song spectrum information includes song frame spectrum information of each song frame of the song sample, the accompaniment spectrum information includes accompaniment frame spectrum information of each accompaniment frame of the song sample, as shown in fig. 2, and the step of outputting the predicted lyric phoneme according to the song feature and the accompaniment feature in S103 may include the following steps:

s201, acquiring song characteristics of the song frame spectrum information of each song frame and accompaniment characteristics of the accompaniment frame spectrum information of each accompaniment frame.

After the song frame spectrum information and the accompaniment frame spectrum information are respectively input to the first network and the second network, the song characteristics output by the first network for each song frame spectrum information and the accompaniment characteristics output by the second network for each accompaniment frame spectrum information can be acquired.

And S202, splicing the song characteristics of each song frame and the accompaniment characteristics of each corresponding accompaniment frame, and inputting the spliced song characteristics and the spliced accompaniment characteristics into a classifier of the acoustic model of the lyrics to be trained.

In practical application, the accompaniment can be extracted from a song sample, correspondingly, the song frame and the accompaniment frame can have a corresponding relation, the song feature of each song frame output by a first network and the accompaniment feature of the corresponding accompaniment frame can be fused, the song feature and the accompaniment feature are spliced, and each spliced song feature and each spliced accompaniment feature are input into a classifier of a lyric acoustic model to be trained.

In some examples, when the song feature and the accompaniment feature are fused, the concatenation may be performed by summing (add), concatenating (Concat), or differencing, for example, the song feature and the accompaniment feature corresponding to the song feature may be summed, and in some related feature fusion manners, two feature vectors may be combined into a complex vector and the complex vector may be used as the summed feature (for example, for the input features x and y, z is x + iy, where i is an imaginary unit), and for example, the song feature and the accompaniment feature corresponding to the song feature may be differenced and the differenced feature may be input to the classifier.

S203, obtaining the predicted lyric phoneme of each song frame according to the output result of the classifier.

After the spliced song characteristics and the accompaniment characteristics are input into a classifier of a lyric acoustic model to be trained, the classifier can perform multi-classification processing on each spliced song characteristic and each accompaniment characteristic, and then the predicted lyric phoneme of each song frame can be obtained according to the output result of the classifier. Illustratively, the classifier in the acoustic model of lyrics to be trained may be a Softmax layer, and the predicted phoneme of the lyrics for each frame may be a phoneme probability distribution, or alternatively, may be a specific phoneme.

In the embodiment, the song characteristics of the song frame spectrum information of each song frame and the accompaniment characteristics of the accompaniment frame spectrum information of each accompaniment frame can be acquired, the song characteristics of each song frame and the accompaniment characteristics of each corresponding accompaniment frame are spliced, and each spliced song characteristics and accompaniment characteristics are input into a classifier of a lyric acoustic model to be trained, so that the predicted lyric phoneme of each song frame can be obtained according to the output result of the classifier, and the frame level of the lyric phoneme of the song can be accurately identified.

In one embodiment, deriving the predicted lyric phoneme for each song frame according to the output of the classifier may comprise the steps of: acquiring phoneme probability distribution of each song frame output by the classifier; and determining the predicted lyric phoneme of each song frame according to the phoneme probability distribution of each song frame.

As an example, the phoneme probability distribution may indicate, for each phoneme provided in advance, a probability that a lyric phoneme of a song frame belongs to the phoneme. In the concrete implementation, after inputting each spliced song feature and accompaniment feature into the classifier, the classifier can identify the feature existing as an interference part in the song feature according to the accompaniment feature, namely the accompaniment feature contained in the song feature. Specifically, in the lyric phoneme recognition process, the accompaniment in the song often has a great influence on the accuracy of the lyric phoneme recognition, in this embodiment, the song features extracted by the lyric acoustic model include various semantic information of the song, correspondingly, the accompaniment features associated with the accompaniment part are also covered in the song features, and after the spliced song features and accompaniment features are obtained by the classifier, the input accompaniment features can be used as a reference to determine the accompaniment features associated with the accompaniment part in the song features, so that the interference information in the song features can be recognized. After the classifier identifies the interference information in the song features, prediction can be performed according to other information except the interference information in the song features to obtain phoneme probability distribution of the song frame. For example, as shown in fig. 3, n phonemes may be provided in advance, and the classifier may determine posterior probabilities p1 and p2 … … pn that the phonemes in the song frame belong to each phoneme according to the spliced one-dimensional features, i.e., the song feature and the accompaniment feature, to obtain a phoneme probability distribution.

Furthermore, the lyric phoneme of each song frame may be predicted according to the phoneme probability distribution of each song frame, and specifically, for example, after obtaining the phoneme probability distribution, the phoneme corresponding to the maximum probability may be determined as the lyric phoneme of the song frame.

In this embodiment, the phoneme probability distribution of each song frame output by the classifier may be obtained, so that the predicted lyric phoneme of each song frame may be determined according to the phoneme probability distribution of each song frame, and reliable predicted lyric phonemes may be obtained by comparing probabilities that song frames belong to different phonemes and screening.

In one embodiment, determining the model loss from the predicted lyric phoneme and phoneme labels in S104 may include: acquiring a phoneme label of the lyrics of each song frame according to the phoneme label of the lyrics in the song sample; model losses are determined based on the predicted phoneme of the lyrics for each song frame and the phoneme label of the lyrics for each song frame.

In practical applications, the phoneme label of the lyrics in the song sample, besides indicating the phoneme of the lyrics, may also indicate the time when the phoneme of the lyrics appears in the song sample, i.e. the song frame corresponding to the phoneme, for example, the phoneme label of the lyrics "o" may indicate that the phoneme of the song frame at time t1 is ā. In a specific implementation, the lyric pronunciation of the song sample is composed of a plurality of phonemes, and when the phoneme label is obtained, the plurality of phonemes of the song sample can be further time-divided to obtain the phoneme label corresponding to each song frame.

After the predicted lyric phonemes output by the lyric acoustic model are obtained, frame-level phoneme labels can be obtained according to the phoneme labels of the lyrics in the song sample, that is, the phoneme labels of the lyrics of each song frame can be obtained, and further, the model loss can be determined according to the predicted lyric phonemes of each song frame and the predicted phoneme labels of the lyrics of each song frame. Specifically, for example, the difference between the lyric phoneme of each song frame and the phoneme label of the lyric of the corresponding song frame may be obtained, and the model loss may be obtained by integrating the differences of a plurality of song frames, in other words, in the case that the phoneme label indicates the lyric phoneme of each moment of the song sample (i.e., the moment corresponding to the song frame), the difference between the actual lyric phoneme of the song frame at the moment and the predicted lyric phoneme of the song frame may be determined according to the predicted lyric phoneme of the same moment (corresponding to the same song frame) and the difference of the phoneme label, and then the model loss may be obtained by combining the differences of the plurality of song frames.

In this embodiment, the phoneme label of the lyrics of each song frame may be obtained according to the phoneme label of the lyrics in the song sample, and the model loss may be determined according to the predicted phoneme of the lyrics of each song frame and the predicted phoneme label of the lyrics of each song frame, so that the difference comparison at the frame level may be performed in detail, and the recognition performance of the lyrics acoustic model may be effectively improved.

The present application further provides a lyric recognition method, which may be applied to an application environment as shown in fig. 4, where the application environment may include a terminal and a server, and the terminal may communicate with the server through a network.

The server can be realized by an independent server or a server cluster consisting of a plurality of servers, the server can be provided with a data storage system, the data storage system can store songs of lyrics to be recognized, and the data storage system can be integrated on the server and also can be placed on the cloud or other network servers. The terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.

In one embodiment, as shown in fig. 5, a lyric recognition method is provided, which is exemplified by the application of the method to the server in fig. 4, and includes the following steps:

s501, acquiring song spectrum information of a song with lyrics to be identified and accompaniment spectrum information of the song.

In practical application, the terminal can send the song of the lyric to be identified to the server, and the server can acquire the song spectrum information of the song and the accompaniment spectrum information of the song after receiving the song, specifically, after acquiring the song of the lyric to be identified, the accompaniment of the song can be extracted from the song through a preset separation model, and the spectrum information of the song and the accompaniment spectrum information of the accompaniment of the song are respectively acquired.

Illustratively, the song of the lyrics to be identified may include at least one of: and the song database is not associated with the song lyrics, the user hums the song to be recognized, turns the song and the audio to be audited.

Specifically, aiming at the song which is not associated with the lyrics in the song database, the audio of the song can be sent to the server, the lyrics of the song can be efficiently and quickly identified by the server, and the cost of manual transcription is reduced; aiming at a song to be recognized hummed by a user, the recognition accuracy and the recognition speed of the song to be recognized can be effectively improved by predicting the lyrics in the song and matching the lyrics in a song database; by identifying the lyric content in the song to be sung and the occurrence position of the lyric in the song to be sung, the similarity between the song to be sung and the reference song can be assisted to be distinguished while the song melody is used for identification; the audio to be audited can be the audio collected in the music safety monitoring, for example, the audio uploaded by a user through a terminal or the audio in a video, and the lyric recognition is performed on the audio to be audited, so that whether illegal sensitive words exist in the audio to be audited, namely the positions of the sensitive words, can be assisted to judge, and the detection of the validity of the audio can be realized.

S502, inputting the song spectrum information of the song and the accompaniment spectrum information of the song into a lyric acoustic model to obtain a predicted lyric phoneme of the song output by the lyric acoustic model; the lyric acoustic model is obtained by training according to the training method of any lyric acoustic model.

After the song spectrum information of the song and the accompaniment spectrum information of the song are obtained, the song spectrum information of the song and the accompaniment spectrum information of the song can be input into the lyric acoustic model, and the predicted lyric phoneme of the song output by the lyric acoustic model is obtained.

S503, inputting the predicted lyric phoneme of the song into the language model to obtain the lyric of the song output by the language model.

As an example, the speech model may predict a corresponding text sequence from an input phoneme sequence, wherein the phoneme sequence may include at least one predicted lyrics phoneme.

After the predicted lyric phonemes of the song are obtained, the predicted lyric phonemes can be input into the language model to obtain the lyrics of the song output by the language model. In practical applications, the final recognition result of the lyrics can be obtained by a Viterbi decoding algorithm (Viterbi decoding algorithm).

In this embodiment, song spectrum information of a song with lyrics to be identified and accompaniment spectrum information of the song may be obtained, and the song spectrum information of the song and the accompaniment spectrum information of the song are input to a lyrics acoustic model to obtain predicted lyrics phonemes of the song output by the lyrics acoustic model, wherein the lyrics acoustic model is obtained by training according to any one of the above training methods of the lyrics acoustic model, and further the predicted lyrics phonemes may be input to a language model to obtain lyrics of the song output by the language model. According to the embodiment, the influence of accompaniment can be effectively eliminated through the lyric acoustic model, the lyric phonemes in the song can be accurately identified, and then the lyrics of the song can be efficiently and accurately identified.

In one embodiment, the method may further comprise the steps of:

s601, a plurality of lyric texts are obtained from a preset song database, and the lyric phonemes of each lyric text character in the lyric texts are obtained.

In a specific implementation, a plurality of lyric texts may be obtained from a preset song database, wherein the language of the lyric texts may be determined according to actual conditions. After a plurality of lyric texts are obtained, word segmentation and word segmentation processing can be carried out on each lyric text to obtain lyric text characters, and lyric phonemes of the lyric text characters are obtained to obtain a lyric dictionary.

S602, based on a plurality of lyrics texts, determining an associated character of each lyrics text character and the occurrence probability of the associated character.

As an example, the associated character of the lyric text character may be a character that occurs within a preset range of the lyric text character, for example, for each lyric text character in the lyric text, m characters before and after the lyric text character may be taken as the associated character of the lyric text character.

In practical application, after a plurality of lyric texts are obtained, the lyric text characters can be counted, the associated character of each lyric text character is determined, and the occurrence probability of the associated character is determined by combining the plurality of lyric texts.

S603, obtaining a lyric phoneme of the associated character from the lyric phoneme of each lyric text character in the lyric text.

Furthermore, since the associated characters of the lyric text characters are also recorded in the lyric text, the lyric phonemes of the associated characters can be acquired from the lyric phonemes of the acquired respective lyric text characters.

S604, a language model is constructed and obtained based on the lyric phoneme of each lyric text character, the lyric phoneme of the associated character of each lyric text character and the occurrence probability of the associated character. The language model may be a statistical model that computes a probability distribution of a sequence of characters (e.g., phrases, sentences, paragraphs).

In this embodiment, a plurality of lyric texts may be obtained from a preset song database, a lyric phoneme of each lyric text character in the lyric texts may be obtained, an occurrence probability of an associated character and an associated character of each lyric text character may be determined based on the plurality of lyric texts, a lyric phoneme of the associated character may be further obtained from a lyric phoneme of each lyric text character in the lyric texts, and a language model may be constructed based on the lyric phoneme of each lyric text character, the lyric phoneme of the associated character of each lyric text character, and the occurrence probability, so as to provide a basis for identifying the lyric text corresponding to the predicted lyric phoneme.

In order to enable those skilled in the art to better understand the above steps, the following is an example to illustrate the embodiments of the present application, but it should be understood that the embodiments of the present application are not limited thereto.

As shown in fig. 6, a song sample and a lyric text may be obtained from a song database, a lyric acoustic model and a language model are respectively trained, and a song recognition model for converting the audio frequency of a song into a corresponding lyric text is constructed in combination with the lyric acoustic model and the language model.

The method comprises the steps of performing word segmentation and molecule processing on a lyric text to obtain a plurality of lyric text characters according to a language model, obtaining lyric phonemes of each lyric text character, counting the occurrence probability of m associated characters before and after each lyric text character, and constructing an n-gram language model according to the lyric phonemes of each lyric text character, the lyric phonemes of the associated characters of each lyric text character and the occurrence probability.

And aiming at the training of the acoustic model of the lyrics, extracting a background accompaniment from the song audio through a preset separation model 'speeter Network', respectively acquiring the MFCC characteristics of the song audio and the background accompaniment, wherein the MFCC characteristics of the song audio can be input into a TDNN (Time Delay Neural Network) as a first Network to obtain corresponding song characteristics, the MFCC characteristics of the background accompaniment can be input into a Resnet (residual error Neural Network) as a second Network to obtain corresponding accompaniment characteristics, after the song characteristics and the accompaniment characteristics are spliced, the corresponding accompaniment characteristics can be input into a softmax layer of the acoustic model of the lyrics for classification, and the softmax layer outputs the posterior probability that each song frame belongs to each phoneme p1 and p2 … … pn.

After the acoustic model of the lyrics is trained, a singing voice recognition model can be constructed by combining the n-gram language model and the acoustic model of the lyrics.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing song data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of training an acoustic model of lyrics or a method of lyrics recognition.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps in the method embodiments described above.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of training a lyric acoustic model, the method comprising:

acquiring a song sample and a phoneme label of lyrics in the song sample;

2. The method of claim 1, wherein the song spectral information comprises song frame spectral information for each song frame of the song sample, and wherein the lyrics acoustic model to be trained comprises a first network comprising a plurality of hidden layers;

the extracting the song characteristics of the song spectrum information by the acoustic model of the lyrics to be trained comprises the following steps:

inputting the song frame frequency spectrum information of each song frame into a first network, and fusing the input features in the input feature sequence by the first network according to the input feature sequence and the time delay value provided by the previous hidden layer to obtain the input feature sequence of the current hidden layer; wherein, the input characteristic sequence of the first hidden layer is the song frame frequency spectrum information of each song frame;

and obtaining the song characteristics of the song frame spectrum information of each song frame according to the output result of the last hidden layer.

3. The method of claim 1, wherein the accompaniment spectral information comprises accompaniment frame spectral information for each accompaniment frame of the song sample, and wherein the lyrics acoustic model to be trained comprises a second network comprising a plurality of residual blocks;

extracting the accompaniment features of the accompaniment frequency spectrum information by the lyric acoustic model to be trained, wherein the accompaniment features comprise:

inputting the accompaniment frequency spectrum information of each accompaniment frame into a second network, and carrying out feature transmission from a first residual block to a last residual block on the accompaniment frame frequency spectrum information of each accompaniment frame by the second network to obtain the accompaniment features of the accompaniment frame frequency spectrum information of each accompaniment frame.

4. The method according to any one of claims 1 to 3, wherein the song spectrum information comprises song frame spectrum information of each song frame of the song sample, and the accompaniment spectrum information comprises accompaniment frame spectrum information of each accompaniment frame of the song sample; the outputting the predicted lyric phoneme according to the song characteristics and the accompaniment characteristics comprises:

acquiring song characteristics of the song frame spectrum information of each song frame and accompaniment characteristics of the accompaniment frame spectrum information of each accompaniment frame;

splicing the song characteristics of each song frame and the accompaniment characteristics of each corresponding accompaniment frame, and inputting each spliced song characteristics and accompaniment characteristics into a classifier of the acoustic model of the lyrics to be trained;

and obtaining the predicted lyric phoneme of each song frame according to the output result of the classifier.

5. The method of claim 4, wherein deriving the predicted lyric phoneme for each song frame based on the output of the classifier comprises:

acquiring phoneme probability distribution of each song frame output by the classifier;

and determining the predicted lyric phoneme of each song frame according to the phoneme probability distribution of each song frame.

6. The method of claim 1, wherein determining a model loss based on the predicted lyric phonemes and the phoneme label comprises:

acquiring a phoneme label of the lyrics of each song frame according to the phoneme label of the lyrics in the song sample;

model losses are determined based on the predicted phoneme of the lyrics for each song frame and the phoneme label of the lyrics for each song frame.

7. A method for lyric recognition, the method comprising:

inputting the song spectrum information of the song and the accompaniment spectrum information of the song into a lyric acoustic model to obtain a predicted lyric phoneme of the song output by the lyric acoustic model; the lyrics acoustic model is trained according to the method of any one of claims 1-6;

8. The method of claim 7, further comprising:

acquiring a plurality of lyric texts from a preset song database, and acquiring the lyric phonemes of each lyric text character in the lyric texts;

determining an associated character for each lyric text character and an occurrence probability of the associated character based on the plurality of lyric texts; acquiring lyric phonemes of the associated characters from lyric phonemes of each lyric text character in the lyric text;

and constructing and obtaining the language model based on the lyric phoneme of each lyric text character, the lyric phoneme of the associated character of each lyric text character and the occurrence probability of the associated character.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6 or 7 to 8.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6 or 7 to 8.