CN113936627A

CN113936627A - Model training method and component, phoneme pronunciation duration labeling method and component

Info

Publication number: CN113936627A
Application number: CN202111179340.1A
Authority: CN
Inventors: 庄晓滨
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-01-14

Abstract

The scheme takes syllables in a song syllable set as a training sample, and can train to obtain a phoneme pronunciation duration recognition model capable of restricting phoneme pronunciation duration.

Description

Model training method and component, phoneme pronunciation duration labeling method and component

Technical Field

The application relates to the technical field of computers, in particular to a model training method and a component, and a phoneme pronunciation duration labeling method and a component.

Background

At present, in the prior art, a text is taken as an object to predict pronunciation duration of a phoneme, but the text has no constraint of pronunciation duration, and the method without the constraint of duration is only applicable to a speech synthesis scene, that is: the pronunciation duration of each phoneme is not limited nor has there been a strict criterion.

The prediction accuracy of the pronunciation duration only influences the naturalness of the voice synthesis, and the influence on the actual use is small. However, for singing voice synthesis, the accuracy of the pronunciation duration prediction directly affects the quality of song synthesis. Since the tempo information of the song defines the pronunciation duration of each word, the synthesized song pronunciation will not fit the song melody if the phoneme pronunciation duration is not constrained.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a model training method and component, a phoneme pronunciation duration labeling method and component, so as to obtain a phoneme pronunciation duration recognition model suitable for a singing chorus scene. The specific scheme is as follows:

to achieve the above object, in one aspect, the present application provides a model training method, including:

obtaining a target syllable and a label of the target syllable from a song syllable set; the label includes: the number of consonant pronunciation frames N of the consonant phoneme in the target syllable₀Number of vowel pronunciation frames N of vowel phone in the target syllable₁；

According to the number N of the consonant pronunciation frames₀And the number of vowel pronunciation frames N₁Construction includes N₀A consonant mark and N₁A target sequence of individual vowel markers;

repeating the target syllable for N times to obtain a syllable group; wherein N is N₀+N₁；；

Inputting the syllable set into a neural network model so that the neural network model outputs a prediction sequence; the prediction sequence comprises N₃A consonant mark and N₄A personal vowel marker; wherein N is₃+N₄＝N；

Calculating a loss value between the predicted sequence and the target sequence;

adjusting parameters of the neural network model based on the loss value to obtain an updated model;

and if the updated model is converged, taking the updated model as a phoneme pronunciation duration recognition model.

Preferably, the method further comprises the following steps:

if the updated model is not converged, another syllable and the label of the syllable are obtained again from the song syllable set so as to carry out iterative training on the updated model until the model is converged, and the updated model is used as the phoneme pronunciation duration recognition model.

Preferably, the inputting the syllable set into a neural network model to make the neural network model output a prediction sequence comprises:

the neural network model determines the total frame number N of pronunciation of the target syllable based on the number of repetitions of the target syllable in the syllable group and predicts the phone type of each frame of pronunciation in the target syllable;

if the phoneme type of any frame pronunciation is consonant, recording a consonant mark, and if the phoneme type of any frame pronunciation is vowel, recording a vowel mark to obtain the prediction sequence.

Preferably, the predicting the phone type of each frame of pronunciation in the target syllable comprises:

predicting a phone type of each frame pronunciation in the target syllable by querying a phone dictionary.

Preferably, the set of song syllables is obtained by converting Chinese lyric texts according to corresponding song audio frequencies.

Preferably, if the target syllable is a monophonic vowel syllable, the number of frames N is pronounced according to the consonant₀And the number of vowel pronunciation frames N₁Construction includes N₀A consonant mark and N₁After the target sequence of the vowel mark, the method further comprises the following steps:

filling a consonant mark at the head of the target sequence;

accordingly, before the calculating the loss value between the predicted sequence and the target sequence, the method further comprises:

filling a consonant mark at the head of the prediction sequence.

Preferably, before calculating the loss value between the predicted sequence and the target sequence, the method further comprises:

noise in the predicted sequence is detected and eliminated.

Preferably, the calculating a loss value between the predicted sequence and the target sequence comprises:

comparing the predicted sequence with the target sequence in a contraposition manner to obtain N comparison results;

determining the loss value based on the N comparison results.

In another aspect, the present application further provides a phoneme pronunciation duration labeling method, including:

acquiring a text to be marked;

converting the text to be labeled into syllables;

inputting the syllable into the phoneme pronunciation duration recognition model, so that the phoneme pronunciation duration recognition model outputs a labeling sequence;

determining a vowel sounds frame number of a vowel phone and a consonant sounds frame number of a consonant phone in the syllable based on the tagging sequence;

recording the product of the vowel pronunciation frame number and the unit frame length as vowel pronunciation duration, and recording the product of the consonant pronunciation frame number and the unit frame length as consonant pronunciation duration;

and marking the vowel pronunciation time length and the consonant pronunciation time length on the syllable.

Preferably, the method further comprises the following steps:

and constructing a set marked with the vowel pronunciation time length and the consonant pronunciation time length so as to train a song synthesis model by using the song syllable set.

In yet another aspect, the present application further provides an electronic device comprising a processor and a memory; wherein the memory is used for storing a computer program which is loaded and executed by the processor to implement the aforementioned model training method.

In yet another aspect, the present application further provides a storage medium, in which computer-executable instructions are stored, and when being loaded and executed by a processor, the computer-executable instructions implement the aforementioned model training method.

The syllable of the song syllable set is used as a training sample, and the consonant pronunciation frame number N of the target syllable and the consonant phoneme in the target syllable can be obtained₀Number of vowel pronunciation frames N of vowel phoneme in target syllable₁(ii) a Then pronouncing the frame number N according to the consonant₀Number of frames N of pronunciation of harmony vowel₁Building a bagDraw N₀A consonant mark and N₁A target sequence of individual vowel markers; repeating the target syllable N times to obtain a syllable group, inputting the syllable group into the neural network model, so that the neural network model knows that the target syllable needs to be pronounced for N frames of time, and outputting the syllable group including N frames₃A consonant mark and N₄A predicted sequence of individual vowel markers; calculating a loss value between the predicted sequence and the target sequence; adjusting parameters of the neural network model based on the loss value to obtain an updated model; and if the updated model is converged, taking the updated model as a phoneme pronunciation duration recognition model. It can be seen that, in the scheme, syllables in a syllable set of a song are used as training samples, a phoneme pronunciation time identification model capable of restricting phoneme pronunciation time can be trained, and the model can determine how long time a vowel pronounces and how long time a consonant pronounces when the syllable pronounces in the song aiming at any one syllable, so that the model can be used for marking the pronunciation phoneme time of the syllable in the song, and can be used for synthesizing the song based on the marking information.

Accordingly, the model training component, the phoneme pronunciation duration labeling method and the phoneme pronunciation duration labeling component (the component comprises a device, equipment and a storage medium) also have the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram illustrating a physical architecture suitable for use in the present application;

FIG. 2 is a flow chart of a model training method provided herein;

FIG. 3 is a schematic syllable decomposition diagram of the present application;

FIG. 4 is a flowchart of a phoneme pronunciation duration labeling method provided by the present application;

FIG. 5 is a flowchart of another phoneme pronunciation duration labeling method provided in the present application;

FIG. 6 is a diagram illustrating a phoneme pronunciation duration recognition model provided in the present application;

FIG. 7 is a schematic diagram of a model training apparatus provided herein;

FIG. 8 is a schematic diagram of a phoneme pronunciation duration labeling apparatus provided in the present application;

FIG. 9 is a diagram of a server architecture provided by the present application;

fig. 10 is a diagram of a terminal structure provided in the present application.

Detailed Description

In the prior art, the pronunciation duration of a phoneme is predicted by taking a text as an object, but the text has no constraint of pronunciation duration. Since the tempo information of the song defines the pronunciation duration of each word, the synthesized song pronunciation will not fit the song melody if the phoneme pronunciation duration is not constrained.

In view of the above problems, the present application proposes a model training scheme, which uses the syllables in the syllable set of the song as training samples to train and obtain a phoneme pronunciation duration recognition model capable of constraining the pronunciation duration of the phoneme, and the model can determine, for any one syllable, how long the vowel pronounces and how long the consonant pronounces when the syllable pronounces in the song, so that the duration of the pronunciation phoneme of the syllable in the song can be labeled by using the model, and thus the model can be used for synthesizing the singing voice based on the labeling information.

The term of art to which this application relates:

phoneme: and the minimum voice unit is divided according to the natural attribute of the voice. From an acoustic property point of view, a phoneme is the smallest unit of speech divided from a psychoacoustic point of view. From the physiological point of view, a pronunciation action forms a phoneme. If [ ma ] contains [ m ] a ] two pronunciation actions, which are two phonemes. The sounds uttered by the same pronunciation action are the same phoneme, and the sounds uttered by different pronunciation actions are different phonemes. For example, in [ ma-mi ], the two [ m ] pronunciations are the same, and the same phoneme [ a ] [ i ] is different and is different. The analysis of phonemes is generally described in terms of pronunciation actions.

Syllable: the most natural structural unit in speech. A syllable is the smallest structural unit of speech made up of combinations of phonemes. It constitutes the head, abdomen and tail parts, and thus has obvious perceptible boundary between syllables. In Chinese, the pronunciation of a Chinese character is a syllable. Mandarin usually has 400 basic non-tonal syllables and 1300 tonal syllables (excluding soft sounds).

For ease of understanding, a physical framework to which the present application applies will be described.

It should be understood that the model training method provided in the present application can be applied to a system or a program having a model training function. Of course, the phoneme pronunciation time labeling method provided by the present application can also be applied to a system or a program with a phoneme pronunciation time labeling function. Specifically, the system or program having the model training function and the system or program having the phoneme pronunciation duration labeling function may be run in a server, a personal computer, or the like.

As shown in fig. 1, fig. 1 is a schematic diagram of a physical architecture applicable to the present application. In fig. 1, a system or a program having a model training function may be run on a server that acquires a set of song syllables from other terminal apparatuses via a network, acquires a target syllable and a label of the target syllable from the set of song syllables; the label includes: number of consonant pronunciation frames N of consonant phoneme in target syllable₀Number of vowel pronunciation frames N of vowel phoneme in target syllable₁(ii) a Number of frames N according to consonant pronunciation₀Number of frames N of pronunciation of harmony vowel₁Construction includes N₀A consonant mark and N₁A target sequence of individual vowel markers; repeating the target syllable for N times to obtain a syllable group; inputting the syllable set into the neural network model such that the neural network model output includes N₃A consonant mark and N₄A predicted sequence of individual vowel markers; calculating a loss value between the predicted sequence and the target sequence; based on pairs of loss valuesAdjusting parameters of the neural network model to obtain an updated model; and if the updated model is converged, taking the updated model as a phoneme pronunciation duration recognition model.

In fig. 1, a system or a program with a phoneme pronunciation duration labeling function may also be run on a server, and the server obtains a text to be labeled from other terminal devices through a network; converting the text to be labeled into syllables; inputting syllables into the phoneme pronunciation duration recognition model so that the phoneme pronunciation duration recognition model outputs a labeling sequence; determining a vowel pronunciation frame number of a vowel phone and a consonant pronunciation frame number of a consonant phone in the syllable based on the tagging sequence; recording the product of the number of vowel pronunciation frames and the length of the unit frame as the vowel pronunciation time length, and recording the product of the number of consonant pronunciation frames and the length of the unit frame as the consonant pronunciation time length; the vowel pronunciation time length and consonant pronunciation time length are marked on syllables.

As can be seen, the server can establish communication connection with a plurality of devices, and the server acquires syllables meeting training conditions or texts needing to mark pronunciation duration of phonemes from the devices. The server trains the neural network model by collecting syllables uploaded by the devices to obtain a phoneme pronunciation duration recognition model.

Fig. 1 shows various terminal devices, in an actual scene, more or fewer types of terminal devices may participate in the process of model training or phoneme pronunciation duration labeling, the specific number and type are determined according to the actual scene, and this is not limited herein, and in addition, fig. 1 shows one server, but in an actual scene, there may also be participation of multiple servers, and the specific number of servers is determined according to the actual scene.

It should be noted that the model training method provided by this embodiment may be performed off-line, that is, the server stores locally the song syllable set, which can be directly trained to obtain the desired model by using the scheme provided by this application. Certainly, the phoneme pronunciation duration labeling method provided in this embodiment may be performed offline, that is, the server locally stores the text to be labeled, and may directly utilize the scheme provided in this application to evaluate the syllables corresponding to the text.

It can be understood that the system and the program with the model training function or the system and the program with the phoneme pronunciation duration tagging function may also be operated in a personal mobile terminal, and may also be used as one of cloud service programs, and a specific operation mode is determined according to an actual scene, and is not limited herein.

Specifically, after the model training is completed, the obtained model can be subjected to syllable labeling aiming at a large amount of texts, so that a syllable set with labeling information is obtained, and the syllable set can be used for training the singing voice synthesis model.

With reference to fig. 2, fig. 2 is a flowchart of a model training method according to an embodiment of the present disclosure. As shown in fig. 2, the model training method may include the steps of:

s201, obtaining target syllables from the song syllable set, and consonant pronunciation frame numbers N of consonant phonemes in the target syllables₀Number of vowel pronunciation frames N of vowel phoneme in target syllable₁。

In this embodiment, the set of song syllables is converted from the lyric text in accordance with the corresponding song audio. Namely: the total frame number of pronunciation of each syllable in a syllable set of a song, the number of consonant pronunciation frames of a consonantal phone in the syllable, and the number of vowel pronunciation frames of a vowel phone in the syllable are determined based on the number of pronunciation frames of the syllable in the corresponding song, so that the pronunciation time of each syllable in the syllable set of a song is known, the pronunciation time of the consonantal phone in the syllable is known, and the pronunciation time of the vowel phone in the syllable is also known.

In general, a chinese character may be composed of two phones, one consonant phone and one vowel phone, so that the total frame number N of the target syllable is N ═ N₀+N₁。

For example: the syllable of the pronunciation of "I" is [ wo ], which can be divided into two phonemes [ (u ] and [ uo ]. Assuming that "i" pronounces 203 ms in a certain song, and the preset unit frame length is 10 ms (i.e. the time length corresponding to one frame of audio), 203/10 ≈ 20.3 ≈ 21 frames (rounding), where the total frame number of pronunciations of "i" in a certain song can be considered as 21. If it is determined that [ (u ] takes 112 msec after the detection, it can be regarded as emitting 11 frames. If it is determined that [ uo ] takes 96 milliseconds after detection, then [ uo ] can be considered to be 10 frames of pronunciation.

The syllable corresponding to any Chinese character can be decomposed by a voice marking tool, so that information such as pronunciation duration of different phonemes in the syllable in a certain song, total pronunciation duration of the syllable and the like can be obtained, but the tools are often decomposed inaccurately, and the decomposition result can be manually adjusted if necessary. The voice annotation tool supports manual adjustment.

As shown in FIG. 3, the voice annotation tool can visualize the song audio and further decompose the individual syllables to form a visual graph as shown in FIG. 3. The "|" -shaped black line in fig. 3 supports mouse dragging, so that the decomposition result can be adjusted manually. Accordingly, some songs can be decomposed to obtain a syllable set of the songs.

Since the time length is not easily labeled, the present embodiment uses the pronunciation frame number as the label of the syllable. Since the unit frame length is always constant, it is only necessary to determine the number of utterance frames of a phoneme, which corresponds to determining the utterance duration of the phoneme.

Of course, for songs and lyrics in other languages (such as english, etc.), conversion can be performed according to the above to obtain corresponding syllable sets of songs.

S202, pronunciation frame number N according to consonants₀Number of frames N of pronunciation of harmony vowel₁Construction includes N₀A consonant mark and N₁Target sequence of individual vowel markers.

Taking the above "i" word as an example, assuming that "0" is used as the consonant label and "1" is used as the vowel label, the target sequence is: [000000000001111111111], consisted of 11 "0 s" and 10 "1 s". At this time, one "0" or one "1" in the target sequence corresponds to 1 frame, and thus the target sequence can be considered to correspond to 21 frames.

And S203, repeating the target syllable for N times to obtain a syllable group.

It should be noted that, in this embodiment, in order to let the model know how long the currently input syllable needs to be pronounced (i.e. how many frames are pronounced), the model repeats the syllable for the number of frames, thereby obtaining a syllable group including a plurality of identical syllables. Namely: a syllable is pronounced several frames in a song, here repeated several times.

And S204, inputting the syllable group into the neural network model so that the neural network model outputs a prediction sequence.

Wherein the prediction sequence comprises N₃A consonant mark and N₄Individual vowel markers, namely: the total number of consonant symbols and vowel symbols included in the predicted sequence is also N. Of course, ideally, N₃＝N₀，N₄＝N₁Namely: the predicted sequence also includes N₀A consonant mark and N₁Individual vowel markers.

Taking the above "I" word as an example, ideally, the predicted sequence and the target sequence [000000000001111111111] are identical, and the predicted sequence output by the model always has some deviation, but the total number of tags in the predicted sequence must be N. As for the prediction sequence, there are several consonant labels and several vowel labels, which depend on the phoneme prediction type of each frame pronunciation by the neural network model.

The neural network model can predict whether each frame of the target syllable is vowel or consonant, and records the frame with corresponding marks, so that a prediction sequence can be obtained.

And S205, calculating a loss value between the predicted sequence and the target sequence.

In one embodiment, calculating a loss value between the predicted sequence and the target sequence comprises: comparing the predicted sequence with the target sequence in a contraposition manner to obtain N comparison results; a loss value is determined based on the N comparison results.

Examples are as follows: "[ wo ]" corresponds to the target sequence: [000000000001111111111], if the output of the model is predicted to be [000010000001111111111], comparing two '0's at the first position of the two sequences, and recording a comparison result if the two are consistent; then, the elements at the second position, the third position … …, and up to the 21 st position of the two sequences are compared in alignment respectively, and finally 21 comparison results are obtained, and based on the comparison results, the loss value can be calculated. For example: determining the ratio of the number of abnormal comparison results (i.e. the comparison results at the fifth position) to the total number of the comparison results as a loss value, and then the loss value is: 1/21. Of course, this approach is merely an exemplary calculation and does not constitute a limitation of the present application.

Essentially, based on the above listed loss value calculation idea, any loss function can be used to calculate the loss value, such as: cross entropy loss functions, etc.

S206, adjusting parameters of the neural network model based on the loss values to obtain an updated model.

S207, judging whether the updated model is converged; if yes, go to step S208; if not, S209 is executed.

Wherein, judging whether the updated model is converged includes: judging whether the loss value is small enough; or the loss value is judged to be unchanged or changed extremely little in the continuous process of a plurality of iterations.

And S208, taking the updated model as a phoneme pronunciation duration recognition model.

S209, replacing the neural network model with the updated model, and executing S201 to obtain another syllable and the label of the syllable from the song syllable set again, and performing iterative training on the updated model until the model converges.

In the present embodiment, the neural network model may be of any structure.

It can be seen that, in this embodiment, the syllables in the syllable set of the song are used as training samples, and the consonant pronunciation frame number N of the target syllable and the consonant phoneme in the target syllable can be obtained₀Number of vowel pronunciation frames N of vowel phoneme in target syllable₁(ii) a Then pronouncing the frame number N according to the consonant₀Number of frames N of pronunciation of harmony vowel₁Construction includes N₀A consonant mark and N₁A target sequence of individual vowel markers; repeating the target syllable for N times to obtain a syllable group, and inputting the syllable group into the neural network model so that the neural network model knows that the target syllable needs to pronounce for N frames of time, thereby outputting a prediction sequence comprising N marks;calculating a loss value between the predicted sequence and the target sequence; adjusting parameters of the neural network model based on the loss value to obtain an updated model; and if the updated model is converged, taking the updated model as a phoneme pronunciation duration recognition model. It can be seen that, in the scheme, syllables in a syllable set of a song are used as training samples, a phoneme pronunciation time identification model capable of restricting phoneme pronunciation time can be trained, and the model can determine how long time a vowel pronounces and how long time a consonant pronounces when the syllable pronounces in the song aiming at any one syllable, so that the model can be used for marking the pronunciation phoneme time of the syllable in the song, and can be used for synthesizing the song based on the marking information.

Based on the above embodiments, it should be noted that inputting the syllable group into the neural network model to make the neural network model output the prediction sequence includes: the neural network model determines the total frame number N of the pronunciation of the target syllable based on the repetition times of the target syllable in the syllable group, and predicts the phoneme type of each frame of pronunciation in the target syllable; if the phoneme type of any frame pronunciation is consonant, recording a consonant mark, and if the phoneme type of any frame pronunciation is vowel, recording a vowel mark to obtain a prediction sequence.

Predicting the phoneme type of each frame of pronunciation in the target syllable, which comprises the following steps: the phone type of each frame of pronunciation in the target syllable is predicted by querying the phone dictionary. Various vowel phonemes and consonant phonemes of chinese are recorded in the phoneme dictionary.

Wherein the phoneme type of each frame pronunciation in the target syllable is predicted, namely: predict whether each frame of the target syllable is a vowel or a consonant.

Based on the above embodiments, it should be noted that the chinese characters also have single phone syllables, such as: wu, o and Wu Wei. These monosyllabic syllables have only vowels and some consonants.

In one embodiment, if the target syllable is a monosyllabic vowel syllable, the number of frames N is pronounced according to the consonant₀Number of frames N of pronunciation of harmony vowel₁Construction includes N₀A consonant mark and N₁Target sequence of individual vowel markersThen, the method further comprises the following steps: filling a consonant mark at the first position of the target sequence; accordingly, before calculating the loss value between the predicted sequence and the target sequence, the method further comprises: a consonant tag is filled in the first place of the predicted sequence.

According to the method, the single-phone vowel syllables can be distinguished, and the model can output more accurate prediction sequences.

Of course, to make the predicted sequence more accurate, noise cancellation may be performed on the predicted sequence. For example: "[ wo ]" corresponds to the target sequence: [000000000001111111111], assuming that the prediction sequence output by the model is [000010000001111111111], it is obvious that the model predicts the phoneme type of the 5 th frame pronunciation by mistake, and it is considered that noise occurs in the prediction sequence. If the noise can be detected, the 5 th bit '1' in the prediction sequence can be modified to '0' to eliminate the noise. Certainly, because the noise has randomness and is not easy to be detected, the noise can be detected and eliminated without carrying out noise detection and elimination, the model can be subjected to iterative training for a plurality of times, and the accuracy of the prediction sequence can be improved.

The noise detection mode may be: if a small number (e.g., 1 or 2 or other) of vowel markers occur by chance in a succession of consonant markers in the prediction sequence, these small numbers of vowel markers may be considered to be noise. Conversely, if a small number (e.g., 1 or 2 or other) of consonant tokens accidentally appear in the consecutively arranged vowel tokens in the prediction sequence, then these small number of consonant tokens can be considered as noise.

Referring to fig. 4, fig. 4 is a flowchart of a phoneme pronunciation duration labeling method according to an embodiment of the present application. As shown in fig. 4, the labeling method may include the following steps:

s401, obtaining a text to be annotated.

S402, converting the text to be annotated into syllables.

And S403, inputting syllables into the phoneme pronunciation duration recognition model, so that the phoneme pronunciation duration recognition model outputs a labeling sequence.

S404, determining the vowel pronunciation frame number of the vowel phoneme in the syllable and the consonant pronunciation frame number of the consonant phoneme based on the labeling sequence.

S405, recording the product of the vowel pronunciation frame number and the unit frame length as vowel pronunciation duration, and recording the product of the consonant pronunciation frame number and the unit frame length as consonant pronunciation duration.

And S406, marking the vowel pronunciation time length and the consonant pronunciation time length on the syllable.

In a specific embodiment, the method further comprises the following steps: and constructing a set marked with vowel pronunciation time and consonant pronunciation time so as to train a song synthesis model by utilizing the song syllable set.

The phoneme pronunciation duration recognition model in this embodiment may determine that there are several frames of vowels and several frames of consonants in a syllable, and further may determine the vowel pronunciation duration and the consonant pronunciation duration correspondingly based on the vowel pronunciation frame number and the consonant pronunciation frame number, and label the corresponding syllable accordingly, so as to obtain the syllable having the vowel pronunciation duration and the consonant pronunciation duration. The set of these syllables can be used as a training sample of the singing voice synthesis model to train the singing voice synthesis model.

The scheme provided by the application is described by a specific application scenario example. Namely: and marking the concrete scheme of syllables by utilizing a phoneme pronunciation duration recognition model. The scheme can convert any text into syllables with phoneme pronunciation duration.

And setting a phoneme pronunciation duration recognition model in the server. Referring to fig. 5, the specific process includes:

s501, a terminal requests a server;

s502, the server feeds back a response message to the terminal;

s503, after receiving the response message, the terminal transmits a text to the server;

s504, the server converts characters in the text into syllables;

s505, the server respectively inputs each syllable into a phoneme pronunciation duration recognition model to obtain each labeling sequence;

s506, the server processes the labeling sequences to determine the vowel pronunciation frame number of the vowel phoneme and the consonant pronunciation frame number of the consonant phoneme in each syllable;

s507, the server calculates the vowel pronunciation time length and the consonant pronunciation time length of each syllable.

S508, the server marks all syllables and forms a set;

s509, the server returns the set and the corresponding notification message to the terminal.

The terminal can be a smart phone, a tablet computer, a notebook computer or a desktop computer.

The structure of the phoneme pronunciation duration recognition model can be seen in fig. 6. In fig. 6, the phoneme pronunciation duration recognition model can be regarded as a phoneme classifier, which can determine whether each frame of a syllable is a consonant or a vowel, and generally determines whether the syllable belongs to a consonant or a vowel with 0.5 as a threshold.

After a syllable is repeated, and input into the model shown in fig. 6, the syllable will be repeated: phonemembedding (390 × 64) → [ conv (3,3,64) → ReLU → pooling (2,2) ] + 5 → GRU → [ conv (3,3,64) → ReLU → upsample (2,2) ] + 5 → sigmoid, and finally a sequence of 0 and 1 is output.

Referring to fig. 6, from left to right, part 1 is phone embedding, parts 2 to 6 are conv, relu and pooling, the gur (gated recovery unit) is a gated cycle unit, parts 8 to 12 are conv, relu and upsample, and part 13 is sigmoid. Wherein the part before the GUR is an encoder and the part after the GUR is a decoder.

The phone embedding can embed syllables into a high-dimensional space, the Chinese language has 390 syllables, and the embedding dimension is 64 dimensions. conv (3,3,64) indicates that the width of the convolution kernel is 3, the height is 3, and the number of channels is 128; pooling (2,2) indicates the size of the pooling layer is 2, and the step is 2; the upsamplale uses a bilinear interpolation method, except that the last layer uses a sigmoid activation function, and other convolutional layers all use a ReLU activation function. As shown in fig. 6, the conv and pooling combined part is repeated 5 times, the conv and upsample combined part is also repeated 5 times, a residual connection method of conv (3,3,64) is used between the encoder and the decoder, and the GRU output unit size is 128.

It should be noted that the parameters such as the size of the convolution kernel and the number of convolution layers are experimentally preferred values here. Of course, these parameters may be set to other values, if desired.

The multi-scale jumplike convolution used in fig. 6 enables to obtain more, larger receptive fields, i.e.: more features can be extracted. The GRU can integrate time sequence information (namely multi-scale characteristics), and the use of the GRU and the GRU can enable the model to better consider context information, improve the accuracy of phoneme duration estimation and further improve the expressive force and naturalness of singing voice synthesis and singing turning synthesis.

The training set of the model can be constructed as follows.

Taking Chinese lyrics as an example, each Chinese character and the pronunciation duration information of the Chinese character are extracted from a lyrics file. For example, "I am" for 208 milliseconds and "listen" for 416 milliseconds. The text is then converted to syllables (the syllable representation commonly used in chinese characters is pinyin) so that "i" is wo and "hear" is ting. The syllable is expanded at the frame level by counting the rounding of one frame at 10 ms according to the duration information, so that 208 ms [ wo ] is rounded to 21 frames and 416 ms [ ting ] is rounded to 42 frames.

Using a speech annotation tool, such as praat shown in FIG. 3, the syllable is further broken down into the form of multiple phones and the duration of each phone is noted. For example: [ wo ] can be split into [ (u ] and [ uo ], and [ ting ] can be split into [ t ] and [ ing ]. [ (u ] is 112 ms for 11 frames and [ uo ] is 96 ms for 10 frames. Through the two steps, syllables and phonemes with duration information can be finally obtained.

And (3) disassembling each Chinese character according to the steps to obtain a syllable set of the song, namely: and (5) training a set. It should be noted that a chinese character can be generally split into two phonemes, where the first phoneme is generally a consonant and the second phoneme is generally a vowel.

Assuming that the input of the model is "i hear" the three syllables, the three syllables are repeated according to their own total frame number, and the following sequences are obtained: [ wo, wo, wo, …, wo, ting, ting, … ting, dao, dao, …, dao ], among others. The number of repetitions of these three syllables was 21, 42, and 42, respectively, since the total frame number of "i" was 21, the total frame number of "listen" was 42, and the total frame number of "go" was 42. Accordingly, the predicted sequence of model outputs may be [0,0, …,0,1,1, …,1,0,0, …,0,1,1, …,1,0,0, …,0,1,1, …, 1], with the first 21 bits (i.e.: 11 "0 s" and 10 "1 s") in the sequence corresponding to the "I" word, the middle 42 bits corresponding to the "hear" word, and the last 42 bits corresponding to the "to" word, which are common in the sequence: 11 "0" +10 "1" +9 "0" +33 "1".

It can be seen that multiple syllables can be input at one time for model processing during model training or use. The purpose of syllable repetition is: the total number of frames for this syllable is reported to the model in number of repetitions. The training process can adopt Adam algorithm of a deep neural network, and the loss function can be a cross entropy loss function commonly used by a two-classification algorithm.

In the embodiment, information such as the total frame number of syllables and the frame number of phonemes in each syllable can be analyzed from the lyric text, so that the pronunciation frame number occupied by different phonemes in each syllable can be marked, the consonant is marked as 0, the vowel is marked as 1, and accordingly, the label sequence of one syllable can be obtained. The Convolutional Neural Network (CNN) used in the above-described method can be modeled by using a syllable at a frame level as an input of a model and a phoneme type at a frame level as an output, and can generate a sequence of 0 s and 1 s in the model, thereby playing an important role in scenes such as singing voice synthesis and singing-over synthesis.

Therefore, the phoneme pronunciation time length recognition model provided by the embodiment can accurately analyze the number of consonant pronunciation frames and the number of vowel pronunciation frames in each syllable, so that the corresponding pronunciation time length can be predicted, the naturalness and pronunciation accuracy of singing voice synthesis are remarkably improved, and the phoneme time length estimation precision and the phoneme time length estimation automation degree have better advantages.

Referring to fig. 7, fig. 7 is a schematic diagram of a model training apparatus according to an embodiment of the present application, including:

syllableAn obtaining module 701, configured to obtain a target syllable and a tag of the target syllable from a set of song syllables; the label includes: number of consonant pronunciation frames N of consonant phoneme in target syllable₀Number of vowel pronunciation frames N of vowel phoneme in target syllable₁；

A constructing module 702 for pronouncing the frame number N according to the consonant₀Number of frames N of pronunciation of harmony vowel₁Construction includes N₀A consonant mark and N₁A target sequence of individual vowel markers;

a repeating module 703, configured to repeat the target syllable N times to obtain a syllable group; N-N₀+N₁；

A prediction module 704 for inputting the syllable set into the neural network model to make the neural network model output a prediction sequence; the predicted sequence includes N₃A consonant mark and N₄A personal vowel marker; n is a radical of₃+N₄＝N；

A calculation module 705 for calculating a loss value between the predicted sequence and the target sequence;

an updating module 706, configured to adjust parameters of the neural network model based on the loss value to obtain an updated model;

and an output module 707, configured to use the updated model as a phoneme pronunciation duration recognition model if the updated model converges.

In a specific embodiment, the method further comprises the following steps:

and the iteration module is used for re-acquiring another syllable and the label of the syllable from the song syllable set if the updated model is not converged so as to carry out iterative training on the updated model until the model is converged, and taking the updated model as a phoneme pronunciation duration recognition model.

In one embodiment, the prediction module comprises:

the neural network model determines the total frame number N of the pronunciation of the target syllable based on the repetition times of the target syllable in the syllable group, and predicts the phoneme type of each frame of pronunciation in the target syllable;

if the phoneme type of any frame pronunciation is consonant, recording a consonant mark, and if the phoneme type of any frame pronunciation is vowel, recording a vowel mark to obtain a prediction sequence.

In one embodiment, the neural network model predicts the phone type of each frame pronunciation in the target syllable by querying the phone dictionary.

In one embodiment, the set of song syllables is converted from Chinese lyric text according to the corresponding song audio.

In a specific embodiment, the method further comprises the following steps:

a filling module for pronouncing the frame number N according to the consonant if the target syllable is a monosyllabic vowel syllable₀Number of frames N of pronunciation of harmony vowel₁Construction includes N₀A consonant mark and N₁Filling a consonant mark at the head of a target sequence after the target sequence of the vowel mark; accordingly, before calculating the loss value between the predicted sequence and the target sequence, the method further comprises: a consonant tag is filled in the first place of the predicted sequence.

In a specific embodiment, the method further comprises the following steps:

and the optimization module is used for detecting and eliminating noise points in the prediction sequence.

In one embodiment, the calculation module is specifically configured to:

comparing the predicted sequence with the target sequence in a contraposition manner to obtain N comparison results; a loss value is determined based on the N comparison results.

For more specific working processes of each module and unit in this embodiment, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not described here again.

It can be seen that the present embodiment provides a model training apparatus, which uses the syllables in the syllable set of the song as training samples, and can train to obtain a phoneme pronunciation duration recognition model capable of restricting the pronunciation duration of the phoneme, and the model can determine, for any one syllable, how long the vowel pronounces and how long the consonant pronounces when the syllable is pronounced in the song, so that the duration of the pronunciation phoneme of the syllable in the song can be labeled by using the model, and the model can be used for synthesizing the singing voice based on the labeling information.

Referring to fig. 8, fig. 8 is a schematic diagram of a phoneme pronunciation duration labeling device according to an embodiment of the present application, including:

a text obtaining module 801, configured to obtain a text to be labeled;

a conversion module 802, configured to convert a text to be annotated into a syllable;

a processing module 803, configured to input syllables into the phoneme pronunciation duration recognition model described above, so that the phoneme pronunciation duration recognition model outputs a labeling sequence;

a determining module 804, configured to determine a vowel pronunciation frame number of a vowel phone and a consonant pronunciation frame number of a consonant phone in the syllable based on the tagging sequence;

a calculating module 805, configured to record a product of a vowel sound frame number and a unit frame length as a vowel sound duration, and record a product of a consonant sound frame number and the unit frame length as a consonant sound duration;

and a labeling module 806, configured to label the vowel pronunciation time and the consonant pronunciation time to the syllable.

In a specific embodiment, the method further comprises the following steps:

and the singing voice synthesis model training module is used for constructing a set marked with vowel pronunciation time and consonant pronunciation time so as to train the singing voice synthesis model by utilizing the song syllable set.

As can be seen, the present embodiment provides a phoneme pronunciation time labeling apparatus in which a phoneme pronunciation time identification model is capable of determining, for any syllable, how long the syllable is pronounced in a song, how long the vowel is pronounced, and how long the consonant is pronounced, and therefore, the time length of a pronunciation phoneme of the syllable in the song can be labeled using the model, so that it can be used to synthesize a song voice based on the labeling information.

Further, the embodiment of the application also provides electronic equipment. The electronic device may be the server 50 shown in fig. 9 or the terminal 60 shown in fig. 10. Fig. 9 and 10 are each a block diagram of an electronic device according to an exemplary embodiment, and the contents of the diagrams should not be construed as any limitation to the scope of use of the present application.

Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application. The server 50 may specifically include: at least one processor 51, at least one memory 52, a power supply 53, a communication interface 54, an input output interface 55, and a communication bus 56. The memory 52 is used for storing a computer program, which is loaded and executed by the processor 51 to implement the relevant steps in the model training method and the phoneme pronunciation duration labeling method disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 53 is used to provide operating voltage for each hardware device on the server 50; the communication interface 54 can create a data transmission channel between the server 50 and an external device, and the communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 55 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

The memory 52 may be a read-only memory, a random access memory, a magnetic disk, an optical disk, or the like as a carrier for storing resources, the resources stored thereon include an operating system 521, a computer program 522, data 523, and the like, and the storage manner may be a transient storage or a permanent storage.

The operating system 521 is used for managing and controlling hardware devices and computer programs 522 on the Server 50 to realize the operation and processing of the processor 51 on the data 523 in the memory 52, and may be a Windows Server, Netware, Unix, Linux, or the like. The computer program 522 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the model training method and the phoneme pronunciation duration labeling method disclosed in any of the foregoing embodiments. The data 523 may include data such as developer information of the application program in addition to data such as update information of the application program, song, text, syllable, and the like.

Fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure, where the terminal 60 may specifically include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

In general, the terminal 60 in the present embodiment includes: a processor 61 and a memory 62.

The processor 61 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 61 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 61 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 61 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 61 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 62 may include one or more computer-readable storage media, which may be non-transitory. The memory 62 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 62 is at least used for storing a computer program 621, wherein after being loaded and executed by the processor 61, the computer program is capable of implementing relevant steps in the model training method and the phoneme pronunciation duration tagging method executed by the terminal side disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 62 may also include an operating system 622 and data 623, etc., which may be stored in a transient or persistent manner. The operating system 622 may include Windows, Unix, Linux, etc. Data 623 may include, but is not limited to, update information for applications.

In some embodiments, the terminal 60 may also include a display 63, an input/output interface 64, a communication interface 65, a sensor 66, a power supply 67, and a communication bus 68.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not intended to be limiting of terminal 60 and may include more or fewer components than those shown.

Further, an embodiment of the present application also discloses a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, the model training method disclosed in any of the foregoing embodiments is implemented. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present application, and are not intended to limit the present application, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of model training, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein inputting the set of syllables into a neural network model to cause the neural network model to output a predicted sequence comprises:

4. The method of claim 3, wherein said predicting the phone type of each frame of pronunciation in the target syllable comprises:

5. The method of any one of claims 1 to 4, wherein if the target syllable is a monophonic vowel syllable, the number of frames N is pronounced according to the consonant₀And the number of vowel pronunciation frames N₁Construction includes N₀A consonant mark and N₁After the target sequence of the vowel mark, the method further comprises the following steps:

filling a consonant mark at the head of the target sequence;

filling a consonant mark at the head of the prediction sequence.

6. The method according to any one of claims 1 to 4, wherein before calculating the loss value between the predicted sequence and the target sequence, further comprising:

noise in the predicted sequence is detected and eliminated.

7. The method of any one of claims 1 to 4, wherein the calculating the loss value between the predicted sequence and the target sequence comprises:

determining the loss value based on the N comparison results.

8. A phoneme pronunciation duration labeling method is characterized by comprising the following steps:

acquiring a text to be marked;

converting the text to be labeled into syllables;

inputting the syllable into the phoneme pronunciation time period recognition model of any one of claims 1 to 7 so that the phoneme pronunciation time period recognition model outputs an annotation sequence;

9. The method of claim 8, further comprising:

10. An electronic device, comprising a processor and a memory; wherein the memory is for storing a computer program that is loaded and executed by the processor to implement the method of any of claims 1 to 9.

11. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out a method according to any one of claims 1 to 9.