CN118230712A - Training method and device of speech synthesis model, medium and electronic equipment - Google Patents

Training method and device of speech synthesis model, medium and electronic equipment Download PDF

Info

Publication number
CN118230712A
CN118230712A CN202410248274.6A CN202410248274A CN118230712A CN 118230712 A CN118230712 A CN 118230712A CN 202410248274 A CN202410248274 A CN 202410248274A CN 118230712 A CN118230712 A CN 118230712A
Authority
CN
China
Prior art keywords
audio
original
speech synthesis
synthesis model
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410248274.6A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Thread Intelligent Technology Chengdu Co ltd
Original Assignee
Moore Thread Intelligent Technology Chengdu Co ltd
Filing date
Publication date
Application filed by Moore Thread Intelligent Technology Chengdu Co ltd filed Critical Moore Thread Intelligent Technology Chengdu Co ltd
Publication of CN118230712A publication Critical patent/CN118230712A/en
Pending legal-status Critical Current

Links

Abstract

The specification discloses a training method, device, medium and electronic equipment of a speech synthesis model, firstly, an original text is obtained, and an original audio corresponding to the original text is obtained. Then, characters in the original audio are recognized, the characters in the original text are replaced by the characters recognized from the original audio, a sample text is obtained, and noise segments in the original audio are marked. And inputting the sample text into a speech synthesis model to be trained to obtain synthesized audio. And finally, training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model. The method can use low-quality audio with multi-read characters, few-read characters, misread characters and burst noise to train the voice synthesis model, reduces the requirement on the quality of the audio used for training the model, and reduces the cost for collecting training data of the voice synthesis model.

Description

Training method and device of speech synthesis model, medium and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a training method and apparatus for a speech synthesis model, a medium, and an electronic device.
Background
With the continuous development of technology, speech synthesis technology is receiving a great deal of attention. The speech synthesis technology is a technology of converting text into speech. The speech synthesis model has a better effect of converting text into speech, so that the speech synthesis model is widely applied to various speech interaction scenes, for example: reading books, dining and calling numbers, etc.
Generally, the audio in the training data of the speech synthesis model is obtained by designating a text to a speaker, and causing the speaker to read the text. However, in order to obtain high-quality audio, the requirements of the speech synthesis model on training data are very strict, that is, the situation of barking, coughing and the like is not allowed to occur when a speaker records the audio, so that the fluency, accuracy and the like of the audio are ensured. This adds difficulty to the training of the speech synthesis model.
Based on this, the present specification provides a training method of a speech synthesis model.
Disclosure of Invention
The present disclosure provides a method, apparatus, medium and electronic device for training a speech synthesis model, so as to at least partially solve the foregoing problems in the prior art.
The technical scheme adopted in the specification is as follows:
The specification provides a training method of a speech synthesis model, the method comprising:
acquiring an original text and acquiring original audio corresponding to the original text;
identifying characters in the original audio;
Replacing characters in the original text with characters identified from the original audio to obtain a sample text; and, marking noise segments in the original audio;
And training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model.
Optionally, replacing the characters in the original text with the characters identified from the original audio specifically includes:
if the recognized characters of the original audio are obtained by multi-reading the characters in the original text, adding the recognized characters of the original audio in the original text; and/or
If the fact that the few-read characters exist in the original text is determined according to the recognized characters of the original audio, deleting the few-read characters in the original text; and/or
And if the recognized characters of the original audio are obtained by misreading the characters in the original text, modifying the characters in the original text into the recognized characters of the original audio.
Optionally, marking the noise section in the original audio specifically includes:
Taking an audio frame meeting specified conditions in the original audio as a noise section, and marking; the specified condition at least comprises an audio frame which is not smooth in the original audio corresponding to the original text.
Optionally, training the speech synthesis model to be trained according to the obtained sample text and the marked noise segment, which specifically includes:
Inputting the sample text into a speech synthesis model to be trained to obtain synthesized audio output by the speech synthesis model to be trained;
determining a loss according to a difference between the synthesized audio and an audio segment other than the noise segment in the original audio;
And training the speech synthesis model to be trained according to the loss to obtain a trained speech synthesis model.
Optionally, the speech synthesis model includes a mel-frequency spectrum conversion network and an audio synthesis network.
Optionally, inputting the sample text into the to-be-trained speech synthesis model to obtain the synthesized audio output by the to-be-trained speech synthesis model, which specifically includes:
determining a phoneme sequence of the sample text;
inputting the phoneme sequence into a Mel frequency spectrum conversion network in the speech synthesis model to be trained to obtain a first Mel frequency spectrum output by the Mel frequency spectrum conversion network;
And inputting the first Mel frequency spectrum into an audio synthesis network in the speech synthesis model to be trained to obtain synthesized audio output by the audio synthesis network.
Optionally, determining the loss according to the difference between the synthesized audio and the audio segment except the noise segment in the original audio specifically includes:
determining a second mel frequency spectrum of the audio segment other than the noise segment;
determining a first loss from the second mel-frequency spectrum and the first mel-frequency spectrum;
and determining a second loss according to the audio segments except the noise segment and the synthesized audio.
Optionally, determining the first loss according to the second mel spectrum and the first mel spectrum specifically includes:
Determining a first loss based on at least one of a difference between the second mel frequency spectrum and the first mel frequency spectrum, a difference between a fundamental frequency of a phoneme in the second mel frequency spectrum and a fundamental frequency of a phoneme in the first mel frequency spectrum, a difference between an energy of a phoneme in the second mel frequency spectrum and an energy of a phoneme in the first mel frequency spectrum, and a difference between a frame length of a phoneme in the second mel frequency spectrum and a frame length of a phoneme in the first mel frequency spectrum.
Optionally, training the speech synthesis model to be trained according to the loss, specifically including:
Training the mel-frequency spectrum conversion network according to the first loss;
And training the audio synthesis network according to the second loss.
The present specification provides a training device of a speech synthesis model, comprising:
the acquisition module is used for acquiring an original text and acquiring original audio corresponding to the original text;
the recognition module is used for recognizing characters in the original audio;
The processing module is used for replacing characters in the original text with characters identified from the original audio to obtain a sample text; and, marking noise segments in the original audio;
And the training module is used for training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model.
Optionally, the processing module is specifically configured to add the recognized characters of the original audio to the original text if the recognized characters of the original audio are obtained by reading the characters of the original text; and/or if the fact that the few-read characters exist in the original text is determined according to the recognized characters of the original audio, deleting the few-read characters in the original text; and/or if the recognized characters of the original audio are obtained by misreading the characters in the original text, modifying the characters in the original text into the recognized characters of the original audio.
Optionally, the processing module is specifically configured to take an audio frame in the original audio, where the audio frame meets a specified condition, as a noise segment, and mark the noise segment; the specified condition at least comprises an audio frame which is not smooth in the original audio corresponding to the original text.
Optionally, the training module is specifically configured to input the sample text into a speech synthesis model to be trained, so as to obtain a synthesized audio output by the speech synthesis model to be trained; determining a loss according to a difference between the synthesized audio and an audio segment other than the noise segment in the original audio; and training the speech synthesis model to be trained according to the loss to obtain a trained speech synthesis model.
Optionally, the speech synthesis model includes a mel-frequency spectrum conversion network and an audio synthesis network.
Optionally, the training module is specifically configured to determine a phoneme sequence of the sample text; inputting the phoneme sequence into a Mel frequency spectrum conversion network in the speech synthesis model to be trained to obtain a first Mel frequency spectrum output by the Mel frequency spectrum conversion network; and inputting the first Mel frequency spectrum into an audio synthesis network in the speech synthesis model to be trained to obtain synthesized audio output by the audio synthesis network.
Optionally, the training module is specifically configured to determine a second mel spectrum of the audio segment other than the noise segment; determining a first loss from the second mel-frequency spectrum and the first mel-frequency spectrum; and determining a second loss according to the audio segments except the noise segment and the synthesized audio.
Optionally, the training module is specifically configured to determine the first loss according to at least one of a difference between the second mel spectrum and the first mel spectrum, a difference between a fundamental frequency of a phoneme in the second mel spectrum and a fundamental frequency of a phoneme in the first mel spectrum, a difference between energy of a phoneme in the second mel spectrum and energy of a phoneme in the first mel spectrum, and a difference between a frame length of a phoneme in the second mel spectrum and a frame length of a phoneme in the first mel spectrum.
Optionally, the training module is specifically configured to train the mel spectrum conversion network according to the first loss; and training the audio synthesis network according to the second loss.
The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the above-described training method of a speech synthesis model.
The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the training method of the speech synthesis model described above when executing the program.
The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:
In the training method of the speech synthesis model provided by the specification, an original text is acquired first, and an original audio corresponding to the original text is acquired. Then, characters in the original audio are recognized, the characters in the original text are replaced by the characters recognized from the original audio, a sample text is obtained, and noise segments in the original audio are marked. And inputting the sample text into a speech synthesis model to be trained to obtain synthesized audio. And finally, training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model.
As can be seen from the above method, the method can use multi-read characters, few-read characters, misread characters and low-quality audio with sudden noise, still can train a speech synthesis model, reduces the requirement on the quality of the original audio used for training the speech synthesis model, and further reduces the cost of collecting training data of the speech synthesis model.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. Attached at
In the figure:
FIG. 1 is a flow chart of a training method of a speech synthesis model in the present specification;
FIG. 2 is a schematic diagram of a noise signature of an original audio provided in the present specification;
FIG. 3 is a training schematic diagram of a speech synthesis model provided in the present specification;
FIG. 4 is a training schematic diagram of a speech synthesis model provided in the present specification;
FIG. 5 is a training schematic diagram of a speech synthesis model provided in the present specification;
FIG. 6 is a schematic diagram of a training device for a speech synthesis model provided in the present specification;
Fig. 7 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.
Fig. 1 is a flow chart of a training method of a speech synthesis model provided in the present specification, which specifically includes the following steps:
S100: and acquiring an original text, and acquiring original audio corresponding to the original text.
Typically, for training of speech synthesis models, the speaker must read from a given text and re-record the speech once misread occurs when training data is collected. Based on the severe requirements, the training cost of the voice synthesis model is increased, and the specification provides a training method of the voice synthesis model, so that the requirements on the quality of audio are reduced, and the cost for collecting training data of the voice synthesis model is reduced.
The execution body for executing the embodiments of the present specification may be any computing device (e.g., terminal, server) with computing capability. The description will now be made with the server as the execution subject.
The server can acquire the original text and acquire the original audio corresponding to the original text. In one or more embodiments of the present disclosure, when the original audio is acquired, the original audio may be original audio recorded when a speaker reads an original text, or may be original audio corresponding to the original text acquired from other paths, for example, original audio corresponding to the original text is generated based on other trained intelligent models. In this specification, there is no limitation on how to acquire the original audio corresponding to the original text.
S102: characters in the original audio are identified.
S104: replacing characters in the original text with characters identified from the original audio to obtain a sample text; and, marking noise segments in the original audio.
After the server obtains the original text and the original audio, the server may identify characters in the original audio.
It should be noted that, in one or more embodiments of the present disclosure, when identifying characters in an original audio, doing phonetics by computer (Praat) may be used to label the original audio to obtain text content corresponding to the original audio, where the characters in the text content are the obtained characters in the original audio. Of course, other methods may be used to identify characters in the original audio, such as: the marking of the characters in the original audio is performed manually, and the specification is not limited.
Further, the server may replace the characters in the original text with the characters identified in the original audio to obtain the sample text. That is, the server may modify the original text according to the recognized characters in the original audio so that the modified sample text is consistent with the characters in the original audio, and the modified original text is used as the sample text.
Specifically, if the recognized characters of the original audio are obtained by reading the characters in the original text more, adding the recognized characters of the original audio in the original text, and/or if the few-read characters are determined to exist in the original text according to the recognized characters of the original audio, deleting the few-read characters in the original text, and/or if the recognized characters of the original audio are obtained by reading the characters in the original text more incorrectly, modifying the characters in the original text into the recognized characters of the original audio.
For example: the original text is "today's air is fresh", when the speaker reads the original text, the obtained character corresponding to the original audio is "today's air is fresh", and the character "day" in the original audio can be known to be misread as the character "day", so that in order to keep the consistency of the original text and the original audio, the server can change the character "day" in the original text into the character "day". Similarly, when the recognized characters in the original audio are characters in the multi-reading original text, i.e., more characters are recognized in the original audio than in the original text, the more characters can be added to the original text. When the recognized characters in the original audio are characters in the original text, the characters in the original text which are not read can be deleted.
In general, some noise may exist in the original audio corresponding to the obtained original text to a greater or lesser extent, such as a person's cough, a car whistle, noise generated by device resonance, etc., that is, noise exists in the original audio corresponding to the obtained original text due to environmental factors, device factors, etc., so in order to reduce the noise in the original audio, the consistency of the characters in the original audio and the characters in the original text is ensured, and in order to improve the accuracy of the trained speech synthesis model, in one or more embodiments of the present disclosure, the server may further mark the noise segment in the original audio. In particular, the server may divide the original audio into audio frames, and when an audio frame in the original audio satisfies a specified condition, the audio frame satisfying the specified condition may be marked in the original audio, where the mark indicates that the audio frame is an audio frame in which noise exists. Briefly, the server may take as a noise segment an audio frame in the original audio that satisfies a specified condition and mark it.
In one or more embodiments of the present disclosure, assuming that the original text is "today's air is fresh", and that there is noise in an audio segment corresponding to a "sky" character in the corresponding original audio, an audio frame corresponding to "sky" may be determined, and the audio frame corresponding to "sky" may be marked. Further assuming that one character corresponds to one audio frame and that the frame length corresponding to one character is 50 ms, the original text has 7 audio frames (7 characters), and the total frame length is 350 ms, the second audio frame and the third audio frame may be used as noise segments ("sky" corresponding audio frames), or the 51 th to 150 th audio segment in the original audio may be marked as noise segments.
Fig. 2 is a schematic diagram of a noise signature of an original audio according to the present application. The original text is "this you want to write the answer", but the character in the identified original audio is "this you want to write the answer", and it is seen that one more character "you" is read, so the audio frame corresponding to "you" needs to be marked as a noise frame, i.e. as < noise >, like the shaded portion in fig. 2, i.e. the audio piece of 0.413596.
The specified condition at least comprises an audio frame which is not smooth in the original audio corresponding to the original text, and the not smooth at least comprises: the original audio has burst noise (such as cough of a person, whistling sound of an automobile, noise of equipment resonance and the like), and the recognized characters in the original audio correspond to the corresponding audio frames where the characters in the original text are inconsistent.
It should be noted that, a noise segment in the original audio refers to a set of audio frames having noise in the original audio.
In one or more embodiments of the present specification, when marking a noise segment, an audio segment corresponding to a position where noise starts and a position where noise ends may be marked with < noise > to indicate that audio between the start position and the end position is audio in which noise exists. Of course, the < noise > may be marked at the position where the noise starts and the position where the noise ends, respectively, and other marking methods may be used, and the specification is not limited thereto.
S106: and training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model.
The server can train the speech synthesis model to be trained according to the obtained sample text and the marked noise section, and the trained speech synthesis model is obtained. Specifically, after obtaining the sample text and the original audio after marking the noise section, the server may input the sample text into the speech synthesis model to be trained to obtain the synthesized audio output by the speech synthesis model to be trained, so as to train the speech synthesis model to be trained according to the original audio after marking the noise section and the synthesized audio in the subsequent steps.
In one or more embodiments of the present disclosure, a factor sequence of a sample text may be determined first, and then a speech synthesis model to be trained is trained according to the factor sequence of the sample text and the marked noise segment, to obtain a trained speech synthesis model. Fig. 3 is a schematic diagram of training a speech synthesis model according to the present disclosure. The sample text is 'we go to outing today', the corresponding factor sequence is 'wo 3men5jin1tian1qu4jiao1 you 2', the character with noise in the original text is the audio segment corresponding to the 'today going' character, and the noise segment corresponding to the 'today going' character in the original audio corresponding to the original text can be marked. And when the speech synthesis model is trained, inputting the factor sequence of the sample text into the speech synthesis model to obtain synthesized audio output by the speech synthesis model to be trained, wherein the synthesized audio has a noise section which is consistent with the noise section in the marked original audio, and then training the speech synthesis model based on the synthesized audio and the noise section in the marked original audio.
Specifically, in the present specification, the server may determine the loss based on the difference between the synthesized audio and the audio segment other than the noise segment. And then training the speech synthesis model to be trained according to the loss to obtain a trained speech synthesis model.
It should be noted that, in one or more embodiments of the present disclosure, when the loss is calculated, the loss is calculated according to a difference between the audio segment other than the noise segment and the synthesized audio output by the speech synthesis model to be trained, that is, the noise segment that does not include audio when the loss is calculated, but the sample text input to the speech synthesis model to be trained includes the text corresponding to the noise segment, that is, the text corresponding to the noise segment in the synthesized audio is included in the sample text, so that the speech synthesis model to be trained can learn the context information corresponding to the audio in the text. The text corresponding to the noise section may be marked in the original text, and may be marked by [ noise ], or may be marked by a symbol, or the like, and the specific marking method is not limited in this specification.
In the training method based on the speech synthesis model provided in the present specification shown in fig. 1, the original text is modified according to the characters of less reading, more reading and misreading in the original audio to ensure the consistency of the original text and the original audio, the modified original text is used as a sample text, noise segments in the original audio are marked, and audio segments except the noise segments are used as labels. And training the speech synthesis model to be trained according to the sample text and the label to obtain a trained speech synthesis model. The method can train the voice synthesis model without using the recognized high-quality audio with characters in the original audio and the original text being bad when the original audio corresponding to the original text has multi-read characters, few-read characters and misread characters or the original audio is low-quality audio with other burst noise, reduces the requirement on the quality of the audio, and greatly improves the recording efficiency without re-recording the audio when the original audio is obtained based on the reading and recording of the speaker, compared with the prior art, the method provided by the embodiment of the specification can improve the recording efficiency by more than two times, thereby reducing the cost of collecting training data of the voice synthesis model.
Further, in order to reduce noise in the original audio and make the trained speech synthesis model better, in one or more embodiments of the present disclosure, in step S104, before marking the noise segment in the original audio, each audio frame of the original audio may be further determined, and the audio frame having a signal-to-noise ratio not greater than the preset threshold in each audio frame may be deleted to remove the audio frame where no speaker' S voice exists. For example: the original text is "today's air is truly fresh. And if long-segment noise exists between the character 'day' and the character 'null' in the original audio, deleting the audio frame corresponding to the long-segment noise.
And the server can also perform noise reduction processing on the audio frames with the signal to noise ratio larger than a preset threshold value in each audio frame so as to inhibit the influence of noise.
After the noise reduction process, the audio segment corresponding to the noise that is not eliminated in the original audio is also the noise segment described in the above step S104. That is, the specified conditions in step S104 described above further include: and the original audio is subjected to noise reduction treatment and then is subjected to noise corresponding audio frames which are not eliminated.
In one or more embodiments of the present disclosure, a deep-convolution loop network (Deep Complex Convolution Recurrent Network, DCCRN) may be used to noise-reduce audio frames having a signal-to-noise ratio greater than a preset threshold. And the audio subjected to noise reduction treatment can be enhanced through Adobe Audition software to human voice or the voice of the read character in the original audio.
In one or more embodiments of the present specification, the speech synthesis model to be trained includes at least: a mel-frequency spectrum conversion network and an audio synthesis network.
Wherein, in one or more embodiments of the present description, the mel-frequency spectrum conversion network may be FASTSPEECH a 2 and the audio synthesis network may be HifiGan a. For FASTSPEECH2, the training samples are "phoneme sequence-mel-spectrum pairs", and for HifiGan, the training samples are "mel-spectrum-audio pairs".
Then in step S106, when the sample text is input into the speech synthesis model to be trained, the server may determine the phoneme sequence of the sample text (if the sample text is chinese, the phoneme sequence is a pinyin sequence with tones), for example, the sample text is: "we go to outing today", the factor sequence of the sample text is: "wo3men5jin1tian qu4jiao1you2". And then, inputting the phoneme sequence into a Mel frequency spectrum conversion network in a speech synthesis model to be trained, and obtaining a first Mel frequency spectrum output by the Mel frequency spectrum conversion network. And finally, inputting the first Mel frequency spectrum into an audio synthesis network in a speech synthesis model to be trained, and obtaining synthesized audio output by the audio synthesis network.
Then in step S108 described above, when determining the loss based on the difference between the synthesized audio and the audio segment other than the noise segment in the original audio, the server may determine a second mel spectrum of the audio segment other than the noise segment, determine the first loss based on the second mel spectrum and the first mel spectrum, and determine the second loss based on the audio segment other than the noise segment and the synthesized audio.
Further, the server may train the mel-frequency spectrum conversion network according to the first loss and train the audio synthesis network according to the second loss.
Further, in one or more embodiments of the present description, when determining the first loss from the second mel spectrum and the first mel spectrum, the first loss may be determined based on a difference in multiple dimensions between the first mel spectrum and the second mel spectrum. In one or more embodiments of the present description, the differences in the plurality of dimensions include differences in fundamental frequency, differences in frame length, energy differences, and differences in the first mel spectrum and the second mel spectrum themselves as a whole.
Fig. 4 is a schematic diagram of training a speech synthesis model according to the present application. As can be seen, when training the mel spectrum conversion network, that is FASTSPEECH2, in the speech synthesis model, the factor sequence corresponding to the sample text may be input into the mel spectrum conversion network, and the mel spectrum conversion network outputs a first mel spectrum, where the first mel spectrum includes a mel spectrum corresponding to a noise segment, so the server may use the mel spectrum corresponding to an audio segment other than the noise segment as a second mel spectrum, that is, a label mel spectrum, and train the mel spectrum conversion network, that is, FASTSPEECH, based on the second mel spectrum and the first mel spectrum. Specifically, the server may determine the first loss according to at least one of a difference between the second mel frequency spectrum and the first mel frequency spectrum, a difference between a fundamental frequency of a phoneme in the second mel frequency spectrum and a fundamental frequency of a phoneme in the first mel frequency spectrum, a difference between an energy of a phoneme in the second mel frequency spectrum and an energy of a phoneme in the first mel frequency spectrum, and a difference between a frame length of a phoneme in the second mel frequency spectrum and a frame length of a phoneme in the first mel frequency spectrum.
That is, when training the mel-frequency spectrum conversion network, the first loss of the first mel-frequency spectrum and the second mel-frequency spectrum may be calculated by the differences of the plurality of dimensions.
Specifically, the server may pass the mel-spectrum loss function: to determine a difference between the second mel-frequency spectrum and the first mel-frequency spectrum, i.e., a difference in waveforms corresponding to the first and second mel-frequency spectra. Where T is the number of audio frames in the original audio excluding noise segments,/> Is the t frame of the first Mel frequency spectrum predicted by the speech synthesis model,/>Is the t-th frame of the second mel spectrum corresponding to the original audio except for the noise segment.
The server may pass the energy loss function: to determine the difference in energy of the phonemes in the second mel spectrum from the energy of the phonemes in the first mel spectrum. Wherein, the number of all phonemes corresponding to the audio segment except the noise segment in the P original audio,/> Is the energy of a phoneme p corresponding to a first Mel frequency spectrum predicted by a speech synthesis model,/>Is the energy of the phoneme p corresponding to the second mel frequency spectrum corresponding to the original audio except the noise segment
The server may pass the frame length loss function: To determine the difference in the frame length of the phonemes in the second mel spectrum from the frame length of the phonemes in the first mel spectrum. Wherein, the number of all phonemes corresponding to the audio segment except the noise segment in the P original audio,/> Is the continuous frame length of a phoneme p corresponding to the first Mel frequency spectrum predicted by a speech synthesis model,/>Is the duration frame length of the phoneme p corresponding to the second mel spectrum corresponding to the original audio except the noise section.
The server may pass the baseband loss function: To determine the difference between the fundamental frequency of the phonemes in the second mel spectrum and the fundamental frequency of the phonemes in the first mel spectrum. Wherein, the number of all phonemes corresponding to the audio segment except the noise segment in the P original audio,/> Is the fundamental frequency of a phoneme p corresponding to the first Mel frequency spectrum predicted by a speech synthesis model,/>Is the fundamental frequency of the phoneme p corresponding to the second mel spectrum corresponding to the original audio except for the noise segment.
The first loss may then be the sum of the losses, i.e., l=l mel+Le+Ld+Lp, where L is the first loss. Of course, the individual losses may also be weighted and summed to obtain the first loss.
The server can obtain the fundamental frequency and energy corresponding to each phoneme in the original text through the WORLD algorithm.
It should be noted that, in one or more embodiments of the present disclosure, in determining the loss, the loss function of the multiple dimensions used above is determined based on FASTSPEECH. Thus, the loss function used may change when other networks are used, and the specific network used and the loss function are not limited in this specification.
Fig. 5 is a schematic diagram of training a speech synthesis model according to the present application. It can be seen that the server may input the mel spectrum with noise output by the mel spectrum conversion network to the audio conversion network, and further the audio conversion network may output the synthesized audio, and then the server may determine the second loss according to the audio segments except the noise segment and the synthesized audio in the original audio, and train the audio conversion network based on the second loss.
In addition, the speech synthesis model to be trained may not include a mel spectrum conversion network and an audio synthesis network, that is, the speech synthesis model to be trained may be an end-to-end machine learning model directly, i.e., a sample text is input into the speech synthesis model, and the obtained output is audio generated according to the sample text. Then, after the speech synthesis model to be trained is trained, the speech synthesis model to be trained may be trained according to the method shown in fig. 1, that is, the sample text in the step S104 is input into the speech synthesis model to be trained, and the speech synthesis model to be trained is trained with the aim that the difference between the audio output by the speech synthesis model to be trained and the audio segment except the noise segment is minimum.
It should be noted that, when the speech synthesis model includes the mel-spectrum conversion network, the operation speed of the speech synthesis model is fast and stable, and when the speech synthesis model is an end-to-end model, the quality of the audio synthesized by the speech synthesis model is high, so that whether the speech synthesis model is an end-to-end model or a model including the mel-spectrum conversion network is selected, which can be determined according to the actual situation and different requirements, and the specification is not limited.
Based on the foregoing description of the training method of the speech synthesis model, the embodiment of the present disclosure further correspondingly provides a schematic diagram of a training device for the speech synthesis model, as shown in fig. 6.
Fig. 6 is a schematic diagram of a training device for a speech synthesis model according to an embodiment of the present disclosure, where the device includes:
the acquiring module 600 is configured to acquire an original text, and acquire an original audio corresponding to the original text;
An identification module 602, configured to identify characters in the original audio;
A processing module 604, configured to replace characters in the original text with characters identified from the original audio, so as to obtain a sample text; and, marking noise segments in the original audio;
The training module 606 is configured to train the speech synthesis model to be trained according to the obtained sample text and the marked noise segment, so as to obtain a trained speech synthesis model.
Optionally, the processing module 604 is specifically configured to add the recognized characters of the original audio to the original text if the recognized characters of the original audio are obtained by reading the characters of the original text; and/or if the fact that the few-read characters exist in the original text is determined according to the recognized characters of the original audio, deleting the few-read characters in the original text; and/or if the recognized characters of the original audio are obtained by misreading the characters in the original text, modifying the characters in the original text into the recognized characters of the original audio.
Optionally, the processing module 604 is specifically configured to take an audio frame in the original audio that meets a specified condition as a noise segment, and mark the noise segment; the specified condition at least comprises an audio frame which is not smooth in the original audio corresponding to the original text.
Optionally, the training module 606 is specifically configured to input the sample text into a speech synthesis model to be trained, and obtain a synthesized audio output by the speech synthesis model to be trained; determining a loss according to a difference between the synthesized audio and an audio segment other than the noise segment in the original audio; and training the speech synthesis model to be trained according to the loss to obtain a trained speech synthesis model.
Optionally, the speech synthesis model includes a mel-frequency spectrum conversion network and an audio synthesis network.
Optionally, the training module 606 is specifically configured to determine a phoneme sequence of the sample text; inputting the phoneme sequence into a Mel frequency spectrum conversion network in the speech synthesis model to be trained to obtain a first Mel frequency spectrum output by the Mel frequency spectrum conversion network; and inputting the first Mel frequency spectrum into an audio synthesis network in the speech synthesis model to be trained to obtain synthesized audio output by the audio synthesis network.
Optionally, the training module 606 is specifically configured to determine a second mel spectrum of the audio segment other than the noise segment; determining a first loss from the second mel-frequency spectrum and the first mel-frequency spectrum; and determining a second loss according to the audio segments except the noise segment and the synthesized audio.
Optionally, the training module 606 is specifically configured to determine the first loss according to at least one of a difference between the second mel spectrum and the first mel spectrum, a difference between a fundamental frequency of a phoneme in the second mel spectrum and a fundamental frequency of a phoneme in the first mel spectrum, a difference between energy of a phoneme in the second mel spectrum and energy of a phoneme in the first mel spectrum, and a difference between a frame length of a phoneme in the second mel spectrum and a frame length of a phoneme in the first mel spectrum.
Optionally, the training module 606 is specifically configured to train the mel spectrum conversion network according to the first loss; and training the audio synthesis network according to the second loss.
The embodiments of the present specification also provide a computer readable storage medium storing a computer program, where the computer program is configured to perform the method for training a speech synthesis model as described above.
Based on the training method of the speech synthesis model described above, the embodiment of the present disclosure further proposes a schematic block diagram of the electronic device shown in fig. 7. At the hardware level, as in fig. 7, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although it may include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the training method of the speech synthesis model.
Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present application.

Claims (12)

1. A method of training a speech synthesis model, the method comprising:
acquiring an original text and acquiring original audio corresponding to the original text;
identifying characters in the original audio;
Replacing characters in the original text with characters identified from the original audio to obtain a sample text; and, marking noise segments in the original audio;
And training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model.
2. The method of claim 1, wherein replacing characters in the original text with characters recognized from the original audio, comprises:
if the recognized characters of the original audio are obtained by multi-reading the characters in the original text, adding the recognized characters of the original audio in the original text; and/or
If the fact that the few-read characters exist in the original text is determined according to the recognized characters of the original audio, deleting the few-read characters in the original text; and/or
And if the recognized characters of the original audio are obtained by misreading the characters in the original text, modifying the characters in the original text into the recognized characters of the original audio.
3. The method of claim 1, wherein marking noise segments in the original audio, in particular, comprises:
Taking an audio frame meeting specified conditions in the original audio as a noise section, and marking; the specified condition at least comprises an audio frame which is not smooth in the original audio corresponding to the original text.
4. The method of claim 1, wherein training the speech synthesis model to be trained based on the obtained sample text and the marked noise segment, specifically comprises:
Inputting the sample text into a speech synthesis model to be trained to obtain synthesized audio output by the speech synthesis model to be trained;
determining a loss according to a difference between the synthesized audio and an audio segment other than the noise segment in the original audio;
And training the speech synthesis model to be trained according to the loss to obtain a trained speech synthesis model.
5. The method of claim 4, wherein the speech synthesis model comprises a mel-frequency spectral conversion network and an audio synthesis network.
6. The method of claim 5, wherein inputting the sample text into the speech synthesis model to be trained results in synthesized audio output by the speech synthesis model to be trained, comprising:
determining a phoneme sequence of the sample text;
inputting the phoneme sequence into a Mel frequency spectrum conversion network in the speech synthesis model to be trained to obtain a first Mel frequency spectrum output by the Mel frequency spectrum conversion network;
And inputting the first Mel frequency spectrum into an audio synthesis network in the speech synthesis model to be trained to obtain synthesized audio output by the audio synthesis network.
7. The method of claim 6, wherein determining the loss based on the difference between the synthesized audio and the audio segments of the original audio other than the noise segment, comprises:
determining a second mel frequency spectrum of the audio segment other than the noise segment;
determining a first loss from the second mel-frequency spectrum and the first mel-frequency spectrum;
and determining a second loss according to the audio segments except the noise segment and the synthesized audio.
8. The method of claim 7, wherein determining a first loss from the second mel-frequency spectrum and the first mel-frequency spectrum, comprises:
Determining a first loss based on at least one of a difference between the second mel frequency spectrum and the first mel frequency spectrum, a difference between a fundamental frequency of a phoneme in the second mel frequency spectrum and a fundamental frequency of a phoneme in the first mel frequency spectrum, a difference between an energy of a phoneme in the second mel frequency spectrum and an energy of a phoneme in the first mel frequency spectrum, and a difference between a frame length of a phoneme in the second mel frequency spectrum and a frame length of a phoneme in the first mel frequency spectrum.
9. The method of claim 7, wherein training the speech synthesis model to be trained based on the loss, comprises:
Training the mel-frequency spectrum conversion network according to the first loss;
And training the audio synthesis network according to the second loss.
10. A training device for a speech synthesis model, the device comprising:
the acquisition module is used for acquiring an original text and acquiring original audio corresponding to the original text;
the recognition module is used for recognizing characters in the original audio;
The processing module is used for replacing characters in the original text with characters identified from the original audio to obtain a sample text; and, marking noise segments in the original audio;
And the training module is used for training the speech synthesis model to be trained according to the obtained sample text and the marked noise section to obtain a trained speech synthesis model.
11. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-9.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-9 when the program is executed.
CN202410248274.6A 2024-03-04 Training method and device of speech synthesis model, medium and electronic equipment Pending CN118230712A (en)

Publications (1)

Publication Number Publication Date
CN118230712A true CN118230712A (en) 2024-06-21

Family

ID=

Similar Documents

Publication Publication Date Title
WO2021002967A1 (en) Multilingual neural text-to-speech synthesis
CN110599998B (en) Voice data generation method and device
CN115952272B (en) Method, device and equipment for generating dialogue information and readable storage medium
CN112365878B (en) Speech synthesis method, device, equipment and computer readable storage medium
CN116502176A (en) Pre-training method and device of language model, medium and electronic equipment
CN109545194A (en) Wake up word pre-training method, apparatus, equipment and storage medium
CN111292734B (en) Voice interaction method and device
CN116343314B (en) Expression recognition method and device, storage medium and electronic equipment
CN115563366A (en) Model training and data analysis method, device, storage medium and equipment
CN113887227A (en) Model training and entity recognition method and device
CN116312480A (en) Voice recognition method, device, equipment and readable storage medium
CN112597301A (en) Voice intention recognition method and device
CN116543264A (en) Training method of image classification model, image classification method and device
CN118230712A (en) Training method and device of speech synthesis model, medium and electronic equipment
CN116186231A (en) Method and device for generating reply text, storage medium and electronic equipment
CN112397073B (en) Audio data processing method and device
CN116129856A (en) Training method of speech synthesis model, speech synthesis method and related equipment
CN115620706A (en) Model training method, device, equipment and storage medium
CN111048065B (en) Text error correction data generation method and related device
CN111652165A (en) Mouth shape evaluating method, mouth shape evaluating equipment and computer storage medium
CN117079646B (en) Training method, device, equipment and storage medium of voice recognition model
CN117746863A (en) Sample audio acquisition method and device, storage medium and electronic equipment
CN116434787B (en) Voice emotion recognition method and device, storage medium and electronic equipment
CN117219055A (en) Voice generation method, device, medium and equipment based on tone separation
CN118098266A (en) Voice data processing method and device based on multi-model selection

Legal Events

Date Code Title Description
PB01 Publication