CN111429878B

CN111429878B - Self-adaptive voice synthesis method and device

Info

Publication number: CN111429878B
Application number: CN202010167018.6A
Authority: CN
Inventors: 贺来朋
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-05-26
Anticipated expiration: 2040-03-11
Also published as: CN111429878A

Abstract

The invention discloses a self-adaptive voice synthesis method and a device, comprising the following steps: training the preset neural network model by using the preset sound record and text marking data corresponding to the preset sound record to obtain a trained preset neural network model; designing a recording text library for users to select target recording text for recording, and obtaining current recording; performing secondary training on the trained preset neural network model by using the current recording and the target recording text; and extracting static voice parameters of the text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice. The problems of low quality and precision of synthesized voice caused by the reasons of less data quantity and low quality generally, insufficient model prediction precision and the like in the prior art are effectively solved, and the experience of a user is improved.

Description

Self-adaptive voice synthesis method and device

Technical Field

The present invention relates to the field of speech synthesis technology, and in particular, to a method and apparatus for adaptive speech synthesis.

Background

In recent years, with the growing maturity of voice technology, voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production and the like. In the social and commercial fields, synthesized voice is used as a sound showing, brings convenience and richness to social life, has potential wide use value, and the existing voice synthesis technology is used for training a duration and acoustic model based on a large amount of high-quality recording and text labeling data of a target speaker, and can synthesize voice with the timbre of the target speaker. Since a large amount of high quality speech is required for training, the adaptive speech synthesis system is proposed, that is, a synthesis system is quickly constructed using a small amount of recording and text data of a target speaker, and a synthetic speech of the timbre of the target speaker is generated. However, this method has the following disadvantages: because the amount of data required for training is small, the quality is usually not high, and the model prediction precision is insufficient, the quality and precision of synthesized voice are low, and the experience of a user is affected.

Disclosure of Invention

Aiming at the displayed problems, the method carries out secondary training on the trained preset neural network model based on the current recording data of the user, and finally carries out voice synthesis on the text to be synthesized according to the secondary trained preset neural network model.

An adaptive speech synthesis method comprising the steps of:

training a preset neural network model by using a preset sound recording and text marking data corresponding to the preset sound recording to obtain a trained preset neural network model;

designing a recording text library for users to select target recording text for recording, and obtaining current recording;

performing secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;

and extracting static voice parameters of the text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.

Preferably, the designing the recording text library for the user to select the target recording text for recording, to obtain the current recording, includes:

a blank recording text library is established in advance;

acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form the sound recording text library;

when receiving an instruction of a user for requesting recording, pushing M first recording texts for selection, wherein the first recording texts are any recording text in the recording texts;

determining a first recording text selected by a user from the M first recording texts as the target recording text;

and receiving the current recording of the user based on the target recording text.

Preferably, before the trained preset neural network model is secondarily trained by using the current recording and the target recording text, the method further includes:

acquiring each sentence of voice in the current recording;

removing a mute segment exceeding a preset duration in each sentence of voice;

carrying out denoising and dereverberation pretreatment on each sentence of voice;

detecting whether the current voice after preprocessing is complete;

if yes, using the label corresponding to the target recording text;

otherwise, reminding the user that the current voice after the preprocessing does not meet the requirement.

Preferably, the performing secondary training on the trained preset neural network model by using the current recording and the target recording text includes:

extracting acoustic characteristic parameters of the current voice after the pretreatment;

extracting first linguistic information associated with a context in the target recording text content;

generating training data according to the acoustic characteristic parameters and the first linguistic information;

and performing secondary training on the trained preset neural network model by using the training data.

Preferably, the extracting static speech parameters of the text to be synthesized by using the preset neural network model after the secondary training, inputting the static speech parameters into a synthesizer to obtain synthesized speech, includes:

acquiring second linguistic information of the text to be synthesized;

inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;

acquiring static voice parameters according to the voice characteristic parameters;

inputting the static voice parameters into a synthesizer for synthesis;

and outputting the synthesized voice after the synthesis is finished.

An adaptive speech synthesis apparatus, the apparatus comprising:

the first training module is used for training the preset neural network model by using the preset sound record and text marking data corresponding to the preset sound record to obtain a trained preset neural network model;

the recording module is used for designing a recording text library to enable a user to select a target recording text for recording, so that a current recording is obtained;

the second training module is used for carrying out secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;

and the synthesis module is used for extracting static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training, and inputting the static voice parameters into the synthesizer to obtain synthesized voice.

Preferably, the recording module includes:

the sub-module is used for pre-establishing a blank recording text library;

the first acquisition submodule is used for acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form the sound recording text library;

the pushing sub-module is used for pushing M first recording texts for selection when receiving an instruction of a user for requesting recording, wherein the first recording texts are any recording text in the recording texts;

the determining submodule is used for determining a first recording text selected by a user from the M first recording texts as the target recording text;

and the receiving sub-module is used for receiving the current recording of the user based on the target recording text.

Preferably, the apparatus further comprises:

the acquisition module is used for acquiring each sentence of voice in the current recording;

the removing module is used for removing the mute section exceeding the preset duration in each sentence of voice;

the preprocessing module is used for carrying out denoising and dereverberation preprocessing on each sentence of voice;

the detection module is used for detecting whether the current voice after pretreatment is complete or not;

the determining module is used for using the label corresponding to the target recording text when the detecting module detects that the current voice after the preprocessing is complete;

and the reminding module is used for reminding a user that the current voice after the pretreatment does not meet the requirement when the detection module detects that the current voice after the pretreatment is not complete.

Preferably, the second training module includes:

the first extraction submodule is used for extracting acoustic characteristic parameters of the current voice after the pretreatment;

the second extraction sub-module is used for extracting first linguistic information related to the context in the target recording text content;

the generation sub-module is used for generating training data according to the acoustic characteristic parameters and the first linguistic information;

and the training sub-module is used for carrying out secondary training on the trained preset neural network model by utilizing the training data.

Preferably, the synthesis module includes:

the second acquisition submodule is used for acquiring second linguistic information of the text to be synthesized;

the obtaining submodule is used for inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;

the third acquisition sub-module is used for acquiring static voice parameters according to the voice characteristic parameters;

the synthesis submodule is used for inputting the static voice parameters into a synthesizer for synthesis;

and the output sub-module is used for outputting the synthesized voice after the synthesis is finished.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a workflow diagram of an adaptive speech synthesis method provided by the present invention;

FIG. 2 is another workflow diagram of an adaptive speech synthesis method according to the present invention;

FIG. 3 is a block diagram of an adaptive speech synthesis apparatus according to the present invention;

fig. 4 is another block diagram of an adaptive speech synthesis apparatus according to the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In recent years, with the growing maturity of voice technology, voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production and the like. In the social and commercial fields, synthesized voice is used as a sound showing, brings convenience and richness to social life, has potential wide use value, and the existing voice synthesis technology is used for training a duration and acoustic model based on a large amount of high-quality recording and text labeling data of a target speaker, and can synthesize voice with the timbre of the target speaker. Since a large amount of high quality speech is required for training, the adaptive speech synthesis system is proposed, that is, a synthesis system is quickly constructed using a small amount of recording and text data of a target speaker, and a synthetic speech of the timbre of the target speaker is generated. However, this method has the following disadvantages: 1. because the amount of data required for training is small, the quality is usually not high, and the model prediction precision is insufficient, the quality and precision of synthesized voice are low, and the experience of a user is affected. 2. The recording environments of different users are greatly different, and recorded voice data can contain overlong interference such as silence segments, noise, reverberation and the like to influence the model training effect. 3. The actual pronunciation record text of the user is inconsistent, and the phenomena of word missing, multiple words, repetition, misreading, overlong pause and the like all cause mismatching of audio data and text labels, thereby influencing the model training effect. In order to solve the above problems, the present embodiment discloses a method for performing secondary training on a trained preset neural network model based on current recording data of a user, and finally performing speech synthesis on a text to be synthesized according to the secondary trained preset neural network model.

An adaptive speech synthesis method, as shown in fig. 1, comprises the following steps:

step S101, training a preset neural network model by using a preset recording and text marking data corresponding to the preset recording to obtain a trained preset neural network model;

step S102, designing a recording text library for users to select target recording text for recording, and obtaining current recording;

step S103, performing secondary training on the trained preset neural network model by using the current recording and the target recording text;

step S104, extracting static voice parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice;

in this embodiment, a large amount of high-quality preset sound recordings and text labeling data corresponding to the sound recordings are used for training a preset neural network model to obtain a trained preset neural network model, then a user selects a proper target sound recording text to record according to own hobbies and demands to obtain a small amount of current sound recordings, then the trained preset neural network model is trained for the second time according to the small amount of sound recordings to obtain a new model capable of synthesizing own voice, static voice parameters of any text to be synthesized can be extracted according to the new model and input into a synthesizer to obtain synthetic voice of the user, in particular, the large amount of preset voice and the text labeling data corresponding to the preset voice can be recording data of any person, and recording data of the user is not needed. The preset neural network model comprises a duration model and an acoustic model, namely, the duration of user voice synthesis is paid attention to, and different voices are synthesized according to different timbres of the user.

The working principle of the technical scheme is as follows: training a preset neural network model by using preset sound recordings and text marking data corresponding to the preset sound recordings to obtain a trained preset neural network model, designing a sound recording text library for users to select target sound recording texts to record, obtaining a current sound recording, performing secondary training on the trained preset neural network model by using the current sound recordings and the target sound recording texts, extracting static voice parameters of texts to be synthesized by using the secondarily trained preset neural network model, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.

The beneficial effects of the technical scheme are as follows: the user can train the preset neural network model according to the preset record, then train the trained preset neural network model for the second time according to the current record, finally train the preset neural network model according to the second time to synthesize the voice of the user, because the training of the first preset neural network model is trained according to a large amount of high-quality preset record, the voice quality and the precision of the model synthesis are extremely high, the model for synthesizing the voice of the user can be obtained by train the preset neural network model for the second time according to the current record, the synthesized voice quality and the precision are extremely high, the problems that the data amount required by training is less, the quality is usually not high, the model prediction precision is insufficient and the like in the prior art are effectively solved, the problems that the synthesized voice quality and the precision are low are solved, and the experience of the user is improved. And the user can select the target text in the recording text library to record, so that the selection is diversified, and the problem that the recorded voice data can contain overlong interference such as mute sections, noise and reverberation to influence the model training effect due to great difference of recording environments of different users in the prior art is solved.

In one embodiment, designing a recording text library for a user to select a target recording text for recording, to obtain a current recording, includes:

a blank recording text library is established in advance;

acquiring N recording texts and inputting the N recording texts into a blank recording text library to form a recording text library;

determining a first recording text selected by a user from M first recording texts as a target recording text;

based on the target recording text, a current recording of the user is received.

The beneficial effects of the technical scheme are as follows: the first recording text selectable by the user is provided, so that the user selects the target recording text suitable for the user according to the age, the cultural level, the region and the use scene of the user, the user has a plurality of different selections, and the experience of the user is further improved.

In one embodiment, before the training of the trained preset neural network model using the current recording and the target recording text, the method further includes:

acquiring each sentence of voice in the current recording;

removing a mute section exceeding a preset duration in each sentence of voice;

detecting whether the current voice after preprocessing is complete;

if yes, using a label corresponding to the target recording text;

otherwise, reminding the user that the current voice after preprocessing does not meet the requirement;

in this embodiment, the step of detecting whether the current voice after preprocessing is complete is as follows: if the current voice after the processing is found to have the insertion error or the deletion error, the user is prompted that the recording quality does not meet the requirement, and the user can select to repeat the current text or switch a new text to record again. If there is no insertion and deletion error, but there is a substitution error, the piece of speech is acceptable and the original recorded text is replaced with text recognized by the recognizer to generate a callout. If the identification errors are not generated, the labels corresponding to the original recording text are used.

The beneficial effects of the technical scheme are as follows: through carrying out noise reduction and reverberation removal processing on the input voice, redundant silence segments are removed, voice quality is improved, and a good sample is provided for the synthesis of the rear voice. And whether the current voice after preprocessing is complete or not is detected, re-recording is selected or text labels are corrected according to the current voice quality, and consistency of recording and text labels is ensured. And the data quality is improved.

In one embodiment, as shown in fig. 2, the training of the pre-set neural network model by using the current recording and the target recording text comprises the following steps:

step S201, extracting acoustic characteristic parameters of the current voice after preprocessing;

step S202, extracting first linguistic information related to the context in the target recording text content;

step S203, training data is generated according to the acoustic characteristic parameters and the first linguistic information;

and S204, performing secondary training on the trained preset neural network model by using training data.

The beneficial effects of the technical scheme are as follows: by utilizing the acoustic characteristic parameters and the first linguistic information to secondarily train the preset neural network model, the accuracy of modeling of the voice characteristic parameters is improved, and a good model is provided for extracting static voice parameters required by synthesized voice.

In one embodiment, extracting static speech parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static speech parameters into a synthesizer to obtain synthesized speech, wherein the method comprises the following steps:

acquiring second linguistic information of a text to be synthesized;

inputting the second linguistic information into a preset neural network model after secondary training to obtain voice characteristic parameters;

inputting static voice parameters into a synthesizer for synthesis;

after the synthesis is finished, outputting synthesized voice;

in this embodiment, the voice feature parameters include dynamic voice parameters, and the dynamic voice parameters are converted into static voice parameters according to the established model.

The beneficial effects of the technical scheme are as follows: by acquiring static voice parameters to synthesize voice, compared with dynamic voice parameters, the method is more stable, and meanwhile, unstable factors are removed, so that the quality of the synthesized voice is higher.

In one embodiment, the method comprises:

step 1: a multi-speaker hybrid underlying neural network model (employing a feed-forward neural network plus RNN-LSTM model structure) is trained using high quality multi-speaker recordings and text labeling data. Speaker embedded information is added into the neural network input to improve stability of tone modeling;

step 2: under the principle of ensuring the phoneme coverage, a recording text library is designed, and the number of the recording text library is far greater than the number N of sentences actually required to be recorded. Randomly selecting N sound recording texts for each user, wherein for each sound recording text, the user can select to skip and switch a new sound recording text;

step 3: each time a user inputs a sentence of voice, the recorded audio passes through an audio preprocessing module, an overlong mute section in the recording is removed, and noise reduction and reverberation removal processing are carried out on the input audio;

step 4: sending the processed audio to an audio quality evaluation module for voice recognition detection, and if an insertion error or a deletion error is found, prompting the user that the recording quality does not meet the requirement, wherein the user can select to repeat the current text or switch a new text to record again. If there is no insertion and deletion error, but there is a substitution error, the piece of speech is acceptable and the original recorded text is replaced with text recognized by the recognizer to generate a callout. If the identification errors are not generated, the labels corresponding to the original recording text are used;

step 5: the sound characteristic parameters (including LF0, MCEP and BAP parameters) are extracted by the preprocessed audio, and the sound recording text confirmed by voice recognition passes through a front-end analysis module to extract the language information relevant to the context. Generating training data of a neural network model by using the voice characteristic parameters and the linguistic information, and retraining the time length and the acoustic neural network model by adopting an adaptive technology by taking the basic model in the step 1 as a source model;

step 6: in the synthesis stage, according to the input text to be synthesized, obtaining context-related linguistic information through a front-end model, using the duration obtained by training in the step 5 and an acoustic neural network model to make reasoning, obtaining voice characteristic parameters (including dynamic characteristic parameters), obtaining smooth static voice characteristic parameters through a parameter generation module, and sending the characteristic parameters into a synthesizer to obtain the synthesized voice of the target speaker.

The beneficial effects of the technical scheme are as follows: 1. the problem that the recorded text of the existing system is fixed and cannot be changed is solved, the space for freely selecting the recorded text is provided for a user in a framed range, and the pronunciation accuracy and fluency are improved. 2. And (3) carrying out noise reduction and reverberation removal treatment on the record, removing redundant mute sections and improving the voice quality. 3. And (3) carrying out quality evaluation on the record file, selecting re-recording or correcting the text label according to the record quality, and ensuring the consistency of the record and the text label. The data quality is improved, and a small amount of training data in the adaptive system is fully utilized. 4. And a neural network model (a feedforward network plus an RNN-LSTM structure) is adopted, and dynamic parameter modeling and a maximum likelihood parameter generation algorithm are combined, so that the accuracy of voice characteristic parameter modeling is improved.

The embodiment also discloses an adaptive speech synthesis apparatus, as shown in fig. 3, which includes:

the first training module 301 is configured to train the preset neural network model by using a preset recording and text labeling data corresponding to the preset recording, so as to obtain a trained preset neural network model;

the recording module 302 is configured to design a recording text library for a user to select a target recording text for recording, so as to obtain a current recording;

the second training module 303 is configured to perform secondary training on the trained preset neural network model by using the current recording and the target recording text;

and the synthesis module 304 is configured to extract static speech parameters of a text to be synthesized by using the preset neural network model after the secondary training, and input the static speech parameters into the synthesizer to obtain synthesized speech.

In one embodiment, a recording module includes:

the sub-module is used for pre-establishing a blank recording text library;

the first acquisition submodule is used for acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form a sound recording text library;

the determining submodule is used for determining a first recording text selected by a user from the M first recording texts as a target recording text;

In one embodiment, the apparatus further comprises:

the pretreatment module is used for carrying out pretreatment of denoising and dereverberation on each sentence of voice;

the determining module is used for using labels corresponding to the target recording text when the detecting module detects that the current voice after the preprocessing is complete;

and the reminding module is used for reminding the user that the current voice after the pretreatment does not meet the requirement when the detection module detects that the current voice after the pretreatment is not complete.

In one embodiment, as shown in fig. 4, the second training module includes:

a first extraction submodule 3031, configured to extract acoustic feature parameters of the current voice after preprocessing;

a second extraction submodule 3032, configured to extract first linguistic information associated with a context in the target recording text content;

a generating submodule 3033, configured to generate training data according to the acoustic feature parameter and the first linguistic information;

and the training submodule 3034 is used for performing secondary training on the trained preset neural network model by utilizing the training data.

In one embodiment, a composition module includes:

the obtaining submodule is used for inputting the second linguistic information into a preset neural network model after secondary training to obtain voice characteristic parameters;

the synthesis submodule is used for inputting static voice parameters into the synthesizer for synthesis;

It will be appreciated by those skilled in the art that the first and second aspects of the present invention refer to different phases of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An adaptive speech synthesis method, comprising the steps of:

extracting static voice parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice;

the performing secondary training on the trained preset neural network model by using the current recording and the target recording text comprises the following steps:

extracting acoustic characteristic parameters of the current voice after pretreatment;

performing secondary training on the trained preset neural network model by using the training data;

before the trained preset neural network model is secondarily trained by using the current recording and the target recording text, the method further comprises:

acquiring each sentence of voice in the current recording;

removing a mute segment exceeding a preset duration in each sentence of voice;

detecting whether the current voice after preprocessing is complete;

if yes, using the label corresponding to the target recording text;

otherwise, reminding the user that the current voice after the pretreatment does not meet the requirement;

the step of detecting whether the current voice after preprocessing is complete is as follows: if the current voice after the processing is found to have the insertion error or the deletion error, prompting the user that the recording quality does not meet the requirement, selecting the current text to repeat or switching a new text to record again by the user, if the current voice does not have the insertion error or the deletion error but has the replacement error, receiving the voice, replacing the original recording text by using the text recognized by the recognizer to generate a label, and if the current voice does not have the insertion error or the deletion error and the replacement error, using the label corresponding to the original recording text.

2. The method for adaptive speech synthesis according to claim 1, wherein the designing the recording text library for the user to select the target recording text for recording, and obtaining the current recording, comprises:

a blank recording text library is established in advance;

3. The adaptive speech synthesis method according to claim 1, wherein extracting static speech parameters of a text to be synthesized using a preset neural network model after the secondary training, and inputting the static speech parameters into a synthesizer to obtain a synthesized speech, comprises:

acquiring second linguistic information of the text to be synthesized;

inputting the static voice parameters into a synthesizer for synthesis;

and outputting the synthesized voice after the synthesis is finished.

4. An adaptive speech synthesis apparatus, the apparatus comprising:

the synthesis module is used for extracting static voice parameters of a text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into the synthesizer to obtain synthesized voice;

the second training module comprises:

the training sub-module is used for carrying out secondary training on the trained preset neural network model by utilizing the training data;

the apparatus further comprises:

the reminding module is used for reminding a user that the current voice after the pretreatment does not meet the requirement when the detection module detects that the current voice after the pretreatment is not complete;

5. The adaptive speech synthesis apparatus according to claim 4, wherein the recording module comprises:

the sub-module is used for pre-establishing a blank recording text library;

6. The adaptive speech synthesis apparatus according to claim 4, wherein the synthesis module comprises: