CN111429878B - Self-adaptive voice synthesis method and device - Google Patents

Self-adaptive voice synthesis method and device Download PDF

Info

Publication number
CN111429878B
CN111429878B CN202010167018.6A CN202010167018A CN111429878B CN 111429878 B CN111429878 B CN 111429878B CN 202010167018 A CN202010167018 A CN 202010167018A CN 111429878 B CN111429878 B CN 111429878B
Authority
CN
China
Prior art keywords
recording
voice
text
current
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010167018.6A
Other languages
Chinese (zh)
Other versions
CN111429878A (en
Inventor
贺来朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010167018.6A priority Critical patent/CN111429878B/en
Publication of CN111429878A publication Critical patent/CN111429878A/en
Application granted granted Critical
Publication of CN111429878B publication Critical patent/CN111429878B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a self-adaptive voice synthesis method and a device, comprising the following steps: training the preset neural network model by using the preset sound record and text marking data corresponding to the preset sound record to obtain a trained preset neural network model; designing a recording text library for users to select target recording text for recording, and obtaining current recording; performing secondary training on the trained preset neural network model by using the current recording and the target recording text; and extracting static voice parameters of the text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice. The problems of low quality and precision of synthesized voice caused by the reasons of less data quantity and low quality generally, insufficient model prediction precision and the like in the prior art are effectively solved, and the experience of a user is improved.

Description

Self-adaptive voice synthesis method and device
Technical Field
The present invention relates to the field of speech synthesis technology, and in particular, to a method and apparatus for adaptive speech synthesis.
Background
In recent years, with the growing maturity of voice technology, voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production and the like. In the social and commercial fields, synthesized voice is used as a sound showing, brings convenience and richness to social life, has potential wide use value, and the existing voice synthesis technology is used for training a duration and acoustic model based on a large amount of high-quality recording and text labeling data of a target speaker, and can synthesize voice with the timbre of the target speaker. Since a large amount of high quality speech is required for training, the adaptive speech synthesis system is proposed, that is, a synthesis system is quickly constructed using a small amount of recording and text data of a target speaker, and a synthetic speech of the timbre of the target speaker is generated. However, this method has the following disadvantages: because the amount of data required for training is small, the quality is usually not high, and the model prediction precision is insufficient, the quality and precision of synthesized voice are low, and the experience of a user is affected.
Disclosure of Invention
Aiming at the displayed problems, the method carries out secondary training on the trained preset neural network model based on the current recording data of the user, and finally carries out voice synthesis on the text to be synthesized according to the secondary trained preset neural network model.
An adaptive speech synthesis method comprising the steps of:
training a preset neural network model by using a preset sound recording and text marking data corresponding to the preset sound recording to obtain a trained preset neural network model;
designing a recording text library for users to select target recording text for recording, and obtaining current recording;
performing secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;
and extracting static voice parameters of the text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.
Preferably, the designing the recording text library for the user to select the target recording text for recording, to obtain the current recording, includes:
a blank recording text library is established in advance;
acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form the sound recording text library;
when receiving an instruction of a user for requesting recording, pushing M first recording texts for selection, wherein the first recording texts are any recording text in the recording texts;
determining a first recording text selected by a user from the M first recording texts as the target recording text;
and receiving the current recording of the user based on the target recording text.
Preferably, before the trained preset neural network model is secondarily trained by using the current recording and the target recording text, the method further includes:
acquiring each sentence of voice in the current recording;
removing a mute segment exceeding a preset duration in each sentence of voice;
carrying out denoising and dereverberation pretreatment on each sentence of voice;
detecting whether the current voice after preprocessing is complete;
if yes, using the label corresponding to the target recording text;
otherwise, reminding the user that the current voice after the preprocessing does not meet the requirement.
Preferably, the performing secondary training on the trained preset neural network model by using the current recording and the target recording text includes:
extracting acoustic characteristic parameters of the current voice after the pretreatment;
extracting first linguistic information associated with a context in the target recording text content;
generating training data according to the acoustic characteristic parameters and the first linguistic information;
and performing secondary training on the trained preset neural network model by using the training data.
Preferably, the extracting static speech parameters of the text to be synthesized by using the preset neural network model after the secondary training, inputting the static speech parameters into a synthesizer to obtain synthesized speech, includes:
acquiring second linguistic information of the text to be synthesized;
inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;
acquiring static voice parameters according to the voice characteristic parameters;
inputting the static voice parameters into a synthesizer for synthesis;
and outputting the synthesized voice after the synthesis is finished.
An adaptive speech synthesis apparatus, the apparatus comprising:
the first training module is used for training the preset neural network model by using the preset sound record and text marking data corresponding to the preset sound record to obtain a trained preset neural network model;
the recording module is used for designing a recording text library to enable a user to select a target recording text for recording, so that a current recording is obtained;
the second training module is used for carrying out secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;
and the synthesis module is used for extracting static voice parameters of the text to be synthesized by using the preset neural network model after the secondary training, and inputting the static voice parameters into the synthesizer to obtain synthesized voice.
Preferably, the recording module includes:
the sub-module is used for pre-establishing a blank recording text library;
the first acquisition submodule is used for acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form the sound recording text library;
the pushing sub-module is used for pushing M first recording texts for selection when receiving an instruction of a user for requesting recording, wherein the first recording texts are any recording text in the recording texts;
the determining submodule is used for determining a first recording text selected by a user from the M first recording texts as the target recording text;
and the receiving sub-module is used for receiving the current recording of the user based on the target recording text.
Preferably, the apparatus further comprises:
the acquisition module is used for acquiring each sentence of voice in the current recording;
the removing module is used for removing the mute section exceeding the preset duration in each sentence of voice;
the preprocessing module is used for carrying out denoising and dereverberation preprocessing on each sentence of voice;
the detection module is used for detecting whether the current voice after pretreatment is complete or not;
the determining module is used for using the label corresponding to the target recording text when the detecting module detects that the current voice after the preprocessing is complete;
and the reminding module is used for reminding a user that the current voice after the pretreatment does not meet the requirement when the detection module detects that the current voice after the pretreatment is not complete.
Preferably, the second training module includes:
the first extraction submodule is used for extracting acoustic characteristic parameters of the current voice after the pretreatment;
the second extraction sub-module is used for extracting first linguistic information related to the context in the target recording text content;
the generation sub-module is used for generating training data according to the acoustic characteristic parameters and the first linguistic information;
and the training sub-module is used for carrying out secondary training on the trained preset neural network model by utilizing the training data.
Preferably, the synthesis module includes:
the second acquisition submodule is used for acquiring second linguistic information of the text to be synthesized;
the obtaining submodule is used for inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;
the third acquisition sub-module is used for acquiring static voice parameters according to the voice characteristic parameters;
the synthesis submodule is used for inputting the static voice parameters into a synthesizer for synthesis;
and the output sub-module is used for outputting the synthesized voice after the synthesis is finished.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a workflow diagram of an adaptive speech synthesis method provided by the present invention;
FIG. 2 is another workflow diagram of an adaptive speech synthesis method according to the present invention;
FIG. 3 is a block diagram of an adaptive speech synthesis apparatus according to the present invention;
fig. 4 is another block diagram of an adaptive speech synthesis apparatus according to the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
In recent years, with the growing maturity of voice technology, voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production and the like. In the social and commercial fields, synthesized voice is used as a sound showing, brings convenience and richness to social life, has potential wide use value, and the existing voice synthesis technology is used for training a duration and acoustic model based on a large amount of high-quality recording and text labeling data of a target speaker, and can synthesize voice with the timbre of the target speaker. Since a large amount of high quality speech is required for training, the adaptive speech synthesis system is proposed, that is, a synthesis system is quickly constructed using a small amount of recording and text data of a target speaker, and a synthetic speech of the timbre of the target speaker is generated. However, this method has the following disadvantages: 1. because the amount of data required for training is small, the quality is usually not high, and the model prediction precision is insufficient, the quality and precision of synthesized voice are low, and the experience of a user is affected. 2. The recording environments of different users are greatly different, and recorded voice data can contain overlong interference such as silence segments, noise, reverberation and the like to influence the model training effect. 3. The actual pronunciation record text of the user is inconsistent, and the phenomena of word missing, multiple words, repetition, misreading, overlong pause and the like all cause mismatching of audio data and text labels, thereby influencing the model training effect. In order to solve the above problems, the present embodiment discloses a method for performing secondary training on a trained preset neural network model based on current recording data of a user, and finally performing speech synthesis on a text to be synthesized according to the secondary trained preset neural network model.
An adaptive speech synthesis method, as shown in fig. 1, comprises the following steps:
step S101, training a preset neural network model by using a preset recording and text marking data corresponding to the preset recording to obtain a trained preset neural network model;
step S102, designing a recording text library for users to select target recording text for recording, and obtaining current recording;
step S103, performing secondary training on the trained preset neural network model by using the current recording and the target recording text;
step S104, extracting static voice parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice;
in this embodiment, a large amount of high-quality preset sound recordings and text labeling data corresponding to the sound recordings are used for training a preset neural network model to obtain a trained preset neural network model, then a user selects a proper target sound recording text to record according to own hobbies and demands to obtain a small amount of current sound recordings, then the trained preset neural network model is trained for the second time according to the small amount of sound recordings to obtain a new model capable of synthesizing own voice, static voice parameters of any text to be synthesized can be extracted according to the new model and input into a synthesizer to obtain synthetic voice of the user, in particular, the large amount of preset voice and the text labeling data corresponding to the preset voice can be recording data of any person, and recording data of the user is not needed. The preset neural network model comprises a duration model and an acoustic model, namely, the duration of user voice synthesis is paid attention to, and different voices are synthesized according to different timbres of the user.
The working principle of the technical scheme is as follows: training a preset neural network model by using preset sound recordings and text marking data corresponding to the preset sound recordings to obtain a trained preset neural network model, designing a sound recording text library for users to select target sound recording texts to record, obtaining a current sound recording, performing secondary training on the trained preset neural network model by using the current sound recordings and the target sound recording texts, extracting static voice parameters of texts to be synthesized by using the secondarily trained preset neural network model, and inputting the static voice parameters into a synthesizer to obtain synthesized voice.
The beneficial effects of the technical scheme are as follows: the user can train the preset neural network model according to the preset record, then train the trained preset neural network model for the second time according to the current record, finally train the preset neural network model according to the second time to synthesize the voice of the user, because the training of the first preset neural network model is trained according to a large amount of high-quality preset record, the voice quality and the precision of the model synthesis are extremely high, the model for synthesizing the voice of the user can be obtained by train the preset neural network model for the second time according to the current record, the synthesized voice quality and the precision are extremely high, the problems that the data amount required by training is less, the quality is usually not high, the model prediction precision is insufficient and the like in the prior art are effectively solved, the problems that the synthesized voice quality and the precision are low are solved, and the experience of the user is improved. And the user can select the target text in the recording text library to record, so that the selection is diversified, and the problem that the recorded voice data can contain overlong interference such as mute sections, noise and reverberation to influence the model training effect due to great difference of recording environments of different users in the prior art is solved.
In one embodiment, designing a recording text library for a user to select a target recording text for recording, to obtain a current recording, includes:
a blank recording text library is established in advance;
acquiring N recording texts and inputting the N recording texts into a blank recording text library to form a recording text library;
when receiving an instruction of a user for requesting recording, pushing M first recording texts for selection, wherein the first recording texts are any recording text in the recording texts;
determining a first recording text selected by a user from M first recording texts as a target recording text;
based on the target recording text, a current recording of the user is received.
The beneficial effects of the technical scheme are as follows: the first recording text selectable by the user is provided, so that the user selects the target recording text suitable for the user according to the age, the cultural level, the region and the use scene of the user, the user has a plurality of different selections, and the experience of the user is further improved.
In one embodiment, before the training of the trained preset neural network model using the current recording and the target recording text, the method further includes:
acquiring each sentence of voice in the current recording;
removing a mute section exceeding a preset duration in each sentence of voice;
carrying out denoising and dereverberation pretreatment on each sentence of voice;
detecting whether the current voice after preprocessing is complete;
if yes, using a label corresponding to the target recording text;
otherwise, reminding the user that the current voice after preprocessing does not meet the requirement;
in this embodiment, the step of detecting whether the current voice after preprocessing is complete is as follows: if the current voice after the processing is found to have the insertion error or the deletion error, the user is prompted that the recording quality does not meet the requirement, and the user can select to repeat the current text or switch a new text to record again. If there is no insertion and deletion error, but there is a substitution error, the piece of speech is acceptable and the original recorded text is replaced with text recognized by the recognizer to generate a callout. If the identification errors are not generated, the labels corresponding to the original recording text are used.
The beneficial effects of the technical scheme are as follows: through carrying out noise reduction and reverberation removal processing on the input voice, redundant silence segments are removed, voice quality is improved, and a good sample is provided for the synthesis of the rear voice. And whether the current voice after preprocessing is complete or not is detected, re-recording is selected or text labels are corrected according to the current voice quality, and consistency of recording and text labels is ensured. And the data quality is improved.
In one embodiment, as shown in fig. 2, the training of the pre-set neural network model by using the current recording and the target recording text comprises the following steps:
step S201, extracting acoustic characteristic parameters of the current voice after preprocessing;
step S202, extracting first linguistic information related to the context in the target recording text content;
step S203, training data is generated according to the acoustic characteristic parameters and the first linguistic information;
and S204, performing secondary training on the trained preset neural network model by using training data.
The beneficial effects of the technical scheme are as follows: by utilizing the acoustic characteristic parameters and the first linguistic information to secondarily train the preset neural network model, the accuracy of modeling of the voice characteristic parameters is improved, and a good model is provided for extracting static voice parameters required by synthesized voice.
In one embodiment, extracting static speech parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static speech parameters into a synthesizer to obtain synthesized speech, wherein the method comprises the following steps:
acquiring second linguistic information of a text to be synthesized;
inputting the second linguistic information into a preset neural network model after secondary training to obtain voice characteristic parameters;
acquiring static voice parameters according to the voice characteristic parameters;
inputting static voice parameters into a synthesizer for synthesis;
after the synthesis is finished, outputting synthesized voice;
in this embodiment, the voice feature parameters include dynamic voice parameters, and the dynamic voice parameters are converted into static voice parameters according to the established model.
The beneficial effects of the technical scheme are as follows: by acquiring static voice parameters to synthesize voice, compared with dynamic voice parameters, the method is more stable, and meanwhile, unstable factors are removed, so that the quality of the synthesized voice is higher.
In one embodiment, the method comprises:
step 1: a multi-speaker hybrid underlying neural network model (employing a feed-forward neural network plus RNN-LSTM model structure) is trained using high quality multi-speaker recordings and text labeling data. Speaker embedded information is added into the neural network input to improve stability of tone modeling;
step 2: under the principle of ensuring the phoneme coverage, a recording text library is designed, and the number of the recording text library is far greater than the number N of sentences actually required to be recorded. Randomly selecting N sound recording texts for each user, wherein for each sound recording text, the user can select to skip and switch a new sound recording text;
step 3: each time a user inputs a sentence of voice, the recorded audio passes through an audio preprocessing module, an overlong mute section in the recording is removed, and noise reduction and reverberation removal processing are carried out on the input audio;
step 4: sending the processed audio to an audio quality evaluation module for voice recognition detection, and if an insertion error or a deletion error is found, prompting the user that the recording quality does not meet the requirement, wherein the user can select to repeat the current text or switch a new text to record again. If there is no insertion and deletion error, but there is a substitution error, the piece of speech is acceptable and the original recorded text is replaced with text recognized by the recognizer to generate a callout. If the identification errors are not generated, the labels corresponding to the original recording text are used;
step 5: the sound characteristic parameters (including LF0, MCEP and BAP parameters) are extracted by the preprocessed audio, and the sound recording text confirmed by voice recognition passes through a front-end analysis module to extract the language information relevant to the context. Generating training data of a neural network model by using the voice characteristic parameters and the linguistic information, and retraining the time length and the acoustic neural network model by adopting an adaptive technology by taking the basic model in the step 1 as a source model;
step 6: in the synthesis stage, according to the input text to be synthesized, obtaining context-related linguistic information through a front-end model, using the duration obtained by training in the step 5 and an acoustic neural network model to make reasoning, obtaining voice characteristic parameters (including dynamic characteristic parameters), obtaining smooth static voice characteristic parameters through a parameter generation module, and sending the characteristic parameters into a synthesizer to obtain the synthesized voice of the target speaker.
The beneficial effects of the technical scheme are as follows: 1. the problem that the recorded text of the existing system is fixed and cannot be changed is solved, the space for freely selecting the recorded text is provided for a user in a framed range, and the pronunciation accuracy and fluency are improved. 2. And (3) carrying out noise reduction and reverberation removal treatment on the record, removing redundant mute sections and improving the voice quality. 3. And (3) carrying out quality evaluation on the record file, selecting re-recording or correcting the text label according to the record quality, and ensuring the consistency of the record and the text label. The data quality is improved, and a small amount of training data in the adaptive system is fully utilized. 4. And a neural network model (a feedforward network plus an RNN-LSTM structure) is adopted, and dynamic parameter modeling and a maximum likelihood parameter generation algorithm are combined, so that the accuracy of voice characteristic parameter modeling is improved.
The embodiment also discloses an adaptive speech synthesis apparatus, as shown in fig. 3, which includes:
the first training module 301 is configured to train the preset neural network model by using a preset recording and text labeling data corresponding to the preset recording, so as to obtain a trained preset neural network model;
the recording module 302 is configured to design a recording text library for a user to select a target recording text for recording, so as to obtain a current recording;
the second training module 303 is configured to perform secondary training on the trained preset neural network model by using the current recording and the target recording text;
and the synthesis module 304 is configured to extract static speech parameters of a text to be synthesized by using the preset neural network model after the secondary training, and input the static speech parameters into the synthesizer to obtain synthesized speech.
In one embodiment, a recording module includes:
the sub-module is used for pre-establishing a blank recording text library;
the first acquisition submodule is used for acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form a sound recording text library;
the pushing sub-module is used for pushing M first recording texts for selection when receiving an instruction of a user for requesting recording, wherein the first recording texts are any recording text in the recording texts;
the determining submodule is used for determining a first recording text selected by a user from the M first recording texts as a target recording text;
and the receiving sub-module is used for receiving the current recording of the user based on the target recording text.
In one embodiment, the apparatus further comprises:
the acquisition module is used for acquiring each sentence of voice in the current recording;
the removing module is used for removing the mute section exceeding the preset duration in each sentence of voice;
the pretreatment module is used for carrying out pretreatment of denoising and dereverberation on each sentence of voice;
the detection module is used for detecting whether the current voice after pretreatment is complete or not;
the determining module is used for using labels corresponding to the target recording text when the detecting module detects that the current voice after the preprocessing is complete;
and the reminding module is used for reminding the user that the current voice after the pretreatment does not meet the requirement when the detection module detects that the current voice after the pretreatment is not complete.
In one embodiment, as shown in fig. 4, the second training module includes:
a first extraction submodule 3031, configured to extract acoustic feature parameters of the current voice after preprocessing;
a second extraction submodule 3032, configured to extract first linguistic information associated with a context in the target recording text content;
a generating submodule 3033, configured to generate training data according to the acoustic feature parameter and the first linguistic information;
and the training submodule 3034 is used for performing secondary training on the trained preset neural network model by utilizing the training data.
In one embodiment, a composition module includes:
the second acquisition submodule is used for acquiring second linguistic information of the text to be synthesized;
the obtaining submodule is used for inputting the second linguistic information into a preset neural network model after secondary training to obtain voice characteristic parameters;
the third acquisition sub-module is used for acquiring static voice parameters according to the voice characteristic parameters;
the synthesis submodule is used for inputting static voice parameters into the synthesizer for synthesis;
and the output sub-module is used for outputting the synthesized voice after the synthesis is finished.
It will be appreciated by those skilled in the art that the first and second aspects of the present invention refer to different phases of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (6)

1. An adaptive speech synthesis method, comprising the steps of:
training a preset neural network model by using a preset sound recording and text marking data corresponding to the preset sound recording to obtain a trained preset neural network model;
designing a recording text library for users to select target recording text for recording, and obtaining current recording;
performing secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;
extracting static voice parameters of a text to be synthesized by using a preset neural network model after secondary training, and inputting the static voice parameters into a synthesizer to obtain synthesized voice;
the performing secondary training on the trained preset neural network model by using the current recording and the target recording text comprises the following steps:
extracting acoustic characteristic parameters of the current voice after pretreatment;
extracting first linguistic information associated with a context in the target recording text content;
generating training data according to the acoustic characteristic parameters and the first linguistic information;
performing secondary training on the trained preset neural network model by using the training data;
before the trained preset neural network model is secondarily trained by using the current recording and the target recording text, the method further comprises:
acquiring each sentence of voice in the current recording;
removing a mute segment exceeding a preset duration in each sentence of voice;
carrying out denoising and dereverberation pretreatment on each sentence of voice;
detecting whether the current voice after preprocessing is complete;
if yes, using the label corresponding to the target recording text;
otherwise, reminding the user that the current voice after the pretreatment does not meet the requirement;
the step of detecting whether the current voice after preprocessing is complete is as follows: if the current voice after the processing is found to have the insertion error or the deletion error, prompting the user that the recording quality does not meet the requirement, selecting the current text to repeat or switching a new text to record again by the user, if the current voice does not have the insertion error or the deletion error but has the replacement error, receiving the voice, replacing the original recording text by using the text recognized by the recognizer to generate a label, and if the current voice does not have the insertion error or the deletion error and the replacement error, using the label corresponding to the original recording text.
2. The method for adaptive speech synthesis according to claim 1, wherein the designing the recording text library for the user to select the target recording text for recording, and obtaining the current recording, comprises:
a blank recording text library is established in advance;
acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form the sound recording text library;
when receiving an instruction of a user for requesting recording, pushing M first recording texts for selection, wherein the first recording texts are any recording text in the recording texts;
determining a first recording text selected by a user from the M first recording texts as the target recording text;
and receiving the current recording of the user based on the target recording text.
3. The adaptive speech synthesis method according to claim 1, wherein extracting static speech parameters of a text to be synthesized using a preset neural network model after the secondary training, and inputting the static speech parameters into a synthesizer to obtain a synthesized speech, comprises:
acquiring second linguistic information of the text to be synthesized;
inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;
acquiring static voice parameters according to the voice characteristic parameters;
inputting the static voice parameters into a synthesizer for synthesis;
and outputting the synthesized voice after the synthesis is finished.
4. An adaptive speech synthesis apparatus, the apparatus comprising:
the first training module is used for training the preset neural network model by using the preset sound record and text marking data corresponding to the preset sound record to obtain a trained preset neural network model;
the recording module is used for designing a recording text library to enable a user to select a target recording text for recording, so that a current recording is obtained;
the second training module is used for carrying out secondary training on the trained preset neural network model by utilizing the current recording and the target recording text;
the synthesis module is used for extracting static voice parameters of a text to be synthesized by using the preset neural network model after secondary training, and inputting the static voice parameters into the synthesizer to obtain synthesized voice;
the second training module comprises:
the first extraction submodule is used for extracting acoustic characteristic parameters of the current voice after the pretreatment;
the second extraction sub-module is used for extracting first linguistic information related to the context in the target recording text content;
the generation sub-module is used for generating training data according to the acoustic characteristic parameters and the first linguistic information;
the training sub-module is used for carrying out secondary training on the trained preset neural network model by utilizing the training data;
the apparatus further comprises:
the acquisition module is used for acquiring each sentence of voice in the current recording;
the removing module is used for removing the mute section exceeding the preset duration in each sentence of voice;
the preprocessing module is used for carrying out denoising and dereverberation preprocessing on each sentence of voice;
the detection module is used for detecting whether the current voice after pretreatment is complete or not;
the determining module is used for using the label corresponding to the target recording text when the detecting module detects that the current voice after the preprocessing is complete;
the reminding module is used for reminding a user that the current voice after the pretreatment does not meet the requirement when the detection module detects that the current voice after the pretreatment is not complete;
the step of detecting whether the current voice after preprocessing is complete is as follows: if the current voice after the processing is found to have the insertion error or the deletion error, prompting the user that the recording quality does not meet the requirement, selecting the current text to repeat or switching a new text to record again by the user, if the current voice does not have the insertion error or the deletion error but has the replacement error, receiving the voice, replacing the original recording text by using the text recognized by the recognizer to generate a label, and if the current voice does not have the insertion error or the deletion error and the replacement error, using the label corresponding to the original recording text.
5. The adaptive speech synthesis apparatus according to claim 4, wherein the recording module comprises:
the sub-module is used for pre-establishing a blank recording text library;
the first acquisition submodule is used for acquiring N sound recording texts and inputting the N sound recording texts into the blank sound recording text library to form the sound recording text library;
the pushing sub-module is used for pushing M first recording texts for selection when receiving an instruction of a user for requesting recording, wherein the first recording texts are any recording text in the recording texts;
the determining submodule is used for determining a first recording text selected by a user from the M first recording texts as the target recording text;
and the receiving sub-module is used for receiving the current recording of the user based on the target recording text.
6. The adaptive speech synthesis apparatus according to claim 4, wherein the synthesis module comprises:
the second acquisition submodule is used for acquiring second linguistic information of the text to be synthesized;
the obtaining submodule is used for inputting the second linguistic information into the preset neural network model after the secondary training to obtain voice characteristic parameters;
the third acquisition sub-module is used for acquiring static voice parameters according to the voice characteristic parameters;
the synthesis submodule is used for inputting the static voice parameters into a synthesizer for synthesis;
and the output sub-module is used for outputting the synthesized voice after the synthesis is finished.
CN202010167018.6A 2020-03-11 2020-03-11 Self-adaptive voice synthesis method and device Active CN111429878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010167018.6A CN111429878B (en) 2020-03-11 2020-03-11 Self-adaptive voice synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010167018.6A CN111429878B (en) 2020-03-11 2020-03-11 Self-adaptive voice synthesis method and device

Publications (2)

Publication Number Publication Date
CN111429878A CN111429878A (en) 2020-07-17
CN111429878B true CN111429878B (en) 2023-05-26

Family

ID=71546451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010167018.6A Active CN111429878B (en) 2020-03-11 2020-03-11 Self-adaptive voice synthesis method and device

Country Status (1)

Country Link
CN (1) CN111429878B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634856B (en) * 2020-12-10 2022-09-02 思必驰科技股份有限公司 Speech synthesis model training method and speech synthesis method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000214876A (en) * 1999-01-25 2000-08-04 Sanyo Electric Co Ltd Japanese speech synthesizing method
US6212501B1 (en) * 1997-07-14 2001-04-03 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
CN103366731A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Text to speech (TTS) method and system
CN110473547A (en) * 2019-07-12 2019-11-19 云知声智能科技股份有限公司 A kind of audio recognition method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE449399T1 (en) * 2005-05-31 2009-12-15 Telecom Italia Spa PROVIDING SPEECH SYNTHESIS ON USER TERMINALS OVER A COMMUNICATIONS NETWORK
CN102568472A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Voice synthesis system with speaker selection and realization method thereof
CN105047192B (en) * 2015-05-25 2018-08-17 上海交通大学 Statistics phoneme synthesizing method based on Hidden Markov Model and device
CN104934028B (en) * 2015-06-17 2017-11-17 百度在线网络技术(北京)有限公司 Training method and device for the deep neural network model of phonetic synthesis
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
CN105118498B (en) * 2015-09-06 2018-07-31 百度在线网络技术(北京)有限公司 The training method and device of phonetic synthesis model
CN109545194A (en) * 2018-12-26 2019-03-29 出门问问信息科技有限公司 Wake up word pre-training method, apparatus, equipment and storage medium
CN110556129B (en) * 2019-09-09 2022-04-19 北京大学深圳研究生院 Bimodal emotion recognition model training method and bimodal emotion recognition method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6212501B1 (en) * 1997-07-14 2001-04-03 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
JP2000214876A (en) * 1999-01-25 2000-08-04 Sanyo Electric Co Ltd Japanese speech synthesizing method
CN103366731A (en) * 2012-03-31 2013-10-23 盛乐信息技术(上海)有限公司 Text to speech (TTS) method and system
CN110473547A (en) * 2019-07-12 2019-11-19 云知声智能科技股份有限公司 A kind of audio recognition method

Also Published As

Publication number Publication date
CN111429878A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
EP1490861B1 (en) Method, apparatus and computer program for voice synthesis
CN101739870B (en) Interactive language learning system and method
US8423367B2 (en) Apparatus and method for creating singing synthesizing database, and pitch curve generation apparatus and method
CN109285537B (en) Acoustic model establishing method, acoustic model establishing device, acoustic model synthesizing method, acoustic model synthesizing device, acoustic model synthesizing equipment and storage medium
US20050171778A1 (en) Voice synthesizer, voice synthesizing method, and voice synthesizing system
CN110390928B (en) Method and system for training speech synthesis model of automatic expansion corpus
KR20170011636A (en) Speech recognition apparatus and method, Model generation apparatus and method for Speech recognition apparatus
KR101153736B1 (en) Apparatus and method for generating the vocal organs animation
KR100659212B1 (en) Language learning system and voice data providing method for language learning
CN108172211B (en) Adjustable waveform splicing system and method
CN111429878B (en) Self-adaptive voice synthesis method and device
JP2011186143A (en) Speech synthesizer, speech synthesis method for learning user's behavior, and program
Zhang et al. AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents
CN111599338B (en) Stable and controllable end-to-end speech synthesis method and device
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
CN111370011A (en) Method, device, system and storage medium for replacing audio
CN114708848A (en) Method and device for acquiring size of audio and video file
CN110310620B (en) Speech fusion method based on native pronunciation reinforcement learning
CN117769739A (en) System and method for assisted translation and lip matching of dubbing
KR20050041749A (en) Voice synthesis apparatus depending on domain and speaker by using broadcasting voice data, method for forming voice synthesis database and voice synthesis service system
CN113628609A (en) Automatic audio content generation
CN114078464B (en) Audio processing method, device and equipment
US11830481B2 (en) Context-aware prosody correction of edited speech
US9905218B2 (en) Method and apparatus for exemplary diphone synthesizer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant