CN111128119B - Voice synthesis method and device - Google Patents

Voice synthesis method and device Download PDF

Info

Publication number
CN111128119B
CN111128119B CN201911420316.5A CN201911420316A CN111128119B CN 111128119 B CN111128119 B CN 111128119B CN 201911420316 A CN201911420316 A CN 201911420316A CN 111128119 B CN111128119 B CN 111128119B
Authority
CN
China
Prior art keywords
voice
pieces
voice information
information
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911420316.5A
Other languages
Chinese (zh)
Other versions
CN111128119A (en
Inventor
孙见青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN201911420316.5A priority Critical patent/CN111128119B/en
Publication of CN111128119A publication Critical patent/CN111128119A/en
Application granted granted Critical
Publication of CN111128119B publication Critical patent/CN111128119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice synthesis method and a device, wherein the method comprises the following steps: sequentially recording N pieces of voice information for a user through preset equipment; when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side; training a preset baseline model of a server side through the first N/2 pieces of voice information to obtain a first voice synthesis model; when the preset device records the later N/2 pieces of voice information, sending the later N/2 pieces of voice information to a server side; and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model. By the technical scheme, the speech synthesis model with the user speaking mode or emotion can be synthesized according to the requirements of the user, the naturalness of the synthesis result is high, namely the similarity with the user speaking mode, emotion and tone in the user voice is high, the time spent on model construction is short, and the user experience is greatly improved.

Description

Voice synthesis method and device
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus.
Background
The Speech synthesis, also known as Text to Speech (Text to Speech) technology, can convert any Text information into standard smooth Speech in real time for reading, and is equivalent to mounting an artificial mouth on a machine. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert character information into audible sound information, namely, to enable a machine to speak like a person. The personalized speech synthesis is To make TTS (text To speech) speech technology synthesize speech, speaking mode and speaking emotion of a specific person by recording some speech segments of the person through some recording equipment at the side.
At present, during personalized speech synthesis, users of different ages and different sexes all adopt a single speech synthesis model, and speech synthesis can be carried out only according to specific speaking modes and emotions in the model, so that the similarity between speech synthesis results and the speaking modes, emotions and sound timbre of the users is low, and the user experience is greatly influenced.
Disclosure of Invention
The invention provides a voice synthesis method and a voice synthesis device. The technical scheme is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a speech synthesis method, including:
sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side;
training a preset baseline model of the server side through the first N/2 pieces of voice information to obtain a first voice synthesis model;
when the preset equipment records the back N/2 pieces of voice information, sending the back N/2 pieces of voice information to the server side;
and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
In an embodiment, the training the preset baseline model of the server side through the first N/2 pieces of speech information to obtain a first speech synthesis model includes:
when the number of the N pieces of voice information is smaller than the preset number, determining that the first voice synthesis model reaches a convergence state;
acquiring a preset number of models generated in the process of training a preset baseline model of the server side through the first N/2 pieces of voice information;
and when the N pieces of voice information are more than or equal to the preset number, selecting a model meeting a preset standard from the preset number of models as the first voice synthesis model.
In one embodiment, the speech synthesis method further comprises:
before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side.
In one embodiment, the filtering the voice information includes:
acquiring first voiceprint information prestored by the user;
extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;
and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.
In one embodiment, the second speech synthesis model has reached a converged state.
According to a second aspect of the embodiments of the present invention, there is provided a speech synthesis apparatus including:
the recording module is used for sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
the first sending module is used for sending the first N/2 pieces of voice information to a server side when the preset equipment finishes recording the first N/2 pieces of voice information;
the first training module is used for training a preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model;
the second sending module is used for sending the last N/2 pieces of voice information to the server side when the preset equipment records the last N/2 pieces of voice information;
and the second training module is used for training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
In one embodiment, the first training module comprises:
the determining submodule is used for determining that the first voice synthesis model reaches a convergence state when the N pieces of voice information are smaller than the preset number;
the first obtaining submodule is used for obtaining a preset number of models generated in the process of training the preset baseline model of the server side through the first N/2 pieces of voice information;
and the selecting submodule is used for selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model when the N pieces of voice information are more than or equal to the preset number.
In one embodiment, the speech synthesis apparatus further includes:
and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.
In one embodiment, the processing module includes:
the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;
the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;
and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.
In one embodiment, the second speech synthesis model has reached a converged state.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method comprises the steps of recording N pieces of voice information for a user in sequence through preset equipment, sending the front N/2 pieces of voice information to a server end when the preset equipment records the front N/2 pieces of voice information, then training a preset baseline model of the server end through the front N/2 pieces of voice information to obtain a first voice synthesis model, sending the back N/2 pieces of voice information to the server end when the preset equipment records the back N/2 pieces of voice information, further training the first voice synthesis model through the back N/2 pieces of voice information to obtain a second voice synthesis model, and then performing personalized voice synthesis through the second voice synthesis model, wherein compared with the method that a single voice synthesis model is adopted for users of different ages and different sexes, in the technical scheme of the invention, the method can synthesize the voice synthesis model with the user speaking mode or emotion according to the requirements of the user, namely, when the user needs to perform personalized voice synthesis through the voice synthesis model, N pieces of voice information of the user can be recorded in sequence, a preset baseline model is trained through the front N \2 pieces of voice information to obtain a first voice synthesis model, when the user urgently needs to perform personalized voice synthesis, the first voice synthesis model can be directly synthesized through the first voice synthesis model, then the first voice model is trained through the rear N \2 pieces of voice information on the basis of the first voice synthesis model to obtain a second voice synthesis model meeting the requirements of the user, the result naturalness of the synthesis of the second voice synthesis model is high, namely, the result has high similarity with the user speaking mode, emotion and the tone in the user voice, because the number of the required voices is small during model training, the time spent on model construction is short, and the user experience is greatly improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another speech synthesis method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a speech synthesis apparatus according to an embodiment of the present invention;
FIG. 4 is a block diagram of another speech synthesis apparatus according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present invention, and as shown in fig. 1, the method can be implemented as the following steps S11-S15:
in step S11, sequentially recording N pieces of voice information to a user through a preset device, where N is a positive integer; the value of N may be, but is not limited to, tens, for example thirty.
In step S12, when the preset device finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to the server, where when N is an even number, the first N/2 pieces of voice information are the first half of the N pieces of voice, and the last N/2 pieces of voice information are the second half of the N pieces of voice; when N is an odd number, the first piece of speech information belongs to the first N/2 pieces, the first half of the remaining pieces of speech information excluding the first piece of speech information also belongs to the first N/2 pieces, and the second half belongs to the last N/2 pieces, for example, if the value of N is 31, the first 16 pieces belong to the first N/2 pieces, and the remaining 15 pieces belong to the last N/2 pieces;
in step S13, training a preset baseline model of the server side through the first N/2 pieces of speech information to obtain a first speech synthesis model; the preset baseline model is a model obtained by training a large amount of voice data, can be used for voice synthesis, but cannot be used for personalized voice synthesis, and the first voice synthesis model can be used for personalized voice synthesis, but the voice synthesis effect does not reach the optimal state.
In step S14, when the preset device finishes recording the last N/2 pieces of voice information, sending the last N/2 pieces of voice information to the server;
in step S15, the first speech synthesis model is trained by the last N/2 pieces of speech information to obtain a second speech synthesis model, where the second speech synthesis model is used for speech synthesis, that is, the second speech synthesis model can perform personalized speech synthesis, and the speech synthesis effect is optimal.
The method comprises the steps of recording N pieces of voice information for a user in sequence through preset equipment, sending the front N/2 pieces of voice information to a server end when the preset equipment records the front N/2 pieces of voice information, then training a preset baseline model of the server end through the front N/2 pieces of voice information to obtain a first voice synthesis model, sending the back N/2 pieces of voice information to the server end when the preset equipment records the back N/2 pieces of voice information, further training the first voice synthesis model through the back N/2 pieces of voice information to obtain a second voice synthesis model, and then performing personalized voice synthesis through the second voice synthesis model, wherein compared with the method that a single voice synthesis model is adopted for users of different ages and different sexes, in the technical scheme of the invention, the method can synthesize the voice synthesis model with the user speaking mode or emotion according to the requirements of the user, namely, when the user needs to perform personalized voice synthesis through the voice synthesis model, N pieces of voice information of the user can be recorded in sequence, a preset baseline model is trained through the front N \2 pieces of voice information to obtain a first voice synthesis model, when the user urgently needs to perform personalized voice synthesis, the first voice synthesis model can be directly synthesized through the first voice synthesis model, then the first voice model is trained through the rear N \2 pieces of voice information on the basis of the first voice synthesis model to obtain a second voice synthesis model meeting the requirements of the user, the result naturalness of the synthesis of the second voice synthesis model is high, namely, the result has high similarity with the user speaking mode, emotion and the tone in the user voice, because the number of the required voices is small during model training, the time spent on model construction is short, and the user experience is greatly improved.
As shown in fig. 2, in one embodiment, the above step S13 can be implemented as the following steps S131-S133:
in step S131, when the N pieces of speech information are less than the preset number, it is determined that the first speech synthesis model reaches a convergence state; the convergence state is the case where the mean square error is the smallest.
In step S132, obtaining a preset number of models generated in a process of training a preset baseline model of the server side through the first N/2 pieces of voice information;
in step S133, when the N pieces of speech information are greater than or equal to the preset number, selecting a model meeting a preset standard from the preset number of models as a first speech synthesis model; the preset standard is that a verification set (validationset) is used as a test sample, data in the verification set is voice information recorded by a user, parameters of a preset number of models are estimated through the data in the verification set, namely, a predicted mean square error is calculated, the predicted mean square errors of the models are compared, and a fitting model with the minimum predicted mean square error is selected as a first voice synthesis model.
The method comprises the steps of obtaining a preset number of models generated in the process of training a preset baseline model of a server end through the first N/2 pieces of voice information, and selecting a model meeting a preset standard from the preset number of models as a first voice synthesis model when the N pieces of voice information are greater than or equal to the preset number, so that the first voice synthesis model is guaranteed to be the model with the best effect in the training process, and an optimal base stone is provided for the generation of a second voice synthesis model.
In one embodiment, the speech synthesis method further comprises:
before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side. For example, an excessively long silent section or the like in the voice information can be removed by the filtering process.
The processed voice can be obtained by carrying out noise reduction processing and screening processing on the voice information, model training is carried out on the processed voice, and when the finally obtained model carries out voice synthesis, the accuracy rate is higher.
In one embodiment, the filtering the voice information includes:
acquiring first voiceprint information prestored by the user;
extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;
and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.
The second voiceprint information in the user recorded voice information is extracted, and then matching is carried out according to the second voiceprint information and the first voiceprint information.
In one embodiment, the second speech synthesis model has reached a converged state.
When the second speech synthesis model reaches the convergence state, the speech synthesis effect is optimal.
As to the above speech synthesis method provided by the embodiment of the present invention, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 3, the apparatus includes:
the recording module 31 is configured to record N pieces of voice information to a user in sequence through a preset device, where N is a positive integer; (ii) a
The first sending module 32 is configured to send the first N/2 pieces of voice information to a server side when the preset device finishes recording the first N/2 pieces of voice information;
the first training module 33 is configured to train a preset baseline model of the server through the first N/2 pieces of speech information to obtain a first speech synthesis model;
the second sending module 34 is configured to, when N/2 pieces of voice information are recorded by the preset device, send the last N/2 pieces of voice information to the server;
a second training module 35, configured to train the first speech synthesis model through the last N/2 pieces of speech information to obtain a second speech synthesis model, where the second speech synthesis model is used for speech synthesis.
As shown in fig. 4, in one embodiment, the first training module 33 includes:
the determining submodule 331 is configured to determine that the first speech synthesis model reaches a convergence state when the N pieces of speech information are smaller than a preset number;
a first obtaining submodule 332, configured to obtain a preset number of models generated in a process of training a preset baseline model of the server through the first N/2 pieces of voice information;
the selecting submodule 333 is configured to select, when the N pieces of speech information are greater than or equal to a preset number, a model meeting a preset standard from among the preset number of models as the first speech synthesis model.
In one embodiment, the speech synthesis apparatus further includes:
and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.
In one embodiment, the processing module includes:
the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;
the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;
and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.
In one embodiment, the second speech synthesis model has reached a converged state.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. A method of speech synthesis, comprising:
sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side;
training the preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model, which specifically comprises the following steps:
when the number of the N pieces of voice information is smaller than the preset number, determining that the first voice synthesis model reaches a convergence state;
acquiring a preset number of models generated in the process of training a preset baseline model of the server side through the first N/2 pieces of voice information;
when the number of the N pieces of voice information is larger than or equal to a preset number, selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model;
when the preset equipment records the back N/2 pieces of voice information, sending the back N/2 pieces of voice information to the server side;
and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
2. The method of claim 1, further comprising:
before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side.
3. The method of claim 2, wherein said filtering said voice message comprises:
acquiring first voiceprint information prestored by the user;
extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;
and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.
4. The method of claim 1, wherein the second speech synthesis model has reached a converged state.
5. A speech synthesis apparatus, comprising:
the recording module is used for sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
the first sending module is used for sending the first N/2 pieces of voice information to a server side when the preset equipment finishes recording the first N/2 pieces of voice information;
the first training module is used for training the preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model, and comprises:
the determining submodule is used for determining that the first voice synthesis model reaches a convergence state when the N pieces of voice information are smaller than the preset number;
the first obtaining submodule is used for obtaining a preset number of models generated in the process of training the preset baseline model of the server side through the first N/2 pieces of voice information;
the selecting submodule is used for selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model when the N pieces of voice information are more than or equal to the preset number;
the second sending module is used for sending the last N/2 pieces of voice information to the server side when the preset equipment records the last N/2 pieces of voice information;
and the second training module is used for training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
6. The apparatus of claim 5, further comprising:
and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.
7. The apparatus of claim 6, wherein the processing module comprises:
the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;
the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;
and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.
8. The apparatus of claim 5, in which the second speech synthesis model has reached a converged state.
CN201911420316.5A 2019-12-31 2019-12-31 Voice synthesis method and device Active CN111128119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911420316.5A CN111128119B (en) 2019-12-31 2019-12-31 Voice synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911420316.5A CN111128119B (en) 2019-12-31 2019-12-31 Voice synthesis method and device

Publications (2)

Publication Number Publication Date
CN111128119A CN111128119A (en) 2020-05-08
CN111128119B true CN111128119B (en) 2022-04-22

Family

ID=70506935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911420316.5A Active CN111128119B (en) 2019-12-31 2019-12-31 Voice synthesis method and device

Country Status (1)

Country Link
CN (1) CN111128119B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862933A (en) * 2020-07-20 2020-10-30 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating synthesized speech

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8175873B2 (en) * 2008-12-12 2012-05-08 At&T Intellectual Property I, L.P. System and method for referring to entities in a discourse domain
CN105185372B (en) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
US11238843B2 (en) * 2018-02-09 2022-02-01 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN110148398A (en) * 2019-05-16 2019-08-20 平安科技(深圳)有限公司 Training method, device, equipment and the storage medium of speech synthesis model

Also Published As

Publication number Publication date
CN111128119A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN108847215B (en) Method and device for voice synthesis based on user timbre
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
JP2013539558A (en) Parameter speech synthesis method and system
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
US20210327446A1 (en) Method and apparatus for reconstructing voice conversation
JP2021110943A (en) Cross-lingual voice conversion system and method
CN112735371B (en) Method and device for generating speaker video based on text information
CN111916054B (en) Lip-based voice generation method, device and system and storage medium
WO2019119279A1 (en) Method and apparatus for emotion recognition from speech
CN108091323A (en) For identifying the method and apparatus of emotion from voice
CN111128119B (en) Voice synthesis method and device
CN114283783A (en) Speech synthesis method, model training method, device and storage medium
CN112580669B (en) Training method and device for voice information
CN117275498A (en) Voice conversion method, training method of voice conversion model, electronic device and storage medium
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
CN111383627B (en) Voice data processing method, device, equipment and medium
US20230252971A1 (en) System and method for speech processing
Yanagisawa et al. Noise robustness in HMM-TTS speaker adaptation
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
JP6000326B2 (en) Speech synthesis model learning device, speech synthesis device, speech synthesis model learning method, speech synthesis method, and program
CN114005428A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages
CN111429878A (en) Self-adaptive speech synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant